==========
Monitoring
==========

To enable monitoring every CrateDB Cloud service must provide a ``/metrics``
endpoint, which will be scraped automatically by Prometheus_ when the
annotation is set on the Pod accordingly::

    metadata:
      annotations:
        prometheus.io/port: "<svc_port>"
        prometheus.io/scrape: "true"

For implementation details please refer to the official `Prometheus
documentation`_.

Every service must contain at least a ``svc_info`` and ``svc_status`` metric.
Depending on the service (e.g. enrichment service) it should also provide
process relevant information like errors or processed messages::

    # HELP svc_info Service information
    # TYPE svc_info gauge
    svc_info{name=“enrichment”, version=“1.0.0”} 1.0

    # HELP svc_status Status 0->GREEN, 1->YELLOW, 2->RED
    # TYPE svc_status gauge
    svc_status{name="crate_connection”} 0
    svc_status{name=“rabbitmq_connection”} 0

    # HELP enrichment_messages_processed_total Total number of processed messages
    # TYPE enrichment_messages_processed_total counter
    enrichment_messages_processed_total{topic=“topic1”} 98743345
    enrichment_messages_processed_total{topic=“topic2”} 5

    # HELP enrichment_messages_failed_total Total number of failed messages
    # TYPE enrichment_messages_failed_total counter
    enrichment_messages_failed_total{topic=“topic1”} 5
    enrichment_messages_failed_total{topic=“topic2”} 5

All additional metrics should comply with the
`Prometheus metric and label naming conventions`_.


----------------------------------
Monitoring and API created regions
----------------------------------

Regions created dynamically via the API use the below service discovery
mechanism to dynamically change the monitoring configuration.

When a new region is created via the ``cloud-api``, it uploads to an S3
bucket configuration files to register itself in the ``main Prometheus``
service for federation and alerting purposes. It consists of three parts, each
one with its own configuration file.

#. The API uploads the following files when a new region is created:

   - ``federated/[organization id]-[region name].yml``
     Registers the region to the ``main Prometheus`` edge federation job. This
     is achieved via the `Prometheus file service discovery`_ method. The file
     contains the region `Prometheus` hostname and the labels to be associated
     with it. All regions using this have an ``edge="true"`` label. The edge
     federation job ensures the ``main Prometheus`` scrapes specific metrics of
     registered regions.

   - ``blackbox/[organization id]-[region name].yml``
     Registers the region to the ``main Prometheus`` edge blackbox job. This is
     also achieved via the `Prometheus file service discovery`_ method. The file
     contains the region `Prometheus`, `alertmanager` and `agent` health URLs as
     well as the labels to be associated with it. All regions using this have an
     ``edge="true"`` label. The blackbox job alerts about downtimes.

   - ``nginx/[organization id]-[region name].conf``
     Creates a new configuration for the ``main Prometheus nginx edge proxy``.
     This allows ``main Prometheus`` to use the nginx edge proxy (which lives in
     the same server) to reach the edge Prometheus instances. This is required
     by the service discovery because each edge Prometheus uses different
     authentication credentials and this is not supported by Prometheus itself.
     Therefore a proxy was created just to handle the basic authentication
     between the ``main Prometheus`` and each region Prometheus.

#. The ``main Prometheus`` server is responsible for syncing locally with the
   previously mentioned S3 bucket via a cron job that is triggered every two
   minutes. Once completed nginx is reloaded to make the new configurations
   available gracefully.

#. The ``main Prometheus`` service can now scrape the newly added edge regions
   via the nginx proxy, which handles authentication. Downtime alerts are also
   available.

.. NOTE::

    Please note ``main Prometheus`` requests to ``nginx edge proxy`` use the
    ``HTTP`` protocol, while the ``nginx edge proxy`` upstream to the
    ``edge Prometheus`` services uses ``HTTPS``. This is considered secure
    because the ``main Prometheus`` and ``nginx edge proxy`` live in the same
    machine, and using ``HTTPS`` in this scenario would imply ``nginx``
    handling certificates, which we want to avoid to keep things simple.


Deleting a region monitoring
----------------------------

Deleting the files for a specific region automatically de-registers the region
from the nginx and the ``main Prometheus`` when its cron job is executed.

.. _Prometheus: https://prometheus.io/
.. _Prometheus documentation: https://prometheus.io/docs/
.. _Prometheus metric and label naming conventions: https://prometheus.io/docs/practices/naming/
.. _Prometheus file service discovery: https://prometheus.io/docs/prometheus/latest/configuration/configuration/#file_sd_config