========== Monitoring ========== To enable monitoring every CrateDB Cloud service must provide a ``/metrics`` endpoint, which will be scraped automatically by Prometheus_ when the annotation is set on the Pod accordingly:: metadata: annotations: prometheus.io/port: "" prometheus.io/scrape: "true" For implementation details please refer to the official `Prometheus documentation`_. Every service must contain at least a ``svc_info`` and ``svc_status`` metric. Depending on the service (e.g. enrichment service) it should also provide process relevant information like errors or processed messages:: # HELP svc_info Service information # TYPE svc_info gauge svc_info{name=“enrichment”, version=“1.0.0”} 1.0 # HELP svc_status Status 0->GREEN, 1->YELLOW, 2->RED # TYPE svc_status gauge svc_status{name="crate_connection”} 0 svc_status{name=“rabbitmq_connection”} 0 # HELP enrichment_messages_processed_total Total number of processed messages # TYPE enrichment_messages_processed_total counter enrichment_messages_processed_total{topic=“topic1”} 98743345 enrichment_messages_processed_total{topic=“topic2”} 5 # HELP enrichment_messages_failed_total Total number of failed messages # TYPE enrichment_messages_failed_total counter enrichment_messages_failed_total{topic=“topic1”} 5 enrichment_messages_failed_total{topic=“topic2”} 5 All additional metrics should comply with the `Prometheus metric and label naming conventions`_. ---------------------------------- Monitoring and API created regions ---------------------------------- Regions created dynamically via the API use the below service discovery mechanism to dynamically change the monitoring configuration. When a new region is created via the ``cloud-api``, it uploads to an S3 bucket configuration files to register itself in the ``main Prometheus`` service for federation and alerting purposes. It consists of three parts, each one with its own configuration file. #. The API uploads the following files when a new region is created: - ``federated/[organization id]-[region name].yml`` Registers the region to the ``main Prometheus`` edge federation job. This is achieved via the `Prometheus file service discovery`_ method. The file contains the region `Prometheus` hostname and the labels to be associated with it. All regions using this have an ``edge="true"`` label. The edge federation job ensures the ``main Prometheus`` scrapes specific metrics of registered regions. - ``blackbox/[organization id]-[region name].yml`` Registers the region to the ``main Prometheus`` edge blackbox job. This is also achieved via the `Prometheus file service discovery`_ method. The file contains the region `Prometheus`, `alertmanager` and `agent` health URLs as well as the labels to be associated with it. All regions using this have an ``edge="true"`` label. The blackbox job alerts about downtimes. - ``nginx/[organization id]-[region name].conf`` Creates a new configuration for the ``main Prometheus nginx edge proxy``. This allows ``main Prometheus`` to use the nginx edge proxy (which lives in the same server) to reach the edge Prometheus instances. This is required by the service discovery because each edge Prometheus uses different authentication credentials and this is not supported by Prometheus itself. Therefore a proxy was created just to handle the basic authentication between the ``main Prometheus`` and each region Prometheus. #. The ``main Prometheus`` server is responsible for syncing locally with the previously mentioned S3 bucket via a cron job that is triggered every two minutes. Once completed nginx is reloaded to make the new configurations available gracefully. #. The ``main Prometheus`` service can now scrape the newly added edge regions via the nginx proxy, which handles authentication. Downtime alerts are also available. .. NOTE:: Please note ``main Prometheus`` requests to ``nginx edge proxy`` use the ``HTTP`` protocol, while the ``nginx edge proxy`` upstream to the ``edge Prometheus`` services uses ``HTTPS``. This is considered secure because the ``main Prometheus`` and ``nginx edge proxy`` live in the same machine, and using ``HTTPS`` in this scenario would imply ``nginx`` handling certificates, which we want to avoid to keep things simple. Deleting a region monitoring ---------------------------- Deleting the files for a specific region automatically de-registers the region from the nginx and the ``main Prometheus`` when its cron job is executed. .. _Prometheus: https://prometheus.io/ .. _Prometheus documentation: https://prometheus.io/docs/ .. _Prometheus metric and label naming conventions: https://prometheus.io/docs/practices/naming/ .. _Prometheus file service discovery: https://prometheus.io/docs/prometheus/latest/configuration/configuration/#file_sd_config