Monitoring

To enable monitoring every CrateDB Cloud service must provide a /metrics endpoint, which will be scraped automatically by Prometheus when the annotation is set on the Pod accordingly:

metadata:
  annotations:
    prometheus.io/port: "<svc_port>"
    prometheus.io/scrape: "true"

For implementation details please refer to the official Prometheus documentation.

Every service must contain at least a svc_info and svc_status metric. Depending on the service (e.g. enrichment service) it should also provide process relevant information like errors or processed messages:

# HELP svc_info Service information
# TYPE svc_info gauge
svc_info{name=“enrichment”, version=“1.0.0”} 1.0

# HELP svc_status Status 0->GREEN, 1->YELLOW, 2->RED
# TYPE svc_status gauge
svc_status{name="crate_connection”} 0
svc_status{name=“rabbitmq_connection”} 0

# HELP enrichment_messages_processed_total Total number of processed messages
# TYPE enrichment_messages_processed_total counter
enrichment_messages_processed_total{topic=“topic1”} 98743345
enrichment_messages_processed_total{topic=“topic2”} 5

# HELP enrichment_messages_failed_total Total number of failed messages
# TYPE enrichment_messages_failed_total counter
enrichment_messages_failed_total{topic=“topic1”} 5
enrichment_messages_failed_total{topic=“topic2”} 5

All additional metrics should comply with the Prometheus metric and label naming conventions.

Monitoring and API created regions

Regions created dynamically via the API use the below service discovery mechanism to dynamically change the monitoring configuration.

When a new region is created via the cloud-api, it uploads to an S3 bucket configuration files to register itself in the main Prometheus service for federation and alerting purposes. It consists of three parts, each one with its own configuration file.

  1. The API uploads the following files when a new region is created:

    • federated/[organization id]-[region name].yml Registers the region to the main Prometheus edge federation job. This is achieved via the Prometheus file service discovery method. The file contains the region Prometheus hostname and the labels to be associated with it. All regions using this have an edge="true" label. The edge federation job ensures the main Prometheus scrapes specific metrics of registered regions.

    • blackbox/[organization id]-[region name].yml Registers the region to the main Prometheus edge blackbox job. This is also achieved via the Prometheus file service discovery method. The file contains the region Prometheus, alertmanager and agent health URLs as well as the labels to be associated with it. All regions using this have an edge="true" label. The blackbox job alerts about downtimes.

    • nginx/[organization id]-[region name].conf Creates a new configuration for the main Prometheus nginx edge proxy. This allows main Prometheus to use the nginx edge proxy (which lives in the same server) to reach the edge Prometheus instances. This is required by the service discovery because each edge Prometheus uses different authentication credentials and this is not supported by Prometheus itself. Therefore a proxy was created just to handle the basic authentication between the main Prometheus and each region Prometheus.

  2. The main Prometheus server is responsible for syncing locally with the previously mentioned S3 bucket via a cron job that is triggered every two minutes. Once completed nginx is reloaded to make the new configurations available gracefully.

  3. The main Prometheus service can now scrape the newly added edge regions via the nginx proxy, which handles authentication. Downtime alerts are also available.

Note

Please note main Prometheus requests to nginx edge proxy use the HTTP protocol, while the nginx edge proxy upstream to the edge Prometheus services uses HTTPS. This is considered secure because the main Prometheus and nginx edge proxy live in the same machine, and using HTTPS in this scenario would imply nginx handling certificates, which we want to avoid to keep things simple.

Deleting a region monitoring

Deleting the files for a specific region automatically de-registers the region from the nginx and the main Prometheus when its cron job is executed.