Monitoring¶
To enable monitoring every CrateDB Cloud service must provide a /metrics
endpoint, which will be scraped automatically by Prometheus when the
annotation is set on the Pod accordingly:
metadata:
annotations:
prometheus.io/port: "<svc_port>"
prometheus.io/scrape: "true"
For implementation details please refer to the official Prometheus documentation.
Every service must contain at least a svc_info and svc_status metric.
Depending on the service (e.g. enrichment service) it should also provide
process relevant information like errors or processed messages:
# HELP svc_info Service information
# TYPE svc_info gauge
svc_info{name=“enrichment”, version=“1.0.0”} 1.0
# HELP svc_status Status 0->GREEN, 1->YELLOW, 2->RED
# TYPE svc_status gauge
svc_status{name="crate_connection”} 0
svc_status{name=“rabbitmq_connection”} 0
# HELP enrichment_messages_processed_total Total number of processed messages
# TYPE enrichment_messages_processed_total counter
enrichment_messages_processed_total{topic=“topic1”} 98743345
enrichment_messages_processed_total{topic=“topic2”} 5
# HELP enrichment_messages_failed_total Total number of failed messages
# TYPE enrichment_messages_failed_total counter
enrichment_messages_failed_total{topic=“topic1”} 5
enrichment_messages_failed_total{topic=“topic2”} 5
All additional metrics should comply with the Prometheus metric and label naming conventions.
Monitoring and API created regions¶
Regions created dynamically via the API use the below service discovery mechanism to dynamically change the monitoring configuration.
When a new region is created via the cloud-api, it uploads to an S3
bucket configuration files to register itself in the main Prometheus
service for federation and alerting purposes. It consists of three parts, each
one with its own configuration file.
The API uploads the following files when a new region is created:
federated/[organization id]-[region name].ymlRegisters the region to themain Prometheusedge federation job. This is achieved via the Prometheus file service discovery method. The file contains the region Prometheus hostname and the labels to be associated with it. All regions using this have anedge="true"label. The edge federation job ensures themain Prometheusscrapes specific metrics of registered regions.blackbox/[organization id]-[region name].ymlRegisters the region to themain Prometheusedge blackbox job. This is also achieved via the Prometheus file service discovery method. The file contains the region Prometheus, alertmanager and agent health URLs as well as the labels to be associated with it. All regions using this have anedge="true"label. The blackbox job alerts about downtimes.nginx/[organization id]-[region name].confCreates a new configuration for themain Prometheus nginx edge proxy. This allowsmain Prometheusto use the nginx edge proxy (which lives in the same server) to reach the edge Prometheus instances. This is required by the service discovery because each edge Prometheus uses different authentication credentials and this is not supported by Prometheus itself. Therefore a proxy was created just to handle the basic authentication between themain Prometheusand each region Prometheus.
The
main Prometheusserver is responsible for syncing locally with the previously mentioned S3 bucket via a cron job that is triggered every two minutes. Once completed nginx is reloaded to make the new configurations available gracefully.The
main Prometheusservice can now scrape the newly added edge regions via the nginx proxy, which handles authentication. Downtime alerts are also available.
Note
Please note main Prometheus requests to nginx edge proxy use the
HTTP protocol, while the nginx edge proxy upstream to the
edge Prometheus services uses HTTPS. This is considered secure
because the main Prometheus and nginx edge proxy live in the same
machine, and using HTTPS in this scenario would imply nginx
handling certificates, which we want to avoid to keep things simple.
Deleting a region monitoring¶
Deleting the files for a specific region automatically de-registers the region
from the nginx and the main Prometheus when its cron job is executed.