Situation
I recently finished moving our monitoring system for the Change Data Capture (CDC) service from Datadog to Grafana as part of a company-wide cost reduction effort.
Before the migration, our CDC connectors sent custom metrics directly to the Datadog agent using UDP. We also used PagerDuty for alerts - when a Datadog monitor detected an issue, it would trigger a phone call. The company has now discontinued both Datadog and PagerDuty, so we needed to find replacement solutions.
Task
What needed to be migrated:
- Enable CDC connector custom metrics in Prometheus
- Recreate monitoring dashboards in Grafana
- Set up alerts in Grafana Alerting
Action
I completed the migration in three steps:
- Set up metric collection
- Deployed the statsd_exporter service in our Kubernetes cluster
- Switched the CDC connectors to send metrics to statsd_exporter instead of the Datadog agent
- Configured Prometheus to scrape the metrics from statsd_exporter
- Recreated dashboards
- Built new monitoring dashboards in Grafana that matched our previous Datadog views
- Configured alerting
- Set up alert rules in Grafana Alerting for critical conditions
- Set up notification channel to Google Chat Channel
Result
The migration successfully reduced our observability costs by $XXX, supporting the company's cost-saving goals.
However, we did encounter some trade-offs. Without PagerDuty phone calls, we can't respond to alerts as quickly since we now rely on Google Chat notifications. We've also noticed that some custom metrics data is occasionally lost due to the current setup not being as reliable as Datadog.