A Pragmatic Approach to Importing MongoDB Data into BigQuery

Situation

My company has a stock data problem. The physical inventory in our stores doesn't match what's shown in our warehouse management system (run by a vendor) or in our various inventory services.
Since this issue is critical and is being watched by executives and investors, we need a fast solution. We'll focus on getting the data flowing quickly first, then optimize later.

The problem is that we use multiple platforms to track stock, which makes comparing reports difficult since the data isn't stored in one place. To solve this, we need to bring all the data from these different platforms into our data warehouse.

In this article, I'll focus on integrating data from our vendor-managed warehouse management system. The specific data we want to bring into BigQuery is the stock log collection, which is very large. This collection tracks all stock movements: inbound shipments, outbound transfers, customer purchases on our platform, and other stock-related activities.

We need to fetch data directly from the vendor's MongoDB database because of several integration issues with our internal service. Making things worse, our CDC platform can't help us because it has technical debt and bugs for MongoDB connector, so MongoDB support isn't available right now.

Task

To make this happen, we need to complete two key steps:
  1. Work with the infrastructure team to set up MongoDB replica set in our environment.
  2. Find the right tool for processing the data, since we'll be ingesting a large collection.

Action

The Infrastructure team provisioned a MongoDB replica set in our environment. This setup ensures our data ingestion doesn't impact the warehouse management system's performance. Kudos to the Infra team for always helping us related to the data engineering tasks.

I built the data pipeline using Spark and Scala. I chose Spark because it can handle large amounts of data efficiently, which we need for this massive stock log collection. I then deployed the pipeline on a Dataproc cluster so it can run reliably in our cloud environment.

Result

The stock log data is now available in BigQuery and ready for analysis and monitoring. Our team can now compare inventory across all platforms to find mismatches between physical store counts and what the systems show.

The ingestion cost is also very low at $0.12 per day or $3.56 per month because the pipeline only runs once a day.

CDC Observability Migration: From Datadog to Grafana with $XXX in Savings

Situation

I recently finished moving our monitoring system for the Change Data Capture (CDC) service from Datadog to Grafana as part of a company-wide cost reduction effort.

Before the migration, our CDC connectors sent custom metrics directly to the Datadog agent using UDP. We also used PagerDuty for alerts - when a Datadog monitor detected an issue, it would trigger a phone call. The company has now discontinued both Datadog and PagerDuty, so we needed to find replacement solutions.

Task

What needed to be migrated:
  • Enable CDC connector custom metrics in Prometheus
  • Recreate monitoring dashboards in Grafana
  • Set up alerts in Grafana Alerting

Action

I completed the migration in three steps:
  1. Set up metric collection
  2. Recreated dashboards
    • Built new monitoring dashboards in Grafana that matched our previous Datadog views
  3. Configured alerting
    • Set up alert rules in Grafana Alerting for critical conditions
    • Set up notification channel to Google Chat Channel

Result

The migration successfully reduced our observability costs by $XXX, supporting the company's cost-saving goals.

However, we did encounter some trade-offs. Without PagerDuty phone calls, we can't respond to alerts as quickly since we now rely on Google Chat notifications. We've also noticed that some custom metrics data is occasionally lost due to the current setup not being as reliable as Datadog.