ilyas ahsan: A Pragmatic Approach to Importing MongoDB Data into BigQuery

Situation

My company has a stock data problem. The physical inventory in our stores doesn't match what's shown in our (run by a vendor) or in our various inventory services.

Since this issue is critical and is being watched by executives and investors, we need a fast solution. We'll focus on getting the data flowing quickly first, then optimize later.

The problem is that we use multiple platforms to track stock, which makes comparing reports difficult since the data isn't stored in one place. To solve this, we need to bring all the data from these different platforms into our .

In this article, I'll focus on integrating data from our vendor-managed warehouse management system. The specific data we want to bring into is the stock log collection, which is very large. This collection tracks all stock movements: inbound shipments, outbound transfers, customer purchases on our platform, and other stock-related activities.

We need to fetch data directly from the vendor's database because of several integration issues with our internal service. Making things worse, our can't help us because it has and bugs for MongoDB connector, so MongoDB support isn't available right now.

Task

To make this happen, we need to complete two key steps:

Work with the infrastructure team to set up MongoDB replica set in our environment.
Find the right tool for processing the data, since we'll be ingesting a large collection.

Action

The Infrastructure team provisioned a MongoDB replica set in our environment. This setup ensures our doesn't impact the warehouse management system's performance. Kudos to the Infra team for always helping us related to the .

I built the using and . I chose Spark because it can handle large amounts of data efficiently, which we need for this massive stock log collection. I then deployed the pipeline on a so it can run reliably in our cloud environment.

Result

The stock log data is now available in BigQuery and ready for analysis and monitoring. Our team can now compare inventory across all platforms to find mismatches between physical store counts and what the systems show.

The ingestion cost is also very low at $0.12 per day or $3.56 per month because the pipeline only runs once a day.