Situation
My company has a stock data problem. The physical inventory in our stores doesn't match what's shown in our warehouse management system (run by a vendor) or in our various inventory services.
Since this issue is critical and is being watched by executives and investors, we need a fast solution. We'll focus on getting the data flowing quickly first, then optimize later.
The problem is that we use multiple platforms to track stock, which makes comparing reports difficult since the data isn't stored in one place. To solve this, we need to bring all the data from these different platforms into our data warehouse.
In this article, I'll focus on integrating data from our vendor-managed warehouse management system. The specific data we want to bring into BigQuery is the stock log collection, which is very large. This collection tracks all stock movements: inbound shipments, outbound transfers, customer purchases on our platform, and other stock-related activities.
We need to fetch data directly from the vendor's MongoDB database because of several integration issues with our internal service. Making things worse, our CDC platform can't help us because it has technical debt and bugs for MongoDB connector, so MongoDB support isn't available right now.
Task
To make this happen, we need to complete two key steps:
- Work with the infrastructure team to set up MongoDB replica set in our environment.
- Find the right tool for processing the data, since we'll be ingesting a large collection.
Action
The Infrastructure team provisioned a MongoDB replica set in our environment. This setup ensures our data ingestion doesn't impact the warehouse management system's performance. Kudos to the Infra team for always helping us related to the data engineering tasks.
I built the data pipeline using Spark and Scala. I chose Spark because it can handle large amounts of data efficiently, which we need for this massive stock log collection. I then deployed the pipeline on a Dataproc cluster so it can run reliably in our cloud environment.
Result
The stock log data is now available in BigQuery and ready for analysis and monitoring. Our team can now compare inventory across all platforms to find mismatches between physical store counts and what the systems show.
The ingestion cost is also very low at $0.12 per day or $3.56 per month because the pipeline only runs once a day.