Reduce the annual cost of tracking services by $42,696

Situation


My company is currently using Mixpanel to monitor events within our app. However, to reduce operational expenses, the company has decided to cancel this service subscription in the second quarter of 2026.

Mixpanel has proven to be an invaluable tool for tracking event. It has become the primary source of truth for several analytics dashboards that monitor the performance of our quick-commerce application. From a data engineering perspective, the most significant feature of this service is its seamless integration process. With just a few clicks, We can access the data available in BigQuery in near-real-time, eliminating the need to create an in-house pipeline for data access. 

The company wants to eliminate the Mixpanel cost but doesn’t want to lose our events data, which makes the performance dashboard stale. Therefore, the data engineering team needs to collaborate with software engineering and infrastructure to provide the data.

Task


To achieve this, I need to follow these steps:
  1. Request the software engineering team to cutover the events data from Mixpanel to another storage system.
  2. Develop a pipeline to import the data into our BigQuery.

Action


The software engineering team has published event data to Kafka Topics.

I developed a data pipeline using Spark and Scala and deployed it on Google Dataproc. I opted for this approach because it’s well-suited for handling large volumes of data, such as search user data in our application. Additionally, I opted Google Dataproc rather than Dataflow because it is cheaper because no need to keep the service run 24 hours.

Result


I successfully set up a data pipeline that ingests event data into BigQuery. This enables the company to achieve its cost reduction initiative and maintain an up-to-date performance dashboard.

The pipeline incurs a very low cost of $0.12 per day or $3.56 per month because it runs only once a day.

The only drawback of the new pipeline is that the data has daily freshness, which is acceptable to our data stakeholders.

A Pragmatic Approach to Importing MongoDB Data into BigQuery

Situation

My company has a stock data problem. The physical inventory in our stores doesn't match what's shown in our warehouse management system (run by a vendor) or in our various inventory services.
Since this issue is critical and is being watched by executives and investors, we need a fast solution. We'll focus on getting the data flowing quickly first, then optimize later.

The problem is that we use multiple platforms to track stock, which makes comparing reports difficult since the data isn't stored in one place. To solve this, we need to bring all the data from these different platforms into our data warehouse.

In this article, I'll focus on integrating data from our vendor-managed warehouse management system. The specific data we want to bring into BigQuery is the stock log collection, which is very large. This collection tracks all stock movements: inbound shipments, outbound transfers, customer purchases on our platform, and other stock-related activities.

We need to fetch data directly from the vendor's MongoDB database because of several integration issues with our internal service. Making things worse, our CDC platform can't help us because it has technical debt and bugs for MongoDB connector, so MongoDB support isn't available right now.

Task

To make this happen, we need to complete two key steps:
  1. Work with the infrastructure team to set up MongoDB replica set in our environment.
  2. Find the right tool for processing the data, since we'll be ingesting a large collection.

Action

The Infrastructure team provisioned a MongoDB replica set in our environment. This setup ensures our data ingestion doesn't impact the warehouse management system's performance. Kudos to the Infra team for always helping us related to the data engineering tasks.

I built the data pipeline using Spark and Scala. I chose Spark because it can handle large amounts of data efficiently, which we need for this massive stock log collection. I then deployed the pipeline on a Dataproc cluster so it can run reliably in our cloud environment.

Result

The stock log data is now available in BigQuery and ready for analysis and monitoring. Our team can now compare inventory across all platforms to find mismatches between physical store counts and what the systems show.

The ingestion cost is also very low at $0.12 per day or $3.56 per month because the pipeline only runs once a day.

CDC Observability Migration: From Datadog to Grafana with $XXX in Savings

Situation

I recently finished moving our monitoring system for the Change Data Capture (CDC) service from Datadog to Grafana as part of a company-wide cost reduction effort.

Before the migration, our CDC connectors sent custom metrics directly to the Datadog agent using UDP. We also used PagerDuty for alerts - when a Datadog monitor detected an issue, it would trigger a phone call. The company has now discontinued both Datadog and PagerDuty, so we needed to find replacement solutions.

Task

What needed to be migrated:
  • Enable CDC connector custom metrics in Prometheus
  • Recreate monitoring dashboards in Grafana
  • Set up alerts in Grafana Alerting

Action

I completed the migration in three steps:
  1. Set up metric collection
  2. Recreated dashboards
    • Built new monitoring dashboards in Grafana that matched our previous Datadog views
  3. Configured alerting
    • Set up alert rules in Grafana Alerting for critical conditions
    • Set up notification channel to Google Chat Channel

Result

The migration successfully reduced our observability costs by $XXX, supporting the company's cost-saving goals.

However, we did encounter some trade-offs. Without PagerDuty phone calls, we can't respond to alerts as quickly since we now rely on Google Chat notifications. We've also noticed that some custom metrics data is occasionally lost due to the current setup not being as reliable as Datadog.

2025 in Review: First Mandate Working with Quick-Commerce Company

Situation


I started working at a quick-commerce company a year ago, after leaving the first Indonesian unicorn company in December 2024. The company is a subsidiary of my previous employer, and I joined through an internal transfer.

At first, our data platform was managed by the holding company, and I supported it while on call. Later, the holding company stopped providing support. Because of this, my company asked me to build a new data platform from scratch in a separate Google Cloud Platform (GCP) environment.


Task


The first mandate was clear. The data manager asked me to create a data platform that could seamlessly integrate with the existing reporting platform. At that time, there were only two data analysts, and they were already busy supporting data stakeholders across the company.


Action


I built the data platform using a Lambda Architecture. I ensured that the outputs in our new environment matched the data warehouse tables used by existing reports in the holding company’s environment. This allowed us to migrate smoothly without disrupting reports.


Result


I successfully built the data platform in our environment, and completed the migration smoothly with a single click by switching the Looker data sources from the holding company’s GCP project to our own. The entire migration took less than 30 minutes, and there were no issues with existing reports.

How We Ingest OLTP data into OLAP System

SITUATION

I was working at a holding company that owned more than twenty subsidiaries. The holding company managed the end-to-end data platform for all its subsidiaries.

Unexpectedly, the holding company decided to stop managing the subsidiaries' data platforms. This meant each subsidiary had to host and manage its own data platform. Adding to the surprise, I joined one of the subsidiary companies, an e-commerce business specializing in fresh products like butchery, vegetables, etc.

At that time, the subsidiary had no data engineering team, and we needed to build the data platform from scratch. However, the holding company offered us their data platform's source code, which could be installed in our environment.

TASK

As the newly formed data engineering team, we considered the pros and cons of building our data platform. Should we create a platform from scratch using the latest open-source tools, or should we adopt the holding company's stack, which consisted mostly of in-house applications and no longer maintained.

ACTIONS

Before deciding on the tools to use, we compared the following architectures:

  • Lambda Architecture
    • Layers: 3 Layers
    • Data Processing: Batch and Real-time
    • Complexity: High
    • Latency: High
    • Scalability: Moderate
  • Kappa Architecture
    • Layers: 2 Layers
    • Data Processing: Continuous stream
    • Complexity: Moderate
    • Latency: Low
    • Scalability: High
  • Medallion Architecture
    • Layers: 3 Layers
    • Data Processing: Incremental data transformation
    • Complexity: High
    • Latency: Moderate
    • Scalability: Moderate

RESULTS

We chose the Lambda Architecture for our data platform because it was more cost-effective, familiar to the team, and most of our reports did not require real-time data. Although it was more complex due to the need to maintain code for each layer, it aligned better with our requirements compared to the other architectures. Additionally, We are using the data platform's source code from the holding company which is also compatible with the Lambda Architecture.

Below is the architecture diagram of our data platform using the Lambda Architecture approach.

How We Saved $48,000 by Replacing Looker with Dbt and Superset

SITUATION

Looker was very helpful for our data scientists because its LookML feature made it easy to build data models, even with many dependencies. It also allowed our business users to explore and analyze data on their own, making self-service analytics simple with easy slice-and-dice features.

However, Looker is expensive. For our team, the cost was $48,000 every year based on the number of accounts we had.


Looker Annual Subscription Cost    = Number of Looker accounts × USD 100 × 12 months
                                                    = 40 × 100 × 12
                                                    = USD 48,000

Even though we liked using Looker, our company wanted a cheaper solution. After looking at other tools, we chose dbt to replace LookML and Apache Superset to replace Looker Studio.

TASK

As the data engineering team, my main tasks were:

  1. Create a new dbt project.
  2. Migrate Looker views to dbt models.
  3. Set up and configure Apache Superset as our new data visualization tool.

ACTIONS

Create a new dbt project

To save costs, we chose to use free and open-source tools. For dbt, we used dbt-core (the free version) instead of dbt-cloud.

Dbt Project Architecture

We created a new repository in GitLab and set up GitLab CI/CD pipelines to run dbt-core. This lets us automatically build our data models in BigQuery without extra costs.

We built two CI/CD pipelines:

  • Staging pipeline: For testing and validating new changes.
  • Production pipeline: For deploying changes to production.

You can see the details of our CI/CD setup in the images below.


To automate building all dbt models on a schedule, we set up an Airflow DAG.

For documentation, we serve dbt docs as a Kubernetes deployment and use Istio for ingress, so our team can easily access the latest documentation.

Migrate Looker views to dbt models.

The migration process was quiet easy, and i've created a separate page for it:

https://ilyas-ahsan.blogspot.com/2025/05/how-we-automatically-convert-looker.html

Set up and configure Apache Superset as our new data visualization tool.

We compared several tools to replace Looker Studio. Our top choices were Metabase and Apache Superset. We chose Apache Superset because it offers a free caching feature, which helps reduce the workload on BigQuery, especially since most dashboards only need daily data updates.

For installation, we followed the official guide for setting up Superset on Kubernetes: https://superset.apache.org/docs/installation/kubernetes

RESULTS

As a result, we successfully migrated and managed 99 Looker views into dbt models. These dbt models are made up of 396 models, 297 sources, and 491 macros.

Below is the largest dbt model lineage, showing the dependencies needed to build the fact transaction models.