Solving Meesho’s Big Data Requirements using Delta Lake

A deep-dive into our data engineering efforts towards building a more scalable architecture for data processing.

At Meesho, we closely follow a data-driven approach towards scaling our backend infrastructure as well as launching any new features for our entrepreneurs. Most of our business and product decisions are taken using the insights derived from the data we collect on multiple fronts.

As the business continues to grow and more entrepreneurs and suppliers transact via Meesho on a daily basis, the data we collect becomes more complex to handle and keeps increasing in volume. Our existing data warehousing model with a tightly-coupled infrastructure, required additional computing resources every time the data grew in size. To overcome this bottleneck, we leveraged the delta lake storage configuration to streamline our data in-flow pipelines over the existing data lake infrastructure.

What is a Data Lake?

Data lake serves as a central repository of all raw datasets collected from multiple sources during a product life cycle, which can later be transformed and accessed by different teams across the organisation.

Existing infrastructure challenges

At Meesho, we’ve been primarily using Amazon Redshift as our data-warehouse platform which earlier powered all our dashboards and analytics. The following are some of the challenges we faced with our existing data-warehousing infrastructure which prompted us to adopt delta lake:

With the increased volume of data collected, we had to create new Redshift instances everytime, which increased the overall operating cost.
We couldn’t append different types of resources to a DB, since we had to follow a uniform resource allocation for horizontal scalability.
Tight-coupling of compute and storage. We had to create new instances of DB to accommodate the increased volume of data. These additional resources introduced unnecessary computing strain on the infrastructure as well as increased operating costs.
Also, due to the tightly coupled nature of the DB instances, the infrastructure is prone to a single point of failure. Any instance failure across redshift can cause disruption on multiple levels.
Complex queries that were used to aggregate data across multiple resources at a time, required more computing power to function seamlessly.

Organisational Data Growth: A Look at Stats

Let’s look at one of the events (i.e Catalog Views) that has witnessed exponential growth in data volume over the years, here at Meesho.

Back in December 2019, during one of the peak business hours, the size of a catalog views report was around ~400 MB / hour. This value skyrocketed in December 2020, when the same catalog views report accumulated data at ~20 GB / hour. Having witnessed a 50x growth in data generation in over a year, delta lake was an essential solution for us to handle the data scale in the upcoming years.

Why did we choose Delta-Lake?

“Delta Lake is an open-source storage layer that brings ACID transactions to Apache Spark and big data workloads.”

ACID compliance ensures reliability for our existing pipelines; one of the most important reasons behind choosing this library.
Over the existing capabilities of this library, we were able to build a custom layer for better support for all our use cases and dynamic workloads.
Delta Logs — The delta lake storage supports transactional logs, which acts as a single source of truth for all data transactions across the company.

More importantly, we’ve built an in-house architecture that is completely configurable, while maintaining a standard across all the data requirements. Let’s take a closer look at our data lake architecture and how data is being ingested.

Meesho’s Data Lake Architecture

The idea is to build a decoupled infinite-storage and infinite-compute architecture, which will be able to handle the data volume scale at large.

“Economy at scale, without compromising on overall performance is the general thumb rule behind data lake.”

*Fig 2: Meesho’s Data Lake Architecture Overview*

The following are some of the tools and frameworks used in our architecture:

Delta Lake — An open-source storage layer running on Apache Spark supports the underlying storage (S3) and computing resources of data lake, and serves as an intermediary between Apache Spark and the storage layer.
Amazon S3 — Replaces the existing disk-based storage we had with redshift, with a storage solution that is independent from all the computing resources. Which means, as the organisational data volume grows, the storage can be increased without depending on additional computing power.
Amazon EMR — Provides compute clusters with Hadoop framework, Apache Spark, Apache Hive and Livy to process the stored data when needed. These clusters enable compute-on-demand; which can be scaled up when needed and scaled down at times of minimal usage.
Zeppelin — A web-based notebook which allows us to maintain workspaces while querying the data-lake. This helps analysts better organise their queries and also queue them to run asynchronously without having to maintain an active session. Zeppelin comes with support for multiple interpreters and an interactive data visualisation that comes in handy for ad-hoc data analysis.
Apache Livy — Enables easy interaction with a Spark cluster using REST interface. Any application can easily submit spark batch jobs or code snippets to a cluster without its dependent libraries.
Presto — An SQL query engine for our data analysts to run direct interactive queries on the data lake.

Data Ingestion — Tech Explained

At Meesho, every data has significance. Hence we ingest data from multiple data analytics solutions as well as data sources to gather them at a single location. As described in the above architecture, these data flows into our Data Lake via two methods:

Pull-based Approach: Debezium configured on our databases is used to capture the fixed datasets via bin logs and push it to the data platform using Kafka.

Push-based Approach: The different data analytics services used by Meesho such as CleverTap and Mixpanel along with our internal Content Ingestion Service would use our in-house Messaging API to upload the data. The API then wraps the data into a similar format and pushes into kafka.

Now, to move the data from the kafka messaging layer into the final S3 storage, we utilized our pipeline configuration setup on the data platform by creating multiple events. These events are responsible for moving the uploaded data via Extractors & Transformers to convert the datasets (Extractors modifies the data into a readable format and Transformers apply schema to the dataset that is being passed). Once the schema is applied, the data is then packaged into file chunks (JSON or Parquet) and stored into Amazon S3. These files will be stored w.r.t the events that were used earlier to push the data.

The Impact Metrics

One of the major impacts of implementing a data lake solution is that, now we can have infinite resources with zero downtime. Unlike our older data-warehouse infrastructure which experienced some downtime whenever we had to scale the resources.

Fig 3: Data Lake performance metrics comparison

In the above graph, it’s evident that our Data Lake systems can handle the same volume of data with less computing power and zero downtime.

From a broader perspective, by storing all data in a unified repository, data lake allows us to use any analytics and business intelligence tools to derive more insights and cost effectively scale storage and compute based on dynamic requirements. With our goal to achieve 10x growth in 2021 and to further enhance our tech architecture, Data Lake solution will aid our engineers to build a much more seamless reselling platform for our entrepreneurs.

This blog has been co-authored by Alok Sharma and Ashok Lingadurai.

Tech internships during the pandemic? Meesho made it happen!

Handling API requests at scale using Meesho’s in-house Edge Proxy service

Are you looking for a place to learn, grow, and gain real industry experience? Look no more. We are hiring!