By late 2022, we had built something we were truly proud of—a real-time ML serving system with a DAG-based executor, a feature store, and an interaction store powering key ranking and personalization models. It was a major milestone, the culmination of months of effort from data scientists, ML engineers, and backend teams. Our system was live, and we were ready to push the boundaries of experimentation.
And it worked. Mostly.
But soon, cracks appeared. Every new model needed custom feature retrieval logic, DAGs became dense and unmanageable, and scaling turned into a constant firefight. Costs surged, and infra bottlenecks slowed experimentation. Our system worked, but it wasn’t built for scale.
This is the story of how we tackled these challenges—building Model Proxy for seamless feature retrieval, optimizing real-time infra, and cutting costs while scaling to millions of QPS.
The Cost of Success
Every new Ranker model required its own feature set, often pulling from different entities. Each addition meant:
- Adding new DAG nodes in IOP
- Writing custom logic to fetch features from multiple sources (e.g., user, product, user × category)
- Inferring intermediate features (e.g., extracting category from a product to fetch user × category data)
- Optimizing I/O and dealing with the inevitable bugs
What began as clean DAGs soon turned into a tangled web of cross-dependent graphs. Every experimentation cycle meant new nodes, new dependencies, and slower iterations.
Scaling Pains (and Cassandra’s Limits)
By late 2022, our Online Feature Store was under tremendous pressure. At peak, we were serving 250–300K reads/sec and handling ~1M writes/sec, even during lean hours. All of this ran on Cassandra. While Cassandra’s distributed architecture had proven itself in production, operating it at this scale came with a heavy price. What began as a proof-of-concept capable of ~100K ops/sec quickly became an operational burden: ensuring node health, tuning compaction, and rebalancing storage were constant firefights. Under heavy load, we saw latency spikes creep in, and our total cost of ownership ballooned as cluster sizes grew.
The Interaction Store, powered by Redis, wasn’t faring much better. As user traffic scaled, Redis clusters kept growing in both size and cost. Latency spikes became more frequent, and the DMC proxy—responsible for routing requests—sometimes lost shard locality. When that happened, requests spilled across nodes, triggering cross-shard communication and degraded performance. Every incident required manual shard rebalancing to stabilize latency, turning day-to-day operations into an unsustainable grind.
Silver Linings
Despite the chaos, the system was live and delivering value:
- Real-time infrastructure was in production
- Costs dropped by 60–70% compared to offline personalization
- New experiments rolled out faster and more successfully
- User engagement metrics improved
It wasn’t perfect. It was far from easy. But it worked and that counted for a lot.
Round Two: Solving the Top 2 Bottlenecks
With the first-gen system stretched to its limits, we stepped back. Conversations with data scientists and backend engineers revealed three recurring pain points:
- Coding feature retrieval logic for every new model was becoming unsustainable
- ML scale was exploding—bringing rising infra costs with it
- Real-time embedding search was the next big unlock
We tackled them one by one, starting with the biggest pain point.
Problem 1: No-Code Feature Retrieval for Model Inference
As our personalization models grew, a clear pattern emerged. Every new ranker needed features from multiple entitiesProduct
- User
- User × Category
- Region, cohort, sub-category, etc.
Each addition forced engineers to hand-craft feature retrieval logic, stitch together relationships, and debug brittle DAG nodes. It worked, but it slowed experimentation and made scaling unsustainable.
The breakthrough insight was that all entity relationships ultimately map back to the context entities provided at inference time (e.g., userId
, productIds
). With that in mind, we designed Model Proxy—a graph-driven feature retrieval system.
- Model Proxy takes a modelId and context IDs (e.g., userId, productIds)
- Loads a pre-defined feature retrieval graph from ZooKeeper
- Executes the graph to resolve entity relationships dynamically
- Outputs a 2D matrix of feature vectors
The impact was immediate:
- No more custom feature retrieval code—just graph updates in config
- Feature consistency across experiments
- Faster iteration cycles for ranking, fraud detection, and beyond
Here’s a visual example that shows how this graph plays out during execution.
We extended the graph further to orchestrate multiple model calls when needed, making Model Proxy a flexible building block for real-time inference. The system was built in GoLang, with gRPC for low-latency communication and Proto3 serialization for efficient data transfer at scale.
Problem 2: Scaling Without Breaking the Bank
With more ML use cases coming online, we needed to cut costs without compromising performance. We focused on:
- Online Feature Store
- Interaction Store
Optimizing the Online Feature Store
Our costs were concentrated in:
- Database (Cassandra)
- Cache (Redis)
- Running Pods (Java services)
Replacing Cassandra with ScyllaDB
As we hit the operational limits of large Cassandra clusters, we transitioned to ScyllaDB, which offered a seamless drop-in replacement without major code changes. The switch brought significant benefits:
- Throughput: Matched or exceeded Cassandra's performance under identical workloads, even under high concurrency.
- Latency: Achieved consistently lower P99 latencies due to ScyllaDB's shard-per-core architecture and better I/O utilization.
- Cost Efficiency: Reduced infra footprint by ~70% through better CPU and memory efficiency, eliminating the need for over-provisioned nodes.
Finding the Right Cache
To reduce backend load and improve response times, we benchmarked multiple caching solutions—Memcached, KeyDB, and Dragonfly—under real production traffic patterns. Dragonfly stood out due to its robust architecture and operational simplicity:
- Data Skew Handling: Efficiently managed extreme key hotness and uneven access patterns without performance degradation.
- Throughput: Delivered consistently high throughput, even with large object sizes and concurrent access.
- Ease of Adoption: Acted as a drop-in Redis replacement with full protocol compatibility—no changes needed in application code or client libraries.
Moving to GoLang for Cost-Efficient Serving
Our JVM-based services were reliable, but the combination of heavy GC overhead, thread-based concurrency, and large runtime footprint made them costly to operate at scale. To address this, we rewrote the core services in Go (Golang), designed from the ground up for low-latency, high-throughput serving. We saw goodness along the following lines:
- Memory Management: Java’s garbage collector required large heaps, TLABs, and frequent GC cycles, inflating memory usage and adding latency under load. Go’s concurrent mark-and-sweep GC is optimized for smaller heaps and lightweight allocations, yielding an ~80% reduction in memory footprint.
- Per-Thread vs Goroutine Overhead: Java threads reserve ~512 KB–1 MB of stack each (plus TLABs/metadata). At 50k concurrent threads, this can mean tens of GB reserved just for stacks. Go goroutines start with ~2 KB stack and grow as needed. Even at 50k concurrency, stack memory is usually hundreds of MB in total, not tens of GB.
- Concurrency Model: Java uses 1:1 OS threads, incurring expensive context switches and synchronization under high concurrency. Go uses an M:N scheduler that multiplexes millions of goroutines onto a small thread pool, cutting context-switch costs and reducing CPU utilization.
- Binary Size & Startup: Java requires JVM startup, JIT warmup, and classpath initialization, which slows cold starts and increases memory usage. Go compiles to a static binary with no runtime dependency, enabling millisecond startups and much smaller container images.
- CPU Efficiency: JVM adds JIT overhead, large heap management, and lock contention in critical sections. Go’s lean runtime and goroutine scheduling eliminated those hot spots, reducing CPU cycles per request.
Impact:
- Memory usage dropped by ~80%
- CPU utilization was significantly lower
- Faster, more efficient deployments
Optimizing the Interaction Store
We realized that we only need a user’s interaction data in Redis when they open the app. So, we implemented a tiered storage approach:
- Cold Tier (ScyllaDB)—Stores click, order, wishlist events
- Hot Tier (Redis)—Loads a user’s past interactions only when they open the app
Smart Offloading: We introduced an inactivity tracker to detect when a user session ends. At that point, Redis data was flushed back to Scylla, reducing unnecessary writes.
Results
By the time the 2023 Mega Blockbuster Sale arrived, the second-generation system was battle-tested in production and ready for peak load. The improvements we had invested in paid off in very tangible ways:
- Scale with Confidence
Our Online Feature Store sustained 1M+ QPS during the sale without visible performance degradation. Latency stayed flat even under bursty traffic, with predictable P99s thanks to ScyllaDB’s shard-per-core model and Dragonfly’s cache efficiency. - Massive Cost Savings
Infra costs for the Online Feature Store and Interaction Store dropped by ~60%, largely from replacing Cassandra with ScyllaDB, adopting Dragonfly, and rewriting Java services in Go. Resource headroom improved, we could handle more load with fewer nodes, reducing both direct compute costs and operational overhead. - Operational Stability
Tiered interaction storage eliminated constant Redis growth and rebalancing fire drills. Feature retrieval logic became no-code, reducing engineering toil and ensuring consistency across experiments. - Faster Experimentation & Impact
Data scientists could onboard new models without writing custom retrieval code—just graph config updates in Model Proxy. Experimentation cycles sped up, leading to higher iteration velocity in ranking and personalization. Downstream impact was measurable, engagement metrics improved, and real-time personalization became cheaper than ever before.
In short, the second-generation stack not only survived the biggest sale of the year, it did so with lower costs, higher stability, and faster iteration velocity. But this success also revealed the next set of challenges.
The Hidden Bottleneck: Our ML Hosting Hit a Hard Limit
While planning for 2023 MBS, we ran into a critical scalability bottleneck:
- Insufficient compute availability in our region for ML instances
- Couldn’t provision enough nodes to handle real-time inference at scale
This forced us to rethink where and how we hosted our models. The existing setup was great for prototyping—but it wasn’t built to handle the bursty, high-QPS demands of real-world production workloads.
Conclusion: From Firefighting to Future-Proofing
What started as an ambitious experiment turned into a real-time ML infrastructure that powered millions of requests per second. We battled scaling pains, rethought feature retrieval with Model Proxy, and rebuilt our infra stack for efficiency—driving down costs while improving experimentation velocity.
But new challenges emerged. Our infrastructure could now handle scale, but our ML model hosting setup hit a hard limit. With compute availability bottlenecks threatening real-time inference, we faced a critical decision: how do we make model serving as scalable and cost-efficient as the rest of our stack? That’s the next piece of the puzzle—and the story of Part 3.
🌟 BharatMLStack is now Open Source!
Everything we’ve learned while building Meesho’s real-time ML platform has been distilled into BharatMLStack — our end-to-end feature store and ML infra stack, now available to the community.
Check it out, give it a ⭐️, and join us in shaping the future of large-scale, cost-efficient ML systems.