I’m sure you’ve played Jenga.
It’s a difficult game, especially after the first dozen or so blocks. It requires a lot of care and planning 🙇
Our earlier engineering framework was much like a huge Jenga pile. To improve any part of this complex structure, we needed to be careful of the entire framework — a huge task.
This wasn’t very efficient 🗑
We moved to a microservice-based framework: think of smaller Jenga blocks making up the entire ecosystem.
To solve a problem in one particular block, all we need to do is disturb that particular microservice. The rest remains undisturbed 😌
However, this structure brought with itself another problem: a single request passes through multiple services. This made it difficult to trace a request across all the services.
We solved it by implementing a distributed tracing.
Meesho’s feed system architecture 🧱
The majority of services in our feed system are Spring framework-based Java microservices. Our feed system has a layered architecture, where each layer has its responsibility and each domain service operates within its boundary context.
To fulfil feed functionality, each request passes through multiple services:
Our choice for implementing distributed tracing: Sleuth 🕵️
After exploring our options, we chose Sleuth for distributed tracing because of its auto-configuration capability and compatibility with other Spring libraries.
As explained in the following flowchart, the first public service that doesn’t find the Trace ID and Span ID in request headers generates these IDs and attaches them to ThreadLocal, while slf4j (a spring logging facade) uses these ThreadLocal variables to get the IDs for logging.
We used RestTemplate as our HTTP client and implemented an interceptor that passes these IDs downstream as headers.
As every software engineer knows, nothing works smoothly on the first go. In our case, we used async processing where the thread changed from A to B.
However, this made the newly-created thread B lose the ThreadLocals of thread A 🤕
To solve this problem, we had two options — either use LazyTraceExecutor which implements the executor and copies the ThreadLocals from one thread to another, or write custom code to copy them by ourselves.
We chose the latter 😎
Using the above solution, we solved the problem of async execution but faced another hurdle — this time in our aggregator services.
The aggregator service employs a bulkhead pattern for isolation and fault tolerance using the resilience4j library.
Since the resilience4j bulkhead class does not provide a way to override the executor, we can’t just use LazyTraceExecutor or a custom callable.
However, since resilience4j does expose context propagators that we can use to propagate the thread context, we wrote a class as the context propagator and passed that as configuration to the bulkhead library:
Results 💪
This way, we obtained Trace ID from User ID in one of our view layer services, then queried our centralised logging system. This showed us the whole request trace across services and we instantly figured out that the A/B service was returning an unexpected response in one of our domain services which caused that issue.
As a bonus, taking the time and effort to improve the logging and debugging process here is paying dividends already, as we’ve uncovered previously unknown bottlenecks in our code. So we also think of this as an investment for the future.
More to come 🚀
If you like what you’re seeing, stay tuned because we’ll continue to document our journey and share how our technology stack evolves. Or better yet, join us! We’re hiring across multiple tech and non-tech roles, so make sure to check out our job openings!