Personalization from Day One: Solving the Cold Start Problem at Meesho

New users rarely give second chances. If their first experience on the app, especially the homepage product feed, feels irrelevant then they leave quickly. At Meesho, we noticed that our static homepage product feed for new users was causing high bounce rates and a poor first impression.

But with zero browsing history, how do you show the right products from the very first scroll?

In this blog, we share how we reengineered Meesho’s homepage feed to deliver personalized product recommendations right from a user’s very first session. 🚀

The Problem

There are two core challenges in personalizing feed for new users: First, lack of behavioral signals early on. Second, incorporate new user interactions on the app in real-time to improve personalization.

Journey of New User Personalization at Meesho

▶️ Demographic Recommender (DemoRec):
Our early approach to solving the cold start problem relied on basic demographic segmentation.
Since we didn’t have behavioral data for new users, we grouped them into broad cohorts based on attributes like gender, location, etc. We then surfaced popular products within each cohort to offer some level of personalization. While this strategy helped avoid completely generic feeds, it had its limitations especially in capturing the unique and evolving preferences of each user.

▶️ Early Signal Personalization Model (ESPM):
To improve on demographic-only personalization, we began blending in early user behavior.
We combined basic demographic signals with a user’s initial interactions on the app such as product clicks, time spent on listings, and scroll patterns to estimate their likelihood of purchasing from different categories. Using this predicted purchase probability, we identified the top categories for each user and curated a personalized product feed tailored to their predicted interests. This hybrid approach gave us a more dynamic way to serve relevant content, even in a user's first session.

The above two approaches definitely solved the first problem mentioned above. We still were looking for a new architecture to keep improving feed as new interactions were done by the user.

▶️ Cold-Warm Net (CWN):
The goal of modeling cold-start users is to learn effective user representation from both cold and warm user behaviors and build models that adapt as users evolve. Cold users are users with no interaction history. Warm users are new users with some interaction history.

Cold-Warm Net uses expert towers for cold-state and warm-up state of users, combined via a gate network that adapts based on user behavior. A dynamic teacher selector guides learning through knowledge distillation, ensuring high-quality personalization from the start. We’ll discuss this in detail below -

Let’s Deep Dive into the Cold-Warm Net model:

Our model consists of two experts - cold & warm experts. User demographic features X_demo are passed to the cold expert to get cold embedding e_cold and user’s interaction sequence X_seq are passed to the warm expert to get warm embedding e_warm. Gating network combines e_cold and e_warm to get the final user embedding e_user. We get item embedding e_item from the item lookup table where embeddings are randomly initialised.

e_cold= f_cold(X_demo)

e_warm= f_warm(X_seq)

e_item=Lookup(i_item)

Gating network

User state features X_state like login state, activity level, lifecycle stage,etc are passed to the gate network to get cold expert weight w_cold and warm expert weight w_warm.

w_warm, w_cold= f_gate(X_state)
e_user = w_warm.e_warm+w_cold.e_cold

Lets say y and y^ are the actual and predicted labels for each sample, then we optimise the whole network by minimizing Binary cross entropy loss L between them

y^= sigmoid(cosine_similarity (e_user ,e_item))
L= Binary cross entropy( y^ , y )

Dynamic knowledge distillation:

Cold-start experts often underfit due to limited information during cold-state of users, so we use Dynamic Knowledge Distillation (DKD) to transfer knowledge from warm expert to the cold expert when needed. An auxiliary distillation loss function L_d added to the main loss L to guide learning.

Let y^_cold and y^_warm denote the predicted label for the cold-start expert and warm expert respectively. For each sample, we compare the Binary cross entropy losses from both experts.
If the cold expert performs worse L(y^_cold,y) > L(y^_warm ,y), L_d is added to L to help it learn from the warm expert. The overall loss function of the network L_o and distillation L_d is defined as

L_d= Cross entropy (y^_cold ,y^_warm)
L_o= L +α* L_d (where, α=0 if L(y^_cold,y) ≤ L(y^_warm,y)

here, α determines the strength of distillation from the warm-up expert.

Why SUB_CATEGORY and PRICE prediction auxiliary tasks were added

1. Enforcing Hierarchical Learning: In our use case, there’s a natural hierarchy: e.g., sub_category → price_decile → catalog. Auxiliary tasks help the model capture and align with this structure.

2. Improved Gradient Flow / Optimization: Auxiliary tasks add additional loss signals at intermediate layers, helping stabilize training by improving gradient flow in deep networks.

3. Better Representation Learning: By encouraging the network to solve sub-problems, it learns richer, much better representation that can improve performance on the main task.

4. Faster Convergence: Training with auxiliary tasks can accelerate convergence by guiding early layers to learn useful features more quickly.

Mathematical formulation: We add auxiliary losses L_{tsub_category} and L_{price_decile} to the original overall loss function of the network L_o to get L_total

L_{sub_category} = Cross entropy (y^_{sub_category} ,y_{sub_category})

L_{price_decile} = Cross entropy (y^_{price_decile}, y_{price_decile})

L_total= L_o+L_{price_decile}+L_{sub_category}

where,
y^_{sub_category} and y_{sub_category} are the predicted and actual sub_category label respectively
y^_{price_decile} and y_{price_decile} denote the predicted and actual price_decile label respectively.
L_total is the total loss used to optimize the network

Backtesting Results:

🚀 Impact!

We rolled out these model enhancements on the Homepage Product Feed and saw a notable uplift in feed engagement metrics like CTR, CVR, O/Vi highlighting stronger engagement and conversion within the feed. Order Contribution from FY increased significantly, reinforcing its central role in driving purchases. The FY feed improvements also influenced Search and other RE surfaces through stronger user intent. Additionally, we saw a sharp drop in bounce rate, indicating that users are finding more relevant content early in the feed, leading to quicker conversion, higher-quality interactions.

Overall saw a notable rise in new user activation, along with a sharp drop in bounce rate—a game-changing impact!

🎉 Shoutouts

Special thanks to Pukhraj Baraskar, Divay Jindal, Devashish Gupta for working closely on the project and Madhurita Mahapatra, Vinit Rongata, Ravindra Kumar Yadav, Debdoot Mukherjee, Anmol Verma and Milan Partani for their guidance.

🗒️Reference

1. https://arxiv.org/pdf/1808.09781
2. https://arxiv.org/pdf/2106.03819
3. https://arxiv.org/pdf/2205.04507
4. https://arxiv.org/pdf/2309.15646