How We Designed a Large-Scale, Real-Time ML decision engine

Well, let me start by giving you strong motivation to read this long-ish blog till the end.

If you want to be the happy guy, you need to establish a solid foundation for your systems. 🙂

Designing a large real-time data science system is a complex task. While most of the time goes in identifying the right tech stack and making sure the system starts serving real-time requests, people often ignore the concept of a Resiliency Pyramid and overlook the below:

Making sure that the system remains fault-tolerant
Having full visibility into the running system
Ensuring readiness towards mitigating the impact of potential issues or failures

Fault-tolerant systems 🤔

A fault-tolerant system is one that continues to reliably operate even in the face of hardware or software failures. Fortunately, most companies now use cloud-based managed services, hence hardware failures are not much of an issue. But potential software failures demand extreme caution.

Data science services are primarily backend services that rely on inputs from various sources, resulting in multiple touch points.

Here are some best practices to ensure resiliency when operating with multiple touch points.

Input standardisation using input interface
🧠 Why do we need it?
Let’s understand it with the help of an example. Suppose our input payload looks like:

and we have used these variables at 10 different places like:

The problem with such an implementation is that if there is any change in the input api-contract (for example, if the key name changes from margin to seller_margin or datatype changes from int to float), we’ll have to modify our codebase at different places, which is certainly not optimal. That’s where an input interface comes into play.

As the above diagram shows, input interface is an interface between input sources and application. If any modification happens in input, we only need to modify interface and our application core logic should remain untouched.

Another thing that needs equal attention is data type validation. It becomes extremely critical if we are using Python, which is a dynamically typed programming language where any variable can take on any data type.

Input interface using Python pydantic-dataclass and typing
The same code can be written as:

So what are the benefits of having an interface and type checker in between?

1) If any input keys are renamed or their data types are changed, we just need to modify our interface starting from line 6.

2) Two, by using try/catch block, we can monitor data type changes, if any.

Code modularisation
🧠 Why do we need it? A simple answer could be to support maintainability and improve collaboration. But no explanation is complete without real-time examples, isn’t it?

Let’s consider the above interface. It is certainly better than using vanilla dictionaries, but there is a problem — it does not provide us any insights on the structure of inputs. One might argue that it’s not a big deal, but imagine when we have hundreds of input variables in place. Let’s analyse the inputs and try to find a pattern. It looks like a user, who has certain attributes like ID and app version, has placed an order that potentially has multiple sub-orders.

Once we have such a structural interface in place, we don’t have to do repeatedly press Ctrl+F, to find a variable.

System latency

No matter how good our system is, it has to respond quickly, in real time. When we try to improve latency, we often tend to focus on the latency of core algorithms which, in most cases, are black box models we have little control over. Let me try to take you through some other alternatives:

Response caching: If our application repeatedly gets the same kind of inputs, it’s not necessary to process them every time they are received. Instead, we can store both inputs and response in a cache, allowing us to be back within a few milliseconds.

1) Module-level caching: Cache the entire response for a given request.
2) Sub-module level caching: If model prediction is composed of multiple smaller components, caching these atomic components is known as sub-module level caching.
Server-side batching: Deep learning inference codes that rely on heavy matrix computations can benefit from batching multiple requests together. Batch matrix computations are generally faster than processing individual requests.

Therefore, if the system requires near real-time performance, it can accumulate requests over a short period of time and then convert them into batches for processing. This allows the system to take advantage of the faster batch matrix computations and respond with the results in a more timely manner.

However, server-side batching can introduce additional latency given the time it takes to collect and aggregate the requests before processing them. Therefore, it is important to strike a balance between the batch size and the additional latency introduced by server-side batching.
Network Configuration

It’s important to check the below configuration to ensure that the network calls are optimal.
The main application and its components should either be in same Virtual Private Cloud, or they should have been paired by VPC peering.
If they are in the same Availability Zone (AZ), it might give us extra edge. But the risk of failures remains high, hence we choose at least two different AZs.

Error handling

The first principle of building a resilient system — it’s okay to get errors as long as we are able to catch them fast. While designing a system we typically make a laundry list of exceptions we should be prepared to catch, but it’s equally important to monitor the extent of it.

💡Categorise exceptions and monitor them using CloudWatch Dashboards

Let’s look at Figure 1. We have multiple touch points where errors can occur:

Inputs can be of the wrong type: In such cases, the general practice is to fall back on the default value.
For a given ID, we might not find features in a feature store: If possible, fall back on the default value.
Model serving platform: If possible, fall back on the default prediction.

Here is a code snippet of catching exceptions and implementing fallback solutions.

Visibility into the running system

We have discussed some of the best practices of building a fault-tolerant system. However, it is insufficient unless we offer visibility for the entire system.

Below, you will find some of the best practices of throwing a spotlight on the running system.

Logging and Dashboarding

💡 The best way to remember the decision one has taken while writing code.

As logging often comes with additional latency, we should utilise it mindfully. In my experience, exception handling is one of the more critical parts of the codebase as it prevents an avalanche — but we must also proactively monitor the fallback logic to ensure that it is working as intended and does not introduce new errors into the system.

🧠 So how do we monitor exception handling?

Well, we can make use of a logger to log the different types of exceptions being caught:

Now, we need to parse error messages in CloudWatch logs and create a dashboard out of it.
Go to cloudwatch → log insights → select log groups → write query

The output should look like this. It gives us a clear picture of what’s happening under the hood.

Alerts

Get a notification on your phone as soon as something breaks. (I know this is not something one would look forward to be subscribed to. 🙃)

There are two kinds of metrics we generally monitor:

Resource health and service-level metrics: These metrics tell us how the backend resource is performing. These metrics include concurrency, latency, average lag (in Kafka), CPU utilisation, memory utilisation, etc,. and most of these metrics are part of AWS all metrics.
North star/business-critical metrics: This is purely application-dependent. One of the ways to get such real-time metrics is to first log raw data (predictions) in CloudWatch logs and then derive metrics out of it in realtime.

a) Log Prediction in cloudwatch logs

b) Use Cloudwatch logInsights to query the logs. However, keep in mind that some queries may be challenging, so don't expect too much. Whenever possible, manage logging in a way that it doesn't require complex queries.

c) Add output of the query in CloudWatch dashboard

d) Periodic alerts on the above query output, using the below diagram:

Readiness towards mitigating the impact of potential issues or failures 🤔

No matter how confident we are on our builds, there is always a possibility of failures as these are all data-based models. Hence, we need a control plan to minimise the risk and impact.

There are various measures to implement a control plan:

Rate Limiters: Define an upper bound of the model predictions. There can be various types of Rate Limiters:

a) Fixed window: For a fixed time window, the metrics of interest cannot go beyond predefined thresholds.

b) Sliding window: At any point of time, we look back at a pre-defined time duration and see if we have bandwidth available.

2. Quick Rollback Mechanism: We need to have an extremely fast rollback mechanism in place, so that when something goes south, we can quickly revert to the previous stable version with just a click.

Anshu Aditya is Data Scientist - III at Meesho