What's the Issue? E-commerce Company & Meesho Context

In a data-driven world, managing vast amounts of information is crucial. However, this comes with significant costs, especially as data keeps expanding. Companies can end up spending millions on data storage, access, and management.


In industries like E-commerce, where Big Data is used for customer experience, marketing, and decision-making, monitoring and controlling costs is vital to prevent excessive spending as the company grows.

In recent years, Meesho has grown exponentially and hence the data. Keeping in mind the goal to become “India’s first profitable horizontal ecommerce company” managing data size and associated access costs is a big worry. If ignored, it’ll raise storage, access and query expenses. Neglected data will accumulate and become a cleanup challenge later. The solution? A smart data strategy is essential.

Addressing This Issue: Strategies and Actions

To address this issue, we introduced the concept of S3 storage policy. This is essentially a set of rules for archiving and cleaning up data that isn't frequently used. However, it's essential to be mindful of the associated costs when performing these actions. AWS S3 offers various storage classes, each with different costs and features. Choosing the right storage class is crucial to align with your specific use case. In our scenario, we need to ensure that data remains inaccessible until manually retrieved. Here's a comparison of storage classes tailored to our needs.

  • Intelligent-Tiering : Moves objects to the most cost effective access tier based on access pattern, but in our case we require hard rules independent of access.
  • Standard-IA: Supposed to return data like standard only but sometimes takes little more time. Also more expensive than GIR.
  • Glacier Instant Retrieval: Good option for data which is not used, retrieval in milliseconds. But everytime we do a get call on an object, it is retrieved and the per object retrieval cost is very high.
  • Glacier: Good option for data that is being used, hard rules can be applied and provide 12 hours retrieval on request. Expensive than GDA.
  • Glacier Deep Archival: Cheapest option available, if the data is required once or twice a year costs very low.

When choosing a storage option, don’t just focus on the price. Think about how much it costs to read, retrieve and archive data in that storage class.

Caveats:

  • S3 Standard-IA, S3 One Zone-IA storage, S3 Glacier Instant Retrieval have minimum billable object size. i.e. for every object smaller than 128KB will be charged for 128KB.
  • Storage classes also have a minimum lock in period like GDA has 180 days, Glacier and GIR have 90 days S3 IA has 30 days. if we change storage class or delete objects before lock in period, we have to pay the remaining cost as early delete cost.
  • Glacier and GDA add extra 40KB of chargeable overhead for metadata to every object.Now as we have taken the first step now the question is how? We tried many things but S3 lifecycle rules are the real stars.

What is the S3 lifecycle rule?

A S3 Lifecycle configuration is a set of rules that defines actions that S3 applies to a group of objects.This lifecycle configurations can either be applied from UI or programmatically using SDKs. Lets deep dive into how to apply lifecycle management rules using below use cases.

  • Delete all the objects which are y days older and archive when the object is x days older (y>x). This can be achieved by:
  1. Transition: move object with certain prefix or tag and is older than x days to another storage class.
  2. Expiration: Expire object with certain prefix or tag and is older than x days .

As we wanted to apply this rule on prefixes which get updated continuously instead of doing it from UI, we went with the python sdk(boto3 client) to apply the rules.

Example:

client = boto3.client("s3")
response = client.put_bucket_lifecycle_configuration(
  Bucket='bucket-name',
  LifecycleConfiguration=lifecycle_configuration
)


Lifecycle Configuration: Rules being a list of rules.

Note: Sometimes due to anomalies, we receive future data. Let's deep dive how we can archive such data. Consider a Table which is connected to a S3 location. And using its date field S3 location is made e.g. if date is 20/08/2023 location prefix would be table_name/year=2023/month=08/day=20/. This case is different as the table sometimes receives future data.

Consider today's date is 20/08/2023 and if a row is added in the table with date 20/08/2023 It would be like:

Here our need is, if we want to archive 20/08/2023 only prefix table_name/year=2023/month=08/day=20/ should be archived and if we go with created date time both the prefixes would be archived so here we will go with prefix based archival.

Caveats:

You can add only 1000 rules per bucket, use wisely. In our case, there were a lot of S3 locations, so we sort them based on size and apply rules on only Top N objects.

  1. Cautiously, use prefix for archival and deletion e.g. You wanted to apply a rule on prefix prefix/1/ and in config you mentioned prefix/1 then all locations like prefix/1 will be archived or deleted e.g. prefix/123, prefix/111  etc.
  2. When applying lifecycle rules using API or SDKs keep in mind its a single post method with all rules in a single request and it overrides the previous set of rules. So develop accordingily.
  3. Do not apply rules on small files. For Deep Dive, checkout the next blog.
  4. When applying two rules on the same prefix, deletion takes precedence as compared to transition.

Failures-Part of the Process: Big Data Version: In a recent situation, we wanted to unlink the files from delta log so that archive files are not queried. So, we attempted different solutions, but none of them worked.

  1. Delete the partition and archive the data, as a result at the time of vacuum the data was deleted.
  2. Assumed archived data that is inaccessible should not be deleted at the time of vacuum. GDA data deleted after the vacuum.
  3. Dropping the partition using an alter table drop partition but delta lake tables doesn’t support that.
  4. Manipulating delta log to unlink the archived files, the option that needs to be added was removed and vacuum again deleted the data.
  5. Archive the data in one go and run FSCK repair but this also didn’t work as FSCK runs based on file list and even though the file is archived it is still present.
  6. Folders having _ appended at its start are ignored by vacuum, tested it and it was working seamlessly. But AWS doesn’t have a rename object feature; it actually copies the object with another name in the background.

When did we start seeing results?

We started by using these rules on old data that needed to be stored for the long term. Once we applied the rules, it took 2-4 days for the changes to become visible, depending on how much data there was. We then continued to add more rules gradually as we expanded our data collection.

Conclusion:

After applying these lifecycle rules, we saw a drastic drop in our cost numbers. Not only in storage but also in read cost as only significant data was being queried. And from then our storage cost is decreasing. With multiple iterations of optimisations our spend reduced by more than 50%.

Dynamic Month-by-Month Cost Reduction Graph:


What Next?

In this tech blog, we learned that when it comes to saving money on S3 storage, it’s not just about how much data you store. It’s also about how many times you request or modify that data, which can add to the cost. Additionally, archival costs can unexpectedly increase for various reasons. That’s why it’s crucial to set up S3 Observability to spot potential issues in advance and save on costs. We’ll dive deeper into this topic in our next blog post.