How We Migrated a 20 TB Database to GCP's CloudSQL with Zero Downtime

Imagine moving a colossal 20TB database—the size of a small digital galaxy—from one cloud to another without breaking a sweat. Sounds impossible? Not for Meesho! We pulled off this massive migration seamlessly, proving once again that when it comes to tech challenges, we’re always up for the task. Let’s take you behind the scenes of this incredible feat!

At Meesho, our tech ecosystem thrives on a diverse mix of databases—SQL, NoSQL, and everything in between—powering a wide array of application use cases. As we expanded our infrastructure to embrace cross-cloud capabilities by integrating GCP, we faced an ambitious challenge: seamlessly migrating databases, from a few gigabytes to a staggering 25+ terabytes, without disrupting performance or operations.

What is the Problem?

While we had successfully migrated databases under 5 terabytes in the past, the real test began when we ventured into uncharted territory: migrating databases exceeding this threshold. The challenge grew exponentially with datasets surpassing 10TB, pushing the limits of conventional migration methods. Even with support from GCP, we found ourselves at a standstill as the complexities of such massive datasets defied standard solutions. Although DMS enabled continuous data replication with minimal downtime, it fell short when tackling datasets over 10TB, forcing us to rethink our approach.

What are the Key challenges?

Maintaining User Experience: Migrating our 20TB database—integral to a critical e-commerce use case managing supplier and customer data—was a high-stakes operation that demanded zero downtime. Even the slightest disruption could have severely impacted user experience, risking financial losses and operational setbacks.
Preserving Performance and Reliability: Replicating the high performance and responsiveness of our previous cloud environment on CloudSQL was essential to maintain user satisfaction and avoid performance-related issues.
Optimizing Cost-Efficiency: Striking the right balance between cost and performance was crucial to offer an affordable shopping experience without compromising service quality.
Handling Data Complexity: The database handled a variety of data types, including complex product information, user profiles, and order details, which added to the migration complexity.
Addressing Security and Privacy Concerns: Strict regulatory compliance requirements necessitated specific data security and access control measures, further complicating the migration process.

What is the Solution?

Instead of migrating the entire 20TB behemoth at once, we opted for a strategic split of data and indexes. This innovative approach allowed us to achieve zero downtime and optimize performance during the migration process.

Here's the step-by-step process we followed:

Creating Blank Schemas on CloudSQL: We pre-provisioned empty schemas on CloudSQL to receive the data. This reduced the overall migration time by avoiding the need to create tables and schemas simultaneously.
Restoring Database Dump: We streamlined the data transfer process by restoring a database dump directly into pre-prepared CloudSQL schemas, significantly cutting down network bandwidth usage by avoiding full dataset replication. To optimize further, we separated data and indexes within the tables, migrating the data first and creating the indexes post-migration. This strategy reduced the restore time by 25%, as only the data was restored initially. Indexes were then created based on use case priorities. By adopting this approach, we were able to bring the database online to support critical application flows in just 65% of the total time typically required for a full 20TB dump and restore.
Initiating Replication Setup: Once the data was in place, we established asynchronous replication between the previous master and the CloudSQL slave. This allowed for continuous data updates on CloudSQL while minimizing impact on the live production environment.
Creating Secondary Indexes on Slave: After verifying data consistency using in-house scripts between the master and slave, we began gradually creating secondary indexes on the CloudSQL slave. We started with smaller tables in size where indexes could be created in a shorter duration to minimize performance impact and gradually progressed to larger indexes. For verifying data consistency, various data validation techniques could also be used.
Final Cut-over: The most critical phase of our migration journey was transitioning active write operations to CloudSQL. To minimize downtime and ensure a seamless user experience, we carefully orchestrated a brief pause in write activities. During a predefined 2-minute window, writes to our essential user databases were temporarily suspended. Our meticulously planned and thoroughly tested failover process ensured a smooth transition of these operations. After verifying that no read or write activity was occurring on the old system, we promoted CloudSQL to Master and resumed normal database operations.

This final step, executed with precision and planning, exemplifies the meticulous approach. By prioritizing minimal downtime and user experience, we successfully transferred our critical database to a new cloud environment, paving the way for future scalability and continued growth.

Benefits of this Approach:

Reduced Downtime: Splitting the migration allowed us to minimize downtime to a brief window during the final cutover to the CloudSQL slave.
Improved Performance: By offloading index creation to the slave, we maintained peak performance on the master during the migration process.
Data Integrity and Consistency: We carefully monitored the replication process throughout to ensure data integrity and consistency. We implemented robust rollback mechanisms to ensure a seamless recovery in case of any unforeseen issues.

Conclusion: Successful massive migration driven by effective innovation and collaboration.

By meticulously planning, rigorously testing, and adopting innovative strategies, we successfully migrated a massive 20TB database to GCP CloudSQL with near-zero downtime. This journey wasn't without its challenges, but through collaboration, technical ingenuity, and a commitment to user experience, we achieved the seemingly impossible.

Key to our success was the strategic separation of data and indexes. This approach minimized downtime and maintained peak performance throughout the migration. Additionally, leveraging GCP's Database Migration Service (DMS) provided continuous data replication, further ensuring a smooth transition.

The migration not only achieved its technical goals but also delivered significant business value. By migrating to CloudSQL, we optimized costs, unlocked greater scalability, and positioned ourselves for continued growth. Moreover, the experience gained has equipped us with valuable expertise for future cloud migrations, fostering a culture of innovation and agility within our organization.

Glossary of Terms:

CloudSQL: Google's fully-managed relational database service.
Asynchronous Replication: A method of data replication where data is copied from a primary database to a secondary database with a slight delay.
DMS: Database Migration Service, a tool for migrating databases with minimal downtime.