This article is written by Jing Zhongzhi, a senior tech expert at Amap.
Just like Google Maps, Amap was first introduced as a navigation App. Along with the increasing user base, Amap has gradually broadened its business scope and started to provide local life services like ride-sharing, local business searches, reviews, etc. Amap has become a super App with hundreds of millions of DAU (Daily Active Users) and has to process a rapidly growing amount of data. In addition to the storage of hundreds of petabytes of data, Amap has to deal with tens of thousands of QPS (Queries Per Second), which poses a great challenge to Amap’s technical architecture. To accommodate its business growth, Amap has to consider a more advanced RDBMS to ensure system availability and stability.
This article shares the thought and considerations behind Amap’s database selection.
As an App with hundreds of millions DAU, Amap keeps generating massive data around the clock, posing a great challenge to store hundreds of terabytes of data generated by the App. At the same time, the Amap team has to find a way to process the massive amount of data efficiently, reduce data storage costs, and eliminate migration costs as much as possible. As Amap carries its local life service business forward, it needs to consider system availability and stability for now and years ahead to accommodate its business growth when selecting a new RDBMS.
With the above considerations, Amap needs an RDBMS to solve the five problems below:
1.Transparent scalability.
A database with small storage specifications is usually deployed at the beginning of the business. However, how can we scale out the database by adding more servers as the business grows and dynamically scale it in as the traffic decreases, thereby reducing the technical risks and costs brought by data migration? In other words, can we make sure that the database always matches business capacity?
2.Cost reduction without performance compromise.
How do we reduce storage costs while maintaining satisfactory database performance for business that generates terabytes of data? For example, the low-cost Open Data Processing Service (ODPS) is an option for storing terabytes of data, but it works with slow data read/write speed, which is a bummer for online businesses.
3.System availability, storage stability, and disaster recovery.
For an application with a DAU above 100 million, service stability is a make-or-break factor because any faults can trigger a boundless disaster. How do we improve the service availability, storage stability, and capacity of local disaster recovery and geo-disaster recovery?
4.Strong data consistency.
How do we ensure strong consistency and zero loss of financial settlement data even after the primary/standby switchover when the primary node crashes?
5.Compatibility issue.
Amap needs a database that supports MySQL protocols to realize smooth storage upgrades and reduce the costs of code migration for the business system. It would be great if the new storage system does not require any modification to the SQL statements to operate on the new business system.
To tackle the above five challenges, Amap needs an RDBMS with the following capabilities:
Standalone databases obviously fall short of Amap’s expectations. A distributed architecture is the only choice to meet Amap’s scalability requirement. The most natural solution is Sharing on MySQL, which is widely adopted in the Alibaba Group.
Sharding on MySQL manages multiple MySQL physical databases by sharding on the application side or at the proxy layer to solve the problem of insufficient capacity and performance of a single MySQL server. This solution supports the scaling of computing and storage resources. However, the business system needs heavy modification and grey release before migration, which incurs high costs and risks to keep the business uninterrupted by the data migration.
Another choice is NewSQL databases. Unlike sharding on MySQL, NewSQL databases realize sharding natively and support dynamic scaling, which is transparent to upper-layer applications. Obviously, a NewSQL database with native scalability capability is a better choice for Amap.
Finally, the Amap team wants to embrace the cloud native era. A cloud native database features resource pooling and also supports dynamic storage scaling.
After the direction is nailed down, the Amap team started researching possible candidates. They finally chose OceanBase, a native distributed SQL database as their target RDBMS, after a thorough thought per their business requirements.
OceanBase Database is a cloud-native database system that features high availability, high scalability, high performance, and low costs. It adopts a shared-nothing architecture where all nodes are equivalent. Each node has its own SQL engine and storage engine. The nodes run in clusters that consist of general PC servers. Its scalability and high availability rely largely on the architecture and services such as OceanBase Proxy (OBProxy), OceanBase Server (OBServer), Globe Timestamp Service (GTS), and RootService. Its high performance and low costs are mainly attributed to the storage engine.
OceanBase Architecture
OceanBase Database can be dynamically scaled out, and the data of partitioned tables are automatically distributed to the added nodes. The scaling process is transparent to upper-layer services. This not only saves migration costs but also causes zero interruptions to the fast-growing business.
Transparent Scalability
From the technical perspective, OceanBase Database ensures scalability by evenly distributing multiple partitions of tables over all OBServers in each zone. To scale out a cluster, the user only needs to add resource nodes to the cluster and allocate new resources to the specified tenant by modifying the tenant’s resource specifications.
As new resources are allocated to new OBServers, RootService of the cluster schedules the migration of partition replicas within the same zone until the load is thoroughly balanced among OBServers.
Accordingly, leader partition replicas in the primary zone are evenly distributed to OBServers to avoid excessive data update load on some OBServers.
An issue here is the possible split brain due to using the Paxos protocol. How do we avoid split-brain when one of the five-zone replicas fails in a cluster deployed in the Five Data Centers across the Three Regions mode?
For example, as shown in the following figure, SH-1 and SH-2 are two zones in region A, and HZ-1, HZ-2, and SZ-1 are three zones in region B. When SH-1 is the primary zone, it is possible that a new primary zone is elected in region B to take care of all leader partition replicas.
Example of Transparent Scalability
OceanBase Database avoids split-brain by using leases. Specifically, if the sequence number of leases at each node increases automatically, it is likely that a node has the latest sequence number but its logs lag behind, which eventually leads to service unavailability. However, the election service of OceanBase Database gets the latest timestamp from GTS to ensure that the leases of the old and new leaders do not overlap, and the timestamp of the new leader does not roll back. This guarantees that only one valid leader exists under the same lease, avoiding split-brain.
Note that the high availability of GTS, a crucial service that the election service depends on, affects the high availability of the entire OceanBase cluster.
2. High availability
OceanBase Database adopts a shared-nothing multi-replica architecture to rule out a single point of failure and ensure system availability. It allows the user to deploy three IDCs in two regions or five IDCs in three regions, which ensures city-level disaster recovery for settlement business that requires consistency and partition tolerance (CP).
The high availability of OceanBase Database is guaranteed jointly by OBProxy, OBServer, GTS, and RootService.
An OBProxy receives SQL requests from upper-layer applications and forwards them to the specified OBServers. It does not have a persistent state and gets all required database information by accessing database services. The failure of OBProxies does not cause data loss. The selection and removal of OBProxies depend on load balancers. Users do not need to manage the selection or removal in OceanBase Cloud.
OBServers are member nodes of an OceanBase cluster. Replicas of table partitions are distributed across the nodes, which are grouped into zones. Each zone holds a complete set of replicas. Replicas of the same partition are distributed across multiple zones. Users can modify only one of the replicas, which is referred to as a leader. Other replicas are referred to as followers. OceanBase Database uses the Multi-Paxos protocol to guarantee data consistency across replicas and elect the leader.
GTS provides global timestamps. This service is also highly available. It keeps three replicas by using special partitions and ensures transaction sequence by providing snapshots and commit versions of all transactions executed in the same tenant. OceanBase Database ensures the reliability and availability of GTS in the same way as that for partition replicas.
RootService is the master controller of an OceanBase cluster. It also uses the Paxos protocol to achieve high availability and support leader election. Users can specify the number of its replicas. RootService determines the status of an OBServer by listening to the heartbeat packets reported from the OBServer. It can also activate or remove an OBServer and control the migration of the partition replica data stored in the OBServer.
3. Strong performance
OceanBase Database relies largely on its storage engine to deliver high performance. The storage engine adopts an LSM-tree-based architecture, which stores baseline data on the disk (SSTables) and incremental data in the memory (MemTables). This helps OceanBase Database separate read and write operations. All data modifications are incremental data. Logs are sequentially written to the disk and data is written to the memory. In other words, all DML operations are in-memory operations, providing excellent performance. During data reading, data from previous versions in MemTables and the baseline data in persistent storage are compacted to generate the updated data.
How OceanBase Performs Data Writes
OceanBase Database writes incremental data to the memory, which is designed in the LSM-tree architecture to optimize performance. However, both MemTables and SSTables need to be accessed during data reading, and reading tiered SSTables on the disk increases the latency. To address this issue, OceanBase Database caches rows and data blocks in the memory to reduce data reads from the disk. Therefore, it is recommended to configure a large memory for OceanBase Database to increase the space of MemTables and caches.
OceanBase Database uses a Bloom filter for caching rows and caches the Bloom filter. Most online transaction processing (OLTP) operations are small queries. OceanBase Database optimizes small queries to eliminate the overhead of parsing entire data blocks as traditional databases do. This makes the performance of OceanBase Database comparable to that of an in-memory database.
When incremental data in the memory reaches a specified size, it is compacted with the baseline data and then persisted to the disk. Additionally, the system performs a daily major compaction during idle hours every night. In addition, OceanBase Database supports different compression algorithms for different types of data, because the baseline data is read-only and is continuously stored in OceanBase Database. This ensures a high compression ratio and greatly reduces the cost without affecting the query performance.
4. System stability and data reliability
OceanBase Database makes it possible to never interrupt the business. If a service node fails, it is automatically removed from the cluster. The data on the node has multiple replicas on other nodes, and the data service of the node is handed over to other nodes. If an IDC fails, OceanBase Database can switch its service nodes to other IDCs within a short period of time.
System stability
OceanBase Database distributes multiple replicas across multiple IDCs and synchronizes these replicas based on the Paxos protocol to ensure strong data consistency. If a node fails, other nodes still host healthy replicas.
Data reliability
5. Lower storage costs
OceanBase Database compresses data to about 1/3 of its original size, which greatly reduces the data storage costs for businesses relying on massive stored data. In addition, OceanBase Database isolates resources from one tenant to another, and tenants of the same cluster share redundant resources, meaning that the user can configure fewer resources and keep the costs under control.
6. Great compatibility with minimal if not zero transformation
OceanBase Database supports MySQL protocols and allows access through MySQL drivers. Basically, users can use it as a distributed MySQL database.
To get familiar with the deployment of OceanBase Database in Alibaba Cloud and gain experience with this new technology, we chose to try out the compensation system, which handles a small business volume.
We applied for a test OceanBase cluster and migrated the core DDL and DML operations with the expectation of identifying and resolving possible compatibility issues. Fortunately, all data create/retrieve/update/delete (CRUD) operations and table create/delete/modify and index create/delete operations involved were executed as expected. This test result increased our confidence in the later migration into our production environment.
After that, we started a stress test on the OceanBase cluster.
As we didn’t find a better-specialized stress testing platform back then, the stress test was indirectly performed at low queries per second (QPS) through APIs by using the application servers.
So, three application servers, each configured with 4 CPU cores and 8 GB memory, were used to indirectly test an OceanBase database of 8 CPU cores and 32 GB memory. At 3,000 QPS for data reads and 1,500 QPS for data writes, OceanBase Database demonstrated great performance with a response time (RT) within 1 ms.
However, OceanBase Database is a near-in-memory database, which requires a large memory size to build in-memory caches and cache incremental data. This means that the database tends to generate hotspot data. In other words, the stress test was actually performed on data in the memory and reflected only the memory operation performance of OceanBase Database, which is certainly far better than that of disks. Therefore, we did not carry the test further.
The test results showed that OceanBase Database delivered high performance most of the time mainly because it handled hotspot data in the memory. However, when cold data is queried, OceanBase Database may load data from the disk or access the tiered SSTables. RT in this regard is not yet determined because of insufficient test data.
Stress test against OceanBase
Stress test against OceanBase
We are further assured by the experience of Trip.com, a super tour App in China, with OceanBase Database. Compared with MySQL, OceanBase Database doubles the read performance and triples the write performance on average. Thanks to the artful data encoding and the tiered compression technology of its storage engine, OceanBase Database consumes 2/3 fewer storage resources than MySQL. This greatly reduces hardware costs.
The following performance charts, offered courtesy of Trip.com, clearly show that OceanBase Database processes read and write transactions with RT at around 1 ms, which agrees with our test results.
Ctrip’s stress test against OceanBase
After the functionality and performance verification, we were set to migrate the data of our compensation system to OceanBase Database. To ensure a smooth migration with zero downtime of online services, we first considered the DTS ( Data Transfer Service) of Alibaba Cloud, which can fully migrate the existing data stored in XDB to OceanBase Database and synchronize the incremental data from XDB to OceanBase Database to ensure data consistency.
However, as DTS supports only data migration between Taobao Distributed Data Layer (TDDL) XDB instances rather than that between an XDB instance and a heterogeneous database instance, we then turned to OceanBase Migration Service (OMS).
OMS supports data migration between heterogeneous databases, such as MySQL and OceanBase Database. The service synchronizes incremental data by subscribing to binlogs of the source database, storing the logs in OMS stores, and providing the subscribed logs to the destination database. As for full data, the service directly queries and synchronizes it from the source database. OMS supports schema synchronization, full data synchronization, incremental data synchronization, data verification, and reverse synchronization, which means two-way synchronization between source and destination databases.
Data migration
OMS also supports data subscription, which allows the user to consume and handle OceanBase Database binlogs through clients for specific purposes, such as synchronizing data to Lindorm and ElasticSearch. The following figure shows the migration process of full and incremental data.
Data migration process
Unfortunately, although OceanBase Database supports data migration from MySQL, a native MySQL database does not provide a built-in component to support TDDL, a database access middleware widely used in Alibaba Group. At that point, we had two options:
Eventually, we chose the second option. Then, how do we complete full and incremental data migration in this case? Other companies that use OceanBase clusters in Alibaba Cloud achieved smooth data migration by using DataX to synchronize full data and synchronizing incremental data themselves. Based on a trade-off analysis, we also decided on this method to quickly bring the business online and verify OceanBase Database capabilities as soon as possible.
Without any further hesitation, we chose Plan B, where the configuration middleware Diamond is deployed to dynamically switch between databases and the data consistency between XDB and OceanBase Database is verified offline on the MAC platform.
Migration plans comparison
You may wonder:
The downsides of Plan A lie in the complex workflow and authorization. If we choose to access the database through MySQL username and password, we need to apply for a migration account with a password in the production environment. Such migration poses a high risk and we need to contact many people, even executives, for negotiation and application approval. The whole process is difficult and time-consuming. Even if the account is approved, the migration process may also be interrupted by other problems, such as the lack of APIs for external calls. For the sake of security and efficiency, we gave up Plan A.
It is true that TDDL is not mentioned or used for access to the OceanBase Database. We access OceanBase Database by using a username and password through a MySQL driver. In this case, the TDDL DataSource needs to be replaced with another MySQL data source. Since direct access based on a username and password has security issues, we choose to indirectly connect to OceanBase Database through TDDL to prevent the leakage of the username and password in plaintext.
While this solution was not the most graceful one, it checked the most boxes for us back then. The best solution expected is that OceanBase Database works perfectly with middleware platforms to minimize our workload in dealing with jobs irrelevant to the core business.
The business then went online in doublewrite mode. In this mode, data is written to both XDB and OceanBase Database, and the data consistency is verified by the MAC platform one day (T + 1 day) after the data is generated. In case of inconsistencies, full data synchronization is performed to synchronize data from ODPS to OceanBase Database until the data in OceanBase Database is consistent with that in XDB for several days.
During the migration, it is important to note that:
After the compensation system as a pilot, Amap also plans to migrate the data of other systems such as the financial system and clearing and settlement system to OceanBase Database in the future. Although the current version of OceanBase Cloud is not perfect, the company has gained rich experience from the work with the OceanBase team. After the compensation system started to use OceanBase Database, we made a preliminary assessment of future benefits.
First, we compared the costs of XDB and OceanBase Database. OceanBase Database achieves a high data compression ratio based on its cutting-edge data encoding and compression technology. In general, business data can be compressed to about 30% of its original size, making it possible to cut down the overall price of OceanBase Database. According to our calculation results, 6 TB of data stored in an XDB with specifications of 16 CPU cores and 64 GB memory can be compressed into 1.8 TB when stored in an OceanBase database with specifications of 8 CPU cores and 32 GB memory. At the storage volume of 6 TB, the costs of XDB and OceanBase Database are the same. When more data needs to be stored, the costs of OceanBase Database will be less than those of XDB.
In addition, DTS, including Data Replication Center (DRC) subscription, also incurs implicit subscription costs, which are almost equal to the XDB costs. On the contrary, OMS is free for now.
As a database capable of hybrid transaction and analytical processing (HTAP), OceanBase Database also supports and optimizes online analytical processing (OLAP). Amap can try to migrate data from Lindorm to OceanBase Database. This allows the company to manage online data in the same storage system and save costs for Lindorm.
Second, geo-disaster recovery. OceanBase Database supports the Five IDCs across Three Regions deployment with strong data consistency. This feature provides city-level disaster recovery without data loss and substantially improves business stability.
Third, the distributed architecture of OceanBase Database supports dynamic scaling. The data storage of the business system can be smoothly scaled out as needed. The scaling process is transparent to the business and causes zero downtime. This feature also saves the costs of database scale-out and data migration in response to business surges, leading to a great reduction of the risk due to insufficient database capacity.