Photo by Jeremy Perkins on Unsplash
People love shopping festivals like Black Friday, Cyber Monday, and Double 11 because there are always large promotions so they can usually buy things at bargain prices. People tend to add things to their wish list and place orders during major campaigns.
Purchasing at the same time might cause large concurrent transactions from the perspective of database. As database administrators, DBAs need to prepare for such traffic surge to make sure that people can buy whatever they want at any time and at the right price, without waiting.
Ted Bai, a senior solution Architect at OceanBase, was once a senior DBA at Alipay. He has experienced several rounds of Double 11 Shopping Festival when he was in Alipay. He talked to Hui Xingxing, an Oracle ACE, about what DBAs should be considering when they prepare for major promotions.
There will be a series of three articles about their chat. This is the first one. In this article they are talking about database stability.
During massive online promotions, we must prevent data loss because money and products are involved, which are just as important as the core business of financial institutions. As a former database administrator (DBA) of Ant Group, I know clearly about how harsh the problems can be during a primary/standby switchover.
In a conventional primary/standby architecture, DBAs must choose between data consistency and system availability.
Ant Group was faced with such a dilemma in 2015. Back then, some telecom fiber-optic cables were damaged in a municipal construction project, leaving the primary and backup databases disconnected. A forced switchover might lead to immeasurable loss, which was unacceptable because Ant Group’s business involved a large number of financial transactions.
So, how did OceanBase Database help Ant Group solve this problem?
OceanBase Database adopts a shared-nothing distributed architecture and maintains strong data consistency by using the Paxos consensus protocol. This protocol ensures that, at any given moment, a node can become the leader and provide I/O services after being agreed upon by the majority of nodes in the cluster. In this case, zero data loss is guaranteed even if one out of three or two out of five nodes fail.
Also, OceanBase Database performs I/O verification and data comparison in extreme scenarios, such as silent data corruption, and bit flipping of the disk or memory. This further ensures that users read consistent data from different replicas.
During off-peak hours, we can launch background tasks to compare the baseline data to prevent leader-follower data inconsistency from multiple perspectives.
Therefore, after migrating the business to OceanBase Database, Ant Group, specifically Alipay, rarely wrestled with data repair and correction after a primary/standby database switchover.
Poor user experience during a massive online promotion can cost dearly. From a holistic perspective, it is extremely challenging for OceanBase Database to guarantee smooth running of the entire process from applications to database and its underlying storage, especially when there is obvious traffic surge.
To provide better user experience, the business structure must also be well-designed while ensuring the stable operation of the database system. OceanBase Database adopts a logical data center (LDC) design for all users. In this design, the global requests of user applications are distributed into several LDCs, and the deployment of business and database of each LDC is self-contained. A benefit of the LDC design is the ultimate scalability, meaning that the system can transfer 1% of traffic to the Internet data center (IDC) of any city at any time to address exceptions such as insufficient capacity of an LDC or IDC. Therefore, the traffic transfer capability is crucial. The multitenant architecture and LDC design of OceanBase Database together work great in ensuring the system stability and user experience.
To keep the database system stable, OceanBase Database also provides features to optimize SQL execution. For example, we can perform inspections to check for exceptions of SQL execution plans or statistics before launching massive online promotions. This is of course only part of the work to ensure system stability. Although most online transactions merely involve online transaction processing (OLTP) tasks or the insertion of short SQL statements, sporadic aggregate queries are unavoidable due to the nature of our business. Sporadic queries, when congested to a certain level, will affect the database stability. To address this issue, OceanBase Database parses SQL queries, identifies the SQL ID of large ones, and puts them to a separate queue when it receives them next time. This separate queue uses exclusive resources to ensure that large queries wait in a queue without contending resources with critical small transactions in the case of a sudden traffic increase, meeting the strict requirements of response time and stability in OLTP scenarios.
In addition to the SQL execution performance, OceanBase Database also ensures the database stability by coping with high business concurrency. For example, in a flash sale where a large number of users want to buy the same product, prompt deduction of the inventory is crucial. This translates to the highly concurrent update of a specified row in a data table. For Alipay, the frequent transfer requests by a large number of merchants also require the prompt update of their account balance. If the database cannot process a specific number of transactions per second (TPS) in a high concurrency scenario, it will directly affect the income of merchants and the overall system throughput. To overcome the high concurrency challenge, OceanBase has figured out that single-row update is essentially a row lock issue, and query congestion during single-row updates is a major cause that decreases the performance in handling concurrent queries. To address this issue, OceanBase Database releases the row lock before committing a transaction, which significantly improves the performance of concurrent single-row updates. This early lock release (ELR) feature can improve the TPS by 3 to 5 times in cross-zone disaster recovery, and is therefore a great help for Alipay.
OceanBase Database has no problem in coping with server failure and network jitter. When the leader fails or becomes unavailable due to hardware problems or an power outage, a new leader will be elected within 30 seconds based on the Paxos protocol-based distributed consensus algorithm. Requests of applications are then routed to the new leader in a way that is nearly transparent to applications.
When network jitter occurs, OceanBase Database can quickly isolate the abnormal traffic and route the normal traffic to healthy nodes by switching the leader role. This means OceanBase Database is a self-healing system. A decision tree is developed based on the experience of DBAs to facilitate machine learning, allowing this self-healing system to quickly recover from exceptions such as network jitter, server failure, and insufficient capacity.
Simply put, OceanBase Database demonstrates smooth performance in massive online promotions because it has integrated a great deal of proven techniques in terms of overall experience and stability improvement. The continuous improvement against various harsh challenges in real-world business scenarios allows OceanBase Database to become reliable and easy to use.
There are four more questions coming along. Stay tuned!