Blog编组 28
61M QPS Challenge in Alipay: How did we do it
右侧logo

This article is the transcript of a tech talk given by Ted Bai, senior Solution Architect from OceanBase, at a Bay Area Infra meetup. Ted was a senior DBA at Alipay before he joined OceanBase. So he knew quite well how Alipay was dealing with the technical challenges brought by the huge traffic surge during shopping festivals like Double 11.

oceanbase database

Photo by shun idota on Unsplash

As you all know, the highest QPS during the Double 11 Shopping Festival so far was 61 million in 2019. And the TPS reached 544,000, meaning that there were that many payment transactions happening concurrently. This has posed great technical challenges. Traditional monolithic database management systems just cannot do the job well anymore. We need something new.

The Technical Challenges behind 61M QPS

Any challenge can be amplified in a system where the traffic reached a certain threshold. Generally speaking, there are several challenges in a distributed database management system:

  1. Elastic scaling, especially during Black Friday or Cyber Monday and before there are large promotion activities.
  2. Disaster recovery. There are several levels of disaster recovery, from node level to rack to the data center and to the region which means the city level. And we don’t want to lose any data because we are dealing with payment transactions.
  3. Resilience. For example, you have multiple IDCs. And the network among these data centers might not work. What should we do to prevent the database from being affected by the network jitter?
  4. Hot-row concurrent update. This happens a lot in the retail industry. For example, a certain product is on big promotion during the shopping holiday, meaning that a lot of people will buy it and the inventory needs constant updating. So this single row in the Inventory table is updating constantly, which made this row a hot row and the data hot data.
  5. Database throttling. What if a large amount of traffic comes directly to your database when the application cannot do the throttling at the time?
  6. Analytical SQL in OLTP systems. In a payment or trade application, most of the database requests are simple queries. But occasionally there are big and complex analytical queries coming through. What do we do to handle these queries without impacting OLTP performance and system stability?
  7. Blast radius. How do we minimize the impact of a failure on the business to less than maybe 1%?
  8. Autonomous service. I was once a DBA before. So I know that one DBA needs to maintain more than 1,000 database instances. How to do that?

Let’s cover these challenges one by one.

Unitization & Multi-Active Architecture

In Alipay, we adopted what we called unitization architecture to cope with such high concurrency.

What is unitization and what it is for?

The unitization architecture is based on LDC, meaning the logic data center. In a logic data center, there are three types of Zones. See the figure below.

oceanbase database

The first one is Regional Zone (RZone), which divides business data into 100 units by any parameter you think is suitable, say User ID. Each unit has been assigned resources and has its own business logic to function independently, each representing 1% of the whole service.

The second one is Global Zone (GZone). Systems that cannot be divided into units, such as CIF and config center, are placed in GZone. The data in GZone is shared by RZones for read and write. There is only one set of GZone in the whole application system.

The last type is City Zone (CZone). Why is there a CZone? Because GZone needs to handle read/write requests from applications and the workload can be really high. In order to ease the pressure of GZone, we set up different mirrors of GZone in each city or region, which is why we call it a City Zone. That being said, a CZone actually has the same data as a GZone.

So what is the benefit of this architecture?

The first one is that when one unit fails, the impact will be only 1% of the whole system because each unit runs independently.

The second benefit is that we can switch traffic at the 1% granularity from one city to another, one region to another, or one data center to another, as per business requirements, which is very flexible.

The third one is disaster recovery. Let’s take the RZone for example. As I mentioned above, the whole system is divided into 100 units and these units will be deployed in three cities with a 3:3:4 ratio in each city. So when any of the three cities crashes, in the worst case only 40% of the system will be impacted. Then with the bottom layer (CZone), we can quickly switch the damaged traffic to another city.

The “Five Data Centers in Three Cities” Architecture

The City Zone concept leads us to the new architecture of OceanBase, which is five data centers in three cities.

oceanbase database

We can see from the above figure that one OceanBase cluster is deployed in five data centers across three cities. Let’s say we split data into 100 units by User ID and each node contains four UIDs from 00 to 03. And each node has five replicas deployed in five data centers separately. You can see that the deployment is for extremely high availability. Even if a city crashes due to an earthquake or electric outage, business service can be recovered within 30 seconds without any data loss.

But city-level disaster recovery is only one of the benefits brought by architecture.

With multiple replicas in different regions, we use a load balancing module which we call OBProxy to route requests from applications to the closest data center.

Another benefit is scalability. Since we can split business data into 100 units, theoretically we can split data into more units, say 200. The only thing we need to do is to add more nodes to the cluster and do auto re-balancing. This helps businesses to handle traffic surges and rapid growth with ease.

Every year in Alipay before the Double 11, we need to scale out our database system to our cloud data center, which we call elastic scaling.

I made it simple with the two slides.


This is before the scaling. We have four UIDs in this cluster and all the leaders of the UIDs are in the data center4 in City3. This is OK for daily traffic. But for Double 11, that’s not OK. So we need to make sure the traffic is evenly distributed in the cities using the cloud data center. OceanBase can do the rebalance transparently by moving the replicas directly to the other two cities and migrating part of the replica leaders to another data center, without being noticed by the application.

oceanbase database

After this elastic scaling process, you can see that the leaders are evenly distributed in every city and every data center. It is worth mentioning that the rebalance process takes only one line of command on the DBA side. All the hard work, including consistency check, is done by the OceanBase kernel. The entire rebalancing process takes only a few hours.

The “One Million Payment TPS” Architecture

As we mentioned earlier, Alipay was dealing with 544,000 TPS during Double 11 in 2019. One of the challenges was that the traffic in one unit outgrew the capacity of the unit. What do we do?

oceanbase database

The idea is that if we want to handle 544,000 TPS, the design capacity must be much higher than that. So at that time, we came up with a project called “one million payments TPS architecture”.

In the traditional sharding and middleware solution, you need to do the sharding manually by rows or columns, together with the middleware tools. It is a risky and high-cost operation because you need to modify the application code to make the sharding effective.

In OceanBase, since it is a native distributed database, meaning it supports partitioning by design, the partition work can be done automatically in the database without manual interruption.

Take the UID00 as an example. Every table in UID00 is a partition and each table can have 100 partitions. One unit can be deployed in several nodes. Currently, in Alipay’s production environment, one unit is deployed on 4/5/6 servers, which is pretty distributed. This is actually more like a data shuffling. There are four nodes and on every node, we have 25 partitions and each partition will have leaders on it.

In OceanBase, the leader is a partition-level concept, so every node can perform write and read queries, which improves the capability to deal with high concurrency situations.

High Availability and Resilience

Before we talk about resilience, I want to mention connections, which have always been a headache in the past for DBAs. Especially when the cache system is crashed or there is some kind of failure, all traffic goes to the database. Even DBAs cannot log into the database because there can be no more connections.

oceanbase database

100,000+ Connections in One Server Node

In OceanBase, there will be only one process which is the OBServer. In this one process, there can be multiple threads, including system threads that are running in the background and user/worker threads. What we do is that we have decoupled sessions with threads, which means that you can create many sessions at a very low cost and the threads will be assigned to active sessions only.

In OceanBase, we can have over 100,000 connections in one server node, which is a very good benefit.

Leader Lease Expired

As you know, OceanBase uses Paxos to ensure strong data consistency. When there is an unexpected failure, such as a data center power off, network issue, and hardware crash, the leader lease will be expired. There will be a new election for a new leader, which will be done within 30 seconds and with no data loss. And this happens automatically, which is called auto recovery.

Another special design is that OceanBase will put the crashed leader into a block list for a configurable period of time, say one hour, to make sure the unstable node is not elected as leader again even if it is recovered very soon and prevent the cluster from crashing again.

Proactive Leader Switch

OceanBase allows DBAs to manually switch leaders if they find the current leader is in some unstable situation caused by unknown reasons. Sometimes DBAs can notice some unusual signals according to their experiences, but they are not sure what the root cause is. To make sure the system is running stably, they can manually switch leaders and do diagnosis work later.

Maintaining Stable SQL Performance

For an OLTP system, SQL execution performance is very important to the system performance. There are some mechanisms in OceanBase to make sure the SQL engine works with high performance.

oceanbase database

Plan cache

A plan cache will cache the physical execution plan of a single SQL statement for reuse, which can improve the performance of SQL execution.

SPM (SQL Plan Management)

SPM will try its best to make sure the SQL execution always picks up the most efficient plan during its whole life cycle. This is what SPM does.

Large Query Queue

A large query queue is designed for large and complex queries to make sure such queries do not affect OLTP performance. This feature is interesting. Actually, it is not too technical, but very useful for Alipay at that time. As I mentioned above, in an OLTP system, 99% of queries are transactional, but once in a while, there will be large and complex queries that inevitably affect the system performance and cause production incidents. The large query queue features create a separate worker queue to process such large queries. Such queries will be waiting in the queue within the separate worker, not affecting other transaction queries.

Adaptive Throttling

There are three types of adaptive throttling in OceanBase.

  1. SQL concurrent limit

The first one is to directly throttle one single specific SQL. For example, the application cannot roll back their release, or this is a very bad SQL to `SELECT`, `COUNT`, or having lots of `Group By`, but it is not very important. We can do throttling to limit the concurrency.

  1. Write throttling

The second one is the write throttling. Since OceanBase uses LSM-Tree, memory is very important. When there is heavy traffic, for example, a batch loading data coming into the database, we need a mechanism to protect OceanBase from the out-of-memory crisis. The write throttling feature will automatically throttle your request when memory usage is more than 80%.

  1. Server request throttling

This is a server-level parameter. If there is a large amount of SQL waiting line or the queue time is more than 100 milliseconds or the response time is more than a threshold, the server request throttling will be activated. It cannot be used in very critical systems. But in some scenarios, it is very useful. It is a self-protecting mechanism.

Hot Row Concurrent Updating

As I mentioned earlier, hot row concurrent updating is a major challenge for high-concurrency systems.

So how does OceanBase solve this problem for Alipay? The answer is ELR (Early Lock Release).

oceanbase database

The concept of ELR has been in the industry for a long time. The thing that holds us back from ELR is the row lock because of ACID compliance. When many transactions want to update one row of data, they need to wait in line and hold the locks to release.

In the traditional way with ELR disabled, a transaction holds the lock in four phases: data write, log serialization, synchronous standby, and the log flush to disk.

The ELR does a thing that is different from the old way where each row lock will release after the log buffer is flushed to the disk. Instead, we will release the lock right after the redo log is written to the log buffer without waiting for the redo log to be flushed to the disk. This reduces the locking time. After the current transaction is released, subsequent transactions are allowed to perform on the same row, thus improving system throughput.

With ELR enabled, Alipay has improved TPS by 5–6 times.

OceanBase Autonomous Service

In Alipay, there are more than 20,000 nodes in over one hundred clusters in the production environment. The number of DBAs is only 10–15. And I was one of them.

Why can we manage such a large cluster with so few people?

Actually, we can talk about it from two perspectives.

One is that OceanBase is a multitenancy cluster. We used to use databases by instances. Now in OceanBase, one tenant is one instance. There can be 100 instances in a big cluster. Previously DBAs have to manage thousands of instances. But now they only have to monitor about 10 clusters. The switch from instance to the cluster will reduce a lot of work.

Another perspective is OceanBase's autonomous service. We designed a platform apart from the database engine.

oceanbase database

In the Sensing module, we define workload modeling. And the Event Detector module defines what kind of issue or event is triggered. This module also collects monitoring data for the event hub. The data will then be sent to the Decision Making module.

Most of the decision-making rules are based on DBAs’ real-world expert experiences. In addition, we use some machine learning frameworks to help us find more patterns. After finding the actual factor for the failure or incident, they will be sent to the Executor for root cause analysis and following actions, for example, to replace the crashed nodes, add CPUs to some of the heavy-loaded tenants, add memory to some of the batch loading jobs, etc.

Let’s take a look at two real-world examples.

oceanbase database

When we detect a high I/O database, the autonomous service will automatically stop all the different kinds of jobs/tasks for the backup migrations to guarantee we have enough I/O resources.

When there is a high load in the tenants, we need to add more CPU resources to the tenants. But the real case actually is more complicated than this because you don’t know if this SQL consumes a normal amount of resources, or if there is something wrong with it, which is what we have been working on to improve.

Questions Taken

  1. How can the ELR feature ensure no data loss?

The bottom line of this feature is that data loss is unacceptable. The reason why there is no data loss is that OceanBase performs transaction commits in batch. If there is a failure before the commits are finished, all transactions will be rolled back.

  1. Could you give me some large query examples in OceanBase or Alipay?

Alipay is designed as a microservice system, where there are a lot of applications in one database. During a transaction or a trade, sometimes you need to count the number of merchants who are at risk or calculate if the account balance is enough to pay for an order. These can be examples of AP queries.

  1. How to guarantee leaders are distributed in each region?

In Alipay’s case, the leader will be spread by UID. It’s not that we will be randomly sending leaders to all regions. For example, when we have 100 UIDs, we will set 30% of the UIDs in City1 and another 30% in City2. Each UID has multiple partitions, so the leaders are within one city/region.

  1. Are there any ETL tools for OceanBase? Because bank data can always be structured, but for Internet-based companies, the data is not always structured.

So far OceanBase has its own CDC API in the log interface which is defined as OceanBase format. One thing we are working on is that we want to design a MySQL-compatible log format that will be easier for other MySQL-compatible databases to use OceanBase. Right now OceanBase has the Flink CDC connector with which you can sync your data from OceanBase to your Kafka and MQ systems.

ICON_SHARE
ICON_SHARE