This is the last post in the holiday season series brought to you by Ted Bai, Senior Solution Architect at OceanBase, and Xingxing Xi, Technical Consultant at Yunhe Enmo (Beijing).
In the previous two posts, Ted Bai and Xingxing Xi have talked about how to not lose data when system fails during major shopping promotion campaigns and how to perform database tuning for better performance with lower costs.
In this post, they continue talking about database disaster recovery and incident plans to ensure system stability under unexpected circumstances.
OceanBase Database provides a solution for SQL exceptions. In most cases, database errors are caused by SQL exceptions. For example, in general cases, some SQL statements are executed a few hundred times per hour. When they are executed tens of thousands of times per hour, exceptions may have occurred. In this case, we expect the database to automatically throttle the execution of these abnormal SQL statements.
To address this issue, OceanBase Database throttles the concurrent execution of queries of the same SQL ID at the kernel layer, reducing the concurrent execution from thousands of times per minute to dozens, or even terminating the execution.
In other words, the database is able to throttle and terminate queries when the application cannot do the same. Throttling can be performed together with the business side. For example, the application business operator can tag non-crucial SQL queries, so that they are automatically throttled in an event of online exceptions and the DBA can step in at the earliest time. When a wrong SQL execution plan or an incorrect table join algorithm is used, OceanBase Database allows you to bind the SQL statement with the best execution plan by using outlines. Throttling and outlines have greatly enhanced our emergency handling capabilities.
OceanBase Database also provides a solution to address unusual traffic, which is often important data read traffic from the application layer and cannot be throttled. As a multi-replica database, OceanBase Database distributes the surging read traffic to all replicas when resources on the current node are insufficient. This method dramatically alleviates the system workload in scenarios such as unusual traffic, cache breakdown, or cache avalanche.
Ant Group runs tens of thousands of physical servers, which host hundreds of thousands of containers and instances. In such a giant system, occasional black swan events are unavoidable. The events include firmware problems and rare OS bugs. OceanBase Database supports the failover to a heterogeneous database in some extreme cases. This solution is also provided for core clusters of Ant Group.
I am confident that OceanBase Database provides solutions for handling emergencies of any scale.
Online alerting involves two types of tasks: monitoring and alerting. Speaking of monitoring, the monitoring software products provided by mature database vendors have served a large number of customers for years. You can trust them without developing your own monitoring systems from scratch. This can save a lot. Of course, you can program some custom monitoring metrics as needed.
As for alerts, keep in mind that invalid alerts have only negative effects on daily operations and maintenance work. Therefore, we recommend dividing alerts into different levels based on the prompts or their priorities and then taking the corresponding measures based on the alert level. For a top-level alert, you can isolate the corresponding error at the business layer to ensure the stability of the core business. Of the two types of disaster recovery solutions that are based on the physical and logical layers respectively, I personally prefer the physical-layer solution.
This is the end of the series of posts. If you have any questions or suggestions, please feel free to leave a comment below!