Emergency response is required to tackle not only faults but all exceptions that exceed the designed capacities and expectations of a cluster. The purpose of emergency response is not to identify the root cause but to eliminate the faults and restore the service of a cluster at the earliest opportunity.
Identify exceptions
You can actively diagnose exceptions or take actions upon notifications. OceanBase Database supports the following exception identification methods:
Business monitoring and alerting, in which applications are monitored.
Routine inspection of databases. For more information, see Monitoring in Monitor and send alerts.
Database monitoring and alerting. For more information, see Alerts in Monitor and send alerts
Emergency event classification
Infrastructure faults
Infrastructure faults are caused by hardware or software exceptions. Most exceptions that involve a single server of OceanBase Database can be automatically recovered. However, you must perform an emergency procedure in some circumstances. In general, infrastructure faults are classified into the following types:
Server failure caused by hardware faults
Server clock offset from the NTP clock
Rack or IDC power failure
Single-server network interface controller (NIC) fault, or network jitter in a rack or IDC
Faults of the load balancer of an ODP cluster
Insufficient capacity due to traffic increase
During a marketing campaign or when a new service is released, traffic surges tend to overwhelm the designed capacity of a database. This results in a series of issues, such as performance downgrade, latency increase, and connection timeout. Insufficient capacity of OceanBase Database often leads to the following exceptions:
High usage of cluster resources for executing SQL queries
High disk I/O on an OBServer node
Overloaded NIC on an OBServer node
Accumulated request queue of a tenant
Full occupation of tenant memory
Full usage of ODP threads
Full usage of the OBServer clog disk
Full usage of the OBServer data disk
Other exceptions of OceanBase clusters
Under the distributed architecture of OceanBase Database, exceptions other than those of the preceding two types may also be triggered in specific scenarios or by user behaviors. These exceptions may be caused by various factors, such as service access methods and software bugs. You can perform a general emergency procedure to stop the loss. These exceptions include:
Blocked tenant minor compaction
Blocked cluster freeze or major compaction
Suspended and long-running transactions
Accumulated request queue of the sys tenant
Unexpected observer process exit
Memory insufficiency or leakage of internal modules
Buffer tables
Some exceptions described in the topic are phenomena, and others are causes. For more information about the logic of exception troubleshooting, see Analysis, diagnosis, and decision-making process.