Identify bottlenecks on the data link|V4.3.0| docs|Distributed Database

Identify bottlenecks on the data link

Last Updated：2024-04-19 08:42:49 Updated

As described in the "High availability" chapter, the data link of OceanBase Database consists of the application server, OceanBase Database Proxy (known as ODP or OBProxy), and OBServer node. Specifically, an application server connects to ODP and uses the database driver to send requests. Due to the distributed architecture of OceanBase Database, user data is distributed across multiple OBServer nodes in multiple partitions with multiple replicas. ODP forwards user requests to the most suitable OBServer node for execution and then returns the results to the user. Additionally, each OBServer node has routing forwarding capabilities. If a request cannot be executed on the current node, it will be forwarded to the appropriate OBServer node.

Data link diagram

When there is an end-to-end performance issue (in a database scenario, end-to-end refers to observing high response time of SQL requests on the application server), it is necessary to first identify which component in the database access link is causing the problem, and then troubleshoot the specific issues within that component.

There are generally two methods for troubleshooting:

Drill-down troubleshooting

This method involves sequentially troubleshooting each component in the link in the order of data access, observing the time consumption of downstream components called by this component, and continuing to troubleshoot the downstream components where the time consumption is significantly abnormal (also known as "time-consuming hotspots"). Essentially, this is a recursive method that gradually approaches the root cause component by drilling down step by step.
Targeted troubleshooting

This method involves exploring components with the most exceptions based on historical experience, observing the core Service Level Agreement (SLA) metrics of this component to determine if it is abnormal, and then deciding on the next component to troubleshoot. This method is essentially based on the principle of exclusion, starting with the troubleshooting of the most frequent component and progressively ruling out others, thus approaching the root cause component.

The diagram above illustrates a simplified version of the database access link. Note that in the diagram, when two components access each other through the network, the network should also be considered a component. If there are suspicions of network issues between the two components, it is necessary to troubleshoot the various levels of switches in the network link. According to the deployment form of ODP, special attention should be paid to network access with longer links, especially network access across VPCs or across clouds.

The two troubleshooting methods mentioned above have their respective strengths. The choice of the more efficient method should be based on the available tools and troubleshooting experience.

For example, if you have deployed a new hybrid cloud system and migrated some business modules from a self-managed IDC to the public cloud for the first time, we recommend that you use the drill-down method for troubleshooting because the access involves cross-cloud links.
On the other hand, if you have recently modified specific components on a mature business link for specific purposes, such as a big sales event or an ODP upgrade, we recommend that you use the targeted troubleshooting method.

The preceding examples illustrate how to select the troubleshooting method based on the scenario. However, experienced engineers can also pinpoint the exact bottleneck component based on the error message.

Locate bottleneck components

The diagram above presents a typical error reporting stack, which consists of three layers: the OBServer node layer, the ODP layer, and the application layer. Typically, applications use data middleware to manage database connection tasks, including access authentication, connection warm-up, and connection pooling. The data middleware is integrated with the application as a package, and the two generate operation logs separately. Therefore, applications and data middleware are collectively referred to as the application layer.

In OceanBase Database, errors occurring within a layer are not only recorded in the operation logs of that layer, but are also thrown to the upper layer and recorded in the operation logs of the components in that upper layer. Consequently, when an error message is received for a specific layer, it is essential to determine the layer from which the error originated. If the error is thrown from a component within the current layer, troubleshooting should begin from that layer. On the other hand, if the error is thrown from a component in a lower layer, troubleshooting should be directed to the respective target layer.

Application jitter

The following sections describe possible errors of each layer. Note that errors reported at a layer are not only recorded in the operation logs of the layer, but also thrown to the upper layer and recorded in the operation logs of the upper layer.

Application errors
- Database connection pool exhausted
  
  This error is one of the most likely errors in an application system. If the database connection pool of the application system is exhausted, new requests for building a connection will fail. Generally, the application code starts a transaction to access the database. In addition to database access, other downstream systems, such as downstream applications and caches, are also called during the transaction. When the transaction starts, a connection is obtained from the connection pool. When the transaction ends, the connection is returned to the connection pool. A connection pool is exhausted usually due to prolonged transaction execution. Possible causes of prolonged transaction execution include long RT of database requests, time-consuming access to downstream systems, and exceptions of the application system.
  - If it takes a long time to call a downstream system by using a remote process call (RPC), the overall execution duration of a transaction can be increased, even when the database access is normal.
  - If an application system is stuck due to a full garbage collection (GC) or CPU exhaustion, the overall execution duration of a transaction can also be increased, even when the database access is normal.
- Database connection failed
  
  This error indicates that the application system failed to connect to an OBServer node. Possible causes include: overloading of the database system due to a large number of requests, exceptions of the application system, such as a full GC, an NIC error of the application server, and incorrect database configuration at the application layer, and a network exception between the application server and the OBServer node.
- Lock contention
  
  A lock contention may occur when the application system tries to lock a database object. Generally, an application system controls concurrent access to the same object by adding a pessimistic or optimistic lock on the object. In addition to a large RT value of the database system that results in serious timeouts and retries, a lock contention is likely caused by exceptions of the application system, such as a large number of concurrent operations on the same object due to an error of the scheduling system or a hacker attack.
ODP errors
- ERROR 1317 (70100): Query execution was interrupted
  
  This error indicates that the client cancels the query. This is often because that a response is not returned for an SQL query sent by the client within the expected timeout period, or that the user initiates the Kill Query action by pressing Ctrl + C. If this error is frequently reported in a production environment, the SQL query execution exceeds the timeout value specified for the client. You need to troubleshoot each component on the link from the client to the OBServer node.
OBServer node errors
- ERROR 4012 (HY000): Timeout, query has reached the maximum query timeout
  
  This error indicates that the processing duration of an SQL query on the OBServer node exceeds the value specified for the ob_query_timeout variable of the server. You can check the OBServer node for internal exceptions.
  
  Here, ob_query_timeout specifies the maximum execution time allowed for an SQL query in microseconds. For more information about this variable, see ob_query_timeout.

Notice

The method described above for identifying components based on error messages may vary depending on differences in component versions, and it heavily relies on engineers' accumulated experience. Therefore, engineers are recommended to develop a set of troubleshooting methodologies tailored to specific versions.