As described in the "High availability" chapter, the data link of OceanBase Database consists of the application server, OceanBase Database Proxy (known as ODP or OBProxy), and OBServer node. Specifically, an application server connects to ODP and uses the database driver to send requests. Due to the distributed architecture of OceanBase Database, user data is distributed across multiple OBServer nodes in multiple partitions with multiple replicas. ODP forwards user requests to the most suitable OBServer node for execution and then returns the results to the user. Additionally, each OBServer node has routing forwarding capabilities. If a request cannot be executed on the current node, it will be forwarded to the appropriate OBServer node.

When there is an end-to-end performance issue (in a database scenario, end-to-end refers to observing high response time of SQL requests on the application server), it is necessary to first identify which component in the database access link is causing the problem, and then troubleshoot the specific issues within that component.
There are generally two methods for troubleshooting:
Drill-down troubleshooting
This method involves sequentially troubleshooting each component in the link in the order of data access, observing the time consumption of downstream components called by this component, and continuing to troubleshoot the downstream components where the time consumption is significantly abnormal (also known as "time-consuming hotspots"). Essentially, this is a recursive method that gradually approaches the root cause component by drilling down step by step.
Targeted troubleshooting
This method involves exploring components with the most exceptions based on historical experience, observing the core Service Level Agreement (SLA) metrics of this component to determine if it is abnormal, and then deciding on the next component to troubleshoot. This method is essentially based on the principle of exclusion, starting with the troubleshooting of the most frequent component and progressively ruling out others, thus approaching the root cause component.

The diagram above illustrates a simplified version of the database access link. Note that in the diagram, when two components access each other through the network, the network should also be considered a component. If there are suspicions of network issues between the two components, it is necessary to troubleshoot the various levels of switches in the network link. According to the deployment form of ODP, special attention should be paid to network access with longer links, especially network access across VPCs or across clouds.
The two troubleshooting methods mentioned above have their respective strengths. The choice of the more efficient method should be based on the available tools and troubleshooting experience.
For example, if you have deployed a new hybrid cloud system and migrated some business modules from a self-managed IDC to the public cloud for the first time, we recommend that you use the drill-down method for troubleshooting because the access involves cross-cloud links.
On the other hand, if you have recently modified specific components on a mature business link for specific purposes, such as a big sales event or an ODP upgrade, we recommend that you use the targeted troubleshooting method.
The preceding examples illustrate how to select the troubleshooting method based on the scenario. However, experienced engineers can also pinpoint the exact bottleneck component based on the error message.

The diagram above presents a typical error reporting stack, which consists of three layers: the OBServer node layer, the ODP layer, and the application layer. Typically, applications use data middleware to manage database connection tasks, including access authentication, connection warm-up, and connection pooling. The data middleware is integrated with the application as a package, and the two generate operation logs separately. Therefore, applications and data middleware are collectively referred to as the application layer.
In OceanBase Database, errors occurring within a layer are not only recorded in the operation logs of that layer, but are also thrown to the upper layer and recorded in the operation logs of the components in that upper layer. Consequently, when an error message is received for a specific layer, it is essential to determine the layer from which the error originated. If the error is thrown from a component within the current layer, troubleshooting should begin from that layer. On the other hand, if the error is thrown from a component in a lower layer, troubleshooting should be directed to the respective target layer.

The following sections describe possible errors of each layer. Note that errors reported at a layer are not only recorded in the operation logs of the layer, but also thrown to the upper layer and recorded in the operation logs of the upper layer.
Application errors
Application error "Database connection pool full"
This is one of the most common errors in application systems. It indicates that the database connection pool is full and new requests cannot obtain a connection. Application code typically starts a transaction to access the database. Besides database access, the transaction may also call other downstream systems, such as downstream applications and caches. A connection is obtained from the pool when the transaction starts and returned when it ends. A full connection pool is usually caused by transactions taking too long. Possible causes include high database request RT, time-consuming access to other downstream systems, or issues in the application system itself.
Other downstream systems taking too long: RPC calls to downstream systems take too long, increasing the overall transaction time. Database access time is normal.
Application system itself: The application is stuck due to full GC or CPU exhaustion, increasing the overall transaction time. Database access time is normal.
Application error "Database connection failed"
This indicates that the application failed to connect to OBServer. Possible causes include: the backend database system overloaded by requests leading to connection failure, application system issues (for example, full GC, NIC error on the application server, or incorrect database configuration at the application layer), or network issues between the application server and OBServer.
Application error "Lock conflict"
This indicates that a lock conflict occurred when the application tried to lock a database object. Applications typically use locking mechanisms (pessimistic or optimistic locks) to control concurrent access to the same object. Lock conflicts have strong application semantics. Besides high database RT causing timeouts and retries, look for causes in the application itself, for example, scheduling system errors leading to many concurrent operations on the same object, or certain types of attacks.
ODP errors
OBProxy error
ERROR 1317 (70100): Query execution was interruptedThis indicates that the client actively interrupted the query. This usually happens when the SQL request sent by the client does not return within the expected timeout, or when the user presses
Ctrl + Cto initiate a Kill Query. When this error occurs frequently in production, it indicates that the SQL execution time from the application's perspective exceeds the client timeout (Query Timeout). Troubleshoot each component from the client to OBServer.
OBServer errors
OBServer error
ERROR 4012 (HY000): Timeout, query has reached the maximum query timeoutThis indicates that the SQL request processing time on OBServer exceeded the server timeout (
ob_query_timeout). You can directly conclude that there is an issue inside OBServer.ob_query_timeoutspecifies the maximum SQL execution time in microseconds. For more information, see ob_query_timeout.
Notice
The method described above for pinpointing components based on error messages may vary depending on component versions and relies heavily on the engineer's experience. Engineers should develop a troubleshooting methodology for each specific version.