Best practices for end-to-end tracing-OceanBase

End-to-end tracing, as the name suggests, covers the entire process—from the generation of SQL statements by business systems to their execution in the database and the return of results to the client. In a distributed architecture, the longer tracing path adds complexity to analyzing and resolving issues.

This topic explains how to use end-to-end tracing to identify issues and address them after analyzing their root causes.

In OceanBase Database V4.0 and earlier, you can use the trace ID of a business SQL request for tracing. However, OceanBase Database Proxy (ODP) and OceanBase Database record their own internal trace IDs instead of the trace ID set by applications or SQL statements. This means you need to locate issues using the values of primary or unique keys in SQL statements.

End-to-end tracing in OceanBase Database V2.x and V3.x

Load balancing layer

Troubleshooting SLB issues

A Server Load Balancer (SLB) address typically maps to multiple backend service addresses. To quickly determine whether the issue is caused by the SLB, you can bypass it and connect directly to the database servers.

For example, assume A is the load balancing address, and b, c, and d are the backend server addresses.

Typical connection flow: application-->A--> (b or c or d).

Troubleshooting process: Test direct connections as app-->b, app-->c, and app-->d. If all direct connections succeed, the issue is likely with the SLB device.

Common SLB issues and solutions

Insufficient SLB configuration

Symptoms: Two types of problems may occur:

The application hangs when trying to connect to the load balancing address.
The application fails to create a connection, even though the number of connections has not reached the maximum limit. This can also happen if SQL execution is too slow, causing connection timeouts.

Cause: The SLB configuration may be insufficient and has reached its performance limit.

Solution: Increase the SLB capacity. Specifically, perform the following steps: Log in to the SLB device --> Check the SLB configuration --> Update the parameters to increase capacity.

Example: A customer encountered an issue where the SLB connection usage hit 100%. By monitoring the traffic, they identified the connection count and traffic volume at the time of the issue. Upon reviewing the SLB configuration, it was found that the SLB was set to support a maximum of 5,000 connections. During stress testing, the actual number of connections exceeded 5,000, clearly indicating a resource shortage. The configuration was adjusted to scale up the capacity.

SLB network issues

Symptoms: Connections through the load balancing address hang or the network is unreachable.

Cause: This could be caused by firewall policies, network routing issues, or other hardware-related problems.

Solution: Verify if the firewall policy restricts specific ports or if the network routing is configured correctly. Additionally, inspect the switch module for packet loss or other issues.

Uneven traffic distribution

Symptoms: Traffic is distributed unevenly across different backend services.

Cause: The SLB load balancing algorithm may have issues, or there could be problems with the business model.

Solution: : For algorithm-related issues, dynamically adjust the SLB load balancing algorithm (weighted round-robin and hash algorithms are recommended) to find the most suitable option. If the issue persists, contact technical support for assistance.

ODP layer

ODP is just one part of the entire chain. When troubleshooting ODP, it's important to clearly define the problem scenario and determine the direction of the investigation.

ODP logs are the primary source of information for diagnosing issues. Make effective use of trace IDs, session IDs, error codes, and SQL statements to search the logs.
OceanBase Cloud Platform (OCP) also provides a log retrieval feature for ODP, which is especially helpful in environments with multiple ODP nodes.

When performing end-to-end tracing, the first step in troubleshooting ODP is identifying the address of the ODP instance. In private cloud setups, ODP is often placed behind a load balancer. Applications typically do not connect directly to ODP but instead connect to a VIP. To identify the ODP address in such cases, you can use the following methods:

Use the sql_audit feature in OceanBase Database to check the client_ip. The client_ip corresponds to the ODP address.
View the configuration of the SLB to determine the list of ODP instances to which the application connects.

Slow query diagnostics

The obproxy.log file provides detailed logs for slow queries and slow transactions. By analyzing obproxy.log, you can review SQL execution times and transaction durations from ODP's perspective. Comparing these metrics with the upstream application's latency statistics and the downstream OBServer's SQL and transaction execution times can help pinpoint where the slowdown is occurring.

To make use of the slow query logs, you need to configure the relevant ODP settings:

slow_query_time_threshold: The threshold for logging slow queries to obproxy_slow.log. Slow queries are also logged in obproxy.log.
query_digest_time_threshold: The threshold for logging query digests to obproxy_digest.log.
slow_transaction_time_threshold: The threshold for logging slow transactions in obproxy.log.

OBServer layer

If an SQL execution failure occurs on the OBServer side, both ODP and the application will receive the corresponding SQL error code. You can analyze and address the issue based on the specific error code.

When diagnosing issues in OceanBase Database, the type of problem determines which diagnostic information to use, as it may be distributed across different sources. For example, you might reference error codes received by the application, alerts from the OCP monitoring platform, OBServer runtime logs, system monitoring views, or even operating system logs.

End-to-end tracing in OceanBase Database 4.x

In OceanBase Database 4.x, a trace is generated for each transaction or SQL statement. O&M personnel can analyze all trace logs to obtain the execution time of each transaction or SQL statement across the entire chain, thereby helping them locate the problem in the access chain. For more information, see End-to-end tracing.

Example of using end-to-end tracing to diagnose issues in OceanBase Database V4.x

If an SQL query takes longer than expected to execute, you can use end-to-end tracing to identify and analyze the root cause.

Enable the Show Trace feature.

obclient [testdb]> set ob_enable_show_trace = 'on';

Execute the following query statement:

obclient [testdb]> select * from tp1 where c1 = 938599;

The return result is as follows:

+--------+--------+
| c1     | c2     |
+--------+--------+
| 938599 | 518633 |
+--------+--------+
1 row in set (0.481 sec)

Use the Show Trace statement to view the execution call chain and the duration of each stage in the call chain.

obclient [testdb]> Show Trace;

The return result is as follows:

+-----------------------------------------------------------+----------------------------+------------+
| Operation                                                 | StartTime                  | ElapseTime |
+-----------------------------------------------------------+----------------------------+------------+
| obclient                                                  | 2023-11-05 19:02:40.723072 | 481.285 ms |
| └── com_query_process                                     | 2023-11-05 19:02:40.723219 | 481.069 ms |
|     └── mpquery_single_stmt                               | 2023-11-05 19:02:40.723225 | 481.057 ms |
|         ├── sql_compile                                   | 2023-11-05 19:02:40.723233 | 2.904 ms   |
|         │   ├── pc_get_plan                               | 2023-11-05 19:02:40.723236 | 0.008 ms   |
|         │   └── hard_parse                                | 2023-11-05 19:02:40.723269 | 2.861 ms   |
|         │       ├── parse                                 | 2023-11-05 19:02:40.723269 | 0.054 ms   |
|         │       ├── resolve                               | 2023-11-05 19:02:40.723338 | 0.187 ms   |
|         │       ├── rewrite                               | 2023-11-05 19:02:40.723556 | 0.342 ms   |
|         │       ├── optimize                              | 2023-11-05 19:02:40.723906 | 2.015 ms   |
|         │       │   ├── inner_execute_read                | 2023-11-05 19:02:40.724391 | 0.126 ms   |
|         │       │   │   ├── sql_compile                   | 2023-11-05 19:02:40.724400 | 0.043 ms   |
|         │       │   │   │   └── pc_get_plan               | 2023-11-05 19:02:40.724402 | 0.023 ms   |
|         │       │   │   ├── open                          | 2023-11-05 19:02:40.724450 | 0.018 ms   |
|         │       │   │   └── do_local_das_task             | 2023-11-05 19:02:40.724485 | 0.016 ms   |
|         │       │   └── close                             | 2023-11-05 19:02:40.724525 | 0.023 ms   |
|         │       │       ├── close_das_task                | 2023-11-05 19:02:40.724526 | 0.005 ms   |
|                 │       └── end_transaction               | 2023-11-05 19:02:40.724539 | 0.003 ms   |
|         │       ├── code_generate                         | 2023-11-05 19:02:40.725930 | 0.092 ms   |
|         │       └── pc_add_plan                           | 2023-11-05 19:02:40.726091 | 0.033 ms   |
|         └── sql_execute                                   | 2023-11-05 19:02:40.726143 | 478.105 ms |
|             ├── open                                      | 2023-11-05 19:02:40.726143 | 0.012 ms   |
|             ├── response_result                           | 2023-11-05 19:02:40.726160 | 477.975 ms |
|             │   └── do_local_das_task                     | 2023-11-05 19:02:40.726165 | 0.078 ms   |
|             └── close                                     | 2023-11-05 19:02:41.204164 | 0.076 ms   |
|                 ├── close_das_task                        | 2023-11-05 19:02:41.204166 | 0.037 ms   |
|                 └── end_transaction                       | 2023-11-05 19:02:41.204227 | 0.004 ms   |
+-----------------------------------------------------------+----------------------------+------------+
27 rows in set

From the results above, you can determine that the root cause of the slow SQL execution is that the sql_execute phase consumed a significant amount of time, which increased the overall query duration.

View and analyze the schema of the tp1 table.

obclient [testdb]> desc tp1;

The return result is as follows:

+-------+------------+------+-----+---------+-------+
| Field | Type       | Null | Key | Default | Extra |
+-------+------------+------+-----+---------+-------+
| c1    | bigint(20) | YES  | MUL | NULL    |       |
| c2    | bigint(20) | YES  |     | NULL    |       |
+-------+------------+------+-----+---------+-------+
2 rows in set

By examining the structure of table tp1, you can see that there is no index created on the column c1. This results in a full table scan during the query in Step 5, which causes the SQL execution to be slow. To improve performance, you need to create an index on the c1 column.

Create an index on the test table.

obclient [testdb]> create index z on tp1(c1) global;

Execute the preceding query statement again.

obclient [testdb]> select * from tp1 where c1 = 938599;

The return result is as follows:

+--------+--------+
| c1     | c2     |
+--------+--------+
| 938599 | 518633 |
+--------+--------+
1 row in set (0.006 sec)

Use the Show Trace statement to examine the execution call chain and analyze the time spent at each stage. You will notice that the query efficiency has significantly improved.

obclient [testdb]> show trace;

The return result is as follows:

+-----------------------------------------------+----------------------------+------------+
| Operation                                     | StartTime                  | ElapseTime |
+-----------------------------------------------+----------------------------+------------+
| obclient                                      | 2023-11-05 19:04:02.473941 | 5.719 ms   |
| └── com_query_process                         | 2023-11-05 19:04:02.474078 | 5.517 ms   |
|     └── mpquery_single_stmt                   | 2023-11-05 19:04:02.474082 | 5.507 ms   |
|         ├── sql_compile                       | 2023-11-05 19:04:02.474092 | 4.549 ms   |
|         │   ├── pc_get_plan                   | 2023-11-05 19:04:02.474095 | 0.055 ms   |
|         │   └── hard_parse                    | 2023-11-05 19:04:02.474174 | 4.458 ms   |
|         │       ├── parse                     | 2023-11-05 19:04:02.474175 | 0.056 ms   |
|         │       ├── resolve                   | 2023-11-05 19:04:02.474244 | 0.193 ms   |
|         │       ├── rewrite                   | 2023-11-05 19:04:02.474512 | 0.321 ms   |
|         │       ├── optimize                  | 2023-11-05 19:04:02.474842 | 3.562 ms   |
|         │       ├── code_generate             | 2023-11-05 19:04:02.478417 | 0.094 ms   |
|         │       └── pc_add_plan               | 2023-11-05 19:04:02.478588 | 0.038 ms   |
|         └── sql_execute                       | 2023-11-05 19:04:02.478649 | 0.908 ms   |
|             ├── open                          | 2023-11-05 19:04:02.478650 | 0.019 ms   |
|             ├── response_result               | 2023-11-05 19:04:02.478674 | 0.819 ms   |
|             │   └── do_local_das_task         | 2023-11-05 19:04:02.478687 | 0.364 ms   |
|             └── close                         | 2023-11-05 19:04:02.479500 | 0.051 ms   |
|                 ├── close_das_task            | 2023-11-05 19:04:02.479501 | 0.021 ms   |
|                 └── end_transaction           | 2023-11-05 19:04:02.479541 | 0.003 ms   |
+-----------------------------------------------+----------------------------+------------+
19 rows in set (0.004 sec)

Compared with the results in Step 3, you can see that the execution time of the sql_execute stage decreases from 478.105 ms to 0.908 ms, greatly improving the query efficiency.