Best practices for troubleshooting OceanBase Database performance issues-OceanBase

Best practices for troubleshooting OceanBase Database performance issues

Last Updated：2025-02-25 08:52:34 Updated

Background information

During routine maintenance, you may encounter complex troubleshooting scenarios where significant effort is required, yet the root cause of the issue remains unclear.

Below are some common scenarios:

Standard benchmark results show significant discrepancies compared with internal performance data.
Performance in low-concurrency scenarios is poor, with execution times not meeting expectations.
Stress test results appear satisfactory, but it is unclear whether they represent OceanBase Database’s maximum capabilities.
High CPU utilization is observed, but queries per second (QPS) remains low, resulting in poor performance.
In high-concurrency stress tests, CPU utilization is low, link latency is minimal, yet QPS remains low.
Uneven load distribution occurs across servers in multi-server deployment scenarios.

Factors affecting the performance

Hardware deployment
- Deploying OceanBase Database Proxy (ODP) and OceanBase Database on the same server can result in high resource overhead, especially during high-concurrency stress tests.
- Deploying the stress test client, load balancer, and ODP across different IDCs increases latency.
- Limited connection counts or small-performance specifications on servers where the load balancer or ODP resides can easily become bottlenecks.
- Sharing disks for data and logs compromises performance during high-concurrency write operations.
Transaction model
- Remote execution and distributed transaction commits degrade performance.
- If partition pruning is not applied to a single SQL statement, it will result in full partition scans, wasting resources.
- Incorrect routing by ODP can cause requests to be randomly distributed, leading to inefficiencies.

Suboptimal plans and missing key optimizations

Execution plan: Multi-level nested loop joins (NLJs) significantly compromise the efficiency of execution plans. Batch rescan is not supported.
Data model: A well-designed data model, including proper use of cardinalities and buffer tables, is critical for optimal performance.
Index strategy: Proper index usage is essential to avoid unnecessary random writes and full-table scans.

System hotspots: Internal implementations of OceanBase Database may create system hotspots, increasing CPU overhead.
Kernel configurations: Misconfigured system kernel parameters can severely impact performance.
I/O performance: Insufficient disk IOPS, read/write bandwidth, or network bandwidth will lead to performance degradation.

Logic of troubleshooting

Diagnostic rules

Cluster deployment
- If possible, avoid deploying ODP on the same server as OceanBase Database. ODP may compromise performance in extreme scenarios.
- Typically, latency benchmarks are as follows: 50 to 170 μs within the same IDC and approximately 1.5 ms between two IDCs in the same region. For cross-region deployment: 5 ms between Region A and Region B, and 25 ms between Region B and Region C. Actual latency depends on the network environment.
- For high-performance stress tests, such as Sysbench benchmarks with Point Select, we recommend removing ODP.
Scalability
- To achieve linear scalability, ensure that the link components, transaction model, and various kernel modules of OceanBase Database fully support linear scaling. If any component becomes a bottleneck, linear scalability cannot be achieved.
  - Transactions must be executed on a single server, and SQL statements must not involve remote procedure calls (RPCs).
  - The software and hardware resources of the physical servers hosting the client and ODP must support linear scaling.
- Use the Top SQL feature in OceanBase Cloud Platform (OCP) to monitor SQL statements. Pay special attention to SQL statements that meet the following conditions: the RPC count is greater than or equal to 1, the partition count is greater than 1, or a non-local plan is used. Sort SQL statements by CPU utilization in descending order and analyze each statement for potential issues.
Single-threaded smoke testing
- For time-consuming SQL statements in simple scenarios, perform a low-concurrency stress test. Measure the latency across the entire chain: Application server -> Load balancer -> ODP -> Leader in OceanBase Database, and analyze key bottlenecks.
System environment data
- In high-concurrency stress tests, ensure that the system CPU utilization does not exceed 15%. If this threshold is exceeded, analyze the causes.
- Ensure that the CPU is running at full load during high-concurrency tests. If the CPU is underutilized, investigate the reasons.
- Verify that the response time of I/O operations on data and log disks, as well as the TCP retransmission rate, remain within reasonable ranges.
Examples of using OCP diagnostics
- Top SQL statements: Go to the top SQL statements page in OceanBase Autonomy Service (OAS). Analyze SQL statements with the following characteristics to assess potential risks:
  - The number of table scans is 1.
  - The number of RPCs exceeds 0.
  - The number of accessed partitions exceeds 1.
  - The execution time of simple SQL statements (such as those returning a small number of rows or based on primary keys or indexes) exceeds 1 ms.
- Slow queries:
  - Go to the slow queries page in OAS. Sort SQL statements by the following dimensions and pay attention to the following two types:
    - Total database time: Identify the top N SQL statements with the longest total database time.
    - Maximum response time: Identify the top N SQL statements with the longest response time.
- Perform a refined analysis of the identified SQL statements as follows:
  - whether the SQL statements meet the four check items for top SQL statements.
  - Check whether the execution plan is correct.
  - Check whether the degree of parallelism (DOP) exceeds 1.

Troubleshooting tools

OCP
- Use the SQL panorama to analyze business SQL statements. Focus on SQL statements that involve RPCs, cross-partition access, disk reads, or abnormal execution plans.
- Monitor key metrics, including CPU utilization, I/O response times for data and clog operations, SQL response times, and workload balancing.
Active session history (ASH) reports
- Starting from OceanBase Database V4.x, more detailed monitoring data is available. You can generate ASH reports to obtain detailed monitoring data for individual SQL statements. For more information, see Generate ASH reports.
top -H -p {PID of the observer process}
- Use this command to view all threads of the observer process. Analyze the usage of worker threads, I/O threads, and minor compaction threads. Check whether OceanBase Database configurations are reasonable or if bottlenecks exist.
- In specific scenarios, analyze the status of each worker thread to help identify and resolve performance issues.