This topic covers performance issues you may encounter during Sysbench testing on OceanBase Database. It explains how to identify bottlenecks, tune configurations, and improve stability under high concurrency, and includes case studies for common problems.
Quick methods for locating Sysbench issues
When Sysbench TPS is lower than expected, fluctuates significantly, drops to zero, or the test exits unexpectedly, start with the quick troubleshooting steps below.
Symptom 1: Sysbench TPS is lower than expected
To troubleshoot this issue, follow these steps:
Analyze the Sysbench parameters.
First, analyze the parameters in the Sysbench command.
If
--rand-type=specialis set or if the--rand-typeparameter is not set, you can set it to--rand-type=uniformto reduce lock contention between data.If the number of tables or their size is too small, you can increase the data set size. For example, set
tables=100andtable_size=1000000.For the
read_writeorwrite_onlyscenarios, check for deadlocks to verify:SELECT * FROM oceanbase.cdb_ob_deadlock_event_history ORDER BY create_time DESC LIMIT 10;
Confirm the connection method.
- If the primary zone is a single zone and you are directly connecting to an OBServer node, confirm whether the connection is to the leader.
- If the primary zone is random, you need to connect to OBProxy for testing.
Query the tenant information.
SELECT * FROM oceanbase.dba_ob_tenants;Query the tenant leader.
SELECT svr_ip, sql_port, count(1) FROM oceanbase.cdb_ob_ls_locations WHERE tenant_id=1 AND role='leader' GROUP BY svr_ip;
Observe CPU overhead.
If the Sysbench parameters and client connection method are correct, you can continue to troubleshoot by checking the CPU usage.
Check the CPU usage of threads.
top -H -p pidof observerIf each business thread is almost fully utilizing a core, check whether the number of threads is close to the number of CPU cores.
- If the number of business threads (such as T1002_L0_G2) is much lower than the number of CPU cores and the total CPU overhead is much lower than the maximum, it is likely that the
cpu_countis set too low. In this case, you can check the command for creating the tenant to confirm whethercpu_countis set too low. - If the total CPU overhead is close to the maximum, the machine's performance has reached its limit, and only a higher-configured machine can achieve higher performance.
- If the number of business threads (such as T1002_L0_G2) is much lower than the number of CPU cores and the total CPU overhead is much lower than the maximum, it is likely that the
Check the tenant specifications.
show create tenant xxx; SELECT * FROM oceanbase.dba_ob_resource_pools; SELECT * FROM oceanbase.dba_ob_units;If none of the business threads are fully utilizing a core (for example, the CPU usage is 20% to 50%), there is a bottleneck elsewhere.
- If a network thread (such as RpcIO) is fully utilizing a core, it may be due to a shortage of network threads, causing network communication to become a bottleneck. In this case, you can increase the number of threads by setting the
net_thread_countparameter. - If the CPU usage of all threads is not high, the bottleneck may be hardware-related. You need to continue to analyze.
- If a network thread (such as RpcIO) is fully utilizing a core, it may be due to a shortage of network threads, causing network communication to become a bottleneck. In this case, you can increase the number of threads by setting the
Analyze hardware-related reasons.
After excluding CPU-related factors, the remaining hardware-related reasons are primarily disk and network issues. The troubleshooting approach for disk and network issues is as follows.
- Confirm the maximum network bandwidth and check whether the network bandwidth becomes a bottleneck during the test. You can deploy Sysbench on the OBServer node for the test.
- Confirm the network latency (by using the ping command). High latency can also cause SQL execution to slow down.
- Confirm the maximum 16K write bandwidth of the log disk (by using the FIO test) and check whether the disk I/O becomes a bottleneck during the test. You can replace the disk with an SSD or NVM for the test.
Commands to check network I/O:
sar -n DEV 1Commands to check the maximum network bandwidth:
## (eth0 is the name of the network interface card) ethtool eth0Other factors.
In addition to the above factors, there may be some special circumstances, such as conflicts between the test period and background tasks of the observer.
A user once reported that during business testing, two benchmark runs produced TPS far below expectations. After checking various configurations, no obvious issues were found. It turned out the tests ran at 2:00 AM—the default daily window for minor and major compactions—which severely affected database performance.
Symptom 2: Sysbench TPS fluctuates or drops to zero
When Sysbench TPS fluctuates or drops to zero, one of the following scenarios may apply.
TPS occasionally fluctuates slightly
Possible cause: OceanBase Database periodically runs minor and major compactions, which consume CPU resources and affect business threads. The impact is smaller when
cpu_countis higher. For example, for a 4c8g tenant, TPS may drop by 80% during minor compaction; for a 32c64g tenant, TPS may drop by 10%.Solution:
Increase the
max_cpuparameter of the tenant resource unit.Increase the CPU concurrency of the tenant.
alter system set cpu_quota_concurrency=4 tenant = xxx;Reduce the number of minor compaction threads (
merge_thread_count) or lower thread priorities (compaction_high_thread_score,compaction_mid_thread_score, andcompaction_low_thread_score). This lowers minor compaction overhead per unit time but extends the overall duration; the total compaction work remains unchanged.
TPS temporarily drops to zero and then recovers intermittently
- Possible cause: A deadlock occurs, and the
--mysql-ignore-errors=1062,1213,6002parameter is set in Sysbench. - Solution:
Check for deadlocks.
SELECT * FROM oceanbase.cdb_ob_deadlock_event_history ORDER BY create_time DESC LIMIT 10;After confirming deadlocks, take the following steps:
- Add the
--rand-type=uniformparameter to Sysbench. - Increase the data set size. For example, set
tables=100andtable_size=1000000(the default value oftable_sizeis 10000).
- Add the
TPS drops to 0 and remains at 0
- Possible cause: A deadlock occurs, and it is not detected, and the timeout period is set to a very long value. As a result, the TPS drops to 0 without any error, exit, or recovery.
- Solution:
Check for deadlocks.
SELECT * FROM oceanbase.cdb_ob_deadlock_event_history ORDER BY create_time DESC LIMIT 10;After confirming deadlocks, take the following steps:
- Add the
--rand-type=uniformparameter to Sysbench. - Increase the data set size. For example, set
tables=100andtable_size=1000000(the default value oftable_sizeis 10000).
- Add the
TPS drops to 0 and then quickly recovers
Possible cause: The Clog disk is full and cannot write data, causing the TPS to drop to 0. After the Clog disk recovers space, the TPS recovers.
Solution: Check the disk space usage of the tenant:
SELECT tenant_id, svr_ip, svr_port, log_disk_in_use, log_disk_size FROM oceanbase.gv$ob_units;Consider allocating more disk space.
If the issue persists after ruling out the above causes, contact technical support for further assistance.
Symptom 3: Sysbench terminates unexpectedly or reports an error
TPS drops to 0 and intermittently recovers, occasionally reporting error 6002 or 1213
- Possible causes: A deadlock occurs, the dataset is too small, or the
rand-typeparameter is not set. The default Special type causes keys to be concentrated. The process does not exit because most SQL statements are still in a deadlock, and threads are waiting for the SQL statements to complete. - Solution:
Check for deadlocks.
SELECT * FROM oceanbase.cdb_ob_deadlock_event_history ORDER BY create_time DESC LIMIT 10;After confirming deadlocks:
- Add the
--rand-type=uniformparameter to Sysbench. - Increase the dataset size. For example, set
tables=100andtable_size=1000000(the default value oftable_sizeis 10000).
- Add the
Sysbench terminates unexpectedly and reports error 1062
- Possible causes: Error 1062 indicates a primary key or unique index conflict. Since Sysbench generates data randomly, primary key conflicts may occur.
- Solution: Add the
--mysql-ignore-errors=1062parameter to Sysbench.
Sysbench terminates unexpectedly and reports error 1213
- Possible causes: A deadlock occurs, and some of the involved transactions are killed.
- Solution: Even with a large dataset and the
--rand-type=uniformparameter, deadlocks may still occur. Therefore, you can add the--mysql-ignore-errors=1062,1213parameter to ignore deadlock errors.
Sysbench terminates unexpectedly and reports error 6002
- Possible causes: Error 6002 may also be caused by a deadlock, but it does not report error 1213.
- Solution: Increase the dataset size and add the
--rand-type=uniformparameter to reduce lock conflicts. Add the--mysql-ignore-errors=1062,1213,6002parameter to Sysbench.
Sysbench terminates unexpectedly and reports error 4012
Possible causes: Severe lock contention or a deadlock occurs. Alternatively, the OBServer node may have poor performance, and the client may be under excessive pressure, leading to a timeout.
Solution: Increase the timeout period (unit: us):
set global ob_query_timeout = 50000000000;
Sysbench terminates unexpectedly and reports error 5930
Possible causes: The SQL precompilation feature is enabled, that is, the
--db-ps-mode=autoparameter is specified. However, OceanBase Database does not fully support SQL precompilation.Solution:
- Increase the value of the tenant parameter:
alter system set open_cursors=65535;. - You can also disable SQL precompilation for testing:
--db-ps-mode=disable;.
- Increase the value of the tenant parameter:
Sysbench terminates unexpectedly and reports other errors
- Possible causes: The error may be caused by a bug in OceanBase Database.
- Solution: Record the error logs and contact technical support for assistance.
General troubleshooting process for Sysbench issues
If quick troubleshooting does not find the cause, follow the full process below. Refer to linked sections where needed. Contact technical support if the issue persists.
General approach:
- Check hardware resources. Insufficient CPU, memory, or disk can hurt performance or drop TPS to zero.
- Review tenant configuration. Under-provisioned tenants cannot use available hardware.
- Review Sysbench parameters. Thread count directly affects TPS; a small dataset or skewed key distribution can cause lock contention or deadlocks.
- Consider the possibility of an OBServer bug.
Note
You can use the commands provided as needed without executing all of them.
Hardware check
If the hardware resources are insufficient, such as the CPU and memory resources are too small (less than 4c8g), the performance will be poor. If the disk is almost full, it will also affect the write speed. High network latency also has an impact.
- CPU:
lscpu - Memory:
top,cat /proc/meminfo - Disk:
df -h - Network latency:
ping ip
OceanBase deployment configuration check
If the hardware resources are sufficient, but the parameters set during OceanBase deployment are too small and cannot fully utilize the resources, it will also lead to performance degradation. Most configurations can be dynamically adjusted, while only a few can be configured only during deployment.
Connect to the SYS tenant and view the global view.
SELECT * FROM GV$OB_SERVERS;View the number of network threads (this configuration cannot be dynamically modified).
show parameters where name = 'net_thread_count';View the deployed clusters.
obd cluster listView the deployed cluster configurations.
obd cluster edit-config [cluster name]
View the OBServer startup parameters on the machine where the OBServer is located.
ps -ef | grep observer
Tenant configuration check
If the resources of the OBServer node are sufficient, you can check whether the resources of the test tenant have become a bottleneck by using the following commands.
View the basic information of the tenant.
SELECT * FROM oceanbase.dba_ob_tenants;View the command for creating the tenant.
show create tenant [tenant name];View the resource pool corresponding to the tenant (you can add
tenant_idto filter).SELECT * FROM oceanbase.dba_ob_resource_pools;View the configuration of the corresponding unit.
SELECT * FROM oceanbase.dba_ob_units;
Sysbench parameter check
The Sysbench parameters are in the Sysbench execution command. You can check them one by one according to the parameter description.
Deadlock troubleshooting
View the deadlock records.
SELECT * FROM oceanbase.cdb_ob_deadlock_event_history ORDER BY create_time DESC LIMIT 10;role=victimindicates the transaction that was killed.View the SQL statements related to the transaction.
SELECT usec_to_time(REQUEST_TIME), TRACE_ID, TX_ID, QUERY_SQL FROM gv$ob_sql_audit WHERE TX_ID=52635 ORDER BY REQUEST_TIME;If no related transactions are found, it means that the SQL Audit feature is not enabled. You can check whether the
enable_sql_auditparameter is set to True:show parameters where name = 'enable_sql_audit';If the query result is False, you will not see the SQL statements corresponding to the transaction. You can filter the transaction ID from the logs to find some information.
Query the SQL corresponding logs.
After finding the SQL statements related to the transaction, use the Trace ID of the conflicting SQL statement to search in the logs:
grep '[trace id]' observer.log*You can see all the logs of this SQL statement. Generally, the last log contains the error code returned to the client:
sending error packet(err=-4101You can also use the Trace ID of the last
commitorbegincommand to search for related logs:sending error packet(err=-6002If no logs are found, it may be due to an incorrect log level or log throttling. You need to set the log level to
WDIAGand increase the log bandwidth limit:alter system set syslog_level='WDIAG'; alter system set syslog_io_bandwidth_limit='2G';Notice
Deadlock information generated before the configuration change cannot be retrieved.
Minor and major compaction analysis
View minor and major compaction tasks that take more than 5 seconds.
SELECT * FROM GV$OB_MERGE_INFO WHERE tenant_id=1002 AND (END_TIME-START_TIME)>5 LIMIT 10;View the major compaction history of a table.
SELECT * FROM GV$OB_TABLET_COMPACTION_HISTORY WHERE TABLET_ID IN (SELECT TABLET_ID FROM oceanbase.CDB_OB_TABLE_LOCATIONS WHERE TABLE_NAME = 'sbtest1') ORDER BY START_TIME DESC;Manually trigger a minor compaction.
alter system minor freeze tenant=xxx;
You can also run top -H on the OBServer node to check the load on the MINI_MERGE and MINOR_EXE threads. These threads are active only briefly.
Slow SQL query
Query the slow SQL statements that take more than 5 seconds and print all the information.
SELECT * FROM gv$ob_sql_audit WHERE elapsed_time > 5000000 LIMIT 10;Print the key fields.
SELECT SVR_IP, SVR_PORT, SQL_EXEC_ID, TRACE_ID, TENANT_ID, SQL_ID, QUERY_SQL, PLAN_ID, PARTITION_CNT, IS_INNER_SQL, ELAPSED_TIME, EXECUTE_TIME, APPLICATION_WAIT_TIME FROM gv$ob_sql_audit WHERE elapsed_time > 5000000 ORDER BY elapsed_time DESC LIMIT 10;If no data is found, it may be because the SQL Audit feature is not enabled:
show parameters where name = 'enable_sql_audit'; alter system set enable_sql_audit = true;
Log issue troubleshooting
View the disk space usage.
SELECT tenant_id, svr_ip, svr_port, LOG_DISK_IN_USE, LOG_DISK_SIZE FROM gv$ob_units;
Problem reproduction process
First, clear the previous
sql_auditrecords.alter system set enable_sql_audit = false; alter system flush sql audit global;Enable the
sql_auditrecords.alter system set enable_sql_audit = true;Modify the log level.
alter system set syslog_level='WDIAG';Reduce the SQL record threshold to record more SQL statements.
alter system set trace_log_slow_query_watermark = '10ms';Increase the system log bandwidth limit to avoid losing critical logs due to throttling.
alter system set syslog_io_bandwidth_limit='2G';If you do not want the transaction to time out, you can also increase the timeout time.
set global ob_query_timeout = 50000000000;
Then, set the parameters to the same values as in the problem scenario and start the Sysbench test.
Sysbench performance issue case analysis
This section analyzes Sysbench TPS dropping to zero with occasional errors.
Symptoms
When you run the following command to perform a Sysbench test on OceanBase Database Community Edition V4.1.0, the TPS drops to 0 and an error occurs.
sysbench --db-driver=mysql --threads=500 --time=300000 --mysql-host=xxxxxxx --mysql-port=2883 --mysql-user='tpcc@t1' --mysql-password='******' --report-interval=2 --tables=100 /usr/share/sysbench/oltp_read_write.lua --db-ps-mode=disable run
Root cause and analysis
The initial analysis suggests that the small table_size and the default special data distribution mode cause data to be concentrated, leading to severe primary key conflicts and ultimately a deadlock, which results in the TPS dropping to 0.
First, increase the timeout period for the test tenant.
set global ob_query_timeout = 50000000000; set global ob_trx_timeout = 50000000000;To quickly reproduce the issue, we set
Tablesto 1, while keeping other parameters unchanged.sysbench oltp_read_write.lua --mysql-host=[ip] --mysql-port=2881 --mysql-user=root@user --mysql-db=test --table_size=10000 --tables=1 --threads=500 --time=3600 --report-interval=1 --db-ps-mode=disable runThe observed phenomenon is that the TPS drops to 0, but occasionally, the TPS and QPS are greater than 0.

We check the deadlock view.
SELECT * FROM oceanbase.CDB_OB_DEADLOCK_EVENT_HISTORY ORDER BY create_time DESC LIMIT 10;
The recent deadlock records are displayed.
role=victimindicates the killed transaction, androle=witnessindicates the transaction that witnessed the conflict.cycle_size=2indicates that only two transactions are conflicting, whilecycle_size>2indicates that multiple transactions are conflicting with one transaction. In this case, we select an event withcycle_size=2for easier observation. We can filter out a group of conflicting transactions by using theevent_id.SELECT * FROM oceanbase.CDB_OB_DEADLOCK_EVENT_HISTORY WHERE cycle_size=2 ORDER BY create_time DESC LIMIT 10;
The transaction IDs are displayed in the Visitor field. We can find all SQL statements related to a transaction by using the transaction ID:
SELECT TRACE_ID, TX_ID, QUERY_SQL FROM gv$ob_sql_audit WHERE TX_ID=5042912 ORDER BY REQUEST_TIME; SELECT TRACE_ID, TX_ID, QUERY_SQL FROM gv$ob_sql_audit WHERE TX_ID=5042591 ORDER BY REQUEST_TIME;
The two transactions are in a cross dependency situation, that is:
- Transaction 1 updates id=4978.
- Transaction 2 updates id=5049.
- Transaction 1 updates id=4978.
- Transaction 2 deletes id=5049.
After the deadlock occurs, transaction 64317 is rolled back, and transaction 64235 is successfully committed. Because transaction 64317 was created later, you can find the first SQL statement of the transaction in the
sql_audittable and use therequest_timeof that SQL statement to determine the transaction order.SELECT REQUEST_TIME, SVR_IP, SVR_PORT, SQL_EXEC_ID, TRACE_ID, TENANT_ID, TX_ID, SQL_ID, QUERY_SQL, PLAN_ID, PARTITION_CNT, IS_INNER_SQL, ELAPSED_TIME, EXECUTE_TIME, APPLICATION_WAIT_TIME FROM gv$ob_sql_audit WHERE TX_ID=64317 ORDER BY REQUEST_TIME;According to the trace ID, you can find the rolled-back transaction in the logs. The error code is 6002 (rollback), and the abort reason is 4101 (deadlock).
Find the SQL statement that caused the conflict in the logs of the killed transaction. The last log entry shows error code 4101. However, Sysbench does not report a deadlock error for the conflicting SQL statement; instead, it reports the rollback error 6002.
[2023-07-11 15:32:40.263794] INFO [SERVER] send_error_packet (obmp_packet_sender.cpp:317) [28473][T1002_L0_G0][T1002][Y5B690B7C0505-00060030956354CB-0-0] [lt=14] sending error packet(err=-4101, extra_err_info=NULL, lbt()="0x1029cb40 0x832fadc 0x82e32f7 0x5b65d4f 0x5960718 0x595c58f 0x595a25a 0x8075f6c 0x595964c 0x80754ea 0x595665a 0x8075b24 0x1061b777 0x10613e9f 0x7f0215489e25 0x7f0214f48f1d")Find the last
BEGINstatement of the transaction in the logs. Because the transaction does not explicitly execute acommitstatement, the next transaction'sbeginstatement implicitly commits the current transaction. Therefore, thebeginstatement is the last SQL statement of the current transaction. You can find the rollback error 6002 in the logs.[2023-07-11 15:32:40.305922] INFO [SERVER] send_error_packet (obmp_packet_sender.cpp:317) [28472][T1002_L0_G0][T1002][Y5B690B7C0505-0006003095136574-0-0] [lt=37] sending error packet(err=-6002, extra_err_info=NULL, lbt()="0x1029cb40 0x832fadc 0x82e32f7 0x5a55e57 0x5960c40 0x595c58f 0x595a25a 0x8075f6c 0x595964c 0x80754ea 0x595665a 0x8075b24 0x1061b777 0x10613e9f 0x7f0215489e25 0x7f0214f48f1d")If no logs are found, log throttling may be the cause. We recommend that you increase the system log bandwidth limit:
alter system set syslog_io_bandwidth_limit='2G';
Resolution
Based on the analysis above, you can draw the following conclusions:
- In the default
rand-type=specialscenario, deadlocks may occur due to excessive key conflicts, causing Sysbench to exit. - In the default
rand-type=specialscenario, iftablesandtable_sizeare also small, deadlocks occur more frequently. A large number of transactions remain blocked, test threads cannot exit in time, and TPS drops to 0 for a period (several minutes). Occasionally, TPS may be greater than 0. - Deadlock transactions are rolled back and may report error code 6002 (instead of 1213).
- If
ob_query_timeoutis set to a long value and deadlock detection does not take effect, transactions cannot be killed and SQL statements remain blocked. In this case, TPS stays at 0 for a long time without any error reports.
To fix Sysbench TPS dropping to zero with occasional errors:
- Consider changing the data distribution type to
--rand-type=uniform. - If you do not want to change the data distribution type, increase the dataset size. For example, set
tables=100andtable_size=1000000. - If you do not want to change the data distribution type or the dataset size, you can ignore the error codes with
--mysql-ignore-errors=1062,1213,6002. However, this does not prevent deadlocks.
