obdiag is a CLI diagnostic tool designed for OceanBase Database. It scans and collects information such as logs, SQL audit records, and process stack information of OceanBase. You may deploy your OceanBase cluster by using OceanBase Cloud Platform (OCP) or OceanBase Deployer (OBD), or manually deploy it based on the OceanBase documentation. Regardless of the deployment mode, you can use obdiag to collect diagnostic information with a few clicks.
obdiag has now been officially open-sourced, and the establishment of a special interest group (SIG) named 'obdiag SIG' has been announced.
The obdiag team has compiled relevant experiences regarding OceanBase diagnosis and tuning and has commenced releasing a series of tutorial articles. This article is the first installment of the series, providing an introduction to the diagnostic methods in OceanBase and the OceanBase Diagnostic Tool (obdiag).
In most scenarios, issues with OceanBase can be located and analyzed using the following three-step approach: environmental analysis, log analysis, and performance view analysis. The secret recipe for diagnosing OceanBase can be summarized as follows:
Environmental analysis involves examining the information of network, disk, memory, CPU, and so on. We won't go into details on that in this article.
Now, let's dive into the log analysis and performance view analysis.
The log files in OceanBase are divided into three main types: observer.log, election.log, and rootservice.log. By default, these log files capture logs at the INFO level and above. Additionally, each type of log generates a corresponding WARNING log file with a .wf suffix (observer.log.wf, election.log.wf, rootservice.log.wf), which exclusively records logs at the WARN level and above.
Name | Path |
observer.log, observer.log.wf | $work_dir/log under the installation directory of the OBServer node |
election.log, election.log.wf | $work_dir/log under the installation directory of the OBServer node |
rootservice.log, rootservice.log.wf | $work_dir/log under the installation directory of the OBServer node |
The logs are divided into the following six levels from low to high: DEBUG, TRACE, INFO, WARN, USER_ERR, and ERROR.
The ERROR log level is special. It prints the stack where the log is generated. The symbol table is required to parse it.
The log format is as follows. The actual format depends on the specific circumstances.
[admin@OceanBase000000000.sqa.ztt /home/admin/oceanbase/log]
$tail -f -n 1 observer.log
[2016-07-17 14:18:04.845802] INFO [RPC.OBMYSQL] obsm_handler.cpp:191 [9543][Y0-0] [lt=47] [dc=0] connection close(easy_connection_str(c)="192.168.0.2:56854_-1_0x7fb8a9171b68", version=0, sessid=2147562562, tenant_id=1, server_id=1, is_need_clear_sessid_=true, ret=0)
[admin@OceanBase000000000.sqa.ztt /home/admin/oceanbase/log]
$tail -f -n 1 observer.log.wf
[2016-07-17 14:18:28.431351] WARN [SQL.SESSION] set_conn (ob_basic_session_info.cpp:2568) [8541][YB420AF4005E-52A8CF4E] [lt=16] [dc=0] debug for set_conn(conn=0x7fb8a9171b68, lbt()="0x4efe71 0x818afd 0xe9ea5b 0x721fc8 0x13747bc 0x2636db0 0x2637d68 0x5054e9 0x7fb98705aaa1 0x7fb9852cc93d ", magic_num_=324478056, sessid_=2147562617, version_=0)
[admin@OceanBase000000000.sqa.ztt /home/admin/oceanbase/log]
$tail -f -n 1 rootservice.log
[2016-07-17 14:18:53.701463] INFO [RS] ob_server_table_operator.cpp:345 [8564][Y0-0] [lt=11] [dc=0] svr_status(svr_status="active", display_status=1)
[admin@OceanBase000000000.sqa.ztt /home/admin/oceanbase/log]
$tail -f -n 1 rootservice.log.wf
[2016-07-16 02:02:12.847602] WARN [RS] choose_leader (ob_leader_coordinator.cpp:2067) [8570][YB420AF4005E-4626EDFC] [lt=8] [dc=0] choose leader info with not same candidate num(tenant_id=1005, server="192.168.0.1:2882", info={original_leader_count:0, primary_zone_count:0, cur_leader_count:1, candidate_count:1, in_normal_unit_count:1})
Pay attention to log information, especially the error log information, and establish a log collection and alert mechanism based on OceanBase's log model.
SQL audit view is a practical feature for SQL performance diagnosis in OceanBase. Prior to OceanBase 4.0, it was gv$sql_audit, and from OceanBase 4.0 onwards, it is gv$ob_sql_audit. This article will be based on the earlier versions.
By utilizing the SQL audit, developers and operators can conveniently trace any SQL statement that has been executed on OceanBase, regardless of the success of these SQL queries, as detailed information is recorded.
The gv$sql_audit query retrieves the sql_audit records on every server, while v$sql_audit retrieves the sql_audit records connecting to this particular server. When querying v$sql_audit, it is impossible to determine which machine the request will be routed to. Therefore, when querying the sql_audit records on a specific server, a direct connection to the server is necessary, or specify the server's IP and port number after querying gv$sql_audit.
During this process, the following points should be noted:
Case 1: How to check if the SQL request volume is evenly distributed?
Action: You can check the sql_audit and find all the executed user SQL within a specific time period, then aggregate the data by server IP. Finally, calculate the number of SQL queries for each server IP in this time period to roughly determine the traffic on each server.
select/*+ parallel(15)*/t2.zone, t1.svr_ip, count(*) as QPS, avg(t1.elapsed_time), avg(t1.queue_time)
from oceanbase.gv$sql_audit t1, __all_server t2
where t1.svr_ip = t2.svr_ip and IS_EXECUTOR_RPC = 0
and request_time > (time_to_usec(now()) - 1000000)
and request_time < time_to_usec(now())
group by t1.svr_ip
order by t2.zone;
Case 2: How to find the SQL that consumes the most CPU?
Action: Firstly, retrieve all the SQL executed within a specific time period. Then, aggregate the data by SQL_ID and calculate the total execution time for each SQL_ID.
Execution time = ELAPSED_TIME - QUEUE_TIME
Because the time spent waiting in the queue does not consume CPU. The actual CPU consumption occurs during the acquisition and execution of the plan. Performance analysis and tuning can be performed based on the SQL that consumes the most CPU.
Case 3: How to analyze SQL queries that cause an abrupt RT jitter?
Action: In this scenario, you can immediately disable sql_audit after the jitter occurs to ensure that the SQL generating the jitter can be analyzed using (g)v$sql_audit. Follow the steps below:
If RT jitter occurs online, but the RT is not consistently high, consider disabling sql_audit immediately after the jitter occurs (alter system set ob_enable_sql_audit = 0) to ensure the presence of the target SQL in sql_audit.
Use sql_audit to query the top N requests by RT within the time period of the jitter, and then analyze the exceptional SQL.
Identify the corresponding abnormal requests, analyze the records in sql_audit, and troubleshoot the issue:
Number of logical reads = 2 * ROW_CACHE_HIT
+ 2 * BLOOM_FILTER_CACHE_HIT
+ BLOCK_INDEX_CACHE_HIT
+ BLOCK_CACHE_HIT + DISK_READS
If the data of the jitter has been eliminated in sql_audit, check if there are slow query trace logs during the jitter in the OBServer, and analyze the corresponding trace logs.
In OceanBase Database, a native distributed database system, root cause analysis for faults is complex because many factors may be involved, such as the server environment, parameters, and running load.
During troubleshooting, experts need to collect and analyze extensive information. Therefore, OceanBase Diagnostic Tool (obdiag) is introduced to help efficiently collect information scattered on various nodes.
The OceanBase Diagnostic Tool, abbreviated as obdiag, has the following features:
The current functions of obdiag include scanning and collecting information from logs, SQL audit records, and process stack information of OceanBase. You can use obdiag to collect diagnostic information with a few clicks when your OceanBase cluster is deployed in different modes, for example, by using OceanBase Cloud Platform (OCP) or OceanBase Deployer (OBD), or manually deployed based on relevant documentation.
In the future, obdiag will incorporate diagnostic analysis capabilities. On one hand, it will enhance the ability of extracting and analyzing effective information based on the existing data collection capabilities, and generating diagnostic analysis reports. On the other hand, it will integrate the existing OceanBase SQL audit tool, SQL Reviewer, and OceanBase's agile SQL diagnostic tool, SQL Diagnoser, to expand the tool's diagnostic capabilities at the SQL level.
If the cluster to be diagnosed is deployed by OBD, you are recommended to upgrade OBD to V2.5.0 or later. Then, you can directly run obdiag commands on OBD.
Follow the method below to use obdiag:
Install OBD 2.5.0 or later.
sudo yum install -y yum-utils
sudo yum-config-manager --add-repo https://mirrors.aliyun.com/oceanbase/OceanBase.repo
sudo yum install -y ob-deploy
source /etc/profile.d/obd.sh
Install obdiag: Use the command below to install obdiag on the local server. The tool will be automatically installed if the user uses it directly without manual installation.
obd obdiag deploy
obd obdiag gather <gather type> <deploy name> [options]
Valid values of the gather type parameter are as follows:
Download obdiag for free:
https://en.oceanbase.com/softwarecenter
https://en.oceanbase.com/docs/obdiag-en
https://github.com/oceanbase/obdiag
When encountering problems with OceanBase, many engineers may feel unsure of where to start. This article introduces the general diagnostic methods for OceanBase, including environmental analysis, log analysis, and performance view analysis. It also recommends the OceanBase Diagnostic Tool, obdiag, to help users collect fault information efficiently.