Blog编组 28
Diagnosis and Tuning with OceanBase: Methods and Tools
右侧logo


obdiag is a CLI diagnostic tool designed for OceanBase Database. It scans and collects information such as logs, SQL audit records, and process stack information of OceanBase. You may deploy your OceanBase cluster by using OceanBase Cloud Platform (OCP) or OceanBase Deployer (OBD), or manually deploy it based on the OceanBase documentation. Regardless of the deployment mode, you can use obdiag to collect diagnostic information with a few clicks.

obdiag has now been officially open-sourced, and the establishment of a special interest group (SIG) named 'obdiag SIG' has been announced.

The obdiag team has compiled relevant experiences regarding OceanBase diagnosis and tuning and has commenced releasing a series of tutorial articles. This article is the first installment of the series, providing an introduction to the diagnostic methods in OceanBase and the OceanBase Diagnostic Tool (obdiag).


The Secret Recipe for OceanBase Diagnosis

In most scenarios, issues with OceanBase can be located and analyzed using the following three-step approach: environmental analysis, log analysis, and performance view analysis. The secret recipe for diagnosing OceanBase can be summarized as follows:


oceanbase database


Environmental analysis involves examining the information of network, disk, memory, CPU, and so on. We won't go into details on that in this article.

Now, let's dive into the log analysis and performance view analysis.


Log Analysis

The log files in OceanBase are divided into three main types: observer.log, election.log, and rootservice.log. By default, these log files capture logs at the INFO level and above. Additionally, each type of log generates a corresponding WARNING log file with a .wf suffix (observer.log.wf, election.log.wf, rootservice.log.wf), which exclusively records logs at the WARN level and above.

NamePath
observer.log, observer.log.wf$work_dir/log under the installation directory of the OBServer node
election.log, election.log.wf$work_dir/log under the installation directory of the OBServer node
rootservice.log, rootservice.log.wf$work_dir/log under the installation directory of the OBServer node


The logs are divided into the following six levels from low to high: DEBUG, TRACE, INFO, WARN, USER_ERR, and ERROR.

The ERROR log level is special. It prints the stack where the log is generated. The symbol table is required to parse it.

1. Log Format

The log format is as follows. The actual format depends on the specific circumstances.

  • [time] log_level [module_name] (file_name:fine_no) [thread_id][Y_trace_id0-trace_id1] [lt=last_log_print_time] [dc=dropped_log_count] log_data_
  • [time] log_level [module_name] function_name (file_name:fine_no) [thread_id][Y_trace_id0-trace_id1] [lt=last_log_print_time] [``dc=``dropped_log_count] log_data_
[admin@OceanBase000000000.sqa.ztt /home/admin/oceanbase/log] 
$tail -f -n 1 observer.log
[2016-07-17 14:18:04.845802] INFO  [RPC.OBMYSQL] obsm_handler.cpp:191 [9543][Y0-0] [lt=47] [dc=0] connection close(easy_connection_str(c)="192.168.0.2:56854_-1_0x7fb8a9171b68", version=0, sessid=2147562562, tenant_id=1, server_id=1, is_need_clear_sessid_=true, ret=0) 
[admin@OceanBase000000000.sqa.ztt /home/admin/oceanbase/log] 
$tail -f -n 1 observer.log.wf
[2016-07-17 14:18:28.431351] WARN  [SQL.SESSION] set_conn (ob_basic_session_info.cpp:2568) [8541][YB420AF4005E-52A8CF4E] [lt=16] [dc=0] debug for set_conn(conn=0x7fb8a9171b68, lbt()="0x4efe71 0x818afd 0xe9ea5b 0x721fc8 0x13747bc 0x2636db0 0x2637d68 0x5054e9 0x7fb98705aaa1 0x7fb9852cc93d ", magic_num_=324478056, sessid_=2147562617, version_=0) 
[admin@OceanBase000000000.sqa.ztt /home/admin/oceanbase/log] 
$tail -f -n 1 rootservice.log
[2016-07-17 14:18:53.701463] INFO  [RS] ob_server_table_operator.cpp:345 [8564][Y0-0] [lt=11] [dc=0] svr_status(svr_status="active", display_status=1) 
[admin@OceanBase000000000.sqa.ztt /home/admin/oceanbase/log] 
$tail -f -n 1 rootservice.log.wf
[2016-07-16 02:02:12.847602] WARN  [RS] choose_leader (ob_leader_coordinator.cpp:2067) [8570][YB420AF4005E-4626EDFC] [lt=8] [dc=0] choose leader info with not same candidate num(tenant_id=1005, server="192.168.0.1:2882", info={original_leader_count:0, primary_zone_count:0, cur_leader_count:1, candidate_count:1, in_normal_unit_count:1})

2. Tips for Log Analysis

Pay attention to log information, especially the error log information, and establish a log collection and alert mechanism based on OceanBase's log model.


Performance View Analysis

1. gv$sql_audit

SQL audit view is a practical feature for SQL performance diagnosis in OceanBase. Prior to OceanBase 4.0, it was gv$sql_audit, and from OceanBase 4.0 onwards, it is gv$ob_sql_audit. This article will be based on the earlier versions.

By utilizing the SQL audit, developers and operators can conveniently trace any SQL statement that has been executed on OceanBase, regardless of the success of these SQL queries, as detailed information is recorded.

The gv$sql_audit query retrieves the sql_audit records on every server, while v$sql_audit retrieves the sql_audit records connecting to this particular server. When querying v$sql_audit, it is impossible to determine which machine the request will be routed to. Therefore, when querying the sql_audit records on a specific server, a direct connection to the server is necessary, or specify the server's IP and port number after querying gv$sql_audit.

During this process, the following points should be noted:

  • The sql_audit is maintained in memory, and its data cannot be stored indefinitely. When the memory reaches a certain limit, it triggers an elimination mechanism to ensure that new records can be written. The sql_audit adopts a first in, first out automatic elimination mechanism, which automatically triggers the elimination when the memory reaches the high water mark (90%), until the memory reaches the low water mark (50%).
  • When the number of sql_audit records reaches 9 million, an elimination process will be triggered, continuously eliminating records until there are 5 million left. This is another elimination strategy of sql_audit.
  • sql_audit provides a cluster-level configuration enable_sql_audit and a tenant-level configuration ob_enable_sql_audit. The sql_audit only takes effect when both of these configurations are set to true; otherwise, it is deactivated.

2. Performance View Diagnostic Cases

Case 1: How to check if the SQL request volume is evenly distributed?

Action: You can check the sql_audit and find all the executed user SQL within a specific time period, then aggregate the data by server IP. Finally, calculate the number of SQL queries for each server IP in this time period to roughly determine the traffic on each server.

select/*+ parallel(15)*/t2.zone, t1.svr_ip,  count(*) as QPS, avg(t1.elapsed_time), avg(t1.queue_time)  
from oceanbase.gv$sql_audit t1, __all_server t2  
where t1.svr_ip = t2.svr_ip and IS_EXECUTOR_RPC = 0          
  and request_time > (time_to_usec(now()) - 1000000)          
  and request_time < time_to_usec(now())
group by t1.svr_ip  
order by t2.zone;


Case 2: How to find the SQL that consumes the most CPU?

Action: Firstly, retrieve all the SQL executed within a specific time period. Then, aggregate the data by SQL_ID and calculate the total execution time for each SQL_ID.

Execution time = ELAPSED_TIME - QUEUE_TIME

Because the time spent waiting in the queue does not consume CPU. The actual CPU consumption occurs during the acquisition and execution of the plan. Performance analysis and tuning can be performed based on the SQL that consumes the most CPU.


Case 3: How to analyze SQL queries that cause an abrupt RT jitter?

Action: In this scenario, you can immediately disable sql_audit after the jitter occurs to ensure that the SQL generating the jitter can be analyzed using (g)v$sql_audit. Follow the steps below:

If RT jitter occurs online, but the RT is not consistently high, consider disabling sql_audit immediately after the jitter occurs (alter system set ob_enable_sql_audit = 0) to ensure the presence of the target SQL in sql_audit.

Use sql_audit to query the top N requests by RT within the time period of the jitter, and then analyze the exceptional SQL.

Identify the corresponding abnormal requests, analyze the records in sql_audit, and troubleshoot the issue:

  • Check the retry count (RETRY_CNT). If the count is high, there may be locking conflicts or leader switches.
  • Check if the queue time value is too high (QUEUE_TIME).
  • Check the time taken to acquire the execution plan (GET_PLAN_TIME). If the time is long, usually IS_HIT_PLAN = 0, indicating a plan cache miss.
  • Check the value of EXECUTE_TIME. If the value is too high, you can troubleshoot it by: first, check if there are long wait events; second, analyze if there is an abnormally high number of logical reads which may occur during sudden spikes in usage.


Number of logical reads = 2 * ROW_CACHE_HIT

+ 2 * BLOOM_FILTER_CACHE_HIT

+ BLOCK_INDEX_CACHE_HIT

+ BLOCK_CACHE_HIT + DISK_READS


If the data of the jitter has been eliminated in sql_audit, check if there are slow query trace logs during the jitter in the OBServer, and analyze the corresponding trace logs.


obdiag: OceanBase Diagnostic Toolbox

In OceanBase Database, a native distributed database system, root cause analysis for faults is complex because many factors may be involved, such as the server environment, parameters, and running load.

During troubleshooting, experts need to collect and analyze extensive information. Therefore, OceanBase Diagnostic Tool (obdiag) is introduced to help efficiently collect information scattered on various nodes.

The OceanBase Diagnostic Tool, abbreviated as obdiag, has the following features:

  • Easy deployment: You can deploy obdiag by using the RPM package or OBD with a few clicks. You can deploy it on an OBServer node or any server that can connect to nodes in the OceanBase cluster.
  • Centralized collection: You need to deploy obdiag only on a single server rather than all servers. Then, you can execute collection or analysis commands on the server where obdiag is deployed.

oceanbase database

The current functions of obdiag include scanning and collecting information from logs, SQL audit records, and process stack information of OceanBase. You can use obdiag to collect diagnostic information with a few clicks when your OceanBase cluster is deployed in different modes, for example, by using OceanBase Cloud Platform (OCP) or OceanBase Deployer (OBD), or manually deployed based on relevant documentation.

In the future, obdiag will incorporate diagnostic analysis capabilities. On one hand, it will enhance the ability of extracting and analyzing effective information based on the existing data collection capabilities, and generating diagnostic analysis reports. On the other hand, it will integrate the existing OceanBase SQL audit tool, SQL Reviewer, and OceanBase's agile SQL diagnostic tool, SQL Diagnoser, to expand the tool's diagnostic capabilities at the SQL level.


Install and use obdiag in combination with OBD

If the cluster to be diagnosed is deployed by OBD, you are recommended to upgrade OBD to V2.5.0 or later. Then, you can directly run obdiag commands on OBD.

Follow the method below to use obdiag:

1. Installation

Install OBD 2.5.0 or later.

sudo yum install -y yum-utils
sudo yum-config-manager --add-repo https://mirrors.aliyun.com/oceanbase/OceanBase.repo
sudo yum install -y ob-deploy
source /etc/profile.d/obd.sh

Install obdiag: Use the command below to install obdiag on the local server. The tool will be automatically installed if the user uses it directly without manual installation.

obd obdiag deploy

2. Use: Syntax of gather commands

Run this command on OBD to collect OceanBase diagnostic information.

obd obdiag gather <gather type> <deploy name> [options]

Valid values of the gather type parameter are as follows:

  • log: collects logs of the specified OceanBase cluster.
  • sysstat: collects information about nodes in the specified OceanBase cluster.
  • clog: collects the clogs of the specified OceanBase cluster.
  • slog: collects the slogs of the specified OceanBase cluster.
  • plan_monitor: collects the execution details of parallel SQL statements with the specified trace ID in the specified OceanBase cluster.
  • stack: collects the stack information of the specified OceanBase cluster.
  • perf: collects the performance information of the specified OceanBase cluster.
  • obproxy_log: collects the logs of the OceanBase Database Proxy (ODP) on which the specified OceanBase cluster depends.
  • all: collects the diagnostic information of the specified OceanBase cluster, including the logs, node information, stack information, and performance information of the cluster.


Install and use obdiag independently:

1. download obdiag

Download obdiag for free:

https://en.oceanbase.com/softwarecenter

2. obdiag documentation

https://en.oceanbase.com/docs/obdiag-en

3. GitHub

https://github.com/oceanbase/obdiag



Summary

When encountering problems with OceanBase, many engineers may feel unsure of where to start. This article introduces the general diagnostic methods for OceanBase, including environmental analysis, log analysis, and performance view analysis. It also recommends the OceanBase Diagnostic Tool, obdiag, to help users collect fault information efficiently.


ICON_SHARE
ICON_SHARE