A customer encountered a performance failure in its database system where a large number of transactions were congested. The direct cause was that CPU utilization reaches 100% on some OBServer nodes, which means that transactions cannot be processed normally. Based on the analysis of the GV$OB_SQL_AUDIT view, a problematic SQL query with average execution time of 2s was located. Besides, the SQL query used over 70% of the CPU resources on the OBServer node. Then, the cause was determined based on the EXPLAIN plan (query execution plan) and DDL definitions of related tables and views, and the CPU time consumed by the SQL query was reduced to 4 ms through a series of DDL optimization measures. The performance bottleneck problem was completely solved.
Logic of troubleshooting
The troubleshooting logic in this case is as follows:
Locate the top SQL queries by querying the
GV$OB_SQL_AUDITview.Execute the following SQL statement to query the
GV$OB_SQL_AUDITview and locate the problematic SQL queries. For this case, the information to query is the ratio of the CPU utilization of SQL queries (GROUP BY sql_id) on the specified OBServer node (svr_ip) to the total CPU time in 30 minutes before the problem occurred. The formula that calculates the ratio isSUM(execute_time)/(30*60*1000*1000)AS cpu_cnt.SELECT sql_id, COUNT(*) AS executions, SUM(execute_time) AS tot_cpu_time, AVG(execute_time) AS avg_cpu_time, SUM(execute_time)/(30*60*1000*1000) AS cpu_cnt, query_sql FROM oceanbase.GV$OB_SQL_AUDIT WHERE tenant_id= 'mysql001' AND svr_ip='xxx.xxx.xxx.xxx' AND request_time BETWEEN (TIME_TO_USEC(NOW())-30*60*1000*1000) AND TIME_TO_USEC(NOW()) AND is_executor_rpc =0 GROUP BY sql_id HAVING COUNT(*)>1 ORDER BY cpu_cnt DESC LIMIT 10;The result shows that the execution frequency and CPU utilization of
SELECT ... FROM V_TT01 WHERE (COL001 IN (...))SQL queries are relatively high. Their CPU utilization reaches about 70% (80 CPU cores/108 CPU cores) on the same OBServer node.+----------------------------------+------------+--------------+--------------+---------+------------------------------------------------------------------------------------------------------------------------------------------------------+ | sql_id | executions | tot_cpu_time | avg_cpu_time | cpu_cnt | query_sql | +----------------------------------+------------+--------------+--------------+---------+------------------------------------------------------------------------------------------------------------------------------------------------------+ | 2705182A6EAB699CEC8E59DA80710B64 | 23,425 | 33,818,182,526 | 1,443,679 | 56 | SELECT ... FROM V_TT01 WHERE (COL001 IN (20017476)) | | 32AB97A0126F566064F84DDDF4936F82 | 7,112 | 7,926.295,919 | 1,114,496 | 13 | SELECT ... FROM V_TT01 WHERE (COL001 IN (20035891,20035892)) | | A5F514E873BE9D1F9A339D0DA7481D69 | 4,046 | 6,053,872,739 | 1,496,261 | 10 | SELECT ... FROM V_TT01 WHERE (COL001 IN (20017892,20007654,20025634)) | | 31FD78420DB07C11C8E3154F1658D237 | 186 | 646,926,207 | 3,478,098 | 1 | SELECT ... FROM V_TT01 WHERE (COL001 IN (20027596,20035891)) | | C48AEE941D985D8DEB66892228D5E845 | 28 | 569,004,505 | 20,321,589 | 1 | SELECT ... | | 101B7B79DFA9AE801BEE4F1A234AD294 | 155 | 440,190,675 | 2,839,940 | 1 | SELECT ... | | 1D0BA376E273B9D622641124D8C59264 | 423 | 429,140,148 | 1,014,516 | 1 | SELECT ... | | 64CF75576816DB5614F3D5B1F35B1472 | 113 | 326,338,159 | 2,887,948 | 1 | SELECT ... | | 23D1C653347BA469396896AD9B20DCA1 | 381 | 309.367.170 | 811,987 | 1 | SELECT ... | | FA4F493FA5CE2DCC64F51CF3754F96C6 | 7 | 289,191,515 | 41,313,074 | 0 | SELECT ... | +------------+----------------------------+--------------+------------+--------------+------------------------------------------------------------------------------------------------------------------------------------------------------+ 10 rows in setQuery the
EXPLAINplan of a problematic SQL query and analyze the key points that affect performance.Execute the
EXPLAINstatement to obtain theEXPLAINplan of the first problematic SQL query.================================================================================= |ID|OPERATOR |NAME |EST. ROWS|COST | --------------------------------------------------------------------------------- |0 |EXCHANGE IN REMOTE | |177 |432961| |1 | EXCHANGE OUT REMOTE | |177 |397632| |2 | SUBPLAN SCAN |V_TT01 |177 |397632| |3 | UNION ALL | |177 |397629| |4 | NESTED-LOOP JOIN CARTESIAN | |0 |151990| |5 | TABLE SCAN |TBL1(IDX_TBL1_COL001)|1 |92 | |6 | MATERIAL | |0 |151899| |7 | NESTED-LOOP ANTI JOIN CARTESIAN| |0 |151899| |8 | TABLE SCAN |TT01(IDX_TT01_COL001)|1 |92 | |9 | MATERIAL | |177 |151781| |10| SUBPLAN SCAN |VIEW1 |177 |151780| |11| TABLE SCAN |TBL3(IDX_TBL3_COL170)|177 |151777| |12| NESTED-LOOP JOIN CARTESIAN | |177 |236327| |13| TABLE GET |TBL2 |1 |46 | |14| TABLE SCAN |TBL3(IDX_TBL3_COL170)|177 |235510| ================================================================================= Outputs & filters: ------------------------------------- 5 - output(......), filter(nil), access(......), partitions(p0), is_index_back=true, range_key([TBL1.COL001(0x7f510ed97860)], [TBL1.COL002(0x7f48eec9baa0)]), range(20017476,MIN ; 20017476,MAX), range_cond([TBL1.COL001(0x7f510ed97860) = 20017476(0x7f510edbd750)]) 8 - output(......), filter(nil), access(......), partitions(p0), is_index_back=true, range_key([TT01.COL001(0x7f510ed97570)], [TT01.COL003(0x7f510ed97b50)]), range(20017476,MIN ; 20017476,MAX), range_cond([TT01.COL001(0x7f510ed97570) = 20017476(0x7f510edbcc50)]) 11 - output(......), filter([20017476 = cast(cast(TBL3.COL170(0x7f3f5b51c090), VARCHAR2(20 BYTE))(0x7f3f5b51d6b0), NUMBER(-1, -85))(0x7f3f5b51ce80)(0x7f3f5b51c380)]), access(......), partitions(p0), is_index_back=false, filter_before_indexback[false], range_key([TBL3.COL170(0x7f3f5b51c090)], [TBL3.COL002(0x7f3f5b528e80)]), range(MIN,MIN ; MAX,MAX)always true 13 - output(......), filter(nil), access(......), partitions(p0), is_index_back=false, range_key([TBL2.COL004(0x7f0cb0ff9f20)]), range[20017476 ; 20017476], range_cond([TBL2.COL004(0x7f0cb0ff9f20) = 20017476(0x7f0cb1001170)]) 14 - output(......), filter([20017476 = cast(cast(TBL3.COL170(0x7f0cb0ffa210), VARCHAR2(20 BYTE))(0x7f0cb1002fa0), NUMBER(-1, -85))(0x7f0cb1002770)(0x7f0cb1001c70)]), access(......), partitions(p0), is_index_back=true, filter_before_indexback[true], range_key([TBL3.COL170(0x7f0cb0ffa210)], [TBL3.COL002(0x7f3f5b582c10)]), range(MIN,MIN ; MAX,MAX)always trueThe preceding
EXPALINplan shows that the total execution cost of the SQL query is estimated to be 432961. The largest portion, which is 397629, is consumed by Operator 3UNION ALL. Both branches of theUNION ALLoperator contain theNESTED-LOOP JOINoperator that is executed at the estimated cost of 151990 and 236327 respectively. In the first branch (operators 4 to 11) of theUNION ALLoperator, the cost is mainly consumed by Operator 11, that is, theTABLE SCANoperation based on theTBL3(IDX_TBL3_COL170)index. In the second branch (operators 12 to 14), the cost is mainly consumed by Operator 14, which is also theTABLE SCANoperation based on theTBL3(IDX_TBL3_COL170)index. Therefore, we can see that the high cost of operators 11 and 14 may be caused by theTBL3index. That is, this index may be a performance bottleneck in this case.Then, we can analyze the
Outputs & filtersinformation of operators 11 and 14.range(MIN,MIN ; MAX,MAX)always trueindicates that index matching was not performed, whilefilter([20017476 = cast(cast(TBL3.COL170(0x7f0cb0ffa210), VARCHAR2(20 BYTE))(0x7f0cb1002fa0), NUMBER(-1, -85))(0x7f0cb1002770)(0x7f0cb1001c70)])indicates that data type conversion occurred during table access. We can infer that the data types of associated fields may be inconsistent and data type conversion has been performed, resulting in an index matching failure onTBL3.COL170. Make further analysis based on this conclusion and DDL definitions.11 - output(......), filter([20017476 = cast(cast(TBL3.COL170(0x7f3f5b51c090), VARCHAR2(20 BYTE))(0x7f3f5b51d6b0), NUMBER(-1, -85))(0x7f3f5b51ce80)(0x7f3f5b51c380)]), access(......), partitions(p0), is_index_back=false, filter_before_indexback[false], range_key([TBL3.COL170(0x7f3f5b51c090)], [TBL3.COL002(0x7f3f5b528e80)]), range(MIN,MIN ; MAX,MAX)always true 14 - output(......), filter([20017476 = cast(cast(TBL3.COL170(0x7f0cb0ffa210), VARCHAR2(20 BYTE))(0x7f0cb1002fa0), NUMBER(-1, -85))(0x7f0cb1002770)(0x7f0cb1001c70)]), access(......), partitions(p0), is_index_back=true, filter_before_indexback[true], range_key([TBL3.COL170(0x7f0cb0ffa210)], [TBL3.COL002(0x7f3f5b582c10)]), range(MIN,MIN ; MAX,MAX)always trueDetermine the cause of the performance issue by analyzing the DDL definitions of related tables and views.
The definition of the
V_TT01view shows thatTBL3in the first branch ofUNION ALLexists in theNOT EXISTSsubquery. According to theEXPLAINplan, the relevant subquery is rewritten asNESTED-LOOP ANTI JOIN. In addition,TO_NUMBERconversion has occurred in theV_TT01view, whereCOL001andCOL004are passed to the(IDX_TBL3_COL170)index ofTBL3throughCOL170.Analysis on the
TBL3table definition shows that theTBL3(IDX_TBL3_COL170)index is defined upon theCOL170field whose data type isVARCHAR2(20). This field is the field that theV_TT01view associates with other tables. The definition of theV_TT01view shows that, in the first branch ofUNION ALL, the association corresponding to Operator 11 is"tbl3"."col170" = "tt01"."col001" = 20017476, and the association corresponding to Operator 14 in the second branch is"tbl3"."col170" = "tbl2"."col004" = 20017476. Due to the inconsistent data types of the associated fields,TO_NUMBERconversion occurs, that is, theVARCHAR2type of theCOL170field is forcibly converted toNUMBER. This is consistent with thefilterinformation of operators 11 and 14 inOutputs & filters.Based on the DDL definition and
EXPALINplan, we can see that during the execution of this type of SQL queries, the unreasonable forcibleTO_NUMBERconversion in theV_TT01view causes a decrease in index performance, resulting in a high cost estimate for operators 11 and 14. TheTO_NUMBERconversion must be removed./* The definition of the problematic SQL query. */ SELECT ...... FROM v_tt01 WHERE COL001 IN (20017476); /* The definition of the V_TT01 view. */ CREATE VIEW "V_TT01_OLD" AS (SELECT To_number("LOCAL_CXZX"."tt01"."col001") AS "COL001", ... ... FROM "LOCAL_CXZX"."tt01", "LOCAL_CXZX"."tbl1" WHERE ( "LOCAL_CXZX"."tt01"."col001" = "LOCAL_CXZX"."tbl1"."col001" ) AND NOT EXISTS((SELECT 1 FROM "LOCAL_CXZX"."tbl3" WHERE ( "LOCAL_CXZX"."tt01"."col001" = "LOCAL_CXZX"."tbl3"."col170" )))) UNION ALL (SELECT To_number("LOCAL_CXZX"."tbl2"."col004") AS "COL001", ... ... FROM "LOCAL_CXZX"."tbl2", "LOCAL_CXZX"."tbl3" "TBL3" WHERE ( "LOCAL_CXZX"."tbl2"."col004" = "tbl3"."col170" )) ; /* The definition of the TBL1 table. */ CREATE TABLE "LOCAL_BDZGBGYY"."TBL1" ( "COL001" NUMBER(16) NOT NULL , // 5: To_number("tt01"."col001") = "tbl1"."col001" = 20017476, To_number(NUMBER(16)) = NUMBER(20) ... ... PRIMARY KEY ("COL002") ) ; CREATE INDEX "LOCAL_BDZGBGYY"."IDX_TBL1_COL001" on "LOCAL_BDZGBGYY"."TBL1" ( "COL001" ) GLOBAL ; /* The definition of the TT01 table. */ CREATE TABLE "EINP_BASICINFO"."TT01" ( ... ... "COL001" NUMBER(20) NOT NULL, // 8: To_number("tt01"."col001") = "tbl1"."col001" = 20017476, To_number(NUMBER(16)) = NUMBER(20) ... ... PRIMARY KEY ("COL003") ) ; CREATE INDEX "EINP_BASICINFO"."IDX_TT01_COL001" on "EINP_BASICINFO"."TT01" ( "COL001" ) GLOBAL ; /* The definition of the TBL3 table.*/ CREATE TABLE "LOCAL_BDZGBGYY"."TBL3" ( "COL170" VARCHAR2(20) NOT NULL, // 11: "tbl3"."col170" = "tt01"."col001" = 20017476. VARCHAR2(20) is changed to NUMBER (20). ... ... // 14: "zb10"."col170" = "tbl2"."col004" = 20017476. VARCHAR2(20) is changed to NUMBER (20). PRIMARY KEY ("COL002") ) ; CREATE INDEX "LOCAL_BDZGBGYY"."IDX_TBL3_COL170" on "LOCAL_BDZGBGYY"."TBL3" ( "COL170" ) GLOBAL ; /* The definition of the TBL2 table. */ CREATE TABLE "SICP4_BASICINFOCENTER_EINP"."TBL2" ( "COL004" NUMBER(20) NOT NULL, // 13: COL004 = 20017476 ... .... PRIMARY KEY ("COL004") ) ;
Logic of optimization
The DDL optimization logic for tables and views in this case is as follows:
- Remove the
TO_NUMBERconversion in the view. The optimizer chooses a better join order, reducing the SQL response time (RT) from 2s to 150 ms. (The input data type of the view isNUMBER, andCOL001in the view is also of theNUMBERtype in the original table. Therefore, the forcibleTO_NUMBERconversion in the view is redundant.) - The data types of all associated fields in the tables and views are defined as
NUMBER(20)to avoid data type conversion. Then, index matching can be performed forTBL3(IDX_TBL3_COL170). The SQL RT is further reduced to 20 ms. (Since the data type ofTBL3.COL170isVARCHAR, when the query conditionCOL001 IN ( value,value,...)is passed toTBL3.COL170, data type conversion fromVARCHARtoNUMBERoccurs, and the index onTBL3.COL170cannot be used for matching.) - Reconstruct the partitions of the four associated tables in the view and add the tables to the same table group. This view serves as the business query entry. Previously, SQL queries accessing this view occupied all computing resources on a small part of OBServer nodes. To prevent this hot view from becoming a bottleneck again, reconstruct the partitions to distribute the query pressure on the view to more OBServer nodes. That is, hash partition the four associated tables in the view by taking the key for association as the partitioning key, and add the tables to the same table group to avoid cross-node queries. After reconstruction, the SQL RT is further reduced to 4 ms.
The optimized SQL query has the following EXPLAIN plan. PX PARTITION ITERATOR indicates that UNION ALL is executed only in a partition. In this case, the cost of indexing on TBL3 is low and the performance is greatly improved.
====================================================================
|ID|OPERATOR |NAME |EST. ROWS|COST|
--------------------------------------------------------------------
|0 |PX COORDINATOR | |3 |477 |
|1 | EXCHANGE OUT DISTR |:EX10000 |3 |405 |
|2 | SUBPLAN SCAN |V_TT01 |3 |405 |
|3 | PX PARTITION ITERATOR| |3 |405 |
|4 | UNION ALL | |3 |405 |
|5 | NESTED-LOOP JOIN | |2 |273 |
|6 | MERGE ANTI JOIN | |2 |183 |
|7 | TABLE SCAN |TT01(INDX_TT01_COL001) |2 |137 |
|8 | SUBPLAN SCAN |VIEW1 |1 |46 |
|9 | TABLE SCAN |TBL3(INDEX_TBL3_COL170)|1 |46 |
|10| TABLE SCAN |TBL1(INDEX_TBL1_COL001)|1 |44 |
|11| NESTED-LOOP JOIN | |1 |132 |
|12| TABLE SCAN |TBL3(INDEX_TBL3_COL170)|1 |92 |
|13| TABLE GET |TBL2 |1 |40 |
====================================================================