A performance issue occurred in a customer's database system, causing significant transaction congestion. The root cause was a surge in CPU usage on some OceanBase Database server nodes, reaching 100%, which led to transaction processing failures. By analyzing the execution audit information in the GV$OB_SQL_AUDIT view, it was identified that one problematic SQL statement had an average execution time of 2 seconds, consuming over 70% of the total CPU resources on the node. By examining the execution plan and DDL definitions of the tables and views, the SQL statement's CPU usage was reduced to 4 ms through DDL optimization, eliminating the performance bottleneck.
Troubleshooting approach
The troubleshooting and optimization approach for this case is as follows:
Use the
GV$OB_SQL_AUDITview to identify the top SQL queries.Execute the following SQL statement to query the SQL (
GROUP BY sql_id) and the total CPU usage (SUM(execute_time)/(30*60*1000*1000) AS cpu_cnt) for a specified OBServer node (svr_ip) within a specified time period (30 minutes) and identify the SQL query that consumes the highest CPU usage:SELECT sql_id, COUNT(*) AS executions, SUM(execute_time) AS tot_cpu_time, AVG(execute_time) AS avg_cpu_time, SUM(execute_time)/(30*60*1000*1000) AS cpu_cnt, query_sql FROM oceanbase.GV$OB_SQL_AUDIT WHERE tenant_id= 'mysql001' AND svr_ip='xxx.xxx.xxx.xxx' AND request_time BETWEEN (TIME_TO_USEC(NOW())-30*60*1000*1000) AND TIME_TO_USEC(NOW()) AND is_executor_rpc =0 GROUP BY sql_id HAVING COUNT(*)>1 ORDER BY cpu_cnt DESC LIMIT 10;Analysis of the result:
- The execution frequency and CPU usage of the
SELECT ... FROM V_TT01 WHERE (COL001 IN (...))SQL query are high, around 70% (80 c/108 c), and it is concentrated on the same OBServer node.
+----------------------------------+------------+--------------+--------------+---------+------------------------------------------------------------------------------------------------------------------------------------------------------+ | sql_id | executions | tot_cpu_time | avg_cpu_time | cpu_cnt | query_sql | +----------------------------------+------------+--------------+--------------+---------+------------------------------------------------------------------------------------------------------------------------------------------------------+ | 2705182A6EAB699CEC8E59DA80710B64 | 23,425 | 33,818,182,526 | 1,443,679 | 56 | SELECT ... FROM V_TT01 WHERE (COL001 IN (20017476)) | | 32AB97A0126F566064F84DDDF4936F82 | 7,112 | 7,926.295,919 | 1,114,496 | 13 | SELECT ... FROM V_TT01 WHERE (COL001 IN (20035891,20035892)) | | A5F514E873BE9D1F9A339D0DA7481D69 | 4,046 | 6,053,872,739 | 1,496,261 | 10 | SELECT ... FROM V_TT01 WHERE (COL001 IN (20017892,20007654,20025634)) | | 31FD78420DB07C11C8E3154F1658D237 | 186 | 646,926,207 | 3,478,098 | 1 | SELECT ... FROM V_TT01 WHERE (COL001 IN (20027596,20035891)) | | C48AEE941D985D8DEB66892228D5E845 | 28 | 569,004,505 | 20,321,589 | 1 | SELECT ... | | 101B7B79DFA9AE801BEE4F1A234AD294 | 155 | 440,190,675 | 2,839,940 | 1 | SELECT ... | | 1D0BA376E273B9D622641124D8C59264 | 423 | 429,140,148 | 1,014,516 | 1 | SELECT ... | | 64CF75576816DB5614F3D5B1F35B1472 | 113 | 326,338,159 | 2,887,948 | 1 | SELECT ... | | 23D1C653347BA469396896AD9B20DCA1 | 381 | 309.367.170 | 811,987 | 1 | SELECT ... | | FA4F493FA5CE2DCC64F51CF3754F96C6 | 7 | 289,191,515 | 41,313,074 | 0 | SELECT ... | +------------+----------------------------+--------------+------------+--------------+------------------------------------------------------------------------------------------------------------------------------------------------------+ 10 rows in set- The execution frequency and CPU usage of the
Analyze the execution plan (Explain) to locate the performance bottleneck.
Obtain the execution plan for the problematic SQL query using the
EXPLAINcommand:================================================================================= |ID|OPERATOR |NAME |EST. ROWS|COST | --------------------------------------------------------------------------------- |0 |EXCHANGE IN REMOTE | |177 |432961| |1 | EXCHANGE OUT REMOTE | |177 |397632| |2 | SUBPLAN SCAN |V_TT01 |177 |397632| |3 | UNION ALL | |177 |397629| |4 | NESTED-LOOP JOIN CARTESIAN | |0 |151990| |5 | TABLE SCAN |TBL1(IDX_TBL1_COL001)|1 |92 | |6 | MATERIAL | |0 |151899| |7 | NESTED-LOOP ANTI JOIN CARTESIAN| |0 |151899| |8 | TABLE SCAN |TT01(IDX_TT01_COL001)|1 |92 | |9 | MATERIAL | |177 |151781| |10| SUBPLAN SCAN |VIEW1 |177 |151780| |11| TABLE SCAN |TBL3(IDX_TBL3_COL170)|177 |151777| |12| NESTED-LOOP JOIN CARTESIAN | |177 |236327| |13| TABLE GET |TBL2 |1 |46 | |14| TABLE SCAN |TBL3(IDX_TBL3_COL170)|177 |235510| ================================================================================= Outputs & filters: ------------------------------------- 5 - output(......), filter(nil), access(......), partitions(p0), is_index_back=true, range_key([TBL1.COL001(0x7f510ed97860)], [TBL1.COL002(0x7f48eec9baa0)]), range(20017476,MIN ; 20017476,MAX), range_cond([TBL1.COL001(0x7f510ed97860) = 20017476(0x7f510edbd750)]) 8 - output(......), filter(nil), access(......), partitions(p0), is_index_back=true, range_key([TT01.COL001(0x7f510ed97570)], [TT01.COL003(0x7f510ed97b50)]), range(20017476,MIN ; 20017476,MAX), range_cond([TT01.COL001(0x7f510ed97570) = 20017476(0x7f510edbcc50)]) 11 - output(......), filter([20017476 = cast(cast(TBL3.COL170(0x7f3f5b51c090), VARCHAR2(20 BYTE))(0x7f3f5b51d6b0), NUMBER(-1, -85))(0x7f3f5b51ce80)(0x7f3f5b51c380)]), access(......), partitions(p0), is_index_back=false, filter_before_indexback[false], range_key([TBL3.COL170(0x7f3f5b51c090)], [TBL3.COL002(0x7f3f5b528e80)]), range(MIN,MIN ; MAX,MAX)always true 13 - output(......), filter(nil), access(......), partitions(p0), is_index_back=false, range_key([TBL2.COL004(0x7f0cb0ff9f20)]), range[20017476 ; 20017476], range_cond([TBL2.COL004(0x7f0cb0ff9f20) = 20017476(0x7f0cb1001170)]) 14 - output(......), filter([20017476 = cast(cast(TBL3.COL170(0x7f0cb0ffa210), VARCHAR2(20 BYTE))(0x7f0cb1002fa0), NUMBER(-1, -85))(0x7f0cb1002770)(0x7f0cb1001c70)]), access(......), partitions(p0), is_index_back=true, filter_before_indexback[true], range_key([TBL3.COL170(0x7f0cb0ffa210)], [TBL3.COL002(0x7f3f5b582c10)]), range(MIN,MIN ; MAX,MAX)always trueAnalysis of the result:
Based on the
EXPLAINinformation, the total estimated cost for this SQL query is 432,961, with the majority of the cost (397,629) attributed to the 3rd operatorUNION ALL. Both branches of theUNION ALLoperator contain aNESTED-LOOP JOIN, with estimated costs of 151,990 and 236,327 respectively. The first branch (from operator 4 to 11) primarily incurs costs at operator 11, which involves aTABLE SCANusing the indexTBL3(IDX_TBL3_COL170).The second branch (from operator 12 to 14) primarily incurs costs at operator 14, which also involves a
TABLE SCANusing the same indexTBL3(IDX_TBL3_COL170). This suggests that the high costs at operators 11 and 14 may be due to the index onTBL3, indicating a performance bottleneck in this case.Observing the execution plan, analyze the Outputs & filters information for operators 11 and 14. The
range(MIN,MIN ; MAX,MAX)always trueindicates that no index matching occurred during the SQL execution. Thefilter([20017476 = cast(cast(TBL3.COL170(0x7f0cb0ffa210), VARCHAR2(20 BYTE))(0x7f0cb1002fa0), NUMBER(-1, -85))(0x7f0cb1002770)(0x7f0cb1001c70)])indicates that data type conversion occurred during access. This suggests that data type conversion may have been performed due to inconsistent data types in the associated fields, leading to the index onTBL3.COL170not being used. Further analysis based on the DDL definitions is needed.
11 - output(......), filter([20017476 = cast(cast(TBL3.COL170(0x7f3f5b51c090), VARCHAR2(20 BYTE))(0x7f3f5b51d6b0), NUMBER(-1, -85))(0x7f3f5b51ce80)(0x7f3f5b51c380)]), access(......), partitions(p0), is_index_back=false, filter_before_indexback[false], range_key([TBL3.COL170(0x7f3f5b51c090)], [TBL3.COL002(0x7f3f5b528e80)]), range(MIN,MIN ; MAX,MAX)always true 14 - output(......), filter([20017476 = cast(cast(TBL3.COL170(0x7f0cb0ffa210), VARCHAR2(20 BYTE))(0x7f0cb1002fa0), NUMBER(-1, -85))(0x7f0cb1002770)(0x7f0cb1001c70)]), access(......), partitions(p0), is_index_back=true, filter_before_indexback[true], range_key([TBL3.COL170(0x7f0cb0ffa210)], [TBL3.COL002(0x7f3f5b582c10)]), range(MIN,MIN ; MAX,MAX)always trueKey findings:
- Total cost: The cost of the
UNION ALLoperator (operator 3) accounts for 92% of the total execution plan cost (397,629 / 432,961). - Index failure: The index
IDX_TBL3_COL170on theTBL3table did not take effect, leading to a full table scan (TABLE SCAN). - Type conversion issue: The
COL170field is of theVARCHAR2data type, which does not match theNUMBERdata type of the associated fields, leading to forced type conversion and index failure.
Analyze the DDL definitions of the related tables and views to determine the cause of the performance issue.
First, from the definition of the view
V_TT01, it is known that the first branch of theUNION ALLoperator containstbl3in theNOT EXISTSsubquery. Based on theEXPLAINinformation, the subquery has been rewritten to aNESTED-LOOP ANTI JOIN. Additionally, the viewV_TT01has undergone aTO_NUMBERconversion, where thecol001andcol004columns will be passed to the index(IDX_TBL3_COL170)onTBL3throughcol170.Next, analyze the definition of the
TBL3table. The indexTBL3(IDX_TBL3_COL170)is defined on theCOL170field, which has a data type ofVARCHAR2(20). This field is the one used to associate the viewV_TT01with other tables. From the definition of the viewV_TT01, it is known that in the first branch of theUNION ALLoperator, operator 11 corresponds to the association"tbl3"."col170" = "tt01"."col001" = 20017476, and in the second branch, operator 14 corresponds to the association"tbl3"."col170" = "tbl2"."col004" = 20017476. Due to inconsistent data types in the association fields, aTO_NUMBERconversion occurred, forcibly converting theVARCHAR2type of theCOL170field toNUMBER. This aligns with thefilterinformation in the Outputs & filters of operators 11 and 14.
Based on the DDL definitions and
EXPLAINinformation, it is determined that theTO_NUMBERforced conversion in the viewV_TT01during the execution of this type of SQL query is unreasonable, leading to a decline in index performance and resulting in high costs for operators 11 and 14. TheTO_NUMBERconversion should be removed./* Definition of the problematic SQL query */ SELECT ...... FROM v_tt01 WHERE COL001 IN (20017476); /* Definition of the view V_TT01 */ CREATE VIEW "V_TT01_OLD" AS (SELECT To_number("LOCAL_CXZX"."tt01"."col001") AS "COL001", ... ... FROM "LOCAL_CXZX"."tt01", "LOCAL_CXZX"."tbl1" WHERE ( "LOCAL_CXZX"."tt01"."col001" = "LOCAL_CXZX"."tbl1"."col001" ) AND NOT EXISTS((SELECT 1 FROM "LOCAL_CXZX"."tbl3" WHERE ( "LOCAL_CXZX"."tt01"."col001" = "LOCAL_CXZX"."tbl3"."col170" )))) UNION ALL (SELECT To_number("LOCAL_CXZX"."tbl2"."col004") AS "COL001", ... ... FROM "LOCAL_CXZX"."tbl2", "LOCAL_CXZX"."tbl3" "TBL3" WHERE ( "LOCAL_CXZX"."tbl2"."col004" = "tbl3"."col170" )) ; /* Definition of the TBL1 table */ CREATE TABLE "LOCAL_BDZGBGYY"."TBL1" ( "COL001" NUMBER(16) NOT NULL , // 5: To_number("tt01"."col001") = "tbl1"."col001" = 20017476, To_number(NUMBER(16)) = NUMBER(20) ... ... PRIMARY KEY ("COL002") ) ; CREATE INDEX "LOCAL_BDZGBGYY"."IDX_TBL1_COL001" on "LOCAL_BDZGBGYY"."TBL1" ( "COL001" ) GLOBAL ; /* Definition of the TT01 table */ CREATE TABLE "EINP_BASICINFO"."TT01" ( ... ... "COL001" NUMBER(20) NOT NULL, // 8: To_number("tt01"."col001") = "tbl1"."col001" = 20017476, To_number(NUMBER(16)) = NUMBER(20) ... ... PRIMARY KEY ("COL003") ) ; CREATE INDEX "EINP_BASICINFO"."IDX_TT01_COL001" on "EINP_BASICINFO"."TT01" ( "COL001" ) GLOBAL ; /* Definition of the TBL3 table */ CREATE TABLE "LOCAL_BDZGBGYY"."TBL3" ( "COL170" VARCHAR2(20) NOT NULL, // 11: "tbl3"."col170" = "tt01"."col001" = 20017476, VARCHAR2(20) modified to NUMBER (20) ... ... // 14: "zb10"."col170" = "tbl2"."col004" = 20017476, VARCHAR2(20) modified to NUMBER (20) PRIMARY KEY ("COL002") ) ; CREATE INDEX "LOCAL_BDZGBGYY"."IDX_TBL3_COL170" on "LOCAL_BDZGBGYY"."TBL3" ( "COL170" ) GLOBAL ; /* Definition of the TBL2 table */ CREATE TABLE "SICP4_BASICINFOCENTER_EINP"."TBL2" ( "COL004" NUMBER(20) NOT NULL, // 13: COL004 = 20017476 ... .... PRIMARY KEY ("COL004") );Key table field type comparison:
Table/fieldData typeAssociation conditionDescriptionTT01.COL001 NUMBER(20) tt01.COL001 = tbl3.COL170 Forged conversion to To_number in the view causes type mismatch TBL3.COL170 VARCHAR2(20) tbl3.COL170 = tt01.COL001 VARCHAR2 type cannot be directly associated with the NUMBER type TBL2.COL004 NUMBER(20) tbl2.COL004 = tbl3.COL170 Same type mismatch issue as TBL3.COL170
Optimization approach
The DDL optimization approach for tables and views in this case is as follows:
Eliminate unnecessary type conversions.
- Problem: The
TO_NUMBERconversion in the viewV_TT01caused type mismatches in the join fields, leading to index invalidation. - Optimization: Use the native field type and remove
TO_NUMBER:
-- Optimized view definition (without type conversion) CREATE VIEW "V_TT01_NEW" AS ( SELECT "tt01"."col001" AS "COL001", ... FROM "tt01", "tbl1" WHERE "tt01"."col001" = "tbl1"."col001" AND NOT EXISTS ( SELECT 1 FROM "tbl3" WHERE "tt01"."col001" = "tbl3"."col170" ) UNION ALL ( SELECT "tbl2"."col004" AS "COL001", ... FROM "tbl2", "tbl3" WHERE "tbl2"."col004" = "tbl3"."col170" ) );- Result: The SQL execution time was reduced from 2 seconds to 150 ms, and the optimizer chose a better join order.
- Problem: The
Standardize field data types.
- Problem:
TBL3.COL170is of typeVARCHAR2(20), while the associated fields COL001 and COL004 are of typeNUMBER, leading to type conversion and index invalidation. - Optimization:
-- Change the field type of TBL3 to NUMBER(20) ALTER TABLE "TBL3" MODIFY ("COL170" NUMBER(20)); -- Rebuild the index CREATE INDEX "IDX_TBL3_COL170" ON "TBL3" ("COL170");- Result: The data types of the associated fields in the tables and views were standardized to
NUMBER(20), avoiding type conversion. This allowedTBL3(IDX_TBL3_COL170)to perform index matching, and the index became effective. The SQL execution time was further reduced to 20 ms.
- Problem:
Partitioning and table group transformation.
- Problem: The hot view
V_TT01caused excessively high CPU load on a single node. - Optimization plan:
- Hash partitioning: Hash partition the associated tables based on the
COL001/COL004/COL170fields. - Table group binding: Add the associated tables to the same table group to avoid cross-node join queries.
- Hash partitioning: Hash partition the associated tables based on the
-- Example: Hash partition TBL3 ALTER TABLE "TBL3" PARTITION BY HASH("COL170") PARTITIONS 8; -- Add tables to a table group ALTER TABLE "TBL3" SET TABLEGROUP = 'group1';- Result: The query load was distributed across multiple nodes, reducing the SQL execution time to 4 ms.
- Problem: The hot view
Optimized execution plan:
====================================================================
|ID|OPERATOR |NAME |EST. ROWS|COST|
--------------------------------------------------------------------
|0 |PX COORDINATOR | |3 |477 |
|1 | EXCHANGE OUT DISTR |:EX10000 |3 |405 |
|2 | SUBPLAN SCAN |V_TT01 |3 |405 |
|3 | PX PARTITION ITERATOR| |3 |405 |
|4 | UNION ALL | |3 |405 |
|5 | NESTED-LOOP JOIN | |2 |273 |
|6 | MERGE ANTI JOIN | |2 |183 |
|7 | TABLE SCAN |TT01(INDX_TT01_COL001) |2 |137 |
|8 | SUBPLAN SCAN |VIEW1 |1 |46 |
|9 | TABLE SCAN |TBL3(INDEX_TBL3_COL170)|1 |46 |
|10| TABLE SCAN |TBL1(INDEX_TBL1_COL001)|1 |44 |
|11| NESTED-LOOP JOIN | |1 |132 |
|12| TABLE SCAN |TBL3(INDEX_TBL3_COL170)|1 |92 |
|13| TABLE GET |TBL2 |1 |40 |
====================================================================
Key improvements:
- The
PX PARTITION ITERATORindicates that theUNION ALLoperation is executed only within partitions. - The index
TBL3.INDEX_TBL3_COL170effectively matches, avoiding full-table scans. The cost of using the index onTBL3is very low, significantly improving performance.
References
For more information about the views mentioned in this topic, see the corresponding view documentation:
