SQL tuning practice with EXPLAIN

2024-08-12 09:32:41  Updated

A customer encountered a performance failure in its database system where a large number of transactions were congested. The direct cause was that CPU utilization reaches 100% on some OBServer nodes, which means that transactions cannot be processed normally. Based on the analysis of the GV$OB_SQL_AUDIT view, a problematic SQL query with average execution time of 2s was located. Besides, the SQL query used over 70% of the CPU resources on the OBServer node. Then, the cause was determined based on the EXPLAIN plan (query execution plan) and DDL definitions of related tables and views, and the CPU time consumed by the SQL query was reduced to 4 ms through a series of DDL optimization measures. The performance bottleneck problem was completely solved.

Logic of troubleshooting

The troubleshooting logic in this case is as follows:

  1. Locate the top SQL queries by querying the GV$OB_SQL_AUDIT view.

    Execute the following SQL statement to query the GV$OB_SQL_AUDIT view and locate the problematic SQL queries. For this case, the information to query is the ratio of the CPU utilization of SQL queries (GROUP BY sql_id) on the specified OBServer node (svr_ip) to the total CPU time in 30 minutes before the problem occurred. The formula that calculates the ratio is SUM(execute_time)/(30*60*1000*1000)AS cpu_cnt.

    SELECT sql_id,
           COUNT(*)               AS executions,
           SUM(execute_time)      AS tot_cpu_time,
           AVG(execute_time)      AS avg_cpu_time,
           SUM(execute_time)/(30*60*1000*1000)     AS cpu_cnt,
           query_sql
        FROM oceanbase.GV$OB_SQL_AUDIT
        WHERE tenant_id= 'mysql001'
            AND svr_ip='xxx.xxx.xxx.xxx'
            AND request_time BETWEEN (TIME_TO_USEC(NOW())-30*60*1000*1000) AND TIME_TO_USEC(NOW())
            AND is_executor_rpc =0
            GROUP BY sql_id
            HAVING COUNT(*)>1
            ORDER BY cpu_cnt DESC LIMIT 10;
    

    The result shows that the execution frequency and CPU utilization of SELECT ... FROM V_TT01 WHERE (COL001 IN (...)) SQL queries are relatively high. Their CPU utilization reaches about 70% (80 CPU cores/108 CPU cores) on the same OBServer node.

    +----------------------------------+------------+--------------+--------------+---------+------------------------------------------------------------------------------------------------------------------------------------------------------+
    |       sql_id                     | executions | tot_cpu_time | avg_cpu_time | cpu_cnt | query_sql      |
    +----------------------------------+------------+--------------+--------------+---------+------------------------------------------------------------------------------------------------------------------------------------------------------+
    | 2705182A6EAB699CEC8E59DA80710B64 |   23,425 |   33,818,182,526  |      1,443,679 |      56 | SELECT ... FROM V_TT01 WHERE (COL001 IN (20017476)) |
    | 32AB97A0126F566064F84DDDF4936F82 |    7,112 |     7,926.295,919 |      1,114,496 |      13 | SELECT ... FROM V_TT01 WHERE (COL001 IN (20035891,20035892)) |
    | A5F514E873BE9D1F9A339D0DA7481D69 |    4,046 |     6,053,872,739 |      1,496,261 |      10 | SELECT ... FROM V_TT01 WHERE (COL001 IN (20017892,20007654,20025634)) |
    | 31FD78420DB07C11C8E3154F1658D237 |      186 |       646,926,207 |      3,478,098 |       1 | SELECT ... FROM V_TT01 WHERE (COL001 IN (20027596,20035891))  |
    | C48AEE941D985D8DEB66892228D5E845 |       28 |       569,004,505 |     20,321,589 |       1 | SELECT ... |
    | 101B7B79DFA9AE801BEE4F1A234AD294 |      155 |       440,190,675 |      2,839,940 |       1 | SELECT ... |
    | 1D0BA376E273B9D622641124D8C59264 |      423 |       429,140,148 |      1,014,516 |       1 | SELECT ... |
    | 64CF75576816DB5614F3D5B1F35B1472 |      113 |       326,338,159 |      2,887,948 |       1 | SELECT ... |
    | 23D1C653347BA469396896AD9B20DCA1 |      381 |       309.367.170 |        811,987 |       1 | SELECT ... |
    | FA4F493FA5CE2DCC64F51CF3754F96C6 |        7 |       289,191,515 |     41,313,074 |       0 | SELECT ... |
    +------------+----------------------------+--------------+------------+--------------+------------------------------------------------------------------------------------------------------------------------------------------------------+
    10 rows in set
    
    
  2. Query the EXPLAIN plan of a problematic SQL query and analyze the key points that affect performance.

    Execute the EXPLAIN statement to obtain the EXPLAIN plan of the first problematic SQL query.

    =================================================================================
    |ID|OPERATOR                             |NAME                 |EST. ROWS|COST  |
    ---------------------------------------------------------------------------------
    |0 |EXCHANGE IN REMOTE                   |                     |177      |432961|
    |1 | EXCHANGE OUT REMOTE                 |                     |177      |397632|
    |2 |  SUBPLAN SCAN                       |V_TT01               |177      |397632|
    |3 |   UNION ALL                         |                     |177      |397629|
    |4 |    NESTED-LOOP JOIN CARTESIAN       |                     |0        |151990|
    |5 |     TABLE SCAN                      |TBL1(IDX_TBL1_COL001)|1        |92    |   
    |6 |     MATERIAL                        |                     |0        |151899|
    |7 |      NESTED-LOOP ANTI JOIN CARTESIAN|                     |0        |151899|
    |8 |       TABLE SCAN                    |TT01(IDX_TT01_COL001)|1        |92    |
    |9 |       MATERIAL                      |                     |177      |151781|
    |10|        SUBPLAN SCAN                 |VIEW1                |177      |151780|
    |11|         TABLE SCAN                  |TBL3(IDX_TBL3_COL170)|177      |151777|
    |12|    NESTED-LOOP JOIN CARTESIAN       |                     |177      |236327|
    |13|     TABLE GET                       |TBL2                 |1        |46    |
    |14|     TABLE SCAN                      |TBL3(IDX_TBL3_COL170)|177      |235510|
    =================================================================================
    
    Outputs & filters:
    -------------------------------------
    
      5 - output(......), filter(nil),
          access(......), partitions(p0),
          is_index_back=true,
          range_key([TBL1.COL001(0x7f510ed97860)], [TBL1.COL002(0x7f48eec9baa0)]), range(20017476,MIN ; 20017476,MAX),
          range_cond([TBL1.COL001(0x7f510ed97860) = 20017476(0x7f510edbd750)])
    
      8 - output(......), filter(nil),
          access(......), partitions(p0),
          is_index_back=true,
          range_key([TT01.COL001(0x7f510ed97570)], [TT01.COL003(0x7f510ed97b50)]), range(20017476,MIN ; 20017476,MAX),
          range_cond([TT01.COL001(0x7f510ed97570) = 20017476(0x7f510edbcc50)])
    
     11 - output(......), filter([20017476 = cast(cast(TBL3.COL170(0x7f3f5b51c090), VARCHAR2(20 BYTE))(0x7f3f5b51d6b0), NUMBER(-1, -85))(0x7f3f5b51ce80)(0x7f3f5b51c380)]),
          access(......), partitions(p0),
          is_index_back=false, filter_before_indexback[false],   
          range_key([TBL3.COL170(0x7f3f5b51c090)], [TBL3.COL002(0x7f3f5b528e80)]), range(MIN,MIN ; MAX,MAX)always true  
    
     13 - output(......), filter(nil),
          access(......), partitions(p0),
          is_index_back=false,
          range_key([TBL2.COL004(0x7f0cb0ff9f20)]), range[20017476 ; 20017476],
          range_cond([TBL2.COL004(0x7f0cb0ff9f20) = 20017476(0x7f0cb1001170)])
    
     14 - output(......), filter([20017476 = cast(cast(TBL3.COL170(0x7f0cb0ffa210), VARCHAR2(20 BYTE))(0x7f0cb1002fa0), NUMBER(-1, -85))(0x7f0cb1002770)(0x7f0cb1001c70)]),
          access(......), partitions(p0),
          is_index_back=true, filter_before_indexback[true],
          range_key([TBL3.COL170(0x7f0cb0ffa210)], [TBL3.COL002(0x7f3f5b582c10)]), range(MIN,MIN ; MAX,MAX)always true
    

    The preceding EXPALIN plan shows that the total execution cost of the SQL query is estimated to be 432961. The largest portion, which is 397629, is consumed by Operator 3 UNION ALL. Both branches of the UNION ALL operator contain the NESTED-LOOP JOIN operator that is executed at the estimated cost of 151990 and 236327 respectively. In the first branch (operators 4 to 11) of the UNION ALL operator, the cost is mainly consumed by Operator 11, that is, the TABLE SCAN operation based on the TBL3(IDX_TBL3_COL170) index. In the second branch (operators 12 to 14), the cost is mainly consumed by Operator 14, which is also the TABLE SCAN operation based on the TBL3(IDX_TBL3_COL170) index. Therefore, we can see that the high cost of operators 11 and 14 may be caused by the TBL3 index. That is, this index may be a performance bottleneck in this case.

    Then, we can analyze the Outputs & filters information of operators 11 and 14. range(MIN,MIN ; MAX,MAX)always true indicates that index matching was not performed, while filter([20017476 = cast(cast(TBL3.COL170(0x7f0cb0ffa210), VARCHAR2(20 BYTE))(0x7f0cb1002fa0), NUMBER(-1, -85))(0x7f0cb1002770)(0x7f0cb1001c70)]) indicates that data type conversion occurred during table access. We can infer that the data types of associated fields may be inconsistent and data type conversion has been performed, resulting in an index matching failure on TBL3.COL170. Make further analysis based on this conclusion and DDL definitions.

     11 - output(......), filter([20017476 = cast(cast(TBL3.COL170(0x7f3f5b51c090), VARCHAR2(20 BYTE))(0x7f3f5b51d6b0), NUMBER(-1, -85))(0x7f3f5b51ce80)(0x7f3f5b51c380)]),
          access(......), partitions(p0),
          is_index_back=false, filter_before_indexback[false],   
          range_key([TBL3.COL170(0x7f3f5b51c090)], [TBL3.COL002(0x7f3f5b528e80)]), range(MIN,MIN ; MAX,MAX)always true
    
     14 - output(......), filter([20017476 = cast(cast(TBL3.COL170(0x7f0cb0ffa210), VARCHAR2(20 BYTE))(0x7f0cb1002fa0), NUMBER(-1, -85))(0x7f0cb1002770)(0x7f0cb1001c70)]),
          access(......), partitions(p0),
          is_index_back=true, filter_before_indexback[true],
          range_key([TBL3.COL170(0x7f0cb0ffa210)], [TBL3.COL002(0x7f3f5b582c10)]), range(MIN,MIN ; MAX,MAX)always true
    
  3. Determine the cause of the performance issue by analyzing the DDL definitions of related tables and views.

    The definition of the V_TT01 view shows that TBL3 in the first branch of UNION ALL exists in the NOT EXISTS subquery. According to the EXPLAIN plan, the relevant subquery is rewritten as NESTED-LOOP ANTI JOIN. In addition, TO_NUMBER conversion has occurred in the V_TT01 view, where COL001 and COL004 are passed to the (IDX_TBL3_COL170) index of TBL3 through COL170.

    Analysis on the TBL3 table definition shows that the TBL3(IDX_TBL3_COL170) index is defined upon the COL170 field whose data type is VARCHAR2(20). This field is the field that the V_TT01 view associates with other tables. The definition of the V_TT01 view shows that, in the first branch of UNION ALL, the association corresponding to Operator 11 is "tbl3"."col170" = "tt01"."col001" = 20017476, and the association corresponding to Operator 14 in the second branch is "tbl3"."col170" = "tbl2"."col004" = 20017476. Due to the inconsistent data types of the associated fields, TO_NUMBER conversion occurs, that is, the VARCHAR2 type of the COL170 field is forcibly converted to NUMBER. This is consistent with the filter information of operators 11 and 14 in Outputs & filters.

    Based on the DDL definition and EXPALIN plan, we can see that during the execution of this type of SQL queries, the unreasonable forcible TO_NUMBER conversion in the V_TT01 view causes a decrease in index performance, resulting in a high cost estimate for operators 11 and 14. The TO_NUMBER conversion must be removed.

    /* The definition of the problematic SQL query. */
    SELECT ......
        FROM   v_tt01
        WHERE  COL001 IN (20017476);
    
    /* The definition of the V_TT01 view. */
    CREATE VIEW "V_TT01_OLD"
    AS
      (SELECT  
              To_number("LOCAL_CXZX"."tt01"."col001") AS "COL001",   
              ... ...
       FROM   "LOCAL_CXZX"."tt01",
              "LOCAL_CXZX"."tbl1"
       WHERE  ( "LOCAL_CXZX"."tt01"."col001" = "LOCAL_CXZX"."tbl1"."col001" )
              AND NOT EXISTS((SELECT 1
                              FROM   "LOCAL_CXZX"."tbl3"
                              WHERE  ( "LOCAL_CXZX"."tt01"."col001" =
                                       "LOCAL_CXZX"."tbl3"."col170"
                                     ))))
    
      UNION ALL
    
      (SELECT  
              To_number("LOCAL_CXZX"."tbl2"."col004") AS "COL001",   
              ... ...
       FROM   "LOCAL_CXZX"."tbl2",
              "LOCAL_CXZX"."tbl3" "TBL3"
       WHERE  ( "LOCAL_CXZX"."tbl2"."col004" = "tbl3"."col170" ))
    
       ;
    
    
    /* The definition of the TBL1 table. */
    CREATE TABLE "LOCAL_BDZGBGYY"."TBL1" (
      "COL001" NUMBER(16)  NOT NULL ,    // 5: To_number("tt01"."col001") = "tbl1"."col001" = 20017476,  To_number(NUMBER(16)) = NUMBER(20)
      ... ...
      PRIMARY KEY ("COL002")
    ) ;
    
    CREATE INDEX "LOCAL_BDZGBGYY"."IDX_TBL1_COL001" on "LOCAL_BDZGBGYY"."TBL1" (
     "COL001"
    ) GLOBAL ;  
    
    
    /* The definition of the TT01 table. */
    CREATE TABLE "EINP_BASICINFO"."TT01" (
    ... ...
      "COL001" NUMBER(20)  NOT NULL,     // 8: To_number("tt01"."col001") = "tbl1"."col001" = 20017476,  To_number(NUMBER(16)) = NUMBER(20)  
    ... ...
      PRIMARY KEY ("COL003")  
    ) ;
    
    CREATE INDEX "EINP_BASICINFO"."IDX_TT01_COL001" on "EINP_BASICINFO"."TT01" (
     "COL001"
    ) GLOBAL ;  
    
    
    /* The definition of the TBL3 table.*/
    CREATE TABLE "LOCAL_BDZGBGYY"."TBL3" (
      "COL170" VARCHAR2(20) NOT NULL,      // 11: "tbl3"."col170" = "tt01"."col001" = 20017476. VARCHAR2(20) is changed to NUMBER (20).  
      ... ...                              // 14: "zb10"."col170" = "tbl2"."col004" = 20017476. VARCHAR2(20) is changed to NUMBER (20).
      PRIMARY KEY ("COL002")
    ) ;
    
    CREATE INDEX "LOCAL_BDZGBGYY"."IDX_TBL3_COL170" on "LOCAL_BDZGBGYY"."TBL3" (
     "COL170"
    ) GLOBAL ;
    
    /* The definition of the TBL2 table. */
    CREATE TABLE "SICP4_BASICINFOCENTER_EINP"."TBL2" (
      "COL004" NUMBER(20)  NOT NULL,      //  13: COL004 = 20017476
      ... ....
      PRIMARY KEY ("COL004")
    ) ;
    

Logic of optimization

The DDL optimization logic for tables and views in this case is as follows:

  • Remove the TO_NUMBER conversion in the view. The optimizer chooses a better join order, reducing the SQL response time (RT) from 2s to 150 ms. (The input data type of the view is NUMBER, and COL001 in the view is also of the NUMBER type in the original table. Therefore, the forcible TO_NUMBER conversion in the view is redundant.)
  • The data types of all associated fields in the tables and views are defined as NUMBER(20) to avoid data type conversion. Then, index matching can be performed for TBL3(IDX_TBL3_COL170). The SQL RT is further reduced to 20 ms. (Since the data type of TBL3.COL170 is VARCHAR, when the query condition COL001 IN ( value,value,...) is passed to TBL3.COL170, data type conversion from VARCHAR to NUMBER occurs, and the index on TBL3.COL170 cannot be used for matching.)
  • Reconstruct the partitions of the four associated tables in the view and add the tables to the same table group. This view serves as the business query entry. Previously, SQL queries accessing this view occupied all computing resources on a small part of OBServer nodes. To prevent this hot view from becoming a bottleneck again, reconstruct the partitions to distribute the query pressure on the view to more OBServer nodes. That is, hash partition the four associated tables in the view by taking the key for association as the partitioning key, and add the tables to the same table group to avoid cross-node queries. After reconstruction, the SQL RT is further reduced to 4 ms.

The optimized SQL query has the following EXPLAIN plan. PX PARTITION ITERATOR indicates that UNION ALL is executed only in a partition. In this case, the cost of indexing on TBL3 is low and the performance is greatly improved.

====================================================================
|ID|OPERATOR                |NAME                   |EST. ROWS|COST|
--------------------------------------------------------------------
|0 |PX COORDINATOR          |                       |3        |477 |
|1 | EXCHANGE OUT DISTR     |:EX10000               |3        |405 |
|2 |  SUBPLAN SCAN          |V_TT01                 |3        |405 |
|3 |   PX PARTITION ITERATOR|                       |3        |405 |
|4 |    UNION ALL           |                       |3        |405 |
|5 |     NESTED-LOOP JOIN   |                       |2        |273 |
|6 |      MERGE ANTI JOIN   |                       |2        |183 |
|7 |       TABLE SCAN       |TT01(INDX_TT01_COL001) |2        |137 |
|8 |       SUBPLAN SCAN     |VIEW1                  |1        |46  |
|9 |        TABLE SCAN      |TBL3(INDEX_TBL3_COL170)|1        |46  |
|10|      TABLE SCAN        |TBL1(INDEX_TBL1_COL001)|1        |44  |
|11|     NESTED-LOOP JOIN   |                       |1        |132 |
|12|      TABLE SCAN        |TBL3(INDEX_TBL3_COL170)|1        |92  |
|13|      TABLE GET         |TBL2                   |1        |40  |
====================================================================

Contact Us