You have learned about statistics and row estimation mechanisms in the previous topics and may wonder how you can verify whether statistics are used or not expired based on the execution plan in business scenarios. You may also want to know which method is used for row estimation based on execution plan analysis. This topic gives you the answer by showing you some examples.
Determine the row estimation method used in the execution plan
First, the following example creates the t_part table that is HASH partitioned by the c1 column into four partitions, and then inserts 10,000 rows into the table.
obclient [TEST]> create table t_part(c1 int, c2 int, c3 int) partition by hash(c1) partitions 4;
Query OK, 0 rows affected (0.164 sec)
obclient [TEST]> insert into t_part select mod(level,500),mod(level,1000),level from dual connect by level<=10000;
Query OK, 10000 rows affected (0.186 sec)
Records: 10000 Duplicates: 0 Warnings: 0
obclient [TEST]> commit;
Query OK, 0 rows affected (0.040 sec)
If no statistics are collected, the optimizer attempts to estimate the number of rows based on dynamic sampling. estimation method:[DYNAMIC SAMPLING FULL] in Optimization Info indicates that row estimation for the current plan is performed based on dynamic sampling. set optimizer_dynamic_sampling = 0; indicates that dynamic sampling is disabled, and the optimizer uses the default statistics for row estimation.
-- Perform dynamic sampling when no statistics have been collected.
obclient [TEST]> explain extended select * from t_part where c1 > 1;
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Query Plan |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ============================================================= |
| |ID|OPERATOR |NAME |EST.ROWS|EST.TIME(us)| |
| ------------------------------------------------------------- |
| |0 |PX COORDINATOR | |9960 |29434 | |
| |1 |└─EXCHANGE OUT DISTR |:EX10000|9960 |20651 | |
| |2 | └─PX PARTITION ITERATOR| |9960 |907 | |
| |3 | └─TABLE FULL SCAN |T_PART |9960 |907 | |
| ============================================================= |
| Outputs & filters: |
| ------------------------------------- |
| 0 - output([INTERNAL_FUNCTION(T_PART.C1(0x7f5fd0c20c90), T_PART.C2(0x7f5fd0c22010), T_PART.C3(0x7f5fd0c22330))(0x7f5fd0cab130)]), filter(nil), rowset=256 |
| 1 - output([INTERNAL_FUNCTION(T_PART.C1(0x7f5fd0c20c90), T_PART.C2(0x7f5fd0c22010), T_PART.C3(0x7f5fd0c22330))(0x7f5fd0cab130)]), filter(nil), rowset=256 |
| dop=1 |
| 2 - output([T_PART.C1(0x7f5fd0c20c90)], [T_PART.C2(0x7f5fd0c22010)], [T_PART.C3(0x7f5fd0c22330)]), filter(nil), rowset=256 |
| force partition granule |
| 3 - output([T_PART.C1(0x7f5fd0c20c90)], [T_PART.C2(0x7f5fd0c22010)], [T_PART.C3(0x7f5fd0c22330)]), filter([T_PART.C1(0x7f5fd0c20c90) > 1(0x7f5fd0c21600)]), rowset=256 |
| access([T_PART.C1(0x7f5fd0c20c90)], [T_PART.C2(0x7f5fd0c22010)], [T_PART.C3(0x7f5fd0c22330)]), partitions(p[0-3]) |
| is_index_back=false, is_global_index=false, filter_before_indexback[false], |
| range_key([T_PART. __pk_increment(0x7f5fd0c22a20)]), range(MIN ; MAX)always true |
| Used Hint: |
| ------------------------------------- |
| /*+ |
| |
| */ |
| Qb name trace: |
| ------------------------------------- |
| stmt_id:0, stmt_type:T_EXPLAIN |
| stmt_id:1, SEL$1 |
| Outline Data: |
| ------------------------------------- |
| /*+ |
| BEGIN_OUTLINE_DATA |
| FULL(@"SEL$1" "TEST"."T_PART"@"SEL$1") |
| PARALLEL( AUTO ) |
| OPTIMIZER_FEATURES_ENABLE('4.2.4.0') |
| END_OUTLINE_DATA |
| */ |
| Optimization Info: |
| ------------------------------------- |
| T_PART: |
| table_rows:10000 |
| physical_range_rows:10000 |
| logical_range_rows:10000 |
| index_back_rows:0 |
| output_rows:9960 |
| table_dop:1 |
| dop_method:Auto DOP |
| avaiable_index_name:[T_PART] |
| stats info:[version=0, is_locked=0, is_expired=0] |
| dynamic sampling level:1 |
| estimation method:[DYNAMIC SAMPLING FULL] |
| Plan Type: |
| DISTRIBUTED |
| Note: |
| Degree of Parallelism is 1 because of Auto DOP |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
55 rows in set
obclient [TEST]> set optimizer_dynamic_sampling = 0;
Query OK, 0 rows affected (0.035 sec)
-- Use default statistics when dynamic sampling is disabled and no statistics have been collected.
obclient [TEST]> explain extended select * from t_part where c1 > 1;
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Query Plan |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ============================================================= |
| |ID|OPERATOR |NAME |EST.ROWS|EST.TIME(us)| |
| ------------------------------------------------------------- |
| |0 |PX COORDINATOR | |1 |16 | |
| |1 |└─EXCHANGE OUT DISTR |:EX10000|1 |16 | |
| |2 | └─PX PARTITION ITERATOR| |1 |16 | |
| |3 | └─TABLE FULL SCAN |T_PART |1 |16 | |
| ============================================================= |
| Outputs & filters: |
| ------------------------------------- |
| 0 - output([INTERNAL_FUNCTION(T_PART.C1(0x7f60d6420c90), T_PART.C2(0x7f60d6422010), T_PART.C3(0x7f60d6422330))(0x7f60d64ab020)]), filter(nil), rowset=16 |
| 1 - output([INTERNAL_FUNCTION(T_PART.C1(0x7f60d6420c90), T_PART.C2(0x7f60d6422010), T_PART.C3(0x7f60d6422330))(0x7f60d64ab020)]), filter(nil), rowset=16 |
| dop=1 |
| 2 - output([T_PART.C1(0x7f60d6420c90)], [T_PART.C2(0x7f60d6422010)], [T_PART.C3(0x7f60d6422330)]), filter(nil), rowset=16 |
| force partition granule |
| 3 - output([T_PART.C1(0x7f60d6420c90)], [T_PART.C2(0x7f60d6422010)], [T_PART.C3(0x7f60d6422330)]), filter([T_PART.C1(0x7f60d6420c90) > 1(0x7f60d6421600)]), rowset=16 |
| access([T_PART.C1(0x7f60d6420c90)], [T_PART.C2(0x7f60d6422010)], [T_PART.C3(0x7f60d6422330)]), partitions(p[0-3]) |
| is_index_back=false, is_global_index=false, filter_before_indexback[false], |
| range_key([T_PART. __pk_increment(0x7f60d6422a20)]), range(MIN ; MAX)always true |
| Used Hint: |
| ------------------------------------- |
| /*+ |
| |
| */ |
| Qb name trace: |
| ------------------------------------- |
| stmt_id:0, stmt_type:T_EXPLAIN |
| stmt_id:1, SEL$1 |
| Outline Data: |
| ------------------------------------- |
| /*+ |
| BEGIN_OUTLINE_DATA |
| FULL(@"SEL$1" "TEST"."T_PART"@"SEL$1") |
| PARALLEL( AUTO ) |
| OPTIMIZER_FEATURES_ENABLE('4.2.4.0') |
| END_OUTLINE_DATA |
| */ |
| Optimization Info: |
| ------------------------------------- |
| T_PART: |
| table_rows:1 |
| physical_range_rows:1 |
| logical_range_rows:1 |
| index_back_rows:0 |
| output_rows:0 |
| table_dop:1 |
| dop_method:Auto DOP |
| avaiable_index_name:[T_PART] |
| stats info:[version=0, is_locked=0, is_expired=0] |
| dynamic sampling level:0 |
| estimation method:[DEFAULT] |
| Plan Type: |
| DISTRIBUTED |
| Note: |
| Degree of Parallelism is 1 because of Auto DOP |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
55 rows in set
The predicate of the query in the previous example is modified to obtain the query explain extended select * from t_part where c1 = 1. After the modification, one partition is retained after partition pruning, and the execution plan is as follows. Although no statistics have been collected, row estimation provides a more accurate number, which is 24. estimation method:[DEFAULT, STORAGE] in Optimization Info shows that row estimation is performed based on the default statistics and storage layer. How is the number 24 obtained? First, the query range of the plan is (MIN ; MAX), and the number of rows returned at the storage layer based on the query range is 2400. However, the selectivity of the predicate c1 = 1 can be calculated only based on default statistics due to the lack of collected statistics. The default number of distinct values (NDV) of the column is 100, so the estimated selectivity of c1 = 1 is 1/100 and the estimated number of rows is 2400 * 1/100 = 24. In this scenario, you can obtain a relatively accurate number of rows based on the query range extracted from the predicate. In scenarios where no query ranges can be extracted from the predicate, the optimizer can calculate the selectivity only based on default statistics, leading to errors in the final result of row estimation.
obclient [TEST]> explain extended select * from t_part where c1 = 1;
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Query Plan |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ================================================= |
| |ID|OPERATOR |NAME |EST.ROWS|EST.TIME(us)| |
| ------------------------------------------------- |
| |0 |TABLE FULL SCAN|T_PART|24 |104 | |
| ================================================= |
| Outputs & filters: |
| ------------------------------------- |
| 0 - output([T_PART.C1(0x7f61b0a20c90)], [T_PART.C2(0x7f61b0a22010)], [T_PART.C3(0x7f61b0a22330)]), filter([T_PART.C1(0x7f61b0a20c90) = 1(0x7f61b0a21600)]), rowset=256 |
| access([T_PART.C1(0x7f61b0a20c90)], [T_PART.C2(0x7f61b0a22010)], [T_PART.C3(0x7f61b0a22330)]), partitions(p1) |
| is_index_back=false, is_global_index=false, filter_before_indexback[false], |
| range_key([T_PART. __pk_increment(0x7f61b0a22a20)]), range(MIN ; MAX)always true |
| Used Hint: |
| ------------------------------------- |
| /*+ |
| |
| */ |
| Qb name trace: |
| ------------------------------------- |
| stmt_id:0, stmt_type:T_EXPLAIN |
| stmt_id:1, SEL$1 |
| Outline Data: |
| ------------------------------------- |
| /*+ |
| BEGIN_OUTLINE_DATA |
| FULL(@"SEL$1" "TEST"."T_PART"@"SEL$1") |
| PARALLEL( AUTO ) |
| OPTIMIZER_FEATURES_ENABLE('4.2.4.0') |
| END_OUTLINE_DATA |
| */ |
| Optimization Info: |
| ------------------------------------- |
| T_PART: |
| table_rows:2400 |
| physical_range_rows:2400 |
| logical_range_rows:2400 |
| index_back_rows:0 |
| output_rows:23 |
| table_dop:1 |
| dop_method:Auto DOP |
| avaiable_index_name:[T_PART] |
| stats info:[version=0, is_locked=0, is_expired=0] |
| dynamic sampling level:0 |
| estimation method:[DEFAULT, STORAGE] |
| Plan Type: |
| LOCAL |
| Note: |
| Degree of Parallelism is 1 because of Auto DOP |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
47 rows in set (0.038 sec)| Query Plan |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ================================================= |
| |ID|OPERATOR |NAME |EST.ROWS|EST.TIME(us)| |
| ------------------------------------------------- |
| |0 |TABLE FULL SCAN|T_PART|1 |4 | |
| ================================================= |
| Outputs & filters: |
| ------------------------------------- |
| 0 - output([T_PART.C1(0x7f5ffd620c90)], [T_PART.C2(0x7f5ffd622010)], [T_PART.C3(0x7f5ffd622330)]), filter([T_PART.C1(0x7f5ffd620c90) = 1(0x7f5ffd621600)]), rowset=16 |
| access([T_PART.C1(0x7f5ffd620c90)], [T_PART.C2(0x7f5ffd622010)], [T_PART.C3(0x7f5ffd622330)]), partitions(p1) |
| is_index_back=false, is_global_index=false, filter_before_indexback[false], |
| range_key([T_PART. __pk_increment(0x7f5ffd622a20)]), range(MIN ; MAX)always true |
| Used Hint: |
| ------------------------------------- |
| /*+ |
| |
| */ |
| Qb name trace: |
| ------------------------------------- |
| stmt_id:0, stmt_type:T_EXPLAIN |
| stmt_id:1, SEL$1 |
| Outline Data: |
| ------------------------------------- |
| /*+ |
| BEGIN_OUTLINE_DATA |
| FULL(@"SEL$1" "TEST"."T_PART"@"SEL$1") |
| PARALLEL( AUTO ) |
| OPTIMIZER_FEATURES_ENABLE('4.2.4.0') |
| END_OUTLINE_DATA |
| */ |
| Optimization Info: |
| ------------------------------------- |
| T_PART: |
| table_rows:1 |
| physical_range_rows:1 |
| logical_range_rows:1 |
| index_back_rows:0 |
| output_rows:1 |
| table_dop:1 |
| dop_method:Auto DOP |
| avaiable_index_name:[T_PART] |
| stats info:[version=0, is_locked=0, is_expired=0] |
| dynamic sampling level:0 |
| estimation method:[DEFAULT, STORAGE] |
| Plan Type: |
| LOCAL |
| Note: |
| Degree of Parallelism is 1 because of Auto DOP |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
47 rows in set
The following example collects statistics on the T_PART table first, and then obtains the execution plans of explain extended select * from t_part where c1 = 1 and explain extended select * from t_part where c1 > 1. You can see that the estimated number of rows in the execution plan for explain extended select * from t_part where c1 = 1 is accurate, and est_method:[OPTIMIZER STATISTICS] in Optimization Info shows that row estimation for the current plan is performed based on the collected statistics. The estimated number of rows in the execution plan for explain extended select * from t_part where c1 > 1 is also accurate, because the selectivity calculated based on the collected statistics is more accurate and the row count returned by the storage layer also contributes to the accurate final result of row estimation.
obclient [TEST]>call dbms_stats.gather_table_stats('TEST','T_PART');
Query OK, 0 rows affected (0.19 sec)
obclient [TEST]> explain extended select * from t_part where c1 > 1;
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Query Plan |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ============================================================= |
| |ID|OPERATOR |NAME |EST.ROWS|EST.TIME(us)| |
| ------------------------------------------------------------- |
| |0 |PX COORDINATOR | |9980 |19368 | |
| |1 |└─EXCHANGE OUT DISTR |:EX10000|9980 |13681 | |
| |2 | └─PX PARTITION ITERATOR| |9980 |908 | |
| |3 | └─TABLE FULL SCAN |T_PART |9980 |908 | |
| ============================================================= |
| Outputs & filters: |
| ------------------------------------- |
| 0 - output([INTERNAL_FUNCTION(T_PART.C1(0x7f603da20c90), T_PART.C2(0x7f603da22010), T_PART.C3(0x7f603da22330))(0x7f603daab0a0)]), filter(nil), rowset=256 |
| 1 - output([INTERNAL_FUNCTION(T_PART.C1(0x7f603da20c90), T_PART.C2(0x7f603da22010), T_PART.C3(0x7f603da22330))(0x7f603daab0a0)]), filter(nil), rowset=256 |
| dop=1 |
| 2 - output([T_PART.C1(0x7f603da20c90)], [T_PART.C2(0x7f603da22010)], [T_PART.C3(0x7f603da22330)]), filter(nil), rowset=256 |
| force partition granule |
| 3 - output([T_PART.C1(0x7f603da20c90)], [T_PART.C2(0x7f603da22010)], [T_PART.C3(0x7f603da22330)]), filter([T_PART.C1(0x7f603da20c90) > 1(0x7f603da21600)]), rowset=256 |
| access([T_PART.C1(0x7f603da20c90)], [T_PART.C2(0x7f603da22010)], [T_PART.C3(0x7f603da22330)]), partitions(p[0-3]) |
| is_index_back=false, is_global_index=false, filter_before_indexback[false], |
| range_key([T_PART. __pk_increment(0x7f603da22a20)]), range(MIN ; MAX)always true |
| Used Hint: |
| ------------------------------------- |
| /*+ |
| |
| */ |
| Qb name trace: |
| ------------------------------------- |
| stmt_id:0, stmt_type:T_EXPLAIN |
| stmt_id:1, SEL$1 |
| Outline Data: |
| ------------------------------------- |
| /*+ |
| BEGIN_OUTLINE_DATA |
| FULL(@"SEL$1" "TEST"."T_PART"@"SEL$1") |
| PARALLEL( AUTO ) |
| OPTIMIZER_FEATURES_ENABLE('4.2.4.0') |
| END_OUTLINE_DATA |
| */ |
| Optimization Info: |
| ------------------------------------- |
| T_PART: |
| table_rows:10000 |
| physical_range_rows:10000 |
| logical_range_rows:10000 |
| index_back_rows:0 |
| output_rows:9979 |
| table_dop:1 |
| dop_method:Auto DOP |
| avaiable_index_name:[T_PART] |
| stats info:[version=1720598262361958, is_locked=0, is_expired=0] |
| dynamic sampling level:0 |
| estimation method:[OPTIMIZER STATISTICS] |
| Plan Type: |
| DISTRIBUTED |
| Note: |
| Degree of Parallelism is 1 because of Auto DOP |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
55 rows in set (0.037 sec)
obclient [TEST]> explain extended select * from t_part where c1 = 1;
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Query Plan |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ================================================= |
| |ID|OPERATOR |NAME |EST.ROWS|EST.TIME(us)| |
| ------------------------------------------------- |
| |0 |TABLE FULL SCAN|T_PART|20 |104 | |
| ================================================= |
| Outputs & filters: |
| ------------------------------------- |
| 0 - output([T_PART.C1(0x7f5fc6e20c90)], [T_PART.C2(0x7f5fc6e22010)], [T_PART.C3(0x7f5fc6e22330)]), filter([T_PART.C1(0x7f5fc6e20c90) = 1(0x7f5fc6e21600)]), rowset=256 |
| access([T_PART.C1(0x7f5fc6e20c90)], [T_PART.C2(0x7f5fc6e22010)], [T_PART.C3(0x7f5fc6e22330)]), partitions(p1) |
| is_index_back=false, is_global_index=false, filter_before_indexback[false], |
| range_key([T_PART. __pk_increment(0x7f5fc6e22a20)]), range(MIN ; MAX)always true |
| Used Hint: |
| ------------------------------------- |
| /*+ |
| |
| */ |
| Qb name trace: |
| ------------------------------------- |
| stmt_id:0, stmt_type:T_EXPLAIN |
| stmt_id:1, SEL$1 |
| Outline Data: |
| ------------------------------------- |
| /*+ |
| BEGIN_OUTLINE_DATA |
| FULL(@"SEL$1" "TEST"." T_PART"@"SEL$1") |
| PARALLEL( AUTO ) |
| OPTIMIZER_FEATURES_ENABLE('4.2.4.0') |
| END_OUTLINE_DATA |
| */ |
| Optimization Info: |
| ------------------------------------- |
| T_PART: |
| table_rows:2400 |
| physical_range_rows:2400 |
| logical_range_rows:2400 |
| index_back_rows:0 |
| output_rows:20 |
| table_dop:1 |
| dop_method:Auto DOP |
| avaiable_index_name:[T_PART] |
| stats info:[version=1720598262361958, is_locked=0, is_expired=0] |
| dynamic sampling level:0 |
| estimation method:[OPTIMIZER STATISTICS, STORAGE] |
| Plan Type: |
| LOCAL |
| Note: |
| Degree of Parallelism is 1 because of Auto DOP |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
47 rows in set
The preceding examples are simple ones for you to better understand the application of statistics and row estimation in the OceanBase Database optimizer. In real business scenarios, you can check the row estimation result of poor execution plans. If the result is inaccurate, check the statistics by referring to the preceding examples. If the statistics are inaccurate, you can try to recollect the statistics and then check whether there is a change in the plan. Execution plans in real business scenarios are more complex, and selectivity calculation for complex predicates is also more challenging. Therefore, it is necessary to analyze issues based on actual situations.
Control row estimation by using system variables
Cardinality estimation by the optimizer is the key to accurate selection of execution plans. Inaccurate cardinality estimation can lead to selection of an undesirable execution plan. Cardinality estimation relies on a series of assumptions, such as predicate independence and join containment assumptions.
However, your data models may not be exactly consistent with the assumptions. Consequently, the optimizer may generate inaccurate estimation results, resulting in the selection of an undesirable execution plan. To solve this problem, you can specify the cardinality_estimation_model system variable to control specific assumptions. This way, the optimizer can generate multiple cardinality estimation results to optimize the selection of execution plans.