Use statistical information and row estimation|V4.3.1| docs|Distributed Database

Use statistical information and row estimation

Last Updated：2024-06-28 05:30:31 Updated

In the previous topics, we discussed the statistical information and row estimation mechanism of OceanBase Database’s optimizer. In practical scenarios, you may wonder how to determine if the plan is utilizing statistical information, if the statistical information has expired, or how to analyze which method is currently being used for row estimation through the plan. This topic gives you the answer with specific examples.

In the following example, a table named t_part is created and HASH partitioned by the c1 column into four partitions. Then, 10,000 rows of data are inserted.

OceanBase(TEST@TEST)>create table t_part(c1 int, c2 int, c3 int) partition by hash(c1) partitions 4;
Query OK, 0 rows affected (0.12 sec)

OceanBase(TEST@TEST)>insert into t_part select mod(level,500),mod(level,1000),level from dual connect by level<=10000;
Query OK, 10000 rows affected (0.18 sec)
Records: 10000  Duplicates: 0  Warnings: 0

When statistical information is not collected, the plan for query Q3 (select * from t_part where c1 > 1) is as follows. It can be observed that the estimated number of rows for the TABLE SCAN operator is 1, which is obviously incorrect. By observing the Optimization Info with est_method:default_stat, it can be seen that the current plan is using default statistical information, indicating that the statistical information on the current table is not collected. Additionally, due to the inability of this partitioned table to perform partition pruning, only default statistical information can be used for row estimation. In this scenario, it is necessary to collect statistical information.

OceanBase(TEST@TEST)>explain extended select * from t_part where c1 > 1\G
*************************** 1. row ***************************
Query Plan: Plan signature: 13918455839680143507
====================================================
|ID|OPERATOR               |NAME    |EST. ROWS|COST|
----------------------------------------------------
|0 |PX COORDINATOR         |        |1        |8   |
|1 | EXCHANGE OUT DISTR    |:EX10000|1        |7   |
|2 |  PX PARTITION ITERATOR|        |1        |7   |
|3 |   TABLE SCAN          |T_PART  |1        |7   |
====================================================

Optimization Info:
-------------------------------------
T_PART:table_rows:1, physical_range_rows:1, logical_range_rows:1, index_back_rows:0, output_rows:0, est_method:default_stat, optimization_method=cost_based, avaiable_index_name[T_PART]

The following example modifies the predicate in query Q3 to obtain query Q4 (select * from t_part where c1 = 1). After the modification, one partition is retained for scanning after partition pruning, and the execution plan is as follows. Although no statistics are collected, row estimation provides a more accurate number, which is 24. By observing the Optimization Info with est_method:local_storage, it can be seen that row estimation is performed based on the storage layer. How is the number 24 obtained? First, the query range of the plan is (MIN ; MAX), and the number of rows returned at the storage layer based on the query range is 2400. However, the selectivity of the predicate c1 = 1 can only be calculated based on default statistics due to lack of available statistics. The default column NDV is 100, so the estimated selectivity of c1 = 1 is 1/100 and the estimated number of rows is 2400 * 1/100 = 24. In this scenario, you can obtain a relatively accurate number of rows based on the query range extracted from the predicate. In scenarios where no query ranges can be extracted from the predicate, the optimizer can only calculate the selectivity based on default statistics, leading to errors in the final result of row estimation.

OceanBase(TEST@TEST)>explain extended select * from t_part where c1 = 1\G
*************************** 1. row ***************************
Query Plan: Plan signature: 136492450096240440
=====================================
|ID|OPERATOR  |NAME  |EST. ROWS|COST|
-------------------------------------
|0 |TABLE SCAN|T_PART|24       |102 |
=====================================

Optimization Info:
-------------------------------------
T_PART:table_rows:2400, physical_range_rows:2400, logical_range_rows:2400, index_back_rows:0, output_rows:23, est_method:local_storage, optimization_method=cost_based, avaiable_index_name[T_PART], estimation info[table_id:500025, (table_type:0, version:-1--1--1, logical_rc:2400, physical_rc:2400)]

The following example collects statistics on the T_PART table first, and then obtains the execution plans of Q3 and Q4. You can see that the estimated number of rows in the execution plan for Q3 is accurate, and est_method:basic_stat in Optimization Info shows that row estimation for the current plan is performed based on the collected statistics. The estimated number of rows in the execution plan for Q4 is also accurate, because the selectivity calculated based on the collected statistics is more accurate and the row count returned by the storage layer also contributes to the accurate final result of row estimation.

OceanBase(TEST@TEST)>call dbms_stats.gather_table_stats('TEST','T_PART');
Query OK, 0 rows affected (0.19 sec)

OceanBase(TEST@TEST)>explain extended select * from t_part where c1 > 1\G
*************************** 1. row ***************************
Query Plan: Plan signature: 13918455839680143507
=====================================================
|ID|OPERATOR               |NAME    |EST. ROWS|COST |
-----------------------------------------------------
|0 |PX COORDINATOR         |        |9980     |29484|
|1 | EXCHANGE OUT DISTR    |:EX10000|9980     |20683|
|2 |  PX PARTITION ITERATOR|        |9980     |900  |
|3 |   TABLE SCAN          |T_PART  |9980     |900  |
=====================================================

Optimization Info:
-------------------------------------
T_PART:table_rows:10000, physical_range_rows:10000, logical_range_rows:10000, index_back_rows:0, output_rows:9979, est_method:basic_stat, optimization_method=cost_based, avaiable_index_name[T_PART]

OceanBase(TEST@TEST)>explain extended select * from t_part where c1 = 1\G
*************************** 1. row ***************************
Query Plan: Plan signature: 136492450096240440
=====================================
|ID|OPERATOR  |NAME  |EST. ROWS|COST|
-------------------------------------
|0 |TABLE SCAN|T_PART|20       |102 |
=====================================

Optimization Info:
-------------------------------------
T_PART:table_rows:2400, physical_range_rows:2400, logical_range_rows:2400, index_back_rows:0, output_rows:20, est_method:local_storage, optimization_method=cost_based, avaiable_index_name[T_PART], estimation info[table_id:500025, (table_type:0, version:-1--1--1, logical_rc:2400, physical_rc:2400)]

The examples provided above are aimed at helping you better understand how statistical information and row estimation are used in OceanBase Database’s optimizer. In real business scenarios, you can verify the accuracy of the row estimation in suboptimal execution plans. If the result is inaccurate, you can refer to the preceding examples to check the statistics. If the statistics are incorrect, you might consider recollecting the statistics and then examining whether there is a change in the plan. Execution plans in actual business scenarios are more complex, and calculating selectivity for complex predicates is also more challenging. Therefore, a thorough analysis based on real-world situations is essential.