In a database, the optimizer attempts to generate the optimal execution plan for each input SQL query, a process that often requires real-time and accurate statistical information and row estimations. Statistical information specifically refers to optimizer statistics, which is a dataset describing table and column information within the database and is a crucial component for the optimizer's cost model to select the optimal execution plan. The optimizer's cost model relies on statistical information regarding tables, columns, predicates, and other objects involved in the query to make plan and optimization selections. Accurate and effective statistical information assists the optimizer in choosing the best execution plan.
The estimation of rows in the optimizer involves calculating the estimated output rows for the current query. Accurate row estimations enable the optimizer's cost model to calculate more precise costs, thus aiding in the selection of the best execution plan. The optimizer in OceanBase Database currently employs two methods for row estimation. One method involves calculating the relevant selectivity based on statistical information and using it to estimate the number of rows. The other method involves the optimizer performing row estimations in real-time during plan generation by utilizing query ranges on the base table path in the storage layer to obtain real-time and precise row estimations.
In summary, the topics below will explore the statistical information and row estimation mechanisms of OceanBase Database's optimizer from the perspective of SQL tuning.