Best practices for collecting statistics to generate an efficient execution plan-OceanBase

Why collect statistics?

When generating and selecting an execution plan, the optimizer evaluates and compares the execution costs of all available plans. Accurate cost evaluation is essential for this process. The optimizer relies on a cost model and operator-estimated row counts to assess execution plan costs, with statistics playing a pivotal role. Accurate statistics directly impact cardinality estimation, which in turn influences cost evaluation and execution plan selection. Therefore, maintaining accurate statistics is critical for generating efficient execution plans.

How it works?

Statistics are stored as regular data in internal tables, and the optimizer maintains a local cache of these statistics to speed up access. Using the statistics, the optimizer evaluates the costs of various query paths and selects the most efficient plan. Regular updates to statistics ensure they reflect changes in the data, helping to maintain optimal query performance.

Scenarios

Complex query optimization: When a complex JSON query or a subquery is executed, accurate statistics enable the optimizer to select a more appropriate table join order and table join method, thereby significantly improving the query performance.
Index selection: The optimizer determines whether to use a specific index to accelerate access based on statistics.
Data distribution change: When the data in a table is significantly changed due to operations such as batch insert, delete, or update, it is particularly important to update statistics.

Considerations

Collection frequency: You need to set an appropriate statistics collection frequency based on the data change pattern.
Resource usage: Statistics collection consumes system resources. Therefore, we recommend that you collect statistics when the system load is low.
Accuracy: It is essential to ensure the accuracy of statistics. Otherwise, the optimizer may fail to select the optimal execution plan.

Default statistics collection strategy

By default, the system starts to collect statistics from 10:00 p.m. on each workday and from 6:00 a.m. on weekends. The collection lasts at most 4 hours on workdays and 20 hours on weekends. Such a period of time is called a statistics maintenance window. In each maintenance window, the optimizer re-collects all outdated statistics of tables or partitions.

Statistics are re-collected for a table or partition when either of the following conditions is met:

Statistics have not been successfully collected for the table or partition since the last collection time.
The number of added, deleted, or modified rows exceeds 10% of the number of rows in the table or partition when statistics are collected the last time.

The following table describes the default statistics collection strategy.

Parameter	Description	Default value
degree	The degree of parallelism (DOP).	1 (single thread for scanning)
method_opt	The strategy for collecting column-level statistics.	Collect the basic statistics of all columns and the histograms of columns with data skew.
granularity	The collection granularity.	For a non-partitioned table, directly collect global statistics; for a partitioned table, collect partition-level statistics and deduce the global statistics.
estimate_percent	The sampling ratio.	Collect statistics through a full-table scan without sampling.
block_sample	Specifies whether to use block sampling.	Use row sampling instead of block sampling.

Configure a statistics collection strategy

The default statistics collection strategy can meet the requirements of most tables. In certain scenarios, you may need to adjust the collection strategy based on business characteristics. This section describes some common scenarios and corresponding collection strategies.

Business peaks overlapping with the statistics maintenance window

The default statistics maintenance window may overlap with business peak hours, such as workday nights. In such a period, the statistics collection task may contend resources with business SQL queries, compromising the business performance. In this case, you can adjust the start time of the statistics maintenance window to stagger the statistics maintenance window with the business peak hours.

-- For a MySQL business tenant, if it is now 11:00 a.m. on Thursday, March 7, 2024, you need to adjust the start time of the statistics maintenance window to 2:00 a.m. on Friday in the current tenant.
call dbms_scheduler.set_attribute('MONDAY_WINDOW', 'NEXT_DATE', '2024-03-11 02:00:00');
call dbms_scheduler.set_attribute('TUESDAY_WINDOW', 'NEXT_DATE', '2024-03-12 02:00:00');
call dbms_scheduler.set_attribute('WEDNESDAY_WINDOW', 'NEXT_DATE', '2024-03-13 02:00:00');
call dbms_scheduler.set_attribute('THURSDAY_WINDOW', 'NEXT_DATE', '2024-03-14 02:00:00');
call dbms_scheduler.set_attribute('FRIDAY_WINDOW', 'NEXT_DATE', '2024-03-08 02:00:00');
call dbms_scheduler.set_attribute('SATURDAY_WINDOW', 'NEXT_DATE', '2024-03-09 02:00:00');
call dbms_scheduler.set_attribute('SUNDAY_WINDOW', 'NEXT_DATE', '2024-03-10 02:00:00');


-- For an Oracle business tenant, if it is now 11:00 a.m. on Thursday, March 7, 2024, you need to adjust the start time of the statistics maintenance window to 2:00 a.m. on Friday in the corresponding sys tenant.
SET NLS_DATE_FORMAT='YYYY-MM-DD HH24:MI:SS';
SET NLS_TIMESTAMP_FORMAT='YYYY-MM-DD HH24:MI:SS.FF';
SET NLS_TIMESTAMP_TZ_FORMAT='YYYY-MM-DD HH24:MI:SS.FF TZR TZD';
call dbms_scheduler.set_attribute('MONDAY_WINDOW', 'NEXT_DATE', '2024-03-11 02:00:00');
call dbms_scheduler.set_attribute('TUESDAY_WINDOW', 'NEXT_DATE', '2024-03-12 02:00:00');
call dbms_scheduler.set_attribute('WEDNESDAY_WINDOW', 'NEXT_DATE', '2024-03-13 02:00:00');
call dbms_scheduler.set_attribute('THURSDAY_WINDOW', 'NEXT_DATE', '2024-03-14 02:00:00');
call dbms_scheduler.set_attribute('FRIDAY_WINDOW', 'NEXT_DATE', '2024-03-08 02:00:00');
call dbms_scheduler.set_attribute('SATURDAY_WINDOW', 'NEXT_DATE', '2024-03-09 02:00:00');
call dbms_scheduler.set_attribute('SUNDAY_WINDOW', 'NEXT_DATE', '2024-03-10 02:00:00');

Incomplete statistics collected for ultra-large tables

Under the default statistics collection strategy, a single full-table scan is performed concurrently on the table or partition whose statistics are to be collected. If the table or partition whose statistics are to be collected contains a large volume of data or occupies much disk space, the collection will take a long time and thus affect statistics collection of other tables, or even time out. In the current business environment, if a table contains more than 0.1 billion rows or occupies more than 20 GB of disk space, we recommend that you configure the statistics collection strategy as follows:

In OceanBase Database of a version earlier than V4.2.1, the estimate_percent parameter specifies the sampling ratio of both basic statistics and histograms. When you use the estimate_percent parameter to adjust the sampling ratio of basic statistics, the sampling ratio of histograms may increase accordingly. Therefore, we recommend that you do not use this parameter in combination with a histogram collection command.

In OceanBase Database V4.2.1 and later, the estimate_percent parameter specifies the sampling ratio only of basic statistics. You can directly adjust this parameter as needed without worrying about affecting the histogram sampling ratio.

Non-partitioned table

Skip large objects (LOBs): Do not collect statistics of LOBs such as TEXT columns. Collect statistics of only non-LOB columns. In MySQL mode, statistics of TEXT columns are collected by default. If all TEXT columns store LOBs, the statistics collection can be very slow. In the following example, the fourth parameter specifies the columns whose statistics are to be collected. You need to add all columns except LOB columns.
```
call dbms_stats.set_table_prefs(
  'database_name',
  'table_name',
  'method_opt',
  'for columns col1,col2,col3,... size auto');
```
Increase the DOP or configure block sampling: You can increase the DOP so that more threads are used to concurrently collect statistics. This way, you can achieve quick collection by consuming more resources. Alternatively, you can configure block sampling to achieve quick collection by reducing the amount of data to be processed during statistics collection. Both methods can improve the statistics collection efficiency. The first method trades off resources for statistics accuracy, whereas the second method trades off statistics accuracy for reduced resource consumption. You can select a method as needed.
```
-- Configure a DOP at the table level.
call dbms_stats.set_table_prefs('database_name', 'table_name', 'degree', '4');

-- Enable block sampling at the table level.
call dbms_stats.set_table_prefs(
  'databse_name',
  'table_name',
  'block_sample',
  'True');
-- Configure a sampling ratio based on the order of magnitude of the table. Generally, collecting statistics of tens of millions of rows can fully reflect the data characteristics of a table.
call dbms_stats.set_table_prefs(
  'databse_name',
  'table_name',
  'estimate_percent',
  '0.1');
```

Partitioned table

Skip LOBs: Do not collect statistics of TEXT columns. In MySQL mode, statistics of TEXT columns are collected by default. If all TEXT columns store LOBs, the statistics collection can be very slow. In the following example, the fourth parameter specifies the columns whose statistics are to be collected. You need to add all columns except LOB columns.
```
-- Skip LOBs.
call dbms_stats.set_table_prefs(
  'database_name',
  'table_name',
  'method_opt',
  'for columns col1,col2,col3,... size auto');
```
Do not collect global statistics: Collect statistics only of a specific level for partitioned or subpartitioned tables, and make sure that global statistics are deleted. In the following example, the fourth parameter specifies the level of statistics to be collected. For a partitioned table, you can specify to collect statistics of only partitions. For a subpartitioned table, you can specify to collect statistics of only subpartitions. If you choose this strategy, you need to delete global statistics for partitioned and subpartitioned tables, and partition-level statistics for subpartitioned tables.
```
-- Configure not to collect global statistics.
-- For partitioned tables
call dbms_stats.set_table_prefs('database_name', 'table_name', 'granularity', 'PARTITION');

-- For subpartitioned tables
call dbms_stats.set_table_prefs('database_name', 'table_name', 'granularity', 'SUBPARTITION');
```
Increase the DOP or configure block sampling: Use settings similar to those of non-partitioned tables to make sure that statistics are collected in off-peak hours. You can increase the DOP so that more threads are used to concurrently collect statistics. This way, you can achieve quick collection by consuming more resources. If you choose this method, you need to stagger the statistics collection time with business peaks. Alternatively, you can configure block sampling to achieve quick collection by reducing the amount of data to be processed during statistics collection, and to prevent a high disk I/O usage resulting from massive data disk reads. Both methods can improve the statistics collection efficiency. The first method trades off resources for statistics accuracy, whereas the second method trades off statistics accuracy for reduced resource consumption. You can select a method as needed.
```
-- Configure a DOP at the table level.
call dbms_stats.set_table_prefs('database_name', 'table_name', 'degree', '4');

-- Enable block sampling at the table level.
call dbms_stats.set_table_prefs(
  'databse_name',
  'table_name',
  'block_sample',
  'True');
-- Configure a sampling ratio based on the order of magnitude of the table. Generally, collecting statistics of tens of millions of rows can fully reflect the data characteristics of a table.
call dbms_stats.set_table_prefs(
  'databse_name',
  'table_name',
  'estimate_percent',
  '0.1');
```

Configure a histogram strategy for statistics collection

The OceanBase Database optimizer supports three types of histograms: frequency histograms, TopK histograms, and hybrid histograms.

In a frequency histogram, each distinct column value corresponds to a single bucket in the histogram. The number of buckets specified must be no less than the number of distinct values (NDV) of the column.

A TopK histogram is a variant of the frequency histogram. Based on the Lossy Counting algorithm, it estimates the overall data distribution by calculating the feature of some data at a ratio of no less than 1-(1/bucket_size).

A hybrid histogram is set based on a specified amount of data collected and is a supplement to frequency histograms and TopK histograms.

In large table scenarios, the OceanBase Database optimizer introduces hybrid histograms for efficient statistics building. This method collects a preset amount of data to build a histogram. Each bucket of a hybrid histogram can contain multiple distinct values so as to segment collected data based on the number of buckets. All data in each segment is placed in the corresponding bucket to describe the data distribution in a larger range by using fewer buckets.

Condition 1: A frequency histogram is selected if the NDV of the specified column is less than or equal to 254.
Condition 2: A TopK histogram is preferentially selected if the NDV of the specified column is greater than 254. After column statistics are collected, the buckets are numbered in descending order based on their frequency. Buckets numbered greater than 254 are discarded, provided that the ratio of data records in the 254 buckets to the total data records collected is at least 1-(1/bucket_size), which is 99.6% by default.
A hybrid histogram is selected if the preceding conditions are not met.

Use the stored procedure set_column_stats to explicitly set column-level statistics. Here is an example:

-- Call the `SET_COLUMN_STATS` procedure in the `DBMS_STATS` package.
-- Set statistics of the 'col1' column in the 'tbl1' table.
-- 'user1': the user or schema to which the table belongs.
-- 'tbl1': the name of the table for which statistics are to be set.
-- 'col1': the name of the column for which statistics are to be set.
-- distcnt => 254: The NDV of the specified column is 254.
-- nullcnt => 10: The number of NULL values in the specified column is 10.
CALL dbms_stats.set_column_stats('user1', 'tbl1', 'col1', distcnt => 254, nullcnt => 10);

Manually collect statistics

Through manual statistics collection, you can better control the collection time and method to reduce the impact on business. The OceanBase Database optimizer allows you to manually collect statistics by using the DBMS_STATS package or executing the ANALYZE statement. We recommend that you use the DBMS_STATS package because it provides more features.

Example: Use the DBMS_STATS package to collect statistics

-- Collect partition-level statistics on the T_PART1 table under the TEST user, with a DOP of 64. Only histograms of columns with uneven data distribution are collected.
CALL dbms_stats.gather_table_stats('TEST', 'T_PART1', degree => 64, granularity => 'PARTITION', method_opt => 'FOR ALL COLUMNS SIZE SKEWONLY');

Example: Use the ANALYZE statement to collect statistics

-- Collect global statistics on the T_PART1 table under the TEST user. Only histograms of columns with uneven data distribution are collected.
ANALYZE TABLE T_PART1 PARTITION ('T_PART1') COMPUTE STATISTICS FOR ALL COLUMNS SIZE skewonly;

For ultra-large tables, you can select the segment-based or manual collection strategy based on the data volume and table schema. Segment-based collection can effectively reduce the load of a single statistics collection task. Manual collection is more flexible and controllable.

Query after a batch execution task

By default, statistics are not immediately updated after you batch import data. If business queries need to be performed immediately after batch import, we recommend that you manually initiate a statistics collection after the import.

Query on tables in which partitions are pre-created by date

For a table in which partitions are pre-created by date, if statistics of some pre-created partitions are not updated in a timely manner after data import, the optimizer may generate a nonoptimal execution plan. To ensure that the optimizer can generate an optimal execution plan, we recommend that you manually collect statistics of related partitions after data import.

References

For more information about statistics, see the following topics: