In a database, the optimizer attempts to generate the optimal execution plan for every SQL query it receives, and generating the best execution plan often requires up-to-date and accurate statistics as well as precise row count estimates. Statistics here refer to optimizer statistics, which are a set of data describing information about tables and columns in the database. They are a crucial part of the cost model that selects the best execution plan. The optimizer cost model relies on the statistics of objects involved in the query—such as tables, columns, and predicates—to select and optimize execution plans. Accurate and effective statistics help the optimizer choose the optimal execution plan.
OceanBase supports various types of statistics, including table-level and column-level basic statistics, as well as histograms. It also provides multiple collection strategies, such as scheduled automatic collection, manual collection, online collection, and automatic asynchronous collection when statistics become severely outdated. In most systems, users generally do not need to worry about the specifics of statistics, because the optimizer regularly runs tasks to collect statistics for tables that need updating. However, in analytical processing (AP) scenarios, there may be some very large tables or tables that require real-time queries after large-scale updates. In these cases, the default statistics collection strategy may not be able to collect statistics in time, which can affect the generation of execution plans. The following sections will introduce targeted approaches for collecting statistics in certain AP scenarios.
Overview
Categories of statistics
Statistics can be classified into the following categories:
Table-level statistics (including index tables): This includes the number of rows, the number of macroblocks, the number of microblocks, and the average row length. This information is used to estimate the scan cost of a table.
Column-level statistics:
- Value distribution in a column: maximum value, minimum value, average column length, and number of different values (NDV).
- Data skew: described by a histogram.
- Null value rate: helps the optimizer handle queries that contain NULL values.
Methods for collecting statistics in OceanBase Database
Automatic collection: the optimizer periodically checks whether statistics need to be updated for each table.
Manual collection: You can run a SQL statement to trigger the collection of statistics. This method is suitable for massive tables or specific query optimizations. When you collect statistics, you can specify the statistics collection strategy, such as the parallelism, granularity, and number of histogram buckets.
Online collection: In scenarios such as batch imports, PDML, and
CREATE TABLE ... AS, you can use theGATHER_OPTIMIZER_STATISTICShint or set the system variable_optimizer_gather_stats_on_load(which is enabled by default) to enable online statistics collection. You can also use theAPPENDhint of the direct load feature to enable online statistics collection.
Statistics update mechanism
Threshold-based update: Statistics are asynchronously updated when the data volume of a table changes by more than 10 times.
Support for partitioned tables: OceanBase Database supports statistics update and management at the partition level.
Optimization strategies for statistics in AP scenarios
Customized collection strategy
- Selective collection: Configure collection tasks separately for the core query tables or key columns in AP scenarios.
- Partition priority: Prioritize updating the partitions with the highest usage frequency or the most significant changes.
Configure degree of parallelism
- When collecting statistics for ultra-large tables, properly set the collection degree of parallelism.
Dynamically adjust update frequency
- Based on the update mode of table data, flexibly configure the statistics update frequency to avoid unnecessary overheads.
Example
Adjust the statistics collection window
By default, the OceanBase optimizer performs daily automatic statistics collection by maintaining a window, ensuring that statistics are updated iteratively. By default, the task starts at 22:00 on weekdays and lasts for a maximum of 4 hours per week, as shown in the following table.
| Maintenance window | Start time/ frequency | Maximum collection duration |
|---|---|---|
| MONDAY_WINDOW | 22:00/per week | 4 hours |
| TUESDAY_WINDOW | 22:00/per week | 4 hours |
| WEDNESDAY_WINDOW | 22:00/per week | 4 hours |
| THURSDAY_WINDOW | 22:00/per week | 4 hours |
| FRIDAY_WINDOW | 22:00/per week | 4 hours |
| SATURDAY_WINDOW | 22:00/per week | 4 hours |
| SUNDAY_WINDOW | 22:00/per week | 4 hours |
You can configure the maintenance window based on your business requirements. For example, if the maintenance window coincides with your business peak hours, you can adjust the start time of the maintenance window or specify not to collect statistics on specific dates. If your business environment contains a large number of tables or a large number of ultra-large tables, you can also adjust the maximum duration of the maintenance window.
The following examples show how to configure the maintenance window.
-- Disable automatic statistics collection on Mondays
call dbms_scheduler.disable('MONDAY_WINDOW');
-- Enable automatic statistics collection on Mondays
call dbms_scheduler.enable('MONDAY_WINDOW');
-- Set the start time of the automatic statistics collection task on Mondays to 8 p.m.
call dbms_scheduler.set_attribute('MONDAY_WINDOW', 'NEXT_DATE', '2022-09-12 20:00:00');
-- Set the duration of the automatic statistics collection task on Wednesdays to 6 hours
-- 6 hours <=> 6 * 60 * 60 * 1000 * 1000 <=> 21600000000 us
call dbms_scheduler.set_attribute('WEDNESDAY_WINDOW', 'JOB_ACTION', 'DBMS_STATS.GATHER_DATABASE_STATS_JOB_PROC(21600000000)');
Statistics collection strategies for ultra-large tables
The optimizer's default statistics collection strategy may result in the statistics of an ultra-large table not being collected within a single maintenance window. Therefore, you need to set a reasonable collection strategy for ultra-large tables. The statistics collection for ultra-large tables is time-consuming mainly due to the following three factors:
- The tables contain a large amount of data, and the statistics collection requires a full table scan, which is time-consuming.
- Histogram collection involves complex computations, resulting in additional costs and time consumption.
- By default, the optimizer collects statistics and histograms on subpartitions, partitions, and the entire table of a large partitioned table. The cost is 3 × (cost of full table scan + cost of histogram collection).
Based on the preceding time-consuming factors, you can optimize the statistics collection for tables based on their actual situations and related query requirements. We recommend that you take the following measures:
- Set an appropriate default collection parallelism. Note that after you set the collection parallelism, you need to adjust the related automatic collection tasks to be executed during off-peak hours to avoid impact on businesses. We recommend that you keep the collection parallelism within 8. You can set the collection parallelism by using the following statement:
-- The same for Oracle and MySQL business tenants:
call dbms_stats.set_table_prefs('database_name', 'table_name', 'degree', '8');
- Set the default histogram collection method for columns. We recommend that you do not collect histograms for columns with evenly distributed data.
-- The same for Oracle and MySQL business tenants
-- 1. If the data is evenly distributed in all columns of the table, you can use the following statement to specify that no histograms be collected for all columns:
call dbms_stats.set_table_prefs('database_name', 'table_name', 'method_opt', 'for all columns size 1');
-- 2. If only a few columns have unevenly distributed data that requires histogram collection, you can use the following statement to specify the collection for these columns (c1 and c2) and not for other columns (c3, c4, and c5):
call dbms_stats.set_table_prefs('database_name', 'table_name', 'method_opt', 'for columns c1 size 254, c2 size 254, c3 size 1, c4 size 1, c5 size 1');
- Set the default collection granularity for partitioned tables. For hash partitioned or key partitioned tables, for example, you can collect only global statistics or specify partition-level statistics to be derived from global statistics.
-- The same for Oracle and MySQL business tenants
-- 1. Specify to collect only global statistics:
call dbms_stats.set_table_prefs('database_name', 'table_name', 'granularity', 'GLOBAL');
-- 2. Specify to derive global statistics from partitions:
call dbms_stats.set_table_prefs('database_name', 'table_name', 'granularity', 'APPROX_GLOBAL AND PARTITION');
- Use large table sampling with caution. If large table sampling is enabled, the number of histogram samples in earlier versions becomes excessively large, potentially leading to counterproductive results. Large table sampling is suitable only for collecting basic statistics and not histograms.
-- The same for Oracle and MySQL business tenants, such as deleting granularity.
-- 1. Specify to collect no histograms for all columns:
call dbms_stats.set_table_prefs('database_name', 'table_name', 'method_opt', 'for all columns size 1');
-- 2. Specify to sample 10% of data:
call dbms_stats.set_table_prefs('database_name', 'table_name', 'estimate_percent', '10');
In addition, if you want to clear or delete a default collection strategy that has been set, you need to specify only the attribute to be cleared {attribute}. You can use the following statement to clear or delete a default collection strategy.
-- The same for Oracle and MySQL business tenants, such as deleting granularity.
call dbms_stats.delete_table_prefs('database_name', 'table_name', 'granularity');
After you set a collection strategy for statistics, you may want to check whether the strategy has been set successfully. You can use the following statement to query the settings.
-- The same for Oracle and MySQL business tenants, such as querying the specified collection parallelism degree.
select dbms_stats.get_prefs('degree', 'database_name','table_name') from dual;
You can also consider locking the statistics of a large table after you manually collect the statistics for the table. Note that when the statistics of a table are locked, automatic collection will not update the statistics, which is suitable for scenarios where the changes in data characteristics are not significant and the values of data are insensitive. If you want to recollect the locked statistics, you must first unlock the statistics.
-- The same for Oracle and MySQL business tenants, lock the statistics of a table.
call dbms_stats.lock_table_stats('database_name', 'table_name');
-- The same for Oracle and MySQL business tenants, unlock the statistics of a table.
call dbms_stats.unlock_table_stats('database_name', 'table_name');
References
For more information about statistics, see the following topics:
Statistics include two types: table-level statistics and column-level statistics. For more information about the types of statistics, see Statistics overview.
The OceanBase optimizer supports both manual and automatic statistics collection. For a detailed introduction and operational guidance on statistics collection, see Overview of statistics collection methods.
For more information about how to manage statistics, see Manage statistics.
You can learn how to use statistics by reading Example.