In a database, the optimizer attempts to generate the optimal execution plan for each input SQL query. Generating the optimal execution plan often requires real-time and effective statistics and accurate row estimates. Statistics, in this context, refer to optimizer statistics, which are a collection of data describing the tables and columns in the database. These statistics are a crucial component of the cost model used to select the optimal execution plan. The optimizer cost model relies on the statistics of the tables, columns, and predicates involved in the query to select and optimize the plan. Accurate and effective statistics help the optimizer select the optimal execution plan.
OceanBase Database supports various types of statistics, including table-level, column-level, and histogram statistics. It also provides collection strategies such as scheduled automatic collection, manual collection, online collection, and asynchronous automatic collection when statistics are severely outdated. In most systems, users typically do not need to worry about statistics issues because the optimizer regularly performs tasks to collect statistics for tables that need updates. However, in AP scenarios, there may be extremely large tables or tables that undergo large-scale updates and require real-time queries. In such cases, the default statistics collection strategy may not be able to collect statistics in a timely manner, which can affect the generation of the execution plan. The following sections will provide targeted guidance on how to collect statistics in AP scenarios.
Overview
Classification of statistics
Statistics can be categorized as follows:
Table-level statistics (including index table statistics): These include the number of rows, macroblocks, microblocks, and average row length. They are used to estimate the cost of scanning a table.
Column-level statistics
- Value distribution: Maximum value, minimum value, average column length, and number of distinct values (NDV).
- Data skew: Described by frequency distribution (Histogram).
- Null value rate: Helps the optimizer handle queries involving NULL values.
Statistics collection methods supported by OceanBase Database
Automatic collection: The optimizer has a scheduled task that analyzes whether the statistics of a table need to be updated on a daily basis.
Manual collection: Users can trigger statistics collection using SQL commands. This method is suitable for large tables or specific query optimization. Additionally, users can specify collection strategies when collecting statistics, such as parallelism, granularity, and the number of histogram buckets.
Online collection: In scenarios such as batch import, PDML, and
CREATE TABLE ... AS, statistics can be collected online using theGATHER_OPTIMIZER_STATISTICShint and the system variable_optimizer_gather_stats_on_load(which is enabled by default). TheAPPENDhint of the direct load feature can also be used for online statistics collection.
Statistics update mechanism
Threshold-triggered update: Statistics are asynchronously updated when the amount of data change in a table exceeds a certain threshold (by default, 10 times).
Partition-level support: OceanBase Database supports statistics update and management at the partition level.
Optimization strategies for statistics in AP scenarios
Customized collection strategy
- Selective collection: Configure separate collection tasks for core query tables or key columns in AP scenarios.
- Partition priority: Prioritize updating the partitions with the highest usage frequency or the most significant changes.
Configure parallelism
- When collecting statistics for very large tables, configure the collection parallelism appropriately.
Dynamically adjust update frequency
- Based on the update pattern of the table data, flexibly configure the statistics update frequency to avoid unnecessary overhead.
Scenario example
Change the statistics collection window
By default, the OceanBase Database optimizer automatically collects statistics on a daily basis by maintaining a window to ensure that statistics are iteratively updated. The default start time for tasks from Monday to Sunday is 22:00, and the maximum collection duration is 4 hours, as shown in the following table.
| Maintenance window | Start time/frequency | Maximum collection duration |
|---|---|---|
| MONDAY_WINDOW | 22:00/per week | 4 hours |
| TUESDAY_WINDOW | 22:00/per week | 4 hours |
| WEDNESDAY_WINDOW | 22:00/per week | 4 hours |
| THURSDAY_WINDOW | 22:00/per week | 4 hours |
| FRIDAY_WINDOW | 22:00/per week | 4 hours |
| SATURDAY_WINDOW | 22:00/per week | 4 hours |
| SUNDAY_WINDOW | 22:00/per week | 4 hours |
Note
- For OceanBase Database V4.3.5, from V4.3.5 BP1, the default start time and maximum collection duration of the
SATURDAY_WINDOWandSUNDAY_WINDOWmaintenance windows are changed from default start time 6:00 and maximum collection duration 20 hours to default start time 22:00 and maximum collection duration 4 hours. - For OceanBase Database V4.2.5, from V4.2.5 BP2, the default start time and maximum collection duration of the
SATURDAY_WINDOWandSUNDAY_WINDOWmaintenance windows are changed from default start time 6:00 and maximum collection duration 20 hours to default start time 22:00 and maximum collection duration 4 hours.
You need to configure the maintenance window based on your business requirements. For example, if the maintenance window coincides with the business peak hours, you can adjust the start time of the maintenance window or skip statistics collection on specific dates. If there are many tables or large tables in the business environment, you can also adjust the collection duration of the maintenance window.
The following examples show how to configure the maintenance window.
-- Disable automatic statistics collection on Mondays.
CALL DBMS_SCHEDULER.DISABLE('MONDAY_WINDOW');
-- Enable automatic statistics collection on Mondays.
CALL DBMS_SCHEDULER.ENABLE('MONDAY_WINDOW');
-- Set the start time of automatic statistics collection on Mondays to 8:00 PM.
CALL DBMS_SCHEDULER.SET_ATTRIBUTE('MONDAY_WINDOW', 'NEXT_DATE', '2022-09-12 20:00:00');
-- Set the collection duration of automatic statistics collection on Mondays to 6 hours.
-- 6 hours <=> 6 * 60 * 60 * 1000 * 1000 <=> 21600000000 us
CALL DBMS_SCHEDULER.SET_ATTRIBUTE('WEDNESDAY_WINDOW', 'JOB_ACTION', 'DBMS_STATS.GATHER_DATABASE_STATS_JOB_PROC(21600000000)');
Statistics collection strategy for large tables
In scenarios where large tables exist, the default statistics collection strategy of the optimizer may not be able to collect the statistics of the table within a single maintenance window. Therefore, a reasonable collection strategy needs to be set for large tables. When collecting statistics for large tables, the main time-consuming steps are as follows:
- Large table data volume: Full table scan is required, which is time-consuming.
- Histogram collection involves complex calculations, which brings additional costs.
- For large partitioned tables, the default strategy collects statistics and histograms for global partitions, local partitions, and the entire table, resulting in a cost of 3 * (cost of full table scan + cost of histogram).
Based on the above time-consuming steps, you can optimize the statistics collection strategy according to the actual situation of the table and the queries. We recommend that you do the following:
Set an appropriate default collection parallelism. After you set the parallelism, you need to adjust the automatic collection tasks to run during off-peak hours to avoid affecting business operations. We recommend that you set the parallelism to 8 or less. You can set the parallelism by using the following method.
-- The same method applies to Oracle and MySQL business tenants. CALL DBMS_STATS.SET_TABLE_PREFS('database_name', 'table_name', 'degree', '8');Set the default histogram collection method for columns. For columns with uniform data distribution, we recommend that you do not collect histograms.
-- The same method applies to Oracle and MySQL business tenants. -- 1. If the data distribution of all columns in the table is uniform, you can use the following method to set the histogram collection method for all columns to not collect histograms. CALL DBMS_STATS.SET_TABLE_PREFS('database_name', 'table_name', 'method_opt', 'for all columns size 1'); -- 2. If the data distribution of only a few columns in the table is not uniform and histograms need to be collected for these columns, you can use the following method to set the histogram collection method for the specified columns to collect histograms and for other columns to not collect histograms. For example, you can set the histogram collection method for columns c1 and c2 to collect histograms and for columns c3, c4, and c5 to not collect histograms. CALL DBMS_STATS.SET_TABLE_PREFS('database_name', 'table_name', 'method_opt', 'for columns c1 size 254, c2 size 254, c3 size 1, c4 size 1, c5 size 1');Set the default collection granularity for partitioned tables. For partitioned tables such as hash-partitioned tables and key-partitioned tables, you can consider collecting only global statistics or setting the partition derivation method to collect global statistics.
-- The same method applies to Oracle and MySQL business tenants. -- 1. Set the collection granularity to collect only global statistics. CALL DBMS_STATS.SET_TABLE_PREFS('database_name', 'table_name', 'granularity', 'GLOBAL'); -- 2. Set the collection granularity to the partition derivation method to collect global statistics. CALL DBMS_STATS.SET_TABLE_PREFS('database_name', 'table_name', 'granularity', 'APPROX_GLOBAL AND PARTITION');We recommend that you do not use the sampling method to collect statistics for large tables. When you use the sampling method to collect statistics for large tables, the number of histogram samples in early versions of the database will also increase significantly, which may have the opposite effect. The sampling method is suitable only for scenarios where only basic statistics need to be collected without histograms.
-- The same method applies to Oracle and MySQL business tenants. For example, if you want to delete the granularity parameter. -- 1. Set the histogram collection method for all columns to not collect histograms. CALL DBMS_STATS.SET_TABLE_PREFS('database_name', 'table_name', 'method_opt', 'for all columns size 1'); -- 2. Set the sampling ratio to 10%. CALL DBMS_STATS.SET_TABLE_PREFS('database_name', 'table_name', 'estimate_percent', '10');
If you want to clear or delete the default collection strategy, you only need to specify the attribute to be cleared, {attribute}, by using the following method.
-- The same method applies to Oracle and MySQL business tenants. For example, if you want to delete the granularity parameter.
CALL DBMS_STATS.delete_table_prefs('database_name', 'table_name', 'granularity');
After you set the collection strategy, you can query whether the strategy is set by using the following method.
-- The same method applies to Oracle and MySQL business tenants. For example, you can query the parallelism degree.
SELECT DBMS_STATS.GET_PREFS('degree', 'database_name','table_name') from dual;
In addition to the above methods, you can also consider whether to manually collect statistics for large tables and lock the statistics. After the statistics of a table are locked, automatic collection will not update them. This method is suitable for scenarios where the data characteristics change slightly and the data values are not sensitive. If you want to re-collect the locked statistics, you need to unlock them first.
-- The same method applies to Oracle and MySQL business tenants. You can lock the statistics of a table.
CALL DBMS_STATS.LOCK_TABLE_PREFS('database_name', 'table_name');
-- The same method applies to Oracle and MySQL business tenants. You can unlock the statistics of a table.
CALL DBMS_STATS.UNLOCK_TABLE_PREFS('database_name', 'table_name');
References
For more information about statistics, see the following topics: