The OceanBase Database optimizer provides two methods for manually collecting statistics: the DBMS_STATS package and the ANALYZE statement.
Note
OceanBase Database does not support estimating the time required to collect statistics. We recommend that you set the ob_query_timeout parameter to a sufficiently large value to ensure that statistics collection is completed. If statistics collection for schema-level statistics is interrupted due to a timeout, the statistics collected so far will be retained.
Considerations and limitations
- We recommend that you set the ob_query_timeout variable to a larger value at the session level. This setting only affects the current session and does not interfere with other operations.
- The system does not support estimating the time required to collect statistics. We recommend that you set ob_query_timeout to a sufficiently large value to ensure that statistics are collected. In particular, if statistics collection for schema-level statistics is interrupted due to a timeout, the statistics collected so far will be retained. This way, you do not need to re-collect all the data after the timeout, saving time.
Collect statistics by using the DBMS_STATS package
The OceanBase Database optimizer supports collecting table-level, schema-level, and index-level statistics by using the DBMS_STATS package. You can call the stored procedures gather_table_stats, gather_schema_stats, and gather_index_stats respectively.
PROCEDURE gather_table_stats (
ownname VARCHAR2,
tabname VARCHAR2,
partname VARCHAR2 DEFAULT NULL,
estimate_percent NUMBER DEFAULT AUTO_SAMPLE_SIZE,
block_sample BOOLEAN DEFAULT FALSE,
method_opt VARCHAR2 DEFAULT DEFAULT_METHOD_OPT,
degree NUMBER DEFAULT NULL,
granularity VARCHAR2 DEFAULT DEFAULT_GRANULARITY,
cascade BOOLEAN DEFAULT NULL,
stattab VARCHAR2 DEFAULT NULL,
statid VARCHAR2 DEFAULT NULL,
statown VARCHAR2 DEFAULT NULL,
no_invalidate BOOLEAN DEFAULT FALSE,
stattype VARCHAR2 DEFAULT 'DATA',
force BOOLEAN DEFAULT FALSE
);
PROCEDURE gather_schema_stats (
ownname VARCHAR2,
estimate_percent NUMBER DEFAULT AUTO_SAMPLE_SIZE,
block_sample BOOLEAN DEFAULT FALSE,
method_opt VARCHAR2 DEFAULT DEFAULT_METHOD_OPT,
degree NUMBER DEFAULT NULL,
granularity VARCHAR2 DEFAULT DEFAULT_GRANULARITY,
cascade BOOLEAN DEFAULT NULL,
stattab VARCHAR2 DEFAULT NULL,
statid VARCHAR2 DEFAULT NULL,
statown VARCHAR2 DEFAULT NULL,
no_invalidate BOOLEAN DEFAULT FALSE,
stattype VARCHAR2 DEFAULT 'DATA',
force BOOLEAN DEFAULT FALSE
);
PROCEDURE gather_index_stats (
ownname VARCHAR2,
indname VARCHAR2,
partname VARCHAR2 DEFAULT NULL,
estimate_percent NUMBER DEFAULT AUTO_SAMPLE_SIZE,
stattab VARCHAR2 DEFAULT NULL,
statid VARCHAR2 DEFAULT NULL,
statown VARCHAR2 DEFAULT NULL,
degree NUMBER DEFAULT NULL,
granularity VARCHAR2 DEFAULT DEFAULT_GRANULARITY,
no_invalidate BOOLEAN DEFAULT FALSE,
force BOOLEAN DEFAULT FALSE,
tabname VARCHAR2 DEFAULT NULL
);
The following table describes the parameters.
| Parameter | Description |
|---|---|
| ownname |
|
| tabname | The name of the table. |
| indname | The name of the index. |
| partname | The name of the partition. The default value is NULL. |
| estimate_percent | The percentage of data to be used to calculate the distribution characteristics. Valid values: [0.000001,100]. If this parameter is set to NULL, all data is used. The default value is AUTO_SAMPLE_SIZE, which means that the optimizer determines the percentage of data to be used. |
| block_sample | Whether to use block sampling instead of row sampling. The default value is FALSE. |
| method_opt | The method for collecting column-level statistics. For more information, see method_opt. |
| degree | The parallelism for collecting statistics. The default value is NULL. |
| granularity | The partition level for collecting statistics. Valid values:
|
| cascade | Whether to collect index statistics of the table at the same time. The default value is TRUE. |
| stattab | This parameter is not implemented and cannot be used. |
| statid | This parameter is not implemented and cannot be used. |
| statown | This parameter is not implemented and cannot be used. |
| no_invalidate | This parameter is not implemented and cannot be used. |
| stattype | This parameter is not implemented and cannot be used. |
| force | Whether to forcibly collect statistics and ignore the lock status. The default value is FALSE. This parameter can be specified only in the BEGIN ... END block of PL/SQL. |
method_opt
method_opt is mainly used with the following syntax:
method_opt:
FOR ALL [INDEXED | HIDDEN] COLUMNS [size_clause]
| FOR COLUMNS [size clause] column [size_clause] [,column [size_clause]...]
size_clause:
SIZE integer
| SIZE REPEAT
| SIZE AUTO
| SIZE SKEWONLY
column:
column_name
| (column_name [, column_name])
The following table describes the parameters.
| Parameter | Description |
|---|---|
| SIZE integer | Specifies the number of histogram buckets to collect for a column. Valid values: [1, 2048]. |
| REPEAT | Collects histograms for only columns that have been previously collected. The number of histogram buckets is the same as the number of histogram buckets used in the previous histogram collection. |
| AUTO | The OceanBase Database optimizer decides whether to collect histograms for a column based on the usage of the column. The number of histogram buckets is the default value, 254. |
| SKEWONLY | Collects histograms for only columns with uneven data distribution. The number of histogram buckets is the default value, 254. |
In addition, the DBMS_STATS package provides GATHER_DATABASE_STATS_JOB_PROC to collect statistics for all tables in the entire database. All collection strategies use the default Prefs settings. For more information about how to set Prefs, see Manage collection strategies (Prefs).
PROCEDURE GATHER_DATABASE_STATS_JOB_PROC();
Examples
The following examples show how to use the DBMS_STATS package to collect statistics:
Collect global statistics for the
tbl1table of useruser.CALL dbms_stats.gather_table_stats('user', 'tbl1', granularity=>'GLOBAL', method_opt=>'FOR ALL COLUMNS SIZE AUTO');Collect partition-level statistics for the
t_part1table of useruser. The parallelism is 64, and histograms are collected for only columns with uneven data distribution.CALL dbms_stats.gather_table_stats('user', 't_part1', degree=>'64', granularity=>'PARTITION', method_opt=>'FOR ALL COLUMNS SIZE SKEWONLY');Collect all statistics for the
t_subpart1table of useruser. The parallelism is 128, and only 50% of the data is collected. The optimizer decides whether to collect histograms for each column.CALL dbms_stats.gather_table_stats('user', 't_subpart1', degree=>'128', estimate_percent=> '50', granularity=>'ALL', method_opt=>'FOR ALL COLUMNS SIZE AUTO');
Relationship between table-level and index-level statistics collection strategies and the GRANULARITY parameter
By default, when you collect table-level statistics, the statistics of all indexes on the table are also collected. Some statistics of an index table can be obtained from the statistics of the base table, such as the number of rows and average row length (calculated based on several columns). Therefore, you can accelerate the statistics collection and avoid rescanning the entire table data. The GRANULARITY parameter is different for local and global indexes, as shown in the following table.
| Partition specified? | GRANULARITY | Global Index | Local Index |
|---|---|---|---|
| No | ALL | GLOBAL | ALL |
| No | AUTO | GLOBAL | AUTO |
| No | DEFAULT | GLOBAL | DEFAULT |
| No | GLOBAL AND PARTITION | GLOBAL | GLOBAL AND PARTITION |
| No | APPROX_GLOBAL AND PARTITION | GLOBAL | APPROX_GLOBAL AND PARTITION |
| No | GLOBAL | GLOBAL | GLOBAL |
| No | PARTITION | GLOBAL | PARTITION |
| No | SUBPARTITION | None | SUBPARTITION |
| Yes | ALL | GLOBAL | ALL |
| Yes | AUTO | GLOBAL | AUTO |
| Yes | DEFAULT | GLOBAL | DEFAULT |
| Yes | GLOBAL AND PARTITION | GLOBAL | GLOBAL AND PARTITION |
| Yes | APPROX_GLOBAL AND PARTITION | None | APPROX_GLOBAL AND PARTITION |
| Yes | GLOBAL | GLOBAL | GLOBAL |
| Yes | PARTITION | None | PARTITION |
| Yes | SUBPARTITION | None | SUBPARTITION |
Collect statistics by using the ANALYZE statement
In Oracle-compatible mode and MySQL-compatible mode, OceanBase Database allows you to collect statistics by using the ANALYZE statement.
Syntax of the ANALYZE statement in Oracle-compatible mode
The syntax of the ANALYZE statement in Oracle-compatible mode is as follows:
analyze_stmt:
ANALYZE TABLE table_name [use_partition] analyze_statistics_clause
use_partition:
PARTITION (partition_name [,partition_name...])
| SUBPARTITION(subpartition_name [,subpartition_name...])
analyze_statistics_clause:
COMPUTE STATISTICS [analyze_for_clause]
| ESTIMATE STATISTICS [analyze_for_clause] [SAMPLE INTNUM {ROWS | PERCENTAGE}]
analyze_for_clause:
FOR TABLE
| FOR ALL [INDEXED | HIDDEN] COLUMNS [size_clause]
| FOR COLUMNS [size_clause] column [size_clause] [,column [size_clause]...]
size_clause:
SIZE integer
| SIZE REPEAT
| SIZE AUTO
| SIZE SKEWONLY
column:
column_name
| (column_name [,column_name])
In the preceding syntax, the histogram collection method for columns is the same as the method_opt syntax of the DBMS_STATS package. For more information, see method_opt.
Compared with the method of collecting statistics by using the DBMS_STATS package, the ANALYZE statement provides fewer strategy settings. To support the collection of only GLOBAL statistics by using the ANALYZE statement, the rule is to set partition_name to the table name in the use_partition syntax.
Here are some examples of collecting statistics by using the ANALYZE statement in Oracle-compatible mode:
Collect statistics for the
tbl1table of theuseruser.ANALYZE TABLE tbl1 COMPUTE STATISTICS FOR ALL COLUMNS SIZE AUTO;Collect
GLOBALstatistics for thet_part1table of theuseruser, and collect histograms for only columns with uneven data distribution.ANALYZE TABLE t_part1 PARTITION (t_part1) COMPUTE STATISTICS FOR ALL COLUMNS SIZE SKEWONLY;Collect statistics for the
p0sp0andp1sp2partitions of thet_subpart1table of theuseruser. The histograms for all columns are determined internally by the optimizer.ANALYZE TABLE t_subpart1 SUBPARTITION(p0sp0,p1sp2) COMPUTE STATISTICS FOR ALL COLUMNS SIZE AUTO;
Syntax of the ANALYZE statement in MySQL-compatible mode
The syntax of the ANALYZE statement in MySQL-compatible mode is as follows:
analyze_stmt:
ANALYZE TABLE table_name UPDATE HISTOGRAM ON column_name_list WITH INTNUM BUCKETS
When the tenant-level parameter enable_sql_extension is set to TRUE, you can use the Oracle-compatible mode syntax, which is an extended syntax provided by OceanBase Database for MySQL-compatible mode.
analyze_stmt:
ANALYZE TABLE table_name [use_partition] analyze_statistics_clause
Here are some examples of collecting statistics by using the ANALYZE statement in MySQL-compatible mode:
Collect statistics for the
tbl1table, and set the number of buckets for each column to 30.ANALYZE TABLE tbl1 UPDATE HISTOGRAM ON a, b, c, d WITH 30 BUCKETS;Use the Oracle-compatible mode syntax to collect statistics for the
tbl1table of thetestuser in MySQL-compatible mode.ALTER SYSTEM SET ENABLE_SQL_EXTENSION = TRUE; ANALYZE TABLE tbl1 COMPUTE STATISTICS FOR ALL COLUMNS SIZE AUTO;
Scenarios for manual statistics collection
Statistics collection for non-partitioned tables
When the product of the number of rows and the number of columns in a table does not exceed 10 million, you can use the following commands to collect statistics for the table. For example, the test.t1 table has 10 columns and 1 million rows.
CREATE TABLE test.t1(c1 INT, c2 INT, c3 INT, c4 INT, c5 INT, c6 INT, c7 INT, c8 INT, c9 INT, c10 INT);
CREATE TABLE t2 (c1 INT PRIMARY KEY);
INSERT INTO t2 VALUES(0),(1),(2),(3),(4),(5),(6),(7),(8),(9);
INSERT /*+APPEND*/ INTO t1 SELECT * FROM (SELECT s0.c1 c1, s1.c1 c2, s2.c1 c3, s3.c1 c4, s4.c1 c5, s5.c1 c6, s0.c1 * 10 c7, s1.c1 * 10 c8, s2.c1 * 10 c9, s3.c1 * 10 c10 FROM t2 s0, t2 s1, t2 s2, t2 s3, t2 s4, t2 s5);
# Scenario where histograms are not collected
CALL DBMS_STATS.GATHER_TABLE_STATS('test', 't1', method_opt=>'for all columns size 1');
# Scenario where histograms are collected and the default strategy is used
CALL DBMS_STATS.GATHER_TABLE_STATS('test', 't1');
When the product of the number of rows and the number of columns in a table exceeds 10 million, you can increase the parallelism based on the data business scenario and system resources to speed up statistics collection. For example, the test.t1 table has 10 million rows, and the parallelism is set to 8.
# Scenario where histograms are not collected
CALL DBMS_STATS.GATHER_TABLE_STATS('test', 't1', degree=>8, method_opt=>'for all columns size 1');
# Scenario where histograms are collected and the default strategy is used
CALL DBMS_STATS.GATHER_TABLE_STATS('test', 't1', degree=>8);
Statistics collection for partitioned tables
For partitioned tables, statistics collection strategies need to consider partition statistics collection. If system resources allow, it is recommended to increase the parallelism by 100% compared to non-partitioned tables.
For example, the test.t_part table has 10 columns and 1 million rows. Since partition statistics collection is added, the parallelism is increased to 2.
CREATE TABLE t_part(c1 INT, c2 INT, c3 INT, c4 INT, c5 INT, c6 INT, c7 INT, c8 INT, c9 INT, c10 INT) PARTITION BY HASH(c1) PARTITIONS 128;
CREATE TABLE t2 (c1 INT PRIMARY KEY);
INSERT INTO t2 VALUES(0),(1),(2),(3),(4),(5),(6),(7),(8),(9);
INSERT /*+APPEND*/ INTO t_part SELECT * FROM (SELECT s0.c1 c1, s1.c1 c2, s2.c1 c3, s3.c1 c4, s4.c1 c5, s5.c1 c6, s0.c1 * 10 c7, s1.c1 * 10 c8, s2.c1 * 10 c9, s3.c1 * 10 c10 FROM t2 s0, t2 s1, t2 s2, t2 s3, t2 s4, t2 s5);
In addition to increasing the parallelism, you can also consider collecting partition statistics and deriving global statistics from them to improve the efficiency of statistics collection. For example, in the same scenario, you can adjust the partition collection method without increasing the parallelism. Here is an example:
# Scenario where histograms are not collected
CALL DBMS_STATS.GATHER_TABLE_STATS('test', 't_part', granularity=>'APPROX_GLOBAL AND PARTITION', method_opt=>'for all columns size 1');
# Scenario where histograms are collected and the default strategy is used
CALL DBMS_STATS.GATHER_TABLE_STATS('test', 't_part', granularity=>'APPROX_GLOBAL AND PARTITION');
For more information about statistics collection and update strategies for partitioned tables, and detailed examples, see Strategies for collecting and updating statistics for partitioned tables.
Statistics collection at the schema (database) level
In addition to manually collecting statistics for individual tables, OceanBase Database provides the DBMS_STATS package for collecting statistics for all tables under a user.
Note
This feature is supported only in OceanBase Database V4.x in dual mode and OceanBase Database V3.2.x in Oracle-compatible mode.
Since collecting statistics for all tables under a user can be time-consuming, it is recommended to use this feature during off-peak hours.
If all tables under the user are small tables (with no more than 1 million rows), you can use the following commands to collect statistics for the test user.
# Scenario where histograms are not collected
CALL DBMS_STATS.GATHER_SCHEMA_STATS('test', method_opt=>'for all columns size 1');
# Scenario where histograms are collected and the default strategy is used
CALL DBMS_STATS.GATHER_SCHEMA_STATS('test');
If there are large tables (with tens of millions of rows) under the user, you can increase the parallelism during off-peak hours to speed up statistics collection:
# Scenario where histograms are not collected
CALL DBMS_STATS.GATHER_SCHEMA_STATS('test', degree=>'16', method_opt=>'for all columns size 1');
# Scenario where histograms are collected and the default strategy is used
CALL DBMS_STATS.GATHER_SCHEMA_STATS('test', degree=>'16');
If there are super-large tables (with more than 100 million rows) under the user, you can collect statistics for the super-large tables separately with a higher parallelism, lock the statistics of the super-large tables, and then use the above commands to collect statistics for the entire user. After the collection is complete, unlock the statistics of the super-large tables and collect statistics incrementally in the future. Here is an example:
CALL DBMS_STATS.GATHER_TABLE_STATS('test', 'big_table', degree=>128, method_opt=>'for all columns size 1');
CALL DBMS_STATS.LOCK_TABLE_STATS('test','big_table');
CALL DBMS_STATS.GATHER_SCHEMA_STATS('TEST', degree=>'16', method_opt=>'for all columns size 1');
CALL DBMS_STATS.UNLOCK_TABLE_STATS('test','big_table');