Manually collect statistics

2026-03-06 07:02:42  Updated

The OceanBase Database optimizer provides two methods for manually collecting statistics: the DBMS_STATS package and the ANALYZE statement.

Note

OceanBase Database does not support estimating the time required to collect statistics. We recommend that you set the ob_query_timeout parameter to a sufficiently large value to ensure that statistics collection is completed. If statistics collection for schema-level statistics is interrupted due to a timeout, the statistics collected so far will be retained.

Considerations and limitations

  1. We recommend that you set the ob_query_timeout variable to a larger value at the session level. This setting only affects the current session and does not interfere with other operations.
  2. The system does not support estimating the time required to collect statistics. We recommend that you set ob_query_timeout to a sufficiently large value to ensure that statistics are collected. In particular, if statistics collection for schema-level statistics is interrupted due to a timeout, the statistics collected so far will be retained. This way, you do not need to re-collect all the data after the timeout, saving time.

Collect statistics by using the DBMS_STATS package

The OceanBase Database optimizer supports collecting table-level, schema-level, and index-level statistics by using the DBMS_STATS package. You can call the stored procedures gather_table_stats, gather_schema_stats, and gather_index_stats respectively.

PROCEDURE gather_table_stats (
  ownname            VARCHAR2,
  tabname            VARCHAR2,
  partname           VARCHAR2 DEFAULT NULL,
  estimate_percent   NUMBER DEFAULT AUTO_SAMPLE_SIZE,
  block_sample       BOOLEAN DEFAULT FALSE,
  method_opt         VARCHAR2 DEFAULT DEFAULT_METHOD_OPT,
  degree             NUMBER DEFAULT NULL,
  granularity        VARCHAR2 DEFAULT DEFAULT_GRANULARITY,
  cascade            BOOLEAN DEFAULT NULL,
  stattab            VARCHAR2 DEFAULT NULL,
  statid             VARCHAR2 DEFAULT NULL,
  statown            VARCHAR2 DEFAULT NULL,
  no_invalidate      BOOLEAN DEFAULT FALSE,
  stattype           VARCHAR2 DEFAULT 'DATA',
  force              BOOLEAN DEFAULT FALSE
);

PROCEDURE gather_schema_stats (
  ownname            VARCHAR2,
  estimate_percent   NUMBER DEFAULT AUTO_SAMPLE_SIZE,
  block_sample       BOOLEAN DEFAULT FALSE,
  method_opt         VARCHAR2 DEFAULT DEFAULT_METHOD_OPT,
  degree             NUMBER DEFAULT NULL,
  granularity        VARCHAR2 DEFAULT DEFAULT_GRANULARITY,
  cascade            BOOLEAN DEFAULT NULL,
  stattab            VARCHAR2 DEFAULT NULL,
  statid             VARCHAR2 DEFAULT NULL,
  statown            VARCHAR2 DEFAULT NULL,
  no_invalidate      BOOLEAN DEFAULT FALSE,
  stattype           VARCHAR2 DEFAULT 'DATA',
  force              BOOLEAN DEFAULT FALSE
);

PROCEDURE gather_index_stats (
  ownname            VARCHAR2,
  indname            VARCHAR2,
  partname           VARCHAR2 DEFAULT NULL,
  estimate_percent   NUMBER DEFAULT AUTO_SAMPLE_SIZE,
  stattab            VARCHAR2 DEFAULT NULL,
  statid             VARCHAR2 DEFAULT NULL,
  statown            VARCHAR2 DEFAULT NULL,
  degree             NUMBER DEFAULT NULL,
  granularity        VARCHAR2 DEFAULT DEFAULT_GRANULARITY,
  no_invalidate      BOOLEAN DEFAULT FALSE,
  force              BOOLEAN DEFAULT FALSE,
  tabname            VARCHAR2  DEFAULT NULL
);

The following table describes the parameters.

Parameter Description
ownname
  • In Oracle-compatible mode: the username. If this parameter is set to NULL, the username of the current login user is used by default.
  • In MySQL-compatible mode: the name of the database where the table is located.
tabname The name of the table.
indname The name of the index.
partname The name of the partition. The default value is NULL.
estimate_percent The percentage of data to be used to calculate the distribution characteristics. Valid values: [0.000001,100]. If this parameter is set to NULL, all data is used. The default value is AUTO_SAMPLE_SIZE, which means that the optimizer determines the percentage of data to be used.
block_sample Whether to use block sampling instead of row sampling. The default value is FALSE.
method_opt The method for collecting column-level statistics. For more information, see method_opt.
degree The parallelism for collecting statistics. The default value is NULL.
granularity The partition level for collecting statistics. Valid values:
  • 'GLOBAL': collect global-level statistics.
  • 'PARTITION': collect partition-level statistics.
  • 'SUBPARTITION': collect subpartition-level statistics.
  • 'ALL' : collect statistics at all levels, including GLOBAL, PARTITION, and SUBPARTITION levels.
  • 'AUTO' : collect statistics at all levels, including GLOBAL, PARTITION, and SUBPARTITION levels, by using the default method. If this parameter is not specified, the default value is 'AUTO'.
  • 'DEFAULT': collect statistics at GLOBAL and PARTITION levels.
  • 'GLOBAL AND PARTITION': collect statistics at GLOBAL and PARTITION levels.
  • 'APPROX_GLOBAL AND PARTITION': collect partition-level statistics and infer global-level statistics based on the partition information.
cascade Whether to collect index statistics of the table at the same time. The default value is TRUE.
stattab This parameter is not implemented and cannot be used.
statid This parameter is not implemented and cannot be used.
statown This parameter is not implemented and cannot be used.
no_invalidate This parameter is not implemented and cannot be used.
stattype This parameter is not implemented and cannot be used.
force Whether to forcibly collect statistics and ignore the lock status. The default value is FALSE. This parameter can be specified only in the BEGIN ... END block of PL/SQL.

method_opt

method_opt is mainly used with the following syntax:

method_opt:
  FOR ALL [INDEXED | HIDDEN] COLUMNS [size_clause]
| FOR COLUMNS [size clause] column [size_clause] [,column [size_clause]...]

size_clause:
  SIZE integer 
| SIZE REPEAT
| SIZE AUTO
| SIZE SKEWONLY

column:
   column_name
| (column_name [, column_name])

The following table describes the parameters.

Parameter Description
SIZE integer Specifies the number of histogram buckets to collect for a column. Valid values: [1, 2048].
REPEAT Collects histograms for only columns that have been previously collected. The number of histogram buckets is the same as the number of histogram buckets used in the previous histogram collection.
AUTO The OceanBase Database optimizer decides whether to collect histograms for a column based on the usage of the column. The number of histogram buckets is the default value, 254.
SKEWONLY Collects histograms for only columns with uneven data distribution. The number of histogram buckets is the default value, 254.

In addition, the DBMS_STATS package provides GATHER_DATABASE_STATS_JOB_PROC to collect statistics for all tables in the entire database. All collection strategies use the default Prefs settings. For more information about how to set Prefs, see Manage collection strategies (Prefs).

PROCEDURE GATHER_DATABASE_STATS_JOB_PROC();

Examples

The following examples show how to use the DBMS_STATS package to collect statistics:

  • Collect global statistics for the tbl1 table of user user.

    CALL dbms_stats.gather_table_stats('user', 'tbl1', granularity=>'GLOBAL', method_opt=>'FOR ALL COLUMNS SIZE AUTO');
    
  • Collect partition-level statistics for the t_part1 table of user user. The parallelism is 64, and histograms are collected for only columns with uneven data distribution.

    CALL dbms_stats.gather_table_stats('user', 't_part1', degree=>'64', granularity=>'PARTITION', method_opt=>'FOR ALL COLUMNS SIZE SKEWONLY');
    
  • Collect all statistics for the t_subpart1 table of user user. The parallelism is 128, and only 50% of the data is collected. The optimizer decides whether to collect histograms for each column.

    CALL dbms_stats.gather_table_stats('user', 't_subpart1', degree=>'128', estimate_percent=> '50', granularity=>'ALL', method_opt=>'FOR ALL COLUMNS SIZE AUTO');
    

Relationship between table-level and index-level statistics collection strategies and the GRANULARITY parameter

By default, when you collect table-level statistics, the statistics of all indexes on the table are also collected. Some statistics of an index table can be obtained from the statistics of the base table, such as the number of rows and average row length (calculated based on several columns). Therefore, you can accelerate the statistics collection and avoid rescanning the entire table data. The GRANULARITY parameter is different for local and global indexes, as shown in the following table.

Partition specified? GRANULARITY Global Index Local Index
No ALL GLOBAL ALL
No AUTO GLOBAL AUTO
No DEFAULT GLOBAL DEFAULT
No GLOBAL AND PARTITION GLOBAL GLOBAL AND PARTITION
No APPROX_GLOBAL AND PARTITION GLOBAL APPROX_GLOBAL AND PARTITION
No GLOBAL GLOBAL GLOBAL
No PARTITION GLOBAL PARTITION
No SUBPARTITION None SUBPARTITION
Yes ALL GLOBAL ALL
Yes AUTO GLOBAL AUTO
Yes DEFAULT GLOBAL DEFAULT
Yes GLOBAL AND PARTITION GLOBAL GLOBAL AND PARTITION
Yes APPROX_GLOBAL AND PARTITION None APPROX_GLOBAL AND PARTITION
Yes GLOBAL GLOBAL GLOBAL
Yes PARTITION None PARTITION
Yes SUBPARTITION None SUBPARTITION

Collect statistics by using the ANALYZE statement

In Oracle-compatible mode and MySQL-compatible mode, OceanBase Database allows you to collect statistics by using the ANALYZE statement.

Syntax of the ANALYZE statement in Oracle-compatible mode

The syntax of the ANALYZE statement in Oracle-compatible mode is as follows:

analyze_stmt:
ANALYZE TABLE table_name [use_partition] analyze_statistics_clause

use_partition:
  PARTITION (partition_name [,partition_name...])
| SUBPARTITION(subpartition_name [,subpartition_name...])

analyze_statistics_clause:
  COMPUTE STATISTICS [analyze_for_clause]
| ESTIMATE STATISTICS [analyze_for_clause] [SAMPLE INTNUM {ROWS | PERCENTAGE}]

analyze_for_clause:
  FOR TABLE
| FOR ALL [INDEXED | HIDDEN] COLUMNS [size_clause]
| FOR COLUMNS [size_clause] column [size_clause] [,column [size_clause]...]

size_clause:
  SIZE integer 
| SIZE REPEAT
| SIZE AUTO
| SIZE SKEWONLY

column:
  column_name
| (column_name [,column_name])

In the preceding syntax, the histogram collection method for columns is the same as the method_opt syntax of the DBMS_STATS package. For more information, see method_opt.

Compared with the method of collecting statistics by using the DBMS_STATS package, the ANALYZE statement provides fewer strategy settings. To support the collection of only GLOBAL statistics by using the ANALYZE statement, the rule is to set partition_name to the table name in the use_partition syntax.

Here are some examples of collecting statistics by using the ANALYZE statement in Oracle-compatible mode:

  • Collect statistics for the tbl1 table of the user user.

    ANALYZE TABLE tbl1 COMPUTE STATISTICS FOR ALL COLUMNS SIZE AUTO;
    
  • Collect GLOBAL statistics for the t_part1 table of the user user, and collect histograms for only columns with uneven data distribution.

    ANALYZE TABLE t_part1 PARTITION (t_part1) COMPUTE STATISTICS FOR ALL COLUMNS SIZE SKEWONLY;
    
  • Collect statistics for the p0sp0 and p1sp2 partitions of the t_subpart1 table of the user user. The histograms for all columns are determined internally by the optimizer.

    ANALYZE TABLE t_subpart1 SUBPARTITION(p0sp0,p1sp2) COMPUTE STATISTICS FOR ALL COLUMNS SIZE AUTO;
    

Syntax of the ANALYZE statement in MySQL-compatible mode

The syntax of the ANALYZE statement in MySQL-compatible mode is as follows:

analyze_stmt:
ANALYZE TABLE table_name UPDATE HISTOGRAM ON column_name_list WITH INTNUM BUCKETS

When the tenant-level parameter enable_sql_extension is set to TRUE, you can use the Oracle-compatible mode syntax, which is an extended syntax provided by OceanBase Database for MySQL-compatible mode.

analyze_stmt:
ANALYZE TABLE table_name [use_partition] analyze_statistics_clause

Here are some examples of collecting statistics by using the ANALYZE statement in MySQL-compatible mode:

  • Collect statistics for the tbl1 table, and set the number of buckets for each column to 30.

    ANALYZE TABLE tbl1 UPDATE HISTOGRAM ON a, b, c, d WITH 30 BUCKETS;
    
  • Use the Oracle-compatible mode syntax to collect statistics for the tbl1 table of the test user in MySQL-compatible mode.

    ALTER SYSTEM SET ENABLE_SQL_EXTENSION = TRUE;
    ANALYZE TABLE tbl1 COMPUTE STATISTICS FOR ALL COLUMNS SIZE AUTO;
    

Scenarios for manual statistics collection

Statistics collection for non-partitioned tables

When the product of the number of rows and the number of columns in a table does not exceed 10 million, you can use the following commands to collect statistics for the table. For example, the test.t1 table has 10 columns and 1 million rows.

CREATE TABLE test.t1(c1 INT, c2 INT, c3 INT, c4 INT, c5 INT, c6 INT, c7 INT, c8 INT, c9 INT, c10 INT);
CREATE TABLE t2 (c1 INT PRIMARY KEY);
INSERT INTO t2 VALUES(0),(1),(2),(3),(4),(5),(6),(7),(8),(9);
INSERT /*+APPEND*/ INTO t1 SELECT * FROM (SELECT s0.c1 c1, s1.c1 c2, s2.c1 c3, s3.c1 c4, s4.c1 c5, s5.c1 c6, s0.c1 * 10 c7, s1.c1 * 10 c8, s2.c1 * 10 c9, s3.c1 * 10 c10 FROM t2 s0, t2 s1, t2 s2, t2 s3, t2 s4, t2 s5);

# Scenario where histograms are not collected
CALL DBMS_STATS.GATHER_TABLE_STATS('test', 't1', method_opt=>'for all columns size 1');

# Scenario where histograms are collected and the default strategy is used
CALL DBMS_STATS.GATHER_TABLE_STATS('test', 't1');

When the product of the number of rows and the number of columns in a table exceeds 10 million, you can increase the parallelism based on the data business scenario and system resources to speed up statistics collection. For example, the test.t1 table has 10 million rows, and the parallelism is set to 8.

# Scenario where histograms are not collected
CALL DBMS_STATS.GATHER_TABLE_STATS('test', 't1', degree=>8, method_opt=>'for all columns size 1');

# Scenario where histograms are collected and the default strategy is used
CALL DBMS_STATS.GATHER_TABLE_STATS('test', 't1', degree=>8);

Statistics collection for partitioned tables

For partitioned tables, statistics collection strategies need to consider partition statistics collection. If system resources allow, it is recommended to increase the parallelism by 100% compared to non-partitioned tables.

For example, the test.t_part table has 10 columns and 1 million rows. Since partition statistics collection is added, the parallelism is increased to 2.

CREATE TABLE t_part(c1 INT, c2 INT, c3 INT, c4 INT, c5 INT, c6 INT, c7 INT, c8 INT, c9 INT, c10 INT) PARTITION BY HASH(c1) PARTITIONS 128;
CREATE TABLE t2 (c1 INT PRIMARY KEY);
INSERT INTO t2 VALUES(0),(1),(2),(3),(4),(5),(6),(7),(8),(9);
INSERT /*+APPEND*/ INTO t_part SELECT * FROM (SELECT s0.c1 c1, s1.c1 c2, s2.c1 c3, s3.c1 c4, s4.c1 c5, s5.c1 c6, s0.c1 * 10 c7, s1.c1 * 10 c8, s2.c1 * 10 c9, s3.c1 * 10 c10 FROM t2 s0, t2 s1, t2 s2, t2 s3, t2 s4, t2 s5);

In addition to increasing the parallelism, you can also consider collecting partition statistics and deriving global statistics from them to improve the efficiency of statistics collection. For example, in the same scenario, you can adjust the partition collection method without increasing the parallelism. Here is an example:

# Scenario where histograms are not collected
CALL DBMS_STATS.GATHER_TABLE_STATS('test', 't_part', granularity=>'APPROX_GLOBAL AND PARTITION', method_opt=>'for all columns size 1');

# Scenario where histograms are collected and the default strategy is used
CALL DBMS_STATS.GATHER_TABLE_STATS('test', 't_part', granularity=>'APPROX_GLOBAL AND PARTITION');

For more information about statistics collection and update strategies for partitioned tables, and detailed examples, see Strategies for collecting and updating statistics for partitioned tables.

Statistics collection at the schema (database) level

In addition to manually collecting statistics for individual tables, OceanBase Database provides the DBMS_STATS package for collecting statistics for all tables under a user.

Note

This feature is supported only in OceanBase Database V4.x in dual mode and OceanBase Database V3.2.x in Oracle-compatible mode.

Since collecting statistics for all tables under a user can be time-consuming, it is recommended to use this feature during off-peak hours.

If all tables under the user are small tables (with no more than 1 million rows), you can use the following commands to collect statistics for the test user.

# Scenario where histograms are not collected
CALL DBMS_STATS.GATHER_SCHEMA_STATS('test', method_opt=>'for all columns size 1');

# Scenario where histograms are collected and the default strategy is used
CALL DBMS_STATS.GATHER_SCHEMA_STATS('test');

If there are large tables (with tens of millions of rows) under the user, you can increase the parallelism during off-peak hours to speed up statistics collection:

# Scenario where histograms are not collected
CALL DBMS_STATS.GATHER_SCHEMA_STATS('test', degree=>'16', method_opt=>'for all columns size 1');


# Scenario where histograms are collected and the default strategy is used
CALL DBMS_STATS.GATHER_SCHEMA_STATS('test', degree=>'16');

If there are super-large tables (with more than 100 million rows) under the user, you can collect statistics for the super-large tables separately with a higher parallelism, lock the statistics of the super-large tables, and then use the above commands to collect statistics for the entire user. After the collection is complete, unlock the statistics of the super-large tables and collect statistics incrementally in the future. Here is an example:

CALL DBMS_STATS.GATHER_TABLE_STATS('test', 'big_table', degree=>128, method_opt=>'for all columns size 1');

CALL DBMS_STATS.LOCK_TABLE_STATS('test','big_table');

CALL DBMS_STATS.GATHER_SCHEMA_STATS('TEST', degree=>'16', method_opt=>'for all columns size 1');

CALL DBMS_STATS.UNLOCK_TABLE_STATS('test','big_table');

Contact Us