Dynamic sampling|V4.2.0| docs|Distributed Database

Dynamic sampling

Last Updated：2023-10-31 11:17:12 Updated

Starting from OceanBase Database V4.2.0, the optimizer can perform dynamic sampling. This feature allows the optimizer to collect required statistics at SQL runtime to generate better execution plans, thus optimizing query performance. The dynamic sampling feature is automatically applied to SQL query statements in OceanBase Database, making execution plans more accurate and efficient.

Principle

By performing dynamic sampling, the optimizer can not only obtain sufficient statistics, but also sample the target database objects in advance to estimate the number of rows during plan generation. The estimation result is then applied in the cost model to generate a better execution plan.

The dynamic sampling feature of OceanBase Database provides the following benefits:

Dynamic sampling can obtain more accurate statistics based on stale or even insufficient statistics.
Dynamic sampling can provide more accurate statistics in queries that contain complex predicates, associative predicates, and so on.
Dynamic sampling can reduce the time and cost consumed for statistics collection. For example, for wide tables, regular statistics collection can be time-consuming and resource-intensive.
Dynamic sampling can improve the query efficiency by dynamically adjusting the execution plans during a query to adapt to data changes.

Before an SQL query is executed, the OceanBase Database optimizer collects table and index statistics to select the best execution plan. If the statistics are inaccurate or incomplete, the optimizer may select an execution plan other than the optimal one, resulting in poor query performance. Basic statistics are usually collected automatically or manually. However, if the data distribution changes or statistics are not collected, or in some complex SQL query scenarios, the statistics may no longer be accurate.

In the following example, both tables t1 and t2 have 1000 rows of data, and statistics are collected on neither table.

CREATE TABLE t1(c1 INT, c2 INT, c3 INT);
CREATE TABLE t2(c1 INT, c2 INT, c3 INT);
CREATE INDEX  idx_c1 ON t2(c1);
INSERT INTO t1 SELECT level,level,level FROM DUAL CONNECT BY level<=1000;
INSERT INTO t2 SELECT level,level,level FROM DUAL CONNECT BY level<=1000;

The execution plan of the query SELECT * FROM t1, t2 WHERE t1.c1 = t2.c1 AND t1.c2 = 1000 is as follows:

+---------------------------------------------------------------------------------------------+
| Query Plan                                                                                  |
+---------------------------------------------------------------------------------------------+
| ====================================================================                        |
| |ID|OPERATOR                      |NAME      |EST.ROWS|EST.TIME(us)|                        |
| --------------------------------------------------------------------                        |
| |0 |HASH JOIN                     |          |98      |199         |                        |
| |1 |├─TABLE SCAN                  |T1        |10      |44          |                        |
| |2 |└─TABLE SCAN                  |T2        |1000    |61          |                        |
| ====================================================================                        |

You can see that a hash join is used to join the two tables. However, only one row of data meets the t1.c2 = 1000 condition. Therefore, a nested loop join can be used to push the join condition t1.c1 = t2.c1 down to the base table t2, so that the t2 table can also use the idx_c1 index. In this case, the execution performance of the plan is improved. The following example shows an execution plan with better performance.

+---------------------------------------------------------------------------------------------+
| Query Plan                                                                                  |
+---------------------------------------------------------------------------------------------+
| ====================================================================                        |
| |ID|OPERATOR                      |NAME      |EST.ROWS|EST.TIME(us)|                        |
| --------------------------------------------------------------------                        |
| |0 |NESTED-LOOP JOIN              |          |1       |21          |                        |
| |1 |├─TABLE FULL SCAN             |T1        |1       |2           |                        |
| |2 |└─DISTRIBUTED TABLE RANGE SCAN|T2(IDX_C1)|1       |18          |                        |
| ====================================================================

Dynamic sampling, therefore, can provide more accurate statistics and help the optimizer select a better execution plan.

You can better understand the principle of dynamic sampling in OceanBase Database from the following example.

CREATE TABLE t1(c1 INT, c2 INT, c3 INT, c4 INT);
CREATE INDEX idx_c1 ON t1(c1);
CREATE INDEX idx_c2 ON t1(c2);

Q1: SELECT c4 FROM t1 WHERE c1 > 1 AND c2 > 1 AND c3 > 1 GROUP BY c4;
Q2: SELECT /*+PARALLEL(2)*/c4 FROM t1 WHERE c1 > 1 AND c2 > 1 AND c3 > 1 GROUP BY c4;

For the queries in the preceding example, the following three base table paths are generated:

Primary table path
Index table path idx_c1
index table path idx_c2

Dynamic sampling constructs the following sampling SQL statements based on the three paths:

Estimate the number of rows in the base table: count(*).
Obtain the basic statistics on the c4 column: approx_count_distinct(c1) and sum(case when c1 is null then 1 else 0 end).
Estimate the number of rows that meet all predicates: sum(case when c1 > 1 and c2 > 1 and c3 > 1 then 1 else 0 end).
Estimate the number of rows in the query range of the index table path idx_c1: sum(case when c1 > 1 then 1 else 0 end).
Estimate the number of rows in the query range of the index table path idx_c2: sum(case when c2 > 1 then 1 else 0 end).

Calculate the sampling rate based on the number of microblocks. If the sampling rate of the Q1 query is ratio, the sampling rate of the Q2 query is 2 × ratio. Dynamic sampling constructs the following two sampling SQL statements:

DYNAMIC SAMPLING Q1:
SELECT
/*+ NO_REWRITE
    NO_PARALLEL
    DYNAMIC_SAMPLING(0)
    QUERY_TIMEOUT(1000000)
*/
count(*),
approx_count_distinct("C4"),
Sum(CASE WHEN "C4" IS NULL THEN 1 ELSE 0 END),
Sum(CASE WHEN ( "C1" > 1 ) AND ( "C2" > 1 ) AND ( "C3" > 1 ) THEN 1 ELSE 0 END),
Sum(CASE WHEN ( "C1" > 1 ) THEN 1 ELSE 0 END),
Sum(CASE WHEN ( "C2" > 1 ) THEN 1 ELSE 0 END)
FROM "TEST"."T1" SAMPLE BLOCK(ratio) SEED(seed);

DYNAMIC SAMPLING Q2:
SELECT
/*+ NO_REWRITE
    PARALLEL(64)
    DYNAMIC_SAMPLING(0)
    QUERY_TIMEOUT(1000000)
*/
count(*),
approx_count_distinct("C4"),
Sum(CASE WHEN "C4" IS NULL THEN 1 ELSE 0 END),
Sum(CASE WHEN ( "C1" > 1 ) AND ( "C2" > 1 ) AND ( "C3" > 1 ) THEN 1 ELSE 0 END),
Sum(CASE WHEN ( "C1" > 1 ) THEN 1 ELSE 0 END),
Sum(CASE WHEN ( "C2" > 1 ) THEN 1 ELSE 0 END)
FROM "TEST"."T1" SAMPLE BLOCK(2*ratio);

Keywords in the preceding example are described as follows:

NO_REWRITE: specifies not to perform rewriting, but to scan the base table without sampling.
NO_PARALLEL: specifies not to enable parallel sampling. Parallel sampling is not performed by default because the original SQL statement does not explicitly specify the DOP.
PARALLEL(x): specifies to use parallel sampling. If the original SQL statement specifies the DOP, parallel sampling is enabled for the sampling SQL statement.
DYNAMIC_SAMPLING(0): specifies whether to enable dynamic sampling. The value 0 indicates that dynamic sampling is disabled for SQL queries.
QUERY_TIMEOUT: the maximum execution time for the sampling SQL statement.

Application scenarios of dynamic sampling

By default, the dynamic sampling feature takes effect on the SQL statements of the current user. Currently, dynamic sampling is supported only for base tables. When the dynamic sampling feature is not disabled, the optimizer attempts to use dynamic sampling during the plan generation phase in the following scenarios:

Scenarios without statistics.
Scenarios where complex predicates exist in query conditions, such as c1 like '%test%'. In such scenarios, the selectivity formula cannot be used to estimate the number of rows.
Scenarios where the user specifies to use dynamic sampling.

Methods for controlling dynamic sampling

Currently, OceanBase Database allows you to control dynamic sampling by using a system variable, a query hint, or a system parameter. The size of the sampling set for dynamic sampling is restricted by the DOP.

Control dynamic sampling by using a system variable

You can use the system variable optimizer_dynamic_sampling to enable or disable the dynamic sampling feature. The syntaxes are as follows:

/* Enable global dynamic sampling. */
SET GLOBAL optimizer_dynamic_sampling = 1;

/* Enable dynamic sampling at the session level. */
SET SESSION optimizer_dynamic_sampling = 1;
SET optimizer_dynamic_sampling = 1;

/* Disable global dynamic sampling. */
SET GLOBAL optimizer_dynamic_sampling = 0;

/* Disable dynamic sampling at the session level. */
SET SESSION optimizer_dynamic_sampling = 0;
SET optimizer_dynamic_sampling = 0;

Control dynamic sampling by using a hint

You can use the DYNAMIC_SAMPLING hint to control whether to enable dynamic sampling in a query. The syntax is as follows:

/*+DYNAMIC_SAMPLING( [[ qb_name_option ] table_name ] INTNUM1 [, INTNUM2 ] ) */

The parameters in the DYNAMIC_SAMPLING hint are described as follows:

qb_name_option: the name of the query block. This parameter is optional.
table_name: the name of the table for dynamic sampling. This parameter is optional. If you do not specify this parameter, dynamic sampling is used for the entire query.
INTNUM1: the level of sampling. Valid values are 0 and 1. The value 1 specifies to enable dynamic sampling, and the value 0 specifies to disable dynamic sampling.
INTNUM2: the number of microblocks to be sampled. We recommend that you specify more than 32 microblocks.

The following example enables dynamic sampling for table t1 in a query.

SELECT /*+DYNAMIC_SAMPLING(t1 1)*/ c4 FROM t1 WHERE c1 > 1 AND c2 > 1 AND c3 > 1 GROUP BY c4;

Control dynamic sampling by using a system parameter

The maximum query time for dynamic sampling is 1/10 of the maximum query time. For example, if the timeout of a query is 10s, the maximum dynamic sampling time is 1s. To prevent dynamic sampling from taking too long, OceanBase Database allows you to specify the system parameter _optimizer_ads_time_limit to control the maximum time allowed for dynamic sampling. The default maximum time is 10s and the value range is [0s, 300s]. If the system parameter is set to 0, dynamic sampling is disabled.

The following example sets the maximum dynamic sampling time to 100s for the current tenant.

obclient [oceanbase]> ALTER SYSTEM SET _optimizer_ads_time_limit = 100;
Query OK, 0 rows affected

Considerations

Dynamic sampling is designed to provide sufficient statistics for the optimizer. It samples database objects in advance to estimate the number of rows during plan generation. The estimation result is then applied in the cost model to generate a better execution plan.

Dynamic sampling provides a better solution for the optimizer to obtain statistics when no statistics are available or accurate row count estimates cannot be made. Take note of the following considerations when you use the dynamic sampling feature:

Dynamic sampling uses block sampling by default. Therefore, for a better sampling effect, we recommend that you perform minor and major compactions after data is imported.
Dynamic sampling inevitably causes additional overhead in hard parsing. If the overhead is not tolerable for business in some transaction processing (TP) scenarios, you can choose to disable dynamic sampling.
Dynamic sampling can be used only as a supplement to statistics collection methods. Do not rely on this feature in business scenarios. Basic statistics collection still needs to be done.