Reduce row reads|V4.3.3| docs|Distributed Database

Reduce row reads

Last Updated：2024-11-11 07:37:15 Updated

Before you go further into index-based SQL tuning, you need to understand a key metric: the number of rows to read. In transaction processing (TP) business scenarios, high throughput is common, and ideally, the number of rows to read per request should be relatively small. The amount of data to read in a single execution is an important metric for evaluating the outcome of SQL tuning. The significance of this metric is straightforward: with the overall I/O capacity of a cluster being limited, a small I/O footprint from a single request means that the I/O burden from the entire business will also be relatively small, thereby minimizing pressure on the entire cluster. The associations between this metric and SQL execution are as follows:

During SQL execution, some filter conditions are used to determine the data scan range (or query range) on the index. The number of rows in this range is the number of rows to read. These filter conditions are referred to as query range filters or range_cond (for example, T1.C2 = ? in line 14 of the following output).
Other base table filter conditions in the SQL statement are called table filters (for example, T1.C1 = 1 in line 10 of the following output). After the data is read from the index, the kernel filters the data again based on table filters.

In most cases, creating an index is primarily aimed at fully utilizing the predicates in a query, narrowing the Query Range, and thereby reducing the number of rows read. Therefore, when dealing with a slow query, the first step is to analyze the query plan to determine what the query range filter is, whether the filter is selective enough, and how many rows to read in the index query range. Given a query and its execution plan, there are two ways to obtain the number of rows to read for this execution plan.

EXPLAIN SELECT * FROM T1 WHERE C1 = 1 AND C2 = 'A'
=========================================
|ID|OPERATOR  |NAME      |EST. ROWS|COST|
-----------------------------------------
|0 |TABLE SCAN|T1(IDX_C2)|1        |2695|
=========================================

Outputs & filters:
-------------------------------------
  0 - output([T1.C1], [T1.C2]), filter([T1.C1 = 1]),
      access([T1.C1], [T1.C2]), partitions(p0),
      is_index_back=true, filter_before_indexback[false],
      range_key([T1.C2], [T1.__pk_increment]), range(A,MIN ; A,MAX),
      range_cond([T1.C2 = ?])

T1:table_rows:10002, physical_range_rows:99, logical_range_rows:99, index_back_rows:99, output_rows:0, est_method:local_storage, optimization_method=cost_based, avaiable_index_name[IDX_C2], unstable_index_name[T1], estimation info[table_id:1100611139453780, (table_type:1, version:0-1658634108084971-1658634108084971, logical_rc:0, physical_rc:0), (table_type:0, version:1-1-9223372036854775807, logical_rc:99, physical_rc:99)]
Parameters
-------------------------------------
{obj:{"VARCHAR2":"A", collation:"utf8mb4_bin", coercibility:"INVALID"}, accuracy:{length:-1, precision:-1, scale:-1}, flag:0, raw_text_pos:-1, raw_text_len:-1, param_meta:{type:"VARCHAR2", collation:"utf8mb4_bin", coercibility:"INVALID"}}

Method 1: Execute EXPLAIN to obtain the execution plan for the query

In the execution plan, table_rows:10002, physical_range_rows:99, logical_range_rows:99 shows the estimated number of rows to read on the IDX_C2 index. Here, table_rows is the total number of rows in the table, and physical_range_rows and logical_range_rows are the number of rows in the range(A,MIN ; A,MAX) range. In general, these two indicators are close, and either one can be considered. In some special scenarios, physical_range_rows may be significantly greater than logical_range_rows. This query uses IDX_C2, and the estimated number of rows to scan is 99.

Method 2: Obtain the exact number of rows by constructing an auxiliary query

Method 1 obtains an estimated number of rows, which may be inaccurate in some scenarios. You can also construct an auxiliary query to obtain the exact number of rows. For example, you can construct the SELECT COUNT(*) FROM T1 WHERE ... query. In this query, the query range filter is the only filter condition. You can then obtain a more accurate number of rows with this real query.

Before you add an index, you can use Method 2 to estimate the number of rows to read on the new index, and thereby determine whether the corresponding execution plan is appropriate.

SELECT COUNT(*) FROM T1 WHERE C2 = 'A'
+----------+
| COUNT(*) |
+----------+
|       99 |
+----------+

Using indexes to optimize query performance often involves creating a composite index, allowing as many predicates as possible to be utilized as query range filters. Before creating a composite index, you can first construct a query to assess the potential number of rows to read on this composite index.

For example, in the following scenario, STATUS can only be either 'SUCC' or 'FAIL'. The table may have a large amount of data that meets the STATUS = 'SUCC' condition. Therefore, the number of rows to read for this query will be very large regardless of whether the primary table or the index table is scanned.

CREATE TABLE T1 (NAME VARCHAR(10), STATUS VARCHAR(10));
CREATE INDEX IDX_STATUS(STATUS);
SELECT * FROM T1 WHERE NAME = 'OceanBase' AND STATUS = 'SUCC';

In one word, there is no good execution plan for this table.

Solution

You may consider creating the IDX_NAME_STATUS(NAME, STATUS) index, which uses more predicates to narrow down the query range. If the selectivity of the equality condition on the NAME column is sufficient, the optimizer chooses the IDX_NAME_STATUS(NAME, STATUS) index. However, there may be extreme cases where a lot of data still remains after all the filtering is done. Then this query is expected to entail a large number of disk reads. In this case, you need to check with the business side to see whether the query logic and usage are reasonable.

You may now understand that a key point for index optimization is to find a set of highly selective predicates in the query, create a purposeful index, and use these predicates to generate a sound query range. The key question is which predicates can be used to determine the query range. Generally, these predicates can be divided into the following categories:

Equality predicates, for example, C1 = 1.
Range predicates, for example, C1 > 1, C1 >= 1, C1 < 1, C1 <= 1, and C1 between 1 and 3.
LIKE predicates, for example, C1 like 'abc%'. These predicates do not support wildcards at the beginning of a string.
IN predicates, for example, C1 IN (1,2,3).

Considerations

The preceding predicates are used to filter the values of a column. Note that function calculation on a column is not supported. For example, the predicates C1 +1 IN (1, 2, 3) and cast(c1_varchar as signed) IN (1, 2, 3) cannot be used to determine the query range. Note that for the C1 in (1, 2, 3) predicate, if the C1 column is not of the integer type, the database implicitly converts C1 to the integer type and then uses the predicate. In this scenario, the calculation of the CAST() function is actually involved, which means that the predicate cannot be used to determine the query range. Here is a typical example. In the following query, the execution plan selects full table scan rather than using the IDX_C1 index. This is because the original predicate C1 = 1 is actually parsed into table filter: cast(t1.C1, DECIMAL(-1, -1)) = cast(1, DECIMAL(1, 0)).

CREATE TABLE T1 (C1 VARCHAR(10), C2 INT);
INSERT INTO T1 VALUES (1, 1), (2, 2);
CREATE INDEX IDX_C1 ON T1 (C1);
EXPLAIN SELECT * FROM T1 WHERE C1 = 1;

| ===================================
|ID|OPERATOR  |NAME|EST. ROWS|COST|
-----------------------------------
|0 |TABLE SCAN|t1  |1        |2   |
===================================

Outputs & filters:
-------------------------------------
  0 - output([t1.C1], [t1.C2]), filter([cast(t1.C1, DECIMAL(-1, -1)) = cast(1, DECIMAL(1, 0))]), rowset=256,
      access([t1.C1], [t1.C2]), partitions(p0)