Partition pruning helps avoid accessing irrelevant partitions, significantly improving SQL execution efficiency. This topic describes the principles and applications of partition pruning.
When you access a partitioned table, you often need to access only some of its partitions. The process of avoiding irrelevant partitions through the optimizer is called partition pruning. As an important optimization technique for partitioned tables, partition pruning greatly enhances SQL execution efficiency. You can use partition pruning to add conditions that locate specific partitions, thereby avoiding irrelevant data and optimizing query performance.
Partition pruning is a complex process. The optimizer extracts relevant partition information based on the partition details of the user table and the conditions specified in the SQL statement. Due to the complexity of the conditions in the SQL statement, the entire extraction logic becomes more complex. This process is handled by the Query Range sub-module in OceanBase Database.
When you access a partitioned table, all data where col1 is 1 is located in partition p1. Therefore, you only need to access this partition and not partitions 0, 2, 3, or 4. Here is an example:
obclient> CREATE TABLE tbl1(col1 INT,col2 INT) PARTITION BY HASH(col1) PARTITIONS 5;
obclient> SELECT * FROM tbl1 WHERE col1 = 1;
You can use the EXPLAIN statement to view the execution plan and see the result of partition pruning:
obclient> EXPLAIN SELECT * FROM tbl1 WHERE col1 = 1 \G
*************************** 1. row ***************************
Query Plan: ===================================
|ID|OPERATOR |NAME|EST. ROWS|COST|
-----------------------------------
|0 |TABLE SCAN|TBL1|990 |383 |
===================================
Outputs & filters:
-------------------------------------
0 - output([TBL1.COL1], [TBL1.COL2]), filter(nil),
access([TBL1.COL1], [TBL1.COL2]), partitions(p1)
1 row in set
Principles of partition pruning
HASH or LIST partitioning
Partition pruning involves using the conditions in the WHERE clause to calculate the values of the partitioning columns and then determining which partitions need to be accessed based on the results. If the partitioning condition is an expression that appears as a whole in an equality condition, partition pruning can also be performed.
For example, if the partitioning condition is the expression c1 + c2, which appears as a whole in an equality condition, partition pruning can be performed. Here is an example:
obclient> CREATE TABLE t1(c1 INT,c2 INT) PARTITION BY HASH(c1 + c2) PARTITIONS 5;
obclient> EXPLAIN SELECT * FROM t1 WHERE c1 + c2 = 1 \G
*************************** 1. row ***************************
Query Plan: ===================================
|ID|OPERATOR |NAME|EST. ROWS|COST|
-----------------------------------
|0 |TABLE SCAN|t1 |5 |1303|
===================================
Outputs & filters:
-------------------------------------
0 - output([t1.c1], [t1.c2]), filter([t1.c1 + t1.c2 = 1]),
access([t1.c1], [t1.c2]), partitions(p1)
RANGE partitioning
You can determine the partitions to access by finding the intersection between the range of the partitioning key in the WHERE clause and the partition range defined in the table.
For RANGE partitions:
If the partitioning condition is a column, partition pruning is supported whether the query condition is an equality condition or an inequality condition.
If the partitioning condition is an expression and the query condition is an equality condition, partition pruning is supported.
For example, if the partitioning condition is the expression c1 and the query condition is an inequality condition c1 < 150 and c1 > 100, partition pruning can be performed. Here is an example:
obclient> CREATE TABLE t1(c1 INT,c2 INT) PARTITION BY RANGE(c1)
(PARTITION p0 VALUES LESS THAN(100),
PARTITION p1 VALUES LESS THAN(200)
);
Query OK, 0 rows affected
obclient> EXPLAIN SELECT * FROM t1 WHERE c1 < 150 and c1 > 110\G
*************************** 1. row ***************************
Query Plan: ===================================
|ID|OPERATOR |NAME|EST. ROWS|COST|
-----------------------------------
|0 |TABLE SCAN|t1 |1 |46 |
===================================
Outputs & filters:
-------------------------------------
0 - output([t1.c1], [t1.c2]), filter([t1.c1 < 150], [t1.c1 > 110]),
access([t1.c1], [t1.c2]), partitions(p1)
1 row in set
Additionally, if the partitioning condition's expression uses time-related functions such as YEAR(), TO_DAYS(), or TO_SECONDS(), partition pruning is supported based on the range specified in the query condition.
Here is an example:
obclient> CREATE TABLE tbl_r (log_id BIGINT NOT NULL,log_value VARCHAR(50),log_date datetime NOT NULL)
PARTITION BY RANGE(to_days(log_date))
(PARTITION M202001 VALUES LESS THAN(to_days('2020/02/01'))
, PARTITION M202002 VALUES LESS THAN(to_days('2020/03/01'))
, PARTITION M202003 VALUES LESS THAN(to_days('2020/04/01'))
, PARTITION M202004 VALUES LESS THAN(to_days('2020/05/01'))
, PARTITION M202005 VALUES LESS THAN(to_days('2020/06/01'))
, PARTITION M202006 VALUES LESS THAN(to_days('2020/07/01'))
, PARTITION M202007 VALUES LESS THAN(to_days('2020/08/01'))
, PARTITION M202008 VALUES LESS THAN(to_days('2020/09/01'))
, PARTITION M202009 VALUES LESS THAN(to_days('2020/10/01'))
, PARTITION M202010 VALUES LESS THAN(to_days('2020/11/01'))
, PARTITION M202011 VALUES LESS THAN(to_days('2020/12/01'))
, PARTITION M202012 VALUES LESS THAN(to_days('2021/01/01'))
);
Query OK, 0 rows affected
obclient> EXPLAIN SELECT * FROM tbl_r WHERE log_date > '2020/07/15' and log_date <'2020/10/07'\G
*************************** 1. row ***************************
Query Plan: ====================================================
|ID|OPERATOR |NAME |EST. ROWS|COST|
----------------------------------------------------
|0 |PX COORDINATOR | |1 |183 |
|1 | EXCHANGE OUT DISTR |:EX10000|1 |183 |
|2 | PX PARTITION ITERATOR| |1 |183 |
|3 | TABLE SCAN |tbl_r |1 |183 |
====================================================
Outputs & filters:
-------------------------------------
0 - output([INTERNAL_FUNCTION(tbl_r.log_id, tbl_r.log_value, tbl_r.log_date)]), filter(nil)
1 - output([INTERNAL_FUNCTION(tbl_r.log_id, tbl_r.log_value, tbl_r.log_date)]), filter(nil), dop=1
2 - output([tbl_r.log_date], [tbl_r.log_id], [tbl_r.log_value]), filter(nil)
3 - output([tbl_r.log_date], [tbl_r.log_id], [tbl_r.log_value]), filter([tbl_r.log_date > ?], [tbl_r.log_date < ?]),
access([tbl_r.log_date], [tbl_r.log_id], [tbl_r.log_value]), partitions(p[6-9])
1 row in set
Principles of subpartition pruning
For subpartitions, first determine the partitions to access based on the primary partitioning key, and then determine the subpartitions to access based on the subpartitioning key. Finally, determine all the physical subpartitions to access by taking the product of the two.
In the following example, the primary partition pruning result is p0, and the subpartition pruning result is sp0, so the physical partition to access is p0sp0.
Notice
In this example, the physical partition identifier to access is p0sp0, which is not the name of a subpartition. For template-based subpartitioned tables, the naming convention for subpartitions is ($part_name)s($subpart_name), such as p0ssp0.
obclient> CREATE TABLE tbl2_rr(col1 INT,col2 INT)
PARTITION BY RANGE(col1)
SUBPARTITION BY RANGE(col2)
SUBPARTITION TEMPLATE
(SUBPARTITION sp0 VALUES LESS THAN(1000),
SUBPARTITION sp1 VALUES LESS THAN(2000)
)
(PARTITION p0 VALUES LESS THAN(100),
PARTITION p1 VALUES LESS THAN(200)
) ;
Query OK, 0 rows affected
obclient> EXPLAIN SELECT * FROM tbl2_rr
WHERE (col1 = 1 or col1 = 2) and (col2 > 101 and col2 < 150) \G
*************************** 1. row ***************************
Query Plan: ======================================
|ID|OPERATOR |NAME |EST. ROWS|COST|
--------------------------------------
|0 |TABLE SCAN|TBL2_RR|99 |53 |
======================================
Outputs & filters:
-------------------------------------
0 - output([TBL2_RR.COL1], [TBL2_RR.COL2]), filter(nil),
access([TBL2_RR.COL1], [TBL2_RR.COL2]), partitions(p0sp0)
1 row in set
In some scenarios, partition pruning may result in a certain level of amplification, but the optimizer ensures that the result is a superset of the data that needs to be accessed, without any data loss.