SORT assignment and optimization|V4.3.5| docs|Distributed Database

SORT assignment and optimization

Last Updated：2025-06-30 12:43:00 Updated

When a query contains a LIMIT clause, whether the generated execution plan contains a blocking operator can significantly impact performance. SORT, as a common blocking operator, is included in the query when an ORDER BY clause is used or when the generated plan requires the assignment of operators based on sorting algorithms (such as MERGE DISTINCT, MERGE GROUP BY, and MERGE JOIN). In these cases, the system attempts to assign SORT operators to obtain an intermediate result set sorted according to the specified sort key. To avoid unnecessary SORT operators and fully utilize already sorted intermediate result sets, the optimizer maintains certain sequence-related information and performs optimizations when attempting to assign SORT operators.

Order in a plan tree

In an execution plan, the input order of operators may be sorted. For base table scans, this order is determined by the sequence produced by the table’s index structure. For other operators, the input order may come from the output order of their child operators. This is different from the explicit output ordering required by an ORDER BY clause. However, most operators that utilize sorting algorithms allow flexibility in their input order, and this order can often be preserved and passed on as the operator’s output order. By optimizing how sorting operators use input order, unnecessary multiple sorts can be avoided.

When the optimizer decides which order a sorting operator should use, it considers both the existing input order and any order that may be required later. For example, in the query below, when the GROUP BY clause uses a merge group by operator, the order and sorting direction of c1 and c2 in the operator’s input can be adjusted as needed.

In Q2_1_1, the index idx is used to generate the order of c1 and c2, avoiding the need for a sorting operator and the sorting operator required by order by c1, c2.
In Q2_1_2, the order by clause is adjusted relative to Q2_1_1, and the primary key is specified. The merge group by operator directly uses the order required by order by c2, c1 as the input order for aggregation and assigns a sorting operator under group by.
Compared with Q2_1_2, Q2_1_3 does not specify the primary key. It generates the merge group by operator that can directly use the order of c1 and c2 generated by idx, and assigns a sorting operator for order by c2, c1 after aggregation.

create table t1(c1 int, c2 int, c3 int, c4 int, pk int primary key);
create index idx on t1(c1, c2, c3);

Q2_1_1:
select /*+no_use_hash_aggregation*/ c1, c2, sum(c3)
from t1 t
group by c2, c1
order by c1, c2;
==========================================
|ID|OPERATOR      |NAME  |EST. ROWS|COST |
------------------------------------------
|0 |MERGE GROUP BY|      |7072     |10965|
|1 | TABLE SCAN   |t(idx)|78858    |8124 |
==========================================

Outputs & filters:
-------------------------------------
  0 - output([t.c1], [t.c2], [T_FUN_SUM(t.c3)]), filter(nil), rowset=256,
      group([t.c1], [t.c2]), agg_func([T_FUN_SUM(t.c3)])
  1 - output([t.c1], [t.c2], [t.c3]), filter(nil), rowset=256,
      access([t.c1], [t.c2], [t.c3]), partitions(p0)

Q2_1_2:
select /*+no_use_hash_aggregation full(t)*/ c1, c2, sum(c3)
from t1 t
group by c1, c2
order by c2, c1;
========================================
|ID|OPERATOR      |NAME|EST. ROWS|COST |
----------------------------------------
|0 |MERGE GROUP BY|    |7072     |41410|
|1 | SORT         |    |100000   |37988|
|2 |  TABLE SCAN  |t   |100000   |8026 |
========================================

Outputs & filters:
-------------------------------------
  0 - output([t.c1], [t.c2], [T_FUN_SUM(t.c3)]), filter(nil), rowset=256,
      group([t.c2], [t.c1]), agg_func([T_FUN_SUM(t.c3)])
  1 - output([t.c2], [t.c1], [t.c3]), filter(nil), rowset=256,
			sort_keys([t.c2, ASC], [t.c1, ASC])
  2 - output([t.c1], [t.c2], [t.c3]), filter(nil), rowset=256,
      access([t.c1], [t.c2], [t.c3]), partitions(p0)

Q2_1_3:
select /*+no_use_hash_aggregation index(t idx)*/ c1, c2, sum(c3)
from t1 t
group by c1, c2
order by c2, c1;
===========================================
|ID|OPERATOR       |NAME  |EST. ROWS|COST |
-------------------------------------------
|0 |SORT           |      |7072     |12771|
|1 | MERGE GROUP BY|      |7072     |10965|
|2 |  TABLE SCAN   |t(idx)|78858    |8124 |
===========================================

Outputs & filters:
-------------------------------------
  0 - output([t.c1], [t.c2], [T_FUN_SUM(t.c3)]), filter(nil), rowset=256,
			sort_keys([t.c2, ASC], [t.c1, ASC])
  1 - output([t.c2], [t.c1], [T_FUN_SUM(t.c3)]), filter(nil), rowset=256,
      group([t.c1], [t.c2]), agg_func([T_FUN_SUM(t.c3)])
  2 - output([t.c1], [t.c2], [t.c3]), filter(nil), rowset=256,
      access([t.c1], [t.c2], [t.c3]), partitions(p0)

Allocation and optimization of the sort operator

When attempting to assign a sorting operator to achieve the desired order, the output of a child operator may already be sorted. The system first checks the input order and other relevant information to determine whether the input order can satisfy the expected output order. In the three queries below, the presence of an ORDER BY clause requires the final result to be ordered. However, because the order is already provided by a matching index or primary key, there is no need to assign an additional sorting operator in the end.

Q2_2_1: The idx index can provide the sequence (c1, c2, c3), and the output sequence matches order by c1, c2.
Q2_2_2: The idx index can provide the sequence (c1, c2, c3). Due to the c1 = 4 predicate, the output sequence matches order by c2.
Q2_2_3: The primary key provides the sequence (pk). Due to the uniqueness of the primary key, the output sequence matches order by pk, c3, c2, c1.

create table t1(c1 int, c2 int, c3 int, c4 int, pk int primary key);
create index idx on t1(c1, c2, c3);

Q2_2_1: select /*+index(t idx)*/ * from t1 t order by c1, c2;
Q2_2_2: select /*+index(t idx)*/ * from t1 t where c1 = 4 order by c2;
Q2_2_3: select /*+index(t primary)*/ * from t1 t order by pk, c3, c2, c1;

Q2_2_1 execution plan:
=======================================
|ID|OPERATOR  |NAME  |EST. ROWS|COST  |
---------------------------------------
|0 |TABLE SCAN|t(idx)|76557    |216994|
=======================================

Outputs & filters:
-------------------------------------
  0 - output([t.c1], [t.c2], [t.c3], [t.c4], [t.pk]), filter(nil), rowset=256,
      access([t.pk], [t.c1], [t.c2], [t.c3], [t.c4]), partitions(p0)

For the remaining sorting operations that cannot be eliminated, the optimizer optimizes the columns to be sorted before allocating the sort operator.

Prefix sort

When a sort operator needs to be allocated, if the input sequence of the operator can match part of the expected sequence, the prefix sort can be used to optimize the result by leveraging the input sequence.

After using the idx index, the table scan operator outputs the sequence (c1, c2, c3) for the following query. The prefix sort is applied, and for each group of c1, only the pk values are sorted.

Q2_2_4:
select /*+index(t idx)*/ c1 from t1 t order by c1, pk;
=======================================
|ID|OPERATOR   |NAME  |EST. ROWS|COST |
---------------------------------------
|0 |SORT       |      |76557    |18874|
|1 | TABLE SCAN|t(idx)|76557    |5319 |
=======================================

Outputs & filters:
-------------------------------------
  0 - output([t.c1]), filter(nil), rowset=256,
			sort_keys([t.c1, ASC], [t.pk, ASC]), prefix_pos(1)
  1 - output([t.pk], [t.c1]), filter(nil), rowset=256,
      access([t.pk], [t.c1]), partitions(p0)

Simplified sort columns

A certain amount of simplification is performed on the columns to be sorted to reduce the number of columns that need to be sorted. The main methods used for simplifying the sort columns are as follows:

Simplify the sort columns using equality conditions
```
select c1 from t1 t where c3 = 1 and c2 = c1 order by c3, c2, c1;
```
For the above query, the columns are simplified using the conditions c3 = 1 and c2 = c1, and the data is sorted based on the c2 column.
Simplify the sort columns using the primary key or unique index
```
select c1 from t1 t order by c1, pk, c3, c2;
```
For the above query, since the primary key is used, the data is sorted based on the c1 and pk columns.