Join order|V4.3.0| docs|Distributed Database

Join order

Last Updated：2024-04-19 08:42:50 Updated

What is join order

For a multi-table join, since the database can join only one pair of data sources at a time, the optimizer must specify the join order of data source pairs. A query may have multiple join orders. For example, the join query select * from t1, t2, t3 where t1.c1 = t2.c1 and t2.c1 = t3.c1; has multiple alternative join orders. Here are two of the join orders:

Plan 1:
==============================================
|ID|OPERATOR    |NAME|EST. ROWS  |COST       |
----------------------------------------------
|0 |HASH JOIN   |    |96059601000|22771533844|
|1 | HASH JOIN  |    |98010000   |23334667   |
|2 |  TABLE SCAN|T1  |100000     |42605      |
|3 |  TABLE SCAN|T2  |100000     |42605      |
|4 | TABLE SCAN |T3  |100000     |42605      |
==============================================

Plan 2:
==============================================
|ID|OPERATOR    |NAME|EST. ROWS  |COST       |
----------------------------------------------
|0 |HASH JOIN   |    |96059601000|22771533844|
|1 | HASH JOIN  |    |98010000   |23334667   |
|2 |  TABLE SCAN|T3  |100000     |42605      |
|3 |  TABLE SCAN|T2  |100000     |42605      |
|4 | TABLE SCAN |T1  |100000     |42605      |
==============================================

For a more complex query, such as select 1 from t1, t2, t3, t4 where t1.c1 = t2.c1 and t1.c1 = t3.c1 and t1.c1 = t4.c1;, more join orders are available, and their costs may differ greatly. The optimizer needs to choose the optimal join order based on certain strategies.

How the optimizer decides the join order

The optimizer selects two data sources for join at a time until all data sources are covered. The optimizer then obtains all possible join orders, calculates the execution cost of each join order, and selects the optimal join order based on the calculated cost.

Various factors enter into the optimizer decision on the join order, for example, the amount of data to join, the number of columns to access in tables, data distribution, the join algorithm, and indexes on tables. Among them, the factor that affects the optimizer most is the amount of data to join, which depends on the accuracy of statistics and the row estimates of the optimizer.

Control the join order in the optimizer

You can use two hints to control the join order in the optimizer.

The first hint is ORDERED, which instructs the optimizer to choose the join order defined by the query without considering other join orders. For example, for select /*+ORDERED*/ 1 from t1, t2, t3 where t1.c1 = t2.c1 and t1.c1 = t3.c1;, the optimizer will only choose the join order t1 join t2 join t3 when generating an execution plan.

==============================================
|ID|OPERATOR    |NAME|EST. ROWS  |COST       |
----------------------------------------------
|0 |HASH JOIN   |    |96059601000|65343614674|
|1 | HASH JOIN  |    |98010000   |66674090   |
|2 |  TABLE SCAN|T1  |100000     |61860      |
|3 |  TABLE SCAN|T2  |100000     |61860      |
|4 | TABLE SCAN |T3  |100000     |61860      |
==============================================

The second hint is LEADING, which specifies a complete or partial order for the optimizer. For example, in select /*+leading(t1 t3 t4 t2)*/ 1 from t1, t2, t3, t4 where t1.c1 = t2.c1 and t1.c1 = t3.c1 and t1.c1 = t4.c1;, LEADING specifies a complete join order.

=====================================================
|ID|OPERATOR     |NAME|EST. ROWS     |COST          |
-----------------------------------------------------
|0 |HASH JOIN    |    |9.414801e+13  |6.412702e+13  |
|1 | HASH JOIN   |    |96059601000   |65343614674   |
|2 |  HASH JOIN  |    |98010000      |66674090      |
|3 |   TABLE SCAN|T1  |100000        |61860         |
|4 |   TABLE SCAN|T3  |100000        |61860         |
|5 |  TABLE SCAN |T4  |100000        |61860         |
|6 | TABLE SCAN  |T2  |100000        |61860         |
=====================================================

You can also specify a partial join order, and the optimizer determines the join order for the remaining tables. For example, for select /*+leading(t3 t1)*/ 1 from t1, t2, t3, t4 where t1.c1 = t2.c1 and t1.c1 = t3.c1 and t1.c1 = t4.c1;, the optimizer may consider the following join orders:

Query Plan1:
=====================================================
|ID|OPERATOR     |NAME|EST. ROWS     |COST          |
-----------------------------------------------------
|0 |HASH JOIN    |    |9.414801e+13  |6.412702e+13  |
|1 | HASH JOIN   |    |96059601000   |65343614674   |
|2 |  HASH JOIN  |    |98010000      |66674090      |
|3 |   TABLE SCAN|T3  |100000        |61860         |
|4 |   TABLE SCAN|T1  |100000        |61860         |
|5 |  TABLE SCAN |T4  |100000        |61860         |
|6 | TABLE SCAN  |T2  |100000        |61860         |
=====================================================

Query Plan2:
=====================================================
|ID|OPERATOR     |NAME|EST. ROWS     |COST          |
-----------------------------------------------------
|0 |HASH JOIN    |    |9.414801e+13  |6.412702e+13  |
|1 | HASH JOIN   |    |96059601000   |65343614674   |
|2 |  HASH JOIN  |    |98010000      |66674090      |
|3 |   TABLE SCAN|T3  |100000        |61860         |
|4 |   TABLE SCAN|T1  |100000        |61860         |
|5 |  TABLE SCAN |T2  |100000        |61860         |
|6 | TABLE SCAN  |T4  |100000        |61860         |
=====================================================

The optimizer then decides the optimal join order based on the cost.

How a poor join order chosen by the optimizer is identified

In business, a poor join order is basically caused by inaccurate statistics or row estimates. For a slow SQL query, you can first check whether the row estimates of the base table are accurate. If the estimated number of rows deviates greatly from the actual number, the join order chosen by the optimizer is usually not the one with the lowest cost. For example, the plan for select 1 from t1, t2, t3, t4 where t1.c1 = t2.c1 and t1.c1 = t3.c1 and t1.c1 = t4.c1 and t1.c2=2 and t4.c2=1 is executed slowly.

=========================================
|ID|OPERATOR     |NAME|EST. ROWS|COST   |
-----------------------------------------
|0 |HASH JOIN    |    |9686572  |2563762|
|1 | HASH JOIN   |    |9884     |184590 |
|2 |  HASH JOIN  |    |11       |97710  |
|3 |   TABLE SCAN|T1  |10       |48849  |
|4 |   TABLE SCAN|T4  |10       |48849  |
|5 |  TABLE SCAN |T2  |100000   |42605  |
|6 | TABLE SCAN  |T3  |100000   |42605  |
=========================================

You can construct an auxiliary SQL query to verify the number of rows in the base table. For example, you can construct select count(*) from t1 where t1.c2=2; to verify the number of rows in t1, which is actually 1 million. That is, the join order chosen by the optimizer is not the optimal one when the row estimates are incorrect. In this case, you need to add the LEADING hint to specify a join order, for example, leading(t3 (t2 (t4 t1))).