What is a merge join?
A merge join is a variation of a nested loops join. If the two datasets to be joined are not sorted, the database sorts them. For each row in the first dataset, the database searches for matching rows in the second dataset and joins them. The starting position for each search is based on the matches found in the previous iteration. This is different from a nested loops join, where the search always starts from the first row in the second dataset. The pseudocode for a merge join is as follows:
for row_1 IN (select * from T1 where xxx)
loop
for row_2 IN (select * from T2 where row_num > last_row_num and xxx)
loop
if match join condition(row_1, row_2)
then
output (row_1, row_2)
else
last_row_num = row_num
break
end if
end loop
end loop
When does the optimizer choose a merge join?
A merge join is also suitable for joining large amounts of data. If both data sources are unordered or the join condition is a non-equality join, the optimizer does not choose the merge join algorithm but selects other join algorithms instead. Otherwise, the optimizer attempts to generate a merge join plan and selects the appropriate join algorithm based on the plan cost.
How to control the optimizer to use a merge join?
The most direct way is to specify the join algorithm using hints. You can use USE_MERGE to indicate that the optimizer should use the merge join algorithm. It is usually also necessary to use LEADING hint, because after specifying the merge join algorithm, the plan cost may not be the lowest and could be overridden by a plan with a lower cost that has a different join order. The USE_MERGE hint specifies the right table for the join.
Here is an example. By default, the optimizer autonomously selects the hash join algorithm.
explain select 1 from t1, t2 where t1.c1 = t2.c1;
Query Plan:
======================================
|ID|OPERATOR |NAME|EST. ROWS|COST |
--------------------------------------
|0 |HASH JOIN | |99000 |194200|
|1 | TABLE SCAN|t1 |100000 |41911 |
|2 | TABLE SCAN|t2 |100000 |41911 |
======================================
Outputs & filters:
-------------------------------------
0 - output([1]), filter(nil),
equal_conds([t1.c1 = t2.c1]), other_conds(nil)
1 - output([t1.c1]), filter(nil),
access([t1.c1]), partitions(p0)
2 - output([t2.c1]), filter(nil),
access([t2.c1]), partitions(p0)
To control the optimizer to use the merge join algorithm, you can use hints.
explain select /*+leading(t1 t2) use_merge(t2)*/ 1 from t1, t2 where t1.c1 = t2.c1;
Query Plan:
======================================
|ID|OPERATOR |NAME|EST. ROWS|COST |
--------------------------------------
|0 |MERGE JOIN | |99000 |13200 |
|1 | TABLE SCAN|t1 |100000 |41911 |
|2 | TABLE SCAN|t2 |100000 |41911 |
======================================
Outputs & filters:
-------------------------------------
0 - output([1]), filter(nil),
equal_conds([t1.c1 = t2.c1]), other_conds(nil)
1 - output([t1.c1]), filter(nil),
access([t1.c1]), partitions(p0)
2 - output([t2.c1]), filter(nil),
access([t2.c1]), partitions(p0)
To better use the merge join algorithm, take note of the following considerations:
- Use a merge join when the data sources are sorted, if possible.
- Do not use the merge join algorithm for non-equijoins.