Merge joins|V4.3.1| docs|Distributed Database

What is merge join

A merge join is a variation on a nested loop join. If the two data sets to join are not sorted, then the database sorts them. For each row in the first data set, the database probes the second data set for matching rows and performs the join, basing its start position on the match made in the previous iteration. This is different from a nested loop join. A nested loop join traverses the second data set each time from the first row. Here is the pseudocode of merge join execution:

for row_1 IN (select * from T1 where xxx)
loop
	for row_2 IN (select * from T2 where row_num > last_row_num and xxx)
	loop
  	if match join condition(row_1, row_2)
  	then
    	output (row_1, row_2)
  	else
    	last_row_num = row_num
    	break
  	end if
	end loop
end loop

When the optimizer chooses merge joins

The merge join algorithm is suitable for joining large amounts of data. The optimizer may choose another join over a merge join when both data sources to join are unsorted or the join condition is not an equijoin. Otherwise, the optimizer tries to generate a plan that uses the merge join algorithm and decides whether to choose this plan based on the plan cost.

Control the use of merge joins in the optimizer

The most direct control method is to specify the join algorithm by using a hint. You can use the USE_MERGE hint to instruct the optimizer to use the merge join algorithm. The LEADING hint is often used together with the USE_MERGE hint to specify the join order. This is because the plan that uses the merge join algorithm may not have the lowest cost and therefore may be overridden by a plan (that adopts a different join order) with lower cost. The USE_MERGE hint specifies the right table of the join.

In the following example, the optimizer autonomously chooses the hash join algorithm by default.

explain select 1 from t1, t2 where t1.c1 = t2.c1;
Query Plan:
======================================
|ID|OPERATOR   |NAME|EST. ROWS|COST  |
--------------------------------------
|0 |HASH JOIN  |    |99000    |194200|
|1 | TABLE SCAN|t1  |100000   |41911 |
|2 | TABLE SCAN|t2  |100000   |41911 |
======================================

Outputs & filters:
-------------------------------------
  0 - output([1]), filter(nil),
      equal_conds([t1.c1 = t2.c1]), other_conds(nil)
  1 - output([t1.c1]), filter(nil),
      access([t1.c1]), partitions(p0)
  2 - output([t2.c1]), filter(nil),
      access([t2.c1]), partitions(p0)

To instruct the optimizer to use the merge join algorithm, we can use a hint.

explain select /*+leading(t1 t2) use_merge(t2)*/ 1 from t1, t2 where t1.c1 = t2.c1;
Query Plan:
======================================
|ID|OPERATOR   |NAME|EST. ROWS|COST  |
--------------------------------------
|0 |MERGE JOIN |    |99000    |13200 |
|1 | TABLE SCAN|t1  |100000   |41911 |
|2 | TABLE SCAN|t2  |100000   |41911 |
======================================

Outputs & filters:
-------------------------------------
  0 - output([1]), filter(nil),
      equal_conds([t1.c1 = t2.c1]), other_conds(nil)
  1 - output([t1.c1]), filter(nil),
      access([t1.c1]), partitions(p0)
  2 - output([t2.c1]), filter(nil),
      access([t2.c1]), partitions(p0)

To better use the merge join algorithm, take note of the following considerations:

Use a merge join when the data sources are sorted, if possible.
Do not use the merge join algorithm for non-equijoins.