What is hash join
A hash join uses the data of the left table to build a hash table, and then finds the matching rows to join with the right table by probing the hash table. Here is the pseudocode of hash join execution:
step 1 build hash table:
for row_1 in (select * from left_table)
loop
hash_value = HASH(row_1)
insert_hash_table(hash_value, row_1)
end loop
step 2 probe hash table:
for row_2 in (select * from right_table)
loop
hash_value = HASH(row_2)
row = lookup_hash_table(hash_value, row_2)
if match join condition(row, row_2)
then
output (row, row_2)
end if
end loop
Here is an execution plan that uses a hash join:
Query Plan:
======================================
|ID|OPERATOR |NAME|EST. ROWS|COST |
--------------------------------------
|0 |HASH JOIN | |99000 |194200|
|1 | TABLE SCAN|t1 |100000 |41911 |
|2 | TABLE SCAN|t2 |100000 |41911 |
======================================
Outputs & filters:
-------------------------------------
0 - output([1]), filter(nil),
equal_conds([t1.c1 = t2.c1]), other_conds(nil)
1 - output([t1.c1]), filter(nil),
access([t1.c1]), partitions(p0)
2 - output([t2.c1]), filter(nil),
access([t2.c1]), partitions(p0)
When the optimizer chooses hash joins
In general, the optimizer considers a hash join when a relatively large amount of data must be joined (or a large percentage of a small table must be joined), and the join is an equijoin.
A hash join is most cost effective when the smaller data set fits in memory. In this case, the join cost is limited to a single read pass over the two data sets. Since the hash table is in memory, the database can access rows without latching them. This technique reduces logical I/O by avoiding the necessity of repeatedly latching and reading blocks in the database buffer cache.
If the data sets are too large for memory, the database partitions the data sources and performs the join partition by partition. This can use a lot of sort area memory and I/O to the temporary tablespace. This method can still be the most cost effective, especially when the database uses parallel queries.
Control the use of hash joins in the optimizer
The most direct control method is to specify the join algorithm by using a hint. You can use the USE_HASH hint to instruct the optimizer to use the hash join algorithm. The LEADING hint is often used together with the USE_HASH hint to specify the join order. This is because the plan that uses the hash join algorithm may not have the lowest cost and therefore may be overridden by a plan (that adopts a different join order) with lower cost. The USE_HASH hint specifies the right table of the join.
In the following example, the optimizer autonomously chooses the nested loop join algorithm by default.
explain select 1 from t1, t2 where t1.c1 = t2.c1;
Query Plan:
============================================
|ID|OPERATOR |NAME|EST. ROWS|COST |
--------------------------------------------
|0 |NESTED-LOOP JOIN| |99000 |4120845|
|1 | TABLE SCAN |t1 |100000 |41911 |
|2 | TABLE GET |t2 |1 |40 |
============================================
Outputs & filters:
-------------------------------------
0 - output([1]), filter(nil),
conds(nil), nl_params_([t1.c1])
1 - output([t1.c1]), filter(nil),
access([t1.c1]), partitions(p0)
2 - output([1]), filter(nil),
access([t2.c1]), partitions(p0)
To instruct the optimizer to use the hash join algorithm, we can use a hint.
explain select /*+leading(t1 t2) use_hash(t2)*/ 1 from t1, t2 where t1.c1 = t2.c1;
Query Plan:
======================================
|ID|OPERATOR |NAME|EST. ROWS|COST |
--------------------------------------
|0 |HASH JOIN | |99000 |194200|
|1 | TABLE SCAN|t1 |100000 |41911 |
|2 | TABLE SCAN|t2 |100000 |41911 |
======================================
Outputs & filters:
-------------------------------------
0 - output([1]), filter(nil),
equal_conds([t1.c1 = t2.c1]), other_conds(nil)
1 - output([t1.c1]), filter(nil),
access([t1.c1]), partitions(p0)
2 - output([t2.c1]), filter(nil),
access([t2.c1]), partitions(p0)
To better use the hash join algorithm, take note of the following considerations:
- Use a data source with a small amount of data to build a hash table, that is, as the driving table, if possible.
- Do not use the hash join algorithm for non-equijoins.