What is a hash join?
A hash join works by building a hash table using the data from the left table and then performing the join operation by probing the hash table with the data from the right table. The pseudocode for the execution process is as follows:
step 1 build hash table:
for row_1 in (select * from left_table)
loop
hash_value = HASH(row_1)
insert_hash_table(hash_value, row_1)
end loop
step 2 probe hash table:
for row_2 in (select * from right_table)
loop
hash_value = HASH(row_2)
row = lookup_hash_table(hash_value, row_2)
if match join condition(row, row_2)
then
output (row, row_2)
end if
end loop
The hash join plan is as follows:
Query Plan:
======================================
|ID|OPERATOR |NAME|EST. ROWS|COST |
--------------------------------------
|0 |HASH JOIN | |99000 |194200|
|1 | TABLE SCAN|t1 |100000 |41911 |
|2 | TABLE SCAN|t2 |100000 |41911 |
======================================
Outputs & filters:
-------------------------------------
0 - output([1]), filter(nil),
equal_conds([t1.c1 = t2.c1]), other_conds(nil)
1 - output([t1.c1]), filter(nil),
access([t1.c1]), partitions(p0)
2 - output([t2.c1]), filter(nil),
access([t2.c1]), partitions(p0)
When does the optimizer choose a hash join?
Typically, when a large amount of data needs to be joined (or the proportion of small tables to be joined is high), the optimizer considers using a hash join, and it must be an equi-join.
Hash joins perform best when a small dataset is used, as they require less memory to materialize the data. In this case, the join cost is simply a single read of the two datasets. Since the hash table resides in memory, the database does not need to access data rows through locks. This technique reduces logical I/O by eliminating the need to repeatedly lock and read blocks from the database buffer cache.
If the dataset is large and the memory is insufficient, the database partitions the data source and performs the join operation on each partition. This approach uses a significant amount of sort area memory and temporary tablespace I/O. However, it is still efficient, especially when the database uses parallel queries.
How to control the optimizer's use of hash joins
The most direct way is to specify the join algorithm using a hint. You can use USE_HASH to indicate that the optimizer should use the hash join algorithm. It is usually also necessary to use the LEADING hint, as the plan cost after specifying the hash join algorithm may not be the lowest and could be overridden by a plan with lower costs (with a different join order). The parameter for USE_HASH is the right table involved in the join.
Example: By default, the optimizer autonomously selects the nested loops join algorithm.
explain select 1 from t1, t2 where t1.c1 = t2.c1;
Query Plan:
============================================
|ID|OPERATOR |NAME|EST. ROWS|COST |
--------------------------------------------
|0 |NESTED-LOOP JOIN| |99000 |4120845|
|1 | TABLE SCAN |t1 |100000 |41911 |
|2 | TABLE GET |t2 |1 |40 |
============================================
Outputs & filters:
-------------------------------------
0 - output([1]), filter(nil),
conds(nil), nl_params_([t1.c1])
1 - output([t1.c1]), filter(nil),
access([t1.c1]), partitions(p0)
2 - output([1]), filter(nil),
access([t2.c1]), partitions(p0)
To control the optimizer's use of the hash join algorithm, you can use hints.
explain select /*+leading(t1 t2) use_hash(t2)*/ 1 from t1, t2 where t1.c1 = t2.c1;
Query Plan:
======================================
|ID|OPERATOR |NAME|EST. ROWS|COST |
--------------------------------------
|0 |HASH JOIN | |99000 |194200|
|1 | TABLE SCAN|t1 |100000 |41911 |
|2 | TABLE SCAN|t2 |100000 |41911 |
======================================
Outputs & filters:
-------------------------------------
0 - output([1]), filter(nil),
equal_conds([t1.c1 = t2.c1]), other_conds(nil)
1 - output([t1.c1]), filter(nil),
access([t1.c1]), partitions(p0)
2 - output([t2.c1]), filter(nil),
access([t2.c1]), partitions(p0)
To better use the hash join algorithm, take note of the following considerations:
- Use a data source with a small amount of data to build a hash table, that is, as the driving table, if possible.
- Do not use the hash join algorithm for non-equijoins.