Generally, the optimizer selects the optimal join method for a query with a join during the plan generation phase based on statistics. If the statistics are inaccurate or outdated, the join method selected by the optimizer is not the optimal one. For example, if the optimizer estimates that the table contains a few rows based on statistics, it may select nested-loop join (NLJ). However, if the table actually contains much more rows than estimated, hash join may be the optimal choice for higher performance. To resolve this issue, OceanBase Database introduces the adaptive join feature, which allows the database to select the most appropriate join method dynamically rather than merely depending on statistics.
Here is an example:
The following plan generated by using the adaptive join method for the SQL query ensures that the optimizer selects the optimal join algorithm for better performance, regardless of whether the queried region is small or large.
obclient> CREATE TABLE users(id int, region int, name varchar(100));
Query OK, 0 rows affected (0.088 sec)
obclient> CREATE TABLE orders(id int, user_id int);
Query OK, 0 rows affected (0.080 sec)
obclient> CREATE INDEX idx_userid ON orders(user_id);
Query OK, 0 rows affected (0.486 sec)
obclient> EXPLAIN SELECT COUNT(1) FROM orders o, users u WHERE u.region = "?" AND u.id = o.user_id;
+-----------------------------------------------------------------------------------------------+
| Query Plan |
+-----------------------------------------------------------------------------------------------+
| ====================================================================================== |
| |ID|OPERATOR |NAME |EST.ROWS|EST.TIME(us)| |
| -------------------------------------------------------------------------------------- |
| |0 |SCALAR GROUP BY | |1 |9833 | |
| |1 |└─HASH JOIN | |4952 |9743 | |
| |2 | ├─NESTED-LOOP JOIN | |512 |9743 | |
| |3 | │ ├─STATISTICS COLLECTOR | |512 |176 | |
| |4 | │ │ └─TABLE FULL SCAN |U |512 |96 | |
| |5 | │ └─DISTRIBUTED TABLE RANGE SCAN|ORDERS_alias(IDX_USERID)|10 |18 | |
| |6 | └─TABLE FULL SCAN |O(IDX_USERID) |9999 |273 | |
| ====================================================================================== |
When generating this plan, the optimizer calculates a threshold for the number of rows. If the number of rows contained in the left-side table is smaller than this threshold, NLJ performs better. Otherwise, hash join provides higher performance.
During execution, Operator 3 statistics collector collects data from the left-side table and determines whether the number of rows exceeds the specified threshold. If the number of rows does not exceed the threshold, Operator 1 hash join is bypassed. If the number of rows exceeds the threshold, Operator 2 nested loop join is bypassed. A bypassed operator outputs the data it receives from the lower-layer operator to the upper-layer operator without performing actual operations.
When the number of rows contained in the left-side table varies significantly and the optimizer fails to generate a plan based on the adaptive join method, you can manually specify to use the adaptive join method for the SQL query by adding the use_adaptive hint. The following example adds the use_adaptive hint to the SQL query:
obclient> EXPLAIN SELECT /*+ leading(u o) use_adaptive(o) */ count(1) FROM orders o, users u WHERE u.region = 1 AND u.id = o.user_id;
Notice
The table specified in use_adaptive must be the right-side table in the join.