JOIN|V4.1.0| docs|Distributed Database

JOIN

Last Updated：2023-07-28 02:55:42 Updated

A JOIN statement is used in the database to combine two or more tables in the database based on the join conditions. The set generated by JOIN can be saved as a table or used as a table.

A JOIN statement combines the attributes of two tables based on their values. JOIN types in the database generally include inner join, outer join, semi-join, and anti-join. Among them, you can rewrite subqueries to implement semi-join and anti-join queries. SQL does not have special syntax for anti-join or semi-join queries.

Join conditions

Join conditions can be divided into two types: equijoin conditions (such as t1.a = t2.b) and non-equijoin conditions (such as t1.a < t2.b). Unlike non-equijoin conditions, equijoin conditions allow the database to use efficient join algorithms, such as Hash join and Merge-Sort join.

Self-joins

A self-join is a join of a table to itself. The following example shows a self-join.

obclient> CREATE TABLE t1(a INT PRIMARY KEY, b INT, c INT);
Query OK, 0 rows affected

obclient> SELECT * FROM t1 AS ta, t1 AS tb WHERE ta.b = tb.b;

Inner join

An inner join is the most basic join operation in a database. An inner join combines the columns of two tables (such as tables A and B) based on the join conditions to generate a new result table. The query compares each row of table A with each row of table B and returns the combinations that meet the join conditions. When the join conditions are met, the matching rows in table A and table B are combined by column (aligned) into rows in the result set. The join first generates the Cartesian product of the two tables, where each row in table A is paired with each row in table B, and then returns records that meet the join conditions.

obclient> CREATE TABLE t1(c1 INT,c2 INT);
Query OK, 0 rows affected
obclient> CREATE TABLE t2(c1 INT,c2 INT);
Query OK, 0 rows affected

obclient> SELECT * FROM t1 JOIN t2 USING(c1);

Outer join

An outer join does not require that each record in either of the two joined tables has a matching record in the other table. A table that needs to reserve all records (including records without a matching record) is called a reserved table.

Outer join operations are further divided into left outer joins, right outer joins, and full joins based on whether the result table contains rows from the table on the left or right side of JOIN, or both.

In a left outer join, if a row in the table on the left side is not found in the table on the right side, NULL is automatically filled in the table on the right side.
In a right outer join, if a row in the table on the right side is not found in the table on the left side, NULL is automatically filled in the table on the left side.
In a full join, NULL is automatically filled if no matching row is found in the table on the left or right side.

obclient> CREATE TABLE t1(c1 INT,c2 INT);
Query OK, 0 rows affected
obclient> CREATE TABLE t2(c1 INT,c2 INT);
Query OK, 0 rows affected

obclient> SELECT * FROM t1 LEFT JOIN t2 ON t1.c1 = t2.c1;
obclient> SELECT * FROM t1 RIGHT JOIN t2 ON t1.c1 = t2.c1;
obclient> SELECT * FROM t1 FULL JOIN t2 ON t1.c1 = t2.c1;

Semi-join

A left or right semi-join for table A and table B returns only rows in table A that match rows in table B or rows in table B that match rows in table A.

You can get a semi-join query only by unnesting and rewriting a subquery. Example:

obclient> CREATE TABLE t1(a INT PRIMARY KEY, b INT, c INT);
Query OK, 0 rows affected

obclient> CREATE TABLE t2(a INT PRIMARY KEY, b INT, c INT);
Query OK, 0 rows affected

obclient> INSERT INTO t1 VALUES (1, 1, 1),(2, 2, 2);
obclient> INSERT INTO t2 VALUES (1, 1, 1),(2, 2, 2);

obclient> SELECT * FROM t1 WHERE t1.a IN (SELECT t2.b FROM t2 WHERE t2.c = t1.c);

When you execute the EXPLAIN statement to view a query plan, the results show that dependent subqueries are unnested and rewritten into semi-joins.

obclient> EXPLAIN SELECT * FROM t1 WHERE t1.a IN (SELECT t2.b FROM t2 WHERE t2.c = t1.c);
| ========================================
|ID|OPERATOR       |NAME|EST. ROWS|COST|
----------------------------------------
|0 |MERGE SEMI JOIN|    |2        |76  |
|1 | TABLE SCAN    |t1  |2        |37  |
|2 | SORT          |    |2        |38  |
|3 |  TABLE SCAN   |t2  |2        |37  |
========================================
...

Anti-join

A left or right anti-join for table A and table B returns only rows in table A that do not match any rows in table B or rows in table B that do not match any rows in table A.

Similar to a semi-join, you can get an anti-join query only by unnesting and rewriting a subquery. Example:

obclient> CREATE TABLE t1(a INT PRIMARY KEY, b INT, c INT);
Query OK, 0 rows affected

obclient> CREATE TABLE t2(a INT PRIMARY KEY, b INT, c INT);
Query OK, 0 rows affected

obclient> INSERT INTO t1 VALUES (1, 1, 1),(2, 2, 2);
obclient> INSERT INTO t2 VALUES (1, 1, 1),(2, 2, 2);

obclient> SELECT * FROM t1 WHERE t1.a NOT IN (SELECT t2.b FROM t2 WHERE t2.c = t1.c);

When you execute the EXPLAIN statement to view a query plan, the results show that dependent subqueries are rewritten into anti-joins.

obclient> EXPLAIN SELECT * FROM t1 WHERE t1.a NOT IN (SELECT t2.b FROM t2 WHERE t2.c = t1.c);
| =============================================
|ID|OPERATOR            |NAME|EST. ROWS|COST|
---------------------------------------------
|0 |HASH RIGHT ANTI JOIN|    |0        |77  |
|1 | TABLE SCAN         |t2  |2        |37  |
|2 | TABLE SCAN         |t1  |2        |37  |
=============================================
...