Blog

Rewrite SQL Statements: Practice with OceanBase Database

ENGINEERING

Rewrite SQL Statements: Practice with OceanBase Database

OceanBase

2023-03-03

Engineering

Sql Optimization|Newsql Database|Database|Distributed Systems

oceanbase database

Photo by Markus Spiske on Unsplash

OceanBase Database is a distributed database fully developed in-house by Ant Group. It has successfully supported seven Double 11 shopping festivals and is also the only distributed database in the world that has broken both TPC-C and TPC-H records. It has set forth a new standard of city-level disaster recovery solutions with five IDCs across three regions. The source code of the OceanBase Database was opened in June 2021. As one of the core modules of a relational database, the query optimizer has long been a challenge among other priorities in database kernel development. It is also the touchstone to measuring database maturity. To help you better understand the query optimizer of OceanBase Database and write productive SQL statements, we will post a series of articles to walk you through the key points of query rewrite and familiarize you with the equivalent variants to complex SQL statements.

Introduction

This article describes how query rewrite is performed in OceanBase Database. Most database developers tend to use the SELECT … FROM … WHERE syntax for database operation, and may add GROUP BY, ORDER BY, and LIMIT clauses in some complex statements. Features such as analytic functions are not a choice in most cases. Well, that is totally fine because we all want to write semantically correct statements by using the syntax that we are familiar with. A concise and correct SQL statement is good enough for users, you might think. In fact, we can always write different variants to accomplish the same query. For example, some application developers would write the SQL statement on the left side of Figure 1 to update the three rows of the lowest values of the T table with type = 10 and sorted by k. Semantically, the SQL statement is correct. In terms of performance, however, we can hardly say it is a good SQL statement. If no optimization is performed, the kernel will first execute a subquery to get rows whose IDs meet the conditions and then execute the outer primary query to update the corresponding rows. So, the SQL statement is not a good one for the kernel. We can use the SQL statement on the right side of Figure 1 to hit the same goal.

oceanbase database

Figure 1 An example of query rewrite

By using the rewritten statement, the kernel accesses the T table only once to find and update the desired rows. Developers may write inefficient SQL statements because of insufficient knowledge about the syntax or the way the kernel works.

If every database user were an expert at writing SQL statements and had profound knowledge about the working principle of the database kernel, it would be easy for them to write excellent SQL statements. However, that is not what the language was designed for. SQL is designed to enable users to focus only on correctly describing their needs. It is the job of the database kernel to efficiently process tasks. Therefore, the kernel of OceanBase Database provides a query rewrite module, which translates a user-friendly statement into a kernel-friendly statement. This article introduces our practice and experience in rewriting queries in OceanBase Database.

Overview

Philosophy of Query Rewrite

When OceanBase Database receives an SQL statement, it first performs lexical and syntactic parsing to convert the statement into an internal statement structure S1. The statement structure S1 is input to the query rewrite module, where it is rewritten into an equivalent statement structure S2 by using a series of algorithms. The statement structure S2 is output to generate a logical execution plan and a physical execution plan. Query rewrite is essentially a process of format matching. The query rewrite module matches each object in the statement structure against the rewrite condition. If an object matches the rewrite condition, it is equivalently converted. For example, to eliminate an identically true condition in a statement as shown in Figure 2, the query rewrite module checks all the conditional expressions in S1. In this case, 1=1 is an identically true condition, and is removed.

oceanbase database

Figure 2 Input and output of the query rewrite module

Algorithm Classification

The query rewrite module provides many rewrite algorithms. Different rewrite algorithms convert a query into equivalent variants of different formats. In a complete query rewrite cycle, the query rewrite module tries all algorithms to optimize the input statement structure. We hope that an SQL statement becomes better after each rewrite. However, not all rewrite algorithms can produce the result as expected. We classify rewrite algorithms into the following two categories:

Rule-based rewrite algorithms: A rule-based rewrite algorithm always generates a better execution plan. For example, the elimination of identically true or false conditions helps generate better execution plans at all times. In general, such algorithms go through the following process: match the format > execute the rewrite. OceanBase Database supports many rule-based rewrite algorithms, such as view merging, subquery promotion, inner join elimination, and outer join elimination.
Cost-based rewrite algorithms: A cost-based rewrite algorithm generates a better execution plan only in some circumstances. The trigger of rewrites by such algorithms depends on several factors, such as the data distribution and the availability of appropriate indexes. A cost-based rewrite is triggered only after the physical optimizer confirms that the rewritten statement will generate a plan with a lower execution cost. In general, such algorithms go through the following process: match the format > execute the rewrite > and verify the cost. OceanBase Database supports many cost-based rewrite algorithms, such as OR Expansion, JA subquery promotion, Win Magic, and Group By Placement.

The following example explains why some rewrites are triggered based on cost.

Q1:SELECT * FROM T1 WHERE C1 < 20000 OR C2 < 30;=>Q2: SELECT /*SEL_1*/ * FROM T1 WHERE C1 < 20000 UNION ALLSELECT /*SEL_2*/ * FROM T1 WHERE C2 < 30 AND LNNVL (C1 < 20000);

In OceanBase Database, the OR Expansion algorithm rewrites query Q1 to query Q2. Semantically, the two queries are equivalent. The purpose of query Q1 is to obtain a set of records that satisfy C1 < 20000 or C2 < 30. In query Q2, the SEL_1 clause obtains a set of records that satisfy C1 < 20000, the SEL_2 clause obtains a set of records that satisfy C2 < 30 but not C1 < 20000, and the union set of the two sets is the result set. Note that, instead of using the C1 >= 20000 conditions, the SEL_2 clause calls the LNNVL() function to obtain the records that do not satisfy C1 < 20000. In the foregoing example, assume that the value of r1 (C1, C2) is given as (NULL, 40). It is expected that SEL_2 may return r1. However, if the C1 >= 20000 condition is used, r1 will be filtered out.

As for whether to trigger a rewrite by using the OR Expansion algorithm, the following two scenarios are possible:

The T1 table has indexes on columns C1 and C2. Query Q1 cannot use the two indexes and can only perform a primary table scan. The SEL_1 and SEL_2 clauses of query Q2 can use the indexes on columns C1 and C2, respectively. If the two filter conditions of query Q2 are highly selective, the index scan can greatly reduce the overhead of data read. In this case, the OR Expansion rewrite can be triggered to generate a better execution plan.

2. The T1 table has no indexes. A full table scan is required for query Q1 and query Q2. As query Q2 is split into two SELECT clauses, two full table scans are required. This means that the execution cost increases after the rewrite. In this case, the OR Expansion rewrite is not triggered.

Exhaustiveness and Correctness

The working principle of each rewrite algorithm is not that complicated. As long as a developer has learned some rewrite rules, it is easy for him or her to manually rewrite an SQL statement into an equivalent kernel-friendly variant. SQL rewrite is a simple task at the business layer because you need to rewrite only one specific SQL statement. On the contrary, SQL rewrite is extremely challenging at the kernel layer because you have to correctly rewrite any given SQL statement that can be rewritten into a better variant. As a result, a rewrite algorithm must meet the following two requirements:

Correctness: The rewrite does not change the semantics of the original SQL statement.
Exhaustiveness: An SQL statement must be rewritten whenever possible.

The correctness requirement is apparently reasonable because an incorrectly rewritten statement can do nothing but damage to your business.

Exhaustiveness is equally important. It requires a rewrite algorithm to be generally applicable to both simple and complex scenarios. A narrowly applicable rewrite algorithm offers limited benefits for your business. However, it is true that a rewrite algorithm can hardly be exhaustive. This is because you can always run into a particularly complex scenario every now and then, and a forced rewrite may introduce errors. Therefore, we can only try to make a rewrite algorithm as exhaustive as possible with its correctness guaranteed. Outer Join elimination is taken as an example to describe the challenges when we try to correctly implement an exhaustive rewrite.

Outer Join operations include Left Outer Join, Right Outer Join, and Full Outer Join. The order of the Outer Join cannot be changed during the operation so the optimizer has limited options for join order. When the L LEFT JOIN R ON L.ID = R.ID statement is executed to perform Left Outer Join of the L and R tables, if a row in the L table is not joined with a row in the R table, a null cell is added to the result set in the column from the R table, as shown in Figure 3. You can see that the result set of the Left Outer Join statement is a superset of that of the Inner Join statement and includes nulls for missing matches from the R table.

oceanbase database

Figure 3 Difference between Inner Join and Outer Join

If you specify the R.C2 = 'XXX' condition to filter the result of L LEFT JOIN R, the row containing the complemented null is filtered out and the result set of Outer Join is identical to that of Inner Join. In this case, we can rewrite the Outer Join clause into an Inner Join clause. This method allows the optimizer to consider more options of join order and algorithms. In Outer Join elimination, a condition R.C2 = 'XXX' is called a Reject-NULL condition. The most challenging part of Outer Join elimination is to accurately determine whether any given expression is a Reject-NULL condition. Consider the following scenarios:

· R.C2 is followed or preceded by a conditional expression such as >, <, ==, or IS NOT NULL. When NULL is on the other side of the expression, the result will definitely be unknown or false. Therefore, they are Reject-NULL conditions.

· F(R.C2) is followed or preceded by a conditional expression such as >, <, or ==. In this case, it is hard to determine the result of such expressions. This is because it is hard to determine the output of F(R.C2) when the input value of R.C2 is NULL. For example, if F(R.C2) is NVL(R.C2, ?), the output is ?. Therefore, it is hard to infer whether NVL(R.C2, ?) = ‘XXX’ constitutes a Reject-NULL condition.

Apparently, it is easier to rewrite the statement in the first case than in the second case. OceanBase Database provides the Pass-NULL mechanism to infer the result of an expression. Given a nested expression and one or multiple parameters, the system can determine whether the output of the expression is NULL when the input of the parameter is NULL. Based on this mechanism, OceanBase Database supports the rewrite in both of the preceding scenarios.

In the preceding scenarios, the Reject-NULL condition is constituted by a single column. In fact, NULLs are filled in the result of Left Outer Join for all missing matches in the right table. So, we can further determine whether the conditional expression composed of multiple columns of the right table constitutes a Reject-NULL condition. In the following SQL statement, for example, an OR condition is specified. If only R.C2 is considered, R.C2 IS NOT NULL is not a Reject-NULL condition for the SQL statement. However, if both R.C2 and R.C3 are taken into account, the OR condition is a Reject-NULL condition.

SELECT L.ID, L.C1, R.C2 FROM L LEFT JOIN R ON L.ID = R.ID             WHERE R.C2 IS NOT NULL OR R.C3 IS NOT NULL;

As you can see, Outer Join elimination is not that complicated. In a simple scenario such as when R.C2 directly appears on one side of the conditional expression, it is easy to determine a Reject-NULL condition. In a scenario of medium complexity, such as when R.C2 is involved in elementary arithmetic operations (addition, subtraction, multiplication, and division), it is not that hard to reach the goal either. However, in a very complex scenario such as when R.C2 is nested in a function expression like CASE WHEN() or NVL(), it is much more difficult to determine the results. For the purpose of Outer Join elimination, determining the Reject-NULL condition is the first step to correctly implementing the rewrite. To improve the generality of the rewrite algorithm, we also need to determine the Pass-NULL property of expressions. Furthermore, the algorithm must recognize the Reject-NULL condition composed of multiple NULL columns.

The method of how to determine an exhaustive Reject-NULL predicate is described above. It is more important to note that not all Reject-NULL predicates are suitable for rewriting Outer Join to Inner Join. This is because the correctness requirement has a higher priority. If improper rewrite rules are defined, the query rewrite can change the semantics of the original SQL statement and leads to incorrect results.

SELECT L.ID, R.ID, T.ID FROM (L LEFT JOIN R ON L.ID = R.ID)LEFT JOIN T ON (R.ID = T.ID);

In the preceding statement, Outer Join is performed to join the result of the L LEFT JOIN R clause with the T table based on the R.ID = T.ID predicate. The effective scope of the predicate includes the columns of the right table resulting from the L LEFT JOIN R clause. This predicate by itself does reject NULL. However, as an outer join predicate, it does not filter the data on the left side, even if the result of the predicate is not true. Therefore, although the result R.ID = T.ID is not true when the value of R.ID is NULL, this Outer Join statement cannot be rewritten into an Inner Join statement.

Rewrite Module

The query rewrite module of OceanBase Database provides a large number of rewrite algorithms. As described in Section 2, these rewrite algorithms are classified into rule-based and cost-based algorithms.

oceanbase database

Figure 4 The rewrite module

A parsed SQL statement is converted to an internal statement structure, which is then rewritten into an equivalent variant by the query rewrite module. The query rewrite module tries all rewrite algorithms in a specific order, and triggers cost verification when a cost-based rewrite algorithm is used. Note that the query rewrite module will keep iterating the rewrite algorithms until no more changes can be triggered by the internal statement structure.

The query rewrite module triggers rewrite algorithms in a specific order. This is because some rewrites need to be triggered before others. For example, in a statement that involves view merge and predicate move, the view merge algorithm tries to merge multiple query blocks into one, and the predicate move algorithm directly pulls up or pushes down the predicates of multiple query blocks. In this case, the query rewrite module tries the view merge algorithm first. This is because if view merge succeeds, a predicate move is no longer necessary. In a word, the query rewrite module of OceanBase Database determines the trigger order based on the relationships between the rewrite algorithms.

Cost verification is another important task of the query rewrite module. A cost-based rewrite may not necessarily result in a better execution plan. During cost verification, the query rewrite module iterates all rewrite algorithms on the rewritten statement. This is because the rewritten statement may be further rewritten. Then, the optimizer generates an execution plan separately for the original and rewritten statements and determines whether the execution plan for the rewritten statement costs less.

Conclusion

This article briefly describes the query rewrite feature of OceanBase Database in terms of the rewrite methods, types of rewrite algorithms, challenges in implementing a rewrite algorithm, and how the rewrite module triggers multiple rewrite algorithms. Query rewrite is a key feature of the optimizer, and is a demanding but basic skill for SQL performance tuning and database management. This article only provides an overview of the rewrite module. We will publish more articles to introduce the many rewrite algorithms of the OceanBase Database. In the subsequent articles of the series, we will share our experience in optimizing relational algebra expressions and calculation expressions, such as subqueries, grouping, analytic functions, and joins. We believe that this series can be a great source of information for you, and we are more than happy to exchange and discuss ideas about query rewrite with you.

Sql Optimization Newsql Database Database Distributed Systems

OceanBase

Technology Trend | Product Interpretation | User Practice

Content