Query rewrite refers to rewriting an SQL query into another one that is easier to optimize.
OceanBase Database supports rule-based query rewrite and cost-based query rewrite.
Rule-based query rewrite always aims to make an SQL query "better", making it possible to keep optimizing the SQL query. For example, converting a subquery into a join is a typical rule-based query rewrite, which provides the optimizer with more execution plan options, such as hash join and merge join besides nested loop join.
Cost-based query rewrite does not necessarily make an SQL query "better". It determines whether to rewrite an SQL statement based on the cost model. For example, OR-EXPANSION is a cost-driven rewrite strategy.
Specific conditions must be met to apply a rewrite rule in a database. Rewrite rules may be mutually triggered. OceanBase Database splits a rule-based rewrite into multiple sets and iteratively processes each set until the query can no longer be rewritten or the number of iterations reaches the specified value. Cost-based rewrites are implemented in a similar mechanism as rule-based rewrites.
Note that a cost-based rewrite may trigger a rule-based rewrite. Therefore, OceanBase Database actually rewrites a query by alternately using cost-based rewrites and rule-based rewrites.
Query rewrite models
OceanBase Database supports two rewrite models for queries: rule-based rewrite and cost-based rewrite.
During the optimization of queries in a database, the rule model and cost model are two key factors that determine how to rewrite a query and select an execution plan. Both models aim to improve the SQL query performance. The following sections describe some common rule models and cost models.
Rule models (rule-based query rewrite)
Generally, a rule model rewrites queries based on fixed heuristic rules. Here are some typical rule-based rewrite strategies:
Subquery-related rewrite:
- View merge: A nested query in a view is merged with the main query to reduce query layers and improve the execution efficiency.
- Subquery unnesting: The subquery in a
WHEREcondition is promoted to the parent query and unnested to produce a join condition in parallel with the parent query. The rewrite deconstructs the subquery and changes the outer parent query to a multi-table join.
Rewrite
ANY or ALLby usingMAX or MIN:- A subquery that contains an
ANYorALLoperator is rewritten to an equivalent that uses theMAXorMINaggregate function so as to apply indexes or a more efficient execution strategy.
- A subquery that contains an
Outer join elimination:
- An outer join whose result does not affect the final result set is eliminated to convert the query into an inner join or a simple
SELECTquery.
- An outer join whose result does not affect the final result set is eliminated to convert the query into an inner join or a simple
Condition simplification
HAVINGcondition elimination: If no aggregate orGROUP BYoperation exists in the query, theHAVINGcondition can be merged into theWHEREconditions, with theHAVINGcondition eliminated. This way, theHAVINGcondition can be managed with other conditions inWHEREfor further optimization.- Equivalence relation deduction: In the process of equivalence relation deduction, new conditional expressions are deduced based on the transitivity of comparison operators to reduce the number of rows to be processed or to select a more efficient index.
- Identically true/false elimination: Identically true and false conditions are eliminated from the query conditions to simplify the query logic.
Non-select project join (non-SPJ) rewrite:
- Redundant ORDER BY elimination: If the query results do not need to be ordered or the ordering is ineffective, the ORDER BY operation is eliminated.
- LIMIT pushdown to subquery: If
LIMITcan be used in a subquery without affecting the final result set,LIMITis pushed down to the subquery to reduce the number of rows to be processed. - LIMIT pushdown to outer join or cross join:
LIMITis pushed down to an appropriate part of an outer join or cross join to reduce the number of rows to be processed. - DISTINCT elimination: If it can be guaranteed that rows in the result set are unique, the
DISTINCToperation is eliminated. - MIN/MAX rewrite: Where it is deemed appropriate, a query is rewritten to a more efficient equivalent to calculate the
MINorMAXvalues. For example, a query can be rewritten to use indexes for access.
Cost models (cost-based query rewrite)
A cost model selects the optimal execution plan for a query based on cost estimation. Here are some cost-based strategies supported in OceanBase Database:
- OR-EXPANSION: splits a predicate that contains an OR condition into multiple independent queries and merges the result sets of the queries. This can sometimes improve the query performance by using indexes or parallel processing.
OceanBase Database estimates the resource consumption of a query based on a cost model, which evaluates the cost for executing an operator based on a set of formulas and constant parameters. For example, the cost for executing the EXCHANGE IN operator can be evaluated by using the following formula:
cost = rows * row_width * NETWORK_TRANS_PER_BYTE_COST +
rows * row_width * NETWORK_DESER_PER_BYTE_COST
The preceding formula involves two coefficients rows (the number of output rows) and row_width (the data width), and calculates the overall overhead based on NETWORK_TRANS_PER_BYTE_COST (the cost for network transmission) and NETWORK_DESER_PER_BYTE_COST (the cost for data deserialization).
An adaptive cost model adjusts the cost coefficients to match a specific hardware environment. You can use an automatic script to try to calculate new cost coefficients when you install your database. However, the optimizer uses too many cost coefficients, which makes the calculation process extremely complex, leading to a high probability of errors and very poor user experience.
The coefficients involved in cost models are normalized into functions of the following hardware parameters to simplify adaptive coefficient adjustment. You can normalize the required cost coefficients into combinations of the following four basic coefficients.
- CPU frequency (CPU_SPEED)
- Sequential disk read rate
- Random disk read rate
- Network bandwidth
For example, the cost efficient for hash table probe can be converted to the number of CPU instructions (CPU_CYCLES). Therefore, the cost for probing a hash table in the current environment can be calculated based on the CPU frequency by using the following formula:
PROBE_HASH_PER_ROW_COST = CPU_CYCLES * CPU_SPEED;
This way, coefficients of a cost model can be decoupled from the current hardware. You can use different system information to calculate cost coefficients for different hardware environments.
OceanBase Database V4.3.0 and later support adaptive cost models. You can evaluate the overhead of an execution plan based on the real-time hardware performance. Moreover, an API is provided for you to manually adjust cost coefficients to customize an optimization strategy when necessary.