Overview

2024-11-11 07:37:15  Updated

Query rewrite refers to rewriting an SQL query into another one that is easier to optimize.

OceanBase Database supports rule-based query rewrite and cost-based query rewrite.

Rule-based query rewrite always aims to make an SQL query "better", making it possible to keep optimizing the SQL query. For example, converting a subquery into a join is a typical rule-based query rewrite, which provides the optimizer with more execution plan options, such as hash join and merge join besides nested loop join.

Cost-based query rewrite does not necessarily make an SQL query "better". It determines whether to rewrite an SQL statement based on the cost model. For example, OR-EXPANSION is a cost-driven rewrite strategy.

Specific conditions must be met to apply a rewrite rule in a database. Rewrite rules may be mutually triggered. OceanBase Database splits a rule-based rewrite into multiple sets and iteratively processes each set until the query can no longer be rewritten or the number of iterations reaches the specified value. Cost-based rewrites are implemented in a similar mechanism as rule-based rewrites.

Note that a cost-based rewrite may trigger a rule-based rewrite. Therefore, OceanBase Database actually rewrites a query by alternately using cost-based rewrites and rule-based rewrites.

Query rewrite models

OceanBase Database supports two rewrite models for queries: rule-based rewrite and cost-based rewrite.

During the optimization of queries in a database, the rule model and cost model are two key factors that determine how to rewrite a query and select an execution plan. Both models aim to improve the SQL query performance. The following sections describe some common rule models and cost models.

Rule models (rule-based query rewrite)

Generally, a rule model rewrites queries based on fixed heuristic rules. Here are some typical rule-based rewrite strategies:

  1. Subquery-related rewrite:

    • View merge: A nested query in a view is merged with the main query to reduce query layers and improve the execution efficiency.
    • Subquery unnesting: The subquery in a WHERE condition is promoted to the parent query and unnested to produce a join condition in parallel with the parent query. The rewrite deconstructs the subquery and changes the outer parent query to a multi-table join.
  2. Rewrite ANY or ALL by using MAX or MIN:

    • A subquery that contains an ANY or ALL operator is rewritten to an equivalent that uses the MAX or MIN aggregate function so as to apply indexes or a more efficient execution strategy.
  3. Outer join elimination:

    • An outer join whose result does not affect the final result set is eliminated to convert the query into an inner join or a simple SELECT query.
  4. Condition simplification

    • HAVING condition elimination: If no aggregate or GROUP BY operation exists in the query, the HAVING condition can be merged into the WHERE conditions, with the HAVING condition eliminated. This way, the HAVING condition can be managed with other conditions in WHERE for further optimization.
    • Equivalence relation deduction: In the process of equivalence relation deduction, new conditional expressions are deduced based on the transitivity of comparison operators to reduce the number of rows to be processed or to select a more efficient index.
    • Identically true/false elimination: Identically true and false conditions are eliminated from the query conditions to simplify the query logic.
  5. Non-select project join (non-SPJ) rewrite:

    • Redundant ORDER BY elimination: If the query results do not need to be ordered or the ordering is ineffective, the ORDER BY operation is eliminated.
    • LIMIT pushdown to subquery: If LIMIT can be used in a subquery without affecting the final result set, LIMIT is pushed down to the subquery to reduce the number of rows to be processed.
    • LIMIT pushdown to outer join or cross join: LIMIT is pushed down to an appropriate part of an outer join or cross join to reduce the number of rows to be processed.
    • DISTINCT elimination: If it can be guaranteed that rows in the result set are unique, the DISTINCT operation is eliminated.
    • MIN/MAX rewrite: Where it is deemed appropriate, a query is rewritten to a more efficient equivalent to calculate the MIN or MAX values. For example, a query can be rewritten to use indexes for access.

Cost models (cost-based query rewrite)

A cost model selects the optimal execution plan for a query based on cost estimation. Here are some cost-based strategies supported in OceanBase Database:

  • OR-EXPANSION: splits a predicate that contains an OR condition into multiple independent queries and merges the result sets of the queries. This can sometimes improve the query performance by using indexes or parallel processing.

OceanBase Database estimates the resource consumption of a query based on a cost model, which evaluates the cost for executing an operator based on a set of formulas and constant parameters. For example, the cost for executing the EXCHANGE IN operator can be evaluated by using the following formula:

cost = rows * row_width * NETWORK_TRANS_PER_BYTE_COST +
       rows * row_width * NETWORK_DESER_PER_BYTE_COST

The preceding formula involves two coefficients rows (the number of output rows) and row_width (the data width), and calculates the overall overhead based on NETWORK_TRANS_PER_BYTE_COST (the cost for network transmission) and NETWORK_DESER_PER_BYTE_COST (the cost for data deserialization).

An adaptive cost model adjusts the cost coefficients to match a specific hardware environment. You can use an automatic script to try to calculate new cost coefficients when you install your database. However, the optimizer uses too many cost coefficients, which makes the calculation process extremely complex, leading to a high probability of errors and very poor user experience.

The coefficients involved in cost models are normalized into functions of the following hardware parameters to simplify adaptive coefficient adjustment. You can normalize the required cost coefficients into combinations of the following four basic coefficients.

  • CPU frequency (CPU_SPEED)
  • Sequential disk read rate
  • Random disk read rate
  • Network bandwidth

For example, the cost efficient for hash table probe can be converted to the number of CPU instructions (CPU_CYCLES). Therefore, the cost for probing a hash table in the current environment can be calculated based on the CPU frequency by using the following formula:

PROBE_HASH_PER_ROW_COST = CPU_CYCLES * CPU_SPEED;

This way, coefficients of a cost model can be decoupled from the current hardware. You can use different system information to calculate cost coefficients for different hardware environments.

OceanBase Database V4.3.0 and later support adaptive cost models. You can evaluate the overhead of an execution plan based on the real-time hardware performance. Moreover, an API is provided for you to manually adjust cost coefficients to customize an optimization strategy when necessary.

References

Contact Us