Blog

Designing a Distributed SQL Engine: Challenges & Decisions

PRODUCT

ENGINEERING

Designing a Distributed SQL Engine: Challenges & Decisions

OceanBase

2023-04-23

Product|Engineering

Database|Distributed Systems|Distributed Database|Newsql Database

oceanbase database

Photo by Alexandre Perotto on Unsplash

This article is the transcript of a tech talk, SQL Engine Design Challenges & Decisions, given by Guoping Wang, tech leader of the SQL engine team at OceanBase. Guoping Wang got his Ph.D. Degree in Computer Science from the National University of Singapore.

In this tech talk, Guoping Wang covered four topics:

‒ Architecture overview

‒ Plan cache

‒ Query optimizer

‒ Execution engine

So without further ado, let’s dive right into the talk.

SQL Engine Architecture Overview

Below is the overall architecture of the OceanBase SQL engine.

oceanbase database

It mainly includes the following components:

1. Fast Parser is used to parameterize SQL statements, which is the input of the plan cache.

2. Plan Cache does plan caching, matching, and evicting

3. Parser/Resolver translates queries into statements, which are the internal representation of SQL statements

4. Transformer analyzes and rewrites the SQLs into more “easy-to-optimize” statements

5. Optimizer enumerates query plans using the System-R style approach and finds the best plan

6. Code Generator converts logical plans into physical plans

7. SQL Executor executes SQL using vector execution and parallel execution methods

Plan Cache

Plan Cache is an important component that is responsible for plan caching, matching, and evicting in the OceanBase SQL engine. It plays a significant role in OLTP workloads.

Usually, when SQL comes, it will first search the Plan Cache to find whether there is already some plan in the cache. If a cached plan is found for this SQL, it will be directly used for execution. Otherwise, the SQL engine will optimize a plan for the SQL. Eventually, the plan will be added to Plan Cache so that plans are automatically evicted based on memory constraints, schema changes, and statistics updates.

Although Plan Cache can be turned off at the tenant level, nearly all OceanBase customers will use the feature for better performance.

Why Plan Cache?

Normally Optimizer takes 1ms even for simple queries. While Plan Cache takes only 30us for simple queries, which is much faster. Therefore, there will be a significant performance improvement for heavy OLTP workloads. That’s the reason why almost all of our customers use Plan Cache.

Two models: Force and Exact

In OceanBase, there are two Plan Cache models, Force and Exact.

In the Force model, parameters with different values match the same plan. It is usually used in OLTP scenarios.

In the Exact model, parameters with the same values match the same plan. It is always used in OLAP scenarios.

Take two queries, Q1 and Q2, for example.

oceanbase database

Q1 and Q2 have different parameters. If you use the Exact model, the two queries cannot share the same plan. And if you use the Force model, the two queries will share the same plan.

Fast Parser

Fast Parser is an important component in Plan Cache. The main job it does is parameterize SQL statements and extract the constraints, which helps identify whether queries with different parameters are different or not.

Take a look at the following example. The Fast Parser component has parameterized the statements and extracted the constraint which is Order by 2. That being said, if there is a query coming through with a statement of Order by 1, then they should not match the same plan because they have different meanings.

oceanbase database

Plan Cache in OceanBase is essentially an in-memory key-value structure. The key can be parameterized SQLs, and database IDs. system variables (sysvars) and so on. The value is a set of plans with different constraints. What Plan Cache does is find an execution plan for a given SQL by constraints.

Query Optimizer

Query Optimizer is one of the most difficult components in RDBMSs. There are two popular optimizer frameworks: System-R Bottom-up style and Volcano/Cascade Top-down stuff. Both frameworks are widely and successfully used in commercial systems and open-source projects. For example, Oracle, IBM, and PostgreSQL use System-R bottom-up style, while SQL Server, Apache Calcite, and Pivotal Orca use the Volcano/Cascade top-down style.

I have been asked several times to compare the two frameworks. From my point of view, the two frameworks have their own advantages and disadvantages. As an optimizer developer, I think the real challenge lies not in choosing the best optimizer framework but in how to accurately estimate intermediate results sets and how to implement a complex query transformer.

In OceanBase, we use the System R approach. Here is the overall architecture:

oceanbase database

There are mainly the following parts in the Optimizer:

Query transformer
Plan enumerator, including access path selection, join enumeration, and remaining operators enumeration.
Cost model to estimate the cost for each enumerated plan
Statistics to estimate the size of intermediate results. In OceanBase, the execution feedback is also used as input for the statistics which helps estimate the size of the intermediate result more accurately.

Query Transformer

Query Transformer is essentially a set of rules to rewrite a SQL query into easier-to-optimize statements. By “easier to optimize”, I mean that the Optimizer can identify more optimization opportunities after the SQL is rewritten, thus reducing costs for execution.

oceanbase database

As seen from the above figure, query transforming rules are divided into heuristic-based rules and cost-based rules.

Heuristic-based rules can be applied without any cost evaluation, which always improves performance. While cost-based rules cannot be applied until the Optimizer evaluates costs and decides that the cost is acceptable for the execution. That being said, cost-based rules will significantly complicate the design of Query Transformer and most commercial RDBMSs don’t support cost-based rules.

In OceanBase, we will iterate all rules until they converge or reach the maximum iterations.

Heuristic-based Rules

Let’s take a look at an example of heuristic-based rules. The term Outer join elimination is a heuristic rule to transform a left join to an inner join.

oceanbase database

The main idea is to examine whether you have the non-reject predicate to filter the tuples generated by the left join. In this example, we have the non-rigid predicate t2.c2=0, so we can transform the left join into the inner join. Since the left joint is usually more costly to evaluate than the inner join, such a heuristic-based rule can help improve execution performance.

Cost-based Rules

Another example is cost-based rules. The term “Or Expansion” is a cost-based rule to transform the disjunctive predicates.

oceanbase database

The main idea is to transform the disjunctive queries into a set of sub-queries connected by the Union statement. In this example, we transform the predicate OR in the original query into two sub-statements connected by Union All. The lnnvl is used to avoid generating duplicates. Without transformation, the original query will be evaluated by the nested loop join, which is quite inefficient. After transformation, each sub-query can be evaluated by hash joint, which is more efficient than the original query. In this example, the Or Expansion can be done to improve performance. However, it is not always the case because Expansion is a cost-based rule and it cannot always reduce costs, which is why it has to be examined by the Optimizer whether it should be done or not.

Challenges in Implementing Transformer Rules

Over the past years, OceanBase has spent a lot of time implementing the transformer rules. And it is the most time-consuming part of query optimization. Basically, there are two challenges in implementing transformer rules:

1. Correctness

You have to make sure the transformed SQL is equivalent to the original SQL. Sometimes the rule is quite straightforward. Sometimes it is very difficult to examine whether the two queries are equivalent.

Let’s see a previously mentioned example. Is it correct to use Or Expansion here?

oceanbase database

The answer is no. In the original query, if a tuple in t1 doesn’t find any match in t2, it will generate a new tuple in t2. After we do Or Expansion, it turns out that if a tuple doesn’t find a match in t2, it will appear two times in each statement of the Union branch, which will generate another tuple and the results will have duplicate data.

2. Completeness

Another challenge is completeness, which means whether the rule is general enough to identify all rewrite opportunities.

Let’s take a look at an example. Can Or Expansion be used here? The answer is yes. The main idea is to use a window function to filter out tuples generated by the outer joins.

oceanbase database

OceanBase has implemented nearly 35 transformer rules, which cover most of the rules found in research papers and commercial RDBMSs. Among these rules, nearly 15 are implemented only by OceanBase and are widely used by customers.

Over the past several years, OceanBase has built a lot of transforming utilities and algorithms, which help implement new rules very fast and efficiently. For example, there are algorithms to derive a unique key after any complex queries, to judge whether a join is a loss-less join, to judge whether some expansions are unique after certain complex queries, to check query containment/overlap/intersection relationship, and so on.

OceanBase has built a very efficient and effective transformer framework. With this framework, it is very easy and fast to extend new rules.

Optimizer

The Optimizer is usually used to enumerate all the equivalent plans and find the best one. But sometimes Optimizer fails to find the best plan given the large plan space. More often Optimizer just tries to avoid bad plans.

To reduce complexity for Optimizer, most commercial RDBMSs restrict the plan space. Below are some examples of the plan space: Bushy tree, left-deep tree, and deep tree.

oceanbase database

OceanBase supports deep trees by default. The Bushy tree can be turned off with an SQL hint.

Components of Optimizer

Optimizer usually has three components:

1. Enumeration algorithms

Enumeration algorithms can be used to efficiently enumerate all equivalent plans for the optimizer. There are dynamic programming for small-size tables, greedy algorithm, and randomized algorithm for large-size tables.

2. Cost model

The cost model is used to estimate the cost for each enumerated plan. Usually, the cost consists of I/O, CPU, and network costs.

3. Statistics and size estimation

Statistics are used to estimate the size of the intermediate results of the query. Usually, statistics consists of basic row/column statistics, histograms, dynamic sampling, and cardinality feedback. Sized animation is the most difficult problem in query optimizer because it’s really difficult to accurately estimate the size of intermediate results. And this is an unsolved problem in both academic and commercial database systems. In fact, most bad plan cases that OceanBase customers have encountered are caused by size estimation issues.

Enumeration algorithms

OceanBase uses System-R style bottom-up dynamic programming to do join enumeration. It supports join reordering for all join types, including outer, semi, anti, and inner. By default, no cross product and Bushy tree are considered, but they are configurable.

Below is how System-R bottom-up dynamic programming works:

oceanbase database

It will enumerate all the relation subsets by iteration. In the first round of iteration, it will enumerate all the plans of size 1. In the second round, it will enumerate all the plans of size 2, and so on.

For each enumerated set, we will compute the best plan set, which means the plan with the lowest cost and the plan with some interesting order, which means the order of this plan will be useful for the following operators and can be used to reduce execution costs.

Distributed Query Optimization

OceanBase is a distributed relational database, so it is imperative to solve the problem of distributed query optimization.

Challenges

There are four challenges in distributed query optimization.

1. More search space

More search space means more distributed algorithms. For example, in OceanBase, there are different kinds of distributed join algorithms, partition-wise join, partial-partition-wise join, Hash distribution join, and broadcast distribution joins. The more distributed algorithms there are, the larger the plan search space will be, which makes distributed query optimization even more challenging.

2. More physical property

In normal query optimization, only ordering is a physical property. But in a distributed scenario, there is partition information, which mainly includes how the data is partitioned and where the location for this partition is. Because partition information can help decide where to build algorithms more efficiently, the information has to be maintained.

3. More complex cost model

In a distributed scenario, the cost model is more complex because there is network cost and parallel degrees to be considered in the model.

4. Other issues

There are some other issues like Bloom filters and partition pruning. All these things will complicate the design of distributed query optimization.

The Old Way: Two-Phase Approach

In its early versions, OceanBase adopted a two-phase approach for distributed query optimization.

In the first phase, it finds the “best” local plan without considering partition information. In the second phase, it transforms the local “best” plan into a distributed plan.

Let’s look at an example of a two-phase plan. In the first phase, it generates a plan for R1 joining R2 and then followed by R3. In the second phase, it makes this plan distributed by adding the exchange operators in the parallel place.

oceanbase database

This two-phase approach is nice in the sense that it helps reduce the complexity of distributed query optimization.

However, it didn’t work well in real production environments. A lot of bad cases of the two-phase approach occurred because it didn’t consider the distributed information when determining join order and join algorithms, which resulted in bad plans in some situations.

A Better Way: One-Phase Approach

In its later versions, OceanBase brought a better solution to address the above-mentioned problems, which is the one-phase approach.

In the plan enumeration phase, it will enumerate all the distributed algorithms and at the same time maintain the interesting order and the partition info.

The optimizer will mark plans with the least cost, with some interesting order, and with some interesting partitioning info as the best plan set.

The cost model will take a parallel degree into consideration when estimating the execution cost of each plan.

Let’s take two distributed plans, P1 and P2, as an example. The estimated execution time for P1 is 100ms and for P2 is 150ms. Even though P2 has a longer estimated time cost, the optimizer still maintains P2 as the best plan because it has some interesting partitioning info that might be useful for a group operator.

oceanbase database

Distributed query optimization will significantly increase plan space. Therefore, OceanBase has constantly been proposing pruning techniques to make distributed joins more efficient.

Execution Engine

Over the past few years, OceanBase has built three types of execution engines:

Volcano execution engine, with the following features and functions:

Well-designed memory layout to optimize performance
All operators, including join, group by, window function, etc., have an on-disk version of the implementation
Automatic, adaptive, and robust memory management for operators

2. Parallel execution engine, which is capable of:

Implementing a lot of efficient distributed join/group by/window function… algorithms
Being tolerant of different kinds of data skew
Being a fully pipelined parallel execution engine
Having a high-performance data transfer layer

3. Vectorized execution engine, which is featuring:

Cache-aware algorithms for various operators
Hardware (e.g., SIMD, prefetching) speeds up
Particularly suitable for our pax storage engine

Volcano Execution Engine

The Volcano execution engine is well-known and widely adopted by most commercial database systems.

In the Volcano execution engine, each operator has an open/next/close/rescan interface. Each plant tree consists of a series of operators, such as table scan, Hash join, and Hash group by.

If you want to get query results from a Volcano execution engine, you simply invoke the get next interface for the plan route. And you can accurately get the next result until you get all the results.

The advantage of the Volcano execution engine is that it’s very easy to implement and extend to new operators.

Let’s take a look at an example. This is a very simple SQL query of a table scan followed by a Group By. To get all the results of this query, you simply invoke the Get Next interface and the Hash Group By.

oceanbase database

But so in this example, you simply invoke the get next interface, get next interface of the hush grew by, and get all the results.

Vectorized Execution Engine

It is known to all that “one tuple at a time” can cause performance issues because there is a lot of overhead due to too many function calls, which in turn results in a CPU cache miss.

To address the problems, OceanBase performs a batch at a time instead of a tuple at a time with its vectorized execution engine.

Besides the “a batch at a time” mechanism, OceanBase also implements a lot of techniques to speed up the vectorized execution engine. For example, it has implemented cache-aware algorithms for various operators, evaluated filters on encoded data, and implemented hardware speed such as prefetching SIMD instructions for better execution performance.

It is worth noting that OceanBase uses a pax storage engine, which is a hybrid row and column store. OceanBase uses micro blocks to organize tuples. Tuples among the micro blocks are organized by row and tuples within each micro block are organized by column. When a query comes through, the storage engine simply decodes and projects the column data into the resolved buffer of the SQL engine, which is very efficient.

oceanbase database

We have performed a benchmark test against the TPC-H dataset with vectorized execution engine and a non-vectorized execution engine respectively. Below is a comparison of the two:

oceanbase database

It can be seen that the performance gap is quite significant and for simple queries like Q1, the vectorized execution engine is almost one order of magnitude faster than the non-vectorized execution engine.

Parallel Execution Engine

As a distributed relational database, OceanBase has to solve the parallel execution problem, which is why the SQL engine team has implemented a very efficient parallel execution engine.

In OceanBase, a distributed execution plan is divided into multiple DFOs (Data Flow Operator) based on the data transmit operator. A DFO is basically a parallel execution unit and has its own parallel degree. And different DFOs can have different parallel degrees. But it is really difficult to automatically design all the degrees for each DFO. So OceanBase still uses the same parallel degree for all DFOs. The Query Coordinator is in charge of DFO scheduling.

Let’s take a look at an example of a simple join query. The execution plan can be divided into two DFOs, DFO0 and DFO1. The coordinator does the scheduling.

oceanbase database

In OceanBase, the DFOs are scheduled to minimize the materialization cost. Data transfer between DFOs is done by DTL (Data Transfer Layer) .

oceanbase database

This is a brief introduction to the OceanBase SQL engine brought to you by Dr. Guoping Wang. In the next few weeks, there will be more in-depth articles about the important components of the SQL Engine such as the vectorized execution engine. Please stay tuned!

If you have any questions, feel free to leave a comment below! Any technical discussion is warmly welcomed.

Database Distributed Systems Distributed Database Newsql Database

OceanBase

Technology Trend | Product Interpretation | User Practice

Content