This topic describes distributed execution and parallel query.
Distributed execution
For a distributed system based on the Shared-Nothing architecture, the data of a relational table is stored in partitions on various nodes of the system. So, for a cross-zone data query, the execution plan is required to handle data on multiple nodes. Therefore, OceanBase Database is provided with capabilities to generate and execute distributed execution plans.
For distributed execution plans, partitioning improves query performance. You do not need to partition a small relational table. However, for a large relational table, you need to carefully select the partitioning key based on the requirements of the upper-layer application system. Make sure that most queries can use the partitioning key for partition pruning, which reduces the data access workload.
We recommend that you use the join key as the partition key for related tables and apply the same partitioning method to configure the same partitions on the same node by using table groups, to reduce cross-node data interaction.
The optimizer of OceanBase Database automatically generates a distributed execution plan based on the query and the physical distribution of data.
Parallel querying
A parallel query refers to the process of restructuring an execution plan to increase its CPU resources and I/O processing capacity, and thus reduce the system response time to the corresponding query. The parallel query technology applies to both distributed and local execution plans.
When a single query involves access to data stored on different nodes, you need to fetch such data to the same node for computing through data redistribution. The system takes each data redistribution node as the upper and lower boundaries to vertically divide each execution plan of OceanBase Database into multiple data flow objects (DFOs). Each DFO is split into tasks of specified degrees of parallelism (DOPs) to improve execution efficiency by running tasks in parallel.
Generally, a higher DOP leads to a shorter response time to the query, but it requires more CPU, I/O, and memory resources to execute the query. For decision support systems (DSS) or data warehouses that support querying massive amounts of data, the response time is shortened.
Generally, the execution logic of a parallel query is similar to that of a distributed execution plan. After an execution plan is decomposed, each part of the plan is executed by multiple execution threads. DFOs of the execution plan and the tasks within each DFO are executed in parallel through scheduling. Parallel queries are ideal for batch update, index creation, and index maintenance operations in online transaction processing (OLTP) scenarios.
Parallel queries can effectively improve the processing performance of the system when the system has:
Sufficient I/O bandwidth
Low CPU utilization
Adequate memory resources
If the system does not have sufficient resources for parallel processing, you cannot improve the execution performance by using parallel queries or increasing the DOPs of the tasks. On the contrary, when the application system is overloaded, the operating system must run more scheduling tasks, such as context switching and page switching, which may degrade its performance.
Generally, for DSS systems with a large number of partitions and data warehouses, parallel execution can reduce the response time. For simple DML operations, intra-partition queries, or queries across a small number of partitions, parallel queries cannot significantly reduce the response time.