OceanBase Database is a multi-tenant database system. It isolates resources among tenants to prevent resource contention and ensure stable business operation.
In OceanBase Database, resource units are the basic units for allocating resources to tenants. A resource unit is equivalent to a Docker container. You can create multiple resource units on an OBServer. Each resource unit occupies some physical resources such as the CPU and memory of the OBServer. The resource allocation status of the OBServer is recorded in an internal table for database administrators (DBAs) to check.
A tenant can have multiple resource units on multiple OBServers but only one resource unit on one OBServer. Multiple resource units of a tenant are independent of each other. OceanBase Database does not aggregate the resource usage of multiple resource units for global resource control. Even if a tenant does not have sufficient resources on an OBServer, the tenant cannot compete for the resources of other tenants on other OBServers.
Resource isolation is a process in which an OBServer allocates resources among multiple resource units. Similar technologies are Docker containers and VMs. However, OceanBase Database does not rely on Docker containers or VMs but isolates resources within the database.
Advantages of multi-tenant isolation in OceanBase Database
Compared with Docker containers and VMs, OceanBase Database provides more lightweight tenant isolation and is easier to implement advanced features such as priority settings. Docker containers or VMs do not meet the requirements of OceanBase Database in the following aspects:
There are heavy overheads in runtime environments of both Docker containers and VMs, while OceanBase Database needs to support lightweight tenants.
The specification change and migration of Docker containers or VMs are costly, while OceanBase Database is designed to help tenants change specifications and perform the migration at a faster speed.
Tenants cannot share resources such as object pools if Docker containers or VMs are used.
The resource isolation settings in Docker containers or VMs are difficult to customize. For example, Docker containers and VMs do not support priority settings within a tenant.
In addition, unified views cannot be exposed if Docker containers or VMs are used.
Isolation from the perspective of user tenants
From the perspective of user tenants, the following isolation results are achieved:
Memory is completely isolated.
Details:
Operators use separate memory during SQL execution. When the memory of a tenant is exhausted, other tenants are not affected.
For more information about operators used in SQL execution, see User interfaces and query languages.
The block cache and MemTable are separated. When the memory of a tenant is exhausted, the read and write operations of other tenants are not affected.
For more information about the block cache, see Multi-level caches.
For more information about the MemTable, see MemTable.
CPUs are isolated based on user-mode scheduling.
The CPUs available for a tenant are determined by resource unit specifications. However, OBServers support the overselling of CPUs.
Most data structures are separated.
Details:
The SQL plan caches of tenants are separated. If the plan cache of a tenant is eliminated, other tenants are not affected.
The SQL audit tables of tenants are separated. If the queries per second (QPS) of a tenant is very high, the audit information about other tenants is not flushed.
Transaction-related data structures are separated.
Details:
If the row lock of a tenant is suspended, other tenants are not affected.
If the transaction of a tenant is suspended, other tenants are not affected.
If the replay for a tenant fails, other tenants are not affected.
CLOGs are shared.
Different tenants on an OBServer share the same CLOG file, which improves group commits of transactions.
Resource classification
You can specify the lower and upper limits of CPUs, memory, input/output operations per second (IOPS), disk size, and session number for a resource unit.
obclient> CREATE RESOURCE UNIT box1 MIN_CPU 4, MAX_CPU 4, MIN_MEMORY 34359738368, MAX_MEMORY 34359738368, min_iops 128, MAX_IOPS 128, MIN_DISK_SIZE '5G', MAX_DISK_SIZE '5G', MIN_SESSION_NUM 64, MAX_SESSION_NUM 64;
The IOPS, disk size, and number of sessions are not managed. Only CPU and memory are considered in resource isolation.
CPU available for an OBServer
When an OBServer starts, it detects the number of online CPUs of physical servers or containers. If the detection result is inaccurate, you can set the cpu_count parameter to specify the number of CPUs. For example, the result may be inaccurate if a containerized environment is used.
The OBServer reserves two CPUs for background threads. Therefore, the number of CPUs available for tenants is equal to the total number of CPUs minus 2.
Memory size available for an OBServer
When an OBServer starts, it detects the memory size of physical servers or containers. The OBServer must reserve some memory for other processes. Therefore, the memory size available for the observer process is equal to physical memory size × memory_limit_percentage. You can also set the memory_limit parameter to specify the total memory size available for the observer process.
The memory size for internal shared modules must be subtracted from the memory size available for the observer process. The remaining memory size is available for tenants. The memory size for internal shared modules is specified by the system_memory parameter.
For more information about memory configurations, see Memory management.
Query resources available for each OBServer
You can query resources available for each OBServer in the oceanbase.__all_virtual_server_stat table. Sample statement:
obclient> SELECT * FROM __all_virtual_server_stat \G
*************************** TODO 1. row ***************************
svr_ip: 10.10.10.1
svr_port: 20900
zone: z1
cpu_total: 14
cpu_assigned: 2.5
cpu_assigned_percent: 17
mem_total: 53687091200
mem_assigned: 12884901888
mem_assigned_percent: 24
disk_total: 10737418240
disk_assigned: 10737418240
disk_assigned_percent: 100
unit_num: 1
migrating_unit_num: 0
merged_version: 1
leader_count: 1283
load: 0.21379327157484151
cpu_weight: 0.4266211604095563
memory_weight: 0.5733788395904437
disk_weight: 0
id: 1
inner_port: 20901
build_version: 3.2.1_1-3c4c42fc31ba322cd7dfd441a8a71f648835eced(Sep 1 2021 16:16:06)
register_time: 0
last_heartbeat_time: 1631074048327156
block_migrate_in_time: 0
start_service_time: 1631073245084853
last_offline_time: 0
stop_time: 0
force_stop_heartbeat: 0
admin_status: NORMAL
heartbeat_status: alive
with_rootserver: 1
with_partition: 1
mem_in_use: 0
disk_in_use: 12582912
clock_deviation: -11
heartbeat_latency: 182
clock_sync_status: SYNC
cpu_capacity: 14
cpu_max_assigned: 5
mem_capacity: 53687091200
mem_max_assigned: 16106127360
ssl_key_expired_time: 0
Resource isolation
CPU and memory overselling
If the sum of the values of theMAX_CPU parameter for all resource units on an OBServer exceeds the total number of CPUs available for the OBServer, CPUs on the OBServer are oversold. Similarly, if the sum of the values of the MAX_MEMORY parameter for all resource units exceeds the allocatable memory size on the OBServer, memory on the OBServer is oversold.
The resource overselling ratio of the OBServer is specified by the resource_hard_limit parameter. If resource_hard_limit is set to 200, the resources can be oversold by 2 times. For example, if 16 CPUs and 16 GB of memory are available, 32 CPUs and 32 GB of memory can be sold.
Overselling improves resource utilization but reduces stability. You must determine whether to enable overselling based on the features and service level agreement (SLA) requirements of applications. For example, overselling is suitable for business R&D and testing scenarios.
Even if resources are not oversold during resource allocation, the number of CPUs available for a resource unit is not strictly limited by the MAX_CPU parameter when isolating CPUs. This means that CPUs are oversold by default. For more information about CPU overselling, see CPU isolation.
Memory isolation
Memory is a rigid resource. Once memory is used, it is difficult to reclaim the memory quickly. Therefore, memory is not suitable for overselling.
The MAX_MEMORY parameter specifies the maximum memory size available for a resource unit. The MIN_MEMORY parameter determines only the timing for the block cache contention.
We recommend that you set theMIN_MEMORY and MAX_MEMORY parameters to the same value in production systems.
CPU isolation
CPUs are more flexible than memory. Therefore, an OBServer allows a resource unit to use physical CPUs that exceed the CPU quota of the resource unit.
In OceanBase Database earlier than V3.1.x, you can limit the number of threads to control CPU utilization. In OceanBase Database V3.1.x and later, you can configure cgroups to control CPU utilization.
Thread classification
An OBServer starts threads that have different features. For more information, see the topics related to OBServer threads. This section describes two types of OBServer threads.
Threads for processing SQL statements and transaction commits, which are collectively referred to as worker threads
Threads for processing network I/O, disk I/O, compactions, and scheduled tasks
Worker threads for different tenants are separated. Non-worker threads are shared among all tenants. In this section, only worker threads are involved in CPU isolation.
Thread-based CPU isolation
CPUs in a resource unit are isolated based on the number of active worker threads for the resource unit.
One thread does not use up one physical CPU due to I/O or lock waits in SQL execution. Therefore, an OBServer starts four threads for each CPU by default. The multiple 4 is specified by the cpu_quota_concurrency parameter. In this case, if the MAX_CPU parameter of a resource unit is set to 10, 40 parallel active threads are allowed for the resource unit, and the maximum CPU utilization is 400%. This means that 10 CPUs are oversold as 40 CPUs.
Cgroup-based CPU isolation
After cgroups are enabled, worker threads of different tenants are stored to different cgroup directories, which improves CPU isolation among tenants. Isolation results:
If only one tenant on an OBServer is heavily loaded and other tenants are idle, the number of CPUs used by the heavily loaded tenant can exceed the value of the
MAX_CPUparameter, and the overselling multiple is still specified by thecpu_quota_concurrencyparameter for compatibility.In the same scenario, if the loads of multiple idle tenants increase and physical CPUs are insufficient, cgroups allocate time slices by weight.
Large query processing
Compared with quick response to large queries, quick response to small queries is of higher significance. This means that large queries have lower priorities than small queries. When large queries and small queries contend for CPUs, the system limits the number of CPUs occupied by large queries.
If a worker thread takes a long time to execute an SQL query, the query is identified as a large query, and the worker thread waits in a pthread condition to release its CPU to other worker threads.
Specifically, an OBServer inserts many checkpoints in code. A running worker thread periodically checks its status at the checkpoints. If it is determined that the thread should be suspended, the thread waits in the pthread condition and will be woken up at the right time.
If both large queries and small queries exist, the large queries can occupy at most 30% of the worker threads. The percentage 30% is specified by the large_query_worker_percentage parameter.
Note:
If no small queries exist, large queries can occupy 100% of the worker threads. The percentage 30% takes effect only when both large queries and small queries exist.
If a worker thread is suspended when it executes a large query, the system may create a worker thread to compensate. However, the total number of worker threads cannot exceed 10 times the value of the
MAX_CPUparameter. The multiple 10 is specified by theworkers_per_cpu_quotaparameter.
Identify large queries in advance
An OBServer starts a new worker thread after it suspends a large query thread. However, if many large queries are initiated, new threads created on the OBServer are still used to process large queries, and the number of worker threads soon reaches the upper limit. No threads are available for small queries until all the large queries are processed.
To resolve this issue, an OBServer predicts whether an SQL query is large before it is executed by estimating the execution time of the SQL query. Prediction is based on the following assumptions: If the execution plans of two SQL queries are the same, execution time of the SQL queries is also similar. In this case, you can determine whether an SQL query is large based on the execution time of the latest SQL query with the same execution plan.
If an SQL query is identified as a large query, it is placed in a large query queue, the worker thread executing the query is released, and the system continues to respond to subsequent requests.
Metrics and logs
You can search logs by the keyword dump tenant info to view tenant information, such as resource specifications, threads, queues, and request statistics. The logs are updated for each tenant every 10s.
Sample log:
grep 'dump tenant info.*tenant={id:1002' log/observer.log.*[2021-05-10 16:56:22.564978] INFO [SERVER.OMT] ob_multi_tenant.cpp:803 [48820][2116][Y0-0000000000000000] [lt=5] dump tenant info(tenant={id:1002, compat_mode:1, unit_min_cpu:"1.000000000000000000e+01", unit_max_cpu:"1.500000000000000000e+01", slice:"0.000000000000000000e+00", slice_remain:"0.000000000000000000e+00", token_cnt:30, ass_token_cnt:30, lq_tokens:3, used_lq_tokens:3, stopped:false, idle_us:4945506, recv_hp_rpc_cnt:2420622, recv_np_rpc_cnt:7523808, recv_lp_rpc_cnt:0, recv_mysql_cnt:4561007, recv_task_cnt:337865, recv_large_req_cnt:1272, tt_large_quries:3648648, actives:35, workers:35, nesting workers:7, lq waiting workers:5, req_queue:total_size=48183 queue[0]=47888 queue[1]=0 queue[2]=242 queue[3]=5 queue[4]=48 queue[5]=0 , large queued:12, multi_level_queue:total_size=0 queue[0]=0 queue[1]=0 queue[2]=0 queue[3]=0 queue[4]=0 queue[5]=0 queue[6]=0 queue[7]=0 , recv_level_rpc_cnt:cnt[0]=0 cnt[1]=0 cnt[2]=0 cnt[3]=0 cnt[4]=0 cnt[5]=165652 cnt[6]=10 cnt[7]=0 })
The following table describes some of the parameters in the sample log.
| Parameter | Description |
|---|---|
| Id | The ID of the tenant. |
| unit_min_cpu | The minimum number of CPUs available. |
| unit_max_cpu | The maximum number of CPUs available. |
| slice | This parameter has no practical significance. |
| slice_remain | This parameter has no practical significance. |
| token_cnt | The number of tokens allocated by the scheduler. Each token is converted into a worker thread. |
| ass_token_cnt | The number of tokens confirmed by the tenant. You can confirm the number of tokens based on the token_cnt parameter. Typically, the values of the two parameters are the same. |
| lq_tokens | The number of tokens for large requests. The value of this parameter is equal to the value of the token_cnt parameter multiplied by the proportion of large requests. |
| used_lq_tokens | The number of worker threads that hold large query tokens. |
| stopped | Indicates whether resource units of the tenant are being deleted. |
| idle_us | The total idle time of the worker threads in one round (10s). The idle time includes only the waiting time in queues. |
| recv_hp/np/lp_rpc_cnt | The cumulative number of RPC requests received by the tenant, including hp (high), np (normal), and lp (low) RPC requests. |
| recv_mysql_cnt | The cumulative number of MySQL requests received by the tenant. |
| recv_task_cnt | The cumulative number of internal tasks received by the tenant. |
| recv_large_req_cnt | The cumulative number of large requests predicted by the tenant. The value of this parameter is only incremented but not cleared. The value is incremented upon retries. |
| tt_large_queries | The cumulative number of large requests processed by the tenant. The value of this parameter is only incremented but not cleared. The value is incremented upon checks. |
| actives | The number of active worker threads, which is often the same as the value of the workers parameter. The difference between the values of the two parameters results from the caches of worker threads and the caches of large requests with worker threads. |
| workers | The number of worker threads held by the tenant. |
| nesting workers | The number of dedicated threads held by the tenant for nested requests. A total of seven threads correspond to seven nesting levels. |
| lq waiting workers | The number of worker threads waiting for scheduling. |
| req_queue | The worker queues with different priorities. A smaller value indicates a higher priority. |
| large queued | The number of predicted large requests. |
| multi_level_queue | The worker queues to accommodate nested requests. The values 1 to 7 correspond to the seven nesting levels. queue[0] is invalid. |