Multi-tenant resource isolation |V3.2.4|OceanBase Database| docs|Distributed Database

Multi-tenant resource isolation

Last Updated：2023-10-27 09:57:43 Updated

OceanBase Database is a multi-tenant database system. It isolates resources among tenants to prevent resource contention and ensure stable business operation.

In OceanBase Database, resource units are the basic units for allocating resources to tenants. A resource unit is equivalent to a Docker container. You can create multiple resource units on an OBServer. Each resource unit occupies some physical resources such as the CPU and memory of the OBServer. The resource allocation status of the OBServer is recorded in an internal table for database administrators (DBAs) to check.

A tenant can have multiple resource units on multiple OBServers but only one resource unit on one OBServer. Multiple resource units of a tenant are independent of each other. OceanBase Database does not aggregate the resource usage of multiple resource units for global resource control. Even if a tenant does not have sufficient resources on an OBServer, the tenant cannot compete for the resources of other tenants on other OBServers.

Resource isolation is a process in which an OBServer allocates resources among multiple resource units. Similar technologies are Docker containers and VMs. However, OceanBase Database does not rely on Docker containers or VMs but isolates resources within the database.

Advantages of multi-tenant isolation in OceanBase Database

Compared with Docker containers and VMs, OceanBase Database provides more lightweight tenant isolation and is easier to implement advanced features such as priority settings. Docker containers or VMs do not meet the requirements of OceanBase Database in the following aspects:

There are heavy overheads in runtime environments of both Docker containers and VMs, while OceanBase Database needs to support lightweight tenants.
The specification change and migration of Docker containers or VMs are costly, while OceanBase Database is designed to help tenants change specifications and perform the migration at a faster speed.
Tenants cannot share resources such as object pools if Docker containers or VMs are used.
The resource isolation settings in Docker containers or VMs are difficult to customize. For example, Docker containers and VMs do not support priority settings within a tenant.

In addition, unified views cannot be exposed if Docker containers or VMs are used.

Isolation from the perspective of user tenants

From the perspective of user tenants, the following isolation results are achieved:

Memory is completely isolated.

Details:
- Operators use separate memory during SQL execution. When the memory of a tenant is exhausted, other tenants are not affected.
  
  For more information about operators used in SQL execution, see User interfaces and query languages.
- The block cache and MemTable are separated. When the memory of a tenant is exhausted, the read and write operations of other tenants are not affected.
  
  For more information about the block cache, see Multi-level caches.
  
  For more information about the MemTable, see MemTable.
CPUs are isolated based on user-mode scheduling.

The CPUs available for a tenant are determined by resource unit specifications. However, OBServers support the overselling of CPUs.
Most data structures are separated.

Details:
- The SQL plan caches of tenants are separated. If the plan cache of a tenant is eliminated, other tenants are not affected.
- The SQL audit tables of tenants are separated. If the queries per second (QPS) of a tenant is very high, the audit information about other tenants is not flushed.
Transaction-related data structures are separated.

Details:
- If the row lock of a tenant is suspended, other tenants are not affected.
- If the transaction of a tenant is suspended, other tenants are not affected.
- If the replay for a tenant fails, other tenants are not affected.
CLOGs are shared.

Different tenants on an OBServer share the same CLOG file, which improves group commits of transactions.

Resource classification

You can specify the lower and upper limits of CPUs, memory, input/output operations per second (IOPS), disk size, and session number for a resource unit.

obclient> CREATE RESOURCE UNIT box1 MIN_CPU 4, MAX_CPU 4, MIN_MEMORY 34359738368, MAX_MEMORY 34359738368, min_iops 128, MAX_IOPS 128, MIN_DISK_SIZE '5G', MAX_DISK_SIZE '5G', MIN_SESSION_NUM 64, MAX_SESSION_NUM 64;

The IOPS, disk size, and number of sessions are not managed. Only CPU and memory are considered in resource isolation.

CPU available for an OBServer

When an OBServer starts, it detects the number of online CPUs of physical servers or containers. If the detection result is inaccurate, you can set the cpu_count parameter to specify the number of CPUs. For example, the result may be inaccurate if a containerized environment is used.

The OBServer reserves two CPUs for background threads. Therefore, the number of CPUs available for tenants is equal to the total number of CPUs minus 2.

Memory size available for an OBServer

When an OBServer starts, it detects the memory size of physical servers or containers. The OBServer must reserve some memory for other processes. Therefore, the memory size available for the observer process is equal to physical memory size × memory_limit_percentage. You can also set the memory_limit parameter to specify the total memory size available for the observer process.

The memory size for internal shared modules must be subtracted from the memory size available for the observer process. The remaining memory size is available for tenants. The memory size for internal shared modules is specified by the system_memory parameter.

For more information about memory configurations, see Memory management.

Query resources available for each OBServer

You can query resources available for each OBServer in the oceanbase.__all_virtual_server_stat table. Sample statement:

obclient> SELECT * FROM __all_virtual_server_stat \G
*************************** TODO 1. row ***************************
               svr_ip: 10.10.10.1
             svr_port: 20900
                 zone: z1
            cpu_total: 14
         cpu_assigned: 2.5
 cpu_assigned_percent: 17
            mem_total: 53687091200
         mem_assigned: 12884901888
 mem_assigned_percent: 24
           disk_total: 10737418240
        disk_assigned: 10737418240
disk_assigned_percent: 100
             unit_num: 1
   migrating_unit_num: 0
       merged_version: 1
         leader_count: 1283
                 load: 0.21379327157484151
           cpu_weight: 0.4266211604095563
        memory_weight: 0.5733788395904437
          disk_weight: 0
                   id: 1
           inner_port: 20901
        build_version: 3.2.1_1-3c4c42fc31ba322cd7dfd441a8a71f648835eced(Sep  1 2021 16:16:06)
        register_time: 0
  last_heartbeat_time: 1631074048327156
block_migrate_in_time: 0
   start_service_time: 1631073245084853
    last_offline_time: 0
            stop_time: 0
 force_stop_heartbeat: 0
         admin_status: NORMAL
     heartbeat_status: alive
      with_rootserver: 1
       with_partition: 1
           mem_in_use: 0
          disk_in_use: 12582912
      clock_deviation: -11
    heartbeat_latency: 182
    clock_sync_status: SYNC
         cpu_capacity: 14
     cpu_max_assigned: 5
         mem_capacity: 53687091200
     mem_max_assigned: 16106127360
 ssl_key_expired_time: 0

Resource isolation

CPU and memory overselling

If the sum of the values of theMAX_CPU parameter for all resource units on an OBServer exceeds the total number of CPUs available for the OBServer, CPUs on the OBServer are oversold. Similarly, if the sum of the values of the MAX_MEMORY parameter for all resource units exceeds the allocatable memory size on the OBServer, memory on the OBServer is oversold.

The resource overselling ratio of the OBServer is specified by the resource_hard_limit parameter. If resource_hard_limit is set to 200, the resources can be oversold by 2 times. For example, if 16 CPUs and 16 GB of memory are available, 32 CPUs and 32 GB of memory can be sold.

Overselling improves resource utilization but reduces stability. You must determine whether to enable overselling based on the features and service level agreement (SLA) requirements of applications. For example, overselling is suitable for business R&D and testing scenarios.

Even if resources are not oversold during resource allocation, the number of CPUs available for a resource unit is not strictly limited by the MAX_CPU parameter when isolating CPUs. This means that CPUs are oversold by default. For more information about CPU overselling, see CPU isolation.

Memory isolation

Memory is a rigid resource. Once memory is used, it is difficult to reclaim the memory quickly. Therefore, memory is not suitable for overselling.

The MAX_MEMORY parameter specifies the maximum memory size available for a resource unit. The MIN_MEMORY parameter determines only the timing for the block cache contention.

We recommend that you set theMIN_MEMORY and MAX_MEMORY parameters to the same value in production systems.

CPU isolation

CPUs are more flexible than memory. Therefore, an OBServer allows a resource unit to use physical CPUs that exceed the CPU quota of the resource unit.

In OceanBase Database earlier than V3.1.x, you can limit the number of threads to control CPU utilization. In OceanBase Database V3.1.x and later, you can configure cgroups to control CPU utilization.

Thread classification

An OBServer starts threads that have different features. For more information, see the topics related to OBServer threads. This section describes two types of OBServer threads.

Threads for processing SQL statements and transaction commits, which are collectively referred to as worker threads
Threads for processing network I/O, disk I/O, compactions, and scheduled tasks

Worker threads for different tenants are separated. Non-worker threads are shared among all tenants. In this section, only worker threads are involved in CPU isolation.

Thread-based CPU isolation

CPUs in a resource unit are isolated based on the number of active worker threads for the resource unit.

One thread does not use up one physical CPU due to I/O or lock waits in SQL execution. Therefore, an OBServer starts four threads for each CPU by default. The multiple 4 is specified by the cpu_quota_concurrency parameter. In this case, if the MAX_CPU parameter of a resource unit is set to 10, 40 parallel active threads are allowed for the resource unit, and the maximum CPU utilization is 400%. This means that 10 CPUs are oversold as 40 CPUs.

Cgroup-based CPU isolation

After cgroups are enabled, worker threads of different tenants are stored to different cgroup directories, which improves CPU isolation among tenants. Isolation results:

If only one tenant on an OBServer is heavily loaded and other tenants are idle, the number of CPUs used by the heavily loaded tenant can exceed the value of the MAX_CPU parameter, and the overselling multiple is still specified by the cpu_quota_concurrency parameter for compatibility.
In the same scenario, if the loads of multiple idle tenants increase and physical CPUs are insufficient, cgroups allocate time slices by weight.

Large query processing

Compared with quick response to large queries, quick response to small queries is of higher significance. This means that large queries have lower priorities than small queries. When large queries and small queries contend for CPUs, the system limits the number of CPUs occupied by large queries.

If a worker thread takes a long time to execute an SQL query, the query is identified as a large query, and the worker thread waits in a pthread condition to release its CPU to other worker threads.

Specifically, an OBServer inserts many checkpoints in code. A running worker thread periodically checks its status at the checkpoints. If it is determined that the thread should be suspended, the thread waits in the pthread condition and will be woken up at the right time.

If both large queries and small queries exist, the large queries can occupy at most 30% of the worker threads. The percentage 30% is specified by the large_query_worker_percentage parameter.

Note:

If no small queries exist, large queries can occupy 100% of the worker threads. The percentage 30% takes effect only when both large queries and small queries exist.
If a worker thread is suspended when it executes a large query, the system may create a worker thread to compensate. However, the total number of worker threads cannot exceed 10 times the value of the MAX_CPU parameter. The multiple 10 is specified by the workers_per_cpu_quota parameter.

Identify large queries in advance

An OBServer starts a new worker thread after it suspends a large query thread. However, if many large queries are initiated, new threads created on the OBServer are still used to process large queries, and the number of worker threads soon reaches the upper limit. No threads are available for small queries until all the large queries are processed.

To resolve this issue, an OBServer predicts whether an SQL query is large before it is executed by estimating the execution time of the SQL query. Prediction is based on the following assumptions: If the execution plans of two SQL queries are the same, execution time of the SQL queries is also similar. In this case, you can determine whether an SQL query is large based on the execution time of the latest SQL query with the same execution plan.

If an SQL query is identified as a large query, it is placed in a large query queue, the worker thread executing the query is released, and the system continues to respond to subsequent requests.

Metrics and logs

You can search logs by the keyword dump tenant info to view tenant information, such as resource specifications, threads, queues, and request statistics. The logs are updated for each tenant every 10s.

Sample log:

grep 'dump tenant info.*tenant={id:1002' log/observer.log.*[2021-05-10 16:56:22.564978] INFO  [SERVER.OMT] ob_multi_tenant.cpp:803 [48820][2116][Y0-0000000000000000] [lt=5] dump tenant info(tenant={id:1002, compat_mode:1, unit_min_cpu:"1.000000000000000000e+01", unit_max_cpu:"1.500000000000000000e+01", slice:"0.000000000000000000e+00", slice_remain:"0.000000000000000000e+00", token_cnt:30, ass_token_cnt:30, lq_tokens:3, used_lq_tokens:3, stopped:false, idle_us:4945506, recv_hp_rpc_cnt:2420622, recv_np_rpc_cnt:7523808, recv_lp_rpc_cnt:0, recv_mysql_cnt:4561007, recv_task_cnt:337865, recv_large_req_cnt:1272, tt_large_quries:3648648, actives:35, workers:35, nesting workers:7, lq waiting workers:5, req_queue:total_size=48183 queue[0]=47888 queue[1]=0 queue[2]=242 queue[3]=5 queue[4]=48 queue[5]=0 , large queued:12, multi_level_queue:total_size=0 queue[0]=0 queue[1]=0 queue[2]=0 queue[3]=0 queue[4]=0 queue[5]=0 queue[6]=0 queue[7]=0 , recv_level_rpc_cnt:cnt[0]=0 cnt[1]=0 cnt[2]=0 cnt[3]=0 cnt[4]=0 cnt[5]=165652 cnt[6]=10 cnt[7]=0 })

The following table describes some of the parameters in the sample log.

Parameter	Description
Id	The ID of the tenant.
unit_min_cpu	The minimum number of CPUs available.
unit_max_cpu	The maximum number of CPUs available.
slice	This parameter has no practical significance.
slice_remain	This parameter has no practical significance.
token_cnt	The number of tokens allocated by the scheduler. Each token is converted into a worker thread.
ass_token_cnt	The number of tokens confirmed by the tenant. You can confirm the number of tokens based on the `token_cnt` parameter. Typically, the values of the two parameters are the same.
lq_tokens	The number of tokens for large requests. The value of this parameter is equal to the value of the `token_cnt` parameter multiplied by the proportion of large requests.
used_lq_tokens	The number of worker threads that hold large query tokens.
stopped	Indicates whether resource units of the tenant are being deleted.
idle_us	The total idle time of the worker threads in one round (10s). The idle time includes only the waiting time in queues.
recv_hp/np/lp_rpc_cnt	The cumulative number of RPC requests received by the tenant, including hp (high), np (normal), and lp (low) RPC requests.
recv_mysql_cnt	The cumulative number of MySQL requests received by the tenant.
recv_task_cnt	The cumulative number of internal tasks received by the tenant.
recv_large_req_cnt	The cumulative number of large requests predicted by the tenant. The value of this parameter is only incremented but not cleared. The value is incremented upon retries.
tt_large_queries	The cumulative number of large requests processed by the tenant. The value of this parameter is only incremented but not cleared. The value is incremented upon checks.
actives	The number of active worker threads, which is often the same as the value of the workers parameter. The difference between the values of the two parameters results from the caches of worker threads and the caches of large requests with worker threads.
workers	The number of worker threads held by the tenant.
nesting workers	The number of dedicated threads held by the tenant for nested requests. A total of seven threads correspond to seven nesting levels.
lq waiting workers	The number of worker threads waiting for scheduling.
req_queue	The worker queues with different priorities. A smaller value indicates a higher priority.
large queued	The number of predicted large requests.
multi_level_queue	The worker queues to accommodate nested requests. The values 1 to 7 correspond to the seven nesting levels. `queue[0]` is invalid.