Multi-tenant resource isolation|V4.3.0| docs|Distributed Database

Multi-tenant resource isolation

Last Updated：2024-04-19 08:42:50 Updated

OceanBase Database is a multi-tenant database system. It isolates resources among tenants to prevent resource contention and ensure stable business operation.

In OceanBase Database, resource units are the basic units for allocating resources to tenants. A resource unit is equivalent to a Docker container. You can create multiple resource units on a node. Each resource unit created on a node occupies a part of the physical resources on the node, such as the CPU and memory resources. The allocation of resources on a node is recorded in an internal table for database administrators.

You can create multiple resource units across nodes for a tenant. However, a tenant can have only one resource unit on a node. The resource units of a tenant are independent of each other. Currently, OceanBase Database does not aggregate the resource usage of multiple resource units for global resource control.

Specifically, if the resources on a node fail the demand of a tenant, OceanBase Database does not allow the tenant to compete for resources of other tenants on another node.

Resource isolation refers to the behavior of controlling resource allocation among multiple resource units on a node, and it is a local behavior of the node. Similar technologies include Docker and virtual machines (VMs), but OceanBase Database does not rely on Docker or VM technologies for resource isolation. It implements resource isolation internally within the database.

Advantages of multi-tenant isolation in OceanBase Database

Compared with Docker and VMs, OceanBase Database offers more lightweight tenant isolation and is easier to implement advanced features such as priority management. From the perspective of OceanBase Database requirements, Docker or VMs have the following disadvantages:

There are heavy overheads in runtime environments of both Docker containers and VMs, while OceanBase Database needs to support lightweight tenants.
The specification change and migration of Docker containers or VMs are costly, while OceanBase Database is designed to help tenants change specifications and perform the migration at a faster speed.
Tenants cannot share resources such as object pools if Docker containers or VMs are used.
Customizing resource isolation within Docker containers or VMs, such as priority support within tenants, is challenging.

In addition, unified views cannot be exposed if Docker containers or VMs are used.

Isolation from the perspective of user tenants

From the perspective of user tenants, the following isolation results are achieved:

Memory is completely isolated.

The details are described as follows:
- Operators use separate memory during SQL execution. When the memory of a tenant is exhausted, other tenants are not affected.
  
  For more information about operators used in SQL execution, see User interfaces and query languages.
- The block cache and MemTable are isolated. When the memory of a tenant is exhausted, the read and write operations of other tenants are not affected.
  
  For more information about the block cache, see Multi-level caches.
  
  For more information about the MemTable, see MemTables.
CPU resources are isolated based on user-mode scheduling.

The CPU resources available for a tenant are determined by the unit specification.

To achieve a better isolation effect, OceanBase Database starts to support the configuration of cgroups to isolate CPU resources in V4.0.0 and later.
SQL modules of different tenants do not affect each other.

The details are described as follows:
- The SQL plan cache of a tenant is isolated from that of another tenant. When the plan cache of a tenant is evicted, the plan caches of other tenants are not affected.
- The SQL audit table of a tenant is isolated from that of another tenant. If the queries per second (QPS) value of a tenant is very high, the audit information about other tenants is not flushed.
Transaction modules of different tenants do not affect each other.

The details are described as follows:
- If the row lock of a tenant is suspended, other tenants are not affected.
- If the transaction of a tenant is suspended, other tenants are not affected.
- If the replay for a tenant fails, other tenants are not affected.

Clog modules of different tenants do not affect each other.

The log stream directories are divided by tenant. The structure of a log stream directory on the disk is as follows:

  tenant_id
├── unit_id_0
│   ├── log
│   └── meta
├── unit_id_1
│   ├── log
│   └── meta
└── unit_id_2
    ├── log
    └── meta

Resource classification

OceanBase Database supports isolation of CPU, memory, and IOPS resources for tenants in V4.0.0 and later. You can specify CPU, memory, IOPS, and log_disk_size resource specifications for each resource unit.

obclient> CREATE RESOURCE UNIT name
    MAX_CPU = 1, [MIN_CPU = 1,]
    MEMORY_SIZE = '5G',
    [MAX_IOPS = 1024, MIN_IOPS = 1024, IOPS_WEIGHT=0,]
    [LOG_DISK_SIZE = '2G'];

Available CPU resources on an OBServer node

When an OBServer node starts, the system detects the number of online CPU cores of the physical machine or container. If the detected number of CPU cores is inaccurate, for example, in a containerized environment, you can use the cpu_count parameter to specify CPU resources.

Each OBServer node reserves two CPU cores for background threads. Therefore, the total CPU cores that can be allocated to tenants are two less than the CPU cores on the OBServer node.

Available memory resources on an OBServer node

When an OBServer node starts, the system detects the memory of the physical machine or container. Each OBServer node reserves a part of the memory for other processes. Therefore, the memory size available for the observer process is calculated based on the Physical memory × memory_limit_percentage formula. You can also set the memory_limit parameter to specify the total memory size available for the observer process.

The memory size for internal shared modules must be subtracted from the memory size available for the observer process. The remaining memory size is available for tenants. The memory size for internal shared modules is specified by the system_memory parameter.

For more information about memory configurations, see Memory management.

Check the available resources on each OBServer node

You can query the oceanbase.GV$OB_SERVERS view to check the available resources on each OBServer node. Here is an example:

obclient> SELECT * FROM oceanbase.GV$OB_SERVERS\G
*************************** 1. row ***************************
                 SVR_IP: xx.xx.xx.xx
               SVR_PORT: 57234
                   ZONE: zone1
               SQL_PORT: 57235
           CPU_CAPACITY: 14
       CPU_CAPACITY_MAX: 14
           CPU_ASSIGNED: 6.5
       CPU_ASSIGNED_MAX: 6.5
           MEM_CAPACITY: 10737418240
           MEM_ASSIGNED: 6442450944
      LOG_DISK_CAPACITY: 316955164672
      LOG_DISK_ASSIGNED: 15569256438
        LOG_DISK_IN_USE: 939524096
     DATA_DISK_CAPACITY: 10737418240
       DATA_DISK_IN_USE: 624951296
DATA_DISK_HEALTH_STATUS: NORMAL
DATA_DISK_ABNORMAL_TIME: NULL
1 row in set

Resource isolation

Memory isolation

OceanBase Database does not support memory over-allocation on OBServer nodes in V4.0.0 and later, because memory over-allocation will cause unstable tenant operation. The MIN_MEMORY and MAX_MEMORY parameters are deprecated, and the MEMORY_SIZE parameter is used instead.

CPU isolation

Before OceanBase Database V3.1.x, you can limit the number of threads to control CPU utilization. In OceanBase Database V3.1.x and later, you can configure cgroups to control CPU utilization.

Thread classification

Many threads with different features start along with an OBServer node. For more information, see OBServer threads. In this section, threads are classified into the following two types:

Threads for processing SQL statements and transaction commits, which are collectively referred to as tenant worker threads.
Threads for processing network I/O, disk I/O, compactions, and scheduled tasks, which are referred to as background threads.

In Oceanbase Database V4.0.0, tenant worker threads and most background threads are tenant-specific, and network threads are shared.

CPU-based isolation of tenant worker threads based on the number of threads

CPU resources in a resource unit are isolated based on the number of active tenant worker threads for the resource unit.

Because the execution of SQL statements may involve I/O waiting and lock waiting, a thread may not fully utilize a physical CPU. Therefore, under default configurations, an OBServer node starts four threads on each CPU. You can set the cpu_quota_concurrency parameter to control the number of threads to be started on a CPU. In this case, if the MAX_CPU parameter of a resource unit is set to 10, 40 parallel active threads are allowed for the resource unit, and the maximum CPU utilization is 400%.

CPU-based isolation of tenant worker threads based on cgroups

After cgroups are configured, worker threads of different tenants are stored in different cgroup directories, which improves CPU isolation between tenants. The isolation results are described as follows:

If a tenant on an OBServer node has a very high load and the rest of the tenants are relatively idle, then the CPU of the high-load tenant will also be limited by MAX_CPU.
If the load of the idle tenants increases, physical CPU resources become insufficient. In this case, cgroups allocate time slices based on weights.

CPU-based isolation of background threads

In OceanBase Database V4.0.0, threads used by a tenant are isolated from those used by another tenant, and each tenant controls a corresponding number of threads.
For better isolation, background threads are also allocated to different cgroup directories to achieve isolation between tenants and between tenants and worker threads.

Large query processing

Compared with quick response to large queries, quick response to small queries is of higher significance. This means that large queries have lower priorities than small queries. When large queries and small queries contend for CPU resources, the system limits the amount of CPU resources occupied by large queries.

If a worker thread takes a long time to execute an SQL query, the query is identified as a large query, and the worker thread waits in a Pthread condition to release its CPU resources to other tenant worker threads.

To support this feature, OBServer nodes insert many checkpoints into the code, making tenant worker threads periodically check their status during the running process. When an OBServer node determines that a worker thread should be suspended, the worker thread waits in a Pthread condition.

If both large queries and small queries exist, the large queries can occupy at most 30% of the tenant worker threads. The percentage 30% is specified by the large_query_worker_percentage parameter.

Take note of the following items:

If no small queries exist, large queries can occupy 100% of the tenant worker threads. The percentage 30% takes effect only when both large queries and small queries exist.
If a tenant worker thread is suspended because it executes a large query, the system may create a tenant worker thread for compensation. However, the total number of tenant worker threads cannot exceed 10 times the value of the MAX_CPU parameter. The multiple 10 is specified by the workers_per_cpu_quota parameter.

Identify large queries in advance

An OBServer node starts a new tenant worker thread after it suspends a large query thread. However, if many large queries are initiated, new threads created on the OBServer node are still used to process large queries, and the number of tenant worker threads soon reaches the upper limit. No threads are available for small queries until all the large queries are processed.

To resolve this issue, an OBServer node predicts whether an SQL query is large before it is executed by estimating the execution time of the SQL query. Prediction is based on the following assumptions: If the execution plans of two SQL queries are the same, the execution time of the SQL queries is also similar. In this case, you can determine whether an SQL query is large based on the execution time of the latest SQL query with the same execution plan.

If an SQL query is identified as a large query, it is placed in a large query queue, the tenant worker thread executing the query is released, and the system continues to respond to subsequent requests.

Metrics and logs

You can search logs by the keyword dump tenant info to view tenant information, such as resource specifications, threads, queues, and request statistics. The logs are updated for each tenant every 30s.

A sample log is as follows:

grep 'dump tenant info.tenant={id:1002' log/observer.log | sed 's/,/,\n/g'

[2022-07-20 14:55:40.774143] INFO  [SERVER.OMT] run1 (ob_multi_tenant.cpp:1993) [80700][MultiTenant][T0][Y0-0000000000000000-0-0] [lt=621]
dump tenant info(tenant={id:1002, // The ID of the tenant.
tenant_meta:{unit:{tenant_id:1002, // The tenant metadata, including the resource unit configuration, superblock, and creation status of the tenant.
unit_id:1001, // The ID of the resource unit.
has_memstore:true, // Indicates whether a MemTable is available.
unit_status:"NORMAL", // The status of the resource unit. Valid values: UNIT_NORMAL, UNIT_MIGRATE_IN, UNIT_MIGRATE_OUT, UNIT_MARK_DELETING, UNIT_WAIT_GC_IN_OBSERVER, UNIT_DELETING_IN_OBSERVER, andUNIT_ERROR_STAT.
config:{unit_config_id:1003, // The ID of the resource unit configuration.
name:"2c2g", // The name of the resource unit configuration.
resource:{min_cpu:2, // The minimum number of CPU cores in the resource unit configuration.
max_cpu:2, // The maximum number of CPU cores in the resource unit configuration.
memory_size:"1.5GB", // The memory size in the resource unit configuration.
log_disk_size:"5.4GB", // The log disk size in the resource unit configuration.
min_iops:20000, // The minimum disk IOPS in the resource unit configuration.
max_iops:20000, // The maximum disk IOPS in the resource unit configuration.
iops_weight:2}}, // The IOPS weight in the resource unit configuration.
mode:1, // The compatibility mode. Valid values: 0:MYSQL. 1:ORACLE.
create_timestamp:1658298418435426, // The time when the resource unit was created.
is_removed:false}, // Indicates whether the resource unit is marked as deleted.
super_block:{tenant_id:1002, // The superblock of the tenant.
replay_start_point:ObLogCursor{file_id=1, // The slog replay point of the tenant.
log_id=440,
offset=343322},
ls_meta_entry:[140](ver=0, // The entry point of the linked table that stores slog checkpoints of the log stream metadata of the tenant.
mode=0,
seq=139),
tablet_meta_entry:[141](ver=0, // The slog checkpoint entry of the tablet metadata of the tenant.
mode=0,
seq=140)},
create_status:1}, // The creation status of the tenant. Valid values: 0: CREATING; 1: CREATE_COMMIT; 2: CREATE_ABORT; 3: DELETING; 4: DELETE_COMMIT.
unit_min_cpu:"2.000000000000000000e+00", // The minimum number of CPU cores guaranteed.
unit_max_cpu:"2.000000000000000000e+00", // The maximum number of CPU cores allowed.
slice:"0.000000000000000000e+00",
slice_remain:"0.000000000000000000e+00",
token_cnt:8, // The number of tokens allocated by the scheduler. Each token is converted into a worker thread.
sug_token_cnt:8,
ass_token_cnt:8, // The number of tokens allocated to the tenant. You can determine the number of tokens based on the token_cnt field. Typically, the values of the two fields are the same.
lq_tokens:2, // The number of tokens for large requests. The value of this field is equal to the value of the token_cnt field multiplied by the proportion of large requests.
used_lq_tokens:0, // The number of worker threads that hold large query tokens.
stopped:false, // Indicates whether resource units of the tenant are being deleted.
idle_us:6076629, // The total idle time of the worker threads in one round (10s). The idle time includes only the wait time in queues.
recv_hp_rpc_cnt:1, // The cumulative number of RPC requests received by the tenant at different levels, including hp(High), np(Normal), and lp(Low).
recv_np_rpc_cnt:5,
recv_lp_rpc_cnt:0,
recv_mysql_cnt:24, // The cumulative number of MySQL requests received by the tenant.
recv_task_cnt:1, // The cumulative number of internal tasks received by the tenant.
recv_large_req_cnt:0, // The cumulative number of large requests predicted by the tenant. The value of this field is only incremented but not cleared. The value is incremented upon a retry.
tt_large_quries:0, // The cumulative number of large requests processed by the tenant. The value of this field is only incremented but not cleared. The value is incremented upon a check.
pop_normal_cnt:1024555,
actives:8, // The number of active worker threads. The value is equal to the number of active worker threads of the tenant plus the number of threads suspended due to large queries, which is often equal to the value of the workers parameter. The difference between the values of the actives and workers parameters is the number of inactive threads of the tenant that are waiting to be collected with a delay.
workers:8, // The number of worker threads held by the tenant, which is equal to the sum of the number of active worker threads of the tenant, the number of threads suspended due to large queries, and the number of inactive threads of the tenant that are waiting to be collected with a delay.
nesting workers:7, // The number of dedicated threads held by the tenant for nested requests. A total of seven threads correspond to seven nesting levels.
lq waiting workers:0, // The number of worker threads intended for processing large queries and waiting to be scheduled.
req_queue:total_size=0 queue[0]=0 queue[1]=0 queue[2]=0 queue[3]=0 queue[4]=0 queue[5]=0 , // The worker queues with different priorities. A smaller value indicates a higher priority.
large queued:0, // The number of large requests predicted by the tenant.
multi_level_queue:total_size=0 queue[0]=0 queue[1]=0 queue[2]=0 queue[3]=0 queue[4]=0 queue[5]=0 queue[6]=0 queue[7]=0 , // The worker queues to accommodate nested requests. The values 1 to 7 correspond to the seven nesting levels. queue[0] is invalid. queue[5] accommodates inner SQL requests.
recv_level_rpc_cnt:cnt[0]=0 cnt[1]=0 cnt[2]=0 cnt[3]=0 cnt[4]=0 cnt[5]=0 cnt[6]=0 cnt[7]=0 ,
group_map:group_id = 1, // The threads of the group that corresponds to each group_id in the group_map and their queuing status.
queue_size = 0, // The queuing status of the group queue.
recv_req_cnt=13526, // The cumulative number of requests in the group that are pushed to the queue.
pop_req_cnt=13526, // The cumulative number of requests popped out by threads in the group.
token_cnt = 2, // The number of tokens allocated to the group. Each token is converted into a worker thread.
min_token_cnt = 2,
max_token_cnt = 2,
ass_token_cnt = 2 , // The number of tokens allocated to the group. You can determine the number of tokens based on the token_cnt field. Typically, the values of the two fields are the same.
rpc_stat_info: pcode=0x150a:cnt=1489 pcode=0x150b:cnt=1091}) // The RPC pcodes that have been received most frequently by the tenant in a period of time. The statistical period is 10 seconds, and the top five RPC pcodes are printed.