Exception symptom
When a load exception occurs in the OceanBase cluster, the average CPU load is abnormal on some OBServer nodes. When the alert threshold specified on OceanBase Cloud Platform (OCP) is exceeded, an alert is triggered on OCP.
If such an alert is triggered, log on to the OCP console, choose Monitoring > OB Data Trend , and then query the performance metrics to check the performance monitoring information in the following sequence:
Target server
Target tenant
Target cluster
Confirm the load on the target server
Log on to the OCP console, and view the performance monitoring data of the cluster. For more information, see Performance monitoring. Pay attention to the QPS, query response time, TPS, and transaction response time metrics in the search results on the Performance Monitoring page. Identify the time when the alert was triggered and check whether the alert is triggered by the following cases. Usually, monitoring metrics variations caused by a load exception may encompass the following cases:
The QPS, query response time, transaction response time, and CPU utilization surge in the specified monitoring time range. The QPS, query response time, and transaction response time surge in the specified monitoring time range. If the TPS slightly varies and the QPS and query response time surge earlier than other indicators, this is usually caused by increased query load or bad SQL statements. In this case, slow queries on the target server in the cluster need to be identified by using the Slow SQL feature provided by OCP.
The query response time, TPS, and transaction response time surge in the specified monitoring time range. If the QPS slightly varies but the TPS and the transaction response time surge earlier than other indicators, this is usually caused by increased write load. In this case, the transactions being executed on the target server in the cluster need to be checked by using the Slow SQL feature provided by OCP.
Confirm the load of the target tenant
To query the tenant load metrics on the tenant monitoring page of OCP, log on to the database as the sys tenant and run the following SQL statement to obtain the ID of the target tenant in the target server. Use the SQL statement that matches the OceanBase Database version.
(1.x) obclient> SELECT zone,tenant_id, COUNT(1) cnt FROM __all_meta_table WHERE svr_ip = 'xxx.xxx.xxx.xxx' GROUP BY tenant_id ORDER BY zone,cnt DESC;
(2.x) obclient> SELECT zone,tenant_id, COUNT(1) cnt FROM __all_virtual_meta_table WHERE svr_ip = 'xxx.xxx.xxx.xxx' GROUP BY tenant_id ORDER BY zone,cnt DESC;
After you obtain the ID of the target tenant, run the following SQL statement to query the name of the target tenant:
obclient> SELECT tenant_name FROM __all_tenant WHERE tenant_id IN (xxx,xxx);
Go to the OCP console, and view the performance monitoring data of the tenant. For more information, see Performance monitoring. Pay attention to the QPS, response time, TPS, and transaction response time metrics on the Performance Monitoring page of the target tenant. Identify the time when the alert was triggered and check whether the alert is triggered by the following cases. Usually, monitoring metrics variations of the target tenant caused by a load exception may encompass the following cases:
The QPS, response time, TPS, and transaction response time metrics increase in the monitoring time range in a variation trend consistent with that of the same monitoring metrics of the target server in the same time range. This indicates that the read and write load of the tenant has increased. In this case, contact the users to confirm their behaviors. If the increase of the read and write load is normal, scale out resources for the affected tenant. For more information, see Tenant management.
The QPS, response time, TPS, and transaction response time metrics increase in the specified monitoring time range in a variation trend much steadier than that of the monitoring metrics of the target server in the same time range. This indicates that the read and write load of the tenant has slightly increased. You need to properly configure the primary zone for the tenant.
Confirm the load on the target cluster
Log on to the OCP console, and view the performance monitoring data of the clusser. For more information, see Performance monitoring. Pay attention to the QPS, response time, TPS, and transaction response time metrics in the search results on the Performance Monitoring page. Identify the time when the alert was triggered and check whether the alert is triggered by the following cases. Usually, monitoring metrics variations of the target cluster caused by a load exception may encompass the following cases:
The QPS, response time, TPS, and transaction response time metrics increase in the specified monitoring time range in a variation trend consistent with that of the same monitoring metrics of the target server in the same time range. This indicates that the read and write load of the cluster has increased. In this case, contact the users to confirm their behaviors. If the increase of the read and write load is normal, scale out OBServers in the cluster. For more information, see OBServer management. Scale out the cluster and then scale out resources for the affected tenant. For more information, see Tenants management.
The QPS, response time, TPS, and transaction response time metrics increase in the specified monitoring time range in a variation trend much steadier than that of the monitoring metrics of the target server node in the same time range. This indicates that the read and write load of the cluster has slightly increased. You need to properly configure the primary zone for the tenant or scale out resources for the target tenant.
Solutions and workarounds
You can locate the cause of the load exception by checking the loads on the target server, tenant, and cluster. Generally, a load exception can be caused by bad SQL statements, improper configuration of the primary zone for tenants, or normal business load growth. For exceptions originates from different causes, the solutions and workarounds vary.
Bad SQL
If the load exception is caused by bad SQL statements, optimize these statements. In the case of an emergency, use the OUTLINE command for throttling or contact the users to restrict the execution of bad SQL statements. To avoid such problems in the long term, a performance review mechanism can be implemented for newly released SQL statements and periodical slow SQL inspection can be performed on OCP.
Improper configuration of the primary zone
If the load exception is caused by improper configuration of the primary zone for the tenant, you can check the distribution of the current leader of the tenant by running the following SQL statement:
(1.x) obclient> SELECT zone,svr_ip,COUNT(1) cnt FROM __all_meta_table WHERE tenant_id = xxxx AND role =1 GROUP BY svr_ip ORDER BY zone,cnt desc;
(2.x) obclient> SELECT zone,svr_ip,COUNT(1) cnt FROM __all_virtual_meta_table WHERE tenant_id = xxxx AND role =1 GROUP BY svr_ip;
obclient> SELECT tenant_id,primary_zone FROM __all_tenant;
Check whether the current partition leader of the tenant is in the expected zone. If a partitioned table is used and leaders can be distributed on different partitions, you can switch the partition leader of the tenant and set all the available zones as primary zones. For more information, see Tenant management.
Before you configure a primary zone for a tenant, run the following statement to check whether auto-balancing is set to True:
obclient> SHOW PARAMETERS LIKE 'enable_rebalance';
-- If True is returned, auto-balancing is enabled. If not, set the parameter to True.
obclient> ALTER SYSTEM SET enable_rebalance = 'True';
Leader distribution means that available zones are set as primary zones. The zones are separated with commas (,). The following shows a sample statement:
obclient> ALTER TENANT tenant_name primary_zone="zone1,zone2,zone3";
Switch-to-leader of a tenant is to switch the primary zone from one zone to another zone. The following shows a sample statement:
-- Assume that the original primary zone is Zone2.
obclient> ALTER TENANT tenant_name primary_zone="zone2";
When multiple tenants share resources in the OceanBase cluster, we recommend that you set the leaders of different tenants to different zones to avoid tenants from being affected by each other.
Normal business load growth
If the load exception is caused by a normal business load growth, scale out resources for the tenant based on the current resource usage in the cluster. For more information, see Tenant management. If the resources in the cluster are used up, scale out the cluster first.
The performance increase is linear with the scale-out only when the business tenant uses the OceanBase partitioned table and all the available zones are set as primary zones. For example, the performance capacity is increased by 50% when a 1-1-1 cluster is scaled out to a 2-2-2 cluster, where a 1-1-1 cluster has three zones and each zone has one server, and a 2-2-2 cluster has three zones and each zone has two servers. However, if a single table is used for a business and only one zone is set as the primary zone for the tenant, the performance increase brought by the scale-out is not linear. The distribution of the cluster load depends on the read and write load on the leaders of different single tables. In this case, you can properly configure the primary zones to distribute leaders of different single tables to different zones, so as to improve the cluster performance to a specific degree.