Monitor the business load of OceanBase Database|V4.3.1| docs|Distributed Database

Monitor the business load of OceanBase Database

Last Updated：2024-08-23 10:14:19 Updated

This topic describes how to monitor the business load of OceanBase Database by using the monitoring charts in the OceanBase Cloud Platform (OCP) console.

Scenarios

The monitoring service allows you to view performance monitoring metrics of, for example, clusters, tenants, hosts, and OBProxies on the GUI of the OCP console. In a production environment, you can predict possible exceptions or perform troubleshooting by monitoring the business load and resource usage.

Prerequisites

The ocp_monagent process on the host of OceanBase Database is running.

Considerations

In extreme cases, such as when the CPU or memory resources of the host are exhausted, metric collection may fail, which results in the loss of monitoring data. To avoid this, you must ensure that resources are sufficient for system availability by, for example, temporarily scaling out resources. For more information about scaling, see Scale out an OceanBase cluster and scale up an OceanBase Database tenant.

Procedure

View the usage of host resources

You can choose Host > Monitoring > Host Performance to view the CPU Utilization chart of a host, or choose Host > Monitoring > Host Resources to view the Memory chart of a host.

If a generated alert indicates that the CPU utilization or memory usage has exceeded the threshold, you can use the preceding methods to first confirm the system resource usage. By viewing the monitoring charts, you can check the following conditions:

Whether the system is in a shortage of resources. For example, the CPU utilization has exceeded 80%.

Check whether you need to temporarily scale out system resources.
Whether resource consumption has increased gradually or surged at a certain point in time.
- In the former case, choose OceanBase Cluster > Database Performance, and check whether the value of queries per second (QPS) has been increasing, and whether the increase is due to business growth.
- In the latter case, go to the TopSQL tab, and check whether the surge is caused by the execution of an SQL query.
The point in time when an exception occurred.
- Check the logs for errors recorded around the point in time.
- Check whether business changes were made around the point in time.

Solutions:

If the resource shortage is caused by gradual business growth, you must scale out system resources.
If the resource shortage is caused by an SQL query, you can first throttle its execution in the OCP console, and then analyze whether the SQL query can be optimized by, for example, creating an index. Alternatively, you can reduce the usage of the SQL query in your business.
If the resource shortage is caused by other reasons, you can solve the issue by modifying related database parameters.

View the tenant performance

You can choose Tenants > Performance and SQL to view the performance and resource usage of a tenant and troubleshoot traffic issues and slow SQL queries.

When a tenant alert is generated, you can perform the following steps to troubleshoot the exception:

View the monitoring charts, such as the Request waiting queue time-consuming and Tenant CPU cost charts. If the request queue time remains in milliseconds and the tenant thread usage remains high, the tenant thread resources are insufficient. In this case, you must confirm whether the exception can cause considerable consequences.
View the SQL execution plan category chart to check for abnormal SQL execution plans.
View the QPS chart to observe changes in the QPS value and the point in time at which a change occurred, and determine whether the issue was caused by a traffic surge.
View the SQL Response Time chart. The increase of the response time is likely to be caused by a slow SQL query.
View the RPC Package RT chart. The increase of the response time is likely to be caused by network latency or server overload.

Solutions:

If the exception was caused by a traffic surge, you can temporarily scale out the tenant resources and determine whether the traffic surge is caused by business conditions or abnormal SQL queries.
If an slow SQL query is identified, run SQL Diagnoser for troubleshooting.
If the response time of RPC packages is long, run performance monitoring tools such as Tsar for troubleshooting.

FAQ

In a drill-down analysis on CPU utilization, the overall CPU utilization of a cluster is sometimes greater than the sum of the CPU utilization of each OBServer node. Why is that?

The CPU utilization of an OceanBase cluster equals 1 - the percentage of free CPU resources. The CPU utilization data of one or more OBServer nodes may be lost during a spike period. As a result, the calculated percentage of free CPU resources of a cluster is lower than expected, hence a higher overall CPU utilization.