Description
If the OBServer failed to refresh the location cache, the tenant SQL request may time out, which may affect your business.
Principle
OCP-Agent (ocp_monagent) monitors the operational logs of OBServer nodes. This alert is triggered in the following two scenarios:
The following keywords continuously appear in the log in one minute:
LOCATION_STATISTIC SYS_CORE suc_cnt=0 sql_suc_cnt=0The preceding keywords can be expressed by using a regular expression:
.\*LOCATION_STATISTIC.\*SYS_CORE.\*suc_cnt=0.\*sql_suc_cnt=0.\*.The following keyword appears in the log:
fail to check is local server.*ret=-4018
Alert information
| Trigger method | Alert level | Scope |
|---|---|---|
| OceanBase Database log analysis | Critical | Server |
Alert rule
| Metric | Default threshold | Duration | Detection cycle | Time before clearance |
|---|---|---|---|---|
| N/A | N/A | 1 minute | 1 minute | 5 minutes |
Alert templates
Alert overview: ${alarm_name}] ${alarm_target}
Alert details: Cluster: ${ob_cluster_name}. Host: ${host}. Alert: ${alarm_name}. Log file: ${filename}. Log level: ${log_level}. Keywords = ${keyword}. Error code = ${error_code}. TraceId = ${trace_id}. Log details = ${error_message}.
Details example: Cluster: TEST_CLUSTER. Host: xxx.xxx.xxx.xxx. Alert: OBServer failed to refresh the location cache. Log file: /home/admin/oceanbase/log/observer.log. Log level: info. Keywords = .*LOCATION_STATISTIC.*SYS_CORE.*suc_cnt=0.sql_suc_cnt=0. Error code = -1. TraceId=YB4221BC464C-0005E787BCF9D46A. Log details = [2023-04-26 14:03:35.555318] INFO [SHARE.PT] ob_location_update_task.cpp:88 [18597][734][YB4221BC464C-0005E787BCF9D46A] [lt=35] [dc=0] [LOCATION_STATISTIC] dump location cache statistics(queue_type="SYS_CORE", suc_cnt=0, fail_cnt=0, sql_suc_cnt=0, sql_fail_cnt=1, total_cnt=1, avg_wait_us=0, avg_exec_us=1537290).
Impact on the system
Incorrect location information may cause the following impacts on the system:
- Timeout of business SQL requests.
- Accumulated request queue of the SYS tenant. To check whether requests are accumulated, you can check the value of the
req_queue:total_sizeparameter by using thedump tenant infokeywords.
Possible causes
Known system failure.
Solutions
Connect to the OBServer node from the SYS tenant.
We recommend that you execute the following statement to refresh the location cache when the OBServer fails:
select count(*) from oceanbase.all_virtual_proxy_schema where tenant_name = "sys" and database_name = "oceanbase" and table_name = "all_core_table" and partition_id = 0;If
Step 2 does not work, we recommend that you restart the server.The refresh of the location cache may fail in the following case, which rarely happens: a read-only replica is dropped by changing the locality of the primary tenant in the cluster, the OBServer node or zone where the read-only replica is located is deleted later on, and the
fail to check is local server(ret=-4018)error message continuously appears in the log for a specified period in scenarios such as a leader switchover, regardless of whether the switchover is triggered by an O&M operation or a fault. In this case, you can directly connect to all OBServer nodes in the cluster and execute thealter system flush location cachestatement to refresh the location cache.