Description
OceanBase Cloud Platform Agent (OCP-Agent) monitors the operation logs of OBServers. This alert is triggered when OCP-Agent detects an ERROR-level log or matches the configured keywords in an ERROR-level or WARN-level log.
You can configure log alert keywords by using the ob.logtailer.keyword.alarm.rules parameter. Example:
[{"keyword":"error occurs on data disk or slog disk", "logLevel":"warn", "svrType":"observer", "errorCode":100003, "alarmLevel":2}]
## If the "error occurs on data disk or slog disk" keyword is detected in an OBServer log of the WARN level, a critical alert with error code 100003 is triggered.
Principle
OCP-Agent monitors the following operation logs of OBServers and triggers this alert when ERROR-level logs are detected:
election.log
observer.log
rootservice.log
The alert detail displays all the information related to the ERROR-level logs, including the error code, log details, and TraceID of the logs.
Alert information
| Trigger method | Alert level | Scope |
|---|---|---|
| OceanBase Database log analysis | Critical | Servers |
Alert rule
| Metric | Default threshold | Duration | Detection cycle | Elimination cycle |
|---|---|---|---|---|
| N/A | N/A | 0 seconds | 0 seconds | 5 minutes |
Alert templates
Overview: [${alarm_name}] ${alarm_target}
Details: Cluster: ${ob_cluster_name}, Host= ${host}, Type: ${server_type}, Error code: ${ob_error_code}, TraceID: ${trace_id}, Log name: ${ob_error_name}, Log level: ${ob_log_level}, Log details=${ob_error_message}
Details example: Cluster: test-ob-cluster, Host=192.168.0.1, Type: OceanBase log, Error code: 4216, TraceID: 71411Y0-0005ABFA67B71CA5lt, Log name: OB_CURL_ERROR, Log level: ERROR, Log details: 2020-09-12 01:00:05.839998 ERROR SHARE fetch_rs_list_from_url (ob_web_service_root_addr.cpp:176) 71411Y0-0005ABFA67B71CA5lt=3 call web service failed(ret=-4216, url="http://www.domain.com:80/services?Action=ObRootServiceInfo&User_ID=alibaba&UID=ocp_master&ObRegion=test-ob-cluster", timeout_ms=2000) BACKTRACE:0x31f2fb9 0x3183c77 0x197e50d 0x197e624 0x193b389 0x846de0 0x847742 0x3475fe2 0x323bd7d 0x3239b1e 0x7f5290057e25 0x7f528e912bad
In this example, the log name is OB_CURL_ERROR. Based on the log details, OBServer failed to call the web service. This failure may be caused by the failure of the OBserver to access the URL of the web service.
In the syntax:
${alarm_name} indicates the alert name. Example: OceanBase log alert.
${obregion} indicates the name of the cluster that generates the alert.
${svr_ip} indicates the IP address of the OBServer in the cluster that generates the alert.
${ob_error_code} indicates the error code. Example: error code = 4013.
${ob_error_name} indicates the log name. Example: log name = OB_ALLOCATE_MEMORY_FAILED.
${ob_error_message} indicates the log details. Example:
[2020-07-16 15:42:12.802975] ERROR [COMMON] alloc_mbhandle (ob_kv_storecache.cpp:1362) [114590][Y0-0000000000000000] [lt=9] Fail to allocate memory, (block_size=2097104, ret=-4013) BACKTRACE:0x2fe2d19 0x2f71427 0x4578a7 0x44e983 0x44f09b 0x4555b5 0x224e4a7 0x224930a 0x224ba6e 0x2ffa8a1 0x2ffaf62 0x2ffb7e1 0x30284c9 0x7f043a853e25 0x7f043910cf1d
Impact on the system
It depends on the specific error. Some errors may interrupt the service.
Possible causes
A serious unrecoverable error of the OBServer causes an ERROR-level log in log files of OceanBase Database.
Troubleshooting procedure
Check whether the error can be ignored.
View the alert details to verify whether the error can be ignored.
If the error can be ignored, choose Alerts > OceanBase Log Filtering in the OCP console, and then set a rule to filter the error.
If the error cannot be ignored, go to the next step.
Locate and solve the problem based on the alert details.
If the alert is triggered along with other alerts, you can clear those alerts first by following the instructions provided in the respective topics.
If no other alerts are triggered, look for the error code in the details.
If the error code is provided, you can solve the problem based on the error code. For more information, see OCP error table.
Otherwise, it is an unknown error. Please collect the information related to the TraceID in the log and provide it to an OBServer technical support engineer for troubleshooting.