Description
OCP-Agent monitors operation logs of OBServers. This alert is triggered when OCP-Agent detects an error in the logs.
Principle
OCP-Agent monitors the following operation logs of OBServers and triggers this alert when ERROR-level logs are detected:
election.log
observer.log
rootservice.log
The alert detail indicates all the information related to the ERROR-level logs, including the code, type, and details of the errors.
Alert information
| Trigger method | Alert level | Scope |
|---|---|---|
| OceanBase log analysis | Critical | Server |
Alert rule
| Metric | Default threshold | Duration | Detection cycle | Time before clearance |
|---|---|---|---|---|
| None | None | 0 seconds | 0 seconds | 5 minutes |
Alert templates
Overview: [${alarm_name}] ${obregion}-${svr_ip} ${ob_error_name}
Details: [${alarm_name}] cluster = ${obregion}, server = ${svr_ip}, error code = ${ob_error_code}, error name = ${ob_error_name}, error details = ${ob_error_message}
Details example: [OceanBase Log Alert] cluster = test-ob-cluster, server = 11.182.84.178, error code = 4216, error name = OB_CURL_ERROR, error details = [2020-09-12 01:00:05.839998] ERROR [SHARE] fetch_rs_list_from_url (ob_web_service_root_addr.cpp:176) [71411][Y0-0005ABFA67B71CA5] [lt=3] call web service failed(ret=-4216, url="http://www.domain.com:80/services?Action=ObRootServiceInfo&User_ID=alibaba&UID=ocp_master&ObRegion=test-ob-cluster", timeout_ms=2000) BACKTRACE:0x31f2fb9 0x3183c77 0x197e50d 0x197e624 0x193b389 0x846de0 0x847742 0x3475fe2 0x323bd7d 0x3239b1e 0x7f5290057e25 0x7f528e912bad
In this example, the error name is OB_CURL_ERROR. Based on the error details, OBServer failed to call the web service. This failure may be caused by the failure of the server of the OBserver to access the URL of the web service.
${alarm_name} indicates the alert name, for example, ob_log_alarm.
${obregion} indicates the name of the cluster that generates the alert.
${svr_ip} indicates the IP address of the OBServer of the cluster that generates the alert.
${ob_error_code} indicates the error code, for example, error code = 4013.
${ob_error_name} indicates the error name, for example, error name = OB_ALLOCATE_MEMORY_FAILED.
${ob_error_message} indicates the error details, for example,
[2020-07-16 15:42:12.802975] ERROR [COMMON] alloc_mbhandle (ob_kv_storecache.cpp:1362) [114590][Y0-0000000000000000] [lt=9] Fail to allocate memory, (block_size=2097104, ret=-4013) BACKTRACE:0x2fe2d19 0x2f71427 0x4578a7 0x44e983 0x44f09b 0x4555b5 0x224e4a7 0x224930a 0x224ba6e 0x2ffa8a1 0x2ffaf62 0x2ffb7e1 0x30284c9 0x7f043a853e25 0x7f043910cf1d
Impact on the system
It depends on the specific problem. Some errors may interrupt the service.
Possible cause
A serious unrecoverable error of the OBServer causes an ERROR-level log in log files of OceanBase Database.
Suggested solutions
Check whether the error can be ignored.
View the alert details to verify the error.
If the error can be ignored, choose Alerts > OceanBase Log Filtering in the OCP console, and then set a rule to filter the error.
If the error cannot be ignored, go to the next step.
Locate and solve the problem based on the alert details.
If the alert is triggered along with other alerts, you can clear those alerts at first by following the instructions provided in the respective topics.
If no other alerts are triggered, look for the error code in the details.
If the error code is provided, you can solve the problem based on the error code. For more information, see OCP error table.
Otherwise, it is an unknown error. Collect the information related to the TraceID in the log and provide it to OBServer Technical Support for troubleshooting.