Description
OCP-Agent (ocp_monagent) monitors operation logs of OBServer nodes and OBProxies. An alert is triggered when OCP-Agent detects a log that matches the alert configurations.
Notice
For OceanBase Database V4.0.2 and later, you can configure log alerts on a GUI. For OceanBase Database of versions earlier than V4.0.2, you can configure log alert keywords by using the system parameter ob.logtailer.keyword.alarm.rules. The following example shows how to configure log alert keywords:
[{"keyword":"error occurs on data disk or slog disk", "logLevel":"warn", "svrType":"observer", "errorCode":100003, "alarmLevel":2}]
## When the keywords "error occurs on data disk or slog disk" are found in a log of OBServer nodes and the corresponding log is at the WARN level, a critical alert with the error code 100003 is triggered. You can set `logLevel` to `error` or `warn`.
Principle
OCP-Agent monitors the following log files of OBServer nodes. An alert is triggered when OCP-Agent detects a log at the ERROR level in these log files.
election.log
observer.log
rootservice.log
/home/admin/oceanbase/logs/obproxy/log/obproxy_error.log
The details of an alert display all the information of the corresponding ERROR log, including the error code, details, and TraceID of the log.
Alert information
| Trigger method | Alert level | Scope |
|---|---|---|
| OceanBase Database log analysis | Critical | Server |
Alert rule
| Metric | Default threshold | Duration | Detection cycle | Time before clearance |
|---|---|---|---|---|
| N/A | N/A | 0 seconds | 3 minutes | 5 minutes |
Alert templates
Overview template: [${alarm_name}] ${alarm_target}
Details Template: [${alarm_name}]; Cluster: ${ob_cluster_name}; Host: ${host}; Log type: ${server_type}; Log file: ${filename}; Log level: ${log_level}; Keyword=${keyword}; Error code=${error_code}; TraceId=${trace_id}; Log details=${error_message}
Details example: [OBServer program log]; Cluster: obtest; Host: xxx.xxx.xxx.xxx; Log type: observer_log; Log file: /home/admin/oceanbase/log/observer.log.wf; Log level: ERROR; Keyword=; Error code=-1; Log details=2020-09-12 01:00:05.839998 ERROR SHARE fetch_rs_list_from_url (ob_web_service_root_addr.cpp:176) 71411Y0-0005ABFA67B71CA5lt=3 call web service failed(ret=-4216, url="http://www.domain.com:80/services?Action=ObRootServiceInfo\&User_ID=alibaba\&UID=ocp_master\&ObRegion=test-ob-cluster", timeout_ms=2000) BACKTRACE:0x31f2fb9 0x3183c77 0x197e50d 0x197e624 0x193b389 0x846de0 0x847742 0x3475fe2 0x323bd7d 0x3239b1e 0x7f5290057e25 0x7f528e912bad
In this example, the log name is OB_CURL_ERROR. Based on the log details, OBServer node failed to call the web service. This failure may be caused because the server of the OBServer node cannot access the URL of the web service.
Notes:
${alarm_name}: the alert name, for example, OBServer node program log.${ob_cluster_name}: the name of the cluster where the alert is triggered.${host}: the IP address or host name of the server where the alert is triggered.${filename}: the log file.${log_level}: the log level.${keyword}: the keyword.${error_code}: the error code.${error_message}: the log details.
Impact on the system
It depends on the specific error. Some errors may interrupt the service. Pay special attention to the following keywords, which may cause the database to become unavailable:
is_out_of_memstore_mem=true: an error log indicating insufficient MemStore space.right_to_die_or_duty_to_live: an error log indicating that the process may have exited.on_fatal_error: a fatal error log indicating that the process may have exited.
Possible causes
The OBServer node has encountered a critical error that cannot be recovered. Therefore, an error log is generated.
Solutions
Check whether the error can be ignored.
View the alert details to verify whether the error can be ignored.
If the error can be ignored, choose
Alerts > OceanBase Log Filtering in the OCP console, and then set a rule to filter the error.If the error cannot be ignored, go to the next step.
Locate and solve the problem based on the alert details.
If the alert is triggered along with other alerts, you can clear those alerts first by following the instructions provided in the respective topics.
If no other alerts are triggered, look for the error code in the details.
If the error code is provided, you can solve the problem based on the error code. For more information, see OCP error table.
If no error code is provided, you can collect logs related to the TraceID and provide them to OceanBase Technical Support for troubleshooting.