Description
Host log alerts may be triggered by the following three types of logs:
- Message logs of the host, which are stored in the
/var/log/messagesdirectory. - ERROR logs of the OCP monitoring agent, which are stored in the
/home/admin/ocp_agent/log/monagent.logdirectory, in which/home/admin/ocp_agentis the installation directory of OCP-Agent. - ERROR logs of the OCP management agent, which are stored in the
/home/admin/ocp_agent/log/mgragent_logdirectory, in which/home/admin/ocp_agentis the installation directory of OCP-Agent.
Principle
After keywords are configured, the system can generate alerts based on the keywords in logs. Alerts can be generated only for keywords in ERROR and WARNING logs of OCP-Agent. Since the log level is not supported for host logs, alerts are generated as long as any keywords are detected in host logs.
Alert information
| Trigger method | Alert level | Scope |
|---|---|---|
| Log parsing | Critical (customizable) | Server |
Alert rule
| Metric | Default threshold | Duration | Detection cycle | Elimination cycle |
|---|---|---|---|---|
| N/A | N/A | 0 seconds | The same alert is triggered every 3 minutes. | 5 minutes |
Alert templates
- Alert details template: [${alarm_name}] cluster: ${ob_cluster_name}, host: ${host}, log type: ${server_type}, log file: ${filename}, log level: ${log_level}, keyword=${keyword}, error code=${error_code}, log details=${error_message}
- Sample: [ocp_monagent logs] cluster: obcee, host: xxx.xxx.xxx.xxx, log type: monagent_log, log file: /home/admin/ocp_agent/log/monagent.log, log level: ERROR, keyword=, error code=-1, log details=2022-12-13T17:48:28.83751+08:00 ERROR [508443,81399b7cfbe9df19] caller=common/input_cache.go:39:Update: update cache for key ob_sysstat, err: Error 1054: Unknown column 'no_exist_column' in 'field list'
Parameters:
- ${alarm_name}: the alert name, for example, ob_log_alarm.
- ${ob_cluster_name}: the name of the cluster where the alert is triggered.
- ${host}: the IP address or hostname of the server where the alert is triggered.
- ${filename}: the log file.
- ${log_level}: the log level.
- ${keyword}: the keyword.
- ${error_code}: the error code.
- ${error_message}: the log details.
Impact on the system
The impact on the system varies based on the actual situation. Some ERROR-level alerts may affect services.
Possible causes
A critical exception occurs on a host. In this case, the system will identify the exception based on the configured keywords. For example, when the following log alert is triggered, the exception may affect the system performance. In this case, a core dump may occur in OceanBase Database.
kernel: nvme nvme3: I/O 266 QID 54 timeout, completion polledAn unexpected error occurs in OCP-Agent.
Solutions
Provide related information for Technical Support to perform troubleshooting.