Host log alerts

2024-11-06 03:13:28  Updated

Description

Host log alerts may be triggered by the following three types of logs:

  • Message logs of the host, which are stored in the /var/log/messages directory.
  • ERROR logs of the OCP monitoring agent, which are stored in the /home/admin/ocp_agent/log/monagent.log directory, in which /home/admin/ocp_agent is the installation directory of OCP-Agent.
  • ERROR logs of the OCP management agent, which are stored in the /home/admin/ocp_agent/log/mgragent_log directory, in which /home/admin/ocp_agent is the installation directory of OCP-Agent.

Principle

After keywords are configured, the system can generate alerts based on the keywords in logs. Alerts can be generated only for keywords in ERROR and WARNING logs of OCP-Agent. Since the log level is not supported for host logs, alerts are generated as long as any keywords are detected in host logs.

Alert information

Trigger method Alert level Scope
Log parsing Critical (customizable) Server

Alert rule

Metric Default threshold Duration Detection cycle Elimination cycle
N/A N/A 0 seconds The same alert is triggered every 3 minutes. 5 minutes

Alert templates

  • Alert details template: [${alarm_name}] cluster: ${ob_cluster_name}, host: ${host}, log type: ${server_type}, log file: ${filename}, log level: ${log_level}, keyword=${keyword}, error code=${error_code}, log details=${error_message}
  • Sample: [ocp_monagent logs] cluster: obcee, host: xxx.xxx.xxx.xxx, log type: monagent_log, log file: /home/admin/ocp_agent/log/monagent.log, log level: ERROR, keyword=, error code=-1, log details=2022-12-13T17:48:28.83751+08:00 ERROR [508443,81399b7cfbe9df19] caller=common/input_cache.go:39:Update: update cache for key ob_sysstat, err: Error 1054: Unknown column 'no_exist_column' in 'field list'

Parameters:

  • ${alarm_name}: the alert name, for example, ob_log_alarm.
  • ${ob_cluster_name}: the name of the cluster where the alert is triggered.
  • ${host}: the IP address or hostname of the server where the alert is triggered.
  • ${filename}: the log file.
  • ${log_level}: the log level.
  • ${keyword}: the keyword.
  • ${error_code}: the error code.
  • ${error_message}: the log details.

Impact on the system

The impact on the system varies based on the actual situation. Some ERROR-level alerts may affect services.

Possible causes

  • A critical exception occurs on a host. In this case, the system will identify the exception based on the configured keywords. For example, when the following log alert is triggered, the exception may affect the system performance. In this case, a core dump may occur in OceanBase Database.

    kernel: nvme nvme3: I/O 266 QID 54 timeout, completion polled

  • An unexpected error occurs in OCP-Agent.

Solutions

Provide related information for Technical Support to perform troubleshooting.

Contact Us