Alert description
This alert is triggered when the "Hardware Error" log appears due to a hardware failure in the host's memory.
Alert principle
Matches the keyword "Hardware Error" in the /var/log/messages file on the host. If matched, an alert is triggered.
Rule information
Monitoring Metrics |
Default Threshold |
Duration |
Detection Cycle |
Elimination Cycle |
|---|---|---|---|---|
| NA | NA | 0 Seconds | NA | 5 Minutes |
Alert information
Alert Trigger Method |
Alert Level |
Scope |
|---|---|---|
| Log Parsing | Critical | Host |
Alert template
Alert overview
- Template: ${alarm_target} ${alarm_name}
- Example: alarm_template_id=0:ob_cluster=obcluster-10006:host=xx.xx.xx.xx:server_type=kern:error_code=-1:keyword=Hardware Error System Message Log
Alert Details
- Template: [${alarm_name}] Cluster: ${ob_cluster_name}, Host: ${host}, Log type: ${server_type}, Log file: ${filename}, Log level: ${log_level}, Keywords=${keyword}, Error code=${error_code}, Log details=${error_message}.
- Example: [System message log] Cluster: danxue_422, Host: xx.xx.xx.xx, Log type: kern, Log file: /var/log/messages, Log level: ERROR, Keywords=Hardware Error, Error code=-1, Log details=Mar 21 18:07:34 sqaappecsv62s2011162217016.sa128 su[267465]: [Hardware Error] pam_unix(su:session): session closed for user root.
Where:
- ${alarm_name}: the name of the alert, for example, OceanBase Log Alert.
- ${ob_cluster_name}: the name of the cluster where the alert is generated.
- ${host}: the IP address or hostname of the server that generated the alert.
- ${server_type}: the type of the log.
- ${filename}: Log file.
- ${log_level}: Log level.
- ${keyword}: keyword.
- ${error_code}: error code.
- ${error_message}: Log details.
Impact on the system
When a process accesses this faulty memory, a core dump occurs.
Possible causes
Host memory contains recoverable errors.
Solution
We recommend that you replace the memory as soon as possible.
