Alert description
This alert is used to monitor system message logs on the host (typically kernel/system messages in Linux /var/log/messages or the journal). The default rule triggers when NVMe timeout-related kernel messages (such as kernel: nvme.*timeout) are matched in the log, alerting the administrator that the host may have an NVMe disk or driver exception.
Alert principle
Parameter |
Value |
|---|---|
| Monitoring Metrics | host_message_log / os_message_log (log type) |
| Monitoring Expression | {"excludeKeywords":{"relation":"Or","keywords":[]},"includeKeywords":{"relation":"Or","keywords":["kernel: nvme.*timeout"]},"logLevels":[],"metric":"os_message_log"} |
| Metric Collection | System Message Logs (server_type corresponds to kern / os_message_log) |
| Metric Source | OCP-Agent Collects Host System and Kernel Logs |
| Collection Cycle | None. The logs are evaluated as soon as they are generated. |
This is a log-based alert. On the host side, log collection and alert filtering are performed. The process is briefly described as follows:
Log collection: OCP-Agent collects system message logs on the host (such as
/var/log/messagesor via journal), and attaches tags such as server_type (e.g., kern corresponding to os_message_log) and filename to each log.Alert filtering: The alert rules issued by OCP include a log expression for
message_log_alarm. The default configuration ismetric: "os_message_log"``,includeKeywords: ["kernel: nvme.*timeout"], andlogLevels: []` (no filtering by log level).Alert triggered by keyword match: When system log content contains a keyword (which is NVMe timeout by default), an alert is generated and reported to OCP. The alert type is
message_log_alarm, and it includes tags such as cluster, host, log type, file name, keyword, error code, and log content.
Default rule: Matches only logs containing the keyword, with no log level restriction. The default keyword is the regular expression *kernel: nvme.timeout, used to identify NVMe timeout-related messages output by the kernel.
Rule information
Monitoring Metrics |
Default Threshold |
Duration |
Detection Cycle |
Elimination Cycle |
|---|---|---|---|---|
| host_message_log / os_message_log (log type) | N/A | N/A | N/A | 5 Minutes |
Alert information
Alert Trigger Method |
Alert Level |
Scope |
|---|---|---|
| Log Matching Rules | Critical | Host |
Alert template
Alert Overview
- Template: ${alarm_target} ${alarm_name}
- Example: ob_cluster=obcluster1:host=xxx.xxx.xxx.xxx System Message Logs
Alert details
- Template:[${alarm_name}] Cluster: ${ob_cluster_name}, Host: ${host}, Log type: ${server_type}, Log file: ${filename}, Log level: ${log_level}, Keywords=${keyword}, Error code=${error_code}, Log details=${error_message}.
- Example: [System message log] Cluster: obcluster1, Host: xxx.xxx.xxx.xxx, Log type: kern, Log file: /var/log/messages, Log level:, Keywords=kernel: nvme.*timeout, Error code=, Log details=kernel: nvme0n1: I/O 0xffff... timeout.
Alert recovery
- Template: Alert: ${alarm_name}, system message log ${value_shown}
- Example: Alert: System message log, System message log has recovered
Impact on the system
The appearance of NVMe timeout-related logs in system messages typically indicates:
- NVMe device or link exception: Disk or controller response timeout may lead to increased I/O latency, timeout errors, or disk loss.
- Storage stability risk: If this issue persists, it may escalate into disk failure, RAID degradation, or data unavailability.
- Impact on the database: If OceanBase data or logs are located on this NVMe, it may cause instance lag, log write failures, or downtime.
Possible causes
- NVMe disk failure or unstable PCIe link.
- Host load or I/O pressure is too high, causing occasional timeouts.
Solution
Confirm the alert content: Locate the specific host and log line (such as
/var/log/messagesor journal) based on the host, filename, keyword, and error_message provided in the alert.View host system logs: Log in to the alert-host and check the NVMe-related records in the system/kernel logs.
grep -i nvme /var/log/messages # or journalctl -k | grep -i nvme dmesg | grep -i nvmeConfirm the affected device: Identify the device name corresponding to the timeout from the logs (for example, nvme0n1 or nvme1n1), and check the block device and its mount point.
lsblk nvme listHardware and driver troubleshooting: Check the status of the NVMe device.
Isolate business from storage: If a disk is confirmed to be the cause of the issue, and if business operations permit, migrate OceanBase data/logs to another disk, or replace/remove that disk and observe whether the alert disappears.
(Optional) Adjust rules: To monitor the content of messages from other systems, modify the
includeKeywordsorexcludeKeywordsfield in themessage_log_alarmsection of the OCP alert rule. To ignore certain NVMe-related alerts, add exclusion keywords.Verification: After the issue is resolved and no matching logs are generated, the alert will automatically clear within the clearance cycle (5 minutes).
