message_log_alarm System Message Logs|V4.4.2|OceanBase Cloud Platform|OCP docs|Distributed Database

message_log_alarm System Message Logs

Last Updated：2026-06-12 02:01:01 Updated

Alert description

This alert is used to monitor system message logs on the host (typically kernel/system messages in Linux /var/log/messages or the journal). The default rule triggers when NVMe timeout-related kernel messages (such as kernel: nvme.*timeout) are matched in the log, alerting the administrator that the host may have an NVMe disk or driver exception.

Alert principle

Parameter	Value
Monitoring Metrics	host_message_log / os_message_log (log type)
Monitoring Expression	`{"excludeKeywords":{"relation":"Or","keywords":[]},"includeKeywords":{"relation":"Or","keywords":["kernel: nvme.*timeout"]},"logLevels":[],"metric":"os_message_log"}`
Metric Collection	System Message Logs (server_type corresponds to kern / os_message_log)
Metric Source	OCP-Agent Collects Host System and Kernel Logs
Collection Cycle	None. The logs are evaluated as soon as they are generated.

This is a log-based alert. On the host side, log collection and alert filtering are performed. The process is briefly described as follows:

Log collection: OCP-Agent collects system message logs on the host (such as /var/log/messages or via journal), and attaches tags such as server_type (e.g., kern corresponding to os_message_log) and filename to each log.
Alert filtering: The alert rules issued by OCP include a log expression for message_log_alarm. The default configuration is metric: "os_message_log"``,includeKeywords: ["kernel: nvme.*timeout"], andlogLevels: []` (no filtering by log level).
Alert triggered by keyword match: When system log content contains a keyword (which is NVMe timeout by default), an alert is generated and reported to OCP. The alert type is message_log_alarm, and it includes tags such as cluster, host, log type, file name, keyword, error code, and log content.

Default rule: Matches only logs containing the keyword, with no log level restriction. The default keyword is the regular expression *kernel: nvme.timeout, used to identify NVMe timeout-related messages output by the kernel.

Rule information

Monitoring Metrics	Default Threshold	Duration	Detection Cycle	Elimination Cycle
host_message_log / os_message_log (log type)	N/A	N/A	N/A	5 Minutes

Alert information

Alert Trigger Method	Alert Level	Scope
Log Matching Rules	Critical	Host

Alert template

Alert Overview
- Template: ${alarm_target} ${alarm_name}
- Example: ob_cluster=obcluster1:host=xxx.xxx.xxx.xxx System Message Logs
Alert details
- Template:[${alarm_name}] Cluster: ${ob_cluster_name}, Host: ${host}, Log type: ${server_type}, Log file: ${filename}, Log level: ${log_level}, Keywords=${keyword}, Error code=${error_code}, Log details=${error_message}.
- Example: [System message log] Cluster: obcluster1, Host: xxx.xxx.xxx.xxx, Log type: kern, Log file: /var/log/messages, Log level:, Keywords=kernel: nvme.*timeout, Error code=, Log details=kernel: nvme0n1: I/O 0xffff... timeout.
Alert recovery
- Template: Alert: ${alarm_name}, system message log ${value_shown}
- Example: Alert: System message log, System message log has recovered

Impact on the system

The appearance of NVMe timeout-related logs in system messages typically indicates:

NVMe device or link exception: Disk or controller response timeout may lead to increased I/O latency, timeout errors, or disk loss.
Storage stability risk: If this issue persists, it may escalate into disk failure, RAID degradation, or data unavailability.
Impact on the database: If OceanBase data or logs are located on this NVMe, it may cause instance lag, log write failures, or downtime.

Possible causes

NVMe disk failure or unstable PCIe link.
Host load or I/O pressure is too high, causing occasional timeouts.

Solution

Confirm the alert content: Locate the specific host and log line (such as /var/log/messages or journal) based on the host, filename, keyword, and error_message provided in the alert.
View host system logs: Log in to the alert-host and check the NVMe-related records in the system/kernel logs.
```
grep -i nvme /var/log/messages
# or
journalctl -k | grep -i nvme
dmesg | grep -i nvme
```
Confirm the affected device: Identify the device name corresponding to the timeout from the logs (for example, nvme0n1 or nvme1n1), and check the block device and its mount point.
```
lsblk
nvme list
```
Hardware and driver troubleshooting: Check the status of the NVMe device.
Isolate business from storage: If a disk is confirmed to be the cause of the issue, and if business operations permit, migrate OceanBase data/logs to another disk, or replace/remove that disk and observe whether the alert disappears.
(Optional) Adjust rules: To monitor the content of messages from other systems, modify the includeKeywords or excludeKeywords field in the message_log_alarm section of the OCP alert rule. To ignore certain NVMe-related alerts, add exclusion keywords.
Verification: After the issue is resolved and no matching logs are generated, the alert will automatically clear within the clearance cycle (5 minutes).

OceanBase

Customer Stories

Documentation