Alert description
This alert monitors whether the Binlog Server process has stopped and triggers an alert if it does.
The OCP-Agent process monitors the start timestamps of OceanBase-related component processes. If a timestamp changes, it indicates the process has stopped or restarted. This alert only indicates the process has stopped at some point, not that it has been stopped for an extended period. If a process is stopped for a long time, other status-related alerts will be triggered.
The related processes include:
- observer: The alert item is observer_process_stop.
- obproxy: The alert item is obproxy_process_stop.
- obproxyd.sh: The alert item is obproxyd_process_stop.
- ocp_agentd: The alert item is agentd_process_stop.
- ocp_mgragent: The alert item is mgragent_process_stop.
- ocp_monagent: The alert item is monagent_process_stop.
- logproxy: The alert item is binlog_process_stop.
Alert principle
The following table lists the key parameters involved in the monitoring logic of this alert.
Parameter |
Value |
|---|---|
| Monitoring Metrics | binlog_boot_time_delta_seconds: The change in the timestamp when the Binlog Server process starts. When this metric value exceeds the threshold, the process stops and an alert is triggered. |
| Metric Source | The process start time is the system boot time plus the time difference between the process start and the system restart. The system boot time can be obtained by executing thecat /proc/statThe btime value of the returned result. The time difference between the process and the system restart is equal to the commandcat /proc/pid/statThe value of the number in the 22nd column of the returned result divided by 100. |
| Metric Collection | process_boot_time_seconds |
| Monitoring Expression | max(delta(process_boot_time_seconds{name="logproxy",@LABELS}[@INTERVAL])) by (@GBLABELS) |
| Collection Cycle | 5 Seconds |
Rule information
Monitoring Metrics |
Default Threshold (Unit: Seconds) |
Duration |
Detection Cycle |
Elimination Cycle |
|---|---|---|---|---|
| binlog_boot_time_delta_seconds | 10 | 0 Seconds | 20 Seconds | 5 Minutes |
Alert information
Alert Trigger Method |
Alert Level |
Scope |
|---|---|---|
| Based on monitoring metric expressions | Critical | Host |
Alert template
Alert overview
- Template: ${alarm_target} ${alarm_name}
- Example: alarm_template_id=0:binlog_cluster=binlog02-2000005:svr_ip=xxx.xxx.xxx.xxx Binlog process stopped
Alert details
- Template: Binlog cluster: ${binlog_cluster}, host: ${host}, alert: ${alarm_name}.
- Example: Binlog cluster: binlog02, host: xxx.xxx.xxx.xxx, alert: Binlog process stopped.
Alert recovery
- Template: Alert: ${alarm_name}, Binlog Server process startup timestamp change: ${recover_value}
- Example: Alert: Binlog server process stopped, Binlog server process start timestamp change: 0 seconds
Impact on the system
The Binlog Server process is responsible for managing the Binlog instance. If the process stops for an extended period, it will affect the management of the Binlog instance.
Possible causes
The MetaDB of Binlog is unavailable.
Solution
Log in to the host and check whether the logproxy process is running.
If the Binlog Server process stops unexpectedly, you can try to manually restart it.
- Log in to OCP.
- On the overview page of the corresponding Binlog cluster, select the target node from the Binlog Server List, first perform Stop, then perform Start to restart it.
If manually starting the process fails, collect the Binlog Server process log information and contact OCP Technical Support for assistance with troubleshooting.
