Alert description
Monitors whether the observer service process has stopped.
The OCP-Agent process monitors the start time of the OceanBase component process. If the timestamp changes, it indicates that the process has stopped or restarted. This alert only indicates that the process has stopped temporarily and does not mean that the process has been running for a long time. If the process is stopped for a long time, other status-related alerts will be triggered.
The relevant processes include:
- observer: alert item is observer_process_stop
- obproxy: alert item is obproxy_process_stop
- obproxyd.sh: alert item is obproxyd_process_stop
- ocp_agentd: alert item is agentd_process_stop
- ocp_mgragent: alert item is mgragent_process_stop
- ocp_monagent: alert item is monagent_process_stop
Alert principle
| Parameter | Value |
|---|---|
| Monitoring metric | observer_boot_time_delta_seconds |
| Metric source | The boot time of the system plus the time difference between the process and the system restart is the start time of the process. The system boot time is equal to the btime value returned by the command cat /proc/stat. The time difference between the process and the system restart is equal to the value of the 22nd column in the result of the command cat /proc/pid/stat divided by 100. |
| Metric collection | process_boot_time_seconds |
| Monitoring expression | max(delta(process_boot_time_seconds{name="observer",@LABELS}[@INTERVAL])) by (@GBLABELS) |
| Metric collection cycle | 5 seconds |
Rule information
| Monitoring expression | Description of the monitoring metric | Default threshold | Detection cycle | Elimination cycle |
|---|---|---|---|---|
| observer_boot_time_delta_seconds > 0 | When the monitoring metric is greater than 0, it indicates that the process has stopped. | 0 seconds | 10 seconds | 5 minutes |
Alert information
| Alert trigger method | Alert level | Scope |
|---|---|---|
| Based on the monitoring metric expression | Warning | OBServer |
Alert template
Alert summary
- Template: ${alarm_target} ${alarm_name}
- Example: alarm_template_id=0:ob_cluster=AdminMETA-12:host=xxx.xxx.xxx.xxx OBServer process stop
Alert details
- Template: Cluster: ${ob_cluster_name}, Host: ${host}, Alert: ${alarm_name}.
- Example: Cluster: AdminMETA, Host: xxx.xxx.xxx.xxx, Alert: OBServer process stop.
Alert recovery
- Template: Alert: ${alarm_name}, OBServer process start time change: ${recover_value}
- Example: Alert: OBServer process stop, OBServer process start time change: 0 seconds
Impact on the system
If the OBServer process stops, the following situations may occur:
If the service is stopped due to a system failure or active maintenance, it is expected.
For multi-replica tenants, if the server hosting a replica stops, the OBServer will attempt to migrate the replica to another server to ensure high availability. However, the migration time depends on the data volume. Therefore, you can try to restart the OBServer. If multiple attempts fail, stop the attempts to prevent core files and log files from filling up the disk.
Possible causes
The process unexpectedly exits and generates a core dump. You can check the core files in the
/data/1directory:ls -l ${observer.coredump.path} --full-time | grep '.*core-observer'The memory usage exceeds the limit, and the operating system kills the process.
Disk failure.
Search for the following three keywords in the observer.log file to find out other reasons for the process stop:
is_out_of_memstore_mem=true,right_to_die_or_duty_to_live, andon_fatal_error.Other unexpected situations.
Procedure
- Before pulling up the OBServer, determine the cause of the issue. If the OBServer process has unexpectedly terminated or experienced memory exhaustion, attempt to restart the process immediately. For other situations, verify the issue before pulling up the process to avoid potential complications.
- Contact OceanBase Technical Support to investigate the cause and assess whether the OBServer can be restarted immediately.