Description
A core dump file is generated when the observer process on an OBServer exits abnormally. The file is as large as dozens or even hundreds of GB. You need to handle the emergency and analyze the root cause of the abnormal exit. The system will keep reporting this alert within 20 minutes after it is triggered.
Principle
| Parameter | Value |
|---|---|
| Metric | observer_core_dump_seconds |
| Source | This metric is collected during OBServer monitoring. You can view the elapsed time in seconds since the last core dump file is generated. |
| Collected metric | observer_coredump_time_seconds |
| Metric expression | max(observer_coredump_time_seconds{@LABELS}) by (@GBLABELS) |
| Collection cycle | 1 second |
Alert rule
| Metric expression | Metric description | Default threshold (unit: second) | Detection cycle | Elimination cycle |
|---|---|---|---|---|
| observer_core_dump_seconds < 1200 and observer_core_dump_seconds > 0 | The time elapsed since the last core dump file is generated. | 1200 | 10 seconds | 5 minutes |
Alert information
| Trigger method | Alert level | Scope |
|---|---|---|
| Based on the expression of the metric | Critical | Server |
Alert templates
- Overview
- Template: ${alarm_target} ${alarm_name}
- Example: ob_cluster=obcluster-1631964370:svr_ip=xxx.xxx.xxx.xxx os_observer_core_dump
- Details
- Template: Cluster: ${ob_cluster_name}, host: ${svr_ip}, alert: an OBServer core dump file was generated ${value_shown} ago.
- Example: Cluster: obcluster, host: xxx.xxx.xxx.xxx, alert: an OBServer core dump file was generated 10 minutes ago.
Impact on the system
When a core dump file is generated, the observer process has exited abnormally.
- For a single-replica system, system services will be unavailable if the observer process does not exist.
- For a multi-replica OceanBase cluster, the availability of the cluster may decrease if the observer process on an OBServer does not exist. For example, the number of zones changes from 3 to 2, or the number of Internet Data Centers (IDCs) in the same region changes from 3 to 2.
Possible causes
Generally, an observer process may exit abnormally due to the signal 11 error (invalid memory access) or signal 6 error (process abandonment). The abnormal exit is usually caused by the OBServer itself, the OS, or underlying reasons.
If the problem is not caused by hardware damage or the OS, you need to restart the OBServer as an emergency measure.
Suggested solutions
- Disconnect the faulty OBServer.
- Contact Technical Support to analyze the root cause.