os_observer_core_dump

2025-01-10 06:15:54  Updated

Description

A core dump file is generated when the observer process on an OBServer exits abnormally. The file is as large as dozens or even hundreds of GB. You need to handle the emergency and analyze the root cause of the abnormal exit. The system will keep reporting this alert within 20 minutes after it is triggered.

Principle

Parameter Value
Metric observer_core_dump_seconds
Source This metric is collected during OBServer monitoring. You can view the elapsed time in seconds since the last core dump file is generated.
Collected metric observer_coredump_time_seconds
Metric expression max(observer_coredump_time_seconds{@LABELS}) by (@GBLABELS)
Collection cycle 1 second

Alert rule

Metric expression Metric description Default threshold (unit: second) Detection cycle Elimination cycle
observer_core_dump_seconds < 1200 and observer_core_dump_seconds > 0 The time elapsed since the last core dump file is generated. 1200 10 seconds 5 minutes

Alert information

Trigger method Alert level Scope
Based on the expression of the metric Critical Server

Alert templates

  • Overview
    • Template: ${alarm_target} ${alarm_name}
    • Example: ob_cluster=obcluster-1631964370:svr_ip=xxx.xxx.xxx.xxx os_observer_core_dump
  • Details
    • Template: Cluster: ${ob_cluster_name}, host: ${svr_ip}, alert: an OBServer core dump file was generated ${value_shown} ago.
    • Example: Cluster: obcluster, host: xxx.xxx.xxx.xxx, alert: an OBServer core dump file was generated 10 minutes ago.

Impact on the system

When a core dump file is generated, the observer process has exited abnormally.

  • For a single-replica system, system services will be unavailable if the observer process does not exist.
  • For a multi-replica OceanBase cluster, the availability of the cluster may decrease if the observer process on an OBServer does not exist. For example, the number of zones changes from 3 to 2, or the number of Internet Data Centers (IDCs) in the same region changes from 3 to 2.

Possible causes

Generally, an observer process may exit abnormally due to the signal 11 error (invalid memory access) or signal 6 error (process abandonment). The abnormal exit is usually caused by the OBServer itself, the OS, or underlying reasons.

If the problem is not caused by hardware damage or the OS, you need to restart the OBServer as an emergency measure.

Suggested solutions

  1. Disconnect the faulty OBServer.
  2. Contact Technical Support to analyze the root cause.

Contact Us