Monitoring and alerting|V4.2.5| docs|Distributed Database

The binlog service integrates Prometheus metrics, which are exposed on port 2984 by default. You can visit http://{IP address of the binlog server}:2984/metrics to query the current metrics, and use Prometheus Alertmanager to monitor and alert on specific metrics.

Host status

Monitoring metrics

Metric	Description	Tag
binlog_cpu_count	The number of CPU cores.	host_name: the unique ID of the node. ip: the IP address of the node.
binlog_cpu_used_ratio	The CPU utilization.
binlog_disk_total_size_mb	The total size of the disk space, in MB.
binlog_disk_used_ratio	The usage of the disk space.
binlog_mem_total_size_mb	The total size of the memory space, in MB.
binlog_mem_used_ratio	The memory usage.
binlog_mem_used_size_mb	The size of the memory space that has been occupied, in MB.
binlog_network_rx_bytes	The number of data bytes received per second.
binlog_network_wx_bytes	The number of data bytes sent per second.
binlog_load1	The average number of running processes within 1 minute.
binlog_load5	The average number of running processes within 5 minutes.
binlog_load15	The average number of running processes within 15 minutes.

Alert rules

Alert related to CPU utilization

- alert: HighCpuUsage
  expr: binlog_cpu_used_ratio > 0.9
  for: 10s
  labels:
    severity: critical
  annotations:
    summary: "CPU utilization exceeding the threshold"
    description: "The CPU utilization on the {{ $labels.ip }} node exceeds 90%."

Alert related to memory usage

- alert: HighMemUsage
  expr: binlog_mem_used_ratio > 0.8
  for: 10s
  labels:
    severity: critical
  annotations:
    summary: "Memory usage exceeding the threshold"
    description: "The memory usage on the {{ $labels.ip }} node exceeds 80%."

Alert related to the average memory usage in a cluster

- alert: HighMemUsage
  expr: avg(binlog_mem_used_ratio) > 0.65
  for: 10s
  labels:
    severity: critical
  annotations:
    summary: "Memory usage exceeding the threshold"
    description: "The average memory usage in the cluster exceeds 65%."

Alert related to the server load

- alert: HighLoad1
  expr: binlog_load1 > 16
  for: 10s
  labels:
    severity: critical
  annotations:
    summary: "Server load exceeding the threshold"
    description: "The server load on the {{ $labels.ip }} node exceeds 16."

Alert related to disk usage

- alert: HighDiskUsage
  expr: binlog_disk_used_ratio > 0.8
  for: 10s
  labels:
    severity: critical
  annotations:
    summary: "Disk usage exceeding the threshold"
    description: "The disk usage on the {{ $labels.ip }} node exceeds 80%."

OBM status

Monitoring metrics

Metric	Description	Tag
binlog_instance_num	The number of binlog instances.	host_name: the unique ID of the node.
binlog_manager_down_count	The number of times that the OBM process fails.	host_name: the unique ID of the node.
binlog_create	The number of binlog tasks created.	ob_cluster_name: the name of the cluster. tenant_name: the name of the tenant.
binlog_release	The number of binlog tasks released.

Alert rules

Alert related to binlog task creation

rules:
  - alert: BinlogCreateAlert
    expr: increase(binlog_create[1m]) > 0
    for: 1m
    labels:
      severity: info
    annotations:
      summary: "Binlog service enabled"
      description: "The binlog service is enabled for the {{ $labels.ob_cluster_name }}.{{ $labels.tenant_name }} tenant."

Alert related to binlog task release

rules:
- alert: BinlogReleaseAlert
  expr: increase(binlog_release[1m]) > 0
  for: 1m
  labels:
    severity: info
  annotations:
    summary: "Binlog service disabled"
    description: "The binlog service is disabled for the {{ $labels.ob_cluster_name }}.{{ $labels.tenant_name }} tenant."

Alert related to OBM process failures

- alert: OBMDownAlert
  expr: increase(binlog_manager_down_count[1m]) > 0
  for: 10s
  labels:
    severity: critical
  annotations:
    summary: "OBM process failed"
    description: "The OBM process on the {{ $labels.host_name }} node fails."

OBI instance status

Monitoring metrics

Metric	Description	Tag
binlog_allocate_node_fail_count	The number of failures to allocate a service node.	instance_id: the ID of the binlog instance. ob_cluster_name: the name of the cluster. tenant_name: the name of the tenant.
binlog_instance_gtid_inconsistent_count	The number of times that OBI instances have inconsistent global transaction identifiers (GTIDs).
binlog_instance_master_switch_count	The number of times to switch the primary OBI instance.	host_name: the unique ID of the node. instance_id: the ID of the binlog instance. ob_cluster_name: the name of the cluster. tenant_name: the name of the tenant.
binlog_instance_master_switch_failed_count	The number of failures to switch the primary OBI instance.
binlog_instance_no_master_count	The number of times that no primary OBI instance is available.
binlog_instance_down	The number of OBI instance failures.
binlog_instance_failover_fail_count	The number of failures to automatically start an OBI instance after a failover.

Alert rules

Alert related to service node allocation failures

- alert: BinlogAllocateFailedAlert
  expr: increase(binlog_allocate_node_fail_count[1m]) > 0
  for: 10s
  labels:
    severity: critical
  annotations:
    summary: "Node allocation failure"
    description: "Failed to allocate a binlog service node to the {{ $labels.tenant_name }} tenant in the {{ $labels.ob_cluster_name }} cluster. Resolve the issue immediately."

Alert related to inconsistent GTIDs of OBI instances

- alert: GtidInconsistentFailedAlert
  expr: increase(binlog_instance_gtid_inconsistent_count[1m]) > 0
  for: 10s
  labels:
    severity: critical
  annotations:
    summary: "Inconsistent GTIDs detected during inspection"
    description: "The GTIDs of OBI instances in the {{ $labels.tenant_name }} tenant of the {{ $labels.ob_cluster_name }} cluster are inconsistent."

Alert related to primary OBI instance switching

rules:
- alert: MasterSwitchAlert
  expr: increase(binlog_instance_master_switch_count[1m]) > 0 
  for: 1m
  labels:
    severity: info
  annotations:
    summary: "Primary OBI instance switching"
    description: "A primary OBI instance switching event occurred in the {{ $labels.tenant_name }} tenant of the {{ $labels.ob_cluster_name }} cluster."

Alert related to frequent primary OBI instance switching

rules:
- alert: MasterSwitchAlert
  expr: increase(binlog_instance_master_switch_count[1m]) > 2
  for: 1m
  labels:
    severity: critical
  annotations:
    summary: "Frequent primary OBI instance switching"
    description: "Primary OBI instance switching frequently occurred in the {{ $labels.tenant_name }} tenant of the {{ $labels.ob_cluster_name }} cluster."

Alert related to primary OBI instance switching failures

- alert: MasterSwitchFailedAlert
  expr: increase(binlog_instance_master_switch_failed_count[1m]) > 0
  for: 1m
  labels:
    severity: critical
  annotations:
    summary: "Primary OBI instance switching failure"
    description: "A primary OBI instance switching failure occurred in the {{ $labels.tenant_name }} tenant of the {{ $labels.ob_cluster_name }} cluster. Resolve the issue immediately."

Alert related to the absence of a primary OBI instance

- alert: NoMasterAlert
  expr: increase(binlog_instance_no_master_count[1m]) > 0
  for: 10s
  labels:
    severity: critical
  annotations:
    summary: "Absence of a primary OBI instance"
    description: "No primary OBI instance exists in the {{ $labels.tenant_name }} tenant of the {{ $labels.ob_cluster_name }} cluster. Resolve the issue immediately."

Alert related to OBI instance failures

- alert: InstanceDownAlert
  expr: changes(binlog_instance_down[15m]) > 0 or (binlog_instance_convert_delay==0)
  for: 10s
  labels:
    severity: critical
  annotations:
    summary: "OBI instance failure"
    description: "The {{ $labels.instance_id }} instance failed."

Alert related to a failure to automatically start an OBI instance after a failover

- alert: FailoverFailedAlert
  expr: increase(binlog_instance_failover_fail_count[1m]) > 0
  for: 10s
  labels:
    severity: critical
  annotations:
    summary: "Automatic instance startup failure after a failover"
    description: "Failed to automatically start the { {$labels.instance_id} } instance in the {{ $labels.tenant_name }} tenant of the {{ $labels.ob_cluster_name }} cluster after a failover. Resolve the issue immediately."

OBI instance performance

Monitoring metrics

Metric	Description	Tag
binlog_instance_convert_checkpoint	The security checkpoint for binlog conversion by the OBI instance, in microseconds.	host_name: the unique ID of the node. instance_id: the ID of the binlog instance. ob_cluster_name: the name of the cluster. tenant_name: the name of the tenant.
binlog_instance_convert_delay	The delay in binlog conversion by the OBI instance, in milliseconds.
binlog_instance_convert_fetch_rps	The RPS in pulling clogs by the OBI instance.
binlog_instance_convert_iops	The IOPS in binlog conversion by the OBI instance, in bytes.
binlog_instance_convert_storage_rps	The RPS in storing binlogs to the disk by the OBI instance.
binlog_instance_dump_count	The number of subscriptions to the OBI instance.
binlog_instance_dump_error_count	The number of exceptions in subscribing to the OBI instance.
binlog_instance_dump_checkpoint	The security checkpoint in the subscription connection, in microseconds.	host_name: the unique ID of the node. instance_id: the ID of the binlog instance. ob_cluster_name: the name of the cluster. tenant_name: the name of the tenant. trace_id: the trace ID of the connection.
binlog_instance_dump_rps	The RPS in the subscription connection.
binlog_instance_dump_delay	The subscription delay in the subscription connection, in seconds.
binlog_instance_dump_heartbeat_rps	The heartbeat RPS in the subscription connection.
binlog_instance_dump_iops	The heartbeat IOPS in the subscription connection, in bytes.

Alert rules

Alert related to the binlog conversion delay

- alert: ConversionDelayAlert
  expr: |
    (binlog_instance_convert_delay > 120000) and
    (binlog_instance_convert_fetch_rps < 5) and
    (binlog_instance_convert_storage_rps < 5) or
    (time() - binlog_instance_convert_checkpoint - binlog_instance_convert_delay) > 120000
  for: 10s
  labels:
    severity: critical
  annotations:
    summary: "Conversion delay exceeding the threshold"
    description: "The binlog conversion by the { {$labels.instance_id} } instance in the {{ $labels.tenant_name }} tenant of the {{ $labels.ob_cluster_name }} cluster is delayed by {{ $binlog_instance_convert_delay }}s."

Alert related to the number of subscriptions

- alert: InstanceDumpAlert
  expr: binlog_instance_dump_count > 100
  for: 1m
  labels:
    severity: warning
  annotations:
    summary: "Binlog instance subscriptions exceeding the threshold"
    description: "The number of subscriptions to the {{ $labels.instance_id }} instance exceeds the threshold {{$binlog_instance_dump_count}}."

- alert: InstanceDumpResolved
  expr: binlog_instance_dump_count <= 100
  for: 1m
  labels:
    severity: normal
  annotations:
    summary: "Subscription threshold exceeding alert cleared"
    description: "The number of subscriptions to the{{ $labels.instance_id }} instance has fallen below 100."