Monitoring and alerting|V4.0.1|OceanBase Binlog Service|oblogproxy docs|Distributed Database

The binlog service integrates Prometheus metrics, which are exposed on port 2984 by default. You can visit http://{IP address of the binlog server}:2984/metrics to query the current metrics, and use Prometheus Alertmanager to monitor and alert on specific metrics.

Host status

Monitoring metrics

Metric	Description	Tag
binlog_cpu_count	The number of CPU cores.	host_name: the unique ID of the node. ip: the IP address of the node.
binlog_cpu_used_ratio	The CPU utilization.
binlog_disk_total_size_mb	The total size of the disk space, in MB.
binlog_disk_used_ratio	The usage of the disk space.
binlog_mem_total_size_mb	The total size of the memory space, in MB.
binlog_mem_used_ratio	The memory usage.
binlog_mem_used_size_mb	The size of the memory space that has been occupied, in MB.
binlog_network_rx_bytes	The number of data bytes received per second.
binlog_network_wx_bytes	The number of data bytes sent per second.
binlog_load1	The average number of running processes within 1 minute.
binlog_load5	The average number of running processes within 5 minutes.
binlog_load15	The average number of running processes within 15 minutes.

Alert rules

Alert related to CPU utilization

- alert: HighCpuUsage
  expr: binlog_cpu_used_ratio > 0.9
  for: 10s
  labels:
    severity: critical
  annotations:
    summary: "CPU utilization exceeding the threshold"
    description: "The CPU utilization on the {{ $labels.ip }} node exceeds 90%."

Alert related to memory usage

- alert: HighMemUsage
  expr: binlog_mem_used_ratio > 0.8
  for: 10s
  labels:
    severity: critical
  annotations:
    summary: "Memory usage exceeding the threshold"
    description: "The memory usage on the {{ $labels.ip }} node exceeds 80%."

Alert related to the average memory usage in a cluster

- alert: HighMemUsage
  expr: avg(binlog_mem_used_ratio) > 0.65
  for: 10s
  labels:
    severity: critical
  annotations:
    summary: "Memory usage exceeding the threshold"
    description: "The average memory usage in the cluster exceeds 65%."

Alert related to the server load

- alert: HighLoad1
  expr: binlog_load1 > 16
  for: 10s
  labels:
    severity: critical
  annotations:
    summary: "Server load exceeding the threshold"
    description: "The server load on the {{ $labels.ip }} node exceeds 16."

Alert related to disk usage

- alert: HighDiskUsage
  expr: binlog_disk_used_ratio > 0.8
  for: 10s
  labels:
    severity: critical
  annotations:
    summary: "Disk usage exceeding the threshold"
    description: "The disk usage on the {{ $labels.ip }} node exceeds 80%."

OBM status

Monitoring metrics

Metric	Description	Tag
binlog_instance_num	The number of binlog instances.	host_name: the unique ID of the node.
binlog_manager_down_count	The number of times that the OBM process fails.	host_name: the unique ID of the node.
binlog_create	The number of binlog tasks created.	ob_cluster_name: the name of the cluster. tenant_name: the name of the tenant.
binlog_release	The number of binlog tasks released.

Alert rules

Alert related to binlog task creation

rules:
  - alert: BinlogCreateAlert
    expr: increase(binlog_create[1m]) > 0
    for: 1m
    labels:
      severity: info
    annotations:
      summary: "Binlog service enabled"
      description: "The binlog service is enabled for the {{ $labels.ob_cluster_name }}.{{ $labels.tenant_name }} tenant."

Alert related to binlog task release

rules:
- alert: BinlogReleaseAlert
  expr: increase(binlog_release[1m]) > 0
  for: 1m
  labels:
    severity: info
  annotations:
    summary: "Binlog service disabled"
    description: "The binlog service is disabled for the {{ $labels.ob_cluster_name }}.{{ $labels.tenant_name }} tenant."

Alert related to OBM process failures

- alert: OBMDownAlert
  expr: increase(binlog_manager_down_count[1m]) > 0
  for: 10s
  labels:
    severity: critical
  annotations:
    summary: "OBM process failed"
    description: "The OBM process on the {{ $labels.host_name }} node fails."

OBI instance status

Monitoring metrics

Metric	Description	Tag
binlog_allocate_node_fail_count	The number of failures to allocate a service node.	instance_id: the ID of the binlog instance. ob_cluster_name: the name of the cluster. tenant_name: the name of the tenant.
binlog_instance_gtid_inconsistent_count	The number of times that OBI instances have inconsistent global transaction identifiers (GTIDs).
binlog_instance_master_switch_count	The number of times to switch the primary OBI instance.	host_name: the unique ID of the node. instance_id: the ID of the binlog instance. ob_cluster_name: the name of the cluster. tenant_name: the name of the tenant.
binlog_instance_master_switch_failed_count	The number of failures to switch the primary OBI instance.
binlog_instance_no_master_count	The number of times that no primary OBI instance is available.
binlog_instance_down	The number of OBI instance failures.
binlog_instance_failover_fail_count	The number of failures to automatically start an OBI instance after a failover.

Alert rules

Alert related to service node allocation failures

- alert: BinlogAllocateFailedAlert
  expr: increase(binlog_allocate_node_fail_count[1m]) > 0
  for: 10s
  labels:
    severity: critical
  annotations:
    summary: "Node allocation failure"
    description: "Failed to allocate a binlog service node to the {{ $labels.tenant_name }} tenant in the {{ $labels.ob_cluster_name }} cluster. Resolve the issue immediately."

Alert related to inconsistent GTIDs of OBI instances

- alert: GtidInconsistentFailedAlert
  expr: increase(binlog_instance_gtid_inconsistent_count[1m]) > 0
  for: 10s
  labels:
    severity: critical
  annotations:
    summary: "Inconsistent GTIDs detected during inspection"
    description: "The GTIDs of OBI instances in the {{ $labels.tenant_name }} tenant of the {{ $labels.ob_cluster_name }} cluster are inconsistent."

Alert related to primary OBI instance switching

rules:
- alert: MasterSwitchAlert
  expr: increase(binlog_instance_master_switch_count[1m]) > 0 
  for: 1m
  labels:
    severity: info
  annotations:
    summary: "Primary OBI instance switching"
    description: "A primary OBI instance switching event occurred in the {{ $labels.tenant_name }} tenant of the {{ $labels.ob_cluster_name }} cluster."

Alert related to frequent primary OBI instance switching

rules:
- alert: MasterSwitchAlert
  expr: increase(binlog_instance_master_switch_count[1m]) > 2
  for: 1m
  labels:
    severity: critical
  annotations:
    summary: "Frequent primary OBI instance switching"
    description: "Primary OBI instance switching frequently occurred in the {{ $labels.tenant_name }} tenant of the {{ $labels.ob_cluster_name }} cluster."

Alert related to primary OBI instance switching failures

- alert: MasterSwitchFailedAlert
  expr: increase(binlog_instance_master_switch_failed_count[1m]) > 0
  for: 1m
  labels:
    severity: critical
  annotations:
    summary: "Primary OBI instance switching failure"
    description: "A primary OBI instance switching failure occurred in the {{ $labels.tenant_name }} tenant of the {{ $labels.ob_cluster_name }} cluster. Resolve the issue immediately."

Alert related to the absence of a primary OBI instance

- alert: NoMasterAlert
  expr: increase(binlog_instance_no_master_count[1m]) > 0
  for: 10s
  labels:
    severity: critical
  annotations:
    summary: "Absence of a primary OBI instance"
    description: "No primary OBI instance exists in the {{ $labels.tenant_name }} tenant of the {{ $labels.ob_cluster_name }} cluster. Resolve the issue immediately."

Alert related to OBI instance failures

- alert: InstanceDownAlert
  expr: changes(binlog_instance_down[15m]) > 0 or (binlog_instance_convert_delay==0)
  for: 10s
  labels:
    severity: critical
  annotations:
    summary: "OBI instance failure"
    description: "The {{ $labels.instance_id }} instance failed."

Alert related to a failure to automatically start an OBI instance after a failover

- alert: FailoverFailedAlert
  expr: increase(binlog_instance_failover_fail_count[1m]) > 0
  for: 10s
  labels:
    severity: critical
  annotations:
    summary: "Automatic instance startup failure after a failover"
    description: "Failed to automatically start the { {$labels.instance_id} } instance in the {{ $labels.tenant_name }} tenant of the {{ $labels.ob_cluster_name }} cluster after a failover. Resolve the issue immediately."

OBI instance performance

Monitoring metrics

Metric	Description	Tag
binlog_instance_convert_checkpoint	The security checkpoint for binlog conversion by the OBI instance, in microseconds.	host_name: the unique ID of the node. instance_id: the ID of the binlog instance. ob_cluster_name: the name of the cluster. tenant_name: the name of the tenant.
binlog_instance_convert_delay	The delay in binlog conversion by the OBI instance, in milliseconds.
binlog_instance_convert_fetch_rps	The RPS in pulling clogs by the OBI instance.
binlog_instance_convert_iops	The IOPS in binlog conversion by the OBI instance, in bytes.
binlog_instance_convert_storage_rps	The RPS in storing binlogs to the disk by the OBI instance.
binlog_instance_dump_count	The number of subscriptions to the OBI instance.
binlog_instance_dump_error_count	The number of exceptions in subscribing to the OBI instance.
binlog_instance_dump_checkpoint	The security checkpoint in the subscription connection, in microseconds.	host_name: the unique ID of the node. instance_id: the ID of the binlog instance. ob_cluster_name: the name of the cluster. tenant_name: the name of the tenant. trace_id: the trace ID of the connection.
binlog_instance_dump_rps	The RPS in the subscription connection.
binlog_instance_dump_delay	The subscription delay in the subscription connection, in seconds.
binlog_instance_dump_heartbeat_rps	The heartbeat RPS in the subscription connection.
binlog_instance_dump_iops	The heartbeat IOPS in the subscription connection, in bytes.

Alert rules

Alert related to the binlog conversion delay

- alert: ConversionDelayAlert
  expr: |
    (binlog_instance_convert_delay > 120000) and
    (binlog_instance_convert_fetch_rps < 5) and
    (binlog_instance_convert_storage_rps < 5) or
    (time() - binlog_instance_convert_checkpoint - binlog_instance_convert_delay) > 120000
  for: 10s
  labels:
    severity: critical
  annotations:
    summary: "Conversion delay exceeding the threshold"
    description: "The binlog conversion by the { {$labels.instance_id} } instance in the {{ $labels.tenant_name }} tenant of the {{ $labels.ob_cluster_name }} cluster is delayed by {{ $binlog_instance_convert_delay }}s."

Alert related to the number of subscriptions

- alert: InstanceDumpAlert
  expr: binlog_instance_dump_count > 100
  for: 1m
  labels:
    severity: warning
  annotations:
    summary: "Binlog instance subscriptions exceeding the threshold"
    description: "The number of subscriptions to the {{ $labels.instance_id }} instance exceeds the threshold {{$binlog_instance_dump_count}}."

- alert: InstanceDumpResolved
  expr: binlog_instance_dump_count <= 100
  for: 1m
  labels:
    severity: normal
  annotations:
    summary: "Subscription threshold exceeding alert cleared"
    description: "The number of subscriptions to the{{ $labels.instance_id }} instance has fallen below 100."

OceanBase

Customer Stories

Documentation

Monitoring and alerting

Host status

Monitoring metrics

Alert rules

OBM status

Monitoring metrics

Alert rules

OBI instance status

Monitoring metrics

Alert rules

OBI instance performance

Monitoring metrics

Alert rules