The binlog service integrates Prometheus metrics, which are exposed on port 2984 by default. You can visit http://{IP address of the binlog server}:2984/metrics to query the current metrics, and use Prometheus Alertmanager to monitor and alert on specific metrics.
Host status
Monitoring metrics
| Metric | Description | Tag |
|---|---|---|
| binlog_cpu_count | The number of CPU cores. | |
| binlog_cpu_used_ratio | The CPU utilization. | |
| binlog_disk_total_size_mb | The total size of the disk space, in MB. | |
| binlog_disk_used_ratio | The usage of the disk space. | |
| binlog_mem_total_size_mb | The total size of the memory space, in MB. | |
| binlog_mem_used_ratio | The memory usage. | |
| binlog_mem_used_size_mb | The size of the memory space that has been occupied, in MB. | |
| binlog_network_rx_bytes | The number of data bytes received per second. | |
| binlog_network_wx_bytes | The number of data bytes sent per second. | |
| binlog_load1 | The average number of running processes within 1 minute. | |
| binlog_load5 | The average number of running processes within 5 minutes. | |
| binlog_load15 | The average number of running processes within 15 minutes. |
Alert rules
Alert related to CPU utilization
- alert: HighCpuUsage expr: binlog_cpu_used_ratio > 0.9 for: 10s labels: severity: critical annotations: summary: "CPU utilization exceeding the threshold" description: "The CPU utilization on the {{ $labels.ip }} node exceeds 90%."Alert related to memory usage
- alert: HighMemUsage expr: binlog_mem_used_ratio > 0.8 for: 10s labels: severity: critical annotations: summary: "Memory usage exceeding the threshold" description: "The memory usage on the {{ $labels.ip }} node exceeds 80%."Alert related to the average memory usage in a cluster
- alert: HighMemUsage expr: avg(binlog_mem_used_ratio) > 0.65 for: 10s labels: severity: critical annotations: summary: "Memory usage exceeding the threshold" description: "The average memory usage in the cluster exceeds 65%."Alert related to the server load
- alert: HighLoad1 expr: binlog_load1 > 16 for: 10s labels: severity: critical annotations: summary: "Server load exceeding the threshold" description: "The server load on the {{ $labels.ip }} node exceeds 16."Alert related to disk usage
- alert: HighDiskUsage expr: binlog_disk_used_ratio > 0.8 for: 10s labels: severity: critical annotations: summary: "Disk usage exceeding the threshold" description: "The disk usage on the {{ $labels.ip }} node exceeds 80%."
OBM status
Monitoring metrics
| Metric | Description | Tag |
|---|---|---|
| binlog_instance_num | The number of binlog instances. | host_name: the unique ID of the node. |
| binlog_manager_down_count | The number of times that the OBM process fails. | |
| binlog_create | The number of binlog tasks created. | |
| binlog_release | The number of binlog tasks released. |
Alert rules
Alert related to binlog task creation
rules: - alert: BinlogCreateAlert expr: increase(binlog_create[1m]) > 0 for: 1m labels: severity: info annotations: summary: "Binlog service enabled" description: "The binlog service is enabled for the {{ $labels.ob_cluster_name }}.{{ $labels.tenant_name }} tenant."Alert related to binlog task release
rules: - alert: BinlogReleaseAlert expr: increase(binlog_release[1m]) > 0 for: 1m labels: severity: info annotations: summary: "Binlog service disabled" description: "The binlog service is disabled for the {{ $labels.ob_cluster_name }}.{{ $labels.tenant_name }} tenant."Alert related to OBM process failures
- alert: OBMDownAlert expr: increase(binlog_manager_down_count[1m]) > 0 for: 10s labels: severity: critical annotations: summary: "OBM process failed" description: "The OBM process on the {{ $labels.host_name }} node fails."
OBI instance status
Monitoring metrics
| Metric | Description | Tag |
|---|---|---|
| binlog_allocate_node_fail_count | The number of failures to allocate a service node. | |
| binlog_instance_gtid_inconsistent_count | The number of times that OBI instances have inconsistent global transaction identifiers (GTIDs). | |
| binlog_instance_master_switch_count | The number of times to switch the primary OBI instance. | |
| binlog_instance_master_switch_failed_count | The number of failures to switch the primary OBI instance. | |
| binlog_instance_no_master_count | The number of times that no primary OBI instance is available. | |
| binlog_instance_down | The number of OBI instance failures. | |
| binlog_instance_failover_fail_count | The number of failures to automatically start an OBI instance after a failover. |
Alert rules
Alert related to service node allocation failures
- alert: BinlogAllocateFailedAlert expr: increase(binlog_allocate_node_fail_count[1m]) > 0 for: 10s labels: severity: critical annotations: summary: "Node allocation failure" description: "Failed to allocate a binlog service node to the {{ $labels.tenant_name }} tenant in the {{ $labels.ob_cluster_name }} cluster. Resolve the issue immediately."Alert related to inconsistent GTIDs of OBI instances
- alert: GtidInconsistentFailedAlert expr: increase(binlog_instance_gtid_inconsistent_count[1m]) > 0 for: 10s labels: severity: critical annotations: summary: "Inconsistent GTIDs detected during inspection" description: "The GTIDs of OBI instances in the {{ $labels.tenant_name }} tenant of the {{ $labels.ob_cluster_name }} cluster are inconsistent."Alert related to primary OBI instance switching
rules: - alert: MasterSwitchAlert expr: increase(binlog_instance_master_switch_count[1m]) > 0 for: 1m labels: severity: info annotations: summary: "Primary OBI instance switching" description: "A primary OBI instance switching event occurred in the {{ $labels.tenant_name }} tenant of the {{ $labels.ob_cluster_name }} cluster."Alert related to frequent primary OBI instance switching
rules: - alert: MasterSwitchAlert expr: increase(binlog_instance_master_switch_count[1m]) > 2 for: 1m labels: severity: critical annotations: summary: "Frequent primary OBI instance switching" description: "Primary OBI instance switching frequently occurred in the {{ $labels.tenant_name }} tenant of the {{ $labels.ob_cluster_name }} cluster."Alert related to primary OBI instance switching failures
- alert: MasterSwitchFailedAlert expr: increase(binlog_instance_master_switch_failed_count[1m]) > 0 for: 1m labels: severity: critical annotations: summary: "Primary OBI instance switching failure" description: "A primary OBI instance switching failure occurred in the {{ $labels.tenant_name }} tenant of the {{ $labels.ob_cluster_name }} cluster. Resolve the issue immediately."Alert related to the absence of a primary OBI instance
- alert: NoMasterAlert expr: increase(binlog_instance_no_master_count[1m]) > 0 for: 10s labels: severity: critical annotations: summary: "Absence of a primary OBI instance" description: "No primary OBI instance exists in the {{ $labels.tenant_name }} tenant of the {{ $labels.ob_cluster_name }} cluster. Resolve the issue immediately."Alert related to OBI instance failures
- alert: InstanceDownAlert expr: changes(binlog_instance_down[15m]) > 0 or (binlog_instance_convert_delay==0) for: 10s labels: severity: critical annotations: summary: "OBI instance failure" description: "The {{ $labels.instance_id }} instance failed."Alert related to a failure to automatically start an OBI instance after a failover
- alert: FailoverFailedAlert expr: increase(binlog_instance_failover_fail_count[1m]) > 0 for: 10s labels: severity: critical annotations: summary: "Automatic instance startup failure after a failover" description: "Failed to automatically start the { {$labels.instance_id} } instance in the {{ $labels.tenant_name }} tenant of the {{ $labels.ob_cluster_name }} cluster after a failover. Resolve the issue immediately."
OBI instance performance
Monitoring metrics
| Metric | Description | Tag |
|---|---|---|
| binlog_instance_convert_checkpoint | The security checkpoint for binlog conversion by the OBI instance, in microseconds. | |
| binlog_instance_convert_delay | The delay in binlog conversion by the OBI instance, in milliseconds. | |
| binlog_instance_convert_fetch_rps | The RPS in pulling clogs by the OBI instance. | |
| binlog_instance_convert_iops | The IOPS in binlog conversion by the OBI instance, in bytes. | |
| binlog_instance_convert_storage_rps | The RPS in storing binlogs to the disk by the OBI instance. | |
| binlog_instance_dump_count | The number of subscriptions to the OBI instance. | |
| binlog_instance_dump_error_count | The number of exceptions in subscribing to the OBI instance. | |
| binlog_instance_dump_checkpoint | The security checkpoint in the subscription connection, in microseconds. | |
| binlog_instance_dump_rps | The RPS in the subscription connection. | |
| binlog_instance_dump_delay | The subscription delay in the subscription connection, in seconds. | |
| binlog_instance_dump_heartbeat_rps | The heartbeat RPS in the subscription connection. | |
| binlog_instance_dump_iops | The heartbeat IOPS in the subscription connection, in bytes. |
Alert rules
Alert related to the binlog conversion delay
- alert: ConversionDelayAlert expr: | (binlog_instance_convert_delay > 120000) and (binlog_instance_convert_fetch_rps < 5) and (binlog_instance_convert_storage_rps < 5) or (time() - binlog_instance_convert_checkpoint - binlog_instance_convert_delay) > 120000 for: 10s labels: severity: critical annotations: summary: "Conversion delay exceeding the threshold" description: "The binlog conversion by the { {$labels.instance_id} } instance in the {{ $labels.tenant_name }} tenant of the {{ $labels.ob_cluster_name }} cluster is delayed by {{ $binlog_instance_convert_delay }}s."Alert related to the number of subscriptions
- alert: InstanceDumpAlert expr: binlog_instance_dump_count > 100 for: 1m labels: severity: warning annotations: summary: "Binlog instance subscriptions exceeding the threshold" description: "The number of subscriptions to the {{ $labels.instance_id }} instance exceeds the threshold {{$binlog_instance_dump_count}}." - alert: InstanceDumpResolved expr: binlog_instance_dump_count <= 100 for: 1m labels: severity: normal annotations: summary: "Subscription threshold exceeding alert cleared" description: "The number of subscriptions to the{{ $labels.instance_id }} instance has fallen below 100."