Alert description
Starting from OBProxy V4.3.3.0, read/write splitting based on tenant service names is supported. OBProxy requests OCP Server for the service name information of all tenants to implement read/write splitting for primary and standby tenants under the same service name. When a standby tenant becomes unavailable, OCP cannot detect this and still returns the standby tenant's service name to OBProxy. As a result, OBProxy continues to route SQL read requests to the standby tenant, causing continuous read request failures for business operations.
For the scenarios described above, OCP needs to involve and implement a plan to achieve the following capabilities:
- The standby tenant is detected as unavailable.
- When a standby tenant becomes unavailable, remove its service name from the information returned to OBProxy to prevent OBProxy from routing requests to that standby tenant.
The alert is triggered based on the alert. This alert is specifically for the scenario described above, where it checks the connectivity of a standby tenant configured with a plan. If the tenant is deemed unreachable, an alert is triggered.
Alert principle
The prerequisites for this alert are as follows:
- The tenant is a standby tenant.
- The tenant has the
Delete Failed Probe Tenant from Tenant Service Name Informationpreset configured. - The service name of the tenant is in the VALID state.
The following table lists the key parameters involved in the monitoring logic of this alert.
Parameter |
Value |
|---|---|
| Monitoring Metrics | ob_tenant_any_unit_connectable: Indicates whether the OceanBase tenant is connectable. Valid values:1indicates that the tenant is connectable. The parameter value is0indicates that the tenant is not connectable and triggers an alert. |
| Monitoring Expression | max(ob_tenant_any_unit_connectable{@LABELS}) by (@GBLABELS) |
| Metric Collection | ob_tenant_any_unit_connectable |
| Data Source | OCP checks the status of a tenant's unit servers through scheduled tasks. If all unit server ports fail to connect, the tenant is deemed unreachable and is recorded as such.ob_tenant_any_unit_connectable = 0, otherwiseob_tenant_any_unit_connectable = 1 |
| Collection Cycle | 5 Seconds |
Rule information
Monitoring Metrics |
Default Threshold |
Duration |
Detection Cycle |
Elimination Cycle |
|---|---|---|---|---|
| ob_tenant_any_unit_connectable | 0 | 0 Seconds | 60 Seconds | 5 Minutes |
Alert information
Alert Trigger Method |
Alert Level |
Scope |
|---|---|---|
| Based on monitoring metric expressions | Critical | OCP |
Alert template
Alert overview
- Template: ${alarm_target} ${alarm_name}
- Example: larm_template_id=0:ob_cluster=ob4353-22:tenant_name=standby_tenant The OceanBase tenant cannot be connected.
Alert Details
- Template: Cluster: ${ob_cluster_name}, Tenant: ${tenant_name}, Alert: ${alarm_name}. All UNIT node SQL ports of the tenant failed to connect.
- Example: Cluster: ob4353, Tenant: standby_tenant, Alert: OceanBase tenant cannot be connected. Connection failed to all SQL ports of the tenant's UNIT nodes.
Alert recovery
- Template: Alert: ${alarm_name}, OceanBase tenant UNIT connectivity: ${recover_value}
- Example: Alert: OceanBase tenant cannot be connected. Can the OceanBase tenant UNIT be connected: 1
Impact on the system
- Impact on OCP: OCP initiates a plan to remove the tenant from the service name list, making it invisible to OBProxy.
- Impact on OBProxy: OBProxy will not be able to see this tenant and will divert read traffic from it to other tenants with the same service name.
- Impact on applications: The tenant cannot be connected, which may prevent applications from connecting to this tenant. Due to OBProxy stream switching, the read traffic of applications will be automatically redirected to another tenant.
Possible causes
A data center failure causes all hosts of a tenant to go down.
Solution
- You must troubleshoot the tenant exception and restore the tenant.
- After restoring the tenant, modify its service name in OCP to change its status back to VALID, making it visible to OBProxy.
