Alert description
When restoring data, the backup storage capacity is obtained. If the network between OCP and the backup server is abnormal, the operation will be retried. If the operation takes too long or the thread is interrupted, this alert is triggered.
Alert principle
The following table describes the key parameters involved in the alert monitoring logic.
| Parameter | Value |
|---|---|
| Monitoring metric | storage_capacity_monitor_error_code |
| Data source | OCP-Server monitors the storage capacity during backup and restore. When an error occurs, the corresponding error code is recorded and assigned to the collection metric. |
| Collection metric | storage_capacity_monitor_error_code |
| Monitoring expression | max(storage_capacity_monitor_error_code{@LABELS}) by (@GBLABELS) |
| Collection interval | 1 second |
Note
The data source for this alert is special. For more information, see the Data source description in the preceding table.
The value of the monitoring metric is the error code collected when OCP-Server monitors the backup and restore storage capacity. This alert is triggered when the monitoring metric value is 1002.
Rule Information
| Monitoring metric | Default threshold | Duration | Detection interval | Elimination interval |
|---|---|---|---|---|
| storage_capacity_monitor_error_code | NA | 0 seconds | 30 minutes | 40 minutes |
Alert Information
| Alert Trigger Method | Alert Level | Scope |
|---|---|---|
| OCP Reminder | Warning | Service |
Alert Template
Alert Overview
- Template: ${alarm_target} ${alarm_name}
- Example: storage_url=file:///obbackup/yc225_214/inc_data/ Backup restore directory capacity timeout or thread interrupted
Alert Details
- Template: Cluster: ${ob_cluster_name}, Alert: ${storage_url} ${alarm_name}. Capacity timeout or interrupted, error code: ${value}.
- Example: Cluster: obcluster-1, Alert: file:///obbackup/yc225_214/inc_data/ Backup restore directory capacity timeout or thread interrupted. Capacity timeout or interrupted, error code: 1002.
Clear Alert
- Template: Alert: ${alarm_name}, Storage capacity collection error code: ${value}
- Example: Alert: Backup restore directory capacity timeout or thread interrupted, Storage capacity collection error code: -
where ${alarm_target} is in the format of storage_url=file:///obbackup//yc225_214/inc_data/
Impact on the system
The storage capacity trend data on the backup restore page is unavailable or displayed with more than 30 minutes interval.
Possible causes
This issue is commonly caused by one of the following reasons:
Network anomalies between OCP and the backup server, or between the OceanBase cluster and the backup storage server.
The backup storage server is unavailable or abnormal.
Solution
Check whether the network between the OCP console and the backup storage server is faulty.
- When the storage type is File or COS, check whether the network between the OceanBase cluster and the backup storage server is faulty. Run the following commands to check whether the network is faulty.
Run the following commands to check whether the network is faulty.
# Ping the backup storage server from the OCP console server. xxx.xxx.xxx.1 is the IP address of the backup storage host. ping xxx.xxx.xxx.1 # Ping the OCP console server from the backup storage server. xxx.xxx.xxx.2 is the IP address of the OCP host. ping xxx.xxx.xxx.2When the storage type is OSS, check whether the network between the OCP console and the backup storage server is faulty.
Run the following commands to check whether the network is faulty.
# Ping the backup storage server from the OCP console server. xxx.xxx.xxx.1 is the IP address of the backup storage host. ping xxx.xxx.xxx.1 # Ping the OCP console server from the backup storage server. xxx.xxx.xxx.2 is the IP address of the OCP host. ping xxx.xxx.xxx.2If the network is normal, the system returns continuous data sending information. In this case, the alert is caused by other reasons.
If the network is faulty, the system does not return continuous data sending information. Contact the network administrator to troubleshoot the network issue. If no network administrator is available, troubleshoot and repair the network issue by referring to Network troubleshooting.
Check whether the backup storage directory is accessible.
Log in to the backup storage server.
Run the following commands to check whether the admin user has the read permission on the backup storage directory.
# /obbackup is the default backup storage directory. cd / && ll | grep obbackup # The following information is returned. drwxrwxrwx 16 root root 4096 Aug 10 14:34 obbackuprwxrwxrwx indicates that all users can read, write, and execute the backup storage directory.
The two root values indicate the user and group to which the backup file storage directory belongs.
If the permission is not rwxrwxrwx, the admin user may not be able to access the backup storage directory. To learn more about Linux permissions, visit www.baidu.com.
- If the permission is insufficient, run the following command to modify the permission of the backup storage directory to ensure that it is accessible.
chmod -R 777 /obbackup