Alert description
When capacity is needed during backup restore, OCP obtains the capacity from the backup server. If the network between OCP and the backup server is abnormal, the capacity cannot be obtained, and OCP retries to obtain the capacity. If the number of retries exceeds a specified threshold, this alert is triggered.
Alert principle
The following table describes the key parameters involved in the monitoring logic of this alert.
| Parameter | Value |
|---|---|
| Monitoring metric | storage_capacity_monitor_error_code |
| Source of the metric | OCP-Server monitors the storage capacity during backup restore. When an error occurs, the corresponding error code is recorded and assigned to the collection metric. |
| Collection metric | storage_capacity_monitor_error_code |
| Monitoring expression | max(storage_capacity_monitor_error_code{@LABELS}) by (@GBLABELS) |
| Collection interval | 1 second |
Note
The source of the metric is special. For more information, see the description of Source of the metric in the preceding table.
The value of the monitoring metric is the error code collected when OCP-Server monitors the storage capacity during backup restore. This alert is triggered when the monitoring metric value is 1001.
Rule Information
| Monitoring metric | Default threshold | Duration | Detection interval | Elimination interval |
|---|---|---|---|---|
| storage_capacity_monitor_error_code | NA | 0 seconds | 30 minutes | 40 minutes |
Alert Information
| Alert Trigger Method | Alert Level | Scope |
|---|---|---|
| Triggered by Monitoring Expression | Service Unavailability | Service |
Alert Template
Alert Overview
- Template: ${alarm_target} ${alarm_name}
- Example: storage_url=file:///obbackup/yc225_214/ Backup and restore capacity exceeded retry limit
Alert Details
- Template: Cluster: ${ob_cluster_name}, Alert: ${storage_url} ${alarm_name}. The capacity retry limit has been exceeded, error code: ${value}.
- Example: Cluster: obcluster-1, Alert: file:///obbackup/yc225_214/ Backup and restore capacity exceeded retry limit, error code: 1001.
Clear Alert
- Template: Alert: ${alarm_name}, Storage capacity collection error code: ${value}
- Example: Alert: Backup and restore capacity exceeded retry limit, Storage capacity collection error code: -
where ${alarm_target} is in the format of storage_url=xxxxxxx.
Impact on the system
The storage capacity trend data on the backup and restore page is unavailable or displayed with more than 30-minute intervals.
Possible causes
Abnormal network connectivity between OCP and the backup storage server.
The backup storage server is unavailable or experiencing issues.
Backup files have been expired and removed by the OBServer.
Solution
Check whether the network between OCP and the backup storage server is normal.
Run the following commands to check whether the network is normal.
# Ping the backup storage server from the OCP server. xxx.xxx.xxx.1 is the IP address of the backup storage host. ping xxx.xxx.xxx.1 # Ping the OCP server from the backup storage server. xxx.xxx.xxx.2 is the IP address of the OCP host. ping xxx.xxx.xxx.2If the network is normal, you can see continuous data transmission information. In this case, the alert may be caused by other reasons.
If you do not see continuous data transmission information, the network is faulty. Contact the network administrator to troubleshoot the network issue. If no network administrator is available, perform troubleshooting and repair according to Network troubleshooting.
Check whether the backup storage directory is inaccessible.
Log in to the backup storage server.
Go to the parent directory of the backup storage directory and run the
llcommand to check whether the admin user has read permissions on the directory.# /obbackup is the default backup storage directory. cd / && ll | grep obbackup # The following information is returned. drwxrwxrwx 16 root root 4096 Aug 10 14:34 obbackuprwxrwxrwx indicates that all users can read, write, and execute the directory.
The two root values indicate the user and group to which the backup file storage directory belongs.
If the permissions are not rwxrwxrwx, the admin user may not be able to access the directory. You must understand Linux permissions through www.baidu.com and then analyze the problem.
If the permissions are insufficient, you can run the following command to modify the permissions of the backup storage directory and ensure that it is accessible.
chmod -R 777 /obbackup
Check whether the backup files are expired and cleared by OBServer.
Log in to the OBServer node as the sys tenant.
obclient -hxxx.xxx.xxx.xxx -P2883 -uroot@sys#**** -p****** -DoceanbaseQuery the backup information based on the
cdb_ob_backup_set_filesview.select * from cdb_ob_backup_set_files where tenantid = xx;Based on the query result, if the
file_statusof the corresponding backup record is deleted, the backup file has been cleared.