Data backup failures|V4.3.0| docs|Distributed Database

Data backup failures

Last Updated：2024-04-19 08:42:49 Updated

This topic describes how to locate and troubleshoot data backup failures during physical backup.

This topic uses /home/admin/oceanbase as an example of the installation directory of OceanBase Database. The log storage path involved in this topic varies based on the actual environment.

Issue 1: A data backup job fails to be initiated

If a data backup job fails to be initiated, perform the following steps to troubleshoot the issue:

Log on to the sys tenant of the cluster as the root user.

Execute the following statement to identify the cause of the failure:

SELECT * FROM oceanbase.DBA_OB_ROOTSERVICE_EVENT_HISTORY WHERE module='backup_data' AND event ='start_backup_data';

A sample query result is as follows:

+----------------------------+-------------+-------------------+-----------+--------+--------+--------+-------+--------+-------+--------+-------+--------+-------+--------+------------+----------------+-------------+
| TIMESTAMP                  | MODULE      | EVENT             | NAME1     | VALUE1 | NAME2  | VALUE2 | NAME3 | VALUE3 | NAME4 | VALUE4 | NAME5 | VALUE5 | NAME6 | VALUE6 | EXTRA_INFO | RS_SVR_IP      | RS_SVR_PORT |
+----------------------------+-------------+-------------------+-----------+--------+--------+--------+-------+--------+-------+--------+-------+--------+-------+--------+------------+----------------+-------------+
| 2023-08-17 15:56:02.023265 | backup_data | start_backup_data | tenant_id | 1002   | result | -9040  |       |        |       |        |       |        |       |        |            | 11.xx.xx.xx    |        4000 |
| 2023-08-17 15:56:13.116237 | backup_data | start_backup_data | tenant_id | 1002   | result | -9040  |       |        |       |        |       |        |       |        |            | 11.xx.xx.xx    |        4000 |
| 2023-08-17 15:56:38.050299 | backup_data | start_backup_data | tenant_id | 1002   | result | -9040  |       |        |       |        |       |        |       |        |            | 11.xx.xx.xx    |        4000 |
| 2023-08-17 15:57:11.236784 | backup_data | start_backup_data | tenant_id | 1002   | result | -9040  |       |        |       |        |       |        |       |        |            | 11.xx.xx.xx    |        4000 |
4 rows in set

Pay attention to values in the following columns in the query result:

NAME1 and VALUE1: respectively indicate the tenant_id field and its value.
NAME2 and VALUE2: respectively indicate the result field and its value.
RS_SVR_IP: the IP address of the server where RootService resides.

If the VALUE2 value of a result in the NAME2 column is not 0, proceed to the next step to find the related error information.

Log on to the server where RootService resides based on the RS_SVR_IP value obtained in the previous step. Search the rootservice.log file to find the related error information.
1. Log on to the server where RootService resides.
2. Go to the directory where logs are stored.
```
cd /home/admin/oceanbase/log
```
3. Run the following command to search for log records that meet the ret=\result condition and find the related error information.
```
grep "ret=\result" rootservice.log* | grep "ob_backup_data_scheduler"  
```
  Replace result with the VALUE2 value of the result field in the previous step. In the following sample command, it is assumed that the VALUE2 value of the result field is -9040:
```
grep "ret=\-9040" rootservice.log* | grep "ob_backup_data_scheduler"  
```
  If the log information indicates any of the following causes, failure to initialize the backup job is expected:
  - data_backup_dest is not configured to specify the data backup destination.
    
    For information about how to configure a backup destination, see Preparations.
  - Log archiving is disabled.
    
    For information about how to enable log archiving, see Enable log archiving.
  - The tenant is being backed up.
  - The tenant is not in the Normal state.
  If the failure to initialize the backup job is caused by other reasons, obtain the error information from the log file and contact OceanBase Technical Support for assistance.

Issue 2: Data backup is stuck at a specific state

If data backup is stuck at a specific state, perform the following steps to locate the issue:

Log on to the sys tenant of the cluster as the root user.
Execute the following statement to confirm the state at which the data backup task is stuck:
```
SELECT * FROM oceanbase.CDB_OB_BACKUP_TASKS WHERE tenant_id = xxx;
```
Pay attention to the value of the STATUS column, which indicates the backup status. Generally, if data backup is stuck, a state such as BACKUP_META or BACKUP_DATA is displayed.

If the STATUS value of the task is BACKUP_META or BACKUP_DATA, proceed to the following step.

Execute the following statement to obtain task execution information:

SELECT * FROM oceanbase.__all_virtual_backup_schedule_task WHERE tenant_id = xxx;

A sample query result is as follows:

+-----------+--------+---------+-------+------+-----------------------------------+-------------------+-------------+------------------+------------------+------------------+
| tenant_id | job_id | task_id | ls_id | type | trace_id                          | destination       | is_schedule | generate_ts      | schedule_ts      | executor_ts      |
+-----------+--------+---------+-------+------+-----------------------------------+-------------------+-------------+------------------+------------------+------------------+
|      1002 |    339 |       3 |  1001 |    0 | YB42C0A80C44-0005F637014C8DE6-0-0 | 192.xx.xx.xx:2882 |      True   | 1679378679020065 | 1679378679205546 | 1679378679302303 |
+-----------+--------+---------+-------+------+-----------------------------------+-------------------+-------------+------------------+------------------+------------------+
1 row in set

Pay attention to values of the destination and trace_id columns in the query result.

trace_id: the trace ID of the data backup task.``
destination: the IP address and port number of the server that executes the data backup task.

Log on to the server indicated by the destination information and run the following command to search the observer.log file:
```
grep 'YB42C0A80C44-0005F637014C8DE6-0-0' observer.log
```
In the preceding command, YB42C0A80C44-0005F637014C8DE6-0-0 is the value of trace_id queried in the previous step.

Notice

If no related log record is found in the observer.log file by using the grep command, a new observer.log file may have been generated. In this case, you can run grep 'YB42C0A80C44-0005F637014C8DE6-0-0' observer.log.*.
After you obtain the error information from the log file, contact OceanBase Technical Support for assistance.

Issue 3: The backup job failed

After you initiate a data backup job, you can query the CDB_OB_BACKUP_JOB_HISTORY or DBA_OB_BACKUP_JOB_HISTORY view for the job status. A sample query result is as follows. If the STATUS value of the backup job is FAILED, the backup job failed.

*************************** 1. row ***************************
              TENANT_ID: 1004
                 JOB_ID: 8
            INCARNATION: 1
          BACKUP_SET_ID: 8
    INITIATOR_TENANT_ID: 1
       INITIATOR_JOB_ID: 10
     EXECUTOR_TENANT_ID: 1004
        PLUS_ARCHIVELOG: OFF
            BACKUP_TYPE: INC
              JOB_LEVEL: USER_TENANT
        ENCRYPTION_MODE: NONE
                 PASSWD:
        START_TIMESTAMP: 2023-04-27 12:41:27.451141
          END_TIMESTAMP: 2022-04-27 12:42:24.352493
                 STATUS: FAILED
                 RESULT: 4016
                COMMENT:(server)ls_id: 1002,addr: 11.xx.xx.xx:4004,module:BACKUP_DATA,result: -4016(Internal error),trace_id: YFA40BA2D90B-005FA487348F363-0-0
            DESCRIPTION:
                   PATH:file:///data/nfs/backup/backup_1682565410121
1 row in set

The COMMENT column in the view displays the following information about the backup job:

(server): the type of the server that executed the current backup job. server indicates that the current backup job was executed by an OBServer node. rootservice indicates that the current backup job was executed by the server where RootService resides.
addr: 11.xx.xx.xx:4004: the IP address and port number of the server that executed the current backup job.
result: -4016(Internal error): the error code returned for the current backup job.
trace_id: YFA40BA2D90B-005FA487348F363-0-0: the trace ID corresponding to the current backup job.``

You can search the observer.log file based on the information indicated in the COMMENT column to further locate the issue. The procedure is as follows:

Log on to the server corresponding to addr: 11.xx.xx.xx:4004.
Go to the directory where logs are stored.
```
cd /home/admin/oceanbase/log
```
Run the following command to search the 'observer.log' file.
- If the backup job was executed by an OBServer node, run the following command to search for log records generated near the point in time when the backup job failed.
```
grep 'YFA40BA2D90B-005FA487348F363-0-0' observer.log
```
  Notice
  
  If no related log record is found in the observer.log file by using the grep command, a new observer.log file may have been generated. In this case, you can run grep 'YFA40BA2D90B-005FA487348F363-0-0' observer.log.*.
- If the backup job was executed by the server where RootService resides, run the following command to search for log records generated near the point in time when the backup job failed.
```
grep 'YFA40BA2D90B-005FA487348F363-0-0' rootservice.log
```
  Notice
  
  If no related log record is found in the rootservice.log file by using the grep command, a new rootservice.log file may have been generated. In this case, you can run grep 'YFA40BA2D90B-005FA487348F363-0-0' rootservice.log.*.
After you obtain the error information from the log file, contact OceanBase Technical Support for assistance.

Common error codes

`-4012` and `-9086`

If the error code -4012 or -9086 is returned, view the error log and perform troubleshooting based on the troubleshooting method described in this topic. Assume that the following log record is found:

[2023-05-19 13:23:25.238790] WDIAG [STORAGE] advance_checkpoint_by_flush (ob_backup_task.cpp:141) [8999][T1004_BACKUP][T1004][Y509E645869DC-0005FC029426456E-0-0] [lt=26][errcode=0] clog checkpoint scn has not passed start scn, retry again(i=0, tenant_id=1004, ls_id={id:1}, clog_checkpoint_scn={val:1684465461355302830}, start_scn={val:1684465461355303000}, need_advance_checkpoint=true, advance_checkpoint_interval=2000000, cur_ts=1684473805238742, last_advance_checkpoint_ts=1684472010015535)

The error information shows that this error is usually related to slow advancement of checkpoints. You must verify whether freezes and minor compactions are performed properly. If the data backup is performed on a physical standby database, you must also verify whether log synchronization is stopped in the standby database.

`-4554`

The error code -4554 is returned when a timed-out task is detected by RootService.

A task may time out due to the following reasons:

The corresponding OBServer node cannot be accessed. In this case, backup dest server is not exist or server status may not active or in service can be found in the error log.
RootService does not receive the backup result from the OBServer node. In this case, the preceding log records do not exist in the error log. You can log on to the corresponding OBServer node based on the trace_id value recorded in the ls_task field of the log and search the log file on the OBServer node to identify the cause.

Data backup failures

Issue 1: A data backup job fails to be initiated

Issue 2: Data backup is stuck at a specific state

Notice

Issue 3: The backup job failed

Notice

Notice

Common error codes

-4012 and -9086

-4554

`-4012` and `-9086`

`-4554`