This topic describes how to locate and troubleshoot data backup failures during physical backup.
This topic uses /home/admin/oceanbase as an example of the installation directory of OceanBase Database. The log storage path involved in this topic varies based on the actual environment.
Issue 1: A data backup job fails to be initiated
If a data backup job fails to be initiated, perform the following steps to troubleshoot the issue:
Log on to the
systenant of the cluster as therootuser.Execute the following statement to identify the cause of the failure:
SELECT * FROM oceanbase.DBA_OB_ROOTSERVICE_EVENT_HISTORY WHERE module='backup_data' AND event ='start_backup_data';A sample query result is as follows:
+----------------------------+-------------+-------------------+-----------+--------+--------+--------+-------+--------+-------+--------+-------+--------+-------+--------+------------+----------------+-------------+ | TIMESTAMP | MODULE | EVENT | NAME1 | VALUE1 | NAME2 | VALUE2 | NAME3 | VALUE3 | NAME4 | VALUE4 | NAME5 | VALUE5 | NAME6 | VALUE6 | EXTRA_INFO | RS_SVR_IP | RS_SVR_PORT | +----------------------------+-------------+-------------------+-----------+--------+--------+--------+-------+--------+-------+--------+-------+--------+-------+--------+------------+----------------+-------------+ | 2023-08-17 15:56:02.023265 | backup_data | start_backup_data | tenant_id | 1002 | result | -9040 | | | | | | | | | | 11.xx.xx.xx | 4000 | | 2023-08-17 15:56:13.116237 | backup_data | start_backup_data | tenant_id | 1002 | result | -9040 | | | | | | | | | | 11.xx.xx.xx | 4000 | | 2023-08-17 15:56:38.050299 | backup_data | start_backup_data | tenant_id | 1002 | result | -9040 | | | | | | | | | | 11.xx.xx.xx | 4000 | | 2023-08-17 15:57:11.236784 | backup_data | start_backup_data | tenant_id | 1002 | result | -9040 | | | | | | | | | | 11.xx.xx.xx | 4000 | 4 rows in setPay attention to values in the following columns in the query result:
NAME1andVALUE1: respectively indicate thetenant_idfield and its value.NAME2andVALUE2: respectively indicate theresultfield and its value.RS_SVR_IP: the IP address of the server where RootService resides.
If the
VALUE2value of aresultin theNAME2column is not0, proceed to the next step to find the related error information.Log on to the server where RootService resides based on the
RS_SVR_IPvalue obtained in the previous step. Search therootservice.logfile to find the related error information.Log on to the server where RootService resides.
Go to the directory where logs are stored.
cd /home/admin/oceanbase/logRun the following command to search for log records that meet the
ret=\resultcondition and find the related error information.grep "ret=\result" rootservice.log* | grep "ob_backup_data_scheduler"Replace
resultwith theVALUE2value of theresultfield in the previous step. In the following sample command, it is assumed that theVALUE2value of theresultfield is-9040:grep "ret=\-9040" rootservice.log* | grep "ob_backup_data_scheduler"If the log information indicates any of the following causes, failure to initialize the backup job is expected:
data_backup_destis not configured to specify the data backup destination.For information about how to configure a backup destination, see Preparations.
Log archiving is disabled.
For information about how to enable log archiving, see Enable log archiving.
The tenant is being backed up.
The tenant is not in the Normal state.
If the failure to initialize the backup job is caused by other reasons, obtain the error information from the log file and contact OceanBase Technical Support for assistance.
Issue 2: Data backup is stuck at a specific state
If data backup is stuck at a specific state, perform the following steps to locate the issue:
Log on to the
systenant of the cluster as therootuser.Execute the following statement to confirm the state at which the data backup task is stuck:
SELECT * FROM oceanbase.CDB_OB_BACKUP_TASKS WHERE tenant_id = xxx;Pay attention to the value of the
STATUScolumn, which indicates the backup status. Generally, if data backup is stuck, a state such asBACKUP_METAorBACKUP_DATAis displayed.If the
STATUSvalue of the task isBACKUP_METAorBACKUP_DATA, proceed to the following step.Execute the following statement to obtain task execution information:
SELECT * FROM oceanbase.__all_virtual_backup_schedule_task WHERE tenant_id = xxx;A sample query result is as follows:
+-----------+--------+---------+-------+------+-----------------------------------+-------------------+-------------+------------------+------------------+------------------+ | tenant_id | job_id | task_id | ls_id | type | trace_id | destination | is_schedule | generate_ts | schedule_ts | executor_ts | +-----------+--------+---------+-------+------+-----------------------------------+-------------------+-------------+------------------+------------------+------------------+ | 1002 | 339 | 3 | 1001 | 0 | YB42C0A80C44-0005F637014C8DE6-0-0 | 192.xx.xx.xx:2882 | True | 1679378679020065 | 1679378679205546 | 1679378679302303 | +-----------+--------+---------+-------+------+-----------------------------------+-------------------+-------------+------------------+------------------+------------------+ 1 row in setPay attention to values of the
destinationandtrace_idcolumns in the query result.trace_id: the trace ID of the data backup task.``destination: the IP address and port number of the server that executes the data backup task.
Log on to the server indicated by the
destinationinformation and run the following command to search theobserver.logfile:grep 'YB42C0A80C44-0005F637014C8DE6-0-0' observer.logIn the preceding command,
YB42C0A80C44-0005F637014C8DE6-0-0is the value oftrace_idqueried in the previous step.Notice
If no related log record is found in the
observer.logfile by using thegrepcommand, a newobserver.logfile may have been generated. In this case, you can rungrep 'YB42C0A80C44-0005F637014C8DE6-0-0' observer.log.*.After you obtain the error information from the log file, contact OceanBase Technical Support for assistance.
Issue 3: The backup job failed
After you initiate a data backup job, you can query the CDB_OB_BACKUP_JOB_HISTORY or DBA_OB_BACKUP_JOB_HISTORY view for the job status. A sample query result is as follows. If the STATUS value of the backup job is FAILED, the backup job failed.
*************************** 1. row ***************************
TENANT_ID: 1004
JOB_ID: 8
INCARNATION: 1
BACKUP_SET_ID: 8
INITIATOR_TENANT_ID: 1
INITIATOR_JOB_ID: 10
EXECUTOR_TENANT_ID: 1004
PLUS_ARCHIVELOG: OFF
BACKUP_TYPE: INC
JOB_LEVEL: USER_TENANT
ENCRYPTION_MODE: NONE
PASSWD:
START_TIMESTAMP: 2023-04-27 12:41:27.451141
END_TIMESTAMP: 2022-04-27 12:42:24.352493
STATUS: FAILED
RESULT: 4016
COMMENT:(server)ls_id: 1002,addr: 11.xx.xx.xx:4004,module:BACKUP_DATA,result: -4016(Internal error),trace_id: YFA40BA2D90B-005FA487348F363-0-0
DESCRIPTION:
PATH:file:///data/nfs/backup/backup_1682565410121
1 row in set
The COMMENT column in the view displays the following information about the backup job:
(server): the type of the server that executed the current backup job.serverindicates that the current backup job was executed by an OBServer node.rootserviceindicates that the current backup job was executed by the server where RootService resides.addr: 11.xx.xx.xx:4004: the IP address and port number of the server that executed the current backup job.result: -4016(Internal error): the error code returned for the current backup job.trace_id: YFA40BA2D90B-005FA487348F363-0-0: the trace ID corresponding to the current backup job.``
You can search the observer.log file based on the information indicated in the COMMENT column to further locate the issue. The procedure is as follows:
Log on to the server corresponding to
addr: 11.xx.xx.xx:4004.Go to the directory where logs are stored.
cd /home/admin/oceanbase/logRun the following command to search the 'observer.log' file.
If the backup job was executed by an OBServer node, run the following command to search for log records generated near the point in time when the backup job failed.
grep 'YFA40BA2D90B-005FA487348F363-0-0' observer.logNotice
If no related log record is found in the
observer.logfile by using thegrepcommand, a newobserver.logfile may have been generated. In this case, you can rungrep 'YFA40BA2D90B-005FA487348F363-0-0' observer.log.*.If the backup job was executed by the server where RootService resides, run the following command to search for log records generated near the point in time when the backup job failed.
grep 'YFA40BA2D90B-005FA487348F363-0-0' rootservice.logNotice
If no related log record is found in the
rootservice.logfile by using thegrepcommand, a newrootservice.logfile may have been generated. In this case, you can rungrep 'YFA40BA2D90B-005FA487348F363-0-0' rootservice.log.*.
After you obtain the error information from the log file, contact OceanBase Technical Support for assistance.
Common error codes
-4012 and -9086
If the error code -4012 or -9086 is returned, view the error log and perform troubleshooting based on the troubleshooting method described in this topic. Assume that the following log record is found:
[2023-05-19 13:23:25.238790] WDIAG [STORAGE] advance_checkpoint_by_flush (ob_backup_task.cpp:141) [8999][T1004_BACKUP][T1004][Y509E645869DC-0005FC029426456E-0-0] [lt=26][errcode=0] clog checkpoint scn has not passed start scn, retry again(i=0, tenant_id=1004, ls_id={id:1}, clog_checkpoint_scn={val:1684465461355302830}, start_scn={val:1684465461355303000}, need_advance_checkpoint=true, advance_checkpoint_interval=2000000, cur_ts=1684473805238742, last_advance_checkpoint_ts=1684472010015535)
The error information shows that this error is usually related to slow advancement of checkpoints. You must verify whether freezes and minor compactions are performed properly. If the data backup is performed on a physical standby database, you must also verify whether log synchronization is stopped in the standby database.
-4554
The error code -4554 is returned when a timed-out task is detected by RootService.
A task may time out due to the following reasons:
The corresponding OBServer node cannot be accessed. In this case,
backup dest server is not existorserver status may not active or in servicecan be found in the error log.RootService does not receive the backup result from the OBServer node. In this case, the preceding log records do not exist in the error log. You can log on to the corresponding OBServer node based on the
trace_idvalue recorded in thels_taskfield of the log and search the log file on the OBServer node to identify the cause.