This topic describes how to locate and troubleshoot physical restore failures.
The physical restore feature is strongly dependent on the data backup and log archiving features. In other words, before you initiate a physical restore, make sure that at least one backup set is available and archive logs are continuous.
Notice
- At present, OceanBase Database allows you to restore backup data only to OceanBase Database of the same version or a later version. You cannot restore backup data from V3.x or V2.x to V4.x.
- Backup data in OceanBase Database of a version earlier than V4.1.0 cannot be restored to OceanBase Database V4.1.0. For example, backup data in OceanBase Database V4.0.x cannot be restored to OceanBase Database V4.1.0.
This topic uses /home/admin/oceanbase as an example of the installation directory of OceanBase Database. The log storage path involved in this topic varies based on the actual environment.
Issue 1: The restore command fails to be executed
When you execute the ALTER SYSTEM RESTORE statement to initiate a physical restore, if the statement execution fails, you can perform the following steps to locate and troubleshoot the issue:
Log on to the
systenant of the cluster as therootuser.Execute the following statement to check the error code:
SELECT * FROM oceanbase.DBA_OB_ROOTSERVICE_EVENT_HISTORY WHERE module='physical_restore';Pay attention to the following information in the query result:
The value of
resultindicates the error code.The value of
RS_SVR_IPindicates the IP address of the server where RootService resides.
Log on to the server where RootService resides based on the
RS_SVR_IPvalue obtained in the previous step. Search therootservice.logfile to find the related error information.Log on to the server where RootService resides.
Go to the directory where logs are stored.
cd /home/admin/oceanbase/logRun the following command to search for log records based on the queried error code and find the related error information.
Assume that the error code queried in the previous step is
-4016. The sample command is as follows:grep "ob_restore_util" rootservice.log | grep "ret=\-4016"Notice
If no related log record is found in the
rootservice.logfile by using thegrepcommand, a newrootservice.logfile may have been generated. In this case, you can rungrep "ob_restore_util" rootservice.log.* | grep "ret=\-4016".The common error message for a command execution failure is
-4018 no enough log for restore. This issue is usually caused by an incorrect restore end point specified when you execute theALTER SYSTEM RESTOREstatement. In other words, the specified restore end point may not be within the restorable window. You must confirm whether any backup set and archive logs are available before the restore end point. For more information about the restorable window, see the Constraint for specifying the timestamp and SCN section in Parameters related to physical restore.
After you obtain the error information from the log file, contact OceanBase Technical Support for assistance.
Issue 2: The restore of a tenant is stuck in the RESTORE_WAIT_LS state
After you execute the ALTER SYSTEM RESTORE statement to initiate a physical restore, you can query the CDB_OB_RESTORE_PROGRESS view for the status of the restore job.
If the restore job remains in the RESTORING state, you can perform the following steps to locate and troubleshoot the issue:
Log on to the
systenant of the cluster as therootuser.Execute the following statement to query the status of the restore job:
SELECT * FROM oceanbase.__all_virtual_restore_job WHERE name = 'status' AND tenant_id = xxxx;If the
statusvalue in the query result isRESTORE_WAIT_LS, proceed to the next step to check the restore status of the log stream replicas of the tenant to be restored.If the
statusvalue in the query result is notRESTORE_WAIT_LS, perform further troubleshooting by referring to the Issue 3: The schema of the tenant to be restored is not refreshed section in this topic.Execute the following statement to check the restore status of the log stream replicas of the tenant to be restored:
SELECT ls_id,svr_ip,svr_port,role,restore_status,zone FROM oceanbase.__all_virtual_ls_meta_table WHERE tenant_id = xxxx;A sample query result is as follows:
+-------+----------------+----------+------+----------------+------+ | ls_id | svr_ip | svr_port | role | restore_status | zone | +-------+----------------+----------+------+----------------+------+ | 1 | 100.xx.xx.xx | 5003 | 1 | 0 | z1 | | 1001 | 100.xx.xx.xx | 5003 | 1 | 0 | z1 | | 1002 | 100.xx.xx.xx | 5003 | 1 | 6 | z1 | +-------+----------------+----------+------+----------------+------+ 3 rows in setPay attention to the following values in the query result:
ROLE: the role of the replica. The value1indicates the leader, which restores data from the external media. The value2indicates a follower, which pulls data from the leader.restore_status: the restore status of the log stream replica. The value0indicates that the log stream replica is properly restored.
If the
restore_statusvalue of a log stream replica is6or8, the log stream is being used for log restore or for restore of minor-compacted data. In this case, check whether the issue is caused by log restore.Notice
All logs are restored in a synchronized manner across log streams since OceanBase Database V4.1.0. If any log stream is stuck in a state earlier than
6or8, all other log streams will be stuck in the6state.Execute the following statement to check whether logs have been restored from the archive directory to the target tenant:
SELECT count(1) FROM oceanbase.GV$OB_LOG_STAT WHERE tenant_id = xxxx AND end_scn < (SELECT recovery_until_scn FROM oceanbase.__all_virtual_tenant_info WHERE tenant_id = xxxx);Parameters in the statement are described as follows:
end_scn: the maximum consumption checkpoint.recovery_until_scn: the end checkpoint for restore of the tenant.
If the query result is not empty, some logs have not been restored from the archive directory to the target tenant. In this case, you can continue to observe the
end_scnvalue in theGV$OB_LOG_STATview. If theend_scnvalue still advances, keep waiting. If theend_scnvalue stops advancing, contact OceanBase Technical Support for assistance.Note
The process of restoring logs from the archive directory to the target tenant may take a long time. The time taken is subject to factors such as the log volume, read performance of the archive media, and workload of OceanBase Database.
If the query result is empty, the
end_scnvalue is greater than or equal to therecovery_until_scnvalue, indicating that logs are successfully restored from the archive directory to the target tenant. In this case, you must further verify whether log replay is completed.Verify whether log replay is completed.
Query the
__all_virtual_replay_statvirtual table for pending log replay tasks.SELECT * FROM oceanbase.__all_virtual_replay_stat WHERE tenant_id = xxxx;Pay attention to values in the following columns in the query result:
pending_cnt: the number of pending log replay tasks. If the value is not 0, pending log replay tasks exist.unsubmitted_log_scn: the SCN of the next log entry waiting to be submitted to the replay engine. If log replay is completed, the value is the SCN of the last log entry replayed plus 1.
Query the
__all_virtual_tenant_infovirtual table for therecovery_until_scnvalue.SELECT recovery_until_scn FROM oceanbase.__all_virtual_tenant_info WHERE tenant_id = xxxx;If the
pending_cntvalue is0and theunsubmitted_log_scnvalue is greater than therecovery_until_scnvalue, log replay is completed. In this case, you must further verify whether log stream 1 has been restored.If either of the foregoing two conditions is not met, log replay is not completed. You can continue to observe whether the
unsubmitted_log_scnvalue advances. If theunsubmitted_log_scnvalue does not advance after a long period of time, contact OceanBase Technical Support for assistance.Verify whether log stream 1 has been restored.
SELECT * FROM oceanbase.__all_ls_recovery_stat tenant_id = xxxx;Compare the
sync_scnandrecovery_until_scnvalues in the query result. If thesync_scnvalue is equal to therecovery_until_scnvalue, log stream 1 has been restored. If the two values are different, log stream 1 has not been restored. In this case, contact OceanBase Technical Support for assistance.If the
restore_statusvalue is still6or8after the entire log restore process is completed, contact OceanBase Technical Support for assistance.
Issue 3: The schema of the tenant to be restored is not refreshed
After you execute the ALTER SYSTEM RESTORE statement to initiate a physical restore, you can query the __all_virtual_restore_job virtual table to check whether residual records of the tenant to be restored exist and check whether the restore of this tenant is stuck in another state rather than RESTORE_WAIT_LS by referring to the Issue 2: The restore of a tenant is stuck in the RESTORE_WAIT_LS state section in this topic. If yes, the schema of the tenant to be restored is not refreshed, and the system fails to change the tenant status to Normal.
Perform the following steps to locate and troubleshoot the issue:
Log on to the
systenant of the cluster as therootuser.Query the
GV$OB_SERVER_SCHEMA_INFOview for the schema refresh progress of the tenant.SELECT * FROM oceanbase.GV$OB_SERVER_SCHEMA_INFO WHERE tenant_id=xxxx;A sample query result is as follows:
+----------------+----------+-----------+--------------------------+-------------------------+--------------+-------------+----------------------------+ | SVR_IP | SVR_PORT | TENANT_ID | REFRESHED_SCHEMA_VERSION | RECEIVED_SCHEMA_VERSION | SCHEMA_COUNT | SCHEMA_SIZE | MIN_SSTABLE_SCHEMA_VERSION | +----------------+----------+-----------+--------------------------+-------------------------+--------------+-------------+----------------------------+ | xx.xx.xx.5 | 4000 | 1002 | 1 | 1 | 4 | 1086 | -1 | | xx.xx.xx.9 | 4002 | 1002 | 1 | 1 | 4 | 1086 | -1 | | xx.xx.xx.9 | 4001 | 1002 | 1 | 1 | 4 | 1086 | -1 | | xx.xx.xx.5 | 4005 | 1002 | 1 | 1 | 4 | 1086 | -1 | | xx.xx.xx.11 | 4004 | 1002 | 1 | 1 | 4 | 1086 | -1 | | xx.xx.xx.11 | 4003 | 1002 | 1 | 1 | 4 | 1086 | -1 | +----------------+----------+-----------+--------------------------+-------------------------+--------------+-------------+----------------------------+ 6 rows in setThe schema refresh is considered successful when the following conditions are met:
REFRESHED_SCHEMA_VERSION=RECEIVED_SCHEMA_VERSIONThe value of
REFRESHED_SCHEMA_VERSIONis greater than 8.The value of
REFRESHED_SCHEMA_VERSION/8is an integer.
If any of the preceding conditions is not met, the schema is not refreshed and you must search the
observer.logfile to continue with the troubleshooting.Log on to any server for which the schema is not refreshed based on the information obtained from the view in the previous step. Then, search the
observer.logfile.Go to the directory where logs are stored.
cd /home/admin/oceanbase/logSearch the logs for related information.
Search the
observer.logfile for the thread name. Schema refresh is performed by background threads and the system will keep trying. Therefore, you need to view only the latest log records.grep "SerScheQueue" observer.logHere are some sample log records found:
observer.log.20220811114045:[2022-08-11 11:39:54.382533] WARN [RPC.OBRPC] rpc_call (ob_rpc_proxy.ipp:361) [192069][SerScheQueue0][T0][YFA00BA2D905-0005E5DEE6A2294E-0-0] [lt=8] execute rpc fail(ret=-4012, dst="11.xx.xx.9:4001") observer.log.20220811114045:[2022-08-11 11:39:54.382552] WARN log_user_error_and_warn (ob_rpc_proxy.cpp:315) [192069][SerScheQueue0][T0][YFA00BA2D905-0005E5DEE6A2294E-0-0] [lt=20]Find the corresponding trace information and the information about the server failed to execute the job. In this example, the trace information is
YFA00BA2D905-0005E5DEE6A2294E-0-0and the IP address of the server failed to execute the job isxx.xx.xx.9.
Log on to the server failed to execute the job, go to the directory where logs are stored based on the obtained trace information, and then search the
observer.logfile for the trace information to confirm the error information.grep "YFA00BA2D905-0005E5DEE6A2294E-0-0" observer.logNotice
If no related log record is found in the
observer.logfile by using thegrepcommand, a newobserver.logfile may have been generated. In this case, you can rungrep "YFA00BA2D905-0005E5DEE6A2294E-0-0" observer.log.*.After you obtain the error information from the log file, contact OceanBase Technical Support for assistance.
Issue 4: The status of the restore job is FAILED in the view
After you execute the ALTER SYSTEM RESTORE statement to initiate a physical restore, you can query the CDB_OB_RESTORE_HISTORY view for the status of the restore job.
If the status of the restore job is displayed as FAILED, perform the following steps to locate and troubleshoot the issue:
Log on to the
systenant of the cluster as therootuser.Query the
CDB_OB_RESTORE_HISTORYview for information in thecommentcolumn.SELECT * FROM oceanbase.CDB_OB_RESTORE_HISTORY WHERE TENANT_ID=xxxx;The
commentcolumn displays some information about the restore job, including the IP address of the OBServer node, the ID of the log stream, the type of the faulty module, and the correspondingtrace_idvalue.For more information about the
CDB_OB_RESTORE_HISTORYview, see CDB_OB_RESTORE_HISTORY.Search for the corresponding log records based on the obtained error code and
trace_idvalue.Log on to the server indicated in the
commentcolumn.Go to the directory where logs are stored.
cd /home/admin/oceanbase/logRun the following command to search for log records generated near the point in time when the restore job failed.
If the job was executed by an OBServer node, which is indicated by
(server)in thecommentcolumn, run the following command to search for log records generated near the point in time when the restore job failed:grep "trace_id" observer.log | grep "WARN\|ERROR"Replace
trace_idin the command with thetrace_idvalue in thecommentcolumn.Notice
If no related log record is found in the
observer.logfile by using thegrepcommand, a newobserver.logfile may have been generated. In this case, you can rungrep "trace_id" observer.log.* | grep "WARN\|ERROR".If the job was executed by the server where RootService resides, which is indicated by
(rootservice)in thecommentcolumn, run the following command to search for log records generated near the point in time when the restore job failed:grep "ob_restore_scheduler" rootservice.log | grep "WARN\|ERROR"Notice
If no related log record is found in the
rootservice.logfile by using thegrepcommand, a newrootservice.logfile may have been generated. In this case, you can rungrep "ob_restore_scheduler" rootservice.log.* | grep "WARN\|ERROR".
After you obtain the error information from the log file, contact OceanBase Technical Support for assistance.