Physical restore failures|V4.2.1| docs|Distributed Database

Physical restore failures

Last Updated：2023-12-25 08:49:40 Updated

This topic describes how to locate and troubleshoot physical restore failures.

The physical restore feature is strongly dependent on the data backup and log archiving features. In other words, before you initiate a physical restore, make sure that at least one backup set is available and archive logs are continuous.

Notice

At present, OceanBase Database allows you to restore backup data only to OceanBase Database of the same version or a later version. You cannot restore backup data from V3.x or V2.x to V4.x.
Backup data in OceanBase Database of a version earlier than V4.1.0 cannot be restored to OceanBase Database V4.1.0. For example, backup data in OceanBase Database V4.0.x cannot be restored to OceanBase Database V4.1.0.

This topic uses /home/admin/oceanbase as an example of the installation directory of OceanBase Database. The log storage path involved in this topic varies based on the actual environment.

Issue 1: The restore command fails to be executed

When you execute the ALTER SYSTEM RESTORE statement to initiate a physical restore, if the statement execution fails, you can perform the following steps to locate and troubleshoot the issue:

Log on to the sys tenant of the cluster as the root user.
Execute the following statement to check the error code:
```
SELECT * FROM oceanbase.DBA_OB_ROOTSERVICE_EVENT_HISTORY WHERE module='physical_restore';
```
Pay attention to the following information in the query result:
- The value of result indicates the error code.
- The value of RS_SVR_IP indicates the IP address of the server where RootService resides.
Log on to the server where RootService resides based on the RS_SVR_IP value obtained in the previous step. Search the rootservice.log file to find the related error information.
1. Log on to the server where RootService resides.
2. Go to the directory where logs are stored.
```
cd /home/admin/oceanbase/log
```
3. Run the following command to search for log records based on the queried error code and find the related error information.
  
  Assume that the error code queried in the previous step is -4016. The sample command is as follows:
```
grep "ob_restore_util" rootservice.log | grep "ret=\-4016"
```
  Notice
  
  If no related log record is found in the rootservice.log file by using the grep command, a new rootservice.log file may have been generated. In this case, you can run grep "ob_restore_util" rootservice.log.* | grep "ret=\-4016".
  
  The common error message for a command execution failure is -4018 no enough log for restore. This issue is usually caused by an incorrect restore end point specified when you execute the ALTER SYSTEM RESTORE statement. In other words, the specified restore end point may not be within the restorable window. You must confirm whether any backup set and archive logs are available before the restore end point. For more information about the restorable window, see the Constraint for specifying the timestamp and SCN section in Parameters related to physical restore.
After you obtain the error information from the log file, contact OceanBase Technical Support for assistance.

Issue 2: The restore of a tenant is stuck in the RESTORE_WAIT_LS state

After you execute the ALTER SYSTEM RESTORE statement to initiate a physical restore, you can query the CDB_OB_RESTORE_PROGRESS view for the status of the restore job.

If the restore job remains in the RESTORING state, you can perform the following steps to locate and troubleshoot the issue:

Log on to the sys tenant of the cluster as the root user.
Execute the following statement to query the status of the restore job:
```
SELECT * FROM oceanbase.__all_virtual_restore_job WHERE name = 'status' AND tenant_id = xxxx;
```
If the status value in the query result is RESTORE_WAIT_LS, proceed to the next step to check the restore status of the log stream replicas of the tenant to be restored.

If the status value in the query result is not RESTORE_WAIT_LS, perform further troubleshooting by referring to the Issue 3: The schema of the tenant to be restored is not refreshed section in this topic.

Execute the following statement to check the restore status of the log stream replicas of the tenant to be restored:

SELECT ls_id,svr_ip,svr_port,role,restore_status,zone FROM oceanbase.__all_virtual_ls_meta_table WHERE tenant_id = xxxx;

A sample query result is as follows:

+-------+----------------+----------+------+----------------+------+
| ls_id | svr_ip         | svr_port | role | restore_status | zone |
+-------+----------------+----------+------+----------------+------+
|     1 | 100.xx.xx.xx   |     5003 |    1 |              0 | z1   |
|  1001 | 100.xx.xx.xx   |     5003 |    1 |              0 | z1   |
|  1002 | 100.xx.xx.xx   |     5003 |    1 |              6 | z1   |
+-------+----------------+----------+------+----------------+------+
3 rows in set

Pay attention to the following values in the query result:

ROLE: the role of the replica. The value 1 indicates the leader, which restores data from the external media. The value 2 indicates a follower, which pulls data from the leader.
restore_status: the restore status of the log stream replica. The value 0 indicates that the log stream replica is properly restored.

If the restore_status value of a log stream replica is 6 or 8, the log stream is being used for log restore or for restore of minor-compacted data. In this case, check whether the issue is caused by log restore.

Notice

All logs are restored in a synchronized manner across log streams since OceanBase Database V4.1.0. If any log stream is stuck in a state earlier than 6 or 8, all other log streams will be stuck in the 6 state.
1. Execute the following statement to check whether logs have been restored from the archive directory to the target tenant:
```
SELECT count(1) FROM oceanbase.GV$OB_LOG_STAT WHERE tenant_id = xxxx AND end_scn < (SELECT recovery_until_scn FROM oceanbase.__all_virtual_tenant_info WHERE tenant_id = xxxx);
```
  Parameters in the statement are described as follows:
  - end_scn: the maximum consumption checkpoint.
  - recovery_until_scn: the end checkpoint for restore of the tenant.
  If the query result is not empty, some logs have not been restored from the archive directory to the target tenant. In this case, you can continue to observe the end_scn value in the GV$OB_LOG_STAT view. If the end_scn value still advances, keep waiting. If the end_scn value stops advancing, contact OceanBase Technical Support for assistance.
  
  Note
  
  The process of restoring logs from the archive directory to the target tenant may take a long time. The time taken is subject to factors such as the log volume, read performance of the archive media, and workload of OceanBase Database.
  
  If the query result is empty, the end_scn value is greater than or equal to the recovery_until_scn value, indicating that logs are successfully restored from the archive directory to the target tenant. In this case, you must further verify whether log replay is completed.
2. Verify whether log replay is completed.
  
  Query the __all_virtual_replay_stat virtual table for pending log replay tasks.
```
SELECT * FROM oceanbase.__all_virtual_replay_stat WHERE tenant_id = xxxx;
```
  Pay attention to values in the following columns in the query result:
  - pending_cnt: the number of pending log replay tasks. If the value is not 0, pending log replay tasks exist.
  - unsubmitted_log_scn: the SCN of the next log entry waiting to be submitted to the replay engine. If log replay is completed, the value is the SCN of the last log entry replayed plus 1.
  Query the __all_virtual_tenant_info virtual table for the recovery_until_scn value.
```
SELECT recovery_until_scn FROM oceanbase.__all_virtual_tenant_info WHERE tenant_id = xxxx;
```
  If the pending_cnt value is 0 and the unsubmitted_log_scn value is greater than the recovery_until_scn value, log replay is completed. In this case, you must further verify whether log stream 1 has been restored.
  
  If either of the foregoing two conditions is not met, log replay is not completed. You can continue to observe whether the unsubmitted_log_scn value advances. If the unsubmitted_log_scn value does not advance after a long period of time, contact OceanBase Technical Support for assistance.
3. Verify whether log stream 1 has been restored.
```
SELECT * FROM oceanbase.__all_ls_recovery_stat tenant_id = xxxx;
```
  Compare the sync_scn and recovery_until_scn values in the query result. If the sync_scn value is equal to the recovery_until_scn value, log stream 1 has been restored. If the two values are different, log stream 1 has not been restored. In this case, contact OceanBase Technical Support for assistance.
4. If the restore_status value is still 6 or 8 after the entire log restore process is completed, contact OceanBase Technical Support for assistance.

Issue 3: The schema of the tenant to be restored is not refreshed

After you execute the ALTER SYSTEM RESTORE statement to initiate a physical restore, you can query the __all_virtual_restore_job virtual table to check whether residual records of the tenant to be restored exist and check whether the restore of this tenant is stuck in another state rather than RESTORE_WAIT_LS by referring to the Issue 2: The restore of a tenant is stuck in the RESTORE_WAIT_LS state section in this topic. If yes, the schema of the tenant to be restored is not refreshed, and the system fails to change the tenant status to Normal.

Perform the following steps to locate and troubleshoot the issue:

Log on to the sys tenant of the cluster as the root user.

Query the GV$OB_SERVER_SCHEMA_INFO view for the schema refresh progress of the tenant.

SELECT * FROM oceanbase.GV$OB_SERVER_SCHEMA_INFO WHERE tenant_id=xxxx;

A sample query result is as follows:

+----------------+----------+-----------+--------------------------+-------------------------+--------------+-------------+----------------------------+
| SVR_IP         | SVR_PORT | TENANT_ID | REFRESHED_SCHEMA_VERSION | RECEIVED_SCHEMA_VERSION | SCHEMA_COUNT | SCHEMA_SIZE | MIN_SSTABLE_SCHEMA_VERSION |
+----------------+----------+-----------+--------------------------+-------------------------+--------------+-------------+----------------------------+
| xx.xx.xx.5     |     4000 |      1002 |                        1 |                       1 |            4 |     1086    |                         -1 |
| xx.xx.xx.9     |     4002 |      1002 |                        1 |                       1 |            4 |     1086    |                         -1 |
| xx.xx.xx.9     |     4001 |      1002 |                        1 |                       1 |            4 |     1086    |                         -1 |
| xx.xx.xx.5     |     4005 |      1002 |                        1 |                       1 |            4 |     1086    |                         -1 |
| xx.xx.xx.11    |     4004 |      1002 |                        1 |                       1 |            4 |     1086    |                         -1 |
| xx.xx.xx.11    |     4003 |      1002 |                        1 |                       1 |            4 |     1086    |                         -1 |
+----------------+----------+-----------+--------------------------+-------------------------+--------------+-------------+----------------------------+
6 rows in set

The schema refresh is considered successful when the following conditions are met:

REFRESHED_SCHEMA_VERSION = RECEIVED_SCHEMA_VERSION
The value of REFRESHED_SCHEMA_VERSION is greater than 8.
The value of REFRESHED_SCHEMA_VERSION/8 is an integer.

If any of the preceding conditions is not met, the schema is not refreshed and you must search the observer.log file to continue with the troubleshooting.

Log on to any server for which the schema is not refreshed based on the information obtained from the view in the previous step. Then, search the observer.log file.
1. Go to the directory where logs are stored.
```
cd /home/admin/oceanbase/log
```
2. Search the logs for related information.
  
  Search the observer.log file for the thread name. Schema refresh is performed by background threads and the system will keep trying. Therefore, you need to view only the latest log records.
```
grep "SerScheQueue" observer.log
```
  Here are some sample log records found:
```
observer.log.20220811114045:[2022-08-11 11:39:54.382533] WARN  [RPC.OBRPC] rpc_call (ob_rpc_proxy.ipp:361) [192069][SerScheQueue0][T0][YFA00BA2D905-0005E5DEE6A2294E-0-0] [lt=8] execute rpc fail(ret=-4012, dst="11.xx.xx.9:4001")
observer.log.20220811114045:[2022-08-11 11:39:54.382552] WARN  log_user_error_and_warn (ob_rpc_proxy.cpp:315) [192069][SerScheQueue0][T0][YFA00BA2D905-0005E5DEE6A2294E-0-0] [lt=20]
```
  Find the corresponding trace information and the information about the server failed to execute the job. In this example, the trace information is YFA00BA2D905-0005E5DEE6A2294E-0-0 and the IP address of the server failed to execute the job is xx.xx.xx.9.
Log on to the server failed to execute the job, go to the directory where logs are stored based on the obtained trace information, and then search the observer.log file for the trace information to confirm the error information.
```
grep "YFA00BA2D905-0005E5DEE6A2294E-0-0" observer.log
```
Notice

If no related log record is found in the observer.log file by using the grep command, a new observer.log file may have been generated. In this case, you can run grep "YFA00BA2D905-0005E5DEE6A2294E-0-0" observer.log.*.
After you obtain the error information from the log file, contact OceanBase Technical Support for assistance.

Issue 4: The status of the restore job is FAILED in the view

After you execute the ALTER SYSTEM RESTORE statement to initiate a physical restore, you can query the CDB_OB_RESTORE_HISTORY view for the status of the restore job.

If the status of the restore job is displayed as FAILED, perform the following steps to locate and troubleshoot the issue:

Log on to the sys tenant of the cluster as the root user.
Query the CDB_OB_RESTORE_HISTORY view for information in the comment column.
```
SELECT * FROM oceanbase.CDB_OB_RESTORE_HISTORY WHERE TENANT_ID=xxxx;
```
The comment column displays some information about the restore job, including the IP address of the OBServer node, the ID of the log stream, the type of the faulty module, and the corresponding trace_id value.

For more information about the CDB_OB_RESTORE_HISTORY view, see CDB_OB_RESTORE_HISTORY.
Search for the corresponding log records based on the obtained error code and trace_id value.
1. Log on to the server indicated in the comment column.
2. Go to the directory where logs are stored.
```
cd /home/admin/oceanbase/log
```
3. Run the following command to search for log records generated near the point in time when the restore job failed.
  - If the job was executed by an OBServer node, which is indicated by (server) in the comment column, run the following command to search for log records generated near the point in time when the restore job failed:
```
grep "trace_id"  observer.log | grep "WARN\|ERROR"
```
    Replace trace_id in the command with the trace_id value in the comment column.
    
    Notice
    
    If no related log record is found in the observer.log file by using the grep command, a new observer.log file may have been generated. In this case, you can run grep "trace_id" observer.log.* | grep "WARN\|ERROR".
  - If the job was executed by the server where RootService resides, which is indicated by (rootservice) in the comment column, run the following command to search for log records generated near the point in time when the restore job failed:
```
grep "ob_restore_scheduler" rootservice.log | grep "WARN\|ERROR"
```
    Notice
    
    If no related log record is found in the rootservice.log file by using the grep command, a new rootservice.log file may have been generated. In this case, you can run grep "ob_restore_scheduler" rootservice.log.* | grep "WARN\|ERROR".
After you obtain the error information from the log file, contact OceanBase Technical Support for assistance.

Physical restore failures

Notice

Issue 1: The restore command fails to be executed

Notice

Issue 2: The restore of a tenant is stuck in the RESTORE_WAIT_LS state

Notice

Note

Issue 3: The schema of the tenant to be restored is not refreshed

Notice

Issue 4: The status of the restore job is FAILED in the view

Notice

Notice