The error code for no leader is -4038. When this error occurs, there are two possible scenarios:
- A leader exists, but it is not the current node. This could happen if the leader has changed and the location cache has not been refreshed.
- No leader exists. In this case, check the network connectivity for the log stream.
Current leader exists
If the leader is switched (including abnormal and graceful switching), the location cache may not be refreshed. When a remote execution is performed, the destination node is no longer the leader, and a 4038 error code is returned. The statement will then be retried. You can confirm this by following these steps:
- Check the
DBA_OB_SERVER_EVENT_HISTORYview to confirm whether the leader switch actually occurred. - Check the system logs to confirm whether the destination node indeed returned a 4038 error code.
# grep 4038 observer.log
[2021-01-21 09:46:31.305260] WARN setup_next_scanner (ob_direct_receive.cpp:292) [15256][YB420B4043DA-0005A535D6FAE816] [lt=2] , ret=-4038
[2021-01-21 09:46:31.305295] WARN [SQL.EXE] setup_next_scanner (ob_direct_receive.cpp:295) [15256][YB420B4043DA-0005A535D6FAE816] [lt=31] while fetching first scanner, the remote rcode is not OB_SUCCESS(ret=-4038, err_msg="", dst_addr=""11.xx.xx.xx:2882"")
[2021-01-21 09:46:31.305302] WARN [SQL.EXE] inner_open (ob_direct_receive.cpp:123) [15256][YB420B4043DA-0005A535D6FAE816] [lt=6] failed to setup first scanner(ret=-4038)
[2021-01-21 09:46:31.305308] WARN [SQL.ENG] open (ob_phy_operator.cpp:138) [15256][YB420B4043DA-0005A535D6FAE816] [lt=4] Open this operator failed(ret=-4038, op_type="PHY_DIRECT_RECEIVE")
[2021-01-21 09:46:31.305314] WARN [SQL.ENG] open (ob_phy_operator.cpp:128) [15256][YB420B4043DA-0005A535D6FAE816] [lt=5] Open child operator failed(ret=-4038, op_type="PHY_ROOT_TRANSMIT")
[2021-01-21 09:46:31.305366] WARN [SERVER] test_and_save_retry_state (ob_query_retry_ctrl.cpp:242) [15256][YB420B4043DA-0005A535D6FAE816] [lt=2] partition change or not master or no response, reutrn it to packet queue to retry(client_ret=-4038, err=-4038, retry_type_=2)
Current leader does not exist
The presence of a leader in a log stream depends on three modules, from bottom to top: the election module, the CLOG module, and the RoleChangeService module. Therefore, when a log stream has no leader, you can troubleshoot from bottom to top:
- Confirm whether an election leader exists.
- Confirm whether a CLOG leader exists.
- Confirm whether the RoleChangeService module is working properly.
Confirm whether an election leader exists
You can query the system logs on any node where the log stream exists to check the current election leader. For example, the following command queries the election leader of log stream 1 in tenant 1. If you want to query other log streams, replace T1 and id:1 with the corresponding tenant ID and log stream ID.
grep 'T1_.*{id:1}' ./election.log | grep 'dump acceptor info'
In the preceding command, lease:{owner:"xx.xx.xx.xx:xxxxx" indicates the election leader of the log stream, where xx.xx.xx.xx is the IP address and xxxxx is the port number.
If no election leader exists, the network connectivity of the log stream is likely faulty. Examples include:
- Unidirectional network connectivity due to the tenant not having enough memory to receive messages but still being able to send messages.
- Inability to receive messages due to work queue backlog.
- Inability to receive messages due to the node being in the Stop state.
- Other network connectivity issues.
You can further analyze the network connectivity at the log stream level by using the following command:
grep 'T1_.*{id:1}' ./election.log | grep 'dump message count'
This log records the message exchange records between this replica and other replicas, and classifies them by IP address and message type, recording the timestamp of the last interaction.
Based on the network connectivity statistics, you can preliminarily determine which node is abnormal and whether the issue is on the sending or receiving end, and then further troubleshoot the corresponding node.
In addition to checking the system logs, you can also query and analyze internal views to check for events. You can query the DBA_OB_SERVER_EVENT_HISTORY view under the sys tenant.
obclient(root@sys)[oceanbase]>
SELECT VALUE4,svr_ip,svr_port,event,name3,value3
FROM DBA_OB_SERVER_EVENT_HISTORY
WHERE module="ELECTION" AND value1=$tenant_id AND value2=$ls_id
ORDER BY VALUE4 limit 10;
If an election leader exists, continue to confirm whether a CLOG leader exists.
Confirm whether a CLOG leader exists
If an election leader exists, you can further check whether a CLOG leader exists.
If the cluster is available (SQL client can connect), you can query the GV$OB_LOG_STAT view under the sys tenant.
obclient >
select * from GV$OB_LOG_STAT
where tenant_id=$tenant_id and ls_id=$ls_id and role="LEADER";
Notice
If the number of nodes in the query result is less than the number of replicas in the memberlist, it indicates that the status of some log stream replicas is abnormal. Further troubleshooting is required.
If the cluster is unavailable (SQL client cannot connect), you can troubleshoot by checking the system logs on the election leader node obtained in the previous step. Execute the following command:
grep 'Txxx_.*PALF_DUMP.*palf_id=xxxx,' observer.log.* | less
# In the preceding command, Txxx is the tenant ID. For example, T1001 indicates the logs of tenant 1001.
# palf_id is the log stream ID. For example, palf_id=1001 indicates the logs of log stream 1001.
If the role in the log is Follower, it indicates that no CLOG leader exists, and further troubleshooting is required.
If a CLOG leader exists, continue to confirm whether the RoleChangeService module is working properly.
Confirm whether the RoleChangeService module is working properly
When a CLOG role change occurs, an asynchronous task is submitted to the task queue of the RoleChangeService module. The background thread of the RoleChangeService module is responsible for executing the leader switch. You can troubleshoot the status of the RoleChangeService module by checking the system logs. The command is as follows:
grep 'Txxx_RCS' observer.log rootservice.log.xxxx -h | sort | less
If the last log is not
end handle_role_change_event_, it indicates that the leader switch is stuck. You can further check the stack trace in the last log to confirm the issue.If the last log is
end handle_role_change_event_and a CLOG leader exists, it indicates that the RoleChangeService module is abnormal. Further troubleshooting is required.
Common reasons for leaderless
Clock drift
The Election module depends on the clock drift between any two OBServer nodes being no more than 2 seconds. If the drift exceeds this threshold, a leaderless situation may occur. You can run the following command to check whether the clocks of OBServer nodes are synchronized.
sudo clockdiff $IP
If the result shows that the clock drift exceeds 2 seconds, you need to check whether the clock service is working properly. Common clock services include NTP and Chrony.
Tenant/table/partition deletion
If a tenant, table, or partition is deleted, the leader may fail to be re-elected because the leader is the last to be deleted. The lease of the leader may expire, leading to a leaderless situation.
Majority of replicas are down
If the majority of replicas are down, no leader can be elected.
Network issues
If the majority of replicas are running and no replicas are down, but the system log of the leader replica shows leader lease is expired, it indicates that the leader failed to be re-elected (failed to extend its lease). This may be caused by network issues such as one-way or two-way network isolation, RPC request backlog, or RPC latency. You need to further investigate the cause.
Here is an example of the error message.
# grep 'leader lease is expired' election.log
[2018-09-27 23:09:04.950617] ERROR [ELECT] run_gt1_task (ob_election.cpp:1425) [38589][Y0-0000000000000000] [log=25]leader lease is expired(election={partition:{tid:1100611139458321, partition_id:710, part_cnt:0}, is_running:true, is_offline:false, is_changing_leader:false, self:"11.xx.xx.xx:2882", proposal_leader:"0.0.0.0", cur_leader:"0.0.0.0", curr_candidates:3{server:"100.xx.xx.xx:2882", timestamp:1538050146803352, flag:0}{server:"11.xx.xx.xx:2882", timestamp:1538050146803352, flag:0}{server:"100.xx.xx.xx:2882", timestamp:1538050146803352, flag:0}, curr_membership_version:0, leader_lease:[0, 0], election_time_offset:350000, active_timestamp:1538050146456812, T1_timestamp:1538069944600000, leader_epoch:1538050145800000, state:0, role:0, stage:1, type:-1, replica_num:3, unconfirmed_leader:"11.xx.xx.xx:2882", takeover_t1_timestamp:1538069933400000, is_need_query:false, valid_candidates:0, cluster_version:4295229511, change_leader_timestamp:0, ignore_log:false, leader_revoke_timestamp:1538069944600553, vote_period:4, lease_time:9800000}, old_leader="11.xx.xx.xx:2882")
High system load
If the system load is very high, a leaderless situation may occur. You can run system commands to check whether the load is normal at the time when the leaderless situation occurs.
CLOG disk is full
If the CLOG disk usage exceeds the threshold (controlled by the tenant parameter log_disk_utilization_limit_threshold, with a default value of 95%), write operations are forcibly stopped. In this case, the replicas on this node cannot be elected as the leader. If the majority of replicas are on nodes where the CLOG disk is full, the cluster becomes leaderless.
Tenant memory is full
When the tenant's MemTable memory is full, log replay cannot continue, and logs cannot be recycled (recycling depends on replay and dump). To avoid the CLOG disk from becoming full, replicas cannot continue to receive logs from the leader. If the majority of follower replicas have full tenant memory, the leader's log synchronization will be blocked, leading to a leaderless cluster.
CLOG reconfirm failure
If the Election module is working properly and a leader is elected, but the CLOG module fails to execute reconfirm, the leader will voluntarily step down. You need to further investigate the cause of the reconfirm failure.
Takeover failure
After CLOG reconfirm succeeds, the cluster enters the takeover state, and the RoleChangeService module continues to execute the specific election logic. If the RoleChangeService module is working abnormally, the cluster becomes leaderless. You need to further investigate the cause of the takeover failure.
Transaction callback failure
If the CLOG logs at the left boundary of the leader replica's transaction sliding window fail to reach a majority and slide out within a certain threshold (10 seconds), the leader may abnormally step down, leading to a leaderless cluster.
Here is an example of the error message.
# grep 'check_leader_sliding_window_not_slide_' observer.log
[2022-11-29 11:19:12.239777] ERROR [CLOG] check_leader_sliding_window_not_slide_ (ob_log_state_mgr.cpp:2243) [7393][0][Y0-0000000000000000-0-0] [lt=66] leader_active_need_switch_(partition_key={tid:1099511627898, partition_id:2, part_cnt:0}, now=1669691952239687, last_check_start_id_time_=1669691939215423, sw max_log_id=16282, start_id=16282) BACKTRACE:0xfbdfc9f 0xfbc93dd 0x55391ef 0x9c0a1f9 0x9af215d 0x532d801 0x532c0ae 0x532bc69 0x532bb4f 0x529b6a6 0x9adee54 0x5522e04 0xfa8d123 0xfa8cf7f 0xfd6284f
You can further investigate the following aspects:
- Slow CLOG writing on the local node. Check the load and await metrics of the CLOG disk.
- Slow CLOG writing on the follower replica. The investigation method is the same as for the local node.
- Slow network between the leader and follower replicas. For example, if the
packet fly cost too much timeerror message indicates that network packets take a long time to be transmitted.
