The error code -4038 indicates the absence of the leader. After discovering the 4038 error, there are two possibilities:
- The leader exists but is not the current node. This may be because the location cache is not refreshed after a leader switch (including abnormal leader switch and graceful leader switch).
- The leader does not exist. In this case, you need to check the network connectivity of the log stream.
The leader exists
If the location cache is not refreshed after a leader switch (including abnormal leader switch and graceful leader switch), the destination node is no longer the leader during remote execution. In this case, the error code 4038 is returned and statement-level retry is performed. You can confirm this issue by using the following methods:
- Query the
DBA_OB_SERVER_EVENT_HISTORYview to confirm whether a leader switch has occurred. - Query the system logs to confirm whether the destination node has returned the error code 4038.
# grep 4038 observer.log
[2021-01-21 09:46:31.305260] WARN setup_next_scanner (ob_direct_receive.cpp:292) [15256][YB420B4043DA-0005A535D6FAE816] [lt=2] , ret=-4038
[2021-01-21 09:46:31.305295] WARN [SQL.EXE] setup_next_scanner (ob_direct_receive.cpp:295) [15256][YB420B4043DA-0005A535D6FAE816] [lt=31] while fetching first scanner, the remote rcode is not OB_SUCCESS(ret=-4038, err_msg="", dst_addr=""11.xx.xx.xx:2882"")
[2021-01-21 09:46:31.305302] WARN [SQL.EXE] inner_open (ob_direct_receive.cpp:123) [15256][YB420B4043DA-0005A535D6FAE816] [lt=6] failed to setup first scanner(ret=-4038)
[2021-01-21 09:46:31.305308] WARN [SQL.ENG] open (ob_phy_operator.cpp:138) [15256][YB420B4043DA-0005A535D6FAE816] [lt=4] Open this operator failed(ret=-4038, op_type="PHY_DIRECT_RECEIVE")
[2021-01-21 09:46:31.305314] WARN [SQL.ENG] open (ob_phy_operator.cpp:128) [15256][YB420B4043DA-0005A535D6FAE816] [lt=5] Open child operator failed(ret=-4038, op_type="PHY_ROOT_TRANSMIT")
[2021-01-21 09:46:31.305366] WARN [SERVER] test_and_save_retry_state (ob_query_retry_ctrl.cpp:242) [15256][YB420B4043DA-0005A535D6FAE816] [lt=2] partition change or not master or no response, reutrn it to packet queue to retry(client_ret=-4038, err=-4038, retry_type_=2)
The leader does not exist
Whether a log stream has a leader depends on three modules, from bottom to top: election module, clog module, and RoleChangeService module. Therefore, when a log stream has no leader, you can troubleshoot from bottom to top:
- Confirm whether the election leader exists.
- Confirm whether the clog leader exists.
- Confirm whether the RoleChangeService module is working properly.
Confirm whether the election leader exists
You can query the system logs on any node that has the log stream to check the current election leader. For example, run the following command to query the election leader of tenant 1, log stream 1. To query another log stream, replace T1 and id:1 with the IDs of the actual tenant and log stream.
grep 'T1_.*{id:1}' ./election.log | grep 'dump acceptor info'
Here, lease:{owner:"xx.xx.xx.xx:xxxxx" indicates the election leader of the log stream, where xx.xx.xx.xx indicates the IP address and xxxxx indicates the port number.
If no election leader exists, it is likely caused by the connectivity exception of the log stream in the following cases:
- The tenant runs out of memory and therefore cannot receive messages but it can still send messages. In this case, only one-way network connection is available.
- Messages are accumulated in the worker queue and no more messages can be received.
- The node is stopped and cannot receive messages.
- The network connection is abnormal due to other causes.
To further analyze the network connectivity at the log stream level based on the network interaction statistics, run the following command:
grep 'T1_.*{id:1}' ./election.log | grep 'dump message count'
This log records the statistics of message sending and receiving between the local replica and other replicas, classified by IP and message type, and records the timestamp of the last interaction.
Based on the network connectivity statistics, you can preliminarily determine which node is abnormal and whether the issue is on the sender or receiver side, and then perform further troubleshooting on the corresponding node.
In addition to system logs, you can also query and analyze internal view events. Query the DBA_OB_SERVER_EVENT_HISTORY view under the sys tenant.
obclient >
SELECT VALUE4,svr_ip,svr_port,event,name3,value3
FROM DBA_OB_SERVER_EVENT_HISTORY
WHERE module="ELECTION" AND value1=$tenant_id AND value2=$ls_id
ORDER BY VALUE4 limit 10;
If the election leader exists, continue to confirm whether the clog leader exists.
Confirm whether the clog leader exists
If the election leader is confirmed to exist, you can further troubleshoot whether the clog leader exists.
If the cluster is available (SQL client can connect), you can query the GV$OB_LOG_STAT view under the sys tenant.
obclient >
select * from GV$OB_LOG_STAT
where tenant_id=$tenant_id and ls_id=$ls_id and role="LEADER";
Notice
If the number of nodes in the query result is less than the number of replicas in the member list, it means that some log stream replicas are in an abnormal state and further troubleshooting is required.
If the cluster is unavailable (SQL client cannot connect), you can troubleshoot by using the system logs. On the election leader node obtained in the previous step, run the following command:
grep 'Txxx_.*PALF_DUMP.*palf_id=xxxx,' observer.log.* | less
# Txxx specifies the tenant_id. For example, T1001 searches for logs of tenant 1001.
# palf_id specifies the log stream ID. For example, palf_id=1001 searches for logs of log stream 1001.
If the role in the log is follower, the clog leader does not exist. Further troubleshoot the cause of the clog leader absence.
If the clog leader exists, continue to confirm whether the RoleChangeService module is working properly.
Confirm whether the RoleChangeService module is working properly
When a clog role switch occurs, an asynchronous task is delivered to the task queue of RoleChangeService. The background thread of RoleChangeService is responsible for executing the leader switchover. You can troubleshoot the RoleChangeService status based on the system logs. Run the following command:
grep 'Txxx_RCS' observer.log rootservice.log.xxxx -h | sort | less
If the last log is not
end handle_role_change_event_, it means the execution of the leader switchover is stuck. You can further check the stack in the last log to confirm the issue.If the last log is
end handle_role_change_event_and the clog leader exists, it means the RoleChangeService module is abnormal. Further troubleshooting is required.
Common causes of no leader in a tenant
Clock offset
The current election module requires that the clock offset between any two OBServer nodes does not exceed 2s. Otherwise, no leader may be elected. You can run the following command to check whether the clocks of OBServer nodes are synchronized:
sudo clockdiff $IP
If the result shows that the offset exceeds 2s, check whether the clock service is working properly. Common clock services include NTP and Chrony.
Tenant, table, or partition is deleted
If a tenant, table, or partition is deleted, the deletion timing may be inconsistent across replicas. If the leader is the last to be deleted, it may fail to extend its lease because it cannot receive votes. When the lease expires, no leader will be available.
Majority of replicas fail
If replicas fail and the remaining normal replicas cannot constitute a majority, no leader will be available.
Network issues
If the majority of replicas are running without failure, and the system log of the leader replica contains the leader lease is expired error, it means the leader replica has failed to extend its lease. This may be caused by network issues (one-way or two-way network isolation, RPC request backlog, RPC delay). Further troubleshooting is required.
Sample error log:
# grep 'leader lease is expired' election.log
[2018-09-27 23:09:04.950617] ERROR [ELECT] run_gt1_task (ob_election.cpp:1425) [38589][Y0-0000000000000000] [log=25]leader lease is expired(election={partition:{tid:1100611139458321, partition_id:710, part_cnt:0}, is_running:true, is_offline:false, is_changing_leader:false, self:"11.xx.xx.xx:2882", proposal_leader:"0.0.0.0", cur_leader:"0.0.0.0", curr_candidates:3{server:"100.xx.xx.xx:2882", timestamp:1538050146803352, flag:0}{server:"11.xx.xx.xx:2882", timestamp:1538050146803352, flag:0}{server:"100.xx.xx.xx:2882", timestamp:1538050146803352, flag:0}, curr_membership_version:0, leader_lease:[0, 0], election_time_offset:350000, active_timestamp:1538050146456812, T1_timestamp:1538069944600000, leader_epoch:1538050145800000, state:0, role:0, stage:1, type:-1, replica_num:3, unconfirmed_leader:"11.xx.xx.xx:2882", takeover_t1_timestamp:1538069933400000, is_need_query:false, valid_candidates:0, cluster_version:4295229511, change_leader_timestamp:0, ignore_log:false, leader_revoke_timestamp:1538069944600553, vote_period:4, lease_time:9800000}, old_leader="11.xx.xx.xx:2882")
High system load
If the system load is very high, it may also result in no leader. You can run system commands to check whether the load at the corresponding time is normal.
Clog disk is full
If the clog disk usage exceeds the threshold (controlled by the tenant parameter log_disk_utilization_limit_threshold, default 95%), log writing is forcibly stopped. In this case, the replica on the node cannot be elected as the leader. If this happens to the majority of nodes, the cluster will have no leader.
Tenant memory is full
When the MemTable memory of a tenant is full, logs cannot be replayed, which further prevents log recycling (recycling depends on replay and dump). Therefore, to avoid the clog disk becoming full, when tenant memory is full, a replica is not allowed to continue receiving logs from the leader. If the majority of Followers have full tenant memory, the log synchronization of the leader will be stuck, and the majority cannot be reached. As a result, the cluster will have no leader.
Clog reconfirm fails
If the election module works properly and elects a leader, but the clog module fails during reconfirm, the leader will voluntarily step down. Further troubleshoot the cause of the reconfirm failure.
Takeover fails
After clog reconfirm succeeds, the system enters the takeover state. The RoleChangeService module continues to execute the specific takeover logic. If the RoleChangeService module is abnormal, the cluster will have no leader. Further troubleshoot the cause of the takeover failure.
Transaction callback execution exception
If the clog logs at the left boundary of the transaction sliding window of the leader replica do not reach the majority and slide out within a certain threshold (10 seconds), the leader will voluntarily step down due to an exception, and the cluster will have no leader.
Sample error log:
# grep 'check_leader_sliding_window_not_slide_' observer.log
[2022-11-29 11:19:12.239777] ERROR [CLOG] check_leader_sliding_window_not_slide_ (ob_log_state_mgr.cpp:2243) [7393][0][Y0-0000000000000000-0-0] [lt=66] leader_active_need_switch_(partition_key={tid:1099511627898, partition_id:2, part_cnt:0}, now=1669691952239687, last_check_start_id_time_=1669691939215423, sw max_log_id=16282, start_id=16282) BACKTRACE:0xfbdfc9f 0xfbc93dd 0x55391ef 0x9c0a1f9 0x9af215d 0x532d801 0x532c0ae 0x532bc69 0x532bb4f 0x529b6a6 0x9adee54 0x5522e04 0xfa8d123 0xfa8cf7f 0xfd6284f
You can further troubleshoot as follows:
- If local clog disk write is slow, check the clog disk load, await, and other metrics.
- If follower replica disk write is slow, use the same troubleshooting method as above.
- If the network between leader and follower is slow, for example, if
packet fly cost too much timeindicates that the network packet takes too long.