Absence of the leader|V4.3.1| docs|Distributed Database

Absence of the leader

Last Updated：2024-06-28 05:30:29 Updated

The error 4038 indicates the absence of the leader, which is caused by the following two reasons:

The leader exists but is not the current node. This may be because that the location cache is not refreshed after the current leader switchover.
The leader does not exist. In this case, you need to check the network connectivity of the log stream.

The leader exists

After a leader switchover with or without exceptions, if the location cache is not refreshed, the destination node is no longer the leader during remote execution. In this case, the error code 4038 is returned, and the execution of statements is retried. To confirm this issue, perform the following operations:

Query the DBA_OB_SERVER_EVENT_HISTORY view and check whether the leader switchover has been performed.
Query the system logs to check whether the destination node has returned the error code 4038.

# grep 4038 observer.log
[2021-01-21 09:46:31.305260] WARN  setup_next_scanner (ob_direct_receive.cpp:292) [15256][YB420B4043DA-0005A535D6FAE816] [lt=2] , ret=-4038
[2021-01-21 09:46:31.305295] WARN  [SQL.EXE] setup_next_scanner (ob_direct_receive.cpp:295) [15256][YB420B4043DA-0005A535D6FAE816] [lt=31] while fetching first scanner, the remote rcode is not OB_SUCCESS(ret=-4038, err_msg="", dst_addr=""11.xx.xx.xx:2882"")
[2021-01-21 09:46:31.305302] WARN  [SQL.EXE] inner_open (ob_direct_receive.cpp:123) [15256][YB420B4043DA-0005A535D6FAE816] [lt=6] failed to setup first scanner(ret=-4038)
[2021-01-21 09:46:31.305308] WARN  [SQL.ENG] open (ob_phy_operator.cpp:138) [15256][YB420B4043DA-0005A535D6FAE816] [lt=4] Open this operator failed(ret=-4038, op_type="PHY_DIRECT_RECEIVE")
[2021-01-21 09:46:31.305314] WARN  [SQL.ENG] open (ob_phy_operator.cpp:128) [15256][YB420B4043DA-0005A535D6FAE816] [lt=5] Open child operator failed(ret=-4038, op_type="PHY_ROOT_TRANSMIT")
[2021-01-21 09:46:31.305366] WARN  [SERVER] test_and_save_retry_state (ob_query_retry_ctrl.cpp:242) [15256][YB420B4043DA-0005A535D6FAE816] [lt=2] partition change or not master or no response, reutrn it to packet queue to retry(client_ret=-4038, err=-4038, retry_type_=2)

No leader exists

The leader election of log streams depends on the following three modules in sequence: election module, clog module, and RoleChangeService module. Therefore, you can check whether the log stream leader exists by using the following procedure:

Check whether the election leader exists.
Check whether the clog leader exists.
Check whether the RoleChangeService module is working properly.

Check whether the election leader exists

You can query the system logs on any node that has a log stream to check the current election leader. For example, you can run the following command to query the election leader of log stream 1 of tenant T1. To query the election leader for another log stream, replace T1 and id:1 with the IDs of the actual tenant and log stream as needed.

grep 'T1_.*{id:1}' ./election.log | grep 'dump acceptor info'

Here, lease:{owner:"xx.xx.xx.xx:xxxxx" indicates the election leader of the log stream, where xx.xx.xx.xx indicates the IP address, and xxxxx indicates the port number.

If no election leader exists, it is likely caused by the connectivity exception of the log stream in the following cases:

The tenant runs out of memory and therefore cannot receive messages but it can still send messages. In this case, only one-way network connection is available.
Messages are accumulated in the worker queue and no more messages can be received.
The node is stopped and cannot receive messages.
The network connection is abnormal due to other causes.

To further analyze the network connectivity at the log stream level based on the network interaction statistics, run the following command:

grep 'T1_.*{id:1}' ./election.log | grep 'dump message count'

The queried log provides the statistics on messaging between the local replica and other replicas and records the timestamps of the last interactions between replicas based on IP addresses and message types.

Based on the network connectivity statistics, you can determine which node causes the issue and perform further troubleshooting on that node.

In addition to system logs, you can also view and analyze internal events by querying the DBA_OB_SERVER_EVENT_HISTORY view under the sys tenant.

obclient >
SELECT VALUE4,svr_ip,svr_port,event,name3,value3
FROM DBA_OB_SERVER_EVENT_HISTORY
WHERE module="ELECTION" AND value1=$tenant_id AND value2=$ls_id
ORDER BY VALUE4 limit 10;

If the election leader exists, you can check whether the clog leader exists.

Check whether the clog leader exists

Once you have confirmed that the election leader exists, you can check whether the clog leader exists.

If the SQL client can be connected, which indicates that the cluster is available, you can check for the clog leader by querying the GV$OB_LOG_STAT view under the sys tenant.

obclient >
select * from GV$OB_LOG_STAT
where tenant_id=$tenant_id and ls_id=$ls_id and role="LEADER";

Notice

If the number of nodes in the query result is less than the number of replicas in the member list, it means that some log stream replicas are in an abnormal state and further troubleshooting is required.

If the SQL client cannot be connected, which indicates that the cluster is unavailable, you can check for the clog leader by using the system logs. To do so, run the following command on the election leader node confirmed in the previous step:

grep 'Txxx_.*PALF_DUMP.*palf_id=xxxx,' observer.log.* | less
# Txxx specifies the tenant ID. For example, if T1001 is used, the logs of tenant 1001 are retrieved.
# palf_id specifies the ID of the log stream. For example, if palf_id is set to 1001, the logs of log stream 1001 are retrieved.

If the value of the role field in the log is Follower, the clog leader does not exist. In this case, further troubleshooting is required to find the causes.

If the clog leader exists, you can check whether the RoleChangeService module is working properly

Check whether the RoleChangeService module is working properly

To switch the clog leader, the clog module sends an asynchronous task to the task queue of the RoleChangeService module. Then, the background thread of the RoleChangeService module performs the leader switchover. You can check the status of the RoleChangeService module by running the following command to query the system logs:

grep 'Txxx_RCS' observer.log rootservice.log.xxxx -h | sort | less

If the last log is not end handle_role_change_event_, it means that the execution of the leader switchover is stuck. You can further analyze the stack in the last log to confirm the causes.
If the last log is end handle_role_change_event_ and the clog leader exsits, you need to further troubleshoot exceptions of the RoleChangeService module.

Common causes of absence of the leader in a tenant

A clock offset has occurred

The current election module requires that the clock offset between any two OBServer nodes does not exceed 200 ms. Otherwise, the leader cannot be elected. You can run the following command to check whether the clocks of OBServer nodes are synchronized:

sudo clockdiff $IP

If the result shows that the clock offset exceeds 200 ms, you need to check whether the clock services are working properly. Common clock services include Network Time Protocol (NTP) and Chrony.

A tenant, table, or partition is deleted

Data is deleted across replicas at different points in time. If a tenant, table, or partition on the original leader is last deleted, the original leader may fail the re-election when the lease expires because no votes are received. As a result, no leader will be available.

The majority of replicas fail

If one or more replicas fail and the remaining normal replicas cannot constitute the majority of replicas, no leader will be available.

A network error has occurred

If the majority of replicas are working properly and the system log file of the leader contains the leader lease is expired message, the leader has failed to extend its lease. This may be caused by network errors such as one-way or two-way network isolation, remote procedure call (RPC) request backlogs, and RPC delay. In this case, further troubleshooting is required.

Here is a sample system log with an error message:

# grep 'leader lease is expired' election.log
[2018-09-27 23:09:04.950617] ERROR [ELECT] run_gt1_task (ob_election.cpp:1425) [38589][Y0-0000000000000000] [log=25]leader lease is expired(election={partition:{tid:1100611139458321, partition_id:710, part_cnt:0}, is_running:true, is_offline:false, is_changing_leader:false, self:"11.xx.xx.xx:2882", proposal_leader:"0.0.0.0", cur_leader:"0.0.0.0", curr_candidates:3{server:"100.xx.xx.xx:2882", timestamp:1538050146803352, flag:0}{server:"11.xx.xx.xx:2882", timestamp:1538050146803352, flag:0}{server:"100.xx.xx.xx:2882", timestamp:1538050146803352, flag:0}, curr_membership_version:0, leader_lease:[0, 0], election_time_offset:350000, active_timestamp:1538050146456812, T1_timestamp:1538069944600000, leader_epoch:1538050145800000, state:0, role:0, stage:1, type:-1, replica_num:3, unconfirmed_leader:"11.xx.xx.xx:2882", takeover_t1_timestamp:1538069933400000, is_need_query:false, valid_candidates:0, cluster_version:4295229511, change_leader_timestamp:0, ignore_log:false, leader_revoke_timestamp:1538069944600553, vote_period:4, lease_time:9800000}, old_leader="11.xx.xx.xx:2882")

The system load is high

A high system load may also result in leader absence. You can run the corresponding system commands to check the load at the corresponding points in time.

The clog disk is full

If the usage of the clog disk exceeds the threshold specified by the tenant-level parameter log_disk_utilization_limit_threshold, whose default value is 95%, log writing is forcibly stopped. In this case, the replica on the node cannot be elected as the leader. If this happens to the majority of the nodes, no leader can be elected for the cluster.

The tenant memory is fully occupied

When the memory for MemTables of a tenant is fully occupied, logs cannot be replayed. As a result, logs cannot be recycled. This is because log recycling depends on log replays and minor compactions. To avoid the full occupation of the clog disk, a replica stops receiving logs from the leader when the tenant memory is fully used. If the majority of followers run out of tenant memory, the log synchronization with the leader is stuck. As a result, the logs cannot be stored on the majority of followers and no leader can be elected for the cluster.

Clog reconfirmation fails

If the election module works properly and elects the leader, but the clog module fails to reconfirm the leader, the leader stops acting as a leader. Further troubleshooting is required to find the cause of the reconfirmation failure.

Takeover fails

After the clog module reconfirms the leader, the leader performs a takeover. Then, the RoleChangeService module executes the specific logic of switchover. If errors occur to the RoleChangeService module, no leader can be elected for the cluster. Further troubleshooting is required to find the cause of the takeover failure.

An error occurs during the transaction callback

In the transaction sliding window of a leader replica, if the clogs in the left pane are not stored on the majority of followers within the specified time threshold and slide out, the leader stops acting as a leader and no leader is available in the cluster. The default threshold is 10 seconds.

Here is a sample system log with an error message:

# grep 'check_leader_sliding_window_not_slide_' observer.log
[2022-11-29 11:19:12.239777] ERROR [CLOG] check_leader_sliding_window_not_slide_ (ob_log_state_mgr.cpp:2243) [7393][0][Y0-0000000000000000-0-0] [lt=66] leader_active_need_switch_(partition_key={tid:1099511627898, partition_id:2, part_cnt:0}, now=1669691952239687, last_check_start_id_time_=1669691939215423, sw max_log_id=16282, start_id=16282) BACKTRACE:0xfbdfc9f 0xfbc93dd 0x55391ef 0x9c0a1f9 0x9af215d 0x532d801 0x532c0ae 0x532bc69 0x532bb4f 0x529b6a6 0x9adee54 0x5522e04 0xfa8d123 0xfa8cf7f 0xfd6284f

You can further perform the following troubleshooting operations:

If log writing to the local clog disk is slow, check the clog disk metrics, such as load and await.
If log writing to the clog disk of a follower is slow, check the clog disk metrics, such as load and await.
Check whether the network transmission speed between the leader and followers is slow. For example, if the packet fly cost too much time message is returned, it means the packet transmission takes a lot of time.