Network failure is often caused by various factors such as a single node’s network card or network cable failure, a network equipment failure in an IDC (Internet Data Center), or inter-IDC transmission jitter. To address clear network issues, OceanBase recommends isolating the faulty component before conducting analysis.
Scenario
In the event of network failure or jitter, an election without a leader will be repeatedly initiated among the partitions in the cluster. This is manifested at the application layer as application request timeouts and failed writes.
Emergency handling procedure
First, check whether there are any network monitoring metrics showing packet loss, timeout retransmission, etc.
Run the tsar --tcp -i 1 -d20190902 command. If retran >= 0.05 is observed for 10 consecutive seconds, it indicates a network transmission failure.
Notice
In the previous command, replace 20190902 with the specific date you want to check, and -i 1 specifies a record interval of 1 minute.
Occasional network jitter can be temporarily ignored. However, if network jitter leads to node or IDC failures, follow the instructions below.
Single node network failure
If it is a single node’s a network failure, such as a high retransmission rate, run the stop server command to isolate the node. Once the single node’s network issue is addressed, run the start server command to reactivate the node. If it is further determined that the network card of the node is faulty, replace the faulty node.
IDC network jitter
For network jitter affecting the entire IDC, run the –stop zone command to temporarily isolate the replicas in the faulty IDC. Once the network is restored, run the –start zone command to reactivate the IDC.