Network failures are often caused by faulty network interface controller (NIC) of a server, faulty network cables, faulty network devices in an IDC, or jitter in inter-IDC transmission. After a network exception is confirmed, OceanBase Database allows you to isolate the exception before you analyze it.
Scenario
If a network failure or network jitter occurs, an election without a leader is repeatedly initiated among the partitions in the cluster. This results in application request timeout and write failures.
Emergency procedure
View network monitoring metrics to check whether packet loss and timeout retransmission occur.
Run the tsar --tcp -i 1 -d20190902 command. If "retran >= 0.05" is returned and the issue lasts for 10 seconds, an exception occurs in network transmission.
Notice
Replace
20190902with the date that you want to check. The-i 1field indicates a record interval of 1 minute.
You can ignore the network jitter that occurs occasionally. However, if the jitter causes errors in the server or IDC, perform the following steps:
Network errors of a single server
If a network error, such as a high retransmission rate, occurs in a single server, run the --stop server command to isolate the server. After the error is eliminated, run the --gstart server command to enable the server again. If you confirm that the error is caused by the faulty NIC of the server or if the network cannot resume within a short period, replace the server. For more information, see Replace an OBServer node.
Network jitter of an IDC
If the network jitter occurs in an entire IDC, run the --stop zone command to isolate the replica of the faulty IDC. After the network is restored, run the --start zone command to enable the IDC again. For more information, see Isolate a failed zone.