High availability of ODP services means that ODP can continuously provide services to reduce the impact of faults on the proxy service. However, the service availability is subject to the following factors:
Factor 1: Whether the ODP process runs stably and provides services normally, and whether the process can be pulled up in time after being accidentally killed.
Factor 2: Whether the third-party modules on which ODP depends, such as OceanBase Cloud Platform (OCP) and OBServers, properly provide services, for example, whether OCP can properly provide HTTP services.
The following sections describe the high availability of ODP services in consideration of these two factors.
ODP process exceptions
In addition to server failures, ODP process exceptions also have a significant impact on service availability. ODP process exceptions occur for many reasons, for example, process core dump caused by program bugs, out-of-memory (OOM) exceptions caused by excessive memory usage, and accidental killing of processes by O&M operations. While the causes may vary, the final results are the same: The service port fails to provide services.
You can solve the issue by using the obproxyd.sh script. Run the nc command to access the listened-to port of ODP. If the TCP connection is established, the ODP process exists. If the process does not exist, the obproxyd.sh script restarts the ODP process. The ODP process can be started and provide services in about 1s.
obproxyd.sh can only restore services as far as possible. However, if a core dump is generated due to program bugs, this issue cannot be resolved by restarting the ODP process. In this case, you can follow the server exception troubleshooting process, so that the LB component will handle the exception. For some bugs that rarely occur, a restart can recover services in a timely manner, causing little impact on the business.
OCP exceptions
Even if ODP is operating properly, ODP cannot properly provide services if a third-party module that the ODP depends on is abnormal. This section describes OCP exceptions.
OCP provides centralized configuration services for ODP and stores some key configuration information, such as the mapping between a cluster name and a cluster RS list, which contains a list of servers on which RootService resides. After ODP pulls the RS list information from OCP, ODP saves the information in the obproxy_rslist_info.json file in the local etc directory.
If OCP is unavailable, ODP can use the locally cached RS list file to provide services. To check whether OCP is available, you can use the curl command to access the URL of OCP. The configuration item ignore_local_config specifies whether to use cached files. The default value is true, which indicates that local cached files are not used.
The use of local cached files can solve the issue that some clusters cannot be accessed upon OCP exceptions. However, this also has the following disadvantages:
Only the RS list information of accessed clusters is stored in the cache. Services remain unavailable for a cluster that has not been accessed.
The cached information is time-sensitive and becomes invalid if the RS list changes.
Therefore, OCP itself must have high availability design and can recover from faults in time to ensure service continuity.
In general, the LB component and the obproxyd.sh script can handle process exceptions and machine exceptions in a timely manner. ODP has been significantly optimized in terms of the program startup process, and can complete initialization and configuration loading and start to provide external services within 1s.
Dependency on external modules has been minimized for ODP. However, it is impossible for ODP to avoid using OCP. Therefore, ODP uses a local cache to reduce dependency on OCP and maximize service availability.