The intra-tenant routing feature allows ODP to select an appropriate node to execute SQL statements after obtaining the server list of the tenant.
During intra-tenant routing, ODP functions like a common proxy such as HAProxy, and selects a server from the tenant server list to implement a one-to-one correspondence between the client connection and the server connection. This achieves simple implementation, but cannot meet performance and high availability requirements. Therefore, routing affects connection management.
Intra-tenant routing is the most complex part of the routing feature for the consideration of better performance and higher availability. This topic describes leader routing, follower routing, and tenant server routing, information caching, routing strategies, transaction routing, and common issues.
Leader routing
A distributed system usually adopts the multi-replica mechanism for disaster recovery and high availability. To ensure data consistency between replicas, the Paxos or Raft algorithm is often used. In practice, a special replica is present, which usually has the latest data and controls the synchronization of data among the replicas. This replica is called the leader, and other replicas are collectively called followers.
OceanBase Database has only one leader. Therefore, the leader routing strategy is to send SQL statements to this replica. In this example, the select c1 from t1 statement is used to describe two conditions that must be met for leader routing:
The SQL statement operates on (for example, queries, inserts into, updates, or deletes from) an entity table, such as the t1 table in the preceding example.
The request reads the latest data. In other words, the request is a strong read.
In ODP logs, the keyword for leader routing is ROUTE_TYPE_LEADER. To implement leader routing, ODP needs to know the ID and location of the partition to be accessed. In OceanBase Database, a table can have one or multiple partitions.
Single-partition table
The table has only one partition. In this case, ODP can obtain the replica location information from the table name.
Multiple-partition table
The table has multiple partitions. ODP needs to calculate the partition ID based on the table name and partitioning key in the SQL statement, and then obtain the replica location information. Replica information includes both the leader information and follower information. Leader routing uses only the leader information.
Multi-partition routing involves partitioning methods (such as Hash, Range, and List partitioning), partitioning key types (such as number and varchar), partitioning algorithms (such as Hash), and type conversion (if the value type in the SQL statement differs from the type of the partitioning key). Therefore, the implementation is relatively complex. For a subpartitioned table, routing involves the following 10 steps:
Parse the SQL statement to get the table name.
Access an internal table of OceanBase Database based on the table name and check whether the table is a partitioned table.
Parse the column expression, such as c1=1, in the SQL statement.
Access an internal table of OceanBase Database to obtain the information about the partitioned table.
Access an internal table of OceanBase Database to obtain the partition information.
Calculate the
partition_idvalue of the partition based on the column expression.Access an internal table of OceanBase Database to obtain the subpartition information.
Calculate the
partition_idvalue of the subpartition based on the column expression.Calculate the final
partition_idvalue.Access an internal table of OceanBase Database to get location information corresponding to the
partition_idvalue.
Follower routing
Similar to leader routing, follower routing also must meet two conditions:
The SQL statement queries an entity table, such as the t1 table in
select c1 from t1.The request does require the latest data. In other words, the request is a weak read.
These two conditions are different from the conditions for leader routing.
For the first condition, follower routing supports only query statements and does not support other statements. This is also the implementation requirement of the Paxos algorithm.
For the second condition, the weak-read setting
ob_read_consistency=weakmust be explicitly specified by using a hint, in the session, or in another way.
For follower routing, the SQL statement can be sent either to the leader or a follower. Therefore, follower routing has more choices. For more information about the selection from multiple replicas during routing, see Routing strategies.
Read/Write splitting is an important feature related with follower routing. Read/Write splitting on requests can reduce the pressure on the leader. ODP also supports read/write splitting. We are constantly improving the feature and have helped customers solve performance issues in various scenarios, for example, OceanBase Cloud.
Tenant server routing
Sometimes, ODP cannot obtain the leader or follower. In this case, ODP can select a tenant server, which is tenant server routing. Typical scenarios of tenant server routing are as follows:
The SQL statement itself does not contain a table name, for example,
select 1.The server where the leader or follower is located is faulty.
Due to its limitations, ODP fails to obtain the table name from a complex SQL statement, or cannot take a replica route.
Through tenant server routing, ODP sends the SQL statement to a server where the tenant is located, thereby ensuring proper functioning. As with follower routing, tenant server routing also involves selection from multiple replicas, and therefore requires routing strategies.
Information caching
For leader routing, follower routing, and tenant server routing, ODP needs to query replica routing information from the sys tenant. To improve performance and reduce pressure on the sys tenant, ODP caches routing information.
Data freshness is crucial for cached information. Cached information of the sys tenant can be created and updated regularly by accessing internal tables. However, ODP does not adopt this mechanism to cache the information for replica routing. Otherwise, ODP will have to pull replica routing information for too many SQL statements, incurring great pressure on the sys tenant.
Number of SQL statements = Number of replicas × Number of ODPs.
OceanBase Database supports up to hundreds of thousands and even millions of partitions. Therefore, data freshness of cached information is a big challenge for ODP. If expired cached information is used, the route is inaccurate. The following describes how ODP ensures data freshness of cached information.
To view the cached information of a table, log on to the ODP with the root@proxysys account and run the show proxyroute command:
MySQL [(none)]> show proxyroute like 'ob1.hudson tt1 test sbtest1'\G
*************************** 1. row ***************************
cluster_name: ob1.hudson
tenant_name: tt1
database_name: test
table_name: sbtest1
state: AVAIL
partition_num: 1
replica_num: 3
table_id: 1101710651081698
cluster_version: 2
schema_version: 1649196335597728
from_rslist: N
create_time: 2022-04-07 12:41:16
last_valid_time: 2022-04-07 12:41:16
last_access_time: 2022-04-07 12:41:16
last_update_time: 1970-01-01 08:00:00
expire_time: 2022-04-12 12:48:42
relative_expire_time: 2022-04-07 12:40:41
server addr: server[0]=xx.xx.xx.xx:xx,leader,FULL; server[1]=xx.xx.xx.xx:xx,follower,FULL; server[2]=xx.xx.xx.xx:xx,follower,FULL;
This example shows the important information in the cache, including the cluster name, tenant name, database name, table name, number of partitions, number of replicas, time information, address information, and cache status. The cache status information is important for the refresh mechanism, because ODP implements caching strategies by modifying the status information. The cached information can be in one of the following states:
BUILDING: The cached information is being created. ODP needs to wait until the cached information is created before using it.AVAIL: The cached information is in a normal state and can be directly used.DIRTY: The cached information is invalid and inaccurate.UPDATING: The invalid cached information is being updated.DELETED: The cached information has been deleted and cannot be used. It will be cleared later.
ODP refreshes the cache status by modifying the cache status, to ensure data freshness. The following describes the cache refresh mechanism in terms of creation, eviction, and refresh.
Cache creation: When ODP accesses a partition for the first time, it obtains the routing information by querying the
__all_virtual_proxy_schematable of the sys tenant. The specified table name is the actual table name. The table cache is different from the tenant routing information. The cache status changes toAVAILafter the cache is created.Cache eviction: ODP changes the cache status to
DIRTYwhen the OBServer returns the routing inaccuracy error, which is indicated in theis_partition_hitfield carried in the OK packet.Cache refresh: When the status of the cached information changes to
DIRTY, the expired cached information is evicted and the cache is re-created or the cached information is updated.
Note
At present, cache eviction is implemented based on the feedback packet of the OBServer. Therefore, the cache status is not perceived in real time. The cached information can only be refreshed upon an incorrect route feedback. This may cause issues.
Routing strategies
Overview
Routing strategies help ODP select a suitable replica for SQL statements. The multiple replicas can be the multiple replicas selected during follower routing or during tenant server routing for statements such as select 1 from dual.
ODP implements three types of routing strategies:
Primary zone-based routing, which takes the top priority
LDC-based routing, which takes the second priority
Random routing, which takes the third priority
Primary zone-based routing
ODP sends an SQL request preferentially to the server where the primary zone is located. The zone where the leader is located is called the primary zone. Adopt this routing strategy in the following scenarios:
In the high-performance deployment architecture commonly used for OceanBase Database, the primary zone of a tenant is deployed on one server. This mechanism reduces network overheads for a distributed system.
During leader routing, ODP may fail to get the table name or calculate the partition ID. In this case, ODP can adopt this routing strategy to send the requests to the leader as far as possible.
LDC-based routing
LDC-based routing is a routing strategy based on location information, which includes the following attributes:
IDC: indicates a logical data center.
Region: indicates a city.
Both ODP and OBServers can set the LDC information. Based on the LDC information, ODP can determine the location relationship with an OBServer. After the LDC information is specified, ODP adopts LDC-based routing by default.
LDC settings for an OBServer
You can set the Region or IDC attribute for each zone of an OBServer. The Region attribute is usually set to the city name, which is case-sensitive. The IDC attribute specifies the data center where the zone is located. Generally, the data center name in lowercase is specified for the IDC attribute. The SQL statement is as follows:
alter system modify zone "z1" set region = "SHANGHAI";
alter system modify zone "z1" set idc = "zue";
Return result:
mysql> select * from DBA_OB_ZONES;
+-------+----------------------------+----------------------------+--------+-----+----------------+-----------+
| ZONE | CREATE_TIME | MODIFY_TIME | STATUS | IDC | REGION | TYPE |
+-------+----------------------------+----------------------------+--------+-----+----------------+-----------+
| z1 | 2022-10-31 11:48:29.040552 | 2022-10-31 11:48:29.041609 | ACTIVE | zue | SHANGHAI | ReadWrite |
+-------+----------------------------+----------------------------+--------+-----+----------------+-----------+
| z2 | 2022-10-31 11:48:29.040552 | 2022-10-31 11:48:29.041609 | ACTIVE | zue | SHANGHAI | ReadWrite |
+-------+----------------------------+----------------------------+--------+-----+----------------+-----------+
| z3 | 2022-10-31 11:48:29.040552 | 2022-10-31 11:48:29.041609 | ACTIVE | ztg | HANGZHOU | ReadOnly |
+-------+----------------------------+----------------------------+--------+-----+----------------+-----------+
3 rows in set
LDC settings for ODP
You can configure LDC settings for ODP in one of the following ways. In this example, the sample zue IDC is used.
Set the LDC parameters when you start the obproxy process. This is the recommended method. Sample code:
cd /opt/taobao/install/obproxy ./bin/obproxy -o proxy_idc_name=zueExecute the ALTER statement on a support client tool to modify the ODP settings. Sample statement:
mysql> alter proxyconfig set proxy_idc_name='zue';
Run the internal command show proxyinfo idc; of ODP to check the LDC configuration for ODP.
mysql> show proxyinfo idc;
+-----------------+--------------+-----------------+-----------------+--------------------------------------------------------+-------------+--------------------+
| global_idc_name | cluster_name | match_type | regions_name | same_idc | same_region | other_region |
+-----------------+--------------+-----------------+-----------------+--------------------------------------------------------+-------------+--------------------+
| zue | obcluster | MATCHED_BY_NONE | [] | [[0]"z1", [1]"z1", [2]"z2", [3]"z2", [4]"z3", [5]"z3"] | [] | [] |
+-----------------+--------------+-----------------+-----------------+--------------------------------------------------------+-------------+--------------------+
| zue | obcluster | MATCHED_BY_IDC | [[0]"SHANGHAI"] | [[0]"z1", [1]"z1", [2]"z2", [3]"z2"] | [] | [[0]"z3", [1]"z3"] |
+-----------------+--------------+-----------------+-----------------+--------------------------------------------------------+-------------+--------------------+
| zue | obcluster | MATCHED_BY_IDC | [[0]"SHANGHAI"] | [[0]"z1", [1]"z1", [2]"z2", [3]"z2"] | [] | [[0]"z3", [1]"z3"] |
+-----------------+--------------+-----------------+-----------------+--------------------------------------------------------+-------------+--------------------+
3 rows in set
Note
If OceanBase Database is deployed in only one IDC, the LDC information is of little use, because the latency among servers in the same IDC is the same by default. If you still want to adopt the LDC-based routing strategy in this case, plan the LDC architecture, and set the LDC attributes for both the OBServer and ODP. If OceanBase Database is deployed across multiple IDCs, you can plan the LDCs based on the IDCs and city.
In some special cases, you can specify LDC settings by using the trick method to enable LDC-based routing. However, we do not recommend you do that.
Random routing
ODP adopts the random routing strategy when multiple replicas are still available for selection after the primary zone-based routing and LDC-based routing. If primary zone-based routing is not enabled or LDC-based routing is not configured, ODP directly uses random routing.
Configure and view routing strategies
Data routing is complex because different routing strategies are available. By default, ODP first perform leader routing and follower routing, and performs tenant server routing if no replica is available. If only one replica is selected, ODP directly routes the data to the replica. Otherwise, ODP routes the data based on routing strategies.
Primary zone-based routing and LDC-based routing are controlled by the following parameters:
enable_primary_zone: specifies whether to use the primary zone-based routing strategy. The valuetrueindicates to use primary zone-based routing.proxy_idc_name: specifies the name of the IDC. LDC-based routing is used when this parameter is specified.
If you want to use routing strategies other than the existing ones, you can modify the proxy_route_policy parameter to set the new strategy with the top priority. The following two routing strategies are frequently used, which are both related to weak reading.
FOLLOWER_FIRST: Requests are first routed to a follower, and are routed to the leader only when no follower is available.FOLLOWER_ONLY: Requests are routed only to a follower. If no follower is available, an error is returned.
You can query the route_type field in ODP logs for the routing strategies used by ODP. For example, ROUTE_TYPE_LEADER indicates that leader routing is used. ROUTE_TYPE_NONPARTITION_UNMERGE_LOCAL indicates a more complex situation. The keywords are described as follows:
PARTITON: indicates to select a server with replica data, regardless of the replica type.NONPARTITION: indicates to select a tenant server regardless of the table data distribution.FOLLOWER: indicates to send the request to a follower.LEADER: indicates to send the request to the leader.UNMERGE: indicates to send the request to a server not in the progress of major compaction.MERGE: indicates to send the request to a server in the progress of major compaction.LOCAL: indicates to send the request to a server in the same IDC.REMOTE: indicates to send the request to a server in a different IDC in the same city.REGION: indicates to send the request to a server in another city.READONLY: indicates to send the request to a server in a READONLY zone.READWRITE: indicates to send the request to a server in a READWRITE zone.DUP: indicates to send the request to a server where the replicated table is located.
Transaction routing
The preceding section describes the routing strategies for a single SQL statement. Some features, such as the transaction feature, involve one or more SQL statements. During transaction routing, the first statement of the transaction is routed based on the preceding routing strategies. Subsequent SQL statements are directly sent to the destination node of the first statement.
For more information about why subsequent SQL statements can be sent only to the destination node of the first statement, see the Functionality section in Factors affecting data routing. At present, ODP does not support transaction state migration, and therefore is subject to this limitation.
Common issues
When ODP implements intra-tenant routing, the following issues may occur:
Failed to get the table name for leader routing.
The SQL statement is too complex. Currently, ODP cannot identify all SQL statements.
The SQL statement is too long. The ODP buffer for SQL statements is 4 KB in size. An excessively long SQL statement may not be fully parsed.
Partition calculation failed for leader routing.
ODP does not support the calculation of multiple partitioning keys, such as range(c1,c2).
The SQL statement does not contain a partitioning key expression, or ODP failed to extract the partitioning key expression.
ODP cannot handle the partitioning key expression, such as c1=now(). ODP does not support the now() function.
Expired cached information is used.
ODP does not have an active refresh mechanism for cached information .
The OBServer does not return feedback for routing.
A configuration error occurred.
If the routing strategy is not set to
FOLLOWER_FIRST, weak data reading requests are sent to the leader in follower routing.The LDC routing information is not configured or the information is incorrectly configured, resulting in routing across IDCs or cities.