OceanBase Database provides two consistency levels: STRONG and WEAK. In a strong consistency read operation, the latest data must be read, and requests are routed to the leader. In a weak consistency read operation, the reading of the latest data is not required, and requests are preferentially routed to the followers. In OceanBase Database, write operations always feature strong consistency, which means that the leader provides services. By default, read operations in OceanBase Database feature strong consistency, which means that the leader provides services. However, you can specify weak consistency reads. In this case, the followers preferentially provide services.
Specify a consistency level
You can specify a consistency level in the following ways:
Specify the
ob_read_consistencyvariableSet a session variable to affect the current session.
obclient> SET ob_read_consistency = WEAK; obclient> SELECT * FROM t1; -- Weak consistency readSet a global variable to affect all sessions subsequently created.
obclient> SET GLOBAL ob_read_consistency = STRONG;
Specify a hint
Set the read consistency level to WEAK. This setting takes precedence over
ob_read_consistency.obclient> SELECT /*+READ_CONSISTENCY(WEAK) */ * FROM t1;Set the read consistency level to STRONG.
obclient> SELECT /*+READ_CONSISTENCY(STRONG) */ * FROM t1;
Consistency levels of SQL statements
DML write statements (
INSERT,DELETE, andUPDATE): These statements forcibly apply strong consistency and make modifications based on the latest data.SELECT FOR UPDATE(SFU): Similar to write statements, this statement forcibly applies strong consistency.Read-only statement (
SELECT): This statement allows you to set different consistency levels to meet different read requirements.
Consistency levels of transactions
The best practice of weak consistency reads is to set the consistency level for the SELECT statement that is not in a transaction to WEAK. The semantics of the statement is certain. In scenarios where transactions are explicitly started, OceanBase Database allows you to configure different consistency levels for different statements in syntax. This can be confusing, and inappropriate configurations can cause errors during SQL execution.
The principles are described as follows:
The consistency level applies to the entire transaction. All statements in a transaction follow the same consistency level.
The consistency level of a transaction is determined by the first statement in the transaction. If different consistency levels are specified for subsequent
SELECTstatements, they are forcibly changed to the consistency level of the transaction.The write statements and SFU statement support only strong consistency. If the consistency level of the transaction that contains the statements WEAK,
OB_NOT_SUPPORTEDis reported.
Here are some examples:
BEGIN;
-- This statement is used to modify data, and its consistency level must be STRONG. Therefore, the consistency level of the entire transaction is STRONG.
insert into t1 values (1);
-- The consistency level of this statement is WEAK.
-- However, because the consistency level of the first statement is STRONG, the consistency level of this statement is forcibly set to STRONG.
select /*+READ_CONSISTENCY(WEAK) */ * from t1;
COMMIT;
BEGIN;
-- The SFU statement is used to modify data, and its consistency level must be STRONG. Therefore, the consistency level of the entire transaction is STRONG.
select * from t1 for update;
-- The consistency level of this statement is WEAK.
-- However, because the consistency level of the first statement is STRONG, the consistency level of this statement is forcibly set to STRONG.
select /*+READ_CONSISTENCY(WEAK) */ * from t1;
COMMIT;
BEGIN;
-- The consistency level of the first statement is WEAK.
select /*+READ_CONSISTENCY(WEAK) */ * from t1;
-- Although the consistency level of this statement is STRONG, it inherits the consistency level of the first statement. Therefore, the consistency level is forcibly set to WEAK.
select * from t1;
COMMIT;
BEGIN;
-- The consistency level of the first statement is WEAK.
select /*+READ_CONSISTENCY(WEAK) */ * from t1;
-- This statement is used to modify data, and its consistency level must be STRONG. Because the consistency level of the first statement is WEAK, NOT SUPPORTED is reported.
insert into t1 values (1);
-- The SFU statement is used to modify data, and its consistency level must be STRONG. Therefore, NOT SUPPORTED is also reported.
select * from t1 for update;
COMMIT;
The consistency level of a single SQL statement can be determined based on the following priorities in descending order:
Consistency level determined based on the statement type: For example, the consistency level of DML and SFU statements must be STRONG.
Consistency level of the transaction: If a statement is in a transaction and is not the first statement in the transaction, the consistency level of the transaction applies to the statement.
Consistency level specified by a hint.
Consistency level specified by the
ob_read_consistencyvariable.Default consistency level: STRONG.
Relationship with isolation levels
The STRONG consistency level supports all isolation levels.
The WEAK consistency level supports only the
read committedisolation level. If the WEAK consistency level is applied to other isolation levels,OB_NOT_SUPPORTEDis reported.
Parameters of weak consistency reads
Versions earlier than V2.2.x
| Name | Scope | Semantics |
|---|---|---|
| enbale_causal_order_read | Cluster level | Specifies whether to enable monotonic reads. Default value: False. |
| max_stale_time_for_weak_consistency | Cluster level | Specifies the maximum lag time of weak consistency reads. Default value: 5, in seconds. |
Versions earlier than V2.2 support only proxy-level monotonic reads, which means that monotonic reads can be ensured when a client always accesses the same proxy. To support this feature, OceanBase Database maintains a monotonically increasing weak consistency read version on proxies. However, monotonic read cannot be ensured if a client accesses different proxies. If you want to enable cluster-level monotonic reads, use OceanBase Database V2.2.x or later.
Versions from V2.2.x to V3.x
| Name | Scope | Semantics |
|---|---|---|
| enable_monotonic_weak_read | Tenant level | Specifies whether to enable monotonic reads. Default value: True. |
| max_stale_time_for_weak_consistency | Tenant level | Specifies the maximum lag time of weak consistency reads. Default value: 5, in seconds. |
| weak_read_version_refresh_interval | Cluster level | Specifies the interval for refreshing the weak consistency read version. Default value: 50, in milliseconds. |
The detailed description of the parameters is as follows:
enable_monotonic_weak_read: specifies whether to enable monotonic reads for a tenant.Weak consistency read requests are routed to different replicas. Versions of data read from different replicas are not guaranteed. After monotonic reads are enabled, OceanBase Database ensures that the read data is not rolled back to earlier versions. A typical scenario of monotonic reads is to ensure the causal sequence. For example, Transaction T2 is committed after Transaction T1 is committed. If a client has read the modifications of Transaction T2, it can definitely read the modifications of Transaction T1 later.
max_stale_time_for_weak_consistency: specifies the maximum lag time of weak consistency reads.OceanBase Database supports bounded staleness for weak consistency reads. Bounded staleness ensures that the read data lags behind the latest data within the time specified by the
max_stale_time_for_weak_consistencyparameter. The default value is 5s. You can specify this parameter for a tenant.Generally, the follower in each partition lags behind the latest data by 100 ms to 200 ms. A weak consistency read is often complete in hundreds of milliseconds. However, a weak consistency read can take more time if network jitters occur or no leader is available. If a replica lags behind the latest data by more than the time specified by the
max_stale_time_for_weak_consistencyparameter, the replica is unreadable, and the retry mechanism retries other valid replicas. If all replicas are unreadable, the retry mechanism continues to retry the replicas until the statement times out.After monotonic reads are enabled, OceanBase Database maintains a weak consistency read version at the cluster level for each tenant. The version is also constrained by the
max_stale_time_for_weak_consistencyparameter. The version is generated based on the minimum replay progress of replicas in all partitions under the tenant. If a replica lags behind the latest data by more than the time specified by themax_stale_time_for_weak_consistencyparameter, the replica is not counted. In the current mechanism, a lagging replica affects the overall monotonic read version. For example, two replicas are respectively generated for two partitions under a tenant, one replica lags behind by 100 ms, and the other lags behind by 1s. In this case, the overall monotonic read version is 1s. Replicas do not lag behind for a long period of time. The monotonic read version is usually within hundreds of milliseconds.weak_read_version_refresh_interval: specifies the refresh interval for the weak consistency read version.This parameter affects the staleness of read data. The value of this parameter cannot exceed that of the
max_stale_time_for_weak_consistencyparameter. If this parameter is set to 0, monotonic weak consistency reads are disabled, and OceanBase Database no longer maintains weak consistency read versions at the cluster level. In addition, you can set this parameter only for a cluster, not for a tenant.
Version V4.x
In V4.x, the purposes of the max_stale_time_for_weak_consistency and weak_read_version_refresh_interval parameters are the same as those in earlier versions.
Weak consistency read timestamp
Weak consistency reads usually apply to query statements for followers and standby databases. In a weak consistency read operation, OceanBase Database still returns all committed data of a transaction to ensure a transaction snapshot, without returning data of uncommitted transactions.
A weak consistency read timestamp provides the following features:
Before a weak consistency read operation is performed, a read snapshot must be determined.
A partition maintains a maximum, secure and readable timestamp in real time. Any read request whose read snapshot is later than the timestamp is not allowed to read the replica.
Generate a timestamp
OceanBase Database supports monotonic and non-monotonic weak-consistency reads. Timestamps are generated in different ways based on the read types.
Monotonic reads
In monotonic reads, read requests are sorted by absolute initiation time, and the read snapshot used for a request initiated later is no earlier than that used for a request initiated earlier. This ensures that read data is not rolled back to earlier versions. The monotony here is defined with respect to read snapshots for statements. Monotonic reads rely on cluster-level monotonic weak consistency read versions, and the global version must be retained without rollback.
The global version is generated based on the earliest and latest secure versions maintained by each OBServer node. The generation of the global version relies on the progress of log synchronization on the local server. In OceanBase Database, transaction logs are maintained by three modules: apply service, replay_service, and transaction.
The
apply servicemodule writes transaction logs to the local disk, sends logs to the standby server, and receives logs synchronized from the leader. The standby server maintains a sliding window for each log stream to manage received logs in ascending order of log IDs. Logs are extracted in sequential order from the sliding window and submitted to thereplay_servicemodule for replay.The
replay_servicemodule replays logs. It hashes multiple logs of a transaction to the same worker thread for sequential replay. However, logs of multiple transactions in the same partition are replayed in a concurrent manner. Logs of a partition are sequenced into a linked list by commit timestamp and replayed by multiple threads in a concurrent manner.The
transactionmodule manages transaction status. It replays redo logs to MemStores and records transaction status.
To summarize, the apply service module receives logs and sends the logs to the replay_service module for replay, and the transaction module replays transaction logs. To prevent the standby server from reading only a part of a transaction in a single partition, you must obtain the secure version for the read in the partition and make sure that all transactions earlier than the version have been replayed. The secure version is calculated by using the following formula:
slave_read_ts = min(replay_service_ts, apply_service_ts, trans_service_ts) - 1
Take note of the following description:
apply_service_tsindicates the timestamp of the next log to be extracted or generated in theapply servicesliding window.replay_service_tsindicates the minimum value of the timestamps of logs that are not replayed in the log stream.trans_service_tsindicates the minimum value ofprepare_log_tsof all transactions being replayed.
Here is an example: The sliding window of the apply service module contains two logs. The log IDs and commit timestamps of the two logs are (10, 100) and (11, 200), respectively. The replay_service module has not replayed the logs whose log IDs and commit timestamps are (7, 70) and (8, 80) in the log stream. The transaction module contains only one transaction that is being replayed. The redo and prepare logs are replayed. The log ID of the prepare log is 9, and the value of the prepare_log_ts parameter is 90. In this case, the secure version of the partition is calculated by using the following formula: Secure version = min(100, 70, 90) – 1 = 69.
Non-monotonic reads
Compared with monotonic reads, non-monotonic reads do not ensure a sequential increase of read snapshots for read requests that are initiated in sequence. The synchronization progress of different replicas for the same data varies. Therefore, if two requests read different replicas, the data may be rolled back.
Atomicity
Both monotonic and non-monotonic reads can ensure the atomicity of modified data of the same transaction in a single partition. In any scenario, the read data is definitely atomic provided that the read snapshot is earlier than the maximum secure and readable version maintained by a single partition. This is a principle that is ensured by atomicity.
Therefore, OceanBase Database is designed to check whether partitions are readable. If a partition is unreadable, the SQL execution engine retries other replicas.