Cluster management|V4.1.0| docs|Distributed Database

Election

What are the election types in OceanBase Database? Who initiate the elections?

An OceanBase cluster involves three types of elections: election without a leader, re-election with a leader, and leader switch. An election without a leader is initiated only when the cluster is started or the original leader fails the re-election. Except for manual leader switch and election without a leader, other automatic elections are all initiated by the leader of the Paxos group. In the Paxos group of three replicas, the leader switch is controlled by the leader. When the leader notices a follower with a higher priority in the renewal process, the leader initiates the leader switch process to enable this follower to take the leader role.

What replicas can take part in elections?

An OceanBase cluster involves three types of replicas: full-featured replica, log-only replica, and read-only replica. The log-only replicas and full-featured replicas take part in elections. Only a full-featured replica can be a candidate of the leader in an election to take the leader role to provide database services. A log-only replica can provide log services but cannot be a candidate of the leader in an election. Therefore, it cannot take the leader role to provide data services.

What are the requirements of the election module of an OceanBase cluster on the infrastructure?

No requirements are raised on the clock offset since OceanBase Database V4.0.0. The OceanBase cluster adopts the strong leader mode in which the lease duration is limited. The default lease duration is 4s. This requires that the maximum one-way message latency in the application layer do not exceed 1s between any two replicas in the environment. However, due to network layers and the implementation of the operating system and upper-layer applications, the message latency in the IP layer is amplified in the application layer and no metric is available for reference regarding the amplification times. Therefore, make sure that the Ping latency in your environment is as low as possible.

How does the election module of an OceanBase cluster ensure an RTO of no more than 8s?

The default lease duration for electing a leader in an OceanBase cluster is 4s. When the leader fails, a maximum duration of 4s is available before the lease expires. After the lease expires, the majority of replicas initiate an election as soon as possible to elect a new leader based on the priorities of replicas in the cluster. The election module performs optimization for the election without a leader that is initiated after the leader fails. After the leader fails, the average no-leader time approximates to the lease duration. After the election succeeds, a series of extra work processes are needed for the cluster services to resume. For example, the new leader needs to synchronize logs in the reconfirm phase of Paxos, the status of transactions uncommitted on the old leader needs to be recovered on the new leader, the location information of the new leader needs to be reported to Root Service. If an OceanBase Database Proxy (ODP) is configured, SQL statements can be routed to the correct leader only after the correct ODP location information is updated on the ODP. The time consumed in the reconfirm phase depends on the amount of log data to be synchronized. The election module ensures that a replica with a large amount of log data to be synchronized is not elected as the new leader. According to the working mechanism of the election module of the OceanBase cluster, if the original leader is re-elected as the new leader, it can continue to provide database services. If the original leader fails the re-election, a re-election without a leader is initiated and the new leader is elected within 8s. Within the RTO of 8s, the election layer consumes 4s and the rest time is left for the recovery of other phases. Though the recovery of other phases is positively related to the amount of log data to be synchronized and the scale of uncommitted transactions, the recovery of each phase does not take more than 100 ms. In this way, the overall service recovery can be completed within 8s.

If the majority of replicas in an OceanBase cluster fail, can the election module elect a new leader to ensure high availability?

If the majority of replicas in an OceanBase cluster fails, no new leader can be elected to provide data services.

Does the election module of an OceanBase cluster needs to be maintained?

OceanBase Database does not provide any method or API for users to intervene in the election module, such as to control the election rules of the majority of replicas. During the normal running of the cluster, the DBA or administrator does not need to perform any manual operations or intervention to control or affect the election module. Any cluster management operations, such as stopping an OBServer node, launching a new OBServer node, or changing the primary zone of the tenant, may cause the election module to initiate an election to ensure that a leader is available to provide data services. There are no circumstances so far that require the administrator to perform O&M intervention on the election module.

Which replica will be elected as the leader in an election? How does OceanBase Database ensure that the replica being elected as the leader is a better choice?

During an election in an OceanBase cluster, the election module compares the priorities of the replicas to ensure that a replica with better health conditions is elected as the leader. Specifically, the following factors of each replica are considered during an election: whether a replica is a legitimate candidate, health score of the replica, version number of the replica member list, and log synchronization status of the replica. In this way, an election automatically triggered by the OceanBase cluster will elect a replica that is more healthy and on the latest member list as the leader. In OceanBase Database V4.0.0, the priorities are consistent in election without a leader and leader switch. The version number corresponding to member changes has the highest priority, which is also the only priority that is built in the election protocol. A replica with a large version number will not vote for a replica with a small version number.

When will automatic leader switch occur?

When the original leader renews its lease, it will periodically interact with the followers. If it finds a follower with a higher priority, it will trigger a leader switch action to enable this follower to take the leader role. A leader switch action is triggered in the following cases:

The primary zone of the tenant changes.
A server or zone is stopped.
An OBServer node is stopped by using the kill -15 command.
Log streams are migrated.

How do I view the causes that trigger leader switch?

You can query the DBA_OB_SERVER_EVENT_HISTORY view for the events that occurred on the election module of a log stream.

How do I view information about election without a leader?

You can query the DBA_OB_SERVER_EVENT_HISTORY view for all election events of all log streams, including election without a leader, leader switch, and member change.

What does the error code 4038 returned by an OceanBase cluster mean?

The definition of the error code 4038 is "not master". This error code may be returned in the following cases:

A leader exists but is not on the server where the running process is located. In this case, you need to troubleshoot upper-layer modules.
No leader exists. The network connection status in the log stream level is abnormal. For example:
- The tenant runs out of memory and therefore cannot receive messages but it can still send messages. In this case, only one-way network connection is available.
- Messages are accumulated in the worker queue and no more messages can be received.
- The server is stopped and cannot receive messages.
- The network connection is abnormal due to other causes.

Does the clog module affect leader elections? How?

In OceanBase Database V4.0.0, the abnormal status of the clog module is a priority considered during elections. Specifically, if the clog module of the leader is abnormal, the election priority of the leader is lowered. When a follower has a higher priority than the leader, leader switch occurs.

After automatic leader switch, will the clog module re-select a leader?

The clog module no longer initiates automatic leader switch since OceanBase Database V4.0.0. If the clog module is hung in writing clogs to the disk, a failure event is recorded and the priority of the leader is lowered. After that, the clog module automatically selects and enable the optimal replica to take the leader role. The leader switch event is recorded in the DBA_OB_SERVER_EVENT_HISTORY view.

Clogs

Does OceanBase Database support clog compression?

In OceanBase Database, you can specify the enable_clog_persistence_compress parameter to manually enable compression for transaction logs to be stored on the disk. The clog_persistence_compress_func parameter specifies the compression algorithm for transaction logs. After log compression is enabled, the clog module will compress and persist logs based on the specified algorithm. The default compression algorithm is lz4_1.0. In OceanBase Database, you can specify the clog_transport_compress_all parameter to manually enable compression for logs to be synchronized. The clog_transport_compress_func parameter specifies the compression algorithm for transaction logs to be synchronized. The default compression algorithm is lz4_1.0.

What are the differences among the compression algorithms for logs to be synchronized in OceanBase Database? How do I choose a compression algorithm?

In OceanBase Database, the clog_transport_compress_func parameter specifies the compression algorithm for transaction logs to be synchronized. The default compression algorithm is lz4_1.0. After log compression is enabled, the clog module will compress and transmit logs based on the specified algorithm. Currently, the clog module supports the lz4_1.0, zstd_1.0, and zstd_1.3.8 compression algorithms. In different compression algorithms, clogs are compressed based on different ratios to reduce the pressure on the bandwidth. All clogs need to be compressed before processing. This affects the performance. In different scenarios, the compression benefits and impact on performance vary with different compression algorithms. The default compression algorithm lz4_1.0 achieves a large compression ratio while ensuring controllable impact on the performance. Therefore, the default value of the clog_transport_compress_func parameter is lz4_1.0 in OceanBase Database. If you demand a higher compression ratio while tolerating severer impact on the performance in different scenarios, you can select a different compression algorithm.

How do I check whether the clogs are synchronized between the followers and the leader?

The GV$OB_LOG_STAT view records the status information about the clogs of different partitions and the Paxos group. The value in the is_in_sync column indicates whether a replica is synchronous with the leader. Its valid values are TRUE and FALSE. By default, if the different clogs between a follower and the leader are within 5s, the value of the is_in_sync field is TRUE, indicating that this follower is synchronous with the leader. If the different logs between a follower and the leader exceed 5s, the follower is out of synchronization with the leader.

During the process from transaction execution to commit, how are data modifications committed to the MemStore and persisted and saved in clogs?

If a database transaction contains a DML request, the data modifications will be updated to the MemTable of the leader. After the transaction is committed, OceanBase Database can ensure that the data modifications are stored in the clogs of the majority of replicas and replayed to the MemStore of the followers. When the usage of the MemStore reaches the specified threshold, a minor compaction is triggered to persist the data in the MemStore to an SSTable. Key actions performed in the storage layer (MemTable and SSTable) and clog layer:

An application initiates a transaction named Tnx1.
Assume that DML requests DML1 and DML2 will be executed in order in the Tnx1 transaction.
DML1 and DML2 update the transaction context and update the data modifications to the MemTable through the SQL layer.
After DML1 and DML2 are executed in the SQL layer, the application initiates a COMMIT action. The Tnx1 transaction enters the commit phase.
The leader stores the clogs on the disk and synchronizes the clogs to the followers.
When the clogs are stored in the majority of replicas, the transaction is successfully committed.
After a follower receives and stores the clogs on the disk, the follower asynchronously replays the clogs to the MemStore to provide data services for weak-consistency reads only after it receives a confirmation message "clog is written to the majority disks" from the leader.

What is the size of memory space used by the clog sliding window?

The size of the sliding window is defined based on the unit of partitions. In the window, each clog consumes about 100 B of memory. The memory space is allocated on demand, and reused or recycled when it is no longer needed. Assume that:

clog_max_unconfirmed_log_count is set to 10000.
A tenant contains 1,000 partitions.
In the case of high load, the sliding window of every partition is full.

Assume that the sliding window of every partition is full at the same time. For an OBServer node where all followers of all partitions are located, 2 GB (100 B × 10000 × 2 × 1,000) of extra memory will be consumed at this time. This is the maximum memory consumption. However, this case seldom happen because it is hardly possible that the sliding windows of 1,000 partitions are all used up at the same time.

If the sliding window of an OBServer node is stuck or full, which error will be returned? What are the consequences? What should I do?

The following table describes the two cases of a full clog sliding window.

Scenario of a full clog sliding window Consequences and key logs

The clog sliding window of the leader is full The leader can no longer allocate log IDs to the transaction layer and will print the submit log error.*\\-4023 message in the observer.log file. If the clog sliding window of the leader is full and no clog slides out of the window within a period of time, requests in transactions can no longer be placed in the clog sliding window. This increases the overall response time of transactions.

The clog sliding window of a follower is full The follower stops receiving clogs from the leader until any clog slides out of the window and prints the check_can_receive_log, now can not receive log message in the observer.log file. If the sliding window of a follower is full for a long period of time and no clog slides out of the window, the clog commit of the majority of replicas is affected, which thereby affects transaction commit. If the clog commit of the minority of replicas is affected, the clogs on the followers fall behind of those on the leader. If the clogs on a follower fall far behind those on the leader, when the other minority of replicas fail, this follower cannot be elected as the leader.

Scenario of a full clog sliding window	Consequences and key logs
The clog sliding window of the leader is full	The leader can no longer allocate log IDs to the transaction layer and will print the `submit log error.*\\-4023` message in the `observer.log` file. If the clog sliding window of the leader is full and no clog slides out of the window within a period of time, requests in transactions can no longer be placed in the clog sliding window. This increases the overall response time of transactions.
The clog sliding window of a follower is full	The follower stops receiving clogs from the leader until any clog slides out of the window and prints the `check_can_receive_log, now can not receive log` message in the observer.log file. If the sliding window of a follower is full for a long period of time and no clog slides out of the window, the clog commit of the majority of replicas is affected, which thereby affects transaction commit. If the clog commit of the minority of replicas is affected, the clogs on the followers fall behind of those on the leader. If the clogs on a follower fall far behind those on the leader, when the other minority of replicas fail, this follower cannot be elected as the leader.

Generally, if the clog sliding window is full, the persistence of the clogs is affected. In this case, the application will sense issues such as slow transactions or lock wait (06005 and lock_for_write conflict.*\-4023 are printed in the log), and the performance is obviously affected. Troubleshooting suggestions:

Check whether the number of clogs submitted for writing grows drastically. For example, a newly launched high-concurrency batch processing scenario will cause very great pressure on the clog sliding window. In this case, check the business logic to determine whether the high-concurrency batch processing is necessary for the business system. If allowed by the business system, try to lower the concurrency to resolve the issue. This method is secure and can achieve optimization effects in the shortest time.
If the business concurrency does not increase sharply, check for bottlenecks regarding the slide-out of the clog sliding window. For example, if the memory of the follower is insufficient due to slow minor compactions, logs cannot be replayed in the memory. As a result, the clog sliding window is full. In this case, contact OceanBase Technical Support for analysis.
If the increase of the business concurrency is necessary, you can specify the clog_max_unconfirmed_log_count parameter to increase the size of the clog sliding window. However, be aware that if you increase the size of the clog sliding window, more memory resources will be allocated to the sliding window. Moreover, if the slide-out of clogs is stuck or too many clogs slide out concurrently, increasing the value of the clog_max_unconfirmed_log_count parameter may just delay the issue to a later timestamp but does not eliminate the bottleneck.
Enable clog aggregation. You can specify the _clog_aggregation_buffer_amount parameter to enable clog aggregation. This parameter specifies the number of buffers for clog aggregation. The size of each buffer is 64 KB. After clog aggregation is enabled, multiple clogs will be aggregated based on the rules to optimize clog processing.
Split hotspot partitions. A sliding window is set for each partition. The more centralized the hotspot data is, the more easily the sliding window becomes full. Therefore, you can distribute the hotspot data to multiple tables or partitions so that the sliding window does not easily become full. Besides the hotspot partitions in tables, you also need to pay attention to hotspot global indexes, including:
- Global non-partitioned index, where index data is centralized on a single partition
- Global partitioned index with an improper partitioning key. For example, if the number of distinct values (NDV) value of the index key is too small or severe data skew exists (some key values have too many records), the index data will be centralized in a few partitions or even in a single partition.

What is the clog reconfirm failure scenario? What is the impact? How do I analyze the clog reconfirm failure scenario?

OceanBase Database is a distributed database. A leader must be elected to provide read and write services. The leader election process is as follows:

The election module of OceanBase Database elects a leader.
The clog module learns the current leader from the election module. Before the elected leader takes the leader role, it executes the clog reconfirm process to ensure that it has the latest data.
The leader performs operations such as takeover to take the leader role. If the new leader is elected by the election module of OceanBase Database but fails the clog reconfirm process, the partition corresponding to this leader cannot provide data services. The clog reconfirm process aims to reconfirm the residual logs in the sliding window that are not confirmed and synchronize the logs to the followers. In OceanBase Database, if the clogs have been synchronized to the majority of replicas when a transaction is being committed, the transaction is successfully committed. If the leader fails at this time, maybe the clogs on a replica that is still online are not confirmed and replayed in the MemTable. The clog reconfirm process ensures that the clogs of all replicas are confirmed and that the clogs of the majority of replicas are synchronized with those of the leader. This ensures that the commit of subsequent transactions is not affected. The following table describes the phases in the clog reconfirm process.

Phase	Phase name	Action in the phase
1	INITED	The leader commits confirmed logs for replay and locally writes the prepare log and the label information about the elected leader.
2	FLUSHING_PREPARE_LOG	The leader sends the label information about the elected leader to all followers. After the leader receives a reply from the majority of followers, it proceeds to the next phase.
3	FETCH_MAX_LSN	The leader record the maximum `max_flushed_id` value and makes some preparations when the majority of replicas (including the leader itself) reach the `max_flushed_log_id` value.
4	RECONFIRMING	The leader enables the confirmed logs in the left pane of the sliding window to slide out of the sliding window and commits these logs for replay. Then, the leader reconfirms all unconfirmed logs. Specifically, it collects the content of an unconfirmed log from the majority of followers, confirms the log content, and synchronizes the log to all followers. Finally, the leader enters the START_WORKING state and writes a start_working log.
5	START_WORKING	The leader waits for the start_working log to be synchronized to the majority of followers.
6	FINISHED	End

In the key phases, a log is printed in the observer.log file. For example, if the following log is printed in phase 2, the leader has received a reply from the majority of followers. If the following log is not printed, this phase is stuck.

[20XX-XX-XX 20:28:06.142985] INFO  [CLOG] ob_log_reconfirm.cpp:598 [127240][Y0-0000000000000000] [lt=25] max_log_ack_list majority(partition_key={tid:1099511627777, partition_id:0, part_cnt:1}, majority_cnt=2, max_log_ack_list=1{server:"10.101.XXX.XXX:261XXX", timestamp:-1, flag:0})

The timeout period of the reconfirm process of a single clog is 10s. In other words, if the reconfirm process of a clog is not finished within 10s, a timeout error log is printed, with the keywords is_reconfirm_role_change_or_sync_timeout. The clog reconfirm process is usually stuck in the preceding phases due to commit failure, network failure, or full log disk. The following table describes the main causes and key logs.

Main cause	Description	Key log and diagnosis method
Clog replay is stuck	During the clog reconfirm process, the confirmed logs in the left pane of the sliding window are all committed for replay. If the clog replay is stuck due to various exceptions such as a memory allocation failure or replay error, clogs will be accumulated in the sliding window. When the number of accumulated clogs reaches the specified threshold (which is 20,000), the sliding window is full and can no longer receive new logs. As a result, the reconfirm process cannot proceed.	If the sliding window contains confirmed logs to be committed for replay, the clog module reports the [ERROR] level code -4023 in the observer.log file. The error keywords are: `there are confirmed logs in sw, try again(partition_key={tid:1106108697592670, partition_id:0, part_cnt:1}, ret=-4023, new_start_id=371801833, start_id=371802703)` Clog replay may be stuck due to many reasons such as a memory allocation failure or replay error. In this case, you must check the observer.log file by running the following commands: `grep ERROR observer.log \| grep STORAGE \| grep 'alloc memory'` and `grep ERROR observer.log \| grep STORAGE \| grep replay`.
The clog disk is full	During the clog reconfirm process, logs such as prepare and start_working are written and the confirmation by the majority of replicas is required. If these logs cannot be written on a server because the clog disk of the server is full, and the remaining normal servers are not the majority, the reconfirm process fails.	If the clog disk is full, the clog module reports the [ERROR] level code -4264 in the observer.log file. The error keywords are: `log out of disk space(partition_key={tid:1102810162709414, partition_id:46, part_cnt:0}, server="xx.xx.xx.xx:29432", ret=-4264, free_quota=-3381817344)`
The clog synchronization network is disconnected	During the clog reconfirm process, the candidate leader needs to communicate with the followers in various phases. Therefore, if packets cannot be sent or received or the latency in packet sending and receiving is too high due to network issues, the reconfirm process may fail.	If the network is disconnected, multiple phases of the reconfirm process will fail. For example, if no reply from the followers is received after the prepare log is sent, a log with the following keywords will be constantly printed on the server of the candidate leader until this phase times out: `max_log_ack_list not majority` After the candidate leader writes the start_working log, it may not receive an acknowledgment from the majority of followers due to the network exception. Then, this phase will time out and fail. The network disconnection may be caused by many reasons. Perform the following steps to troubleshoot the issue: Run the ping command. If the ping operation succeeds, the network is normal. Run the sudo iptables -L command to check whether network isolation exists. View the queue logs of the other tenant to check whether the queue is full and no more requests can be processed.
Requests are accumulated in the queue of the SYS500 tenant	If requests are accumulated in the queue of the SYS500 tenant, new RPC requests can no longer be processed, messages related to the reconfirm process are no longer processed, and the logs on the leader show that the followers did not reply to any request. In this case, you need to analyze the situation and reasons of the accumulation of requests in the queue.	View the situation of accumulation of requests in the queue. Search by the keywords `dump tenant info(tenant={id:500,` in the observer.log file of the leader or a follower. The content behind the `req_queue` keyword is the situation of the queue of the SYS500 tenant. If you notice that requests are accumulated in the queue of the SYS500 tenant, you need to further analyze the causes. In this case, you can use OBStack to print the stack information about the current OBServer node and then send the information to OceanBase Technical Support for troubleshooting.
Worker threads are insufficient	-	Check whether the following log indicating packet processing timeout exists: `[20XX-XX-XX 11:12:12.103] easy_request.c:420(tid:7fe2cdc9f700)[57136] rpc time, packet_id :2425446521, pcode :4096, client_start_time :1517541127590356, start_send_diff :8, send_connect_diff: 0, connect_write_diff: 5, request_fly_ts: 188, arrival_push_diff: 2, wait_queue_diff: 4508603, pop_process_start_diff :2, process_handler_diff: 4184, process_end_response_diff: 1, response_fly_ts: 447, read_end_diff: 1, client_end_time :1517541132103797, dst: 10.101.X.X:XXXXX` The value of `wait_queue_diff` reaches several seconds, which indicates that the processing by the worker threads is slow. In this case, check whether the value of `cpu_quota` (`cpu_quota_concurrency`, `workers_per_cpu_quota`, and `system_cpu_quota`) is too small.

If the issue persists, contact OceanBase Technical Support.

If the clog disk is full, how can I analyze the issue and take emergency actions?

In an OceanBase cluster with a large number of partitions and write requests, clogs of some replicas may not be recycled in a timely manner due to slow minor compactions or heterogeneous resource unit specifications of tenants. When the clog disk usage reaches 95%, the observer process automatically stops writing clogs. As a result, a large number of replicas on the OBServer node are out of synchronization. Normally, the clog disk usage remains at about 80%. If the clog disk usage exceeds 80%, the recycling of clog files is slow or stuck. After a minor compaction is performed to store the data corresponding to all partition logs in an SSTable, this clog file can be recycled. The cause of a full clog disk is usually that the minor compaction of some partitions is stuck. When an alert indicating insufficient clog disk space is triggered, perform the following steps to identify the partitions that hinders recycling:

View the output of the following command:
```
grep  can_skip_base  observer.log
```
- In the search result, find the records with the need_record value being True. partition_key indicates the partition whose logs cannot be recycled.
- Search the records near the time of the alert log. The partition in the search result is the one that hinders recycling. Obtain the partition_key value.
Search the logs of this partition based on the obtained partition_key.
```
grep  partition_key  observer.log
```
Analyze the logs to identify the reason why the timestamp of minor compaction does not advance.

Emergency measures

When the clog disk is full, if the majority of nodes are affected, the commit of transactions is affected. In other words, the entire system is hung and provides no response. Meanwhile, a large number of concurrent DML transactions exist in the business system. In this case, we recommend that you pause the business load to recover the clog service. If only the minority of nodes are affected, quickly analyze whether the node has major bottlenecks in the I/O layer and network layer (or infrastructure layer). If such major bottlenecks exist, isolate the minority of nodes and resolve these bottlenecks. If no such bottlenecks exist, you can perform the following operations on the minority of nodes:

When the clog disk usage reaches 95%, writing is automatically stopped and no more logs can be received. This threshold is specified by a parameter in OceanBase Database. You can increase the value of this parameter to 98% for the server whose clog disk is full.
```
alter system set clog_disk_usage_limit_percentage = 98 server ='xxx:2882';
```
Then, the corresponding follower will start to synchronize logs from the leader. If a large number of logs need to be synchronized, a rebuild is triggered to copy data and clogs from the leader to the follower. You can observe the recovery process as follows:
Execute the following SQL statement to check whether the number of partitions that are out of clog synchronization decreases:
```
select svr_ip, count(*) from  gv$ob_log_stat  where is_offline = 0 and is_in_sync = 0 group by 1;    
```
If no, some replica triggers a rebuild to copy the baseline data and incremental data. Execute the following SQL statement to check whether any replica is been rebuilt:
```
select svr_ip, count(*) from __all_virtual_partition_migration_status  where action != 'END' group by 1;
```
If the query result is not 0, check the parameters related to concurrent rebuild tasks.
```
show parameters like "%data_copy%";
```
If the default value 2 is retained for both server_data_copy_in_concurrency and server_data_copy_out_concurrency, change their values to 10 to accelerate the concurrent rebuilding of multiple replicas.
```
alter system set server_data_copy_in_concurrency = 10;
alter system set server_data_copy_out_concurrency = 10;
```
If the issue persists after you increase the value of clog_disk_usage_limit_percentage to 98%, shut down the OBServer node, copy the clog files to another directory with larger storage space, and create a soft link to link to the clogs. Restart the OBServer node.

Notice

This step is risky. You can perform this step to recover the system only when the cluster fails. When you copy log files, be extremely careful to avoid improper operations.

If the clog disk usage decreases to a value less than 95%, the OBServer node is recovered. Wait until all replicas reach the sync state.

After the value of is_in_sync drops to 0, recover the modified parameter values.

Possible causes

O&M issues occur. The space for the clog disk is occupied by files of other users. As a result, the available space is insufficient.
The most common cause of a full clog disk is consecutive minor compaction failures. After minor compactions are performed to store the partition data in a clog file to an SSTable and the partition metadata is persisted, the clog file can be recycled. Therefore, if minor compaction fails, the OBServer node restarts upon a failure, the start replay timestamp for clogs does not advance, clog files cannot be recycled. The causes and logic of consecutive minor compaction failures are complex. Therefore, if you encounter consecutive minor compaction failures, contact OceanBase Technical Support for diagnostics.

What error is returned if clog initialization fails after the OBServer node restarts? If the OBServer node fails the initialization at startup, how do I recover the system?

Error description Error log Recovery method

clog_shm (the shared memory file records the valid position where flush_pos_ is greater than the last clog file.) The clog module reports the [ERROR] level code -4016 in the observer.log file. The error keywords are:
The clog start pos is unexpected, (ret=-4016, file_id=1018, offset=51658630 Delete the clog_shm file from the store directory and then restart the OBServer node.

Errors in an iteration clog file is usually caused by the clog file itself.

Error description	Error log	Recovery method
clog_shm (the shared memory file records the valid position where flush_pos_ is greater than the last clog file.)	The clog module reports the [ERROR] level code -4016 in the observer.log file. The error keywords are: `The clog start pos is unexpected, (ret=-4016, file_id=1018, offset=51658630`	Delete the clog_shm file from the store directory and then restart the OBServer node.
Errors in an iteration clog file is usually caused by the clog file itself.	The clog module reports the [ERROR] level code -4016 in the observer.log file. The error keywords are: `invalid scan_confirmed_log_cnt(ret=-4016, ret="OB_ERR_UNEXPECTED",` `notify_scan_finished_ finished(ret=-4016, cost_time=154)` `notify_scan_finished_ failed(ret=-4016)` `log scan runnable exit error(ret=-4016)`	These errors are returned because errors occur in the clog file or iteration clog file. If the preceding issue occurs in the minority of replicas, delete these replicas. Note This troubleshooting method is applicable when the issue occurs only in the minority of replicas. If the issue occurs in the majority of replicas, contact OceanBase Technical Support and do not take any measures because improper operations will cause data loss in this case. Assume that in a three-replica environment, one of the replicas fails. Recovery method 1: Find a similar OBServer node and quickly add it to the OceanBase cluster to replace the failed replica. Make sure that `enable_rereplication` is set to True. The failed OBServer node is permanently removed. Recovery method 2: Find the failed OBServer node, permanently remove it, securely clear the data on this OBServer node, and add a new OBServer node to the cluster. Execute the following statement to permanently remove the failed OBServer node. Then, query the `DBA_OB_SERVERS` view. If no record of the failed OBServer node is found, this OBServer node is permanently removed. `alter system delete server '$obs0_ip:$obs0_port';` Clear the data on the failed OBServer node. Add a new OBServer to the cluster node. `alter system add server '$obs0_ip:$obs0_port';` Then, the cluster initiates data synchronization to the new replica. Make sure that `enable_rereplication` is set to True.

The clog module reports the [ERROR] level code -4016 in the observer.log file. The error keywords are:

invalid scan_confirmed_log_cnt(ret=-4016, ret="OB_ERR_UNEXPECTED",
notify_scan_finished_ finished(ret=-4016, cost_time=154)
notify_scan_finished_ failed(ret=-4016)
log scan runnable exit error(ret=-4016)

These errors are returned because errors occur in the clog file or iteration clog file.

If the preceding issue occurs in the minority of replicas, delete these replicas.
Note
This troubleshooting method is applicable when the issue occurs only in the minority of replicas.
If the issue occurs in the majority of replicas, contact OceanBase Technical Support and do not take any measures because improper operations will cause data loss in this case.

Assume that in a three-replica environment, one of the replicas fails.
Recovery method 1:
Find a similar OBServer node and quickly add it to the OceanBase cluster to replace the failed replica. Make sure that enable_rereplication is set to True. The failed OBServer node is permanently removed.
Recovery method 2:
Find the failed OBServer node, permanently remove it, securely clear the data on this OBServer node, and add a new OBServer node to the cluster.

Execute the following statement to permanently remove the failed OBServer node. Then, query the DBA_OB_SERVERS view. If no record of the failed OBServer node is found, this OBServer node is permanently removed.
alter system delete server '$obs0_ip:$obs0_port';
Clear the data on the failed OBServer node.
Add a new OBServer to the cluster node.
alter system add server '$obs0_ip:$obs0_port';
Then, the cluster initiates data synchronization to the new replica. Make sure that enable_rereplication is set to True.

Storage

What are the differences between the storage engine of OceanBase Database and that of conventional databases?

OceanBase Database adopts a read/write separation architecture. Data is divided into baseline data and incremental data. Incremental data is stored in memory (MemTable), and baseline data is stored on SSD (SSTable). All data modifications are incremental data and written only to MemTables. All DML operations are performed on MemTables, providing excellent performance. During data reading, data of previous versions in MemTables and the baseline data in persistent storage are compacted to generate the updated data. OceanBase implements both block cache and row cache in the memory to avoid random read of the baseline data. When the incremental data in a MemTable reaches a specified size, it is compacted with the baseline data and then persisted to the disk. The system performs a daily major compaction during idle hours every night.

How is data stored in OceanBase Database?

Data files in OceanBase Database are stored in macroblocks. The size of each macroblock is 2 MB. A macroblock is divided into microblocks (16 KB before compression). Each microblock contains two or more rows. A microblock is the smallest I/O unit of OceanBase Database.

Do I need to use a partitioned table?

You can refer to the following information to decide whether to use a partitioned table:

If the data volume of your table may exceed 200 GB in the foreseeable future, we recommend that you use a partitioned table.
If data in your table is evenly distributed along the timeline, we recommend that you use a partitioned table.

What are the differences between a local index and a global index in OceanBase Database in terms of implementation?

Differences between a local index and a global index:

Local index: The index is bound to a single table or the smallest subpartition of a partitioned table. No additional partitioning record is created in the system tenant of OceanBase Database. All local index-based SQL statements are executed locally.
Global index: You need to create an additional partitioning record in the system tenant of OceanBase Database. You can take a global index as a table, which occupies an additional partition quota. The global index has its own partitioning strategy. It is not bound to the primary table. Even if a global index is created on a single table, it can be stored on a different node from that of the primary table. This easily makes an SQL database a distributed database.