Exceptions
When an exception occurs during a major compaction of OceanBase clusters, a major compaction timeout or major compaction error alert is triggered on OCP, as shown in the following figure.
The compaction related error will also appear in the OceanBase monitoring log. The following figure shows the alert information of this error.
When the administrator receives the alert, the administrator can log on to the sys tenant of the OceanBase cluster that generates the alert for troubleshooting.
Locate the server with the major compaction exception
Log on to the sys tenant. Run the following command to query the __all_zone table for the major compaction status and locate the zone on which the major compaction fails:
SELECT * FROM __all_zone WHERE name IN ('merge_status','all_merged_version');
If the major compaction status is normal, the following output is displayed:
mysql> SELECT * FROM __all_zone WHERE name IN ('merge_status','all_merged_version');
+----------------------------+----------------------------+---------+--------------------+-------+---------+
| gmt_create | gmt_modified | zone | name | value | info |
+----------------------------+----------------------------+---------+--------------------+-------+---------+
| 2019-06-24 11:19:00.559234 | 2020-02-18 14:02:28.997661 | | merge_status | 1 | MERGING |
| 2019-08-28 13:34:31.182559 | 2020-02-18 08:48:28.447141 | ET15_10 | all_merged_version | 773 | |
| 2019-08-28 13:34:31.182986 | 2020-02-18 14:02:31.640651 | ET15_10 | merge_status | 1 | MERGING |
| 2019-08-28 14:37:25.685870 | 2020-02-18 08:40:52.009764 | ET15_11 | all_merged_version | 773 | |
| 2019-08-28 14:37:25.686322 | 2020-02-18 14:02:31.643109 | ET15_11 | merge_status | 1 | MERGING |
| 2019-08-28 13:33:17.273600 | 2020-02-18 08:30:00.087549 | ET15_9 | all_merged_version | 773 | |
| 2019-08-28 13:33:17.273964 | 2020-02-18 14:02:31.645466 | ET15_9 | merge_status | 1 | MERGING |
+----------------------------+----------------------------+---------+--------------------+-------+---------+
When a compaction error occurs in a zone, the merge status of this zone is displayed in the info column as TIMEOUT. If the merged version (displayed as all_merged_version in the name column) of this zone is displayed in the value column as 773, the version being compacted is 773.
Set the major_version to 773 in the following WHERE condition SQL statement to locate the IP address of the faulty server, Then, the value of svr_ip is the IP address of the faulty server. Sample statement:
SELECT zone, svr_ip, major_version, macro_block_count, use_old_macro_block_count
, merge_start_time, merge_finish_time, merge_process, merge_finish_time - merge_start_time AS cost_time
, macro_block_count - use_old_macro_block_count AS merge_macro_block_count
, (macro_block_count - use_old_macro_block_count) / (merge_finish_time - merge_start_time) AS avg_per_sec
FROM __all_virtual_partition_sstable_image_info
WHERE major_version = 773
AND merge_process <> 100
ORDER BY zone, svr_ip, major_version;
Locate the exception cause
Locate the cause of the server exception
After you obtain the IP address of the server with the major compaction exception, log on to the server in SSH mode and check whether a hardware or software exception occurs. Generally, you can check whether:
Low I/O performance. The compaction failure may be caused by an I/O timeout error. Run the
iostat -x -k 1command to check whether the I/O usage is 100%. If so, the compaction operation was affected.Full disk usage. If the disk is full, data cannot be written into the disk. For example, you cannot write data into the /home/admin/oceanbase/log directory, Clog directory, or the Docker directory when your disk is full. Run the
df -lhcommand to check the disk space.Hardware error. Run the following statement to check for hardware exceptions. If a value is returned, a hardware exception has occurred.
dmesg|grep -E "Failed status, reset controller|Controller encountered a fatal error and was reset|Controller encountered a fatal error and was reset"
Run the following command for servers with RAID cards to check for RAID card exceptions. If a value is returned, a RAID card exception has occurred.
tbraid log | grep -E "Read Medium ERR|Error"
tbraid log | grep -E 'host IOs were blocked|I2C 4 cannot find idle bus'
- Kernel bug. Run the following statement to check for a kernel bug. If a message in the
task xxx blocked for more than 120 secondsformat is returned, an I/O exception was caused by a kernel bug.
dmesg |grep blocked
- Network problem. Check the Message log or run the
TSARcommand to check for a network interface card (NIC) error. If the following messages appear, the compaction failure was caused by a network problem.
May 26 04:34:23 db142151114.na62 lldpd[67883]: iface_eth_recv: error while receiving frame on eno1: Network is down
May 26 04:34:23 db142151114.na62 lldpd[67883]: iface_eth_recv: error while receiving frame on eno2: Network is down
May 26 04:34:23 db142151114.na62 lldpd[67883]: iface_eth_recv: error while receiving frame on enp7s0f0: Network is down
May 26 04:34:23 db142151114.na62 lldpd[67883]: iface_eth_recv: error while receiving frame on enp7s0f1: Network is down
If one or more of the preceding problems are found, contact Technical Support to solve the problems. If the faulty server must be replaced, replace the OBServer node.
Locate the cause of an internal exception
An internal exception, including network jitter, I/O timeout, or invalid parameter settings, can also cause the major compaction exception. You can check whether:
- Internal macro block error. To locate an internal macro block error, query and locate the partition of the error first. In the following SQL statement, set data_version to 777, the compaction version identified in the preceding section. Execute the statement applicable to your OceanBase version.
(1.x) select * from __all_meta_table where data_version != 773;
(2.x) select * from __all_virtual_meta_table where data_version != 773;
The statement returns information about the partition where the major compaction error occurred, including the IP address and zone of the server where the partition resides.
- Parameter configuration error. The OceanBase cluster provides some cluster-compaction parameters. If you do not configure the parameters correctly, the compaction may fail. Run the following statement to check whether the value of enable_manual_merge is
false. If the value is not false, you can only run manual major compaction commands to perform major compactions.
show parameters like '%enable_manual_merge%';
Run the following statement to check whether the value of enable_upgrade is false. If the value is not false, all major compactions will be blocked.
show parameters like '%enable_upgrade%';
Solution
The solution varies with the exception cause.
Solutions to server exceptions
For an intermittent server exception, ensure that the exception will not occur, run
STOP SERVERto stop the faulty server, and then restart the server. For more information, see the " OBServer management " topic of the OceanBase Database Administrator Guide.
For a persistent server exception, you need to bring the abnormal server offline for maintenance by O&M engineers. Before you bring the abnormal server offline, run
STOP SERVERto stop the faulty server, and replace it with a new server. For more information, see the " OBServer management " topic of the OceanBase Database Administrator Guide.
Internal exception
The solution to an internal macroblock exception is similar to that to an intermittent server exception. Run
STOP SERVERto stop the faulty server and then restart the server. For more information, see the " OBServer management " topic of the OceanBase Database Administrator Guide.To rectify parameter setting errors, run the
ALTERcommand to set the parameters to valid values. For example:
ALTER system SET `parameter_name`= 'True';