This article explains the sources of some obdiag inspection indicators.
Description
The issue #xx mentioned in this section refers to the issue number of the obdiag project.
OBProxy check items
Check item name |
Check item description |
|---|---|
| version.bad_version | Check if the OBProxy version is a deprecated version. Some versions of OBProxy are buggy and their use is not recommended. |
| version.old_version | Check if the OBProxy version is an old version. Some older versions of OBProxy are no longer supported and are not recommended for use. The source can be found on GitHub issue #1103. |
| parameter.request_buffer_length | Check whether the OBProxy parameter request_buffer_length is the default value. |
| parameter.work_thread_num | Check the value of the OBProxy parameter work_thread_num to prevent thread exhaustion problems. The source can be found on GitHub issue #1019. |
| parameter.enable_ob_protocol_v2_with_client | Check the OBProxy parameter enable_ob_protocol_v2_with_client, and alert if it is enabled. |
OceanBase database check items
Log flow check items
Check item name |
Check item description |
|---|---|
| clog.clog_hang | Check for disk failure issues that may cause the log stream to hang. The source can be found on GitHub issue #963. |
| clog.clog_disk_full | Check whether there is a log stream disk full problem. |
| ls.paxo_members | Check whether the log stream is consistent with paxo-members. If inconsistent, the server deletion operation cannot be performed successfully. |
Version check items
Check item name |
Check item description |
|---|---|
| version.bad_version | Check whether the OceanBase database is a deprecated version. Some versions of the OceanBase database have bugs and are not recommended for use. |
| version.old_version | Check whether the OceanBase database is an old version. Some older versions of OceanBase databases are no longer supported and are not recommended for use. |
Index check items
Check item name |
Check item description |
|---|---|
| index.global_index_unpartitioned | Check for unpartitioned global indexes that may cause hotspot issues during batch operations. The source can be found on GitHub issue #957. |
Tenant inspection items
Check item name |
Check item description |
|---|---|
| tenant.ddl_operation_table_size | Check the size of tenant internal table __all_ddl_operation. When the number of records exceeds 10 million, the user is prompted to pay attention. The source can be found on GitHub issue #1061. |
| tenant.tenant_threshold | Check tenant thread utilization and alert when it exceeds the 95% threshold. The source can be found on GitHub issue #963. |
| tenant.tenant_locality_consistency_check | Check the tenant's regional consistency and the number of log stream members to ensure tenant availability. The source can be found on GitHub issue #1048. |
| tenant.max_stale_time_for_weak_consistency | Check whether the max_stale_time_for_weak_consistency parameter is the default value. The source can be found on GitHub issue #850. |
| tenant.tenant_min_resource | Check tenant resource pool configuration and report if less than 2C4G CPU or memory. |
| tenant.parameters_default | Checks whether all parameters have default values. The source can be found on GitHub issue #850. |
| tenant.macroblock_utilization_rate_tenant | Check whether the ratio of actual data volume to actual disk usage for all tenants in the OceanBase cluster is within a reasonable range. The OceanBase database stores data in macroblocks, and each macroblock may not be fully utilized to improve efficiency. If the ratio of actual data volume to actual disk usage is too low, a full merge should be performed to improve disk utilization. The source can be found on GitHub issue #847. |
| tenant.writing_throttling_trigger_percentage | Check whether writing_throttling_trigger_percentage is configured to 100. If configured to 100, the write speed limit will be turned off, causing MemStore to explode. The source can be found on GitHub issue #758. |
Sysbench Check Items
Check item name |
Check item description |
|---|---|
| sysbench.sysbench_free_test_memory_limit | Check cluster memory limit information when sysbench is idle. |
| sysbench.sysbench_free_test_network_speed | Check cluster network speed information when sysbench is idle. |
| sysbench.sysbench_test_cluster_parameters | Check cluster parameters when running sysbench. |
| sysbench.sysbench_test_cluster_datafile_size | Check cluster data file size and log disk size information when sysbench is idle. |
| sysbench.sysbench_test_cluster_log_disk_size | Check the cluster log disk size parameter. |
| sysbench.sysbench_test_log_level | Check the cluster system log level information when running sysbench. |
| sysbench.sysbench_test_tenant_primary_zone | Check the cluster tenant primary availability zone information when running sysbench. |
| sysbench.sysbench_run_test_tenant_memory_used | Check cluster memory usage and memory usage information when sysbench is idle. |
| sysbench.sysbench_test_cpu_quota_concurrency | Check cluster CPU quota concurrency information when running sysbench. |
| sysbench.sysbench_free_test_cpu_count | Check cluster CPU count information when sysbench is idle. |
| sysbench.sysbench_test_tenant_log_disk_size | Check the tenant log disk size parameter. |
| sysbench.sysbench_run_test_tenant_cpu_used | Check sysbench runtime cluster CPU information. |
| sysbench.sysbench_test_sql_net_thread_count | Check cluster SQL network thread count information when running sysbench. |
| sysbench.sysbench_test_tenant_cpu_parameters | Check tenant CPU parameters. |
Log check items
Check item name |
Check item description |
|---|---|
| log.log_size_with_ocp | Check whether the free space of the log directory exceeds the size of 100 files. |
| log.log_size | Check the cluster max_syslog_file_count parameter value, and alarm when it is not set to 0 or the setting exceeds 100. The source can be found on GitHub issue #963. |
Archive Check Items
Check item name |
Check item description |
|---|---|
| archive.archive_continuous_error | Check the OceanBase database log for the pay ATTENTION!! archive continuous encounter error more than 15 error. This error means that the archive encountered the error more than 15 times in a row. The source can be found on GitHub issue #991. |
Error code check items
Check item name |
Check item description |
|---|---|
| err_code.find_err_4016 | Check whether error 4016 is reported when enable_sql_audit is set to True. |
| err_code.find_err_4012 | Check whether error 4012 is reported when enable_sql_audit is set to True. |
| err_code.find_err_4001 | Check whether error 4001 is reported when enable_sql_audit is set to True. |
| err_code.find_err_4377 | Check whether error 4377 is reported when enable_sql_audit is set to True. |
| err_code.find_err_4108 | Check whether error 4108 is reported when enable_sql_audit is set to True. |
| err_code.find_err_4013 | Check whether error 4013 is reported when enable_sql_audit is set to True. |
| err_code.find_err_4015 | Check whether error 4015 is reported when enable_sql_audit is set to True. |
| err_code.find_err_4000 | Check whether error 4000 is reported when enable_sql_audit is set to True. |
| err_code.find_err_4105 | Check whether error 4105 is reported when enable_sql_audit is set to True. |
| err_code.find_err_4103 | Check whether error 4103 is reported when enable_sql_audit is set to True. |
Network check items
Check item name |
Check item description |
|---|---|
| network.network_offset | Check cluster network clock offset information. |
| network.network_drop | Check cluster network packet loss information. |
| network.network_speed_diff | Checks if all OBServer nodes have consistent network card speed via dynamic network card name lookup. The source can be found on GitHub issue #763. |
| network.network_write_cond_wakeup | Check the OceanBase cluster log for network write condition wakeup issues. |
| network.local_ip_check | Verify that local_ip in observer.config.bin matches the actual network card IP on the configured network interface. The source can be found on GitHub issue #878. |
| network.log_easy_slow | Check for network latency issues by searching for EASY SLOW in the OceanBase cluster logs. |
| network.TCP_retransmission | Check for TCP retransmissions. The source can be found on GitHub issue #348. |
| network.network_speed | Check cluster network speed information. |
System check items
Check item name |
Check item description |
|---|---|
| system.parameter_tcp_wmem | Detect kernel parameters. |
| system.core_pattern | Check the kernel core_pattern. |
| system.cgroup_version | Check the cgroup version. OceanBase database currently uses cgroup v1. If the customer's operating system is cgroup v2, resource isolation will not take effect. The source can be found on GitHub issue #1101. |
| system.getenforce | Check SELinux via getenforce. |
| system.check_command | Confirm whether dependent components exist. |
| system.dependent_software | To detect dependent software, please refer to the official website "OceanBase Cloud Platform" document Host Standardization Check Items for details. |
| system.instruction_set_avx | Check whether the CPU supports the AVX instruction set to be compatible with the OceanBase database. The source can be found on GitHub issue #1024. |
| system.dependent_software_swapon | To detect dependent software, please refer to the official website "OceanBase Cloud Platform" document Host Standardization Check Items for details. |
| system.kernel_bad_version | Check whether the operating system version is 3.10. Using the cgroup method to deploy the OceanBase database on the operating system kernel version 3.10 has the risk of system downtime. The source can be found on GitHub issue #910. |
| system.clock_source | Check whether the clock source type is tsc. |
| system.aio | To detect aio, please refer to the official website "OceanBase Cloud Platform" document Host Standardization Check Items for details. |
| system.parameter_ip_local_port_range | Detect kernel parameters. For details, please refer to the official website "OceanBase Cloud Platform" document Host Standardization Check Items. |
| system.ulimit_parameter | Detect the ulimit parameter. For details, please refer to the official website "OceanBase Cloud Platform" document Host Standardization Check Items. |
| system.tcp_tw_reuse | Checks whether sockets in TIME-WAIT state (TIME-WAIT port) are allowed for new TCP connections. Needs to be set to 1 to ensure system performance. The source can be found on GitHub issue #737. |
| system.mount_options | When mounting NFS, you need to ensure that the parameters of the backup mount environment include nfsvers=4.1, sync, lookupcache=positive and hard. The sources can be found on GitHub issue #611 and issue #852. |
| system.parameter_tcp_rmem | Detect kernel parameters. For details, please refer to the official website "OceanBase Cloud Platform" document Host Standardization Check Items. |
| system.arm_smmu | If the node is an arm architecture, check whether smmu needs to be turned off. The source can be found on GitHub issue #784. |
| system.python_version | Check whether the Python version installed on the host is 2.7.x and ensure that the relevant OceanBase database scripts can run normally. The source can be found on GitHub issue #869. |
| system.clock_source_check | It is recommended to add a check item to check whether the OBServer node clock source configuration file server IP is consistent. The sources can be found on GitHub issue #781 and issue #873. |
| system.check_system_language | Check if $LANG is en_US.UTF-8 |
| system.parameter | Detect kernel parameters. For details, please refer to the official website "OceanBase Cloud Platform" document Host Standardization Check Items. |
| system.dmesg_log | Confirm if Hardware Error exists in dmesg. The source can be found on GitHub issue #885. |
| system.docker0_interface_check | Check whether docker0 or docker0-like network interface exists in the deployment environment. When deploying OBProxy through OCP, if the docker0 interface exists, the displayed IP may correspond to the docker0 address rather than the actual physical host address. It is recommended to remove the docker0 interface after confirming that it is not in use. The source can be found on GitHub issue #1198. |
Table check items
Check item name |
Check item description |
|---|---|
| table.macroblock_utilization_rate_table | Check whether the ratio of the actual data volume to the actual disk usage of all tables in the OceanBase cluster is within a reasonable range. The OceanBase database stores data in macroblocks, and each macroblock may not be fully utilized to improve efficiency. If the ratio of actual data volume to actual disk usage is too low, a full merge should be performed to improve disk utilization. This task includes query timeout protection to prevent hangs. The sources can be found on GitHub issue #848 and issue #1067. |
| table.information_schema_tables_two_data | Check whether there is a table with two records in information_schema.tables. The source can be found on GitHub issue #390. |
| table.auto_split_error_table | Checks an auto-split-enabled table for shards that should trigger auto-split but do not. This check identifies tables where some shards have reached the auto_part_size threshold but auto-split is not triggered, which may indicate a problem with the auto-split mechanism. |
CPU Check Items
Check item name |
Check item description |
|---|---|
| cpu.oversold | Check whether any OBServer node has CPU oversold. |
Defect inspection items
Check item name |
Check item description |
|---|---|
| bugs.bug_469 | Check the glibc version of the OBServer node (obtained through ldd). The glibc version must be less than 2.34, otherwise it may cause the OBServer node to crash. The source can be found on GitHub issue #469. |
| bugs.bug_182 | Check for OceanBase database bug: OceanBase database has been upgraded to version 4.2.1, and error code -4109 and error message Server state or role not the same as expected appear when executing DDL for some partition tables. The source can be found on GitHub issue #182. |
| bugs.cgroup_kernel_bad_version | Check whether the operating system kernel version is 3.10. Using the cgroup method to deploy the OceanBase database on the operating system kernel version 3.10 may cause system downtime. The source can be found on GitHub issue #910. |
| bugs.bug_385 | Check whether there is an OceanBase database bug: When the OceanBase database version is between [4.2.1.0,4.2.1.3], there are multiple root users under the tenant. If this bug occurs, please consider upgrading the OceanBase database version or deleting redundant users. The source can be found on GitHub issue #385. |
Disk check items
Check item name |
Check item description |
|---|---|
| disk.clog_abnormal_file | Check the clog folder for files that do not belong to the OceanBase database. |
| disk.data_disk_full | Check the data disk usage and alert when the usage exceeds the 85% threshold. The source can be found on GitHub issue #963. |
| disk.disk_hole | Check whether there is a disk hole problem. |
| disk.mount_disk_full | Check the disk usage of each mount point on the host. The source can be found on GitHub issue #611. |
| disk.xfs_repair | Check the xfs_repair log in dmesg. The source can be found on GitHub issue #451. |
| disk.sstable_abnormal_file | Check the data folder for files that do not belong to the OceanBase database. |
| disk.disk_full | Check whether the disk usage reaches the threshold. |
| disk.disk_iops | Check disk IOPS. |
Column storage check items
Check item name |
Check item description |
|---|---|
| column_storage.tenant_parameters | Check the column storage proof of concept on tenant parameters. |
Cluster check items
Check item name |
Check item description |
|---|---|
| cluster.mod_too_large | Check if any module is using more than 10GB of memory. |
| cluster.core_file_find | Check if the core file exists. |
| cluster.optimizer_better_inlist_costing_parmmeter | Check if the tag parameter is enabled for a specific version. |
| cluster.memory_limit_vs_phy_mem | Check if memory_limit is larger than the physical memory size. Memory_limit larger than physical memory will cause serious problems. The source can be found on GitHub issue #1066. |
| cluster.memory_chunk_cache_size | Check the memory block capacity cached by the memory allocator. It is recommended to set it to 0. The source can be found on GitHub issue #843. |
| cluster.memstore_limit_percentage | Checks the percentage of MemStore memory used by the tenant as a percentage of its total available memory. It is recommended to keep the default value of 50. The source can be found on GitHub issue #871. |
| cluster.trace_log_slow_query_watermark | Check the query execution time threshold. It is recommended to be no less than 1s and no more than 2s. If the query execution time exceeds this threshold, it is considered a slow query, and the tracking log of the slow query will be printed to the system log. The source can be found on GitHub issue #842. |
| cluster.memstore_usage | Check the MemStore usage and alert when the utilization exceeds 50%. The source can be found on GitHub issue #963. |
| cluster.cpu_quota_concurrency | Check the maximum number of concurrencies allowed by each CPU quota of the tenant. The recommended value is in the range [2,4]. The source can be found on GitHub issue #738. |
| cluster.observer_not_active | Check if any OBServer node is not in ACTIVE state. |
| cluster.clog_sync_time_warn_threshold | Check the clog synchronization time warning threshold, it is recommended to set it to 100ms. If the synchronization time exceeds the alarm threshold, a WARN log will be generated. The source can be found on GitHub issue #793. |
| cluster.autoinc_cache_refresh_interval | Check the refresh interval of the auto-increment column cache. It is recommended to set it to more than 1 hour. Frequent refreshes can affect system performance. The source can be found on GitHub issue #817. |
| cluster.syslog_io_bandwidth_limit | Check the disk IO bandwidth limit that the system log can occupy. It is recommended not to exceed 30M. System logs that exceed the bandwidth limit will be discarded. The source can be found on GitHub issue #841. |
| cluster.ls_number | Check if the log stream ID is not_enough_replica. |
| cluster.tenant_number | Check the number of tenants. |
| cluster.part_trans_action_max | Check if there are more than 200 transaction participants. |
| cluster.major_suspended | Check if there is a manually suspended major compaction in the OceanBase cluster. The source can be found on GitHub issue #1015. |
| cluster.resource_limit_max_session_num | Check whether the hidden parameter _resource_limit_max_session_num has been changed. Parameter modification may cause Too many connections errors. |
| cluster.task_opt_stat_gather_fail | Check whether the history collection task has failed execution results. |
| cluster.logons_check | Check whether the cumulative user login value is close to the 2147483647 threshold, only check OceanBase database versions before V4.2.1.4. The source can be found on GitHub issue #972. |
| cluster.zone_not_active | Check if any availability zone is not in ACTIVE state. |
| cluster.deadlocks | Check for deadlocks. |
| cluster.ob_query_timeout | Check the ob_query_timeout global variable for thread hang issues. The source can be found on GitHub issue #978. |
| cluster.datafile_next | Check node parameter datafile_maxsize. Checks if datafile_next is 0 when datafile_maxsize is set and is greater than datafile_size. If this value is 0, the data file will not grow. The source can be found on GitHub issue #573. |
| cluster.ob_enable_prepared_statement | Check whether prepared statements are enabled. It is recommended to enable it, especially if the front end is a JAVA application. The source can be found on GitHub issue #844. |
| cluster.data_path_settings | Check if data_dir and log_dir_disk are on the same disk. |
| cluster.sys_obcon_health | Check if the cluster is connected by connecting to the sys tenant. The source can be found on GitHub issue #872. |
| cluster.tenant_memory_tablet_count | Check whether the tenant memory specifications and the number of tablets per OBServer node exceed the 90% health check threshold. The source can be found on GitHub issue #1104. |
| cluster.tenant_500_memory_analysis | Analyze the memory usage of tenant 500 (internal system tenant) and identify memory anomalies. Check the total memory, top consuming modules, known problem modules and memory ratio. The source can be found on GitHub issue #99. |
| cluster.major | Check if there are any suspended major compaction processes. |
| cluster.tenant_locks | Check the waiting number of tenant locks and alarm when it exceeds the 5000 threshold. The source can be found on GitHub issue #963. |
| cluster.server_permanent_offline_time | Check the server_permanent_offline_time parameter and alert when it is not set to 3600s. The source can be found on GitHub issue #816. |
| cluster.no_leader | Check the cluster tenant log stream leader. |
| cluster.cgroup | Check whether tenant isolation is enabled when the OceanBase database is version 4.x and above. Should be enabled by default to ensure performance. The source can be found on GitHub issue #849. |
| cluster.freeze_trigger_percentage | Check the freeze_trigger_percentage parameter. It is recommended that the server maintain the default configuration 20. The source can be found on GitHub issue #795. |
| cluster.observer_port | Check whether the necessary ports between OceanBase cluster nodes are connected. The source can be found on GitHub issue #845. |
| cluster.memory_limit_percentage | Check the total available memory size in the system as a percentage of the total memory size. It is recommended to keep the default value 80. The source can be found on GitHub issue #750. |
| cluster.upper_trans_version | Check OceanBase database version. When the OceanBase database version is V4.0.0.0 or above, if executing the relevant SQL query in the sys tenant returns a non-empty result (that is, upper_trans_version cannot be calculated for a long time), the user is prompted to upgrade to OceanBase database V4.2.5.3 or above to fix this problem. The source can be found on GitHub issue #838. |
| cluster.task_opt_stat | Check the task optimization statistics collection history. |
| cluster.ob_enable_plan_cache_bad_version | Check the ob_enable_plan_cache variable. When the OceanBase database version is V4.1.0 or V4.1.0 BP1, it is recommended to turn off ob_enable_plan_cache. |
| cluster.session_limit | Check the number of tenant sessions and alert when it exceeds the 5000 threshold. The source can be found on GitHub issue #963. |
| cluster.table_history_too_many | Check the table history of tenants in the cluster. If there are too many table histories of tenants in the cluster, schema refresh will continue to report -4013 when the machine is restarted, causing the specific machine to be unable to refresh the schema of the corresponding tenant. |
| cluster.auto_increment_cache_size | Check the globally available cache of auto-increment columns for all tenants in the cluster. The source can be found on GitHub issue #870. |
| cluster.sys_log_level | Check the sys_log_level parameter. |
| cluster.global_indexes_too_much | Check if there are tables with more than 20 global indexes. |
| cluster.large_query_threshold | Check the query execution time threshold, it is recommended to set it to 5s. Requests that exceed the time limit may be suspended. After the suspension, it is automatically determined to be a large query, and the large query scheduling policy is implemented. The source can be found on GitHub issue #859. |
| cluster.enable_lock_priority | Check whether the enable_lock_priority parameter is enabled. Activation of the enable_lock_priority parameter will affect the performance of DDL/DML in daily use. It is not recommended to turn it on unless lock-free structural changes are required. The source can be found on GitHub issue #890. |
| cluster.upgrade_finished | Check whether the OceanBase cluster upgrade is completed and verify version consistency. The source can be found on GitHub issue #759. |
| cluster.default_compress_func | Check the default compression algorithm for manifest data. It is recommended to use a default value that matches ob_version to improve compression ratio and reduce storage costs. For scenarios with higher query rt requirements, consider using lz4_1.0 or turning off compression. The source can be found on GitHub issue #792. |
