During database operation, I/O bottlenecks are a common issue. They can be categorized into the following two scenarios:
One scenario is when a particular SQL statement has significant I/O overhead during execution. In this case, you need to identify which database objects or SQL operators are associated with the I/O.
Another scenario is when the OCP monitoring indicates high I/O pressure on the database. In this case, you need to determine the source of the I/O pressure.
To help you quickly identify the root cause and resolve the issue efficiently, this topic summarizes a clear and practical I/O troubleshooting process. This process provides specific steps to enhance issue resolution efficiency and minimize the impact on business operations, supporting daily maintenance tasks.
I/O bottlenecks can be categorized into the following two scenarios:
SQL-level issue: Some SQL statements have significant I/O overhead during execution. In this case, you need to analyze the associated database objects or operators.
Cluster-level issue: The OCP monitoring indicates high I/O pressure on the database. In this case, you need to determine the source of the I/O pressure.
Process
Check for I/O-related issues.
You can use the following methods to check for I/O-related issues:
Log in to the OCP console and view the Physical I/O Count and Physical I/O Throughput performance metrics. If the I/O bandwidth of a node or tenant is high, it indicates an I/O-related issue.
View the Top Foreground DB Time and Top Background DB Time sections in the ASH report. If the I/O-related wait events, such as
db file data read,db file compact read,db file compact write, androw store disk write, account for a significant proportion (corresponding toWait Classvalues ofUSER_IOandSYSTEM_IO), it indicates an I/O-related issue.
Troubleshoot the cluster-level issue.
Obtain the ASH report for the corresponding node and tenant during the issue period.
View the Top Foreground DB Time and Top Background DB Time sections to identify the source of the I/O.
If the I/O-related wait events in the foreground account for a significant proportion, the I/O load is mainly concentrated in the foreground sessions. In this case, proceed to step 3 to troubleshoot the SQL-level issue.
Otherwise, continue to further identify the source of the I/O.
Further identify the source of the I/O.
View the Top IO Bandwidth section. A larger value of
IO Size(MB)indicates a larger total I/O size issued by the corresponding module during the issue period.Select the
Program Module Action/SQL IDfield with a largerIO Size(MB)value to obtain the corresponding information. If the information is related to PX/DAS execution, proceed to step 3 to troubleshoot the SQL-level issue. Otherwise, refer to the relevant documentation or contact technical support for assistance.
Troubleshoot the SQL-level issue.
If the I/O-related wait events in the foreground account for a significant proportion, you can further troubleshoot the SQL-level issue based on the ASH report for the corresponding node and tenant during the issue period:
View the Top SQL with Top Event and Top SQL with Top Operator sections.
Top SQL with Top Event: Displays the SQL statements with the highest execution overhead in the ASH report and the wait events with the highest proportion for each SQL statement in the
Top Eventcolumn.Top SQL with Top Operator: Displays the SQL statements with the highest execution overhead in the ASH report and the operator IDs and event IDs with the highest overhead for each SQL statement in the
Top OperatorandTop Eventcolumns.By viewing these sections, you can obtain the SQL statements and operator IDs associated with the I/O-related wait events.
If you want to identify which database table partitions are accessed to cause the I/O, view the Top DB Object section. It lists the tablet IDs accessed most frequently by user SQL statements.
Based on the obtained information, refer to the relevant documentation or contact technical support for assistance.