During database operation, I/O issues are a common problem and can be divided into the following two scenarios:
One is that the main overhead of a certain SQL statement is disk reading. You need to find out which database object or SQL operator is associated with the disk reading.
Another is that the disk read/write pressure is very high in the OCP monitoring. You need to find out the source of the disk pressure.
To help you quickly locate the root cause and efficiently solve the problem, this topic summarizes a clear and practical I/O troubleshooting process. This process provides clear steps to improve the efficiency of problem handling, minimize the impact on business operations, and provide strong support for daily maintenance work.
I/O issues can be divided into the following two scenarios:
SQL-level issues: Some SQL statements have high disk reading overhead. You need to associate the database object or operator with the disk reading.
Cluster-level issues: The OCP monitoring shows high disk read/write pressure. You need to find out the source of the disk pressure.
Process introduction
Confirm whether there are I/O-related issues.
You can use the following methods to confirm:
Log in to the OCP console and check the Physical I/O Count and Physical I/O Throughput performance metrics. View the I/O bandwidth at the node level (OBServer level) and tenant level (tenant level). If the I/O bandwidth of a node or tenant is high, it indicates that there may be I/O-related issues.
Check the Top Foreground DB Time and Top Background DB Time sections in the ASH report. If the waiting events related to I/O, such as
db file data read,db file compact read,db file compact write, androw store disk write, account for a high proportion (corresponding toWait Classvalues ofUSER_IOandSYSTEM_IO), it indicates that there may be I/O-related issues.
Troubleshoot cluster-level issues.
Obtain the ASH report for the corresponding node and tenant during the issue period.
View the Top Foreground DB Time and Top Background DB Time sections to identify the source of I/O.
If the proportion of I/O-related waiting events in the foreground is high, it indicates that the I/O load is mainly concentrated on foreground sessions. In this case, proceed to step 3 to troubleshoot SQL-level issues.
Otherwise, continue to further identify the source of I/O.
Further identify the source of I/O.
View the Top IO Bandwidth section. A larger
IO Size(MB)value indicates that the total size of I/O operations issued by the corresponding module during the issue period is larger.Select the
Program Module Action/SQL IDfield with a largeIO Size(MB)value to obtain the corresponding information. If the information relates to PX/DAS execution, proceed to step 3 to troubleshoot SQL-level issues. Otherwise, refer to the relevant documentation or contact technical support for assistance.
Troubleshoot SQL-level issues.
Based on the ASH report for the corresponding node and tenant during the issue period, if the proportion of I/O-related waiting events in the foreground is high, you can further troubleshoot:
View the Top SQL with Top Event and Top SQL with Top Operator sections.
Top SQL with Top Event: Displays the SQL statements with the highest execution overhead in the obtained ASH report. The
Top Eventcolumn shows the waiting event with the highest proportion for each SQL statement.Top SQL with Top Operator: Displays the SQL statements with the highest execution overhead in the obtained ASH report. The
Top OperatorandTop Eventcolumns show the operator ID and event_id with the highest overhead for each SQL statement.By viewing these two tables, you can obtain the SQL statements and operator IDs associated with I/O-related waiting events.
If you want to know which database table partition caused the I/O, view the Top DB Object section. It lists the tablet_id accessed most frequently by user SQL statements.
Based on the obtained information, refer to the relevant documentation or contact technical support for assistance.
