Purpose
This is a linear regression function that fits a regression line using the ordinary least squares method to a set of pairs of numbers. It can be used as either an aggregate or an analytic function.
Note
- When used as an analytic function, the
OVERclause must be used to define the window for the calculation. It performs the calculation on a set of rows and returns multiple values. - When used as an aggregate function, it performs the calculation on a set of rows and returns a single value. In this case, the
OVERclause is not needed.
Syntax
{ REGR_SLOPE
| REGR_INTERCEPT
| REGR_COUNT
| REGR_R2
| REGR_AVGX
| REGR_AVGY
| REGR_SXX
| REGR_SYY
| REGR_SXY
}(expr1 , expr2)
[ OVER (analytic_clause) ]
Parameters
Parameter |
Description |
|---|---|
| REGR_SLOPE | Returns the slope of the line. The return value is a numeric data type and can be NULL. After ignoring the null value pairs ( expr1, expr2 ), it performs the following calculation: sql COVAR_POP(expr1, expr2) / VAR_POP(expr2) |
| REGR_INTERCEPT | Returns the y intercept of the regression line. The return value is a numeric data type and can be NULL. After ignoring the null value pairs ( expr1, expr2 ), it performs the following calculation: sql AVG(expr1) - REGR_SLOPE(expr1, expr2) * AVG(expr2) |
| REGR_COUNT | Returns the number of non-null value pairs used to fill the regression line, which is an integer. |
| REGR_R2 | Returns the coefficient of determination (also known as R squared or goodness of fit) of the regression. The return value is a numeric data type and can be NULL. After ignoring the null values of VAR_POP(expr1) and VAR_POP(expr2), the return value is: sql NULL if VAR_POP(expr2) = 0 1 if VAR_POP(expr1) = 0 and VAR_POP(expr2) != 0 POWER(CORR(expr1,expr),2) if VAR_POP(expr1) > 0 and VAR_POP(expr2 != 0 |
| REGR_AVGX | Returns the average of the independent variable ( expr2 ) of the regression line. After ignoring the null value pairs ( expr1, expr2 ), it performs the following calculation: sql AVG(expr2) |
| REGR_AVGY | Returns the average of the dependent variable ( expr1 ) of the regression line. After ignoring the null value pairs ( expr1, expr2 ), it performs the following calculation: sql AVG(expr1) |
| REGR_SXX | Returns the value equal to the total variance of the independent variable ( expr2 ) multiplied by the number of value pairs after ignoring the null value pairs ( expr1, expr2 ). The calculation is as follows: sql REGR_COUNT(expr1, expr2) * VAR_POP(expr2) |
| REGR_SYY | Returns the value equal to the total variance of the dependent variable ( expr1 ) multiplied by the number of value pairs after ignoring the null value pairs ( expr1, expr2 ). The calculation is as follows: sql REGR_COUNT(expr1, expr2) * VAR_POP(expr1) |
| REGR_SXY | Returns the value equal to the total covariance multiplied by the number of value pairs after ignoring the null value pairs ( expr1, expr2 ). The calculation is as follows: sql REGR_COUNT(expr1, expr2) * COVAR_POP(expr1, expr2) |
| expr1 | The value of the dependent variable (y value). |
| expr2 | The value of the independent variable (x value). |
| OVER | Use the OVER clause to define the window for calculation. For more information, see Window Function Description. |
Note
REGR_SXX, REGR_SYY, and REGR_SXY are auxiliary functions used to calculate various diagnostic statistics.
Return type
The return value is of the numeric data type or NULL.
Examples
A table named tbl1 has been created.
obclient> SELECT * FROM tbl1;
+------+------+------+------+
| COL1 | COL2 | COL3 | COL4 |
+------+------+------+------+
| 1 | A1 | 8 | 12 |
| 1 | A2 | 10 | 15 |
| 1 | A3 | 11 | 16 |
| 2 | B1 | 9 | 14 |
| 2 | B2 | 10 | 15 |
| 2 | B3 | 8 | 13 |
| 2 | B4 | 11 | 16 |
| 3 | C1 | 8 | 18 |
| 3 | C2 | 9 | 16 |
| 3 | C3 | 10 | 15 |
| 3 | C4 | 11 | 12 |
| 3 | C5 | 12 | 10 |
| 4 | d1 | 8 | NULL |
| 4 | d1 | 9 | 10 |
| 4 | d1 | 10 | NULL |
+------+------+------+------+
15 rows in set
Example of an aggregate function
Calculate three types of diagnostic statistics for the linear regression of the numeric pair of columns col3 and col4.
obclient> SELECT col1,REGR_SXX(col3,col4) "REGR_SXX",
REGR_SYY(col3,col4) "REGR_SYY",
REGR_SXY(col3,col4) "REGR_SXY"
FROM tbl1
GROUP BY col1;
+------+----------------------------------------+------------------------------------------+----------------------------------------+
| COL1 | REGR_SXX | REGR_SYY | REGR_SXY |
+------+----------------------------------------+------------------------------------------+----------------------------------------+
| 1 | 8.666666666666666666666666666666666667 | 4.66666666666666666666666666666666666701 | 6.333333333333333333333333333333333333 |
| 2 | 5 | 5 | 5 |
| 3 | 40.8 | 10 | -20 |
| 4 | 0 | 0 | 0 |
+------+----------------------------------------+------------------------------------------+----------------------------------------+
4 rows in set
Example of an analytic function
Group by column col1 and sort column col2 in ascending order, then calculate the cumulative slope of the numeric pair of columns col3 and col4.
obclient> SELECT col1,col2,col3,col4,
REGR_SLOPE(col3,col4) OVER(PARTITION BY col1 ORDER BY col2) "REGR_SLOPE"
FROM tbl1;
+------+------+------+------+--------------------------------------------+
| COL1 | COL2 | COL3 | COL4 | REGR_SLOPE |
+------+------+------+------+--------------------------------------------+
| 1 | A1 | 8 | 12 | NULL |
| 1 | A2 | 10 | 15 | .6666666666666666666666666666666666666667 |
| 1 | A3 | 11 | 16 | .7307692307692307692307692307692307691642 |
| 2 | B1 | 9 | 14 | NULL |
| 2 | B2 | 10 | 15 | 1 |
| 2 | B3 | 8 | 13 | 1 |
| 2 | B4 | 11 | 16 | 1 |
| 3 | C1 | 8 | 18 | NULL |
| 3 | C2 | 9 | 16 | -.5 |
| 3 | C3 | 10 | 15 | -.6428571428571428571428571428571428570956 |
| 3 | C4 | 11 | 12 | -.5066666666666666666666666666666666666667 |
| 3 | C5 | 12 | 10 | -.4901960784313725490196078431372549019608 |
| 4 | d1 | 8 | NULL | NULL |
| 4 | d1 | 9 | 10 | NULL |
| 4 | d1 | 10 | NULL | NULL |
+------+------+------+------+--------------------------------------------+
15 rows in set
