Purpose
This function is a linear regression function that fits a regression line using the ordinary least squares method to a set of number pairs. It can be used as either an aggregate or analytic function.
Note
- When used as an analytic function, the
OVERclause must be used to define the window for computation. It computes over a set of rows and returns multiple values. - When used as an aggregate function, it aggregates over a set of rows and returns a single value. In this case, the
OVERclause is not required.
Syntax
{ REGR_SLOPE
| REGR_INTERCEPT
| REGR_COUNT
| REGR_R2
| REGR_AVGX
| REGR_AVGY
| REGR_SXX
| REGR_SYY
| REGR_SXY
}(expr1 , expr2)
[ OVER (analytic_clause) ]
Parameters
Parameter |
Description |
|---|---|
| REGR_SLOPE | Returns the slope of the regression line. The return value is a numeric data type and can be NULL. After ignoring the null pairs ( expr1, expr2 ), it performs the following calculation: sql COVAR_POP(expr1, expr2) / VAR_POP(expr2) |
| REGR_INTERCEPT | Returns the y intercept of the regression line. The return value is a numeric data type and can be NULL. After ignoring the null pairs ( expr1, expr2 ), it performs the following calculation: sql AVG(expr1) - REGR_SLOPE(expr1, expr2) * AVG(expr2) |
| REGR_COUNT | Returns the number of non-null pairs used to build the regression line. It is an integer. |
| REGR_R2 | Returns the coefficient of determination (also known as R squared or goodness of fit). The return value is a numeric data type and can be NULL. After ignoring the null values of VAR_POP(expr1) and VAR_POP(expr2), it performs the following calculation: sql NULL if VAR_POP(expr2) = 0 1 if VAR_POP(expr1) = 0 and VAR_POP(expr2) != 0 POWER(CORR(expr1,expr),2) if VAR_POP(expr1) >; 0 and VAR_POP(expr2 != 0 |
| REGR_AVGX | Returns the average of the independent variable ( expr2 ) of the regression line. After ignoring the null pairs ( expr1, expr2 ), it performs the following calculation: sql AVG(expr2) |
| REGR_AVGY | Returns the average of the dependent variable ( expr1 ) of the regression line. After ignoring the null pairs ( expr1, expr2 ), it performs the following calculation: sql AVG(expr1) |
| REGR_SXX | Returns the value equal to the total variance of the independent variable ( expr2 ) multiplied by the number of non-null pairs, after ignoring the null pairs ( expr1, expr2 ). The calculation is as follows: sql REGR_COUNT(expr1, expr2) * VAR_POP(expr2) |
| REGR_SYY | Returns the value equal to the total variance of the dependent variable ( expr1 ) multiplied by the number of non-null pairs, after ignoring the null pairs ( expr1, expr2 ). The calculation is as follows: sql REGR_COUNT(expr1, expr2) * VAR_POP(expr1) |
| REGR_SXY | Returns the value equal to the total covariance multiplied by the number of non-null pairs, after ignoring the null pairs ( expr1, expr2 ). The calculation is as follows: sql REGR_COUNT(expr1, expr2) * COVAR_POP(expr1, expr2) |
| expr1 | Specifies the value of the dependent variable (y value). |
| expr2 | Specifies the value of the independent variable (x value). |
| OVER | Use the OVER clause to define the window for the calculation. For more information, see Window Function Description. |
Note
REGR_SXX, REGR_SYY, and REGR_SXY are auxiliary functions for calculating various diagnostic statistics.
Return type
The return value is a numeric data type or NULL.
Examples
A table named tbl1 has been created.
obclient> SELECT * FROM tbl1;
+------+------+------+------+
| COL1 | COL2 | COL3 | COL4 |
+------+------+------+------+
| 1 | A1 | 8 | 12 |
| 1 | A2 | 10 | 15 |
| 1 | A3 | 11 | 16 |
| 2 | B1 | 9 | 14 |
| 2 | B2 | 10 | 15 |
| 2 | B3 | 8 | 13 |
| 2 | B4 | 11 | 16 |
| 3 | C1 | 8 | 18 |
| 3 | C2 | 9 | 16 |
| 3 | C3 | 10 | 15 |
| 3 | C4 | 11 | 12 |
| 3 | C5 | 12 | 10 |
| 4 | d1 | 8 | NULL |
| 4 | d1 | 9 | 10 |
| 4 | d1 | 10 | NULL |
+------+------+------+------+
15 rows in set
Aggregate function example
Calculate three types of diagnostic statistics for the linear regression of the numeric pair col3 and col4.
obclient> SELECT col1,REGR_SXX(col3,col4) "REGR_SXX",
REGR_SYY(col3,col4) "REGR_SYY",
REGR_SXY(col3,col4) "REGR_SXY"
FROM tbl1
GROUP BY col1;
+------+----------------------------------------+------------------------------------------+----------------------------------------+
| COL1 | REGR_SXX | REGR_SYY | REGR_SXY |
+------+----------------------------------------+------------------------------------------+----------------------------------------+
| 1 | 8.666666666666666666666666666666666667 | 4.66666666666666666666666666666666666701 | 6.333333333333333333333333333333333333 |
| 2 | 5 | 5 | 5 |
| 3 | 40.8 | 10 | -20 |
| 4 | 0 | 0 | 0 |
+------+----------------------------------------+------------------------------------------+----------------------------------------+
4 rows in set
Analytic function example
Group by col1 and sort col2 in ascending order, then calculate the cumulative slope for the numeric pair col3 and col4.
obclient> SELECT col1,col2,col3,col4,
REGR_SLOPE(col3,col4) OVER(PARTITION BY col1 ORDER BY col2) "REGR_SLOPE"
FROM tbl1;
+------+------+------+------+--------------------------------------------+
| COL1 | COL2 | COL3 | COL4 | REGR_SLOPE |
+------+------+------+------+--------------------------------------------+
| 1 | A1 | 8 | 12 | NULL |
| 1 | A2 | 10 | 15 | .6666666666666666666666666666666666666667 |
| 1 | A3 | 11 | 16 | .7307692307692307692307692307692307691642 |
| 2 | B1 | 9 | 14 | NULL |
| 2 | B2 | 10 | 15 | 1 |
| 2 | B3 | 8 | 13 | 1 |
| 2 | B4 | 11 | 16 | 1 |
| 3 | C1 | 8 | 18 | NULL |
| 3 | C2 | 9 | 16 | -.5 |
| 3 | C3 | 10 | 15 | -.6428571428571428571428571428571428570956 |
| 3 | C4 | 11 | 12 | -.5066666666666666666666666666666666666667 |
| 3 | C5 | 12 | 10 | -.4901960784313725490196078431372549019608 |
| 4 | d1 | 8 | NULL | NULL |
| 4 | d1 | 9 | 10 | NULL |
| 4 | d1 | 10 | NULL | NULL |
+------+------+------+------+--------------------------------------------+
15 rows in set
