Analytic functions |V2.2.77|OceanBase Database| docs|Distributed Database

Analytic functions

Last Updated：2023-08-18 09:26:34 Updated

Introduction

Analytic functions (also known as window functions in some databases) are similar to aggregate functions in that the computation is always based on a set of rows. The difference is that an aggregate function returns only one row per group, while an analytic function can return multiple rows per group, and each row in the group is the result of the window-based logical computation. Analytic functions can significantly optimize queries that require self-joins.

Syntax

A window is also called a frame. OceanBase Database supports two types of frame semantics: ROWS and RANGE. The former is based on a physical row offset, and the latter is based on a logical value offset.

Syntax:

analytic_function:
  analytic_function([ arguments ]) OVER (analytic_clause)

analytic_clause:
  [ query_partition_clause ] [ order_by_clause [ windowing_clause ] ]

query_partition_clause:
  PARTITION BY { expr[, expr ]... | ( expr[, expr ]... ) }

order_by_clause:
  ORDER [ SIBLINGS ] BY{ expr | position | c_alias } [ ASC | DESC ] [ NULLS FIRST | NULLS LAST ] [, { expr | position | c_alias } [ ASC | DESC ][ NULLS FIRST | NULLS LAST ]]...

windowing_clause:
  { ROWS | RANGE } { BETWEEN { UNBOUNDED PRECEDING | CURRENT ROW | value_expr {
  PRECEDING | FOLLOWING } } AND{ UNBOUNDED FOLLOWING | CURRENT ROW | value_expr { 
  PRECEDING | FOLLOWING } } | { UNBOUNDED PRECEDING | CURRENT ROW| value_expr 
  PRECEDING}}

SUM/MIN/MAX/COUNT/AVG

Declaration

Syntax of SUM: SUM([ DISTINCT | ALL ] expr) [ OVER (analytic_clause) ]

Syntax of MIN: MIN([ DISTINCT | ALL ] expr) [ OVER (analytic_clause) ]

Synatx of MAX: MAX([ DISTINCT | ALL ] expr) [ OVER (analytic_clause) ]

Syntax of COUNT: COUNT({ * | [ DISTINCT | ALL ] expr }) [ OVER (analytic_clause) ]

Syntax of AVG: AVG([ DISTINCT | ALL ] expr) [ OVER(analytic_clause) ]

Description

All the preceding analytic functions have the corresponding aggregate functions. SUM returns the sum of expr, MIN/MAX returns the minimum/maximum value of expr, COUNT returns the number of query rows in the window, and AVG returns the average value of expr.

If expr is specified for the COUNT function, the number of expr values excluding NULL is returned. If COUNT(*) is specified, the number of all rows is returned.

Example

obclient> create table exployees(last_name char(10), salary decimal, job_id char(32));
Query OK, 0 rows affected (0.17 sec)

obclient> insert into exployees values('jim', 2000, 'cleaner');
Query OK, 1 row affected (0.03 sec)

obclient> insert into exployees values('mike', 12000, 'engineering');
Query OK, 1 row affected (0.00 sec)

obclient> insert into exployees values('lily', 13000, 'engineering');
Query OK, 1 row affected (0.01 sec)

obclient> insert into exployees values('tom', 11000, 'engineering');
Query OK, 1 row affected (0.00 sec)

obclient>select last_name, sum(salary) over(partition by job_id) totol_s, min(salary) over(partition by job_id) min_s, max(salary) over(partition by job_id) max_s, count(*) over(partition by job_id) count_s from exployees;
+-----------+---------+-------+-------+---------+
| last_name | totol_s | min_s | max_s | count_s |
+-----------+---------+-------+-------+---------+
| jim       |    2000 |  2000 |  2000 |       1 |
| mike      |   36000 | 11000 | 13000 |       3 |
| lily      |   36000 | 11000 | 13000 |       3 |
| tom       |   36000 | 11000 | 13000 |       3 |
+-----------+---------+-------+-------+---------+
4 rows in set (0.01 sec)

NTH_VALUE/FIRST_VALUE/LAST_VALUE

Declaration

Syntax of NTH_VALUE: NTH_VALUE (measure_expr, n) [ FROM { FIRST | LAST } ] [ { RESPECT | IGNORE } NULLS ] OVER (analytic_clause)

Syntax of FIRST_VALUE: FIRST_VALUE { (expr) [ {RESPECT | IGNORE} NULLS ] | (expr [ {RESPECT | IGNORE} NULLS ])} OVER (analytic_clause)

Syntax of LAST_VALUE: LAST_VALUE { (expr) [ {RESPECT | IGNORE} NULLS ] | (expr [ {RESPECT | IGNORE} NULLS ])} OVER (analytic_clause)

Description

The NTH_VALUE function represents the ordinal number of a value in a specific direction defined by [FROM { FIRST | LAST } ]. The default direction is FROM FIRST, including a flag indicating whether to ignore the NULL values. The windows are a unified analytic_clause. Here, n must be positive. If n is NULL, an error message is returned. If n is greater than the total number of rows in the window, the function returns NULL.

FIRST_VALUE specifies to count from the first value, while LAST_VALUE specifies to count from the last.

Example

obclient> create table exployees(last_name char(10), salary decimal, job_id char(32));
Query OK, 0 rows affected (0.08 sec)

obclient> insert into exployees values('jim', 2000, 'cleaner');
Query OK, 1 row affected (0.11 sec)

obclient> insert into exployees values('mike', 12000, 'engineering');
Query OK, 1 row affected (0.00 sec)

obclient> insert into exployees values('lily', 13000, 'engineering');
Query OK, 1 row affected (0.00 sec)

obclient> insert into exployees values('tom', 11000, 'engineering');
Query OK, 1 row affected (0.01 sec)

obclient> select last_name, first_value(salary) over(partition by job_id) totol_s, last_value(salary) over(partition by job_id) min_s, max(salary) over(partition by job_id) max_s from exployees;
+-----------+---------+-------+-------+
| last_name | totol_s | min_s | max_s |
+-----------+---------+-------+-------+
| jim       |    2000 |  2000 |  2000 |
| mike      |   12000 | 11000 | 13000 |
| lily      |   12000 | 11000 | 13000 |
| tom       |   12000 | 11000 | 13000 |
+-----------+---------+-------+-------+
4 rows in set (0.01 sec)

LEAD/LAG

Declaration

Syntax of LEAD: LEAD { ( value_expr [, offset [, default]]) [ { RESPECT | IGNORE } NULLS ] | ( value_expr [ { RESPECT | IGNORE } NULLS ] [, offset [, default]] )} OVER ([ query_partition_clause ] order_by_clause)

Syntax of LAG: LAG { ( value_expr [, offset [, default]]) [ { RESPECT | IGNORE } NULLS ] | ( value_expr [ { RESPECT | IGNORE } NULLS ] [, offset [, default]] )} OVER ([ query_partition_clause ] order_by_clause)

Description

LEAD and LAG are used to retrieve the data of the Nth row before or after the same field on the current row from a query. This operation can be implemented by using the self-join of the same table. However, the LEAD/LAG window functions are more efficient.

value_expr is the field used for comparison. offset is the offset of value_expr. The default value of the default parameter is NULL, which means if no default value is set for LEAD/LAG, NULL is returned. For example, if the current row is 4 and the offset value is 6 for LAG, Row -2 is the row to be retrieved. If this row does not exist, the value of default is returned.

[ { RESPECT | IGNORE } NULLS] specifies whether to include NULL values in the result. The default value is RESPECT, which means to include null values in the result.

Note that the LEAD and LAG functions must be followed by order_by_clause. The data must be sorted on a column. query_partition_clause is optional. If query_partition_clause is not specified, data of all partitions that meets the conditions is retrieved.

Example

obclient> create table exployees(last_name char(10), salary decimal, job_id char(32));
Query OK, 0 rows affected (0.08 sec)

obclient> insert into exployees values('jim', 2000, 'cleaner');
Query OK, 1 row affected (0.11 sec)

obclient> insert into exployees values('mike', 12000, 'engineering');
Query OK, 1 row affected (0.01 sec)

obclient> insert into exployees values('lily', 13000, 'engineering');
Query OK, 1 row affected (0.00 sec)

obclient> insert into exployees values('tom', 11000, 'engineering');
Query OK, 1 row affected (0.00 sec)

obclient> select last_name, lead(salary) over(order by salary) lead, lag(salary) over(order by salary) lag from exployees;
+-----------+-------+-------+
| last_name | lead  | lag   |
+-----------+-------+-------+
| jim       | 11000 |  NULL |
| tom       | 12000 |  2000 |
| mike      | 13000 | 11000 |
| lily      |  NULL | 12000 |
+-----------+-------+-------+
4 rows in set (0.01 sec)

STDDEV/VARIANCE/STDDEV_SAMP/STDDEV_POP

Declaration

Syntax of VARIANCE: VARIANCE([ DISTINCT | ALL ] expr) [ OVER (analytic_clause) ]

Syntax of STDDEV: STDDEV([ DISTINCT | ALL ] expr) [ OVER (analytic_clause) ]

Syntax of STDDEV_SAMP: STDDEV_SAMP(expr) [ OVER (analytic_clause) ]

Syntax of STDDEV_POP: STDDEV_POP(expr) [ OVER (analytic_clause) ]

Description

VARIANCE returns the variance of expr. expr may be a numeric type or a type that can be converted to a numeric type. The data type of variance is the same as that of the input value.

STDDEV returns the standard deviation of expr and the parameter type is the same as that of VARIANCE.

STDDEV_SAMP returns the sample standard deviation.

STDDEV_POP returns the cumulative standard deviation of the population.

Example

obclient> create table exployees(last_name char(10), salary decimal, job_id char(32));
Query OK, 0 rows affected (0.08 sec)

obclient> insert into exployees values('jim', 2000, 'cleaner');
Query OK, 1 row affected (0.11 sec)

obclient> insert into exployees values('mike', 12000, 'engineering');
Query OK, 1 row affected (0.01 sec)

obclient> insert into exployees values('lily', 13000, 'engineering');
Query OK, 1 row affected (0.00 sec)

obclient> insert into exployees values('tom', 11000, 'engineering');
Query OK, 1 row affected (0.00 sec)

obclient> select last_name, stddev(salary) over(order by salary) std, variance(salary) over(order by salary) var, stddev_pop(salary) over() std_pop, stddev_samp(salary) over() from exployees;
+-----------+-------------------+--------------------+-------------------+----------------------------+
| last_name | std               | var                | std_pop           | stddev_samp(salary) over() |
+-----------+-------------------+--------------------+-------------------+----------------------------+
| jim       |                 0 |                  0 | 4387.482193696061 |          5066.228051190222 |
| tom       |              4500 |           20250000 | 4387.482193696061 |          5066.228051190222 |
| mike      | 4496.912521077347 | 20222222.222222224 | 4387.482193696061 |          5066.228051190222 |
| lily      | 4387.482193696061 |           19250000 | 4387.482193696061 |          5066.228051190222 |
+-----------+-------------------+--------------------+-------------------+----------------------------+
4 rows in set (0.00 sec)

NTILE

Declaration

NTILE(expr) OVER ([ query_partition_clause ] order_by_clause)

Description

The NTILE function divides sorted rows of a partition into a specified number of groups that have the same size and returns the group number to each row. If expr is NULL, NULL is returned.

Example

obclient> create table exployees(last_name char(10), salary decimal, job_id char(32));
Query OK, 0 rows affected (0.08 sec)

obclient> insert into exployees values('jim', 2000, 'cleaner');
Query OK, 1 row affected (0.11 sec)

obclient> insert into exployees values('mike', 12000, 'engineering');
Query OK, 1 row affected (0.01 sec)

obclient> insert into exployees values('lily', 13000, 'engineering');
Query OK, 1 row affected (0.00 sec)

obclient> insert into exployees values('tom', 11000, 'engineering');
Query OK, 1 row affected (0.00 sec)

obclient> select last_name, ntile(10) over(partition by job_id order by salary) ntl from exployees;
+-----------+------+
| last_name | ntl  |
+-----------+------+
| jim       |    1 |
| tom       |    1 |
| mike      |    2 |
| lily      |    3 |
+-----------+------+
4 rows in set (0.01 sec)

ROW_NUMBER

Declaration

ROW_NUMBER( ) OVER ([ query_partition_clause ] order_by_clause)

Description

The ROW_NUMBER function assigns a number to each row based on the row sequence specified in the order_by_clause clause.

Example

obclient> create table exployees(last_name char(10), salary decimal, job_id char(32));
Query OK, 0 rows affected (0.08 sec)

obclient> insert into exployees values('jim', 2000, 'cleaner');
Query OK, 1 row affected (0.11 sec)

obclient> insert into exployees values('mike', 12000, 'engineering');
Query OK, 1 row affected (0.01 sec)

obclient> insert into exployees values('lily', 13000, 'engineering');
Query OK, 1 row affected (0.00 sec)

obclient> insert into exployees values('tom', 11000, 'engineering');
Query OK, 1 row affected (0.00 sec)

obclient> select last_name, row_number() over(partition by job_id order by salary) ntl from exployees;
+-----------+------+
| last_name | ntl  |
+-----------+------+
| jim       |    1 |
| tom       |    1 |
| mike      |    2 |
| lily      |    3 |
+-----------+------+
4 rows in set (0.00 sec)

RANK/DENSE_RANK/PERCENT_RANK

Declaration

Syntax of RANK: RANK( ) OVER ([ query_partition_clause ] order_by_clause)

Syntax of DENSE_RANK: DENSE_RANK( ) OVER([ query_partition_clause ] order_by_clause)

Syntax of PERCENT_RANK: PERCENT_RANK( ) OVER ([ query_partition_clause ] order_by_clause)

Description

RANK returns the order of each row of data in a column that is sorted by the column specified in order_by_clause. For example, you can sort by salary to view the ranking of the income of employees.

The semantics of the DENSE_RANK function are the same as those of the RANK function. The difference is that the RANK function may involve a skip during sorting but the DENSE_RANK function does not.

The semantics of the PERCENT_RANK function are the same as those of the RANK function. The difference is that the sorting result of PERCENT_RANK is a percentage, and it returns the percentage of the given rows.

Example

obclient> create table exployees(last_name char(10), salary decimal, job_id char(32));
Query OK, 0 rows affected (0.10 sec)

obclient> insert into exployees values('jim', 2000, 'cleaner');
Query OK, 1 row affected (0.11 sec)

obclient> insert into exployees values('mike', 12000, 'engineering');
Query OK, 1 row affected (0.00 sec)

obclient> insert into exployees values('lily', 13000, 'engineering');
Query OK, 1 row affected (0.00 sec)

obclient> insert into exployees values('tom', 11000, 'engineering');
Query OK, 1 row affected (0.00 sec)

obclient> select last_name, rank() over(partition by job_id order by salary) rank, dense_rank() over(partition by job_id order by salary) dense_rank, percent_rank() over(partition by job_id order by salary) percent_rank from exployees;
+-----------+------+------------+----------------------------------+
| last_name | rank | dense_rank | percent_rank                     |
+-----------+------+------------+----------------------------------+
| jim       |    1 |          1 | 0.000000000000000000000000000000 |
| tom       |    1 |          1 | 0.000000000000000000000000000000 |
| mike      |    2 |          2 | 0.500000000000000000000000000000 |
| lily      |    3 |          3 | 1.000000000000000000000000000000 |
+-----------+------+------------+----------------------------------+
4 rows in set (0.01 sec)

CUME_DIST

Declaration

CUME_DIST() OVER ([ query_partition_clause ] order_by_clause)

Description

This function calculates the distribution of a value. The return value is greater than 0 but less than or equal to 1. As an analytic function, CUME_DIST calculates the proportion of data smaller than a specific column of the current row in ascending order. In the following example, the function returns the proportion of each row of data on the sorting column of the window, given that the data is grouped by job_id and sorted by salary.

Example

obclient> create table exployees(last_name char(10), salary decimal, job_id char(32));
Query OK, 0 rows affected (0.10 sec)

obclient> insert into exployees values('jim', 2000, 'cleaner');
Query OK, 1 row affected (0.11 sec)

obclient> insert into exployees values('mike', 12000, 'engineering');
Query OK, 1 row affected (0.00 sec)

obclient> insert into exployees values('lily', 13000, 'engineering');
Query OK, 1 row affected (0.00 sec)

obclient> insert into exployees values('tom', 11000, 'engineering');
Query OK, 1 row affected (0.00 sec)

obclient> select last_name, cume_dist() over(partition by job_id order by salary) cume_dist from exployees;
+-----------+----------------------------------+
| last_name | cume_dist                        |
+-----------+----------------------------------+
| jim       | 1.000000000000000000000000000000 |
| tom       | 0.333333333333333333333333333333 |
| mike      | 0.666666666666666666666666666667 |
| lily      | 1.000000000000000000000000000000 |
+-----------+----------------------------------+
4 rows in set (0.01 sec)