Version: 2.3.x

Window Functions

A window function performs a calculation across a set of table rows that are somehow related to the current row.This is comparable to the type of calculation that can be done with an aggregate function.However, window functions do not cause rows to become grouped into a single output row like non-window aggregate calls would.Instead, the rows retain their separate identities.Behind the scenes, the window function is able to access more than just the current row of the query result

The following example demonstrates how to compare each temperature value with the average temperature of the region:

SELECT station,temperature, avg(temperature) OVER (partition by station) from air;
+-------------+-------------+----------------------+
| station     | temperature | AVG(air.temperature) |
+-------------+-------------+----------------------+
| XiaoMaiDao  | 80.0        | 64.93894989583701    |
| XiaoMaiDao  | 64.0        | 64.93894989583701    |
| XiaoMaiDao  | 79.0        | 64.93894989583701    |
| XiaoMaiDao  | 55.0        | 64.93894989583701    |
| ... ...     |
| LianYunGang | 67.0        | 65.12753786942551    |
| LianYunGang | 76.0        | 65.12753786942551    |
| LianYunGang | 50.0        | 65.12753786942551    |
| LianYunGang | 54.0        | 65.12753786942551    |
| ... ...     |
+------------+-------------+----------------------+

A window function call always contains an OVER clause directly following the window function’s name and argument(s).This is what syntactically distinguishes it from a normal function or non-window aggregate.The OVER clause determines exactly how the rows of the query are split up for processing by the window function.The PARTITION BY clause within OVER divides the rows into groups, or partitions, that share the same values of the PARTITION BY expression(s).For each row, the window function is computed across the rows that fall into the same partition as the current row.The previous example showed how to count the average of a column per partition.

You can also control the order in which rows are processed by window functions using ORDER BY within OVER.(The window ORDER BY does not even have to match the order in which the rows are output.) Here is an example:

SELECT station, visibility, temperature,
       rank() OVER (PARTITION BY visibility ORDER BY temperature DESC)
FROM air limit 5;
+-------------+------------+-------------+--------+
| station     | visibility | temperature | RANK() |
+-------------+------------+-------------+--------+
| XiaoMaiDao  | 50.0       | 80.0        | 1      |
| XiaoMaiDao  | 50.0       | 80.0        | 1      |
| LianYunGang | 50.0       | 80.0        | 1      |
| LianYunGang | 50.0       | 80.0        | 1      |
| XiaoMaiDao  | 50.0       | 80.0        | 1      |
+-------------+------------+-------------+--------+

There is another important concept associated with window functions: for each row, there is a set of rows within its partition called its window frame.Some window functions act only on the rows of the window frame, rather than of the whole partition.Here is an example of using window frames in queries:

SELECT station, visibility, temperature,
    avg(temperature) OVER(ORDER BY temperature ASC ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING) AS avg,
    min(temperature) OVER(ORDER BY visibility ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS cum_min
FROM air
ORDER BY visibility ASC;
+-------------+------------+-------------+------+---------+
| station     | visibility | temperature | avg  | cum_min |
+-------------+------------+-------------+------+---------+
| LianYunGang | 50.0       | 53.0        | 53.0 | 53.0    |
| XiaoMaiDao  | 50.0       | 51.0        | 51.0 | 51.0    |
| XiaoMaiDao  | 50.0       | 51.0        | 51.0 | 51.0    |
| XiaoMaiDao  | 50.0       | 52.0        | 52.0 | 51.0    |
| ... ...    |
+-------------+------------+-------------+------+---------+

When a query involves multiple window functions, it is possible to write out each one with a separate OVER clause, but this is duplicative and error-prone if the same windowing behavior is wanted for several functions.Instead, each windowing behavior can be named in a WINDOW clause and then referenced in OVER.For example:

SELECT sum(temperature) OVER w, avg(temperature) OVER w
FROM air
WINDOW w AS (PARTITION BY station ORDER BY temperature DESC);
+----------------------+----------------------+
| SUM(air.temperature) | AVG(air.temperature) |
+----------------------+----------------------+
| 77200.0              | 80.0                 |
| 77200.0              | 80.0                 |
| 77200.0              | 80.0                 |
| 77200.0              | 80.0                 |
| ... ...                 |
+----------------------+----------------------+

Syntax

The syntax of the OVER clause is

function([expression])
  OVER(
    [PARTITION BY expression[, …]]
    [ORDER BY expression [ ASC | DESC ][, …]]
    [ frame_clause ]
    )

Where frame_clause is one of the following:

  { RANGE | ROWS | GROUPS } frame_start
  { RANGE | ROWS | GROUPS } BETWEEN frame_start AND frame_end

And frame_start and frame_end can be one of them

UNBOUNDED PRECEDING
offset PRECEDING
CURRENT ROW
offset FOLLOWING
UNBOUNDED FOLLOWING

Where offset is a non-negative integer.

RANGE and GROUPS modes require an ORDER BY clause (for RANGE, ORDER BY must specify exactly one column).

Aggregate Functions

All aggregate functions can be used as window functions.

Ranking Functions

row_number

Number of the current row within its partition, counting from 1.

row_number()

rank

Rank of the current row with gaps; same as row_number of its first peer.

rank()

dense_rank

Rank of the current row without gaps; this function counts peer groups.

dense_rank()

ntile

Integer ranging from 1 to the argument value, dividing the partition as equally as possible.

ntile(expression)

Analytical functions

cume_dist

Return cumulative distribution, that is, (the number of rows before or equal to the current row)/(total number of rows in the partition).The value range is from 1/N to 1.

cume_dist()

percent_rank

Returns the relative ranking of the current row, i.e. (rank - 1) / (total number of partition rows - 1).Therefore, the value range is from 0 to 1 (inclusive).

percent_rank()

lag

Return the value of the offset rows before the current row in the partition; if there is no such row, return default (must be compatible with the value type). Both offset and default are evaluated against the current row. If omitted, offset defaults to 1, default to NULL.

lag(expression, offset, default)

lead

Return the value of the offset rows after the current row in the partition; if there is no such row, return default (must be compatible with the value type). Both offset and default are evaluated against the current row. If omitted, offset defaults to 1, default to NULL.

lead(expression, offset, default)

first_value

Returns the value obtained in the first row of the window frame.

first_value(expression)

last_value

Returns the value obtained in the last row of the window frame.

last_value(expression)

nth_value

Returns the value obtained in the nth row in the window frame (counting from 1); if there is no such row, returns NULL.

nth_value(expression, n)

Time Windows

time_window

Used for analyzing, aggregating, or processing data in a continuous data stream.The time_window function defines a fixed-size window, then slides this window along the data stream, processing the data within the window sequentially.

time_window(time_expression, window_duration [, slide_duration])

Options	Description
`time_expression`	The time expression to operate.Can be a constant, column, or function, and any combination of arithmetic operators.
`window_duration`	Set the length of the time window.Please refer to Duration Units.
`slide_duration`	Set the length of the window slide. If this parameter is not specified, it will become a scrolling window.Please refer to Duration Units.

tip

The rule generated by the window is:

start, end
time, time_column + window_duration
time - slide_duration, time + window_duration - slide_duration
time - 2 * slide_duration, time + window_duration - 2 * slide_duration
...
time - n * slide_duration, time + window_duration - n * slide_duration

And meet the condition start <= time > end

View example

time_window_gapfill

Similar to time_window, fill in missing values for time series data based on sliding windows.

time_window_gapfill(<time column>, <window_duration>[, <sliding_duration>[, <start time>]]): <time window struct>

tip

time_window_gapfill must be used as a top-level expression in a query or subquery.For example, time_window_gapfill cannot be nested within another function, such as sum( time_window_gapfill(...)).

time_window_gapfill supports two types of missing value filling strategies.

interpolate
locf

Sample Data

View example

interpolate

The core idea of linear interpolation is to assume that the relationship between known data points is linear, and then estimate the value of unknown data points based on the linear relationship between known data points.Specifically, linear interpolation infers the ordinate of unknown data points by using the linear rate of change between the ordinate of known data points.

Linear interpolation is suitable for estimating continuous variables, such as filling missing values in time series or interpolating in spatial data.However, the accuracy and applicability of linear interpolation depend on the characteristics of the data and the actual situation.In some cases, the data may have a non-linear relationship, or there may be other interpolation methods that are more suitable.Therefore, before applying linear interpolation, it is necessary to carefully consider the nature of the data and the purpose of interpolation to ensure that the interpolation results are reasonable and accurate.

interpolate(<expression>)

View example

locf

This function is used to perform missing value imputation within a time window, and fills missing values using the "Last Observation Carried Forward" (LOCF) operation.

"Last Observation Carried Forward" (LOCF) is a method used to fill in missing values, which uses the most recent observable value for filling.The specific processing method is as follows:

Find the nearest non-missing value before the missing value.
Copy the value of the non-missing value to the location of the missing value.
Continue to traverse backwards until the next non-missing value is encountered.
If the next non-missing value is encountered, repeat steps 1 and 2, copying the value of this non-missing value to the missing value position.
If there are still missing values at the end of the data sequence, the last non-missing value will be continuously copied until all missing values are filled.

In short, the LOCF method fills in missing values by copying the most recent observable value to the missing value location, ensuring data continuity over time.This method assumes that the data after missing values are the same as or very close to the last observed value.

It should be noted that the LOCF method may introduce some bias, especially when the data after missing values changes dramatically.Therefore, when using LOCF to fill in missing values, it is necessary to carefully consider the characteristics of the data and the purpose of the analysis to ensure that the filled values can reasonably reflect the actual situation.

locf(<expression>)

View example

Window Functions

Syntax​

Aggregate Functions​

Ranking Functions​

row_number​

rank​

dense_rank​

ntile​

Analytical functions​

cume_dist​

percent_rank​

lag​

lead​

first_value​

last_value​

nth_value​

Time Windows​

time_window​

time_window_gapfill​

Sample Data​

interpolate​

locf​

Syntax

Aggregate Functions

Ranking Functions

row_number

rank

dense_rank

ntile

Analytical functions

cume_dist

percent_rank

lag

lead

first_value

last_value

nth_value

Time Windows

time_window

time_window_gapfill

Sample Data

interpolate

locf