February 17th, 2010 by depesz | Tags: , , , , | No comments »
Did it help? If yes - maybe you can help me?

On 12th of February Tom Lane committed patch by Hitoshi Harada:

Log Message:
-----------
Extend the set of frame options supported for window functions.
 
This patch allows the frame to start from CURRENT ROW (in either RANGE or
ROWS mode), and it also adds support for ROWS n PRECEDING and ROWS n FOLLOWING
start and end points. (RANGE value PRECEDING/FOLLOWING isn't there yet ---
the grammar works, but that's all.)
 
Hitoshi Harada, reviewed by Pavel Stehule

The description is pretty clear, but let's look at what exactly it is.

First, let's create simple test table:

create table test (
id serial primary key,
some_grouping int4,
some_value int4
);

Now, let's insert some random data:

insert into test (some_grouping, some_value)
select i, random() * 100 + 50 as j
from generate_series(1,2) i, generate_series(1,10) k;

The data in it doesn't look spectacular, but it serves the purpose:

select * from test order by id asc;
id | some_grouping | some_value
----+---------------+------------
1 | 1 | 94
2 | 1 | 86
3 | 1 | 77
4 | 1 | 86
5 | 1 | 146
6 | 1 | 145
7 | 1 | 68
8 | 1 | 71
9 | 1 | 84
10 | 1 | 105
11 | 2 | 126
12 | 2 | 88
13 | 2 | 90
14 | 2 | 50
15 | 2 | 104
16 | 2 | 68
17 | 2 | 61
18 | 2 | 107
19 | 2 | 127
20 | 2 | 112
(20 rows)

Now. If you're not familiar with window functions, and specyfically with framing, let me show you simple example:

SELECT
id,
some_grouping,
some_value,
sum(some_value) OVER (
PARTITION BY some_grouping
RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
) as sum_per_group,
sum(some_value) OVER (
PARTITION BY some_grouping
ORDER BY id
RANGE UNBOUNDED PRECEDING
) as cumulative_sum_per_group
FROM
test
ORDER BY some_grouping, id;
id | some_grouping | some_value | sum_per_group | cumulative_sum_per_group
----+---------------+------------+---------------+--------------------------
1 | 1 | 94 | 962 | 94
2 | 1 | 86 | 962 | 180
3 | 1 | 77 | 962 | 257
4 | 1 | 86 | 962 | 343
5 | 1 | 146 | 962 | 489
6 | 1 | 145 | 962 | 634
7 | 1 | 68 | 962 | 702
8 | 1 | 71 | 962 | 773
9 | 1 | 84 | 962 | 857
10 | 1 | 105 | 962 | 962
11 | 2 | 126 | 933 | 126
12 | 2 | 88 | 933 | 214
13 | 2 | 90 | 933 | 304
14 | 2 | 50 | 933 | 354
15 | 2 | 104 | 933 | 458
16 | 2 | 68 | 933 | 526
17 | 2 | 61 | 933 | 587
18 | 2 | 107 | 933 | 694
19 | 2 | 127 | 933 | 821
20 | 2 | 112 | 933 | 933
(20 rows)

Now. Previously it would be difficult to calculate rolling average. For example something like this:

SELECT
id,
some_grouping,
some_value,
(
lag(some_value, 1) over (grouping)
+
some_value
+
lead(some_value, 1) over (grouping)
) / 3 as rolling_average
FROM
test
WINDOW grouping AS (
PARTITION BY some_grouping
ORDER BY id
)
ORDER BY some_grouping, id;

It returns, more or less as expected:

id | some_grouping | some_value | rolling_average
----+---------------+------------+-----------------
1 | 1 | 94 | [null]
2 | 1 | 86 | 85
3 | 1 | 77 | 83
4 | 1 | 86 | 103
5 | 1 | 146 | 125
6 | 1 | 145 | 119
7 | 1 | 68 | 94
8 | 1 | 71 | 74
9 | 1 | 84 | 86
10 | 1 | 105 | [null]
11 | 2 | 126 | [null]
12 | 2 | 88 | 101
13 | 2 | 90 | 76
14 | 2 | 50 | 81
15 | 2 | 104 | 74
16 | 2 | 68 | 77
17 | 2 | 61 | 78
18 | 2 | 107 | 98
19 | 2 | 127 | 115
20 | 2 | 112 | [null]
(20 rows)

As you can see first and last row for any given window don't have rolling average – i.e. they have nulls. That's because for first row lag(some_value) is null, and for last row – lead(some_value) is null.

Now, let's see how it works with this new patch by Hitoshi Harada:

SELECT
id,
some_grouping,
some_value,
cast(
avg(some_value) over (
grouping
rows between 1 preceding and 1 following
)
as int4
) as rolling_average
FROM
test
WINDOW grouping AS (
PARTITION BY some_grouping
ORDER BY id
)
ORDER BY some_grouping, id;

and result is:

id | some_grouping | some_value | rolling_average
----+---------------+------------+-----------------
1 | 1 | 94 | 90
2 | 1 | 86 | 86
3 | 1 | 77 | 83
4 | 1 | 86 | 103
5 | 1 | 146 | 126
6 | 1 | 145 | 120
7 | 1 | 68 | 95
8 | 1 | 71 | 74
9 | 1 | 84 | 87
10 | 1 | 105 | 95
11 | 2 | 126 | 107
12 | 2 | 88 | 101
13 | 2 | 90 | 76
14 | 2 | 50 | 81
15 | 2 | 104 | 74
16 | 2 | 68 | 78
17 | 2 | 61 | 79
18 | 2 | 107 | 98
19 | 2 | 127 | 115
20 | 2 | 112 | 120
(20 rows)

This is very cool.

While I can write query without using this new frames that will calculate correct values for first and last rows:

SELECT
id,
some_grouping,
some_value,
(
coalesce( lag(some_value, 1) over (grouping), 0 )
+
some_value
+
coalesce( lead(some_value, 1) over (grouping), 0 )
) / (
1
+
case when lag(some_value, 1) over (grouping) is null then 0 else 1 end
+
case when lead(some_value, 1) over (grouping) is null then 0 else 1 end
) as rolling_average
FROM
test
WINDOW grouping AS (
PARTITION BY some_grouping
ORDER BY id
)
ORDER BY some_grouping, id;

It definitely isn't nice, and would very quickly become nightmarish for ranges larger than 3 rows.

Of course the patch also added other options besides “BETWEEN n PRECEDING AND n FOLLOWING" – you can check full list of possible framings in dev docs.

Leave a comment