Waiting for 9.4 – Provide moving-aggregate support for a bunch of aggregates.

On 13th of April, Tom Lane committed patch:

Provide moving-aggregate support for a bunch of numerical aggregates.
 
First installment of the promised moving-aggregate support in built-in
aggregates: count(), sum(), avg(), stddev() and variance() for
assorted datatypes, though not for float4/float8.
 
In passing, remove a 2001-vintage kluge in interval_accum(): interval
array elements have been properly aligned since around 2003, but
nobody remembered to take out this workaround.  Also, fix a thinko
in the opr_sanity tests for moving-aggregate catalog entries.
 
David Rowley and Florian Pflug, reviewed by Dean Rasheed

On the same day he also committed:

Provide moving-aggregate support for boolean aggregates.
 
David Rowley and Florian Pflug, reviewed by Dean Rasheed

A bit before these two happened, there was also other commit, that provided infrastructure for these aggregates.

So, what is it about?

As you know we have so called Window Functions. These can be used, for example, to calculate moving sum:

$ SELECT i, SUM(i) OVER (ORDER BY i) FROM generate_series(1,10) i ORDER BY i;
 i  | SUM 
----+-----
  1 |   1
  2 |   3
  3 |   6
  4 |  10
  5 |  15
  6 |  21
  7 |  28
  8 |  36
  9 |  45
 10 |  55
(10 ROWS)

What is less known/used, is that you can limit the “window" that the aggregate is using, for example – to 5 rows “around" current:

SELECT i, SUM(i) OVER (ORDER BY i ROWS BETWEEN 2 preceding AND 2 following ) FROM generate_series(1,10) i ORDER BY i;
 i  | SUM 
----+-----
  1 |   6
  2 |  10
  3 |  15
  4 |  20
  5 |  25
  6 |  30
  7 |  35
  8 |  40
  9 |  34
 10 |  27
(10 ROWS)

For example – sum for i = 3 sums only 1, 2, 3, 4, 5.

Up until now, when doing this kind of aggregates, PostgreSQL had to recalculate the aggregate for each row.

So, for row:

  • i=1, it did 1 + 2 + 3
  • i=2, it did 1 + 2 + 3 + 4
  • i=3, it did 1 + 2 + 3 + 4 + 5
  • i=4, it did 2 + 3 + 4 + 5 + 6
  • i=10, id did 8 + 9 + 10

Despite “obvious" optimization of reusing calculation from previous row, and some “logic".

Now – the aggregates get much more parameters (compare old and new versions of CREATE AGGREGATE docs), and they can “know" how to handle reuse of previous values.

Of course – it's not magical – aggregate has to be taught what it should do to “unaggregate" values that shouldn't be uses anymore.

Let's see how that would work. I'll make a string aggregate that can work with moving windows.

First, let's make some very simple test data:

$ CREATE TABLE test (id serial PRIMARY KEY, payload text);
CREATE TABLE
$ copy test(payload) FROM stdin;
a
b
c
d
e
\.
COPY 5

Now. Let's write simple string concatenating aggregate, using custom functions so I can add there some notices:

CREATE FUNCTION custom_concat(text, text) RETURNS text AS $$
BEGIN
    raise notice 'custom_concat( %, % )', $1, $2;
    RETURN COALESCE($1, '') || COALESCE($2, '');
END;
$$ LANGUAGE plpgsql;

Now, I can create test aggregate:

CREATE aggregate dt(text) (
    sfunc = custom_concat,
    stype = text
);
CREATE AGGREGATE

And quick check if it works:

SELECT id, dt(payload) OVER (ORDER BY id)
FROM test
ORDER BY id;
NOTICE:  custom_concat( <NULL>, a )
NOTICE:  custom_concat( a, b )
NOTICE:  custom_concat( ab, c )
NOTICE:  custom_concat( abc, d )
NOTICE:  custom_concat( abcd, e )
 id |  dt   
----+-------
  1 | a
  2 | ab
  3 | abc
  4 | abcd
  5 | abcde
(5 ROWS)

Nice. The function was called 5 times, all is well. But what will happen if I'll introduce moving frame?

SELECT id, dt(payload) OVER (ORDER BY id ROWS BETWEEN 1 preceding AND 1 following )
FROM test
ORDER BY id;
NOTICE:  custom_concat( <NULL>, a )
NOTICE:  custom_concat( a, b )
NOTICE:  custom_concat( ab, c )
NOTICE:  custom_concat( <NULL>, b )
NOTICE:  custom_concat( b, c )
NOTICE:  custom_concat( bc, d )
NOTICE:  custom_concat( <NULL>, c )
NOTICE:  custom_concat( c, d )
NOTICE:  custom_concat( cd, e )
NOTICE:  custom_concat( <NULL>, d )
NOTICE:  custom_concat( d, e )
 id | dt  
----+-----
  1 | ab
  2 | abc
  3 | bcd
  4 | cde
  5 | de
(5 ROWS)

Here we see some interesting things:

  • even without any added logic, Pg is smart enough not to repeat concatenation of a and b for the 2nd row (abc)
  • these NULLs show when aggregate function was called from scratch – so it happened 4 times
  • all together, we had 11 concat calls

Now, let's drop the aggregate, and add moving “sfunc":

DROP aggregate dt(text);
DROP AGGREGATE
 
CREATE FUNCTION ms_custom_concat(text, text) RETURNS text AS $$
BEGIN
    raise notice 'ms_custom_concat( %, % )', $1, $2;
    IF POSITION( $2 IN $1 ) = 1 THEN
        RETURN substr($1, 1 + LENGTH($2));
    END IF;
    RETURN $1;
END;
$$ LANGUAGE plpgsql;
CREATE FUNCTION
 
CREATE aggregate dt(text) (
    sfunc = custom_concat,
    stype = text,
    msfunc = custom_concat,
    mstype = text,
    minvfunc = ms_custom_concat
);
CREATE AGGREGATE

Now aggregate dt has second set of parameters – used only for “moving frame" type of queries – these are the msfunc/mstype/minvfunc.
The important one is minvfunc which is used to reverse aggregate.

With this defined, I get:

SELECT id, dt(payload) OVER (ORDER BY id ROWS BETWEEN 1 preceding AND 1 following )
FROM test
ORDER BY id;
NOTICE:  custom_concat( <NULL>, a )
NOTICE:  custom_concat( a, b )
NOTICE:  custom_concat( ab, c )
NOTICE:  ms_custom_concat( abc, a )
NOTICE:  custom_concat( bc, d )
NOTICE:  ms_custom_concat( bcd, b )
NOTICE:  custom_concat( cd, e )
NOTICE:  ms_custom_concat( cde, c )
 id | dt  
----+-----
  1 | ab
  2 | abc
  3 | bcd
  4 | cde
  5 | de
(5 ROWS)

5 calls to custom_concat + 3 to ms_custom_concat. That's nice. Generally – the larger the “frame" the better these new aggregates will be. Of course there is non-zero cost of calling inversion function, but, if written correctly – it shouldn't be a big problem.

One additional thing to note – moving aggregate can use different “storage type" (that's why I had both stype and mstype). So you can, for example, choose different algorithm and temporary storage for moving aggregates. And that's definitely cool. Thanks guys.

2 thoughts on “Waiting for 9.4 – Provide moving-aggregate support for a bunch of aggregates.”

  1. There’s a type here:
    “Pg is smart enough not to repeat concatenation of and b for the 2nd row.”
    It should be “concatenation of a and b”.

    Btw, nicely explained!

Comments are closed.