Waiting for 9.4 – Support ordered-set (WITHIN GROUP) aggregates.

On 23rd of December, Tom Lane committed patch:

Support ordered-set (WITHIN GROUP) aggregates.
 
This patch introduces generic support for ordered-set and hypothetical-set
aggregate functions, as well as implementations of the instances defined in
SQL:2008 (percentile_cont(), percentile_disc(), rank(), dense_rank(),
percent_rank(), cume_dist()).  We also added mode() though it is not in the
spec, as well as versions of percentile_cont() and percentile_disc() that
can compute multiple percentile values in one pass over the data.
 
Unlike the original submission, this patch puts full control of the sorting
process in the hands of the aggregate's support functions.  To allow the
support functions to find out how they're supposed to sort, a new API
function AggGetAggref() is added to nodeAgg.c.  This allows retrieval of
the aggregate call's Aggref node, which may have other uses beyond the
immediate need.  There is also support for ordered-set aggregates to
install cleanup callback functions, so that they can be sure that
infrastructure such as tuplesort objects gets cleaned up.
 
In passing, make some fixes in the recently-added support for variadic
aggregates, and make some editorial adjustments in the recent FILTER
additions for aggregates.  Also, simplify use of IsBinaryCoercible() by
allowing it to succeed whenever the target type is ANY or ANYELEMENT.
It was inconsistent that it dealt with other polymorphic target types
but not these.
 
Atri Sharma and Andrew Gierth; reviewed by Pavel Stehule and Vik Fearing,
and rather heavily editorialized upon by Tom Lane

Now, did you read the commit message? Did you understand it? Well, to be honest – I didn't. But. That's why we have access to sources, including to new changes in documentation.

So I've read the changes in docs. And still didn't understand it. Well. I've always learned best by trying, so that's what I did.

Example in docs references households table with income column. So let's test it:

CREATE TABLE households AS
    SELECT CAST(random() * 200000 + 20000 AS int4) AS income FROM generate_series(1,10) i;
SELECT 10

Data is:

SELECT ROW_NUMBER() OVER (ORDER BY income), * FROM households ORDER BY income;
 ROW_NUMBER | income 
------------+--------
          1 |  21118
          2 |  27346
          3 |  40818
          4 |  64789
          5 |  75681
          6 |  99033
          7 | 132923
          8 | 140213
          9 | 174634
         10 | 204843
(10 ROWS)

OK. And now, what does the query from docs do:

SELECT percentile_disc(0.5) WITHIN GROUP (ORDER BY income) FROM households;
 percentile_disc 
-----------------
           75681
(1 ROW)

It gave me the 5th row. (50%-th row). What if 50% of the rowcount was fractional?

Please note that I didn't get median. That's because I used percentile_disc. There is a another aggregate that can provide value interpolated.

SELECT percentile_cont(0.5) WITHIN GROUP (ORDER BY income) FROM households;
 percentile_cont 
-----------------
           87357
(1 ROW)

The value I got is simply in the middle between 5th and 6th row. So it's the “right" median. Please note that while you can use percentile_disc() on basically anything you can order, you can't use percentile_cont() on some types because it's not possible to interpolate them.

Interestingly, we can check, using the same query, more than one percentile:

SELECT percentile_cont(array[0.25, 0.33, 0.5, 0.66, 0.75, 0.95]) WITHIN GROUP (ORDER BY income) FROM households;
                    percentile_cont                    
-------------------------------------------------------
 {46810.75,75354.24,87357,130889.6,138390.5,191248.95}
(1 ROW)

Now, that's pretty cool.

There is also “mode" aggregate, which returns the most common row:

SELECT mode() WITHIN GROUP (ORDER BY added_by) FROM plans;
  mode  
--------
 depesz
(1 ROW)
 
SELECT added_by, COUNT(*) FROM plans GROUP BY 1 ORDER BY 2 DESC LIMIT 2;
 added_by | COUNT  
----------+--------
 [NULL]   | 105421
 depesz   |     35
(2 ROWS)

NULLs are ignored by aggregates, so it returned the most common value that is not null.

select percentile_cont(0.5) within group (order by id) from plans;
percentile_disc
—————–
Kl
(1 row)

So far it looks that you can use “ordered set aggregates" to extract certain values that you'd otherwise would have to use window functions or some weird constructions based on aggregating values to array, and then picking something from there.

The cool thing is that it works with normal grouping too.

For example. In plans table (the one from explain.depesz.com), I have is_deleted column, which tells if given plan was deleted (I don't delete so to avoid reusing ids, but the plan is removed from row and replaced by empty string).

This is relatively recent feature, so it should show different distribution or values. Let's see:

SELECT
    is_deleted,
    MIN(entered_on),
    MAX(entered_on),
    COUNT(*),
    percentile_cont( 0.5 ) WITHIN GROUP (ORDER BY entered_on)
FROM
    plans
GROUP BY
    is_deleted;
ERROR:  FUNCTION percentile_cont(NUMERIC, TIMESTAMP WITH TIME zone) does NOT exist
LINE 6:     percentile_cont( 0.5 ) WITHIN GROUP (ORDER BY entered_on...
            ^
HINT:  No FUNCTION matches the given name AND argument types. You might need TO ADD explicit TYPE casts.

Interestingly – I can't use percentile_cont (though it should be possible to interpolate there). OK. Let's just pick percentile_disc:

SELECT
    is_deleted,
    MIN(entered_on),
    MAX(entered_on),
    COUNT(*),
    percentile_disc( 0.5 ) WITHIN GROUP (ORDER BY entered_on)
FROM
    plans
GROUP BY
    is_deleted;
 is_deleted |              MIN              |              MAX              | COUNT |        percentile_disc        
------------+-------------------------------+-------------------------------+-------+-------------------------------
 f          | 2008-12-04 13:20:43+01        | 2013-12-16 22:56:49.951565+01 | 98687 | 2012-10-22 15:38:51.41854+02
 t          | 2013-03-30 20:20:40.102483+01 | 2013-12-16 22:51:34.011972+01 |  6971 | 2013-08-20 07:47:46.051709+02
(2 ROWS)

The cool thing in here is that we can also, easily, add more than one ordered-set aggregate:

SELECT
    is_deleted,
    COUNT(*),
    percentile_disc( 0.5 ) WITHIN GROUP (ORDER BY entered_on),
    percentile_cont( 0.5 ) WITHIN GROUP (ORDER BY LENGTH(plan) )
FROM
    plans
GROUP BY
    is_deleted;
 is_deleted | COUNT |        percentile_disc        | percentile_cont 
------------+-------+-------------------------------+-----------------
 f          | 98687 | 2012-10-22 15:38:51.41854+02  |            2728
 t          |  6971 | 2013-08-20 07:47:46.051709+02 |               0
(2 ROWS)

This is of course stupid example, since all deleted plans have 0 length, but still – it shows nicely what can be done.

There is another part of this – “Hypothetical-Set" Aggregate functions.

The example I made for this is probably even more contrived, but, let's see it anyway.

id strings for plans are added randomly, starting with random 2 letter string, if it's taken, length is increased and new string is generated. After each “hit" length of id is increased. This means that while in the beginning explain was generating mostly 2 character ids, sometimes you would get 3 letter id. And sometimes, rarely, you'd get 4 letter.

Currently (well, on 16th of December) the stats looked:

 LENGTH | COUNT | filled_percent | namespace_size 
--------+-------+----------------+----------------
      2 |  3844 |  100.000000000 |           3844
      3 | 82458 |   34.598536471 |         238328
      4 | 19332 |    0.130830809 |       14776336
      5 |    24 |    0.000002620 |      916132832
(4 ROWS)

Now. I might want to know how many rows (or even better, how many percent of all rows) with given length were added at given point in time.

So I can:

SELECT LENGTH(id),
       percent_rank( '2012-01-01 00:00:00 UTC' ) WITHIN GROUP (ORDER BY entered_on)
FROM plans
GROUP BY LENGTH(id)
ORDER BY LENGTH(id)
 LENGTH |    percent_rank    
--------+--------------------
      2 |  0.997918834547347
      3 |  0.277705013461399
      4 | 0.0611421477343265
      5 |                  0
(4 ROWS)

So apparently at the given moment (UTC midnight on 1st of January 2012) I already had 99.79..% of the 2 character ids, 27.77…% or 3 character ids, ~ 6% of 4 character ids, and none (0%) of 5 character ids).

100% is the current state.

Please note that row “5, 0" is interesting – as it shows 0% (correctly) even though there were no rows in the group (length(id) = 5) that would be older than the cutoff value, so if I'd made simply:

SELECT LENGTH(id), COUNT(*) FROM plans WHERE entered_on <= '2012-01-01 00:00:00 UTC' GROUP BY 1;

I would see only 3 rows, and the row with length(id) = 5 would be missing!

Other hypothetical-set aggregate functions are:

  • rank – what rank would have given value, within given group/ordering, if it was there (even if it isn't there really)
  • dense_rank – similar, but when there are duplicate values in the order by clause, they are ignored
  • percent_rank – percentile that the value would get
  • cume_dist – number of rows preceding (or including) given value divided by total number of rows

All in all – this is great addition. Will definitely take some time to get used to the syntax and the functionality, but it's definitely looks worth it. If I miss anything, is support in percentile_cont for more datatypes (timestamps should be simple to add, I think).

The last bit is performance.

For example, normally, when getting median value, you would have to either:

  • get all values into array and get middle element, or
  • count number values, then order all values and get middle one with limit/offset

So, I made a 10 million row table, with random() values:

$ CREATE TABLE test AS SELECT random() AS i FROM generate_series(1,10000000) q;
SELECT 10000000

And let's get median.

First approach – get all values, sorted, into array, and then get middle element – used ~ 60MB of RAM, and took ~ 15 seconds:

WITH x AS (SELECT array_agg(i ORDER BY i) AS arr FROM test) SELECT arr[ array_upper(arr, 1) / 2 ] FROM x;

Second approach – get count, and use limit offset to find middle row – ~ 25MB of ram and ~ 16.5 seconds.

SELECT * FROM test ORDER BY i LIMIT 1 offset ( SELECT COUNT(*) / 2 FROM test);

Using the ordered sets: ~ 30MB and 13 seconds:

SELECT percentile_disc(0.5) WITHIN GROUP (ORDER BY i) FROM test;

Definitely great stuff, and one that I hope to use more.

2 thoughts on “Waiting for 9.4 – Support ordered-set (WITHIN GROUP) aggregates.”

  1. Incredibly helpful overview of the new functions within 9.4 especially the piece explaning the retrieval of median values -thank you so much

  2. How come the builtin function mode() is so very slow? An implementation written in SQL is actually faster.

    select public.mode(val) from numbers;
    takes around 1s (1 million rows)

    select mode() within group (order by val) from numbers;
    takes around 2.5s (1 million rows)

    I would have expected a native function to be much faster. If wanted, I could post the code.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.