Waiting for 8.5 – ordered aggregates

On 15th of December Tom Lane committed patch by Andrew Gierth (aka RhodiumToad), which adds interesting capability:

Log Message:
-----------
Support ORDER BY within aggregate function calls, at long last providing a
non-kluge method for controlling the order in which values are fed to an
aggregate function.  At the same time eliminate the old implementation
restriction that DISTINCT was only supported for single-argument aggregates.
 
Possibly release-notable behavioral change: formerly, agg(DISTINCT x)
dropped null values of x unconditionally.  Now, it does so only if the
agg transition function is strict; otherwise nulls are treated as DISTINCT
normally would, ie, you get one copy.
 
Andrew Gierth, reviewed by Hitoshi Harada

Description in log message is pretty clear, so let's just test it.

Let's assume we have orders table with following structure:

# \d orders
                          Table "public.orders"
 Column  |  Type   |                      Modifiers
---------+---------+-----------------------------------------------------
 id      | integer | not null default nextval('orders_id_seq'::regclass)
 buyer   | text    | not null
 ordered | date    | not null
 total   | integer |
Indexes:
    "orders_pkey" PRIMARY KEY, btree (id)

Within these table we have these data:

# SELECT * FROM orders;
 id | buyer |  ordered   | total
----+-------+------------+-------
  1 | Alice | 2009-05-12 |    57
  2 | Eve   | 2009-07-07 |   131
  3 | Alice | 2009-08-15 |   177
  4 | Carol | 2009-04-28 |   232
...
 38 | Eve   | 2009-02-01 |   225
(38 ROWS)

Data was generated using this query:

INSERT INTO orders (buyer, ordered, total)
    SELECT n[j], now() - '1 year'::INTERVAL * random(), 50 + random() * 200
    FROM
        generate_series(1,10) i,
        generate_series(1,5) j,
        (SELECT '{Alice,Bob,Carol,Dave,Eve}'::text[] AS n) x
    WHERE random() < 0.6;

Now. We'd like to get, for each buyer, sum of order values, and list of dates when they ordered. This is simple:

# SELECT buyer, SUM(total), array_agg( ordered)
FROM orders
GROUP BY buyer
ORDER BY buyer;
 buyer | SUM  |                                              array_agg
-------+------+------------------------------------------------------------------------------------------------------
 Alice | 1057 | {2009-05-12,2009-08-15,2009-11-22,2009-05-08,2009-02-14,2009-03-25,2009-03-06,2009-08-16}
 Bob   |  905 | {2009-01-29,2009-08-22,2009-05-28,2009-05-12,2009-02-10,2009-08-17}
 Carol | 1118 | {2009-04-28,2009-03-12,2009-01-10,2009-03-30,2009-06-27,2009-09-19,2009-12-14,2009-09-06,2009-09-01}
 Dave  | 1239 | {2009-07-15,2009-07-27,2009-02-07,2009-12-19,2009-01-13,2009-05-28,2009-08-27}
 Eve   | 1222 | {2009-07-07,2009-04-07,2009-10-02,2009-02-05,2009-09-26,2009-04-18,2009-08-14,2009-02-01}
(5 ROWS)

The problem in here is that dates in array_agg are not ordered. Previously, we had to do stuff like this:

# SELECT buyer, SUM(total), array_agg( ordered)
FROM ( SELECT * FROM orders ORDER BY buyer, ordered ) x
GROUP BY buyer
ORDER BY buyer;
 buyer | SUM  |                                              array_agg
-------+------+------------------------------------------------------------------------------------------------------
 Alice | 1057 | {2009-02-14,2009-03-06,2009-03-25,2009-05-08,2009-05-12,2009-08-15,2009-08-16,2009-11-22}
 Bob   |  905 | {2009-01-29,2009-02-10,2009-05-12,2009-05-28,2009-08-17,2009-08-22}
 Carol | 1118 | {2009-01-10,2009-03-12,2009-03-30,2009-04-28,2009-06-27,2009-09-01,2009-09-06,2009-09-19,2009-12-14}
 Dave  | 1239 | {2009-01-13,2009-02-07,2009-05-28,2009-07-15,2009-07-27,2009-08-27,2009-12-19}
 Eve   | 1222 | {2009-02-01,2009-02-05,2009-04-07,2009-04-18,2009-07-07,2009-08-14,2009-09-26,2009-10-02}
(5 ROWS)

Which works, but the problem is that the order is not guaranteed – it is ordered because that's how it works with this plan. Should PostgreSQL choose another plan to run the query – it might become unordered again.

Now, thanks to Andrew, we can force the ordering:

# SELECT buyer, SUM(total), array_agg( ordered ORDER BY ordered )
FROM orders
GROUP BY buyer;
 buyer | SUM  |                                              array_agg
-------+------+------------------------------------------------------------------------------------------------------
 Alice | 1057 | {2009-02-14,2009-03-06,2009-03-25,2009-05-08,2009-05-12,2009-08-15,2009-08-16,2009-11-22}
 Bob   |  905 | {2009-01-29,2009-02-10,2009-05-12,2009-05-28,2009-08-17,2009-08-22}
 Carol | 1118 | {2009-01-10,2009-03-12,2009-03-30,2009-04-28,2009-06-27,2009-09-01,2009-09-06,2009-09-19,2009-12-14}
 Dave  | 1239 | {2009-01-13,2009-02-07,2009-05-28,2009-07-15,2009-07-27,2009-08-27,2009-12-19}
 Eve   | 1222 | {2009-02-01,2009-02-05,2009-04-07,2009-04-18,2009-07-07,2009-08-14,2009-09-26,2009-10-02}
(5 ROWS)

What's more – we can sort by value in some other field – for example – sort orders by their value:

# SELECT buyer, SUM(total), array_agg( ordered ORDER BY total DESC )
FROM orders
GROUP BY buyer;
 buyer | SUM  |                                              array_agg
-------+------+------------------------------------------------------------------------------------------------------
 Alice | 1057 | {2009-05-08,2009-08-15,2009-03-25,2009-08-16,2009-02-14,2009-11-22,2009-03-06,2009-05-12}
 Bob   |  905 | {2009-02-10,2009-01-29,2009-08-17,2009-05-12,2009-08-22,2009-05-28}
 Carol | 1118 | {2009-04-28,2009-09-01,2009-03-30,2009-06-27,2009-12-14,2009-09-19,2009-03-12,2009-01-10,2009-09-06}
 Dave  | 1239 | {2009-05-28,2009-07-27,2009-02-07,2009-07-15,2009-08-27,2009-01-13,2009-12-19}
 Eve   | 1222 | {2009-02-01,2009-08-14,2009-09-26,2009-04-07,2009-07-07,2009-02-05,2009-04-18,2009-10-02}
(5 ROWS)

So, we can check:

# SELECT *
FROM orders
WHERE buyer = 'Bob'
ORDER BY total DESC;
 id | buyer |  ordered   | total
----+-------+------------+-------
 24 | Bob   | 2009-02-10 |   249
 11 | Bob   | 2009-01-29 |   215
 26 | Bob   | 2009-08-17 |   156
 19 | Bob   | 2009-05-12 |   105
 12 | Bob   | 2009-08-22 |    95
 17 | Bob   | 2009-05-28 |    85
(6 ROWS)

Great. Works as described 🙂 Of course ordering doesn't make sense for all aggregates – count(*) for example 🙂 But the sole fact that we have it, means we can calculate median in plain SQL:

# WITH aggregated AS (
    SELECT
        buyer,
        array_agg(total ORDER BY total) AS list,
        COUNT(*),
        COUNT(*) / 2 AS half
    FROM orders
    GROUP BY buyer
)
SELECT
    buyer,
    list,
    CASE
        WHEN COUNT % 2 = 1 THEN list[ half + 1 ]
        ELSE ( list[ half ] + list[ half + 1 ] ) / 2
    END AS median
FROM
    aggregated;
 buyer |               list                | median
-------+-----------------------------------+--------
 Alice | {57,70,83,119,156,170,177,225}    |    137
 Bob   | {85,95,105,156,215,249}           |    130
 Carol | {52,77,86,99,105,107,159,201,232} |    105
 Dave  | {98,131,154,155,219,241,241}      |    155
 Eve   | {66,85,121,131,182,202,210,225}   |    156
(5 ROWS)

How sweet it that?

3 thoughts on “Waiting for 8.5 – ordered aggregates”

  1. Note that one psql block in your post is not monospaced:
    select buyer, sum(total), array_agg( ordered order by total desc ) from orders group by buyer …

  2. Great! Then ordered aggregates is another feature of Muldis D that will now be implementable over Postgres without much circumlocution. The PostgreSQL developers continue to make my job of implementing Muldis D easier with every release.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.