Getting first and last values per group

Every so often someone needs solution to getting first (or couple of first) values for given column. Or last.

For some cases (when there is not many groups) you can use recursive queries. But it's not always the best choice.

Let's try to implement first() and last() aggregates, so these could be easily used by anybody.

Let's start with simplest possible version – returning just one value.

First, we need a function. Technically functions require parameter datatypes, which would make it pretty long, but luckily, we can use “ANYELEMENT" pseudo datatype:

$ CREATE OR REPLACE FUNCTION agg_first(
    IN p_state       anyelement,
    IN p_new_element anyelement
    )
    RETURNS anyelement
    LANGUAGE SQL
    AS $$
SELECT COALESCE( p_state, p_new_element );
$$;
CREATE FUNCTION

This functions looks really simple, but that's because it should be simple. Now, let's make the aggregate:

$ CREATE AGGREGATE FIRST(anyelement) (
    sfunc = agg_first,
    stype = anyelement
);
CREATE AGGREGATE

Now, we'll need some sample data, so let's create simple table with random data:

$ CREATE TABLE test (
    group_id int4,
    random_int int4,
    random_ts timestamptz,
    random_text text
    );
CREATE TABLE
 
$ INSERT INTO test (group_id, random_int, random_ts, random_text)
    SELECT
        FLOOR(random() * 3),
        FLOOR(random() * 1000000),
        now() - '2 years'::INTERVAL * random(),
        md5(random()::text)
    FROM generate_series(1,20);
INSERT 0 20

Data is pretty simple:

$ SELECT * FROM test ORDER BY group_id;
 group_id | random_int |           random_ts           |           random_text            
----------+------------+-------------------------------+----------------------------------
        0 |     493798 | 2015-12-19 17:41:12.110158+01 | 4b9ac4c4ad6c2a806c0db801e9a03373
        0 |     424242 | 2015-10-07 21:35:20.318158+02 | 58d3a010b6fcd2200dd078b445132281
        0 |     202836 | 2015-06-11 05:07:16.526158+02 | 6e99221b0cce929ba0ca41bc8dfaaf27
        0 |     635961 | 2016-12-05 06:18:10.257358+01 | f18d9774a1b74f4422116ec1ceec8b6a
        0 |     722122 | 2015-09-13 20:07:48.580558+02 | 17192f91bf1e94f2e8af7c9bd2c15d07
        0 |     484963 | 2015-07-03 13:17:44.366158+02 | d17fcf48649fe6be0bf639c888bd9c5b
        0 |     510350 | 2016-09-21 13:37:46.708558+02 | 7560516244c289dc84bcb30f04ed35a8
        0 |     895509 | 2016-08-08 07:07:43.713358+02 | f56fbaeb6119963a4bdb6758608d2b8f
        1 |      58672 | 2017-02-02 08:53:27.978958+01 | f58ddb16c746175c57ed7859274ba120
        1 |     780823 | 2015-05-04 07:26:56.807758+02 | 2e6bc5ce1c1d2965f00be96373638e59
        1 |     485881 | 2015-11-26 03:23:53.092558+01 | 2a09584e0035379e795cfa70f6615eec
        1 |     609786 | 2015-08-01 23:00:41.505358+02 | 16d0845557d4d246cd431b74356422f6
        1 |     682945 | 2015-05-02 10:07:26.260558+02 | 776c602308bb93fad0d9753b125d8727
        1 |      82612 | 2016-11-08 18:29:11.822158+01 | ab90846f963d2d24980f9ce9d8a658bb
        1 |     668048 | 2015-06-25 23:31:54.398158+02 | bf6bafca69ffe00dcc5f701889bcdc97
        2 |      53543 | 2016-10-16 15:31:52.180558+02 | 6771fa4d76e866ea7f0cd062e5608f65
        2 |     276975 | 2015-07-02 22:18:54.263758+02 | 5582f058c2f800a5c74ff6621afea48e
        2 |     619981 | 2016-02-07 09:22:35.678158+01 | d3177fd39ce9566bc568dab3607622cd
        2 |     292590 | 2016-02-01 13:36:33.959758+01 | cf4a398869f666acc0399c37c68e4c17
        2 |      60572 | 2015-07-23 05:26:34.199758+02 | 019a33e73a3de44198ed8e9c55b38092
(20 ROWS)

So, let's see if we can get “first" of anything:

$ SELECT
    group_id,
    FIRST(random_int) AS first_int,
    FIRST(random_ts) AS first_ts,
    FIRST(random_text) AS first_text
FROM test
GROUP BY group_id
ORDER BY group_id;
 group_id | first_int |           first_ts            |            first_text            
----------+-----------+-------------------------------+----------------------------------
        0 |    895509 | 2016-08-08 07:07:43.713358+02 | f56fbaeb6119963a4bdb6758608d2b8f
        1 |     58672 | 2017-02-02 08:53:27.978958+01 | f58ddb16c746175c57ed7859274ba120
        2 |    619981 | 2016-02-07 09:22:35.678158+01 | d3177fd39ce9566bc568dab3607622cd
(3 ROWS)

Because of PostgreSQL power, we can also do something smarter:

$ SELECT
    group_id,
    FIRST(random_int ORDER BY random_int) AS first_int,
    FIRST(random_ts ORDER BY random_int) AS first_ts,
    FIRST(random_text ORDER BY random_int) AS first_text
FROM test
GROUP BY group_id
ORDER BY group_id;
 group_id | first_int |           first_ts            |            first_text            
----------+-----------+-------------------------------+----------------------------------
        0 |    202836 | 2015-06-11 05:07:16.526158+02 | 6e99221b0cce929ba0ca41bc8dfaaf27
        1 |     58672 | 2017-02-02 08:53:27.978958+01 | f58ddb16c746175c57ed7859274ba120
        2 |     53543 | 2016-10-16 15:31:52.180558+02 | 6771fa4d76e866ea7f0cd062e5608f65
(3 ROWS)

In this case, I got values for first row, in a group, if it was sorted by random_int.

Now, that we have this in place, let's write last():

$ CREATE OR REPLACE FUNCTION agg_last(
    IN p_state       anyelement,
    IN p_new_element anyelement
    )
    RETURNS anyelement
    LANGUAGE SQL
    AS $$
SELECT COALESCE( p_new_element, p_state );
$$;
CREATE FUNCTION
 
$ CREATE AGGREGATE LAST(anyelement) (
    sfunc = agg_last,
    stype = anyelement
);
CREATE AGGREGATE

Let's try it:

$ SELECT
    group_id,
    LAST(random_int ORDER BY random_int) AS last_int,
    LAST(random_ts ORDER BY random_int) AS last_ts,
    LAST(random_text ORDER BY random_int) AS last_text
FROM test
GROUP BY group_id
ORDER BY group_id;
 group_id | last_int |            last_ts            |            last_text             
----------+----------+-------------------------------+----------------------------------
        0 |   895509 | 2016-08-08 07:07:43.713358+02 | f56fbaeb6119963a4bdb6758608d2b8f
        1 |   780823 | 2015-05-04 07:26:56.807758+02 | 2e6bc5ce1c1d2965f00be96373638e59
        2 |   619981 | 2016-02-07 09:22:35.678158+01 | d3177fd39ce9566bc568dab3607622cd
(3 ROWS)

Cool. Looks like working. Now, how about getting couple of first/last elements? This will be slightly more complicated:

$ CREATE OR REPLACE FUNCTION agg_first(
    IN p_state       anyarray,
    IN p_new_element anyelement,
    IN p_limit       int4
    )
    RETURNS anyarray
    LANGUAGE SQL
    AS $$
SELECT CASE
    WHEN COALESCE( array_length( p_state, 1 ), 0 ) < p_limit
         THEN p_state || p_new_element
    ELSE p_state
     END;
$$;
CREATE FUNCTION
 
$ CREATE AGGREGATE FIRST(anyelement, int4) (
    sfunc = agg_first,
    stype = anyarray,
    initcond = '{}'
);
CREATE AGGREGATE

And quick test:

$ SELECT
    group_id,
    FIRST(random_int, 2 ORDER BY random_int) AS first_ints,
    FIRST(random_ts, 2 ORDER BY random_ts) AS first_tses,
    FIRST(random_text, 2 ORDER BY random_text) AS first_texts
FROM
    test
GROUP BY group_id
ORDER BY group_id;
-[ RECORD 1 ]--------------------------------------------------------------------
group_id    | 0
first_ints  | {202836,424242}
first_tses  | {"2015-06-11 05:07:16.526158+02","2015-07-03 13:17:44.366158+02"}
first_texts | {17192f91bf1e94f2e8af7c9bd2c15d07,4b9ac4c4ad6c2a806c0db801e9a03373}
-[ RECORD 2 ]--------------------------------------------------------------------
group_id    | 1
first_ints  | {58672,82612}
first_tses  | {"2015-05-02 10:07:26.260558+02","2015-05-04 07:26:56.807758+02"}
first_texts | {16d0845557d4d246cd431b74356422f6,2a09584e0035379e795cfa70f6615eec}
-[ RECORD 3 ]--------------------------------------------------------------------
group_id    | 2
first_ints  | {53543,60572}
first_tses  | {"2015-07-02 22:18:54.263758+02","2015-07-23 05:26:34.199758+02"}
first_texts | {019a33e73a3de44198ed8e9c55b38092,5582f058c2f800a5c74ff6621afea48e}

last function will be slightly more complex

$ CREATE OR REPLACE FUNCTION agg_last(
    IN p_state       anyarray,
    IN p_new_element anyelement,
    IN p_limit       int4
    )
    RETURNS anyarray
    LANGUAGE SQL
    AS $$
    SELECT ( CASE
        WHEN array_length( p_state, 1 ) >= p_limit
            THEN p_state[(array_upper(p_state, 1) - p_limit + 2):]
        ELSE
            p_state
        END ) || p_new_element;
$$;
CREATE FUNCTION
 
$ CREATE AGGREGATE LAST(anyelement, int4) (
    sfunc = agg_last,
    stype = anyarray,
    initcond = '{}'
);
CREATE AGGREGATE

So, let's see if it really works:

$ SELECT
    group_id,
    LAST(random_int, 2 ORDER BY random_int) AS last_ints,
    LAST(random_ts, 2 ORDER BY random_ts) AS last_tses,
    LAST(random_text, 2 ORDER BY random_text) AS last_texts
FROM
    test
GROUP BY group_id
ORDER BY group_id;
-[ RECORD 1 ]-------------------------------------------------------------------
group_id   | 0
last_ints  | {722122,895509}
last_tses  | {"2016-09-21 13:37:46.708558+02","2016-12-05 06:18:10.257358+01"}
last_texts | {f18d9774a1b74f4422116ec1ceec8b6a,f56fbaeb6119963a4bdb6758608d2b8f}
-[ RECORD 2 ]-------------------------------------------------------------------
group_id   | 1
last_ints  | {682945,780823}
last_tses  | {"2016-11-08 18:29:11.822158+01","2017-02-02 08:53:27.978958+01"}
last_texts | {bf6bafca69ffe00dcc5f701889bcdc97,f58ddb16c746175c57ed7859274ba120}
-[ RECORD 3 ]-------------------------------------------------------------------
group_id   | 2
last_ints  | {292590,619981}
last_tses  | {"2016-02-07 09:22:35.678158+01","2016-10-16 15:31:52.180558+02"}
last_texts | {cf4a398869f666acc0399c37c68e4c17,d3177fd39ce9566bc568dab3607622cd}

Comparing it with larger dump above can be complicated, so let's generate the same results using window functions, and compare:

$ WITH numbered_rows AS (
    SELECT
        group_id,
        ROW_NUMBER() OVER (partition BY group_id ORDER BY random_int DESC) AS row_no,
        random_int,
        random_ts,
        random_text
    FROM
        test
)
SELECT
    group_id,
    array_agg(random_int ORDER BY random_int) AS first_ints,
    array_agg(random_ts ORDER BY random_ts) AS first_tses,
    array_agg(random_text ORDER BY random_text) AS first_texts
FROM
    numbered_rows
WHERE
    row_no < 3
GROUP BY group_id
ORDER BY group_id;
-[ RECORD 1 ]--------------------------------------------------------------------
group_id    | 0
first_ints  | {722122,895509}
first_tses  | {"2015-09-13 20:07:48.580558+02","2016-08-08 07:07:43.713358+02"}
first_texts | {17192f91bf1e94f2e8af7c9bd2c15d07,f56fbaeb6119963a4bdb6758608d2b8f}
-[ RECORD 2 ]--------------------------------------------------------------------
group_id    | 1
first_ints  | {682945,780823}
first_tses  | {"2015-05-02 10:07:26.260558+02","2015-05-04 07:26:56.807758+02"}
first_texts | {2e6bc5ce1c1d2965f00be96373638e59,776c602308bb93fad0d9753b125d8727}
-[ RECORD 3 ]--------------------------------------------------------------------
group_id    | 2
first_ints  | {292590,619981}
first_tses  | {"2016-02-01 13:36:33.959758+01","2016-02-07 09:22:35.678158+01"}
first_texts | {cf4a398869f666acc0399c37c68e4c17,d3177fd39ce9566bc568dab3607622cd}

Data looks the same to me. And query with first/last aggregates seem to be much simpler to write, read, and understand.

Just like with short_ids lately, there is no point in copy/pasting the code from blog – you can download it as usable extension. Use either GitHub or PGXN.

2 thoughts on “Getting first and last values per group”

gotar says:

2017-02-24 at 09:56

For simple cases, one can use ‘DISTINCT ON’ (beware, this is error prone). I’ve been using row_number() approach before, now there is first_value() function which helps a lot (trying to avoid last_value() as it’s not straightforward to use).
Alexey Bashtanov says:

2017-03-09 at 17:32

The aggregate functions provided by https://github.com/bashtanov/argm extension can do the same faster, as they allows hash aggregation, and are written in C.

Comments are closed.