# Getting first and last values per group

Every so often someone needs solution to getting first (or couple of first) values for given column. Or last.

For some cases (when there is not many groups) you can use recursive queries. But it's not always the best choice.

Let's try to implement first() and last() aggregates, so these could be easily used by anybody.

First, we need a function. Technically functions require parameter datatypes, which would make it pretty long, but luckily, we can use “ANYELEMENT" pseudo datatype:

```\$ CREATE OR REPLACE FUNCTION agg_first(
IN p_state       anyelement,
IN p_new_element anyelement
)
RETURNS anyelement
LANGUAGE SQL
AS \$\$
SELECT COALESCE( p_state, p_new_element );
\$\$;
CREATE FUNCTION```

This functions looks really simple, but that's because it should be simple. Now, let's make the aggregate:

```\$ CREATE AGGREGATE FIRST(anyelement) (
sfunc = agg_first,
stype = anyelement
);
CREATE AGGREGATE```

Now, we'll need some sample data, so let's create simple table with random data:

```\$ CREATE TABLE test (
group_id int4,
random_int int4,
random_ts timestamptz,
random_text text
);
CREATE TABLE

\$ INSERT INTO test (group_id, random_int, random_ts, random_text)
SELECT
FLOOR(random() * 3),
FLOOR(random() * 1000000),
now() - '2 years'::INTERVAL * random(),
md5(random()::text)
FROM generate_series(1,20);
INSERT 0 20```

Data is pretty simple:

```\$ SELECT * FROM test ORDER BY group_id;
group_id | random_int |           random_ts           |           random_text
----------+------------+-------------------------------+----------------------------------
0 |     493798 | 2015-12-19 17:41:12.110158+01 | 4b9ac4c4ad6c2a806c0db801e9a03373
0 |     424242 | 2015-10-07 21:35:20.318158+02 | 58d3a010b6fcd2200dd078b445132281
0 |     202836 | 2015-06-11 05:07:16.526158+02 | 6e99221b0cce929ba0ca41bc8dfaaf27
0 |     635961 | 2016-12-05 06:18:10.257358+01 | f18d9774a1b74f4422116ec1ceec8b6a
0 |     722122 | 2015-09-13 20:07:48.580558+02 | 17192f91bf1e94f2e8af7c9bd2c15d07
0 |     484963 | 2015-07-03 13:17:44.366158+02 | d17fcf48649fe6be0bf639c888bd9c5b
0 |     510350 | 2016-09-21 13:37:46.708558+02 | 7560516244c289dc84bcb30f04ed35a8
0 |     895509 | 2016-08-08 07:07:43.713358+02 | f56fbaeb6119963a4bdb6758608d2b8f
1 |      58672 | 2017-02-02 08:53:27.978958+01 | f58ddb16c746175c57ed7859274ba120
1 |     780823 | 2015-05-04 07:26:56.807758+02 | 2e6bc5ce1c1d2965f00be96373638e59
1 |     485881 | 2015-11-26 03:23:53.092558+01 | 2a09584e0035379e795cfa70f6615eec
1 |     609786 | 2015-08-01 23:00:41.505358+02 | 16d0845557d4d246cd431b74356422f6
1 |     682945 | 2015-05-02 10:07:26.260558+02 | 776c602308bb93fad0d9753b125d8727
1 |      82612 | 2016-11-08 18:29:11.822158+01 | ab90846f963d2d24980f9ce9d8a658bb
1 |     668048 | 2015-06-25 23:31:54.398158+02 | bf6bafca69ffe00dcc5f701889bcdc97
2 |      53543 | 2016-10-16 15:31:52.180558+02 | 6771fa4d76e866ea7f0cd062e5608f65
2 |     276975 | 2015-07-02 22:18:54.263758+02 | 5582f058c2f800a5c74ff6621afea48e
2 |     619981 | 2016-02-07 09:22:35.678158+01 | d3177fd39ce9566bc568dab3607622cd
2 |     292590 | 2016-02-01 13:36:33.959758+01 | cf4a398869f666acc0399c37c68e4c17
2 |      60572 | 2015-07-23 05:26:34.199758+02 | 019a33e73a3de44198ed8e9c55b38092
(20 ROWS)```

So, let's see if we can get “first" of anything:

```\$ SELECT
group_id,
FIRST(random_int) AS first_int,
FIRST(random_ts) AS first_ts,
FIRST(random_text) AS first_text
FROM test
GROUP BY group_id
ORDER BY group_id;
group_id | first_int |           first_ts            |            first_text
----------+-----------+-------------------------------+----------------------------------
0 |    895509 | 2016-08-08 07:07:43.713358+02 | f56fbaeb6119963a4bdb6758608d2b8f
1 |     58672 | 2017-02-02 08:53:27.978958+01 | f58ddb16c746175c57ed7859274ba120
2 |    619981 | 2016-02-07 09:22:35.678158+01 | d3177fd39ce9566bc568dab3607622cd
(3 ROWS)```

Because of PostgreSQL power, we can also do something smarter:

```\$ SELECT
group_id,
FIRST(random_int ORDER BY random_int) AS first_int,
FIRST(random_ts ORDER BY random_int) AS first_ts,
FIRST(random_text ORDER BY random_int) AS first_text
FROM test
GROUP BY group_id
ORDER BY group_id;
group_id | first_int |           first_ts            |            first_text
----------+-----------+-------------------------------+----------------------------------
0 |    202836 | 2015-06-11 05:07:16.526158+02 | 6e99221b0cce929ba0ca41bc8dfaaf27
1 |     58672 | 2017-02-02 08:53:27.978958+01 | f58ddb16c746175c57ed7859274ba120
2 |     53543 | 2016-10-16 15:31:52.180558+02 | 6771fa4d76e866ea7f0cd062e5608f65
(3 ROWS)```

In this case, I got values for first row, in a group, if it was sorted by random_int.

Now, that we have this in place, let's write last():

```\$ CREATE OR REPLACE FUNCTION agg_last(
IN p_state       anyelement,
IN p_new_element anyelement
)
RETURNS anyelement
LANGUAGE SQL
AS \$\$
SELECT COALESCE( p_new_element, p_state );
\$\$;
CREATE FUNCTION

\$ CREATE AGGREGATE LAST(anyelement) (
sfunc = agg_last,
stype = anyelement
);
CREATE AGGREGATE```

Let's try it:

```\$ SELECT
group_id,
LAST(random_int ORDER BY random_int) AS last_int,
LAST(random_ts ORDER BY random_int) AS last_ts,
LAST(random_text ORDER BY random_int) AS last_text
FROM test
GROUP BY group_id
ORDER BY group_id;
group_id | last_int |            last_ts            |            last_text
----------+----------+-------------------------------+----------------------------------
0 |   895509 | 2016-08-08 07:07:43.713358+02 | f56fbaeb6119963a4bdb6758608d2b8f
1 |   780823 | 2015-05-04 07:26:56.807758+02 | 2e6bc5ce1c1d2965f00be96373638e59
2 |   619981 | 2016-02-07 09:22:35.678158+01 | d3177fd39ce9566bc568dab3607622cd
(3 ROWS)```

Cool. Looks like working. Now, how about getting couple of first/last elements? This will be slightly more complicated:

```\$ CREATE OR REPLACE FUNCTION agg_first(
IN p_state       anyarray,
IN p_new_element anyelement,
IN p_limit       int4
)
RETURNS anyarray
LANGUAGE SQL
AS \$\$
SELECT CASE
WHEN COALESCE( array_length( p_state, 1 ), 0 ) < p_limit
THEN p_state || p_new_element
ELSE p_state
END;
\$\$;
CREATE FUNCTION

\$ CREATE AGGREGATE FIRST(anyelement, int4) (
sfunc = agg_first,
stype = anyarray,
initcond = '{}'
);
CREATE AGGREGATE```

And quick test:

```\$ SELECT
group_id,
FIRST(random_int, 2 ORDER BY random_int) AS first_ints,
FIRST(random_ts, 2 ORDER BY random_ts) AS first_tses,
FIRST(random_text, 2 ORDER BY random_text) AS first_texts
FROM
test
GROUP BY group_id
ORDER BY group_id;
-[ RECORD 1 ]--------------------------------------------------------------------
group_id    | 0
first_ints  | {202836,424242}
first_tses  | {"2015-06-11 05:07:16.526158+02","2015-07-03 13:17:44.366158+02"}
-[ RECORD 2 ]--------------------------------------------------------------------
group_id    | 1
first_ints  | {58672,82612}
first_tses  | {"2015-05-02 10:07:26.260558+02","2015-05-04 07:26:56.807758+02"}
first_texts | {16d0845557d4d246cd431b74356422f6,2a09584e0035379e795cfa70f6615eec}
-[ RECORD 3 ]--------------------------------------------------------------------
group_id    | 2
first_ints  | {53543,60572}
first_tses  | {"2015-07-02 22:18:54.263758+02","2015-07-23 05:26:34.199758+02"}
first_texts | {019a33e73a3de44198ed8e9c55b38092,5582f058c2f800a5c74ff6621afea48e}```

last function will be slightly more complex

```\$ CREATE OR REPLACE FUNCTION agg_last(
IN p_state       anyarray,
IN p_new_element anyelement,
IN p_limit       int4
)
RETURNS anyarray
LANGUAGE SQL
AS \$\$
SELECT ( CASE
WHEN array_length( p_state, 1 ) >= p_limit
THEN p_state[(array_upper(p_state, 1) - p_limit + 2):]
ELSE
p_state
END ) || p_new_element;
\$\$;
CREATE FUNCTION

\$ CREATE AGGREGATE LAST(anyelement, int4) (
sfunc = agg_last,
stype = anyarray,
initcond = '{}'
);
CREATE AGGREGATE```

So, let's see if it really works:

```\$ SELECT
group_id,
LAST(random_int, 2 ORDER BY random_int) AS last_ints,
LAST(random_ts, 2 ORDER BY random_ts) AS last_tses,
LAST(random_text, 2 ORDER BY random_text) AS last_texts
FROM
test
GROUP BY group_id
ORDER BY group_id;
-[ RECORD 1 ]-------------------------------------------------------------------
group_id   | 0
last_ints  | {722122,895509}
last_tses  | {"2016-09-21 13:37:46.708558+02","2016-12-05 06:18:10.257358+01"}
last_texts | {f18d9774a1b74f4422116ec1ceec8b6a,f56fbaeb6119963a4bdb6758608d2b8f}
-[ RECORD 2 ]-------------------------------------------------------------------
group_id   | 1
last_ints  | {682945,780823}
last_tses  | {"2016-11-08 18:29:11.822158+01","2017-02-02 08:53:27.978958+01"}
last_texts | {bf6bafca69ffe00dcc5f701889bcdc97,f58ddb16c746175c57ed7859274ba120}
-[ RECORD 3 ]-------------------------------------------------------------------
group_id   | 2
last_ints  | {292590,619981}
last_tses  | {"2016-02-07 09:22:35.678158+01","2016-10-16 15:31:52.180558+02"}
last_texts | {cf4a398869f666acc0399c37c68e4c17,d3177fd39ce9566bc568dab3607622cd}```

Comparing it with larger dump above can be complicated, so let's generate the same results using window functions, and compare:

```\$ WITH numbered_rows AS (
SELECT
group_id,
ROW_NUMBER() OVER (partition BY group_id ORDER BY random_int DESC) AS row_no,
random_int,
random_ts,
random_text
FROM
test
)
SELECT
group_id,
array_agg(random_int ORDER BY random_int) AS first_ints,
array_agg(random_ts ORDER BY random_ts) AS first_tses,
array_agg(random_text ORDER BY random_text) AS first_texts
FROM
numbered_rows
WHERE
row_no < 3
GROUP BY group_id
ORDER BY group_id;
-[ RECORD 1 ]--------------------------------------------------------------------
group_id    | 0
first_ints  | {722122,895509}
first_tses  | {"2015-09-13 20:07:48.580558+02","2016-08-08 07:07:43.713358+02"}
first_texts | {17192f91bf1e94f2e8af7c9bd2c15d07,f56fbaeb6119963a4bdb6758608d2b8f}
-[ RECORD 2 ]--------------------------------------------------------------------
group_id    | 1
first_ints  | {682945,780823}
first_tses  | {"2015-05-02 10:07:26.260558+02","2015-05-04 07:26:56.807758+02"}
-[ RECORD 3 ]--------------------------------------------------------------------
group_id    | 2
first_ints  | {292590,619981}
first_tses  | {"2016-02-01 13:36:33.959758+01","2016-02-07 09:22:35.678158+01"}
first_texts | {cf4a398869f666acc0399c37c68e4c17,d3177fd39ce9566bc568dab3607622cd}```

Data looks the same to me. And query with first/last aggregates seem to be much simpler to write, read, and understand.

Just like with short_ids lately, there is no point in copy/pasting the code from blog – you can download it as usable extension. Use either GitHub or PGXN.

## 2 thoughts on “Getting first and last values per group”

1. gotar says:

For simple cases, one can use ‘DISTINCT ON’ (beware, this is error prone). I’ve been using row_number() approach before, now there is first_value() function which helps a lot (trying to avoid last_value() as it’s not straightforward to use).

2. Alexey Bashtanov says:

The aggregate functions provided by https://github.com/bashtanov/argm extension can do the same faster, as they allows hash aggregation, and are written in C.