Who logged to system from multiple countries in 2 hours?

Yesterday someone posted a set of queries for interviews, all centered on answering business-like questions from database.

Today this post is hidden behind some “subscribe to read more" thing, so I will not even link it, but one question there caught my eye.

Since I can't copy paste the text, I'll try to write what I remember:

Given table sessions, with columns: user_id, login_time, and country_id, list all cases where single account logged to the system from more than one country within 2 hour time frame.

The idea behind is that it would be a tool to find hacked account, based on idea that you generally can't change country within 2 hours. Which is somewhat true.

Solution in the blogpost suggested joining sessions table with itself, using some inequality condition. I think we can do better…

First, of course, let's make a table, and put there some data. Since I don't care about size of data, for now, let's make it small, with just couple of well defined cases:

=$ create table sessions (
    user_id int8,
    login_time timestamptz,
    country_id int8
);
CREATE TABLE

In real life there would be some indexes, foreign keys, primary key, but for this example, I just don't care enough.

Sample data. I will need at least three cases:

user that never changed country, it will be user_id == 1
user that changed country, but it was more than 2 hours apart, it will be user_id == 2
user that changed country, and did it within 2 hours. Let's make it user_id == 3

So, insert data will be:

=$ insert into sessions (user_id, login_time, country_id) values
    (1, '2025-05-01 12:34:56', 100),
    (1, '2025-05-01 13:34:56', 100),
    (1, '2025-05-01 19:34:56', 100),
    (2, '2025-05-01 12:34:56', 100),
    (2, '2025-05-01 19:34:56', 200),
    (3, '2025-05-01 12:34:56', 100),
    (3, '2025-05-01 13:34:56', 200);
INSERT 0 7

Now, basic approach with join would be something like:

=$ select
    s_start.user_id,
    s_start.login_time,
    array_agg(distinct s_end.country_id) as countries
from
    sessions as s_start
    join sessions s_end on
        s_start.user_id = s_end.user_id and
        s_end.login_time between s_start.login_time and s_start.login_time + '2 hours'::interval
group by
    s_start.user_id, s_start.login_time
having
    count(distinct s_end.country_id) > 1;
 user_id |       login_time       | countries
---------+------------------------+-----------
       3 | 2025-05-01 12:34:56+02 | {100,200}
(1 row)

Simple-ish. But can it be done with just one scan of sessions?

I think we can. Using window functions, and specifically non-standard “frame" definitions…

Specifically, in the window definition we can use something like this: order by login_time range between current row and ‘2 hours'::interval following. This makes a frame that takes into account all rows that have login_time that is between this of current row, and 2 hours in the future. We can see it working here:

=$ SELECT
    s.*,
    array_agg(login_time) over (partition by user_id order by login_time range between current row and '2 hours'::interval following)
FROM
    sessions s;
 user_id |       login_time       | country_id |                      array_agg
---------+------------------------+------------+-----------------------------------------------------
       1 | 2025-05-01 12:34:56+02 |        100 | {"2025-05-01 12:34:56+02","2025-05-01 13:34:56+02"}
       1 | 2025-05-01 13:34:56+02 |        100 | {"2025-05-01 13:34:56+02"}
       1 | 2025-05-01 19:34:56+02 |        100 | {"2025-05-01 19:34:56+02"}
       2 | 2025-05-01 12:34:56+02 |        100 | {"2025-05-01 12:34:56+02"}
       2 | 2025-05-01 19:34:56+02 |        200 | {"2025-05-01 19:34:56+02"}
       3 | 2025-05-01 12:34:56+02 |        100 | {"2025-05-01 12:34:56+02","2025-05-01 13:34:56+02"}
       3 | 2025-05-01 13:34:56+02 |        200 | {"2025-05-01 13:34:56+02"}
(7 rows)

The partition by user_id part makes it so that for each row we only take into account other rows with the same user_id.

Anyway, with this in place, we can change the query to actually show the country ids, and filter them out so that it will be only shown when there are multiple countries:

=$ SELECT
    s.user_id,
    s.login_time,
    array_agg(distinct s.country_id) over (partition by user_id order by login_time range between current row and '2 hours'::interval following) as countries
FROM
    sessions s;
ERROR:  DISTINCT is not implemented for window functions
LINE 4:     array_agg(distinct country_id) over (partition by user_i...
            ^

That throws a tiny wrench into my plans. But OK, let's make it work with array_agg, and no distinct, and we can filter them out later:

=$ SELECT
    s.user_id,
    s.login_time,
    array_agg(s.country_id) over (partition by user_id order by login_time range between current row and '2 hours'::interval following) as countries
FROM
    sessions s;
 user_id |       login_time       | countries
---------+------------------------+-----------
       1 | 2025-05-01 12:34:56+02 | {100,100}
       1 | 2025-05-01 13:34:56+02 | {100}
       1 | 2025-05-01 19:34:56+02 | {100}
       2 | 2025-05-01 12:34:56+02 | {100}
       2 | 2025-05-01 19:34:56+02 | {200}
       3 | 2025-05-01 12:34:56+02 | {100,200}
       3 | 2025-05-01 13:34:56+02 | {200}
(7 rows)

That worked as expected, so let's get rid of duplicate countries. Relatively simple way, at least for me, is to write custom aggregate. I know that this sounds scary, but it's just a matter of writing very simple function, and one create aggregate statement:

=$ create function array_agg_unique( INOUT p_state int8[], IN p_newval int8 ) returns int8[] as $$
    select case
        when p_newval = any(p_state) then p_state
        else p_state || p_newval
    end;
$$ language sql;
CREATE FUNCTION
 
=$ create aggregate array_agg_unique( int8 ) (
    sfunc = array_agg_unique,
    stype = int8[],
    initcond = '{}'
);
CREATE AGGREGATE

That's all. So how does the new unique aggregate work:

=$ SELECT
    s.user_id,
    s.login_time,
    array_agg_unique(s.country_id) over (partition by user_id order by login_time range between current row and '2 hours'::interval following) as countries
FROM
    sessions s;
 user_id |       login_time       | countries 
---------+------------------------+-----------
       1 | 2025-05-01 12:34:56+02 | {100}
       1 | 2025-05-01 13:34:56+02 | {100}
       1 | 2025-05-01 19:34:56+02 | {100}
       2 | 2025-05-01 12:34:56+02 | {100}
       2 | 2025-05-01 19:34:56+02 | {200}
       3 | 2025-05-01 12:34:56+02 | {100,200}
       3 | 2025-05-01 13:34:56+02 | {200}
(7 rows)

And now the filtering. The simplest, for me, is to just wrap it in CTE:

WITH base as (
    SELECT
        s.user_id,
        s.login_time,
        array_agg_unique(s.country_id) over (partition by user_id order by login_time range between current row and '2 hours'::interval following) as countries
    FROM
        sessions s
)
select *
from base
where array_upper(countries, 1) > 1;
 user_id |       login_time       | countries 
---------+------------------------+-----------
       3 | 2025-05-01 12:34:56+02 | {100,200}
(1 row)

Now, the question is: is it really faster than the join approach?

Let's look first at simple explain analyze of both queries. First the join approach:

```
=$ explain analyze
```
```
select
```
```
    s_start.user_id,
```
```
    s_start.login_time,
```

    array_agg(distinct s_end.country_id) as countries

```
from
```
```
    sessions as s_start
```
```
    join sessions s_end on
```

        s_start.user_id = s_end.user_id and

        s_end.login_time between s_start.login_time and s_start.login_time + '2 hours'::interval

```
group by
```
```
    s_start.user_id, s_start.login_time
```
```
having
```

    count(distinct s_end.country_id) > 1;

                                                                 QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------------------

 GroupAggregate  (cost=194.32..209.42 rows=183 width=48) (actual time=0.147..0.163 rows=1.00 loops=1)

   Group Key: s_start.user_id, s_start.login_time

   Filter: (count(DISTINCT s_end.country_id) > 1)

```
   Rows Removed by Filter: 6
```
```
   Buffers: shared hit=8
```

   ->  Sort  (cost=194.32..195.69 rows=549 width=24) (actual time=0.120..0.139 rows=9.00 loops=1)

         Sort Key: s_start.user_id, s_start.login_time, s_end.country_id

         Sort Method: quicksort  Memory: 25kB

```
         Buffers: shared hit=8
```

         ->  Hash Join  (cost=45.33..169.34 rows=549 width=24) (actual time=0.039..0.074 rows=9.00 loops=1)

               Hash Cond: (s_start.user_id = s_end.user_id)

               Join Filter: ((s_end.login_time >= s_start.login_time) AND (s_end.login_time <= (s_start.login_time + '02:00:00'::interval)))

               Rows Removed by Join Filter: 8

```
               Buffers: shared hit=2
```

               ->  Seq Scan on sessions s_start  (cost=0.00..25.70 rows=1570 width=16) (actual time=0.004..0.013 rows=7.00 loops=1)

                     Buffers: shared hit=1

               ->  Hash  (cost=25.70..25.70 rows=1570 width=24) (actual time=0.026..0.030 rows=7.00 loops=1)

                     Buckets: 2048  Batches: 1  Memory Usage: 17kB

                     Buffers: shared hit=1

                     ->  Seq Scan on sessions s_end  (cost=0.00..25.70 rows=1570 width=24) (actual time=0.004..0.013 rows=7.00 loops=1)

                           Buffers: shared hit=1

```
 Planning:
```
```
   Buffers: shared hit=156
```
```
 Planning Time: 0.435 ms
```
```
 Execution Time: 0.243 ms
```
```
(25 rows)
```

OK. So it took two separate Seq Scans of sessions (in lines 31 and 36). I guess if there was more data, one of them would be changed into set of Index Scans. We'll test it in a moment.

On the other hand my window+frame+aggregate approach generates this explain analyze:

```
=$ explain analyze
```
```
WITH base as (
```
```
    SELECT
```
```
        s.user_id,
```
```
        s.login_time,
```

        array_agg_unique(s.country_id) over (partition by user_id order by login_time range between current row and '2 hours'::interval following) as countries

```
    FROM
```
```
        sessions s
```
```
)
```
```
select *
```
```
from base
```
```
where array_upper(countries, 1) > 1;
```

                                                            QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------------

 Subquery Scan on base  (cost=109.08..163.99 rows=523 width=48) (actual time=0.279..0.296 rows=1.00 loops=1)

   Filter: (array_upper(base.countries, 1) > 1)

```
   Rows Removed by Filter: 6
```
```
   Buffers: shared hit=50
```

   ->  WindowAgg  (cost=109.08..140.44 rows=1570 width=48) (actual time=0.221..0.283 rows=7.00 loops=1)

         Window: w1 AS (PARTITION BY s.user_id ORDER BY s.login_time RANGE BETWEEN CURRENT ROW AND '02:00:00'::interval FOLLOWING)

         Storage: Memory  Maximum Storage: 17kB

```
         Buffers: shared hit=50
```

         ->  Sort  (cost=109.04..112.96 rows=1570 width=24) (actual time=0.044..0.056 rows=7.00 loops=1)

               Sort Key: s.user_id, s.login_time

               Sort Method: quicksort  Memory: 25kB

```
               Buffers: shared hit=4
```

               ->  Seq Scan on sessions s  (cost=0.00..25.70 rows=1570 width=24) (actual time=0.006..0.015 rows=7.00 loops=1)

                     Buffers: shared hit=1

```
 Planning:
```
```
   Buffers: shared hit=63 dirtied=1
```
```
 Planning Time: 0.279 ms
```
```
 Execution Time: 0.385 ms
```
```
(18 rows)
```

Single Seq Scan, and some processing.

Great. So now, let's put there some more data. Let's say 100,000 rows with some random users, times, and countries:

=$ insert into sessions (user_id, login_time, country_id)
select
    floor( 1 + random() * 50 ),
    now() - '3 months'::interval * random(),
    floor( 1 + random() * random() * random() * random() * random() * 5 )
from generate_series(1,100000);
INSERT 0 100000

The multi-random expression for country simply made country_id much less equally distributed:

=$ select country_id, count(*) from sessions group by 1 order by 1;
 country_id | count 
------------+-------
          1 | 97623
          2 |  2119
          3 |   235
          4 |    23
        100 |     5
        200 |     2
(6 rows)

To make sure that we have it all nicely indexed, I'll add index on (user_id, login_time):

=$ create index the_index on sessions (user_id, login_time);
CREATE INDEX

Sweet. And now, the moment of truth: I'll run both queries (the one using join, and the one using window) 3 times each, pick fastest from each type, and we'll see how it behaves:

Fastest JOIN run:

                                                                              QUERY PLAN                                                                              
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=17.24..2143404.69 rows=33298 width=48) (actual time=154.160..3097.721 rows=6316.00 loops=1)
   Group Key: s_start.user_id, s_start.login_time
   Filter: (count(DISTINCT s_end.country_id) > 1)
   Rows Removed by Filter: 93690
   Buffers: shared hit=585882
   ->  Incremental Sort  (cost=17.24..1919272.77 rows=22263351 width=24) (actual time=151.612..2689.938 rows=285069.00 loops=1)
         Sort Key: s_start.user_id, s_start.login_time, s_end.country_id
         Presorted Key: s_start.user_id, s_start.login_time
         Full-sort Groups: 8578  Sort Method: quicksort  Average Memory: 26kB  Peak Memory: 26kB
         Buffers: shared hit=585882
         ->  Nested Loop  (cost=0.84..770705.97 rows=22263351 width=24) (actual time=151.285..1929.886 rows=285069.00 loops=1)
               Buffers: shared hit=585882
               ->  Index Only Scan using the_index on sessions s_start  (cost=0.42..3052.52 rows=100007 width=16) (actual time=0.011..130.035 rows=100007.00 loops=1)
                     Heap Fetches: 0
                     Index Searches: 1
                     Buffers: shared hit=387
               ->  Index Scan using the_index on sessions s_end  (cost=0.42..5.46 rows=222 width=24) (actual time=0.003..0.007 rows=2.85 loops=100007)
                     Index Cond: ((user_id = s_start.user_id) AND (login_time >= s_start.login_time) AND (login_time <= (s_start.login_time + '02:00:00'::interval)))
                     Index Searches: 100007
                     Buffers: shared hit=585495
 Planning Time: 0.145 ms
 JIT:
   Functions: 13
   Options: Inlining true, Optimization true, Expressions true, Deforming true
   Timing: Generation 0.490 ms (Deform 0.123 ms), Inlining 8.884 ms, Optimization 123.118 ms, Emission 19.230 ms, Total 151.722 ms
 Execution Time: 3106.287 ms
(26 rows)

and fastest query with custom aggregate and window-based approach:

                                                                     QUERY PLAN                                                                      
-----------------------------------------------------------------------------------------------------------------------------------------------------
 Subquery Scan on base  (cost=49.51..8850.37 rows=33336 width=48) (actual time=0.890..957.652 rows=6316.00 loops=1)
   Filter: (array_upper(base.countries, 1) > 1)
   Rows Removed by Filter: 93691
   Buffers: shared hit=100241
   ->  WindowAgg  (cost=49.51..7350.26 rows=100007 width=48) (actual time=0.034..816.409 rows=100007.00 loops=1)
         Window: w1 AS (PARTITION BY s.user_id ORDER BY s.login_time RANGE BETWEEN CURRENT ROW AND '02:00:00'::interval FOLLOWING)
         Storage: Memory  Maximum Storage: 17kB
         Buffers: shared hit=100241
         ->  Index Scan using the_index on sessions s  (cost=0.42..5600.14 rows=100007 width=24) (actual time=0.011..169.658 rows=100007.00 loops=1)
               Index Searches: 1
               Buffers: shared hit=100241
 Planning Time: 0.068 ms
 Execution Time: 965.772 ms
(13 rows)

Sweet. So it seems to be ~ 3 times faster.

Hope it will help someone on some kind of job interview. Or at least to show that there are many way to work to get answer to a problem…

7 thoughts on “Who logged to system from multiple countries in 2 hours?”

Viktor Anders Olov Holmberg says:

2025-09-01 at 15:01

Super cool!! Is the interval in following a non-documented feature? What other types can go in there? Can’t find anything in the docs about it
depesz says:

2025-09-01 at 15:02

@Viktor:
it’s normal frame definition using range and not rows. You can find information about it in docs for SELECT query: https://pgdoc.link/select
Anonymous says:

2025-09-01 at 18:39

Hmm, what about using date_bin on the timestamp and then GROUP by binned timestamp and user id with an array agg on countries?
depesz says:

2025-09-01 at 20:03

@Anonymous: date_bin will bin into “buckets”, but will not check “distance” between them.

For example, 20:04 (hour) and 20:06 will get to different buckets if you’d use 5 minute bucket size, despite the fact that only 2 minutes have passed.
Zegarek says:

2025-09-14 at 14:50

You could leave `array_agg()` to do its thing and check `array_length(array_remove(countries, countries[1]),1)0` in the outer query, entirely skipping deduplication because this method aims to only establish whether it’s unique or not, it does not actually need to deduplicate for that. Also, if those were `int4`’s, `intarray` could `uniq(sort(array_agg(country_id) over w1))`, or just `sort()` then compare first vs last: https://dbfiddle.uk/4IdmnpGM
I guess the custom aggregate is fairly slow because it keeps comparing incoming elements on the fly to keep things unique in intermediate states, while it could dedupe just once at the end in ffunc.

All of this is just nitpicking, you did a great job demonstrating the principle. Thumbs up and thanks for posting cool stuff.
depesz says:

2025-09-15 at 13:23

@Zegarek:
try this on larger scale example, like the one I showed.
In my test aggrregating to array, and then doing unique was MUCH slower. As in: order of magnitude.
Zegarek says:

2025-09-16 at 09:02

I’m trying this on the same scale presented in the post, 100k rows – it’s all in the fiddle I attached. In some tests, on some versions, I can even see the initial self-join query outperforming the one with `array_agg_unique`. On v15, it’s 973ms vs 1461ms. On v16 they matched at around 1200ms: https://dbfiddle.uk/GKcxEVaz On v17 it’s again a loss at 1100ms vs 1200ms https://dbfiddle.uk/21-vNxJe

If I do `array(select distinct unnest(countries))` to deduplicate it (plain SQL in an intermediate CTE, not in ffunc), it’s still faster https://dbfiddle.uk/iLzmaaYZ I haven’t tried to do that in an ffunc, assuming it wouldn’t make much difference but I guess I it might not work the same when applied there.

I agree it’s reasonable to expect that changing the data profile and/or adding way more samples will affect these scores, because not accumulating the duplicates in array_agg_unique saves you memory but on the same 100k, that’s what I’m seeing.