July 12th, 2016 by depesz | Tags: , , , , , , , | 2 comments »
Did it help? If yes - maybe you can help me?

Today, on irc, someone asked interesting question.

Basically she ran a query like:

select a, b, c, d, e, f from table order by a

then, she processed the query to get, for each a array of unique values of b, c, d, e, and f, and then he inserted it back to database, to some other table.

It was a problem, because the table had many rows (millions I would assume), and the whole process was slow.

So, how to make it faster?

First, let's make some test data:

$ create table test ( a int4, b int4, c int4, d int4, e int4, f int4);

Now, let's load some values there. I don't know how many distinct values there were in each of the columns, so let's assume that this query will generate somewhat sensible dataset:

$ insert into test (a,b,c,d,e,f)
select
    cast( random() * 100000 as int4 ),
    cast( random() * 10 as int4 ),
    cast( random() * 20 as int4 ),
    cast( random() * 30 as int4 ),
    cast( random() * 40 as int4 ),
    cast( random() * 50 as int4 )
from
generate_series(1,10000000);

This inserted 10 million rows with some, hopefully obvious, values:

$ select * from test limit 10;
   a   | b  | c  | d  | e  | f  
-------+----+----+----+----+----
 12152 |  1 | 10 | 24 | 35 |  9
 93998 |  9 |  6 | 14 |  7 | 42
 57485 |  6 | 17 | 19 |  5 |  8
 83496 |  9 |  5 |  2 | 15 | 28
 92942 | 10 | 10 | 26 | 19 | 19
 35400 |  6 |  9 | 25 | 16 | 17
   513 |  3 |  4 |  9 | 33 | 18
 15767 |  4 | 20 |  0 |  0 |  5
 16405 |  8 |  1 | 12 | 36 | 22
 96755 |  8 |  8 | 14 | 29 | 43
(10 rows)

Now, the goal is to get one row for each value of a where values of b, c, d, e, and f are arrays with unique values of given column, preferably sorted.

For example given these 3 rows:

$ select * from test order by a limit 3;
 a | b | c  | d  | e  | f  
---+---+----+----+----+----
 0 | 6 | 13 |  2 | 25 | 10
 0 | 1 | 14 | 15 |  5 | 49
 0 | 6 | 14 | 10 | 22 | 15
(3 rows)

We would like to get one row, with columns:

  • a = 0
  • b = array[1,6]
  • c = array[13,14]
  • d = array[2,10,15]
  • e = array[5,22,25]
  • e = array[10,15,49]

We could, for example, use such script:

#!/usr/bin/env perl
use strict;
use DBI;
 
my $dbh = DBI->connect("dbi:Pg:dbname=depesz;port=5960;host=localhost");
 
$dbh->do("COPY (select a, b, c, d, e, f from test order by a) to stdout");
my $single_row;
my %current;
while ( $dbh->pg_getcopydata( $single_row ) >= 0 ) {
    $single_row =~ s/\s*\z//;
    my @columns = split /\t/, $single_row;
    $current{'id'} = $columns[0] unless exists $current{'id'};
 
    if ($current{'id'} != $columns[0]) {
        print_current(\%current);
        %current = ( 'id' => $columns[0] );
    }
    $current{'b'}->{$columns[1]} = 1;
    $current{'c'}->{$columns[2]} = 1;
    $current{'d'}->{$columns[3]} = 1;
    $current{'e'}->{$columns[4]} = 1;
    $current{'f'}->{$columns[5]} = 1;
}
$dbh->disconnect;
print_current(\%current);
 
exit;
 
sub print_current {
    my $c = shift;
    print $c->{'id'};
    for my $k (qw( b c d e f ) ) {
        printf "\t{%s}", join(",", sort{$a<=>$b} keys %{$c->{$k}});
    }
    print "\n";
}

This script, when I ran it on my desktop took 47 seconds. What's more – it needed to load from database 187MB of data, and send back 40MB.

Now, the same can be achieved with this query:

$ select
    a,
    array_agg(distinct b order by b),
    array_agg(distinct c order by c),
    array_agg(distinct d order by d),
    array_agg(distinct e order by e),
    array_agg(distinct f order by f)
from test
group by a;

It's explain analyze:

                                                          QUERY PLAN                                                           
-------------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=1736503.61..1913689.26 rows=97240 width=164) (actual time=5943.621..12482.748 rows=100001 loops=1)
   Group Key: a
   ->  Sort  (cost=1736503.61..1761503.29 rows=9999871 width=24) (actual time=5943.554..7201.170 rows=10000000 loops=1)
         Sort Key: a
         Sort Method: external merge  Disk: 332432kB
         ->  Seq Scan on test  (cost=0.00..163693.71 rows=9999871 width=24) (actual time=0.016..566.024 rows=10000000 loops=1)
 Planning time: 0.140 ms
 Execution time: 12523.137 ms
(8 rows)

Shows that it ran in 12.5 seconds.

To have it working you need to have PostgreSQL 9.0 or newer, which means that it works on virtually every sensible Pg installation there is (9.0 was released 6 years ago, and it's so old, it is even no longer supported.

Pretty cool, isn't it?

  1. 2 comments

  2. # Chris Travers
    Aug 12, 2016

    Have you tried a loose index scan with the same data? It might be more complex of a query but I bet it would go faster.

  3. # Pritam Baral
    Oct 29, 2016

    I tried Chris Travers’s suggestion. The results were … interesting.

    Adding an index on column `a` slowed down the query to 150% of its no-index run. Consistent behaviour across multiple runs.

    Then I sorted the data (by means of a `CREATE TABLE test2 AS … ORDER BY a;`). The query times dropped to 66% of the original run.

    But, adding an index on column `a` on this sorted table, sped up the query to a further 65% of the non-index, sorted table run. A total of 42% of the original, non-indexed, non-sorted run.

Leave a comment