Waiting for PostgreSQL 11 – Add hash partitioning.

On 9th of November 2017, Robert Haas committed patch:

Add hash partitioning.
 
Hash partitioning is useful when you want to partition a growing data
set evenly.  This can be useful to keep table sizes reasonable, which
makes maintenance operations such as VACUUM faster, or to enable
partition-wise join.
 
At present, we still depend on constraint exclusion for partitioning
pruning, and the shape of the partition constraints for hash
partitioning is such that that doesn't work.  Work is underway to fix
that, which should both improve performance and make partitioning
pruning work with hash partitioning.
 
Amul Sul, reviewed and tested by Dilip Kumar, Ashutosh Bapat, Yugo
Nagata, Rajkumar Raghuwanshi, Jesper Pedersen, and by me.  A few
final tweaks also by me.
 
Discussion: http://postgr.es/m/CAAJ_b96fhpJAP=ALbETmeLk1Uni_GFZD938zgenhF49qgDTjaQ@mail.gmail.com

Some time ago, I wrote that Pg 10 will get native partitioning.

Back then we were limited to partitions by range and by list. These are the most commonly used, and they might even be the most useful. But there is also one more type of partition – by hash.

This can be used to divide rows (more or less) equally into multiple partitions. So, for example, we can have 1 billion rows in 100 partitions, and each will contain 10 million rows (more or less).

Let's see how this can be achieved now, in Pg 11. For my test let's assume we have users table with this base structure:

$ CREATE TABLE users (
    username    text         PRIMARY KEY,
    password    text,
    created_on  timestamptz  NOT NULL DEFAULT now(),
    id_admin    bool         NOT NULL DEFAULT FALSE
);

Making it partitioned by hash (of username) could be done like this:

$ CREATE TABLE users (
    username    text         NOT NULL,
    password    text,
    created_on  timestamptz  NOT NULL DEFAULT now(),
    id_admin    bool         NOT NULL DEFAULT FALSE
) PARTITION BY HASH (username);

Partitions are then defined using 2 numbers:

  • MODULUS
  • REMAINDER

In the most simple case – MODULUS is number of partitions, and REMAINDER is number, 0 or more, but less than MODULUS.

For example, assuming we'd want to partition into eight partitions, I could:

$ CREATE TABLE users_p0 PARTITION OF users ( PRIMARY KEY (username) ) FOR VALUES WITH (MODULUS 8, REMAINDER 0);
$ CREATE TABLE users_p1 PARTITION OF users ( PRIMARY KEY (username) ) FOR VALUES WITH (MODULUS 8, REMAINDER 1);
$ CREATE TABLE users_p2 PARTITION OF users ( PRIMARY KEY (username) ) FOR VALUES WITH (MODULUS 8, REMAINDER 2);
$ CREATE TABLE users_p3 PARTITION OF users ( PRIMARY KEY (username) ) FOR VALUES WITH (MODULUS 8, REMAINDER 3);
$ CREATE TABLE users_p4 PARTITION OF users ( PRIMARY KEY (username) ) FOR VALUES WITH (MODULUS 8, REMAINDER 4);
$ CREATE TABLE users_p5 PARTITION OF users ( PRIMARY KEY (username) ) FOR VALUES WITH (MODULUS 8, REMAINDER 5);
$ CREATE TABLE users_p6 PARTITION OF users ( PRIMARY KEY (username) ) FOR VALUES WITH (MODULUS 8, REMAINDER 6);
$ CREATE TABLE users_p7 PARTITION OF users ( PRIMARY KEY (username) ) FOR VALUES WITH (MODULUS 8, REMAINDER 7);

With this in place, I now have tables that look like this:

$ \d+ users
                                              TABLE "public.users"
   COLUMN   |           TYPE           | Collation | NULLABLE | DEFAULT | Storage  | Stats target | Description 
------------+--------------------------+-----------+----------+---------+----------+--------------+-------------
 username   | text                     |           | NOT NULL |         | extended |              | 
 password   | text                     |           |          |         | extended |              | 
 created_on | TIMESTAMP WITH TIME zone |           | NOT NULL | now()   | plain    |              | 
 id_admin   | BOOLEAN                  |           | NOT NULL | FALSE   | plain    |              | 
Partition KEY: HASH (username)
Partitions: users_p0 FOR VALUES WITH (modulus 8, remainder 0),
            users_p1 FOR VALUES WITH (modulus 8, remainder 1),
            users_p2 FOR VALUES WITH (modulus 8, remainder 2),
            users_p3 FOR VALUES WITH (modulus 8, remainder 3),
            users_p4 FOR VALUES WITH (modulus 8, remainder 4),
            users_p5 FOR VALUES WITH (modulus 8, remainder 5),
            users_p6 FOR VALUES WITH (modulus 8, remainder 6),
            users_p7 FOR VALUES WITH (modulus 8, remainder 7)
 
$ \d+ users_p1
                                            TABLE "public.users_p1"
   COLUMN   |           TYPE           | Collation | NULLABLE | DEFAULT | Storage  | Stats target | Description 
------------+--------------------------+-----------+----------+---------+----------+--------------+-------------
 username   | text                     |           | NOT NULL |         | extended |              | 
 password   | text                     |           |          |         | extended |              | 
 created_on | TIMESTAMP WITH TIME zone |           | NOT NULL | now()   | plain    |              | 
 id_admin   | BOOLEAN                  |           | NOT NULL | FALSE   | plain    |              | 
Partition OF: users FOR VALUES WITH (modulus 8, remainder 1)
Partition CONSTRAINT: satisfies_hash_partition('1161847'::oid, 8, 1, username)
Indexes:
    "users_p1_pkey" PRIMARY KEY, btree (username)

So, let's insert some random users, and see what happens:

$ \copy users (username) FROM stdin;
proffers
babbles
cents
choose
chalked
redoubts
pitting
coddling
relieves
wooing
codgers
sinewy
separate
ferry
crusty
cursing
hawkers
deducted
gaseous
voyagers
\.

All data is visible in users, of course:

$ SELECT * FROM users;
 username | password |          created_on          | id_admin 
----------+----------+------------------------------+----------
 coddling |          | 2017-11-10 14:15:55.71057+01 | f
 voyagers |          | 2017-11-10 14:15:55.71057+01 | f
 relieves |          | 2017-11-10 14:15:55.71057+01 | f
 codgers  |          | 2017-11-10 14:15:55.71057+01 | f
 separate |          | 2017-11-10 14:15:55.71057+01 | f
 hawkers  |          | 2017-11-10 14:15:55.71057+01 | f
 deducted |          | 2017-11-10 14:15:55.71057+01 | f
 crusty   |          | 2017-11-10 14:15:55.71057+01 | f
 chalked  |          | 2017-11-10 14:15:55.71057+01 | f
 redoubts |          | 2017-11-10 14:15:55.71057+01 | f
 sinewy   |          | 2017-11-10 14:15:55.71057+01 | f
 pitting  |          | 2017-11-10 14:15:55.71057+01 | f
 wooing   |          | 2017-11-10 14:15:55.71057+01 | f
 cents    |          | 2017-11-10 14:15:55.71057+01 | f
 cursing  |          | 2017-11-10 14:15:55.71057+01 | f
 gaseous  |          | 2017-11-10 14:15:55.71057+01 | f
 proffers |          | 2017-11-10 14:15:55.71057+01 | f
 choose   |          | 2017-11-10 14:15:55.71057+01 | f
 ferry    |          | 2017-11-10 14:15:55.71057+01 | f
 babbles  |          | 2017-11-10 14:15:55.71057+01 | f
(20 ROWS)

But we can also see that it is being distributed among partitions:

$ SELECT tableoid::regclass AS partition_name, COUNT(*) FROM users GROUP BY 1 ORDER BY 1;
 partition_name | COUNT 
----------------+-------
 users_p0       |     2
 users_p1       |     5
 users_p2       |     1
 users_p3       |     3
 users_p4       |     2
 users_p5       |     3
 users_p6       |     3
 users_p7       |     1
(8 ROWS)

Of course, with only 20 rows distribution is not perfect, but it works sanely.

Unfortunately, as of now, PostgreSQL can't prune partitions from execution, based on their values:

$ EXPLAIN analyze SELECT * FROM users WHERE username = 'redoubts';
                                                          QUERY PLAN                                                           
-------------------------------------------------------------------------------------------------------------------------------
 Append  (cost=0.15..65.34 ROWS=8 width=73) (actual TIME=0.030..0.049 ROWS=1 loops=1)
   ->  INDEX Scan USING users_p0_pkey ON users_p0  (cost=0.15..8.17 ROWS=1 width=73) (actual TIME=0.010..0.010 ROWS=0 loops=1)
         INDEX Cond: (username = 'redoubts'::text)
   ->  INDEX Scan USING users_p1_pkey ON users_p1  (cost=0.15..8.17 ROWS=1 width=73) (actual TIME=0.004..0.004 ROWS=0 loops=1)
         INDEX Cond: (username = 'redoubts'::text)
   ->  INDEX Scan USING users_p2_pkey ON users_p2  (cost=0.15..8.17 ROWS=1 width=73) (actual TIME=0.007..0.007 ROWS=0 loops=1)
         INDEX Cond: (username = 'redoubts'::text)
   ->  INDEX Scan USING users_p3_pkey ON users_p3  (cost=0.15..8.17 ROWS=1 width=73) (actual TIME=0.007..0.008 ROWS=1 loops=1)
         INDEX Cond: (username = 'redoubts'::text)
   ->  INDEX Scan USING users_p4_pkey ON users_p4  (cost=0.15..8.17 ROWS=1 width=73) (actual TIME=0.006..0.006 ROWS=0 loops=1)
         INDEX Cond: (username = 'redoubts'::text)
   ->  INDEX Scan USING users_p5_pkey ON users_p5  (cost=0.15..8.17 ROWS=1 width=73) (actual TIME=0.004..0.004 ROWS=0 loops=1)
         INDEX Cond: (username = 'redoubts'::text)
   ->  INDEX Scan USING users_p6_pkey ON users_p6  (cost=0.15..8.17 ROWS=1 width=73) (actual TIME=0.005..0.005 ROWS=0 loops=1)
         INDEX Cond: (username = 'redoubts'::text)
   ->  INDEX Scan USING users_p7_pkey ON users_p7  (cost=0.15..8.17 ROWS=1 width=73) (actual TIME=0.003..0.003 ROWS=0 loops=1)
         INDEX Cond: (username = 'redoubts'::text)
 Planning TIME: 0.999 ms
 Execution TIME: 0.115 ms
(19 ROWS)

but, from what I gather – there is work underway to make it work.

All things considered – it's great addition, one that will be definitely helpful in some cases, and it will just become better. Thanks a lot to all involved.

One thought on “Waiting for PostgreSQL 11 – Add hash partitioning.”

  1. I don’t understand why the syntax allows each partition to specify its own modulus. I can’t see why you would ever want to mix different moduli, and in doing so surely you would end up with tuples that could exist in multiple different partitions?

    Skimming through the mailing list it certainly looks like allowing different moduli is intentional, but I didn’t see any explanation for how that would work (or why it would be useful)

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.