June 7th, 2015 by depesz | Tags: , , , , | 26 comments »
Did it help? If yes - maybe you can help me?

Recently I noticed that more and more cases that I deal with could use some partitioning. And while theoretically most people know about it, it's definitely not a very well-understood feature, and sometimes people are scared of it.

So, I'll try to explain, to my best knowledge, what it is, why one would want to use it, and how to actually make it happen.

As you probably know PostgreSQL has tables. And tables have data. Sometimes there is just a few rows, and sometimes there are billions.

Partitioning is a method of splitting the large (based on number of records, not number of column) tables into many smaller ones. Preferably it should be done in a way that is transparent for application.

One of the less used features of PostgreSQL is the fact that it's object-relational database. And the object in here is what is important. Because objects (well, classes) do know something called “inheritance". And this is what is being used for partitioning.

Let's see what it is.

I'll create normal table, users:

$ create table users (
    id             serial primary key,
    username       text not null unique,
    password       text,
    created_on     timestamptz not null,
    last_logged_on timestamptz not null
);

Now, just for completeness sake, let's add some rows, and additional index:

$ insert into users (username, password, created_on, last_logged_on)
    select
        random_string( (random() * 4 + 5)::int4),
        random_string( 20 ),
        now() - '2 years'::interval * random(),
        now() - '2 years'::interval * random()
    from
        generate_series(1, 10000);
$ create index newest_users on users (created_on);

So, we have test table:

$ \d
                                      Table "public.users"
     Column     |           Type           |                     Modifiers                      
----------------+--------------------------+----------------------------------------------------
 id             | integer                  | not null default nextval('users_id_seq'::regclass)
 username       | text                     | not null
 password       | text                     | 
 created_on     | timestamp with time zone | not null
 last_logged_on | timestamp with time zone | not null
Indexes:
    "users_pkey" PRIMARY KEY, btree (id)
    "users_username_key" UNIQUE CONSTRAINT, btree (username)
    "newest_users" btree (created_on)

with some random data:

$ select * from users limit 10;
 id | username |       password       |          created_on           |        last_logged_on         
----+----------+----------------------+-------------------------------+-------------------------------
  1 | ityfce3  | 2ukgbflj_l2ndo3vilt2 | 2015-01-02 16:56:41.346113+01 | 2015-04-15 12:34:58.318913+02
  2 | _xg_pv   | u8hy20aifyblg9f3_rf2 | 2014-09-27 05:41:05.317313+02 | 2014-08-07 14:46:14.197313+02
  3 | uvi1wo   | h09ae85v_f_cx0gf6_8r | 2013-06-17 18:48:44.389313+02 | 2014-06-03 06:53:49.640513+02
  4 | o6rgs    | vzbrkwhnsucxco5pjep0 | 2015-01-30 11:33:25.150913+01 | 2013-11-05 07:18:47.730113+01
  5 | nk61jw77 | lidk_mnpe_olffmod7ed | 2014-06-15 07:18:34.597313+02 | 2014-03-21 17:42:44.763713+01
  6 | 3w326_2u | pyoqg87feemojhql7jrn | 2015-01-20 05:41:54.133313+01 | 2014-09-07 20:33:23.682113+02
  7 | m9rk9mnx | 6pvt94s6ol46kn0yl62b | 2013-07-17 15:13:36.315713+02 | 2013-11-12 10:53:06.123713+01
  8 | adk6c    | egfp8re0z492e6ri8urz | 2014-07-23 11:41:11.883713+02 | 2013-10-22 07:19:36.200513+02
  9 | rsyaedw  | ond0tie9er92oqhmdj39 | 2015-05-11 16:45:40.472513+02 | 2013-08-31 17:29:18.910913+02
 10 | prlobe46 | _3br5v97t2xngcd7xz4n | 2015-01-10 20:13:29.461313+01 | 2014-05-04 06:25:56.072513+02
(10 rows)

Now, with this table in place, I can create partitions. And this means – inherited tables:

$ create table users_1 () inherits (users);
 
$ \d users_1
                                     Table "public.users_1"
     Column     |           Type           |                     Modifiers                      
----------------+--------------------------+----------------------------------------------------
 id             | integer                  | not null default nextval('users_id_seq'::regclass)
 username       | text                     | not null
 password       | text                     | 
 created_on     | timestamp with time zone | not null
 last_logged_on | timestamp with time zone | not null
Inherits: users

So, we have new table, which has certain interesting properties:

  • is uses the same sequence as base of its id column
  • all columns have the same definition, including not null limits
  • there is no primary key, nor unique constraint on username, nor index on created_on

Let's try again, but with more “bang":

$ drop table users_1;
$ create table users_1 ( like users including all );
$ \d users_1
                                     Table "public.users_1"
     Column     |           Type           |                     Modifiers                      
----------------+--------------------------+----------------------------------------------------
 id             | integer                  | not null default nextval('users_id_seq'::regclass)
 username       | text                     | not null
 password       | text                     | 
 created_on     | timestamp with time zone | not null
 last_logged_on | timestamp with time zone | not null
Indexes:
    "users_1_pkey" PRIMARY KEY, btree (id)
    "users_1_username_key" UNIQUE CONSTRAINT, btree (username)
    "users_1_created_on_idx" btree (created_on)

Now we have all the indexes and constraints, but we lost inheritance information. But we can add it afterwards, with:

$ alter table users_1 inherit users;
$ \d users_1
                                     Table "public.users_1"
     Column     |           Type           |                     Modifiers                      
----------------+--------------------------+----------------------------------------------------
 id             | integer                  | not null default nextval('users_id_seq'::regclass)
 username       | text                     | not null
 password       | text                     | 
 created_on     | timestamp with time zone | not null
 last_logged_on | timestamp with time zone | not null
Indexes:
    "users_1_pkey" PRIMARY KEY, btree (id)
    "users_1_username_key" UNIQUE CONSTRAINT, btree (username)
    "users_1_created_on_idx" btree (created_on)
Inherits: users

We could have done it in one step, but then some not nice notices are shown:

$ drop table users_1;
 
$ create table users_1 ( like users including all ) inherits (users);
NOTICE:  merging column "id" with inherited definition
NOTICE:  merging column "username" with inherited definition
NOTICE:  merging column "password" with inherited definition
NOTICE:  merging column "created_on" with inherited definition
NOTICE:  merging column "last_logged_on" with inherited definition
 
$ \d users_1
                                     Table "public.users_1"
     Column     |           Type           |                     Modifiers                      
----------------+--------------------------+----------------------------------------------------
 id             | integer                  | not null default nextval('users_id_seq'::regclass)
 username       | text                     | not null
 password       | text                     | 
 created_on     | timestamp with time zone | not null
 last_logged_on | timestamp with time zone | not null
Indexes:
    "users_1_pkey" PRIMARY KEY, btree (id)
    "users_1_username_key" UNIQUE CONSTRAINT, btree (username)
    "users_1_created_on_idx" btree (created_on)
Inherits: users

In any case, we now have 2 tables – base, and 1st partition.

If I'll do any kind of select/update/delete to users, then both of them will be scanned:

$ explain analyze select * from users where id = 123;
                                                         QUERY PLAN                                                          
-----------------------------------------------------------------------------------------------------------------------------
 Append  (cost=0.29..16.47 rows=2 width=66) (actual time=0.008..0.009 rows=1 loops=1)
   ->  Index Scan using users_pkey on users  (cost=0.29..8.30 rows=1 width=48) (actual time=0.008..0.008 rows=1 loops=1)
         Index Cond: (id = 123)
   ->  Index Scan using users_1_pkey on users_1  (cost=0.15..8.17 rows=1 width=84) (actual time=0.001..0.001 rows=0 loops=1)
         Index Cond: (id = 123)
 Planning time: 0.327 ms
 Execution time: 0.031 ms
(7 rows)

But if I'll ask the partition directly – it will just query the partition:

$ explain analyze select * from users_1 where id = 123;
                                                      QUERY PLAN                                                       
-----------------------------------------------------------------------------------------------------------------------
 Index Scan using users_1_pkey on users_1  (cost=0.15..8.17 rows=1 width=84) (actual time=0.002..0.002 rows=0 loops=1)
   Index Cond: (id = 123)
 Planning time: 0.162 ms
 Execution time: 0.022 ms
(4 rows)

If we wanted, we could have asked only users table, without it's “children/partitions", using keyword ONLY:

$ explain analyze select * from only users where id = 123;
                                                    QUERY PLAN                                                     
-------------------------------------------------------------------------------------------------------------------
 Index Scan using users_pkey on users  (cost=0.29..8.30 rows=1 width=48) (actual time=0.008..0.008 rows=1 loops=1)
   Index Cond: (id = 123)
 Planning time: 0.229 ms
 Execution time: 0.031 ms
(4 rows)

You might have notices that I said that select/update/delete work on all partitions. What about insert? Insert has to insert data somewhere, so it works always as if there was ONLY added. So if I want to insert row to users_1, I have to do:

INSERT INTO users_1 ...

this might sound bad, but don't worry, there are ways around it, I'm not done yet.

OK. Knowing all this, let's use some actual partitioning. First we need to decide what will be the partitioning key – i.e. what is the algorithm that will be chosen to pick partition.

There are couple of more obvious:

  • partition by date – for example, have partitions based on year when user was created
  • partition by id range – for example – first million users, second million users, and so on
  • partition by something else – like first letter of username

There are also couple of others, less often used like “partition by hash of username".

Why would one use one schema and not another. Let's consider their benefits and drawbacks:

  • partition by date:
    • benefits:
      • simple to understand
      • number of rows in a given table will be somewhat stable
    • drawbacks:
      • requires maintenance – every so often we need to add new partition
      • searching by username or id will require scan of all partitions
  • partition by id:
    • benefits:
      • simple to understand
      • number of rows in a given table will be 100% stable
    • drawbacks:
      • requires maintenance – every so often we need to add new partition
      • searching by username will require scan of all partitions
  • partition by first letter of username:
    • benefits:
      • simple to understand
      • no maintenance – there is only limited set of partitions, and we never need to add more
    • drawbacks:
      • number of rows in partitions will steadily increase
      • some partitions will have significantly more rows than others (more people have nick “t*" than “y*")
      • searching by id will require scan of all partitions
  • partition by hash of username:
    • benefits:
      • no maintenance – there is only limited set of partitions, and we never need to add more
      • rows will be distributed equally among the partitions
    • drawbacks:
      • number of rows in partitions will steadily increase
      • searching by id will require scan of all partitions
      • searching by username will scan only one partition, but only when using additional conditions

The last listed drowback of hashed username approach is interesting. Let's see what's going on.

First I will need to create some more partitions:

$ create table users_2 ( like users including all );
$ alter table users_2 inherit users;
...
$ create table users_10 ( like users including all );
$ alter table users_10 inherit users;

now, our users knows about 10 partitions:

$ \d users
                                      Table "public.users"
     Column     |           Type           |                     Modifiers                      
----------------+--------------------------+----------------------------------------------------
 id             | integer                  | not null default nextval('users_id_seq'::regclass)
 username       | text                     | not null
 password       | text                     | 
 created_on     | timestamp with time zone | not null
 last_logged_on | timestamp with time zone | not null
Indexes:
    "users_pkey" PRIMARY KEY, btree (id)
    "users_username_key" UNIQUE CONSTRAINT, btree (username)
    "newest_users" btree (created_on)
Number of child tables: 10 (Use \d+ to list them.)

In PostgreSQL there is option constraint_exclusion which, when set to “on" or “partition", makes PostgreSQL skip partitions that cannot contain matching rows.

In my Pg it is, by default:

$ show constraint_exclusion;
 constraint_exclusion 
----------------------
 partition
(1 row)

Now, all my partitions, and base table don't have any sensible constraints, so any query will scan all 11 tables (master and 10 partitions):

$ explain analyze select * from users where id = 123;
                                                          QUERY PLAN                                                           
-------------------------------------------------------------------------------------------------------------------------------
 Append  (cost=0.29..89.98 rows=11 width=81) (actual time=0.009..0.013 rows=1 loops=1)
   ->  Index Scan using users_pkey on users  (cost=0.29..8.30 rows=1 width=48) (actual time=0.007..0.007 rows=1 loops=1)
         Index Cond: (id = 123)
   ->  Index Scan using users_1_pkey on users_1  (cost=0.15..8.17 rows=1 width=84) (actual time=0.001..0.001 rows=0 loops=1)
         Index Cond: (id = 123)
   ->  Index Scan using users_2_pkey on users_2  (cost=0.15..8.17 rows=1 width=84) (actual time=0.000..0.000 rows=0 loops=1)
         Index Cond: (id = 123)
   ->  Index Scan using users_3_pkey on users_3  (cost=0.15..8.17 rows=1 width=84) (actual time=0.001..0.001 rows=0 loops=1)
         Index Cond: (id = 123)
   ->  Index Scan using users_4_pkey on users_4  (cost=0.15..8.17 rows=1 width=84) (actual time=0.000..0.000 rows=0 loops=1)
         Index Cond: (id = 123)
   ->  Index Scan using users_5_pkey on users_5  (cost=0.15..8.17 rows=1 width=84) (actual time=0.000..0.000 rows=0 loops=1)
         Index Cond: (id = 123)
   ->  Index Scan using users_6_pkey on users_6  (cost=0.15..8.17 rows=1 width=84) (actual time=0.001..0.001 rows=0 loops=1)
         Index Cond: (id = 123)
   ->  Index Scan using users_7_pkey on users_7  (cost=0.15..8.17 rows=1 width=84) (actual time=0.000..0.000 rows=0 loops=1)
         Index Cond: (id = 123)
   ->  Index Scan using users_8_pkey on users_8  (cost=0.15..8.17 rows=1 width=84) (actual time=0.000..0.000 rows=0 loops=1)
         Index Cond: (id = 123)
   ->  Index Scan using users_9_pkey on users_9  (cost=0.15..8.17 rows=1 width=84) (actual time=0.001..0.001 rows=0 loops=1)
         Index Cond: (id = 123)
   ->  Index Scan using users_10_pkey on users_10  (cost=0.15..8.17 rows=1 width=84) (actual time=0.000..0.000 rows=0 loops=1)
         Index Cond: (id = 123)
 Planning time: 1.321 ms
 Execution time: 0.087 ms
(25 rows)

This is not really effective. But we can limit it.

Let's say that our partitions were made for “partition by id". And each partition should have 100,000 ids.

So we can add some constraints:

$ alter table users_1 add constraint partition_check check (id >= 0 and id < 100000);
$ alter table users_2 add constraint partition_check check (id >= 100000 and id < 200000);
...
$ alter table users_10 add constraint partition_check check (id >= 900000 and id < 1000000);

Now, the same query as previously:

$ explain analyze select * from users where id = 123;
                                                         QUERY PLAN                                                          
-----------------------------------------------------------------------------------------------------------------------------
 Append  (cost=0.29..16.47 rows=2 width=66) (actual time=0.008..0.009 rows=1 loops=1)
   ->  Index Scan using users_pkey on users  (cost=0.29..8.30 rows=1 width=48) (actual time=0.008..0.009 rows=1 loops=1)
         Index Cond: (id = 123)
   ->  Index Scan using users_1_pkey on users_1  (cost=0.15..8.17 rows=1 width=84) (actual time=0.000..0.000 rows=0 loops=1)
         Index Cond: (id = 123)
 Planning time: 1.104 ms
 Execution time: 0.031 ms
(7 rows)

scans only 2 tables: master (where currently all the data is, and there is no constraint, so it can't be excluded) and appropriate partition.

Cool, right?

We could add these kinds of partitioning conditions on username or created_on without any problem. But take a look at what happens when the partition key is more complicated:

$ alter table users_1 drop constraint partition_check, add constraint partition_check check (abs( hashtext(username) ) % 10 = 0);
$ alter table users_2 drop constraint partition_check, add constraint partition_check check (abs( hashtext(username) ) % 10 = 1);
...
$ alter table users_10 drop constraint partition_check, add constraint partition_check check (abs( hashtext(username) ) % 10 = 9);

In case you don't know – hashtext() takes string, and returns integer in the range of -2147483648 to 2147483647.

We know, because math tells us, that abs(hashtext(string)) % 10, will always give value within 0..9, and that it can be easily calculated for any parameter.

Does PostgreSQL know?

$ explain analyze select * from users where username = 'depesz';
                                                              QUERY PLAN                                                               
---------------------------------------------------------------------------------------------------------------------------------------
 Append  (cost=0.29..89.98 rows=11 width=81) (actual time=0.023..0.023 rows=0 loops=1)
   ->  Index Scan using users_username_key on users  (cost=0.29..8.30 rows=1 width=48) (actual time=0.016..0.016 rows=0 loops=1)
         Index Cond: (username = 'depesz'::text)
   ->  Index Scan using users_1_username_key on users_1  (cost=0.15..8.17 rows=1 width=84) (actual time=0.000..0.000 rows=0 loops=1)
         Index Cond: (username = 'depesz'::text)
   ->  Index Scan using users_2_username_key on users_2  (cost=0.15..8.17 rows=1 width=84) (actual time=0.001..0.001 rows=0 loops=1)
         Index Cond: (username = 'depesz'::text)
   ->  Index Scan using users_3_username_key on users_3  (cost=0.15..8.17 rows=1 width=84) (actual time=0.001..0.001 rows=0 loops=1)
         Index Cond: (username = 'depesz'::text)
   ->  Index Scan using users_4_username_key on users_4  (cost=0.15..8.17 rows=1 width=84) (actual time=0.000..0.000 rows=0 loops=1)
         Index Cond: (username = 'depesz'::text)
   ->  Index Scan using users_5_username_key on users_5  (cost=0.15..8.17 rows=1 width=84) (actual time=0.001..0.001 rows=0 loops=1)
         Index Cond: (username = 'depesz'::text)
   ->  Index Scan using users_6_username_key on users_6  (cost=0.15..8.17 rows=1 width=84) (actual time=0.000..0.000 rows=0 loops=1)
         Index Cond: (username = 'depesz'::text)
   ->  Index Scan using users_7_username_key on users_7  (cost=0.15..8.17 rows=1 width=84) (actual time=0.001..0.001 rows=0 loops=1)
         Index Cond: (username = 'depesz'::text)
   ->  Index Scan using users_8_username_key on users_8  (cost=0.15..8.17 rows=1 width=84) (actual time=0.001..0.001 rows=0 loops=1)
         Index Cond: (username = 'depesz'::text)
   ->  Index Scan using users_9_username_key on users_9  (cost=0.15..8.17 rows=1 width=84) (actual time=0.000..0.000 rows=0 loops=1)
         Index Cond: (username = 'depesz'::text)
   ->  Index Scan using users_10_username_key on users_10  (cost=0.15..8.17 rows=1 width=84) (actual time=0.001..0.001 rows=0 loops=1)
         Index Cond: (username = 'depesz'::text)
 Planning time: 1.092 ms
 Execution time: 0.095 ms
(25 rows)

No. It doesn't. Basically, PostgreSQL can do automatic partition exclusion only for range based checks (or equality). Nothing based on functions. And even simple modulo is too much:

$ alter table users_1 drop constraint partition_check, add constraint partition_check check ( id % 10 = 0);
$ alter table users_2 drop constraint partition_check, add constraint partition_check check ( id % 10 = 1);
...
$ alter table users_10 drop constraint partition_check, add constraint partition_check check ( id % 10 = 9);
$ explain analyze select * from users where id = 123;
                                                          QUERY PLAN                                                           
-------------------------------------------------------------------------------------------------------------------------------
 Append  (cost=0.29..89.98 rows=11 width=81) (actual time=0.009..0.016 rows=1 loops=1)
   ->  Index Scan using users_pkey on users  (cost=0.29..8.30 rows=1 width=48) (actual time=0.009..0.009 rows=1 loops=1)
         Index Cond: (id = 123)
   ->  Index Scan using users_1_pkey on users_1  (cost=0.15..8.17 rows=1 width=84) (actual time=0.001..0.001 rows=0 loops=1)
         Index Cond: (id = 123)
   ->  Index Scan using users_2_pkey on users_2  (cost=0.15..8.17 rows=1 width=84) (actual time=0.001..0.001 rows=0 loops=1)
         Index Cond: (id = 123)
   ->  Index Scan using users_3_pkey on users_3  (cost=0.15..8.17 rows=1 width=84) (actual time=0.001..0.001 rows=0 loops=1)
         Index Cond: (id = 123)
   ->  Index Scan using users_4_pkey on users_4  (cost=0.15..8.17 rows=1 width=84) (actual time=0.001..0.001 rows=0 loops=1)
         Index Cond: (id = 123)
   ->  Index Scan using users_5_pkey on users_5  (cost=0.15..8.17 rows=1 width=84) (actual time=0.000..0.000 rows=0 loops=1)
         Index Cond: (id = 123)
   ->  Index Scan using users_6_pkey on users_6  (cost=0.15..8.17 rows=1 width=84) (actual time=0.001..0.001 rows=0 loops=1)
         Index Cond: (id = 123)
   ->  Index Scan using users_7_pkey on users_7  (cost=0.15..8.17 rows=1 width=84) (actual time=0.000..0.000 rows=0 loops=1)
         Index Cond: (id = 123)
   ->  Index Scan using users_8_pkey on users_8  (cost=0.15..8.17 rows=1 width=84) (actual time=0.001..0.001 rows=0 loops=1)
         Index Cond: (id = 123)
   ->  Index Scan using users_9_pkey on users_9  (cost=0.15..8.17 rows=1 width=84) (actual time=0.000..0.000 rows=0 loops=1)
         Index Cond: (id = 123)
   ->  Index Scan using users_10_pkey on users_10  (cost=0.15..8.17 rows=1 width=84) (actual time=0.001..0.001 rows=0 loops=1)
         Index Cond: (id = 123)
 Planning time: 0.973 ms
 Execution time: 0.086 ms
(25 rows)

That's a shame because modulo based partition keys have one great ( in my opinion ) benefit – stable number of partitions. You don't have to create them afterwards, unless you'll decide to re-partition after hitting some larger amount of data.

Does that mean you can't use complicated (function/module based partition keys)? No. You can use them, but the queries will become a little bit more complex:

$ explain analyze select * from users where id = 123 and id % 10 = 123 % 10;
                                                         QUERY PLAN                                                          
-----------------------------------------------------------------------------------------------------------------------------
 Append  (cost=0.29..16.48 rows=2 width=66) (actual time=0.010..0.011 rows=1 loops=1)
   ->  Index Scan using users_pkey on users  (cost=0.29..8.31 rows=1 width=48) (actual time=0.010..0.010 rows=1 loops=1)
         Index Cond: (id = 123)
         Filter: ((id % 10) = 3)
   ->  Index Scan using users_4_pkey on users_4  (cost=0.15..8.17 rows=1 width=84) (actual time=0.001..0.001 rows=0 loops=1)
         Index Cond: (id = 123)
         Filter: ((id % 10) = 3)
 Planning time: 1.018 ms
 Execution time: 0.033 ms
(9 rows)

What I did is added another condition, which was:

id % 10 = 123 % 10

PostgreSQL can, while parsing, change it to:

id % 10 = 3

because it knows that % operator for integers is immutable. And now, I have, as part of the query, exact partition key – id % 10 = 3. So Pg can use these partitions that either don't have partition key (i.e. master) or have one matching the query.

Whether the additional complication is worth it – it's up to you.

If you'd prefer not to change queries, and you're fine with every so often creating new partitions, you might want to consider PG Partition Manger written by my former colleague Keith Fiske – it is set of functions that you run manually to define partitions, and one that you run from cronjob that takes care of creating new partitions for upcoming data.

Now, in all of these I did mention inserts, but didn't explain how to “get around the problem with inserts having to go to partition".

Basically – it's a job for trigger. Pg_partman by Keith creates these triggers for you, but I want you to understand what's going on, so you'll not use pg_partman as “black box", but rather as a helper tool that does the tedious so you don't have to.

Currently my partitioning schema is modulo based (something that partman can't do, as far as I know), so let's write appropriate trigger function. This will be called when data will be inserted to users table, and should, without raising errors, redirect the insert to appropriate partition. So let's write:

$ create function partition_for_users() returns trigger as $$
DECLARE
    v_parition_name text;
BEGIN
    v_parition_name := format( 'users_%s', 1 + NEW.id % 10 );
    execute 'INSERT INTO ' || v_parition_name || ' VALUES ( ($1).* )' USING NEW;
    return NULL;
END;
$$ language plpgsql;

and now the trigger definition:

$ create trigger partition_users before insert on users for each row execute procedure partition_for_users();

So, let's try to add a row:

$ insert into users (username, password, created_on, last_logged_on)
    values (
        'depesz',
        random_string( 20 ),
        now() - '2 years'::interval * random(),
        now() - '2 years'::interval * random()
    );
$ select currval('users_id_seq');
 currval 
---------
   10003
(1 row)

So, let's see if the data is visible:

$ select * from users where username = 'depesz';
  id   | username |       password       |          created_on           |        last_logged_on         
-------+----------+----------------------+-------------------------------+-------------------------------
 10003 | depesz   | bp7zwy8k3t3a37chf1hf | 2014-10-24 02:45:51.398824+02 | 2015-02-05 18:24:57.072424+01
(1 row)

Looks good, but where it actually is? In master?

$ select * from only users where username = 'depesz';
 id | username | password | created_on | last_logged_on 
----+----------+----------+------------+----------------
(0 rows)

No, so maybe in the correct partition?

$ select * from users_4 where username = 'depesz';
  id   | username |       password       |          created_on           |        last_logged_on         
-------+----------+----------------------+-------------------------------+-------------------------------
 10003 | depesz   | bp7zwy8k3t3a37chf1hf | 2014-10-24 02:45:51.398824+02 | 2015-02-05 18:24:57.072424+01

Yes. Trigger worked. But it has one drawback. Namely “RETURNING" doesn't work:

$ insert into users (username, password, created_on, last_logged_on)
    values (
        'test',
        random_string( 20 ),
        now() - '2 years'::interval * random(),
        now() - '2 years'::interval * random()
    )
    returning *;
 id | username | password | created_on | last_logged_on 
----+----------+----------+------------+----------------
(0 rows)

This is because, from the standpoint of executor – this insert didn't return anything – trigger returned NULL.

As of now, I'm not aware of sensible solution to this. In my cases, I simply opt to get my primary key value before using nextval(), and then insert the ready value – so I have it ready for after insert:

$ select nextval('users_id_seq');
 nextval 
---------
   10005
(1 row)
 
$ insert into users (id, username, password, created_on, last_logged_on)
    values (
        10005,
        'test',
        random_string( 20 ),
        now() - '2 years'::interval * random(),
        now() - '2 years'::interval * random()
    );

There is one note in here though. Routing all inserts via trigger will make them slower – after all for every row pg will have to fire another “insert".

If you have time-sensitive bulk inserts, I found that the best solution is to make them work directly with partitions. So, instead of, for example:

COPY users FROM stdin;
....
\.

You pre-obtain how many ids you need, for example using:

select nextval('users_id_seq') from generate_series(1, 100);

And then issue appropriate:

COPY users_p1 FROM stdin;
....
\.
COPY users_p2 FROM stdin;
....
\.
...

Not really nice, but it might be helpful if you have bulk data imports to partitioned tables.

So, now you should understand what is partitioning, and how it works. The next question on list in title was: why?

Why is relatively simple. For performance or maintenance.

Consider simple example like the users table, that has 1 billion rows (1,000,000,000).

Searching in it will get progressively more expensive, even with indexing, simply because depth of indexes will grow.

We can see it even on my small test table.

Let's drop all partitions, and partitioning trigger:

$ drop table users_1;
$ drop table users_2;
...
$ drop table users_10;
$ drop trigger partition_users on users;

Now, users has 10,000 rows. Simple search based on username, takes (3 tries, took best time): 0.020ms.

If I'll add more rows:

$ insert into users (username, password, created_on, last_logged_on)
    select
        random_string( (random() * 4 + 5)::int4),
        random_string( 20 ),
        now() - '2 years'::interval * random(),
        now() - '2 years'::interval * random()
    from
        generate_series(1, 100000);

the same search takes 0.025ms. The increase of 0.005ms might not be large, but we're still at only 110,000 rows, and no other tables in the system so the whole table, with indexes, fits in memory.

Of course your partitioning has to make sense. For example – if you usually search by username, then partitioning by id doesn't make sense – as Pg will have to search through all of the partitions (in could make sense in future, but this I leave for the very end of this blogpost).

Basically you have to consider what you usually query for – whether it's search by some key, or perhaps you usually only look in recent data? And partition in such a way that you limit number of partitions Pg has to scan.

What's also important, especially if you're more DBA than programmer – partitioning makes your life much easier. Any kind of maintenance tasks (create index, vacuum, pg_reorg/pg_repack, pg_dump) can be effectively split into as many tasks as you have partitions. So instead of single, multi-hour, transaction for repack of large table, you have 20, much faster, and using less disk space transactions, while the end result is basically the same!

Of course there are not only good news. There is also one big drawback of partitioning. You can't have foreign keys pointing to partitioned table.

It just doesn't work. You could have fkeys pointing directly to partitions, but that (usually) doesn't make sense.

Whether this is huge problem for you – it depends on your usecase. I figured that in most of the cases where we reach tables big enough to warrant partitioning, the application is tested well enough that we can agree to lack of this fkey. Plus – we can always add cronjob for testing if there are any “bad" values.

We now what it is, how it works, why use it. The last thing left is: how to migrate table to partitioned. Usually the app doesn't start with partitioned tables – it wouldn't make sense in the beginning. But after a while you have some table with many rows, and you can feel “I wish I partitioned it when I first created it".

But perhaps we still can partition it, while the app works? With minimal problems?

Let's see. For test, I created pgbench database that is 97GB. Most of it, 83GB, is in pgbench_accounts table, which contains 666,600,000 records.

Schema of this table is:

   Table "public.pgbench_accounts"
  Column  |     Type      | Modifiers 
----------+---------------+-----------
 aid      | integer       | not null
 bid      | integer       | 
 abalance | integer       | 
 filler   | character(84) | 
Indexes:
    "pgbench_accounts_pkey" PRIMARY KEY, btree (aid)

And all queries to it are based on aid column, which has values from 1 to 666,600,000.

So, let's partition it, based on aid ranges.

Let's say that I'll put 10 million rows per partition, so I will need 67 partitions.

But how can I check if my work doesn't disrupt work? That's simple. I'll run pgbench, in a loop. I'm not interested in absolute reported speeds, just information how much my work influences what pgbench does.

With this in mind, I ran:

$ while true
do
    date
    pgbench -T 10 -c 2 bench
done 2>&1 | tee pgbench.log

Which will run 10 second tests, and log some statistics to file, so I can later on correlate performance with my partitioning work.

With this in place, let me start by creating the partitions, with appropriate checks already in place:

do $$
declare
    i int4;
    aid_min INT4;
    aid_max INT4;
begin
    for i in 1..67
    loop
        aid_min := (i - 1) * 10000000 + 1;
        aid_max := i * 10000000;
        execute format('CREATE TABLE pgbench_accounts_p_%s ( like pgbench_accounts including all )', i );
        execute format('ALTER TABLE pgbench_accounts_p_%s inherit pgbench_accounts', i);
        execute format('ALTER TABLE pgbench_accounts_p_%s add constraint partitioning_check check ( aid >= %s AND aid <= %s )', i, aid_min, aid_max );
    end loop;
end;
$$;

Now, partitions are in place, and I can veryfy that checks are being used:

$ explain analyze select * from pgbench_accounts where aid = 123;
                                                                       QUERY PLAN                                                                       
--------------------------------------------------------------------------------------------------------------------------------------------------------
 Append  (cost=0.57..16.75 rows=2 width=224) (actual time=6.468..6.473 rows=1 loops=1)
   ->  Index Scan using pgbench_accounts_pkey on pgbench_accounts  (cost=0.57..8.59 rows=1 width=97) (actual time=6.468..6.469 rows=1 loops=1)
         Index Cond: (aid = 123)
   ->  Index Scan using pgbench_accounts_p_1_pkey on pgbench_accounts_p_1  (cost=0.14..8.16 rows=1 width=352) (actual time=0.004..0.004 rows=0 loops=1)
         Index Cond: (aid = 123)
 Planning time: 3.475 ms
 Execution time: 6.497 ms
(7 rows)

Now, we need to add “routing" trigger:

$ create function partition_for_accounts() returns trigger as $$
DECLARE
    v_parition_name text;
BEGIN
    v_parition_name := format( 'pgbench_accounts_p_%s', 1 + ( NEW.aid - 1 ) / 10000000 );
    execute 'INSERT INTO ' || v_parition_name || ' VALUES ( ($1).* )' USING NEW;
    return NULL;
END;
$$ language plpgsql;
 
$ create trigger partition_users before insert on pgbench_accounts for each row execute procedure partition_for_accounts();

This is all good, but this will work for newly inserted rows, and now, I have 666 million rows in master table. What can I do about it?

Well, I need to move them. This is relatively simple, in concept, but has some gotchas:

  1. at no point can both rows be visible to any transaction (i.e. from master and from partition)
  2. I can't delete all, and insert them to partitions as it would lock whole master table for the duration of migration

The second problem can be alleviated by working in batches. But – we can't use SQL for this.

Every so often someone asks about splitting large operation into batches, and calling it from single sql function, that would iterate over batches and do its work on them. This has fundamental problem: function call is in single transaction. So everything that function does will happen in single transaction. So it doesn't solve the locking problem.

But – we can use psql (or ruby, perl, python, whatever) to do it, with each batch moving just small number of rows, and thus locking for just a brief moment.

Basically single query will look like:

with x as (delete from only pgbench_accounts where aid between .. and .. returning *)
insert into appropriate_partition select * from x;

I will choose my batch size to be 1000 – it's small enough not to be too long, and large enough so that the total number of batches will not be exorbitant (666 thousand).

Now, let's generate the batch file:

\pset format unaligned
\pset tuples_only true
\o /tmp/run.batch.migration.sql
SELECT
    format(
        'with x as (DELETE FROM ONLY pgbench_accounts WHERE aid >= %s AND aid <= %s returning *) INSERT INTO pgbench_accounts_p_%s SELECT * FROM x;',
        i,
        i + 999,
        ( i - 1 ) / 10000000 + 1
    )
FROM
    generate_series( 1, 666600000, 1000 ) i;
\o

When I ran this in psql, it created file: /tmp/run.batch.migration.sql, which is quite large (97MB), as it contains 666,600 queries, like these:

with x as (DELETE FROM ONLY pgbench_accounts WHERE aid >= 1 AND aid <= 1000 returning *) INSERT INTO pgbench_accounts_p_1 SELECT * FROM x;
with x as (DELETE FROM ONLY pgbench_accounts WHERE aid >= 1001 AND aid <= 2000 returning *) INSERT INTO pgbench_accounts_p_1 SELECT * FROM x;
with x as (DELETE FROM ONLY pgbench_accounts WHERE aid >= 2001 AND aid <= 3000 returning *) INSERT INTO pgbench_accounts_p_1 SELECT * FROM x;

Now, that I have this ready, I can run it (of course “on screen" or “on tmux" so that it will not get lost should my ssh connection to server break:

$ psql -d bench -f /tmp/run.batch.migration.sql

This will take some time. In case of my test database, average batch takes ~ 92ms, which means I'm looking at 17 hours of migration time.

Actually, it took only 7 hours. Not bad.

Afterwards, the pgbench_accounts table is still ~ 83GB (I guess my disk is not fast enough to handle pgbench, migration, and vacuums).

But, I checked, and it looks like all rows have been migrated to partitions:

$ select count(*) from only pgbench_accounts;
 count 
-------
     0
(1 row)

How about speed of pgbench during the migration process?

Well, there have been four phases:

  1. Before any migration work was done
  2. After partitions were created
  3. After trigger was created
  4. During migration

Results?

  phase  |    min    |       avg        |    max    
---------+-----------+------------------+-----------
 Phase 1 | 28.662223 | 64.0359512839506 | 87.219148
 Phase 2 | 21.147816 | 56.2721418360656 | 75.967217
 Phase 3 | 23.868018 | 58.6375074477612 | 75.335558
 Phase 4 |  5.222364 | 23.6086916565574 | 65.770852
(4 rows)

So yes. Migration did slow things down. But please notice that this is simple desktop machine, with sata drives, not ssd, under heavy load all the time – pgbench was running the queries as fast as possible.

What's more – some of the slowdown is due to vacuum not really handling the deletions well. In my opinion – this is perfectly acceptable.

Afterwards, I could:

$ truncate only pgbench_accounts;

and afterwards, to check if everything is OK:

$ select count(*) from pgbench_accounts;
   count   
-----------
 666600000
(1 row)

All this was done without any errors, and without interrupting “real application".

And finally – partitioning will become soon(ish) even more cooler. Since recently we can have partitions on different servers. And there is work in progress (though unlikely to hit earlier than 9.6) to allow parallel execution of scans, which would make the whole thing even better.

Hope you'll find it useful.

  1. 26 comments

  2. # Anonymous
    Jun 7, 2015

    thx for nice article.

    I think there might be one little glitch. In the section where you added insert trigger on users table and execute first insert you write:

    No, so maybe in the correct partition?:

    $ select * from users_2 where username = 'depesz';
     id | username | password | created_on | last_logged_on 
    ----+----------+----------+------------+----------------
    (0 rows)

    Yes. Trigger worked. But it has one drawback. Namely “RETURNING” doesn’t work:

    But no rows are shown.

  3. # Anonymous
    Jun 8, 2015

    Perhaps one more inconvenience with hashed username approach may be the troubles when deleting individual partitions. Some databases have specialized MERGE/COALESCE syntax that reduce the number of partitions but not sure how one would just drop certain partition’s data. Maybe, hash partitioning should not be used at all if maintenance by deleting partitions is anticipated.

  4. Jun 8, 2015

    @anonymous: thanks, fixed.

  5. Jun 8, 2015

    @anonymous: sorry, I have no idea what you mean by deleting partitions. how you partition is irrelevant, when you can simply: drop table partition;

  6. # Anonymous
    Jun 8, 2015

    @depesz, that’s true but how do you describe what you just dropped. You can see that dropping, say, the oldest time-based range partition makes sense as a maintenance operation on partitions but how do you describe a maintenance operation namely dropping a hash bucket? Maybe, you can increase/decrease number of hash partitions but it would/should simply redistribute the data into new number of hash buckets keeping all of it intact. Hope that explains the point I’m trying to make.

  7. Jun 8, 2015

    No, sorry, I still don’t get it.

    dropping old partition makes sense for range based partitions. But for hashes the idea just doesn’t work. there is no new/old when it comes to hashes.

    So sorry, I have literally no idea what you are talking about.

  8. # Anonymous
    Jun 8, 2015

    I think we agree more or less but the way I put it may have been confusing; sorry about that.

    I just mean to suggest being careful about choosing partitioning method based on needs of querying (logical groupings of data), maintenance, or need for just a dumb uniform distribution. The last one does not necessarily help with the first two and hash seems to me to be just that. I agree but couldn’t articulate that dropping a hash partition does not make sense.

    And thanks for a detailed article as always.

  9. # Jose Luis
    Jun 8, 2015

    You say:

    > Searching in it will get progressively more expensive, even with indexing, simply because depth of indexes will grow.

    I am not sure about this. Have you tested it? All that I have readed from Oracle, it says that partitioning for optimize a index access by key doesn’t improve anything. It can be negative for oltp applications.

    Doing a partitioning you omit one one level of index, but instead a search the index structuresfor every partitioning table. You substitute one level a optimized access to a index, for a look in database dictionary dictionary for search the index of the partitioned table.

    I think that accessing a row by index, in a partitioned table must be slower.

  10. # AndyC
    Jun 9, 2015

    How about “Phase 5: After migration”. Might be interesting to compare Phase 1 to 5.

    Awesome write up, thanks!

  11. Jun 9, 2015

    @Jose:

    if your query has to search through all of the partitions – then sure, it will be slower than similar query on one large table.

    But the point of partitioning is skipping partitions that we know, and can prove, cannot contain the data we’re after.

  12. Jun 9, 2015

    @AndyC:

    I actually did test that, but the numbers I got were very weird (huge spikes), so figured my desktop decided to behave badly, and besides – the timings were just to shown that you can do the migration live – speed afterwards will depend on what queries you will be running there.

    And some things are even less comparable – for example creating index on partitioned system.

  13. # Matt
    Jun 9, 2015

    Regarding searches in indexes getting “progressively more expensive,” partial indexes accomplish exactly the same thing as partitions, no?

    So “partitions” (MySQL has implemented parts better, but some dumb limitations) seem to be for the NON-indexed part of the table, [possibly] divided on the same kind(s) of conditions that would be used for partial indexes.

    And of course with partial indexes, you have the ability to have “overlapping” conditions (same rows in multiple specialized indexes), depending on usage/need/etc.

  14. Jun 10, 2015

    @Matt:
    I wouldn’t say it’s the same.

    Partitions partition each index. So if you have 5 indexes, and 100 partitions you have 500 indexes, but only 5 on each partition table.
    In case of single table, you’d have to have 500 indexes, and this would (in my opinion) influence planning time.

    Not to mention, that one of the biggest benefits of partitioning would be lost: you would still have one single table, where doing any kind of maintenance work would take LONG time.

  15. # Romanov
    Jun 11, 2015

    I wonder how this partitioning technique up scales.
    In my case I have about 500M rows per year and the aim would be to keep at least 2 years (thus almost 1B rows) online.
    How would this technique scale with 1000 partitions?
    How would the planning cost be impacted?
    I fear that a number of internal algorithms are linear!

  16. Jun 11, 2015

    @Romanov:

    first of all – why in the $DEITY name would you even consider having 1000 partitions?

    second – since I never came even close to 1000 partitions (i would assume that I would need at least 20 billion rows for this), i have no idea.

    don’t fear – test.

  17. # Romanov
    Jun 16, 2015

    @depesz

    All examples I have seen so far involve no more than a dozen partitions. Something you can handle even manually.

    With historical data in the range of 500M rows per year I would think about partitioning by month, just like the examples.
    How would it scale with partitioning by week?

    Testing on a billion row table can be really hard, when not useless, unless there is some knowledge about the expected results (even at the large) that can drive this challenging test.

    Thus my original question …

  18. # TransPartition
    Aug 5, 2015

    Sorry, but table partitioning in Postgresql is very very far from the minimum acceptable. DB2, Oracle, SQL Server, Informix and (*sigh*) Mysql/MariaDB has transparent table partitioning, and everything is controlled by the core. Need to use an external script/plugin to create partitions is not usable for a “non shell” DBA,i.e., that use only SQL to maintain the database. Maintain this is a pain in a database with hundred tables. But appears that the core members see this feature like a ‘ugly duck’ and not a serious feature… 🙁

  19. Jan 16, 2016

    I found this link for an example of how to handle RETURNING * on your inserts:

    https://gist.github.com/copiousfreetime/59067

    Note that you need to create an after insert delete from master trigger as well in order to make it work (otherwise you get duplicate row in master and partition table). This is all covered in the code example.

  20. # Van Le
    Mar 18, 2016

    Thank you.

    I follow your post and get duplicate rows on master.
    End up writing delete trigger ( from https://gist.github.com/copiousfreetime/59067)

  21. # rushabh
    Apr 25, 2016

    Deleting table only will create a void. You need to CASCADE that table too in order to avoid leaving the space empty.
    e.g DROP TABLE tablename CASCADE;

  22. # Terry
    Jan 10, 2017

    this is a great article. I’m wondering about if you have another article about the best practices when you want add columns in partitioned tables.

    Regards.
    Terry

  23. Jan 11, 2017

    @Terry:

    nope, add column to parent table, and it will be added to all tables that inherit from it.

    To make the process less invasive you can first add column to all children, and only then to parent, but it will require certain caution.

    Also – just like with any other column addition – it’s best to add without “not null” or “default …” clauses.

  24. # Terry
    Jan 11, 2017

    @Depesz
    Thank you for the info, I’m gonna run some test adding the column to all children first as you say.

  25. # Mangal
    Jun 14, 2017

    Just came across this article while googling other stuff. Amazing read. Thanks for sharing it. Bookmarked!

  26. Jul 19, 2017

    Many thanks for a lot of details (as usual) regarding this topic.

    I want to add important correction: if you choose to use `CREATE TABLE … (LIKE … INCUDING ALL)` and then `ALTER .. INHERIT ..` — this will give you a child table where all columns are **own**, not really inherited. So if you later, say, drop some column in the parent table, the child will continue to have it.

    To fix that, you need manually fix metadata and tell Postgres that child’s columns are inherited, smth like this:

    update pg_attribute
    set attislocal = false
    where
    attrelid = (select oid from pg_class where relname = ‘CHILD_TABLE_NAME’)
    and attname not in (‘tableoid’, ‘cmax’, ‘xmax’, ‘cmin’, ‘xmin’, ‘ctid’)
    and not attisdropped
    ;

    So I’d rather choose the other approach and see some “not nice notices”.

  27. Jul 19, 2017

    update: `create table users_1 ( like users including all ) inherits (users);` will also give you a “local” copy for each column, so if then one drops a column in master, such column will remain in the child.

    Bottom line: if you’re using `create .. (like ..)`, keep in mind that such child table will have its own cloned columns rather than truly inherited.

Leave a comment