On 22nd of March, Tom Lane committed patch:

Allow foreign tables to participate in inheritance.
 
Foreign tables can now be inheritance children, or parents.  Much of the
system was already ready for this, but we had to fix a few things of
course, mostly in the area of planner and executor handling of row locks.
 
As side effects of this, allow foreign tables to have NOT VALID CHECK
constraints (and hence to accept ALTER ... VALIDATE CONSTRAINT), and to
accept ALTER SET STORAGE and ALTER SET WITH/WITHOUT OIDS.  Continuing to
disallow these things would've required bizarre and inconsistent special
cases in inheritance behavior.  Since foreign tables don't enforce CHECK
constraints anyway, a NOT VALID one is a complete no-op, but that doesn't
mean we shouldn't allow it.  And it's possible that some FDWs might have
use for SET STORAGE or SET WITH OIDS, though doubtless they will be no-ops
for most.
 
An additional change in support of this is that when a ModifyTable node
has multiple target tables, they will all now be explicitly identified
in EXPLAIN output, for example:
 
 Update on pt1  (cost=0.00..321.05 rows=3541 width=46)
   Update on pt1
   Foreign Update on ft1
   Foreign Update on ft2
   Update on child3
   ->  Seq Scan on pt1  (cost=0.00..0.00 rows=1 width=46)
   ->  Foreign Scan on ft1  (cost=100.00..148.03 rows=1170 width=46)
   ->  Foreign Scan on ft2  (cost=100.00..148.03 rows=1170 width=46)
   ->  Seq Scan on child3  (cost=0.00..25.00 rows=1200 width=46)
 
This was done mainly to provide an unambiguous place to attach "Remote SQL"
fields, but it is useful for inherited updates even when no foreign tables
are involved.
 
Shigeru Hanada and Etsuro Fujita, reviewed by Ashutosh Bapat and Kyotaro
Horiguchi, some additional hacking by me

Initially I almost dismissed it. Another cool feature for hard-code nerds. But then it hit me. This is really big. Why? Read on …

Let's see how to get it working, and what we can use it for 🙂

I'll make myself 5 databases, named master and shard_1, shard_2, shard_3, and shard_4.

$ echo master shard_{1,2,3,4} | xargs -n1 createdb

Of course in real life scenario the databases would be on separate machines/instances, but this is just test environment, so it has to be enough.

Now, in each of these databases, I'll make a poor-man sharding table for users:

for a in {1..4}
do
    echo "
        create table users (id serial primary key, username text not null);
        alter sequence users_id_seq increment by 4 restart with $a;
    " | psql -d shard_$a; done
done

It creates the users table, but in each shard it will give different ids. In shard 1 it will be 1, 5, 9, …, in shard 2 it will be 2, 6, 10, … and so on.

With the tables in place, let's make a master table and make all other inherit from it.

$ create table users (id serial primary key, username text not null);
$ create extension postgres_fdw;
$ create server shard_1 foreign data wrapper postgres_fdw options( dbname 'shard_1' );
$ create server shard_2 foreign data wrapper postgres_fdw options( dbname 'shard_2' );
$ create server shard_3 foreign data wrapper postgres_fdw options( dbname 'shard_3' );
$ create server shard_4 foreign data wrapper postgres_fdw options( dbname 'shard_4' );
$ create user mapping for depesz server shard_1 options ( user 'depesz' );
$ create user mapping for depesz server shard_2 options ( user 'depesz' );
$ create user mapping for depesz server shard_3 options ( user 'depesz' );
$ create user mapping for depesz server shard_4 options ( user 'depesz' );
$ create foreign table users_shard_1 () INHERITS (users) server shard_1 options ( table_name 'users' );
$ create foreign table users_shard_2 () INHERITS (users) server shard_2 options ( table_name 'users' );
$ create foreign table users_shard_3 () INHERITS (users) server shard_3 options ( table_name 'users' );
$ create foreign table users_shard_4 () INHERITS (users) server shard_4 options ( table_name 'users' );

Tedious. But it can be make more automatic.

Now, with this in place I should be able to select, insert, update, and delete rows directly to shards:

[master] $ insert into users_shard_1 (username) values ('Alice') returning *;
 id | username 
----+----------
  1 | Alice
(1 row)
 
INSERT 0 1
 
[master] $ insert into users_shard_1 (username) values ('Bob') returning *;
 id | username 
----+----------
  2 | Bob
(1 row)
 
INSERT 0 1
 
[master] $ ^C
 
[master] $ insert into users_shard_2 (username) values ('Carol') returning *;
 id | username 
----+----------
  3 | Carol
(1 row)
 
INSERT 0 1
 
[master] $ insert into users_shard_2 (username) values ('Dan') returning *;
 id | username 
----+----------
  4 | Dan
(1 row)
 
INSERT 0 1
 
[master] $ insert into users_shard_3 (username) values ('Eve') returning *;
 id | username 
----+----------
  5 | Eve
(1 row)
 
INSERT 0 1
 
[master] $ insert into users_shard_3 (username) values ('Frank') returning *;
 id | username 
----+----------
  6 | Frank
(1 row)
 
INSERT 0 1
 
[master] $ insert into users_shard_4 (username) values ('George') returning *;
 id | username 
----+----------
  7 | George
(1 row)
 
INSERT 0 1
 
[master] $ insert into users_shard_4 (username) values ('Harold') returning *;
 id | username 
----+----------
  8 | Harold
(1 row)

Immediately I can see a problem – my sequences on shards do not give the ids I thought they will. Reason is simple – ID is assigned on master. So, to make it work sensibly, I would have to make 4 sequences on master, and do the magic there. Given that values up to 8 are already taken, I'll have to change restart values too:

$ create sequence users_id_seq_shard_1 increment by 4 restart with 9;
$ create sequence users_id_seq_shard_2 increment by 4 restart with 10;
$ create sequence users_id_seq_shard_3 increment by 4 restart with 11;
$ create sequence users_id_seq_shard_4 increment by 4 restart with 12;
$ alter foreign table users_shard_1 alter COLUMN id set default nextval('users_id_seq_shard_1');
$ alter foreign table users_shard_2 alter COLUMN id set default nextval('users_id_seq_shard_2');
$ alter foreign table users_shard_3 alter COLUMN id set default nextval('users_id_seq_shard_3');
$ alter foreign table users_shard_4 alter COLUMN id set default nextval('users_id_seq_shard_4');

And now, new inserts, will have ids assigned correctly:

[master] $ insert into users_shard_1 (username) values ('ian'), ('james') returning *;
 id | username 
----+----------
  9 | ian
 13 | james
(2 rows)
 
INSERT 0 2
 
[master] $ insert into users_shard_2 (username) values ('kenneth'), ('larry') returning *;
 id | username 
----+----------
 10 | kenneth
 14 | larry
(2 rows)
 
INSERT 0 2
 
[master] $ insert into users_shard_3 (username) values ('michael'), ('nicholas') returning *;
 id | username 
----+----------
 11 | michael
 15 | nicholas
(2 rows)
 
INSERT 0 2
 
[master] $ insert into users_shard_4 (username) values ('oscar'), ('paul') returning *;
 id | username 
----+----------
 12 | oscar
 16 | paul
(2 rows)
 
INSERT 0 2

Better.

Now, thanks to this new patch, I can scan them all together:

[master] $ explain analyze select count(*) from users;
                                                          QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=883.69..883.70 rows=1 width=0) (actual time=3.392..3.392 rows=1 loops=1)
   ->  Append  (cost=0.00..849.56 rows=13653 width=0) (actual time=0.880..3.381 rows=16 loops=1)
         ->  Seq Scan on users  (cost=0.00..0.00 rows=1 width=0) (actual time=0.001..0.001 rows=0 loops=1)
         ->  Foreign Scan on users_shard_1  (cost=100.00..212.39 rows=3413 width=0) (actual time=0.878..0.879 rows=4 loops=1)
         ->  Foreign Scan on users_shard_2  (cost=100.00..212.39 rows=3413 width=0) (actual time=0.864..0.864 rows=4 loops=1)
         ->  Foreign Scan on users_shard_3  (cost=100.00..212.39 rows=3413 width=0) (actual time=0.768..0.769 rows=4 loops=1)
         ->  Foreign Scan on users_shard_4  (cost=100.00..212.39 rows=3413 width=0) (actual time=0.857..0.860 rows=4 loops=1)
 Planning time: 0.292 ms
 Execution time: 4.871 ms
(9 rows)

Delete rows from them:

[master] $ delete from users where random() < 0.3 returning *;
 id | username 
----+----------
 10 | kenneth
 15 | nicholas
  7 | George
 12 | oscar
(4 rows)

and of course update them:

[master] $ explain analyze update users set username = lower(username) where username <> lower(username) returning *;
                                                       QUERY PLAN                                                        
-------------------------------------------------------------------------------------------------------------------------
 Update on users  (cost=0.00..625.61 rows=4941 width=42) (actual time=0.775..2.344 rows=7 loops=1)
   Update on users
   Foreign Update on users_shard_1
   Foreign Update on users_shard_2
   Foreign Update on users_shard_3
   Foreign Update on users_shard_4
   ->  Seq Scan on users  (cost=0.00..0.00 rows=1 width=42) (actual time=0.001..0.001 rows=0 loops=1)
         Filter: (username <> lower(username))
   ->  Foreign Scan on users_shard_1  (cost=100.00..156.40 rows=1235 width=42) (actual time=0.507..0.510 rows=2 loops=1)
   ->  Foreign Scan on users_shard_2  (cost=100.00..156.40 rows=1235 width=42) (actual time=0.313..0.315 rows=2 loops=1)
   ->  Foreign Scan on users_shard_3  (cost=100.00..156.40 rows=1235 width=42) (actual time=0.294..0.296 rows=2 loops=1)
   ->  Foreign Scan on users_shard_4  (cost=100.00..156.40 rows=1235 width=42) (actual time=0.261..0.262 rows=1 loops=1)
 Planning time: 0.135 ms
 Execution time: 2.898 ms
(14 rows)

Now, you have to notice that sharding in this way doesn't really make the queries faster – all foreign scans are done sequentially.

But, assuming some a bit larger dataset:

[master] $ insert into users_shard_1 (username) select 'random user ' || i from generate_series(1,100000) i;
INSERT 0 100000
 
[master] $ insert into users_shard_2 (username) select 'random user ' || i from generate_series(1,100000) i;
INSERT 0 100000
 
[master] $ insert into users_shard_3 (username) select 'random user ' || i from generate_series(1,100000) i;
INSERT 0 100000
 
[master] $ insert into users_shard_4 (username) select 'random user ' || i from generate_series(1,100000) i;
INSERT 0 100000

We can see that with fast queries it still makes sense.

[master] $ explain analyze select * from users where id = 123124;
QUERY PLAN
———————————————————————————————————————-
Append (cost=0.00..508.81 rows=29 width=36) (actual time=4.270..4.271 rows=1 loops=1)
-> Seq Scan on users (cost=0.00..0.00 rows=1 width=36) (actual time=0.001..0.001 rows=0 loops=1)
Filter: (id = 123124)
-> Foreign Scan on users_shard_1 (cost=100.00..127.20 rows=7 width=36) (actual time=1.020..1.020 rows=0 loops=1)
-> Foreign Scan on users_shard_2 (cost=100.00..127.20 rows=7 width=36) (actual time=1.247..1.247 rows=0 loops=1)
-> Foreign Scan on users_shard_3 (cost=100.00..127.20 rows=7 width=36) (actual time=1.004..1.004 rows=0 loops=1)
-> Foreign Scan on users_shard_4 (cost=100.00..127.20 rows=7 width=36) (actual time=0.993..0.994 rows=1 loops=1)
Planning time: 0.349 ms
Execution time: 6.070 ms
(9 rows)

This doesn't look so good. But, PostgreSQL has built in partitioning. Which we can now use for some great effects:

[master] $ alter foreign table  users_shard_1 add check ( id % 4 = 1 );
ALTER FOREIGN TABLE
 
[master] $ alter foreign table  users_shard_2 add check ( id % 4 = 2 );
ALTER FOREIGN TABLE
 
[master] $ alter foreign table  users_shard_3 add check ( id % 4 = 3 );
ALTER FOREIGN TABLE
 
[master] $ alter foreign table  users_shard_4 add check ( id % 4 = 0 );
ALTER FOREIGN TABLE

and with this in place:

[master] $ explain analyze select * from users where id = 123124 and (id % 4) = (123124 % 4);
                                                      QUERY PLAN                                                      
----------------------------------------------------------------------------------------------------------------------
 Append  (cost=0.00..133.91 rows=2 width=36) (actual time=0.905..0.907 rows=1 loops=1)
   ->  Seq Scan on users  (cost=0.00..0.00 rows=1 width=36) (actual time=0.002..0.002 rows=0 loops=1)
         Filter: ((id = 123124) AND ((id % 4) = 0))
   ->  Foreign Scan on users_shard_4  (cost=100.00..133.91 rows=1 width=36) (actual time=0.902..0.903 rows=1 loops=1)
 Planning time: 0.448 ms
 Execution time: 1.451 ms
(6 rows)

This is much nicer.

Only two “tables" got scanned – users – empty parent of all shards, and the single shard that we needed.

If I used partition/sharding key differently (using ranges for example) PostgreSQL could have be even smart enough to pick the right shard without guiding with (id % 4) = (123124 % 4).

But this little thing doesn't really spoil the fun. Lo and behold it looks like we got sharding ( of course my choice of names for databases did spoil the fun a bit ) solution that will actually work.

With some simple triggers on users() table (to redirect inserts to partitions/shards) we'll be golden.

The only thing missing would be some way to run the foreign scans in parallel (when there is more than one). Maybe it will materialize one day, we'll see.

Great THANK YOU goes to all authors, reviewers and editors. This is really huge thing.

  1. 12 comments

  2. Apr 3, 2015

    Yes, this is very big.

    Isn’t Mr. Hass working on parallel scans with custom background workers ?

    This would be seriously major if parallel FDW scans could happen sooner than later, because without parallel scans most “big data” problems would still not be fun to solve with postgres due to the single CPU limitation and the sequential FDW access issue.

    get over that and a whole world of problems just became solvable with postgres.

  3. # RobJ
    Apr 3, 2015

    Your first group of inserts (Alice, Bob et al) assigned the wrong IDs, and the second group of inserts (ian, james and company) assigned the correct IDs. You described how to fix it, but didn’t include the actual commands to do that. Could you show those commands?

    Thanks for this post.

  4. Apr 3, 2015

    @RobJ:

    Sorry, no idea how I missed that. Fixed.

  5. # RobJ
    Apr 3, 2015

    Thanks for fixing that so quickly! I understand now how you did it – the foreign tables inherit the default value from the parent’s sequence when they are created, but then you can change the default value to use the new, properly-spaced and initialized sequences.

    You’re right, this feature is HUGE.

  6. Apr 8, 2015

    Yes, Robert Haas is working on parallel scans, and this will be used for parallel foreign table scans.

  7. # Sylvain R
    Apr 14, 2015

    Hi,

    I made a version of this with a trigger which automatically spread the data across the shards : https://gist.github.com/sylr/623bab09edd04d53ee4e

    Hope it’ll help someone.

    Cheers.

  8. # Mirosław
    Apr 25, 2015

    Jest: Now, in each of these tables
    Powinno być: Now, in each of these databases

  9. Apr 27, 2015

    @Mirosław:
    thanks, fixed.

  10. Apr 30, 2015

    Hi,

    I tried this with a little variation-

    postgres=# CREATE FOREIGN TABLE log_entry_y2015_f (log_time timestamp,
    entry text, constraint check_15_f check (log_time =’2015-01-01′))
    INHERITS (log_entries) SERVER log options (filename ‘/home/postgres/log_entries.csv’, format ‘csv’);
    NOTICE: merging column “log_time” with inherited definition
    NOTICE: merging column “entry” with inherited definition
    CREATE FOREIGN TABLE

    postgres=# CREATE FOREIGN TABLE log_entry_y2014_f (log_time timestamp,
    entry text, constraint check_for check (log_time =’2014-01-01′))
    INHERITS (log_entries) SERVER server_5433 OPTIONS (table_name ‘log_entry_y2014_f’);
    NOTICE: merging column “log_time” with inherited definition
    NOTICE: merging column “entry” with inherited definition
    CREATE FOREIGN TABLE

    and then I tried a select operation-

    postgres=# explain select * from log_entries where log_time=’2014-04-29 19:09:20.841275′;
    QUERY PLAN
    —————————————————————————————-
    Append (cost=0.00..126.12 rows=7 width=40)
    -> Seq Scan on log_entries (cost=0.00..0.00 rows=1 width=40)
    Filter: (log_time = ‘2014-04-29 19:09:20.841275′::timestamp without time zone)
    -> Foreign Scan on log_entry_y2014_f (cost=100.00..126.12 rows=6 width=40)
    (4 rows)

    postgres=# explain select * from log_entries where log_time>=’2015-04-29 19:09:20.841275’;
    QUERY PLAN
    —————————————————————————————–
    Append (cost=0.00..1.31 rows=2 width=34)
    -> Seq Scan on log_entries (cost=0.00..0.00 rows=1 width=40)
    Filter: (log_time >= ‘2015-04-29 19:09:20.841275’::timestamp without time zone)
    -> Foreign Scan on log_entry_y2015_f (cost=0.00..1.31 rows=1 width=27)
    Filter: (log_time >= ‘2015-04-29 19:09:20.841275’::timestamp without time zone)
    Foreign File: /home/postgres/log_entries.csv
    Foreign File Size: 139
    (7 rows)

    So that does partition pruning plus I can have my data spread across different ‘types’ of foreign servers (and not just different foreign servers).

    I think this is a great feature!

    Regards
    Sameer

  11. Apr 30, 2015

    Oops! Looks like I commented in the wrong post. I actually borrowed the example from here- http://michael.otacoo.com/postgresql-2/postgres-9-5-feature-foreign-table-inheritance/

  12. # Mah
    Oct 18, 2015

    This feature is just coming at the right time 🙂
    Hope the performance would be good as well.

  13. # Nikos Baliktsioglou
    Nov 4, 2015

    This is a great feature, but the thing that immediately comes to my mind when we are talking about real installations on the field is: what about High Availability? We would have the master DB on which all queries are sent to, which is a single point of failure, and a number of slave machines that actually hold data. As I see it, each of the DB servers in the shard will need to be replicated to ensure that data is safe. This essentially means that for the number of DB servers in the shard, we will need an equal number of standby machines… and managing that number of clusters will be very demanding to say the least. So, I think that for this feature to be meaningfull as a read sharding solution, somehow High Availability / Replication of data must be taken into account to work together with this feature.. Of course I may be missing something here so pls send my your thoughts 🙂

Leave a comment