Performance gains from using foreign keys

Foreign keys are known for couple of things, but speeding up your system is not one of them. But sometimes, having them in place lets you make queries significantly faster.

How? Let me show you example I have seen lately (well, it's simplified example based on something much more convoluted, and definitely longer):

There was this database, and in there there were products:

                            Table "public.products"
   Column    |  Type   |                       Modifiers                       
-------------+---------+-------------------------------------------------------
 id          | integer | not null default nextval('products_id_seq'::regclass)
 codename    | text    | 
 description | text    | 
...
Indexes:
    "products_pkey" PRIMARY KEY, btree (id)
...

and sales:

                                       Table "public.sales"
      Column      |           Type           |                     Modifiers                      
------------------+--------------------------+----------------------------------------------------
 id               | integer                  | not null default nextval('sales_id_seq'::regclass)
 sold_object_type | text                     | 
 sold_object_id   | integer                  | 
 sold_when        | timestamp with time zone | 
...
Indexes:
    "sales_pkey" PRIMARY KEY, btree (id)
    "sales_sold_when_idx" btree (sold_when)
...

The important part was that someone, at some distant point in past, decided to use “polymorphic foreign keys". Which are basically application-enforced keys, but which can point to different places.

In out case we could sell products, services, and probably some other type of objects.

Depending on value in sold_object_type, sold_object_id leads to different tables.

It was all working nice and cool, until there was request – we need report that will show how many different products were sold between two given timestamps.

Because we were not sure that values in sold_object_id are sensible (due to the fact that after couple of month (or years?) there were multiple applications, working from different places, doing different thing, and in various programming languages, so app logic for enforcing this keys could have been circumvented), the query had to be:

SELECT
    COUNT( DISTINCT p.id )
FROM
    sales s
    JOIN products p ON s.sold_object_id = p.id
WHERE
    s.sold_object_type = 'product'
    AND sold_when BETWEEN $1 AND $2;

While it looks perfectly sensible – it is unfortunately rather slow. Generally we were able to make the query run with merge join (scan on sales using sold_when index, sort it by sold_object_id, and merge with records returned from whole-table index scan of products) or hash join (scan on sales using sold_when index, and seq scan on sales, hashing, and join).

Both of these were rather slow – mostly because number of products was several million, and number of sales was also quite high.

Let's make some test data, and run explains on it:

$ INSERT INTO products (id, codename) SELECT i, i FROM generate_series(1,5000000) i;
INSERT 0 5000000
 
$ INSERT INTO sales (sold_object_type, sold_object_id, sold_when)
    SELECT
        CASE WHEN random() < 0.9 THEN 'product' ELSE 'service' END,
        CAST(random() * 5100000 AS int4),
        now() - '3 years'::INTERVAL * random()
    FROM generate_series(1,10000000);
INSERT 0 10000000

Hash join:

                                                                                QUERY PLAN                                                                                 
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=262146.71..262146.72 rows=1 width=4) (actual time=4007.942..4007.942 rows=1 loops=1)
   ->  Hash Join  (cost=166198.45..261490.83 rows=262349 width=4) (actual time=2084.606..3808.164 rows=245274 loops=1)
         Hash Cond: (s.sold_object_id = p.id)
         ->  Bitmap Heap Scan on sales s  (cost=7139.45..75930.79 rows=262349 width=4) (actual time=434.552..1565.965 rows=250040 loops=1)
               Recheck Cond: ((sold_when >= '2010-09-01 02:00:00+02'::timestamp with time zone) AND (sold_when <= '2010-10-01 01:59:59+02'::timestamp with time zone))
               Filter: (sold_object_type = 'product'::text)
               ->  Bitmap Index Scan on sales_sold_when_idx  (cost=0.00..7073.87 rows=291219 width=0) (actual time=432.240..432.240 rows=277970 loops=1)
                     Index Cond: ((sold_when >= '2010-09-01 02:00:00+02'::timestamp with time zone) AND (sold_when <= '2010-10-01 01:59:59+02'::timestamp with time zone))
         ->  Hash  (cost=77027.00..77027.00 rows=5000000 width=4) (actual time=1648.872..1648.872 rows=5000000 loops=1)
               Buckets: 4096  Batches: 256  Memory Usage: 701kB
               ->  Seq Scan on products p  (cost=0.00..77027.00 rows=5000000 width=4) (actual time=0.009..721.422 rows=5000000 loops=1)
 Total runtime: 4008.004 ms
(12 rows)

and merge join:

                                                                                   QUERY PLAN                                                                                    
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=273488.49..273488.50 rows=1 width=4) (actual time=1943.630..1943.631 rows=1 loops=1)
   ->  Merge Join  (cost=99543.81..272832.61 rows=262349 width=4) (actual time=486.449..1900.820 rows=245274 loops=1)
         Merge Cond: (p.id = s.sold_object_id)
         ->  Index Scan using products_pkey on products p  (cost=0.00..156878.80 rows=5000000 width=4) (actual time=0.011..809.598 rows=5000000 loops=1)
         ->  Sort  (cost=99543.68..100199.55 rows=262349 width=4) (actual time=486.414..525.182 rows=245275 loops=1)
               Sort Key: s.sold_object_id
               Sort Method:  quicksort  Memory: 17865kB
               ->  Bitmap Heap Scan on sales s  (cost=7139.45..75930.79 rows=262349 width=4) (actual time=61.938..341.319 rows=250040 loops=1)
                     Recheck Cond: ((sold_when >= '2010-09-01 02:00:00+02'::timestamp with time zone) AND (sold_when <= '2010-10-01 01:59:59+02'::timestamp with time zone))
                     Filter: (sold_object_type = 'product'::text)
                     ->  Bitmap Index Scan on sales_sold_when_idx  (cost=0.00..7073.87 rows=291219 width=0) (actual time=47.084..47.084 rows=277970 loops=1)
                           Index Cond: ((sold_when >= '2010-09-01 02:00:00+02'::timestamp with time zone) AND (sold_when <= '2010-10-01 01:59:59+02'::timestamp with time zone))
 Total runtime: 1945.287 ms
(13 rows)

This might look like not much (1.9 and 4s), but this is just my test environment. On the real server, with its load, and its data the queries took over 30 minutes to finish.

This is exactly the case where having fkey would help. If we were absolutely sure that every row matches product (i.e. there are no rows which should point to product, but given product does not exist), we could do:

SELECT
    COUNT( DISTINCT s.sold_object_id )
FROM
    sales s
WHERE
    s.sold_object_type = 'product'
    AND sold_when BETWEEN $1 AND $2;

Which has much nicer plan:

                                                                             QUERY PLAN                                                                              
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=76586.66..76586.67 rows=1 width=4) (actual time=476.159..476.159 rows=1 loops=1)
   ->  Bitmap Heap Scan on sales s  (cost=7139.45..75930.79 rows=262349 width=4) (actual time=65.518..333.893 rows=250040 loops=1)
         Recheck Cond: ((sold_when >= '2010-09-01 02:00:00+02'::timestamp with time zone) AND (sold_when <= '2010-10-01 01:59:59+02'::timestamp with time zone))
         Filter: (sold_object_type = 'product'::text)
         ->  Bitmap Index Scan on sales_sold_when_idx  (cost=0.00..7073.87 rows=291219 width=0) (actual time=50.586..50.586 rows=277970 loops=1)
               Index Cond: ((sold_when >= '2010-09-01 02:00:00+02'::timestamp with time zone) AND (sold_when <= '2010-10-01 01:59:59+02'::timestamp with time zone))
 Total runtime: 476.796 ms
(7 rows)

Besides time – it doesn't touch products table at all, thus limiting IO usage.

Lesson from this particular case – put your constraints in database – this will make it possible to write better/faster queries based on facts that are visible on the db level, without checking all the apps that use the db, and can potentially misbehave.

It is always beneficial to put them also in your webapp, if only to avoid some round-trips with faulty data, but keeping checks as close to actual storage is really beneficial.

And as a final note – for this particular case we couldn't do anything (due to large codebase that would need to be reviewed and potentially modified), but generally I would add several columns (product_id, service_id), and a check to make sure only one of them is not null.

6 thoughts on “Performance gains from using foreign keys”

  1. What happens if you put an EXISTS on the second query to verify that the ID is valid? This comes up a lot in my work where we unfortunately use A LOT of polymorphic foreign keys.

  2. @Anonymous:
    It would cause nested loop, with 250k index scans on products. read: loooong time to finish.

  3. If polymorphic foreign keys are structured the way I think they are, then the structure could be supported at the DBMS level using something I could call “distributed key constraints” plus “distributed foreign key constraints”; the first is essentially a primary or unique key ranging over multiple tables, so it says that between corresponding columns of several tables, the tuple values must be unique; the second is like a foreign key where the parent table is the union of tables defined by the distributed key. This is not very complicated as I see it and could be useful and perform much better than other implementation methods. An implementation would probably involve an index that covers several tables rather than just one.

  4. @darren duncan: that is the sort of thing we need to handle FK constraints that reference (inheritance-based) partitioned tables

  5. @ Thomas and Darren:

    Wouldn’t they (“distributed key constraints” plus “distributed foreign key constraints”) also be the necessary infrastructure to allow postgres to finally have ASSERTions ?

  6. @Thomas, sure, DKC/DFKC could be used for what you say.
    However, the main case I’m thinking of for them is more from the basic relational modelling point of view with no need to use inheritance or whatever.

    Say we have a product database with books and DVDs, every product has a unique product ID, and books and DVDs have different detail attributes. So you could have a books table and a DVDs table, each with a “product_id” field and potentially all of the other fields differing. So not only would the product_id field be a unique or primary key for each table, but a distributed key would exist saying that every product_id value is unique across the 2 tables. And a foreign key could then target this over-arching product_id as its parent side.

    @Mike Adams, if by “ASSERTions” you mean generic database constraint expressions, then no you don’t need DKCs or DFKCs, and you don’t need ordinary primary or foreign keys either; rather you just need a generic boolean-resulting function which takes one or more database tables as input and returns true if the function thinks they obey its arbitrary criteria; you can define all of the other kinds of constraints with just an expression and they are just shorthands for such; eg, a pk/unique constraint is just compare the cardinality of the key columns before and after removal of duplicates, and if they don’t match, the key is violated.

Comments are closed.