Waiting for 8.4 – UNION / INTERSECT / EXCEPT

During last month or so, Tom Lane commited changes in PostgreSQL, which were foundations for adding hash-based versions of popular features.

I already described first such feature – DISTINCT.

Now, there were 3 more commits which were related to this:

All 3 were committed by Tom, on 7th of August. Commit messages:

Teach the system how to use hashing for UNION.  (INTERSECT/EXCEPT will follow,
but seem like a separate patch since most of the remaining work is on the
executor side.)  I took the opportunity to push selection of the grouping
operators for set operations into the parser where it belongs.  Otherwise this
is just a small exercise in making prepunion.c consider both alternatives.
 
As with the recent DISTINCT patch, this means we can UNION on datatypes that
can hash but not sort, and it means that UNION without ORDER BY is no longer
certain to produce sorted output.
Support hashing for duplicate-elimination in INTERSECT and EXCEPT queries.
This completes my project of improving usage of hashing for duplicate
elimination (aggregate functions with DISTINCT remain undone, but that's
for some other day).
 
As with the previous patches, this means we can INTERSECT/EXCEPT on datatypes
that can hash but not sort, and it means that INTERSECT/EXCEPT without ORDER
BY are no longer certain to produce sorted output.
Improve INTERSECT/EXCEPT hashing by realizing that we don't need to make any
hashtable entries for tuples that are found only in the second input: they
can never contribute to the output.  Furthermore, this implies that the
planner should endeavor to put first the smaller (in number of groups) input
relation for an INTERSECT.  Implement that, and upgrade prepunion's estimation
of the number of rows returned by setops so that there's some amount of sanity
in the estimate of which one is smaller.

What it means for average user? Basically union, intersect and except can now work faster.

Since the changes are quite obvious I'll skip specific examples and plans of execution, but I would like to use this space to send personal big THANK YOU to Tom Lane for these changes, as I use set operators a lot, and speedup of them is very welcome 🙂

6 thoughts on “Waiting for 8.4 – UNION / INTERSECT / EXCEPT”

  1. What happened to the usual depesz-style benchmark? You always did this for PostgreSQL performance enhancements. 🙂

  2. @intgr:
    To be honest – I don’t have a pre-8.4 version right now. So I can’t really benchmark 8.4 against anything. And I just got a bit too lazy to install 8.3 just for 5 minutes of benchmarking.

    Sorry. I’ll do better next tme.

  3. Hum, so is the order of ‘UNION ALL’ not garanteed any more with 8.4, or is it just for ‘UNION’ ?

    I’m using “SELECT foo FROM bar WHERE baz UNION ALL SELECT baz2 LIMIT 1” to return a default value if none is in foo, will I have to add explicit ordering ?

  4. @moltonel:
    “union all” – as far as i know *never* guarateed any order.

    union did guarantee, but only in pg <= 8.3. your approach at getting default value looks flawed, as it relies on order of rows from union all. why don't you simply use coalesce?

  5. @depesz:
    Not using COALESCE because I also have NULLS in my table, and want those if they exist. The default value is non-null. I’m on 8.2 right now, so no “NULLS FIRST” available.

    I can fix my method by adding explicit ordering (wont look as nice :p). But if I have missed a smarter way to do this, I’m all ears (I’d rather not create a function for this, though).

  6. Just noticed the followup questions here. The behavior of UNION ALL shouldn’t be changed at all by these patches: it still just appends the results of the second query to the results of the first. What these patches are about is enabling use of hashing to recognize duplicate tuples for the various set-ops that require recognizing that; which is to say, all of them *except* UNION ALL.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.