Waiting for 9.5 – Use abbreviated keys for faster sorting of text datums.

On 19th of January, Robert Haas committed patch:

Use abbreviated keys for faster sorting of text datums.
This commit extends the SortSupport infrastructure to allow operator
classes the option to provide abbreviated representations of Datums;
in the case of text, we abbreviate by taking the first few characters
of the strxfrm() blob.  If the abbreviated comparison is insufficent
to resolve the comparison, we fall back on the normal comparator.
This can be much faster than the old way of doing sorting if the
first few bytes of the string are usually sufficient to resolve the
There is the potential for a performance regression if all of the
strings to be sorted are identical for the first 8+ characters and
differ only in later positions; therefore, the SortSupport machinery
now provides an infrastructure to abort the use of abbreviation if
it appears that abbreviation is producing comparatively few distinct
keys.  HyperLogLog, a streaming cardinality estimator, is included in
this commit and used to make that determination for text.
Peter Geoghegan, reviewed by me.

To be honest, I somehow missed this commit, but I've read Peter's blogpost about it, and decided to check it out.

For starters – this commit will not give you any more functionality. But it should make certain operations faster.

The biggest winner should be index creation, so I figured I'll test it.

Since the change was added to 9.5, I'll compare index creation time on 9.4.0 and 9.5, and we'll see how it works out.

Both Pg servers are setup on the same hardware with the same config, so while there could be some variance in results, it shouldn't be all that big – aside from effects of speedup given by the patch.

For tests I used 3 different datasets:

  • short – 1,000,000 random, unique, strings, 5 characters long
  • long – 1,000,000 random, unique, strings, 80 characters long
  • long_prefixed – 1,000,000 random, unique, strings, 80 characters long, but first 20 characters have only 100 different possibilities, and the prefix is 18 zeros, and then 2 digits

Sample data:

=$ select * from short order by random() limit 10;
(10 rows)
=$ select * from long order by random() limit 10;
(10 rows)
=$ select * from long_prefixed order by random() limit 10;
(10 rows)

To test the speed, I ran:

create index q on _TABLE_ (x);
drop index q;

10 times on each of the tables. Then I took the best time (out of these 10, and only time of create index, not dropping) for comparison. Results:

table: 9.4: 9.5: difference:
short 3,801.084 ms 1,069.608 ms – 71%
long 7,696.992 ms 4,359.055 ms – 43%
long_prefixed 23,795.578 ms 24,538.389 ms + 3%

So, it looks that in two cases the speedup is there, and is pretty significant. In one case, specifically designed to be worst case scenario – we lost ~ 3% of the performance.

I think that it's definitely worth if – after all, if you have 1 million rows and they all have the same 18 characters – just get rid of the common prefix 🙂

In any way – great stuff, thanks Peter & Robert.

9 thoughts on “Waiting for 9.5 – Use abbreviated keys for faster sorting of text datums.”

  1. I’m not suggesting this would be a *useful* thing, but what’s the performance on an index on long_prefixed(substr(x,1,10),substr(x,11))?

    I read that memcmp() is used for a quick direct-equality test, and if that’s the case there should be a speedup from that.

  2. Which means we can expect speedups on indexes with very low cardinality. While you’re at it, could you test index on the substr(x,1,10) by itself? I’d ask for one on substr(x,11) but that seems like it’d have the same results as the “long” test.

  3. @Anssi:
    Luckily pg does it correctly, even with 9.5 and abbrev-sorting:

    $ create table t (id serial, n text collate "hu_HU.utf8");
    $ INSERT INTO t VALUES (1,'ab'),(2,'áb'),(3,'az'),(4,'áz'),(5,'comb'),(6,'cukor'),(7,'csak'),(8,'tz'),(9,'tty');
    INSERT 0 9
    $ select * from t order by n asc, id asc;
     id |   n   
      1 | ab
      2 | áb
      3 | az
      4 | áz
      5 | comb
      6 | cukor
      7 | csak
      8 | tz
      9 | tty
    (9 rows)
    $ select * from t order by n asc, id desc;
     id |   n   
      1 | ab
      2 | áb
      3 | az
      4 | áz
      5 | comb
      6 | cukor
      7 | csak
      8 | tz
      9 | tty
    (9 rows)
  4. I’m not surprised PostgreSQL does it correctly.

    I was referring to the substr index test case, which I believe doesn’t work correctly.

  5. @puqun: write a short script (one-liner?) that generates the data, and then issue “create table …” and “copy … from …”

  6. @puqun I’ve recently published a [faker_fdw](http://github.com/guedes/faker_fdw) that can helps on generating randomly data for tests. The faker_fdw supports locale e can generate texts, names, phone_number, addresses and a bunch of other things that can be useful when random() is not enough, so it could be another option.

Comments are closed.