Waiting for 9.5 – Use abbreviated keys for faster sorting of text datums.

On 19th of January, Robert Haas committed patch:

Use abbreviated keys for faster sorting of text datums.
 
This commit extends the SortSupport infrastructure to allow operator
classes the option to provide abbreviated representations of Datums;
in the case of text, we abbreviate by taking the first few characters
of the strxfrm() blob.  If the abbreviated comparison is insufficent
to resolve the comparison, we fall back on the normal comparator.
This can be much faster than the old way of doing sorting if the
first few bytes of the string are usually sufficient to resolve the
comparison.
 
There is the potential for a performance regression if all of the
strings to be sorted are identical for the first 8+ characters and
differ only in later positions; therefore, the SortSupport machinery
now provides an infrastructure to abort the use of abbreviation if
it appears that abbreviation is producing comparatively few distinct
keys.  HyperLogLog, a streaming cardinality estimator, is included in
this commit and used to make that determination for text.
 
Peter Geoghegan, reviewed by me.

To be honest, I somehow missed this commit, but I've read Peter's blogpost about it, and decided to check it out.

For starters – this commit will not give you any more functionality. But it should make certain operations faster.

The biggest winner should be index creation, so I figured I'll test it.

Since the change was added to 9.5, I'll compare index creation time on 9.4.0 and 9.5, and we'll see how it works out.

Both Pg servers are setup on the same hardware with the same config, so while there could be some variance in results, it shouldn't be all that big – aside from effects of speedup given by the patch.

For tests I used 3 different datasets:

  • short – 1,000,000 random, unique, strings, 5 characters long
  • long – 1,000,000 random, unique, strings, 80 characters long
  • long_prefixed – 1,000,000 random, unique, strings, 80 characters long, but first 20 characters have only 100 different possibilities, and the prefix is 18 zeros, and then 2 digits

Sample data:

=$ SELECT * FROM short ORDER BY random() LIMIT 10;
   x    
--------
 luASGO
 KANanb
 xLxNyQ
 bGWGuI
 w8xujk
 wN58nP
 9BFQA1
 g7cZBn
 1XAy54
 iVrAXP
(10 ROWS)
 
=$ SELECT * FROM long ORDER BY random() LIMIT 10;
                                        x                                         
----------------------------------------------------------------------------------
 RMdALzft60a7SpM3a4oH70SvJmVMyrwNlUNFKGsqSlXYDX1WvxQbAEix0iHcwiddjriQlfjgeqNxLA7o
 OtyeodMPArSrdf5XnZDUpPv73sQv4bryTGZ0isfnH14UyPRbcbgGotxDe3Gxmj7UE4cF7RNs9B2sejUz
 oRvw19mA4BwkSswUr5ie6EOH9bfiZ7Z924qcyWb2nBGNKRL1xqSYoDMU8h7vIB8SQqKkWW1ajXRH1z2h
 B9h0hw1MFyqE2gPNRujt0VhNjEayd9dzztLW6ArDOvX6f8DCvup5v8zPHuvPE7Ma5OkbsCd7AxLOY0W2
 yK7pL1mIIYDo83ZX1NszRaiVlgTMBi0dbDNglQs88IcXWCKzbVq8m8osgys43vfAcHCHgcyDvp81ZLUR
 fya0uVWSrSMF9j1RReMvDeZDrFl8jRfyFQcyggHTkqPBAKVvAYPugC7AXo6q9GBzcxtVDg6KXCYy4dQd
 RfSqoc3l4a5DoPmFc7djuRPMJOcBsDIUGXPOH1LKJbE0rm7zBYMh1mU6JEaEE3KWHN6b7sB66KPUj4Lv
 8QcL9NQc2Q7uyiUlHN14Bk37sCAxkmdQ9lUcIpqLarFuiIxgCT1Vnl4vs7hMAq4ioZ9vrCwOKrjTGLqp
 sFiTu8IHbIAcOjWTyaxofjKEHAWHd8fqDAztYrMxaaiAYAPLOnODh5bQvFX10p191dhWpVJJnBaizfMx
 DSQvqayewNR1uwxQwT7GyYVo0NubofjGYZywLEOOZxyJHnHCQJn4RfBzWQC4n8B8xmYfxhxFy0u5PDHd
(10 ROWS)
 
=$ SELECT * FROM long_prefixed ORDER BY random() LIMIT 10;
                                        x                                         
----------------------------------------------------------------------------------
 00000000000000000038LQMpDuy15O11PzrWnbaad84GNpAoyOaSkAP9k8U6uAy4J5B1VCkmP81LICyz
 00000000000000000017PgIm4SqFCS2EQE6PjyVanMJGtfh7dte28Bm8Yk43yjH4ut43wFTudcLiUesW
 00000000000000000015R14GKcjKjX7CY22Paoe4DLqaU5Q1B1xJYGDm7QhsiA2ecGf5Mkw3YyqHcmQm
 000000000000000000060MmvCNrOvnzE9tOMmIWWynS7U4Ce5S50TyyQwbmL2srL4gYkXZfLWXS5Vkwp
 00000000000000000093rhUKB05HkCzW4z7z7G2K0kliFKHWp5VvZveRPXr1WqxpQU5B9E0L6yn7bfLc
 000000000000000000277U8V7J8TG4hhHkuEw5JDYx5GjTWdAH8icIFW2rLzTv2n8K9xSiaFYvdRX842
 00000000000000000042cOJ7dX3NEqxvK0wWjkqZ7wEhBE2gUVkWt864vvufvo8BbR8OOHMSFfLl16CG
 00000000000000000032a1otXaIAPYCW2MJC8k4A5L54wJJkNXQx0aWxHQEd2rnuSjihAoshJLQOLsUV
 00000000000000000052peumks0hKU0BojAA4063BMPINg7OblC1R203cdGh20gDMPKdqXrhY1sBlaIc
 00000000000000000089Chw9QVnKR99cbWVw75JXMhWw9QuhygEgn9NxjYgXqJZwLsCcsmCGITTbttdW
(10 ROWS)

To test the speed, I ran:

CREATE INDEX q ON _TABLE_ (x);
DROP INDEX q;

10 times on each of the tables. Then I took the best time (out of these 10, and only time of create index, not dropping) for comparison. Results:

table: 9.4: 9.5: difference:
short 3,801.084 ms 1,069.608 ms – 71%
long 7,696.992 ms 4,359.055 ms – 43%
long_prefixed 23,795.578 ms 24,538.389 ms + 3%

So, it looks that in two cases the speedup is there, and is pretty significant. In one case, specifically designed to be worst case scenario – we lost ~ 3% of the performance.

I think that it's definitely worth if – after all, if you have 1 million rows and they all have the same 18 characters – just get rid of the common prefix 🙂

In any way – great stuff, thanks Peter & Robert.

9 thoughts on “Waiting for 9.5 – Use abbreviated keys for faster sorting of text datums.”

  1. I’m not suggesting this would be a *useful* thing, but what’s the performance on an index on long_prefixed(substr(x,1,10),substr(x,11))?

    I read that memcmp() is used for a quick direct-equality test, and if that’s the case there should be a speedup from that.

  2. Which means we can expect speedups on indexes with very low cardinality. While you’re at it, could you test index on the substr(x,1,10) by itself? I’d ask for one on substr(x,11) but that seems like it’d have the same results as the “long” test.

  3. @Anssi:
    Luckily pg does it correctly, even with 9.5 and abbrev-sorting:

    $ CREATE TABLE t (id serial, n text COLLATE "hu_HU.utf8");
    CREATE TABLE
     
    $ INSERT INTO t VALUES (1,'ab'),(2,'áb'),(3,'az'),(4,'áz'),(5,'comb'),(6,'cukor'),(7,'csak'),(8,'tz'),(9,'tty');
    INSERT 0 9
     
    $ SELECT * FROM t ORDER BY n ASC, id ASC;
     id |   n   
    ----+-------
      1 | ab
      2 | áb
      3 | az
      4 | áz
      5 | comb
      6 | cukor
      7 | csak
      8 | tz
      9 | tty
    (9 ROWS)
     
    $ SELECT * FROM t ORDER BY n ASC, id DESC;
     id |   n   
    ----+-------
      1 | ab
      2 | áb
      3 | az
      4 | áz
      5 | comb
      6 | cukor
      7 | csak
      8 | tz
      9 | tty
    (9 ROWS)
  4. I’m not surprised PostgreSQL does it correctly.

    I was referring to the substr index test case, which I believe doesn’t work correctly.

  5. @puqun: write a short script (one-liner?) that generates the data, and then issue “create table …” and “copy … from …”

  6. @puqun I’ve recently published a [faker_fdw](http://github.com/guedes/faker_fdw) that can helps on generating randomly data for tests. The faker_fdw supports locale e can generate texts, names, phone_number, addresses and a bunch of other things that can be useful when random() is not enough, so it could be another option.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.