Should you use HASH index?

Today, Mattias|farm on IRC asked how to create primary key using HASH index. After some talk, he said that in some books it said that for “=" (equality) hash indexes are better.

So, I digged a bit deeper.

IMPORTANT UPDATE: As of PostgreSQL 10 hash indexes are WAL logged. As such, main point against them is gone.

First of all, there is this thread on pgsql-general mailing list (thanks to omarqureshi for link). Then there are docs which say:

Note: Hash index operations are not presently WAL-logged, so hash indexes might need to be rebuilt with REINDEX after a database crash. For this reason, hash index use is presently discouraged.

Based on the same document, we know that hash index supports only “=" operation, while B-Tree (the most natural, and default, index) supports also: <, <=, >= and > – which comes in really handy, even for text based searches.

So far, we see that HASH indexes have less functionality, and they are not as safe (given lack of WAL information, I would suspect that they also don't get updated on WAL-slaves, but I might be wrong in here).

The only thing left is speed. The thread I mentioned earlier states that these are slower, but let's check the facts.

$ CREATE TABLE test ( string text );
CREATE TABLE

And now let's put some data in there.

Since I want couple of million rows, I can't really use any dictionary, but let's generate the strings in Perl:

#!/usr/bin/perl -w
use strict;
 
print generate_random_string() . "\n" for 1 .. 10_000_000;
 
exit;
 
sub generate_random_string {
    my $word_count = 2 + int(rand() * 4);
    return join(' ', map { generate_random_word() } 1..$word_count);
}
 
sub generate_random_word {
    my $len = 3 + int(rand() * 5);
    my @chars = ( "a".."z", "A".."Z", "0".."9" );
    my @word_chars = map { $chars[rand @chars] } 1.. $len;
    return join '', @word_chars;
}

With this I made myself a nice 10M lines file (it happened that all lines are unique), which I now can load to Pg:

$ copy test FROM '/tmp/words.lst';
COPY 10000000

Now. I need to test create index speed. For this, I will create 4 different indexes:

  • B-Tree
  • Unique B-Tree
  • Hash
  • Unique Hash

And check creation time. To get somehow sensible answer, I will test each of the indexes 3 times, and present all of them.

While doing the test I got:

ERROR:  access method "hash" does not support unique indexes

OK. So you can't make unique hash indexes. OK. What about times?

index min avg max
plain_btree 81681.452 83055.347 85590.522
plain_hash 22879.853 23502.960 24549.948
unique_btree 81000.509 82097.202 83032.581

That's pretty interesting – index creation is actually much faster for hash than for B-Tree. This could be because this index is smaller (256MB vs 394MB), and not enough maintenance_work_mem for index creation, but it's simply faster.

How about queries?

We need 2 types of queries:

  • query for something that is in set
  • query for something that is not in set

To make the tests as quick as possible, I created 2 copies of the base table, renamed them, and created appropriate indexes:

$ \d test_hash 
 Table "public.test_hash"
 Column | Type | Modifiers 
--------+------+-----------
 string | text | 
Indexes:
    "i_hash" hash (string)
 
$ \d test_btree 
 Table "public.test_btree"
 Column | Type | Modifiers 
--------+------+-----------
 string | text | 
Indexes:
    "i_btree" btree (string)
 
$ \d test_unique_btree 
Table "public.test_unique_btree"
 Column | Type | Modifiers 
--------+------+-----------
 string | text | 
Indexes:
    "i_unique_btree" btree (string)

And again simple Perl script:

#!/usr/bin/perl -w
use strict;
use Benchmark qw( cmpthese );
use DBI;
 
my @existing_strings = (
    'iMGd8w VWGbR 98ssaUF',
    'QRK aBiC bMchM',
    'ilbD 4hlj30 cGg',
    'kWM PXQU lJVK',
    '6QlVY vbqthh aE2Rmy',
    'U502 gvKbY wRA4d9',
    '2ATzccg vf8Xt uhiCbph',
    'fXn474 f8M 7Pk4CQd',
    'k2Umnl n5kfnEJ sa0m',
    'JnzKYZ w1c bpt',
);
 
my @nonexisting_strings = (
    '3zwy PJe 34P',
    'pqo6d n6j 5tV',
    'XEyoX ciKRh0d iJrSS',
    'pa3r29P V16Gvgy ayEv',
    'TB0f9 5TE wbPGI',
    'T0btpG aK6RhF x1mXlo',
    '2T2 xdMf6kb WUuZi',
    'dPV xipP4 azF',
    't3IC Qjbtx5V H86',
    'XLyiZa6 Wj1 XKKz',
);
 
my $dbh = DBI->connect(
    'dbi:Pg:dbname=depesz;port=5900', 'depesz', undef,
    {
        'AutoCommit' => 1,
        'RaiseError' => 1,
        'PrintError' => 1,
    }
);
 
print "Testing existing keys:\n";
cmpthese(
    50000,
    {
        'existing_hash'            => sub { fetch_from( 'test_hash',         \@existing_strings ); },
        'existing_btree'           => sub { fetch_from( 'test_btree',        \@existing_strings ); },
        'existing_unique_btree'    => sub { fetch_from( 'test_unique_btree', \@existing_strings ); },
    }
);
 
print "\nTesting nonexisting keys:\n";
cmpthese(
    50000,
    {
        'nonexisting_hash'         => sub { fetch_from( 'test_hash',         \@nonexisting_strings ); },
        'nonexisting_btree'        => sub { fetch_from( 'test_btree',        \@nonexisting_strings ); },
        'nonexisting_unique_btree' => sub { fetch_from( 'test_unique_btree', \@nonexisting_strings ); },
    }
);
 
exit;
 
sub fetch_from {
    my ( $table, $values ) = @_;
    my $sql = 'SELECT * FROM ' . $table . ' WHERE string = ANY( ? )';
    my $rows = $dbh->selectall_arrayref( $sql, undef, $values );
    return;
}

And here are (somewhat unexpected) results:

Testing existing keys:
                        Rate existing_btree  existing_hash existing_unique_btree
existing_btree        6203/s             --            -0%                   -2%
existing_hash         6219/s             0%             --                   -1%
existing_unique_btree 6305/s             2%             1%                    --
 
Testing nonexisting keys:
                           Rate nonexisting_btree nonexisting_unique_btree nonexisting_hash
nonexisting_btree        7519/s                --                      -7%             -13%
nonexisting_unique_btree 8078/s                7%                       --              -6%
nonexisting_hash         8606/s               14%                       7%               --

This is after 50000 of selects for each method. Each select was working with 10 different strings.

Now, that's interesting. It seems that for non existing keys hash index is 6-14% faster than B-Tree.

Last thing to check is better/worst concurrency.

To test it, I will insert 100,000 new strings to each of the tables, in 10 threads, each thread inserting 10k rows. Like this:

/bin/ls -c1 data.0[0-9] | xargs -d "\n" -I_FILE_ -P10 psql -c '\copy test_hash from _FILE_'

Times:

  • hash: 1m42.595s
  • normal btree: 1m53.631s
  • unique btree: 1m55.463s

Summary:

Hash indexes are actually a bit faster (not much, though) than btree indexes. They are also smaller. Currently there are 2 problems with them:

  • cannot make unique hash index
  • are not wal-logged

First of this problems – well, to be honest I don't know why it's still there – we had ready patch that wasn't applied since 2003 (or it was, but later on the functionality got removed).

Second problem is much bigger one, and (especially given hot standby and streaming replication in 9.0) is the reason not to use hash indexes.
Time for hash table: 1m42.595s

7 thoughts on “Should you use HASH index?”

  1. Great article depesz, just what I wanted =) Great benchmarks and great feedback both on IRC and thru this article.

  2. it suppose to be much quicker on 9.0. I also blogged a while back about hashes, but done on blobs and stuff, to speed things up (I used md5 as ‘hash’).

  3. You can make a hash-based UNIQUE index in 9.0 with exclusion constraints.

    CREATE TABLE foo (
    t TEXT NOT NULL,
    EXCLUDE USING hash (t WITH =)
    );

    Whether it’s actually worth it in terms of speed is a different
    question, but yes, you can do it.

  4. Can you try to get around the perl-roundtrip time in the tests.. I would believe that is a significant factor in the equation.

    Jesper

  5. @David:
    true, I absolutely forgot about exclusion constraints. Good point.

    @Jesper:
    theoretically I could but I don’t really think it’s all that important. Reason: roundtrip is for all access methods – i.e. it should “punish” all indexes in the same way. Also – it’s not that bad – there is 1 roundtrip for every 10 values.

Comments are closed.