July 3rd, 2011 by depesz | Tags: , , , , , | 10 comments »
Did it help? If yes - maybe you can help me? Donate BTC to 19zPa5diT2LZqGtTi8f8bfApLn8rw9zBHx

For todays post in Understanding postgresql.conf series, I chose work_mem parameter.

Documentation describes it as:

Specifies the amount of memory to be used by internal sort operations and hash tables before writing to temporary disk files. The value defaults to one megabyte (1MB). Note that for a complex query, several sort or hash operations might be running in parallel; each operation will be allowed to use as much memory as this value specifies before it starts to write data into temporary files. Also, several running sessions could be doing such operations concurrently. Therefore, the total memory used could be many times the value of work_mem; it is necessary to keep this fact in mind when choosing the value. Sort operations are used for ORDER BY, DISTINCT, and merge joins. Hash tables are used in hash joins, hash-based aggregation, and hash-based processing of IN subqueries.

So, it looks pretty well described, but let's see it in some examples.

We know, that work_mem is limit of memory allocated to operation, where the operation is related to sorting or hashes.

Let's start with sorting, to see how it works.

First I will need some test table:

$ create table test (id serial primary key, random_text text );
NOTICE:  CREATE TABLE will create implicit sequence "test_id_seq" for serial column "test.id"
NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index "test_pkey" for table "test"
CREATE TABLE

And in it some data:

$ \! perl -e '@c=("a".."z","A".."Z",0..9); print join("",map{$c[rand@c]}10..20+rand(40))."\n" for 1..1000000' > /tmp/random_strings
 
$ copy test (random_text) from '/tmp/random_strings';
COPY

This Perl one-liner prints 1 million of random strings to file. Strings which look like:

$ select * from test limit 5;
 id |                    random_text
----+---------------------------------------------------
  1 | KNDq8noeQ66d51GPEgGwBze9PYePMfC
  2 | DqoZ4vW7TwRI2hOIi27odsbfVuwyAdeeXeql2CfbWrdhlr
  3 | bZIVA3dmjcDp89X0vYRYQKZpVLrsMvO7L6a6fMRS0WKsQ8Cl2
  4 | K5Cc1dOKkL2wh4XJRj5BlQnD3DJGpeVzDyMX
  5 | ON8wCwQNsHewrEzcoHZmBmOncKs2hxrvph9CSVg
(5 rows)

OK. Thanks to id column we will be able to easily limit whether we want 10, 100, 1000 or more rows to be sorted, and the random_text column, without index, will be great for sorting.

So, let's see first test:

$ explain analyze select * from test where id <= 10 order by random_text asc;
                                                       QUERY PLAN·······················································
------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=8.66..8.68 rows=9 width=35) (actual time=0.080..0.081 rows=10 loops=1)
   Sort Key: random_text
   Sort Method: quicksort  Memory: 26kB
   ->  Index Scan using test_pkey on test  (cost=0.00..8.52 rows=9 width=35) (actual time=0.004..0.010 rows=10 loops=1)
         Index Cond: (id <= 10)
 Total runtime: 0.133 ms
(6 rows)
(6 rows)

We see that for sorting 10 rows PostgreSQL used 25kB of RAM. Current limit is:

$ show work_mem ;
 work_mem
----------
 1MB
(1 row)

So, let's see how it will work out for larger sets of data, still with 1MB of work_mem. I ran the explain 3 times with rowcounts 10, 100, 1000, 10k, 100k, and 1M. Results:

rows avg. time Sort method
10 0.072 ms quicksort Memory: 26kB
100 0.258 ms quicksort Memory: 33kB
1000 3.421 ms quicksort Memory: 114kB
10000 42.047 ms external merge Disk: 448kB
100000 498.615 ms external merge Disk: 4440kB
1000000 6721.819 ms external merge Disk: 44504kB

As you can see when we hit 10k rows PostgreSQL switched from quicksort in memory, to external merge method. Interestingly – time didn't increase (i.e. time per row), but that's just because we're dealing with (still) relatively small datasets, and my test machine has pretty much memory, so kernel caches most of temporary files data.

You might wander, though, why PostgreSQL switched to Disk, when it used only 448kB? After all, work_mem is 1MB. Answer is pretty simple – as I understand – disk is used when work_mem is not enough, so it means it's already been filled. So, sort with “Disk: 448kB" would mean that more or less whole work_mem has been used plus 448kB of disk.

and how would that work with pure memory?

$ set work_mem = '2MB';
SET
 
$ explain analyze select * from test where id <= 10000 order by random_text asc;
                                                           QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=1028.64..1053.41 rows=9911 width=35) (actual time=29.778..30.410 rows=10000 loops=1)
   Sort Key: random_text
   Sort Method: quicksort  Memory: 1265kB
   ->  Index Scan using test_pkey on test  (cost=0.00..370.80 rows=9911 width=35) (actual time=0.019..3.769 rows=10000 loops=1)
         Index Cond: (id <= 10000)
 Total runtime: 30.932 ms
(6 rows)

As you can see it used 1265kB of work_mem.

As with hashing – as far as I was able to tell Hash operator doesn't spill to disk (or at least doesn't mention it), but instead increases number of “Batches":

$ explain analyze  select * from test a join test b using (random_text) where a.id < 10;
                                                          QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=9.62..219822.85 rows=9 width=39) (actual time=0.078..2421.989 rows=9 loops=1)
   Hash Cond: (b.random_text = a.random_text)
   ->  Seq Scan on test b  (cost=0.00..182313.65 rows=9999865 width=35) (actual time=0.007..997.380 rows=10000000 loops=1)
   ->  Hash  (cost=9.50..9.50 rows=9 width=35) (actual time=0.021..0.021 rows=9 loops=1)
         Buckets: 1024  Batches: 1  Memory Usage: 1kB
         ->  Index Scan using test_pkey on test a  (cost=0.00..9.50 rows=9 width=35) (actual time=0.004..0.007 rows=9 loops=1)
               Index Cond: (id < 10)
 Total runtime: 2422.059 ms
(8 rows)

With larger counts I got:

$ explain analyze  select * from test a join test b using (random_text) where a.id < 100000;
                                                                QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=5634.86..445954.33 rows=98717 width=39) (actual time=64.133..5755.718 rows=99999 loops=1)
   Hash Cond: (b.random_text = a.random_text)
   ->  Seq Scan on test b  (cost=0.00..182313.65 rows=9999865 width=35) (actual time=0.009..1079.314 rows=10000000 loops=1)
   ->  Hash  (cost=3628.89..3628.89 rows=98717 width=35) (actual time=64.057..64.057 rows=99999 loops=1)
         Buckets: 2048  Batches: 8  Memory Usage: 863kB
         ->  Index Scan using test_pkey on test a  (cost=0.00..3628.89 rows=98717 width=35) (actual time=0.019..30.527 rows=99999 loops=1)
               Index Cond: (id < 100000)
 Total runtime: 5760.123 ms
(8 rows)

or

$ explain analyze  select * from test a join test b using (random_text) where a.id < 1000000;
                                                                   QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=59578.00..491750.18 rows=1044855 width=39) (actual time=569.559..6815.947 rows=999999 loops=1)
   Hash Cond: (b.random_text = a.random_text)
   ->  Seq Scan on test b  (cost=0.00..182313.65 rows=9999865 width=35) (actual time=0.009..1152.624 rows=10000000 loops=1)
   ->  Hash  (cost=38354.31..38354.31 rows=1044855 width=35) (actual time=569.288..569.288 rows=999999 loops=1)
         Buckets: 2048  Batches: 128  Memory Usage: 544kB
         ->  Index Scan using test_pkey on test a  (cost=0.00..38354.31 rows=1044855 width=35) (actual time=0.021..259.011 rows=999999 loops=1)
               Index Cond: (id < 1000000)
 Total runtime: 6863.036 ms
(8 rows)

So, what does it help to have higher work_mem? The idea is that having higher work_mem let's the hash be larger with fewer “Batches", and thus make it faster. Why “the idea"? When I tried to test it I got:

$ set work_mem = '100MB';
SET
 
$ explain  analyze  select * from test a join test b using (random_text) where a.id < 1000000;
                                                                   QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=51415.00..369175.51 rows=1044855 width=39) (actual time=580.566..10006.115 rows=999999 loops=1)
   Hash Cond: (b.random_text = a.random_text)
   ->  Seq Scan on test b  (cost=0.00..182313.65 rows=9999865 width=35) (actual time=0.008..1017.055 rows=10000000 loops=1)
   ->  Hash  (cost=38354.31..38354.31 rows=1044855 width=35) (actual time=580.400..580.400 rows=999999 loops=1)
         Buckets: 131072  Batches: 1  Memory Usage: 67387kB
         ->  Index Scan using test_pkey on test a  (cost=0.00..38354.31 rows=1044855 width=35) (actual time=0.036..252.311 rows=999999 loops=1)
               Index Cond: (id < 1000000)
 Total runtime: 10048.051 ms
(8 rows)

As we can see hashing used 67MB of ram, but the speed was basically the same – even a bit slower.

And even decreasing the work_mem to the point of absurd didn't really help me show performance difference:

$ set work_mem = '128kB';
SET
 
$ explain  analyze  select * from test a join test b using (random_text) where a.id < 1000000;
                                                                   QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=59578.00..454250.69 rows=1044855 width=39) (actual time=611.189..6666.236 rows=999999 loops=1)
   Hash Cond: (b.random_text = a.random_text)
   ->  Seq Scan on test b  (cost=0.00..182313.65 rows=9999865 width=35) (actual time=0.009..1107.376 rows=10000000 loops=1)
   ->  Hash  (cost=38354.31..38354.31 rows=1044855 width=35) (actual time=604.508..604.508 rows=999999 loops=1)
         Buckets: 1024  Batches: 1024  Memory Usage: 74kB
         ->  Index Scan using test_pkey on test a  (cost=0.00..38354.31 rows=1044855 width=35) (actual time=0.039..249.613 rows=999999 loops=1)
               Index Cond: (id < 1000000)
 Total runtime: 6710.618 ms
(8 rows)

Frankly – I don't quite understand it, perhaps someone can explain this result to me in comments. For now I assume it's just a matter of specifics of the data I have here.

As I tried to show, increasing work_mem can lead to faster sorts (thanks to sort being done in memory, and not spilling to disk). Thanks to this, operations that depend on sorting (aggregates (sometimes), distinct, merge joins) can be faster.

So, why not simply set work_mem to 1GB, and be happy with it?

The problem lies in the fact that every execution node that uses work_mem, can use it up to the limit.

So, if we'd have query that executes plan like this, where there are 101 Sort or Hash operations, you'd risk using up to 101 * work_mem of memory. In single DB connection!.

To show some perspective – Let's assume you have some queries that do 4-10 Sort/Hash operations. And you have 100 max_connections limit. This means that theoretically you can get yp to 1000 x work_mem usage! (100 connections, each running the 10-sort/hash query).

Above means, that while increasing work_mem is cool, you have to be sure not create situation when pg will try to allocate more memory than there physically is. A lot of “out of memory" errors reported to PostgreSQL mailing lists came from users which set their work_mem to large values like 512MB or more, and then ran some complex queries. On machine with (relatively) small amount of ram.

What has to be noted though, is that you can change work_mem using normal sql query – and it will be changed for this particular session only.

So, it might make sense to set work_mem in postgresql.conf to relatively low value (1-10MB), and then, find out which queries would use more, and change the application to issue

set work_mem = '100MB'

before running them, and

reset work_mem

afterwards.

This of course raises question: how do I know which queries could use more ram? It's simple – just change logging configuration – set log_temp_files to 0 (to log all temp files), and then just check logs every so often. Or check slowest queries, their execution plans, if they have disk based sorts.

As for values of work_mem. There is a long standing idea/rumour that work_mem is internally limited to 1GB, and PostgreSQL will not use more than that even when having set work_mem = '10GB'. Well, I tested it, and for query:

explain analyze select * from lima where mike < 9000000 order by tango

I got this plan, which says:

Sort Method: quicksort Memory: 1304617kB

(side note: I love the anonymization thingie in explain.depesz.com)

Finally, the last usage of work mem is when you're dealing with GIN indexes:

During a series of insertions into an existing GIN index that has FASTUPDATE enabled, the system will clean up the pending-entry list whenever it grows larger than work_mem. To avoid fluctuations in observed response time, it's desirable to have pending-list cleanup occur in the background (i.e., via autovacuum). Foreground cleanup operations can be avoided by increasing work_mem or making autovacuum more aggressive. However, enlarging work_mem means that if a foreground cleanup does occur, it will take even longer.

GIN indexes are not very common - so it might be irrelevant for your case. The problem seems to be well cleared - when you're doing lots of insertions to GIN - like import of data to table which has GIN index on tsvector for full text search - increasing work_mem can make the process faster (if work_mem is larger than changes in index).

So. Let's see how it plays.

First, I need some data. So I wrote simple Perl script:

#!/usr/bin/perl
use strict;
use warnings;
use autodie;
 
open my $fh, '<', '/usr/share/dict/american-english-insane';
my @dict;
while (my $l = <$fh>) {
    $l =~ s/\s+//;;
    push @dict, $l if $l =~ /\S/;
}
close $fh;
 
for my $size ( 100, 1_000, 10_000, 100_000, 1_000_000, 10_000_000 ) {
    open my $out, '>', '/tmp/' . $size . '.list';
 
    for my $i ( 1.. $size ) {
        my $count = int( 20 + rand 150 );
        my @words = map { $dict[ rand @dict ] } 1..$count;
        print $out join( ' ', @words ) . "\n";
    }
    close $out;
}
 
exit;

Which created me set of 5 files:

=$ ls -l /tmp/100*
-rw-r--r-- 1 depesz depesz 9952497412 2011-07-03 17:59 /tmp/10000000.list
-rw-r--r-- 1 depesz depesz  994157889 2011-07-03 17:46 /tmp/1000000.list
-rw-r--r-- 1 depesz depesz   99494600 2011-07-03 17:45 /tmp/100000.list
-rw-r--r-- 1 depesz depesz    9982229 2011-07-03 17:45 /tmp/10000.list
-rw-r--r-- 1 depesz depesz     993531 2011-07-03 17:45 /tmp/1000.list
-rw-r--r-- 1 depesz depesz     108166 2011-07-03 17:45 /tmp/100.list

Then, I wrote simple shell script which will:

  • create table with text column, and gin index on it.
  • load data to the table
  • drop the table

The process was repeated 3 times for every file, and every work_mem from list:

  • 1MB
  • 10MB
  • 100MB
  • 1GB

Script, in case you're interested:

#!/bin/bash
for data_input_file in /tmp/100.list /tmp/1000.list /tmp/10000.list /tmp/100000.list /tmp/1000000.list /tmp/10000000.list
do
    for work_mem in 1MB 10MB 100MB 1GB
    do
        echo "Working on $data_input_file with $work_mem work_mem."
        for i in 1 2 3
        do
            (
                echo "set work_mem = '$work_mem';"
                echo "create table gin_test ( body text );"
                echo "CREATE INDEX gin_idx ON gin_test USING gin(to_tsvector('english', body));"
            ) | psql -qAtX
            /usr/bin/time -f "- %e" psql -c "\\copy gin_test from '$data_input_file'"
            psql -qAtX -c "drop table gin_test"
        done
    done
done

Unfortunately it took long enough to render me uninterested. Especially since the results are far from spectacular:

file work_mem
1MB 10MB 100MB 1GB
/tmp/100.list 0.05 s 0.05 s 0.05 s 0.05 s
/tmp/1000.list 0.72 s 0.74 s 0.81 s 0.76 s
/tmp/10000.list 9.90 s 10.07 s 10.01 s 9.92 s
/tmp/100000.list 673.33 s 672.68 s 679.72 s 680.76 s

Given the lack of difference in times, I assume that this is irrelevant (unless proven/shown otherwise).

So. That would conclude the blogpost on work_mem. If you'll have any questions - please let me know in comment.

  1. 10 comments

  2. Jul 4, 2011

    I’ve noticed when sorting spills to disk, and I increase the work_mem a bit but not enough to put the whole sort in memory, it still uses the same amount of disk. So I’m not sure this is correct: “So, sort with “Disk: 448kB” would mean that more or less whole work_mem has been used plus 448kB of disk.”

    I don’t have numbers with me but I was seeing eg:
    work_mem = 256MB, used 100MB disk
    work_mem = 512MB, used 100MB disk
    work_mem = 1024MB, used 0 disk (suddenly fits all in memory)

    Mind you, this is probably on 8.3 so behaviour may have changed?

  3. # Jesper Krogh
    Jul 4, 2011

    For GIN, I think you’ll need to construct an index that physically fills more than your memory for fastupdate to have any relevance.

  4. Jul 4, 2011

    The upper limit on work_mem is at 2GB:

    select unit,max_val from pg_settings where name=’work_mem’;
    unit | max_val
    ——+———
    kB | 2097151

    I think you can set it larger in some versions of PostgreSQL, but it still won’t use an amount beyond this limit. Your example only proved it will go usefully higher than 1GB, I doubt you can find any version where it will go over 2GB. On the 9.1 server I do testing on here, trying to set it higher than 2GB does give the appropriate error.

    The suggested explanation for why the work_mem transition point to disk isn’t so clear isn’t right, as demonstrated by Nathan Thom’s example. The amount of RAM taken up to run the quicksort and the amount needed to do the merge sort are not the same. Merge sorts actually use quite a bit less memory than a quicksort on the same size data set. You can see this in the article’s sample table too. Note how with the 1000 element example, quicksort uses 114kB of RAM. A quicksort of 10000 elements would take closer to 1MB of RAM, but since it switches to the more disk efficient disk merge sort there it’s only 448KB instead. The disk-based merge sort doesn’t use work_mem worth of caching and spill the remains; it’s all done on disk, and the amount reported is the total.

  5. # Chris Mellish
    Jul 4, 2011

    I’m not a PG internals guy but as to the reason that a hash using more memory ended up being slower than a smaller hash the broke the workload into batches, the most logical explanation for me is to do with CPU caches. The smaller hash size was just 544KB which, depending on the CPU, will fit entirely within the cache. With the 67MB example that will definitely not fit within the cache and so will be much more reliant on the speed of main memory.

    Depending on the way in which Postgres breaks the task up in to smaller batches it is conceivable that lots of small fast operations are quicker than one big but slower operation.

  6. Jul 4, 2011

    @Jeper:
    it’s possible. But I’m not really willing to do such experiment – it takes so long, and I need the machine for other purposes.

    It might be that this code works when you update index – i.e. where there are some data in index, and you add new rows.

  7. Jul 4, 2011

    @Greg:
    In my pg (9.2devel):

    $ select unit,max_val from pg_settings where name=’work_mem’;
    unit | max_val
    ——+————
    kB | 2147483647
    (1 row)

    which is 2TB :).

    As for your explanation of memory usage – it definitely makes sense. Thanks. This makes me think – wouldn’t it be possible to make disk based sorts use memory for as long as they can, and then spill to disk?

  8. # gj
    Jul 4, 2011

    “Interestingly – time didn’t increase (i.e. time per row),”
    It looks like it did, 3.421 ms to 42.047 ms

  9. Jul 4, 2011

    @GJ: Per row.

    there is increase but not really big. 0.0034ms vs. 0.0042ms.

  10. # Chaw-Chi Yu
    Apr 11, 2014

    I have run into a problem of PostgreSql reporting “out of memory”. Please see the following PostgreSql log entries:
    TopMemoryContext: 49816 total in 6 blocks; 5016 free (24 chunks); 44800 used
    TopTransactionContext: 8192 total in 1 blocks; 7856 free (0 chunks); 336 used
    Operator lookup cache: 24576 total in 2 blocks; 14072 free (6 chunks); 10504 used
    Operator class cache: 8192 total in 1 blocks; 3848 free (0 chunks); 4344 used
    MessageContext: 24576 total in 2 blocks; 20344 free (11 chunks); 4232 used
    smgr relation table: 8192 total in 1 blocks; 744 free (0 chunks); 7448 used
    TransactionAbortContext: 32768 total in 1 blocks; 32752 free (0 chunks); 16 used
    Portal hash: 8192 total in 1 blocks; 3912 free (0 chunks); 4280 used
    PortalMemory: 8192 total in 1 blocks; 8176 free (1 chunks); 16 used
    Relcache by OID: 8192 total in 1 blocks; 2336 free (0 chunks); 5856 used
    CacheMemoryContext: 667472 total in 20 blocks; 1864 free (15 chunks); 665608 used
    …………………
    MdSmgr: 8192 total in 1 blocks; 6160 free (0 chunks); 2032 used
    LOCALLOCK hash: 8192 total in 1 blocks; 3912 free (0 chunks); 4280 used
    Rendezvous variable hash: 8192 total in 1 blocks; 3848 free (0 chunks); 4344 used
    Timezones: 49432 total in 2 blocks; 5968 free (0 chunks); 43464 used
    ErrorContext: 8192 total in 1 blocks; 8176 free (0 chunks); 16 used
    2014-03-11 16:04:53 HKT ERROR: out of memory
    2014-03-11 16:04:53 HKT DETAIL: Failed on request of size 144.

    I was told this can be fixed by increasing work_mem. Is this correct? I’m using version 8.3 with default postgresql.conf. Any comment or help is greatly appreciated

  11. Apr 11, 2014

    @Chaw:
    increasing? most likely no. What is your: shared_buffers, work_mem, and explain of the query that was causing it? Also – what OS and architecture is it (32 bit? 64 bit?)? are there any system limitations (ulimit)?

Leave a comment