June 9th, 2012 by depesz | Tags: , , , , , , , , , | 5 comments »
Did it help? If yes - maybe you can help me? Donate BTC to 19zPa5diT2LZqGtTi8f8bfApLn8rw9zBHx

(disclaimer: all the data and examples in here are on Linux – the same data can be probably obtained on other systems too, it's just that I work on Linux and don't know other systems well).

This question pops occasionally in various places – PostgreSQL is using too much memory, why is that, and how can it be mitigated?

Before we can go to “optimizing", we should understand the problem. But do we? Both standard tools – ps and top – lie. How/why? Let's see.

I have a very simple PostgreSQL instance on my desktop, configured with 4 GB of shared buffers, and 100MB of work_mem. I did run some work, and now the ps shows:

=$ ps -u pgdba uf
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
pgdba    32324  0.0  0.0  79788  1900 ?        S    14:26   0:00 sshd: pgdba@pts/13  
pgdba    32325  0.0  0.0  25844  5788 pts/13   Ss+  14:26   0:00  \_ -bash
pgdba    27502  0.0  0.8 4344112 109724 ?      S    14:18   0:00 /home/pgdba/work/bin/postgres
pgdba    27506  0.0  0.0  24792   620 ?        Ss   14:18   0:00  \_ postgres: logger process     
pgdba    27508  1.5 34.7 4346688 4274752 ?     Ss   14:18   0:14  \_ postgres: checkpointer process   
pgdba    27509  0.2 12.1 4346164 1495780 ?     Ss   14:18   0:02  \_ postgres: writer process     
pgdba    27510  0.3  0.1 4346164 17292 ?       Ss   14:18   0:03  \_ postgres: wal writer process   
pgdba    27511  0.0  0.0 4347168 2408 ?        Ss   14:18   0:00  \_ postgres: autovacuum launcher process   
pgdba    27512  0.0  0.0  26888   856 ?        Ss   14:18   0:00  \_ postgres: archiver process   last was 00000001000000060000004D
pgdba    27513  0.0  0.0  27184  1160 ?        Ss   14:18   0:00  \_ postgres: stats collector process   
pgdba    27713  5.6 34.8 4347268 4285716 ?     Ss   14:19   0:51  \_ postgres: depesz depesz [local] idle
pgdba    27722  2.6  3.1 4347412 392704 ?      Ss   14:19   0:23  \_ postgres: depesz depesz [local] idle
pgdba    27726 15.8 35.0 4352560 4309776 ?     Ss   14:19   2:25  \_ postgres: depesz depesz [local] idle

top shows basically the same numbers.

Which is suspicious, because “free" shows:

=$ free
             total       used       free     shared    buffers     cached
Mem:      12296140   12144356     151784          0      26828   10644460
-/+ buffers/cache:    1473068   10823072
Swap:            0          0          0

i.e – only 1.5GB of ram used (and ~ 10GB used as disk cache – ready to be freed if any app would need more ram).

So how come I see such huge numbers in PS?

First of all – we need to disregard VSZ column. The one that's important is RSS. But its still not really useful:

=$ ps -u pgdba o pid,rss:8,cmd | awk 'NR>1 {A+=$2} {print} END{print "Total RSS: " A}'
  PID      RSS CMD
27502   109724 /home/pgdba/work/bin/postgres
27506      620 postgres: logger process     
27508  4274752 postgres: checkpointer process   
27509  1755420 postgres: writer process     
27510    17292 postgres: wal writer process   
27511     2408 postgres: autovacuum launcher process   
27512      856 postgres: archiver process   last was 00000001000000060000004D
27513     1160 postgres: stats collector process   
27713  4285716 postgres: depesz depesz [local] idle
27722   392700 postgres: depesz depesz [local] idle
27726  4309776 postgres: depesz depesz [local] idle
32324     1900 sshd: pgdba@pts/13  
32325     5788 -bash
Total RSS: 15158112

So, this clocks at ~ 15GB, which is more than I have in this box. So, what is the real memory usage. As in: how much memory I would gain, if I'd kill Pg?

Luckily, we can view what exactly is the RAM being used for, thanks to wonderful world of /proc directory/filesystem/mountpoint/fairy.

Each process on Linux has a directory in /proc. In this directory there is a number of files and directories. Don't be fooled by reported size of the files – they all have “0" bytes, but they do contain information. It's magic.

The one file we're interested in is “smaps".

Content of it looks like this:

=$ sudo head -n 20 /proc/27713/smaps
00400000-00914000 r-xp 00000000 09:00 3545633                            /home/pgdba/work/bin/postgres
Size:               5200 kB
Rss:                 964 kB
Pss:                 214 kB
Shared_Clean:        964 kB
Shared_Dirty:          0 kB
Private_Clean:         0 kB
Private_Dirty:         0 kB
Referenced:          964 kB
Anonymous:             0 kB
AnonHugePages:         0 kB
Swap:                  0 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Locked:                0 kB
00b13000-00b14000 r--p 00513000 09:00 3545633                            /home/pgdba/work/bin/postgres
Size:                  4 kB
Rss:                   4 kB
Pss:                   0 kB
Shared_Clean:          0 kB
...

For this particular process smaps has over 2000 lines, so I will not show you all of it.

So, anyway – this process – 27713 – according to ps, uses 4285716 kilobytes of RAM. So, what is it so big? Quick grep, and we see:

=$ sudo grep -B1 -E '^Size: *[0-9]{6}' /proc/27713/smaps
7fde8dacc000-7fdf952d6000 rw-s 00000000 00:04 232882235                  /SYSV005a5501 (deleted)
Size:            4317224 kB

There is only one “block" that is over 100MB of size, and it's size is very close to total size of process.

Full information about it:

7fde8dacc000-7fdf952d6000 rw-s 00000000 00:04 232882235                  /SYSV005a5501 (deleted)
Size:            4317224 kB
Rss:             4280924 kB
Pss:             1245734 kB
Shared_Clean:          0 kB
Shared_Dirty:    4280924 kB
Private_Clean:         0 kB
Private_Dirty:         0 kB
Referenced:      4280924 kB
Anonymous:             0 kB
AnonHugePages:         0 kB
Swap:                  0 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Locked:                0 kB

Most of this information is more or less cryptic, but we see couple of things:

  • it is shared memory (first line contains rw-s, where “s" is for shared)
  • by the look of it (/SYSV… deleted) it looks like the shared memory is done using mmaping deleted file – so the memory will be in “Cached", and not “Used" columns in free output.
  • Size of the shared block is 4317224, and 4280924 from it is actually resident in memory

That's ok – that's shared_buffers. But the thing is – shared buffers are used by most of the backends. And, what's worse, not always to the same extent. For example, the same shared buffers data from process 27722:

=$ sudo grep -A14 7fde8dacc000-7fdf952d6000 /proc/27722/smaps
7fde8dacc000-7fdf952d6000 rw-s 00000000 00:04 232882235                  /SYSV005a5501 (deleted)
Size:            4317224 kB
Rss:              388652 kB
Pss:               95756 kB
Shared_Clean:          0 kB
Shared_Dirty:     388652 kB
Private_Clean:         0 kB
Private_Dirty:         0 kB
Referenced:       388652 kB
Anonymous:             0 kB
AnonHugePages:         0 kB
Swap:                  0 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Locked:                0 kB

In here we see that this process requested/used only 388MB of the memory.

So calculating will be complex. For exaple – we might have two processes, each using 400MB of shared_buffers, but it doesn't tell us how much memory it is actually using – because it could be that they are using 100MB of the same buffers, and 300MB of different – so in total the memory usage would be 700MB.

We do know that total size of this shared_buffers block is 4317224. Which is great. But what about other things? Libraries for example – they can be shared by kernel between multiple processes.

Luckily, in 2007 Fengguang Wu sent (and previously wrote) a very cool patch for kernel – which added “Pss" info to smaps.

Basically, Pss is at most Rss, but gets decreased if the same pages of memory are used by more than one process.

That's why Pss above was so much lower than Rss/Size. For example – in the last example. Rss was given 388652, but Pss just 95756, which means that most of the pages this backend used was also used by 3 other backends.

So, now – knowing about Pss, we can finally get real memory usage of a running pg cluster:

=$ ps -u pgdba o pid= | sed 's#.*#/proc/&/smaps#' | xargs sudo grep ^Pss: | awk '{A+=$2} END{print A}'
4329040

If you just said “WTF, WHAT DID HE RUN?!", let me explain. First command:

=$ ps -u pgdba o pid=
27502
...
32325

just returns pids of pgdba user (usually you'll want postgres, but I'm different, and run PostgreSQL as pgdba).

Second – sed – changes the pids to paths to smaps files:

=$ ps -u pgdba o pid= | sed 's#.*#/proc/&/smaps#'
/proc/27502/smaps
...
/proc/32325/smaps

Then I do simple grep for ^Pss lines in the files given from sed. Which returns lots of lines like:

/proc/32325/smaps:Pss:                   0 kB
/proc/32325/smaps:Pss:                   4 kB
/proc/32325/smaps:Pss:                   4 kB

And then the awk summarizes 2nd column (which is the size). And I get the number 4329040 – in kilobytes.

So, theoretically, if I'd stop Pg, I would reclaim that much of RAM. Let's see if it's true:

=$ free; pg_ctl -m immediate stop; free
             total       used       free     shared    buffers     cached
Mem:      12296140   12145424     150716          0      40708   10640968
-/+ buffers/cache:    1463748   10832392
Swap:            0          0          0
waiting for server to shut down.... done
server stopped
             total       used       free     shared    buffers     cached
Mem:      12296140    7781960    4514180          0      40856    6325092
-/+ buffers/cache:    1416012   10880128
Swap:            0          0          0

Used memory dropped from 12145424 to 7781960 – meaning I got back 4363464 kB of RAM. Which is even a bit higher than expected 4329040, but it's close enough. And most of it came off disk cache – as expected, because it was used for shared_buffers.

This is all good, but can this method be used to estimate how much RAM I would reclaim by killing single backend?

No and yes. Shutting down whole Pg meant that the shared memory that it was using – could have been freed. In normal environments, when you kill backend – you end up freeing only the memory that was private to this backend. And this is usually depressingly low number.

For example, on another machine, with much more impressive hardware:

=> ps uxf | grep user.db_name | sort -nk6 | tail -n 1 | tee >( cat - >&2) | awk '{system("cat /proc/"$2"/smaps")}' | grep ^Private | awk '{A+=$2} END{print A}'
postgres  5278  8.2  0.3 107465132 1727408 ?   Ss   13:21   0:03  \_ postgres: user db_name aa.bbb.cc.dd(eeeee) idle
52580

I.e. – the process has 1.7GB of RSS (as visible in ps output), but only 52MB of it is Private memory that would be freed if it got killed.

So no – you can't use the Pss for this, but you can use Private_* data from smaps to get the number.

To summarize it all – PostgreSQL uses much less memory than it looks like at first sight, and while it is possible to get quite accurate numbers – you need to do some shell scripting to get them.

And now I brace myself for comments from the people that will point all technical mistakes in this post, or (what's worse) typos. :)

  1. 5 comments

  2. # Vatroslav
    Jun 9, 2012

    This is pretty advanced for my understanding, but it has helped me grasp a bit more about Linux and PostgreSQL, and for that, I’m thankful.

  3. Jun 16, 2012

    While my memory on the subject is a bit fuzzy and me being a lousy reader, I’m afraid I can’t point out any specific mistakes.

    I’d strongly suggest you to try out “smem” tool (www.selenic.com/smem/) though, which provides same valuable info in a much more *ahem* user-friendly way, as well as providing great aggregations of it.

  4. # 73rddef
    Jul 21, 2012

    My version of ps aligns the output, which means that the command fails if the pids have a different number of digits. (It tries to access filenames such as “/proc/ 1792/smaps”)

    I rewrote the sed as ‘s#[0-9]\+#/proc/&/smaps#’ which isn’t perfect but works

  5. Apr 8, 2013

    It’s worth to mention some tools which show process listings with better memory metrics than `ps`. Including `meminfo.py` http://koltsoff.com/pub/meminfo/ – here private memory size is shown as URES.

  6. # Ramachandran
    Jul 30, 2014

    Very very good.

Leave a comment