While there are some docs on it, I decided to write about it, in perhaps more accessible language – not as a developer, but as PostgreSQL user.

Some parts (quite large parts) were described in one of my earlier posts, but I'll try to concentrate on WAL itself, and show a bit more in here.

Before I will go into how it's written, and how to check various things, first let's try to think about simple question: why such thing even exits? What does it do?

Let's imagine you have a 1GB file, and you want to change the 100kB of it, from some defined offset.

Theoretically it's simple – open the file to write (without overwriting), do fseek() to appropriate location, and write the new 100kB. Then close the file, and open a beer – job done.

But is it? What will happen if the power will go down while doing the write? And you have no UPS?

It's actually pretty bad situation – let's assume that it happened in the middle of process. Your program got killed (well, the machine got turned off), and data on disk contains half old data and half new data!

Of course you could argue that it's why we have UPSes, but reality is a bit more complicated – various disks or storage controlles have write cache, and can lie to operating system (which then lies to application) that the data has been written, while it's in cache, and then the problem can strike again.

So, the solution has to be found that will make sure that we will either have new data, or old data, but not a mix of it.

The solution in here is relatively simple. Aside from this 1GB data file, store additional file, which never gets overwritten, only appended to. And change your process to:

  • open log file, in append mode
  • write to log file information “Will write this data (here goes the data) to this file (path) at offset (offset)"
  • close the log file
  • make sure that log file got actually written to disk. Call fsync() and hope that the disks will do it properly
  • change data file normally
  • mark the operation in log file as done

The last part can be simply done by storing somewhere location of last applied change from log file.

Now. Let's think about power failure. What will happen if the power failure will strike when writing to logfile? Nothing. Data in real file didn't get damaged, and your program just has to be smart enough to ignore not-fully written log entries. And what will happen if the power outage will happen during changing data in your real file? That's simple – on next start of your program, you check log for any changes that should be applied, but aren't, and you apply them – when the program is started content of data file is broken, but it gets fixed fast.

And if the power will break when marking the operation as done? No problem – on next start the operation will be simply redone.

Thanks to this we are (reasonably) safe from such things. It has also other benefits, but this will come later.

So, now that we know what is the purpose of WAL ( in case it wasn't clear: protect your data from hardware/power failures ), let's think about how.

PostgreSQL uses so called “pages". All things aside page is simply 8kB of data. That's why table/index files have always sizes divisible by 8192 (8kB) – you can't have table that is 10.5 pages in size. It's 11 pages. Not all of the pages are full. Not all are even guaranteed to be used (they could contain data that got removed).

All IO operations use pages. I.e. to get INT4 value from table, PostgreSQL reads 8kB of data (at least).

So, what happens when you INSERT new row? First PostgreSQL find which page it will put it to. It might be newly created page if all pages of the table are full, or it could be some other page if there is free space in them.

After the page has been chosen PostgreSQL loads it to memory (shared_buffers), and makes changes there. Information of all changes gets logged to WAL, but this is done by simple “write()" (without call to fsync, yet), so it's very fast.

Now, when you issue COMMIT; (which could happen automatically if it's auto-commit connection), PostgreSQL actually does fsync() to WAL. And that's all. The data is not written to your table file.

At this moment, the changes you wanted (new row) are in 2 places:

  • modified copy of table page in shared_buffers
  • record in WAL with information about the change

All writes to WAL are done not by individual backends, but by specialized process – wal writer, which you can see for example here:

=$ ps uxf
pgdba    25473  0.0  0.0 112552  2104 ?        S    01:16   0:00 sshd: pgdba@pts/8
pgdba    25474  0.0  0.0  28000  6704 pts/8    Ss   01:16   0:00  \_ -bash
pgdba    25524  0.0  0.0  20520  1148 pts/8    R+   01:16   0:00      \_ ps uxf
pgdba     7366  0.0  0.0  67716  7492 ?        S    Jul13   0:00 /home/pgdba/work/bin/postgres
pgdba     7368  0.0  0.0  25308   888 ?        Ss   Jul13   0:01  \_ postgres: logger process
pgdba     7370  0.5  0.2  67936 36176 ?        Ss   Jul13   0:55  \_ postgres: writer process
pgdba     7371  0.0  0.0  67716  2072 ?        Ss   Jul13   0:08  \_ postgres: wal writer process
pgdba     7372  0.0  0.0  68824  5588 ?        Ss   Jul13   0:00  \_ postgres: autovacuum launcher process
pgdba     7373  0.0  0.0  27404   900 ?        Ss   Jul13   0:00  \_ postgres: archiver process   last was 000000010000000800000057
pgdba     7374  0.0  0.0  27816  1340 ?        Ss   Jul13   0:01  \_ postgres: stats collector process

Thanks to this there is no locking on writes to WAL, just simple, plain appending of data.

Now, if we'd continue the process for long time, we would have lots of modified pages in memory, and lots of records in WAL.

So, when is data written to actual disk pages of tables?

Two situations:

  • page swap
  • checkpoint

Page swap is very simple process – let's assume we had shared_buffers set to 10, and all these buffers are taken by 10 different pages, and all are modified. And now, due to user activity PostgreSQL has to load another page to get data from it. What will happen? It's simple – one of the pages will get evicted from memory, and new page will be loaded. If the page that got removed was “dirty" ( which means that there were some changes in it that weren't yet saved to table file ), it will be first written to table file.

Checkpoint is much more interesting. Before we will go into what it is, let's think about theoretical scenario. You have database that is 1GB in size, and your server has 10GB of RAM. Clearly you can keep all pages of database in memory, so the page swap never happens.

What will happen if you'd let the database run, with writes, changes, additions, for long time? Theoretically all would be OK – all changes would get logged to WAL, and memory pages would be modified, all good. Now imagine, that after 24 hours of work, the system gets killed – again – power failure.

On next start PostgreSQL would have to read, and apply, all changes from all WAL segments that happened in last 24 hours! That's a lot of work, and this would cause startup of PostgreSQL to take loooooooong time.

To solve this problem, we get checkpoints. These happen usually automatically, but you can force them to happen at will, by issuing CHECKPOINT command.

So, what is checkpoint? Checkpoint does very simple thing: it writes all dirty pages from memory to disk, marks them as “clean" in shared_buffers, and stores information that all of wal up to now is applied. This happens without any locking of course. So, the immediate information from here, is that amount of work that newly started PostgreSQL has to do is related to how much time passed before last CHECKPOINT and the moment PostgreSQL got stopped.

This brings us back to – when it happens. Manual checkpoints are not common, usually one doesn't even think about it – it all happens in background. How does PostgreSQL know when to checkpoint, then? Simple, thanks to two configuration parameters:

  • checkpoint_segments
  • checkpoint_timeout

And here, we have to learn a bit about segments.

As I wrote earlier – WAL is (in theory) infinite file, that gets only new data (appended), and never overwrites.

While this is nice in theory, practice is a bit more complex. For example – there is not really any use for WAL data that was logged before last checkpoint. And files of infinite size are (at least for now) not possible.

PostgreSQL developers decided to segment this infinite WAL into segments. Each segment has it's consecutive number, and is 16MB in size. When one segment will be full, PostgreSQL simply switches to next.

Now, that we know what is segments we can understand what checkpoint_segments is about. This is number (default: 3) which means: if that many segments for filled since last checkpoint, issue new checkpoint.

With defaults, it means that if you'd insert data that would take (in PostgreSQL format) 6144 pages ( 16MB of segment is 2048 pages, so 3 segments are 6144 pages) – it would automatically issue checkpoint.

Second parameter – checkpoint_timeout, is a time interval (defaults to 5 minutes), and if this time passes from last checkpoint – new checkpoint will be issued. It has to be understood that (generally) the more often you make the checkpoints, the less invasive they are.

This comes from simple fact – generally, over time, more and more different pages get dirty. If you'd checkpoint every minute – only pages from minute would have to be written to disk. 5 minutes – more pages. 1 hour – even more pages.

While checkpointing doesn't lock anything, it has to be understood that checkpoint of (for example) 30 segments, will cause high-intensive write of 480 MB of data to disk. And this might cause some slowdowns for concurrent reads.

So far, I hope, it's pretty clear.

Now, next part of the jigsaw – wal segments.

These files (wal segments) reside in pg_xlog/ directory of PostgreSQL PGDATA:

=$ ls -l data/pg_xlog/
total 131076
-rw------- 1 pgdba pgdba 16777216 2011-07-14 01:39 000000010000000800000058
-rw------- 1 pgdba pgdba 16777216 2011-07-14 01:03 000000010000000800000059
-rw------- 1 pgdba pgdba 16777216 2011-07-14 01:03 00000001000000080000005A
-rw------- 1 pgdba pgdba 16777216 2011-07-14 01:04 00000001000000080000005B
-rw------- 1 pgdba pgdba 16777216 2011-07-14 01:04 00000001000000080000005C
-rw------- 1 pgdba pgdba 16777216 2011-07-14 01:04 00000001000000080000005D
-rw------- 1 pgdba pgdba 16777216 2011-07-14 01:06 00000001000000080000005E
-rw------- 1 pgdba pgdba 16777216 2011-07-14 01:06 00000001000000080000005F
drwx------ 2 pgdba pgdba     4096 2011-07-14 01:14 archive_status/

Each segment name contains 3 blocks of 8 hex digits. For example: 00000001000000080000005C means:

  • 00000001 – timeline 1
  • 00000008 – 8th block
  • 0000005C – hex(5C) segment within block

Last part goes only from 00000000 to 000000FE (not FF!).

2nd part of filename plus the 2 characters at the end of 3rd part give us location is this theoretical infinite WAL file.

Within PostgreSQL we can always check what is current WAL location:

$ select pg_current_xlog_location();
(1 row)

This means that we are using now file xxxxxxxx000000080000058, and PostgreSQL is writing at offset 4A62E0 in it – which is 4874976, which, since the WAL segment is 16MB means that the wal segment is filled in ~ 25% now.

The most mysterious thing is timeline. Timeline starts from 1, and increments (by one) everytime you make WAL-slave from server, and this Slave is promoted to standalone. Generally – within given working server this value doesn't change.

All of these information we can also get using pg_controldata program:

=$ pg_controldata data
pg_control version number:            903
Catalog version number:               201107031
Database system identifier:           5628532665370989219
Database cluster state:               in production
pg_control last modified:             Thu 14 Jul 2011 01:49:12 AM CEST
Latest checkpoint location:           8/584A6318
Prior checkpoint location:            8/584A6288
Latest checkpoint's REDO location:    8/584A62E0
Latest checkpoint's TimeLineID:       1
Latest checkpoint's NextXID:          0/33611
Latest checkpoint's NextOID:          28047
Latest checkpoint's NextMultiXactId:  1
Latest checkpoint's NextMultiOffset:  0
Latest checkpoint's oldestXID:        727
Latest checkpoint's oldestXID's DB:   1
Latest checkpoint's oldestActiveXID:  33611
Time of latest checkpoint:            Thu 14 Jul 2011 01:49:12 AM CEST
Minimum recovery ending location:     0/0
Backup start location:                0/0
Current wal_level setting:            hot_standby
Current max_connections setting:      100
Current max_prepared_xacts setting:   0
Current max_locks_per_xact setting:   64
Maximum data alignment:               8
Database block size:                  8192
Blocks per segment of large relation: 131072
WAL block size:                       8192
Bytes per WAL segment:                16777216
Maximum length of identifiers:        64
Maximum columns in an index:          32
Maximum size of a TOAST chunk:        1996
Date/time type storage:               64-bit integers
Float4 argument passing:              by value
Float8 argument passing:              by value

This has some interesting information – for example location (in WAL infinite-file) of last checkpoint, previous checkpoint, and REDO location.

REDO location is very important – this is the place in WAL that PostgreSQL will have to read from if it got killed, and restarted.

Values above don't differ much because this is my test system which doesn't have any traffic now, but we can see on another machine:

=> pg_controldata data/
Latest checkpoint location:           623C/E07AC698
Prior checkpoint location:            623C/DDD73588
Latest checkpoint's REDO location:    623C/DE0915B0

The last thing that's important is to understand what happens with obsolete WAL segments, and how “new" wal segments are created.

Let me show you one thing again:

=$ ls -l data/pg_xlog/
total 131076
-rw------- 1 pgdba pgdba 16777216 2011-07-14 01:39 000000010000000800000058
-rw------- 1 pgdba pgdba 16777216 2011-07-14 01:03 000000010000000800000059
-rw------- 1 pgdba pgdba 16777216 2011-07-14 01:03 00000001000000080000005A
-rw------- 1 pgdba pgdba 16777216 2011-07-14 01:04 00000001000000080000005B
-rw------- 1 pgdba pgdba 16777216 2011-07-14 01:04 00000001000000080000005C
-rw------- 1 pgdba pgdba 16777216 2011-07-14 01:04 00000001000000080000005D
-rw------- 1 pgdba pgdba 16777216 2011-07-14 01:06 00000001000000080000005E
-rw------- 1 pgdba pgdba 16777216 2011-07-14 01:06 00000001000000080000005F
drwx------ 2 pgdba pgdba     4096 2011-07-14 01:14 archive_status/

This was on system with no writes, and REDO location of 8/584A62E0.

Since on start PostgreSQL will need to read from this location, all WAL segments before 000000010000000800000058 (i.e. 000000010000000800000057, 000000010000000800000056 and so on) are obsolete.

On the other hand – please note that we have ready seven files for future use.

PostgreSQL works in this way: whenever WAL segment gets obsolete (i.e. REDO location is later in WAL than this segment) the file is renamed. That's right. It's not removed, it's renamed. Renamed to what? To next file in WAL. So when I'll do some writes, and then there will be checkpoint in 8/59* location, file 000000010000000800000058 will get renamed to 000000010000000800000060.

This is one of the reasons why your checkpoint_segments shouldn't be too low.

Let's think for a while about what would happen if we had very long checkpoint_timeout, and we would fill all checkpoint_segments. To record new write PostgreSQL would have to either do checkpoint (which it will do), but at the same time – it wouldn't have any more ready segments left to use. So it would have to create new file. New file, 16MB of data (\x00 probably) – it would have to be written to disk before PostgreSQL could write anything that user requested. Which means that if you'll ever reach the checkpoint_segments concurrent user activity will be slowed down, because PostgreSQL will have to create new files to accommodate writes of data requested by users.

Usually it's not a problem, you just set checkpoint_segments to some relatively high number, and you're done.

Anyway. When looking at pg_xlog/ directory, current WAL segment (the one that gets the writes) is usually somewhere in the middle. Which might cause some confusion, because mtime of the files will not change in the same direction as numbers in filenames. Like here:

$ ls -l
total 704512
-rw------- 1 postgres postgres 16777216 Jul 13 16:51 000000010000002B0000002A
-rw------- 1 postgres postgres 16777216 Jul 13 16:55 000000010000002B0000002B
-rw------- 1 postgres postgres 16777216 Jul 13 16:55 000000010000002B0000002C
-rw------- 1 postgres postgres 16777216 Jul 13 16:55 000000010000002B0000002D
-rw------- 1 postgres postgres 16777216 Jul 13 16:55 000000010000002B0000002E
-rw------- 1 postgres postgres 16777216 Jul 13 16:55 000000010000002B0000002F
-rw------- 1 postgres postgres 16777216 Jul 13 16:55 000000010000002B00000030
-rw------- 1 postgres postgres 16777216 Jul 13 17:01 000000010000002B00000031
-rw------- 1 postgres postgres 16777216 Jul 13 17:16 000000010000002B00000032
-rw------- 1 postgres postgres 16777216 Jul 13 17:21 000000010000002B00000033
-rw------- 1 postgres postgres 16777216 Jul 13 14:31 000000010000002B00000034
-rw------- 1 postgres postgres 16777216 Jul 13 14:32 000000010000002B00000035
-rw------- 1 postgres postgres 16777216 Jul 13 14:19 000000010000002B00000036
-rw------- 1 postgres postgres 16777216 Jul 13 14:36 000000010000002B00000037
-rw------- 1 postgres postgres 16777216 Jul 13 14:37 000000010000002B00000038
-rw------- 1 postgres postgres 16777216 Jul 13 14:38 000000010000002B00000039
-rw------- 1 postgres postgres 16777216 Jul 13 14:39 000000010000002B0000003A
-rw------- 1 postgres postgres 16777216 Jul 13 14:40 000000010000002B0000003B
-rw------- 1 postgres postgres 16777216 Jul 13 14:41 000000010000002B0000003C
-rw------- 1 postgres postgres 16777216 Jul 13 14:41 000000010000002B0000003D
-rw------- 1 postgres postgres 16777216 Jul 13 14:42 000000010000002B0000003E
-rw------- 1 postgres postgres 16777216 Jul 13 14:43 000000010000002B0000003F
-rw------- 1 postgres postgres 16777216 Jul 13 14:33 000000010000002B00000040
-rw------- 1 postgres postgres 16777216 Jul 13 14:34 000000010000002B00000041
-rw------- 1 postgres postgres 16777216 Jul 13 14:45 000000010000002B00000042
-rw------- 1 postgres postgres 16777216 Jul 13 14:55 000000010000002B00000043
-rw------- 1 postgres postgres 16777216 Jul 13 14:55 000000010000002B00000044
-rw------- 1 postgres postgres 16777216 Jul 13 14:55 000000010000002B00000045
-rw------- 1 postgres postgres 16777216 Jul 13 14:55 000000010000002B00000046
-rw------- 1 postgres postgres 16777216 Jul 13 14:55 000000010000002B00000047
-rw------- 1 postgres postgres 16777216 Jul 13 14:55 000000010000002B00000048
-rw------- 1 postgres postgres 16777216 Jul 13 15:09 000000010000002B00000049
-rw------- 1 postgres postgres 16777216 Jul 13 15:25 000000010000002B0000004A
-rw------- 1 postgres postgres 16777216 Jul 13 15:35 000000010000002B0000004B
-rw------- 1 postgres postgres 16777216 Jul 13 15:51 000000010000002B0000004C
-rw------- 1 postgres postgres 16777216 Jul 13 15:55 000000010000002B0000004D
-rw------- 1 postgres postgres 16777216 Jul 13 15:55 000000010000002B0000004E
-rw------- 1 postgres postgres 16777216 Jul 13 15:55 000000010000002B0000004F
-rw------- 1 postgres postgres 16777216 Jul 13 15:55 000000010000002B00000050
-rw------- 1 postgres postgres 16777216 Jul 13 15:55 000000010000002B00000051
-rw------- 1 postgres postgres 16777216 Jul 13 15:55 000000010000002B00000052
-rw------- 1 postgres postgres 16777216 Jul 13 16:19 000000010000002B00000053
-rw------- 1 postgres postgres 16777216 Jul 13 16:35 000000010000002B00000054
drwx------ 2 postgres postgres       96 Jun  4 23:28 archive_status

Please note that newest file – 000000010000002B00000033 is neither the first, nor the last. And the oldest file – is quote close after newest – 000000010000002B00000036.

This is all natural. All files before current, are the ones that are still needed, and their mtimes will be going in the same direction as WAL segments numbering.

Last file (based on filenames) – *54 has mtime just before *2A – which tells us that it previously was *29, but got renamed when REDO location moved somewhere to file *2A.

Hope that it's clear from above explanation, if not – please state your questions/concerns in comments.

So, to wrap it up. WAL exists to save your bacon in case of emergency. Thanks to WAL it is very hard to get any problems with data – I would even say impossible, but it's still possible in case your hardware misbehaves – like: lies about actual disk writes.

WAL is stored in a number of files in pg_xlog/ directory, and the files get reused, so the directory shouldn't grow. Number of these files is usually 2 * checkpoint_segments + 1.

Whoa? Why 2* checkpoint_segments?

Reason is very simple. Let's assume you have checkpoint_segments set to 5. You filled them all, and checkpoint is called. Checkpoint is called in WAL segment #x. In #x + 5 we will have another checkpoint. But PostgreSQL always keeps (at least) checkpoint_segments ahead of current location, to avoid need to create new segments for data from user queries. So, at any given moment, you might have:

  • current segment
  • checkpoint_segments segments, since REDO location
  • checkpoint_segments “buffer" in front of current location

Sometimes, when you have more writes than checkpoint_segments, in which case PostgreSQL will create new segments (as I described above). Which will inflate number of files in pg_xlog/. But this will get restored after some time – simply some obsolete segments will not get renamed, but instead will be removed.

Finally, last thing. GUC “checkpoint_warning". It is also (like checkpoint_timeout) interval, usually much shorter – by default 30 seconds. This is used to log (not to WAL, but normal log) information if the automated checkpoints happen too often.

Since checkpoint_timeout is supposedly larger than checkpoint_warning, this usually means that it alerts if you filled more than checkpoint_segments worth of log in checkpoint_timeout time.

Such information looks like this:

2011-07-14 01:03:22.160 CEST @ 7370  LOG:  checkpoint starting: xlog
2011-07-14 01:03:26.175 CEST @ 7370  LOG:  checkpoint complete: wrote 1666 buffers (40.7%); 0 transaction log file(s) added, 0 removed, 3 recycled; write=3.225 s, sync=0.720 s, total=4.014 s; sync files=5, longest=0.292 s, average=0.144 s
2011-07-14 01:03:34.904 CEST @ 7370  LOG:  checkpoints are occurring too frequently (12 seconds apart)
2011-07-14 01:03:34.904 CEST @ 7370  HINT:  Consider increasing the configuration parameter "checkpoint_segments".
2011-07-14 01:03:34.904 CEST @ 7370  LOG:  checkpoint starting: xlog
2011-07-14 01:03:39.239 CEST @ 7370  LOG:  checkpoint complete: wrote 1686 buffers (41.2%); 0 transaction log file(s) added, 0 removed, 3 recycled; write=3.425 s, sync=0.839 s, total=4.334 s; sync files=5, longest=0.267 s, average=0.167 s
2011-07-14 01:03:48.077 CEST @ 7370  LOG:  checkpoints are occurring too frequently (14 seconds apart)
2011-07-14 01:03:48.077 CEST @ 7370  HINT:  Consider increasing the configuration parameter "checkpoint_segments".
2011-07-14 01:03:48.077 CEST @ 7370  LOG:  checkpoint starting: xlog

Please note the “HINT" lines.

These are hints only (that is not warnings, or fatals) because too low checkpoint_segments doesn't cause any risk to your data – it just might slow down interaction with clients, if user will send modification query, that will have to wait for new WAL segment to be created (i.e. 16MB written to disk).

As a last note – if you're having some kind of monitoring of your PostgreSQL (like cacti, or ganglia, or munin or some commercial, like circonus) you might want to add graph that will show you your WAL progress in time.

To do it, you'd need to convert the current xlog location to some normal decimal number, and then draw differences. For example like this:

=$ psql -qAtX -c "select pg_current_xlog_location()" | \
    awk -F/ 'BEGIN{print "ibase=16"} {printf "%s%08s\n", $1, $2}' | \

Or, if the numbers get too big, just decimal “number" of the file:

=$ (
    echo "ibase=16"
    psql -qAtX -c "select pg_xlogfile_name(pg_current_xlog_location())" | \
        cut -b 9-16,23-24
) | bc

Drawing increments of the 2nd value (6438394) in 5 minute increments will tell you what's the optimal checkpoint_segments (although always remember to make it a bit larger than the actually needed, just in case of sudden spike in traffic).

  1. 72 comments

  2. # RaFi
    Jul 14, 2011

    “Since checkpoint_timeout is supposedly larger than checkpoint_timeout”

  3. # gj
    Jul 14, 2011

    “Your program god killed ” what did id my program god do again ? πŸ˜‰

  4. Jul 14, 2011

    @Rafi, @GJ:

    Thanks fixed.

    Two comments, both about typos. :(

  5. # Mauro Infantino
    Jul 14, 2011

    This should be part of the official documentation.

    Thank you very much! Great article!

  6. # RaFi
    Jul 14, 2011

    Depesz, Mauro is 101% right: it should be a part of the documentation πŸ˜‰
    So please keep writing πŸ˜‰

    If I can ask for anything: please share with us your magic wisdom of postgresql tuning πŸ˜‰


  7. # jigar shah
    Jul 14, 2011

    amazing job at articulating the whole process. i am a big fan. thank you Hubert.

  8. # Marti Raudsepp
    Jul 15, 2011

    Thanks for the article :)

    I don’t mean to be nitpicky, but you state that flushing dirty pages “happens without any locking of course”

    This is not entirely accurate, shared memory data structures, such as shared buffers, are always protected by lightweight locks. Though these locks are only acquired for short periods and don’t show up in pg_locks.

    Saying that it happens without any _heavyweight_ locks would be correct.

  9. Jul 15, 2011

    Thanks for clarification. You’re right, of course. I meant without heavyweight locks.

  10. Jul 16, 2011

    “Number of these files is usually 2 * checkpoint_segments + 1.”

    Really you should expect anywhere up to ((2 + checkpoint_completion_target) * checkpoint_segments + 1) to be a normal amount. Check out this graph https://skitch.com/xzilla/fkg7f/circonus-wal-files, which shows that 2*checkpoint_segments is often a base line for the number of wal files.

  11. Jul 21, 2011


    Great article !! Would you mind if i translate some of your articles to pt_BR? Im from brazilian postgresql community thanks a lot!!

  12. Jul 21, 2011

    not at all. Would be glad if you could post link to translated version afterwards, but it’s not required.

  13. Jul 23, 2011

    The math on how many segments are kept around (“recycled”) is actually a bit more complicated than that. Starting in 8.3, it’s (2 + checkpoint_completion_target) * checkpoint_segments + 1; the completion target actually causes the end of the checkpoint to slip a bit to the right. And starting in 9.0, it can use the logic checkpoint_segments + wal_keep_segments + 1 for how many to keep instead. It’s also worth noting the WAL recycling stops if you have an archive_command set but it stops executing correctly.

  14. Jul 23, 2011


    thanks for additional info.

    As for archiving – it will be in future, in another blogpost in “Understanding postgresql.conf” post. Not sure when – it has to wait for some inspiration.

  15. # daevy
    Aug 17, 2011

    I understand right that? there can be such a scenario?
    I set TH = 30 min. Checkpoint been executed, the database worked for 25 minutes, and then it crashed. Thus I lose all the changes within 25 minutes after the last checkpoint?

  16. # daevy
    Aug 17, 2011

    oh sorry, read the article more carefully and the question was dropped

  17. # Madhu
    Oct 27, 2011

    Great article. can you talk about the archive_status dir. the files get named with a .done right?
    if i have to manually cleanup pg_xlog (filesystem full cause my slave host is down), is it safe to delete those which have a corresponding .done in archive_status?

  18. # Dinesh
    Dec 18, 2011

    Thank you very much for your wonderful explanation..

  19. # RobJ
    Dec 28, 2011

    IIRC, when Oracle does a checkpoint, it puts the checkpoint number in the header of every datafile, so that during recovery, it knows how recent the datafiles are, and what redo needs to be applied.

    When I look at Postgres 9.0 datafiles under data/base, I see that the timestamps on them aren’t all current. In fact, some of the datafiles have timestamps that are several days old. Does this mean that during a checkpoint, Postgres doesn’t touch all of the datafiles? (That would be a *good* thing for us.)

    If so, then how does Postgres know what to do during recovery? Is it just the DBA’s responsibility to ensure that the datafile backups are current enough to apply the WAL files properly?

  20. Dec 28, 2011

    Solution is very simple – pg stores (not in datafiles, in additional file) last fully written position in WAL.

    So, when it has to do recovery – it just reapplies all changes since the recorded WAL location to newest available WAL location.

  21. # esh
    Jan 26, 2012

    ” While checkpointing
    doesn’ t lock anything, it
    has to be understood that
    checkpoint of (for
    example) 30 segments,
    will cause high-intensive
    write of 480 MB of data
    to disk.”

    Why 480 MB? IMHO it sync only dirty _memory_ (shared buffers).

  22. Jan 26, 2012

    well, I could have said “up to 480MB”.

  23. # Dii
    Mar 10, 2012

    Thank you for this article, exactly what I wanted to know about this topic, great stuff!

  24. # charles
    Apr 3, 2012

    an easy to understand explanation,thanks for the great article.

  25. # esh
    Aug 8, 2012

    Latest checkpoint location: 8/584A6318
    Prior checkpoint location: 8/584A6288
    Latest checkpoint’s REDO location: 8/584A62E0

    Can you explain why postgres need REDO location? Why not just replay WAL from Prior checkpoin location? Thanks!

  26. Aug 9, 2012

    well, not really – I mean – I have been thinking about it too, and there definitely is a good reason (in PostgreSQL every important thing has a reason, even if the reason is not obvious). I know the answer could be found in sources, but I don’t know it now.

    One thing – please notice that REDO location is actually *later* than prior checkpoint location. So having REDO location makes recovery faster.

  27. # Steve Haslam
    Oct 11, 2012

    In my understanding: the redo point is where the checkpoint started, the checkpoint location is where it finished. In order to get back to the state we were in when the checkpoint finished at A6318 we have to play back the changes that were made while the checkpoint was in progress, i.e. from A62E0 to A6318, since these changes may or may not have been included in the blocks that the checkpoint pushed to disk.


  28. # Chiru
    Oct 22, 2012

    Great Article….
    Thank you very much….

  29. # ravi
    Nov 21, 2012

    when I am restoring database,following error came.

    restore_command = ‘cp /opt/pgsqlbackup/wals/%f %p’
    LOG: recovery_target_time = ‘2012-11-20 11:40:50+05:30′
    cp: cannot stat `/opt/pgsqlbackup/wals/00000001.history’: No such file or directory
    LOG: restored log file “000000010000000000000007” from archive
    LOG: record with zero length at 0/70000A0
    LOG: invalid primary checkpoint record
    LOG: restored log file “000000010000000000000007” from archive
    LOG: invalid xl_info in secondary checkpoint record
    PANIC: could not locate a valid checkpoint record
    LOG: startup process (PID 5739) was terminated by signal 6: Aborted
    LOG: aborting startup due to startup process failure
    LOG: database system was shut down at 2012-11-20 11:51:33 IST
    LOG: starting archive recovery

    — what is the solution>


  30. Nov 21, 2012

    1. please note that this is just a blog. if you have technical problems with pg, you’ll be much better off writing to some PostgreSQL related mailing list ( http://www.postgresql.org/community/lists/ ) – most likely – general

    2. looks like the wal segment …7 is broken. why? no idea. there is not even close to enough information.

  31. # ravi
    Nov 21, 2012

    After reading your article,I got idea on checkpoint parameter.Thank u for that information.

  32. # kenyon
    Nov 22, 2012

    Thanks for your great article!! i’m from china,a postgres lover. Would U mind if i translate parts of your article to chinese,and i will post the web link?

  33. Nov 22, 2012

    @Kenyon: I wouldn’t mind. I will be glad to see my texts translated (not that I can read Chinese, unfortunately).

  34. # Jon Coulter
    Dec 6, 2012

    It know you wrote this over a year ago, but I wanted to let you know it is an excellent post with great information.

    I’ve been wondering how/when/why the wal segments are named the way they are as a way of ‘auditing’ backups. In other words, postgres is reporting the “last” archived file is X, so let’s figure out what the last, say, 1000 files before “X” were and make sure they are backed up. Any files missed means we have a broken log chain. I should be able to extrapolate this from this information now.


  35. # Mandeep Gandhi
    Mar 12, 2013

    Suppose that I want to have check that if my standby database has replayed the last WAL log being created by primary database, is the only thing to so is check select pg_current_xlog_location() and see that that segment’s WAL log is played in standby ????

  36. Mar 12, 2013

    check pg_stat_replication view in master.

  37. # Keith Wiggins
    Mar 21, 2013

    I have been having a hard time with unexpected timeline IDs for my slave when it starts up. I rsync the pg_xlog from the active node prior to starting up. I am hot standby with archive logging. One thing I had noticed is that when a slave became active (after the original active was stopped) it selected a new timeline ID but I did not see a new history file nor a new WAL file for that new timeline ID. When the slave started up, it complained about the the previous timeline with an ‘unexpected timeline id bla’. Shoud a new active create a new WAl and history file and would this cause the unexpected timeline id? I perform an rsync on the slave to pull over the pg_xlog before starting up the slave. To clear this, I have to stop/start the active and then start up the slave, then we are ok.


  38. Mar 21, 2013

    To be honest – I don’t know if you’re asking for help, or describing a method to solve a problem. But just in case: never, literally: never, copy pg_xlog data from master to slave.
    pg_xlog on slave should be initially empty.

  39. # Keith Wiggins
    Mar 21, 2013


    Was asking for help :) Thanks for responding so fast, was not expecting that :) I was emptying it out but I had change that because we have a GEO configuration and the time to rsync was too large. I can put it back to rsync to an empty directory. I guess my question is, when a slave becomes active, should it always create a new history file and new WAL file for the new timeline id?


  40. Mar 21, 2013

    active? new timeline is when slave becomes master (or standlone, which is master with no slaves).
    If by active you mean – queryable – no. you can have slave that is queryable, and then it uses timeline from master.

  41. # Keith Wiggins
    Mar 21, 2013

    Sorry, that’s just our internal description of it, I mean when a slave becomes a master.

  42. Mar 21, 2013

    In such case – yes. Becoming a master is a timeline change.

  43. # Keith Wiggins
    Mar 21, 2013

    I am seeing the timeline change in the postgres log file, and mostly I see the new master create a new history file and WAL file but I have seen where it has created a new timeline but did not create the history or WAL file.

  44. Mar 21, 2013

    Not sure what you mean. You see that it did, but you don’t see where? New history file should be in pg_xlog, and it will be archived, if archiving is enabled on this server. it’s possible that it will be removed from pg_xlog, if i recall correctly.

  45. # Keith Wiggins
    Mar 21, 2013


    What I means is that when a slave becomes master and creates a new timeline, a new .history file and WAL file are not always created in the pg_xlog directory. Doesn’t happen all of the time but every now and then and that seems to mess up my slave when it comes up.


  46. # Keith Wiggins
    Mar 21, 2013

    So, if the new timeline is 12 for example, I should see a new 00000012.history and a new WAL file of 000000120000001.. for example. I have seen cases where a new timeline is created (according to the log file) but the history and WAL are not created.

  47. Mar 21, 2013

    the file is *definitely* not 120000001

    it’s just timeline change, so if your xlog normally was 000000110000abcd00000099 new one will be 000000120000abcd0000009a

    as for lack of files – sorry, without logging there checking logs, logs from archiving tool, i can not guess what could be happening.

  48. Mar 21, 2013

    also – please note that this is not instant messanger. if you have problems – please come to irc. comments in blogpost on checkpoints are not the best thing for discussing a problem.

  49. # Keith Wiggins
    Mar 21, 2013

    I trunctated the log name (didn’t feel like typing it all out). I think you answered my question that there should be a new WAL file. Thanks for your help, very appreciated.


  50. # Cagdas
    Oct 6, 2013

    Well done, this the easiest yet most explanatory doc. on archiving I could find on the internet. Keep up the good work…

  51. Nov 12, 2013

    Great explanation. You mention “Information of all changes gets logged to WAL, but this is done by simple β€œwrite()”. Which process takes care of that? WAL writer writes data from WAL buffer to the physical files, right?

  52. Nov 12, 2013

    normal backend.

    Start a backend, before it does anything, attach strace, and then issue some writing command.

    You’ll see something like:

    12:20:25.151575 open("pg_xlog/0000000100000019000000EC", O_RDWR) = 7 <0.000117>
    12:20:25.151833 lseek(7, 245760, SEEK_SET) = 245760 <0.000083>
    12:20:25.152005 write(7, "v\320\5\0\1\0\0\0\0\300\3\354\31\0\0\0\35\1\0\0\0\0\0\0\0\0\0\1\0\0\1\0"..., 8192) = 8192 <0.000160>
  53. Nov 14, 2013

    Thanks. One more – in the even of a transaction with a few updates/inserts, I assume the write() to the data pages will happen immediately, without waiting for a commit. What about writes to the WAL buffer?

  54. Nov 14, 2013

    Oh, I got the answer by reading your blog again – even before commit, the changes will be there in WAL buffer. Right?

  55. Nov 14, 2013


    yes. they will *not* be written to table, though. At least not initially.

  56. Nov 15, 2013

    One more doubt – checkpoints flush all ‘dirty’ data, not necessarily ‘committed’ data?

  57. Nov 15, 2013


  58. # hossein
    Nov 26, 2013

    please help me ?? how we can view log files in postgres ?

  59. Nov 27, 2013

    sorry, but:
    1. the question is extremely vague. What kind of logs? Xlogs? text query logs? startup logs?
    2. you might find that mailing list are the preferred support channel for PostgreSQL. I do welcome questions related to blogposts, but I am, definitely, not official support channel for Pg.

  60. # hossein
    Nov 27, 2013

    @depesz . can we chatting ? I need your help .
    I read that in log files , save record that have information about when start transaction , when commit it and so on . I want to see them for an insert command in postgre . please help me .

  61. # hossein
    Nov 27, 2013

    i am waiting for your answer . I have a presentation about “recovery in postgre” next weak . i need your help .

  62. # Fabio Caiut
    Dec 16, 2013

    Thanks, the better explanation about WAL that I have seen!

  63. # Fabio Caiut
    Dec 16, 2013

    Sorry, the best …

  64. # Aleksandr
    May 21, 2014

    “This means that we are using now file xxxxxxxx000000080000058, and PostgreSQL is writing at offset 4A62E0 in it – which is 4874976, which, since the WAL segment is 16MB means that the wal segment is filled in ~ 25% now.”

    I didn’t get it.
    Where this “4874976” comes from?

  65. # Aleksandr
    May 21, 2014

    Oh, I see)
    (4A62E0)16 = (4874976)10

  66. May 21, 2014

    @Aleksandr: yeah, sorry, I should have made it clearer.

  67. # yayix
    Jan 14, 2015

    Great blog and with so much details. Keep it coming. I have a question though, what would be the maximum “Latest checkpoint location” from pg_controldata, I’m currently at E0C7/13015CD8 (it is a long-running cluster), will E0C7 increment after FFFF to 10000 and so on?

  68. Jan 14, 2015

    yes, it will increase. the xlog position rises up to ffffffff/ffffffffff. Not sure what happens then, but I don’t think it’s all that important.

  69. # yayix
    Jan 14, 2015

    Awesome! That is really good to know. I was under the impression that I may need to reload the db soon and that I could be near wraparound or just plain run out of some limit there. Thanks for the prompt reply – appreciate it.

  70. # Freddy
    Aug 8, 2015

    Nice write up. Good work DEPESZ

  71. # Harsh
    Sep 24, 2015

    This is amazing work…

  72. # BobL
    Oct 13, 2015

    Depesz, you are my hero! That is the most complete, detailed, and understandable explanation of WAL files, checkpoints, and recovery I have read anywhere. This *should* be part of the official documentation. Great job!

  73. # Alexis Daniel Gomez Honig
    Nov 5, 2015

    Depesz, I have a time issue with wal senders/receiver in the initial backup. It’s taking too much time (4 days)
    I have a 14 Terabytes of data in a storage, I’ve setup a secondary postgres, pg_star_backup, rsync via fiber, pgstopbackup, and finally start slave backup.
    But the problem is that never start up the slave server for query.
    The wal files sending/receiving process it’s too slow. in 4 days
    -bash-4.1$ ps -ef | grep sender
    postgres 25969 23661 0 Nov02 ? 00:00:02 postgres: wal sender process rep streaming 152F/2402DD28)
    With a single giga nic (for 16 Mb of data seem enougt)

Leave a comment