Waiting for PostgreSQL 17 – Add support for incremental backup.

On 20th of December 2023, Robert Haas committed patch:

Add support for incremental backup.
 
To take an incremental backup, you use the new replication command
UPLOAD_MANIFEST to upload the manifest for the prior backup. This
prior backup could either be a full backup or another incremental
backup.  You then use BASE_BACKUP with the INCREMENTAL option to take
the backup.  pg_basebackup now has an --incremental=PATH_TO_MANIFEST
option to trigger this behavior.
 
An incremental backup is like a regular full backup except that
some relation files are replaced with files with names like
INCREMENTAL.${ORIGINAL_NAME}, and the backup_label file contains
additional lines identifying it as an incremental backup. The new
pg_combinebackup tool can be used to reconstruct a data directory
from a full backup and a series of incremental backups.
 
Patch by me.  Reviewed by Matthias van de Meent, Dilip Kumar, Jakub
Wartak, Peter Eisentraut, and Álvaro Herrera. Thanks especially to
Jakub for incredibly helpful and extensive testing.
 
Discussion: http://postgr.es/m/CA+TgmoYOYZfMCyOXFyC-P+-mdrZqm5pP2N7S-r0z3_402h9rsA@mail.gmail.com

This is huge change. While there are already tools that offer incremental backups (like pgBackRest), internal (to pg) pg_basebackup, so far didn't allow it.

It could have been used to make copy off server, and optionally compress/tarball it. But that's about it.

So, let's see how original pg_basebackup works, and what do the new functionality does.

To make sure that I have things changing, on some not-big db, I made table:

=# CREATE TABLE just_for_fun (last_updated timestamptz);
=# INSERT INTO just_for_fun (last_updated) VALUES (now());

And am running:

=# UPDATE just_for_fun SET last_updated = now();

every second.

With this in place, I can run simple pg_basebackup:

=$ mkdir /var/tmp/backups; pg_basebackup -D /var/tmp/backups
 
=$ ls -l /var/tmp/backups/
total 360
-rw------- 1 pgdba pgdba    227 Jan  8 17:16 backup_label
-rw------- 1 pgdba pgdba 226076 Jan  8 17:16 backup_manifest
drwx------ 7 pgdba pgdba   4096 Jan  8 17:16 base/
…
-rw------- 1 pgdba pgdba     88 Jan  8 17:16 postgresql.auto.conf
-rw------- 1 pgdba pgdba  29806 Jan  8 17:16 postgresql.conf

This command basically copied whole PGDATA to /var/tmp/backups, and if I were so inclined, I could have started pg there using pg_ctl -D /var/tmp/backups start. Of course making some changes not to conflict (port number, for example) with existing Pg).

This is plain backup. There is also tar backup:

=$ rm -rf /var/tmp/backups/; mkdir /var/tmp/backups; pg_basebackup -Ft -D /var/tmp/backups
 
=$ ls -l /var/tmp/backups/
total 56176
-rw------- 1 pgdba pgdba   226218 Jan  8 17:19 backup_manifest
-rw------- 1 pgdba pgdba 40509440 Jan  8 17:19 base.tar
-rw------- 1 pgdba pgdba 16778752 Jan  8 17:19 pg_wal.tar

Inside is basically the same thing, but as tar archives (by default with no compression). The interesting thing is the backup_manifest file. It contains:

=$ cat  /var/tmp/backups/backup_manifest  | head -n 10
{ "PostgreSQL-Backup-Manifest-Version": 1,
"Files": [
{ "Path": "backup_label", "Size": 227, "Last-Modified": "2024-01-08 16:21:14 GMT", "Checksum-Algorithm": "CRC32C", "Checksum": "f6db08ca" },
{ "Path": "tablespace_map", "Size": 0, "Last-Modified": "2024-01-08 16:21:14 GMT", "Checksum-Algorithm": "CRC32C", "Checksum": "00000000" },
{ "Path": "pg_xact/0000", "Size": 8192, "Last-Modified": "2024-01-08 16:21:13 GMT", "Checksum-Algorithm": "CRC32C", "Checksum": "c79e44f3" },
{ "Path": "PG_VERSION", "Size": 3, "Last-Modified": "2024-01-08 13:08:53 GMT", "Checksum-Algorithm": "CRC32C", "Checksum": "64440205" },
{ "Path": "pg_multixact/offsets/0000", "Size": 8192, "Last-Modified": "2024-01-08 13:09:02 GMT", "Checksum-Algorithm": "CRC32C", "Checksum": "23464490" },
{ "Path": "pg_multixact/members/0000", "Size": 8192, "Last-Modified": "2024-01-08 13:08:53 GMT", "Checksum-Algorithm": "CRC32C", "Checksum": "23464490" },
{ "Path": "conf.d/depesz.conf", "Size": 512, "Last-Modified": "2024-01-08 13:08:54 GMT", "Checksum-Algorithm": "CRC32C", "Checksum": "c6f171e0" },
{ "Path": "pg_ident.conf", "Size": 2640, "Last-Modified": "2024-01-08 13:08:53 GMT", "Checksum-Algorithm": "CRC32C", "Checksum": "0ce04d87" },
…
{ "Path": "base/5/2652", "Size": 16384, "Last-Modified": "2024-01-08 13:08:53 GMT", "Checksum-Algorithm": "CRC32C", "Checksum": "259eec8e" },
{ "Path": "pg_logical/replorigin_checkpoint", "Size": 8, "Last-Modified": "2024-01-08 16:21:13 GMT", "Checksum-Algorithm": "CRC32C", "Checksum": "c74b6748" },
{ "Path": "current_logfiles", "Size": 44, "Last-Modified": "2024-01-08 13:08:54 GMT", "Checksum-Algorithm": "CRC32C", "Checksum": "97357c1c" },
{ "Path": "log/postgresql-2024-01-08_140854.log", "Size": 1021834, "Last-Modified": "2024-01-08 16:21:14 GMT", "Checksum-Algorithm": "CRC32C", "Checksum": "d2498fb2" },
{ "Path": "global/pg_control", "Size": 8192, "Last-Modified": "2024-01-08 16:21:14 GMT", "Checksum-Algorithm": "CRC32C", "Checksum": "43872087" }
],
"WAL-Ranges": [
{ "Timeline": 1, "Start-LSN": "4/86000028", "End-LSN": "4/86000750" }
],
"Manifest-Checksum": "106517baea81404769cd4deb686ff58b58997308f0d90d9afbfa9d0111a5003d"}

This file can be used for verification that backup is complete.

But it can also be used to see what has changed since the backup was made.

So, let's remove this backup, make full one in slightly more complex path:

=$ rm -rf /var/tmp/backups/; mkdir /var/tmp/backups/
=$ pg_basebackup -Ft -D "/var/tmp/backups/$( date +%Y-%m-%d_%H%M%S-FULL )"
 
=$ ls -l /var/tmp/backups/
total 4
drwx------ 2 pgdba pgdba 4096 Jan  8 17:39 2024-01-08_173902-FULL/
 
=$ ls -l /var/tmp/backups/2024-01-08_173902-FULL/
total 56356
-rw------- 1 pgdba pgdba   226219 Jan  8 17:39 backup_manifest
-rw------- 1 pgdba pgdba 40691712 Jan  8 17:39 base.tar
-rw------- 1 pgdba pgdba 16778752 Jan  8 17:39 pg_wal.tar

Sweet. Now, I can:

=$ pg_basebackup -i /var/tmp/backups/2024-01-08_173902-FULL/backup_manifest -Ft -D "/var/tmp/backups/$( date +%Y-%m-%d_%H%M%S-INCREMENTAL )"
pg_basebackup: error: could NOT initiate base backup: ERROR:  incremental backups cannot be taken unless WAL summarization IS enabled
pg_basebackup: removing DATA directory "/var/tmp/backups/2024-01-08_173956-INCREMENTAL"

Ooops. I can't. Well, what is this WAL summarization? Doc for pg_basebackup doesn't mention it. But docs for PostgreSQL configuration do.

So, quick:

$ ALTER system SET summarize_wal = ON;
$ SELECT pg_reload_conf();

and now…

=$ pg_basebackup -i /var/tmp/backups/2024-01-08_173902-FULL/backup_manifest -Ft -D "/var/tmp/backups/$( date +%Y-%m-%d_%H%M%S-INCREMENTAL )"
 
=$ ls -l /var/tmp/backups/
total 8
drwx------ 2 pgdba pgdba 4096 Jan  8 17:39 2024-01-08_173902-FULL/
drwx------ 2 pgdba pgdba 4096 Jan  8 17:40 2024-01-08_174043-INCREMENTAL/
 
=$ ls -l /var/tmp/backups/2024-01-08_174043-INCREMENTAL/
total 23860
-rw------- 1 pgdba pgdba   236528 Jan  8 17:40 backup_manifest
-rw------- 1 pgdba pgdba  7413248 Jan  8 17:40 base.tar
-rw------- 1 pgdba pgdba 16778752 Jan  8 17:40 pg_wal.tar

Sweet. Backup made, and is significantly smaller than previous one (40MB in full, vs. 7MB in incremental).

Interestingly – number of files in manifest is the same:

=$ jq '.Files | length' /var/tmp/backups/2024-01-08_173902-FULL/backup_manifest
1571
 
=$ jq '.Files | length' /var/tmp/backups/2024-01-08_174043-INCREMENTAL/backup_manifest
1571

Quick looking into the manifests shows, that incremental manifest can contain two types of files:

=$ jq .Files[13] /var/tmp/backups/2024-01-08_174043-INCREMENTAL/backup_manifest
{
  "Path": "global/1214_fsm",
  "Size": 24576,
  "Last-Modified": "2024-01-08 16:10:41 GMT",
  "Checksum-Algorithm": "CRC32C",
  "Checksum": "722d586a"
}

and

=$ jq .Files[12] /var/tmp/backups/2024-01-08_174043-INCREMENTAL/backup_manifest
{
  "Path": "global/INCREMENTAL.2695",
  "Size": 12,
  "Last-Modified": "2024-01-08 13:08:53 GMT",
  "Checksum-Algorithm": "CRC32C",
  "Checksum": "e34c7d7c"
}

If the filename doesn't start with INCREMENTAL, then it's normal file, easily available in tar.

But if the file starts with INCREMENTAL, it has to be fetched from some earlier backup.

It looks that to find original file, the path should just have “INCREMENTAL." removed:

=$ jq '.Files[] | select(.Path == "global/2695")' /var/tmp/backups/2024-01-08_173902-FULL/backup_manifest
{
  "Path": "global/2695",
  "Size": 16384,
  "Last-Modified": "2024-01-08 13:08:53 GMT",
  "Checksum-Algorithm": "CRC32C",
  "Checksum": "a3f40679"
}

Though, it caught me by surprise that the checksum doesn't match (e34c7d7c vs. a3f40679), but it could be just because filename is part of data to generate checksum.

OK. But how about restoring such backup? Clearly, do it, I will need to process multiple separate backups, and combine them.

Luckily, there is a tool there, just for that: pg_combinebackup.

So, let's try it:

=$ pg_combinebackup -o /var/tmp/backups/combined /var/tmp/backups/2024-01-08_173902-FULL/ /var/tmp/backups/2024-01-08_174043-INCREMENTAL/
pg_combinebackup: error: could not open file "/var/tmp/backups/2024-01-08_174043-INCREMENTAL//PG_VERSION": No such file or directory

OK. So, it looks that it requires PG_VERSION file in the directory, which means that it has to be unpacked. Which is slightly annoying, but I guess I can work with it:

=$ mkdir /var/tmp/backups/FULL
=$ tar -x -C /var/tmp/backups/FULL -f /var/tmp/backups/2024-01-08_173902-FULL/base.tar
=$ tar -x -C /var/tmp/backups/FULL/pg_wal/ -f /var/tmp/backups/2024-01-08_173902-FULL/pg_wal.tar
=$ cp /var/tmp/backups/2024-01-08_173902-FULL/backup_manifest /var/tmp/backups/FULL/
 
=$ mkdir /var/tmp/backups/INCR
=$ tar -x -C /var/tmp/backups/INCR -f /var/tmp/backups/2024-01-08_174043-INCREMENTAL/base.tar
=$ tar -x -C /var/tmp/backups/INCR/pg_wal -f /var/tmp/backups/2024-01-08_174043-INCREMENTAL/pg_wal.tar
=$ cp /var/tmp/backups/2024-01-08_174043-INCREMENTAL/backup_manifest /var/tmp/backups/INCR/

and now, I can:

=$ pg_combinebackup -o /var/tmp/backups/combined /var/tmp/backups/FULL /var/tmp/backups/INCR

This is cool.

What's even cooler, is that you can easily make incremental backups with increment going from “last full", or “from any previously made". This can be easily used to make any chain of incremental backups.

All in all, I think this is great change. Thanks to everyone involved.

One thought on “Waiting for PostgreSQL 17 – Add support for incremental backup.”

  1. For complicated reasons that I may do a blog post about, we need BOTH the backup_manifest and the data produced by WAL summarization to take a correct incremental backup. The WAL summarization data does most of the work, but the manifest is needed to handle corner cases. Note also that this is block-level incremental backup, so an INCREMENTAL file can contain some blocks while others may need to be fetched from the predecessor backup.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.