On 20th of December 2023, Robert Haas committed patch:
Add support for incremental backup. To take an incremental backup, you use the new replication command UPLOAD_MANIFEST to upload the manifest for the prior backup. This prior backup could either be a full backup or another incremental backup. You then use BASE_BACKUP with the INCREMENTAL option to take the backup. pg_basebackup now has an --incremental=PATH_TO_MANIFEST option to trigger this behavior. An incremental backup is like a regular full backup except that some relation files are replaced with files with names like INCREMENTAL.${ORIGINAL_NAME}, and the backup_label file contains additional lines identifying it as an incremental backup. The new pg_combinebackup tool can be used to reconstruct a data directory from a full backup and a series of incremental backups. Patch by me. Reviewed by Matthias van de Meent, Dilip Kumar, Jakub Wartak, Peter Eisentraut, and Álvaro Herrera. Thanks especially to Jakub for incredibly helpful and extensive testing. Discussion: http://postgr.es/m/CA+TgmoYOYZfMCyOXFyC-P+-mdrZqm5pP2N7S-r0z3_402h9rsA@mail.gmail.com
This is huge change. While there are already tools that offer incremental backups (like pgBackRest), internal (to pg) pg_basebackup, so far didn't allow it.
It could have been used to make copy off server, and optionally compress/tarball it. But that's about it.
So, let's see how original pg_basebackup works, and what do the new functionality does.
To make sure that I have things changing, on some not-big db, I made table:
=# CREATE TABLE just_for_fun (last_updated timestamptz); =# INSERT INTO just_for_fun (last_updated) VALUES (now());
And am running:
=# UPDATE just_for_fun SET last_updated = now();
every second.
With this in place, I can run simple pg_basebackup:
=$ mkdir /var/tmp/backups; pg_basebackup -D /var/tmp/backups =$ ls -l /var/tmp/backups/ total 360 -rw------- 1 pgdba pgdba 227 Jan 8 17:16 backup_label -rw------- 1 pgdba pgdba 226076 Jan 8 17:16 backup_manifest drwx------ 7 pgdba pgdba 4096 Jan 8 17:16 base/ … -rw------- 1 pgdba pgdba 88 Jan 8 17:16 postgresql.auto.conf -rw------- 1 pgdba pgdba 29806 Jan 8 17:16 postgresql.conf
This command basically copied whole PGDATA to /var/tmp/backups, and if I were so inclined, I could have started pg there using pg_ctl -D /var/tmp/backups start
. Of course making some changes not to conflict (port number, for example) with existing Pg).
This is plain backup. There is also tar backup:
=$ rm -rf /var/tmp/backups/; mkdir /var/tmp/backups; pg_basebackup -Ft -D /var/tmp/backups =$ ls -l /var/tmp/backups/ total 56176 -rw------- 1 pgdba pgdba 226218 Jan 8 17:19 backup_manifest -rw------- 1 pgdba pgdba 40509440 Jan 8 17:19 base.tar -rw------- 1 pgdba pgdba 16778752 Jan 8 17:19 pg_wal.tar
Inside is basically the same thing, but as tar archives (by default with no compression). The interesting thing is the backup_manifest file. It contains:
=$ cat /var/tmp/backups/backup_manifest | head -n 10 { "PostgreSQL-Backup-Manifest-Version": 1, "Files": [ { "Path": "backup_label", "Size": 227, "Last-Modified": "2024-01-08 16:21:14 GMT", "Checksum-Algorithm": "CRC32C", "Checksum": "f6db08ca" }, { "Path": "tablespace_map", "Size": 0, "Last-Modified": "2024-01-08 16:21:14 GMT", "Checksum-Algorithm": "CRC32C", "Checksum": "00000000" }, { "Path": "pg_xact/0000", "Size": 8192, "Last-Modified": "2024-01-08 16:21:13 GMT", "Checksum-Algorithm": "CRC32C", "Checksum": "c79e44f3" }, { "Path": "PG_VERSION", "Size": 3, "Last-Modified": "2024-01-08 13:08:53 GMT", "Checksum-Algorithm": "CRC32C", "Checksum": "64440205" }, { "Path": "pg_multixact/offsets/0000", "Size": 8192, "Last-Modified": "2024-01-08 13:09:02 GMT", "Checksum-Algorithm": "CRC32C", "Checksum": "23464490" }, { "Path": "pg_multixact/members/0000", "Size": 8192, "Last-Modified": "2024-01-08 13:08:53 GMT", "Checksum-Algorithm": "CRC32C", "Checksum": "23464490" }, { "Path": "conf.d/depesz.conf", "Size": 512, "Last-Modified": "2024-01-08 13:08:54 GMT", "Checksum-Algorithm": "CRC32C", "Checksum": "c6f171e0" }, { "Path": "pg_ident.conf", "Size": 2640, "Last-Modified": "2024-01-08 13:08:53 GMT", "Checksum-Algorithm": "CRC32C", "Checksum": "0ce04d87" }, … { "Path": "base/5/2652", "Size": 16384, "Last-Modified": "2024-01-08 13:08:53 GMT", "Checksum-Algorithm": "CRC32C", "Checksum": "259eec8e" }, { "Path": "pg_logical/replorigin_checkpoint", "Size": 8, "Last-Modified": "2024-01-08 16:21:13 GMT", "Checksum-Algorithm": "CRC32C", "Checksum": "c74b6748" }, { "Path": "current_logfiles", "Size": 44, "Last-Modified": "2024-01-08 13:08:54 GMT", "Checksum-Algorithm": "CRC32C", "Checksum": "97357c1c" }, { "Path": "log/postgresql-2024-01-08_140854.log", "Size": 1021834, "Last-Modified": "2024-01-08 16:21:14 GMT", "Checksum-Algorithm": "CRC32C", "Checksum": "d2498fb2" }, { "Path": "global/pg_control", "Size": 8192, "Last-Modified": "2024-01-08 16:21:14 GMT", "Checksum-Algorithm": "CRC32C", "Checksum": "43872087" } ], "WAL-Ranges": [ { "Timeline": 1, "Start-LSN": "4/86000028", "End-LSN": "4/86000750" } ], "Manifest-Checksum": "106517baea81404769cd4deb686ff58b58997308f0d90d9afbfa9d0111a5003d"}
This file can be used for verification that backup is complete.
But it can also be used to see what has changed since the backup was made.
So, let's remove this backup, make full one in slightly more complex path:
=$ rm -rf /var/tmp/backups/; mkdir /var/tmp/backups/ =$ pg_basebackup -Ft -D "/var/tmp/backups/$( date +%Y-%m-%d_%H%M%S-FULL )" =$ ls -l /var/tmp/backups/ total 4 drwx------ 2 pgdba pgdba 4096 Jan 8 17:39 2024-01-08_173902-FULL/ =$ ls -l /var/tmp/backups/2024-01-08_173902-FULL/ total 56356 -rw------- 1 pgdba pgdba 226219 Jan 8 17:39 backup_manifest -rw------- 1 pgdba pgdba 40691712 Jan 8 17:39 base.tar -rw------- 1 pgdba pgdba 16778752 Jan 8 17:39 pg_wal.tar
Sweet. Now, I can:
=$ pg_basebackup -i /var/tmp/backups/2024-01-08_173902-FULL/backup_manifest -Ft -D "/var/tmp/backups/$( date +%Y-%m-%d_%H%M%S-INCREMENTAL )" pg_basebackup: error: could NOT initiate base backup: ERROR: incremental backups cannot be taken unless WAL summarization IS enabled pg_basebackup: removing DATA directory "/var/tmp/backups/2024-01-08_173956-INCREMENTAL"
Ooops. I can't. Well, what is this WAL summarization? Doc for pg_basebackup doesn't mention it. But docs for PostgreSQL configuration do.
So, quick:
$ ALTER system SET summarize_wal = ON; $ SELECT pg_reload_conf();
and now…
=$ pg_basebackup -i /var/tmp/backups/2024-01-08_173902-FULL/backup_manifest -Ft -D "/var/tmp/backups/$( date +%Y-%m-%d_%H%M%S-INCREMENTAL )" =$ ls -l /var/tmp/backups/ total 8 drwx------ 2 pgdba pgdba 4096 Jan 8 17:39 2024-01-08_173902-FULL/ drwx------ 2 pgdba pgdba 4096 Jan 8 17:40 2024-01-08_174043-INCREMENTAL/ =$ ls -l /var/tmp/backups/2024-01-08_174043-INCREMENTAL/ total 23860 -rw------- 1 pgdba pgdba 236528 Jan 8 17:40 backup_manifest -rw------- 1 pgdba pgdba 7413248 Jan 8 17:40 base.tar -rw------- 1 pgdba pgdba 16778752 Jan 8 17:40 pg_wal.tar
Sweet. Backup made, and is significantly smaller than previous one (40MB in full, vs. 7MB in incremental).
Interestingly – number of files in manifest is the same:
=$ jq '.Files | length' /var/tmp/backups/2024-01-08_173902-FULL/backup_manifest 1571 =$ jq '.Files | length' /var/tmp/backups/2024-01-08_174043-INCREMENTAL/backup_manifest 1571
Quick looking into the manifests shows, that incremental manifest can contain two types of files:
=$ jq .Files[13] /var/tmp/backups/2024-01-08_174043-INCREMENTAL/backup_manifest { "Path": "global/1214_fsm", "Size": 24576, "Last-Modified": "2024-01-08 16:10:41 GMT", "Checksum-Algorithm": "CRC32C", "Checksum": "722d586a" }
and
=$ jq .Files[12] /var/tmp/backups/2024-01-08_174043-INCREMENTAL/backup_manifest { "Path": "global/INCREMENTAL.2695", "Size": 12, "Last-Modified": "2024-01-08 13:08:53 GMT", "Checksum-Algorithm": "CRC32C", "Checksum": "e34c7d7c" }
If the filename doesn't start with INCREMENTAL, then it's normal file, easily available in tar.
But if the file starts with INCREMENTAL, it has to be fetched from some earlier backup.
It looks that to find original file, the path should just have “INCREMENTAL." removed:
=$ jq '.Files[] | select(.Path == "global/2695")' /var/tmp/backups/2024-01-08_173902-FULL/backup_manifest { "Path": "global/2695", "Size": 16384, "Last-Modified": "2024-01-08 13:08:53 GMT", "Checksum-Algorithm": "CRC32C", "Checksum": "a3f40679" }
Though, it caught me by surprise that the checksum doesn't match (e34c7d7c vs. a3f40679), but it could be just because filename is part of data to generate checksum.
OK. But how about restoring such backup? Clearly, do it, I will need to process multiple separate backups, and combine them.
Luckily, there is a tool there, just for that: pg_combinebackup.
So, let's try it:
=$ pg_combinebackup -o /var/tmp/backups/combined /var/tmp/backups/2024-01-08_173902-FULL/ /var/tmp/backups/2024-01-08_174043-INCREMENTAL/ pg_combinebackup: error: could not open file "/var/tmp/backups/2024-01-08_174043-INCREMENTAL//PG_VERSION": No such file or directory
OK. So, it looks that it requires PG_VERSION file in the directory, which means that it has to be unpacked. Which is slightly annoying, but I guess I can work with it:
=$ mkdir /var/tmp/backups/FULL =$ tar -x -C /var/tmp/backups/FULL -f /var/tmp/backups/2024-01-08_173902-FULL/base.tar =$ tar -x -C /var/tmp/backups/FULL/pg_wal/ -f /var/tmp/backups/2024-01-08_173902-FULL/pg_wal.tar =$ cp /var/tmp/backups/2024-01-08_173902-FULL/backup_manifest /var/tmp/backups/FULL/ =$ mkdir /var/tmp/backups/INCR =$ tar -x -C /var/tmp/backups/INCR -f /var/tmp/backups/2024-01-08_174043-INCREMENTAL/base.tar =$ tar -x -C /var/tmp/backups/INCR/pg_wal -f /var/tmp/backups/2024-01-08_174043-INCREMENTAL/pg_wal.tar =$ cp /var/tmp/backups/2024-01-08_174043-INCREMENTAL/backup_manifest /var/tmp/backups/INCR/
and now, I can:
=$ pg_combinebackup -o /var/tmp/backups/combined /var/tmp/backups/FULL /var/tmp/backups/INCR
This is cool.
What's even cooler, is that you can easily make incremental backups with increment going from “last full", or “from any previously made". This can be easily used to make any chain of incremental backups.
All in all, I think this is great change. Thanks to everyone involved.
For complicated reasons that I may do a blog post about, we need BOTH the backup_manifest and the data produced by WAL summarization to take a correct incremental backup. The WAL summarization data does most of the work, but the manifest is needed to handle corner cases. Note also that this is block-level incremental backup, so an INCREMENTAL file can contain some blocks while others may need to be fetched from the predecessor backup.