September 11th, 2013 by depesz | Tags: , , , , | 11 comments »
Did it help? If yes - maybe you can help me? Donate BTC to 19zPa5diT2LZqGtTi8f8bfApLn8rw9zBHx

Recently someone was looking for help with script from Pg wiki.

I never really checked it, but when talking with the guy needing help, I looked at it. And didn't really like it.

For starters – the scripts are huge – over 9kB. Second – they use techniques that are not that great (I'm talking about bash scripting/programming, not Pg related stuff).

Based on problems of the guy from IRC, decided to write something. Maybe it will not have all the functionality, but at the very least – it should be simpler to understand, and fit on single screen.

Before I will go to this, let me first provide some theory.

There are at least two kinds of backups that PostgreSQL handles. pg_dump way – which (more or less) generates SQL script that you can load to any new database using psql, and file-level backup – which is basically tarball of whole data directory, and as such cannot be loaded to existing database, but rather – can be untarred, and Pg can be started from it.

Both approaches have their benefits and drawbacks, of course:

pg_dump is simpler. It is also smaller. You can load just parts of it, and you can load it to any database you want.

file level backup is larger, and cannot be loaded to existing database. On the other hand – it contains everything about the source database, including things like bloat, statistics, and all kind of metainformation.

The script on the wiki page is for pg_dump backups. It's basically a wrapper around pg_dump that does some smart things, and is configurable.

So, let's try to get most of its functionality, without writing 9kB of code.

Simplest backup of database can be done using:

pg_dump -f /var/backups/pg-database_name.sql database_name

Output ( file /var/backups/pg-database_name.sql ) is plain sql script that will create all tables, and load data.

That's fine, but it's better to use so called “custom format". Thanks to this you will get ability to easily pick what to restore, and what not to restore. Luckily the change is simple:

pg_dump -Fc -f /var/backups/pg-database_name.dump database_name

Great. What about compressing it? Well, it just so happens that it already is compressed. The data at least. If you'd do `strings /var/backups/pg-database_name.dump | less` you would notice sql queries, but these are only for schema creation, not data loading!

But this dumps only single database. How can we dump more?

You could, of course, write multiple pg_dump commands. But on the other hand – we have full power of shell to utilize, so why don't we use it?

First, list of all databases, that might contain data, sorted from largest to smallest:

psql -qAtX -c "select datname from pg_database where datallowconn order by pg_database_size(oid) desc"

What it does it gets all rows from pg_database table (which conveniently contains information on all databases, limits to these that users can connect to (it removes template0, which is important for Pg, but not important for backups), and sorts by database size.

Now, with this in place, we can use returned names and run pg_dumps for all of the databases:

psql -qAtX -c "select datname from pg_database where datallowconn order by pg_database_size(oid) desc" | \
    xargs -d'\n' -I{} pg_dump -Fc -f pg-{}.dump {}

This is more complex, so let's look at it more closely.

Output from psql (which is list of databases, one database name per line) is feed to xargs.

Xargs will read it, and split on line end (-d'\n'). Next, for each line, it will run:

pg_dump -Fc -f pg-{}.dump {}

where {} will be changed to database name.

So, for example, if your largest database is named “main_db" it will call:

pg_dump -Fc -f pg-main_db.dump main_db

What's more. xargs program has the great thing that is -P option. This makes it possible to run the backups in parallel.

For example:

psql -qAtX -c "select datname from pg_database where datallowconn order by pg_database_size(oid) desc" | \
    xargs -d'\n' -I{} -P2 pg_dump -Fc -f pg-{}.dump {}

would dump 2 databases at the same time.

With this in place, we can have backup of every database. But – there is no dump of roles. Database dumps to contain tables, data, views, functions and even privileges, but not the users themselves.

To add dump of roles, we need to use pg_dumpall program and its -r option:

pg_dumpall -r > roles.sql

pg_dumpall can be used to dump all databases, but I don't use it because:

  • it dumps everything into one single file
  • it uses plain text format, which means it's more complicated to extract just a bit from the dump

So, script to dump roles, and all databases is currently:

pg_dumpall -r > roles.sql
psql -qAtX -c "select datname from pg_database where datallowconn order by pg_database_size(oid) desc" | \
    xargs -d'\n' -I{} -P2 pg_dump -Fc -f pg-{}.dump {}

Now, you could say that it's unreadable. To this, I can only answer with:

pg_dumpall --roles-only > roles.sql
psql --quiet --no-align --tuples-only --no-psqlrc --command="select datname from pg_database where datallowconn order by pg_database_size(oid) desc" | \
    xargs --delimiter='\n' -I{} --max-procs=2 pg_dump --format=c --file=pg-{}.dump {}

Interestingly, -I option for xargs doesn't have long version (–replace is for -i, not -I).

Anyway. We have the dump. Now what about keeping more than one, and removing old ones?

I assume we're on Linux system with GNU utils. This is important, because on other systems base shell tools are backward compatible, which means – very limited in their uses.

But on Linux we can write:

#!/usr/bin/env bash
cd /var/backups/pg
backup_dir=$( date +%Y-%m-%d )
mkdir "$backup_dir"
cd "$backup_dir"
pg_dumpall -r > roles.dump
psql -qAtX -c "select datname from pg_database where datallowconn order by pg_database_size(oid) desc" | \
    xargs -d'\n' -I{} -P2 pg_dump -Fc -f pg-{}.dump {}
cd /var/backups/pg
find . -type f -mtime +6 -delete
find . -type d -delete 2>/dev/null

This would dump backups to /var/backups/pg/DATE (date in format YYYY-MM-DD), and keep just like week of them.

Now, this script is of course not fully safe. It could potentially remove some files from other directory, if /var/backups/pg wouldn't work. But adding safety precautions, and making it more readable, is still pretty simple:

#!/usr/bin/env bash
# common
top_level_backup_dir=/var/backups/pg
# common
 
cd "$top_level_backup_dir" || exit 1
 
backup_dir=$( date +%Y-%m-%d )
mkdir -p "$backup_dir" || exit 1
cd "$backup_dir"
 
# Make actual backup files
pg_dumpall -r > roles.dump
psql -qAtX -c "select datname from pg_database where datallowconn order by pg_database_size(oid) desc" | \
    xargs -d'\n' -I{} -P2 pg_dump -Fc -f pg-{}.dump {}
 
# Retention policy
cd "$top_level_backup_dir" || exit 1
find . -mindepth 2 -maxdepth 2 -type f -regex '\./[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9]/.*dump' -mtime +6 -delete
find . -mindepth 1 -maxdepth 1 -type d -regex '\./[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9]' -empty -delete

Now you can ask – why there is only one variable in common configuration? Why isn't there PATH, or number of days to keep backups for.

Answer is very simple. PATH should be set properly before the script is run. And the number of days is used only once in script, so moving it to another location, would not make it simpler to change, but instead would make the script larger.

This is actually pretty common problem, which I've seen many times. There is a simple problem to solve (make a backup). Someone decides that it's best to write super-general wrapper around existing tool, so that it can be configured and not written for scratch when new case will happen.

The problem is that it often leads to configurable monstrocity that has thousands of lines of code, and dozens of lines of configuration, just to replace 10 line bash script. And all of this in the name of making it reusable and easier to maintain.

Anyway. The script, as it is, has 683 characters (including all comments, new lines, and so on). It does what one could assume to be sensible backup policy, and is (or at least should be) easy to read and understand by virtually anyone. Or so I hope.

  1. 11 comments

  2. # postgres
    Sep 11, 2013

    For these kind of scripts I prefer to use `set -e’ at the beginning of the bash script, which will make it exit whenever a command fails (instead of adding `|| exit 1′ to all commands):

    > set -e
    > cd “$top_level_backup_dir”
    > backup_dir=$( date +%Y-%m-%d )
    > mkdir -p — “$backup_dir”
    > cd “$backup_dir”
    > [...]

    This also allows you to catch errors from all the `pg_dumpall’, `psql’ and `xargs/pg_dump’.

    Here, your script will exit with a success exit code even if a (or all) pg_dump(s) have failed. Whereas with `set -e’ it will exit with an error exit code at the first error encountered.

    Oh, and by the way, these kind of scripts are always perfectible… :)

  3. Sep 11, 2013

    @postgres:
    Based on http://mywiki.wooledge.org/BashFAQ/105 I tend to not use `set -e` in scripts that I want people to learn from. In my own, I sometimes take the -e shortcut, but am trying to get rid of this.

  4. Sep 11, 2013

    We have a couple scripts to do this. First is a python script (rewritten from an old shell script that I had written) that does basic rotation by keeping x copies of dumps around: https://github.com/pgexperts/pgbackup

    Second one is more robust for keeping various amounts of backups per day/week/month and is written in perl: https://github.com/pgexperts/pg-rolldump

  5. # Joe
    Sep 11, 2013

    The psql commands don’t specify a database name so they will fail if the default dbname doesn’t exist (or PGDATABASE hasn’t been used).

  6. Sep 12, 2013

    @Joe:
    Sure. Neither is hostname, or port, or username. I am working under certain assumptions in here. The most important one being that I’m merely providing a way, not a complete toolset.
    And anyway – running backups should happen from postgres account, and as such it’s unlikely to hit “no such database” error.

  7. # Gabriele
    Sep 13, 2013

    Surely custom format has many advantages w.r.t. plain format.

    Suppose however that you have a fixed amount of space and a number of not so small db. And suppose that you would like to preserve the maximum number of copies of each db.

    Well, if one day nothing changes in a db you can think of symlink backup file and save space.

    Plain format can be md5-summed and symlinked.

    AFAIK custom format cannot since date and time is embedded so even if schema and data are the same the dump is different every day.
    I have not found a way to strip metadata from the dump.

  8. Sep 13, 2013

    @Gabriele:
    just use rdiff-backup to store it.

  9. # Ben
    Sep 14, 2013

    Nice and simple.

    Adding a “psql -l > dabatases.list” may also prove useful at restore time (unless you remember all the dbs encodings, collations, etc).

    Sometimes I also generate a file containing the sizes of each databases. So, when I need to replace a dead server in a hurry, I can choose a suitable one (with enough disk space) and I can partition it as needed. Otherwise I’d have to ask Munin for the last tablespace’s known size on this server.

  10. Sep 17, 2013

    Regarding backup rotation, I wrote a Ruby tool which “understands” date formats and returns files that should be deleted: https://github.com/francois/surrender

    Hope that helps!

  11. # zawada
    Dec 17, 2013

    @Gabriele:

    http://zbackup.org/ – nice tool for storage.
    I use that for KVM images

  12. Have you ever considered publishing an e-book or guest authoring on other sites?
    I have a blog centered on the same information you discuss and would love to have you share some stories/information.
    I know my readers would appreciate your work. If you’re even remotely interested, feel
    free to send me an email.

Leave a comment