Every now and then somebody asks how to make diff of database schemata.
Usual background is like: we have production database, and development database, and we want to see what is different on development to be able to change production in the same way.
Personally I think that such approach is inherently flawed. Why?
First of all – if you have development database – you might have all kind of development cruft there – secondary indexes, semi-temporary tables, and so on.
What's also more important – not all changes are trivial to detect – let's assume, that you have a change which split “email" column into two different columns – one for domain and one for username. Change of schema is nice, but we'll also need to change the data.
There is also problem of dependencies, and correct ordering of changes.
System might see that we added new tables, and foreign keys, but ordering them correctly can be tricky – or even not possible for simple automatic processes.
There is also a matter of rolling change back – and I'm not talking about sql “ROLLBACK" – I'm talking about a huge fuckup, which generated HTTP/500 errors on your site due to some untested change, but was discovered after the change has been committed on production.
So. If detecting changes is flawed, what's my solution?
My solution is to work with diffs.
What do I mean?
When I just first start with new database, I make it absolutely empty (plus PL/pgSQL, but it's going to be by default in 9.0). Then I load my “Versioning" solution. And then, I load my schemata.
What is my “Versioning"? It's a set of functions and a table in database, that let me track which patches has been applied, what are their dependencies, and some very simple logic to manage it.
Before I will go into gory technical details let me go step-by-step of what versioning is and what it's not:
- it is not infallible. It is not meant to be perfect. It is meant as a tool that helps the willing, but it's still perfectly possible to work around it – for good and bad.
- it's lightweight. Just one table (plus eventual user-created ones, but that's different case) and a handful of functions.
- it's built into database, so it can work without relying on any external tools (although external tools can help with some cases)
- it's relying on convention, and not strictly enforced rules (due to first point)
- it's not complete – there are things that I'd like to add there, but as of now – I haven't found the willpower to work on it
- it will not “think" for you. You have to think. It can just catch some simple mistakes.
How does the work flow look?
As I said – I start with empty database, PL/pgSQL and Versioning (I'm writing it capitalized, to make sure it's treated as product name, and not a noun – perhaps you can find a better name for it?).
Then I write a set of “patches" which add basic schema. This might be (in the beginning) single .sql file, which contains whole schema as we have it developed already.
Afterwards, whenever I want to change something – be it add new column, new table, drop something – instead of issuing commands to development database, I write .sql script, which uses Versioning functionality to register itself, check for dependencies, and do the actual work.
All such scripts are stored in some kind of VCS (CVS, svn, git, mercurial, whatever).
Whenever we want to push set of changes to production – we just list which patches have to be applied, and DBA (or any admin with just a rudimentary database skills) can apply them to production database. As long as all changes are done using Versioning patches – outcome should be the same.
So why do I bother with “Versioning" at all? I could just as well keep repo of .sql files without any database logic? Answer is pretty simple – to be able to tell which patches have been already applied, when, and by who. And, to be able to make sure that (for example) patch adding “sessions" will be added only after “users-ng" have been applied.
Given above information, lets wrap what kind of functionality do we need from Versioning:
- Ability to write: “patch named ‘xxx' is (as of now) applied"
- Check if patch named “xxx" is installed, and fail if it is.
- Check if patch named “xxx" is installed, and fail if it is not.
- List currently applied patches
- Remove patch from list of installed
Since there is not need for doing checks just for themselves, I combined 3 first functionalities into single function call. And here we have working solution.
How to use it?
Simple – start with empty database, and load “install.versioning.sql" in it.
Afterwards, in your project, make directories for patches, and “rollbacks".
Before I will go on I'll explain why I quoted rollbacks. I don't mean rollback in terms of SQL transactions. It's just a way to remove some patch from database.
Let me show you some example.
Let's assume we have some database, which contains already patch which created ‘users' table ( it was named: ‘user-management' ).
So, now we want to add “sessions".
I create file (in patches directory) named “user-sessions.sql", which contains:
BEGIN; SELECT _v.register_patch( 'user-sessions', ARRAY[ 'user-management' ], NULL ); CREATE TABLE sessions ( id serial primary key, user_id int4 not null references users, started_tsz timestamptz, remote_addr inet, user_browser text ); alter table users add column last_session_id int4; alter table users add foreign key (last_session_id) references sessions (id ); COMMIT;
and in rollbacks directory, I put file “rollback-user-sessions.sql", which contains:
BEGIN; SELECT _v.unregister_patch( 'user-sessions' ); ALTER TABLE users DROP COLUMN last_session_id; DROP TABLE sessions; COMMIT;
Couple of notes:
- I put my register/unregister function calls at the beginning – thanks to this I will avoid long wait if there will be conflict or other type of error with registering this patch
- I'm putting patches that are in files that are named the same way as patch – it helps when you need to find patch that should be installed before the one you have to install
- If you already have database – hope is not lost – simply do pg_dump -s, add transaction, and make it into “base" patch
- In patches i use transactions – it is to be sure that when running the patches in automatic way we will get sane results – either the patch will be installed, or not, but never partially.
- If you have patch that has problems with being applied as single transaction due to locks – you can always make a decision and try to apply it manually, step by step, without transactions – but it should not be default!
Installation of such patches is usually trivial: psql -v ON_ERROR_STOP=1 -f patch-file.sql.
There is a question – how to apply list of 20 patches, that have requirements, and we want them to be applied in sane order?
One answer would be to use some kind of numerical id in names, and force loading by order of id. But this is wrong on many levels.
Let me show you something nicer.
Let's assume we have following patches:
=$ head -n 2 *.sql ==> first.sql <== BEGIN; SELECT _v.register_patch( 'first', NULL, NULL ); ==> fourth.sql <== BEGIN; SELECT _v.register_patch( 'fourth', ARRAY[ 'second', 'first' ], NULL ); ==> second.sql <== BEGIN; SELECT _v.register_patch( 'second', ARRAY[ 'third' ], NULL ); ==> third.sql <== BEGIN; SELECT _v.register_patch( 'third', ARRAY[ 'first' ], NULL );
As you can see we have 4 patches, 3 of them requiring some other.
In Versioning, there is file tools/list-dependencies-from-patches.sh. When ran on these files, it shows this output:
=$ ~/projects/private/Versioning/tools/list-dependencies-from-patches.sh *.sql first first fourth second fourth first fourth fourth second third third first
That's not really helpful, but with usage of 2 another smart tools we can make it to:
=$ ~/projects/private/Versioning/tools/list-dependencies-from-patches.sh *.sql | tsort | tac first third second fourth
Great – sorted to the order in which I have to load it to database to make it work!
Of course it is possible to make patch that will not be parse-able by the list-dependencies- script, but hey – look at one of points above:
it is not infallible. It is not meant to be perfect. It is meant as a tool that helps the willing, but it's still perfectly possible to work around it – for good and bad.
So, that's about it. That's what I use, and why I use it. There are bad sides to it, there are good sides to it, but I think that the benefits outweigh drawbacks (at least for me).