August 22nd, 2010 by depesz | Tags: , , , , | 27 comments »
Did it help? If yes - maybe you can help me? Donate BTC to 19zPa5diT2LZqGtTi8f8bfApLn8rw9zBHx

Every now and then somebody asks how to make diff of database schemata.

Usual background is like: we have production database, and development database, and we want to see what is different on development to be able to change production in the same way.

Personally I think that such approach is inherently flawed. Why?

First of all – if you have development database – you might have all kind of development cruft there – secondary indexes, semi-temporary tables, and so on.

What's also more important – not all changes are trivial to detect – let's assume, that you have a change which split “email" column into two different columns – one for domain and one for username. Change of schema is nice, but we'll also need to change the data.

There is also problem of dependencies, and correct ordering of changes.

System might see that we added new tables, and foreign keys, but ordering them correctly can be tricky – or even not possible for simple automatic processes.

There is also a matter of rolling change back – and I'm not talking about sql “ROLLBACK" – I'm talking about a huge fuckup, which generated HTTP/500 errors on your site due to some untested change, but was discovered after the change has been committed on production.

So. If detecting changes is flawed, what's my solution?

My solution is to work with diffs.

What do I mean?

When I just first start with new database, I make it absolutely empty (plus PL/pgSQL, but it's going to be by default in 9.0). Then I load my “Versioning" solution. And then, I load my schemata.

What is my “Versioning"? It's a set of functions and a table in database, that let me track which patches has been applied, what are their dependencies, and some very simple logic to manage it.

Before I will go into gory technical details let me go step-by-step of what versioning is and what it's not:

  • it is not infallible. It is not meant to be perfect. It is meant as a tool that helps the willing, but it's still perfectly possible to work around it – for good and bad.
  • it's lightweight. Just one table (plus eventual user-created ones, but that's different case) and a handful of functions.
  • it's built into database, so it can work without relying on any external tools (although external tools can help with some cases)
  • it's relying on convention, and not strictly enforced rules (due to first point)
  • it's not complete – there are things that I'd like to add there, but as of now – I haven't found the willpower to work on it
  • it will not “think" for you. You have to think. It can just catch some simple mistakes.

How does the work flow look?

As I said – I start with empty database, PL/pgSQL and Versioning (I'm writing it capitalized, to make sure it's treated as product name, and not a noun – perhaps you can find a better name for it?).

Then I write a set of “patches" which add basic schema. This might be (in the beginning) single .sql file, which contains whole schema as we have it developed already.

Afterwards, whenever I want to change something – be it add new column, new table, drop something – instead of issuing commands to development database, I write .sql script, which uses Versioning functionality to register itself, check for dependencies, and do the actual work.

All such scripts are stored in some kind of VCS (CVS, svn, git, mercurial, whatever).

Whenever we want to push set of changes to production – we just list which patches have to be applied, and DBA (or any admin with just a rudimentary database skills) can apply them to production database. As long as all changes are done using Versioning patches – outcome should be the same.

So why do I bother with “Versioning" at all? I could just as well keep repo of .sql files without any database logic? Answer is pretty simple – to be able to tell which patches have been already applied, when, and by who. And, to be able to make sure that (for example) patch adding “sessions" will be added only after “users-ng" have been applied.

Given above information, lets wrap what kind of functionality do we need from Versioning:

  • Ability to write: “patch named ‘xxx' is (as of now) applied"
  • Check if patch named “xxx" is installed, and fail if it is.
  • Check if patch named “xxx" is installed, and fail if it is not.
  • List currently applied patches
  • Remove patch from list of installed

Since there is not need for doing checks just for themselves, I combined 3 first functionalities into single function call. And here we have working solution.

How to use it?

Simple – start with empty database, and load “install.versioning.sql" in it.

Afterwards, in your project, make directories for patches, and “rollbacks".

Before I will go on I'll explain why I quoted rollbacks. I don't mean rollback in terms of SQL transactions. It's just a way to remove some patch from database.

Let me show you some example.

Let's assume we have some database, which contains already patch which created ‘users' table ( it was named: ‘user-management' ).

So, now we want to add “sessions".

I create file (in patches directory) named “user-sessions.sql", which contains:

BEGIN;
SELECT _v.register_patch( 'user-sessions', ARRAY[ 'user-management' ], NULL );
CREATE TABLE sessions (
id serial primary key,
user_id int4 not null references users,
started_tsz timestamptz,
remote_addr inet,
user_browser text
);
alter table users add column last_session_id int4;
alter table users add foreign key (last_session_id) references sessions (id );
COMMIT;

and in rollbacks directory, I put file “rollback-user-sessions.sql", which contains:

BEGIN;
SELECT _v.unregister_patch( 'user-sessions' );
ALTER TABLE users DROP COLUMN last_session_id;
DROP TABLE sessions;
COMMIT;

Couple of notes:

  • I put my register/unregister function calls at the beginning – thanks to this I will avoid long wait if there will be conflict or other type of error with registering this patch
  • I'm putting patches that are in files that are named the same way as patch – it helps when you need to find patch that should be installed before the one you have to install
  • If you already have database – hope is not lost – simply do pg_dump -s, add transaction, and make it into “base" patch
  • In patches i use transactions – it is to be sure that when running the patches in automatic way we will get sane results – either the patch will be installed, or not, but never partially.
  • If you have patch that has problems with being applied as single transaction due to locks – you can always make a decision and try to apply it manually, step by step, without transactions – but it should not be default!

Installation of such patches is usually trivial: psql -v ON_ERROR_STOP=1 -f patch-file.sql.

There is a question – how to apply list of 20 patches, that have requirements, and we want them to be applied in sane order?

One answer would be to use some kind of numerical id in names, and force loading by order of id. But this is wrong on many levels.

Let me show you something nicer.

Let's assume we have following patches:

=$ head -n 2 *.sql
==> first.sql <==
BEGIN;
SELECT _v.register_patch( 'first', NULL, NULL );
 
==> fourth.sql <==
BEGIN;
SELECT _v.register_patch( 'fourth', ARRAY[ 'second', 'first' ], NULL );
 
==> second.sql <==
BEGIN;
SELECT _v.register_patch( 'second', ARRAY[ 'third' ], NULL );
 
==> third.sql <==
BEGIN;
SELECT _v.register_patch( 'third', ARRAY[ 'first' ], NULL );

As you can see we have 4 patches, 3 of them requiring some other.

In Versioning, there is file tools/list-dependencies-from-patches.sh. When ran on these files, it shows this output:

=$ ~/projects/private/Versioning/tools/list-dependencies-from-patches.sh *.sql
first first
fourth second
fourth first
fourth fourth
second third
third first

That's not really helpful, but with usage of 2 another smart tools we can make it to:

=$ ~/projects/private/Versioning/tools/list-dependencies-from-patches.sh *.sql | tsort | tac
first
third
second
fourth

Great – sorted to the order in which I have to load it to database to make it work!

Of course it is possible to make patch that will not be parse-able by the list-dependencies- script, but hey – look at one of points above:

it is not infallible. It is not meant to be perfect. It is meant as a tool that helps the willing, but it's still perfectly possible to work around it – for good and bad.

So, that's about it. That's what I use, and why I use it. There are bad sides to it, there are good sides to it, but I think that the benefits outweigh drawbacks (at least for me).

  1. 27 comments

  2. # gregj
    Aug 22, 2010

    Another thing that would greatly gain from ability to have DDL triggers in PostgreSQL.

  3. # Thomas
    Aug 22, 2010

    We are pretty satisfied with using Liquibase to manage all database scripts.

  4. Shouldn’t your title read “how to”?

  5. Aug 22, 2010

    @Caleb:
    Sure. Fixed, thanks.

  6. Aug 22, 2010

    @Thomas:
    it introduces it’s own “language”, which I wanted to avoid. Sooner or later you end up in situation when only using direct SQL solves the problem.

  7. Aug 22, 2010

    @GregJ:

    Well – to be honest – I’m not really sure how ddl triggers relate to versioning your schema. I mean – I’d like to get ddl triggers, but they seem (to me) orthogonal to versioning.

  8. # Thomas
    Aug 22, 2010

    @depesz: Yes I know it’s “another language” but it is really helpful and you can still use “native” SQL as well. It is a solid platform for these kind of things.

  9. Aug 22, 2010

    Rails (as in ‘ruby on rails’) has something called ‘migrations’ which more or less does the above.

    Also in another era many moons ago I developed such a system for ms sqlserver where a ‘current version’ as well as updatescripts to go from version A to B were stored in the db itself.

    In Rails migrations you can write schema changes in a db independent way but also use native SQL if needed (there are still people who think only oracle, db2 or mssql are safe for production). There’s also a way to do rollbacks, but I’ve never ever had to use this so far (thank god I guess).

    Version is stored in the db, the update scripts (the actual migrations) are plain text on your filesystem, so you can store them in subversion, git,…

    There’s no dependency system, but I have a feeling that might be overkill and make things unnecessary complicated. Sure, ‘update’ scripts need to be executed in the correct order (so in a sense each one is dependent on the previous one), but I’ve never felt I needed a real dependency system.

    On occasion I’ve used rails migrations without a rails project, just because it’s such a handy and straightforward system to handle db versioning.

    My 2 cents…

  10. # Tom Davis
    Aug 23, 2010

    What method if any do you use for determining whether a rollback does in fact do what is expected?

    Do you and yours enforce a policy to create a rollback at the same time as the patch or do you only write the rollback when you need to revert to a previous version of the database? Or does it depend?

  11. Aug 23, 2010

    @Tom:
    I require rollback to be existing. Always.

    As for making sure it’s sane – well, it’s free-text SQL, so to analyze anything like this I would have to write an SQL parser, and a lot of logic, so I just test it.

    The simplest test that I do perform is to apply patch, apply rollback, apply patch.

    If all of these will not raise errors – i assume it’s fine. I mean – even if this is broken on some non-syntax level (for example: error in logic/math) – I can fix it with next patch.

  12. # Thom Brown
    Aug 23, 2010

    Is this similar to Post Facto? (http://www.post-facto.org/)

  13. Aug 23, 2010

    @Thom:
    Not sure. Haven’t used post facto, but from quick glance:

    - my versioning doesn’t depend on any tools external to database – it can use them (for example to provide sorted list of patches), but doesn’t *require* them.
    - patches can contain both schema and data changes
    - as I understand, it *tracks* changes, and converts them into .sql files – i.e. it works exactly the way I don’t like – there are always cases when deducing how the change was done is not trivial from the outcome. And knowing the process to lead to outcome (what commands to run) it better than having to find out how to obtain the same result.

  14. # Milos Babic
    Aug 23, 2010

    Can you explain why using of numerical IDs in versioning is wrong?

    Personally I use one table (versions) which contains id of the script being applied, name of user which executed script, short description and identification whether script performs DDL, DML or mix. Simple “select * from versions order by version_id” gives all scripts being executed.

  15. Aug 23, 2010

    @Milos:

    situation:
    I started to work on a new patch, gave it new sequential number.

    while working on it, I got request to add something else, but so that it was somehow related to this new patch I was working on.

    It was so close, that I simply refactored code, put new feature into patch with id (for example) “100″, and worked back on my previous task (90), but adding dependancy on 100.

    in such case – ordering doesn’t work, as you might have dependancies that require some “later” patch.

    this is definitely not common – that’s why it’s often overlooked. happened to me 2 or 3 times over the years, always when working on really big change.

  16. Aug 23, 2010

    I’m currently using the open source DbDeploy for my projects (involving Oracle and PostgreSQL). The basic idea looks remarkaby similar to your Versioning solution.

  17. Aug 23, 2010

    Much like rails migrations (but without all the rails), I use Sequel to manage the current ‘state’ of the DB. It’s a fantastic DB swiss army knife for Ruby, and allows you to use both SQL and ruby — as much or as little as you like of either.

    With it, I can say “of the last 100 DB changes, I want version 26.” Or “update me to the most recent DDL.” Or even “put the DB into the state it was on November 2nd, 2008.”

    Those DB migrations are generally committed as a part of the codebase, so they are tracked as well.

    Still useful if you’re not using Ruby (albeit definitely more ‘external’, then) — but if you’re already in Ruby and using PG, I find it indispensable.

    http://sequel.rubyforge.org/rdoc/

  18. # Ben Finney
    Sep 1, 2010

    Thanks for reinforcing that version control is for database changes as well.

    I’ve been following a similar procedure to what you outline here. I was informed and inspired by the Refactoring Databases book.

  19. Sep 6, 2010

    Our solution of preparing DB patches is quite simple. All database object are stored in Subversion, each database object like table,domain ,type etc. is separate file. Each version of the system is a separate subversion tag. When we create new patch we compare old tag with new one(it is normal functionality of SVN). All changes are traced in such solution – delete/update/add.

  20. Oct 7, 2010

    I never used something like this. What always worked for me was exactly the oposite, which is diffing SQL dumps :-) I simply kept snapshot of last development database schema, which was also production usually, and developed my application on copy of that database. When I was done with the work, I created dump of production and development schema databases and let apgdiff (Another PostgreSQL Diff Tool, free application, can be found at http://apgdiff.startnet.biz/) compare the two dumps and create DDL statements that will transform original database to the new state. If you do things like column splits, database object renames, new columns with NOT NULL constraints without default values etc, you have to go through the generated file and adjust some statements by hand. But this is generally pretty fast if there are not many changes, in many cases there were none in my scenarios. After this, I put that SQL “diff” file to my application update routines so on application startup database was updated to latest version. After this, I tested whether everything goes well on my original copy of the database, that means I ran my application on the database that was at the same state as production database (or it might also have been just in state of previous commit), application performed upgrade of the database and I was done with that commit.

    With this approach, you have schema “diffs” in the form of DDL commands in your repository in case you need to do something with them. You can also do changes in your development database the way you like. You can also use apgdiff anytime, even without any repository.

  21. Oct 7, 2010

    @Miroslav:
    this is exactly the solution that I believe is inherently flawed, please check first couple of paragraphs in this blogpost for explanation why.

  22. Oct 7, 2010

    @Depesz:
    In my opinion it depends on the way developers work. For many developers my solution would not work, and for many would.

    I can comment on your objections to diffing dumps:

    development cruft – this I do not have in my databases, not when I test stuff for commit and really seldom at other time

    changes like splitting column – that would simply appear as dropping one column and adding new columns, which means I have to move the drop lower and add one update statement to the diff, that’s all for this case. But it’s true that many people think diffing application can do everything for them till they dig little bit deeper into diffing schemas and find out they have to review the diffs before they use them.

    dependencies and ordering – that was seldom issue for me with apgdiff, though sure it is not that clever to catch all cases, but that you will catch while testing database update, which you have to do anyway.

    database changeset rollback – that is not trivial task whatever method you use, as it does not cover only structure but also data, so I prefer rather fixing the problem and move forward than moving backward and fixing data … but that is just general approach, it really depends on situation what is the best solution, if data were lost and how precious they were, if they can be recovered etc etc.

    There is no single best solution for every developer, and apparently your solution works for many of them so as my solution works for the others … and many can even profit from both :-)

  23. # Gunnar
    Nov 2, 2010

    Nice solution!
    Here is a script to generate a graph to illustrate the dependencies

    echo "digraph { size=\"16,12\";" > /tmp/file.dot
    /list-dependencies-from-patches.sh /sqlfiles/* |sed 's/ / -> /g' |sed 's/$/;/' >> /tmp/file.dot 
    echo "}" >> /tmp/file.dot
    dot -Nshape=box  /tmp/file.dot > /tmp/file.jpeg
  24. Nov 2, 2010

    @Gunnar:
    nice, thanks. Took the liberty of putting your code in pre-block, with some syntax highlighting, hope you don’t mind.

  25. # Joel Jacobson
    May 20, 2012

    You might want to check out a patch I submitted to pghackers, which add a –split option to pg_dump. Each table, sequence, view, function, etc, are dumped in separate files, optimal for version controlling. It’s then super-easy to diff two different schemas, simply use git to compare two schemas, and you will easily see in the summary which files are affected.

    I’ve found the summary extremely important, you don’t want to be bothered with the actual diff in the first stage. First you want to see what objects changed, many of them can probably be ignored in case you are comparing a development database with production, as people tend to add crazy stuff to development databases, safe to ignore.

    You can find the patch here:

    http://archives.postgresql.org/pgsql-hackers/2010-12/msg02285.php

  26. # Brian White
    Sep 20, 2013

    Stumbled onto this today. Depesz my thinking is much like yours. For anyone using diff tools I think your databases must be rather small and have low concurrent users. Redgate is the one I’m most familiar with, and the diffs it creates are simply insane. Something that could be done with 3 lines of sql turns into 300 lines as your table is torn apart and rebuilt. Just not something you do when you have hundreds of millions of rows in there. 24/7 high usage sites require careful manual sql just to change a datatype, typically adding a new column, copying data over in blocks, and then only grabbing a schema lock for a second as you rename the tables, then drop the old one. And numbering for versions – at a certain complexity it falls apart. We need to support multiple active branches. And something critical may come up that needs to go to all three. Something else may need to go to 1 and 2. Something else may need to go to 2 and 3. While the normal pattern is to just send to 1. The numbering problem gets to be unsolvable. I’m thinking of the named patches as well.

  27. # Thomas Ramfjord
    Feb 19, 2014

    Do you have a strategy for changing a column type that a view depends on?

    I understand that this should ideally be done within a single transaction, but since each change in your script locks the table, it doesn’t seem like you could re-run the view migration from within the alter column type migration.

    Would you recreate the view code each time?

  28. Feb 20, 2014

    @Thomas:
    not sure what you mean. You just drop the view, alter table, recreate the view. That’s all. drop/create the view is trivial, and fast, so I just put it all in transaction.

Leave a comment