Is there any good versioning package for database schema and/or data?

Like practically any application, applications that I deal with evolve. They change their requirements when it comes to database storage – new tables, new columns, modified columns. Or perhaps – new base data – like new values in virtually static dictionaries.

For past years we've been working with set of pl/plgsql functions which kept track of “patches", and their dependencies.

The problem with this approach is that it doesn't really scale well, and it generates problems when we use replication (Slony).

The problems mean that we have to apply the patches “by hand" on master/slave server, because downtime in the day is not acceptable, and nobody is willing to do upgrades at 3 am just to be able to add new column.

So, we deal with it. But lately I grew annoyed by this, and started to think about a better way to organize patches.

I “dream" about a system when I would write a patch itself, and the system will make it possible to install and uninstall it with automatic changing database to prior state (this is not simple with things like “update", but it is definitely possible).

System which would “understand" replication, and apply changes to all replicated servers in a way that it will be as safe as possible with no or minimal downtime.

System which would let me track dependencies, and then install them if I'll tell it to install patch, that requires some other patches that were not applied to target database.

Basically – I want something like apt-get/dpkg/rpm for database.

So, writing this seems to be perfectly possible, but is it necessary? Perhaps somebody someplace already wrote such system? Do you know any? Or should I stop whining, sit down and write it myself?

24 thoughts on “Is there any good versioning package for database schema and/or data?”

tmarc says:

2008-10-23 at 12:26

ActiveRecord, migrations?

http://api.rubyonrails.com/classes/ActiveRecord/Migration.html
depesz says:

2008-10-23 at 12:29

@tmarc:
Never played much with it.

Can I do patch with activerecord that will:
1. add new column
2. for all existing rows set new column value to some_old_column * some_other_old_column
3. drop some_old_column
4. insert 2 new rows – with new column data

and then – uninstall the patch with recovering removed data?
Alex says:

2008-10-23 at 13:52

Yes. In Active Record you define a set of SQL commands for moving forward and for moving backward. The steps moving backward essentially “undo” the changes in the active record. These commands can be either schema changes or changes to the data.

I’m planning to build a system that using a migration-like approach for schema changes. This is based on a master database that describes the format of our production database. When I save changes to the database spec, it will create the upward and downward migrations. When we roll out an update for our clients, we’ll insert the new migrations into the database and call a function to roll the database forward to the right version.

I’ll also create a “reorganization” routine that compares the database to the master spec and updates it to be the same. This is the approach we use now and while it works well, it’s quite time-consuming. I believe it will be good to have this routine in case the database gets off of the migration path.

Wish I could say this is built and working, but it’s still in the planning stages. Still, we do use migrations in Rails and we have reorganization routine we use now. I’m simply planning to do this better and completely at the database level.
depesz says:

2008-10-23 at 13:59

@Alex:
But do I *have to* write the undo commands? If yes – then it doesn’t help me.
Also – can it upgrade all servers in slony replication cluster? Especially since not all elements of upgrade patch should be executed on slave.
David Fetter says:

2008-10-23 at 14:12

Depesz,

I know it’s heresy, but the way to handle this is at the organization level, using your SCM system and development and deployment processes. The DBMS never has enough information to handle this on its own.

Cheers,
David.
depesz says:

2008-10-23 at 14:19

@David Fetter:
I’m not sure how this can help.

Let’s assume there is a project, which was already in at least 20 production upgrades, and we have 7 new functionalities to be added. Each functionality is being worked on separately – both in terms of people and code.

We do have scm, and people do write patches to database, and we do keep them in svn. But I want to make writing patches less complicated in *most* cases. Not all, as writing general system which will be able to generate “downgrade” based on “upgrade” in some cases would require AI, but most of the cases are pretty simple.
Vincenzo Romano says:

2008-10-23 at 14:48

I’ve found Druid (http://druid.sourceforge.net/) a very interesting piece of software to work as a DBA/DBD.
It needs some other work for a really complete product, but most of features are already there.
For the very first time your work on the DB is not the DB itself, but is rather saved in an XML file. When you are done, you can commit the changes to the DB or can export it into an SQL script to be run into the SQL client.
Use that along with your favourite revision control system ™ and here comes a solution.
I’d advise you to give it a try.
Norman Yamada says:

2008-10-23 at 15:25

I’ve been looking for a similar thing for my organization; we have multiple instances of a database with multiple schemas and managing changes across schemas has proven to be quite challenging.

For the moment, we’re hand-rolling a small patch system using dbi-link to connect across databases and running queries like this to compare tables:
select row(r.*) as a,
row(t.*) as b
from run_market r
full outer join test.run_market t
on r.mktid = t.mktid and r.sig_sysid = t.sig_sysid and r.pos_sysid = t.pos_sysid
where row(r.*) is distinct from row(t.*);.

This allows us to walk a table quickly and see all the differences between tables (assuming the structure of the table hasn’t changed) and allows us to create both patchsets and reversal sets, which we store in a different database.

While this works for our small shop, it’s difficult to manage, because your patchset can get out of sync unless change control is tightly managed. And we have to handle table structure changes on an ad-hoc basis.

But it would be nice if someone had written a more generalized approach to this problem.
Theory says:

2008-10-23 at 18:42

FYI, Rails migrations are incompatible with Slony and other non-log shipping (e.g., PITR) replication solutions. Furthermore, if you use its methods for updating the database, you’re quite limited in what you can do.

If this is something you want to work on, depesz, I’d be glad to help out!

—Theory
Oskar says:

2008-10-23 at 19:24

Support for several databases and seems to be very flexible. Haven’t tried it myself yet though.
http://www.liquibase.org/
Milan Martak says:

2008-10-23 at 19:29

coming from Oracle background here is what we do (sadly only in Oracle):
1) DEV/TEST/PATCH/PROD instances (depends on you flavor/process)
2) PLSQL Developer (tool from Allround Automations) that provides
i) schema comparison (all objects) between 2 databases/schemas,
output is a “patch” script (alter tables, create or replace, etc., tool
to show the diffs is indeed included)
ii)data comparison between 2 tables (can be in different databases/schemas),
output is a patch script (insert/update/delete)
3) manually promoting the scripts to the TEST/PATCH/PROD as per our process

obviously this will not help you in Postgres – and even this tool/functionality is not perfect – but this is the closest we were able to get with ANY database development.
sadly i know of nothing like this available in Postgres world – i waste my time doing a research from time to time – so we may end-up writing something similar on our own for our Postgres development…

and – trust me i feel your pain as we maintain number of Postgres and Oracle instances w/heavy development in both.
depesz says:

2008-10-23 at 20:10

@Milan Martak:
step 1 – works the same with pg
step 2 – more tricky, but generally doable
step 3 – works the same with pg.

but this is not what i need/want.

let me explain, as there is clearly a lot of misconceptions.

there is a “product”. which is generally a website.

we have teams (plural) working in parallel on several different, not connected features (for example – new option for user data, new type of service to be sold with products, new kind of search).

each of this changes is developed separately, and each generates its own “patch” to database. written by developers.

generally patches contain usually:
1. new column (~ 47% patches)
2. new table (~37%)
3. updates (~ 22%)
4. inserts (~12%)
5. triggers (~10%)
6. dropping column (~8%)
(generates > 100%, because there are patches that do many things).

at some point in time, we generate new release, which contains list of all patches that should be applied to database. this goes through devel and test systems, and then it is applied to production. in more or less one step.

usually, number of patches to be applied is within 1-5.

the problem is that:
1. we require each patch to come with a way to uninstall, including bringing data back to state “as close to original as possible”.
2. on production we use slony replication, which makes things more complicated, as the patches should be ran on slave systems only partially.

at the moment we handle the thing by writing patches in a way that they are directly applicable to non-replicated systems, and they are put into production by “highly skilled professional” (i.e. me 🙂

i need a system that will make *most* of the patches:
1. generate its own uninstall procedures
2. automatically applicable to replicated systems

i’m definitely not pursuing “golden bullet” that will resolve 100% of my problems with patches. i will be happy to have something that can handle basic things like “add table, add column, add trigger, modify function, simple table update”.

i do not want to have graphical “db designers” for it. i just need a way to express:
“create table xxx (id serial primary key, some_fields text)”
in a way that will make it work on repicated systems, and i will not have to write uninstaller by my own.
eggyknap says:

2008-10-23 at 20:20

I’m not convinced that SCM + processes are the best way. I’ve not seen a better one, but I’m unconvinced a better one couldn’t be found, particularly if you elect only to support one particular database. But the sort of thing I’m looking for would take a fair bit of development, probably within the database itself.
Robert Treat says:

2008-10-23 at 20:23

@Milan Martak, there are tools that do database schema comparison for postgres, http://sqlmanager.net/en/products/postgresql/dbcomparer is one of them. There are also tools that will do data comparisons; we played with one not to long ago that did cross-db comparisons (ie. oracle/postgres) via odbc, (sorry, dont remember the name, we gave it up cause it couldnt handle tables as large as we work with, but there are some out there). I also know that some ERD tools are able to generate schema scripts and do schema diff between versions (I used one by Charonware a few years back with Postgres), so that is also an avenue to persue for that type of functionality.

That all said, Personally I think doing the changes manually and checking them into svn (very migrations like) is the way to go. Mainly because some upgrades will require things like data changes that no tool will be able to automate for you. And if you really cant be bothered to write add/drop column statements, use one of the above systems to do some of that work for you (I just tend to find those systems often create as much overhead as they reduce)
Steve says:

2008-10-24 at 08:41

There were also some of the issues for which in my present project I have selected Brainwave Platform Its Schemaless semantic databasebuilt on the RDF model can handle requirement and schema changes easily as there are no tables and columns to bother you.
gj says:

2008-10-24 at 10:50

having multiple ppl working on the same db schema sure must generate loads of problems. Surely what you need, is someone or few folks to concentrate on merging these features toughether, and make sure they work seemlesly. Or even , one team of ppl just doing DB stuff.

Otherwise, I am sure the schemas are full of akward things, and duplications of features.
depesz says:

2008-10-24 at 10:56

@gj:
that’s strange assumption.

when multiple people are working on the same program do we surely get duplication of features? no. it is known for a long time that you can work in groups, and each group works on its own part of the system.

same thing with databases. one group needs to change “users” and “user_data”, some other “search.*” and so on. their work is generally independent on work of other groups.
Joe says:

2008-10-24 at 14:21

Have you taken a look at http://www.agiledata.org/essays/databaseRefactoring.html?
depesz says:

2008-10-24 at 14:38

@Joe:
I just did, but I don’t see how’s that relevant.
gj says:

2008-10-24 at 19:08

all I can say, from expierence – it is not so obvious 🙂 and the different teams, and features they design surely interconnect at some level.
Milan Martak says:

2008-10-26 at 05:52

@Robert Treat, last time i checked the dbcomparer does not work w/8.3 …
Milan Martak says:

2008-10-27 at 02:26

@depesz – understood.
the truth is that we do the DB/data compare to create the scripts and then maintain everything manually in SC and there is no requirement for un-install…
when/if we build something to automate this i will share but i doubt there will be any value for you as we do not care about the uninstall piece.
Carlo A. Curino says:

2008-11-04 at 05:05

Hey,

At UCLA we are working on a system named PRISM

http://yellowstone.cs.ucla.edu/schema-evolution/index.php/Prism

that has as a main pbjectives to:
1) achieve predictability of the evolution process (by predicting impact over schema, data and queries)
2) automate data migration scripts generation (from a higher level specification)
3) automate the query/view adaptation process

At the current stage, provided that the evolution is designed using an SQL-like language we designed, we achieve the above objectives.
In particular, 3 is achieved by rewriting existing queries expressed over an old schema version in (guaranteed to be) equivalent ones over the current DB schema.
At the moment we are working on the rewriting of updates and to support integrity constraints evolution.

I know i’m not talking about an actual product, but about academic prototypes, but hopefully in a reasonable time span we might have something really practical out there for you.

In the mean time you might want to check out a couple of videos of the current system Demo:
http://yellowstone.cs.ucla.edu/schema-evolution/documents/Prism-Demo.mov

And for the more formal/theoretical guys the paper VLDB2008 presentation:
http://yellowstone.cs.ucla.edu/schema-evolution/documents/prism-vldb2008.mov

Your feedbacks are more than welcome… please post a reply or send me emails at carlo (at) curino.us

Bests,
Carlo A. Curino, PhD Candidate
Misha says:

2009-11-09 at 09:35

http://apgdiff.sourceforge.net/ – compares two schema dump files and creates output file that is (after some hand-made modifications) suitable for upgrade of old schema.

Comments are closed.