September 30th, 2007 by depesz | Tags: , , | 3 comments »
Did it help? If yes - maybe you can help me?

just recently we got another array for out main production database. this means – we will be able to add new tablespace, thus making everything go faster.

in theory – it's nice. but which tables to move to the other?

the basic assumption is simple – index on table should not be on the same tablespace as the table itself. that's easy. but – should we really put all tables on one tablespace, and all indexes on another?

we decided that the important things that should be “boosted" are seeks and writes. sequential reads are (in our situation) more or less irrelevant.

read on to check how we split the load.

but how do i know how many seeks and writes i have for every table?

luckily – postgres tells it to me:

# select relname, idx_tup_fetch, n_tup_ins, n_tup_upd, n_tup_del from pg_stat_user_tables order by relname;
relname | idx_tup_fetch | n_tup_ins | n_tup_upd | n_tup_del
----------------------------+---------------+-----------+-----------+-----------
browse | 0 | 0 | 0 | 0
last | 249 | 248 | 0 | 247
mailaddresses | 0 | 0 | 0 | 0
messages | [null] | 17577 | 0 | 37
presence | 250 | 1844 | 0 | 1843
presence_history | 0 | 3687 | 0 | 0
privacy | 0 | 0 | 0 | 0
private | 0 | 0 | 0 | 0
roster | 1231 | 295 | 0 | 294
storedsubscriptionrequests | 44 | 35 | 0 | 30
users | 6158 | 1 | 0 | 0
vcard | 821 | 5 | 0 | 4
(12 rows)

this is example output from my jabber database. it shows that the most seeks (which are related to idx_tup_fetch) are in users table. as for writes – i have to sumammarize inserted, updated and deleted rows, but it's not really complicated:

# select relname, idx_tup_fetch as seeks, n_tup_ins + n_tup_upd + n_tup_del as writes from pg_stat_user_tables order by relname;
relname | seeks | writes
----------------------------+--------+--------
browse | 0 | 0
last | 249 | 495
mailaddresses | 0 | 0
messages | [null] | 17614
presence | 250 | 3687
presence_history | 0 | 3687
privacy | 0 | 0
private | 0 | 0
roster | 1231 | 589
storedsubscriptionrequests | 44 | 65
users | 6158 | 1
vcard | 821 | 9
(12 rows)

now, the best option to split all these tables would be to do so in a way that summarized seeks on both tablespaces will be the same (it would mean that we had splitthe seeks load exactly 50/50). same goes for writes.

for example:

let's assume that tablespace “primary" will get tables browse, last and users. while new, secondary tablespace will get rest of them.

primary load would be:

  • seeks = 0 (browse) + 249 (last) + 6158 (users) = 6407.
  • writes = 0 + 495 + 1 = 496

secondary load:

  • seeks = 0 (messages) + 250 (persence) + 1231 (roster) + 44 (storedsubscriptionrequests) + 821 (vcard) = 2346
  • writes = 17614 + 3687 + 3687 + 589 + 65 + 9 = 25651

so, let's see how the load is balanced. seeks: 6407 vs. 2346. which means smaller is around 36.6% of bigger.

in case of writes the lack of balance is even more visible: 496 vs. 25651, so the smaller is only 1.9%.

now – in the best situation – there would be no smaller, so both tablespaces would have “100%" in terms of seeks and writes. combined – it would show 200% (seeks + writes).

based on this assumption i wrote a program which tries to find the best layout.

it connects to database, gets list of tables, their respective seeks and writes numbers, and tries to find the best layout.

output of this program looks like this:

tablespace: primary ||| tablespace: secondary ||| score
--------------------------------------------+++--------------------------------------------+++----------
tables | idx_tup_fetch | writes ||| tables | idx_tup_fetch | writes ||| score
--------+-----------------+-----------------+++--------+-----------------+-----------------+++----------
------: 132 | 24 100 792 839 | 30 964 283 ||| 148 | 12 386 640 794 | 17 829 644 ||| 1.089765
1000: 132 | 20 153 660 201 | 24 973 620 ||| 148 | 16 333 773 432 | 23 820 307 ||| 1.764281
2000: 132 | 20 153 659 880 | 24 973 597 ||| 148 | 16 333 773 753 | 23 820 330 ||| 1.764282
3000: 132 | 20 153 659 880 | 24 973 488 ||| 148 | 16 333 773 753 | 23 820 439 ||| 1.764291
4000: 132 | 20 153 660 109 | 24 973 485 ||| 148 | 16 333 773 524 | 23 820 442 ||| 1.764291
5000: 132 | 20 153 659 880 | 24 973 485 ||| 148 | 16 333 773 753 | 23 820 442 ||| 1.764291
6000: 132 | 20 153 659 880 | 24 973 485 ||| 148 | 16 333 773 753 | 23 820 442 ||| 1.764291
7000: 132 | 20 153 659 880 | 24 973 485 ||| 148 | 16 333 773 753 | 23 820 442 ||| 1.764291
8000: 132 | 20 153 659 880 | 24 973 485 ||| 148 | 16 333 773 753 | 23 820 442 ||| 1.764291
9000: 132 | 20 153 659 880 | 24 973 485 ||| 148 | 16 333 773 753 | 23 820 442 ||| 1.764291
10000: 132 | 20 153 659 880 | 24 973 485 ||| 148 | 16 333 773 753 | 23 820 442 ||| 1.764291
11000: 132 | 20 153 659 880 | 24 973 485 ||| 148 | 16 333 773 753 | 23 820 442 ||| 1.764291
12000: 132 | 20 153 659 880 | 24 973 485 ||| 148 | 16 333 773 753 | 23 820 442 ||| 1.764291
13000: 132 | 20 153 659 880 | 24 973 485 ||| 148 | 16 333 773 753 | 23 820 442 ||| 1.764291
14000: 132 | 20 153 659 880 | 24 973 485 ||| 148 | 16 333 773 753 | 23 820 442 ||| 1.764291
15000: 132 | 20 153 659 880 | 24 973 485 ||| 148 | 16 333 773 753 | 23 820 442 ||| 1.764291
16000: 132 | 20 153 659 880 | 24 973 485 ||| 148 | 16 333 773 753 | 23 820 442 ||| 1.764291
17000: 132 | 20 153 659 880 | 24 973 485 ||| 148 | 16 333 773 753 | 23 820 442 ||| 1.764291
18000: 132 | 20 153 659 880 | 24 973 485 ||| 148 | 16 333 773 753 | 23 820 442 ||| 1.764291
19000: 132 | 20 153 659 880 | 24 973 485 ||| 148 | 16 333 773 753 | 23 820 442 ||| 1.764291
20000: 132 | 20 153 659 880 | 24 973 485 ||| 148 | 16 333 773 753 | 23 820 442 ||| 1.764291
21000: 132 | 20 153 659 880 | 24 973 485 ||| 148 | 16 333 773 753 | 23 820 442 ||| 1.764291
tablespace: primary ||| tablespace: secondary ||| score
--------------------------------------------+++--------------------------------------------+++----------
tables | idx_tup_fetch | writes ||| tables | idx_tup_fetch | writes ||| score
--------+-----------------+-----------------+++--------+-----------------+-----------------+++----------
------: 133 | 12 947 666 336 | 39 956 104 ||| 147 | 23 539 824 882 | 8 837 991 ||| 0.771225
1000: 133 | 18 158 881 443 | 24 397 048 ||| 147 | 18 328 609 775 | 24 397 047 ||| 1.990740
2000: 133 | 18 158 881 443 | 24 397 048 ||| 147 | 18 328 609 775 | 24 397 047 ||| 1.990740

what it all means?

first the program assigns tables to tablespaces (virtually, of course!) totally randomly. then it calculates achieved “balance".

this is this line:

------: 132 | 24 100 792 839 | 30 964 283 ||| 148 | 12 386 640 794 | 17 829 644 ||| 1.089765

it means that after randomized layout, it did put 132 tables to primary tablespace, and 148 to secondary.

summarized seeks are 24 100 792 839 in case of primary, and 12 386 640 794 for secondary. for writes the numbers were 30 964 283 vs. 17 829 644.

number 1.089… is “balance" score which is calculated like i showed above but not in percents – so score 1.089, means 108.9% (this is summarized for 2 factors).

now, my program tries to randomly swap tables between tablespaces. if new (after swap) score is higher than previous – the move is made. if it is not higher – it is rollbacked (again, virtually).

every 1000 iterations program prints current score:

1000: 132 | 20 153 660 201 | 24 973 620 ||| 148 | 16 333 773 432 | 23 820 307 ||| 1.764281

as this can be seen, after 1000 iterations it did found a nice layout – with score 1.76, and really close numbers of seeks and writes.

after printing out results, program checks if it didn't reach the target for current check – in this case, target was 1.99. if the target is not reached, it makes next swaps.

if 21000 swaps happened, and the target is still not reached, program starts from scratch – gets new data from database, randomizes layout, and starts random swaps.

when target score is reached – program stops.

since every time new line is printed, program outputs current layout (in form of ready sql script) – after finishing the program i have nice sql which can be used to put the tables to their respective tablespaces (assuming they are named primary and secondary).

simple. yet, doing it “by hand" for 300 tables is not something i would call “fun".

additionally the program can be told not to put 2 tables on the same tablespace. this is useful for example if you have triggers that on every insert/update/delete to table “a", automatically update table “b". in such a case putting them (tables “a" and “b") on separate tablespaces will be a good idea.

now, you can get the program from my GitHub (it is named: find.best.tablespace.split.pl).

as usual there is no manual, and the code is ugly, but it should be easily customizable.

important parts:

  1. database connection info is in get_data_from_db() function (last one in code)
  2. list of which tables should not be on the same tablespace as another is at the very beginning, and is stored in %cant_be_together (hash, dictionary).
  3. sql script with commands to move tables to tablespaces is saved as tablespace.split.out in current directory

if you have any questions about it – feel free to mail or comment on blog 🙂

  1. 3 comments

  2. Oct 18, 2007

    Would simulated annealing (or a similar algorithm) help you find a solution without needing to restart?

    i.e. add one function, and change one line to:

    if ($new_result

  3. Oct 18, 2007

    if ($new_result is less than $current_result + slowly_reducing_temperature()) {

  1. 1 Trackback(s)

  2. Dec 14, 2007: Log Buffer #65: a Carnival of the Vanities for DBAs » Hancorp Technologies

Leave a comment