April 4th, 2010 by depesz | Tags: , , , , , , , , , | 28 comments »
Did it help? If yes - maybe you can help me?

Question from title sounds weird to you? It's just a ‘rm backup_filename'? Well. I really wish it was so simple in some cases.

One of the servers I'm looking into, there is interesting situation:

  • quite busy database server (2k tps is the low point of the day)
  • very beefy hardware
  • daily backups, each sized at about 100GB
  • backups stored on ext3 filesystem with default options
  • before launching daily backup, script removes oldest backup (we keep 3 days of backups on this machine)

So. The problem is, when doing rm of the oldest backup – load skyrockets, and all concurrent performance goes south. And it wasn't for a second or two. Removal of the file took 9 minutes!

After some searching, we found out that it was because PostgreSQL xlog files shared mountpoint with backups, and when rm kicked in – Pg was not really able to write to them!

So I envisioned a simple test for my local machine to test various ways of handling large files removal:

  • create 15 1GB files with irrelevant content
  • create 3 50GB files, in order, with irrelevant content
  • start script which does a lot of concurrent reads and writes to those 1GB files
  • remove oldest of the 50GB files
  • check what happens

Quick note – since I need to be sure that those 30GB files are in order, after each test, I remove them, and recreate using simple script:

rm -rf rm
mkdir rm
for a in 1 2 3
do
dd if=/dev/zero of=rm/$a bs=1M count=50000
done

During all tests, I have run these 2 “programs":

(
while true
do
uptime;
sleep 1;
done &> /tmp/uptime.log
) &

and:

(
iostat -xk 1 /dev/sdb2 | \
perl -MPOSIX=strftime -ne '
$|=1;
next unless /^(sdb2|\s+\d)/;
if (/^\s+/) {
s/\s*$/ /;
$_=strftime("%H:%M:%S ", localtime time).$_
}
print
' &> /tmp/iostat.log
) &

This provide me with following logs:

=$ tail -n 5 /tmp/uptime.log
23:16:06 up 13:40, 2 users, load average: 0.03, 0.04, 0.03
23:16:07 up 13:40, 2 users, load average: 0.03, 0.04, 0.03
23:16:08 up 13:40, 2 users, load average: 0.02, 0.04, 0.03
23:16:09 up 13:40, 2 users, load average: 0.02, 0.04, 0.03
23:16:10 up 13:40, 2 users, load average: 0.02, 0.04, 0.03

and

=$ tail -n 5 /tmp/iostat.log
23:16:05 0.62 0.00 0.00 0.12 0.00 99.25 sdb2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
23:16:06 0.49 0.00 0.12 0.00 0.00 99.39 sdb2 0.00 9.00 0.00 3.00 0.00 48.00 32.00 0.00 0.00 0.00 0.00
23:16:07 0.86 0.00 0.25 0.00 0.00 98.89 sdb2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
23:16:08 0.00 0.00 0.00 0.00 0.00 100.00 sdb2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
23:16:09 0.73 0.00 0.37 0.00 0.00 98.90 sdb2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

while uptime.log is pretty obvious, meaning of columns in iostat.log is:

timestamp %user %nice %system %iowait %steal %idle device-name rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util

If you're not sure what these mean – please consult the fine manual.

Program that runs io operations (reads and writes) is too long to include in here, but you can download it.

So, first let's get test run, with no removals, to see how my system handles it:

=$ rm /tmp/tester.log; date; sleep 3; time ./test.pl ; date
Sun Apr 4 00:13:46 CEST 2010
 
real 24m45.194s
user 0m17.186s
sys 0m27.532s
Sun Apr 4 00:38:34 CEST 2010

Test is done, so let's see some nice graphs (click to zoom).

One note – due to obvious reason, we don't have uptime info for every second, in such case, I assume it to be the same as in previous second.

test1-blocks.png

test1-iousage.png

test1-load.png

As you can see above adding new workers (every 20 seconds new pair) caused nice increase of load, iowaits and IO usage.

With the exception of time between 00:29:24 and 00:29:46 it was working rather nicely, although differences second-to-second are huge. But let's blame test machine for this.

So, now. I will restart the test, wait for 6 minutes (at this moment, we should have all workers started, and working for at least 40 seconds), and then I'll remove oldest 50GB file:

=$ rm /tmp/tester.log; date; sleep 3; time ./test.pl ; date
Sun Apr 4 02:31:22 CEST 2010
 
real 24m43.989s
user 0m16.699s
sys 0m27.182s
Sun Apr 4 02:56:09 CEST 2010

and in the same time, on another console:

=$ sleep 6m; date; time rm rm/1; date
Sun Apr 4 02:37:25 CEST 2010
 
real 0m56.059s
user 0m0.000s
sys 0m1.920s
Sun Apr 4 02:38:21 CEST 2010

In logs, I can see:

2010-04-04 02:36:29.539921 +0200 : 17343 : all workers started

So, at the moment when rm started, we had all workers ready, and working for 56 seconds.

So, how do graphs look like?

test2-blocks.png

test2-iousage.png

test2-load.png

Again we see some small, temporary drops in blocks per second, but just take a look at how the graphs look like at 2:37:25 (zooming in on those graphs helps a lot)!

Raw numbers:

02:37:25 – 85 reads , 102 writes
02:37:26 – 3 reads , 1 writes
02:37:27 – 1 reads
02:37:34 – 2 reads , 12 writes
02:37:45 – 1 reads
02:38:15 – 5 reads
02:38:21 – 3 reads , 2 writes
02:38:22 – 1 reads
02:38:23 – 3 reads , 2 writes
02:38:24 – 50 reads , 71 writes

All missing data – both for missing seconds, or missing writes – are 0.

For example – between 02:37:46 and 02:38:14 literally no block has been read or written.

Sweet.

Now, that I have replicated the problem. Let's find solution. On ServerFault I was told to use ionice. So, let's try.

I verified that I'm running CFQ:

=$ cat /sys/block/sdb/queue/scheduler
noop anticipatory deadline [cfq]

So, ionice should work nicely. Another test:

=$ rm /tmp/tester.log; date; sleep 3; time ./test.pl ; date
Sun Apr 4 08:55:30 CEST 2010
 
real 24m43.078s
user 0m15.946s
sys 0m27.042s
Sun Apr 4 09:20:16 CEST 2010

=$ sleep 7m; date; time ionice -c2 -n7 rm rm/1; date
Sun Apr 4 09:02:31 CEST 2010
 
real 1m1.625s
user 0m0.000s
sys 0m1.940s
Sun Apr 4 09:03:33 CEST 2010

As You can see, I used -c2 -n7 options to ionice, which seem sane.

Results:

test3-blocks.png

test3-iousage.png

test3-load.png

Oops. Basically – big fail for ionice.

So, what other things can we do? Well, we can try to overwrite the file with file of size 0, and then remove it.

Since these will be 2 operations, I'll add 1 minute delay between them, to see (in case there will be problems) which operation causes the problem: freeing space or actual file removal.

So, let's play:

=$ rm /tmp/tester.log; date; sleep 3; time ./test.pl ; date
Sun Apr 4 10:04:50 CEST 2010
 
real 24m43.726s
user 0m15.799s
sys 0m26.822s
Sun Apr 4 10:29:37 CEST 2010

=$ sleep 7m; date; time cat /dev/null > rm/1; date; sleep 1m; date; time rm rm/1; date
Sun Apr 4 10:11:52 CEST 2010
 
real 0m55.278s
user 0m0.000s
sys 0m1.890s
Sun Apr 4 10:12:48 CEST 2010
Sun Apr 4 10:13:48 CEST 2010
 
real 0m0.012s
user 0m0.000s
sys 0m0.000s
Sun Apr 4 10:13:48 CEST 2010

First of all – as we can see – it was the freeing of space that takes time, not actual rm – just notice the times in above example.

And how do the graphs look like?

test4-blocks.png

test4-iousage.png

test4-load.png

Basically – it's just like normal rm.

Is there no way to do remove this old backup without damaging concurrent performance?

Well. There is hope. There is little known program truncate. It is little known, since it's only relatively new in core utils, despite the fact that it uses syscall that has been available for a long time.

It might be that your system doesn't have truncate. In such case – you can hack your own C code, or use perl -e ‘… truncate …'.

But what can we do with truncate? It's simple – I will truncate the file, in 10MB steps, till it reaches 0 bytes, and them remove it. Between truncations, I'll add some short sleep.

Since I have 50GB file, truncating it by 10MB will take 5000 truncations, so I'll sleep after each truncate 0.02s. Total time of sleep will be 100s.

=$ rm /tmp/tester.log; date; sleep 3; time ./test.pl ; date
Sun Apr 4 11:04:38 CEST 2010
 
real 24m44.519s
user 0m15.869s
sys 0m25.828s
Sun Apr 4 11:29:26 CEST 2010

=$ sleep 7m; date; for a in $( seq 1 5000 ); do truncate -s -10485760 rm/1; sleep 0.02; done; rm rm/1; date
Sun Apr 4 11:11:39 CEST 2010
Sun Apr 4 11:15:42 CEST 2010

First thing that I notice, is that runtime of file removal is 4 minutes, which is longer than I expected. But, let's see the graphs:

test5-blocks.png

test5-iousage.png

test5-load.png

Whoa. A lot of changes.

First of all: removal of files still does influence concurrent operations, but it doesn't stop them. Which is very cool. Remember that while on my test machine I'm removing 50GB from system, and it takes ~ 1 minute (without slowdown induced by truncate), on my real production it was 9 minutes!.

Now. The only question left regarding this is: will is scale?

Let's retry with twice as long sleeps:

=$ rm /tmp/tester.log; date; sleep 3; time ./test.pl ; date
Sun Apr 4 13:54:31 CEST 2010
 
real 24m43.061s
user 0m15.632s
sys 0m26.338s
Sun Apr 4 14:19:17 CEST 2010

=$ sleep 7m; date; for a in $( seq 1 5000 ); do truncate -s -10485760 rm/1; sleep 0.04; done; rm rm/1; date
Sun Apr 4 14:01:33 CEST 2010
Sun Apr 4 14:07:22 CEST 2010

File removal, as expected, was much longer. And how about graphs?

test6-blocks.png

test6-iousage.png

test6-load.png

Hmm … unfortunately it seems that my machine took the liberty to do some extra work in the same time (yeah, sorry – don't have dedicated machine just for such tests), but I think that the timeline, when truncates kicked-in, when it finished, and what was the impact – is clearly visible.

OK. So what did I learn?

First of all – if I'll ever need to use large files, I will steer away from ext3. I am by no means filesystem wizard, but this just seems insane.

Second – if i'll be using ext3 for such cases, I'll just use truncate-loop, and make the rm take even very long time, but not disrupt other things.

Last question is: why it happens?

Well. To be honest – I was not able to find any confirmation on this on the web, so there are 2 options: I am wrong, or my search skills are that good, but I'll describe my idea, so perhaps you can confirm/refute it:

ext3, when freeing disk space is doing some kind of defragmentation, which moves some data around to maximize continuous free disk space.

So, when we had 3 large files (blocks A, B and C), and some free space (.):

AAAAAAAABBBBBBBBCCCCCCCC.....

and I removed whole A:

........BBBBBBBBCCCCCCCC.....

ext3 relocated some blocks (most likely from C):

CCCCCCCCBBBBBBBB.............

Since these were 100GB, it had to read 100GB of C, and write it to new place, and then mark all of previous-C blocks as free. That's at least 200GB in IO!

By doing it step by step, I make each “defrag" work on very small amount of data, so it is fast.

I also theorize that because this is ext3 internal work, ionice settings do not apply to it.

Now. There goes a question to my dear readers – do you agree with my assessment of what's happening and why? Do you know any docs that would confirm/refute this idea, perhaps providing new explanation? Waiting for your comments.

  1. 28 comments

  2. # Macdice
    Apr 4, 2010

    This is not an answer to your question, but may be interesting: At one stage I had MythTV set up to record a large number of TV shows, and delete old ones automatically. The files were very large — more than 1GB — and I was using ext3, and it fairly quickly became apparent that ext3 sucks at deleting big files — it would severely degrade other IO for big chunks of time, wreaking havoc. The advice on MythTV forums etc was to switch to JFS, which has good delete performance. It worked wonderfully — very fast, smooth deletes (no idea if JFS any good for PostgreSQL in general, just though I’d share this anecdote!)

  3. Apr 4, 2010

    I have also experienced this when deleting MythTV recordings on a test machine running Ext3. My production MythTV backend at the time ran Reiser3 which did not have this issue. Today I am using NFS to my Mac Pro running HFSX which is also immune.

    MythTV added sometime ago an option to slowly delete recordings using many truncations. As you have discovered, this works.

    Anybody using this method in production may wish to rename the file before commencing to prevent confusion should the process be interrupted.

  4. Apr 4, 2010

    ext3 is known to be terrible for this.

    The following benchmarks are 2 years old, but already showed ext4 to be much lower-impact at deletes than ext3:

    http://www.bullopensource.org/ext4/20070404/FileDeletion.html

    Jon

  5. Apr 4, 2010

    @Macdice:
    thanks, I heard also a lot of good stuff about XFS, but unfortunately FS change on large databases server is not a thing that you can easily do.

  6. Apr 4, 2010

    @Jason:
    Thanks about information. It seems MythTV guys worked it out quite well.

  7. Apr 4, 2010

    @Jon:
    Thanks for link.

    Time for me is not really important – file can even take multiple hours to remove, but impact on concurrent operations should be minimal.

    From the benchmark it looks like XFS is clear winner. ext3 with extents takes shorter time than ext3, but still writes *very* comparable amount of data. Reads way less, though.

  8. # gregj
    Apr 4, 2010

    probowales albo usunac katalog z plikiem (przenies go tam najpierw), albo zmienic ustawienia ext3, albo kolejne rozwiazanie, to osobny naped na te pliki.

  9. Apr 4, 2010

    @gregj:
    usuniecie katalogu z plikiem nic nie da – przeciez pokazalem, ze istotne jest to, ze jest zwalniane miejsce – wiec co za roznica czy usuwasz plik z katalogiem czy bez.

    co do ustawien ext3 – nie. znasz jakies ustawienia ktore moglyby pomoc?

    co do osobnego napedu – dodawanie nowych dyskow do pracujacego serwera jest dosyc skomplikowane.

  10. any filesystem that uses extents will be better at this. ext4, and xfs I know both use extents (I don’t know about jfs, or reiserfs).

    the reason being is that in ext3 every block used gets recorded block1 is for this file, block2, block3 or whatever… at 4096 blocks for a 100G file this takes quite a while to remove all those entries. When you are using extents this entry is block1..block100G or whatever. It uses a range instead of listing each block sequentially.

    What IO scheduler are you using? you might get better performance out of deadline. I don’t think ionice works with deadline at all though.. nicing is for cfq only (I think, was when it first came out).

    I don’t know that it would help… but perhaps ‘really nicing’ your program would help

    http://xenoterracide.blogspot.com/2008/04/really-nice.html

    this is of course just combining nice and ionice giving it less processor and IO time.

  11. # BCS
    Apr 5, 2010

    I’m at least as in the dark as you are but I don’t *think* ext3 would do defrag like that.

    My guess is that the kernel is trying to do the delete in one step and as a result is locking the file system for that whole interval. OTOH this could in theory be solvable by making the system call code release it’s locks when it’s preempted by a context switch. Given that no one has implemented that, I might be wrong about the root cause (or that fix might be to hard to implement).

  12. Apr 5, 2010

    Being the person who forced the author to use ext3, to satisify support requirements for RHEL5 + qlogic + brocade + NetApp, as NetApp, or most other companies, will not support just your ordinary (no offense to other unsupported distros) linux distro, and this becomes important when you spend $x on your storage solution, I thought I might offer the fruits of the authors suggestions to purging multi-hundred gig files from ext3 file systems, at least from my point of view. The author suggested that we truncate the logs rather than just purge them in one shot. This seemed like a great idea except that only the truncate C libs were included with Centos5 and RHEL5, and not the “truncate” command included with more recent core-utils rpm packages. Other options such as perl/python/name you home made hack existed, but a bet that the truncate command from a newer fedora(12) was based on the same libs for truncate as the current Cent5/RHEL5, a simple copy of the truncate binary from Fedora 12 to Cent 5 worked perfectly. (not yet tested on RHEL5) But having that command made this script possible. And may buy some time until we can migrate to ext4, or do away with OEM support of any kind. (unlikely)

    Again, thanks to the author for his wide and deep knowledge of systems, and excellent understanding of databases, operating systems and storage. And so I give to you, “truncatenator”, a replacement for rm in large file removal crons:

    #!/bin/bash
    # yes I watch Phineas and Ferb on disney, dont ask.
    trunc="/truncatenator/truncate"
    file=$1
    size=`ls -l $file |awk '{print $5}'`
    while [ $size -gt 1000000000 ]; do
    size=`echo $size-100000000|bc`;
    $trunc -c -s -100000000 $file ;
    sleep 0.25;
    done
    rm -f $file
    #end

    This has proved itself on Centos5 with a reasonably fast disk subsystem to be able to remove large files without blocking other processes from disk access. You can adjust to any needed values to allow for your system to not block on file system access. Again, thanks to the author, who has time and again proven his chops to this old dog.
    Cheers
    R

  13. Apr 5, 2010

    I would love to see these same tests run on other filesystems, same criteria.

  14. Apr 5, 2010

    @Allspaw:
    well, I would love to run those tests, but I don’t have hardware to put it on, and running it on virtual box doesn’t make much sense.

  15. we could probably convince phoronix to do them…. they already love doing all kinds of benchmarks…

  16. # aszeszo
    Apr 6, 2010

    Out of curiosity, what kind of hardware is PostgreSQL running on?

  17. Apr 6, 2010

    @Aszeszo:
    beefy. really beefy. sorry – can’t tell anything more.

  18. Apr 7, 2010

    Slow deletion on ext3 is simply because it touches every inode of the file, which turns into a bunch of random writes all over the disk for a giant file. There’s no defragmentation or similar work going on. A volume that’s also holding pg_xlog is a really bad combination because that’s going to force all the inode writes to disk each time a fsync comes in from a WAL write, rather than grouping them up into more efficient, sorted chunks before writing as would normally be the case.

  19. # crg
    Apr 13, 2010

    One thought wanders through my mind:

    why truncate/delete the file? If the backups are simmilar in size you could try to overwrite previous backup file contents with a command simmilar to this one:

    pg_dump … | dd of=backup_file conv=notrunc

  20. Apr 13, 2010

    @CRG:
    1. it’s pretty obscure, and not a lot of people know you can do it
    2. in case next backup is smaller – it leads to errors on restore. these errors can be ignored, but the sole fact that uncompressing is not clean – makes me worried.

  21. # crg
    Apr 13, 2010

    @depesz:

    ad. 1)

    Yeah, it’s not so obvious.

    ad. 2)

    why would that lead to errors? Let’s assume that we use custom backup file format. As far as I understand custom is a binary format that uses file offsets and it’s not a stream like a regular SQL dump.

    If you’re in doubt you could try it on some test db…

  22. Apr 13, 2010

    @CRG:
    it’s tar.gz of $PGDATA, not pg_dump.

    use this technique to overwrite tar, and you’ll see errors at the end of uncompression.

  23. # Thomas Thorne
    Jun 6, 2012

    The article says “As You can see, I used -c2 -n7 options to ionice, which seem sane.” which is true but if you want no disruption then -c3 ‘idle’ would be a better choice than -c2 ‘best-effort’. I use -c3 to build in the background and have found it to work well without causing the build to wait for ever. If you really do have 100% io usage then -c3 will not let the delete ever complete but I don’t expect that is what you have based on the worked test.

    On Serverfault the “use ionice” answer is not very helpful if none of the options are explained, for starters your default process will already have some ionice (equivalent) settings granted by the OS.

  24. # Tim
    Aug 30, 2012

    No doubt, I am going to seem stupid (sigh, not the first time) but I would just try normal practice for low priority I/O interfering with high priority I/O. Isolate the backup files on their own physical disk(s), preferable SLOW ones, with their own mount point(s). Unless Linux is a lot more stupid than I think, this ought to prevent the rm from eating too much in the way of resources, and I cannot imagine locking of anything important being an issue. If you have REALLY beefy hardware, as you say, the CPU and memory usage for this delete should be negligible. The only thing I would double check is that Linux does not decide to increase disk cache substantially during the rm. I doubt it will.

  25. Aug 30, 2012

    @Tim:
    the case is long gone, but backups were on separate set of disks, afair.

  26. # Jules
    Nov 26, 2012

    If the backup dir is on a different disk or lv, why not just write a script to drop it, put it back and then format it?

  27. Nov 26, 2012

    @Jules:
    because you might have for example 4 backups on it, and you want to drop only one? Though it is interesting idea to have separate lv/partition for each backup. not such how well it would scale, though.

  28. # Jules
    Nov 27, 2012

    Yes, of course.
    I have a system that is a disk staging area for an RMAN job. A weekly full followed by daily incrementals. All sent off to tape on a daily basis. Once a week we needed to delete all the files and start again. 4TB took 6 hours. Much faster for us drop the disk and then put it back.

  29. # Malina
    Apr 4, 2013

    @Depesz, sorry for gravedigging here a bit but you said, depend son how it scales… according to your parameters back then, it required either 3 (or 4 for redundancy?) LV partitions .. hardly a huge scaling issue. And since you have a fixed size more or less, you don’t need ot worry about LV resizing either. So like osmeone said… instead of deleting, simply release the whole set of pointers, by a quick format or so 🙂

Leave a comment