A tale of automating tests of Pg with Bash

Word of warning: this blogpost is about thing related to Bash (well, maybe other shells too, didn't really test), but since I found it while doing Pg work, and it might bite someone else doing Pg related work, I decided to add it to “postgresql" tag.

So, due to some work I had to do, I needed a quick, repeatable way to setup some Pg instances, replication between them, and some data loader. All very simple, no real problems. At least that's what I thought…

Let's start with two simple bash scripts:

setup.sh:

#!/usr/bin/env bash
pg_ctl -D test/master start

and generate-load.sh:

#!/usr/bin/env bash
i=0
j=0
while true
do
    seq 1 10 | sed 's/.*/select * from increment();/' | psql -qAtX -p 5432 -U test -d postgres
    echo -n '#'
    i=$(( i + 1 ))
    if (( i == 100 ))
    then
        j=$(( j + 1 ))
        echo " : $j"
        i=0
    fi
    sleep 0.1
done

First script is just one line, that starts Pg, but the other looks more complex. What is does is very simple – it runs psql, calls increment() function in postgres database (some helper function, irrelevant , could be select 1) – 10 times (i.e. 10 times calls the function), then sleeps 0.1s, and repeats.

Every repeat it prints “#" character and every 100 repeats it moves cursor to new line.

Very simple.

When I start it manually, from shell, it looks like this:

=$ ./setup.sh 
server starting
 
=$ ./generate-load.sh 
################################^C
 
=$ ps uxf
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
test     19409  0.0  0.0  90532  2268 ?        S    May15   0:01 sshd: test@pts/4    
test     19410  0.0  0.0  25084  5264 pts/4    Ss   May15   0:00  \_ -bash
test     12152  0.0  0.0  19668  1308 pts/4    R+   00:14   0:00      \_ ps uxf
test     11925  0.1  0.1 173492 12940 pts/4    S    00:13   0:00 /home/pgdba/work/bin/postgres -D test/master
test     11929  0.0  0.0  27016   804 ?        Ss   00:13   0:00  \_ postgres: logger process                    
test     11931  0.0  0.0 173492   976 ?        Ss   00:13   0:00  \_ postgres: checkpointer process              
test     11932  0.0  0.0 173492   984 ?        Ss   00:13   0:00  \_ postgres: writer process                    
test     11933  0.0  0.0 173492   976 ?        Ss   00:13   0:00  \_ postgres: wal writer process                
test     11934  0.0  0.0 174308  2180 ?        Ss   00:13   0:00  \_ postgres: autovacuum launcher process       
test     11935  0.0  0.0  29112   880 ?        Ss   00:13   0:00  \_ postgres: archiver process   last was 000000010000000000000003
test     11936  0.0  0.0  29248  1076 ?        Ss   00:13   0:00  \_ postgres: stats collector process

All fine. But I'm lazy. So at the end of setup.sh (in my real script, it was longer, but that's irrelevant), I added: ./generate-load.sh, and to make it wait properly, added also -w flag to pg_ctl. So finally setup.sh looks like:

#!/usr/bin/env bash
pg_ctl -D test/master -w START
./generate-LOAD.sh

Theoretically all is fine, there shouldn't be any problems. Right?

Well:

=$ ps uxf
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
test     19409  0.0  0.0  90532  2268 ?        S    May15   0:01 sshd: test@pts/4    
test     19410  0.0  0.0  25088  5268 pts/4    Ss   May15   0:00  \_ -bash
test     14129  0.0  0.0  19668  1304 pts/4    R+   00:16   0:00      \_ ps uxf
 
=$ ./setup.sh
waiting for server to start.... done
server started
##########################^C
 
=$ ps uxf
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
test     19409  0.0  0.0  90532  2268 ?        S    May15   0:01 sshd: test@pts/4    
test     19410  0.0  0.0  25088  5268 pts/4    Ss   May15   0:00  \_ -bash
test     14291  0.0  0.0  19668  1304 pts/4    R+   00:16   0:00      \_ ps uxf

Pg got killed when I pressed ctrl-c when executing generate-load.sh!

I tried all kinds of “stuff". Removed -w (and added sleep to prevent generate-load.sh from starting too soon). Added /dev/null redirections for all standard streams. Added nohup. Added “&" at the end.

All for nothing.

When script runs Pg, and later on is CTRL-C'ed – Pg gets stopped too.

That was major WTF.

It looked like SIGINT (ctrl-c) got propagated to already-started, happily running background processes. But why?

RhodiumToad on irc found out why exactly, and what to do to prevent it from happening.

Before I will go into fixing it, let's try to simplify the case. Let's write new script – test.sh, that will do something “like" running pg_ctl:

#!/usr/bin/env bash
perl -e 'fork and exit; sleep(1000000)' &
echo "press ctrl-c, run ps, and notice lack of perl script"
sleep 1000000

When I run it, I get:

=$ ps uxf
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
test     15064  0.0  0.0  90380  2232 ?        S    00:22   0:00 sshd: test@pts/2    
test     15065  0.2  0.0  24880  5056 pts/2    Ss   00:22   0:00  \_ -bash
test     15323  0.0  0.0  19668  1308 pts/2    R+   00:23   0:00      \_ ps uxf
 
=$ ./test.sh 
press ctrl-c, run ps, and notice lack of perl script
^C
 
=$ ps uxf
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
test     15064  0.0  0.0  90380  2232 ?        S    00:22   0:00 sshd: test@pts/2    
test     15065  0.2  0.0  24880  5056 pts/2    Ss   00:22   0:00  \_ -bash
test     15334  0.0  0.0  19668  1304 pts/2    R+   00:23   0:00      \_ ps uxf
test     15330  0.0  0.0  17348   276 pts/2    S    00:23   0:00 perl -e fork and exit; sleep(1000000)

WTF?! Why is it there? Clearly bash left it to its work, but why?

Well, Bash didn't leave it alone. It's just that Perl, when started in background ( & at the end ) simply ignores SIGINT (which is what is generated when ctrl-c is pressed). But postgres doesn't ignore it – it exits. Of course, I could modify PostgreSQL sources to make it invulnerable, but somehow this didn't strike me as particularly good idea.

Instead, decided to make my perl “stand-in for PostgreSQL" die on sigint, which is luckily simple:

$ cat test.sh
#!/usr/bin/env bash
perl -e 'fork and exit; $SIG{INT} = sub {die "SIGINT\n" }; sleep(1000000)' &
echo "press ctrl-c, run ps, and notice lack of perl script"
sleep 1000000
 
=$ ps uxf
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
test     15064  0.0  0.0  90380  2232 ?        S    00:22   0:00 sshd: test@pts/2    
test     15065  0.0  0.0  24880  5056 pts/2    Ss   00:22   0:00  \_ -bash
test     15575  0.0  0.0  19668  1304 pts/2    R+   00:26   0:00      \_ ps uxf
 
=$ ./test.sh 
press ctrl-c, run ps, and notice lack of perl script
^CSIGINT
 
 
=$ ps uxf
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
test     15064  0.0  0.0  90380  2232 ?        S    00:22   0:00 sshd: test@pts/2    
test     15065  0.0  0.0  24880  5056 pts/2    Ss   00:22   0:00  \_ -bash
test     15624  0.0  0.0  19668  1304 pts/2    R+   00:27   0:00      \_ ps uxf

Now – perl is gone, and it even printed “SIGINT" as requested.

So – clearly Bash is propagating those sigints to previously started processes.

Simple solution is to use so called “monitor" mode in Bash. This is also what interactive shell uses by default – it enables normal job control.

When I'll change test.sh to:

#!/usr/bin/env bash
set -o monitor
perl -e 'fork and exit; $SIG{INT} = sub {die "SIGINT\n" }; sleep(1000000)' &
echo "press ctrl-c, run ps, and notice lack of perl script"
sleep 1000000

When I run it, I get:

=$ ps uxf
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
test     15064  0.0  0.0  90380  2232 ?        S    00:22   0:00 sshd: test@pts/2    
test     15065  0.0  0.0  24880  5056 pts/2    Ss   00:22   0:00  \_ -bash
test     15784  0.0  0.0  19668  1304 pts/2    R+   00:29   0:00      \_ ps uxf
 
=$ ./test.sh 
press ctrl-c, run ps, and notice lack of perl script
^C
=$ ps uxf
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
test     15064  0.0  0.0  90380  2232 ?        S    00:22   0:00 sshd: test@pts/2    
test     15065  0.0  0.0  24880  5056 pts/2    Ss   00:22   0:00  \_ -bash
test     15796  0.0  0.0  19668  1304 pts/2    R+   00:29   0:00      \_ ps uxf
test     15792  0.0  0.0  17348   480 pts/2    S    00:29   0:00 perl -e fork and exit; $SIG{INT} = sub {die "SIGINT\n" }; sleep(1000000)

Nice. Better. But why exactly did it help? What has changed? Let's use ps to view some details of processes from within the script:

#!/usr/bin/env bash
perl -e 'fork and exit; $SIG{INT} = sub {die "SIGINT #1\n" }; sleep(10)' &
{
    set -o monitor
    perl -e 'fork and exit; $SIG{INT} = sub {die "SIGINT #2\n" }; sleep(10)'
} &
sleep 1
ps xf o pid,ppid,sid,pgrp,tt,cmd
echo "press ctrl-c"
sleep 1000000

1st perl (SIGINT #1) is called in normal Bash environment.

The second one though is ran separately, with monitor mode enabled.

So, let's see how it works:

=$ ps xf o pid,ppid,sid,pgrp,tt,cmd
  PID  PPID   SID  PGRP TT       CMD
15064 14793 14793 14793 ?        sshd: test@pts/2    
15065 15064 15065 15065 pts/2     \_ -bash
16547 15065 15065 16547 pts/2         \_ ps xf o pid,ppid,sid,pgrp,tt,cmd
 
=$ ./test.sh 
  PID  PPID   SID  PGRP TT       CMD
15064 14793 14793 14793 ?        sshd: test@pts/2    
15065 15064 15065 15065 pts/2     \_ -bash
16552 15065 15065 16552 pts/2         \_ bash ./test.sh
16559 16552 15065 16552 pts/2             \_ ps xf o pid,ppid,sid,pgrp,tt,cmd
16558     1 15065 16556 pts/2    perl -e fork and exit; $SIG{INT} = sub {die "SIGINT #2\n" }; sleep(10)
16557     1 15065 16552 pts/2    perl -e fork and exit; $SIG{INT} = sub {die "SIGINT #1\n" }; sleep(10)
press ctrl-c
^CSIGINT #1
 
 
=$ ps xf o pid,ppid,sid,pgrp,tt,cmd
  PID  PPID   SID  PGRP TT       CMD
15064 14793 14793 14793 ?        sshd: test@pts/2    
15065 15064 15065 15065 pts/2     \_ -bash
16564 15065 15065 16564 pts/2         \_ ps xf o pid,ppid,sid,pgrp,tt,cmd
16558     1 15065 16556 pts/2    perl -e fork and exit; $SIG{INT} = sub {die "SIGINT #2\n" }; sleep(10)

Please notice the values in PGRP column (process group id) for both Perls.

The one ran without monitor mode has the same PGRP as bash script itself. On the other hand – the one that was started within monitor mode has different pgrp (and it's not artefact of using {}s, checked it).

Looks like Bash is sending SIGINT to processes it has within the same PGRP.

This theory looks great until we'll do one slight change:

#!/usr/bin/env bash
perl -e 'fork and exit; $SIG{INT} = sub {die "SIGINT #1\n" }; sleep(10)' &
set -o monitor
perl -e 'fork and exit; $SIG{INT} = sub {die "SIGINT #2\n" }; sleep(10)' &
sleep 1
ps xf o pid,ppid,sid,pgrp,tt,cmd
echo "press ctrl-c"
sleep 1000000

And running it shows:

=$ ps xf o pid,ppid,sid,pgrp,tt,cmd
  PID  PPID   SID  PGRP TT       CMD
15064 14793 14793 14793 ?        sshd: test@pts/2    
15065 15064 15065 15065 pts/2     \_ -bash
17010 15065 15065 17010 pts/2         \_ ps xf o pid,ppid,sid,pgrp,tt,cmd
 
=$ ./test.sh 
  PID  PPID   SID  PGRP TT       CMD
15064 14793 14793 14793 ?        sshd: test@pts/2    
15065 15064 15065 15065 pts/2     \_ -bash
17015 15065 15065 17015 pts/2         \_ bash ./test.sh
17021 17015 15065 17021 pts/2             \_ ps xf o pid,ppid,sid,pgrp,tt,cmd
17020     1 15065 17017 pts/2    perl -e fork and exit; $SIG{INT} = sub {die "SIGINT #2\n" }; sleep(10)
17019     1 15065 17015 pts/2    perl -e fork and exit; $SIG{INT} = sub {die "SIGINT #1\n" }; sleep(10)
press ctrl-c
^C
=$ ps xf o pid,ppid,sid,pgrp,tt,cmd
  PID  PPID   SID  PGRP TT       CMD
15064 14793 14793 14793 ?        sshd: test@pts/2    
15065 15064 15065 15065 pts/2     \_ -bash
17031 15065 15065 17031 pts/2         \_ ps xf o pid,ppid,sid,pgrp,tt,cmd
17020     1 15065 17017 pts/2    perl -e fork and exit; $SIG{INT} = sub {die "SIGINT #2\n" }; sleep(10)
17019     1 15065 17015 pts/2    perl -e fork and exit; $SIG{INT} = sub {die "SIGINT #1\n" }; sleep(10)

Apparently globally enabled monitor mode stops Bash from sending SIGINT even to processes in it's group.

To be honest, I don't fully understand what's going on, and why monitor mode (apparently) disables all SIGINT propagation, but the final thought is: if you want to start Pg (or anything else that dies on SIGINT) from a script, and this script can be later on killed by ctrl-c, make sure you use monitor-mode.

This can be used also, as I showed, for single command:

...
{
    set -o monitor
    pg_ctl ...
} &
...

Hope it helps.

12 thoughts on “A tale of automating tests of Pg with Bash”

  1. You’re looking for the ‘setsid’ command. On my machine this is in the ‘util-linux’ package, so I guess it might not exist on other OSs. It’s a pretty simple wrapper around the POSIX setsid() API, though, which is available by default in both Perl and Python, so it shouldn’t be too hard to write such a script yourself.

  2. @Screwtape:

    portability is a problem. And sure, I could write simple “daemonizer” script in Perl, but it just seems weird that I had to go through all the problem with starting daemon from *shell* – which supposedly exists to run other programs 🙂

    Anyway, I got my script fixed, and setsid is definitely a possibility if one is lucky enough to be sure that all he needs is Linux.

  3. @Gotar:
    did you check it?
    I just did, and both “nohup” and disown did *not* help – the process is killed.

    I really did test all:
    – redirections of std*
    – nohup
    – disown

    and all of these keep killing (sigint) the process.

  4. Hello,

    I use a custom script to start postgres relying on a bash specific internal function close to nohup, but which remove the command from the job table: disown.

    I found this while messing with “pg_ctl -w” with no success to be able to know exactly if postgres succed to start or not. Last time I checked, this was still a running discussion on pgsql-hackers.

    Main part of my script about this is:

    postmaster -D “$PGDATA” >> “$PGLOGS/postgresql-$1.log” 2>&1 &
    PGPID=$!
    disown $PGPID

    …then some more code related to linux to wait until I know it start or failed. I can share my script if needed.

    Cheers,

  5. @ioguix:
    it doesn’t help in any way with the problem I have.

    Do what I was describing: start pg, in any way you want, and then put in the same script – sleep 10000. and ctrl-c the script execution while sleeping.
    you will notice that the pg will get killed too, just as I described. disown does not help with it, and simplest possible test shows. like this one:

    #!/usr/bin/env zsh
    postmaster -D test/master >> /tmp/log 2>&1 &
    PGPID=$!
    disown $PGPID
    ps uxf
    echo "press ctrl-c, run ps, and notice lack of perl script"
    sleep 1000000
  6. Oh, I kept the page opened since a bunch of time and didn’t noticed the other answers.

    I will make some further tests with disown, but according to the bash manual and my experience with this script (which has some years now), it should certainly not receive the SIGINT and stay up.

    The only difference with your use case is that I use this code in a bash function I source-ed from my .bashrc, not from a script. But when I close a terminal where I started one or more pgsql cluster, they are definitely not killed and I can see the processes re-ordered with a “ps fax”.

  7. @ioguix:

    I KNOW.
    That’s the whole point. If the script that started pg has ended, or you’re running in interactive shell, all works fine.

    But if you are using a script (i.e. not interactive shell), and you do stuff after pg has been started, and you ctrl-c the script – it kills pg too.

    Sorry – I just don’t know how to be clearer on this – all the comments so far suggest testing disown/nohup/redirections – where I thought that I was clear that none of these does work. And even 2 line shell script can prove that SIGINT is propagated regardless of nohup/disown/stream redirections.

  8. Wow, you’re right.

    So I guess there’s a difference between calling disown from the script itself or from a function.

    My bad, thanks for sharing this information!

  9. The difference is on whether you’re using interactive shell or a script. That’s why “set -o monitor” helps – because it makes bash “assume” it’s interactive even when it’s not.

Comments are closed.