A tale of slow ssh connections – select * from depesz;

At home I have couple of machines, but one of them takes relatively long time to login to.

It looks like this:

=$ time ssh blob id -u
1000
 
real    0m0.401s
user    0m0.010s
sys     0m0.000s

While 0.4s is not very long, it bugged me. And today, I decided to look into it, to see what's happening.

First of all some information about environment. Blob is Linux machine, based on Ubuntu Maverick. Software is updated daily, so we can be relatively sure that it's OK.

As for network. Pinging blob shows:

=$ ping -c 5 blob
PING blob.home.depesz.com (172.28.173.1) 56(84) bytes of data.
64 bytes from blob.home.depesz.com (172.28.173.1): icmp_req=1 ttl=64 time=0.202 ms
64 bytes from blob.home.depesz.com (172.28.173.1): icmp_req=2 ttl=64 time=0.177 ms
64 bytes from blob.home.depesz.com (172.28.173.1): icmp_req=3 ttl=64 time=0.225 ms
64 bytes from blob.home.depesz.com (172.28.173.1): icmp_req=4 ttl=64 time=0.204 ms
64 bytes from blob.home.depesz.com (172.28.173.1): icmp_req=5 ttl=64 time=0.196 ms
 
--- blob.home.depesz.com ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 3997ms
rtt min/avg/max/mdev = 0.177/0.200/0.225/0.023 ms

So, it doesn't show any correlation with 400ms to get simple “id" output.

First thing I did, was stracing ssh command, to see what it was waiting for.

So, I ran:

strace -f -o a.log -tt -T ssh blob id -u

And checked a.log. Inside I saw:

8841  11:11:11.767118 select(8, [3 7], [], NULL, NULL) = 1 (in [3]) <0.294210>
8841  11:11:12.061389 read(3, "\233\355\322\31\256\242\375T?4\241\276]\323\365k=}\272w]&y P]\361\260\364H#\237"..., 8192) = 48 <0.000009>

Which shows that it was waiting for 0.3s for input on socket 3. Which is:

8841  11:11:11.640007 socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 3 <0.000052>
8841  11:11:11.640143 fcntl(3, F_SETFD, FD_CLOEXEC) = 0 <0.000043>
8841  11:11:11.640270 connect(3, {sa_family=AF_INET, sin_port=htons(22), sin_addr=inet_addr("172.28.173.1")}, 16) = 0 <0.000227>

which is the ssh connection. So it seems to be related to problems on the other side.

Let's check, perhaps ssh -v will show anything …:

=$ time ssh -v blob id 2>&1 | perl -pe 'use Time::HiRes qw( time ); $_ = sprintf("%.6f %s", time(), $_)'
1292235250.919918 OpenSSH_5.5p1 Debian-4ubuntu4, OpenSSL 0.9.8o 01 Jun 2010
...
1292235251.011839 debug1: Entering interactive session.
1292235251.309365 debug1: Requesting authentication agent forwarding.
...
1292235251.316409 debug1: compress incoming: raw data 235, compressed 177, factor 0.75

As we can see, ssh was waiting on “Entering interactive session".

So, what could it do?

Perhaps it's a problem with DNS?

So, I added “UseDNS no", restarted sshd, and retried. To no change. It was still taking ~ 0.4s.

So, knowing that sshd must be doing something bad, I decided to strace it, to see what it's waiting for.

=# mkdir ssh.strace
 
=# cd ssh.strace/
 
=# netstat -ntlp | grep ":22 "
tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN      21691/sshd
tcp6       0      0 :::22                   :::*                    LISTEN      21691/sshd
 
=# strace -ff -o strace.log -tt -T -p 21691

After running above commands on blob, I switched to other terminal, ran the “time ssh blob id -u" test (which now, due to strace tool over 1 second), and stopped strace.

First problem. Try to guess how many files were created by this.

There were no other connections in the mean time. Each file means – another process has been created/forked/cloned.

I would assume that it will require:

1 file for sshd daemon listening for connections
1 file for ssh forked off daemon to handle this connection
possibly 1 file for user shell
1 file for “id" command

In total 4 files. Reality?

=# ls -l strace.log.* | wc -l
89

89 files. Oh my. That shows we have some serious problems.

Let's see what is there, and why it took so much time.

Since whole ‘time ssh blob id -u' took 1 second now, let's see if there was anything in there that took over 1 second:

=# grep ' <1\.' strace.log.2*
strace.log.21691:11:20:26.080354 select(7, [3 4], NULL, NULL, NULL) = ? ERESTARTNOHAND (To be restarted) <1.022610>

Well, 21691 waiting is not really helpful, as it's the listening daemon, and the time was simply the time of ssh session. Hmm. So, let's see what exactly took so much time:

Searching for smaller times, I found that 0.638228 s was spent waiting for program “/usr/bin/landscape-sysinfo" ?!

But I don't know/care what it is?

It seems that this program was called from /etc/update-motd.d/50-landscape-sysinfo.

And this in turn was called by “/bin/run-parts –lsbsysinit /etc/update-motd.d".

Which in turn was called by sshd itself (the one forked off listener-ssh).

It looks like on every connection my motd is being updated! Which doesn't make even slightest sense – mostly, because I don't use ubuntu-provided motd (which is auto-regenerated, and stored somewhere in /var), but my own motd:

=# ls -l /etc/motd*
lrwxrwxrwx 1 root root  11 2010-12-13 10:11 /etc/motd -> motd.static
-rw-r--r-- 1 root root   5 2010-12-13 10:11 /etc/motd.static
-rw-r--r-- 1 root root 346 2009-09-27 06:10 /etc/motd.tail.old

( in case you don't know – usually /etc/motd is symlink to /var/run/motd which is being regenerated – as it look – on every login ! ).

Since I don't care about motd in this particular case, I decided to simply remove it at all.

Motd seems to be regenerated by pam_motd (at least that's the connection I was able to make using some greps). In /etc/ssh/sshd_config there is also “PrintMotd" option, but I have set it to “no".

So. I did cd /etc/pam.d/ and:

=# grep -i motd *
login:# Prints the motd upon succesful login
login:# (Replaces the `MOTD_FILE' option in login.defs)
login:session    optional   pam_motd.so
sshd:session    optional     pam_motd.so

OK. So it seems to be used by login, and by sshd. Quick fix:

=# perl -pi -e 's/^/# / if /pam_motd/' *
 
=# grep pam_motd *
login:# session    optional   pam_motd.so
sshd:# session    optional     pam_motd.so

And test:

=$ time ssh blob id -u
1000
 
real    0m0.120s
user    0m0.010s
sys     0m0.000s

YEAH!. 0.3s dropped from connection time, just by disabling simply stupid “feature".

What's also interesting, that when I now strace sshd – there are only 7 files. I.e. only 6 processes forked off for every connection (7th file is the master, listener, sshd).

There 6 processes are mostly some internal stuff of ssh, and 2 more, that are related to my ~/.bashrc (they do run “which …").

All in all – I got rid of 82 new processes, and 75% of runtime, just by disabling pam_motd!

6 thoughts on “A tale of slow ssh connections”

gj says:

2010-12-13 at 13:57

start using debian instead of ubuntu 🙂
User-friendliness comes with a price.
Viktor says:

2010-12-13 at 21:43

What’s missing in this analysis is how long it took you to figure out the problem.

Just kidding! I know how annoying those little slow-downs are — they tend to break your concentration — but it would still be interesting from a cost-benefit perspective.
depesz says:

2010-12-13 at 21:44

@Viktor – around 30 minutes. So of course – it’s not effective, if we’ll look at 0.3s vs. 30 minutes. But on the other hand – I learned from it, so it’s much more valuable.
Edwin says:

2010-12-14 at 00:11

Can you report the bug here?
https://bugs.launchpad.net/ubuntu/+source/landscape-client/+filebug
depesz says:

2010-12-14 at 00:14

@Edwin:

I’m not sure if/what the bug would be. 0.3s is not very slow. On the other hand – the problem in my opinion stems from doing too much on login time, not necessarily from landscape-client itself.
larryd says:

2016-07-07 at 05:03

Nice work, this has been bugging me for months!

Comments are closed.