Highlights

Subscribe

Creative Commons License
This weblog is licensed under a Creative Commons License.
Powered by
Movable Type 4.25

« If at first you don't succeed: make a wordle and call it a day. | Main | test post --ignore »

parallel ssh

[warning: technical wonkery]

pssh is a godsend when you're working with a number of servers. Pssh lets you run the same command on many different computers via ssh, collating the results for you.

The man page says pssh is "most useful for operating on clusters of homogenously-configured hosts.".

But to my mind, where it really shines is doing diagnostics over a number of machines.

I was recently called in to fix up a problem like this. A website had gone down, the sysadmin was missing, and nobody really new which of seventy-odd machines was responsible. So I was flailing around in the dark. All I had was the shared root password, and a client very keen for things to start working Real Soon Now.

My first tool out of the box was nmap. This lets me figure out which servers we have around::

  root@dv06x:/home/dan# nmap -sP 10.10.22.0/24

    Starting Nmap 4.20 ( http://insecure.org ) at 2011-05-29 15:03 PDT
    Host 10.10.22.1 appears to be up.
    MAC Address: 00:04:23:E1:F8:DF (Intel)
    Host 10.10.22.5 appears to be up.
    MAC Address: 00:04:23:E1:F8:DF (Intel)

Here I'm telling nmap to ping every machine with an IP address (internal to the data centre) of 10.10.22.XXX -- that is, which shares the first 24 bits of its IP address with 10.10.22. [why 10.10.22? because ifconfig showed me the IP for the firewall server I was first logged in to, and it began with 10.10.22.

Now, let's pull a list of IPs out of there. I couldn't find an option in nmap to just print the IP address, so let's do it with perl:

    root@dv06x:/home/dan# nmap -sP 10.10.22.0/24 | grep "appears to be up" | perl -pi -e 's/.*?((\d+\.){3}\d+).*/\1/' > servers.txt 2>/dev/null
    root@dv06x:/home/dan# cat servers.txt
    10.10.22.1
    10.10.22.5
    10.10.22.6
    ...

Now we have these servers, we can try ssh'ing into them. First, we need password-less ssh access. Here's an explanation. First we generate a keypair:

ssh-keygen -t rsa

This generates keys in ~/.ssh/idrsa (private) and ~/.ssh/idrsa.pub (public). Now we must copy the public key to every other server, appending it to the file ~/.ssh/authorized_keys.

scp doesn't support appending. One way would be to first scp to a temp file, then log in and append that file to ~/.ssh/authorized_keys.

But that's a lot of excess typing, if you're doing it seventy times. Instead we can use pipes with cat:

    cat ~/.ssh/id_rsa.pub | ssh root@10.10.22.6 "cat >> ~/.ssh/authorized_keys"

Now we need to do this for every machine. Let's do it the semi-manual way: pssh won't save much time, since we need to enter passwords and accept fingerprints anyway:

 
    root@dv06x:/home/dan# for server in `cat servers.txtc`
    > do
    cat ~/.ssh/id_rsa.pub | ssh root@$server "cat >> ~/.ssh/authorized_keys"
    > done

Now we should be able to connect to any of them without a password:

 
    root@dv06x:~# ssh root@10.10.22.23 
    Last login: Mon May 16 04:16:41 2011 from 10.10.22.5
    Linux xn03 2.6.18-xen #1 SMP Fri May 18 16:01:42 BST 2007 x86_64

Now is when pssh comes into its own. The man page explains basic usage:

 
       parallel-ssh [OPTIONS] -h hosts.txt prog [arg0...]

That is, you give it a list of hosts in a file, and a command to execute on hte command-line. It'll ssh into each host in parallel, and run the command everywhere. I'll also use the

-P
option, so that we can see the output directly on the terminal.

Let's start with the

uptime
command. This prints out how long the server has been up -- as well as, more interestingly, the current load:

 
    root@dv06x:/home/dan# pssh -P -h servers_all_ip uptime
    ...
    10.10.104.156:  16:03:42 up 53 days, 22:46,  0 users,  load average: 0.00, 0.02, 0.26
    [69] 16:21:56 [SUCCESS] 10.10.104.156
    10.10.22.101:  16:03:18 up 2 days,  3:47,  1 user,  load average: 69.52, 70.14, 70.21
    [70] 16:21:56 [SUCCESS] 10.10.22.101
    10.10.104.146:  16:03:12 up 33 days, 15:25,  0 users,  load average: 1.24, 1.22, 1.19
    [71] 16:21:56 [SUCCESS] 10.10.104.146
    10.10.104.156:  16:03:42 up 53 days, 22:46,  0 users,  load average: 0.00, 0.02, 0.26
    [69] 16:21:56 [SUCCESS] 10.10.104.156
    10.10.22.101:  16:03:18 up 2 days,  3:47,  1 user,  load average: 69.52, 70.14, 70.21
    [70] 16:21:56 [SUCCESS] 10.10.22.101
    10.10.104.146:  16:03:12 up 33 days, 15:25,  0 users,  load average: 1.24, 1.22, 1.19
    [71] 16:21:56 [SUCCESS] 10.10.104.146

It's slightly irritating not to have the output matched up with the results, but it's already tellig us something useufl. 10.10.22.101 has an immense load, and has been recently rebooted. That's probably somewhere to concentrate our attention

We can also gather some information about what operating systems we're dealing with.

lsb_release -a
will get us that:

 
    root@dv06x:/home/dan# pssh -P -h servers.txt "lsb_release -a"
    ...
    10.10.22.12: Distributor ID:    Ubuntu
    Description:    Ubuntu 7.10
    Release:        7.10
    Codename:       gutsy

Much more can be done along these lines, but I'll leave it there for now