Loops and Git Automation

This post provides a few quick overviews of cool bits of shell script that I've written or put together recently. Nothing earth shattering, but perhaps interesting nonetheless.

Commit all Git Changes

For a long time, I used the following bit of code to provide the inverse operation of "git add .". Where "git add ." adds all uncommited changes to the staging area for the next commit, the following commit automatically removes all files that are no longer present on the file-system from the staging area for the next commit.

if [ "`git ls-files -d | wc -l`" -gt "0" ]; then
  git rm --quiet `git ls-files -d`
fi

This is great if you forget to use "git mv" or you delete a file using rm, you can run this operation and pretty quickly have git catch up with the state of reality. In retrospect I'm not really sure why I put the error checking if statement in there.

There are two other implementations of this basic idea that I'm aware of:

for i in `git ls-files -d`; do
  git rm --quiet $i
done

Turns out you can do pretty much the same thing with the following statement using the xargs command and you end up with something that's a bit more succinct:

git ls-files --deleted -z | xargs -0 git rm --quiet

I'm not sure why, I think it's because I started being a Unix nerd after Linux dropped the argument number limit, and as a result I've never really gotten a chance to become familiar with xargs. While I sometimes sense that a problems is xargs shaped, I almost never run into "too many arguments" errors, and always attempt other solutions first.

A Note About xargs

If you're familiar with xargs skip this section. Otherwise, it's geeky story time.

While this isn't currently an issue on Linux, some older UNIX systems (including older versions of Linux,) had this limitation where you could only pass a limited number of arguments to a command. If you had too many, the command would produce an error, and you had to find another way.

I'm not sure what the number was, and the specific number isn't particularly important to the story. Generally, I understand that this problem would crop up when attempting to take the output of a command like find and piping or passing it to another command like grep or the like. I'm not sure if you can trigger "too many arguments" errors with globbing (i.e. *) but like I said this kind of thing is pretty uncommon these days.

One of the "other ways" was to use the xargs command which basically takes very long list of arguments and passes them one by one (or in batches?) to another command. My gut feeling is that xargs can do some things, like the above a bit more robustly, but that isn't experimentally grounded. Thoughts?

Onward and Upward!

Cron is the Wrong Solution

Cron is great, right? For the uninitiated, if there are any of you left, Cron is a task scheduler that makes it possible to run various scripts and programs at specified intervals. This means that you can write programs that "do a thing" in a stateless way, set them to run regularly, without having to consider any logic regarding when to run, or any kind of state tracking. Cron is simple and the right way to do a great deal of routine automation, but there are caveats.

At times I've had scads of cron jobs, and while they work, from time to time I find myself going through my list of cron tasks on various systems and removing most of them or finding better ways.

The problems with cron are simple:

  • Its often a sledge hammer, and it's very easy to put something in cron job that needs a little more delicacy.

  • While it's possible to capture the output of cron tasks (typically via email,) the feedback from cronjobs is hard to follow. So it's hard to detect errors, performance deterioration, inefficiencies, or bugs proactively.

  • Its too easy to cron something to run every minute or couple of minutes. A task that seems relatively lightweight when you run it once can end up being expensive in the aggregate when they have to run a thousand times a day.

    This isn't to say that there aren't places where using cron isn't absolutely the right solution, but there are better solutions. For instance:

  • Include simple tests and logic for the cron task to determine if it needs to run before actually running.

  • Make things very easy to invoke or on demand rather than automatically running them regularly.

    I've begun to find little scripts and dmenu, or an easily called emacs-lisp function to be preferable to a cron job for a lot of tasks that I'd otherwise set in a cron job.

  • Write real daemons. It's hard and you have to make sure that they don't error out or quit unexpectedly--which requires at least primitive monitoring--but a little bit of work here can go a long way.

Onward and Upward!

6 Awesome Arch Linux Tricks

A couple of years ago I wrote "Why Arch Linux Rocks" and "Getting the most from Arch Linux." I've made a number of attempts to get more involved in the Arch project and community, but mostly I've been too busy working and using Arch to do actual work. Then a few weeks ago when I needed to do something minor with my system--I forget what--and I found myself thinking "this Arch thing is pretty swell, really."

This post is a collection of the clever little things that make Arch great.

abs

I'm using abs as a macro for all of the things about the package build system that I enjoy.

Arch packages are easy to build for users: you download a few files read a bash script in the PKGBUILD file and run the makepkg command. Done. Arch packages are also easy to specify for developers: just specify a "build()" function and some variables int eh PKGBUILD file.

Arch may not have as many packages as Debian, but I think it's clear that you don't need comprehensive package coverage when making packages is trivially easy.

If you use Arch and you don't frequent that AUR, or if you ever find yourself doing "./configure; make; make install" then you're wasting your time or jeopardizing the stability of your server.

yaourt

The default package management tool for Arch Linux, pacman, is a completely sufficient utility. This puts pacman ahead of a number of other similar tools, but to be honest I'm not terribly wild about it. Having said that, I think that yaourt is a great thing. It provides a wrapper around all of pacman's functionality and adds support for AUR/ABS packages in a completely idiomatic manner. The reduction in cost of installing this software is quite welcome.

It's not "official" or supported, because it's theoretically possible to really screw up your system with yaourt but if you're cautious, you should be good.

yaourt -G

The main yaourt functions that I use regularly are the "-Ss" which provides a search of the AUR, and the -G option. -G just downloads the tarball with the package specification (e.g. the PKGBUILD and associated files) from the AUR and untars the archive into the current directory.

With that accomplished, it's trivial to build and install the package, but you get to keep a record of the build files for future reference and possible tweaking. So basically, you this is the way to take away the tedium of getting packages from the AUR, while giving you more control and oversight of package installation.

rc.conf

If you've installed Arch, then you're already familiar with the rc.conf file. In case you didn't catch how it works, rc.conf is bash script that defines certain global configuration values, which in turn controls certain aspects of the boot process and process initialization.

I like that it's centralized, that you can do all kinds of wild network configuration in the script, and I like that everything is in one place.

netcfg

In point of fact, one of primary reasons I switched to Arch Linux full time, was because of the network configuration tool, netcfg. Like the rc.conf setup, netcfg works by having a network configuration files which define a number of variables which are sourced by netcfg when imitating a network connection.

It's all in bash, of course, and it works incredibly well. I like having network management easy to configure, and setup in a way that doesn't require a management daemon.

Init System

Previous points have touched on this, but the "BSD-style" init system is perfect. It works quickly, and boot ups are stunningly fast: even without an SSD I got to a prompt in less than a minute, and probably not much more than 30 seconds. With an SSD: it's even better great. The points that you should know:

  • Daemon control scripts, (i.e. init scripts) are located in /etc/rc.d. There's a pretty useful "library" of shell functions in /etc/rc.d/function and a good template file in``/etc/rc.d/skel` for use when building your own control scripts. The convention is to have clear and useful output and easy to understand scripts, and with the provided material this is pretty easy.

  • In /etc/rc.conf there's a DAEMON variable that holds an array. Place names, corresponding to the /etc/rc.d file name, of daemons in this array to start them at boot time. Daemons are started synchronously by default (i.e. order of items in this array matters and the control script must exit before running the next script.) However, if a daemon's name is prefixed by an @ sign, the process is started in the background and the init process moves to the next item in the array without waiting.

    Start-up dependency issues are yours to address, but using order and background start-up this is trivial to implement. Background start ups lead to fast boot times.

Xen and KVM: Failing Differently Together

When I bought what is now my primary laptop, I had intended to use the extra flexibility to learn the prevailing (industrial-grade) virtualization technology. While that project would have been edifying on its own, I also hoped to use the extra flexibility to some more consistent testing and development work.

This project spurned a xen laptop project, but the truth is that Xen is incredibly difficult to get working, and eventually the "new laptop" just became the "every day laptop," and I let go of the laptop Xen project. In fact, until very recently I'd pretty much given up on doing virtualization things entirely, but for various reasons beyond the scope of this post I've been inspired to begin tinkering with virtualization solutions again.

As a matter of course, I found myself trying KVM in a serious way for the first time. This experience both generated a new list of annoyances and reminded me about all the things I didn't like about Xen. I've collected these annoyances and thoughts into the following post. I hope that these thoughts will be helpful for people thinking about virtualization pragmatically, and also help identify some of the larger to pain points with the current solution.

Xen Hardships: It's all about the Kernel

Xen is, without a doubt, the more elegant solution from a design perspective and it has a history of being the more robust and usable tool. Performance is great, Xen hosts can have up-times in excess of a year or two.

The problem is that dom0 support has, for the past 2-3 years, been in shambles, and the situation isn't improving very rapidly. For years, the only way to run a Xen box was to use an ancient kernel with a set of patches that was frightening, or a more recent kernel with ancient patches forward ported. Or you could use cutting edge kernel builds, with reasonably unstable Xen support.

A mess in other words.

Now that Debian Squeeze (6.0) has a pv-ops dom0 kernel, things might look up, but other than that kernel (which I've not had any success with, but that may be me,) basically the only way to run Xen is to pay Citrix [1] or build your own kernel from scratch, again results will be mixed (particularly given the non-existent documentation,) maintenance costs are high, and a lot of energy will be duplicated.

What to do? Write documentation and work with the distributions so that if someone says "I want to try using Xen," they'll be able to get something that works.

KVM Struggles: It's all about the User Experience

The great thing about KVM is that it just works. "sudo modprobe kvm kvm-intel" is basically the only thing between most people and a KVM host. No reboot required. To be completely frank, the prospect of doing industrial-scale virtualization on-top of nothing but the Linux kernel and with a wild module in it, gives me the willies is inelegant as hell. For now, it's pretty much the best we have.

The problem is that it really only half works, which is to say that while you can have hypervisor functionality and a booted virtual machine, with a few commands, it's not incredibly functional in practical systems. There aren't really good management tools, and getting even basic networking configured off the bat, and qemu as the "front end" for KVM leaves me writhing in anger and frustration. [2]

Xen is also subject to these concerns, particularly around netowrking. At the same time, Xen's basic administrative tools make more sense, and domU's can be configured outside of interminable non-paradigmatic command line switches.

The core of this problem is that KVM isn't very Unix-like, and it's a problem that is rooted in it's core and pervades the entire tool, and it's probably rooted in the history of its development.

What to do? First, KVM does a wretched job of anticipating actual real-world use cases, and it needs to do better at that. For instances it sets up networking in a way that's pretty much only good for software testing and GUI interfaces but sticking the Kernel on the inside of the VM makes it horrible for Kernel testing. Sort out the use cases, and there ought to be associated tooling that makes common networking configurations easy.

Second, KVM needs to at least pretend to be Unix-like. I want config files with sane configurations, and I want otherwise mountable disk images that can be easily mounted by the host.

Easy right?

[1]The commercial vendor behind Xen, under whose stewardship the project seems to have mostly stalled. And I suspect that the commercial distribution is Red Hat 5-based, which is pretty dead-end. Citrix doesn't seem to be very keen on using "open source," to generate a sales channel, and also seems somewhat hesitant to put energy into making Xen easier to run for existing Linux/Unix users.
[2]The libvirtd and Virt Manager works pretty well, though it's not particularly flexible, and it's not a simple command line interface and a configuration file system.

9 Awesome SSH Tricks

Sorry for the lame title. I was thinking the other day, about how awesome SSH is, and how it's probably one of the most crucial pieces of technology that I use every single day. Here's a list of 10 things that I think are particularly awesome and perhaps a bit off the beaten path.

Update: (2011-09-19) There are some user-submitted ssh-tricks on the wiki now! Please feel free to add your favorites. Also the hacker news thread might be helpful for some.

SSH Config

I used SSH regularly for years before I learned about the config file, that you can create at ~/.ssh/config to tell how you want ssh to behave.

Consider the following configuration example:

Host example.com *.example.net
User root
Host dev.example.net dev.example.net
User shared
Port 220
Host test.example.com
User root
UserKnownHostsFile /dev/null
StrictHostKeyChecking no
Host t
HostName test.example.org
Host *
Compression yes
CompressionLevel 7
Cipher blowfish
ServerAliveInterval 600
ControlMaster auto
ControlPath /tmp/ssh-%r@%h:%p

I'll cover some of the settings in the "Host *" block, which apply to all outgoing ssh connections, in other items in this post, but basically you can use this to create shortcuts with the ssh command, to control what username is used to connect to a given host, what port number, if you need to connect to an ssh daemon running on a non-standard port. See "man ssh_config" for more information.

Control Master/Control Path

This is probably the coolest thing that I know about in SSH. Set the "ControlMaster" and "ControlPath" as above in the ssh configuration. Anytime you try to connect to a host that matches that configuration a "master session" is created. Then, subsequent connections to the same host will reuse the same master connection rather than attempt to renegotiate and create a separate connection. The result is greater speed less overhead.

This can cause problems if you' want to do port forwarding, as this must be configured on the original connection, otherwise it won't work.

SSH Keys

While ControlMaster/ControlPath is the coolest thing you can do with SSH, key-based authentication is probably my favorite. Basically, rather than force users to authenticate with passwords, you can use a secure cryptographic method to gain (and grant) access to a system. Deposit a public key on servers far and wide, while keeping a "private" key secure on your local machine. And it just works.

You can generate multiple keys, to make it more difficult for an intruder to gain access to multiple machines by breaching a specific key, or machine. You can specify specific keys and key files to be used when connected to specific hosts in the ssh config file (see above.) Keys can also be (optionally) encrypted locally with a pass-code, for additional security. Once I understood how secure the system is (or can be), I found my self thinking "I wish you could use this for more than just SSH."

SSH Agent

Most people start using SSH keys because they're easier and it means that you don't have to enter a password every time that you want to connect to a host. But the truth is that in most cases you want to have unencrypted private keys that have meaningful access to systems because once someone has access to a copy of the private key the have full access to the system. That's not good.

But the truth is that typing in passwords is a pain, so there's a solution: the ssh-agent. Basically one authenticates to the ssh-agent locally, which decrypts the key and does some magic, so that then whenever the key is needed for the connecting to a host you don't have to enter your password. ssh-agent manages the local encryption on your key for the current session.

SSH Reagent

I'm not sure where I found this amazing little function but it's great. Typically, ssh-agents are attached to the current session, like the window manager, so that when the window manager dies, the ssh-agent loses the decrypted bits from your ssh key. That's nice, but it also means that if you have some processes that exist outside of your window manager's session (e.g. Screen sessions) they loose the ssh-agent and get trapped without access to an ssh-agent so you end up having to restart would-be-persistent processes, or you have to run a large number of ssh-agents which is not ideal.

Enter "ssh-reagent." stick this in your shell configuration (e.g. ~/.bashrc or ~/.zshrc) and run ssh-reagent whenever you have an agent session running and a terminal that can't see it.

ssh-reagent () {
  for agent in /tmp/ssh-*/agent.*; do
      export SSH_AUTH_SOCK=$agent
      if ssh-add -l 2>&1 > /dev/null; then
         echo Found working SSH Agent:
         ssh-add -l
         return
      fi
  done
  echo Cannot find ssh agent - maybe you should reconnect and forward it?
}

It's magic.

SSHFS and SFTP

Typically we think of ssh as a way to run a command or get a prompt on a remote machine. But SSH can do a lot more than that, and the OpenSSH package that probably the most popular implementation of SSH these days has a lot of features that go beyond just "shell" access. Here are two cool ones:

SSHFS creates a mountable file system using FUSE of the files located on a remote system over SSH. It's not always very fast, but it's simple and works great for quick operations on local systems, where the speed issue is much less relevant.

SFTP, replaces FTP (which is plagued by security problems,) with a similar tool for transferring files between two systems that's secure (because it works over SSH) and is just as easy to use. In fact most recent OpenSSH daemons provide SFTP access by default.

There's more, like a full VPN solution in recent versions, secure remote file copy, port forwarding, and the list could go on.

SSH Tunnels

SSH includes the ability to connect a port on your local system to a port on a remote system, so that to applications on your local system the local port looks like a normal local port, but when accessed the service running on the remote machine responds. All traffic is really sent over ssh.

I set up an SSH tunnel for my local system to the outgoing mail server on my server. I tell my mail client to send mail to localhost server (without mail server authentication!), and it magically goes to my personal mail relay encrypted over ssh. The applications of this are nearly endless.

Keep Alive Packets

The problem: unless you're doing something with SSH it doesn't send any packets, and as a result the connections can be pretty resilient to network disturbances. That's not a problem, but it does mean that unless you're actively using an SSH session, it can go silent causing your local area network's NAT to eat a connection that it thinks has died, but hasn't. The solution is to set the "ServerAliveInterval [seconds]" configuration in the SSH configuration so that your ssh client sends a "dummy packet" on a regular interval so that the router thinks that the connection is active even if it's particularly quiet. It's good stuff.

/dev/null .known_hosts

A lot of what I do in my day job involves deploying new systems, testing something out and then destroying that installation and starting over in the same virtual machine. So my "test rigs" have a few IP addresses, I can't readily deploy keys on these hosts, and every time I redeploy SSH's host-key checking tells me that a different system is responding for the host, which in most cases is the symptom of some sort of security error, and in most cases knowing this is a good thing, but in some cases it can be very annoying.

These configuration values tell your SSH session to save keys to `/dev/null (i.e. drop them on the floor) and to not ask you to verify an unknown host:

UserKnownHostsFile /dev/null
StrictHostKeyChecking no

This probably saves me a little annoyance and minute or two every day or more, but it's totally worth it. Don't set these values for hosts that you actually care about.


I'm sure there are other awesome things you can do with ssh, and I'd live to hear more. Onward and Upward!

Persistent SSH Tunels with AutoSSH

Rather than authenticate to a SMTP server to send email, which is fraught with potential security issues and hassles, I use a SSH tunnel to the machine running my mail server. This is automatic, easy to configure both for the mail server and mail client, and incredibly secure. It's good stuff.

The downside, if there is one, is that the tunnel has to be active to be able to send email messages, and SSH tunnels sometimes disconnect a bit too silently particularly on unstable (wireless) connections. I (and others, I suspect) have had some success with integrating the tunnel connection with pre- and post- connection hooks, so that the network manager automatically creates a tunnel after connecting to the network. but this is a flawed solution that produces uneven results.

Recently I've discovered this program called "AutoSSH," which creates an SSH tunnel and tests it regularly to ensure that the tunnel is functional. If it isn't, AutoSSH recreates the tunnel. Great!

First start off by getting a copy of the program. It's not part of the OpenSSh package, so you'll need to download it separately. It's in every pacakge management repository that I've tried to get it from. So installation, will probably involve one of the following commands at your system's command line:

apt-get install autossh
pacman -S autossh
yum install autossh
port install autossh

When that's done, you'll issue a command that resembles the following

autossh -M 25 -f tychoish@foucault.cyborginstitute.net -L 25:127.0.0.1:25

Really, the important part here is the "autossh -M 25" part of the command. This tells autossh to watch ("monitor") port number 25 on the local system for a tunnel. The rest of the command (e.g. "-f -L 127.0.0.1:25:127.0.0.1:25 mailserver@tychoish.com -N") is just a typical call to the ssh program.

Things to remember:

  • If you need to create a tunnel on a local port with numbered lower than 1000, you'll need to run the autossh command as root.
  • SSH port forwarding only forwards traffic from a local port to a remote port, through an SSH connection. All traffic is transmitted over the wire on port 22. Unless you establish multiple tunnels, only traffic sent to the specific local port will be forwarded.
  • Perhaps it's obvious, but there has to be some service listening on the specified remote end of the tunnel, or else the tunnel won't do anything.
  • In a lot of ways, depending on your use case autossh, can obviate the need for much more complex VPN setups for a lot of deployments. Put an autossh command in an @reboot cronjob, with an account that has ssh keys generated, and just forget about it for encrypting things like database traffic and the like.

Onward and Upward!

There's Always Something New to Learn

Now that I'm fairly confident in my ability to do basic Linux systems administration tasks: manage web and email servers, maintain most Linux systems, convince desktop systems that they really do want to work the way they're supposed to, I'm embarking on a new learning process. I've been playing around with "real" virtualization on my desktop, and I've been reading a bunch about systems administration topics that are generally beyond the scope of what I've dealt with until now. Here is a selection of the projects I'm playing with:

  • Getting a working xen setup at home. This requires, learning a bit more about building working operating systems, and also in the not to distant future buying a new (server) computer.
  • Installing xen on the laptop, because it'll support it, I have the resources to make it go, and it'll be awesome.
  • Learning everything I can about LVM, which is a new (to me) way of managing partitions and disk images, that makes backups, disk snapshots, and other awesomeness much easier. It means, some system migration stuff that I have yet to tinker with, as none of my systems currently support LVM.
  • Doing package development for Arch Linux, because I think that's probably within the scope of my ability, because I think it would add to my skill set, and because I appreciate the community, and I want to be able to give back. Also I should spend some time editing the wiki, because I'm really lazy with that.

I guess the overriding lesson of all these projects is a more firm grasp of how incredibly awesome, powerful, and frankly stable Arch Linux is (or can be.) I mean there are flaws of course, but given how I use systems, I've yet to run into something show stopping. That's pretty cool.

I want ZFS in the Kernel

The background:

Sun Microsystems developed this file system called "ZFS," which is exceptionally awesome in it's capabilities and possibilities. The problem, is that it was developed and released as part of the Open Solaris project which has a licensing incompatibility with the Linux Kernel. Both are open source, but there is a technical (and not all together uncommon) conflict in the terms of the license that makes it possible to combine code from both licenses in a single executable.

Basically the GPL, under which the Linux Kernel is distributed, says if you distribute a binary (executable) under the terms of the GPL, the source code is all files that you used to make that binary. By contrast ZFS's license says "here are all the files that we used to make this binary, if you change them when you make your binary and give that binary to other people you have to give them, but if you add additional files, you don't have to give those out to people."

Apparently the idea behind the ZFS license (i.e. the CDDL, and the MPL from whence it originated,) is that it allows for easier embedding of ZFS (and other technologies) in proprietary code because the resulting binary isn't list covered by the CDDL in most cases. Even though the CDDL is incredibly confusing, apparently it's more "business friendly," but I diverge from my original point.

And so if Linux users want to run ZFS, they have to run it as a user-space process (i.e. not in the kernel,) which is suboptimal, or they have to run Solaris in a vitalized environment (difficult,) or something. There's also a ZFS-like file system called "btrfs," which can be included in the kernel (interestingly, developed by Oracle who of course now own ZFS itself,) but it is not production ready.

What I'm about to propose is an end run around the GPL. Because it seems to me that combining the source code violates neither license, distributing source code violates no license. Compiling the source code for your own use violates no license. I mean it's annoying and would require a bit of bootstrapping to get a Linux+zfs system up and running, but this is the kind of thing that Gentoo Linux users do all the time, and isn't a huge technological barrier.

It feels a bit creepy of course. but I think it works. The logic has also been used before. We'll call it the "PGP loophole."

PGP is an encryption system that's damn good so good in fact, that when it was first released, there were export restrictions on the source code because it qualified as military-grade munitions in America. Source code. How peculiar. In any case there were lawsuits, and PGP source was released outside of America by printing it in a book. Which could be disassembled and scanned into a computer and then compiled. Books were never and--as far as I know--are not classified as munitions, and so they could be exported. Of course I'm not a lawyer, but it strikes me that linux+zfs and PGP in the 90's may be in analogous situations.

And I think, because this proposal centers around the distribution of source code and only source code this kind of distribution is fully within the spirit of free software. Sure it's pretty easy, even for the "good guys," to run a foul by distributing a binary, but this would be easy to spot, and there are already suitable enforcement mechanisms in place, for the Linux kernel generally, and Oracle's legal department which we can assume will take care of itself.

Or Oracle could release ZFS under GPL. Either solution works for me.