Systems Administrators are the Problem

For years now, the idea of the terrible stack, or the dynamic duo of Terraform and Ansible, from this tweet has given me a huge amount of joy, basically anytime someone mentions either Terraform or Ansible, which happens rather a lot. It’s not exactly that I think that Terriform or Ansible are exactly terrible: the configuration management problems that these pieces of software are trying to solve are real and actually terrible, and having tools that help regularize the problem of configuration management definitely improve things. And yet the tools leave things wanting a bit.

Why care so much about configuration management?

Configuration matters because every application needs some kind of configuration: a way to connect to a database (or similar), a place to store its output, and inevitably other things, like a dependencies, or feature flags or whatever.

And that’s the simple case. While most things are probably roughly simple, it’s very easy to have requirements that go beyond this a bit, and it turns out that while a development team might--but only might--not have requirements for something that qualifies as “weird” but every organization has something.

As a developer, configuration and deployment often matters a bunch, and it’s pretty common to need to make changes to this area of the code. While it’s possible to architect things so that configuration can be managed within an application (say), this all takes longer and isn’t always easy to implement, and if your application requires escalated permissions, or needs a system configuration value set then it’s easy to get stuck.

And there’s no real way to avoid it: If you don’t have a good way to manage configuration state, then infrastructure becomes bespoke and fragile: this is bad. Sometimes people suggest using image-based distribution (so called “immutable infrastructure,") but this tends to be slow (images are large and can take a while to build,) and you still have to capture configuration in some way.

But how did we get here?

I think I could weave a really convincing, and likely true story about the discipline of system administration and software operations in general and its history, but rather than go overboard, I think the following factors are pretty important:

  • computers used to be very expensive, were difficult to operate, and so it made sense to have people who were primarily responsible for operating them, and this role has more or less persisted forever.
  • service disruptions can be very expensive, so it’s useful for organizations to have people who are responsible for “keeping the lights on,” and troubleshoot operational problems when things go wrong.
  • most computer systems depend on state of some kind--files on disks, the data in databases--and managing that state can be quite delicate.
  • recent trends in computing make it possible to manipulate infrastructure--computers themselves, storage devices, networks--with code, which means we have this unfortunate dualism of infrastructure where it’s kind of code but also kind of data, and so it feels hard to know what the right thing to do.

Why not just use <xyz>

This isn’t fair, really, but and you know it’s gonna be good when someone trivializes an adjacent problem domain with a question like this, but this is my post so you must endure it, because the idea that there’s another technology or way of framing the problem that makes this better is incredibly persistent.

Usually <xyz>, in recent years has been “Kubernetes” or “docker” or “containers,” but it sort of doesn’t matter, and in the past solutions platforms-as-a-service (e.g. AppEngine/etc.) or backend-as-a-service (e.g. parse/etc.) So let’s run down some answers:

  • “bake configuration into the container/virtual machine/etc. and then you won’t have state,” is a good idea, except it means that if you need to change configuration very quickly, it becomes quite hard because you have to rebuild and deploy an image, which can take a long time, and then there’s problems of how you get secrets like credentials into the service.
  • “use a service for your platform needs,” is a good solution, except that it can be pretty inflexible, particularly if you have an application that wasn’t designed for the service, or need to use some kind of off-the-shelf (a message bus, a cache, etc.) service or tool that wasn’t designed to run in this kind of environment. It’s also the case that the hard cost of using platforms-as-a-service can be pretty high.
  • “serverless” approaches something of a bootstrapping problem, how do you manage the configuration of the provider? How do you get secrets into the execution units?

What’s so terrible about these tools?

  • The tools can’t decide if configuration should be described programatically, using general purpose programming languages and frameworks (e.g. Chef, many deployment tools) or using some kind of declarative structured tool (Puppet, Ansible), or some kind of ungodly hybrid (e.g. Helm, anything with HCL). I’m not sure that there’s a good answer here. I like being able to write code, and I think YAML-based DSLs aren’t great; but capturing configuration creates a huge amount of difficult to test code. Regardless, you need to find ways of being able to test the code inexpensively, and doing this in a way that’s useful can be hard.
  • Many tools are opinionated have strong idioms in hopes of helping to make infrastructure more regular and easier to reason about. This is cool and a good idea, it makes it harder to generalize. While concepts like immutability and idempotency are great properties for configuration systems to have, say, they’re difficult to enforce, and so maybe developing patterns and systems that have weaker opinions that are easy to comply with, and idioms that can be applied iteratively are useful.
  • Tools are willing to do things to your systems that you’d never do by hand, including a number of destructive operations (terraform is particularly guilty of this), which erodes some of their trust and inspires otherwise bored ops folks, to write/recapitulate their own systems, which is why so many different configuration management tools emerge.

Maybe the tools aren’t actually terrible, and the organizational factors that lead to the entrenchment of operations teams (incumbency, incomplete cost analysis, difficult to meet stability requirements,) lead to the entrenchment of the kinds of processes that require tools like this (though causality could easily flow in the opposite direction, with the same effect.)

Observation at Scale

I wrote this thing about monitoring in software, and monitoring in web applications (and similar) about a year ago, and I sort of forgot about it, but as I was cleaning up recently I found this, and think that I mostly still agree with the point. Enjoy!

It is almost always the case that writing software that does what you want it to do is the easy part and everything else is the hard part.

As your software does more a number of common features emerge:

  • other people are responsible for operating your software.
  • multiple instances of the program are running at once, on different computers.
  • you may not be able to connect to all of the running instances of the program when something goes wrong.
  • people will observe behaviors that you don’t expect and that you won’t be able to understand by observing the program’s inputs or outputs''

There are many things you can do to take a good proof of concept program and turn it into a production-ready program, but I think logging and introspection abilities are among the most powerful: they give you the most bang for your buck, as it were. It’s also true that observability (monitoring) is a hot area of software development that’s seeing a lot of development and thought at the moment.

While your application can have its own internal reporting system, its almost always easier to collect data in logs first rather than

Aggregate Logs

Conventionally operators and developers interact with logs using standard unix stream processing tools: tail, less, and grep and sometimes wc, awk, and sed. This is great when you have one (or a small number) process running on one machine. When applications get bigger, stream processing begins to break down.

Mostly you can’t stream process because of volume there’s too much data, it’s hard to justify spending disk space on all of your application servers on logs, and there’s too much of it to look at and do useful things. It’s also true that once you have multiple machines, its really helpful to be able to look at all of the logs in a single place.

At the lowest level the syslog protocol and associated infrastructure solves this problem by providing a common way for services to send log data via a network (UDP, etc.) It works but you still only have stream processing tools, which may be fine, depending on your use case and users.

Additionally there are services and applications that solve this problem: splunk (commercial/enterprise software ) sumologic (commercial/cloud software) and the ELK stack (an amalgamation of open source tools.) that provide really powerful ways to do log search, reporting, and even build visualizations. There are probably others as well.

Use them.

Structure Logs

The most common interview question for systems administrators that my colleagues give is a “log sawing” question. This seems pretty standard, and is a pretty common exercise for parsing information out of well known streams of log data. Like “find a running average request time,” or figure out the request rate.

The hard part is that most logs, in this example are unstructured in the sense that they are just line-wise printed strings, and so the exercise is in figuring out the structure of the messages, parsing data from the string, and then tracking data over the course of the logs. Common exercise, definitely a thing that you have to do, and also totally infuriating and basically impossible to generalize.

If you’re writing software, don’t make your users do this kind of thing. Capture events (log messages) in your program and output them with the information already parsed. The easiest way is to make your log messages mapping types, and then write them out in JASON, but there are other options.

In short, construct your log messages so that they’re easy to consume by other tools: strongly (and stably) type your messages, provide easy way to group and filter similar messages. Report operations in reasonable units (e.g. seconds rather than nanoseconds) to avoid complex calculations during processing, and think about how a given data point would beinteresting to track over time.

Annotate Events

Building on the power of structured logs, it’s often useful to be able to determine the flow of traffic or operations through the system to make it possible to understand the drivers of different kinds of load, and the impact of certain kinds of traffic on overall performance. Because a single operation may impact multiple areas of the system, annotating messages appropriately makes it possible to draw more concrete conclusions based on the data you collect.

For example when a client makes a user request for data, your system probably has a request-started and request-ended event. In addition this operation may retrieve data, do some application-level manipulation, modify other data, and then return it to the user. If there’s any logging between the start and end of a request, then it’s useful to tie these specific events together, and annotations can help.

Unlike other observability strategies, there’s not a single software feature that you can use to annotate messages once you have structured capabilities, although the ability of your logging systems to have some kind of middleware to inject annotations is quite useful.

Collect Metrics

In addition to events produced by your system, it may be useful to have a background data collection thread to report on your application and system’s resource utilization. Things like, runtime resource utilization, garbage collector stats, and system IO/CPU/Memory use can all be useful.

There are ways to collect this data via other means, and there are a host of observability tools that support this kind of metrics collection. But using multiple providers complicates actually using this data, and makes it harder to understand what’s going in the course of running a system. If your application is already reporting other stats, consider bundling these metrics in your existing approach.

By making your application responsible for system metrics you immediately increase the collaboration between the people working on development and operations, if such a divide exists.

Conclusion

In short:

  • collect more data,
  • increase the fidelity and richness of the data you collect,
  • aggregate potentially related data in the same systems to maximize value,
  • annotate messages to add value, and provide increasingly high level details.

Cron is the Wrong Solution

Cron is great, right? For the uninitiated, if there are any of you left, Cron is a task scheduler that makes it possible to run various scripts and programs at specified intervals. This means that you can write programs that “do a thing” in a stateless way, set them to run regularly, without having to consider any logic regarding when to run, or any kind of state tracking. Cron is simple and the right way to do a great deal of routine automation, but there are caveats.

At times I’ve had scads of cron jobs, and while they work, from time to time I find myself going through my list of cron tasks on various systems and removing most of them or finding better ways.

The problems with cron are simple:

  • Its often a sledge hammer, and it’s very easy to put something in cron job that needs a little more delicacy.

  • While it’s possible to capture the output of cron tasks (typically via email,) the feedback from cronjobs is hard to follow. So it’s hard to detect errors, performance deterioration, inefficiencies, or bugs proactively.

  • Its too easy to cron something to run every minute or couple of minutes. A task that seems relatively lightweight when you run it once can end up being expensive in the aggregate when they have to run a thousand times a day.

    This isn’t to say that there aren’t places where using cron isn’t absolutely the right solution, but there are better solutions. For instance:

  • Include simple tests and logic for the cron task to determine if it needs to run before actually running.

  • Make things very easy to invoke or on demand rather than automatically running them regularly.

    I’ve begun to find little scripts and dmenu, or an easily called emacs-lisp function to be preferable to a cron job for a lot of tasks that I’d otherwise set in a cron job.

  • Write real daemons. It’s hard and you have to make sure that they don’t error out or quit unexpectedly--which requires at least primitive monitoring--but a little bit of work here can go a long way.

Onward and Upward!

Allowable Complexity

I’m not sure I’d fully realized it before, but the key problems in systems administration--at least the kind that I interact with the most--are really manifestations of a tension between complexity and reliability.

Complex systems are often more capable flexible, so goes the theory. At the same time, complexity often leads to operational failure, as a larger number of moving parts leads to more potential points of failure. I think it’s an age old engineering problem and I doubt that there are good practical answers.

I’ve been working on this writing project where I’ve been exploring a number of fundamental systems administration problem domains, so this kind of thing is on my mind. It seems, that the way to address the hard questions often come back to “what are the actual requirements, and are you willing to pay the premiums to make the complex systems reliable?”

Trade-offs around complexity also happen in software development proper: I’ve heard more than a few developers talk in the last few months weigh the complexity of using dynamic languages like Python for very large scale projects. While the quests and implications manifest differently for code, it seems like this is part of the same problem.

Rather than prattle on about various approaches, I’m just going to close out this post with a few open questions/thoughts:

  • What’s the process for determining requirements that accounts for actual required complexity?

  • How do things that had previously been complex, become less complex?

    Perhaps someone just has write the code in C or C++ and let it mature for a few years before administrators accept it as stable?

  • Is there an corresponding level of complexity threshold in software development and within software itself? (Likely yes,) and is it related to something intrinsic to particular design patterns, or to tooling (i.e. programming language implementations, compilers, and so forth.)

Might better developer tooling allow us to programs of larger scope in dynamic languages (perhaps?)

Reader submitted questions:

  • Your questions here.

Answers, or attempts thereat in comments.

The Overhead of Management

Every resource, every person, every project, every machine you have to manage comes with an ongoing cost. This is just as true of servers as is it is of people who work on projects that you’re in charge of or have some responsibility for, and while servers and teammates present very different kinds of management challenges, working effectively and managing management costs across contexts is (I would propose) similar. Or at least similar enough to merit some synthetic discussion.

There’s basically only one approach to managing “systems administration costs,” and that’s to avoid it as much as possible. This isn’t to say that sys admins avoid admining, but rather we work very hard to ensure that systems don’t need administration. We write operating systems that administer themselves, we script procedures to automate most tasks as much as possible (the Perl programing language was developed and popularized for use of easing the administration of UNIX systems,) and we use tools manage larger systems more effectively.

People, time, and other resources cannot be so easily automated, and I think in response there are two major approaches (if we can create a somewhat false dichotomy for a moment:)

On the one hand there’s the school of thought that says “admit and assess management costs early, and pay them up front.” This is the corporate model in many ways. Have (layers upon layers of) resources dedicated to managing management costs, and then let this “middle management” make sure that things get done in spite of the management burden. On servers this is spending a lot of time choosing tools, configuring the base system, organizing the file system proactively, and constructing a healthy collection of “best practices.”

By contrast, the other perspective suggests that management costs should only be paid when absolutely necessary. make things, get something working and extant and then if something needs to be managed later, do it then and only as you need. On some level this is inspiring philsophy behind the frequent value of favoring “working code” over “great ideas” in the open source world.1 Though I think they phrase it differently, this is the basic approach that many hacker-oriented start ups have taken, and it seems to work for them. On the server, this approach is the “get it working,” approach, and these administrators aren’t bothered by having to go in every so often to “redo” how things are configured, and I think on some level this kind of approach to “management overhead” grows out of the agile world and the avoidance of “premature optimizations.”

But like all “somewhat false dichotomies,” there are flaws in the above formulation. Mostly the “late management” camp is able to delay management most effectively by anticipating their future needs (either by smarts or by dumb luck) early and planning around that. And the “early management” camp has to delay some management needs or else you’d be drowned in overhead before you started: and besides, the MBA union isn’t that strong.

We might even cast the “early management” approach as being “top down,” and the “late management” camp as being “bottom up.” If you know, we were into that kind of thing. It’s always, particularly in the contemporary moment to look at the bottom-up approach and say “that’s really innovative and awesome, that’s better,” and view “top-down” organizations as “stodgy and old world,” when neither does a very good job of explaining what’s going on and there isn’t inherent radicalism or stodginess in either organization. But it is interesting. At least mildly.

Thoughts? Onward and Upward!


  1. Alan Cox’s Cathedrals, Bazaars and the Town Council ↩︎

technology as infrastructure, act three

Continued from, Technology as Infrastructure, Act Two.

Act Three

All my discussions of “technology as infrastructure” thus far have been fairly high level. Discussions of particular business strategies of major players (eg. google and amazon), discussions approaches to “the cloud,” and so forth. As is my way, however, I’ve noticed that the obvious missing piece of this puzzle is how users--like you and me--are going to use the cloud. How thinking about technology as infrastructure changes the way we interact with our technology, and other related issues.

One of my introductory interludes was a new use-case that I’ve developed for myself: I run my chat clients on a server, and then using GNU screen which is an incredibly powerful, clever, and impossible to describe application. I’ve written about it before, but lets just describe it’s functionality as such:

Screen allows users to begin a persistent (terminal/shell/console) session on one computer, and then “detatch” and continue that session on another machine where the session runs--virtually--indistinguishable from “native sessions.”

So my chat programs are running on a server “inside of” a screen session and when I want to talk to someone, I trigger something on my local machine that connects to that screen session, and a second later, the program is up and running just as I left it.

Screen can of course, be used locally (and I do use it in this mode every waking moment of my day) but there’s something fundamentally different about how this specific use case affects the way I think about my connection.

This is just one, and one very geeky, example of what infrastructural computing--the cloud--is all about. We (I) can talk till we’re (I’m) blue in the face, but I think the interesting questions arise not from thinking about how the infrastructure and the software will develop, but rather from thinking about what this means to people on the ground.

At a(n apparently) crucial moment in the development of “the cloud” my personal technological consumption went from “quirky but popular and mainstream” to fiercely independent, hackerish, and free-software-based. As a result, my examples in this area may not be concretely helpful in figuring out the path of things to come.

I guess the best I can do, at the moment is to pose a series of questions, and we’ll discuss the answers, if they seem apparent in comments:

  • Does “the cloud” provide more--on any meaningful way--than a backup service? It seems like the key functionality that cloud services provide is hosting for things like email and documents, that is more reliable than saving and managing backups for the ordinary consumer>
  • Is there functionality in standards and conventions that are underutilized in desktop computing that infrastructural approaches could take advantage without building proprietary layers on-top of java-script and HTTP?
  • Is it more effective to teach casual user advanced computing techniques (ie. using SSH) or to develop solutions that make advanced infrastructural computing easier for casual users (ie. front ends for git, more effective remote-desktop services).
  • Is it more effective for connections to “the cloud” to be baked into current applications (more or less the current approach) or to bake connections to the cloud into the operating system (eg. mounting infrastructural resources as file systems)
  • Is the browser indeed the prevailing modality, or simply the most convenient tool for network interaction.
  • Do we have enough conceptual experience with using technology to collaborate (eg. wikis, source control systems like git, email) to be able to leverage the potential of the cloud, in ways that reduce total workloads rather than increase said workloads?
  • Does infrastructural computing grow out of the problem of limited computing power (we might call this “vertical complexity”) or a management problem of computing resources in multiple contexts (eg. work, home, laptop, desktop, cellphone; we might call this “horizontal complexity”) And does this affect the kind of solutions that we are able to think about and use?

Perhaps the last question isn’t quite user-centric, but I think it leads to a lot of interesting solutions about possible technologies. In a lot of ways the most useful “cloud” tool that I use, is Google’s Blackberry sync tool which keeps my calendar and address book synced (perfectly! so much that I don’t even notice) between my computer, the phone, and the web. Git, for me solves the horizontal problem. I’m not sure that there are many “vertical problems,” other than search and data filtering, but it’s going to be interesting to think about.

In any case, I look forward to discussing the answers and implications of these issues with you all, so if you’re feeling shy, don’t, and leave a comment.

Cheers!

technology as infrastructure, act two

Continued from, Technology as Infrastructure, Act One.

Act Two

Cnet’s Matt Assay covering this post by RedMonk’s Stephen O’Grady suggests that an “open source cloud” is unlikely because superstructure (hardware/concrete power) matters more than infrastructure (software)--though in IT “infrastructure” means something different, so go read Stephen’s article.

It’s my understanding that, in a manner of speaking, open source has already “won” this game. Though google’s code is proprietary, it runs on a Linux/java-script/python platform. Amazon’s “cloud” (EC2) runs on Xen (the open source virtualization platform) and nearly all of the operating system choices are linux based. (Solaris and Windows, are options).

I guess the question of “what cloud” would seem trite at this point, but I think clarifying “which cloud” is crucial at this point, particularly with regards to openness. There seem to be several:

  • Cloud infrastructure. Web servers, hosting, email servers. Traditionally these are things an institution ran their own servers for, these days that same institution might run their servers on some sort of virtualized hardware for which there are many providers.

    How open? Open. There are certainly proprietary virtualization tools (VMware, windows-whatever, etc.), and you can vitalize windows, and I suppose HP-UX and AIX are getting virtualized as well. But Linux-based operating systems are likely virtualized at astonishing rates compared to non-open source OSes. And much of the server infrastructure (sendmail, postfix/exim, Apache, etc.) is open source at some point.

    In point of fact, this cloud is more or less the way it’s always been and is, I’d argue, open-source’s “home turf.”

  • Cloud applications: consumer. This would be stuff like Gmail, flickr, wikipedia, twitter, facebok, ubuntuONE, googe docs, google wave, and other “application services” targeted at non-commercial/enterprise consumers and very small groups of people. This cloud consists of entirely software, provided as services and is largely dominated by google, and other big players (Microsoft, yahoo, etc.)

    How open? Not very. This space looks very much like the desktop computing world looked in the mid-90s. Very proprietary, very closed, the alternatives are pretty primitive, and have a hard time doing anything but throwing rocks at the feet of the giant (google.)

  • Cloud applications: enterprise. This would be things like SalesForce (a software-as-a-service CRM tool.) and other SaaS application. I suppose google-apps-for-domains falls under this category, as does pretty much anything that uses the term SaaS.

    How open? Not very. SaaS is basically Proprietary Software: The Next Generation as the business model is based on the exclusivity of rights over the source code. At the same time, in most sectors there are viable open source projects that are competing with the proprietary options: SugarCRM, Horde, Squirrel Mail, etc.

  • Cloud services: enterprise. This is what act one covered or eluded to, but generally this covers things like PBX systems, all the stuff that runs corporate intranets, groupware applications (some of which are open source), collaboration tools, internal issue tracking systems, shared storage systems.

    How open? Reasonably open. Certainly there’s a lot of variance here, but for the most part, but Asterisk for PBX-stuff, there are a number of open source groupware applications. Jira/perforce/bitkeeper aren’t open source, but Trac/SVN/git are. The samba project kills in this area and is a drop in replacement for Microsoft’s file-sharing systems.

The relationship, between open source and “the cloud,” thus, depends a lot on what you’re talking about. I guess this means there needs to be an “act three,” to cover specific user strategies. Because, regardless of which cloud you use, your freedom has more to do with practice than it does with some inherent capability of the software stack.

technology as infrastructure, act one

Act One

This post is inspired by three converging observations:

1. Matt posted a comment to a previous post: that read:

“Cloud” computing. Seriously. Do we really want to give up that much control over our computing? In the dystopian future celebrated by many tech bloggers, computers will be locked down appliances, and we will rely on big companies to deliver services to us.

2. A number of podcasts that I listened to while I drove to New Jersey produced/hosted/etc. by Michael Cote for RedMonk that discussed current events and trends in “Enterprise-grade Information Technology,” which is a world, that I’m only beginning to scratch the surface of.

3. Because my Internet connection at home is somewhat spotty, and because it makes sense have an always on (and mobile) connection to IRC for work, I’ve started running my chat clients (mcabber and irssi) inside of a gnu screen session on my server.


My specific responses:

1. Matt’s right, from a certain perspective. There’s a lot of buzz-word-heavy, venture capital driven, consumer targeted “cloud computing tools” which seem to be all about getting people to use web-based “applications,” and give up autonomy in exchange for data that may be more available to us because it’s stored on someones network.

Really, however, I think this isn’t so much a problem with “networked computing,” as it is with both existing business models for information technology, and an example of the worst kind of cloud computing. And I’m using Matt’s statement as a bit of a straw man, as a lot of the things that I’m including under the general heading of “cloud computing,” aren’t really what Matt’s talking about above.

At the same time I think there is the cloud that Matt refers to: the Google/Microsoft/Startup/Ubuntu One/etc. cloud, and then there’s all the rest of distributed/networked/infrastructural computing which isn’t new or sexy, but I think is really the same as the rest of the cloud.

2. The “enterprise” world thinks about computers in a much different way than I ever do. Sometimes this is frustrating: the tendrils of proprietary software are strongest here, and enterprise folks care way too much about Java. In other aspects it’s really fascinating, because technology becomes an infrastructural resource, rather than a concrete tool which accomplishes a specific task.

Enterprise hardware and software exists to provide large corporate institutions the tools to manage large amounts of data/projects/data/communications/etc.

This is, I think on some level, the real cloud. This “technology-as-infrastructure” thing.

3. In an elaboration of the above, I outsourced a chunk of my computing to “the cloud.” I could run those applications locally, and I haven’t given up that possibility, but one needs a network connection to use a chat client, so the realm of possibilities where I would want to connect to a chat server, but wouldn’t be able to connect to my server, is next to impossible (particularly because some of the chat servers run on my hardware.).


I guess the point I’m driving at is: maybe this “cloud thing” isn’t about functionality, or websites, or software, or business models, but rather about the evolution of our computing needs from providing a set of tools and localized resources to providing infrastructure.

And that the shift isn’t so much about the technology: in point of fact running a terminal application in a screen session over SSH isn’t a cutting edge technology by any means, but rather about how we use the technology to support what it is we do.

Or am I totally off my rocker here?