Systems Administrators are the Problem
For years now, the idea of the terrible stack, or the dynamic duo of Terraform and Ansible, from this tweet has given me a huge amount of joy, basically anytime someone mentions either Terraform or Ansible, which happens rather a lot. It’s not exactly that I think that Terriform or Ansible are exactly terrible: the configuration management problems that these pieces of software are trying to solve are real and actually terrible, and having tools that help regularize the problem of configuration management definitely improve things. And yet the tools leave things wanting a bit.
Why care so much about configuration management?
Configuration matters because every application needs some kind of configuration: a way to connect to a database (or similar), a place to store its output, and inevitably other things, like a dependencies, or feature flags or whatever.
And that’s the simple case. While most things are probably roughly simple, it’s very easy to have requirements that go beyond this a bit, and it turns out that while a development team might--but only might--not have requirements for something that qualifies as “weird” but every organization has something.
As a developer, configuration and deployment often matters a bunch, and it’s pretty common to need to make changes to this area of the code. While it’s possible to architect things so that configuration can be managed within an application (say), this all takes longer and isn’t always easy to implement, and if your application requires escalated permissions, or needs a system configuration value set then it’s easy to get stuck.
And there’s no real way to avoid it: If you don’t have a good way to manage configuration state, then infrastructure becomes bespoke and fragile: this is bad. Sometimes people suggest using image-based distribution (so called “immutable infrastructure,") but this tends to be slow (images are large and can take a while to build,) and you still have to capture configuration in some way.
But how did we get here?
I think I could weave a really convincing, and likely true story about the discipline of system administration and software operations in general and its history, but rather than go overboard, I think the following factors are pretty important:
- computers used to be very expensive, were difficult to operate, and so it made sense to have people who were primarily responsible for operating them, and this role has more or less persisted forever.
- service disruptions can be very expensive, so it’s useful for organizations to have people who are responsible for “keeping the lights on,” and troubleshoot operational problems when things go wrong.
- most computer systems depend on state of some kind--files on disks, the data in databases--and managing that state can be quite delicate.
- recent trends in computing make it possible to manipulate infrastructure--computers themselves, storage devices, networks--with code, which means we have this unfortunate dualism of infrastructure where it’s kind of code but also kind of data, and so it feels hard to know what the right thing to do.
Why not just use <xyz>
This isn’t fair, really, but and you know it’s gonna be good when someone trivializes an adjacent problem domain with a question like this, but this is my post so you must endure it, because the idea that there’s another technology or way of framing the problem that makes this better is incredibly persistent.
Usually <xyz>
, in recent years has been “Kubernetes” or “docker”
or “containers,” but it sort of doesn’t matter, and in the past
solutions platforms-as-a-service (e.g. AppEngine/etc.) or
backend-as-a-service (e.g. parse/etc.) So let’s run down some answers:
- “bake configuration into the container/virtual machine/etc. and then you won’t have state,” is a good idea, except it means that if you need to change configuration very quickly, it becomes quite hard because you have to rebuild and deploy an image, which can take a long time, and then there’s problems of how you get secrets like credentials into the service.
- “use a service for your platform needs,” is a good solution, except that it can be pretty inflexible, particularly if you have an application that wasn’t designed for the service, or need to use some kind of off-the-shelf (a message bus, a cache, etc.) service or tool that wasn’t designed to run in this kind of environment. It’s also the case that the hard cost of using platforms-as-a-service can be pretty high.
- “serverless” approaches something of a bootstrapping problem, how do you manage the configuration of the provider? How do you get secrets into the execution units?
What’s so terrible about these tools?
- The tools can’t decide if configuration should be described programatically, using general purpose programming languages and frameworks (e.g. Chef, many deployment tools) or using some kind of declarative structured tool (Puppet, Ansible), or some kind of ungodly hybrid (e.g. Helm, anything with HCL). I’m not sure that there’s a good answer here. I like being able to write code, and I think YAML-based DSLs aren’t great; but capturing configuration creates a huge amount of difficult to test code. Regardless, you need to find ways of being able to test the code inexpensively, and doing this in a way that’s useful can be hard.
- Many tools are opinionated have strong idioms in hopes of helping to make infrastructure more regular and easier to reason about. This is cool and a good idea, it makes it harder to generalize. While concepts like immutability and idempotency are great properties for configuration systems to have, say, they’re difficult to enforce, and so maybe developing patterns and systems that have weaker opinions that are easy to comply with, and idioms that can be applied iteratively are useful.
- Tools are willing to do things to your systems that you’d never do by hand, including a number of destructive operations (terraform is particularly guilty of this), which erodes some of their trust and inspires otherwise bored ops folks, to write/recapitulate their own systems, which is why so many different configuration management tools emerge.
Maybe the tools aren’t actually terrible, and the organizational factors that lead to the entrenchment of operations teams (incumbency, incomplete cost analysis, difficult to meet stability requirements,) lead to the entrenchment of the kinds of processes that require tools like this (though causality could easily flow in the opposite direction, with the same effect.)