For years now, the idea of the terrible stack, or the dynamic duo of
Terraform and Ansible, from this
tweet has
given me a huge amount of joy, basically anytime someone mentions either
Terraform or Ansible, which happens rather a lot. It’s not exactly
that I think that Terriform or Ansible are exactly terrible: the
configuration management problems that these pieces of software are
trying to solve are real and actually terrible, and having tools that
help regularize the problem of configuration management definitely
improve things. And yet the tools leave things wanting a bit.
Why care so much about configuration management?
Configuration matters because every application needs some kind of
configuration: a way to connect to a database (or similar), a place to
store its output, and inevitably other things, like a dependencies, or
feature flags or whatever.
And that’s the simple case. While most things are probably roughly
simple, it’s very easy to have requirements that go beyond this a bit,
and it turns out that while a development team might--but only
might--not have requirements for something that qualifies as “weird”
but every organization has something.
As a developer, configuration and deployment often matters a bunch, and
it’s pretty common to need to make changes to this area of the code.
While it’s possible to architect things so that configuration can be
managed within an application (say), this all takes longer and isn’t
always easy to implement, and if your application requires escalated
permissions, or needs a system configuration value set then it’s easy
to get stuck.
And there’s no real way to avoid it: If you don’t have a good way to
manage configuration state, then infrastructure becomes bespoke and
fragile: this is bad. Sometimes people suggest using image-based
distribution (so called “immutable infrastructure,") but this tends to
be slow (images are large and can take a while to build,) and you still
have to capture configuration in some way.
But how did we get here?
I think I could weave a really convincing, and likely true story about
the discipline of system administration and software operations in
general and its history, but rather than go overboard, I think the
following factors are pretty important:
- computers used to be very expensive, were difficult to operate, and so
it made sense to have people who were primarily responsible for
operating them, and this role has more or less persisted forever.
- service disruptions can be very expensive, so it’s useful for
organizations to have people who are responsible for “keeping the
lights on,” and troubleshoot operational problems when things go
wrong.
- most computer systems depend on state of some kind--files on disks,
the data in databases--and managing that state can be quite delicate.
- recent trends in computing make it possible to manipulate
infrastructure--computers themselves, storage devices,
networks--with code, which means we have this unfortunate dualism of
infrastructure where it’s kind of code but also kind of data, and so
it feels hard to know what the right thing to do.
Why not just use <xyz>
This isn’t fair, really, but and you know it’s gonna be good when
someone trivializes an adjacent problem domain with a question like
this, but this is my post so you must endure it, because the idea that
there’s another technology or way of framing the problem that makes
this better is incredibly persistent.
Usually <xyz>
, in recent years has been “Kubernetes” or “docker”
or “containers,” but it sort of doesn’t matter, and in the past
solutions platforms-as-a-service (e.g. AppEngine/etc.) or
backend-as-a-service (e.g. parse/etc.) So let’s run down some answers:
- “bake configuration into the container/virtual machine/etc. and then
you won’t have state,” is a good idea, except it means that if you
need to change configuration very quickly, it becomes quite hard
because you have to rebuild and deploy an image, which can take a long
time, and then there’s problems of how you get secrets like
credentials into the service.
- “use a service for your platform needs,” is a good solution, except
that it can be pretty inflexible, particularly if you have an
application that wasn’t designed for the service, or need to use some
kind of off-the-shelf (a message bus, a cache, etc.) service or tool
that wasn’t designed to run in this kind of environment. It’s also
the case that the hard cost of using platforms-as-a-service can be
pretty high.
- “serverless” approaches something of a bootstrapping problem, how do
you manage the configuration of the provider? How do you get secrets
into the execution units?
What’s so terrible about these tools?
- The tools can’t decide if configuration should be described
programatically, using general purpose programming languages and
frameworks (e.g. Chef, many deployment tools) or using some kind of
declarative structured tool (Puppet, Ansible), or some kind of ungodly
hybrid (e.g. Helm, anything with HCL). I’m not sure that there’s a
good answer here. I like being able to write code, and I think
YAML-based DSLs aren’t great; but capturing configuration creates a
huge amount of difficult to test code. Regardless, you need to find
ways of being able to test the code inexpensively, and doing this in a
way that’s useful can be hard.
- Many tools are opinionated have strong idioms in hopes of helping to
make infrastructure more regular and easier to reason about. This is
cool and a good idea, it makes it harder to generalize. While concepts
like immutability and idempotency are great properties for
configuration systems to have, say, they’re difficult to enforce, and
so maybe developing patterns and systems that have weaker opinions
that are easy to comply with, and idioms that can be applied
iteratively are useful.
- Tools are willing to do things to your systems that you’d never do by
hand, including a number of destructive operations (terraform is
particularly guilty of this), which erodes some of their trust and
inspires otherwise bored ops folks, to write/recapitulate their own
systems, which is why so many different configuration management tools
emerge.
Maybe the tools aren’t actually terrible, and the organizational
factors that lead to the entrenchment of operations teams (incumbency,
incomplete cost analysis, difficult to meet stability requirements,)
lead to the entrenchment of the kinds of processes that require tools
like this (though causality could easily flow in the opposite direction,
with the same effect.)