Systems Administrators are the Problem

For years now, the idea of the terrible stack, or the dynamic duo of Terraform and Ansible, from this tweet has given me a huge amount of joy, basically anytime someone mentions either Terraform or Ansible, which happens rather a lot. It’s not exactly that I think that Terriform or Ansible are exactly terrible: the configuration management problems that these pieces of software are trying to solve are real and actually terrible, and having tools that help regularize the problem of configuration management definitely improve things. And yet the tools leave things wanting a bit.

Why care so much about configuration management?

Configuration matters because every application needs some kind of configuration: a way to connect to a database (or similar), a place to store its output, and inevitably other things, like a dependencies, or feature flags or whatever.

And that’s the simple case. While most things are probably roughly simple, it’s very easy to have requirements that go beyond this a bit, and it turns out that while a development team might--but only might--not have requirements for something that qualifies as “weird” but every organization has something.

As a developer, configuration and deployment often matters a bunch, and it’s pretty common to need to make changes to this area of the code. While it’s possible to architect things so that configuration can be managed within an application (say), this all takes longer and isn’t always easy to implement, and if your application requires escalated permissions, or needs a system configuration value set then it’s easy to get stuck.

And there’s no real way to avoid it: If you don’t have a good way to manage configuration state, then infrastructure becomes bespoke and fragile: this is bad. Sometimes people suggest using image-based distribution (so called “immutable infrastructure,") but this tends to be slow (images are large and can take a while to build,) and you still have to capture configuration in some way.

But how did we get here?

I think I could weave a really convincing, and likely true story about the discipline of system administration and software operations in general and its history, but rather than go overboard, I think the following factors are pretty important:

  • computers used to be very expensive, were difficult to operate, and so it made sense to have people who were primarily responsible for operating them, and this role has more or less persisted forever.
  • service disruptions can be very expensive, so it’s useful for organizations to have people who are responsible for “keeping the lights on,” and troubleshoot operational problems when things go wrong.
  • most computer systems depend on state of some kind--files on disks, the data in databases--and managing that state can be quite delicate.
  • recent trends in computing make it possible to manipulate infrastructure--computers themselves, storage devices, networks--with code, which means we have this unfortunate dualism of infrastructure where it’s kind of code but also kind of data, and so it feels hard to know what the right thing to do.

Why not just use <xyz>

This isn’t fair, really, but and you know it’s gonna be good when someone trivializes an adjacent problem domain with a question like this, but this is my post so you must endure it, because the idea that there’s another technology or way of framing the problem that makes this better is incredibly persistent.

Usually <xyz>, in recent years has been “Kubernetes” or “docker” or “containers,” but it sort of doesn’t matter, and in the past solutions platforms-as-a-service (e.g. AppEngine/etc.) or backend-as-a-service (e.g. parse/etc.) So let’s run down some answers:

  • “bake configuration into the container/virtual machine/etc. and then you won’t have state,” is a good idea, except it means that if you need to change configuration very quickly, it becomes quite hard because you have to rebuild and deploy an image, which can take a long time, and then there’s problems of how you get secrets like credentials into the service.
  • “use a service for your platform needs,” is a good solution, except that it can be pretty inflexible, particularly if you have an application that wasn’t designed for the service, or need to use some kind of off-the-shelf (a message bus, a cache, etc.) service or tool that wasn’t designed to run in this kind of environment. It’s also the case that the hard cost of using platforms-as-a-service can be pretty high.
  • “serverless” approaches something of a bootstrapping problem, how do you manage the configuration of the provider? How do you get secrets into the execution units?

What’s so terrible about these tools?

  • The tools can’t decide if configuration should be described programatically, using general purpose programming languages and frameworks (e.g. Chef, many deployment tools) or using some kind of declarative structured tool (Puppet, Ansible), or some kind of ungodly hybrid (e.g. Helm, anything with HCL). I’m not sure that there’s a good answer here. I like being able to write code, and I think YAML-based DSLs aren’t great; but capturing configuration creates a huge amount of difficult to test code. Regardless, you need to find ways of being able to test the code inexpensively, and doing this in a way that’s useful can be hard.
  • Many tools are opinionated have strong idioms in hopes of helping to make infrastructure more regular and easier to reason about. This is cool and a good idea, it makes it harder to generalize. While concepts like immutability and idempotency are great properties for configuration systems to have, say, they’re difficult to enforce, and so maybe developing patterns and systems that have weaker opinions that are easy to comply with, and idioms that can be applied iteratively are useful.
  • Tools are willing to do things to your systems that you’d never do by hand, including a number of destructive operations (terraform is particularly guilty of this), which erodes some of their trust and inspires otherwise bored ops folks, to write/recapitulate their own systems, which is why so many different configuration management tools emerge.

Maybe the tools aren’t actually terrible, and the organizational factors that lead to the entrenchment of operations teams (incumbency, incomplete cost analysis, difficult to meet stability requirements,) lead to the entrenchment of the kinds of processes that require tools like this (though causality could easily flow in the opposite direction, with the same effect.)

Easy Mode, Hard Mode

I’ve been thinking more recently about that the way that we organize software development projects, and have been using this model of “hard mode” vs “easy mode” a bit recently, and thought it might be useful to expound upon it.


Often users or business interests come to software developers and make a feature request. “I want it to be possible to do thing faster or in combination with another operation, or avoid this class of errors.” Sometimes (often!) these are reasonable requests, and sometimes they’re even easy to do, but sometimes a seemingly innocuous feature or improvement are really hard sometimes, engineering work requires hard work. This isn’t really a problem, and hard work can be quite interesting. It is perhaps, an indication of an architectural flaw when many or most easy requests require disproportionately hard work.

It’s also the case that it’s possible to frame the problem in ways that make the work of developing software easier or harder. Breaking problems into smaller constituent problems make them easier to deal with. Improving the quality of the abstractions and testing infrastructure around a problematic area of code makes it easier to make changes later to an area of the code.

I’ve definitely been on projects where the only way to develop features and make improvement is to have a large amount of experience with the problem domain and the codebase, and those engineers have to spend a lot of consentrated time building features and fighting against the state of the code and its context. This is writing software in “hard mode,” and not only is the work harder than it needs to be, features take longer to develop than users would like. This mode of development makes it very hard to find and retain engineers because of the large ramping period and consistently frustrating nature of the work. Frustration that’s often compounded by the expectation or assumption that easy requests are easy to produce.

In some ways the theme of my engineering career has been work on taking “hard mode projects” reducing the barriers to entry in code bases and project so that they become more “easy mode projects”: changing the organization of the code, adding abstractions that make it easier to develop meaningful features without rippling effects in other parts of the code, improving operational observability to facilitate debugging, restructuring project infrastructure to reduce development friction. In general, I think of the hallmarks of “easy mode” projects as:

  • abstractions and library functions exist for common tasks. For most pieces of “internet infrastructure” (network attached services,) developer’s should be able to add behavior without needing to deal with the nitty gritty of thread pools or socket abstractions (say.) If you’re adding a new REST request, you should be able to just write business logic and not need to think about the applications threading model (say). If something happens often (say, retrying failed requests against upstream API,) you should be able to rely on an existing tool to orchestrate retries.
  • APIs and tools are safe ergonomic. Developers writing code in your project should be able to call into existing APIs and trust that they behave reasonably and handle errors reasonably. This means, methods should do what they say, and exported/public interfaces should be difficult to use improperly, and (e.g. expected exception handling/safety, as well as thread safety and nil semantics ad appropriate.) While it’s useful to interact with external APIs defensively, you can reduce the amount of effort by being less defensive for internal/proximal APIs.
  • Well supported code and operational infrastructure. It should be easy to deploy and test changes to the software, the tests should run quickly, and when there’s a problem there should be a limited number of places that you could look to figure out what’s happening. Making tests more reliable, improving error reporting and tracing, exposing more information to metrics systems, to make the behavior of the system easier to understand in the long term.
  • Changes are scoped to support incremental development. While there are lots of core technical and code infrastructure work that support making projects more “easy mode” a lot of this is about the way that development teams decide to structure projects. This isn’t technical, ususally, but has more to do with planning cadences, release cadences, and scoping practices. There are easier and harder ways of making changes, and it’s often worthwhile to ask yourself “could we make this easier.” The answer, I’ve found, is often “yes”.

Moving a project from hard to easy mode is often in large part about investing in managing technical debt, but it’s also a choice: we can prioritize things to make our projects easier, we can make small changes to the way we approach specific projects that all move projects toward being easier. The first step is always that choice.

Tips for More Effective Multi-Tasking

I posted something about how I organize my own work, and I touched on “multi-tasking,” and I realized immediately that I had touched something that required a bit more explanation.

I feel like a bit of an outlier to suggest that people spend time learning how to multitask better, particularly when the prevaling conventional wisdom is just “increase focus,” “decrease multitasking,” reduce “context switches,” between different tasks. It’s as if there’s this mythical word where you can just “focus more” taking advantage of longer blocks of time, with fewer distractions, and suddenly be able to get more done.

This has certainly never been true of my experience.

I was, perhaps unsurprisingly, a bit disorganized as a kid. Couldn’t sit still, forgot deadlines, focused inconsistently on things: sometimes I was unstoppable, and sometimes nothing stuck. As an adult, I’ve learned more about myself and I know how to provide the kind of structure I need to get things done, even for work that I find less intrinsically fascinating. Also I drink a lot more caffeine. I’m also aware that with a slightly different brain or a slightly different set of coping strategies, I would struggle a lot more than I do.

There are a lot of reasons why it can be difficult to focus, but I don’t think the why matters much here: thinking pragmatically about how to make the most of the moments we do have, the focus that’s available. Working on multiple things just is, and I think to some extent its a skill that we can cultivate or at least approximate. Perhaps some of the things I do would be useful to you:

  • fit your tasks to the attention you have. I often write test code later in the day or during my afternoon slump between 2-3:30, and do more complicated work with my morning coffee between 9:30 and 11:30, and do more writing later in the day. There are different of tasks, and knowing what kinds of work makes sense for which part of the day can be a great help.
  • break tasks apart as small as you can do, even if it’s just for yourself. It’s easy to get a little thing done, and bigger tasks can be intimidating. If the units of work that you focus on are the right size it’s possible to give yourself enough time to do the work that you need to do and intersperse tasks from a few related projects.
  • plan what you do before you do it, and leave yourself notes about your plan. As I write code I often write a little todo list that contains the requirements for a function. This makes it easy to pick something up if you get interrupted. My writing process also involves leaving little outlines of paragraphs that I want to write or narrative elements that I want to pass.
  • leave projects, when possible, at a stopping point. Make it easy for yourself to pick it back up when you’re ready. Maybe this means making sure that you finish writing a test or some code, rather than leaving a function half written. When writing prose, I sometimes finish a paragraph and write the first half of the next sentence, to make it easier to pick up.
  • exercise control what and when you do things. There are always interruptions, or incoming mesages and alerts that could require our attention. There are rarely alerts that must cause us to drop what we’re currently working on. While there are “drop everything” tasks sometimes, most things are fine to come back to in a little while, and most emails are safe to ignore for a couple of hours. It’s fine to quickly add something to a list to come back to later. It’s also fine to be disrupted, but having some control over that is often helpful.
  • find non-intrusive ways to feel connected. While it should be possible to do some level of multitasking as you work, there are some kind of interruptions that take a lot of attention. When you’re focusing on work, checking your email can be a distraction (say), but it can be hard to totally turn off email while you’re working. Rather than switch to look at my email on some cadence throughout the day, I (effectively,) check my phone far more regularly just to make sure that there’s nothing critical, and can go much longer between looking at my email. The notifications I see are limited, and may messages never trigger alerts. I feel like I know what’s going on, and I don’t get stuck replying to email all day.1

This is, more or less, what works for me, and I (hope) that there’s something generalizable here, even if we do different kinds of work!


  1. Email is kind of terrible, in a lot of ways: there’s a lot of it, messages come in at all times, people are bad at drafting good subject lines, a large percentage of email messages are just automated notifications, historically you had to “check it” which took time, and drafting responses can take quite a while, given that the convention is for slightly longer messages. I famously opted out of email, basically for years, and gleefully used all the time I wasn’t reading email to get things done. The only way this was viable, was that I’ve always had a script that checks my mail and sends me a notification (as an IM) with the From and Subject line of most important messages, which gives me enough context to actually respond to things that were important (most things aren’t) without needing to actually dedicate time to looking at email. ↩︎

Test Multi-Execution

Editoral Note: this is a follow up to my earlier Principles of Test Oriented Software Development post.

In software development, we write tests to make sure the code we write does what we want it to do. Great this is pretty easy to get behind.

Tests sometimes fail.

The goal, is that, most of the time when tests fail, it’s because the code is broken: you fix the code and the test passes. Sometimes when test fail there’s a bug in the test, it makes an assertion that can’t or shouldn’t be true: these are bad because they mean the test is broken, but all code has bugs, and test code can be broken so that’s fine.

Ideally either pass or fail, and if a test fails it fails repeatedly, with the same error. Unfortunately, this is of course not always true, and tests can fail intermittently if they test something that can change, or the outcome of the test is impacted by some external factor like “the test passes if the processor is very fast, and the system does not have IO contention, but fails sometimes as the system slows down.” Sometimes tests include (intentionally or not) some notion of “randomnesses,” and fail intermittently because of this.

A test suite with intermittent failures is basically the worst. A suite that never fails isn’t super valuable, because it probably builds false confidence, a test suite that always fails isn’t useful because developers will ignore the results or disable the tests, but a test that fails intermittently, particularly one that fails 10 or 20 percent of the time, means that developers always will always look at the test, or just rerun the test until it passes.

There are a couple of things you can do to fix your tests:

  • write better tests: find sources of non-determinism in your test and rewrite tests to avoid these kinds of “flaky” outcomes. Sometimes this means restructuring your tests in a more “pyramid-like” structure, with more unit tests and fewer integration tests (which are likely to be less deterministic.)
  • run tests more reliably: find ways of running your test suite that produce more consistent results. This means running tests in more isolated environments, changing the amount of test parallelism, ensure that tests clean up their environment before they run, and can be as logically isolated as possible.

But it’s hard to find these tests and you can end up playing wack-a-mole with dodgy tests for a long time, and the urge to just run the tests a second (or third) time to get them to pass so you can merge your change and move on with your work is tempting. This leaves:

  • run tests multiple times: so that a test doesn’t pass until it passes multiple times. Many test runner’s have some kind of repeated execution mode, and if you can combine with some kind of “stop executing after the first fail,” then this can be reasonably efficient. Use multiple execution to force the tests to produce more reliable results rather than cover-up or exacerbates the flakiness.
  • run fewer tests: it’s great to have a regression suite, but if you have unreliable tests, and you can’t use the multi-execution hack to smoke out your bad tests, then running a really full matrix of tests is just going to produce more failures, which means you’ll spend more of your time looking at tests, in non-systematic ways, which are unlikely to actually improve code.

Staff Engineering

In August of 2019 I became a Staff Engineer, which is what a lot of companies are calling their “level above Senior Engineer” role these days. Engineering leveling is a weird beast, which probably a post onto itself. Despite my odd entry into a career in tech, my path in the last 4 or 5 years has been pretty conventional; however, somehow, despite having an increasingly normal career trajectory, explaining what I do on a day to day basis has not gotten easier.

Staff Engineers are important for scaling engineering teams, but lots of teams get by with out them, and unlike more junior engineers who have broadly similar job roles, there are a lot of different ways to be a Staff Engineer, which only muddies things. This post is a reflection on some key aspects of my experience organized in to topics that I hope will be useful for people who may be interested in becoming staff engineers or managing such a person. If you’re also a Staff Engineer and your experience is different, I wouldn’t be particularly surprised.

Staff Engineers Help Teams Build Great Software

Lots of teams function just fine without Staff Engineers and teams can build great products without having contributors in Staff-type roles. Indeed, because Staff Engineers vary a lot, the utility of having more senior individual contributors on a team depends a lot of the specific engineer and the team in question: finding a good fit is even harder than usual. In general, having Senior Technical leadership can help teams by:

  • giving people managers more space and time to focus on the team organization, processes, and people. Particularly in small organizations, team managers often pick up technical leadership.
  • providing connections and collaborations between groups and efforts. While almost all senior engineers have a “home team” and are directly involved in a few specific projects, they also tend to have broader scope, and so can help coordinate efforts between different projects and groups.
  • increasing the parallelism of teams, and can provide the kind of infrastructure that allows a team to persue multiple streams of development at one time.
  • supporting the career path and growth of more junior engineers, both as a result of direct mentoring, but also by enabling the team to be more successful by having more technical leadership capacity creates opportunities for growth for everyone on the team.

Staff Promotions Reflect Organizational Capacity

In addition to experience and a history of effusiveness, like other promotions, getting promoted to Staff Engineer is less straight forward than other promotions. This is in part because the ways we think about leveling and job roles (i.e. to describe the professional activities and capabilities along several dimensions for each level,) become complicated when there are lots of different ways to be a Staff Engineer. Pragmatically, these kind of promotions often depend on other factors:

  • the existence of other Staff Engineers in the organization make it more likely that there’s an easy comparison for a candidate.
  • past experience of managers getting Staff+ promotions for engineers. Enginering Managers without this kind of experience may have difficulty creating the kinds of opportunities within their organizations and for advocating these kinds of promotions.
  • organizational maturity and breadth to support the workload of a Staff Engineer: there are ways to partition teams and organizations that preclude some of the kinds of higher level concerns that justify having Staff Engineers, and while having senior technical leadership is often useful, if the organization can’t support it, it won’t happen.
  • teams with a sizable population of more junior engineers, particularly where the team is growing, will have more opportunity and need for Staff Engineers. Teams that are on the balance more senior, or are small and relatively static tend to have less opportunity for the kind of broadly synthetic work that tends to lead to Staff promotions.

There are also, of course, some kinds of technical achievements and professional characteristics that Staff Engineers often have, and I’m not saying that anyone in the right organizational context can be promoted, exactly. However, without the right kind of organizational support and context, even the most exceptional engineers will never be promoted.

Staff Promotions are Harder to Get Than Equivalent Management Promotions

In many organizations its true that Staff promotions are often much harder to get than equivalent promotions to peer-level management postions: the organizational contexts required to support the promotion of Engineers into management roles are much easier to create, particularly as organizations grow. As you hire more engineers you need more Engineering Managers. There are other factors:

  • managers control promotions, and it’s easier for them to recapitulate their own career paths in their reports than to think about the Staff role, and so more Engineers tend to be pushed towards management than Senior IC roles. It’s also probably that meta-managers benefit organizationally from having more front-line managers in their organizations than more senior ICs, which exacerbates this bias.
  • from an output perspective, Senior Engineers can write the code that Staff Engineers would otherwise write, in a way that Engineering Management tends to be difficult to avoid or do without. In other terms, management promotions are often more critical from the organization’s perspective and therefore prioritized over Staff promotions, particularly during growth.
  • cost. Staff Engineers are expensive, often more expensive than managers particularly at the bottom of the brackets, and it’s difficult to imagine that the timing of Staff promotions are not impacted by budgetary requirements.

Promoting a Staff Engineer is Easier than Hiring One

Because there are many valid ways to do the Staff job, and so much of the job is about leveraging context and building broader connections between different projects, people with more organizational experience and history often have an advantage over fresh industry hires. In general:

  • Success as a Staff Engineer in one organization does not necessarily translate to success at another.
  • The conventions within the process for industry hiring, are good at selecting junior engineers, and there are fewer conventions for more Senior roles, which means that candidates are not assessed for skills and experiences that are relevant to their day-to-day while also being penalized for (often) being unexceptional at the kind of problems that junior engineering interviews focus on. While interview processes are imperfect assessment tools in all cases, they’re particularly bad at more senior levels.
  • Senior engineering contributors have a the potential to have huge impact on product development, engineering outcomes, all of which requires a bunch of trust on the part of the organization, and that kind of trust is often easier to build with someone who already has organizational experience

This isn’t to say that it’s impossible to hire Staff engineers, I’m just deeply dubious of the hiring process for these kinds of roles having both interviewed for these kinds of roles and also interviewed candidates for them. I’ve also watched more than one senior contributor not really get along well with a team or other leadership after being hired externally, and for reasons that end up making sense in retrospect. It’s really hard.

Staff Engineers Don’t Not Manage

Most companies have a clear distinction between the career trajectories of people involved in “management” and senior “individual contributor” roles (like Staff Engineers,) with managers involved in leadership for teams and humans, with ICs involved in technical aspects. This seems really clear on paper but incredibly messy in practice. The decisions that managers make about team organization and prioritization have necessary technical implications; while it’s difficult to organize larger scale technical initiatives without awareness of the people and teams. Sometimes Staff Engineers end up doing actual management on a temporary basis in order to fill gaps as organizations change or cover parental leave

It’s also the case that a huge part of the job for many Staff Engineer’s involves direct mentorship of junior engineers, which can involve leading specific projects, conversations about career trajectories and growth, as well as conversations about specific technical topics. This has a lot of overlap with management, and that’s fine. The major differences is that senior contributors share responsibility for the people they mentor with their actual managers, and tend to focus mentoring on smaller groups of contributors.

Staff Engineers aren’t (or shouldn’t be!) managers, even when they are involved in broader leadership work, even if the specific engineer is capable of doing management work: putting ICs in management roles, takes time away from their (likely more valuable) technical projects.

Staff Engineers Write Boring and Tedious But Easy Code

While this is perhaps not a universal view, I feel pretty safe in suggesting that Staff Engineers should be directly involved in development projects. While there are lots of ways to be involved in development: technical design, architecture, reviewing code and documents, project planning and development, and so fort, I think it’s really important that Staff Engineers be involved with code-writing, and similar activies. This makes it easy to stay grounded and relevant, and also makes it possible to do a better job at all of the other kinds of engineering work.

Having said that, it’s almost inevitable that the kinds of contribution to the code that you make as a Staff Engineer are not the same kinds of contributions that you make at other points in your career. Your attention is probably pulled in different directions. Where a junior engineer can spend most of their day focusing on a few projects and writing code, Staff Engineers:

  • consult with other teams.
  • mentor other engineers.
  • build long and medium term plans for teams and products.
  • breaking larger projects apart and designing APIs between components.

All of this “other engineering work” takes time, and the broader portfolio of concerns means that more junior engineers often have more time and attention to focus on specific programming tasks. The result is that the kind of code you end up writing tends to be different:

  • fixing problems and bugs in systems that require a lot of context. The bugs are often not very complicated themselves, but require understanding the implication of one component with regards to other components, which can make them difficult.
  • projects to enable future development work, including building infrastructure or designing an interface and connecting an existing implementation to that interface ahead of some larger effort. This kind of “refactor things to make it possible to write a new implementation.”
  • writing small isolated components to support broader initiatives, such as exposing existing information via new APIs, and building libraries to facilitate connections between different projects or components.
  • projects that support the work of the team as a whole: tools, build and deployment systems, code clean up, performance tuning, test infrastructure, and so forth.

These kinds of projects can amount to rather a lot of development work, but they definitely have their own flavor. As I approached Staff and certainly since, the kind of projects I had attention for definitely shifted. I actually like this kind of work rather a lot, so that’s been quite good for me, but the change is real.

There’s definitely a temptation to give Staff Engineers big projects that they can go off and work on alone, and I’ve seen lots of teams and engineers attempt this: sometimes these projects work out, though more often the successes feel like an exception. There’s no “right kind” of way to write software as a Staff Engineer, sometimes senior engineer’s get to work on bigger “core projects.” Having said that, if a Staff Engineer is working on the “other engineering” aspects of the job, there’s just limited time to do big development projects in a reasonable time frame.

Related Reading