Software Engineering for 2.0

I've been thinking about what I do as a software engineer for a while, as there seems to be a common thread through the kinds of projects and teams that I'm drawn toward and I wanted to write a few blog posts on this topic to sort of collect my thoughts and see if these ideas resonated with anyone else.

I've never been particularly interested in building new and exciting features. Hackathon's have never held any particular appeal, and the things I really enjoy are working on are on the spectrum of "stabilize this piece of software," or "make this service easy to operate" or "refactor this code to make support future development" and less "design and build some new feature." Which isn't to say that I don't like building new features or writing code, but that I'm more driven by the code and supporting my teammates than I am by the feature.

I think it's great that I'm different from software engineers who are really focused on the features, both because I think the tension between our interests pushes both classes of software engineer to do great things. Feature development keeps software and products relevant and addresses users' needs. Stabilization work makes projects last and reduces the incidence of failures that distract from feature work, and when there's consistent attention paid to aligning infrastructure [1] work with feature development of the long term, infrastructure engineers can significantly lower the cost of implementing a feature.

The kinds of projects that fall into these categories inculde the following areas:

  • managing application state and workload in larger distributed contexts. This has involved designing and implementing things like configuration management, deployment processes, queuing systems, and persistence layers.
  • concurrency control patterns and process lifecycle. In programming environments where threads are available, finding ways to ensure that processes can safely shut down, and errors can be communicated between threads and processes takes some work and providing mechanisms to shutdown cleanly, communicate abort signals to worker threads, and handle communication patterns between threads in a regular and expected way, is really important. Concurrency is a great tool, but being able to manage concurrency safely and predictably and in descret parts of the code are useful.
  • programming model and ergonomic APIs and services. No developers produces a really compelling set of abstractions on the first draft, particularly when they're focused on delivering different kinds of functionality. The revision and iteration process helps everyone build better software.
  • test infrastructure and improvements. No one thinks tests should take a long time or report results non-deterministically, and yet so many test are. The challenge is that tests often look good or seem reasonable or are stable when you write them, and their slow runtimes compound overtime, or orthogonal changes make them slower. Sometimes adding an extra check in some pre-flight test-infrastructure code ends ups causing tests that had been just fine, thank you to become problems. Maintaining and structure test infrastructure has been a big part of what I've ended up doing. Often, however, working back from the tests, it's possible to see how a changed interface or an alternate factoring of code would make core components easier to test, and doing a cleanup pass of tests on some regular cadence to improve things. Faster more reliable tests, make it possible to develop with greater confidence.

In practice this has included:

  • changing the build system for a project to produce consistent artifacts, and regularizing the deployment process to avoid problems during deploy.
  • writing a queuing system without any extra service level dependencies (e.g. in the project's existing database infrastructure) and then refactoring (almost) all arbitrary parallel workloads to use the new queuing system.
  • designing and implementing runtime feature flagging systems so operators could toggle features or components on-and-off via configuration options rather than expensive deploys.
  • replacing bespoke implementations with components provided by libraries or improving implementation quality by replacing components in-place, with the goal of making new implementations more testable or performant (or both!)
  • plumbing contexts (e.g. Golang's service contexts) through codebases to be able to control the lifecycle of concurrent processes.
  • implementing and migrating structured logging systems and building observability systems based on these tools to monitor fleets of application services.
  • Refactoring tests to reuse expensive test infrastructure, or using table-driven tests to reduce test duplication.
  • managing processes' startup and shutdown code to avoid corrupted states and efficiently terminate and resume in-progress work.

When done well (or just done at all), this kind of work has always paid clear dividends for teams, even when under pressure to produce new features, because the work on the underlying platform reduces the friction for everyone doing work on the codebase.

[1]It's something of an annoyance that the word "infrastructure" is overloaded, and often refers to the discipline of running software rather than the parts of a piece of software that supports the execution and implementation of the business logic of user-facing features. Code has and needs infrastructure too, and a lot of the work of providing that infrastructure is also software development, and not operational work, though clearly all of these boundaries are somewhat porous.