The Mid Career Shuffle

2023-09-20 – tychoish

tl;dr> I’ve decided to take a new job as an early engineering manager/tech lead, at a data analytics/database startup. And I hope to begin blogging/journaling here more about software development and the way that software and databases interact with other things that grab my interest.

This post doesn’t really have a narrative. It’s just a collection of thoughts. I’m sort of out of the habit of writing blog posts, and one of the things that’s gotten me stuck on writing in the past year or so is feeling like the narrative of any given blog post is not quite right.¹ But we have to start somewhere, and the story will happen over the next few posts.

In the middle of 2020 I left a job at a company that I’d been at for almost 9 years. It seemed like the world was on fire, what it meant to work at a globally distributed tech company was changing (and still is!), and it felt like a good time to make a change. But to what? I’ve spent the last few years working on different projects and exporing different things: I’ve worked on cool distributed systems problems, I’ve worked on product-focused (backend) engineering, I’ve worked on familiar latency-sensitive compute provisioning and orchestration, and a little bit this and that between.

I’ve identified a few things about jobs and working in this process:

while remote work is (and has been for a while) definitely a reality of our world,² the one thing you can’t really accommodate for is timezones. A friend, zmagg, claims (and I believe this) that time differences between 9 hours and 16 hours are basically unworkable as there isn’t really enough overlap to really collaborate.
if the center of gravity in a company is in a place, or in an office, remote teams really have to be intentional about making sure to include people who are outside of that bubble.
the interesting parts of software engineering is what happens between people building software on a team: programming is a means to an end, and while the tools change every so often, how you build together is really fascinating (and hard)!

My new job is different from any job I’ve ever had: First, I’m going to be working on building and developing teams and helping support the engineers and the engineering teams rather than working directly/full-time on the software. I probably will end up doing some software work too. This isn’t that novel: I’ve definitely done “the management thing” a few times of times, but more of my time will be doing this, and I get to be more intentional about how I approach it. Second, my work (mostly) has been in support of products that other people--largely my coworkers develop, in the sense that the work that I’m drawn to is the internal infrastructure of software. I like infrastructure a lot, and I think I tend to focus on these kinds of problems because I feel like it gives me the biggest opportunity to make an impact. In this sense, in a lot of ways this role is very similar to my previous jobs: building the infrastructure to enable and support the people working around me. Only this time it is (not entirely) software. I’m actually looking forward to mapping concrete business goals to actual research and development problems and projects, helping to provide context and prioritize projects without needing to filter through layers of indirection.

I have been blogging in one form or another (mostly this form) to greater and lesser (mostly lesser) extent for a (really) really long time. The landscape of the way we interact on the internet is changing: twitter is really a shell of its former self, “big” social media networks (facebook, instagram, ticktock, twitter, etc.) have really segmented by generation and region, and little social media sites are still niche.

For my part, I’ve been spending more time in “group chat,” with sort of wacky combinations of random friends. This isn’t new for me, and I take a lot of joy of building out very small communities of folks. Also, as I think about it, the thing I most want to do is have coversations with people. Sometimes that conversation is longer form (like this,) and sometimes, it’s more one-to-many (like this sort of half baked telegram channel idea that’s just sort of a “things I would have said on twitter” but as a chat,) but rarely is it “I would like to become a publisher,” or I want to use blogging as a way to market some other entrepreneurial effort. Not that there are anything wrong with these reasons for blogging.

It’s easy (as a software engineer, though I think this probably holds true for anyone who process for “building” thing requires a lot of just “thinking,” about things,) to do a lot of the thinking part in private and to only show up with the “right” solution (with maybe a few alternatives for rhetorical effect,) or to mostly focus our writing on conclusions reached rather than questions asked.

So perhaps, (and we’ll see if holds,) I can use this space more as a journal and as a collection of questions and (perhaps) failed attempts and less of news feed.

I think I often get stuck on where to start, and how much exposition to provide for the topic. I think too much time as a technical writer where my job was to explain difficult concepts means I can (at least try) to start to start to early, and then feel like I get to the end of a post before I’ve said what I think needs to be said. The answer to this might be to just write more and post more, so the exposition is just in the scroll back rather than trying to do too much in one post. And to always err on the side of not enough exposition. Let me know, though if you have an opinion on that one. ↩︎
companies are often necessarily global in scope, and having a team or project spread between different offices isn’t all that different than when it is spread between a bunch of people’s homes. Once you have more than one location, n locations are pretty much the same. ↩︎

The Most Forgotten CI Feature

2023-08-22 – tychoish

Developers love to complain about their tools, particularly continuous integration (CI) software, everyone has a pet idea of how to make it faster to run or make the results easier to interpret, or more useful. They’re all wrong.¹

I spent a few years as the, I guess tech lead(?),² of a team that was developing a CI tool/platform. It was a cool project, big scale (IO-bound workloads! Cloud!) in an organization that grew rapidly and was exactly the right age for the organization to believe in CI, but also predate the maturity of really mature off-the-shelf CI systems (e.g. Github Actions.)

We did a lot of cool things, but I think the most useful thing we ever built was a notification system that let people know when their tests were done running.

That’s it.

See, people were always asking “I want my test to run in 5 minutes, or 10 minutes,” so that I don’t lose too much time waiting for the results and I can avoid getting distracted or losing focus. You could spend a lot of time making things faster, and in some cases this is a great idea: slow things are bad, compute time is expensive (particularly in aggregate), and for trivial things you do want a fast response time.

The problem is really that sometimes things can’t be made all that much faster without an exceptional amount of effort, and while compute time is expensive sometimes you end up spending significantly more on faster machines or more machinesfor increased parallelism, which can result in for only modest gains.

This of course misses the point: human attention spans are incredibly fickle and while really well focused and disciplined engineers might be able to wait for 5 minutes, anything longer than that and most people will have moved on and at that point it might as well taken an hour. While there are some upper bounds and pragmatic aspects on this just because if a build takes 2 hours (say) a developer can only really has time to try 1 or 2 things out in a given work day, but it does mean that execution times between a minute and about 20 minutes are functionally the same.

So, just notify people when their build is done. Don’t beat distraction by being really fast, beat distraction by interrupting the distraction. People don’t need to spend their day looking at a dashboard if the dashboard tells them (gently) to come back when their task is done.

Why doesn’t every CI tool do this? It might be the feature that every developer wants, and yet, there’s no really good “tell me when my build is done,” feature in any of the popular tools. It’s a hard feature to get right, and there are a lot of tricky bits:

People generally don’t want to get emails; emails would be easy, but they’re not a good way to send a quick--largely ephemeral--reminder. While you can pull emails out of commits (which isn’t a great strategy most of the time,) or (presumably) usernames if you’re on a platform like GitHub, there are some important user settings that you have to store somewhere.
Who to notify for a particular build is a little hard, it turns out. People often want to opt into notifications and be able to only receive notifications that are important to them. Sometimes the person who wants the notification isn’t the set of the authors of a commit, or the user that submitted the branch/pull request.
When to send a notification isn’t clear either. Whenall tasks in the build complete? What if some tasks are blocked from starting because one thing failed? Is “do what I mean” notifications something like “notify me on the first failure, or when all (runable) tasks succeed”? If a task fails (and generates a notification,) and then a task is restarted and then the build goes on to succeed do you send a second notification (probably?) Not only are these hard questions, but different teams might want different semantics.

It’s one of those things that seems simple, but there’s just enough complexity that it’s hard to build and hard to get right. Easier, a bit to do when all of your CI platform is developed in house, but only a bit, and (probably for the better) there aren’t many specific tasks Anyway, I hope someone builds something like this, I’d certainly use it.

Well not all of them. The problem is that developers are very likely to get stuck spinning in weird local optimizations and it’s really hard to think about CI features as a user. Developing CI software is also a fun ride because all your users are also engineers, which is a tough crowd. ↩︎
In retrospect I realize I was doing tech-lead things, but my title and how we talked about I what I was doing at the time wasn’t indicative of this reality. Ah well. ↩︎

Reintroducing Grip

2023-08-17 – tychoish

Once upon a time, I wrote this logging library for Go called grip and used it a lot at a couple of jobs, where it provided a pretty great API and way to think about running logging and metrics infrastructure for an application. For the last year and a half, I’ve mostly ignored it. I didn’t want to be that guy about loggers, so I mostly let it drop, but I was looking at some other logging systems and I felt inspired to unwrap it a bit and see if I could improve things and if the time away would inspire me to rethink any of the basic assumptions.

This post is about a lot of things (and maybe it will spill over,) but generally:

the practice and use of logging, and how that is changing,
adoption of libraries out side of the standard library for common infrastructure,
the process of changing grip, and also digging into some more specific changes.

Logging Trends

Grip came about sort of at the height of the structured logging craze, and I think still focuses more on being a way to provide not just the ability to do slightly cooler than a collection of print statements but also be the primary way your software transmits events. Because all of the parts of grip are pretty plugable, we used this for everything from normal application logging to our entire event notification system, and kind of everything in between, including a sort of frightful amount of managing the output of a distributed build farm. The output of these messages went to all of the usual notification targets (e.g. email, slack, github, jira,) but also directly to log aggregation services (e.g. splunk) and the system log without being written to a file, which just made the services much easier to operate.

Like me, grip is somewhat opinionated. A lot of loggers compete based on “speed” (typically of writing messages to standard output or to any file) or on features (automatic contextual data collection and message annotation, formatting tools, hierarchical filtering, etc.) but I think grip’s selling point is really:

great interfaces for doing all of the things that you normally do with a logger, and all the things you want to do.
provide ways of logging data that doesn’t involve building/formatting strings, and lets the normal data.
make logging work just as well for the small CLI process as the large distributed application, and with similar amounts of ease. Pursuant to this I really think that any logging system or architecture that requires lots of extra stuff (agents, forwarders, rotation tools, etc.) to actually be able to use the logs is overkill. There’s nothing particularly exciting about sending logs over the network (or to a file, or a local socket,) and it should be possible to reduce the amount of complexity in this part of the deployment, which is good for just about everyone.
provide good interfaces and paradigms for replacing or custimizing core logging behavior. It should be possible for applications to write their own tools for sending log messages to places, or functions to filter and format log messages. This is a logging tool kit that can grow and develop with your application.
- in the x hierarchy include implementations for many messaging formats, and tools: use grip implementations to send alerts to telegram, email, slack, jira, etc. Also, support for sending logs _directly to wherever they need to be (splunk, syslog, etc.). Whever the logs need to be, grip should be able to get you there.
- The core grip package provides logging multiplexers, buffers, and batching tools to help control and manage the flow of logging from your application code to whatever logging backend you’re targeting.'
- Implementations and bridges between grip’s interfaces and other logging framework and tools. Grip attempts to make it possible to become the logging system for existing applications without needing to replace all logging call sites immediately (or ever!)
- Tools to support using grip interfaces and backends inside of other tools: you can use a grip Sender as a writer, or access a standard-library logger implementation that writes to the underlying grip logger.
logging should be fast, but speed of logging only really matters when data volume is really high, which is usually a problem for any logger: when data volume is lower even a really slow logger is still faster than the network or the file system. Picking a logger that’s fast, given this, is typically a poor optimization. Grip should be fast enough for any real world application and contains enough flexibility to provide an optimized path for distributing log messages as needed.
provide features to support some great logging paradigms:
- lazy construction of messages: while the speed of writing messages to their output handler typically doesn’t matter, sometimes building log messages can be intensive. Recently I ran into a case where calling [Sprintf]{.title-ref} at the call site of the logger for messages that were not logged (e.g. debug messages,) had material effects on the memory usage of the program. While string formating is a common case (and grip has “[Logf]{.title-ref}” style handlers that are lazy by default,) we found a bunch of metrics collection workloads that had similar sorts of patterns and lazy execution of these metrics tended to help a lot.
- conditional logging, rather than wrapping a log statement in an [if]{.title-ref} block, I found that you could have logging handlers that
- randomized logging or sampled logging: In some cases, I want to have log messages that only get logged half the time or something similar. In some high-volume code paths logging all the time is too much, and never is also too much, but it’s possible to devise a happy medium. Grip, for years now, has implemented this in terms of the conditional logging functionality.
- structured logging, rather than logging strings, it’s sometimes nice to just pass data, in the form of a special structure or just a map and let the logging system package deal with the output formating and transmission of messages. In general, line-oriented JSON makes for a good logging format, assuming most or enough of your messages have related structures and your log viewing and processing tools support this kind of data, although alternate human-readable string formats should also be available.

None of these things changed in the rewrite: by default, grip mostly just looks and works like the Go standard library logger (and even uses the logger under the hood for a lot of things,) but it was definitely fun to look at more contemporary logging practices and tools and think about what makes a logging library compelling (or not).

Infrastructure Libraries

Infrastructure libraries are an interresting beast: ideally every dependency carries some kind of maintenance cost, so you want to minimize the number of dependencies you require. At the same time, not using libraries is also bad because it means you end up writing more code and that has even more maintenance costs. It’s also the case that software engineers love writing infrastructure code and are often quite opinionated about the ergonomics of their infrastructure libraries.

On top of that, you make infrastructure software decisions once and then are sort of stuck with them for a really long time. I’ve changed loggers in projects and it’s rough, and in general you want to choose libraries: as an application developer you have the great feeling that no one’s differentiating feature is going to have anything to do with the logger¹ and you want something that’s battle tested and familiar to everyone. Sometimes--as in the logging package in Python--the standard library has a library just works and everyone just uses that; other times, there are one or two options that most project uses (e.g. log4j in java, etc.).

Even if grip is great, it seems unlikely that everyone (or anyone?) will switch to using grip over some of the other more popular options. I’m OK with that, I’m not sure that beyond writing a few blog posts I’m really that excited about doing the kind of marketing and promotion that it might take to promote a logging library like this, and at the same time the moment for a new logging library might have already passed.

Grip Reborn

The “new grip” is a pretty substantial change. A lot of implementation details changed and I deleted big parts of the interface that didn’t quite make sense, or that I thought were a bit fussy to use. Basically just sanding off a lot of awkward edges. The big changes:

greatly simplified dependencies, with more packages and an x hierarchy for extensions. The main grip package, and it’s primary sub packages' no longer has any external dependencies (beyond github.com/tychoish/fun.) Any logging backend or message type that has additional dependencies are in x.
I deleted a lot of code. There were a lot of things that just weren’t needed, there was an extra interface, a bunch of constructors for various objects that weren’t useful. I also simplified the concept of levels/priority within the logging system.
simpler high level logging interface. I used to have an extra interface and package to hold all of the Info, Error, (and so forth), and I cut a lot of that out and just made a Logger type in the main package which just wraps an underlying interface and doesn’t need to be mutable, and doing this made it possible to simplify a lot of the code.
added some straight forward handlers to attach loggers to contexts. I think previously, I was split on the opinion that loggers should either be (functionally) global or passed explicitly around as objects, and I think I’ve come around to the idea that loggers maybe ought to hang off the context object, but contextual loggers, global loggers, and logging objects are all available.
the high level logging interface is much smaller, with handlers for all the levels and formatting, line, and conditional (e.g. f, ln, When) logging. I’m not 100% sold on having ln, but I’m pretty pleased with having things be pretty simple and streamlined. The logger interface, as with the standard logger is mirrored in the grip package, with a shim.
new message.Builder interface and methods that provides a chainable interface for building messages without needing to mess with the internals of the message package which might be ungainly at logging call sites.
new KV message type: this makes it possible to have structured logging payloads without using map types, which might prove easier easier to integrate with the new zerolog and zap backends.
I have begun exploring in the [series]{.title-ref} package, what it’d mean to have a metrics collection and publication system that is integrated into a logger. I wrote probably too much code, but I am excited to do some more work to do some more work in this area.

Arguably, in a CI platform, most of the hard problems have something to do with logging, so this is an exception, but perhaps one of the only exceptions ↩︎

New Go Module: tychoish/fun

2023-08-16 – tychoish

This is a follow up to my New Go Modules post about a project that I’ve been working on for the past few months: github.com/tychoish/fun.

fun is a collection of simple libraries using generics to do a collection of relatively mundane things, with a focus on well-built tools to make it easier for developers to solve higher level problems without needing to re-implement low level infrastructure, or use some rougher parts of the go standard library. It has no dependencies outside of the standard library, and contains a number of pretty cool tools, which were fun to write. Some of the basic structures were:

I wrote linked list implementations (single and double), adapted a single-ended Queue implementation that a former coworker did as part of an abandoned branch of development, and wrote a similar Deque. (fun/pubsub)
I adapted and developed a “broker” which uses the queues above to be able to do one-to-many pubsub implementations. (fun/pubsub)
I made a few high-level standard library synchronization tools (sync.Map, sync.WaitGroup, sync.Pool, and atomic.Value) even more higher level, with the help generics and more than a few stubborn opinions. (fun/adt; atomic data types)
I revisited an error aggregation tool that I wrote years ago, and made it a bunch faster, less quirky, and more compatible with the current state of the art with regards to error handling (e.g. errors.Is and errors.As). (fun/erc). I also wrote some basic error mangling tools including a more concise errors.Join, tools to unwind/unwrap errors, and a simple error type ers.Error to provide for constant errors (fun/ers)
I wrote an iterator interface and a collection of function wrappers and types (in the top level package), for interacting with iterators (or really, streams in one way or another,) and decorating those handlers and processors with common kinds of operations, modifications, and simple semantics.

I don’t know that it will catch on, but I’ve written (and rewritten) a lot of worker pools and stream processing things to use thse tools, and it’s been a useful programming model. By providing the wrappers and the iteration, users can implement features almost functionally (which should be easy to test.)
I built a collection of service orchestration tools to manage long running application services (e.g. http servers, worker pools, long running subprocesses, etc.) and a collection of context helpers to make it easier to manage the lifecycle of the kind of long-running applications I end up working on most of the time. Every time I’ve joined a new project… ever, I end up doing some amount of service orchestration work, and this is the toolkit I would want. (fun/srv)
I wrote some simple testing frameworks and tools for assertions (fun/assert) halt-on-failure, but with a tesitfy-inspired interface, and better reporiting along with a mirrored, fail-but-continue fun/assert/check, along with fun/testt (“test-tee” or “testy”) which has a bunch of simple helpers for using contexts, and fun/ensure, which is a totally different take on an assertion library.

I don’t want this post to be documentation about the project; there are a lot of docs in the README and on the go documentation, also the implementations are meant to be pretty readable, so feel free to dive in there. I did want to call out a few ways that I’ve begun using this library and the lessons it’s taught me in my day to day work.

I set a goal of writing code that’s 100% covered by tests. This is hard, and only possible in part because it’s a stateless library without dependencies, but I learned a lot about the code I was writing, and feel quite confident in it as a result.
I also set a goal of having no dependencies outside of the module and the standard library: I didn’t want to require users opt in using a set of tools that I liked, or that would require on going maintenance to update and manage. Not only is this a good goal in terms of facilitating adoption, it also constrained what I would do, and forced me to write things with external extensibility: there had to be hooks, interfaces and function types had to be easy to implement, and I decided to limit scope for other things.
Having a more complete set of atomic types and tools (particularly the map and the typed atomic value, also the typed integer atomic types in the standard library are great), has allowed me to approach concurrent programming problems without doing crazy things with channels or putting mutexes everywhere. I don’t think either channels or mutexes are a problem in the hands of a practiced practitioner, but having a viable alternative means it’s one less thing to go wrong, and you can save the big guns (mutexes) for more complex synchronization problems.
Linked lists are super cool. I’ve previously taken the opinion that you shouldn’t implement your own sequence types ever, and mostly avoided writing one of these before now. Having now done it, and now having truly double-ended structures means things like “adding something to the beginning” or “inserting into/removing from the middle” of a sequence isn’t so awkward.

The experimental slices library makes this a little less awkward with standard library slices/arrays, and they are proably faster.
It was really fun to take the concept of an interator and then build out from this concept to build tools that would make them easy to use, and got some good filtering, parallel processing, and map/reduce tools. I definitely learned a bunch but also I think (and have found) these tools useful.
I’ve long held that most go applications should be structured so that you shouldn’t really need to think too much about concurrency when writing business logic. I’ve previously tried to posit that the solution to this was to provide robust queue processing tools (e.g. amboy), but I think that’s too heavy weight, and it was cool to be able to think about the solution to this concept from a different angle.

Anyway, give it a try, and tell me what you think! Also pull requests are welcome!

New Go Modules

2023-08-15 – tychoish

I’ve been doing a little big of programming for fun in recent months, and I wanted to do a little big of blogging here to explain myself, so expect that over the next few days.

By way of creating the facade of suspense--which the astute among you will be able to spoil by way of looking at my github--this post isn’t going to mention the actual projects, but rather answer two questions that are more interesting from a higher level:

Why have you done this?
What have you learned about yourself as a programmer?

Why Have You Done This?

First, while I’ve been writing Go for a long time, I must confess that I was a long time generic skeptic and hold out. Writing some code for myself and I wanted to get a better feeling for how they worked, now that they’re here.

In my day to day work, I have found myself writing similar code again and again: while I have definitely have a class of problems that I tend to work on, I’d be ok if I never wrote another function to bridge the gap between a [context.Context]{.title-ref} and a [sync.WaitGroup]{.title-ref} (and I’m not holding my breath for this in the standard library.)

Finally, over the years I’ve written a few projects that I’ve worked on professionally have been open source, and a few of them I’ve even rather liked, and I wanted to see what it would be like to revisit some of these projects with (potentially) wiser eyes and fingers.

What Have You Learned?

I actually think I’ve learned a lot:

While I’m a big fan of the part of software development which is “deleting code” in my professional work, it was really interesting to take that to code that I knew I’d written (or directed the writing of) and been able to see what I could cut.
I’m better at writing tests and writing testable code than I was even a few years ago. While I believe in writing defensive code, I found myself writing an implementation and then going through and deleting a lot of code that wasn’t useful or was (fundamentally) superstitious when I discovered that there was no way to trigger a potential case.
Because I’ve spent a few years mostly working on relatively high level projects and not having the bandwidth to work on lower (mid?) level infrastructural code, in practice I’ve spent more time assessing other people’s libraries and I was keenly aware that I was developing code that would only be used if someone chose to, and in thinking about how those decisions are made.
This isn’t new, but I care a lot about the erognomics of software. I think 5 or 6 years ago, ergonomics meant “code which was easy to use and provented users from doing wrong things,” and I think my view of erognomic code has expanded to include interfaces that are easy and obvious to use, and promote the users of my code to write better code.

Software Engineering for 2.0

2022-07-07 – tychoish

I’ve been thinking about what I do as a software engineer for a while, as there seems to be a common thread through the kinds of projects and teams that I’m drawn toward and I wanted to write a few blog posts on this topic to sort of collect my thoughts and see if these ideas resonated with anyone else.

I’ve never been particularly interested in building new and exciting features. Hackathon’s have never held any particular appeal, and the things I really enjoy are working on are on the spectrum of “stabilize this piece of software,” or “make this service easy to operate” or “refactor this code to make support future development” and less “design and build some new feature.” Which isn’t to say that I don’t like building new features or writing code, but that I’m more driven by the code and supporting my teammates than I am by the feature.

I think it’s great that I’m different from software engineers who are really focused on the features, both because I think the tension between our interests pushes both classes of software engineer to do great things. Feature development keeps software and products relevant and addresses users' needs. Stabilization work makes projects last and reduces the incidence of failures that distract from feature work, and when there’s consistent attention paid to aligning infrastructure¹ work with feature development of the long term, infrastructure engineers can significantly lower the cost of implementing a feature.

The kinds of projects that fall into these categories inculde the following areas:

managing application state and workload in larger distributed contexts. This has involved designing and implementing things like configuration management, deployment processes, queuing systems, and persistence layers.
concurrency control patterns and process lifecycle. In programming environments where threads are available, finding ways to ensure that processes can safely shut down, and errors can be communicated between threads and processes takes some work and providing mechanisms to shutdown cleanly, communicate abort signals to worker threads, and handle communication patterns between threads in a regular and expected way, is really important. Concurrency is a great tool, but being able to manage concurrency safely and predictably and in descret parts of the code are useful.
programming model and ergonomic APIs and services. No developers produces a really compelling set of abstractions on the first draft, particularly when they’re focused on delivering different kinds of functionality. The revision and iteration process helps everyone build better software.
test infrastructure and improvements. No one thinks tests should take a long time or report results non-deterministically, and yet so many test are. The challenge is that tests often look good or seem reasonable or are stable when you write them, and their slow runtimes compound overtime, or orthogonal changes make them slower. Sometimes adding an extra check in some pre-flight test-infrastructure code ends ups causing tests that had been just fine, thank you to become problems. Maintaining and structure test infrastructure has been a big part of what I’ve ended up doing. Often, however, working back from the tests, it’s possible to see how a changed interface or an alternate factoring of code would make core components easier to test, and doing a cleanup pass of tests on some regular cadence to improve things. Faster more reliable tests, make it possible to develop with greater confidence.

In practice this has included:

changing the build system for a project to produce consistent artifacts, and regularizing the deployment process to avoid problems during deploy.
writing a queuing system without any extra service level dependencies (e.g. in the project’s existing database infrastructure) and then refactoring (almost) all arbitrary parallel workloads to use the new queuing system.
designing and implementing runtime feature flagging systems so operators could toggle features or components on-and-off via configuration options rather than expensive deploys.
replacing bespoke implementations with components provided by libraries or improving implementation quality by replacing components in-place, with the goal of making new implementations more testable or performant (or both!)
plumbing contexts (e.g. Golang’s service contexts) through codebases to be able to control the lifecycle of concurrent processes.
implementing and migrating structured logging systems and building observability systems based on these tools to monitor fleets of application services.
Refactoring tests to reuse expensive test infrastructure, or using table-driven tests to reduce test duplication.
managing processes' startup and shutdown code to avoid corrupted states and efficiently terminate and resume in-progress work.

When done well (or just done at all), this kind of work has always paid clear dividends for teams, even when under pressure to produce new features, because the work on the underlying platform reduces the friction for everyone doing work on the codebase.

It’s something of an annoyance that the word “infrastructure” is overloaded, and often refers to the discipline of running software rather than the parts of a piece of software that supports the execution and implementation of the business logic of user-facing features. Code has and needs infrastructure too, and a lot of the work of providing that infrastructure is also software development, and not operational work, though clearly all of these boundaries are somewhat porous. ↩︎

Systems Administrators are the Problem

2022-07-06 – tychoish

For years now, the idea of the terrible stack, or the dynamic duo of Terraform and Ansible, from this tweet has given me a huge amount of joy, basically anytime someone mentions either Terraform or Ansible, which happens rather a lot. It’s not exactly that I think that Terriform or Ansible are exactly terrible: the configuration management problems that these pieces of software are trying to solve are real and actually terrible, and having tools that help regularize the problem of configuration management definitely improve things. And yet the tools leave things wanting a bit.

Why care so much about configuration management?

Configuration matters because every application needs some kind of configuration: a way to connect to a database (or similar), a place to store its output, and inevitably other things, like a dependencies, or feature flags or whatever.

And that’s the simple case. While most things are probably roughly simple, it’s very easy to have requirements that go beyond this a bit, and it turns out that while a development team might--but only might--not have requirements for something that qualifies as “weird” but every organization has something.

As a developer, configuration and deployment often matters a bunch, and it’s pretty common to need to make changes to this area of the code. While it’s possible to architect things so that configuration can be managed within an application (say), this all takes longer and isn’t always easy to implement, and if your application requires escalated permissions, or needs a system configuration value set then it’s easy to get stuck.

And there’s no real way to avoid it: If you don’t have a good way to manage configuration state, then infrastructure becomes bespoke and fragile: this is bad. Sometimes people suggest using image-based distribution (so called “immutable infrastructure,") but this tends to be slow (images are large and can take a while to build,) and you still have to capture configuration in some way.

But how did we get here?

I think I could weave a really convincing, and likely true story about the discipline of system administration and software operations in general and its history, but rather than go overboard, I think the following factors are pretty important:

computers used to be very expensive, were difficult to operate, and so it made sense to have people who were primarily responsible for operating them, and this role has more or less persisted forever.
service disruptions can be very expensive, so it’s useful for organizations to have people who are responsible for “keeping the lights on,” and troubleshoot operational problems when things go wrong.
most computer systems depend on state of some kind--files on disks, the data in databases--and managing that state can be quite delicate.
recent trends in computing make it possible to manipulate infrastructure--computers themselves, storage devices, networks--with code, which means we have this unfortunate dualism of infrastructure where it’s kind of code but also kind of data, and so it feels hard to know what the right thing to do.

Why not just use <xyz>

This isn’t fair, really, but and you know it’s gonna be good when someone trivializes an adjacent problem domain with a question like this, but this is my post so you must endure it, because the idea that there’s another technology or way of framing the problem that makes this better is incredibly persistent.

Usually <xyz>, in recent years has been “Kubernetes” or “docker” or “containers,” but it sort of doesn’t matter, and in the past solutions platforms-as-a-service (e.g. AppEngine/etc.) or backend-as-a-service (e.g. parse/etc.) So let’s run down some answers:

“bake configuration into the container/virtual machine/etc. and then you won’t have state,” is a good idea, except it means that if you need to change configuration very quickly, it becomes quite hard because you have to rebuild and deploy an image, which can take a long time, and then there’s problems of how you get secrets like credentials into the service.
“use a service for your platform needs,” is a good solution, except that it can be pretty inflexible, particularly if you have an application that wasn’t designed for the service, or need to use some kind of off-the-shelf (a message bus, a cache, etc.) service or tool that wasn’t designed to run in this kind of environment. It’s also the case that the hard cost of using platforms-as-a-service can be pretty high.
“serverless” approaches something of a bootstrapping problem, how do you manage the configuration of the provider? How do you get secrets into the execution units?

What’s so terrible about these tools?

The tools can’t decide if configuration should be described programatically, using general purpose programming languages and frameworks (e.g. Chef, many deployment tools) or using some kind of declarative structured tool (Puppet, Ansible), or some kind of ungodly hybrid (e.g. Helm, anything with HCL). I’m not sure that there’s a good answer here. I like being able to write code, and I think YAML-based DSLs aren’t great; but capturing configuration creates a huge amount of difficult to test code. Regardless, you need to find ways of being able to test the code inexpensively, and doing this in a way that’s useful can be hard.
Many tools are opinionated have strong idioms in hopes of helping to make infrastructure more regular and easier to reason about. This is cool and a good idea, it makes it harder to generalize. While concepts like immutability and idempotency are great properties for configuration systems to have, say, they’re difficult to enforce, and so maybe developing patterns and systems that have weaker opinions that are easy to comply with, and idioms that can be applied iteratively are useful.
Tools are willing to do things to your systems that you’d never do by hand, including a number of destructive operations (terraform is particularly guilty of this), which erodes some of their trust and inspires otherwise bored ops folks, to write/recapitulate their own systems, which is why so many different configuration management tools emerge.

Maybe the tools aren’t actually terrible, and the organizational factors that lead to the entrenchment of operations teams (incumbency, incomplete cost analysis, difficult to meet stability requirements,) lead to the entrenchment of the kinds of processes that require tools like this (though causality could easily flow in the opposite direction, with the same effect.)

API Ergonomics

2022-07-05 – tychoish

I touched on the idea of API ergonomics in Values for Collaborative Codebases, but I think the topic is due a bit more exploration. Typically you think about an API as being “safe” or “functionally complete,” or “easy to use,” but “ergonomic” is a bit far afield from the standard way that people think and talk about APIs (in my experience.)

I think part of the confusion is that “API” gets used in a couple of different contexts, but let’s say that an API here are the collection of nouns (types, structures,) and verbs (methods, functions) used to interact with a concept (hardware, library, service). APIs can be conceptually really large (e.g. all of a database, a public service), or quite small and expose only a few simple methods (e.g. a data serialization library, or some kind of hashing process.) I think some of the confusion is that people also use the term API to refer to the ways that services access data (e.g. REST, etc.) and while I have no objection to this formulation, service API design and class or library API design feel like related but different problems.

Ergonomics, then is really about making choices in the design of an API, so that:

functionality is discoverable during programming. If you’re writing in a language with good code completion tools, then make sure methods and functions are well located and named in a way to take advantage of completion. Chainable APIs are awesome for this.
use clear naming for functions and arguments that describe your intent and their use.
types should imply semantic intent. If your programming language has a sense of mutability (e.g. passing references verses concrete types in Go, or const (for all its failings) in C++), then make sure you use these markers to both enforce correct behavior and communicate intent.
do whatever you can to encourage appropriate use and discourage inappropriate use, by taking advantage of encapsulation features (interfaces, non-exported/private functions, etc.), and passing data into and out of the API with strongly/explicitly-typed objects (e.g. return POD classes, or enumerated values or similar rather than numeric or string types.)
reduce the complexity of the surface area by exporting the smallest reasonable API, and also avoiding ambiguous situations, as with functions that take more than one argument of a given type, which leads to cases where users can easily (and legally) do the wrong thing.
increase safety of the API by removing or reducing and being explicit about the API’s use of global state. Avoid providing APIs that are not thread safe. Avoid throwing exceptions (or equivalents) in your API that you expect users to handle. If users pass nil pointers into an API, its OK to throw an exception (or let the runtime do it,) but there shouldn’t be exceptions that originate in your code that need to be handled outside of it.

Ergonomic interfaces feel good to use, but they also improve quality across the ecosystem of connected products.

Logging Trends#

Infrastructure Libraries#

Grip Reborn#

Why Have You Done This?#

What Have You Learned?#

Logging Trends

Infrastructure Libraries

Grip Reborn

Why Have You Done This?

What Have You Learned?