In Favor of an Application Infrastructure Framework

The byproduct of a lot of my work on Evergreen over the past few years has been that I’ve amassed a small collection of reusable components in the form of libraries that address important but not particularly core functionality. While I think the actual features and scale that we’ve achieved for “real” features, the infrastructure that we built has been particularly exciting.

It turns out that I’ve written about a number of these components already here, even. Though I think, my initial posts were about these components in their more proof-of-concept stage, now (finally!) we’re using them all in production so their a bit more hardened.

The first grip is a logging framework. Initially, I thought a high-level logging framework with plug-able backends was going to be really compelling. While configurable back-ends has been good for using grip as the primary toolkit for writing messaging and user-facing alerting, the most compelling feature has been structured logging.

Most of the logging that we do, now, (thanks to grip,) has been to pass structures (e.g. maps) to the logger with key/value data. In combination with log aggregation services/tools (like ELK, splunk, or sumologic,) we can basically take care of nearly all of our application observablity (monitoring) use cases in one stop. It includes easy to use system and golang runtime metrics collection, all using an easy push-based collection, and can also power alert escalation. After having maintained an application using this kind of event driven structured logging system, I have a hard time thinking about running applications without it.

Next we have amboy which is a queue-system. Like grip, all of the components are plug-able, so it support in-memory (ephemeral) queues, distributed queues, dependency graph systems and priority queue implementations as well as a number of different execution models. The most powerful thing that amboy affords us is a single and clear abstraction for defining “background” execution and workloads.

In go it’s easy to spin up a go routine to do some work in the background, it’s super easy to implement worker pools to parallelize the processing of simple tasks. The problem is that as systems grow, it becomes pretty hard to track this complexity in your own code, and we discovered that our application was essentially bifurcated between offline (e.g. background) and online (e.g. request-driven) work. To address all of this problem, we defined all of the background work as small, independent units of work, which can be easily tested, and as a result there is essentially no-adhoc concurrency in the application except what runs in the queues.

The end result of having a unified way to characterize background work is that scaling the application because much less complicated. We can build new queue implementations, without needing to think about the business logic of the background work itself, and we add capacity by increasing the resources of worker machines without needing to think about the architecture of the system. Delightfully, the queue metaphor is independent of external services, so we can run the queue in memory backed by a heap or hash map with executors running in dedicated go-routines if we want, and also scale it out to use databases or dedicated queue services with additional process-based workers, as needed.

The last component, gimlet, addresses building HTTP interfaces, and provides tools for registering routes, writing responses, managing middleware and authentication, an defining routes in a way that’s easy to test. Gimlet is just a wrapper around some established tools like negroni, gorilla/mux, all built on established standard-library foundations. Gimlet has allowed us to unify a bunch of different approaches to these problems, and has lowered the barrier to entry for most of our interfaces.

There are other infrastructural problems still on the table: tools for building inter-system communication and RPC when you can’t communicate via a queue or a shared database (I’ve been thinking a lot about gRPC and protocol buffers for this,) and also about object-mapping and database access patterns, which I don’t really have an answer for.¹

Nevertheless, with the observability, background tasks, and HTTP interface problems well understood at supported, it definitely frees developers to spend more of their time focused core problems of importance to users and the goals of the project. Which is a great place to be.

I built a database migration tool called anser which is mostly focused on integrating migration workflows into production systems so that migrations are part of the core code and can run without affecting production traffic, and while these tools have been useful, I haven’t seen a clear path between this project and meaningfully simplifying the way we manage access to data. ↩︎