Almost two years ago, I switched teams at work to join the team behind evergreen which is a homegrown continuous integration tool that we use organization wide to host and support development efforts and operational automation. It's been great.
From the high level, Evergreen takes changes that developers make to source code repositories and runs a set of tasks for each of those changes on a wide variety of systems, and is a key part of the system that allows us to verify that the software we write works on computers other than the ones that we interact with directly. There are a number of CI systems in the world, but Evergreen has a number of interesting features:
- it runs tasks in parallel, fanning out tasks to a large pool of machines to shorten the "wall clock" time for task execution.
- tasks execute on ephemeral systems managed by Evergreen in response to demands of incoming work.
- the service maintains a queue of work and handles task dispatching and results collection.
This focus on larger scale task parallelism and managing host pools gives Evergreen the ability to address larger scale continuous integration workflows with a lower maintenance overhead. This is totally my jam: we get to both affect the development workflow and engineering policies for basically everyone and improving operational efficiency is a leading goal.
My previous gig was more operational, on a sibling team, so it's been really powerful to be able to address problems relating to application scale and drive the architecture from the other side. I wrote a blog post for a work-adjacent outlet about the features and improvements, but this is my blog, and I think it'd be fun to have some space to explore "what I've been working on," rather than focusing on Evergren as a product.
My first order of business, after becoming familiar with the code base, was to work on logging. When I started learning Go, I wrote a logging library (I even bloged about it), and using this library has allowed us to "get serious about logging." While it was a long play, we now have highly structured logging which has allowed the entire logging system to become a centerpiece in our observably story, and we've been able to use centralized log aggregation services (and even shop around!) As our deployment grows, centralized logging is the thing that has kept everything together.
Recently, I've been focusing on how the application handles "offline" or background work. Historically the application has had a number of loosely coupled "cron-job" like operations that all happened on single machine at a regular interval. I'm focusing on how to move these systems into more tightly coupled, event-driven operations that can be distributed to a larger cluster of machines. Amboy is a big part of this, but there have been other changes related to this project.
On the horizon, I'm also starting to think about how to reduce the cost of exposing data and operations to clients and users in a way that's lightweight and flexible, and relatively inexpensive for developer time. Right now there's a lot of technical debt, a myriad of different ways to describe interfaces, and inconsistent client coverage. Nothing insurmountable, but definitely the next frontier of growing pains.
The theme here is "how do we take an application that works and does something really cool," and turn it into a robust piece of software that can both scale as needs grown, but also provide a platform for developing new features with increasing levels of confidence, stability, and speed.
The conventional wisdom is that it's easy to build features fast-and-loose without a bunch of infrastructure, and that as you scale the speed of feature development slows down. I'm pretty convinced that this is untrue and am excited to explore the ways that improved common infrastructure can reduce the impact of this ossification and lead to more nimble and expansive feature development.
We'll see how it goes! And I hope to be able to find the time to write about it more here.