Test Multi-Execution

2021-03-23 – tychoish

Editoral Note: this is a follow up to my earlier Principles of Test Oriented Software Development post.

In software development, we write tests to make sure the code we write does what we want it to do. Great this is pretty easy to get behind.

Tests sometimes fail.

The goal, is that, most of the time when tests fail, it’s because the code is broken: you fix the code and the test passes. Sometimes when test fail there’s a bug in the test, it makes an assertion that can’t or shouldn’t be true: these are bad because they mean the test is broken, but all code has bugs, and test code can be broken so that’s fine.

Ideally either pass or fail, and if a test fails it fails repeatedly, with the same error. Unfortunately, this is of course not always true, and tests can fail intermittently if they test something that can change, or the outcome of the test is impacted by some external factor like “the test passes if the processor is very fast, and the system does not have IO contention, but fails sometimes as the system slows down.” Sometimes tests include (intentionally or not) some notion of “randomnesses,” and fail intermittently because of this.

A test suite with intermittent failures is basically the worst. A suite that never fails isn’t super valuable, because it probably builds false confidence, a test suite that always fails isn’t useful because developers will ignore the results or disable the tests, but a test that fails intermittently, particularly one that fails 10 or 20 percent of the time, means that developers always will always look at the test, or just rerun the test until it passes.

There are a couple of things you can do to fix your tests:

write better tests: find sources of non-determinism in your test and rewrite tests to avoid these kinds of “flaky” outcomes. Sometimes this means restructuring your tests in a more “pyramid-like” structure, with more unit tests and fewer integration tests (which are likely to be less deterministic.)
run tests more reliably: find ways of running your test suite that produce more consistent results. This means running tests in more isolated environments, changing the amount of test parallelism, ensure that tests clean up their environment before they run, and can be as logically isolated as possible.

But it’s hard to find these tests and you can end up playing wack-a-mole with dodgy tests for a long time, and the urge to just run the tests a second (or third) time to get them to pass so you can merge your change and move on with your work is tempting. This leaves:

run tests multiple times: so that a test doesn’t pass until it passes multiple times. Many test runner’s have some kind of repeated execution mode, and if you can combine with some kind of “stop executing after the first fail,” then this can be reasonably efficient. Use multiple execution to force the tests to produce more reliable results rather than cover-up or exacerbates the flakiness.
run fewer tests: it’s great to have a regression suite, but if you have unreliable tests, and you can’t use the multi-execution hack to smoke out your bad tests, then running a really full matrix of tests is just going to produce more failures, which means you’ll spend more of your time looking at tests, in non-systematic ways, which are unlikely to actually improve code.

Principles of Test Oriented Software Development

2021-03-02 – tychoish

I want to like test-driven-development (TDD), but realistically it’s not something that I ever actually do. Part of this is because TDD, as canonically described is really hard to actually pratice: TDD involves writing tests before writing code, writing tests which must fail before the implementation is complete or correct, and then using the tests to refactor the code. It’s a nice idea, and it definitely leads to better test coverage, but the methodology forces you to iterate inefficiently on aspects of a design, and is rarely viable when extending existing code bases. Therefore, I’d like to propose a less-dogmatic alternative: test-oriented-development.¹

I think, in practice, this largely aligns with the way that people write software, and so test oriented development does not describe a new way of writing code or of writing tests, but rather describes the strategies we use to ensure that the code we write is well tested and testable. Also, I think providing these strategies in a single concrete form will present a reasonable and pragmatic alternative to TDD that will make the aim of “developing more well tested software” more achievable.

Make state explicit. This is good practice for all kinds of development, but generally, don’t put data in global variables, and pass as much state (configuration, services, etc.) into functions and classes rather than “magicing” them.
Methods and functions should be functional. I don’t tend to think of myself as a functional programmer, as my tendencies are not very ideological, in this regard, but generally having a functional approach simplifies a lot of decisions and makes it easy to test systems at multiple layers.
Most code should be internal and encapsulated. Packages and types with large numbers of exported or public methods should be a warning sign. The best kinds of tests can provide all desired coverage, by testing interfaces themselves,
Write few simple tests and varry the data passed to those tests. This is essentially a form of “table driven testing,” where you write a small sequence of simple cases, and run those tests with a variety of tests. Having test infrastructure that allows this kind of flexibility is a great technique.
Begin writing tests as soon as possible. Orthodox TDD suggests that you should start writing tests first, and I think that this is probably one of the reasons that TDD is so hard to adopt. It’s also probably the case that orthodox TDD emerged when prototyping was considerably harder than it is today, and as a result TDD just feels like friction, because it’s difficult to plan implementations in a test-first sort of way. Having said that, start writing tests as soon as possible.
Experiment in tests. Somehow, I’ve managed to write a lot of code without working an interactive debugger into my day-to-day routine, which means I do a lot of debugging by reading code, and also by writing tests to try and replicate production phenomena in more isolated phenomena. Writing and running tests in systems is a great way to learn about them.

Sorry that this doesn’t lead to a better acronym. ↩︎

Distributed Systems Problems and Strategies

2021-01-07 – tychoish

At a certain scale, most applications end up having to contend with a class of “distributed systems” problems: when a single computer or a single copy of an application can’t support the required throughput of an application there’s not much to do except to distribute it, and therein lies the problem. Taking one of a thing and making many of the thing operate similarly can be really fascinating, and frankly empowering. At some point, all systems become distributed in some way, to a greater or lesser extent. While the underlying problems and strategies are simple enough, distributed systems-type bugs can be gnarly and having some framework for thinking about these kinds of systems and architectures can be useful, or even essential, when writing any kind of software.

Concerns

Application State

Applications all have some kind of internal state: configuration, runtime settings, in addition to whatever happens in memory as a result of running the application. When you have more than one copy of a single logical application, you have to put state somewhere. That somewhere is usually a database, but it can be another service or in some kind of shared file resource (e.g. NFS or blob storage like S3.)

The challenge is not “where to put the state,” because it probably doesn’t matter much, but rather in organizing the application to remove the assumption that state can be stored in the application. This often means avoiding caching data in global variables and avoiding storing data locally on the filesystem, but there are a host of ways in which application state can get stuck or captured, and the fix is generally “ensure this data is always read out of some centralized and authoritative service,” and ensure that any locally cached data is refreshed regularly and saved centrally when needed.

In general, better state management within applications makes code better regardless of how distributed the system is, and when we use the “turn it off and turn it back on,” we’re really just clearing out some bit of application state that’s gotten stuck during the runtime of a piece of software.

Startup and Shutdown

Process creation and initialization, as well as shutdown, is difficult in distributed systems. While most configuration and state is probably stored in some remote service (like a database,) there’s a bootstrapping process where each process gets enough local configuration required to get that configuration and startup from the central service, which can be a bit delicate.

Shutdown has its own problems set of problems, as specific processes need to be able to complete or safely abort in progress operations.

For request driven work (i.e. HTTP or RPC APIs) without statefull or long-running requests (e.g. many websockets and most streaming connections), applications have to stop accepting new connections and let all in progress requests complete before terminating. For other kinds of work, the process has to either complete in progress work or provide some kind of “checkpointing” approach so that another process can pick up the work later.

Horizontal Scalability

Horizontal scalability, being able to increase the capability of an application by adding more instances of the application rather than creasing the resources allotted to the application itself, is one of the reasons that we build distributed systems in the first place,¹ but simply being able to run multiple copies of the application at once isn’t always enough, the application needs to be able to effectively distribute it’s workloads. For request driven work this is genreally some kind of load balancing layer or strategy, and for other kinds of workloads you need some way to distribute work across the application.

There are lots of different ways to provide loadbalancing, and a lot depends on your application and clients, there is specialized software (and even hardware) that provides loadbalancing by sitting “in front of” the application and routing requests to a backend, but there are also a collection of client-side solutions that work quite well. The complexity of load balancing solutions varies a lot: there are some approaches that just distribute responses “evenly” (by number) to a single backend one-by-one (“round-robin”) and some approaches that attempt to distribute requests more “fairly” based on some reporting of each backend or an analysis of the incoming requests, and the strategy here depends a lot on the requirements of the application or service.

For workloads that aren’t request driven, systems require some mechanism of distributing work to workers, ususally with some kind of messaging system, though it’s possible to get pretty far using a just a normal general purpose database to store pending work. The options for managing, ordering, and distributing the work, is the meat of problem.

Challenges

When thinking about system design or architecture, I tend to start with the following questions.

how does the system handle intermittent failures of particular components?
what kind of downtime is acceptable for any given component? for the system as a whole?
how do operations timeout and get terminated, and how to clients handle these kinds of failures?
what are the tolerances for the application in terms of latency of various kinds of operations, and also the tolerances for “missing” or “duplicating” an operation?
when (any single) node or component of the system aborts or restarts abruptly, how does the application/service respond? Does work resume or abort safely?
what level of manual intervention is acceptable? Does the system need to node failure autonomously? If so how many nodes?

Concepts like “node” or “component” or “operation,” can mean different things in different systems, and I use the terms somewhat vaguely as a result. These general factors and questions apply to systems that have monolithic architectures (i.e. many copies of a single type of process which performs many functions,) and service-based architectures (i.e. many different processes performing specialized functions.)

Solutions

Ignore the Problem, For Now

Many applications run in a distributed fashion while only really addressing parts of their relevant distributed systems problems, and in practice it works out ok. Applications may store most of their data in a database, but have some configuration files that are stored locally: this is annoying, and sometimes an out-of-sync file can lead to some unexpected behavior. Applications may have distributed application servers for all request-driven workloads, but may still have a separate single process that does some kind of coordinated background work, or run cron jobs.

Ignoring the problem isn’t always the best solution in the long term, but making sure that everything is distributed (or able to be distributed,) isn’t always the best use of time, and depending the specific application it works out fine. The important part, isn’t always to distribute things in all cases, but to make it possible to distribute functions in response to needs: in some ways I think about this as the “just in time” approach.

Federation

Federated architectures manage distributed systems protocols at a higher level: rather than assembling a large distributed system, build very small systems that can communicate at a high level using some kind of established protocol. The best example of a federated system is probably email, though there are others.²

Federated systems have more complex protocols that have to be specification based, which can be complicated/difficult to build. Also, federated services have to maintain the ability to interoperate with previous versions and even sometimes non-compliant services, which can be difficult to maintain. Federated systems also end up pushing a lot of the user experience into the clients, which can make it hard to control this aspect of the system.

On the upside, specific implementations and instances of a federated service can be quite simple and have straight forward and lightweight implementations. Supporting email for a few users (or even a few hundred) is a much more tractable problem than supporting email for many millions of users.

Distributed Locks

Needing some kind of lock (for mutual exclusion or mutex) is common enough in programming, and provide some kind of easy way to ensure that only a single actor has access to a specific resource. Doing this within a single process involves using kernel (futexes) or programming language runtime implementations, and is simple to conceptualize, and while the concept in a distributed system is functionally the same, the implementation of distributed locks are more complicated and necessarily slower (both the lock themselves, and their impact on the system as a whole).

All locks, local or distributed can be difficult to use correctly: the lock must be acquired before using the resource, and it must fully protect the resource, without protecting too much and having a large portion of functionality require the lock. So while locks are required sometimes, and conceptually simple, using them correctly is hard. With that disclaimer, to work, distributed locks require:³

some concept of an owner, which must be sufficiently specific (hostname, process identifier,) but that should be sufficiently unique to protect against process restarts, host renaming and collision.
lock status (locked/link) and if the lock has different modes, such as a multi-reader/single-writer lock, then that status.
a timeout or similar mechanism to prevent deadlocks if the actor holding a lock halts or becomes inaccessible, the lock is eventually released.
versioning, to prevent stale actors from modifying the same lock. In the case that actor-1 has a lock and stalls for longer than the timeout period, such that actor-2 gains the lock, when actor-1 runs again it must know that its been usurped.

Not all distributed systems require distributed locks, and in most cases, transactions in the data layer, provide most of the isolation that you might need from a distributed lock, but it’s a useful concept to have.

Duplicate Work (Idempotency)

For a lot of operations, in big systems, duplicating some work is easier and ultimately faster than coordinating and isolating that work in a single location. For this, having idempotent operations⁴ is useful. Some kinds of operations and systems make idempotency easier to implement, and in cases where the work is not idempotent (e.g. as in data processing or transformation,) the operation can be, by attaching some kind of clock to the data or operation.⁵

Using clocks and idempotency makes it possible to maintain data consistency without locks. At the same time, some of the same considerations apply. Having all operations duplicated is difficult to scale so having ways for operations to abort early can be useful.

Consensus Protocols

Some operations can’t be effectively distributed, but are also not safe to duplicate. Applications can use consensus protocols to do “leader election,” to ensure that there’s only one node “in charge” at a time, and the protocol. This is common in database systems, where “single leader” systems are useful for balancing write performance in distributed context. Consensus protocols have some amount of overhead, and are good for systems of a small to moderate size, because all elements of the system must communicate with all other nodes in the system.

The two prevailing consensus protocols are Paxos and Raft--pardoning the oversimplification here--with Raft being a simpler and easier to implement imagination of the same underlying principles. I’ve characterized consensus as being about leader election, though you can use these protocols to allow a distributed system to reach agreement on any manner of operations or shared state.

Queues

Building a fully generalized distributed application with consensus is a very lofty proposition, and commonly beyond the scope of most applications. If you can characterize the work of your system as discrete units of work (tasks or jobs,) and can build or access a queue mechanism within your application that supports workers on multiple processes, this might be enough to support a great deal of your distributed requirements for the application.

Once you have reliable mechanisms and abstractions for distributing work to a queue, scaling the system can be managed outside of the application by using different backing systems, or changing the dispatching layer, and queue optimization is pretty well understood. There are lots of different ways to schedule and distribute queued work, but perhaps this is beyond the scope of this article.

I wrote one of these, amboy, but things like gearman and celery do this as well, and many of these tools are built on messaging systems like Kafka or AMPQ, or just use general purpose databases as a backend. Keeping a solid abstraction between the applications queue and then messaging system seems good, but a lot depends on your application’s workload.

Delegate to Upstream Services

While there are distributed system problems that applications must solve for themselves, in most cases no solution is required! In practice many applications centralize a lot of their concerns in trusted systems like databases, messaging systems, or lock servers. This is probably correct! While distributed systems are required in most senses, distributed systems themselves are rarely the core feature of an application, and it makes sense to delegate these problem to services that that are focused on solving this problem.

While multiple external services can increase the overall operational complexity of the application, implementing your own distributed system fundamentals can be quite expensive (in terms of developer time), and error prone, so it’s generally a reasonable trade off.

Conclusion

I hope this was as useful for you all as it has been fun for me to write!

In most cases, some increase in reliability, by adding redundancy is a strong secondary motivation. ↩︎
xmpp , the protocol behind jabber which powered/powers many IM systems is another federated example, and the fediverse points to others. I also suspect that some federation-like features will be used at the infrastructure layer to coordinate between constrained elements (e.g. multiple k8s clusters will use federation for coordination, and maybe multi-cloud/multi-region orchestration as well…) ↩︎
This article about distributed locks in redis was helpful in summarizing the principles for me. ↩︎
An operation is idempotent if it can be performed more than once without changing the outcome. For instance, the operation “increment the value by 10” is not idempotent because it increments a value every time it runs, so running the operation once is different than running it twice. At the same time the operation “set the value to 10” is idempotent, because the value is always 10 at the end of the operation. ↩︎
Clocks can take the form of a “last modified timestamp,” or some kind of versioning integer associated with a record. Operations can check their local state against a canonical record, and abort if their data is out of date. ↩︎

New Beginnings: Deciduous Platform

2020-08-30 – tychoish

I left my job at MongoDB (8.5 years!) at the beginning of the summer, and started a new job at the beginning of the month. I’ll be writing and posting more about my new gig, career paths in general, reflections on what I accomplished on my old team, the process of interviewing as a software engineer, as well as the profession and industry over time. For now, though, I want to write about one of the things I’ve been working on this summer: making a bunch of the open source libraries that I worked on more generally useable. I’ve been calling this the deciduous platform,¹ which now has its own github organization! So it must be real.

The main modification in these forks, aside from adding a few features that had been on my list for a while, has been to update the buildsystem to use go modules² and rewrite the history of the repository to remove all of the old vendoring. I expect to continue development on some aspects of these over time, though the truth is that these libraries were quite stable and were nearly in maintenance mode anyway.

Background

The team was responsible for a big monolith (or so) application: development had begun in 2013, which was early for Go, and while everything worked, it was a bit weird. My efforts when I joined in 2015 focused mostly on stabilization, architecture, and reliability. While the application worked, mostly, it was clear that it suffered from a few problem, which I believe were the result of originating early in the history of Go: First, because no one had tried to write big applications yet, the patterns weren’t well established, and so the team ended up writing code that worked but that was difficult to maintain, and ended up with bespoke solutions to a number of generic problems like running workloads in the background or managing Apia. Second, Go’s standard library tends to be really solid, but also tends towards being a little low level for most day-to-day tasks, so things like logging and process management end up requiring more code³ than is reasonable.

I taught myself to write Go by working on a logging library, and worked on a distributed queue library. One of the things that I realized early, was that breaking the application into “microservices,” would have been both difficult and offered minimal benefit,⁴ so I went with the approach of creating a well factored monolith, which included a lot of application specific work, but also building a small collection of libraries and internal services to provide useful abstractions and separations for application developers and projects.

This allowed for a certain level of focus, both for the team creating the infrastructure, but also for the application itself: the developers working on the application mostly focused on the kind of high level core business logic that you’d expect, while the infrastructure/platform team really focused on these libraries and various integration problems. The focus wasn’t just organizational: the codebases became easier to maintain and features became easier to develop.

This experience has lead me to think that architecture decisions may not be well captured by the monolith/microservice dichotomy, but rather there’s' this third option that centers on internal architecture, platforms, and the possibility for developer focus and velocity.

Platform Overview

While there are 13 or so repositories in the platform, really there are 4 major libraries: grip, a logging library; jasper, a process management framework; amboy, a (possibly distributed) worker queue; and gimlet, a collection of tools for building HTTP/REST services.

The tools all work pretty well together, and combine to provide an environment where you can focus on writing the business logic for your HTTP services and background tasks, with minimal boilerplate to get it all running. It’s pretty swell, and makes it possible to spin up (or spin out) well factored services with similar internal architectures, and robust internal infrastructure.

I wanted to write a bit about each of the major components, addressing why I think these libraries are compelling and the kinds of features that I’m excited to add in the future.

Grip

Grip is a structured-logging friendly library, and is broadly similar to other third-party logging systems. There are two main underlying interfaces, representing logging targets (Sender) and messages, as well as a higher level “journal” interface for use during programming. It’s pretty easy to write new message or bakcends, which means you can use grip to capture all kinds of arbitrary messages in consistent manners, and also send those messages wherever they’re needed.

Internally, it’s quite nice to be able to just send messages to specific log targets, using configuration within an application rather than needing to operationally manage log output. Operations folks shouldn’t be stuck dealing with just managing logs, after all, and it’s quite nice to just send data directly to Splunk or Sumologic. We also used the same grip fundamentals to send notifications and alerts to Slack channels, email lists, or even to create Jira Issues, minimizing the amount of clunky integration code.

There are some pretty cool projects in and around grip:

support for additional logging targets. The decudous version of grip adds twitter as an output format as well as creating desktop notifications (e.g. growl/libnotify,) but I think it would also be interesting to add fluent/logstash connections that don’t have to transit via standard error.'
While structured logging is great, I noticed that we ended up logging messages automatically in the background as a method of metrics collection. It would be cool to be able to add some kind of “intercepting sender” that handled some of these structured metrics, and was able to expose this data in a format that the conventional tools these days (prometheus, others,) can handle. Some of this code would clearly need to be in Grip, and other aspects clearly fall into other tools/libraries.

Amboy

Amboy is an interface for doing things with queues. The interfaces are simple, and you have:

a queue that has some way of storing and dispatching jobs.
implementations of jobs which are responsible for executing your business logic, and with a base implemention that you can easily compose, into your job types, all you need to implement, really is a Run() method.
a queue “group” which provides a higher level abstraction on top of queues to support segregating workflows/queues in a single system to improve quality of service. Group queues function like other queues but can be automatically managed by the processes.
a runner/pool implementation that provides the actual thread pool.

There’s a type registry for job implementations and versioning in the schema for jobs so that you can safely round-trip a job between machines and update the implementation safely without ensuring the queue is empty.

This turns out to be incredibly powerful for managing background and asynchronous work in applications. The package includes a number of in-memory queues for managing workloads in ephemeral utilities, as well as a distributed MongoDB backed-queue for running multiple copies of an application with a shared queue(s). There’s also a layer of management tools for introspecting, managing, the state of jobs.

While Amboy is quite stable, there is a small collection of work that I’m interested in:

a queue implementation that store jobs to a local Badger database on-disk to provide single-system restartabilty for jobs.
a queue implementation that stores jobs in a PostgreSQL, mirroring the MongoDB job functionality, to be able to meet job backends.
queue implementations that use messaging systems (Kafka, AMPQ) for backends. There exists an SQS implementation, but all of these systems have less strict semantics for process restarts than the database options, and database can easily handle on the order of a hundred of thousand of jobs an hour.
changes to the queue API to remove a few legacy methods that return channels instead of iterators.
improve the semantics for closing a queue.

While Amboy has provisions for building architectures with workers running on multiple processes, rather than having queues running multiple threads within the same process, it would be interesting to develop more fully-fledged examples of this.

Jasper

Jasper provides a high level set of tools for managing subprocesses in Go, adding a highly ergonomic API (in Go,) as well as exposing process management as a service to facilitate running processes on remote machines. Jasper also manages/tracks the state of running processes, and can reduce pressures on calling code to track the state of processes.

The package currently exposes Jasper services over REST, gRPC, and MongoDB’s wire protocol, and there is also code to support using SSH as a transport so that you don’t need to expose remote these services publically.

Jasper is, perhaps, the most stable of the libraries, but I am interested in thinking about a couple of extensions:

using jasper as PID 1 within a container to be able to orchestrate workloads running on containers, and contain (some) support for lower level container orchestration.
write configuration file-based tools for using jasper to orchestrate buildsystems and distributed test orchestration.

I’m also interested in cleaning up some of the MongoDB-specific code (i.e. the code that downloads MongoDB versions for use in test harnesses,) and perhaps reinvisioning that as client code that uses Jasper rather than as a part of Jasper.

Gimlet

I’ve written about gimlet here before when I started the project, and it remains a pretty useful and ergonomic way to define and regester HTTP APIs, in the past few years, its grown to add more authentication features, as well as a new “framework” for defining routes. This makes it possible to define routes by implementing an interface that:

makes it very easy to produce paginated routes, and provides some helpers for managing content
separates the parsing of inputs from executing the results, which can make route definitions easy to test without integration tests.
rehome functionality on top of chi router. The current implementation uses Negroni and gorilla mux (but neither are exposed in the interface), but I think it’d be nice to have this be optional, and chi looks pretty nice.

Other Great Tools

The following libraries are defiantly smaller, but I think they’re really cool:

birch is a builder for programatically building BSON documents, and MongoDB’s extended JSON format. It’s built upon an earlier version of the BSON library. While it’s unlikely to be as fast at scale, for many operations (like finding a key in a document), the interface is great for constructing payloads.
ftdc provides a way to generate (and read,) MongoDB’s diagnostic data format, which is a highly compressed timeseries data format. While this implementation could drift from the internal implementation over time, the format and tool remain useful for arbitrary timeseries data.
certdepot provides a way to manage a certificate authority with the certificates stored in a centralized store. I’d like to add other storage backends over time.

And more…

Notes

My old team built a continuous integration tool called evergreen which is itself a pun (using “green” to indicate passing builds, most CI systems are not ever-green.) Many of the tools and libraries that we built had got names with tree puns, and somehow “deciduous” seems like the right plan. ↩︎
For an arcane reason, all of these tools had to build with an old version of Go (1.10) that didn’t support modules, so we had an arcane and annoying vendoring solution that wasn’t compatible with modules. ↩︎
Go tends to be a pretty verbose language, and I think most of the time this creates clarity; however, for common tasks it has the feeling of offering a poor abstraction, or forcing you to write duplicated code. While I don’t believe that more-terse-code is better, I think there’s a point where the extra verbosity for route operations just creates the possibility for more errors. ↩︎
The team was small, and as an internal tools team, unlikely to grow to the size where microservices offered any kind of engineering efficiency (at some cost,) and there weren’t significant technical gains that we could take advantage of: the services of the application didn’t need to be globally distributed and the boundaries between components didn’t need to scale independently. ↩︎

What is it That You Do?

2019-12-24 – tychoish

The longer that I have this job, the more difficult it is to explain what I do. I say, “I’m a programmer,” and you’d think that I write code all day, but that doesn’t map onto what my days look like, and the longer it seems the less code I actually end up writing. I think the complexity of this seemingly simple question grows from the fact that building software involves a lot more than writing code, particularly as projects become more complex.

I’d venture to say that most code is written and maintained by one person, and typically used by a very small number of pepole (often on behalf of many more people,) though this is difficult to quantify. Single maintainer software is still software, and there are lots of interesting problems, but as much as anything else I’m interested in the problems adjacent to multi-author code-bases and multi-operator software development.¹

Fundamentally, I’m interested in the following questions:

How can (sometimes larger) groups of people collaborate to build something that’s bigger than the scope of any of their work?
How can we build software in a way that lets individual developers focus most of the time on the features and concerns that are the most important to them and their users.²

The software development process, regardless of the scope of the problem, has a number of different aspects:

Operations: How does is this software execute and how do we know that its successful when it runs?
Behavior: What does it do, and how do we ensure it has the correct behavior?
Interface: How will users interact with the process, and how do we ensure a consistent experience across versions and users' environment?
Product: Who are the users? What features do they want? Which features are the most important?

Sometimes we can address these questions by writing code, but often there’s a lot of talking to users, other developers, and other people who work in software development organizations (e.g. product managers, support, etc.) not to mention writing a lot of English (documentation, specs, and the like.)

I still don’t think that I’ve successfully answered the framing question, except to paint a large picture of what kinds of work goes into making software, and described some of my specific domain interests. This ends up boiling down to:

I write a lot of documents describing new features and improvements to our software. [product]
I think a lot about how our product works as it grows (scaling), and what kinds of changes we can make now to make that process more smooth. [operations]
How can I help the more junior members of my team focus on the aspects of their jobs that they enjoy the most, and help illustrate broader contexts to them. [mentoring]
How can we take the problems we’re solving today and build the solution that balances the immediate requirements with longer term maintainability and reuse. [operations/infrastructure]

The actual “what” I’m spending my time boils down to reading a bunch of code, meeting with my teamates, meeting with users (who are also coworkers.) And sometimes writing code. If I’m lucky.

I think the single-author and/or single-operator class is super interesting and valuable, particularly because it includes a lot of software outside of the conventional disciplinary boundaries of software and includes things like macros, spreadsheets, small scale database, and IT/operations (“scripting”) work. ↩︎
It’s very easy to spend most of your time as a developer writing infrastructure code of some sort, to address either internal concerns (logging, data management and modeling, integrating with services) or project/process automation (build, test, operations) concerns. Infrastructure isn’t bad, but it isn’t the same as working on product features. ↩︎

The Case for Better Build Systems

2019-07-20 – tychoish

A lot of my work, these days, focuses on figuring out how to improve how people develop software in ways that reduces the amount of time developers have to spend doing work outside of development and that improves the quality of their work. This post, has been sitting in my drafts folder for the last year, and does a good job of explaining how I locate my workand* makes a case for high quality generic build system tooling that I continue to feel is compelling.*

Incidentally, it turns out that I wrote an introductory post about buildsystems 6 years ago. Go past me.

Canonically, build systems described the steps required to produce artifacts, as system (graph) of dependencies¹ and these dependencies are between source files (code) and artifacts (programs and packages) with intermediate artifacts all in terms of the files they are or create. Though different development environments, programming languages, and kinds of software have different.

While the canonical “build systems are about producing files,” the truth is that the challenge of contemporary _software development isn’t really just about producing files. Everything from test automation to deployment is something that we can think about as a kind of build system problem.

Let’s unwind for a moment. The work of “making software,” breaks down into a collection of--reasonably disparate--tasks, which include:

collecting requirements (figuring out what people want,)
project planning (figuring out how to break larger collections of functionality into more reasonable units.)
writing new code in existing code bases.
exploring unfamiliar code and making changes.
writing tests for code you’ve recently written, or areas of the code base that have recently chaining.
rewriting existing code with functionally equivalent code (refactoring,)
fixing bugs discovered by users.
fixing bugs discovered by an automated test suite.
releasing software (deploying code.)

Within these tasks developers do a lot of little experiments and tests. Make a change, see what it’s impact is by doing something like compiling the code, running the program or running a test program. The goal, therefore, of the project of developer productivity projects is to automate these processes and shorten the time it takes to do any of these tasks. In particular the feedback loop between “making a change” and seeing if that change had an impact. The more complex the system that you’re developing, with regards to distinct components, dependencies, target platforms, compilation model, and integration’s, them the more time you spend in any one of these loops and the less productive you can be.

Build systems are uniquely positioned to suport the development process: they’re typically purpose built per-project (sharing common infrastructure,) most projects already have one, and they provide an ideal environment to provide the kind of incremental development of additional functionality and tooling. The model of build systems: the description of processes in terms of dependency graphs and the optimization for local environments means.

The problem, I think, is that build systems tend to be pretty terrible, or at least many suffer a number of classic flaws:

implicit assumptions about the build or development environment which make it difficult to start using.
unexpressed dependencies on services or systems that the build requires to be running in a certain configuration.
improperly configured dependency graphs which end up requiring repeated work.
incomplete expression of dependencies which require users to manually complete operations in particular orders.
poorly configured defaults which make for overly complex invocations for common tasks.
operations distributed among a collection of tools with little integration so that compilation, test automation, release automation, and other functions.

By improving the quality, correctness, and usability of build systems, we:

improve the experience for developers every day,
make it really easy to optimize basically every aspect of the development process,
reduce the friction for including new developers in a project’s development process.

I’m not saying “we need to spend more time writing build automation tools” (like make, ninja, cmake, and friends,) or that the existing ones are bad and hard to use (they, by and large are,) but that they’re the first and best hook we have into developer workflows. A high quality, trustable, tested, and easy to use build system for a project make development easier, continuous integration easy and maintainable, and ultimately improve the ability of developers to spend more of their time focusing on important problems.

ideally build systems describe directed acylcic graph, though many projects have buildsystems with cyclic dependency graphs that they ignore in some way. ↩︎

Three Way Merge Script

2013-01-05 – tychoish

Note: This is an old post about a script I wrote a few months ago about a piece of code that I’m no longer (really) using. I present it here as an archival piece with a boatload of caveats. Enjoy!

I have a problem that I think is not terribly unique: I have a directory of files and I want to maintain two distinct copies of these files at once, and I want a tool that looks at both directories and makes sure they’re up to date. That’s all. Turns out nothing does exactly that, so I wrote a hacked up shell script, and you can get it from the code section:

merge-script

I hope you enjoy!

Background

You might say, “why not just use git to take care of this,” which is fair. The truth is that I don’t really care about the histories as long as there’s revision. Here’s the situation:

I keep a personal ikiwiki instance for all of my notes, tasks, and project stuff. There’s nothing revolutionary, and I even use deft, dired, and some hacked up lisp to do most of the work. But I also work on a lot of projects that have their own git repositories and I want to be able to track the notes of some of those files in those repositories as well.

Conflicts.

There are some possible solutions:

1. Use hard links so that both files will point at the same data on disk.

Great idea, but it breaks on multiple systems. Even if it might have worked in this case, it freight ens me to have such fragile systems.

Note: the more I play with this, the less suitable I think that it might be for multi system use. If one or both of the sides is in a git repo, and you make changes locally and then pull changes in from a git upstream, the git files, may look newer than the files that you changed. A flaw.

2. Only edit files in one repository or the other, and have a pre-commit hook, or similar, that copies data from the new system to the old system.

I rejected this because I thought I’d have a hard time enforcing this behavior.

3. Write a script that uses some diff3 to merge (potential) changes from both sources of changes.

This is what I did.

The script actually uses the merge command which is a wrapper around diff3 from rcs. shrug.

Beyond my somewhat trivial and weird use-case, I actually think that this script is more useful for the following situation:

You use services like Dropbox as a way of getting data onto mobile devices (say,) but you want the canonical version of the file to live in a git repository on your system.

This is the script for you.

I hope you enjoy it!

Today's Bottleneck

2013-01-04 – tychoish

Computers are always getting faster. From the perspective of the casual observer it may seem like every year all of the various specs keep going up, and systems are faster.¹ In truth, progress isn’t uniform across all systems and subsystems, and thinking about this progression of technology gives us a chance to think about the constraints that developers² and other people who build technology face.

For most of the past year, I’ve used a single laptop, for all of my computing work, and while it’s been great, in this time I lost touch with the comparative speed of systems. No great loss, but I found myself surprised to learn that all computers did not have the same speed: It wasn’t until I started using other machines on a regular basis that I remembered that hardware could affect performance.

For most of the past decade, processors have been fast. While some processors are theoretically faster and some have other features like virtualization extensions and better multitasking capacities (i.e. hyperthreading and multi-core systems) the improvements have been incremental at best.

Memory (RAM) manages to mostly keep up with the processors, so there’s no real bottleneck between RAM and the processor. Although RAM capacities are growing, at current volumes extra RAM just means services/systems that had to be distributed given RAM density can all run on one server. In general: “ho hum.”

Disks are another story all together.

While disks got faster over this period, they didn’t get much faster during this period, and so for a long time disks were the bottle neck in computing speed. To address this problem, a number of things changed:

We designed systems for asynchronous operation.

Basically, folks spilled a lot of blood and energy to make sure that systems could continue to do work while waiting for the disk to reading or writing data. This involves using a lot of event loops, queuing systems, and so forth.

These systems are really cool, the only problem is that it means that we have to be smarter about some aspects of software design and deployment. This doesn’t fix the tons of legacy sitting around, or the fact that a lot of tools and programmers are struggling to keep up.

We started to build more distributed systems so that any individual spinning disk is responsible for writing/reading less data.
We hacked disks themselves to get better performance.

There are some ways you can eek out a bit of extra performance from spinning disks: namely RAID-10, hardware RAID controllers, and using smaller platters. RAID approaches use multiple drives (4) to provide simple redundancy and roughly double performance. Smaller platters require less movement of the disk arm, and you get a bit more out of the hardware.

Now, with affordable solid state disks (SSDs,) all of these disk related speed problems are basically moot. So what are the next bottlenecks for computers and performance:
Processors. It might be the case that processors are going to be the slow to develop bottleneck. There are a lot of expectations on processors these days: high speed, low power consumption, low temperature, high amount of parallelism (cores and hyperthreading.) But these expectations are necessarily conflicting.

The main route to innovation is to make the processors themselves smaller, which does increase performance and helps control heat and power consumption, but there is a practical limit to the size of a processor.

Also, no matter how fast you make the processor, it’s irrelevant unless the software is capable of taking advantage of the feature.
Software.

We’re still not great at building software with asynchronous components. “Non-blocking” systems do make it easier to have systems that work better with slower disks. Still, we don’t have a lot of software that does a great job of using the parallelism of a processor, so it’s possible to get some operations that are slow and will remain slow because a single threaded process must grind through a long task and can’t share it.
Network overhead.

While I think better software is a huge problem, network throughput could be a huge issue. The internet endpoints (your connection) has gotten much faster in the past few years. That’s a good thing, indeed, but there are a number of problems:
Transfer speeds aren’t keeping up with data growth or data storage, and if that trend continues, we’re going to end up with a lot of data that only exists in one physical location, which leads to catastrophic data loss.

I think we’ll get back to a point where moving physical media around will begin to make sense. Again.
Wireless data speeds and architectures (particularly 802.11x, but also wide area wireless,) have become ubiquitous, but aren’t really sufficient for serious use. The fact that our homes, public places, and even offices (in some cases) aren’t wired correctly to be able to provide opportunities to plug in will begin to hurt.

Thoughts? Other bottlenecks? Different reading of the history?

By contrast, software seems like its always getting slower, and while this is partially true, there are additional factors at play, including feature growth, programmer efficiency, and legacy support requirements. ↩︎
Because developers control, at least to some extent, how everyone uses and understands technology, the constrains on the way they use computers id important to everyone. ↩︎

Concerns#

Application State#

Startup and Shutdown#

Horizontal Scalability#

Challenges#

Solutions#

Ignore the Problem, For Now#

Federation#

Distributed Locks#

Duplicate Work (Idempotency)#

Consensus Protocols#

Queues#

Delegate to Upstream Services#

Conclusion#

Background#

Platform Overview#

Grip#

Amboy#

Jasper#

Gimlet#

Other Great Tools#

Notes#

Background#

Concerns

Application State

Startup and Shutdown

Horizontal Scalability

Challenges

Solutions

Ignore the Problem, For Now

Federation

Distributed Locks

Duplicate Work (Idempotency)

Consensus Protocols

Queues

Delegate to Upstream Services

Conclusion

Background

Platform Overview

Grip

Amboy

Jasper

Gimlet

Other Great Tools

Notes

Background