At a certain scale, most applications end up having to contend with a class of “distributed systems” problems: when a single computer or a single copy of an application can’t support the required throughput of an application there’s not much to do except to distribute it, and therein lies the problem. Taking one of a thing and making many of the thing operate similarly can be really fascinating, and frankly empowering. At some point, all systems become distributed in some way, to a greater or lesser extent. While the underlying problems and strategies are simple enough, distributed systems-type bugs can be gnarly and having some framework for thinking about these kinds of systems and architectures can be useful, or even essential, when writing any kind of software.

Concerns

Application State

Applications all have some kind of internal state: configuration, runtime settings, in addition to whatever happens in memory as a result of running the application. When you have more than one copy of a single logical application, you have to put state somewhere. That somewhere is usually a database, but it can be another service or in some kind of shared file resource (e.g. NFS or blob storage like S3.)

The challenge is not “where to put the state,” because it probably doesn’t matter much, but rather in organizing the application to remove the assumption that state can be stored in the application. This often means avoiding caching data in global variables and avoiding storing data locally on the filesystem, but there are a host of ways in which application state can get stuck or captured, and the fix is generally “ensure this data is always read out of some centralized and authoritative service,” and ensure that any locally cached data is refreshed regularly and saved centrally when needed.

In general, better state management within applications makes code better regardless of how distributed the system is, and when we use the “turn it off and turn it back on,” we’re really just clearing out some bit of application state that’s gotten stuck during the runtime of a piece of software.

Startup and Shutdown

Process creation and initialization, as well as shutdown, is difficult in distributed systems. While most configuration and state is probably stored in some remote service (like a database,) there’s a bootstrapping process where each process gets enough local configuration required to get that configuration and startup from the central service, which can be a bit delicate.

Shutdown has its own problems set of problems, as specific processes need to be able to complete or safely abort in progress operations.

For request driven work (i.e. HTTP or RPC APIs) without statefull or long-running requests (e.g. many websockets and most streaming connections), applications have to stop accepting new connections and let all in progress requests complete before terminating. For other kinds of work, the process has to either complete in progress work or provide some kind of “checkpointing” approach so that another process can pick up the work later.

Horizontal Scalability

Horizontal scalability, being able to increase the capability of an application by adding more instances of the application rather than creasing the resources allotted to the application itself, is one of the reasons that we build distributed systems in the first place,¹ but simply being able to run multiple copies of the application at once isn’t always enough, the application needs to be able to effectively distribute it’s workloads. For request driven work this is genreally some kind of load balancing layer or strategy, and for other kinds of workloads you need some way to distribute work across the application.

There are lots of different ways to provide loadbalancing, and a lot depends on your application and clients, there is specialized software (and even hardware) that provides loadbalancing by sitting “in front of” the application and routing requests to a backend, but there are also a collection of client-side solutions that work quite well. The complexity of load balancing solutions varies a lot: there are some approaches that just distribute responses “evenly” (by number) to a single backend one-by-one (“round-robin”) and some approaches that attempt to distribute requests more “fairly” based on some reporting of each backend or an analysis of the incoming requests, and the strategy here depends a lot on the requirements of the application or service.

For workloads that aren’t request driven, systems require some mechanism of distributing work to workers, ususally with some kind of messaging system, though it’s possible to get pretty far using a just a normal general purpose database to store pending work. The options for managing, ordering, and distributing the work, is the meat of problem.

Challenges

When thinking about system design or architecture, I tend to start with the following questions.

how does the system handle intermittent failures of particular components?
what kind of downtime is acceptable for any given component? for the system as a whole?
how do operations timeout and get terminated, and how to clients handle these kinds of failures?
what are the tolerances for the application in terms of latency of various kinds of operations, and also the tolerances for “missing” or “duplicating” an operation?
when (any single) node or component of the system aborts or restarts abruptly, how does the application/service respond? Does work resume or abort safely?
what level of manual intervention is acceptable? Does the system need to node failure autonomously? If so how many nodes?

Concepts like “node” or “component” or “operation,” can mean different things in different systems, and I use the terms somewhat vaguely as a result. These general factors and questions apply to systems that have monolithic architectures (i.e. many copies of a single type of process which performs many functions,) and service-based architectures (i.e. many different processes performing specialized functions.)

Solutions

Ignore the Problem, For Now

Many applications run in a distributed fashion while only really addressing parts of their relevant distributed systems problems, and in practice it works out ok. Applications may store most of their data in a database, but have some configuration files that are stored locally: this is annoying, and sometimes an out-of-sync file can lead to some unexpected behavior. Applications may have distributed application servers for all request-driven workloads, but may still have a separate single process that does some kind of coordinated background work, or run cron jobs.

Ignoring the problem isn’t always the best solution in the long term, but making sure that everything is distributed (or able to be distributed,) isn’t always the best use of time, and depending the specific application it works out fine. The important part, isn’t always to distribute things in all cases, but to make it possible to distribute functions in response to needs: in some ways I think about this as the “just in time” approach.

Federation

Federated architectures manage distributed systems protocols at a higher level: rather than assembling a large distributed system, build very small systems that can communicate at a high level using some kind of established protocol. The best example of a federated system is probably email, though there are others.²

Federated systems have more complex protocols that have to be specification based, which can be complicated/difficult to build. Also, federated services have to maintain the ability to interoperate with previous versions and even sometimes non-compliant services, which can be difficult to maintain. Federated systems also end up pushing a lot of the user experience into the clients, which can make it hard to control this aspect of the system.

On the upside, specific implementations and instances of a federated service can be quite simple and have straight forward and lightweight implementations. Supporting email for a few users (or even a few hundred) is a much more tractable problem than supporting email for many millions of users.

Distributed Locks

Needing some kind of lock (for mutual exclusion or mutex) is common enough in programming, and provide some kind of easy way to ensure that only a single actor has access to a specific resource. Doing this within a single process involves using kernel (futexes) or programming language runtime implementations, and is simple to conceptualize, and while the concept in a distributed system is functionally the same, the implementation of distributed locks are more complicated and necessarily slower (both the lock themselves, and their impact on the system as a whole).

All locks, local or distributed can be difficult to use correctly: the lock must be acquired before using the resource, and it must fully protect the resource, without protecting too much and having a large portion of functionality require the lock. So while locks are required sometimes, and conceptually simple, using them correctly is hard. With that disclaimer, to work, distributed locks require:³

some concept of an owner, which must be sufficiently specific (hostname, process identifier,) but that should be sufficiently unique to protect against process restarts, host renaming and collision.
lock status (locked/link) and if the lock has different modes, such as a multi-reader/single-writer lock, then that status.
a timeout or similar mechanism to prevent deadlocks if the actor holding a lock halts or becomes inaccessible, the lock is eventually released.
versioning, to prevent stale actors from modifying the same lock. In the case that actor-1 has a lock and stalls for longer than the timeout period, such that actor-2 gains the lock, when actor-1 runs again it must know that its been usurped.

Not all distributed systems require distributed locks, and in most cases, transactions in the data layer, provide most of the isolation that you might need from a distributed lock, but it’s a useful concept to have.

Duplicate Work (Idempotency)

For a lot of operations, in big systems, duplicating some work is easier and ultimately faster than coordinating and isolating that work in a single location. For this, having idempotent operations⁴ is useful. Some kinds of operations and systems make idempotency easier to implement, and in cases where the work is not idempotent (e.g. as in data processing or transformation,) the operation can be, by attaching some kind of clock to the data or operation.⁵

Using clocks and idempotency makes it possible to maintain data consistency without locks. At the same time, some of the same considerations apply. Having all operations duplicated is difficult to scale so having ways for operations to abort early can be useful.

Consensus Protocols

Some operations can’t be effectively distributed, but are also not safe to duplicate. Applications can use consensus protocols to do “leader election,” to ensure that there’s only one node “in charge” at a time, and the protocol. This is common in database systems, where “single leader” systems are useful for balancing write performance in distributed context. Consensus protocols have some amount of overhead, and are good for systems of a small to moderate size, because all elements of the system must communicate with all other nodes in the system.

The two prevailing consensus protocols are Paxos and Raft--pardoning the oversimplification here--with Raft being a simpler and easier to implement imagination of the same underlying principles. I’ve characterized consensus as being about leader election, though you can use these protocols to allow a distributed system to reach agreement on any manner of operations or shared state.

Queues

Building a fully generalized distributed application with consensus is a very lofty proposition, and commonly beyond the scope of most applications. If you can characterize the work of your system as discrete units of work (tasks or jobs,) and can build or access a queue mechanism within your application that supports workers on multiple processes, this might be enough to support a great deal of your distributed requirements for the application.

Once you have reliable mechanisms and abstractions for distributing work to a queue, scaling the system can be managed outside of the application by using different backing systems, or changing the dispatching layer, and queue optimization is pretty well understood. There are lots of different ways to schedule and distribute queued work, but perhaps this is beyond the scope of this article.

I wrote one of these, amboy, but things like gearman and celery do this as well, and many of these tools are built on messaging systems like Kafka or AMPQ, or just use general purpose databases as a backend. Keeping a solid abstraction between the applications queue and then messaging system seems good, but a lot depends on your application’s workload.

Delegate to Upstream Services

While there are distributed system problems that applications must solve for themselves, in most cases no solution is required! In practice many applications centralize a lot of their concerns in trusted systems like databases, messaging systems, or lock servers. This is probably correct! While distributed systems are required in most senses, distributed systems themselves are rarely the core feature of an application, and it makes sense to delegate these problem to services that that are focused on solving this problem.

While multiple external services can increase the overall operational complexity of the application, implementing your own distributed system fundamentals can be quite expensive (in terms of developer time), and error prone, so it’s generally a reasonable trade off.

Conclusion

I hope this was as useful for you all as it has been fun for me to write!

In most cases, some increase in reliability, by adding redundancy is a strong secondary motivation. ↩︎
xmpp , the protocol behind jabber which powered/powers many IM systems is another federated example, and the fediverse points to others. I also suspect that some federation-like features will be used at the infrastructure layer to coordinate between constrained elements (e.g. multiple k8s clusters will use federation for coordination, and maybe multi-cloud/multi-region orchestration as well…) ↩︎
This article about distributed locks in redis was helpful in summarizing the principles for me. ↩︎
An operation is idempotent if it can be performed more than once without changing the outcome. For instance, the operation “increment the value by 10” is not idempotent because it increments a value every time it runs, so running the operation once is different than running it twice. At the same time the operation “set the value to 10” is idempotent, because the value is always 10 at the end of the operation. ↩︎
Clocks can take the form of a “last modified timestamp,” or some kind of versioning integer associated with a record. Operations can check their local state against a canonical record, and abort if their data is out of date. ↩︎

Concerns#

Application State#

Startup and Shutdown#

Horizontal Scalability#

Challenges#

Solutions#

Ignore the Problem, For Now#

Federation#

Distributed Locks#

Duplicate Work (Idempotency)#

Consensus Protocols#

Queues#

Delegate to Upstream Services#

Conclusion#