At a certain scale, most applications end up having to contend with a
class of “distributed systems” problems: when a single computer or a
single copy of an application can’t support the required throughput of
an application there’s not much to do except to distribute it, and
therein lies the problem. Taking one of a thing and making many of
the thing operate similarly can be really fascinating, and frankly
empowering. At some point, all systems become distributed in some way,
to a greater or lesser extent. While the underlying problems and
strategies are simple enough, distributed systems-type bugs can be
gnarly and having some framework for thinking about these kinds of
systems and architectures can be useful, or even essential, when writing
any kind of software.
Concerns#
Application State#
Applications all have some kind of internal state: configuration,
runtime settings, in addition to whatever happens in memory as a result
of running the application. When you have more than one copy of a single
logical application, you have to put state somewhere. That somewhere is
usually a database, but it can be another service or in some kind of
shared file resource (e.g. NFS or blob storage like S3.)
The challenge is not “where to put the state,” because it probably
doesn’t matter much, but rather in organizing the application to remove
the assumption that state can be stored in the application. This often
means avoiding caching data in global variables and avoiding storing
data locally on the filesystem, but there are a host of ways in which
application state can get stuck or captured, and the fix is generally
“ensure this data is always read out of some centralized and
authoritative service,” and ensure that any locally cached data is
refreshed regularly and saved centrally when needed.
In general, better state management within applications makes code
better regardless of how distributed the system is, and when we use the
“turn it off and turn it back on,” we’re really just clearing out
some bit of application state that’s gotten stuck during the runtime of
a piece of software.
Startup and Shutdown#
Process creation and initialization, as well as shutdown, is difficult
in distributed systems. While most configuration and state is probably
stored in some remote service (like a database,) there’s a
bootstrapping process where each process gets enough local configuration
required to get that configuration and startup from the central service,
which can be a bit delicate.
Shutdown has its own problems set of problems, as specific processes
need to be able to complete or safely abort in progress operations.
For request driven work (i.e. HTTP or RPC APIs) without statefull or
long-running requests (e.g. many websockets and most streaming
connections), applications have to stop accepting new connections and
let all in progress requests complete before terminating. For other
kinds of work, the process has to either complete in progress work or
provide some kind of “checkpointing” approach so that another process
can pick up the work later.
Horizontal Scalability#
Horizontal scalability, being able to increase the capability of an
application by adding more instances of the application rather than
creasing the resources allotted to the application itself, is one of the
reasons that we build distributed systems in the first place, but
simply being able to run multiple copies of the application at once
isn’t always enough, the application needs to be able to effectively
distribute it’s workloads. For request driven work this is genreally
some kind of load balancing layer or strategy, and for other kinds of
workloads you need some way to distribute work across the application.
There are lots of different ways to provide loadbalancing, and a lot
depends on your application and clients, there is specialized software
(and even hardware) that provides loadbalancing by sitting “in front
of” the application and routing requests to a backend, but there are
also a collection of client-side solutions that work quite well. The
complexity of load balancing solutions varies a lot: there are some
approaches that just distribute responses “evenly” (by number) to a
single backend one-by-one (“round-robin”) and some approaches that
attempt to distribute requests more “fairly” based on some reporting
of each backend or an analysis of the incoming requests, and the
strategy here depends a lot on the requirements of the application or
service.
For workloads that aren’t request driven, systems require some
mechanism of distributing work to workers, ususally with some kind of
messaging system, though it’s possible to get pretty far using a just a
normal general purpose database to store pending work. The options for
managing, ordering, and distributing the work, is the meat of problem.
Challenges#
When thinking about system design or architecture, I tend to start with
the following questions.
- how does the system handle intermittent failures of particular
components?
- what kind of downtime is acceptable for any given component? for the
system as a whole?
- how do operations timeout and get terminated, and how to clients
handle these kinds of failures?
- what are the tolerances for the application in terms of latency of
various kinds of operations, and also the tolerances for “missing”
or “duplicating” an operation?
- when (any single) node or component of the system aborts or restarts
abruptly, how does the application/service respond? Does work resume
or abort safely?
- what level of manual intervention is acceptable? Does the system need
to node failure autonomously? If so how many nodes?
Concepts like “node” or “component” or “operation,” can mean
different things in different systems, and I use the terms somewhat
vaguely as a result. These general factors and questions apply to
systems that have monolithic architectures (i.e. many copies of a single
type of process which performs many functions,) and service-based
architectures (i.e. many different processes performing specialized
functions.)
Solutions#
Ignore the Problem, For Now#
Many applications run in a distributed fashion while only really
addressing parts of their relevant distributed systems problems, and in
practice it works out ok. Applications may store most of their data in a
database, but have some configuration files that are stored locally:
this is annoying, and sometimes an out-of-sync file can lead to some
unexpected behavior. Applications may have distributed application
servers for all request-driven workloads, but may still have a separate
single process that does some kind of coordinated background work, or
run cron jobs.
Ignoring the problem isn’t always the best solution in the long term,
but making sure that everything is distributed (or able to be
distributed,) isn’t always the best use of time, and depending the
specific application it works out fine. The important part, isn’t
always to distribute things in all cases, but to make it possible to
distribute functions in response to needs: in some ways I think about
this as the “just in time” approach.
Federation#
Federated architectures manage distributed systems protocols at a higher
level: rather than assembling a large distributed system, build very
small systems that can communicate at a high level using some kind of
established protocol. The best example of a federated system is probably
email, though there are others.
Federated systems have more complex protocols that have to be
specification based, which can be complicated/difficult to build. Also,
federated services have to maintain the ability to interoperate with
previous versions and even sometimes non-compliant services, which can
be difficult to maintain. Federated systems also end up pushing a lot of
the user experience into the clients, which can make it hard to control
this aspect of the system.
On the upside, specific implementations and instances of a federated
service can be quite simple and have straight forward and lightweight
implementations. Supporting email for a few users (or even a few
hundred) is a much more tractable problem than supporting email for many
millions of users.
Distributed Locks#
Needing some kind of lock (for mutual exclusion or mutex) is common
enough in programming, and provide some kind of easy way to ensure that
only a single actor has access to a specific resource. Doing this within
a single process involves using kernel (futexes) or programming language
runtime implementations, and is simple to conceptualize, and while the
concept in a distributed system is functionally the same, the
implementation of distributed locks are more complicated and necessarily
slower (both the lock themselves, and their impact on the system as a
whole).
All locks, local or distributed can be difficult to use correctly: the
lock must be acquired before using the resource, and it must fully
protect the resource, without protecting too much and having a large
portion of functionality require the lock. So while locks are required
sometimes, and conceptually simple, using them correctly is hard. With
that disclaimer, to work, distributed locks require:
- some concept of an owner, which must be sufficiently specific
(hostname, process identifier,) but that should be sufficiently unique
to protect against process restarts, host renaming and collision.
- lock status (locked/link) and if the lock has different modes, such as
a multi-reader/single-writer lock, then that status.
- a timeout or similar mechanism to prevent deadlocks if the actor
holding a lock halts or becomes inaccessible, the lock is eventually
released.
- versioning, to prevent stale actors from modifying the same lock. In
the case that actor-1 has a lock and stalls for longer than the
timeout period, such that actor-2 gains the lock, when actor-1 runs
again it must know that its been usurped.
Not all distributed systems require distributed locks, and in most
cases, transactions in the data layer, provide most of the isolation
that you might need from a distributed lock, but it’s a useful concept
to have.
Duplicate Work (Idempotency)#
For a lot of operations, in big systems, duplicating some work is easier
and ultimately faster than coordinating and isolating that work in a
single location. For this, having idempotent operations is useful.
Some kinds of operations and systems make idempotency easier to
implement, and in cases where the work is not idempotent (e.g. as in
data processing or transformation,) the operation can be, by attaching
some kind of clock to the data or operation.
Using clocks and idempotency makes it possible to maintain data
consistency without locks. At the same time, some of the same
considerations apply. Having all operations duplicated is difficult to
scale so having ways for operations to abort early can be useful.
Consensus Protocols#
Some operations can’t be effectively distributed, but are also not safe
to duplicate. Applications can use consensus protocols to do “leader
election,” to ensure that there’s only one node “in charge” at a
time, and the protocol. This is common in database systems, where
“single leader” systems are useful for balancing write performance in
distributed context. Consensus protocols have some amount of overhead,
and are good for systems of a small to moderate size, because all
elements of the system must communicate with all other nodes in the
system.
The two prevailing consensus protocols are Paxos and Raft--pardoning
the oversimplification here--with Raft being a simpler and easier to
implement imagination of the same underlying principles. I’ve
characterized consensus as being about leader election, though you can
use these protocols to allow a distributed system to reach agreement on
any manner of operations or shared state.
Queues#
Building a fully generalized distributed application with consensus is a
very lofty proposition, and commonly beyond the scope of most
applications. If you can characterize the work of your system as
discrete units of work (tasks or jobs,) and can build or access a queue
mechanism within your application that supports workers on multiple
processes, this might be enough to support a great deal of your
distributed requirements for the application.
Once you have reliable mechanisms and abstractions for distributing work
to a queue, scaling the system can be managed outside of the application
by using different backing systems, or changing the dispatching layer,
and queue optimization is pretty well understood. There are lots of
different ways to schedule and distribute queued work, but perhaps this
is beyond the scope of this article.
I wrote one of these, amboy, but
things like gearman and
celery do this as well, and many of
these tools are built on messaging systems like Kafka or AMPQ, or just
use general purpose databases as a backend. Keeping a solid abstraction
between the applications queue and then messaging system seems good, but
a lot depends on your application’s workload.
Delegate to Upstream Services#
While there are distributed system problems that applications must solve
for themselves, in most cases no solution is required! In practice many
applications centralize a lot of their concerns in trusted systems like
databases, messaging systems, or lock servers. This is probably correct!
While distributed systems are required in most senses, distributed
systems themselves are rarely the core feature of an application, and it
makes sense to delegate these problem to services that that are focused
on solving this problem.
While multiple external services can increase the overall operational
complexity of the application, implementing your own distributed system
fundamentals can be quite expensive (in terms of developer time), and
error prone, so it’s generally a reasonable trade off.
Conclusion#
I hope this was as useful for you all as it has been fun for me to
write!