At a certain scale, most applications end up having to contend with a class of
"distributed systems" problems: when a single computer or a single copy of an
application can't support the required throughput of an application there's
not much to do except to distribute it, and therein lies the problem. Taking
one of a thing and making many of the thing operate similarly can be
really fascinating, and frankly empowering. At some point, all systems become
distributed in some way, to a greater or lesser extent. While the underlying
problems and strategies are simple enough, distributed systems-type bugs can
be gnarly and having some framework for thinking about these kinds of systems
and architectures can be useful, or even essential, when writing any kind of
software.
Concerns
Application State
Applications all have some kind of internal state: configuration, runtime
settings, in addition to whatever happens in memory as a result of running the
application. When you have more than one copy of a single logical application,
you have to put state somewhere. That somewhere is usually a database, but it
can be another service or in some kind of shared file resource (e.g. NFS or
blob storage like S3.)
The challenge is not "where to put the state," because it probably doesn't
matter much, but rather in organizing the application to remove the assumption
that state can be stored in the application. This often means avoiding caching
data in global variables and avoiding storing data locally on the filesystem,
but there are a host of ways in which application state can get stuck or
captured, and the fix is generally "ensure this data is always read out of
some centralized and authoritative service," and ensure that any locally
cached data is refreshed regularly and saved centrally when needed.
In general, better state management within applications makes code better
regardless of how distributed the system is, and when we use the "turn it off and
turn it back on," we're really just clearing out some bit of application state
that's gotten stuck during the runtime of a piece of software.
Startup and Shutdown
Process creation and initialization, as well as shutdown, is difficult in
distributed systems. While most configuration and state is probably stored in
some remote service (like a database,) there's a bootstrapping process where
each process gets enough local configuration required to get that
configuration and startup from the central service, which can be a bit
delicate.
Shutdown has its own problems set of problems, as specific processes need to
be able to complete or safely abort in progress operations.
For request driven work (i.e. HTTP or RPC APIs) without statefull or
long-running requests (e.g. many websockets and most streaming connections),
applications have to stop accepting new connections and let all in progress
requests complete before terminating. For other kinds of work, the process has
to either complete in progress work or provide some kind of "checkpointing"
approach so that another process can pick up the work later.
Horizontal Scalability
Horizontal scalability, being able to increase the capability of an
application by adding more instances of the application rather than creasing
the resources allotted to the application itself, is one of the reasons that
we build distributed systems in the first place, but simply
being able to run multiple copies of the application at once isn't always
enough, the application needs to be able to effectively distribute it's
workloads. For request driven work this is genreally some kind of load
balancing layer or strategy, and for other kinds of workloads you need some
way to distribute work across the application.
There are lots of different ways to provide loadbalancing, and a lot depends
on your application and clients, there is specialized software (and even
hardware) that provides loadbalancing by sitting "in front of" the application
and routing requests to a backend, but there are also a collection of
client-side solutions that work quite well. The complexity of load balancing
solutions varies a lot: there are some approaches that just distribute
responses "evenly" (by number) to a single backend one-by-one ("round-robin")
and some approaches that attempt to distribute requests more "fairly" based on
some reporting of each backend or an analysis of the incoming requests, and
the strategy here depends a lot on the requirements of the application or
service.
For workloads that aren't request driven, systems require some mechanism of
distributing work to workers, ususally with some kind of messaging system,
though it's possible to get pretty far using a just a normal general purpose
database to store pending work. The options for managing, ordering, and
distributing the work, is the meat of problem.
Challenges
When thinking about system design or architecture, I tend to start with the
following questions.
- how does the system handle intermittent failures of particular components?
- what kind of downtime is acceptable for any given component? for the system
as a whole?
- how do operations timeout and get terminated, and how to clients handle
these kinds of failures?
- what are the tolerances for the application in terms of latency of various
kinds of operations, and also the tolerances for "missing" or "duplicating"
an operation?
- when (any single) node or component of the system aborts or restarts
abruptly, how does the application/service respond? Does work resume or
abort safely?
- what level of manual intervention is acceptable? Does the system need to
node failure autonomously? If so how many nodes?
Concepts like "node" or "component" or "operation," can mean different things
in different systems, and I use the terms somewhat vaguely as a result. These
general factors and questions apply to systems that have monolithic
architectures (i.e. many copies of a single type of process which performs many
functions,) and service-based architectures (i.e. many different processes
performing specialized functions.)
Solutions
Ignore the Problem, For Now
Many applications run in a distributed fashion while only really addressing
parts of their relevant distributed systems problems, and in practice it works
out ok. Applications may store most of their data in a database, but have some
configuration files that are stored locally: this is annoying, and sometimes
an out-of-sync file can lead to some unexpected behavior. Applications may
have distributed application servers for all request-driven workloads, but may
still have a separate single process that does some kind of coordinated
background work, or run cron jobs.
Ignoring the problem isn't always the best solution in the long term, but
making sure that everything is distributed (or able to be distributed,) isn't
always the best use of time, and depending the specific application it works
out fine. The important part, isn't always to distribute things in all cases,
but to make it possible to distribute functions in response to needs: in some
ways I think about this as the "just in time" approach.
Federation
Federated architectures manage distributed systems protocols at a higher
level: rather than assembling a large distributed system, build very small
systems that can communicate at a high level using some kind of established
protocol. The best example of a federated system is probably email, though
there are others.
Federated systems have more complex protocols that have to be specification
based, which can be complicated/difficult to build. Also, federated services
have to maintain the ability to interoperate with previous versions and even
sometimes non-compliant services, which can be difficult to
maintain. Federated systems also end up pushing a lot of the user experience
into the clients, which can make it hard to control this aspect of the
system.
On the upside, specific implementations and instances of a federated service
can be quite simple and have straight forward and lightweight
implementations. Supporting email for a few users (or even a few hundred) is a
much more tractable problem than supporting email for many millions of users.
Distributed Locks
Needing some kind of lock (for mutual exclusion or mutex) is common enough in
programming, and provide some kind of easy way to ensure that only a single
actor has access to a specific resource. Doing this within a single process
involves using kernel (futexes) or programming language runtime
implementations, and is simple to conceptualize, and while the concept in a
distributed system is functionally the same, the implementation of distributed
locks are more complicated and necessarily slower (both the lock themselves,
and their impact on the system as a whole).
All locks, local or distributed can be difficult to use correctly: the lock
must be acquired before using the resource, and it must fully protect the
resource, without protecting too much and having a large portion of
functionality require the lock. So while locks are required sometimes, and
conceptually simple, using them correctly is hard. With that disclaimer, to
work, distributed locks require:
- some concept of an owner, which must be sufficiently specific (hostname,
process identifier,) but that should be sufficiently unique to protect
against process restarts, host renaming and collision.
- lock status (locked/link) and if the lock has different modes, such as a
multi-reader/single-writer lock, then that status.
- a timeout or similar mechanism to prevent deadlocks if the actor holding a
lock halts or becomes inaccessible, the lock is eventually released.
- versioning, to prevent stale actors from modifying the same lock. In the
case that actor-1 has a lock and stalls for longer than the timeout period,
such that actor-2 gains the lock, when actor-1 runs again it must know that
its been usurped.
Not all distributed systems require distributed locks, and in most cases,
transactions in the data layer, provide most of the isolation that you might
need from a distributed lock, but it's a useful concept to have.
Duplicate Work (Idempotency)
For a lot of operations, in big systems, duplicating some work is
easier and ultimately faster than coordinating and isolating that work in a
single location. For this, having idempotent operations is
useful. Some kinds of operations and systems make idempotency easier to
implement, and in cases where the work is not idempotent (e.g. as in data
processing or transformation,) the operation can be, by attaching some kind
of clock to the data or operation.
Using clocks and idempotency makes it possible to maintain data consistency
without locks. At the same time, some of the same considerations apply. Having
all operations duplicated is difficult to scale so having ways for operations
to abort early can be useful.
Consensus Protocols
Some operations can't be effectively distributed, but are also not safe to
duplicate. Applications can use consensus protocols to do "leader election,"
to ensure that there's only one node "in charge" at a time, and the
protocol. This is common in database systems, where "single leader" systems
are useful for balancing write performance in distributed context. Consensus
protocols have some amount of overhead, and are good for systems of a small to
moderate size, because all elements of the system must communicate with all
other nodes in the system.
The two prevailing consensus protocols are Paxos and Raft--pardoning the
oversimplification here--with Raft being a simpler and easier to implement
imagination of the same underlying principles. I've characterized consensus
as being about leader election, though you can use these protocols to allow a
distributed system to reach agreement on any manner of operations or shared
state.
Queues
Building a fully generalized distributed application with consensus is a very
lofty proposition, and commonly beyond the scope of most applications. If you
can characterize the work of your system as discrete units of work (tasks or
jobs,) and can build or access a queue mechanism within your application that
supports workers on multiple processes, this might be enough to support a
great deal of your distributed requirements for the application.
Once you have reliable mechanisms and abstractions for distributing work to a
queue, scaling the system can be managed outside of the application by using
different backing systems, or changing the dispatching layer, and queue
optimization is pretty well understood. There are lots of different ways to
schedule and distribute queued work, but perhaps this is beyond the scope of
this article.
I wrote one of these, amboy, but
things like gearman and celery do this as well, and many of these tools
are built on messaging systems like Kafka or AMPQ, or just use general purpose
databases as a backend. Keeping a solid abstraction between the applications
queue and then messaging system seems good, but a lot depends on your
application's workload.
Delegate to Upstream Services
While there are distributed system problems that applications must solve
for themselves, in most cases no solution is required! In practice many
applications centralize a lot of their concerns in trusted systems like
databases, messaging systems, or lock servers. This is probably correct! While
distributed systems are required in most senses, distributed systems
themselves are rarely the core feature of an application, and it makes sense
to delegate these problem to services that that are focused on solving this
problem.
While multiple external services can increase the overall operational
complexity of the application, implementing your own distributed system
fundamentals can be quite expensive (in terms of developer time), and error
prone, so it's generally a reasonable trade off.