The Org Mode Product

As a degenerate emacs user, as it were, I have of course used org-mode a lot, and indeed it's probably the mode I end up doing a huge amount of my editing in, because it's great with text and I end up writing a lot of text. I'm not, really, an org mode user in the sense that it's not the system or tool that I use to stay organized, and I haven't really done much development of my own tooling or process around using orgmode to handle document production, and honestly, most of the time I use reStructuredText as my preferred lightweight markup language.

I was thinking, though, as I was pondering ox-leanpub, what even is org-mode trying to do and what the hell would a product manager do, if faced with org-mode.

In some ways, I think it sucks the air out of the fun of hacking on things like emacs to bring all of the "professionalization of making software" to things like org-mode, and please trust that this meant with a lot of affection for org-mode: this is meant as a thought experiment.


Org has a lot going on:

  • it provides a set of compelling tools for interacting with hierarchical human-language documents.
  • it's a document markup and structure system,
  • the table editing features are, given the ability to write formula in lisp, basically a spreadsheet.
  • it's a literate programming environment, (babel)
  • it's a document preparation system, (ox)
  • it's a task manager, (agenda)
  • it's a time tracking system,
  • it even has pretty extensive calendar management tools.

Perhaps the thing that's most exciting about org-mode is that it provides functionality for all of these kinds of tasks in a single product so you don't have to bounce between lots of different tools to do all of these things.

It's got most of the "office" suite covered, and I think (particularly for new people, but also for people like me,) it's not clear why I would want my task system, my notes system, and my document preparation system to all have their data intermingled in the same set of files. The feeling is a bit unfocused.

The reason for this, historically makes sense: org-mode grew out of technically minded academics who were mostly using it as a way of preparing papers, and who end up being responsible for a lot of structuring their own time/work, but who do most of their work alone. With this kind of user story in mind, the gestalt of org-mode really comes together as a product, but otherwise it's definitely a bit all over the place.

I don't think this is bad, and particularly given its history, it's easy to understand why things are the way they are, but I think that it is useful to take a step back and think about the workflow that org supports and inspires, while also not forgetting the kinds of workflows that it precludes, and the ways that org, itself, can create a lot of conceptual overhead.

There are also some gaps, in org, as a product, which I think grow out of this history, and I think there are

They are, to my mind:

  • importing data, and bidirectional sync. These are really hard problems, and there've been some decent projects over the years to help get data into org, I think org-trello is the best example I can think about, but it can be a little dodgy, and the "import story" pales in comparison to the export story. It would be great if:
    • you could use the org interface to interact with and manipulate data that isn't actually in org-files, or at least where the system-of-record for the data isn't org. Google docs? Files in other formats?
  • collaborating with other people. Org-mode files tend to cope really poorly with multiple people editing them at the same time (asynchronously as with git,) and also in cases where not-everyone uses org-mode. One of the side effects of having the implementation of org-features so deeply tied to the structure of text in the org-format, it becomes hard to interact with org-data outside of emacs (again, you can understand why this happens, and it's indeed very lispy,), which means you have to use emacs and use org if you want to collaborate on projects that use org.
    • this might look like some kind of different diff-drivers for git, in addition to some other more novel tools.
    • bi-directional sync might also help with this issue.
  • beyond the agenda, building a story for content that spans multiple-file. Because the files are hierarchical, and org provides a great deal of taxonomic indexing features, you really never need more than one org-file forever, but it's also kind of wild to just keep everything in one file, so you end up with lots of org-files, and while the agenda provides a way to filter out the task and calendar data, it's sort of unclear how to mange multi-file systems for some of the other projects. It's also the case, that because you can inject some configuration at the file level, it can be easy to get stuck.
  • tools for interacting with org content without (interactive or apparent) emacs. While I love emacs as much as the next nerd, I tend to think that having a dependency on emacs is hard to stomach, particularly for collaborative efforts, (though with docker and the increasing size of various runtimes, this may be less relevant.) If it were trivially easy to write build processes that extracted or ran babel programs without needing to be running from within emacs? What if there were an org-export CLI tool?

Docker Isn't A Build System

I can't quite decide if this title is ironic or not. I've been trying really hard to not be a build system guy, and I think I'm succeeding--mostly--but sometimes things come back at us. I may still be smarting from people proposing "just using docker" to solve any number of pretty irrelevant problems.

The thing is that docker does help solve many build problems: before docker, you had to either write code that supported any possible execution environment. This was a lot of work, and was generally really annoying. Because docker provides a (potentially) really stable execution environment, it can make a lot of sense to do your building inside of a docker container, in the same way that folks often do builds in chroot environments (or at least did). Really containers are kind of super-chroots, and it's a great thing to be able to give your development team a common starting point for doing development work. This is cool.

It's also the case that Docker makes a lot of sense as a de facto standard distribution or deployment form, and in this way it's kind of a really fat binary. Maybe it's too big, maybe it's the wrong tool, maybe it's not perfect, but for a lot of applications they'll end up running in containers anyway, and treating a docker container like your executable format makes it possible to avoid running into issues that only appear in one (set) of environments.

At the same time, I think it's important to keep these use-cases separate: try to avoid using the same container for deploying that you use for development, or even for build systems. This is good because "running the [deployment] container" shouldn't build software, and it'll also limit the size of your production containers, and avoid unintentionally picking up dependencies. This is, of course, less clear in runtimes that don't have a strong "compiled artifacts" story, but is still possible.

There are some notes/caveats:

  • Dockerfiles are actually kind of build systems, and under the hood they're just snapshotting the diffs of the filesystem between each step. So they work best if you treat them like build systems: make the steps discrete and small, keep the stable deterministic things early in the build, and push the more ephemeral steps later in the build to prevent unnecessary rebuilding.
  • "Build in one container and deploy in another," requires moving artifacts between containers, or being able to run docker-in-docker, which are both possible but may be less obvious than some other workflows.
  • Docker's "build system qualities," can improve the noop and rebuild-performance of some operations (e.g. the amount of time to rebuild things if you've just built or only made small changes.) which can be a good measure of the friction that developers experience, because of the way that docker can share/cache between builds. This is often at the expense of making artifacts huge and at greatly expanding the amount of time that the operations can take. This might be a reasonable tradeoff to make, but it's still a tradeoff.

My Code Review Comments

I end up reviewing a lot of code, and while doing code review (and getting reviews) used to take up a lot of time, I think I've gotten better at doing reviews, and knowing what's important to comment on and what doesn't

  • The code review process is not there to discover bugs. Write tests to catch bugs, and use the code review process to learn about a specific change, and find things that are difficult to test for.
  • As yourself if something is difficult to follow, and comment on that. If you can't figure out what something is doing, or you have to read it more than once, then that's probably a problem.
  • Examine and comment on the naming of functions. Does the function appear to do what the name indicates.
  • Think about the interface of a piece code:
    • What's exported or public?
    • How many arguments do your functions take?
  • Look for any kind of shared state between functions, particularly data that's mutable or stored in globally accessable, or package local structures.
  • Focus your time on the production-facing, public code, and less on things like tests and private/un-exported APIs. While tests are important, and it's important that there's good test coverage (authors should use coverage tooling to check this), and efficient test execution, beyond this high level aspect, you can keep reading?
  • Put yourself in the shoes of someone who might need to debug this code and think about logging as well as error handling and reporting.
  • Push yourself and others to be able to get very small pieces of code reviewed at a time. Shorter reviews are easier to process and while it's annoying to break a logical component into a bunch of pieces, it's definitely worth it.

Values for Collaborative Codebases

After, my post on what I do for work I thought it'd be good to describe the kinds of things that make software easy to work on and collaborative. Here's the list:

  1. Minimal Documentation. As a former technical writer this is sort of painful, but most programning environments (e.g. languages) have idioms and patterns that you can follow for how to organize code, run tests and build artifacts. it's ok if your project has exceptional requirements that require you to break the rules in some way, but the conventions should be obvious and the justification for rule-breaking should be plain. If you adhere to convention, you don't need as much documentation. It's paradoxical, because better documentation is supposed to facilitate accessibility, but too much documentation is sometimes an indication that things are weird and hard.
  2. Make it Easy To Run. I'd argue that the most difficult (and frustrating) part of writing software is getting it to run everywhere that you might want it to run. Writing software that runs, even does what you want on your own comptuer is relatively, easy: making it work on someone else's computer is hard. One of the big advantages of developing software that runs as web apps means that you (mostly) get to control (and limit) where the software runs. Making it possible to easily run a piece of software on any computer it might reasonably run (e.g. developer's computers, user's computers and/or production environments.) Your software itself, should be responsible for managing this environment, to the greatest extent possible. This isn't to say that you need to use containers or some such, but having packaging and install scripts that use well understood idioms and tools (e.g. requirements.txt, virtualenv, makefiles, etc.) is good.
  3. Clear Dependencies. Software is all built upon other pieces of software, and the versions of those libraries are important to record, capture, and recreate. Now it's generally a good idea to update dependencies regularly so you can take advantage of improvements from upstream providers, but unless you regularly test against multiple versions of your dependencies (you don't), and can control all of your developer and production environments totally (you can't), then you should provide a single, centralized way of describing your dependencies. Typically strategies involve: vendoring dependencies, using lockfiles (requirements.txt and similar) or build system integration tends to help organize this aspect of a project.
  4. Automated Tests. Software requires some kind of testing to ensure that it has the intended behavior, and tests that can run quickly and automatically without requiring users to exercise the software manually is absolutely essential. Tests should run quickly, and it should be possible to run a small group of tests very quickly to support iterative development on a specific area of the code. Indeed, most software development can and should be oriented toward the experience of writing tests and exercising new features with tests above pretty much everything else. The best test suites exercise the code at many levels, ranging from very small unit tests to ensure the correct behavior of the functions and methods, to higher level tests that test the functionality of higher-order functions and methods, and full integration tests that test the entire system.
  5. Continuous Integration. Continuous integration system's are tools that support development and ensure that changes to a code pass a more extensive range of tests than are readily available to developers. CI systems are also useful for automating other aspects of a project (releases, performance testing, etc.) A well maintained CI environment provide the basis for commonality for larger projects with a larger number for projects larger groups of developers, and is all but required to ensure a well supported automated test system and well managed dependency.
  6. Files and Directories Make Sense. Code is just text in files, and software is just a lot of code. Different languages and frameworks have different expectations about how code is organized, but you should be able to have a basic understanding of the software and be able to browse the files, and be able to mostly understand what components are and how they relate to each other based on directory names, and should be able to (roughly) understand what's in a file and how the files relate to eachother based on their names. In this respect, shorter files, when possible are nice, are directory structures that have limited depth (wide and shallow,) though there are expections for some programming language.
  7. Isolate Components and Provide Internal APIs. Often when we talk about APIs we talk about the interface that users access our software, but larger systems have the same need for abstractions and interfaces internally that we expose for (programmatic) users. These APIs have different forms from public ones (sometimes,) but in general:
    • Safe APIs. The APIs should be easy to use and difficult to use incorrectly. This isn't just "make your API thread safe if your users are multithreaded," but also, reduce the possibility for unintended side effects, and avoid calling conventions that are easy to mistake: effects of ordering, positional arguments, larger numbers of arguments, and complicated state management.
    • Good API Ergonomics. The ergonomics of an API is somewhat ethereal, but it's clear when an API has good ergonomics: writing code that uses the API feels "native," it's easy to look at calling code and understand what's going on, and errors that make sense and are easy to handle. It's not simply enough for an API to be safe to use, but it should be straightforward and clear.

How to Become a Self-Taught Programmer

i am a self taught programmer. i don't know that i'd recommend it to anyone else there are so many different curricula and training programs that are well tested and very efficacious. for lots of historical reasons, many programmers end up being all or mostly self taught: in the early days because programming was vocational and people learned on the job, then because people learned programming on their own before entering cs programs, and more recently because the demand for programmers (and rate of change) for the kinds of programming work that are in the most demand these days. and knowing that it's possible (and potentially cheaper) to teach yourself, it seems like a tempting option.

this post, then, is a collection of suggestions, guidelines, and pointers for anyone attempting to teach themselves to program:

  • focus on learning one thing (programming language and problem domain) at a time. there are so many different things you could learn, and people who know how to program seem to have an endless knowledge of different things. knowing one set of tools and one area (e.g. "web development in javascript," or "system administration in python,") gives you the framework to expand later, and the truth is that you'll be able to learn additional things more easily once you have a framework to build upon.

  • when learning something in programming, always start with a goal. have some piece of data that you want to explore or visualize, have a set of files that you want to organize, or something that you want to accomplish. learning how to program without a goal, means that you don't end up asking the kinds of questions that you need to form the right kinds of associations.

  • programming is all about learning different things: if you end up programming for long enough you'll end up learning different languages, and being able to pick up new things is the real skill.

  • being able to clearly capture what you were thinking when you write code is basically a programming super power.

  • programming is about understanding problems [1] abstractly and building abstractions around various kinds of problems. being able break apart these problems into smaller core issues, and thinking abstractly about the problem so that you can solve both the problem in front of you and also solve it in the future are crucial skills.

  • collect questions or curiosities as you encounter them, but don't feel like you have to understand everything, and use this to guide your background reading, but don't feel like you have to hunt down the answer to every term you hear or see that you don't already know immediately. if you're pretty rigorous about going back and looking things up, you'll get a pretty good base knowledge over time.

  • always think about the users of your software as you build, at every level. even if you're building software for your own use, think about the future version of yourself that will use that software, imagine that other people might use the interfaces and functions that you write and think about what assumptions they might bring to the table. think about the output that your program, script, or function produces, and how someone would interact with that output.

  • think about the function as the fundamental building block of your software. lower level forms (i.e. statements) are required, but functions are the unit where meaning is created in the context of a program. functions, or methods depending, take input (arguments, ususally, but sometimes also an object in the case of methods) and produce some output, sometimes with some kind of side-effect. the best functions:

    • clearly indicate side-effects when possible.
    • have a mechanism for reporting on error conditions (exceptions, return values,)
    • avoid dependencies on external state, beyond what is passed as arguments.
    • are as short as possible.
    • use names that clearly describe the behavior and operations of the function.

    if programming were human language (english,) you'd strive to construct functions that were simple sentences and not paragraph's, but also more than a couple of words/phrases, and you would want these sentences to be clear to understand with limited context. if you have good functions, interfaces are more clear and easier to use, code becomes easier to read and debug, and easier to test.

  • avoid being too weird. many programmers are total nerds, and you may be too, and that's ok, but it's easier to learn how to do something if there's prior art that you can learn from and copy. on a day-to-day basis, a lot of programming work is just doing something until you get stuck and then you google for the answer. If you're doing something weird--using a programming language that's less widely used, or in a problem space that is a bit out of mainstream, it can be harder to find answers to your problems.

Notes

[1]I use the term "problem" to cover both things like "connecting two components internally" and also more broadly "satisfying a requirement for users," and programming often includes both of these kinds of work.

Test Multi-Execution

Editoral Note: this is a follow up to my earlier Principles of Test Oriented Software Development post.

In software development, we write tests to make sure the code we write does what we want it to do. Great this is pretty easy to get behind.

Tests sometimes fail.

The goal, is that, most of the time when tests fail, it's because the code is broken: you fix the code and the test passes. Sometimes when test fail there's a bug in the test, it makes an assertion that can't or shouldn't be true: these are bad because they mean the test is broken, but all code has bugs, and test code can be broken so that's fine.

Ideally either pass or fail, and if a test fails it fails repeatedly, with the same error. Unfortunately, this is of course not always true, and tests can fail intermittently if they test something that can change, or the outcome of the test is impacted by some external factor like "the test passes if the processor is very fast, and the system does not have IO contention, but fails sometimes as the system slows down." Sometimes tests include (intentionally or not) some notion of "randomnesses," and fail intermittently because of this.

A test suite with intermittent failures is basically the worst. A suite that never fails isn't super valuable, because it probably builds false confidence, a test suite that always fails isn't useful because developers will ignore the results or disable the tests, but a test that fails intermittently, particularly one that fails 10 or 20 percent of the time, means that developers always will always look at the test, or just rerun the test until it passes.

There are a couple of things you can do to fix your tests:

  • write better tests: find sources of non-determinism in your test and rewrite tests to avoid these kinds of "flaky" outcomes. Sometimes this means restructuring your tests in a more "pyramid-like" structure, with more unit tests and fewer integration tests (which are likely to be less deterministic.)
  • run tests more reliably: find ways of running your test suite that produce more consistent results. This means running tests in more isolated environments, changing the amount of test parallelism, ensure that tests clean up their environment before they run, and can be as logically isolated as possible.

But it's hard to find these tests and you can end up playing wack-a-mole with dodgy tests for a long time, and the urge to just run the tests a second (or third) time to get them to pass so you can merge your change and move on with your work is tempting. This leaves:

  • run tests multiple times: so that a test doesn't pass until it passes multiple times. Many test runner's have some kind of repeated execution mode, and if you can combine with some kind of "stop executing after the first fail," then this can be reasonably efficient. Use multiple execution to force the tests to produce more reliable results rather than cover-up or exacerbates the flakiness.
  • run fewer tests: it's great to have a regression suite, but if you have unreliable tests, and you can't use the multi-execution hack to smoke out your bad tests, then running a really full matrix of tests is just going to produce more failures, which means you'll spend more of your time looking at tests, in non-systematic ways, which are unlikely to actually improve code.

Principles of Test Oriented Software Development

I want to like test-driven-development (TDD), but realistically it's not something that I ever actually do. Part of this is because TDD, as canonically described is really hard to actually pratice: TDD involves writing tests before writing code, writing tests which must fail before the implementation is complete or correct, and then using the tests to refactor the code. It's a nice idea, and it definitely leads to better test coverage, but the methodology forces you to iterate inefficiently on aspects of a design, and is rarely viable when extending existing code bases. Therefore, I'd like to propose a less-dogmatic alternative: test-oriented-development. [1]

I think, in practice, this largely aligns with the way that people write software, and so test oriented development does not describe a new way of writing code or of writing tests, but rather describes the strategies we use to ensure that the code we write is well tested and testable. Also, I think providing these strategies in a single concrete form will present a reasonable and pragmatic alternative to TDD that will make the aim of "developing more well tested software" more achievable.

  1. Make state explicit. This is good practice for all kinds of development, but generally, don't put data in global variables, and pass as much state (configuration, services, etc.) into functions and classes rather than "magicing" them.
  2. Methods and functions should be functional. I don't tend to think of myself as a functional programmer, as my tendencies are not very ideological, in this regard, but generally having a functional approach simplifies a lot of decisions and makes it easy to test systems at multiple layers.
  3. Most code should be internal and encapsulated. Packages and types with large numbers of exported or public methods should be a warning sign. The best kinds of tests can provide all desired coverage, by testing interfaces themselves,
  4. Write few simple tests and varry the data passed to those tests. This is essentially a form of "table driven testing," where you write a small sequence of simple cases, and run those tests with a variety of tests. Having test infrastructure that allows this kind of flexibility is a great technique.
  5. Begin writing tests as soon as possible. Orthodox TDD suggests that you should start writing tests first, and I think that this is probably one of the reasons that TDD is so hard to adopt. It's also probably the case that orthodox TDD emerged when prototyping was considerably harder than it is today, and as a result TDD just feels like friction, because it's difficult to plan implementations in a test-first sort of way. Having said that, start writing tests as soon as possible.
  6. Experiment in tests. Somehow, I've managed to write a lot of code without working an interactive debugger into my day-to-day routine, which means I do a lot of debugging by reading code, and also by writing tests to try and replicate production phenomena in more isolated phenomena. Writing and running tests in systems is a great way to learn about them.
[1]Sorry that this doesn't lead to a better acronym.

Distributed Systems Problems and Strategies

At a certain scale, most applications end up having to contend with a class of "distributed systems" problems: when a single computer or a single copy of an application can't support the required throughput of an application there's not much to do except to distribute it, and therein lies the problem. Taking one of a thing and making many of the thing operate similarly can be really fascinating, and frankly empowering. At some point, all systems become distributed in some way, to a greater or lesser extent. While the underlying problems and strategies are simple enough, distributed systems-type bugs can be gnarly and having some framework for thinking about these kinds of systems and architectures can be useful, or even essential, when writing any kind of software.

Concerns

Application State

Applications all have some kind of internal state: configuration, runtime settings, in addition to whatever happens in memory as a result of running the application. When you have more than one copy of a single logical application, you have to put state somewhere. That somewhere is usually a database, but it can be another service or in some kind of shared file resource (e.g. NFS or blob storage like S3.)

The challenge is not "where to put the state," because it probably doesn't matter much, but rather in organizing the application to remove the assumption that state can be stored in the application. This often means avoiding caching data in global variables and avoiding storing data locally on the filesystem, but there are a host of ways in which application state can get stuck or captured, and the fix is generally "ensure this data is always read out of some centralized and authoritative service," and ensure that any locally cached data is refreshed regularly and saved centrally when needed.

In general, better state management within applications makes code better regardless of how distributed the system is, and when we use the "turn it off and turn it back on," we're really just clearing out some bit of application state that's gotten stuck during the runtime of a piece of software.

Startup and Shutdown

Process creation and initialization, as well as shutdown, is difficult in distributed systems. While most configuration and state is probably stored in some remote service (like a database,) there's a bootstrapping process where each process gets enough local configuration required to get that configuration and startup from the central service, which can be a bit delicate.

Shutdown has its own problems set of problems, as specific processes need to be able to complete or safely abort in progress operations.

For request driven work (i.e. HTTP or RPC APIs) without statefull or long-running requests (e.g. many websockets and most streaming connections), applications have to stop accepting new connections and let all in progress requests complete before terminating. For other kinds of work, the process has to either complete in progress work or provide some kind of "checkpointing" approach so that another process can pick up the work later.

Horizontal Scalability

Horizontal scalability, being able to increase the capability of an application by adding more instances of the application rather than creasing the resources allotted to the application itself, is one of the reasons that we build distributed systems in the first place, [1] but simply being able to run multiple copies of the application at once isn't always enough, the application needs to be able to effectively distribute it's workloads. For request driven work this is genreally some kind of load balancing layer or strategy, and for other kinds of workloads you need some way to distribute work across the application.

There are lots of different ways to provide loadbalancing, and a lot depends on your application and clients, there is specialized software (and even hardware) that provides loadbalancing by sitting "in front of" the application and routing requests to a backend, but there are also a collection of client-side solutions that work quite well. The complexity of load balancing solutions varies a lot: there are some approaches that just distribute responses "evenly" (by number) to a single backend one-by-one ("round-robin") and some approaches that attempt to distribute requests more "fairly" based on some reporting of each backend or an analysis of the incoming requests, and the strategy here depends a lot on the requirements of the application or service.

For workloads that aren't request driven, systems require some mechanism of distributing work to workers, ususally with some kind of messaging system, though it's possible to get pretty far using a just a normal general purpose database to store pending work. The options for managing, ordering, and distributing the work, is the meat of problem.

[1]In most cases, some increase in reliability, by adding redundancy is a strong secondary motivation.

Challenges

When thinking about system design or architecture, I tend to start with the following questions.

  • how does the system handle intermittent failures of particular components?
  • what kind of downtime is acceptable for any given component? for the system as a whole?
  • how do operations timeout and get terminated, and how to clients handle these kinds of failures?
  • what are the tolerances for the application in terms of latency of various kinds of operations, and also the tolerances for "missing" or "duplicating" an operation?
  • when (any single) node or component of the system aborts or restarts abruptly, how does the application/service respond? Does work resume or abort safely?
  • what level of manual intervention is acceptable? Does the system need to node failure autonomously? If so how many nodes?

Concepts like "node" or "component" or "operation," can mean different things in different systems, and I use the terms somewhat vaguely as a result. These general factors and questions apply to systems that have monolithic architectures (i.e. many copies of a single type of process which performs many functions,) and service-based architectures (i.e. many different processes performing specialized functions.)

Solutions

Ignore the Problem, For Now

Many applications run in a distributed fashion while only really addressing parts of their relevant distributed systems problems, and in practice it works out ok. Applications may store most of their data in a database, but have some configuration files that are stored locally: this is annoying, and sometimes an out-of-sync file can lead to some unexpected behavior. Applications may have distributed application servers for all request-driven workloads, but may still have a separate single process that does some kind of coordinated background work, or run cron jobs.

Ignoring the problem isn't always the best solution in the long term, but making sure that everything is distributed (or able to be distributed,) isn't always the best use of time, and depending the specific application it works out fine. The important part, isn't always to distribute things in all cases, but to make it possible to distribute functions in response to needs: in some ways I think about this as the "just in time" approach.

Federation

Federated architectures manage distributed systems protocols at a higher level: rather than assembling a large distributed system, build very small systems that can communicate at a high level using some kind of established protocol. The best example of a federated system is probably email, though there are others. [2]

Federated systems have more complex protocols that have to be specification based, which can be complicated/difficult to build. Also, federated services have to maintain the ability to interoperate with previous versions and even sometimes non-compliant services, which can be difficult to maintain. Federated systems also end up pushing a lot of the user experience into the clients, which can make it hard to control this aspect of the system.

On the upside, specific implementations and instances of a federated service can be quite simple and have straight forward and lightweight implementations. Supporting email for a few users (or even a few hundred) is a much more tractable problem than supporting email for many millions of users.

[2]xmpp , the protocol behind jabber which powered/powers many IM systems is another federated example, and the fediverse points to others. I also suspect that some federation-like features will be used at the infrastructure layer to coordinate between constrained elements (e.g. multiple k8s clusters will use federation for coordination, and maybe multi-cloud/multi-region orchestration as well...)

Distributed Locks

Needing some kind of lock (for mutual exclusion or mutex) is common enough in programming, and provide some kind of easy way to ensure that only a single actor has access to a specific resource. Doing this within a single process involves using kernel (futexes) or programming language runtime implementations, and is simple to conceptualize, and while the concept in a distributed system is functionally the same, the implementation of distributed locks are more complicated and necessarily slower (both the lock themselves, and their impact on the system as a whole).

All locks, local or distributed can be difficult to use correctly: the lock must be acquired before using the resource, and it must fully protect the resource, without protecting too much and having a large portion of functionality require the lock. So while locks are required sometimes, and conceptually simple, using them correctly is hard. With that disclaimer, to work, distributed locks require: [3]

  • some concept of an owner, which must be sufficiently specific (hostname, process identifier,) but that should be sufficiently unique to protect against process restarts, host renaming and collision.
  • lock status (locked/link) and if the lock has different modes, such as a multi-reader/single-writer lock, then that status.
  • a timeout or similar mechanism to prevent deadlocks if the actor holding a lock halts or becomes inaccessible, the lock is eventually released.
  • versioning, to prevent stale actors from modifying the same lock. In the case that actor-1 has a lock and stalls for longer than the timeout period, such that actor-2 gains the lock, when actor-1 runs again it must know that its been usurped.

Not all distributed systems require distributed locks, and in most cases, transactions in the data layer, provide most of the isolation that you might need from a distributed lock, but it's a useful concept to have.

[3]This article about distributed locks in redis was helpful in summarizing the principles for me.

Duplicate Work (Idempotency)

For a lot of operations, in big systems, duplicating some work is easier and ultimately faster than coordinating and isolating that work in a single location. For this, having idempotent operations [4] is useful. Some kinds of operations and systems make idempotency easier to implement, and in cases where the work is not idempotent (e.g. as in data processing or transformation,) the operation can be, by attaching some kind of clock to the data or operation. [5]

Using clocks and idempotency makes it possible to maintain data consistency without locks. At the same time, some of the same considerations apply. Having all operations duplicated is difficult to scale so having ways for operations to abort early can be useful.

[4]An operation is idempotent if it can be performed more than once without changing the outcome. For instance, the operation "increment the value by 10" is not idempotent because it increments a value every time it runs, so running the operation once is different than running it twice. At the same time the operation "set the value to 10" is idempotent, because the value is always 10 at the end of the operation.
[5]Clocks can take the form of a "last modified timestamp," or some kind of versioning integer associated with a record. Operations can check their local state against a canonical record, and abort if their data is out of date.

Consensus Protocols

Some operations can't be effectively distributed, but are also not safe to duplicate. Applications can use consensus protocols to do "leader election," to ensure that there's only one node "in charge" at a time, and the protocol. This is common in database systems, where "single leader" systems are useful for balancing write performance in distributed context. Consensus protocols have some amount of overhead, and are good for systems of a small to moderate size, because all elements of the system must communicate with all other nodes in the system.

The two prevailing consensus protocols are Paxos and Raft--pardoning the oversimplification here--with Raft being a simpler and easier to implement imagination of the same underlying principles. I've characterized consensus as being about leader election, though you can use these protocols to allow a distributed system to reach agreement on any manner of operations or shared state.

Queues

Building a fully generalized distributed application with consensus is a very lofty proposition, and commonly beyond the scope of most applications. If you can characterize the work of your system as discrete units of work (tasks or jobs,) and can build or access a queue mechanism within your application that supports workers on multiple processes, this might be enough to support a great deal of your distributed requirements for the application.

Once you have reliable mechanisms and abstractions for distributing work to a queue, scaling the system can be managed outside of the application by using different backing systems, or changing the dispatching layer, and queue optimization is pretty well understood. There are lots of different ways to schedule and distribute queued work, but perhaps this is beyond the scope of this article.

I wrote one of these, amboy, but things like gearman and celery do this as well, and many of these tools are built on messaging systems like Kafka or AMPQ, or just use general purpose databases as a backend. Keeping a solid abstraction between the applications queue and then messaging system seems good, but a lot depends on your application's workload.

Delegate to Upstream Services

While there are distributed system problems that applications must solve for themselves, in most cases no solution is required! In practice many applications centralize a lot of their concerns in trusted systems like databases, messaging systems, or lock servers. This is probably correct! While distributed systems are required in most senses, distributed systems themselves are rarely the core feature of an application, and it makes sense to delegate these problem to services that that are focused on solving this problem.

While multiple external services can increase the overall operational complexity of the application, implementing your own distributed system fundamentals can be quite expensive (in terms of developer time), and error prone, so it's generally a reasonable trade off.

Conclusion

I hope this was as useful for you all as it has been fun for me to write!