A Common Failure

2020-07-03 – tychoish

I’ve been intermittently working on a common lisp library to produce a binary encoding of arbitrary objects, and I think I’m going to be abandoning the project. This is an explanation of that decision and an reflection on my experience.

Why Common Lisp?

First, some background. I’ve always thought that Common Lisp is a language with a bunch of cool features and selling points, but unsurprisingly, I’ve never really had the experience of writing more than some one-off bits of code in CL, which isn’t surprising. This project was a good experience for really digging into writing and managing a conceptually larger project which was a good kick in the pants to learn more.

The things I like:

the implementations of the core runtime are really robust and high quality, and make it possible to imagine running your code in a bunch of different contexts. Even though it’s a language with relatively few users, it feels robust in a way. The most common implementations also have ways of producing fully self contained static binaries (like Go, say), which makes the thought of distributing software seem reasonable.
quicklisp, a package/library management tool is new (in the last decade or so,) has really raised the level of the ecosystem. It’s not as complete as I’d expect in many ways, but quicklisp changed CL from something quaint to something that you could actually imagine using.
the object system is really nice. There isn’t quite compile time-type checking on the values of slots (attributes) of objects, though you can opt in. My general feeling is that I can pretty easily get the feeling of writing statically typed code with all of the freedom of writing dynamic code.
multiple dispatch, and the conceptual approach to genericism, is amazing and really simplifies flow control in a lot of cases. You implement the methods you need, for the appropriate types/objects and then just write the logic you need, and the function call machinery just does the right thing. There’s surprisingly little conditional logic, as a result.

Things I don’t like:

there are all sorts of things that don’t quite have existing libraries, and so I find myself wanting to do things that require more effort than necessary. This project to write a binary encoding tool would have been a component in service of a much larger project. It’d be nice if you could skip some of the lower level components, or didn’t have your design choices so constrained by gaps in infrastructure.
at the same time, the library ecosystem is pretty fractured and there are common tools around which there aren’t really consensus. Why are there so many half-finished YAML and JSON libraries? There are a bunch of HTTP server (!) implementations, but really you need 2 and not 5?
looping/iteration isn’t intuitive and difficult to get common patterns to work. The answer, in most cases is to use (map) with lambdas rather than loops, in most cases, but there’s this pitfall where you try and use a (loop) and really, that’s rarely the right answer.
implicit returns seem like an over sight, hilariously, Rust also makes this error. Implicit returns also make knowing what type a function or method returns quite hard to reason about.

Writing an Encoder

So the project I wrote was an attempt to write really object oriented code as a way to writing an object encoder to a JSON-like format. Simple enough, I had a good mental model of the format, and my general approach to doing any kind of binary format processing is to just write a crap ton of unit tests and work somewhat iteratively.

I had a lot of fun with the project, and it gave me a bunch of key experiences which make me feel comfortable saying that I’m able to write common lisp even if it’s not a language that I feel maximally comfortable in (yet?). The experiences that really helped included:

producing enough code to really have to think about how packaging and code organization worked. I’d written a function here and there, before, but never something where I needed to really understand and use the library/module/packaging (e.g. systems and libraries.) infrastructure.
writing and running lots of tests. I don’t always follow test-driven development closely, but writing lots of tests is part of my process, and being able to watch the layers of this code come together was a lot of fun and very instructive.
this project for me, was mostly about API design and it was nice to have a project that didn’t require much design in terms of the actual functionality, as object encoding is pretty straight forward.

From an educational perspective all of my goals were achieved.

Failure Mode

The problem is that it didn’t work out, in the final analysis. While the library I constructed was able to encode and decode objects and was internally correct, I never got it to produce encoding that other implementations of the same specification could reliably read, and the ability to read data encoded by other libraries only worked in trivial cases. In the end:

this was mostly a project designed to help me gain competence in a programming language I don’t really know, and in that I was successful.
adding this encoding format isn’t particularly useful to any project I’m thinking of working on in the short term, and doesn’t directly enable me to do anything in particular.
the architecture of the library would not be particularly performant in practice, as the encoding process didn’t deploy a buffer pool of any kind, and it would have been harder than not to back fill that in, and I wasn’t particularly interested in that.
it didn’t work, and the effort to debug the issue would be more substantive than I’m really interested in doing at this point, particularly given the limited utility.

So here we are. Onto a different project!

There's No Full Stack

2020-04-26 – tychoish

Software engineers use terms like “backend” and “frontend” to describe areas of expertice and focus, and the thought is that these terms map roughly onto “underlying systems and business logic” and “user interfaces.” The thought, is that these are different kinds of work and no person can really specialize on “everything.”

But it’s all about perspective. Software is build in layers: there are frontends and backends at almost every level, so the classification easily breaks down if you look at it too hard. It’s also the case that that logical features, from the perspective of the product and user, require the efforts of both disciplines. Often development organizations struggle to hand projects off between groups of front-end and back-end teams.¹

Backend/Frontend is also a poor way to organize work, as often it forces a needless boundary between people and teams wokring on related projects. Backend work (ususally) has to be completed first, and if that slips (or estimation is off) then the front end work has to happen in a crunch. Even if timing goes well, it’s difficult to maintain engineering continuity through the handoff and context is often lost in the process.

In response to splitting projects and teams into front and backend, engineers have developed this idea of “full stack” engineering. This typically means “integrated front end and backend development.” A noble approach: keep the same engineer on the project from start to finish, and avoid an awkward handoff or resetting context halfway through a project. Historic concerns about “front end and backend being in different languages” are reduced both by the advent of back-end javascript, and a realization that programmers often work in multiple languages.

While full stack sounds great, it’s a total lie. First, engineers by and large cannot maintain context on all aspects of a system, so boundaries end up appearing in different places. A full stack engineer might end up writing front end and the APIs on the backed that the front end depends on, but not the application logic that supports the feature. Or an engineer might focus only a very specific set of features, but not be able to branch out very broadly. Second, specialization is important for allowing engineers to focus and be productive, and while context switching projects between engineers, having engineers that must context switch regularly between different disciplines is bad for those engineers. In short you can’t just declare that engineers will be able to do it all.'

Some, particularly larger, teams and prodcuts can get around the issue entirely by dividing ownership and specialization along functional boundaries rather than by engineering discipline, but there can be real technical limitations, and getting a team to move to this kind of ownership model is super difficult. Therefore, I’d propose a different organization or a way of dividing projects and engineering that avoids both “frontend/backend” as well as the idea of “full stack”:

feature or product engineers, that focus on core functionality delivered to users. This includes UI, supporting backend APIs, and core functionality. The users of these teams are the users of the product. These jobs have the best parts of “full stack” type orientation, but draw an effective “lower” boundary of responsibility and allow feature-based specialization.
infrastructure or product platform engineers, that focus on deployment, operations and supporting internal APIs. These teams and engineers should see their users as feature and product engineers. These engineers should fall somewhere between “backend engineers,” and the “devops” and “sre” -type roles of the last decade, and cover the area “above” systems (e.g. not inclusive of machine management and access provisioning,) and below features.

This framework helps teams scale up as needs and requirements change: Feature teams can be divided and parallelized and focus in functionality slices, while, infrastructure teams divide easily into specialties (e.g. networking, storage, databases, internal libraries, queues, etc.) and along service boundaries. Teams are in a better position to handle continuity of projects, and engineers can maintain context and operate using more agile methods. I suspect that, if we look carefully, many organizations and teams have this kind of de facto organization, even if they use different kind of terminology.

Thoughts?

In truth this problem of coordination between frontend and backend teams is really that it forces a waterfall-like coordination between teams, which is always awkward. The problem isn’t that backend engineers can’t write frontend code, but that having different teams requires a handoff that is difficult to manage correctly, and around that handoff processes and management happens. ↩︎

Experimental Sweater Pattern

2019-12-24 – tychoish

I wrote this post nearly 5 years ago, and have been sitting on the draft for a long time: not for any reason, I think it’s actually pretty good post. For non-knitters, this is kind of a “ask a great cook for their comfort food recpie,” but in a narrative form.

In any case, I haven’t really been knitting very much recently, and while I enjoy writing knitting patterns there’s a lot of work in writing a well formed knitting pattern that I’m poorly positioned for ring now (test knitting! good photography! talking with knitters!) But, perhaps someone will find it useful… Enjoy!

Part of my recent return to knitting has been about taking a much more simple approach to yarn. I think yarn is cool and working with good yarn is awesome but at a certain point, I think yarn distracts from the things that I like most about knitting: the consistency, the dependability, the rhythm of the activity, and coordination of parallel activities.

Novel yarns and yarn variety actually makes the process of knitting less enjoyable for me. It also doesn’t really jive with my taste in clothing: I like plain things that fit well without a lot of adornment. While I enjoy knitting patterned sweaters for the rhythm, I don’t really wear them much. I also, live and spend my time in a climate where a I’m almost always wearing a light sweater (during the cold months) and inside during the rest of the months.

The result of this is that I’ve mostly been working on sock knitting. I like wearing wool socks, and after a period of not wearing them for a few reasons, I didn’t actually have that many wool socks. Which has lead me to get acclimated to knitting fingering weight yarn with size 0 needles.

So I want to make a sweater in this mold: fingering weight, very plain lines, probably knit in the round using the Elizabeth Zimmerman system. Starting from the bottom, I’ve been leaning away from ribbings at the bottom, and have tended to like hems though they sometimes flare. Ususally, I just cast on provisionally and add the hem (or whatever) at the end anyway. There’s time yet to decide.

More importantly, I’m quite interested in having a rolled collar for the neck, but I tend to think that rolled ends mostly have a flare look anyway. I can defer this decision for a while.

For shaping and even most of the styling I intend to copy the Chrome Cobra Zip Up, which is, by far, my favorite article of clothing.

I think really subtle increases (so that the sweater tapers to the waist) is a good feature and might choose to do some of those, particularly if the model sweater has them. The model sweater has a really long back, and I think I might moderate this slightly.

The shoulders are an open question. If this goes well, I think I’d like to knit my way though most of the standard EZ shoulder constructions: I think I’ve knit all of the options at least once, but I’ve not done all of them in plain knitting, and most of the sweaters are a bit odd in one way or another. There’s a long project. I want to start with, and hopefully master, the set-in-sleeve.

For those of you playing along at home, set-in-sleeves are probably what you think of as “normal” sleeves, the garment fits in the shoulders, and the sleeves angle gently down from the shoulders. Most shirts have this shaping but the shapes aren’t terribly natural for knitting.

To knit set in sleeves in the round, you join the sleeves to the body, setting some stitches aside where the pieces meet and then decreasing body snitches into the sleeves as you knit until the body is just as wide as the shoulders. Then decrease the sleeve stitches into the body until you have about 3 inches of sleeve stitches left. Finally, knit knit short rows across the front and back (or just the front) stitches, decreasing the remaining stitches in the short rows ending with a 3 needle bind-off at the appropriate moment. To get a good crew neck, begin shaping the front of the neck every row 1.5 inches before you start the shoulder short-rows, and shape the back of the neck every other row when you start the shoulder short rows.

The shaping and body of the knitting is pretty straightforward from design perspective. The hard part from the perspective of the success of the sweater is the hems and/or ribbing, and figuring out the right thing to for each hem. It’s always something.

What is it That You Do?

2019-12-24 – tychoish

The longer that I have this job, the more difficult it is to explain what I do. I say, “I’m a programmer,” and you’d think that I write code all day, but that doesn’t map onto what my days look like, and the longer it seems the less code I actually end up writing. I think the complexity of this seemingly simple question grows from the fact that building software involves a lot more than writing code, particularly as projects become more complex.

I’d venture to say that most code is written and maintained by one person, and typically used by a very small number of pepole (often on behalf of many more people,) though this is difficult to quantify. Single maintainer software is still software, and there are lots of interesting problems, but as much as anything else I’m interested in the problems adjacent to multi-author code-bases and multi-operator software development.¹

Fundamentally, I’m interested in the following questions:

How can (sometimes larger) groups of people collaborate to build something that’s bigger than the scope of any of their work?
How can we build software in a way that lets individual developers focus most of the time on the features and concerns that are the most important to them and their users.²

The software development process, regardless of the scope of the problem, has a number of different aspects:

Operations: How does is this software execute and how do we know that its successful when it runs?
Behavior: What does it do, and how do we ensure it has the correct behavior?
Interface: How will users interact with the process, and how do we ensure a consistent experience across versions and users' environment?
Product: Who are the users? What features do they want? Which features are the most important?

Sometimes we can address these questions by writing code, but often there’s a lot of talking to users, other developers, and other people who work in software development organizations (e.g. product managers, support, etc.) not to mention writing a lot of English (documentation, specs, and the like.)

I still don’t think that I’ve successfully answered the framing question, except to paint a large picture of what kinds of work goes into making software, and described some of my specific domain interests. This ends up boiling down to:

I write a lot of documents describing new features and improvements to our software. [product]
I think a lot about how our product works as it grows (scaling), and what kinds of changes we can make now to make that process more smooth. [operations]
How can I help the more junior members of my team focus on the aspects of their jobs that they enjoy the most, and help illustrate broader contexts to them. [mentoring]
How can we take the problems we’re solving today and build the solution that balances the immediate requirements with longer term maintainability and reuse. [operations/infrastructure]

The actual “what” I’m spending my time boils down to reading a bunch of code, meeting with my teamates, meeting with users (who are also coworkers.) And sometimes writing code. If I’m lucky.

I think the single-author and/or single-operator class is super interesting and valuable, particularly because it includes a lot of software outside of the conventional disciplinary boundaries of software and includes things like macros, spreadsheets, small scale database, and IT/operations (“scripting”) work. ↩︎
It’s very easy to spend most of your time as a developer writing infrastructure code of some sort, to address either internal concerns (logging, data management and modeling, integrating with services) or project/process automation (build, test, operations) concerns. Infrastructure isn’t bad, but it isn’t the same as working on product features. ↩︎

The Case for Better Build Systems

2019-07-20 – tychoish

A lot of my work, these days, focuses on figuring out how to improve how people develop software in ways that reduces the amount of time developers have to spend doing work outside of development and that improves the quality of their work. This post, has been sitting in my drafts folder for the last year, and does a good job of explaining how I locate my workand* makes a case for high quality generic build system tooling that I continue to feel is compelling.*

Incidentally, it turns out that I wrote an introductory post about buildsystems 6 years ago. Go past me.

Canonically, build systems described the steps required to produce artifacts, as system (graph) of dependencies¹ and these dependencies are between source files (code) and artifacts (programs and packages) with intermediate artifacts all in terms of the files they are or create. Though different development environments, programming languages, and kinds of software have different.

While the canonical “build systems are about producing files,” the truth is that the challenge of contemporary _software development isn’t really just about producing files. Everything from test automation to deployment is something that we can think about as a kind of build system problem.

Let’s unwind for a moment. The work of “making software,” breaks down into a collection of--reasonably disparate--tasks, which include:

collecting requirements (figuring out what people want,)
project planning (figuring out how to break larger collections of functionality into more reasonable units.)
writing new code in existing code bases.
exploring unfamiliar code and making changes.
writing tests for code you’ve recently written, or areas of the code base that have recently chaining.
rewriting existing code with functionally equivalent code (refactoring,)
fixing bugs discovered by users.
fixing bugs discovered by an automated test suite.
releasing software (deploying code.)

Within these tasks developers do a lot of little experiments and tests. Make a change, see what it’s impact is by doing something like compiling the code, running the program or running a test program. The goal, therefore, of the project of developer productivity projects is to automate these processes and shorten the time it takes to do any of these tasks. In particular the feedback loop between “making a change” and seeing if that change had an impact. The more complex the system that you’re developing, with regards to distinct components, dependencies, target platforms, compilation model, and integration’s, them the more time you spend in any one of these loops and the less productive you can be.

Build systems are uniquely positioned to suport the development process: they’re typically purpose built per-project (sharing common infrastructure,) most projects already have one, and they provide an ideal environment to provide the kind of incremental development of additional functionality and tooling. The model of build systems: the description of processes in terms of dependency graphs and the optimization for local environments means.

The problem, I think, is that build systems tend to be pretty terrible, or at least many suffer a number of classic flaws:

implicit assumptions about the build or development environment which make it difficult to start using.
unexpressed dependencies on services or systems that the build requires to be running in a certain configuration.
improperly configured dependency graphs which end up requiring repeated work.
incomplete expression of dependencies which require users to manually complete operations in particular orders.
poorly configured defaults which make for overly complex invocations for common tasks.
operations distributed among a collection of tools with little integration so that compilation, test automation, release automation, and other functions.

By improving the quality, correctness, and usability of build systems, we:

improve the experience for developers every day,
make it really easy to optimize basically every aspect of the development process,
reduce the friction for including new developers in a project’s development process.

I’m not saying “we need to spend more time writing build automation tools” (like make, ninja, cmake, and friends,) or that the existing ones are bad and hard to use (they, by and large are,) but that they’re the first and best hook we have into developer workflows. A high quality, trustable, tested, and easy to use build system for a project make development easier, continuous integration easy and maintainable, and ultimately improve the ability of developers to spend more of their time focusing on important problems.

ideally build systems describe directed acylcic graph, though many projects have buildsystems with cyclic dependency graphs that they ignore in some way. ↩︎

Observation at Scale

2019-07-19 – tychoish

I wrote this thing about monitoring in software, and monitoring in web applications (and similar) about a year ago, and I sort of forgot about it, but as I was cleaning up recently I found this, and think that I mostly still agree with the point. Enjoy!

It is almost always the case that writing software that does what you want it to do is the easy part and everything else is the hard part.

As your software does more a number of common features emerge:

other people are responsible for operating your software.
multiple instances of the program are running at once, on different computers.
you may not be able to connect to all of the running instances of the program when something goes wrong.
people will observe behaviors that you don’t expect and that you won’t be able to understand by observing the program’s inputs or outputs''

There are many things you can do to take a good proof of concept program and turn it into a production-ready program, but I think logging and introspection abilities are among the most powerful: they give you the most bang for your buck, as it were. It’s also true that observability (monitoring) is a hot area of software development that’s seeing a lot of development and thought at the moment.

While your application can have its own internal reporting system, its almost always easier to collect data in logs first rather than

Aggregate Logs

Conventionally operators and developers interact with logs using standard unix stream processing tools: tail, less, and grep and sometimes wc, awk, and sed. This is great when you have one (or a small number) process running on one machine. When applications get bigger, stream processing begins to break down.

Mostly you can’t stream process because of volume there’s too much data, it’s hard to justify spending disk space on all of your application servers on logs, and there’s too much of it to look at and do useful things. It’s also true that once you have multiple machines, its really helpful to be able to look at all of the logs in a single place.

At the lowest level the syslog protocol and associated infrastructure solves this problem by providing a common way for services to send log data via a network (UDP, etc.) It works but you still only have stream processing tools, which may be fine, depending on your use case and users.

Additionally there are services and applications that solve this problem: splunk (commercial/enterprise software ) sumologic (commercial/cloud software) and the ELK stack (an amalgamation of open source tools.) that provide really powerful ways to do log search, reporting, and even build visualizations. There are probably others as well.

Use them.

Structure Logs

The most common interview question for systems administrators that my colleagues give is a “log sawing” question. This seems pretty standard, and is a pretty common exercise for parsing information out of well known streams of log data. Like “find a running average request time,” or figure out the request rate.

The hard part is that most logs, in this example are unstructured in the sense that they are just line-wise printed strings, and so the exercise is in figuring out the structure of the messages, parsing data from the string, and then tracking data over the course of the logs. Common exercise, definitely a thing that you have to do, and also totally infuriating and basically impossible to generalize.

If you’re writing software, don’t make your users do this kind of thing. Capture events (log messages) in your program and output them with the information already parsed. The easiest way is to make your log messages mapping types, and then write them out in JASON, but there are other options.

In short, construct your log messages so that they’re easy to consume by other tools: strongly (and stably) type your messages, provide easy way to group and filter similar messages. Report operations in reasonable units (e.g. seconds rather than nanoseconds) to avoid complex calculations during processing, and think about how a given data point would beinteresting to track over time.

Annotate Events

Building on the power of structured logs, it’s often useful to be able to determine the flow of traffic or operations through the system to make it possible to understand the drivers of different kinds of load, and the impact of certain kinds of traffic on overall performance. Because a single operation may impact multiple areas of the system, annotating messages appropriately makes it possible to draw more concrete conclusions based on the data you collect.

For example when a client makes a user request for data, your system probably has a request-started and request-ended event. In addition this operation may retrieve data, do some application-level manipulation, modify other data, and then return it to the user. If there’s any logging between the start and end of a request, then it’s useful to tie these specific events together, and annotations can help.

Unlike other observability strategies, there’s not a single software feature that you can use to annotate messages once you have structured capabilities, although the ability of your logging systems to have some kind of middleware to inject annotations is quite useful.

Collect Metrics

In addition to events produced by your system, it may be useful to have a background data collection thread to report on your application and system’s resource utilization. Things like, runtime resource utilization, garbage collector stats, and system IO/CPU/Memory use can all be useful.

There are ways to collect this data via other means, and there are a host of observability tools that support this kind of metrics collection. But using multiple providers complicates actually using this data, and makes it harder to understand what’s going in the course of running a system. If your application is already reporting other stats, consider bundling these metrics in your existing approach.

By making your application responsible for system metrics you immediately increase the collaboration between the people working on development and operations, if such a divide exists.

Conclusion

In short:

collect more data,
increase the fidelity and richness of the data you collect,
aggregate potentially related data in the same systems to maximize value,
annotate messages to add value, and provide increasingly high level details.

Citation Practices in Informal Writing

2019-07-18 – tychoish

I’ve been thinking recently about the way that my interest in writing and publishing blog posts has waned, and while there are a lot of factors that have contributed to this, I think there’s a big part of me that questions what the purpose of writing is or should be, and because I mostly write about the things I’m learning or thinking about, my posts end up being off-the-cuff explanations of things I’ve learned or musings on a theoretical point which aren’t particularly well referenced, and while they’re fun to write and useful for my own process they’re not particularly useful to anyone else. Realizing this puts me at something of a crossroads with my own writing, and has me thinking a lot about the practice of citation.¹

Mechanically, citation anchors text in relationship to other work,² but it also allows discussion to happen in and between texts. Also, the convention for citation in the context of informal writing is a link or an informal reference, so it’s difficult to track over time, and hard to be systemtic in the way that one text interacts with its sources.

Blogs bring out confessional writing with ambling³ structure and the freedom to say just about anything, which I have found liberating and generally instructive, but it’s also limiting. For writing that comes out of personal experience, it’s difficult to extrapolate and contextualize your argument, or even to form an argument, particularly in the context of a blog where you’re writing a larger number of shorter pieces. It’s also probably true that by framing discussions in personal experience its hard for people with different experiences to relate to the content, and more importantly the concepts within.

I’m not arguing against journaling: journals are greatgreat, but sometimes, I think journals might be best unpublished. I’m also not arguing against the personal essay as a form: there are many topics that are well served by that genre of writing. I do want to think about what else is possible⁴ and how to write things that are stronger, more grounded, and easier to relate to and interact with. I think more citations and references are the key, but I’m left with two problems:

Style. There aren’t great conventions for referencing things in informal writing. Throwing a link in the right context works, and is clear, but it might not be enough as it’s hard to know what’s a citation-typed-reference and other kinds of links. Also links don’t hold up well over time. The more formal approaches are rooted in out of date technologies and tactics. Citations often reference page numbers, footnotes don’t often make sense in informal situations,⁵ and bibliography conventions are mostly non-existant.
Tooling. I’m pretty sure that well cited texts are well-cited, because their authors have great memories for things they’ve read, but because researchers often have tooling that supports managing a database of references, notes and bibliographic information. If you have a record of the resources you’ve read (or otherwise consumed), it becomes easier to pull out citations as you write and edit.⁶

Neither of these are insurmountable, but I think would require a good deal of work both on figuring out better citation formats and patterns, as well as developing better tooling. I don’t have answers yet, but I do want to think more about it, and probably play with writing some tools.

My initial intent was to sort of discuss the personal conflict, and reflect on the corpus of this site and consider my own growth as a writer, which might have been a fine way to tell this story, but it felt much more self indulgent, and I think probably makes my point by example better than I am here. ↩︎
Admittedly this should be cited. ↩︎
Limited, of course, by size. ↩︎
The problem is that I think there aren’t a lot of great examples or models to follow. I’ve been thinking about other kinds of writing (e.g. journalism, academic writing, and fiction,) for potential models. The academic writing and journalism are clearly the starting points for this post. ↩︎
My attempts to the contrary aside. Having said that I expect that ↩︎
Zotero is probably the most popular one. There are tools that allow you to maintain BibTeX files, with similar effects. The space is probably underdeveloped, and most tooling is targeted at researchers in specific fields. It’s unfortunately a difficult space to develop a compelling tool in because the technology is easy (so it’s easy to overengineer,) and there are just enough users (and different kinds of users) to make the interface/interaction design problems non trivial. ↩︎

Things I Learned About Queues

2018-09-11 – tychoish

I think the queue is a really powerful metaphor for organizing and orchestrating the internal architecture of an application. Once you have a queue, and tasks that are running in that queue, making that system run well requires some attention. This post is very much a sequel to the application framework post.

Factors of your queue implementation and system may impact the applicability of any of these suggestions for your particular application. Additionally, there is lots of work on queue theory so there are formally described properties of queues, and this is really just a collection of informal knowledge that I’ve collected on this subject. I hope you find it useful!

As the operator of a queue there are two properties: latency, or time to completion, for work in the queue and throughput, or total amount of work completed. These properties are generally trade-offs with each other, and often work to improve throughput will impact latency, and vice versa. It turns out, however, that the theoretical limits of your system’s capacity for either latency or throughput are below the actual requirements of your application, so you can generally just focus on improving one area or the other without really feeling like you’re trading latency for throughput.

All tasks in the queue should, generally, of similar size in terms of execution time and resource usage. When there are tasks that run slowly or take a long time and tasks that run quickly, you can easily end up in situations where long running tasks group together. Indeed, this isn’t just a possibility, but a near certainty. If you can’t break work into similar sized units, then you main recourse is to either separate the work into different queues and proportion resources as needed to ensure that both queues are making progress. You generally want to run longer tasks before shorter tasks, but the impact on overall performance depends on other characteristics and the way that your application expects certain kinds of latency and throughput.

Always monitor task runtime (by type,) as well as overall queue depth, and if possible currently running operations. When something goes wrong, or there’s an external event that impacts queue performance, you’ll need these data to understand the state of your world and debug the underlying issue. Don’t wait for something to go wrong to set this up.

Idempotentcy, or the ability of a task to run more than once without chaining the final outcome is a great property in any distributed system, but the more idempotent your operations are the less you have to worry about edge cases where you might run them more than once, particularly around process restarts and deployments. While you generally only want to run things once for efficiency sake, it’s important to be able to know that you can run things more than once without causing a crisis.

While it’s easy to think about the time a task spends waiting in a queue when tasks are ordered in the queue in a first-in-first-out model, other ordering mechanisms can easily lead to items getting stuck in the queue. Consider the following behaviors:

if there are dependencies between tasks, or cases where doing work leads to the creation of more tasks, always run these tasks earlier before other tasks.
consider boosting the relative priority of tasks that have been waiting longer relative to other tasks in the queue. If tasks have a priority, and new tasks come in that have higher priority than older tasks, then some lower priority tasks may never get done. While priority of tasks is important, if its important that all tasks get done, balance wait time with priority.
alternatively, consider elimiting tasks that have been waiting in the queue for longer than a specified period. These “TTL” for queue items can avoid wasting resources doing work that is not useful.
separate desperate priority or types of work into seperate queues to reduce latency. Having additional queues often incurs some per-queue/per-worker resource overhead. When worker infrastructure canbe shared between queues, and both queues are not consistently running at capacity (e.g. have backlogs).
quantify the job dispatching overhead. While generally breaking apart larger tasks into smaller execution units improves efficiency, if the overhead of creating, dispatching, and running jobs is a significant portion of a job’s runtime, then your system is probably spending too many resources on overhead and you can increase throughput by increasing the overall task size.

There’s more, but this definitely covers the basics.

Why Common Lisp?#

Writing an Encoder#

Failure Mode#

Aggregate Logs#

Structure Logs#

Annotate Events#

Collect Metrics#

Conclusion#