Observation at Scale

I wrote this thing about monitoring in software, and monitoring in web applications (and similar) about a year ago, and I sort of forgot about it, but as I was cleaning up recently I found this, and think that I mostly still agree with the point. Enjoy!

It is almost always the case that writing software that does what you want it to do is the easy part and everything else is the hard part.

As your software does more a number of common features emerge:

  • other people are responsible for operating your software.
  • multiple instances of the program are running at once, on different computers.
  • you may not be able to connect to all of the running instances of the program when something goes wrong.
  • people will observe behaviors that you don't expect and that you won't be able to understand by observing the program's inputs or outputs''

There are many things you can do to take a good proof of concept program and turn it into a production-ready program, but I think logging and introspection abilities are among the most powerful: they give you the most bang for your buck, as it were. It's also true that observability (monitoring) is a hot area of software development that's seeing a lot of development and thought at the moment.

While your application can have its own internal reporting system, its almost always easier to collect data in logs first rather than

Aggregate Logs

Conventionally operators and developers interact with logs using standard unix stream processing tools: tail, less, and grep and sometimes wc, awk, and sed. This is great when you have one (or a small number) process running on one machine. When applications get bigger, stream processing begins to break down.

Mostly you can't stream process because of volume there's too much data, it's hard to justify spending disk space on all of your application servers on logs, and there's too much of it to look at and do useful things. It's also true that once you have multiple machines, its really helpful to be able to look at all of the logs in a single place.

At the lowest level the syslog protocol and associated infrastructure solves this problem by providing a common way for services to send log data via a network (UDP, etc.) It works but you still only have stream processing tools, which may be fine, depending on your use case and users.

Additionally there are services and applications that solve this problem: splunk (commercial/enterprise software ) sumologic (commercial/cloud software) and the ELK stack (an amalgamation of open source tools.) that provide really powerful ways to do log search, reporting, and even build visualizations. There are probably others as well.

Use them.

Structure Logs

The most common interview question for systems administrators that my colleagues give is a "log sawing" question. This seems pretty standard, and is a pretty common exercise for parsing information out of well known streams of log data. Like "find a running average request time," or figure out the request rate.

The hard part is that most logs, in this example are unstructured in the sense that they are just line-wise printed strings, and so the exercise is in figuring out the structure of the messages, parsing data from the string, and then tracking data over the course of the logs. Common exercise, definitely a thing that you have to do, and also totally infuriating and basically impossible to generalize.

If you're writing software, don't make your users do this kind of thing. Capture events (log messages) in your program and output them with the information already parsed. The easiest way is to make your log messages mapping types, and then write them out in JASON, but there are other options.

In short, construct your log messages so that they're easy to consume by other tools: strongly (and stably) type your messages, provide easy way to group and filter similar messages. Report operations in reasonable units (e.g. seconds rather than nanoseconds) to avoid complex calculations during processing, and think about how a given data point would beinteresting to track over time.

Annotate Events

Building on the power of structured logs, it's often useful to be able to determine the flow of traffic or operations through the system to make it possible to understand the drivers of different kinds of load, and the impact of certain kinds of traffic on overall performance. Because a single operation may impact multiple areas of the system, annotating messages appropriately makes it possible to draw more concrete conclusions based on the data you collect.

For example when a client makes a user request for data, your system probably has a request-started and request-ended event. In addition this operation may retrieve data, do some application-level manipulation, modify other data, and then return it to the user. If there's any logging between the start and end of a request, then it's useful to tie these specific events together, and annotations can help.

Unlike other observability strategies, there's not a single software feature that you can use to annotate messages once you have structured capabilities, although the ability of your logging systems to have some kind of middleware to inject annotations is quite useful.

Collect Metrics

In addition to events produced by your system, it may be useful to have a background data collection thread to report on your application and system's resource utilization. Things like, runtime resource utilization, garbage collector stats, and system IO/CPU/Memory use can all be useful.

There are ways to collect this data via other means, and there are a host of observability tools that support this kind of metrics collection. But using multiple providers complicates actually using this data, and makes it harder to understand what's going in the course of running a system. If your application is already reporting other stats, consider bundling these metrics in your existing approach.

By making your application responsible for system metrics you immediately increase the collaboration between the people working on development and operations, if such a divide exists.

Conclusion

In short:

  • collect more data,
  • increase the fidelity and richness of the data you collect,
  • aggregate potentially related data in the same systems to maximize value,
  • annotate messages to add value, and provide increasingly high level details.
comments powered by Disqus