I spent some time yesterday evening dealing with a bug in some code I wrote/maintain, and I thought it would be a good exercise to just talk about the issue, and approaches problem solving. This is both an effort to demystify the act of programming and debugging and a brainstorming exercise. I explained a bunch of the background in an external page, to make the content a bit more accessible.

The System

To build a large complex documentation site we use Sphinx as the core "compiler" to translate source into outputs like HTML and PDFs, but this is really just the core operation, there's a lot of other work: generating content, analyzing content to make sure that everything that needs to be recompiled is, and migrating content to staging and production environments.

Essentially the build happens in three-parts:

  1. In the first step, we generate a bunch of content from structured sources. Then we have to assemble a dependency graph of all the content and update the time stamp (i.e. mime) all files that depend on files that have changed so that Sphinx will actually rebuild the files that need it.

    Most of this work happens in some sort of parallel execution model. I'll rant about concurrency in Python in a bit. Hold your questions.

  2. In the second step Sphinx runs. For testing purposes, this just builds one output (i.e. HTML), but our production builds need to build multiple output formats. When there are multiple builds, this runs each build in its own thread. Builds execute as sub-processes. Sphinx's processing has a two-phase approach: there's a read phase and a write phase. The read phase happens in one linear execution, but writing files can happen in a processing pool (if sphinx needs to write large numbers of multiple files.)

  3. Finally we do a bunch of post-processing on the builds to move the files into the right places. This happens (typically) in a process pool that runs within the thread where the build executed.

See my overview of concurrency, and rant on Python concurrency for some background.

The Problem

There's a deadlock. Sometimes, one of the Sphinx will just die, and stop doing anything, produces no messages and waits forever. (This is in step 2 above.) Everything else works fine.

Great.

The problem, or part of the problem, is that all of the parts of the system work fine in isolation. The code that starts the external process gets called extensively and works fine. The thread/process management code is also well exercised. In essence the question is:

"What could make code that works fine most of the time, fail oddly only sometimes."

As I think about it, I've observed intermittent deadlocks from time to time, but they've gotten much worse recently:

  • We weren't doing automated building for a few weeks, for an unrelated reason and the problem shows up more frequently in automated builds.
  • We moved the execution of the Sphinx process to thread pools from process pools: they were already running in yielding operations, no need to double the overhead.
  • A few more operations moved to threads as well, again, mostly operations with lots of yielding, to reduce the overhead of processes and in one situation when a task was incompatible with processes (i.e. not pickelable.)

Attempts

  • Make sure that the output of the sphinx build isn't filling the output buffering that subprocess is doing. Write output to a temporary file rather than using the buffer. (see tempfile)
  • Use Popen.poll() to figure out when an operation is done rather than using a blocking call which might prevent the thread from ever waking up.

Current state.

The bug is difficult to reproduce, but the above issues seems to reduce the incidence of the bug by a significant margin. My current theory is there were two deadlocks: one in sphinx itself that was pretty uncommon, and one with how we were calling sphinx that was more common and exacerbated by switching to the pool that ran sphinx from a process pool to a thread pool.

So the bug has probably been around for a while, but was much more rare given the different execution model. And regardless, was easily confused with another bug. Delightful.

Having resolved the bigger issue, it's no longer a real problem for our automated testing, so I consider it a (minor) success. Longer term, I think we need to streamline the execution and process management, but this is not crucial now s we can continue to increment toward that solution.