Delegated Builds

Introduction

I cooked up something at work that I think is going to be awesome for building the project that I work on a day to day basis. Here’s the basic problem, in a different post, I’ll expand on these in more depth:

  • we do continuous deployment.
  • we maintain and publish multiple branches.
  • our builds take a non-trivial amount of time (4-6 minutes depending on hardware,) and will continue to get longer.
  • our documentation toolkit, Sphinx, lacks concurrency for some steps, which means builds take too long and leave most of a contemporary computer idle. Furthermore, given our use of topic-branches it can be hard to get work done during a build.

So there are a couple of notable hacks that I’ve come up with, over the past few months that help:

  • duplicate some of the initial work so that different output formats can build in parallel (using Make’s job control) at the expense of disk some space.
  • using a source proxy, (i.e. copying the source content into the build directory and building from this copy so that the actual source files can change during the build.)

These changes are simple and amount to some really minor changes to commands and Make files. This next fix required a non-trivial amount of code, but is really awesome:

  • building content, if it’s already committed, from a local checkout in the build directory. This way you can build (and publish!) from a different branch without doing anything to your current working directory.

The Code

See the gist for a basic overview, and keep your eyes on the repositories:

Some implementation notes:

  • it needs a bit more clean up and configuration with regards to a few hard-coded directory names, and assumptions about projects.

  • in practice it should work fine with Python 2.7 and 3.0. If you have the backported argparse module for 2.6, that should work too.

  • this plugs in really nicely with some existing infrastructure: becasuse we generate most of our Makefiles, it’s trivial to make this script smart and only permit sane things with regards to branch creation/management, and build targets.

    Building on this, I’ve written up a separate script to generate makefile targets to invoke these commands, which allows the script to fit more nicely into the existing idiom. That’s not included here, yet.

  • this, so far, has been the best introduction I’ve gotten to the subprocess module, so perhaps this will be useful to you.

  • there’s no good way to queue builds in make, except to use the blocking mode and use a bunch of make calls, which is reasonably inefficient. To get better at this, we’ll need to make some underlying build changes, but the gains could be pretty significant.

  • finally, this is the first bit of Python I’ve written since I had the breakthrough where I finally understood classes of a moderate amount of complexity without classes. No harm, and it’s not like there’s any internal state; at the same time, a bit of encapsulation around the interactions with git might be useful.

Pull request, suggestions, are always welcome.

More On Delegated Builds

I’ve written a bunch more about this problem and script and will be posting some of that very soon!

(Also it’s good to return to blogging/posting. Thanks for sticking around!)

git raspberry

This is an awful pun, but I’ve recently written the following script to help with some of the work of back-porting patch sets to maintenance branches. Basically you pass it a bunch of commit identifiers and it cherry picks them all in order.

#!/usr/bin/python

import os import sys import subprocess

for commit in sys.argv[1:]: with open(os.devnull, “w”) as fnull: subprocess.call( [‘git’, ‘cherry-pick’, commit ], stdout=fnull, stderr=fnull )

What I’d been doing, previously is assembling commit hashes in an emacs buffer, and then copy-pasting git cherry-pick before each line and then pasting those lines into the shell and hoping nothing goes wrong. The script isn’t much better but it’s a start.

To use, save in a file named git-raspberry in your $PATH, chmod +x the file, and then just run “git raspberry”. Turns out git runs any program in the path that starts with git- as a sub command.

The more you know.

New Knitting Project: Ballstown

I’ve started a new project, much to my own surprise. After many years of looking at the merino/tencel blend “colrain” I ordered a cone of it, and have cast on a project: a plain tube using size 0s..

I think I may be crazy.

The thing is, I got one of these neck tubes a month or two ago, and it’s the most amazing thing ever. Looks good with most things, not weird, very comfortable, etc.

So I’m making myself one…

I’m calling it “Ballstown” after a tune in the Sacred Harp of the same name. The tune is named after the town in the capital region of New York State, now known as “Ballston Spa.” Why? Because I cast on 217 stitches.

It turns out, I’ve really rather missed plain knitting that you can just knit on for hours without really thinking about, or can knit on in the dark.

One of the reasons that I’ve not been knitting as much recently, other than available time is that I’ve found it difficult to actually wear or use the things I knit. Sweaters, even finer weight ones are too warm to wear inside, and not windproof enough to keep me warm outside without substantial jacket.

The answer is to knit finer fabrics, of course, but this has been easier said than done, for me. Mostly I’ve stuck to fair isle sweaters, which are great fun to knit, and reasonably wearable, but difficult to knit on casually: lots to lug around, and starting to knit something with a pattern requires some “spin up time,” as you remember where you were and what you’re supposed to be doing.

In most ways this plain tube is the perfect answer to this problem….

I’ll blog more about this (or not,) as I progress.

Sphinx Caveats

This is a rough sketch of some things that I’ve learned about the Sphinx documentation generation system. I should probably spend some time to collect what I’ve learned in a more coherent and durable format, but this will have to do for now:

  • If you describe a type in parameter documentation it will automatically link to the Python documentation for that type when using the Python Domain and if you have intersphinx connected. That’s pretty cool.

  • Sphinx let’s you define a scope for a file in some cases. If you’re documenting command-line options to a program. (i.e. with the “program” with subsidiary “option” directives,) or if you’re documenting Python objects and callables within the context of a module, the module and program directives have a scoping effect.

    Cool but it breaks the reStructuredText idiom, which only allows you to decorate and provide semantic context for specific nested blocks within the file. As in Python code, there’s no way to end a block except via white-space,1 which produces some very confusing markup effects.

    The “default-domain” directive is similarly… odd.

  • Sphinx cannot take advantage of multiple cores to render a single project, except when building multiple outputs (i.e. PDF/LaTeX, HTML with and/or directories.) if with a weird caveat that only one builder can touch the doctree directory at once. (So you either need to put each builder on its own doctree directory, or let one build complete and then build the reset in parallel.)

    For small documentation sets, under a few dozen pages/kb, this isn’t a huge problem, for larger documentation sets this can be quite frustrating.2

    This limitation means that while it’s possible to write extensions to Sphinx to do additional processing of the build, in most cases, it makes more sense to build custom content and extensions that modify or generate reStructuredText or that munge the output in some way. The include directive in reStructuredText and milking the hell out of make are good places to started.

  • Be careful when instantiating objects in Sphinx’s conf.py file: since Sphinx stores the pickle (serialization) of conf.py and compares that stored pickle with the current file to ensure that configuration hasn’t changed (changed configuration files necessitate a full rebuild of the project.) Creating objects in this file will trigger full (and often unneeded) rebuilds.

  • Delightfully, Sphinx produces .po file that you can use to power translated output, using the gettext sphinx builder. Specify a different doctree directory for this output to prevent some issues with overlapping builds. This is really cool.

Sphinx is great. Even though I’m always looking at different documentation systems and practices I can’t find anything that’s better. My hope is that the more I/we talk about these issues and the closer I/we’ll get to solutions, and the better the solutions will be.

Onward and Upward!


  1. In Python this isn’t a real problem, but reStructuredText describes a basically XML-like document, and some structures like headings are not easy to embed in rst blocks. ↩︎

  2. reality documentation sets would need to be many hundreds of thousands of words for this to actually matter in a significant way. I’ve seen documentation take 2-3 minutes for clean a regeneration using Sphinx on very powerful hardware (SSDs, contemporary server-grade processors, etc.), and while this shouldn’t be a deal breaker for anyone, documentation that’s slow to regenerate is harder to maintain and keep up to date (e.g. difficult to test the effect of changes on output, non-trivial to update the documents regularly with small fixes.) ↩︎

Stability is a Crutch

I don’t think the tension between having good, robust, and bug-free software and having software with new features and capabilities is solvable in the macro case. What follows is a musing on this subject, related in my mind to the On Installing Linux post.


I’m not exactly making the argument that we should all prefer to use unstable and untested software, but I think there is a way in which the stability1 of the most prevalent Linux distributions is a crutch. Because developers can trust that the operating system will effectively never change, there’s no need to write code that expects that it might change.

The argument for this is largely economic: by spacing updates out to once a year or once every 18 months, you can batch “update” costs and save some amount of overhead. The downside here is that if you defer update costs, they tend to increase. Conversely, its difficult to move development forward if you’re continuously updating, and if your software is too “fresh,” you may loose time to working out bugs in your dependencies rather than your system itself.

The logic of both arguments holds, but I’m not aware of comparative numbers for the costs of either approach. I’m not sure that there are deployments of significant size that actually deploy on anything that isn’t reasonably stable. Other factors:

  • automated updating and system management.
  • testing infrastructure.
  • size of deployment.
  • number and variety of deployment configurations.

  1. Reliably updated and patched regularly for several years of maintenance, but otherwise totally stable and static. ↩︎

Markdown Standardization

I (mostly) lurk on the markdown discussion list, which is a great collection of people who implement and develop projects that use markdown. And there’s been a recent spate of conversation about standardization of markdown implementations and syntax. This post is both a reflection on this topic and a brief overview of the oft-forgotten history.

A History of Markdown Standardization

Markdown is a simple project that takes the convention that most people have been using to convey text formatting and style in plain text email, and providing a very minimalist and lightweight script that translates this “markup” (i.e. “markdown”) into HTML. It’s a great idea, and having systems that make it possible for people to focus on writing rather than formating is a great thing.

People should never write XML, HTML, or XHTML by hand. *Ever*.

Here’s the problem: the initial implementation is a Perl script that uses a bunch of pattern matching regular expressions (as you’d expect) to parse the input. It’s slow, imprecise, there are a few minor logical bugs, there’s no formal specification, and the description of the markdown language are ambiguous on a few key questions. Furthermore, there are a number of features that are simple, frequently requested/desired, with no official description of the behavior.

As people have gone about developing markdown implementations and extensions in other languages, to fix up the inconsistencies, to provide markdown support in every programming language under the sun, without a formal specification and disambiguation of the open question, the result is fragmentation: all the implementations are subtly different. Often you’ll never notice, but if you use footnotes (which are non-standard,) or want to have nested lists, you will end up writing implementation-dependent markdown.

The result is that either you tie your text to a specific implementation, or you go blithely on with the knowledge that the markdown that you write or store today will require intervention of some sort in the future. If you need to extend markdown syntax, you can’t without becoming an implementer of markdown itself.

That’s an awful thing. And there’s no real path out of this: the originator of markdown has publicly stated that he has no interest in blessing a successor, continuing development of the reference implementation, or in contributing to a specification process. Insofar as he controls the authoritative definition of markdown, the project to standardize markdown is dead before it even begins.

The problem is that while most people involved (implementers, application developers, etc.) in markdown want some resolution to this problem: it’s bad for users and it makes implementing markdown difficult (which markdown flavor should you use? should you reimplemented bugs for consistency and compatibility, or provide a correct system that breaks compatibility?) At the same time, markdown implementations are not commercial products and were built to address their author’s needs, and none of those maintainers really have the time or a non-goodwill interest in a standardization process.

If markdown standardization weren’t doomed from the start, the fact that the only people with any real ability to rally community support for a standardized markdown, are not inclined to participate in a standardization process.

Markdown Isn’t For Text That Matters

If markdown were better, more clear, and more rigorously defined and implemented, this wouldn’t be a problem, but the truth is that markdown’s main role has been for README files, blog posts, wikis, and comments on blog posts and in discussion forms.

It’s a great “lowest common denominator” for multi-authored text that needs rich hypertext features but needs markdowns simplicity and intuitiveness. Big projects? Multi-file projects? Outputs beyond single files?

Sure you can hack it with things like maruku and multi-markdown to get LaTeX output, and footnotes, and more complex metadata. And there are some systems that make it possible to handle projects beyond the scope of a single file, but they’re not amazing, or particularly innovative, particularly at scale.

To recap, markdown is probably not an ideal archival format for important text, because:

  • The implementation-dependency means that markdown often fails at genericism, which I think is supposed to be it’s primary features.

    Generic text representation formats are a must.

  • If you need output formats beyond HTML/XHTML then markdown is probably not for you.

You can get other formats, but it’s even more implementation specific.

The Alternatives

Don’t standardize anything. While markdown isn’t perfect the way it is now, there’s no real change possible that wouldn’t make markdown worse. There are two paths forward, as I see it:

  1. Give up and use reStructuredText for all new projects.

    RST is fussy, but has definite and clear solutions to the issues that plague markdown.

    • It has support for every major output format, and it wouldn’t be too hard to expand on that.
    • It’s fast.
    • In addition to the primary implementation, Pandoc supports python and there are early stage Java/PHP implementations. Most tools just wrap the Python implementation, which isn’t really a problem.
    • There are clear paths for extending rst as needed for new projects.
  2. Design and implement a new markdown -like implementation. I think reMarkdown would be a good name. This will be a lot of work, and have the following components:

    • a complete test suite that other implementations could use to confirm compatibility with reMarkdown.
    • a formal specification.
    • a lexer/parser design and reference implementation. With an abstract XML-like output format. We want a realistic model implementation that isn’t overly dependent upon a single output format.
  • an explicit and defined process for changing and improving the syntax.

On Installing Linux

(alternately, “Installing Linux the Hard Way”)

I’ve had the occasion to install Linux on three systems in the recent past. People don’t really install Linux anymore, it seems: with “cloud” instances and provisioning that’s based on images means that no one really has to install Linux as such. My experiences have been mostly awful:

  • I couldn’t make my current laptop do a full LCM boot for the life of me. I partitioned the hard drive in the conventional way, and while the system works fine, I think non-abstracted disk volumes are bad practice.

    Disk partitioning and bootloaders remain the most difficult and frustrating aspect of the installation process, and there’s no automation to support this work. Furthermore, even if it takes you a day to get it right, usually you don’t have to mess with it for a year or two. Which makes it difficult to improve practically.

    The Debian installer will do this pretty well, but you can’t get the auto partitioning tool to not use the full disk. Or I can’t figure it out.

  • I recently tried to install Arch Linux on an infrastructural system. Apparently in the last couple of months Arch totally did away with the installation system. So it dumps you into a mostly working shell and provides a couple of shell scripts to “automate” the installation.

It’s a great idea, as long as you never have to use it.

Conversely, it’s a great idea if you’re constantly running installations.

If you install Arch once every year or two, as I suspect is the most common case, good luck.

I need to do it again: to update an older laptop to the 64-bit version of Arch, and I fear this is going to be terribly painful. I’m left with two main questions:

1. Have we given up on the idea that desktop Linux may be viable for people who aren’t already familiar with Linux, or who aren’t software developers (or the next best thing?)

  1. Does the desktop experience actually matter?

I’m asking this in a more narrow line of questioning. There’s computer usage that revolves around things that happen in the browser, which is (probably) better suited for embeded systems (i.e. Android and iOS based devices,) and it’s not clear where the line between that and “General Purpose” computing will fall.

If we end up using embeded systems for most of the computers that we actually touch, this fundamentally changes the desktop experience as we know it, particularly for things like installation.

Three Way Merge Script

Note: This is an old post about a script I wrote a few months ago about a piece of code that I’m no longer (really) using. I present it here as an archival piece with a boatload of caveats. Enjoy!

I have a problem that I think is not terribly unique: I have a directory of files and I want to maintain two distinct copies of these files at once, and I want a tool that looks at both directories and makes sure they’re up to date. That’s all. Turns out nothing does exactly that, so I wrote a hacked up shell script, and you can get it from the code section:

merge-script

I hope you enjoy!

Background

You might say, “why not just use git to take care of this,” which is fair. The truth is that I don’t really care about the histories as long as there’s revision. Here’s the situation:

I keep a personal ikiwiki instance for all of my notes, tasks, and project stuff. There’s nothing revolutionary, and I even use deft, dired, and some hacked up lisp to do most of the work. But I also work on a lot of projects that have their own git repositories and I want to be able to track the notes of some of those files in those repositories as well.

Conflicts.

There are some possible solutions:

1. Use hard links so that both files will point at the same data on disk.

Great idea, but it breaks on multiple systems. Even if it might have worked in this case, it freight ens me to have such fragile systems.

Note: the more I play with this, the less suitable I think that it might be for multi system use. If one or both of the sides is in a git repo, and you make changes locally and then pull changes in from a git upstream, the git files, may look newer than the files that you changed. A flaw.

2. Only edit files in one repository or the other, and have a pre-commit hook, or similar, that copies data from the new system to the old system.

I rejected this because I thought I’d have a hard time enforcing this behavior.

3. Write a script that uses some diff3 to merge (potential) changes from both sources of changes.

This is what I did.

The script actually uses the merge command which is a wrapper around diff3 from rcs. shrug.

Beyond my somewhat trivial and weird use-case, I actually think that this script is more useful for the following situation:

You use services like Dropbox as a way of getting data onto mobile devices (say,) but you want the canonical version of the file to live in a git repository on your system.

This is the script for you.

I hope you enjoy it!