In Favor of Fast Builds

This is an entry in my loose series of posts about build systems.

I've been thinking recently about why I've come to think that build systems are so important, and this post is mostly just me thinking aloud about this issue and related questions.

Making Builds Efficient

Writing a build systems for a project is often relatively trivial, once you capture the process, and figure out the base dependencies, you can write scripts and make files to automate this process. The problem is that the most rudimentary build systems aren't terribly efficient, for two main reasons:

1. It's difficult to stumble into a build process that is easy to parallelize, so these rudimentary solutions often depend on a series of step happening in a specific order.

2. It's easier to write a build system that rebuilds too much rather than too little for subsequent builds. From the perspective of build tool designers, this is the correct behavior; but it means that it takes more work to ensure that you only rebuild what you need to.

As a corollary, you need to test build systems and approaches with significantly large systems, where "rebuilding too much," can be detectable.

Making a build system efficient isn't too hard, but it does require some amount of testing and experimentation, and often it centers on having explicit dependencies, so that the build tool (i.e. Make, SCons, Ninja, etc.) can build output files in the correct order and only build when a dependency changes. [1]

The Benefits of a Fast Build

  1. Fast builds increase overall personal productivity.

    You don't have to wait for a build to complete, and you're not tempted to context switch during the build, so you stay focused on your work.

  2. Fast builds increase quality.

    If your build system (and to a similar extent, your test system,) run efficiently, it's possible to detect errors earlier in the development process, which will prevent errors and defects. A tighter feedback loop on the code you write is helpful.

  3. Fast builds democratize the development process.

    If builds are easy to run, and require minimal cajoling and intervention, it becomes much more likely that many people

    This is obviously most prevalent in open source communities and projects, this is probably true of all development teams.

  4. Fast builds promote freshness.

    If the build process is frustrating, then anyone who might run the build will avoid it and run the build less frequently, and on the whole the development effort looses important feedback and data.

    Continuous integration systems help with this, but they require significant resources, are clumsy solutions, and above all, CI attempts to solve a slightly different problem.

Optimizing Builds

Steps you can take to optimizing builds:

(Note: I'm by no means an expert in this, so feel free to add or edit these suggestions.)

  • A large number of smaller jobs that can complete independently of other tools, are easy to run in parallel. If the jobs that create a product take longer and are more difficult to split into components, then the build will be slower, particularly on more powerful hardware.
  • Incremental builds are a huge win, particularly for larger processes. Most of the reasons why you want "fast builds," only require fast rebuilds and partial builds, not necessarily the full "clean builds." While fast initial builds are not unimportant, they account for a small percentage of use.
  • Manage complexity.

There are a lot of things you can do to make builds smarter, which should theoretically make builds faster.

Examples of this kind of complexity include storing dependency information in a database, or using hashing rather than "mtime" to detect staleness, or integrating the build automation with other parts of the development tool chain, or using a more limited method to specify build processes.

The problem, or the potential problem is that you lose simplicity, and it's possible that something in this "smarter and more complex" system can break or slow down under certain pressures, or can have enough overhead to render them unproductive optimizations.

[1]It's too easy to use wild-cards so that the system must rebuild a given output if any of a number of input files change. Some of this is unavoidable, and generally there are more input files than output files, but particularly with builds that have intermediate stages, or more complex relationships between files it's important to attend to these files.

On Build Processes

I've found myself writing a fair number of Makefiles in the last few weeks: In part because it was a tool, hell a class of tools, that I didn't really understand and I'm a big sucker for learning new things, and in part because I had a lot of build process-related tasks to automate. But I think my interest is a bit deeper than that.

Make and related tools provide a good metaphor for thinking about certain kinds of tasks and processes. Build systems are less about making something more efficient (though often it does do that,) and more about making processes reproducible and consistent. In some respects I think it's appropriate to think of build tools as.

I've written here before about the merits of /technical-writing/compilation for documentation, and I think that still holds true: build processes add necessary procedural structure. Indirectly, having formalized build process, also makes it very easy to extend and develop processes as needs change. There's some up-front work, but it nearly always pays off.

While I want to avoid thinking that everything is a Makefile-shaped nail, I think it's also probably true that there are a lot of common tasks in general purpose computing that are make shaped: format conversion, extracting and importing data, typesetting (and all sorts of publication related tasks,) archiving, system configuration, etc. Perhaps, more generic build tools need to be part of basic computer literacy. That's another topic for a much larger discussion.

Finally, I want to raise (or re-raise) the question, that another function of build systems is reduce friction on common tasks and increase the likelihood that tasks will get done, and that people will need less technical background to do fundamentally mundane tasks. Build systems are absolutely essential for producing output from any really complex process because it's hard to reliably produce builds without them; for less complex processes they're essential because no one (or fewer people) do those tasks without some kind of support.

Rough thoughts as always.

Git Feature Requests

  • The ability to mark a branch "diverged," to prevent (or warn) on attempted merges from master (for example) into a maintenance branch.

  • The ability to create and track dedicated topic branches, and complementary tooling to encourage rebasing commits in these sorts of branches. We might call them "patch sets" or "sets" rather than "branches." Also, it might be useful to think about using/displaying these commits, when published, in a different way.

  • Represent merge commits as hyperlinks to the user, when possible. I think GitHub's "network graph" and similar visualizations are great for showing how commits and branches interact and relate to each other.

    This would probably require some additional or modifies output from "git log".

  • Named stashes.

  • Branched stashes (perhaps this is closer to what I'm thinking about for the request regarding topic branches.)

  • The ability to checkout "working copies," of different points/branches currently from a single repository at the same time, using "native" git utilities.

    Related, "shelf" functionality is scriptable, but this too needs to be easier and more well supported.

    I think legit is a step in the right direction, but it's weird and probably makes it more difficult to understand what's happening with git conceptually as opposed to the above features which would provide more appropriate conceptual metaphors for the work that would-be-git-users need.

Limitiations of GitHub Forks

Assumption:

  1. git is pretty awesome, but it's conceptually complex. As a result using git demands a preexisting familiarity with git itself or some sort of wrapper to minimize the conceptual overhead.
  2. The collaboration methods (i.e. hosting) provided by git, which are simple by design to allow maximum flexibility, do not provide enough structure to be practically useful. As a result providers like GitHub (and BitBucket and gitorious) offer a valuable service that makes it easier--or even possible--for people to use git.

Caveats:

  • there are problems with using centralized repository services controlled by third parties, particularly for open source/free software projects.

    There are ways that GitHub succeeds an fails in this regard. but this dynamic is too complex to fully investigate within the scope of this post.

  • If you use GitHub as designed, and the way that most projects use nGitHub, then you have a very specific and particular view of how Git works.

    While this isn't a bad thing, it's less easy to use git in some more distributed workflows as a result. This isn't GitHub's fault so much as it is an artifact of people not really knowing how git itself works.

Assertions:

  1. GitHub's "fork" model[^fork] disincentives people from working in "topic" branches.

  2. By making it really easy for people to publish their branches, GitHub disincentives the most productive use of the "git rebase" command that leads to clean and clear histories.

  3. There's no distinction between a "soft fork" where you create a fork for the purpose of submitting a patch (i.e. a "pull request") and a "hard fork," where you actually want to break the relationship with the original project.

    This is mostly meaningful in context of the other features that GitHub provides, notably the "Network" chart, and the issue tracker. In a soft-fork that I would intend to merge back in, I'd like the issues to "come with," the repository, or at least connect in some way to the "parent." For hard forks, it might make sense to leave the old issues behind. The same with the network chart, which is incredibly powerful, but it's not great at guessing how your repository relates to the rest of its "social network."

The solution: keep innovating, keep fighting lock-in, and don't let GitHub dictate how you work.

Distributed Bug Tracking

The free software/open source/software development world needs a distributed bug tracking story. Because the current one sucks.

The State of the Art

There are a number of tools written between 2006 and 2010 or so that provide partial or incomplete solutions to the problem. Almost isn't quite good enough. The "resources" section of this post, contains an overview of the most important (my judgment,) representatives of the current work in the area with a bit of editorializing.

In general these solutions are good starts, and I think they allow us (or me) a good starting point for thinking about what distributed bug tracking could be like. Someday.

Bug tracking needs are diverse, which creates a signifigant design challenge for any system in this space. There are many existing solutions, that everyone hates, and I suspect most would-be developers and innovators in the space would like to avoid opening this can of worms.

Another factor is that, while most people have come to the conclusion that distributed source control tools are the "serious" contemporary tool for managing source code the benefits of distributed bug tracking hasn't yet propogated in the same way. Many folks have begun to come to terms with the fact that some amount of tactical centralization is inevitable, required, and even desirable [1] in the context of a issue tracking systems.

Add to this the frequent requirement that non-developer users often need to track and create issues, and the result is that we've arrived at something of an impasse.

Requirements

A distributed bug tracking system would need:

  • A good way to provide short, unique identifiers for individual issues and comments so that users can discuss issues canonically.

  • An interface contained in a single application, script, or binary, that you could distribute with the application.

  • A simple/lightweight web-based interface so that users can (at least) review, search, and reference issues from a web browser.

    Write access would also be good, but is less critical. Also, it might be more practical (both from a design and a workflow perspective,) to have users submit bugs on the web into a read-only "staging queue," that developers/administrators would then formally import into the project. This formalizes a certain type of triage approach that many projects may find useful.

  • To be separable from the source code history, either by using a branch, or by using pre-commit hooks to ensure that you never commit changes to code/content and the bugs at the same time.

  • To be editable, and to interact with commonly accessible tools that users already use. Email, command line tools, the version control systems, potentially documentation systems, build systems, testing frameworks and so forth.

  • Built on reliable tools. [2]

  • To provide an easy way to customize your "views" on bugs for a particular team or project. In other words, each team can freely decide which extra fields get attached to their bugs, along with which fields are visible by default, which are required, and so on--without interfering with other projects.

The Future of the Art

  1. We (all) need to work on building new and better tools to help solve the distributed issue tracking problem. This will involve:
    • learning from the existing attempts,
    • continuing to develop and solidify the above requirements,
    • (potentially) test and develop a standard (yaml/json?) based data storage format that is easy to parse, and easily merged that multiple tools can use.
    • Develop some simple prototype tools, potentially as a suite of related utilities (a la early versions of git.) that facilitate interaction with the git database. With an eye towards flexibility and extensible.
  2. While there are implications for free software hosting as well as vendor independence and network service autonomy (a la `Franklin Street Statement <http://autonomo.us/2008/07/franklin-street-statement/>`_.) I think the primary reason to pursue distributed bug tracking has more to do with productivity and better engineering practices, and less with the policy. In summary:
    • Bug database systems that run locally and are fast[3] and always available.
    • Tools that permit offline interaction with issue database.
    • Tools that allow users to connect issues to branches.
    • Tools that make it possible to component-ize bug databases in parallel with software

Resources

(With commentary,)

  • dist-bugs mailing list

    This is the canonical source for discussion around distributed bug tracking.

  • Bugs Everywhere

    This is among the most well developed solution speaking holistically. "be" is written in Python, can generate output for the web. It uses its own data format, and has a pretty good command line tool. The HTML output generate is probably not very fast at scale (none are,) but I have not tested it.

  • Ditz

    Ditz is a very well developed solution. Ditz: implemented in Ruby, has a web interface, has a command line tool, uses a basic YAML data format, and stores data in branch. Current development is slow, getting it up and running is non-trivial, and my sense is that there isn't a very active community of contributors. There are reasons for this, likely but they are beyond the scope of this overview.

  • pitz

    Pitz is a Python re-implementation of Ditz, and while the developer(s?) have produced a "release," the "interface" is a Python shell, and to interact with the database you have to, basically write commands in Python syntax. From a data perspective, however, Pitz, like Ditz is quite developed. Pitz while it stores data in-tree, I think it's important source of ideas/examples/scaffolding.

  • Artemis

    This is a really clever solution that uses Maildirs to store issues. As a result you can interact with and integrate Artimis issues with your existing email client. Pull down changes, and see new bugs in your email, without any complicated email and list server setups.

    The huge caveat is that it's implemented as a plugin for Mercurial, and so can't be used with git projects. Also, all data resides in the tree.

  • git-issues

    In most ways, git-issues is my favorite: it's two Python files, 1700 lines of code, stores issues outside of the source branch, and has a good command line interface. On the downside, it uses XML (which shouldn't matter, but I think probably does, at least in terms of attracting developers,) and doesn't have a web-based interface. It's also currently un-maintained.

  • Prophet/sd

    SD, which is based on a distributed database named Prophet, is a great solution. The primary issue is that it's currently unmentioned and is not as feature complete as it should be. Also a lot of SD focuses on synchronizing with existing centralized issue trackers, potentially at the expense of developing other tools.

[1]It seems that you want centralized issue databases, or at least the fact that centralized issue databases appear canonical is a major selling point for issue tracking software in general. Otherwise, everyone would have their own text file with a bunch of issues, and that would suck.
[2]Because I don't program (much) and it's easy to criticize architectural decisions from afar, I don't want to explicitly say "we need to write this in Python for portability reasons" or something that would be similarly unfounded. At the same time, adoption and ease of use is crucial here, both for developers and users. Java and Ruby (and maybe Perl,) for various reasons, add friction to the adoption possibilities.
[3]"Is Jira/Bugzilla/etc. slow for you today?"

Lies About Documentation...

.. that developers tell.

  1. All the documentation you'd need is in the test cases.
  2. My comments are really clear and detailed.

3. I'm really interested and committed to having really good documentation.

  1. This code is easy to read because its so procedural.
  2. This doesn't really need documentation.

6. I've developed a really powerful way to extract documentation from this code.

  1. The documentation is up to date.
  2. We've tested this and nothing's changed.
  3. This behavior hasn't changed, and wouldn't affect users anyway.
  4. The error message is clear.

11. This entire document needs to be rewritten to account for this change.

  1. You can document this structure with a pretty clear table.

Often this is true, more often these kinds of comments assume that it's possible to convey 3-5 dimension matrixes clearly on paper/computer screens.

  1. I can do that.
  2. I will do that.
  3. No one should need to understand.

Loops and Git Automation

This post provides a few quick overviews of cool bits of shell script that I've written or put together recently. Nothing earth shattering, but perhaps interesting nonetheless.

Commit all Git Changes

For a long time, I used the following bit of code to provide the inverse operation of "git add .". Where "git add ." adds all uncommited changes to the staging area for the next commit, the following commit automatically removes all files that are no longer present on the file-system from the staging area for the next commit.

if [ "`git ls-files -d | wc -l`" -gt "0" ]; then
  git rm --quiet `git ls-files -d`
fi

This is great if you forget to use "git mv" or you delete a file using rm, you can run this operation and pretty quickly have git catch up with the state of reality. In retrospect I'm not really sure why I put the error checking if statement in there.

There are two other implementations of this basic idea that I'm aware of:

for i in `git ls-files -d`; do
  git rm --quiet $i
done

Turns out you can do pretty much the same thing with the following statement using the xargs command and you end up with something that's a bit more succinct:

git ls-files --deleted -z | xargs -0 git rm --quiet

I'm not sure why, I think it's because I started being a Unix nerd after Linux dropped the argument number limit, and as a result I've never really gotten a chance to become familiar with xargs. While I sometimes sense that a problems is xargs shaped, I almost never run into "too many arguments" errors, and always attempt other solutions first.

A Note About xargs

If you're familiar with xargs skip this section. Otherwise, it's geeky story time.

While this isn't currently an issue on Linux, some older UNIX systems (including older versions of Linux,) had this limitation where you could only pass a limited number of arguments to a command. If you had too many, the command would produce an error, and you had to find another way.

I'm not sure what the number was, and the specific number isn't particularly important to the story. Generally, I understand that this problem would crop up when attempting to take the output of a command like find and piping or passing it to another command like grep or the like. I'm not sure if you can trigger "too many arguments" errors with globbing (i.e. *) but like I said this kind of thing is pretty uncommon these days.

One of the "other ways" was to use the xargs command which basically takes very long list of arguments and passes them one by one (or in batches?) to another command. My gut feeling is that xargs can do some things, like the above a bit more robustly, but that isn't experimentally grounded. Thoughts?

Onward and Upward!

Erstwhile Programmer

This is the story of how I occasionally realize I exist on the continuum of "programmers," rather than just being an eccentric sort of writer type.

Evidence

download-mail

I have this somewhat peculiar method of downloading email that I think works great. A few weeks ago, however, I was trying to compress things in "hot storage," and realized that I had a problem.

For a year or so, I had been automating commits to the git repository that held all my mail. In order to effectively archive and compress some mail, I needed to do some serious rebasing to not only remove a bunch of messages from the current repository but also pull that content from the history and flatten the history somewhat.

The problem was that I had 50,000 commits and there's simply no effective way to rebase that many commits in a reasonable amount of time, particularly given I/O limitations. So I gave up, started from (relative) scratch, and rewrote the scripts to be a little bit more smart... You know in an afternoon.

See the revised code here: download mail

ikiwiki-tasklist

I've written about this before in my post on my new personal organization stuff, but it's no great announcement that I'm moving away from working in emacs' org-mode and doing more work with ikiwiki and some hand-rolled scripts. I think org-mode is great, it just ended up getting in my way a bit and I think I can get more of what I need to get done in other ways.

I have learned a great deal from org-mode. I made the biggest leap away from org-mode when I wrote ikiwiki tasklist, which does all of the things I had been using org-mode's agenda for. It's not a complicated at all: look in some files for some lines that begin with specific strings and put them into a page that is the perfect task list.

See the code here: ikiwiki tasklist.

Common Lisp Weenie

"What Window Manager is that," he asked.

"StumpWM, it's written in Common Lisp," I said, launching into a 30 second pitch for Stump.

My pitch about stump is pretty basic: the Common Lisp interface allows you to evaluate code during run-time without restarting the window manager or loosing state; it's functionally like screen, which is very intuitive for window management; and it has emacs-like key-bindings, which I think work pretty well.

"So you're a Common Lisp programmer?"

"Well not really, I mean, I know enough to get by."

"Right."

"Right."

Conclusion

In several (technical writing) job interviews recently, people asked me about my programming experience, and my answer varied a lot.

I know how computer programs work, I know how people write computer programs, I understand how software testing and debugging works, I understand the kinds of designs that lead to good programs and the kinds that lead to bad software. I don't write code--really--but I can sort of hack things together in shell scripts when I need to.

The answer to the question, these days, is "I'm a programmer in the way that most people are writers: most people are comfortable writing a quick email or a short blurb, but get stuck and have trouble really writing longer or more complicated kinds of text. Reasonably capable but not skilled."

The above code examples work: they do what I need them to do, and particularly in the case of the mail script, they work much better than the previous iteration. I need to do more work, and I feel like I'm reaching the boundaries of what can be comfortably done in shell scripting. My next big programming project is to go through these two scripts and port them to Python and see if I can add just a little bit of additional functionality in the process.

I'm sure I'll report to you on this as my work progresses.