Project Orientation

(or my latest attempt to do things in a more “project oriented way.”)

This post is about recent projects, projects that I’m working on, and how my work has changed in recent months.

A couple of weeks ago, I finally posted all of the content that I’ve been working on for the new, revived Cyborg Institute. While the book on systems administration itself had been mostly done for a while, I’d delayed for two reasons:

  1. I wanted to have a couple of other projects completed to demonstrate that the Institute as a project wasn’t just isolated to book-like objects.
  2. I wanted to have some infrastructure in place to be able to sanely publish the Institute site without using some gnarly content management system.1

The end result is that in addition to the book, I’ve put together a few other projects and documentation. The more exciting thing is that I might do more things like this in the future.

In addition to a lot of day-job work--forthcomming releases and team growth are eating a lot of my time--I’ve been working on a series of wiki pages (and related blog posts,) that address “information debt that happens when organizations don’t put resources and energy into maintaining resources and “knoweldge.” Expect to hear more on this topic.


The truth is that I really like working on bigger projects. Writing blog posts and participating in online conversations has been very rewarding to me over the past ~10 years, I feel like I’ve hit a wall: I’ve written ~830,000 words on tychoish.com, and am frustrated that there’s not a lot to show for it:

  • readership is steady, even increasing, but not inspiring,

  • I don’t actually want to work as a blogger, and

  • most importantly the work I’ve done here doesn’t really build to anything more than a half-dozen or so blog posts.

    While there are themes throughout all of the posts, the work isn’t very rigorous, and it lacks a certain kind of depth.

    So here I am, writing books-like objects things about technology that I hope are and will be useful for both technical and non-technical audiences, as well as compiling the little things that I hack on for other people to improve and benefit fromm, and writing fiction (that I may try and publish conventionally, but I may end up self-publishing using a similar proccess.) The goal is to:

  • Write things with more rigor, including better citations and research.

  • Work on projects that address topics more comprehensively.

  • Produce, document, and maintain scripts and other programs that I write rather than endlessly critique existing tools and approaches. In short, less talking about stuff and more making stuff.

Let’s see how this goes!


  1. All content management systems are gnarly. ↩︎

imenu for Markdown

For a while, I’ve been envious of some of the project and file navigation features in emacs for browsing bigger projects/programs, things like imenu and tags have always seems awesome but given that I spend most of time editing restructured text and markdown files (I’m a technical writer), these tools have been distant and not a part of my day to day work.

It’s not that it would be impossible to write interfaces for imenu or etags, for the formats I use regularly, but more that I’ve never gotten around to it until now.

We’re still a ways away on the question of etags, but it turns out that when I wasn’t looking rst mode got imenu support, and with the following little bit of elisp you can get imenu for markdown.

(setq markdown-imenu-generic-expression
   '(("title"  "^\\(.*\\)[\n]=+$" 1)
     ("h2-"    "^\\(.*\\)[\n]-+$" 1)
     ("h1"   "^# \\(.*\\)$" 1)
     ("h2"   "^## \\(.*\\)$" 1)
     ("h3"   "^### \\(.*\\)$" 1)
     ("h4"   "^#### \\(.*\\)$" 1)
     ("h5"   "^##### \\(.*\\)$" 1)
     ("h6"   "^###### \\(.*\\)$" 1)
     ("fn"   "^\\[\\^\\(.*\\)\\]" 1)
))

(add-hook 'markdown-mode-hook
   (lambda ()
      (setq imenu-generic-expression markdown-imenu-generic-expression)))

Pretty awesome! I hope it helps you make awesome things.

Practical Branch Layout

I’ve recently gone from being someone who uses git entirely “on my own,” to being someone who uses git with a lot of other people at once. I’ve also had to introduce git to the uninitiated a few times. These have both been notable challenges.

Git is hard, particularly in these contexts: not only are there many concepts to learn, but there’s no single proscribed workflow and a multitude of workflow possibilities. Which is great from a philosophy perspective, and arguably good from a “useful/robust tool perspective,” but horrible from a “best practices” perspective. Hell, it’s horrible for a “consistent” and “sane” practices perspective.

There are a lot of solutions to the “finding a sane practice,” when working with git and large teams. Patching or reformulating the interface is a common strategy. Legit is a great example of this, but I don’t think it’s enough, because the problem is really one of bad and unclear defaults. For instance:

  • The “master” branch, in larger distributed systems is highly confusing. If you have multiple remotes (which is common), every remote has its own master branch, which all (necessarily) refer to different possible points in a repository’s history.
  • The names of a local branch do not necessarily refer to the names of the remote branch in any specific repository. The decoupling of local and remote branches, makes sense from a design perspective, but it’s difficult to retain this mapping in your head, and it’s also difficult to talk about branch configurations because your “view of the universe,” doesn’t often coincide with anyone else’s?

Here are some ideas:

  1. Have two modes of operation: a maintainer’s mode that resembles current day git with a few basic tweaks described in later options, and a contributors mode, that is designed with makes the following assumptions:

    • The “mainline” of the project’s development will occur in a branch to which this user only has read-only access.
    • Most of this user’s work will happen in isolated topic branches.
  2. Branch naming enforcement:

    • All branches will be uniquely named, relative to a username and/or hostname. This will be transparent (largely) to the history, but will make all conversations about branches less awkward. This is basically how branches work now, with [remote]/[branch], except that all local branches need self/[branch], and the software should make this more transparent.
    • Remote branches will implicitly have local tracking branches with identical names. You could commit to any of the local tracking branches, and pull will have the ability to convert your changes to a self/[branch] if needed.
    • All local branches, if published, will map to a single remote branch. One remote, will be the user’s default “publication target,” for branches.

    This is basically what the origin remote does today, so again, this isn’t a major change.

  3. When you run git clone, this remote repository should be the upstream repository, not the origin.

    Use the origin remote, which should be the default “place to publish my work,” and would be configured separately.

  4. Minor tweaks.

    • Map local branches directly to remote branches.
    • Be able to specify a remote branch as a “mirror” of another branch.
    • Make cherry-picked commits identifiable by their original commit-id internally. The goal is to push people to cherry-pick commits as much as possible to construct histories without needing to rebase.1
    • Have sub-module support automatically configured, without fussing.
    • Have better functions to cleaning up branch cruft. Particularly on remotes.
    • Have some sort of configurable “published pointer,” that users can use as a safe block against rebases before a given point.

    The goals here are to:

    • Make working with git all about branches and iteration rather than a sequence of commits.
    • Provide good tools to prevent people from rebasing commits, which is always confusing and rarely actually required.
    • Make branch names as canonical as possible. The fact that there can be many possible names for the same thing is awful.

Who’s with me? We can sort out the details, if you want in comments.


  1. To make this plausible, github needs to allow cherry-picked commits to close a pull request. ↩︎

Hypertextuality

I recently took some of my writing time to create a makefile (Novel Makefile) to manage work I hope to be doing on a new novel project. I’ve started outlining and researching the story in earnest after having spent the past few couple of years talking about it, and I think writing will commence soon. In another post I’d like to write up some thoughts on the tooling and technology of writing non-technical/non-manual long-form.

This post, drawing from the spending some time buried deep in production is about the state of (conceptually) longer form work in digital mediums. Or, at least a brief commentary on same.


The tools that I use to write technical materials do all sorts of cool things, like:

  • provide instant cross referencing,
  • generate great indexes, and
  • automatically generate and link glossaries.

This is not particularly unusual, and in fact Sphinx is somewhat under-featured relative to other documentation generation systems like DocBook.1

And yet people publish ebooks that virtually identical to paper books. Ebooks seem to say “*this electronic book is the best facsimile of a paper book that we can imagine right now,*” while totally ignoring anything more that a *hyper*text rightfully might be.

I enjoy reading ebooks, just as I have enjoyed reading paperbooks, but mostly because ebooks basically are paperbooks. I’ve written posts in the past challenging myself,and fiction writers in general, to actually do hypertext rather than recapitulating prior modalities in digital form.

At various points I’ve thought that wikis might be a good model of how to do hypertext, because the form is structurally novel. Any more, I don’t think that this is the case: wikis are unstructured and chaotic, and I’ve come to believe that the secret to hypertext is structure. There are so many possibilities in hypertext, and I think much experimentation in hypertext has attempted to address the chaos of this experience. This does serve to highlight the extent to which “the future is here,” but it obscures the fact that structure makes narratives understandable. Think about how much great, new, innovative (and successful!) fiction in the past decade (or so) is not structurally experimental or chaotic. (Answer: there’s a lot of it.)

The not-so-secret of hypertext, is (I suspect,) tooling: without really good tools the mechanics of producing a complex, interactive textual experience2 is difficult for a single writer, or even a small group of writers. Most tools that manage the publication and writing of text are not suited to helping the production of large-multi-page and mutli-component texts. One potential glimmer of hope is that tools for developing programs (IDEs, build systems, compilers, templating systems, introspection tools, pattern matching, etc.) are well developed and could modified for use in text production.

The second non-so-secret of hypertext is probably that hypertext is an evolution of text production and consumption, not a revolution. Which only seems reasonable. We have the technology now to produce really cool text product. While tooling needs to get better, the literature needs to do some catching up.

Lets start making things!


  1. It’s not that Sphinx is “bad,” but it’s clearly designed for a specific kind of documentation project, and if you stray too far outside of those bounds, or need formats that aren’t quite supported, then you end up without a lot of recourse. Having said that, the “normal,” well supported and most projects--documentation or otherwise--will only very rarely hit upon an actual limitation of Sphinx itself. ↩︎

  2. To be clear, I’m partial to the argument that today’s computer games, particularly role-playing games, are the things that the futurists of the 1960s and 70s (e.g. Theodor Holm Nelson) called “hypertext.” ↩︎

Denormalize Access Control

Access control is both immensely useful and incredibly broken.

Access control, or the ability to constrain access to data and programs in a shared system is the only way that we, as users of shared systems, can maintain our identities, personal security, and privacy. Shared systems include: databases, file servers, social networking sites, virtualized computing systems, vendor accounts, control panels, management tools, and so forth all need robust, flexible, granular, and scalable access control tools.

Contemporary access control tools--access control lists (ACL,) and access control groups--indeed the entire conceptual framework for managing access to data and resources, don’t work. From a theoretical practice, ACLs that express a relationship between users or groups of users and data or resources, represent a parsimonious solution to the “access control problem:” if properly deployed only those with access grants will have access to a given resource.

In practice these these kinds of relationships do not work. Typically relationships between data and users is rich and complex and different users need to be able to do different things with different resources. Some users need “read only” access, others need partial read access, some need read and write access but only to a subset of a resource. While ACL systems can impose these kinds of restrictions, the access control abscration doesn’t match the data abstraction or the real-world relationships that it supposedly reflects.

Compounding this problem are two important factors:

  1. Access control needs change over time in response to social and cultural shifts among the users and providers of these resources.
  2. There are too many pieces of information or resources in any potential shared system to allocate access on a per-object or per-resource basis, and the volume of objects and resources is only increasing.

Often many objects or resources have the same or similar access control patterns, which leads to the “group” abstraction. Groups make it possible to describe a specific access control pattern that apply to a number of objects, and connect this pattern with specific resources.

Conceptual deficiencies:

  • There’s a volume problem. Access control data represents a many-to-many-to-many relationship. There are many different users and (nested) groups, many different kinds of access controls that systems can grant, and many different (nested) resources. This would be unmanageably complex without the possibility for nesting, but nesting means that the relationships between resources and between groups and users are also important. With the possibility for nesting access control is impossible.

  • ACLs and group-based access control don’t account for the fact that access must be constantly evolving, and current systems don’t contain support for ongoing maintenance. (we need background threads that go through and validate access control consistency.) Also all access control grants must have some capacity for automatic expiration.

  • Access control requirements and possibilities shift as data becomes more or less structured, and as data use patterns change. The same conceptual framework that works well for access control in the context of a the data stored in a relational database, doesn’t work so when the data in question is a word processing document, an email folder, or a spread sheet.

    The fewer people that need access to a single piece of data, the easier the access control system can be. While this seems self evident, it also means that access control systems are difficult to test in the really large complex systems in which they’re used.

  • Group-based access control systems, in effect, normalize data about access control, in an effort to speed up data access times. While this performance is welcome, in most cases granting access via groups leads to an overly liberal distribution of access control rights. At once, its too difficult to understand “who has access to what” and too easy to add people to groups that give them more access than they need.

So the solution:

  1. Denormalize all access control data,
  2. don’t grant access to groups, and
  3. forbid inheritance.

This is totally counter to the state of the art. In most ways, normalized access control data, with role/group-based access control, and complex inheritance are the gold standard. Why would it work?

  • If you have a piece of data, you will always be able to determine who has access to data, without needing to do another look-up.

  • If you can deactivate credentials, then a background process can go through and remove access without causing a large security problem. (For partial removes, you would freeze an account, let the background process modify access control and then unfreeze the account.)

    The down side is that, potentially, in a large system, it may take a rather long time for access grants to propagate to users. Locking user accounts makes the system secure/viable, but doesn’t make the process any more quick.

    As an added bonus, these processes could probably be independent and wouldn’t require any sort of shared state or lock, which means many such operation could run in parallel, and they could stop and restart at will.

  • The inheritance option should be fuzzy. Some sort of “bucket-based” access control should be possible, if there’s a lot of data with the same access control rules and users.

    Once things get more complex, buckets are the wrong metaphor, you should use granular controls everywhere.

Problems/Conclusion:

  • Denormalization might fix the problems with ACLs and permissions systems, but it doesn’t fix the problems with distributed identity management.

    As a counterpoint, this seems like a cryptography management problem.

  • Storing access control information with data means that it’s difficult to take a user and return a list of what these credentials have access to.

    In truth, centralized ACL systems are subject to this flaw as well.

  • A huge part of the problem with centralized ACL derives from nesting, and the fact that we tend to model/organize data in tree-like structures, that often run counter to the organization of access control rights. As a result access control tools must be arbitrary.

Taxonomic Failure

I tell people that I’m a professional writer, but this is a bit misleading, because what I really do is figure out how to organize information so that it’s useful and usable. Anyone, with sufficient training and practice, can figure out how to convey simple facts in plain language, but figuring out how to organize simple facts in plain language into a coherent text is the more important part of my job and work.

This post and the “Information Debt” wiki page, begin to address some of these the problem of information resource maintenance, organization, and institutional practices with regards to information and knowledge resources.

Organization is hard. Really hard. One of the challenges for digital resources is that they lack all of the conventions of /technical-writing/books, which would seem to be freeing: you get more space and you get the opportunity to do really flexible categorization and organization things.

Great right?

Right.

Really flexible and powerful taxonomic systems, like tagging systems have a number of problems when applied to large information resources:

  • the relationship between the “scope” of the tag, and the specificity of the tag matters a lot. Too much. Problems arise when:
  • tags are really specific, pages include a number of pieces of information, and tags can only map to pages.
  • tags are general and the resources all address similar or related topics.
  • the size of the tag “buckets” matters as well. If there are too many items with a tag, users will find not the tag for answering their questions.
  • if your users or applications have added additional functionality using tags, tags begin to break as a useful taxonomic system. For example, if your system attaches actions to specific tags (i.e. send email alerts when content with a specific tag,) or if you use a regular notation to simulate a hierarchy, then editors begin adding content to tags, not for taxonomic reasons, but for workflow reasons or to trigger the system.

The resulting organization isn’t useful from a textual perspective.

  • If you have to have multiple tagging systems or namespaces.

    Using namespaces is powerful, and helps prevent collisions. At the same Sat Aug 16 10:50:00 2014, if your taxonomic system has collisions, this points to a larger problem.

  • If the taxonomy ever has more than one term for a conceptual facet, then the tagging system is broken.

These problems tend to exacerbate as:

  • the resource ages.
  • the number of contributors grow.

There’s this core paradox in tagging systems: To tag content effectively, you need a fix list of potential tags before you begin tagging content, and you need to be very familiar with the corpus of tagged contentbefore* beginning to tag content.*

And there’s not much you can do to avoid it. To further complicate the problem, it’s essentially impossible to “redo” a taxonomic system for sufficiently large resources given the time requirements for reclassification and the fact that classification systems and processes are difficult to automate.

The prevalence of tagging systems and the promises of easy, quick taxonomic organization are hard to avoid and counteract. As part of the fight against information debt it’s important to draw attention to the failure of broken taxonomy systems. We need, as technical writers, information custodians, and “knowledge workers,” to develop approaches to organization that are easy to implement and less likely to lead to huge amounts of information debt.

Work Logging

Joey Hess’ blog of his work on git-annex-assitant has been a great inspiration to me. Basically, Joey had a very successful Kickstarter campaign to fund his work on a very cool tool of his that he wants to expand to be a kind of drop-box replacement. As part of the project, he’s been blogging nearly every day about the work, his progress, and the problems he’s faced.

I really admire this kind of note taking, and think it’s really interesting to see how people progress on cool projects. More than that, I think it’s a cool way to make yourself accountable to the world, and help ensure things get done.

By the same token, when your project work is writing, increasing daily workload by writing notes/posts rather than actually doing work is a problem, on the other hand, given the right kind of templates, and a good end of day habit, it might be easy and a good habit.

Anyone else interested in this? Thoughts on using rhizome versus some other area of tychoish.com? So many choices!

In Favor of PDF

This is really a short rant, and should come as a surprise to no one.

I hate DOC files, and RTF files, to say nothing of ODF, DOCX, and their ilk because they have two necessarily conflicting properties:

1. They’re oriented at producing documents on paper. Which is crazy. Paper is an output, but it’s not the only output in common use, so it’s nuts that generic document representation formats would be so tightly coupled with paper.

2. The rendering of the content is editor specific, particularly with regards to display options. If I compile a document and send it to you, I have no guarantee whatsoever about the presentation or display of the document on your system, particularly if I’m not certain that your system is similarly configured. Particularly with respect to fonts, page breaks, etc.

This is particularly idiotic with respect to 1.

It’s not that PDF is great, or especially usable, but it’s consistent and behaves as expected. Furthermore, it does a good job of appropriately expressing the limitations of paper.

So use PDF and accept no substitutions.