Programming Tutorials

This post is a follow up to my :doc`/posts/coding-pedagogy` post. This “series,” addresses how people learn how to program, the state of the technical materials that support this education process, and the role of programming in technology development.

I’ve wanted to learn how to program for a while and I’ve been perpetually frustrated by pretty much every lesson or document I’ve ever encountered in this search. This is hyperbolic, but it’s pretty close to the truth. Teaching people how to program is hard and the materials are either written by people who:

  • don’t really remember how they learned to program.

Many programming tutorials were written by these kinds of programmers, and the resulting materials tend to be decent in and of themselves, but they fail to actually teach people how to program if they don’t know how to program already.

If you already know how to program, or have learned to program in a few different languages, it’s easy so substitute “learning how to program,” with “learn how to program in a new language” because that experience is more fresh, and easier to understand.

These kinds of materials will teach the novice programmer a lot about programming languages and fundamental computer science topics, but not anything that you really need to learn how to write code.

  • people who don’t really know how to program.

People who don’t know how to program tend to assume that you can teach by example, using guided tutorials. You can’t really. Examples are good for demonstrating syntax and procedure, and answering tactical questions, but aren’t sufficient for teaching the required higher order problem solving skills. Focusing on the concrete aspects of programming syntax, the standard library, and the process for executing code isn’t enough.

These kinds of documents can be very instructive, and outsider perspective are quite useful, but if the document can’t convey how to solve real problems with code, you’ll be hard pressed to learn how to write useful programs from these guides.

In essence, we have a chicken and egg problem.


Interlude:

Even six months ago, when people asked me “are you a programmer?” (or engineer,) I’d often object strenuously. Now, I wave my hand back and forth and say “sorta, I program a bit, but I’m the technical writer.” I don’t write code on a daily basis and I’m not very nimble at starting to write programs from scratch, but sometimes when the need arises, I know enough to write code that works, to figure out the best solution to fix at least some of the problems I run into.

I still ask other people to write programs or fix problems I’m having, but it’s usually more because I don’t have time to figure out an existing system that I know they’re familiar with and less because I’m incapable of making the change myself.

Even despite these advances, I still find it hard to sit down with a blank buffer and write code from scratch, even if I have a pretty clear idea of what it needs to do. Increasingly, I’ve begun to believe that this is the case for most people who write code, even very skilled engineers.

This will be the subject of an upcoming post.


The solution(s):

1. Teach people how to code by forcing people to debug programs and make trivial modifications to code.

People pick up syntax pretty easily, but struggle more with the problem solving aspects of code. While there are some subtle aspects of syntax, the compiler or interpreter does enough to teach people syntax. The larger challenge is getting people to understand the relationship between their changes and behavior and any single change and the reset of a piece of code.

2. Teach people how to program by getting them to solve actual problems using actual tools, libraries, and packages.

Too often, programming tutorials and examples attempt to be self-contained or unrealistically simple. While this makes sense from a number of perspectives (easier to create, easier to explain, fewer dependency problems for users,) it’s incredibly uncommon and probably leads to people thinking that a lot of programming revolves around re-implementing solutions to solved problems.

I’m not making a real argument about computer science education, or formal engineering training, with which I have very little experience or interest. As contemporary, technically literate, actors in digital systems, programming is a relevant for most people.

I’m convinced that many people do a great deal of work that is effectively programming: manipulating tools, identifying and recording procedures, collecting information about the environment, performing analysis, and taking action based on collected data. Editing macros, mail filtering systems, and spreadsheets are obvious examples though there are others.

Would teaching these people how programming worked and how they could use programming tools improve their digital existences? Possibly.

Would general productivity improve if more people new how to think about automation and were able to do some of their own programming? Almost certainly.

Would having more casual programmers create additional problems and challenges in technology? Yes. These would be interesting problems to solve as well.

Project Orientation

(or my latest attempt to do things in a more “project oriented way.")

This post is about recent projects, projects that I’m working on, and how my work has changed in recent months.

A couple of weeks ago, I finally posted all of the content that I’ve been working on for the new, revived Cyborg Institute. While the book on systems administration itself had been mostly done for a while, I’d delayed for two reasons:

  1. I wanted to have a couple of other projects completed to demonstrate that the Institute as a project wasn’t just isolated to book-like objects.
  2. I wanted to have some infrastructure in place to be able to sanely publish the Institute site without using some gnarly content management system.1

The end result is that in addition to the book, I’ve put together a few other projects and documentation. The more exciting thing is that I might do more things like this in the future.

In addition to a lot of day-job work--forthcomming releases and team growth are eating a lot of my time--I’ve been working on a series of wiki pages (and related blog posts,) that address “information debt that happens when organizations don’t put resources and energy into maintaining resources and “knoweldge.” Expect to hear more on this topic.


The truth is that I really like working on bigger projects. Writing blog posts and participating in online conversations has been very rewarding to me over the past ~10 years, I feel like I’ve hit a wall: I’ve written ~830,000 words on tychoish.com, and am frustrated that there’s not a lot to show for it:

  • readership is steady, even increasing, but not inspiring,

  • I don’t actually want to work as a blogger, and

  • most importantly the work I’ve done here doesn’t really build to anything more than a half-dozen or so blog posts.

    While there are themes throughout all of the posts, the work isn’t very rigorous, and it lacks a certain kind of depth.

    So here I am, writing books-like objects things about technology that I hope are and will be useful for both technical and non-technical audiences, as well as compiling the little things that I hack on for other people to improve and benefit fromm, and writing fiction (that I may try and publish conventionally, but I may end up self-publishing using a similar proccess.) The goal is to:

  • Write things with more rigor, including better citations and research.

  • Work on projects that address topics more comprehensively.

  • Produce, document, and maintain scripts and other programs that I write rather than endlessly critique existing tools and approaches. In short, less talking about stuff and more making stuff.

Let’s see how this goes!


  1. All content management systems are gnarly. ↩︎

imenu for Markdown

For a while, I’ve been envious of some of the project and file navigation features in emacs for browsing bigger projects/programs, things like imenu and tags have always seems awesome but given that I spend most of time editing restructured text and markdown files (I’m a technical writer), these tools have been distant and not a part of my day to day work.

It’s not that it would be impossible to write interfaces for imenu or etags, for the formats I use regularly, but more that I’ve never gotten around to it until now.

We’re still a ways away on the question of etags, but it turns out that when I wasn’t looking rst mode got imenu support, and with the following little bit of elisp you can get imenu for markdown.

(setq markdown-imenu-generic-expression
   '(("title"  "^\\(.*\\)[\n]=+$" 1)
     ("h2-"    "^\\(.*\\)[\n]-+$" 1)
     ("h1"   "^# \\(.*\\)$" 1)
     ("h2"   "^## \\(.*\\)$" 1)
     ("h3"   "^### \\(.*\\)$" 1)
     ("h4"   "^#### \\(.*\\)$" 1)
     ("h5"   "^##### \\(.*\\)$" 1)
     ("h6"   "^###### \\(.*\\)$" 1)
     ("fn"   "^\\[\\^\\(.*\\)\\]" 1)
))

(add-hook 'markdown-mode-hook
   (lambda ()
      (setq imenu-generic-expression markdown-imenu-generic-expression)))

Pretty awesome! I hope it helps you make awesome things.

Practical Branch Layout

I’ve recently gone from being someone who uses git entirely “on my own,” to being someone who uses git with a lot of other people at once. I’ve also had to introduce git to the uninitiated a few times. These have both been notable challenges.

Git is hard, particularly in these contexts: not only are there many concepts to learn, but there’s no single proscribed workflow and a multitude of workflow possibilities. Which is great from a philosophy perspective, and arguably good from a “useful/robust tool perspective,” but horrible from a “best practices” perspective. Hell, it’s horrible for a “consistent” and “sane” practices perspective.

There are a lot of solutions to the “finding a sane practice,” when working with git and large teams. Patching or reformulating the interface is a common strategy. Legit is a great example of this, but I don’t think it’s enough, because the problem is really one of bad and unclear defaults. For instance:

  • The “master” branch, in larger distributed systems is highly confusing. If you have multiple remotes (which is common), every remote has its own master branch, which all (necessarily) refer to different possible points in a repository’s history.
  • The names of a local branch do not necessarily refer to the names of the remote branch in any specific repository. The decoupling of local and remote branches, makes sense from a design perspective, but it’s difficult to retain this mapping in your head, and it’s also difficult to talk about branch configurations because your “view of the universe,” doesn’t often coincide with anyone else’s?

Here are some ideas:

  1. Have two modes of operation: a maintainer’s mode that resembles current day git with a few basic tweaks described in later options, and a contributors mode, that is designed with makes the following assumptions:

    • The “mainline” of the project’s development will occur in a branch to which this user only has read-only access.
    • Most of this user’s work will happen in isolated topic branches.
  2. Branch naming enforcement:

    • All branches will be uniquely named, relative to a username and/or hostname. This will be transparent (largely) to the history, but will make all conversations about branches less awkward. This is basically how branches work now, with [remote]/[branch], except that all local branches need self/[branch], and the software should make this more transparent.
    • Remote branches will implicitly have local tracking branches with identical names. You could commit to any of the local tracking branches, and pull will have the ability to convert your changes to a self/[branch] if needed.
    • All local branches, if published, will map to a single remote branch. One remote, will be the user’s default “publication target,” for branches.

    This is basically what the origin remote does today, so again, this isn’t a major change.

  3. When you run git clone, this remote repository should be the upstream repository, not the origin.

    Use the origin remote, which should be the default “place to publish my work,” and would be configured separately.

  4. Minor tweaks.

    • Map local branches directly to remote branches.
    • Be able to specify a remote branch as a “mirror” of another branch.
    • Make cherry-picked commits identifiable by their original commit-id internally. The goal is to push people to cherry-pick commits as much as possible to construct histories without needing to rebase.1
    • Have sub-module support automatically configured, without fussing.
    • Have better functions to cleaning up branch cruft. Particularly on remotes.
    • Have some sort of configurable “published pointer,” that users can use as a safe block against rebases before a given point.

    The goals here are to:

    • Make working with git all about branches and iteration rather than a sequence of commits.
    • Provide good tools to prevent people from rebasing commits, which is always confusing and rarely actually required.
    • Make branch names as canonical as possible. The fact that there can be many possible names for the same thing is awful.

Who’s with me? We can sort out the details, if you want in comments.


  1. To make this plausible, github needs to allow cherry-picked commits to close a pull request. ↩︎

Hypertextuality

I recently took some of my writing time to create a makefile (Novel Makefile) to manage work I hope to be doing on a new novel project. I’ve started outlining and researching the story in earnest after having spent the past few couple of years talking about it, and I think writing will commence soon. In another post I’d like to write up some thoughts on the tooling and technology of writing non-technical/non-manual long-form.

This post, drawing from the spending some time buried deep in production is about the state of (conceptually) longer form work in digital mediums. Or, at least a brief commentary on same.


The tools that I use to write technical materials do all sorts of cool things, like:

  • provide instant cross referencing,
  • generate great indexes, and
  • automatically generate and link glossaries.

This is not particularly unusual, and in fact Sphinx is somewhat under-featured relative to other documentation generation systems like DocBook.1

And yet people publish ebooks that virtually identical to paper books. Ebooks seem to say “*this electronic book is the best facsimile of a paper book that we can imagine right now,*” while totally ignoring anything more that a *hyper*text rightfully might be.

I enjoy reading ebooks, just as I have enjoyed reading paperbooks, but mostly because ebooks basically are paperbooks. I’ve written posts in the past challenging myself,and fiction writers in general, to actually do hypertext rather than recapitulating prior modalities in digital form.

At various points I’ve thought that wikis might be a good model of how to do hypertext, because the form is structurally novel. Any more, I don’t think that this is the case: wikis are unstructured and chaotic, and I’ve come to believe that the secret to hypertext is structure. There are so many possibilities in hypertext, and I think much experimentation in hypertext has attempted to address the chaos of this experience. This does serve to highlight the extent to which “the future is here,” but it obscures the fact that structure makes narratives understandable. Think about how much great, new, innovative (and successful!) fiction in the past decade (or so) is not structurally experimental or chaotic. (Answer: there’s a lot of it.)

The not-so-secret of hypertext, is (I suspect,) tooling: without really good tools the mechanics of producing a complex, interactive textual experience2 is difficult for a single writer, or even a small group of writers. Most tools that manage the publication and writing of text are not suited to helping the production of large-multi-page and mutli-component texts. One potential glimmer of hope is that tools for developing programs (IDEs, build systems, compilers, templating systems, introspection tools, pattern matching, etc.) are well developed and could modified for use in text production.

The second non-so-secret of hypertext is probably that hypertext is an evolution of text production and consumption, not a revolution. Which only seems reasonable. We have the technology now to produce really cool text product. While tooling needs to get better, the literature needs to do some catching up.

Lets start making things!


  1. It’s not that Sphinx is “bad,” but it’s clearly designed for a specific kind of documentation project, and if you stray too far outside of those bounds, or need formats that aren’t quite supported, then you end up without a lot of recourse. Having said that, the “normal,” well supported and most projects--documentation or otherwise--will only very rarely hit upon an actual limitation of Sphinx itself. ↩︎

  2. To be clear, I’m partial to the argument that today’s computer games, particularly role-playing games, are the things that the futurists of the 1960s and 70s (e.g. Theodor Holm Nelson) called “hypertext.” ↩︎

Denormalize Access Control

Access control is both immensely useful and incredibly broken.

Access control, or the ability to constrain access to data and programs in a shared system is the only way that we, as users of shared systems, can maintain our identities, personal security, and privacy. Shared systems include: databases, file servers, social networking sites, virtualized computing systems, vendor accounts, control panels, management tools, and so forth all need robust, flexible, granular, and scalable access control tools.

Contemporary access control tools--access control lists (ACL,) and access control groups--indeed the entire conceptual framework for managing access to data and resources, don’t work. From a theoretical practice, ACLs that express a relationship between users or groups of users and data or resources, represent a parsimonious solution to the “access control problem:” if properly deployed only those with access grants will have access to a given resource.

In practice these these kinds of relationships do not work. Typically relationships between data and users is rich and complex and different users need to be able to do different things with different resources. Some users need “read only” access, others need partial read access, some need read and write access but only to a subset of a resource. While ACL systems can impose these kinds of restrictions, the access control abscration doesn’t match the data abstraction or the real-world relationships that it supposedly reflects.

Compounding this problem are two important factors:

  1. Access control needs change over time in response to social and cultural shifts among the users and providers of these resources.
  2. There are too many pieces of information or resources in any potential shared system to allocate access on a per-object or per-resource basis, and the volume of objects and resources is only increasing.

Often many objects or resources have the same or similar access control patterns, which leads to the “group” abstraction. Groups make it possible to describe a specific access control pattern that apply to a number of objects, and connect this pattern with specific resources.

Conceptual deficiencies:

  • There’s a volume problem. Access control data represents a many-to-many-to-many relationship. There are many different users and (nested) groups, many different kinds of access controls that systems can grant, and many different (nested) resources. This would be unmanageably complex without the possibility for nesting, but nesting means that the relationships between resources and between groups and users are also important. With the possibility for nesting access control is impossible.

  • ACLs and group-based access control don’t account for the fact that access must be constantly evolving, and current systems don’t contain support for ongoing maintenance. (we need background threads that go through and validate access control consistency.) Also all access control grants must have some capacity for automatic expiration.

  • Access control requirements and possibilities shift as data becomes more or less structured, and as data use patterns change. The same conceptual framework that works well for access control in the context of a the data stored in a relational database, doesn’t work so when the data in question is a word processing document, an email folder, or a spread sheet.

    The fewer people that need access to a single piece of data, the easier the access control system can be. While this seems self evident, it also means that access control systems are difficult to test in the really large complex systems in which they’re used.

  • Group-based access control systems, in effect, normalize data about access control, in an effort to speed up data access times. While this performance is welcome, in most cases granting access via groups leads to an overly liberal distribution of access control rights. At once, its too difficult to understand “who has access to what” and too easy to add people to groups that give them more access than they need.

So the solution:

  1. Denormalize all access control data,
  2. don’t grant access to groups, and
  3. forbid inheritance.

This is totally counter to the state of the art. In most ways, normalized access control data, with role/group-based access control, and complex inheritance are the gold standard. Why would it work?

  • If you have a piece of data, you will always be able to determine who has access to data, without needing to do another look-up.

  • If you can deactivate credentials, then a background process can go through and remove access without causing a large security problem. (For partial removes, you would freeze an account, let the background process modify access control and then unfreeze the account.)

    The down side is that, potentially, in a large system, it may take a rather long time for access grants to propagate to users. Locking user accounts makes the system secure/viable, but doesn’t make the process any more quick.

    As an added bonus, these processes could probably be independent and wouldn’t require any sort of shared state or lock, which means many such operation could run in parallel, and they could stop and restart at will.

  • The inheritance option should be fuzzy. Some sort of “bucket-based” access control should be possible, if there’s a lot of data with the same access control rules and users.

    Once things get more complex, buckets are the wrong metaphor, you should use granular controls everywhere.

Problems/Conclusion:

  • Denormalization might fix the problems with ACLs and permissions systems, but it doesn’t fix the problems with distributed identity management.

    As a counterpoint, this seems like a cryptography management problem.

  • Storing access control information with data means that it’s difficult to take a user and return a list of what these credentials have access to.

    In truth, centralized ACL systems are subject to this flaw as well.

  • A huge part of the problem with centralized ACL derives from nesting, and the fact that we tend to model/organize data in tree-like structures, that often run counter to the organization of access control rights. As a result access control tools must be arbitrary.

Taxonomic Failure

I tell people that I’m a professional writer, but this is a bit misleading, because what I really do is figure out how to organize information so that it’s useful and usable. Anyone, with sufficient training and practice, can figure out how to convey simple facts in plain language, but figuring out how to organize simple facts in plain language into a coherent text is the more important part of my job and work.

This post and the “Information Debt” wiki page, begin to address some of these the problem of information resource maintenance, organization, and institutional practices with regards to information and knowledge resources.

Organization is hard. Really hard. One of the challenges for digital resources is that they lack all of the conventions of /technical-writing/books, which would seem to be freeing: you get more space and you get the opportunity to do really flexible categorization and organization things.

Great right?

Right.

Really flexible and powerful taxonomic systems, like tagging systems have a number of problems when applied to large information resources:

  • the relationship between the “scope” of the tag, and the specificity of the tag matters a lot. Too much. Problems arise when:
  • tags are really specific, pages include a number of pieces of information, and tags can only map to pages.
  • tags are general and the resources all address similar or related topics.
  • the size of the tag “buckets” matters as well. If there are too many items with a tag, users will find not the tag for answering their questions.
  • if your users or applications have added additional functionality using tags, tags begin to break as a useful taxonomic system. For example, if your system attaches actions to specific tags (i.e. send email alerts when content with a specific tag,) or if you use a regular notation to simulate a hierarchy, then editors begin adding content to tags, not for taxonomic reasons, but for workflow reasons or to trigger the system.

The resulting organization isn’t useful from a textual perspective.

  • If you have to have multiple tagging systems or namespaces.

    Using namespaces is powerful, and helps prevent collisions. At the same Sat Aug 16 10:50:00 2014, if your taxonomic system has collisions, this points to a larger problem.

  • If the taxonomy ever has more than one term for a conceptual facet, then the tagging system is broken.

These problems tend to exacerbate as:

  • the resource ages.
  • the number of contributors grow.

There’s this core paradox in tagging systems: To tag content effectively, you need a fix list of potential tags before you begin tagging content, and you need to be very familiar with the corpus of tagged contentbefore* beginning to tag content.*

And there’s not much you can do to avoid it. To further complicate the problem, it’s essentially impossible to “redo” a taxonomic system for sufficiently large resources given the time requirements for reclassification and the fact that classification systems and processes are difficult to automate.

The prevalence of tagging systems and the promises of easy, quick taxonomic organization are hard to avoid and counteract. As part of the fight against information debt it’s important to draw attention to the failure of broken taxonomy systems. We need, as technical writers, information custodians, and “knowledge workers,” to develop approaches to organization that are easy to implement and less likely to lead to huge amounts of information debt.

Work Logging

Joey Hess' blog of his work on git-annex-assitant has been a great inspiration to me. Basically, Joey had a very successful Kickstarter campaign to fund his work on a very cool tool of his that he wants to expand to be a kind of drop-box replacement. As part of the project, he’s been blogging nearly every day about the work, his progress, and the problems he’s faced.

I really admire this kind of note taking, and think it’s really interesting to see how people progress on cool projects. More than that, I think it’s a cool way to make yourself accountable to the world, and help ensure things get done.

By the same token, when your project work is writing, increasing daily workload by writing notes/posts rather than actually doing work is a problem, on the other hand, given the right kind of templates, and a good end of day habit, it might be easy and a good habit.

Anyone else interested in this? Thoughts on using rhizome versus some other area of tychoish.com? So many choices!