New Beginnings: Deciduous Platform

I left my job at MongoDB (8.5 years!) at the beginning of the summer, and started a new job at the beginning of the month. I'll be writing and posting more about my new gig, career paths in general, reflections on what I accomplished on my old team, the process of interviewing as a software engineer, as well as the profession and industry over time. For now, though, I want to write about one of the things I've been working on this summer: making a bunch of the open source libraries that I worked on more generally useable. I've been calling this the deciduous platform, [2] which now has its own github organization! So it must be real.

The main modification in these forks, aside from adding a few features that had been on my list for a while, has been to update the buildsystem to use go modules [3] and rewrite the history of the repository to remove all of the old vendoring. I expect to continue development on some aspects of these over time, though the truth is that these libraries were quite stable and were nearly in maintenance mode anyway.

Background

The team was responsible for a big monolith (or so) application: development had begun in 2013, which was early for Go, and while everything worked, it was a bit weird. My efforts when I joined in 2015 focused mostly on stabilization, architecture, and reliability. While the application worked, mostly, it was clear that it suffered from a few problem, which I believe were the result of originating early in the history of Go: First, because no one had tried to write big applications yet, the patterns weren't well established, and so the team ended up writing code that worked but that was difficult to maintain, and ended up with bespoke solutions to a number of generic problems like running workloads in the background or managing Apia. Second, Go's standard library tends to be really solid, but also tends towards being a little low level for most day-to-day tasks, so things like logging and process management end up requiring more code [4] than is reasonable.

I taught myself to write Go by working on a logging library, and worked on a distributed queue library. One of the things that I realized early, was that breaking the application into "microservices," would have been both difficult and offered minimal benefit, [5] so I went with the approach of creating a well factored monolith, which included a lot of application specific work, but also building a small collection of libraries and internal services to provide useful abstractions and separations for application developers and projects.

This allowed for a certain level of focus, both for the team creating the infrastructure, but also for the application itself: the developers working on the application mostly focused on the kind of high level core business logic that you'd expect, while the infrastructure/platform team really focused on these libraries and various integration problems. The focus wasn't just organizational: the codebases became easier to maintain and features became easier to develop.

This experience has lead me to think that architecture decisions may not be well captured by the monolith/microservice dichotomy, but rather there's' this third option that centers on internal architecture, platforms, and the possibility for developer focus and velocity.

Platform Overview

While there are 13 or so repositories in the platform, really there are 4 major libraries: grip, a logging library; jasper, a process management framework; amboy, a (possibly distributed) worker queue; and gimlet, a collection of tools for building HTTP/REST services.

The tools all work pretty well together, and combine to provide an environment where you can focus on writing the business logic for your HTTP services and background tasks, with minimal boilerplate to get it all running. It's pretty swell, and makes it possible to spin up (or spin out) well factored services with similar internal architectures, and robust internal infrastructure.

I wanted to write a bit about each of the major components, addressing why I think these libraries are compelling and the kinds of features that I'm excited to add in the future.

Grip

Grip is a structured-logging friendly library, and is broadly similar to other third-party logging systems. There are two main underlying interfaces, representing logging targets (Sender) and messages, as well as a higher level "journal" interface for use during programming. It's pretty easy to write new message or bakcends, which means you can use grip to capture all kinds of arbitrary messages in consistent manners, and also send those messages wherever they're needed.

Internally, it's quite nice to be able to just send messages to specific log targets, using configuration within an application rather than needing to operationally manage log output. Operations folks shouldn't be stuck dealing with just managing logs, after all, and it's quite nice to just send data directly to Splunk or Sumologic. We also used the same grip fundamentals to send notifications and alerts to Slack channels, email lists, or even to create Jira Issues, minimizing the amount of clunky integration code.

There are some pretty cool projects in and around grip:

  • support for additional logging targets. The decudous version of grip adds twitter as an output format as well as creating desktop notifications (e.g. growl/libnotify,) but I think it would also be interesting to add fluent/logstash connections that don't have to transit via standard error.'
  • While structured logging is great, I noticed that we ended up logging messages automatically in the background as a method of metrics collection. It would be cool to be able to add some kind of "intercepting sender" that handled some of these structured metrics, and was able to expose this data in a format that the conventional tools these days (prometheus, others,) can handle. Some of this code would clearly need to be in Grip, and other aspects clearly fall into other tools/libraries.

Amboy

Amboy is an interface for doing things with queues. The interfaces are simple, and you have:

  • a queue that has some way of storing and dispatching jobs.
  • implementations of jobs which are responsible for executing your business logic, and with a base implemention that you can easily compose, into your job types, all you need to implement, really is a Run() method.
  • a queue "group" which provides a higher level abstraction on top of queues to support segregating workflows/queues in a single system to improve quality of service. Group queues function like other queues but can be automatically managed by the processes.
  • a runner/pool implementation that provides the actual thread pool.

There's a type registry for job implementations and versioning in the schema for jobs so that you can safely round-trip a job between machines and update the implementation safely without ensuring the queue is empty.

This turns out to be incredibly powerful for managing background and asynchronous work in applications. The package includes a number of in-memory queues for managing workloads in ephemeral utilities, as well as a distributed MongoDB backed-queue for running multiple copies of an application with a shared queue(s). There's also a layer of management tools for introspecting, managing, the state of jobs.

While Amboy is quite stable, there is a small collection of work that I'm interested in:

  • a queue implementation that store jobs to a local Badger database on-disk to provide single-system restartabilty for jobs.
  • a queue implementation that stores jobs in a PostgreSQL, mirroring the MongoDB job functionality, to be able to meet job backends.
  • queue implementations that use messaging systems (Kafka, AMPQ) for backends. There exists an SQS implementation, but all of these systems have less strict semantics for process restarts than the database options, and database can easily handle on the order of a hundred of thousand of jobs an hour.
  • changes to the queue API to remove a few legacy methods that return channels instead of iterators.
  • improve the semantics for closing a queue.

While Amboy has provisions for building architectures with workers running on multiple processes, rather than having queues running multiple threads within the same process, it would be interesting to develop more fully-fledged examples of this.

Jasper

Jasper provides a high level set of tools for managing subprocesses in Go, adding a highly ergonomic API (in Go,) as well as exposing process management as a service to facilitate running processes on remote machines. Jasper also manages/tracks the state of running processes, and can reduce pressures on calling code to track the state of processes.

The package currently exposes Jasper services over REST, gRPC, and MongoDB's wire protocol, and there is also code to support using SSH as a transport so that you don't need to expose remote these services publically.

Jasper is, perhaps, the most stable of the libraries, but I am interested in thinking about a couple of extensions:

  • using jasper as PID 1 within a container to be able to orchestrate workloads running on containers, and contain (some) support for lower level container orchestration.
  • write configuration file-based tools for using jasper to orchestrate buildsystems and distributed test orchestration.

I'm also interested in cleaning up some of the MongoDB-specific code (i.e. the code that downloads MongoDB versions for use in test harnesses,) and perhaps reinvisioning that as client code that uses Jasper rather than as a part of Jasper.

Gimlet

I've written about gimlet here before when I started the project, and it remains a pretty useful and ergonomic way to define and regester HTTP APIs, in the past few years, its grown to add more authentication features, as well as a new "framework" for defining routes. This makes it possible to define routes by implementing an interface that:

  • makes it very easy to produce paginated routes, and provides some helpers for managing content
  • separates the parsing of inputs from executing the results, which can make route definitions easy to test without integration tests.
  • rehome functionality on top of chi router. The current implementation uses Negroni and gorilla mux (but neither are exposed in the interface), but I think it'd be nice to have this be optional, and chi looks pretty nice.

Other Great Tools

The following libraries are defiantly smaller, but I think they're really cool:

  • birch is a builder for programatically building BSON documents, and MongoDB's extended JSON format. It's built upon an earlier version of the BSON library. While it's unlikely to be as fast at scale, for many operations (like finding a key in a document), the interface is great for constructing payloads.
  • ftdc provides a way to generate (and read,) MongoDB's diagnostic data format, which is a highly compressed timeseries data format. While this implementation could drift from the internal implementation over time, the format and tool remain useful for arbitrary timeseries data.
  • certdepot provides a way to manage a certificate authority with the certificates stored in a centralized store. I'd like to add other storage backends over time.

And more...

Notes

[1]Though, given my usual publication lag, I'm writing this a couple days before starting.
[2]My old team built a continuous integration tool called evergreen which is itself a pun (using "green" to indicate passing builds, most CI systems are not ever-green.) Many of the tools and libraries that we built had got names with tree puns, and somehow "deciduous" seems like the right plan.
[3]For an arcane reason, all of these tools had to build with an old version of Go (1.10) that didn't support modules, so we had an arcane and annoying vendoring solution that wasn't compatible with modules.
[4]Go tends to be a pretty verbose language, and I think most of the time this creates clarity; however, for common tasks it has the feeling of offering a poor abstraction, or forcing you to write duplicated code. While I don't believe that more-terse-code is better, I think there's a point where the extra verbosity for route operations just creates the possibility for more errors.
[5]The team was small, and as an internal tools team, unlikely to grow to the size where microservices offered any kind of engineering efficiency (at some cost,) and there weren't significant technical gains that we could take advantage of: the services of the application didn't need to be globally distributed and the boundaries between components didn't need to scale independently.

What is it That You Do?

The longer that I have this job, the more difficult it is to explain what I do. I say, "I'm a programmer," and you'd think that I write code all day, but that doesn't map onto what my days look like, and the longer it seems the less code I actually end up writing. I think the complexity of this seemingly simple question grows from the fact that building software involves a lot more than writing code, particularly as projects become more complex.

I'd venture to say that most code is written and maintained by one person, and typically used by a very small number of pepole (often on behalf of many more people,) though this is difficult to quantify. Single maintainer software is still software, and there are lots of interesting problems, but as much as anything else I'm interested in the problems adjacent to multi-author code-bases and multi-operator software development. [1]

Fundamentally, I'm interested in the following questions:

  • How can (sometimes larger) groups of people collaborate to build something that's bigger than the scope of any of their work?
  • How can we build software in a way that lets individual developers focus most of the time on the features and concerns that are the most important to them and their users. [2]

The software development process, regardless of the scope of the problem, has a number of different aspects:

  • Operations: How does is this software execute and how do we know that its successful when it runs?
  • Behavior: What does it do, and how do we ensure it has the correct behavior?
  • Interface: How will users interact with the process, and how do we ensure a consistent experience across versions and users' environment?
  • Product: Who are the users? What features do they want? Which features are the most important?

Sometimes we can address these questions by writing code, but often there's a lot of talking to users, other developers, and other people who work in software development organizations (e.g. product managers, support, etc.) not to mention writing a lot of English (documentation, specs, and the like.)

I still don't think that I've successfully answered the framing question, except to paint a large picture of what kinds of work goes into making software, and described some of my specific domain interests. This ends up boiling down to:

  • I write a lot of documents describing new features and improvements to our software. [product]
  • I think a lot about how our product works as it grows (scaling), and what kinds of changes we can make now to make that process more smooth. [operations]
  • How can I help the more junior members of my team focus on the aspects of their jobs that they enjoy the most, and help illustrate broader contexts to them. [mentoring]
  • How can we take the problems we're solving today and build the solution that balances the immediate requirements with longer term maintainability and reuse. [operations/infrastructure]

The actual "what" I'm spending my time boils down to reading a bunch of code, meeting with my teamates, meeting with users (who are also coworkers.) And sometimes writing code. If I'm lucky.

[1]I think the single-author and/or single-operator class is super interesting and valuable, particularly because it includes a lot of software outside of the conventional disciplinary boundaries of software and includes things like macros, spreadsheets, small scale database, and IT/operations ("scripting") work.
[2]It's very easy to spend most of your time as a developer writing infrastructure code of some sort, to address either internal concerns (logging, data management and modeling, integrating with services) or project/process automation (build, test, operations) concerns. Infrastructure isn't bad, but it isn't the same as working on product features.

The Case for Better Build Systems

A lot of my work, these days, focuses on figuring out how to improve how people develop software in ways that reduces the amount of time developers have to spend doing work outside of development and that improves the quality of their work. This post, has been sitting in my drafts folder for the last year, and does a good job of explaining how I locate my work **and* makes a case for high quality generic build system tooling that I continue to feel is compelling.*


Incidentally, it turns out that I wrote an introductory post about buildsystems 6 years ago. Go past me.

Canonically, build systems described the steps required to produce artifacts, as system (graph) of dependencies [1] and these dependencies are between source files (code) and artifacts (programs and packages) with intermediate artifacts all in terms of the files they are or create. Though different development environments, programming languages, and kinds of software have different.

While the canonical "build systems are about producing files," the truth is that the challenge of contemporary _software_ development isn't really just about producing files. Everything from test automation to deployment is something that we can think about as a kind of build system problem.

Let's unwind for a moment. The work of "making software," breaks down into a collection of--reasonably disparate--tasks, which include:

  • collecting requirements (figuring out what people want,)
  • project planning (figuring out how to break larger collections of functionality into more reasonable units.)
  • writing new code in existing code bases.
  • exploring unfamiliar code and making changes.
  • writing tests for code you've recently written, or areas of the code base that have recently chaining.
  • rewriting existing code with functionally equivalent code (refactoring,)
  • fixing bugs discovered by users.
  • fixing bugs discovered by an automated test suite.
  • releasing software (deploying code.)

Within these tasks developers do a lot of little experiments and tests. Make a change, see what it's impact is by doing something like compiling the code, running the program or running a test program. The goal, therefore, of the project of developer productivity projects is to automate these processes and shorten the time it takes to do any of these tasks. In particular the feedback loop between "making a change" and seeing if that change had an impact. The more complex the system that you're developing, with regards to distinct components, dependencies, target platforms, compilation model, and integration's, them the more time you spend in any one of these loops and the less productive you can be.

Build systems are uniquely positioned to suport the development process: they're typically purpose built per-project (sharing common infrastructure,) most projects already have one, and they provide an ideal environment to provide the kind of incremental development of additional functionality and tooling. The model of build systems: the description of processes in terms of dependency graphs and the optimization for local environments means.

The problem, I think, is that build systems tend to be pretty terrible, or at least many suffer a number of classic flaws:

  • implicit assumptions about the build or development environment which make it difficult to start using.
  • unexpressed dependencies on services or systems that the build requires to be running in a certain configuration.
  • improperly configured dependency graphs which end up requiring repeated work.
  • incomplete expression of dependencies which require users to manually complete operations in particular orders.
  • poorly configured defaults which make for overly complex invocations for common tasks.
  • operations distributed among a collection of tools with little integration so that compilation, test automation, release automation, and other functions.

By improving the quality, correctness, and usability of build systems, we:

  • improve the experience for developers every day,
  • make it really easy to optimize basically every aspect of the development process,
  • reduce the friction for including new developers in a project's development process.

I'm not saying "we need to spend more time writing build automation tools" (like make, ninja, cmake, and friends,) or that the existing ones are bad and hard to use (they, by and large are,) but that they're the first and best hook we have into developer workflows. A high quality, trustable, tested, and easy to use build system for a project make development easier, continuous integration easy and maintainable, and ultimately improve the ability of developers to spend more of their time focusing on important problems.

[1]ideally build systems describe directed acylcic graph, though many projects have buildsystems with cyclic dependency graphs that they ignore in some way.

Three Way Merge Script

Note: This is an old post about a script I wrote a few months ago about a piece of code that I'm no longer (really) using. I present it here as an archival piece with a boatload of caveats. Enjoy!

I have a problem that I think is not terribly unique: I have a directory of files and I want to maintain two distinct copies of these files at once, and I want a tool that looks at both directories and makes sure they're up to date. That's all. Turns out nothing does exactly that, so I wrote a hacked up shell script, and you can get it from the code section:

merge-script

I hope you enjoy!

Background

You might say, "why not just use git to take care of this," which is fair. The truth is that I don't really care about the histories as long as there's revision. Here's the situation:

I keep a personal ikiwiki instance for all of my notes, tasks, and project stuff. There's nothing revolutionary, and I even use deft, dired, and some hacked up lisp to do most of the work. But I also work on a lot of projects that have their own git repositories and I want to be able to track the notes of some of those files in those repositories as well.

Conflicts.

There are some possible solutions:

1. Use hard links so that both files will point at the same data on disk.

Great idea, but it breaks on multiple systems. Even if it might have worked in this case, it freight ens me to have such fragile systems.

Note: the more I play with this, the less suitable I think that it might be for multi system use. If one or both of the sides is in a git repo, and you make changes locally and then pull changes in from a git upstream, the git files, may look newer than the files that you changed. A flaw.

2. Only edit files in one repository or the other, and have a pre-commit hook, or similar, that copies data from the new system to the old system.

I rejected this because I thought I'd have a hard time enforcing this behavior.

3. Write a script that uses some diff3 to merge (potential) changes from both sources of changes.

This is what I did.

The script actually uses the merge command which is a wrapper around diff3 from rcs. shrug.

Beyond my somewhat trivial and weird use-case, I actually think that this script is more useful for the following situation:

You use services like Dropbox as a way of getting data onto mobile devices (say,) but you want the canonical version of the file to live in a git repository on your system.

This is the script for you.

I hope you enjoy it!

Today's Bottleneck

Computers are always getting faster. From the perspective of the casual observer it may seem like every year all of the various specs keep going up, and systems are faster. [1] In truth, progress isn't uniform across all systems and subsystems, and thinking about this progression of technology gives us a chance to think about the constraints that developers [2] and other people who build technology face.

For most of the past year, I've used a single laptop, for all of my computing work, and while it's been great, in this time I lost touch with the comparative speed of systems. No great loss, but I found myself surprised to learn that all computers did not have the same speed: It wasn't until I started using other machines on a regular basis that I remembered that hardware could affect performance.

For most of the past decade, processors have been fast. While some processors are theoretically faster and some have other features like virtualization extensions and better multitasking capacities (i.e. hyperthreading and multi-core systems) the improvements have been incremental at best.

Memory (RAM) manages to mostly keep up with the processors, so there's no real bottleneck between RAM and the processor. Although RAM capacities are growing, at current volumes extra RAM just means services/systems that had to be distributed given RAM density can all run on one server. In general: "ho hum."

Disks are another story all together.

While disks got faster over this period, they didn't get much faster during this period, and so for a long time disks were the bottle neck in computing speed. To address this problem, a number of things changed:

  • We designed systems for asynchronous operation.

Basically, folks spilled a lot of blood and energy to make sure that systems could continue to do work while waiting for the disk to reading or writing data. This involves using a lot of event loops, queuing systems, and so forth.

These systems are really cool, the only problem is that it means that we have to be smarter about some aspects of software design and deployment. This doesn't fix the tons of legacy sitting around, or the fact that a lot of tools and programmers are struggling to keep up.

  • We started to build more distributed systems so that any individual spinning disk is responsible for writing/reading less data.

  • We hacked disks themselves to get better performance.

    There are some ways you can eek out a bit of extra performance from spinning disks: namely RAID-10, hardware RAID controllers, and using smaller platters. RAID approaches use multiple drives (4) to provide simple redundancy and roughly double performance. Smaller platters require less movement of the disk arm, and you get a bit more out of the hardware.

    Now, with affordable solid state disks (SSDs,) all of these disk related speed problems are basically moot. So what are the next bottlenecks for computers and performance:

  • Processors. It might be the case that processors are going to be the slow to develop bottleneck. There are a lot of expectations on processors these days: high speed, low power consumption, low temperature, high amount of parallelism (cores and hyperthreading.) But these expectations are necessarily conflicting.

    The main route to innovation is to make the processors themselves smaller, which does increase performance and helps control heat and power consumption, but there is a practical limit to the size of a processor.

    Also, no matter how fast you make the processor, it's irrelevant unless the software is capable of taking advantage of the feature.

  • Software.

    We're still not great at building software with asynchronous components. "Non-blocking" systems do make it easier to have systems that work better with slower disks. Still, we don't have a lot of software that does a great job of using the parallelism of a processor, so it's possible to get some operations that are slow and will remain slow because a single threaded process must grind through a long task and can't share it.

  • Network overhead.

    While I think better software is a huge problem, network throughput could be a huge issue. The internet endpoints (your connection) has gotten much faster in the past few years. That's a good thing, indeed, but there are a number of problems:

  • Transfer speeds aren't keeping up with data growth or data storage, and if that trend continues, we're going to end up with a lot of data that only exists in one physical location, which leads to catastrophic data loss.

    I think we'll get back to a point where moving physical media around will begin to make sense. Again.

  • Wireless data speeds and architectures (particularly 802.11x, but also wide area wireless,) have become ubiquitous, but aren't really sufficient for serious use. The fact that our homes, public places, and even offices (in some cases) aren't wired correctly to be able to provide opportunities to plug in will begin to hurt.

Thoughts? Other bottlenecks? Different reading of the history?

[1]By contrast, software seems like its always getting slower, and while this is partially true, there are additional factors at play, including feature growth, programmer efficiency, and legacy support requirements.
[2]Because developers control, at least to some extent, how everyone uses and understands technology, the constrains on the way they use computers id important to everyone.

Novel Automation

This post is a follow up to the interlude in the /posts/programming-tutorials post, which part of an ongoing series of posts on programmer training and related issues in technological literacy and education.

In short, creating novel automations is hard. The process would have to look something like:

  1. Realize that you have an unfulfilled software need.
  2. Decide what the proper solution to that need is. Make sure the solution is sufficiently flexible to be able to support all required complexity.
  3. Then sit down, open an empty buffer and begin writing code.

Not easy. [1]

Something I've learned in the past few years is that the above process is relatively uncommon for actual working programmers: most of the time you're adding a few lines here and there, testing various changes or adding small features built upon other existing systems and features.

If this is how programming work is actually done, then the kinds of methods we use to teach programmers how to program should hold some resemblance to the actual work that programmers do. As an attempt at a case study, my own recent experience:

I've been playing with Buildbot for a few weeks now for personal curiosity, and it may be useful to automate some stuff for the Cyborg Institute. Buildbot has its merits and frustrations, but this post isn't really about buildbot. Rather, the experience of doing buildbot work has taught me something about programming and about "building things," including:

  • When you set up buildbot, it generates a python configuration file where all buildbot configuration and "programming" goes.

    As a bit of a sidebar, I've been using a base configuration derived from the buildbot configuration for buildbot itself, and the fact that the default configuration is less clean and a big and I'd assumed that I was configuring a buildbot in the "normal way."

    Turns out I haven't, and this hurts my (larger) argument slightly.

    I like the idea of having a very programmatic interface for systems that must integrate with other components, and I really like the idea of a system that produces a good starting template. I'm not sure what this does for overall maintainability in the long term, but it makes getting started and using the software in a meaningful way, much more possible.

  • Using organizing my buildbot configuration as I have, modeled on the "metabuildbot," has nicely illustrated the idea software is just a collection of modules that interact with each other in a defined way. Nothing more, nothing less.

  • Distributed systems are incredibly difficult to get people to conceptualize properly, for anyone, and I think most of the frustration with buildbot stems from this.

  • Buildbot provides an immediate object lesson on the trade-offs between simplicity and terseness on the one hand and maintainability and complexity on the other.

    This point relates to the previous one. Because distributed systems are hard, it's easy to configure something that's too complex and that isn't what you want at all in your Buildbot before you realize that what you actually need is something else entirely.

    This doesn't mean that there aren't nightmarish Buildbot configs, and there are, but the lesson is quite valuable.

  • There's something interesting and instructive in the way that Buildbot's user experience lies somewhere between "an application," that you install and use, and a program that you write using a toolkit.

    It's clearly not exactly either, and both at the same time.

I suspect some web-programming systems may be similar, but I have relatively little experience with systems like these. And frankly, I have little need for these kinds of systems in any of my current projects.

Thoughts?

[1]Indeed this may be why the incidence of people writing code, getting it working and then rewrite it from the ground up: writing things from scratch is an objectively hard thing, where rewriting and iterating is considerably easier. And the end result is often, but not always better.

Programming Tutorials

This post is a follow up to my :doc`/posts/coding-pedagogy` post. This "series," addresses how people learn how to program, the state of the technical materials that support this education process, and the role of programming in technology development.

I've wanted to learn how to program for a while and I've been perpetually frustrated by pretty much every lesson or document I've ever encountered in this search. This is hyperbolic, but it's pretty close to the truth. Teaching people how to program is hard and the materials are either written by people who:

  • don't really remember how they learned to program.

Many programming tutorials were written by these kinds of programmers, and the resulting materials tend to be decent in and of themselves, but they fail to actually teach people how to program if they don't know how to program already.

If you already know how to program, or have learned to program in a few different languages, it's easy so substitute "learning how to program," with "learn how to program in a new language" because that experience is more fresh, and easier to understand.

These kinds of materials will teach the novice programmer a lot about programming languages and fundamental computer science topics, but not anything that you really need to learn how to write code.

  • people who don't really know how to program.

People who don't know how to program tend to assume that you can teach by example, using guided tutorials. You can't really. Examples are good for demonstrating syntax and procedure, and answering tactical questions, but aren't sufficient for teaching the required higher order problem solving skills. Focusing on the concrete aspects of programming syntax, the standard library, and the process for executing code isn't enough.

These kinds of documents can be very instructive, and outsider perspective are quite useful, but if the document can't convey how to solve real problems with code, you'll be hard pressed to learn how to write useful programs from these guides.

In essence, we have a chicken and egg problem.


Interlude:

Even six months ago, when people asked me "are you a programmer?" (or engineer,) I'd often object strenuously. Now, I wave my hand back and forth and say "sorta, I program a bit, but I'm the technical writer." I don't write code on a daily basis and I'm not very nimble at starting to write programs from scratch, but sometimes when the need arises, I know enough to write code that works, to figure out the best solution to fix at least some of the problems I run into.

I still ask other people to write programs or fix problems I'm having, but it's usually more because I don't have time to figure out an existing system that I know they're familiar with and less because I'm incapable of making the change myself.

Even despite these advances, I still find it hard to sit down with a blank buffer and write code from scratch, even if I have a pretty clear idea of what it needs to do. Increasingly, I've begun to believe that this is the case for most people who write code, even very skilled engineers.

This will be the subject of an upcoming post.


The solution(s):

1. Teach people how to code by forcing people to debug programs and make trivial modifications to code.

People pick up syntax pretty easily, but struggle more with the problem solving aspects of code. While there are some subtle aspects of syntax, the compiler or interpreter does enough to teach people syntax. The larger challenge is getting people to understand the relationship between their changes and behavior and any single change and the reset of a piece of code.

2. Teach people how to program by getting them to solve actual problems using actual tools, libraries, and packages.

Too often, programming tutorials and examples attempt to be self-contained or unrealistically simple. While this makes sense from a number of perspectives (easier to create, easier to explain, fewer dependency problems for users,) it's incredibly uncommon and probably leads to people thinking that a lot of programming revolves around re-implementing solutions to solved problems.

I'm not making a real argument about computer science education, or formal engineering training, with which I have very little experience or interest. As contemporary, technically literate, actors in digital systems, programming is a relevant for most people.

I'm convinced that many people do a great deal of work that is effectively programming: manipulating tools, identifying and recording procedures, collecting information about the environment, performing analysis, and taking action based on collected data. Editing macros, mail filtering systems, and spreadsheets are obvious examples though there are others.

Would teaching these people how programming worked and how they could use programming tools improve their digital existences? Possibly.

Would general productivity improve if more people new how to think about automation and were able to do some of their own programming? Almost certainly.

Would having more casual programmers create additional problems and challenges in technology? Yes. These would be interesting problems to solve as well.

Denormalize Access Control

Access control is both immensely useful and incredibly broken.

Access control, or the ability to constrain access to data and programs in a shared system is the only way that we, as users of shared systems, can maintain our identities, personal security, and privacy. Shared systems include: databases, file servers, social networking sites, virtualized computing systems, vendor accounts, control panels, management tools, and so forth all need robust, flexible, granular, and scalable access control tools.

Contemporary access control tools--access control lists (ACL,) and access control groups--indeed the entire conceptual framework for managing access to data and resources, don't work. From a theoretical practice, ACLs that express a relationship between users or groups of users and data or resources, represent a parsimonious solution to the "access control problem:" if properly deployed only those with access grants will have access to a given resource.

In practice these these kinds of relationships do not work. Typically relationships between data and users is rich and complex and different users need to be able to do different things with different resources. Some users need "read only" access, others need partial read access, some need read and write access but only to a subset of a resource. While ACL systems can impose these kinds of restrictions, the access control abscration doesn't match the data abstraction or the real-world relationships that it supposedly reflects.

Compounding this problem are two important factors:

  1. Access control needs change over time in response to social and cultural shifts among the users and providers of these resources.
  2. There are too many pieces of information or resources in any potential shared system to allocate access on a per-object or per-resource basis, and the volume of objects and resources is only increasing.

Often many objects or resources have the same or similar access control patterns, which leads to the "group" abstraction. Groups make it possible to describe a specific access control pattern that apply to a number of objects, and connect this pattern with specific resources.

Conceptual deficiencies:

  • There's a volume problem. Access control data represents a many-to-many-to-many relationship. There are many different users and (nested) groups, many different kinds of access controls that systems can grant, and many different (nested) resources. This would be unmanageably complex without the possibility for nesting, but nesting means that the relationships between resources and between groups and users are also important. With the possibility for nesting access control is impossible.

  • ACLs and group-based access control don't account for the fact that access must be constantly evolving, and current systems don't contain support for ongoing maintenance. (we need background threads that go through and validate access control consistency.) Also all access control grants must have some capacity for automatic expiration.

  • Access control requirements and possibilities shift as data becomes more or less structured, and as data use patterns change. The same conceptual framework that works well for access control in the context of a the data stored in a relational database, doesn't work so when the data in question is a word processing document, an email folder, or a spread sheet.

    The fewer people that need access to a single piece of data, the easier the access control system can be. While this seems self evident, it also means that access control systems are difficult to test in the really large complex systems in which they're used.

  • Group-based access control systems, in effect, normalize data about access control, in an effort to speed up data access times. While this performance is welcome, in most cases granting access via groups leads to an overly liberal distribution of access control rights. At once, its too difficult to understand "who has access to what" and too easy to add people to groups that give them more access than they need.

So the solution:

  1. Denormalize all access control data,
  2. don't grant access to groups, and
  3. forbid inheritance.

This is totally counter to the state of the art. In most ways, normalized access control data, with role/group-based access control, and complex inheritance are the gold standard. Why would it work?

  • If you have a piece of data, you will always be able to determine who has access to data, without needing to do another look-up.

  • If you can deactivate credentials, then a background process can go through and remove access without causing a large security problem. (For partial removes, you would freeze an account, let the background process modify access control and then unfreeze the account.)

    The down side is that, potentially, in a large system, it may take a rather long time for access grants to propagate to users. Locking user accounts makes the system secure/viable, but doesn't make the process any more quick.

    As an added bonus, these processes could probably be independent and wouldn't require any sort of shared state or lock, which means many such operation could run in parallel, and they could stop and restart at will.

  • The inheritance option should be fuzzy. Some sort of "bucket-based" access control should be possible, if there's a lot of data with the same access control rules and users.

    Once things get more complex, buckets are the wrong metaphor, you should use granular controls everywhere.

Problems/Conclusion:

  • Denormalization might fix the problems with ACLs and permissions systems, but it doesn't fix the problems with distributed identity management.

    As a counterpoint, this seems like a cryptography management problem.

  • Storing access control information with data means that it's difficult to take a user and return a list of what these credentials have access to.

    In truth, centralized ACL systems are subject to this flaw as well.

  • A huge part of the problem with centralized ACL derives from nesting, and the fact that we tend to model/organize data in tree-like structures, that often run counter to the organization of access control rights. As a result access control tools must be arbitrary.