Novel Automation

This post is a follow up to the interlude in the /posts/programming-tutorials post, which part of an ongoing series of posts on programmer training and related issues in technological literacy and education.

In short, creating novel automations is hard. The process would have to look something like:

  1. Realize that you have an unfulfilled software need.
  2. Decide what the proper solution to that need is. Make sure the solution is sufficiently flexible to be able to support all required complexity.
  3. Then sit down, open an empty buffer and begin writing code.

Not easy. [1]

Something I've learned in the past few years is that the above process is relatively uncommon for actual working programmers: most of the time you're adding a few lines here and there, testing various changes or adding small features built upon other existing systems and features.

If this is how programming work is actually done, then the kinds of methods we use to teach programmers how to program should hold some resemblance to the actual work that programmers do. As an attempt at a case study, my own recent experience:

I've been playing with Buildbot for a few weeks now for personal curiosity, and it may be useful to automate some stuff for the Cyborg Institute. Buildbot has its merits and frustrations, but this post isn't really about buildbot. Rather, the experience of doing buildbot work has taught me something about programming and about "building things," including:

  • When you set up buildbot, it generates a python configuration file where all buildbot configuration and "programming" goes.

    As a bit of a sidebar, I've been using a base configuration derived from the buildbot configuration for buildbot itself, and the fact that the default configuration is less clean and a big and I'd assumed that I was configuring a buildbot in the "normal way."

    Turns out I haven't, and this hurts my (larger) argument slightly.

    I like the idea of having a very programmatic interface for systems that must integrate with other components, and I really like the idea of a system that produces a good starting template. I'm not sure what this does for overall maintainability in the long term, but it makes getting started and using the software in a meaningful way, much more possible.

  • Using organizing my buildbot configuration as I have, modeled on the "metabuildbot," has nicely illustrated the idea software is just a collection of modules that interact with each other in a defined way. Nothing more, nothing less.

  • Distributed systems are incredibly difficult to get people to conceptualize properly, for anyone, and I think most of the frustration with buildbot stems from this.

  • Buildbot provides an immediate object lesson on the trade-offs between simplicity and terseness on the one hand and maintainability and complexity on the other.

    This point relates to the previous one. Because distributed systems are hard, it's easy to configure something that's too complex and that isn't what you want at all in your Buildbot before you realize that what you actually need is something else entirely.

    This doesn't mean that there aren't nightmarish Buildbot configs, and there are, but the lesson is quite valuable.

  • There's something interesting and instructive in the way that Buildbot's user experience lies somewhere between "an application," that you install and use, and a program that you write using a toolkit.

    It's clearly not exactly either, and both at the same time.

I suspect some web-programming systems may be similar, but I have relatively little experience with systems like these. And frankly, I have little need for these kinds of systems in any of my current projects.

Thoughts?

[1]Indeed this may be why the incidence of people writing code, getting it working and then rewrite it from the ground up: writing things from scratch is an objectively hard thing, where rewriting and iterating is considerably easier. And the end result is often, but not always better.

Programming Tutorials

This post is a follow up to my :doc`/posts/coding-pedagogy` post. This "series," addresses how people learn how to program, the state of the technical materials that support this education process, and the role of programming in technology development.

I've wanted to learn how to program for a while and I've been perpetually frustrated by pretty much every lesson or document I've ever encountered in this search. This is hyperbolic, but it's pretty close to the truth. Teaching people how to program is hard and the materials are either written by people who:

  • don't really remember how they learned to program.

Many programming tutorials were written by these kinds of programmers, and the resulting materials tend to be decent in and of themselves, but they fail to actually teach people how to program if they don't know how to program already.

If you already know how to program, or have learned to program in a few different languages, it's easy so substitute "learning how to program," with "learn how to program in a new language" because that experience is more fresh, and easier to understand.

These kinds of materials will teach the novice programmer a lot about programming languages and fundamental computer science topics, but not anything that you really need to learn how to write code.

  • people who don't really know how to program.

People who don't know how to program tend to assume that you can teach by example, using guided tutorials. You can't really. Examples are good for demonstrating syntax and procedure, and answering tactical questions, but aren't sufficient for teaching the required higher order problem solving skills. Focusing on the concrete aspects of programming syntax, the standard library, and the process for executing code isn't enough.

These kinds of documents can be very instructive, and outsider perspective are quite useful, but if the document can't convey how to solve real problems with code, you'll be hard pressed to learn how to write useful programs from these guides.

In essence, we have a chicken and egg problem.


Interlude:

Even six months ago, when people asked me "are you a programmer?" (or engineer,) I'd often object strenuously. Now, I wave my hand back and forth and say "sorta, I program a bit, but I'm the technical writer." I don't write code on a daily basis and I'm not very nimble at starting to write programs from scratch, but sometimes when the need arises, I know enough to write code that works, to figure out the best solution to fix at least some of the problems I run into.

I still ask other people to write programs or fix problems I'm having, but it's usually more because I don't have time to figure out an existing system that I know they're familiar with and less because I'm incapable of making the change myself.

Even despite these advances, I still find it hard to sit down with a blank buffer and write code from scratch, even if I have a pretty clear idea of what it needs to do. Increasingly, I've begun to believe that this is the case for most people who write code, even very skilled engineers.

This will be the subject of an upcoming post.


The solution(s):

1. Teach people how to code by forcing people to debug programs and make trivial modifications to code.

People pick up syntax pretty easily, but struggle more with the problem solving aspects of code. While there are some subtle aspects of syntax, the compiler or interpreter does enough to teach people syntax. The larger challenge is getting people to understand the relationship between their changes and behavior and any single change and the reset of a piece of code.

2. Teach people how to program by getting them to solve actual problems using actual tools, libraries, and packages.

Too often, programming tutorials and examples attempt to be self-contained or unrealistically simple. While this makes sense from a number of perspectives (easier to create, easier to explain, fewer dependency problems for users,) it's incredibly uncommon and probably leads to people thinking that a lot of programming revolves around re-implementing solutions to solved problems.

I'm not making a real argument about computer science education, or formal engineering training, with which I have very little experience or interest. As contemporary, technically literate, actors in digital systems, programming is a relevant for most people.

I'm convinced that many people do a great deal of work that is effectively programming: manipulating tools, identifying and recording procedures, collecting information about the environment, performing analysis, and taking action based on collected data. Editing macros, mail filtering systems, and spreadsheets are obvious examples though there are others.

Would teaching these people how programming worked and how they could use programming tools improve their digital existences? Possibly.

Would general productivity improve if more people new how to think about automation and were able to do some of their own programming? Almost certainly.

Would having more casual programmers create additional problems and challenges in technology? Yes. These would be interesting problems to solve as well.

Denormalize Access Control

Access control is both immensely useful and incredibly broken.

Access control, or the ability to constrain access to data and programs in a shared system is the only way that we, as users of shared systems, can maintain our identities, personal security, and privacy. Shared systems include: databases, file servers, social networking sites, virtualized computing systems, vendor accounts, control panels, management tools, and so forth all need robust, flexible, granular, and scalable access control tools.

Contemporary access control tools--access control lists (ACL,) and access control groups--indeed the entire conceptual framework for managing access to data and resources, don't work. From a theoretical practice, ACLs that express a relationship between users or groups of users and data or resources, represent a parsimonious solution to the "access control problem:" if properly deployed only those with access grants will have access to a given resource.

In practice these these kinds of relationships do not work. Typically relationships between data and users is rich and complex and different users need to be able to do different things with different resources. Some users need "read only" access, others need partial read access, some need read and write access but only to a subset of a resource. While ACL systems can impose these kinds of restrictions, the access control abscration doesn't match the data abstraction or the real-world relationships that it supposedly reflects.

Compounding this problem are two important factors:

  1. Access control needs change over time in response to social and cultural shifts among the users and providers of these resources.
  2. There are too many pieces of information or resources in any potential shared system to allocate access on a per-object or per-resource basis, and the volume of objects and resources is only increasing.

Often many objects or resources have the same or similar access control patterns, which leads to the "group" abstraction. Groups make it possible to describe a specific access control pattern that apply to a number of objects, and connect this pattern with specific resources.

Conceptual deficiencies:

  • There's a volume problem. Access control data represents a many-to-many-to-many relationship. There are many different users and (nested) groups, many different kinds of access controls that systems can grant, and many different (nested) resources. This would be unmanageably complex without the possibility for nesting, but nesting means that the relationships between resources and between groups and users are also important. With the possibility for nesting access control is impossible.

  • ACLs and group-based access control don't account for the fact that access must be constantly evolving, and current systems don't contain support for ongoing maintenance. (we need background threads that go through and validate access control consistency.) Also all access control grants must have some capacity for automatic expiration.

  • Access control requirements and possibilities shift as data becomes more or less structured, and as data use patterns change. The same conceptual framework that works well for access control in the context of a the data stored in a relational database, doesn't work so when the data in question is a word processing document, an email folder, or a spread sheet.

    The fewer people that need access to a single piece of data, the easier the access control system can be. While this seems self evident, it also means that access control systems are difficult to test in the really large complex systems in which they're used.

  • Group-based access control systems, in effect, normalize data about access control, in an effort to speed up data access times. While this performance is welcome, in most cases granting access via groups leads to an overly liberal distribution of access control rights. At once, its too difficult to understand "who has access to what" and too easy to add people to groups that give them more access than they need.

So the solution:

  1. Denormalize all access control data,
  2. don't grant access to groups, and
  3. forbid inheritance.

This is totally counter to the state of the art. In most ways, normalized access control data, with role/group-based access control, and complex inheritance are the gold standard. Why would it work?

  • If you have a piece of data, you will always be able to determine who has access to data, without needing to do another look-up.

  • If you can deactivate credentials, then a background process can go through and remove access without causing a large security problem. (For partial removes, you would freeze an account, let the background process modify access control and then unfreeze the account.)

    The down side is that, potentially, in a large system, it may take a rather long time for access grants to propagate to users. Locking user accounts makes the system secure/viable, but doesn't make the process any more quick.

    As an added bonus, these processes could probably be independent and wouldn't require any sort of shared state or lock, which means many such operation could run in parallel, and they could stop and restart at will.

  • The inheritance option should be fuzzy. Some sort of "bucket-based" access control should be possible, if there's a lot of data with the same access control rules and users.

    Once things get more complex, buckets are the wrong metaphor, you should use granular controls everywhere.

Problems/Conclusion:

  • Denormalization might fix the problems with ACLs and permissions systems, but it doesn't fix the problems with distributed identity management.

    As a counterpoint, this seems like a cryptography management problem.

  • Storing access control information with data means that it's difficult to take a user and return a list of what these credentials have access to.

    In truth, centralized ACL systems are subject to this flaw as well.

  • A huge part of the problem with centralized ACL derives from nesting, and the fact that we tend to model/organize data in tree-like structures, that often run counter to the organization of access control rights. As a result access control tools must be arbitrary.

In Favor of Fast Builds

This is an entry in my loose series of posts about build systems.

I've been thinking recently about why I've come to think that build systems are so important, and this post is mostly just me thinking aloud about this issue and related questions.

Making Builds Efficient

Writing a build systems for a project is often relatively trivial, once you capture the process, and figure out the base dependencies, you can write scripts and make files to automate this process. The problem is that the most rudimentary build systems aren't terribly efficient, for two main reasons:

1. It's difficult to stumble into a build process that is easy to parallelize, so these rudimentary solutions often depend on a series of step happening in a specific order.

2. It's easier to write a build system that rebuilds too much rather than too little for subsequent builds. From the perspective of build tool designers, this is the correct behavior; but it means that it takes more work to ensure that you only rebuild what you need to.

As a corollary, you need to test build systems and approaches with significantly large systems, where "rebuilding too much," can be detectable.

Making a build system efficient isn't too hard, but it does require some amount of testing and experimentation, and often it centers on having explicit dependencies, so that the build tool (i.e. Make, SCons, Ninja, etc.) can build output files in the correct order and only build when a dependency changes. [1]

The Benefits of a Fast Build

  1. Fast builds increase overall personal productivity.

    You don't have to wait for a build to complete, and you're not tempted to context switch during the build, so you stay focused on your work.

  2. Fast builds increase quality.

    If your build system (and to a similar extent, your test system,) run efficiently, it's possible to detect errors earlier in the development process, which will prevent errors and defects. A tighter feedback loop on the code you write is helpful.

  3. Fast builds democratize the development process.

    If builds are easy to run, and require minimal cajoling and intervention, it becomes much more likely that many people

    This is obviously most prevalent in open source communities and projects, this is probably true of all development teams.

  4. Fast builds promote freshness.

    If the build process is frustrating, then anyone who might run the build will avoid it and run the build less frequently, and on the whole the development effort looses important feedback and data.

    Continuous integration systems help with this, but they require significant resources, are clumsy solutions, and above all, CI attempts to solve a slightly different problem.

Optimizing Builds

Steps you can take to optimizing builds:

(Note: I'm by no means an expert in this, so feel free to add or edit these suggestions.)

  • A large number of smaller jobs that can complete independently of other tools, are easy to run in parallel. If the jobs that create a product take longer and are more difficult to split into components, then the build will be slower, particularly on more powerful hardware.
  • Incremental builds are a huge win, particularly for larger processes. Most of the reasons why you want "fast builds," only require fast rebuilds and partial builds, not necessarily the full "clean builds." While fast initial builds are not unimportant, they account for a small percentage of use.
  • Manage complexity.

There are a lot of things you can do to make builds smarter, which should theoretically make builds faster.

Examples of this kind of complexity include storing dependency information in a database, or using hashing rather than "mtime" to detect staleness, or integrating the build automation with other parts of the development tool chain, or using a more limited method to specify build processes.

The problem, or the potential problem is that you lose simplicity, and it's possible that something in this "smarter and more complex" system can break or slow down under certain pressures, or can have enough overhead to render them unproductive optimizations.

[1]It's too easy to use wild-cards so that the system must rebuild a given output if any of a number of input files change. Some of this is unavoidable, and generally there are more input files than output files, but particularly with builds that have intermediate stages, or more complex relationships between files it's important to attend to these files.

Distributed Bug Tracking

The free software/open source/software development world needs a distributed bug tracking story. Because the current one sucks.

The State of the Art

There are a number of tools written between 2006 and 2010 or so that provide partial or incomplete solutions to the problem. Almost isn't quite good enough. The "resources" section of this post, contains an overview of the most important (my judgment,) representatives of the current work in the area with a bit of editorializing.

In general these solutions are good starts, and I think they allow us (or me) a good starting point for thinking about what distributed bug tracking could be like. Someday.

Bug tracking needs are diverse, which creates a signifigant design challenge for any system in this space. There are many existing solutions, that everyone hates, and I suspect most would-be developers and innovators in the space would like to avoid opening this can of worms.

Another factor is that, while most people have come to the conclusion that distributed source control tools are the "serious" contemporary tool for managing source code the benefits of distributed bug tracking hasn't yet propogated in the same way. Many folks have begun to come to terms with the fact that some amount of tactical centralization is inevitable, required, and even desirable [1] in the context of a issue tracking systems.

Add to this the frequent requirement that non-developer users often need to track and create issues, and the result is that we've arrived at something of an impasse.

Requirements

A distributed bug tracking system would need:

  • A good way to provide short, unique identifiers for individual issues and comments so that users can discuss issues canonically.

  • An interface contained in a single application, script, or binary, that you could distribute with the application.

  • A simple/lightweight web-based interface so that users can (at least) review, search, and reference issues from a web browser.

    Write access would also be good, but is less critical. Also, it might be more practical (both from a design and a workflow perspective,) to have users submit bugs on the web into a read-only "staging queue," that developers/administrators would then formally import into the project. This formalizes a certain type of triage approach that many projects may find useful.

  • To be separable from the source code history, either by using a branch, or by using pre-commit hooks to ensure that you never commit changes to code/content and the bugs at the same time.

  • To be editable, and to interact with commonly accessible tools that users already use. Email, command line tools, the version control systems, potentially documentation systems, build systems, testing frameworks and so forth.

  • Built on reliable tools. [2]

  • To provide an easy way to customize your "views" on bugs for a particular team or project. In other words, each team can freely decide which extra fields get attached to their bugs, along with which fields are visible by default, which are required, and so on--without interfering with other projects.

The Future of the Art

  1. We (all) need to work on building new and better tools to help solve the distributed issue tracking problem. This will involve:
    • learning from the existing attempts,
    • continuing to develop and solidify the above requirements,
    • (potentially) test and develop a standard (yaml/json?) based data storage format that is easy to parse, and easily merged that multiple tools can use.
    • Develop some simple prototype tools, potentially as a suite of related utilities (a la early versions of git.) that facilitate interaction with the git database. With an eye towards flexibility and extensible.
  2. While there are implications for free software hosting as well as vendor independence and network service autonomy (a la `Franklin Street Statement <http://autonomo.us/2008/07/franklin-street-statement/>`_.) I think the primary reason to pursue distributed bug tracking has more to do with productivity and better engineering practices, and less with the policy. In summary:
    • Bug database systems that run locally and are fast[3] and always available.
    • Tools that permit offline interaction with issue database.
    • Tools that allow users to connect issues to branches.
    • Tools that make it possible to component-ize bug databases in parallel with software

Resources

(With commentary,)

  • dist-bugs mailing list

    This is the canonical source for discussion around distributed bug tracking.

  • Bugs Everywhere

    This is among the most well developed solution speaking holistically. "be" is written in Python, can generate output for the web. It uses its own data format, and has a pretty good command line tool. The HTML output generate is probably not very fast at scale (none are,) but I have not tested it.

  • Ditz

    Ditz is a very well developed solution. Ditz: implemented in Ruby, has a web interface, has a command line tool, uses a basic YAML data format, and stores data in branch. Current development is slow, getting it up and running is non-trivial, and my sense is that there isn't a very active community of contributors. There are reasons for this, likely but they are beyond the scope of this overview.

  • pitz

    Pitz is a Python re-implementation of Ditz, and while the developer(s?) have produced a "release," the "interface" is a Python shell, and to interact with the database you have to, basically write commands in Python syntax. From a data perspective, however, Pitz, like Ditz is quite developed. Pitz while it stores data in-tree, I think it's important source of ideas/examples/scaffolding.

  • Artemis

    This is a really clever solution that uses Maildirs to store issues. As a result you can interact with and integrate Artimis issues with your existing email client. Pull down changes, and see new bugs in your email, without any complicated email and list server setups.

    The huge caveat is that it's implemented as a plugin for Mercurial, and so can't be used with git projects. Also, all data resides in the tree.

  • git-issues

    In most ways, git-issues is my favorite: it's two Python files, 1700 lines of code, stores issues outside of the source branch, and has a good command line interface. On the downside, it uses XML (which shouldn't matter, but I think probably does, at least in terms of attracting developers,) and doesn't have a web-based interface. It's also currently un-maintained.

  • Prophet/sd

    SD, which is based on a distributed database named Prophet, is a great solution. The primary issue is that it's currently unmentioned and is not as feature complete as it should be. Also a lot of SD focuses on synchronizing with existing centralized issue trackers, potentially at the expense of developing other tools.

[1]It seems that you want centralized issue databases, or at least the fact that centralized issue databases appear canonical is a major selling point for issue tracking software in general. Otherwise, everyone would have their own text file with a bunch of issues, and that would suck.
[2]Because I don't program (much) and it's easy to criticize architectural decisions from afar, I don't want to explicitly say "we need to write this in Python for portability reasons" or something that would be similarly unfounded. At the same time, adoption and ease of use is crucial here, both for developers and users. Java and Ruby (and maybe Perl,) for various reasons, add friction to the adoption possibilities.
[3]"Is Jira/Bugzilla/etc. slow for you today?"

Cron is the Wrong Solution

Cron is great, right? For the uninitiated, if there are any of you left, Cron is a task scheduler that makes it possible to run various scripts and programs at specified intervals. This means that you can write programs that "do a thing" in a stateless way, set them to run regularly, without having to consider any logic regarding when to run, or any kind of state tracking. Cron is simple and the right way to do a great deal of routine automation, but there are caveats.

At times I've had scads of cron jobs, and while they work, from time to time I find myself going through my list of cron tasks on various systems and removing most of them or finding better ways.

The problems with cron are simple:

  • Its often a sledge hammer, and it's very easy to put something in cron job that needs a little more delicacy.

  • While it's possible to capture the output of cron tasks (typically via email,) the feedback from cronjobs is hard to follow. So it's hard to detect errors, performance deterioration, inefficiencies, or bugs proactively.

  • Its too easy to cron something to run every minute or couple of minutes. A task that seems relatively lightweight when you run it once can end up being expensive in the aggregate when they have to run a thousand times a day.

    This isn't to say that there aren't places where using cron isn't absolutely the right solution, but there are better solutions. For instance:

  • Include simple tests and logic for the cron task to determine if it needs to run before actually running.

  • Make things very easy to invoke or on demand rather than automatically running them regularly.

    I've begun to find little scripts and dmenu, or an easily called emacs-lisp function to be preferable to a cron job for a lot of tasks that I'd otherwise set in a cron job.

  • Write real daemons. It's hard and you have to make sure that they don't error out or quit unexpectedly--which requires at least primitive monitoring--but a little bit of work here can go a long way.

Onward and Upward!

Git In Practice

Most people don't use git particularly well. It's a capable piece of software that supports a number of different workflows, but because it doesn't mandate any particular workflow it's possible to use git productively for years without ever really touching some features.

And some of the features--in my experience mostly those related to more manual branching, merging, and history manipulation operations--are woefully underutilized. Part of this is because Github, which is responsible for facilitating much of git's use, promotes a specific workflow that makes it possible to do most of the (minimal required) branch operations on the server side, with the help of a much constrained interface. Github makes git usable by making it possible to get most of the benefit of git without needing to mess with SHA1 hashes, or anything difficult on the command-line.

That's a good thing. Mostly.

Nevertheless, there are a few operations that remain hard with git: I sometimes encounter situations that I have to try a few times before I get it right, and there are commands that I always have to check the man page to figure out how to specify the references. And even then I'm sometimes still confused. So maybe I (or we?) can spend a little bit of time and figure out what processes remain hard with git and maybe try and see if there is a way to make the process a bit more streamlined.

Here's my list:

  • Reorder all commits since x commit.

    This is basically: find the commit before the earliest one that you want to change, run git rebase -i <commit hash> to reorder the commits even though git sorts the commits in the order that I find most un-intuitive.

  • Create local branches to track remote branches or repositories.

    Setup the remotes, if necessary, and then run: git branch --track   <local-branch-name> <remote>/<branch-name> and git config   branch.{name}.push {local-branch}:master.

  • Stash all local changes and switch branches.

    It would also be nice if you could figure out way for git (or a helper) to see any open files in your text editor and save/close them if needed.

  • Pull a commit from the history of one branch into another branch without pulling anything else.

    I think this is chery-pick? It might also be nice to pull a series of commits from one branch, rebase them into one commit in the destination branch, and then commit that.

  • Pretty much every time I've tried to use the merge command to get something other than what I would have expected to happen by using "pull," it ends tragically.

Reader suggestions:

  • Put your process/procedural frustrations with git here.

How about we work on figuring out how to solve these problems in comments?

9 Awesome Git Tricks

I'm sure that most "hacker bloggers" have probably done their own "N Git Tricks," post at this point. But git is one of those programs that has so much functionality and everyone uses it differently that there is a never ending supply of fresh posts on this topic. My use of git changes enough that I could probably write this post annaully and come up with a different 9 things. That said here's the best list right now.

See Staged Differences

The git diff command shows you the difference between the last commit and the state of the current working directory. That's really useful and you might not use it as much as you should. The --cached option shows you just the differences that you've staged.

This provides a way to preview your own patch, to make sure everything is in order. Crazy useful. See below for the example:

git diff --cached

Eliminate Merge Commits

In most cases, if two or more people publish commits to a shard repository, and everyone commits to remote repositories more frequently then they publish changes, when they pull, git has to make "meta commits" that make it possible to view a branching (i.e. "tree-like") commit history in a linear form. This is good for making sure that the tool works, but it's kind of messy, and you get histories with these artificial events in them that you really ought to remove (but no one does.) The "--rebase" option to "git pull" does this automatically and subtally rewrites your own history in such a way as to remove the need for merge commits. It's way clever and it works. Use the following command:

git pull --rebase

There are caveats:

  • You can't have uncommitted changes in your working copy when you run this command or else it will refuse to run. Make sure everything's committed, or use "git stash"
  • Sometimes the output isn't as clear as you'd want it to be, particularly when things don't go right. If you don't feel comfortable rescuing yourself in a hairy git rebase, you might want to avoid this one.
  • If the merge isn't clean, there has to be a merge commit anyway I believe.

Amend the Last Commit

This is a recent one for me..

If you commit something, but realized that you forgot to save one file, use the "--amend" switch (as below) and you get to add whatever changes you have staged to the previous commit.

git commit --amend

Note: if you amend a commit that you've published, you might have to do a forced update (i.e. git push -f) which can mess with the state of your collaborators and your remote repository.

Stage all of Current State

I've been using a versing of this function for years now as part of my download mail scheme. For some reason in my head, it's called "readd." In any case, the effect of this is simple:

  • If a file is deleted from the working copy of the repository, remove it (git rm) from the next commit.
  • Add all changes in the working copy to the next commit.
git-stage-all(){
   if [ "`git ls-files -d | wc -l`" -gt "0" ]; then; git rm --quiet `git ls-files -d`; fi
   git add .
}

So the truth of the matter is that you probably don't want to be this blasé about commits, but it's a great time saver if you use the rm/mv/cp commands on a git repo, and want to commit those changes, or a have a lot of small files that you want to process in one way and then snapshot the tree with git.

Editor Integration

The chances are that your text editor has some kind of git integration that makes it possible to interact with git without needing to drop into a shell.

If you use something other than emacs I leave this as an exercise for the reader. If you use emacs, get "magit," possibly from your distribution's repository, or from the upstream.

As an aside you probably want to add the following to your .emacs somewhere.

(setq magit-save-some-buffers nil)
(add-hook 'before-save-hook 'delete-trailing-whitespace)

Custom Git Command Aliases

In your user account's "~/.gitconfig" file or in a per-repository ".git/config" file, it's possible to define aliases that add bits of functionality to your git command. This is useful defining shortcuts, combinations, and for triggering arbitrary scripts. Consider the following:

[alias]
all-push  = "!git push origin master; git push secondary master"
secondary = "!git push secondary master"

Then from the command line, you can use:

git secondary
git all-push

Git Stash

"git stash" takes all of the staged changes and stores them away somewhere. This is useful if you want to break apart a number of changes into several commits, or have changes that you don't want to get rid of (i.e. "git reset") but also don't want to commit. "git stash" puts staged changes onto the stash and "git stash pop" applies the changes to the current working copy. It operates as a FILO stack (e.g. "First In, Last Out") stack in the default operation.

To be honest, I'm not a git stash power user. For me it's just a stack that I put patches on and pull them off later. Apparently it's possible to pop things off the stash in any order you like, and I'm sure I'm missing other subtlety.

Everyone has room for growth.

Ignore Files

You can add files and directories to a .gitignore file in the top level of your repository, and git will automatically ignore these files. One "ignore pattern" per line, and it's possible to use shell-style globing.

This is great to avoid accidentally committing temporary files, but I also sometimes put entire sub-directories if I need to nest git repositories within git-repositories. Technically, you ought to use git's submodule support for this, but this is easier. Here's the list of temporary files that I use:

.DS_Store
*.swp
*~
\#*#
.#*
\#*
*fasl
*aux
*log

Host Your Own Remotes

I've only once accidentally said "git" when I meant "github" (or vice versa) once or twice. With github providing public git-hosting services and a great compliment of additional tooling, it's easy forget how easy it is to host your own git repositories.

The problem is that, aside from making git dependent on one vendor, this ignores the "distributed" parts of git and all of the independence and flexibility that comes with that. If you're familiar with how Linux/GNU/Unix works, git hosting is entirely paradigmatic.

Issue the following commands to create a repository:

mkdir -p /srv/git/repo.git
cd /srv/git/repo.git
git init --bare

Edit the .git/config file in your existing repository to include a remote block that resembles the following:

[remote "origin"]
fetch = +refs/heads/*:refs/remotes/origin/*
url = [username]@[hostname]:/srv/git/repo.git

If you already have a remote named origin, change the occurrence of the word remote in the above snippet with the name of your remote. (In multi-remote situations, I prefer to use descriptive identifier like "public" or machine's hostnames.)

Then issue "git push origin master" on the local machine, and you're good. You can us a command in the following form to clone this repository at any time.

git clone [username]@[hostname]:/srv/git/repo.git

Does anyone have git tricks that they'd like to share with the group?