on git: in two parts

A post about the distributed version control system "git" in two parts.

Part One: Git Puns

My identi.ca buddy madalu and frequent commenter here posted a few weeks ago the following notice:

#ubuntu-one... No thanks! I'll stick with my home-brewed git + server + usb drive solution. My git repos breed like rabbits!

Which basically sums up my opinion on ubuntuone. But I thought that the "my git repos breed like rabbits" was both accurate (git repositories are designed to be replicated in their entirety), and a sort of funny way to put it. And being the kind of person that I am, I decided to see what other (potentially dirty) puns I could make about git. Here's what I came up with:

what did one git repo say to another git repo? pull my diff

what did mama git say when she found her remote in his room making new branches? octopus merge this instant!

what did one git remote say to entice another remote to branch? it's ok we can just tell them we were cherry picking later.

what did dr. git say when a repo complained of bloating? git gc

I should point out that these four puns all demonstrate a factual feature of git, though the "pull my diff" isn't exactly what happens.

"Octopus Merge" is the method that git uses when there are a lot of divergent branches (more than three) that need to be merged together. Similarly "cherry picking" is a way to manually select what changes get merged together if you're not ready to do full merges, and git gc is the cleanup script that goes through and re-compresses and prunes the database so that your repo works faster and with less disk space.

Anyway, I'm out of puns, you all are welcome to join in.

Part Two: Atypical uses of Git.

I'm sure I've written a bunch here about how I'm not really a programmer, and while this is true I do use git a lot. In part I think this is because git is really mostly an ad-hoc file system and also given how I write, the kind of writing I do isn't that different from programming.

So aside from storing my writing projects, and my orgmode, I do things like store all of my mail directories in git. Which you might think is kind of weird, but the truth is that it makes keeping lots of computers in sync a rather simple proposition, and its damn fast.

I also have a directory I call "garen" (but used to call "main") that is basically my home directory. It has all my emacs lisp files, most of my non-mail related scripts, various configuration files. and so forth. It started out as a backup and workspace for smaller projects, but it's since morphed into "that one thing I need to have of my computer in order to actually work." When I was setting up the server it took a thousand things that might have been huge headaches and made them non-issues. Here's what this repo looks like:

  • emacs/ This is where my emacs-lisp files all live. I have a 'init.el' file which is basically the standard .emacs file, and a 'gui-init.el' file for code that I only want to run if I'm running desktop where I'll be running non-console emacs frames. As a result on my machines my .emacs file looks like this:

    (load "~/garen/emacs/gui-init.el")
    (load "~/garen/emacs/init.el")
    

    With the first line commented out if needed. End result, emacs loads the same everywhere, no thinking.

  • scripts/ I add this to my path, so that any little bit of bash script that I want to be able to use is accessable and the same on all my machines.

  • configs/ Generally my format is to have config_file.machine_name, for example: bashrc.leibniz. In the case of the bashrc, I have a ".common" file that has everything that all my machines need, while the machine specific files have everything that's... well specific, and a source statement for the common file. So my "real" .bashrc looks like this:

    source /home/tychoish/garen/configs/bashrc.leibniz
    

    And everything stays in sync between the machines. How cool is that.

That's sort of the most important thing. The great thing is that this makes setting up a new user account on a server, or a box itself a piece of cake.

Food for thought!

emacs blogging? me too? forward directions...

One of the things that I find a lot when I'm seraching the internet for emacs things (or, in the case of my google alert, when emacs stuff on the web finds me.) Are people writing blog posts that are along the lines of, "so I was playing around with emacs weblogger mode..." Which is pretty much what this is.

I've been toying with the idea of switching to a git-based blogging platform/site generator, that would be much more slim than my current tool, which (though I love it, and recomend it to other people regularly) doesn't seem to fit my workflow particularly well. Since leaving the (wonderful) TextMate blogging bundle behind, I've been in search of a blogging tool... and, well, I'm still looking.

When I find it, I assure you that you'll be the first to know.

Comitting From the Bottom Up

My blog reading eyes/ears tend to perk up when I see someone writing about git as this piece of software fascinates me in a potentially unhealthy sort of way. I read a post the other day that talked a bunch about git, and centralized SCM tools like SVN and CVS, as well as the other distributed SCM bazaar. If that last sentence was greek to you, don't worry, I'm heading into a pretty general discussion. Here's the background:

Version control or source control management systems (VCS/SCM), are tools that programmers use to store the code of a program or project as they develop it. These tools store versions of a code base which has a lot of benefits: programmers can work concurrently on a project and distribute their changes regularly to avoid duplicating efforts or working on divergent editions code. SCMs also save your history incase you change something that you didn't intended to you can go back to known working states, or "revive" older features that you'd deleted. SCMs are It's a good thing, and I'd wager that most programmers use some sort of system to track this task. [1]

The basic unit of any version control system is the "commit," which represents a collection or set of changes that a given developer chooses to "check in" to the system. There are two basic models of VCS/SCM: the centralized client/server system and the distributed system. Centralization means that the history is stored on a server or centralized machine, and a group of developers all send and pull changes from that central "repository." Distributed systems give every developer in a project a copy of the full history, and give them the capability of sending or pulling changes from any other developer in a system.


There's a lot of topics about the various merits of both distributed and centralized version control systems, and a lot of this discussion ends up being hashed over technological features like speed and the various ease of various operations or over process features that relate to what a system allows or promotes in terms of workflow. While these discussions are interesting they're too close to the actual programs to see something that I think is pretty interesting.

In centralized systems, "the commit" is something that serves the project's management. If done right (so the theory goes), in a centralized system, only a select few have access to submit changes, as the central server's only way of reconciling diverging versions of a code-base is to accept the first submitted change (poor solution) and the more developers you have the greater the chance of having version collisions. As a result there's a lot less committing that happens. In big projects, you still have to mail patches around because only a few people can commit changes and in smaller teams, people are more likely to "put off committing" because frequent commits of incremental changes are more likely to confuse teammates, and committing amounts to publication.

In distributed systems, since the total "repository" is stored locally, committing changes to your repository and publishing changes with collaborators are separate options. As a result, there's less incentive for developers to avoid creating commits for incremental changes. Rather than have commits mark complete working states with a lot of changes in every individual commit, commits mark points of experimentation in the distributed system.

This difference, is really critical. Commits in a centralized system serve the person who "owns" the repository, whereas in the distributed system they serve the developer. There are other aspects of these programs which affect the way developers relate to their code, but I think on some fundamental level this is really important.

Also, I don't want to make the argument that "bottom up distribution = good and top down centralization = bad," as I think it's more complicated than that. It's also possible to use distributed technology in centralized workflows, and if you use centralized systems with the right team, the top-down limitation isn't particularly noticeable. But as a starting point, it's an interesting analysis.

[1]So common are they, that I was surprised to learn that the Linux Kernel (is a massive project) spent many many years without any formal system to manage these functions. They used "tar balls and patches, for years" which is amazing.