Three Way Merge Script

Note: This is an old post about a script I wrote a few months ago about a piece of code that I'm no longer (really) using. I present it here as an archival piece with a boatload of caveats. Enjoy!

I have a problem that I think is not terribly unique: I have a directory of files and I want to maintain two distinct copies of these files at once, and I want a tool that looks at both directories and makes sure they're up to date. That's all. Turns out nothing does exactly that, so I wrote a hacked up shell script, and you can get it from the code section:

merge-script

I hope you enjoy!

Background

You might say, "why not just use git to take care of this," which is fair. The truth is that I don't really care about the histories as long as there's revision. Here's the situation:

I keep a personal ikiwiki instance for all of my notes, tasks, and project stuff. There's nothing revolutionary, and I even use deft, dired, and some hacked up lisp to do most of the work. But I also work on a lot of projects that have their own git repositories and I want to be able to track the notes of some of those files in those repositories as well.

Conflicts.

There are some possible solutions:

1. Use hard links so that both files will point at the same data on disk.

Great idea, but it breaks on multiple systems. Even if it might have worked in this case, it freight ens me to have such fragile systems.

Note: the more I play with this, the less suitable I think that it might be for multi system use. If one or both of the sides is in a git repo, and you make changes locally and then pull changes in from a git upstream, the git files, may look newer than the files that you changed. A flaw.

2. Only edit files in one repository or the other, and have a pre-commit hook, or similar, that copies data from the new system to the old system.

I rejected this because I thought I'd have a hard time enforcing this behavior.

3. Write a script that uses some diff3 to merge (potential) changes from both sources of changes.

This is what I did.

The script actually uses the merge command which is a wrapper around diff3 from rcs. shrug.

Beyond my somewhat trivial and weird use-case, I actually think that this script is more useful for the following situation:

You use services like Dropbox as a way of getting data onto mobile devices (say,) but you want the canonical version of the file to live in a git repository on your system.

This is the script for you.

I hope you enjoy it!

Practical Branch Layout

I've recently gone from being someone who uses git entirely "on my own," to being someone who uses git with a lot of other people at once. I've also had to introduce git to the uninitiated a few times. These have both been notable challenges.

Git is hard, particularly in these contexts: not only are there many concepts to learn, but there's no single proscribed workflow and a multitude of workflow possibilities. Which is great from a philosophy perspective, and arguably good from a "useful/robust tool perspective," but horrible from a "best practices" perspective. Hell, it's horrible for a "consistent" and "sane" practices perspective.

There are a lot of solutions to the "finding a sane practice," when working with git and large teams. Patching or reformulating the interface is a common strategy. Legit is a great example of this, but I don't think it's enough, because the problem is really one of bad and unclear defaults. For instance:

  • The "master" branch, in larger distributed systems is highly confusing. If you have multiple remotes (which is common), every remote has its own master branch, which all (necessarily) refer to different possible points in a repository's history.
  • The names of a local branch do not necessarily refer to the names of the remote branch in any specific repository. The decoupling of local and remote branches, makes sense from a design perspective, but it's difficult to retain this mapping in your head, and it's also difficult to talk about branch configurations because your "view of the universe," doesn't often coincide with anyone else's?

Here are some ideas:

  1. Have two modes of operation: a maintainer's mode that resembles current day git with a few basic tweaks described in later options, and a contributors mode, that is designed with makes the following assumptions:

    • The "mainline" of the project's development will occur in a branch to which this user only has read-only access.
    • Most of this user's work will happen in isolated topic branches.
  2. Branch naming enforcement:

    • All branches will be uniquely named, relative to a username and/or hostname. This will be transparent (largely) to the history, but will make all conversations about branches less awkward. This is basically how branches work now, with [remote]/[branch], except that all local branches need self/[branch], and the software should make this more transparent.
    • Remote branches will implicitly have local tracking branches with identical names. You could commit to any of the local tracking branches, and pull will have the ability to convert your changes to a self/[branch] if needed.
    • All local branches, if published, will map to a single remote branch. One remote, will be the user's default "publication target," for branches.

    This is basically what the origin remote does today, so again, this isn't a major change.

  3. When you run git clone, this remote repository should be the upstream repository, not the origin.

    Use the origin remote, which should be the default "place to publish my work," and would be configured separately.

  4. Minor tweaks.

    • Map local branches directly to remote branches.
    • Be able to specify a remote branch as a "mirror" of another branch.
    • Make cherry-picked commits identifiable by their original commit-id internally. The goal is to push people to cherry-pick commits as much as possible to construct histories without needing to rebase. [1]
    • Have sub-module support automatically configured, without fussing.
    • Have better functions to cleaning up branch cruft. Particularly on remotes.
    • Have some sort of configurable "published pointer," that users can use as a safe block against rebases before a given point.

    The goals here are to:

    • Make working with git all about branches and iteration rather than a sequence of commits.
    • Provide good tools to prevent people from rebasing commits, which is always confusing and rarely actually required.
    • Make branch names as canonical as possible. The fact that there can be many possible names for the same thing is awful.

Who's with me? We can sort out the details, if you want in comments.

[1]To make this plausible, github needs to allow cherry-picked commits to close a pull request.

Git Feature Requests

  • The ability to mark a branch "diverged," to prevent (or warn) on attempted merges from master (for example) into a maintenance branch.

  • The ability to create and track dedicated topic branches, and complementary tooling to encourage rebasing commits in these sorts of branches. We might call them "patch sets" or "sets" rather than "branches." Also, it might be useful to think about using/displaying these commits, when published, in a different way.

  • Represent merge commits as hyperlinks to the user, when possible. I think GitHub's "network graph" and similar visualizations are great for showing how commits and branches interact and relate to each other.

    This would probably require some additional or modifies output from "git log".

  • Named stashes.

  • Branched stashes (perhaps this is closer to what I'm thinking about for the request regarding topic branches.)

  • The ability to checkout "working copies," of different points/branches currently from a single repository at the same time, using "native" git utilities.

    Related, "shelf" functionality is scriptable, but this too needs to be easier and more well supported.

    I think legit is a step in the right direction, but it's weird and probably makes it more difficult to understand what's happening with git conceptually as opposed to the above features which would provide more appropriate conceptual metaphors for the work that would-be-git-users need.

Git In Practice

Most people don't use git particularly well. It's a capable piece of software that supports a number of different workflows, but because it doesn't mandate any particular workflow it's possible to use git productively for years without ever really touching some features.

And some of the features--in my experience mostly those related to more manual branching, merging, and history manipulation operations--are woefully underutilized. Part of this is because Github, which is responsible for facilitating much of git's use, promotes a specific workflow that makes it possible to do most of the (minimal required) branch operations on the server side, with the help of a much constrained interface. Github makes git usable by making it possible to get most of the benefit of git without needing to mess with SHA1 hashes, or anything difficult on the command-line.

That's a good thing. Mostly.

Nevertheless, there are a few operations that remain hard with git: I sometimes encounter situations that I have to try a few times before I get it right, and there are commands that I always have to check the man page to figure out how to specify the references. And even then I'm sometimes still confused. So maybe I (or we?) can spend a little bit of time and figure out what processes remain hard with git and maybe try and see if there is a way to make the process a bit more streamlined.

Here's my list:

  • Reorder all commits since x commit.

    This is basically: find the commit before the earliest one that you want to change, run git rebase -i <commit hash> to reorder the commits even though git sorts the commits in the order that I find most un-intuitive.

  • Create local branches to track remote branches or repositories.

    Setup the remotes, if necessary, and then run: git branch --track   <local-branch-name> <remote>/<branch-name> and git config   branch.{name}.push {local-branch}:master.

  • Stash all local changes and switch branches.

    It would also be nice if you could figure out way for git (or a helper) to see any open files in your text editor and save/close them if needed.

  • Pull a commit from the history of one branch into another branch without pulling anything else.

    I think this is chery-pick? It might also be nice to pull a series of commits from one branch, rebase them into one commit in the destination branch, and then commit that.

  • Pretty much every time I've tried to use the merge command to get something other than what I would have expected to happen by using "pull," it ends tragically.

Reader suggestions:

  • Put your process/procedural frustrations with git here.

How about we work on figuring out how to solve these problems in comments?

9 Awesome Git Tricks

I'm sure that most "hacker bloggers" have probably done their own "N Git Tricks," post at this point. But git is one of those programs that has so much functionality and everyone uses it differently that there is a never ending supply of fresh posts on this topic. My use of git changes enough that I could probably write this post annaully and come up with a different 9 things. That said here's the best list right now.

See Staged Differences

The git diff command shows you the difference between the last commit and the state of the current working directory. That's really useful and you might not use it as much as you should. The --cached option shows you just the differences that you've staged.

This provides a way to preview your own patch, to make sure everything is in order. Crazy useful. See below for the example:

git diff --cached

Eliminate Merge Commits

In most cases, if two or more people publish commits to a shard repository, and everyone commits to remote repositories more frequently then they publish changes, when they pull, git has to make "meta commits" that make it possible to view a branching (i.e. "tree-like") commit history in a linear form. This is good for making sure that the tool works, but it's kind of messy, and you get histories with these artificial events in them that you really ought to remove (but no one does.) The "--rebase" option to "git pull" does this automatically and subtally rewrites your own history in such a way as to remove the need for merge commits. It's way clever and it works. Use the following command:

git pull --rebase

There are caveats:

  • You can't have uncommitted changes in your working copy when you run this command or else it will refuse to run. Make sure everything's committed, or use "git stash"
  • Sometimes the output isn't as clear as you'd want it to be, particularly when things don't go right. If you don't feel comfortable rescuing yourself in a hairy git rebase, you might want to avoid this one.
  • If the merge isn't clean, there has to be a merge commit anyway I believe.

Amend the Last Commit

This is a recent one for me..

If you commit something, but realized that you forgot to save one file, use the "--amend" switch (as below) and you get to add whatever changes you have staged to the previous commit.

git commit --amend

Note: if you amend a commit that you've published, you might have to do a forced update (i.e. git push -f) which can mess with the state of your collaborators and your remote repository.

Stage all of Current State

I've been using a versing of this function for years now as part of my download mail scheme. For some reason in my head, it's called "readd." In any case, the effect of this is simple:

  • If a file is deleted from the working copy of the repository, remove it (git rm) from the next commit.
  • Add all changes in the working copy to the next commit.
git-stage-all(){
   if [ "`git ls-files -d | wc -l`" -gt "0" ]; then; git rm --quiet `git ls-files -d`; fi
   git add .
}

So the truth of the matter is that you probably don't want to be this blasé about commits, but it's a great time saver if you use the rm/mv/cp commands on a git repo, and want to commit those changes, or a have a lot of small files that you want to process in one way and then snapshot the tree with git.

Editor Integration

The chances are that your text editor has some kind of git integration that makes it possible to interact with git without needing to drop into a shell.

If you use something other than emacs I leave this as an exercise for the reader. If you use emacs, get "magit," possibly from your distribution's repository, or from the upstream.

As an aside you probably want to add the following to your .emacs somewhere.

(setq magit-save-some-buffers nil)
(add-hook 'before-save-hook 'delete-trailing-whitespace)

Custom Git Command Aliases

In your user account's "~/.gitconfig" file or in a per-repository ".git/config" file, it's possible to define aliases that add bits of functionality to your git command. This is useful defining shortcuts, combinations, and for triggering arbitrary scripts. Consider the following:

[alias]
all-push  = "!git push origin master; git push secondary master"
secondary = "!git push secondary master"

Then from the command line, you can use:

git secondary
git all-push

Git Stash

"git stash" takes all of the staged changes and stores them away somewhere. This is useful if you want to break apart a number of changes into several commits, or have changes that you don't want to get rid of (i.e. "git reset") but also don't want to commit. "git stash" puts staged changes onto the stash and "git stash pop" applies the changes to the current working copy. It operates as a FILO stack (e.g. "First In, Last Out") stack in the default operation.

To be honest, I'm not a git stash power user. For me it's just a stack that I put patches on and pull them off later. Apparently it's possible to pop things off the stash in any order you like, and I'm sure I'm missing other subtlety.

Everyone has room for growth.

Ignore Files

You can add files and directories to a .gitignore file in the top level of your repository, and git will automatically ignore these files. One "ignore pattern" per line, and it's possible to use shell-style globing.

This is great to avoid accidentally committing temporary files, but I also sometimes put entire sub-directories if I need to nest git repositories within git-repositories. Technically, you ought to use git's submodule support for this, but this is easier. Here's the list of temporary files that I use:

.DS_Store
*.swp
*~
\#*#
.#*
\#*
*fasl
*aux
*log

Host Your Own Remotes

I've only once accidentally said "git" when I meant "github" (or vice versa) once or twice. With github providing public git-hosting services and a great compliment of additional tooling, it's easy forget how easy it is to host your own git repositories.

The problem is that, aside from making git dependent on one vendor, this ignores the "distributed" parts of git and all of the independence and flexibility that comes with that. If you're familiar with how Linux/GNU/Unix works, git hosting is entirely paradigmatic.

Issue the following commands to create a repository:

mkdir -p /srv/git/repo.git
cd /srv/git/repo.git
git init --bare

Edit the .git/config file in your existing repository to include a remote block that resembles the following:

[remote "origin"]
fetch = +refs/heads/*:refs/remotes/origin/*
url = [username]@[hostname]:/srv/git/repo.git

If you already have a remote named origin, change the occurrence of the word remote in the above snippet with the name of your remote. (In multi-remote situations, I prefer to use descriptive identifier like "public" or machine's hostnames.)

Then issue "git push origin master" on the local machine, and you're good. You can us a command in the following form to clone this repository at any time.

git clone [username]@[hostname]:/srv/git/repo.git

Does anyone have git tricks that they'd like to share with the group?

Key Git Concepts

Git is a very... different kind of software. It's explicitly designed against the paradigm for other programs like it (version control/source management) and to make maters worse most of it's innovations and eccentricities are very difficult to create metaphors and analogies around. This is likely because it takes a non-proscriptive approach to workflow (you can work with your collaborators in any way that makes sense for you) and more importantly it lets people do away with linearity. Git makes it possible, and perhaps even encourages, creators to give up an idea of a singular or linear authorship process.

That sounds great (or is at least passable) in conversation but it is practically unclear. But even when you sit down and can interact with a "real" git repository, it can still be fairly difficult to "grok." And to make matter worse, there are a number of very key concepts that regular users of git acclimate to but that are still difficult to grasp from the ousted. This post, then, attempts to illuminate a couple of these concepts more practically in hopes of making future interactions with git less painful. Consider the following:

The Staging Area

The state of every committed object (i.e. file) as of the last commit is the HEAD. Every commit has a unique identifying hash that you can see when you run git log.

The working tree, or checkout, is the files you interact with inside of the local repository. You can checkout different branches, so that you're not working in the "master" (default or trunk) branch of the repository, which is mostly an issue when collaborating with other people.

If you want to commit something to the repository, it must first be "staged" or added with the git add command. Use git status to see what files are staged and what files are not staged. The output of git diff generates the difference between the HEAD plus all staged changes, and all unstaged changes. To see the difference between all staged changes and HEAD use the "git diff --cached".

The staging area makes it possible to construct commits in very granular sorts of ways. The staging area makes it possible to use commits, less like "snapshots" of the entire tree of a repository, and rather as discrete objects with that contain a single atomic change set. This relationship to commits is enhanced by the ability to do "squash merges" and squash a series of commits in a rebase, but it starts with the staging area.

If you've staged files incorrectly you can use the git reset command to reset this process. Used alone, reset is a non destructive operation.

Branches

The ability to work effectively in branches is the fundamental function of git, and probably also the easiest to be confused by. A branch in git, fundamentally, is just a different HEAD in the same repository. Branches within a single repository allow you to work on specific sets of changes (e.g. "topics") and track other people's changes, without needing to make modifications to the "master" or default branch of the repository.

The major confusion of branches springs from git's ability to treat every branch of every potentially related repository as a branch of each other. Therefore it's possible to push to and pull from multiple remote branches from a single remote repository and to push to and pull from multiple repositories. Ignore this for a moment (or several) and remember:

A branch just means your local repository has more than one "HEAD" against which you can create commits and "diff" your working checkout. When something happens in one of these branches that's worth saving or preserving or sharing, you can either publish this branch or merge it into the "master" branch, and publishes these changes.

The goal of git is to construct a sequence of commits that represent the progress of a project. Branches are a tool that allow you to isolate changes within tree's until you're ready to merge them together. When the differences between HEAD and your working copy becomes to difficult to manage using git add and git reset, create a branch and go from there.

Rebase

Rebasing git repositories is scary, because the operation forces you to rewrite the history of a repository to "cherry pick" and reorder commits in a way leads to a useful progression and collection of atomic moments in a project's history. As opposed to the tools that git replaces, "the git way" suggests that one ought to "commit often" because all commits are local operations, and this makes it possible to use the commit history to facilitate experimentation and very small change sets that the author of a body of code (or text!) can revert or amend over time.

Rebasing, allows you to take the past history objects, presumably created frequently during the process of working (i.e. to save a current state) and compress this history into a set of changes (patches) that reflect a more usable history once the focus of work has moved on. I've read and heard objects to git on the basis that it allows developers to "rewrite history," and individuals shouldn't be able to perform destructive operations on the history of a repository. The answer to this is twofold:

  • Git, and version control isn't necessarily supposed to provide an consistently reliable history of a projects code. It's supposed to manage the code, and provide useful tools to managing and using the history of a project. Because of the way the staging area works, sometimes commits are made out of order or a "logical history object" is split into two actual objects. Rebasing makes these non-issues.
  • Features like rebasing are really intended to happen before commits are published, in most cases. Developers will make a series of commits and then, while still working locally, rebase the repository to build a useful history and then publish those changes to his collaborators. So it's not so much that rebasing allows or encourages revisionist histories, but that it allows developers to control the state of their present or the relative near future.

Bonus: The git stash

The git stash isn't so much a concept that's difficult to grasp, but a tool for interacting with the features describe above that is pretty easy to get. Imagine one of the following cases:

You're making changes to a repository, you're not ready to commit, but someone writes you an email, and says that they need you to quickly change 10 or 12 strings in a couple of files (some of which you're in the middle of editing,) and they need this change published very soon. You can't commit what you've edited as that might break something you're unwilling to risk breaking. How do you make the changes you need to make without committing your in-progress changes?

You're working in a topic branch, you've changed a number of files, and suddenly realized that you need to be working in a different branch. You can't commit your changes and merge them into the branch you need to be using that would disrupt the history of that branch. How do you save current changes and then import them to another branch without committing?

Basically invoke git stash which saves the difference between the index (e.g. HEAD) and the current state of the working directory. Then do whatever you need to do (change branches, pull new changes, do some other work,) and then invoke git stash pop and everything that was included in your stash will be applied to the new working copy. It's sort of like a temporary commit. There's a lot of additional functionality within git stash but, that's an entirely distinct bag of worms.

Onward and Upward!

Git Tips for Writers

The Context

git is this version control system that's designed to be used in a distributed manner, and supports a very divers and non-linear workflow. While it's designed to support the work of software developers--particularly in large projects like the linux kernel--at the core, git is just a file system layer that has an awareness of time and iteration. It also does its magic on any kind of text files... code or writing. I use git to manage a lot of my writing--indeed, most of my digital life, which is a bit weird admittedly; and as a result people on the Internet, not to mention my coworkers,come to me with git questions from time. This post is a response to a more recent change.

How I Work

I have two kinds of repositories: general repositories which store a bunch of different kinds of files that I need to work: the general repositories that store files that I always need to get things done, and specific project-only repositories that only have the text (and possibly notes) for a very specific project. I also have a "writing" repository where I do drafting for the blog, and start writing projects that I'd like to version, but are too small yet for their own project repositories. The brief overview:

  • garen is like my home directory within my home directory, and it has config files, scripts. and other daily essentials.
  • org stores my org-mode files.
  • fiction projects: I have five repositories in ~/ that store fiction projects, that I'm theoretically working on in some capacity, though I haven't touched most of them regularly.
  • writing holds blog drafting, and a couple of not-exactly-fiction, projects that I'm not quite ready to admit exist.
  • website content: wikish, tychoish.com, cyborginstitute.com, the cyborg institute wiki and a few other website projects that I'm involved with have repositories to store their content.

The lesson here, about repository organization, is that git wants you to have distinct repositories for different projects. Its possible to merge repositories together (really!) and also to separate the histories of specific directories into their own repositories if you're so inclined.

I write in emacs almost exclusively, I sometimes use magit, which is a delightful interface to git that works within emacs in a very emacs-centric way. If you use dired, magit will be familiar. Having said that, I mostly just add files, make commits and push repositories. Although I've been very interested in flashbake for some time, I've never really used it: it seems designed for people who aren't used to version control or git, and the fact that I am means that it feels cumbersome to me. I suppose I should take this as a challenge, and attempt to hack it into something more usable from my perspective, but I've not felt the urge yet.

I use gitosis (but it's in the debian repositories) on foucault (my server) to manage the publication of my git repositories. I push regularly, both to make sure that all of my machines are up to date, and also as a way of keeping my systems backed up. While I don't take snapshots of my systems, I've been able to set up systems and been up and running inside of ninety minutes after reimaging a laptop without loosing a single bit. Although unorthodox, git is my backup strategy, and the restores work fine. I strongly recommend having your own git hosting set up. It's not difficult, and while I think git hub is awesome on it's own terms, independence and self sufficiency is really important here.

I don't really take advantage of any branching and merging in git, though I've played with it enough to know how it works. I do have a branch in the repository for the novel I'm writing for an editor to be able to edit the novel as I write on it without needing to see their changes and comments until I get to that point.

And that's sort of it. I use jekyll (or an old personal fork) and soon to be cyblog) as well as ikiwiki to publish content, but other than that, I just write stuff.

In any case, if you have thoughts on the subject I'd love to see your input on the wikish git writing page.

Write on!

git magic

The following, mostly accurate conversation (apologies for any liberties) should be a parable for the use of the git version control system: As I was about to leave work the other day...

tycho: I pushed today's work to our repository, have at, I'm headed out.

Coworker A: Awesome. I did too. (pause) wait. It's screwed up. I deleted a file I didn't mean to. (pastes link to diff into chatroom)

tycho: Oh, that's easy to fix. You can reset back to before the file, add all the changes that are in you're repository, except the deletion of the file, commit, and then "git reset --hard" and then publish that.

Coworker A: But your changes...

(as an aside, the original solution should still work, I think)

tycho: Oh. Hrm. Right. Well... Rebase to remove the bad commit and then add the file in question back on top of my changes.

Coworker A: Wait, what?

tycho: (looks at clook). Shit, I'll do it. (turns to Coworker P), have you pulled recently?

Coworker P: Nope I'll do that no--

tycho: Dont't!

Coworker P: Alright then!

tycho: (mumbles and works)


At this juncture, I pull out crazy git commands and rebase the repository, back a few commits to pull out a single changeset. And then recommit the file with the changes worth saving (which I had copied into ~/ before beginning this operation.)

One thing I've learned about using git rebase is that you always have to go back a commit or two before I think I need to, pick out the hash for the last good commit. Also when using "git rebase -i" I find that the commits are listed in the reverse order that I want them to be listed in.

Another great hint: Issue the following command if you're an emacs user and you don't want git to open rebase editing sessions in vim.

git config --global core.editor "emacsclient -t -a emacs -NOW"

The one issue here is that I had to rewrite the history of an already published series of changes. This is why I didn't want P to pull. When I was done, and the state of my repository was as it should have been, my next push (predictably failed), as it needed to be a "git push -f", which is something of a scary operation. It worked out, and when everyone pulled the next time everything was fine: I knew it would be for P because their local repository never knew about the first iteration of the history. I was less sure if A's would adjust so seamlessly, but it did.


tycho: Ok, done. Pull A.

Coworker A: All better! I have no clue what happened.

tycho: It's cool, don't sweat it. There's very little that isn't fixable. As long as you don't hard reset changes, and don't do crazy rebasing stuff, you should be ok.

Coworker A: Like what you just did?

tycho: Pretty much.


Here are the lessons:

  • "git push" and "git pull" would seem like parallel operations but they're not. Pull with abandon, it never hurts to pull. But if lots of people are pulling from the same repository, and you push a change that you don't mean to push, it's really hard to take that change back in a logical and productive way. So push with caution.
  • Rebasing is a tool that has great power shouldn't be feared even though theoretically you can screw stuff up with it. The git way says "commit your changes early and often," is great, but it can be sort of anti-social, as individual commits become sort of meaningless, and change logs can get hard to manage. Rebasing, though scary, can make it possible to both commit as often as you need to, and then rebase to be presentable.
  • Fear forced pushes.
  • Everything in git can be changed, so play with things, and then only publish changes when the repository is in a good working state.

Onward!