Insufficiently Random: Why commit messages matter

Some folks wonder why I want longer, detailed commit messages in a project. Often other people claim "Fix the frobinator bug when it frobs too slow" might be sufficiently detailed to cover a change. But its usually not.

As you explored the issue and tried to understand the problem, you filled your head up with important details about how the frobinator works, what a frob even is, what a slow frob looks like, and why a slow frob shouldn't be permitted in this context. All of this information is necessary for you to understand the problem and code a patch that resolves it. Moreover, if this detail wasn't necessary for you to code the patch, you wouldn't have had the slow frobbing in the first place. It would have been fairly obvious at the time of original development.

Commit messages, when combined with a powerful blame engine in your version control, can give you really powerful insight into what you were thinking at the time. This can be incredibly handy when someone asks a question later.

Yesterday, Junio Hamano, git maintainer extraordinaire, asked me why git-gui implements its own clone function. When I wrote this code, it must have been really obvious to me why it needed to reimplement the same logic as git clone. But I wrote it back in 2007. I've done a ton of things since then. There's no way I can remember what I was doing, or why I was doing it. I do however remember thinking, "this code is done, it works, I'll never have to look at or think about it again". Famous last words.

When Junio asked this question... I honestly couldn't remember what I was doing. I'm usually somewhat against reinventing the wheel, and I try to avoid rewriting something unless I seem to have a good reason for it. So I really was looking at his question saying, "yea, why did I do that there...".

Fortunately, I write fairly detailed commit messages, and git blame is an incredible tool:


  $ git blame lib/choose_repository.tcl
  ...
  81d4d3dd (Shawn O. Pearce     2007-09-24 08:40:44 -0400  633) 
  81d4d3dd (Shawn O. Pearce     2007-09-24 08:40:44 -0400  634)           $o_cons start \
  81d4d3dd (Shawn O. Pearce     2007-09-24 08:40:44 -0400  635)                   [mc "Counting objects"] \
  81d4d3dd (Shawn O. Pearce     2007-09-24 08:40:44 -0400  636)                   [mc "buckets"]
  ...
  81d4d3dd (Shawn O. Pearce     2007-09-24 08:40:44 -0400  673)           update
  81d4d3dd (Shawn O. Pearce     2007-09-24 08:40:44 -0400  674) 
  ab08b363 (Shawn O. Pearce     2007-09-22 03:47:43 -0400  675)           file mkdir [file join .git objects pack]
  ab08b363 (Shawn O. Pearce     2007-09-22 03:47:43 -0400  676)           foreach i [glob -tails -nocomplain \
  ab08b363 (Shawn O. Pearce     2007-09-22 03:47:43 -0400  677)                   -directory [file join $objdir pack] *] {
  ab08b363 (Shawn O. Pearce     2007-09-22 03:47:43 -0400  678)                   lappend tolink [file join pack $i]
  ab08b363 (Shawn O. Pearce     2007-09-22 03:47:43 -0400  679)           }
  81d4d3dd (Shawn O. Pearce     2007-09-24 08:40:44 -0400  680)           $o_cons update [incr bcur] $bcnt
  81d4d3dd (Shawn O. Pearce     2007-09-24 08:40:44 -0400  681)           update
  81d4d3dd (Shawn O. Pearce     2007-09-24 08:40:44 -0400  682) 
  81d4d3dd (Shawn O. Pearce     2007-09-24 08:40:44 -0400  683)           foreach i $buckets {
  ab08b363 (Shawn O. Pearce     2007-09-22 03:47:43 -0400  684)                   file mkdir [file join .git objects $i]
  ab08b363 (Shawn O. Pearce     2007-09-22 03:47:43 -0400  685)                   foreach j [glob -tails -nocomplain \
  ab08b363 (Shawn O. Pearce     2007-09-22 03:47:43 -0400  686)                           -directory [file join $objdir $i] *] {
  ab08b363 (Shawn O. Pearce     2007-09-22 03:47:43 -0400  687)                           lappend tolink [file join $i $j]
  ab08b363 (Shawn O. Pearce     2007-09-22 03:47:43 -0400  688)                   }
  81d4d3dd (Shawn O. Pearce     2007-09-24 08:40:44 -0400  689)                   $o_cons update [incr bcur] $bcnt
  81d4d3dd (Shawn O. Pearce     2007-09-24 08:40:44 -0400  690)                   update
  ab08b363 (Shawn O. Pearce     2007-09-22 03:47:43 -0400  691)           }

It would seem that 81d4d3dd, and ab08b363 are commits adding code to do a clone.


  $ git show 81d4d3dd
  commit 81d4d3dddc5e96aea45a2623c9b1840491348b92
  Author: Shawn O. Pearce  spearce.org>
  Date:   Mon Sep 24 08:40:44 2007 -0400

    git-gui: Keep the UI responsive while counting objects in clone

    If we are doing a "standard" clone by way of hardlinking the
    objects (or copying them if hardlinks are not available) the
    UI can freeze up for a good few seconds while Tcl scans all
    of the object directories.  This is espeically noticed on a
    Windows system when you are working off network shares and
    need to wait for both the NT overheads and the network.

    We now show a progress bar as we count the objects and build
    our list of things to copy.  This keeps the user amused and
    also makes sure we run the Tk event loop often enough that
    the window can still be dragged around the desktop.

    Signed-off-by: Shawn O. Pearce  spearce.org>

  $ git show ab08b363
  commit ab08b3630414dfb867825c4a5828438e1c69199d
  Author: Shawn O. Pearce  spearce.org>
  Date:   Sat Sep 22 03:47:43 2007 -0400

    git-gui: Allow users to choose/create/clone a repository
  …
    Rather than relying on the git-clone Porcelain that ships with
    git we build the new repository ourselves and then obtain content
    by git-fetch.  This technique simplifies the entire clone process
    to roughly: `git init && git fetch && git pull`.  Today we use
    three passes with git-fetch; the first pass gets us the bulk of
    the objects and the branches, the second pass gets us the tags,
    and the final pass gets us the current value of HEAD to initialize
    the default branch.

    If the source repository is on the local disk we try to use a
    hardlink to connect the objects into the new clone as this can
    be many times faster than copying the objects or packing them and
    passing the data through a pipe to index-pack.  Unlike git-clone
    we stick to pure Tcl [file link -hard] operation thus avoiding the
    need to fork a cpio process to setup the hardlinks.  If hardlinks
    do not appear to be supported (e.g. filesystem doesn't allow them or
    we are crossing filesystem boundaries) we use file copying instead.

    Signed-off-by: Shawn O. Pearce  spearce.org>

So 30 seconds after being asked, I've managed to remember this was mostly about git-gui on Windows, where Cygwin can be pretty slow for file operations, and hardlinks are available on NTFS if your application knows how to make them. By doing the clone logic within Tcl, which is a native Win32 application, we can bypass Cygwin overheads, including the need to fork and execute a bunch of commands from the git-clone.sh shell script. Because, back in 2007, git-clone was still just a shell script.

In hindsight, that paragraph above should also be in the commit messages. And I probably should have ported git clone to C instead. Its C now, but not because of my efforts. And now git-gui maybe should just call it. It would have made git-gui a whole lot smaller.

You can follow the rest of the thread.

5 comments :

Sverre Rabbelier said...: Thanks for the post, I pointed some 2nd year CS students at it to explain why high quality commit messages are important. (It is at times horrifying to see the quality of the commit messages by 1st and 2nd year CS students.) Hopefully this'll motivate them to Do The Right Thing (TM) from now on ;).; February 4, 2010 at 6:35 AM
Tim Daly said...: I agree that the information you know when you code, especially the "why" need to be written down.
Code written over a year ago is as obscure as if it were written by someone else. However, I think
that the information you know ought to be provided as literate programming (ref: Knuth) rather than
in changelogs. That is, the "how" and "why" information needs to be in paragraphs of text that
reside in the source code, next to the code that implements the ideas. Having the information stored
in changelogs implies that I know what to look for and by that time I'm already deep into debugging.
The information in a literate source file format is available when I start looking at the code.; February 4, 2010 at 10:58 AM
Sveger Olofson said...: Yes, this kind of information should be in comments, not commit messages.; February 4, 2010 at 9:26 PM
Leo said...: Nothing against the point you make here, but the pathetic lack of comments in the choose_repository.tcl source file prompts me to say this: Such explanations belong in the source file, not in the meta information of a possibly inaccessible repository. I absolutely second Tim Daly's point.; February 4, 2010 at 10:50 PM
Clemens Buchacher said...: I think you should take a look at git's git respository to better understand the advantage of commit messages over inline comments. The git repository has approximately 250 thousand lines of code and only about 20 thousand lines of comments. But it has 200 thousand lines of commit messages. Having such a detailed history is incredibly useful.

Commit messages are more powerful than inline code comments for several reasons.

- They are associated with a single logical change, which has a very specific motivation to comment on. This is useful if you want to know why a problem was solved in a particular way. If you read the code, however, you do not usually want to know "why was this implemented like this, and not like that." First, you want to know what the code does and how it works. Having inline explanations for the motivation behind writing the code would only be a distraction at that point.

- Inline comments are often misleading, because they are not maintained with the code. It is hard to keep all comments up-to-date, because you have to notice that a certain comment pertains to a change you made and is possibly invalidated by it.

- Commit messages, on the other hand, maintain themselves, because they belong to a version and can therefore be associated with a change. If code that you commented on in a commit message is changed later, the commit message will automatically be updated with the new change. And you are still able to dig out the original commit that created the code.

The event that the repository is not available to view the commit message information is quite unlikely, especially because git by default copies all history information locally. If you're going to read the code without its history, you might as well read it without syntax highlighting, ctags or grep.; February 20, 2010 at 10:15 AM

Insufficiently Random

The lonely musings of a loosely connected software developer.

Thursday, February 4, 2010

Why commit messages matter

5 comments :

Post a Comment

Blog Archive

Insufficiently Random

The lonely musings of a loosely connected software developer.

Thursday, February 4, 2010

Why commit messages matter

5 comments :

Post a Comment

Subscribe To

Blog Archive