Insufficiently Random

Don't Assume You Know What's Best

2010-05-14T07:24:00.000-07:00

Yesterday I think we finally found the cause of Gerrit Code Review issue 390 and JGit bug 308945. In issue 390 a long-running Gerrit Code Review daemon suddenly loses access to objects in one or more Git repositories. The daemon's error log shows the server cannot find a commit, but git cat-file on the command line is able to read the same commit without errors. A restart of the daemon JVM always corrects the problem. So we knew the problem had to be data corruption within JGit's in-memory caches.

All along I've been looking for some sort of data corruption in the list of known pack files. The list is managed using volatiles and AtomicReferences, and does most atomic operations itself, rather than building upon the data structures offered by the java.util.concurrent package. Since I'm not Doug Lea, its entirely possible that there is a race or unsafe write within this code. Fortunately, this code appears to be OK. I've gone over it dozens of times and cannot find a logic fault.

Then along comes bug 308945, where JGit is reading from a closed pack file.

JGit opens each pack file once using a RandomAccessFile, and then uses the NIO API read(ByteBuffer, long) to execute a thread-safe pread(2) system call anytime it needs data from the file. This allows JGit to reuse the same file descriptor across multiple concurrent threads, reducing the number of times that it needs to check the pack file's header and footer. It also minimizes the number of open file descriptors required to service a given traffic load. To prevent the file from being closed by one thread while its being read by another, JGit keeps its own internal 'in-use' counter for each file, and only closes the file descriptor when this counter drops to 0. In theory, this should work well.

Right until we mixed in MINA SSHD, a pure Java SSH server. When a client disconnects unexpectedly, MINA appears to be sending an interrupt to the thread that last read data from the client connection. If that thread is currently inside of read(ByteBuffer,long), the read is interrupted, the file descriptor is closed, and an exception is thrown to the caller.

Wait, what? The file descriptor is closed?

And therein lies the bug. When I selected read(ByteBuffer,long) and this file descriptor reuse strategy for JGit, I failed to notice the documentation for throws ClosedByInterruptException. That oversight on my part lead to the misuse of an API, and an ugly race condition.

When MINA SSHD interrupts a working JGit thread at the right place, the current pack file gets closed, but JGit's in-use counter thinks its still open. Subsequent attempts to access that file all fail, because its closed. When JGit encounters a pack file that is failing to perform IO as expected, it removes the file from the in-memory pack file list, but leaves it alone on disk. JGit never picks up the pack file again, as the pack file list is only updated when the modification time on the GIT_DIR/objects/pack directory changes. Without the pack file in the list, its contained objects cannot be found, and they appear to just vanish from the repository.

I don't know what possessed the people who worked on JSR 51 (New I/O APIs for the Java^TM Platform) to think that closing a file descriptor automatically during an interrupt was a good idea. RandomAccessFile's own read method doesn't do this, but its associated FileChannel does. In my opinion, they might as well have just invoked a setuid root copy of /sbin/halt and powered off the host computer.

Why am I surprised when things work?

2010-04-28T07:30:00.000-07:00

I recently purchased a Fujitsu ScanSnap S1500M. This isn't interesting, its a scanner. You plug it into your computer, and its supposed to make picture files from paper. Yay. We've had scanners for ages. Its not blog worthy.

What shocked me was, the damn thing does exactly what it says on the tin.

Load its sheet feeder up with paper, plug it into the computer's USB port. Push the only "Scan" button on the front. Next thing you know, there is a folder full of sequentially numbered JPEG files. It automatically detects the length of the paper. It scans double-sided at the same speed it scans single-sided. It automatically drops back sides which are completely blank. Pages narrower than 8.5" are correctly detected and scanned with a narrower image width. It goes through 20 pages per minute. That's fast enough that its done before you realize its started.

I realized after scanning several hundred pages in just a few minutes that very few things I purchase these days "just work". Most products still require a lot of tinkering from the user, or are still so complex that you need an advanced degree to operate them. This scanner, well, anyone's cat could use it. Just tap that scan button.

Most products require you to purchase additional stuff, e.g. cables, to get them to work. Fujitsu actually included a USB cable in the box. Just unpack, plug in, and go. Its hard to argue with that. Even my HD TiVo was harder to get setup and going.

To organize that directory of image files, I started using Brad Fitzpatrick's scanningcabinet application. Though I did make a few changes in my own scanningcabinet fork on GitHub. Now if only Google AppEngine supported full text search better...

Gerrit Code Review on FLOSS Weekly

2010-04-23T10:43:00.000-07:00

On Wednesday I recorded a netcast for FLOSS Weekly with Randal Schwartz and Randi Harper about Gerrit Code Review, JGit, EGit, and Git in general. The video and audio versions of the netcast are now available.

It was fun recording the show. I don't usually do these sorts of things, I find talking to a laptop somewhat challenging conceptually. Its just a thing sitting there, and it doesn't talk back. You can't see your audience's reactions to your words. I guess that's why I never got into radio, I couldn't sit and talk to a wall for four hours a day, every day. I definitely prefer getting up on stage and giving a talk in person.

Pre-testing commits with Git

2010-04-12T14:17:00.000-07:00

The awesome folks who work on Hudson CI have finally brought us pre-tested commits with Gerrit Code Review. Their solution of watching everything under refs/changes/ is a bit brute-force, but its an amazing first start, because Hudson can "vote" on the change and prevent it from being submitted if the build failed.

A few years ago I started a similar sort of thing for Git. Its carried in the contrib/continuous directory of the git.git source code distribution. But this whole Hudson-Gerrit integration is way better, because it lets you catch the failure before its submitted to your development branches.

Git is moving...

2010-04-07T14:02:00.001-07:00

Cedric recently wrote an interesting post on his blog, Git for the nervous developer. Unlike a lot of the other blogs out there, he approached Git from the kicking and screaming angle, where he was already comfortable with another VCS and was forced to switch to Git for his day-job. Its an interesting perspective that he has, how he has found some sort of happiness with the tool he didn't choose to use.

Today I also gave a talk on Git, JGit, EGit and Gerrit Code Review at the Sonatype Maven Meetup in Philadelphia. The talk was really well attended, according to Jason van Zyl, everyone chose my talk during the first time slot of the day. Most of the audience is apparently from the financial services sector, so they are a bit behind the bleeding edge of the open source VCS curve, but they were aware of Git and asking some really great questions about its capabilities. I'm glad I went, maybe we'll see some wider adoption of Git outside of the more usual open source communities.

JGit 0.7.1

2010-03-29T17:08:00.000-07:00

We finally managed to release 0.7.1 of JGit, through the Eclipse Foundation's incubation process. Unfortunately we have yet to figure out how to get our Hudson CI server to produce a Maven update site and make that available through the download farm used by Eclipse projects. So we have yet to get an official Maven repository published.

But Eclipse users can install EGit 0.7.1 through the official P2 update site, http://download.eclipse.org/egit/updates.

[Update] We now have an official JGit Maven site.

Can't beat the cloud

2010-03-27T17:46:00.000-07:00

I have decided it is time to stop running my own web server just for my silly little blog. And that was about the only reason I'm still renting a virtual server from Slicehost (err Rackspace). At $20/month it just doesn't make much sense anymore.

So I've moved the blog onto Blogger, static files onto AWS S3, and some small URL redirection glue onto Google AppEngine. With free DNS hosting provided by my registrar, free blog hosting at Blogger, and free redirection glue on AppEngine, I can cut my costs by nearly $20/month. The only real cost is the static files on S3, which is just a couple of tiny images. I fortunately have never had a very media rich site. Given S3's prices, this is pennies/month.

The domain's email was moved months ago onto Google Apps for Your Domain. I just couldn't keep SpamAssassin running with sufficiently up-to-date rules on a tiny virtual server with only 256 MB of memory allocated to it. Fortunately, Gmail works great over IMAP and direct SMTP. And I do like having the fast web based search every once in a while.

This means the git.spearce.org experiment will be going away soon. Most likely I'll keep my Git repositories at GitHub, or repo.or.cz. Fortunately these are small and generally just mirrors of open source projects whose primary repository lives elsewhere.

The Eclipse.org JGit follies continue...

2010-02-10T03:47:00.000-08:00

Another day. Another compliant from me about running a project at Eclipse.org. This time it wound up in the jgit-dev mailing list archives, as replies to a thread that I think started from my blog post on the tragedy of Eclipse.

Instead of reposting the whole thing, I'll just point to my two messages in context:

why do I need to spend my time on this crap?
why is the new file header what it is?

The tragedy of Eclipse.org

2010-02-08T13:28:00.000-08:00

I've probably posted something about this before. But I'm really getting fed up with the Eclipse Development Process. Its a frelling nightmare for a committer to work with. I'm really starting to regret moving JGit there.

Right now, if I have X hours to work on a project, I seem to be averaging what feels like X/2 hours in paperwork and other "important steps" of the development process. None of which have helped my project to ship higher quality, or more feature complete code. Which means either my or my employer's time is being wasted. I don't have time to waste when I have 108 bugs open in Gerrit Code Review, and 64 bugs open in EGit and JGit.

Based on a private email chain I'm having with the Eclipse IP review team, it looks like the initial EGit code contribution was bungled not just by myself, but also by the foundation's IP review process. Which means I probably have to run EGit back through IP review, almost from scratch. But only after I write a script to datamine contributors out of the old EGit history and inject a complete, per-file git short-log into each file header. Its a good thing I have an awesome version control system like Git to keep these records for me. Too bad nobody else on the planet can use it to obtain information they might want to know about our source code. I guess running software to read information about a file is too scary for some individuals. So I have to do it for them. Now, and for every change we make in the future. Yay. :-(

The astute reader may notice in that above paragraph, "private email chain" doesn't jive with other publications from the Eclipse Foundation demanding that projects be run in an open and transparent manner (see how do I start a project on Eclipse Newcomers). I really do feel like JGit is a less open project now that it has moved to Eclipse.org. Conversations with the Eclipse IP team about the legal status of any contribution is always discussed by private email. These things never make it to the project mailing list. The IPzilla database is closed to everyone but committers. There are backroom deals going on about what our file headers should look like in order to sufficiently convey that the source code is under the new-style BSD. The discussion that led to the approval of the EGit IP log for 0.7.0, approved despite what appears to be an error in the initial review, also happened by private email.

It took a significant amount of effort on my part to even get JGit hosted at Eclipse.org. Originally, the new-style BSD license wasn't permissible for a hosted project, and I had to seek a special exemption from the Eclipse Board of Directors. A process that required significant backroom conversations, over at least 6 months. Again, not exactly open. The only reason I think I haven't pulled the project back is because of the huge initial investment I've already made in this.

Maybe JGit and EGit are just unique projects. But in my experience, I am not a unique snowflake, and neither is my work. I'm not as special as I might seem at first glance.

I wouldn't be surprised if I've lost at least 2 days every month to paperwork. That's about 30 days, or 1.5 person-months since the project really started this move in January 2009. 1/12 of my time over the past year has just gone to catering to the Eclipse development process. Food for thought. Join Eclipse... make sure you pick up at least 1/12 of another full-time developer just to deal with the red-tape.

The part that really troubles me with the red-tape isn't so much that it is there, but that committers bear the brunt of the effort, while large corporations that are strategic members reap the benefits of having a concise change history listed inside of each source code file, or knowing that every contributor who ever touched this source code has been grilled in detail on a bug tracker.

So back to my post title. The real tragedy is, these corporations who sell commercial products based on top of Eclipse.org distributions are pushing not just the open source development work, but also a whole ton of onerous legal and reporting constraints back onto their project committers. Its enough to make this committer start to reconsider things. I wish I had been using a time clock this past year, to accurately record how many days the Eclipse development process has robbed me of since the start of all of this. It feels significant enough that if I went to my manager with it, I think he'd go ballistic.

Why commit messages matter

2010-02-04T06:04:00.000-08:00

Some folks wonder why I want longer, detailed commit messages in a project. Often other people claim "Fix the frobinator bug when it frobs too slow" might be sufficiently detailed to cover a change. But its usually not.

As you explored the issue and tried to understand the problem, you filled your head up with important details about how the frobinator works, what a frob even is, what a slow frob looks like, and why a slow frob shouldn't be permitted in this context. All of this information is necessary for you to understand the problem and code a patch that resolves it. Moreover, if this detail wasn't necessary for you to code the patch, you wouldn't have had the slow frobbing in the first place. It would have been fairly obvious at the time of original development.

Commit messages, when combined with a powerful blame engine in your version control, can give you really powerful insight into what you were thinking at the time. This can be incredibly handy when someone asks a question later.

Yesterday, Junio Hamano, git maintainer extraordinaire, asked me why git-gui implements its own clone function. When I wrote this code, it must have been really obvious to me why it needed to reimplement the same logic as git clone. But I wrote it back in 2007. I've done a ton of things since then. There's no way I can remember what I was doing, or why I was doing it. I do however remember thinking, "this code is done, it works, I'll never have to look at or think about it again". Famous last words.

When Junio asked this question... I honestly couldn't remember what I was doing. I'm usually somewhat against reinventing the wheel, and I try to avoid rewriting something unless I seem to have a good reason for it. So I really was looking at his question saying, "yea, why did I do that there...".

Fortunately, I write fairly detailed commit messages, and git blame is an incredible tool:


  $ git blame lib/choose_repository.tcl
  ...
  81d4d3dd (Shawn O. Pearce     2007-09-24 08:40:44 -0400  633) 
  81d4d3dd (Shawn O. Pearce     2007-09-24 08:40:44 -0400  634)           $o_cons start \
  81d4d3dd (Shawn O. Pearce     2007-09-24 08:40:44 -0400  635)                   [mc "Counting objects"] \
  81d4d3dd (Shawn O. Pearce     2007-09-24 08:40:44 -0400  636)                   [mc "buckets"]
  ...
  81d4d3dd (Shawn O. Pearce     2007-09-24 08:40:44 -0400  673)           update
  81d4d3dd (Shawn O. Pearce     2007-09-24 08:40:44 -0400  674) 
  ab08b363 (Shawn O. Pearce     2007-09-22 03:47:43 -0400  675)           file mkdir [file join .git objects pack]
  ab08b363 (Shawn O. Pearce     2007-09-22 03:47:43 -0400  676)           foreach i [glob -tails -nocomplain \
  ab08b363 (Shawn O. Pearce     2007-09-22 03:47:43 -0400  677)                   -directory [file join $objdir pack] *] {
  ab08b363 (Shawn O. Pearce     2007-09-22 03:47:43 -0400  678)                   lappend tolink [file join pack $i]
  ab08b363 (Shawn O. Pearce     2007-09-22 03:47:43 -0400  679)           }
  81d4d3dd (Shawn O. Pearce     2007-09-24 08:40:44 -0400  680)           $o_cons update [incr bcur] $bcnt
  81d4d3dd (Shawn O. Pearce     2007-09-24 08:40:44 -0400  681)           update
  81d4d3dd (Shawn O. Pearce     2007-09-24 08:40:44 -0400  682) 
  81d4d3dd (Shawn O. Pearce     2007-09-24 08:40:44 -0400  683)           foreach i $buckets {
  ab08b363 (Shawn O. Pearce     2007-09-22 03:47:43 -0400  684)                   file mkdir [file join .git objects $i]
  ab08b363 (Shawn O. Pearce     2007-09-22 03:47:43 -0400  685)                   foreach j [glob -tails -nocomplain \
  ab08b363 (Shawn O. Pearce     2007-09-22 03:47:43 -0400  686)                           -directory [file join $objdir $i] *] {
  ab08b363 (Shawn O. Pearce     2007-09-22 03:47:43 -0400  687)                           lappend tolink [file join $i $j]
  ab08b363 (Shawn O. Pearce     2007-09-22 03:47:43 -0400  688)                   }
  81d4d3dd (Shawn O. Pearce     2007-09-24 08:40:44 -0400  689)                   $o_cons update [incr bcur] $bcnt
  81d4d3dd (Shawn O. Pearce     2007-09-24 08:40:44 -0400  690)                   update
  ab08b363 (Shawn O. Pearce     2007-09-22 03:47:43 -0400  691)           }

It would seem that 81d4d3dd, and ab08b363 are commits adding code to do a clone.


  $ git show 81d4d3dd
  commit 81d4d3dddc5e96aea45a2623c9b1840491348b92
  Author: Shawn O. Pearce  spearce.org>
  Date:   Mon Sep 24 08:40:44 2007 -0400

    git-gui: Keep the UI responsive while counting objects in clone

    If we are doing a "standard" clone by way of hardlinking the
    objects (or copying them if hardlinks are not available) the
    UI can freeze up for a good few seconds while Tcl scans all
    of the object directories.  This is espeically noticed on a
    Windows system when you are working off network shares and
    need to wait for both the NT overheads and the network.

    We now show a progress bar as we count the objects and build
    our list of things to copy.  This keeps the user amused and
    also makes sure we run the Tk event loop often enough that
    the window can still be dragged around the desktop.

    Signed-off-by: Shawn O. Pearce  spearce.org>

  $ git show ab08b363
  commit ab08b3630414dfb867825c4a5828438e1c69199d
  Author: Shawn O. Pearce  spearce.org>
  Date:   Sat Sep 22 03:47:43 2007 -0400

    git-gui: Allow users to choose/create/clone a repository
  …
    Rather than relying on the git-clone Porcelain that ships with
    git we build the new repository ourselves and then obtain content
    by git-fetch.  This technique simplifies the entire clone process
    to roughly: `git init && git fetch && git pull`.  Today we use
    three passes with git-fetch; the first pass gets us the bulk of
    the objects and the branches, the second pass gets us the tags,
    and the final pass gets us the current value of HEAD to initialize
    the default branch.

    If the source repository is on the local disk we try to use a
    hardlink to connect the objects into the new clone as this can
    be many times faster than copying the objects or packing them and
    passing the data through a pipe to index-pack.  Unlike git-clone
    we stick to pure Tcl [file link -hard] operation thus avoiding the
    need to fork a cpio process to setup the hardlinks.  If hardlinks
    do not appear to be supported (e.g. filesystem doesn't allow them or
    we are crossing filesystem boundaries) we use file copying instead.

    Signed-off-by: Shawn O. Pearce  spearce.org>

So 30 seconds after being asked, I've managed to remember this was mostly about git-gui on Windows, where Cygwin can be pretty slow for file operations, and hardlinks are available on NTFS if your application knows how to make them. By doing the clone logic within Tcl, which is a native Win32 application, we can bypass Cygwin overheads, including the need to fork and execute a bunch of commands from the git-clone.sh shell script. Because, back in 2007, git-clone was still just a shell script.

In hindsight, that paragraph above should also be in the commit messages. And I probably should have ported git clone to C instead. Its C now, but not because of my efforts. And now git-gui maybe should just call it. It would have made git-gui a whole lot smaller.

You can follow the rest of the thread.

How class names can go horribly wrong

2010-01-04T12:28:00.000-08:00

Somehow I found myself writing this in a JGit test case:


assertTrue("isa TransportHttp", t instanceof TransportHttp);
assertTrue("isa HttpTransport", t instanceof HttpTransport);

What is wrong with me...

Eclipse is the new RPM hell

2009-12-15T11:57:00.000-08:00

Remember back when RedHat was the best Linux distribution? If not, let me remind you of RPM hell. A situation where you can't install Foo 1.X, because it needs Bar 2.3.Y, but you also have Zidget 8.Z installed and that needs Bar 2.2.Y. Long story short, you can have either Foo or Zidget on your system, but not both.

Today I just tried to install the Eclipse Test & Performance Tools, so I could try to find out why Apache MINA SSHD has such poor throughput during uploads into the server. Unfortunately I can't run version 4.5 because it depends on a decade old version of libstdc++. But I can't install version 4.6, which supposedly has a newer linkage, because the 4.6 requires SWT 3.4.0. But I'm running Eclipse 3.4.2, which apparently does not have a new enough SWT.

Folks. Seriously?

The only major consumer of SWT that really matters is the Eclipse SDK. And the SDK platform version numbers don't match SWT version numbers. And the test and performance tools require a decade old shared library which isn't even distributed with Ubuntu Hardy.

I guess since its Java, its OK to repeat decade old mistakes, because its in a different programming language.

:-(

My ancient history and the art of software

2009-12-08T19:59:00.000-08:00

This week I'm traveling. I'm in Miami for the 2 day Eclipse board of directors meeting. When I'm forced to fly, which I usually try to avoid doing, I tend to take a stack of books with me to try and pass the time on the airplane. Unfortunately, airline flight speeds haven't quite caught up with Moore's law as it relates to the power draw of modern laptops. Nor has the airline seat space kept up with the size of my 15" PowerBook and my increasing girth. So books it is.

So I'm currently reading Peter Seibel's Coders at Work. Tonight, while reading through Brendan Eich's interview and his history at Netscape, it reminded me of my own development history around the same time period. I really don't talk about my past very much, so before I get to be too old to remember it, I might as well write some of it down. :-)

During the mid-to-late 90's I was still in high school, but was working part-time in the afternoons after school as a software developer (err, script monkey) for a now defunct website and ISP called INJersey. INJersey was the aspiring, and perhaps too early for its time, online arm of the local daily print paper, the Asbury Park Press. Somehow its owners realized this Internet thing was worth investing in, before a lot of users figured it out and actually got online.

Back then, we didn't have Apache and all of its module glory. We had a pile of patch files you had to apply to NSCA httpd to get A PAtCHy server. We didn't have FastCGI, we had plain old CGI scripts, often written in Perl 4, where &article'render() was an actual function call and not some syntax error brought on by a lack of caffeine. We didn't even have JavaScript or animated images. <blink> was as good as it got.

My job back then was really simple. Take Microsoft Office documents from writers who couldn't be bothered to learn HTML, and put them online in HTML. I wrote a lot of Perl and AppleScript to rip the stuff apart and put it back together again as plain HTML files that we could serve to web users. This was out of shear laziness, the first great virtue of a programmer. INJersey hired me not as a software developer, but as a data entry monkey to copy and paste the text from Office into an HTML document, and stick in the <td> or <b> tag when necessary. I quickly grew bored with that task and found it much more fun to write scripts to do my job for me, while I explored the wonders of a T1 internet connection.

My managers quickly realized I was able to do more than just copy and paste text with a computer, so they started giving me simple programming assignments. When Netscape 2.0 launched, one of our real programmers figured out we could do server-push based animated images. This is one of those monstrously stupid ideas that I'm glad has died on the web. I'm quite happy that I actually can't come up with a great reference link for it anymore. Anyway, INJersey's website was just awesome one day, because we had animated images, and others didn't.

About this time one of my managers saw this website called JangaChat. It was a free online web based chat room system. No IRC client needed. No Java applets. No plugins. Just Netscape 2.0. Even better, they allowed HTML to be entered without escaping, so you could write fancy messages like "Bob, that was the best fish <b>ever</b>!" and actually have it come out in bold. Their site was more awesome than our animated images. We somehow had to out awesome them again.

Remember, this is like 1995. We just got Netscape 2.0, and Brendan Eich had just unleased this JavaScript thing on the world. We didn't know what we could do with it... or the damage it could cause. Cross site scripting hadn't even been invented yet.

My managers gave me a simple task, create an INJersey version of JangaChat that we could run on our own servers, so our users could chat online in real time about whatever they felt like, without needing to first install an IRC client.

At the time, I only really knew Perl, and was only self-taught at that. So I wrote the first version as a Perl CGI. I remember abusing the server-based multipart push feature we used for image animation to allow the Perl CGI to stream new messages to the browser as they arrived in the chat room. JangaChat, and this system I worked on at INJersey, may have been some of the first uses of hanging GETs.

Unfortunately, not only did I not know C, I also didn't know how to do proper interprocess communication on UNIX. So I implemented with what I did know: each chat room was assigned a local file. New messages were appended onto the end of the file using a POST CGI, and messages were read out to active users by their hanging GET CGIs, which were continuously trying to read the tail of the room's file. Over a decade later, I can't believe I was once foolish enough to believe this was a good idea.

We launched under the name ChatterBox. We quickly stole most of the users from JangaChat, plus picked up our own users, and our little Pentium 133 server could barely keep up with all of those Perl based hanging GET CGI processes demanding resources. Messages were also often truncated or otherwise badly mangled, as I had no file locking, and no way to ensure a full message was written before being read and sent to a browser.

Management really wanted this product to work, because the demo was flashy, and their local advertising customers were wowed by it. I rewrote the Perl code into C, but kept the basic design of individual hanging GET CGIs per user, and a single log file per chat room. Even in C, we still couldn't keep up with traffic. Somehow I convinced management that buying some "real server hardware", rather than our tiny single processor Pentium 133, would solve the scaling problems, and they went out and purchased a pair of SGI Origin 200 servers running Irix. Today of course, all 3 of you reading this are screaming "but its the software that wasn't scalable!". That's what a decade of learning gets you. :-)

So upon getting this shiny new server hardware, I start thinking about how I might have done things differently. I knew I couldn't rely on the (then pretty crappy and unscalable) dbm library for user data management, like I had with Perl. So I started poking around at things like mSQL, PostgreSQL, and MySQL. Back in 1996, mSQL was the best there was, but it was single threaded. PostgreSQL was pretty slow and barely ran, it was more of a research project than it was a production database engine. MySQL crashed more than it stayed up, severely lacked features compared to mSQL, and required nasty pthread libraries which weren't exactly standard or well supported on UNIX systems back then.

So I toyed with writing my own. As it turned out, Irix had a decent pthread implementation at the time. I actually ended up with a thread safe, balanced B*-tree that stored user record data in fixed size leaf nodes, and used an arbitrary byte sequence as the record key. To make the implementation easier on myself, I made every block of the file the same size (I think I used 2048 bytes, but I can't remember), and the parent that pointed to the block told you the type using the lower bits of the file offset. E.g. offset & 1 == 0 meant the block was another btree intermediate node, while offset & 1 == 1 meant it was an application data record.

Within the user data record, I didn't want to write all of the glue required for a formal DDL like a proper SQL server would support. After all, I just had to store a couple of values for each user account, like their email address, date of last login, and a handful of preference settings. So I basically did what Google protocol buffers does, and stored a flexible bag of key/value pairs in binary form. Schema changes were accomplished incrementally, as users were updated during normal application processing, rather than all up front when the software changed.

So, I solved my data storage problem by just rolling my own system. Looking back on it, its not too different from the approach Google takes with BigTable. Only they scale across machines by partitioning the key space, while I just tossed the entire key space onto a single node.

For the interactive web components, I finally got a clue and realized that the hanging GET CGI processes were my real scaling factor. They consumed too much of the system's resources per process. My local UNIX system administrator finally got around to telling me about sar and ps and how to use them to figure out that I was being a moron.

Since we still didn't have FastCGI, I wrote my own HTTP server. From scratch. In C. First and last time I've ever done that. Lesson learned. There's a reason I haven't been involved in the Apache HTTPd project. My scars still haven't healed.

Initially, my HTTP server was using pthreads, spawning a pthread per browser request. Which meant that my server used 1 thread per hanging GET. At first I thought this would be OK, threads were lightweight compared to a process, right? Until my local UNIX administrator brought out the clue stick and made me think about it for a few minutes. In the process-per-connection implementation we had very little allocated memory, just a kilobyte or so of global state, and the entire program executable was mapped as shared memory between all of the instances. The real memory cost was in the program's stack, which the OS had a minimum size on. With pthreads we didn't gain much resource savings over the process-per-connection approach because we still had the same per-connection state, and the same minimum thread stack size.

I started digging around the Irix manual pages, and one day found this nifty thing called select(). I wrote a quick simple server using it, and realized non-blocking asynchronous IO was the best thing since sliced bread. That night I completely tore apart the pthread based HTTP server and rewrote it as a non-blocking, asynchronous IO server.

By now I had completely abandoned the idea of the message distribution going through a local log file, and instead stored it in shared memory protected by pthread mutexes and condition variables. Threads posting messages into a chat room appended their message object onto a distribution queue and signaled one of the IO threads to start streaming the messages out using asynchronous IO to each connected browser.

Along the way I also had to write an HTML parser, because we wanted to allow messages like "Here is a <font color=red>red rose</font>" to render as intended by the author, but we didn't want to have unclosed tags like "<h1>Hi!" ruin the formatting for every subsequent message on the page. End users also figured out cross site scripting before we did, with plenty of "<script>window.close()</script>" nosense being pasted into rooms and closing hundreds of browsers at once. Yes, we learned about cross site scripting attacks the hard way. Back in 1996/97, nobody really thought about this sort of stuff.

We relaunched in the winter of 1997. The Wayback machine has a snapshot of the rewritten homepage dated April 13, 1997.

Of all of that though, the thing I'm still most excited about is the "non-Shockwave" version of the ChatterBox website. Back then we didn't have AJAX and XMLHttpRequest. Heck, we didn't even have iframe. We had straight up boring <frameset>. I rolled my own AJAX system from scratch using JavaScript, a hand rolled message queue, and a 1 pixel high frame in a frameset that performed POSTs driven by JavaScript. In very early 1997. Apparently I discovered parts of the web that we didn't see again until XMLHttpRequest was widely supported.

Unfortunately management pulled the plug a year or so later, and that neat little HTTP server with its pre-AJAX AJAX and pre-Google Talk hanging GETs disappeared into the sands of time.

So, tonight, while reading Brendan Eich's interview talking about building JavaScript in two weeks, I remembered just how much I learned that year in my part time job. I went from being a Perl script monkey who couldn't even compile a C program, to having written an asynchronous IO HTTP server from the ground up, mastered multi-threading, mutexes, condition variables, AJAX, cross site scripting, and a balanced binary tree implementation that bears a lot of resemblance to BigTable. All without taking any CS classes in high school.

Maybe that's why college, and my later jobs, were all so boring for me. I'd done all of it already. Maybe that's why I was less amazed than my peers when Google Maps launched. I didn't make something nearly as awesome, or as empowering to so many people around the world as Lars and the Maps team did. But I had certainly seen, worked with, and discovered much of today's web. In 1997.

What is HTML 5 going to bring us that we haven't discovered yet?

Introduction to Gerrit Code Review

2009-12-07T19:19:00.000-08:00

Yesterday R. Tyler Ballance (aka rtyler on #git) started poking me about Gerrit Code Review. One day later, he's writing an amazing blog post describing how to install the latest development version, and why it so awesome to use for team development. He even has screenshots! :-)

Thanks rtyler.

EGit in 2010?

2009-12-07T08:37:00.000-08:00

Mike just described some changes coming to Eclipse in 2010, including Git for projects. Yay!

However, he's right. EGit has to get better, fast. We need more contributors who know and love the SWT/JFace/Resource APIs and can crank out the UI improvements necessary to bring the Git team provider up to the same level as the CVS team provider.

Moving my git repositories

2009-12-05T09:01:00.000-08:00

I finally got around to creating git.spearce.org. I've been contributing to Git since Feb 17 2006 and yet I couldn't be bothered to setup my own Git host for my repositories. For the past 3 years I've primarily leaned on Pasky's excellent repo.or.cz service. But I've always wanted a more permanent home for my projects.

Last week I threw SpamAssassin off my domain's server, which meant I finally had virtual memory free to run git daemon. So you can now find most of my projects at git.spearce.org, with proper git:// URLs available for efficient cloning. Maybe sometime later this month I'll get smart HTTP enabled as well.

I'll continue to update the repositories on repo.or.cz, but I'll primarily be using the ones hosted on git.spearce.org.

EGit at Eclipse

2009-12-03T09:36:00.000-08:00

A few months ago we moved EGit, the Git team provider for Eclipse, over to the Eclipse Foundation. Along the way we decided to try out some new development techniques, like taking advantage of my day-job project Gerrit Code Review to help us discuss pending changes. This lead us down the road of not paying too much attention to the Eclipse IP process, and failing to tag all contributed patches with the +iplog flag in Bugzilla.

Fortunately Wayne Beaton helped us get Gerrit configured in a way that meets the foundation's IP process guidelines, and has encouraged us to continue forward. This is great, because it means we can rely on Git for attribution tracking, rather than Bugzilla.

Also, since the project moved homes we picked up 3 prolific contributors, and all of them have turned into committers on the project.

Been a while...

2009-12-03T09:28:00.000-08:00

Has it really been over a year since I last updated spearce.org?

Yup. It has.

A lot happens in a year. We moved. I got a new job. I got a lot of new projects. I started spending a lot of time working on Git, or more precisely, JGit. And I let my amusing little corner of the web grow cobwebs and wither.

Yay.

Getting Giddy with Git

2008-07-18T08:23:00.000-07:00

Recently Johannes Schindelin and I participated in a podcast about Git's involvement in Google Summer of Code. You can listen to the podcast on the Google open source blog.

Using jgit To Publish on Amazon S3

2008-07-08T17:06:00.000-07:00

Recent versions of jgit, the 100% pure Java implementation of the Git version control system, support fetch and push directly over Amazon S3 .

It behaves like http push does in C git in that it is transparent to the end-user. Transparent client-side encryption can also be enabled, in case the repository data must be protected from the operators of S3.

First you need to create a bucket using some sort of standard S3 tools. I used jets3t's cockpit tool to create "gitney". A bucket may hold any number of repositories and acts as a root directory. It may also be a domain name if you want to use S3 based virtual hosting.

Next you need to create a properties file containing your AWSAccessKeyId and AWSSecretAccessKey so that jgit can authenticate itself with the S3 service. Since the AWSSecretAccessKey should be maintained privately its a good idea to store this in a protected file within your home directory.

  $ touch ~/.jgit_s3_public
  $ chmod 600 ~/.jgit_s3_public
  $ cat >>~/.jgit_s3_public
  accesskey: AWSAccessKeyId
  secretkey: AWSSecretAccessKey
  acl: public
  EOF

We also include acl: public so all objects (files) created by jgit through this configuration file are readable by anyone. The default (if not specified) is acl: private, making the objects readable only by yourself, and those who manage the S3 service.

Next we configure the remote in Git and push to the S3 bucket:

  $ git remote add s3 amazon-s3://.jgit_s3_public@gitney/projects/egit.git/
  $ jgit push s3 refs/heads/master
  $ jgit push --tags s3

Future updates are just as easy:

  $ jgit push s3 refs/heads/master

(or)

  $ git config --add remote.s3.push refs/heads/master
  $ jgit push s3

Pushes are always incremental and consequently there is relatively little bandwidth usage during subsequent pushes.

Our repository is now cloneable directly over HTTP (assuming we used acl: public):

  $ git clone http://gitney.s3.amazonaws.com/projects/egit.git

A jgit amazon-s3 URL is organized as:

  amazon-s3://$config@$bucket/$prefix
  http://$bucket.s3.amazonaws.com/$prefix

where the three major components are:

$config is the name of the configuration properties file stored in $GIT_DIR/$config or $HOME/$config (searched for in that order).

$bucket is the name of the Amazon S3 bucket holding the objects.

$prefix is the prefix to apply to all objects (files) within this repository. It implicitly ends in "/". You may omit this portion of the URI if you want the bucket to contain only one repository.

This is something of an abuse of URI syntax as the traditional username field is holding the name of a file in either $GIT_DIR or $HOME, however it permits hiding the secret access key from prying eyes as well as supplies a way to carry more information (such as acl or encryption settings) than what can appear in a URI.

Transparent client-side encryption for a repository stored on S3 can be enabled by adding a password to the properties file:

  $ cp ~/.jgit_s3_public ~/.jgit_s3_private
  $ echo password: Sup3rS3cr3t >>~/.jgit_s3_private

and using .jgit_s3_private in the $config field of an amazon-s3:// URL. The encryption algorithm can also be specified in property crypto.algorithm, which defaults to PBEWithMD5AndDES.

The encryption format currently used by jgit matches the format used by jets3t (specifically format version 2), making it possible to download and decrypt a repository through cockpit in the event that jgit is not readily available.

New Server, New Home

2008-06-26T16:59:00.000-07:00

In honor of us moving from New York to California I have also moved the server that hosts spearce.org. If you are reading this blog post, welcome to my new home away from home on the web.

If you aren't reading this blog post, you may need to reconsider how you reached this location, since you can't see it.

Difficult gitk Graphs

2007-07-26T15:59:00.000-07:00

On the Git mailing list I've talked about one of the repositories I develop on/maintain, as its graph in gitk is somewhat interesting. Today I took a couple of redacted screenshots from two of the interesting parts of the history.

The first image is from a set of octopus merges that occurred in the history. This probably would look better if we had just used git-rebase to transplant the commits instead of merging them, but at the time the user who created these was still quite new to Git...

(larger version)

What's even worse about that particular rendering is the branches on either side. This was taken with the following gitk preferences set:

Maximum graph width (lines): 80
Maximum graph width (% width of pane): 90

Even with those settings, gitk still does not have enough space to show all of the active branches passing through this particular point in time (see the lines cut off on the upper right corner). In case you are wondering, yes, nearly all of those merged together later on. A few haven't yet.
This second image was taken from a more recent point in the project's timeline, during the same gitk session, and with the same preferences. We've obviously reduced the number of active branches somewhat, and there is now space for the commit subject lines (which I had to redact) and tracking branch labels (also had to be redacted).

(larger version)

There is still an octopus here ("Cauterize prior batch m"), but this one was automatically generated by a script we developed and was not directly caused by a user. The script takes a single commit and merges it to all branches that share a common prefix. I don't have the repository on hand right now, but I think this particular octopus was created while we were merging one commit to ~80 branches. In this case 4 branches were already fully merged with a 5th, and we wanted to keep it that way after the new commit was merged to all 80 branches. To do that the script remerged these 4 by way of an octopus.

Before you can ask, no, this is not an imported repository. This was all created in git, using the core porcelain and git-gui.

pg is for sale

2007-04-28T17:50:00.000-07:00

I have decided to no longer support pg, as I haven't used it myself in a very, very long time. It was a useful tool and learning vehicle for myself and a few others, but it just isn't nearly as good as core Git with topic branches. Or git-gui.

Git and Linux Repository Growth

2007-02-28T18:24:00.000-08:00

I got curious about the growth rate for the git.git and linux-2.6.git repositories, so I wrote git-statplot to dump out object counts and sizes by earliest date entered. Plotting these with Gnuplot gave me some interesting results:

The git.git repository (red line) appears to have a very stable history, but sees a huge spike right around now. This spike is an outlier caused by my development repository; I was very actively rebasing a branch on Feb 26th, resulting in hundreds of commits in my reflog.

The linux-2.6.git history seems to show a periodic cycle, with some days reaching up to 700 commits per day. The obvious correlation between number of KiB of disk space used and the number of commits per day is also clearly visible. As I am not a kernel developer the linux-2.6.git data came from a mirror of Linus' repository. I ommitted the first day outlier to prevent the Y-axis scales from shooting through the roof.

I have linked the images to higher-resolution PDFs, clicking on them will give you the PDFs.

git-gui Screenshots

2007-01-21T18:57:00.000-08:00

niv on #git nudged me enough to create some screenshots of git-gui.

The main window:

Branch creation dialog:

Delete branch dialog:

You can currently get git-gui by cloning the repository from repo.or.cz:

git clone git://repo.or.cz/git-gui.git

or track it's progress through gitweb at git-gui @ repo.or.cz.