Insufficiently Random: My ancient history and the art of software

This week I'm traveling. I'm in Miami for the 2 day Eclipse board of directors meeting. When I'm forced to fly, which I usually try to avoid doing, I tend to take a stack of books with me to try and pass the time on the airplane. Unfortunately, airline flight speeds haven't quite caught up with Moore's law as it relates to the power draw of modern laptops. Nor has the airline seat space kept up with the size of my 15" PowerBook and my increasing girth. So books it is.

So I'm currently reading Peter Seibel's Coders at Work. Tonight, while reading through Brendan Eich's interview and his history at Netscape, it reminded me of my own development history around the same time period. I really don't talk about my past very much, so before I get to be too old to remember it, I might as well write some of it down. :-)

During the mid-to-late 90's I was still in high school, but was working part-time in the afternoons after school as a software developer (err, script monkey) for a now defunct website and ISP called INJersey. INJersey was the aspiring, and perhaps too early for its time, online arm of the local daily print paper, the Asbury Park Press. Somehow its owners realized this Internet thing was worth investing in, before a lot of users figured it out and actually got online.

Back then, we didn't have Apache and all of its module glory. We had a pile of patch files you had to apply to NSCA httpd to get A PAtCHy server. We didn't have FastCGI, we had plain old CGI scripts, often written in Perl 4, where &article'render() was an actual function call and not some syntax error brought on by a lack of caffeine. We didn't even have JavaScript or animated images. <blink> was as good as it got.

My job back then was really simple. Take Microsoft Office documents from writers who couldn't be bothered to learn HTML, and put them online in HTML. I wrote a lot of Perl and AppleScript to rip the stuff apart and put it back together again as plain HTML files that we could serve to web users. This was out of shear laziness, the first great virtue of a programmer. INJersey hired me not as a software developer, but as a data entry monkey to copy and paste the text from Office into an HTML document, and stick in the <td> or <b> tag when necessary. I quickly grew bored with that task and found it much more fun to write scripts to do my job for me, while I explored the wonders of a T1 internet connection.

My managers quickly realized I was able to do more than just copy and paste text with a computer, so they started giving me simple programming assignments. When Netscape 2.0 launched, one of our real programmers figured out we could do server-push based animated images. This is one of those monstrously stupid ideas that I'm glad has died on the web. I'm quite happy that I actually can't come up with a great reference link for it anymore. Anyway, INJersey's website was just awesome one day, because we had animated images, and others didn't.

About this time one of my managers saw this website called JangaChat. It was a free online web based chat room system. No IRC client needed. No Java applets. No plugins. Just Netscape 2.0. Even better, they allowed HTML to be entered without escaping, so you could write fancy messages like "Bob, that was the best fish <b>ever</b>!" and actually have it come out in bold. Their site was more awesome than our animated images. We somehow had to out awesome them again.

Remember, this is like 1995. We just got Netscape 2.0, and Brendan Eich had just unleased this JavaScript thing on the world. We didn't know what we could do with it... or the damage it could cause. Cross site scripting hadn't even been invented yet.

My managers gave me a simple task, create an INJersey version of JangaChat that we could run on our own servers, so our users could chat online in real time about whatever they felt like, without needing to first install an IRC client.

At the time, I only really knew Perl, and was only self-taught at that. So I wrote the first version as a Perl CGI. I remember abusing the server-based multipart push feature we used for image animation to allow the Perl CGI to stream new messages to the browser as they arrived in the chat room. JangaChat, and this system I worked on at INJersey, may have been some of the first uses of hanging GETs.

Unfortunately, not only did I not know C, I also didn't know how to do proper interprocess communication on UNIX. So I implemented with what I did know: each chat room was assigned a local file. New messages were appended onto the end of the file using a POST CGI, and messages were read out to active users by their hanging GET CGIs, which were continuously trying to read the tail of the room's file. Over a decade later, I can't believe I was once foolish enough to believe this was a good idea.

We launched under the name ChatterBox. We quickly stole most of the users from JangaChat, plus picked up our own users, and our little Pentium 133 server could barely keep up with all of those Perl based hanging GET CGI processes demanding resources. Messages were also often truncated or otherwise badly mangled, as I had no file locking, and no way to ensure a full message was written before being read and sent to a browser.

Management really wanted this product to work, because the demo was flashy, and their local advertising customers were wowed by it. I rewrote the Perl code into C, but kept the basic design of individual hanging GET CGIs per user, and a single log file per chat room. Even in C, we still couldn't keep up with traffic. Somehow I convinced management that buying some "real server hardware", rather than our tiny single processor Pentium 133, would solve the scaling problems, and they went out and purchased a pair of SGI Origin 200 servers running Irix. Today of course, all 3 of you reading this are screaming "but its the software that wasn't scalable!". That's what a decade of learning gets you. :-)

So upon getting this shiny new server hardware, I start thinking about how I might have done things differently. I knew I couldn't rely on the (then pretty crappy and unscalable) dbm library for user data management, like I had with Perl. So I started poking around at things like mSQL, PostgreSQL, and MySQL. Back in 1996, mSQL was the best there was, but it was single threaded. PostgreSQL was pretty slow and barely ran, it was more of a research project than it was a production database engine. MySQL crashed more than it stayed up, severely lacked features compared to mSQL, and required nasty pthread libraries which weren't exactly standard or well supported on UNIX systems back then.

So I toyed with writing my own. As it turned out, Irix had a decent pthread implementation at the time. I actually ended up with a thread safe, balanced B*-tree that stored user record data in fixed size leaf nodes, and used an arbitrary byte sequence as the record key. To make the implementation easier on myself, I made every block of the file the same size (I think I used 2048 bytes, but I can't remember), and the parent that pointed to the block told you the type using the lower bits of the file offset. E.g. offset & 1 == 0 meant the block was another btree intermediate node, while offset & 1 == 1 meant it was an application data record.

Within the user data record, I didn't want to write all of the glue required for a formal DDL like a proper SQL server would support. After all, I just had to store a couple of values for each user account, like their email address, date of last login, and a handful of preference settings. So I basically did what Google protocol buffers does, and stored a flexible bag of key/value pairs in binary form. Schema changes were accomplished incrementally, as users were updated during normal application processing, rather than all up front when the software changed.

So, I solved my data storage problem by just rolling my own system. Looking back on it, its not too different from the approach Google takes with BigTable. Only they scale across machines by partitioning the key space, while I just tossed the entire key space onto a single node.

For the interactive web components, I finally got a clue and realized that the hanging GET CGI processes were my real scaling factor. They consumed too much of the system's resources per process. My local UNIX system administrator finally got around to telling me about sar and ps and how to use them to figure out that I was being a moron.

Since we still didn't have FastCGI, I wrote my own HTTP server. From scratch. In C. First and last time I've ever done that. Lesson learned. There's a reason I haven't been involved in the Apache HTTPd project. My scars still haven't healed.

Initially, my HTTP server was using pthreads, spawning a pthread per browser request. Which meant that my server used 1 thread per hanging GET. At first I thought this would be OK, threads were lightweight compared to a process, right? Until my local UNIX administrator brought out the clue stick and made me think about it for a few minutes. In the process-per-connection implementation we had very little allocated memory, just a kilobyte or so of global state, and the entire program executable was mapped as shared memory between all of the instances. The real memory cost was in the program's stack, which the OS had a minimum size on. With pthreads we didn't gain much resource savings over the process-per-connection approach because we still had the same per-connection state, and the same minimum thread stack size.

I started digging around the Irix manual pages, and one day found this nifty thing called select(). I wrote a quick simple server using it, and realized non-blocking asynchronous IO was the best thing since sliced bread. That night I completely tore apart the pthread based HTTP server and rewrote it as a non-blocking, asynchronous IO server.

By now I had completely abandoned the idea of the message distribution going through a local log file, and instead stored it in shared memory protected by pthread mutexes and condition variables. Threads posting messages into a chat room appended their message object onto a distribution queue and signaled one of the IO threads to start streaming the messages out using asynchronous IO to each connected browser.

Along the way I also had to write an HTML parser, because we wanted to allow messages like "Here is a <font color=red>red rose</font>" to render as intended by the author, but we didn't want to have unclosed tags like "<h1>Hi!" ruin the formatting for every subsequent message on the page. End users also figured out cross site scripting before we did, with plenty of "<script>window.close()</script>" nosense being pasted into rooms and closing hundreds of browsers at once. Yes, we learned about cross site scripting attacks the hard way. Back in 1996/97, nobody really thought about this sort of stuff.

We relaunched in the winter of 1997. The Wayback machine has a snapshot of the rewritten homepage dated April 13, 1997.

Of all of that though, the thing I'm still most excited about is the "non-Shockwave" version of the ChatterBox website. Back then we didn't have AJAX and XMLHttpRequest. Heck, we didn't even have iframe. We had straight up boring <frameset>. I rolled my own AJAX system from scratch using JavaScript, a hand rolled message queue, and a 1 pixel high frame in a frameset that performed POSTs driven by JavaScript. In very early 1997. Apparently I discovered parts of the web that we didn't see again until XMLHttpRequest was widely supported.

Unfortunately management pulled the plug a year or so later, and that neat little HTTP server with its pre-AJAX AJAX and pre-Google Talk hanging GETs disappeared into the sands of time.

So, tonight, while reading Brendan Eich's interview talking about building JavaScript in two weeks, I remembered just how much I learned that year in my part time job. I went from being a Perl script monkey who couldn't even compile a C program, to having written an asynchronous IO HTTP server from the ground up, mastered multi-threading, mutexes, condition variables, AJAX, cross site scripting, and a balanced binary tree implementation that bears a lot of resemblance to BigTable. All without taking any CS classes in high school.

Maybe that's why college, and my later jobs, were all so boring for me. I'd done all of it already. Maybe that's why I was less amazed than my peers when Google Maps launched. I didn't make something nearly as awesome, or as empowering to so many people around the world as Lars and the Maps team did. But I had certainly seen, worked with, and discovered much of today's web. In 1997.

What is HTML 5 going to bring us that we haven't discovered yet?

Insufficiently Random

The lonely musings of a loosely connected software developer.

Tuesday, December 8, 2009

My ancient history and the art of software

0 comments :

Post a Comment

Blog Archive