(It feels kind of unreal to see Google these days jostling with Apple for the title of Most Valuable Company on the Planet. Seems like not that long ago it was this crazy little grad student project running off of borrowed machines in the CSL basement. The secret behind making that transition was an elegantly simple business model backed by fiendishly complex software. And the engineer behind that software was Ron Garret.
Ron has his own site with lots of stories from the bad old days, and has kindly given permission to reproduce the story of the AdWords launch here. Strangely, there does not seem to have been an AdWords launch t-shirt, but the story is too good to not include here, along with the “stock” photo above [photo credit Muhammad Cordiaz].)
Ron sez: I dove into the adstoo project with as much enthusiasm as I could muster, which I’m ashamed to say wasn’t much. The situation was exacerbated by the fact that we had no Java development infrastructure. We were writing bareback, so to speak. We had no debugger. We were using JSP, but had no editor support for JSP syntax. (That turned into a real debugging nightmare. It could take many tens of minutes to find a single typographical error because the only indication that there was a problem was that the output just looked all wrong, but the actual problem could be very far away from the apparent source of the problem.)
Fortunately for me, I was assigned a junior engineer to work with/for me, and he actually knew what he was doing. While I struggled to learn the Java libraries and debugging techniques (I knew the basic language, but I had never done any serious development in it before) this guy just took the bull by the horns and pretty much just wrote the whole damn thing in a matter of weeks. I sometimes pull this old joke out of the dustbin, that in the ancient tradition of senior-junior relationships, he did all the work and I took all the credit.
That’s not quite true. I did end up writing the credit card billing and accounting system, which is a nontrivial thing to get right. Fortunately for me, just before coming to Google I had taken some time to study computer security and cryptography, so I was actually well prepared for that particular task. Back in those days internal security was more or less nonexistent. All the engineers had root access to all of the servers. But I believe in planning ahead, and I anticipated the day when Google was not going to be a small company any more, and so I designed the billing system to be secure against even a dishonest employee with root access (which is not such an easy thing to do). I have no idea if they are still using my system, but if they are then I’d feel pretty confident that my credit card number was not going to get stolen.
Things were made worse by the fact that I had been assigned an office mate who was also new to Google, and who was not part of the ads group. Most of the other ads group members were sharing offices (or cubicles) with other ads group members, and so I felt I wasn’t really part of the club. On top of that, I was away from home and didn’t really have a life up there in Northern California. The stress mounted. I started to get paranoid that I would get fired before reaching the one-year mark. I started experiencing stress-related health problems, some of which are still with me today. On more than one occasion I came that close to quitting. To this day I have no idea why I didn’t.
It was about this time that I had my one and only meeting with Larry Page. It was to discuss the progress of the adstoo project and to set a launch date. My manager was there along with a couple of other people (including Doug I think). Things went smoothly until Larry suggested changing the way billing was handled. I don’t remember the details, but my response was that this would be significant work. No one challenged me, but I found out later that the reaction of people in the room was something along the lines of, “Is he crazy? This ought to be a trivial change.” This little incident turned out to have very far ranging repurcussions later, but that will have to wait for the next blog entry.
Somehow we actually managed to launch AdWords on schedule, in September of 2000. It still seems like a bloody miracle. Most of the credit goes to Jeremy, Ed and Schwim. It could not have been done without them.
I can still remember watching the very first ad roll in. It was for a company called Lively Lobsters. Two months ago, after five years of intending to do so, I finally bought myself a little toy stuffed lobster to commemorate the occasion. (Update on 12/9/2005: It appears that Lively Lobsters has gone out of business. There’s some irony for you.)
About two weeks later all hell broke loose.
The AdWords launch went fairly smoothly, and I spent most of the next two weeks just monitoring the system, fixing miscellaneous bugs, and answering emails from users. (Yes, I was front-line AdWords support for the first month or so.)
The billing system that I had written ran as a cron job (for you non-programmers, that means that it ran automatically on a set schedule) and the output scrolled by in a window on my screen. Everything was working so well I didn’t really pay much attention to it any more, until out of the corner of my eye I noticed that something didn’t look quite right.
I pulled up the biller window and saw that a whole bunch of credit card charges were being declined one after another. The reason was immediately obvious: the amounts being charged were outrageous, tens of thousands, hundreds of thousands, millions of dollars. Basically random numbers, most of which no doubt exceeded people’s credit limits by orders of magnitude.
But a few didn’t. Some charges, for hundreds or thousands of dollars, were getting through. Either way it was bad. For the charges that weren’t getting through the biller was automatically shutting down the accounts, suspending all their ads, and sending out nasty emails telling people that their credit cards had been rejected.
I got a sick feeling in the pit of my stomach, killed the biller, and started trying to figure out what the fsck was going on. (For you non-programmers out there, that’s a little geek insider joke. Fsck is a unix command. It’s short for File System ChecK.)
It quickly became evident that the root cause of the problem was some database corruption. The ad servers which actually served up the the ads would keep track of how many times a particular ad had been served and periodically dump those counts into a database. The biller would then come along and periodically collect all those counts, roll them up into an invoice, and bill the credit cards. The database was filled with entries containing essentially random numbers. No one had a clue how they got there.
I began the process of manually going through the database to clean up the bad entries, roll back the erroneous transactions, and send out apologetic emails to all the people who had been affected. Fortunately, there weren’t a huge number of users back then, and I had caught the problem early enough that only a small number of them were affected. Still, it took several days to finally clean up the mess.
Now, it’s a complete no-brainer that when something like that happens you add some code to detect the problem if it ever happens again, especially when you don’t know why the problem happened in the first place. But I didn’t. It’s probably the single biggest professional mistake I’ve ever made. In my defense I can only say that I was under a lot of stress (more than I even realized at the time), but that’s no excuse. I dropped the ball. And it was just pure dumb luck that the consequences were not more severe. If the problem had waited a year to crop up instead of a couple of weeks, or if I hadn’t just happened to be there watching the biller window (both times!) when the problem cropped up Google could have had a serious public relations problem on its hands. As it happened, only a few dozen people were affected and we were able to undo the damage fairly easily.
You can probably guess what happened next. Yep. One week later. Same problem. This time I added a sanity check to the billing code and kicked myself black and blue for not thinking to do it earlier. At least the cleanup went a little faster this time because by now I had a lot of practice in what to do.
And we still didn’t know where the random numbers were coming from despite the fact that everyone on the ads team was trying to figure it out.
OK, time to wrap up this little soap opera.
The problem turned out to be something called a race condition, which is one of the most pernicious and difficult kinds of bugs to find. (Those of you who are technically savvy can skip to the end.)
Most modern server code is multi-threaded, which means that it does more than one computation at once. This is important because computers do more than just compute. They also store and retrieve information from hard disks, which are much, much slower than the computers. Every time the computer has to access the disk things come to a screeching halt. To give you some idea, most modern computers run at clock speed measured in gigahertz, or billions of cycles per second. The fastest hard disks have seek times (that is, the time it takes the drive to move the read/write head into the proper position) of several milliseconds. So a computer can perform tens of millions of computations in the time it takes a hard disk just to get into position to read or write data.
In order to keep things from bogging down, when one computation has to access the disk, it suspends itself, and another computation takes over. This way, one computer sort of “pretends” that it is really multiple computers all running at the same time, even though in reality what is happening is that one computer is just time-slicing lots of simultaneous computations.
The ad server, the machine that actually served up ads in response to search terms, ran multi-threaded code written in C++, which is more or less the industry standard nowadays for high-performance applications. C++ is byzantine, one of the most complex programming languages ever invented. I’ve been studying C++ off and on for ten years and I’m still far from being an expert. Its designers didn’t really set out to make it that complicated, it just sort of accreted more and more cruft over the years until it turned into this hulking behemoth.
C++ has a lot of features, but one feature that it lacks that Lisp and Java have is automatic memory management. Lisp and Java (and most other modern programming langauges) use a technique called garbage collection to automatically figure out when a piece of memory is no longer being used and put it back in the pool of available memory. In C++ you have to do this manually.
Memory management in multi-threaded applications is one of the biggest challenges C++ programmers face. It’s a nightmare. All kinds of techniques and protocols have been developed to help make the task easier, but none of them work very well. At the very least they all require a certain discipline on the part of the programmer that is very difficult to maintain. And for complex pieces of code that are being worked on by more than one person it is very, very hard to get it right.
What happened, it turned out, was this: the ad server kept a count of all the ads that it served, which it periodically wrote out to the database. (For those of you wondering what database we were using, it was MySQL, which leads to another story, but that will have to wait for another post.) It also had a feature where, if it was shut down for any reason, it would write out the final served ads count before it actually quit. The ad counts were stored in a block of memory that was stack allocated by one thread. The final ad counts were written out by code running in a different thread. So when the ad server was shut down, the first thread would exit and free up the memory holding the ad counts, which would then be reused by some other process, which would write essentially random data there. In the meantime, the thread writing out the final ad counts would still be reading that memory. This is why it’s called a race condition, because the two threads were racing each other, with the ad-count-writer trying to finish before the main thread freed up the memory it was using to get those counts. And because the ad-count-writer was writing those counts to a database, which is to say, to disk, it always lost the race.
Now, here is the supreme irony: remember the meeting with Larry where he wanted to make a change to the billing model that I said would be hard and everyone else in the room thought would be easy? The bug was introduced when the ad server code was changed to accommodate that new billing model. On top of that, this kind of bug is actually impossible to introduce except in a language with manual memory management like C++. In a language with automatic memory management like Java or Lisp the system automatically notices that the memory is still in use and prevent it from being reused until all threads were actually done with it.
By the time this bug was found and fixed (by Ed) I was a mental wreck, and well on my way to becoming a physical wreck as well. My relationship with my wife was beginning to strain. My manager and I were barely on speaking terms. And I was getting a crick in my neck from the chip I was carrying around on my shoulder from feeling that I had been vindicated in my assessment of the potential difficulties of changing the billing model.
So I went to my manager and offered to resign from the ads group. To my utter astonishment, she did not accept.