launch – google tshirt project

AdWords Launch

February 10, 2016May 13, 2016Leave a comment

(It feels kind of unreal to see Google these days jostling with Apple for the title of Most Valuable Company on the Planet. Seems like not that long ago it was this crazy little grad student project running off of borrowed machines in the CSL basement. The secret behind making that transition was an elegantly simple business model backed by fiendishly complex software. And the engineer behind that software was Ron Garret.

Ron has his own site with lots of stories from the bad old days, and has kindly given permission to reproduce the story of the AdWords launch here. Strangely, there does not seem to have been an AdWords launch t-shirt, but the story is too good to not include here, along with the “stock” photo above [photo credit Muhammad Cordiaz].)

Ron sez: I dove into the adstoo project with as much enthusiasm as I could muster, which I’m ashamed to say wasn’t much. The situation was exacerbated by the fact that we had no Java development infrastructure. We were writing bareback, so to speak. We had no debugger. We were using JSP, but had no editor support for JSP syntax. (That turned into a real debugging nightmare. It could take many tens of minutes to find a single typographical error because the only indication that there was a problem was that the output just looked all wrong, but the actual problem could be very far away from the apparent source of the problem.)

Fortunately for me, I was assigned a junior engineer to work with/for me, and he actually knew what he was doing. While I struggled to learn the Java libraries and debugging techniques (I knew the basic language, but I had never done any serious development in it before) this guy just took the bull by the horns and pretty much just wrote the whole damn thing in a matter of weeks. I sometimes pull this old joke out of the dustbin, that in the ancient tradition of senior-junior relationships, he did all the work and I took all the credit.

That’s not quite true. I did end up writing the credit card billing and accounting system, which is a nontrivial thing to get right. Fortunately for me, just before coming to Google I had taken some time to study computer security and cryptography, so I was actually well prepared for that particular task. Back in those days internal security was more or less nonexistent. All the engineers had root access to all of the servers. But I believe in planning ahead, and I anticipated the day when Google was not going to be a small company any more, and so I designed the billing system to be secure against even a dishonest employee with root access (which is not such an easy thing to do). I have no idea if they are still using my system, but if they are then I’d feel pretty confident that my credit card number was not going to get stolen.

Things were made worse by the fact that I had been assigned an office mate who was also new to Google, and who was not part of the ads group. Most of the other ads group members were sharing offices (or cubicles) with other ads group members, and so I felt I wasn’t really part of the club. On top of that, I was away from home and didn’t really have a life up there in Northern California. The stress mounted. I started to get paranoid that I would get fired before reaching the one-year mark. I started experiencing stress-related health problems, some of which are still with me today. On more than one occasion I came that close to quitting. To this day I have no idea why I didn’t.

It was about this time that I had my one and only meeting with Larry Page. It was to discuss the progress of the adstoo project and to set a launch date. My manager was there along with a couple of other people (including Doug I think). Things went smoothly until Larry suggested changing the way billing was handled. I don’t remember the details, but my response was that this would be significant work. No one challenged me, but I found out later that the reaction of people in the room was something along the lines of, “Is he crazy? This ought to be a trivial change.” This little incident turned out to have very far ranging repurcussions later, but that will have to wait for the next blog entry.

Somehow we actually managed to launch AdWords on schedule, in September of 2000. It still seems like a bloody miracle. Most of the credit goes to Jeremy, Ed and Schwim. It could not have been done without them.

I can still remember watching the very first ad roll in. It was for a company called Lively Lobsters. Two months ago, after five years of intending to do so, I finally bought myself a little toy stuffed lobster to commemorate the occasion. (Update on 12/9/2005: It appears that Lively Lobsters has gone out of business. There’s some irony for you.)

About two weeks later all hell broke loose.

The AdWords launch went fairly smoothly, and I spent most of the next two weeks just monitoring the system, fixing miscellaneous bugs, and answering emails from users. (Yes, I was front-line AdWords support for the first month or so.)

The billing system that I had written ran as a cron job (for you non-programmers, that means that it ran automatically on a set schedule) and the output scrolled by in a window on my screen. Everything was working so well I didn’t really pay much attention to it any more, until out of the corner of my eye I noticed that something didn’t look quite right.

I pulled up the biller window and saw that a whole bunch of credit card charges were being declined one after another. The reason was immediately obvious: the amounts being charged were outrageous, tens of thousands, hundreds of thousands, millions of dollars. Basically random numbers, most of which no doubt exceeded people’s credit limits by orders of magnitude.

But a few didn’t. Some charges, for hundreds or thousands of dollars, were getting through. Either way it was bad. For the charges that weren’t getting through the biller was automatically shutting down the accounts, suspending all their ads, and sending out nasty emails telling people that their credit cards had been rejected.

I got a sick feeling in the pit of my stomach, killed the biller, and started trying to figure out what the fsck was going on. (For you non-programmers out there, that’s a little geek insider joke. Fsck is a unix command. It’s short for File System ChecK.)

It quickly became evident that the root cause of the problem was some database corruption. The ad servers which actually served up the the ads would keep track of how many times a particular ad had been served and periodically dump those counts into a database. The biller would then come along and periodically collect all those counts, roll them up into an invoice, and bill the credit cards. The database was filled with entries containing essentially random numbers. No one had a clue how they got there.

I began the process of manually going through the database to clean up the bad entries, roll back the erroneous transactions, and send out apologetic emails to all the people who had been affected. Fortunately, there weren’t a huge number of users back then, and I had caught the problem early enough that only a small number of them were affected. Still, it took several days to finally clean up the mess.

Now, it’s a complete no-brainer that when something like that happens you add some code to detect the problem if it ever happens again, especially when you don’t know why the problem happened in the first place. But I didn’t. It’s probably the single biggest professional mistake I’ve ever made. In my defense I can only say that I was under a lot of stress (more than I even realized at the time), but that’s no excuse. I dropped the ball. And it was just pure dumb luck that the consequences were not more severe. If the problem had waited a year to crop up instead of a couple of weeks, or if I hadn’t just happened to be there watching the biller window (both times!) when the problem cropped up Google could have had a serious public relations problem on its hands. As it happened, only a few dozen people were affected and we were able to undo the damage fairly easily.

You can probably guess what happened next. Yep. One week later. Same problem. This time I added a sanity check to the billing code and kicked myself black and blue for not thinking to do it earlier. At least the cleanup went a little faster this time because by now I had a lot of practice in what to do.

Stress.

And we still didn’t know where the random numbers were coming from despite the fact that everyone on the ads team was trying to figure it out.

OK, time to wrap up this little soap opera.

The problem turned out to be something called a race condition, which is one of the most pernicious and difficult kinds of bugs to find. (Those of you who are technically savvy can skip to the end.)

Most modern server code is multi-threaded, which means that it does more than one computation at once. This is important because computers do more than just compute. They also store and retrieve information from hard disks, which are much, much slower than the computers. Every time the computer has to access the disk things come to a screeching halt. To give you some idea, most modern computers run at clock speed measured in gigahertz, or billions of cycles per second. The fastest hard disks have seek times (that is, the time it takes the drive to move the read/write head into the proper position) of several milliseconds. So a computer can perform tens of millions of computations in the time it takes a hard disk just to get into position to read or write data.

In order to keep things from bogging down, when one computation has to access the disk, it suspends itself, and another computation takes over. This way, one computer sort of “pretends” that it is really multiple computers all running at the same time, even though in reality what is happening is that one computer is just time-slicing lots of simultaneous computations.

The ad server, the machine that actually served up ads in response to search terms, ran multi-threaded code written in C++, which is more or less the industry standard nowadays for high-performance applications. C++ is byzantine, one of the most complex programming languages ever invented. I’ve been studying C++ off and on for ten years and I’m still far from being an expert. Its designers didn’t really set out to make it that complicated, it just sort of accreted more and more cruft over the years until it turned into this hulking behemoth.

C++ has a lot of features, but one feature that it lacks that Lisp and Java have is automatic memory management. Lisp and Java (and most other modern programming langauges) use a technique called garbage collection to automatically figure out when a piece of memory is no longer being used and put it back in the pool of available memory. In C++ you have to do this manually.

Memory management in multi-threaded applications is one of the biggest challenges C++ programmers face. It’s a nightmare. All kinds of techniques and protocols have been developed to help make the task easier, but none of them work very well. At the very least they all require a certain discipline on the part of the programmer that is very difficult to maintain. And for complex pieces of code that are being worked on by more than one person it is very, very hard to get it right.

What happened, it turned out, was this: the ad server kept a count of all the ads that it served, which it periodically wrote out to the database. (For those of you wondering what database we were using, it was MySQL, which leads to another story, but that will have to wait for another post.) It also had a feature where, if it was shut down for any reason, it would write out the final served ads count before it actually quit. The ad counts were stored in a block of memory that was stack allocated by one thread. The final ad counts were written out by code running in a different thread. So when the ad server was shut down, the first thread would exit and free up the memory holding the ad counts, which would then be reused by some other process, which would write essentially random data there. In the meantime, the thread writing out the final ad counts would still be reading that memory. This is why it’s called a race condition, because the two threads were racing each other, with the ad-count-writer trying to finish before the main thread freed up the memory it was using to get those counts. And because the ad-count-writer was writing those counts to a database, which is to say, to disk, it always lost the race.

Now, here is the supreme irony: remember the meeting with Larry where he wanted to make a change to the billing model that I said would be hard and everyone else in the room thought would be easy? The bug was introduced when the ad server code was changed to accommodate that new billing model. On top of that, this kind of bug is actually impossible to introduce except in a language with manual memory management like C++. In a language with automatic memory management like Java or Lisp the system automatically notices that the memory is still in use and prevent it from being reused until all threads were actually done with it.

By the time this bug was found and fixed (by Ed) I was a mental wreck, and well on my way to becoming a physical wreck as well. My relationship with my wife was beginning to strain. My manager and I were barely on speaking terms. And I was getting a crick in my neck from the chip I was carrying around on my shoulder from feeling that I had been vindicated in my assessment of the potential difficulties of changing the billing model.

So I went to my manager and offered to resign from the ads group. To my utter astonishment, she did not accept.

Strangely, I can’t find any record of an Adwords Launch t-shirt, but this is the base of Ron’s commemorative AdWords Launch lava lamp.

Google Labs 2.0: the Launch

February 9, 2016May 19, 20173 Comments

20 April, 2009

It’s hard to overstate the adrenaline of the launch. We’ve been on this project for a bit over a year – nothing major by any standards. And what we’re launching isn’t a major product. Technically speaking, it isn’t even a product. Prosaically, we’ve just changed the presentation style of the Labs homepage, a site Google’s been serving for years. We are adding to it links to a couple of other “Labs” products that are getting unveiled today, but they’ve got their own teams and product managers. We’re not getting ourselves stressed on their behalf; we’ve got enough on our plate.

The actual sequence of steps is pretty straightforward – I’ve scrawled them up on the whiteboard in the unoccupied cubes our team is camped out at this morning:

10:30: flip switch to make new apps externally visible
10:30+e: verify new apps externally visible
10:30+2e: flip switch to make new site externally visible at googlelabs.com
10:30+3e: panic and debug
12:30: blog post goes out
12:30+e: flip redirect to make old site (labs.google.com) redirect to new site (googlelabs.com)

Monday morning, 9:00. Arthur and I are halfway up to the city, carpooling through the tail end of rush hour. We’re calculating transit times – 15 more minutes to the San Francisco office, with 10 minutes to park, puts us into action at 9:25. We’ll have 90 minutes to spare – plenty of time to put together an impromptu war room, sync with the infrastructure guys who’ll be flipping switches for us, and run the pre-launch tests one last time. Hell, there’s even time to get breakfast and a cup of coffee.

That, of course, is when my phone rings. I fumble for the right buttons to pick it up on the car’s hands-free audio.

“Hi Pablo? It’s Michael. Do you have a moment? It’s Important.”

Crap.

I hate when It’s Important. Because “It” is never Important Good, like Larry and Sergey wanting to lend us the jet for the weekend, and needing to know where we want the limo pickup. No, “It’s Important” always means It’s Important Bad, like the datacenter we’re hosted out of has just gone down in flames.

Sure enough, it’s the datacenter. Down for the count, and something’s not working right on the automatic failover to a backup. All of a sudden, ninety minutes doesn’t seem like such a lot of time.

Michael is our product manager, and he fills us in. It may not be that bad. He’s already up in SF, coordinating a workaround (have I mentioned how much I love this guy?). The infrastructure guys are manually porting our code to another site and configuring a dedicated pool of machines to host us. They just need to know what special runtime flags, if any, we need for the pool.

Arthur and I parley briefly, trying to remember what special favors we’d asked when we got our machines set up the first time. We come up blank, which is either a good thing (we didn’t ask for any) or a bad thing (we’ve forgotten the arcane animal sacrifices our code requires). In five minutes we’ve crossed from nervous confidence to outright panic, then over into the eerie, liminal world of “Hope for the Best.” And we’re still 20 minutes from the office.

It’s good to have Arthur in the right seat here. He’s a couple of years younger than me and too modest to wear the hard-earned brass rat, but quietly carries a decade more experience than most of the engineers at this company. In previous fire drills, I’ve often imagined his calm get-it-done perspective as having been channelled from the days of Apollo’s mission control. Alright boys, they’re on fire out there halfway to the moon bleeding hydrazine and short on oxygen – whaddaya got for me?

Plan B is on deck, so we review the options for Plan C. By the time we make it to a desk, Michael will have confirmed that the manual port is working. Or not. If not, one of us can join the infrastructure guys if figuring out why the hell it isn’t, while the other can try copying our data over the internal, test-only version of the console. It can’t go live, but we’ll at least be able to do the demo, and we’ll just have to get RJ to massage the messaging: instead of “We’ve just launched…” he’ll get to announce that “we are about to launch…” Of course, there’s the rest of the PR freight train we’ll have to deal with. The blog, the press release, the “googlegrams”. Everything’s choreographed to trip at 12:30, alerting the press – and the world at large – to go have a look at http://www.googlelabs.com.

A haunting story looms in my mind. Somewhere in the Soviet Union during the peak of the Cold War space race. The rocket launch was aborted just before launch. Countdown clock stopped, and ignition sequence shut down as the crew went in to diagnose a problem with the first stage motor. The scientists, engineers, generals and VIPs gathered at the base of the monstrous rocket. But the rocket’s fourth stage motor relied on a timed ignition. It was to fire once the third stage had burnt out, at some precise number of minutes after T-0. Somehow, when the countdown clock was stopped, no one ever told the fourth stage, and when the designated time came it dutifully burst to life, straight into the thousand ton stack of explosive fuel in the stages below it. Some hold that the Soviet space program never recovered from the disaster. The carnage and loss of life was horrific, and though carefully shrouded from the public, the Russian scientists knew, and the effect on morale was devastating.

I decide to spare Arthur my gruesome vision, but to me the moral was that any go/no-go decision had to be an all-in proposition.

Parking turns out to be a non-trivial affair – a couple of times around the block before we find an underground garage of twisty turn passages, all alike. But by the time we’re riding the elevator up to the fourth floor (9:25, on the dot), we still haven’t heard from Michael. We take this as a good sign, mostly because there’s not a lot we can do right now if it isn’t.

The thing about Michael – the thing about all the best product managers at Google – is that he really cares. I don’t just mean that he cares about the product; that’s given. But he cares about everything, and everyone. As software engineer-turned Harvard MBA-turned product manager, he’s spent sweat on both sides of the divide. His boyish charm and easy smile are effective tools, but they’re earnest. He really wants to know what the UI folks think is best. He really wants the PR team to have their say. You get the feeling that he really truly likes everyone, and because of that, everyone seems willing to go just one step further to keep him happy.

Arthur and I find the corner where the infrastructure team sits, and corner Pete, their PM, for an update. He offers that the backup servers “seem to be holding”, which Arthur confirms with a few test queries. Our adrenaline settles down a notch as we scout a pair of empty desks around the corner and start to play Marco Polo on the phone with Michael. Arthur, the covert Zen monk, reminds me: time to start breathing again, eh?

The only member of the team still missing is Artem, our Russian. He’s young, he’s fast and he’s fearless. Sometimes he scares me. Started coding professionally at age 13, and as far as I can tell, hasn’t stopped for a breath since. I sometimes imagine him as James Bond’s technological, Russian counterpart. Or maybe McGyver’s. “We’ve got latency problems? I think it’s in the distributed memcache; lemme build an intermediate hot-caching layer that fronts it on a machine-by-machine basis. Yes, I know it’ll require recoding the storage layer representation. Yes, I know we’re launching on Monday. Hey, we’ve got an entire weekend ahead – we could write half of Vista in that time. Yes, it would probably be the bad half, but – look, just let me do it, okay? I’ll send you the code reviews.”

And somehow, it works. Time and again, I’ve seen his code go into security review and come out with no comment other than the proverbial gold star: LGTM (“looks good to me.”). If anything, the “additional comments” section will say something like “Very nicely done!”

So, Artem’s won everyone’s trust. The downside of this trust is that, when a grenade comes tumbling into the office, we’ve gotten into the habit of looking to him to throw himself on it (and defuse it, and turn it into an undocumented feature, with plenty of time to spare). So without Artem on hand, we’re missing our safety – we’re test pilots without a parachute.

Thirty minutes to go, and Artem finds us. He cruises in like it’s Friday morning, and his calm is almost unnerving – “Anyone know where the espresso machine is in this office? All I can find is the pre-made stuff.”

Michael’s briefed him by phone already. He picks a corner near the whiteboard, settles in, and flips open his laptop as if to catch up on email. I look over to Arthur, who reminds me (again) to keep breathing. Artem looks up and reads my mind: “Look, the code is up on the backup datacenter, we’ve run the tests, and everything looks good at this point. There’s nothing else we can do until it’s time to flip the switch. You guys eaten breakfast yet?”

I find myself wondering if I actually like to panic, and stifle the thought as soon as it surfaces. Artem is right. I breathe – a couple of times for good measure – then return to my temporary seat and try to focus on my backlog of email. None of it really needs to be answered today, but slogging through some of the bitwork and meeting requests is as good as anything for making the time pass, and is certainly more productive than hyperventilating.

Inexplicably, the next time I look at my watch, it’s 10:27. Holy crap – show time. Give or take an hour. Honestly, I need to keep reminding myself that the only real deadline is two hours away, at 12:30, when RJ starts showing thing off live to the assembled press. But Radhika and Andy, tech leads for the new apps we’re launching, give us the thumbs up, so there’s no reason not to launch. We poke our heads around the corner to where Pete and the AppEngine team are still dealing with the smoking aftermath of the datacenter crash. Pete rolls over to his monitor, flips a few virtual switches in DNS-land from his keyboard and we run back to our laptops to check. It’s all good so far.

10:30 – or something like it. One last round of checks with Michael, Arthur and Artem before flagging Pete again, who – almost anticlimactically – fires off another keyboard command for us. “There – you guys should be live now.”

Internal access? Check. External? Nothing. 404 – what the hell? Clear the DNS cache and refresh. 404 again. For real. The site’s just not there. Somewhere in my brainstem, that reptilian ball of Rambo neurons controlling our basic fight-or-flight reflexes kicks the door down and says “Oh yeah – that’s what I’m talkin’ about!”

I’m back at Pete’s office so fast I swear I can still see my shadow back at the cubes. “You’re sure you flipped the switch to take us live?” He looks back at me as though I’d asked him whether he was sure he was Pete. “Yeah – absolutely.” And I believe him, because he does this stuff for a living, and we’re just one more app out of thousands, absolutely routine stuff.

I relay confirmation back to the cubes, where Arthur and Michael are conferring. Even Artem looks worried now, which scares me more than anything else.

So what next? Packet traces? We’ve got no idea how the hell to do something like that without help from the App Engine team, and they’ve got their hands full with more important stuff at the moment. I stifle the urge to ask Pete whether he’s sure that he’s sure, and remind myself, yet again, to do some of that “breathing” stuff. Arthur’s right (as he – maddeningly – always is), it helps me focus again.

We’re all at our screens now. Artem’s trying packets, Arthur’s looking at the logs, and I’ve got my head in the code. I don’t have much hope that I’ll be much use, but scrolling through the sequence of lines fired when a request arrives gives me something to do. It’s sort of a rosary, I guess, for coders: receive request, parse headers, dispatch database query… Artem calls out from across the cube: “Requests are making it to the server – why the hell are we getting 404’s?” He’s not asking anyone in particular – or maybe he’s asking himself. Arthur’s unflappable, as always, his voice as flat as that of Spock, acknowledging the impossible: “Confirmed – we’re logging the requests.”

I circle back through the request processing code to where the logging statements are. “What’s the log message?”

“Just that the GET was received.”

“Anything else after that?

“Nothing.”

I loop back through the code, but it’s now scrolling past me, Matrix-like, in a cascade of indecipherable symbols. I have a vague sense that I’m panicking, and breathing doesn’t help.

Arthur looks up, joins me on the code and takes less than a minute to find it: we – no, I – had left a failsafe in place. Just in case the server were accidentally exposed to the outside, it did its own check where the request was coming from. If it looked like an external IP address, the server pretended it wasn’t there, faking an “address not found” response.

I flip the failsafe flag, reboot, and hit the server again. The shiny new “Google Labs” home page fills my screen. “We have liftoff!” – I kind of shriek it.

“It’s live?” – Arthur’s voice is measured.

Artem’s not waiting for confirmation: “I’ve got it here, too. Woo hoooo!” I think he starts doing some kind of dance, but I’m not paying attention anymore. I click through a couple of the links, confirming that everything’s live.

Something else kicks into our collective bloodstreams – endorphins? – blending with the adrenaline, transforming the panic of 15 seconds ago into a Hell-yeah-bring-it-on machismo. We’re pumped now, ready to face the world: we have launched.

…

Mind you, at this point, there are exactly three people in the world who know that Google Labs has launched. Five, if you count Pete and Radhika, but they’ve got other things to worry about. Until someone stumbles across the improbable URL http://www.googlelabs.com, we’re invisible.

The press conference is still ahead, with a dozen plus reporters from Tech Crunch, the New York Times, Wired and the like breathlessly typing away, “I’m here live-blogging the much anticipated Google Labs launch…” while RJ eloquently explaining how Google understands that “innovation is a conversation, not a one-way street.” And when it does come, Arthur, Artem, Michael and I are at the back of the conference room, as invisible and as much a part of the furniture as the stackable flexichairs we’re sitting in.

Artem’s got a window up on Google News, hitting “refresh” as the updates come fast and furious. RJ is in the groove – he’s got the press eating from his hand, and we know – we just know – that tomorrow’s headlines are going to be glorious (“Google News Timeline: A Glorious, Intriguing Time Sink”). I do find myself wishing that RJ will, at some point, direct his audience’s attention back to where we’re sitting, wishing that he’ll make passing reference of thanks to the engineers who worked, mostly in their spare time, for over a year to conceive of and bring to fruition this new way to let engineers launch products and bring users into the process.

I find myself wishing there was a way for him to tell the story of how improbably it all came together, starting with a Friday night email appeal blitzed out to eng-misc almost two years ago, asking for help and promising a “Google Labs” t-shirt for anyone who helped. Which reminds me: I’ve got to get some t-shirts printed up…

	Gabriella Sannino (@… on Google Dance!
	david pablo cohn on Pittsburgh
	Andrea H. on Pittsburgh
	An Unreasonable Idea… on Google Labs 2.0: the Laun…
	Edward M. Sills III on Google Labs 2.0: the Laun…

	Gabriella Sannino (@… on Google Dance!
	david pablo cohn on Pittsburgh
	Andrea H. on Pittsburgh
	An Unreasonable Idea… on Google Labs 2.0: the Laun…
	Edward M. Sills III on Google Labs 2.0: the Laun…