The Google Incident

Saturday, October 24, 2009 - Jerry Sievert

Starting a new job is always difficult. Coming up to speed on essential projects, finding your niche, even remembering everyone’s names. Then there's the added challenge of starting a job at a partner of Google. Backing up a bit, my new job was approached by Google as both a data provider, and a partner in their new Social App section of iGoogle.

It’s never easy to step into a project that is considered 90% complete, especially when it is an example of quickly churned out anti-patterns, spearheaded by an ex-employee trying to make their way out the door. As your application goes live, and Google's servers start pounding on your network and infrastructure, you quickly learn what it is to be scaleable.

Sadly, this was not the case. When testing cannot be 100% complete, and infrastructure is often times opaque, it’s very easy to make assumptions and hope that they pan out under fire. Unfortunately, not every assumption is correct - like that your caching is working correctly.

Caching is a double-edged sword, especially when you throw a mySQL database in the mix, and use it as a document storage engine. What could be a textbook example of where to use CouchDB or MongoDB, can suddenly turn into a nightmare of unmaintainability, especially when you have a Google project manager emailing you every morning asking for status updates.

Enter debugging and pair programming. When in doubt, find someone smart and sequester them with you. I did just this. Slowly, but surely, anti-patterns started to give way to patterns; bugs and unimplemented features slowly started to disappear. Caching suddenly started becoming viable. Traffic to database servers started dropping from 80mbit down to 3mbit. Cache hits started approaching 99%, which for an ever-changing social media tool was amazing. Finally, you have a sustainable social application.

Until that email shows up again, where everything isn’t quite right. You are still missing some core features, reality doesn’t quite match documentation, and it’s still possible to trigger an obscure bug that you swear works just fine in your application stack. That surge of adrenaline hits again, and you’re off and running. When in doubt, assume it’s the application. But, if hours of testing repeatedly show the application working, go asking questions. Sometimes, you find out that on the deployment side, there’s an additional layer of caching that you didn’t know about.

It takes a lot of effort to convince an operations group to put a caching exception into place, especially given both the colossal surge of traffic that occurs, and the lack of track record any new application has (not to mention the above mentioned application-side caching issues). I was lucky, I had an in. I had worked with the head of operations before, and was willing to put my own reputation and sanity on the line to back up that everything would go smoothly. He took the bait (sucker). The application went live, without the additional page caching that normally occurs. It was then up to the code, caching, and databases to hold their own.

A note back from Google: this bug fixed, but we found one more issue we’d like resolved. Another 100 lines of code deleted and replaced with some simple logic, and one last ditch effort to get Google to give the application a pass. Finally, that email arrives: “This feed seems good now.” Small words, but big elation.