Advertisement
Motherboard

For Google's Cloud, 18 Minutes of Downtime Is an Eternity

Google's engineering VP offers a post-mortem analysis of Monday night's outage.

by Michael Byrne
Apr 14 2016, 9:00am

Image: Arthur Caranta/Flickr

On Monday night, Google's Compute Engine went dark for 18 minutes. The Engine itself was whirring away just fine, but incoming internet traffic to its servers was misrouted, breaking users' connections and preventing them from reconnecting. If you're not actually sure what the Compute Engine is or who it serves, that may seem like no big deal. To Google, however, a Cloud outage, whether it's rooted in connectivity issues or somewhere deeper, is dire as hell.

"We recognize the severity of this outage, and we apologize to all of our customers for allowing it to occur," a Google post-mortem analysis published on Tuesday begins. "As of this writing, the root cause of the outage is fully understood and GCE is not at risk of a recurrence... Our engineering teams will be working over the next several weeks on a broad array of prevention, detection and mitigation systems intended to add additional defense in depth to our existing production safeguards."

"We recognize the severity of this outage, and we apologize to all of our customers for allowing it to occur."

So, what is the Google Compute Engine, exactly? For starters, it's the thing that makes Google itself tick. It powers Gmail, YouTube, and other services. Its utility as a Google product is in providing high-performance, distributed virtualized computers to users on-demand. Such GCE tasks might range from video rendering to machine learning to really anything for which crunching truly epic amounts of data is required.

It's an incredible idea, really: Virtually unlimited (or endlessly scalable) computing power is right there at this moment just waiting for anyone to use, for anything. It's also a very literal interpretation of cloud computing itself, e.g. beaming deep computing problems up to some amorphous blob of Google hardware and then, with startling quickness, having results beamed back down.

As a very literal manifestation of the cloud, the GCE's weakness is the cloud's weakness—maintaining connections. For the cloud illusion to persist, and so the cloud's usefulness, the cloud must be nigh indistinguishable from a terrestrial or local computer doing the same. If your computer(s) live in Google's cloud, then downtime is akin to suddenly not having a computer, which is bad.

The failure on Monday had to do with how the GCE tells the rest of the internet where to find it. This is accomplished by announcing blocks of IP addresses (and so GCE users) using the Border Gateway Protocol (BGP).

"To maximize service performance, Google's networking systems announce the same IP blocks from several different locations in our network, so that users can take the shortest available path through the internet to reach their Google service," Google engineering VP Benjamin Treynor Sloss explains in the post-mortem.

"This approach also enhances reliability; if a user is unable to reach one location announcing an IP block due to an internet failure between the user and Google, this approach will send the user to the next-closest point of announcement," Sloss continues. "This is part of the internet's fabled ability to 'route around' problems, and it masks or avoids numerous localized outages every week as individual systems in the internet have temporary problems."

On Monday night, Google engineers were swapping out one IP block configuration for a new one, a routine process. On this occasion, however, a timing glitch led to IP configuration files being updated at different times, which led the system to failsafe, rejecting the new IP blocks and attempting to revert to the old ones.

That would have been ideal, but at this point a previously unknown bug appeared. This new bug forced the system to pivot again, ignoring the old configurations and pushing out the new, but still incomplete IP blocks. This shouldn't have happened, because the system implements what's known as a canary step. Here, new configurations are first pushed to only one location and if errors are detected, the push across the entire vast network of server infrastructure should be cancelled. The canary step indeed failed, but because of yet another bug, the error message didn't get to where it needed to go.

As the new IP blocks reached their destinations, more fault detection triggers started going off, which forced the end-points to stop broadcasting IP information to the internet at all. The result: blackout.

It took an error and two bugs to shut the GCE down, which is a helluva unlikely confluence of events. But 18 minutes is a long time when you're leasing out high-performance distributed computing systems on a per-minute basis.

It so happens I was doing my own Google cloud wrenching Monday night with a related but unaffected service called the Google App Engine, which is used to develop and deploy rich, ultra-scalable web applications. The GAE is likewise an amazing platform that I'm pretty sure will wind up scrambling a whole lot of what we assume to be true about the internet and computing in general. But, like its cloud sibling, that potential depends on a futzy old thing originally meant to serve text files to scientists.

Anyhow, 18 minutes of downtime didn't break the internet and Google is offering some healthy discounts (the cloud ain't free) to those affected. In order to get even get to the point of being able to break the internet, the cloud, whether it's Google's or Amazon's or whoever's, needs to prove itself as bulletproof. That's more reasonable than it may sound, but we obviously still have a ways to go.