We had a plan last week. We were going to publish a story on CNN.com with a video we made, about the hackerspace NYC Resistor. Because we are savvy social media gurus, we were also going to do an IAmA on Reddit with some of the Resistor crew. We were going to reap the rewards of thousands of hits, and then treat ourselves to some funny YouTube videos, maybe a little bit of longform reading.
But then, sometime around 3 in the morning on Thursday, everything broke. Specifically, an Amazon Elastic Compute Cloud 2 (EC2) server broke. That took down Mobo’s content, leaving our readers with a “SORRY SOMETHING WENT WRONG” greeting page.
And nothing else seemed to work. If I had gone to Quora to ask the Internet about it, I would have been met by another apology. The trusted advisors I could possibly have had on GroupMe were nowhere to be found. Even my listless attempts to participate in the real life gaming layer of Scvngr would have fallen flat in the face of server troubles. Couldn’t check in to “My Website is Down” on Foursquare, because if there was such a place, it had disappeared with Amazon’s server. Hootsuite was affected too, and while we don’t use that to send Tweets, we suddenly felt like we had been struck by an Internet curse. I was convinced when I discovered that Reddit had gone into read-only mode.
I wrote to Amazon hoping for some precious clues as to what had precipitated what the EC2 status page called “a networking event.”
A networking event. I imagined server administrators handling wine and cheese amidst blinking bookshelves made of custom-built servers and miles of ethernet cables, while a hell storm is raining bodies on the city outside, explaining it all away with every possible concocted excuse and euphemism (“Your cache is full,” “There was an error”) to be found in an “IT for Dummies” book.
But nary a word from Amazon. And then on Thursday evening, an administrator showed up on the company’s blog and added some details to the story:
We’d like to provide additional color on what were working on right now (please note that we always know more and understand issues better after we fully recover and dive deep into the post mortem). A networking event early this morning triggered a large amount of re-mirroring of EBS volumes in US-EAST-1. This re-mirroring created a shortage of capacity in one of the US-EAST-1 Availability Zones, which impacted new EBS volume creation as well as the pace with which we could re-mirror and recover affected EBS volumes. Additionally, one of our internal control planes for EBS has become inundated such that it’s difficult to create new EBS volumes and EBS backed instances. We are working as quickly as possible to add capacity to that one Availability Zone to speed up the re-mirroring, and working to restore the control plane issue. We’re starting to see progress on these efforts, but are not there yet. We will continue to provide updates when we have them.
“Additional color”: as if he were writing a story for the society pages. It occurred to me that “we will continue to provide updates as we have them” has probably been a stock phrase of IT administrators since antiquity, passed down like a shibboleth.
I wondered what lessons would be gleaned from this whole experience. The media, I suspected, would gang up on the cloud, because it’s new, and revolutionary, and people don’t understand it. And it’s called “the cloud,” which seems like an easy enough target.
But I already felt bad for the cloud, that nebulous and dumb network of servers, satisfying our ceaseless appetite for bits and in exchange hogging up lots of our dirty energy, dumbly. (See Greenpeace’s new ‘dirty data’ report; pdf) Ultimately, I knew, this was bigger than the cloud, which, let’s face it, isn’t going anywhere. “This will be a bump in the road for the cloud, nothing more,” said Jesse Knight, our server guru. “If anything, it gives some other networks some room to compete more effectively.”
The real problem is bigger than the cloud, I realized. It’s about the brain. The human cloud. How much do we depend on our most important tools, what do we expect from them, and how do we prepare to cope when they break? It’s a planning-ahead problem, one as ancient as those IT admins.
Of course Amazon’s system includes redundancy, that most important asset, so that if one region fails, another can step in seamlessly. Without redundancy, we’re always susceptible to failures. And as dependable as a big server farm run by Amazon may be — they’re certainly no amateurs when it comes to serving up data, lots of data — failures just happen. In this case, one of a number of redundant zones in northern Virginia took out other redundant zones; our best solution would have been to have our own backup, rather than relying on Amazon’s. But that would have been costly, and given the rarity of these kinds of things, not worth it.
Then again, in complex systems, as in life, the smallest error can, by the mad and beautiful logic of chaos and emergence, become simply catastrophic.
Or not. Sometimes the smallest error simply reminds you of what’s important, and provides an excuse to go outside and look up at the clouds, the actual clouds, if you ever needed one.