hush
Many have proclaimed outrage over last week's AWS outage, but the blame lies in all of our poor planning PHOTO: Kristina Flour

In our Twitter-fueled, Facebook-fed society, it's very easy to get caught up in hype. 

So after AWS’s outage last week, it was amazing to see all of the anti-cloud pundits coming out of the woodwork to hold AWS responsible for taking down a majority of the internet. 

Pundits said there were too many monopolies. Some global equities firms immediately predicted a 2 percent negative impact on 1Q 2017 AWS revenues. 

What? Let's put this in context.

Who’s to Blame?

The AWS S3 outage happened in one of its availability zones, US East Region 1. While this is a large zone, there are 18 availability zones and all others were operating normally during the outage. 

AWS S3 is designed to deliver 99.99 percent availability, and scale past trillions of objects worldwide. Last week’s outage illustrated that one-in-ten-thousand chance of non-availability. The world is not built for 100 percent availability nor should any company believe that the public cloud automatically provides this.

There is no denying that many businesses rely on AWS S3. According to market research firms, S3 is used by nearly 150,000 sites, 120,000 unique domains and has almost 4 trillion pieces of data stored in it. It powers big brand sites like Netflix, Adobe, Spotify, Pinterest, Trello, IFTTT and Buzzfeed, as well as tens of thousands of smaller sites.

But who’s to blame here? Did anyone consider why Amazon’s ecommerce site didn’t go down? Was it because they get preferential treatment or was it because they’ve built their site with redundancy in mind? 

The reason is relatively simple: its sites are spread out across a number of geographic zones so an outage in one area doesn't mean the whole site goes down. If your SaaS provider or application architecture does not provide for redundancy, it’s not really Amazon’s fault. This is common sense. 

This outage is an indictment, not of AWS, but of business and IT decision makers. As many companies moved their IT applications to the public cloud, cost was a major driver and unfortunately this drove decisions to abandon traditional thinking around architectures to provide redundant data services. 

AWS Is Not Alone

Let’s put the AWS 4-hour S3 outage in perspective of other major outages we have seen over the past year. 

In September 2016, Microsoft Azure suffered a serious outage due to a spike in network traffic that caused DNS issues resulting in several regions being unavailable. Despite that being the second multi-region outage in a week (Europe was hit the week before), it barely made headlines. 

Google’s cloud services were down in April, affecting their Compute Engine instances and VPN services in all of its regions. For damage control, it offered customers a 10 percent discount on their monthly compute charges.  

Last March, some Salesforce customers in Europe had to cope with a CRM disruption for up to 10 hours caused by a storage problem across an instance on that continent. And in May, a Salesforce outage wiped out four hours of customer data that took days to fully remediate. 

In January 2016, a power outage at a Verizon data center impacted JetBlue Airways operations, delaying flights and sending many passengers scrambling to rebook. The Verizon data center outage impacted customer support systems, including jetblue.com, mobile apps, a toll-free phone number and check-in and airport counter/gate systems.

Our beloved Twitter was down for eight hours in the same month due to uploading some faulty code that took down the Twitter website and mobile apps. Security services expert Symantec experienced a 24-hour outage in April 2016 preventing Symantec clients from administering email and web security services due to a database update error.  

And Apple’s outage in June 2016 resulted in some of the tech giants popular iCloud backup services, App Store and iTunes to be offline. 

The Cost of Downtime

Downtime is no joking matter. The estimated cost of downtime to US businesses in 2016 reached $700 billion. Fortune 1000 companies reported between $1.25 billion and $2.5 billion in estimated business losses. The cost of downtime has increased 38 percent since 2010, according to a recent study by Ponemon Institute. 

But the causes for outages are not due to cloud service providers like AWS being unprepared for enterprise availability. Cyber crimes are the fastest-growing cause of data center outages, rising from 2 percent in 2010 to 22 percent of outages. Uninterruptible power supply (UPS) failure continues to be the number one cause of unplanned data center outages, accounting for one-quarter of all such events.

IT equipment malfunction accounted for only 4 percent. Water, heat or air conditioning failure accounted for 11 percent of outages, followed by weather at 10 percent and generator failure at 6 percent.

Many of the above scenarios are actually due to private, on-premises data centers and not cloud service providers. 

But in the end, human error still accounts for a majority of all outages. The Feb. 28 AWS outage has been attributed to one of its employees debugging an issue with the billing system and accidentally taking more servers offline than intended. That error started a domino effect that took down many other server subsystems.

Don’t Put Your Head in the Clouds 

The public cloud is a fantastic platform offering very compelling cost, elasticity and availability benefits. 

But like any infrastructure, it’s not immune to failure. You need to leverage it the right way. 

Building resiliency into application architectures is not something that disappears when you use the public cloud. If you don’t prepare for that, you're likely to lose control of your business when the next outage happens.