Amazon Web Services (AWS) suffered an outage late Thursday, June 14. This was the first major outage since last year’s highly publicized incident. Tongues are wagging, bloggers are writing and everyone is wondering how the outage will impact Amazon’s efforts to make inroads into the enterprise. Is Amazon receiving unfair criticism and attention for its most recent availability problems?
What Happened with AWS
Thursday at 8:50 p.m. PDT, AWS updated its cloud status to reflect there was a power outage in its Virginia data center. The problem was reportedly resolved at 3:26 am PDT. This was the same data center that impacted several high profile sites when it went down in April last year.
The Virginia data center also had performance issues in December when Amazon rebooted thousands of instances. Thursday’s outage also impacted several popular sites like Quora, Salesforce’s platform-as-a-service company Heroku, Pinterest and Dropbox.
Only days after the latest outage, Amazon released a statement identifying the source of the problem — a power outage. Amazon further explained,
At approximately 8:44PM PDT, there was a cable fault in the high voltage Utility power distribution system. Two Utility substations that feed the impacted Availability Zone went offline, causing the entire Availability Zone to fail over to generator power. All EC2 instances and EBS volumes successfully transferred to back-up generator power.”
The power went out and the servers switched over to the backup power source. Everything was working as expected: the system failed gracefully and activated the backup system. Then things took a bad turn. Nine minutes after the failure, one of the generators overheated because of a defective cooling fan.
When the generator overheated, the servers using the generator attempted to connect with the secondary backup power supply. Unfortunately, one of the breakers on the secondary backup power supply was incorrectly configured and failed, leaving the servers without power. Three strikes and part of AWS was out.
Is the AWS Criticism Justified
Almost everyone agrees that when AWS is up, it works great. The company promises 99.95 percent uptime, which equates to about seven minutes of downtime each month, and AWS exceeds these numbers most months.
The company has been criticized for its poor communication, but has corrected the behavior in recent months. On the same day as the outage, Amazon announced it was offering all customers 24-hour access to customer support at no cost and lowering the price for some categories of premium support.
In addition, AWS added a system to monitor services and customer usage patterns, which it uses to send alerts to inform users how they can save money, improve performance or avoid security problems.
Despite the respectable service record and recent enhancements, critics were quick to reprimand AWS for poor testing of its failover systems. Others questioned if the outage would shake business confidence in the public cloud. Is this criticism fair?
In my opinion, the scrutiny of AWS outages far exceeds the severity. Every computer system will eventually fail, even if there are layers of backup plans, and computers power the cloud. After the April outage, AWS gave customers the ability to distribute their workloads geographically, and many users did. However, some users continued to rely on a single geographic location, ignoring the possibility of a site outage. When you take risks, there are bound to be consequences one day. For many AWS users, Thursday was that day.