amazon_aws_logo.jpg
See what happened was… The entire internet and traditional media have constantly focused on the most talked about technology outage since the airline reservation system crashed a decade ago: cloud failure. Now that Amazon's EC2 service is back up and an apology has been issued, let the hindsight discussions begin. Here's a quick round-up of mea culpa analysis including reimbursement details.

Amazon is Really, Really Sorry

Amazon has published a post-mortem and apology on why EC2 failed. The apology note, which is several thousand words long and includes enough techno talk to satisfy a 101 level CS course, is a first step in what will surely be many to restore confidence in the cloud providers services.

The listed steps Amazon will take to minimize another outage include:

  • Give customers the ability to take advantage of multiple Availability Zones
  • Make it easier to deploy a service over multiple availability zones, which even Amazon says is too hard
  • Improve client communications

The company is also putting its money where its mouth is. Amazon will reimburse impacted customers with a 10-day credit equal to 100% of their usage of EBS Volumes, EC2 Instances and RDS database instances that were running in the affected Availability Zone.

 An apology and a little cash back may not be enough for some. In addition to causing outages, the failure may have caused some permanent data losses. In the current environment where big data is king and as valuable as a metal commodity, this is very bad news for Amazon.

Wake-up Call: The Cloud Is NOT Magic

Not all organizations that leveraged Amazon’s platform fell from the clouds with the internet giant. Tiny cloud directory services startup Okta, founded by former Salesforce.com head of engineering Todd McKinnon, did not go down. Why? According to McKinnon,

“We had a fail over to a backup system and we ran on that during the outage. We span multiple availability zones and regions.”

The knowledge that all infrastructure is ultimately brittle is surely lesson learned from his days at Salesforce.com. Engine Yard, a Ruby-on-Rails PaaS provider, used a similar strategy to avert failure. So did Netflix. While not unscathed, SaaS Web CMS provider Aquia had measures in place for addressing a Drupal Gardens or related failure.

It is Amazon’s fault that their service went down. Customers are not to blame for the failure. However, customers must take responsibility for their business operations and clients ultimately.

Organizations enamored by the cloud must understand that IT systems fail, even fancy ones in the cloud, and take appropriate steps to ensure non-functional requirements are met if a piece of the infrastructure fails catastrophically. This does not necessarily implicate a super expensive, fully redundant implementation, but technology leaders should evaluate  each system, determine which components must be constantly available and implement failover solutions commensurate with each system's business importance.

(Stepping down from soapbox now.)