Amazon’s Simple Storage Service (S3) experienced substantial technical issues today — and thousands of websites paid the price.
According to Amazon’s AWS service health dashboard, the S3 outage was due to “high error rates with S3 in US-EAST-1.”
Whatever that means.
A Cloudless 4 Hours
The Amazon S3 outage lasted from around 10 am to 2 pm PST, affecting websites like Quora, Business Insider, Instagram, Slack and Giphy.
In an ironic twist, Amazon wasn't able to update its own service health dashboard for the first two hours of the outage because the dashboard itself was hosted on AWS.
And as you would expect, the fallout was marked by an amusing Twitter hashtag, #AWSOutage.
It began trending just minutes into the saga:
By 11:35 am PST, Amazon managed to grant themselves the ability to update their own website. They informed the world of their findings and their plans, albeit without much detail:
“We have now repaired the ability to update the service health dashboard. The service updates are below. We continue to experience high error rates with S3 in US-EAST-1, which is impacting various AWS services. We are working hard at repairing S3, believe we understand root cause and are working on implementing what we believe will remediate the issue.”
A few hours and a few million 404 errors later, Amazon confirmed that everything was back to normal:
“As of 1:49 PM PST, we are fully recovered for operations for adding new objects in S3, which was our last operation showing a high error rate. The Amazon S3 service is operating normally.”
During those four hours, some websites and apps were knocked totally out of the cloud, while others were accessible but not fully functioning. Slack users, for example, could chat normally, but file uploads were out of the question.
Ironically, Is It Down Right Now — the website dedicated to telling you if websites are down — went down itself.
A Wake-Up Call?
For the majority of internet surfers, this outage was little more than a short and comical inconvenience. But for the brands relying on Amazon S3 to keep their websites, apps and platforms in the cloud, four hours of downtime is no joke — particularly when Amazon claims to deliver, “99.999999999% durability”.
In response to the outage, CMSWire spoke to Shawn Moore, CTO of Solodev, about how Amazon’s four-hour failure is actually a wake-up call for many of its customers:
“The AWS breakdown, caused by high error rates with S3 in US-EAST-1, has caused websites completely held by those servers to go down. For those in the industry, this brings to light the reality that all technology will fail eventually – even ones that are ‘too big to fail’ like Amazon.
“While this does impact an estimated 20 percent of the internet, there are many businesses hosted on Amazon that are not having these issues. The difference is that the ones who have fully embraced Amazon’s design philosophy to have their website data distributed across multiple regions were prepared.
"This is a wake-up call for those hosted on AWS and other providers to take a deeper look at how their infrastructure is set up and emphasizes the need for redundancy — a capability that AWS offers, but it’s now being revealed how few were actually using.”
Some Amazon S3 customers will choose to see this as an inevitable part of life in the cloud — as no piece of technology is infallible. But for others, Shawn Moore’s words may ring true enough for them to consider further data diversification.
In any case, if there’s one thing this episode proves, it’s that when Amazon sneezes, the online world catches a cold.