Microsoft is trying to recover from a widespread outage that affected its Azure cloud platform across multiple regions. The company acknowledged that 11-hour issue, which started last night, affected customers with virtual machines in all regions other than the new Australian data center.
The unanswered question now: What's the long-term impact of the outage, which knocked many third-party sites offline and created problems with Microsoft's Office 365 suite?
In a blog post explaining the outage, Microsoft Azure Corporate Vice President Jason Zander explained that the problem was due to a performance update Microsoft made to Azure storage services globally, across the board and all at the same time. Zander wrote (and we cite a significant part of his statement for clarity):
As part of a performance update to Azure Storage, an issue was discovered that resulted in reduced capacity across services utilizing Azure Storage, including Virtual Machines, Visual Studio Online, Websites, Search and other Microsoft services. Prior to applying the performance update, it had been tested over several weeks in a subset of our customer-facing storage service for Azure Tables.
We typically call this 'flighting,' as we work to identify issues before we broadly deploy any updates. The flighting test demonstrated a notable performance improvement and we proceeded to deploy the update across the storage service. During the rollout we discovered an issue that resulted in storage blob front ends going into an infinite loop, which had gone undetected during flighting. The net result was an inability for the front ends to take on further traffic, which in turn caused other services built on top to experience issues."
To add insult to injury the Service Health Dashboard and Azure Management Portal, which reports on Azure Storage services and links directly to it, could not update. So for a large part of the evening it reported that everything at Azure was just dandy.
Except for a lot of irate customers, things were far from OK. Small businesses across the US and UK experienced what was, in effect, a total shutdown of storage services. Customers were not able to use or even access their corporate websites.
In fairness to Microsoft, it was pretty quick to ‘fess up, but whether that is going to appease current or potential users remains to be seen.
Microsoft is already moving on and Zander has already outlined some of the actions that they will be taking to ensure it doesn’t happen again, even if there are undoubtedly some Azure employees that are still smarting from the scolding they got over this one. Zander writes:
We are taking steps to improve the Microsoft Azure Platform and our processes to ensure such incidents do not occur in the future, and in this case it includes (but is not limited to):
- Ensure that the deployment tools enforce the standard protocol of applying production changes in incremental batches is always followed
- Improve the recovery methods to minimize the time to recovery
- Fix the infinite loop bug in the CPU reduction improvement from the Blob Front-Ends before it is rolled out into production
- Improve Service Health Dashboard Infrastructure and protocols
Leaving aside the direct impact of the downtime, Microsoft’s reputation is going to suffer from this, whatever way it moves. According to its Service Level Agreement (SLA), Microsoft states: "We guarantee at least 99.9 percent availability of the Azure Active Directory Basic and Premium services."
However, a lot of people are not convinced. A selection of the comments that greeted Zanders blog post explaining the outage demonstrates just how much anger there is over this. One poster calling himself Nathan summarizes what was the gist of many of the comments that have appeared to date.
Why was the dashboard marked all green for Azure West despite our TAM and TAs informing us that the issue was ongoing? We rely on the dashboard to know what the health of Azure is, it seems like it’s more of a PR stunt now? Why was the response time from the engineering team over 20 hours for a Sev-A? Why despite using geo-redundant storage was data loss realized by your customers?"
But a considerable number of people were also annoyed at Microsoft’s handling of the situation at a customer level. Here’s another comment:
In these kinds of situations there is always talk of jumping ship and moving to other service providers:
For Microsoft, the timing of this is pretty bad though. It comes at a time when it is really starting to develop its cloud presence to challenge Amazon, IBM, Google and others offering rival products.