Outage Outrage As Microsoft's Azure Stumbles

Microsoft is trying to recover from a widespread outage that affected its Azure cloud platform across multiple regions. The company acknowledged that 11-hour issue, which started last night, affected customers with virtual machines in all regions other than the new Australian data center.

The unanswered question now: What's the long-term impact of the outage, which knocked many third-party sites offline and created problems with Microsoft's Office 365 suite?

Failed Update

In a blog post explaining the outage, Microsoft Azure Corporate Vice President Jason Zander explained that the problem was due to a performance update Microsoft made to Azure storage services globally, across the board and all at the same time. Zander wrote (and we cite a significant part of his statement for clarity):

As part of a performance update to Azure Storage, an issue was discovered that resulted in reduced capacity across services utilizing Azure Storage, including Virtual Machines, Visual Studio Online, Websites, Search and other Microsoft services. Prior to applying the performance update, it had been tested over several weeks in a subset of our customer-facing storage service for Azure Tables.

We typically call this 'flighting,' as we work to identify issues before we broadly deploy any updates. The flighting test demonstrated a notable performance improvement and we proceeded to deploy the update across the storage service. During the rollout we discovered an issue that resulted in storage blob front ends going into an infinite loop, which had gone undetected during flighting. The net result was an inability for the front ends to take on further traffic, which in turn caused other services built on top to experience issues."

To add insult to injury the Service Health Dashboard and Azure Management Portal, which reports on Azure Storage services and links directly to it, could not update. So for a large part of the evening it reported that everything at Azure was just dandy.

Microsoft Reaction

Except for a lot of irate customers, things were far from OK. Small businesses across the US and UK experienced what was, in effect, a total shutdown of storage services. Customers were not able to use or even access their corporate websites.

In fairness to Microsoft, it was pretty quick to ‘fess up, but whether that is going to appease current or potential users remains to be seen.

Microsoft is already moving on and Zander has already outlined some of the actions that they will be taking to ensure it doesn’t happen again, even if there are undoubtedly some Azure employees that are still smarting from the scolding they got over this one. Zander writes:

We are taking steps to improve the Microsoft Azure Platform and our processes to ensure such incidents do not occur in the future, and in this case it includes (but is not limited to):

Ensure that the deployment tools enforce the standard protocol of applying production changes in incremental batches is always followed
Improve the recovery methods to minimize the time to recovery
Fix the infinite loop bug in the CPU reduction improvement from the Blob Front-Ends before it is rolled out into production
Improve Service Health Dashboard Infrastructure and protocols

Leaving aside the direct impact of the downtime, Microsoft’s reputation is going to suffer from this, whatever way it moves. According to its Service Level Agreement (SLA), Microsoft states: "We guarantee at least 99.9 percent availability of the Azure Active Directory Basic and Premium services."

Outage Reactions

However, a lot of people are not convinced. A selection of the comments that greeted Zanders blog post explaining the outage demonstrates just how much anger there is over this. One poster calling himself Nathan summarizes what was the gist of many of the comments that have appeared to date.

Why was the dashboard marked all green for Azure West despite our TAM and TAs informing us that the issue was ongoing? We rely on the dashboard to know what the health of Azure is, it seems like it’s more of a PR stunt now? Why was the response time from the engineering team over 20 hours for a Sev-A? Why despite using geo-redundant storage was data loss realized by your customers?"

But a considerable number of people were also annoyed at Microsoft’s handling of the situation at a customer level. Here’s another comment:

2014-11-20 azure outage ray suelzer comment.jpg

In these kinds of situations there is always talk of jumping ship and moving to other service providers:

2014-11-20 azure outage michael comment.jpg

Azure isn't the first to stumble. AWS had a couple of real doozies with the Easter service outage of 2011 and the July outage of 2012, which left millions of Netflix customers unable stream video.

Learning Opportunities

Webinar

Dec

Rebrand. Migrate. Optimize. How to Do It All (Without Slowing Down)

Cresta leveled up site speed, design flexibility and marketer sanity (in record time). Find out how.

Webinar

Dec

Beyond Composability: How Modern Marketers Build Connected Experiences

Ready to launch campaigns faster, personalize smarter and prove your marketing ROI? Discover the power of a modern DXP.

Webinar

Dec

Unlock Connected Service: How to Forecast, Staff & Support Every Channel

Stop juggling tools. 73% of CX leaders say silos damage CX. Build a seamless service operation instead.

Webinar

Dec

Empowering Non-Profits: Smarter Crisis Communication and Community Engagement

The cost of miscommunication is measured in more than words. Learn to deliver outreach that's fast, clear and trusted.

Webinar

Dec

Roundtable: Turning Real-Time CX Signals into Business Results

Four big brands. One live, unscripted discussion on how modern CX teams move from dashboards to real impact.

Webinar

On demand

From Manual to Magical: How AI Transforms CX Teams

Learn how to replace manual support processes with automation that actually delivers.

Watch Now

Webinar

Dec

Rebrand. Migrate. Optimize. How to Do It All (Without Slowing Down)

Cresta leveled up site speed, design flexibility and marketer sanity (in record time). Find out how.

Webinar

Dec

Beyond Composability: How Modern Marketers Build Connected Experiences

Ready to launch campaigns faster, personalize smarter and prove your marketing ROI? Discover the power of a modern DXP.

Webinar

Dec

Unlock Connected Service: How to Forecast, Staff & Support Every Channel

Stop juggling tools. 73% of CX leaders say silos damage CX. Build a seamless service operation instead.

For Microsoft, the timing of this is pretty bad though. It comes at a time when it is really starting to develop its cloud presence to challenge Amazon, IBM, Google and others offering rival products.