If you are a current user of Microsoft's Azure, the infrastructure and software as a service cloud technology, you already know of the worldwide outage that occurred for over eight hours on the Windows Azure Service Management Component. The glitch that brought the system down began at 1:45 a.m. (GMT) on Leap Year Day, and some have speculated this may be at least part of the cause.
A quick link to the Azure Service Dashboard tells the story, and though it is up this morning, at the time of this writing (GMT 5:51 p.m.) the outage continued with comments from the Microsoft team that included hour-by-hour updates on how things were progressing.
One Hour, 45 Minutes Into Leap Day
At the specific Coordinated Universal Time, 1:45 a.m. UTC, Microsoft reported:
We are experiencing an issue with Windows Azure service management. Customers will not be able to carry out service management operations. We are actively investigating this issue and working to resolve it as soon as possible. Further updates will be published to keep you apprised of the situation. We apologize for any inconvenience this causes our customers."
UTC or Temps universel coordonné in French is basically GMT (Greenwich Mean Time) with added leap seconds to keep Atomic Time as accurate as possible.
Microsoft Azure Service Dashboard
True to its word, hourly updates came flowing in beginning at 3:00 a.m. (UTC) first with the good news there would be no impact to storage accounts. Microsoft also said hosted services currently running, "...may experience a capacity impact."
By 4:00 a.m., the team identified the "root cause of this incident" and were working on a fix. The report said the initial problem was "...tracked back to a cert issue triggered on 2/29/2012 GMT. " By 5:00 a.m. containment was achieved, with the group reporting that less than 3.8% of the hosted services were impacted. "...We have taken the necessary measures to prevent the incident from spreading across the production environment."
Between 6:00 a.m. and 9:00 a.m., Microsoft Azure team reported Hotfix development, testing and deployment with a gradual roll-out in the North Central US sub-region, and the promise to "...progressively enable service management back for customers." And by 1:30 p.m. UTC, the team reported, "The issue is mitigated and service management is restored for the majority of customers. But, there are still lingering issues needing work...before we can completely restore service management."
But as the hotfix was being deployed, other issues began around 2:30 UTC with Windows Azure Compute service across North Europe and the USA that culminated in sub-region outages affecting a double-digit percentage of hosted services, where incoming web traffic could not get through.
Northern Europe, South Central US Hit Hardest
Locationwise, Windows Azure Compute Availability in the N. Europe and S. Central US sub-regions were hit hardest, with reports from Microsoft that at 2:30 UTC, 37%, and 28% of hosted services respectively were affected by the issue. In North Central US, 6.7% hosted services were down at that time.
The latest reported status from Microsoft includes:
Service Unavailable for:
- SQL Azure Reporting [North Europe] - Last update 8:00 a.m. UTC
- Windows Azure Marketplace - DataMarket [South Central US] - Last Update 5:11 p.m. UTC
- Windows Azure Service Management [Worldwide] - Last Update 3:30 p.m. UTC
Service Performance Degradation for:
- Windows Azure Compute [North Central US] - Last Update 3:30 p.m. UTC
- Windows Azure Compute [North Europe] - Last Update 3:30 p.m. UTC
- Windows Azure Compute [South Central US] - Last Update 3:30 p.m. UTC
Microsoft Azure launched in 2009, and up to now has had a good track record, with no reported outages. Issues over cloud service reliability are being hotly debated on the web as a result, with a growing realization that, with all the benefits the cloud has to offer, you must expect a little rain sometime.