Blame People for Cloud Downtime Woes

Blame People for Cloud Downtime Woes

3 minute read
Dom Nicastro avatar

People are among the top concerns for public clouds. They make mistakes. And those mistakes lead to downtimes.

The findings come from CloudEndure's first survey of IT professionals in North America and Europe, "2014 State of Public Cloud Disaster Recovery."

Specifically, the 116 IT pros ranked human error is right up there with application bugs and network failures as the primary risks to system availability.

Downtimes 'Unavoidable'

Last year had its share of cloud downtime news. Several companies were involved in a major outage last fall when they had to patch a security vulnerability affecting certain versions of XenServer, a popular open-source hypervisor.

Microsoft in November tried to recover from a widespread outage that affected its Azure cloud platform across multiple regions. It affected customers with virtual machines in all regions other than the new Australian data center.

Leonid Feinberg, vice president of products for Tel Aviv-based CloudEndure, a cloud mobility provider, told CMSWire the most common type of human error found in the company's 2014 research was a mistake during a maintenance operation and re-configuration.

The top challenges in meeting availability goals are insufficient IT resources, budget limitations and limited ability to prevent software bugs.

"From our perspective, the lesson that is learned here is that downtimes are unavoidable, especially in today's world when one relies on many third party service providers," Feinberg said. "We believe that trying to prevent disasters is never enough and one needs to accept that disasters will happen and prepare for this possibility to make sure the disruption to the business is minimal."

Learning Opportunities

Lessons Learned

Rackspace certainly was forthcoming with its mistakes last fall. Taylor Rhodes, CEO and president of the San Antonio, Texas-based public and private cloud hosting provider, admitted some response errors and said the downtime ultimately forced a reboot for about a quarter of Rackspace's 200,000 customers.

"In the course of it, we dropped a few balls," Rhodes said in a blog post to customers. "Some of our reboots, for example, took much longer than they should. And some of our notifications were not as clear as they should have been. We are making changes to address those mistakes. And we welcome your feedback on how we can better serve you." 

Too many times, the handling of downtime is not taken seriously, Feinberg told CMSWire.

"The adoption of the cloud made many people carefree about believing that the cloud provider will take care of this," Feinberg said, "and the results show that it's not a realistic approach."

Other Findings

The CloudEndure results also revealed:

  • Almost all respondents claim they meet their availability goals consistently (43 percent) or most of the time (49 percent), but 26 percent of the organizations surveyed don’t measure service availability at all
  • 79 percent have a service availability goal of 99.9 percent or better, but over half of the companies (54 percent) had at least one outage in the past three months
  • Load balancing and local (single region/zone) storage backup are the leading strategies to ensure system availability and data protection cited by 59 percent and 51 percent of the respondents, respectively
  • There is a strong correlation between the cost of downtime and the average hours per week invested in backup/disaster recovery

Creative Commons Creative Commons Attribution-Noncommercial-No Derivative Works 2.0 Generic License Title image by ktpupp.