Oops Is Rackspace Rethinking its 9999 Uptime Boast

Oops! Is Rackspace Rethinking its 99.99% Uptime Boast?

7 minute read
Dom Nicastro avatar

Eleven hours and counting. 

It's been a long haul of public-cloud downtime for David Björkevik and his team at Schemagi, a Linköping, Sweden-based company that makes schedules for nurses and other healthcare personnel using advanced optimization techniques.

Schemagi is one of the Rackspace public cloud customers experiencing downtime because of a maintenance reboot scheduled by the San Antonio, Texas-based managed hosting provider that offers public and private cloud hosting services.

Rackspace posted an "urgent notice" early Saturday morning ET on its website notifying customers of cloud server reboots in light of a potential problem with its public cloud environment.

The news comes around the same time the company 99.99 percent OpenStack API uptime guarantee for its new release of its private cloud software on its cloud computing open source OpenStack creation. 

Downtime Frustration

It's left customers like Björkevik waiting for answers and without means to do business.

"We use the cloud service to gather requests and feedback from the nurses," Björkevik told CMSWire. "Taking swift action to security issues is a good thing. They could have been more clear on the long downtimes. I did not expect so many hours of downtime. If I did, I would have prepared backup arrangements. They have a 24-hour service window, but the way they worded it I just expected a reboot. Maybe five minutes of downtime and some fixing. Luckily our cloud service is only a small part of our service offering. Maybe 50-100 users affected at this point."

Several others are tweeting concerns about the Rackspace response and timing of the reboots -- many under tweets coming under the hashtag #RackspaceReboot.

According to a blog post from Knopf, Rackspace notified customers of the problem and reboot at 9:22 p.m. on Friday night, Sept. 26. That email was later translated into the Rackspace website post a few hours later:

Recently, an issue that has the potential to impact a portion of the Public Cloud environment was reported. Our engineers and developers continue to work closely with our vendors and partners to apply the solution to remediate this issue. While we believe in transparent communication, there are times when we must withhold certain details in order to protect you, our customers."

Happened with Amazon, Too

Ben Kepes, technology evangelist and the director of Diversity Limited, blogged for Forbes about the reboot. He noted that Amazon Web Services Sept. 24 notified its public cloud customers of a problem that turned out to be related to a problem with the Xen hypervisor that AWS’ service is built upon

"This was somewhat surprising given that Rackspace’s public cloud is also built upon Xen," Kepes wrote. 

He also noted that Rackspace's timing to email customers on late on a Friday night was poor, suggesting "this should go out during working hours." He also criticized the subsequent "24-hour maintenance window," writing, "Hello? Are you expecting me to wait up all night to check my website."

Rackspace Response

Rackspace did not immediately return an email sent this morning with questions from CMSWire.

UPDATE: At 4:30 ET today, a Rackspace media relations official responded to CMSWire:

"Rackspace has identified a bug in the software for our Cloud Servers (Standard, Performance 1 & Performance 2) and we are running a global maintenance, in which reboots are occurring. All reboots are expected to be completed in the next 72 hours.

"Rackspace strives to provide as much information to our customers as possible and regularly posts updates regarding any such issues on our status site."

In its post to customers, it noted that "as part of the solution that is being developed, we anticipate that a reboot will be necessary for all Standard, Performance 1 and Performance 2 Cloud Servers within our Cloud Servers infrastructure. In preparation for these reboots, we recommend that you take proactive steps to ensure that your environment is configured to return to proper operations after a reboot."

Customers should, according to Rackspace:

  • Verify all necessary services (Apache, IIS, MySQL, etc.) are configured to start on server boot
  • Ensure that you have up-to-date server images and file-level backups enabled, and confirm that you have backups of all critical data
  • Confirm that any unsaved changes, such as firewall rules and application configurations, are indeed saved

It also posted a maintenance window by time zone that began Sunday and is expected to continue through Wednesday. "We will perform this maintenance one region at a time, and will not begin the maintenance for the next region until the maintenance for the previous region is complete," Rackspace officials said.

Rackspace officials said they'll communicate with customers via email, its status page and its Rackspace Community.

Learning Opportunities

Better Communication Needed?

"I feel like Rackspace could use this incident as an opportunity for improvement in their customer communications," said Patrick McKenzie, founder of Kalzumeus Software and Appointment Reminder.

McKenzie said his teams run all of its servers (three product lines) on Rackspace's cloud, coming to them through the Slicehost acquisition.  

"Back in the day, Slicehost was far and away the best option for hosting Ruby on Rails app, the main programming stack we use," McKenzie told CMSWire. "I could write love letters of how great that company was -- dropping in on their customer support chatroom meant you were talking to someone who could teach you to be a better engineer, every single time."

For McKenzie, Rackspace's first notification of the reboot was after close-of-business Friday, before a reboot window scheduled for Sunday.  

"While I understand that critical security vulnerabilities wait for no one, standard practice in hosting is to give customers a) enough advance notice to have staff ready for maintenance windows or to do a planned cutover to back up providers, and b) give users approximately one hour of notice before an individual machine goes down rather than saying, 'You have to be at your computers, ready to respond, at any time in a 24-hour window.'"

After he staged himself for the maintenance window, McKenzie learned through a third party that the type of servers that he has with Rackspace, because they are on a legacy platform, would not even require a reboot at all.  

"This caused a lot of wasted effort for us," he told CMSWire.

McKenzie said he lost about a half a workday, waiting for the reboot that was never coming because of his company's infrastructure.  

"It isn't the worst thing in the world, but that's below the standard of fanatical support that Rackspace reports they hold themselves to," McKenzie said. "My experiences since the acquisition are, regrettably, mixed on that score."

Large Issues at Stake?

Does this situation speak to larger issues in the software industry regarding cloud services that must be addressed?

McKenzie responded to that question we posed by saying, "We're all still learning things as we go along, particularly for those of us who don't have Google/Facebook/etc. quality operations teams."

"There exist ways to have a service operating on cloud servers such that randomly shooting a machine gun through your data center wouldn't affect your operations in a way you'd notice," McKenzie said. "Those are largely beyond the means of smaller users, such as our company. Apparently, they're also beyond the reach of Rackspace at present. Eventually, as the knowhow percolates through the industry and the technology stacks and service offerings get more mature, I think that might be something that more of us can benefit from."

The industry, McKenzie noted, is "far, far, far behind where we need to be" on security issues like these.  

"Rackspace reacting with alacrity to this issue is a point in their favor," he said. "They probably should have reacted faster -- AWS beat them by several days, which is an eternity when automated for loops can compromise every machine on the Internet in hours. But acting quickly was far, far, far better than not acting and hoping no one noticed, which was our industry's responses to security issues for years.  We've still got a long, long way to go about getting more proactive about software security -- hopefully, the Amazon Web Services and Rackspaces and Googles of the world can continue throwing smart people and money at the substantial R&D challenges required to secure the software that effectively underpins the entire world."