amazon_aws_logo.jpg
Amazon's (news, site) Elastic Cloud Compute service went down last week and is still recovering today, after bringing a number of major services to a standstill. Meanwhile, Sony's PlayStation network is also experiencing problems.

UPDATE - Tuesday 16:00 EST

Sony has just come clean about the whole PSN outage, in quite a sobering message on its blog. The highlights, if they can be called that, are that between April 17 and April 19, 2011, certain PlayStation Network and Qriocity service user account information was compromised in connection with an illegal and unauthorized intrusion into our network.

Sony has shut down the service, called in outside investigators and recommends that users get their credit reports and check for dubious behavior. Don't expect PSN to come back up until additional security measures are in place.

As for Amazon, who look like they are getting off quite lightly in comparison, all things seem to getting back to normal and affected Amazon users are apologizing to their customers, while waiting to hear about what Amazon will do to compensate them.

One such example is from Drupal who sent an email to customers:

Late last week, Drupal Gardens experienced an unexpected delay in service due to a widely-publicized outage at our partner Amazon Web Services (AWS). Sites were restored through Friday and, by midnight Friday, all sites were back up and fully functional.

The purpose of this message is to offer our sincere apologies to you for this disruption and to give affected customers a free month of service. Annual subscribers will get a one month extension. For those with monthly plans, your next month of service will be completely free of charge.

UPDATE - Tuesday 5:00 EST

Amazon continues a return to normality, but things just get worse for Sony with the rumors of credit card theft doing the rounds. Although, this could be an extension of a fake funds rumor that surfaced when PSN first went down. In its latest statement, the company isn't even sure if the service will be back up on Wednesday, as previously planned.

Obviously, there is no point putting the same broken system back in place, and Sony needs time to improve its security, but the delay will only anger gamers, film and music fans who use Sony's various services. Sony unveiled its new tablet PCs in Tokyo today and made no mention of the outage, even though they will use PSN services.

UPDATE - Monday 9:00 EST

Amazon's services are now largely back up and running, with the caveat of delays and slow service, but complaints seem to be dying down. Now, the inquest can begin into this failure and compensation paid to those inconvenienced. There's already an excellent lessons to learn piece on CNET.

Sony, on the other hand, is basically having to rebuild its PSN network to prevent a repeat of the hack that brought it down last Wednesday. Service may be brought back up on Tuesday/Wednesday, but the whole global service will have been down for a week.

Sony is still keeping largely quiet about this disaster that has left tens of millions of gamers without any online service. But, what it seems has happened is that, after an original protest hack by the Anonymous group, another hacker found a similar way in and either started doing lots of damage or mischief (possibly, crediting accounts with fake funds).

Either way, Sony has been caught with its pants down in public, proving that its service is both insecure and unsafe, weak and overly restrictive. If someone takes down Microsoft's Xbox onlne service, Netflix or Apple's iTunes,  then the whole walled-garden concept of consumer service is looking seriously shaky.

UPDATE - Saturday 4:30 EST

Sony's PSN is still inaccessible to millions of gamers and its new Qriocity music service is also out of action. At least Sony has acknowledged the nature of the outage and given a practical timeframe (even if it is days away) for recovery. Sony says:

An external intrusion on our system has affected our PlayStation Network and Qriocity services. In order to conduct a thorough investigation and to verify the smooth and secure operation of our network services going forward, we turned off PlayStation Network & Qriocity services on the evening of Wednesday, April 20th. Providing quality entertainment services to our customers and partners is our utmost priority. We are doing all we can to resolve this situation quickly, and we once again thank you for your patience. We will continue to update you promptly as we have additional information to share."

As for Amazon, there is a lengthy update on the service page, but the upshot is things are getting back to normal, slowly -- mostly due to the massive amounts of data involved. Requests are being moderated to prevent bottlenecks, so any resumed services may be a little slower than normal.

The mainstream media is now picking through the carcasses of the stories, with the Sony story the fourth most read on the BBC, and both stories are still generating masses of Twitter messages. If you look behind the outdated Top Tweets and general murmerings, there are some decent articles starting to appear analyzing the Amazon issue.

UPDATE - Friday 11:30AM EST

Amazon is still restoring services to customers, well into the second day of its outage. Popular sites lke Foursquare and Quora seem to be back up and running. Reddit is running in emergency mode, but many businesses are still without access to their data. The latest on Amazon's AWS service status page is that:

6:18 AM PDT We're starting to see more meaningful progress in restoring volumes (many have been restored in the last few hours) and expect this progress to continue over the next few hours. We expect that well reach a point where a minority of these stuck volumes will need to be restored with a more time consuming process, using backups made to S3 yesterday (these will have longer recovery times for the affected volumes). When we get to that point, we'll let folks know. As volumes are restored, they become available to running instances, however they will not be able to be detached until we enable the API commands in the affected Availability Zone."

Sony have not done much updating of its PlayStation Blog in response to the global wipeout of its PlayStation Network gaming service, but a comment from a blog manager suggests that the PSN downtime could be down to "outside attack":

Our support teams are investigating the cause of the problem, including the possibility of targeted behaviour by an outside party. If the reported Network problems are indeed caused by such acts, we would like to once again thank our customers who have borne the brunt of the attack through interrupted service. Our engineers are continuing to work to restore and maintain the services, and we appreciate our customers' continued support."

Has your company suffered from this outage? What steps have you taken for continuity and do you plan to change how you operate in the future? Let us know in the comments. Original article follows:

Bad Day at the Office?

A host of major Internet-based services were affected yesterday by an Amazon EC2 and relational database service outage at its Virginia data center. The service page lists instance and connectivity errors, plus latency problems that have hamstrung the likes of Quora and Foursquare.

The first we knew of it was when Sony gamers were complaining last night that the company's games PSN store was down. A message on the company's site leaves that issue unexplained (we think it is unrelated to Amazon's issues -- possibly Sony's ongoing problems with the hacking group Anonymous).

Then, while investigating, we started seeing messages regarding Amazon's EC2 service and then the likes of Reddit and HootSuite followed. It seems few had a happy day since, posting messages to bear with them, although perhaps the general productivity of the world rose a bit without all those distractions.

This is a Crisis

Amazon's log lists the problems starting at:

1:41 AM PDT We are currently investigating latency and error rates with EBS volumes and connectivity issues reaching EC2 instances in the US-EAST-1 region.
2:18 AM PDT We can confirm connectivity errors impacting EC2 instances and increased latencies impacting EBS volumes in multiple availability zones in the US-EAST-1 region. Increased error rates are affecting EBS CreateVolume API calls. We continue to work towards resolution.
2:49 AM PDT We are continuing to see connectivity errors impacting EC2 instances, increased latencies impacting EBS volumes in multiple availability zones in the US-EAST-1 region, and increased error rates affecting EBS CreateVolume API calls. We are also experiencing delayed launches for EBS backed EC2 instances in affected availability zones in the US-EAST-1 region. We continue to work towards resolution."
 

and goes on up to the latest update...

10:58 PM PDT Just a short note to let you know that the team continues to be all-hands on deck trying to add capacity to the affected Availability Zone to re-mirror stuck volumes. It's taking us longer than we anticipated to add capacity to this fleet. When we have an updated ETA or meaningful new update, we will make sure to post it here. But, we can assure you that the team is working this hard and will do so as long as it takes to get this resolved.

2:41 AM PDT We continue to make progress in restoring volumes but don't yet have an estimated time of recovery for the remainder of the affected volumes. We will continue to update this status and provide a time frame when available."
 

The sites affected are showing somewhat less technical messages. This was Foursquare's page early this morning (the site appears to be bouncing up and down) and others offering similar terse "it's not our fault" apologies. Some sites are coming back up but are citing limited service or availability.

fsqam.jpg

Foursquare and many other sites are affected by Amazon's outage

While this is far from the Internet's Titanic or Concorde moment, it is a timely reminder of the perils of cloud and off-site data storage and computing. Amazon will have a lot of PR firefighting to do after the techs have fixed the issues. Not companies to miss a trick, many social media rivals to the affected firms are turning to Twitter and other social sites to promote their non-Amazon powered alternatives.

Other hosters such as Storm on Demand are offering US$ 100 discounts to try their own services and the whole problem will have companies looking at their disaster recovery plans carefully... or making one up as the sheepish CIO rushes to the boardroom with some explaining to do.

SkyNet Not to Blame

Jokers on Twitter were quick to point out that yesterday was the day referenced in the Terminator movie as the date when SkyNet became self-aware and started wiping us out. Amazon has categorically denied SkyNet's (the U.K.'s defence satellite comms system) involvement in the outage on its forum.

From the information I have and to answer your questions, SkyNet did not have anything to do with the service event at this time."

So, that's a relief. Sony's problems are still going on with the entire worldwide gaming service knocked out with an estimate of "days" before it will return, leaving millions of on-holiday gamers fuming and unable to play online and some single-player games.

Is this cloud thing really all its cracked up to be?