Thursday, 10 January 2013

Ouchage – Windows Azure

Just when you thought it was safe to move to the cloud there’s a whole series of high-profile outages, but should you really give up on the cloud?

Christmas and New Year is always a good time to schedule a bit of downtime, nobody really works much and a few hours here or there won’t make much difference. Unfortunately the same isn’t true for unscheduled downtime, as PC management service Soluto and online movie service Netflix found to their cost over this Christmas and New Year.

Netflix app logoNetflix was the first outage victim just as the US was cozying up to their internet TVs to watch some streamed Christmas movies the service failed, and it continued to fail over the Christmas Eve peak viewing time and into Christmas Day. The next victim was Soluto, for 62 hours between Christmas and New Year the service was down, no doubt just as their users were probably taking advantage of a bit of R&R to sort out their computers before the New Year. 

The more observant of you will have noticed the words victim used in both of these examples? But were they really victims or did they seed their own problems by choosing the cloud?

Netflix are a service that couldn’t exist without the cloud - to build the sort of infrastructure they needed to launch would have required tens of millions of dollars and taken years to build – and they also know that to build a successful consumer service you need to inspire confidence. So unlike many businesses Netflix take time out to test and test their systems continually using a team of Chaos Monkeys and Chaos Gorillas to see what happens if their own internal systems, and those of their cloud service provider Amazon Web Services (AWS) were to go wrong. 

Soluto logoSoluto are similar in their use of the cloud but very different when it comes to resilience. Like Netflix Soluto wouldn’t have been able to launch if it wasn’t for the cloud, but unlike Netflix, Soluto took a decision to rely solely on their cloud provider (Microsoft Azure) to look after the service resilience. As the Soluto team explains in their apology email.

We could have obviously spent time building various mechanisms to make sure that whatever happens to Azure, we’ll be able to provide our service (the extreme example would be creating a redundant deployment in Amazon). But that’s not the startup way. Because by doing so, we wouldn’t have created hundreds of features for our users at the same time. And for well over a year, we didn’t experience a severe downtime except for a single case of several hours in February, but once a year that’s acceptable.

So are Netflix victims and Soluto guilty? The answer is probably yes, and no. Soluto were unlucky to see such a catastrophic failure, 62 hours is a long time and no doubt Soluto will be drawing to Microsoft’s attention to their SLA agreement and be spared the bill for this month’s hosting.

What you should take away from these two high profile problems is not a fear of cloud but a template for how to deal with a crisis. If there is a problem then deal with it quickly, communicate it to your customers, and keep them regularly updated. Don’t pretend it will go away and don’t hide it, both Netflix and Soluto handled this outage well and hopefully it won’t tarnish their brands, learn from them.

Lastly, if you are choosing a cloud solution then you should probably take a look at a plan b for 2013.

About the Author:
Marcus Austin works for Firebrand Training as a Technical Author. Marcus has over 25 years’ experience in the technology and business sector. His recent work includes constructing a mobile strategy for the Guardian Media Group, together with writing and editing for magazines and websites including TechRadar, Internet Retailing, IT Perspectives, and Santander Breakthrough.