“Office365” states it very clearly – it would appear that the Microsoft SLA doesn’t take into account leap years, as we discovered last Wednesday. Customers lost access to Azure at 1:45 a.m GMT, February 28 due to an outage that lasted for the following 36 hours, apparently caused by a bug that has been traced back to “cert issue” in a Microsoft datacenter in Dublin. The details themselves are rather notable from a historical point of view:
“While final root cause analysis is in progress, this issue appears to be due to a time calculation that was incorrect for the leap year,” Bill Laing, a Microsoft exec, wrote on the Azure blog. “Once we discovered the issue we immediately took steps to protect customer services that were already up and running, and began creating a fix for the issue.”
The issue was eventually resolved, and it is far from being the first one to bring a major PaaS service to a complete halt. Competitor AWS also had its fair share of unscheduled shutdowns throughout the past few years; the most recent of which was in late April last year. (It’s interesting that this particular incident didn’t technically infringe Amazon’s 99.95% uptime SLA, because – as this ZDNet piece pointed out at the time – while EC2 was covered by the agreement, the two other services that were the very cause of the downtime were not.)
Generally speaking, cloud platforms such as Azure can guarantee considerably higher uptime than traditional on-premise infrastructure, not to mention the much better performance and utilization rate. This is not surprising considering Microsoft is probably using better components on both ends of the stack, but it’s this level of service – and the demand it rightfully earns – that leads to all the problems in the first place.
The aftermath
It’s all about the ‘public’ in public cloud. As one of the most widely-used PaaS offerings out there, Azure is being leveraged by thousands of organizations and independent developers worldwide – all of whom have temporarily lost access to their data and apps during the blackout. One good example is CloudStore, a UK-based cloud service storefront that recently exited stealth and received a nice bit of coverage from The Guardian following the outage. The bad PR for Azure however resonated across more than just one publication.
Cloud adoption is increasing in the enterprise space, a trend bolstered by the whole list of supporting predictions, forecasts, infographics and analyst quotes that make their entry into the blogosphere on a daily basis. And by default, this movement leads to increased dependency within these consolidated environments, which undoubtedly couple many advantages with the shortcomings to match.
The gap between traditional data center operators and tomorrow is steadily narrowing as the market shifts, but the bridge is still and will most likely always remain a work-in-progress.
Leave a Reply