Last week was not a good one for the platform business. Uber continues to receive bad publicity on multiple fronts, as noted in my post on Uber’s Defeat Device and Denial of Service (March 2017). And on Tuesday, a fat-fingered system admin at AWS managed to take out a significant chunk of the largest platform on the planet, seriously degrading online retail in the Northern Virginia (US-EAST-1) Region. According to one estimate, performance at over half of the top internet retailers was hit by 20 percent or more, and some websites were completely down.
What have we learned from this? Yahoo Finance tells us not to worry.
“The good news: Amazon has addressed the issue, and is working to ensure nothing similar happens again. … Let’s just hope … that Amazon doesn’t experience any further issues in the near future.”
Other commentators are not so optimistic. For Computer Weekly, this incident
“highlights the risk of running critical systems in the public cloud. Even the most sophisticated cloud IT infrastructure is not infallible.”
So perhaps one lesson is not to trust platforms. Or at least not to practice wilful blindness when your chosen platform or cloud provider represents a single point of failure.
One of the myths of cloud, according to Aidan Finn,
“is that you get disaster recovery by default from your cloud vendor (such as Microsoft and Amazon). Everything in the cloud is a utility, and every utility has a price. If you want it, you need to pay for it and deploy it, and this includes a scenario in which a data center burns down and you need to recover. If you didn’t design in and deploy a disaster recovery solution, you’re as cooked as the servers in the smoky data center.”
Interestingly, Amazon itself was relatively unaffected by Tuesday’s problem. This may have been because they split their deployment across multiple geographical zones. However, as Brian Guy points out, there are significant costs involved in multi-region deployment, as well as data protection issues. He also notes that this question is not (yet) addressed by Amazon’s architectural guidelines for AWS users, known as the Well-Architected Framework.
Amazon recently added another pillar to the Well-Architected Framework, namely operational excellence. This includes such practices as performing operations with code: in other words, automating operations as much as possible. Did someone say Fat Finger?
Abel Avram, The AWS Well-Architected Framework Adds Operational Excellence (InfoQ, 25 Nov 2016)
Julie Bort, The massive AWS outage hurt 54 of the top 100 internet retailers — but not Amazon (Business Insider, 1 March 2017)
Aidan Finn, How to Avoid an AWS-Style Outage in Azure (Petri, 6 March 2017)
Brian Guy, Analysis: Rethinking cloud architecture after the outage of Amazon Web Services (GeekWire, 5 March 2017)
Daniel Howley, Why you should still trust Amazon Web Services even though it took down the internet (Yahoo Finance, 6 March 2017)
Chris Mellor, Tuesday’s AWS S3-izure exposes Amazon-sized internet bottleneck (The Register, 1 March 2017)
Shaun Nichols, Amazon S3-izure cause: Half the web vanished because an AWS bod fat-fingered a command (The Register, 2 March 2017)
Cliff Saran, AWS outage shows vulnerability of cloud disaster recovery (Computer Weekly, 6 March 2017)