Since The Big List Of Articles On The Amazon Outage was published we've a had few updates that people might not have seen. Amazon of course released their Summary of the Amazon EC2 and Amazon RDS Service Disruption in the US East Region. Netlix shared their Lessons Learned from the AWS Outage as did Heroku (How Heroku Survived the Amazon Outage), Smug Mug (How SmugMug survived the Amazonpocalypse), and SimpleGeo (How SimpleGeo Stayed Up During the AWS Downtime).
The curious thing from my perspective is the general lack of response to Amazon's explanation. I expected more discussion. There's been almost none that I've seen. My guess is very few people understand what Amazon was talking about enough to comment whereas almost everyone feels qualified to talk about the event itself.
Lesson for crisis handlers: deep dive post-mortems that are timely, long, honestish, and highly technical are the most effective means of staunching the downward spiral of media attention.
Amazon's Explanation of What Happened
Amazon's Explanation Of What Happened
- Summary of the Amazon EC2 and Amazon RDS Service Disruption in the US East Region
- Hackers News thread on AWS Service Disruption Post Mortem
- Quite Funny Commentary on the Summary
Experiences From Specific Companies, Both Good And Bad
- Lessons Netflix Learned from the AWS Outage by several Netflixians on the Netflix Tech Blog
- How Heroku Survived the Amazon Outage on the Heroku status page
- How SimpleGeo Stayed Up During the AWS Downtime by Mike Malone
- How SmugMug survived the Amazonpocalypse by Don MacAskill (Hacker News discussion)
- How Bizo survived the Great AWS Outage of 2011 relatively unscathed... by Someone at Bizo
- Joe Stump's explanation of how SimpleGeo survived
- How Netflix Survived the Outage
- Why Twilio Wasn’t Affected by Today’s AWS Issues on Twilio Engineering's Blog (Hacker News thread)
- On reddit's outage
- What caused the Quora problems/outage in April 2011?
- Recovering from Amazon cloud outage by Drew Engelson of PBS.
- PBS was affected for a while primarily because we do use EBS-backed RDS databases. Despite being spread across multiple availability-zones, we weren’t easily able to launch new resources ANYWHERE in the East region since everyone else was trying to do the same. I ended up pushing the RDS stuff out West for the time being. From Comment
Amazon Web Services Discussion Forum
A fascinating peek into the experiences of people who were dealing with the outage while they were experiencing it. Great real-time social archeology in action.
- Amazon Web Services Discussion Forum
- Cost-effective backup plan from now on?
- Life of our patients is at stake - I am desperately asking you to contact
- Why did the EBS, RDS, Cloudformation, Cloudwatch and Beanstalk all fail?
- Moved all resources off of AWS
- Any success stories?
- Is the mass exodus from East going to cause demand problems in the West?
- Finally back online after about 71 hours
- Amazon EC2 features vs windows azure
- Aren't Availability Zones supposed to be "insulated from failures"?
- What a lot of people aren't realizing about the downtime:
- ELB CNAME
- Availability Zones were used in a misleading manner
- Tip: How to recover your instance
- Crying in Forum Gets Results, Silver-level AWS Premium Support Doesn't
- Well-worth reading: "design for failure" cloud deployment strategy
- New best practice
- Don't bother with Premium Support
- Best practices for multi-region redundancy
- "Postmortum"
- Learning from this case
- Amazon, still no instructions what to do?
- Anyone else prepared for an all-nighter?
- Is Jeff Bezos going to give a public statement?
- Rackspace, GoGrid, StormonDemand and Others
- Jeff Barr, Werner Vogels and other AWS persons - where have you been???
- After you guys fix EBS do I have do anything on my side?
- Need Help!!! Lives of people and billions in revenue are at risk now!!!
- I've Got A Suspicion
- Farewell EC2, Farewell
There were also many many instances of support and help in the log.
In Summary
- Amazon EC2 outage: summary and lessons learned by RightScale
- AWS outage timeline & downtimes by recovery strategy by Eric Kidd
- The Aftermath of Amazon’s Cloud Outage by Rich Miller
Taking Sides: It's The Customer's Fault
- So Your AWS-based Application is Down? Don’t Blame Amazon by The Storage Architect
- The Cloud is not a Silver Bullet by Joe Stump (Hacker News thread)
- The AWS Outage: The Cloud's Shining Moment by George Reese (Hacker News discussion)
- Failing to Plan is Planning to Fail by Ted Theodoropoulos
- Get a life and build redundancy/resiliency in your apps on the Cloud Computing group
Taking Sides: It's Amazon's Fault
- Stop Blaming the Customers - the Fault is on Amazon Web Services by Klint Finley
- AWS is down: Why the sky is falling by Justin Santa Barbara (Hacker News thread)
- Amazon Web Services are down - Huge Hacker News thread
Lessons Learned And Other Insight Articles
- Amazon’s EBS outage by Robin Harris of StorageMojo
- People Using Amazon Cloud: Get Some Cheap Insurance At Least by Bob Warfield
- Basic scalability principles to avert downtime by Ronald Bradford
- Amazon crash reveals 'cloud' computing actually based on data centers by Kevin Fogarty
- Seven lessons to learn from Amazon's outage By Phil Wainewright
- The Cloud and Outages : Five Key Lessons by Patrick Baillie (Cloud Computing Group discussion)
- Some thoughts on outages by Till Klampaeckel
- Amazon.com’s real problem isn’t the outage, it’s the communication by Keith Smith
- How to work around Amazon EC2 outages by James Cohen (Hacker News thread)
- Today’s EC2 / EBS Outage: Lessons learned on Agile Sysadmin
- Amazon EC2 has gone down -what would a prefered hosting platform be? on Focus
- Single Points of Failure by Mat
- Coping with Cloud Downtime with Puppet
- Amazon Outage Concerns Are Overblown by Tim Crawford
- Where There Are Clouds, It Sometimes Rains by Clay Loveless
- Availability, redundancy, failover and data backups at LearnBoost by Guillermo Rauch
- Cloud hosting vs colocation by Chris Chandler (Hacker News thread)
- Amazon’s EC2 & EBS outage by Arnon Rotem-Gal-Oz
- Complex Systems Have Complex Failures. That’s Cloud Computing by Greg Ferro
- Amazon Web Services, Hosting in the Cloud and Configuration Management by Ian Chilton
- Lessons learned from deploying a production database in EC2 by by Grig Gheorghiu of Agile Testing
- Bezos on Amazon as a technology and invention company by John Gruber on Daring Fireball.
Vendor's Vent
- Amazon Outage Proves Value of Riak’s Vision by Basho
- Magical Block Store: When Abstractions Fail Us by Mark Joyent (Hacker News discussion)
- On Cascading Failures and Amazon’s Elastic Block Store by Jason
- An unofficial EC2 outage postmortem - the sky is not falling from CloudHarmony
- Cloudfail: Lessons Learned from AWS Outage by Jyoti Bansal
Summary Of The Amazon EC2 And Amazon RDS Service Disruption In The US East Region
- A network change was performed as part of our normal AWS scaling activities in a single Availability Zone in the US East Region. The configuration change was to upgrade the capacity of the primary network. During the change, one of the standard steps is to shift traffic off of one of the redundant routers in the primary EBS network to allow the upgrade to happen. The traffic shift was executed incorrectly and rather than routing the traffic to the other router on the primary network, the traffic was routed onto the lower capacity redundant EBS network.
- When this network connectivity issue occurred, a large number of EBS nodes in a single EBS cluster lost connection to their replicas. When the incorrect traffic shift was rolled back and network connectivity was restored, these nodes rapidly began searching the EBS cluster for available server space where they could re-mirror data. Once again, in a normally functioning cluster, this occurs in milliseconds. In this case, because the issue affected such a large number of volumes concurrently, the free capacity of the EBS cluster was quickly exhausted, leaving many of the nodes “stuck” in a loop, continuously searching the cluster for free space. This quickly led to a “re-mirroring storm,” where a large number of volumes were effectively “stuck” while the nodes searched the cluster for the storage space it needed for its new replica. At this point, about 13% of the volumes in the affected Availability Zone were in this “stuck” state.
Nice blog... here I found nice links of useful informative blog. I am looking information on amazon downtime and found useful links to related blog. Thanks for sharing
ReplyDelete