Unlimited-Data. moved to lab.itbee.vn : May 2011

Monday, 30 May 2011

Stuff to Watch from Surge 2010

Stuff to Watch from Surge 2010: "

Surge is a conference put on by OmniTI targeting practical Scalability matters. OmniTI specializes in helping people solve their scalability problems, as is only natural, as it was founded by Theo Schlossnagle, author of the canonical Scalable Internet Architectures.

Now that Surge 2011 is on the horizon, they've generously made available nearly all the videos from the Surge 2010 conference. A pattern hopefully every conference will follow (only don't wait a year please). We lose a lot of collective wisdom from events not being available online in a timely manner.

In truth, nearly all the talks are on topic and are worth watching, but here are a few that seem especially relevant:

Going 0 to 60: Scaling LinkedIn by Ruslan Belkin, Sr. Director of Engineering, LinkedIn.
- Have you ever wondered what architectures the site like LinkedIn may have used and what insights teams have learned while growing the system from serving just a handful to close to a hundred million of users?
Scaling and Loadbalancing Wikia Across The World by Artur Bergman, VP of Engineering and Operations, Wikia.
- Wikia hosts around a 100 000 wikis using the open source Mediawiki software. In this talk I'll take a tour through the process of taking a legacy source code and turning it into a globally distributed system.
Design for Scale - Patterns, Anti-Patterns, Successes and Failures by Christopher Brown, VP of Engineering, Opscode.
- This isn't your "Gang of Four". Christopher will discuss his experiences building Amazon's EC2 and the Opscode Platform, and the experiences of others designing large-scale online services.
Quantifying Scalability FTW by Neil Gunther, Founder/Principal Consultant, Performance Dynamics.
- Successful scalability requires transforming your data to quantify the cost-benefit of any architectural decisions. In other words: information = measurement + method.
Database Scalability Patterns by Robert Treat Lead Database Architect, OmniTI.
- In Database Scalability Patterns we will attempt to distill all of the information/hype/discussions around scaling databases, and break down the common patterns we've seen dealing with scaling databases.
From disaster to stability: scaling challenges of my.opera.com by Cosimo Streppone, Lead Developer, Opera Software.
- This talk tells the story of these last 3 years. Our successes, our failures, and what remains to be done.
Embracing Concurrency at Scale by Justin Sheehy CTO, Basho Technologies.
- Justin will focus on methods for designing and building robust fundamentally-concurrent distributed systems. We will look at practices that are "common knowledge" but too often forgotten, at old lessons that the software industry at large has somehow missed, and at some general "good practices" and rules that must be thrown away when moving into a distributed and concurrent world.
Scalable Design Patterns by Theo Schlossnagle, Principal/CEO, OmniTI.
- In this talk, we'll take a whirlwind tour though different patterns for scalable architecture design and focus on evaluating if each is the right tool for the job. Topics include load balancing, networking, caching, operations management, code deployment, storage, service decoupling and data management (the RDBMS vs. noSQL argument).
Why Some Architects Almost Never Shard Their Applications by Baron Schwartz, VP of Consulting, Percona.
- "Shard early, shard often" is common advice -- and it's often wrong. In reality, many systems don't have to be sharded.
Scaling myYearbook.com - Lessons Learned From Rapid Growth by Gavin M. Roy, Chief Technology Officer, MyYearbook.
- In this talk Gavin will review the growing pains and methodologies used to handle the consistent growth and demand while affording the rapid development cycles required by the product development team.

Tuesday, 24 May 2011

Updated AWS Security White Paper; New Risk and Compliance White Paper

Updated AWS Security White Paper; New Risk and Compliance White Paper: "

We have updated the AWS Security White Paper and we've created a new Risk and Compliance White Paper. Both are available now.

The AWS Security White Paper describes our physical and operational security principles and practices.

It includes a description of the shared responsibility model, a summary of our control environment, a review of secure design principles, and detailed information about the security and backup considerations related to each part of AWS including the Virtual Private Cloud, EC2, and the Simple Storage Service.

The new AWS Risk and Compliance White Paper covers a number of important topics including (again) the shared responsibility model, additional information about our control environment and how to evaluate it, and detailed information about our certifications and third-party attestations. A section on key compliance issues addresses a number of topics that we are asked about on a regular basis.

The AWS Security team and the AWS Compliance team are complimentary organizations and are responsible for the security infrastructure, practices, and compliance programs described in these white papers. The AWS Security team is headed by our Chief Information Security Officer and is based outside of Washington, DC. Like most parts of AWS, this team is growing and they have a number of open positions:

We also have a number of security-related positions open in Seattle:

-- Jeff;

Wednesday, 11 May 2011

How can academics do research on cloud computing?

How can academics do research on cloud computing?: "This week I'm in Napa for HotOS 2011 -- the premier workshop on operating systems. HotOS is in its 24th year -- it started as the Workshop on Workstation Operating Systems in 1987. More on HotOS in a forthcoming blog post, but for now I wanted to comment on a very lively ~~argument~~ discussion that took place during the panel session yesterday.

The panel consisted of Mendel Rosenblum from Stanford (and VMWare, of course); Rebecca Isaacs from Microsoft Research; John Wilkes from Google; and Ion Stoica from Berkeley. The charge to the panel was to discuss the gap between academic research in cloud computing and the realities faced by industry. This came about in part because a bunch of cloud papers were submitted to HotOS from academic research groups. In some cases, the PC felt that the papers were trying to solve the wrong problems, or making incorrect assumptions about the state of cloud computing in the real world. We thought it would be interesting to hear from both academic and industry representatives about whether and how academic researchers can hope to do work on the cloud, given that there's no way for a university to build something at the scale and complexity of a real-world cloud platform. The concern is that academics will be relegated to working on little problems at the periphery, or come up with toy solutions.

The big challenge, as I see it, is how to enable academics to do interesting and relevant work on the cloud when it's nearly impossible to build up the infrastructure in a university setting. John Wilkes made the point that that he never wanted to see another paper submission showing a 10% performance improvement in Hadoop, and he's right -- this is not the right problem for academics to be working on. Not because 10% improvement is not useful, or that Hadoop is a bad platform, but because those kinds of problems are already being solved by industry. In my opinion, the best role for academia is to open up new areas and look well beyond where industry is working. But this is often at odds with the desire for academics to work on 'industry relevant' problems, as well as to get funding from industry. Too often I think academics fall into the trap of working on things that might as well be done at a company.

Much of the debate at HotOS centered around the industry vs. academic divide and a fair bit of it was targeted at my previous blog posts on this topic. Timothy Roscoe argued that academia's role was to shed light on complex problems and gain understanding, not just to engineer solutions. I agree with this. Sometimes at Google, I feel that we are in such a rush to implement that we don't take the time to understand the problems deeply enough: build something that works and move onto the next problem. Of course, you have to move fast in industry. The pace is very different than academia, where a PhD student needs to spend multiple years focused on a single problem to get a dissertation written about it.

We're not there yet, but there are some efforts to open up cloud infrastructure to academic research. OpenCirrus is a testbed supported by HP, Intel, and Yahoo! with more than 10,000 cores that academics can use for systems research. Microsoft has opened up its Azure cloud platform for academic research. Only one person at HotOS raised their hand when asked if anyone was using this -- this is really unfortunate. (My theory is that academics have an allergic reaction to programming in C# and Visual Studio, which is too bad, since this is a really great platform if you can get over the toolchain.) Google is offering a billion core hours through its Exacycle program, and Amazon has a research grant program as well.

Providing infrastructure is only one part of the solution. Knowing what problems to work on is the other. Many people at HotOS bemoaned the fact that companies like Google are so secretive about what they're doing, and it's hard to learn what the 'real' challenges are from the outside. My answer to this is to spend time at Google as a visiting scientist, and send your students to do internships. Even though it might not result in a publication, I can guarantee you will learn a tremendous amount about what the hard problems are in cloud computing and where the great opportunities are for academic work. (Hell, my mind was blown after my first couple of days at Google. It's like taking the red pill.)

A few things that jump to mind as ripe areas for academic research on the cloud:

Understanding and predicting performance at scale, with uncertain workloads and frequent node failures.
Managing workloads across multiple datacenters with widely varying capacity, occasional outages, and constrained inter-datacenter network links.
Building failure recovery mechanisms that are robust to massive correlated outages. (This is what brought down Amazon's EC2 a few weeks ago.)
Debugging large-scale cloud applications: tools to collect, visualize, and inspect the state of jobs running across many thousands of cores.
Managing dependencies in a large codebase that relies upon a wide range of distributed services like Chubby and GFS.
Handling both large-scale upgrades to computing capacity as well as large-scale outages seamlessly, without having to completely shut down your service and everything it depends on.

Tuesday, 10 May 2011

EMC Makes a Big Bet on Hadoop

EMC Makes a Big Bet on Hadoop: "

EMC is throwing its weight behind Hadoop. Today, at EMC World, the storage giant announced a slew of Hadoop-centric products, including a specialized appliance for Hadoop-based big data analytics and two separate Hadoop distributions. EMC’s entry is most definitely going to shake up the Hadoop and database markets. EMC is now the largest company actively pushing its own Hadoop distribution, and it has an appliance that will put EMC out in front of analytics vendors such as Oracle and Teradata when it comes to handling unstructured data.

EMC’s flagship Hadoop distribution is called Greenplum HD Enterprise Edition. EMC describes it as “a 100 percent interface-compatible implementation of the Apache Hadoop stack” that also includes enterprise-grade features such as snapshots and wide-area replication, a native network file system, and integrated storage and cluster management capabilities. The company also claims performance improvements of two to fives times over the standard Apache Hadoop distribution.

Mapr Magic

It’s noteworthy that many of these capabilities are also available in startup MapR’s HDFS alternative, and that MapR CEO John Schroeder took the stage at a morning EMC World press conference announcing the news. EMC Greenplum’s Luke Lonergan wouldn’t confirm to me that EMC’s Enterprise Edition will use MapR as the primary storage engine, but it’s not too difficult to connect the dots.

However, while the Enterprise Edition is proprietary in part, the Greenplum HD Community Edition is fully open source and still makes big improvements over what’s currently available with the Apache version. In fact, Lonergan told me, Community Edition is based on Facebook’s optimized version of Hadoop. Like Cloudera’s distribution for Hadoop, Community Edition pre-integrates Hadoop MapReduce, Hadoop Distributed File System, HBase, Zookeeper and Hive, but it also includes fault tolerance for the NameNode in HDFS and the JobTracker node in Hadoop MapReduce. These improvements are underway within Apache thanks to Yahoo , but they’re not included in any official release yet.

Too Much Hadoop?

I asked a couple of weeks ago whether the Hadoop-distribution market could handle all the players it now hosts, and now that question is even more pressing. As Luke Lonergan put it during the press conference, EMC is an “8,000-pound elephant” in the Hadoop space, and that should make Cloudera, IBM, DataStax and (possibly) Yahoo shake seek higher ground.

For Cloudera, EMC is major threat because it competes directly against Cloudera’s open-source and proprietary products. It even has partnerships with a large number of business intelligence and other up-the-stack vendors, some of which already are Cloudera partners. These include Concurrent, CSC, Datameer, Informatica, Jaspersoft, Karmasphere, Microstrategy, Pentaho, SAS, SnapLogic, Talend, and VMware.

Oh, and Cloudera and Greenplum have an existing integration partnership. As Lonergan noted, “This definitely marks a change [in that relationship].” The two are now competitors, after all.

EMC vs Big Blue

IBM is still the largest company involved in selling Hadoop products, but it presently suffers from the problem of not having yet announced its official Hadoop distribution. EMC’s Hadoop distributions will be available later this quarter. I noted recently how EMC is following IBM’s lead in acquiring capabilities across the big data stack — from Hadoop to predictive analytics — and today’s news further proves how competitive the two storage heavyweights might become in the analytics space, too.

IBM isn’t the only big-name vendor that should be worried about EMC’s new Hadoop-heavy plans, though. The EMC Greenplum HD Data Computing Appliance should make appliance makers Oracle and Teradata, as well as analytic database vendors such as HP, ParAccel and others, quite nervous. The appliance is like the existing EMC Greenplum Data Computing Appliance, only it lets customers process Hadoop data within the same system as their Greenplum analytic database. Presently, most analytic databases and appliances integrate with Hadoop, but still suffer from the latency of having to send data over the network from Hadoop clusters to the database and back.

IBM already has integrated Hadoop with its other big data tools, including with InfoSphere BigInsights, Watson and Cognos Consumer Insight, and I have to believe a version of its Netezza analytics appliance with Hadoop co-processing will be on the way shortly, possibly in conjunction with its official Hadoop distribution release.

Lonergan also noted that EMC is working closely with VMware, of which EMC is the majority stockholder, on integrating EMC’s Hadoop products with VMware’s virtualization and cloud products, as well as its GemStone distributed database software.

There still will be opportunities for community collaboration among all the open source Hadoop distributions — Cloudera, DataStax Brisk and EMC Greenplum HD Community Edition — but we’ll see how willing they are to work together now that the competition has really heated up. All of a sudden, EMC looks like the strongest Hadoop company going, and everyone else needs to figure out in a hurry how they’ll counter today’s landscape-altering news.

Related content from GigaOM Pro (subscription req’d):

High Scalability - High Scalability - Comet - An Example of the New Key-Code Databases

High Scalability - High Scalability - Comet - An Example of the New Key-Code Databases

Comet is an active distributed key-value store built at the University of Washington. The paper describing Comet is Comet: An active distributed key-value store, there are also slides, and a MP3 of a presentation given at OSDI '10. Here's a succinct overview of Comet:

Today's cloud storage services, such as Amazon S3 or peer-to-peer DHTs, are highly inflexible and impose a variety of constraints on their clients: specific replication and consistency schemes, fixed data timeouts, limited logging, etc. We witnessed such inflexibility first-hand as part of our Vanish work, where we used a DHT to store encryption keys temporarily. To address this issue, we built Comet, an extensible storage service that allows clients to inject snippets of code that control their data's behavior inside the storage service.

I found this paper quite interesting because it takes the initial steps of collocating code with a key-value store, which turns it into what might called a key-code store. This is something I've been exploring as a way of moving behavior to data in order to overcome network limitations in the cloud and provide other benefits. An innovator in this area is the Alchemy Database, which has already combined Redis and Lua. A good platform for this sort of thing might be Node.js integrated with V8. This would allow complex Javascript programs to run in an efficient evented container. There are a lot of implications of this sort of architecture, more about that later, but the Comet paper describes a very interesting start.

From the abstract and conclusion:

This paper described Comet, an active distributed key value store. Comet enables clients to customize a distributed storage system in application-speciﬁc ways using Comet’s active storage objects. By supporting ASOs, Comet allows multiple applications with diverse requirements to share a common storage system. We implemented Comet on the Vuze DHT using a severely restricted Lua language sandbox for handler programming. Our measurements and experience demonstrate that a broad range of behaviors and customizations are possible in a safe, but active, storage environment.
Distributed key-value storage systems are widely used incorporations and across the Internet. Our research seeks to greatly expand the application space for key-value storage systems through application-speciﬁc customization. We designed and implemented Comet, an extensible, distributed key-value store. Each Comet node stores a collection of active storage objects (ASOs) that consist of a key, a value, and a set of handlers. Comet handlers run as a result of timers or storage operations, such as get or put, allowing an ASO to take dynamic, application-speciﬁc actions to customize its behavior. Handlers are written in a simple sandboxed extension language, providing properties of safety and isolation.
We implemented a Comet prototype for the Vuze DHT, deployed Comet nodes on Vuze from PlanetLab, and built and evaluated over a dozen Comet applications. Our experience demonstrates that simple, safe, and restricted extensibility can significantly increase the power and range of applications that can run on distributed active storage systems. This approach facilitates the sharing of a single storage system by applications with diverse needs, allowing them to reap the consolidation benefits inherent in today’s massive clouds.

Related Articles
Roxana Geambasu, Ph.D. Candidate
Scriptable Object Cache by Rohit Karlupia

Monday, 9 May 2011

Presentation:More Best Practices for Large-Scale Websites: Lessons from eBay

Presentation:More Best Practices for Large-Scale Websites: Lessons from eBay: "Randy Shoup shares 10 lessons learned from eBay: Partition Everything, Asynchrony Everywhere, Automate Everything, Everything Fails, Embrace Inconsistency, Expect (R)evolution, Dependencies Matter, Respect Authority, Never Enough Data, Custom Infrastructure. By Randy Shoup"

Summary
Randy Shoup shares 10 lessons learned from eBay: Partition Everything, Asynchrony Everywhere, Automate Everything, Everything Fails, Embrace Inconsistency, Expect (R)evolution, Dependencies Matter, Respect Authority, Never Enough Data, Custom Infrastructure.

Bio
Randy Shoup is a Distinguished Architect in the eBay Marketplace Architecture group. Since 2004, he has been the primary architect for eBay's search infrastructure. Prior to eBay, Randy was Chief Architect at Tumbleweed Communications, and has also held a variety of software development and architecture roles at Oracle and Informatica.

Presentation:More Best Practices for Large-Scale Websites: Lessons from eBay

Monday, 2 May 2011

Stack Overflow Makes Slow Pages 100x Faster by Simple SQL Tuning

Stack Overflow Makes Slow Pages 100x Faster by Simple SQL Tuning: "

Source: highscalability.com

The most common complaint against NoSQL is that if you know how to write good SQL queries then SQL works fine. If SQL is slow you can always tune it and make it faster. A great example of this incremental improvement process was written up by StackExchange's Sam Saffron, in A day in the life of a slow page at Stack Overflow, where he shows through profiling and SQL tuning it was possible to reduce page load times from 630ms to 40ms for some pages and for other pages the improvement was 100x.

Sam provides a lot of wonderful detail of his tuning process, how it works, the thought process, the tools used, and the tradeoffs involved. Here's a short summary of the steps:

Using their mini-profiler it was shown that a badge detail page was taking 630.3 ms to load, 298.1 ms of that was spent on SQL queries, and then the tool listed the SQL queries for the page and how long each took.
From the historical logs, which are stored in HAProxy on a month-by-month basis, Sam was able to determine this page is accessed 26,000 times a day and takes 532ms on average to render. This is too long, yet there are probably higher value problems to be solved, but Google takes speed into account for page rank and Sam thinks this can be done faster.
Sam noticed:
1. There were lots of select queries which Sam calls the N+1 Select Problem. There are many individual detail select queries to get all the data for a master record. More details at: Hibernate Pitfall: Why Relationships Should Be Lazy.
2. Half the time was spent on the web server.
3. There were some expensive queries.
Performing a code review the code uses a LINQ-2-SQL multi join. LINQ-2-SQL takes a high level ORM description and generates SQL code from it. The generated code was slow and cost a 10x slowdown in production.
The strategy was to remove the ORM overhead using Dapper, which is a simple .Net object mapper, to rewrite the query. The resulting code is faster and more debuggable. Linq2Sql is being removed from selects, but is still being used for writes on Stack Overflow.
In production this query was taking too long, so the next step was to look at the query plan, which showed a table scan was being used instead of an index. A new index was created which cut the page load time by a factor of 10.
The N+1 problem was fixed by changing the ViewModel to use a left join which pulled all the records in one query.

A NoSQLite might counter that this is what key-value database is for. All that data could have been retrieved in one get, no tuning, problem solved. The counter is then you lose all the benefits of a relational database and it can be shown that the original was fast enough and could be made very fast through a simple turning process, so there is no reason to go NoSQL.

Stack Overflow Articles on HighScalability

The Updated Big List of Articles on the Amazon Outage

The Updated Big List of Articles on the Amazon Outage: "

Source: Highscalability.com

Since The Big List Of Articles On The Amazon Outage was published we've a had few updates that people might not have seen. Amazon of course released their Summary of the Amazon EC2 and Amazon RDS Service Disruption in the US East Region. Netlix shared their Lessons Learned from the AWS Outage as did Heroku (How Heroku Survived the Amazon Outage), Smug Mug (How SmugMug survived the Amazonpocalypse), and SimpleGeo (How SimpleGeo Stayed Up During the AWS Downtime).

The curious thing from my perspective is the general lack of response to Amazon's explanation. I expected more discussion. There's been almost none that I've seen. My guess is very few people understand what Amazon was talking about enough to comment whereas almost everyone feels qualified to talk about the event itself.

Lesson for crisis handlers: deep dive post-mortems that are timely, long, honestish, and highly technical are the most effective means of staunching the downward spiral of media attention.

Amazon's Explanation of What Happened

Amazon's Explanation Of What Happened

Experiences From Specific Companies, Both Good And Bad

Lessons Netflix Learned from the AWS Outage by several Netflixians on the Netflix Tech Blog
How Heroku Survived the Amazon Outage on the Heroku status page
How SimpleGeo Stayed Up During the AWS Downtime by Mike Malone
How SmugMug survived the Amazonpocalypse by Don MacAskill (Hacker News discussion)
How Bizo survived the Great AWS Outage of 2011 relatively unscathed... by Someone at Bizo
Joe Stump's explanation of how SimpleGeo survived
How Netflix Survived the Outage
Why Twilio Wasn’t Affected by Today’s AWS Issues on Twilio Engineering's Blog (Hacker News thread)
On reddit's outage
What caused the Quora problems/outage in April 2011?
Recovering from Amazon cloud outage by Drew Engelson of PBS.
- PBS was affected for a while primarily because we do use EBS-backed RDS databases. Despite being spread across multiple availability-zones, we weren’t easily able to launch new resources ANYWHERE in the East region since everyone else was trying to do the same. I ended up pushing the RDS stuff out West for the time being. From Comment

Amazon Web Services Discussion Forum

A fascinating peek into the experiences of people who were dealing with the outage while they were experiencing it. Great real-time social archeology in action.

There were also many many instances of support and help in the log.

In Summary

Amazon EC2 outage: summary and lessons learned by RightScale
AWS outage timeline & downtimes by recovery strategy by Eric Kidd
The Aftermath of Amazon’s Cloud Outage by Rich Miller

Taking Sides: It's The Customer's Fault

So Your AWS-based Application is Down? Don’t Blame Amazon by The Storage Architect
The Cloud is not a Silver Bullet by Joe Stump (Hacker News thread)
The AWS Outage: The Cloud's Shining Moment by George Reese (Hacker News discussion)
Failing to Plan is Planning to Fail by Ted Theodoropoulos
Get a life and build redundancy/resiliency in your apps on the Cloud Computing group

Taking Sides: It's Amazon's Fault

Stop Blaming the Customers - the Fault is on Amazon Web Services by Klint Finley
AWS is down: Why the sky is falling by Justin Santa Barbara (Hacker News thread)
Amazon Web Services are down - Huge Hacker News thread

Lessons Learned And Other Insight Articles

Amazon’s EBS outage by Robin Harris of StorageMojo
People Using Amazon Cloud: Get Some Cheap Insurance At Least by Bob Warfield
Basic scalability principles to avert downtime by Ronald Bradford
Amazon crash reveals 'cloud' computing actually based on data centers by Kevin Fogarty
Seven lessons to learn from Amazon's outage By Phil Wainewright
The Cloud and Outages : Five Key Lessons by Patrick Baillie (Cloud Computing Group discussion)
Some thoughts on outages by Till Klampaeckel
Amazon.com’s real problem isn’t the outage, it’s the communication by Keith Smith
How to work around Amazon EC2 outages by James Cohen (Hacker News thread)
Today’s EC2 / EBS Outage: Lessons learned on Agile Sysadmin
Amazon EC2 has gone down -what would a prefered hosting platform be? on Focus
Single Points of Failure by Mat
Coping with Cloud Downtime with Puppet
Amazon Outage Concerns Are Overblown by Tim Crawford
Where There Are Clouds, It Sometimes Rains by Clay Loveless
Availability, redundancy, failover and data backups at LearnBoost by Guillermo Rauch
Cloud hosting vs colocation by Chris Chandler (Hacker News thread)
Amazon’s EC2 & EBS outage by Arnon Rotem-Gal-Oz
Complex Systems Have Complex Failures. That’s Cloud Computing by Greg Ferro
Amazon Web Services, Hosting in the Cloud and Configuration Management by Ian Chilton
Lessons learned from deploying a production database in EC2 by by Grig Gheorghiu of Agile Testing
Bezos on Amazon as a technology and invention company by John Gruber on Daring Fireball.

Vendor's Vent

Amazon Outage Proves Value of Riak’s Vision by Basho
Magical Block Store: When Abstractions Fail Us by Mark Joyent (Hacker News discussion)
On Cascading Failures and Amazon’s Elastic Block Store by Jason
An unofficial EC2 outage postmortem - the sky is not falling from CloudHarmony
Cloudfail: Lessons Learned from AWS Outage by Jyoti Bansal

Update on Friday, April 29, 2011 at 8:27AM by

Todd Hoff

Summary Of The Amazon EC2 And Amazon RDS Service Disruption In The US East Region

A network change was performed as part of our normal AWS scaling activities in a single Availability Zone in the US East Region. The configuration change was to upgrade the capacity of the primary network. During the change, one of the standard steps is to shift traffic off of one of the redundant routers in the primary EBS network to allow the upgrade to happen. The traffic shift was executed incorrectly and rather than routing the traffic to the other router on the primary network, the traffic was routed onto the lower capacity redundant EBS network.
When this network connectivity issue occurred, a large number of EBS nodes in a single EBS cluster lost connection to their replicas. When the incorrect traffic shift was rolled back and network connectivity was restored, these nodes rapidly began searching the EBS cluster for available server space where they could re-mirror data. Once again, in a normally functioning cluster, this occurs in milliseconds. In this case, because the issue affected such a large number of volumes concurrently, the free capacity of the EBS cluster was quickly exhausted, leaving many of the nodes “stuck” in a loop, continuously searching the cluster for free space. This quickly led to a “re-mirroring storm,” where a large number of volumes were effectively “stuck” while the nodes searched the cluster for the storage space it needed for its new replica. At this point, about 13% of the volumes in the affected Availability Zone were in this “stuck” state.