Monday 30 May 2011

Stuff to Watch from Surge 2010

Stuff to Watch from Surge 2010: "


Surge is a conference put on by OmniTI targeting practical Scalability matters. OmniTI specializes in helping people solve their scalability problems, as is only natural, as it was founded by Theo Schlossnagle, author of the canonical Scalable Internet Architectures.


Now that Surge 2011 is on the horizon, they've generously made available nearly all the videos from the Surge 2010 conference. A pattern hopefully every conference will follow (only don't wait a year please). We lose a lot of collective wisdom from events not being available online in a timely manner.

In truth, nearly all the talks are on topic and are worth watching, but here are a few that seem especially relevant:

  • Going 0 to 60: Scaling LinkedIn by Ruslan Belkin, Sr. Director of Engineering, LinkedIn.
    • Have you ever wondered what architectures the site like LinkedIn may have used and what insights teams have learned while growing the system from serving just a handful to close to a hundred million of users?
  • Scaling and Loadbalancing Wikia Across The World by Artur Bergman, VP of Engineering and Operations, Wikia.
    • Wikia hosts around a 100 000 wikis using the open source Mediawiki software. In this talk I'll take a tour through the process of taking a legacy source code and turning it into a globally distributed system.
  • Design for Scale - Patterns, Anti-Patterns, Successes and Failures by Christopher Brown, VP of Engineering, Opscode.
    • This isn't your "Gang of Four". Christopher will discuss his experiences building Amazon's EC2 and the Opscode Platform, and the experiences of others designing large-scale online services.
  • Quantifying Scalability FTW by Neil Gunther, Founder/Principal Consultant, Performance Dynamics.
    • Successful scalability requires transforming your data to quantify the cost-benefit of any architectural decisions. In other words: information = measurement + method.
  • Database Scalability Patterns by Robert Treat Lead Database Architect, OmniTI.
    • In Database Scalability Patterns we will attempt to distill all of the information/hype/discussions around scaling databases, and break down the common patterns we've seen dealing with scaling databases.
  • From disaster to stability: scaling challenges of my.opera.com by Cosimo Streppone, Lead Developer, Opera Software.
    • This talk tells the story of these last 3 years. Our successes, our failures, and what remains to be done.
  • Embracing Concurrency at Scale by Justin Sheehy CTO, Basho Technologies.
    • Justin will focus on methods for designing and building robust fundamentally-concurrent distributed systems. We will look at practices that are "common knowledge" but too often forgotten, at old lessons that the software industry at large has somehow missed, and at some general "good practices" and rules that must be thrown away when moving into a distributed and concurrent world.
  • Scalable Design Patterns by Theo Schlossnagle, Principal/CEO, OmniTI.
    • In this talk, we'll take a whirlwind tour though different patterns for scalable architecture design and focus on evaluating if each is the right tool for the job. Topics include load balancing, networking, caching, operations management, code deployment, storage, service decoupling and data management (the RDBMS vs. noSQL argument).
  • Why Some Architects Almost Never Shard Their Applications by Baron Schwartz, VP of Consulting, Percona.
    • "Shard early, shard often" is common advice -- and it's often wrong. In reality, many systems don't have to be sharded.
  • Scaling myYearbook.com - Lessons Learned From Rapid Growth by Gavin M. Roy, Chief Technology Officer, MyYearbook.
    • In this talk Gavin will review the growing pains and methodologies used to handle the consistent growth and demand while affording the rapid development cycles required by the product development team.

Tuesday 24 May 2011

Updated AWS Security White Paper; New Risk and Compliance White Paper

Updated AWS Security White Paper; New Risk and Compliance White Paper: "

We have updated the AWS Security White Paper and we've created a new Risk and Compliance White Paper.  Both are available now.


The AWS Security White Paper describes our physical and operational security principles and practices.


It includes a description of the shared responsibility model, a summary of our control environment, a review of secure design principles, and detailed information about the security and backup considerations related to each part of AWS including the Virtual Private Cloud, EC2, and the Simple Storage Service.


 

The new AWS Risk and Compliance White Paper covers a number of important topics including (again) the shared responsibility model, additional information about our control environment and how to evaluate it, and detailed information about our certifications and third-party attestations. A section on key compliance issues addresses a number of topics that we are asked about on a regular basis.


 

The AWS Security team and the AWS Compliance team are complimentary organizations and are responsible for the security infrastructure, practices, and compliance programs described in these white papers. The AWS Security team is headed by our Chief Information Security Officer and is based outside of Washington, DC. Like most parts of AWS, this team is growing and they have a number of open positions:



We also have a number of security-related positions open in Seattle:



-- Jeff;




    "

    Wednesday 11 May 2011

    How can academics do research on cloud computing?

    How can academics do research on cloud computing?: "This week I'm in Napa for HotOS 2011 -- the premier workshop on operating systems. HotOS is in its 24th year -- it started as the Workshop on Workstation Operating Systems in 1987. More on HotOS in a forthcoming blog post, but for now I wanted to comment on a very lively argument discussion that took place during the panel session yesterday.

    The panel consisted of Mendel Rosenblum from Stanford (and VMWare, of course); Rebecca Isaacs from Microsoft Research; John Wilkes from Google; and Ion Stoica from Berkeley. The charge to the panel was to discuss the gap between academic research in cloud computing and the realities faced by industry. This came about in part because a bunch of cloud papers were submitted to HotOS from academic research groups. In some cases, the PC felt that the papers were trying to solve the wrong problems, or making incorrect assumptions about the state of cloud computing in the real world. We thought it would be interesting to hear from both academic and industry representatives about whether and how academic researchers can hope to do work on the cloud, given that there's no way for a university to build something at the scale and complexity of a real-world cloud platform. The concern is that academics will be relegated to working on little problems at the periphery, or come up with toy solutions.

    The big challenge, as I see it, is how to enable academics to do interesting and relevant work on the cloud when it's nearly impossible to build up the infrastructure in a university setting. John Wilkes made the point that that he never wanted to see another paper submission showing a 10% performance improvement in Hadoop, and he's right -- this is not the right problem for academics to be working on. Not because 10% improvement is not useful, or that Hadoop is a bad platform, but because those kinds of problems are already being solved by industry. In my opinion, the best role for academia is to open up new areas and look well beyond where industry is working. But this is often at odds with the desire for academics to work on 'industry relevant' problems, as well as to get funding from industry. Too often I think academics fall into the trap of working on things that might as well be done at a company.

    Much of the debate at HotOS centered around the industry vs. academic divide and a fair bit of it was targeted at my previous blog posts on this topic. Timothy Roscoe argued that academia's role was to shed light on complex problems and gain understanding, not just to engineer solutions. I agree with this. Sometimes at Google, I feel that we are in such a rush to implement that we don't take the time to understand the problems deeply enough: build something that works and move onto the next problem. Of course, you have to move fast in industry. The pace is very different than academia, where a PhD student needs to spend multiple years focused on a single problem to get a dissertation written about it.

    We're not there yet, but there are some efforts to open up cloud infrastructure to academic research. OpenCirrus is a testbed supported by HP, Intel, and Yahoo! with more than 10,000 cores that academics can use for systems research. Microsoft has opened up its Azure cloud platform for academic research. Only one person at HotOS raised their hand when asked if anyone was using this -- this is really unfortunate. (My theory is that academics have an allergic reaction to programming in C# and Visual Studio, which is too bad, since this is a really great platform if you can get over the toolchain.) Google is offering a billion core hours through its Exacycle program, and Amazon has a research grant program as well.

    Providing infrastructure is only one part of the solution. Knowing what problems to work on is the other. Many people at HotOS bemoaned the fact that companies like Google are so secretive about what they're doing, and it's hard to learn what the 'real' challenges are from the outside. My answer to this is to spend time at Google as a visiting scientist, and send your students to do internships. Even though it might not result in a publication, I can guarantee you will learn a tremendous amount about what the hard problems are in cloud computing and where the great opportunities are for academic work. (Hell, my mind was blown after my first couple of days at Google. It's like taking the red pill.)

    A few things that jump to mind as ripe areas for academic research on the cloud:
    • Understanding and predicting performance at scale, with uncertain workloads and frequent node failures.
    • Managing workloads across multiple datacenters with widely varying capacity, occasional outages, and constrained inter-datacenter network links.
    • Building failure recovery mechanisms that are robust to massive correlated outages. (This is what brought down Amazon's EC2 a few weeks ago.)
    • Debugging large-scale cloud applications: tools to collect, visualize, and inspect the state of jobs running across many thousands of cores.
    • Managing dependencies in a large codebase that relies upon a wide range of distributed services like Chubby and GFS.
    • Handling both large-scale upgrades to computing capacity as well as large-scale outages seamlessly, without having to completely shut down your service and everything it depends on.
    "

    Tuesday 10 May 2011

    EMC Makes a Big Bet on Hadoop

    EMC Makes a Big Bet on Hadoop: "

    EMC is throwing its weight behind Hadoop. Today, at EMC World, the storage giant announced a slew of Hadoop-centric products, including a specialized appliance for Hadoop-based big data analytics and two separate Hadoop distributions. EMC’s entry is most definitely going to shake up the Hadoop and database markets. EMC is now the largest company actively pushing its own Hadoop distribution, and it has an appliance that will put EMC out in front of analytics vendors such as Oracle and Teradata when it comes to handling unstructured data.


    EMC’s flagship Hadoop distribution is called Greenplum HD Enterprise Edition. EMC describes it as “a 100 percent interface-compatible implementation of the Apache Hadoop stack” that also includes enterprise-grade features such as snapshots and wide-area replication, a native network file system, and integrated storage and cluster management capabilities. The company also claims performance improvements of two to fives times over the standard Apache Hadoop distribution.


    Mapr Magic


    It’s noteworthy that many of these capabilities are also available in startup MapR’s HDFS alternative, and that MapR CEO John Schroeder took the stage at a morning EMC World press conference announcing the news. EMC Greenplum’s Luke Lonergan wouldn’t confirm to me that EMC’s Enterprise Edition will use MapR as the primary storage engine, but it’s not too difficult to connect the dots.


    However, while the Enterprise Edition is proprietary in part, the Greenplum HD Community Edition is fully open source and still makes big improvements over what’s currently available with the Apache version. In fact, Lonergan told me, Community Edition is based on Facebook’s optimized version of Hadoop. Like Cloudera’s distribution for Hadoop, Community Edition pre-integrates Hadoop MapReduce, Hadoop Distributed File System, HBase, Zookeeper and Hive, but it also includes fault tolerance for the NameNode in HDFS and the JobTracker node in Hadoop MapReduce. These improvements are underway within Apache thanks to Yahoo , but they’re not included in any official release yet.


    Too Much Hadoop?


    I asked a couple of weeks ago whether the Hadoop-distribution market could handle all the players it now hosts, and now that question is even more pressing. As Luke Lonergan put it during the press conference, EMC is an “8,000-pound elephant” in the Hadoop space, and that should make Cloudera, IBM, DataStax and (possibly) Yahoo shake seek higher ground.


    For Cloudera, EMC is major threat because it competes directly against Cloudera’s open-source and proprietary products. It even has partnerships with a large number of business intelligence and other up-the-stack vendors, some of which already are Cloudera partners. These include Concurrent, CSC, Datameer, Informatica, Jaspersoft, Karmasphere, Microstrategy, Pentaho, SAS, SnapLogic, Talend, and VMware.


    Oh, and Cloudera and Greenplum have an existing integration partnership. As Lonergan noted, “This definitely marks a change [in that relationship].” The two are now competitors, after all.


    EMC vs Big Blue


    IBM is still the largest company involved in selling Hadoop products, but it presently suffers from the problem of not having yet announced its official Hadoop distribution. EMC’s Hadoop distributions will be available later this quarter. I noted recently how EMC is following IBM’s lead in acquiring capabilities across the big data stack — from Hadoop to predictive analytics — and today’s news further proves how competitive the two storage heavyweights might become in the analytics space, too.


    IBM isn’t the only big-name vendor that should be worried about EMC’s new Hadoop-heavy plans, though. The EMC Greenplum HD Data Computing Appliance should make appliance makers Oracle and Teradata, as well as analytic database vendors such as HP, ParAccel and others, quite nervous. The appliance is like the existing EMC Greenplum Data Computing Appliance, only it lets customers process Hadoop data within the same system as their Greenplum analytic database. Presently, most analytic databases and appliances integrate with Hadoop, but still suffer from the latency of having to send data over the network from Hadoop clusters to the database and back.


    IBM already has integrated Hadoop with its other big data tools, including with InfoSphere BigInsights, Watson and Cognos Consumer Insight, and I have to believe a version of its Netezza analytics appliance with Hadoop co-processing will be on the way shortly, possibly in conjunction with its official Hadoop distribution release.


    Lonergan also noted that EMC is working closely with VMware, of which EMC is the majority stockholder, on integrating EMC’s Hadoop products with VMware’s virtualization and cloud products, as well as its GemStone distributed database software.


    There still will be opportunities for community collaboration among all the open source Hadoop distributions — Cloudera, DataStax Brisk and EMC Greenplum HD Community Edition — but we’ll see how willing they are to work together now that the competition has really heated up. All of a sudden, EMC looks like the strongest Hadoop company going, and everyone else needs to figure out in a hurry how they’ll counter today’s landscape-altering news.


    Related content from GigaOM Pro (subscription req’d):




    The cloud-optimized networks of tomorrow run on Brocade Ethernet fabrics today. Assess Your Cloud Readiness. Download Forrester Study »

    "

    High Scalability - High Scalability - Comet - An Example of the New Key-Code Databases

    High Scalability - High Scalability - Comet - An Example of the New Key-Code Databases

    Comet is an active distributed key-value store built at the University of Washington. The paper describing Comet is Comet: An active distributed key-value store, there are also slides, and a MP3 of a presentation given at OSDI '10. Here's a succinct overview of Comet:

    Today's cloud storage services, such as Amazon S3 or peer-to-peer DHTs, are highly inflexible and impose a variety of constraints on their clients: specific replication and consistency schemes, fixed data timeouts, limited logging, etc. We witnessed such inflexibility first-hand as part of our Vanish work, where we used a DHT to store encryption keys temporarily. To address this issue, we built Comet, an extensible storage service that allows clients to inject snippets of code that control their data's behavior inside the storage service.

    I found this paper quite interesting because it takes the initial steps of collocating code with a key-value store, which turns it into what might called a key-code store. This is something I've been exploring as a way of moving behavior to data in order to overcome network limitations in the cloud and provide other benefits. An innovator in this area is the Alchemy Database, which has already combined Redis and Lua. A good platform for this sort of thing might be Node.js integrated with V8. This would allow complex Javascript programs to run in an efficient evented container. There are a lot of implications of this sort of architecture, more about that later, but the Comet paper describes a very interesting start.

    From the abstract and conclusion:

    This paper described Comet, an active distributed key value store. Comet enables clients to customize a distributed storage system in application-specific ways using Comet’s active storage objects. By supporting ASOs, Comet allows multiple applications with diverse requirements to share a common storage system. We implemented Comet on the Vuze DHT using a severely restricted Lua language sandbox for handler programming. Our measurements and experience demonstrate that a broad range of behaviors and customizations are possible in a safe, but active, storage environment.
    Distributed key-value storage systems are widely used incorporations and across the Internet. Our research seeks to greatly expand the application space for key-value storage systems through application-specific customization. We designed and implemented Comet, an extensible, distributed key-value store. Each Comet node stores a collection of active storage objects (ASOs) that consist of a key, a value, and a set of handlers. Comet handlers run as a result of timers or storage operations, such as get or put, allowing an ASO to take dynamic, application-specific actions to customize its behavior. Handlers are written in a simple sandboxed extension language, providing properties of safety and isolation.
    We implemented a Comet prototype for the Vuze DHT, deployed Comet nodes on Vuze from PlanetLab, and built and evaluated over a dozen Comet applications. Our experience demonstrates that simple, safe, and restricted extensibility can significantly increase the power and range of applications that can run on distributed active storage systems. This approach facilitates the sharing of a single storage system by applications with diverse needs, allowing them to reap the consolidation benefits inherent in today’s massive clouds.

    Related Articles
    Roxana Geambasu, Ph.D. Candidate
    Scriptable Object Cache by Rohit Karlupia

    Monday 9 May 2011

    Presentation:More Best Practices for Large-Scale Websites: Lessons from eBay

    Presentation:More Best Practices for Large-Scale Websites: Lessons from eBay: "Randy Shoup shares 10 lessons learned from eBay: Partition Everything, Asynchrony Everywhere, Automate Everything, Everything Fails, Embrace Inconsistency, Expect (R)evolution, Dependencies Matter, Respect Authority, Never Enough Data, Custom Infrastructure. By Randy Shoup"

    Summary
    Randy Shoup shares 10 lessons learned from eBay: Partition Everything, Asynchrony Everywhere, Automate Everything, Everything Fails, Embrace Inconsistency, Expect (R)evolution, Dependencies Matter, Respect Authority, Never Enough Data, Custom Infrastructure.

    Bio
    Randy Shoup is a Distinguished Architect in the eBay Marketplace Architecture group. Since 2004, he has been the primary architect for eBay's search infrastructure. Prior to eBay, Randy was Chief Architect at Tumbleweed Communications, and has also held a variety of software development and architecture roles at Oracle and Informatica.

    Presentation:More Best Practices for Large-Scale Websites: Lessons from eBay

    Presentation:More Best Practices for Large-Scale Websites: Lessons from eBay: "Randy Shoup shares 10 lessons learned from eBay: Partition Everything, Asynchrony Everywhere, Automate Everything, Everything Fails, Embrace Inconsistency, Expect (R)evolution, Dependencies Matter, Respect Authority, Never Enough Data, Custom Infrastructure. By Randy Shoup"

    Summary
    Randy Shoup shares 10 lessons learned from eBay: Partition Everything, Asynchrony Everywhere, Automate Everything, Everything Fails, Embrace Inconsistency, Expect (R)evolution, Dependencies Matter, Respect Authority, Never Enough Data, Custom Infrastructure.

    Bio
    Randy Shoup is a Distinguished Architect in the eBay Marketplace Architecture group. Since 2004, he has been the primary architect for eBay's search infrastructure. Prior to eBay, Randy was Chief Architect at Tumbleweed Communications, and has also held a variety of software development and architecture roles at Oracle and Informatica.

    Monday 2 May 2011

    Stack Overflow Makes Slow Pages 100x Faster by Simple SQL Tuning

    Stack Overflow Makes Slow Pages 100x Faster by Simple SQL Tuning: "


    Source: highscalability.com

    The most common complaint against NoSQL is that if you know how to write good SQL queries then SQL works fine. If SQL is slow you can always tune it and make it faster. A great example of this incremental improvement process was written up by StackExchange's Sam Saffron, in A day in the life of a slow page at Stack Overflow, where he shows through profiling and SQL tuning it was possible to reduce page load times from 630ms to 40ms for some pages and for other pages the improvement was 100x.


    Sam provides a lot of wonderful detail of his tuning process, how it works, the thought process, the tools used, and the tradeoffs involved. Here's a short summary of the steps:
    1. Using their mini-profiler it was shown that a badge detail page was taking 630.3 ms to load, 298.1 ms of that was spent on SQL queries, and then the tool listed the SQL queries for the page and how long each took.
    2. From the historical logs, which are stored in HAProxy on a month-by-month basis, Sam was able to determine this page is accessed 26,000 times a day and takes 532ms on average to render. This is too long, yet there are probably higher value problems to be solved, but Google takes speed into account for page rank and Sam thinks this can be done faster.
    3. Sam noticed:
      1. There were lots of select queries which Sam calls the N+1 Select Problem. There are many individual detail select queries to get all the data for a master record. More details at: Hibernate Pitfall: Why Relationships Should Be Lazy.
      2. Half the time was spent on the web server.
      3. There were some expensive queries.
    4. Performing a code review the code uses a LINQ-2-SQL multi join. LINQ-2-SQL takes a high level ORM description and generates SQL code from it. The generated code was slow and cost a 10x slowdown in production.
    5. The strategy was to remove the ORM overhead using Dapper, which is a simple .Net object mapper, to rewrite the query. The resulting code is faster and more debuggable. Linq2Sql is being removed from selects, but is still being used for writes on Stack Overflow.
    6. In production this query was taking too long, so the next step was to look at the query plan, which showed a table scan was being used instead of an index. A new index was created which cut the page load time by a factor of 10.
    7. The N+1 problem was fixed by changing the ViewModel to use a left join which pulled all the records in one query.
    A NoSQLite might counter that this is what key-value database is for. All that data could have been retrieved in one get, no tuning, problem solved. The counter is then you lose all the benefits of a relational database and it can be shown that the original was fast enough and could be made very fast through a simple turning process, so there is no reason to go NoSQL.

    Related Articles

    The Updated Big List of Articles on the Amazon Outage

    The Updated Big List of Articles on the Amazon Outage: "

    Source: Highscalability.com

    Since The Big List Of Articles On The Amazon Outage was published we've a had few updates that people might not have seen. Amazon of course released their Summary of the Amazon EC2 and Amazon RDS Service Disruption in the US East Region. Netlix shared their Lessons Learned from the AWS Outage as did Heroku (How Heroku Survived the Amazon Outage), Smug Mug (How SmugMug survived the Amazonpocalypse), and SimpleGeo (How SimpleGeo Stayed Up During the AWS Downtime).


    The curious thing from my perspective is the general lack of response to Amazon's explanation. I expected more discussion. There's been almost none that I've seen. My guess is very few people understand what Amazon was talking about enough to comment whereas almost everyone feels qualified to talk about the event itself.


    Lesson for crisis handlers: deep dive post-mortems that are timely, long, honestish, and highly technical are the most effective means of staunching the downward spiral of media attention.


    Amazon's Explanation of What Happened

    Amazon's Explanation Of What Happened

    Experiences From Specific Companies, Both Good And Bad

    Amazon Web Services Discussion Forum

    A fascinating peek into the experiences of people who were dealing with the outage while they were experiencing it. Great real-time social archeology in action.

    There were also many many instances of support and help in the log.

    In Summary

    Taking Sides: It's The Customer's Fault

    Taking Sides: It's Amazon's Fault

    Lessons Learned And Other Insight Articles

    Vendor's Vent