Tuesday, 7 June 2011

Apple iCloud: Syncing and Distributed Storage Over Streaming and Centralized Storage

Apple iCloud: Syncing and Distributed Storage Over Streaming and Centralized Storage: "


There has been a lot of speculation over how Apple's iCloud would work. With the Apple Worldwide Developers Conference keynotes having just completed, we finally learned the truth. We can handle it. They made some interesting and cost effective architecture choices that preserved the value of their devices and the basic model of how their existing services work.


A lot of pundits foretold that with all the datacenters Apple was building we would get a streaming music solution. Only one copy of music would be stored and then streamed on demand to everyone. Or they could go the Google brute force method and copy up all a user's music and play it on demand.


Apple did neither. The chose an interesting middle path that's not Google, Amazon, or even MobileMe.


They key idea is you no longer need a PC. Device content is now synced over the air and is managed by the cloud, not your legacy computer. Your data may not even be stored in the cloud, but the whole management, syncing, and control of content is done by the cloud instead of the PC. PCs are now just another device on par with the iPhone and iPad.


What happens to your data depends on the type of data. Apple gives you 5GB of free storage, which doesn't sound like a lot at all. The twist here is purchased music, apps, and books, and photos will not count against free storage because these are stored on your devices. Photos hit the cloud for a maximum of 30 days, which allows your devices 30 days to contact the cloud and download the photos, after that I guess they are lost. All the big data is stored on your devices.


Some smaller content is stored in the cloud. This content is mail, documents, Camera Roll, account information, settings, and other app data. This data is much smaller than photos, videos, and music, so it's a manageable amount of storage per user. It wasn't talked about, but I'd imagine storage could be increased for a price, so any increased storage usage would be funded.


What Apple ended up creating is a syncing model where large content is synced between devices, smaller meta-data type content is stored in the cloud, and shared changeable data like mail is stored in the cloud. The advantages of this approach are:

  • It's consistent with how iTunes works now. Instead of a PC there's a cloud. No revolution here.
  • It's cost efficient, which is important because iCloud is free. Storage will increase linearly based on the number of users, not the number of media items, which is a much more manageable curve. Apple is not on the hook for ever increasing amounts of data storage.
  • Bandwidth usage is bursty during syncing operations, but is otherwise low, other than background notifications, etc. In a streaming system bandwidth usage is continuous and high, which shifts their cost structure to that of being a CDN. The path taken here by Apple skips that whole problem. Most if this will probably be over WiFi instead of 3G so user bandwidth caps can be avoided.
  • The role for devices is preserved. Devices aren't just a thin client for the cloud. The meat of the application logic is still solidly on the device and not the cloud. Apple can still sell high margin devices and you can get your media on all your devices. And since these devices already have storage, there's no need to duplicate storage in the cloud, which is more economical.
  • The low cost of Apple's new Scan and Match service is made possible because of the minimal storage and bandwidth costs. They can just keep one copy and push it down to devices for local access. What other vendor has this ability? The devices that people are already used to buying offload this cost and are themselves a profit center.
  • This will really drive the demand for larger and larger SSD drives. The 1000 image storage limit for photos on your devices will be a big negative.

Low level stuff like how merging happens if conflicting changes are made on different devices was not talked about, but the details will be interesting. There are a still a lot of details that need to be explained about where things are stored, how much can be stored, how they get synced, and how much it will cost. And this is still a very personal service. It's device centric. It's not social, there doesn't appear to be any sharing, which seems a strange oversight.

Apple has taken an interesting middle path in their architecture and there's a lot to learn from. They aren't just storing stuff like Amazon and Google. They aren't creating another streaming service. They are creating a product unique to their ecosystem, an environment users will find difficult to leave.

I should mention that this is my very early impression of how things work taken from the presentation and poking around the iCloud site. It's possible I could have got some of it wrong.


Friday, 3 June 2011

Awesome List of Advanced Distributed Systems Papers

Awesome List of Advanced Distributed Systems Papers: "


As part of Dr. Indranil Gupta's CS 525 Spring 2011 Advanced Distributed Systems class, he has collected an incredible list of resources on distributed systems. His research group is also doing some interesting work.


The various topics include: Before there Were Clouds, Cloud Computing, P2P Systems, Basic Distributed Computing Concepts, Sensor Networks, Overlays and DHTs, Cloud Programming, Cloud Scheduling, Key-Value Stores, Storage, Sensor Net Routing, Geo-Distribution, P2P Apps, In-network processing, Epidemics, Probabilistic Membership Protocols, Distributed Monitoring and  Management, Publish-Subscribe/CDNs, Measurement Studies, Old Wine: Stale or Vintage?, In Byzantium, Cloud Pricing, Other Industrial Systems, Structure of Networks, Completing the Circle, Green Clouds, Distributed Debugging, Flash!, The Middle or the End?, Availability-Aware Systems, Design Methodologies, Handling Stress, Sources of unreliability in networks, Handling Stress, Selfish algorithms, Security, Economic Theory, The future of sensor nets?, The End-to-End Approach, Automatic Computing and Inference, Caching, Classical Algorithms, Topology and Naming, Practical theory perspectives, Modular Systems.


That's just the list of topics! For every topic there's the slide deck used to teach the class, a main list of papers and a second list of optional papers. So there's a lot to choose from. Happy reading! If any of the papers really stand out for you, please share.



"

Monday, 30 May 2011

Stuff to Watch from Surge 2010

Stuff to Watch from Surge 2010: "


Surge is a conference put on by OmniTI targeting practical Scalability matters. OmniTI specializes in helping people solve their scalability problems, as is only natural, as it was founded by Theo Schlossnagle, author of the canonical Scalable Internet Architectures.


Now that Surge 2011 is on the horizon, they've generously made available nearly all the videos from the Surge 2010 conference. A pattern hopefully every conference will follow (only don't wait a year please). We lose a lot of collective wisdom from events not being available online in a timely manner.

In truth, nearly all the talks are on topic and are worth watching, but here are a few that seem especially relevant:

  • Going 0 to 60: Scaling LinkedIn by Ruslan Belkin, Sr. Director of Engineering, LinkedIn.
    • Have you ever wondered what architectures the site like LinkedIn may have used and what insights teams have learned while growing the system from serving just a handful to close to a hundred million of users?
  • Scaling and Loadbalancing Wikia Across The World by Artur Bergman, VP of Engineering and Operations, Wikia.
    • Wikia hosts around a 100 000 wikis using the open source Mediawiki software. In this talk I'll take a tour through the process of taking a legacy source code and turning it into a globally distributed system.
  • Design for Scale - Patterns, Anti-Patterns, Successes and Failures by Christopher Brown, VP of Engineering, Opscode.
    • This isn't your "Gang of Four". Christopher will discuss his experiences building Amazon's EC2 and the Opscode Platform, and the experiences of others designing large-scale online services.
  • Quantifying Scalability FTW by Neil Gunther, Founder/Principal Consultant, Performance Dynamics.
    • Successful scalability requires transforming your data to quantify the cost-benefit of any architectural decisions. In other words: information = measurement + method.
  • Database Scalability Patterns by Robert Treat Lead Database Architect, OmniTI.
    • In Database Scalability Patterns we will attempt to distill all of the information/hype/discussions around scaling databases, and break down the common patterns we've seen dealing with scaling databases.
  • From disaster to stability: scaling challenges of my.opera.com by Cosimo Streppone, Lead Developer, Opera Software.
    • This talk tells the story of these last 3 years. Our successes, our failures, and what remains to be done.
  • Embracing Concurrency at Scale by Justin Sheehy CTO, Basho Technologies.
    • Justin will focus on methods for designing and building robust fundamentally-concurrent distributed systems. We will look at practices that are "common knowledge" but too often forgotten, at old lessons that the software industry at large has somehow missed, and at some general "good practices" and rules that must be thrown away when moving into a distributed and concurrent world.
  • Scalable Design Patterns by Theo Schlossnagle, Principal/CEO, OmniTI.
    • In this talk, we'll take a whirlwind tour though different patterns for scalable architecture design and focus on evaluating if each is the right tool for the job. Topics include load balancing, networking, caching, operations management, code deployment, storage, service decoupling and data management (the RDBMS vs. noSQL argument).
  • Why Some Architects Almost Never Shard Their Applications by Baron Schwartz, VP of Consulting, Percona.
    • "Shard early, shard often" is common advice -- and it's often wrong. In reality, many systems don't have to be sharded.
  • Scaling myYearbook.com - Lessons Learned From Rapid Growth by Gavin M. Roy, Chief Technology Officer, MyYearbook.
    • In this talk Gavin will review the growing pains and methodologies used to handle the consistent growth and demand while affording the rapid development cycles required by the product development team.

Tuesday, 24 May 2011

Updated AWS Security White Paper; New Risk and Compliance White Paper

Updated AWS Security White Paper; New Risk and Compliance White Paper: "

We have updated the AWS Security White Paper and we've created a new Risk and Compliance White Paper.  Both are available now.


The AWS Security White Paper describes our physical and operational security principles and practices.


It includes a description of the shared responsibility model, a summary of our control environment, a review of secure design principles, and detailed information about the security and backup considerations related to each part of AWS including the Virtual Private Cloud, EC2, and the Simple Storage Service.


 

The new AWS Risk and Compliance White Paper covers a number of important topics including (again) the shared responsibility model, additional information about our control environment and how to evaluate it, and detailed information about our certifications and third-party attestations. A section on key compliance issues addresses a number of topics that we are asked about on a regular basis.


 

The AWS Security team and the AWS Compliance team are complimentary organizations and are responsible for the security infrastructure, practices, and compliance programs described in these white papers. The AWS Security team is headed by our Chief Information Security Officer and is based outside of Washington, DC. Like most parts of AWS, this team is growing and they have a number of open positions:



We also have a number of security-related positions open in Seattle:



-- Jeff;




    "

    Wednesday, 11 May 2011

    How can academics do research on cloud computing?

    How can academics do research on cloud computing?: "This week I'm in Napa for HotOS 2011 -- the premier workshop on operating systems. HotOS is in its 24th year -- it started as the Workshop on Workstation Operating Systems in 1987. More on HotOS in a forthcoming blog post, but for now I wanted to comment on a very lively argument discussion that took place during the panel session yesterday.

    The panel consisted of Mendel Rosenblum from Stanford (and VMWare, of course); Rebecca Isaacs from Microsoft Research; John Wilkes from Google; and Ion Stoica from Berkeley. The charge to the panel was to discuss the gap between academic research in cloud computing and the realities faced by industry. This came about in part because a bunch of cloud papers were submitted to HotOS from academic research groups. In some cases, the PC felt that the papers were trying to solve the wrong problems, or making incorrect assumptions about the state of cloud computing in the real world. We thought it would be interesting to hear from both academic and industry representatives about whether and how academic researchers can hope to do work on the cloud, given that there's no way for a university to build something at the scale and complexity of a real-world cloud platform. The concern is that academics will be relegated to working on little problems at the periphery, or come up with toy solutions.

    The big challenge, as I see it, is how to enable academics to do interesting and relevant work on the cloud when it's nearly impossible to build up the infrastructure in a university setting. John Wilkes made the point that that he never wanted to see another paper submission showing a 10% performance improvement in Hadoop, and he's right -- this is not the right problem for academics to be working on. Not because 10% improvement is not useful, or that Hadoop is a bad platform, but because those kinds of problems are already being solved by industry. In my opinion, the best role for academia is to open up new areas and look well beyond where industry is working. But this is often at odds with the desire for academics to work on 'industry relevant' problems, as well as to get funding from industry. Too often I think academics fall into the trap of working on things that might as well be done at a company.

    Much of the debate at HotOS centered around the industry vs. academic divide and a fair bit of it was targeted at my previous blog posts on this topic. Timothy Roscoe argued that academia's role was to shed light on complex problems and gain understanding, not just to engineer solutions. I agree with this. Sometimes at Google, I feel that we are in such a rush to implement that we don't take the time to understand the problems deeply enough: build something that works and move onto the next problem. Of course, you have to move fast in industry. The pace is very different than academia, where a PhD student needs to spend multiple years focused on a single problem to get a dissertation written about it.

    We're not there yet, but there are some efforts to open up cloud infrastructure to academic research. OpenCirrus is a testbed supported by HP, Intel, and Yahoo! with more than 10,000 cores that academics can use for systems research. Microsoft has opened up its Azure cloud platform for academic research. Only one person at HotOS raised their hand when asked if anyone was using this -- this is really unfortunate. (My theory is that academics have an allergic reaction to programming in C# and Visual Studio, which is too bad, since this is a really great platform if you can get over the toolchain.) Google is offering a billion core hours through its Exacycle program, and Amazon has a research grant program as well.

    Providing infrastructure is only one part of the solution. Knowing what problems to work on is the other. Many people at HotOS bemoaned the fact that companies like Google are so secretive about what they're doing, and it's hard to learn what the 'real' challenges are from the outside. My answer to this is to spend time at Google as a visiting scientist, and send your students to do internships. Even though it might not result in a publication, I can guarantee you will learn a tremendous amount about what the hard problems are in cloud computing and where the great opportunities are for academic work. (Hell, my mind was blown after my first couple of days at Google. It's like taking the red pill.)

    A few things that jump to mind as ripe areas for academic research on the cloud:
    • Understanding and predicting performance at scale, with uncertain workloads and frequent node failures.
    • Managing workloads across multiple datacenters with widely varying capacity, occasional outages, and constrained inter-datacenter network links.
    • Building failure recovery mechanisms that are robust to massive correlated outages. (This is what brought down Amazon's EC2 a few weeks ago.)
    • Debugging large-scale cloud applications: tools to collect, visualize, and inspect the state of jobs running across many thousands of cores.
    • Managing dependencies in a large codebase that relies upon a wide range of distributed services like Chubby and GFS.
    • Handling both large-scale upgrades to computing capacity as well as large-scale outages seamlessly, without having to completely shut down your service and everything it depends on.
    "

    Tuesday, 10 May 2011

    EMC Makes a Big Bet on Hadoop

    EMC Makes a Big Bet on Hadoop: "

    EMC is throwing its weight behind Hadoop. Today, at EMC World, the storage giant announced a slew of Hadoop-centric products, including a specialized appliance for Hadoop-based big data analytics and two separate Hadoop distributions. EMC’s entry is most definitely going to shake up the Hadoop and database markets. EMC is now the largest company actively pushing its own Hadoop distribution, and it has an appliance that will put EMC out in front of analytics vendors such as Oracle and Teradata when it comes to handling unstructured data.


    EMC’s flagship Hadoop distribution is called Greenplum HD Enterprise Edition. EMC describes it as “a 100 percent interface-compatible implementation of the Apache Hadoop stack” that also includes enterprise-grade features such as snapshots and wide-area replication, a native network file system, and integrated storage and cluster management capabilities. The company also claims performance improvements of two to fives times over the standard Apache Hadoop distribution.


    Mapr Magic


    It’s noteworthy that many of these capabilities are also available in startup MapR’s HDFS alternative, and that MapR CEO John Schroeder took the stage at a morning EMC World press conference announcing the news. EMC Greenplum’s Luke Lonergan wouldn’t confirm to me that EMC’s Enterprise Edition will use MapR as the primary storage engine, but it’s not too difficult to connect the dots.


    However, while the Enterprise Edition is proprietary in part, the Greenplum HD Community Edition is fully open source and still makes big improvements over what’s currently available with the Apache version. In fact, Lonergan told me, Community Edition is based on Facebook’s optimized version of Hadoop. Like Cloudera’s distribution for Hadoop, Community Edition pre-integrates Hadoop MapReduce, Hadoop Distributed File System, HBase, Zookeeper and Hive, but it also includes fault tolerance for the NameNode in HDFS and the JobTracker node in Hadoop MapReduce. These improvements are underway within Apache thanks to Yahoo , but they’re not included in any official release yet.


    Too Much Hadoop?


    I asked a couple of weeks ago whether the Hadoop-distribution market could handle all the players it now hosts, and now that question is even more pressing. As Luke Lonergan put it during the press conference, EMC is an “8,000-pound elephant” in the Hadoop space, and that should make Cloudera, IBM, DataStax and (possibly) Yahoo shake seek higher ground.


    For Cloudera, EMC is major threat because it competes directly against Cloudera’s open-source and proprietary products. It even has partnerships with a large number of business intelligence and other up-the-stack vendors, some of which already are Cloudera partners. These include Concurrent, CSC, Datameer, Informatica, Jaspersoft, Karmasphere, Microstrategy, Pentaho, SAS, SnapLogic, Talend, and VMware.


    Oh, and Cloudera and Greenplum have an existing integration partnership. As Lonergan noted, “This definitely marks a change [in that relationship].” The two are now competitors, after all.


    EMC vs Big Blue


    IBM is still the largest company involved in selling Hadoop products, but it presently suffers from the problem of not having yet announced its official Hadoop distribution. EMC’s Hadoop distributions will be available later this quarter. I noted recently how EMC is following IBM’s lead in acquiring capabilities across the big data stack — from Hadoop to predictive analytics — and today’s news further proves how competitive the two storage heavyweights might become in the analytics space, too.


    IBM isn’t the only big-name vendor that should be worried about EMC’s new Hadoop-heavy plans, though. The EMC Greenplum HD Data Computing Appliance should make appliance makers Oracle and Teradata, as well as analytic database vendors such as HP, ParAccel and others, quite nervous. The appliance is like the existing EMC Greenplum Data Computing Appliance, only it lets customers process Hadoop data within the same system as their Greenplum analytic database. Presently, most analytic databases and appliances integrate with Hadoop, but still suffer from the latency of having to send data over the network from Hadoop clusters to the database and back.


    IBM already has integrated Hadoop with its other big data tools, including with InfoSphere BigInsights, Watson and Cognos Consumer Insight, and I have to believe a version of its Netezza analytics appliance with Hadoop co-processing will be on the way shortly, possibly in conjunction with its official Hadoop distribution release.


    Lonergan also noted that EMC is working closely with VMware, of which EMC is the majority stockholder, on integrating EMC’s Hadoop products with VMware’s virtualization and cloud products, as well as its GemStone distributed database software.


    There still will be opportunities for community collaboration among all the open source Hadoop distributions — Cloudera, DataStax Brisk and EMC Greenplum HD Community Edition — but we’ll see how willing they are to work together now that the competition has really heated up. All of a sudden, EMC looks like the strongest Hadoop company going, and everyone else needs to figure out in a hurry how they’ll counter today’s landscape-altering news.


    Related content from GigaOM Pro (subscription req’d):




    The cloud-optimized networks of tomorrow run on Brocade Ethernet fabrics today. Assess Your Cloud Readiness. Download Forrester Study »

    "

    High Scalability - High Scalability - Comet - An Example of the New Key-Code Databases

    High Scalability - High Scalability - Comet - An Example of the New Key-Code Databases

    Comet is an active distributed key-value store built at the University of Washington. The paper describing Comet is Comet: An active distributed key-value store, there are also slides, and a MP3 of a presentation given at OSDI '10. Here's a succinct overview of Comet:

    Today's cloud storage services, such as Amazon S3 or peer-to-peer DHTs, are highly inflexible and impose a variety of constraints on their clients: specific replication and consistency schemes, fixed data timeouts, limited logging, etc. We witnessed such inflexibility first-hand as part of our Vanish work, where we used a DHT to store encryption keys temporarily. To address this issue, we built Comet, an extensible storage service that allows clients to inject snippets of code that control their data's behavior inside the storage service.

    I found this paper quite interesting because it takes the initial steps of collocating code with a key-value store, which turns it into what might called a key-code store. This is something I've been exploring as a way of moving behavior to data in order to overcome network limitations in the cloud and provide other benefits. An innovator in this area is the Alchemy Database, which has already combined Redis and Lua. A good platform for this sort of thing might be Node.js integrated with V8. This would allow complex Javascript programs to run in an efficient evented container. There are a lot of implications of this sort of architecture, more about that later, but the Comet paper describes a very interesting start.

    From the abstract and conclusion:

    This paper described Comet, an active distributed key value store. Comet enables clients to customize a distributed storage system in application-specific ways using Comet’s active storage objects. By supporting ASOs, Comet allows multiple applications with diverse requirements to share a common storage system. We implemented Comet on the Vuze DHT using a severely restricted Lua language sandbox for handler programming. Our measurements and experience demonstrate that a broad range of behaviors and customizations are possible in a safe, but active, storage environment.
    Distributed key-value storage systems are widely used incorporations and across the Internet. Our research seeks to greatly expand the application space for key-value storage systems through application-specific customization. We designed and implemented Comet, an extensible, distributed key-value store. Each Comet node stores a collection of active storage objects (ASOs) that consist of a key, a value, and a set of handlers. Comet handlers run as a result of timers or storage operations, such as get or put, allowing an ASO to take dynamic, application-specific actions to customize its behavior. Handlers are written in a simple sandboxed extension language, providing properties of safety and isolation.
    We implemented a Comet prototype for the Vuze DHT, deployed Comet nodes on Vuze from PlanetLab, and built and evaluated over a dozen Comet applications. Our experience demonstrates that simple, safe, and restricted extensibility can significantly increase the power and range of applications that can run on distributed active storage systems. This approach facilitates the sharing of a single storage system by applications with diverse needs, allowing them to reap the consolidation benefits inherent in today’s massive clouds.

    Related Articles
    Roxana Geambasu, Ph.D. Candidate
    Scriptable Object Cache by Rohit Karlupia