Unlimited-Data. moved to lab.itbee.vn

Sunday, 8 August 2010

Google North American Faculty Summit - cloud computing

Google North American Faculty Summit - cloud computing: "Posted by Brian Bershad, Director of Engineering, Site Director, Google Seattle

Of the three themes of our 2010 Faculty Summit, cloud computing was the one that pervaded all others, from security in the cloud to the presumption of cloud infrastructure behind the social web. But in our more focused discussion on cloud computing last Thursday, we started with the premise of “prodigiousness,” a concept introduced by Afred Spector, VP of Research and Special Initiatives.

While we all know that systems are huge and will get even huger, the implications of this size on programmability, manageability, power, etc. is hard to comprehend. Alfred noted that the Internet is predicted to be carrying a zetta-byte (10²¹ bytes) per year in just a few years. And growth in the number of processing elements per chip may give rise to warehouse computers of having 10¹⁰ or more processing elements. To use systems at this scale, we need new solutions for storage and computation. It was these solutions we focused on throughout our discussions.

In the plenary talk, Andrew Fikes spoke on storage system opportunities. Among many topics, he talked about shifting engineering foci to storage management and optimization not just on an individual cluster of co-located systems, but across geographically distributed clusters. The goal is so-called planetary-scale systems. This brings up all manner of diverse challenges ranging from the need to continually balance storage vs. transmission costs, the need to account for variable network latency characteristics, and the desire to optimize storage (e.g., by physically storing only one copy of a file that many feel they have rights to, or own).

We had a few roundtables in the afternoon for deeper discussions. In the table I led, we discussed two systems for “programming the data center” developed by systems researchers at Google Seattle/Kirkland. The first, Dremel, is a scalable, interactive ad-hoc query system for analysis of read-only nested databases. Dremel was recently presented in a paper at VLDB (Dremel: Interactive Analysis of Web-Scale Datasets, Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton, Theo Vassilakis. In Proceedings of the 36th Int'l Conf on Very Large Data Bases, 2010). The system serves as the foundational technology behind BigQuery, a product launched in limited preview mode at Google I/O in May.

We also discussed FlumeJava, a Java library that makes it easy to develop, test and run efficient data-parallel pipelines at data center scale. FlumeJava was developed by programming languages researchers at Google Seattle, and is currently in widespread use within Google. It was presented at the recent PLDI conference (FlumeJava: easy, efficient data-parallel pipelines, Craig Chambers, Ashish Raniwala, Frances Perry, Stephen Adams, Robert R. Henry, Robert Bradshaw, Nathan Weizenbaum. In Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation). The work reflects Google’s commitment to programming language and compiler technologies at scale.

The field of data center programming has progressed substantially in the last 10 years. Dremel and FlumeJava systems represent abstractions of a higher level than the MapReduce construct we previously introduced, and we think they are easier to use (within their domain of applicability) and more automatically optimizable. With time, the field will discover new “instructions” and even better abstractions leading us to a point where computations which run on nearly unlimited processors can be expressed as easily as sequential programs. We are working hard to make progress here, and I look forward to reporting on our progress in the future.

Cloud storage integrated with NAS

Cloud storage integrated with NAS: "

There are several new companies working to combine ultra-fast local storage with feature-rich cloud storage, which we will be profiling in the coming weeks (keep your eyes open for a new category in our Directory specifically built for NAS-cloud hybrids). There are also some existing companies who are believers in the power of hybrid local/cloud solutions – please let us know your favorites…

In the meantime, we think the guys at NETGEAR (formerly Infrant) nailed the idea in this video (warning: this video contains a giant animated reptile) for the ReadyNAS Vault:

These are some of the companies or products we think fit the bill: Nasuni Filer, Ctera Cloud Plus , NETGEAR ReadyNAS Vault, Nirvanix CloudNAS, JungleDisk Map Drive, ElephantDrive Map Drive. Who are we missing?

VC Perspective on BigData and NoSQL Databases

VC Perspective on BigData and NoSQL Databases: "VC Perspective on BigData and NoSQL Databases:
Fantastic overview of the BigData and NoSQL databases market from a VC:

[…] Though many companies in the Fortune 1000 are starting to experiment with Hadoop, today only 10-20% of enterprises need big data solutions. This number could grow as high as 40-50% in 5 years.

[…]

Too many NoSQL database companies have already been created (Cloudera, 10gen, MongoDB, VoltDB, CouchDB, etc). While the user interest in such databases is increasing (many Fortune 1000 companies have started Hadoop evaluation projects), the market won’t be able to sustain them. I expect to see significant consolidation in the next 3-5 years.

[…]

For no reason apparent to me, NoSQL database companies are trying to reinvent the data warehousing and business intelligence infrastructures that have been created over the years.

Note also the fantastic BigData definition:

The data in these sets is at the terabyte or petabyte scale, it is semi-structured, highly distributed, and much of it is of unknown value so it must be processed quickly to identify the interesting parts to keep.

VC Perspective on BigData and NoSQL Databases originally posted on the NoSQL blog: myNoSQL

Building Blocks of Dynamo-like Distributed Systems

Building Blocks of Dynamo-like Distributed Systems: "

Basho guys have started to talk about their experience on building Riak, the Dynamo-like distributed key-value store and the common building blocks of distributed systems.

Justin Sheehy interviewed by Sadek Drobi over ☞ InfoQ.com:

Even just the Dynamo specific parts are very dramatic in differences. There have been a number of Dynamo-like systems developed over the past few years, each of which has had to design and implement large portions of even just the Dynamo-like sections on their own. Because Dynamo tells you what some very good design decisions are but it doesn’t show you how to implement the system. Even just the Dynamo portion you have to do a lot of design work, just to implement that.

Justin on choosing Erlang for implementing Riak:

There was a really natural choice because especially when you look at the Dynamo model, where they talk about all these operations where to get a value you’ll send messages to multiple other parties, then you’ll wait through various phases for responses of different classes to come back and the basic building blocks to do that kind of messaging and to do that kind of more complex state machine are there for you out of the box for you in Erlang.

Kevin Smith promises a series of posts covering the details of ☞ riak_core, the refactored core of the Riak system that can be used for building Dynamo-like distributed systems:

Distributed systems are complex and some of that complexity shows in the amount of features available in riak_core. Rather than dive deeply into code, I’m going to separate the features into broad categories and give an overview of each.

The ☞ first part covers aspects like:

node liveness & membership (note it is interesting to note that improvements to the failure recovery mechanism is part of the latest Riak release

partitioning & distributing work

cluster state

Definitely a series I’ll keep an eye on as I’m pretty sure there are many things to be learned from their experience. (shameless plug) If you happen to be in the Bay area in November, come check the NoSQL track at QCon where, even if not yet published yet, among others, Andy Gross, VP of engineering at Basho, will be speaking about how to build Dynamo style systems using Riak’s core.

Building Blocks of Dynamo-like Distributed Systems originally posted on the NoSQL blog: myNoSQL

If You Could Have One Resource For Cloud Security…

If You Could Have One Resource For Cloud Security…: "
I got an interesting tweet sent to me today that asked a great question:

I thought about this and it occurred to me that while I would have liked to have answered that the Cloud Security Alliance Guidance was my first choice, I think the most appropriate answer is actually the following:

“Cloud Security and Privacy: An Enterprise Perspective on Risks and Compliance” by Tim Mather, Subra Kumaraswamy, and Shahed Latif is an excellent overview of the issues (and approaches to solutions) for Cloud Security and privacy. Pair it with the CSA and ENISA guidance and you’ve got a fantastic set of resources. I’d also suggest George Reese’s excellent book “Cloud Application Architectures: Building Applications and Infrastructure in the Cloud”

I suppose it’s only fair to disclose that I played a small part in reviewing/commenting on both of these books prior to being published

/Hoff

ArchCamp: Scalable Databases (NoSQL)

ArchCamp: Scalable Databases (NoSQL): "

ArchCamp: Scalable Databasess (NoSQL)

The ArchCamp unconference was held this past Friday at HackerDojo in Mountain View, CA. There was plenty of pizza, beer, and great conversation. This session started out free-form, but shaped up pretty quickly into a discussion of the popular open source scalable NoSQL databases and the architectural categories in which they belong.

Saturday, 24 July 2010

Measuring and Comparing the Performance of 5 Cloud Platforms

Measuring and Comparing the Performance of 5 Cloud Platforms: "Bitcurrent and Webmetrics have run a number of tests for a month on 5 different cloud platforms - Amazon, Google, Rackspace, Salesforce.com, and Terremark -, attempting to measure the performance of each platform. One of their conclusions is that each platform works better for different application types. By Abel Avram"