Unlimited-Data. moved to lab.itbee.vn : March 2011

Tuesday 29 March 2011

Music to my Ears - Introducing Amazon Cloud Drive and Amazon Cloud Player

Music to my Ears - Introducing Amazon Cloud Drive and Amazon Cloud Player: "

Today Amazon.com announced new solutions to help customers manage their digital music collections. Amazon Cloud Drive and Amazon Cloud Player enable customers to securely and reliably store music in the cloud and play it on any Android phone, tablet, Mac or PC, wherever they are.

As a big music fan with well over 100Gb in digital music I am particularly excited that I now have access to all my digital music anywhere I go.

Order in the Chaos

The number of digital objects in our lives is growing rapidly. What used to be only available in physical formats now often has digital equivalents and this digitalization is driving great new innovations. The methods for accessing these objects is also rapidly changing; where in the past you needed a PC or a Laptop to access these objects, now many of our electronic devices have become capable of processing them. Our smart phones and tablets are obvious examples, but many other devices are quickly gaining these capabilities; TV Sets and Hifi systems are internet enabled, and soon our treadmills and automobiles will be equally plugged into the digital world.

Managing all these devices, along with the content we store and access on them, is becoming harder and harder. We see that with our Amazon customers; when they hear a great tune on a radio they may identify it using the Shazam or Soundhound apps on their mobile phone and buy that song instantly from the Amazon MP3 store. But now this mp3 is on their phone and not on the laptop that they use to sync their iPod with and not on the Windows Media Center PC that powers their HiFi TV set. That's frustrating - so much so that customers tell us they wait to buy digital music until they are in front of the computer they store their music library on, which brings back memories of a world constrained by physical resources.

The launch of Amazon Cloud Drive, Amazon Cloud Player and Amazon Cloud Player for Android will help to bring order in this chaos and will ensure that customers can buy, access and play their music anywhere. Customers can upload their existing music library into Amazon Cloud Drive and music purchased from the Amazon MP3 store can be added directly upon purchase. Customers then use Amazon Cloud Player Web application to easily manage their music collections with download and stream options. The Amazon Cloud Player for Android is integrated with the Amazon MP3 app and gives customers instant access to all the music they have stored in Amazon Cloud Drive on their mobile device. Any purchases that customers make on their Android devices can be stored in Amazon Cloud Drive and are immediately accessible from anywhere.

A Drive in the Cloud

To build Amazon Cloud Drive the team made use of a number of cloud computing services offered by Amazon Web Services. The scalability, reliability and durability requirements for Cloud Drive are very high which is why they decided to make use of the Amazon Simple Storage Service (S3) as the core component of their service. Amazon S3 is used by enterprises of all sizes and is designed to handle scaling extremely well; it stores hundreds of billions of objects and easily performs several hundreds of thousands of storage transaction a second.

Amazon S3 uses advanced techniques to provide very high durability and reliability; for example it is designed to provide 99.999999999% durability of objects over a given year. Such a high durability level means that if you store 10,000 objects with Amazon S3, you can on average expect to incur a loss of a single object once every 10,000,000 years. Amazon S3 redundantly stores your objects on multiple devices across multiple facilities in an Amazon S3 Region. The service is designed to sustain concurrent device failures by quickly detecting and repairing any lost redundancy, for example there may be a concurrent loss of data in two facilities without the customer ever noticing.

Cloud Drive also makes extensive use of AWS Identity and Access Management (IAM) to help ensure that objects owned by a customer can only be accessed by that customer. IAM is designed to meet the strict security requirements of enterprises and government agencies using cloud services and allows Amazon Cloud Drive to manage access to objects at a very fine grained level.

A key part of the Cloud Drive architecture is a Metadata Service that allows customers to quickly search and organize their digital collections within Cloud Drive. The Cloud Player Web Applications and Cloud Player for Android make extensive use of this Metadata service to ensure a fast and smooth customer experience.

Making it simple for everyone

Amazon Cloud Drive and Amazon Cloud Player are important milestones in making sure that customer have access to their digital goods at anytime from anywhere. I am excited about this because it is already making my digital music experience simpler and I am looking forward to the innovation that these services will drive on behalf of our customers.

If you are an engineer interested in working on Amazon Cloud Drive and related technologies the team has a number of openings and would love to talk to you! More details at http://www.amazon.com/careers.

"

As Big Data Takes Off, the Hadoop Wars Begin

As Big Data Takes Off, the Hadoop Wars Begin: "

It turns out “big data” isn’t just a buzzword, but a legitimate concern for companies across the board. Their interest in the tools to take advantage of the opportunity for analysis of all this data has sparked a land grab among established vendors and startups alike. The action is centered around Hadoop, the flagship technology for storing and processing large amounts of unstructured data.

Since Yahoo open-sourced Hadoop a few years ago, the primary options for organizations wanting to take advantage of the product have been the open-source Apache Hadoop distribution, the Cloudera distribution of Hadoop ,and Amazon Web Services’ Elastic MapReduce service. That will change soon, as everyone from EMC and IBM to database startups like Hadapt and DataStax get into the business of selling Hadoop-based technologies and services.

So far, Cloudera, which provides commercial support for its open-source distribution, as well as its own proprietary Hadoop-cluster management software, has been the only company to truly capitalize on Hadoop financially. Arguably, its success is to blame for the stiff competition it’s about to face for companies’ Hadoop attention and dollars.

Too Many Distributions

Cloudera, a private company, hasn’t released any financial details, but Wednesday at Structure: Big Data, VP of Engineering Amr Adawallah mentioned during a panel that Cloudera has more than 80 customers running Hadoop in production, and the company does have technology partnerships across the data world, including a leading data warehouse, BI, and database vendors. Cloudera also has raised $36 million from investors since launching in 2009. It appears the other software companies have noticed all the activity around Cloudera and want in on some of the action.

IBM already has a Hadoop business that includes its own distribution it says is better suited for commercial users than the open-source Apache Hadoop distribution, though both IBM and Cloudera are based on the Apache distribution. IBM’s offering also provides an application called InfoSphere BigSheets, which hides the complexities of Hadoop underneath a variety of advanced analytics, BI and visualization tools. Based on a few sources I spoke with at Structure: Big Data, and after reading into an advertisement in the program for the conference , it looks EMC is getting into the game. The ad hints that EMC will announce a Hadoop product involving its new Greenplum database on May 9: The ad read, “05.09.11: EMC Greenplum. Apache Hadoop.” Also at the event, two independent sources suggested members of Yahoo’s Hadoop team will be spinning off their own separate business, and there is speculation this move is somehow tied into EMC’s Hadoop plans.

IBM isn’t to be taken lightly, nor is EMC on its own, but the latter turn of events would be a potentially market-changing situation given the Hadoop know-how within Yahoo, which has contributed the majority of the code now included in Apache Hadoop. During a panel at Structure: Big Data, Yahoo’s VP of Cloud Architecture Todd Papaioannou, quipped to Cloudera’s Awadallah that Yahoo will keep innovating on Hadoop and everyone could keep reselling it. Papaioannou declined to comment on the rumors of a Hadoop spinout, but did tell me via email, “I think Apache Hadoop will remain the go-to place to get access to new improvements and innovation in the core Hadoop platform. That’s exactly why we announced our ‘double down’ strategy and the work we are doing on the next generation of both Map Reduce and HDFS.”

Death by a Thousand Startups

It’s not only large vendors that Cloudera will have to fight off; its real threat is death by a thousand startups and ISVs. At Structure: Big Data, NoSQL startup DataStax announced its own open-source Hadoop distribution based on the NoSQL database Cassandra, which provides a replacement for the Hadoop Distributed File System (HDFS). DataStax says this gives users the ability to process data and feed it back to applications at extremely low latencies, which Cloudera can’t offer because Apache Hadoop environments currently reside on separate infrastructure from application servers and databases. Om wrote earlier about Mapr, a startup focused on improving the performance and reliability of the HDFS. Appistry is already addressing this with its own wholly-distributed HDFS alternative.

Launches weren’t over yet. Another database startup called Hadapt officially launched Wednesday with a product that melds the HDFS-based HBase database with traditional RDBMS capabilities. HBase is an Apache Hadoop subproject heavily used by Facebook, and included as part of Cloudera’s Hadoop distribution. And next Tuesday, high-performance computing pioneer Platform Computing — which has a presence in many large financial data centers and 10 of the top 20 Fortune companies — will be announcing an analytics offering that applies its current cluster- and grid-management capabilities to MapReduce workloads. As noted above, management tools are where Cloudera actually makes money selling software as opposed to services.

There are several commercial alternatives to Apache Hadoop MapReduce, as well. Pervasive Software’s DataRush software is designed for writing big data workflows and to take full advantage of multi-core processors. And Cascading, an open-source, data-processing API sits atop MapReduce. A startup called Concurrent offers commercial support and services for Cascading. Amazon Web Services offers a cloud-based Hadoop service called Elastic MapReduce, which spares users the cost of buying their own gear on which to run Hadoop workloads.

Confused? Here’s a round-up of currently available Hadoop distributions:

Full-on distributions

Apache Hadoop
Cloudera’s Distribution including Apache Hadoop (that’s the official name)
IBM Distribution of Apache Hadoop
DataStax Brisk
Amazon Elastic MapReduce

HDFS alternatives

Mapr
Appistry CloudIQ Storage Hadoop Edition
IBM Global Parallel File System (GPFS)
CloudStore

Hadoop MapReduce alternatives

Pervasive DataRush
Cascading
Hive (an Apache subproject, included in Cloudera’s distribution)
Pig (a Yahoo-developed language, included in Cloudera’s distribution)

Cloudera Isn’t Flinching — Yet

Even with all this competition, however, it’s unclear whether Cloudera actually feels its iron grip on the commercial Hadoop world slipping away. CEO Mike Olson thinks a rich ecosystem of Hadoop companies is necessary if it’s to grow into a multi-billion-dollar business like he thinks it can, but he sees most of that activity taking place up the stack from the foundational distribution layer where Cloudera operates. He said via email, “I believe there’s an enormous opportunity for smart companies, and even open-source projects, to build a new generation of data analysis tools on top of that platform.”

His colleague Awadallah was slightly less politic in his response when asked specifically about the DataStax distribution, stating in a video interview with my colleague Stacey Higginbotham Wednesday that he thinks DataStax’s distribution is a “big mistake,” and he doesn’t think the company can yet back up its claims of Hadoop support. He added that a better alternative to trying to reinvent the wheel in terms of Hadoop support and stability would have been for DataStax to keep its focus on Cassandra partner with Cloudera on the Hadoop integration.

Watch this video for free on GigaOM

Cloudera has plenty of reason to be confident, actually. Among its ranks are Hadoop creator Doug Cutting and former Yahoo colleague Awadallah, as well as Chief Scientist Jeff Hammerbacher – who previously led Facebook’s massive data efforts — and Vertica vetertan Omer Trajman. Olson himself is the former CEO of SleepyCat Software, which distributed the open-source Berkeley DB database before Oracle bought the company in 2006. Or, as Adwallah put it, “[W]e have the muscle to be able to back up our words with execution.” Further, as long as Facebook and Yahoo continue contributing their webscale-driven — and proven — enhancements back to Apache Hadoop, Cloudera has plenty of fuel to feed its evolution. Facebook, for example, is responsible for the popular Hive query language that gives Hadoop users a SQL-like experience many prefer to MapReduce, and, as noted above, Yahoo is currently pushing for a next-generation architecture that addresses some known performance bottlenecks with Apache Hadoop.

But the threat is real. Cloudera has partnerships with many analytics vendors, but none of the companies mentioned here are operating up the stack from Cloudera. They’re all addressing the foundational HDFS, Hadoop MapReduce and cluster-management areas where Cloudera presently does business (although IBM and EMC are operating up the stack with analytics software, too). With so many options available — and with Apache Hadoop code open to anyone who wants to use it — every vendor with aspirations of making big money in Hadoop is going to have to work extra hard to convince users they’re adding value worth paying for.

Image courtesy of Flickr user NileGuide.com.

Related content from GigaOM Pro (subscription req’d):

Thursday 24 March 2011

Cassandra + Hadoop = Brisk by DataStax

Cassandra + Hadoop = Brisk by DataStax: "
I just heard the announcement DataStax, the company offering Cassandra services, made about Brisk a Hadoop and Hive distribution built on top of Cassandra:

Brisk provides integrated Hadoop MapReduce, Hive and job and task tracking capabilities, while providing an HDFS-compatible storage layer powered by Cassandra.

Brisk was announced officially during the MapReduce panel at Structure Big Data event. But it looks like others have already had a chance to hear about Brisk — is there something that I should be doing to hear the “unofficial” announcements?

DataStax has also made available a whitepaper: “Evolving Hadoop into a Low-Latency Data Infrastructure: Unifying Hadoop, Hive and Apache Cassandra for Real-time and Analytics” that you can download from here

Original title and link: Cassandra + Hadoop = Brisk by DataStax (NoSQL databases © myNoSQL)

Tuesday 15 March 2011

6 Lessons from Dropbox - One Million Files Saved Every 15 minutes

6 Lessons from Dropbox - One Million Files Saved Every 15 minutes: "

Dropbox saves one million files every 15 minutes, more tweets than even Twitterers tweet. That mind blowing statistic was revealed by Rian Hunter, a Dropbox Engineer, in his presentation How Dropbox Did It and How Python Helped at PyCon 2011.

The first part of the presentation is some Dropbox lore, origin stories and other foundational myths. We learn that Dropbox is a startup company located in San Francisco that has probably one of the most popular file synchronization and sharing tools in the world, shipping Python on the desktop and supporting millions of users and growing every day.

About half way through the talk turns technical. Not a lot of info on how Dropbox handles this massive scale was dropped, but there were a number of good lessons to ponder:

Friday 11 March 2011

InfoQ: Riak Core: Dynamo Building Blocks

Sunday 6 March 2011

Great work at FAST ’11

Great work at FAST ’11: "

After a quick scan of the paper titles I wasn’t impressed. But after seeing presentations and posters I am.

Here’s some I found interesting. I’ll be posting longer pieces on some of these.

A Study of Practical Deduplication Full paper *Best Paper Winner*
Tradeoffs in Scalable Data Routing for Deduplication Clusters Full paper
Exploiting Half-Wits: Smarter Storage for Low-Power Devices Full paper
Reliably Erasing Data from Flash-Based Solid State Drives Full paper
Scale and Concurrency of GIGA+: File System Directories with Millions of Files Full paper
Emulating Goliath Storage Systems with David Full paper *Best Paper Winner*

An excellent conference. NetApp, EMC, Microsoft and IBM were recruiting.

The StorageMojo take

We’re still learning about flash, and the research presented here is a substantial addition to our meager knowledge.

Microsoft tells me they’re delivering major improvements to NTFS and Windows Server later this year. I’m looking forward to that briefing.

And it’s always a pleasure catching up with the people who, for some reason, never come to Sedona.

Courteous comments welcome, as always.

Copyright © 2011 StorageMojo. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator, the site you are looking at is guilty of copyright infringement. Please contact legal@storagemojo.com so we can take legal action immediately.
Plugin by Taragana

StorageMojo @FAST ’11 It’s that time of the year again: the Usenix File...
HCR: a great day for storage The historic Health Care Reform that Congress passed last night...

Related posts brought to you by Yet Another Related Posts Plugin.

Saturday 5 March 2011

XCP 1.0 released

XCP 1.0 released: "

After 16 months of development, Xen.org is proud to present the first full version of the Xen Cloud Platform. We wanted to thank the project team, who made this happen.

A full feature list as well as the install image and source packages can be found on the download page.

The following new features and improvements have been added since the XCP 0.5 release of XCP last summer:

Includes Xen hypervisor version 3.4.2
Includes Linux 2.6.32 privileged domain
VM Protection and Recovery: configure scheduled snapshots and (optional) archive of virtual machines via snapshot or export
Local host storage caching of VM images to reduce load on shared storage
Boot from SAN with multipathing support: boots Xen Hypervisor hosts with HBAs from a SAN, with multipathing support.
Improved Linux guest support: Ubuntu templates, Fedora 13/Red Hat Enterprise Linux (RHEL) 6 templates, RHEL / CentOS / Oracle Enterprise Linux versions 5.0 to 5.5 support with a generic “RHEL 5″ template
Enhanced guest OS support for Windows 7 SP1, Windows Server 2008 R2 SP1, Windows Server 2003, and Suse Linux Enterprise Server (SLES) 11 SP1
Improved MPP RDAC multipathing including path health reporting and alerting through XAPI
Snapshot improvements: improved reclamation of space after VM snapshots are deleted, even if the VM is running
Support for blktap2 disk backend driver rather than blktap1
Support for Citrix XenCenter 5.6 FP1 Windows-based GUI management tool (see here)
Support for Openstack Bexar release

XCP is significant for Xen.org for a number of reasons: it allows the Xen.org community to develop interesting new functionality against a mature, stable and scalable virtualization stack. If you do want to get involved, check out the project’s wish list and get in touch with the XCP team via the mailing list.

Although XCP can be used as a stand-alone solution to build private clouds or as an enterprise server virtualization solution, there are significant opportunities to extend, innovate and build on top of XCP. Check out the list of open source projects and commercial solutions which already do this.

XCP integrates seamlessly with the Openstack Bexar release: this means that the Xen Hypervisor and XCP are part of an end-to-end open source software stack covering all components from the bare metal to cloud orchestration software. Over the last year, you have seen the Xen community more closely working with downstream Linux and Qemu. The same is happening with upstream projects such as Openstack and OpenNebula.

Unlike the Xen Hypervisor project, XCP delivers an installable binary. This represents a step-change in usability and enables the Xen developer community to more directly engage with its users.

You can find more information on XCP on the XCP home page and on the Wiki. And thank you again, to everybody who made this release happen!

Tuesday 1 March 2011

Paper: An Experimental Investigation of the Akamai Adaptive Video Streaming

Paper: An Experimental Investigation of the Akamai Adaptive Video Streaming: "

Video is hot on the Internet and people are really interested in knowing how to make it work. Dan Rayburn has a post pointing to a fascinating paper: An Experimental Investigation of the Akamai Adaptive Video Streaming, which talks in some detail about the protocols big players like YouTube, Skype and Akamai use to serve video over on an inherently video unfriendly medium like the Internet. For Akamai they found:

Each video is encoded in five versions at different bit rates and stored in separate files.
The client sends commands to the server with an average inter departure time of about 2 s, i.e. the control algorithm is executed on average each 2 seconds.
Akamai uses only the video level to adapt the video source to the available bandwidth, whereas the frame rate of the video is kept constant.
When a sudden drop in the available bandwidth occurs, short interruptions of the video playback can occur due to the a large actuation delay.
For a sudden increase of the available bandwidth, the transient time to match the new bandwidth is roughly 150 seconds.

Abstract:

Unlimited-Data. moved to lab.itbee.vn