Showing posts with label Cloud storage. Show all posts
Showing posts with label Cloud storage. Show all posts

Tuesday, 7 June 2011

Apple iCloud: Syncing and Distributed Storage Over Streaming and Centralized Storage

Apple iCloud: Syncing and Distributed Storage Over Streaming and Centralized Storage: "


There has been a lot of speculation over how Apple's iCloud would work. With the Apple Worldwide Developers Conference keynotes having just completed, we finally learned the truth. We can handle it. They made some interesting and cost effective architecture choices that preserved the value of their devices and the basic model of how their existing services work.


A lot of pundits foretold that with all the datacenters Apple was building we would get a streaming music solution. Only one copy of music would be stored and then streamed on demand to everyone. Or they could go the Google brute force method and copy up all a user's music and play it on demand.


Apple did neither. The chose an interesting middle path that's not Google, Amazon, or even MobileMe.


They key idea is you no longer need a PC. Device content is now synced over the air and is managed by the cloud, not your legacy computer. Your data may not even be stored in the cloud, but the whole management, syncing, and control of content is done by the cloud instead of the PC. PCs are now just another device on par with the iPhone and iPad.


What happens to your data depends on the type of data. Apple gives you 5GB of free storage, which doesn't sound like a lot at all. The twist here is purchased music, apps, and books, and photos will not count against free storage because these are stored on your devices. Photos hit the cloud for a maximum of 30 days, which allows your devices 30 days to contact the cloud and download the photos, after that I guess they are lost. All the big data is stored on your devices.


Some smaller content is stored in the cloud. This content is mail, documents, Camera Roll, account information, settings, and other app data. This data is much smaller than photos, videos, and music, so it's a manageable amount of storage per user. It wasn't talked about, but I'd imagine storage could be increased for a price, so any increased storage usage would be funded.


What Apple ended up creating is a syncing model where large content is synced between devices, smaller meta-data type content is stored in the cloud, and shared changeable data like mail is stored in the cloud. The advantages of this approach are:

  • It's consistent with how iTunes works now. Instead of a PC there's a cloud. No revolution here.
  • It's cost efficient, which is important because iCloud is free. Storage will increase linearly based on the number of users, not the number of media items, which is a much more manageable curve. Apple is not on the hook for ever increasing amounts of data storage.
  • Bandwidth usage is bursty during syncing operations, but is otherwise low, other than background notifications, etc. In a streaming system bandwidth usage is continuous and high, which shifts their cost structure to that of being a CDN. The path taken here by Apple skips that whole problem. Most if this will probably be over WiFi instead of 3G so user bandwidth caps can be avoided.
  • The role for devices is preserved. Devices aren't just a thin client for the cloud. The meat of the application logic is still solidly on the device and not the cloud. Apple can still sell high margin devices and you can get your media on all your devices. And since these devices already have storage, there's no need to duplicate storage in the cloud, which is more economical.
  • The low cost of Apple's new Scan and Match service is made possible because of the minimal storage and bandwidth costs. They can just keep one copy and push it down to devices for local access. What other vendor has this ability? The devices that people are already used to buying offload this cost and are themselves a profit center.
  • This will really drive the demand for larger and larger SSD drives. The 1000 image storage limit for photos on your devices will be a big negative.

Low level stuff like how merging happens if conflicting changes are made on different devices was not talked about, but the details will be interesting. There are a still a lot of details that need to be explained about where things are stored, how much can be stored, how they get synced, and how much it will cost. And this is still a very personal service. It's device centric. It's not social, there doesn't appear to be any sharing, which seems a strange oversight.

Apple has taken an interesting middle path in their architecture and there's a lot to learn from. They aren't just storing stuff like Amazon and Google. They aren't creating another streaming service. They are creating a product unique to their ecosystem, an environment users will find difficult to leave.

I should mention that this is my very early impression of how things work taken from the presentation and poking around the iCloud site. It's possible I could have got some of it wrong.


Friday, 29 April 2011

The Rise of Hadoop: How many Hadoop-related solutions exist?

The Rise of Hadoop: How many Hadoop-related solutions exist?: "The Rise of Hadoop: How many Hadoop-related solutions exist?:

The CMSWire commented list of Hadoop-related solutions:


  1. Apache Hadoop
  2. Appistry CloudIQ Storage Hadoop Edition: a HDFS replacement improving on the single NameNode ( here).


    Shipping.


  3. IBM Distribution of Apache Hadoop: Apache Hadoop, a 32-bit Linux version of the IBM SDK for Java 6 SR 8, and an easy-to-use installer that will install and configure both Hadoop (including SSH setup) and Java (here).


    Shipping, but in alphaWorks


  4. IBM Global Parallel File System (GPFS): a high-performance shared-disk clustered file system developed by IBM (here).


    Shipping.


  5. Cloudera’s Distribution including Apache Hadoop: Cloudera’s packaging for Hadoop and Hadoop toolkit (here).


    Shipping.


  6. DataStax Brisk: using Apache Cassandra for Hadoop (and Hive) core services (here).


    Announced, but not released yet


  7. Amazon Elastic MapReduce: Amazon hosted Hadoop framework running on the infrastructure of Amazon EC2 and Amazon S3 (here).

    Shipping.


  8. Mapr: proprietary replacement for HDFS.


    Talked about


  9. CloudStore: the former Kosmos open-source distributed filesystem (here).


    Shipping[1]


  10. Pervasive DataRush: parallel data processing optimization for Hadoop jobs (here).


    Shipping.


  11. Cascading: query API and query Planner.


    Shipping.


  12. Apache Hive: data warehouse on top of Hadoop.


    Shipping


  13. Yahoo Pig: high-level data-flow language and execution framework for parallel computation.


    Shipping.


  14. Hadapt: hybrid architecture combining relational databases and Hadoop (here).


    Announced.


Some others are in the Hadoop toolkit.



Instead of “shipping” another criteria that can be used is number of users and amount of processed data.






  1. Kosmos current release is 0.5 dating June, 2010
     




Original title and link: The Rise of Hadoop: How many Hadoop-related solutions exist? (NoSQL databases © myNoSQL)



"

Tuesday, 14 December 2010

Big Just Got Bigger - 5 Terabyte Object Support in Amazon S3

Big Just Got Bigger - 5 Terabyte Object Support in Amazon S3: "


Today, Amazon S3 announced a new breakthrough in supporting customers with large files by increasing the maximum supported object size from 5 gigabytes to 5 terabytes. This allows customers to store and reference a large file as a single object instead of smaller 'chunks'. When combined with the Amazon S3 Multipart Upload release, this dramatically improves how customers upload, store and share large files on Amazon S3.


Who has files larger than 5GB?



Amazon S3 has always been a scalable, durable and available data repository for almost any customer workload. However, as use of the cloud as grown, so have the file sizes customers want to store in Amazon S3 as objects. This is especially true for customers managing HD video or data-intensive instruments such as genomic sequencers. For example, a 2-hour movie on Blu-ray can be 50 gigabytes. The same movie stored in an uncompressed 1080p HD format is around 1.5 terabytes.


By supporting such large object sizes, Amazon S3 better enables a variety of interesting big data use cases. For example, a movie studio can now store and manage their entire catalog of high definition origin files on Amazon S3 as individual objects. Any movie or collection of content could be easily pulled in to Amazon EC2 for transcoding on demand and moved back into Amazon S3 for distribution through edge locations throughout the word with Amazon CloudFront. Or, BioPharma researchers and scientists can stream genomic sequencer data directly into Amazon S3, which frees up local resources and allows scientists to store, aggregate, and share human genomes as single objects in Amazon S3. Any researcher anywhere in the world then has access to a vast genomic data set with the on-demand compute power for analysis, such as Amazon EC2 Cluster GPU Instances, previously only available to the largest research institutions and companies.


Multipart Upload and moving large objects into Amazon S3



To make uploading large objects easier, Amazon S3 also recently announced Multipart Upload, which allows you to upload an object in parts. You can create parallel uploads to better utilize your available bandwidth and even stream data into Amazon S3 as it's being created. Also, if a given upload runs into a networking issue, you only have to restart the part, not the entire object allowing you recover quickly from intermittent network errors.


Multipart Upload isn't just for customers with files larger than 5 gigabytes. With Multipart Upload, you can upload any object larger than 5 megabytes in parts. So, we expect customers with objects larger than 100 megabytes to extensively use Multipart Upload when moving their data into Amazon S3 for a faster, more flexible upload experience.


More information



For more information on Multipart Upload and managing large objects in Amazon S3, see Jeff Barr's blog posts on Amazon S3 Multipart Upload and Large Object Support as well as the Amazon S3 Developer Guide.

"

Sunday, 5 December 2010

Almost half of cloud revenues from storage!

Almost half of cloud revenues from storage!: "

A new report from the 451 Group says that the cloud computing marketplace will reach $16.7bn in revenue by 2013. Even more interesting, however, the Group reports the cloud-based storage will play a starring role in cloud growth, accounting for nearly 40% of the core cloud pie in 2010. “We view storage as the most fertile sector, and predict that cloud storage will experience the strongest growth in the cloud platforms segment,” the report says.


More insights from the report…


Including the large and well-established software-as-a-service (SaaS) category, cloud computing will grow from revenue of $8.7bn in 2010 to $16.7bn in 2013, a compound annual growth rate (CAGR) of 24%.


The core cloud computing market will grow at much more rapid pace as the cloud increasingly becomes a mainstream IT strategy embraced by corporate enterprises and government agencies. Excluding SaaS revenue, cloud-delivered platform and infrastructure services will grow from $964m in revenue in 2010 to $3.9bn 2013 – a CAGR of 60% – the report said. The core market includes platform-as-a-service (PaaS) and infrastructure-as-a-service (IaaS) offerings, as well as the cloud-delivered software used to build and manage a cloud environment, which The 451 Group calls ‘software infrastructure as a service’ (SIaaS).



"

Monday, 29 November 2010

Design — Sheepdog Project

Design — Sheepdog Project


The architecture of Sheepdog is fully symmetric; there is no central node such as a meta-data server. This design enables following features.
  • Linear scalability in performance and capacity
    When more performance or capacity is needed, Sheepdog can be grown linearly by simply adding new machines to the cluster.
  • No single point of failure
    Even if a machine fails, the data is still accessible through other machines.
  • Easy administration
    There is no config file about cluster’s role. When administrators launch Sheepdog programs at the newly added machine, Sheepdog automatically detects the added machine and begins to configure it as a member of the cluster.

Architecture

Sheepdog is a storage system that provides a simple key-value interface to Sheepdog client (qemu block driver). Sheepdog is consists of multiple nodes.
Compare Sheepdog architecture and a regular cluster file system architecture
Sheepdog consists of only one server (we call collie) and patched QEMU/KVM.
Sheepdog components

Virtual Disk Image (VDI)

A Sheepdog client divides a VM image into fixed-size objects (4 MB by default) and store them on the distributed storage system. Each object is identified by globally unique 64 bit id, and replicated to multiple nodes.
Virtual disk image

Object

Sheepdog objects are grouped into two types.
  • VDI Object: A VDI object contains metadata for a VM image such as image name, disk size, creation time, etc.
  • Data Object: A VM images is divided into a data object. Sheepdog client generally access this object.
Sheepdog uses consistent hashing to decide where objects store. Consistent hashing is a scheme that provides hash table functionality, and the addition or removal of nodes does not significantly change the mapping of objects. I/O load is balanced across the nodes by features of hash table. A mechanism of distributing the data not randomly but intelligently is a future work.
Each node is placed on consistent hashing ring based on its own id. To determine where to store the object, Sheepdog client gets the object id, finds the corresponding point on the ring, and walk clockwise to determine the target nodes.
Consistent hashing

VDI Operation

In most cases, Sheepdog clients can access their images independently because we do not allow for clients to access the same image at the same time. But some VDI operations (e.g. cloning VDI, locking VDI) must be done exclusively because the operations updating global information. To implement this in the highly available system, we use a group communication system (GCS). Group communication systems provide specific guarantees such as total ordering of messages. We use corosync, one of most famous GCS.
Cluster communication