Monday 29 November 2010

Design — Sheepdog Project

Design — Sheepdog Project


The architecture of Sheepdog is fully symmetric; there is no central node such as a meta-data server. This design enables following features.
  • Linear scalability in performance and capacity
    When more performance or capacity is needed, Sheepdog can be grown linearly by simply adding new machines to the cluster.
  • No single point of failure
    Even if a machine fails, the data is still accessible through other machines.
  • Easy administration
    There is no config file about cluster’s role. When administrators launch Sheepdog programs at the newly added machine, Sheepdog automatically detects the added machine and begins to configure it as a member of the cluster.

Architecture

Sheepdog is a storage system that provides a simple key-value interface to Sheepdog client (qemu block driver). Sheepdog is consists of multiple nodes.
Compare Sheepdog architecture and a regular cluster file system architecture
Sheepdog consists of only one server (we call collie) and patched QEMU/KVM.
Sheepdog components

Virtual Disk Image (VDI)

A Sheepdog client divides a VM image into fixed-size objects (4 MB by default) and store them on the distributed storage system. Each object is identified by globally unique 64 bit id, and replicated to multiple nodes.
Virtual disk image

Object

Sheepdog objects are grouped into two types.
  • VDI Object: A VDI object contains metadata for a VM image such as image name, disk size, creation time, etc.
  • Data Object: A VM images is divided into a data object. Sheepdog client generally access this object.
Sheepdog uses consistent hashing to decide where objects store. Consistent hashing is a scheme that provides hash table functionality, and the addition or removal of nodes does not significantly change the mapping of objects. I/O load is balanced across the nodes by features of hash table. A mechanism of distributing the data not randomly but intelligently is a future work.
Each node is placed on consistent hashing ring based on its own id. To determine where to store the object, Sheepdog client gets the object id, finds the corresponding point on the ring, and walk clockwise to determine the target nodes.
Consistent hashing

VDI Operation

In most cases, Sheepdog clients can access their images independently because we do not allow for clients to access the same image at the same time. But some VDI operations (e.g. cloning VDI, locking VDI) must be done exclusively because the operations updating global information. To implement this in the highly available system, we use a group communication system (GCS). Group communication systems provide specific guarantees such as total ordering of messages. We use corosync, one of most famous GCS.
Cluster communication

Friday 26 November 2010

Scalability | Harvard Computer Science Lecture

Scalability | Harvard Computer Science Lecture: "


Watch it on Academic Earth

LECTURE DESCRIPTION

Professor David J. Malan discusses scalability as it pertains to building dynamic websites.

COURSE DESCRIPTION

Today's websites are increasingly dynamic. Pages are no longer static HTML files but instead generated by scripts and database calls. User interfaces are more seamless, with technologies like Ajax replacing traditional page reloads. This course teaches students how to build dynamic websites with Ajax and with Linux, Apache, MySQL, and PHP (LAMP), one of today's most popular frameworks. Students learn how to set up domain names with DNS, how to structure pages with XHTML and CSS, how to program in JavaScript and PHP, how to configure Apacheand MySQL, how to design and query databases with SQL, how to use Ajax with both XML andJSON, and how to build mashups. The course explores issues of security, scalability, and cross-browser support and also discusses enterprise-level deployments of websites, including third-party hosting, virtualization, colocation in data centers, firewalling, and load-balancing.

Tuesday 23 November 2010

ZooKeeper Promoted to Apache Top Level Project

ZooKeeper Promoted to Apache Top Level Project: "
ZooKeeper, the centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services, that started under Hadoop umbrella has been promoted to an Apache Top Level Project, according to the ☞ report sent out by Doug Cutting.

In case you are wondering what it means, simply put it’s a proof of the project maturity and its community to be able to ensure the project future. On the other hand, if Hadoop, HBase, and ZooKeeper communities will not coordinate their efforts, it might mean more work for its users to match and test versions when using it together with Hadoop, HBase.


Original title and link: ZooKeeper Promoted to Apache Top Level Project (NoSQL databases © myNoSQL)

"

Another NoSQL Comparison: Evaluation Guide

Another NoSQL Comparison: Evaluation Guide: "Another NoSQL Comparison: Evaluation Guide:
The requirements were clear:


  • Fast data insertion.
  • Extremely fast random reads on large datasets.
  • Consistent read/write speed across the whole data set.
  • Efficient data storage.
  • Scale well.
  • Easy to maintain.
  • Have a network interface.
  • Stable, of course.

The list of NoSQL databases to be compared: Tokyo Cabinet, BerkleyDB, MemcacheDB, Project Voldemort, Redis, and MongoDB, not so clear.

The methodology to evaluate and the results definitely not clear at all.


NoSQL Comparison Guide / A review of Tokyo Cabinet, Tokyo Tyrant, Berkeley DB, MemcacheDB, Voldemort, Redis, MongoDB


And the conclusion is quite wrong:


Although MongoDB is the solution for most NoSQL use cases, it’s not the only solution for all NoSQL needs.



Original title and link: Another NoSQL Comparison: Evaluation Guide (NoSQL databases © myNoSQL)

"

Tuesday 16 November 2010

Videos from Hadoop World

Videos from Hadoop World: "
There was one NoSQL conference that I’ve missed and I was really pissed off: Hadoop World. Even if I’ve followed and curated the Twitter feed, resulting in Hadoop World in tweets, the feeling of not being there made me really sad. But now, thanks to Cloudera I’ll be able to watch most of the presentations. Many of them have already been published and the complete list can be found ☞ here.

Based on the twitter activity on that day, I’ve selected below the ones that seemed to have generated most buzz. The list contains names like Facebook, Twitter, eBay, Yahoo!, StumbleUpon, comScore, Mozilla, AOL. And there are quite a few more …




HBase in production at Facebook


Presented by Jonathan Gray (Facebook)



HBase in Production at Facebook, Jonathan Gray, Facebook


The Hadoop Ecosystem at Twitter


Presented by Kevin Weil (Twitter)



The Hadoop Ecosystem at Twitter, Kevin Weil, Twitter




Hadoop at eBay


Presented by Anil Madan (eBay)



Hadoop at eBay, Anil Madan, eBay





A Fireside Chat: Using Hadoop to Tackle Big Data at comScore


Presented by Martin Hall (Karmasphere) and Will Duckworth (comScore)




A Fireside Chat: Using Hadoop to Tackle Big Data at comScore, Martin Hall, Karmasphere and Will Duckworth, comScore




ScaleIn Collecting and Querying Log Data in Near Real-time


Presented by Anurag Phadke (Firefox)



ScaleIn Collecting and Querying Log Data in Near Real-time, Anurag Phadke, Firefox





AOL’s Data Layer


Presented by Ian Holsman (AOL)



AOL’s Data Layer, Ian Holsman, AOL





Hadoop Based Intelligent Text Information Processing System


Presented by Vaijanath Rao (AOL) and Rohini Uppuluri (AOL)



Intelligent Text Information Processing System, Vaijanath Rao and Rohini Uppuluri, AOL





Mixing Real-Time Needs and Batch Processing: How StumbleUpon Built an Advertising Platform using HBase and Hadoop


Presented by Jean-Daniel Cryans (StumbleUpon)



Mixing Real-Time Needs and Batch Processing: How StumbleUpon Built an Advertising Platform using HBase and Hadoop, Jean-Daniel C





Hadoop at Yahoo! Ready for Business


Presented by Arun C. Murthy (Yahoo!)



Hadoop at Yahoo! Ready for Business, Arun C. Murthy, Yahoo!




Apache ZooKeeper at Yahoo!


Presented by Mahadev Konar (Yahoo!)



Apache ZooKeeper at Yahoo!, Mahadev Konar, Yahoo






And having in mind names like Bank of America, Orbitz, CME, Infochimps, sematext, I bet you can find ☞ many more. So, I guess now we have videos for at least a few days.


Thanks Cloudera!


Original title and link: Videos from Hadoop World (NoSQL databases © myNoSQL)

"

The Underlying Technology of Messages

The Underlying Technology of Messages: "
We're launching a new version of Messages today that combines chat, SMS, email, and Messages into a real-time conversation. The product team spent the last year building out a robust, scalable infrastructure. As we launch the product, we wanted to share some details about the technology.
The current Messages infrastructure handles over 350 million users sending over 15 billion person-to-person messages per month. Our chat service supports over 300 million users who send over 120 billion messages per month. By monitoring usage, two general data patterns emerged:
  1. A short set of temporal data that tends to be volatile
  2. An ever-growing set of data that rarely gets accessed
When we started investigating a replacement for the existing Messages infrastructure, we wanted to take an objective approach to storage for these two usage patterns. In 2008 we open-sourced Cassandra, an eventual-consistency key-value store that was already in production serving traffic for Inbox Search. Our Operations and Databases teams have extensive knowledge in managing and running MySQL, so switching off of either technology was a serious concern. We either had to move away from our investment in Cassandra or train our Operations teams to support a new, large system.
We spent a few weeks setting up a test framework to evaluate clusters of MySQL, Apache Cassandra, Apache HBase, and a couple of other systems. We ultimately chose HBase. MySQL proved to not handle the long tail of data well; as indexes and data sets grew large, performance suffered. We found Cassandra's eventual consistency model to be a difficult pattern to reconcile for our new Messages infrastructure.
HBase comes with very good scalability and performance for this workload and a simpler consistency model than Cassandra. While we’ve done a lot of work on HBase itself over the past year, when we started we also found it to be the most feature rich in terms of our requirements (auto load balancing and failover, compression support, multiple shards per server, etc.). HDFS, the underlying filesystem used by HBase, provides several nice features such as replication, end-to-end checksums, and automatic rebalancing. Additionally, our technical teams already had a lot of development and operational expertise in HDFS from data processing with Hadoop. Since we started working on HBase, we've been focused on committing our changes back to HBase itself and working closely with the community. The open source release of HBase is what we’re running today.
Since Messages accepts data from many sources such as email and SMS, we decided to write an application server from scratch instead of using our generic Web infrastructure to handle all decision making for a user's messages. It interfaces with a large number of other services: we store attachments in Haystack, wrote a user discovery service on top of Apache ZooKeeper, and talk to other infrastructure services for email account verification, friend relationships, privacy decisions, and delivery decisions (for example, should a message be sent over chat or SMS). We spent a lot of time making sure each of these services are reliable, robust, and performant enough to handle a real-time messaging system.
The new Messages will launch over 20 new infrastructure services to ensure you have a great product experience. We hope you enjoy using it.
Kannan is a software engineer at Facebook.

"

Strategy: Biggest Performance Impact is to Reduce the Number of HTTP Requests

Strategy: Biggest Performance Impact is to Reduce the Number of HTTP Requests: "


Low Cost, High Performance, Strong Security: Pick Any Three by Chris Palmer has a funny and informative presentation where the main message is: reduce the size and frequency of network communications, which will make your pages load faster, which will improve performance enough that you can use HTTPS all the time, which will make you safe and secure on-line, which is a good thing.

The benefits of HTTPS for security are overwhelming, but people are afraid of the performance hit. The argument is successfully made that the overhead of HTTPS is low enough that you can afford the cost if you do some basic optimization. Reducing the number of HTTP requests is a good source of low hanging fruit.

From the Yahoo UI Blog:


Reducing the number of HTTP requests has the biggest impact on reducing response time and is often the easiest performance improvement to make.


"

Facebook: The Underlying Technology of Messages Using HBase

Facebook: The Underlying Technology of Messages Using HBase: "Facebook: The Underlying Technology of Messages Using HBase:
Cassandra, MySQL, and HBase were evaluated for the Facebook new messaging system. HBase was finally picked and is now behind the new product announced today:


HBase comes with very good scalability and performance for this workload and a simpler consistency model than Cassandra. While we’ve done a lot of work on HBase itself over the past year, when we started we also found it to be the most feature rich in terms of our requirements (auto load balancing and failover, compression support, multiple shards per server, etc.). HDFS, the underlying filesystem used by HBase, provides several nice features such as replication, end-to-end checksums, and automatic rebalancing. Additionally, our technical teams already had a lot of development and operational expertise in HDFS from data processing with Hadoop. Since we started working on HBase, we’ve been focused on committing our changes back to HBase itself and working closely with the community. The open source release of HBase is what we’re running today.



Original title and link: Facebook: The Underlying Technology of Messages Using HBase (NoSQL databases © myNoSQL)

"

Saturday 13 November 2010

LISA '10 Technical Sessions

LISA '10 Technical Sessions


Data Structures from the Future: Bloom Filters, Distributed Hash Tables, and More!
Thomas A. Limoncelli,Google, Inc.

View the Slides
Greetings, earthlings of the year 2010! I've traveled back in time to share with you some of the technologies that system administrators operate in the future. Chances are you know what a cache is and how to tune it. In the future, there will be glorious things such as "bloom filters," "distributed hash tables," and "NoSQL databases." I will reveal what they are and (more important) how to tune them. (This will be an informal talk with a lot of hand-waving.)

Thursday 11 November 2010

Quick Reference: Hadoop Tools Ecosystem

Quick Reference: Hadoop Tools Ecosystem: "
Just a quick reference of the continuously growing Hadoop tools ecosystem.

Hadoop


The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing.

hadoop.apache.org


HDFS


A distributed file system that provides high throughput access to application data.

hadoop.apache.org/hdfs/


MapReduce


A software framework for distributed processing of large data sets on compute clusters.


Amazon Elastic MapReduce


Amazon Elastic MapReduce is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).

aws.amazon.com/elasticmapreduce/


Cloudera Distribution for Hadoop (CDH)


Cloudera’s Distribution for Hadoop (CDH) sets a new standard for Hadoop-based data management platforms.

cloudera.com/hadoop


ZooKeeper


A high-performance coordination service for distributed applications. ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.

hadoop.apache.org/zookeeper/


HBase


A scalable, distributed database that supports structured data storage for large tables.

hbase.apache.org


Avro


A data serialization system. Similar to ☞ Thrift and ☞ Protocol Buffers.

avro.apache.org


Sqoop


Sqoop (“SQL-to-Hadoop”) is a straightforward command-line tool with the following capabilities:

  • Imports individual tables or entire databases to files in HDFS
  • Generates Java classes to allow you to interact with your imported data
  • Provides the ability to import from SQL databases straight into your Hive data warehouse
cloudera.com/downloads/sqoop/


Flume


Flume is a distributed, reliable, and available service for efficiently moving large amounts of data soon after the data is produced.

archive.cloudera.com/cdh/3/flume/


Hive


Hive is a data warehouse infrastructure built on top of Hadoop that provides tools to enable easy data summarization, adhoc querying and analysis of large datasets data stored in Hadoop files. It provides a mechanism to put structure on this data and it also provides a simple query language called Hive QL which is based on SQL and which enables users familiar with SQL to query this data. At the same time, this language also allows traditional map/reduce programmers to be able to plug in their custom mappers and reducers to do more sophisticated analysis which may not be supported by the built-in capabilities of the language.

hive.apache.org


Pig


A high-level data-flow language and execution framework for parallel computation. Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.

pig.apache.org


Oozie


Oozie is a workflow/coordination service to manage data processing jobs for Apache Hadoop. It is an extensible, scalable and data-aware service to orchestrate dependencies between jobs running on Hadoop (including HDFS, Pig and MapReduce).

yahoo.github.com/oozie


Cascading


Cascading is a Query API and Query Planner used for defining and executing complex, scale-free, and fault tolerant data processing workflows on a Hadoop cluster.

cascading.org


Cascalog


Cascalog is a tool for processing data on Hadoop with Clojure in a concise and expressive manner. Cascalog combines two cutting edge technologies in Clojure and Hadoop and resurrects an old one in Datalog. Cascalog is high performance, flexible, and robust.

github.com/nathanmarz/cascalog


HUE


Hue is a graphical user interface to operate and develop applications for Hadoop. Hue applications are collected into a desktop-style environment and delivered as a Web application, requiring no additional installation for individual users.

archive.cloudera.com/cdh3/hue

You can read more about HUE on ☞ Cloudera blog.


Chukwa


Chukwa is a data collection system for monitoring large distributed systems. Chukwa is built on top of the Hadoop Distributed File System (HDFS) and Map/Reduce framework and inherits Hadoop’s scalability and robustness. Chukwa also includes a flexible and powerful toolkit for displaying, monitoring and analyzing results to make the best use of the collected data.

incubator.apache.org/chukwa/


Mahout


A Scalable machine learning and data mining library.

mahout.apache.org


Integration with Relational databases


Integration with Data Warehouses


The only list I have is MapReduce, RDBMS, and Data Warehouse, but I’m afraid it is quite a bit old. So maybe someone could help me update it.

Anything else? Once we validate this list, I’ll probably have to move it on the NoSQL reference


Original title and link: Quick Reference: Hadoop Tools Ecosystem (NoSQL databases © myNoSQL)

"

MongoDB Replica Sets

MongoDB Replica Sets: "
Even if MongoDB replica sets ☞ official documentation is quite good, that doesn’t mean more coverage of the subject is not going to be useful. So, this is kind of everything you need to read about MongoDB replica sets.




Let’s start with Kristina Chodorow’s series on MongoDB replica sets:

  • ☞ Replica sets part1: Master-Slave is so 2009
    This post shows how to do the “Hello, world” of replica sets. I was going to start with a post explaining what they are, but coding is more fun than reading. For now, all you have to know is that they’re master-slave with automatic failover.
  • ☞ Replica sets part2: what are replica sets
    Replica sets are basically just master-slave with automatic failover.

    So, you have a pool of servers with one primary (the master) and N secondaries (slaves). If the primary crashes or disappears, the other servers will hold an election to choose a new primary.
  • ☞ Part 3: Replica sets in the wild:
    […] you might want to migrate dev servers into production, add new slaves, prioritize servers, change things on the fly… that’s what this post covers.
  • ☞ Sharding and replica sets illustrated:



    Let’s say we drop a tray. CRASH! With this setup, your data is safe (as long as you were using w) and the cluster loses no functionality (in terms of reads and writes).

    Chunks will not be able to migrate (because one of the config servers is down), so a shard may become bloated if the config server is down for a very long time.

    Network partitions and losing two server are bigger problems, so you should have more than three servers if you actually want great availability.



If Kristina’s first post covered MongoDB replica sets setup, BoxedIce guys ☞ have also a post on the topic:


You can have any number of members in a replica set and your data will exist in full on each member of the set. This allows you to have servers distributed across data centres and geographies to ensure full redundancy. One server is aways the primary to which reads and writes are sent, with the other members being secondary and accepting reads only. In the event of the primary failing, another member will take over automatically.


plus a video showing the setup:




And nothing is complete without videos, so here is Dwight Merriman’s talk on MongoDB replication and replica sets:




Last, but not least you can find a replica sets demo video on ☞ 10gen blog.


Original title and link: MongoDB Replica Sets (NoSQL databases © myNoSQL)

"

Tuesday 9 November 2010

Presentation:Machine Learning: A Love Story

Presentation:Machine Learning: A Love Story: "Hilary Mason presents the history of machine learning covering some of the most significant developments taking place over the last two decades, especially the fundamental math and algorithmic tools employed. She also exemplifies how machine learning is used by bit.ly to discover various statistical information about users. By Hilary Mason"

Introducing S4

Stream of Words

Motivation

Data streams abound in the world of Big Data: Twitter, search queries, stock quotes, website analytics, sensor data to name a few. Yet, popular approaches for data processing at this scale are based on MapReduce: a batch-oriented framework; in other cases, there are proprietary stream processing systems, or ad-hoc solutions for particular problems.

We understand that the greatest value from data is sometimes derived by processing it as soon as we get it. S4 is a general-purpose open platform built from the ground up to process data streams — we mean process data as it arrives, one event at a time; not buffered in arbitrary batches.

Application

In search advertising we aim to select, rank, filter, and place ads in ways that benefit users, publishers, and advertisers. In other words, we want to show as few ads as possible that are highly relevant and generate the most clicks and revenue. To accomplish all this we build machine learning models that accurately predict how users will click on ads. To improve the accuracy of the click prediction we process recent search queries, clicks, and timing information. We call this personalization. To scale we need to process thousands of search queries per second from potentially millions of users per day. This problem is what motivated us to build a software platform to process events in the same way that MapReduce and Hadoop try to address the scalability problem in batch processing, and the NoSQL projects try to address the scalability problem for data storage.

Who we are

Leo Neumeyer led the search ad optimization group and became the main internal champion and early designer of the project. Bruce Robbins has been the key software engineer who coded 80% of the system and is a key designer. Many scientists including Anish Nair, Anand Kesari, Stefan Schroedl, Erick Cantu-Paz, and Jon Malkin implemented algorithms for personalization, click prediction, and online parameter optimization that drove the design of the project over several iterations. Kishore Gopalakrishna designed and implemented most of the communication layer as a stand alone component on top of ZooKeeper. Ben Reed gave us lots of great advice and support and JC Mao was the executive who supported making S4 an open source project. Finally, Gil Yehuda, Director of Open Source at Yahoo!, is a great open source evangelist who knows how to get things done.

We learned by trying to solve real world problems and we are just starting to understand the use cases and the challenges that need to be solved. For this reason we decided to make S4 open source in October 2010 after incubating it for a couple of years. It feels like in one day we got more feedback from the community than in the last two years combined, pretty amazing. Please remember that this is an early stage release, it has a long road[map] ahead and we need to work on documentation. We challenge people passionate about stream processing to get involved and help us grow the project. The best way to get started is to play with the sample application and/or build your own. The forum is a great place to ask questions and give us feedback.