Unlimited-Data. moved to lab.itbee.vn

Tuesday, 31 August 2010

NoSQL databases: 10 Things you Should Know About Them

NoSQL databases: 10 Things you Should Know About Them: "NoSQL databases: 10 Things you Should Know About Them:
5 pros and 5 cons by Guy Harrison:

Five advantages of NoSQL

Elastic scaling

Big data

Goodbye DBAs (see you later?)

Economics

Flexible data models

Five challenges of NoSQL

Maturity

Support

Analytics and business intelligence

Administration

Expertise

A few amendments though:

Elastic scaling: partially correct. Right now there are just a few featuring elastic scaling: Cassandra, HBase, Riak, Project Voldemort, Membase, and just recently CouchDB through BigCouch.
Big data:
partially correct. Some of the NoSQL databases do not scale horizontally and so they are not a perfect fit for BigData.
Goodbye DBAs (see you later?):
Maybe you’ll not call them DBA, but someone should still do data modeling and think about data access patterns
Support and Expertise:
I do see these just as sub-categories of maturity (lack of thereof).

Original title and link for this post: NoSQL databases: 10 Things you Should Know About Them (published on the NoSQL blog: myNoSQL)

Monday, 30 August 2010

Pomegranate - Storing Billions and Billions of Tiny Little Files

Pomegranate - Storing Billions and Billions of Tiny Little Files: "

Pomegranate is a novel distributed file system built over distributed tabular storage that acts an awful lot like a NoSQL system. It's targeted at increasing the performance of tiny object access in order to support applications like online photo and micro-blog services, which require high concurrency, high throughput, and low latency. Their tests seem to indicate it works:

We have demonstrate that file system over tabular storage performs well for highly concurrent access. In our test cluster, we observed linearly increased more than 100,000 aggregate read and write requests served per second (RPS).

Rather than sitting atop the file system like almost every other K-V store, Pomegranate is baked into file system. The idea is that the file system API is common to every platform so it wouldn't require a separate API to use. Every application could use it out of the box.

The features of Pomegranate are:

It handles billions of small files efficiently, even in one directory;
It provide separate and scalable caching layer, which can be snapshot-able;
The storage layer uses log structured store to absorb small file writes to utilize the disk bandwidth;
Build a global namespace for both small files and large files;
Columnar storage to exploit temporal and spatial locality;
Distributed extendible hash to index metadata;
Snapshot-able and reconfigurable caching to increase parallelism and tolerant failures;
Pomegranate should be the first file system that is built over tabular storage, and the building experience should be worthy for file system community.

Can Ma, who leads the research on Pomegranate, was kind enough to agree to a short interview.

Can you please give an overview of the architecture and what you are doing that's cool and different?

Basically, there is no distributed or parallel file system that can handle billions of small files efficiently. However, we can foresee that web applications(such as email, photo, and even video), and bio-computing(gene sequencing) need massive small file accesses. Meanwhile, file system API is general enough and well understood for most programmers.

Thus, we want to built a file system to manage billions of small files, and provide high throughput of concurrent accesses. Although Pomegranate is designed for accesses to small files, it support large files either. It is built on top of other distributed file systems, such as Lustre, and only manage the namespace and small files. We just want to stand on ''the Shoulders of Giants". See the figure bellow:

Pomegranate has many Metadata Servers and Metadata Storage Servers to serve metadata requests and small file read/write requests. The MDSs are just a caching layer, which load metadata from storage and commit memory snapshots to storage. The core of Pomegranate is a distributed tabular storage system called xTable. It supports key indexed multi-column lookups. We use distributed extendible hash to locate server from the key, because extendible hash is more adaptive to scale up and down.

In file systems, directory table and inode table are always separated to support two different types of lookup. Lookups by pathname are handled by directory table, while lookups by inode number are handled by inode table. It is nontrivial to consistently update these two indexes, especially in a distributed file system. Meanwhile, using two indexes has increased the lookup latency, which is unacceptable for accessing tiny files. Typically, there are in memory caches for dentry and inode, however, the caches can't easily extend. Modifying metadata has to update multiple locations. To keep consistency, operation log is introduced. While, operation log is always a serial point for request flows.

Pomegranate use a table-like directory structure to merge directory table and inode table. Two different types of lookup are unified to lookups by key. For file system, the key is the hash value of dentry name. Hash conflicts are resolved by a global unique id for each file. For each update, we just need to search and update one table. To eliminate the operations log, we design and support memory snapshot to get a consistent image. The dirty regions of each snapshot can be written to storage safely without considering concurrent modifications.(The concurrent updates are COWed.)

However, there are some complex file system operations such as mkdir, rmdir, hard link, and rename that should be considered. These ops have to update at least two tables. We implement a reliable multisite update service to propagate deltas from one table to another. For example, on mkdir, we propagate the delta("nlink +1") to the parent table.

Are there any single points of failure?

There is no SPOF in design. We use cluster of MDSs to serve metadata request. If one MDS crashed, the requests are redirected to other MDSs(consistent hash and heartbeats are used). Metadata and small files are replicated to multiple nodes either. However, this replication is triggered by external sync tools which is asynchronous to the writes.

Small files have usually been the death of filesystems because of directory structure maintenance. How do you get around that?

Yep, it is deadly slow for small file access in traditional file systems. We replace the traditional directory table (B+ tree or hash tree) to distributed extendible hash table. The dentry name and inode metadata are treated as columns of the table. Lookups from clients are sent(or routed if needs) to the correct MDS. Thus, to access a small file, we just need to access one table row to find the file location. We keep each small file stored sequentially in native file system. As a result, one I/O access can serve a small file read.

What posix apis are supported? Can files be locked, mapped, symlinks, etc?

At present, the POSIX support is progressing. We do support symlinks, mmap access. While, flock is not supported.

Why do a kernel level file system rather than a K-V store on top?

Our initial objective is to implement a file system to support more existing applications. While, we do support K/V interface on top of xTable now. See the figure of architecture, the AMC client is the key/value client for Pomegranate. We support simple predicates on key or value, for example we support "select * from table where key < 10 and 'xyz' in value" to get the k/v pairs that value contains "xyz" and key < 10.

How does it compare to other distributed filesystems?

We want to compare the small file performance with other file systems. However, we have not tested it yet. We will do it in the next month. Although, we believe most distributed file systems can not handle massive small file accesses efficiently.

Are indexes and any sort of queries supported?

For now, these supports has not be properly considered yet. We plan to consider range query next.

Does it work across datacenters, that is, how does it deal with latency?

Pomegranate only works in a datacenter. WAN support has not been considered yet.

It looks like you use an in-memory architecture for speed. Can you talk about that?

We use a dedicated memory cache layer for speed. Table rows are grouped as table slices. In memory, the table slices are hashed in to a local extendible hash table both for performance and space consumption. Shown by the bellow figure,

Clients issue request by hash the file name and lookup in the bitmap. Then, using a consistent hash ring to locate the cache server(MDS) or storage server(MDSL). Each update firstly gets the *opened* transaction group, and can just apply to the in memory table row. Each transaction group changing is atomic. After all the pending updates are finished, the transaction group can be committed to storage safely. This approach is similar as Sun's ZFS.

How is high availability handled?

Well, the central server for consistent hash ring management and failure coordinator should be replicated by Paxos algorithm. We plan to use ZooKeeper for high available central service.
Other components are designed to be fault tolerant. Crashes of MDS and MDSL can be configured as recovered immediately by routing requests to new servers (by selecting the next point in consistent hash ring).

Operationally, how does it work? How are nodes added into the system?

Adding nodes to the caching layer is simple. The central server (R2) add the new node to the consistent hash ring. All the cache servers should act on this change and just invalidate their cached table slices if they will be managed by the new node. Requests from clients are routed to the new server, and a CH ring change notification will piggyback to client to pull the new ring from the center server.

How do you handle large files? Is it suitable for streaming video?

As described earlier, large files are relayed to other distributed file systems. Our caching layer will not be polluted by the streaming video data.

Anything else you would like to add?

Another figure for interaction of Pomegranate components.

Sunday, 29 August 2010

Scalability updates for Aug 27th 2010

Scalability updates for Aug 27th 2010: "

My updates have been slow recently due to other things I’m involved in. If you need more updates around what I’m reading, please feel free to follow me on twitter or buzz.

Here are some of the big ones I have mentioned on my twitter/buzz feeds.

Tools: Real-time Relationship Analytics from large scale graph processing using cassandra [code here]

Hadoop 0.21 has been released. The one feature I think is really cool is ability to “append” in hdfs. I found the lack of append feature slightly limiting to what I was trying to do last month.

Short intro to flume : Flume is a distributed log collection service which can collect and write logs to HDFS. If you have ever had problems aggregating logs, you should take a look at this.

Topsy just upgraded their backend engine and wrote all about it on their blog. Indexing twitter firehose is no small task, and these guys have done an amazing job.

Its hard to write incrementing or decrementing counters using an eventually-consistent distributed datastore like cassandra. But its not impossible as shown in this example at git.

Coudkick has been building their business around cassandra. They are now offering their experiences back to the community here and here.

Riptano has been putting interesting videos online from their cassandra summits.

Three papers on load balancing.

Scalability updates for Feb 18, 2010

Scalability Updates for Jan 26th 2010

Scaling updates for Feb 10, 2010

Scalability links for March 13th 2010

Scalability links for Feb 28th 2010

OpenStack - The Answer to: How do We Compete with Amazon?

OpenStack - The Answer to: How do We Compete with Amazon?: "

The Silicon Valley Cloud Computing Group had a meetup Wednesday on OpenStack, whose tag line is the open source, open standards cloud. I was shocked at the large turnout. 287 people registered and it looked like a large percentage of them actually showed up. I wonder, was it the gourmet pizza, the free t-shirts, or are people really that interested in OpenStack? And if they are really interested, why are they that interested? On the surface an open cloud doesn't seem all that sexy a topic, but with contributions from NASA, from Rackspace, and from a very avid user community, a lot of interest there seems to be.

The brief intro blurb to OpenStack is:

Friday, 27 August 2010

Zookeeper experience

Zookeeper experience: "
While working on Kafka, a distributed pub/sub system (more on that later) at LinkedIn, I need to use Zookeeper (ZK) to implement the load-balancing logic. I’d like to share my experience of using Zookeeper. First of all, for those of you who don’t know, Zookeeper is an Apache project that implements a consensus service based on a variant of Paxos (it’s similar to Google’s Chubby). ZK has a very simple, file system like API. One can create a path, set the value of a path, read the value of a path, delete a path, and list the children of a path. ZK does a couple of more interesting things: (a) one can register a watcher on a path and get notified when the children of a path or the value of a path is changed, (b) a path can be created as ephemeral, which means that if the client that created the path is gone, the path is automatically removed by the ZK server. However, don’t let the simple API fool you. One needs to understand a lot more than those APIs in order to use them properly. For me, this translates to weeks asking the ZK mailing list (which is pretty responsive) and our local ZK experts.

To get started, it’s important to understand the state transitions and the associated watcher events inside a ZK client. A ZK client can be in one of the 3 states, disconnected, connected, and closed. When a client is created, it’s in the disconnected state. Once a connection is established, the client is moved to the connected state. If the client loses its connection to a server, it switches back to the disconnected state. If it can’t connect to any server within some time limit, it’s eventually transitioned to the closed state. For each state transition, a state changing event (disconnected, syncconnected and expired) is sent to the client’s watcher. As you will see, those events are critical to the client. Finally, if one performs an operation on ZK when the client is in the disconnected state, a ConnectionLossException (CLE) is thrown back to the caller. More detailed information can be found at the ZK site. A lot of the subtleties when using ZK are to deal with those state changing events.

The first tricky issue is related to CLE. The problem is that when a CLE happens, the requested operation may or may not have taken place on ZK. If the connection was lost before the request reached the server, the operation didn’t take place. On the other hand, it can happen that the request did reach the server and got executed there. However, before the server can send a response back, the connection was lost. If the request is a read or an update, one can just keep retrying until the operation succeeds. It becomes a problem if the request is a create. If you simply retry, you may get a NodeExistsException and it’s not clear whether it’s you or someone else have created the path. What one can do is to set the value of the path to a client specific value during creation. If a NodeExistsException is thrown, read the value back to check who actually created it. One can’t use this approach for sequential paths (a ZK feature that creates a path with a generated sequential id) though. If you retry, a different path will be created. You also can’t check who created the path, since if you get a CLE, you don’t know the name of the path that gets created. For this reason, I think that sequential paths have very limited benefit since it’s very hard to use them correctly.

The second tricky issue is to distinguish between a disconnect and an expired event. The former happens when the ZK client can’t connect to the server. This is because either (1) the ZK server is down, or (2) the ZK server is up, but the ZK client is partitioned from the server or it is in a long GC pause and can’t send the heartbeat in time. In case (1), when the ZK server comes back, the client watcher will get a syncconnected event and everything is back to normal. Surprisingly, in this case, all the ephemeral paths and the watchers are still kept at the server and you don’t have to recreate them. In case (2), when the client finally reconnects to the server, it will get back an expired event. This implies that the server thinks the client is dead and has taken the liberty to delete all the ephemeral paths and watchers created by that client. It’s the responsibility of the client to start a new ZK session and to recreate the ephemeral paths and the watchers.

To deal with the above issues, one has to write additional code that keeps track of the ZK client state, starts a new session when the old one expires, and handles the CLE appropriately. For my application, I find the ZKClient package quite useful. ZKClient is a wrapper of the original ZK client. It maintains the current state of the ZK client, hides the CLE from the caller by retrying the request when the state is transitioned to connected again, and reconnects when necessary. ZKClient has an Apache license and has been used in Katta for quite some time. Even with the help of ZKClient, I still have to handle things like who actually created a path when a NodeExistsException occurs and re-registering after a session expires.

Finally, how do you test your ZK application, especially the various failure scenarios? One can use utilities like “ifconfig down/up” to simulate network partitioning. Todd Lipcon’s Gremlins seems very useful too.
"

What’s New in Apache Hadoop 0.21

What’s New in Apache Hadoop 0.21: "
Apache Hadoop 0.21.0 was released on August 23, 2010. The last major release was 0.20.0 in April last year, so it’s not surprising that there are so many changes in this release, given the amount of activity in the Hadoop development community. In fact, there were over 1300 issues fixed in JIRA (Common, HDFS, MapReduce), the issue tracker used for Apache Hadoop development. Bear in mind that the 0.21.0 release, like all dot zero releases, isn’t suitable for production use.

With such a large delta from the last release, it is difficult to grasp the important new features and changes. This post is intended to give a high-level view of some of the more significant features introduced in the 0.21.0 release. Of course, it can’t hope to cover everything, so please consult the release notes (Common, HDFS, MapReduce) and the change logs (Common, HDFS, MapReduce) for the full details. Also, please let us know in the comments of any features, improvements, or bug fixes that you are excited about.

You can download Hadoop 0.21.0 from an Apache Mirror. Thanks to everyone who contributed to this release!

Project Split

Organizationally, a significant chunk of work has arisen from the project split, which transformed a single Hadoop project (called Core) into three constituents: Common, HDFS, and MapReduce. HDFS and MapReduce both have dependencies on Common, but (other than for running tests) MapReduce has no dependency on HDFS. This separation emphasizes the fact that MapReduce can run on alternative distributed file systems (although HDFS is still the best choice for sheer throughput and scalability), and it has made following development easier since there are now separate lists for each subproject. There is one release tarball still, however, although it is laid out a little differently from previous releases, since it has a subdirectory containing each of the subproject source files.

From a user’s point of view little has changed as a result of the split. The configuration files are divided into core-site.xml, hdfs-site.xml, and mapred-site.xml (this was supported in 0.20 too), and the control scripts are now broken into three (HADOOP-4868): in addition to the bin/hadoop script, there is a bin/hdfs script and a bin/mapreduce script for running HDFS and MapReduce daemons and commands, respectively. The bin/hadoop script still works as before, but issues a deprecation warning. Finally, you will need to set the HADOOP_HOME environment variable to have the scripts work smoothly.

Common

The 0.21.0 release is technically a minor release (traditionally Hadoop 0.x releases have been major, and have been allowed to break compatibility with the previous 0.x-1 release) so it is API compatible with 0.20.2. To make the intended stability and audience of a particular API in Hadoop clear to users, all Java members with public visibility have been marked with classification annotations to say whether they are Public, or Private (there is also LimitedPrivate which signifies another, named, project may use it), and whether they are Stable, Evolving, or Unstable (HADOOP-5073). Only elements marked as Public appear in the user Javadoc (Common , MapReduce; note that HDFS is all marked as private since it is accessed through the FileSystem interface in Common). The classification interface is descibed in detail in Towards Enterprise-Class Compatibility for Apache Hadoop by Sanjay Radia.

This release has seen some significant improvements to testing. The Large-Scale Automated Test Framework, known as Herriot (HADOOP-6332), allows developers to write tests that run against a real (possibly large) cluster. While there are only a dozen or so tests at the moment, the intention is that more tests will be written over time so that regression tests can be shared and run against new Hadoop release candidates, thereby making Hadoop upgrades more predictable for users.

Hadoop 0.21 also introduces a fault injection framework, which uses AOP to inject faults into a part of the system that is running under test (e.g. a datanode), and asserts that the system reacts to the fault in the expected manner. Complementing fault injection is mock object testing, which tests code “in the small”, at the class-level rather than the system-level. Hadoop has a growing number of Mockito-based tests for this purpose (MAPREDUCE-1050).

Among the many other improvements and new features, a couple of small ones stand out: the ability to retrieve metrics and configuration from Hadoop daemons by accessing the URLs /metrics and /conf in a browser (HADOOP-5469, HADOOP-6408).

HDFS

Support for appends in HDFS has had a rocky history. The feature was introduced in the 0.19.0 release, and then disabled in 0.19.1 due to stability issues. The good news is that the append call is back in 0.21.0 with a brand new implementation (HDFS-265), and may be accessed via FileSystem’s append() method. Closely related—and more interesting for many applications, such as HBase—is the Syncable interface that FSDataOutputStream now implements, which brings sync semantics to HDFS (HADOOP-6313).

Hadoop 0.21 has a new filesystem API, called FileContext, which makes it easier for applications to work with multiple filesystems (HADOOP-4952). The API is not in widespread use yet (e.g. it is not integrated with MapReduce), but it has some features that the old FileSystem interface doesn’t, notably support for symbolic links (HADOOP-6421, HDFS-245).

The secondary namenode has been deprecated in 0.21. Instead you should consider running a checkpoint node (which essentially acts like a secondary namenode) or a backup node (HADOOP-4539). By using a backup node you no longer need an NFS-mount for namenode metadata, since it accepts a stream of filesystem edits from the namenode, which it writes to disk.

New in 0.21 is the offline image viewer (oiv) for HDFS image files (HADOOP-5467). This tool allows admins to analyze HDFS metadata without impacting the namenode (it also works with older versions of HDFS). There is also a block forensics tool for finding corrupt and missing blocks from the HDFS logs (HDFS-567).

Modularization continues in the platform with the introduction of pluggable block placement (HDFS-385), an expert-level interface for developers who want to try out new placement algorithms for HDFS.

Other notable new features include:

Support for efficient file concatenation in HDFS (HDFS-222)
Distributed RAID filesystem (HDFS-503) – an erasure coding filesystem running on HDFS, designed for archival storage since the replication factor is reduced from 3 to 2, while keeping the likelihood of data loss about the same. (Note that the RAID code is a MapReduce contrib module since it has a dependency on MapReduce for generating parity blocks.)

MapReduce

The biggest user-facing change in MapReduce is the status of the new API, sometimes called “context objects”. The new API is now more broadly supported since the MapReduce libraries (in org.apache.hadoop.mapreduce.lib) have been ported to use it (MAPREDUCE-334). The examples all use the new API too (MAPREDUCE-271). Nevertheless, to give users more time to migrate to the new API, the old API has been un-deprecated in this release (MAPREDUCE-1735), which means that existing programs will compile without deprecation warnings.

The LocalJobRunner (for trying out MapReduce programs on small local datasets) has been enhanced to make it more like running MapReduce on a cluster. It now supports the distributed cache (MAPREDUCE-476), and can run mappers in parallel (MAPREDUCE-1367).

Distcp has seen a number of small improvements too, such as preserving file modification times (HADOOP-5620), input file globbing (HADOOP-5472), and preserving the source path (MAPREDUCE-642).

Continuing the testing theme, this release is the first to feature MRUnit, a contrib module that helps users write unit tests for their MapReduce jobs (HADOOP-5518).

Other new contrib modules include Rumen (MAPREDUCE-751) and Mumak (MAPREDUCE-728), tools for modelling MapReduce. The two are designed to work together: Rumen extracts job data from historical logs, which Mumak then uses to simulate MapReduce applications and clusters on a cluster. Gridmix3 is also designed to work with Rumen traces. The job history log analyzer is another tool that gives information about MapReduce cluster utilization (HDFS-459).

On the job scheduling front there have been updates to the Fair Scheduler, including global scheduling (MAPREDUCE-548), preemption (MAPREDUCE-551), and support for FIFO pools (MAPREDUCE-706). Similarly, the Capacity Scheduler now supports hierarchical queues (MAPREDUCE-824), and admin-defined hard limits (MAPREDUCE-532). There is also a brand new scheduler, the Dynamic Priority Scheduler, which dynamically changes queue shares using a pricing model (HADOOP-4768).

Smarter speculative execution has been added to all schedulers using a more robust algorithm, called Longest Approximate Time to End (LATE) (HADOOP-2141).

Finally, a couple of smaller changes:

Streaming combiners are now supported, so that the -combiner option may specify any streaming script or executable, not just a Java class. (HADOOP-4842)
On the successful completion of a job, the MapReduce runtime creates a _SUCCESS file in the output directory. This may be useful for applications that need to see if a result set is complete just by inspecting HDFS. (MAPREDUCE-947)

What’s Not In

Finally, it bears mentioning what didn’t make it into 0.21.0. The biggest omission is the new Kerberos authentication work from Yahoo! While a majority of the patches are included, security is turned off by default, and is unlikely to work if enabled (certainly there is no guarantee that it will provide any level of security, since it is incomplete). A full working security implementation will be available in 0.22, and also the next version of CDH.

Also, Sqoop, which was initially developed as a Hadoop contrib module, is not in 0.21.0, since it was moved out to become a standalone open source project hosted on github.
"

Thursday, 26 August 2010

Designing a Web Application with Scalability in Mind

Designing a Web Application with Scalability in Mind: "Max Indelicato, a Software Development Director and former Chief Software Architect, has written a post on how to design a web application for scalability. He suggests choosing the right deploying and storage solution, a scalable data storage and schema, and using abstraction layers. By Abel Avram"