Unlimited-Data. moved to lab.itbee.vn : August 2010

Tuesday, 31 August 2010

NoSQL databases: 10 Things you Should Know About Them

NoSQL databases: 10 Things you Should Know About Them: "NoSQL databases: 10 Things you Should Know About Them:
5 pros and 5 cons by Guy Harrison:

Five advantages of NoSQL

Elastic scaling

Big data

Goodbye DBAs (see you later?)

Economics

Flexible data models

Five challenges of NoSQL

Maturity

Support

Analytics and business intelligence

Administration

Expertise

A few amendments though:

Elastic scaling: partially correct. Right now there are just a few featuring elastic scaling: Cassandra, HBase, Riak, Project Voldemort, Membase, and just recently CouchDB through BigCouch.
Big data:
partially correct. Some of the NoSQL databases do not scale horizontally and so they are not a perfect fit for BigData.
Goodbye DBAs (see you later?):
Maybe you’ll not call them DBA, but someone should still do data modeling and think about data access patterns
Support and Expertise:
I do see these just as sub-categories of maturity (lack of thereof).

Original title and link for this post: NoSQL databases: 10 Things you Should Know About Them (published on the NoSQL blog: myNoSQL)

Monday, 30 August 2010

Pomegranate - Storing Billions and Billions of Tiny Little Files

Pomegranate - Storing Billions and Billions of Tiny Little Files: "

Pomegranate is a novel distributed file system built over distributed tabular storage that acts an awful lot like a NoSQL system. It's targeted at increasing the performance of tiny object access in order to support applications like online photo and micro-blog services, which require high concurrency, high throughput, and low latency. Their tests seem to indicate it works:

We have demonstrate that file system over tabular storage performs well for highly concurrent access. In our test cluster, we observed linearly increased more than 100,000 aggregate read and write requests served per second (RPS).

Rather than sitting atop the file system like almost every other K-V store, Pomegranate is baked into file system. The idea is that the file system API is common to every platform so it wouldn't require a separate API to use. Every application could use it out of the box.

The features of Pomegranate are:

It handles billions of small files efficiently, even in one directory;
It provide separate and scalable caching layer, which can be snapshot-able;
The storage layer uses log structured store to absorb small file writes to utilize the disk bandwidth;
Build a global namespace for both small files and large files;
Columnar storage to exploit temporal and spatial locality;
Distributed extendible hash to index metadata;
Snapshot-able and reconfigurable caching to increase parallelism and tolerant failures;
Pomegranate should be the first file system that is built over tabular storage, and the building experience should be worthy for file system community.

Can Ma, who leads the research on Pomegranate, was kind enough to agree to a short interview.

Can you please give an overview of the architecture and what you are doing that's cool and different?

Basically, there is no distributed or parallel file system that can handle billions of small files efficiently. However, we can foresee that web applications(such as email, photo, and even video), and bio-computing(gene sequencing) need massive small file accesses. Meanwhile, file system API is general enough and well understood for most programmers.

Thus, we want to built a file system to manage billions of small files, and provide high throughput of concurrent accesses. Although Pomegranate is designed for accesses to small files, it support large files either. It is built on top of other distributed file systems, such as Lustre, and only manage the namespace and small files. We just want to stand on ''the Shoulders of Giants". See the figure bellow:

Pomegranate has many Metadata Servers and Metadata Storage Servers to serve metadata requests and small file read/write requests. The MDSs are just a caching layer, which load metadata from storage and commit memory snapshots to storage. The core of Pomegranate is a distributed tabular storage system called xTable. It supports key indexed multi-column lookups. We use distributed extendible hash to locate server from the key, because extendible hash is more adaptive to scale up and down.

In file systems, directory table and inode table are always separated to support two different types of lookup. Lookups by pathname are handled by directory table, while lookups by inode number are handled by inode table. It is nontrivial to consistently update these two indexes, especially in a distributed file system. Meanwhile, using two indexes has increased the lookup latency, which is unacceptable for accessing tiny files. Typically, there are in memory caches for dentry and inode, however, the caches can't easily extend. Modifying metadata has to update multiple locations. To keep consistency, operation log is introduced. While, operation log is always a serial point for request flows.

Pomegranate use a table-like directory structure to merge directory table and inode table. Two different types of lookup are unified to lookups by key. For file system, the key is the hash value of dentry name. Hash conflicts are resolved by a global unique id for each file. For each update, we just need to search and update one table. To eliminate the operations log, we design and support memory snapshot to get a consistent image. The dirty regions of each snapshot can be written to storage safely without considering concurrent modifications.(The concurrent updates are COWed.)

However, there are some complex file system operations such as mkdir, rmdir, hard link, and rename that should be considered. These ops have to update at least two tables. We implement a reliable multisite update service to propagate deltas from one table to another. For example, on mkdir, we propagate the delta("nlink +1") to the parent table.

Are there any single points of failure?

There is no SPOF in design. We use cluster of MDSs to serve metadata request. If one MDS crashed, the requests are redirected to other MDSs(consistent hash and heartbeats are used). Metadata and small files are replicated to multiple nodes either. However, this replication is triggered by external sync tools which is asynchronous to the writes.

Small files have usually been the death of filesystems because of directory structure maintenance. How do you get around that?

Yep, it is deadly slow for small file access in traditional file systems. We replace the traditional directory table (B+ tree or hash tree) to distributed extendible hash table. The dentry name and inode metadata are treated as columns of the table. Lookups from clients are sent(or routed if needs) to the correct MDS. Thus, to access a small file, we just need to access one table row to find the file location. We keep each small file stored sequentially in native file system. As a result, one I/O access can serve a small file read.

What posix apis are supported? Can files be locked, mapped, symlinks, etc?

At present, the POSIX support is progressing. We do support symlinks, mmap access. While, flock is not supported.

Why do a kernel level file system rather than a K-V store on top?

Our initial objective is to implement a file system to support more existing applications. While, we do support K/V interface on top of xTable now. See the figure of architecture, the AMC client is the key/value client for Pomegranate. We support simple predicates on key or value, for example we support "select * from table where key < 10 and 'xyz' in value" to get the k/v pairs that value contains "xyz" and key < 10.

How does it compare to other distributed filesystems?

We want to compare the small file performance with other file systems. However, we have not tested it yet. We will do it in the next month. Although, we believe most distributed file systems can not handle massive small file accesses efficiently.

Are indexes and any sort of queries supported?

For now, these supports has not be properly considered yet. We plan to consider range query next.

Does it work across datacenters, that is, how does it deal with latency?

Pomegranate only works in a datacenter. WAN support has not been considered yet.

It looks like you use an in-memory architecture for speed. Can you talk about that?

We use a dedicated memory cache layer for speed. Table rows are grouped as table slices. In memory, the table slices are hashed in to a local extendible hash table both for performance and space consumption. Shown by the bellow figure,

Clients issue request by hash the file name and lookup in the bitmap. Then, using a consistent hash ring to locate the cache server(MDS) or storage server(MDSL). Each update firstly gets the *opened* transaction group, and can just apply to the in memory table row. Each transaction group changing is atomic. After all the pending updates are finished, the transaction group can be committed to storage safely. This approach is similar as Sun's ZFS.

How is high availability handled?

Well, the central server for consistent hash ring management and failure coordinator should be replicated by Paxos algorithm. We plan to use ZooKeeper for high available central service.
Other components are designed to be fault tolerant. Crashes of MDS and MDSL can be configured as recovered immediately by routing requests to new servers (by selecting the next point in consistent hash ring).

Operationally, how does it work? How are nodes added into the system?

Adding nodes to the caching layer is simple. The central server (R2) add the new node to the consistent hash ring. All the cache servers should act on this change and just invalidate their cached table slices if they will be managed by the new node. Requests from clients are routed to the new server, and a CH ring change notification will piggyback to client to pull the new ring from the center server.

How do you handle large files? Is it suitable for streaming video?

As described earlier, large files are relayed to other distributed file systems. Our caching layer will not be polluted by the streaming video data.

Anything else you would like to add?

Another figure for interaction of Pomegranate components.

Sunday, 29 August 2010

Scalability updates for Aug 27th 2010

Scalability updates for Aug 27th 2010: "

My updates have been slow recently due to other things I’m involved in. If you need more updates around what I’m reading, please feel free to follow me on twitter or buzz.

Here are some of the big ones I have mentioned on my twitter/buzz feeds.

Tools: Real-time Relationship Analytics from large scale graph processing using cassandra [code here]

Hadoop 0.21 has been released. The one feature I think is really cool is ability to “append” in hdfs. I found the lack of append feature slightly limiting to what I was trying to do last month.

Short intro to flume : Flume is a distributed log collection service which can collect and write logs to HDFS. If you have ever had problems aggregating logs, you should take a look at this.

Topsy just upgraded their backend engine and wrote all about it on their blog. Indexing twitter firehose is no small task, and these guys have done an amazing job.

Its hard to write incrementing or decrementing counters using an eventually-consistent distributed datastore like cassandra. But its not impossible as shown in this example at git.

Coudkick has been building their business around cassandra. They are now offering their experiences back to the community here and here.

Riptano has been putting interesting videos online from their cassandra summits.

Three papers on load balancing.

Scalability updates for Feb 18, 2010

Scalability Updates for Jan 26th 2010

Scaling updates for Feb 10, 2010

Scalability links for March 13th 2010

Scalability links for Feb 28th 2010

OpenStack - The Answer to: How do We Compete with Amazon?

OpenStack - The Answer to: How do We Compete with Amazon?: "

The Silicon Valley Cloud Computing Group had a meetup Wednesday on OpenStack, whose tag line is the open source, open standards cloud. I was shocked at the large turnout. 287 people registered and it looked like a large percentage of them actually showed up. I wonder, was it the gourmet pizza, the free t-shirts, or are people really that interested in OpenStack? And if they are really interested, why are they that interested? On the surface an open cloud doesn't seem all that sexy a topic, but with contributions from NASA, from Rackspace, and from a very avid user community, a lot of interest there seems to be.

The brief intro blurb to OpenStack is:

Friday, 27 August 2010

Zookeeper experience

Zookeeper experience: "
While working on Kafka, a distributed pub/sub system (more on that later) at LinkedIn, I need to use Zookeeper (ZK) to implement the load-balancing logic. I’d like to share my experience of using Zookeeper. First of all, for those of you who don’t know, Zookeeper is an Apache project that implements a consensus service based on a variant of Paxos (it’s similar to Google’s Chubby). ZK has a very simple, file system like API. One can create a path, set the value of a path, read the value of a path, delete a path, and list the children of a path. ZK does a couple of more interesting things: (a) one can register a watcher on a path and get notified when the children of a path or the value of a path is changed, (b) a path can be created as ephemeral, which means that if the client that created the path is gone, the path is automatically removed by the ZK server. However, don’t let the simple API fool you. One needs to understand a lot more than those APIs in order to use them properly. For me, this translates to weeks asking the ZK mailing list (which is pretty responsive) and our local ZK experts.

To get started, it’s important to understand the state transitions and the associated watcher events inside a ZK client. A ZK client can be in one of the 3 states, disconnected, connected, and closed. When a client is created, it’s in the disconnected state. Once a connection is established, the client is moved to the connected state. If the client loses its connection to a server, it switches back to the disconnected state. If it can’t connect to any server within some time limit, it’s eventually transitioned to the closed state. For each state transition, a state changing event (disconnected, syncconnected and expired) is sent to the client’s watcher. As you will see, those events are critical to the client. Finally, if one performs an operation on ZK when the client is in the disconnected state, a ConnectionLossException (CLE) is thrown back to the caller. More detailed information can be found at the ZK site. A lot of the subtleties when using ZK are to deal with those state changing events.

The first tricky issue is related to CLE. The problem is that when a CLE happens, the requested operation may or may not have taken place on ZK. If the connection was lost before the request reached the server, the operation didn’t take place. On the other hand, it can happen that the request did reach the server and got executed there. However, before the server can send a response back, the connection was lost. If the request is a read or an update, one can just keep retrying until the operation succeeds. It becomes a problem if the request is a create. If you simply retry, you may get a NodeExistsException and it’s not clear whether it’s you or someone else have created the path. What one can do is to set the value of the path to a client specific value during creation. If a NodeExistsException is thrown, read the value back to check who actually created it. One can’t use this approach for sequential paths (a ZK feature that creates a path with a generated sequential id) though. If you retry, a different path will be created. You also can’t check who created the path, since if you get a CLE, you don’t know the name of the path that gets created. For this reason, I think that sequential paths have very limited benefit since it’s very hard to use them correctly.

The second tricky issue is to distinguish between a disconnect and an expired event. The former happens when the ZK client can’t connect to the server. This is because either (1) the ZK server is down, or (2) the ZK server is up, but the ZK client is partitioned from the server or it is in a long GC pause and can’t send the heartbeat in time. In case (1), when the ZK server comes back, the client watcher will get a syncconnected event and everything is back to normal. Surprisingly, in this case, all the ephemeral paths and the watchers are still kept at the server and you don’t have to recreate them. In case (2), when the client finally reconnects to the server, it will get back an expired event. This implies that the server thinks the client is dead and has taken the liberty to delete all the ephemeral paths and watchers created by that client. It’s the responsibility of the client to start a new ZK session and to recreate the ephemeral paths and the watchers.

To deal with the above issues, one has to write additional code that keeps track of the ZK client state, starts a new session when the old one expires, and handles the CLE appropriately. For my application, I find the ZKClient package quite useful. ZKClient is a wrapper of the original ZK client. It maintains the current state of the ZK client, hides the CLE from the caller by retrying the request when the state is transitioned to connected again, and reconnects when necessary. ZKClient has an Apache license and has been used in Katta for quite some time. Even with the help of ZKClient, I still have to handle things like who actually created a path when a NodeExistsException occurs and re-registering after a session expires.

Finally, how do you test your ZK application, especially the various failure scenarios? One can use utilities like “ifconfig down/up” to simulate network partitioning. Todd Lipcon’s Gremlins seems very useful too.
"

What’s New in Apache Hadoop 0.21

What’s New in Apache Hadoop 0.21: "
Apache Hadoop 0.21.0 was released on August 23, 2010. The last major release was 0.20.0 in April last year, so it’s not surprising that there are so many changes in this release, given the amount of activity in the Hadoop development community. In fact, there were over 1300 issues fixed in JIRA (Common, HDFS, MapReduce), the issue tracker used for Apache Hadoop development. Bear in mind that the 0.21.0 release, like all dot zero releases, isn’t suitable for production use.

With such a large delta from the last release, it is difficult to grasp the important new features and changes. This post is intended to give a high-level view of some of the more significant features introduced in the 0.21.0 release. Of course, it can’t hope to cover everything, so please consult the release notes (Common, HDFS, MapReduce) and the change logs (Common, HDFS, MapReduce) for the full details. Also, please let us know in the comments of any features, improvements, or bug fixes that you are excited about.

You can download Hadoop 0.21.0 from an Apache Mirror. Thanks to everyone who contributed to this release!

Project Split

Organizationally, a significant chunk of work has arisen from the project split, which transformed a single Hadoop project (called Core) into three constituents: Common, HDFS, and MapReduce. HDFS and MapReduce both have dependencies on Common, but (other than for running tests) MapReduce has no dependency on HDFS. This separation emphasizes the fact that MapReduce can run on alternative distributed file systems (although HDFS is still the best choice for sheer throughput and scalability), and it has made following development easier since there are now separate lists for each subproject. There is one release tarball still, however, although it is laid out a little differently from previous releases, since it has a subdirectory containing each of the subproject source files.

From a user’s point of view little has changed as a result of the split. The configuration files are divided into core-site.xml, hdfs-site.xml, and mapred-site.xml (this was supported in 0.20 too), and the control scripts are now broken into three (HADOOP-4868): in addition to the bin/hadoop script, there is a bin/hdfs script and a bin/mapreduce script for running HDFS and MapReduce daemons and commands, respectively. The bin/hadoop script still works as before, but issues a deprecation warning. Finally, you will need to set the HADOOP_HOME environment variable to have the scripts work smoothly.

Common

The 0.21.0 release is technically a minor release (traditionally Hadoop 0.x releases have been major, and have been allowed to break compatibility with the previous 0.x-1 release) so it is API compatible with 0.20.2. To make the intended stability and audience of a particular API in Hadoop clear to users, all Java members with public visibility have been marked with classification annotations to say whether they are Public, or Private (there is also LimitedPrivate which signifies another, named, project may use it), and whether they are Stable, Evolving, or Unstable (HADOOP-5073). Only elements marked as Public appear in the user Javadoc (Common , MapReduce; note that HDFS is all marked as private since it is accessed through the FileSystem interface in Common). The classification interface is descibed in detail in Towards Enterprise-Class Compatibility for Apache Hadoop by Sanjay Radia.

This release has seen some significant improvements to testing. The Large-Scale Automated Test Framework, known as Herriot (HADOOP-6332), allows developers to write tests that run against a real (possibly large) cluster. While there are only a dozen or so tests at the moment, the intention is that more tests will be written over time so that regression tests can be shared and run against new Hadoop release candidates, thereby making Hadoop upgrades more predictable for users.

Hadoop 0.21 also introduces a fault injection framework, which uses AOP to inject faults into a part of the system that is running under test (e.g. a datanode), and asserts that the system reacts to the fault in the expected manner. Complementing fault injection is mock object testing, which tests code “in the small”, at the class-level rather than the system-level. Hadoop has a growing number of Mockito-based tests for this purpose (MAPREDUCE-1050).

Among the many other improvements and new features, a couple of small ones stand out: the ability to retrieve metrics and configuration from Hadoop daemons by accessing the URLs /metrics and /conf in a browser (HADOOP-5469, HADOOP-6408).

HDFS

Support for appends in HDFS has had a rocky history. The feature was introduced in the 0.19.0 release, and then disabled in 0.19.1 due to stability issues. The good news is that the append call is back in 0.21.0 with a brand new implementation (HDFS-265), and may be accessed via FileSystem’s append() method. Closely related—and more interesting for many applications, such as HBase—is the Syncable interface that FSDataOutputStream now implements, which brings sync semantics to HDFS (HADOOP-6313).

Hadoop 0.21 has a new filesystem API, called FileContext, which makes it easier for applications to work with multiple filesystems (HADOOP-4952). The API is not in widespread use yet (e.g. it is not integrated with MapReduce), but it has some features that the old FileSystem interface doesn’t, notably support for symbolic links (HADOOP-6421, HDFS-245).

The secondary namenode has been deprecated in 0.21. Instead you should consider running a checkpoint node (which essentially acts like a secondary namenode) or a backup node (HADOOP-4539). By using a backup node you no longer need an NFS-mount for namenode metadata, since it accepts a stream of filesystem edits from the namenode, which it writes to disk.

New in 0.21 is the offline image viewer (oiv) for HDFS image files (HADOOP-5467). This tool allows admins to analyze HDFS metadata without impacting the namenode (it also works with older versions of HDFS). There is also a block forensics tool for finding corrupt and missing blocks from the HDFS logs (HDFS-567).

Modularization continues in the platform with the introduction of pluggable block placement (HDFS-385), an expert-level interface for developers who want to try out new placement algorithms for HDFS.

Other notable new features include:

Support for efficient file concatenation in HDFS (HDFS-222)
Distributed RAID filesystem (HDFS-503) – an erasure coding filesystem running on HDFS, designed for archival storage since the replication factor is reduced from 3 to 2, while keeping the likelihood of data loss about the same. (Note that the RAID code is a MapReduce contrib module since it has a dependency on MapReduce for generating parity blocks.)

MapReduce

The biggest user-facing change in MapReduce is the status of the new API, sometimes called “context objects”. The new API is now more broadly supported since the MapReduce libraries (in org.apache.hadoop.mapreduce.lib) have been ported to use it (MAPREDUCE-334). The examples all use the new API too (MAPREDUCE-271). Nevertheless, to give users more time to migrate to the new API, the old API has been un-deprecated in this release (MAPREDUCE-1735), which means that existing programs will compile without deprecation warnings.

The LocalJobRunner (for trying out MapReduce programs on small local datasets) has been enhanced to make it more like running MapReduce on a cluster. It now supports the distributed cache (MAPREDUCE-476), and can run mappers in parallel (MAPREDUCE-1367).

Distcp has seen a number of small improvements too, such as preserving file modification times (HADOOP-5620), input file globbing (HADOOP-5472), and preserving the source path (MAPREDUCE-642).

Continuing the testing theme, this release is the first to feature MRUnit, a contrib module that helps users write unit tests for their MapReduce jobs (HADOOP-5518).

Other new contrib modules include Rumen (MAPREDUCE-751) and Mumak (MAPREDUCE-728), tools for modelling MapReduce. The two are designed to work together: Rumen extracts job data from historical logs, which Mumak then uses to simulate MapReduce applications and clusters on a cluster. Gridmix3 is also designed to work with Rumen traces. The job history log analyzer is another tool that gives information about MapReduce cluster utilization (HDFS-459).

On the job scheduling front there have been updates to the Fair Scheduler, including global scheduling (MAPREDUCE-548), preemption (MAPREDUCE-551), and support for FIFO pools (MAPREDUCE-706). Similarly, the Capacity Scheduler now supports hierarchical queues (MAPREDUCE-824), and admin-defined hard limits (MAPREDUCE-532). There is also a brand new scheduler, the Dynamic Priority Scheduler, which dynamically changes queue shares using a pricing model (HADOOP-4768).

Smarter speculative execution has been added to all schedulers using a more robust algorithm, called Longest Approximate Time to End (LATE) (HADOOP-2141).

Finally, a couple of smaller changes:

Streaming combiners are now supported, so that the -combiner option may specify any streaming script or executable, not just a Java class. (HADOOP-4842)
On the successful completion of a job, the MapReduce runtime creates a _SUCCESS file in the output directory. This may be useful for applications that need to see if a result set is complete just by inspecting HDFS. (MAPREDUCE-947)

What’s Not In

Finally, it bears mentioning what didn’t make it into 0.21.0. The biggest omission is the new Kerberos authentication work from Yahoo! While a majority of the patches are included, security is turned off by default, and is unlikely to work if enabled (certainly there is no guarantee that it will provide any level of security, since it is incomplete). A full working security implementation will be available in 0.22, and also the next version of CDH.

Also, Sqoop, which was initially developed as a Hadoop contrib module, is not in 0.21.0, since it was moved out to become a standalone open source project hosted on github.
"

Thursday, 26 August 2010

Designing a Web Application with Scalability in Mind

Designing a Web Application with Scalability in Mind: "Max Indelicato, a Software Development Director and former Chief Software Architect, has written a post on how to design a web application for scalability. He suggests choosing the right deploying and storage solution, a scalable data storage and schema, and using abstraction layers. By Abel Avram"

Tuesday, 24 August 2010

Tech talk: Get your distributed pub/sub on… Ben Reed talks about Hedwig

Tech talk: Get your distributed pub/sub on… Ben Reed talks about Hedwig: "

Hedwig

Benjamin Reed (Yahoo! Research)

June 7, 2010

ABSTRACT

Hedwig is a large scale cross data center publish/subscribe service developed at Yahoo! Research. We needed a scalable, fault tolerant, publish/subscribe service that has strong delivery and ordering guarantees to maintain the consistency of replicas of datasets in different data centers. We found that we could use ZooKeeper, a coordination service, and BookKeeper, a distributed write-ahead logging service to build such a service. To present Hedwig I will first review two of the services it is built on: ZooKeeper and BookKeeper. I will then present the motivation for Hedwig, review its design, and present its current status.

BIOGRAPHY

Benjamin Reed has worked for almost 2 decades in the industry: from an intern working on CAD/CAM systems, to shipping and receiving applications in OS/2, AIX, and CICS, to operations, to system admin research and Java frameworks at IBM Almaden Research (11 years), until finally arriving at Yahoo! Research (3 years ago) to work on distributed computing problems. His main interests are large scale processing environments and highly available and scalable systems. Dr. Reed’s research project at IBM grew into OSGI, which is now in application servers, IDEs, cars, and mobile phones. While at Yahoo, he has worked also worked on Pig and ZooKeeper, which are Apache projects for which he is a committer. Benjamin has Ph.D. in Computer Science from the University of California, Santa Cruz.

[This video is part of LinkedIn's tech talk series.]

6 Ways to Kill Your Servers - Learning How to Scale the Hard Way

6 Ways to Kill Your Servers - Learning How to Scale the Hard Way: "

This is a guest post by Steffen Konerow, author of the High Performance Blog.

Learning how to scale isn’t easy without any prior experience. Nowadays you have plenty of websites like highscalability.com to get some inspiration, but unfortunately there is no solution that fits all websites and needs. You still have to think on your own to find a concept that works for your requirements. So did I.

A few years ago, my bosses came to me and said “We’ve got a new project for you. It’s the relaunch of a website that has already 1 million users a month. You have to build the website and make sure we’ll be able to grow afterwards”. I was already an experienced coder, but not in these dimensions, so I had to start learning how to scale – the hard way.

Sunday, 22 August 2010

Hoff’s 5 Rules Of Cloud Security…

Hoff’s 5 Rules Of Cloud Security…: "
Mike Dahn pinged me via Twitter with an interesting and challenging question:

I took this as a challenge in 5 minutes or less to articulate this in succinct, bulleted form. I timed it. 4 minutes & 48 seconds. Loaded with snark and Hoffacino-fueled dogma.

Here goes:

Get an Amazon Web Services account, instantiate a couple of AMIs as though you were deploying a web-based application with sensitive information that requires resilience, security, survivability and monitoring. If you have never done this and you’re in security spouting off about the insecurities of Cloud, STFU and don’t proceed to step 2 until you do. AWS (IaaS) puts much of the burden on you to understand what needs to be done to secure Cloud-based services which is why I focus on it. It’s also accessible and available to everyone.

-
Take some time to be able to intelligently understand that as abstracted as much of Cloud is in terms of the lack of exposed operational moving parts, you still need to grok architecture holistically in order to be able to secure it — and the things that matter most within it. Building survivable systems, deploying securable (and as secure as you can make it) code, focusing on protecting information and ensuring you understand system design and The Three R’s (Resistance, Recognition, Recovery) is pretty darned important. That means you have to understand how the Cloud provider actually works so when they don’t you’ll already have planned around that…

-
Employ a well-developed risk assessment/management framework and perform threat modeling. See OCTAVE, STRIDE/DREAD, FAIR. Understanding whether an application or datum is OK to move to “the Cloud” isn’t nuanced. It’s a simple application of basic, straightforward and prudent risk management. If you’re not doing that now, Cloud is the least of your problems. As I’ve said in the past “if your security sucks now, you’ll be pleasantly surprised by the lack of change when you move to Cloud.”

-
Proceed to the Cloud Security Alliance website and download the guidance. Read it. Join one or more of the working groups and participate to make Cloud Security better in any way you believe you have the capacity to do so. If you just crow about how “more secure” the Cloud is or how “horribly insecure by definition” it is, it’s clear you’ve not done steps 1-3. Skip 1-3, go to #5 and then return to #1.

-
Use common sense. There ain’t no patch for stupid. Most of us inherently understand that this is a marathon and not a sprint. If you take steps 1-4 seriously you’re going to be able to logically have discussions and make decisions about what deployment models and providers suit your needs. Not everything will move to the Cloud (public, private or otherwise) but a lot of it can and should. Being able to layout a reasonable timeline is what moves the needle. Being an idealog on either side of the tarpit does nobody any good. Arguing is for Twitter, doing is for people who matter.

Cloud is only rocket science if you’re NASA and using the Cloud for rocket science. Else, for the rest of us, it’s an awesome platform upon which we leverage various opportunities to improve the way in which we think about and implement the practices and technology needed to secure the things that matter most to us.

/Hoff

(Yeah, I know. Not particularly novel or complex, right? Nope. That’s the point. Just like ”How to Kick Ass in Information Security — Hoff’s Spritually-Enlightened Top Ten Guide to Health, Wealth and Happiness“)

Related articles by Zemanta

Talking with George Reese about Cloud Security, CloudAudit, and enStratus (cloudofdata.com)
The Security Hamster Sine Wave Of Pain: Public Cloud & The Return To Host-Based Protection… (rationalsurvivability.com)
If You Could Have One Resource For Cloud Security… (rationalsurvivability.com)
What You Can Do About Cloud Computing Security (deurainfosec.com)
Hoff says SaaS Vendors Should Eat Their Own Dog Food. Is Security SaaS an Exception? (securecloudreview.com)
The Classical DMZ Design Pattern: How To Kill Security In the Cloud (rationalsurvivability.com)
Risk is not a Synonym for “Lack of Security” (devcentral.f5.com)
CloudAudit Effort Gaining Momentum (securecloudreview.com)
CLOUDINOMICON: Idempotent Infrastructure, Survivable Systems & Bringing Sexy Back to Information Centricity (rationalsurvivability.com)
Incomplete Thought: The DevOps Disconnect (rationalsurvivability.com)
Friday Cloud Poetry: “On the Bullshit That is False Cloud” (rationalsurvivability.com)

CLOUDINOMICON: Idempotent Infrastructure, Survivable Systems & Bringing Sexy Back to Information Centricity (rationalsurvivability.com)
Incomplete Thought: The DevOps Disconnect (rationalsurvivability.com)
Friday Cloud Poetry: “On the Bullshit That is False Cloud” (rationalsurvivability.com)

Tuesday, 17 August 2010

Scaling an AWS infrastructure - Tools and Patterns

Scaling an AWS infrastructure - Tools and Patterns: "

This is a guest post by Frédéric Faure (architect at Ysance), you can follow him on twitter.

How do you scale an AWS (Amazon Web Services) infrastructure? This article will give you a detailed reply in two parts: the tools you can use to make the most of Amazon’s dynamic approach, and the architectural model you should adopt for a scalable infrastructure.

I base my report on my experience gained in several AWS production projects in casual gaming (Facebook), e-commerce infrastructures and within the mainstream GIS (Geographic Information System). It’s true that my experience in gaming (IsCool, The Game) is currently the most representative in terms of scalability, due to the number of users (over 800 thousand DAU – daily active users – at peak usage and over 20 million page views every day), however my experiences in e-commerce and GIS (currently underway) provide a different view of scalability, taking into account the various problems of availability and data management. I will therefore attempt to provide a detailed overview of the factors to take into account in order to optimise the dynamic nature of an infrastructure constructed in a Cloud Computing environment, and in this case, in the AWS environment.

Thursday, 12 August 2010

Three Papers on Load Balancing

Three Papers on Load Balancing: "
Found a reference to these three papers on load balancing in the Cassandra mailing list.

Simple Efficient Load Balancing Algorithms for Peer-to-Peer Systems

David R. Karger, Matthias Ruhl:

Load balancing is a critical issue for the efficient oper- ation of peer-to-peer networks. We give two new load- balancing protocols whose provable performance guar- antees are within a constant factor of optimal. Our proto- cols refine the consistent hashing data structure that un- derlies the Chord (and Koorde) P2P network. Both pre- serve Chord’s logarithmic query time and near-optimal data migration cost.
Our first protocol balances the distribution of the key address space to nodes, which yields a load-balanced system when the DHT maps items “randomly” into the address space. To our knowledge, this yields the first P2P scheme simultaneously achieving O(logn) degree, O(logn) look-up cost, and constant-factor load balance (previous schemes settled for any two of the three).
Our second protocol aims to directly balance the dis- tribution of items among the nodes. This is useful when the distribution of items in the address space cannot be randomized—for example, if we wish to support range- searches on “ordered” keys. We give a simple protocol that balances load by moving nodes to arbitrary locations “where they are needed.” As an application, we use the last protocol to give an optimal implementation of a dis- tributed data structure for range searches on ordered data.

PDF available ☞ here.

Load Balancing in Structured P2P Systems

Ananth Rao, Karthik Lakshminarayanan, Sonesh Surana, Richard Karp, Ion Stoica:

Most P2P systems that provide a DHT abstraction dis- tribute objects among “peer nodes” by choosing random identifiers for the objects. This could result in an O(log N) imbalance. Besides, P2P systems can be highly heteroge- neous, i.e. they may consist of peers that range from old desktops behind modem lines to powerful servers connected to the Internet through high-bandwidth lines. In this paper, we address the problem of load balancing in such P2P sys- tems.
We explore the space of designing load-balancing algo- rithms that uses the notion of “virtual servers”. We present three schemes that differ primarily in the amount of infor- mation used to decide how to re-arrange load. Our simu- lation results show that even the simplest scheme is able to balance the load within 80% of the optimal value, while the most complex scheme is able to balance the load within 95% of the optimal value.

PS document available ☞ here.

Simple Load Balancing for Distributed Hash Tables

John Byers, Jeffrey Considine, Michael Mitzenmacher:

Distributed hash tables have recently become a useful building block for a variety of distributed applications. However, current schemes based upon consistent hashing require both considerable implementation complexity and substantial storage overhead to achieve desired load balancing goals. We argue in this paper that these goals can be achieved more simply and more cost-effectively. First, we suggest the direct application of the “power of two choices” paradigm, whereby an item is stored at the less loaded of two (or more) random alternatives. We then consider how associating a small constant number of hash values with a key can naturally be extended to support other load balancing strategies, including load-stealing or load-shedding, as well as providing natural fault-tolerance mechanisms.

PS document available ☞ here.

Happy reading!

Three Papers on Load Balancing originally posted on the NoSQL blog: myNoSQL

Hadoop: The Problem of Many Small Files

Hadoop: The Problem of Many Small Files: "Hadoop: The Problem of Many Small Files:
On why storing small files in HDFS is inefficient and how to solve this issue using Hadoop Archive:

When there are many small files stored in the system, these small files occupy a large portion of the namespace. As a consequence, the disk space is underutilized because of the namespace limitation. In one of our production clusters, there are 57 millions files of sizes less than 128 MB, which means that these files contain only one block. These small files use up 95% of the namespace but only occupy 30% of the disk space.

Hadoop: The Problem of Many Small Files originally posted on the NoSQL blog: myNoSQL