Unlimited-Data. moved to lab.itbee.vn : August 2011

Thursday, 11 August 2011

LevelDB - Fast and Lightweight Key/Value Database From the Authors of MapReduce and BigTable

LevelDB - Fast and Lightweight Key/Value Database From the Authors of MapReduce and BigTable: "

LevelDB is an exciting new entrant into the pantheon of embedded databases, notable both for its pedigree, being authored by the makers of the now mythical Google MapReduce and BigTable products, and for its emphasis on efficient disk based random access using log-structured-merge (LSM) trees.

The plan is to keep LevelDB fairly low-level. The intention is that it will be a useful building block for higher-level storage systems. Basho is already investigating using LevelDB as one if its storage engines.

In the past many systems were built around embedded databases, though most developers now use database servers connected to via RPCs. An embedded database is a database distributed as a library and linked directly into your application. The application is responsible for providing a service level API, sharding, backups, initiating consistency checking, initiation rollback, startup, shutdown, queries, etc. Applications become the container for the database and the manager of the database.

Architectures using embedded databases typically never expose a raw database abstraction at all. They have a service API and the services use the embedded database library call transparently behind the scene. Often an embedded database will provide multiple access types, like indexed access for key-value uses and btrees for range queries and cursors.

BerkelyDB is one well known example of an embedded database, SQLite is another, the file system is perhaps the most commonly used database, and there have been many many other btree libraries in common use. I've used C-tree on several projects. In a battle of old versus new, a user named IM46 compared Leveldb to BerkelyDB and found that LevelDB solidly outperforms Berkeley DB for larger databases.

Programmers usually thought doing this stuff was easy, wrote their own failed on-disk btree library (raises hand), and then look around for a proven product. It's only relatively recently the databases have gone up market and included a network layer and higher level services.

Building a hybrid application/database architecture is still a very viable option when you want everything to be just so. If you are going to load balance requests across sharded application servers anyway, using a heavy weight external database infrastructure may not be necessary.

The LevelDB mailing list started off very active and has died down a bit, but is still nicely active and informative. Here are some excellent FAQish tips, performance suggestions, and porting issues extracted from the list:

Friday, 5 August 2011

Hadoop's tremendous inefficiency on graph data management (and how to avoid it)

Hadoop's tremendous inefficiency on graph data management (and how to avoid it): "Hadoop is great. It seems clear that it will serve as the basis of the vast majority of analytical data management within five years. Already today it is extremely popular for unstructured and polystructured data analysis and processing, since it is hard to find other options that are superior from a price/performance perspective. The reader should not take the following as me blasting Hadoop. I believe that Hadoop (with its ecosystem) is going to take over the world.

The problem with Hadoop is that its strength is also its weakness. Hadoop gives the user tremendous flexibility and power to scale all kinds of different data management problems. This is obviously great. But it is this same flexibility that allows the user to perform incredibly inefficient things and not care because (a) they can simply add more machines and use Hadoop's scalability to hide inefficiency in user code (b) they can convince themselves that since everyone talks about Hadoop as being designed for 'batch data processing' anyways, they can let their process run in the background and not care about how long it will take for it to return.

Although not the subject of this post, an example of this inefficiency can be found in a SIGMOD paper that a bunch of us from Yale and the University of Wisconsin published 5 weeks ago. The paper shows that using Hadoop on structured (relational) data is at least a factor of 50 less efficient than it needs to be (an incredibly large number given how hard data center administrators work to yield less than a factor of two improvement in efficiency). As many readers of this blog already know, this factor of 50 improvement is the reason why Hadapt was founded. But this post is not about Hadapt or relational data. In this post, the focus is on graph data, and how if one is not careful, using Hadoop can be well over a factor of 1000 less efficient than it needs to be.

Before we get into how to improve Hadoop's efficiency on graph data by a factor of 1000, let's pause for a second to comprehend how dangerous it is to let inefficiencies in Hadoop become widespread. Imagine a world where the vast majority of data processing runs on Hadoop (a not entirely implausible scenario). If people allow these factors of 50 or 1000 to exist in their Hadoop utilization, these inefficiency factors translate directly to factors of 50 or 1000 more power utilization, more carbon emissions, more data center space, and more silicon waste. The disastrous environmental consequences in a world where everyone standardizes on incredibly inefficient technology is downright terrifying. And this is ignoring the impact on businesses in terms of server and energy costs, and lower performance. It seems clear that developing a series of 'best practices' around using Hadoop efficiently is going to be extremely important moving forward.

Let's delve into the subject of graph data in more detail. Recently there was a paper by Rohloff et. al. that showed how to store graph data (represented in vertex-edge-vertex 'triple' format) in Hadoop, and perform sub-graph pattern matching in a scalable fashion over this graph of data. The particular focus of the paper is on Semantic Web graphs (where the data is stored in RDF and the queries are performed in SPARQL), but the techniques presented in the paper are generalizable to other types of graphs. This paper and resulting system (called SHARD) has received significant publicity, including a presentation at HadoopWorld 2010, a presentation at DIDC 2011, and a feature on Cloudera's Website. In fact, it is a very nice technique. It leverages Hadoop to scale sub-graph pattern matching (something that has historically be difficult to do); and by aggregating all outgoing edges for a given vertex into the same key-value pair in Hadoop, it even scales queries in a way that is 2-3 times more efficient than the naive way to use Hadoop for the same task.

The only problem is that, as shown by an upcoming VLDB paper that we're releasing today, this technique is an astonishing factor of 1340 times less efficient than an alternative technique for processing sub-graph pattern matching queries within a Hadoop-based system that we introduce in our paper. Our paper, led by my student, Jiewen Huang, achieves these enormous speedups in the following ways:

Hadoop, by default, hash partitions data across nodes. In practice (e.g., in the SHARD paper) this results in data for each vertex in the graph being randomly distributed across the cluster (dependent on the result of a hash function applied to the vertex identifier). Therefore, data that is close to each other in the graph can end up very far away from each other in the cluster, spread out across many different physical machines. For graph operations such as sub-graph pattern matching, this is wildly suboptimal. For these types of operations, the graph is traversed by passing through neighbors of vertexes; it is hugely beneficial if these neighbors are stored physically near each other (ideally on the same physical machine). When using hash partitioning, since there is no connection between graph locality and physical locality, a large amount of network traffic is required for each hop in the query pattern being matched (on the order of one MapReduce job per graph hop), which results in severe inefficiency. Using a clustering algorithm to graph partition data across nodes in the Hadoop cluster (instead of using hash partitioning) is a big win.
Hadoop, by default, has a very simple replication algorithm, where all data is generally replicated a fixed number of times (e.g. 3 times) across the cluster. Treating all data equally when it comes to replication is quite inefficient. If data is graph partitioned across a cluster, the data that is on the border of any particular partition is far more important to replicate than the data that is internal to a partition and already has all of its neighbors stored locally. This is because vertexes that are on the border of a partition might have several of their neighbors stored on different physical machines. For the same reasons why it is a good idea to graph partition data to keep graph neighbors local, it is a good idea to replicate data on the edges of partitions so that vertexes are stored on the same physical machine as their neighbors. Hence, allowing different data to be replicated at different factors can further improve system efficiency.
Hadoop, by default, stores data on a distributed file system (HDFS) or a sparse NoSQL store (HBase). Neither of these data stores are optimized for graph data. HDFS is optimized for unstructured data, and HBase for semi-structured data. But there has been significant research in the database community on creating optimized data stores for graph-structured data. Using a suboptimal store for the graph data is another source of tremendous inefficiency. By replacing the physical storage system with graph-optimized storage, but keeping the rest of the system intact (similar to the theme of the HadoopDB project), it is possible to greatly increase the efficiency of the system.

To a first degree of approximation, each of the above three improvements yield an entire order of magnitude speedup (a factor of 10). By combining them, we therefore saw the factor of 1340 improvement in performance on the identical benchmark that was run in the SHARD paper. (For more details on the system architecture, partitioning and data placement algorithms, query processing, and experimental results please see our paper).

It is important to note that since we wanted to run the same benchmark as the SHARD paper, we used the famous Lehigh University Benchmark (LUBM) for Semantic Web graph data and queries. Semantic Web sub-graph pattern matching queries tend to contain quite a lot of constants (especially on edge labels) relative to other types of graph queries. The next step for this project is to extend and benchmark the system on other graph applications (the types of graphs that people tend to use systems based on Google's Pregel project today).

In conclusion, it is perfectly acceptable to give up a little bit of efficiency for improved scalability when using Hadoop. However, once this decrease in efficiency starts to reach a factor of two, it is likely a good idea to think about what is causing this inefficiency, and attempt to find ways to avoid it (while keeping the same scalability properties). Certainly once the factor extends beyond the factor of two (such as the enormous 1340 factor we discovered in our VLDB paper), the sheer waste in power and hardware cannot be ignored. This does not mean that Hadoop should be thrown away; however it will become necessary to package Hadoop with 'best practice' solutions to avoid such unnecessarily high levels of waste.

Wednesday, 3 August 2011

Is Stonebraker right? Why SQL isn’t the choice du jour for many apps

Is Stonebraker right? Why SQL isn’t the choice du jour for many apps: "

Two weeks ago, I wrote a post that sparked a pretty overwhelming response. The gist of the post, derived from an interview with database pioneer Michael Stonebraker, was that legacy SQL databases, including MySQL, are relics and no longer relevant with regard to today’s web applications. Stonebraker cited Facebook’s renowned MySQL-plus-memcached architecture as an example of how much effort it takes to make such databases keep up with applications that store lots of data and serve high rates of transactions.

Michael Stonebraker

By and large, the responses weren’t positive. Some singled out Stonebraker as out of touch or as just trying to sell a product. Some pointed to the popularity of MySQL as evidence of its continued relevance. Many challenged how Stonebraker dare question the wisdom of Facebook’s top-of-the-line database engineers.

They’re all fair-enough statements, but they also somewhat missed the point. Stonebraker wasn’t calling out Facebook, nor was he suggesting (as far as I can tell) that it abandon MySQL tomorrow. Yes, he has a product, VoltDB, to sell, but that shouldn’t blur the overall message: Whatever database technology someone might choose to use for a new web application, anyone who hopes to achieve even a fraction of Facebook’s traffic should not go down the same path as Facebook did.

Facebook’s implementation is a sign of the times in which it was built, but the evidence suggests that if Facebook could do it over again with today’s database options, it wouldn’t go down the same path. Sharding MySQL thousands of times, operating thousands of memcached servers and paying a team of crack engineers to keep it scaling is nobody’s idea of fun.

First, Facebook

Nobody denies that Facebook’s MySQL team is supremely smart or that it does a great job innovating to ensure that the database is able to keep up with the site’s transactions.

Jim Starkey

Jim Starkey, the founder and CTO of NimbusDB — and a man with some serious relational database and MySQL credentials – puts it well. “You either scale to where your customer base takes you or you die,” he said, and Facebook has been able to do with MySQL what would others would not have been able to do. It has “absolutely skilled” engineers, he added, but they don’t exist everywhere, and Facebook has the added benefit of being able to pay them.

Paul Mikesell, the founder and CEO of Clustrix, echoed that sentiment, telling me that Facebook has done great work to make its site scalable. Clustrix sells a “NewSQL” database that is compatible with MySQL. Interestingly, Jonathan Heiliger, the soon to be former VP of technical operations at Facebook, sits on Clustrix’s advisory board.

No, it’s not so much Facebook’s MySQL implementation that’s the problem. By and large, it does what it’s designed to do, which is to keep up with the myriad status updates and other data that populate users’ profiles. Rather, it’s that Facebook had to expend so much money and so many man-hours to get there.

Facebook has declined numerous requests for comments, save for this snippet from a spokesperson: “[Our] philosophy is to build infrastructure using the best tools available for the job and [we] are constantly evaluating better ways to do things when and where it matters.”

Indeed it is. As I noted in the original post, as Facebook has rolled out new applications, it has increasingly utilized newer database technologies better suited for those tasks. Inbox search within Facebook is powered by the Cassandra NoSQL database that it created, while Facebook Messages and some other new applications use HBase. It looks like Facebook is onto something.

Actually, MySQL isn’t the problem . . .

Curt Monash

According to database industry analyst Curt Monash, Stonebraker makes a valid point in citing Facebook’s complex MySQL situation, because Facebook isn’t using MySQL for its relational capabilities. MySQL might be a fine database choice for a low-end application that requires full relational capabilities, but sharded MySQL plus memcached is not. You lose a lot of those as soon as you begin sharding, he explained, and the application actually communicates directly with memcached for data that resides in that layer. It’s that architecture that’s the problem.

Monash believes there are two timelines for when a technology runs its course, depending on the situation: when you shouldn’t use it to start a new project, and when you should upgrade. For new projects that might have to scale massively, he said, you wouldn’t choose MySQL plus memcached.

As for the sharding, Starkey said, “The only thing sharding has going for it is the absence of alternatives.” He noted that although it’s difficult to find anything he and Stonebraker agree on, they do both agree that traditional SQL databases aren’t easy to scale. Because scaling them is so complex, Starkey — who, like Stonebraker, has a horse in the NewSQL race with NimbusDB — thinks all legacy databases will be irrelevant in a few years. All except low-end MySQL, that is.

Monash said there are several possible options for companies that want to retain MySQL features while still being able to scale, including Clustrix, TokuDB, ScaleDB and Schooner MySQL with Active Cluster. Clustrix’s Mikesell noted that several of its customers were very happy to be done sharding after they made the switch, while others saved lots of human and capital resources by never having to shard in the first place.

There also are startups, such as dbshards and ScaleBase, that make sharding transparent to applications, saving developers from having to write applications that can handle a sharded database.

… always

However, if you don’t need relational features and/or ACID compliance, Monash says there are many possibilities, of which VoltDB, NimbusDB and the other NewSQL databases might not even be the best options. Monash actually takes a pretty harsh stance when it comes to VoltDB.

Even Starkey acknowledges this, explaining that you only really need ACID if you have valuable data. Google has a relational database for its revenue-related information, he said, but uses NoSQL tools like BigTable elsewhere. If a company has plans for its web application to scale and start driving a lot of traffic, Starkey said, he can’t imagine why it would build that new application using MySQL.

But Facebook isn’t a greenfield environment, which makes matters more complicated. Given Facebook’s reliance on memcached and use of it as a key-value store, though, Monash said a Membase Server, a NoSQL database, might actually be a good replacement if Facebook were to transition from MySQL. That’s because Membase has memcached built in and is designed to mimic it in many ways, only in a single tier.

James Phillips, the co-founder and senior VP of products at Couchbase (the new corporate home for Membase Server), said the vast majority of Membase deployments are for new applications, but large sites switching to it from a MySQL-plus-memcached environment isn’t unheard of. In fact, Zynga recently made the switch.

Also, Netflix recently transitioned from an Oracle database to SimpleDB on Amazon Web Services and Cassandra. For a detailed explanation of how and why, check out this presentation by Sid Anand, its cloud data architect.

Based on what he knows of Facebook’s architecture, some of which likely was gleaned from Facebook Director of Engineering Robert Johnson, who sits on Couchbase’s advisory board, Phillips thinks it would be possible, although not necessarily easy, for Facebook to make a switch.

Furthermore, most NoSQL databases and a number of NewSQL databases have open-source and/or free versions, so developers concerned with cost or flexibility aren’t without options.

In closing

Monash sums it up nicely: “Are there undesirable aspects to the Facebook architecture? Absolutely. Are they as serious as [Stonebraker] makes them out to be? Absolutely not.”

That’s because it has the engineering talent to do what it pleases, whether that’s sticking with MySQL or eventually transitioning to something else. But not everyone has that luxury, and if they don’t really need a relational database, or really need a relational database that can scale, there’s a strong case to be made that MySQL is no longer the most desirable option.

Image courtesy of Flickr user mandiberg

Related research and analysis from GigaOM Pro:
Subscriber content. Sign up for a free trial.

Facebook trapped in MySQL ‘fate worse than death’

Facebook trapped in MySQL ‘fate worse than death’: "

According to database pioneer Michael Stonebraker, Facebook is operating a huge, complex MySQL implementation equivalent to “a fate worse than death,” and the only way out is “bite the bullet and rewrite everything.”

Not that it’s necessarily Facebook’s fault, though. Stonebraker says the social network’s predicament is all too common among web startups that start small and grow to epic proportions.

During an interview this week, Stonebraker explained to me that Facebook has split its MySQL database into 4,000 shards in order to handle the site’s massive data volume, and is running 9,000 instances of memcached in order to keep up with the number of transactions the database must serve. I’m checking with Facebook to verify the accuracy of those numbers, but Facebook’s history with MySQL is no mystery.

The oft-quoted statistic from 2008 is that the site had 1,800 servers dedicated to MySQL and 805 servers dedicated to memcached, although multiple MySQL shards and memcached instances can run on a single server. Facebook even maintains a MySQL at Facebook page dedicated to updating readers on the progress of its extensive work to make the database scale along with the site.

The widely accepted problem with MySQL is that it wasn’t built for webscale applications or those that must handle excessive transaction volumes. Stonebraker said the problem with MySQL and other SQL databases is that they consume too many resources for overhead tasks (e.g., maintaining ACID compliance and handling multithreading) and relatively few on actually finding and serving data. This might be fine for a small application with a small data set, but it quickly becomes too much to handle as data and transaction volumes grow.

This is a problem for a company like Facebook because it has so much user data, and because every user clicking “Like,” updating his status, joining a new group or otherwise interacting with the site constitutes a transaction its MySQL database has to process. Every second a user has to wait while a Facebook service calls the database is time that user might spend wondering if it’s worth the wait.

Not just a Facebook problem

In Stonebraker’s opinion, “old SQL (as he calls it) is good for nothing” and needs to be “sent to the home for retired software.” After all, he explained, SQL was created decades ago before the web, mobile devices and sensors forever changed how and how often databases are accessed.

But products such as MySQL are also open-source and free, and SQL skills aren’t hard to come by. This means, Stonebraker says, that when web startups decide they need to build a product in a hurry, MySQL is natural choice. But then they hit that hockey-stick-like growth rate like Facebook did, and they don’t really have the time to re-engineer the service from the database up. Instead, he said, they end up applying Band-Aid fixes that solve problems as they occur, but that never really fix the underlying problem of an inadequate data-management strategy.

There have been various attempts to overcome SQL’s performance and scalability problems, including the buzzworthy NoSQL movement that burst onto the scene a couple of years ago. However, it was quickly discovered that while NoSQL might be faster and scale better, it did so at the expense of ACID consistency. As I explained in a post earlier this year about Citrusleaf, a NoSQL provider claiming to maintain ACID properties:

ACID is an acronym for “Atomicity, Consistency, Isolation, Durability” — a relatively complicated way of saying transactions are performed reliably and accurately, which can be very important in situations like e-commerce, where every transaction relies on the accuracy of the data set.

Stonebraker thinks sacrificing ACID is a “terrible idea,” and, he noted, NoSQL databases end up only being marginally faster because they require writing certain consistency and other functions into the application’s business logic.

Stonebraker added, though, that NoSQL is a fine option for storing and serving unstructured or semi-structured data such as documents, which aren’t really suitable for relational databases. Facebook, for example, created Cassandra for certain tasks and also uses the Hadoop-based HBase heavily, but it’s still a MySQL shop for much of its core needs.

Is ‘NewSQL’ the cure?

But Stonebraker — an entrepreneur as much as a computer scientist — has an answer for the shortcoming of both “old SQL” and NoSQL. It’s called NewSQL (a term coined by 451 Group analyst Matthew Aslett) or scalable SQL, as I’ve referred to it in the past. Pushed by companies such as Xeround, Clustrix, NimbusDB, GenieDB and Stonebraker’s own VoltDB, NewSQL products maintain ACID properties while eliminating most of the other functions that slow legacy SQL performance. VoltDB, an online-transaction processing (OLTP) database, utilizes a number of methods to improve speed, including by running entirely in-memory instead of on disk.

It would be easy to accuse Stonebraker of tooting his own horn, but NewSQL vendors have been garnering lots of attention, investment and customers over the past year. There’s no guarantee they’re the solution for Facebook’s MySQL woes — the complexity of Facebook’s architecture and the company’s penchant for open source being among the reasons — but perhaps NewSQL will help the next generation of web startups avoid falling into the pitfalls of their predecessors. Until, that is, it, too, becomes a relic of the Web 3.0 era.

Feature image courtesy of Flickr user jimw; error image courtesy of Flickr user rubenerd.

Related content from GigaOM Pro (subscription req’d):