Unlimited-Data. moved to lab.itbee.vn : May 2013

Friday, 31 May 2013

SQL is what’s next for Hadoop: Here’s who’s doing it

SQL is what’s next for Hadoop: Here’s who’s doing it:
When we first began putting together the schedule for Structure: Data several months ago, we knew that running SQL queries on Hadoop would be a big deal — we just didn’t know how big a deal it would actually become. Fast-forward to today, a mere month away from the event (March 20-21 in New York), and the writing on the wall is a lot clearer. SQL support isn’t the end-game for Hadoop, but it’s the feature that will help Hadoop find its way into more places in more companies that understand the importance of next-generation analytics but don’t want to (or can’t yet) re-invent the wheel by becoming MapReduce experts.
In fact, there are now so many products and projects pushing SQL queries and interactive data analysis on Hadoop — including two more announced this week — that it’s getting hard to keep track. But I’ll do my best.
Of course, Facebook began this whole movement to bring SQL database-like functionality to Hadoop when it created Hive in 2009. Hive, now an Apache project, includes a data-management layer and SQL-like query language called HiveQL. It has proven rather useful and popular over the years, but Hive’s reliance on MapReduce makes it somewhat slow by nature — MapReduce scans the entire data set and moves a lot of data over the network while processing a job — and there hasn’t been much effort to package it in a manner that might attract mainstream users.
And keep in mind that this next generation of SQL-on-Hadoop tools aren’t just business intelligence or database products that can access data stored in Hadoop; EMC Greenplum, HP Vertica, IBM Netezza, ParAccel, Microsoft SQL Server and Teradata/Aster Data (which this week released some cool new features for just this purpose) all allow some sort of access to Hadoop data. Rather, these are applications, frameworks and engines that let users query Hadoop data from inside Hadoop, sometimes by re-architecting the underlying compute and data infrastructures. The beauty of this approach is that data is usable in its existing form and, in theory, doesn’t require two separate data stores for analytic applications.

Data warehouses and BI: The Structure: Data set

Structure:Data: Put data to work. 60+ big data experts speaking. March 20-21, 2013, New York City. Register now.

I’m highlighting this group of companies first, not because I think they’re the best (although that might well be), but because I’m truly excited about the panel they’ll be featured on at our conference next month. The panel is moderated by Facebook engineering manager Ravi Murthy– a guy who knows his way around a database — so they’ll have to answer some tough questions from one of the most-advanced and most-aggressive Hadoop and analytics tools users out there:
Apache Drill: Drill is a MapR-led effort to create a Google Dremel-like (or BigQuery-like) interactive query engine on top of Hadoop. First announced in August, the project is still under development and in the incubator program within Apache. According to its web site, “One explicitly stated design goal is that Drill is able to scale to 10,000 servers or more and to be able to process petabyes of data and trillions of records in seconds.”
Hadapt: Hadapt, which actually launched at Structure: Data in 2011, was the first of the SQL on Hadoop vendors and is somewhat unique in that it has a real product on the market and real users in production. Its unique architecture includes tools for advanced SQL functions and a split-execution engine for MapReduce and relational tasks, and both HDFS and relational storage. In October, the company announced a tight integration with Tableau Software around advanced visual analytics.

Platfora: Technically not a SQL product, Platfora is red-hot right now and is trying to re-imagine the world of business intelligence for a big data world. Essentially an HTML5 canvas laid atop Hadoop and an in-memory, massively parallel processing engine, the company’s software, which it unveiled in October, is designed to make analyzing data stored in Hadoop a fast and visually intuitive process.
Qubole: Qubole is an interesting case in that it’s essentially a cloud-based version of the popular Apache Hive framework launched by the guys who created Hive while working at Facebook. Qubole claims it auto-scaling abilities, optimized Hadoop code and columnar data cache make its service run much faster than Hive alone — and running on Amazon Web Services makes it easier than maintaining a physical cluster.

Data warehouses and BI: The rest

Citus Data: Citus Data’s CitusDB isn’t just about Hadoop, but rather wants to bring the power of its distributed Postgres implementation to all types of data. It relies on Postgres’s foreign data wrappers feature to convert disparate data types into the database’s native format, and then on its own distributed-processing technology to carry out queries in seconds or less. Because of its Postgres foundation, CitusDB can join data from different data sources and retains all the native features that come with that database.

Cloudera Impala: Cloudera’s Impala might just be the most-important SQL-on-Hadoop effort around because of Cloudera’s expansive installation and partner footprints. It’s a massively parallel processing engine that bypasses MapReduce to enable interactive queries on data stored in either HDFS or HBase, using the same variant of SQL that Hive uses. However, because Cloudera doesn’t build applications, it’s relying on higher-level BI and analytics partners to provide the user interface.

Karmasphere: Karmasphere is one of the first startups to build an analytic application atop Hadoop, and in its 2.0 release last year the company added support for SQL queries of data in HDFS. Like Hive, Karmasphere still relies on MapReduce to process queries, which means it’s inherently slower than newer approaches. However, unlike Hive, Karmasphere allows for parallel queries to run at the same time and includes a visual interface for writing queries and filtering results.

Lingual: Lingual is a new open source project from Concurrent (see disclosure), the parent company of the Cascading framework for Hadoop. Announced on Wednesday, Lingual runs on Cascading and gives developers and analysts a true ANSI SQL interface from which to run analytics or build applications. Lingual is compatible with traditional BI tools, JDBC and the Cascading family of APIs.
Phoenix: Phoenix is a new and relatively unknown open source project that comes out of Salesforce.com and aims to allow fast SQL queries of data stored in HBase, the NoSQL database built atop HDFS. Its stated mission: “Become the standard means of accessing HBase data through a well-defined, industry standard API.” Users interact with it through JDBC interfaces, and its developers claim its sub-second response times for small queries and seconds-long response for querying tens of millions of rows.

A sample of Phoenix via the SQuirreL client

Shark: Shark isn’t technically Hadoop, but it’s cut from the same cloth. Shark, in this case, stands for “Hive on Spark,” with Hive meaning the same thing it does to Hadoop, but with Spark being an in-memory platform designed to run parallel-processing jobs 100 times faster than MapReduce (a speed improve over traditional Hive that Shark also claims). Shark also includes APIs for turning query results into a type of data format amenable to machine learning algorithms. Both Shark and Spark are developed by the University of California, Berkeley’s AMPLab.

Stinger Initiative: Launched on Wednesday (along with a security gateway called Knox and a faster, simpler processing framework called Tez), the Stinger Initiative is a Hortonworks-led effort to make Hive faster — up too 100x — and more functional. Stinger adds more SQL analytics capabilities to Hive, but the most-important aspects are infrastructural: an optimized execution engine, a columnar file format and the ability to avoid MapReduce bottlenecks by running atop Tez.

Operational SQL

Drawn to Scale: Drawn to Scale is a startup that has built an operational SQL database on top of HBase. The key word here is database, as its product, called Spire, is modeled after Google’s F1 designed to power transactional applications as analytic ones. Spire has a fully distributed index and queries are sent only to the node with the relevant data, so reads and writes are fast and the system can handle lots of concurrent users without falling down.

Splice Machine: Database startup Splice Machine is also trying to get into the operational space by building its Splice SQL Engine atop the naturally distributed HBase database. Splice Machine focuses its message on transactional integrity, which is really where it separates itself from scalable NoSQL databases and analytics-focused SQL-on-Hadoop efforts. It relies on HBase’s aut0-sharding feature in order to making scaling an easy process.

Feature image courtesy of Shutterstock user hauhu.
Upcoming: Structure:Data, Mar. 20-21, 2013, New York, Register by March 1 and save $200! More upcoming conferences.

Related research and analysis from GigaOM Pro:
Subscriber content. Sign up for a free trial.

Replication and the latency-consistency tradeoff

by Daniel Abadi, dbmsmusings.blogspot.com
December 7th 2011
As 24/7 availability becomes increasingly important for modern applications, database systems are frequently replicated in order to stay up and running in the face of database server failure. It is no longer acceptable for an application to wait for a database to recover from a log on disk --- most mission-critical applications need immediate failover to a replica.
There are several important tradeoffs to consider when it comes to system design for replicated database systems. The most famous one is CAP --- you have to trade off consistency vs. availability in the event of a network partition. In this post, I will go into detail about a lesser-known but equally important tradeoff --- between latency and consistency. Unlike CAP, where consistency and availability are only traded off in the event of a network partition, the latency vs. consistency tradeoff is present even during normal operations of the system. (Note: the latency-consistency tradeoff discussed in this post is the same as the "ELC" case in my PACELC post).
The intuition behind the tradeoff is the following: there's no way to perform consistent replication across database replicas without some level of synchronous network communication. This communication takes time and introduces latency. For replicas that are physically close to each other (e.g., on the same switch), this latency is not necessarily onerous. But replication over a WAN will introduce significant latency.
The rest of this post adds more meat to the above intuition. I will discuss several general techniques for performing replication, and show how each technique trades off latency or consistency. I will then discuss several modern implementations of distributed database systems and show how they fit into the general replication techniques that are outlined in this post.
There are only three alternatives for implementing replication (each with several variations): (1) data updates are sent to all replicas at the same time, (2) data updates are sent to an agreed upon master node first, or (3) data updates are sent to a single (arbitrary) node first. Each of these three cases can be implemented in various ways; however each implementation comes with a consistency-latency tradeoff. This is described in detail below.

Data updates are sent to all replicas at the same time. If updates are not first passed through a preprocessing layer or some other agreement protocol, replica divergence (a clear lack of consistency) could ensue (assuming there are multiple updates to the system that are submitted concurrently, e.g., from different clients), since each replica might choose a different order with which to apply the updates . On the other hand, if updates are first passed through a preprocessing layer, or all nodes involved in the write use an agreement protocol to decide on the order of operations, then it is possible to ensure that all replicas will agree on the order in which to process the updates, but this leads to several sources of increased latency. For the case of the agreement protocol, the protocol itself is the additional source of latency. For the case of the preprocessor, the additional sources of latency are:
2. Routing updates through an additional system component (the preprocessor) increases latency
3. The preprocessor either consists of multiple machines or a single machine. If it consists of multiple machines, an agreement protocol to decide on operation ordering is still needed across machines. Alternatively, if it runs on a single machine, all updates, no matter where they are initiated (potentially anywhere in the world) are forced to route all the way to the single preprocessor first, even if there is a data replica that is nearer to the update initiation location.
Data updates are sent to an agreed upon location first (this location can be dependent on the actual data being updated) --- we will call this the "master node" for a particular data item. This master node resolves all requests to update the same data item, and the order that it picks to perform these updates will determine the order that all replicas perform the updates. After it resolves updates, it replicates them to all replica locations. There are three options for this replication:
2. The replication is done synchronously, meaning that the master node waits until all updates have made it to the replica(s) before "committing" the update. This ensures that the replicas remain consistent, but synchronous actions across independent entities (especially if this occurs over a WAN) increases latency due to the requirement to pass messages between these entities, and the fact that latency is limited by the speed of the slowest entity.
3. The replication is done asynchronously, meaning that the update is treated as if it were completed before it has been replicated. Typically the update has at least made it to stable storage somewhere before the initiator of the update is told that it has completed (in case the master node fails), but there are no guarantees that the update has been propagated to replicas. The consistency-latency tradeoff in this case is dependent on how reads are dealt with:
  1. If all reads are routed to the master node and served from there, then there is no reduction in consistency. However, there are several latency problems with this approach:
    1. Even if there is a replica close to the initiator of the read request, the request must still be routed to the master node which could potentially be physically much farther away.
    2. If the master node is overloaded with other requests or has failed, there is no option to serve the read from a different node. Rather, the request must wait for the master node to become free or recover. In other words, there is a potential for increased latency due to lack of load balancing options.
  2. If reads can be served from any node, read latency is much better, but this can result in inconsistent reads of the same data item, since different locations have different versions of a data item while its updates are still being propagated, and a read can potentially be sent to any of these locations. Although the level of reduced consistency can be bounded by keeping track of update sequence numbers and using them to implement "sequential/timeline consistency" or "read-your-writes consistency", these options are nonetheless reduced consistency options. Furthermore, write latency can be high if the master for a write operation is geographically far away from the requester of the write.
4. A combination of (a) and (b) are possible. Updates are sent to some subset of replicas synchronously, and the rest asynchronously. The consistency-latency tradeoff in this case again is determined by how reads are dealt with. If reads are routed to at least one node that had been synchronously updated (e.g. when R + W > N in a quorum protocol, where R is the number of nodes involved in a synchronous read, W is the number of nodes involved in a synchronous write, and N is the number of replicas), then consistency can be preserved, but the latency problems of (a), (b)(i)(1), and (b)(i)(2) are all present (though to somewhat lower degrees, since the number of nodes involved in the synchronization is smaller, and there is potentially more than one node that can serve read requests). If it is possible for reads to be served from nodes that have not been synchronously updated (e.g. when R + W <= N), then inconsistent reads are possible, as in (b)(ii) above .
Data updates are sent to an arbitrary location first, the updates are performed there, and are then propagated to the other replicas. The difference between this case and case (2) above is that the location that updates are sent to for a particular data item is not always the same. For example, two different updates for a particular data item can be initiated at two different locations simultaneously. The consistency-latency tradeoff again depends on two options:
1. If replication is done synchronously, then the latency problems of case (2)(a) above are present. Additionally, extra latency can be incurred in order to detect and resolve cases of simultaneous updates to the same data item initiated at two different locations.
2. If replication is done asynchronously, then similar consistency problems as described in case (1) and (2b) above present themselves.

Therefore, no matter how the replication is performed, there is a tradeoff between consistency and latency. For carefully controlled replication across short distances, there exists reasonable options (e.g. choice 2(a) above, since network communication latency is small in local data centers); however, for replication over a WAN, there exists no way around the significant consistency-latency tradeoff.
To more fully understand the tradeoff, it is helpful to consider how several well-known distributed systems are placed into the categories outlined above. Dynamo, Riak, and Cassandra choose a combination of (2)(c) and (3) from the replication alternatives described above. In particular, updates generally go to the same node, and are then propagated synchronously to W other nodes (case (2)(c)). Reads are synchronously sent to R nodes with R + W typically being set to a number less than or equal to N, leading to a possibility of inconsistent reads. However, the system does not always send updates to the same node for a particular data item (e.g., this can happen in various failure cases, or due to rerouting by a load balancer), which leads to the situation described in alternative (3) above, and the potentially more substantial types of consistency shortfalls. PNUTS chooses option (2)(b)(ii) above, for excellent latency at reduced consistency. HBase chooses (2) (a) within a cluster, but gives up consistency for lower latency for replication across different clusters (using option (2)(b)).
In conclusion, there are two major reasons to reduce consistency in modern distributed database systems, and only one of them is CAP. Ignoring the consistency-latency tradeoff of replicated systems is a great oversight, since it is present at all times during system operation, whereas CAP is only relevant in the (arguably) rare case of a network partition. In fact, the consistency-latency tradeoff is potentially more significant than CAP, since it has a more direct effect of the baseline operations of modern distributed database systems.

Original Page: http://dbmsmusings.blogspot.com/2011/12/replication-and-latency-consistency.html