Friday 25 November 2011

Paper: Don’t Settle for Eventual: Scalable Causal Consistency for Wide-Area Storage with COPS

onPaper: Don’t Settle for Eventual: Scalable Causal Consistency for Wide-Area Storage with COPS:



Teams from Princeton and CMU are working together to solve one of the most difficult problems in the repertoire: scalable geo-distributed data stores. Major companies like Google and Facebook have been working on multiple datacenter database functionality for some time, but there's still a general lack of available systems that work for complex data scenarios.
The ideas in this paper--Don’t Settle for Eventual: Scalable Causal Consistency for Wide-Area Storage with COPS--are different. It's not another eventually consistent system, or a traditional transaction oriented system, or a replication based system, or a system that punts on the issue. It's something new, a causally consistent system that achieves ALPS system properties. Move over CAP, NoSQL, etc, we have another acronym: ALPS - Available (operations always complete successfully), Low-latency (operations complete quickly (single digit milliseconds)), Partition-tolerant (operates with a partition), and Scalable (just add more servers to add more capacity). ALPS is the recipe for an always-on data store: operations always complete, they are always successful, and they are always fast.
ALPS sounds great, but we want more, we want consistency guarantees as well. Fast and wrong is no way to go through life. Most current systems achieve low latency by avoiding synchronous operation across the WAN, directing reads and writes to a local datacenter, and then using eventual consistency to maintain order. Causal consistency promises another way.
Intrigued? Let's learn more about causal consistency and how it might help us build bigger and better distributed systems.

In a talk on COPS, Wyatt Lloyd, defines consistency as a restriction on the ordering and timing of operations. We want the strongest consistency guarantees possible because it makes the programmer's life a lot easier.  Strong consistency defines a total ordering on all operations and what you write is what you read, regardless of location. This is called linearizability and is impossible to achieve strong consistency with ALPS properties. Remember your CAP. Sequential consistency still guarantees a total ordering on operations, but is not required to happen in real-time. Sequential consistency and low latency are impossible to achieve on a WAN. Eventual consistency is an ALPS system (Cassandra), but is a weak property that doesn't give any ordering guarantees at all.
There's a general idea if you want an always-on scalable datastore that you have to sacrifice consistency and settle for eventual consistency. There's another form of consistency, causal consistency, that sits between eventual consistency and the stronger forms of consistency. Causal consistency gives a partial order over operations so the clients see operations in order governed by causality. Theoretically causal consistency is a stronger consistency guarantee, that is also scalable, and maintains ALPS properties. It's a sweet spot for providing ALPS features and strongish consistency guarantees.
A key property of causal consistency to keep in mind is that it guarantees you will be working on consistent values, but it doesn't guarantee you will be working on the most recent values. That's a property of strong consistency. So under a network partition your operations won't match those in other datacenters until they are made eventually consistent.
The driver for causal consistency is low latency. They want operations to always be fast. Other approaches emphasize avoiding write-write conflicts via transactions and latency isn't as important. You'll never do a slow 2PC across a WAN.
Here's a money quote describing causal consistency in more detail:
The central approach in COPS involves explicitly tracking and enforcing causal dependencies between updates.  For instance, if you upload a photo and add it to an album, the album update “depends on” the photo addition, and should only be applied after it.  Writes in COPS are accepted by a local datacenter that then propagates them to other, remote, datacenters.  These remote datacenters check that all dependencies are satisfied by querying other nodes in the cluster before applying writes.  This approach differs from traditional causal systems that exchange update logs between replicas.  In particular, the COPS approach avoids any single serialization point to collect, transmit, merge, or apply logs.  Avoiding single serialization points is a major factor in enabling COPS to scale to large cluster sizes.
Even though COPS provides a causal+ consistent data store, it is impossible for clients to obtain a consistent view of multiple keys by issuing single-key gets.  (This problem exists even in linearizable systems.)  In COPS-GT, we enable clients to issue get transactions that return a set of consistent values.  Our get transaction algorithm is non-blocking, lock-free, and takes at most two rounds of inter-datacenter queries.  It does, however, require COPS-GT to store and propagate more metadata than normal COPS. Our evaluation shows that COPS completes operations in less than a millisecond, provides throughput similar to previous systems when using one server per cluster, and scales well as we increase the number of servers in each cluster. It also shows that COPS-GT provides similar latency, throughput, and scaling to COPS for common workloads.
Michael Freedman gives an example involving three operations on a social networking site:
  1. Remove boss from friends group.
  2. Post looking for a new job.
  3. A friend reads the post.
Causality is given by the following rules:
  1. Thread of execution rule. Operations done by the same thread of execution or ordered by causality. The first operation happens after the second.
  2. Gets-From rule. Operations that read a value are after write operations.
  3. Transitive closure rule. The first operation is before the read of the post. 
The result is that operations happen in the order you expect. The post for a new job happens after the boss is removed from the friends group. In another example, a photo upload followed by adding a reference of the photo album will always happen in that order so you don't have to worry about dangling references. This makes the job of the programmer a lot easier, which is why we like transactional systems so much: the expected happens.
How does causality handle conflicting updates? Say two writes in different datacenters happen to the same key at the same time. This is unordered by causality because the operations do not occur in the same thread of execution. What we want all datacenters to agree on a value. By default the rule is to have the last writer win. You can have application specific handlers as well to that all datacenters converge on the same value. This sounds a lot like eventual consistency to me. They call this causal consistency + convergent conflict handling as causal+ consistency.
Their innovation is to create a causal+ consistent system that is also scalable. Previous systems used log shipping, which serializes at a centralized point. Instead of logs they use dependency meta data to capture causality. They replace the single serialization point with distributed verification. They don't expose the value of a replicated put operation until they confirm all the causally previous operations have shown up in the datacenter.
COPS is their system implementing causal+ consistency:
  • Organized as a geo-replicated system with a cluster of nodes in each datacenter.
  • Each cluster stores all data.
  • Scale-out architecture with many nodes inside each cluster.
  • Consistent hashing to partition keys across nodes.
  • Assumes partitions do not occur within a datacenter so strongly consistent replication is used within a datacenter. Use chain replication, though could use Paxos.
  • Between datacenters where latency is high, data is replicated in a causal+ consistent manner.
  • They use a thick client library. It tracks causality and mediates local cluster access. 
  • Value is written immediately to the local datacenter. Immediately queued up for asynchronous replication.
  • Clients maintains dependency information, which includes a version number uniquely identifying a value. This information is inserted into dependency list. Any future operations are causally after the current operation. This information is used to resolve dependencies in the system.
    • Why not just use vector clocks? Because they've targeted very large distributed systems where ther vector clock state would get out of control.
  • Get transactions give a consistent view of multiple keys with low latency. They only have read transactions. Write conflicts are handled by last writer wins or application specific reconciliation.
  • They've found their system gives high throughput and near linear scalability while providing causal+ consistency.
The details of how all this works quickly spirals out of control. Best to watch the video and read the paper for the details. The questioning at the end of the video is contentious and entertaining. I'd like to see that part go on longer as everyone seems to have their own take on what works best. It's pretty clear from the questions that there's no one best way to build these systems. You pick what's important to you and create a solution that gives you that. You can't have it all it seems, but what can you have is the question.
Will we see a lot COPS clones immediately spring up like we saw when the Dynamo paper was published? I don't know. Eventually consistent systems like Cassandra get you most of what COP has without the risk. Though COPS has a lot of good features. Causal ordering to a programmer is a beautiful property as are the ALPS properties in general. The emphasis on low-latency is a winner too. Thick client libraries are a minus as they reduce adoption rates. Complex client libraries are very difficult to port to other languages. Not being able to deal with write-write conflicts in an equally programmer friendly manner while maintaining scalability for large systems, is unfortunate, but is just part of the reality of a CAP world. You could say using a strongly consistent model in each datacenter could limit the potential size of your system. But, all together it's interesting and different. Low-latency, geo-distribution, combined with a more intuitive consistency model could be big drivers for adoption for developers, and it's developers that matter in these sorts of things.

Tuesday 15 November 2011

High Scalability - High Scalability - Using Gossip Protocols for Failure Detection, Monitoring, Messaging and Other Good Things

High Scalability - High Scalability - Using Gossip Protocols for Failure Detection, Monitoring, Messaging and Other Good Things:

Using Gossip Protocols For Failure Detection, Monitoring, Messaging And Other Good Things

When building a system on top of a set of wildly uncooperative and unruly computers you have knowledge problems: knowing when other nodes are dead; knowing when nodes become alive; getting information about other nodes so you can make local decisions, like knowing which node should handle a request based on a scheme for assigning nodes to a certain range of users; learning about new configuration data; agreeing on data values; and so on.
How do you solve these problems?
A common centralized approach is to use a database and all nodes query it for information. Obvious availability and performance issues for large distributed clusters. Another approach is to use Paxos, a protocol for solving consensus in a network to maintain strict consistency requirements for small groups of unreliable processes. Not practical when larger number of nodes are involved.
So what's the super cool decentralized way to bring order to large clusters?
Gossip protocols, which maintain relaxed consistency requirements amongst a very large group of nodes. A gossip protocol is simple in concept. Each nodes sends out some data to a set of other nodes. Data propagates through the system node by node like a virus. Eventually data propagates to every node in the system. It's a way for nodes to build a global map from limited local interactions.
As you might imagine there are all sorts of subtleties involved, but at its core it's a simple and robust system. A node only has to send to a subset of other nodes. That's it.
Cassandra, for example, uses what's called an anti-entropy version of the gossip protocol for repairing unread data using Merkle Trees. Riak uses a gossip protocol to share and communicate ring state and bucket properties around the cluster.
For a detailed look at using gossip protocols take a look at GEMS: Gossip-Enabled Monitoring Service for Scalable Heterogeneous Distributed Systems by Rajagopal Subramaniyan, Pirabhu Raman, Alan George, and Matthew Radlinski. I really like this paper because of how marvelously well written and clear it is on how to use gossip protocols to detect node failures and load balance based on data sampled from other other nodes. Details are explained clearly and it dares to cover a variety of possibly useful topics.
From the abstract:
Gossip protocols have proven to be effective means by which failures can be detected in large, distributed systems in an asynchronous manner without the limitations associated with reliable multicasting for group communications. In this paper, we discuss the development and features of a Gossip-Enabled Monitoring Service (GEMS), a highly responsive and scalable resource monitoring service, to monitor health and performance information in heterogeneous distributed systems. GEMS has many novel and essential features such as detection of network partitions and dynamic insertion of new nodes into the service. Easily extensible, GEMS also incorporates facilities for distributing arbitrary system and application-specific data. We present experiments and analytical projections demonstrating scalability, fast response times and low resource utilization requirements, making GEMS a potent solution for resource monitoring in distributed computing.

Failure Detection

The failure detection part of the paper is good and makes sense. By combining the reachability data from a lot of different nodes you can quickly determine when a node is down. When a node is down, for example, there's no need to attempt to write to that node, saving queue space, CPU, and bandwidth.
In a distributed system you need at least two independent sources of information to mark a node down. It's not enough to simply say because your node can't contact another node that the other node is down. It's quite possible that your node is broken and the other node is fine. But if other nodes in the system also see that other node is dead then you can with some confidence conclude that that node is dead. Many complex hard to debug bugs are hidden here. How do you know what other nodes are seeing? Via a gossip protocol exchanging this kind of reachability data.
In embedded systems the backplane often has traces between nodes so a local system can get an independent source of confirmation that a given node is dead, or alive, or transitioning between the two states. If the datacenter is really the computer, it would be nice to see datacenters step up and implement higher level services like node liveness and time syncing so every application doesn't have to worry about these issues, again.
The paper covers the obvious issue of scaling as the number of nodes increases by dividing nodes into groups and introducing a hierarchy of layers at which node information is aggregated. They found running the gossip protocol used less than 60 Kbps of bandwidth and less than 2% of CPU for a system of 128 nodes.
One thing I would add is the communication subsystem can also contribute what it learns about reachability, we don't just have to rely on a gossip heartbeat. If the communication layer can't reach a node that fact can be noted in a reachability table. This keeps data as up to date as possible.

Using Gossip As A Form Of Messaging

In addition to failure detection, the paper shows how to transmit node and subsystem properties between nodes. This is a great extension and is a far more robust mechanism than individual modules using TCP connections to exchange data and command and control. We want to abstract communication out of application level code and this type of approach accomplishes that.
It seems somewhat obvious that you would transmit node properties to other nodes. Stats like load average, free memory, etc. would allow a local node to decide where to send work, for example. If a node is idle send it work (as long as everyone doesn't send it work at the same time). This local decision making angle is the key to scale. There's no centralized controller. Local nodes make local decisions based on local data. This can scale as far as the gossip protocol can scale.
What goes to another level is that they use an architecture I've used on several products, sending subsystem information so software modules on a node can send information to other modules on other nodes. For example, queue depth for a module could be sent out so other modules could gauge the work load. Alarm information could be sent out so other entities know the status of modules they are dependent on. Key information like configuration changes can be passed on. Even requests and response can be sent through this mechanism. At an architecture level this allows the aggregation of updates (from all sources on a node) so they can be sent in big blocks through the system instead of small messages, which is a big win.
This approach can be combined with a publish/subscribe topic registration system to reduce useless communication between nodes.
Another advantage of this approach is data could flow directly into your monitoring system rather than having a completely separate monitoring subsystem bolted on.
In the meat world we are warned against gossiping, it's a sin, it can ruin lives, it can ruin your reputation, etc., but in software, gossiping is a powerful tool in your distributed toolbox. So go forth and gossip.

Related Articles

Tuesday 25 October 2011

High Scalability - High Scalability - VoltDB Decapitates Six SQL Urban Myths and Delivers Internet Scale OLTP in the Process

High Scalability - High Scalability - VoltDB Decapitates Six SQL Urban Myths and Delivers Internet Scale OLTP in the Process:




VoltDB Decapitates Six SQL Urban Myths And Delivers Internet Scale OLTP In The Process

What do you get when you take a SQL database and start a new implementation from scratch, taking advantage of the latest research and modern hardware? Mike Stonebraker, the sword wielding Johnny Appleseed of the database world, hopes you get something like his new database, VoltDB: a pure SQL, pure ACID, pure OLTP, shared nothing, sharded, scalable, lockless, open source, in-memory DBMS, purpose-built for running hundreds of thousands of transactions a second. VoltDB claims to be 100 times faster than MySQL, up to 13 times faster than Cassandra, and 45 times faster than Oracle, with near-linear scaling.
Will VoltDB kill off the new NoSQL upstarts? Will VoltDB cause a mass extinction of ancient databases? Probably no and no to both questions, but it's a product with a definite point-of-view and is worth a look as the transaction component in your system. But will it be right for you? Let's see...
I first heard the details about VoltDB at Gluecon, where Mr. Stonebraker presented his highly entertaining Urban Myths About SQL (slides) talk. The hallways were buzzing long afterwards debating his in-your-face challenge of the existing order. With a refreshing take no prisoners style he lambastes existing SQL products as not good enough, saying they should either get better or die. NoSQL products fair no better as they simply aren't necessary, their whole existence based on a perception of the weakness of current SQL implementations rather than what SQL could be, if properly implemented.
As an argument against NoSQL, VoltDB simply asks: if you can get a relational database with all the scalable ACIDy goodness, why would you ever stoop to using a NoSQL database that might only ever be eventually reliable?
The attack against existing relational databases systems is to position them as legacy systems that suffer from archaic architectures that are slow by design. They have to deal with disks, threads, and other performance killing constructs that must not be so much evolved as transcended. You see, VoltDB isn't just competing against NoSQL, it's aiming squarely at existing relational database vendors by using the patented technological leap play.
It's a bold two prong strategy, but will the compromises that are part of the reconceptualization of SQL engine architectures prove too limiting for the majority?

VoltDB's Architecture

The bulk of the rest of this article is about the SQL Myths, but I think touching a bit on Volt's architecture before we address the myths will help frame the discussion a little better.
John Hugg, from VoltDB Engineering, says:
VoltDB is designed to solve OLTP at internet scale. If you don't need the scale of VoltDB, then of course you're going to be much happier with a general system like Postgres that offers so many features and is compatible with so much.
Some other quotes indicating the spirit behind VoltDB:
VoltDB is designed to make difficult or impossible problems manageable.
And:
When we set out to build VoltDB, we figured it wasn't worth the tradeoffs unless it's MUCH faster. So it is.
John is not kidding. What matters to VoltDB is: speed at scale, speed at scale, speed at scale, SQL, and ACID. If that matches your priorities then you'll probably be happy. Otherwise, as you'll see, everything is sacrificed for speed at scale and what is sacrificed is often ease of use, generality, anderror checking. It's likely we'll see ease of use improve over time, but for now it looks like rough going, unless of course, you are a going for speed at scale.
Some of the more interesting features are:
  • Main-memory storage. VoltDB stores all its data in RAM. This means there are no buffer pools to manage, so that source of overhead is removed, and there are no blocking disk stalls.
  • Run transactions to completion –single threaded –in timestamp order. Based on the model that 200 record updates is a hefty transaction, you might as well run them to completion. By single threading all locking overhead is removed. On multi-core systems they allocate chunks of memory to each CPU and run each CPU single threaded.
  • Replicas. Persistence is guaranteed by having the data reside in multiple main memories. There are no logs, so there are no disks, which removes the disk overhead. For high availability an active-active architecture is used. Each transaction is run twice, once on each node in the same time-stamp order, which means the replicas are ACID consistent. Data can be asynchronously shipped to disk for a 5% performance hit. VoltDB replication is not a master/slave or primary/backup replication system. Each replica is a first class, fully capable instance of a partition.
  • Tables are partitioned across multiple servers. Partitions are defined in a project XML configuration file that defines the partition keys. Clients can connect through any node. Partitioning is automatic, but you have to decide on the sharding key. They are working on an automated system to give advice on which key to use. If new nodes are added then the data must reloaded to cause the new nodes to be used, the data is not automatically distributed to the new nodes. Not all tables need to be partitioned. Small, mostly read-only tables can be replicated across all of the partitions of a VoltDB database.
  • Stored procedures, written in Java, are the unit of transaction. Only data changed within a single invocation of a stored procedure is in a transaction, transactions can't span multiple rounds of communication with a client. Also, from the same source as the previous link, The vast majority (almost 100%) of your stored procedure invocations must be single partition for VoltDB to be useful to you. If, for example, you partitioned by graph node, updating multiple nodes in a single transaction/procedure will not be a single partition transaction. These and other restrictions show the tradeoff between speed and generality.
  • A limited subset of SQL '99 is supported. DDL operations like ALTER and DROP aren't supported. Operations having to do with users, groups and security have been moved into XML configuration files. Updating table structure on the fly is convenient, but it's not fast, so it's out. You are also discouraged from doing SUM operations because it would take a long time and block other transactions. Single threading means you must quantize your work into small enough chunks that don't stall the work pipeline. The goal is to have transactions run in under 50 milliseconds. This is all done for speed.
  • Design a schema and workflow to use single-sited procedures. Data for a table is stored in a partition that is split onto different nodes. Each set of data on a node is called a slice. When a query can run on a single node it it is said to be single-sited. Performance is clearly best when a query can run on just one node against a limited set of data.
  • Challenging operations model. Changing the database schema or reconfiguring the cluster hardware requires first saving and shutting down the database. An exception are stored procedures which can be updated on the fly. In general, choosing speed as the primary design point has made the development and deployment process complicated and limiting. VoltDB, for example, does not support bringing a node back into the cluster while the database is running. All clients must be stopped, the database must be snapshotted, the database must be restarted in a special mode, the data is reloaded, and then clients can be restarted. See Using VoltDB for more details.
  • No WAN support. In the case of a network partition VoltDB chooses consistency over availability, so you will see a hiccup until connectivity can be restored. Out of all the possible failures, Mr. Stonebraker argues, network partitioning is one of the least likely failures, especially compared to programmer error, so choosing strong consistency over availability is the right engineering call. Future versions of VoltDB will do more to address this single-data-center catastrophe scenario.
  • OLAP is purposefully kept separate. VoltDB is only for OLTP. It's not for reporting or OLAP because those uses require locks be taken which destroys performance. Every update is spooled to a (optional) companion warehouse system that is a second or two behind the main transaction system (yet is perfectly consistent). The companion system could be something like Vertica, an analytics RDBMS also built by Mr. Stonebraker. The justification for this split of responsibilities is that one size does not fit all. You should run transactions on something that is good at transactions and run reporting on something that's good at reporting. An specialized transaction architecture will run circles around (50 times to 100 times faster) a one size fits all solution.
VoltDB is different because it has consciously been architected to remove what research shows are the four common sources of overhead in database management systems: logging (19%), latching(19%), locking (17%), B-tree, and buffer management operations (35%). By removing all removing all four sources overhead VoltDB can be really fast while still being ACID. How fast?
VoltDB claims to be 100 times faster than MySQL, up to 13 times faster than Cassandra, and 45 times faster than Oracle, with near-linear scaling. Though I think linear scaling only applies when you are not using distributed transactions. Two-phase transactions with an in-memory database will be relatively fast, but they will still be slow given the protocol overhead.
Keep in mind VoltDB is in-memory, so it should be fast, that should be a given. But VoltDB says they have gone beyond what other in-memory databases have done, they haven't just improved buffer management. By removing locks, latching, and threading overhead it's that much faster than other in-memory databases. You could argue that it's a waste of RAM, that only hot data should be kept in RAM, but the contention is that RAM can hold the entire data set, so there's no reason to compromise anymore.
The performance comparison against databases like Cassandra is somewhat of a strawman as they are designed for a much different purpose. Cassandra can store petabytes of data, across hundreds of nodes, across multiple data centers, and new nodes can be added at will. Operationally there's no comparison. Though I realize the purpose of the benchmarks is to show SQL is not natively slow, can work well for key-value usage patterns, and compares favorably with other industry leaders.
I really like the parallelism of the origin of relational theory with the origin of VoltDB's architecture. Relational theory was invented to remove update anomalies that occur when storing duplicate data. When data is duplicated, either within a record or in different tables, then it's easy to cause inconsistencies when performing updates, deletes, and adds. The normalization process makes it much harder to have inconsistencies because facts are stored once and only once. It may be a stretch, but I think of the process of creating a SQL engine architecture based on removing performance anomalies as fascinatingly analogous.

The Six SQL Urban Myths

Here are the six myths that Mr. Stonebraker says NoSQL advocates incorrectly perpetuate:
• Myth #1: SQL is too slow, so use a lower level interface
• Myth #2: I like a K-V interface, so SQL is a non-starter
• Myth #3: SQL systems don’t scale
• Myth #4: There are no open source, scalable SQL engines
• Myth #5: ACID is too slow, so avoid using it
• Myth #6: in CAP, choose AP over CA

Myth #1A: SQL Is Too Slow Because Of Heavy Interfaces Like ODBC/JDBC

Problem:
SQL is compiled into an optimized intermediate format in a way similar to how languages like C are compiled into assembly. SQL is not slow. What is slow are heavy interfaces like ODBC/JDBC that cause too many round trips to the database. Performance is determined by this interface.
VoltDB's Solution:
Only stored procedures are supported. A main advantage stored procedures have over chatty ODBC/JDBC protocols is one message is sent to the database and one reply is sent back. All the computation is in the database. Much more efficient than ODBC/JDBC. To go even faster you can batch stored procedure calls together. Stored procedures are wildly faster than using ODBC/JDBC.
The execution flow is something like:
Discussion:
This is a bit of a strawman argument in that I've never heard anyone seriously suggest SQL itself as a language is slow.
Using stored procedures was at one time the canonical relational database architecture. For every operation a stored procedure was created that executed all the required data manipulation and a result was returned. It's perfectly right to say this is the most efficient path, given certain conditions, but there are many problems:
  1. Stored procedure languages suck. They are difficult to program, ugly to use, and nearly impossible to debug. So programmers escape to the tool chains they know and love, leaving the database to deal with data, not logic. I understand Java is the stored procedure language for VoltDB. Depending on your allegiances that may be the worst or best thing in the world. The more overarching point is that you have no choice, but that may be the price of performance at scale.
  2. Putting logic into the database makes the database an application server, a function for which they are ill equipped. Let's say during a stored procedure you need to make a REST call to get a discount rate, for example, this involves blocking, IO, threading and all the usual backend server issues that databases don't know how to do well. VoltDB gets around this issue by simply not allowing you to do this.
  3. Once it can't scale you are dead, dead, dead. On a few projects I've worked on we've used the stored procedure approach. It works fine until it doesn't. At a certain load the database just dies and as a system you are stuck. You can't make it perform better so you are forced to hack away until all the benefits of using the database are gone and you are left with an expensive and brittle albatross at your system's core. So that's why projects have learned not to trust trust the success of their project to how good of an application server your database can be. Instead, they separate out logic from data and let each scale independently. This is a more risk reduced approach. VoltDB counters this objection by allowing new nodes to be added into the system, by limiting the work you can do in the server, by using RAM, and by sharding so you can control how much work is done on each node. The problem is operationally, the process is so onerous.
An interesting fact about stored procedures is that they can take anywhere from half a second to several seconds to prepare a statement in VoltDB 1.0.01, so ad-hoc queries are not practical. VoltDB also does not allow SQL to support getting the schema of tables, listing tables or altering tables. DDL operations aren't considered part of a core OLTP database.

Myth #1B: SQL Is Too Slow Because SQL Engine Implementations Have Too Much Overhead

Problem:
Traditional relational databases are using 30 year old architectures that make them slow by design. Ninety percent of all CPU cycles in your typical OLTP database go into non-productive work like: managing disk buffer pools, locking, crash recovery, multi-threading. To go faster you need to get rid of the overhead. Traditional relational databases do not do this, they are overhead rich and slow, but this is not the fault of SQL, it's the implementations that are faulty.
VoltDB's Solution:
Remove the overhead. VoltDB has made rather radical architectural choices to remove the bottlenecks. The results according to their performance test for single node performance:
• MySQL: X
• A very popular RDBMS elephant: 2.5 * X
• VoltDB: 100 X
VoltDB is 100 times faster than MySQL in their tests.
Discussion:
Seems spot on to me.
Myth #2: I like a K-V interface, so SQL is a non-starter
Problem:
Programmers like the convenience that key-value interfaces provide. Put a value by key and get a value by key. Simple. SQL doesn't provide a key-value interface so programmers won't use SQL.
VoltDB's Solution:
Create a thin get/put layer ontop of SQL using stored procedures. They are easy to support on top of a SQL engine. Using gets and puts their tests show that VoltDB is 15 times faster than MySQL and memcached.
  • MySQL/Memcached: X
  • Cassandra: 7*X
  • VoltDB: 15 * X
Discussion:
Creating a get/put set of stored procedure was at one time standard operating procedure. Any time a table is added so are new stored procedures. What this means is every time any attribute is changed or any table is changed, the effects ripple up and down the entire stack. This is a maintenance nightmare, which reveals one of the major strengths of NoSQL: schemaless design.
I'm sure developers value the ease of putting a value, but what really matters is what that implies, you aren't tied to a rigid schema that cause nightmares every time a data model changes. Should adding a parameter really cause every interface to break? A simple stored procedure facade completely ignores this aspect of the key-value approach.

Myth #3: SQL Systems Don’t Scale

Problem:
Some SQL engines don't support multiple nodes so they don't scale: MySQL, Postgres, Oracle, SQLServer. Some modern SQL engines do scale linearly: DB2,Vertica, Asterdata, Greenplum, DB2, Vertica, Asterdata, Greenplum, and EnterpriseDB.
VoltDB's Solution:
Their architecture scales linearly. So far they've test on 100 cores on 12 nodes and have scaled linearly.
Discussion:
As discussed previously, you must follow very precise rules about how to partition your system, limit how long queries take, accept a less than agile operations model, and accept that replication equals durability. Speed and scale has its costs, if you are willing to pay VoltDB will probably deliver.

Myth #4: There Are No Open Source, Scalable SQL Engines

VoltDB is open source. In fact, Mr. Stonebraker believes all future databases will be open source.

Myth #5: ACID Is Too Slow, So Avoid Using It

Problem:
Transactions are too expensive which is used by NoSQL vendors as an excuse not to provide real ACID transactions. Some applications require ACID semantics. If the database doesn't provide this feature then it's pushed to user code which is the worst of all worlds. Databases live a long time, so if you need ACID later and your database doesn't support it then you are in trouble.
Mr. Stonebraker contends 99% of all OLTP database are 1TB or less, especially if you factor out static content like pictures. This means transactional database can or will fit in main memory. Facebook is not the common case.
Now combine this with the finding in OLTP through the looking glass, and what we found there that current transactional databases do about 12% useful work, which is the actual cost of doing the retrieves and updates, the rest is spent on logging (19%), latching (19%), locking (17%), B-tree, and buffer management operations (35%). Multi-threading overhead on shared data structures is surprisingly high.
The abstract from the paper:
Online Transaction Processing (OLTP) databases include a suite of features - disk-resident B-trees and heap files, locking-based concurrency control, support for multi-threading - that were optimized for computer technology of the late 1970's. Advances in modern processors, memories, and networks mean that today's computers are vastly different from those of 30 years ago, such that many OLTP databases will now fit in main memory, and most OLTP transactions can be processed in milliseconds or less. Yet database architecture has changed little.
Based on this observation, we look at some interesting variants of conventional database systems that one might build that exploit recent hardware trends, and speculate on their performance through a detailed instruction-level breakdown of the major components involved in a transaction processing database system (Shore) running a subset of TPC-C. Rather than simply profiling Shore, we progressively modified it so that after every feature removal or optimization, we had a (faster) working system that fully ran our workload. Overall, we identify overheads and optimizations that explain a total difference of about a factor of 20x in raw performance. We also show that there is no single "high pole in the tent" in modern (memory resident) database systems, but that substantial time is spent in logging, latching, locking, B-tree, and buffer management operations.
There's not one thing you can do to improve performance as the cost is spread around evenly. Better B-Trees, for example, don't really help. You are only optimizing the 12% part of the overhead which gets you nowhere.
VoltDB's Solution:
VoltDB removes all four sources of overhead. Getting rid of ACID only gets you to two times faster, if you want to go ten times faster you need to get rid of the buffer pool overhead and the multithreading overhead too. This is a far more radical architecture change.
The result of getting rid of all four sources of overhead is that VoltDB supports ACID transaction semantics while retaining performance. Transactions can be used all the time with no penalty. The obvious implication being: so why would you ever not use a fast, scalable relational database that is faster than anything else you have ever used?
Discussion:
I assume, if it were possible, everyone would like ACID transactions. To make it possible VoltDB makes a number of restrictions that make VoltDB less than ideal as general purpose database. If you have lots of data then VoltDB wont work for you because lots of data still won't fit in RAM. If you want easy operations then VoltDB won't work for you. If you want to use your own language then VoltDB won't work for you. If you want to have longer lived transactions that integrate inputs from a series of related inputs then VoltDB will not work for you. If you want a little OLAP with your OLTP then VoltDB will not work for you. If you want cross data center availability then VoltDB will not work for you.
And VoltDB never said it would do all that. VoltDB promises damn fast and scalable ACID transactions at all costs. But given all the limitations needed to get there, saying NoSQL simply thinks ACID is too expensive, seems mighty unfair.

Myth #6: In CAP, Choose AP Over CA

Problem:
CAP theorem says partitions happen so you have to get rid of consistently or availability, if you prefer availability then you have to jettison consistency.
Mr. Stonebraker says there a few things to think about when making this tradeoff:
  • Order matters. Not all transactions are commutative. So if you get rid of ACID and you have operations where orders matters then eventual consistency will give you the wrong result. If you are transactions are a series of additions and subtractions then the order doesn't matter. But as soon as you throw in a non-commutative operation like multiplication then you will get wrong results. If you decide later that you need ACID type transactions then it's a fork-lift upgrade.
  • Semantic constraints. If you have something like stock for an item and you don't want to send out an item when there is non stock left, when there's a partition in an eventually consistent system you will not be able to do this as each partition has no idea of what the others are doing. So if you have semantic constraints you need ACID.
  • Errors. Errors are what cause partitions. Humans screwing up are the biggest source of errors. Next are bugs in applications or the database itself. Lastly, network failures cause partitions. Network failures are rare so designing an eventually consistent system for the rarest failure case seems like a bad engineering system. It makes more sense to take the hit for the rare case of a network being down, especially given the number of failures you'll have because of other non-network related problems is so much higher. Sacrificing consistency for availability is a bad engineering tradeoff.
VoltDB's Solution:
On network failures VoltDB proudly chooses consistency over availability. Your database is down until the network partition and the database are repaired.
Discussion:
I agree completely that moving the repair logic to the programmer is a recipe for disaster. Having programmers worry about read repair, vector clocks, the commutativity of transactions, how to design compensatory transactions to make up for previous failed transactions, and the other very careful bits of design, is asking for a very fragile system. ACID transactions are clean and understandable and that's why people like them.
Do you buy the argument that network partitions are rare? To the extent you buy this argument determines how you feel about VoltDB's design choices. Cloud architectures now assume failure as a starting point. They generally assume nodes within a data center are unreliable and that the more nodes you have and the more data center boundaries you cross, the more reason there is to expect failure. Amazon, for example, made the decision to go with an eventually consistent architecture by using Dynamo to back their shopping carts. Would they have done that for no reason? With that philosophy in mind, Cassandra has embraced failure at its core. Nodes can be added or taken down at any time, all because failure is assumed from the start. This makes for a very robust system. The price is a lack of distributed ACID transactions.
In contrast, the VoltDB model seems from a different age: a small number of highly tended nodes in a centralized location. In return though you get performance and ACID guarantees. It all depends on where you want to place your partition failure bets.

General Discussion

Our Future is Polyglot
VoltDB is a racehorse. It's a very specialized component that has been bred to do one thing and one thing only: be a very fast and scalable OLTP database. If that's what you need then VoltDB could be your Triple Crown winner.
What VoltDB reinforces for me is that our future is polyglot. Specialized tools for specialized purposes. VoltDB is specialized for OLTP. It's a singletasker that doesn't want to be mediocre at everything, it wants to be great at one thing. A pattern we are seeing more and more. For graph access use a graph database. For search use a search subsystem. And so on.
We Need Cache Consistency Protocols
What we may need is a robust cache coherence protocol between different data silos. In this kind of architecture there's no real master database other databases can sync to. Every silo has their equally valid perspective on the data. As events stream in to each silo we need some way of keeping all the silos consistent. I have not seen this type of component yet.
Open Source Means We Can Build Complex Interacting Systems
What makes the polyglot future possible is open source. With open source it's possible to make a system of many complex parts because open source pricing means it doesn't cost anything until you want extra support. Spending a good percentage of your budget on a mega database means it simply has to do everything because you can't afford anything else. With open source we can create very powerful composite systems that wouldn't otherwise be possible to create with lower end budgets.

Related Articles

Friday 30 September 2011

Presentation: HBase @ Facebook

Presentation: HBase @ Facebook: Kannan Muthukkaruppan overviews HBase, explaining what Facebook Messages is and why they chose HBase to implement it, their contribution to HBase, and what they plan to use it for in the future. By Kannan Muthukkaruppan

Thursday 22 September 2011

Presentation: Building Scalable Systems: an Asynchronous Approach

Presentation: Building Scalable Systems: an Asynchronous Approach: Theo Schlossnagle expresses his opinion on Big Data, NoSQL, cloud, system architecture and design, then he discusses the benefit of using asynchronous queues for building scalable systems. By Theo Schlossnagle

Thursday 15 September 2011

Programming != Computer Science

Programming != Computer Science: I recently read this very interesting article on ways to "level up" as a software developer. Reading this article brought home something that has been nagging me for a while since joining Google: that there is a huge skill and cultural gap between "developers" and "Computer Scientists." Jason's advice to leveling-up in the aforementioned article is very practical: write code in assembly, write a mobile app, complete the exercises in SICP, that sort of thing. This is good advice, but certainly not all that I would want people on my team spending their time doing in order to be true technical leaders. Whether you can sling JavaScript all day or know the ins and outs of C++ templates often has little bearing on whether you're able to grasp the bigger, more abstract, less well-defined problems and be able to make headway on them.

For that you need a very different set of skills, which is where I start to draw the line between a Computer Scientist and a developer. Personally, I consider myself a Computer Scientist first and a software engineer second. I am probably not the right guy to crank out thousands of lines of Java on a tight deadline, and I'll be damned if I fully grok C++'s inheritance rules. But this isn't what Google hired me to do (I hope!) and I lean heavily on some amazing programmers who do understand these things better than I do.

Note that I am not defining a Computer Scientist as someone with a PhD -- although it helps. Doing a PhD trains you to think critically, to study the literature, make effective use of experimental design, and to identify unsolved problems. By no means do you need a PhD to do these things (and not everyone with a PhD can do them, either).

A few observations on the difference between Computer Scientists and Programmers...

Think Big vs. Get 'er Done

One thing that drove me a little nuts when I first started at Google was how quickly things move, and how often solutions are put into place that are necessary to move ahead, even if they aren't fully general or completely thought through. Coming from an academic background I am used to spending years pounding away at a single problem until you have a single, beautiful, general solution that can stand up to a tremendous amount of scrutiny (mostly in the peer review process). Not so in industry -- we gotta move fast, so often it's necessary to solve a problem well enough to get onto the next thing. Some of my colleagues at Google have no doubt been driven batty by my insistence on getting something "right" when they would rather just (and in fact need to) plow ahead.

Another aspect of this is that programmers are often satisfied with something that solves a concrete, well-defined problem and passes the unit tests. What they sometimes don't ask is "what can my approach not do?" They don't always do a thorough job at measurement and analysis: they test something, it seems to work on a few cases, they're terribly busy, so they go ahead and check it in and get onto the next thing. In academia we can spend months doing performance evaluation just to get some pretty graphs that show that a given technical approach works well in a broad range of cases.

Throwaway prototype vs. robust solution

On the other hand, one thing that Computer Scientists are not often good at is developing production-quality code. I know I am still working at it. The joke is that most academics write code so flimsy that it collapses into a pile of bits as soon as the paper deadline passes. Developing code that is truly robust, scales well, is easy to maintain, well-documented, well-tested, and uses all of the accepted best practices is not something academics are trained to do. I enjoy working with hardcore software engineers at Google who have no problem pointing out the totally obvious mistakes in my own code, or suggesting a cleaner, more elegant approach to some ass-backwards page of code I submitted for review. So there is a lot that Computer Scientists can learn about writing "real" software rather than prototypes.

My team at Google has a good mix of folks from both development and research backgrounds, and I think that's essential to striking the right balance between rapid, efficient software development and pushing the envelope of what is possible.