Wednesday, 15 August 2012

High Scalability - High Scalability - MemSQL Architecture - The Fast (MVCC, InMem, LockFree, CodeGen) and Familiar (SQL)

High Scalability - High Scalability - MemSQL Architecture - The Fast (MVCC, InMem, LockFree, CodeGen) and Familiar (SQL):



MemSQL Architecture - The Fast (MVCC, InMem, LockFree, CodeGen) And Familiar (SQL)

This is an interview with MemSQL cofounder’s Eric Frenkiel and Nikita Shamgunov, in which they try to answer critics by going into more depth about their technology.

MemSQL ruffled a few feathers with their claim of being the fastest database in the world. According to their benchmarks MemSQL can execute 200K TPS on an EC2 Quadruple Extra Large and on a 64 core machine they can push 1.2 million transactions a second.

Benchmarks are always a dark mirror, so make of them what you will, but the target market for MemSQL is clear: projects looking for something both fast and familiar. Fast as in a novel design using a combination of technologies like MVCC, code generation, lock-free data structuresskip lists, and in-memory execution. Familiar as in SQL and nothing but SQL. The only interface to MemSQL is SQL.

It’s right to point out MemSQL gets a boost by being a first release. Only a limited subset of SQL is supported, neither replication or sharding are implemented yet, and writes queue in memory before flushing to disk. The next release will include a baseline distributed system, native replication, n-way joins, and subqueries. Maintaining performance as more features are added is a truer test.

And MemSQL is RAM based, so of course it’s fast, right? Even among in-memory databases MemSQL hopes to convince you they’ve made some compelling design choices. The reasoning for their design goes something like:
  • Modern hardware requires a modern database. The idea is to strip everything down and rethink it all out again.
  • Facebook scaled for two reasons: Memcached and Code Generation. Memcached provides in-memory key-value access and HipHop translates PHP to C++. Applying those ideas to a database you get an in-memory database that uses SQL instead of KV.
  • Since performance requires operating out of RAM, the first assumption is the data set fits in RAM.
  • Reading fast from RAM has been solved by Memcached, which is basically a hash table sitting behind a network interface. What is not solved is the fast write problem.
  • The fast write problem is solved by eliminating contention. The way to eliminate contention is by using lock-free data structures. Lock-free data structures scale well which is why MemSQL has faster writes than Memcached.
  • Hash tables are an obvious choice for key-value data structures, but what would you use for range queries? Skip lists are the only efficient data structures for range queries and are more scalable than b-trees. Lock-free skip lists are also difficult to build.
  • When competing with in-memory technologies you need to execute fewer instructions for each SQL query. This goal is met via code generation. In a traditional database system there is a fixed overhead per query to set up all the contexts, prepare the tree for interpretations, and then run the query through the plan. All that is sidestepped by generating code, which becomes more of win as the number of cores and the number of queries increases.
  • MVCC is a good match with an in-memory databases because it offers both efficiency and transactional correctness. It also supports fast deletes. Rows can be marked deleted and then cleaned up behind the scenes.

On the first hearing of this strange brew of technologies you would not be odd in experiencing a little buzzword fatigue. But it all ends up working together. The mix of lock-free data structures, code generation, skip lists, and MVCC makes sense when you consider the driving forces of data living in memory and the requirement for blindingly fast execution of SQL queries.

In a single machine environment MemSQL makes an excellent case for their architecture. In a distributed environment they are limited in the same way every distributed databases is limited. MVCC doesn’t offer any magic as it doesn’t translate easily across shards. The choices  MemSQL has made reflect their primary use case of fast transactions plus fast real-time SQL based analytics. MemSQL uses a sharded shared nothing approach where queries are run independently on each shard and merged together on aggregation nodes. Transactions across shards won’t be supported until two phase commit is implemented, but then they will perform like any other database.  What they really want to do well is run fast real-time aggregations across a cluster so that’s what their design reflects.

A lot of other questions come to mind with such a novel design. Will MemSQL perform common operations like “return the top 5 X” as programmers have come to expect? Will MemSQL still perform when hit with a wide variety of different SQL queries? They say yes given code generation and their data structure choices. Is SQL expressive enough to solve real world problems across many domains? When you start adding stored procs or user defined functions will the carefully orchestrated dance of data structures still work?

To answer these questions and more, let’s take a deeper look into the technology behind MemSQL.

Stats

  • Largest Installation: 500 node cluster deployed on EC2 (4000 cores)
  • Biggest single machine deployment: 320-core SGI supercomputer; 4 TB RAM
  • 20 billion records in a single table on a single machine
  • 1.25m inserts/sec on a single 64-core machine
  • Run 25 16-core servers 24/7 for testing
  • 15 employees, heavy on engineering

Information Sources

  • Interview over Skype.
  • Email Q&A.
  • Everything listed under section Related Articles.

Use Cases

  • Data is proliferating and there are always green field markets that need fixing. Giving a relational interface is a good way to get real-time solved in an understandable way.
  • Targeted at high throughput workloads with small transactions in a heavily concurrent environment.
  • Users with lots of CPU and RAM who need performance. MemSQL is a complex high performance piece of software. You’ll use MemSQL if you are making money.
  • Answer what is happening right now questions. Most successful use case is simultaneous insert, which is only possible with a row based system, and simultaneous select, which supports real-time analytics, like min/max/distinct/ave etc.
  • Not in the BI market. Works well with products like Vertica. MemSQL works well with real time analytics. Data inserts transactionally yet supports a real-time analytical layer.
  • Relying on startups that value time over money. Build or buy? Time is the most valuable thing. Don’t need to create vertical infrastructure. Memcached is not needed which simplifies layers in the stack.
  • Write-heavy, read-heavy workloads
    • Machine data
    • Traffic spikes
    • Streaming data

Origin Story

  • Started in Jan 2011. Worked in stealth mode for 14 months.
  • Both Eric and Nikita worked at Facebook and decided there was a better way to give SQL at scale and speed.
  • They quit and applied to Y Combinator with zero lines of code. Y Combinator normally expects to see a demo but bought the argument that they could build a very fast database based on their experience. Nikita  worked on the Microsoft SQL Server core engine and has other geek credentials. Eric worked on Platform at Facebook and Nikita worked on Infrastructure.
  • Any Facebook partner that touched the social graph immediately saw problems with scale. Games would add 2-3 million users in a matter of weeks. Media companies would have to start tracking Like buttons and comments on social deployments.
  • Aha moment was realizing not just Facebook had these problems. Downstream traffic from social networks and new data sources like Capital markets that may have to consume a million messages a second.

Why Faster?

  • Lock free + code generation + MVCC is faster on multiple cores than an approach using partitioning by core and serializing data structure access per core.
  • Code generation minimizes code execution paths within queries and removes interpretation overhead. SQL is being hardwired into the server.
  • C++ is used instead of Java.
  • Skip lists are used instead of b-trees because b-trees don’t scale.
  • CPU efficiency means higher throughput. Since MemSQL uses fewer instructions per query they can achieve higher throughput. More queries can be pushed through the system because they’ve minimized parsing, caching, and plan cache matching.

The Y Combinator Cabal

  • The Y Combinator network is very powerful. Through Y Combinator they were able to get a first customer way before release, which helped them prioritize features and avoid the be everything to everybody trap.
  • Their first customer grew to a million users in 6 weeks, put all their data in RAM, and supported just the SQL that they needed.

Lock-Free Data Structures

  • Lock-free data structures scale exceptionally well as more resources (CPU, RAM) are added to a system. Lock-free data structures minimize wasted CPU during points of high contention.
  • Every component of the MemSQL engine is built on lock-free data structures: linked lists, queues, stacks, skip lists, and hash tables.
  • Lock-free queues and stacks are used throughout the system for managing state in transactions and memory managers.
  • The lock-free hash table is used to map query shapes to compiled plans in the plan cache.
  • Lock-free skip lists and hash tables are available as index data structures.

Skip Lists

  • A skip list is a popular data structure that performs extremely well in-memory. It shares many fundamental properties with a randomized tree. For example, it offers O(logN) time to seek to a specific value.
  • The basic idea is that the bottom layer is a sorted linked list. Each higher layer is an “express” lane for the lists below. An element in layer i appears in layer i+1 with some fixed probability p (usually something like ½ or ¼).
  • Among the data structures that offer the efficient seek and insert properties of a tree, skip lists are notable for being implemented lock-free and perform extremely well in highly concurrent environments.
  • Skip lists have two main tradeoffs:
    • Compared to b-trees, skip lists are slightly slower for long sequential scans.
    • Lock free skip lists are unidirectional. So in MemSQL, you have to specify for each skip list index whether it should be ascending or descending. If you need both directions, you need two indexes.

MVCC

MemSQL implements multi version concurrency control to implement transactional semantics
  • Every time a transaction modifies a row, MemSQL creates a new version that sits on top of the existing one. This version is only visible to the transaction that made the modification. Read queries the access the same row “see” the old version of this row.
  • Versions in MemSQL are implemented as a lock-free linked list.
  • MemSQL only takes a lock in the case of a write-write conflict on the same row. MemSQL takes a lock because it is easier to program against. The alternative would be to fail the second transaction, which would require the programmer to resolve the failure.
  • MemSQL implements a lock-wait timeout for deadlocks. The default value is 60 seconds. If the timeout occurs, the transaction is aborted.
  • MemSQL queues modified rows for the garbage collector. This lets the garbage collector clean up old versions very efficiently by avoiding a full-table scan.
  • Wherever possible, MemSQL can optimize single-row update queries down to simple atomic operations.

MemSQL Durability

  • How it Works
    • MemSQL pushes transactions to disk as fast as the disk will allow.
    • Transactions first commit to an in memory buffer, and then asynchronously start writing to disk. If the size of the transaction buffer is zero, then the transaction is not acknowledged until it is committed to disk (full synchronous durability).
    • A log flusher thread flushes the transactions to disk every few milliseconds. MemSQL leverages group commit to dramatically improve disk throughput.
    • Once MemSQL log files reach a certain size (2 GB by default), they are compressed into snapshots. Snapshots are more compact and faster for recovery. This number is configurable as snapshot-trigger-size.
    • When MemSQL is restarted, it recovers its state by reading the snapshot file (a mullti-threaded process) and then replaying the remaining log file to restore its state. Snapshot recovery is significantly faster than log recovery.
    • Durability can be fully disabled for workloads that do not require it. Unless the transaction buffer fills up, this has no impact on query throughput or performance. It does not improve read performance.
    • MemSQL uses checksums in its snapshot and log files to validate data consistency.
  • Why it's Fast
    • Group commit makes MemSQL faster in highly concurrent use cases (many writers pushing a lot of data into the log). This scenario is common to customers seeking a high throughput database.
    • The on-disk backup of MemSQL is extremely compact, which reduces I/O pressure.
    • Without a buffer pool, writes to disk are limited to sequentially writing logs and snapshots, so MemSQL is able to efficiently take advantage of sequential I/O.
    • MemSQL writes both its snapshot and log files to disk sequentially. Both hard disk and solid state drives offer significantly better performance for sequential I/O than they do for random I/O.
    • MemSQL completely avoids page-swapping and can guarantee consistent high-throughput SLAs on read/write queries. No read query in MemSQL will ever wait for disk. This property is extremely important for real-time analytics scanning over terabytes of data.

Code Generation

  • How it Works
    • MemSQL compiles SQL queries to native code with SQL to C++ code generation. C++ is compiled with GCC and loaded into the database as a shared object.
    • Compilation happens at run time. MemSQL is a just-in-time compiler for SQL. Compiled query plans are reused across server restarts so they only have to be compiled once in the lifetime of an application.
    • MemSQL uses a two-phase parser. The first parser is a one-pass lightweight layer called the auto-parameterizer, which strips numbers and strings out of plans.
    • For example "SELECT * FROM t WHERE id > 5 and name=’John’;" is converted to "SELECT * FROM t WHERE id > @ and name = ^;". These plans are stored in a hash table that maps parameterized queries to compiled query plans.
      • If the hash table contains the plan, then the parameters are passed into the compiled code which executes the query.
      • Otherwise, the query is processed by a traditional tree-based SQL parser and compiled into C++ code. The next time a query with the same shape is run, it will match the compiled plan in the hash table.
    • DDL queries (CREATE/ALTER) and DML queries (SELECT/INSERT/UPDATE/DELETE) all go through code generation.
    • Compiled C++ code is stored in the /plancache directory. Feel free to dive in and take a look.
    • Does not support “if statements” or any procedural type logic. It’s SQL and only SQL (for now). Though they contend using the SQL base for creating generated code yields optimum performance because it removes as much interpretation as possible. Machine code generated from SQL is hardwired into the code path.
    • Dynamic SQL is supported. The code generation is handled in the background by calling gcc, which is not unusual for for system that combine DSLs with dynamic linking.
  • Why it's Fast
    • The resulting speedup is comparable to upgrading from an interpreted language (PHP) to a compiled language (C++). This is the premise behind HipHop's PHP -> C++ compilation used at Facebook.
    • The hot code path for query plans that have been compiled by the system is optimized very carefully. The hash table used to manage the plan cache is lock-free. Read queries smaller than 4kb use either pre-allocated or stack-allocated memory. Write queries allocate table memory via header-inlined slab allocators.
    • malloc() is never called. Memory is allocated on process creation and managed internally from the on. This allows for complete control of garbage collection.
    • Index data structures are defined in a per-table header file, which is included in every query on that table. This enables all storage engine operations to be inlined directly into the code and avoids the overhead of expensive per-row virtual function calls.
    • Memory based systems can be slow depending on their design. Taking a global write lock or using memory mapped files is slow. A granular lock is taken only on a write-write conflict.
    • Query compilation latency. A fresh installation of MemSQL comes with an empty plancache. Every new query shape not in plancache will be compiled with GCC. GCC compilation is still a little bit slow compared to query compilation in MySQL. Compilation takes 0.5 to 10 seconds per query per thread (depending on hardware).

Replication

  • MemSQL replication is row-based, and supports master/multi-slave configuration.
  • Supports K-Safety. As many servers as required can be used for High Availability.
  • MemSQL supports online provisioning. Provisioning works by shipping and recovering from a snapshot, and then continuing to replay from the log. This process fits naturally into MemSQL’s durability scheme and is what enables online provisioning.
  • A slave will never encounter conflicts because order of execution is serialized in the transaction log.
  • MemSQL does not support master-master replication. A master-master design has a negative trade off:loss of data consistency.
  • MemSQL also supports synchronous replication. The tradeoff is higher write-query latency. Does not slow down reads.
  • Load is balanced across slaves.
  • When using asynchronous replication slaves are a few milliseconds behind.

The Distributed Story

  • Note, this feature is not released yet,  it’s set to be released in late September.
  • Query Routing
    • Two-tier architecture with aggregators and leaves. Leaf nodes run MemSQL and store data. Aggregators have a query planner that receives queries, breaks them into smaller queries, and routes them to one or more leaf nodes. They aggregate the results intelligently back to the user.
    • Leaf nodes have a shared-nothing design. Data sharding across leaf nodes is managed by the aggregators.
    • MemSQL uses standard SQL partitioning syntax to implement range and hash (key) partitioning. Range partitioning involves defining every range of data to split against, while hash partitioning takes only a shard key to hash against. MemSQL uses consistent hashing to minimize data movement in the event of a failure.
    • In the event of a network partition, an aggregator communicates with the metadata node to either (a) resync changed metadata or (b) assume responsibility for remapping data ranges and update it. During this time, queries are blocked up to a tolerance timeout.
    • MemSQL supports single-shard OLTP queries (routing) and complex cross-shard OLAP queries that iterate over the entire dataset.
    • Because the aggregators push most of the work to the leaf nodes, MemSQL scales almost linearly with each added leaf node.
    • The aggregator also uses code generation to compile the logic. For simple queries the overhead is less than .5 milliseconds.
    • Leaf nodes are dumb, they do not know about other leaf nodes. MVCC only works on each leaf node, not across leaf nodes.
    • There are no cross shard transactions. Updates happen independently on each shard. Two phase commit is a future feature.
    • Optimize for throughput and scalability.
      • By using a shared nothing architecture and by not using a two phase commit means a very high number of nodes can be supported with this approach.
      • Use case for this is integration with Vertica, which requires a long load step, but can store data larger than memory. MemSQL lets you see what is happening with a system right now.
      • Vertica is a column store which can do fast aggregations but can’t do fast updates, which is what MemSQL is optimized for.
      • Example use case is a financial system where sharding by stocks makes sense and stats are calculated by stock.
      • JDBC/ODBC can be used to sync to 3rd party systems.
  • Clustering
    • Analytical queries can be run that touch every single node in order to produce aggregations.
    • Clustering is managed by aggregator nodes communicating with an external metadata service. The metadata service can be backed by MemSQL, MySQL, or ZooKeeper.
    • Each aggregator syncs and updates against the metadata service. The unified-metadata design was picked because it minimizes chatter in the system. If aggregators cannot contact the metadata node, they cannot perform DDL operations.
    • Short answer about CAP theorem: MemSQL can be configured for AP (availability, partition tolerance) or CP (consistency, partition tolerance). With asynchronous replication and load-balanced reads, MemSQL can even be configured for eventual consistency.
    • Long answer about the CAP theorem: As Eric Brewer (CAP inventor) points out, CAP theorem and the “pick 2 out of 3” mentality are too simplistic for analyzing a complex system (http://www.infoq.com/articles/cap-twelve-years-later-how-the-rules-have-changed). Instead, the discussion should be about how the system “maximizes combinations of consistency and availability that make sense for the specific application.”
    • Users can configure replication to match the requirements of their application:
      • Replication enables high availability in the system. In the event of a network partition, MemSQL will immediately elect a new replication master to keep the system available.
      • Asynchronous replication minimizes query latency for write queries. In the event of a failure, failover to replication slave and suffer partial data loss for the replication delta. MemSQL replication is powerful enough to detect and in some cases resolve divergences when the partitioned leaf node is restarted and restored to the system.
      • Synchronous replication maintains strong consistency. It enables the system to be configured for k-safety (the system remains consistent and available in the presence of k failures). This option results in higher write query latencies.

Deployment And Management

  • Easy new query deployment.
    • Deploying new queries doesn’t require wrapping code in Java, compiling the Java, stopping the server and then deploying.
    • MySQL clients send SQL statements to the server. Nothing extra is needed.
    • Implication is simplicity is from SQL, adding any third party code won’t work.
  • How  get up and running with MemSQL:
    • Download of developer edition is available directly on the website. http://www.memsql.com/#download
    • Streamlined deployment on EC2. Pre-configured AMI that accepts your license key and launches MemSQL for you on Ubuntu 12.04. 20% of production MemSQL deployments are run this way.
    • Can also run on a variety of Linux 64-bit systems by downloading a tar.gz file from download.memsql.com.
  • Client Libraries
    • The MySQL protocol is used instead of making their client protocol. This makes it easy to integrate into existing environments.
    • Very smart use of the extensive MySQL ecosystem by leveraging high performance MySQL clients instead of building their own.
    • Just change your port and point to MemSQL instead of MySQL. Allows the focus to be on server development instead of client development.
    • MemSQL works with any MySQL client library: ODBC, JDBC, PHP/Python/Perl/Ruby/etc, mysql c library.
    • Popular manageability tools (Sequel Pro, PHPMyAdmin, MySQL Workbench), app frameworks (Ruby on Rails, Django, Symfony), and visualization tools (panopticon) work with MemSQL.
  • Server Management
    • MemSQL controls out of memory with two knobs: maximum_memory and maximum_table_memory. maximum_memory limits the total memory use by the server and maximum_table_memory limits the amount used for table storage. If memory usage exceeds maximum_table_memory then write queries are blocked but read queries stay up. On the developer build maximum_table_memory is hardcoded to 10 GB.
    • MemSQL exposes custom statistics with the SHOW STATUS, SHOW STATUS EXTENDED, and SHOW PLANCACHE commands. You can get numbers on total query compilations, query execution performance, and durability performance.

SQL Support

  • Very limited SQL support. Just joins between two tables. No outer or full outer joins.
  • Does not support: views, prepared queries, stored procedures, user defined functions, triggers, foreign keys, charsets other than utf8
  • The only supported isolation level: READ COMMITTED
  • MemSQL only supports single query transactions. Every query begins and commits in its own transaction.
  • SQL is used:
    • To reduce training costs, people (traders, business types) know SQL.
    • Because they wanted to build something that was easy to use.

Pricing

  • Pricing isn’t being disclosed just yet. The thought is use an hourly model so more developers can deploy on the cloud today.

Lessons Learned

  • Y Combinator isn’t  just about funding, it’s a support network that helps you make much needed connections, like getting early customer wins.
  • Use C++ for systems-level infrastructure. It allows you to build more efficient and robust software
  • You should use established interfaces to drive adoption. Make it extremely easy for people to try your software
  • Hire people who are “ahead of the curve” in their careers and promote from within
  • Invest in a good code review system; we use Phabricator (Facebook’s code review system, now at phabricator.org)
  • Make it easy to add tests to the system and invest in hardware and software to make testing easy. Software will only become reliable from extensive testing. If you’re testing 24/7, invest in your own hardware.

Related Articles

Thursday, 17 May 2012

Article: If all these new DBMS technologies are so scalable, why are Oracle and DB2 still on top of TPC-C? A roadmap to end their dominance.


http://dbmsmusings.blogspot.com/2012/05/if-all-these-new-dbms-technologies-are.html

(This post is coau­thored by Alexan­der Thom­son and Daniel Abadi)
In the last decade, data­base tech­nol­o­gy has arguably pro­gressed fur­thest along the scal­a­bil­i­ty dimen­sion. There have been hun­dreds of research papers, dozens of open-source projects, and numer­ous star­tups attempt­ing to improve the scal­a­bil­i­ty of data­base tech­nol­o­gy. Many of these new tech­nolo­gies have been extreme­ly influential---some papers have earned thou­sands of cita­tions, and some new sys­tems have been deployed by thou­sands of enter­pris­es.

So let's ask a sim­ple ques­tion: If all these new tech­nolo­gies are so scal­able, why on earth are Ora­cle and DB2 still on top of the TPC-C stand­ings? Go to the TPC-C Web­site with the top 10 results in raw trans­ac­tions per sec­ond. As of today (May 16th, 2012), Ora­cle 11g is used for 3 of the results (includ­ing the top result), 10g is used for 2 of the results, and the rest of the top 10 is filled with var­i­ous ver­sions of DB2. How is tech­nol­o­gy designed decades ago still dom­i­nat­ing TPC-C? What hap­pened to all these new tech­nolo­gies with all these scal­a­bil­i­ty claims?

The sur­pris­ing truth is that these new DBMS tech­nolo­gies are not list­ed in theTPC-C top ten results not because that they do not care enough to enter, but rather because they would not win if they did.

To under­stand why this is the case, one must under­stand that scal­a­bil­i­ty does not come for free. Some­thing must be sac­ri­ficed to achieve high scal­a­bil­i­ty. Today, there are three major cat­e­gories of trade­off that can be exploit­ed to make a sys­tem scale. The new tech­nolo­gies basi­cal­ly fall into two of these cat­e­gories; Ora­cle and DB2 fall into a third. And the later parts of this blog post describes research from our group at Yale that intro­duces a fourth cat­e­go­ry of trade­off that pro­vides a roadmap to end the dom­i­nance of Ora­cle and DB2.

These cat­e­gories are:

(1) Sac­ri­fice ACID for scal­a­bil­i­ty. Our pre­vi­ous post on this topic dis­cussed this in detail. Basi­cal­ly we argue that a major class of new scal­able tech­nolo­gies fall under the cat­e­go­ry of "NoSQL" which achieves scal­a­bil­i­ty by drop­ping ACID guar­an­tees, there­by allow­ing them to eschew two phase lock­ing, two phase com­mit, and other imped­i­ments to con­cur­ren­cy and proces­sor inde­pen­dence that hurt scal­a­bil­i­ty. All of these sys­tems that relax ACID are imme­di­ate­ly inel­i­gi­ble to enter the TPC-C com­pe­ti­tion since ACID guar­an­tees are one of TPC-C's require­ments. That's why you don't see NoSQL data­bas­es in the TPC-C top 10---they are imme­di­ate­ly dis­qual­i­fied.

(2) Reduce trans­ac­tion flex­i­bil­i­ty for scal­a­bil­i­ty. There are many so-called"NewSQL" data­bas­es that claim to be both ACID-compliant and scal­able. And these claims are true---to a degree. How­ev­er, the fine print is that they are only lin­ear­ly scal­able when trans­ac­tions can be com­plete­ly iso­lat­ed to a sin­gle "par­ti­tion" or "shard" of data. While these NewSQL data­bas­es often hide the com­plex­i­ty of shard­ing from the appli­ca­tion devel­op­er, they still rely on the shards to be fair­ly inde­pen­dent. As soon as a trans­ac­tion needs to span mul­ti­ple shards (e.g., update two dif­fer­ent user records on two dif­fer­ent shards in the same atom­ic trans­ac­tion), then these NewSQL sys­tems all run into prob­lems. Some sim­ply reject such trans­ac­tions. Oth­ers allow them, but need to per­form two phase com­mit or other agree­ment pro­to­cols in order to ensure ACID com­pli­ance (since each shard may fail inde­pen­dent­ly). Unfor­tu­nate­ly, agree­ment pro­to­cols such as two phase com­mit come at a great scal­a­bil­i­ty cost (see our 2010 paper that explains why). There­fore, NewSQL data­bas­es only scale well if multi-shard trans­ac­tions (also called "dis­trib­uted trans­ac­tions" or "multi-partition trans­ac­tions") are very rare. Unfor­tu­nate­ly for these data­bas­es, TPC-C mod­els a fair­ly rea­son­able retail appli­ca­tion where cus­tomers buy prod­ucts and the inven­to­ry needs to be updat­ed in the same atom­ic trans­ac­tion. 10% of TPC-C New Order trans­ac­tions involve cus­tomers buy­ing prod­ucts from a "remote" ware­house, which is gen­er­al­ly stored in a sep­a­rate shard. There­fore, even for basic appli­ca­tions like TPC-C, NewSQL data­bas­es lose their scal­a­bil­i­ty advan­tages. That's why the NewSQL data­bas­es do not enter TPC-C results --- even just 10% of multi-shard trans­ac­tions caus­es their per­for­mance to degrade rapid­ly.

(3) Trade cost for scal­a­bil­i­ty. If you use high end hard­ware, it is pos­si­ble to get stun­ning­ly high trans­ac­tion­al through­put using old data­base tech­nolo­gies that don't have shared-nothing hor­i­zon­tal­ly scal­a­bil­i­ty. Ora­cle tops TPC-C with an incred­i­bly high through­put of 500,000 trans­ac­tions per sec­ond. There exists no appli­ca­tion in the mod­ern world that pro­duces more than 500,000 trans­ac­tions per sec­ond (as long as humans are ini­ti­at­ing the transactions---machine-generated trans­ac­tions are a dif­fer­ent story). There­fore, Ora­cle basi­cal­ly has all the scal­a­bil­i­ty that is need­ed for human scale appli­ca­tions. The only down­side is cost---the Ora­cle sys­tem that is able to achieve 500,000 trans­ac­tions per sec­ond costs a pro­hib­i­tive $30,000,000!

Since the first two types of trade­offs are imme­di­ate dis­qual­i­fiers for TPC-C, the only remain­ing thing to give up is cost-for-scale, and that's why the old data­base tech­nolo­gies are still dom­i­nat­ing TPC-C. None of these new tech­nolo­gies can han­dle both ACID and 10% remote trans­ac­tions.

A fourth approach...

TPC-C is a very rea­son­able appli­ca­tion. New tech­nolo­gies should be able to han­dle it. There­fore, at Yale we set out to find a new dimen­sion in this trade­off space that could allow a sys­tem to han­dle TPC-C at scale with­out cost­ing $30,000,000. Indeed, we are pre­sent­ing a paper next week at SIG­MOD (see the full paper) that describes a sys­tem that can achieve 500,000 ACID-compliant TPC-C New Order trans­ac­tions per sec­ond using com­mod­i­ty hard­ware in the cloud. The cost to us to run these exper­i­ments was less than $300 (of course, this is rent­ing hard­ware rather than buy­ing, so it's hard to com­pare prices --- but still --- a fac­tor of 100,000 less than $30,000,000 is quite large).

Calvin, our pro­to­type sys­tem designed and built by a large team of researchers at Yale that include Thad­deus Dia­mond, Shu-Chun Weng, Kun Ren, Philip Shao, Anton Petrov, Michael Giuf­fri­da, and Aaron Segal (in addi­tion to the authors of this blog post), explores a trade­off very dif­fer­ent from the three described above. Calvin requires all trans­ac­tions to be exe­cut­ed fully server-side and sac­ri­fices the free­dom to non-deterministically abort or reorder trans­ac­tions on-the-fly dur­ing exe­cu­tion. In return, Calvin gets scal­a­bil­i­ty, ACID-compliance, and extreme­ly low-overhead multi-shard trans­ac­tions over a shared-nothing archi­tec­ture. In other words, Calvin is designed to han­dle high-volume OLTP through­put on shard­ed data­bas­es on cheap, com­mod­i­ty hard­ware stored local­ly or in the cloud. Calvin sig­nif­i­cant­lyimproves the scal­a­bil­i­ty over our pre­vi­ous approach to achiev­ing deter­min­ism in data­base sys­tems.

Scal­ing ACID

The key to Calvin's strong per­for­mance is that it reor­ga­nizes the trans­ac­tion exe­cu­tion pipeline nor­mal­ly used in DBMSs accord­ing to the prin­ci­ple: do all the "hard" work before acquir­ing locks and begin­ning exe­cu­tion. In par­tic­u­lar, Calvin moves the fol­low­ing stages to the front of the pipeline:

  • Repli­ca­tion. In tra­di­tion­al sys­tems, repli­cas agree on each mod­i­fi­ca­tion to data­base state only after some trans­ac­tion has made the change at some "mas­ter" repli­ca. In Calvin, all repli­cas agree in advance on the sequence of trans­ac­tions that they will (deter­min­is­ti­cal­ly) attempt to exe­cute.
  • Agree­ment between par­tic­i­pants in dis­trib­uted trans­ac­tions. Data­base sys­tems tra­di­tion­al­ly use two-phase com­mit (2PC) to han­dle dis­trib­uted trans­ac­tions. In Calvin, every node sees the same glob­al sequence of trans­ac­tion requests, and is able to use this already-agreed-upon infor­ma­tion in place of a com­mit pro­to­col.
  • Disk access­es. In our VLDB 2010 paper, we observed that deter­min­is­tic sys­tems per­formed ter­ri­bly in disk-based envi­ron­ments due to hold­ing locks for the 10ms+ dura­tion of read­ing the need­ed data from disk, since they can­not reorder con­flict­ing trans­ac­tions on the fly. Calvin gets around this set­back by prefetch­ing into mem­o­ry all records that a trans­ac­tion will need dur­ing the repli­ca­tion phase---before locks are even acquired.

As a result, each trans­ac­tion's user-specified logic can be exe­cut­ed at each shard with an absolute min­i­mum of run­time syn­chro­niza­tion between shards or repli­cas to slow it down, even if the trans­ac­tion's logic requires it to access records at mul­ti­ple shards. By min­i­miz­ing the time that locks are held, con­cur­ren­cy can be great­ly increased, there­by lead­ing to near-linear scal­a­bil­i­ty on a com­mod­i­ty clus­ter of machines. 

Strong­ly con­sis­tent glob­al repli­ca­tion 

Calvin's deter­min­is­tic exe­cu­tion seman­tics pro­vide an addi­tion­al ben­e­fit: repli­cat­ing trans­ac­tion­al input is suf­fi­cient to achieve strong­ly con­sis­tent repli­ca­tion. Since repli­cat­ing batch­es of trans­ac­tion requests is extreme­ly inex­pen­sive and hap­pens before the trans­ac­tions acquire locks and begin exe­cut­ing, Calvin's trans­ac­tion­al through­put capac­i­ty does not depend at all on its repli­ca­tion con­fig­u­ra­tion. 

In other words, not only can Calvin can run 500,000 trans­ac­tions per sec­ond on 100 EC2 instances in Ama­zon's US East (Vir­ginia) data cen­ter, it can main­tain strongly-consistent, up-to-date 100-node repli­cas in Ama­zon's Europe (Ire­land) and US West (Cal­i­for­nia) data centers---at no cost to through­put. 

Calvin accom­plish­es this by hav­ing repli­cas per­form the actu­al pro­cess­ing of trans­ac­tions com­plete­ly inde­pen­dent­ly of one anoth­er, main­tain­ing strong con­sis­ten­cy with­out hav­ing to con­stant­ly syn­chro­nize trans­ac­tion results between repli­cas. (Calvin's end-to-end trans­ac­tion laten­cy does depend on mes­sage delays between repli­cas, of course---there is no get­ting around the speed of light.) 

Flex­i­ble data model 

So where does Calvin fall in the OldSQL/NewSQL/NoSQL tri­choto­my? 

Actu­al­ly, nowhere. Calvin is not a data­base sys­tem itself, but rather a trans­ac­tion sched­ul­ing and repli­ca­tion coor­di­na­tion ser­vice. We designed the sys­tem to inte­grate with any data stor­age layer, rela­tion­al or oth­er­wise. Calvin allows user trans­ac­tion code to access the data layer freely, using any data access lan­guage or inter­face sup­port­ed by the under­ly­ing stor­age engine (so long as Calvin can observe which records user trans­ac­tions access). The exper­i­ments pre­sent­ed in the paper use a cus­tom key-value store. More recent­ly, we've hooked Calvin up to Google's Lev­elDB and added sup­port for SQL-based data access with­in trans­ac­tions, build­ing rela­tion­al tables on top of Lev­elDB's effi­cient sorted-string stor­age. 

From an appli­ca­tion devel­op­er's point of view, Calvin's pri­ma­ry lim­i­ta­tion com­pared to other sys­tems is that trans­ac­tions must be exe­cut­ed entire­ly server-side. Calvin has to know in advance what code will be exe­cut­ed for a given trans­ac­tion. Users may pre-define trans­ac­tions direct­ly in C++, or sub­mit arbi­trary Python code snip­pets on-the-fly to be parsed and exe­cut­ed as trans­ac­tions. 

For some appli­ca­tions, this require­ment of com­plete­ly server-side trans­ac­tions might be a dif­fi­cult lim­i­ta­tion. How­ev­er, many appli­ca­tions pre­fer to exe­cute trans­ac­tion code on the data­base serv­er any­way (in the form of stored pro­ce­dures), in order to avoid mul­ti­ple round trip mes­sages between the data­base serv­er and appli­ca­tion serv­er in the mid­dle of a trans­ac­tion. 

If this lim­i­ta­tion is accept­able, Calvin presents a nice alter­na­tive in the trade­off space to achiev­ing high scal­a­bil­i­ty with­out sac­ri­fic­ing ACID or multi-shard trans­ac­tions. Hence, we believe that ourSIG­MOD paper may present a roadmap for over­com­ing the scal­a­bil­i­ty dom­i­nance of the decades-old data­base solu­tions on tra­di­tion­al OLTP work­loads. We look for­ward to debat­ing the mer­its of this approach in the weeks ahead (and Alex will be pre­sent­ing the paper at SIG­MOD next week).

Friday, 11 May 2012

InfoQ: Panel: Multicore, Manycore, and Cloud Computing

InfoQ: Panel: Multicore, Manycore, and Cloud Computing

Note: Biggest challenges are the correctness and performance of parallel execution.
In term of correctness: false sharing, atomicity, consistency model.
In term of performance: locking, synchronization, deadlock.


Thursday, 19 April 2012

Building Highly Available Systems in Erlang

InfoQ: Building Highly Available Systems in Erlang:

Key ideas:


The process approach to fault isolation advocates that the process
software be fail-fast, it should either function correctly or it
should detect the fault, signal failure and stop operating.

  Processes are  made fail-fast  by defensive programming.  They check
all their inputs, intermediate results and data structures as a matter
of course. If any error is detected, they signal a failure and stop. In
the  terminology of  [Christian],  fail-fast software  has small  fault
detection latency.

Saturday, 7 April 2012

Are Cloud Based Memory Architectures the Next Big Thing?

Are Cloud Based Memory Architectures the Next Big Thing?:
We are on the edge of two potent technological changes: Clouds and Memory Based Architectures. This evolution will rip open a chasm where new players can enter and prosper. Google is the master of disk. You can't beat them at a game they perfected. Disk based databases like SimpleDB and BigTable are complicated beasts, typical last gasp products of any aging technology before a change. The next era is the age of Memory and Cloud which will allow for new players to succeed. The tipping point will be soon.

Let's take a short trip down web architecture lane:

  • It's 1993: Yahoo runs on FreeBSD, Apache, Perl scripts and a SQL database
  • It's 1995: Scale-up the database.
  • It's 1998: LAMP
  • It's 1999: Stateless + Load Balanced + Database + SAN
  • It's 2001: In-memory data-grid.
  • It's 2003: Add a caching layer.
  • It's 2004: Add scale-out and partitioning.
  • It's 2005: Add asynchronous job scheduling and maybe a distributed file system.
  • It's 2007: Move it all into the cloud.
  • It's 2008: Cloud + web scalable database.
  • It's 20??: Cloud + Memory Based Architectures

    You may disagree with the timing of various innovations and you would be correct. I couldn't find a history of the evolution of website architectures, so I just made stuff up. If you have any better information please let me know.

    Why might cloud based memory architectures be the next big thing? For now we'll just address the memory based architecture part of the question, the cloud component is covered a little later.

    Behold the power of keeping data in memory:
    Google query results are now served in under an astonishingly fast 200ms, down from 1000ms in the olden days. The vast majority of this great performance improvement is due to holding indexes completely in memory. Thousands of machines process each query in order to make search results appear nearly instantaneously.
    This text was adapted from notes on Google Fellow Jeff Dean keynote speech at WSDM 2009.

    Google isn't the only one getting a performance bang from moving data into memory. Both LinkedInand Digg keep the graph of their network social network in memory. Facebook has northwards of 800 memcached servers creating a reservoir of 28 terabytes of memory enabling a 99% cache hit rate. Even little guys can handle 100s of millions of events per day by using memory instead of disk.

    With their new Unified Computing strategy Cisco is also entering the memory game. Their new machines "will be focusing on networking and memory" with servers crammed with 384 GB of RAM, fast processors, and blazingly fast processor interconnects. Just what you need when creating memory based systems.

    Memory Is The System Of Record

    What makes Memory Based Architectures different from traditional architectures is that memory is the system of record. Typically disk based databases have been the system of record. Disk has been King, safely storing data away within its castle walls. Disk being slow we've ended up wrapping disks in complicated caching and distributed file systems to make them perform.

    Sure, memory is used as all over the place as cache, but we're always supposed to pretend that cache can be invalidated at any time and old Mr. Reliable, the database, will step in and provide the correct values. In Memory Based Architectures memory is where the "official" data values are stored.

    Caching also serves a different purpose. The purpose behind cache based architectures is to minimize the data bottleneck through to disk. Memory based architectures can address the entire end-to-end application stack. Data in memory can be of higher reliability and availability than traditional architectures.

    Memory Based Architectures initially developed out of the need in some applications spaces for very low latencies. The dramatic drop of RAM prices along with the ability of servers to handle larger and larger amounts of RAM has caused memory architectures to verge on going mainstream. For example, someone recently calculated that 1TB of RAM across 40 servers at 24 GB per server would cost an additional $40,000. Which is really quite affordable given the cost of the servers. Projecting out, 1U and 2U rack-mounted servers will soon support a terabyte or more or memory.

    RAM = High Bandwidth And Low Latency

    Why are Memory Based Architectures so attractive? Compared to disk RAM is a high bandwidth and low latency storage medium. Depending on who you ask the bandwidth of RAM is 5 GB/s. The bandwidth of disk is about 100 MB/s. RAM bandwidth is many hundreds of times faster. RAM wins. Modern hard drives have latencies under 13 milliseconds. When many applications are queued for disk reads latencies can easily be in the many second range. Memory latency is in the 5 nanosecond range. Memory latency is 2,000 times faster. RAM wins again.

    RAM Is The New Disk

    The superiority of RAM is at the heart of the RAM is the New Disk paradigm. As an architecture it combines the holy quadrinity of computing:
  • Performance is better because data is accessed from memory instead of through a database to a disk.
  • Scalability is linear because as more servers are added data is transparently load balanced across the servers so there is an automated in-memory sharding.
  • Availability is higher because multiple copies of data are kept in memory and the entire system reroutes on failure.
  • Application development is faster because there’s only one layer of software to deal with, the cache, and its API is simple. All the complexity is hidden from the programmer which means all a developer has to do is get and put data.

    Access disk on the critical path of any transaction limits both throughput and latency. Committing a transaction over the network in-memory is faster than writing through to disk. Reading data from memory is also faster than reading data from disk. So the idea is to skip disk, except perhaps as an asynchronous write-behind option, archival storage, and for large files.

    Or Is Disk Is The The New RAM

    To be fair there is also a Disk is the the new RAM, RAM is the New Cache paradigm too. This somewhat counter intuitive notion is that a cluster of about 50 disks has the same bandwidth of RAM, so the bandwidth problem is taken care of by adding more disks.

    The latency problem is handled by reorganizing data structures and low level algorithms. It's as simple as avoiding piecemeal reads and organizing algorithms around moving data to and from memory in very large batches and writing highly parallelized programs. While I have no doubt this approach can be made to work by very clever people in many domains, a large chunk of applications are more time in the random access domain space for which RAM based architectures are a better fit.

    Grids And A Few Other Definitions

    There's a constellation of different concepts centered around Memory Based Architectures that we'll need to understand before we can understand the different products in this space. They include:
  • Compute Grid - parallel execution. A Compute Grid is a set of CPUs on which calculations/jobs/work is run. Problems are broken up into smaller tasks and spread across nodes in the grid. The result is calculated faster because it is happening in parallel.
  • Data Grid - a system that deals with data — the controlled sharing and management of large amounts of distributed data.
  • In-Memory Data Grid (IMDG) - parallel in-memory data storage. Data Grids are scaled horizontally, that is by adding more nodes. Data contention is removed removed by partitioning data across nodes.
  • Colocation - Business logic and object state are colocated within the same process. Methods are invoked by routing to the object and having the object execute the method on the node it was mapped to. Latency is low because object state is not sent across the wire.
  • Grid Computing - Compute Grids + Data Grids
  • Cloud Computing - datacenter + API. The API allows the set of CPUs in the grid to be dynamically allocated and deallocated.

    Who Are The Major Players In This Space?

    With that bit of background behind us, there are several major players in this space (in alphabetical order):
  • Coherence - is a peer-to-peer, clustered, in-memory data management system. Coherence is a good match for applications that need write-behind functionality when working with a database and you require multiple applications have ACID transactions on the database. Java, JavaEE, C++, and .NET.
  • GemFire - an in-memory data caching solution that provides low-latency and near-zero downtime along with horizontal & global scalability. C++, Java and .NET.
  • GigaSpaces - GigaSpaces attacks the whole stack: Compute Grid, Data Grid, Message, Colocation, and Application Server capabilities. This makes for greater complexity, but it means there's less plumbing that needs to be written and developers can concentrate on writing business logic. Java, C, or .Net.
  • GridGain - A compute grid that can operate over many data grids. It specializes in the transparent and low configuration implementation of features. Java only.
  • Terracotta - Terracotta is network-attached memory that allows you share memory and do anything across a cluster. Terracotta works its magic at the JVM level and provides: high availability, an end of messaging, distributed caching, a single JVM image. Java only.
  • WebSphere eXtreme Scale. Operates as an in-memory data grid that dynamically caches, partitions, replicates, and manages application data and business logic across multiple servers.

    This class of products has generally been called In-Memory Data Grids (IDMG), though not all the products fit snugly in this category. There's quite a range of different features amongst the different products.

    I tossed IDMG the acronym in favor of Memory Based Architectures because the "in-memory" part seems redundant, the grid part has given way to the cloud, the "data" part really can include both data and code. And there are other architectures that will exploit memory yet won't be classic IDMG. So I just used Memory Based Architecture as that's the part that counts.

    Given the wide differences between the products there's no canonical architecture. As an example here's a diagram of how GigaSpaces In-Memory-Data-Grid on the Cloud works.





    Some key points to note are:
  • A POJO (Plain Old Java Object) is written through a proxy using a hash-based data routing mechanism to be stored in a partition on a Processing Unit. Attributes of the object are used as a key. This is straightforward hash based partitioning like you would use with memcached.
  • You are operating through GigaSpace's framework/container so they can automatically handle things like messaging, sending change events, replication, failover, master-worker pattern, map-reduce, transactions, parallel processing, parallel query processing, and write-behind to databases.
  • Scaling is accomplished by dividing your objects into more partitions and assigning the partitions to Processing Unit instances which run on nodes-- a scale-out strategy. Objects are kept in RAM and the objects contain both state and behavior. A Service Grid component supports the dynamic creation and termination of Processing Units.

    Not conceptually difficult and familiar to anyone who has used caching systems like memcached. Only is this case memory is not just a cache, it's the system of record.

    Obviously there are a million more juicy details at play, but that's the gist of it. Admittedly GigaSpaces is on the full featured side of the product equation, but from a memory based architecture perspective the ideas should generalize. When you shard a database, for example, you generally lose the ability to execute queries, you have to do all the assembly yourself. By using GigaSpaces framework you get a lot of very high-end features like parallel query processing for free.

    The power of this approach certainly comes in part from familiar concepts like partitioning. But the speed of memory versus disk also allows entire new levels of performance and reliability in a relatively simple and easy to understand and deploy package.

    NimbusDB - The Database In The Cloud

    Jim Starkey, President of NimbusDB, is not following the IDMG gang's lead. He's taking a completely fresh approach based on thinking of the cloud as a new platform unto itself. Starting from scratch, what would a database for the cloud look like?

    Jim is in position to answer this question as he has created a transactional database engine for MySQL named Falcon and added multi-versioning support to InterBase, the first relational database to feature MVCC (Multiversion Concurrency Control).

    What defines the cloud as a platform? Here's are some thoughts from Jim I copied out of the Cloud Computing group. You'll notice I've quoted Jim way way too much. I did that because Jim is an insightful guy, he has a lot of interesting things to say, and I think he has a different spin on the future of databases in the cloud than anyone else I've read. He also has the advantage of course of not having a shipping product, but we shall see.
  • I've probably said this before, but the cloud is a new computing platform that some have learned to exploit, others are scrambling to master, but most people will see as nothing but a minor variation on what they're already doing. This is not new. When time sharing as invented, the batch guys considered it as remote job entry, just a variation on batch. When departmental computing came along (VAXes, et al), the timesharing guys considered it nothing but timesharing on a smaller scale. When PCs and client/server computing came along, the departmental computing guys (i.e. DEC), considered PCs to be a special case of smart terminals. And when the Internet blew into town, the client server guys considered it as nothing more than a global scale LAN. So the batchguys are dead, the timesharing guys are dead, the departmental computing guys are dead, and the client server guys are dead. Notice a pattern?
  • The reason that databases are important to cloud computing is that virtually all applications involve the interaction of client data with a shared, persistent data store. And while application processing can be easily scaled, the limiting factor is the database system. So if you plan to do anything more than play Tetris in the cloud, the issue of database management should be foremost in your mind.
  • Disks are the limiting factors in contemporary database systems. Horrible things, disk. But conventional wisdom is that you build a clustered database system by starting with a distributed file system. Wrong. Evolution is faster processors, bigger memory, better tools. Revolution
    is a different way of thinking, a different topology, a different way of putting the parts together.
  • What I'm arguing is that a cloud is a different platform, and what works well for a single computer doesn't work at all well in cloud, and things that work well in a cloud don't work at all on the single computer system. So it behooves us to re-examine a lot an ancient and honorable assumptions to see if they make any sense at all in this brave new world.
  • Sharing a high performance disk system is fine on a single computer, troublesome in a cluster, and miserable on a cloud.
  • I'm a database guy who's had it with disks. Didn't much like the IBM 1301, and disks haven't gotten much better since. Ugly, warty, slow, things that require complex subsystems to hide their miserable characteristics. The alternative is to use the memory in a cloud as a distributed L2
    cache. Yes, disks are still there, but they're out of the performance loop except for data so stale that nobody has it memory.
  • Another machine or set of machines is just as good as a disk. You can quibble about reliable power, etc, but write queuing disks have the same problem.
  • Once you give up the idea of logs and page caches in favor of asynchronous replications, life gets a great deal brighter. It really does make sense to design to the strengths of cloud(redundancy) rather than their weaknesses (shared anything).
  • And while one guys is fetching his 100 MB per second, the disk is busy and everyone else is waiting in line contemplating existence. Even the cheapest of servers have two gigabit ethernet channels and switch. The network serves everyone in parallel while the disk is single threaded
  • I favor data sharing through a formal abstraction like a relational database. Shared objects are things most programmers are good at handling. The fewer the things that application developers need to manage the more likely it is that the application will work.
  • I buy the model of object level replication, but only as a substrate for something with a more civilized API. Or in other words, it's a foundation, not a house.
  • I'd much rather have a pair of quad-core processors running as independent servers than contending for memory on a dual socket server. I don't object to more cores per processor chip, but I don't want to pay for die size for cores perpetually stalled for memory.
  • The object substrate worries about data distribution and who should see what. It doesn't even know it's a database. SQL semantics are applied by an engine layered on the object substrate. The SQL engine doesn't worry or even know that it's part of a distributed database -- it just executes SQL statements. The black magic is MVCC.
  • I'm a database developing building a database system for clouds. Tell me what you need. Here is my first approximation: A database that scales by adding more computers and degrades gracefully when machines are yanked out; A database system that never needs to be shut down; Hardware and software fault tolerance; Multi-site archiving for disaster survival; A facility to reach into the past to recover from human errors (drop table customers; oops;); Automatic load balancing
  • MySQL scales with read replication which requires a full database copy to start up. For any cloud relevant application, that's probably hundreds of gigabytes. That makes it a mighty poor candidate for on-demand virtual servers.
  • Do remember that the primary function of a database system is to maintain consistency. You don't want a dozen people each draining the last thousand buckets from a bank account or a debit to happen without the corresponding credit.
  • Whether the data moves to the work or the work moves to the data isn't that important as long as they both end up a the same place with as few intermediate round trips as possible.
  • In my area, for example, databases are either limited by the biggest, ugliest machine you can afford *or* you have to learn to operation without consistent, atomic transactions. A bad rock / hard place choice that send the cost of scalable application development through the ceiling. Once we solve that, applications that server 20,000,000 users will be simple and cheap to write. Who knows where that will go?
  • To paraphrase our new president, we must reject the false choice between data consistency and scalability.
  • Cloud computing is about using many computers to scale problems that were once limited by the capabilities of a single computer. That's what makes clouds exciting, at least to me. But most will argue that cloud computing is a better economic model for running many instances of a
    single computer. Bah, I say, bah!
  • Cloud computing is a wonder new platform. Let's not let the dinosaurs waiting for extinction define it as a minor variation of what they've been doing for years. They will, of course, but this (and the dinosaurs) will pass.
  • The revolutionary idea is that applications don't run on a single computer but an elastic cloud of computers that grows and contracts by demand. This, in turn, requires an applications infrastructure that can a) run a single application across as many machines as necessary, and b) run many applications on the same machines without any of the cross talk and software maintenance problems of years past. No, the software infrastructure required to enable this is not mature and certainly not off the shelf, but many smart folks are working on it.
  • There's nothing limiting in relational except the companies that build them. A relational database can scale as well as BigTable and SimpleDB but still be transactional. And, unlike BigTable and SimpleDB, a relational database can model relationships and do exotic things like transferring money from one account to another without "breaking the bank.". It is true that existing relational database systems are largely constrained to single cpu or cluster with a shared file system, but we'll get over that.
  • Personally, I don't like masters any more than I like slaves. I strongly favor peer to peer architectures with no single point of failure. I also believe that database federation is a work-around
    rather than a feature. If a database system had sufficient capacity, reliability, and availability, nobody would ever partition or shard data. (If one database instance is a headache, a million tiny ones is a horrible, horrible migraine.)
  • Logic does need to be pushed to the data, which is why relational database systems destroyed hierarchical (IMS), network (CODASYL), and OODBMS. But there is a constant need to push semantics higher to further reduce the number of round trips between application semantics and the database systems. As for I/O, a database system that can use the cloud as an L2 cache breaks free from dependencies on file systems. This means that bandwidth and cycles are the limiting factors, not I/O capacity.
  • What we should be talking about is trans-server application architecture, trans-server application platforms, both, or whether one will make the other unnecessary.
  • If you scale, you don't/can't worry about server reliability. Money spent on (alleged) server reliability is money wasted.
  • If you view the cloud as a new model for scalable applications, it is a radical change in computing platform. Most people see the cloud through the lens of EC2, which is just another way to run a server that you have to manage and control, then the cloud is little more than a rather
    boring business model. When clouds evolve to point that applications and databases can utilize whatever resources then need to meet demand without the constraint of single machine limitations, we'll have something really neat.
  • On MVCC: Forget about the concept of master. Synchronizing slaves to a master is hopeless. Instead, think of a transaction as a temporal view of database state; different transactions
    will have different views. Certain critical operations must be serialized, but that still doesn't require that all nodes have identical views of database state.
  • Low latency is definitely good, but I'm designing the system to support geographically separated sub-clouds. How well that works under heavy load is probably application specific. If the amount of volatile data common to the sub-clouds is relatively low, it should work just fine provided there is enough bandwidth to handle the replication messages.
  • MVCC tracks multiple versions to provide a transaction with a view of the database consistent with the instant it started while preventing a transaction from updating a piece of data that it could not see. MVCC is consistent, but it is not serializable. Opinions vary between academia and the real world, but most database practitioners recognize that the consistency provided by MVCC is sufficient for programmers of modest skills to product robust applications.
  • MVCC, heretofore, has been limited to single node databases. Applied to the cloud with suitable bookkeeping to control visibility of updates on individual nodes, MVCC is as close to black magic as you are likely to see in your lifetime, enabling concurrency and consistency with mostly non-blocking, asynchronous messaging. It does, however, dispense with the idea that a cloud has at any given point of time a single definitive state. Serializability implemented with record locking is an attempt to make distributed system march in lock-step so that the result is as if there there no parallelism between nodes. MVCC recognizes that parallelism is the key to scalability. Data that is a few microseconds old is not a problem as long as updates don't collide.

    Jim certainly isn't shy with his opinions :-)

    My summary of what he wants to do with NimbusDB is:
  • Make a scalable relational database in the cloud where you can use normal everyday SQL to perform summary functions, define referential integrity, and all that other good stuff.
  • Transactions scale using a distributed version of MVCC, which I do not believe has been done before. This is the key part of the plan and a lot depends on it working.
  • The database is stored primarily in RAM which makes cloud level scaling of an RDBMS possible.
  • The database will handle all the details of scaling in the cloud. To the developer it will look like just a very large highly available database.

    I'm not sure if NimbusDB will support a compute grid and map-reduce type functionality. The low latency argument for data and code collocation is a good one, so I hope it integrates some sort of extension mechanism.

    Why might NimbusDB be a good idea?
  • Keeps simple things simple. Web scale databases like BigTable and SimpleDB make simple things difficult. They are full of quotas, limits, and restrictions because by their very nature they are just a key-value layer on top of a distributed file system. The database knows as little about the data as possible. If you want to build a sequence number for a comment system, for example, it takes complicated sharding logic to remove write contention. Developers are used to SQL and are comfortable working within the transaction model, so the transition to cloud computing would be that much easier. Now, to be fair, who knows if NimbusDB will be able to scale under high load either, but we need to make simple things simple again.
  • Language independence. Notice the that IDMG products are all language specific. They support some combination of .Net/Java/C/C++. This is because they need low level object knowledge to transparently implement their magic. This isn't bad, but it does mean if you use Python, Erlang, Ruby, or any other unsupported language then you are out of luck. As many problems as SQL has, one of its great gifts is programmatic universal access.
  • Separates data from code. Data is forever, code changes all the time. That's one of the common reasons for preferring a database instead of an objectbase. This also dovetails with the language independence issue. Any application can access data from any language and any platform from now and into the future. That's a good quality to have.

    The smart money has been that cloud level scaling requires abandoning relational databases and distributed transactions. That's why we've seen an epidemic of key-value databases and eventually consistent semantics. It will be fascinating to see if Jim's combination of Cloud + Memory + MVCC can prove the insiders wrong.

    Are Cloud Based Memory Architectures The Next Big Thing?

    We've gone through a couple of different approaches to deploying Memory Based Architectures. So are they the next big thing?

    Adoption has been slow because it's new and different and that inertia takes a while to overcome. Historically tools haven't made it easy for early adopters to make the big switch, but that is changing with easier to deploy cloud based systems. And current architectures, with a lot of elbow grease, have generally been good enough.

    But we are seeing a wide convergence on caching as way to make slow disks perform. Truly enormous amounts of effort are going into adding cache and then trying to keep the database and applications all in-sync with cache as bottom up and top down driven changes flow through the system.

    After all that work it's a simple step to wonder why that extra layer is needed when the data could have just as well be kept in memory from the start. Now add the ease of cloud deployments and the ease of creating scalable, low latency applications that are still easy to program, manage, and deploy. Building multiple complicated layers of application code just to make the disk happy will make less and less sense over time.

    We are on the edge of two potent technological changes: Clouds and Memory Based Architectures. This evolution will rip open a chasm where new players can enter and prosper. Google is the master of disk. You can't beat them at a game they perfected. Disk based databases like SimpleDB and BigTable are complicated beasts, typical last gasp products of any aging technology before a change. The next era is the age of Memory and Cloud which will allow for new players to succeed. The tipping point is soon.

    Related Articles

  • GridGain: One Compute Grid, Many Data Grids
  • GridGain vs Hadoop
  • Cameron Purdy: Defining a Data Grid
  • Compute Grids vs. Data Grids
  • Performance killer: Disk I/O by Nathanael Jones
  • RAM is the new disk... by Steven Robbins
  • Talk on disk as the new RAM by Greg Linden
  • Disk-Based Parallel Computation, Rubik's Cube, and Checkpointing by Gene Cooperman, Northeastern Professor, High Performance Computing Lab - Disk is the the new RAM and RAM is the new cache
  • Disk is the new disk by David Hilley.
  • Latency lags bandwidth by David A. Patterson
  • InfoQ Article - RAM is the new disk... by Nati Shalom
  • Tape is Dead Disk is Tape Flash is Disk RAM Locality is King by Jim Gray
  • Product: ScaleOut StateServer is Memcached on Steroids
  • Cameron Purdy: Defining a Data Grid
  • Compute Grids vs. Data Grids
  • Latency is Everywhere and it Costs You Sales - How to Crush it
  • Virtualization for High Performance Computing by Shai Fultheim
  • Multi-Multicore Single System Image / Cloud Computing. A Good Idea? (part 1) by Greg Pfister
  • How do you design and handle peak load on the Cloud ? by Cloudiquity.
  • Defining a Data Grid by Cameron Purdy
  • The Share-Nothing Architecture by Zef Hemel.
  • Scaling memcached at Facebook
  • Cache-aside, write-behind, magic and why it sucks being an Oracle customer by Stefan Norberg.
  • Introduction to Terracotta by Mike
  • The five-minute rule twenty years later, and how flash memory changes the rules by Goetz Graefe