Thursday 17 May 2012

Article: If all these new DBMS technologies are so scalable, why are Oracle and DB2 still on top of TPC-C? A roadmap to end their dominance.


http://dbmsmusings.blogspot.com/2012/05/if-all-these-new-dbms-technologies-are.html

(This post is coau­thored by Alexan­der Thom­son and Daniel Abadi)
In the last decade, data­base tech­nol­o­gy has arguably pro­gressed fur­thest along the scal­a­bil­i­ty dimen­sion. There have been hun­dreds of research papers, dozens of open-source projects, and numer­ous star­tups attempt­ing to improve the scal­a­bil­i­ty of data­base tech­nol­o­gy. Many of these new tech­nolo­gies have been extreme­ly influential---some papers have earned thou­sands of cita­tions, and some new sys­tems have been deployed by thou­sands of enter­pris­es.

So let's ask a sim­ple ques­tion: If all these new tech­nolo­gies are so scal­able, why on earth are Ora­cle and DB2 still on top of the TPC-C stand­ings? Go to the TPC-C Web­site with the top 10 results in raw trans­ac­tions per sec­ond. As of today (May 16th, 2012), Ora­cle 11g is used for 3 of the results (includ­ing the top result), 10g is used for 2 of the results, and the rest of the top 10 is filled with var­i­ous ver­sions of DB2. How is tech­nol­o­gy designed decades ago still dom­i­nat­ing TPC-C? What hap­pened to all these new tech­nolo­gies with all these scal­a­bil­i­ty claims?

The sur­pris­ing truth is that these new DBMS tech­nolo­gies are not list­ed in theTPC-C top ten results not because that they do not care enough to enter, but rather because they would not win if they did.

To under­stand why this is the case, one must under­stand that scal­a­bil­i­ty does not come for free. Some­thing must be sac­ri­ficed to achieve high scal­a­bil­i­ty. Today, there are three major cat­e­gories of trade­off that can be exploit­ed to make a sys­tem scale. The new tech­nolo­gies basi­cal­ly fall into two of these cat­e­gories; Ora­cle and DB2 fall into a third. And the later parts of this blog post describes research from our group at Yale that intro­duces a fourth cat­e­go­ry of trade­off that pro­vides a roadmap to end the dom­i­nance of Ora­cle and DB2.

These cat­e­gories are:

(1) Sac­ri­fice ACID for scal­a­bil­i­ty. Our pre­vi­ous post on this topic dis­cussed this in detail. Basi­cal­ly we argue that a major class of new scal­able tech­nolo­gies fall under the cat­e­go­ry of "NoSQL" which achieves scal­a­bil­i­ty by drop­ping ACID guar­an­tees, there­by allow­ing them to eschew two phase lock­ing, two phase com­mit, and other imped­i­ments to con­cur­ren­cy and proces­sor inde­pen­dence that hurt scal­a­bil­i­ty. All of these sys­tems that relax ACID are imme­di­ate­ly inel­i­gi­ble to enter the TPC-C com­pe­ti­tion since ACID guar­an­tees are one of TPC-C's require­ments. That's why you don't see NoSQL data­bas­es in the TPC-C top 10---they are imme­di­ate­ly dis­qual­i­fied.

(2) Reduce trans­ac­tion flex­i­bil­i­ty for scal­a­bil­i­ty. There are many so-called"NewSQL" data­bas­es that claim to be both ACID-compliant and scal­able. And these claims are true---to a degree. How­ev­er, the fine print is that they are only lin­ear­ly scal­able when trans­ac­tions can be com­plete­ly iso­lat­ed to a sin­gle "par­ti­tion" or "shard" of data. While these NewSQL data­bas­es often hide the com­plex­i­ty of shard­ing from the appli­ca­tion devel­op­er, they still rely on the shards to be fair­ly inde­pen­dent. As soon as a trans­ac­tion needs to span mul­ti­ple shards (e.g., update two dif­fer­ent user records on two dif­fer­ent shards in the same atom­ic trans­ac­tion), then these NewSQL sys­tems all run into prob­lems. Some sim­ply reject such trans­ac­tions. Oth­ers allow them, but need to per­form two phase com­mit or other agree­ment pro­to­cols in order to ensure ACID com­pli­ance (since each shard may fail inde­pen­dent­ly). Unfor­tu­nate­ly, agree­ment pro­to­cols such as two phase com­mit come at a great scal­a­bil­i­ty cost (see our 2010 paper that explains why). There­fore, NewSQL data­bas­es only scale well if multi-shard trans­ac­tions (also called "dis­trib­uted trans­ac­tions" or "multi-partition trans­ac­tions") are very rare. Unfor­tu­nate­ly for these data­bas­es, TPC-C mod­els a fair­ly rea­son­able retail appli­ca­tion where cus­tomers buy prod­ucts and the inven­to­ry needs to be updat­ed in the same atom­ic trans­ac­tion. 10% of TPC-C New Order trans­ac­tions involve cus­tomers buy­ing prod­ucts from a "remote" ware­house, which is gen­er­al­ly stored in a sep­a­rate shard. There­fore, even for basic appli­ca­tions like TPC-C, NewSQL data­bas­es lose their scal­a­bil­i­ty advan­tages. That's why the NewSQL data­bas­es do not enter TPC-C results --- even just 10% of multi-shard trans­ac­tions caus­es their per­for­mance to degrade rapid­ly.

(3) Trade cost for scal­a­bil­i­ty. If you use high end hard­ware, it is pos­si­ble to get stun­ning­ly high trans­ac­tion­al through­put using old data­base tech­nolo­gies that don't have shared-nothing hor­i­zon­tal­ly scal­a­bil­i­ty. Ora­cle tops TPC-C with an incred­i­bly high through­put of 500,000 trans­ac­tions per sec­ond. There exists no appli­ca­tion in the mod­ern world that pro­duces more than 500,000 trans­ac­tions per sec­ond (as long as humans are ini­ti­at­ing the transactions---machine-generated trans­ac­tions are a dif­fer­ent story). There­fore, Ora­cle basi­cal­ly has all the scal­a­bil­i­ty that is need­ed for human scale appli­ca­tions. The only down­side is cost---the Ora­cle sys­tem that is able to achieve 500,000 trans­ac­tions per sec­ond costs a pro­hib­i­tive $30,000,000!

Since the first two types of trade­offs are imme­di­ate dis­qual­i­fiers for TPC-C, the only remain­ing thing to give up is cost-for-scale, and that's why the old data­base tech­nolo­gies are still dom­i­nat­ing TPC-C. None of these new tech­nolo­gies can han­dle both ACID and 10% remote trans­ac­tions.

A fourth approach...

TPC-C is a very rea­son­able appli­ca­tion. New tech­nolo­gies should be able to han­dle it. There­fore, at Yale we set out to find a new dimen­sion in this trade­off space that could allow a sys­tem to han­dle TPC-C at scale with­out cost­ing $30,000,000. Indeed, we are pre­sent­ing a paper next week at SIG­MOD (see the full paper) that describes a sys­tem that can achieve 500,000 ACID-compliant TPC-C New Order trans­ac­tions per sec­ond using com­mod­i­ty hard­ware in the cloud. The cost to us to run these exper­i­ments was less than $300 (of course, this is rent­ing hard­ware rather than buy­ing, so it's hard to com­pare prices --- but still --- a fac­tor of 100,000 less than $30,000,000 is quite large).

Calvin, our pro­to­type sys­tem designed and built by a large team of researchers at Yale that include Thad­deus Dia­mond, Shu-Chun Weng, Kun Ren, Philip Shao, Anton Petrov, Michael Giuf­fri­da, and Aaron Segal (in addi­tion to the authors of this blog post), explores a trade­off very dif­fer­ent from the three described above. Calvin requires all trans­ac­tions to be exe­cut­ed fully server-side and sac­ri­fices the free­dom to non-deterministically abort or reorder trans­ac­tions on-the-fly dur­ing exe­cu­tion. In return, Calvin gets scal­a­bil­i­ty, ACID-compliance, and extreme­ly low-overhead multi-shard trans­ac­tions over a shared-nothing archi­tec­ture. In other words, Calvin is designed to han­dle high-volume OLTP through­put on shard­ed data­bas­es on cheap, com­mod­i­ty hard­ware stored local­ly or in the cloud. Calvin sig­nif­i­cant­lyimproves the scal­a­bil­i­ty over our pre­vi­ous approach to achiev­ing deter­min­ism in data­base sys­tems.

Scal­ing ACID

The key to Calvin's strong per­for­mance is that it reor­ga­nizes the trans­ac­tion exe­cu­tion pipeline nor­mal­ly used in DBMSs accord­ing to the prin­ci­ple: do all the "hard" work before acquir­ing locks and begin­ning exe­cu­tion. In par­tic­u­lar, Calvin moves the fol­low­ing stages to the front of the pipeline:

  • Repli­ca­tion. In tra­di­tion­al sys­tems, repli­cas agree on each mod­i­fi­ca­tion to data­base state only after some trans­ac­tion has made the change at some "mas­ter" repli­ca. In Calvin, all repli­cas agree in advance on the sequence of trans­ac­tions that they will (deter­min­is­ti­cal­ly) attempt to exe­cute.
  • Agree­ment between par­tic­i­pants in dis­trib­uted trans­ac­tions. Data­base sys­tems tra­di­tion­al­ly use two-phase com­mit (2PC) to han­dle dis­trib­uted trans­ac­tions. In Calvin, every node sees the same glob­al sequence of trans­ac­tion requests, and is able to use this already-agreed-upon infor­ma­tion in place of a com­mit pro­to­col.
  • Disk access­es. In our VLDB 2010 paper, we observed that deter­min­is­tic sys­tems per­formed ter­ri­bly in disk-based envi­ron­ments due to hold­ing locks for the 10ms+ dura­tion of read­ing the need­ed data from disk, since they can­not reorder con­flict­ing trans­ac­tions on the fly. Calvin gets around this set­back by prefetch­ing into mem­o­ry all records that a trans­ac­tion will need dur­ing the repli­ca­tion phase---before locks are even acquired.

As a result, each trans­ac­tion's user-specified logic can be exe­cut­ed at each shard with an absolute min­i­mum of run­time syn­chro­niza­tion between shards or repli­cas to slow it down, even if the trans­ac­tion's logic requires it to access records at mul­ti­ple shards. By min­i­miz­ing the time that locks are held, con­cur­ren­cy can be great­ly increased, there­by lead­ing to near-linear scal­a­bil­i­ty on a com­mod­i­ty clus­ter of machines. 

Strong­ly con­sis­tent glob­al repli­ca­tion 

Calvin's deter­min­is­tic exe­cu­tion seman­tics pro­vide an addi­tion­al ben­e­fit: repli­cat­ing trans­ac­tion­al input is suf­fi­cient to achieve strong­ly con­sis­tent repli­ca­tion. Since repli­cat­ing batch­es of trans­ac­tion requests is extreme­ly inex­pen­sive and hap­pens before the trans­ac­tions acquire locks and begin exe­cut­ing, Calvin's trans­ac­tion­al through­put capac­i­ty does not depend at all on its repli­ca­tion con­fig­u­ra­tion. 

In other words, not only can Calvin can run 500,000 trans­ac­tions per sec­ond on 100 EC2 instances in Ama­zon's US East (Vir­ginia) data cen­ter, it can main­tain strongly-consistent, up-to-date 100-node repli­cas in Ama­zon's Europe (Ire­land) and US West (Cal­i­for­nia) data centers---at no cost to through­put. 

Calvin accom­plish­es this by hav­ing repli­cas per­form the actu­al pro­cess­ing of trans­ac­tions com­plete­ly inde­pen­dent­ly of one anoth­er, main­tain­ing strong con­sis­ten­cy with­out hav­ing to con­stant­ly syn­chro­nize trans­ac­tion results between repli­cas. (Calvin's end-to-end trans­ac­tion laten­cy does depend on mes­sage delays between repli­cas, of course---there is no get­ting around the speed of light.) 

Flex­i­ble data model 

So where does Calvin fall in the OldSQL/NewSQL/NoSQL tri­choto­my? 

Actu­al­ly, nowhere. Calvin is not a data­base sys­tem itself, but rather a trans­ac­tion sched­ul­ing and repli­ca­tion coor­di­na­tion ser­vice. We designed the sys­tem to inte­grate with any data stor­age layer, rela­tion­al or oth­er­wise. Calvin allows user trans­ac­tion code to access the data layer freely, using any data access lan­guage or inter­face sup­port­ed by the under­ly­ing stor­age engine (so long as Calvin can observe which records user trans­ac­tions access). The exper­i­ments pre­sent­ed in the paper use a cus­tom key-value store. More recent­ly, we've hooked Calvin up to Google's Lev­elDB and added sup­port for SQL-based data access with­in trans­ac­tions, build­ing rela­tion­al tables on top of Lev­elDB's effi­cient sorted-string stor­age. 

From an appli­ca­tion devel­op­er's point of view, Calvin's pri­ma­ry lim­i­ta­tion com­pared to other sys­tems is that trans­ac­tions must be exe­cut­ed entire­ly server-side. Calvin has to know in advance what code will be exe­cut­ed for a given trans­ac­tion. Users may pre-define trans­ac­tions direct­ly in C++, or sub­mit arbi­trary Python code snip­pets on-the-fly to be parsed and exe­cut­ed as trans­ac­tions. 

For some appli­ca­tions, this require­ment of com­plete­ly server-side trans­ac­tions might be a dif­fi­cult lim­i­ta­tion. How­ev­er, many appli­ca­tions pre­fer to exe­cute trans­ac­tion code on the data­base serv­er any­way (in the form of stored pro­ce­dures), in order to avoid mul­ti­ple round trip mes­sages between the data­base serv­er and appli­ca­tion serv­er in the mid­dle of a trans­ac­tion. 

If this lim­i­ta­tion is accept­able, Calvin presents a nice alter­na­tive in the trade­off space to achiev­ing high scal­a­bil­i­ty with­out sac­ri­fic­ing ACID or multi-shard trans­ac­tions. Hence, we believe that ourSIG­MOD paper may present a roadmap for over­com­ing the scal­a­bil­i­ty dom­i­nance of the decades-old data­base solu­tions on tra­di­tion­al OLTP work­loads. We look for­ward to debat­ing the mer­its of this approach in the weeks ahead (and Alex will be pre­sent­ing the paper at SIG­MOD next week).

Friday 11 May 2012

InfoQ: Panel: Multicore, Manycore, and Cloud Computing

InfoQ: Panel: Multicore, Manycore, and Cloud Computing

Note: Biggest challenges are the correctness and performance of parallel execution.
In term of correctness: false sharing, atomicity, consistency model.
In term of performance: locking, synchronization, deadlock.