Tuesday 18 January 2011

Google Megastore: The Data Engine Behind GAE

Google Megastore: The Data Engine Behind GAE: "

Megastore is the data engine supporting the Google
Application Engine
.
It’s a scalable structured data store providing full ACID semantics within partitions
but lower consistency guarantees across partitions.









I wrote up some notes on it back in 2008 Under
the Covers of the App Engine Datastore
and
posted Phil Bernstein’s excellent notes from a 2008
SIGMOD
talk: Google
Megastore
. But there has been remarkably
little written about this datastore over the intervening couple of years until this
year’s CIDR conference papers were posted. CIDR 2011 includes Megastore:
Providing Scalable, Highly Available Storage for Interactive Services
.








My rough notes from the paper:



· Megastore
is built upon BigTable



· Bigtable
supports fault-tolerant storage within a single datacenter



· Synchronous
replication based upon Paxos and
optimized for long distance inter-datacenter links



· Partitioned
into a vast space of small databases each with its own replicated log



· Each
log stored across a Paxos cluster



· Because
they are so aggressively partitioned, each Paxos group only has to accept logs for
operations on a small partition. However, the design does serialize updates on each
partition



· 3
billion writes and 20 billion read transactions per day



· Support
for consistency unusual for a NoSQL database but driven by (what I believe to be)
the correct belief that inconsistent updates make many applications difficult to write
(see I
Love Eventual Consistency but …
)



· Data
Model:




· The
data model is declared in a strongly typed schema



· There
are potentially many tables per schema



· There
are potentially many entities per table



· There
are potentially many strongly typed properties per entity



· Repeating
properties are allowed



· Tables
can be arranged hierarchically where child tables point to root tables



· Megastore
tables are either entity group root tables or child tables



· The
root table and all child tables are stored in the same entity group



· Secondary
indexes are supported



· Local
secondary indexes index a specific entity group and are maintained consistently



· Global
secondary indexes index across entity groups are asynchronously updates and eventually
consistent



· Repeated
indexes: supports indexing repeated values (e.g. photo tags)



· Inline
indexes provide a way to denormalize data
from source entities into a related target entity as a virtual repeated column.



· There
are physical storage hints:



· “IN
TABLE” directs Megastore to store two tables in the same underlying BigTable



· “SCATTER”
attribute prepends a 2 byte hash to each key to cool hot spots on tables with monotonically
increasing values like dates (e.g. a history table).



· “STORING”
clause on an index supports index-only-access by redundantly storing additional data
in an index. This avoids the double access often required of doing a secondary index
lookup to find the correct entity and then selecting the correct properties from that
entity through a second table access. By pulling values up into the secondary index,
the base table doesn’t need to be accessed to obtain these properties.



· 3
levels of read consistency:



· Current:
Last committed value



· Snapshot:
Value as of start of the read transaction



· Inconsistent
reads: used for cross entity group reads



· Update
path:



· Transaction
writes its mutations to the entity groups write-ahead log and then apply the mutations
to the data (write ahead logging).



· Write
transaction always begins with a current read to determine the next available log
position. The commit operation gathers mutations into a log entry, assigns an increasing
timestamp, and appends to log which is maintained using paxos.



· Update
rates within a entity group are seriously limited by:



· When
there is log contention, one wins and the rest fail and must be retried.




· Paxos
only accepts a very limited update rate (order 10^2 updates per second).



· Paper
reports that “limiting updates within an entity group to a few writes per second per
entity group yields insignificant write conflicts”



· Implication:
programmers must shard aggressively to get even moderate update rates and consistent
update across shards is only supported using two phase commit which is not recommended.



· Cross
entity group updates are supported by:



· Two-phase
commit
with the fragility
that it brings



· Queueing
and asynchronously applying the changes



· Excellent
support for backup and redundancy:



· Synchronous
replication to protect against media failure



· Snapshots
and incremental log backups








Overall, an excellent paper with
lots of detail on a nicely executed storage system. Supporting consistent read and
full ACID update semantics is impressive although the limitation of not being able
to update an entity group at more than a “few per second” is limiting.














Thanks to Zhu
Han
, Reto
Kramer
, and Chris
Newcombe
for all sending
this paper my way.








--jrh






















From Perspectives."

No comments:

Post a Comment