Showing posts with label Mapreduce. Show all posts
Showing posts with label Mapreduce. Show all posts

Tuesday, 31 December 2013

Top Posts of 2013: Big Data Beyond MapReduce: Google's Big Data Papers | Architects Zone

Source: http://architects.dzone.com/articles/big-data-beyond-mapreduce

Mainstream Big Data is all about MapReduce, but when looking at real-time data, limitations of that approach are starting to show. In this post, I'll review Google's most important Big Data publications and discuss where they are (as far as they've disclosed).

MapReduce, Google File System and Bigtable: the mother of all big data algorithms

Chronologically the first paper is on the Google File System  from 2003, which is a distributed file system. Basically, files are split into chunks which are stored in a redundant fashion on a cluster of commodity machines (Every article about Google has to include the term "commodity machines"!)
Next up is the MapReduce  paper from 2004. MapReduce has become synonymous with Big Data. Legend has it that Google used it to compute their search indices. I imagine it worked like this: They have all the crawled web pages sitting on their cluster and every day or so they ran MapReduce to recompute everything.
Next up is the Bigtable  paper from 2006 which has become the inspiration for countless NoSQL databases like Cassandra, HBase, and others. About half of the architecture of Cassandra is modeled after BigTable, including the data model, SSTables, and write-through-logs (the other half being Amazon's Dynamo database for the peer-to-peer clustering model).

Percolator: Handling individual updates

Google didn't stop with MapReduce. In fact, with the exponential growth of the Internet, it became impractical to recompute the whole search index from scratch. Instead, Google developed a more incremental system, which still allowed for distributed computing.
Now here is where it's getting interesting, in particular compared to what common messages from mainstream Big Data are. For example, they have reintroduced transactions, something NoSQL still tells you that you don't need or cannot have if you want to have scalability.
In the Percolator  paper from 2010, they describe how the Google is keeping its web search index up to date. Percolator is built on existing technologies like Bigtable, but adds transactions and locks on rows and tables, as well as notifications for change in the tables. These notifications are then used to trigger the different stages in a computation. This way, the individual updates can "percolate" through the database.
This approach is reminiscent of stream processing frameworks (SPFs) like Twitter's Storm , or Yahoo's S4 , but with an underlying data base. SPFs usually use message passing and no shared data. This makes it easier to reason about what is happening, but also has the problem that there is no way to access the result of the computation unless you manually store it somewhere in the end.

Pregel: Scalable graph computing

Eventually, Google also had to start mining graph data like the social graph in an online social network, so they developed Pregel , published in 2010.
The underlying computational model is much more complex than in MapReduce: Basically, you have worker threads for each node which are run in parallel iteratively. In each so-called superstep, the worker threads can read messages in the node's inbox, send messages to other nodes, set and read values associated with nodes or edges, or vote to halt. Computations are run till all nodes have voted to halt. In addition, there are also Aggregators and Combiners which compute global statistics.
The paper shows how to implement a number of algorithms like Google's PageRank, shortest path, or bipartite matching. My personal feeling is that Pregel requires even more rethinking on the side of the implementor than MapReduce or SPFs.

Dremel: Online visualizations

Finally, in another paper from 2010, Google describes Dremel , which is an interactive database with an SQL-like language for structured data. So instead of tables with fixed fields like in an SQL database, each row is something like a JSON object (of course, Google uses it's own protocol buffer  format). Queries are pushed down to servers and then aggregated on their way back up and use some clever data format for maximum performance.

Big Data beyond MapReduce

Google didn't stop with MapReduce, but they developed other approaches for applications where MapReduce wasn't a good fit, and I think this is an important message for the whole Big Data landscape. You cannot solve everything with MapReduce. You can make it faster by getting rid of the disks and moving all the data to in-memory, but there are tasks whose inherent structure makes it hard for MapReduce to scale.
Open source projects have picked up on the more recent ideas and papers by Google. For example, ApacheDrill  is reimplementing the Dremel framework, while projects like Apache Giraph  and Stanford's GPS  are inspired by Pregel.
There are still other approaches as well. I'm personally a big fan of stream mining (not to be confused with stream processing) which aims to process event streams with bounded computational resources by resorting to approximation algorithms. Noel Welsh has some interesting slide's  on the topic.

Sent from Evernote

Tuesday, 6 November 2012

Tutorial: Designing Good Algorithms for Map-Reduce and Beyond

Program - ACM Symposium on Cloud Computing 2012:

Tutorial: Designing Good Algorithms for Map-Reduce and Beyond (view)

Foto N. Afrati (NTUA and Google), Magdalena Balazinska (University of Washington), Anish Das Sarma (Google), Bill Howe (University of Washington), Semih Salihoglu (Stanford University), and Jeffrey D. Ullman (Stanford University)

Wednesday, 17 October 2012

Statistics about Hadoop and Mapreduce Algorithm Papers

Statistics about Hadoop and Mapreduce Algorithm Papers:


Underneath are statistics about which 20 papers (of about 80 papers) were most read in our 3 previous postings about mapreduce and hadoop algorithms (the postings have been read approximately 5000 times). The list is ordered by decreasing reading frequency, i.e. most popular at spot 1.
  1. MapReduce-Based Pattern Finding Algorithm Applied in Motif Detection for Prescription Compatibility Network
    authors: Yang Liu, Xiaohong Jiang, Huajun Chen , Jun Ma and Xiangyu Zhang – Zhejiang University
  2. Data-intensive text processing with Mapreduce
    authors: Jimmy Lin and Chris Dyer – University of Maryland
  3. Large-Scale Behavioral Targeting
    authors: Ye Chen (eBay), Dmitry Pavlov (Yandex Labs) and John F. Canny (University of California, Berkeley)
  4. Improving Ad Relevance in Sponsored Search
    authors: Dustin Hillard, Stefan Schroedl, Eren Manavoglu, Hema Raghavan and Chris Leggetter (Yahoo Labs)
  5. Experiences on Processing Spatial Data with MapReduce
    authors: Ariel Cary, Zhengguo Sun, Vagelis Hristidis and Naphtali Rishe – Florida International University
  6. Extracting user profiles from large scale data
    authors: Michal Shmueli-Scheuer, Haggai Roitman, David Carmel, Yosi Mass and David Konopnicki – IBM Research, Haifa
  7. Predicting the Click-Through Rate for Rare/New Ads
    authors: Kushal Dave and Vasudeva Varma – IIIT Hyderabad
  8. Parallel K-Means Clustering Based on MapReduce
    authors: Weizhong Zhao, Huifang Ma and Qing He – Chinese Academy of Sciences
  9. Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce
    authors: Mohammad Farhan Husain, Pankil Doshi, Latifur Khan and Bhavani Thuraisingham – University of Texas at Dallas
  10. Map-Reduce Meets Wider Varieties of Applications
    authors: Shimin Chen and Steven W. Schlosser – Intel Research
  11. LogMaster: Mining Event Correlations in Logs of Large-scale Cluster Systems
    authors: Wei Zhou, Jianfeng Zhan, Dan Meng (Chinese Academy of Sciences), Dongyan Xu (Purdue University) and Zhihong Zhang (China Mobile Research)
  12. Efficient Clustering of Web-Derived Data Sets
    authors: Luıs Sarmento, Eugenio Oliveira (University of Porto), Alexander P. Kehlenbeck (Google), Lyle Ungar (University of Pennsylvania)
  13. A novel approach to multiple sequence alignment using hadoop data grids
    authors: G. Sudha Sadasivam and G. Baktavatchalam – PSG College of Technology
  14. Web-Scale Distributional Similarity and Entity Set Expansion
    authors: Patrick Pantel, Eric Crestan, Ana-Maria Popescu, Vishnu Vyas (Yahoo Labs) and Arkady Borkovsky (Yandex Labs)
  15. Grammar based statistical MT on Hadoop
    authors: Ashish Venugopal and Andreas Zollmann (Carnegie Mellon University)
  16. Distributed Algorithms for Topic Models
    authors: David Newman, Arthur Asuncion, Padhraic Smyth and Max Welling – University of California, Irvine
  17. Parallel algorithms for mining large-scale rich-media data
    authors: Edward Y. Chang, Hongjie Bai and Kaihua Zhu – Google Research
  18. Learning Influence Probabilities In Social Networks
    authors: Amit Goyal, Laks V. S. Lakshmanan (University of British Columbia) and Francesco Bonchi (Yahoo! Research)
  19. MrsRF: an efficient MapReduce algorithm for analyzing large collections of evolutionary trees
    authors: Suzanne J Matthews and Tiffani L Williams – Texas A&M University
  20. User-Based Collaborative-Filtering Recommendation Algorithms on Hadoop
    authors: Zhi-Dan Zhao and Ming-sheng Shang

Tuesday, 12 April 2011

MapReduce: A major step backwards | The Database Column

MapReduce: A major step backwards | The Database Column

MapReduce: A major step backwards

on Jan 17 in Database architecture, Database history, Database innovation posted by DeWitt

[Note: Although the system attributes this post to a single author, it was written by David J. DeWitt and Michael Stonebraker]

On January 8, a Database Column reader asked for our views on new distributed database research efforts, and we’ll begin here with our views on MapReduce. This is a good time to discuss it, since the recent trade press has been filled with news of the revolution of so-called “cloud computing.” This paradigm entails harnessing large numbers of (low-end) processors working in parallel to solve a computing problem. In effect, this suggests constructing a data center by lining up a large number of “jelly beans” rather than utilizing a much smaller number of high-end servers.

For example, IBM and Google have announced plans to make a 1,000 processor cluster available to a few select universities to teach students how to program such clusters using a software tool called MapReduce [1]. Berkeley has gone so far as to plan on teaching their freshman how to program using the MapReduce framework.

As both educators and researchers, we are amazed at the hype that the MapReduce proponents have spread about how it represents a paradigm shift in the development of scalable, data-intensive applications. MapReduce may be a good idea for writing certain types of general-purpose computations, but to the database community, it is:

  1. A giant step backward in the programming paradigm for large-scale data intensive applications

  2. A sub-optimal implementation, in that it uses brute force instead of indexing
  3. Not novel at all — it represents a specific implementation of well known techniques developed nearly 25 years ago
  4. Missing most of the features that are routinely included in current DBMS
  5. Incompatible with all of the tools DBMS users have come to depend on

First, we will briefly discuss what MapReduce is; then we will go into more detail about our five reactions listed above.

What is MapReduce?

The basic idea of MapReduce is straightforward. It consists of two programs that the user writes called map andreduce plus a framework for executing a possibly large number of instances of each program on a compute cluster.

The map program reads a set of “records” from an input file, does any desired filtering and/or transformations, and then outputs a set of records of the form (key, data). As the map program produces output records, a “split” function partitions the records into M disjoint buckets by applying a function to the key of each output record. This split function is typically a hash function, though any deterministic function will suffice. When a bucket fills, it is written to disk. The map program terminates with M output files, one for each bucket.

In general, there are multiple instances of the map program running on different nodes of a compute cluster. Each map instance is given a distinct portion of the input file by the MapReduce scheduler to process. If N nodes participate in the map phase, then there are M files on disk storage at each of N nodes, for a total of N * M files;Fi,j, 1 ≤ iN, 1 ≤ jM.

The key thing to observe is that all map instances use the same hash function. Hence, all output records with the same hash value will be in corresponding output files.

The second phase of a MapReduce job executes M instances of the reduce program, Rj, 1 ≤ jM. The input for each reduce instance Rj consists of the files Fi,j, 1 ≤ iN. Again notice that all output records from the map phase with the same hash value will be consumed by the same reduce instance — no matter which map instance produced them. After being collected by the map-reduce framework, the input records to a reduce instance are grouped on their keys (by sorting or hashing) and feed to the reduce program. Like the map program, the reduce program is an arbitrary computation in a general-purpose language. Hence, it can do anything it wants with its records. For example, it might compute some additional function over other data fields in the record. Each reduce instance can write records to an output file, which forms part of the “answer” to a MapReduce computation.

To draw an analogy to SQL, map is like the group-by clause of an aggregate query. Reduce is analogous to theaggregate function (e.g., average) that is computed over all the rows with the same group-by attribute.

We now turn to the five concerns we have with this computing paradigm.

1. MapReduce is a step backwards in database access

As a data processing paradigm, MapReduce represents a giant step backwards. The database community has learned the following three lessons from the 40 years that have unfolded since IBM first released IMS in 1968.

  • Schemas are good.

  • Separation of the schema from the application is good.
  • High-level access languages are good.

MapReduce has learned none of these lessons and represents a throw back to the 1960s, before modern DBMSs were invented.

The DBMS community learned the importance of schemas, whereby the fields and their data types are recorded in storage. More importantly, the run-time system of the DBMS can ensure that input records obey this schema. This is the best way to keep an application from adding “garbage” to a data set. MapReduce has no such functionality, and there are no controls to keep garbage out of its data sets. A corrupted MapReduce dataset can actually silently break all the MapReduce applications that use that dataset.

It is also crucial to separate the schema from the application program. If a programmer wants to write a new application against a data set, he or she must discover the record structure. In modern DBMSs, the schema is stored in a collection of system catalogs and can be queried (in SQL) by any user to uncover such structure. In contrast, when the schema does not exist or is buried in an application program, the programmer must discover the structure by an examination of the code. Not only is this a very tedious exercise, but also the programmer must find the source code for the application. This latter tedium is forced onto every MapReduce programmer, since there are no system catalogs recording the structure of records — if any such structure exists.

During the 1970s the DBMS community engaged in a “great debate” between the relational advocates and the Codasyl advocates. One of the key issues was whether a DBMS access program should be written:

  • By stating what you want – rather than presenting an algorithm for how to get it (relational view)

  • By presenting an algorithm for data access (Codasyl view)

The result is now ancient history, but the entire world saw the value of high-level languages and relational systems prevailed. Programs in high-level languages are easier to write, easier to modify, and easier for a new person to understand. Codasyl was rightly criticized for being “the assembly language of DBMS access.” A MapReduce programmer is analogous to a Codasyl programmer — he or she is writing in a low-level language performing low-level record manipulation. Nobody advocates returning to assembly language; similarly nobody should be forced to program in MapReduce.

MapReduce advocates might counter this argument by claiming that the datasets they are targeting have no schema. We dismiss this assertion. In extracting a key from the input data set, the map function is relying on the existence of at least one data fi
eld in each input record. The same holds for a reduce function that computes some value from the records it receives to process.

Writing MapReduce applications on top of Google’s BigTable (or Hadoop’s HBase) does not really change the situation significantly. By using a self-describing tuple format (row key, column name, {values}) different tuples within the same table can actually have different schemas. In addition, BigTable and HBase do not provide logical independence, for example with a view mechanism. Views significantly simplify keeping applications running when the logical schema changes.


2. MapReduce is a poor implementation

All modern DBMSs use hash or B-tree indexes to accelerate access to data. If one is looking for a subset of the records (e.g., those employees with a salary of 10,000 or those in the shoe department), then one can often use an index to advantage to cut down the scope of the search by one to two orders of magnitude. In addition, there is a query optimizer to decide whether to use an index or perform a brute-force sequential search.

MapReduce has no indexes and therefore has only brute force as a processing option. It will be creamed whenever an index is the better access mechanism.

One could argue that value of MapReduce is automatically providing parallel execution on a grid of computers. This feature was explored by the DBMS research community in the 1980s, and multiple prototypes were built including Gamma [2,3], Bubba [4], and Grace [5]. Commercialization of these ideas occurred in the late 1980s with systems such as Teradata.

In summary to this first point, there have been high-performance, commercial, grid-oriented SQL engines (with schemas and indexing) for the past 20 years. MapReduce does not fare well when compared with such systems.

There are also some lower-level implementation issues with MapReduce, specifically skew and data interchange.

One factor that MapReduce advocates seem to have overlooked is the issue of skew. As described in “Parallel Database System: The Future of High Performance Database Systems,” [6] skew is a huge impediment to achieving successful scale-up in parallel query systems. The problem occurs in the map phase when there is wide variance in the distribution of records with the same key. This variance, in turn, causes some reduce instances to take much longer to run than others, resulting in the execution time for the computation being the running time of the slowest reduce instance. The parallel database community has studied this problem extensively and has developed solutions that the MapReduce community might want to adopt.

There is a second serious performance problem that gets glossed over by the MapReduce proponents. Recall that each of the N map instances produces M output files — each destined for a different reduce instance. These files are written to a disk local to the computer used to run the map instance. If N is 1,000 and M is 500, the map phase produces 500,000 local files. When the reduce phase starts, each of the 500 reduce instances needs to read its 1,000 input files and must use a protocol like FTP to “pull” each of its input files from the nodes on which the map instances were run. With 100s of reduce instances running simultaneously, it is inevitable that two or more reduce instances will attempt to read their input files from the same map node simultaneously — inducing large numbers of disk seeks and slowing the effective disk transfer rate by more than a factor of 20. This is why parallel database systems do not materialize their split files and use push (to sockets) instead of pull. Since much of the excellent fault-tolerance that MapReduce obtains depends on materializing its split files, it is not clear whether the MapReduce framework could be successfully modified to use the push paradigm instead.

Given the experimental evaluations to date, we have serious doubts about how well MapReduce applications can scale. Moreover, the MapReduce implementers would do well to study the last 25 years of parallel DBMS research literature.

3. MapReduce is not novel

The MapReduce community seems to feel that they have discovered an entirely new paradigm for processing large data sets. In actuality, the techniques employed by MapReduce are more than 20 years old. The idea of partitioning a large data set into smaller partitions was first proposed in “Application of Hash to Data Base Machine and Its Architecture” [11] as the basis for a new type of join algorithm. In “Multiprocessor Hash-Based Join Algorithms,” [7], Gerber demonstrated how Kitsuregawa’s techniques could be extended to execute joins in parallel on a shared-nothing [8] cluster using a combination of partitioned tables, partitioned execution, and hash based splitting. DeWitt [2] showed how these techniques could be adopted to execute aggregates with and without group by clauses in parallel. DeWitt and Gray [6] described parallel database systems and how they process queries. Shatdal and Naughton [9] explored alternative strategies for executing aggregates in parallel.

Teradata has been selling a commercial DBMS utilizing all of these techniques for more than 20 years; exactly the techniques that the MapReduce crowd claims to have invented.

While MapReduce advocates will undoubtedly assert that being able to write MapReduce functions is what differentiates their software from a parallel SQL implementation, we would remind them that POSTGRES supported user-defined functions and user-defined aggregates in the mid 1980s. Essentially, all modern database systems have provided such functionality for quite a while, starting with the Illustra engine around 1995.

4. MapReduce is missing features

All of the following features are routinely provided by modern DBMSs, and all are missing from MapReduce:

  • Bulk loader — to transform input data in files into a desired format and load it into a DBMS

  • Indexing — as noted above
  • Updates — to change the data in the data base
  • Transactions — to support parallel update and recovery from failures during update
  • Integrity constraints — to help keep garbage out of the data base
  • Referential integrity — again, to help keep garbage out of the data base
  • Views — so the schema can change without having to rewrite the application program

In summary, MapReduce provides only a sliver of the functionality found in modern DBMSs.


5. MapReduce is incompatible with the DBMS tools

A modern SQL DBMS has available all of the following classes of tools:

  • Report writers (e.g., Crystal reports) to prepare reports for human visualization

  • Business intelligence tools (e.g., Business Objects or Cognos) to enable ad-hoc querying of large data warehouses
  • Data mining tools (e.g., Oracle Data Mining or IBM DB2 Intelligent Miner) to allow a user to discover structure in large data sets
  • Replication tools (e.g., Golden Gate) to allow a user to replicate data from on DBMS to another
  • Database design tools (e.g., Embarcadero) to assist the user in constructing a data base.

MapReduce cannot use these tools and has none of its own. Until it becomes SQL-compatible or until someone writes all of these tools, MapReduce will remain very difficult to use in an end-to-end task.

In Summary

It is exciting to see a much larger community engaged in the design and implementation of scalable query processing techniques. We, however, assert that they should not o
verlook the lessons of more than 40 years of database technology — in particular the many advantages that a data model, physical and logical data independence, and a declarative query language, such as SQL, bring to the design, implementation, and maintenance of application programs. Moreover, computer science communities tend to be insular and do not read the literature of other communities. We would encourage the wider community to examine the parallel DBMS literature of the last 25 years. Last, before MapReduce can measure up to modern DBMSs, there is a large collection of unmet features and required tools that must be added.

We fully understand that database systems are not without their problems. The database community recognizes that database systems are too “hard” to use and is working to solve this problem. The database community can also learn something valuable from the excellent fault-tolerance that MapReduce provides its applications. Finally we note that some database researchers are beginning to explore using the MapReduce framework as the basis for building scalable database systems. The Pig[10] project at Yahoo! Research is one such effort.


References

[1] “MapReduce: Simplified Data Processing on Large Clusters,” Jeff Dean and Sanjay Ghemawat, Proceedings of the 2004 OSDI Conference, 2004.

[2] “The Gamma Database Machine Project,” DeWitt, et. al., IEEE Transactions on Knowledge and Data Engineering, Vol. 2, No. 1, March 1990.

[4] “Gamma – A High Performance Dataflow Database Machine,” DeWitt, D, R. Gerber, G. Graefe, M. Heytens, K. Kumar, and M. Muralikrishna, Proceedings of the 1986 VLDB Conference, 1986.

[5] “Prototyping Bubba, A Highly Parallel Database System,” Boral, et. al., IEEE Transactions on Knowledge and Data Engineering,Vol. 2, No. 1, March 1990.

[6] “Parallel Database System: The Future of High Performance Database Systems,” David J. DeWitt and Jim Gray, CACM, Vol. 35, No. 6, June 1992.

[7] “Multiprocessor Hash-Based Join Algorithms,” David J. DeWitt and Robert H. Gerber, Proceedings of the 1985 VLDB Conference, 1985.

[8] “The Case for Shared-Nothing,” Michael Stonebraker, Data Engineering Bulletin, Vol. 9, No. 1, 1986.

[9] “Adaptive Parallel Aggregation Algorithms,” Ambuj Shatdal and Jeffrey F. Naughton, Proceedings of the 1995 SIGMOD Conference, 1995.

[10] “Pig”, Chris Olston, http://research.yahoo.com/project/90

[11] “Application of Hash to Data Base Machine and Its
Architecture,” Masaru Kitsuregawa, Hidehiko Tanaka, Tohru Moto-Oka,
New Generation Comput. 1(1): 63-74 (1983)

Tuesday, 26 October 2010

Google search index splits with MapReduce • The Register

Google search index splits with MapReduce

Welds BigTable to file system 'Colossus'
Exclusive Google Caffeine — the remodeled search infrastructure rolled out across Google's worldwide data center network earlier this year — is not based on MapReduce, the distributed number-crunching platform that famously underpinned the company's previous indexing system. As the likes of Yahoo!, Facebook, and Microsoft work to duplicate MapReduce through the open source Hadoop project, Google is moving on.
According to Eisar Lipkovitz, a senior director of engineering at Google, Caffeine moves Google's back-end indexing system away from MapReduce and onto BigTable, the company's distributed database platform.
As Google's Matt Cutts told us last year, the new search infrastructure also uses an overhaul of the company's GFS distributed file system. This has been referred to as Google File System 2 or GFS2, but inside Google, Lipkovitz says, it's known as Colossus.
Caffeine expands on BigTable to create a kind of database programming model that lets the company make changes to its web index without rebuilding the entire index from scratch. "[Caffeine] is a database-driven, Big Table–variety indexing system," Lipkovitz tells The Reg, saying that Google will soon publish a paper discussing the system. The paper, he says, will be delivered next month at the USENIX Symposium on Operating Systems Design and Implementation (OSDI).
Before the arrival of Caffeine, Google built its search index — an index of the entire web — using MapReduce. In essence, the index was created by a sequence of batch operations. The MapReduce platform "maps" data-crunching tasks across a collection of distributed machines, splitting them into tiny sub-tasks, before "reducing" the results into one master calculation.
MapReduce would receive the epic amounts of webpage data collected by Google's crawlers, and it would crunch this down to the links and metadata needed to actually search these pages. Among so many other things, it would determine each site's PageRank, the site's relationship to all the other sites on the web.
"We would start with a very large [collection of data] and we would process it," Lipkovitz tells us. "Then, eight hours or so later, we'd get the output of this whole process and we'd copy it to our serving systems. And we did this continuously."
But MapReduce didn't allow Google to update its index as quickly as it would like. In the age of the "real-time" web, the company is determined to refresh its index within seconds. Over the last few years, MapReduce has received ample criticism from the likes of MIT database guru Mike Stonebraker, and according to Lipkovitz, Google long ago made "similar observations." MapReduce, he says, isn't suited to calculations that need to occur in near real-time.
MapReduce is a sequence of batch operations, and generally, Lipkovitz explains, you can't start your next phase of operations until you finish the first. It suffers from "stragglers," he says. If you want to build a system that's based on series of map-reduces, there's a certain probability that something will go wrong, and this gets larger as you increase the number of operations. "You can't do anything that takes a relatively short amount of time," Lipkovitz says, "so we got rid of it."
With Caffeine, Google can update its index by making direct changes to the web map already stored in BigTable. This includes a kind of framework that sits atop BigTable, and Lipkovitz compares it to old-school database programming and the use of "database triggers."
"It's completely incremental," he says. When a new page is crawled, Google can update its index with the necessarily changes rather than rebuilding the whole thing.
Previously, Google's index was separated into layers, and some layers updated faster than others. The main layer wouldn't be updated for about two weeks. "To refresh a layer of the old index, we would analyze the entire web, which meant there was a significant delay between when we found a page and made it available to you," Google said in a blog post when Caffeine was rolled out.
"With Caffeine, we analyze the web in small portions and update our search index on a continuous basis, globally. As we find new pages, or new information on existing pages, we can add these straight to the index. That means you can find fresher information than ever before — no matter when or where it was published."
There's also a "compute framework," Lipkovitz says, that lets engineers execute code atop BigTable. And the system is underpinned by Colossus, the distributed storage platform also known as GFS2. The original Google File System, he says, didn't scale as well as the company would like.
Colossus is specifically designed for BigTable, and for this reason it's not as suited to "general use" as GFS was. In other words, it was built specifically for use with the new Caffeine search-indexing system, and though it may be used in some form with other Google services, it isn't the sort of thing that's designed to span the entire Google infrastructure.
At Wednesday's launch of Google Instant, a new incarnation of Google's search engine that updates search results as you type, Google distinguished engineer Ben Gomes told The Reg that Caffeine was not built with Instant in mind, but that it helped deal with the added load placed on the system by the "streaming" search service.
Lipkovitz makes a point of saying that MapReduce is by no means dead. There are still cases where Caffeine uses batch processing, and MapReduce is still the basis for myriad other Google services. But prior the arrival of Caffeine, the indexing system was Google's largest MapReduce application, so use of the platform has been significantly, well, reduced.
In 2004, Google published research papers on GFS and MapReduce that became the basis for the open source Hadoop platform now used by Yahoo!, Facebook, and — yes — Microsoft. But as Google moves beyond GFS and MapReduce, Lipkovitz stresses that he is "not claiming that the rest of the world is behind us."
"We're in business of making searches useful," he says. "We're not in the business of selling infrastructure." ®

Friday, 8 October 2010

MapReduce: A major step backwards | The Database Column

MapReduce: A major step backwards | The Database Column

MapReduce: A major step backwards

on Jan 17 in Database architecture, Database history, Database innovation posted by DeWitt
[Note: Although the system attributes this post to a single author, it was written by David J. DeWitt and Michael Stonebraker]
On January 8, a Database Column reader asked for our views on new distributed database research efforts, and we’ll begin here with our views on MapReduce. This is a good time to discuss it, since the recent trade press has been filled with news of the revolution of so-called “cloud computing.” This paradigm entails harnessing large numbers of (low-end) processors working in parallel to solve a computing problem. In effect, this suggests constructing a data center by lining up a large number of “jelly beans” rather than utilizing a much smaller number of high-end servers.
For example, IBM and Google have announced plans to make a 1,000 processor cluster available to a few select universities to teach students how to program such clusters using a software tool called MapReduce [1]. Berkeley has gone so far as to plan on teaching their freshman how to program using the MapReduce framework.
As both educators and researchers, we are amazed at the hype that the MapReduce proponents have spread about how it represents a paradigm shift in the development of scalable, data-intensive applications. MapReduce may be a good idea for writing certain types of general-purpose computations, but to the database community, it is:
  1. A giant step backward in the programming paradigm for large-scale data intensive applications
  2. A sub-optimal implementation, in that it uses brute force instead of indexing
  3. Not novel at all — it represents a specific implementation of well known techniques developed nearly 25 years ago
  4. Missing most of the features that are routinely included in current DBMS
  5. Incompatible with all of the tools DBMS users have come to depend on
First, we will briefly discuss what MapReduce is; then we will go into more detail about our five reactions listed above.
What is MapReduce?
The basic idea of MapReduce is straightforward. It consists of two programs that the user writes called map andreduce plus a framework for executing a possibly large number of instances of each program on a compute cluster.
The map program reads a set of “records” from an input file, does any desired filtering and/or transformations, and then outputs a set of records of the form (key, data). As the map program produces output records, a “split” function partitions the records into M disjoint buckets by applying a function to the key of each output record. This split function is typically a hash function, though any deterministic function will suffice. When a bucket fills, it is written to disk. The map program terminates with M output files, one for each bucket.
In general, there are multiple instances of the map program running on different nodes of a compute cluster. Each map instance is given a distinct portion of the input file by the MapReduce scheduler to process. If N nodes participate in the map phase, then there are M files on disk storage at each of N nodes, for a total of N * M files;Fi,j, 1 ≤ iN, 1 ≤ jM.
The key thing to observe is that all map instances use the same hash function. Hence, all output records with the same hash value will be in corresponding output files.
The second phase of a MapReduce job executes M instances of the reduce program, Rj, 1 ≤ jM. The input for each reduce instance Rj consists of the files Fi,j, 1 ≤ iN. Again notice that all output records from the map phase with the same hash value will be consumed by the same reduce instance — no matter which map instance produced them. After being collected by the map-reduce framework, the input records to a reduce instance are grouped on their keys (by sorting or hashing) and feed to the reduce program. Like the map program, the reduce program is an arbitrary computation in a general-purpose language. Hence, it can do anything it wants with its records. For example, it might compute some additional function over other data fields in the record. Each reduce instance can write records to an output file, which forms part of the “answer” to a MapReduce computation.
To draw an analogy to SQL, map is like the group-by clause of an aggregate query. Reduce is analogous to theaggregate function (e.g., average) that is computed over all the rows with the same group-by attribute.
We now turn to the five concerns we have with this computing paradigm.
1. MapReduce is a step backwards in database access
As a data processing paradigm, MapReduce represents a giant step backwards. The database community has learned the following three lessons from the 40 years that have unfolded since IBM first released IMS in 1968.
  • Schemas are good.
  • Separation of the schema from the application is good.
  • High-level access languages are good.
MapReduce has learned none of these lessons and represents a throw back to the 1960s, before modern DBMSs were invented.
The DBMS community learned the importance of schemas, whereby the fields and their data types are recorded in storage. More importantly, the run-time system of the DBMS can ensure that input records obey this schema. This is the best way to keep an application from adding “garbage” to a data set. MapReduce has no such functionality, and there are no controls to keep garbage out of its data sets. A corrupted MapReduce dataset can actually silently break all the MapReduce applications that use that dataset.
It is also crucial to separate the schema from the application program. If a programmer wants to write a new application against a data set, he or she must discover the record structure. In modern DBMSs, the schema is stored in a collection of system catalogs and can be queried (in SQL) by any user to uncover such structure. In contrast, when the schema does not exist or is buried in an application program, the programmer must discover the structure by an examination of the code. Not only is this a very tedious exercise, but also the programmer must find the source code for the application. This latter tedium is forced onto every MapReduce programmer, since there are no system catalogs recording the structure of records — if any such structure exists.
During the 1970s the DBMS community engaged in a “great debate” between the relational advocates and the Codasyl advocates. One of the key issues was whether a DBMS access program should be written:
  • By stating what you want – rather than presenting an algorithm for how to get it (relational view)
  • By presenting an algorithm for data access (Codasyl view)
The result is now ancient history, but the entire world saw the value of high-level languages and relational systems prevailed. Programs in high-level languages are easier to write, easier to modify, and easier for a new person to understand. Codasyl was rightly criticized for being “the assembly language of DBMS access.” A MapReduce programmer is analogous to a Codasyl programmer — he or she is writing in a low-level language performing low-level record manipulation. Nobody advocates returning to assembly language; similarly nobody should be forced to program in MapReduce.
MapReduce advocates might counter this argument by claiming that the datasets they are targeting have no schema. We dismiss this assertion. In extracting a key from the input data set, the map function is relying on the existence of at least one data fi
eld in each input record. The same holds for a reduce function that computes some value from the records it receives to process.
Writing MapReduce applications on top of Google’s BigTable (or Hadoop’s HBase) does not really change the situation significantly. By using a self-describing tuple format (row key, column name, {values}) different tuples within the same table can actually have different schemas. In addition, BigTable and HBase do not provide logical independence, for example with a view mechanism. Views significantly simplify keeping applications running when the logical schema changes.

2. MapReduce is a poor implementation
All modern DBMSs use hash or B-tree indexes to accelerate access to data. If one is looking for a subset of the records (e.g., those employees with a salary of 10,000 or those in the shoe department), then one can often use an index to advantage to cut down the scope of the search by one to two orders of magnitude. In addition, there is a query optimizer to decide whether to use an index or perform a brute-force sequential search.
MapReduce has no indexes and therefore has only brute force as a processing option. It will be creamed whenever an index is the better access mechanism.
One could argue that value of MapReduce is automatically providing parallel execution on a grid of computers. This feature was explored by the DBMS research community in the 1980s, and multiple prototypes were built including Gamma [2,3], Bubba [4], and Grace [5]. Commercialization of these ideas occurred in the late 1980s with systems such as Teradata.
In summary to this first point, there have been high-performance, commercial, grid-oriented SQL engines (with schemas and indexing) for the past 20 years. MapReduce does not fare well when compared with such systems.
There are also some lower-level implementation issues with MapReduce, specifically skew and data interchange.
One factor that MapReduce advocates seem to have overlooked is the issue of skew. As described in “Parallel Database System: The Future of High Performance Database Systems,” [6] skew is a huge impediment to achieving successful scale-up in parallel query systems. The problem occurs in the map phase when there is wide variance in the distribution of records with the same key. This variance, in turn, causes some reduce instances to take much longer to run than others, resulting in the execution time for the computation being the running time of the slowest reduce instance. The parallel database community has studied this problem extensively and has developed solutions that the MapReduce community might want to adopt.
There is a second serious performance problem that gets glossed over by the MapReduce proponents. Recall that each of the N map instances produces M output files — each destined for a different reduce instance. These files are written to a disk local to the computer used to run the map instance. If N is 1,000 and M is 500, the map phase produces 500,000 local files. When the reduce phase starts, each of the 500 reduce instances needs to read its 1,000 input files and must use a protocol like FTP to “pull” each of its input files from the nodes on which the map instances were run. With 100s of reduce instances running simultaneously, it is inevitable that two or more reduce instances will attempt to read their input files from the same map node simultaneously — inducing large numbers of disk seeks and slowing the effective disk transfer rate by more than a factor of 20. This is why parallel database systems do not materialize their split files and use push (to sockets) instead of pull. Since much of the excellent fault-tolerance that MapReduce obtains depends on materializing its split files, it is not clear whether the MapReduce framework could be successfully modified to use the push paradigm instead.
Given the experimental evaluations to date, we have serious doubts about how well MapReduce applications can scale. Moreover, the MapReduce implementers would do well to study the last 25 years of parallel DBMS research literature.
3. MapReduce is not novel
The MapReduce community seems to feel that they have discovered an entirely new paradigm for processing large data sets. In actuality, the techniques employed by MapReduce are more than 20 years old. The idea of partitioning a large data set into smaller partitions was first proposed in “Application of Hash to Data Base Machine and Its Architecture” [11] as the basis for a new type of join algorithm. In “Multiprocessor Hash-Based Join Algorithms,” [7], Gerber demonstrated how Kitsuregawa’s techniques could be extended to execute joins in parallel on a shared-nothing [8] cluster using a combination of partitioned tables, partitioned execution, and hash based splitting. DeWitt [2] showed how these techniques could be adopted to execute aggregates with and without group by clauses in parallel. DeWitt and Gray [6] described parallel database systems and how they process queries. Shatdal and Naughton [9] explored alternative strategies for executing aggregates in parallel.
Teradata has been selling a commercial DBMS utilizing all of these techniques for more than 20 years; exactly the techniques that the MapReduce crowd claims to have invented.
While MapReduce advocates will undoubtedly assert that being able to write MapReduce functions is what differentiates their software from a parallel SQL implementation, we would remind them that POSTGRES supported user-defined functions and user-defined aggregates in the mid 1980s. Essentially, all modern database systems have provided such functionality for quite a while, starting with the Illustra engine around 1995.
4. MapReduce is missing features
All of the following features are routinely provided by modern DBMSs, and all are missing from MapReduce:
  • Bulk loader — to transform input data in files into a desired format and load it into a DBMS
  • Indexing — as noted above
  • Updates — to change the data in the data base
  • Transactions — to support parallel update and recovery from failures during update
  • Integrity constraints — to help keep garbage out of the data base
  • Referential integrity — again, to help keep garbage out of the data base
  • Views — so the schema can change without having to rewrite the application program
In summary, MapReduce provides only a sliver of the functionality found in modern DBMSs.

5. MapReduce is incompatible with the DBMS tools
A modern SQL DBMS has available all of the following classes of tools:
  • Report writers (e.g., Crystal reports) to prepare reports for human visualization
  • Business intelligence tools (e.g., Business Objects or Cognos) to enable ad-hoc querying of large data warehouses
  • Data mining tools (e.g., Oracle Data Mining or IBM DB2 Intelligent Miner) to allow a user to discover structure in large data sets
  • Replication tools (e.g., Golden Gate) to allow a user to replicate data from on DBMS to another
  • Database design tools (e.g., Embarcadero) to assist the user in constructing a data base.
MapReduce cannot use these tools and has none of its own. Until it becomes SQL-compatible or until someone writes all of these tools, MapReduce will remain very difficult to use in an end-to-end task.
In Summary
It is exciting to see a much larger community engaged in the design and implementation of scalable query processing techniques. We, however, assert that they should not o
verlook the lessons of more than 40 years of database technology — in particular the many advantages that a data model, physical and logical data independence, and a declarative query language, such as SQL, bring to the design, implementation, and maintenance of application programs. Moreover, computer science communities tend to be insular and do not read the literature of other communities. We would encourage the wider community to examine the parallel DBMS literature of the last 25 years. Last, before MapReduce can measure up to modern DBMSs, there is a large collection of unmet features and required tools that must be added.
We fully understand that database systems are not without their problems. The database community recognizes that database systems are too “hard” to use and is working to solve this problem. The database community can also learn something valuable from the excellent fault-tolerance that MapReduce provides its applications. Finally we note that some database researchers are beginning to explore using the MapReduce framework as the basis for building scalable database systems. The Pig[10] project at Yahoo! Research is one such effort.

References
[1] “MapReduce: Simplified Data Processing on Large Clusters,” Jeff Dean and Sanjay Ghemawat, Proceedings of the 2004 OSDI Conference, 2004.
[2] “The Gamma Database Machine Project,” DeWitt, et. al., IEEE Transactions on Knowledge and Data Engineering, Vol. 2, No. 1, March 1990.
[4] “Gamma – A High Performance Dataflow Database Machine,” DeWitt, D, R. Gerber, G. Graefe, M. Heytens, K. Kumar, and M. Muralikrishna, Proceedings of the 1986 VLDB Conference, 1986.
[5] “Prototyping Bubba, A Highly Parallel Database System,” Boral, et. al., IEEE Transactions on Knowledge and Data Engineering,Vol. 2, No. 1, March 1990.
[6] “Parallel Database System: The Future of High Performance Database Systems,” David J. DeWitt and Jim Gray, CACM, Vol. 35, No. 6, June 1992.
[7] “Multiprocessor Hash-Based Join Algorithms,” David J. DeWitt and Robert H. Gerber, Proceedings of the 1985 VLDB Conference, 1985.
[8] “The Case for Shared-Nothing,” Michael Stonebraker, Data Engineering Bulletin, Vol. 9, No. 1, 1986.
[9] “Adaptive Parallel Aggregation Algorithms,” Ambuj Shatdal and Jeffrey F. Naughton, Proceedings of the 1995 SIGMOD Conference, 1995.
[10] “Pig”, Chris Olston, http://research.yahoo.com/project/90
[11] “Application of Hash to Data Base Machine and Its
Architecture,” Masaru Kitsuregawa, Hidehiko Tanaka, Tohru Moto-Oka,
New Generation Comput. 1(1): 63-74 (1983)