Unlimited-Data. moved to lab.itbee.vn : October 2012

Wednesday 17 October 2012

Statistics about Hadoop and Mapreduce Algorithm Papers

Statistics about Hadoop and Mapreduce Algorithm Papers:

Underneath are statistics about which 20 papers (of about 80 papers) were most read in our 3 previous postings about mapreduce and hadoop algorithms (the postings have been read approximately 5000 times). The list is ordered by decreasing reading frequency, i.e. most popular at spot 1.

MapReduce-Based Pattern Finding Algorithm Applied in Motif Detection for Prescription Compatibility Network
authors: Yang Liu, Xiaohong Jiang, Huajun Chen , Jun Ma and Xiangyu Zhang – Zhejiang University
Data-intensive text processing with Mapreduce
authors: Jimmy Lin and Chris Dyer – University of Maryland
Large-Scale Behavioral Targeting
authors: Ye Chen (eBay), Dmitry Pavlov (Yandex Labs) and John F. Canny (University of California, Berkeley)
Improving Ad Relevance in Sponsored Search
authors: Dustin Hillard, Stefan Schroedl, Eren Manavoglu, Hema Raghavan and Chris Leggetter (Yahoo Labs)
Experiences on Processing Spatial Data with MapReduce
authors: Ariel Cary, Zhengguo Sun, Vagelis Hristidis and Naphtali Rishe – Florida International University
Extracting user profiles from large scale data
authors: Michal Shmueli-Scheuer, Haggai Roitman, David Carmel, Yosi Mass and David Konopnicki – IBM Research, Haifa
Predicting the Click-Through Rate for Rare/New Ads
authors: Kushal Dave and Vasudeva Varma – IIIT Hyderabad
Parallel K-Means Clustering Based on MapReduce
authors: Weizhong Zhao, Huifang Ma and Qing He – Chinese Academy of Sciences
Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce
authors: Mohammad Farhan Husain, Pankil Doshi, Latifur Khan and Bhavani Thuraisingham – University of Texas at Dallas
Map-Reduce Meets Wider Varieties of Applications
authors: Shimin Chen and Steven W. Schlosser – Intel Research
LogMaster: Mining Event Correlations in Logs of Large-scale Cluster Systems
authors: Wei Zhou, Jianfeng Zhan, Dan Meng (Chinese Academy of Sciences), Dongyan Xu (Purdue University) and Zhihong Zhang (China Mobile Research)
Efficient Clustering of Web-Derived Data Sets
authors: Luıs Sarmento, Eugenio Oliveira (University of Porto), Alexander P. Kehlenbeck (Google), Lyle Ungar (University of Pennsylvania)
A novel approach to multiple sequence alignment using hadoop data grids
authors: G. Sudha Sadasivam and G. Baktavatchalam – PSG College of Technology
Web-Scale Distributional Similarity and Entity Set Expansion
authors: Patrick Pantel, Eric Crestan, Ana-Maria Popescu, Vishnu Vyas (Yahoo Labs) and Arkady Borkovsky (Yandex Labs)
Grammar based statistical MT on Hadoop
authors: Ashish Venugopal and Andreas Zollmann (Carnegie Mellon University)
Distributed Algorithms for Topic Models
authors: David Newman, Arthur Asuncion, Padhraic Smyth and Max Welling – University of California, Irvine
Parallel algorithms for mining large-scale rich-media data
authors: Edward Y. Chang, Hongjie Bai and Kaihua Zhu – Google Research
Learning Influence Probabilities In Social Networks
authors: Amit Goyal, Laks V. S. Lakshmanan (University of British Columbia) and Francesco Bonchi (Yahoo! Research)
MrsRF: an efficient MapReduce algorithm for analyzing large collections of evolutionary trees
authors: Suzanne J Matthews and Tiffani L Williams – Texas A&M University
User-Based Collaborative-Filtering Recommendation Algorithms on Hadoop
authors: Zhi-Dan Zhao and Ming-sheng Shang

Thursday 4 October 2012

CS345A: Data Mining

http://infolab.stanford.edu/~ullman/mining/2009/index.html

Date	Topic	PowerPoint Slides	PDF Document
1/7	Introductory Remarks (JDU)	PPT	PDF
1/7	Introductory Remarks (AR)	PPT	PDF
1/12	Map-Reduce	PPT	PDF
1/14	Frequent Itemsets 1	PPT	PDF
1/14-1/21	Frequent Itemsets 2	PPT	PDF
1/16	Peter Pawlowski's Talk on Aster Data	PPTX	PDF
1/16	Nanda Kishore's Talk on ShareThis	PPT	PDF
1/26	Recommendation Systems	PPT	PDF
1/28	Shingling, Minhashing, Locality-Sensitive Hashing	PPT	PDF
2/2	Applications and Variants of LSH	PPT	PDF
2/2-2/4	Distance Measures, Generalizations of Minhashing and LSH	PPT	PDF
2/4	High-Similarity Algorithms	PPT	PDF
2/9	PageRank	PPT	PDF
2/11	Link Spam, Hubs & Authorities	PPT	PDF
2/18	Generalization of Map-Reduce	PPT	PDF
2/18-2/23	Clustering	PPT	PDF
2/23	Streaming Data	PPT	PDF
2/25	Relation Extraction	PPT	PDF
3/2	On-Line Algorithms, Advertising Optimization	PPT	PDF
3/4	Algorithms on Streams	PPT	PDF

Tuesday 2 October 2012

Google Spanner's Most Surprising Revelation: NoSQL is Out and NewSQL is In

High Scalability - High Scalability - Google Spanner's Most Surprising Revelation: NoSQL is Out and NewSQL is In:

Google recently released a paper on Spanner, their planet enveloping tool for organizing the world’s monetizable information. Reading the Spanner paper I felt it had that chiseled in stone feel that all of Google’s best papers have. An instant classic. Jeff Dean foreshadowed Spanner’s humungousness as early as2009. Now Spanner seems fully online, just waiting to handle “millions of machines across hundreds of datacenters and trillions of database rows.” Wow.

The Wise have yet to weigh in on Spanner en masse. I look forward to more insightful commentary. There’s a lot to make sense of. What struck me most in the paper was a deeply buried section essentially describing Google’s motivation for shifting away from NoSQL and to NewSQL. The money quote:

We believe it is better to have application programmers deal with performance problems due to overuse of transactions as bottlenecks arise, rather than always coding around the lack of transactions.

This reads as ironic given Bigtable helped kickstart the NoSQL/eventual consistency/key-value revolution.

We see most of the criticisms leveled against NoSQL turned out to be problems for Google too. Only Google solved the problems in a typically Googlish way, through the fruitful melding of advanced theory and technology. The result: programmers get the real transactions, schemas, and query languages many crave along with the scalability and high availability they require.

The full quote:

Spanner exposes the following set of data features to applications: a data model based on schematized semi-relational tables, a query language, and general purpose transactions. The move towards supporting these features was driven by many factors. The need to support schematized semi-relational tables and synchronous replication is supported by the popularity of Megastore [5].

At least 300 applications within Google use Megastore (despite its relatively low performance) because its data model is simpler to manage than Bigtable’s, and because of its support for synchronous replication across datacenters. (Bigtable only supports eventually-consistent replication across datacenters.) Examples of well-known Google applications that use Megastore are Gmail, Picasa, Calendar, Android Market, and AppEngine.

The need to support a SQLlike query language in Spanner was also clear, given the popularity of Dremel [28] as an interactive data analysis tool. Finally, the lack of cross-row transactions in Bigtable led to frequent complaints; Percolator [32] was in part built to address this failing.

Some authors have claimed that general two-phase commit is too expensive to support, because of the performance or availability problems that it brings [9, 10, 19]. We believe it is better to have application programmers deal with performance problems due to overuse of transactions as bottlenecks arise, rather than always coding around the lack of transactions. Running two-phase commit over Paxos mitigates the availability problems.

What was the cost? It appears to be latency, but apparently not of the crippling sort, though we don’t have benchmarks. In any case, Google thought dealing with latency was an easier task than programmers hacking around the lack of transactions. I find that just fascinating. It brings to mind so many years of RDBMS vs NoSQL arguments it’s not even funny.

I wonder if Amazon could build their highly available shopping cart application, said to a be a motivator for Dynamo, on top of Spanner?

Is Spanner The Future In The Same Way Bigtable Was The Future?

Will this paper spark the same revolution that the original Bigtable paper caused? Maybe not. As it is Open Source energy that drives these projects, and since few organizations need to support transactions on a global scale (yet), whereas quite a few needed to do something roughly Bigtablish, it might be awhile before we see a parallel Open Source development tract.

A complicating factor for an Open Source effort is that Spanner includes the use of GPS and Atomic clock hardware. Software only projects tend to be the most successful. Hopefully we’ll see clouds step it up and start including higher value specialized services. A cloud wide timing plane should be a base feature. But we are still stuck a little bit in the cloud as Internet model instead of the cloud as a highly specialized and productive software container.

Another complicating factor is that as Masters of Disk it’s not surprising Google built Spanner on top of a new Distributed File System called Colossus. Can you compete with Google using disk? If you go down the Spanner path and commit yourself to disk, Google already has many years lead time on you and you’ll never be quite as good. It makes more sense to skip a technological generation and move to RAM/SSD as a competitive edge. Maybe this time Open Source efforts should focus elsewhere, innovating rather than following Google?