Wednesday 17 October 2012

Statistics about Hadoop and Mapreduce Algorithm Papers

Statistics about Hadoop and Mapreduce Algorithm Papers:


Underneath are statistics about which 20 papers (of about 80 papers) were most read in our 3 previous postings about mapreduce and hadoop algorithms (the postings have been read approximately 5000 times). The list is ordered by decreasing reading frequency, i.e. most popular at spot 1.
  1. MapReduce-Based Pattern Finding Algorithm Applied in Motif Detection for Prescription Compatibility Network
    authors: Yang Liu, Xiaohong Jiang, Huajun Chen , Jun Ma and Xiangyu Zhang – Zhejiang University
  2. Data-intensive text processing with Mapreduce
    authors: Jimmy Lin and Chris Dyer – University of Maryland
  3. Large-Scale Behavioral Targeting
    authors: Ye Chen (eBay), Dmitry Pavlov (Yandex Labs) and John F. Canny (University of California, Berkeley)
  4. Improving Ad Relevance in Sponsored Search
    authors: Dustin Hillard, Stefan Schroedl, Eren Manavoglu, Hema Raghavan and Chris Leggetter (Yahoo Labs)
  5. Experiences on Processing Spatial Data with MapReduce
    authors: Ariel Cary, Zhengguo Sun, Vagelis Hristidis and Naphtali Rishe – Florida International University
  6. Extracting user profiles from large scale data
    authors: Michal Shmueli-Scheuer, Haggai Roitman, David Carmel, Yosi Mass and David Konopnicki – IBM Research, Haifa
  7. Predicting the Click-Through Rate for Rare/New Ads
    authors: Kushal Dave and Vasudeva Varma – IIIT Hyderabad
  8. Parallel K-Means Clustering Based on MapReduce
    authors: Weizhong Zhao, Huifang Ma and Qing He – Chinese Academy of Sciences
  9. Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce
    authors: Mohammad Farhan Husain, Pankil Doshi, Latifur Khan and Bhavani Thuraisingham – University of Texas at Dallas
  10. Map-Reduce Meets Wider Varieties of Applications
    authors: Shimin Chen and Steven W. Schlosser – Intel Research
  11. LogMaster: Mining Event Correlations in Logs of Large-scale Cluster Systems
    authors: Wei Zhou, Jianfeng Zhan, Dan Meng (Chinese Academy of Sciences), Dongyan Xu (Purdue University) and Zhihong Zhang (China Mobile Research)
  12. Efficient Clustering of Web-Derived Data Sets
    authors: Luıs Sarmento, Eugenio Oliveira (University of Porto), Alexander P. Kehlenbeck (Google), Lyle Ungar (University of Pennsylvania)
  13. A novel approach to multiple sequence alignment using hadoop data grids
    authors: G. Sudha Sadasivam and G. Baktavatchalam – PSG College of Technology
  14. Web-Scale Distributional Similarity and Entity Set Expansion
    authors: Patrick Pantel, Eric Crestan, Ana-Maria Popescu, Vishnu Vyas (Yahoo Labs) and Arkady Borkovsky (Yandex Labs)
  15. Grammar based statistical MT on Hadoop
    authors: Ashish Venugopal and Andreas Zollmann (Carnegie Mellon University)
  16. Distributed Algorithms for Topic Models
    authors: David Newman, Arthur Asuncion, Padhraic Smyth and Max Welling – University of California, Irvine
  17. Parallel algorithms for mining large-scale rich-media data
    authors: Edward Y. Chang, Hongjie Bai and Kaihua Zhu – Google Research
  18. Learning Influence Probabilities In Social Networks
    authors: Amit Goyal, Laks V. S. Lakshmanan (University of British Columbia) and Francesco Bonchi (Yahoo! Research)
  19. MrsRF: an efficient MapReduce algorithm for analyzing large collections of evolutionary trees
    authors: Suzanne J Matthews and Tiffani L Williams – Texas A&M University
  20. User-Based Collaborative-Filtering Recommendation Algorithms on Hadoop
    authors: Zhi-Dan Zhao and Ming-sheng Shang

Thursday 4 October 2012

CS345A: Data Mining

http://infolab.stanford.edu/~ullman/mining/2009/index.html

DateTopicPowerPoint SlidesPDF Document
1/7Introductory Remarks (JDU)PPTPDF
1/7Introductory Remarks (AR)PPTPDF
1/12Map-ReducePPTPDF
1/14Frequent Itemsets 1PPTPDF
1/14-1/21Frequent Itemsets 2PPTPDF
1/16Peter Pawlowski's Talk on Aster DataPPTXPDF
1/16Nanda Kishore's Talk on ShareThisPPTPDF
1/26Recommendation SystemsPPTPDF
1/28Shingling, Minhashing, Locality-Sensitive HashingPPTPDF
2/2Applications and Variants of LSHPPTPDF
2/2-2/4Distance Measures, Generalizations of Minhashing and LSHPPTPDF
2/4High-Similarity AlgorithmsPPTPDF
2/9PageRankPPTPDF
2/11Link Spam, Hubs & AuthoritiesPPTPDF
2/18Generalization of Map-ReducePPTPDF
2/18-2/23ClusteringPPTPDF
2/23Streaming DataPPTPDF
2/25Relation ExtractionPPTPDF
3/2On-Line Algorithms, Advertising OptimizationPPTPDF
3/4Algorithms on StreamsPPTPDF

Tuesday 2 October 2012

Google Spanner's Most Surprising Revelation: NoSQL is Out and NewSQL is In

High Scalability - High Scalability - Google Spanner's Most Surprising Revelation: NoSQL is Out and NewSQL is In:


Google recently released a paper on Spanner, their planet enveloping tool for organizing the world’s monetizable information. Reading the Spanner paper I felt it had that chiseled in stone feel that all of Google’s best papers have. An instant classic. Jeff Dean foreshadowed Spanner’s humungousness as early as2009.  Now Spanner seems fully online, just waiting to handle “millions of machines across hundreds of datacenters and trillions of database rows.” Wow.

The Wise have yet to weigh in on Spanner en masse. I look forward to more insightful commentary. There’s a lot to make sense of. What struck me most in the paper was a deeply buried section essentially describing Google’s motivation for shifting away from NoSQL and to NewSQL. The money quote:
We believe it is better to have application programmers deal with performance problems due to overuse of transactions as bottlenecks arise, rather than always coding around the lack of transactions.
This reads as ironic given Bigtable helped kickstart the NoSQL/eventual consistency/key-value revolution.
We see most of the criticisms leveled against NoSQL turned out to be problems for Google too. Only Google solved the problems in a typically Googlish way, through the fruitful melding of advanced theory and technology. The result: programmers get the real transactions, schemas, and query languages many crave along with the scalability and high availability they require.

The full quote:
Spanner exposes the following set of data features to applications: a data model based on schematized semi-relational tables, a query language, and general purpose transactions. The move towards supporting these features was driven by many factors. The need to support schematized semi-relational tables and synchronous replication is supported by the popularity of Megastore [5].
At least 300 applications within Google use Megastore (despite its relatively low performance) because its data model is simpler to manage than Bigtable’s, and because of its support for synchronous replication across datacenters. (Bigtable only supports eventually-consistent replication across datacenters.) Examples of well-known Google applications that use Megastore are Gmail, Picasa, Calendar, Android Market, and AppEngine.
The need to support a SQLlike query language in Spanner was also clear, given the popularity of Dremel [28] as an interactive data analysis tool. Finally, the lack of cross-row transactions in Bigtable led to frequent complaints; Percolator [32] was in part built to address this failing.
Some authors have claimed that general two-phase commit is too expensive to support, because of the performance or availability problems that it brings [9, 10, 19]. We believe it is better to have application programmers deal with performance problems due to overuse of transactions as bottlenecks arise, rather than always coding around the lack of transactions. Running two-phase commit over Paxos mitigates the availability problems.

What was the cost? It appears to be latency, but apparently not of the crippling sort, though we don’t have benchmarks. In any case, Google thought dealing with latency was an easier task than programmers hacking around the lack of transactions. I find that just fascinating. It brings to mind so many years of RDBMS vs NoSQL arguments it’s not even funny.

I wonder if Amazon could build their highly available shopping cart application, said to a be a motivator for Dynamo, on top of Spanner?

Is Spanner The Future In The Same Way Bigtable Was The Future?

Will this paper spark the same revolution that the original Bigtable paper caused? Maybe not. As it is Open Source energy that drives these projects, and since few organizations need to support transactions on a global scale (yet), whereas quite a few needed to do something roughly Bigtablish, it might be awhile before we see a parallel Open Source development tract.

A complicating factor for an Open Source effort is that Spanner includes the use of GPS and Atomic clock hardware. Software only projects tend to be the most successful. Hopefully we’ll see clouds step it up and start including higher value specialized services. A cloud wide timing plane should be a base feature. But we are still stuck a little bit in the cloud as Internet model instead of the cloud as a highly specialized and productive software container.

Another complicating factor is that as Masters of Disk it’s not surprising Google built Spanner on top of a new Distributed File System called Colossus. Can you compete with Google using disk? If you go down the Spanner path and commit yourself to disk, Google already has many years lead time on you and you’ll never be quite as good. It makes more sense to skip a technological generation and move to RAM/SSD as a competitive edge. Maybe this time Open Source efforts should focus elsewhere, innovating rather than following Google?