Unlimited-Data. moved to lab.itbee.vn : April 2013

Monday, 1 April 2013

Big Data Beyond MapReduce: Google's Big Data Papers | Architects Zone

Big Data Beyond MapReduce: Google's Big Data Papers | Architects Zone:

Mainstream Big Data is all about MapReduce, but when looking at real-time data, limitations of that approach are starting to show. In this post, I’ll review Google’s most important Big Data publications and discuss where they are (as far as they’ve disclosed).

MapReduce, Google File System and Bigtable: the mother of all big data algorithms

Chronologically the first paper is on the Google File System from 2003, which is a distributed file system. Basically, files are split into chunks which are stored in a redundant fashion on a cluster of commodity machines (Every article about Google has to include the term “commodity machines”!)

Next up is the MapReduce paper from 2004. MapReduce has become synonymous with Big Data. Legend has it that Google used it to compute their search indices. I imagine it worked like this: They have all the crawled web pages sitting on their cluster and every day or so they ran MapReduce to recompute everything.

Next up is the Bigtable paper from 2006 which has become the inspiration for countless NoSQL databases like Cassandra, HBase, and others. About half of the architecture of Cassandra is modeled after BigTable, including the data model, SSTables, and write-through-logs (the other half being Amazon’s Dynamo database for the peer-to-peer clustering model).

Percolator: Handling individual updates

Google didn’t stop with MapReduce. In fact, with the exponential growth of the Internet, it became impractical to recompute the whole search index from scratch. Instead, Google developed a more incremental system, which still allowed for distributed computing.

Now here is where it’s getting interesting, in particular compared to what common messages from mainstream Big Data are. For example, they have reintroduced transactions, something NoSQL still tells you that you don’t need or cannot have if you want to have scalability.

In the Percolator paper from 2010, they describe how the Google is keeping its web search index up to date. Percolator is built on existing technologies like Bigtable, but adds transactions and locks on rows and tables, as well as notifications for change in the tables. These notifications are then used to trigger the different stages in a computation. This way, the individual updates can “percolate” through the database.

This approach is reminiscent of stream processing frameworks (SPFs) like Twitter’s Storm, or Yahoo’s S4, but with an underlying data base. SPFs usually use message passing and no shared data. This makes it easier to reason about what is happening, but also has the problem that there is no way to access the result of the computation unless you manually store it somewhere in the end.

Pregel: Scalable graph computing

Eventually, Google also had to start mining graph data like the social graph in an online social network, so they developed Pregel, published in 2010.

The underlying computational model is much more complex than in MapReduce: Basically, you have worker threads for each node which are run in parallel iteratively. In each so-called superstep, the worker threads can read messages in the node’s inbox, send messages to other nodes, set and read values associated with nodes or edges, or vote to halt. Computations are run till all nodes have voted to halt. In addition, there are also Aggregators and Combiners which compute global statistics.

The paper shows how to implement a number of algorithms like Google’s PageRank, shortest path, or bipartite matching. My personal feeling is that Pregel requires even more rethinking on the side of the implementor than MapReduce or SPFs.

Dremel: Online visualizations

Finally, in another paper from 2010, Google describes Dremel, which is an interactive database with an SQL-like language for structured data. So instead of tables with fixed fields like in an SQL database, each row is something like a JSON object (of course, Google uses it’s own protocol buffer format). Queries are pushed down to servers and then aggregated on their way back up and use some clever data format for maximum performance.

Big Data beyond MapReduce

Google didn’t stop with MapReduce, but they developed other approaches for applications where MapReduce wasn’t a good fit, and I think this is an important message for the whole Big Data landscape. You cannot solve everything with MapReduce. You can make it faster by getting rid of the disks and moving all the data to in-memory, but there are tasks whose inherent structure makes it hard for MapReduce to scale.

Open source projects have picked up on the more recent ideas and papers by Google. For example, ApacheDrill is reimplementing the Dremel framework, while projects like Apache Giraph and Stanford’s GPS are inspired by Pregel.

There are still other approaches as well. I’m personally a big fan of stream mining (not to be confused with stream processing) which aims to process event streams with bounded computational resources by resorting to approximation algorithms. Noel Welsh has some interesting slide’s on the topic.

Published at DZone with permission of Mikio Braun, author and DZone MVB. (source)

CodeBuild: SCRUM Software Development Methodology : Nuts & Bolts

CodeBuild: SCRUM Software Development Methodology : Nuts & Bolts:

Scrum is an agile software development methodology. It is useful for relatively small (5-9 developers) development teams and limited project calendars. This post will give brief definitions and useful suggestions about this methodology.

Team Members

Product Owner: Behaves as real customer and criticizes product. Attends meetings and can determine work priorities. Adds items to product backlog (will be told later). Only one person can be product owner in a scrum team.
Scrum Master: Enforcer of the rules. Generally most experienced team member. Works like team leader. Solves the problems of the team, if team/team member couldn't. Only one person can be scrum master in a scrum team.
Team: A scrum team member is responsible of analysing, design, development, test and the other required processes. So, specialization is rare. Self-organization is a plus. There is no hierarchy in members. Scrum master is also a team member.

Document Types:

Product Backlog: This list contains items that the project must have and may have. Item size can increase every time.
Sprint Backlog: Sprint is a time interval (generally 2-4 weeks) that a group of items must be completed (similar with "iteration" in unified process and some other processes). Sprint backlog contains items that must be completed in that sprint.

Meeting Types:

Daily Scrum/Standup Meeting: Max. time is 15 minutes. Performed at the start of the day and generally in stand-up style. Every member should answer 3 questions:

What did you do yesterday?
What will you do today?
Is there a problem about your work item(s)?

Sprint Planning Meeting: Performed at the start of each sprint. Sprint backlog is created using product backlog items and priority opinions of all team members.
Sprint Review: Performed at the end of each sprint. Status of sprint backlog items are discussed and results are recorded.
Sprint Retrospective: Performed at the end of each sprint. Sprint process issues are discussed to improve process quality.

Process:

Scrum Process

From product owner opinions and other internal/external sources, product backlog is created.
Sprint planning meeting is performed at the start of each sprint. Previous sprint's unfinished items (if exists) and product backlog are sources of sprint backlog. Backlog is created according to item priority opinions of all team members. Sprint length (in weeks) may also be defined here for once.
In sprint time, daily meetings are performed. Team members take responsibility of items from item pool of sprint backlog continiously when an item is finished. By the way scrum master solve possible problems and product owner criticizes results.
Sprint review and retrospective meetings are performed at the end of each sprint. Results are discussed in these meetings. Results are recorded to be applied for next sprints.