Showing posts with label Google. Show all posts

Sunday, 29 September 2013

We’re on the cusp of deep learning for the masses. You can thank Google later — Tech News and Analysis

From Evernote:

We're on the cusp of deep learning for the masses. You can thank Google later — Tech News and Analysis

Clipped from: http://gigaom.com/2013/08/16/were-on-the-cusp-of-deep-learning-for-the-masses-you-can-thank-google-later/

We're on the cusp of deep learning for the masses. You can thank Google later — Tech News and Analysis

Google silently did something revolutionary on Thursday. It open sourced a tool called word2vec , prepackaged deep-learning software designed to understand the relationships between words with no human guidance. Just input a textual data set and let underlying predictive models get to work learning.
"This is a really, really, really big deal," said Jeremy Howard, president and chief scientist of data-science competition platform Kaggle. "… It's going to enable whole new classes of products that have never existed before." Think of Siri on steroids, for starters, or perhaps emulators that could mimic your writing style down to the tone.

When deep learning works, it works great

To understand Howard's excitement, let's go back a few days. It was Monday and I was watching him give a presentation in Chicago about how deep learning was dominating the competition in Kaggle, the online platform where organization present vexing predictive problems and data scientists compete to create the best models. Whenever someone has used a deep learning model to tackle one of the challenges, he told the room, it has performed better than any model ever previously devised to tackle that specific problem.

Jeremy Howard (left) at Structure: Data 2012 (c) Pinar Ozger / http://www.pinarozger.com

But there's a catch: deep learning is really hard. So far, only a handful of teams in hundreds of Kaggle competitions have used it. Most of them have included Geoffrey Hinton or have been associated with him.
Hinton is a University of Toronto professor who pioneered the use of deep learning for image recognition and is now a distinguished engineer at Google, as well. What got Google really interested in Hinton — at least to the point where it hired him — was his work in an image-recognition competition called ImageNet . For years the contest's winners had been improving only incrementally on previous results, until Hinton and his team used deep learning to improve by an order of magnitude.

Neural networks: A way-simplified overview

Deep learning, Howard explained, is essentially a bigger, badder take on the neural network models that have been around for some time. It's particularly useful for analyzing image, audio, text, genomic and other multidimensional data that doesn't lend itself well to traditional machine learning techniques.
Neural networks work by analyzing inputs (e.g., words or images) and recognizing the features that comprise them as well as how all those features relate to each other. With images, for example, a neural network model might recognize various formations of pixels or intensities of pixels as features.

A very simple neural network. Source: Wikipedia Commons

Trained against a set of labeled data, the output of a neural network might be the classification of an input as a dog or cat, for example. In cases where there is no labeled training data — a process called self-taught learning — neural networks can be used to identify the common features of their inputs and group similar inputs even though the models can't predict what they actually are. Like when Google researchers constructed neural networks that were able to recognize cats and human faces without having been trained to do so.

Stacking neural networks to do deep learning

In deep learning, multiple neural networks are "stacked" on top of each other, or layered, in order to create models that are even better at prediction because each new layer learns from the ones before it. In Hinton's approach, each layer randomly omits features — a process called "dropout" — to minimize the chances the model will overfit itself to just the data upon which it was trained. That's a technical way of saying the model won't work as well when trying to analyze new data.
So dropout or similar techniques are critical to helping deep learning models understand the real causality between the inputs and the outputs, Howard explained during a call on Thursday. It's like looking at the same thing under the same lighting all the time versus looking at it in different lighting and from different angles. You'll see new aspects and won't see others, he said, "But the underlying structure is going to be the same each time."

An example of what features a neural network might learn from images. Source: Hinton et al

Still, it's difficult to create accurate models and to program them to run on the number of computing cores necessary to process them in a reasonable timeframe. It's also can be difficult to train them on enough data to guarantee accuracy in an unsupervised environment. That's why so much of the cutting-edge work in the field is still done by experts such as Hinton, Jeff Dean and Andrew Ng, all of whom had or still have strong ties to Google.
There are open source tools such as Theano and PyLearn2 that try to minimize the complexity, Howard told the audience on Monday, but a user-friendly, commercialized software package could be revolutionary. If data scientists in places outside Google could simply (a relative term if ever there was one) input their multidimensional data and train models to learn it, that could make other approaches to predictive modeling all but obsolete. It wouldn't be inconceivable, Howard noted, that a software package like this could emerge within the next year.

Enter word2vec

Which brings us back to word2vec. Google calls it "an efficient implementation of the continuous bag-of-words and skip-gram architectures for computing vector representations of words." Those "architectures" are two new natural-language processing techniques developed by Google researchers Tomas Mikolov, Ilya Sutskever, and Quoc Le (Google Fellow Jeff Dean was also involved, although modestly, he told me.) They're like neural networks, only simpler so they can be trained on larger data sets.
Kaggle's Howard calls word2vec the "crown jewel" of natural language processing. "It's the English language compressed down to a list of numbers," he said.
Word2vec is designed to run on a system as small as a single multicore machine (Google tested its underlying techniques over days across more than 100 cores on its data center servers). Its creators have shown how it can recognize the similarities among words (e.g., the countries in Europe) as well as how they're related to other words (e.g., countries and capitals). It's able to decipher analogical relationships (e.g., short is to shortest as big is to biggest), word classes (e.g., carnivore and cormorant both relate to animals) and "linguistic regularities" (e.g., "vector('king') – vector('man') + vector('woman') is close to vector('queen')).

Source: Google

Right now, the word2vec Google Code page notes, "The linearity of the vector operations seems to weakly hold also for the addition of several vectors, so it is possible to add several word or phrase vectors to form representation of short sentences."
This is accomplished by turning words into numbers that correlate with their characteristics, Howard said. Words that express positive sentiment, adjectives, nouns associated with sporting events — they'll all have certain numbers in common based on how they're used in the training data (so bigger data is better).

Smarter models means smarter apps

If this is all too esoteric, think about these methods applied to auto-correct or word suggestions in text-messaging apps. Current methods for doing this might be as simple as suggesting words that are usually paired together, Howard explained, meaning a suggestion is could be based solely on the word immediately before it. Using deep-learning-based approaches, a texting app could take into account the entire sentence, for example, because the app would have a better understanding of what the all words really mean in context.
Maybe you could average out all the numbers in a tweet, Howard suggested, and get a vector output that would accurately infer the sentiment, subject and level of formality of the tweet. Really, the possibilities are limited only to the types of applications people can think up to take advantage of word2vec's deep understanding of natural language.

An example output file from word2vec that has grouped similar words

The big caveat, however, is researchers and industry data scientists still need to learn how to use word2vec. There hasn't been a lot of research done on how to best use these types of models, Howard said, and the thousands of researchers working on other methods of natural language processing aren't going to jump ship to Google's tools overnight. Still, he believes the community will come around and word2vec and its underlying techniques could make all other approaches to natural language processing obsolete.
And this is just the start. A year from now, Howard predicts, deep learning will have surpassed a whole class of algorithms in other fields (i.e., things other than speech recognition, image recognition and natural language processing), and a year after that it will be integrated into all sorts of software packages. The only questions — and they're admittedly big ones — is how smart deep learning models can get (and whether they'll run into another era of hardware constraints that graphical processing units helped resolve earlier this millennium) and how accessible software packages like word2vec can make deep learning even for relatively unsophisticated users.
"Maybe in 10 years' time," Howard proposed, "we'll get to that next level."

Monday, 1 April 2013

Big Data Beyond MapReduce: Google's Big Data Papers | Architects Zone

Big Data Beyond MapReduce: Google's Big Data Papers | Architects Zone:

Mainstream Big Data is all about MapReduce, but when looking at real-time data, limitations of that approach are starting to show. In this post, I’ll review Google’s most important Big Data publications and discuss where they are (as far as they’ve disclosed).

MapReduce, Google File System and Bigtable: the mother of all big data algorithms

Chronologically the first paper is on the Google File System from 2003, which is a distributed file system. Basically, files are split into chunks which are stored in a redundant fashion on a cluster of commodity machines (Every article about Google has to include the term “commodity machines”!)

Next up is the MapReduce paper from 2004. MapReduce has become synonymous with Big Data. Legend has it that Google used it to compute their search indices. I imagine it worked like this: They have all the crawled web pages sitting on their cluster and every day or so they ran MapReduce to recompute everything.

Next up is the Bigtable paper from 2006 which has become the inspiration for countless NoSQL databases like Cassandra, HBase, and others. About half of the architecture of Cassandra is modeled after BigTable, including the data model, SSTables, and write-through-logs (the other half being Amazon’s Dynamo database for the peer-to-peer clustering model).

Percolator: Handling individual updates

Google didn’t stop with MapReduce. In fact, with the exponential growth of the Internet, it became impractical to recompute the whole search index from scratch. Instead, Google developed a more incremental system, which still allowed for distributed computing.

Now here is where it’s getting interesting, in particular compared to what common messages from mainstream Big Data are. For example, they have reintroduced transactions, something NoSQL still tells you that you don’t need or cannot have if you want to have scalability.

In the Percolator paper from 2010, they describe how the Google is keeping its web search index up to date. Percolator is built on existing technologies like Bigtable, but adds transactions and locks on rows and tables, as well as notifications for change in the tables. These notifications are then used to trigger the different stages in a computation. This way, the individual updates can “percolate” through the database.

This approach is reminiscent of stream processing frameworks (SPFs) like Twitter’s Storm, or Yahoo’s S4, but with an underlying data base. SPFs usually use message passing and no shared data. This makes it easier to reason about what is happening, but also has the problem that there is no way to access the result of the computation unless you manually store it somewhere in the end.

Pregel: Scalable graph computing

Eventually, Google also had to start mining graph data like the social graph in an online social network, so they developed Pregel, published in 2010.

The underlying computational model is much more complex than in MapReduce: Basically, you have worker threads for each node which are run in parallel iteratively. In each so-called superstep, the worker threads can read messages in the node’s inbox, send messages to other nodes, set and read values associated with nodes or edges, or vote to halt. Computations are run till all nodes have voted to halt. In addition, there are also Aggregators and Combiners which compute global statistics.

The paper shows how to implement a number of algorithms like Google’s PageRank, shortest path, or bipartite matching. My personal feeling is that Pregel requires even more rethinking on the side of the implementor than MapReduce or SPFs.

Dremel: Online visualizations

Finally, in another paper from 2010, Google describes Dremel, which is an interactive database with an SQL-like language for structured data. So instead of tables with fixed fields like in an SQL database, each row is something like a JSON object (of course, Google uses it’s own protocol buffer format). Queries are pushed down to servers and then aggregated on their way back up and use some clever data format for maximum performance.

Big Data beyond MapReduce

Google didn’t stop with MapReduce, but they developed other approaches for applications where MapReduce wasn’t a good fit, and I think this is an important message for the whole Big Data landscape. You cannot solve everything with MapReduce. You can make it faster by getting rid of the disks and moving all the data to in-memory, but there are tasks whose inherent structure makes it hard for MapReduce to scale.

Open source projects have picked up on the more recent ideas and papers by Google. For example, ApacheDrill is reimplementing the Dremel framework, while projects like Apache Giraph and Stanford’s GPS are inspired by Pregel.

There are still other approaches as well. I’m personally a big fan of stream mining (not to be confused with stream processing) which aims to process event streams with bounded computational resources by resorting to approximation algorithms. Noel Welsh has some interesting slide’s on the topic.

Published at DZone with permission of Mikio Braun, author and DZone MVB. (source)

Thursday, 21 March 2013

Vietnamese students surprise Google’s engineer — TalkVietnam

Vietnamese students surprise Google’s engineer — TalkVietnam:

VietNamNet Bridge – Neil Fraser, a Google software engineer, said Vietnamese high school students have the informatics knowledge good enough to go through the interview round at Google.

The engineer from one of the world’s biggest technology firms, during his visit to Vietnam, spent his time to visit schools to find out how informatics are taught in the country. He said what he witnessed there has really surprised him.

Vietnamese primary school students begin learning informatics when they enter the second grade. The first lessons are the ones about the basic skills to use computers, including the skills to preserve hard and soft disks.

When entering the third grade, students receive the lessons about Microsoft Word and they have to fulfill really difficult typing exercises.

The interesting thing that the engineer has noted was that the small children could learn how to type on a software product in English, which is not their mother tongue.

He said he really was surprised when realizing that fourth and fifth graders begin learning to program with Logo and they have to fulfill complicated questions. Meanwhile, in the US, higher grades have to struggle to solve HTML exercises, while Loops or Conditional form exercises are believed to be too difficult for students to understand.

Having a strong impression with the informatics curricula followed by Vietnamese schools, Neil expressed his willingness to give support to a school in Da Nang City.

After realizing that the biggest problem of the school is the lack of the teaching software products, he spent his holiday to write a software piece which allows self-teaching and learning Loops and Conditional-form exercises more effectively, called Blocky Maze.

And after realizing that Be Van Dan School lacks money to hire informatics teachers, because of which only 50 percent of students can attend informatics lessons, Nail donated $1,500 to the school which would be used to hire more teachers the next year.

Nail also felt curious about the informatics knowledge of higher graders in Vietnam, which prompted him to visit a high school without noticing in advance. And he got so surprised when witnessing the students solving very difficult Pascal exercises at a class.

After returning to the US, he consulted with senior officials of Google and found out that these are among the 1/3 most difficult exercises to be given to the candidates who want to apply for a job at Google.

Meanwhile, Vietnamese students only have 45 minutes to solve an exercise of this kind, and most of them can fulfill the exercise.

In other words, 50 percent of the 11th graders can pass the interview ground to get jobs at Google.

The US engineer said the informatics training at universities is not as good as he expected. This could be one of the reasons behind the technology groups’ complaint about the lack of qualified workforce for the information technology industry.

However, he said what he witnessed in Vietnam is really impressive. Vietnamese students and teachers have higher eagerness for teaching and learning informatics than American students. In the US, informatics teaching has not received appropriate investments because of many problems, including the ones in the educational system, training force and parents.

The software piece the US engineer wrote during his short holiday for the primary school in Da Nang was mastered by the students just after 10 minutes of introduction. This once again caused a big surprise to the US engineer.

Tuesday, 2 October 2012

Google Spanner's Most Surprising Revelation: NoSQL is Out and NewSQL is In

High Scalability - High Scalability - Google Spanner's Most Surprising Revelation: NoSQL is Out and NewSQL is In:

Google recently released a paper on Spanner, their planet enveloping tool for organizing the world’s monetizable information. Reading the Spanner paper I felt it had that chiseled in stone feel that all of Google’s best papers have. An instant classic. Jeff Dean foreshadowed Spanner’s humungousness as early as2009. Now Spanner seems fully online, just waiting to handle “millions of machines across hundreds of datacenters and trillions of database rows.” Wow.

The Wise have yet to weigh in on Spanner en masse. I look forward to more insightful commentary. There’s a lot to make sense of. What struck me most in the paper was a deeply buried section essentially describing Google’s motivation for shifting away from NoSQL and to NewSQL. The money quote:

We believe it is better to have application programmers deal with performance problems due to overuse of transactions as bottlenecks arise, rather than always coding around the lack of transactions.

This reads as ironic given Bigtable helped kickstart the NoSQL/eventual consistency/key-value revolution.

We see most of the criticisms leveled against NoSQL turned out to be problems for Google too. Only Google solved the problems in a typically Googlish way, through the fruitful melding of advanced theory and technology. The result: programmers get the real transactions, schemas, and query languages many crave along with the scalability and high availability they require.

The full quote:

Spanner exposes the following set of data features to applications: a data model based on schematized semi-relational tables, a query language, and general purpose transactions. The move towards supporting these features was driven by many factors. The need to support schematized semi-relational tables and synchronous replication is supported by the popularity of Megastore [5].

At least 300 applications within Google use Megastore (despite its relatively low performance) because its data model is simpler to manage than Bigtable’s, and because of its support for synchronous replication across datacenters. (Bigtable only supports eventually-consistent replication across datacenters.) Examples of well-known Google applications that use Megastore are Gmail, Picasa, Calendar, Android Market, and AppEngine.

The need to support a SQLlike query language in Spanner was also clear, given the popularity of Dremel [28] as an interactive data analysis tool. Finally, the lack of cross-row transactions in Bigtable led to frequent complaints; Percolator [32] was in part built to address this failing.

Some authors have claimed that general two-phase commit is too expensive to support, because of the performance or availability problems that it brings [9, 10, 19]. We believe it is better to have application programmers deal with performance problems due to overuse of transactions as bottlenecks arise, rather than always coding around the lack of transactions. Running two-phase commit over Paxos mitigates the availability problems.

What was the cost? It appears to be latency, but apparently not of the crippling sort, though we don’t have benchmarks. In any case, Google thought dealing with latency was an easier task than programmers hacking around the lack of transactions. I find that just fascinating. It brings to mind so many years of RDBMS vs NoSQL arguments it’s not even funny.

I wonder if Amazon could build their highly available shopping cart application, said to a be a motivator for Dynamo, on top of Spanner?

Is Spanner The Future In The Same Way Bigtable Was The Future?

Will this paper spark the same revolution that the original Bigtable paper caused? Maybe not. As it is Open Source energy that drives these projects, and since few organizations need to support transactions on a global scale (yet), whereas quite a few needed to do something roughly Bigtablish, it might be awhile before we see a parallel Open Source development tract.

A complicating factor for an Open Source effort is that Spanner includes the use of GPS and Atomic clock hardware. Software only projects tend to be the most successful. Hopefully we’ll see clouds step it up and start including higher value specialized services. A cloud wide timing plane should be a base feature. But we are still stuck a little bit in the cloud as Internet model instead of the cloud as a highly specialized and productive software container.

Another complicating factor is that as Masters of Disk it’s not surprising Google built Spanner on top of a new Distributed File System called Colossus. Can you compete with Google using disk? If you go down the Spanner path and commit yourself to disk, Google already has many years lead time on you and you’ll never be quite as good. It makes more sense to skip a technological generation and move to RAM/SSD as a competitive edge. Maybe this time Open Source efforts should focus elsewhere, innovating rather than following Google?

Tuesday, 6 September 2011

High Scalability - High Scalability - The Three Ages of Google - Batch, Warehouse, Instant

High Scalability - High Scalability - The Three Ages of Google - Batch, Warehouse, Instant:

The world has changed. And some things that should not have been forgotten, were lost. I found these words from the Lord of the Rings echoing in my head as I listened to a fascinating presentation by Luiz André Barroso, Distinguished Engineer at Google, concerning Google's legendary past, golden present, and apocryphal future. His talk, Warehouse-Scale Computing: Entering the Teenage Decade, was given at the Federated Computing Research Conference. Luiz clearly knows his stuff and was early at Google, so he has a deep and penetrating perspective on the technology. There's much to learn from, think about, and build.

Lord of the Rings applies at two levels. At the change level, Middle Earth went through three ages. While listening to Luiz talk, it seems so has Google: Batch (indexes calculated every month), Warehouse (the datacenter is the computer), and Instant (make it all real-time). At the "what was forgot" level, in the Instant Age section of the talk, a common theme was the challenge of making low latency systems on top of commodity systems. These are issues very common in the real-time area and it struck me that these were the things that should not have been forgotten.

What is completely new, however, is the combining of Warehouse + Instant, and that's where the opportunities and the future is to be found- the Fourth Age.

The First Age - The Age Of Batch

The time is 2003. The web is still young and HTML is still page oriented. Ajax has been invented, but is still awaiting early killer apps like Google Maps and a killer marketing strategy, a catchy brand name like Ajax.

Google is batch oriented. They crawled the web every month (every month!), built a search index, and answered queries. Google was largely read-only, which is pretty easy to scale. This is still probably the model most people have in their minds eye about how Google works.

Google was still unsophisticated in their hardware. They built racks in colo spaces, bought fans from Walmart and cable trays from Home Depot.

It's quaint to think that all of Google's hardware and software architecture could be described in seven pages: Web Search for a Planet: The Google Cluster Architecture by Luiz Barroso, Jeffrey Dean, and Urs Hoelzle. That would quickly change.

The Second Age - The Age Of The Warehouse

The time is 2005. Things move fast on the Internet. The Internet has happened, it has become pervasive, higher speed, and interactive. Google is building their own datacenters and becoming more sophisticated at every level. Iconic systems like BigTable are in production.

About this time Google realized they were building something qualitatively different than had come before, something we now think of, more or less, as cloud computing. Amazon's EC2 launched in 2006. Another paper, this one is really a book, summarizes what they were trying to do: The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines by Luiz André Barroso and Urs Hölzle. Note the jump from 7 pages to book size and note that it was published in 2009, 4 years after they were implementing the vision. To learn what Google is really up to now will probably take an encyclopedia and come out in a few years, after they are on to the next thing.

The fundamental insight in this age is that the datacenter is the computer. You may recall that in the late 1980s Sun's John Gage hailed "the network is the computer." The differences are interesting to ponder. When the network was the computer we created client-server architectures that appeared to the outside world as a single application, but in reality they were made of individual nodes connected by a network. Wharehouse-scale Computing (WSC) moves up stack, it considers computer resources, to be as much as possible, fungible, that is they are interchangeable and location independent. Individual computers lose identity and become just a part of a service. Sun later had their own grid network, but I don't think they ever had this full on WSC vision.

Warehouse scale machines are different . They are not made up of separate computers. Applications are not designed to run single machines, but to run Internet services on a datacenter full of machines. What matters is the aggregate performance of the entire system.

The WSC club is not a big one. Luiz says you might have warehouse scale computer if you get paged in the middle of the night because you only have petabytes of data of storage left. With cloud computing

The Third Age - The Age Of Instant

The time is now. There's no encyclopedia yet on how the Age of Instant works because it is still being developed. But because Google is quite open, we do get clues: Google's Colossus Makes Search Real-Time By Dumping MapReduce; Large-Scale Incremental Processing Using Distributed Transactions And Notifications; Tree Distribution Of Requests And Responses; Google Megastore - 3 Billion Writes and 20 Billion Read Transactions Daily; and so much more I didn't cover or only referenced.

Google's Instant Search Results is a crude example Luiz says of what the future will hold. This is the feature that when you type in a letter in the search box you instantly get back query results. This means for every query 5 or 6 queries are executed. You can imagine the infrastructure this must take.

The flip side of search is content indexing. The month long indexing runs are long gone. The Internet is now a giant event monster feeding Google with new content to index continuously and immediately. It is astonishing how quickly content is indexed now. That's a revolution in architecture.

Luiz thinks in the next few years the level of interactivity, insight and background information the system will have to help you, will dwarf what there is in Instant Search. If you want to know why Google is so insistent on using Real Names in Google+, this is why.

Luiz explains this change having 4 drivers:

Applications - instantaneous , personalized, contextual

Scale - increased attention to latency tail

Efficiency - driving utilization up, and energy/water usage down

Hardware Trends - non-volatile storage, multi-cores, fast networks

Instant in the context of Warehouse computing is a massive engineering challenge. It's a hard thing to treat a datacenter as a computer and it's a hard thing to provide instant indexing and instant results, to provide instant in a warehouse scale computer is an entirely new level of challenge. This challenge is what the second half of his talk covers.

The problem is we aren't meeting this challenge. Our infrastructure is broken. Datacenters have the diameter of a microsecond, yet we are still using entire stacks designed for WANs. Real-time requires low and bounded latencies and our stacks can't provide low latency at scale. We need to fix this problem and towards this end Luiz sets out a research agenda, targeting problems that need to be solved:

Rethink IO software stack. An OS that makes scheduling decisions 10s of msecs is incompatible with IO devices that response in microseconds.

Revisit operating systems scheduling.

Rethink threading models.

Re-read 1990's fast messaging papers.

Make IO design a higher priority. Not just NICs and RDMA, consider CPU design and memory systems.

"The fun starts now" Luiz says, these are still very early days, predicting this will be the:

Decade of resource efficiency

Decade of IO

Decade of low latency (and low tail latency)

Decade of Warehouse-scale disaggregation, making resources available outside of just one machine, not just a single rack, but all machines.

This is a great talk, very informative, and very inspiring. Well worth watching. We'll talk more about specific technical points in later articles, but his sets the stage not just for Google, but for the rest of the industry as well.

Web Search for a Planet: The Google Cluster Architecture by Luiz Barroso, Jeffrey Dean, and Urs Hoelzle

The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines by Luiz André Barroso, Urs Hölzle

It’s time for low latency by Stephen M. Rumble, Diego Ongaro, Ryan Stutsman, Mendel Rosenblum, and John K. Ousterhout

Sunday, 26 June 2011

Google BigTable, MapReduce, MegaStore vs. Hadoop, MongoDB

Google BigTable, MapReduce, MegaStore vs. Hadoop, MongoDB: "Google BigTable, MapReduce, MegaStore vs. Hadoop, MongoDB:
Dhanji R. Prasanna leaving Google:

Here is something you’ve may have heard but never quite believed before: Google’s vaunted scalable software infrastructure is obsolete. Don’t get me wrong, their hardware and datacenters are the best in the world, and as far as I know, nobody is close to matching it. But the software stack on top of it is 10 years old, aging and designed for building search engines and crawlers. And it is well and truly obsolete.

Protocol Buffers, BigTable and MapReduce are ancient, creaking dinosaurs compared to MessagePack, JSON, and Hadoop. And new projects like GWT, Closure and MegaStore are sluggish, overengineered Leviathans compared to fast, elegant tools like jQuery and mongoDB. Designed by engineers in a vacuum, rather than by developers who have need of tools.

Maybe it is just the disappointment of someone whose main project was killed

. Or maybe it is true. Or maybe it is just another magic triangle:

Agility Scalability Coolness factor Triangle

Agility Scalability Coolness factor Triangle

Edward Ribeiro mentioned a post from another ex-Googler which points out similar issues with Google’s philosophy.

Original title and link: Google BigTable, MapReduce, MegaStore vs. Hadoop, MongoDB (NoSQL databases © myNoSQL)

Monday, 7 February 2011

Does Google do "research"?

Does Google do "research"?: "I've been asked a lot by folks recently about whether the work I'm doing now at Google is 'research' and whether one can really have a 'research career' at Google. This has also led to a lot of interesting discussions about what the role of research is in an industrial setting. TL;DR -- yes, Google does research, but not like any other company I know.

Here's my personal take on what 'research' means at Google. (Don't take this as any official statement -- and I'm sure not everyone at Google would agree with this!)

They don't give us lab coats like this, though I wish they did.

The conventional model for industrial research is to set up a lab populated entirely by PhDs, whose job is mostly to write papers, and (in the best case) inform the five-to-ten year roadmap for the company. Usually the 'research lab' is a separate entity from the product side of the company, and may even be physically remote.

Under this model, it can be difficult to get anything you build into production. I have a lot of friends and colleagues at places like Microsoft Research and Intel Labs, and they readily admit that 'technology transfer' is not always easy. Of course, it's not their job to build real systems -- it's primarily to build prototypes, write papers about those prototypes, and then move on to the next big thing. Sometimes, rarely, a research project will mature to the point where it gets picked up by the product side of the company, but this is the exception rather than the norm. It's like throwing ping pong balls at a mountain -- it takes a long time to make a dent.

But these labs aren't supposed to be writing code that gets picked up directly by products -- it's about informing the long-term strategic direction. And a lot of great things can come out of that model. I'm not knocking it -- I spent a year at Intel Research Berkeley before joining Harvard, so I have some experience with this style of industrial research.

From what I can tell, Google takes a very different approach to research. We don't have a separate 'research lab.' Instead, research is distributed throughout the many engineering efforts within the company. Most of the PhDs at Google (myself included) have the job title 'software engineer,' and there's generally no special distinction between the kinds of work done by people with PhDs versus those without. Rather than forking advanced projects off as a separate “research” activity, much of the research happens in the course of building Google’s core systems. Because of the sheer scales at which Google operates, a lot of what we do involves research even if we don't always call it that.

There is also an entity called Google Research, which is not a separate physical lab, but rather a distributed set of teams working in areas such as machine learning, information retrieval, natural language processing, algorithms, and so forth. It's my understanding that even Google Research builds and deploys real systems, like Google’s automatic language translation and voice recognition platforms.

(Update 23-Jan-2011: Someone pointed out that Google also has a 'Quantitative Analyst' job role. These folks work closely with teams in engineering and research to analyze massive data sets, build models, and so forth -- a lot of this work results in research publications as well.)

I like the Google model a lot, since it keeps research and engineering tightly integrated, and keeps us honest. But there are some tradeoffs. Some of the most common questions I've fielded lately include:

Can you publish papers at Google? Sure. Google publishes hundreds of research papers a year. (Some more details here.)You can even sit on program committees, give talks, attend conferences, all that. But this is not your main job, so it's important to make sure that the research outreach isn't interfering with your ability to do get 'real' work done. It's also true that Google teams are sometimes too busy to spend much time pushing out papers, even when the work is eminently publishable.

Can you do long-term crazy beard-scratching pie-in-the-sky research at Google? Maybe. Google does some crazy stuff, like developing self-driving cars. If you wanted to come to Google and start an effort to, say, reinvent the Internet, you'd have to work pretty hard to convince people that it could be done and makes sense for the company. Fortunately, in my area of systems and networking, I don't need to look that far out to find really juicy problems to work on.

Do you have to -- gulp -- maintain your code? And write unit tests? And documentation? And fix bugs? Oh yes. All of that and more. And I love it. Nothing gets me going more than adding a feature or fixing a bug in my code when I know that it will affect millions of people. Yes, there is overhead involved in building real production systems. But knowing that the systems I build will have immediate impact is a huge motivator. So, it's a tradeoff.

But doesn't it bother you that you don't have a fancy title like 'distinguished scientist' and get your own office? I thought it would bug me, but I'm actually quite proud to be a lowly software engineer. I love the open desk seating, and I'm way more productive in that setting. It's also been quite humbling to work side by side with these hotshot developers who are only a couple of years out of college and know way more than I do about programming.

I will be frank that Google doesn't always do the best job reaching out to folks with PhDs or coming from an academic background. When I interviewed (both in 2002 and in 2010), I didn't get a good sense of what I could contribute at Google. The software engineering interview can be fairly brutal: I was asked questions about things I haven't seen since I was a sophomore in college. And a lot of people you talk to will tell you (incorrectly) that 'Google doesn't do research.' Since I've been at Google for a few months, I have a much better picture and one of my goals is to get the company to do a better job at this. I'll try to use this blog to give some of the insider view as well.

Obligatory disclaimer: This is my personal blog. The views expressed here are mine alone and not those of my employer.

Tuesday, 18 January 2011

Google Megastore: The Data Engine Behind GAE

Google Megastore: The Data Engine Behind GAE: "

Megastore is the data engine supporting the Google
Application Engine.
It’s a scalable structured data store providing full ACID semantics within partitions
but lower consistency guarantees across partitions.

I wrote up some notes on it back in 2008 Under
the Covers of the App Engine Datastore and
posted Phil Bernstein’s excellent notes from a 2008
SIGMOD talk: Google
Megastore. But there has been remarkably
little written about this datastore over the intervening couple of years until this
year’s CIDR conference papers were posted. CIDR 2011 includes Megastore:
Providing Scalable, Highly Available Storage for Interactive Services.

My rough notes from the paper:

· Megastore
is built upon BigTable

· Bigtable
supports fault-tolerant storage within a single datacenter

· Synchronous
replication based upon Paxos and
optimized for long distance inter-datacenter links

· Partitioned
into a vast space of small databases each with its own replicated log

· Each
log stored across a Paxos cluster

· Because
they are so aggressively partitioned, each Paxos group only has to accept logs for
operations on a small partition. However, the design does serialize updates on each
partition

· 3
billion writes and 20 billion read transactions per day

· Support
for consistency unusual for a NoSQL database but driven by (what I believe to be)
the correct belief that inconsistent updates make many applications difficult to write
(see I
Love Eventual Consistency but …)

· Data
Model:

· The
data model is declared in a strongly typed schema

· There
are potentially many tables per schema

· There
are potentially many entities per table

· There
are potentially many strongly typed properties per entity

· Repeating
properties are allowed

· Tables
can be arranged hierarchically where child tables point to root tables

· Megastore
tables are either entity group root tables or child tables

· The
root table and all child tables are stored in the same entity group

· Secondary
indexes are supported

· Local
secondary indexes index a specific entity group and are maintained consistently

· Global
secondary indexes index across entity groups are asynchronously updates and eventually
consistent

· Repeated
indexes: supports indexing repeated values (e.g. photo tags)

· Inline
indexes provide a way to denormalize data
from source entities into a related target entity as a virtual repeated column.

· There
are physical storage hints:

· “IN
TABLE” directs Megastore to store two tables in the same underlying BigTable

· “SCATTER”
attribute prepends a 2 byte hash to each key to cool hot spots on tables with monotonically
increasing values like dates (e.g. a history table).

· “STORING”
clause on an index supports index-only-access by redundantly storing additional data
in an index. This avoids the double access often required of doing a secondary index
lookup to find the correct entity and then selecting the correct properties from that
entity through a second table access. By pulling values up into the secondary index,
the base table doesn’t need to be accessed to obtain these properties.

· 3
levels of read consistency:

· Current:
Last committed value

· Snapshot:
Value as of start of the read transaction

· Inconsistent
reads: used for cross entity group reads

· Update
path:

· Transaction
writes its mutations to the entity groups write-ahead log and then apply the mutations
to the data (write ahead logging).

· Write
transaction always begins with a current read to determine the next available log
position. The commit operation gathers mutations into a log entry, assigns an increasing
timestamp, and appends to log which is maintained using paxos.

· Update
rates within a entity group are seriously limited by:

· When
there is log contention, one wins and the rest fail and must be retried.

· Paxos
only accepts a very limited update rate (order 10^2 updates per second).

· Paper
reports that “limiting updates within an entity group to a few writes per second per
entity group yields insignificant write conflicts”

· Implication:
programmers must shard aggressively to get even moderate update rates and consistent
update across shards is only supported using two phase commit which is not recommended.

· Cross
entity group updates are supported by:

· Two-phase
commit with the fragility
that it brings

· Queueing
and asynchronously applying the changes

· Excellent
support for backup and redundancy:

· Synchronous
replication to protect against media failure

· Snapshots
and incremental log backups

Overall, an excellent paper with
lots of detail on a nicely executed storage system. Supporting consistent read and
full ACID update semantics is impressive although the limitation of not being able
to update an entity group at more than a “few per second” is limiting.

The paper: http://www.cidrdb.org/cidr2011/Papers/CIDR11_Paper32.pdf.

Thanks to Zhu
Han, Reto
Kramer, and Chris
Newcombe for all sending
this paper my way.

--jrh

James
Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

From Perspectives."

Google Megastore - 3 Billion Writes and 20 Billion Read Transactions Daily

Google Megastore - 3 Billion Writes and 20 Billion Read Transactions Daily: "

A giant step into the fully distributed future has been taken by the Google App Engine team with the release of their High Replication Datastore. The HRD is targeted at mission critical applications that require data replicated to at least three datacenters, full ACID semantics for entity groups, and lower consistency guarantees across entity groups.

This is a major accomplishment. Few organizations can implement a true multi-datacenter datastore. Other than SimpleDB, how many other publicly accessible database services can operate out of multiple datacenters? Now that capability can be had by anyone. But there is a price, literally and otherwise. Because the HRD uses three times the resources as Google App Engine's Master/Slave datastatore, it will cost three times as much. And because it is a distributed database, with all that implies in the CAP sense, developers will have to be very careful in how they architect their applications because as costs increased, reliability increased, complexity has increased, and performance has decreased. This is why HRD is targeted ay mission critical applications, you gotta want it, otherwise the Master/Slave datastore makes a lot more sense.

The technical details behind the HRD are described in this paper, Megastore: Providing Scalable, Highly Available Storage for Interactive Services. This is a wonderfully written and accessible paper, chocked full of useful and interesting details. James Hamilton wrote an excellent summary of the paper in Google Megastore: The Data Engine Behind GAE. There are also a few useful threads in Google Groups that go into some more details about how it works, costs, and performance (the original announcement, performance comparison).

Some Megastore highlights:

Unlimited-Data. moved to lab.itbee.vn

Sunday, 29 September 2013