Unlimited-Data. moved to lab.itbee.vn : February 2011

Friday, 11 February 2011

InfoQ: NoSQL Shake-Up. Membase and CouchOne merge into Couchbase

InfoQ: NoSQL Shake-Up. Membase and CouchOne merge into Couchbase: "The companies describe the merger as a real synergy or optimal fit. Membase will enhance CouchDB performance by providing an efficient, scalable and distributed caching layer and speeding up view server operations. CouchDB replaces the Membase persistence store (SQLLite) and enhances Membase with querying, indexing, map-reduce and more database capabilities. It also benefits from the more mature operations and tools support that Membase offers. The combination of both allows the new products to serve many more different customer needs than before and scale up to millions of users or down to a single mobile device."

Two Books Alike in Dignity - ACM Queue

Monday, 7 February 2011

Does Google do "research"?

Does Google do "research"?: "I've been asked a lot by folks recently about whether the work I'm doing now at Google is 'research' and whether one can really have a 'research career' at Google. This has also led to a lot of interesting discussions about what the role of research is in an industrial setting. TL;DR -- yes, Google does research, but not like any other company I know.

Here's my personal take on what 'research' means at Google. (Don't take this as any official statement -- and I'm sure not everyone at Google would agree with this!)

They don't give us lab coats like this, though I wish they did.

The conventional model for industrial research is to set up a lab populated entirely by PhDs, whose job is mostly to write papers, and (in the best case) inform the five-to-ten year roadmap for the company. Usually the 'research lab' is a separate entity from the product side of the company, and may even be physically remote.

Under this model, it can be difficult to get anything you build into production. I have a lot of friends and colleagues at places like Microsoft Research and Intel Labs, and they readily admit that 'technology transfer' is not always easy. Of course, it's not their job to build real systems -- it's primarily to build prototypes, write papers about those prototypes, and then move on to the next big thing. Sometimes, rarely, a research project will mature to the point where it gets picked up by the product side of the company, but this is the exception rather than the norm. It's like throwing ping pong balls at a mountain -- it takes a long time to make a dent.

But these labs aren't supposed to be writing code that gets picked up directly by products -- it's about informing the long-term strategic direction. And a lot of great things can come out of that model. I'm not knocking it -- I spent a year at Intel Research Berkeley before joining Harvard, so I have some experience with this style of industrial research.

From what I can tell, Google takes a very different approach to research. We don't have a separate 'research lab.' Instead, research is distributed throughout the many engineering efforts within the company. Most of the PhDs at Google (myself included) have the job title 'software engineer,' and there's generally no special distinction between the kinds of work done by people with PhDs versus those without. Rather than forking advanced projects off as a separate “research” activity, much of the research happens in the course of building Google’s core systems. Because of the sheer scales at which Google operates, a lot of what we do involves research even if we don't always call it that.

There is also an entity called Google Research, which is not a separate physical lab, but rather a distributed set of teams working in areas such as machine learning, information retrieval, natural language processing, algorithms, and so forth. It's my understanding that even Google Research builds and deploys real systems, like Google’s automatic language translation and voice recognition platforms.

(Update 23-Jan-2011: Someone pointed out that Google also has a 'Quantitative Analyst' job role. These folks work closely with teams in engineering and research to analyze massive data sets, build models, and so forth -- a lot of this work results in research publications as well.)

I like the Google model a lot, since it keeps research and engineering tightly integrated, and keeps us honest. But there are some tradeoffs. Some of the most common questions I've fielded lately include:

Can you publish papers at Google? Sure. Google publishes hundreds of research papers a year. (Some more details here.)You can even sit on program committees, give talks, attend conferences, all that. But this is not your main job, so it's important to make sure that the research outreach isn't interfering with your ability to do get 'real' work done. It's also true that Google teams are sometimes too busy to spend much time pushing out papers, even when the work is eminently publishable.

Can you do long-term crazy beard-scratching pie-in-the-sky research at Google? Maybe. Google does some crazy stuff, like developing self-driving cars. If you wanted to come to Google and start an effort to, say, reinvent the Internet, you'd have to work pretty hard to convince people that it could be done and makes sense for the company. Fortunately, in my area of systems and networking, I don't need to look that far out to find really juicy problems to work on.

Do you have to -- gulp -- maintain your code? And write unit tests? And documentation? And fix bugs? Oh yes. All of that and more. And I love it. Nothing gets me going more than adding a feature or fixing a bug in my code when I know that it will affect millions of people. Yes, there is overhead involved in building real production systems. But knowing that the systems I build will have immediate impact is a huge motivator. So, it's a tradeoff.

But doesn't it bother you that you don't have a fancy title like 'distinguished scientist' and get your own office? I thought it would bug me, but I'm actually quite proud to be a lowly software engineer. I love the open desk seating, and I'm way more productive in that setting. It's also been quite humbling to work side by side with these hotshot developers who are only a couple of years out of college and know way more than I do about programming.

I will be frank that Google doesn't always do the best job reaching out to folks with PhDs or coming from an academic background. When I interviewed (both in 2002 and in 2010), I didn't get a good sense of what I could contribute at Google. The software engineering interview can be fairly brutal: I was asked questions about things I haven't seen since I was a sophomore in college. And a lot of people you talk to will tell you (incorrectly) that 'Google doesn't do research.' Since I've been at Google for a few months, I have a much better picture and one of my goals is to get the company to do a better job at this. I'll try to use this blog to give some of the insider view as well.

Obligatory disclaimer: This is my personal blog. The views expressed here are mine alone and not those of my employer.

Tuesday, 1 February 2011

Why Netflix Picked Amazon SimpleDB, Hadoop/HBase, and Cassandra

Why Netflix Picked Amazon SimpleDB, Hadoop/HBase, and Cassandra: "Why Netflix Picked Amazon SimpleDB, Hadoop/HBase, and Cassandra:

Yury Izrailevsky^[1]:

The reason why we use multiple NoSQL solutions is because each one is best suited for a specific set of use cases. For example, HBase is naturally integrated with the Hadoop platform, whereas Cassandra is best for cross-regional deployments and scaling with no single points of failure. Adopting the non-relational model in general is not easy, and Netflix has been paying a steep pioneer tax while integrating these rapidly evolving and still maturing NoSQL products. There is a learning curve and an operational overhead. Still, the scalability, availability and performance advantages of the NoSQL persistence model are evident and are paying for themselves already, and will be central to our long-term cloud strategy.

Summarizing the pros for each of the 3 solutions:

Amazon SimpleDB Pros
- highly durable, writes spanning multiple availability zones
- handy query and data formats
- batch operations
- consistent reads
- hosted solution
HBase Pros
- dynamic partitioning model
- built-in support for compression
- range queries
- support for distributed counters
- strong consistency
- interoperability with Hadoop
Cassandra Pros
- no dedicated name nodes
- no practical architectural limitations on data sizes, row/column counts, etc.
- flexible data model
- no underlying storage format requirements like HDFS
- uniquely flexible consistency and replication models
- cross-datacenter and cross-regional replication

I hope the next post will be about the “small” issues Netflix ran into when adopting each of these systems. In the past they’ve shared some of the challenges of an Oracle - Amazon SimpleDB hybrid solution.

Yury Izrailevsky: Netflix Director of Cloud and Systems Infrastructure
↩

Unlimited-Data. moved to lab.itbee.vn

Monday, 28 February 2011

TechPack - Cloud Computing