Tuesday, 9 November 2010

Introducing S4

Stream of Words

Motivation

Data streams abound in the world of Big Data: Twitter, search queries, stock quotes, website analytics, sensor data to name a few. Yet, popular approaches for data processing at this scale are based on MapReduce: a batch-oriented framework; in other cases, there are proprietary stream processing systems, or ad-hoc solutions for particular problems.

We understand that the greatest value from data is sometimes derived by processing it as soon as we get it. S4 is a general-purpose open platform built from the ground up to process data streams — we mean process data as it arrives, one event at a time; not buffered in arbitrary batches.

Application

In search advertising we aim to select, rank, filter, and place ads in ways that benefit users, publishers, and advertisers. In other words, we want to show as few ads as possible that are highly relevant and generate the most clicks and revenue. To accomplish all this we build machine learning models that accurately predict how users will click on ads. To improve the accuracy of the click prediction we process recent search queries, clicks, and timing information. We call this personalization. To scale we need to process thousands of search queries per second from potentially millions of users per day. This problem is what motivated us to build a software platform to process events in the same way that MapReduce and Hadoop try to address the scalability problem in batch processing, and the NoSQL projects try to address the scalability problem for data storage.

Who we are

Leo Neumeyer led the search ad optimization group and became the main internal champion and early designer of the project. Bruce Robbins has been the key software engineer who coded 80% of the system and is a key designer. Many scientists including Anish Nair, Anand Kesari, Stefan Schroedl, Erick Cantu-Paz, and Jon Malkin implemented algorithms for personalization, click prediction, and online parameter optimization that drove the design of the project over several iterations. Kishore Gopalakrishna designed and implemented most of the communication layer as a stand alone component on top of ZooKeeper. Ben Reed gave us lots of great advice and support and JC Mao was the executive who supported making S4 an open source project. Finally, Gil Yehuda, Director of Open Source at Yahoo!, is a great open source evangelist who knows how to get things done.

We learned by trying to solve real world problems and we are just starting to understand the use cases and the challenges that need to be solved. For this reason we decided to make S4 open source in October 2010 after incubating it for a couple of years. It feels like in one day we got more feedback from the community than in the last two years combined, pretty amazing. Please remember that this is an early stage release, it has a long road[map] ahead and we need to work on documentation. We challenge people passionate about stream processing to get involved and help us grow the project. The best way to get started is to play with the sample application and/or build your own. The forum is a great place to ask questions and give us feedback.

No comments:

Post a Comment