Wednesday, 18 December 2013

Using RequireJS with Angular - Inline Block's Blog

Using RequireJS With Angular

Source: http://blog.inlineblock.com/blog/2013/06/06/using-requirejs-with-angular/

Since attending Fluent Conf 2013 and watching the many AngularJS talks and seeing the power of its constructs, I wanted to get some experience with it.

Most of the patterns for structuring the code for single page webapps, use some sort dependency management for all the JavaScript instead of using global controllers or other similar bad things. Many of the AngularJS examples seem to follow these bad-ish patterns. Using angular.module('name' , []), helps this problem (why don't they show more angular.module() usage in their tutorials?), but you can still end up with a bunch of dependency loading issues (at least without hardcoding your load order in your header). I even spent time talking to a few engineers with plenty experience with Angular and they all seemed to be okay with just using something like Ruby's asset pipeline to include your files (into a global scope) and to make sure everything ends up in one file in the end via their build process. I don't really like that, but if you are fine with that, I'd suggest you do what you are most comfortable with.

Why RequireJS?

I love using RequireJS. You can async load your dependencies and basically remove all globals from your app. You can use r.js to compile all your JavaScript into a single file and minify that easily, so that your app loads quickly.

So how does this work with Angular? You'd think it would be easy when making single page web apps. You need your 'module' aka your app. You add the routing to your app but to have your routing, you need the controllers and to have your controllers you need the module they belong to. If you do not structure your code and what you load in with Require.js in the right order, you end up with circular dependencies.

Example

So below for my directory structure. My module/app is called "mainApp".

My base public directory:

directory listing

123456789101112131415

index.html- javascripts    - controllers/    - directives/    - factories/    - modules/    - routes/    - templates/    - vendors/      require.js      jquery.js    main.js    require.js- stylesheets/  ...

Here is my boot file, aka my main.js.

javascripts/main.js

12345678910111213141516171819

require.config({  baseUrl: '/javascripts',  paths: {    'jQuery': '//ajax.googleapis.com/ajax/libs/jquery/1.10.1/jquery.min',    'angular': '//ajax.googleapis.com/ajax/libs/angularjs/1.0.7/angular',    'angular-resource': '//ajax.googleapis.com/ajax/libs/angularjs/1.0.7/angular-resource'  },  shim: {    'angular' : {'exports' : 'angular'},    'angular-resource': { deps:['angular']},    'jQuery': {'exports' : 'jQuery'}  }});require(['jQuery', 'angular', 'routes/mainRoutes'] , function ($, angular, mainRoutes) {  $(function () { // using jQuery because it will run this even if DOM load already happened    angular.bootstrap(document , ['mainApp']);  });});

You'll notice how I am not loading my mainApp in. Basically we are bringing in the last thing that needs to configured for your app to load, to prevent circular dependencies. Since the Routes need the mainApp controllers and the controllers need the mainApp module, we just have them directly include the mainApp.js.

Also we are configuring require.js to bring in angular and angular-resource (angular-resource so we can do model factories).

Here is my super simple mainApp.js

javascripts/modules/mainApp.js

123	`define(['angular' , 'angular-resource'] , function (angular) { return angular.module('mainApp' , ['ngResource']);});`

And here is my mainRoutes file:

javascripts/routes/mainRoutes.js

define(['modules/mainApp' , 'controllers/listCtrl'] , function (mainApp) {  return mainApp.config(['$routeProvider' , function ($routeProvider) {    $routeProvider.when('/' , {controller: 'listCtrl' , templateUrl: '/templates/List.html'});  }]);});

You will notice I require the listCtrl, but actually use its reference. Including it adds it to my mainApp module so it can be used.

Here is my super simple controller:

javascripts/controllers/listCtrl.js

define(['modules/mainApp' , 'factories/Item'] , function (mainApp) {  mainApp.controller('listCtrl' , ['$scope' , 'Item' , function ($scope , Item) {    $scope.items = Item.query();  });});

So you'll notice, I have to include that mainApp again, so I can add the controller to it. I also have a dependency on Item, which in this case is a factory.The reason I include that, is so that it gets added to the app, so the dependency injection works. Again, I don't actually reference it, I just let dependency injection do its thing.

Lets take a look at this factory really quick.

javascripts/factories/Item.js

define(['modules/mainApp'] , function (mainApp) {  mainApp.factory('Item' , ['$resource' , function ($resource) {    return $resource('/item/:id' , {id: '@id'});  }]);});

Pretty simple, but again, we have to pull in that mainApp module to add the factory to it.

So finally lets look at our index.html, most if it is simple stuff, but the key part is the ng-view portion, which tells angular where to place the view. Even if you don't use document in your bootstrap and opt to use a specific element, you still need this ng-view.

index.html

123456789101112

  <!DOCTYPE html><html><head>  <title>Angular and Require</title>  <script src="/javascripts/require.js" data-main="javascripts/main"></script></head><body>  <div class="page-content">    <ng:view></ng:view>  </div></body></html>

Posted by Inline Block Jun 6th, 2013 amd, angular, angularjs, coding, javascript, requirejs

3 people like this. Be the first of your friends.

« TDD JavaScript with Require.js and Teabag on Rails I don't really like Twitter Ads »

Sent from Evernote

Monday, 9 December 2013

PayPal Switches from Java to JavaScript

by Abel Avram on Nov 29, 2013 | Discuss
PayPal has decided to use JavaScript from browser all the way to the back-end server for web applications, giving up legacy code written in JSP/Java.
Jeff Harrell, Director of Engineering at PayPal, has explained in a couple of blog posts (Set My UI Free Part 1: Dust JavaScript Templating, Open Source and More , Node.js at PayPal ) why they decided and some conclusions resulting from switching their web application development from Java/JSP to a complete JavaScript/Node.js stack.
According to Harrell, PayPal's websites had accumulated a good deal of technical debt, and they wanted a "technology stack free of this which would enable greater product agility and innovation." Initially, there was a significant divide between front-end engineers working in web technologies and back-end ones coding in Java. When a UX person wanted to sketch up some pages, they had to ask Java programmers to do some back-end wiring to make it work. This did not fit with their Lean UX development model:

At the time, our UI applications were based on Java and JSP using a proprietary solution that was rigid, tightly coupled and hard to move fast in. Our teams didn't find it complimentary to our Lean UX development model and couldn't move fast in it so they would build their prototypes in a scripting language, test them with users, and then later port the code over to our production stack.

They wanted a "templating [solution that] must be decoupled from the underlying server technology and allow us to evolve our UIs independent of the application language" and that would work with multiple environments. They decided to go with Dust.js – a templating framework backed up by LinkedIn – , plus Twitter's Bootstrap and Bower , a package manager for the web. Additional pieces added later were LESS , RequireJS , Backbone.js , Grunt , and Mocha .
Some of PayPal's pages have been redesigned but they still had some of the legacy stack:

… we have legacy C++/XSL and Java/JSP stacks, and we didn't want to leave these UIs behind as we continued to move forward. JavaScript templates are ideal for this. On the C++ stack, we built a library that used V8 to perform Dust renders natively – this was amazingly fast! On the Java side, we integrated Dust using a Spring ViewResolver coupled with Rhino to render the views.

At that time, they also started using Node.js for prototyping new pages, concluding that it was "extremely proficient" and decided to try it in production. For that they also built Kraken.js , a "convention layer" placed on top of Express which is a Node.js-based web framework. (PayPal has recently open sourced Kraken.js.) The first application to be done in Node.js was the account overview page, which is one of the most accessed PayPal pages, according to Harrell. But because they were afraid the app might not scale well, they decided to create an equivalent Java application to fall back to in case the Node.js one won't work. Following are some conclusions regarding the development effort required for both apps:

	Java/Spring	JavaScript/Node.js
Set-up time	0	2 months
Development	~5 months	~3 months
Engineers	5	2
Lines of code	unspecified	66% of unspecified

The JavaScript team needed 2 months for the initial setup of the infrastructure, but they created with fewer people an application with the same functionality in less time. Running the test suite on production hardware, they concluded that the Node.js app was performing better than the Java one, serving:

Double the requests per second vs. the Java application. This is even more interesting because our initial performance results were using a single core for the node.js application compared to five cores in Java. We expect to increase this divide further.

and having

35% decrease in the average response time for the same page. This resulted in the pages being served 200ms faster— something users will definitely notice.

As a result, PayPal began using the Node.js application in beta in production, and have decided that "all of our consumer facing web applications going forward will be built on Node.js," while some of the existing ones are being ported to Node.js.
One of the benefits of using JavaScript from browser to server is, according to Harrell, the elimination of a divide between front and back-end development by having one team "which allows us to understand and react to our users' needs at any level in the technology stack."

Tell us what you think

Sent from Evernote

List of 20+ Sentiment Analysis APIs | Mashape Blog

List of 20+ Sentiment Analysis APIs Posted 7 months ago

Just a few days back we posted a List of 40+ Machine Learning APIs .
The APIs below are a Sentiment Analysis subset group from that Machine Learning API list. Sentiment Analysis refers to "the application of natural language processing, computational linguistics, and text analytics to identify and extract subjective information in source materials."
We hope you'll it find useful!

S entiment Analysis for Social Media - The multilingual sentiment analysis API (with exceptional accuracy, 83.4% as opposed to industry standard of 65.4%, and available in Mandarin) from Chatterbox classifies social media texts as positive or negative, with a free daily allowance to get you started. The system uses advanced statistical models (machine learning & NLP) trained on social data, meaning the detection can handle slang, common misspellings, emoticons, hashtags, etc.
Text-Processing - Sentiment analysis, stemming and lemmatization, part-of-speech tagging and chunking, phrase extraction and named entity recognition.
ML Analyzer - Text Classification, Article Summarization, Sentiment Analysis, Stock symbol extraction, Person Names Extractor, Language Detection, Locations Extractor, Adult content Analyzer.
Anger Detection for Social Media - This unique API will revolutionise your service levels, protect your brand and monitor both sales and promotional campaigns. Designed specifically for social media this API automatically measures the anger levels within social messages so you can quickly highlight action points. Combined with Chatterbox Sentiment Analysis, Anger Detection is designed to protect your brand and service interaction with an online audience.
TweetSentiments - Returns the sentiment of Tweets. Two online APIs call the Twitter API to analyze Tweets from a given Twitter user or Tweets returned by a Twitter search query. The offline API analyzes texts of Tweets you've already got, one Tweet at a time.
Repustate Sentiment and Social Media Analytics - Repustate's sentiment analysis and social media analytics API allows you to extract key words and phrases and determine social media sentiment in one of many languages. These languages include English, Arabic, German, French and Spanish. Monitor social media as well using our API and retrieve your data all with simple API calls.
Chinese Sentiment Analysis for Social Media - 此API适用于中文社交媒体的情感分析（例如新浪微博），能针对每一条消息进行情感分类：正面或负面。该系统基于社交媒体，能够充分利用俚语，特殊词语等新新网络用语。请注意：该免费版本提供每天500条消息分类 - 超过此上限，将会被额外收费。
Viralheat Sentiment - Viralheat sentiment is free API and allows users to submit short chunks of text for sentiment scoring.
Text Processing - The WebKnox text processing API lets you process (natural) language texts. You can detect the text's language, the quality of the writing, find entity mentions, tag part-of-speech, extract dates, extract locations, or determine the sentiment of the text.
Skyttle - Skyttle API is designed to turn any text into constituent terms (meaningful expressions), entities (names of people, place and things), and sentiment terms. Languages supported are English, Spanish, French, German, Chinese, Swedish, Greek, Czech, Italian and Russian.
Fluxifi NLP - Cloud based Natural Language Processing API. Includes Sentiment and Language Detection.
Sentiment Analysis Spanish - Sentiment analysis for Spanish language of any given tweet.
AlchemyAPI - AlchemyAPI provides advanced cloud-based and on-premise text analysis infrastructure that eliminates the expense and difficulty of integrating natural language processing systems into your application, service, or data processing pipeline.
nlpTools - Text processing framework to analyse Natural Language. It is especially focused on text classification and sentiment analysis of online news media (general-purpose, multiple topics).
Chinese Analytics - Soshio allows companies to quickly expand their understanding of the Chinese market. Its Chinese Analytics API provides Chinese text analytics and sentiment analysis capabilities for businesses to create their own social monitoring dashboard.
Truthy - Write scripts to work with our data, statistics, and images using the API. Download tweet volume over time, network layout, and statistics about memes and users, such as predicted political partisanship, sentiment score, language, and activity.
Speech2Topics - Yactraq Speech2Topics is a cloud service that converts audiovisual content into topic metadata via speech recognition & natural language processing. Customers use Yactraq metadata to target ads, build UX features like content search/discovery and mine Youtube videos for brand sentiment. In the past such services have been expensive and only used by large video publishers. The unique thing about Yactraq is we deliver our service at a price any product developer can afford.
Bitext Sentiment Analysis - The purpose of this service is to extract opinions from text. An opinion represents the subject an author is writing about and a sentiment score that classifies how positively or negatively the author feels towards that subject. Deep Linguistic Analysis is used to identify the subject the author is discussing.
Textalytics Sentiment Analysis - Multilingual sentiment analysis of texts from different sources (blogs, social networks,…). Besides polarity at sentence and global level, Textalytics Sentiment Analysis 1.1 uses advanced natural language processing techniques to also detect the polarity associated to both entities and concepts in the text. Sentiment Analysis also gives the user the possibility of detecting the polarity of user-defined entities and concepts, making the service a flexible tool applicable to any kind of scenario.
Sentiment - This tool works by examining individual words and short sequences of words (n-grams) and comparing them with a probability model. The probability model is built on a prelabeled test set of IMDb movie reviews. It can also detect negations in phrases, i.e, the phrase "not bad" will be classified as positive despite having two individual words with a negative sentiment.
Starget sentiment analysis - his is a short text (a twitt or a single sentence) sentiment classification API. It has two types of analysis: one for finding more (but less accurate) sentiment snippets and another one for finding more accurate sentiment (but missing some difficult cases).
Textalytics Media Analysis - Textalytics Media Analysis API analyzes mentions, topics, opinions and facts in all types of media. This API provides services for: - Sentiment analysis - Extracts positive and negative opinions according to the context.
Nevahold - Nevahold is a customer service application that leverages the social influence of its community to help users get their voice heard by companies. This API gives you real-time information of company's - Response Time on Facebook and Twitter - Average Response Rate - Customer Service Score - Sentiments - Geo Locations of interactions - Trending keywords
Free Natural Language Processing Service - 100% free service including sentiment analysis, content extraction, and language detection. Enjoy!

You should also check out our other useful API lists for machine learning , natural language processing , summarizing text , SMS APIs , and face recognition APIs .

Sent from Evernote

Tuesday, 26 November 2013

Coreference Resolution Tools : A first look – Dreaming in Data

Coreference Resolution Tools : A first look

2010-09-28 17:44:08 » Natural Language Processing

Coreference is where two or more noun phrases refer to the same entity. This is an integral part of natural languages to avoid repetition, demonstrate possession/relation etc.
Eg: Harry wouldn't bother to read "Hogwarts: A History" as long as Hermione is around. He knows she knows the book by heart.
The different types of coreference includes:
Noun phrases: Hogwarts A history <- the book
Pronouns : Harry <- He
Possessives : her, his, their
Demonstratives: This boy
Coreference resolution or anaphor resolution is determining what an entity is refering to. This has profound applications in nlp tasks such as semantic analysis, text summarisation, sentiment analysis etc.
In spite of extensive research, the number of tools available for CR and level of their maturity is much less compared to more established nlp tasks such as parsing. This is due to the inherent ambiguities in resolution.
The following are some of the tools currently available. Almost all tools come with bundled sentence deducters, taggers, parsers, named entity recognizers etc as setting them up all would be tedious.
Let us try using the following sentence from one of the presentations on BART as input. I'm using the demo app wherever possible and where not, I'm installing the same on my local machine.
Queen Elizabeth set about transforming her husband, King George VI, into a viable monarch. Lionel Logue, a renowned speech therapist, was summoned to help the King overcome his speech impediment.
The following equivalence sets need to be identified.
QE: Queen Elizabeth, her
KG: husband, King George VI, the King, his
LL: Lionel Logue, a renowned speech therapist
The results are as follows.

Tools	Result	Comments
Illinois Coreference Package	Lionel Logue(0) a renowned speech \|therapist\|(2) his(8) Queen Elizabeth(3) the \|King\|(5) King(4) King \|George VI\|(7) transforming her \|husband\|(6) her(1) a viable \|monarch\|(9)	The 'his' and 'her' are wrongly matched to the wrong entities. his is matched to Logue and her is matched to King George
CherryPicker	<COREF ID="1″>Queen</COREF> <COREF ID="2″>Elizabeth</COREF> set about transforming <COREF ID="3″ REF="2″>her</COREF> <COREF ID="4″>husband</COREF>, <COREF ID="5″>King</COREF> <COREF ID="6″ REF= "5″>George VI</COREF>, into a viable monarch. <COREF ID="9″>Lionel Logue</COREF>, a renowned speech <COREF ID="10″>therapist</COREF>, was summoned to help the <COREF ID="7″ REF="5″>King</COREF> overcome <COREF ID="8″ REF="5″>his</COREF> sp eech impediment. Queen her Elizabeth Husband King George VI King his Lionel Logue Therapist	It is mostly ok, except that Queen Elizabeth is split into two Entities Queen and Elizabeth . Other than that, it is one of the best results. Notably it matches King to King George VI and hence, his is correctly mapped to King George VI
Natural Language Synergy Lab	Queen Elizabeth set about transforming her husband , King George VI , into a viable monarch. Lionel Logue , a renowned speech therapist , was summoned to help the King overcome his speech impediment .
BART	{_person Queen Elizabeth } set about transforming {_np {_np her } husband } , {_person King George VI } , into {_np a viable monarch } . {_person Lionel Logue } , {_np a renowned {_np speech } therapist } , was summoned to help {_np the King } overcome {_np {_np his } {_np speech } impediment } . Coreference chain 1 {person Queen Elizabeth } {np her } {np a viable monarch } {np the King } {np his } Coreference chain 2 {person Lionel Logue } {np a renowned {np speech } therapist } Coreference chain 3 {np speech } {np speech }
JavaRAP	******Anaphor-antecedent pairs* (0,0) Queen Elizabeth <– (0,5) her, (1,12) the King <– (1,15) his ****Text with substitution*** Queen Elizabeth set about transforming <Queen Elizabeth's> husband, King George VI, into a viable monarch. Lionel Logue, a renowned speech therapist, was summoned to help the King overcome <the King's> speech impediment.	It has attempted only the pronoun resolution and that has been done well.
GuiTAR	Failed
OpenNLP	(TOP (S (NP#6 (NNP Queen) (NNP Elizabeth)) (VP (VBD set) (PP (IN about) (S (VP (VBG transforming) (NP (NP (NML#6 (PRP$ her)) (NN husband)) (, ,) (NP (NNP King) (person (NNP George) (NNP VI)))) (, ,) (PP (IN into) (NP (DT a) (JJ viable) (NN monarch))))))) (. .)) ) (TOP (S (NP#1 (NP (person (NNP Lionel) (NNP Logue))) (, ,) (NP#1 (DT a) (JJ renowned) (NN speech) (NN therapist))) (, ,) (VP (VBD was) (VP (VBN summoned) (S (VP (TO to) (VP (VB help) (S (NP (DT the) (NNP King)) (VP (VBN overcome) (NP (NML#1 (PRP$ his)) (NN speech) (NN impediment))))))))) (. .)) ) Lionel Logue a renowned speech therapist Queen Elizabeth her husband
Reconcile	<NP NO="0″ CorefID="1″>Queen Elizabeth</NP> set about transforming <NP NO="2″ CorefID="3″><NP NO="1″ CorefID="1″>her</NP> husband</NP>, <NP NO="3″ CorefID="3″>King George VI</NP>, into <NP NO=" 4″ CorefID="4″>a viable monarch</NP>. <NP NO="5″ CorefID="6″>Lionel Logue</NP>, <NP NO="6″ CorefID="6″>a renowned speech therapist</NP>, was summoned to help <NP NO="7″ CorefID="6″>the King</NP > overcome <NP NO="9″ CorefID="9″><NP NO="8″ CorefID="6″>his</NP> speech impediment</NP>. Queen Elizabeth her her husband King George VI A Viable monarch Lionel Logue a renowned speech therapist the king his	the king has been wrongly attributed to Lionel Logue, which resulted in his also to be wronlt atttributed.
ARKref	[Queen Elizabeth]1 set about transforming [[her]1 husband , [King George VI]2 ,]2 into [a viable monarch] . [Lionel Logue , [a renowned speech therapist]6 ,]6 was summoned to help [the King]8 overcome [[his]8 speech impediment] .	One of the best results. only info lacking is linking 'the King' to "king George VI"

_As a side note, cherry picker fails with the following error.
cherrypicker1.01/tools/crf++/.libs/lt-crf_test: error while loading shared libraries: libcrfpp.so.0: cannot open shared object file: No such file or directory
To proceed, we need to download CRF++ . Install it.
Then we need to modify line no 18 in cherrypicker.sh file
tools/crf++/crf_test -m modelmd $1.crf > $1.pred
to
crf_test -m modelmd $1.crf > $1.pred
open a new terminal:
run
sudo ldconfig_
now cherrypicker should work

ARKref and Cherrypicker seem to be the best options available right now.
Are there any other coreference resolution systems that have not been looked at? Can we add more about the above tools? Please post your comments.

Sent from Evernote

Wednesday, 6 November 2013

Presto: Interacting with petabytes of data at Facebook

By Lydia Chan on Wednesday, November 6, 2013 at 7:01pm

By Martin Traverso
Background
Facebook is a data-driven company. Data processing and analytics are at the heart of building and delivering products for the 1 billion+active users of Facebook. We have one of the largest data warehouses in the world, storing more than 300 petabytes. The data is used for a wide range of applications, from traditional batch processing to graph analytics [1], machine learning, and real-time interactive analytics.
For the analysts, data scientists, and engineers who crunch data,derive insights, and work to continuously improve our products, the performance of queries against our data warehouse is important. Being able to run more queries and get results faster improves their productivity.
Facebook's warehouse data is stored in a few large Hadoop/HDFS-based clusters. Hadoop MapReduce [2] and Hive are designed for large-scale, reliable computation, and are optimized for overall system throughput. But as our warehouse grew to petabyte scale and our needs evolved, it became clear that we needed an interactive system optimized for low query latency.
In Fall 2012, a small team in the Facebook Data Infrastructure group set out to solve this problem for our warehouse users. We evaluated a few external projects, but they were either too nascent or did not meet our requirements for flexibility and scale. So we decided to build Presto, a new interactive query system that could operate fast at petabyte scale.
In this post, we will briefly describe the architecture of Presto, its current status, and future roadmap.
Architecture
Presto is a distributed SQL query engine optimized for ad-hoc analysis at interactive speed. It supports standard ANSI SQL, including complex queries, aggregations, joins, and window functions.
The diagram below shows the simplified system architecture of Presto. The client sends SQL to the Presto coordinator. The coordinator parses, analyzes, and plans the query execution. The scheduler wires together the execution pipeline, assigns work to nodes closest to the data, and monitors progress. The client pulls data from output stage, which in turn pulls data from underlying stages.
The execution model of Presto is fundamentally different from Hive/MapReduce. Hive translates queries into multiple stages of MapReduce tasks that execute one after another. Each task reads inputs from disk and writes intermediate output back to disk. In contrast, the Presto engine does not use MapReduce. It employs a custom query and execution engine with operators designed to support SQL semantics. In addition to improved scheduling, all processing is in memory and pipelined across the network between stages. This avoids unnecessary I/O and associated latency overhead. The pipelined execution model runs multiple stages at once, and streams data from one stage to the next as it becomes available. This significantly reduces end-to-end latency for many types of queries.

The Presto system is implemented in Java because it's fast to develop, has a great ecosystem, and is easy to integrate with the rest of the data infrastructure components at Facebook that are primarily built in Java. Presto dynamically compiles certain portions of the query plan down to byte code which lets the JVM optimize and generate native machine code. Through careful use of memory and data structures, Presto avoids typical issues of Java code related to memory allocation and garbage collection. (In a later post, we will share some tips and tricks for writing high-performance Java system code and the lessons learned while building Presto.)
Extensibility is another key design point for Presto. During the initial phase of the project, we realized that large data sets were being stored in many other systems in addition to HDFS. Some data stores are well-known systems such as HBase, but others are custom systems such as the Facebook News Feed backend. Presto was designed with a simple storage abstraction that makes it easy to provide SQL query capability against these disparate data sources. Storage plugins (called connectors) only need to provide interfaces for fetching metadata, getting data locations, and accessing the data itself. In addition to the primary Hive/HDFS backend, we have built Presto connectors to several other systems, including HBase, Scribe, and other custom systems.

Current status
As mentioned above, development on Presto started in Fall 2012. We had our first production system up and running in early 2013. It was fully rolled out to the entire company by Spring 2013. Since then, Presto has become a major interactive system for the company's data warehouse. It is deployed in multiple geographical regions and we have successfully scaled a single cluster to 1,000 nodes. The system is actively used by over a thousand employees,who run more than 30,000 queries processing one petabyte daily.
Presto is 10x better than Hive/MapReduce in terms of CPU efficiency and latency for most queries at Facebook. It currently supports a large subset of ANSI SQL, including joins, left/right outer joins, subqueries,and most of the common aggregate and scalar functions, including approximate distinct counts (using HyperLogLog) and approximate percentiles (based on quantile digest). The main restrictions at this stage are a size limitation on the join tables and cardinality of unique keys/groups. The system also lacks the ability to write output data back to tables (currently query results are streamed to the client).
Roadmap
We are actively working on extending Presto functionality and improving performance. In the next few months, we will remove restrictions on join and aggregation sizes and introduce the ability to write output tables. We are also working on a query "accelerator" by designing a new data format that is optimized for query processing and avoids unnecessary transformations. This feature will allow hot subsets of data to be cached from backend data store, and the system will transparently use cached data to "accelerate" queries. We are also working on a high performance HBase connector.
Open source
After our initial Presto announcement at the Analytics @ WebScale conference in June 2013 [3], there has been a lot of interest from the external community. In the last couple of months, we have released Presto code and binaries to a small number of external companies. They have successfully deployed and tested it within their environments and given us great feedback.
Today we are very happy to announce that we are open-sourcing Presto. You can check out the code and documentation on the site below. We look forward to hearing about your use cases and how Presto can help with your interactive analysis.
http://prestodb.io/
https://github.com/facebook/presto
The Presto team within Facebook Data Infrastructure consists of Martin Traverso, Dain Sundstrom, David Phillips, Eric Hwang, Nileema Shingte and Ravi Murthy.
Links
[1] Scaling Apache Giraph to a trillion edges. https://www.facebook.com/notes/facebook-engineering/scaling-apache-giraph-to-a-trillion-edges/10151617006153920
[2] Under the hood: Scheduling MapReduce jobs more efficiently with Corona https://www.facebook.com/notes/facebook-engineering/under-the-hood-scheduling-mapreduce-jobs-more-efficiently-with-corona/10151142560538920
[3] Video of Presto talk at Analytics@Webscale conference, June 2013 https://www.facebook.com/photo.php?v=10202463462128185

Sent from Evernote

Unlimited-Data. moved to lab.itbee.vn

Wednesday, 18 December 2013