Tuesday, 26 November 2013

Coreference Resolution Tools : A first look – Dreaming in Data


Coreference Resolution Tools : A first look

2010-09-28 17:44:08 » Natural Language Processing



Coreference is where two or more noun phrases refer to the same entity.   This is an integral part of natural languages to avoid repetition, demonstrate possession/relation etc.
Eg:  Harry wouldn't bother to read "Hogwarts: A History" as long as Hermione is around.  He knows she knows the book by heart.
The different types of coreference includes:
Noun phrases: Hogwarts A history <- the book
Pronouns : Harry <- He
Possessives : her, his, their
Demonstratives:  This boy
Coreference resolution or anaphor resolution is determining what an entity is refering to.  This has profound applications in nlp tasks such as semantic analysis, text summarisation, sentiment analysis etc.
In spite of extensive research, the number of tools available for CR and level of their maturity is much less compared to more established nlp tasks such as parsing.  This is due to the inherent ambiguities in resolution.
The following are some of the tools currently available.   Almost all tools come with bundled sentence deducters, taggers, parsers, named entity recognizers  etc as setting them up all would be tedious.
Let us try using the following sentence from one of the presentations on BART as input. I'm using the demo app wherever possible and where not, I'm installing the same on my local machine.
Queen Elizabeth set about transforming her husband, King George VI, into a viable monarch. Lionel Logue, a renowned speech therapist, was summoned to help the King overcome his speech impediment.
The following equivalence sets need to be identified.
QE: Queen Elizabeth, her
KG: husband, King George VI, the King, his
LL: Lionel Logue, a renowned speech therapist
The results are as follows.
Tools Result Comments
Illinois Coreference PackageLionel Logue(0)
a renowned speech |therapist|(2)
his(8)
Queen Elizabeth(3)
the |King|(5)
King(4)
King |George VI|(7)
transforming her |husband|(6)
her(1)
a viable |monarch|(9)
The 'his' and 'her' are wrongly matched to the wrong entities.  his is matched to Logue and her is matched to King George
CherryPicker<COREF ID="1″>Queen</COREF> <COREF ID="2″>Elizabeth</COREF> set about transforming <COREF ID="3″ REF="2″>her</COREF> <COREF ID="4″>husband</COREF>, <COREF ID="5″>King</COREF> <COREF ID="6″ REF=    "5″>George VI</COREF>, into a viable monarch.
<COREF ID="9″>Lionel Logue</COREF>, a renowned speech <COREF ID="10″>therapist</COREF>, was summoned to help the <COREF ID="7″ REF="5″>King</COREF> overcome <COREF ID="8″ REF="5″>his</COREF> sp eech impediment.
Queen
    her
Elizabeth
Husband
King
    George VI
    King
    his
Lionel Logue
Therapist
It is mostly ok, except that Queen Elizabeth is split into two Entities Queen and Elizabeth .  Other than that, it is one of the best results.  Notably it matches King to King George VI and hence, his is correctly mapped to King George VI
Natural Language Synergy Lab Queen Elizabeth set about transforming her husband , King George VI , into a viable monarch. Lionel Logue , a renowned speech therapist , was summoned to help the King overcome his speech impediment .
BART
{person Queen Elizabeth } set about transforming {np {np her } husband } , {person King George VI } , into {np a viable monarch } . {person Lionel Logue } , {np a renowned {np speech } therapist } , was summoned to help {np the King } overcome {np {np his } {np speech } impediment } .

Coreference chain 1

{person Queen Elizabeth }
{np her }
{np a viable monarch }
{np the King }
{np his }

Coreference chain 2

{person Lionel Logue }
{np a renowned {np speech } therapist }

Coreference chain 3

{np speech }
{np speech }
JavaRAP ********Anaphor-antecedent pairs*****
(0,0) Queen Elizabeth <– (0,5) her,
(1,12) the King <– (1,15) his
********Text with substitution*****
Queen Elizabeth set about transforming <Queen Elizabeth's> husband, King George VI, into a viable monarch.
Lionel Logue, a renowned speech therapist, was summoned to help the King overcome <the King's> speech impediment.
It has attempted only the pronoun resolution and that has been done well.
GuiTAR Failed
OpenNLP (TOP (S (NP#6 (NNP Queen) (NNP Elizabeth)) (VP (VBD set) (PP (IN about) (S (VP (VBG transforming) (NP (NP (NML#6 (PRP$ her)) (NN husband)) (, ,) (NP (NNP King) (person (NNP George) (NNP VI)))) (, ,) (PP (IN into) (NP (DT a) (JJ viable) (NN monarch))))))) (. .)) )
(TOP (S (NP#1 (NP (person (NNP Lionel) (NNP Logue))) (, ,) (NP#1 (DT a) (JJ renowned) (NN speech) (NN therapist))) (, ,) (VP (VBD was) (VP (VBN summoned) (S (VP (TO to) (VP (VB help) (S (NP (DT the) (NNP King)) (VP (VBN overcome) (NP (NML#1 (PRP$ his)) (NN speech) (NN impediment))))))))) (. .)) )
Lionel Logue
a renowned speech therapist
Queen Elizabeth
her husband
Reconcile <NP NO="0″ CorefID="1″>Queen Elizabeth</NP> set about transforming <NP NO="2″ CorefID="3″><NP NO="1″ CorefID="1″>her</NP> husband</NP>, <NP NO="3″ CorefID="3″>King George VI</NP>, into <NP NO="    4″ CorefID="4″>a viable monarch</NP>. <NP NO="5″ CorefID="6″>Lionel Logue</NP>, <NP NO="6″ CorefID="6″>a renowned speech therapist</NP>, was summoned to help <NP NO="7″ CorefID="6″>the King</NP    > overcome <NP NO="9″ CorefID="9″><NP NO="8″ CorefID="6″>his</NP> speech impediment</NP>.
Queen Elizabeth
her
her husband
King George VI
A Viable monarch
Lionel Logue
a renowned speech therapist
the king
his
the king has been wrongly attributed to Lionel Logue, which resulted in his also to be wronlt atttributed.
ARKref [Queen Elizabeth]1 set about transforming [[her]1 husband , [King George VI]2 ,]2 into [a viable monarch] .
[Lionel Logue , [a renowned speech therapist]6 ,]6 was summoned to help [the King]8 overcome [[his]8 speech impediment] .
One of the best results.  only info lacking is linking 'the King' to "king George VI"
_As a side note, cherry picker fails with the following error.
cherrypicker1.01/tools/crf++/.libs/lt-crf_test: error while loading shared libraries: libcrfpp.so.0: cannot open shared object file: No such file or directory
To proceed, we need to download CRF++ .  Install it.
Then we need to modify line no 18 in cherrypicker.sh file
tools/crf++/crf_test -m modelmd $1.crf > $1.pred
to
crf_test -m modelmd $1.crf > $1.pred
open a new terminal:
run
sudo ldconfig_
now cherrypicker should work
ARKref and Cherrypicker seem to be the best options available right now.
Are there any other coreference resolution systems that have not been looked at?  Can we add more about the above tools?  Please post your comments.

Sent from Evernote

Wednesday, 6 November 2013

Presto: Interacting with petabytes of data at Facebook


Presto: Interacting with petabytes of data at Facebook

By Lydia Chan on Wednesday, November 6, 2013 at 7:01pm
By Martin Traverso
Background
Facebook is a data-driven company. Data processing and analytics are at the heart of building and delivering products for the 1 billion+active users of Facebook. We have one of the largest data warehouses in the world, storing more than 300 petabytes. The data is used for a wide range of applications, from traditional batch processing to graph analytics [1], machine learning, and real-time interactive analytics.
For the analysts, data scientists, and engineers who crunch data,derive insights, and work to continuously improve our products, the performance of queries against our data warehouse is important. Being able to run more queries and get results faster improves their productivity.
Facebook's warehouse data is stored in a few large Hadoop/HDFS-based clusters. Hadoop MapReduce [2] and Hive are designed for large-scale, reliable computation, and are optimized for overall system throughput. But as our warehouse grew to petabyte scale and our needs evolved, it became clear that we needed an interactive system optimized for low query latency.
In Fall 2012, a small team in the Facebook Data Infrastructure group set out to solve this problem for our warehouse users. We evaluated a few external projects, but they were either too nascent or did not meet our requirements for flexibility and scale. So we decided to build Presto, a new interactive query system that could operate fast at petabyte scale.
In this post, we will briefly describe the architecture of Presto, its current status, and future roadmap.
Architecture
Presto is a distributed SQL query engine optimized for ad-hoc analysis at interactive speed. It supports standard ANSI SQL, including complex queries, aggregations, joins, and window functions.
The diagram below shows the simplified system architecture of Presto. The client sends SQL to the Presto coordinator. The coordinator parses, analyzes, and plans the query execution. The scheduler wires together the execution pipeline, assigns work to nodes closest to the data, and monitors progress. The client pulls data from output stage, which in turn pulls data from underlying stages.
The execution model of Presto is fundamentally different from Hive/MapReduce. Hive translates queries into multiple stages of MapReduce tasks that execute one after another. Each task reads inputs from disk and writes intermediate output back to disk. In contrast, the Presto engine does not use MapReduce. It employs a custom query and execution engine with operators designed to support SQL semantics. In addition to improved scheduling, all processing is in memory and pipelined across the network between stages. This avoids unnecessary I/O and associated latency overhead. The pipelined execution model runs multiple stages at once, and streams data from one stage to the next as it becomes available. This significantly reduces end-to-end latency for many types of queries.
The Presto system is implemented in Java because it's fast to develop, has a great ecosystem, and is easy to integrate with the rest of the data infrastructure components at Facebook that are primarily built in Java. Presto dynamically compiles certain portions of the query plan down to byte code which lets the JVM optimize and generate native machine code. Through careful use of memory and data structures, Presto avoids typical issues of Java code related to memory allocation and garbage collection. (In a later post, we will share some tips and tricks for writing high-performance Java system code and the lessons learned while building Presto.)
Extensibility is another key design point for Presto. During the initial phase of the project, we realized that large data sets were being stored in many other systems in addition to HDFS. Some data stores are well-known systems such as HBase, but others are custom systems such as the Facebook News Feed backend. Presto was designed with a simple storage abstraction that makes it easy to provide SQL query capability against these disparate data sources. Storage plugins (called connectors) only need to provide interfaces for fetching metadata, getting data locations, and accessing the data itself. In addition to the primary Hive/HDFS backend, we have built Presto connectors to several other systems, including HBase, Scribe, and other custom systems.
Current status
As mentioned above, development on Presto started in Fall 2012. We had our first production system up and running in early 2013. It was fully rolled out to the entire company by Spring 2013. Since then, Presto has become a major interactive system for the company's data warehouse. It is deployed in multiple geographical regions and we have successfully scaled a single cluster to 1,000 nodes. The system is actively used by over a thousand employees,who run more than 30,000 queries processing one petabyte daily.
Presto is 10x better than Hive/MapReduce in terms of CPU efficiency and latency for most queries at Facebook. It currently supports a large subset of ANSI SQL, including joins, left/right outer joins, subqueries,and most of the common aggregate and scalar functions, including approximate distinct counts (using HyperLogLog) and approximate percentiles (based on quantile digest). The main restrictions at this stage are a size limitation on the join tables and cardinality of unique keys/groups. The system also lacks the ability to write output data back to tables (currently query results are streamed to the client).
Roadmap
We are actively working on extending Presto functionality and improving performance.  In the next few months, we will remove restrictions on join and aggregation sizes and introduce the ability to write output tables.  We are also working on a query "accelerator" by designing a new data format that is optimized for query processing and avoids unnecessary transformations. This feature will allow hot subsets of data to be cached from backend data store, and the system will transparently use cached data to "accelerate" queries.  We are also working on a high performance HBase connector.
Open source
After our initial Presto announcement at the Analytics @ WebScale conference in June 2013 [3], there has been a lot of interest from the external community. In the last couple of months, we have released Presto code and binaries to a small number of external companies. They have successfully deployed and tested it within their environments and given us great feedback.
Today we are very happy to announce that we are open-sourcing Presto. You can check out the code and documentation on the site below. We look forward to hearing about your use cases and how Presto can help with your interactive analysis.
http://prestodb.io/
https://github.com/facebook/presto
The Presto team within Facebook Data Infrastructure consists of Martin Traverso, Dain Sundstrom, David Phillips, Eric Hwang, Nileema Shingte and Ravi Murthy.
Links
[1] Scaling Apache Giraph to a trillion edges. https://www.facebook.com/notes/facebook-engineering/scaling-apache-giraph-to-a-trillion-edges/10151617006153920
[2] Under the hood: Scheduling MapReduce jobs more efficiently with Corona https://www.facebook.com/notes/facebook-engineering/under-the-hood-scheduling-mapreduce-jobs-more-efficiently-with-corona/10151142560538920
[3] Video of Presto talk at Analytics@Webscale conference, June 2013 https://www.facebook.com/photo.php?v=10202463462128185

Sent from Evernote

Sunday, 29 September 2013

The Power of Proxies in Java | Javalobby

From Evernote:

The Power of Proxies in Java | Javalobby

Clipped from: http://java.dzone.com/articles/power-proxies-java

The Power of Proxies in Java

05.10.2010
| 46633 views |
In this article, I'll show you the path that leads to true Java power, the use of proxies.
They are everywhere but only a handful of people know about them. Hibernate for lazy loading entities, Spring for AOP, LambdaJ for DSL, only to name a few: they all use their hidden magic. What are they? They are… Java's dynamic proxies.
Everyone knows about the GOF Proxy design pattern:
Allows for object level access control by acting as a pass through entity or a placeholder object.
Likewise, in Java, a dynamic proxy is an instance that acts as apass through to the real object. This powerful pattern let you changethe real behaviour from a caller point of view since method calls canbe intercepted by the proxy.

Pure Java proxies

Pure Java proxies have some interesting properties:
  • They are based on runtime implementations of interfaces
  • They are public, final and not abstract
  • They extend java.lang.reflect.Proxy
In Java, the proxy itself is not as important as the proxy's behaviour. The latter is done in an implementation of java.lang.reflect.InvocationHandler . It has only a single method to implement:
public Object invoke(Object proxy, Method method, Object[] args)
  • proxy: the proxy instance that the method was invoked on
  • method: the Method instance corresponding to the interface method invoked on the proxy instance. The declaring class of the Methodobject will be the interface that the method was declared in, which maybe a superinterface of the proxy interface that the proxy classinherits the method through
  • args: an array of objects containing the values of the arguments passed in the method invocation on the proxy instance, or nullif interface method takes no arguments. Arguments of primitive typesare wrapped in instances of the appropriate primitive wrapper class,such as java.lang.Integer or java.lang.Boolean
Let's take a simple example: suppose we want a List that can't be added elements to it. The first step is to create the invocation handler:
01.public class NoOpAddInvocationHandler implements InvocationHandler {
02. 
03.private final List proxied;
04. 
05.public NoOpAddInvocationHandler(List proxied) {
06. 
07.this.proxied = proxied;
08.}
09. 
10.public Object invoke(Object proxy, Method method, Object[] args) throws Throwable {
11. 
12.if (method.getName().startsWith("add")) {
13. 
14.return false;
15.}
16. 
17.return method.invoke(proxied, args);
18.}
19.}
The invoke method will intercept method calls and donothing if the method starts with "add". Otherwise, it will the callpass to the real proxied object. This is a very crude example but isenough to let us understand the magic behind.
Notice that in case you want your method call to pass through, youneed to call the method on the real object. For this, you'll need areference to the latter, something the invoke method does not provide. That's why in most cases, it's a good idea to pass it to the constructor and store it as an attribute.
Note: under no circumstances should you call the method on the proxyitself since it will be intercepted again by the invocation handler andyou will be faced with a StackOverflowError.
To create the proxy itself:
1.List proxy = (List) Proxy.newProxyInstance(
2.NoOpAddInvocationHandlerTest.class.getClassLoader(),
3.new Class[] { List.class },
4.new NoOpAddInvocationHandler(list));
The newProxyInstance method takes 3 arguments:
  • the class loader
  • an array of interfaces that will be implemented by the proxy
  • the power behind the throne in the form of the invocation handler
Now, if you try to add elements to the proxy by calling any add methods, it won't have any effect.

CGLib proxies

Java proxies are runtime implementations of interfaces. Objects donot necessarily implement interfaces, and collections of objects do notnecessarily share the same interfaces. Confronted with such needs, Javaproxies fail to provide an answser.
Here begins the realm of CGLib . CGlib is a third-party framework, based on bytecode manipulation provided by ASM that can help with the previous limitations. A word of advice first,CGLib's documentation is not on par with its features: there's notutorial nor documentation. A handful of JavaDocs is all you can counton. This said CGLib waives many limitations enforced by pure Javaproxies:
  • you are not required to implement interfaces
  • you can extend a class
For example, since Hibernate entities are POJO, Java proxies cannot be used in lazy-loading; CGLib proxies can.
There are matches between pure Java proxies and CGLib proxies: where you use Proxy, you use net.sf.cglib.proxy.Enhancer class, where you use InvocationHandler, you use net.sf.cglib.proxy.Callback. The two main differences is that Enhancer has a public constructor and Callback cannot be used as such but only through one of its subinterfaces:
  • Dispatcher: Dispatching Enhancer callback
  • FixedValue: Enhancer callback that simply returns the value to return from the proxied method
  • LazyLoader: Lazy-loading Enhancer callback
  • MethodInterceptor: General-purpose Enhancer callback which provides for "around advice"
  • NoOp: Methods using this Enhancer callback will delegate directly to the default (super) implementation in the base class
As an introductory example, let's create a proxy that returns thesame value for hash code whatever the real object behind. The featurelooks like a MethodInterceptor, so let's implement it as such:
01.<public class HashCodeAlwaysZeroMethodInterceptor implements MethodInterceptor {
02. 
03.public Object intercept(Object object, Method method, Object[] args,
04.MethodProxy methodProxy) throws Throwable {
05. 
06.if ("hashCode".equals(method.getName())) {
07. 
08.return 0;
09.}
10. 
11.return methodProxy.invokeSuper(object, args);
12.}
13.}
Looks awfully similar to a Java invocation handler, doesn't it? Now, in order to create the proxy itself:
1.Object proxy = Enhancer.create(
2.Object.class,
3.new HashCodeAlwaysZeroMethodInterceptor());
Likewise, the proxy creation isn't suprising. The real differences are:
  • there's no interface involved in the process
  • the proxy creation process also creates the proxied object. There'sno clear cut between proxy and proxied from the caller point of view
  • thus, the callback method can provide the proxied object and there's no need to create and store it in your own code

Conclusion

This article only brushed the surface of what can be done withproxies. Anyway, I hope it let you see that Java has some interestingfeatures and points of extension, whether out-of-the-box or coming fromsome third-party framework
You can find the sources for this article in Eclipse/Maven format here .
From http://blog.frankel.ch/the-power-of-proxies-in-java

We’re on the cusp of deep learning for the masses. You can thank Google later — Tech News and Analysis

From Evernote:

We're on the cusp of deep learning for the masses. You can thank Google later — Tech News and Analysis

Clipped from: http://gigaom.com/2013/08/16/were-on-the-cusp-of-deep-learning-for-the-masses-you-can-thank-google-later/

We're on the cusp of deep learning for the masses. You can thank Google later — Tech News and Analysis 

Google silently did something revolutionary on Thursday. It open sourced a tool called word2vec , prepackaged deep-learning software designed to understand the relationships between words with no human guidance. Just input a textual data set and let underlying predictive models get to work learning.
"This is a really, really, really big deal," said Jeremy Howard, president and chief scientist of data-science competition platform Kaggle. "… It's going to enable whole new classes of products that have never existed before." Think of Siri on steroids, for starters, or perhaps emulators that could mimic your writing style down to the tone.

When deep learning works, it works great 

To understand Howard's excitement, let's go back a few days. It was Monday and I was watching him give a presentation in Chicago about how deep learning was dominating the competition in Kaggle, the online platform where organization present vexing predictive problems and data scientists compete to create the best models. Whenever someone has used a deep learning model to tackle one of the challenges, he told the room, it has performed better than any model ever previously devised to tackle that specific problem.

Jeremy Howard (left) at Structure: Data 2012 (c) Pinar Ozger / http://www.pinarozger.com
But there's a catch: deep learning is really hard. So far, only a handful of teams in hundreds of Kaggle competitions have used it. Most of them have included Geoffrey Hinton or have been associated with him.
Hinton is a University of Toronto professor who pioneered the use of deep learning for image recognition and is now a distinguished engineer at Google, as well. What got Google really interested in Hinton — at least to the point where it hired him — was his work in an image-recognition competition called ImageNet . For years the contest's winners had been improving only incrementally on previous results, until Hinton and his team used deep learning to improve by an order of magnitude.

Neural networks: A way-simplified overview 

Deep learning, Howard explained, is essentially a bigger, badder take on the neural network models that have been around for some time. It's particularly useful for analyzing image, audio, text, genomic and other multidimensional data that doesn't lend itself well to traditional machine learning techniques.
Neural networks work by analyzing inputs (e.g., words or images) and recognizing the features that comprise them as well as how all those features relate to each other. With images, for example, a neural network model might recognize various formations of pixels or intensities of pixels as features.

A very simple neural network. Source: Wikipedia Commons
Trained against a set of labeled data, the output of a neural network might be the classification of an input as a dog or cat, for example. In cases where there is no labeled training data — a process called self-taught learning — neural networks can be used to identify the common features of their inputs and group similar inputs even though the models can't predict what they actually are. Like when Google researchers constructed neural networks that were able to recognize cats and human faces without having been trained to do so.

Stacking neural networks to do deep learning 

In deep learning, multiple neural networks are "stacked" on top of each other, or layered, in order to create models that are even better at prediction because each new layer learns from the ones before it. In Hinton's approach, each layer randomly omits features — a process called "dropout" — to minimize the chances the model will overfit itself to just the data upon which it was trained. That's a technical way of saying the model won't work as well when trying to analyze new data.
So dropout or similar techniques are critical to helping deep learning models understand the real causality between the inputs and the outputs, Howard explained during a call on Thursday. It's like looking at the same thing under the same lighting all the time versus looking at it in different lighting and from different angles. You'll see new aspects and won't see others, he said, "But the underlying structure is going to be the same each time."

An example of what features a neural network might learn from images. Source: Hinton et al
Still, it's difficult to create accurate models and to program them to run on the number of computing cores necessary to process them in a reasonable timeframe. It's also can be difficult to train them on enough data to guarantee accuracy in an unsupervised environment. That's why so much of the cutting-edge work in the field is still done by experts such as Hinton, Jeff Dean and Andrew Ng, all of whom had or still have strong ties to Google.
There are open source tools such as Theano and PyLearn2 that try to minimize the complexity, Howard told the audience on Monday, but a user-friendly, commercialized software package could be revolutionary. If data scientists in places outside Google could simply (a relative term if ever there was one) input their multidimensional data and train models to learn it, that could make other approaches to predictive modeling all but obsolete. It wouldn't be inconceivable, Howard noted, that a software package like this could emerge within the next year.

Enter word2vec 

Which brings us back to word2vec. Google calls it "an efficient implementation of the continuous bag-of-words and skip-gram architectures for computing vector representations of words." Those "architectures" are two new natural-language processing techniques developed by Google researchers Tomas Mikolov, Ilya Sutskever, and Quoc Le (Google Fellow Jeff Dean was also involved, although modestly, he told me.) They're like neural networks, only simpler so they can be trained on larger data sets.
Kaggle's Howard calls word2vec the "crown jewel" of natural language processing. "It's the English language compressed down to a list of numbers," he said.
Word2vec is designed to run on a system as small as a single multicore machine (Google tested its underlying techniques over days across more than 100 cores on its data center servers). Its creators have shown how it can recognize the similarities among words (e.g., the countries in Europe) as well as how they're related to other words (e.g., countries and capitals). It's able to decipher analogical relationships (e.g., short is to shortest as big is to biggest), word classes (e.g., carnivore and cormorant both relate to animals) and "linguistic regularities" (e.g., "vector('king') – vector('man') + vector('woman') is close to vector('queen')).

Source: Google
Right now, the word2vec Google Code page notes, "The linearity of the vector operations seems to weakly hold also for the addition of several vectors, so it is possible to add several word or phrase vectors to form representation of short sentences."
This is accomplished by turning words into numbers that correlate with their characteristics, Howard said. Words that express positive sentiment, adjectives, nouns associated with sporting events — they'll all have certain numbers in common based on how they're used in the training data (so bigger data is better).

Smarter models means smarter apps 

If this is all too esoteric, think about these methods applied to auto-correct or word suggestions in text-messaging apps. Current methods for doing this might be as simple as suggesting words that are usually paired together, Howard explained, meaning a suggestion is could be based solely on the word immediately before it. Using deep-learning-based approaches, a texting app could take into account the entire sentence, for example, because the app would have a better understanding of what the all words really mean in context.
Maybe you could average out all the numbers in a tweet, Howard suggested, and get a vector output that would accurately infer the sentiment, subject and level of formality of the tweet. Really, the possibilities are limited only to the types of applications people can think up to take advantage of word2vec's deep understanding of natural language.

An example output file from word2vec that has grouped similar words
The big caveat, however, is researchers and industry data scientists still need to learn how to use word2vec. There hasn't been a lot of research done on how to best use these types of models, Howard said, and the thousands of researchers working on other methods of natural language processing aren't going to jump ship to Google's tools overnight. Still, he believes the community will come around and word2vec and its underlying techniques could make all other approaches to natural language processing obsolete.
And this is just the start. A year from now, Howard predicts, deep learning will have surpassed a whole class of algorithms in other fields (i.e., things other than speech recognition, image recognition and natural language processing), and a year after that it will be integrated into all sorts of software packages. The only questions — and they're admittedly big ones — is how smart deep learning models can get (and whether they'll run into another era of hardware constraints that graphical processing units helped resolve earlier this millennium) and how accessible software packages like word2vec can make deep learning even for relatively unsophisticated users.
"Maybe in 10 years' time," Howard proposed, "we'll get to that next level."

Cajo, the easiest way to accomplish distributed computing in Java | Java Code Geeks

From Evernote:

Cajo, the easiest way to accomplish distributed computing in Java | Java Code Geeks

Clipped from: http://www.javacodegeeks.com/2011/01/cajo-easiest-way-to-accomplish.html

Cajo, the easiest way to accomplish distributed computing in Java

by on January 27th, 2011 | Filed in: Enterprise Java Tags: ,
Derived from the introductory section of Jonas Boner's article "Distributed Computing Made Easy" posted on TheServerSide.com on May 1st 2006 :
"Distributed computing is becoming increasingly important in the world of enterprise application development. Today, developers continuously need to address questions like: How do you enhance scalability by scaling the application beyond a single node? How can you guarantee high-availability, eliminate single points of failure, and make sure that you meet your customer SLAs?
For many developers, the most natural way of tackling the problem would be to divide up the architecture into groups of components or services that are distributed among different servers. While this is not surprising, considering the heritage of CORBA, EJB, COM and RMI that most developers carry around, if you decide to go down this path then you are in for a lot of trouble. Most of the time it is not worth the effort and will give you more problems than it solves."
On the other hand, distributed computing and Java go together naturally. As the first language designed from the bottom up with networking in mind, Java makes it very easy for computers to cooperate. Even the simplest applet running in a browser is a distributed application, if you think about it. The client running the browser downloads and executes code that is delivered by some other system. But even this simple applet wouldn't be possible without Java's guarantees of portability and security: the applet can run on any platform, and can't sabotage its host.
The cajo project is a small library, enabling powerful dynamic multi-machine cooperation. It is a surprisingly easy to use yet unmatched in performance. It is a uniquely 'drop-in' distributed computing framework: meaning it imposes no structural requirements on your applications, nor source changes. It allows multiple remote JVMs to work together seamlessly, as one.
The project owner John Catherino claims "King Of the Mountain! ;-)" and challenges everyone who is willing to prove that there exists a distributed computing framework in Java that is equally flexible and as fast as cajo.
To tell you the truth, personally I am convinced by John's saying; and I strongly believe that you will be also if you just let me walk you through this client – server example. You will be amazed of how easy and flexible the cajo framework is :
The Server.java
01import gnu.cajo.Cajo; // The cajo implementation of the Grail
02
03public class Server {
04
05   public static class Test { // remotely callable classes must be public
06      // though not necessarily declared in the same class
07      private final String greeting;
08      // no silly requirement to have no-arg constructors
09      public Test(String greeting) { this.greeting = greeting; }
10      // all public methods, instance or static, will be remotely callable
11      public String foo(Object bar, int count) {
12         System.out.println("foo called w/ " + bar + ' ' + count + " count");
13         return greeting;
14      }
15      public Boolean bar(int count) {
16         System.out.println("bar called w/ " + count + " count");
17         return Boolean.TRUE;
18      }
19      public boolean baz() {
20         System.out.println("baz called");
21         return true;
22      }
23      public String other() { // functionality not needed by the test client
24         return "This is extra stuff";
25      }
26   } // arguments and return objects can be custom or common to server and client
27
28   public static void main(String args[]) throws Exception { // unit test
29      Cajo cajo = new Cajo(0);
30      System.out.println("Server running");
31      cajo.export(new Test("Thanks"));
32   }
33}
Compile via:
1javac -cp cajo.jar;. Server.java
Execute via:
1java -cp cajo.jar;. Server
As you can see with just 2 commands :
1Cajo cajo = new Cajo(0);
2cajo.export(new Test("Thanks"));
we can expose any POJO (Plain Old Java Object) as a distributed service!
And now the Client.java
01import gnu.cajo.Cajo;
02
03import java.rmi.RemoteException; // caused by network related errors
04
05interface SuperSet {  // client method sets need not be public
06   void baz() throws RemoteException;
07} // declaring RemoteException is optional, but a nice reminder
08
09interface ClientSet extends SuperSet {
10   boolean bar(Integer quantum) throws RemoteException;
11   Object foo(String barbaz, int foobar) throws RemoteException;
12} // the order of the client method set does not matter
13
14public class Client {
15   public static void main(String args[]) throws Exception { // unit test
16      Cajo cajo = new Cajo(0);
17      if (args.length > 0) { // either approach must work...
18         int port = args.length > 1 ? Integer.parseInt(args[1]) : 1198;
19         cajo.register(args[0], port);
20         // find server by registry address & port, or...
21      } else Thread.currentThread().sleep(100); // allow some discovery time
22
23      Object refs[] = cajo.lookup(ClientSet.class);
24      if (refs.length > 0) { // compatible server objects found
25         System.out.println("Found " + refs.length);
26         ClientSet cs = (ClientSet)cajo.proxy(refs[0], ClientSet.class);
27         cs.baz();
28         System.out.println(cs.bar(new Integer(77)));
29         System.out.println(cs.foo(null, 99));
30      } else System.out.println("No server objects found");
31      System.exit(0); // nothing else left to do, so we can shut down
32   }
33}
Compile via:
1javac -cp cajo.jar;. Client.java
Execute via:
1java -cp cajo.jar;. Client
The client can find server objects either by providing the server address and port (if available) or by using multicast. To locate the appropriate server object "Dynamic Client Subtyping" is used. For all of you who do not know what "Dynamic Client Subtyping" stands for, John Catherino explains in his relevant blog post :
"Oftentimes service objects implement a large, rich interface. Other times service objects implement several interfaces, grouping their functionality into distinct logical concerns. Quite often, a client needs only to use a small portion of an interface; or perhaps some methods from a few of the logical grouping interfaces, to satisfy its own needs.
The ability of a client to define its own interface, from ones defined by the service object, is known as subtyping in Java. (in contrast to subclassing) However, unlike conventional Java subtyping; Dynamic Client Subtyping means creating an entirely different interface. What makes this subtyping dynamic, is that it works with the original, unmodified service object.
This can be a very potent technique, for client-side complexity management."
Isn't that really cool??? We just have to define the interface our client "needs" to use and locate the appropriate server object that complies with the client specification. The following command derived from our example accomplish just that :
1Object refs[] = cajo.lookup(ClientSet.class);
Last but not least we can create a client side "proxy" of the server object and remotely invoke its methods just like an ordinary local object reference, by issuing the following command :
1ClientSet cs = (ClientSet)cajo.proxy(refs[0], ClientSet.class);
That's it. These allow for complete interoperability between distributed JVMs. It just can't get any easier than this.
As far as performance is concerned, I have conducted some preliminary tests on the provided example and achieved an average score of 12000 TPS on the following system :
Sony Vaio with the following characteristics :
  • System : openSUSE 11.1 (x86_64)
  • Processor (CPU) : Intel(R) Core(TM)2 Duo CPU T6670 @ 2.20GHz
  • Processor Speed : 1,200.00 MHz
  • Total memory (RAM) : 2.8 GB
  • Java : OpenJDK 1.6.0_0 64-Bit
For your convenience I provide the code snippet that I used to perform the stress test :
1int repeats = 1000000;
2long start = System.currentTimeMillis();
3for(int i = 0; i < repeats;i ++)
4  cs.baz();
5System.out.println("TPS : " + repeats/((System.currentTimeMillis() - start)/1000d));
Happy Coding! and Don't forget to share!
Justin
Related Articles :
Related Whitepaper:

Java EE 6 Cookbook for Securing, Tuning, and Extending Enterprise Applications

Java Platform, Enterprise Edition is a widely used platform for enterprise server programming in the Java programming language.
This book covers exciting recipes on securing, tuning and extending enterprise applications using a Java EE 6 implementation.The book starts with the essential changes in Java EE 6. Then they will dive into the implementation of some of the new features of the JPA 2.0 specification, and look at implementing auditing for relational data stores.They will then look into how they can enable security for their software system using Java EE built-in features as well as using the well-known Spring Security framework. They will then look at recipes on testing various Java EE technologies including JPA, EJB, JSF, and Web services.Next they will explore various ways to extend a Java EE environment with the use of additional dynamic languages as well as frameworks.At the end of the book, they will cover managing enterprise application deployment and configuration, and recipes that will help you debug problems and enhance the performance of your applications.
.
.
Share and enjoy!
.