Tuesday, 26 November 2013

Coreference Resolution Tools : A first look – Dreaming in Data


Coreference Resolution Tools : A first look

2010-09-28 17:44:08 » Natural Language Processing



Coreference is where two or more noun phrases refer to the same entity.   This is an integral part of natural languages to avoid repetition, demonstrate possession/relation etc.
Eg:  Harry wouldn't bother to read "Hogwarts: A History" as long as Hermione is around.  He knows she knows the book by heart.
The different types of coreference includes:
Noun phrases: Hogwarts A history <- the book
Pronouns : Harry <- He
Possessives : her, his, their
Demonstratives:  This boy
Coreference resolution or anaphor resolution is determining what an entity is refering to.  This has profound applications in nlp tasks such as semantic analysis, text summarisation, sentiment analysis etc.
In spite of extensive research, the number of tools available for CR and level of their maturity is much less compared to more established nlp tasks such as parsing.  This is due to the inherent ambiguities in resolution.
The following are some of the tools currently available.   Almost all tools come with bundled sentence deducters, taggers, parsers, named entity recognizers  etc as setting them up all would be tedious.
Let us try using the following sentence from one of the presentations on BART as input. I'm using the demo app wherever possible and where not, I'm installing the same on my local machine.
Queen Elizabeth set about transforming her husband, King George VI, into a viable monarch. Lionel Logue, a renowned speech therapist, was summoned to help the King overcome his speech impediment.
The following equivalence sets need to be identified.
QE: Queen Elizabeth, her
KG: husband, King George VI, the King, his
LL: Lionel Logue, a renowned speech therapist
The results are as follows.
Tools Result Comments
Illinois Coreference PackageLionel Logue(0)
a renowned speech |therapist|(2)
his(8)
Queen Elizabeth(3)
the |King|(5)
King(4)
King |George VI|(7)
transforming her |husband|(6)
her(1)
a viable |monarch|(9)
The 'his' and 'her' are wrongly matched to the wrong entities.  his is matched to Logue and her is matched to King George
CherryPicker<COREF ID="1″>Queen</COREF> <COREF ID="2″>Elizabeth</COREF> set about transforming <COREF ID="3″ REF="2″>her</COREF> <COREF ID="4″>husband</COREF>, <COREF ID="5″>King</COREF> <COREF ID="6″ REF=    "5″>George VI</COREF>, into a viable monarch.
<COREF ID="9″>Lionel Logue</COREF>, a renowned speech <COREF ID="10″>therapist</COREF>, was summoned to help the <COREF ID="7″ REF="5″>King</COREF> overcome <COREF ID="8″ REF="5″>his</COREF> sp eech impediment.
Queen
    her
Elizabeth
Husband
King
    George VI
    King
    his
Lionel Logue
Therapist
It is mostly ok, except that Queen Elizabeth is split into two Entities Queen and Elizabeth .  Other than that, it is one of the best results.  Notably it matches King to King George VI and hence, his is correctly mapped to King George VI
Natural Language Synergy Lab Queen Elizabeth set about transforming her husband , King George VI , into a viable monarch. Lionel Logue , a renowned speech therapist , was summoned to help the King overcome his speech impediment .
BART
{person Queen Elizabeth } set about transforming {np {np her } husband } , {person King George VI } , into {np a viable monarch } . {person Lionel Logue } , {np a renowned {np speech } therapist } , was summoned to help {np the King } overcome {np {np his } {np speech } impediment } .

Coreference chain 1

{person Queen Elizabeth }
{np her }
{np a viable monarch }
{np the King }
{np his }

Coreference chain 2

{person Lionel Logue }
{np a renowned {np speech } therapist }

Coreference chain 3

{np speech }
{np speech }
JavaRAP ********Anaphor-antecedent pairs*****
(0,0) Queen Elizabeth <– (0,5) her,
(1,12) the King <– (1,15) his
********Text with substitution*****
Queen Elizabeth set about transforming <Queen Elizabeth's> husband, King George VI, into a viable monarch.
Lionel Logue, a renowned speech therapist, was summoned to help the King overcome <the King's> speech impediment.
It has attempted only the pronoun resolution and that has been done well.
GuiTAR Failed
OpenNLP (TOP (S (NP#6 (NNP Queen) (NNP Elizabeth)) (VP (VBD set) (PP (IN about) (S (VP (VBG transforming) (NP (NP (NML#6 (PRP$ her)) (NN husband)) (, ,) (NP (NNP King) (person (NNP George) (NNP VI)))) (, ,) (PP (IN into) (NP (DT a) (JJ viable) (NN monarch))))))) (. .)) )
(TOP (S (NP#1 (NP (person (NNP Lionel) (NNP Logue))) (, ,) (NP#1 (DT a) (JJ renowned) (NN speech) (NN therapist))) (, ,) (VP (VBD was) (VP (VBN summoned) (S (VP (TO to) (VP (VB help) (S (NP (DT the) (NNP King)) (VP (VBN overcome) (NP (NML#1 (PRP$ his)) (NN speech) (NN impediment))))))))) (. .)) )
Lionel Logue
a renowned speech therapist
Queen Elizabeth
her husband
Reconcile <NP NO="0″ CorefID="1″>Queen Elizabeth</NP> set about transforming <NP NO="2″ CorefID="3″><NP NO="1″ CorefID="1″>her</NP> husband</NP>, <NP NO="3″ CorefID="3″>King George VI</NP>, into <NP NO="    4″ CorefID="4″>a viable monarch</NP>. <NP NO="5″ CorefID="6″>Lionel Logue</NP>, <NP NO="6″ CorefID="6″>a renowned speech therapist</NP>, was summoned to help <NP NO="7″ CorefID="6″>the King</NP    > overcome <NP NO="9″ CorefID="9″><NP NO="8″ CorefID="6″>his</NP> speech impediment</NP>.
Queen Elizabeth
her
her husband
King George VI
A Viable monarch
Lionel Logue
a renowned speech therapist
the king
his
the king has been wrongly attributed to Lionel Logue, which resulted in his also to be wronlt atttributed.
ARKref [Queen Elizabeth]1 set about transforming [[her]1 husband , [King George VI]2 ,]2 into [a viable monarch] .
[Lionel Logue , [a renowned speech therapist]6 ,]6 was summoned to help [the King]8 overcome [[his]8 speech impediment] .
One of the best results.  only info lacking is linking 'the King' to "king George VI"
_As a side note, cherry picker fails with the following error.
cherrypicker1.01/tools/crf++/.libs/lt-crf_test: error while loading shared libraries: libcrfpp.so.0: cannot open shared object file: No such file or directory
To proceed, we need to download CRF++ .  Install it.
Then we need to modify line no 18 in cherrypicker.sh file
tools/crf++/crf_test -m modelmd $1.crf > $1.pred
to
crf_test -m modelmd $1.crf > $1.pred
open a new terminal:
run
sudo ldconfig_
now cherrypicker should work
ARKref and Cherrypicker seem to be the best options available right now.
Are there any other coreference resolution systems that have not been looked at?  Can we add more about the above tools?  Please post your comments.

Sent from Evernote

Wednesday, 6 November 2013

Presto: Interacting with petabytes of data at Facebook


Presto: Interacting with petabytes of data at Facebook

By Lydia Chan on Wednesday, November 6, 2013 at 7:01pm
By Martin Traverso
Background
Facebook is a data-driven company. Data processing and analytics are at the heart of building and delivering products for the 1 billion+active users of Facebook. We have one of the largest data warehouses in the world, storing more than 300 petabytes. The data is used for a wide range of applications, from traditional batch processing to graph analytics [1], machine learning, and real-time interactive analytics.
For the analysts, data scientists, and engineers who crunch data,derive insights, and work to continuously improve our products, the performance of queries against our data warehouse is important. Being able to run more queries and get results faster improves their productivity.
Facebook's warehouse data is stored in a few large Hadoop/HDFS-based clusters. Hadoop MapReduce [2] and Hive are designed for large-scale, reliable computation, and are optimized for overall system throughput. But as our warehouse grew to petabyte scale and our needs evolved, it became clear that we needed an interactive system optimized for low query latency.
In Fall 2012, a small team in the Facebook Data Infrastructure group set out to solve this problem for our warehouse users. We evaluated a few external projects, but they were either too nascent or did not meet our requirements for flexibility and scale. So we decided to build Presto, a new interactive query system that could operate fast at petabyte scale.
In this post, we will briefly describe the architecture of Presto, its current status, and future roadmap.
Architecture
Presto is a distributed SQL query engine optimized for ad-hoc analysis at interactive speed. It supports standard ANSI SQL, including complex queries, aggregations, joins, and window functions.
The diagram below shows the simplified system architecture of Presto. The client sends SQL to the Presto coordinator. The coordinator parses, analyzes, and plans the query execution. The scheduler wires together the execution pipeline, assigns work to nodes closest to the data, and monitors progress. The client pulls data from output stage, which in turn pulls data from underlying stages.
The execution model of Presto is fundamentally different from Hive/MapReduce. Hive translates queries into multiple stages of MapReduce tasks that execute one after another. Each task reads inputs from disk and writes intermediate output back to disk. In contrast, the Presto engine does not use MapReduce. It employs a custom query and execution engine with operators designed to support SQL semantics. In addition to improved scheduling, all processing is in memory and pipelined across the network between stages. This avoids unnecessary I/O and associated latency overhead. The pipelined execution model runs multiple stages at once, and streams data from one stage to the next as it becomes available. This significantly reduces end-to-end latency for many types of queries.
The Presto system is implemented in Java because it's fast to develop, has a great ecosystem, and is easy to integrate with the rest of the data infrastructure components at Facebook that are primarily built in Java. Presto dynamically compiles certain portions of the query plan down to byte code which lets the JVM optimize and generate native machine code. Through careful use of memory and data structures, Presto avoids typical issues of Java code related to memory allocation and garbage collection. (In a later post, we will share some tips and tricks for writing high-performance Java system code and the lessons learned while building Presto.)
Extensibility is another key design point for Presto. During the initial phase of the project, we realized that large data sets were being stored in many other systems in addition to HDFS. Some data stores are well-known systems such as HBase, but others are custom systems such as the Facebook News Feed backend. Presto was designed with a simple storage abstraction that makes it easy to provide SQL query capability against these disparate data sources. Storage plugins (called connectors) only need to provide interfaces for fetching metadata, getting data locations, and accessing the data itself. In addition to the primary Hive/HDFS backend, we have built Presto connectors to several other systems, including HBase, Scribe, and other custom systems.
Current status
As mentioned above, development on Presto started in Fall 2012. We had our first production system up and running in early 2013. It was fully rolled out to the entire company by Spring 2013. Since then, Presto has become a major interactive system for the company's data warehouse. It is deployed in multiple geographical regions and we have successfully scaled a single cluster to 1,000 nodes. The system is actively used by over a thousand employees,who run more than 30,000 queries processing one petabyte daily.
Presto is 10x better than Hive/MapReduce in terms of CPU efficiency and latency for most queries at Facebook. It currently supports a large subset of ANSI SQL, including joins, left/right outer joins, subqueries,and most of the common aggregate and scalar functions, including approximate distinct counts (using HyperLogLog) and approximate percentiles (based on quantile digest). The main restrictions at this stage are a size limitation on the join tables and cardinality of unique keys/groups. The system also lacks the ability to write output data back to tables (currently query results are streamed to the client).
Roadmap
We are actively working on extending Presto functionality and improving performance.  In the next few months, we will remove restrictions on join and aggregation sizes and introduce the ability to write output tables.  We are also working on a query "accelerator" by designing a new data format that is optimized for query processing and avoids unnecessary transformations. This feature will allow hot subsets of data to be cached from backend data store, and the system will transparently use cached data to "accelerate" queries.  We are also working on a high performance HBase connector.
Open source
After our initial Presto announcement at the Analytics @ WebScale conference in June 2013 [3], there has been a lot of interest from the external community. In the last couple of months, we have released Presto code and binaries to a small number of external companies. They have successfully deployed and tested it within their environments and given us great feedback.
Today we are very happy to announce that we are open-sourcing Presto. You can check out the code and documentation on the site below. We look forward to hearing about your use cases and how Presto can help with your interactive analysis.
http://prestodb.io/
https://github.com/facebook/presto
The Presto team within Facebook Data Infrastructure consists of Martin Traverso, Dain Sundstrom, David Phillips, Eric Hwang, Nileema Shingte and Ravi Murthy.
Links
[1] Scaling Apache Giraph to a trillion edges. https://www.facebook.com/notes/facebook-engineering/scaling-apache-giraph-to-a-trillion-edges/10151617006153920
[2] Under the hood: Scheduling MapReduce jobs more efficiently with Corona https://www.facebook.com/notes/facebook-engineering/under-the-hood-scheduling-mapreduce-jobs-more-efficiently-with-corona/10151142560538920
[3] Video of Presto talk at Analytics@Webscale conference, June 2013 https://www.facebook.com/photo.php?v=10202463462128185

Sent from Evernote