Please join us for the next NLP Seminar on Monday, October 9, at 4:00pm in 202 South Hall.
Speaker: Siva Reddy (Stanford)
Title: Linguists-defined vs. Machine-induced Natural Language Structures for Executable Semantic Parsing
Querying a database to retrieve an answer, telling a robot to perform an action, or teaching a computer to play a game are tasks requiring communication with machines in a language interpretable by them. Here we consider the task of converting human languages to a knowledge-base (KB) language for question-answering. While human languages have latent structures, machine interpretable languages have explicit formal structures. The computational linguistics community has created several treebanks to understand the formal structures of human languages, e.g., universal dependencies. But are these useful for deriving machine interpretable formal structures?
In the first part of the talk, I will discuss how to convert universal dependencies in multiple languages to both general-purpose and kb-executable logical forms. In the second part, I will present a neural model on how to induce task-specific natural language structures. I will discuss the similarities and differences between linguists-defined and machine-induced structures, and pros and cons of each.
Siva Reddy is a postdoc at the Stanford NLP group working with Chris Manning. His research focuses on finding fundamental representations of language, mostly interpretable, which are useful for NLP applications, especially machine understanding. In this direction, he is currently exploring whether linguistic representations are necessary or all we need is end-to-end learning. His postdoc is partly funded by a Facebook AI Research grant. Prior to the postdoc, he was a Google PhD Fellow at the University of Edinburgh under the supervision of Mirella Lapata and Mark Steedman. He worked with Google Parsing team as an intern during his PhD, and as a full-time employee for Adam Kilgarriff’s Sketch Engine before his PhD. His team won the first place in SemEval 2011 Compositionality Detection task and a best paper at IJCNLP 2011. Apart from language, he loves nature and badminton.
Please join us for our first NLP Seminar of the Fall semester on Monday, September 25, at 4:00pm in 202 South Hall.
Speaker: David Smith (Northeastern University)
Title: Modeling Text Dependencies: Information Cascades, Translations, and Multi-Input Encoders
Dependencies among texts arise when speakers and writers copy manuscripts, cite the scholarly literature, speak from talking points, repost content on social networking platforms, or in other ways transform earlier texts. While in some cases these dependencies are observable—e.g., by citations or other links—we often need to infer them from the text alone. In our Viral Texts project, for example, we have built models of reprinting for noisily-OCR’d nineteenth-century newspapers to trace the flow of news, literature, jokes, and anecdotes throughout the United States. Our Oceanic Exchanges project is now extending that work to information propagation across language boundaries. Other projects in our group involve inferring and exploiting text dependencies to model the writing of legislation, the impact of scientific press releases, and changes in the syntax of language.
In this talk, I will discuss methods both for inferring these dependency structures and for exploiting them to improve other tasks. First, I will describe a new directed spanning tree model of information cascades and a new contrastive training procedure that exploits partial temporal ordering in lieu of labeled link data. This model outperforms previous approaches to network inference on blog datasets and, unlike those approaches, can evaluate individual links and cascades. Then, I will describe methods for extracting parallel passages from large multilingual, but not parallel, corpora by performing efficient search in the continuous document-topic simplex of a polylingual topic model. These extracted bilingual passages are sufficient to train translation systems with greater accuracy than some standard, smaller clean datasets. Finally, I will describe methods for automatically detecting multiple transcriptions of the same passage in a large corpus of noisy OCR and for exploiting these multiple witnesses to correct noisy text. These multi-input encoders provide an efficient and effective approximation to the intractable multi-sequence alignment approach to collation and allow us to produce transcripts with more than 75% reductions in error.