Please join us for our first NLP Seminar of the Fall semester on Monday, September 25, at 4:00pm in 202 South Hall.

Speaker: David Smith (Northeastern University)

Title: Modeling Text Dependencies: Information Cascades, Translations, and Multi-Input Encoders

Abstract:

Dependencies among texts arise when speakers and writers copy manuscripts, cite the scholarly literature, speak from talking points, repost content on social networking platforms, or in other ways transform earlier texts. While in some cases these dependencies are observable—e.g., by citations or other links—we often need to infer them from the text alone. In our Viral Texts project, for example, we have built models of reprinting for noisily-OCR’d nineteenth-century newspapers to trace the flow of news, literature, jokes, and anecdotes throughout the United States. Our Oceanic Exchanges project is now extending that work to information propagation across language boundaries. Other projects in our group involve inferring and exploiting text dependencies to model the writing of legislation, the impact of scientific press releases, and changes in the syntax of language.

In this talk, I will discuss methods both for inferring these dependency structures and for exploiting them to improve other tasks. First, I will describe a new directed spanning tree model of information cascades and a new contrastive training procedure that exploits partial temporal ordering in lieu of labeled link data. This model outperforms previous approaches to network inference on blog datasets and, unlike those approaches, can evaluate individual links and cascades. Then, I will describe methods for extracting parallel passages from large multilingual, but not parallel, corpora by performing efficient search in the continuous document-topic simplex of a polylingual topic model. These extracted bilingual passages are sufficient to train translation systems with greater accuracy than some standard, smaller clean datasets. Finally, I will describe methods for automatically detecting multiple transcriptions of the same passage in a large corpus of noisy OCR and for exploiting these multiple witnesses to correct noisy text. These multi-input encoders provide an efficient and effective approximation to the intractable multi-sequence alignment approach to collation and allow us to produce transcripts with more than 75% reductions in error.