Please join us for our first NLP Seminar of the Fall semester on Monday, September 25, at 4:00pm in 202 South Hall.
Speaker: David Smith (Northeastern University)
Title: Modeling Text Dependencies: Information Cascades, Translations, and Multi-Input Encoders
Dependencies among texts arise when speakers and writers copy manuscripts, cite the scholarly literature, speak from talking points, repost content on social networking platforms, or in other ways transform earlier texts. While in some cases these dependencies are observable—e.g., by citations or other links—we often need to infer them from the text alone. In our Viral Texts project, for example, we have built models of reprinting for noisily-OCR’d nineteenth-century newspapers to trace the flow of news, literature, jokes, and anecdotes throughout the United States. Our Oceanic Exchanges project is now extending that work to information propagation across language boundaries. Other projects in our group involve inferring and exploiting text dependencies to model the writing of legislation, the impact of scientific press releases, and changes in the syntax of language.
In this talk, I will discuss methods both for inferring these dependency structures and for exploiting them to improve other tasks. First, I will describe a new directed spanning tree model of information cascades and a new contrastive training procedure that exploits partial temporal ordering in lieu of labeled link data. This model outperforms previous approaches to network inference on blog datasets and, unlike those approaches, can evaluate individual links and cascades. Then, I will describe methods for extracting parallel passages from large multilingual, but not parallel, corpora by performing efficient search in the continuous document-topic simplex of a polylingual topic model. These extracted bilingual passages are sufficient to train translation systems with greater accuracy than some standard, smaller clean datasets. Finally, I will describe methods for automatically detecting multiple transcriptions of the same passage in a large corpus of noisy OCR and for exploiting these multiple witnesses to correct noisy text. These multi-input encoders provide an efficient and effective approximation to the intractable multi-sequence alignment approach to collation and allow us to produce transcripts with more than 75% reductions in error.
The NLP Seminar is back for Fall 2017! We will slightly change our meeting to Mondays from 4:00-5:00pm, in almost the same location, room 210 South Hall. We’ll be meeting approximately once a month this semester.
Here is the speaker for this semester:
Sep 25: David Smith, Northeastern U
Oct 9: Siva Reddy, Stanford U
Oct 30: Christopher Potts, Stanford U
Nov 13: He He: Stanford U
Amber Boydstun, UC Davis: postponed to Spring 2018
For up to the minute notifications, join the email list (UC Berkeley community only).
Please join us for our final NLP Seminar of the spring semester on Monday, May 1, at 3:30pm in 202 South Hall.
Speaker: Pramod Viswanath, University of Illinois
Title: Geometries of Word Embeddings
Real-valued word vectors have transformed NLP applications; popular examples are word2vec and GloVe, recognized for their ability to capture linguistic regularities via simple geometrical operations. In this talk, we demonstrate further striking geometrical properties of the word vectors. First we show that a very simple, and yet counter-intuitive, post-processing technique, which makes the vectors “more isotropic”, renders off-the-shelf vectors even stronger. Second, we show that a sentence containing a target word is well represented by a low rank subspace; subspaces associated with a particular sense of the target word tend to intersect over a line (one-dimensional subspace). We harness this Grassmannian geometry to disambiguate (in an unsupervised way) multiple senses of words, specifically so on the most promiscuously polysemous of all words: prepositions. A surprising finding is that rare senses, including idiomatic/sarcastic/metaphorical usages, are efficiently captured. Our algorithms are all unsupervised and rely on no linguistic resources; we validate them by presenting new state-of-the-art results on a variety of multilingual benchmark datasets.
Please join us for the NLP Seminar Monday, April 24 at 3:30pm in 202 South Hall.
Speaker: Marta Recasens (Google)
There’s Life Beyond Coreference
I’ll give a bird’s eye view of the coreference resolution task, discussing why after more than two decades of research on this task, state-of-the-art systems are still far from performing satisfactorily for real applications. Then, I’ll focus on the long tail of the problem, exemplifying how to cheaply learn common sense of the kind required by the Winograd Schema Challenge, and I’ll finish by undermining the traditional definition of the task, whose attempt at simplifying the problem may be making it even harder.
Please join us for the NLP Seminar Monday, April 10 at 3:30pm in 202 South Hall.
Speaker: Danqi Chen (Stanford)
Title: Towards the Machine Comprehension of Text
Enabling a computer to understand a document so that it can answer comprehension questions is a central, yet unsolved goal of NLP. The task of reading comprehension (i.e., question answering over unstructured text) has received vast attention recently, and a lot of progress has been made thanks to the creation of large-scale datasets and development of attention-based neural networks.
In this talk, I’ll first present how we advance this line of research. I’ll show how simple models can achieve (nearly) state-of-the-art performance on recent benchmarks, including the CNN/Daily Mail datasets and the Stanford Question Answering Dataset. I’ll focus on explaining the logical structure behind these neural architectures and discussing advantages as well as limits of current approaches.
Lastly I’ll talk about how we leverage existing machine comprehension systems and enable them to answer open-domain questions using full Wikipedia. We demonstrate the promise of our system, as well as set up new benchmarks by evaluating on multiple existing QA datasets.
Danqi Chen is a Ph.D. candidate in Computer Science at Stanford University, advised by Prof. Christopher Manning. Her main research interests lie in deep learning for natural language processing and understanding, and she is particularly interested in the intersection between text understanding and knowledge reasoning. She has been working on machine comprehension, question answering, knowledge base population and dependency parsing. She is a recipient of a Facebook fellowship and a Microsoft Research Women’s Fellowship and an outstanding paper award at ACL’16. Prior to Stanford, she received her B.S. from Tsinghua University in 2012.
Please join us for the NLP Seminar Monday, Mar 6 at 4:00pm in 205 South Hall.
Speaker: Joel Tetreault, Grammarly
Title: Analyzing Formality in Online Communication
Full natural language understanding requires comprehending not only the content or meaning of a piece of text or speech, but also the stylistic way in which it is conveyed. To enable real advancements in dialog systems, information extraction, and human-computer interaction, computers need to understand the entirety of what humans say, both the literal and the non-literal. This talk presents an in-depth investigation of one particular stylistic aspect, formality. First, we provide an analysis of humans’ subjective perceptions of formality in four different genres of online communication. We highlight areas of high and low agreement and extract patterns that consistently differentiate formal from informal text. Next, we develop a statistical model for predicting formality at the sentence level, using rich NLP and deep learning features, and then evaluate the model’s performance against human judgments across genres. Finally, we apply our model to analyze language use in online debate forums. Our results provide new evidence in support of theories of linguistic coordination, underlining the importance of formality for language generation systems.
This work was done with Ellie Pavlick (UPenn) during her summer internship at Yahoo Labs.
Please join us for the NLP Seminar on Monday 2/27 at 3:30pm in 202 South Hall. All are welcome!
Speaker: Jayant Krishnamurthy (Allen Institute for AI)
Title: Semantic Parsing to Probabilistic Programs for Situated Question Answering
Situated question answering is the problem of answering questions about an environment such as an image or diagram. This problem is challenging because it requires jointly interpreting a question and an environment using background knowledge to select the correct answer. We present Parsing to Probabilistic Programs, a novel situated question answering model that can use background knowledge and global features of the question/environment interpretation while retaining efficient approximate inference. Our key insight is to treat semantic parses as probabilistic programs that execute nondeterministically and whose possible executions represent environmental uncertainty. We evaluate our approach on a new, publicly-released data set of 5000 science diagram questions, outperforming several competitive classical and neural baselines.
Please join us for the NLP Seminar on Monday 2/13 at 3:30pm in 202 South Hall. All are welcome!
Speaker: Stephan Meylan (UC Berkeley)
Title: Word forms are optimized for efficient communication
The inverse relationship between word length and use frequency, first identified by G.K. Zipf in 1935, is a classic empirical law that holds across a wide range of human languages. We demonstrate that length is one aspect of a much more general property of words: how distinctive they are with respect to other words in a language. Distinctiveness plays a critical role in recognizing words in fluent speech, in that it reflects the strength of potential competitors when selecting the best candidate for an ambiguous signal. Phonological information content, a measure of a word’s probability under a statistical model of a language’s sound or character sequences, concisely captures distinctiveness. Examining large-scale corpora from 13 languages, we find that distinctiveness significantly outperforms word length as a predictor of frequency. This finding provides evidence that listeners’ processing constraints shape fine-grained aspects of word forms across languages.
The NLP Seminar is back for Spring 2017! We will retain our meeting time of Mondays from 3:30-4:30pm, in the same location, room 202 South Hall.
Here is the speaker list:
Feb 13: Stephen Meylan, UC Berkeley
Feb 27: Jayant Krishnamurthy, Allen Institute for AI
March 6: Joel Tetreault, Grammarly
April 10: Danqi Chen, Stanford
April 24: Marta Recasens, Google
May 1: Pramod Viswanath, U Illinois
For up to the minute notifications, join the email list (UC Berkeley community only).
Please join us for the NLP Seminar on Monday 11/14 at 3:30pm in 202 South Hall. All are welcome!
Speaker: David Jurgens (Stanford)
Title: Citation Classification for Behavioral Analysis of a Scientific Field
Citations are an important indicator of the state of a scientific field, reflecting how authors frame their work, and influencing uptake by future scholars. However, our understanding of citation behavior has been limited to small-scale manual citation analysis. We perform the largest behavioral study of citations to date, analyzing how citations are both framed and taken up by scholars in one entire field: natural language processing. We introduce a new dataset of nearly 2,000 citations annotated for function and centrality, and use it to develop a state-of-the-art classifier and label the entire ACL Reference Corpus. We then study how citations are framed by authors and use both papers and online traces to track how citations are followed by readers. We demonstrate that authors are sensitive to discourse structure and publication venue when citing, that online readers follow temporal links to previous and future work rather than methodological links, and that how a paper cites related work is predictive of its citation count. Finally, we use changes in citation roles to show that the field of NLP is undergoing a significant increase in consensus.