Yanai Elazar will be giving a virtual talk on Tuesday, June 22nd, from 10am — 11am. Zoom information will be distributed via the Berkeley NLP Seminar listserv. Please note this differs from our usual time slot.

Title: Causal Attributions in Language Models

Abstract: The outstanding results of enormous language models are largely unexplained, and different methods in interpretability aim to interpret and analyze these models to understand their working mechanisms. Probing, one of these tools, suggests that properties that can be accurately predicted from these models’ representations are likely to explain some of the features or concepts that these models make use of in their predictions. In the first part of this talk, I’ll propose a new interpretability method that takes inspiration from counterfactuals – what would have been the prediction if the model had not accessed certain information – and claim it is a more suitable method for asking causal questions about how certain attributes are used by models. In the second part, I’ll talk about a different kind of probing that treats the model as a black box and uses cloze patterns to query the model for world knowledge, under the LAMA framework. I will first describe a new framework that measures consistency — invariance of a model’s behavior under meaning preserving alternation of its input — of language models to knowledge and show that current LMs are generally not consistent. Then, I will conclude with an ongoing work where we develop a causal diagram and highlight different concepts, like co-occurrences, that cause the model’s predictions (as opposed to true and robust knowledge acquisition).

Bio: Yanai Elazar is a third year PhD student at Bar-Ilan University, working with Prof. Yoav Goldberg. His main interests involve model interpretation, analysis, biases in datasets and models, and commonsense reasoning. Yanai was awarded with multiple scholarships, including the PBC fellowship for outstanding PhD candidates in Data Science, and the Google PhD Fellowship.