Ethan Perez will be giving a hybrid talk at the NLP seminar on Friday, April 29, from 11am-noon PST. This talk will be held in person in South Hall 202, and Zoom information will be distributed via the Berkeley NLP Seminar listserv for those wishing to attend remotely.

Title: Aligning Language Models with Human Preferences

Abstract: Self-supervised learning objectives are highly effective at pretraining language models (LMs) for various tasks. In this talk, we first show that self-supervised objectives are misaligned with human preferences in many, important ways; LMs trained on internet text generate misinformation, offensive jokes, and personal contact information, and are highly sensitive to the conditioning text (“prompt”). Next, we show that LM-based classifiers are effective at predicting which texts humans prefer. As a result, it is possible to use such classifiers as a learning signal to automatically correct the LM. We showcase this approach to train a high-quality retrieval system, obtaining strong performance across a variety of tasks using Retrieval-Augmented Generation (RAG). Even after such training schemes, some undesirable behaviors may remain undetected during training. We thus go a step further and generate inputs that elicit undesirable behaviors from the LM using other LMs, to preemptively catch and fix such behaviors. Overall, we find that some of the most powerful tools for aligning LMs with human preferences are LMs themselves.

Bio: Ethan Perez is a fourth year Ph.D. student in Natural Language Processing at New York University. He is advised by Kyunghyun Cho and Douwe Kiela and funded by NSF and Open Philanthropy. His research aims to develop learning algorithms that overcome human shortcomings, such as social biases, cognitive biases, and misconceptions. Previously, he has spent time at DeepMind, Facebook AI Research, Montreal Institute for Learning Algorithms, and Google. He earned a Bachelor’s from Rice University as the Engineering department’s Outstanding Senior.