Month: October 2015

Oct 22: Marcus Rohrbach: LRCN — An Architecture for Visual Recognition, Description, and Question Answering

Speaker: Marcus Rohrbach (UC Berkeley)

Title: LRCN – an Architecture for Visual Recognition, Description, and Question Answering (UC Berkeley)


Models based on deep convolutional networks have dominated recent image interpretation tasks; we investigate whether models which are also recurrent, or “temporally deep”, are effective for tasks involving sequences, visual and otherwise. We developed the novel Long-term Recurrent Convolutional Network (LRCN) suitable for large-scale visual learning which is end-to-end trainable. In this talk I will demonstrate the value of this model on video recognition tasks, image description and retrieval problems, video narration challenges, as well as visual Question Answering. In contrast to current models which assume a fixed spatio-temporal receptive field or simple temporal averaging for sequential processing, recurrent convolutional models are “doubly deep” in that they can be compositional in spatial and temporal “layers”. Our model consequently has advantages when target concepts are complex and/or training data are limited as we show in several benchmarks. I will conclude with some ongoing projects on visual grounding and how we want to describe novel visual concepts.

Oct 8: Taylor Berg-Kirkpatrick: Unsupervised Transcription of Language and Music

Speaker: Taylor Berg-Kirkpatrick (UC Berkeley)

Title: Unsupervised Transcription of Language and Music


A variety of transcription tasks–for example, both historical document transcription and polyphonic music transcription–can be viewed as linguistic decipherment problems. I’ll describe an approach to such problems that involves building a detailed generative model of the relationship between the input (e.g. an image of a historical document) and its transcription (the text the document contains). It turns out that these models can be learned in a completely unsupervised fashion–without ever seeing an example of an input annotated with its transcription–effectively deciphering the hidden correspondence. I’ll demo two state-of-the-art systems, one for historical document transcription and one for polyphonic piano music transcription, that outperform supervised methods.

Slides: (pdf)