Speaker: Marcus Rohrbach (UC Berkeley)

Title: LRCN – an Architecture for Visual Recognition, Description, and Question Answering (UC Berkeley)


Models based on deep convolutional networks have dominated recent image interpretation tasks; we investigate whether models which are also recurrent, or “temporally deep”, are effective for tasks involving sequences, visual and otherwise. We developed the novel Long-term Recurrent Convolutional Network (LRCN) suitable for large-scale visual learning which is end-to-end trainable. In this talk I will demonstrate the value of this model on video recognition tasks, image description and retrieval problems, video narration challenges, as well as visual Question Answering. In contrast to current models which assume a fixed spatio-temporal receptive field or simple temporal averaging for sequential processing, recurrent convolutional models are “doubly deep” in that they can be compositional in spatial and temporal “layers”. Our model consequently has advantages when target concepts are complex and/or training data are limited as we show in several benchmarks. I will conclude with some ongoing projects on visual grounding and how we want to describe novel visual concepts.