Multimodal Learning for Vision and Language

Junhua Mao
PhD, 2017
Yuille, Alan L
This thesis focuses on proposing and addressing various tasks in the field of vision and language, a new and challenging area which contains the hottest research topics for both computer vision and natural language processing. We first proposed an effective RNN-CNN framework (Recurrent Neural Network-Convolutional Neural Network) to address the task of image captioning (i.e. describing an image with a sentence). Based on this work, we proposed effective models and constructed large-scale datasets, for various vision and language tasks, such as unambiguous object descriptions (i.e. Referring expressions), image question answering, one-shot novel concept captioning, multimodal word embedding, and multi-label classification. Many of these tasks have not been successfully addressed or even been investigated before. Our work are among the first deep learning effort for these tasks, and achieves the state-of-the-art results. We hope the methods and datasets proposed in this thesis could provide insight for the future development of vision and language.
2017