Holistic Scene Understanding and Goal-directed Multi-agent Event Parsing

Yixin Chen
PhD, 2022
Zhu, Song-Chun
Humans, even young infants, are adept at perceiving and understanding complex indoor scenes and events. Holistic scene understanding involves abundant aspects, including 3D human pose, objects, physical relations, functionality, etc. Besides the physical and functional configuration of the scene, interpreting human actions and goal-oriented tasks is a higher-level goal, and requires reasoning about the complex structures in activities along the temporal dimension. When multiple people are in the scene, collaborations and communications inevitably happen, in both verbal and non-verbal forms. Despite the recent remarkable progress in artificial intelligence, building an intelligent machine with human-like perception and reasoning capability for the aforementioned complex tasks remains a significant and challenging problem.In this dissertation, we study the holistic scene understanding and goal-directed multi-agent event parsing by identifying the critical problems from various perspectives. We first propose a framework for holistic 3D scene parsing and human pose estimation, with a particular focus on human-object interaction and physical commonsense reasoning. Contact information is critical in modeling the fine-grained human-object relations from visual cues. We demonstrate how to extract meaningful contact information from 2D images and its usefulness in 3D human pose estimation. Then we introduce our efforts in understanding goal-directed actions, concurrent multi-tasks, and collaborations among multi-agents. Finally, we investigate the two typical types of human communications by proposing a spatial and temporal model for shared attention and examining the power of both language and gesture under the embodied reference setting.
2022