A Cognition Platform for Joint Inference of 3D Geometry, Object States, and Human Belief

Tao Yuan
PhD, 2019
Zhu, Song Chun
Humans can extract rich information from visual scenes, such as the 3D locations of objects and humans, the actions of humans, the states of objects, the belief of humans. Although various state-of-the-art algorithms can achieve good results for solving individual tasks, building a system to jointly infer these different tasks for scene understanding is still an underexplored area. Most of these tasks are not independent with each other, and humans can jointly infer hidden information with their commonsense knowledge among these tasks. In this dissertation, we propose a spatio-temporal framework to jointly infer and optimize multiple tasks across different times and views with a unified explicit probabilistic graphical representation.
This dissertation contains four main parts. 1) we describe the system overview, the data flow in the system, and engineering efforts to make the system scalable under different scenarios. 2) we propose an algorithm for holistic 3D scene parsing and human pose estimation with human-object interaction and physical commonsense. Human-object interaction can model the fine-grained relations between agents and objects, and physical commonsense can model the physical plausibility of the reconstructed scene. 3) we introduce a joint parsing framework that integrates view-centric proposals into scene-centric parse graphs that represent a coherent scene-centric understanding of cross-view scenes. 4) we present a joint inference algorithm to understanding object states, robot knowledge, and human beliefs under multi-view settings by maintaining three types of parse graphs. The algorithm can be applied to the cross-view small object tracking problem and some false-belief problems. Experiments show that our joint inference framework can achieve better results than individual algorithms.
2019