Spatial-Temporal Hierarchical Model for Joint Learning and Inference of Human Action and Pose
Xiaohan Nie
Ph.D., 2017
Advisor: Song-Chun Zhu
In the community of computer vision, human pose estimation and human action recognition are two classic and also of particular important tasks. They always serve as basic preprocessing
steps for other high-level tasks such as group activity analysis, visual search and human identification and they are also widely used as key components in many real applications such as intelligent surveillance system and human-computer interaction based system. The two tasks are closely related for understanding human motion, most methods, however, learn separate models and combine them sequentially.
In this dissertation, we build systems for pursuing a unified framework to integrate training and inference of human pose estimation and action recognition in a spatial-temporal And-Or Graph (ST-AOG) representation. Particularly, we study different ways to achieve this goal:
(1) A two-level And-Or Tree structure is utilized for representing action as animated pose template (APT). Each action is a sequence of moving pose templates with transition probabilities. Each Pose template consists of a shape template represented by an And-node capturing part appearance, and a motion template represented by an Or-node capturing part motions. The transitions between moving pose templates are governed in a Hidden Markov
Model. The part locations, pose types and action labels are estimated together in inference.
(2) In order to tackle actions from unknown and unseen views we present a multi-view spatial-temporal And-Or Graph (MST-AOG) for cross-view action recognition. As a compositional model, the MST-AOG compactly represents the hierarchical combinatorial structures of cross-view actions by explicitly modeling the geometry, appearance and motion variations.
The model training takes advantage of the 3D human skeleton data obtained from Kinect cameras to avoid annotating video frames. The efficient inference enables action recognition from novel views. A new Multi-view Action3D dataset has been created and released.
(3) To further represent part, pose and action jointly and improve performance, we represent action at three scales by a ST-AOG model. Each action is decomposed into poses which are further divided into mid-level spatial-temporal parts (ST-parts) and then parts. The hierarchical model structure captures the geometric and appearance variations of pose at each frame. The lateral connections between ST-parts at adjacent frames capture the action-specific motions. The model parameters at three scales are learned discriminatively and dynamic programming is utilized for efficient inference. The experiments demonstrate the large benefit of joint modeling of the two tasks.
(4) The last but not the least, we study a novel framework for full-body 3D human pose estimation which is a essential task for human attention recognition, robot-based human action prediction and interaction. We build a two-level hierarchy of Long Short-Term Memory(LSTM) network with tree-structure to predict the depth on 2D human joints and then reconstruct the 3D pose. Our two-level model utilizes two cues for depth prediction: 1) the global features from 2D skeleton. 2) the local features from image patches of body parts.
steps for other high-level tasks such as group activity analysis, visual search and human identification and they are also widely used as key components in many real applications such as intelligent surveillance system and human-computer interaction based system. The two tasks are closely related for understanding human motion, most methods, however, learn separate models and combine them sequentially.
In this dissertation, we build systems for pursuing a unified framework to integrate training and inference of human pose estimation and action recognition in a spatial-temporal And-Or Graph (ST-AOG) representation. Particularly, we study different ways to achieve this goal:
(1) A two-level And-Or Tree structure is utilized for representing action as animated pose template (APT). Each action is a sequence of moving pose templates with transition probabilities. Each Pose template consists of a shape template represented by an And-node capturing part appearance, and a motion template represented by an Or-node capturing part motions. The transitions between moving pose templates are governed in a Hidden Markov
Model. The part locations, pose types and action labels are estimated together in inference.
(2) In order to tackle actions from unknown and unseen views we present a multi-view spatial-temporal And-Or Graph (MST-AOG) for cross-view action recognition. As a compositional model, the MST-AOG compactly represents the hierarchical combinatorial structures of cross-view actions by explicitly modeling the geometry, appearance and motion variations.
The model training takes advantage of the 3D human skeleton data obtained from Kinect cameras to avoid annotating video frames. The efficient inference enables action recognition from novel views. A new Multi-view Action3D dataset has been created and released.
(3) To further represent part, pose and action jointly and improve performance, we represent action at three scales by a ST-AOG model. Each action is decomposed into poses which are further divided into mid-level spatial-temporal parts (ST-parts) and then parts. The hierarchical model structure captures the geometric and appearance variations of pose at each frame. The lateral connections between ST-parts at adjacent frames capture the action-specific motions. The model parameters at three scales are learned discriminatively and dynamic programming is utilized for efficient inference. The experiments demonstrate the large benefit of joint modeling of the two tasks.
(4) The last but not the least, we study a novel framework for full-body 3D human pose estimation which is a essential task for human attention recognition, robot-based human action prediction and interaction. We build a two-level hierarchy of Long Short-Term Memory(LSTM) network with tree-structure to predict the depth on 2D human joints and then reconstruct the 3D pose. Our two-level model utilizes two cues for depth prediction: 1) the global features from 2D skeleton. 2) the local features from image patches of body parts.
2017