Coupling and Learning Hierarchical Generative and Descriptive Models for Image Synthesis and Analysis
Learning a generative model with compositional structure is a fundamental problem in statistics. My thesis generalizes two major classical statistical models by introducing the Convolutional Neural Networks (ConvNets): (1) Exponential family model, which is generalized to a descriptor model by a bottom-up ConvNet. (2) Latent factor model, which is generalized to a generator model by a top-down ConvNet. The probability distribution of descriptor is in the form of exponentially tiling of a reference distribution. The descriptor can be derived directly from the discriminative ConvNet. Assuming rectified linear units and Gaussian white noise reference distribution, the descriptor contains a representational structure with multiple layers of binary activation variables, which reconstruct the mean of the Gaussian piece. The model is learned by Maximum Likelihood Estimation (MLE). The Langevin dynamics for data synthesis is driven by reconstruction error, and the corresponding gradient descent dynamics converges to a local energy minimum that is auto-encoding. The probability distribution of generator is in the form of Multivariate Gaussian, where the mean is computed by a non-linear ConvNet mapping function on latent factors. The model is learned by an alternating back-propagation algorithm, which underlies the famous Expectation-maximization (EM) algorithm. The alternating back-propagation iterates the following two steps. (a) Inferential back-propagation, which infers the latent factors by Langevin dynamics or gradient descent. (b) Learning back-propagation, which updates the parameters given the inferred latent factors by gradient descent. The gradient computations in both steps are powered by back-propagation, and they share most of their code in common. The learning algorithms of two models can be interwoven into a cooperative training algorithm, where the generator model generates synthesized examples to jump-start the Markov Chain Monte Carlo (MCMC) sampling of the descriptor model to fuel the learning of the descriptor model. The generator model can then learn from how the descriptors MCMC revises the synthesized examples generated by the generator, and the learning is supervised because the latent factors are known. The experiment results show that the two models can generate realistic images, audios and dynamic patterns. Moreover, the generator can also be used to learn from incomplete or indirect training data. Generator and cooperative training algorithm can outperform generative adversarial network (GAN) and variational auto-encoder (VAE) in data recovery and incompletion tasks.