Model Checking for Incomplete High-dimensional Categorical Data

Ming-Yi Hu
Ph.D., 1999
Advisor: Thomas Belin and Robert Jennrich

Categorical data are often arranged in a contingency table and summarized by a loglinear model. A standard approach for comparing two competing models is to calculate twice the discrepancy between maximized loglikelihoods, which follows a χ2 distribution asymptotically. But when data are sparse, the χ2 approximation may be questionable. As an alternative to a large-sample approximation to the reference distribution, we implement the framework introduced by Rubin (1984) for finding the posterior predictive check (PPC) distribution. The PPC distribution represents the conditional probability of a future value of a test statistic based on the information given by observed data along with model specifications, which can serve as the reference distribution for the relevant likelihood-ratio statistics. However, it can be computationally demanding to construct a PPC distribution based on a large number of replicates. This is especially the case when the original data are incomplete, since generation of each PPC replicate requires an involved statistical computing approach (we use a data-augmentation strategy). In practice, we propose to approximate the PPC distribution by a gamma distribution whose parameters are estimated by a combination of training data and a modest-sized sample of PPC replicates. Some simulated examples suggest that this procedure, which can reduce the computation needed to approximate the PPC distribution by a factor of 20, has satisfactory statistical properties.