Missing Data Imputation for Tree-Based Models

Yan He
Ph.D., 2006
Advisor: Richard Berk
A wide variety of data can include some form of censorship or missing information. Missing data are a problem for all statistical analyses, tree-based models, such as CART and Random Forests are certainly no exception. In recent years, there have been many new developed tools that can be applied to missing data problems: likelihood and estimating function methodology, cross-validation, the bootstrap and other simulation techniques, Bayesian and multiple imputations, and the EM algorithm. Although applied successfully to well-defined parametric models, such methods may be inappropriate for tree-based models, which are usually considered as non-parametric models. CART/RF have built-in algorithms to impute missing data, such as surrogate variables or proximity. But these imputation methods have no formal rationale, and are unstable, especially for RF models. The nonparametric bootstrap methods to impute missing values overcome all of the drawbacks that are implicit in both single and multiple imputations. It 1) does not depend on the missing-data mechanism, 2) requires no knowledge of either the probability distributions or model structure, and 3) successfully incorporates the estimates of uncertainty associated with the imputed data. Furthermore, 2000 replications of bootstrap samples provide stable and accurate statistical inferences (Efron, 1994). In my dissertation research, the nonparametric bootstrap methods were implemented to impute missing values before cases were dropped down the tree (CART/RF), and the classification results were compared to both complete-data/full-data analysis and to the classification results using surrogate variables/proximity. Significant improvement in the ability to predict were found for both CART and RF models.