A Two-Stage ML Approach to Missing Data: Theory and Application to Auxiliary Variables
Victoria Savalei, Peter M. Bentler
A popular ad-hoc approach to conducting SEM with missing data is to obtain a saturated ML estimate of the sample covariance matrix ('the EM covariance matrix') and then to use this estimate in the complete data ML fitting function to obtain parameter estimates. This two-stage approach is appealing because it minimizes a familiar function while being only marginally less efficient than the direct ML approach (Graham, 2003). Importantly also, the two-stage approach allows for easy incorporation of auxiliary variables, which can mitigate bias and efficiency problems due to missing data (Collins, Schafer, & Kam, 2001). Incorporating auxiliary variables with direct ML is not straightforward and requires setting up a special model. However, standard errors and test statistics provided by the complete data routine analyzing the EM covariance matrix will be incorrect. Empirical approaches to finding the right corrections have failed to provide unequivocal solutions (Enders & Peugh, 2004). In this paper, we rely on the results of Yuan and Bentler (2000) to develop theoretical formulas for the correct standard errors and test statistics for the two-stage approach and its extension to include auxiliary variables. Since these accurately reflect the variability of the two-stage estimator, the actual sample size multiplier n can be used, and no adjustments are necessary. We study the performance of the two-stage test statistics and standard errors in a small simulation study, replicating the conditions studied by Enders and Peugh. We find that not only does the new two-stage approach perform well in all conditions, but the two-stage residual-based test statistic outperforms the direct ML statistic, deemed optimal in the missing data literature. We call for an incorporation of this new missing data method into standard SEM software for further study.
2007-09-01