Statistical Methods in Classification Problems using Gene Expression/Proteomic Signatures

Xuelian Wei
Ph.D., 2008
Advisor: Ker-Chau Li

Conventional methods for cancer diagnosis are invasive, subjective, and labor intensive. Recently, with advances in microarray and proteomic technologies, gene expression and proteomic signatures of biological samples from patients have been proven promising and feasible as a non-invasive, objective and accurate molecular diagnostic method in oncology (Clarke, et al., 2008; Eisen, et al., 1998; Golub, et al., 1999). First, a new classification technique using microarray data to perform cancer classification, called distribution-based classification (DBC), is proposed, which is motivated from the heuristic that samples within the same class tend to have more similar profiles in gene expression than samples from different classes. The difference between the distributions of the within-class and the between-class sample-correlations lays the basis for DBC. The new method is shown to perform equivalently or outperform other popular machine learning methods on class prediction in 22 binary and multi-class microarrary datasets involving human cancers. Furthermore, DBC could be extended as a general classification technique for other binary and multi-class classification tasks.

On the other hand, the biomarker (feature) selection is as important as classification for molecular diagnostic applications. A good biomarker could not only help predict the classes accurately, but also provide information to understand the development of cancer or other disease. In the second half of the dissertation, two feature selection methods are successfully applied in biomarker discovery problem to detect protein/peptide biomarkers from proteomic data. In CHAPTER 3, a new method to align detected protein/peptide peaks across spectra, called correlation-based hierarchical clustering (CBHC) method, is proposed to improve the pre-processing of raw data to facilitate better biomarker discovery results. It is motivated from the complete linkage hierarchical cluster (CLHC) method (Tibshirani, et al., 2004), but many important modifications are applied to best utilize all the information from both the locations and the shapes of detected peaks to achieve a more accurate peak alignment.

In CHAPTER 4, the plasma protein profiles of renal transplant are analyzed with the prospect of finding novel biomarkers indicative of the rejection process. Two statistical models, binomial model and linear mixed-effect model, are used to access the predictive power of the potential biomarkers to distinguish rejection samples from post-rejection samples. We found 25 potential biomarkers specifically associated with the renal allograft acute cellular rejection. Four of these candidates appear to have the highest diagnostic value. Three of them have been identified and confirmed by monoclonal antibody immunoprecipitation so far. The identification and confirmation of the last one is ongoing.