DNA Copy Number Reconstruction via Regularization

Zhongyang Zhang
Ph.D., 2012
Advisor: Qing Zhou and Chiara Sabatti

Recent advances in genomics have underscored the surprising ubiquity of DNA copy number variants (CNVs). They carry information on the modalities of genome evolution and about the deregulation of DNA replication in cancer cells; their study can be helpful to localize tumor suppressor genes, distinguish different populations of cancerous cell, as well identify genomic variations responsible for disease phenotypes. A number of different high-throughput technologies can be used to identify copy number variable sites, and the literature documents multiple effective algorithms. We augment this literature with a focus on computational speed and simultaneous analysis of multiple sequences.

One the one hand, we explore CNV reconstruction for single sample via estimation with a fused-lasso penalty. We mount a fresh attack on this difficult optimization problem by a majorization-minimization (MM) framework. We also reframe the reconstruction problem in terms of imputation via discrete optimization. This approach is easier and more accurate than parameter estimation. The accuracy of our imputations is comparable to that of hidden Markov models at a substantially lower computational cost.

On the other hand, we investigate the specific problem of detecting regions where variation in copy number is relatively common in the sample at hand: this encompasses the cases of copy number polymorphisms (CNPs), related samples, technical replicates, and cancerous sub-populations from the same individual. We present a segmentation method to reconstruct CNV regions, that is based on penalized estimation and is capable of processing multiple signals jointly. Our approach is computationally very attractive and leads to sensitivity and specificity levels comparable to those of state-of-the-art specialized methodologies. Its versatility and speed make the method applicable to data obtained with a wide range of technologies and particularly useful in the initial screening stages of large data sets.

Finally, we perform CNV detection and analysis in a set of pedigrees from two Central American isolate and admixed populations. We characterize CNPs in this sample in terms of their frequencies and prevalence on different genetic backgrounds.