Modeling and Analysis of Multiple Alignments, ChIP-seq, and Gene Expression Data for Finding Transcription Factor Binding Sites

Gong Chen
Ph.D., 2010
Advisor: Qing Zhou

Transcription factors bind to sequence-specific sites in DNA to regulate gene transcription. Identifying transcription factor binding sites (TFBS's) is an important step for understanding gene regulation. In this dissertation, I extend the field of finding TFBS's from two perspectives: 1. Modeling DNA multiple alignments for improving accuracy of detecting TFBS's and 2. analyzing ChIP-seq and gene expression data for identifying combinatorial transcriptional regulation in mouse embryonic stem cells.

By treating TFBS's as signals and surrounding DNA sequences as background, TFBS detection can be understood as a problem of detecting signals from background. Although sophisticated in modeling TFBS's and their combinatorial patterns, computational methods for TFBS detection often make oversimplified homogeneous model assumptions for background sequences. Since nucleotide base composition varies across genomic regions, it is expected to be helpful for detection methods to incorporate the heterogeneity into background modeling. When sequences from multiple species are utilized, variation in evolutionary conservation violates the common assumption of an identical conservation level in multiple alignments. To handle both types of heterogeneity, I propose a generative model in which a segmented Markov chain is used to partition a multiple alignment into regions of homogeneous nucleotide base composition and a hidden Markov model (HMM) is employed to account for different conservation levels. Bayesian inference on the model is developed via Gibbs sampling with dynamic programming. Simulation studies and empirical evidence from biological data sets reveal the dramatic effect of background modeling on detecting TFBS's, and demonstrate that the proposed approach is able to achieve substantial improvements over commonly used background models.

Data generated by chromatin immunoprecipitation coupled with sequencing (ChIP-seq) provide genome-wide binding locations of DNA-binding proteins. A recent study made available ChIP-seq data of several important transcription factors in mouse embryonic stem cells. Although these data have been shown to predict gene expression well, they may not account for some large-scale distinctive expression patterns. I hypothesize that there exist other transcription factors that are collaborating with the factors from the ChIP-seq experiments, referred to as cofactors, and the collaborations or the combinatorial effects of multiple factors may help explain patterns in gene expression profiles. After constructing features that integrate information from genomic sequences with the ChIP-seq data to indicate potential combinatorial effects of multiple factors on gene expression, I identify features that have strong statistical significance under false discovery rate control. By treating gene expression patterns as class labels and features as predictors, I report a small number of significant features that can lead to considerable improvement in predicting expression patterns compared with classification utilizing only information of the ChIP-seq factors. In addition, I provide biological interpretations of regulatory roles of cofactors involved in the features, some of which are supported by existing literatures. Finally, I predict target genes of the cofactors based on classification results and show that gene expression profiles in another independent data set are consistent with the prediction.