Machine Learning Approaches to Understanding Gene Regulation in Mouse Embryonic Stem Cells

Michael James Mason
Ph.D., 2010
Advisor: Qing Zhou

New high-throughput technologies have enabled biologists to gain a genome wide perspective of cell functions, disease development, and species evolution. As the cost of these techniques decreases, unprecedented amounts of data are being generated. While traditional statistical methods can be useful in analyzing such data, new methodologies are needed in order to make full use of the information they possess. This dissertation presents novel statistical methods aimed at better analyzing two types of high-throughput data, microarray expression data and DNA binding data (ChIP-chip and ChIP-seq).

Network analysis methods are useful in analyzing many samples of gene expression data. These approaches can find coexpressed genes related to a particular cell function. One such algorithm, weighted gene coexpression network analysis (WGCNA), identifies candidate genes that may regulate gene expression by measuring its module centrality. Here I extend WGCNA to incorporate the direction of gene coexpression in network construction. Applying this method to microarray samples from mouse embryonic stem cells (ESCs), I identify important functional pathways relevant to ESC pluripotency and self-renewal that would not be found by unsigned network analysis. Using WGCNA’s measure of module centrality, I identify novel genes that may regulate these pathways.

Transcription factors (TFs) regulate gene expression by binding DNA patterns, motifs, in the promoter regions of genes. Chromatin immunoprecipitation assays like ChIP-chip or ChIP-seq identify DNA sequences that are bound by specific TFs. The improved sensitivity and wider genomic coverage of newer ChIP assays have facilitated the discovery of consensus motifs. In this dissertation I develop contrast motif finder (CMF) designed to take advantage of these improvements in accuracy and coverage in order to find regulatory signals hidden within consensus motifs. I show that the consensus motifs found by CMF are more accurate than those found by other popular motif finders. Furthermore, CMF indentifies context-dependent motifs with implications to combinatorial regulatory roles of TFs in ESCs.