Hierarchical Empirical Bayes Analysis of Genomic Microarrays

Stephen Erickson
Ph.D., 2006
Advisor: Chiara Sabatti

Genomic microarray data are characterized by an immense number of variables (i.e. genes, loci) but modest sample sizes. Hierarchical empirical Bayes analysis provides a natural, flexible, and useful paradigm for tackling such data. Hierarchical, because hierarchical structures facilitate the sharing of power across variables. Empirical, because the distribution of model parameters is rarely known with any precision a priori yet can be estimated in a reasonable fashion from experimental data.

A typical gene expression microarray experiment uses a handful of arrays to infer which of the thousands of assayed genes are differentially expressed, and by how much, under two contrasting cellular conditions. If one defines θi as the true change in expression of gene i between the two conditions, the vector θ = (θ1, . . . , θN) is therefore the parameter of primary interest. This dissertation describes a Bayesian approach to estimating the θi, dubbed the l1 estimator. This estimator has two defining characteristics. First, the prior distribution on θi is a mixture of a discrete point mass at zero (corresponding to no differential expression of gene i) and a symmetric continuous distribution centered at zero (indicating there is no a priori preference for up- or down-regulation). Second, the l1 estimator θ_hati is defined as the posterior median, not mean, of θi, and therefore minimizes absolute error loss. Simulation and experimental results show an interesting connection to the false discovery rate (FDR, Benjamini and Hochberg 1995).

A chapter of the dissertation also addresses high-density genotyping microarrays. These arrays can simultaneously genotype hundreds of thousands of single nucleotide polymorphisms (SNPs) in the human genome and are therefore commonly used in association and linkage studies, but are increasingly used to infer regions of genomic loss. The chapter first describes a nonparametric data-normalization technique for comparison of intensity levels between arrays. Next, an empirical Bayes analysis of SNP-specific summed intensities allows between-SNP comparison of intensities. A set of genotyping arrays which includes known genomic loss on regions of chromosome 22 is analyzed, revealing that the Bayesian modeling improves sensitivity to reduced fluorescence.