Statistical criteria and procedures for controlling false positives with applications to biological and biomedical data analysis

Yiling Chen
PhD, 2021
Li, Jingyi
The need to control rates of false positives is prevalent in biological and biomedical data analysis. Two statistical conceptualizations of rates of false positives—type I error and false discovery rate (FDR)— are widely used in these analyses. For example, in automated cancer detection from transcriptomics data, practitioners often need to control type I error—the conditional probability of making a false positive as healthy—because false negatives could lead to severe consequences such as delayed treatment or even life loss. In contrast, a false positive leads to less serious consequences. Another example is the widely-used FDR control in multiple-testing problems such as differential expression genes identification from RNA sequencing data. Because discoveries are often subject to laborious and expensive downstream validation, researchers want to control the FDR—the expected proportion of false discoveries among discoveries—to save validation costs; in comparison, missing true discoveries is often less concerning. Despite existing efforts, controlling rates of false positives remain challenging. This dissertation aims to address them in three projects.My first project involves prioritizing type I error of feature selection for binary classification problems. Binary classification problems are prevalent in biomedical data analysis: for example, the aforementioned automated cancer detection where the response is binary: with or without cancer. In those cases, type I error control, i.e., false-positive rate control, is critical so that the chance of missing cancer patients is under a reasonable level, a consideration neglected by existing model selection methods. In Chapter 2, we develop a novel model selection criterion, Neyman-Pearson Criterion (NPC), that prioritizes the type I error in binary classification. The theoretical model selection property of NPC is studied for non-parametric plug-in methods. A real data study on breast cancer detection using DNA methylation data suggests that NPC is a practical criterion that can reveal novel clinical biomarkers for cancer diagnosis with both high sensitivity and specificity. My second project focuses on FDR control in high-throughput data analysis from two conditions. High-throughput data analysis commonly involves the identification of “interesting” features (e.g., genes, genomic regions, and proteins), whose values differ between two conditions. To ensure the reliability of such analysis, existing bioinformatics tools primarily use the FDR as the criterion, the control of which typically requires p-values. However, obtaining valid p-values is often hard or even impossible because of limited sample sizes in high-throughput data. In Chapter 3, we propose Clipper, a p-value-free FDR control framework for high-throughput data with two conditions. Through comprehensive simulation and real-data benchmarking, Clipper outperforms existing generic FDR control methods and specific bioinformatics tools designed for various tasks, including differentially expressed gene identification from RNA-seq data, differentially interacting chromatin region identification from Hi-C data, and peptide identification from mass spectrometry data. My third project focuses on FDR control in aggregating peptides identified by multiple database search algorithms from mass spectrometry data. The state-of-the-art shotgun proteomics analysis relies on database search algorithms to identify peptides and proteins in biological samples. A key step in this process is peptide identification, which is done via matching mass spectra that code the sequence information of a peptide against protein databases that contain known protein sequences. Numerous database search algorithms have been developed over time, each with distinct advantages in peptide identification. To utilize this, in Chapter 4 we develop a statistical framework, Aggregation of Peptide Identification Results (APIR), for combining peptide matching results from multiple database search algorithms with FDR control. We demonstrate using benchmark data that APIR achieves higher detection sensitivity than individual search algorithms do while maintaining FDR control. Extensive real data studies show that APIR can uncover additional biologically meaningful proteins and post-translational modifications that are otherwise undetected by individual search algorithms.
2021