Statistical methods for analyzing mRNA isoform variation in large-scale RNA-seq data

Levon Demirdjian
PhD, 2018
Wu, Yingnian
Alternative splicing (AS) is a major source of cellular and functional complexity in the eukaryotic transcriptome and plays a critical role in many developmental processes and diseases. Variations in AS are an important factor in disease-causing mutations, and it is hypothesized that over half of all known disease-causing mutations affect splicing patterns. Next-generation RNA sequencing (RNA-seq) technology has enabled the accumulation of large-scale sequencing data from diverse human tissues and populations and has provided an important resource for discovering variations in AS, yet the size and complexity of large-scale RNA-seq datasets continue to pose significant data analysis challenges to researchers. In this work, we propose new statistical methodologies that more effectively leverage complex RNA-seq data structures for studying AS.
In the first part of this work, we propose a sensitive and robust methodology called PAIRADISE for detecting genetic and allelic variation of alternative splicing in population-scale transcriptome datasets. PAIRADISE uses a novel statistical framework to detect allele-specific alternative splicing (ASAS) from population-scale RNA-seq data. A key feature of PAIRADISE is a statistical model that aggregates ASAS signals across multiple replicates of a given individual or multiple individuals in a population. PAIRADISE consistently outperforms alternative statistical models in simulation studies, and boosts the power of ASAS detection when applied to replicate or population-scale RNA-seq data.
Next, we introduce the rMATS-Iso statistical framework for quantifying AS in modules with complex patterns of AS using replicate RNA-seq data. Importantly, rMATS-Iso leverages an EM algorithm to disambiguate short RNA-seq reads which may be consistent with multiple mRNA isoforms. As a result, rMATS-Iso can accommodate complex patterns of AS within a splicing module where transcripts can be defined by any combination of exons, splice site choices, etc. In addition, rMATS-Iso uses a likelihood ratio test to detect differential splicing between sample groups, and quantifies the extent to which each individual isoform contributes to the overall difference.
In conjunction with the continued development of next-generation sequencing methods, we anticipate that both PAIRADISE and rMATS-Iso will have broad utilities in elucidating the landscape of alternative splicing variation as well as other forms of mRNA isoform variation in human populations.
2018