Homals-clustering Analysis and its Applications in Computational Sequence Analysis
Advisor: Jan de Leeuw and Ker-Chau Li
Searching conserved sequence patterns of known cis-regulatory elements not only provides an initial step towards elucidating their structures and mechanisms, but also helps greatly the prediction of novel regulatory elements. However, due to some special properties of those elements, such as the lack of primary sequence similarity or allowance of variations in nucleotide bases, TRANSFAC and miRBase databases still rely on expert systems to perform the search manually. To tackle the challenge of automatically searching conserved sequence patterns, we developed a novel method, homals-clustering analysis, which clusters sequences based on the sharing of grouped N -mers (representing conserved patterns). Our proposed Homals-clustering analysis consolidates a decryption of N -mers, homogeneity analysis, and newly designed jigsaw-puzzle clustering and multi-layer clustering strategy into a unified framework. We conducted the evaluation of its performance on yeast data of TRANSFAC and human and mouse data in miRBase by comparing with several related studies and methods and the results showed that our method possess the property of detecting conserved patterns with high sensitivity and robustness. Most importantly, since it requires no expert intervention, it enables users without expert knowledge to exploit those databases on a up-to-date basis.