Imbalanced Binary Classification for Detecting Transcription Factor Binding Sites in H1 Human Embryonic Stem Cells

Surui Sun
M.S., 2017
Advisor: Jingyi Jessica Li
A binary classification problem is imbalanced if the two classes are not equally represented. In the paper, we investigate the prediction of genome-wide transcription factor (TF) binding
sites, which can be formulated as an imbalanced binary classification problem. We apply traditional binary classification methods (Logistic Regression, K Nearest Neighbors, Random Forest, AdaBoost, Support Vector Machine, Na¨ıve Bayes, and Linear Discriminant Analysis) to address this problem, and we also combine these methods with two synthetic resampling
methods, SMOTE (Synthetic Minority Over-sampling Technique) and ADASYN (Adaptive Synthetic Sampling), to check if synthetic resampling for the minority class could improve the classification performance in terms of F1 score, accuracy, precision, recall, area under the precision-recall (PR) curve, geometric mean of positive accuracy and negative accuracy
(Gmean) and area under the receiver operating characteristic (ROC) curve. Our results show that compared with traditional methods, the addition of SMOTE and ADASYN can effectively increase the geometric mean metric and the recall rates at the cost of reduced precision rates. However, SMOTE and ADASYN have no obvious improvement in the area under the ROC curve or the area under the PR curve.
2017