SMOTE Variants for Imbalanced Binary Classification: Heart Disease Prediction
Xiaoru Zheng
MS, 2020
Li, Jingyi
Class imbalance is prevalent in many medical diagnosis problems, where the number of patients suffering from a particular disease is much smaller than the number of healthy people in the population. A similar phenomenon also occurs in credit card fraud detection and spam email filtering. Approaches to dealing with imbalanced data can be divided into those at data and algorithm levels. At the data level, resampling techniques such as oversampling and undersampling can result in a more balanced distribution of classes. At the algorithm level, cost-sensitive learning takes the prediction error into account in the training process, which can achieve better overall prediction performance. In this study, variants of the Synthetic Minority Oversampling Technique (SMOTE) are implemented for comparison: regular SMOTE, Borderline-SMOTE, SVM-SMOTE, KMeans-SMOTE, which are combined four classification algorithms under the classical and Neyman-Pearson (NP) paradigms to build predictive models on the heart disease data, and the performance of these models is compared. Our results show that the SVM-SMOTE and the Borderline-SMOTE outperform other SMOTE variants, and the NP classification is superior in controlling the type I error effectively.
2020