Combined approach for predicting sparse variables such as tips ratio and daily precipitation

Jinshu Li
MS, 2019
Wu, Yingnian
A sparse variable is a variable whose values are mostly zero. Because of its sparsity, satisfactory prediction results of a sparse variable usually cannot be obtained by either pure (i.e. single) regression or pure classification machine learning methods. Therefore, to resolve this difficulty, this thesis paper proposes a framework that combines a regression model and a classification model. Furthermore, two types of the combined regression and classification framework are discussed, and their differences are illustrated. Two sparse variables are selected as the case studies: taxi tips ratio (i.e. tips amount divided by total fare) and daily precipitation volume (i.e. total rainfall amount in one day). The author first employs Lasso regression to select relevant features for each sparse variable, with the best Lasso parameter determined by cross-validation (CV). Second, the author selects Logistic regression and the AdaBoost method as the classification methods, while the XGBoost method is chosen as the regression method. The hyperparameters are determined by fine-tuning. The author then surveys over the prediction results of the pure classification method, the pure regression method, and the combined method, using root mean square error (RMSE) as the metric. The results show that the pure regression method provides the least RMSE for both variables; however, it does not satisfy the sparsity requirement. On the contrary, the combined method, whose RMSE is close to the RMSE of the pure regression method, can also provide the sparse results, which makes it an efficient way to predict sparse variables like taxi tip ratio and daily precipitation.
2019