Analysis and Modeling for Ganoderma Data
Zheqi Wu
MS, 2019
Xu, Hongquan
It is fundamentally challenging to learn from small data sets. In this paper, we analyze ganoderma data, also called Lingzhi, which has a tiny dataset with quite a lot of chemical substances. It is quite a challenge to not only build suitable models tting the small data but also do the feature extraction to identify the critical subgroup of the chemical substances that are eective to the cancer treatment. This paper does data preprocessing first to adjust the response variable, eliminate outliers and deal with multicollinearity problem. Secondly, we use four datasets with both linear and non-linear models to experiment. It shows that XGboost model has the best tness of dataset. Also, Principal Component Analysis and Partial Least Square transformation techniques are suitable for our feature dimension reduction purpose that it can reduce the features from 24 dimensions to 5. In the discussion part, we analyze the feature importance between the model with the best performance and the original features.
2019