Semi-Greedy Construction of Oblique-Split Decision Trees

Matthew Rudolph Larriva
MAS, 2019
Frederic Paik Schoenberg
Classification and Regression Trees (CART) are a method of structured prediction widely used in machine learning. Favored for their robustness to non-linear relationships and interpretability (James, Witten, Hastie, & Tibshirani, 2013), they have seen continued application since their broad introduction in 1984 (Breiman, Friedman, Olshen, & Stone, 1984). The shortcomings of a single-tree model, specifically overfitting and lack of competitiveness compared to other machine learning methods, have been overcome through advances in bagging, pruning, and boosting (James et al., 2013). Decision trees are fitted in a top-down greedy fashion because non-greedy fitting is computationally intractable. While this leads to efficient computation, using top-down greedy estimates can produce suboptimal models. To overcome this suboptimality I propose a semi-greedy method of decision tree construction. First, I reformulate the problem based on the work of Norouzi et. al. (2015)(Norouzi, Collins, Johnson, Fleet, & Kohli, 2015), and use an algorithm which optimizes the split functions at all levels of the tree jointly, based on a global objective. This algorithm is parameterized by a single scalar for which I propose a bound. Second, I propose a dimensionality reduction step for cases when the semi-greedy algorithm becomes intractable. In the five datasets tested (Breast Cancer Wisconsin, Banknotes Authentication, Haberman Survival Data, Blood Transfusion Data, Fertility Data), the algorithm produces higher accuracy than two existing greedy tree packages in R (rpart, tree). And, it is computationally tractable for low-width datasets (approximately 8 features or fewer). For higher order problems, I propose a preprocessing step which reduces the higher-width data to a more tractable subset. I show that a dimensionally reduced space, processed using the semi-greedy algorithm, outperforms the greedy methods (rpart, trees) trained with both full and reduced data sets.
2019