Rare Event Prediction with Mortgage Lead Data

Kaleb Julian Erickson
MAS, 2019
Guido Montufar Cuartas
LeadPoint Inc. is a digital marketplace for refinance mortgage leads. These leads are purchased by lenders who then reach out to contact the lead in an attempt to refinance their mortgage. LeadPoint is interested in creating a predictive model that will identify leads that have a higher propensity to become a funded loan. This paper describes the process of using lead data from LeadPoint to create a model that predicts which leads are most likely to fund. The lead data is extremely imbalanced, with only 0.55% of the leads listed as a funded loan. The scarcity of funded loans in the data qualify a funded loan as a rare event. Since the vast
majority of the leads do not end up funding, it is extremely difficult to accurately predict how any given lead will end up. This paper considers three different methods for dealing with this rare event data and compares them to a baseline logistic regression model. These additional methods include a method called Rare Event Logistic Regression, Gradient Boosted Decision Trees (specifically using a technology called CatBoost), and a data augmentation method called Synthetic Minority Oversampling Technique (SMOTE). The results showed that the rare event logistic regression model trained on the original data had the best performance, although the results were only slightly better than the logistic regression model. This rare event logistic regression model is able to identify a subset of leads with a fund rate of 0.90%. While this new fund rate is only 67% better than the original dataset, this is a very good model and represents expanded business opportunities for LeadPoint as well as a potential 19% immediate increase in company revenue.
2019