This project was made for the 2022 Rice University Datathon, a 24 hour data science challenge in teams of up to 4. With my team, we worked on a dataset of loan defaults and experimented with EDA, creating visuals, predicting loan default, and predicting interest rate.
In this project, we thoroughly worked through the data sceince pipeline, working through the following steps:
- Cleaning data
- EDA
- Creating visualizations
- Feature engineering
- Modeling
- Model Evalutation
The skills and models that we used in this project were:
- Logistic Regression
- SMOTE oversampling
- Linear Regression
- Random Forrest Regression
In the cleaning section, we had to deal with missing values in 13 columns. To handle this, we worked to understand the columns with missing values, and appropriately filled with the necessary fields. In many cases, the missing values were supplements for 0, or for some field (like 'Accepted' for denial_reason).
Heatmap of missing values after Cleaning
For the EDA portion, we focused on deeply understanding and visualizing the data in creative and significant ways. We started from the most general and broad visualizations and focussed in on specific relationships. Below are the visualizations we created:
we started with broadly looking at all correlations
we then started to get more specific and focus on fewer relationships
we then took a sample of the entire data set and looked at individual relationships
To model this data set, we first focused on predicting the acceptance of the loan.
We used logistic regression for this task, producing the following results:
Because of the imbalance in the categories, we utilized SMOTE to try to even the imbalance.
After this, we moved to predicting interest rate, first using linear regression. We achieved a testing error of .97 using this model.
We finished our project by attempting to utilize random forrest regression to predict the interest rate. However, we didn't achieve any results better than our linear model.









