Skip to content

calvinaberg1/BeginnerTrackDatathon

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 

Repository files navigation

LoanDefaultRiceDatathon

Overview

This project was made for the 2022 Rice University Datathon, a 24 hour data science challenge in teams of up to 4. With my team, we worked on a dataset of loan defaults and experimented with EDA, creating visuals, predicting loan default, and predicting interest rate.

In this project, we thoroughly worked through the data sceince pipeline, working through the following steps:

  • Cleaning data
  • EDA
  • Creating visualizations
  • Feature engineering
  • Modeling
  • Model Evalutation

The skills and models that we used in this project were:

  1. Logistic Regression
  2. SMOTE oversampling
  3. Linear Regression
  4. Random Forrest Regression

Cleaning

In the cleaning section, we had to deal with missing values in 13 columns. To handle this, we worked to understand the columns with missing values, and appropriately filled with the necessary fields. In many cases, the missing values were supplements for 0, or for some field (like 'Accepted' for denial_reason).

Heatmap of missing values after Cleaning

image

EDA

For the EDA portion, we focused on deeply understanding and visualizing the data in creative and significant ways. We started from the most general and broad visualizations and focussed in on specific relationships. Below are the visualizations we created:

we started with broadly looking at all correlations

image

we then started to get more specific and focus on fewer relationships

image

we then took a sample of the entire data set and looked at individual relationships

image

image

image

image

image

Modeling

To model this data set, we first focused on predicting the acceptance of the loan.

We used logistic regression for this task, producing the following results:

image

Because of the imbalance in the categories, we utilized SMOTE to try to even the imbalance.

image

After this, we moved to predicting interest rate, first using linear regression. We achieved a testing error of .97 using this model.

We finished our project by attempting to utilize random forrest regression to predict the interest rate. However, we didn't achieve any results better than our linear model.

About

Beginner Track Datathon Work

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors