StrataScratch Challenge

Inspiration

As college students, we rely heavily on ride-sharing services like Uber, and something we wondered was what factors actually influence a potential candidate to become an Uber driver. This curiosity is what ultimately led us to explore and analyze the StrataScratch dataset for factors that influence Uber sign ups to go on a first drive. We are aware that Uber's business is heavily reliant on having active, reliable drivers, which is why identifying predictors of this is a key insight when it comes to hiring and predicting Uber employee behavior.

What it does

Our project makes use of a Random Forest Classification model to predict whether a new Uber driver will complete their first trip or not. Our model analyzes multiple features for correlations, most importantly the difference in dates between first signup date and when a driver added their vehicle to Uber’s system. By repeating this process, we aimed to capture behavioral patterns in the Uber drivers and ultimately use the model to indicate the likelihood of a driver being active, which helps Uber recognize high-potential candidates.

How we built it

Our dataset was provided through the StrataScratch Challenge, however it was unfiltered and not useful as given. We mainly used Google Colab as our development environment, taking advantage of its collaborative features and integrated resources. To optimize the dataset, we performed cleaning and exploratory visualizations such as removing null values that did not significantly contribute to whether signups eventually took first drives. After this, we visualized the data with matplotlib charts in order to see correlations that made an impact on signups who had first drives. Then, we were able to conduct exploratory data analysis (EDA) using correlation heatmaps to discover the most predictive variables. The heat map was a good conceptualization to decide which categories truly corresponded to first drives and utilizing it, we further decided which data to use in our model. Finally, we created a model using our most correlated data to best predict signups which become first drives and achieved 75% accuracy. We also made sure to evaluate our model using metrics like accuracy and a confusion matrix to improve our model further.

Challenges we ran into

A major challenge we ran into was cleaning up our data (i.e. removing inconsistent data and outliers which would impact our model). Additionally, we encountered an obstacle of determining which features would necessarily have a correlation and which features were logically redundant. Our team also experimented with creating our own features by finding time differences between dates of sign up, background check completion, and vehicle registration. Overall, the challenging portion was balancing the data set for optimal model training while ensuring that invaluable data was not compromised.

Accomplishments that we're proud of

We are most proud of the way we manipulated and cleaned up our data, and we gained a significant amount of valuable skills and insights throughout this process as this was all of our first datathon. Our process required us to thoughtfully manipulate, remove, and add additional columns into our data frame. By creating derived variables such as the difference between two pre existing variables, we were able to significantly improve our model’s predictive power. Our model’s solid accuracy of 75% reflects this. We also visualized our findings effectively by creating heatmaps and bar charts to effectively display correlations.

What we learned

Throughout our process, we were able to deepen our understanding of learning models, specifically our understanding of decision trees and random forests and why they are so well suited for data like this. We also learned the importance of data cleaning and features, which determine the outcome of real-world machine learning projects. In addition to this, each member of the team had a learning curve when creating our model and following a data cleaning process. We all expanded our data science and machine learning knowledge starting with no previous background and ending with a project. We ultimately learned that a good machine learning model starts with good data, not necessarily just fancy algorithms.

What's next for StrataScratch Challenge

Moving forward, something we would like to improve on in the future would be to include more predictor variables other than the main three that we used in our heat map. We also aim to build a simple interface to visualize and interact with the predictions our model would make. This would make our project more operationally useful. Lastly, we would also like to explore different ways Uber could act on our model’s output and how our model would help Uber’s team in the future.