This project analyzes a dataset of 54,681 Uber driver signups to predict which drivers are likely to complete their first trip (convert). Only 11.22% of driver signups ever complete their first trip, representing a significant opportunity to improve conversion rates and reduce acquisition costs.
By building machine learning models and conducting thorough exploratory analysis, this project:
- Identifies key factors that determine driver conversion
- Builds predictive models to identify high-potential drivers
- Provides actionable recommendations to improve conversion rates
- Estimates the business impact of implementing these recommendations
The dataset contains information about driver signups from January 2016, with data pulled a few months later to include whether drivers completed their first trip. Each record represents a driver signup with the following attributes:
| Column | Description |
|---|---|
| id | Unique driver identifier |
| city_name | City where the driver signed up |
| signup_os | Device OS used for signup ("android", "ios", "website", "other") |
| signup_channel | Acquisition channel ("offline", "paid", "organic", "referral") |
| signup_date | Date of account creation (format: 'YYYY MM DD') |
| bgc_date | Date of background check consent (format: 'YYYY MM DD', 'NA' if not completed) |
| vehicle_added_date | Date when vehicle information was uploaded (format: 'YYYY MM DD', 'NA' if not completed) |
| vehicle_make | Make of vehicle uploaded (e.g., Honda, Ford, Kia) |
| vehicle_model | Model of vehicle uploaded (e.g., Accord, Prius, 350z) |
| vehicle_year | Year the car was made (format: 'YYYY') |
| first_completed_date | Date of first trip as a driver (format: 'YYYY MM DD', 'NA' if no trip completed) |
Note: Missing values are represented as 'NA' strings in the dataset, not actual null values.
This project requires Python 3.7+ and the following packages:
pandas==1.3.4
numpy==1.21.4
matplotlib==3.5.0
seaborn==0.11.2
scikit-learn==1.0.1
You can install all requirements with:
pip install -r requirements.txt-
Data Cleaning and Preprocessing:
- Handling 'NA' values in date fields
- Creating binary target variable for first trip completion
- Converting dates to datetime format
- Creating flags for completed onboarding steps
-
Exploratory Data Analysis:
- Analyzing onboarding funnel and dropout rates
- Calculating conversion rates by various segments
- Examining time relationships between signup and key milestones
-
Feature Engineering:
- Creating completion status flags for key onboarding steps
- Encoding categorical variables
- Generating time-based features
- Creating interaction terms
-
Model Development:
- Logistic Regression for interpretability
- Random Forest for handling non-linear relationships
- Gradient Boosting for maximizing predictive performance
- Simple rule-based model as a baseline
-
Model Evaluation:
- Train/test split validation
- Accuracy, precision, recall, and F1 score metrics
- Confusion matrices
- Feature importance analysis
-
Onboarding completion is critical:
- Both BGC and vehicle completed: 45.59% conversion
- Only BGC completed: 1.32% conversion
- Only vehicle added or neither step: 0% conversion
-
Time sensitivity is crucial:
- Same day BGC completion: 42.10% conversion
- 1-3 days: 31.10% conversion
- 4-7 days: 19.04% conversion
- 8-14 days: 9.67% conversion
- 15+ days: 2.45% conversion
-
Acquisition channel matters:
- Referral: 19.89% conversion
- Organic: 9.01% conversion
- Paid: 6.19% conversion
-
Signup platform influences conversion:
- Mac: 16.28% conversion
- Windows: 13.25% conversion
- iOS web: 13.17% conversion
- Android web: 9.73% conversion
-
Feature importance (Logistic Regression):
- bgc_completed: 5.55
- has_vehicle_info: 2.74
- vehicle_added: 2.17
Our best-performing model achieved:
- Accuracy: 92.93%
- Precision: 70.07%
- Recall: 65.13%
- F1 Score: 67.51%
The high recall indicates we successfully identify the vast majority of drivers who will take their first trip.
Based on model insights, we recommend:
-
Focus on completing both onboarding steps:
- Simplify vehicle addition process
- Create clear onboarding progress tracker
- Target interventions for drivers who completed BGC but not vehicle info
-
Optimize for quick background checks:
- Conversion drops dramatically when BGC takes more than 3 days
- Implement urgent follow-up for drivers with delayed BGC
-
Expand the referral program:
- Referrals convert at ~20% vs ~6% for paid channels
- Reallocate budget from paid to referral incentives
-
Implement real-time prediction and intervention:
- Score drivers daily and flag those at risk
- Deploy targeted interventions based on model predictions
- A/B test different incentives for at-risk segments
Implementing these recommendations could:
- Increase overall conversion rate from 11.22% to 16-18%
- Reduce average time to first trip by 30-40%
- Decrease cost per converted driver by 20-25%
- Improve driver supply in key markets by 15-20%
- Real-time scoring system for new driver signups
- A/B testing framework for intervention validation
- Enhanced feature engineering with additional data sources
- Dynamic model updates with continuous retraining
- Personalized intervention optimization based on driver characteristics
This dataset was provided as part of a take-home assignment in the recruitment process for data science positions at Uber. The analysis and models are for educational purposes.
Presentation Slide : https://docs.google.com/presentation/d/1X3fqHPpnkzPW_yd7OMVyDSamq9uDRc9s0ZLmzXUB5Hs/edit?usp=sharing
DevPost Link : https://devpost.com/software/zotzotzot-predicting-driver-activation/joins/DITk_wOi4PpacQiDfKwYvQ