Accurately predicting the nightly price of Airbnb listings using regression models and data preprocessing techniques.
This project explores price prediction for Airbnb listings through a regression modeling pipeline. Three models are evaluated:
- Linear Regression
- Gradient Boosted Regressor
- Optimized Gradient Boosted Regressor (via GridSearchCV)
The end goal is to develop a model that can power an intelligent Price Estimation Tool for new Airbnb hosts.
- Feature engineering: added and removed relevant features
- Imputed missing values in
'beds'usingpandas.isnull()andfillna() - Winsorized outliers in
'price'and'review_scores_rating' - Applied log transformation to
'price'to normalize the distribution - Converted
has_availabilityfrom boolean to numeric - One-hot encoded
room_typeandneighbourhood_group_cleansed - Computed a correlation matrix to identify key features
- Dropped redundant columns like
beds_naand original winsorized columns
- Split data into training and test sets
- Applied
StandardScalerfor normalization - Fitted a
LinearRegressionmodel - Evaluated with RMSE and R²
- Fitted a
GradientBoostingRegressorwith:max_depth=10n_estimators=300
- Evaluated with RMSE and R²
- Used
GridSearchCVto optimize:max_depthn_estimators
- Trained GBDT with optimal parameters
- Compared performance against baseline models
| Model | R² Score | RMSE |
|---|---|---|
| Linear Regression | 0.42 | 104.9 |
| GBDT (Non-Optimized) | 0.56 | 91.4 |
| GBDT (Optimized via GridSearchCV) | 0.76 | 89.1 |
- Expand to larger, more diverse property datasets
- Perform deeper EDA and feature scaling
- Explore advanced models:
- Neural Networks
- Stochastic Gradient Descent Regressor (SGD)
- Target a reduced RMSE of < 50
Affordability is a core value of Airbnb. This model contributes to a smart Price Estimation Tool to assist new hosts in setting competitive and fair prices.
"Price too low? Lose revenue. Too high? Lose customers."
This model ensures:
- Hosts get data-driven price estimates
- Travelers benefit from fair pricing
- Airbnb improves marketplace efficiency
- Linear Regression — baseline model for price prediction
- Gradient Boosted Decision Trees (GBDT) — for improved non-linear performance
- GridSearchCV — for hyperparameter tuning and model optimization
- Pandas — data manipulation and preprocessing
- NumPy — numerical computing
- Scikit-learn — preprocessing tools (e.g.,
StandardScaler,OneHotEncoder)
- Matplotlib — basic plotting
- Seaborn — statistical visualizations (e.g., correlation heatmaps)
- Jupyter Notebook — interactive experimentation environment
- Python 3.8+ — core programming language
📫 Let’s Connect!
If you found this project insightful or want to collaborate, feel free to reach out or check out my GitHub profile.