This project aims to predict the prices of houses and flats using data collected from 99acres.com. The dataset contains around 4000 samples of properties from Gurgaon. The project involves data processing, feature engineering, exploratory data analysis (EDA), outlier detection, model training, and deployment on Streamlit.
- Project Overview
- Data Processing and Cleaning
- Feature Engineering
- Exploratory Data Analysis
- Outlier Detection and Removal
- Missing Value Imputation
- Feature Selection
- Model Training and Selection
- Deployment
- How to Run
- Results
- Contributing
- License
The main objective of this project is to develop a machine learning model that can accurately predict the prices of real estate properties in Gurgaon. The project includes data processing, feature engineering, exploratory data analysis, outlier detection, model training, and deployment.
- Loading the Data: Import the dataset and check for initial anomalies.
- Data Cleaning: Remove duplicates, handle incorrect data types, and standardize categorical variables.
- Data Normalization: Normalize numerical features to ensure all features contribute equally to the model.
- Feature Creation: Create new features based on domain knowledge (e.g., price per square foot, age of the property).
- Feature Transformation: Apply transformations to skewed features to normalize their distribution.
- Distribution Plots: histograms and density plots for numerical features.
- Box Plots: Visualising the spread and identify outliers in numerical features.
- Correlation Matrix: Examining the relationships between numerical features.
- Pair Plots: Visualizing relationships between pairs of numerical features.
- Identifying Outliers: Used statistical methods like IQR to detect outliers.
- Removing Outliers
- Identifying Missing Values: Check for missing values in the dataset.
- Imputation Techniques: Apply appropriate techniques (mean, median, mode, or model-based imputation) to fill missing values.
- Feature Importance: Use algorithms like Random Forest or Lasso Regression to identify important features.
- Dimensionality Reduction: Apply techniques like PCA if needed to reduce the feature space.
- Train-Test Split: Split the data into training and testing sets.
- Model Selection: Experiment with various algorithms (e.g., Linear Regression, Random Forest, XGBoost) and select the best model based on performance metrics.
- Model Serialization: Save the trained model using joblib or pickle.
- Streamlit Application: Develop a user-friendly interface using Streamlit for model deployment.
- Clone the Repository:
git clone https://github.com/yourusername/real-estate-price-prediction.git
- Install Dependencies:
pip install -r requirements.txt
- Run the Streamlit Application:
streamlit run app.py
- Model Performance: Random Forest Model performed the best with r2 score of 90% and MAE of 47 lacs
Contributions are welcome! Please read the contributing guidelines before getting started.
This project is licensed under the MIT License. See the LICENSE file for more details.