- The notebook's primary purpose is data analysis for airline models, using data initially loaded from
https://github.com/Speeb04/SDSS-Datathon/raw/refs/heads/main/Resources/Cases/Airline%20Tickets/airline_ticket_dataset.csv. - Data preprocessing involved cleaning city names by removing the suffix ' (Metropolitan Area)' from
city1andcity2columns for standardization, and identifying unique cities and carrier types. - Feature engineering included the creation of boolean indicators:
city1_is_hub,city2_is_hub(for hub cities),carrier_lg_is_full_service, andcarrier_low_is_low_cost(for carrier types). - Population data (2020-2024 estimates) was successfully integrated into the dataset for
city1andcity2using a lookup table (pop_lut) and theaddPOPfunction. - An attempt to integrate GDP data (2001-2018) encountered significant challenges, resulting in a large number of 'N/A' values after
GeoNamestandardization, indicating difficulties in matching city names from the GDP dataset with those in the main DataFrame.
- The significant 'N/A' values encountered during GDP data integration highlight a need for more robust city name matching or alternative geographic data sources to effectively incorporate economic indicators.
- The comprehensive data preparation, including cleaning, feature engineering, and partial external data integration, provides a solid foundation for building and training machine learning models to predict various airline-related outcomes, such as ticket prices or passenger demand.
This notebook aims to predict airfare prices based on various route and city attributes using machine learning models.
- Problem Statement
- Data Loading and Initial Exploration
- Data Cleaning and Preprocessing
- Feature Engineering
- Model Training and Evaluation
- Conclusion and Next Steps
Given information about a desired route, the objective is to predict the airfare for that route.
- The dataset
final_output_data.csvis downloaded from a GitHub repository. - Essential libraries like
pandas,matplotlib,numpy, andsklearnare imported. - The CSV file is loaded into a pandas DataFrame named
data. - Initial descriptive statistics (
data.describe()) and the head of the DataFrame (data.head()) are displayed to understand the data structure.
- A copy of the original DataFrame (
data) is made for preprocessing (df). - Columns with 100% missing values are dropped.
- Numerical columns are identified, and their values are cleaned by removing currency symbols ('$'), commas (','), and converting '#DIV/0!' to NaN. Data types are then coerced to numeric.
- Rows with NaN values in
city1_pop_2024andcity2_pop_2024are dropped. - Remaining NaN values in numerical columns are imputed using the median of their respective columns.
- The
passengerscolumn is converted to integer type.
- A
routecolumn is created by sorting and concatenatingcity1andcity2to represent unique routes. - Categorical features such as
Year,quarter,city1,city2,carrier_lg,carrier_low, androuteare one-hot encoded usingpd.get_dummies(). - The target variable
yis set tofare. - Features
Xare selected, excludingfare,fare_lg,fare_low,citymarketid_1, andcitymarketid_2.
- The data is split into training and testing sets (
X_train,X_test,y_train,y_test) with a 80/20 split. - Model 1: Random Forest Regressor
- A
RandomForestRegressormodel is initialized and trained. - Predictions are made on the test set.
- The model is evaluated using Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and R-squared (R2).
- A scatter plot visualizes actual vs. predicted fares.
- A
- Model 2: Linear Regression
- A
LinearRegressionmodel is initialized and trained. - Predictions are made on the test set.
- The model is evaluated using MAE, MSE, RMSE, and R2.
- A scatter plot visualizes actual vs. predicted fares.
- The intercept and coefficients of the Linear Regression model are displayed to understand feature importance.
- A
The notebook demonstrates the process of building and evaluating models for airfare prediction. Future steps include further feature selection to remove confounding features.