Project Story: Predicting Route Price Using Socio-economic & Market Share Data

About the Project

Our project aimed to predict airfare prices for airline routes, leveraging a rich dataset containing various route-specific, carrier-specific, and city-specific socio-economic indicators. The core idea was to move beyond simple route identification and understand the underlying economic and market dynamics that dictate air travel costs.

Inspiration

Being a group of students that live primarily outside of Toronto, we're heavily affected by the pricing of air fares when we travel for diaspora and back to school. Analyzing flight pricing data helps us understand how this pricing structure is affected and how we can spend less on an essential for non-local, full time students.

What We Learned

Through this project, we gained significant insights into:

The Power of Feature Engineering and Selection: The deliberate exclusion of direct city identifiers forced us to focus on more abstract, yet highly influential, features such as a city's GDP, population, and various market share metrics (e.g., large_ms, lf_ms). This refined feature set proved crucial for building generalizable models.
Impact of Data Preprocessing: The necessity of rigorous data cleaning, handling missing values, type conversions, and especially feature scaling (Standardization) became evident for optimal model performance, particularly with Linear Regression and Neural Networks.
Model Suitability for Tabular Data: We explored different modeling paradigms: traditional Linear Regression for interpretability, Neural Networks for their ability to capture complex non-linear relationships, and TabNet for its deep learning capabilities specifically tailored for tabular data, offering both high performance and some interpretability through feature importances.
Log Transformation for Skewed Targets: Applying a log transformation to the 'fare' target variable proved beneficial for models like TabNet, as it helps normalize skewed distributions and often improves convergence and predictive accuracy.
Hyperparameter Sensitivity: The cat_emb_dim parameter in TabNet, governing the embedding size for categorical features, demonstrated its critical role. Proper tuning of such hyperparameters is essential for models to effectively learn from high-cardinality categorical data.

How We Built Our Project

Our project followed a structured machine learning workflow:

Data Acquisition and Initial Exploration: We started by downloading and loading the final_output_data.csv into a Pandas DataFrame. Initial descriptive statistics and correlation matrices helped us understand the raw data.
Data Cleaning and Feature Preparation: We meticulously cleaned the data by:
- Dropping columns with 100% missing values.
- Converting monetary and population columns from string to numeric, handling special characters.
- Imputing missing numerical values with medians to maintain data integrity.
- Converting Year to integer type.
- Most crucially, we defined our feature set to exclude direct city1 and city2 identifiers, concentrating on socio-economic features (GDP, population) and market share data (large_ms, lf_ms, TotalFaredPax_city, TotalPerLFMkts_city, TotalPerPrem_city). Categorical features like quarter, carrier_lg, and carrier_low were retained.
Model Implementation and Training:
- Linear Regression: Implemented as a baseline, using one-hot encoded and standardized features. We analyzed its coefficients for feature impact.
- Neural Network (TensorFlow/Keras): A multi-layer perceptron was built, using standardized input features. This allowed us to explore non-linear relationships.
- TabNet (PyTorch): Chosen for its state-of-the-art performance on tabular data. This model utilized integer-encoded categorical features with learned entity embeddings, along with a log-transformed target variable. It provided a powerful, interpretable deep learning approach.
Model Evaluation and Comparison: For each model, we calculated standard regression metrics: Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and R-squared ($R^2$). We visualized actual vs. predicted values and, for Linear Regression and TabNet, explored global feature importances to understand which features drove predictions.

Challenges We Faced

Initial Data Quality: The raw dataset presented challenges with mixed data types and string formats (e.g., currency symbols, commas in numbers), requiring careful cleaning and conversion.
Preventing Memorization: The deliberate decision to exclude explicit city names and IDs was a design challenge. It meant foregoing potentially high R-squared values for a more robust and generalizable model that learns from underlying market conditions rather than specific routes. We had to carefully curate the remaining features to ensure sufficient information was available.
TabNet Implementation: Adapting pytorch-tabnet for regression and correctly configuring categorical feature handling (specifically cat_idxs and cat_dims) was initially challenging. The need for a log transformation of the target variable and careful inverse transformation for evaluation also required attention.
Interpreting Complex Models: While Neural Networks and TabNet offered superior predictive power, interpreting their internal workings and deriving clear insights into feature contributions was more complex than with Linear Regression. TabNet's built-in feature importance mechanism helped bridge this gap to some extent.

Overall, the project successfully demonstrated that meaningful airfare prediction can be achieved using socio-economic and market share data, even without direct route identifiers, leading to models that are both predictive and insightful.