Scalable Databases – Final Project Phase 2

Overview

This repository contains the final project (Phase 1 & 2) for the Scalable Databases course. The goal of the project is to analyze flight delay data using big data techniques and predictive modeling. The project spans from data sampling in Hive to advanced machine learning model implementation in Python, culminating in high-accuracy predictions of flight delays.

Data Sampling in Hive

Sample Table Creation: A new sample table was created in Hive.
Delayed Column: A new column, "Delayed", was added to flag flight delays.
Random Sampling: 30,000 rows were extracted using random sampling to create a manageable dataset.
Data Export: The sample data was exported as a CSV file to be used in subsequent stages.

Data Preparation in Python

Data Merging: Combined sample data across all years.
Handling Missing Values:
- Identified columns with null values.
- Dropped columns with excessive missing data.
Data Transformation:
- Replaced missing numerical values with the median.
- Replaced missing categorical values with the mode.
- Encoded categorical variables using LabelEncoder.
Data Splitting: Divided the data into training, validation, and test sets.

Model Implementation

Multiple machine learning models were implemented to predict flight delays:

XGBoost
Gradient Boosting
Decision Trees
Random Forest
Logistic Regression

Highlights:

The XGBoost model showed superior performance compared to other algorithms.

Model Evaluations

Models were assessed using various performance metrics.
The XGBoost model achieved approximately 90% accuracy on the test dataset.
Detailed comparisons were made to determine the best-performing model for predicting flight delays.

Target Predictions

Prediction Pipeline: The final model was used to generate target predictions.
Delay Calculations: Additional features (depdelay and arrdelay) were created by computing the differences between actual and scheduled times.
The predictions provide actionable insights into potential flight delays based on input features.

Conclusion

The project successfully demonstrates an end-to-end data pipeline—from data extraction in Hive to predictive modeling in Python.
XGBoost emerged as the best-performing model, delivering high accuracy in predicting flight delays.
This work provides a scalable approach to handling large datasets and implementing machine learning solutions in a production-like environment.

Extra Steps & Credits

Extra Verification: An additional check was performed by calculating depdelay and arrdelay using actual versus scheduled times.
Team Effort: Special thanks to all team members for their contributions and collaboration throughout the project.

Thank you for exploring our project! If you have any questions or feedback, please feel free to reach out.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
Phase - 1		Phase - 1
Phase - 2		Phase - 2
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Scalable Databases – Final Project Phase 2

Overview

Data Sampling in Hive

Data Preparation in Python

Model Implementation

Model Evaluations

Target Predictions

Conclusion

Extra Steps & Credits

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Scalable Databases – Final Project Phase 2

Overview

Data Sampling in Hive

Data Preparation in Python

Model Implementation

Model Evaluations

Target Predictions

Conclusion

Extra Steps & Credits

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages