This project involves analyzing the famous Titanic dataset to predict passenger survival using various machine learning models. The goal was to achieve the highest prediction accuracy possible.
- Project Overview
- Dataset
- Exploratory Data Analysis (EDA)
- Feature Engineering
- Model Building
- Model Evaluation
- Kaggle Submission
- Conclusion
- How to Run the Project
- Acknowledgments
- License
The Titanic Survival Analysis project aimed to predict whether a passenger survived the disaster using features such as age, gender, ticket class, and other attributes. The analysis includes EDA, feature engineering, model building, and evaluation.
The Titanic dataset used for this project contains the following key features:
- PassengerId: Unique identifier for each passenger
- Pclass: Ticket class (1st, 2nd, or 3rd)
- Sex: Gender of the passenger
- Age: Age of the passenger
- SibSp: Number of siblings/spouses aboard
- Parch: Number of parents/children aboard
- Ticket: Ticket number
- Fare: Passenger fare
- Cabin: Cabin number (if available)
- Embarked: Port of embarkation (C, Q, S)
The test set lacks the Survived column, requiring the model to predict survival outcomes.
Key insights gained from EDA:
- Majority of passengers in higher classes had a higher survival rate.
- Women had a significantly higher survival rate compared to men.
- Passengers with lower ticket fares had lower survival rates.
The following transformations and feature engineering techniques were applied:
- Handling missing values in columns like
Age,Cabin, andEmbarked. - Creating new features based on ticket class and family size.
- Encoding categorical variables.
- Normalizing numerical features to improve model performance.
Various machine learning models were tested, including:
- Logistic Regression
- Random Forest
- XGBoost
- CatBoost
- MLP Classifier
Hyperparameter tuning was performed to optimize model performance. The Random Forest Classifier (after resampling) provided the most accurate and balanced predictions.
Hyperparameter tuning was performed to optimize model performance.
The best-performing model achieved an accuracy score of 0.77 on Kaggle. The Random Forest Classifier (Resampled) was finalized as the best model based on evaluation metrics.
-
Metrics Table for comparison (
Accuracy,PrecisionandRecall) -
Confusion Matrix
-
ROC-AUC Curve for best performing model
A final submission was made to Kaggle, achieving a prediction score of 0.77.
- Effective feature engineering and hyperparameter tuning significantly contributed to the model's performance.
- The test set's prediction accuracy of 0.77 demonstrates a strong predictive model.
- Further improvements could involve advanced ensemble methods.
- Clone this repository:
git clone <repository-url>
- Navigate to the project directory:
cd Titanic_Survival_Analysis - Install the required dependencies:
pip install -r requirements.txt
- Run the Jupyter Notebook:
jupyter notebook Titanic_Survival_Analysis.ipynb
App Link:- Click Here
- Kaggle for providing the Titanic dataset.
- Data Science Community for continuous learning and support.
- This project is licensed under the MIT License. See the LICENSE file for more information.


