Estimation of obesity levels based on eating habits and physical condition

Cohort: 7, Team: ML10

Members

Overview

We have access to a dataset of 2111 individuals that records their obesity Level along with 17 attributes related with eating habits, physical condition, and demographics.

Business Objective

World Health Organization (WHO) is interested in understanding key features associated with obsesity levels in three different locations - Mexico, Peru and Colombia. Using the provied dataset, we want to understand the main factors that contribute to obesity levels. Once the top factors are identified, WHO can make a decision about focusing educational initiatives and allocating financial resources towards the identified factors in order to achieve the most improvement in health outcomes.

Therefore, the key deliverable for us is the list of top 3 factors that are most strongly correlated with obesity levels.

Dataset details

The data consist of 2111 individuals with 17 attributes recorded for each individual. The list of attributes is as follows:

Variable Name	Type	Description	Category
Gender	Categorical	-	Demographic
Age	Continuous	-	Demographic
Height	Continuous	-	Other
Weight	Continuous	-	Other
family_history_with_overweight	Categorical	Has a family member suffered or suffers from overweight?	Family History
FAVC	Categorical	Do you eat high caloric food frequently?	Eating Habits
FCVC	Integer	Do you usually eat vegetables in your meals?	Eating Habits
NCP	Continuous	How many main meals do you have daily?	Eating Habits
CAEC	Categorical	Do you eat any food between meals?	Eating Habits
SMOKE	Categorical	Do you smoke?	Lifestyle
CH2O	Continuous	How much water do you drink daily?	Eating Habits
SCC	Categorical	Do you monitor the calories you eat daily?	Eating Habits
FAF	Continuous	How often do you have physical activity?	Lifestyle
TUE	Integer	How much time do you use technological devices such as cell phone, videogames, television, computer and others?	Lifestyle
CALC	Categorical	How often do you drink alcohol?	Lifestyle
MTRANS	Categorical	Which transportation do you usually use?	Lifestyle
NObeyesdad	Categorical	Obesity level	Target

Potential risks and uncertainty

Some important factors such as genetic pre-disposition to obesity, presence of diabetes, etc. are missing from the data and will therefore be ignored.
While the models will tell us the most important features that predict obesity, the relationships are not necessarily causal, which means that improving that factor may not reduce obesity levels.
BMI (weight/height^2) has been used to define the target categories, and both weight and height have also been included in the data as predictors. From our initial analysis, we see that weight is the most important predictor for each category as expected. Therefore, it's unclear if weight and height should be included as predictors.
Any insights we draw from these data will be geographically limited to Latin America and not transferable to other geographies.
Only 23% data are directly collected from people, the rest have been synthetically generated. Synthetic data may not be representative of the real population.
We have 7 categories in our target variable. Some of these may be hard to distinguish.
Some of the variables in the data are self-reported and may contain biases inherent in self-reported data. For example, a variable like FAVC (Do you eat high caloric food frequently?) is difficult to self-assess and report accurately.

Methodology

The overall plan to tackle the project is as follows:

Exploratory Data Analysis (EDA): understand each feature, explore relationships between features, understand class imbalances, analyze any missing data, and to generate ideas for feature engineering and data-pre-processing.
Unsupervised analysis: perform unsupervised clustering of data and analyze the key features contributing to each cluster.
Develop and evaluate predictive ML models:
- Create a baseline logistic regression model
- Evaluate the performance of the model using metrics such as precision, recall, and confusion matrices.
- Finetune baseline model and try other algorithms such as Random Forest Classifier and XGBoost classifier
- Compare performance of the more complex models with the baseline model.
Choose the best model: Compare performance of different models to choose the best one, calculate feature importance of input features
Make recommendations and suggestions for the executive team at WHO.

Git structure

├── data
├──── raw
├── experiments
├── models
├── images
├── README.md
└── .gitignore

Data: Contains the raw data.
Experiments: A folder containing ipython notebook for data exploration and experiments.
Models: A folder containing the final trained model
Images: Contain all images used in the README.md file
README: This file!
.gitignore: Files to exclude from this folder (e.g., large data files).

Technical stack

scikit-learn library to fit and evaluate supervised and unsupervised ML models.
pandas and numpy: to load and explore the dataset and basic data manipulations
matplotlib: to create visualizations

Task assignment

Shefali Lathwal: Exploratory data analysis
Suni Bek: Exploratory data analysis
Rameez Rosul: Unsupervised data analysis
Olalekan Kadri: Baseline logistic regression and experiments (including SHAP analysis)
Vinushiya Shanmugathasan: Experiment with Random Forest model (including SHAP analysis)
Eka Dwipayana: Experiment with XGBoost and Neural Network models (including SHAP analysis)

Exploratory data analysis

Dataset overview

The dataset contains 1 target variable, NObeyesdad, and 16 feature variables
There are a total of 2111 data points.
8 features are of type object, and 8 features are of type float
There are no missing values in any variable
The target category consists of seven different weight classes. The classes and number of participants in each class are shown below.

Key observations

People with a family history of obesity are much more likely to be overweight or obese, especially in the higher obesity classes. Those without a family history are mostly normal weight or underweight.
People who usually walk instead of other means of transportation tend to have normal weight, and people who use automobiles tend to belong to higher weight categories.
How often someone drinks alcohol does not seem to be related to weight class.
People who "sometimes" eat between meals tend to belong to higher weight categories, and ones who report eating between meals "frequently", and "always" tend to have normal weight or insufficient weight. This might point to a bias in self-reported survey metrics where people who belong to higher weight categories, might tend to underestimate how often they eat between meals.
There are only 44 people who report smoking out of 2111 in the survey, which is a much lower percentage that would be expected based on public health surveys.
Height and weight have a correlation of 0.46. No other features have signification correlation with each other.

Possible biases in the data due to synthetic data generation

Classes Obesity_Type_II and Obesity_Type_III have very skewed gender distributions - There are only two female participants in Obesity_Type_II and only one male participant in Obesity_Type_III.
All age ranges are not equally represented in each class. For Obesity_Type_III, Insufficient_Weight, and Normal_Weight in particular, the age range is quite narrow, compared to the other four classes, indicating possible bias in the data.
For Obesity_Type_III, we also see two distinct groups with weight, which is different from all other classes and may indicate issues with synthetic data generation.
Most of the synthetic data generated belong to Obesity_Level_II and Obesity_Level_III classes, which is the possible reason that there is a gender skew, two distinct weight groups and two distinct age groups in these classes. Therefore, any interpretation on these two classes may not be reliable.

Unsupervised analysis

Clustering Summary (K-Prototypes) This project applies K-Prototypes to group individuals based on lifestyle behaviors. Although clustering is unsupervised, the groups show alignment with obesity categories.

Cluster Insights Cluster 0 — Active & Balanced Features: FAF (activity), NCP (meal count), Age Pattern: High activity, diverse weight outcomes Dominant class: Insufficient Weight Insight: Health-conscious and active.

Cluster 1 — Moderate Activity, Overweight-Prone Features: NCP, Age, FAF Pattern: Slightly low activity, moderate lifestyle Dominant class: Overweight Level I Insights: Possible early-stage overweight,opportunity for preventive habits.

Cluster 2 — Low Hydration, Obesity II Features: Age, CH₂O (water intake), FAF Pattern: Older group, low hydration + low activity Dominant class: Obesity Type II Insights: overweight; hydration and activity are factors.

Cluster 3 — Inactive & Severely Obese Features: FAF (very low), NCP, Age Pattern: Lowest physical activity Dominant class: Obesity Type III Insights: High-risk group

Model Development and Evaluation

Model 1: Logistic Regression

The experiment was conducted by splitting the dataset into training and test dataset, stratified. Logistic regression model was used. With accuracy of 88%, the following feature importance were found:

Top 3 important features for Insufficient_Weight are : Weight, Height and Age
Top 3 important features for Normal_Weight are : Weight, Height and CAEC_sometimes
Top 3 important features for Obesity_Type_I are : Weight, Height and FCVC
Top 3 important features for Obesity_Type_II are : Weight, Height and Gender_Male
Top 3 important features for Obesity_Type_III are : Weight, FCVC and Gender_Male
Top 3 important features for Overweight_Level_I are : Weight, Height and FCVC
Top 3 important features for Overweight_Level_II are : Gender_Male, Age and FCVC

The SHAP feature important plot is shown as follows:

Note that weight or height is common important feaures to all target classes except overweight_Level_II. This is an insight that needs to further checked with other models to see if the same pattern is repeated.

Model 2: Random Forest

The second experiment involved training a random forest model, where the accuracy was about 93%.
The top three features that highly impacted the overall model are the following:
- Weight, FCVC and age
These are the top three features which impacted each class:
- Class 0 - Insufficient Weight: weight, age, family history with overweight
- Class 1 - Normal Weight: weight, CAEC_sometimes, age
- Class 2 - Obesity Type I: weight, FCVC, height
- Class 3 - Obesity_Type_II: weight, age, gender_male
- Class 4 - Obesity_Type_III: weight, FCVC, gender_male
- Class 5 - Overweight_Level_I: weight, age, FCVC
- Class 6 - Overweight_Level_II: weight, age, height

Model 3: XGBoost

The third model performed XGBoost, where the accuracy was around 95.27% with no sign of overfitting (by tightening the regularization).
The process of classification was import all the libraries required, read dataset, map the target, do train tes split, create pipeline that scaling the numerical features and do one hot encoding for categorical features, do hyperparameter tunning to find the best parameters value based on model's performance, train the final model with the best parameters value and calculate all the metrics. The train versus testing loss is shown in the following image, where the decrease of train loss is followed by the decrease of testing loss. It means there is no sign of overfitting.

The result of model metrics is detailed with confussion matrix as follows:

The SHAP feature importance showed:

Top 3 important features for Insufficient_Weight are : Weight, Height and Age
Top 3 important features for Normal_Weight are : Weight, Height and CH2O
Top 3 important features for Obesity_Type_I are : Weight, Height and FCVC
Top 3 important features for Obesity_Type_II are : Weight, Height and CH2O
Top 3 important features for Obesity_Type_III are : Weight, Gender_Female and FCVC
Top 3 important features for Overweight_Level_I are : Weight, Height and FCVC
Top 3 important features for Overweight_Level_II are : Weight, Age, Height

The plot of SHAP feature importance for Obesity Type III class is presented as follows:

Model 4: Neural Network with PyTorch

The third model carried out Nural Network with PyTorch, where the accuracy was around 95.27% with no sign of overfitting (by tightening the regularization with dropout rate optimization).
The process of classification was import all the libraries required, read dataset, map the target, do train tes split, create pipeline that scaling the numerical features and do one hot encoding for categorical features, create PyTorch dataloader, create neural network architecture, assign optuna for hyperparameter tunning to find the best parameters value based on model's performance, train the final model with the best parameters value and calculate all the metrics. The calculation of all metrics through all classes is demonstrated bellow:

The result of model metrics is detailed with confussion matrix as follows:

The SHAP feature importance showed:

Top 3 important features for Insufficient_Weight are : Weight, Height and Family_History_with_Overweight_Yes
Top 3 important features for Normal_Weight are : Weight, Height and CAEC_Sometimes
Top 3 important features for Obesity_Type_I are : Weight, Height and CAEC_Sometimes
Top 3 important features for Obesity_Type_II are : Weight, Height and Age
Top 3 important features for Obesity_Type_III are : Weight, Height and Gender_Male
Top 3 important features for Overweight_Level_I are : Weight, Height and Family_History_with_Overweight_Yes
Top 3 important features for Overweight_Level_II are : Weight, Age and Family_History_with_Overweight_Yes

The plot of SHAP feature importance for Obesity Type III class is presented as follows:

Legend note:

Class 0 is Insufficient Weight
Class 1 is Normal Weight
Class 2 is Obesity Type I
Class 3 is Obesity Type II
Class 4 is Obesity Type III
Class 5 is Overweight Level I
Class 6 is Overweight Level II

Models Summary

The resume of all 4 models performance are detailed as follows:

Model	Accuracy	Precision	Recall	F1-Score
Logistic Regression	88.00	88.00	88.00	88.00
Random Forest	93.14	94.00	93.00	93.00
XGBoost	95.27	95.49	95.27	95.31
Neural Network	95.27	95.13	95.15	95.09

Conclusions and Future Directions

Eating (and water-drinking) habits occur most frequently as the key factors.
Education resources and funding should be dedicated to nutritional programs and improving access to healthier food options.
-Promote water intake over soda, sweetened drinks, and alcohol
-Increase access to fruits and vegetables
-Decrease access to high-calorie, low-nutrient foods
We observed biases in the data likely due to synthetic oversampling. Therefore, WHO should allocate more resources to collecting more representative data from the general population.

Team Videos

Video Link
Eka Dwipayana Video
Olalekan Kadri Video
Rameez Rosul Video
Shefali Lathwal Video
Suni Bek Video
Vinushiya Shanmugathasan Video

References

Dataset has been sourced from UC Irvine Machine Learning Repository
Publication linked to the dataset

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation