A project submission to Nanyang Technological University for the course SC1015 (Introduction to Data Science & Artificial Intelligence).
Video presentation link: https://youtu.be/k0Ayb1S2R-o
Dataset is obtained from Kaggle, titled "Customer Personality Analysis".
- Supermarkets attract a diverse range of customers with differing preferences and needs. Effective advertising required a tailored approach towards targeting the needs and wants of each customer.
- Our chosen problem statement: How can supermarkets leverage machine learning to identify customer segments based on customer attributes?
- With many different Clustering algorithms available, how can we identify the most optimal model that can segment the supermarket's customers?
- Exploratory Data Analysis was performed to identify required preprocessing steps.
- Preprocessing was done to clean and prepare the dataset for the models. Steps such as the replacement of null values, removal of outliers, scaling and one hot encoding was performed.
- Data Visualisation was used to understand subtle relationships and distributions within the dataset.
- Dimensionality Reduction was achieved through Principal Component Analysis.
- The optimal cluster number was identified using Elbow Method, Hierarchical Graph and Gap Statistic.
- 6 Clustering Algorithms across 5 Clustering methods were employed on the dataset.
- Evaluation was performed on these 6 Clustering Algorithms using Silhouette Score, Calinski Harabaz Index and Davies Bouldin Index.
- Profiling of the identified clusters was performed based on their demographic and behavioural characteristics.
- Recommendations for the supermarkets were drawn from the results.
Connectivity/ Hierarchical Clustering
- Agglomerative Clustering Model
Centroid/ Partition Clustering
- K-Means
- Mean Shift
Distribution Model
- Gaussian Mixture Model
Density Model
- Ordering Points To Identify the Clustering Structure (OPTICS)
Graph-based Model
- Spectral Clustering
- K-Means Clustering yielded the best results overall.
- Customers can be segmented into 4 clusters, eac with their own demographic traits: Income, Age, Number of Children, Education (specifically Third Cycle).
- These clusters have differing spending behaviours: Receptivity to Campaigns, Compain Tendencies, Highest Expenditure Product Categories and Prefered Purchase Avenue.
- More detailed conclusions can be found in the notebook.
- https://neptune.ai/blog/clustering-algorithms
- https://www.analyticsvidhya.com/blog/2019/08/comprehensive-guide-k-means-clustering/
- https://www.analyticsvidhya.com/blog/2019/10/gaussian-mixture-models-clustering/
- https://analyticsindiamag.com/a-tutorial-on-various-clustering-evaluation-metrics/
- https://www.geeksforgeeks.org/ml-mean-shift-clustering/
- https://www.mygreatlearning.com/blog/introduction-to-spectral-clustering/
- https://www.datanovia.com/en/lessons/agglomerative-hierarchical-clustering/
- https://scentellegher.github.io/machine-learning/2020/01/27/pca-loadings-sklearn.html
- https://www.mikulskibartosz.name/pca-how-to-choose-the-number-of-components/
- https://www.kaggle.com/code/karnikakapoor/customer-segmentation-clustering
- https://www.kaggle.com/code/sonalisingh1411/customer-personality-analysis-segmentation
- https://www.kaggle.com/code/gaganmaahi224/9-clustering-techniques-for-customer-segmentation#Exploratory-Data-Analysis
- https://towardsdatascience.com/how-to-select-the-best-number-of-principal-components-for-the-dataset-287e64b14c6d
- https://towardsdatascience.com/clustering-evaluation-strategies-98a4006fcfc