Team #243

How we built it

In crafting our solution, we embarked on a comprehensive journey that began with a meticulous exploration of the dataset through Exploratory Data Analysis (EDA). This phase provided critical insights into the Singlife insurance domain, allowing us to understand the unique characteristics and challenges within the data. We delved into policyholder behaviors, identifying potential touchpoints that influence customer engagement and satisfaction during the insurance acquisition process. Following this, a rigorous data cleaning process unfolded, involving imputation to address missing values, one-hot encoding for categorical variables, and feature scaling to standardize numerical features. This step was crucial in preparing the data for subsequent analysis, ensuring its suitability for modeling within Singlife's context.

Moving forward, we delved into feature selection methodologies, employing techniques such as Pearson correlation, point biserial correlation, SelectKBest, and Lasso. These steps were pivotal in streamlining our dataset by identifying and retaining the most influential features specific to Singlife's policyholder interactions and behaviors while mitigating issues like multicollinearity.

With a refined dataset in hand, the modeling stage commenced. Addressing class imbalances using the Synthetic Minority Over-sampling Technique (SMOTE), we endeavored to enhance the model's ability to generalize. Subsequently, we rigorously evaluated three distinct models in their effectiveness in predicting policyholder behaviors: the Decision Tree Algorithm pruned through Hyperparameter Tuning and a Support Vector Machine (SVM).

Upon careful analysis of the models' performance metrics—precision, recall, and accuracy—the Support Vector Machine consistently boasted the highest scores across these vital dimensions. We decided to focus on precision and recall as accuracy tends to be a misleading metric for imbalanced datasets, for example if the classifier predicts all individuals as non-purchasers it can still achieve a high accuracy while failing to identify any actual purchasers.

We made sure to tune the hyper-parameters of the model using GridSearchCV which also did k-fold cross validation to ensure that the model's performance was assessed across multiple subsets of the training data, reducing the risk of overfitting. It also helped us identify the optimal combination of hyper parameters for the model.