Machine Learning – Research Newcomer

March 10, 2019

Predicting Blood Donations- Drivendata

Data Analysis and Interpretation Capstone WEEK 1:

I’ll be doing the Blood Donation Prediction problem from Drivendata; link to which is – https://www.drivendata.org/competitions/2/warm-up-predict-blood-donations/

Here following are more details about the same:

Research Question:
This problem is part of drivendata.com and our dataset is from a mobile blood donation vehicle in Taiwan. The Blood Transfusion Service Center drives to different universities and collects blood as part of a blood drive. We want to predict whether or not a donor will give blood the next time the vehicle comes to campus given information about some of the donors. It’s a supervised machine learning classification problem.
My motivation:
This is my first time trying an actual Machine Learning problem and I wanted to start easy and simple. That’s why choosing this problem which I was able to grasp easily and will try to do my best.
Potential Implications
I have been working as a Business Analyst from the past 8 months, which my usual working domain is Time Series data. I am yet to be assigned or work with other analytics domain and thus this problem can help me understand a bit about other machine learning techniques.

October 29, 2017February 1, 2018

K-means Cluster Analysis

Course 4 Week 4: Assignment

Introduction

This final week I am gonna talk about Cluster Analysis and in that we’re gonna discuss K-Means Cluster Analysis technique.The goal of cluster analysis is to group or cluster observations into subsets based on the similarity of responses on multiple variables. Observations that have similar response patterns are grouped together to form clusters. That is, the goal is to partition the observations in a data set. That into a smaller set of clusters and each observation belongs to only one cluster.

Cluster analysis is an unsupervised learning method. Meaning, there is no specific response variable included in the analysis.

Below is the assignment and since the Program Code is really really long, I’ve put it at the end of the blog post.

Assignment

A k-means cluster analysis was conducted to identify underlying subgroups of adolescents based upon the similarity of responses on 11 variables that represent characteristics that could have an impact on adolescent depression.

Clustered variables included three binary variables, indicative of whether or not the adolescent had ever used alcohol and marijuana. Quantitative variables included variables measuring problems with alcohol usage, a scale measuring engaging in deviant behavior (such as vandalism, other property damage, lying, stealing, running away from home, driving without parental permission, selling drugs and unexcused school absence), and scales measuring violent behavior, self-esteem, parental presence, parental activities, family connectedness, school connectedness and academic achievement (measured as grade point average). All of the clustered variables were standardized to a mean of 0 and standard deviation of 1.

Using simple random sampling, data were split into a training set that included 70% (N=3201) of the observations and a test set that included 30% (N=1701) of the observations. A series of k-means cluster analyses were conducted on the training set specifying k=1-9 clusters, using Euclidean distance measurement. The variances in the cluster variables that was accounted for by the clusters (R-square values) was plotted for each of the nine cluster solutions in an elbow curve to provide visual guidance for choosing the number of clusters to interpret.

Now we clearly see two specific bends in the curve at 2, 4, 6 and 7. While we can’t create a scatter plot for 2 variable. So we check for the variability of the data in 4, 6 and 7, to see whether the clusters overlap. Or the patterns of means on the clustering variables are unique and meaningful. And whether there are significant differences between the clusters on our external validation variable, Depression Level.

Here are the plots of 4, 6 and 7:

A scatterplot of the 4 canonical variables by cluster indicate the observations are densely packed with low cluster variance and did not overlap significantly with the other clusters. Clusters 6 and 7, on the other hand, were for the most part distinct, but with greater spread amongst the observations, suggesting a higher cluster variance within the cluster. Cluster 4 showed observations that were least compacted of all of the clusters. These plot results suggest that an optimal cluster solution may have fewer than 4 clusters, so it is important that evaluation of lesser than 4 clusters solution be evaluated.

So we look into the Cluster Means table of 4 clusters to observe the relations:

The means of the clustering variables show that compared to the other clusters, adolescents in cluster 3 had a relatively low likelihood of ever having used alcohol or marijuana, or of having problems with alcohol, deviant behavior, or violence. Furthermore, cluster 3 adolescents also showed higher self-esteem, school and family connectedness, and achieved the highest GPA scores compared with the other adolescent clusters.

Cluster 1 adolescents appear to be the most troubled adolescents, having the highest likelihood of alcohol and marijuana use, alcohol-related problems, deviant behavior, and violence. Additionally, they showed the lowest self-esteem, school, and family connectedness and lowest GPA scores.

And finally in order to externally validate the clusters, an Analysis of Variance (ANOVA) was performed to test for significant differences between the clusters on Adolescent Depression. The boxplot clearly shows that Depression Level in cluster 1 is really high and in cluster 3 the lowest.

A Tukey test was used for post hoc comparisons between the clusters. Results indicated significant differences between the clusters on Depression Level. The Tukey post hoc comparisons showed significant differences between each clusters on Depression Level.

Adolescents in cluster 1 had the highest Depression Level (Mean=15.3777778, SD=8.5677158), and cluster 3 had the lowest Depression Level (Mean=6.2380637, SD=4.8378852).

And that’s the final assignment of this fourth course of this specialization. Below is the program code. Thanks for reading.

Program

GitHub Link

October 21, 2017February 1, 2018

Running a Lasso Regression Analysis

Course 4 Week 3: Assignment

This week’s assignment involves another machine learning technique called Lasso Regression. Lasso regression is what is called the Penalized regression method, often used in machine learning to select the subset of variables. It is a supervised machine learning method. Specifically, LASSO is a Shrinkage and Variable Selection method for linear regression models. LASSO, is actually an acronym for Least Absolute Selection and Shrinkage Operator.

I again define the response or target variable- Alcohol Use Without Supervision as my response variable.
The candidate explanatory variables include gender, race, marijuana, cocaine, or inhalant use, regular smoking, availability of cigarettes in the home, whether or not either parent was on public assistance, any experience with being expelled from school. Age, alcohol problems, deviance, violence, depression, self-esteem, parental presence, activities with parents, family and school connectedness, and grade point average are the quantitative variables.

PROGRAM

GitHub Link

__________________________________________________________________________________________

Data were randomly split into a training set that included 65% of the observations (N=2972) and a test set that included 35% of the observations (N=1600) out of total 4572 observations. The least angle regression algorithm with k=10 fold cross-validation was used to estimate the lasso regression model in the training set, and the model was validated using the test set. The change in the cross-validation average (mean) squared error at each step was used to identify the best subset of predictor variables.

Here are the results:

The ASE and Test ASE are the averaged squared error, which is the same as the means square error for the training data and the test data. We see that at the beginning, there are no predictors in the model, just the intercept. Then variables are entered one at a time in order of the magnitude of the reduction in the mean, or average squared error which we already know that with an increase in the number of explanatory variables the model’s R-value increases. So they are ordered in terms of how important they are in predicting school connectedness. According to the lasso regression results, it appears that the most important predictor of alcohol use is marijuana use. Followed by alcohol problems, deviant behavior and so on.

Moreover, a really brilliant graph which SAS forms is the following:

The coefficient progression plot shows the change in the regression coefficients at each step, and the vertical line represents the selected model. This plot shows the relative importance of the predictor selected at any step of the selection process, how the regression coefficients changed with the addition of a new predictor at each step. As well as, the steps at which each variable entered the model.

We also see how these variables are related to the response variable i.e. negatively or positively. The variables under horizontal 0 line are negatively related as clearly visible and vice-versa.

The lower plot shows how the chosen selection criterion, in this example CVPRESS, which is the residual sum of squares summed across all the cross-validation folds in the training set, changes as variables are added to the model. Initially, it decreases rapidly and then levels off to a point in which adding more predictors doesn’t lead to much production in the residual sum of squares.

During the estimation process, marijuana use and age were most strongly associated with alcohol use, followed by deviant behavior and cocaine abuse. While marijuana use, age, and deviant behavior were negatively associated with alcohol use; cocaine abuse was positively associated.

Finally, the output below shows the R-Square and adjusted R-Square for the selected model and the mean square error for both the training and test data. It also shows the final 12 explanatory variables for our model and what their estimated regression coefficients would be for the selected model.

So that’s all from this week.

October 15, 2017February 1, 2018

Running a Random Forest

Course 4 Week 2: Assignment

This week’s assignment involves another machine learning technique called as Random Forests. Random forests are predictive models that allow for a data-driven exploration of many explanatory variables in predicting a response or target variable. Random forests provide importance scores for each explanatory variable and also allow you to evaluate any increases in the correct classification with the growing of smaller and larger number of trees. This data mining algorithm is based on decision

Random Forest is a data mining algorithm based on decision trees but proceeds by growing many trees, i.e. a decision tree forest. In ways, directly address the problem of model reproducibility.

We define the response or target variable- Alcohol Use Without Supervision as my response variable.
The candidate explanatory variables include gender, race, marijuana, cocaine, or inhalant use, regular smoking, availability of cigarettes in the home, whether or not either parent was on public assistance, any experience with being expelled from school. Age, alcohol problems, deviance, violence, depression, self-esteem, parental presence, activities with parents, family and school connectedness, and grade point average.

Program

GitHub Link

_______________________________________________________________

We get following results from running the above code:

The number of observations read from my data set was 6,504 while the number of observations used was 6,444. Within the baseline fit statistics output, you can see that the misclassification rate of the random forest is displayed. Here we see that the forest misclassified 44.8% of the sample. Suggesting that the forest correctly classified 80.2% of the sample.

Then next we have the importance table:

The variables are listed from highest importance to lowest importance in predicting alcohol use. In this way, random forests are sometimes used as a data reduction technique, where variables are chosen in terms of their importance to be included in regression and other types of statistical models. Here we see that some of the most important variables in predicting alcohol use include marijuana use, deviant behavior, regular smoking, cigarette availability, race, inhalant use, cocaine use, etc.

To summarize, like decision trees, random forests are a type of data mining algorithm that can select from among a large number of variables, those that are most important in determining the target or response variable to be explained. Also, like decision trees, the target variable in a random forest can be categorical or quantitative. And the group of explanatory variables can be categorical or quantitative, or any combination.

Thus this concludes this week’s assignment.

October 8, 2017February 1, 2018

Running a Classification Tree: Decision Tree

Course 4 Week 1: Assignment

This week’s assignment involves decision trees, and more specifically, classification trees. Decision trees are predictive models that allow for a data-driven exploration of nonlinear relationships and interactions among many explanatory variables in predicting a response or target variable. When the response variable is categorical (two levels), the model is a called a classification tree. Explanatory variables can be either quantitative, categorical or both. Decision trees create segmentations or subgroups in the data, by applying a series of simple rules or criteria over and over again which choose variable constellations that best predict the response (i.e. target) variable.

Since a new dataset was created by the creator of this course for this particular week’s assignment called treeaddhealth and I am a bit overworked this week so I am gonna use the same dataset and do my assignment as per the specific changes. So in the course videos, the Decision Tree was formed for a variable named TREG1 to create a model that correctly classifies those people who have smoked on a regular basis. For this particular kind of modeling several explanatory variables were used which were categorical as well as quantitative and we then saw the results we obtained.

In my assignment, I am choosing Alcohol Use Without Supervision as my variable and to see what the model predicts our model’s final value would be. But before continuing the modeling I changed my category values for Alcohol Use variable i.e. alcevr1. I changed 0 for No as 2 for No and 1 for Yes remained as it is. The reason for this particular change was mentioned in the video. Its because SAS predicts the lowest value of our target variable, this can cause our model event level to be zero, or no. So I need to recode the no’s for alcohol use to a two, keeping one equal to yes. To be able to interpret your trees correctly, it’s important to pay attention to this detail. I did this using PROC SQL command and rest can be seen below in the program:

GitHub Link

Decision tree analysis was performed to test nonlinear relationships among a series of explanatory variables and a binary, categorical response variable. All possible separations (categorical) or cut points (quantitative) are tested. For the present analyses, the entropy “goodness of split” criterion was used to grow the tree and a cost complexity algorithm was used for pruning the full tree into a final subtree.

The following explanatory variables were included as possible contributors to a classification tree model evaluating alcohol use (my response variable), age, gender, (race/ethnicity) Hispanic, White, Black, Native American and Asian. Smoking experimentation, marijuana use, cocaine use, inhalant use, availability of cigarettes in the home, whether or not either parent was on public assistance, any experience with being expelled from school. alcohol problems, deviance, violence, depression, self-esteem, parental presence, parental activities, family connectedness, school connectedness and grade point average.

Now interpreting the results obtained:

We can see in the model information Information table that the decision tree that SAS grew has 189 leaves before pruning and 7 leaves following pruning.

Model event level lets us confirm that the tree is predicting the value one, that is yes, for our target variable Alcohol Use.

Notice too that the number of observations read from my data set was 6,564 while the number of observations used was only 4,575. 4,575 represents the number of observations with valid data for the target variable, and each of the explanatory variables. Those observations with missing data on even one variable have been set aside.

Next, by default PROC HPSPILT creates a plot of the cross-validated average standard error, ASE, based on the number of leaves each of the trees generated on the training sample.

A vertical reference line is drawn for the tree with the number of leaves that has the lowest cross-validated ASE. In this case, the 7 leaf tree. The horizontal reference line represents the average standard error plus one standard error for this complexity parameter. The one minus SE rule is applied when we do pruning via the cost-complexity method. To potentially select a smaller tree that has only a slightly higher error rate than the minimum ASE. Selecting the smallest tree that has an ASE below the horizontal reference line is in effect implementing the one minus SE rule. By default, SAS uses this rule to select and display the final tree.

To potentially select a smaller tree that has only a slightly higher error rate than the minimum ASE. Selecting the smallest tree that has an ASE below the horizontal reference line is in effect implementing the one minus SE rule and by default, SAS uses this rule to select and display the final tree.

Following the pruning plot that chose a general model with 14 split levels and 7 leaves, the final, smaller tree is presented, which shows our model, with splits on alcohol problems, marijuana use, and deviant behavior.

Alcohol Problems was the first variable to separate the sample into two subgroups. Adolescents with alcohol problems score greater than or equal to 0.060 (range 0 to 6) were more likely to have used alcohol compared to adolescents not meeting this cutoff (100%) which is clearly a really easy thing to interpret. As the values of this particular variable are discrete and take values from 0 to 6, thus if an adolescent does have alcohol problems and his score anything but 0, so it’s pretty clear that he must be using alcohol without supervision which is our target variable. Thus, for such adolescents, the probability itself becomes 1 or 100% and this is clearly seen in our decision tree as well.

Of the adolescents with this scores less than 0.060, a further subdivision was made with the dichotomous variable of marijuana use. Adolescents who reported having used marijuana were more likely to have been using alcohol without supervision i.e. about 76%.

Adolescents with alcohol problems score less than 0.060 who had never used marijuana, a further subdivision was made with the deviance score. Adolescents with a deviance score greater than or equal to 0.270 were less likely to been using alcohol without supervision(56%). Adolescents with a deviance score less than 0.270 who had no alcohol problems and never used marijuana were less likely to have been using alcohol without supervision which was 78%.

SAS also generated a model-based confusion matrix which shows how well the final classification tree performed.

The total model correctly classifies 71% of those who have Used Alcohol Without Supervision. That is one minus the error rate of .29 and 81% of those who have not which is 1 minus the 19% error rate. So we are fairly able to predict both the results of those who used alcohol during adolescence, as well as the one who did not.

Finally, SAS also shows a variable importance table. Due to the fact that decision trees attempt to maximize correct classification with the simplest tree structure, it’s possible for variables that do not necessarily represent primary splits in the model to be of notable importance in the prediction of the target variable. When potential explanatory variables are for example highly correlated, or provide similar information then they tend to not make the final cut in our choice of variables. The absence of the alternate variable from the model does not necessarily suggest that it’s unimportant, but rather that it’s masked by the other.

To evaluate this phenomenon of masking, an importance measure is calculated on the primary splitting variables and for competing variables that were not selected as a primary predictor in our final model. The importance score measures a variable’s ability to mimic the chosen tree and to play the role as a stand-in for variables appearing as primary splits. Here we see Alcohol Problems and Marijuana Use are one of the two most important variables for this particular model training.

That is it for this week’s assignment.