Course 4 Week 4: Assignment
Introduction
This final week I am gonna talk about Cluster Analysis and in that we’re gonna discuss K-Means Cluster Analysis technique.The goal of cluster analysis is to group or cluster observations into subsets based on the similarity of responses on multiple variables. Observations that have similar response patterns are grouped together to form clusters. That is, the goal is to partition the observations in a data set. That into a smaller set of clusters and each observation belongs to only one cluster.
Cluster analysis is an unsupervised learning method. Meaning, there is no specific response variable included in the analysis.
Below is the assignment and since the Program Code is really really long, I’ve put it at the end of the blog post.
Assignment
A k-means cluster analysis was conducted to identify underlying subgroups of adolescents based upon the similarity of responses on 11 variables that represent characteristics that could have an impact on adolescent depression.
Clustered variables included three binary variables, indicative of whether or not the adolescent had ever used alcohol and marijuana. Quantitative variables included variables measuring problems with alcohol usage, a scale measuring engaging in deviant behavior (such as vandalism, other property damage, lying, stealing, running away from home, driving without parental permission, selling drugs and unexcused school absence), and scales measuring violent behavior, self-esteem, parental presence, parental activities, family connectedness, school connectedness and academic achievement (measured as grade point average). All of the clustered variables were standardized to a mean of 0 and standard deviation of 1.
Using simple random sampling, data were split into a training set that included 70% (N=3201) of the observations and a test set that included 30% (N=1701) of the observations. A series of k-means cluster analyses were conducted on the training set specifying k=1-9 clusters, using Euclidean distance measurement. The variances in the cluster variables that was accounted for by the clusters (R-square values) was plotted for each of the nine cluster solutions in an elbow curve to provide visual guidance for choosing the number of clusters to interpret.

Now we clearly see two specific bends in the curve at 2, 4, 6 and 7. While we can’t create a scatter plot for 2 variable. So we check for the variability of the data in 4, 6 and 7, to see whether the clusters overlap. Or the patterns of means on the clustering variables are unique and meaningful. And whether there are significant differences between the clusters on our external validation variable, Depression Level.
Here are the plots of 4, 6 and 7:



A scatterplot of the 4 canonical variables by cluster indicate the observations are densely packed with low cluster variance and did not overlap significantly with the other clusters. Clusters 6 and 7, on the other hand, were for the most part distinct, but with greater spread amongst the observations, suggesting a higher cluster variance within the cluster. Cluster 4 showed observations that were least compacted of all of the clusters. These plot results suggest that an optimal cluster solution may have fewer than 4 clusters, so it is important that evaluation of lesser than 4 clusters solution be evaluated.
So we look into the Cluster Means table of 4 clusters to observe the relations:

The means of the clustering variables show that compared to the other clusters, adolescents in cluster 3 had a relatively low likelihood of ever having used alcohol or marijuana, or of having problems with alcohol, deviant behavior, or violence. Furthermore, cluster 3 adolescents also showed higher self-esteem, school and family connectedness, and achieved the highest GPA scores compared with the other adolescent clusters.
Cluster 1 adolescents appear to be the most troubled adolescents, having the highest likelihood of alcohol and marijuana use, alcohol-related problems, deviant behavior, and violence. Additionally, they showed the lowest self-esteem, school, and family connectedness and lowest GPA scores.
And finally in order to externally validate the clusters, an Analysis of Variance (ANOVA) was performed to test for significant differences between the clusters on Adolescent Depression. The boxplot clearly shows that Depression Level in cluster 1 is really high and in cluster 3 the lowest.

A Tukey test was used for post hoc comparisons between the clusters. Results indicated significant differences between the clusters on Depression Level. The Tukey post hoc comparisons showed significant differences between each clusters on Depression Level.

Adolescents in cluster 1 had the highest Depression Level (Mean=15.3777778, SD=8.5677158), and cluster 3 had the lowest Depression Level (Mean=6.2380637, SD=4.8378852).
And that’s the final assignment of this fourth course of this specialization. Below is the program code. Thanks for reading.
Program




