K-means Cluster Analysis

Course 4 Week 4: Assignment

Introduction

This final week I am gonna talk about Cluster Analysis and in that we’re gonna discuss K-Means Cluster Analysis technique.The goal of cluster analysis is to group or cluster observations into subsets based on the similarity of responses on multiple variables. Observations that have similar response patterns are grouped together to form clusters. That is, the goal is to partition the observations in a data set. That into a smaller set of clusters and each observation belongs to only one cluster.

Cluster analysis is an unsupervised learning method. Meaning, there is no specific response variable included in the analysis.

Below is the assignment and since the Program Code is really really long, I’ve put it at the end of the blog post.

Assignment

A k-means cluster analysis was conducted to identify underlying subgroups of adolescents based upon the similarity of responses on 11 variables that represent characteristics that could have an impact on adolescent depression.

Clustered variables included three binary variables, indicative of whether or not the adolescent had ever used alcohol and marijuana. Quantitative variables included variables measuring problems with alcohol usage, a scale measuring engaging in deviant behavior (such as vandalism, other property damage, lying, stealing, running away from home, driving without parental permission, selling drugs and unexcused school absence), and scales measuring violent behavior, self-esteem, parental presence, parental activities, family connectedness, school connectedness and academic achievement (measured as grade point average). All of the clustered variables were standardized to a mean of 0 and standard deviation of 1.

Using simple random sampling, data were split into a training set that included 70% (N=3201) of the observations and a test set that included 30% (N=1701) of the observations. A series of k-means cluster analyses were conducted on the training set specifying k=1-9 clusters, using Euclidean distance measurement. The variances in the cluster variables that was accounted for by the clusters (R-square values) was plotted for each of the nine cluster solutions in an elbow curve to provide visual guidance for choosing the number of clusters to interpret.

Capture

Now we clearly see two specific bends in the curve at 2, 4, 6 and 7. While we can’t create a scatter plot for 2 variable. So we check for the variability of the data in 4, 6 and 7, to see whether the clusters overlap. Or the patterns of means on the clustering variables are unique and meaningful. And whether there are significant differences between the clusters on our external validation variable, Depression Level.

Here are the plots of 4, 6 and 7:

Capture3Capture3.2Capture3.3

A scatterplot of the 4 canonical variables by cluster indicate the observations are densely packed with low cluster variance and did not overlap significantly with the other clusters. Clusters 6 and 7, on the other hand, were for the most part distinct, but with greater spread amongst the observations, suggesting a higher cluster variance within the cluster. Cluster 4 showed observations that were least compacted of all of the clusters. These plot results suggest that an optimal cluster solution may have fewer than 4 clusters, so it is important that evaluation of lesser than 4 clusters solution be evaluated.

So we look into the Cluster Means table of 4 clusters to observe the relations:

Capture2

The means of the clustering variables show that compared to the other clusters, adolescents in cluster 3 had a relatively low likelihood of ever having used alcohol or marijuana, or of having problems with alcohol, deviant behavior, or violence. Furthermore, cluster 3 adolescents also showed higher self-esteem, school and family connectedness, and achieved the highest GPA scores compared with the other adolescent clusters.

Cluster 1 adolescents appear to be the most troubled adolescents, having the highest likelihood of alcohol and marijuana use, alcohol-related problems, deviant behavior, and violence. Additionally, they showed the lowest self-esteem, school, and family connectedness and lowest GPA scores.

And finally in order to externally validate the clusters, an Analysis of Variance (ANOVA) was performed to test for significant differences between the clusters on Adolescent Depression. The boxplot clearly shows that Depression Level in cluster 1 is really high and in cluster 3 the lowest.

Capture4

A Tukey test was used for post hoc comparisons between the clusters. Results indicated significant differences between the clusters on Depression Level. The Tukey post hoc comparisons showed significant differences between each clusters on Depression Level.

Capture5

Adolescents in cluster 1 had the highest Depression Level (Mean=15.3777778, SD=8.5677158), and cluster 3 had the lowest Depression Level (Mean=6.2380637, SD=4.8378852).

And that’s the final assignment of this fourth course of this specialization. Below is the program code. Thanks for reading.

Program

GitHub Link

 

Looks easy? Think again!

The Whites have two kids, at least one of which is a boy, what is the probability that they both are boys?
The Browns have two kids, the older one of which is a boy, what is the probability that they both are boys?

While reading an article today published by US Naval Academy, which I ended up reading after I was trying to search about conditional probabilities, I came across this simple, yet amazing and baffling probability problem. It seems easy and kind of sounds like the answer in both cases would be same, right? Guess what, if you think they should be equal, you’re wrong sire. In fact the original question has another family and a probability question related to them as well:

The Greens have two kids: what is the probability that they are both boys?

So the question which no one of you would had thought by now would be, why was I searching for conditional probabilities at first place? Well funny story, it’s related to another mind-boggling problem that I was dealing with. Take a look why I call it so:

Two cards are drawn from a pack of cards.
A. What’s the probability of getting ace on both the cards given that first one is an ace?
B. What’s the probability of getting ace on both the cards given that first one is an ace of spades?

See… looks similar right? And it does actually feels like these both questions should have same answer. But guess what, they aren’t similar at all, in fact, one of them is twice of the other.

6978408-aces-poker-cards

I learned it today- “the more information you have, the more your odds go up.”

Sounds like a waste of time, but well I can try to show you what I am trying to say by solving our original question. So shall we?

Let event A= The Greens have two kids: what is the probability that they are both boys?

Capture.JPG

Seems simple and easy to understand right? Since they both are independent events we can find this probability easily the way I did.

Let’s give a try on the second one:

Event B= The Whites have two kids, at least one of which is a boy, what is the probability that they both are boys?

Now this is a conditional probability, so we will make use of conditional probability formula:

Capture

Yes the chances increased to have a boy. Which does seem like obvious since it’s a conditional probability. But what about the third case, which is kind of exactly same but with a tiny bit of extra information. You think the answer would be same? Well let’s try:

Event C= The Browns have two kids, the older one of which is a boy, what is the probability that they both are boys?

Capture

See… with each new information the probability changed from 25% chance to 33.33% and then to 50%. This above computation might look easy to understand and the reasoning would seem correct too, but this was actually a question to understand what’s happening in case of the Ace of Spades problem.

I won’t be solving this question on my own, as it’s been solved twice by really good lecturers already. I can just tell that the probability to get two cases while knowing that the first card was Ace of Spades is 1/17 and on the other hand without this information, the probability falls down to 1/33.

For the solution of that question, here are the links:

  1. https://www.usna.edu/Users/physics/mungan/_files/documents/Scholarship/TwoAces.pdf
  2. The second solution is there as an example in this video lecture by Joe Blitzstein, Professor of the Practice in Statistics Harvard University. Go to 9:35 seconds to see the example from start.


I hope you liked this article and though it seems highly unlikely to enjoy something related to mathematics and stats, but well, might be of some interest to a few of us… right?

Thanks for reading.

Image Credits:
Header Image
Cards Image

Design a site like this with WordPress.com
Get started