16 Unsupervised Learning

This chapter covers the eighth learning objective along with one additional topic of correlation analysis.

As of the fall of 2020, there have been eight different PA exams. The table below shows the types of questions in each. Your exam in December will likely have a question on either k-means or hierarchical clustering because these topics have not frequently been tested. The Hospital Readmissions sample project did have one question related to k-means, but that was in 2018. Our ExamPA.net practice exams contain questions on both topics.

16.1 Types of Learning

You are in a classroom and your teacher is giving a lesson which has a correct answer, such as 1 + 2 = 2. This is called supervised learning. The teacher is the supervisor, who is responsible for minimizing the number of incorrect answers. When a question is answered correctly, there is a reward, and when a mistake is made, there is a penalty. In machine learning, we measure the performance using metrics such as RMSE, and we say that the model is a good “fit” when this metric is low.

Supervised (Left), Unsupervised (Right)Supervised (Left), Unsupervised (Right)

Figure 16.1: Supervised (Left), Unsupervised (Right)

However, if you are given a pile of Legos and told to build whatever you want, that would be unsupervised learning. There are no right or wrong answers and to goal is to explore patterns in the data. There are no performance metrics, and it is impossible to say whether a model is a good fit or not.

Up to this point, you have been working with supervised learning. This is the major focus of PA because predictive analytics often is used to predict a future outcome. We now move on to the unsupervised learning algorithms. Here are all of the learning algorithms on PA:

Supervised Unsupervised
GLM Correlation analysis
Lasso, Ridge, and Elastic Net Principal component analysis (PCA)
Decision Tree K-means clustering
Bagged Tree Hierarchical clustering
Boosted Tree

Semi-Supervised Learning is a mix of the two. One example of this is using PCA or Clustering to create features used in a supervised model.

16.2 Correlation Analysis

  • Two variables are said to be positively correlated when increasing one tends to increase the other and negatively correlated when increasing one decreases the other
  • Correlation is unsupervised because it does not depend on the target variable
  • Correlation is only defined for numeric variables. It is possible to calculate the correlation between binary variables if coded as 0 or 1, but these questions have never appeared on PA.

16.2.1 Correlation does not equal causation

This is a common saying of statisticians. Two things happening at the same time are not sufficient evidence to suggest that one causes the other. A spurious correlation is when two unrelated variables have a positive or negative correlation by random chance.

Examples:

Number of Drownings and Ice Cream Sales: Drownings rise when ice cream sales rise because the heat causes more people to go swimming and want ice cream.

Executives who say “please” and “thank you” more often enjoy better performance reviews: It might at first appear that this is due to brown-nosing or flattery, but a more likely explanation is that people who take the extra effort to be polite also take the extra effort to do their jobs well. People who have good performance may also be polite, but not all polite people are high performers.

16.3 Principal Component Analysis (PCA)

Often there are variables that contain redundant information. PCA is one method of simplifying them.

StatQuest. “Principal Component Analysis (PCA), Step-by-Step” YouTube, Joshua Starmer, 2 Apr 2018, https://www.youtube.com/watch?v=FgakZw6K1QQ

PCA is a dimensionality reduction method which reduces the number of variables needed to retain most of the information in a matrix. If there are predictor variables \((x_1, x_2, x_3)\), then running PCA and choosing the first two Principal Components (PCs) would reduce the dimension from 3 to 2. You can imagine this as observations in three-dimensional space being projected down onto a plane. The coordinates of this plane are \((\text{PC}_1, \text{PC}_2)\).

Each PC is a linear combination of the original variables. For example, PC1 might be

\[PC_1 = 0.2X_1 + 0.3X_2 - 0.2X_3\] The weights here are also called loadings or rotations, and are (0.2, 0.3, -0.2) in this example.

Readings
ISLR 10.2 Principal Component Analysis
ISLR 10.3 Clustering Methods

16.3.1 Example: US Arrests

In this example, we perform PCA on the USArrests data set, which is part of the base R package. The rows of the data set contain the 50 states, in alphabetical order:

library(tidyverse)
states=row.names(USArrests)
states
##  [1] "Alabama"        "Alaska"         "Arizona"        "Arkansas"       "California"    
##  [6] "Colorado"       "Connecticut"    "Delaware"       "Florida"        "Georgia"       
## [11] "Hawaii"         "Idaho"          "Illinois"       "Indiana"        "Iowa"          
## [16] "Kansas"         "Kentucky"       "Louisiana"      "Maine"          "Maryland"      
## [21] "Massachusetts"  "Michigan"       "Minnesota"      "Mississippi"    "Missouri"      
## [26] "Montana"        "Nebraska"       "Nevada"         "New Hampshire"  "New Jersey"    
## [31] "New Mexico"     "New York"       "North Carolina" "North Dakota"   "Ohio"          
## [36] "Oklahoma"       "Oregon"         "Pennsylvania"   "Rhode Island"   "South Carolina"
## [41] "South Dakota"   "Tennessee"      "Texas"          "Utah"           "Vermont"       
## [46] "Virginia"       "Washington"     "West Virginia"  "Wisconsin"      "Wyoming"

The columns of the data set contain four variables relating to various crimes:

glimpse(USArrests)
## Rows: 50
## Columns: 4
## $ Murder   <dbl> 13.2, 10.0, 8.1, 8.8, 9.0, 7.9, 3.3, 5.9, 15.4, 17.4, 5.3, 2.6, 10.4, 7.2, 2.2, 6.0, …
## $ Assault  <int> 236, 263, 294, 190, 276, 204, 110, 238, 335, 211, 46, 120, 249, 113, 56, 115, 109, 24…
## $ UrbanPop <int> 58, 48, 80, 50, 91, 78, 77, 72, 80, 60, 83, 54, 83, 65, 57, 66, 52, 66, 51, 67, 85, 7…
## $ Rape     <dbl> 21.2, 44.5, 31.0, 19.5, 40.6, 38.7, 11.1, 15.8, 31.9, 25.8, 20.2, 14.2, 24.0, 21.0, 1…

Let us start by taking a quick look at the column means of the data.

USArrests %>% summarise_all(mean)
##   Murder Assault UrbanPop   Rape
## 1  7.788  170.76    65.54 21.232

We see right away the the data have vastly different means. We can also examine the variances of the four variables.

USArrests %>% summarise_all(var)
##     Murder  Assault UrbanPop     Rape
## 1 18.97047 6945.166 209.5188 87.72916

Not surprisingly, the variables also have vastly different variances: the UrbanPop variable measures the percentage of the population in each state living in an urban area, which is not comparable to the number of crimes committed in each state per 100,000 individuals. If we failed to scale the variables before performing PCA, then most of the principal components we observed would be driven by the Assault variable, since it has by far the largest mean and variance.

Thus, it is important to standardize the variables to have mean zero and standard deviation 1 before performing PCA. We will perform principal components analysis using the prcomp() function, which is one of several functions that perform PCA. By default, this centers the variables to have a mean zero. By using the option scale=TRUE, we scale the variables to have standard deviation 1:

pca = prcomp(USArrests, scale=TRUE)

The output from prcomp() contains a number of useful quantities:

names(pca)
## [1] "sdev"     "rotation" "center"   "scale"    "x"

The center and scale components correspond to the means and standard deviations of the variables that were used for scaling prior to implementing PCA:

pca$center
##   Murder  Assault UrbanPop     Rape 
##    7.788  170.760   65.540   21.232
pca$scale
##    Murder   Assault  UrbanPop      Rape 
##  4.355510 83.337661 14.474763  9.366385

The rotation matrix provides the principal component loadings; each column of pr.out\$rotation contains the corresponding principal component loading vector:

pca$rotation
##                 PC1        PC2        PC3         PC4
## Murder   -0.5358995  0.4181809 -0.3412327  0.64922780
## Assault  -0.5831836  0.1879856 -0.2681484 -0.74340748
## UrbanPop -0.2781909 -0.8728062 -0.3780158  0.13387773
## Rape     -0.5434321 -0.1673186  0.8177779  0.08902432

We see that there are four distinct principal components. This is to be expected because there are in general min(n − 1, p) informative principal components in a data set with \(n\) observations and \(p\) variables.

Using the prcomp() function, we do not need to explicitly multiply the data by the principal component loading vectors in order to obtain the principal component score vectors. Rather the 50 × 4 matrix \(x\) has as its columns the principal component score vectors. That is, the $k^{thcolumn is the $k^{th principal component score vector. We’ll take a look at the first few states:

head(pca$x)
##                   PC1        PC2         PC3          PC4
## Alabama    -0.9756604  1.1220012 -0.43980366  0.154696581
## Alaska     -1.9305379  1.0624269  2.01950027 -0.434175454
## Arizona    -1.7454429 -0.7384595  0.05423025 -0.826264240
## Arkansas    0.1399989  1.1085423  0.11342217 -0.180973554
## California -2.4986128 -1.5274267  0.59254100 -0.338559240
## Colorado   -1.4993407 -0.9776297  1.08400162  0.001450164

We can plot the first two principal components using the biplot() function:

biplot(pca, scale=0)

The scale=0 argument to biplot() ensures that the arrows are scaled to represent the loadings; other values for scale give slightly different bi plots with different interpretations.

The prcomp() function also outputs the standard deviation of each principal component. We can access these standard deviations as follows:

pca$sdev
## [1] 1.5748783 0.9948694 0.5971291 0.4164494

The variance explained by each principal component is obtained by squaring these:

pca_var=pca$sdev^2
pca_var
## [1] 2.4802416 0.9897652 0.3565632 0.1734301

To compute the proportion of variance explained by each principal component, we simply divide the variance explained by each principal component by the total variance explained by all four principal components:

pve=pca_var/sum(pca_var)
pve
## [1] 0.62006039 0.24744129 0.08914080 0.04335752

We see that the first principal component explains 62.0% of the variance in the data, the next principal component explains 24.7% of the variance, and so forth. We can plot the PVE explained by each component as follows:

plot(pve, xlab="Principal Component", ylab="Proportion of Variance Explained", ylim=c(0,1),type='b')

We can also use the function cumsum(), which computes the cumulative sum of the elements of a numeric vector, to plot the cumulative PVE:

plot(cumsum(pve), xlab="Principal Component", ylab="Cumulative Proportion of Variance Explained", ylim=c(0,1),type='b')

a=c(1,2,8,-3)
cumsum(a)
## [1]  1  3 11  8

16.3.2 Example: SOA PA 6/12/19, Task 3

  1. (9 points) Use observations from principal components analysis (PCA) to generate a new feature

Your assistant has provided code to run a PCA on three variables. Run the code on these three variables. Interpret the output, including the loadings on significant principal components. Generate one new feature based on your observations (which may also involve dropping some current variables). Your assistant has provided some notes on using PCA on factor variables in the .Rmd file.

Already enrolled? Watch the full video: Practice Exams | Practice Exams + Lessons

16.3.3 Example: PCA on Cancel Cells

The data NCI60 contains expression levels of 6,830 genes from 64 cancel cell lines. Cancer type is also recorded.

library(ISLR)
nci_labs=NCI60$labs
nci_data=NCI60$data

We first perform PCA on the data after scaling the variables (genes) to have a standard deviation one, although one could reasonably argue that it is better not to scale the genes:

pca = prcomp(nci_data, scale=TRUE)

We now plot the first few principal component score vectors in order to visualize the data. The observations (cell lines) corresponding to a given cancer type will be plotted in the same color to see to what extent the observations within a cancer type are similar to each other. We first create a simple function that assigns a distinct color to each element of a numeric vector. The function will be used to assign a color to each of the 64 cell lines, based on the cancer type to which it corresponds. We will use the rainbow() function, which takes its argument a positive integer and returns a vector containing that number of distinct colors.

Cols=function(vec){
    cols=rainbow(length(unique(vec)))
    return(cols[as.numeric(as.factor(vec))])
  }

We now can plot the principal component score vectors:

par(mfrow=c(1,2))
plot(pca$x[,1:2], col=Cols(nci_labs), pch=19,xlab="Z1",ylab="Z2")
plot(pca$x[,c(1,3)], col=Cols(nci_labs), pch=19,xlab="Z1",ylab="Z3")

On the whole, cell lines corresponding to a single cancer type tend to have similar values on the first few principal component score vectors.

This indicates that cell lines from the same cancer type tend to have pretty similar gene expression levels.

We can obtain a summary of the proportion of variance explained (PVE) of the first few principal components using the (PVE) of the first few principal components using the summary() method for a prcomp object:

summary(pca)
## Importance of components:
##                            PC1      PC2      PC3      PC4      PC5      PC6      PC7      PC8      PC9
## Standard deviation     27.8535 21.48136 19.82046 17.03256 15.97181 15.72108 14.47145 13.54427 13.14400
## Proportion of Variance  0.1136  0.06756  0.05752  0.04248  0.03735  0.03619  0.03066  0.02686  0.02529
## Cumulative Proportion   0.1136  0.18115  0.23867  0.28115  0.31850  0.35468  0.38534  0.41220  0.43750
##                            PC10     PC11     PC12     PC13     PC14     PC15     PC16     PC17     PC18
## Standard deviation     12.73860 12.68672 12.15769 11.83019 11.62554 11.43779 11.00051 10.65666 10.48880
## Proportion of Variance  0.02376  0.02357  0.02164  0.02049  0.01979  0.01915  0.01772  0.01663  0.01611
## Cumulative Proportion   0.46126  0.48482  0.50646  0.52695  0.54674  0.56590  0.58361  0.60024  0.61635
##                            PC19    PC20     PC21    PC22    PC23    PC24    PC25    PC26    PC27   PC28
## Standard deviation     10.43518 10.3219 10.14608 10.0544 9.90265 9.64766 9.50764 9.33253 9.27320 9.0900
## Proportion of Variance  0.01594  0.0156  0.01507  0.0148 0.01436 0.01363 0.01324 0.01275 0.01259 0.0121
## Cumulative Proportion   0.63229  0.6479  0.66296  0.6778 0.69212 0.70575 0.71899 0.73174 0.74433 0.7564
##                           PC29    PC30    PC31    PC32    PC33    PC34    PC35    PC36    PC37    PC38
## Standard deviation     8.98117 8.75003 8.59962 8.44738 8.37305 8.21579 8.15731 7.97465 7.90446 7.82127
## Proportion of Variance 0.01181 0.01121 0.01083 0.01045 0.01026 0.00988 0.00974 0.00931 0.00915 0.00896
## Cumulative Proportion  0.76824 0.77945 0.79027 0.80072 0.81099 0.82087 0.83061 0.83992 0.84907 0.85803
##                           PC39    PC40    PC41   PC42    PC43   PC44    PC45   PC46    PC47    PC48
## Standard deviation     7.72156 7.58603 7.45619 7.3444 7.10449 7.0131 6.95839 6.8663 6.80744 6.64763
## Proportion of Variance 0.00873 0.00843 0.00814 0.0079 0.00739 0.0072 0.00709 0.0069 0.00678 0.00647
## Cumulative Proportion  0.86676 0.87518 0.88332 0.8912 0.89861 0.9058 0.91290 0.9198 0.92659 0.93306
##                           PC49    PC50    PC51    PC52    PC53    PC54    PC55    PC56    PC57   PC58
## Standard deviation     6.61607 6.40793 6.21984 6.20326 6.06706 5.91805 5.91233 5.73539 5.47261 5.2921
## Proportion of Variance 0.00641 0.00601 0.00566 0.00563 0.00539 0.00513 0.00512 0.00482 0.00438 0.0041
## Cumulative Proportion  0.93947 0.94548 0.95114 0.95678 0.96216 0.96729 0.97241 0.97723 0.98161 0.9857
##                           PC59    PC60    PC61    PC62    PC63      PC64
## Standard deviation     5.02117 4.68398 4.17567 4.08212 4.04124 2.148e-14
## Proportion of Variance 0.00369 0.00321 0.00255 0.00244 0.00239 0.000e+00
## Cumulative Proportion  0.98940 0.99262 0.99517 0.99761 1.00000 1.000e+00

Using the plot() function, we can also plot the variance explained by the first few principal components:

plot(pca)

Note that the height of each bar in the bar plot is given by squaring the corresponding element of pr.out\$sdev. However, it is generally more informative to plot the PVE of each principal component (i.e. a scree plot) and the cumulative PVE of each principal component. This can be done with just a little tweaking:

pve = 100*pca$sdev^2/sum(pca$sdev^2)
par(mfrow=c(1,2))
plot(pve,  type="o", ylab="PVE", xlab="Principal Component", col="blue")
plot(cumsum(pve), type="o", ylab="Cumulative PVE", xlab="Principal Component", col="brown3")

Together, we see that the first seven principal components explain around 40% of the variance in the data. This is not a huge amount of the variance. However, looking at the scree plot, we see that while each of the first seven principal components explains a substantial amount of variance, there is a marked decrease in the variance explained by other principal components. That is, there is an elbow in the plot after approximately the seventh principal component. This suggests that there may be little benefit to examining more than seven or so principal components (phew! even examining seven principal components may be difficult).

16.4 Clustering

Imagine that you are a large retailer interested in understanding the customer base. There may be several “types” of customers, such as those shopping for a business with corporate accounts, those shopping for leisure, or debt-strapped grad students. Each of these customers would exhibit different behavior and should be treated differently statistically. However, how can the “type” of customer be defined? Especially for large customer data sets in the millions, one can imagine how this problem can be challenging.

Clustering algorithms look for groups of observations that are similar to one another. Because there is no target variable, measuring the quality of the “fit” is much more complicated. There are many clustering algorithms, but this exam only focuses on the two that are most common.

16.5 K-Means Clustering

Kmeans takes continuous data and assigns observations into k clusters or groups. In the two-dimensional example, this is the same as drawing lines around points.

StatQuest. “ K-means clustering” YouTube, Joshua Starmer, 2 Apr 2018, https://www.youtube.com/watch?v=4b5d3muPQmA&t=268s

Kmeans consists of the following steps:

a) Start with two variables (\(X_1\) on the X-axis, and \(X_2\) on the Y-axis.) b) Randomly assign cluster centers. c) Put each observation into the cluster that is closest. d) - f) Move the cluster center to the mean of the observations assigned to it and continue until the centers stop moving. g) Repeated steps a) - f) a given number of times (controlled by n.starts). This reduces the uncertainty from choosing the initial centers randomly.

16.5.0.1 R Example

In R, the function kmeans() performs K-means clustering in R. We begin with a simple simulated example in which there truly are two clusters in the data: the first 25 observations have a mean shift relative to the next 25 observations.

set.seed(2)
x = matrix(rnorm(50*2), ncol = 2)
x[1:25,1] = x[1:25,1]+3
x[1:25,2] = x[1:25,2]-4

We now perform K-means clustering with K = 2:

km_out = kmeans(x,2,nstart = 20)

The cluster assignments of the 50 observations are contained in km_out$cluster:

km_out$cluster
##  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

The K-means clustering perfectly separated the observations into two clusters even though we did not supply any group information to kmeans(). We can plot the data, with each observation colored according to its cluster assignment:

plot(x, col = (km_out$cluster+1), main = "K-Means Clustering Results with K = 2", xlab = "", ylab = "", pch = 20, cex = 2)

Here the observations can be easily plotted because they are two-dimensional. If there were more than two variables, we could instead perform PCA and plot the first two principal components score vectors. In this example, we knew that there were two clusters because we generated the data. However, for real data, in general, we do not know the true number of clusters. We could instead have performed K-means clustering on this example with K = 3. If we do this, K-means clustering will split up the two “real” clusters since it has no information about them:

set.seed(4)
km_out = kmeans(x, 3, nstart = 20)
km_out
## K-means clustering with 3 clusters of sizes 17, 23, 10
## 
## Cluster means:
##         [,1]        [,2]
## 1  3.7789567 -4.56200798
## 2 -0.3820397 -0.08740753
## 3  2.3001545 -2.69622023
## 
## Clustering vector:
##  [1] 1 3 1 3 1 1 1 3 1 3 1 3 1 3 1 3 1 1 1 1 1 3 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 3 2 2 2 2
## 
## Within cluster sum of squares by cluster:
## [1] 25.74089 52.67700 19.56137
##  (between_SS / total_SS =  79.3 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss" "betweenss"   
## [7] "size"         "iter"         "ifault"
plot(x, col = (km_out$cluster+1), main = "K-Means Clustering Results with K = 3", xlab = "", ylab = "", pch = 20, cex = 2)

To run the kmeans() function in R with multiple initial cluster assignments, we use the nstart argument. If a value of nstart greater than one is used, then K-means clustering will be performed using multiple random assignments, and the kmeans() function will report only the best results. Here we compare using nstart = 1:

set.seed(3)
km_out = kmeans(x, 3, nstart = 1)
km_out$tot.withinss
## [1] 97.97927

to nstart = 20:

km_out = kmeans(x,3,nstart = 20)
km_out$tot.withinss
## [1] 97.97927

Note that km_out\$tot.withinss is the total within-cluster sum of squares, which we seek to minimize by performing K-means clustering. The individual within-cluster sum-of-squares are contained in the vector km_out\$withinss.

It is generally recommended to always run K-means clustering with a large value of nstart, such as 20 or 50, to avoid getting stuck in an undesirable local optimum.

When performing K-means clustering and using multiple initial cluster assignments, it is also important to set a random seed using the set.seed() function. This way, the initial cluster assignments can be replicated, and the K-means output will be fully reproducible.

16.6 Hierarchical Clustering

Kmeans required that we choose the number of clusters, k. Hierarchical clustering is an alternative that does not require that we choose only one value of k.

StatQuest. “ Hierarchical Clustering” YouTube, Joshua Starmer, 20 Jun, 2017, https://www.youtube.com/watch?v=7xHsRkOdVwo&t=137s

The most common type of hierarchical clustering uses a bottom-up approach. This starts with a single observation and then looks for others close and puts them into a cluster. Then it looks for other clusters that are similar and groups these together into a mega cluster. It continues to do this until all observations are in the same group.

This is analyzed with a graph called a dendrogram (dendro = tree, gram = graph). The height represents “distance,” or how similar the clusters are to one another. The clusters on the bottom, which are vertically close to one another, have similar data values; the clusters that are further apart vertically are less similar.

Choosing the value of the cutoff height changes the number of clusters that result.

Certain data have a natural hierarchical structure. For example, say that the variables are City, Town, State, Country, and Continent. If we used hierarchical clustering, this pattern could be established even if we did not have labels for Cities, Towns, and so forth.

The hclust() function implements hierarchical clustering in R. In the following example; we use the data of previous section to plot the hierarchical clustering dendrogram using full, single, and average linkage clustering Euclidean distance as the dissimilarity measure. We begin by clustering observations using complete linkage. The dist() function is used to compute the 50 \(\times\) 50 inter-observation Euclidean distance matrix:

hc_complete = hclust(dist(x), method = "complete")

We could just as easily perform hierarchical clustering with average or single linkage instead:

hc_average = hclust(dist(x), method = "average")
hc_single = hclust(dist(x), method = "single")

We can now plot the dendrograms obtained using the usual plot() function. The numbers at the bottom of the plot identify each observation:

par(mfrow = c(1,3))
plot(hc_complete,main = "Complete Linkage", xlab = "", sub = "", cex = .9)
plot(hc_average, main = "Average Linkage", xlab = "", sub = "", cex = .9)
plot(hc_single, main = "Single Linkage", xlab = "", sub = "", cex = .9)

To determine the cluster labels for each observation associated with a given cut of the dendrogram, we can use the cutree() function:

cutree(hc_complete, 2)
##  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
cutree(hc_average, 2)
##  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 1 2 1 2 2 2 2
cutree(hc_single, 2)
##  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

For this data, complete and average linkage generally separate the observations into their correct groups. However, single linkage identifies one observation as belonging to its own cluster. A more sensible answer is obtained when four clusters are selected, although there are still two singletons:

cutree(hc_single, 4)
##  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 4 3 3 3 3 3 3 3 3

To scale the variables before performing hierarchical clustering of the observations, we can use the scale() function:

xsc = scale(x)
plot(hclust(dist(xsc), method = "complete"), main = "Hierarchical Clustering with Scaled Features")

Correlation-based distance can be computed using the as.dist() function, which converts an arbitrary square symmetric matrix into a form that the hclust() function recognizes as a distance matrix. However, this only makes sense for data with at least three features since the absolute correlation between any two observations with measurements on two features is always 1. Let us generate and cluster a three-dimensional data set:

x = matrix(rnorm(30*3), ncol = 3)
dd = as.dist(1-cor(t(x)))
plot(hclust(dd, method = "complete"), main = "Complete Linkage with Correlation-Based Distance", xlab = "", sub = "")

16.6.1 Example: Clustering Cancel Cells

Unsupervised techniques are often used in the analysis of genomic data. In this example, we will see how hierarchical and K-means clustering compare on the NCI60 cancer cell line microarray data, which consists of 6,830 gene expression measurements on 64 cancer cell lines:

# The NCI60 data
library(ISLR)
nci_labels = NCI60$labs
nci_data = NCI60$data

Each cell line is labeled with a cancer type. We will ignore the cancer types in performing clustering, as these are unsupervised techniques. After performing clustering, we will use this column to see the extent to which these cancer types agree with the results of these unsupervised techniques.

The data has 64 rows and 6,830 columns.

dim(nci_data)
## [1]   64 6830

Let’s take a look at the cancer types for the cell lines:

table(nci_labels)
## nci_labels
##      BREAST         CNS       COLON K562A-repro K562B-repro    LEUKEMIA MCF7A-repro MCF7D-repro 
##           7           5           7           1           1           6           1           1 
##    MELANOMA       NSCLC     OVARIAN    PROSTATE       RENAL     UNKNOWN 
##           8           9           6           2           9           1

We now proceed to hierarchically cluster the cell lines in the NCI60 data to determine whether or not the observations cluster into distinct types of cancer. To begin, we standardize the variables to have mean zero and standard deviation one. This step is optional and needs only be performed if we want each gene to be on the same scale:

sd_data = scale(nci_data)

We now perform hierarchical clustering of the observations using complete, single, and average linkage. We will use standard Euclidean distance as the dissimilarity measure:

par(mfrow = c(1,3))
data_dist = dist(sd_data)
plot(hclust(data_dist), labels = nci_labels, main = "Complete Linkage", xlab = "", sub = "",ylab = "")
plot(hclust(data_dist, method = "average"), labels = nci_labels, main = "Average Linkage", xlab = "", sub = "",ylab = "")
plot(hclust(data_dist, method = "single"), labels = nci_labels,  main = "Single Linkage", xlab = "", sub = "",ylab = "")

We see that the choice of linkage certainly does affect the results obtained. Typically, a single linkage will tend to yield trailing clusters: very large clusters onto which individual observations attach one-by-one. On the other hand, complete and average linkage tends to yield more balanced, attractive clusters. For this reason, the complete and average linkage is generally preferred to single linkage. Cell lines within a single cancer type tend to cluster together, although the clustering is not perfect.

Let us use our complete linkage hierarchical clustering for the analysis. We can cut the dendrogram at the height that will yield a particular number of clusters, say 4:

hc_out = hclust(dist(sd_data))
hc_clusters = cutree(hc_out,4)
table(hc_clusters,nci_labels)
##            nci_labels
## hc_clusters BREAST CNS COLON K562A-repro K562B-repro LEUKEMIA MCF7A-repro MCF7D-repro MELANOMA NSCLC
##           1      2   3     2           0           0        0           0           0        8     8
##           2      3   2     0           0           0        0           0           0        0     1
##           3      0   0     0           1           1        6           0           0        0     0
##           4      2   0     5           0           0        0           1           1        0     0
##            nci_labels
## hc_clusters OVARIAN PROSTATE RENAL UNKNOWN
##           1       6        2     8       1
##           2       0        0     1       0
##           3       0        0     0       0
##           4       0        0     0       0

There are some clear patterns. All the leukemia cell lines fall in cluster 3, while the breast cancer cell lines are spread out over three different clusters. We can plot the cut on the dendrogram that produces these four clusters using the abline() function, which draws a straight line on top of any existing plot in R:

par(mfrow = c(1,1))
plot(hc_out, labels = nci_labels)
abline(h = 139, col = "red")

Printing the output of hclust gives a useful brief summary of the object:

hc_out
## 
## Call:
## hclust(d = dist(sd_data))
## 
## Cluster method   : complete 
## Distance         : euclidean 
## Number of objects: 64

We claimed earlier that K-means clustering and hierarchical clustering with the dendrogram cut to obtain the same number of clusters could yield very different results. How do these NCI60 hierarchical clustering results compare to what we get if we perform K-means clustering with K = 4?

set.seed(2)
km_out = kmeans(sd_data, 4, nstart = 20)
km_clusters = km_out$cluster

We can use a confusion matrix to compare the differences in how the two methods assigned observations to clusters:

table(km_clusters,hc_clusters)
##            hc_clusters
## km_clusters  1  2  3  4
##           1 11  0  0  9
##           2 20  7  0  0
##           3  9  0  0  0
##           4  0  0  8  0

We see that the four clusters obtained using hierarchical clustering and Kmeans clustering are somewhat different. Cluster 2 in K-means clustering is identical to cluster 3 in hierarchical clustering. However, the other clusters differ: for instance, cluster 4 in K-means clustering contains a portion of the observations assigned to cluster 1 by hierarchical clustering and all of the observations assigned to cluster 2 by hierarchical clustering.

16.6.2 References

These examples are an adaptation of p. 404-407, 410-413 of “Introduction to Statistical Learning with Applications in R” by Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani. Adapted by R. Jordan Crouser at Smith College for SDS293: Machine Learning (Spring 2016), and re-implemented in Fall 2016 tidyverse format by Amelia McNamara and R. Jordan Crouser at Smith College.

Used with permission from Jordan Crouser at Smith College and to the following contributors on GitHub:

  • github.com/jcrouser
  • github.com/AmeliaMN
  • github.com/mhusseinmidd
  • github.com/rudeboybert
  • github.com/ijlyttle