15.1. K-Means Clustering¶
15.1.1. Iris Data Set¶
We will work with the iris dataset in this section. Our goal is to automatically cluster the measurements in the data set so that measurements from the same species fall into the same cluster. The species variable in the data set works as the ground truth. We will use the other variables to perform clustering.
Let us estimate the correlation between different variables and the species variable:
> cor(iris$Sepal.Length, as.integer(iris$Species), method='spearman')
[1] 0.7980781
> cor(iris$Sepal.Width, as.integer(iris$Species), method='spearman')
[1] -0.4402896
> cor(iris$Petal.Length, as.integer(iris$Species), method='spearman')
[1] 0.9354305
> cor(iris$Petal.Width, as.integer(iris$Species), method='spearman')
[1] 0.9381792
The correlations suggest that petal width and petal length are very strongly correlated with the species of the flower. We will use these variables for the clustering purpose.
Let us prepare the subset data frame on which clustering will be applied
> iris2 <- iris[, c("Petal.Length", "Petal.Width")]
We are ready to run the k-means clustering algorithm now:
> num_clusters <- 3
> set.seed(1111)
> result <- kmeans(iris2, num_clusters, nstart=20)
We can see the centers for the 3 clusters as follows:
> result$centers
Petal.Length Petal.Width
1 5.595833 2.037500
2 1.462000 0.246000
3 4.269231 1.342308
We can see the assignment of cluster to individual measurements as follows:
> result$cluster
[1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[45] 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 3 3 3 3 1 3 3 3 3
[89] 3 3 3 3 3 3 3 3 3 3 3 3 1 1 1 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1 1 1 1 3 1 1 1 1 1
[133] 1 1 1 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1
Let us see how well the clustering has happened:
> table(result$cluster, iris$Species)
setosa versicolor virginica
1 0 2 46
2 50 0 0
3 0 48 4
6 measurements have been mis-clustered.
Within cluster sum of squares by cluster:
> result$withinss
[1] 16.29167 2.02200 13.05769
At times, it is useful to first center and scale each variable before clustering:
> iris3 <- scale(iris2)
> result <- kmeans(iris3, num_clusters, nstart=20)
> table(result$cluster, iris$Species)
setosa versicolor virginica
1 0 48 4
2 50 0 0
3 0 2 46
>