R language Access Menu

Title Text Both  

Cluster analysis

This technique is generally used to cluster or classify rows (i.e. cases or individuals) into similar subtypes. A number of packages are available. Most commonly used is kmeans function of base R. 

Iris dataset in R has data on lengths and widths of sepals and petals of 3 plant species. In following example kmeans is used to divide iris dataset rows into 3 clusters: 

Code:

> head(iris)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

To test if kmeans is able to cluster the rows correctly into 3 species using first 4 columns only: 

> km = kmeans(iris[-5], 3)
> km$cluster
  [1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 1 2 2 2 2 2 2 2 2 2
 [63] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 1 1 1 1 2 1 1 1 1 1 1 2 2 1 1 1 1 2 1 2 1 2
[125] 1 1 2 2 1 1 1 1 1 2 1 1 1 1 2 1 1 1 2 1 1 1 2 1 1 2

> table(km$cluster, iris$Species)
   
    setosa versicolor virginica
  1     33          0         0
  2     17          4         0
  3      0         46        50

As can be seen, the first 2 groups created by kmeans both correspond to setosa species, while the 3rd group has both versicolor and virginica. Trying kmeans with 2 clusters, we get following result: 

Code:

> km = kmeans(iris[-5], 2)
> table(km$cluster, iris$Species)
   
    setosa versicolor virginica
  1     50          3         0
  2      0         47        50

Hence, kmeans is able to separate setosa from other 2 species. It also means that virginica and versicolor species are more closely related. 


Trying clustering again using hddc() function of Hdclassif package which does not require the numebr of clusters to be pre-specified: 

Code:

> library(HDclassif)
> res = hddc(iris[-5])
          Model            K       BIC
         ALL             1       -991.4453 
         AKJBKQKDK       2       -606.9822 
         AKJBKQKDK       3       -598.0384 
         AKJBKQKDK       4       -639.826 
         AKJBKQKDK       5       -675.3368 
         AKJBKQKDK       6       -700.6925 
         AKJBKQKDK       7       -756.5343 
         AKJBKQKDK       8       -828.1545 
         AKJBKQKDK       9       -870.1027 
         AKJBKQKDK       10      -884.5045 

SELECTED: model AKJBKQKDK with 3 clusters, BIC=-598.0384.

> res$class
  [1] 1 4 6 6 6 3 3 6 3 3 6 2 6 6 6 6 6 3 3 3 3 5 1 3 3 7 3 4 7 3 3 6 5 4 3 2 2 3 1 6 7 7 5 4 6 6 5 2 3 6 3 3 2 6 3 3 3 5 6 3 3 3
 [63] 3 2 2 5 3 1 2 4 3 4 3 3 2 7 2 2 3 5 2 4 3 2 3 3 6 2 4 2 2 2 5 6 2 2 4 4 4 2 3 4 3 3 3 5 3 4 3 4 4 3 2 3 4 2 2 2 3 6 4 4 3 2
[125] 2 2 4 3 3 2 8 3 8 6 1 2 6 3 3 2 6 6 1 3 6 2 8 2 6 3 6 6 6 3 1 3 3 3 3 6 6 6 6 2 6 3 6 6 7 6 1 3 6 3 3 4 3 3 6 3 2 6 1 6 6 6
[187] 6 2 3

> table(res$class, iris$Species)
   
    setosa versicolor virginica
  1      0         43         0
  2      0          7        50
  3     50          0         0

This is able to identify 3 clusters (species) from lengths and widths of petals and sepals, though 7 of 150 (4.7%) cases are incorrectly classified. 

References:
Laurent Berge, Charles Bouveyron, Stephane Girard (2012). HDclassif: An R Package for Model-Based Clustering and Discriminant Analysis of High-Dimensional Data. Journal of Statistical Software, 46(6), 1-29. URL http://www.jstatsoft.org/v46/i06/.

Maechler, M., Rousseeuw, P., Struyf, A., Hubert, M., Hornik, K.(2015).cluster: Cluster Analysis Basics and Extensions. R package version 2.0.2.
 


    Comments & Feedback