##### Citations
Title Text Both

## Cluster analysis

This technique is generally used to cluster or classify rows (i.e. cases or individuals) into similar subtypes. A number of packages are available. Most commonly used is kmeans function of base R.

Iris dataset in R has data on lengths and widths of sepals and petals of 3 plant species. In following example kmeans is used to divide iris dataset rows into 3 clusters:

Code:

Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

To test if kmeans is able to cluster the rows correctly into 3 species using first 4 columns only:

> km = kmeans(iris[-5], 3)
> km\$cluster
[1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 1 2 2 2 2 2 2 2 2 2
[63] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 1 1 1 1 2 1 1 1 1 1 1 2 2 1 1 1 1 2 1 2 1 2
[125] 1 1 2 2 1 1 1 1 1 2 1 1 1 1 2 1 1 1 2 1 1 1 2 1 1 2

> table(km\$cluster, iris\$Species)

setosa versicolor virginica
1     33          0         0
2     17          4         0
3      0         46        50

As can be seen, the first 2 groups created by kmeans both correspond to setosa species, while the 3rd group has both versicolor and virginica. Trying kmeans with 2 clusters, we get following result:

Code:

> km = kmeans(iris[-5], 2)
> table(km\$cluster, iris\$Species)

setosa versicolor virginica
1     50          3         0
2      0         47        50

Hence, kmeans is able to separate setosa from other 2 species. It also means that virginica and versicolor species are more closely related.

Trying clustering again using hddc() function of Hdclassif package which does not require the numebr of clusters to be pre-specified:

Code:

> library(HDclassif)
> res = hddc(iris[-5])
Model            K       BIC
ALL             1       -991.4453
AKJBKQKDK       2       -606.9822
AKJBKQKDK       3       -598.0384
AKJBKQKDK       4       -639.826
AKJBKQKDK       5       -675.3368
AKJBKQKDK       6       -700.6925
AKJBKQKDK       7       -756.5343
AKJBKQKDK       8       -828.1545
AKJBKQKDK       9       -870.1027
AKJBKQKDK       10      -884.5045

SELECTED: model AKJBKQKDK with 3 clusters, BIC=-598.0384.

> res\$class
[1] 1 4 6 6 6 3 3 6 3 3 6 2 6 6 6 6 6 3 3 3 3 5 1 3 3 7 3 4 7 3 3 6 5 4 3 2 2 3 1 6 7 7 5 4 6 6 5 2 3 6 3 3 2 6 3 3 3 5 6 3 3 3
[63] 3 2 2 5 3 1 2 4 3 4 3 3 2 7 2 2 3 5 2 4 3 2 3 3 6 2 4 2 2 2 5 6 2 2 4 4 4 2 3 4 3 3 3 5 3 4 3 4 4 3 2 3 4 2 2 2 3 6 4 4 3 2
[125] 2 2 4 3 3 2 8 3 8 6 1 2 6 3 3 2 6 6 1 3 6 2 8 2 6 3 6 6 6 3 1 3 3 3 3 6 6 6 6 2 6 3 6 6 7 6 1 3 6 3 3 4 3 3 6 3 2 6 1 6 6 6
[187] 6 2 3

> table(res\$class, iris\$Species)

setosa versicolor virginica
1      0         43         0
2      0          7        50
3     50          0         0

This is able to identify 3 clusters (species) from lengths and widths of petals and sepals, though 7 of 150 (4.7%) cases are incorrectly classified.

References:
Laurent Berge, Charles Bouveyron, Stephane Girard (2012). HDclassif: An R Package for Model-Based Clustering and Discriminant Analysis of High-Dimensional Data. Journal of Statistical Software, 46(6), 1-29. URL http://www.jstatsoft.org/v46/i06/.

Maechler, M., Rousseeuw, P., Struyf, A., Hubert, M., Hornik, K.(2015).cluster: Cluster Analysis Basics and Extensions. R package version 2.0.2.