##### Citations
Title Text Both

## Principal component analysis

Principal component analysis is a useful technique for dimensionality reduction, i.e. to reduce the number of variables by combining correlated variables together. The contribution of different variables (loadings) in components give an idea which variables are correlated with each other. The plot of first 2 component (biplot) is especially useful, both for variable loadings and for individual row scores. The function prcomp is generally used for principal component analysis.

Code:

> pr = prcomp(birthwt, center=T, scale=T)
> print(round(summary(pr)\$importance,2))
PC1  PC2  PC3  PC4  PC5  PC6  PC7  PC8  PC9 PC10
Standard deviation     1.51 1.21 1.15 1.07 0.97 0.90 0.85 0.78 0.72 0.44
Proportion of Variance 0.23 0.15 0.13 0.11 0.09 0.08 0.07 0.06 0.05 0.02
Cumulative Proportion  0.23 0.37 0.51 0.62 0.71 0.80 0.87 0.93 0.98 1.00

Here, first 2 components explain 37% variation of entire dataset.

The contribution of different variables in different components can be seen with following command:

Code:

> round(pr\$rotation,2)
PC1   PC2   PC3   PC4   PC5   PC6   PC7   PC8   PC9  PC10
low    0.55 -0.12  0.25 -0.11 -0.21  0.11  0.12 -0.16 -0.29 -0.66
age   -0.20 -0.37  0.02 -0.52  0.23  0.33  0.56  0.11  0.26 -0.07
lwt   -0.29 -0.27  0.45 -0.03  0.22 -0.42 -0.03 -0.65 -0.01  0.01
race   0.19  0.58  0.18 -0.29  0.13  0.12 -0.21 -0.23  0.61 -0.10
smoke  0.18 -0.53 -0.19  0.42 -0.19  0.06 -0.13 -0.11  0.63 -0.10
ptl    0.26 -0.25 -0.29 -0.15  0.60  0.30 -0.47 -0.17 -0.23  0.09
ht     0.06 -0.14  0.66  0.20  0.33  0.00 -0.20  0.59  0.09 -0.02
ui     0.31 -0.02 -0.32 -0.19  0.29 -0.76  0.14  0.23  0.13 -0.13
ftv   -0.16 -0.26  0.03 -0.58 -0.46 -0.14 -0.55  0.19  0.02 -0.01
bwt   -0.56  0.11 -0.23  0.15  0.20  0.02 -0.18  0.09  0.01 -0.72

The biplot for different variables can be plotted with following commands:

Code:

> pr = prcomp(birthwt, center=T, scale=T)
> abline(h=0, v=0)

Output graph

In above plot one can see that age, lwt and ftv are close to each other, hence related, while low and bwt are on opposite ends, hence invesely related.

The principal component score for each row can be obtained with following command:

Code:

low age lwt race smoke ptl ht ui ftv  bwt
85   0  19 182    2     0   0  0  1   0 2523
86   0  33 155    3     0   0  0  0   3 2551
87   0  20 105    1     1   0  0  0   1 2557
88   0  21 108    1     1   0  0  1   2 2594
89   0  18 107    1     1   0  0  1   0 2600
91   0  21 124    3     0   0  0  0   0 2622

PC1   PC2   PC3   PC4   PC5   PC6   PC7   PC8   PC9 PC10
85  0.26  0.65  0.05 -0.04  1.09 -2.91  0.58 -0.76 -0.08 0.65
86 -1.15 -0.12  0.86 -2.82 -0.45  0.19 -0.10 -0.26  0.97 0.67
87  0.05 -0.64 -0.74  1.10 -1.17  0.18 -0.19  0.45  0.28 0.86
88  0.67 -1.03 -1.57 -0.07 -0.71 -2.06 -0.22  1.25  0.70 0.43
89  1.10 -0.32 -1.65  1.32  0.02 -1.97  0.49  0.85  0.52 0.47
91 -0.01  1.73  0.26  0.05  0.13  0.27  0.20 -0.38  0.35 0.76

Biplots for different rows can be plotted with following commands:

Code:

> scores = pr\$x
> plot(scores[,1], scores[,2], xlab="PC1",ylab="PC2", type='n', main='Principal Component Analysis biplot')
> text(scores[,1]+0.005, scores[,2]+0.003, rownames(birthwt))
> abline(h=0, v=0)

Output graph

In above plot one can see that rows 159, 126 and 170 are close together (lower left corner of plot) hence are similar.