##### Citations
Title Text Both

## Comparing many groups

For this section we will use bwdf dataset and try to find relation between age and race. Age is a numeric variable while race is a factor variable with 3 levels. Hence, this is a comparison between 3 groups:

ANOVA (Analysis of variance)

R has aov function to perform analysis of variance:

code:

> res = aov(age~race, data=bwdf)

> res

Call:

aov(formula = age ~ race, data = bwdf)

Terms:

race Residuals

Sum of Squares   230.080  5048.205

Deg. of Freedom        2       186

Residual standard error: 5.209692

Estimated effects may be unbalanced

> summary(res)

Df Sum Sq Mean Sq F value Pr(>F)

race          2    230  115.04   4.239 0.0158 *

Residuals   186   5048   27.14

---

Signif. Codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

P value indicates significant relation between age and race. To determine which of the groups are significantly different from each other, TukeyHSD and pairwise t.tests can be done:

code:

> TukeyHSD(res)

Tukey multiple comparisons of means

95% family-wise confidence level

Fit: aov(formula = age ~ race, data = bwdf)

\$race

2-1 -2.7532051 -5.474434 -0.03197581 0.0466676

3-1 -1.9036070 -3.863030  0.05581649 0.0589215

3-2  0.8495982 -1.994371  3.69356754 0.7603654

P values show that ages of race groups 1 vs 2 and 1 vs 3 are significantly different from each other, while there is no significant difference between groups 2 and 3.

code:

> with(bwdf, pairwise.t.test(age, race))

Pairwise comparisons using t tests with pooled SD

data:  age and race

1     2

2 0.053 -

3 0.053 0.481

P values show significant difference between race groups 1 vs 2 and 1 vs 3 but not between 2 vs 3.

Non-parametric test

For non-normally distributed data and for small sample sizes, Kruskal Wallis test can be performed as a non-parametric test for analysis of variance:

code:

> res = kruskal.test(age~race, data=bwdf)

> res

Kruskal-Wallis rank sum test

data:  age by race

Kruskal-Wallis chi-squared = 7.2515, df = 2, p-value = 0.02663

Using regression:

Linear regression can also be used to test relation between multiple groups:

code:

> summary(lm(age~race, data=bwdf))

Call:

lm(formula = age ~ race, data = bwdf)

Residuals:

Min       1Q   Median       3Q      Max

-10.2917  -4.2917  -0.5385   3.6119  20.7083

Coefficients:

Estimate Std. Error t value            Pr(>|t|)

(Intercept)  24.2917     0.5317  45.686 <0.0000000000000002 ***

race2        -2.7532     1.1518  -2.390              0.0178 *

race3        -1.9036     0.8293  -2.295              0.0228 *

---

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 5.21 on 186 degrees of freedom

Multiple R-squared:  0.04359,   Adjusted R-squared:  0.03331

F-statistic: 4.239 on 2 and 186 DF,  p-value: 0.01585

P values show that race groups 2 and 3 are significantly different from race group 1.