ANOVA - 2

ANOVA with the R Language

Author

Joseph Mhango

Published

2024-09-13

Arrange the arithmetic to compare more than two groups

ANalysis Of VAriance (ANOVA)

The analysis of variance is not a mathematical theorem, but rather a convenient method of arranging the arithmetic. R. A. Fisher (via Wishart 1934. Sppl. J. Roy. Soc. 1(1):26-61.)

1.1 Objectives

The question of 1-way ANOVA
Data and assumptions
Graphing
Test and alternatives
Practice exercises

ANOVA - When?

anova-able scenarios:

one numeric continuous dependent variable of interest
a factor that contains 2 or more levels, often with a control
When there are just two levels, it’s essentially a t-test

ANOVA - scenarios

overall difference At least one factor mean is different from the rest
Planned comparisons: Targeted orthogonal contrasts
post-hoc tests: compare any factors you want, with some correction.
Partition variation: What’s causing variance in the dependent variable?

Test statistic

The test statistic for ANOVA is the F ratio
It is the proportion of variance in the dependent variable between the groups, relative to that within the groups

Data and assumptions

data example: +experiment in animal genetics, looking at the weight of chickens (8-week old weight in grams), where weight is the continuous dependent variable. The factor is the sire identity; the chicks were sired by one of 5 sires, thus sire is a factor with 5 levels A, B, C, D, and E.

Wide format data

A <- c(687, 691, 793, 675, 700, 753, 704, 717)
B <- c(618, 680, 592, 683, 631, 691, 694, 732)
C <- c(618, 687, 763, 747, 687, 737, 731, 603)
D <- c(600, 657, 669, 606, 718, 693, 669, 648)
E <- c(717, 658, 674, 611, 678, 788, 650, 690)

Long format data (preferred)

Convert your data to long format (aov in R works with long data)

weight <- c(A,B,C,D,E)
sire <- c(rep("A", 8),
          rep("B", 8),
          rep("C", 8),
          rep("D", 8),
          rep("E", 8) )
new.long<-data.frame(weight, sire)

Assumptions of ANOVA

Gaussian residuals (test graphically)
Homoscedasticity (test graphically, both overall and within factors; residuals versus fitted plot)
Independent observations (assumed)

Formal test of Gaussian residuals

We can use the shapiro NHST shapiro.test().

shapiro.test(rstandard(m1))

There is no evidence of difference to Gaussian in our residuals for our ANOVA model (Shapiro-Wilk: W = 0.99, n = 40, P = 0.99).

Homoscedasticity check

We will look at the residuals relative to the fitted values.

# Plot for homoscedasticity check
plot(formula = rstandard(m1) ~ fitted(m1),
     ylab = "m1: residuals",
     xlab = "m1: fitted values",
     main = "Spread similar across x?")
abline(h = 0,
       lty = 2, lwd = 2, col = "red")

# Make the mean residual y points (just to check)
y1 <- aggregate(rstandard(m1), by = list(new.long$Sire), FUN = mean)[,2]
# Make the x unique fitted values (just to check)
x1 <- unique(round(fitted(m1), 6))

points(x = x1, y = y1, 
       pch = 16, cex = 1.2, col = "blue")

Bartlett test for equal variances between groups

bartlett.test(): do groups have signif different variances?
significant differences = homoskedasticity violated

bartlett.test(formula = weight~sire, data = new.long)

We find no evidence that variance in offspring weight differs between sires (Bartlett test: K-sqared = 1.69, df = 4, P = 0.79).

Graphing ANOVA

Recommended: Boxplot, with some way to show the central tendency of the data separately for each factor level. For continuous variables, boxplots show this perfectly.

For count variables, barplots are sometimes used with the height set to the mean, along with some form of error bar.

Graphing ANOVA


boxplot(Weight ~ Sire, data = new.long,
        ylab = "Weight (g)",
        xlab = "Sire",
        main = "Effect of Sire on 8-wk weight",
        cex = 0) 

abline(h = mean(new.long$Weight), 
       lty = 2, lwd = 2, col = "red") # Mere vanity

points(x = jitter(rep(1:5, each = 8), amount = .1),
       y = new.long$Weight,
       pch = 16, cex = .8, col = "blue") # Mere vanity

ANOVA basic output

The ANOVA Table”
- There are 2 rows;for the sire effect and the residual
- The test statistic is the F value (1.87)
- The P-value column is named “Pr(>F)” (0.14)
- There are 4 degrees of freedom for the factor
- Residual degrees of freedom is 35
No sig evidence that sire explains variation in weight

Contrasts and post hoc test

Let’s say that sire C is our reference level sire
For now, we’ll use lm() to compare everything to sire C
you can generate contrasts and make tests using emmeans e.t.c.
Let’s interpret the coefficients
What can we notice about the F value and p-value?

Post hoc tests

The phrase post hoc implies an afterthought. Tread carefully
When multiple tests are made on the same data, it increases the chance of discovering a false positive (Type I error)
p-values can be adjusted to keep Type I error chance at 5%.

Bonferroni

The Bonferroni adjustment of p-values:
- Simply divides the P by the number of post hoc pairwise comparisons, so you end up evaluating at a lower P-value
- It is considered conservative, and there are alternatives.
- alternatives exists e.g. Tukey’s e.t.c.
- With pairwise.t.test(), the output will be a matrix of p-values

Tukey HSD (Honestly Significant Differences)

The Tukey HSD test is less conservative Bonferroni
We use the Tukey.HSD() function to apply it here.
Also note that the p-values tend to be smaller for the exact same comparisons, but there are sill no significant comparisons.

Alternatives amidst assumption violations

In case the assumptions of 1-way ANOVA cannot be met by the data, there are a few options:

Transformations (log(), sqrt() e.t.c.)
GLMs
Non-parameteric tests: a great escape, but tend to have less power

Kruskal-Wallis alternative to the 1-way ANOVA

# Try this:
# ?kruskal.test
kruskal.test(formula = Weight ~ Sire,
             data = new.long)

The result is qualitatively the same as that for our 1-way ANOVA test

ANOVA calculation details

# Our data
chicken.wide

ANOVA Equations

{width = “600px”}

ANOVA Variables

{width = “400px”}

ANOVA Sources of variation table

## Do an ANOVA “by hand” programmatically

For the ANOVA code in the script, try to follow what is going on in the code
It is okay if not every detail is clear yet
Do we get the same answer as aov()?