Code
# Group A yields
A <- c(20, 22, 19, 21)
B <- c(34, 35, 36, 34)
C <- c(27, 28, 29, 28)At the turn of the 20th century, statistics was beginning to find its voice.
Galton had introduced the world to correlation and regression, inspired by the orderly shapes he observed in human height and heredity. His cousin Darwin had sparked a hunger to understand variation, and Galton’s statistical tools gave it mathematical form. Karl Pearson, Galton’s student, took these ideas further — formalising correlation, fitting curves, and building the great machinery of early statistical inference.
Gosset, working quietly under the alias “Student,” challenged the assumption that large samples were always needed, showing that even small batches — like those in the Guinness brewery — could yield meaningful insight, as long as one accounted properly for uncertainty. This gave birth to the t-distribution, a gentler bell curve for modest data.
These pioneers were building tools to understand individual differences, to compare pairs of means, and to measure relationships between variables. Their assumptions were grounded in the bell-shaped normal distribution — a mathematical model that made much of early statistics possible.
But beneath this elegant structure, a tension was growing.
What if you weren’t comparing just two groups, but three or four?
What if you weren’t just curious about a relationship, but trying to design an experiment to test multiple treatments, under uncertain conditions, in the unpredictable world of nature?
This is the world Ronald A. Fisher walked into — not a world of idealised variables on chalkboards, but muddy fields, inconsistent weather, and the unglamorous grind of crop yield records.
It was 1919, and Fisher had just taken a position at the Rothamsted Experimental Station in England, home to some of the oldest agricultural experiments in the world. Here, researchers had been testing fertilisers, seed varieties, and farming practices for decades. But their analysis of the results had become a guessing game, relying more on intuition than inference.
Fisher wasn’t content with that. He believed that even the chaos of the real world could be tamed — if only we could understand how randomness itself behaves.
Picture this: you’re a scientist testing three fertilisers on wheat. You assign each to several plots of land, let the crops grow, then record the yields.
Now, you want to know — did any fertiliser perform better than the others?
Previously, you might compare fertiliser A vs B with a t-test, then B vs C, then A vs C. But this has two major flaws:
Fisher saw through this. Instead of asking Which two groups are different?, he asked:
Is there evidence that not all group means are equal?
A single question. One test. But to answer it, Fisher had to change the way scientists thought about data.
Rather than comparing means directly, Fisher proposed a new idea:
If there’s a true difference in treatment, it should show up as increased variation between the group means.
But we also expect some variation to occur within each group — just from randomness.
So Fisher constructed a ratio:
\[ F = \frac{\text{Variance Between Groups}}{\text{Variance Within Groups}} \]
This was the birth of the F-statistic.
If the fertilisers all had the same effect, the variation between group means would be about the same as the variation within groups — the F-ratio would be close to 1. But if one fertiliser truly boosted yields, the means would diverge, increasing the numerator and pushing the F-statistic higher.
This wasn’t just clever — it was radical. Fisher had moved the conversation away from point estimates (means) and toward patterns of variability.
The decision to use variance wasn’t arbitrary. Fisher was steeped in the work of Gauss, whose normal distribution described measurement error in astronomy and was later adopted by Galton to explain natural traits.
The formula for the normal curve involved the population variance \(\sigma^2\), which quantified spread. In Pearson’s world, this was assumed known. Gosset challenged that, showing that when estimating variance from the data, the result was a distribution with heavier tails — the t-distribution.
Fisher extended this thinking. He understood that variance wasn’t just noise — it was a signal in itself. By analysing how total variation could be partitioned, he could infer whether treatment effects existed.
In this way, variance became more than a mathematical artefact — it became a lens through which to view structure.
To understand how variance is used in ANOVA, we must introduce the idea of Sum of Squares (SS) — the building block of variance.
Recall that variance is calculated as:
\[ \text{Variance} = \frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{x})^2 \]
The numerator of this formula — the sum of squared deviations from the mean — is the Sum of Squares. It tells us the total squared distance that the data points are from their group mean.
We don’t divide by \(n - 1\) yet — that’s what turns SS into variance. But by keeping it as a total, we can compare sources of variation before standardising them.
In ANOVA:
Then, dividing each SS by its degrees of freedom gives us Mean Squares (MS) — which are essentially variances.
And the F-statistic is:
\[ F = \frac{MS_B}{MS_W} = \frac{SS_B / (k - 1)}{SS_W / (N - k)} \]
This ratio of two variances is what Fisher realised could detect true experimental effects beyond random variation.
To make this ratio of variances meaningful, Fisher derived its sampling distribution — the F-distribution.
This distribution gave Fisher a reference: if your F-statistic was unusually large compared to what randomness would typically produce, you had evidence that not all groups were the same.
Fisher formalised his approach as Analysis of Variance — or ANOVA. Here’s the elegant logic:
It wasn’t just a test — it was a framework for experimental design.
| Source of Variation | Sum of Squares (SS) | Degrees of Freedom | Mean Square (MS) | F-Ratio |
|---|---|---|---|---|
| Between Groups | \(SS_B\) | \(k - 1\) | \(MS_B\) | \(F = MS_B / MS_W\) |
| Within Groups | \(SS_W\) | \(N - k\) | \(MS_W\) | |
| Total | \(SS_T\) | \(N - 1\) |
Let’s break it down with examples.
You have three fertiliser types: A, B, and C.
# Group A yields
A <- c(20, 22, 19, 21)
B <- c(34, 35, 36, 34)
C <- c(27, 28, 29, 28)var_A <- var(A)
var_B <- var(B)
var_C <- var(C)
MS_within <- mean(c(var_A, var_B, var_C))This shows how much scatter there is within each group.
mean_A <- mean(A)
mean_B <- mean(B)
mean_C <- mean(C)
grand_mean <- mean(c(A, B, C))
MS_between <- mean(c((mean_A - grand_mean)^2,
(mean_B - grand_mean)^2,
(mean_C - grand_mean)^2))This shows how different the group means are from the overall average.
F_value <- MS_between / MS_withinIf
F_valueis large, the treatment is likely responsible for group differences.
But we don’t need to do all than manually in R. If we have a table of our data loaded, we can just do the analysis with an inbuilt ANOVA function called aov
# Combine into dataframe
values <- c(A, B, C)
groups <- factor(rep(c("A", "B", "C"), each = 4))
df <- data.frame(values, groups)
# Use aov to perform ANOVA
model <- aov(values ~ groups, data = df)
summary(model) Df Sum Sq Mean Sq F value Pr(>F)
groups 2 406.5 203.25 187.6 4.61e-08 ***
Residuals 9 9.7 1.08
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
This performs a full ANOVA using R’s
aov()function. You’ll see the F-statistic and the p-value that tells you whether group means are significantly different.
This output matches the logic of our manual calculation above and reinforces the beauty of Fisher’s framework: a comparison of structured variation against random noise.
If the ANOVA indicates significant differences, we can perform post hoc tests to identify which groups differ.
TukeyHSD(model) Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = values ~ groups, data = df)
$groups
diff lwr upr p adj
B-A 14.25 12.19514 16.30486 0.00e+00
C-A 7.50 5.44514 9.55486 8.20e-06
C-B -6.75 -8.80486 -4.69514 1.95e-05
This adjusts for multiple comparisons and tells you where the significant differences lie.
pairwise.t.test(df$values, df$groups, p.adjust.method = "bonferroni")
Pairwise comparisons using t tests with pooled SD
data: df$values and df$groups
A B
B 3.6e-08 -
C 9.2e-06 2.2e-05
P value adjustment method: bonferroni
This compares all pairs while adjusting for Type I error.
Always use these after a significant ANOVA result to avoid data dredging.
Fisher’s genius was in shifting from comparison of values to partitioning of variability. He gave science a framework for asking:
“Where does the variation come from — and is it more than just noise?”