T-test

Author

Joseph Mhango

Published

2024-09-13

1 Student’s T-test

William Sealy Gosset is credited with inventing the t-test while working for the Guinness brewery. The idea was refined and supported by the great statistician R. A. Fisher, and the idea was initially described in a paper anonymously by “Student”, in order to protect the commercial interests of Guinness. Today, it is perhaps one of the most prevalent and basic tools in statistics, and it is a fascinating story.

1.1 Objectives

The question of the t-test
Data and assumptions
Graphing
Test and alternatives
Practice exercises

2 The question of the t-test

The typical premise of the t-test is that it is used to compare populations you are interested in, which you measure with independent samples. There are a few versions of the basic question.

Compare two independent samples

Here you have measured a numeric variable and have two samples.
Are the means of the two samples different (i.e. did the samples come from different populations)?
Example: An experiment with a control and one treatment group.

Compare 1 sample to a known mean

Here you have one sample which you wish to compare to a mean value.
Did the sample come from a population exhibiting the known mean?
The data are simply a single numeric vector, and the population mean for comparison.

Paired samples

Here the individual observation comprising the 2 samples are not independent.
Example: Before vs After experiments
Another example: Measuring plots that are paired spatially
For each of these examples, there is a unit, patient, or plot identification, that represents the relationship of each paired measure.

3 Data and assumptions

The principle assumptions of the t-test are:

Gaussian distribution of observations WITHIN each sample
Heteroscedasticity (our old friend) - i.e., the variance is equal in each sample
Independence of observations

3.1 Evaluating and testing the assumptions

The t-test is thought to be somewhat robust to violation of assumptions
To a certain extent, Gaussian distribution and heteroscedasticiy assumption violations won’t bias your results.
The assumption of independence of observations is always of high importance.

Testing assumptions

Gaussian distribution of observations WITHIN each sample

Plot and evaluate a a histogram (hist()) and a q-q plot (e.g., with qqplot())
Use a statistical test evaluating whether data are Gaussian (e.g., shapiro.test()
NB 1 test EACH SAMPLE SEPARATELY (this is sometimes confusing for beginners).

and If guilty?…

Mann-Whitney U-test will allow violation of assumptions
Also called the name Wilcoxon Test

Heteroscedasticity assumption

Examine graphically

# Bartlett's Test
bartlett.test(list(group1, group2))

Independence assumption

Can’t really be directly tested without supporting information
It’s up to you to ensure your analysis is being done on the right data
in time series and spatial data, you can test for autocorrelations (out of scope)

typical Output of t test function.

The t value; can be be positive or negative. The absolute value of t should increase with the probability that the samples came different populations.
The degrees of freedom; The number of independent data points available to estimate the t-statistic.
The 95% confidence interval; This gives a range of values that is likely to contain the true difference in means between the populations from which the samples were taken
The P-value