topic15

Inference for Multiple Means

ANOVA

Math 115

Extending Our Framework

Previously, we compared two groups using the t-test.

But what if we have 3 or more groups?

Compare test scores across 4 teaching methods?
Compare salaries across 5 job categories?
Compare vocabulary scores across 4 social classes?

We need a new approach!

Why Not Multiple T-Tests?

With 4 groups, we could do pairwise t-tests:

Groups 1 vs 2, 1 vs 3, 1 vs 4, 2 vs 3, 2 vs 4, 3 vs 4
That’s 6 separate tests!

The problem: Each test has a 5% chance of Type I error (if α = 0.05)

Overall probability of at least one Type I error:

\[1 - 0.95^6 = 0.265 = 26.5\%\]

This inflated error rate is unacceptable!

The Solution: One Overall Test

Instead of many pairwise tests, do one holistic test:

Hypotheses:

\(H_0: \mu_1 = \mu_2 = \mu_3 = \cdots = \mu_k\) (all means are equal)
\(H_A:\) At least one mean is different

If we reject \(H_0\), THEN we can explore which groups differ (with appropriate adjustments).

This approach is called Analysis of Variance (ANOVA).

Vocabulary Scores

Research question: Do vocabulary test scores differ by self-identified social class?

Groups: Lower, Middle, Upper, Working (k = 4)
Response: Score on 10-question vocabulary test
Data: General Social Survey (n = 795)

Class	n	Mean	SD
Lower	41	5.07	2.24
Middle	331	6.76	1.89
Upper	16	6.19	2.34
Working	407	5.75	1.87

A Holistic Approach to Comparing Means

One way to approach this problem would be to make 6 pairwise comparisons (comparing each group to every other group) using two-sample t-tests
However, if the null hypothesis is true, there is a 5% chance of making a type 1 error with each test (if \(\alpha=0.05\))
The probability of making at least 1 type 1 error after m tests would be \(1-(1-\alpha)^m\)
- In our example, it would be \(1-0.95^6=0.265\)
Instead we take a holistic view and test whether at least one of the means is different from the others

Note that this holistic approach does not identify which of the tested groups have significantly different means
If the null hypothesis rejected then we will just know that there are significant differences among means
If there is convincing evidence that at least one of the means is different we can follow up with post-hoc pairwise tests to see which groups are different
We will also need to take steps to control the type 1 error given the multiple hypothesis tests
This is a topic we will discuss in more detail later

The Key Idea: Between vs. Within

Two sources of variability in the data:

Between groups: How different are the group means from each other?

Within groups: How much do individuals vary within each group?

If \(H_0\) is true (all means equal), between-group variability should be small relative to within-group variability.

Variability Between vs. Within (Sums of Squares)

Between groups: How far are group means from the overall mean?

\[SSG = \sum_{i=1}^{k} n_i(\bar{x}_i - \bar{x})^2\]

Within groups: How far are individual values from their group means?

\[SSE = \sum_{i=1}^{k} \sum_{j=1}^{n_i} (x_{ij} - \bar{x}_i)^2\]

Mean Squares (MSG and MSE)

To compare variability, we convert sums of squares to mean squares by dividing by degrees of freedom:

Between Groups:

\[MSG = \frac{SSG}{df_G}\]

\[df_G = k - 1\]

(number of groups minus 1)

Within Groups:

\[MSE = \frac{SSE}{df_E}\]

\[df_E = n - k\]

(total observations minus number of groups)

Why divide by df? This accounts for how many groups and observations we have, making the variability measures comparable.

The F Statistic

\[F = \frac{\text{Variability Between Groups}}{\text{Variability Within Groups}} = \frac{MSG}{MSE}\]

MSG = Mean Square between Groups
MSE = Mean Square Error (within groups)

Interpretation:

If \(H_0\) is true: F should be close to 1 (similar variability)
Large F → group means are more spread out than expected → evidence against \(H_0\)

Computing F

For the vocabulary example:

	Between Groups	Within Groups
Degrees of freedom	\(df_G = 4 - 1 = 3\)	\(df_E = 795 - 4 = 791\)
Sum of Squares	\(SSG = 236.56\)	\(SSE = 2869.8\)
Mean Square	\(MSG = \frac{236.56}{3} = 78.85\)	\(MSE = \frac{2869.8}{791} = 3.63\)

F statistic: \(F = \frac{MSG}{MSE} = \frac{78.85}{3.63} = 21.73\)

This F is much larger than 1!

Hypothesis Test Using Random Permutation

In order to see if this F-statistics represents statistically significant evidence, we can simulate null hypothesis
To simulate independence between word score and social class, we randomly permute the values of the response (wordsum score)

Here are 5 random simulations

# A tibble: 795 × 8
      id wordsum class   randPerm1 randPerm2 randPerm3 randPerm4 randPerm5
   <int>   <dbl> <chr>   <fct>     <fct>     <fct>     <fct>     <fct>    
 1     1       6 MIDDLE  MIDDLE    MIDDLE    MIDDLE    MIDDLE    MIDDLE   
 2     2       9 WORKING WORKING   WORKING   WORKING   WORKING   WORKING  
 3     3       6 WORKING WORKING   WORKING   WORKING   WORKING   WORKING  
 4     4       5 WORKING WORKING   WORKING   WORKING   WORKING   WORKING  
 5     5       6 WORKING WORKING   WORKING   WORKING   WORKING   WORKING  
 6     6       6 WORKING WORKING   WORKING   WORKING   WORKING   WORKING  
 7     7       8 MIDDLE  MIDDLE    MIDDLE    MIDDLE    MIDDLE    MIDDLE   
 8     8      10 WORKING WORKING   WORKING   WORKING   WORKING   WORKING  
 9     9       8 WORKING WORKING   WORKING   WORKING   WORKING   WORKING  
10    10       9 UPPER   UPPER     UPPER     UPPER     UPPER     UPPER    
# ℹ 785 more rows

Here is the dotplot of 100 simulations

Null distribution

Histogram of F scores (null distribution) for 1,000 random permutations of word scores. Dashed vertical line indicates observed F score.

There are 0 randomized \(F\) statistics that are at least as large as the observed value (21.73)
The p-value is approximately \(0/1000 = 0\)

The ANOVA Table

Jamovi reports the full ANOVA table:

	Sum of Squares	df	Mean Square	F	p
class	236.56	3	78.85	21.73	< 0.001
Residuals	2869.80	791	3.63

F = 21.73 (test statistic)

The Null Distribution

When conditions are met, the F-statistic follows an F-distribution if \(H_0\) is true:

F-distribution depends on degrees of freedom: \(df_1 = k - 1\) and \(df_2 = n - k\)
This distribution describes what \(F\) values we’d expect if all group means are equal

Computing the p-value:

P-value = area to the right of observed \(F\) under the F-distribution

Conditions for ANOVA

Independence:

Random sample ✓
Independent observations within and between groups ✓

Normality:

Reasonable sample sizes in each group ✓
No extreme outliers visible ✓

Equal variance:

The argest SD is less or equal twice the smallest SD
\((2.34 \le 2 \times 1.87)\) ✓

Conditions are met for using the F-distribution.

The F-Distribution

The F-Distribution
Random Permutations and F-distribution

F-distribution has two df parameters: \(df_1 = k-1\), \(df_2 = n-k\)
P-value is always the right tail (area beyond observed F)

Conclusion

Results:

F = 21.73
df₁ = 3, df₂ = 791
P-value < 0.001

Decision: Reject \(H_0\)

Conclusion: There is convincing evidence that mean vocabulary scores differ across self-identified social classes. At least one group has a different mean.

What ANOVA Does and Doesn’t Tell Us

ANOVA tells us:

At least one group mean is different from the others

ANOVA does NOT tell us:

WHICH specific groups differ
The direction or magnitude of differences

Summary

ANOVA compares means across 3+ groups:

Component	Formula/Value
F statistic	\(F = \frac{MSG}{MSE}\)
Degrees of freedom	\(df_1 = k - 1\), \(df_2 = n - k\)
P-value	Right tail of F-distribution

Intuition: Large F means between-group variability is large relative to within-group variability → evidence that means differ

Conditions: Independence + Normality + Equal variance

Connection to Independent Samples T-Test

	Two Means	Multiple Means
Groups	2	3 or more
Test statistic	T	F
Null distribution	t-distribution	F-distribution
Hypothesis	\(\mu_1 = \mu_2\)	\(\mu_1 = \mu_2 = \cdots = \mu_k\)
Alternative	\(\mu_1 \neq \mu_2\)	At least one differs

For k = 2, ANOVA and two-sample t-test give equivalent results!

(In fact, \(F = T^2\) when comparing two groups)

References

Introduction to Modern Statistics (2e) textbook by Mine Çetinkaya-Rundel and Johanna Hardin
Section 22.3