Inference for Multiple Means

ANOVA

Math 115

Extending Our Framework

Previously, we compared two groups using the t-test.

But what if we have 3 or more groups?

  • Compare test scores across 4 teaching methods?
  • Compare salaries across 5 job categories?
  • Compare vocabulary scores across 4 social classes?

We need a new approach!

Why Not Multiple T-Tests?

With 4 groups, we could do pairwise t-tests:

  • Groups 1 vs 2, 1 vs 3, 1 vs 4, 2 vs 3, 2 vs 4, 3 vs 4
  • That’s 6 separate tests!

The problem: Each test has a 5% chance of Type I error (if α = 0.05)

Overall probability of at least one Type I error:

\[1 - 0.95^6 = 0.265 = 26.5\%\]

This inflated error rate is unacceptable!

The Solution: One Overall Test

Instead of many pairwise tests, do one holistic test:

Hypotheses:

  • \(H_0: \mu_1 = \mu_2 = \mu_3 = \cdots = \mu_k\) (all means are equal)
  • \(H_A:\) At least one mean is different

If we reject \(H_0\), THEN we can explore which groups differ (with appropriate adjustments).

This approach is called Analysis of Variance (ANOVA).

Vocabulary Scores

Research question: Do vocabulary test scores differ by self-identified social class?

  • Groups: Lower, Middle, Upper, Working (k = 4)
  • Response: Score on 10-question vocabulary test
  • Data: General Social Survey (n = 795)

EDA: Vocabulary Scores

Class n Mean SD
Lower 41 5.07 2.24
Middle 331 6.76 1.89
Upper 16 6.19 2.34
Working 407 5.75 1.87

A Holistic Approach to Comparing Means

  • One way to approach this problem would be to make 6 pairwise comparisons (comparing each group to every other group) using two-sample t-tests
  • However, if the null hypothesis is true, there is a 5% chance of making a type 1 error with each test (if \(\alpha=0.05\))
  • The probability of making at least 1 type 1 error after m tests would be \(1-(1-\alpha)^m\)
    • In our example, it would be \(1-0.95^6=0.265\)
  • Instead we take a holistic view and test whether at least one of the means is different from the others
  • Note that this holistic approach does not identify which of the tested groups have significantly different means
  • If the null hypothesis rejected then we will just know that there are significant differences among means
  • If there is convincing evidence that at least one of the means is different we can follow up with post-hoc pairwise tests to see which groups are different
  • We will also need to take steps to control the type 1 error given the multiple hypothesis tests
  • This is a topic we will discuss in more detail later

The Key Idea: Between vs. Within

Two sources of variability in the data:

Between groups: How different are the group means from each other?

Within groups: How much do individuals vary within each group?

If \(H_0\) is true (all means equal), between-group variability should be small relative to within-group variability.

Variability Between vs. Within (Sums of Squares)

Between groups: How far are group means from the overall mean?

\[SSG = \sum_{i=1}^{k} n_i(\bar{x}_i - \bar{x})^2\]

Within groups: How far are individual values from their group means?

\[SSE = \sum_{i=1}^{k} \sum_{j=1}^{n_i} (x_{ij} - \bar{x}_i)^2\]

Mean Squares (MSG and MSE)

To compare variability, we convert sums of squares to mean squares by dividing by degrees of freedom:

Between Groups:

\[MSG = \frac{SSG}{df_G}\]

\[df_G = k - 1\]

(number of groups minus 1)

Within Groups:

\[MSE = \frac{SSE}{df_E}\]

\[df_E = n - k\]

(total observations minus number of groups)

Why divide by df? This accounts for how many groups and observations we have, making the variability measures comparable.

The F Statistic

\[F = \frac{\text{Variability Between Groups}}{\text{Variability Within Groups}} = \frac{MSG}{MSE}\]

  • MSG = Mean Square between Groups
  • MSE = Mean Square Error (within groups)

Interpretation:

  • If \(H_0\) is true: F should be close to 1 (similar variability)
  • Large F → group means are more spread out than expected → evidence against \(H_0\)

Computing F

For the vocabulary example:

Between Groups Within Groups
Degrees of freedom \(df_G = 4 - 1 = 3\) \(df_E = 795 - 4 = 791\)
Sum of Squares \(SSG = 236.56\) \(SSE = 2869.8\)
Mean Square \(MSG = \frac{236.56}{3} = 78.85\) \(MSE = \frac{2869.8}{791} = 3.63\)

F statistic: \(F = \frac{MSG}{MSE} = \frac{78.85}{3.63} = 21.73\)

This F is much larger than 1!

Hypothesis Test Using Random Permutation

  • In order to see if this F-statistics represents statistically significant evidence, we can simulate null hypothesis
  • To simulate independence between word score and social class, we randomly permute the values of the response (wordsum score)

Here are 5 random simulations

# A tibble: 795 × 8
      id wordsum class   randPerm1 randPerm2 randPerm3 randPerm4 randPerm5
   <int>   <dbl> <chr>   <fct>     <fct>     <fct>     <fct>     <fct>    
 1     1       6 MIDDLE  MIDDLE    MIDDLE    MIDDLE    MIDDLE    MIDDLE   
 2     2       9 WORKING WORKING   WORKING   WORKING   WORKING   WORKING  
 3     3       6 WORKING WORKING   WORKING   WORKING   WORKING   WORKING  
 4     4       5 WORKING WORKING   WORKING   WORKING   WORKING   WORKING  
 5     5       6 WORKING WORKING   WORKING   WORKING   WORKING   WORKING  
 6     6       6 WORKING WORKING   WORKING   WORKING   WORKING   WORKING  
 7     7       8 MIDDLE  MIDDLE    MIDDLE    MIDDLE    MIDDLE    MIDDLE   
 8     8      10 WORKING WORKING   WORKING   WORKING   WORKING   WORKING  
 9     9       8 WORKING WORKING   WORKING   WORKING   WORKING   WORKING  
10    10       9 UPPER   UPPER     UPPER     UPPER     UPPER     UPPER    
# ℹ 785 more rows

Here is the dotplot of 100 simulations

Null distribution

Histogram of F scores (null distribution) for 1,000 random permutations of word scores. Dashed vertical line indicates observed F score.

  • There are 0 randomized \(F\) statistics that are at least as large as the observed value (21.73)
  • The p-value is approximately \(0/1000 = 0\)

The ANOVA Table

Jamovi reports the full ANOVA table:

Sum of Squares df Mean Square F p
class 236.56 3 78.85 21.73 < 0.001
Residuals 2869.80 791 3.63
  • F = 21.73 (test statistic)

The Null Distribution

When conditions are met, the F-statistic follows an F-distribution if \(H_0\) is true:

  • F-distribution depends on degrees of freedom: \(df_1 = k - 1\) and \(df_2 = n - k\)
  • This distribution describes what \(F\) values we’d expect if all group means are equal

Computing the p-value:

  • P-value = area to the right of observed \(F\) under the F-distribution

Conditions for ANOVA

Independence:

  • Random sample
  • Independent observations within and between groups

Normality:

  • Reasonable sample sizes in each group
  • No extreme outliers visible

Equal variance:

  • The argest SD is less or equal twice the smallest SD
    \((2.34 \le 2 \times 1.87)\)

Conditions are met for using the F-distribution.

The F-Distribution

  • F-distribution has two df parameters: \(df_1 = k-1\), \(df_2 = n-k\)
  • P-value is always the right tail (area beyond observed F)

Conclusion

Results:

  • F = 21.73
  • df₁ = 3, df₂ = 791
  • P-value < 0.001

Decision: Reject \(H_0\)

Conclusion: There is convincing evidence that mean vocabulary scores differ across self-identified social classes. At least one group has a different mean.

What ANOVA Does and Doesn’t Tell Us

ANOVA tells us:

  • At least one group mean is different from the others

ANOVA does NOT tell us:

  • WHICH specific groups differ
  • The direction or magnitude of differences

Summary

ANOVA compares means across 3+ groups:

Component Formula/Value
F statistic \(F = \frac{MSG}{MSE}\)
Degrees of freedom \(df_1 = k - 1\), \(df_2 = n - k\)
P-value Right tail of F-distribution

Intuition: Large F means between-group variability is large relative to within-group variability → evidence that means differ

Conditions: Independence + Normality + Equal variance

Connection to Independent Samples T-Test

Two Means Multiple Means
Groups 2 3 or more
Test statistic T F
Null distribution t-distribution F-distribution
Hypothesis \(\mu_1 = \mu_2\) \(\mu_1 = \mu_2 = \cdots = \mu_k\)
Alternative \(\mu_1 \neq \mu_2\) At least one differs

For k = 2, ANOVA and two-sample t-test give equivalent results!

(In fact, \(F = T^2\) when comparing two groups)

References