Post-Hoc Tests, Multiple Comparisons

Addition to Chapter 22
Math 215

Wordsum Score

  • Do Wordsum test scores vary between self-identified social classes?
  • Self-identified social classes: “Lower” (L), “Middle” (M), “Upper” (U), “Working” (W)
  • Let \(\mu_C\) = mean score for social class \(C\)
  • Hypothesis test
    • \(H_0: \mu_L=\mu_M=\mu_U=\mu_W\)
    • \(H_A:\) at least one of the means is different

ANOVA

  • We conducted a hypothesis test based on a randomized null distribution of \(F\) statistics
  • And a test (ANOVA) using a model (\(F\)-distribution)
  • We found convincing evidence (\(F_{3,791}=21.73\), p-value < 0.001) that at least one mean score is different
aov(wordsum ~ class, data = gss) |> 
  tidy()
# A tibble: 2 × 6
  term         df sumsq meansq statistic   p.value
  <chr>     <dbl> <dbl>  <dbl>     <dbl>     <dbl>
1 class         3  237.  78.9       21.7  1.56e-13
2 Residuals   791 2870.   3.63      NA   NA       

Follow-Up Tests

  • So far we have taken a holistic view, considering all of the groups at the same time to determine if at least one of the means is different
  • If we conclude that there is convincing evidence that there is a difference, we can make pairwise group comparisons
  • If there are \(k\) groups, then there are \(m=k\cdot(k-1)/2\) pairwise comparisons
  • In the Wordsum example there are 4 groups, resulting in \(m = (4\cdot 3)/2=6\) possible pairwise comparisons

Multiple comparisons

We could perform 6 separate hypothesis tests (e.g., \(t\)-tests):

  • \(H_0:\mu_L-\mu_M=0\), \(\hspace{2ex} H_A:\mu_L-\mu_M\neq0\)
  • \(H_0:\mu_L-\mu_U=0\), \(\hspace{2ex} H_A:\mu_L-\mu_U\neq0\)
  • \(H_0:\mu_L-\mu_W=0\), \(\hspace{2ex} H_A:\mu_L-\mu_W\neq0\)
  • \(H_0:\mu_M-\mu_U=0\), \(\hspace{2ex} H_A:\mu_M-\mu_U\neq0\)
  • \(H_0:\mu_M-\mu_W=0\), \(\hspace{2ex} H_A:\mu_M-\mu_W\neq0\)
  • \(H_0:\mu_U-\mu_W=0\), \(\hspace{2ex} H_A:\mu_U-\mu_W\neq0\)

Data

Ridge plot showing distribution of word scores for each self-identified social class

Ridge plot showing distribution of word scores for each self-identified social class
class n mean sd
LOWER 41 5.07 2.24
MIDDLE 331 6.76 1.89
UPPER 16 6.19 2.34
WORKING 407 5.75 1.87
  • Pairwise t-tests using pooled SD (pooled across all groups)
  • No adjustment for multiple comparison
  • p-values:
LOWER MIDDLE UPPER
MIDDLE 1.1e-07 - -
UPPER 0.048 0.240 -
WORKING 0.031 1.6e-12 0.367
  • It seems like there are significant differences between MIDDLE and LOWER, LOWER and WORKING, UPPER and LOWER, WORKING and MIDDLE groups
  • There are no significant differences between UPPER and MIDDLE, UPPER and WORKING groups
  • HOWEVER, unadjusted p-value do not account for a possibility of increased Type 1 Error

The Problem with Multiple Comparisons

  • With \(m\) pairwise comparisons using a significance level of \(\alpha\), for each test, the probability of making at least one Type 1 error if there are no difference between groups is \(1-(1-\alpha)^m\)
  • If each \(H_0\) is true, probability of at least one Type 1 error in 6 tests with \(\alpha = 0.05\): \[1-0.95^6=0.265\]

Familywise Error Rate

  • The Familywise Error rate (FWE) is the probability of making at least one Type 1 error when performing multiple hypothesis tests
  • We can control the FWE using multiple comparison methods
  • These methods use a reduced significance level for each hypothesis test to ensure \(FWE\leq\alpha\)

Bonferroni Method

  • The Bonferroni method is the simplest multiple comparison method
  • Let \(E_i\) be the event of making a Type 1 error with test \(i\)
  • For \(m\) tests, \[P(E_1 \text{ or } E_2 \text{ or }\cdots\text{ or } E_m) \leq P(E_1) + P(E_2)+\ldots + P(E_m)\]
  • If each test conducted at significance level \(\alpha/m\) \[FWE\leq\frac{\alpha}{m}+\frac{\alpha}{m}+\ldots+\frac{\alpha}{m}=m\cdot\frac{\alpha}{m}=\alpha\]
  • For \(m\) tests, the Bonferroni method tests each one at a level of \(\alpha/m\)
  • For the Wordscore example each test would use a level of \(0.05/6=0.00833\)
  • Equivalently, the p-value from each test is adjusted by multiplying by the number of tests
  • The adjusted p-values are compared to the original significance level

Adjusted P-Values

Using the Bonferroni method

library(magrittr) # to get the %$% pipe

gss %$%
  pairwise.t.test(wordsum, class, p.adjust.method = "bonferroni")

    Pairwise comparisons using t tests with pooled SD 

data:  wordsum and class 

        LOWER   MIDDLE  UPPER
MIDDLE  6.8e-07 -       -    
UPPER   0.29    1.00    -    
WORKING 0.18    9.8e-12 1.00 

P value adjustment method: bonferroni 
  • So after the Bonferroni adjustment there are significant Differences between MIDDLE and LOWER, MIDDLE and WORKING groups
  • Bonferroni method is very conserviative, resulting in FWE that is usually much smaller than \(\alpha\) (loss of power)
  • There are alternative methods that are less conservative and have higher power while still controlling FWE
  • E.g. Holm’s method (p.adjust.method = "holm")

Tukey Procedure

  • Less conservative than Bonferroni method
  • Only for pairwise comparisons of means
aov(wordsum ~ class, data = gss) |> 
  TukeyHSD()
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = wordsum ~ class, data = gss)

$class
                     diff        lwr        upr     p adj
MIDDLE-LOWER    1.6881586  0.8762706  2.5000466 0.0000007
UPPER-LOWER     1.1143293 -0.3311641  2.5598226 0.1945998
WORKING-LOWER   0.6762150 -0.1272750  1.4797050 0.1335047
UPPER-MIDDLE   -0.5738293 -1.8290536  0.6813950 0.6416209
WORKING-MIDDLE -1.0119436 -1.3748942 -0.6489929 0.0000000
WORKING-UPPER  -0.4381143 -1.6879230  0.8116945 0.8035197
  • If there are significant differences between the means, then the corresponding confidence interval for the difference of means will not contain value zero
  • So based on the properties of CIs, there are significant differences between MIDDLE and LOWER, WORKING and MIDDLE groups

Conclusions

  • We come to the same conclusions using the Tukey procedure or the Bonferroni method (but not the unadjusted p-values!)
  • Based on the results of the ANOVA, we concluded that there is convincing evidence that at least one of the mean scores is different
  • We followed this with post-hoc pairwise tests for differences between group means
  • Based on the pairwise tests, we conclude that there is convincing evidence of differences between mean scores for the “Middle” and “Lower” social classes and and between mean scores for the “Working and Middle” social classes.
  • We are unable to reject the other null hypotheses
  • For example, it is plausible that the mean scores are the same for “Upper” and “Lower” social classes

Additional Thoughts

  • We can also perform post-hoc tests after we perform a hypothesis test for multiple proportions (e.g., a chi-squared test)
  • In this case we would use the pairwise.prop.test function in R
  • We can calculate confidence intervals for pairwise differences (means or proportions) using the same ideas
  • Bonferroni correction can be applied to the confidence level
  • Use \(100\cdot(1-\alpha/m)\%\) for \(m\) comparisons
  • For a 95% confidence level, we would compute CI for the 6 pairwise differences with confidence level \(100\cdot(1-0.05/6)=99.17\%\)
  • Also see CI output from Tukey procedure