Post-Hoc Tests, Multiple Comparisons

Additional Topic
Math 115

Yurk

Wordsum Score

  • Do Wordsum test scores vary between self-identified social classes?
  • Self-identified social classes: “Lower” (L), “Middle” (M), “Upper” (U), “Working” (W)
  • Let \(\mu_C\) = mean score for social class \(C\)
  • Hypothesis test
    • \(H_0: \mu_L=\mu_M=\mu_U=\mu_W\)
    • \(H_A:\) at least one of the means is different

ANOVA

  • We conducted a hypothesis test based on a randomized null distribution of \(F\) statistics
  • And a test (ANOVA) using a model (\(F\)-distribution)
  • We found convincing evidence (\(F_{3,791}=21.73\), p-value < 0.001) that at least one mean score is different

Follow-Up Tests

  • So far we have taken a holistic view, considering all of the groups at the same time to determine if at least one of the means is different
  • If we conclude that there is convincing evidence that there is a difference, we can make pairwise group comparisons
  • If there are \(k\) groups, then there are \(a=k\cdot(k-1)/2\) pairwise comparisons
  • In the Wordsum example there are 4 groups, resulting in \(a = (4\cdot 3)/2=6\) possible pairwise comparisons

Multiple comparisons

We could perform 6 separate hypothesis tests (e.g., \(t\)-tests):

  • \(H_0:\mu_L-\mu_M=0\), \(\hspace{2ex} H_A:\mu_L-\mu_M\neq0\)
  • \(H_0:\mu_L-\mu_U=0\), \(\hspace{2ex} H_A:\mu_L-\mu_U\neq0\)
  • \(H_0:\mu_L-\mu_W=0\), \(\hspace{2ex} H_A:\mu_L-\mu_W\neq0\)
  • \(H_0:\mu_M-\mu_U=0\), \(\hspace{2ex} H_A:\mu_M-\mu_U\neq0\)
  • \(H_0:\mu_M-\mu_W=0\), \(\hspace{2ex} H_A:\mu_M-\mu_W\neq0\)
  • \(H_0:\mu_U-\mu_W=0\), \(\hspace{2ex} H_A:\mu_U-\mu_W\neq0\)

Data

Ridge plot showing distribution of word scores for each self-identified social class
class n mean sd
LOWER 41 5.07 2.24
MIDDLE 331 6.76 1.89
UPPER 16 6.19 2.34
WORKING 407 5.75 1.87
  • Pairwise t-tests using pooled SD (pooled across all groups)
  • No adjustment for multiple comparison
  • p-values:
LOWER MIDDLE UPPER
MIDDLE 1.1e-07 - -
UPPER 0.048 0.240 -
WORKING 0.031 1.6e-12 0.367

The Problem with Multiple Comparisons

  • With \(a\) pairwise comparisons using a significance level of \(\alpha\), for each test, the probability of making at least one Type 1 error if there are no difference between groups is \(1-(1-\alpha)^a\)
  • If each \(H_0\) is true, probability of at least one Type 1 error in 6 tests with \(\alpha = 0.05\): \[1-0.95^6=0.265\]

Familywise Error Rate

  • The Familywise Error rate (FWE) is the probability of making at least one Type 1 error when performing multiple hypothesis tests
  • We can control the FWE using multiple comparison methods
  • These methods use a reduced significance level for each hypothesis test to ensure \(FWE\leq\alpha\)

Bonferroni Method

  • The Bonferroni method is the simplest multiple comparison method
  • Let \(E_i\) be the event of making a Type 1 error with test \(i\)
  • For \(a\) tests, \[P(E_1 \text{ or } E_2 \text{ or }\cdots\text{ or } E_a) \leq P(E_1) + P(E_2)+\ldots + P(E_a)\]
  • If each test conducted at significance level \(\alpha/a\) \[FWE\leq\frac{\alpha}{a}+\frac{\alpha}{a}+\ldots+\frac{\alpha}{a}=a\cdot\frac{\alpha}{a}=\alpha\]
  • For \(a\) tests, the Bonferroni method tests each one at a level of \(\alpha/a\)
  • For the Wordscore example each test would use a level of \(0.05/6=0.00833\)
  • Equivalently, the p-value from each test is adjusted by multiplying by the number of tests
  • The adjusted p-values are compared to the original significance level

Adjusted P-Values

  • The Bonferoni method can be used for correcting p-values in post-hoc T-tests in the ANOVA analysis in Jamovi

 POST HOC TESTS

 Post Hoc Comparisons - class                                                                      
 ───────────────────────────────────────────────────────────────────────────────────────────────── 
   class          class      Mean Difference    SE       df     t         p         p-bonferroni   
 ───────────────────────────────────────────────────────────────────────────────────────────────── 
   LOWER     -    MIDDLE              -1.688    0.315    791    -5.353    < .001    < .001   
             -    UPPER               -1.114    0.561    791    -1.985     0.048           0.285   
             -    WORKING             -0.676    0.312    791    -2.167     0.031           0.183   
   MIDDLE    -    UPPER                0.574    0.488    791     1.177     0.240           1.000   
             -    WORKING              1.012    0.141    791     7.178    < .001    < .001   
   UPPER     -    WORKING              0.438    0.485    791     0.902     0.367           1.000   
 ───────────────────────────────────────────────────────────────────────────────────────────────── 
   Note. Comparisons are based on estimated marginal means
  • Bonferroni method is very conservative, resulting in FWE that is usually much smaller than \(\alpha\) (loss of power)
  • There are alternative methods that are less conservative and have higher power while still controlling FWE
  • E.g. Holm’s method (also implemented in Jamovi)

Tukey Procedure

  • Less conservative than Bonferroni method
  • Only for pairwise comparisons of means

 POST HOC TESTS

 Post Hoc Comparisons - class                                                                                 
 ──────────────────────────────────────────────────────────────────────────────────────────────────────────── 
   class          class      Mean Difference    SE       df     t         p         p-tukey    p-bonferroni   
 ──────────────────────────────────────────────────────────────────────────────────────────────────────────── 
   LOWER     -    MIDDLE              -1.688    0.315    791    -5.353    < .001    < .001    < .001   
             -    UPPER               -1.114    0.561    791    -1.985     0.048      0.195           0.285   
             -    WORKING             -0.676    0.312    791    -2.167     0.031      0.134           0.183   
   MIDDLE    -    UPPER                0.574    0.488    791     1.177     0.240      0.642           1.000   
             -    WORKING              1.012    0.141    791     7.178    < .001    < .001    < .001   
   UPPER     -    WORKING              0.438    0.485    791     0.902     0.367      0.804           1.000   
 ──────────────────────────────────────────────────────────────────────────────────────────────────────────── 
   Note. Comparisons are based on estimated marginal means

Conclusions

  • We come to the same conclusions using the Tukey procedure or the Bonferroni method (but not the unadjusted p-values!)
  • Based on the results of the ANOVA, we concluded that there is convincing evidence that at least one of the mean scores is different
  • We followed this with post-hoc pairwise tests for differences between group means
  • Based on the pairwise tests, we conclude that there is convincing evidence of differences between mean scores for the “Middle” and “Lower” social classes and and between mean scores for the “Working and Middle” social classes.
  • We are unable to reject the other null hypotheses
  • For example, it is plausible that the mean scores are the same for “Upper” and “Lower” social classes

Additional Thoughts

  • We can also perform post-hoc tests after we perform a hypothesis test for multiple proportions (e.g., a chi-squared test)
  • We can calculate confidence intervals for pairwise differences (means or proportions) using the same ideas
  • Bonferroni correction can be applied to the confidence level
  • Use \(100\cdot(1-\alpha/a)\%\) for \(a\) comparisons
  • For a 95% confidence level, we would compute CI for the 6 pairwise differences with confidence level \(100\cdot(1-0.05/6)=99.17\%\)