Post-Hoc Tests, Multiple Comparisons

Additional Topic
Math 115

Yurk

Wordsum Score

Do Wordsum test scores vary between self-identified social classes?
Self-identified social classes: “Lower” (L), “Middle” (M), “Upper” (U), “Working” (W)
Let \(\mu_C\) = mean score for social class \(C\)
Hypothesis test
- \(H_0: \mu_L=\mu_M=\mu_U=\mu_W\)
- \(H_A:\) at least one of the means is different

ANOVA

We conducted a hypothesis test based on a randomized null distribution of \(F\) statistics
And a test (ANOVA) using a model (\(F\)-distribution)
We found convincing evidence (\(F_{3,791}=21.73\), p-value < 0.001) that at least one mean score is different

Follow-Up Tests

So far we have taken a holistic view, considering all of the groups at the same time to determine if at least one of the means is different
If we conclude that there is convincing evidence that there is a difference, we can make pairwise group comparisons
If there are \(k\) groups, then there are \(a=k\cdot(k-1)/2\) pairwise comparisons
In the Wordsum example there are 4 groups, resulting in \(a = (4\cdot 3)/2=6\) possible pairwise comparisons

Multiple comparisons

We could perform 6 separate hypothesis tests (e.g., \(t\)-tests):

\(H_0:\mu_L-\mu_M=0\), \(\hspace{2ex} H_A:\mu_L-\mu_M\neq0\)
\(H_0:\mu_L-\mu_U=0\), \(\hspace{2ex} H_A:\mu_L-\mu_U\neq0\)
\(H_0:\mu_L-\mu_W=0\), \(\hspace{2ex} H_A:\mu_L-\mu_W\neq0\)
\(H_0:\mu_M-\mu_U=0\), \(\hspace{2ex} H_A:\mu_M-\mu_U\neq0\)
\(H_0:\mu_M-\mu_W=0\), \(\hspace{2ex} H_A:\mu_M-\mu_W\neq0\)
\(H_0:\mu_U-\mu_W=0\), \(\hspace{2ex} H_A:\mu_U-\mu_W\neq0\)

Ridge plot showing distribution of word scores for each self-identified social class

class	n	mean	sd
LOWER	41	5.07	2.24
MIDDLE	331	6.76	1.89
UPPER	16	6.19	2.34
WORKING	407	5.75	1.87

Pairwise t-tests using pooled SD (pooled across all groups)
No adjustment for multiple comparison
p-values:

	LOWER	MIDDLE	UPPER
MIDDLE	1.1e-07	-	-
UPPER	0.048	0.240	-
WORKING	0.031	1.6e-12	0.367

The Problem with Multiple Comparisons

With \(a\) pairwise comparisons using a significance level of \(\alpha\), for each test, the probability of making at least one Type 1 error if there are no difference between groups is \(1-(1-\alpha)^a\)
If each \(H_0\) is true, probability of at least one Type 1 error in 6 tests with \(\alpha = 0.05\): \[1-0.95^6=0.265\]

Familywise Error Rate

The Familywise Error rate (FWE) is the probability of making at least one Type 1 error when performing multiple hypothesis tests
We can control the FWE using multiple comparison methods
These methods use a reduced significance level for each hypothesis test to ensure \(FWE\leq\alpha\)

Bonferroni Method

The Bonferroni method is the simplest multiple comparison method
Let \(E_i\) be the event of making a Type 1 error with test \(i\)
For \(a\) tests, \[P(E_1 \text{ or } E_2 \text{ or }\cdots\text{ or } E_a) \leq P(E_1) + P(E_2)+\ldots + P(E_a)\]
If each test conducted at significance level \(\alpha/a\) \[FWE\leq\frac{\alpha}{a}+\frac{\alpha}{a}+\ldots+\frac{\alpha}{a}=a\cdot\frac{\alpha}{a}=\alpha\]

For \(a\) tests, the Bonferroni method tests each one at a level of \(\alpha/a\)
For the Wordscore example each test would use a level of \(0.05/6=0.00833\)
Equivalently, the p-value from each test is adjusted by multiplying by the number of tests
The adjusted p-values are compared to the original significance level

Adjusted P-Values

The Bonferoni method can be used for correcting p-values in post-hoc T-tests in the ANOVA analysis in Jamovi


 POST HOC TESTS

 Post Hoc Comparisons - class                                                                      
 ───────────────────────────────────────────────────────────────────────────────────────────────── 
   class          class      Mean Difference    SE       df     t         p         p-bonferroni   
 ───────────────────────────────────────────────────────────────────────────────────────────────── 
   LOWER     -    MIDDLE              -1.688    0.315    791    -5.353    < .001    < .001   
             -    UPPER               -1.114    0.561    791    -1.985     0.048           0.285   
             -    WORKING             -0.676    0.312    791    -2.167     0.031           0.183   
   MIDDLE    -    UPPER                0.574    0.488    791     1.177     0.240           1.000   
             -    WORKING              1.012    0.141    791     7.178    < .001    < .001   
   UPPER     -    WORKING              0.438    0.485    791     0.902     0.367           1.000   
 ───────────────────────────────────────────────────────────────────────────────────────────────── 
   Note. Comparisons are based on estimated marginal means

Bonferroni method is very conservative, resulting in FWE that is usually much smaller than \(\alpha\) (loss of power)
There are alternative methods that are less conservative and have higher power while still controlling FWE
E.g. Holm’s method (also implemented in Jamovi)

Tukey Procedure

Less conservative than Bonferroni method
Only for pairwise comparisons of means


 POST HOC TESTS

 Post Hoc Comparisons - class                                                                                 
 ──────────────────────────────────────────────────────────────────────────────────────────────────────────── 
   class          class      Mean Difference    SE       df     t         p         p-tukey    p-bonferroni   
 ──────────────────────────────────────────────────────────────────────────────────────────────────────────── 
   LOWER     -    MIDDLE              -1.688    0.315    791    -5.353    < .001    < .001    < .001   
             -    UPPER               -1.114    0.561    791    -1.985     0.048      0.195           0.285   
             -    WORKING             -0.676    0.312    791    -2.167     0.031      0.134           0.183   
   MIDDLE    -    UPPER                0.574    0.488    791     1.177     0.240      0.642           1.000   
             -    WORKING              1.012    0.141    791     7.178    < .001    < .001    < .001   
   UPPER     -    WORKING              0.438    0.485    791     0.902     0.367      0.804           1.000   
 ──────────────────────────────────────────────────────────────────────────────────────────────────────────── 
   Note. Comparisons are based on estimated marginal means

Conclusions

We come to the same conclusions using the Tukey procedure or the Bonferroni method (but not the unadjusted p-values!)
Based on the results of the ANOVA, we concluded that there is convincing evidence that at least one of the mean scores is different
We followed this with post-hoc pairwise tests for differences between group means

Based on the pairwise tests, we conclude that there is convincing evidence of differences between mean scores for the “Middle” and “Lower” social classes and and between mean scores for the “Working and Middle” social classes.
We are unable to reject the other null hypotheses
For example, it is plausible that the mean scores are the same for “Upper” and “Lower” social classes

Additional Thoughts

We can also perform post-hoc tests after we perform a hypothesis test for multiple proportions (e.g., a chi-squared test)

We can calculate confidence intervals for pairwise differences (means or proportions) using the same ideas
Bonferroni correction can be applied to the confidence level
Use \(100\cdot(1-\alpha/a)\%\) for \(a\) comparisons
For a 95% confidence level, we would compute CI for the 6 pairwise differences with confidence level \(100\cdot(1-0.05/6)=99.17\%\)