Math 115
Until now, we used one-sided hypothesis tests. Alternative hypothesis looked like one of the following:
P-value: Area in ONE tail of the normal distribution
We had a specific direction in mind for the alternative.
Scenario: You want to investigate how well people can identify AI-generated images.
Recent research has produced mixed results:
Research question: Is human detection accuracy DIFFERENT from chance (50%)?
We care about detecting a difference in either direction—this calls for a two-sided test.
Two-sided hypothesis test:
The \(\neq\) symbol means “not equal to”—the true value could be greater OR less than \(p_0\).
We don’t specify a direction in the alternative hypothesis.
Use two-sided when:
Default to two-sided unless you have a strong reason for one-sided.
Context: Given recent media attention on AI fakes, some people may be over-eager to call images AI-generated. Others may underestimate how realistic AI has become.
Study: 80 students each view the same AI-generated landscape image. Each guesses whether it’s real or AI-generated.
Result: 48 students correctly identify it as AI-generated (60%).
Hypotheses:
If we used a one-sided test (\(H_A: p > 0.5\)):
But wait… what if people were actually WORSE than chance?
For \(H_A: p \neq 0.5\), evidence against \(H_0\) comes from either direction.
If \(\hat{p} = 0.60\) (Z = 1.79) OR \(\hat{p} = 0.40\) (Z = -1.79), we’d have equal evidence against \(H_0\).
Method: Double the smaller tail
Calculate \(Z = \frac{\hat{p} - p_0}{SE}\)
Find area to the LEFT of Z and area to the RIGHT of Z
Double the SMALLER of these two areas
The two-sided p-value is twice as large as a possible one-sided one. So it is more conservative, it is harder to reject the null hypothesis
This method works for both simulation-based and model-based inference, and for both symmetric and asymmetric distributions.
By symmetry, both tails have equal area. P-value = \(2 \times 0.037 = 0.074\)
Left-tail is smaller, so p-value is \(2 \times \frac{3}{100}=0.06\)
Hypotheses:
Data: \(\hat{p} = 48/80 = 0.6\)
Check conditions:
Independence: Each student responds independently ✓
Success-Failure (Expected successes and failures):
Compute:
Decision: P-value = 0.074 > α = 0.05
We fail to reject \(H_0\) at the α = 0.05 significance level.
Interpretation: - We do not have convincing evidence that students can identify this AI-generated image at a rate different from chance.
The observed 60% accuracy, while higher than 50%, could plausibly have occurred by chance alone.
We cannot generalize to a larger population since it was not a random sample
Check conditions:
Independence: Each student responds independently ✓
Success-Failure (Expected successes and failures):
Compute:
When we make a decision about null hypothesis (\(H_0\)), there are four possible outcomes:
| What is true (unknown to us) | ||
|---|---|---|
| What we decide (based on data) | Null hypothesis is true | Null hypothesis is false |
| Reject null hypothesis | Type I error (false alarm)✗ | Correct decision✓ |
| Do not reject null hypothesis | Correct decision✓ | Type II error (missed opportunity)✗ |
Two ways to be right, two ways to be wrong.
Definition: Rejecting \(H_0\) when it’s actually true.
“False alarm” — claiming an effect that doesn’t exist.
In AI study context:
Consequence: Overestimating human detection abilities.
Definition: Failing to reject \(H_0\) when \(H_A\) is actually true.
“Missed detection” — failing to find a real effect.
In AI study context:
Consequence: Missing important findings about human-AI interaction.
The significance level α is the probability of making a Type I error when \(H_0\) is true. \[\boxed{\text{Probability of Type I Error = Significance Level} (\alpha)}\] - If we always use α = 0.05, we make Type I errors about 5% of the time when \(H_0\) is true - Smaller α → fewer Type I errors
But there’s a trade-off…
We can’t minimize both error types simultaneously.
Choose α based on which error is more costly in your context.
\[\boxed{\text{Probabilty of Type II Error} = 1 - \text{Power}}\]
The null hypothesis distribution (H₀).
The alternative hypothesis distribution (Hₐ).
Three main factors increase power:
Larger sample size → more precision → easier to detect effects
Larger true effect → bigger differences are easier to spot
Larger α → easier to reject \(H_0\) (but more Type I errors!)
For our AI study: With only 80 students and an observed 10% difference from chance, we may not have had enough power to detect a real (but small) effect.
It depends on context!
Prioritize avoiding Type I errors when:
Prioritize avoiding Type II errors when:
Choose your significance level α accordingly.
A \((1-\alpha)\) confidence interval and a two-sided test at level α give consistent results:
Our AI study: 95% CI = (0.493, 0.707) contains 0.5 → consistent with failing to reject \(H_0\).
The confidence interval tells us more than just the hypothesis test:
Recommendation: Always report the CI, not just the p-value!
Our AI study: We’re 95% confident the true detection rate is between 49.3% and 70.7%.
Two-sided tests:
Decision errors:
CI connection:
AI Image Detection Research: