Two-Sided Tests and Decision Errors

Topic 7

Math 115

Recap: One-Sided Tests

Until now, we used one-sided hypothesis tests. Alternative hypothesis looked like one of the following:

  • \(H_A: p > p_0\)
  • \(H_A: p < p_0\)

P-value: Area in ONE tail of the normal distribution

We had a specific direction in mind for the alternative.

When Direction is Unknown

Scenario: You want to investigate how well people can identify AI-generated images.

Recent research has produced mixed results:

  • Some studies find accuracy around 60% (above chance)
  • Others find rates near 50% (essentially chance)

Research question: Is human detection accuracy DIFFERENT from chance (50%)?

We care about detecting a difference in either direction—this calls for a two-sided test.

Two-Sided Hypotheses

Two-sided hypothesis test:

  • \(H_0: p = p_0\) (parameter equals null value)
  • \(H_A: p \neq p_0\) (parameter is different from null value)

The \(\neq\) symbol means “not equal to”—the true value could be greater OR less than \(p_0\).

We don’t specify a direction in the alternative hypothesis.

When to Use Two-Sided Tests

Use two-sided when:

  • No prior expectation about direction
  • Both directions would be scientifically interesting
  • You want to detect ANY difference from the null

Default to two-sided unless you have a strong reason for one-sided.

AI Image Study Setup

Context: Given recent media attention on AI fakes, some people may be over-eager to call images AI-generated. Others may underestimate how realistic AI has become.

Study: 80 students each view the same AI-generated landscape image. Each guesses whether it’s real or AI-generated.

Result: 48 students correctly identify it as AI-generated (60%).

Hypotheses:

  • \(H_0: p = 0.5\) (detection rate equals chance)
  • \(H_A: p \neq 0.5\) (detection rate differs from chance)

The Problem with One Tail

If we used a one-sided test (\(H_A: p > 0.5\)):

  • P-value = area to the RIGHT of our Z-score
  • Only counts evidence in ONE direction

But wait… what if people were actually WORSE than chance?

  • If only 32 students got it right (40%), that would also be evidence against \(H_0\)!
  • A one-sided test would miss this

Both Tails Count

For \(H_A: p \neq 0.5\), evidence against \(H_0\) comes from either direction.

If \(\hat{p} = 0.60\) (Z = 1.79) OR \(\hat{p} = 0.40\) (Z = -1.79), we’d have equal evidence against \(H_0\).

Calculating Two-Sided P-values

Method: Double the smaller tail

  1. Calculate \(Z = \frac{\hat{p} - p_0}{SE}\)

  2. Find area to the LEFT of Z and area to the RIGHT of Z

  3. Double the SMALLER of these two areas

  • The two-sided p-value is twice as large as a possible one-sided one. So it is more conservative, it is harder to reject the null hypothesis

  • This method works for both simulation-based and model-based inference, and for both symmetric and asymmetric distributions.

Visualizing the Two-Sided P-value (Mathematical Model)

By symmetry, both tails have equal area. P-value = \(2 \times 0.037 = 0.074\)

Visualizing the Two-Sided P-value (Simulation-based)

  • The p-value: counts the outcomes in both tails and doubles the SMALLER tail

Left-tail is smaller, so p-value is \(2 \times \frac{3}{100}=0.06\)

AI Image Study: Complete Analysis

Hypotheses:

  • \(H_0: p = 0.5\) (detection rate equals chance)
  • \(H_A: p \neq 0.5\) (detection rate differs from chance)

Data: \(\hat{p} = 48/80 = 0.6\)

Check conditions:

  • Independence: Each student responds independently

  • Success-Failure (Expected successes and failures):

    • \(80 \times 0.5 = 40\) ≥ 10
    • \(80 \times (1-0.5) = 40\) ≥ 10

Compute:

  • \(SE = \sqrt{\frac{0.5 \times(1-0.5)}{80}} = 0.0559\)
  • \(Z = \frac{0.6 - 0.5}{0.0559} = 1.79\)
  • Two-sided P-value = \(2 \times 0.037 = 0.074\)

AI Image Study: Conclusion

Decision: P-value = 0.074 > α = 0.05

We fail to reject \(H_0\) at the α = 0.05 significance level.

Interpretation: - We do not have convincing evidence that students can identify this AI-generated image at a rate different from chance.

  • The observed 60% accuracy, while higher than 50%, could plausibly have occurred by chance alone.

  • We cannot generalize to a larger population since it was not a random sample

Confidence Interval

  • Note that since the confidence interval does not depend on any specific test of significance, the calculations of the CI do not change
  • Also, whether a value of the null hypothesis is inside of CI or not corresponds to two-sided p-value
    • If two-sided p-value is > significance level(\(\alpha\)) then null hypothesis value is inside of the CI
    • If two-sided p-value is < significance level(\(\alpha\)) then null hypothesis value is outside of the CI
  • In our example, since two-sided p-value (0.074) is greater than \(\alpha = 0.05\), the value of the null hypothesis (\(H_0: p=0.5\)) is expected to be inside of the 95% CI
  • On the other hand, since 0.074 < \(\alpha = 0.1\), the value of the null hypothesis is NOT expected to be inside of the 90% CI

Check conditions:

  • Independence: Each student responds independently

  • Success-Failure (Expected successes and failures):

    • \(80 \times 0.6 = 48\) ≥ 10
    • \(80 \times (1-0.6) = 32\) ≥ 10

Compute:

  • \(SE_{ci} = \sqrt{\frac{0.6 \times(1-0.6)}{80}} = 0.0548\)
  • General confidence Interval: \(\hat{p}\pm z*SE_{ci}\)
  • 95% CI = \(\hat{p}\pm 1.96*SE_{ci} = (0.493, 0.707)\)← 0.5 is inside
  • 90% CI = \(\hat{p}\pm 1.645*SE_{ci} = (0.51, 0.69)\)← 0.5 is NOT inside

Decision Errors: Four Scenarios

When we make a decision about null hypothesis (\(H_0\)), there are four possible outcomes:

What is true (unknown to us)
What we decide (based on data) Null hypothesis is true Null hypothesis is false
Reject null hypothesis Type I error (false alarm) Correct decision
Do not reject null hypothesis Correct decision Type II error (missed opportunity)

Two ways to be right, two ways to be wrong.

Type I Error (False Positive)

Definition: Rejecting \(H_0\) when it’s actually true.

“False alarm” — claiming an effect that doesn’t exist.

In AI study context:

  • Concluding people can detect AI images differently than chance…
  • …when they actually can’t (detection rate really is 50%)

Consequence: Overestimating human detection abilities.

Type II Error (False Negative)

Definition: Failing to reject \(H_0\) when \(H_A\) is actually true.

“Missed detection” — failing to find a real effect.

In AI study context:

  • Failing to conclude that people can detect AI images…
  • …when they actually can (or can’t) at a rate different from 50%

Consequence: Missing important findings about human-AI interaction.

Significance Level Controls Type I Errors

The significance level α is the probability of making a Type I error when \(H_0\) is true. \[\boxed{\text{Probability of Type I Error = Significance Level} (\alpha)}\] - If we always use α = 0.05, we make Type I errors about 5% of the time when \(H_0\) is true - Smaller α → fewer Type I errors

But there’s a trade-off…

The Trade-off

We can’t minimize both error types simultaneously.

Choose α based on which error is more costly in your context.

What is Power?

  • Power is the probability of correctly rejecting \(H_0\) when \(H_A\) is actually true.
  • In other words: Power is he probability of detecting a real effect when it exists.
  • We want high power (typically 80% or higher).
  • If power is known, then we can find the probability of type II error:

\[\boxed{\text{Probabilty of Type II Error} = 1 - \text{Power}}\]

Illustration of Power

The null hypothesis distribution (H₀).

The alternative hypothesis distribution (Hₐ).

Factors Affecting Power

Three main factors increase power:

  1. Larger sample size → more precision → easier to detect effects

  2. Larger true effect → bigger differences are easier to spot

  3. Larger α → easier to reject \(H_0\) (but more Type I errors!)

For our AI study: With only 80 students and an observed 10% difference from chance, we may not have had enough power to detect a real (but small) effect.

Which Error Matters More?

It depends on context!

Prioritize avoiding Type I errors when:

  • False positives are dangerous or costly
  • Example: Approving an ineffective medical treatment

Prioritize avoiding Type II errors when:

  • Missing a real effect is dangerous
  • Example: Failing to detect a disease in screening

Choose your significance level α accordingly.

CI and Two-Sided Test Connection

A \((1-\alpha)\) confidence interval and a two-sided test at level α give consistent results:

  • If the CI contains \(p_0\) → Don’t reject \(H_0\) (p-value > α)
  • If the CI excludes \(p_0\) → Reject \(H_0\) (p-value ≤ α)

Visualizing the Connection

Our AI study: 95% CI = (0.493, 0.707) contains 0.5 → consistent with failing to reject \(H_0\).

CIs Give More Information

The confidence interval tells us more than just the hypothesis test:

  • Direction: Is the effect positive or negative?
  • Magnitude: How big might the effect be?
  • Plausible values: What’s the range of reasonable estimates?

Recommendation: Always report the CI, not just the p-value!

Our AI study: We’re 95% confident the true detection rate is between 49.3% and 70.7%.

Summary

Two-sided tests:

  • \(H_A: p \neq p_0\) (parameter differs from null in either direction)
  • P-value = 2 × (one-tail area)
  • Default choice when direction is unknown

Decision errors:

  • Type I: False positive (controlled by α)
  • Type II: False negative (reduced by increasing power)
  • Trade-off between the two

CI connection:

  • CI contains \(p_0\) ↔︎ Don’t reject \(H_0\)
  • CI excludes \(p_0\) ↔︎ Reject \(H_0\)

References

AI Image Detection Research: