Associations and Causation

Comparing Proportions

Math 115

What We’ve Done vs. What We’ll Do

Previously: Examined variables individually

  • Frequency tables for single categorical variables
  • Bar plots for single categorical variables
  • Summary statistics for single numerical variables
  • Inference for a single proportion

Now: Examine relationships between two variables

  • Research questions often ask about relationships
  • “Does treatment affect outcome?”
  • “Do different groups show different patterns?”

Association and Independence

When we examine two variables together:

  • Associated: There IS a relationship between the two variables
    • Knowing the value of one variable tells you something about the other
  • Independent: There is NO relationship between the two variables
    • Knowing one variable tells you nothing about the other

Our goal: Determine if two categorical variables are associated

Comic Characters

  • 15,128 comic characters from DC and Marvel comics
  • Key variables:
    • Alignment (align): good, bad, neutral
    • Identity (id): secret, public, etc.
  • Research question: Is alignment associated with identity type?

Contingency Tables

A contingency table summarizes the relationship between two categorical variables:

  • Each cell shows the count for that combination of values
  • Row totals: frequency of each category of the row variable
  • Column totals: frequency of each category of the column variable
  • Grand total: total number of observations

Also called a two-way table or cross-tabulation

Comics Contingency Table

Contingency table for identity and alignment
align No Dual Public Secret Unknown Total
Bad 453 2,106 4,352 7 6,918
Good 640 2,905 2,430 0 5,975
Neutral 377 946 908 2 2,233
Reformed Criminals 0 1 1 0 2
Total 1,470 5,958 7,691 9 15,128

Raw counts alone don’t reveal patterns clearly…

Conditional Proportions

Conditional proportions: Proportions calculated within groups

  • Most useful when conditioned on the grouping variable
  • Reveals patterns that raw counts may hide

Formula (conditioned on rows): (count in cell) / (row total)

Example:

\[\begin{array}{c} \text{Proportion of} \\ \text{bad characters with} \\ \text{secret identities} \end{array} = \frac{\text{Bad characters with secret identity}}{\text{Total bad characters}}\]

Comics Conditional Proportions

align No Dual Public Secret Unknown
Bad 0.065 0.304 0.629 0.001
Good 0.107 0.486 0.407 0.000
Neutral 0.169 0.424 0.407 0.001
Reformed Criminals 0.000 0.500 0.500 0.000

Key finding: Bad characters appear to be more likely to have secret identities (62.9%) compared to good characters (40.7%)

Visualizing Two Categorical Variables

Three types of bar plots for two categorical variables:

  • Dodged bar plot: Bars side-by-side, compare counts across groups
  • Stacked bar plot: Bars stacked, show composition of each group
  • Standardized bar plot: Bars stacked to same height, compare proportions within groups

Standardized bar plots are most useful for detecting associations

Comics Standardized Bar Plot

Standardized bar plot showing identity proportions by alignment

Clearly shows bad characters have higher proportion of secret identities

Explanatory and Response Variables

When we suspect one variable may affect another:

  • Explanatory variable: The variable that might influence or explain the other
    • Also called: independent variable, predictor
  • Response variable: The variable that might be affected or respond
    • Also called: dependent variable, outcome

Not all studies have clear explanatory/response roles

Stent Study Introduction

Researchers studied stent effectiveness for preventing strokes:

  • 451 at-risk patients assigned to receive stent (treatment) or not (control)
  • Outcome (“stroke” or “no event”) recorded after 365 days
First five patients
group outcome
control no event
treatment no event
treatment stroke
treatment no event
control no event
  • Explanatory variable: group (treatment/control)
  • Response variable: outcome (stroke/no event)

Two Scope of Inference Questions

Recall from before: The first scope of inference question

  1. Generalization: Can we apply findings to the broader population?
    • Answered by: How we sampled (random vs. convenience)

NEW: The second scope of inference question

  1. Causation: Can we conclude cause-and-effect?
    • Answered by: How we conducted the study

The Complete Picture

These are TWO SEPARATE QUESTIONS:

Question Determined by Key feature
Generalization Sampling method Random sampling
Causation Study design Random assignment

A study can:

  • Generalize but not establish causation
  • Establish causation but not generalize
  • Do both, or neither

Observational Study vs. Experiment

Observational study: Researchers observe without manipulation

  • Cases observed “as they are”
  • No control over explanatory variable
  • Example: Survey asking about diet and health

Experiment: Researchers deliberately manipulate explanatory variable

  • Cases assigned to treatment conditions
  • Control over explanatory variable
  • Example: Randomly assigning patients to receive drug or placebo

Ice Cream and Drowning

A surprising finding: Strong positive association between ice cream sales and drowning deaths

  • When ice cream sales are high → drowning deaths are high
  • When ice cream sales are low → drowning deaths are low

Does eating ice cream cause drowning?

No! A confounding variable explains both:

  • Hot weather → more ice cream sales
  • Hot weather → more swimming → more drowning

Confounding Variables

A confounding variable is associated with BOTH the explanatory and response variables

flowchart TD
    A[Temperature<br/>confounder] --> B[Ice cream sales]
    A --> C[Drowning deaths]

    style A fill:#ffcccc,stroke:#cc0000,stroke-width:3px
    style B fill:#cce5ff,stroke:#0066cc,stroke-width:2px
    style C fill:#cce5ff,stroke:#0066cc,stroke-width:2px

Problem: Creates alternative explanations for observed associations

  • In observational studies, we can’t rule out confounders
  • We can only say variables are associated, not that one causes the other

Random Assignment

The solution to confounding: Random assignment

Random assignment: Cases are randomly assigned to treatment groups

  • Researchers control who receives which treatment
  • Assignment made by chance (like flipping a coin)
  • Not based on patient characteristics, preferences, or doctor’s choice

Key benefit: Ensures treatment groups are similar on average with respect to all possible confounders

Stent Study: Random Assignment

How random assignment worked in the stent study:

  • 451 at-risk patients enrolled in the study
  • Each patient randomly assigned to treatment or control
  • Assignment determined by chance, not by patient or doctor choice

Result of random assignment: Treatment and control groups similar on average in:

  • Age
  • Severity of stenosis (narrowing of arteries)
  • Other health conditions

Stent Study: Observational Alternative

What if the stent study had been observational?

  • Patients and doctors choose whether to get stent (no random assignment)

Problem: Patients with more severe stenosis may be more likely to:

  • Receive stent (doctors recommend it for severe cases)
  • Have strokes (severe stenosis increases stroke risk)

Severity of stenosis
is a confounder:

flowchart TD
    A[Severity of stenosis] --> B[Stent treatment]
    A --> C[Stroke outcome]

    style A fill:#ffcccc,stroke:#cc0000,stroke-width:3px
    style B fill:#cce5ff,stroke:#0066cc,stroke-width:2px
    style C fill:#cce5ff,stroke:#0066cc,stroke-width:2px

Why Experiments Can Establish Causation

Random assignment breaks the confounder link:

In the actual randomized stent study:

  1. Researchers controlled treatment assignment (not patients/doctors)
  2. Random assignment balanced severity of stenosis (and other potential confounders) between groups
  3. Any difference in stroke rates must be due to the stent itself

Without random assignment, we couldn’t distinguish:

  • Does the stent cause more strokes?
  • Or do sicker patients (who would get stents) just have more strokes anyway?

Random Assignment vs. Random Sampling

Random Sampling Random Assignment
What How cases are selected from population How cases are assigned to groups
Purpose Enable generalization Enable causal conclusions
Controls for Sampling bias Confounding variables

These are different concepts that address different questions!

Stent Study Results

Summary of results for stent study
Group Stroke No Event
control 28 199
treatment 45 179

Stroke proportions: Control 12%, Treatment 20%

It turns out that this difference is statistically significant

This was a randomized experiment → Can we conclude stents caused more strokes? Yes! Random assignment rules out confounders.

Recycling Study: Observational

Question: Does a sign above a recycling bin reduce contamination?

Study design: Students compare two bins over one month

  • Bin in Student Center: no sign
  • Bin in Lichty Hall (residence): has sign
  • Result: Lower contamination with sign

Is this an experiment? No - no manipulation of sign placement

Possible confounder: Location type (public vs. residential spaces may have different behaviors regardless of signs)

Recycling Study: Experiment

Same question, different design:

  • 20 bins across campus
  • Each day, randomly assign 10 bins to get signs, 10 without
  • Record contamination for each bin

Now it’s an experiment!

  • Researchers control sign placement
  • Random assignment balances location types across conditions
  • I.e., a bin in a residential space is no more likely to get a sign than a public one
  • If lower contamination with signs → signs caused the reduction

Summary: Two Categorical Variables

Summarizing relationships:

  • Contingency tables: Counts for each combination
  • Conditional proportions: Proportions within groups (reveal patterns)
  • Standardized bar plots: Visualize proportional differences

Key concepts:

  • Associated: Variables are related
  • Independent: No relationship between variables

Summary: Scope of Inference

Two separate questions about what we can conclude:

  • Generalization → How did we sample?
  • Causation → How did we conduct the study?

References