topic8

Associations and Causation

Comparing Proportions

Math 115

What We’ve Done vs. What We’ll Do

Previously: Examined variables individually

Frequency tables for single categorical variables
Bar plots for single categorical variables
Summary statistics for single numerical variables
Inference for a single proportion

Now: Examine relationships between two variables

Research questions often ask about relationships
“Does treatment affect outcome?”
“Do different groups show different patterns?”

Association and Independence

When we examine two variables together:

Associated: There IS a relationship between the two variables
- Knowing the value of one variable tells you something about the other
Independent: There is NO relationship between the two variables
- Knowing one variable tells you nothing about the other

Our goal: Determine if two categorical variables are associated

Comic Characters

15,128 comic characters from DC and Marvel comics
Key variables:
- Alignment (align): good, bad, neutral
- Identity (id): secret, public, etc.
Research question: Is alignment associated with identity type?

Contingency Tables

A contingency table summarizes the relationship between two categorical variables:

Each cell shows the count for that combination of values
Row totals: frequency of each category of the row variable
Column totals: frequency of each category of the column variable
Grand total: total number of observations

Also called a two-way table or cross-tabulation

Comics Contingency Table

Contingency table for identity and alignment
align	No Dual	Public	Secret	Unknown	Total
Bad	453	2,106	4,352	7	6,918
Good	640	2,905	2,430	0	5,975
Neutral	377	946	908	2	2,233
Reformed Criminals	0	1	1	0	2
Total	1,470	5,958	7,691	9	15,128

Raw counts alone don’t reveal patterns clearly…

Conditional Proportions

Conditional proportions: Proportions calculated within groups

Most useful when conditioned on the grouping variable
Reveals patterns that raw counts may hide

Formula (conditioned on rows): (count in cell) / (row total)

Example:

\[\begin{array}{c} \text{Proportion of} \\ \text{bad characters with} \\ \text{secret identities} \end{array} = \frac{\text{Bad characters with secret identity}}{\text{Total bad characters}}\]

Comics Conditional Proportions

align	No Dual	Public	Secret	Unknown
Bad	0.065	0.304	0.629	0.001
Good	0.107	0.486	0.407	0.000
Neutral	0.169	0.424	0.407	0.001
Reformed Criminals	0.000	0.500	0.500	0.000

Key finding: Bad characters appear to be more likely to have secret identities (62.9%) compared to good characters (40.7%)

Visualizing Two Categorical Variables

Three types of bar plots for two categorical variables:

Dodged bar plot: Bars side-by-side, compare counts across groups
Stacked bar plot: Bars stacked, show composition of each group
Standardized bar plot: Bars stacked to same height, compare proportions within groups

Standardized bar plots are most useful for detecting associations

Comics Standardized Bar Plot

Standardized bar plot showing identity proportions by alignment

Clearly shows bad characters have higher proportion of secret identities

Explanatory and Response Variables

When we suspect one variable may affect another:

Explanatory variable: The variable that might influence or explain the other
- Also called: independent variable, predictor
Response variable: The variable that might be affected or respond
- Also called: dependent variable, outcome

Not all studies have clear explanatory/response roles

Stent Study Introduction

Researchers studied stent effectiveness for preventing strokes:

451 at-risk patients assigned to receive stent (treatment) or not (control)
Outcome (“stroke” or “no event”) recorded after 365 days

First five patients
group	outcome
control	no event
treatment	no event
treatment	stroke
treatment	no event
control	no event

Explanatory variable: group (treatment/control)
Response variable: outcome (stroke/no event)

Two Scope of Inference Questions

Recall from before: The first scope of inference question

Generalization: Can we apply findings to the broader population?
- Answered by: How we sampled (random vs. convenience)

NEW: The second scope of inference question

Causation: Can we conclude cause-and-effect?
- Answered by: How we conducted the study

The Complete Picture

These are TWO SEPARATE QUESTIONS:

Question	Determined by	Key feature
Generalization	Sampling method	Random sampling
Causation	Study design	Random assignment

A study can:

Generalize but not establish causation
Establish causation but not generalize
Do both, or neither

Observational Study vs. Experiment

Observational study: Researchers observe without manipulation

Cases observed “as they are”
No control over explanatory variable
Example: Survey asking about diet and health

Experiment: Researchers deliberately manipulate explanatory variable

Cases assigned to treatment conditions
Control over explanatory variable
Example: Randomly assigning patients to receive drug or placebo

Ice Cream and Drowning

A surprising finding: Strong positive association between ice cream sales and drowning deaths

When ice cream sales are high → drowning deaths are high
When ice cream sales are low → drowning deaths are low

Does eating ice cream cause drowning?

No! A confounding variable explains both:

Hot weather → more ice cream sales
Hot weather → more swimming → more drowning

Confounding Variables

A confounding variable is associated with BOTH the explanatory and response variables

flowchart TD
    A[Temperature<br/>confounder] --> B[Ice cream sales]
    A --> C[Drowning deaths]

    style A fill:#ffcccc,stroke:#cc0000,stroke-width:3px
    style B fill:#cce5ff,stroke:#0066cc,stroke-width:2px
    style C fill:#cce5ff,stroke:#0066cc,stroke-width:2px

Problem: Creates alternative explanations for observed associations

In observational studies, we can’t rule out confounders
We can only say variables are associated, not that one causes the other

Random Assignment

The solution to confounding: Random assignment

Random assignment: Cases are randomly assigned to treatment groups

Researchers control who receives which treatment
Assignment made by chance (like flipping a coin)
Not based on patient characteristics, preferences, or doctor’s choice

Key benefit: Ensures treatment groups are similar on average with respect to all possible confounders

Stent Study: Random Assignment

How random assignment worked in the stent study:

451 at-risk patients enrolled in the study
Each patient randomly assigned to treatment or control
Assignment determined by chance, not by patient or doctor choice

Result of random assignment: Treatment and control groups similar on average in:

Age
Severity of stenosis (narrowing of arteries)
Other health conditions

Stent Study: Observational Alternative

What if the stent study had been observational?

Patients and doctors choose whether to get stent (no random assignment)

Problem: Patients with more severe stenosis may be more likely to:

Receive stent (doctors recommend it for severe cases)
Have strokes (severe stenosis increases stroke risk)

Severity of stenosis
is a confounder:

flowchart TD
    A[Severity of stenosis] --> B[Stent treatment]
    A --> C[Stroke outcome]

    style A fill:#ffcccc,stroke:#cc0000,stroke-width:3px
    style B fill:#cce5ff,stroke:#0066cc,stroke-width:2px
    style C fill:#cce5ff,stroke:#0066cc,stroke-width:2px

Why Experiments Can Establish Causation

Random assignment breaks the confounder link:

In the actual randomized stent study:

Researchers controlled treatment assignment (not patients/doctors)
Random assignment balanced severity of stenosis (and other potential confounders) between groups
Any difference in stroke rates must be due to the stent itself

Without random assignment, we couldn’t distinguish:

Does the stent cause more strokes?
Or do sicker patients (who would get stents) just have more strokes anyway?

Random Assignment vs. Random Sampling

	Random Sampling	Random Assignment
What	How cases are selected from population	How cases are assigned to groups
Purpose	Enable generalization	Enable causal conclusions
Controls for	Sampling bias	Confounding variables

These are different concepts that address different questions!

Stent Study Results

Summary of results for stent study
Group	Stroke	No Event
control	28	199
treatment	45	179

Stroke proportions: Control 12%, Treatment 20%

It turns out that this difference is statistically significant

This was a randomized experiment → Can we conclude stents caused more strokes? Yes! Random assignment rules out confounders.

Recycling Study: Observational

Question: Does a sign above a recycling bin reduce contamination?

Study design: Students compare two bins over one month

Bin in Student Center: no sign
Bin in Lichty Hall (residence): has sign
Result: Lower contamination with sign

Is this an experiment? No - no manipulation of sign placement

Possible confounder: Location type (public vs. residential spaces may have different behaviors regardless of signs)

Recycling Study: Experiment

Same question, different design:

20 bins across campus
Each day, randomly assign 10 bins to get signs, 10 without
Record contamination for each bin

Now it’s an experiment!

Researchers control sign placement
Random assignment balances location types across conditions
I.e., a bin in a residential space is no more likely to get a sign than a public one
If lower contamination with signs → signs caused the reduction

Summary: Two Categorical Variables

Summarizing relationships:

Contingency tables: Counts for each combination
Conditional proportions: Proportions within groups (reveal patterns)
Standardized bar plots: Visualize proportional differences

Key concepts:

Associated: Variables are related
Independent: No relationship between variables

Summary: Scope of Inference

Two separate questions about what we can conclude:

Generalization → How did we sample?
Causation → How did we conduct the study?

References

Introduction to Modern Statistics (2e) textbook by Mine Çetinkaya-Rundel and Johanna Hardin
Sections 1.1, 1.2, 2.2, 2.3, 2.4, 4.1, 4.2, 4.3