The primary role of blocking in experiments is to reduce the unexplained variability in the response
Stratifying plays a similar role in observational studies
Groups of observational units are formed with similar values of the stratifying variable
Stratifying variable is accounted for as a source of variability in the analysis (similar to blocking in experiment)
Confounding Variables
Confounding variables are associated with both the dependent variable and the independent variable
Confounders can obscure the true relationship between the variables of interest
In experiments, randomization ensures that treatment groups are similar in terms of confounding variables
This is not the case in observational studies
Often, the stratifying variable is a potential confounding variable
Smoking and birth weight
Do infants whose mothers smoke have a different mean birth weight than infants whose mothers do not smoke?
births141 dataset
Random sample of 1,000 cases from US birth data set from 2014 (19 removed with missing values)
habit is smoking habit (“smoker” or “nonsmoker”)
weight is birth weight in pounds
premie birth is premature (premie) or full-term
First, we will include smoking habit as a single explanatory variable (no stratifying variable)
Although it would be appropriate to compare means for smoking and non-smoking mothers using a t-test (see Ch 20 slides), an F-test gives us the same results
lm(weight ~ habit, data = births14) |>anova() |>tidy()
# A tibble: 2 × 6
term df sumsq meansq statistic p.value
<chr> <int> <dbl> <dbl> <dbl> <dbl>
1 habit 1 35.4 35.4 21.6 0.00000382
2 Residuals 979 1604. 1.64 NA NA
premie as a Stratifying Variable
The length of the pregnancy is expected to explain some of the variation in birth weight
It also has the potential to be a confounding variable if there is an association between smoking and length of pregnancy
We will use premie as a stratifying variable in our analysis
Statistical Model with a Stratifying Variable
Statistical model for the \(k\)th observation in group \(i\) and stratum \(j\),
Due to the association between habit and premie it is difficult to fully separate their effects on birth weight
E.g., whether the mother smokes or not could affect whether the baby is a premie or not which could impact weight
\(SS_{habit}\) measures the variation in the response that is attributed to habit but does not include the variation that is jointly attributed to the two explanatory variables
If we reverse the order of the variables, we can also measure the variation that is attributed to premie without variation that is jointly attributed to the two variables
lm(weight ~ habit + premie, data = births14) |>anova() |>tidy()
# A tibble: 3 × 6
term df sumsq meansq statistic p.value
<chr> <int> <dbl> <dbl> <dbl> <dbl>
1 habit 1 35.4 35.4 27.7 1.75e- 7
2 premie 1 354. 354. 277. 5.13e-55
3 Residuals 978 1250. 1.28 NA NA
ANCOVA
In the previous analysis we used premie as a stratifying variable
The dataset also includes the variabe weeks, the length of the pregnancy in weeks
We can conduct an alternative analysis using weeks as a (numeric) covariate instead of stratifying by premie
Statistical model (parallel lines) for the \(j\)th observation in the \(i\)th group: \[y_{ij}=\mu + \alpha_i + \beta(X_{ij} - \bar{\bar{X}})+\varepsilon_{ij}\]
weeks as a covariate explains more of the variation in the response than premie (\(SS_{weeks} = 483\) vs. \(SS_{premie}=374\))
The resulting model explains more of the variability in the response than the model that included premie, so \(SSE\) is smaller
As a result, the \(F\) statistic is larger, and the \(p\)-value is smaller
lm(weight ~ weeks + habit, data = births14) |>anova() |>tidy()
# A tibble: 3 × 6
term df sumsq meansq statistic p.value
<chr> <int> <dbl> <dbl> <dbl> <dbl>
1 weeks 1 483. 483. 414. 4.66e-77
2 habit 1 14.9 14.9 12.8 3.68e- 4
3 Residuals 978 1141. 1.17 NA NA
Conclusions
Both analyses lead us to conclude that there is a significant association between smoking and birth weight when we also account for the length of pregnancies
We reach the same conclusion whether we treat the length of pregnancy as a categorical variable (premie) or a numeric variable (weeks)
Because this is an observational study, we cannot conclude that smoking causes a difference in birth weight
It is possible that there are other confounding variables that we have not accounted for in the model