Stratification in Observational Studies

Controlling Variation in Observational studies
Math 215

Reducing Unexplained Variability

The primary role of blocking in experiments is to reduce the unexplained variability in the response
Stratifying plays a similar role in observational studies
Groups of observational units are formed with similar values of the stratifying variable
Stratifying variable is accounted for as a source of variability in the analysis (similar to blocking in experiment)

Confounding Variables

Confounding variables are associated with both the dependent variable and the independent variable
Confounders can obscure the true relationship between the variables of interest
In experiments, randomization ensures that treatment groups are similar in terms of confounding variables
This is not the case in observational studies
Often, the stratifying variable is a potential confounding variable

Smoking and birth weight

Do infants whose mothers smoke have a different mean birth weight than infants whose mothers do not smoke?
births14 ¹ dataset
Random sample of 1,000 cases from US birth data set from 2014 (19 removed with missing values)
habit is smoking habit (“smoker” or “nonsmoker”)
weight is birth weight in pounds
premie birth is premature (premie) or full-term

First, we will include smoking habit as a single explanatory variable (no stratifying variable)
Although it would be appropriate to compare means for smoking and non-smoking mothers using a t-test (see Ch 20 slides), an F-test gives us the same results

lm(weight ~ habit, data = births14) |>
  anova() |>
  tidy()

# A tibble: 2 × 6
  term         df  sumsq meansq statistic     p.value
  <chr>     <int>  <dbl>  <dbl>     <dbl>       <dbl>
1 habit         1   35.4  35.4       21.6  0.00000382
2 Residuals   979 1604.    1.64      NA   NA

`premie` as a Stratifying Variable

The length of the pregnancy is expected to explain some of the variation in birth weight
It also has the potential to be a confounding variable if there is an association between smoking and length of pregnancy
We will use premie as a stratifying variable in our analysis

Statistical Model with a Stratifying Variable

Statistical model for the \(k\)th observation in group \(i\) and stratum \(j\),

\[y_{ijk}=\mu+\alpha_i+\beta_j+\varepsilon_{ijk}\]

\(\mu\) is the overall mean
\(\alpha_i\) is the differential effect of group \(i\)
\(\beta_j\) is the differential effect of stratum \(j\)
\(\varepsilon_{ijk}\sim N(0,\sigma^2)\) is the noise
Note that subscript \(k\) implies that there are several observations for each group/stratum combination

ANOVA Table

\(SS_{premie}\) accounts for the effect of premie
\(SS_{habit}\) accounts for both variables, but then \(SS_{premie}\) is subtracted off

lm(weight ~ premie + habit, data = births14) |>
  anova() |>
  tidy()

# A tibble: 3 × 6
  term         df  sumsq meansq statistic   p.value
  <chr>     <int>  <dbl>  <dbl>     <dbl>     <dbl>
1 premie        1  374.  374.       293.   1.18e-57
2 habit         1   15.4  15.4       12.0  5.42e- 4
3 Residuals   978 1250.    1.28      NA   NA

Interestingly, \(p\)-value for habit is higher when accounting for premie in the model (\(0.000542\) vs \(0.00000382\))
This is the opposite of what we would expect to see in an experiment, and is due to association between premie and habit

lm(weight ~ premie + habit, data = births14) |>
  anova() |>
  tidy()

# A tibble: 3 × 6
  term         df  sumsq meansq statistic   p.value
  <chr>     <int>  <dbl>  <dbl>     <dbl>     <dbl>
1 premie        1  374.  374.       293.   1.18e-57
2 habit         1   15.4  15.4       12.0  5.42e- 4
3 Residuals   978 1250.    1.28      NA   NA

We can also see the effects of this association if we compare the coefficient of the corresponding linear models.

lm(weight ~ habit, data = births14) |>
  tidy()

# A tibble: 2 × 5
  term        estimate std.error statistic    p.value
  <chr>          <dbl>     <dbl>     <dbl>      <dbl>
1 (Intercept)    7.27     0.0435    167.   0         
2 habitsmoker   -0.593    0.128      -4.65 0.00000382

habit has a smaller effect when we account for premie

lm(weight ~ premie + habit, data = births14) |>
  tidy()

# A tibble: 3 × 5
  term         estimate std.error statistic  p.value
  <chr>           <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)     7.47     0.0403    185.   0       
2 premiepremie   -1.84     0.110     -16.7  5.13e-55
3 habitsmoker    -0.393    0.113      -3.47 5.42e- 4

Association between `habit` and `premie`

Due to the association between habit and premie it is difficult to fully separate their effects on birth weight
E.g., whether the mother smokes or not could affect whether the baby is a premie or not which could impact weight
\(SS_{habit}\) measures the variation in the response that is attributed to habit but does not include the variation that is jointly attributed to the two explanatory variables

If we reverse the order of the variables, we can also measure the variation that is attributed to premie without variation that is jointly attributed to the two variables

lm(weight ~ habit + premie, data = births14) |>
  anova() |>
  tidy()

# A tibble: 3 × 6
  term         df  sumsq meansq statistic   p.value
  <chr>     <int>  <dbl>  <dbl>     <dbl>     <dbl>
1 habit         1   35.4  35.4       27.7  1.75e- 7
2 premie        1  354.  354.       277.   5.13e-55
3 Residuals   978 1250.    1.28      NA   NA

ANCOVA

In the previous analysis we used premie as a stratifying variable
The dataset also includes the variabe weeks, the length of the pregnancy in weeks
We can conduct an alternative analysis using weeks as a (numeric) covariate instead of stratifying by premie
Statistical model (parallel lines) for the \(j\)th observation in the \(i\)th group: \[y_{ij}=\mu + \alpha_i + \beta(X_{ij} - \bar{\bar{X}})+\varepsilon_{ij}\]

weeks as a covariate explains more of the variation in the response than premie (\(SS_{weeks} = 483\) vs. \(SS_{premie}=374\))
The resulting model explains more of the variability in the response than the model that included premie, so \(SSE\) is smaller
As a result, the \(F\) statistic is larger, and the \(p\)-value is smaller

lm(weight ~ weeks + habit, data = births14) |>
  anova() |>
  tidy()

# A tibble: 3 × 6
  term         df  sumsq meansq statistic   p.value
  <chr>     <int>  <dbl>  <dbl>     <dbl>     <dbl>
1 weeks         1  483.  483.       414.   4.66e-77
2 habit         1   14.9  14.9       12.8  3.68e- 4
3 Residuals   978 1141.    1.17      NA   NA

Conclusions

Both analyses lead us to conclude that there is a significant association between smoking and birth weight when we also account for the length of pregnancies
We reach the same conclusion whether we treat the length of pregnancy as a categorical variable (premie) or a numeric variable (weeks)
Because this is an observational study, we cannot conclude that smoking causes a difference in birth weight
It is possible that there are other confounding variables that we have not accounted for in the model