Stratification in Observational Studies

Controlling Variation in Observational studies
Math 215

Reducing Unexplained Variability

  • The primary role of blocking in experiments is to reduce the unexplained variability in the response
  • Stratifying plays a similar role in observational studies
  • Groups of observational units are formed with similar values of the stratifying variable
  • Stratifying variable is accounted for as a source of variability in the analysis (similar to blocking in experiment)

Confounding Variables

  • Confounding variables are associated with both the dependent variable and the independent variable
  • Confounders can obscure the true relationship between the variables of interest
  • In experiments, randomization ensures that treatment groups are similar in terms of confounding variables
  • This is not the case in observational studies
  • Often, the stratifying variable is a potential confounding variable

Smoking and birth weight

  • Do infants whose mothers smoke have a different mean birth weight than infants whose mothers do not smoke?
  • births14 1 dataset
  • Random sample of 1,000 cases from US birth data set from 2014 (19 removed with missing values)
  • habit is smoking habit (“smoker” or “nonsmoker”)
  • weight is birth weight in pounds
  • premie birth is premature (premie) or full-term
  • First, we will include smoking habit as a single explanatory variable (no stratifying variable)
  • Although it would be appropriate to compare means for smoking and non-smoking mothers using a t-test (see Ch 20 slides), an F-test gives us the same results
lm(weight ~ habit, data = births14) |>
  anova() |>
  tidy()
# A tibble: 2 × 6
  term         df  sumsq meansq statistic     p.value
  <chr>     <int>  <dbl>  <dbl>     <dbl>       <dbl>
1 habit         1   35.4  35.4       21.6  0.00000382
2 Residuals   979 1604.    1.64      NA   NA         

premie as a Stratifying Variable

  • The length of the pregnancy is expected to explain some of the variation in birth weight
  • It also has the potential to be a confounding variable if there is an association between smoking and length of pregnancy
  • We will use premie as a stratifying variable in our analysis

Statistical Model with a Stratifying Variable

Statistical model for the \(k\)th observation in group \(i\) and stratum \(j\),

\[y_{ijk}=\mu+\alpha_i+\beta_j+\varepsilon_{ijk}\]

  • \(\mu\) is the overall mean
  • \(\alpha_i\) is the differential effect of group \(i\)
  • \(\beta_j\) is the differential effect of stratum \(j\)
  • \(\varepsilon_{ijk}\sim N(0,\sigma^2)\) is the noise
  • Note that subscript \(k\) implies that there are several observations for each group/stratum combination

ANOVA Table

  • \(SS_{premie}\) accounts for the effect of premie
  • \(SS_{habit}\) accounts for both variables, but then \(SS_{premie}\) is subtracted off
lm(weight ~ premie + habit, data = births14) |>
  anova() |>
  tidy()
# A tibble: 3 × 6
  term         df  sumsq meansq statistic   p.value
  <chr>     <int>  <dbl>  <dbl>     <dbl>     <dbl>
1 premie        1  374.  374.       293.   1.18e-57
2 habit         1   15.4  15.4       12.0  5.42e- 4
3 Residuals   978 1250.    1.28      NA   NA       
  • Interestingly, \(p\)-value for habit is higher when accounting for premie in the model (\(0.000542\) vs \(0.00000382\))
  • This is the opposite of what we would expect to see in an experiment, and is due to association between premie and habit
lm(weight ~ premie + habit, data = births14) |>
  anova() |>
  tidy()
# A tibble: 3 × 6
  term         df  sumsq meansq statistic   p.value
  <chr>     <int>  <dbl>  <dbl>     <dbl>     <dbl>
1 premie        1  374.  374.       293.   1.18e-57
2 habit         1   15.4  15.4       12.0  5.42e- 4
3 Residuals   978 1250.    1.28      NA   NA       

We can also see the effects of this association if we compare the coefficient of the corresponding linear models.

lm(weight ~ habit, data = births14) |>
  tidy()
# A tibble: 2 × 5
  term        estimate std.error statistic    p.value
  <chr>          <dbl>     <dbl>     <dbl>      <dbl>
1 (Intercept)    7.27     0.0435    167.   0         
2 habitsmoker   -0.593    0.128      -4.65 0.00000382

habit has a smaller effect when we account for premie

lm(weight ~ premie + habit, data = births14) |>
  tidy()
# A tibble: 3 × 5
  term         estimate std.error statistic  p.value
  <chr>           <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)     7.47     0.0403    185.   0       
2 premiepremie   -1.84     0.110     -16.7  5.13e-55
3 habitsmoker    -0.393    0.113      -3.47 5.42e- 4

Association between habit and premie

  • Due to the association between habit and premie it is difficult to fully separate their effects on birth weight
  • E.g., whether the mother smokes or not could affect whether the baby is a premie or not which could impact weight
  • \(SS_{habit}\) measures the variation in the response that is attributed to habit but does not include the variation that is jointly attributed to the two explanatory variables
  • If we reverse the order of the variables, we can also measure the variation that is attributed to premie without variation that is jointly attributed to the two variables
lm(weight ~ habit + premie, data = births14) |>
  anova() |>
  tidy()
# A tibble: 3 × 6
  term         df  sumsq meansq statistic   p.value
  <chr>     <int>  <dbl>  <dbl>     <dbl>     <dbl>
1 habit         1   35.4  35.4       27.7  1.75e- 7
2 premie        1  354.  354.       277.   5.13e-55
3 Residuals   978 1250.    1.28      NA   NA       

ANCOVA

  • In the previous analysis we used premie as a stratifying variable
  • The dataset also includes the variabe weeks, the length of the pregnancy in weeks
  • We can conduct an alternative analysis using weeks as a (numeric) covariate instead of stratifying by premie
  • Statistical model (parallel lines) for the \(j\)th observation in the \(i\)th group: \[y_{ij}=\mu + \alpha_i + \beta(X_{ij} - \bar{\bar{X}})+\varepsilon_{ij}\]
  • weeks as a covariate explains more of the variation in the response than premie (\(SS_{weeks} = 483\) vs. \(SS_{premie}=374\))
  • The resulting model explains more of the variability in the response than the model that included premie, so \(SSE\) is smaller
  • As a result, the \(F\) statistic is larger, and the \(p\)-value is smaller
lm(weight ~ weeks + habit, data = births14) |>
  anova() |>
  tidy()
# A tibble: 3 × 6
  term         df  sumsq meansq statistic   p.value
  <chr>     <int>  <dbl>  <dbl>     <dbl>     <dbl>
1 weeks         1  483.  483.       414.   4.66e-77
2 habit         1   14.9  14.9       12.8  3.68e- 4
3 Residuals   978 1141.    1.17      NA   NA       

Conclusions

  • Both analyses lead us to conclude that there is a significant association between smoking and birth weight when we also account for the length of pregnancies
  • We reach the same conclusion whether we treat the length of pregnancy as a categorical variable (premie) or a numeric variable (weeks)
  • Because this is an observational study, we cannot conclude that smoking causes a difference in birth weight
  • It is possible that there are other confounding variables that we have not accounted for in the model