Stratification in Observational Studies

Controlling Variation in Observational studies
Math 219

Reducing Unexplained Variability

The primary role of blocking in experiments is to reduce the unexplained variability in the response
Stratifying plays a similar role in observational studies
Groups of observational units are formed with similar values of the stratifying variable
Stratifying variable is accounted for as a source of variability in the analysis (similar to blocking in experiment)

Confounding Variables

Confounding variables are associated with both the dependent variable and the independent variable
Confounders can obscure the true relationship between the variables of interest
In experiments, randomization ensures that treatment groups are similar in terms of confounding variables
This is not the case in observational studies
Often, the stratifying variable is a potential confounding variable
Stratification adjusts for these confounding variables by dividing the treatment and control groups into smaller more homogeneous subgroups, where the confounding factor is the same

Smoking and birth weight

Do infants whose mothers smoke have a different mean birth weight than infants whose mothers do not smoke?
births14 ¹ dataset
Random sample of 1,000 cases from US birth data set from 2014 (19 removed with missing values)
habit is smoking habit (“smoker” or “nonsmoker”)
weight is birth weight in pounds
premie birth is premature (premie) or full-term

glimpse(births14)

Rows: 981
Columns: 13
$ fage           <int> 34, 36, 37, NA, 32, 32, 37, 29, 30, 29, 30, 34, 28, 28,…
$ mage           <dbl> 34, 31, 36, 16, 31, 26, 36, 24, 32, 26, 34, 27, 22, 31,…
$ mature         <chr> "younger mom", "younger mom", "mature mom", "younger mo…
$ weeks          <dbl> 37, 41, 37, 38, 36, 39, 36, 40, 39, 39, 42, 40, 40, 39,…
$ premie         <chr> "full term", "full term", "full term", "full term", "pr…
$ visits         <dbl> 14, 12, 10, NA, 12, 14, 10, 13, 15, 11, 14, 16, 20, 15,…
$ gained         <dbl> 28, 41, 28, 29, 48, 45, 20, 65, 25, 22, 40, 30, 31, NA,…
$ weight         <dbl> 6.96, 8.86, 7.51, 6.19, 6.75, 6.69, 6.13, 6.74, 8.94, 9…
$ lowbirthweight <chr> "not low", "not low", "not low", "not low", "not low", …
$ sex            <chr> "male", "female", "female", "male", "female", "female",…
$ habit          <chr> "nonsmoker", "nonsmoker", "nonsmoker", "nonsmoker", "no…
$ marital        <chr> "married", "married", "married", "not married", "marrie…
$ whitemom       <chr> "white", "white", "not white", "white", "white", "white…

First, we will include smoking habit as a single explanatory variable (no stratifying variable)
Although it would also be appropriate to compare means for smoking and non-smoking mothers using a t-test (see Ch 20 slides), an F-test gives us the same results

lm(weight ~ habit, data = births14) |>
  anova() |>
  tidy()

# A tibble: 2 × 6
  term         df  sumsq meansq statistic     p.value
  <chr>     <int>  <dbl>  <dbl>     <dbl>       <dbl>
1 habit         1   35.4  35.4       21.6  0.00000382
2 Residuals   979 1604.    1.64      NA   NA

`premie` as a Stratifying Variable

The length of the pregnancy is expected to explain some of the variation in birth weight
It also has the potential to be a confounding variable if there is an association between smoking and length of pregnancy
We will use premie as a stratifying variable in our analysis

Statistical Model with a Stratifying Variable

Statistical model for the \(k\)th observation in group \(i\) and stratum \(j\),

\[y_{ijk}=\mu+\alpha_i+\beta_j+\varepsilon_{ijk}\]

\(\mu\) is the overall mean
\(\alpha_i\) is the differential effect of group \(i\)
\(\beta_j\) is the differential effect of stratum \(j\)
\(\varepsilon_{ijk}\sim N(0,\sigma^2)\) is the noise
Note that subscript \(k\) implies that there are several observations for each group/stratum combination

ANOVA Table

\(SS_{premie}\) accounts for the effect of premie
\(SS_{habit}\) accounts for both variables, but then \(SS_{premie}\) is subtracted off

lm(weight ~ premie + habit, data = births14) |>
  anova() |>
  tidy()

# A tibble: 3 × 6
  term         df  sumsq meansq statistic   p.value
  <chr>     <int>  <dbl>  <dbl>     <dbl>     <dbl>
1 premie        1  374.  374.       293.   1.18e-57
2 habit         1   15.4  15.4       12.0  5.42e- 4
3 Residuals   978 1250.    1.28      NA   NA

Interestingly, \(p\)-value for habit is higher when accounting for premie in the model (\(0.000542\) vs \(0.00000382\))
This is the opposite of what we would expect to see in an experiment, and is due to association between premie and habit

habit only

lm(weight ~ habit, data = births14) |>
  anova() |>
  tidy()

# A tibble: 2 × 6
  term         df  sumsq meansq statistic     p.value
  <chr>     <int>  <dbl>  <dbl>     <dbl>       <dbl>
1 habit         1   35.4  35.4       21.6  0.00000382
2 Residuals   979 1604.    1.64      NA   NA

habit and premie

lm(weight ~ premie + habit, data = births14) |>
  anova() |>
  tidy()

# A tibble: 3 × 6
  term         df  sumsq meansq statistic   p.value
  <chr>     <int>  <dbl>  <dbl>     <dbl>     <dbl>
1 premie        1  374.  374.       293.   1.18e-57
2 habit         1   15.4  15.4       12.0  5.42e- 4
3 Residuals   978 1250.    1.28      NA   NA

Recall the results from the strawberries study

Storage only

lm(Firmness ~ Storage, data = strawberries) |>
  anova() |>
  tidy()

# A tibble: 2 × 6
  term         df sumsq meansq statistic p.value
  <chr>     <int> <dbl>  <dbl>     <dbl>   <dbl>
1 Storage       2  11.2   5.59     0.995   0.398
2 Residuals    12  67.4   5.62    NA      NA

Storage and Variety

lm(Firmness ~ Variety+Storage, data = strawberries) |>
  anova() |>
  tidy()

# A tibble: 3 × 6
  term         df sumsq meansq statistic    p.value
  <chr>     <int> <dbl>  <dbl>     <dbl>      <dbl>
1 Variety       4 63.5  15.9        32.4  0.0000547
2 Storage       2 11.2   5.59       11.4  0.00455  
3 Residuals     8  3.93  0.491      NA   NA

Introduction of Variety lowered \(SSE\) but did not change \(SS_{Storage}\) resulting in smaller p-value for Storage

We can also see the effects of this association if we compare the coefficient of the corresponding linear models.

lm(weight ~ habit, data = births14) |>
  tidy()

# A tibble: 2 × 5
  term        estimate std.error statistic    p.value
  <chr>          <dbl>     <dbl>     <dbl>      <dbl>
1 (Intercept)    7.27     0.0435    167.   0         
2 habitsmoker   -0.593    0.128      -4.65 0.00000382

habit has a smaller effect when we account for premie

lm(weight ~ premie + habit, data = births14) |>
  tidy()

# A tibble: 3 × 5
  term         estimate std.error statistic  p.value
  <chr>           <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)     7.47     0.0403    185.   0       
2 premiepremie   -1.84     0.110     -16.7  5.13e-55
3 habitsmoker    -0.393    0.113      -3.47 5.42e- 4

Association between `habit` and `premie`

Due to the association between habit and premie it is difficult to fully separate their effects on birth weight
E.g., whether the mother smokes or not could affect whether the baby is a premie or not which could impact weight
\(SS_{habit}\) measures the variation in the response that is attributed to habit but does not include the variation that is jointly attributed to the two explanatory variables

If we reverse the order of the variables, we can also measure the variation that is attributed to premie without variation that is jointly attributed to the two variables

habit+premie

lm(weight ~ habit+premie, data = births14) |>
  anova() |>
  tidy()

# A tibble: 3 × 6
  term         df  sumsq meansq statistic   p.value
  <chr>     <int>  <dbl>  <dbl>     <dbl>     <dbl>
1 habit         1   35.4  35.4       27.7  1.75e- 7
2 premie        1  354.  354.       277.   5.13e-55
3 Residuals   978 1250.    1.28      NA   NA

premie+ habit

lm(weight ~ premie+habit, data = births14) |>
  anova() |>
  tidy()

# A tibble: 3 × 6
  term         df  sumsq meansq statistic   p.value
  <chr>     <int>  <dbl>  <dbl>     <dbl>     <dbl>
1 premie        1  374.  374.       293.   1.18e-57
2 habit         1   15.4  15.4       12.0  5.42e- 4
3 Residuals   978 1250.    1.28      NA   NA

ANCOVA

In the previous analysis we used premie as a stratifying variable
The dataset also includes the variabe weeks, the length of the pregnancy in weeks
We can conduct an alternative analysis using weeks as a (numeric) covariate instead of stratifying by premie
Statistical model (parallel lines) for the \(j\)th observation in the \(i\)th group: \[y_{ij}=\mu + \alpha_i + \beta(X_{ij} - \bar{\bar{X}})+\varepsilon_{ij}\]

weeks as a covariate explains more of the variation in the response than premie (\(SS_{weeks} = 483\) vs. \(SS_{premie}=374\))
The resulting model explains more of the variability in the response than the model that included premie, so \(SSE\) is smaller
As a result, the \(F\) statistic is larger, and the \(p\)-value is smaller

lm(weight ~ weeks + habit, data = births14) |>
  anova() |>
  tidy()

# A tibble: 3 × 6
  term         df  sumsq meansq statistic   p.value
  <chr>     <int>  <dbl>  <dbl>     <dbl>     <dbl>
1 weeks         1  483.  483.       414.   4.66e-77
2 habit         1   14.9  14.9       12.8  3.68e- 4
3 Residuals   978 1141.    1.17      NA   NA

Conclusions

Both analyses lead us to conclude that there is a significant association between smoking and birth weight when we also account for the length of pregnancies
We reach the same conclusion whether we treat the length of pregnancy as a categorical variable (premie) or a numeric variable (weeks)
Because this is an observational study, we cannot conclude that smoking causes a difference in birth weight
It is possible that there are other confounding variables that we have not accounted for in the model

Lung capacity and smoking in youth

Data set lungcap from GLMsdata package
The data give information on the health and smoking habits of a sample of 654 youths, aged 3 to 19, in the area of East Boston during middle to late 1970s.
Variables:
- FEV - the forced expiratory volume in litres, a measure of lung capacity; a numeric vector
- Gender the gender of the subjects: a vector with females coded as F and males as M
- Smoker the smoking status of the subject: a vector with smokers coded as Yes and non-smokers as No

Research question

One primary question of interest is whether smokers suffer reduced pulmonary function.

glimpse(lungcap)

Rows: 654
Columns: 6
$ Age    <int> 3, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, …
$ FEV    <dbl> 1.072, 0.839, 1.102, 1.389, 1.577, 1.418, 1.569, 1.196, 1.400, …
$ Ht     <dbl> 46.0, 48.0, 48.0, 48.0, 49.0, 49.0, 50.0, 46.5, 49.0, 49.0, 50.…
$ Gender <fct> F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, …
$ Smoke  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ Smoker <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No…

Again, we will include smoking habit as a single explanatory variable (no stratifying variable)

lm(FEV~Smoker, data=lungcap) |> 
  anova() |> 
     tidy()

# A tibble: 2 × 6
  term         df sumsq meansq statistic   p.value
  <chr>     <int> <dbl>  <dbl>     <dbl>     <dbl>
1 Smoker        1  29.6 29.6        41.8  1.99e-10
2 Residuals   652 461.   0.708      NA   NA

We now consider variable Gender

lm(FEV~Gender, data=lungcap) |> 
  anova() |> 
     tidy()

# A tibble: 2 × 6
  term         df sumsq meansq statistic       p.value
  <chr>     <int> <dbl>  <dbl>     <dbl>         <dbl>
1 Gender        1  21.3 21.3        29.6  0.0000000750
2 Residuals   652 470.   0.720      NA   NA

Variable Gender is a significant factor in determining lung capacity

Now both

lm(FEV~Gender+Smoker, data=lungcap) |>
   anova() |> 
     tidy()

# A tibble: 3 × 6
  term         df sumsq meansq statistic   p.value
  <chr>     <int> <dbl>  <dbl>     <dbl>     <dbl>
1 Gender        1  21.3 21.3        31.8  2.49e- 8
2 Smoker        1  33.7 33.7        50.3  3.45e-12
3 Residuals   651 436.   0.670      NA   NA

Association between `Gender` and `Smoker`

Due to the association between Gender and Smoker it is difficult to fully separate their effects on lung capacity

ANCOVA

In the previous analysis we used Gender as a stratifying variable
The dataset also includes the variabe Age, the age of the youths
We can conduct an alternative analysis using Age as a numeric covariate

lm4 <- lm(FEV~Age+Smoker, data=lungcap)
tidy(lm4)

# A tibble: 3 × 5
  term        estimate std.error statistic   p.value
  <chr>          <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)    0.367   0.0814       4.51 7.65e-  6
2 Age            0.231   0.00818     28.2  8.28e-115
3 SmokerYes     -0.209   0.0807      -2.59 9.86e-  3

anova(lm4) |> tidy()

# A tibble: 3 × 6
  term         df  sumsq  meansq statistic    p.value
  <chr>     <int>  <dbl>   <dbl>     <dbl>      <dbl>
1 Age           1 281.   281.       880.    5.54e-123
2 Smoker        1   2.14   2.14       6.70  9.86e-  3
3 Residuals   651 208.     0.319     NA    NA

Both analyses lead us to conclude that there is a significant association between lung capacity and smoking habit when we also account for the age of subjects
We reach the same conclusion whether we stratify on variable (Gender) or use a numeric covariate (Age)
The lung capacity is positively associated with Age and negatively associated with Smoker variable
It is possible that there are other confounding variables that we have not accounted for in the model
It is also possible to consider interaction terms in the model

Stratification in Observational Studies

Reducing Unexplained Variability

Confounding Variables

Smoking and birth weight

premie as a Stratifying Variable

Statistical Model with a Stratifying Variable

ANOVA Table

Association between habit and premie

ANCOVA

Conclusions

Lung capacity and smoking in youth

Research question

Association between Gender and Smoker

ANCOVA

`premie` as a Stratifying Variable

Association between `habit` and `premie`

Association between `Gender` and `Smoker`