ANCOVA

Analysis of Covariance
Math 215

Hot Dogs

  • Are hot dogs made with some types of meat healthier than others?
  • We will consider both calorie content and sodium levels for beef, poultry, and meat (mostly pork and beef) hot dogs
  • Based on an example from text by Heiberger and Holland (2015)
  • hotdog dataset from HH package
  • Hot dogs made from poultry are often lower in calories
  • Do manufacturers add sodium to enhance flavor to make up for the lower fat content?
  • We might start to approach this question by comparing sodium content between hot dog types using a standard one-way ANOVA

Sodium Levels

Ridge plot showing distributions of sodium content in different types of hot dogs.
Type n mean sd
Beef 20 401 102
Meat 17 419 93.9
Poultry 17 459 84.7
  aov(Sodium~Type, data=hotdog) |> tidy()
# A tibble: 2 × 6
  term         df   sumsq meansq statistic p.value
  <chr>     <dbl>   <dbl>  <dbl>     <dbl>   <dbl>
1 Type          2  31739. 15869.      1.78   0.179
2 Residuals    51 455249.  8926.     NA     NA    

One-Way ANOVA

Statistical model \[y_{ij}=\mu + \alpha_i + \varepsilon_{ij}\]

  • \(y_{ij}\) - observed sodium amount
  • \(\alpha_i\) - effect from \(i\)-th group
  • \(\varepsilon_{ij}\) - error terms (independent, normally distributed with mean 0 and common standard deviation \(\sigma\))

ANOVA table

hds_lm <- lm(Sodium ~ Type, data = hotdog)

hds_lm |>
  anova() |>
  tidy()
# A tibble: 2 × 6
  term         df   sumsq meansq statistic p.value
  <chr>     <int>   <dbl>  <dbl>     <dbl>   <dbl>
1 Type          2  31739. 15869.      1.78   0.179
2 Residuals    51 455249.  8926.     NA     NA    
  • Based on this analysis we do not have convincing evidence that at least one of the mean sodium levels is different
  • However, we have not accounted for the dependence of sodium on calorie content
  • Next we will conduct an analysis that accounts for this covariate

Analysis of Covariance (ANCOVA)

  • Statistical model for the \(j\)th observation in the \(i\)th group: \[y_{ij}=\mu + \alpha_i + \beta(X_{ij} - \bar{\bar{X}})+\varepsilon_{ij}\]
  • Like before, \(\mu\) is the overall population mean, \(\alpha_i\) is the differential effect of group \(i\), \(\beta\) is the slope
  • \(X_{ij}\) - observed amount of calories
  • \(\bar{\bar{X}}\) is the overall mean of the \(X_{ij}\)s
  • Each group has a different intercept (\(\mu+\alpha_i\))
  • All groups have a common slope \(\beta\)

Linear model

  • We can fit a linear model as we have in the past
  • One categorical predictor, one numeric predictor
hdsc_lm <- lm(Sodium ~ Calories + Type, data = hotdog)

hdsc_lm |>
  tidy()
# A tibble: 4 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)  -113.      53.3      -2.13  3.85e- 2
2 Calories        3.28     0.331     9.92  2.09e-13
3 TypeMeat       11.3     18.3       0.618 5.40e- 1
4 TypePoultry   183.      22.2       8.24  7.15e-11
# A tibble: 4 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)  -113.      53.3      -2.13  3.85e- 2
2 Calories        3.28     0.331     9.92  2.09e-13
3 TypeMeat       11.3     18.3       0.618 5.40e- 1
4 TypePoultry   183.      22.2       8.24  7.15e-11
  • The linear model that R fits to the data is \[\begin{array}{rcl}\widehat{Sodium} &=& -113 + 3.28\times Calories\\ && +11.3\times TypeMeat + 183\times TypePoultry\end{array}\]
  • By identifying terms in this model with the regression output, we can estimate the coefficients in the standard model
hotdog |>
    group_by(Type) |>
    summarize(n = n(), mean = mean(Calories), sd = sd(Calories))
# A tibble: 3 × 4
  Type        n  mean    sd
  <fct>   <int> <dbl> <dbl>
1 Beef       20  157.  22.6
2 Meat       17  159.  25.2
3 Poultry    17  119.  22.6

\[\bar{\bar{X}}= \frac{157+159+119}{3} = 145\]

Prediction model in standard form \[\widehat{Sodium} = 428+3.28\times (Calories-145) + \left\{\begin{array}{ll}-65 & \text{if } Beef\\ -53 & \text{if } Meat \\ 118 & \text{if } Poultry\end{array}\right.\]

Sodium vs Calories, faceted by Type with fitted linear model

Hypotheses

  • We will test the hypotheses
    • \(H_0: \alpha_1=\alpha_2=\alpha_3=0\)
    • \(H_A:\) at least one alpha is different
  • However, this time our analysis (ANCOVA) will take into account the relationship between Sodium and Calories

ANOVA table

hdsc_lm |>
  anova() |>
  tidy()
# A tibble: 3 × 6
  term         df   sumsq  meansq statistic   p.value
  <chr>     <int>   <dbl>   <dbl>     <dbl>     <dbl>
1 Calories      1 106270. 106270.      34.7  3.28e- 7
2 Type          2 227386. 113693.      37.1  1.34e-10
3 Residuals    50 153331.   3067.      NA   NA       

A different conclusion

  • When we take into the covariate (Calories) into account, we come to a different conclusion
  • We reject the null hypothesis. There is an association between sodium and hotdog type
  • The ANCOVA compared the intercepts of the three lines
  • We found that the vertical distance between the lines is significantly different from 0

Adjusting for Calories

  • We can adjust Sodium for Calories by subtracting \(b(X_{ij}-\bar{\bar{X}})\) from each \(y_{ij}\), where \(b=3.28\) is the estimate of the slope

  • Meat and Beef are indistinguishable and that Poultry differs from both Meat and Beef

Sodium Adjusted for Calories vs Calories, faceted by Type with adjusted linear model

Sequential sums of squares

  • The order of the factors is important
  • R computes sums of square sequentially by default
  • First, the sums of squares for Calories is calculated (as a regression sum of squares) \[SS_{Calories}=\sum_{i=1}^n(\hat{y}_i-\bar{y})^2\]
  • \(\hat{y}\) is based on a model that does not account for hot dog type

\(SS_{Calories}\) without Type

lm(Sodium ~ Calories, data = hotdog) |>
  anova() |>
  tidy()
# A tibble: 2 × 6
  term         df   sumsq  meansq statistic   p.value
  <chr>     <int>   <dbl>   <dbl>     <dbl>     <dbl>
1 Calories      1 106270. 106270.      14.5  0.000369
2 Residuals    52 380718.   7321.      NA   NA       

Compare to \(SS_{Calories}\) with Type

lm(Sodium ~ Calories + Type, data = hotdog) |>
  anova() |>
  tidy()
# A tibble: 3 × 6
  term         df   sumsq  meansq statistic   p.value
  <chr>     <int>   <dbl>   <dbl>     <dbl>     <dbl>
1 Calories      1 106270. 106270.      34.7  3.28e- 7
2 Type          2 227386. 113693.      37.1  1.34e-10
3 Residuals    50 153331.   3067.      NA   NA       
  • Next, the sums of squares for hot dog type is calculated, accounting for calories \[SS_{Type}=\left(\sum_{i=1}^a\sum_{j=1}^{n_i}(\hat{y}_{ij}-\bar{\bar{y}})^2\right)-SS_{Calories}\]
  • Here, the prediction \(\hat{y}_{ij}\) uses the full model: a different intercept for each type of hot dog (but same slope)
  • This is the sum of squares that is accounted for by the full model that is not accounted for by calories alone

Here \(SS_{Type}\) without accounting for Calories

lm(Sodium ~ Type, data = hotdog) |>
  anova() |>
  tidy()
# A tibble: 2 × 6
  term         df   sumsq meansq statistic p.value
  <chr>     <int>   <dbl>  <dbl>     <dbl>   <dbl>
1 Type          2  31739. 15869.      1.78   0.179
2 Residuals    51 455249.  8926.     NA     NA    

Compare \(SS_{Type}\) accounting for Calories

lm(Sodium ~ Calories + Type, data = hotdog) |>
  anova() |>
  tidy()
# A tibble: 3 × 6
  term         df   sumsq  meansq statistic   p.value
  <chr>     <int>   <dbl>   <dbl>     <dbl>     <dbl>
1 Calories      1 106270. 106270.      34.7  3.28e- 7
2 Type          2 227386. 113693.      37.1  1.34e-10
3 Residuals    50 153331.   3067.      NA   NA       

Sequential Sums of Squares: Order

Calories first and Type second

lm(Sodium ~ Calories + Type, data = hotdog) |>
  anova() |>
  tidy()
# A tibble: 3 × 6
  term         df   sumsq  meansq statistic   p.value
  <chr>     <int>   <dbl>   <dbl>     <dbl>     <dbl>
1 Calories      1 106270. 106270.      34.7  3.28e- 7
2 Type          2 227386. 113693.      37.1  1.34e-10
3 Residuals    50 153331.   3067.      NA   NA       

Compare to Type first and Calories second

lm(Sodium ~ Type + Calories, data = hotdog) |>
  anova() |>
  tidy()
# A tibble: 3 × 6
  term         df   sumsq  meansq statistic   p.value
  <chr>     <int>   <dbl>   <dbl>     <dbl>     <dbl>
1 Type          2  31739.  15869.      5.17  9.07e- 3
2 Calories      1 301917. 301917.     98.5   2.09e-13
3 Residuals    50 153331.   3067.     NA    NA       

If there is one factor of interest (Type), but we want to account for another variable (Calories), the factor of interest should enter the model last

Different Slopes

  • We can also consider a possible model with different slopes

\[y_{ij}=\mu + \alpha_i + \beta_i(X_{ij} - \bar{\bar{X}})+\varepsilon_{ij}\]

lm(Sodium ~ Type*Calories, data = hotdog) |>
  tidy()
# A tibble: 6 × 5
  term                 estimate std.error statistic       p.value
  <chr>                   <dbl>     <dbl>     <dbl>         <dbl>
1 (Intercept)          -228.       87.6       -2.61 0.0121       
2 TypeMeat              137.      123.         1.11 0.272        
3 TypePoultry           392.      114.         3.44 0.00122      
4 Calories                4.01      0.553      7.26 0.00000000296
5 TypeMeat:Calories      -0.802     0.773     -1.04 0.305        
6 TypePoultry:Calories   -1.53      0.820     -1.86 0.0687       

Sodium vs Calories, faceted by Type different slopes