The linear model that R fits to the data is \[\begin{array}{rcl}\widehat{Sodium} &=& -113 + 3.28\times Calories\\ && +11.3\times TypeMeat + 183\times TypePoultry\end{array}\]
By identifying terms in this model with the regression output, we can estimate the coefficients in the standard model
hotdog |>group_by(Type) |>summarize(n =n(), mean =mean(Calories), sd =sd(Calories))
# A tibble: 3 × 4
Type n mean sd
<fct> <int> <dbl> <dbl>
1 Beef 20 157. 22.6
2 Meat 17 159. 25.2
3 Poultry 17 119. 22.6
\[\bar{\bar{X}}= \frac{157+159+119}{3} = 145\]
Prediction model in standard form \[\widehat{Sodium} = 428+3.28\times (Calories-145) + \left\{\begin{array}{ll}-65 & \text{if } Beef\\
-53 & \text{if } Meat \\ 118 & \text{if } Poultry\end{array}\right.\]
Sodium vs Calories, faceted by Type with fitted linear model
Hypotheses
We will test the hypotheses
\(H_0: \alpha_1=\alpha_2=\alpha_3=0\)
\(H_A:\) at least one alpha is different
However, this time our analysis (ANCOVA) will take into account the relationship between Sodium and Calories
ANOVA table
hdsc_lm |>anova() |>tidy()
# A tibble: 3 × 6
term df sumsq meansq statistic p.value
<chr> <int> <dbl> <dbl> <dbl> <dbl>
1 Calories 1 106270. 106270. 34.7 3.28e- 7
2 Type 2 227386. 113693. 37.1 1.34e-10
3 Residuals 50 153331. 3067. NA NA
A different conclusion
When we take into the covariate (Calories) into account, we come to a different conclusion
We reject the null hypothesis. There is an association between sodium and hotdog type
The ANCOVA compared the intercepts of the three lines
We found that the vertical distance between the lines is significantly different from 0
Adjusting for Calories
We can adjust Sodium for Calories by subtracting \(b(X_{ij}-\bar{\bar{X}})\) from each \(y_{ij}\), where \(b=3.28\) is the estimate of the slope
Meat and Beef are indistinguishable and that Poultry differs from both Meat and Beef
Sodium Adjusted for Calories vs Calories, faceted by Type with adjusted linear model
Sequential sums of squares
The order of the factors is important
R computes sums of square sequentially by default
First, the sums of squares for Calories is calculated (as a regression sum of squares) \[SS_{Calories}=\sum_{i=1}^n(\hat{y}_i-\bar{y})^2\]
\(\hat{y}\) is based on a model that does not account for hot dog type
\(SS_{Calories}\) without Type
lm(Sodium ~ Calories, data = hotdog) |>anova() |>tidy()
# A tibble: 2 × 6
term df sumsq meansq statistic p.value
<chr> <int> <dbl> <dbl> <dbl> <dbl>
1 Calories 1 106270. 106270. 14.5 0.000369
2 Residuals 52 380718. 7321. NA NA
Compare to \(SS_{Calories}\) with Type
lm(Sodium ~ Calories + Type, data = hotdog) |>anova() |>tidy()
# A tibble: 3 × 6
term df sumsq meansq statistic p.value
<chr> <int> <dbl> <dbl> <dbl> <dbl>
1 Calories 1 106270. 106270. 34.7 3.28e- 7
2 Type 2 227386. 113693. 37.1 1.34e-10
3 Residuals 50 153331. 3067. NA NA
Next, the sums of squares for hot dog type is calculated, accounting for calories \[SS_{Type}=\left(\sum_{i=1}^a\sum_{j=1}^{n_i}(\hat{y}_{ij}-\bar{\bar{y}})^2\right)-SS_{Calories}\]
Here, the prediction \(\hat{y}_{ij}\) uses the full model: a different intercept for each type of hot dog (but same slope)
This is the sum of squares that is accounted for by the full model that is not accounted for by calories alone
Here \(SS_{Type}\) without accounting for Calories
lm(Sodium ~ Type, data = hotdog) |>anova() |>tidy()
# A tibble: 2 × 6
term df sumsq meansq statistic p.value
<chr> <int> <dbl> <dbl> <dbl> <dbl>
1 Type 2 31739. 15869. 1.78 0.179
2 Residuals 51 455249. 8926. NA NA
Compare \(SS_{Type}\) accounting for Calories
lm(Sodium ~ Calories + Type, data = hotdog) |>anova() |>tidy()
# A tibble: 3 × 6
term df sumsq meansq statistic p.value
<chr> <int> <dbl> <dbl> <dbl> <dbl>
1 Calories 1 106270. 106270. 34.7 3.28e- 7
2 Type 2 227386. 113693. 37.1 1.34e-10
3 Residuals 50 153331. 3067. NA NA
Sequential Sums of Squares: Order
Calories first and Type second
lm(Sodium ~ Calories + Type, data = hotdog) |>anova() |>tidy()
# A tibble: 3 × 6
term df sumsq meansq statistic p.value
<chr> <int> <dbl> <dbl> <dbl> <dbl>
1 Calories 1 106270. 106270. 34.7 3.28e- 7
2 Type 2 227386. 113693. 37.1 1.34e-10
3 Residuals 50 153331. 3067. NA NA
Compare to Type first and Calories second
lm(Sodium ~ Type + Calories, data = hotdog) |>anova() |>tidy()
# A tibble: 3 × 6
term df sumsq meansq statistic p.value
<chr> <int> <dbl> <dbl> <dbl> <dbl>
1 Type 2 31739. 15869. 5.17 9.07e- 3
2 Calories 1 301917. 301917. 98.5 2.09e-13
3 Residuals 50 153331. 3067. NA NA
If there is one factor of interest (Type), but we want to account for another variable (Calories), the factor of interest should enter the model last
Different Slopes
We can also consider a possible model with different slopes