We will review one-way ANOVA (introduced in Ch 22)
In the process we will introduce some new notation and terminology and dig a bit deeper into the theory
We will also discuss the connection between ANOVA and linear models
One-Way ANOVA Example
Do Wordsum test scores vary between social classes?
Social classes: LOWER, MIDDLE, UPPER, WORKING
Ridge plot showing distribution of word scores for each self-identified social class
Statistical Model
We use the following statistical model for the \(j\)th observation from the \(i\)th group \[y_{ij}=\mu + \alpha_i + \varepsilon_{ij}\]
\(y_{ij}\) is the value of the response (e.g., wordsum)
\(\mu\) is the overall population mean
\(\alpha_i\) is the differential effect of group \(i\) in the population (\(i = 1,2,\ldots,a\))
Notes:
\(\sum_{i=1}^a\alpha_i=0\), where \(a\) is the number of groups
The population group means are \(\mu_i=\mu+\alpha_i\)
\(\varepsilon_{ij}\) represent error/noise, and are assumed to be independent, normally distributed with mean 0 and common standard deviation \(\sigma\)
Hypothesis Test
We compare the \(a\) group means by testing the following hypotheses:
\(H_0: \alpha_1=\alpha_2=\cdots=\alpha_a=0\)
\(H_A:\) at least one \(\alpha_i\) is different
This is equivalent to our previous formulation of the hypothesis test:
\(H_0: \mu_1=\mu_2=\cdots=\mu_a\)
\(H_A:\) at least one \(\mu_i\) is different
Point Estimates
The group sample means are \[\bar{y}_i=\frac{y_{i1}+y_{i2}+\cdots+y_{in_i}}{n_i}\]
\(\mu\) is estimated using the grand mean (mean of means): \[\bar{\bar{y}}=\frac{\bar{y}_1+\bar{y}_2+\cdots+\bar{y}_a}{a}\]
\(\alpha_i\) is estimated by \(\bar{y}_i-\bar{\bar{y}}\)
Furthermore, \(SSE/\sigma^2\) follows a chi-squared distribution with \(n-a\) degrees of freedom
Both of these properties hold whether \(H_0\) is true or not
F Statistic
The ratio of two chi-squared distributed statistics divided by their degrees of freedom follows an \(F\) distribution
If the \(H_0\) is true, \[F=\frac{MSG}{MSE}=\frac{SSG/df_G}{SSE/df_E}\] follows an \(F\) distribution with \(df_G\) and \(df_E\) degrees of freedom
If \(H_0\) is true
\(MSG\) and \(MSE\) are both unbiased estimates of \(\sigma^2\)
We expect \(F\) to be close to 1
If \(H_A\) is true
\(MSE\) is an unbiased estimates of \(\sigma^2\)
\(MSG > \sigma^2\)
We expect \(F > 1\)
Larger values of \(F\) provide more convincing evidence against \(H_0\)
P-value
# A tibble: 2 × 6
term df sumsq meansq statistic p.value
<chr> <int> <dbl> <dbl> <dbl> <dbl>
1 class 3 237. 78.9 21.7 1.56e-13
2 Residuals 791 2870. 3.63 NA NA
The p-value is the area under the density curve for the \(F\) distribution beyond the observed value of \(F\)
1-pf(21.73467, df1 =3, df2 =791)
[1] 1.560974e-13
We reject the null hypothesis. There is convincing evidence that at least one mean is different
One-Way ANOVA with Numeric Independent Variable
Is there a linear association between weight and height in physically active individuals?
Scatter plot of weight vs. height with line of best fit.
Statistical Model
We use the following statistical model for the \(i\)th observation \[y_{i}=\beta_0 + \beta_1x_i + \varepsilon_{i}\]
\(y_{i}\) is the value of the response (e.g., wgt)
\(\beta_0\) is the population intercept
\(\beta_1\) is the population slope
\(\varepsilon_{i}\) represent error/noise, and are assumed to be independent, normally distributed with mean 0 and constant standard deviation \(\sigma\)
# A tibble: 2 × 6
term df sumsq meansq statistic p.value
<chr> <int> <dbl> <dbl> <dbl> <dbl>
1 hgt 1 46370. 46370. 535. 2.83e-81
2 Residuals 505 43753. 86.6 NA NA
ANOVA table key
term
df
sumsq
meansq
statistic
numeric predictor
\(df_R=1\)
\(SSR\)
\(MSR=SSR/df_R\)
\(F=MSR/MSE\)
Residuals (error)
\(df_E=n-2\)
\(SSE\)
\(MSE=SSE/df_E\)
Regression Sum of Squares
Regression sum of squares \[SSR=\sum_{i=1}^n(\hat{y}_i-\bar{y})^2\]
The regression mean square \[MSR=\frac{SSR}{df_R}=\frac{SSR}{1}\] is an unbiased estimate of \(\sigma^2\) if \(H_0\) is true, and an overestimate of \(\sigma^2\) if \(H_A\) is true
Sum of Squared Errors
Sum of squared errors \[SSE=\sum_{i=1}^n(y_i-\hat{y}_i)^2\]
The mean square error \[MSE=\frac{SSE}{df_E}=\frac{SSE}{n-2}\] is an unbiased estimate of \(\sigma^2\) whether \(H_0\) is true or not
F Statistic
If the \(H_0\) is true, \[F=\frac{MSR}{MSE}=\frac{SSR/df_R}{SSE/df_E}\] follows an \(F\) distribution with \(df_R=1\) and \(df_E=n-2\) degrees of freedom
If \(H_0\) is true
\(MSR\) and \(MSE\) are both unbiased estimates of \(\sigma^2\)
We expect \(F\) to be close to 1
If \(H_A\) is true
\(MSE\) is an unbiased estimates of \(\sigma^2\)
\(MSR > \sigma^2\)
We expect \(F > 1\)
Larger values of \(F\) provide more convincing evidence against \(H_0\)
Variability: Explained vs Unexplained
Whether the independent variable is categorical or numeric, an ANOVA compares explained variability to unexplained variability
Explained variability is the variability captured by the model
Unexplained variability is the variability that is not described by the model
# A tibble: 2 × 6
term df sumsq meansq statistic p.value
<chr> <int> <dbl> <dbl> <dbl> <dbl>
1 hgt 1 46370. 46370. 535. 2.83e-81
2 Residuals 505 43753. 86.6 NA NA
ANOVA p-value is identical to the p-value for the slope in the regression table
Also note: for simple regression, \(F=T^2\)
We reject the null hypothesis. There is convincing evidence that the slope is nonzero. There is a statistically significant linear association between weight and heigth in physically active individuals.