# A tibble: 2 × 2
sex avg_hgt
<int> <dbl>
1 0 164.872
2 1 177.745
Coefficient of determination (\(R^2\))
The coefficient of determination, also known as R-squared (\(R^2\)) is used to measure how well a model describes the data
\(R^2\) is the proportion of variation in the outcome/response variable that is explained by the model
For simple linear regression (one numeric predictor), \(R^2 = r^2\)
\(R^2\) will always have values between 0 and 1
Value close to 1: linear model fits the data well (describes nearly 100% of the variability in outcomes)
Value close to 0 indicates that it does not fit well
Total Sum of Squares
total sum of squares, denoted SST, describes the total variation in the outcome \[SST = (y_1-\bar{y})^2 + (y_2-\bar{y})^2 + \cdots + (y_n-\bar{y})^2\]
Note that SST does not involve the model at all
However, can think of a null model that uses the sample mean as the prediction
SST is the sum of the squared residuals for the null model
Sum of Squared Errors
sum of squared errors, denoted SSE, quantifies the variation in outcomes that the model fails to describe \[\begin{array}{rcl}SSE &=& (y_1-\hat{y}_1)^2 + (y_2-\hat{y}_2)^2 + \cdots + (y_n-\hat{y}_n)^2 \\ &=& e_1^2 + e_2^2 + \cdots + e_n^2\end{array}\]
Given by the sum of the squared residuals, which we have encountered before
Regression Sum of Squares
regression sum of squares, denoted SSR, measures the variation that is accounted for by the model \[SSR = SST - SSE\]
Hence, the proportion of variation in the outcome that is described by the model is \[R^2 = \frac{SST - SSE}{SST} = 1 - \frac{SSE}{SST}\]
We can have R compute \(R^2\)
Height explains about 51.5% of the variability in weights
library(broom)lm(wgt ~ hgt, data = bdims) |>glance()
residual plot is a plot of residuals vs. predicted values (scatter plot with points \((\hat{y}_i,e_i)\)
Useful for diagnosing problems with the linear models
If there is a pattern in the residual plot, then a more complicated model (e.g., a nonlinear model or a model that includes more predictors) may be more appropriate
Residual plot can be created using the augment function from the broom package.
The predictions are stored in the variable \(.fitted\) and the residuals are stored as \(.resid\)
library(broom)lm1 <-lm(wgt ~ hgt, data = bdims)bdims_aug <-augment(lm1, bdims)
There are no obvious patterns in the height vs. weight residual plot.
bdims_aug |>ggplot(aes(x = .fitted, y = .resid)) +geom_point() +geom_hline(yintercept =0, color ="red", linetype ="dashed") +theme_minimal()
Residual plot for weight vs. height with horizontal line at \(e=0\) for reference.
More residual plots
Some scatter plots (top) and corresponding residual plots (bottom). From IMS1 Fig. 7.10
More residual plots
More residual plots. From IMS1 Ex. 7.2
Outliers
outliers are observations that fall far from the point cloud
high leverage points fall horizontally far from the center of the point cloud
high leverage points have more pull on the regression line
influential points have a strong influence on the slope of the regression line
influential points can be identified by fitting a line with the point removed. If the slope is very different than when the point is included, then the point is influential.
Each of the following plots has an outlier. Which are high leverage? Influential?