Linear Regression, Single Predictor

IMS2 Ch. 7
Math 115

Yurk

Body Measurements

bdims body measurement dataset, available here
507 physically active individuals (247 men, 260 women)
age, weight (wgt), height (hgt), sex, 21 body girth variables (e.g., hip girth)

Weight vs. Height

Scatter plot of weight vs. height.

It appears that the data fall roughly along a line.

Linear Model

Scatter plot of weight vs. height with line of best fit.

We can add a line of best fit to the scatter plot.

Equation for line: \[y = b_0 + b_1 x\]
\(b_0\) and \(b_1\) are coefficients
- \(b_0\) = intercept
- \(b_1\) = slope
\(b_0\) and \(b_1\) are statistics (fit using sample)
\(\beta_0\) and \(\beta_1\) are the corresponding parameters
The fitted values are \(b_0=-105.0\), \(b_1=1.018\)

Variable Roles

wgt = outcome/response (dependent variable, \(y\))
hgt = predictor (independent variable, \(x\))
We use a hat to indicate an estimate or prediction \[\widehat{wgt} = -105.0 + 1.018 \times hgt\]

Using a Model to Make Predictions

Use the model to predict the weight of a person with a given height
The predicted weight of a 170 cm tall individual is \[\begin{array}{rcl}\widehat{wgt} &=& -105.0 + 1.018 \times hgt\\ &=& -105.0 + 1.018 \times 170 \\ &=& 68.06\, kg\end{array}\]

Correlation

The correlation coefficient describes strength and direction of a linear relationship
Denoted \(r\) for a sample, \(\rho\) for a population
\(-1\leq r\leq1\)

Direction of linear relationship
- \(r>0\) indicates a positive association
- \(r<0\) indicates a negative association.
Strength of linear relationship
- Values close to 0 indicate a weak linear association
- Values close to -1 or 1 indicate a strong linear association

Some scatter plots and their correlations. IMS 2 Figure 7.10.

Let \((x_i,y_i)\) be the \(i\)th observation of the numeric variables \(x\) and \(y\)
Then \(r\) is \[r=\frac{1}{n-1}\sum_{i=1}^n\frac{x_i-\bar{x}}{s_x}\cdot\frac{y_i-\bar{y}}{s_y}\]
Here \(\bar{x}\) and \(\bar{y}\) are the sample means, and \(s_x\) and \(s_y\) are the sample standard deviations of the \(x\) and \(y\)
\(r\) is independent of the units of measurement of \(x\) and \(y\)

Scatter plot of weight vs. height with line of best fit.

Correlation between height and weight: \(r=0.717\)

Interpretation of coefficients

\[\widehat{wgt} = -105.0 + 1.018 \times hgt\]

Slope: for each additional centimeter of height, we expect weight to increase by 1.018 kg
Intercept: we would predict a 0 cm tall individual to weigh -105.0 kg
In many cases, this intercept interpretation is not useful
Better to think of intercept as positioning line vertically so it passes through the data cloud

Extrapolation

Predicting weight for individual with height outside of the range of the observed data is an example of extrapolation
We should not expect the model to apply outside of this range
Extrapolation can lead to nonsensical predictions (0 cm tall individuals with negative weight) or inaccurate ones

Least Squares Regression

How is the best fit line determined?
Slope and intercept chosen to minimize the error between the observed and predicted response

Plot highlighting three residuals. IMS2 Figure 7.8.

The residual (error) for the \(i\)th observation \((x_i,y_i)\) is \[e_i = y_i - \hat{y}_i\]

Least Squares Line

The least squares regression line minimizes the sum of the squared residuals, \[e_1^2+e_2^2+\cdots+e_n^2\]
Properties of least squares line
- The line passes through the point \((\bar{x},\bar{y})\)
- The slope is \(b_1=\frac{s_y}{s_x}r\)
Note: We can use these properties to compute the slope and intercept if we know the means, SDs, and correlation

Using Software

Typically we will use Jamovi or other Statistical software to compute the coefficients of the least squares line
We will learn to do this in J Lab 3
Results are usually presented in a regression table


 MODEL SPECIFIC RESULTS

 MODEL 1

 Model Coefficients - wgt                                              
 ───────────────────────────────────────────────────────────────────── 
   Predictor    Estimate       SE            t            p            
 ───────────────────────────────────────────────────────────────────── 
   Intercept    -105.011254    7.53940919    -13.92831    < .0000001   
   hgt             1.017617    0.04398680     23.13459    < .0000001   
 ─────────────────────────────────────────────────────────────────────

Coefficient of determination (\(R^2\))

The coefficient of determination, also known as R-squared (\(R^2\)) is used to measure how well a model describes the data
\(R^2\) is the proportion of variation in the outcome/response variable that is explained by the model
For simple linear regression, \(R^2 = r^2\)

\(R^2\) will always have values between 0 and 1
Value close to 1: linear model fits the data well (describes nearly 100% of the variability in outcomes)
Value close to 0 indicates that it does not fit well

We can compute \(R^2\) using Jamovi or other Statistical Software
Height explains about 51.5% of the variability in weights


 Model Fit Measures                  
 ─────────────────────────────────── 
   Model    R            R²          
 ─────────────────────────────────── 
       1    0.7173011    0.5145208   
 ─────────────────────────────────── 
   Note. Models estimated using
   sample size of N=507

Residual plots

residual plot is a plot of residuals vs. predicted values (scatter plot with points \((\hat{y}_i,e_i)\)
Useful for diagnosing problems with the linear models
If there is a pattern in the residual plot, then a more complicated model (e.g., a nonlinear model or a model that includes more predictors) may be more appropriate

There are no obvious patterns in the height vs. weight residual plot.

Residual plot for weight vs. height with horizontal line at \(e=0\) for reference.

More residual plots

Some scatter plots (top) and corresponding residual plots (bottom). From IMS2 Example.

More residual plots

More residual plots. From IMS2 Exercise. 7.2

Outliers

outliers are observations that fall far from the point cloud
high leverage points fall horizontally far from the center of the point cloud
high leverage points have more pull on the regression line
influential points have a strong influence on the slope of the regression line
influential points can be identified by fitting a line with the point removed. If the slope is very different than when the point is included, then the point is influential.

Each of the following plots has an outlier. Which are high leverage? Influential?

Scatter plots with outliers. From IMS2 Fig. 7.16.