Inference: Multiple Regression

IMS1 Ch. 25
Math 215

Yurk

Interest Rates

loans ¹
10,000 loans through Lending Club, individuals lend to each other
Response: interest_rate
Predictors: dept_to_income (debt to income ratio), term (number of months for loan), credit_checks (number of credit inquiries in last 12 months)

Multiple Regression Output

lm(interest_rate ~ debt_to_income + term + credit_checks, data = loans) |> 
  tidy()

# A tibble: 4 × 5
  term           estimate std.error statistic   p.value
  <chr>             <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)      4.31     0.195        22.1 2.09e-105
2 debt_to_income   0.0408   0.00307      13.3 4.76e- 40
3 term             0.158    0.00417      37.9 6.24e-294
4 credit_checks    0.247    0.0193       12.8 3.29e- 37

Linear model: \[\begin{array}{rcl}\widehat{interest\_rate} &=& 4.31+0.0408\times debt\_to\_income\\ &+& 0.158 \times term +0.247\times credit\_checks\end{array}\]

# A tibble: 4 × 5
  term           estimate std.error statistic   p.value
  <chr>             <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)      4.31     0.195        22.1 2.09e-105
2 debt_to_income   0.0408   0.00307      13.3 4.76e- 40
3 term             0.158    0.00417      37.9 6.24e-294
4 credit_checks    0.247    0.0193       12.8 3.29e- 37

A standard error and $T$-statistic is listed for each coefficient
We will not discuss the details of the standard error calculation (linear algebra)

# A tibble: 4 × 5
  term           estimate std.error statistic   p.value
  <chr>             <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)      4.31     0.195        22.1 2.09e-105
2 debt_to_income   0.0408   0.00307      13.3 4.76e- 40
3 term             0.158    0.00417      37.9 6.24e-294
4 credit_checks    0.247    0.0193       12.8 3.29e- 37

Hypothesis test for coefficient for each predictor:
- $H_0: \beta_1=0$, given term and credit_checks included in the model
- $H_0: \beta_2=0$, given debt_to_income and credit_checks included in the model
- $H_0: \beta_3=0$, given debt_to_income and term included in the model
For $k$ predictors, $df=n-k-1$

# A tibble: 4 × 5
  term           estimate std.error statistic   p.value
  <chr>             <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)      4.31     0.195        22.1 2.09e-105
2 debt_to_income   0.0408   0.00307      13.3 4.76e- 40
3 term             0.158    0.00417      37.9 6.24e-294
4 credit_checks    0.247    0.0193       12.8 3.29e- 37

All three coefficients are significant
For example, it would be extremely unlikely to obtain a value for the debt_to_income coefficient that is at least as extreme as 0.0408 if there is no relationship between interest rate an debt to income ratio

Interpreting Coefficients

# A tibble: 4 × 5
  term           estimate std.error statistic   p.value
  <chr>             <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)      4.31     0.195        22.1 2.09e-105
2 debt_to_income   0.0408   0.00307      13.3 4.76e- 40
3 term             0.158    0.00417      37.9 6.24e-294
4 credit_checks    0.247    0.0193       12.8 3.29e- 37

The model predicts that interest rate will increase by 0.0408 for each increase of 1 in the debt to income ration (assuming the other predictors are held constant)
Interest rate is predicted to increase by 0.247 for each additional credit check
Does this mean that credit checks are more important than debt to income ratio?

It is not informative to compare coefficient values when the data are on different scales

predictor	mean	sd
`debt_to_income`	19.3	15.0
`term`	43.3	36
`credit_checks`	1.96	2.3

Standardized Predictors

We can standardize the predictors by calculating the number of standard deviations each value is from the mean
The $i$th observation of the $j$th predictor $x_j$ is standardized as \[u_{ij}=\frac{x_{ij}-\bar{x}_j}{s_j}\]
The standardized predictors have mean = 0, standard deviation = 1

Let’s standardize the predictors for the interest rate example
We can use the mutate function

loans_standardized <- loans |>
  mutate(debt_to_income = 
           (debt_to_income - mean(debt_to_income, na.rm = TRUE))/
           sd(debt_to_income, na.rm = TRUE)) |>
  mutate(term = 
           (term - mean(term, na.rm = TRUE))/
           sd(term, na.rm = TRUE)) |>
  mutate(credit_checks = 
           (credit_checks - mean(credit_checks, na.rm = TRUE))/
           sd(credit_checks, na.rm = TRUE))

lm(interest_rate ~ debt_to_income + term + credit_checks, 
   data = loans_standardized) |> 
  tidy()

# A tibble: 4 × 5
  term           estimate std.error statistic   p.value
  <chr>             <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)      12.4      0.0459     270.  0        
2 debt_to_income    0.612    0.0460      13.3 4.76e- 40
3 term              1.74     0.0460      37.9 6.24e-294
4 credit_checks     0.588    0.0460      12.8 3.29e- 37

The intercept is the predicted interest rate when each of the predictors have their mean value
If debt to income ratio is increased by 1 standard deviations, interest rate is predicted to increase by 0.612 (holding other predictors constant)
Similar interpretation for other coefficients

# A tibble: 4 × 5
  term           estimate std.error statistic   p.value
  <chr>             <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)      12.4      0.0459     270.  0        
2 debt_to_income    0.612    0.0460      13.3 4.76e- 40
3 term              1.74     0.0460      37.9 6.24e-294
4 credit_checks     0.588    0.0460      12.8 3.29e- 37

Since all predictors are on the same scale, coefficient comparisons are more meaningful
E.g., term has the largest impact on predicted interest_rate
p-values have not changed

Coins

Next we explore the money data set
Amount of money in different people’s coin dishes (simulated data)
Response: total_amount (USD)
Predictors: number_of_coins, number_of_low_coins (pennies, nickels, dimes)
We will use this dataset to explore multiple linear regression with correlated predictors

Relationships between the variables

Of course, there is a relationship between the total amount and each of the predictors
The number of coins and the number of low coins are also correlated
Multicollinearity occurs when the predictor variables are correlated with themselves

total vs coins
total vs. low coins
coins vs low coins

Scatter plot of `total_amount` vs. `number_of_coins`.

Scatter plot of `total_amount` vs. `number_of_low_coins`.

Scatter plot of `number_of_coins` vs. `number_of_low_coins`.

How is the multiple regression model affected by correlated predictors? \[\begin{array}{rcl}\widehat{total\_amount}&=&0.798 \\ &+& 0.206\times number\_of\_coins\\ &-& 0.160\times number\_of\_low\_coins\end{array}\]

lm(total_amount ~ number_of_coins + number_of_low_coins,
   data = money) |>
  tidy()

# A tibble: 3 × 5
  term                estimate std.error statistic  p.value
  <chr>                  <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)            0.798    0.301       2.65 1.42e- 2
2 number_of_coins        0.206    0.0209      9.89 9.44e-10
3 number_of_low_coins   -0.160    0.0291     -5.51 1.33e- 5

\[\begin{array}{rcl}\widehat{total\_amount}&=&0.798 \\ &+& 0.206\times number\_of\_coins\\ &-& 0.160\times number\_of\_low\_coins\end{array}\]

The relationship between the total amount and each predictor (when considered on its own) is positive
However, the coefficient for the number of low coins is negative in the multiple regression model
Why?

Intepreting Coefficients

\[\begin{array}{rcl}\widehat{total\_amount}&=&0.798 \\ &+& 0.206\times number\_of\_coins\\ &-& 0.160\times number\_of\_low\_coins\end{array}\]

The predicted amount increases by $0.206 for each additional coin, while keeping the number of low coins the same. A quarter is added.
The predicted amount decreases by $0.160 for each additional low coin, while keeping the number of coins the same. A quarter is replaced by a penny, nickel, or dime.

Multicollinearity

Multicollinearity is common when there are multiple predictors, especially in observational studies
Indicates that the predictors include some redundant information
Makes it difficult to interpret coefficients
Often results in models with high $R^2$, but few coefficients significantly different from 0
Can be avoided with careful experimental design (more on this later)

Macroeconomic Data

longley dataset ¹
US macroeconomic data from 1947 to 1962 (n = 16)
Note that observations are dependent, because this is a time series

Response: Employed: number of people employed
Predictors:
- GNP.deflator: GNP adjusted for inflation
- GNP: Gross National Product
- Unemployed: Number of unemployed people
- Armed.Forces: Number of people in armed forces
- Population: Noninstitutionalized people at least 14 years old
Predictors are collinear

library(GGally)
longley |>
  ggpairs() +
  theme_minimal()

Full Model

lm(Employed ~ ., data = longley) |>
  tidy()

# A tibble: 7 × 5
  term           estimate std.error statistic  p.value
  <chr>             <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)  -3482.     890.         -3.91  0.00356 
2 GNP.deflator     0.0151   0.0849      0.177 0.863   
3 GNP             -0.0358   0.0335     -1.07  0.313   
4 Unemployed      -0.0202   0.00488    -4.14  0.00254 
5 Armed.Forces    -0.0103   0.00214    -4.82  0.000944
6 Population      -0.0511   0.226      -0.226 0.826   
7 Year             1.83     0.455       4.02  0.00304

VIF

The variance inflation factor (VIF) for the $i$th predictor is \[VIF_i=\frac{1}{1-R_i^2}\]
Here, $R_i^2$ is the $R^2$ obtained from for regression of predictor $i$ in terms of the other predictors
$R_i^2$ close to 1 indicates that predictor $i$ is closely related to the other predictors and is redundant $\rightarrow$ large $VIF_i$
$VIF_i>5$ is taken as an indication of collinearity

Variance inflation factors for longley data

library(car)
lm(Employed ~ ., data = longley) |>
  vif()

GNP.deflator          GNP   Unemployed Armed.Forces   Population         Year 
   135.53244   1788.51348     33.61889      3.58893    399.15102    758.98060

Manual Variable Selection

We can use VIF and p-values to manually reduce the size of the model
Remove predictors with high VIF, high p-values
Before we start, note that adjusted $R^2$ for the full model is 0.992

lm(Employed ~ ., data = longley) |>
  glance()

# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
1     0.995         0.992 0.305      330. 4.98e-10     6  0.907  14.2  20.4
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

Full Model (6 Predictors)

# A tibble: 7 × 5
  term           estimate std.error statistic  p.value
  <chr>             <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)  -3482.     890.         -3.91  0.00356 
2 GNP.deflator     0.0151   0.0849      0.177 0.863   
3 GNP             -0.0358   0.0335     -1.07  0.313   
4 Unemployed      -0.0202   0.00488    -4.14  0.00254 
5 Armed.Forces    -0.0103   0.00214    -4.82  0.000944
6 Population      -0.0511   0.226      -0.226 0.826   
7 Year             1.83     0.455       4.02  0.00304

VIF

GNP.deflator          GNP   Unemployed Armed.Forces   Population         Year 
   135.53244   1788.51348     33.61889      3.58893    399.15102    758.98060

Population has a large p-value and large VIF
Eliminate it from the model

5 Predictor Model

lm(Employed ~ . - Population, data = longley) |>
  tidy()

# A tibble: 6 × 5
  term           estimate std.error statistic   p.value
  <chr>             <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)  -3565.     772.         -4.62  0.000957 
2 GNP.deflator     0.0277   0.0607      0.456 0.658    
3 GNP             -0.0421   0.0176     -2.39  0.0379   
4 Unemployed      -0.0210   0.00303    -6.95  0.0000397
5 Armed.Forces    -0.0104   0.00200    -5.21  0.000397 
6 Year             1.87     0.399       4.68  0.000867

VIF

GNP.deflator          GNP   Unemployed Armed.Forces         Year 
   76.641401   546.870494    14.289620     3.460846   644.626426

GNP.deflator has a large p-value and large VIF
Eliminate it from the model

4 Predictor Model

# A tibble: 5 × 5
  term           estimate std.error statistic   p.value
  <chr>             <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)  -3599.     741.          -4.86 0.000503 
2 GNP             -0.0402   0.0165      -2.44 0.0328   
3 Unemployed      -0.0209   0.00290     -7.20 0.0000175
4 Armed.Forces    -0.0101   0.00184     -5.52 0.000180 
5 Year             1.89     0.383        4.93 0.000449

VIF

         GNP   Unemployed Armed.Forces         Year 
  515.123851    14.108642     3.141581   638.128041

GNP has the largest p-value and large VIF
Eliminate it from the model

Final Model (3 Predictors)

# A tibble: 4 × 5
  term            estimate std.error statistic  p.value
  <chr>              <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)  -1797.       68.6        -26.2  5.89e-12
2 Unemployed      -0.0147    0.00167     -8.79 1.41e- 6
3 Armed.Forces    -0.00772   0.00184     -4.20 1.22e- 3
4 Year             0.956     0.0355      26.9  4.24e-12

VIF

  Unemployed Armed.Forces         Year 
    3.317929     2.223317     3.890861

All predictors have VIF < 5
All coefficients are significant

Adjusted $R^2$ is 0.991, so model describes slightly less variability in employment than the full model
However, simple model with reduced/no collinearity is preferred

# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
1     0.993         0.991 0.332      555. 3.92e-13     3  -2.76  15.5  19.4
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

Palmer Penguins

penguins dataset ¹
Measurements for three species of penguins from Palmer Archipelago

Response: body_mass_g
Predictors:
- species: Adelie, Chinstrap, or Gentoo
- bill_length_mm
- bill_depth_mm
- flipper_length_mm
- sex: female or male

3 Predictor Model

I performed manual variable selection
Dropping species and bill_length_mm resulted in all predictors having VIF < 5 and all coefficiencts significantly different from 0

# A tibble: 4 × 5
  term              estimate std.error statistic  p.value
  <chr>                <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)        -2247.     625.       -3.59 3.76e- 4
2 bill_depth_mm        -86.9     15.5      -5.63 3.96e- 8
3 flipper_length_mm     38.2      2.08     18.3  3.47e-52
4 sexmale              538.      51.3      10.5  2.17e-22

VIF

    bill_depth_mm flipper_length_mm               sex 
         2.653895          2.444470          1.891038

Model Comparison

Does the simpler 3 predictor model predict body_mass_g as well as the full model?
We will also compare these two models to the best single predictor model, which uses flipper_length_mm as the predictor
We would like to compare how well each model performs on data that were not used to train/fit the model

One approach is to compare adjusted $R^2$

Model	predictors	adjusted $R^2$
Full	`species`, `bill_length_mm`, `bill_depth_mm`, `flipper_length_mm`, `sex`	0.873
3 predictor	`bill_depth_mm`, `flipper_length_mm`, `sex`	0.844
1 predictor	`flipper_length_mm`	0.758

Cross-Validation

Another approach is to use cross-validation
Divide the data into fourths (4-fold cross-validation)
We fit each model 4 times
Each time we hold out 1/4 of the data and fit the model to the remaining 3/4 of the data
Use the fitted model to predict body_mass_g on the 1/4 of the data we held back
Measure the prediction error (residual) on the holdout sample

Compare models using cross-validation SSE \[CV\,SSE=\sum_{i=1}^n(\hat{y}_{cv,i}-y_i)^2\]

Model	predictors	CV SSE
Full	`species`, `bill_length_mm`, `bill_depth_mm`, `flipper_length_mm`, `sex`	28,105,231
3 predictor	`bill_depth_mm`, `flipper_length_mm`, `sex`	39,756,521
1 predictor	`flipper_length_mm`	52,576,385

The full model has the smallest CV SSE
We expect the full model to perform better when predicting body_mass_g for penguins that were not used to fit the model