Linear Regression: Multiple Predictors

Chapter 8
Math 215

Mario Kart

  • You have finally decided to sell your collection of Mario Kart games for the Nintendo Wii
  • How do different auction and game characteristics affect the price of the game on Ebay?

Box art from Mario Kart.

  • mariokart 1 data from 143 Ebay sales, 12 variables, including
Variable Description
total_pr Total price (auction price + shipping)
start_pr Starting price of auction
duration Auction length (days)
cond Condition (new or used)
wheels Number of steering wheels included
n_bids Number of bids

EDA

  • The two highest prices include more than just the game and wheels. Remove these points from the data.
mariokart |>
  filter(total_pr >= 75) |>
  select(total_pr, title)
# A tibble: 3 × 2
  total_pr title                                                    
     <dbl> <fct>                                                    
1     327. "Nintedo Wii Console Bundle Guitar Hero 5 Mario Kart "   
2     118. "10 Nintendo Wii Games - MarioKart Wii, SpiderMan 3, etc"
3      75  "NEW MARIO KART WITH WII WHEEL+2 GT PRO WHITE WII WHEEL" 
mariokart <- mariokart |>
  filter(total_pr < 80)

Single Predictor (total_pr ~ wheels)

\[\widehat{total\_pr} = 37.5 + 8.64\times wheels\]

lm1 <- lm(total_pr ~ wheels, data = mariokart)

tidy(lm1)
# A tibble: 2 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)    37.5      0.780      48.1 1.80e-88
2 wheels          8.64     0.548      15.8 9.05e-33

Coefficient of determination: \(R^2=0.642\)

glance(lm1)
# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
1     0.642         0.639  5.48      249. 9.05e-33     1  -439.  884.  892.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

Adding a Categorical Predictor - Parallel Slopes Model

  • We can include condition (new or used) as a second predictor in the model
  • The variable needs to be recoded first (e.g., “new” = 0, “used” = 1)
  • However, R will do it for us automatically when we fit a linear model with a categorical predictor
  • With character data, R will choose the levels alphabetically
  • condused is the recoded variable (condused = 0 if cond = new, condused = 1 if cond = used)

\[\widehat{total\_pr} = 42.4+7.23\times wheels-5.58\times condused\]

lm2 <- lm(total_pr ~ wheels + cond, data = mariokart)

tidy(lm2)
# A tibble: 3 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)    42.4      1.07      39.8  1.75e-77
2 wheels          7.23     0.542     13.3  1.29e-26
3 condused       -5.58     0.925     -6.04 1.35e- 8

The model \[\widehat{total\_pr} = 42.4+7.23\times wheels-5.58\times condused\] can be rewritten as

\[\widehat{total\_pr} = \left\{\begin{array}{cc}42.4+7.23\times wheels, & \textrm{if } cond = ``new''\\36.8+7.23\times wheels, & \textrm{if } cond = ``used''\end{array}\right.\] Since this model is composed of two lines with the same slope, this is sometimes called a parallel slopes model

Scatter plot of total price vs. number of steering wheels colored by condition, along with parallel slopes model.

  • The coefficient of determination for the parallel slopes model is \(R^2=0.717\)
  • This model explains more of the variation in the response than the model with a single predictor (72% vs. 64%)
glance(lm2)
# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
1     0.717         0.712  4.89      174. 1.68e-38     2  -422.  853.  864.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

Adjusted R-Squared

  • \(R^2\) will always increase as more variables are included in the model
  • However, that does not mean that the model will do a better job of predicting values of the response for new data (that were not used to fit the model)
  • \(R^2\) can be adjusted by introducing a penalty that increases with the number of predictors
  • Adjusted R-Squared is a better choice for comparing models with different numbers of predictors
  • Adjusted \(R^2\) is 0.639 for the single predictor model and 0.712 for the parallel slopes model
  • Based on adjusted \(R^2\) would select the parallel slopes model over the single predictor model
glance(lm1)
# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
1     0.642         0.639  5.48      249. 9.05e-33     1  -439.  884.  892.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
glance(lm2)
# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
1     0.717         0.712  4.89      174. 1.68e-38     2  -422.  853.  864.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

Many Predictors

  • We can include any number of predictors in our model
  • Let’s construct a model that uses all of the predictors in the
lm3 <- lm(total_pr ~ wheels + cond + start_pr + duration + n_bids,
   data = mariokart)

lm3 |> tidy()
# A tibble: 6 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)   39.4      1.80       21.9  3.16e-46
2 wheels         6.72     0.508      13.2  3.82e-26
3 condused      -4.77     0.935      -5.10 1.10e- 6
4 start_pr       0.180    0.0335      5.35 3.62e- 7
5 duration      -0.275    0.171      -1.61 1.11e- 1
6 n_bids         0.191    0.0866      2.20 2.93e- 2

The fitted model is

\[\begin{array}{rcl}\widehat{total\_pr} & = & 39.4\\ & + & 6.72\times wheels\\ & - & 4.77\times condused\\ & + & 0.180 \times start\_pr\\ & - & 0.28\times duration\\ & + & 0.191\times n\_bids\end{array}\]

In general, a multiple regression model with \(k\) predictors has the form \[\hat{y}=b_0+b_1x_1+b_2x_2+\cdots+b_kx_k\]

  • Adjusted \(R^2\) for this model is 0.761.
  • We would choose this model over the parallel slopes model, which had adjusted \(R^2\) of 0.712.
glance(lm3)
# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
1     0.770         0.761  4.45      90.4 2.41e-41     5  -408.  829.  850.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

Model Selection

  • The best model is not always the one that uses the most predictors
  • Sometimes including an additional predictor will make the model perform worse when it is used to predict the outcome for new observations
  • One way to compare competing models is using adjusted \(R^2\), but there are other measures of model performance that can be used
  • With \(k\) predictors there are \(2^k\) possible models

Stepwise Selection

  • Stepwise selection strategies can be used to select a subset of predictors
  • Backward elimination starts with a model with all \(k\) predictors (the full model). Next, each predictor is deleted from the model and adjusted \(R^2\) is computed. The model with \(k-1\) predictors that has the highest adjusted \(R^2\) is retained. This process is repeated as long as the adjusted \(R^2\) continues to increase at each step.
  • Forward selection starts with the null model (0 predictors). Next, a single predictor model is created for each of the \(k\) predictors. Adjusted \(R^2\) is computed for each one and the best single predictor (with the highest \(R^2\)) model is retained. This process is repeated, adding the best single new predictor at each step as long as the adjusted \(R^2\) continues to increase.

Example of a Forward Elimination

  • One predictor

    • lm(total_pr ~ wheels, data = mariokart) |> glance() #0.638977*

    • lm(total_pr ~ cond, data = mariokart) |> glance() #0.3458805

    • lm(total_pr ~ start_pr, data = mariokart) |> glance() #0.1008448

    • lm(total_pr ~ duration, data = mariokart) |> glance() #0.1338066

    • lm(total_pr ~ n_bids, data = mariokart) |> glance() #0.0009509121

    • So predictor wheels is included

  • Two predictors

    • lm(total_pr ~ wheels+cond, data = mariokart) |> glance() #0.7124098*

    • lm(total_pr ~ wheels+start_pr, data = mariokart) |> glance() #0.6752166

    • lm(total_pr ~ wheels+duration, data = mariokart) |> glance() #0.6528401

    • lm(total_pr ~ wheels+n_bids, data = mariokart) |> glance() #0.6368667

    • So predictor cond is included

  • Three predictors

    • lm(total_pr ~ wheels+cond+start_pr, data = mariokart) |> glance() #0.7523667*

    • lm(total_pr ~ wheels+cond+duration, data = mariokart) |> glance() #0.7106673

    • lm(total_pr ~ wheels+cond+n_bids, data = mariokart) |> glance() #0.7143661

    • So predictor start_pr is added

  • Four predictors

    • lm(total_pr ~ wheels+cond+start_pr+duration, data = mariokart) |> glance() #0.7547146

    • lm(total_pr ~ wheels+cond+start_pr+n_bids, data = mariokart) |> glance() ##0.7655999*

    • So predictor n_bids is added

  • All five predictors

    • lm(total_pr ~ wheels + cond + start_pr + duration + n_bids, data = mariokart) |> glance()#0.7614715
  • It is lower than with four predictors wheels, cond, start_pr and n_bids. Therefore we end up with four predictors in our model

Example of a Backward Elimination

# full 0.7614715
lm(total_pr ~ wheels + cond + start_pr + duration + n_bids, data = mariokart) |> glance()
# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
1     0.770         0.761  4.45      90.4 2.41e-41     5  -408.  829.  850.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
# 4 - 0.7655999
lm(total_pr ~ wheels + cond + start_pr + duration, data = mariokart) |>
  glance() #0.7547146
# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
1     0.762         0.755  4.51      109. 2.31e-41     4  -410.  832.  850.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
lm(total_pr ~ wheels + cond + start_pr + n_bids, data = mariokart) |>
  glance() #0.7655999*
# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
1     0.766         0.759  4.48      111. 7.62e-42     4  -409.  830.  847.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
lm(total_pr ~ wheels + cond + duration + n_bids, data = mariokart) |>
  glance() #0.7129729
# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
1     0.721         0.713  4.88      87.9 9.59e-37     4  -421.  854.  872.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
lm(total_pr ~ wheels + start_pr + duration + n_bids, data = mariokart) |> 
  glance() #0.7175296
# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
1     0.726         0.718  4.84      89.9 3.25e-37     4  -420.  852.  870.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
lm(total_pr ~ cond + start_pr + duration + n_bids, data = mariokart) |> 
  glance() #0.4561745
# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
1     0.472         0.456  6.72      30.4 4.73e-18     4  -466.  944.  962.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
# 3 - 0.7523667
lm(total_pr ~ wheels + cond + start_pr, data = mariokart) |>
  glance() #0.7523667*
# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
1     0.758         0.752  4.54      143. 5.56e-42     3  -411.  832.  847.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
lm(total_pr ~ wheels + cond + n_bids, data = mariokart) |>
  glance() #0.7143661
# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
1     0.720         0.714  4.87      118. 9.58e-38     3  -421.  853.  867.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
lm(total_pr ~ wheels + start_pr + n_bids, data = mariokart) |>
  glance() #0.6914357
# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
1     0.698         0.691  5.06      106. 1.87e-35     3  -427.  863.  878.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
lm(total_pr ~ cond + start_pr + n_bids, data = mariokart) |>
  glance() #0.43299
# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
1     0.445         0.433  6.86      36.6 1.89e-17     3  -470.  949.  964.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

The results of the backward selection starting with the 5-predictor Mario Kart total price model are summarized below.

Step Predictors Adjusted \(R^2\)
0 wheels, cond, start_pr, duration, n_bids 0.761
1 wheels, cond, start_pr, duration 0.755
1 wheels, cond, start_pr, n_bids 0.766*
1 wheels, cond, duration, n_bids 0.713
1 wheels, start_pr, duration, n_bids 0.718
1 cond, start_pr, duration, n_bids 0.456
2 wheels, cond, start_pr 0.752
2 wheels, cond, n_bids 0.714
2 wheels, start_pr, n_bids 0.691
2 cond, start_pr, n_bids 0.433

The selected model has 4 predictors (duration) was dropped from the model.

# A tibble: 5 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)   38.7      1.76       22.0  1.10e-46
2 wheels         6.86     0.503      13.6  3.19e-27
3 condused      -5.38     0.859      -6.26 4.68e- 9
4 start_pr       0.170    0.0332      5.12 1.04e- 6
5 n_bids         0.187    0.0871      2.14 3.38e- 2

Categorical Predictors with More than 2 Levels

  • The iris dataset has 150 observations of 5 variables
  • We will focus on the relationship between petal width and petal length for 3 different species, Iris setosa, Iris versicolor, and Iris virginica
  • Species is a categorical variable with 3 levels
  • Adding a categorical variable with \(L\) levels will introduce \(L-1\) rows to the regression output
  • You can can think of this as adding \(L-1\) indicator variables that take on values 0 and 1
  • Here is a model that uses Petal.Width and Species to predict Petal.Length.
  • Including Species as a predictor introduces coefficients for Speciesversicolor and Speciesvirginica to the model.
  • There is no coefficient for setosa. This is the base level for the Species variable (first alphabetically)
# A tibble: 4 × 5
  term              estimate std.error statistic  p.value
  <chr>                <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)           1.21    0.0652     18.6  2.88e-40
2 Petal.Width           1.02    0.152       6.69 4.41e-10
3 Speciesversicolor     1.70    0.181       9.38 1.17e-16
4 Speciesvirginica      2.28    0.281       8.09 2.08e-13

The model can be written in two ways

\[\widehat{Petal.Length}=1.21 + 1.02\times Petal.Width + 1.70\times Speciesversicolor + 2.28\times Speciesvirginica\]

or

\[\widehat{Petal.Length}=\left\{\begin{array}{cl}1.21+1.02\times Petal.Width, & \textrm{if } Species = ``setosa''\\2.91+1.02\times Petal.Width, & \textrm{if } Species = ``versicolor''\\3.49+1.02\times Petal.Width, & \textrm{if } Species = ``virginica''\end{array}\right.\]

This is another example of a parallel slopes model.

Scatter plot of petal length vs. petal width colored by species, along with parallel slopes model.