Inference: Regression Single Predictor

IMS2 Ch. 24
Math 115

Yurk

Body Measurements

  • bdims body measurement dataset, available here
  • 507 physically active individuals (247 men, 260 women)
  • age, weight (wgt), height (hgt), sex, 21 body girth variables (e.g., hip girth)

Variability in Slopes

  • Slope can vary from sample to sample from the same population
  • We will explore this variability with random samples of 20 individuals from the bdims data

Observations of wgt vs. hgt and least squares line for first sample of 20.

Sample 1

term estimate std.error statistic p.value
(Intercept) -186.302648 47.3081367 -3.938068 0.0009641
hgt 1.507037 0.2759548 5.461175 0.0000346

Observations of wgt vs. hgt and least squares lines for first two samples of 20.

Sample 2

term estimate std.error statistic p.value
(Intercept) -118.634720 42.7824360 -2.772977 0.0125415
hgt 1.102269 0.2468542 4.465263 0.0002991

Observations of wgt vs. hgt and least squares lines for first three samples of 20.

Sample 3

term estimate std.error statistic p.value
(Intercept) -117.085702 26.2230835 -4.464986 0.0002993
hgt 1.067681 0.1513807 7.052953 0.0000014

Least squares lines for 100 random samples of 20.

Dotplot of slopes of least squares lines from 100 random samples.

n mean sd
100 1.009732 0.220826

Inference for a Slope

  • Recall that a linear model with one predictor has the form \[\widehat{y}=b_0+b_1x\]
  • \(b_0\) and \(b_1\) are point estimates of the intercept and slope based on the sample (statistics)
  • \(\beta_0\) and \(\beta_1\) are the population intercept and slope (parameters)
  • We can construct confidence intervals for the slope, \(\beta_1\)
  • We can conduct hypothesis tests for the slope
  • Typically, the null hypothesis is \[H_0:\beta_1=0\]

Test Statistic for Slope

  • The test statistic for a slope is a \(T\) statistic \[T=\frac{\widehat{\beta}_1-\beta_{1,0}}{SE}\]
  • \(\beta_{1,0}\) denotes the value of the slope under the null hypothesis (usually 0)
  • The formula for the standard error for the slope is \[SE = \frac{s}{\sqrt{\sum_{i=1}^n(x_i-\bar{x})^2}}\]
  • \(s\) estimates the standard deviation of the residuals, given by \[\begin{array}{rcl}s &=& \sqrt{\frac{SSE}{n-2}}\\ &=& \sqrt{\frac{\sum_{i=1}^n(y_i-\hat{y}_i)^2}{n-2}}\end{array}\]
  • Recall that \(SSE\) is the sum of squared errors, also called the residual sum of squares (\(RSS\))

Mathematical Model for Slope

Note

When the null hypothesis is true and the following conditions are met, the \(T\) score has a \(t\)-distribution with \(df=n-2\) degrees of freedom.

  1. Linearity
  2. Independent observations
  3. Normality of residuals
  4. Constant variability

One way to check conditions is to look at residual plots.

Confidence Interval using Mathematical Model

  • If the technical conditions are met we can also use a \(t\) distribution with \(df=n-2\) to calculate a confidence interval for the slope
  • The interval is \[b_1\pm t^{\ast}_{df}\times SE\]
  • The standard error is the same as we used for the hypothesis test (use regression output)
  • The value of \(t^{\ast}_{df}\) depends on the confidence level and degrees of freedom

Italian Restaurants in NYC

  • Is the price of a meal associated with food quality?
  • restNYC dataset1
  • Customer survey from Italian restaurants in NYC (\(n\) = 168)
  • Price (USD, includes tip and drink)
  • Food (rating: 1 to 30)
  • Hypothesis test: \(H_0: \beta_1=0\), \(H_A: \beta_1\neq0\)
  • Confidence interval for slope

Scatter plot of Price vs Food with least squares line.

Fitting a Linear Model

  • Least squares regression line \[\widehat{Price}=-17.8+2.94\times Food\]
  • Recall that we can use Jamovi to find the equation for the regression line (see, e.g., J Lab 4)

Checking Conditions

Linearity? Independent observations? Normality of residuals? Constant variability?

Residual plot.

Hypothesis Test Using Mathematical Model

term estimate std.error statistic p.value
(Intercept) -17.83215 5.8631197 -3.04141 0.0027375
Food 2.93896 0.2833809 10.37106 0.0000000
  • \(T\) = 10.4
  • \(df = 168-2=166\)
  • p-value < 0.001

Randomization Test

  • We can randomly permute the value of the response (Price) to simulate the null hypothesis
  • Each time, compute the slope of the relationship between Price and Food
  • We can do this with the Randomize module in Jamovi

Histogram of slopes from different random permultations of Price (null distribution).

p-value \(\approx0\)

CI Using Mathematical Model

  • A 95% confidence interval for the slope is given by \[b_1\pm t^{\ast}_{df}\times SE\]

  • \(SE=0.283\) (from regression output)

  • We can use the Randomize module in Jamovi to calculate \(t^{\ast}_{116}=1.974\) for a 95% CI

  • The 95% CI is \(2.94\pm1.974\times0.283\).

  • 95% confident that slope is between 2.38 and 3.49.

CI Using Randomization

  • We can also calculate a 95% bootstrap percentile confidence interval
  • We can use the Randomize module in Jamovi

95% bootstrap percentile confidence interval: (2.38, 3.45)

Conclusions

  • There is convincing evidence that there is an association between price and food rating in NYC Italian restaurants (p-value < 0.001)
  • We are 95% confident that the slope is between 2.38 and 3.45, meaning that the price of a meal increases by between $2.38 and $3.45 for each increase of 1 point in the food rating.
  • We do not know if this is a random sample, so we should be careful about generalizing the results
  • This is an observational study, so we cannot conclude a cause-and-effect relationship between the variables