Inference for Linear Regression

Topic 18

Math 115

Why Inference for the Slope?

Previously, we found a linear model:

\[\widehat{wgt} = b_0 + b_1 \times hgt\]

But is this relationship real or just due to chance?

  • \(b_1\) is a statistic (calculated from our sample)
  • \(\beta_1\) is the parameter (true population slope, unknown)
  • Maybe \(\beta_1 = 0\) and we just got unlucky with our sample?

Hypotheses for Slope

Null hypothesis: No linear relationship

\[H_0: \beta_1 = 0\]

Alternative hypothesis: There IS a linear relationship

\[H_A: \beta_1 \neq 0\]

(Can also be one-sided: \(\beta_1 > 0\) or \(\beta_1 < 0\))

The T Statistic for Slope

Same pattern as other T tests:

\[T = \frac{\text{statistic} - \text{null value}}{SE}\]

For the slope:

\[T = \frac{b_1 - 0}{SE} = \frac{b_1}{SE}\]

Software provides \(b_1\) and \(SE\) — you can verify \(T\)!

Degrees of Freedom

For inference about a slope:

\[df = n - 2\]

Why n - 2?

  • We estimated 2 parameters (\(b_0\) and \(b_1\)) from the data
  • Each parameter “uses up” one degree of freedom

Compare to one-sample T: \(df = n - 1\) (estimated 1 parameter, \(\bar{x}\))

Italian Restaurants in NYC

Research question: Is the price of a meal associated with food quality rating?

  • Data: Customer surveys from 168 Italian restaurants in NYC
  • Response (\(y\)): Price (in USD, includes tip and drink)
  • Predictor (\(x\)): Food rating (scale: 1 to 30)

Hypotheses:

  • \(H_0: \beta_1 = 0\) (no relationship between food rating and price)
  • \(H_A: \beta_1 \neq 0\) (there is a relationship)

The Data

The regression line: \(\widehat{Price} = -17.8 + 2.94 \times Food\)

The Null Distribution

If conditions are met and \(H_0\) is true (\(\beta_1 = 0\)), the test statistic

\[T = \frac{b_1}{SE}\]

follows a t-distribution with \(df = n - 2\).

For our example: \(df = 168 - 2 = 166\)

Conditions for Inference

Before using the T-distribution, check:

  1. Linearity: Relationship is approximately linear
  2. Independence: Observations are independent
  3. Normality: Residuals are approximately normal
  4. Constant variance: Spread of residuals is consistent

We check these using residual plots.

Checking Conditions

  • No obvious curve → Linearity
  • Residuals scattered symmetrically above/below 0 → Normality plausible ✓
  • Roughly constant spread → Constant variance

Software Output

Term Estimate Std. Error T statistic P-value
(Intercept) -17.83 5.863 -3.04 0.003
Food 2.94 0.283 10.37 < 0.001

Key values for the slope (Food row):

  • \(b_1 = 2.94\) (the slope estimate)
  • \(SE = 0.283\) (standard error)
  • \(T = 10.37\) (test statistic)
  • p-value < 0.001

Verifying the T Statistic

From the output: \(b_1 = 2.94\) and \(SE = 0.283\)

\[T = \frac{b_1}{SE} = \frac{2.94}{0.283} = 10.37\] Compare to T from output: 10.37 ✓

You should be able to verify T from the slope and SE!

The T Distribution

P-value < 0.001

Hypothesis Test Conclusion

Results: \(T = 10.37\), \(df = 166\), p-value < 0.001

Decision: Reject \(H_0\)

Conclusion: There is convincing evidence that food rating is associated with meal price at NYC Italian restaurants. Higher food ratings are associated with higher prices.

Note: This is an observational study, so we cannot conclude that higher food ratings cause higher prices.

CI Formula

We can also estmiate \(\beta_1\) using a confidence interval.

Same structure as other T-based CIs:

\[b_1 \pm t^*_{df} \times SE_{b_1}\]

where \(t^*_{df}\) is the critical value for the desired confidence level.

For 95% CI: Use \(t^*\) with \(df = n - 2\)

Computing the CI

From regression output \(b_1 = 2.94\), \(SE = 0.283\), \(df = 166\)

For 95% CI: \(t^*_{166} = 1.974\)

\[2.94 \pm 1.974 \times 0.283\]

\[2.94 \pm 0.56\]

95% CI: (2.38, 3.5)

Interpreting the CI

95% CI for slope: (2.38, 3.5)

We are 95% confident that for each additional point in food rating, the price of a meal increases by between $2.38 and $3.5, on average.

Note: The CI does not include 0, consistent with rejecting \(H_0\).

References