Three of the predictors (job_city, honors, race) are statistically significant
Meaning of the p-value:
Focusing on race predictor as an example, it is very unlikely (p-value. < 0.0001) to obtain a value of \(b_{race}\) as as far from 0 as 0.442 if received_callback is unrelated to race (i.e. if null hypothesis \(H_0: \beta_{race}=0\) is true) and the model already contains the other predictors
Manual Variable Selection
Multicollinearity is not a problem here, because the data are from an experiment
VIF are all near 1 (smallest possible value)
library(car)vif(mod)
job_city college_degree years_experience honors
1.153433 1.070009 1.190619 1.065429
military has_email_address race sex
1.190074 1.136914 1.000901 1.130979
Removing insignificant coeffcients
The coefficient for sex is not significantly different from 0, so we drop it from the model
The coefficients do not change much, because the predictors are not collinear
All but one of the predictors (exclaim_mess) are statistically significant
Email Classification
We would like to use the model to predict whether an email is spam or not
The model predicts the log-odds that an email is spam
One way to classify emails is to label an email as spam if the predicted probability exceeds 0.5
Most emails, including spam, have a low predicted probability of being spam
If we used a threshold of 0.5, only 1% of emails in the data would be classified as spam
Instead, we will use a lower threshold, 0.1
This allows us to classify more emails as spam
We expect to correctly classify a larger number of emails as spam
However, we also expect to incorrectly classify emails as spam that are not actually spam
Predicted probability
We can visualize these data by plotting the true classification of the emails against the model's fitted probabilities
The predicted probability that each of the 3921 emails that are spam. Points have been jittered
Quality of the Model
For example, we might ask: if we look at emails that we modeled as having 10% chance of being spam, do we find out 10% of the actually are spam? We can check this for groups of the data by constructing a plot as follows:
Bucket the data into groups based on their predicted probabilities.
Compute the average predicted probability for each group.
Compute the observed probability for each group, along with a 95% confidence interval for the true probability of success for those individuals.
Plot the observed probabilities (with 95% confidence intervals) against the average predicted probabilities for each group. If the model does a good job describing the data, the plotted points should fall close to the line \(y=x\), since the predicted probabilities should be similar to the observed probabilities.
We can use the confidence intervals to roughly gauge whether anything might be amiss.
Plot of the observed probabilities
The dashed line is within the confidence bound of the 95% confidence intervals of each of the buckets, suggesting the logistic fit is reasonable.