Confidence Intervals with Bootstrapping

IMS2 Ch. 12
Math 115

Yurk

Sampling Distribution

  • A sampling distribution is the distribution we would obtain if we could select samples of the same size again and again from the same population, calculating the value of the statistic of interest each time
  • Much of inferential statistics is based on being able to approximate sampling distributions
  • We rarely have the ability to select many samples from the same population (if we did we would usually just select a larger sample!)
  • However, we can make up a population and repeatedly sample from it to test different statistical ideas

Candidate X

  • Candidate X is running for president in the US
  • Candidate X’s campaign can only afford to poll a sample of 30 voters
  • They would like to use that sample to estimate the proportion of voters who support Candidate X in the whole population (\(p\))
  • Let’s make up a population in which 60% of voters support Candidate X for president (parameter: p = 0.6)
  • The corresponding statistic is the proportion of voters in the sample who support Candidate X (statistic: \(\hat{p}\))
  • Using a computer, we can repeatedly select random samples of 30 from the theoretical population to see how \(\hat{p}\) might vary from sample to sample
  • Let’s do this 100 times
  • What does the sampling distribution of \(\hat{p}\) look like?

Sampling distribution. Proportions for 100 samples of 30 from a population with 60% (dashed vertical line) support for Candidate X.

Describing the Sampling Distribution

  • We can summarize the sampling distribution with a mean and standard deviation
  • The mean of this distribution is the true proportion of yes votes for Candidate X (0.6)
  • The standard deviation of a statistic’s sampling distribution has a special name, the standard error (SE)
  • The standard error for the proportion of yes votes for Candidate X is SE = 0.094.

A more realistic scenario (one sample)

  • Suppose you work for Candidate X’s campaign and want to estimate the proportion of US voters that support Candidate X
  • You do not know that the true value is \(p=0.6\)
  • You conduct a poll in which you select a single sample of 30 voters

Results of the poll

  • You find that 21 people in the sample support Candidate X. Thus, \(\hat{p} =21/30 = 0.7\)
  • This gives you a point estimate for the proportion of voters who support Candidate X (\(p\approx 0.7\))
  • However, we know that there will be variability from sample to sample, creating uncertainty in our estimate
  • We can express that uncertainty by making an interval estimate instead
  • E.g., we might estimate Candidate X’s support to be between 0.6 and 0.8 based on how much the statistic is expected to vary from sample to sample
  • Calculating a 95% confidence interval is one way to find such an estimate
  • To calculate one from a single sample, we need to approximate the sampling distribution
  • We can use a randomization-based approach called bootstrapping

Bootstrapping

  • In practice, our single sample is the best approximation we have of the population
  • We can simulate repeated random sampling from the population by randomly sampling from the sample with replacement
  • We select bootstrap samples that are the same size as the original sample
  • Variability tends to be very close to the variability in the sampling distribution
  • Let’s select 100 bootstrapped sampled using the results from our one poll
  • Later, we will learn how to do this using Jamovi

100 Bootstrap samples

Bootstrap distribution

  • Calculate \(\hat{p}\) for each bootstrap sample
  • The resulting distribution is called a bootstrap distribution for \(\hat{p}\)
  • The variability of this distribution is approximately the same as the variability of the sampling distribution

Distribution of bootstrapped proportions. Each dot represents the proportion of support for Candidate X from a single bootstrap sample of 30.

Bootstrap estimate of SE

  • We can estimate the standard error using the standard deviation of the bootstrap distribution
  • Using the bootstrap distribution we get a standard error estimate of \(SE_{boot}=\) 0.084
  • Compare this with the standard error we measured earlier using the actual sampling distribution \(SE=\) 0.094

Bootstrap 95% Confidence Interval

  • The 95 % bootstrap percentile confidence interval for a parameter \(p\) is obtained by calculating the 2.5% and 97.5% percentiles for the bootstrapped statistics.

Distribution of bootstrapped proportion with 95% boostrap CI.

  • These values are 0.516 and 0.867 for the bootstrap distribution of proportions of yes votes for Candidate X
  • We say we are 95% confident that the value of the true proportion of yes votes is between 0.516 and 0.867
  • Does the interval include the true value of \(p=0.6\)?
  • The confidence interval reflects what we think are plausible values for the parameter
  • For example, it is plausible that 80% of people plan to vote for Candidate X, because 0.8 is between 0.516 and 0.867
  • It is not plausible that 50% (or less) of people plan to vote for Candidate X, because 0.5 is not between 0.516 and 0.867
  • This would be good news for Candidate X.

Other Confidence Levels

  • A 90% CI is between 5% and 95% percentiles of the bootstrap distribution (the middle 90%)

Distribution of bootstrapped proportion with 90% boostrap CI.

  • The 90% boostrap percentile confidence interval for the proportion of yes votes for Candidate X is between 0.533 and 0.833

  • 90% CI is narrower than 95% CI

    • 95% CI: (0.516, 0.867)
    • 90% CI: (0.533, 0.833)
  • With a higher confidence level (95%), the interval needs to be wider for us to be more confident that it contains the true value of the parameter

Properties of Confidence Intervals

  • The confidence interval will contain the observed value of the statistic (usually near the center of the interval)
  • Larger sample sizes result in narrower confidence intervals (we are more confident that the parameter is close to the point estimate if the point estimate comes from a larger sample)
  • If we were to repeatedly sample from the population and calculate a 95% confidence interval from each sample, about 95% of the confidence intervals would contain the true value of the parameter