J Lab 4

Inference for proportions

Author

Yurk

Portions of this lab are based on an R lab from Chapter 23 of the Introduction to Modern Statistics (2e) textbook by Mine Çetinkaya-Rundel and Johanna Hardin.

Youth Risk Behavior Surveillance System

In this lab you will explore the yrbss data set, available here or on our Moodle page. The data set is from the Youth Risk Behavior Surveillance System (YRBSS) survey, which tracks high school student behaviors that could impact health. The data set includes a subset of the variables from the full YRBSS survey.

Download the yrbss.csv file to your computer, and open it in Jamovi. How many cases/observations are there? What does each row represent?

You will only use a subset of the variables for this lab, described in the following table. In Jamovi, set up the Measure type and Data type for each variable as shown in the table. All of these variables seem like they should be numeric. Look through the values in Jamovi to see why they are being setup this way.

Variable Measure type Data type Description
helmet_12m Nominal Text How often wore helmet when biking in last 12 months
text_while_driving_30d Nominal Text How many days texting while driving in the last 30 days
strength_training_7d Continuous Integer How many days strength training in the last 7 days
school_night_hours_sleep Nominal Text How many hours sleep in typical school night

Single Proportion

In this part of the lab you will focus on the variable text_while_driving_30d, which measures the number of days a student texted while driving in the last 30 days. You will consider this variable for a particular subset of students – those who reported participating in another risky behavior, not wearing a helmet while biking.

Let’s start by filtering the data. You will focus only on students who never wore a helmet while biking in the last 12 months. In Jamovi, filter the data so that it only includes observations where helmet_12m is “never”. You can refer to the instructions in J Lab 2 if you need a refresher on how to filter data in Jamovi.

Next you will do some additional filtering. We are not interested in the value of text_while_driving_30d for students who did not respond to the question or that responded that they “did not drive” during the last 30 days. First, filter the data so that it only includes observations where text_while_driving_30d is not missing. You can do this by retaining observations for which text_while_driving_30d != NA. Missing values are indicated by NA in Jamovi, and != means not equal. Then, filter the data so that it only includes observations where text_while_driving_30d is not “did not drive”.

Figure 1 shows the Filter interface in Jamovi with the last two filters applied. Note that you will need to determine how to apply the first filter (the hidden “Filter 1”) on your own. This filter should retain only students that never wore a helmet.

Figure 1: Filtering the data to include only students who never wore a helmet while biking and whose response to the question about texting while driving indicated that they had driven.

Scroll through your filtered data set in Jamovi and confirm that the correct cases have been retained. How many observations are in the filtered data set? Check to make sure there are 4,387.

Now that you have filtered the data down to the relevant observations, you will create a new variable called text_ind that has a value of “yes” if the student texted while driving every day in the last 30 days, and a value of “no” otherwise. You can create this variable by transforming the text_while_driving_30d variable. Transforming a variable was also introduced in J Lab 2. Refer to that lab if you need a refresher.

The correct recoding condition is shown in Figure 2.

Figure 2: Creating a new variable that indicates whether a student texted while driving every day in the last 30 days.

Confirm that the new variable text_ind has been created correctly by scrolling through your data set in Jamovi. Compare its values to the original text_while_driving_30d variable to make sure the recoding was done correctly.

Next, you will calculate the proportion of students who texted while driving every day in the last 30 days in the filtered data. This proportion is a point estimate of the population proportion of students who texted while driving every day in the last 30 days, for the students who never wore a helmet while riding a bike.

Use the Descriptives analysis in Jamovi to calculate the proportion of students who texted while driving every day in the last 30 days, using the text_ind variable you created. Also, create a bar plot showing the distribution of the text_ind variable. If you need a refresher on how to calculate a proportion or create a bar plot in Jamovi, you can refer back to the instructions in J Lab 2. Your results should be similar to the following:


 DESCRIPTIVES

 FREQUENCIES

 Frequencies of text_ind                              
 ──────────────────────────────────────────────────── 
   text_ind    Counts    % of Total    Cumulative %   
 ──────────────────────────────────────────────────── 
   no            3924      89.44609        89.44609   
   yes            463      10.55391       100.00000   
 ──────────────────────────────────────────────────── 

What is the proportion of these students who reported texting while driving every day in the last 30 days?

Hypothesis test for a single proportion

Your friend says they know many teens that never were a helmet while biking and says that about 12% of them text while driving every day. They claim that the proportion of proportion of US teenagers in this group that text every day must be about 0.12. You are skeptical of this claim and decide to test it using the YRBSS data. You test the following hypotheses at the \(\alpha = 0.05\) level: \[\begin{array}{ll} H_0: & p = 0.12\\ H_a: & p \neq 0.12\end{array}\]

You want to perform the hypothesis test two ways: using a mathematical model and using randomization. Before you can use a mathematical model, you need to check the validity conditions for the test. The observations in the data set are independent, so you just need to check the success-failure condition. Are the expected number of successes and failures both at least 10? Since your are testing the null hypothesis, you can use the null proportion of 0.12 to calculate the expected number of successes and failures in a sample of size 4,387. The expected number of successes (texting while driving every day) is \(0.12\times 4387 = 526.44\), and the expected number of failures is \(0.88\times 4387 = 3860.56\). Both of these values are greater than 10, so the success-failure condition is met.

Next, you compute the standard error of the sample proportion. The standard error of the sample proportion is \[SE = \sqrt{\frac{p(1-p)}{n}}.\] Of course, we do not know the true population proportion, \(p\). Since you are testing the null hypothesis you should use \(p_0=0.12\), the proportion under the null hypothesis. Thus, the standard error is \(SE = \sqrt{\frac{0.12\times 0.88}{4387}} = 0.0049\).

Then, you compute the \(Z\) score for the sample proportion. The \(Z\) score is \[Z = \frac{\hat{p} - p_0}{SE} = \frac{0.1055 - 0.12}{0.0049} = -2.959.\]

Finally, you compute the p-value for the test. The \(p\)-value is the probability of observing a \(Z\) score as extreme as the observed value if the null hypothesis is true. Since the technical conditions for using a normal distribution as a model for the null distribution are met, \(Z\) should follow an approximately standard normal distribution if \(H_0\) is true. Thus, you can calculate the p-value as an area under the density curve for the standard normal distribution.

You can use the Model-Based Inference analysis in the Randomize module to calculate this area/p-value. If you have the Randomize module installed, you should see an icon labeled Randomize when you select the Analyses tab in Jamovi. Click on this icon, and select Model-Based Inference from the list. In the interface, make sure that the “Use standard normal distribution” button is selected. Then check the box labeled “Calculate area (p-value)”. Enter the \(Z\) score you calculated into the box labeled “Observed value (Z or T)”, and select “Both” tails to calculate a two-sided p-value. Figure 3 shows the Model-Based Inference interface with the correct options selected. Figure 3 shows the interface after it has been set up correctly for the hypothesis test.

Figure 3: Calculating the p-value for a hypothesis test for a single proportion.

The p-value for the hypothesis test is 0.003, as shown in the table in Figure 3. The plot in the figure shows the standard normal distribution with the area corresponding to the p-value shaded in red. The shaded regions, which are small, are located in both tails because the test is two-sided. These represent values of \(Z\) that are less than -2.959 (the observed value) or greater than 2.959.

Next you use a randomization-based test of the null hypothesis. The randomization-based test is a permutation test that approximates the null distribution of the sample proportion by simulating many samples from a population in which \(H_0\) is true. For a single proportion, the simulation is called a parametric bootstrap. It is equivalent to making a spinner with 12% of the area shaded red (representing a teenager that texts while driving every day) and 88% of the area shaded blue (representing a teenager that does not). The proportions correspond with a true null hypothesis. You spin the spinner 4,387 times (one for each teen in the sample), count the number of red spins, and then calculate the proportion of red spins. This simulates sampling 4,387 teenagers from a population in which \(H_0\) is true and calculating the proportion that text while driving every day. If you repeat this many times, you can build a null distribution of the sample proportion, which indicates how the sample proportion would vary if the null hypothesis was true.

You can use the Single Proportion - Hypothesis Test Analysis in the Randomize module to perform the randomization-based test. Click on the icon for the Randomize module, and select Single Proportion - Hypothesis Test from the list. In the interface, drag the text_ind variable into the Variable box, and enter 0.12 into the Test value box. This is the value of the proportion you are testing under the null hypothesis. Start with 100 simulations, so you can easily visualize the results. Enter 100 as the “number of simulated samples”, and select “Dot plot” for the plot. There is one more option that you may want to set in the interface. There is a check box labeled “Seed the random number generator”. If you select this, then you will get the same randomization results every time you use the same seed. This is useful if you want to compare results with a friend. If you do not select this, then you will get different results every time you run the test. The default seed is 8675309 (1,000 \(\times\) Jenny’s constant). If you use the same seed, your results will match the ones shown in this lab exactly.

Figure 4 shows the Single Proportion - Hypothesis Test interface with the correct options selected and the random number generator seeded with 8675309.

Figure 4: Performing a randomization-based test of a single proportion.

The dot plot in Figure 4 shows the null distribution of the sample proportion. Each dot represents a single simulated sample of 4,387 teenagers from a population in which \(H_0\) is true. Each dot is placed on the \(x\)-axis according to the value of \(\hat{p}\) for that sample. As you can see (and probably expect), the proportion varies from sample to sample, even though the true proportion is fixed at 0.12 in the population. The vertical red dashed line indicates the value of the observe sample proportion (0.106), which is also listed in the summary table that is created by the analysis (not shown in the figure, because it is above the plot). Dots that are colored red represent simulated sample proportions that are counted in the calculation of the p-value.

Since you are conducting a two-sided test, the p-value is calculated by first determining which tail (to the left of the observed value or to the right of the observed value) has fewer values. In this case it is the left tail. The proportion of the simulated sample proportions that appear in this tail (the ratio of red dots compared to all dots) is doubled to get the p-value. The Single Proportion - Hypothesis Test Analysis does this calculation for you and reports the p-value in the Simulation Results table, as seen in Figure 4. The approximate p-value in this case is 0.02. Note that this corresponds to the single red dot in the left tail, which represents \(1/100=0.01\) proportion of the simulated sample proportions. Doubling this gives the p-value of \(2\times0.01=0.02\).

You can increase the number of simulations in the Single Proportion - Hypothesis Test interface to get a more precise estimate of the p-value. However, the null distribution becomes difficult to visualize as a dot plot with more than about 100 simulations. Select “Histogram” for the plot type, and then increase the number of simulations to 1,000. Your results should be similar to Figure 5.

Figure 5: Performing a randomization-based test of a single proportion with a histogram plot.

Again there is a single simulated proportion in the left tail. However, this time you did 1,000 simulations so the approximate p-value is \(2\times\frac{1}{1000}=0.002\). You may try to get an even more precise estimate using 10,000 simulations, but it may take a long time to run on your computer.

Based on the p-values from the mathematical model and the randomization-based test, do you reject or fail to reject the null hypothesis? What does this tell you about your friend’s claim?

Confidence interval for a single proportion

The sample proportion \(\hat{p}=0.106\) gives a point estimate for the population proportion of non-helmet-wearing teenagers who text while driving every day. However, you can also calculate a confidence interval for the population proportion. You will calculate a 95% confidence interval in two ways: using a mathematical model (a normal distribution) and using randomization (bootstrapping).

First use a normal distribution. The technical conditions for using a normal distribution are similar to the ones for the model-based hypothesis test. However, when you check the success-failure condition, you use the sample proportion, \(\hat{p}\), instead of the null proportion, \(p_0\). This is due to the fact that you are not testing the null hypothesis, but rather estimating the population proportion, and our best single value estimate of that proportion is \(\hat{p}\). Since \(\hat{p}n\) is just the number of successes in the sample and \((1-\hat{p})n\) is the number of failures, you can simply use these counts to check the success-failure condition. In the sample there are 463 successes (students who texted while driving every day) and 3,924 failures (students who did not). Both of these values are greater than 10, so the condition is met.

The confidence interval is calculated as \[\hat{p} \pm Z^*\times SE,\] where \(Z^*\) is the critical value that corresponds to the desired confidence level. The standard error of the sample proportion is calculated differently for a confidence interval than for a hypothesis test. Again, it is based on the sample proportion , \(\hat{p}\), instead of the null proportion, \(p_0\), \[SE = \sqrt{\frac{\hat{p}(1-\hat{p})}{n}} = \sqrt{\frac{0.106\times(1-0.106)}{4387}}=0.0046.\] You probably remember that the critical value for a 95% confidence interval is \(Z^*=1.96\). However, in case you forgot, or in case you need to calculate a confidence interval using a different confidence level, you can use the Model-Based Inference analysis in the Randomize module to calculate the critical value (e.g., Figure 3). Simply select “Calculate CI multiplier” and enter the desired confidence level as a percentage (e.g., 95). The critical value is listed in the Critical value table under the heading “Multiplier”. Confirm that you get a critical value of \(Z^{\ast}=1.960\).

Now you are ready to calculate the confidence interval using the model. It is \[0.106 \pm 1.96\times 0.0046 = (0.097, 0.115).\]

Next, you will calculate a confidence interval using randomization. To do this, you will simulate repeated samples of size 4,387 from the population, calculating the sample proportion for each one. The sample itself is the best representation of the population that you have access to, so you can simulate sampling from the population by selecting a random sample of 4,387 teens from the original sample with replacement. You can think of this as placing 4,387 balls in a bucket–463 red balls (successes) and 3,924 white ball (failures). You draw a ball from the bucket, record its color, and then place it back in the bucket. You repeat this process 4,387 times, and then calculate the proportion of red balls (successes) in the sample. You repeat this process many times to build a distribution of sample proportions. This process is called bootstrapping, and the resulting distribution is called a bootstrap distribution. The variability of the bootstrap distribution gives you a good idea of how much the sample proportion would vary if you were to take many samples from the population (i.e., the variability of the sampling distribution).

The 95% confidence interval is constructed by finding the 2.5th and 97.5th percentiles of the simulated sample proportions. This is called a 95% bootstrap percentile confidence interval. You can find confidence intervals for other confidence levels by using different percentiles. For a 91% confidence interval, you want the middle 91% of the simulated sample proportions. That is, you want to leave out the bottom 4.5% and the top 4.5%. Thus, you would find the 4.5th and 95.5th percentiles of the simulated sample proportions.

You can use the Single Proportion - Confidence Interval Analysis in the Randomize module to calculate a bootstrap percentile confidence interval. Click on the icon for the Randomize module, and select Single Proportion - Confidence Interval from the list. In the interface, drag the text_ind variable into the Variable box. Enter 95 as the “Confidence level” to calculate a 95% confidence interval, and select “bootstrap percentile” as the type of confidence interval. To begin with, enter 100 as the “number of bootstraps”, and select “Dot plot” as for the plot type. You can also seed the random number generator with 8675309 if you want to get the same results as shown in the lab. Figure 6 shows the Single Proportion - Confidence Interval interface with the correct options selected.

Figure 6: Calculating a confidence interval for a single proportion using randomization.

The dot plot in Figure 6 shows the bootstrap distribution of the sample proportion. Each dot represents the proportion of successes, \(\hat{p}\), for a single simulated sample of 4,387 teenagers (a single bootstrap sample). The vertical red dashed lines indicate the 2.5th and 97.5th percentiles of the bootstrap distribution, which are used to construct the 95% bootstrap percentile confidence interval. The confidence interval is \((0.097, 0.113)\), as shown in the Simulation Results table that is created by the analysis.

As before, you can expect to get a more precise estimate of the confidence interval if you increase the number of simulations. Increase the number of simulations to 1000 and change the plot type to “Histogram”. Figure 7 shows the Single Proportion - Confidence Interval interface with the correct options selected.

Figure 7: Calculating a confidence interval for a single proportion using randomization with a histogram plot.

As you can see in the table in Figure 7, the new 95% bootstrap percentile confidence interval is \((0.097, 0.115)\).

Note that none of the confidence intervals you have calculated contain your friends claim of 0.12, so 0.12 is not considered to be a plausible value for the parameter. Based on the second 95% bootstrap percentile confidence interval, you can state that you are 95% confident that the true proportion of non-helmet-wearing teenagers who text while driving every day is between 0.097 and 0.115.

Recall that this means that if many samples of 4,387 teenagers were selected from the population of non-helmet-wearing teenagers, and a 95% confidence interval was calculated using each sample (like the one you just calculated), then about 95% of those intervals will contain the true population proportion.

Comparing two proportions

Do students that wear helmets when biking text while driving less than students that do not wear helmets? To answer this question, you will compare the proportions of students who texted while driving every day in the last 30 days between students who never wore a helmet and students who reported wearing a helmet at least some of the time.

The first thing you will need to do is to inactivate the filter you created that filters out any case where the student responded something other than “never” to the helmet question. Once this is done, add two new filters. The first one should remove any cases where the student did not respond to the helmet question (the NAs), and the second should remove students that responded that they “did not ride”. After you have applied the filters, scroll through the data and make sure you have retained the correct cases. There should be 5,395 cases remaining.

Next, create a variable called helmet_ind that has a value of “yes” if the student wore a helmet at least some of the time in the last 12 months, and a value of “no” otherwise. You can create this variable by transforming the helmet_12m variable. Use the following recode condition: IF($source=="never", "no", "yes"). Compare the values of the new variable to the values of helmet_12m and make sure the results make sense.

The helmet_ind variable separates the cases into two groups, those who never wore a helmet and those who wore a helmet at least some of the time. You will compare the distributions of the text_ind variable between these two groups. Create a contingency table that shows the counts of students who texted while driving every day in the last 30 days and those who did not for each group. Add conditional proportions to the table to show the proportions for each group. Also, create a stacked bar plot that shows the distribution of the text_ind variable for each group. You learned how to create these numerical and visual summaries in J Lab 2, and you should refer to that lab if you need a refresher. Organize your contingency table so that the helmet_ind variable is along the rows and the text_ind variable is along the columns. Your results should be similar to the following:


 Contingency Tables                                                    
 ───────────────────────────────────────────────────────────────────── 
   helmet_ind                    yes          no           Total       
 ───────────────────────────────────────────────────────────────────── 
   no            Observed              463         3924         4387   
                 % within row     10.55391     89.44609    100.00000   
                                                                       
   yes           Observed               56          952         1008   
                 % within row      5.55556     94.44444    100.00000   
                                                                       
   Total         Observed              519         4876         5395   
                 % within row      9.62002     90.37998    100.00000   
 ───────────────────────────────────────────────────────────────────── 

Note that the first row of the table matches the proportions you calculated in the first part of the lab (the proportions of non-helmet-wearers that texted every day and did not text every day). The second row of the table shows the proportions for the helmet-wearers that texted every day and did not text every day. Based on the table and the bar plot you created, do you think there is a difference in the proportion of students who texted while driving every day between students who never wore a helmet and students who wore a helmet at least some of the time?

The statistic of interest is the difference in proportions of students who texted while driving every day between the two groups, \[\hat{p}_{\text{no helmet}} - \hat{p}_{\text{helmet}} = 0.106-0.056 = 0.050.\]

Hypothesis test for a difference in proportions

To address the question of whether students who never wore a helmet text while driving every day at a higher rate than students who wore a helmet at least some of the time, you will test the following hypotheses: \[\begin{array}{ll} H_0: & p_{\text{no helmet}} - p_{\text{helmet}}=0\\ H_a: & p_{\text{no helmet}} - p_{\text{helmet}}>0\end{array}\]

The technical conditions for the using a normal distribution for the hypothesis test are similar to the single proportion case. To check the success-failure condition, you need to calculate the expected number of successes and failures in each group. Since you are testing the null hypothesis, use a single pooled proportion of successes, \(\hat{p}_{pool}\), for the calculations (since the null hypothesis is that the proportions are the same for both groups). \(\hat{p}_{pool}\) is simply the overall proportion of successes (ignoring groups), and it is listed in the last row of the contingency table you created, \(\hat{p}_{pool}=0.096\). Thus, the expected number of successes in the no helmet group is \(0.096\times 4387 = 421.2\), and the expected number of successes in the helmet group is \(0.096\times 1008 = 96.8\). The expected number of failures in the no helmet group is \(0.904\times 4387 = 3965.8\), and the expected number of failures in the helmet group is \(0.904\times 1008 = 911.2\). All four of these values are greater than 10, so the success-failure condition is met.

For a hypothesis test, the standard error for the difference in proportions is also calculated using \(\hat{p}_{pool}\), \[SE=\sqrt{\hat{p}_{pool}(1-\hat{p}_{pool})\left(\frac{1}{n_1}+\frac{1}{n_2}\right)}=\sqrt{0.096\cdot(1-0.096)\left(\frac{1}{4387}+\frac{1}{1008}\right)}=0.0103\]

The \(Z\) score for the difference in proportions is \[Z = \frac{(\hat{p}_{\text{no helmet}} - \hat{p}_{\text{helmet}}) - 0}{SE} = \frac{0.050}{0.0103} = 4.85.\]

Use the Model-Based Inference analysis in the Randomize module to calculate the p-value for the hypothesis test. In the interface, make sure that the “Use standard normal distribution” button is selected. Then check the box labeled “Calculate area (p-value)”. Enter the \(Z\) score you calculated into the box labeled “Observed value (Z or T)”, and select “Right” tail to calculate the one-sided p-value. The p-value should be very small (<0.001).

Next, you will use a randomization-based test of the null hypothesis. The randomization-based test is a permutation test that approximates the null distribution of the difference in proportions by simulating many samples from a population in which \(H_0\) is true. For a difference in proportions, the test is called a permutation test. Each simulated sample starts with the original sample. The values of the response variable, text_ind, are randomly shuffled (permuted) so that neither group should tend to have a higher proportion of teens that text every day than in the other group. This is equivalent to taking 5,396 note cards (each representing a case in the sample) and writing “yes” on 519 if them (representing the students who text every day while driving) and “no” on the remaining 4,876 cards (representing those students who do not). You shuffle the note cards and divide them into two groups of 4,387 cards (the group of students that never wear a helmet) and 1,008 cards (the group that wears a helmet at least some of the time). You calculate the difference in proportions for the two groups, and then repeat this process many times to build a null distribution of the difference in proportions.

You can use the Two Proportions - Hypothesis Test Analysis in the Randomize module to perform the randomization-based test. Click on the icon for the Randomize module, and select Two Proportions - Hypothesis Test from the list. In the interface, drag the text_ind variable into the Columns box, and drag the helmet_ind variable into the Rows box. Make sure to indicate that you want to compare “rows”, so that you are comparing the no-helmet group to the helmet group. The Two Proportions - Hypothesis Test Analysis will create a contingency table similar to the one you created earlier. Add row percentages to the table by selecting the correct options under the Cells section of the interface. The group in the first row of the table will be considered “Group 1” and the second group will be considered “Group 2”. In this case, “Group 1” is the non-helmet-wearers. The difference in proportions that is calculated by the Analysis tool is always calculated as Group 1 - Group 2. Select the appropriate alternative hypothesis. In this case, it should be “Group 1 > Group 2”.

Next, expand the Simulations section of the interface to set up the simulations. Start with a “Dot plot” and 100 simulations. What does each dot represent? What is the vertical red dashed line? What is the approximate p-value?

Next try 1,000 simulations and a “Histogram” plot. Figure 8 shows the Two Proportions - Hypothesis Test interface with the correct options selected.

Figure 8: Performing a randomization-based test of a difference in proportions.

What does the histogram show? Were any of the simulated differences in proportions as large as the observed difference? What is the approximate p-value?

What can you conclude based on the p-values from the mathematical model and the randomization-based test?

Confidence interval for a difference in proportions

You can also calculate a confidence interval for the difference in proportions of students who texted while driving every day between students who never wore a helmet and students who wore a helmet at least some of the time. You will calculate a 95% confidence interval in two ways: using a mathematical model (a normal distribution) and using randomization (bootstrapping).

The technical conditions for using a normal distribution are similar to the single proportion case. To check the success-failure condition, you need to calculate the expected number of successes and failures in each group. For a confidence interval, these are simple the observed counts, which are all greater than 10.

Using a normal distribution, the confidence interval is calculated as \[(\hat{p}_{\text{no helmet}} - \hat{p}_{\text{helmet}}) \pm Z^*\times SE,\] where \(Z^*\) is the critical value that corresponds to the desired confidence level. The standard error for the difference in proportions is calculated using \(\hat{p}_{\text{no helmet}}\) and \(\hat{p}_{\text{helmet}}\) rather than a pooled proportion. This is due to the fact that you are not assuming that \(H_0\) is true when you calculate the confidence interval. As a result, each group can have a different proportion of successes in the population, and \(\hat{p}_{\text{no helmet}}\) and \(\hat{p}_{\text{helmet}}\) are the best point estimates of those proportions. The standard error for the difference in proportions is \[\begin{array}{lcl} SE &=& \sqrt{\frac{\hat{p}_{\text{no helmet}}(1-\hat{p}_{\text{no helmet}})}{n_{\text{no helmet}}}+\frac{\hat{p}_{\text{helmet}}(1-\hat{p}_{\text{helmet}})}{n_{\text{helmet}}}}\\ &=& \sqrt{\frac{0.106\cdot(1-0.106)}{4387}+\frac{0.056\cdot(1-0.056)}{1008}}\\ &=& 0.00860\end{array}\]

The critical value for a 95% confidence interval is \(Z^*=1.96\). Recall that you can use the Model-Based Inference analysis in the Randomize module to calculate the critical value.

The confidence interval is \[0.050 \pm 1.96\times 0.00860 = (0.0331, 0.0669).\]

Next, you will calculate a 95% bootstrap percentile confidence interval using the Two Proportions - Confidence Interval Analysis in the Randomize module. The bootstrap samples are created by randomly resampling each of the original samples with replacement. That is, 4,387 cases are randomly selected from the no-helmet group (with replacement), and 1,000 cases are randomly selected from the helmet group (with replacement), go simulate the original sampling process. The difference in proportions is calculated for each bootstrap sample, and the 95% confidence interval is constructed by finding the 2.5th and 97.5th percentiles of the simulated differences in proportions.

Click on the icon for the Randomize module, and select Two Proportions - Confidence Interval from the list. In the interface, drag the text_ind variable into the Columns box, and drag the helmet_ind variable into the Rows box. Make sure to indicate that you want to compare “rows”, so that you are comparing the no-helmet group to the helmet group. The Two Proportions - Confidence Interval Analysis will create a contingency table similar to the ones you created earlier. Add row percentages to the table. Select the appropriate confidence level. In this case, it should be 95%. Select “bootstrap percentile” as the type of confidence interval. To begin with, enter 100 as the “number of bootstraps”, and select “Dot plot” as the plot type. What does each dot in the dot plot represent? What are the vertical red dashed lines? What is the approximate confidence interval?

Now, increase the number of simulations to 1,000 and change the plot type to “Histogram”. Figure 9 shows the Two Proportions - Confidence Interval interface with the correct options selected.

Figure 9: Calculating a confidence interval for a difference in proportions using randomization.

The 95% bootstrap percentile confidence interval is \((0.034, 0.065)\), as seen in Figure 9. Based on this interval, you can state that you are 95% confident that the proportion of students texted while driving every day is between 0.034 and 0.065 higher for students who never wore a helmet while biking than for students who wore a helmet at least some of the time. Note that 0 is not in the interval, indicating that it is not plausible that the difference in proportions is 0 in the population. This is consistent with the results of the hypothesis tests you conducted earlier.

Saving your work

You can save your work in Jamovi by clicking on the hamburger menu and selecting Save. You can save your work as a .omv file, which is a file that can be opened in Jamovi. However, you will not turn this file in for your lab report. Instead, you will turn in a PDF of your lab report that includes screenshots of the Jamovi interface, scatter plots, tables, and your answers to questions at the end of the lab. Even though you are not turning it in, you should save your Jamovi file in case you need to refer back to it later.

What you need to turn in

This section includes questions that you will turn in for this lab. You will continue to work with yrbss data, but you will focus on the strength_training_7d and the school_night_hours_sleep variables.

  1. First you will analyze a single proportion for a particular group of students. You will consider the group of students that gets 10+ hours of sleep in a typical school night, and analyze the students in this group that do strength training every day of the week (7 days). You will need to start by filtering the data. To get the correct subset of students, filter the data to only include students that get 10+ hours of sleep in a typical school night. Create a second filter that only retains students that answered the strength training question (remove any NAs). Activate both filters and check the data to make sure they are working the way you would expect. Next, create a new variable called strength_ind by transforming the strength_training_7d variable. The new variable should have the value “yes” for students who indicated that they did strength training for 7 days and “no” otherwise. Use the recode condition IF($source == 7, "yes", "no") for the transformation. Scroll through the data and compare the values of the new strength_ind variable to the values of strength_training_7d, and make sure the results are correct. Use the Descriptives analysis to create a table of counts and proportions for the strength_ind variable for the filtered data. Also create bar plot showing the distribution of the strength_ind variable. Include the table and the plot in your lab report. What proportion of students that get 10+ hours of sleep in a typical school night do strength training every day of the week? Wgat us the total number of students that get 10+ hours of sleep in a typical school night?
  2. Use the Model-Based Inference Analysis in the Randomize module to calculate a 97% confidence interval for the proportion of students that do strength training every day of the week for the 10+ hours of sleep group. Are the technical conditions for using a normal distribution satisfied? Explain. What is the standard error for the proportion of students that do strength training every day of the week? Show how you calculated it. What is the critical value for a 97% confidence interval? Take a screenshot of the Model-Based Inference interface with the correct options selected to calculate the module, and include it in your lab report. State the 97% confidence interval in your lab report. What does this interval tell you about the proportion of students that do strength training every day of the week for the 10+ hours of sleep group? Use a complete sentence to answer this question. Based on your confidence interval, is it plausible that 25% of students that get 10+ hours of sleep in a typical school night do strength training every day of the week? Explain.
  3. Next, use the Single Proportion - Confidence Interval analysis in the Randomize module to calculate a 97% bootstrap percentile confidence interval for the proportion of students that do strength training every day of the week for the 10+ hours of sleep group. Use 200 bootstrap samples, and visualize the bootstrap distribution using a dotplot. Include the dot plot and the Simulation Results table in your lab report. What does each dot in the dot plot represent? What are the vertical red dashed lines? What is the approximate confidence interval? State the 97% bootstrap percentile confidence interval in your lab report. How does it compare to the one you calculated using the normal distribution? How could you get a more precise estimate of the confidence interval using the bootstrap method?
  4. Now you will analyze a difference in proportions between two groups of students. You will compare the group of students that gets 10+ hours of sleep in a typical school night to the group of students gets less than 10 hours of sleep in a typical school night. In particular, you will compare the proportions of students that do strength training every day of the week between these two groups. To start, you will need to inactivate the filter you created that only retains students that get 10+ hours of sleep. After that, you should add a filter to remove cases in which the student did not respond to the sleep question (NAs). Scroll through the data to ensure that the filters are working correctly. Then, create a new variable called sleep_ind by transforming the school_night_hours_sleep variable. The new variable should have the value “yes” for students that get 10+ hours of sleep in a typical school night and “no” otherwise. Use the recode condition IF($source == "10+", "yes", "no") for the transformation. Scroll through the data and compare the values of the new sleep_ind variable to the values of school_night_hours_sleep, and make sure the results are correct. Create a contingency table that shows the counts of students that do strength training every day of the week and those that do not for each sleep group. Add conditional proportions to the table to show the proportions for each sleep group. Also, create a stacked bar plot that shows the distribution of the strength_ind variable for each group. Ensure that the plot shows row percentages (each bar should reach 100%) with “Rows” on the x-axis. Include the table and the plot in your lab report. Based on the table and the bar plot, do you think there is a difference in the proportion of students that do strength training every day of the week between students that get 10+ hours of sleep in a typical school night and students that get less than 10 hours of sleep in a typical school night? Calculate the observed difference in proportions of students that do strength training every day of the week between the two groups (10+ hours - less than 10 hours). State the observed difference in your lab report.
  5. Next you will determine whether there is convincing evidence of an association between the strength_ind and sleep_ind variables. State the null and alternative hypotheses using symbols in your lab report. The hypotheses should be stated in terms of differences of population proportions. Since you are just testing whether or not the variables are associated, your alternative hypothesis should be two-sided. Explain in words what each proportion involved in your statement of the hypotheses represents.
  6. Use the Model-Based Inference analysis in the Randomize module to perform a hypothesis test for the difference in proportions of students that do strength training every day of the week between the two groups. What is the value of \(\hat{p}_{pool}\)? Show how you calculated it. Are the technical conditions met for the test? Explain. What is the standard error for the difference in proportions? Show your calculation for this, too. What is the \(Z\) score for the difference in proportions? How did you calculate this? What is the p-value for the hypothesis test? Include a screenshot of the Model-Based Inference interface with the correct options selected to calculate the p-value, and include it in your lab report.
  7. Next, use the Two Proportions - Hypothesis Test analysis in the Randomize module to perform a randomization-based test of the null hypothesis. Use 1,000 random permutations of the data, and display the null distribution using a histogram. Include the histogram and the table of Simulation Results in your lab report. What does the histogram show? Were any of the simulated differences in proportions as large as the observed difference? What is the approximate p-value? What can you conclude based on the p-values from the mathematical model and the randomization-based test? Use complete sentences to answer this question.

You may create your lab report in a Word document or a Google Doc. You may organize your report as numbered answers to the questions listed above. Include the screenshots, plots, and tables in your report, making sure that they are positioned under the correct question number. You should also include your answers to the questions in your report, and your answers should refer to the relevant plots or tables when applicable. Save your report as a PDF and submit using the appropriate submission link on the course Moodle page (check the pdf before you submit it to make sure it is readable and complete).