J Lab 5

Inference for contingency tables and inference for a single mean

Author

Yurk

Portions of this lab are based on an R lab from Chapter 23 of the Introduction to Modern Statistics (2e) textbook by Mine Çetinkaya-Rundel and Johanna Hardin.

Youth Risk Behavior Surveillance System

In this lab you continue to explore the yrbss data set that was introduced in J Lab 4. The Youth Risk Behavior Surveillance System (YRBSS) survey tracks high school student behaviors that could impact health. Download the yrbss.csv file, available here or on our Moodle page, and open it in Jamovi.

You will only use a subset of the variables for this lab, described in the following table. In Jamovi, set up the Measure type and Data type for each variable as shown in the table.

Variable Measure type Data type Description
helmet_12m Nominal Text How often wore helmet when biking in last 12 months
text_while_driving_30d Nominal Text How many days texting while driving in the last 30 days
strength_training_7d Continuous Integer How many days strength training in the last 7 days
school_night_hours_sleep Nominal Text How many hours sleep in typical school night
height Continuous Decimal Height (meters)
weight Continuous Decimal Weight (kilograms)
age Continuous Integer Age (years)

Comparing texting while driving behaviors between differrent bike helmet usage groups

In J Lab 4 you investigated the relationship between texting while driving and wearing a helmet while biking. You explored the difference in the proportions of students who texted while driving every day in the last 30 days between students who never wore a helmet and students who reported wearing a helmet at least some of the time in the last 12 months. You estimated the difference in proportions using a 95% confidence interval, and you tested the hypothesis that the difference in proportions was 0.

In this lab you will continue to investigate between texting while driving and wearing a helmet while biking. However, you will take a different approach to the analysis this time. Instead of a binary response (texted every day vs. did not), you will consider a related response variable with three levels (never texted, texted 1-19 days, and texted 20 or more days). Also, instead of comparing the values of the response across two groups (wore a helmet at least some of the time vs. never wore a helmet), you will compare the values of the response across three groups (always wore a helmet, sometimes wore a helmet but not always, and never wore a helmet).

Filtering data and recoding variables

Start by filtering the data down to the relevant observations by excluding the following cases:

  • Students who did not drive in the last 30 days
  • Students who did not respond to the texting question
  • Students who did not ride a bike in the last 12 months
  • Students who did not respond to the helmet question

You may refer to the instructions in J Lab 2 or J Lab 4 if you need a refresher on how to filter data in Jamovi.

Next, you will recode the helmet_12m and text_while_driving_30d variables. Create two new variables, helmet_ind and text_ind that each have three levels. Start with the texting variable. As in previous labs, you can recode the variable by clicking on it’s name in the spreadsheet in Jamovi, then clicking on the Transform button in the Data tab. Set the name of the transformed variable to text_ind. Then, select “Create New Transform”.

In the Transform interface that pops up, click + recode condition three times and add the following recode conditions:

  • if $source == "0" use "0"
  • if $source == "30" use "20-30"
  • if $source == "20-29" use "20-30"
  • else use "1-19"

Figure 1 shows the last three of these conditions set up correctly (the first one is hidden in this view).

Figure 1: Recoding the text_while_driving_30d variable.

Once you have recoded the text_while_driving_30d variable, compare the values of the new text_ind variable to the original text_while_driving_30d variable to make sure it was done correctly.

Next recode the helmet_12m variable. The new helmet_ind variable should have the value “always” for students that reported always wearing a helmet, “never” for students that reported never wearing a helmet, and “sometimes” for everyone else. Compare the values of the new variable to the original helmet_12m variable to make sure the recoding was done correctly.

Comparing texting distributions between helmet groups

Create a contingency table that shows the counts of students in each of the three texting while driving groups for each of the three bike helmet groups. Add conditional proportions to the table to show the proportions within each helmet group. Also create a stacked bar plot that shows the distribution of the text_ind variable for each helmet_ind group. You learned how to create these numerical and visual summaries in J Lab 2, and you created similar tables and plots in J Lab 4. You should refer to those labs if you need a refresher. Your results should be similar to the following:


 Contingency Tables                                                                 
 ────────────────────────────────────────────────────────────────────────────────── 
   helmet_ind                    0            1-19         20-30        Total       
 ────────────────────────────────────────────────────────────────────────────────── 
   always        Observed              172           38           19          229   
                 % within row     75.10917     16.59389      8.29694    100.00000   
                                                                                    
   never         Observed             2566         1178          643         4387   
                 % within row     58.49100     26.85206     14.65694    100.00000   
                                                                                    
   sometimes     Observed              521          191           67          779   
                 % within row     66.88062     24.51861      8.60077    100.00000   
                                                                                    
   Total         Observed             3259         1407          729         5395   
                 % within row     60.40778     26.07970     13.51251    100.00000   
 ────────────────────────────────────────────────────────────────────────────────── 

In the contingency table and the plot in Figure 1, it would be better to order the groups using increasing helmet usage and increasing texting frequency. Unfortunately, Jamovi does not currently allow you to reorder the levels of a transformed variable in an efficient way. It is possible to copy the data from the new helmet_ind or text_ind variable into a new variable and then to reorder the levels in the desired order. However, we will not do that here.

Based on the plot and the contingency table, does it appear that there is an association between texting while driving and helmet usage? How does the information in the table and the plot compare to the ones you created in J Lab 4?

Hypothesis tests

Next you will conduct tests of the following hypotheses:

  • \(H_0:\) There is no difference in rates of texting while driving between students with different bike helmet usage
  • \(H_A:\) There is a difference in rates of texting while driving between students with different bike helmet usage

You will conduct both a hypothesis test using a mathematical model (a Chi-squared test) and a hypothesis test using random permutation.

Before you test these hypotheses using a mathematical model, you must check the technical conditions for the test. In particular, you need to check that there are at least 5 expected counts in each cell of the contingency table (assuming that the null hypothesis is true). Since the smallest helmet group is those who “always” wear a helmet, and the smallest texting group is those who texted “20-30 days”, you can focus on the cell of the table that corresponds to the intersection of these two groups. Overall, about 13.5% of students texted while driving 20-30 days (Figure 1). If \(H_0\) is true, we expect the percentage of students who texted while driving 20-30 days to be the same (13.5%) for all three helmet groups. Thus, since there are 229 students that always wore a helmet (Figure 1), we expect \(0.135\times 229= 30.9\) of these students to text while driving 20-30 days. This is greater than 5, so the technical condition is met.

You can also have Jamovi calculate expected counts for the entire contingency table (assuming that the null hypothesis is true). These can be added in the same interface that allowed you to add row percentages to the table. Simply click on the checkbox next to “Expected counts” to add them to the table. Does the expected count for students who always wore a helmet and texted while driving for 20-30 days match the expected count of 30.9 you calculated above?

Once you have constructed a contingency table in Jamovi and checked the technical conditions for the test, it is easy to conduct the model-based Chi-squared test. In fact, you may have noticed the results of the test have already appeared in Jamovi, since they are included in the Contingency Tables analysis results by default. If you do not see the results of the Chi-squared test, make sure the box next to \(X^2\) is checked in the Statistics section of the Contingency Tables interface. Your results should be similar to the following:


 χ² Tests                               
 ────────────────────────────────────── 
         Value       df    p            
 ────────────────────────────────────── 
   χ²    48.66734     4    < .0000001   
   N         5395                       
 ────────────────────────────────────── 

The table shows the value of the Chi-squared statistic, \(X^2=48.7\). It also shows the degrees of freedom for the Chi-squared distribution that approximates the null-distribution. Recall that the degrees of freedom are calculated as \((r-1)\times(c-1)\), where \(r\) is the number of rows in the contingency table and \(c\) is the number of columns (not including marginal totals). Since the table has 3 rows and 3 tables, the degrees of freedom are \(2\times2=4\).

The p-value is very small (<0.001),so we reject the null hypothesis at the \(\alpha=0.05\) significance level. There is convincing evidence that there is a difference in the rates of texting while driving between students with different levels of bike helmet usage.

You can also conduct a simulation-based hypothesis test using the Randomize module in Jamovi. The module creates simulated samples by randomly permuting (shuffling) the values of the response variable, text_ind. Since the values of the response are randomly assigned to the cases, the distribution of the text_ind variable will not tend to differ between helmet usage groups. Thus, the simulated samples are examples of samples from a population in which the null hypothesis is true. We can create a null distribution for the \(X^2\) statistic by calculating the values of the statistic for the simulated samples. This is similar to the approach you took in J Lab 4 to perform a hypothesis test for the difference in proportions.

Select Contingency Table - Hypothesis Test from the Randomize module. Set up the analysis similar to how you built the contingency table above by dragging helmet_ind into the “Rows” box and text_ind into the “Columns” box. Under Simulations, specify 1000 simulations and select “Histogram” for the plot. You may also seed the random number generator with 8675309 if you want your results to match the ones in this lab.

Figure 2 shows the interface for the Contingency Table - Hypothesis Test analysis with the correct options selected, along with the results of the analysis.

Figure 2: Simulation-based hypothesis test for contingency table.

From the histogram in Figure 2, you can see that there were no simulated samples with \(X^2\) values as large as the observed value of \(X^2=48.7\). Thus the p-value is approximately \(0/1000 = 0\). The results of the simulation-based test match the results of the model-based Chi-squared test.

Student height

Next you will learn how to use Jamovi to analyze a single mean. We will use the height variable in the yrbss data, which gives the height of the student in meters.

Another study claims that the average height of a 15 year old in the United States is 67 inches (1.70 meters). You will evaluate this claim using the yrbss data by conducting a hypothesis test and calculating a 95% confidence interval for the mean height of a 15 year old in the United States.

Hypothesis test for a single mean

You will test the following hypotheses:

  • \(H_0: \mu = 1.70\)
  • \(H_A: \mu \neq 1.70\)

where \(\mu\) is the average height of a 15 year old in the United States in meters.

Start by filtering the data to retain only 15 year olds. Deactivate the filters that you created for the first part of the lab. Then create a new filter that only retains the 15 year olds. Finally, create another new filter that removes any cases for which the height variable is missing. Activate both new filters and check the data to make sure they are working the way you would expect.

Next you should explore the distributions of heights for the 15 year olds. Create a histogram of the heights, and a table of descriptive statistics. Include the following in the table: the number of cases, the mean, the median, the standard deviation, the minimum, the maximum, Q1, and Q2. You learned how to create histograms and tables of descriptive statistics for a numeric variable in J Lab 2. Your results should be similar to the following:


 DESCRIPTIVES

 Descriptives                        
 ─────────────────────────────────── 
                         height      
 ─────────────────────────────────── 
   N                          2870   
   Mean                   1.678927   
   Median                 1.680000   
   Standard deviation    0.1004759   
   Minimum                1.270000   
   Maximum                2.110000   
   25th percentile        1.600000   
   75th percentile        1.750000   
 ─────────────────────────────────── 

Your hypothesis test will be based on a mathematical model–a T-distribution. However, before proceeding with the hypothesis test, you must check to see that the conditions for the test are met. You can use the table and the histogram you just created to check the conditions. Since there are at least 30 observations (see the table) and no extreme outliers (see the histogram), the conditions for the test are met.

Before you can conduct the hypothesis test you need to compute the value of the T-statistic, which is calculated as follows:

\[T = \frac{\bar{x}-\mu_0}{\frac{s}{\sqrt{n}}}\]

where \(\mu_0\) is the value of the mean under the null hypothesis, \(\bar{x}\) is the sample mean, \(s\) is the sample standard deviation, and \(n\) is the sample size. Most of this information is included in the table you created. Thus,

\[T = \frac{1.679-1.70}{\frac{0.1005}{\sqrt{2870}}}=-11.19\]

Next you have to calculate the degrees of freedom for the T-distribution that you will use as a model of the null distribution. The degrees of freedom for the T-distribution is given by \(df=n-1\). Thus, the degrees of freedom for our test are \(df=2870-1=2869\).

You can conduct the hypothesis test using the Model-Based Inference analysis in the Randomize module in Jamovi. In the interface, select Use T distribution and enter 2869 for the degrees of freedom. Then check the box to calculate the p-value, enter the observed value of \(T\), and select “Both” tails. Figure 3 shows the interface after it has been set up correctly for the hypothesis test along with the results.

Figure 3: Calculating the p-value for a hypothesis test for a single mean.

The p-value is very small (<0.001), so we reject the null hypothesis at the \(\alpha=0.05\) significance level. There is convincing evidence that the average height of a 15 year old in the United States is different from 1.70 meters.

Confidence interval for a mean

Next you will calculate a 95% confidence interval for the average height of a 15 year old in the United States. You will calculate a 95% confidence interval in two ways: using a mathematical model (a T-distribution) and using randomization (bootstrapping).

You have already checked the technical conditions for using a T-distribution. You can use the Model-Based Inference analysis in the Randomize module in Jamovi to calculate the appropriate multiplier for the 95% confidence interval. This time, instead of checking the box for calculating a p-value, check the box for calculating a CI multiplier. Also, specify that the confidence level is 95%. Figure 4 shows the interface after it has been set up correctly for the confidence interval along with the results.

Figure 4: Calculating the multiplier for a 95% confidence interval for a single mean.

Since the multiplier is \(T^{\ast}_{2869}=1.961\), the 95% confidence interval is calculated as

\[\bar{x} \pm T^{\ast}_{df}\times \frac{s}{\sqrt{n}} = 1.679 \pm 1.961\times \frac{0.1005}{\sqrt{2870}}\]

This results in a 95% confidence interval of \((1.675, 1.683)\) for the mean height of 15 year olds in the United States. Note that this confidence interval does not include the value of 1.70 suggested by the other study, which is consistent with the results of the hypothesis test you just conducted.

Next, you will calculate a 95% confidence interval using a bootstrap distribution of means. You will use the Single Mean - Confidence Interval analysis in the Randomize module to calculate a 95% bootstrap percentile confidence interval. Set the confidence level to 95%, select “bootstrap percentile” for the type of interval, specify that you want to do 1000 bootstraps, with a histogram as the plot type. You may also seed the random number generator with 8675309 if you want your results to match the ones in this lab. Figure 5 shows the interface after it has been set up correctly for the confidence interval along with the results.

Figure 5: Calculating a 95% bootstrap percentile confidence interval for a single mean.

The histogram shows the mean heights from 1,000 bootstrap samples. A bootstrap sample is randomly selected from the original sample with replacement. The variability of the bootstrap distribution approximates the variability in the sampling distribution for the mean. The 95% confidence interval is calculated as the 2.5th and 97.5th percentiles of the bootstrap distribution. In Figure 5 you can see that the 95% confidence interval is approximately \((1.675, 1.682)\), which is very close to the 95% confidence interval you calculated using a mathematical model. If your results do not show enough significant figures, you can increase the number of significant figures displayed in the results by clicking on the icon that looks like three vertically stacked dots at the top right of the Jamovi window. Under Results, set the Number format to “5 sf”.

Saving your work

You can save your work in Jamovi by clicking on the hamburger menu and selecting Save. You can save your work as a .omv file, which is a file that can be opened in Jamovi. However, you will not turn this file in for your lab report. Instead, you will turn in a PDF of your lab report that includes screenshots of the Jamovi interface, scatter plots, tables, and your answers to questions at the end of the lab. Even though you are not turning it in, you should save your Jamovi file in case you need to refer back to it later.

What you need to turn in

This section includes questions that you will turn in for this lab. You will continue to work with yrbss data, but you will focus on the strength_training_7d, school_night_hours_sleep, and weight, and age variables.

  1. Another study reports that the mean weight for 15 year olds in the United States is 65.5 kg. Evaluate this claim using the yrbss data by conducting a hypothesis test using a T-distribution. First state your hypotheses in your report using symbols. Next, create a histogram of weights for the 15 year olds in the yrbss data. Then create a table of summary statistics for the weights, including the number of cases, the mean, the median, the standard deviation, the minimum, the maximum, Q1, and Q2. Include the table and the plot in your report. Are the conditions for using a T-distribution met? Explain. Calculate the T-statistic and degrees of freedom and include them in your report. Conduct the hypothesis test and include the results in your report. What is the p-value? What can you conclude based on the p-value? Use complete sentences to answer this question.
  2. Calculate a 95% confidence interval for the mean weight of 15 year olds in the United States using a mathematical model. Report the value of the multiplier you used, and the 95% confidence interval you calculated. Compare this to the results of the hypothesis test you just conducted. Are they consistent?
  3. Calculate a 95% bootstrap percentile confidence interval for the mean weight of 15 year olds in the United States. Report the confidence interval, and include the histogram showing the bootstrap distribution in your report. Compare this confidence interval to the one you calculated using a mathematical model. Are they consistent?
  4. Next you will investigate whether there is an association between the amount of strength training a student does and the number of hours they sleep in a typical school night. Start by filtering the data (make sure you have deactivated any existing filters first) to exclude students who did not respond to the strength training question and to exclude students who did not respond to the sleep question. Then, create two new variables. Create a variable called strength_ind by recoding the strength_training_7d variable. This variable should have three levels: “low” for students who did 0-2 days of strength training, “medium” for students who did 3-5 days of strength training, and “high” for students who did 6-7 days of strength training. Next create a variable called sleep_ind by recoding the school_night_hours_sleep variable. This variable should have three levels: “low” for students who slept 0-6 hours in a typical school night, “medium” for students who slept 7-9 hours in a typical school night, and “high” for students who slept 9 or more hours in a typical school night. Once you have applied your filters and created your new variables check your data to make sure they match what you are expecting. Take a screenshot of the data in Jamovi that shows the new variables along with the source variables that were used to create them and include it in your report. If you are unable to fit all four variables in the same view, you may include multiple screenshots in your report. Your screen shots do not need to include all of the cases.
  5. Using the two new variables you created, create a contingency table showing the counts of students that were in each of the strength training groups for each of the sleep hours groups. Make sure the rows of the table correspond with the sleep hours groups. Also add the percentages of each strength group for each of the sleep hour groups (i.e., within each row) to the table. Then create a stacked bar plot that shows the percentages of each strength training group for each of the sleep groups (sleep_ind should be on the horizontal axis). Include the table and the plot in your report.
  6. Conduct a hypothesis test using a mathematical model to evaluate whether there is evidence of an association between the amount of strength training a student does and the number of hours they sleep in a typical school night. Write out your hypotheses in words. Explain how you determined whether the validity conditions are met for the test. Your answer should refer to the contingency table you just created or to a new table that you include in your report for this question. Use Jamovi to calculate the value of the \(X^2\) statistic and a p-value. Include the table in your report. What can you conclude based on the p-value?
  7. Now conduct the hypothesis test using a randomization approach with the Contingency Table - Hypothesis Test analysis in the Randomize module. Use at least 1,000 random permutations and create a histogram of the null distribution. Include the histogram in your report. What does the histogram show? What is the approximate p-value? Are your results consistent with the results from the model-based test?

You may create your lab report in a Word document or a Google Doc. You may organize your report as numbered answers to the questions listed above. Include the screenshots, plots, and tables in your report, making sure that they are positioned under the correct question number. You should also include your answers to the questions in your report, and your answers should refer to the relevant plots or tables when applicable. Save your report as a PDF and submit using the appropriate submission link on the course Moodle page (check the pdf before you submit it to make sure it is readable and complete).