DESCRIPTIVES
FREQUENCIES
Frequencies of text_ind
────────────────────────────────────────────────────
text_ind Counts % of Total Cumulative %
────────────────────────────────────────────────────
no 3924 89.44609 89.44609
yes 463 10.55391 100.00000
────────────────────────────────────────────────────
J Lab 1
Data Manipulation and Inference for Proportions
Portions of this lab are based on an R lab from the Introduction to Modern Statistics (2e) textbook by Mine Çetinkaya-Rundel and Johanna Hardin.
Part 1: Guided Walkthrough
In the guided walkthrough you will learn how to import data, set up variable types, filter, recode variables, create frequency tables and bar plots, create contingency tables and stacked bar plots, construct bootstrap confidence intervals, and perform permutation tests. You will use these skills in Part 2 to answer questions that you will turn in for your lab report.
Note: The questions in Part 1 are for your own understanding and do not need to be included in your lab report. Only the questions in Part 2 need to be turned in.
Section 1: Introduction & Import CSV
In this lab you will explore the yrbss data set, available here or on our Moodle page. The data set is from the Youth Risk Behavior Surveillance System (YRBSS) survey, which tracks high school student behaviors that could impact health. The data set includes a subset of the variables from the full YRBSS survey.
Download the yrbss.csv file to your computer. Then, open it in Jamovi by clicking on the hamburger menu (the three horizontal lines in the top left corner) and selecting Open. Click the Browse button and navigate to the file you downloaded.
How many cases/observations are there? What does each row represent?
Section 2: Set Up Variable Types
You will only use a subset of the variables for this lab, described in the following table. In Jamovi, click on the Data tab, then click on the Setup icon. Set up the Measure type and Data type for each variable as shown in the table. You can select different columns by clicking on the column names at the top of the spreadsheet.
| Variable | Measure type | Data type | Description |
|---|---|---|---|
helmet_12m |
Nominal | Text | How often wore helmet when biking in last 12 months |
text_while_driving_30d |
Nominal | Text | How many days texting while driving in the last 30 days |
physically_active_7d |
Continuous | Integer | How many days physically active in the last 7 days |
hours_tv_per_school_day |
Nominal | Text | How many hours of TV on a typical school day |
Some of these variables seem like they should be numeric. For example, text_while_driving_30d looks like it should contain numbers (0, 1, 2, …, 30). However, scroll through the values of this variable in Jamovi. You will see that some of the values are text strings (like “did not drive”), which is why the variable needs to be set up as Nominal/Text. The same is true for helmet_12m and hours_tv_per_school_day. The variable physically_active_7d is truly numeric (integer values 0 through 7), so it is set up as Continuous/Integer.
Section 3: Filtering Data
In this section you will focus on the variable text_while_driving_30d, which measures the number of days a student texted while driving in the last 30 days. You will consider this variable for a particular subset of students – those who reported participating in another risky behavior, not wearing a helmet while biking.
You will apply three filters to the data. To set up filters, click on the Data tab in Jamovi, then click on the Filters icon.
Filter 1: In the field next to the \(f_x\) button, type helmet_12m == "never". This retains only students who never wore a helmet while biking in the last 12 months. Click on the toggle switch above the text field to make the filter active. Note: == means equals in the filter syntax, and text values must be in quotation marks.
Filter 2: Click the large blue + button to the left of Filter 1 to add a new filter. Type text_while_driving_30d != NA. This removes students who did not respond to the texting question. The symbol != means not equal, and NA represents missing values in Jamovi. Activate this filter.
Filter 3: Add another filter by clicking the + button again. Type text_while_driving_30d != "did not drive". This removes students who reported that they did not drive in the last 30 days. Activate this filter.
After all three filters are active, click on the eye icon beneath the + button to hide the filtered (grayed out) rows. This makes it easier to scroll through just the retained cases.
Figure 1 shows the Filter interface in Jamovi with Filters 2 and 3 visible. Filter 1 is scrolled up and not visible in the figure, but it should still be active in your Jamovi session.
Scroll through your filtered data set and confirm that the correct cases have been retained. You can calculate the number of remaining (non-filtered) cases by subtracting the number of Filtered rows from the Row count. These numbers are listed below the spereadsheet. Check: there should be 4,387 observations remaining.
Section 4: Recode/Transform a Variable
Now you will create a new variable called text_ind that has a value of “yes” if the student texted while driving every day in the last 30 days (i.e., the value of text_while_driving_30d is “30”), and a value of “no” otherwise.
To create this variable:
Click on the Data tab.
Click on the header of the
text_while_driving_30dcolumn so it is selected.Click the Transform icon.
Change the name of the transformed variable to
text_ind(by default it will be something liketext_while_driving_30d (2)).In the pull-down box labeled using transform, select “create new transform”.
In the text field labeled \(f_x\), type the following recode condition:
IF($source == "30", "yes", "no")In this formula,
$sourcerefers to the original variable (text_while_driving_30d). TheIFfunction checks whether the value equals “30”. If it does, the new variable gets the value “yes”; otherwise it gets “no”.
Figure 2 shows the transformation interface with the correct recode condition.
Confirm that the new variable text_ind has been created correctly by scrolling through your data set. Compare its values to the original text_while_driving_30d variable to make sure the recoding was done correctly.
Section 5: Frequency Table & Bar Plot (Single Proportion)
Now you will calculate the proportion of students who texted while driving every day in the last 30 days in the filtered data. This proportion is a point estimate of the population proportion.
- Click on the Analyses tab, then click on the Exploration icon, and select Descriptives.
- Drag the
text_indvariable into the Variables box. - Expand the Statistics options menu. Uncheck all of the numeric statistics (N, Mean, Median, Std. deviation, Minimum, Maximum) since they are not appropriate for a categorical variable. Then, check the box for Frequency tables (located below the Split by box).
- Expand the Plots options menu and check the box for Bar plot.
Your results should be similar to the following:
What proportion of these students reported texting while driving every day in the last 30 days?
Section 6: Contingency Table & Stacked Bar Plot (Two Proportions)
Do students that wear helmets when biking text while driving less than students that do not wear helmets? To answer this question, you will compare the proportions of students who texted while driving every day between students who never wore a helmet and students who reported wearing a helmet at least some of the time.
First, modify your filters:
- Click on the Data tab and then the Filters icon. Inactivate Filter 1 (the
helmet_12m == "never"filter) by clicking its toggle switch so it is no longer active. Do not delete it – you will reactivate it later. Leave the twotext_while_driving_30dfilters (Filters 2 and 3) active. - Add a new filter to remove cases where
helmet_12mis missing:helmet_12m != NA. Activate it. - Add another filter to remove cases where the student responded “did not ride”:
helmet_12m != "did not ride". Activate it.
Check: there should be 5,395 observations remaining.
Next, create a new variable called helmet_ind that has a value of “no” if the student never wore a helmet, and “yes” otherwise. Select the helmet_12m column, click Transform, rename the variable to helmet_ind, create a new transform, and enter the recode condition:
IF($source == "never", "no", "yes")
Compare the values of helmet_ind to helmet_12m and make sure the results make sense.
Now create a contingency table and stacked bar plot:
- Click on the Analyses tab, then click on the Frequencies icon, and select Independent Samples (under Contingency Tables).
- Drag
helmet_indinto the Rows box andtext_indinto the Columns box. - Expand the Cells options menu. Under Percentages, check the box for Row. This adds conditional proportions showing the proportion within each helmet group.
- Expand the Plots options menu. Check the box for Bar plot. Select Stacked (instead of Side by side) and Percentages (instead of Counts). Select “within rows” from the drop-down menu next to Percentages.
Your results should be similar to the following:
Contingency Tables
─────────────────────────────────────────────────────────────────────
helmet_ind yes no Total
─────────────────────────────────────────────────────────────────────
no Observed 463 3924 4387
% within row 10.55391 89.44609 100.00000
yes Observed 56 952 1008
% within row 5.55556 94.44444 100.00000
Total Observed 519 4876 5395
% within row 9.62002 90.37998 100.00000
─────────────────────────────────────────────────────────────────────
The first row of the table shows the proportions for non-helmet-wearers, and the second row shows the proportions for helmet-wearers. Based on the table and the bar plot, the observed difference in proportions of students who texted while driving every day between the two groups is:
\[\hat{p}_{\text{no helmet}} - \hat{p}_{\text{helmet}} = 0.106 - 0.056 = 0.050\]
Section 7: Seeding the Random Number Generator
Sections 8 and 9 use the Randomize module in Jamovi. Before proceeding, verify that this module is installed: click on the Analyses tab and look for an icon labeled Randomize.
In Sections 8 and 9 you will use simulation-based methods (bootstrapping and permutation testing). These methods involve random sampling, which means the results will be slightly different each time you run them.
To make your results reproducible (and to match the results shown in this lab), you can seed the random number generator. A seed is a starting value that determines the sequence of random numbers generated. If you use the same seed, you will get the same results every time.
In the Randomize module interfaces, there is a check box labeled “Seed the random number generator”. If you select this, you can enter a seed value. The default seed is 8675309 (1,000 \(\times\) Jenny’s constant). Use this seed for all simulations in this lab.
Section 8: Bootstrap Percentile CI (Single Proportion)
Now you will construct a confidence interval for the proportion of non-helmet-wearing students who text while driving every day, using bootstrapping.
First, reactivate the filters for non-helmet-wearers only. Go to the Filters interface and make the following changes:
- Reactivate the
helmet_12m == "never"filter. - Inactivate the
helmet_12m != NAfilter. - Inactivate the
helmet_12m != "did not ride"filter. - Leave the two
text_while_driving_30dfilters active.
Check: you should have 4,387 observations.
The bootstrap method works by resampling from your original sample (with replacement) many times, calculating the sample proportion for each resample, and using the resulting distribution to estimate the confidence interval.
To perform the bootstrap:
- Click on the Analyses tab. Click on the Randomize module icon, and select Single Proportion - Confidence Interval.
- Drag the
text_indvariable into the Variable box. - Set the Confidence level to 95.
- Select bootstrap percentile as the type of confidence interval.
- Check “Seed the random number generator” and enter 8675309.
- Start with 100 bootstraps and select Dot plot.
Figure 3 shows the interface and results.
Each dot in the dot plot represents the proportion of successes (\(\hat{p}\)) from a single bootstrap sample of 4,387 students. The vertical red dashed lines indicate the 2.5th and 97.5th percentiles of the bootstrap distribution, which form the boundaries of the 95% bootstrap percentile confidence interval.
Now you will modify the analysis you just created. To reopen an existing analysis, click on the corresponding results in the Results pane on the right side of the screen. This will bring up the analysis interface so you can adjust its settings. Increase the number of bootstraps to 1,000 and change the plot type to Histogram. Your results should be similar to Figure 4.
Report the 95% bootstrap percentile confidence interval from the Simulation Results table. What does this interval tell you about the proportion of non-helmet-wearing teenagers who text while driving every day?
Section 9: Permutation Test (Two Proportions)
Now you will test whether students who never wore a helmet text while driving every day at a higher rate than students who wore a helmet at least some of the time.
First, reactivate the filters for both helmet groups. Go to the Filters interface and make the following changes:
- Inactivate the
helmet_12m == "never"filter. - Reactivate the
helmet_12m != NAfilter. - Reactivate the
helmet_12m != "did not ride"filter. - Leave the two
text_while_driving_30dfilters active.
Check: you should have 5,395 observations.
State the hypotheses:
\[\begin{array}{ll} H_0: & p_{\text{no helmet}} - p_{\text{helmet}} = 0\\ H_a: & p_{\text{no helmet}} - p_{\text{helmet}} > 0\end{array}\]
The permutation test works by randomly shuffling the values of the response variable (text_ind) among all students, calculating the difference in proportions for each shuffle, and building a null distribution. This simulates what would happen if there were truly no association between helmet wearing and texting while driving.
To perform the permutation test:
- Click on the Analyses tab. Click on the Randomize module icon, and select Two Proportions - Hypothesis Test.
- Drag
text_indinto the Columns box andhelmet_indinto the Rows box. - Make sure “compare rows” is selected.
- Check the contingency table that appears in the results to see which group is in the first row. The first row is “Group 1” and the second row is “Group 2”. In this case, Group 1 will probably be the no-helmet group (“no”) and Group 2 the helmet group (“yes”). If so, select the alternative hypothesis Group 1 > Group 2. If the groups are reversed, select Group 1 < Group 2 instead.
- Expand the Cells section and check Row percentages.
- Expand the Simulations section. Check “Seed the random number generator” and enter 8675309.
- Start with 100 simulations and select Dot plot. Each dot represents the difference in proportions from one random permutation of the data. The vertical red dashed line marks the observed difference. Dots colored red are counted toward the p-value.
- Now change to 1,000 simulations and select Histogram.
Figure 5 shows the interface and results with 1,000 simulations.
What is the approximate p-value from the Simulation Results table? Based on this p-value, do you reject or fail to reject the null hypothesis at the \(\alpha = 0.05\) level? What does this tell you about the relationship between helmet wearing and texting while driving?
Section 10: Taking Screenshots for Lab Reports
When creating your lab report, you will need to include screenshots of Jamovi tables and plots. The right-click copy method in Jamovi is sometimes unreliable, so it is better to take screenshots directly.
Mac: Press Command + Shift + 4, then drag to select the area you want to capture. The screenshot will be saved to your Desktop.
Windows: Press Windows + Shift + S to open the Snipping Tool. Drag to select the area you want to capture. The screenshot will be copied to your clipboard and can be pasted directly into your document.
Tips:
- Crop your screenshots tightly around the relevant table or plot.
- Position each screenshot directly under the question it corresponds to in your lab report.
Section 11: Saving Your Work
You can save your work in Jamovi by clicking on the hamburger menu and selecting Save. Save your work as a .omv file, which can be reopened in Jamovi later.
Important: The .omv file is not what you submit. You will submit a PDF of your lab report (see Part 2). However, you should save the .omv file in case you need to refer back to your work. If you are working on a classroom computer, email the file to yourself or save it to Google Drive before you leave.
Part 2: What You Need to Turn In
As you work through the questions below, build your lab report in a Word document or Google Doc. For each question, include the requested screenshots, tables, and written answers. You should write your report as you go rather than waiting until the end. See the Submission Instructions at the end of this document for more details.
You will continue to work with the yrbss data set, but now you will focus on different variables: physically_active_7d and hours_tv_per_school_day. The variable physically_active_7d records how many days a student was physically active in the last 7 days (integer values 0 through 7), and hours_tv_per_school_day records how many hours of TV a student watches on a typical school day.
Scenario: You want to investigate whether heavy TV watchers (those who watch 5 or more hours of TV per school day) are less likely to be physically active every day of the week compared to lighter TV watchers.
Before starting, inactivate all filters from the walkthrough by clicking on each filter’s toggle switch in the Filters interface.
Question 1: Filter, recode, frequency table & bar plot (single proportion)
Start by analyzing the proportion of heavy TV watchers who are physically active every day.
Create a filter to retain only students who watch 5+ hours of TV per school day:
hours_tv_per_school_day == "5+". Activate this filter.Create a second filter to remove students who did not respond to the physical activity question:
physically_active_7d != NA. Activate this filter.Check: you should have 1,589 observations remaining.
Create a new variable called
active_indby transforming thephysically_active_7dvariable. Use the recode condition:IF($source == 7, "yes", "no"). This creates a variable indicating whether a student was physically active every day of the week. Scroll through the data and verify that the recoding is correct.Note: In the walkthrough, the recode condition used
"30"in quotes becausetext_while_driving_30dis a Text variable. Here,physically_active_7dis an Integer variable, so you use7without quotes. The rule is: use quotes for Text values, no quotes for numeric values.Use the Descriptives analysis to create a frequency table and bar plot for the
active_indvariable. Uncheck all numeric statistics and check Frequency tables. Check Bar plot under Plots.
In your lab report, include:
- A screenshot of the frequency table.
- A screenshot of the bar plot.
- The proportion of heavy TV watchers who are physically active every day.
- The total number of students (N) in the filtered data.
Question 2: Contingency table & stacked bar plot (two proportions)
Now compare the proportions of students who are physically active every day between heavy TV watchers and lighter TV watchers.
- Inactivate the
hours_tv_per_school_day == "5+"filter. Keep thephysically_active_7d != NAfilter active. - Add a new filter to remove students who did not respond to the TV question:
hours_tv_per_school_day != NA. Activate it. - Check: you should have 13,213 observations remaining.
- Create a new variable called
tv_indby transforming thehours_tv_per_school_dayvariable. Use the recode condition:IF($source == "5+", "high", "low"). Verify the recoding is correct. - Create a contingency table: Analyses \(\rightarrow\) Frequencies \(\rightarrow\) Independent Samples. Drag
tv_indinto the Rows box andactive_indinto the Columns box. Under Cells, check Row percentages. Under Plots, create a stacked bar plot with row percentages.
In your lab report, include:
- A screenshot of the contingency table.
- A screenshot of the stacked bar plot.
- The observed difference in proportions of students who are physically active every day: \(\hat{p}_{\text{high TV}} - \hat{p}_{\text{low TV}}\).
Question 3: Bootstrap percentile CI (single proportion)
Construct a confidence interval for the proportion of heavy TV watchers who are physically active every day.
- Reactivate the
hours_tv_per_school_day == "5+"filter and inactivate thehours_tv_per_school_day != NAfilter. Leave thephysically_active_7d != NAfilter active. Your data should now include only heavy TV watchers who responded to the physical activity question. Check: you should have 1,589 observations remaining. - Use the Randomize module: Single Proportion - Confidence Interval.
- Drag
active_indinto the Variable box. - Set the confidence level to 95, select bootstrap percentile, use 1,000 bootstraps, select Histogram, and seed the RNG with 8675309.
In your lab report, include:
- A screenshot of the histogram.
- A screenshot of the Simulation Results table.
- State the 95% bootstrap percentile confidence interval.
- Interpret the confidence interval in context (use a complete sentence).
Question 4: Permutation test (two proportions)
Test whether there is a difference in the proportion of students who are physically active every day between heavy TV watchers and lighter TV watchers.
Adjust your filters so that both TV groups are included (not just the heavy watchers). Inactivate the
hours_tv_per_school_day == "5+"filter and reactivate thehours_tv_per_school_day != NAfilter. Leave thephysically_active_7d != NAfilter active. Your data should now include all students who responded to both the TV and physical activity questions. Check: you should have 13,213 observations remaining.State the hypotheses. Since you are testing whether there is any difference (not a specific direction), use a two-sided test:
\[\begin{array}{ll} H_0: & p_{\text{high TV}} - p_{\text{low TV}} = 0 \\ H_a: & p_{\text{high TV}} - p_{\text{low TV}} \neq 0 \end{array}\]
Use the Randomize module: Two Proportions - Hypothesis Test.
Drag
active_indinto the Columns box andtv_indinto the Rows box.Make sure “compare rows” is selected and set the alternative hypothesis to Group 1 \(\neq\) Group 2.
Use 1,000 permutations, select Histogram, and seed the RNG with 8675309.
In your lab report, include:
- The null and alternative hypotheses (written using symbols as shown above).
- A screenshot of the histogram.
- A screenshot of the Simulation Results table.
- The p-value from the simulation.
- Your conclusion: do you reject or fail to reject the null hypothesis? State your conclusion in complete sentences, in the context of the problem.
Submission Instructions
Create your lab report in a Word document or Google Doc. Organize your report as numbered answers (1 through 4) to the questions above. Include screenshots and plots positioned under the correct question number. Your answers should refer to the relevant plots or tables when applicable.
Save your report as a PDF and submit it using the appropriate submission link on the course Moodle page. Check the PDF before you submit it to make sure it is readable and complete.