J Lab 1

Data Manipulation and Inference for Proportions

Author

Yurk

Portions of this lab are based on an R lab from the Introduction to Modern Statistics (2e) textbook by Mine Çetinkaya-Rundel and Johanna Hardin.

Part 1: Guided Walkthrough

In the guided walkthrough you will learn how to import data, set up variable types, filter, recode variables, create frequency tables and bar plots, create contingency tables and stacked bar plots, construct bootstrap confidence intervals, and perform permutation tests. You will use these skills in Part 2 to answer questions that you will turn in for your lab report.

Note: The questions in Part 1 are for your own understanding and do not need to be included in your lab report. Only the questions in Part 2 need to be turned in.

Section 1: Introduction & Import CSV

In this lab you will explore the yrbss data set, available here or on our Moodle page. The data set is from the Youth Risk Behavior Surveillance System (YRBSS) survey, which tracks high school student behaviors that could impact health. The data set includes a subset of the variables from the full YRBSS survey.

Download the yrbss.csv file to your computer. Then, open it in Jamovi by clicking on the hamburger menu (the three horizontal lines in the top left corner) and selecting Open. Click the Browse button and navigate to the file you downloaded.

How many cases/observations are there? What does each row represent?

Section 2: Set Up Variable Types

You will only use a subset of the variables for this lab, described in the following table. In Jamovi, click on the Data tab, then click on the Setup icon. Set up the Measure type and Data type for each variable as shown in the table. You can select different columns by clicking on the column names at the top of the spreadsheet.

Variable Measure type Data type Description
helmet_12m Nominal Text How often wore helmet when biking in last 12 months
text_while_driving_30d Nominal Text How many days texting while driving in the last 30 days
physically_active_7d Continuous Integer How many days physically active in the last 7 days
hours_tv_per_school_day Nominal Text How many hours of TV on a typical school day

Some of these variables seem like they should be numeric. For example, text_while_driving_30d looks like it should contain numbers (0, 1, 2, …, 30). However, scroll through the values of this variable in Jamovi. You will see that some of the values are text strings (like “did not drive”), which is why the variable needs to be set up as Nominal/Text. The same is true for helmet_12m and hours_tv_per_school_day. The variable physically_active_7d is truly numeric (integer values 0 through 7), so it is set up as Continuous/Integer.

Section 3: Filtering Data

In this section you will focus on the variable text_while_driving_30d, which measures the number of days a student texted while driving in the last 30 days. You will consider this variable for a particular subset of students – those who reported participating in another risky behavior, not wearing a helmet while biking.

You will apply three filters to the data. To set up filters, click on the Data tab in Jamovi, then click on the Filters icon.

Filter 1: In the field next to the \(f_x\) button, type helmet_12m == "never". This retains only students who never wore a helmet while biking in the last 12 months. Click on the toggle switch above the text field to make the filter active. Note: == means equals in the filter syntax, and text values must be in quotation marks.

Filter 2: Click the large blue + button to the left of Filter 1 to add a new filter. Type text_while_driving_30d != NA. This removes students who did not respond to the texting question. The symbol != means not equal, and NA represents missing values in Jamovi. Activate this filter.

Filter 3: Add another filter by clicking the + button again. Type text_while_driving_30d != "did not drive". This removes students who reported that they did not drive in the last 30 days. Activate this filter.

After all three filters are active, click on the eye icon beneath the + button to hide the filtered (grayed out) rows. This makes it easier to scroll through just the retained cases.

Figure 1 shows the Filter interface in Jamovi with Filters 2 and 3 visible. Filter 1 is scrolled up and not visible in the figure, but it should still be active in your Jamovi session.

Figure 1: Filtering the data to include only students who never wore a helmet while biking and whose response to the question about texting while driving indicated that they had driven.

Scroll through your filtered data set and confirm that the correct cases have been retained. You can calculate the number of remaining (non-filtered) cases by subtracting the number of Filtered rows from the Row count. These numbers are listed below the spereadsheet. Check: there should be 4,387 observations remaining.

Section 4: Recode/Transform a Variable

Now you will create a new variable called text_ind that has a value of “yes” if the student texted while driving every day in the last 30 days (i.e., the value of text_while_driving_30d is “30”), and a value of “no” otherwise.

To create this variable:

  1. Click on the Data tab.

  2. Click on the header of the text_while_driving_30d column so it is selected.

  3. Click the Transform icon.

  4. Change the name of the transformed variable to text_ind (by default it will be something like text_while_driving_30d (2)).

  5. In the pull-down box labeled using transform, select “create new transform”.

  6. In the text field labeled \(f_x\), type the following recode condition:

    IF($source == "30", "yes", "no")

    In this formula, $source refers to the original variable (text_while_driving_30d). The IF function checks whether the value equals “30”. If it does, the new variable gets the value “yes”; otherwise it gets “no”.

Figure 2 shows the transformation interface with the correct recode condition.

Figure 2: Creating a new variable that indicates whether a student texted while driving every day in the last 30 days.

Confirm that the new variable text_ind has been created correctly by scrolling through your data set. Compare its values to the original text_while_driving_30d variable to make sure the recoding was done correctly.

Section 5: Frequency Table & Bar Plot (Single Proportion)

Now you will calculate the proportion of students who texted while driving every day in the last 30 days in the filtered data. This proportion is a point estimate of the population proportion.

  1. Click on the Analyses tab, then click on the Exploration icon, and select Descriptives.
  2. Drag the text_ind variable into the Variables box.
  3. Expand the Statistics options menu. Uncheck all of the numeric statistics (N, Mean, Median, Std. deviation, Minimum, Maximum) since they are not appropriate for a categorical variable. Then, check the box for Frequency tables (located below the Split by box).
  4. Expand the Plots options menu and check the box for Bar plot.

Your results should be similar to the following:


 DESCRIPTIVES

 FREQUENCIES

 Frequencies of text_ind                              
 ──────────────────────────────────────────────────── 
   text_ind    Counts    % of Total    Cumulative %   
 ──────────────────────────────────────────────────── 
   no            3924      89.44609        89.44609   
   yes            463      10.55391       100.00000   
 ──────────────────────────────────────────────────── 

What proportion of these students reported texting while driving every day in the last 30 days?

Section 6: Contingency Table & Stacked Bar Plot (Two Proportions)

Do students that wear helmets when biking text while driving less than students that do not wear helmets? To answer this question, you will compare the proportions of students who texted while driving every day between students who never wore a helmet and students who reported wearing a helmet at least some of the time.

First, modify your filters:

  1. Click on the Data tab and then the Filters icon. Inactivate Filter 1 (the helmet_12m == "never" filter) by clicking its toggle switch so it is no longer active. Do not delete it – you will reactivate it later. Leave the two text_while_driving_30d filters (Filters 2 and 3) active.
  2. Add a new filter to remove cases where helmet_12m is missing: helmet_12m != NA. Activate it.
  3. Add another filter to remove cases where the student responded “did not ride”: helmet_12m != "did not ride". Activate it.

Check: there should be 5,395 observations remaining.

Next, create a new variable called helmet_ind that has a value of “no” if the student never wore a helmet, and “yes” otherwise. Select the helmet_12m column, click Transform, rename the variable to helmet_ind, create a new transform, and enter the recode condition:

IF($source == "never", "no", "yes")

Compare the values of helmet_ind to helmet_12m and make sure the results make sense.

Now create a contingency table and stacked bar plot:

  1. Click on the Analyses tab, then click on the Frequencies icon, and select Independent Samples (under Contingency Tables).
  2. Drag helmet_ind into the Rows box and text_ind into the Columns box.
  3. Expand the Cells options menu. Under Percentages, check the box for Row. This adds conditional proportions showing the proportion within each helmet group.
  4. Expand the Plots options menu. Check the box for Bar plot. Select Stacked (instead of Side by side) and Percentages (instead of Counts). Select “within rows” from the drop-down menu next to Percentages.

Your results should be similar to the following:


 Contingency Tables                                                    
 ───────────────────────────────────────────────────────────────────── 
   helmet_ind                    yes          no           Total       
 ───────────────────────────────────────────────────────────────────── 
   no            Observed              463         3924         4387   
                 % within row     10.55391     89.44609    100.00000   
                                                                       
   yes           Observed               56          952         1008   
                 % within row      5.55556     94.44444    100.00000   
                                                                       
   Total         Observed              519         4876         5395   
                 % within row      9.62002     90.37998    100.00000   
 ───────────────────────────────────────────────────────────────────── 

The first row of the table shows the proportions for non-helmet-wearers, and the second row shows the proportions for helmet-wearers. Based on the table and the bar plot, the observed difference in proportions of students who texted while driving every day between the two groups is:

\[\hat{p}_{\text{no helmet}} - \hat{p}_{\text{helmet}} = 0.106 - 0.056 = 0.050\]

Section 7: Seeding the Random Number Generator

Sections 8 and 9 use the Randomize module in Jamovi. Before proceeding, verify that this module is installed: click on the Analyses tab and look for an icon labeled Randomize.

In Sections 8 and 9 you will use simulation-based methods (bootstrapping and permutation testing). These methods involve random sampling, which means the results will be slightly different each time you run them.

To make your results reproducible (and to match the results shown in this lab), you can seed the random number generator. A seed is a starting value that determines the sequence of random numbers generated. If you use the same seed, you will get the same results every time.

In the Randomize module interfaces, there is a check box labeled “Seed the random number generator”. If you select this, you can enter a seed value. The default seed is 8675309 (1,000 \(\times\) Jenny’s constant). Use this seed for all simulations in this lab.

Section 8: Bootstrap Percentile CI (Single Proportion)

Now you will construct a confidence interval for the proportion of non-helmet-wearing students who text while driving every day, using bootstrapping.

First, reactivate the filters for non-helmet-wearers only. Go to the Filters interface and make the following changes:

  • Reactivate the helmet_12m == "never" filter.
  • Inactivate the helmet_12m != NA filter.
  • Inactivate the helmet_12m != "did not ride" filter.
  • Leave the two text_while_driving_30d filters active.

Check: you should have 4,387 observations.

The bootstrap method works by resampling from your original sample (with replacement) many times, calculating the sample proportion for each resample, and using the resulting distribution to estimate the confidence interval.

To perform the bootstrap:

  1. Click on the Analyses tab. Click on the Randomize module icon, and select Single Proportion - Confidence Interval.
  2. Drag the text_ind variable into the Variable box.
  3. Set the Confidence level to 95.
  4. Select bootstrap percentile as the type of confidence interval.
  5. Check “Seed the random number generator” and enter 8675309.
  6. Start with 100 bootstraps and select Dot plot.

Figure 3 shows the interface and results.

Figure 3: Calculating a confidence interval for a single proportion using bootstrapping (100 bootstraps, dot plot).

Each dot in the dot plot represents the proportion of successes (\(\hat{p}\)) from a single bootstrap sample of 4,387 students. The vertical red dashed lines indicate the 2.5th and 97.5th percentiles of the bootstrap distribution, which form the boundaries of the 95% bootstrap percentile confidence interval.

Now you will modify the analysis you just created. To reopen an existing analysis, click on the corresponding results in the Results pane on the right side of the screen. This will bring up the analysis interface so you can adjust its settings. Increase the number of bootstraps to 1,000 and change the plot type to Histogram. Your results should be similar to Figure 4.

Figure 4: Calculating a confidence interval for a single proportion using bootstrapping (1,000 bootstraps, histogram).

Report the 95% bootstrap percentile confidence interval from the Simulation Results table. What does this interval tell you about the proportion of non-helmet-wearing teenagers who text while driving every day?

Section 9: Permutation Test (Two Proportions)

Now you will test whether students who never wore a helmet text while driving every day at a higher rate than students who wore a helmet at least some of the time.

First, reactivate the filters for both helmet groups. Go to the Filters interface and make the following changes:

  • Inactivate the helmet_12m == "never" filter.
  • Reactivate the helmet_12m != NA filter.
  • Reactivate the helmet_12m != "did not ride" filter.
  • Leave the two text_while_driving_30d filters active.

Check: you should have 5,395 observations.

State the hypotheses:

\[\begin{array}{ll} H_0: & p_{\text{no helmet}} - p_{\text{helmet}} = 0\\ H_a: & p_{\text{no helmet}} - p_{\text{helmet}} > 0\end{array}\]

The permutation test works by randomly shuffling the values of the response variable (text_ind) among all students, calculating the difference in proportions for each shuffle, and building a null distribution. This simulates what would happen if there were truly no association between helmet wearing and texting while driving.

To perform the permutation test:

  1. Click on the Analyses tab. Click on the Randomize module icon, and select Two Proportions - Hypothesis Test.
  2. Drag text_ind into the Columns box and helmet_ind into the Rows box.
  3. Make sure “compare rows” is selected.
  4. Check the contingency table that appears in the results to see which group is in the first row. The first row is “Group 1” and the second row is “Group 2”. In this case, Group 1 will probably be the no-helmet group (“no”) and Group 2 the helmet group (“yes”). If so, select the alternative hypothesis Group 1 > Group 2. If the groups are reversed, select Group 1 < Group 2 instead.
  5. Expand the Cells section and check Row percentages.
  6. Expand the Simulations section. Check “Seed the random number generator” and enter 8675309.
  7. Start with 100 simulations and select Dot plot. Each dot represents the difference in proportions from one random permutation of the data. The vertical red dashed line marks the observed difference. Dots colored red are counted toward the p-value.
  8. Now change to 1,000 simulations and select Histogram.

Figure 5 shows the interface and results with 1,000 simulations.

Figure 5: Performing a permutation test for a difference in two proportions (1,000 permutations, histogram).

What is the approximate p-value from the Simulation Results table? Based on this p-value, do you reject or fail to reject the null hypothesis at the \(\alpha = 0.05\) level? What does this tell you about the relationship between helmet wearing and texting while driving?

Section 10: Taking Screenshots for Lab Reports

When creating your lab report, you will need to include screenshots of Jamovi tables and plots. The right-click copy method in Jamovi is sometimes unreliable, so it is better to take screenshots directly.

Mac: Press Command + Shift + 4, then drag to select the area you want to capture. The screenshot will be saved to your Desktop.

Windows: Press Windows + Shift + S to open the Snipping Tool. Drag to select the area you want to capture. The screenshot will be copied to your clipboard and can be pasted directly into your document.

Tips:

  • Crop your screenshots tightly around the relevant table or plot.
  • Position each screenshot directly under the question it corresponds to in your lab report.

Section 11: Saving Your Work

You can save your work in Jamovi by clicking on the hamburger menu and selecting Save. Save your work as a .omv file, which can be reopened in Jamovi later.

Important: The .omv file is not what you submit. You will submit a PDF of your lab report (see Part 2). However, you should save the .omv file in case you need to refer back to your work. If you are working on a classroom computer, email the file to yourself or save it to Google Drive before you leave.


Part 2: What You Need to Turn In

As you work through the questions below, build your lab report in a Word document or Google Doc. For each question, include the requested screenshots, tables, and written answers. You should write your report as you go rather than waiting until the end. See the Submission Instructions at the end of this document for more details.

You will continue to work with the yrbss data set, but now you will focus on different variables: physically_active_7d and hours_tv_per_school_day. The variable physically_active_7d records how many days a student was physically active in the last 7 days (integer values 0 through 7), and hours_tv_per_school_day records how many hours of TV a student watches on a typical school day.

Scenario: You want to investigate whether heavy TV watchers (those who watch 5 or more hours of TV per school day) are less likely to be physically active every day of the week compared to lighter TV watchers.

Before starting, inactivate all filters from the walkthrough by clicking on each filter’s toggle switch in the Filters interface.

Question 1: Filter, recode, frequency table & bar plot (single proportion)

Start by analyzing the proportion of heavy TV watchers who are physically active every day.

  1. Create a filter to retain only students who watch 5+ hours of TV per school day: hours_tv_per_school_day == "5+". Activate this filter.

  2. Create a second filter to remove students who did not respond to the physical activity question: physically_active_7d != NA. Activate this filter.

  3. Check: you should have 1,589 observations remaining.

  4. Create a new variable called active_ind by transforming the physically_active_7d variable. Use the recode condition: IF($source == 7, "yes", "no"). This creates a variable indicating whether a student was physically active every day of the week. Scroll through the data and verify that the recoding is correct.

    Note: In the walkthrough, the recode condition used "30" in quotes because text_while_driving_30d is a Text variable. Here, physically_active_7d is an Integer variable, so you use 7 without quotes. The rule is: use quotes for Text values, no quotes for numeric values.

  5. Use the Descriptives analysis to create a frequency table and bar plot for the active_ind variable. Uncheck all numeric statistics and check Frequency tables. Check Bar plot under Plots.

In your lab report, include:

  • A screenshot of the frequency table.
  • A screenshot of the bar plot.
  • The proportion of heavy TV watchers who are physically active every day.
  • The total number of students (N) in the filtered data.

Question 2: Contingency table & stacked bar plot (two proportions)

Now compare the proportions of students who are physically active every day between heavy TV watchers and lighter TV watchers.

  1. Inactivate the hours_tv_per_school_day == "5+" filter. Keep the physically_active_7d != NA filter active.
  2. Add a new filter to remove students who did not respond to the TV question: hours_tv_per_school_day != NA. Activate it.
  3. Check: you should have 13,213 observations remaining.
  4. Create a new variable called tv_ind by transforming the hours_tv_per_school_day variable. Use the recode condition: IF($source == "5+", "high", "low"). Verify the recoding is correct.
  5. Create a contingency table: Analyses \(\rightarrow\) Frequencies \(\rightarrow\) Independent Samples. Drag tv_ind into the Rows box and active_ind into the Columns box. Under Cells, check Row percentages. Under Plots, create a stacked bar plot with row percentages.

In your lab report, include:

  • A screenshot of the contingency table.
  • A screenshot of the stacked bar plot.
  • The observed difference in proportions of students who are physically active every day: \(\hat{p}_{\text{high TV}} - \hat{p}_{\text{low TV}}\).

Question 3: Bootstrap percentile CI (single proportion)

Construct a confidence interval for the proportion of heavy TV watchers who are physically active every day.

  1. Reactivate the hours_tv_per_school_day == "5+" filter and inactivate the hours_tv_per_school_day != NA filter. Leave the physically_active_7d != NA filter active. Your data should now include only heavy TV watchers who responded to the physical activity question. Check: you should have 1,589 observations remaining.
  2. Use the Randomize module: Single Proportion - Confidence Interval.
  3. Drag active_ind into the Variable box.
  4. Set the confidence level to 95, select bootstrap percentile, use 1,000 bootstraps, select Histogram, and seed the RNG with 8675309.

In your lab report, include:

  • A screenshot of the histogram.
  • A screenshot of the Simulation Results table.
  • State the 95% bootstrap percentile confidence interval.
  • Interpret the confidence interval in context (use a complete sentence).

Question 4: Permutation test (two proportions)

Test whether there is a difference in the proportion of students who are physically active every day between heavy TV watchers and lighter TV watchers.

  1. Adjust your filters so that both TV groups are included (not just the heavy watchers). Inactivate the hours_tv_per_school_day == "5+" filter and reactivate the hours_tv_per_school_day != NA filter. Leave the physically_active_7d != NA filter active. Your data should now include all students who responded to both the TV and physical activity questions. Check: you should have 13,213 observations remaining.

  2. State the hypotheses. Since you are testing whether there is any difference (not a specific direction), use a two-sided test:

    \[\begin{array}{ll} H_0: & p_{\text{high TV}} - p_{\text{low TV}} = 0 \\ H_a: & p_{\text{high TV}} - p_{\text{low TV}} \neq 0 \end{array}\]

  3. Use the Randomize module: Two Proportions - Hypothesis Test.

  4. Drag active_ind into the Columns box and tv_ind into the Rows box.

  5. Make sure “compare rows” is selected and set the alternative hypothesis to Group 1 \(\neq\) Group 2.

  6. Use 1,000 permutations, select Histogram, and seed the RNG with 8675309.

In your lab report, include:

  • The null and alternative hypotheses (written using symbols as shown above).
  • A screenshot of the histogram.
  • A screenshot of the Simulation Results table.
  • The p-value from the simulation.
  • Your conclusion: do you reject or fail to reject the null hypothesis? State your conclusion in complete sentences, in the context of the problem.

Submission Instructions

Create your lab report in a Word document or Google Doc. Organize your report as numbered answers (1 through 4) to the questions above. Include screenshots and plots positioned under the correct question number. Your answers should refer to the relevant plots or tables when applicable.

Save your report as a PDF and submit it using the appropriate submission link on the course Moodle page. Check the PDF before you submit it to make sure it is readable and complete.