J Lab 2

Inference for Means Using Randomization

Author

Yurk

Portions of this lab are based on an R lab from the Introduction to Modern Statistics (2e) textbook by Mine Cetinkaya-Rundel and Johanna Hardin.

Part 1: Guided Walkthrough

In the guided walkthrough you will learn how to create descriptive statistics for a numeric variable, construct a bootstrap confidence interval for a single mean, perform a permutation test for two independent means, and construct a bootstrap confidence interval for the difference in two means. You will use these skills in Part 2 to answer questions that you will turn in for your lab report.

Note: The questions in Part 1 are for your own understanding and do not need to be included in your lab report. Only the questions in Part 2 need to be turned in.

Section 1: Introduction

In this lab you will continue to work with the yrbss data set from J Lab 1. In J Lab 1 you used bootstrapping and permutation tests for proportions. In this lab you will use the same randomization methods for means.

If you saved your Jamovi file from J Lab 1, you can reopen it. Otherwise, download yrbss.csv from the OpenIntro website or from our Moodle page, and open it in Jamovi.

You will use the following variables in this lab:

Variable Measure type Data type Description
height Continuous Decimal Height (meters)
weight Continuous Decimal Weight (kilograms)
physically_active_7d Continuous Integer Days physically active (at least 60 min) in last 7 days
age Continuous Integer Age (years)

Click on the Data tab, then click on the Setup icon, and set the Measure type and Data type for each variable as shown in the table. If you are reusing your J Lab 1 .omv file, you already set up physically_active_7d as Continuous/Integer and only need to set up height, weight, and age.

Also, if you are reusing your J Lab 1 file, deactivate (do not delete) any filters left over from J Lab 1. Click on the Data tab, then click on the Filters icon, and click the toggle switch on each active filter to turn it off.

Before proceeding, verify that the Randomize module is installed: click on the Analyses tab and look for an icon labeled Randomize.

Section 2: From Proportions to Means: The Same Randomization Logic

In J Lab 1 you learned two simulation-based inference methods: bootstrap confidence intervals and permutation tests. You applied these methods to proportions. In this lab you will apply the same methods to means. The logic is identical – only the statistic changes.

Bootstrap confidence interval:

  • In J Lab 1: You resampled from the original data with replacement, calculated \(\hat{p}\) for each resample, and used the percentiles of the bootstrap distribution for the CI.
  • For means: You resample from the original data with replacement, calculate \(\bar{x}\) for each resample, and use the percentiles of the bootstrap distribution for the CI.
  • In both cases, the bootstrap simulates repeated sampling from the population to estimate how much the statistic varies from sample to sample. The only difference is which statistic you track.

Permutation test:

  • In J Lab 1: You shuffled the response variable values between groups, calculated the difference in \(\hat{p}\) for each shuffle, and built a null distribution.
  • For means: You shuffle the response variable values between groups, calculate the difference in \(\bar{x}\) for each shuffle, and build a null distribution.
  • In both cases, the permutation test simulates what would happen if there were no difference between groups, building a null distribution to compare against the observed result. The only difference is which statistic you track.

When to use randomization instead of the T-distribution:

In class you learned how to use the T-distribution for inference about means. The T-distribution works well when certain conditions are met (large enough sample, no extreme skew or outliers). Randomization methods are a good alternative when:

  • The sample size is small (n < 30) and the distribution is not approximately normal
  • The data are strongly skewed or have extreme outliers

Randomization methods work well regardless of the distribution shape, making them more flexible.

Section 3: Descriptive Statistics for a Numeric Variable

For this section you will analyze the heights of 15-year-old students in the YRBSS data.

Set up two filters:

Filter 1: age == 15. This retains only 15-year-old students. Note that age is a numeric variable, so you do not use quotation marks around the value. Recall from J Lab 1 that quotation marks are only needed for text values (like helmet_12m == "never").

Filter 2: Click the + button to add a second filter. Type height != NA. This removes students with missing height data.

Activate both filters. Check: you should have approximately 2,870 observations remaining. You can calculate the number of remaining (non-filtered) cases by subtracting the number of Filtered rows from the Row count. These numbers are listed below the spreadsheet.

Now create descriptive statistics for height:

  1. Click on the Analyses tab, then click on the Exploration icon, and select Descriptives.
  2. Drag height into the Variables box.
  3. Expand the Statistics options menu. Keep N, Mean, Median, Std. deviation, Minimum, and Maximum checked. Also check Percentile Values and enter 25, 75 in the text field to add the 25th and 75th percentiles (Q1 and Q3).
  4. Expand the Plots options menu and check the box for Histogram.

Your results should be similar to the following:


 DESCRIPTIVES

 Descriptives                        
 ─────────────────────────────────── 
                         height      
 ─────────────────────────────────── 
   N                          2870   
   Mean                   1.678927   
   Median                 1.680000   
   Standard deviation    0.1004759   
   Minimum                1.270000   
   Maximum                2.110000   
   25th percentile        1.600000   
   75th percentile        1.750000   
 ─────────────────────────────────── 

The histogram shows the distribution of heights for 15-year-old students. The distribution is approximately symmetric and bell-shaped, centered around the mean. The descriptive statistics table provides a numerical summary of the distribution.

Section 4: Bootstrap Percentile CI for a Single Mean

Another study claims that the average height of a 15-year-old in the US is 1.70 meters. You will estimate the true mean height using a bootstrap confidence interval.

To construct the bootstrap CI:

  1. Click on the Analyses tab. Click on the Randomize module icon, and select Single Mean - Confidence Interval.
  2. Drag height into the Variable box.
  3. Set the Confidence level to 95.
  4. Select bootstrap percentile as the type of confidence interval.
  5. Check “Seed the random number generator” and enter 8675309. As you learned in J Lab 1, seeding ensures your results are reproducible.
  6. Start with 100 bootstraps and select Dot plot.

Each dot in the dot plot represents the mean height (\(\bar{x}\)) from a single bootstrap resample of the same size as the original sample. This is the same process you used for a single proportion in J Lab 1, but now each dot is a bootstrap sample mean instead of a bootstrap sample proportion. The vertical red dashed lines indicate the boundaries of the 95% bootstrap percentile confidence interval.

Now modify the analysis: click on the corresponding results in the Results pane on the right side of the screen to reopen the analysis interface. Increase the number of bootstraps to 1,000 and change the plot type to Histogram. Your results should be similar to Figure 1.

Figure 1: Calculating a 95% bootstrap percentile confidence interval for a single mean (1,000 bootstraps, histogram).

Report the 95% bootstrap percentile confidence interval from the Simulation Results table. Does this interval include the claimed value of 1.70 meters? What does this tell you about the claim?

Section 5: Setting Up Two Groups: Weight by Physical Activity

Next you will investigate whether there is a difference in mean weight between students who are physically active at least 3 of the last 7 days and those who are not.

First, modify your filters. Deactivate the age == 15 and height != NA filters. Then create the following new filters:

Filter 3: weight != NA. This removes students with missing weight data.

Filter 4: physically_active_7d != NA. This removes students with missing physical activity data.

Activate Filters 3 and 4 (and make sure Filters 1 and 2 are inactive). Check: you should have approximately 12,364 observations remaining.

Now create a new variable to define the two physical activity groups. Click on the header of the physically_active_7d column so it is selected, then click the Transform icon. Change the name of the transformed variable to physical_3plus. In the pull-down box labeled using transform, select “create new transform”. In the text field labeled \(f_x\), type the following recode condition:

IF($source < 3, "0-2", "3-7")

In this formula, $source refers to the original variable (physically_active_7d). The IF function checks whether the value is less than 3. If it is, the new variable gets the value “0-2”; otherwise it gets “3-7”. Note that this uses < (less than) instead of == (equals) that you used in J Lab 1. Since physically_active_7d is a numeric variable, you do not use quotation marks around 3.

Compare the values of physical_3plus to physically_active_7d to verify the recoding is correct. Students with 0, 1, or 2 days of physical activity should be in the “0-2” group, and students with 3 through 7 days should be in the “3-7” group.

Now create descriptive statistics split by group:

  1. Click on the Analyses tab, then click on the Exploration icon, and select Descriptives.
  2. Drag weight into the Variables box and drag physical_3plus into the Split by box.
  3. Expand the Statistics options menu. Keep N, Mean, Median, Std. deviation, Minimum, and Maximum checked.
  4. Expand the Plots options menu. Check the boxes for Histogram and Box plot.

Your results should be similar to the following:


 DESCRIPTIVES

 Descriptives                                         
 ──────────────────────────────────────────────────── 
                         physical_3plus    weight     
 ──────────────────────────────────────────────────── 
   N                     0-2                   4022   
                         3-7                   8342   
   Missing               0-2                      0   
                         3-7                      0   
   Mean                  0-2               66.67389   
                         3-7               68.44847   
   Median                0-2               62.60000   
                         3-7               65.77000   
   Standard deviation    0-2               17.63805   
                         3-7               16.47832   
   Minimum               0-2               29.94000   
                         3-7               33.11000   
   Maximum               0-2               180.9900   
                         3-7               160.1200   
 ──────────────────────────────────────────────────── 

What is the observed difference in sample means between the two groups? Do the distributions look similar in shape?

Section 6: Permutation Test for Two Independent Means

You will test the following hypotheses:

\[\begin{array}{ll} H_0: & \mu_1 - \mu_2 = 0\\ H_a: & \mu_1 - \mu_2 \neq 0\end{array}\]

where \(\mu_1\) and \(\mu_2\) are the population mean weights for the two physical activity groups.

This is the same permutation logic from J Lab 1 Section 9. There, you shuffled text_ind values among helmet groups and calculated the difference in proportions. Here, you shuffle weight values among activity groups and calculate the difference in means.

To perform the permutation test:

  1. Click on the Analyses tab. Click on the Randomize module icon, and select Two Means - Hypothesis Test.
  2. Drag weight into the Dependent Variable box and physical_3plus into the Grouping Variable box.
  3. Check the boxes for Descriptives and Descriptives plots.
  4. Look at the descriptive statistics table that appears in the results to see which group Jamovi has listed first (Group 1) and which is listed second (Group 2). Note which group is which. Select the alternative hypothesis Group 1 \(\neq\) Group 2 (two-sided test).
  5. Expand the Simulations section. Check “Seed the random number generator” and enter 8675309.
  6. Start with 100 simulations and select Dot plot. Each dot represents the difference in means (\(\bar{x}_1 - \bar{x}_2\)) from one random permutation of the data. Dots colored red are counted toward the p-value.
  7. Now change to 1,000 simulations and select Histogram.

Your results with 1,000 simulations should be similar to Figure 2.

Figure 2: Permutation test comparing two independent means (1,000 permutations, histogram).

What is the p-value from the Simulation Results table? Based on this p-value, do you reject or fail to reject the null hypothesis at the \(\alpha = 0.05\) significance level? What does this tell you about the relationship between physical activity and weight?

Section 7: Bootstrap CI for Difference in Two Means

Now you will estimate the difference in mean weights between the two groups using a bootstrap confidence interval. This is the same idea as the single-mean bootstrap in Section 4, but now each bootstrap resample produces a difference in means (\(\bar{x}_1 - \bar{x}_2\)) instead of a single mean.

  1. Click on the Analyses tab. Click on the Randomize module icon, and select Two Means - Confidence Interval.
  2. Drag weight into the Dependent Variable box and physical_3plus into the Grouping Variable box.
  3. Set the Confidence level to 95.
  4. Select bootstrap percentile as the type of confidence interval.
  5. Check “Seed the random number generator” and enter 8675309.
  6. Specify 1,000 bootstraps and select Histogram.

Your results should be similar to Figure 3.

Figure 3: Calculating a 95% bootstrap percentile confidence interval for a difference in means (1,000 bootstraps, histogram).

Report the 95% bootstrap percentile confidence interval from the Simulation Results table. Interpret the interval in context. Does the confidence interval include 0? Is this consistent with the result of the permutation test in Section 6?

Section 8: Saving Your Work

You can save your work in Jamovi by clicking on the hamburger menu and selecting Save. Save your work as a .omv file, which can be reopened in Jamovi later.

Important: The .omv file is not what you submit. You will submit a PDF of your lab report (see Part 2). However, you should save the .omv file in case you need to refer back to your work. If you are working on a classroom computer, email the file to yourself or save it to Google Drive before you leave.


Part 2: What You Need to Turn In

Your instructor will provide you with a data set to work with for this part of the lab. You will determine your own research questions based on the variables in that data set.

As you work through the questions below, build your lab report in a Word document or Google Doc. For each question, include the requested screenshots, tables, and written answers. Refer to J Lab 1 if you need a refresher on how to take screenshots. You should write your report as you go rather than waiting until the end. See the Submission Instructions at the end of this document for more details.

Question 1: Single Mean – Bootstrap CI

Come up with a research question about the mean of a numeric variable in the data set. For example, you might ask: “What is the average ____ for ____ ?” You may need to filter the data to focus on a particular subgroup defined by a categorical variable.

Create descriptive statistics for your chosen variable:

  1. Use Analyses \(\rightarrow\) Exploration \(\rightarrow\) Descriptives.
  2. Include N, Mean, Median, Std. deviation, Minimum, and Maximum.
  3. Create a Histogram.

Construct a 95% bootstrap percentile confidence interval:

  1. Use Randomize \(\rightarrow\) Single Mean - Confidence Interval.
  2. Use 1,000 bootstraps, Histogram, seed 8675309.

In your lab report, include:

  • Your research question.
  • A description of the variable you chose and any filters you applied.
  • State what population parameter you are estimating.
  • A screenshot of the descriptive statistics table.
  • A screenshot of the histogram of the data.
  • A screenshot of the bootstrap distribution histogram and the Simulation Results table.
  • State the 95% bootstrap percentile confidence interval and interpret it in context (use a complete sentence).

Question 2: Two Independent Means – Permutation Test & Bootstrap CI

Come up with a research question about whether two groups differ in the mean of a numeric variable. For example: “Is there a difference in mean ____ between ____ and ____ ?” Choose a numeric variable (response) and a categorical variable with two levels (explanatory). You may need to recode a variable to create exactly two groups, similar to how you created physical_3plus from a numeric variable in Section 5. You could also recode a categorical variable with many levels into two groups.

State your null and alternative hypotheses using symbols:

\[\begin{array}{ll} H_0: & \mu_1 - \mu_2 = 0\\ H_a: & \mu_1 - \mu_2 \neq 0\end{array}\]

Create descriptive statistics split by group:

  1. Use Analyses \(\rightarrow\) Exploration \(\rightarrow\) Descriptives.
  2. Include N, Mean, Median, Std. deviation, Minimum, and Maximum.
  3. Create Histograms and Box plots (split by group).

Perform a permutation test:

  1. Use Randomize \(\rightarrow\) Two Means - Hypothesis Test.
  2. Use 1,000 simulations, Histogram, seed 8675309.

Construct a 95% bootstrap percentile CI for the difference in means:

  1. Use Randomize \(\rightarrow\) Two Means - Confidence Interval.
  2. Use 1,000 bootstraps, Histogram, seed 8675309.

In your lab report, include:

  • Your research question and hypotheses (using symbols). Define what Group 1 and Group 2 represent in your context.
  • A screenshot of the descriptive statistics table (split by group).
  • A screenshot of the histograms and box plots (split by group).
  • A screenshot of the permutation null distribution and the Simulation Results table.
  • The p-value and your conclusion at the \(\alpha = 0.05\) significance level (use complete sentences).
  • A screenshot of the bootstrap CI histogram and the Simulation Results table.
  • State the 95% bootstrap percentile confidence interval and interpret it in context.
  • Explain whether the confidence interval is consistent with the hypothesis test result.

Question 3: Scope of Inference

Consider your results from Questions 1 and 2.

  • Can you infer a cause-and-effect relationship from your results? Why or why not?
  • To what population, if any, can you generalize your results? Explain.

Submission Instructions

Create your lab report in a Word document or Google Doc. Organize your report as numbered answers (1 through 3) to the questions above. Include screenshots and plots positioned under the correct question number. Your answers should refer to the relevant plots or tables when applicable.

Save your report as a PDF and submit it using the appropriate submission link on the course Moodle page. Check the PDF before you submit it to make sure it is readable and complete.