DESCRIPTIVES
Descriptives
────────────────────────────────────────────────────
physical_3plus weight
────────────────────────────────────────────────────
N 0-2 4022
3-7 8342
Missing 0-2 0
3-7 0
Mean 0-2 66.67389
3-7 68.44847
Median 0-2 62.60000
3-7 65.77000
Standard deviation 0-2 17.63805
3-7 16.47832
Minimum 0-2 29.94000
3-7 33.11000
Maximum 0-2 180.9900
3-7 160.1200
────────────────────────────────────────────────────
J Lab 6
Inference for comparing means
Portions of this lab are based on an R lab from Chapter 23 of the Introduction to Modern Statistics (2e) textbook by Mine Çetinkaya-Rundel and Johanna Hardin.
Youth Risk Behavior Surveillance System
In the first part of this lab you continue to explore the yrbss data set that was introduced in J Lab 4 and that you continued to explore in J Lab 5. The Youth Risk Behavior Surveillance System (YRBSS) survey tracks high school student behaviors that could impact health. Download the yrbss.csv file, available here or on our Moodle page, and open it in Jamovi.
You will only use a subset of the variables for this lab, described in the following table. In Jamovi, set up the Measure type and Data type for each variable as shown in the table.
| Variable | Measure type | Data type | Description |
|---|---|---|---|
| physically_active_7d | Continuous | Integer | How many days physically active for at least 60 minutes in the last 7 days |
| weight | Continuous | Decimal | Weight (kilograms) |
Weight and physical activity - comparing two groups
You will explore the relationship between a high school student’s weight and their physical activity. You will start by creating two groups of high schoolers in the data: those who were physically active for 3 or more days in the last 7 days and those whe were not.
Filtering data and recoding the physical activity variable
Start by filtering the data down to the relevant observations by excluding the following cases:
- Students who did not respond to the physical activity question
- Students who did not respond to the weight question
You may refer to the instructions in J Lab 2 or J Lab 4 if you need a refresher on how to filter data in Jamovi.
Next, you will recode the physically_active_7d variable to create the two physical activity groups. Create a new variable, physical_3plus that has two levels. As in previous labs, you can recode the variable by clicking on its name in the spreadsheet in Jamovi, then click on the Transform button in the Data tab. Set the name of the transformed variable to physical_3plus. Then, select “Create New Transform”.
In the Transform interface that pops up, click + recode condition once and add the following recode conditions:
if $source < 3 use "0-2"else use "3-7"
Once you have recoded the physically_active_7d variable, compare the values of the new physical_3plus variable to the original physically_active_7d variable to make sure it was done correctly.
Comparing weight distributions between physical activity groups
Create a table of descriptive statistics for the weight that is split by the physical_3plus variable. For each group, your table should list the sample size, mean, median, standard deviation, minimum, and maximum. Also create a faceted histogram and a side-by-side box plot that show the distributions of weight for the two physical activity groups. You learned how to create these numerical and visual summaries in J Lab 2. Your results should be similar to the following:
Based on the plots and the table, does it appear that there is an association between weight and physical activity? What is the observed difference in sample means (low activity - high activity)?
Hypothesis tests
Next you will conduct tests of the following hypotheses:
- \(H_0:\) \(\mu_1 - \mu_2 = 0\)
- \(H_A:\) \(\mu_1 - \mu_2 \neq 0\)
We will consider the group that was physically active for 0-2 days as the first group and the group and the group that was physically active for 3-7 days as the second group.
You will conduct both a hypothesis test using a mathematical model (a T-test) and a hypothesis test using random permutation. For the T-test you should assume that the groups have equal variances.
Before you test these hypotheses using a mathematical model, you must check the technical conditions for the test. You can assume that the observations are independent because the observations are from a random sample of high school students. Both samples are much larger than 30, so you can assume that the sampling distribution for the difference in means is normal. Finally, the standard deviations are very similar in the two groups, satisfying the equal variance assumption.
The \(T\) statistic is \[T=\frac{(\bar{x}_1-\bar{x}_2)-0}{s_p\sqrt{\frac{1}{n_1}+\frac{1}{n_2}}}\] where \(\bar{x}_1\) and \(\bar{x_2}\) are the sample means, \(n_1\) and \(n_2\) are the sample sizes, and \(s_p\) is the pooled standard deviation.
For our samples, the pooled standard deviation is \[\begin{array}{rcl}s_p &=& \sqrt{\frac{(n_1-1)s_1^2+(n_2-1)s_2^2}{n_1+n_2-2}} \\ &=& \sqrt{\frac{(4,022-1)\cdot 17.64^2+(8,342-1)\cdot 16.48^2}{4,022+8,342-2}} \\ &=& 16.86\end{array}\] and the \(T\) statistic is \[\begin{array}{rcl}T &=&\frac{(66.67-68.45)-0}{16.86\cdot\sqrt{\frac{1}{4,022}+\frac{1}{8,342}}} \\ &=& - 5.48\end{array}\]
If \(H_0\) is true, the \(T\) statistic will follow a T-distribution with \(df=n_1+n_2-2\) degrees of freedom. Here, \(df=4,022+8,342-2=12,362\). Thus, you can compute a p-value using the Model-Based Inference analysis in the Randomize module in Jamovi. You learned how to do this in J Lab 5. Since the alternative hypothesis is two-sided, make sure you compute the area in both tails of the T-distribution. The resulting p-value is very small (\(<0.001\)).
Next, conduct a corresponding simulation-based hypothesis test, using the Two Means - Hypothesis Test analysis in the Randomize module. Drag weight into the Dependent Variable box, and drag physical_3plus into the Grouping Variable box. Make sure you have selected a two-sided alternative hypothesis. Show the descriptive statistics and plots by clicking the appropriate checkboxes. Conduct 1,000 simulations of the null hypothesis, and plot the differences in means as a histogram. The module simulates the null hypothesis by randomly permuting the value of the response variable. If you seed the random number generator with 8675309, you should obtain results that match those in the lab.
Figure 1 shows the analysis with the correct setup and results.
The results are consistent with the hypothesis test based on a mathematical model. You can reject the null hypothesis and conclude that there is convincing evidence of a difference in the mean weights between students who are physically active less than 3 days and those how are physically active for 3+ days (p < 0.001).
Confidence intervals
Next you will calculate a 95% confidence interval to estimate the difference in mean weights between the two physical activity groups in the population. You have already checked the technical conditions for using a T-distribution. The confidence interval for the difference in means is given by \[\bar{x}_1-\bar{x}_2\pm t^{\ast}_{df} \times SE\] where the standard error for the the difference in means is \[SE=s_p\sqrt{\frac{1}{n_1}+\frac{1}{n_1}}\] In this case the standard error is \[\begin{array}{rcl}SE &=& 16.86\cdot\sqrt{\frac{1}{4,022}+\frac{1}{8,342}} \\ &=& 0.324 \end{array}\]
You can use the Model-Based Inference analysis in the Randomize module to calculate the multiplier. You learned how to do this in J Lab 5. The multiplier is \(t^{\ast}_{12,362}=1.96\). Since the difference in sample means is -1.77 kg, the 95% confidence interval for the difference in means: \[-1.77\pm 1.96\times 0.324 = -1.77 \pm 0.63\] The 95% confidence interval can also be expressed in interval form as \((-2.40, -1.14)\). Thus, we are 95% confident that the mean weight is between 1.14 kg and 2.4 kg lower for the group that is physically active 0-2 days per week compared to the group that is physically active 3-7 days per week. Note that 0 is not included in the confidence interval, which is consistent with the results of the hypothesis test.
Next, you will calculate a 95% confidence interval using a bootstrap distribution of differences in means. Use the Two Means - Confidence Interval analysis in the Randomize module to calculate a 95% bootstrap percentile confidence interval. Drag weight into the Dependent Variable box, and drag physical_3plus into the Grouping Variable box. Set the confidence level to 95%, select “bootstrap percentile” for the type of interval, specify that you want to do 1000 bootstraps, with a histogram as the plot type. You may also seed the random number generator with 8675309 if you want your results to match the ones in this lab. Figure 2 shows the interface after it has been set up correctly for the confidence interval along with the results.
The 95% bootstrap percentile confidence interval for the difference in means is \((-2.37, -1.11)\). Compare this interval to the one you obtained using a T-distribution.
Weight and physical activity - comparing more than two groups
In the previous analysis you divided the sample into two groups based on their physical activity levels. Would your results be much different if you divided the sample into four groups? This time you will create four groups of high schoolers: those who were physically active for 0-1 days (group 1), 2-3 days (group 2), 4-5 days (group 3), and 6-7 days (group 4).
Filtering data and recoding the physical activity variable
Use the same filters you used in the first part of the lab. You should exclude the following cases:
- Students who did not respond to the physical activity question
- Students who did not respond to the weight question
Next, recode the physically_active_7d variable to create the four physical activity groups. Create a new variable, physical_four that has four levels. Use the following recode conditions:
if $source < 2 use "0-1"if $source == 2 use "2-3"if $source == 3 use "2-3"if $source == 4 use "4-5"if $source == 5 use "4-5"else use "6-7"
Once you have recoded the physically_active_7d variable, compare the values of the new physical_four variable to the original physically_active_7d variable to make sure it was done correctly.
Comparing weight distributions between four physical activity groups
Create a table of descriptive statistics for weight that is split by the physical_four variable. For each group, your table should list the sample size, mean, median, standard deviation, minimum, and maximum. Also create a faceted histogram and a side-by-side box plot that show the distributions of weight for the two physical activity groups. Your results should be similar to the following:
DESCRIPTIVES
Descriptives
───────────────────────────────────────────────────
physical_four weight
───────────────────────────────────────────────────
N 0-1 2851
2-3 2539
4-5 2825
6-7 4149
Missing 0-1 0
2-3 0
4-5 0
6-7 0
Mean 0-1 66.56254
2-3 67.65722
4-5 68.18799
6-7 68.68570
Median 0-1 62.14000
2-3 63.50000
4-5 65.77000
6-7 65.77000
Standard deviation 0-1 17.87305
2-3 17.30102
4-5 16.60787
6-7 16.04068
Minimum 0-1 31.75000
2-3 29.94000
4-5 33.11000
6-7 35.38000
Maximum 0-1 180.9900
2-3 163.3000
4-5 158.7600
6-7 156.9500
───────────────────────────────────────────────────
Based on the plots and the table, does it appear that there is an association between weight and physical activity? Does mean weight seem to increase or decrease with increasing physical activity?
Hypothesis tests
Next you will conduct tests of the following hypotheses:
- \(H_0:\) \(\mu_1 = \mu_2 = \mu_3 = \mu_4\)
- \(H_A:\) At least one of the population means is different.
You will conduct both a hypothesis test using a mathematical model (ANOVA/F-Test) and a hypothesis test using random permutation. For the ANOVA you should assume that the groups have equal variances.
Before you test these hypotheses using a mathematical model, you must check the technical conditions for the test. You can assume that the observations are independent because the observations are from a random sample of high school students. All four samples are large, but the distributions of weights are right-skewed. This should cause some concern since ANOVA assumes that the distributions are normally distributed. The standard deviations are very similar in the four groups, satisfying the equal variance assumption.
Despite our concerns about the shapes of the distributions, we will proceed with the ANOVA. However, the permutation test you will conduct later is a better option to pursue in this case.
Conduct the ANOVA by clicking on the Analyses tab, then click on the ANOVA icon, and select ANOVA. Drag weight into the Dependent Variable box, and drag physical_four into the Fixed Factors box. The result should be an ANOVA table that is similar to the following:
ANOVA
ANOVA - weight
────────────────────────────────────────────────────────────────────────────────────
Sum of Squares df Mean Square F p
────────────────────────────────────────────────────────────────────────────────────
physical_four 8034.792 3 2678.2639 9.414196 0.0000033
Residuals 3516321.712 12360 284.4920
────────────────────────────────────────────────────────────────────────────────────
The value of the \(F\) statistic is 9.41. Comparing this to an \(F\) distribution with \(df_1=3\) and \(df_2=12,360\) degrees of freedom, results in a small p-value (<0.001). Recall that the degrees of freedom are calculated as one less than the number of groups (\(df_1\)) and the difference between the overall sample size and the number of groups (\(df_2\)). Since the p-value is less than \(\alpha=0.05\), you can reject the null hypothesis and conclude that at least one of the group means is different.
You can follow up the test with post-hoc pairwise T-tests to determine which pairs of means are different from each other. Within the same ANOVA interface, click on “Post Hoc Tests” to expand the post-hoc test options. Select “Tukey” for the correction (to account for multiple hypothesis tests), and drag the physical_four variable to the box on the right.
Since there are 4 groups, there are \(4\cdot 3/2=6\) possible pairwise comparisons. Your table of results should be similar to the following:
POST HOC TESTS
Post Hoc Comparisons - physical_four
────────────────────────────────────────────────────────────────────────────────────────────────
physical_four physical_four Mean Difference SE df t p-tukey
────────────────────────────────────────────────────────────────────────────────────────────────
0-1 - 2-3 -1.095 0.460 12360 -2.38 0.081
- 4-5 -1.625 0.448 12360 -3.63 0.002
- 6-7 -2.123 0.410 12360 -5.17 < .001
2-3 - 4-5 -0.531 0.461 12360 -1.15 0.658
- 6-7 -1.028 0.425 12360 -2.42 0.073
4-5 - 6-7 -0.498 0.411 12360 -1.21 0.621
────────────────────────────────────────────────────────────────────────────────────────────────
Note. Comparisons are based on estimated marginal means
Based on the adjusted p-values in the table, you can conclude that there is convincing evidence that the mean weight for group 1 is different than the mean weight group 3, and that the mean weight for group 1 is different than the mean weight for group 4. However, none of the other pairwise differences are statistically significant at the \(\alpha=0.05\) significance level.
Next, conduct a simulation-based hypothesis test, using the Multiple Means - Hypothesis Test analysis in the Randomize module. This is an overall test of the null hypothesis that all of the group means are the same in the population. It is also based on the \(F\) statistic.
Drag weight into the Dependent Variable box, and drag physical_four into the Grouping Variable box. Show the descriptive statistics and plots by clicking the appropriate checkboxes. Conduct 1,000 simulations of the null hypothesis, and plot the differences in means as a histogram. The module simulates the null hypothesis by randomly permuting the value of the response variable. If you seed the random number generator with 8675309, you should obtain results that match those in the lab.
Figure 3 shows the analysis with the correct setup and results.
The observed value of the \(F\) statistic is much larger than any of the simulated values, resulting in a p value that is approximately 0 (< 0.001). These results are consistent with the hypothesis test based on a mathematical model. You can reject the null hypothesis and conclude that there is convincing evidence that at least one of the group means is different.
Saving your work
You can save your work in Jamovi by clicking on the hamburger menu and selecting Save. You can save your work as a .omv file, which is a file that can be opened in Jamovi. However, you will not turn this file in for your lab report. Instead, you will turn in a PDF of your lab report that includes screenshots of the Jamovi interface, scatter plots, tables, and your answers to questions at the end of the lab. Even though you are not turning it in, you should save your Jamovi file in case you need to refer back to it later.
What you need to turn in
This section includes questions that you will turn in for this lab. The data set for this lab is available at our course Moodle website (Look for “J Lab 6 Data set” in “Jamovi Labs” section)
Come up with a research question evaluating the relationship between a numerical variable in your data set (the response variable) and a categorical variable in your data set with two levels (the explanatory variable). Note that your analysis may involve either comparing two independent means (as described in this lab) or paired means (not covered in this lab). Formulate a question in a way that it can be answered using a hypothesis test, and state your null and alternative hypotheses. Perform a permutation test rather than a test based on a mathematical model. Use at least 1,000 simulations of the null hypothesis. Include a table of descriptive statistics in your report. For each group, your table should list the sample size, mean, median, standard deviation, minimum, and maximum. Also include faceted histograms and side-by-side box plots showing the distribution of the response variable for each group. Also include your null distribution and table of simulation results. Report your statistical results/conclusions and provide an explanation of your results using plain language. Refer to your tables and plots in your explanation. Be sure to state the significance level you used.
For the same question you addressed in the problem 1, calculate a 95% bootstrap percentile confidence interval for the difference in means (for independent means) or mean difference (for paired means). Include the bootstrap distribution and table of simulation results in your report. State the confidence interval you found and interpret it using plain language. Explain whether or not your confidence interval supports the conclusion from your hypothesis test in problem 1.
Come up with a different research question evaluating the relationship between a numerical variable in your data set (the response variable) and a categorical variable in your data set with at least three levels (the explanatory variable). Formulate your question in a way that it can be answered using a hypothesis test, and state your null and alternative hypotheses. Include a table of descriptive statistics in your report. For each group, your table should list the sample size, mean, median, standard deviation, minimum, and maximum. Also include faceted histograms and side-by-side box plots showing the distribution of the response variable for each group. If the technical conditions are met, perform a hypothesis test using a mathematical model (ANOVA), and include the ANOVA table in your report. Otherwise use a permutation test, and include the null distribution and table of simulation results in your report. Report the technical conditions you checked and explain why each of them was satisfied or not. Report your statistical results/conclusions and provide an explanation of your results using plain language. Refer to your tables and plots in your explanation. Be sure to state the significance level you used.
Address the scope of the inferences you made in problems 1-3. Can you infer any cause and effect relationships? To what, if any, population can you generalize you results? Explain.
You may create your lab report in a Word document or a Google Doc. You may organize your report as numbered answers to the questions listed above. Include the screenshots, plots, and tables in your report, making sure that they are positioned under the correct question number. You should also include your answers to the questions in your report, and your answers should refer to the relevant plots or tables when applicable. Save your report as a PDF and submit using the appropriate submission link on the course Moodle page (check the pdf before you submit it to make sure it is readable and complete).