J Lab 2

Exploring Data with Jamovi

Author

Yurk

Portions of this lab are based on the R lab from Chapter 6 of the Introduction to Modern Statistics (2e) textbook by Mine Çetinkaya-Rundel and Johanna Hardin.

NYC flights

In this lab we will explore the nycflights data set, available here or on our Moodle page. The data set contains information on a random sample of flights that departed from New York City in 2013. The data set has 32,735 rows and 16 columns. The columns are: year, month, day, dep_time, dep_delay, arr_time, arr_delay, carrier, tailnum, flight, origin, dest, air_time, distance, hour, and minute.

Download the nycflights.csv file to your computer. Next, open the file in Jamovi. How many cases/observations are there? How many variables are there?

The nycflights data set includes both categorical and numeric variables. For example, the carrier variable is a categorical variable that represents the airline carrier, and the dep_delay variable is a numeric variable that represents the departure delay in minutes. The origin and dest variables are also categorical variables that represent the origin and destination airports, respectively.

We will only use a subset of the variables for this lab, described in the following table. In Jamovi, set up the Measure type and Data type for each variable as shown in the table.

Variable Measure type Data type Description
month Ordinal Integer Month of departure
dep_delay Continuous Decimal Departure delay (minutes)
arr_delay Continuous Decimal Arrival delay (minutes)
carrier Nominal Text Carrier abbreviation
dest Nominal Text Destination airport
air_time Continuous Integer Amount of time in the air (minutes)
distance Continuous Integer Distance flown (miles)

Some of the values of dep_delay are negative. What does a negative value mean for this variable?

Exploring departure delays

We will explore the distribution of departure delays by creating numerical summaries and visualizations. In both cases we will use the Descriptives analysis in Jamovi. To access the Descriptives analysis, first click on the Analyses tab, then click on the Exploration icon, and select Descriptives.

Numerical summaries

We calculated some summary statistics for a numeric variable in the previous lab. We will do something similar here, but we will add more summary statistics to our table this time.

In the Descriptives analysis interface, drag the dep_delay variable into the Variables box. Then, expand the Statistics options menu by clicking on its header. Check the boxes for the following descriptive statistics: N, Mean, Median, Std. deviation, Range, Minimum, Maximum, IQR. Also select Cut points for \(\square\) equal groups, and enter 4 into the box. This will add Q1, Q2, and Q3 to your table.

Your results should look like the following table:


 Descriptives                        
 ─────────────────────────────────── 
                         dep_delay   
 ─────────────────────────────────── 
   N                         32735   
   Mean                   12.70515   
   Median                -2.000000   
   Standard deviation     40.40743   
   IQR                    16.00000   
   Range                  1322.000   
   Minimum               -21.00000   
   Maximum                1301.000   
   25th percentile       -5.000000   
   50th percentile       -2.000000   
   75th percentile        11.00000   
 ─────────────────────────────────── 

What does \(N\) represent? How long was the longest delay? Which values represent Q1 and Q3? How many times is the median listed in the table? Is the IQR value consistent with the values of the quartiles?

Note that the mean departure delay is about 12.7 minutes, while the median is -2 minutes. Recall that the mean is more sensitive to extreme values than the median. What does this suggest about the distribution of departure delays?

Histogram

Next we will use Jamovi to create a histogram of the departure delays. We do this using the same Descriptives interface. The dep_delay variable should have already been dragged into the Variables box. Now, expand the Plots options menu by clicking on its header. Check the box for Histogram. Your results should match the following histogram:

Is the distribution symmetric or skewed? What does this suggest about the distribution of departure delays? Is this consistent with what you observed about the median and the mean?

Filtering the data

Suppose we were only interested in departure delays for flights that were bound for the LAX airport. We can easily create numerical and visual summaries for LAX-bound flights by filtering the data set to include only observations where the dest variable is equal to LAX. In fact, once we apply the filter, the tables and plots we created earlier will be automatically updated to reflect just the LAX-bound flights.

To filter the data, click on the Data tab in Jamovi, then click on the Filters icon. In the field next to the \(f_x\) button, type dest == "LAX". Note: it is important that you type it exactly as shown here, including the quotation marks and the double equals sign. Now, click on the toggle switch above the text field to make the filter active.

Figure 1 shows the filter correctly defined and activated.

Figure 1: Defining and activating a filter.

After applying the filter, click the up arrow in the filter interface to hide it. Notice that most of the rows in the data set have been grayed out, and there is a new column in the spreadsheet called Filter 1. There is a check mark in this column when a row is retained by the filter, and the row is white. Otherwise there is an X in the column, and the row is gray.

The table of descriptive statistics and the histogram will now only include the LAX-bound flights, and are shown below.


 DESCRIPTIVES

 Descriptives                        
 ─────────────────────────────────── 
                         dep_delay   
 ─────────────────────────────────── 
   N                          1583   
   Mean                   9.782059   
   Median                -1.000000   
   Standard deviation     33.48609   
   IQR                    11.00000   
   Range                  358.0000   
   Minimum               -13.00000   
   Maximum                345.0000   
   25th percentile       -4.000000   
   50th percentile       -1.000000   
   75th percentile        7.000000   
 ─────────────────────────────────── 

How many flights were bound for LAX? What is the mean departure delay for flights bound for LAX? How does this compare to the mean departure delay for all flights? Does it appear that most of the LAX-bound flights were early? late? on time?

We can also visualize the distribution of departure delays for LAX-bound flights using a box plot. To add a box plot to your results, you simply need to select Box plot in the Plots options menu in the Descriptives interface. You can change an existing analysis (instead of starting over with a new one) by clicking on the corresponding section in the Results pane on the right side of the screen. This will open the analysis interface in the left pane so you can adjust the settings (e.g., add a box plot after you have already created a table and histogram).

Your results should match the following box plot:

Departure delays for LAX-bound flights: Delta vs. Jet Blue

Do different airlines have different departure delays for flights bound for LAX? We will compare the departure delays for Delta and Jet Blue flights bound for LAX. This requires us to compare the distributions of a numeric variable (departure delay) across the levels of a second, categorical variable (carrier). We can do this by creating side-by-side box plots or faceted histograms and using more advanced tables of summary statistics.

This analysis requires two changes to the analyses we have already performed:

  1. Additional filtering. In addition to filtering for flights bound for LAX, we will also filter for flights operated by Delta and Jet Blue.

  2. Incorporating a second, categorical variable in our analysis. After filtering, we need to split the cases according to the levels of the carrier variable so that the distributions of delayed times can be compared between the two airlines.

Click on the Filters icon in the Data tab. You should see the filter you created earlier (probably labeled “Filter 1”). Make sure that filter is active. Now click the large blue + button to the left of your existing filter to add a new filter. In the text field labeled \(f_x\), type carrier == "DL" or carrier == "B6". Click on the toggle switch to activate the filter.

Figure 2 shows the filter correctly defined and activated.

Figure 2: Defining and activating a second filter.

DL is the abbreviation for Delta, and B6 is the abbreviation for Jet Blue. When both filters are active, the data set will only include flights bound for LAX (Filter 1) that were operated by Delta or Jet Blue (Filter 2).

Click on the eye icon beneath the + icon that you used to add the second filter to hide all of the cases that are not retained by the filters you have applied. Figure 2 shows only the cases that are retained after both filters are applied.

Click the up arrow to hide the filter interface. Explore the data set, and confirm that all of the remaining cases are flights bound for LAX that were operated by Delta or Jet Blue.

Next we will split the data set according to the levels of the carrier variable, updating the table and plots we have already created to reflect the split (they should already be updated to reflect the new filtering). Click on the existing results in the Results pane to bring up the options for the Descriptives analysis. Next, drag the ‘carrier’ variable into the Split by box in the Descriptives interface. This will split the data set according to the levels of the carrier variable, and the table and plots will be updated to reflect the split.

Figure 3 shows the Descriptives interface with the carrier variable dragged into the Split by box. Your analysis should be set up the same as in the figure.

Figure 3: Splitting the data set by the carrier variable.

Your results should now match the following:


 DESCRIPTIVES

 Descriptives                                   
 ────────────────────────────────────────────── 
                         carrier    dep_delay   
 ────────────────────────────────────────────── 
   N                     B6               159   
                         DL               262   
   Mean                  B6          7.125786   
                         DL          2.145038   
   Median                B6         -1.000000   
                         DL         -2.000000   
   Standard deviation    B6          20.05143   
                         DL          18.41773   
   IQR                   B6          15.00000   
                         DL          4.750000   
   Range                 B6          153.0000   
                         DL          220.0000   
   Minimum               B6         -11.00000   
                         DL         -11.00000   
   Maximum               B6          142.0000   
                         DL          209.0000   
   25th percentile       B6         -4.000000   
                         DL         -4.000000   
   50th percentile       B6         -1.000000   
                         DL         -2.000000   
   75th percentile       B6          11.00000   
                         DL         0.7500000   
 ────────────────────────────────────────────── 

Note that the summary table now lists values of the statistics for each airline separately. How many of the flights were operated by Delta? By Jet Blue? What is the mean departure delay for Delta flights? For Jet Blue flights? How do the quartiles compare?

The plots are also split according to carrier. The histogram is now faceted to show the distribution of departure delays for Jet Blue on the top (blue) and Delta on e bottom (orange). The box plot on the left shows the distribution for Jet Blue, and the one on the right shows the distribution for Delta. How do the distributions compare? Does one airline appear to be better than the other in terms of departing on time?

On-time departures

Now we will explore the proportion of flights that depart on time. We will define a flight as on time if the departure delay is less than 5 minutes. First, we will need to create a new variable that indicates whether a flight is on time or not. We will do this by transforming the existing dep_delay variable, which is numeric, into a new variable dep_type that is categorical with two levels, “on time” and “delayed”. Often this type of transformation is referred to as recoding the variable.

In Jamovi, click on the Data tab. Then, click on the header for the dep_delay column in the spreadsheet so that the correct variable is selected for the transformation. Next, click on the Transform icon. In the interface that appears, change the name of the transformed variable to dep_type (by default it will be something like dep_delay (2)). In the pull-down box labeled using transform, select “create new transform”. This will bring up a second interface where you can define the transformation.

In the text field labeled \(f_x\), type IF($source < 5, "on time", "delayed") to define the recode condition. In this condition $source refers to the original variable, dep_delay, that you are transforming. The IF function checks whether the value of dep_delay is less than 5. If it is, the value of dep_type will be “on time”. If it is not, the value of dep_type will be “delayed”.

If you have set everything up correctly, your interface should look like the one in Figure 4.

Figure 4: Defining a transformation to create a new variable.

Note that the new variable is added to the spreadsheet, and the values are automatically filled in based on the transformation you defined. Check to make sure that the values in the dep_type column are correct, indicating that you recoded the variable correctly.

Hide the interface that you used to define the transformation by clicking the down arrow. Then, hide the other transformation interface by clicking the up arrow.

Exploring a categorical variable

Now we will create numerical and visual summaries for the new dep_type variable. The summary statistics that are appropriate for a categorical variable are different from those that are appropriate for a numeric variable. For a categorical variable, we might want to know the counts of each level and the proportion or percentage of cases that fall into each level.

The visualizations that are appropriate for a categorical variable are also different from those that are appropriate for a numeric variable. For a categorical variable, will use a bar plot instead of a histogram or box plot.

We will create a new analysis in Jamovi to summarize the dep_type variable. We can use the same Descriptives type of analysis that we used earlier, but we will need to use different settings to reflect the fact that dep_type is a categorical variable. Click on the Analyses tab, then click on the Exploration icon, and select Descriptives to create a new analysis.

We will start by creating a frequency table that includes percentages. In the Descriptives interface, drag the dep_type variable into the Variables box. Then, expand the Statistics options menu by clicking on its header. Uncheck all of the boxes in this menu, since they are only appropriate for a numeric variable. Next, click on the check box to turn on Frequency tables. This check box is located below the Split by variable box (which should be empty). This will create the summary table. Your interface should look like the one in Figure 5.

Figure 5: Creating a frequency table for a categorical variable.

Note that we are still working with the filtered data set that includes only flights bound for LAX that were operated by Delta or Jet Blue. The frequency table shows the counts and percentages of flights that were on time and delayed for these flights. What percentage of the flights were delayed?

We can also add a bar plot to visualize the distribution of the dep_type variable. First, hide the Statistics options menu by clicking on its header. Then, expand the Plots options menu beneath it. Check the box for Bar plot. This will add a bar plot to your results. Your plot should match the following:

Exploring two categorical variables

What if we wanted to compare the proportion of on-time flights bound for LAX between Delta and Jet Blue? To do this, we need to calculate conditional proportions. Unfortunately, we cannot do that from within the Descriptives interface we have been using. Instead, we will need to use a different analysis to create the summaries and visualizations for the dep_type variable, split by carrier. In Jamovi, click on the Analyses tab, then click on the Frequencies icon, and select Independent Samples, which is listed under Contingency Tables.

Drag the carrier variable into the Rows box, and drag the dep_type variable into the Columns box. This will create a contingency table that shows the numbers of flights bound for LAX that were on time and delayed, split by carrier. The table also includes marginal totals (e.g., there were 159 total Jet Blue flights, and there 321 total on-time flights). Your interface should look like the one in Figure 6.

Figure 6: Creating a contingency table for two categorical variables.

Next, we will add conditional proportions to the table. We are interested in comparing the proportion of on-time flights for Delta and Jet Blue. So the conditional proportions should be calculated within each row of the table. To add the conditional proportions, expand the Cells options menu by clicking on its header. Under Percentages, check the box for Row. This will add the conditional proportions to the table. Your interface should look like the one in Figure 7.

Figure 7: Adding conditional proportions to a contingency table.

What percentage of Delta flights were on time? What percentage of Jet Blue flights were on time? How do the percentages compare?

We can visualize the relationship between the two categorical variables using bar plots. Hide the Cells options menu by clicking on its header. Then, expand the Plots options menu beneath it. Check the box for Bar plot. This will add a side-by-side (dodged) bar plot to your results. Your interface should look like the one in Figure 8.

Figure 8: Creating a side-by-side bar plot for two categorical variables.

What does each pair of bars represent? Within each pair, what does the blue (left) bar represent? What does the orange (right) bar represent?

Now, create a stacked bar plot showing percentages by selecting the Stacked option instead of Side by side, and the Percentages option instead of Counts. Select the “within rows” from the drop-down menu next to the Percentages option so that the percentages of on-time and delayed departures are calculated within each row of the table (for each airline). Your stacked bar plot should look like the following:

Compare the percentages of on-time flights for Delta and Jet Blue using the bar plot. Which airline has a higher percentage of on-time flights?

Clearing filters

If you want to clear the filters you have created so you can return to working with the original data set, click on the Data tab, then click on the Filters icon. Select inactive for each of the filters you want to remove.

Copying tables and plots from Jamovi to include in your lab report

You can copy tables and plots from Jamovi to include in reports. To copy a table or plot, right-click on the table or plot and select Copy. (If you are using a Mac without right-clicking turned on, hold control and then click instead of right-clicking). You can then paste the table or plot into your report.

Saving your work

You can save your work in Jamovi by clicking on the hamburger menu and selecting Save. You can save your work as a .omv file, which is a file that can be opened in Jamovi. However, you will not turn this file in for your lab report. Instead, you will turn in a PDF of your lab report that includes screenshots of the Jamovi interface, scatter plots, tables, and your answers to questions at the end of the lab. Even though you are not turning it in, you should save your Jamovi file in case you need to refer back to it later.

What you need to turn in

This section includes questions that you will turn in for this lab. You will be working with the original nycflights data set for this part of the lab, so make sure you have cleared any filters you created earlier. You may leave the recoded variables in your data set, since they won’t effect the answers to the questions.

Arrival delays are probably more important than departure delays. We will focus on arrival delays in this part of the lab. The arr_delay variable in the nycflights data set represents the arrival delay in minutes. Negative values indicate early arrivals, and positive values indicate late arrivals.

Perhaps delays are more likely to occur in months with bad weather. We will further explore the delays for Jet Blue flights leaving from NYC airports by comparing flights in February to flights in August.

  1. Create two filters so that your filtered data set includes all Jet Blue flights that departed from NYC airports in February or August. Note that months are listed as numbers, so February corresponds with the value 2, and August corresponds with the value 8. When you are setting up your filters you do not need to include quotes around these numbers (i.e., use 2 instead of “2” for February). Take a screenshot that shows the filters correctly defined and activated, similar to Figure 2. Include the screenshot in your lab report.
  2. In this problem you will calculate descriptive statistics using the filtered data set you created in the previous problem. At this point you should not compare flights from Feburary to flights from August. Instead, you should focus on the descriptive statistics for the filtered data set as a whole. Create a table of descriptive statistics for the arr_delay variable. Your table must include the number of cases, the mean, median, standard deviation, IQR, range, miniumum, maximum, and the quartiles (Q1, Q2, Q3). Include the table in your lab report. Exactly how many cases are included in the filtered data set? (Hint: your answer should be between 850 and 870. If it is not you probably do not have you filters set up correctly.) What is the mean arrival delay for the flights in the filtered data set? What is the median arrival delay? How do the mean and median compare? What does this suggest about the distribution of arrival delays?
  3. In this problem you will create a histogram of the arrival delays for the whole filtered data set you created in the first problem. Create a histogram of the arr_delay variable. Include the histogram in your lab report. Is the shape of the distribution consistent with your expectations from problem 2? Explain.
  4. Now we will begin to compare arrival delays for Jet Blue flights in February to those in August. Create a table of descriptive statistics for the arr_delay variable, split by the month variable. Your table must include the same statistics as in problem 2. Include the table in your lab report. How many flights were in February? How many were in August? What is the mean arrival delay for the February flights? For the August flights? How do the means compare?
  5. Create a faceted histogram and a side-by-side box plot to compare the arrival delays for Jet Blue flights in February to those in August. Include both plots in your lab report. What do the plots suggest about the arrival delays for Jet Blue flights in February compared to August?
  6. Create a new categorical variable that indicates whether a flight arrives on time or has a delayed arrival. Call your new variable arr_type. A flight is considered on time if the arrival delay is less than 8 minutes. Use the same transformation method you used to create the dep_type variable earlier in the lab, but note that the threshold is different. Take a screenshot of the Jamovi interface that shows the transformation correctly defined, similar to Figure 4. Include the screenshot in your lab report.
  7. Create a contingency table that shows the counts of on-time and delayed arrivals for Jet Blue flights in February and August, with the month variable as the rows and the arr_type variable as the columns, include conditional proportions in the table that show the proportion of February flights that are delayed, the proportion of August flights that are delayed, etc. Your table should be similar to the contingency table in Figure 7 (broken down by month instead of carrier). Include the table in your lab report. What percentage of the February flights were delayed? What percentage of the August flights were delayed? How do the percentages compare?
  8. Create a stacked bar plot that shows the percentages of on-time and delayed arrivals for Jet Blue flights in February and August. Month should be along the \(x-axis\), there should be two bars that are each broken into two segments, and the top of each bar should be at 100%. Include the stacked bar plot in your lab report. What does the bar on the left represent? Within the bar on the left, what does the blue area represent? What does the orange area represent? Is the percentage of delayed flights higher in February or August? Is the difference very large?

You may create your lab report in a Word document or a Google Doc. You may organize your report as numbered answers to the questions listed above. Include the screenshots, plots, and tables in your report, making sure that they are positioned under the correct question number. You should also include your answers to the questions in your report, and your answers should refer to the relevant plots or tables when applicable. Save your report as a PDF and submit using the appropriate submission link on the course Moodle page (check the pdf before you submit it to make sure it is readable and complete).