Exploring Numerical Data

Chapter 5
Math 215

Life Expectancies

  • Every county in the US (3,142 counties) in a data set life
  • Variables include county name, state, average life expectancy (expectancy), median income (income)
  • Let’s glimpse the data (we used pipe in this command)
library(tidyverse)

life |> glimpse()
Rows: 3,142
Columns: 4
$ state      <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabama", "Ala…
$ county     <chr> "Autauga County", "Baldwin County", "Barbour County", "Bibb…
$ expectancy <dbl> 76.060, 77.630, 74.675, 74.155, 75.880, 71.790, 73.730, 73.…
$ income     <dbl> 37773, 40121, 31443, 29075, 31663, 25929, 33518, 33418, 312…

Selecting columns

  • We will focus on life expectancy in Massachusetts
  • filter gives us a subset of the rows and select gives us a subset of the columns in a data frame
  • We will filter the data to obtain a subset of the observations for MA counties and name it life_ma
life_ma <- life |>
  filter(state == "Massachusetts")
  • And select the county and expectancy columns
life_ma <- life_ma |>
  select(county, expectancy)
  • Let’s glimpse the resulting data set
life_ma |>
  glimpse()
Rows: 14
Columns: 2
$ county     <chr> "Barnstable County", "Berkshire County", "Bristol County", …
$ expectancy <dbl> 80.325, 79.780, 78.975, 80.995, 80.415, 80.070, 78.420, 80.…

Mutating columns

  • We can use the mutate function to create new columns from existing columns or to manipulate existing columns
  • Let’s round the life expectancies to whole numbers
life_ma <- life_ma |>
  mutate(expectancy = round(expectancy))
  • Let’s glimpse the resulting data set
life_ma |>
  glimpse()
Rows: 14
Columns: 2
$ county     <chr> "Barnstable County", "Berkshire County", "Bristol County", …
$ expectancy <dbl> 80, 80, 79, 81, 80, 80, 78, 80, 81, 80, 81, 79, 79, 80
  • We can chain several operations together with pipes
  • Data flows from left to right through the pipeline
  • The following code chains all of the operations together that we just performed
life_ma <- life |>
  filter(state == "Massachusetts") |>
  select(county, expectancy) |>
  mutate(expectancy = round(expectancy))

Life Expectancies in MA Counties

  • Now we can take a look on a dotplot of life expectancies for the fourteen MA counties
  • Shape: Skewed to the left, Unimodal
  • Mode = 80 (the most commonly occurring data point)
  • Max = 81, Min = 78, Range = Max - Min = 81-78=3

Summaries of numerical data

  • Measures of center

    • mean
    • median
  • Percentiles/quantiles

    • quartiles
    • other percentiles
  • Measures of spread

    • interquartile range
    • standard deviation

Mean

  • If there are \(n\) cases in a sample then the sample mean of the numeric variable \(x\) is \[\bar{x}=\frac{x_1+x_2+\cdots+x_n}{n}\]
  • The sample mean is a measure of the center of the distribution of the data
  • The sample mean \(\bar{x}\) (a statistic) gives us a point estimate of the population mean \(\mu\) (a parameter)

Median

  • The median is the value that splits the data in half
  • 50% of the data fall below the median
  • We can compute the mean and median of the life expectancy data and summarize them
life_ma |>
  summarize(mean = mean(expectancy), median = median(expectancy))
# A tibble: 1 × 2
   mean median
  <dbl>  <dbl>
1  79.9     80
  • Let’s see where the mean and median fall on the dotplot
  • Note that the mean is pulled toward the left tail of the distribution.

Skew and Symmetry

Group means

  • We can also compare means between different groups in the data

  • Let’s compare the mean of the life expectancy variable between counties in West Coast states (California, Oregon, Washington) and counties that are not in West Coast states

  • First we add a new variable to the full data set that indicates whether a county is in a West Coast state

life <- life |>
  mutate(west_coast = if_else(state %in% c("California", "Oregon",
                                           "Washington"),
                              "yes",
                              "no"))
  • Next we group the data using the new west_coast variable
  • Then we use the summarize function to compute group means and medians
  • Is life expectancy higher in counties in west coast states?
life |>
  group_by(west_coast) |>
  summarize(mean = mean(expectancy), median = median(expectancy))
# A tibble: 2 × 3
  west_coast  mean median
  <chr>      <dbl>  <dbl>
1 no          77.1   77.3
2 yes         78.9   78.6

Percentiles

  • The Xth percentile is the value below which X% of the data fall
  • The median is the 50th percentile
  • Let’s compute the 90th percentile of the life expectancy variable in the Massachusetts data
quantile(life_ma$expectancy, 0.9)
90% 
 81 

Quartiles

  • The first quartile (Q1) is the 25th percentile, the value below which 25% of the data fall
  • The third quartile (Q3) is the 75th percentile, the value below which 75% of the data fall
  • The median is sometimes described as the second quartile (Q2)
  • Quartiles are often included in numerical summaries of a data set
  • Let’s add quartiles to our summary of the Massachusetts data
life_ma |>
  summarize(Q1 = quantile(expectancy, 0.25),
            median = median(expectancy),
            Q3 = quantile(expectancy, 0.75),
            mean = mean(expectancy))
# A tibble: 1 × 4
     Q1 median    Q3  mean
  <dbl>  <dbl> <dbl> <dbl>
1  79.2     80    80  79.9

Five number summary

  • Minimum, Q1, median, Q3 and maximum are called five number summary
  • Intrequartile Range (IQR)= Q3 - Q1
  • Range = Max - Min
life_ma |>
  summarize(min = min(expectancy),
            Q1 = quantile(expectancy, 0.25),
            median = median(expectancy),
            Q3 = quantile(expectancy, 0.75),
            max = max(expectancy),
            iqr = IQR(expectancy),
            range=max(expectancy)-min(expectancy))
# A tibble: 1 × 7
    min    Q1 median    Q3   max   iqr range
  <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl>
1    78  79.2     80    80    81  0.75     3

Box plot for the five number summary

  • The box extends from Q1 to Q3 with a vertical line at the median (Q2)
  • Whiskers extend from the box to the smallest and largest values that are not outliers
  • Outliers are plotted as individual points according to \(1.5\times\)IQR rule

# A tibble: 1 × 7
    min    Q1 median    Q3   max   iqr range
  <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl>
1    78  79.2     80    80    81  0.75     3

Identifying outliers using IQR

  • An outlier is an observation that is extreme relative to the rest of the data
  • There is no universally accepted method for identifying outliers
  • One common method uses the \(1.5\times\)IQR rule: a point is considered an outlier if it is larger than Q3+\(1.5\times\)IQR or smaller than Q1 - \(1.5\times\)IQR.
  • Let’s use this method to identify outliers in the life expectancy data for the whole country
iqr <- IQR(life$expectancy)
q1 <- quantile(life$expectancy, 0.25)
q3 <- quantile(life$expectancy, 0.75)

life <- life |>
  mutate(is_outlier = case_when(expectancy < q1 - 1.5*iqr ~ TRUE,
                                expectancy <= q3 + 1.5*iqr ~ FALSE,
                                TRUE ~ TRUE))
life |> count(is_outlier)
# A tibble: 2 × 2
  is_outlier     n
  <lgl>      <int>
1 FALSE       3130
2 TRUE          12
  • There are 12 counties identified as outliers
  • All of them have low life expectancies
  • Here is the boxplot of the life expectancy of the entire dataset
life |>
  ggplot(aes(x = expectancy)) +
  geom_boxplot() +
  theme_minimal() +
  theme(text = element_text(size = 20))

Standard deviation

  • The most commonly used measure of variability is the standard deviation
  • The deviation of a single observation \(i\) is the difference between the observed value and the mean, \(x_i - \bar{x}\)
  • The standard deviation describes the typical deviation of the data from the mean
  • The sample variance is the average squared deviance \[s^2=\frac{(x_1-\bar{x})^2 + (x_2-\bar{x})^2 \cdots (x_n-\bar{x})^2}{n-1}\]
  • We divide by \(n-1\) rather than \(n\) (the sample size) to obtained an unbiased estimate of the population variance \(\sigma^2\). Otherwise \(s^2\) tends to underestimate \(\sigma^2\)
  • The sample standard deviation is \[s = \sqrt{s^2} = \sqrt{\frac{\sum_{i=1}^n(x_i-\bar{x})^2}{n-1}}\]

- For many numeric variables the following rules of thumb apply:

  • Roughly 68% of the data fall within 1 standard deviation of the mean
  • Roughly 95% of the data fall within 2 standard deviations of the mean
  • Roughly 99.7% of the data fall within 3 standard deviations of the mean

Illustrations of 68-95-99.7 rule. IMS1 Figure 5.7.

  • We can use the sd function to compute the sample standard deviation
life_ma |>
  summarize(sd = sd(expectancy),
            mean = mean(expectancy))
# A tibble: 1 × 2
     sd  mean
  <dbl> <dbl>
1 0.864  79.9

Cars

  • data on all new car models (428) in a certain year
  • 19 variables
  • includes weight, highway mpg (hwy_mpg), msrp, whether a pickup or not (pickup)
  • we will explore a variety of visualizations involving numerical variables
  • Let’s glimpse the data
glimpse(cars)
Rows: 428
Columns: 19
$ name        <chr> "Chevrolet Aveo 4dr", "Chevrolet Aveo LS 4dr hatch", "Chev…
$ sports_car  <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FA…
$ suv         <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FA…
$ wagon       <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FA…
$ minivan     <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FA…
$ pickup      <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FA…
$ all_wheel   <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FA…
$ rear_wheel  <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FA…
$ msrp        <int> 11690, 12585, 14610, 14810, 16385, 13670, 15040, 13270, 13…
$ dealer_cost <int> 10965, 11802, 13697, 13884, 15357, 12849, 14086, 12482, 12…
$ eng_size    <dbl> 1.6, 1.6, 2.2, 2.2, 2.2, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 1.7…
$ ncyl        <int> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4…
$ horsepwr    <int> 103, 103, 140, 140, 140, 132, 132, 130, 110, 130, 130, 115…
$ city_mpg    <int> 28, 28, 26, 26, 26, 29, 29, 26, 27, 26, 26, 32, 36, 32, 29…
$ hwy_mpg     <int> 34, 34, 37, 37, 37, 36, 36, 33, 36, 33, 33, 38, 44, 38, 33…
$ weight      <int> 2370, 2348, 2617, 2676, 2617, 2581, 2626, 2612, 2606, 2606…
$ wheel_base  <int> 98, 98, 104, 104, 104, 105, 105, 103, 103, 103, 103, 103, …
$ length      <int> 167, 153, 183, 183, 183, 174, 174, 168, 168, 168, 168, 175…
$ width       <int> 66, 66, 69, 68, 69, 67, 67, 67, 67, 67, 67, 67, 67, 68, 66…

Dotplot

  • A dotplot represents each case with a dot
  • Dots are stacked on top of each other at the appropriate location on the x-axis
  • Let’s create a dot plot of vehicle weights
  • By default ggplot does not produce a useful y-axis scale, so we hide it
cars |>
  ggplot(aes(x = weight)) +
  geom_dotplot(dotsize = 0.2, stackratio = 1.2) +
  scale_y_continuous(NULL, breaks = NULL) + # hide meaningless y scale
  theme_minimal() +
  theme(text = element_text(size = 20))
  • Note that the distribution of weights is skewed right, meaning that it has a tail on the right

Histogram

  • In a histogram data are aggregated into bins on the x-axis
  • The height of each bar is proportional to the number of cases in the bin
  • Let’s create a histogram of vehicle weights
cars |>
  ggplot(aes(x = weight)) +
  geom_histogram(col='gray',fill='blue') +
  theme_minimal() +
  theme(text = element_text(size = 20))
  • Rather than using the default values, we can also specify the number of bins (bins) or the bin width (binwidth)
  • Here we specify the number of bins
cars |>
  ggplot(aes(x = weight)) +
  geom_histogram(bins = 15,col='grey',fill='blue') +
  theme_minimal() +
  theme(text = element_text(size = 20))

Density plot

  • In a density plot the shape of the distribution is outlined by smooth line

  • Let’s create a density plot of vehicle weights (The bandwith (bw) controls the degree of smoothing)

cars |>
  ggplot(aes(x = weight)) +
  geom_density(bw=100) +
  theme_minimal() +
  theme(text = element_text(size = 20))

Scatter plot - visualizing 2 numerical variables

  • We can visualize two numeric variables using a scatter plot
  • Let’s plot highway gas mileage against vehicle weight
cars |>
  ggplot(aes(x = weight, y = hwy_mpg)) +
  geom_point() +
  theme_minimal() +
  theme(text = element_text(size = 20))

Faceted histograms

  • We can visualize two variables, where one is numeric and the other is categorical using faceted histograms
  • We simply plot a separate histogram for each level of the categorical variable
  • Let’s use a faceted histogram to visualize the distribution of highway mileage for vehicles that are pickups and vehicles that are not pickups
cars |>
  ggplot(aes(x = hwy_mpg,fill=pickup)) +
  geom_histogram(col='grey') +
  facet_wrap(~pickup, ncol = 1, scales = "free_y") +
  theme(text = element_text(size = 20))

Colored density plots

  • We can use colored density plots for a similar purpose-
  • By coloring the density plots according to the levels of the categorical variable, we can plot them on the same axes and still distinguish between the distributions
  • Let’s use colored density plots to visualize the distribution of highway mileage for vehicles that are pickups and vehicles that are not pickups
  • The level of alpha controls the transparency of the plots
cars |>
  ggplot(aes(x = hwy_mpg, fill = pickup)) +
  geom_density(alpha = 0.5) +
  theme_minimal() +
  theme(text = element_text(size = 20))

Transforming data

  • Sometimes it is helpful to transform a variable
  • For example, a log transformation is commonly applied to distributions that are strongly skewed to the right
  • The transformed variable is often more appropriate for analyses that use a mathematical model to approximate the distribution of the data
  • The msrp (manufacturer’s suggested retail price) data is skewed right
cars |>
  ggplot(aes(x = msrp)) +
  geom_histogram(col='grey', fill='blue') +
  theme_minimal() +
  theme(text = element_text(size = 20))
  • Let’s create a new variable by taking the (natural) log of the msrp variable
  • The transformed variable has a more symmetric, bell-shaped distribution
cars <- cars |>
  mutate(msrp_log = log(msrp))

cars |>
  ggplot(aes(x = msrp_log)) +
  geom_histogram(col='grey', fill='blue') +
  theme_minimal() +
  theme(text = element_text(size = 20))

Intensity Maps

  • Sometimes it is useful to use colors to show higher or lower values of variables
  • Using various colors or a continuous gradient we can visualize distribution of a variable
  • Intensity maps are very helpful for seeing geographical trends
ggplot(data = chi_health_map, aes(fill = Unemployment)) +
  geom_sf() + 
  scale_fill_gradient2(low = "green", mid = "purple", high = "red", midpoint = 20) +
  ggtitle("Unemployment rates in Chicago Community Areas")