Exploring Numerical Data

Chapter 5
Math 115

Life Expectancies

  • Every county in the US (3,142 counties)
  • Variables include county name, state, average life expectancy (expectancy), median income (income)
  • First 15 rows of life_exp data
state county expectancy income
Alabama Autauga County 76.060 37773
Alabama Baldwin County 77.630 40121
Alabama Barbour County 74.675 31443
Alabama Bibb County 74.155 29075
Alabama Blount County 75.880 31663
Alabama Bullock County 71.790 25929
Alabama Butler County 73.730 33518
Alabama Calhoun County 73.300 33418
Alabama Chambers County 73.245 31282
Alabama Cherokee County 74.650 32645
Alabama Chilton County 73.880 31380
Alabama Choctaw County 75.050 31046
Alabama Clarke County 74.820 31877
Alabama Clay County 74.145 32965
Alabama Cleburne County 74.145 31209

Life Expectancies in MA Counties

  • For now, we will focus on life expectancy in Massachusetts (14 counties)
  • Since the life_exp data include all US counties, we need to filter the data to retain just the Massachusetts counties
  • You can filter a data set using Jamovi (or another spreadsheet program)
  • First 15 rows of filtered life_exp data
state county expectancy income
Massachusetts Barnstable County 80.325 64730
Massachusetts Berkshire County 79.780 50712
Massachusetts Bristol County 78.975 48294
Massachusetts Dukes County 80.995 78745
Massachusetts Essex County 80.415 60320
Massachusetts Franklin County 80.070 48428
Massachusetts Hampden County 78.420 46216
Massachusetts Hampshire County 80.040 46756
Massachusetts Middlesex County 81.240 73265
Massachusetts Nantucket County 80.325 107341
Massachusetts Norfolk County 81.115 80711
Massachusetts Plymouth County 79.145 59273
Massachusetts Suffolk County 79.260 66074
Massachusetts Worcester County 79.550 51370
  • Let’s take a quick peak at a dotplot of life expectancies for the MA counties

Summaries of numerical data

  • Measures of center

    • mean
    • median
  • Percentiles/quantiles

    • quartiles
    • other percentiles
  • Measures of spread

    • interquartile range
    • standard deviation

Mean

  • If there are \(n\) cases in a sample then the sample mean of the numeric variable \(x\) is \[\bar{x}=\frac{x_1+x_2+\cdots+x_n}{n}\]
  • The sample mean is a measure of the center of the distribution of the data
  • The sample mean \(\bar{x}\) (a statistic) gives us a point estimate of the population mean \(\mu\) (a parameter)

The mean life expectancy in Massachusetts counties is 79.9 years

mean
79.85714

Median

  • The median is the value that splits the data in half
  • 50% of the data fall below the median
  • We can also compute the median of the life expectancy data
mean median
79.85714 80
  • Let’s see where the mean and median fall on the dotplot
  • The mean is red and dashed. Note that it is pulled toward the thicker left tail of the distribution.

Skew and Symmetry

Group means

  • We can also compare means between different groups in the data
  • Let’s compare the mean of the life expectancy variable between counties in West Coast states (California, Oregon, Washington) and counties that are not in West Coast states
  • Is life expectancy higher in counties in west coast states?
west_coast mean median
no 77.12750 77.31
yes 78.90545 78.65

Percentiles

  • The Xth percentile is the value below which X% of the data fall
  • The median is the 50th percentile
  • For example, the 90th percentile of the life expectancy variable in the Massachusetts data is 81 years, meaning that 90% of the counties have an average life expectancy that is less than 81 years

Quartiles

  • The first quartile (Q1) is the 25th percentile, the value below which 25% of the data fall
  • The third quartile (Q3) is the 75th percentile, the value below which 75% of the data fall
  • The median is sometimes described as the second quartile (Q2)
  • Quartiles are often included in numerical summaries of a data set
  • Let’s add quartiles to our summary of the Massachusetts county life expectancy data
Q1 median Q3 mean
79.25 80 80 79.85714

Maximum and minimum values

  • Maximum and minimum values in a data set are often included numerical summaries as well
  • Let’s add them to our summary of the Massachusetts county life expectancy data
min Q1 median Q3 max mean
78 79.25 80 80 81 79.85714

Range

  • The simplest measure of spread/variability of a distribution of data is the range
  • It is simply the difference between the largest and smallest values
  • The range is \(81-78=3\) years for the Massachusetts county life expectancy data
range
3

Interquartile range

  • The interquartile range (IQR) is the difference Q3-Q1
  • The IQR will never be larger than the range!
  • Th IQR is \(80-79.25=0.75\) years for the Massachusetts county life expectancy data
iqr range
0.75 3

Standard deviation

  • The most commonly used measure of variability is the standard deviation
  • The deviation of a single observation \(i\) is the difference between the observed value and the mean, \(x_i - \bar{x}\)
  • The standard deviation describes the typical deviation of the data from the mean
  • The sample variance is the average squared deviance \[s^2=\frac{(x_1-\bar{x})^2 + (x_2-\bar{x})^2 \cdots (x_n-\bar{x})^2}{n-1}\]
  • We divide by \(n-1\) rather than \(n\) (the sample size) to obtained an unbiased estimate of the population variance \(\sigma^2\). Otherwise \(s^2\) tends to underestimate \(\sigma^2\)
  • The sample standard deviation is \[s = \sqrt{s^2} = \sqrt{\frac{\sum_{i=1}^n(x_i-\bar{x})^2}{n-1}}\]

- For many numeric variables the following rules of thumb apply:

  • Roughly 68% of the data fall within 1 standard deviation of the mean
  • Roughly 95% of the data fall within 2 standard deviations of the mean
  • Roughly 99.7% of the data fall within 3 standard deviations of the mean

Illustrations of 68-95-99.7 rule. IMS2 Figure 5.7.

  • The standard deviation for the Massachusetts county life expectancy data is 0.864 years
sd iqr range
0.8644378 0.75 3

Identifying outliers using IQR

  • An outlier is an observation that is extreme relative to the rest of the data
  • There is no universally accepted method for identifying outliers
  • One common method that is also simple uses the IQR
  • With this method, a point is considered an outlier if it is larger than Q3 or smaller than Q1 by more than \(1.5\times\)IQR.
  • Let’s use this method to identify outliers in the life expectancy data for the whole country
  • For the whole country, \(Q1=75.70\) years and \(Q3=78.77\) years
  • Thus, \(IQR = 78.77-75.70=3.07\)
  • Any county with a life expectancy less than \(75.70-1.5\times3.07=71.09\) years or greater than \(78.77+1.5\times3.07=80.31\) years is considered an outlier
  • Using the \(1.5\times\)IQR rule, 12 counties identified as outliers
  • All have low life expectancies
state county expectancy
Arkansas Mississippi County 70.915
Arkansas Phillips County 70.900
Mississippi Coahoma County 70.740
Virginia Petersburg City 70.740
Mississippi Washington County 70.595
West Virginia Mingo County 70.590
Mississippi Sunflower County 70.385
Mississippi Quitman County 70.030
Mississippi Tunica County 70.030
Mississippi Bolivar County 69.675
Kentucky Perry County 69.585
West Virginia Mcdowell County 68.400

Cars

  • data on all new car models (428) in a certain year
  • 19 variables
  • includes weight, highway mpg (hwy_mpg), msrp, whether a pickup or not (pickup)
  • we will explore a variety of visualizations involving numerical variables
  • Here are the first 8 rows
name sports_car suv wagon minivan pickup all_wheel rear_wheel msrp dealer_cost eng_size ncyl horsepwr city_mpg hwy_mpg weight wheel_base length width
Chevrolet Aveo 4dr FALSE FALSE FALSE FALSE FALSE FALSE FALSE 11690 10965 1.6 4 103 28 34 2370 98 167 66
Chevrolet Aveo LS 4dr hatch FALSE FALSE FALSE FALSE FALSE FALSE FALSE 12585 11802 1.6 4 103 28 34 2348 98 153 66
Chevrolet Cavalier 2dr FALSE FALSE FALSE FALSE FALSE FALSE FALSE 14610 13697 2.2 4 140 26 37 2617 104 183 69
Chevrolet Cavalier 4dr FALSE FALSE FALSE FALSE FALSE FALSE FALSE 14810 13884 2.2 4 140 26 37 2676 104 183 68
Chevrolet Cavalier LS 2dr FALSE FALSE FALSE FALSE FALSE FALSE FALSE 16385 15357 2.2 4 140 26 37 2617 104 183 69
Dodge Neon SE 4dr FALSE FALSE FALSE FALSE FALSE FALSE FALSE 13670 12849 2.0 4 132 29 36 2581 105 174 67
Dodge Neon SXT 4dr FALSE FALSE FALSE FALSE FALSE FALSE FALSE 15040 14086 2.0 4 132 29 36 2626 105 174 67
Ford Focus ZX3 2dr hatch FALSE FALSE FALSE FALSE FALSE FALSE FALSE 13270 12482 2.0 4 130 26 33 2612 103 168 67

Dotplot

  • A dotplot represents each case with a dot
  • Dots are stacked on top of each other at the appropriate location on the x-axis
  • Dot plot of vehicle weights

Histogram

  • In a histogram data are aggregated into bins on the x-axis
  • The height of each bar is proportional to the number of cases in the bin
  • Histogram of vehicle weights

Density plot

  • In a density plot the shape of the distribution is represented using a smooth line (think of this as a smoothed out histogram)
  • Density plot of vehicle weights

Box plot

  • A box plot takes a different approach to visualizing the distribution of a numerical variable

  • Boxplots are constructed using summary statistics:

    • The box extends from Q1 to Q3 with a vertical line at the median (Q2)
    • Whiskers extend from the box to the smallest and largest values that are not outliers
    • Outliers are plotted as individual points
  • Box plot of vehicle weights. Are there any outliers?
  • It is important to note that box plots can miss important characteristics of the distributions
  • For example, if the distribution is bimodal, then the box plot won’t show it

Scatter plot - visualizing 2 numerical variables

  • We can visualize two numeric variables using a scatter plot
  • Typically, the x-axis is used for the explanatory variable and y-axis for the response variable
  • A scatter plot of highway gas mileage vs. vehicle weight

Faceted histograms

  • We can visualize two variables, where one is numeric and the other is categorical using faceted histograms
  • We simply plot a separate histogram for each level of the categorical variable
  • Faceted histogram showing the distributions of highway mileage for vehicles that are pickups and vehicles that are not pickups

Colored density plots

  • We can use colored density plots for a similar purpose
  • By coloring the density plots according to the levels of the categorical variable, we can plot them on the same axes and still distinguish between the distributions
  • Colored density plots to visualize the distributions of highway mileage for vehicles that are pickups and vehicles that are not pickups

Transforming data

  • Sometimes it is helpful to transform a variable
  • For example, a log transformation is commonly applied to distributions that are strongly skewed to the right
  • The transformed variable is often more appropriate for analyses that use a mathematical model to approximate the distribution of the data
  • The msrp data are skewed right
  • Histogram of MSRP
  • We can create a new variable by taking the (natural) log of the msrp variable
  • The transformed variable has a more symmetric, bell-shaped distribution
  • Histogram of log of MSRP

Intensity Maps

  • Sometimes it is useful to use colors to show higher or lower values of variables
  • Using various colors or a continuous gradient we can visualize distribution of a variable
  • Intensity maps are very helpful for seeing geographical trends