Exploring Numerical Data
Chapter 5
Math 115
Life Expectancies
- Every county in the US (3,142 counties)
- Variables include county name, state, average life expectancy (expectancy), median income (income)
- First 15 rows of
life_exp data
| Alabama |
Autauga County |
76.060 |
37773 |
| Alabama |
Baldwin County |
77.630 |
40121 |
| Alabama |
Barbour County |
74.675 |
31443 |
| Alabama |
Bibb County |
74.155 |
29075 |
| Alabama |
Blount County |
75.880 |
31663 |
| Alabama |
Bullock County |
71.790 |
25929 |
| Alabama |
Butler County |
73.730 |
33518 |
| Alabama |
Calhoun County |
73.300 |
33418 |
| Alabama |
Chambers County |
73.245 |
31282 |
| Alabama |
Cherokee County |
74.650 |
32645 |
| Alabama |
Chilton County |
73.880 |
31380 |
| Alabama |
Choctaw County |
75.050 |
31046 |
| Alabama |
Clarke County |
74.820 |
31877 |
| Alabama |
Clay County |
74.145 |
32965 |
| Alabama |
Cleburne County |
74.145 |
31209 |
Life Expectancies in MA Counties
- For now, we will focus on life expectancy in Massachusetts (14 counties)
- Since the
life_exp data include all US counties, we need to filter the data to retain just the Massachusetts counties
- You can filter a data set using Jamovi (or another spreadsheet program)
- First 15 rows of filtered
life_exp data
| Massachusetts |
Barnstable County |
80.325 |
64730 |
| Massachusetts |
Berkshire County |
79.780 |
50712 |
| Massachusetts |
Bristol County |
78.975 |
48294 |
| Massachusetts |
Dukes County |
80.995 |
78745 |
| Massachusetts |
Essex County |
80.415 |
60320 |
| Massachusetts |
Franklin County |
80.070 |
48428 |
| Massachusetts |
Hampden County |
78.420 |
46216 |
| Massachusetts |
Hampshire County |
80.040 |
46756 |
| Massachusetts |
Middlesex County |
81.240 |
73265 |
| Massachusetts |
Nantucket County |
80.325 |
107341 |
| Massachusetts |
Norfolk County |
81.115 |
80711 |
| Massachusetts |
Plymouth County |
79.145 |
59273 |
| Massachusetts |
Suffolk County |
79.260 |
66074 |
| Massachusetts |
Worcester County |
79.550 |
51370 |
- Let’s take a quick peak at a dotplot of life expectancies for the MA counties
Summaries of numerical data
Measures of center
Percentiles/quantiles
- quartiles
- other percentiles
Measures of spread
- interquartile range
- standard deviation
Mean
- If there are \(n\) cases in a sample then the sample mean of the numeric variable \(x\) is \[\bar{x}=\frac{x_1+x_2+\cdots+x_n}{n}\]
- The sample mean is a measure of the center of the distribution of the data
- The sample mean \(\bar{x}\) (a statistic) gives us a point estimate of the population mean \(\mu\) (a parameter)
The mean life expectancy in Massachusetts counties is 79.9 years
- Let’s see where the mean and median fall on the dotplot
- The mean is red and dashed. Note that it is pulled toward the thicker left tail of the distribution.
Skew and Symmetry
Group means
- We can also compare means between different groups in the data
- Let’s compare the mean of the life expectancy variable between counties in West Coast states (California, Oregon, Washington) and counties that are not in West Coast states
- Is life expectancy higher in counties in west coast states?
| no |
77.12750 |
77.31 |
| yes |
78.90545 |
78.65 |
Percentiles
- The Xth percentile is the value below which X% of the data fall
- The median is the 50th percentile
- For example, the 90th percentile of the life expectancy variable in the Massachusetts data is 81 years, meaning that 90% of the counties have an average life expectancy that is less than 81 years
Quartiles
- The first quartile (Q1) is the 25th percentile, the value below which 25% of the data fall
- The third quartile (Q3) is the 75th percentile, the value below which 75% of the data fall
- The median is sometimes described as the second quartile (Q2)
- Quartiles are often included in numerical summaries of a data set
- Let’s add quartiles to our summary of the Massachusetts county life expectancy data
Maximum and minimum values
- Maximum and minimum values in a data set are often included numerical summaries as well
- Let’s add them to our summary of the Massachusetts county life expectancy data
| 78 |
79.25 |
80 |
80 |
81 |
79.85714 |
Range
- The simplest measure of spread/variability of a distribution of data is the range
- It is simply the difference between the largest and smallest values
- The range is \(81-78=3\) years for the Massachusetts county life expectancy data
Interquartile range
- The interquartile range (IQR) is the difference Q3-Q1
- The IQR will never be larger than the range!
- Th IQR is \(80-79.25=0.75\) years for the Massachusetts county life expectancy data
Standard deviation
- The most commonly used measure of variability is the standard deviation
- The deviation of a single observation \(i\) is the difference between the observed value and the mean, \(x_i - \bar{x}\)
- The standard deviation describes the typical deviation of the data from the mean
- The sample variance is the average squared deviance \[s^2=\frac{(x_1-\bar{x})^2 + (x_2-\bar{x})^2 \cdots (x_n-\bar{x})^2}{n-1}\]
- We divide by \(n-1\) rather than \(n\) (the sample size) to obtained an unbiased estimate of the population variance \(\sigma^2\). Otherwise \(s^2\) tends to underestimate \(\sigma^2\)
- The sample standard deviation is \[s = \sqrt{s^2} = \sqrt{\frac{\sum_{i=1}^n(x_i-\bar{x})^2}{n-1}}\]
- For many numeric variables the following rules of thumb apply:
- Roughly 68% of the data fall within 1 standard deviation of the mean
- Roughly 95% of the data fall within 2 standard deviations of the mean
- Roughly 99.7% of the data fall within 3 standard deviations of the mean
![]()
Illustrations of 68-95-99.7 rule. IMS2 Figure 5.7.
- The standard deviation for the Massachusetts county life expectancy data is 0.864 years
Identifying outliers using IQR
- An outlier is an observation that is extreme relative to the rest of the data
- There is no universally accepted method for identifying outliers
- One common method that is also simple uses the IQR
- With this method, a point is considered an outlier if it is larger than Q3 or smaller than Q1 by more than \(1.5\times\)IQR.
- Let’s use this method to identify outliers in the life expectancy data for the whole country
- For the whole country, \(Q1=75.70\) years and \(Q3=78.77\) years
- Thus, \(IQR = 78.77-75.70=3.07\)
- Any county with a life expectancy less than \(75.70-1.5\times3.07=71.09\) years or greater than \(78.77+1.5\times3.07=80.31\) years is considered an outlier
- Using the \(1.5\times\)IQR rule, 12 counties identified as outliers
- All have low life expectancies
| Arkansas |
Mississippi County |
70.915 |
| Arkansas |
Phillips County |
70.900 |
| Mississippi |
Coahoma County |
70.740 |
| Virginia |
Petersburg City |
70.740 |
| Mississippi |
Washington County |
70.595 |
| West Virginia |
Mingo County |
70.590 |
| Mississippi |
Sunflower County |
70.385 |
| Mississippi |
Quitman County |
70.030 |
| Mississippi |
Tunica County |
70.030 |
| Mississippi |
Bolivar County |
69.675 |
| Kentucky |
Perry County |
69.585 |
| West Virginia |
Mcdowell County |
68.400 |
Cars
- data on all new car models (428) in a certain year
- 19 variables
- includes weight, highway mpg (hwy_mpg), msrp, whether a pickup or not (pickup)
- we will explore a variety of visualizations involving numerical variables
- Here are the first 8 rows
| Chevrolet Aveo 4dr |
FALSE |
FALSE |
FALSE |
FALSE |
FALSE |
FALSE |
FALSE |
11690 |
10965 |
1.6 |
4 |
103 |
28 |
34 |
2370 |
98 |
167 |
66 |
| Chevrolet Aveo LS 4dr hatch |
FALSE |
FALSE |
FALSE |
FALSE |
FALSE |
FALSE |
FALSE |
12585 |
11802 |
1.6 |
4 |
103 |
28 |
34 |
2348 |
98 |
153 |
66 |
| Chevrolet Cavalier 2dr |
FALSE |
FALSE |
FALSE |
FALSE |
FALSE |
FALSE |
FALSE |
14610 |
13697 |
2.2 |
4 |
140 |
26 |
37 |
2617 |
104 |
183 |
69 |
| Chevrolet Cavalier 4dr |
FALSE |
FALSE |
FALSE |
FALSE |
FALSE |
FALSE |
FALSE |
14810 |
13884 |
2.2 |
4 |
140 |
26 |
37 |
2676 |
104 |
183 |
68 |
| Chevrolet Cavalier LS 2dr |
FALSE |
FALSE |
FALSE |
FALSE |
FALSE |
FALSE |
FALSE |
16385 |
15357 |
2.2 |
4 |
140 |
26 |
37 |
2617 |
104 |
183 |
69 |
| Dodge Neon SE 4dr |
FALSE |
FALSE |
FALSE |
FALSE |
FALSE |
FALSE |
FALSE |
13670 |
12849 |
2.0 |
4 |
132 |
29 |
36 |
2581 |
105 |
174 |
67 |
| Dodge Neon SXT 4dr |
FALSE |
FALSE |
FALSE |
FALSE |
FALSE |
FALSE |
FALSE |
15040 |
14086 |
2.0 |
4 |
132 |
29 |
36 |
2626 |
105 |
174 |
67 |
| Ford Focus ZX3 2dr hatch |
FALSE |
FALSE |
FALSE |
FALSE |
FALSE |
FALSE |
FALSE |
13270 |
12482 |
2.0 |
4 |
130 |
26 |
33 |
2612 |
103 |
168 |
67 |
Dotplot
- A dotplot represents each case with a dot
- Dots are stacked on top of each other at the appropriate location on the x-axis
- Dot plot of vehicle weights
Histogram
- In a histogram data are aggregated into bins on the x-axis
- The height of each bar is proportional to the number of cases in the bin
- Histogram of vehicle weights
Density plot
- In a density plot the shape of the distribution is represented using a smooth line (think of this as a smoothed out histogram)
- Density plot of vehicle weights
- Box plot of vehicle weights. Are there any outliers?
- It is important to note that box plots can miss important characteristics of the distributions
- For example, if the distribution is bimodal, then the box plot won’t show it
Scatter plot - visualizing 2 numerical variables
- We can visualize two numeric variables using a scatter plot
- Typically, the x-axis is used for the explanatory variable and y-axis for the response variable
- A scatter plot of highway gas mileage vs. vehicle weight
Faceted histograms
- We can visualize two variables, where one is numeric and the other is categorical using faceted histograms
- We simply plot a separate histogram for each level of the categorical variable
- Faceted histogram showing the distributions of highway mileage for vehicles that are pickups and vehicles that are not pickups
Colored density plots
- We can use colored density plots for a similar purpose
- By coloring the density plots according to the levels of the categorical variable, we can plot them on the same axes and still distinguish between the distributions
- Colored density plots to visualize the distributions of highway mileage for vehicles that are pickups and vehicles that are not pickups
- We can create a new variable by taking the (natural) log of the
msrp variable
- The transformed variable has a more symmetric, bell-shaped distribution
Intensity Maps
- Sometimes it is useful to use colors to show higher or lower values of variables
- Using various colors or a continuous gradient we can visualize distribution of a variable
- Intensity maps are very helpful for seeing geographical trends