Exploring Data
Topic 2
Math 115
Inferential statistics
- It is usually impractical to make observations on every individual in a population (this type of study is called a census)
- Instead select a sample from the population and use observations made on the sample to make inferences about the population
Example: The Hope College Biology Department would like to know the proportion of hemlock trees in the Hope College Nature Preserve (HCNP) that are infested by hemlock woolly adelgid (HWA)
- Census: researchers inspect every hemlock in the preserve (the population) for HWA infestation and calculate the proportion
- Survey: researchers randomly select 100 hemlock trees in the preserve (a sample) inspect them and calculate the proportion
Parameter vs. statistic
- A statistic is a numerical value or summary measure that is calculated from a sample
- A parameter is the corresponding value in the population
- The value of statistic give us an estimate of the value of the corresponding parameter
Identify the parameter and the statistic.
- The proportion of trees infested with HWA in a sample of 100 hemlocks from the HCNP
- The proportion of trees infested with HWA in the HCNP
Anecdotal Evidence
- Anecdotal evidence refers to personal stories, experiences, or individual observations
- Example:A man on the news got mercury poisoning from eating swordfish, so the average mercury concentration in swordfish must be dangerously high.
- Inferences should not be made using anecdotal evidence or data that are collected in a haphazard manner
- Such information may not be representative of the population
- Instead, inferences should be based on carefully designed studies
Identify anecdotal evidence.
- Your friend, who recently visited the HCNP, stated that they noticed that about half the hemlocks seemed to be infested with HWA
- Students from an introductory Biology lab use a map to select 50 hemlocks in the HCNP. They visit the trees and record whether each one has signs of infestation with HWA
EDA for Categorical Variables
15,128 comic characters from DC and Marvel comics
11 variables, including
- name
- identity (id) gives information about personal identity (e.g., identity is kept secret)
- alignment (align) gives information about whether character is good, bad, etc
The comics data set is saved in a CSV file
Let’s download the file and open it with Jamovi
- First 10 rows of the comics data
| Spider-Man (Peter Parker) |
Secret |
Good |
Hazel Eyes |
Brown Hair |
Male |
NA |
Living Characters |
4043 |
Aug-62 |
marvel |
| Captain America (Steven Rogers) |
Public |
Good |
Blue Eyes |
White Hair |
Male |
NA |
Living Characters |
3360 |
Mar-41 |
marvel |
| Wolverine (James \"Logan\" Howlett) |
Public |
Neutral |
Blue Eyes |
Black Hair |
Male |
NA |
Living Characters |
3061 |
Oct-74 |
marvel |
| Iron Man (Anthony \"Tony\" Stark) |
Public |
Good |
Blue Eyes |
Black Hair |
Male |
NA |
Living Characters |
2961 |
Mar-63 |
marvel |
| Thor (Thor Odinson) |
No Dual |
Good |
Blue Eyes |
Blond Hair |
Male |
NA |
Living Characters |
2258 |
Nov-50 |
marvel |
| Benjamin Grimm (Earth-616) |
Public |
Good |
Blue Eyes |
No Hair |
Male |
NA |
Living Characters |
2255 |
Nov-61 |
marvel |
| Reed Richards (Earth-616) |
Public |
Good |
Brown Eyes |
Brown Hair |
Male |
NA |
Living Characters |
2072 |
Nov-61 |
marvel |
| Hulk (Robert Bruce Banner) |
Public |
Good |
Brown Eyes |
Brown Hair |
Male |
NA |
Living Characters |
2017 |
May-62 |
marvel |
| Scott Summers (Earth-616) |
Public |
Neutral |
Brown Eyes |
Brown Hair |
Male |
NA |
Living Characters |
1955 |
Sep-63 |
marvel |
| Jonathan Storm (Earth-616) |
Public |
Good |
Blue Eyes |
Blond Hair |
Male |
NA |
Living Characters |
1934 |
Nov-61 |
marvel |
Describing categorical data
- We can summarize a single categorical variable using a frequency table
- Counts the number of observations for each level of the variable
| No Dual |
1,394 |
| Public |
5,656 |
| Secret |
7,281 |
| Unknown |
9 |
| Total |
15,128 |
Proportion Calculations
Example 1: What proportion of comic characters have a secret identity?
\(Proportion=\frac{Count}{Total}=\frac{7,691}{15,128}=0.508\)
Example 2: What percentage of comic characters have a secret identity?
\(Percentage=Proportion\times 100=0.508×100=50.8\%\)
- Here are the resulting proportions
| No Dual |
0.097 |
| Public |
0.394 |
| Secret |
0.508 |
| Unknown |
0.001 |
Visualizing categorical data
- We can use a bar plot to visualize categorical data
![]()
Bar plot showing frequencies of levels of identity variable
Summarizing two categorical variables
- A contingency table is a table that can be used to summarize two categorical variables
- Each value is a count of the number of times a variable outcome combination occurs
- Usually includes row and column totals as well (marginal totals)
Contingency table for identity and alignment
| Bad |
435 |
2,031 |
4,119 |
7 |
6,592 |
| Good |
598 |
2,726 |
2,296 |
0 |
5,620 |
| Neutral |
361 |
898 |
865 |
2 |
2,126 |
| Reformed Criminals |
0 |
1 |
1 |
0 |
2 |
| Total |
1,394 |
5,656 |
7,281 |
9 |
14,340 |
- It is also useful to create contingency tables with proportions
- The simplest version is obtained by dividing each count by the grand total
- In this case values in table sum to 1
Proportion of outcomes for each combination of allignment and identity
| Bad |
0.0303 |
0.1416 |
0.2872 |
0.0005 |
| Good |
0.0417 |
0.1901 |
0.1601 |
0.0000 |
| Neutral |
0.0252 |
0.0626 |
0.0603 |
0.0001 |
| Reformed Criminals |
0.0000 |
0.0001 |
0.0001 |
0.0000 |
- What does the value 0.0299 mean?
Conditional proportions
- We can also create tables of conditional proportions than can be helpful to explore associations between the variables
- We need to decide whether the proportions should be conditioned on rows (divide counts by row totals) or columns (divide counts by colum totals)
- If conditioned on rows, proportions sum to 1 along rows
- If conditioned on columns, proportions sum to 1 along columns
- These proportions are conditioned on rows (alignment)
- Allows us to compare proportions of identity types between different alignment groups
- For example, we can see that about 63% of bad characters have secret identities whereas only about 41% of good characters have secret identities.
| Bad |
0.0660 |
0.3081 |
0.6248 |
0.0011 |
| Good |
0.1064 |
0.4851 |
0.4085 |
0.0000 |
| Neutral |
0.1698 |
0.4224 |
0.4069 |
0.0009 |
| Reformed Criminals |
0.0000 |
0.5000 |
0.5000 |
0.0000 |
- These proportions are conditioned on columns (identity)
- Allows us to compare proportions of alignment types between different identity groups
- For example, we can see that about 57% characters with secret identities are bad, whereas only about 32% of characters with secret identities are good.
| Bad |
0.3121 |
0.3591 |
0.5657 |
0.7778 |
| Good |
0.4290 |
0.4820 |
0.3153 |
0.0000 |
| Neutral |
0.2590 |
0.1588 |
0.1188 |
0.2222 |
| Reformed Criminals |
0.0000 |
0.0002 |
0.0001 |
0.0000 |
Visualizing two categorical variables
- There are different ways to visualize two categorical variables using bar plots
- We can create stacked bar plot
- Colors show how composition varies within each group
![]()
Stacked bar plot showing alignment frequencies for different id levels
- We can also visualize the data using side-by-side (dodged) bar plots
![]()
Dodged bar plot showing alignment frequencies for different id levels
- Another alternative is to use faceted bar plots
- Facet according to one of the variables
- A facet (subplot) is created for each level of that variable
![]()
Faceted bar plot showing alignment frequencies for different id levels
- A fourth type of bar plot we can use to visualize two categorical variables is a standardized (filled) bar plot
- This shows conditional proportions (instead of counts) in a stacked format
- The following proportions are conditioned on
id
![]()
Standardized bar plot showing alignment proportions for different id levels
- We can take a different perspective by exchanging the roles of the variables
- The following proportions are conditioned on
align
![]()
Standardized bar plot showing id proportions for different alignment levels
EDA for Quantitative Variables
- Every county in the US (3,142 counties)
- Variables include county name, state, average life expectancy (expectancy), median income (income)
- First 15 rows of
life_exp data
| Alabama |
Autauga County |
76.060 |
37773 |
| Alabama |
Baldwin County |
77.630 |
40121 |
| Alabama |
Barbour County |
74.675 |
31443 |
| Alabama |
Bibb County |
74.155 |
29075 |
| Alabama |
Blount County |
75.880 |
31663 |
| Alabama |
Bullock County |
71.790 |
25929 |
| Alabama |
Butler County |
73.730 |
33518 |
| Alabama |
Calhoun County |
73.300 |
33418 |
| Alabama |
Chambers County |
73.245 |
31282 |
| Alabama |
Cherokee County |
74.650 |
32645 |
| Alabama |
Chilton County |
73.880 |
31380 |
| Alabama |
Choctaw County |
75.050 |
31046 |
| Alabama |
Clarke County |
74.820 |
31877 |
| Alabama |
Clay County |
74.145 |
32965 |
| Alabama |
Cleburne County |
74.145 |
31209 |
Life Expectancies in MA Counties
- For now, we will focus on life expectancy in Massachusetts (14 counties)
- Since the
life_exp data include all US counties, we need to filter the data to retain just the Massachusetts counties
- You can filter a data set using Jamovi (or another spreadsheet program)
- Here are 14 counties of Massachusetts in the filtered
life_exp data
| Massachusetts |
Barnstable County |
80.325 |
64730 |
| Massachusetts |
Berkshire County |
79.780 |
50712 |
| Massachusetts |
Bristol County |
78.975 |
48294 |
| Massachusetts |
Dukes County |
80.995 |
78745 |
| Massachusetts |
Essex County |
80.415 |
60320 |
| Massachusetts |
Franklin County |
80.070 |
48428 |
| Massachusetts |
Hampden County |
78.420 |
46216 |
| Massachusetts |
Hampshire County |
80.040 |
46756 |
| Massachusetts |
Middlesex County |
81.240 |
73265 |
| Massachusetts |
Nantucket County |
80.325 |
107341 |
| Massachusetts |
Norfolk County |
81.115 |
80711 |
| Massachusetts |
Plymouth County |
79.145 |
59273 |
| Massachusetts |
Suffolk County |
79.260 |
66074 |
| Massachusetts |
Worcester County |
79.550 |
51370 |
- Let’s take a quick peak at a dotplot of life expectancies for the MA counties (we rounded values to the nearest integer)
Summaries of numerical data
Measures of center
Percentiles/quantiles
- quartiles
- other percentiles
Measures of spread
- interquartile range
- standard deviation
Mean
- If there are \(n\) cases in a sample then the sample mean of the numeric variable \(x\) is \[\bar{x}=\frac{x_1+x_2+\cdots+x_n}{n}\]
- The sample mean is a measure of the center of the distribution of the data
- The sample mean \(\bar{x}\) (a statistic) gives us a point estimate of the population mean \(\mu\) (a parameter)
The mean life expectancy in Massachusetts counties is 79.9 years
- Let’s see where the mean and median fall on the dotplot
- The mean is red and dashed. Note that it is pulled toward the thicker left tail of the distribution.
Skew and Symmetry
Group means
- We can also compare means between different groups in the data
- Let’s compare the mean of the life expectancy variable between counties in West Coast states (California, Oregon, Washington) and counties that are not in West Coast states
- Is life expectancy higher in counties in west coast states?
| no |
77.12750 |
77.31 |
| yes |
78.90545 |
78.65 |
Percentiles
- The Xth percentile is the value below which X% of the data fall
- The median is the 50th percentile
- For example, the 90th percentile of the life expectancy variable in the Massachusetts data is 81 years, meaning that 90% of the counties have an average life expectancy that is less than 81 years
Quartiles
- The first quartile (Q1) is the 25th percentile, the value below which 25% of the data fall
- The third quartile (Q3) is the 75th percentile, the value below which 75% of the data fall
- The median is sometimes described as the second quartile (Q2)
- Quartiles are often included in numerical summaries of a data set
- Let’s add quartiles to our summary of the Massachusetts county life expectancy data
Five Number Summary
- Maximum and minimum values in a data set are often included numerical summaries as well
- Let’s add them to our summary of the Massachusetts county life expectancy data
| 78 |
79.25 |
80 |
80 |
81 |
79.85714 |
Boxplot
- A box plot is a visual represenation of the five number summary
- Boxplots are constructed using summary statistics:
- The box extends from Q1 to Q3 with a vertical line at the median (Q2)
- Whiskers extend from the box to the smallest and largest values that are not outliers
- Outliers are plotted as individual points
Range
- The simplest measure of spread/variability of a distribution of data is the range
- It is simply the difference between the largest and smallest values
- The range is \(81-78=3\) years for the Massachusetts county life expectancy data
Interquartile range
- The interquartile range (IQR) is the difference Q3-Q1
- The IQR will never be larger than the range!
- Th IQR is \(80-79.25=0.75\) years for the Massachusetts county life expectancy data
Standard deviation
- The most commonly used measure of variability is the standard deviation
- The deviation of a single observation \(i\) is the difference between the observed value and the mean, \(x_i - \bar{x}\)
- The standard deviation describes the typical deviation of the data from the mean
- The sample variance is the average squared deviance \[s^2=\frac{(x_1-\bar{x})^2 + (x_2-\bar{x})^2 \cdots (x_n-\bar{x})^2}{n-1}\]
- We divide by \(n-1\) rather than \(n\) (the sample size) to obtained an unbiased estimate of the population variance \(\sigma^2\). Otherwise \(s^2\) tends to underestimate \(\sigma^2\)
- The sample standard deviation is \[s = \sqrt{s^2} = \sqrt{\frac{\sum_{i=1}^n(x_i-\bar{x})^2}{n-1}}\]
- For many numeric variables the following rules of thumb apply:
- Roughly 68% of the data fall within 1 standard deviation of the mean
- Roughly 95% of the data fall within 2 standard deviations of the mean
- Roughly 99.7% of the data fall within 3 standard deviations of the mean
![]()
Illustrations of 68-95-99.7 rule. IMS2 Figure 5.7.
- The standard deviation for the Massachusetts county life expectancy data is 0.864 years
Identifying outliers using IQR
- An outlier is an observation that is extreme relative to the rest of the data
- There is no universally accepted method for identifying outliers
- One common method that is also simple uses the IQR
- With this method, a point is considered an outlier if it is larger than Q3 or smaller than Q1 by more than \(1.5\times\)IQR.
- Let’s use this method to identify outliers in the life expectancy data for the whole country
- For the whole country, \(Q1=75.70\) years and \(Q3=78.77\) years
- Thus, \(IQR = 78.77-75.70=3.07\)
- Any county with a life expectancy less than \(75.70-1.5\times3.07=71.09\) years or greater than \(78.77+1.5\times3.07=80.31\) years is considered an outlier
- Using the \(1.5\times\)IQR rule, 12 counties identified as outliers
- All have low life expectancies
| Arkansas |
Mississippi County |
70.915 |
| Arkansas |
Phillips County |
70.900 |
| Mississippi |
Coahoma County |
70.740 |
| Virginia |
Petersburg City |
70.740 |
| Mississippi |
Washington County |
70.595 |
| West Virginia |
Mingo County |
70.590 |
| Mississippi |
Sunflower County |
70.385 |
| Mississippi |
Quitman County |
70.030 |
| Mississippi |
Tunica County |
70.030 |
| Mississippi |
Bolivar County |
69.675 |
| Kentucky |
Perry County |
69.585 |
| West Virginia |
Mcdowell County |
68.400 |
Cars
- data on all new car models (428) in a certain year
- 19 variables
- includes weight, highway mpg (hwy_mpg), msrp, whether a pickup or not (pickup)
- we will explore a variety of visualizations involving numerical variables
- Here are the first 8 rows
| Chevrolet Aveo 4dr |
FALSE |
FALSE |
FALSE |
FALSE |
FALSE |
FALSE |
FALSE |
11690 |
10965 |
1.6 |
4 |
103 |
28 |
34 |
2370 |
98 |
167 |
66 |
| Chevrolet Aveo LS 4dr hatch |
FALSE |
FALSE |
FALSE |
FALSE |
FALSE |
FALSE |
FALSE |
12585 |
11802 |
1.6 |
4 |
103 |
28 |
34 |
2348 |
98 |
153 |
66 |
| Chevrolet Cavalier 2dr |
FALSE |
FALSE |
FALSE |
FALSE |
FALSE |
FALSE |
FALSE |
14610 |
13697 |
2.2 |
4 |
140 |
26 |
37 |
2617 |
104 |
183 |
69 |
| Chevrolet Cavalier 4dr |
FALSE |
FALSE |
FALSE |
FALSE |
FALSE |
FALSE |
FALSE |
14810 |
13884 |
2.2 |
4 |
140 |
26 |
37 |
2676 |
104 |
183 |
68 |
| Chevrolet Cavalier LS 2dr |
FALSE |
FALSE |
FALSE |
FALSE |
FALSE |
FALSE |
FALSE |
16385 |
15357 |
2.2 |
4 |
140 |
26 |
37 |
2617 |
104 |
183 |
69 |
| Dodge Neon SE 4dr |
FALSE |
FALSE |
FALSE |
FALSE |
FALSE |
FALSE |
FALSE |
13670 |
12849 |
2.0 |
4 |
132 |
29 |
36 |
2581 |
105 |
174 |
67 |
| Dodge Neon SXT 4dr |
FALSE |
FALSE |
FALSE |
FALSE |
FALSE |
FALSE |
FALSE |
15040 |
14086 |
2.0 |
4 |
132 |
29 |
36 |
2626 |
105 |
174 |
67 |
| Ford Focus ZX3 2dr hatch |
FALSE |
FALSE |
FALSE |
FALSE |
FALSE |
FALSE |
FALSE |
13270 |
12482 |
2.0 |
4 |
130 |
26 |
33 |
2612 |
103 |
168 |
67 |
Dotplot
- A dotplot represents each case with a dot
- Dots are stacked on top of each other at the appropriate location on the x-axis
- Dot plot of vehicle weights
Histogram
- In a histogram data are aggregated into bins on the x-axis
- The height of each bar is proportional to the number of cases in the bin
- Histogram of vehicle weights
Density plot
- In a density plot the shape of the distribution is represented using a smooth line (think of this as a smoothed out histogram)
- Density plot of vehicle weights
Box plot
- Box plot of vehicle weights. Are there any outliers?
- It is important to note that box plots can miss important characteristics of the distributions
- For example, if the distribution is bimodal, then the box plot won’t show it
Scatter plot - visualizing 2 numerical variables
- We can visualize two numeric variables using a scatter plot
- Typically, the x-axis is used for the explanatory variable and y-axis for the response variable
- A scatter plot of highway gas mileage vs. vehicle weight
Faceted histograms
- We can visualize two variables, where one is numeric and the other is categorical using faceted histograms
- We simply plot a separate histogram for each level of the categorical variable
- Faceted histogram showing the distributions of highway mileage for vehicles that are pickups and vehicles that are not pickups
Colored density plots
- We can use colored density plots for a similar purpose
- By coloring the density plots according to the levels of the categorical variable, we can plot them on the same axes and still distinguish between the distributions
- Colored density plots to visualize the distributions of highway mileage for vehicles that are pickups and vehicles that are not pickups
Intensity Maps
- Sometimes it is useful to use colors to show higher or lower values of variables
- Using various colors or a continuous gradient we can visualize distribution of a variable
- Intensity maps are very helpful for seeing geographical trends