Exploring Data

Topic 2
Math 115

Inferential statistics

  • It is usually impractical to make observations on every individual in a population (this type of study is called a census)
  • Instead select a sample from the population and use observations made on the sample to make inferences about the population

Example: The Hope College Biology Department would like to know the proportion of hemlock trees in the Hope College Nature Preserve (HCNP) that are infested by hemlock woolly adelgid (HWA)

  • Census: researchers inspect every hemlock in the preserve (the population) for HWA infestation and calculate the proportion
  • Survey: researchers randomly select 100 hemlock trees in the preserve (a sample) inspect them and calculate the proportion

Parameter vs. statistic

  • A statistic is a numerical value or summary measure that is calculated from a sample
  • A parameter is the corresponding value in the population
  • The value of statistic give us an estimate of the value of the corresponding parameter

Identify the parameter and the statistic.

  • The proportion of trees infested with HWA in a sample of 100 hemlocks from the HCNP
  • The proportion of trees infested with HWA in the HCNP

Anecdotal Evidence

  • Anecdotal evidence refers to personal stories, experiences, or individual observations
  • Example:A man on the news got mercury poisoning from eating swordfish, so the average mercury concentration in swordfish must be dangerously high.
  • Inferences should not be made using anecdotal evidence or data that are collected in a haphazard manner
  • Such information may not be representative of the population
  • Instead, inferences should be based on carefully designed studies

Identify anecdotal evidence.

  • Your friend, who recently visited the HCNP, stated that they noticed that about half the hemlocks seemed to be infested with HWA
  • Students from an introductory Biology lab use a map to select 50 hemlocks in the HCNP. They visit the trees and record whether each one has signs of infestation with HWA

EDA for Categorical Variables

  • 15,128 comic characters from DC and Marvel comics

  • 11 variables, including

    • name
    • identity (id) gives information about personal identity (e.g., identity is kept secret)
    • alignment (align) gives information about whether character is good, bad, etc
  • The comics data set is saved in a CSV file

  • Let’s download the file and open it with Jamovi

  • First 10 rows of the comics data
name id align eye hair gender gsm alive appearances first_appear publisher
Spider-Man (Peter Parker) Secret Good Hazel Eyes Brown Hair Male NA Living Characters 4043 Aug-62 marvel
Captain America (Steven Rogers) Public Good Blue Eyes White Hair Male NA Living Characters 3360 Mar-41 marvel
Wolverine (James \"Logan\" Howlett) Public Neutral Blue Eyes Black Hair Male NA Living Characters 3061 Oct-74 marvel
Iron Man (Anthony \"Tony\" Stark) Public Good Blue Eyes Black Hair Male NA Living Characters 2961 Mar-63 marvel
Thor (Thor Odinson) No Dual Good Blue Eyes Blond Hair Male NA Living Characters 2258 Nov-50 marvel
Benjamin Grimm (Earth-616) Public Good Blue Eyes No Hair Male NA Living Characters 2255 Nov-61 marvel
Reed Richards (Earth-616) Public Good Brown Eyes Brown Hair Male NA Living Characters 2072 Nov-61 marvel
Hulk (Robert Bruce Banner) Public Good Brown Eyes Brown Hair Male NA Living Characters 2017 May-62 marvel
Scott Summers (Earth-616) Public Neutral Brown Eyes Brown Hair Male NA Living Characters 1955 Sep-63 marvel
Jonathan Storm (Earth-616) Public Good Blue Eyes Blond Hair Male NA Living Characters 1934 Nov-61 marvel

Describing categorical data

  • We can summarize a single categorical variable using a frequency table
  • Counts the number of observations for each level of the variable
identity count
No Dual 1,394
Public 5,656
Secret 7,281
Unknown 9
Total 15,128

Proportion Calculations

Example 1: What proportion of comic characters have a secret identity?

\(Proportion=\frac{Count}{Total}=\frac{7,691}{15,128}=0.508\)

Example 2: What percentage of comic characters have a secret identity?

\(Percentage=Proportion\times 100=0.508×100=50.8\%\)

  • Here are the resulting proportions
identity proportion
No Dual 0.097
Public 0.394
Secret 0.508
Unknown 0.001

Visualizing categorical data

  • We can use a bar plot to visualize categorical data

Bar plot showing frequencies of levels of identity variable

Summarizing two categorical variables

  • A contingency table is a table that can be used to summarize two categorical variables
  • Each value is a count of the number of times a variable outcome combination occurs
  • Usually includes row and column totals as well (marginal totals)
Contingency table for identity and alignment
align No Dual Public Secret Unknown Total
Bad 435 2,031 4,119 7 6,592
Good 598 2,726 2,296 0 5,620
Neutral 361 898 865 2 2,126
Reformed Criminals 0 1 1 0 2
Total 1,394 5,656 7,281 9 14,340
  • It is also useful to create contingency tables with proportions
  • The simplest version is obtained by dividing each count by the grand total
  • In this case values in table sum to 1
Proportion of outcomes for each combination of allignment and identity
align No Dual Public Secret Unknown
Bad 0.0303 0.1416 0.2872 0.0005
Good 0.0417 0.1901 0.1601 0.0000
Neutral 0.0252 0.0626 0.0603 0.0001
Reformed Criminals 0.0000 0.0001 0.0001 0.0000
  • What does the value 0.0299 mean?

Conditional proportions

  • We can also create tables of conditional proportions than can be helpful to explore associations between the variables
  • We need to decide whether the proportions should be conditioned on rows (divide counts by row totals) or columns (divide counts by colum totals)
  • If conditioned on rows, proportions sum to 1 along rows
  • If conditioned on columns, proportions sum to 1 along columns
  • These proportions are conditioned on rows (alignment)
  • Allows us to compare proportions of identity types between different alignment groups
  • For example, we can see that about 63% of bad characters have secret identities whereas only about 41% of good characters have secret identities.
align No Dual Public Secret Unknown
Bad 0.0660 0.3081 0.6248 0.0011
Good 0.1064 0.4851 0.4085 0.0000
Neutral 0.1698 0.4224 0.4069 0.0009
Reformed Criminals 0.0000 0.5000 0.5000 0.0000
  • These proportions are conditioned on columns (identity)
  • Allows us to compare proportions of alignment types between different identity groups
  • For example, we can see that about 57% characters with secret identities are bad, whereas only about 32% of characters with secret identities are good.
align No Dual Public Secret Unknown
Bad 0.3121 0.3591 0.5657 0.7778
Good 0.4290 0.4820 0.3153 0.0000
Neutral 0.2590 0.1588 0.1188 0.2222
Reformed Criminals 0.0000 0.0002 0.0001 0.0000

Visualizing two categorical variables

  • There are different ways to visualize two categorical variables using bar plots
  • We can create stacked bar plot
  • Colors show how composition varies within each group

Stacked bar plot showing alignment frequencies for different id levels

  • We can also visualize the data using side-by-side (dodged) bar plots

Dodged bar plot showing alignment frequencies for different id levels

  • Another alternative is to use faceted bar plots
  • Facet according to one of the variables
  • A facet (subplot) is created for each level of that variable

Faceted bar plot showing alignment frequencies for different id levels

  • A fourth type of bar plot we can use to visualize two categorical variables is a standardized (filled) bar plot
  • This shows conditional proportions (instead of counts) in a stacked format
  • The following proportions are conditioned on id

Standardized bar plot showing alignment proportions for different id levels

  • We can take a different perspective by exchanging the roles of the variables
  • The following proportions are conditioned on align

Standardized bar plot showing id proportions for different alignment levels

EDA for Quantitative Variables

  • Every county in the US (3,142 counties)
  • Variables include county name, state, average life expectancy (expectancy), median income (income)
  • First 15 rows of life_exp data
state county expectancy income
Alabama Autauga County 76.060 37773
Alabama Baldwin County 77.630 40121
Alabama Barbour County 74.675 31443
Alabama Bibb County 74.155 29075
Alabama Blount County 75.880 31663
Alabama Bullock County 71.790 25929
Alabama Butler County 73.730 33518
Alabama Calhoun County 73.300 33418
Alabama Chambers County 73.245 31282
Alabama Cherokee County 74.650 32645
Alabama Chilton County 73.880 31380
Alabama Choctaw County 75.050 31046
Alabama Clarke County 74.820 31877
Alabama Clay County 74.145 32965
Alabama Cleburne County 74.145 31209

Life Expectancies in MA Counties

  • For now, we will focus on life expectancy in Massachusetts (14 counties)
  • Since the life_exp data include all US counties, we need to filter the data to retain just the Massachusetts counties
  • You can filter a data set using Jamovi (or another spreadsheet program)
  • Here are 14 counties of Massachusetts in the filtered life_exp data
state county expectancy income
Massachusetts Barnstable County 80.325 64730
Massachusetts Berkshire County 79.780 50712
Massachusetts Bristol County 78.975 48294
Massachusetts Dukes County 80.995 78745
Massachusetts Essex County 80.415 60320
Massachusetts Franklin County 80.070 48428
Massachusetts Hampden County 78.420 46216
Massachusetts Hampshire County 80.040 46756
Massachusetts Middlesex County 81.240 73265
Massachusetts Nantucket County 80.325 107341
Massachusetts Norfolk County 81.115 80711
Massachusetts Plymouth County 79.145 59273
Massachusetts Suffolk County 79.260 66074
Massachusetts Worcester County 79.550 51370
  • Let’s take a quick peak at a dotplot of life expectancies for the MA counties (we rounded values to the nearest integer)

Summaries of numerical data

  • Measures of center

    • mean
    • median
  • Percentiles/quantiles

    • quartiles
    • other percentiles
  • Measures of spread

    • interquartile range
    • standard deviation

Mean

  • If there are \(n\) cases in a sample then the sample mean of the numeric variable \(x\) is \[\bar{x}=\frac{x_1+x_2+\cdots+x_n}{n}\]
  • The sample mean is a measure of the center of the distribution of the data
  • The sample mean \(\bar{x}\) (a statistic) gives us a point estimate of the population mean \(\mu\) (a parameter)

The mean life expectancy in Massachusetts counties is 79.9 years

mean
79.85714

Median

  • The median is the value that splits the data in half
  • 50% of the data fall below the median
  • We can also compute the median of the life expectancy data
mean median
79.85714 80
  • Let’s see where the mean and median fall on the dotplot
  • The mean is red and dashed. Note that it is pulled toward the thicker left tail of the distribution.

Skew and Symmetry

Group means

  • We can also compare means between different groups in the data
  • Let’s compare the mean of the life expectancy variable between counties in West Coast states (California, Oregon, Washington) and counties that are not in West Coast states
  • Is life expectancy higher in counties in west coast states?
west_coast mean median
no 77.12750 77.31
yes 78.90545 78.65

Percentiles

  • The Xth percentile is the value below which X% of the data fall
  • The median is the 50th percentile
  • For example, the 90th percentile of the life expectancy variable in the Massachusetts data is 81 years, meaning that 90% of the counties have an average life expectancy that is less than 81 years

Quartiles

  • The first quartile (Q1) is the 25th percentile, the value below which 25% of the data fall
  • The third quartile (Q3) is the 75th percentile, the value below which 75% of the data fall
  • The median is sometimes described as the second quartile (Q2)
  • Quartiles are often included in numerical summaries of a data set
  • Let’s add quartiles to our summary of the Massachusetts county life expectancy data
Q1 median Q3 mean
79.25 80 80 79.85714

Five Number Summary

  • Maximum and minimum values in a data set are often included numerical summaries as well
  • Let’s add them to our summary of the Massachusetts county life expectancy data
min Q1 median Q3 max mean
78 79.25 80 80 81 79.85714

Boxplot

  • A box plot is a visual represenation of the five number summary
  • Boxplots are constructed using summary statistics:
    • The box extends from Q1 to Q3 with a vertical line at the median (Q2)
    • Whiskers extend from the box to the smallest and largest values that are not outliers
    • Outliers are plotted as individual points

Range

  • The simplest measure of spread/variability of a distribution of data is the range
  • It is simply the difference between the largest and smallest values
  • The range is \(81-78=3\) years for the Massachusetts county life expectancy data
range
3

Interquartile range

  • The interquartile range (IQR) is the difference Q3-Q1
  • The IQR will never be larger than the range!
  • Th IQR is \(80-79.25=0.75\) years for the Massachusetts county life expectancy data
iqr range
0.75 3

Standard deviation

  • The most commonly used measure of variability is the standard deviation
  • The deviation of a single observation \(i\) is the difference between the observed value and the mean, \(x_i - \bar{x}\)
  • The standard deviation describes the typical deviation of the data from the mean
  • The sample variance is the average squared deviance \[s^2=\frac{(x_1-\bar{x})^2 + (x_2-\bar{x})^2 \cdots (x_n-\bar{x})^2}{n-1}\]
  • We divide by \(n-1\) rather than \(n\) (the sample size) to obtained an unbiased estimate of the population variance \(\sigma^2\). Otherwise \(s^2\) tends to underestimate \(\sigma^2\)
  • The sample standard deviation is \[s = \sqrt{s^2} = \sqrt{\frac{\sum_{i=1}^n(x_i-\bar{x})^2}{n-1}}\]

- For many numeric variables the following rules of thumb apply:

  • Roughly 68% of the data fall within 1 standard deviation of the mean
  • Roughly 95% of the data fall within 2 standard deviations of the mean
  • Roughly 99.7% of the data fall within 3 standard deviations of the mean

Illustrations of 68-95-99.7 rule. IMS2 Figure 5.7.

  • The standard deviation for the Massachusetts county life expectancy data is 0.864 years
sd iqr range
0.8644378 0.75 3

Identifying outliers using IQR

  • An outlier is an observation that is extreme relative to the rest of the data
  • There is no universally accepted method for identifying outliers
  • One common method that is also simple uses the IQR
  • With this method, a point is considered an outlier if it is larger than Q3 or smaller than Q1 by more than \(1.5\times\)IQR.
  • Let’s use this method to identify outliers in the life expectancy data for the whole country
  • For the whole country, \(Q1=75.70\) years and \(Q3=78.77\) years
  • Thus, \(IQR = 78.77-75.70=3.07\)
  • Any county with a life expectancy less than \(75.70-1.5\times3.07=71.09\) years or greater than \(78.77+1.5\times3.07=80.31\) years is considered an outlier
  • Using the \(1.5\times\)IQR rule, 12 counties identified as outliers
  • All have low life expectancies
state county expectancy
Arkansas Mississippi County 70.915
Arkansas Phillips County 70.900
Mississippi Coahoma County 70.740
Virginia Petersburg City 70.740
Mississippi Washington County 70.595
West Virginia Mingo County 70.590
Mississippi Sunflower County 70.385
Mississippi Quitman County 70.030
Mississippi Tunica County 70.030
Mississippi Bolivar County 69.675
Kentucky Perry County 69.585
West Virginia Mcdowell County 68.400

Cars

  • data on all new car models (428) in a certain year
  • 19 variables
  • includes weight, highway mpg (hwy_mpg), msrp, whether a pickup or not (pickup)
  • we will explore a variety of visualizations involving numerical variables
  • Here are the first 8 rows
name sports_car suv wagon minivan pickup all_wheel rear_wheel msrp dealer_cost eng_size ncyl horsepwr city_mpg hwy_mpg weight wheel_base length width
Chevrolet Aveo 4dr FALSE FALSE FALSE FALSE FALSE FALSE FALSE 11690 10965 1.6 4 103 28 34 2370 98 167 66
Chevrolet Aveo LS 4dr hatch FALSE FALSE FALSE FALSE FALSE FALSE FALSE 12585 11802 1.6 4 103 28 34 2348 98 153 66
Chevrolet Cavalier 2dr FALSE FALSE FALSE FALSE FALSE FALSE FALSE 14610 13697 2.2 4 140 26 37 2617 104 183 69
Chevrolet Cavalier 4dr FALSE FALSE FALSE FALSE FALSE FALSE FALSE 14810 13884 2.2 4 140 26 37 2676 104 183 68
Chevrolet Cavalier LS 2dr FALSE FALSE FALSE FALSE FALSE FALSE FALSE 16385 15357 2.2 4 140 26 37 2617 104 183 69
Dodge Neon SE 4dr FALSE FALSE FALSE FALSE FALSE FALSE FALSE 13670 12849 2.0 4 132 29 36 2581 105 174 67
Dodge Neon SXT 4dr FALSE FALSE FALSE FALSE FALSE FALSE FALSE 15040 14086 2.0 4 132 29 36 2626 105 174 67
Ford Focus ZX3 2dr hatch FALSE FALSE FALSE FALSE FALSE FALSE FALSE 13270 12482 2.0 4 130 26 33 2612 103 168 67

Dotplot

  • A dotplot represents each case with a dot
  • Dots are stacked on top of each other at the appropriate location on the x-axis
  • Dot plot of vehicle weights

Histogram

  • In a histogram data are aggregated into bins on the x-axis
  • The height of each bar is proportional to the number of cases in the bin
  • Histogram of vehicle weights

Density plot

  • In a density plot the shape of the distribution is represented using a smooth line (think of this as a smoothed out histogram)
  • Density plot of vehicle weights

Box plot

  • Box plot of vehicle weights. Are there any outliers?
  • It is important to note that box plots can miss important characteristics of the distributions
  • For example, if the distribution is bimodal, then the box plot won’t show it

Scatter plot - visualizing 2 numerical variables

  • We can visualize two numeric variables using a scatter plot
  • Typically, the x-axis is used for the explanatory variable and y-axis for the response variable
  • A scatter plot of highway gas mileage vs. vehicle weight

Faceted histograms

  • We can visualize two variables, where one is numeric and the other is categorical using faceted histograms
  • We simply plot a separate histogram for each level of the categorical variable
  • Faceted histogram showing the distributions of highway mileage for vehicles that are pickups and vehicles that are not pickups

Colored density plots

  • We can use colored density plots for a similar purpose
  • By coloring the density plots according to the levels of the categorical variable, we can plot them on the same axes and still distinguish between the distributions
  • Colored density plots to visualize the distributions of highway mileage for vehicles that are pickups and vehicles that are not pickups

Intensity Maps

  • Sometimes it is useful to use colors to show higher or lower values of variables
  • Using various colors or a continuous gradient we can visualize distribution of a variable
  • Intensity maps are very helpful for seeing geographical trends