Exploring Categorical Data

IMS2 Ch. 4
Math 115

Yurk

Comic Characters

  • 15,182 comic characters from DC and Marvel comics

  • 11 variables, including

    • name
    • identity (id) gives information about personal identity (e.g., identity is kept secret)
    • alignment (align) gives information about whether character is good, bad, etc

CSV Data Files

  • In J Lab 1 we learned how to import data from a CSV file into Jamovi
  • The comics data set is also saved in a CSV file
  • Let’s download the file and explore it using a text editor
  • We can also open it with Jamovi
  • First 10 rows of the comics data
name id align eye hair gender gsm alive appearances first_appear publisher
Spider-Man (Peter Parker) Secret Good Hazel Eyes Brown Hair Male NA Living Characters 4043 Aug-62 marvel
Captain America (Steven Rogers) Public Good Blue Eyes White Hair Male NA Living Characters 3360 Mar-41 marvel
Wolverine (James \"Logan\" Howlett) Public Neutral Blue Eyes Black Hair Male NA Living Characters 3061 Oct-74 marvel
Iron Man (Anthony \"Tony\" Stark) Public Good Blue Eyes Black Hair Male NA Living Characters 2961 Mar-63 marvel
Thor (Thor Odinson) No Dual Good Blue Eyes Blond Hair Male NA Living Characters 2258 Nov-50 marvel
Benjamin Grimm (Earth-616) Public Good Blue Eyes No Hair Male NA Living Characters 2255 Nov-61 marvel
Reed Richards (Earth-616) Public Good Brown Eyes Brown Hair Male NA Living Characters 2072 Nov-61 marvel
Hulk (Robert Bruce Banner) Public Good Brown Eyes Brown Hair Male NA Living Characters 2017 May-62 marvel
Scott Summers (Earth-616) Public Neutral Brown Eyes Brown Hair Male NA Living Characters 1955 Sep-63 marvel
Jonathan Storm (Earth-616) Public Good Blue Eyes Blond Hair Male NA Living Characters 1934 Nov-61 marvel

Describing categorical data

  • We can summarize a single categorical variable using a frequency table
  • Counts the number of observations for each level of the variable
identity count
No Dual 1,470
Public 5,958
Secret 7,691
Unknown 9
Total 15,128
  • We can also calculate proportions for each of the levels
identity proportion
No Dual 0.097
Public 0.394
Secret 0.508
Unknown 0.001

Visualizing categorical data

  • We can use a bar plot to visualize categorical data

Bar plot showing frequencies of levels of identity variable

Summarizing two categorical variables

  • A contingency table is a table that can be used to summarize two categorical variables
  • Each value is a count of the number of times a variable outcome combination occurs
  • Usually includes row and column totals as well (marginal totals)
Contingency table for identity and alignment
align No Dual Public Secret Unknown Total
Bad 453 2,106 4,352 7 6,918
Good 640 2,905 2,430 0 5,975
Neutral 377 946 908 2 2,233
Reformed Criminals 0 1 1 0 2
Total 1,470 5,958 7,691 9 15,128
  • It is also useful to create contingency tables with proportions
  • The simplest version is obtained by dividing each count by the grand total
  • In this case values in table sum to 1
Proportion of outcomes for each combination of allignment and identity
align No Dual Public Secret Unknown
Bad 0.0299 0.1392 0.2877 0.0005
Good 0.0423 0.1920 0.1606 0.0000
Neutral 0.0249 0.0625 0.0600 0.0001
Reformed Criminals 0.0000 0.0001 0.0001 0.0000
  • What does the value 0.0299 mean?

Conditional proportions

  • We can also create tables of conditional proportions than can be helpful to explore associations between the variables
  • We need to decide whether the proportions should be conditioned on rows (divide counts by row totals) or columns (divide counts by colum totals)
  • If conditioned on rows, proportions sum to 1 along rows
  • If conditioned on columns, proportions sum to 1 along columns
  • These proportions are conditioned on rows (alignment)
  • Allows us to compare proportions of identity types between different alignment groups
  • For example, we can see that about 63% of bad characters have secret identities whereas only about 41% of good characters have secret identities.
align No Dual Public Secret Unknown
Bad 0.0655 0.3044 0.6291 0.0010
Good 0.1071 0.4862 0.4067 0.0000
Neutral 0.1688 0.4236 0.4066 0.0009
Reformed Criminals 0.0000 0.5000 0.5000 0.0000
  • These proportions are conditioned on columns (identity)
  • Allows us to compare proportions of alignment types between different identity groups
  • For example, we can see that about 57% characters with secret identities are bad, whereas only about 32% of characters with secret identities are good.
align No Dual Public Secret Unknown
Bad 0.3082 0.3535 0.5659 0.7778
Good 0.4354 0.4876 0.3160 0.0000
Neutral 0.2565 0.1588 0.1181 0.2222
Reformed Criminals 0.0000 0.0002 0.0001 0.0000

Visualizing two categorical variables

  • There are different ways to visualize two categorical variables using bar plots
  • We can create stacked bar plot
  • Colors show how composition varies within each group

Stacked bar plot showing alignment frequencies for different id levels

  • We can also visualize the data using side-by-side (dodged) bar plots

Dodged bar plot showing alignment frequencies for different id levels

  • Another alternative is to use faceted bar plots
  • Facet according to one of the variables
  • A facet (subplot) is created for each level of that variable

Faceted bar plot showing alignment frequencies for different id levels

  • A fourth type of bar plot we can use to visualize two categorical variables is a standardized (filled) bar plot
  • This shows conditional proportions (instead of counts) in a stacked format
  • The following proportions are conditioned on id

Standardized bar plot showing alignment proportions for different id levels

  • We can take a different perspective by exchanging the roles of the variables
  • The following proportions are conditioned on align

Standardized bar plot showing id proportions for different alignment levels