Exploring Categorical Data
IMS2 Ch. 4
Math 115
CSV Data Files
- In J Lab 1 we learned how to import data from a CSV file into Jamovi
- The
comics data set is also saved in a CSV file
- Let’s download the file and explore it using a text editor
- We can also open it with Jamovi
- First 10 rows of the comics data
| Spider-Man (Peter Parker) |
Secret |
Good |
Hazel Eyes |
Brown Hair |
Male |
NA |
Living Characters |
4043 |
Aug-62 |
marvel |
| Captain America (Steven Rogers) |
Public |
Good |
Blue Eyes |
White Hair |
Male |
NA |
Living Characters |
3360 |
Mar-41 |
marvel |
| Wolverine (James \"Logan\" Howlett) |
Public |
Neutral |
Blue Eyes |
Black Hair |
Male |
NA |
Living Characters |
3061 |
Oct-74 |
marvel |
| Iron Man (Anthony \"Tony\" Stark) |
Public |
Good |
Blue Eyes |
Black Hair |
Male |
NA |
Living Characters |
2961 |
Mar-63 |
marvel |
| Thor (Thor Odinson) |
No Dual |
Good |
Blue Eyes |
Blond Hair |
Male |
NA |
Living Characters |
2258 |
Nov-50 |
marvel |
| Benjamin Grimm (Earth-616) |
Public |
Good |
Blue Eyes |
No Hair |
Male |
NA |
Living Characters |
2255 |
Nov-61 |
marvel |
| Reed Richards (Earth-616) |
Public |
Good |
Brown Eyes |
Brown Hair |
Male |
NA |
Living Characters |
2072 |
Nov-61 |
marvel |
| Hulk (Robert Bruce Banner) |
Public |
Good |
Brown Eyes |
Brown Hair |
Male |
NA |
Living Characters |
2017 |
May-62 |
marvel |
| Scott Summers (Earth-616) |
Public |
Neutral |
Brown Eyes |
Brown Hair |
Male |
NA |
Living Characters |
1955 |
Sep-63 |
marvel |
| Jonathan Storm (Earth-616) |
Public |
Good |
Blue Eyes |
Blond Hair |
Male |
NA |
Living Characters |
1934 |
Nov-61 |
marvel |
Describing categorical data
- We can summarize a single categorical variable using a frequency table
- Counts the number of observations for each level of the variable
| No Dual |
1,470 |
| Public |
5,958 |
| Secret |
7,691 |
| Unknown |
9 |
| Total |
15,128 |
- We can also calculate proportions for each of the levels
| No Dual |
0.097 |
| Public |
0.394 |
| Secret |
0.508 |
| Unknown |
0.001 |
Visualizing categorical data
- We can use a bar plot to visualize categorical data
![]()
Bar plot showing frequencies of levels of identity variable
Summarizing two categorical variables
- A contingency table is a table that can be used to summarize two categorical variables
- Each value is a count of the number of times a variable outcome combination occurs
- Usually includes row and column totals as well (marginal totals)
Contingency table for identity and alignment
| Bad |
453 |
2,106 |
4,352 |
7 |
6,918 |
| Good |
640 |
2,905 |
2,430 |
0 |
5,975 |
| Neutral |
377 |
946 |
908 |
2 |
2,233 |
| Reformed Criminals |
0 |
1 |
1 |
0 |
2 |
| Total |
1,470 |
5,958 |
7,691 |
9 |
15,128 |
- It is also useful to create contingency tables with proportions
- The simplest version is obtained by dividing each count by the grand total
- In this case values in table sum to 1
Proportion of outcomes for each combination of allignment and identity
| Bad |
0.0299 |
0.1392 |
0.2877 |
0.0005 |
| Good |
0.0423 |
0.1920 |
0.1606 |
0.0000 |
| Neutral |
0.0249 |
0.0625 |
0.0600 |
0.0001 |
| Reformed Criminals |
0.0000 |
0.0001 |
0.0001 |
0.0000 |
- What does the value 0.0299 mean?
Conditional proportions
- We can also create tables of conditional proportions than can be helpful to explore associations between the variables
- We need to decide whether the proportions should be conditioned on rows (divide counts by row totals) or columns (divide counts by colum totals)
- If conditioned on rows, proportions sum to 1 along rows
- If conditioned on columns, proportions sum to 1 along columns
- These proportions are conditioned on rows (alignment)
- Allows us to compare proportions of identity types between different alignment groups
- For example, we can see that about 63% of bad characters have secret identities whereas only about 41% of good characters have secret identities.
| Bad |
0.0655 |
0.3044 |
0.6291 |
0.0010 |
| Good |
0.1071 |
0.4862 |
0.4067 |
0.0000 |
| Neutral |
0.1688 |
0.4236 |
0.4066 |
0.0009 |
| Reformed Criminals |
0.0000 |
0.5000 |
0.5000 |
0.0000 |
- These proportions are conditioned on columns (identity)
- Allows us to compare proportions of alignment types between different identity groups
- For example, we can see that about 57% characters with secret identities are bad, whereas only about 32% of characters with secret identities are good.
| Bad |
0.3082 |
0.3535 |
0.5659 |
0.7778 |
| Good |
0.4354 |
0.4876 |
0.3160 |
0.0000 |
| Neutral |
0.2565 |
0.1588 |
0.1181 |
0.2222 |
| Reformed Criminals |
0.0000 |
0.0002 |
0.0001 |
0.0000 |
Visualizing two categorical variables
- There are different ways to visualize two categorical variables using bar plots
- We can create stacked bar plot
- Colors show how composition varies within each group
![]()
Stacked bar plot showing alignment frequencies for different id levels
- We can also visualize the data using side-by-side (dodged) bar plots
![]()
Dodged bar plot showing alignment frequencies for different id levels
- Another alternative is to use faceted bar plots
- Facet according to one of the variables
- A facet (subplot) is created for each level of that variable
![]()
Faceted bar plot showing alignment frequencies for different id levels
- A fourth type of bar plot we can use to visualize two categorical variables is a standardized (filled) bar plot
- This shows conditional proportions (instead of counts) in a stacked format
- The following proportions are conditioned on
id
![]()
Standardized bar plot showing alignment proportions for different id levels
- We can take a different perspective by exchanging the roles of the variables
- The following proportions are conditioned on
align
![]()
Standardized bar plot showing id proportions for different alignment levels