Exploring Categorical Data

Comic Characters

15,182 comic characters from DC and Marvel comics
11 variables, including
- name
- identity (id) gives information about personal identity (e.g., identity is kept secret)
- alignment (align) gives information about whether character is good, bad, etc

name	id	align	eye	hair	gender	gsm	alive	appearances	first_appear	publisher
Spider-Man (Peter Parker)	Secret	Good	Hazel Eyes	Brown Hair	Male	NA	Living Characters	4043	Aug-62	marvel
Captain America (Steven Rogers)	Public	Good	Blue Eyes	White Hair	Male	NA	Living Characters	3360	Mar-41	marvel
Wolverine (James \"Logan\" Howlett)	Public	Neutral	Blue Eyes	Black Hair	Male	NA	Living Characters	3061	Oct-74	marvel
Iron Man (Anthony \"Tony\" Stark)	Public	Good	Blue Eyes	Black Hair	Male	NA	Living Characters	2961	Mar-63	marvel
Thor (Thor Odinson)	No Dual	Good	Blue Eyes	Blond Hair	Male	NA	Living Characters	2258	Nov-50	marvel
Benjamin Grimm (Earth-616)	Public	Good	Blue Eyes	No Hair	Male	NA	Living Characters	2255	Nov-61	marvel
Reed Richards (Earth-616)	Public	Good	Brown Eyes	Brown Hair	Male	NA	Living Characters	2072	Nov-61	marvel
Hulk (Robert Bruce Banner)	Public	Good	Brown Eyes	Brown Hair	Male	NA	Living Characters	2017	May-62	marvel
Scott Summers (Earth-616)	Public	Neutral	Brown Eyes	Brown Hair	Male	NA	Living Characters	1955	Sep-63	marvel
Jonathan Storm (Earth-616)	Public	Good	Blue Eyes	Blond Hair	Male	NA	Living Characters	1934	Nov-61	marvel

Bar plot showing frequencies of levels of identity variable

A contingency table is a table that can be used to summarize two categorical variables
Each value is a count of the number of times a variable outcome combination occurs
Usually includes row and column totals as well (marginal totals)

Contingency table for identity and alignment
align	No Dual	Public	Secret	Unknown	Total
Bad	453	2,106	4,352	7	6,918
Good	640	2,905	2,430	0	5,975
Neutral	377	946	908	2	2,233
Reformed Criminals	0	1	1	0	2
Total	1,470	5,958	7,691	9	15,128

Proportion of outcomes for each combination of allignment and identity
align	No Dual	Public	Secret	Unknown
Bad	0.0299	0.1392	0.2877	0.0005
Good	0.0423	0.1920	0.1606	0.0000
Neutral	0.0249	0.0625	0.0600	0.0001
Reformed Criminals	0.0000	0.0001	0.0001	0.0000

We can also create tables of conditional proportions than can be helpful to explore associations between the variables
We need to decide whether the proportions should be conditioned on rows (divide counts by row totals) or columns (divide counts by colum totals)
If conditioned on rows, proportions sum to 1 along rows
If conditioned on columns, proportions sum to 1 along columns

These proportions are conditioned on rows (alignment)
Allows us to compare proportions of identity types between different alignment groups
For example, we can see that about 63% of bad characters have secret identities whereas only about 41% of good characters have secret identities.

align	No Dual	Public	Secret	Unknown
Bad	0.0655	0.3044	0.6291	0.0010
Good	0.1071	0.4862	0.4067	0.0000
Neutral	0.1688	0.4236	0.4066	0.0009
Reformed Criminals	0.0000	0.5000	0.5000	0.0000

These proportions are conditioned on columns (identity)
Allows us to compare proportions of alignment types between different identity groups
For example, we can see that about 57% characters with secret identities are bad, whereas only about 32% of characters with secret identities are good.

align	No Dual	Public	Secret	Unknown
Bad	0.3082	0.3535	0.5659	0.7778
Good	0.4354	0.4876	0.3160	0.0000
Neutral	0.2565	0.1588	0.1181	0.2222
Reformed Criminals	0.0000	0.0002	0.0001	0.0000

Stacked bar plot showing alignment frequencies for different id levels

Dodged bar plot showing alignment frequencies for different id levels

Faceted bar plot showing alignment frequencies for different id levels

A fourth type of bar plot we can use to visualize two categorical variables is a standardized (filled) bar plot
This shows conditional proportions (instead of counts) in a stacked format
The following proportions are conditioned on id

Standardized bar plot showing alignment proportions for different id levels

Standardized bar plot showing id proportions for different alignment levels