Hello Data
IMS2 Ch. 1
Math 115
Lending Club Loans
Six loans from the loan50 dataset
| 22000 |
10.90 |
60 |
B |
NJ |
59000 |
rent |
| 6000 |
9.92 |
36 |
B |
CA |
60000 |
rent |
| 25000 |
26.30 |
36 |
E |
SC |
75000 |
mortgage |
| 6000 |
9.92 |
36 |
B |
CA |
75000 |
rent |
| 25000 |
9.43 |
60 |
B |
OH |
254000 |
mortgage |
| 6400 |
9.92 |
36 |
B |
IN |
67000 |
mortgage |
- Random sample of 50 loans made through Lending Club platform
- A sample is a subset of a larger group (the population)
- The sample consists of the 50 loans. The population is all loans made through the platform
Six loans from the loan50 dataset
| 22000 |
10.90 |
60 |
B |
NJ |
59000 |
rent |
| 6000 |
9.92 |
36 |
B |
CA |
60000 |
rent |
| 25000 |
26.30 |
36 |
E |
SC |
75000 |
mortgage |
| 6000 |
9.92 |
36 |
B |
CA |
75000 |
rent |
| 25000 |
9.43 |
60 |
B |
OH |
254000 |
mortgage |
| 6400 |
9.92 |
36 |
B |
IN |
67000 |
mortgage |
- Data organized in a table called a data frame
Six loans from the loan50 dataset
| 22000 |
10.90 |
60 |
B |
NJ |
59000 |
rent |
| 6000 |
9.92 |
36 |
B |
CA |
60000 |
rent |
| 25000 |
26.30 |
36 |
E |
SC |
75000 |
mortgage |
| 6000 |
9.92 |
36 |
B |
CA |
75000 |
rent |
| 25000 |
9.43 |
60 |
B |
OH |
254000 |
mortgage |
| 6400 |
9.92 |
36 |
B |
IN |
67000 |
mortgage |
Each row represents a single case or observational unit
Each column represents a variable, corresponding to a loan characteristic. E.g.,
loan_amount (amount of loan in USD)
term (number of months of the loan)
grade (related to likelihood of being repaid)
Summary Statistics
- A summary statistic is a single number that summarizes data from a sample
- Mean loan amount ($17,083.00) is a summary statistic
- Summary statistics can be organized in tables
| A |
6.8 |
| B |
10.2 |
| C |
13.8 |
| D |
18.6 |
| E |
25.6 |
Association
- If there is a relationship between two variables, we say that the variables are associated
- Interest rate and loan grade appear to be associated
- If there is no relationship between two variables, we say the variables are independent
Variable Types
- A numerical variable takes on values that are described using numbers that make sense to add, subtract, average, etc
- A categorical variable takes on values that indicate different levels or categories
Six loans from the loan50 dataset
| 22000 |
10.90 |
60 |
B |
NJ |
59000 |
rent |
| 6000 |
9.92 |
36 |
B |
CA |
60000 |
rent |
| 25000 |
26.30 |
36 |
E |
SC |
75000 |
mortgage |
| 6000 |
9.92 |
36 |
B |
CA |
75000 |
rent |
| 25000 |
9.43 |
60 |
B |
OH |
254000 |
mortgage |
| 6400 |
9.92 |
36 |
B |
IN |
67000 |
mortgage |
Loan data variable types:
| loan_amount |
numerical |
| interest_rate |
numerical |
| term |
numerical |
| grade |
categorical |
| state |
categorical |
| total_income |
numerical |
| homeownership |
categorical |
![]()
Variable types. From IMS2 Fig. 1.1.
Six loans from the loan50 dataset
| 22000 |
10.90 |
60 |
B |
NJ |
59000 |
rent |
| 6000 |
9.92 |
36 |
B |
CA |
60000 |
rent |
| 25000 |
26.30 |
36 |
E |
SC |
75000 |
mortgage |
| 6000 |
9.92 |
36 |
B |
CA |
75000 |
rent |
| 25000 |
9.43 |
60 |
B |
OH |
254000 |
mortgage |
| 6400 |
9.92 |
36 |
B |
IN |
67000 |
mortgage |
Loan data variable types:
| loan_amount |
numerical, continuous |
| interest_rate |
numerical, continuous |
| term |
numerical, discrete |
| grade |
categorical, ordinal |
| state |
categorical, nominal |
| total_income |
numerical, continuous |
| homeownership |
categorical, nominal |
Scatterplots
- The relationship between two numerical variables can be visualized using a scatterplot
![]()
Scatterplot showing loan amount vs. income.
Direction of association
- Two numerical variables are said to have a positive association if the values of one variable tend to be higher when the values of the other variable are higher
- Two numerical variables are said to have a negative association if the values of one variable tend to be lower when the values of the other variable are higher
- Loan amount and total income appear to be positively associated
Stents for Treating Strokes
First five patients
| control |
no event |
| treatment |
no event |
| treatment |
stroke |
| treatment |
no event |
| control |
no event |
- Researchers designed a study to study the effectiveness of stents in preventing strokes
- 451 at-risk patients randomly assigned to receive stent (treatment) or not (control)
- Outcome (“stroke” or “no event”) recorded after 365 days
First five patients
| control |
no event |
| treatment |
no event |
| treatment |
stroke |
| treatment |
no event |
| control |
no event |
- What are the cases?
- What are the variables?
- What are the variable types?
Summary of results for stent study
| control |
28 |
199 |
| treatment |
45 |
179 |
- The proportion that had a stroke in the control group was 28/227 = 0.12
- The proportion that had a stroke in the treatment group was 45/224 = 0.20
- Does there appear to be an association between group and outcome?
Variable Roles
- In some cases we may think that one variable affects the other variable
- We say that the explanatory variable affects the response variable
- In the stent study, the group (treatment or control) is the explanatory variable, and the outcome is the response variable
Experiment vs. Observational Study
- An experiment is a study in which researchers manipulate or assign the values of the explanatory variable
- The stent study is an experiment, because the researchers assign patients to the treatment or control group
- When cases are randomly assigned to groups, the study is called a randomized experiment
- An observational study is a study without such manipulation. The cases are observed as they are
- The Lending Club data are from an observational study