Hello Data

Lending Club Loans

Six loans from the loan50 dataset
loan_amount	interest_rate	term	grade	state	total_income	homeownership
22000	10.90	60	B	NJ	59000	rent
6000	9.92	36	B	CA	60000	rent
25000	26.30	36	E	SC	75000	mortgage
6000	9.92	36	B	CA	75000	rent
25000	9.43	60	B	OH	254000	mortgage
6400	9.92	36	B	IN	67000	mortgage

Random sample of 50 loans made through Lending Club platform
A sample is a subset of a larger group (the population)
The sample consists of the 50 loans. The population is all loans made through the platform

Six loans from the loan50 dataset
loan_amount	interest_rate	term	grade	state	total_income	homeownership
22000	10.90	60	B	NJ	59000	rent
6000	9.92	36	B	CA	60000	rent
25000	26.30	36	E	SC	75000	mortgage
6000	9.92	36	B	CA	75000	rent
25000	9.43	60	B	OH	254000	mortgage
6400	9.92	36	B	IN	67000	mortgage

Six loans from the loan50 dataset
loan_amount	interest_rate	term	grade	state	total_income	homeownership
22000	10.90	60	B	NJ	59000	rent
6000	9.92	36	B	CA	60000	rent
25000	26.30	36	E	SC	75000	mortgage
6000	9.92	36	B	CA	75000	rent
25000	9.43	60	B	OH	254000	mortgage
6400	9.92	36	B	IN	67000	mortgage

Each row represents a single case or observational unit
Each column represents a variable, corresponding to a loan characteristic. E.g.,
- loan_amount (amount of loan in USD)
- term (number of months of the loan)
- grade (related to likelihood of being repaid)

If there is a relationship between two variables, we say that the variables are associated
Interest rate and loan grade appear to be associated
If there is no relationship between two variables, we say the variables are independent

A numerical variable takes on values that are described using numbers that make sense to add, subtract, average, etc
A categorical variable takes on values that indicate different levels or categories

Six loans from the loan50 dataset
loan_amount	interest_rate	term	grade	state	total_income	homeownership
22000	10.90	60	B	NJ	59000	rent
6000	9.92	36	B	CA	60000	rent
25000	26.30	36	E	SC	75000	mortgage
6000	9.92	36	B	CA	75000	rent
25000	9.43	60	B	OH	254000	mortgage
6400	9.92	36	B	IN	67000	mortgage

Loan data variable types:

Numerical variables can be further broken down into
- discrete: takes on discrete values (with jumps between consecutive values)
- continuous: can take on any value within a range
Categorical variables can be further broken down into
- ordinal: levels have a natural ordering
- nominal: levels do not have a natural ordering

Variable types. From IMS2 Fig. 1.1.

Six loans from the loan50 dataset
loan_amount	interest_rate	term	grade	state	total_income	homeownership
22000	10.90	60	B	NJ	59000	rent
6000	9.92	36	B	CA	60000	rent
25000	26.30	36	E	SC	75000	mortgage
6000	9.92	36	B	CA	75000	rent
25000	9.43	60	B	OH	254000	mortgage
6400	9.92	36	B	IN	67000	mortgage

Loan data variable types:

The relationship between two numerical variables can be visualized using a scatterplot

Scatterplot showing loan amount vs. income.

Two numerical variables are said to have a positive association if the values of one variable tend to be higher when the values of the other variable are higher
Two numerical variables are said to have a negative association if the values of one variable tend to be lower when the values of the other variable are higher
Loan amount and total income appear to be positively associated

Researchers designed a study to study the effectiveness of stents in preventing strokes
451 at-risk patients randomly assigned to receive stent (treatment) or not (control)
Outcome (“stroke” or “no event”) recorded after 365 days

Summary of results for stent study
Group	Stroke	No Event
control	28	199
treatment	45	179

In some cases we may think that one variable affects the other variable
We say that the explanatory variable affects the response variable
In the stent study, the group (treatment or control) is the explanatory variable, and the outcome is the response variable

An experiment is a study in which researchers manipulate or assign the values of the explanatory variable
The stent study is an experiment, because the researchers assign patients to the treatment or control group
When cases are randomly assigned to groups, the study is called a randomized experiment
An observational study is a study without such manipulation. The cases are observed as they are
The Lending Club data are from an observational study