Chapter 1: Getting to Know Data

Key Concepts

Statistics is the science of collecting, organizing, analyzing, and interpreting data. In this chapter, we learn about data types, individuals, variables, and datasets. Data can be categorical (qualitative) or numerical (quantitative). Understanding the distinction between these types is crucial, since the choice of graphs and summaries depends on it.

Types of variables:
- Categorical (qualitative): categories or groups (e.g., gender, eye color).
  - Nominal: no inherent order (e.g., types of fruit).
  - Ordinal: ranked order (e.g., class rank).
- Numerical (quantitative): numbers with meaningful magnitude.
  - Discrete: whole numbers (e.g., number of siblings).
  - Continuous: any real value in a range (e.g., height, weight).

Example

Consider a dataset of 6 students:

Student	Gender	Eye Color	Height (inches)	Number of Siblings
1	Male	Blue	70	1
2	Female	Brown	65	2
3	Male	Green	72	0
4	Female	Blue	60	3
5	Male	Brown	68	2
6	Female	Hazel	62	1

Gender, Eye Color → categorical.
Height → numerical, continuous.
Number of Siblings → numerical, discrete.

Chapter 2: Study Design

Key Concepts

Study design is about how data are collected. A poor design can lead to bias and misleading conclusions. Important ideas include:

Observational studies record data without influencing outcomes.
Experiments apply treatments to individuals and record responses. They can demonstrate causation if well designed.
Sampling methods:
- Simple random sample (SRS): every individual has equal chance of selection.
- Stratified sampling: population divided into subgroups, samples taken from each.
- Cluster sampling: divide population into clusters, randomly select some clusters, then sample all individuals within them.
- Convenience samples often lead to bias.
Bias types: selection bias, response bias, nonresponse bias.
Randomization and control groups are key principles in experimental design.

Example: Sampling

Suppose a university wants to estimate average textbook spending among students. They divide students by class year (freshman, sophomore, junior, senior), then randomly sample 20 students from each year. This is an example of stratified sampling, which ensures all class years are represented.

Chapter 4: Exploring Categorical Data

Key Concepts

Categorical data can be summarized and compared using tables and graphs.

Frequency tables and relative frequency tables display counts or proportions.
Bar plots and side-by-side bar plots visualize categorical distributions.
Contingency tables (two-way tables) show relationships between two categorical variables.
Conditional distributions allow comparisons across groups.

Example

Contingency Table

Survey of 60 people about Exercise & Gender:

	Exercise Regularly	Exercise Sometimes	Exercise Never	Total
Male	12	8	10	30
Female	15	9	6	30
Total	27	17	16	60

Probability a randomly chosen person is Female and Exercises Regularly = 15/60 = 0.25.
Probability of a Male person to Never Exercise = 10/30 = 0.333.

Chapter 5: Exploring Numerical Data

Numerical data are summarized using measures of center (mean, median) and spread (range, interquartile range, standard deviation). Graphical summaries include histograms, boxplots, and scatterplots.

Five number summary: minimum, Q1, median, Q3, maximum.
Boxplots visually represent the five number summary. Outliers appear as points beyond “whiskers.”
Histograms show distribution shape (symmetric, skewed, uniform, bimodal).
Skewness: right-skewed has a long right tail, left-skewed has a long left tail.

Example: Five Number Summary

Dataset of exam scores for 14 students:

##  [1] 55 60 62 65 67 70 72 74 75 78 80 82 85 90

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    55.0    65.5    73.0    72.5    79.5    90.0

## [1] 55 65 73 80 90

This dataset shows exam scores ranging from 55 to 90. The boxplot summarizes spread and highlights symmetry of distribution.

Interpretation:
The middle 50% of scores fall between 65 and 80. The distribution is slightly right-skewed due to high values near 90.

Chapter 7: Simple Linear Regression

Key Concepts

Regression models relationships between two quantitative variables. In simple linear regression, we model:

\[ y = \beta_0 + \beta_1 x \]

Slope (\(\beta_1\)): the slope of the regression line for the entire population (parameter).
Intercept (\(\beta_0\)): the y-intercept of the regression line for the entire population (parameter) .
Formulas for sample estimates of the slope and y-intercept:
\[ b_1 = r \times \frac{s_y}{s_x}, \quad b_0 = \bar{y} - b_1 \bar{x} \]
Residuals: difference between observed and predicted value (\(e_i = y_i - \hat{y}_i\)).
Scatterplots help visualize the relationship.

Example

Dataset of hours studied (x) and exam scores (y):

Hours	Score
2	65
4	70
6	75
8	85
10	88
12	95

Calculating slope and intercept using the formulas:

\(\bar{x}=7, \; \bar{y}\approx79.7, \; s_x\approx3.42, \; s_y\approx10.51, \; r\approx0.98\).
Thus, \(b_1 = r (s_y/s_x) \approx 3.01\), \(b_0 = \bar{y} - b_1\bar{x} \approx 58.6\).

Interpretation: each additional hour studied increases predicted score by about 3 points.

Residual example: If a student studied 6 hours and scored 75, predicted score is 76.7, so residual = -1.7.

Scatterplot:

Study Checklist

Use this as a self-test to make sure you’re ready:

I can list & define: categorical vs numerical, nominal vs ordinal, discrete vs continuous.
I can distinguish observational studies vs experiments; identify biases & confounding.
I understand explanatory vs response variables.
I can compute and interpret proportions (marginal, joint, conditional).
I can create/read bar charts, segmented bar charts, pie charts.
I can compute and interpret mean, median, variance, standard deviation, IQR.
I can describe distributions (shape, center, spread, outliers).
I can compute and interpret regression line parameters (slope, intercept).
I can calculate residuals; interpret R².
I can do predictions using regression; recognize limits of predictions.
I can critique study designs: where inferences are valid, whether causal claims are supported.

Quiz Questions

Question 1: A dataset of 18 quiz scores is:
52, 55, 57, 60, 61, 62, 63, 65, 66, 67, 69, 70, 72, 73, 74, 76, 80, 85.
Find the five-number summary and draw a boxplot.

Question 2: A dataset of 20 students’ heights is collected. Construct a histogram and comment on the distribution shape.

Question 3: Identify whether each of the following variables is categorical or numerical:
(a) Favorite color
(b) Annual income
(c) Age group (child, teen, adult)
(d) Number of books read per year

Question 4: Using the contingency table below, answer parts (a) and (b).

	Like Cats	Like Dogs	Both	Total
Male	10	15	5	30
Female	12	14	4	30
Total	22	29	9	60

Find the probability that a randomly chosen person is Female and Likes Dogs.
Find the probability that a person Likes Cats given they are Male.

Question 5: A dataset of 30 household incomes is presented on a histogram. Describe its shape

Question 6: The table shows students’ preferred study methods by class year. Create side-by-side bar graphs.

Question 7: Compare two datasets with side-by-side histograms and boxplots. Dataset A: 5,6,6,7,8,9,10. Dataset B: 7,7,8,9,10,11,12.

Question 8: Two datasets are plotted below. Identify which shows a linear trend and which shows a nonlinear trend.

Question 9: Using the dataset below, calculate the slope and intercept using the formulas. Then predict the score for x=8.

x	2	4	6	8
y	60	65	72	80

Means: \(\bar{x}=6\), \(\bar{y}=73.7\).
SDs: \(s_x=3.42\), \(s_y=13.3\).
Correlation \(r \approx 0.99\).

Question 10. Using the regression line from Q9, predict the score for a student who studies 8 hours.

Question 11. For a student who studied 5 hours and scored 72, calculate the residual.

Question 12. Interpret the slope and intercept in the context of this dataset.

Quiz Answers

Answer 1: Five-number summary: Min=52, Q1=61, Median=66.5, Q3=73.5, Max=85. Boxplot shows slight right skew.

Answer 2: Histogram shows approximately symmetric distribution.

Answer 3: (a) categorical, (b) numerical, (c) categorical, (d) numerical.

Answer 4: (a) 14/60 = 0.233. (b) 10/30 = 0.333.

Answer 5: Histogram shows left-skewed distribution.

Answer 6: Side-by-side bar graphs show Seniors prefer Solo method more than Freshmen.

Answer 7: Dataset B is shifted right compared to A. Boxplots confirm higher median.

Answer 8: Picture A = linear trend, Picture B = nonlinear trend.

Answer 9. Means: \(\bar{x}=6, \bar{y}=73.7, s_x=3.42, s_y=13.3\). Correlation r≈0.99.
Slope: b1 ≈ 3.89. Intercept: b0 ≈ 50.4.

Answer 10. Prediction for 8 hours: y ≈ 81.5.

Answer 11. Predicted at x=5: y ≈ 69.8. Observed=72, residual=+2.2.

Answer 12. Slope: each additional study hour increases score by ≈3.9 points. Intercept: predicted score ≈50 when hours=0.

Math 115 Test 1 Study Guide

Chapter 1: Getting to Know Data

Key Concepts

Example

Chapter 2: Study Design

Key Concepts

Example: Sampling

Chapter 4: Exploring Categorical Data

Key Concepts

Example

Contingency Table

Chapter 5: Exploring Numerical Data

Example: Five Number Summary

Chapter 7: Simple Linear Regression

Key Concepts

Example

Study Checklist

Quiz Questions

Quiz Answers