Chapter 1: Getting to Know Data

Key Concepts

Statistics is the science of collecting, organizing, analyzing, and interpreting data. In this chapter, we learn about data types, individuals, variables, and datasets. Data can be categorical (qualitative) or numerical (quantitative). Understanding the distinction between these types is crucial, since the choice of graphs and summaries depends on it.

  • Types of variables:
    • Categorical (qualitative): categories or groups (e.g., gender, eye color).
      • Nominal: no inherent order (e.g., types of fruit).
      • Ordinal: ranked order (e.g., class rank).
    • Numerical (quantitative): numbers with meaningful magnitude.
      • Discrete: whole numbers (e.g., number of siblings).
      • Continuous: any real value in a range (e.g., height, weight).

Example

Consider a dataset of 6 students:

Student Gender Eye Color Height (inches) Number of Siblings
1 Male Blue 70 1
2 Female Brown 65 2
3 Male Green 72 0
4 Female Blue 60 3
5 Male Brown 68 2
6 Female Hazel 62 1
  • Gender, Eye Color → categorical.
  • Height → numerical, continuous.
  • Number of Siblings → numerical, discrete.

Chapter 2: Study Design

Key Concepts

Study design is about how data are collected. A poor design can lead to bias and misleading conclusions. Important ideas include:

  • Observational studies record data without influencing outcomes.
  • Experiments apply treatments to individuals and record responses. They can demonstrate causation if well designed.
  • Sampling methods:
    • Simple random sample (SRS): every individual has equal chance of selection.
    • Stratified sampling: population divided into subgroups, samples taken from each.
    • Cluster sampling: divide population into clusters, randomly select some clusters, then sample all individuals within them.
    • Convenience samples often lead to bias.
  • Bias types: selection bias, response bias, nonresponse bias.
  • Randomization and control groups are key principles in experimental design.

Example: Sampling

Suppose a university wants to estimate average textbook spending among students. They divide students by class year (freshman, sophomore, junior, senior), then randomly sample 20 students from each year. This is an example of stratified sampling, which ensures all class years are represented.


Chapter 4: Exploring Categorical Data

Key Concepts

Categorical data can be summarized and compared using tables and graphs.

  • Frequency tables and relative frequency tables display counts or proportions.
  • Bar plots and side-by-side bar plots visualize categorical distributions.
  • Contingency tables (two-way tables) show relationships between two categorical variables.
  • Conditional distributions allow comparisons across groups.

Example

Contingency Table

Survey of 60 people about Exercise & Gender:

Exercise Regularly Exercise Sometimes Exercise Never Total
Male 12 8 10 30
Female 15 9 6 30
Total 27 17 16 60
  • Probability a randomly chosen person is Female and Exercises Regularly = 15/60 = 0.25.
  • Probability of a Male person to Never Exercise = 10/30 = 0.333.

Chapter 5: Exploring Numerical Data

Numerical data are summarized using measures of center (mean, median) and spread (range, interquartile range, standard deviation). Graphical summaries include histograms, boxplots, and scatterplots.

Example: Five Number Summary

Dataset of exam scores for 14 students:

##  [1] 55 60 62 65 67 70 72 74 75 78 80 82 85 90
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    55.0    65.5    73.0    72.5    79.5    90.0
## [1] 55 65 73 80 90

This dataset shows exam scores ranging from 55 to 90. The boxplot summarizes spread and highlights symmetry of distribution.

Interpretation:
The middle 50% of scores fall between 65 and 80. The distribution is slightly right-skewed due to high values near 90.


Chapter 7: Simple Linear Regression

Key Concepts

Regression models relationships between two quantitative variables. In simple linear regression, we model:

\[ y = \beta_0 + \beta_1 x \]

  • Slope (\(\beta_1\)): the slope of the regression line for the entire population (parameter).
  • Intercept (\(\beta_0\)): the y-intercept of the regression line for the entire population (parameter) .
  • Formulas for sample estimates of the slope and y-intercept:
    \[ b_1 = r \times \frac{s_y}{s_x}, \quad b_0 = \bar{y} - b_1 \bar{x} \]
  • Residuals: difference between observed and predicted value (\(e_i = y_i - \hat{y}_i\)).
  • Scatterplots help visualize the relationship.

Example

Dataset of hours studied (x) and exam scores (y):

Hours Score
2 65
4 70
6 75
8 85
10 88
12 95

Calculating slope and intercept using the formulas:

\(\bar{x}=7, \; \bar{y}\approx79.7, \; s_x\approx3.42, \; s_y\approx10.51, \; r\approx0.98\).
Thus, \(b_1 = r (s_y/s_x) \approx 3.01\), \(b_0 = \bar{y} - b_1\bar{x} \approx 58.6\).

Interpretation: each additional hour studied increases predicted score by about 3 points.

Residual example: If a student studied 6 hours and scored 75, predicted score is 76.7, so residual = -1.7.

Scatterplot:


Study Checklist

Use this as a self-test to make sure you’re ready:


Quiz Questions

Question 1: A dataset of 18 quiz scores is:
52, 55, 57, 60, 61, 62, 63, 65, 66, 67, 69, 70, 72, 73, 74, 76, 80, 85.
Find the five-number summary and draw a boxplot.

Question 2: A dataset of 20 students’ heights is collected. Construct a histogram and comment on the distribution shape.

Question 3: Identify whether each of the following variables is categorical or numerical:
(a) Favorite color
(b) Annual income
(c) Age group (child, teen, adult)
(d) Number of books read per year

Question 4: Using the contingency table below, answer parts (a) and (b).

Like Cats Like Dogs Both Total
Male 10 15 5 30
Female 12 14 4 30
Total 22 29 9 60
  1. Find the probability that a randomly chosen person is Female and Likes Dogs.
  2. Find the probability that a person Likes Cats given they are Male.

Question 5: A dataset of 30 household incomes is presented on a histogram. Describe its shape

Question 6: The table shows students’ preferred study methods by class year. Create side-by-side bar graphs.

Question 7: Compare two datasets with side-by-side histograms and boxplots. Dataset A: 5,6,6,7,8,9,10. Dataset B: 7,7,8,9,10,11,12.

Question 8: Two datasets are plotted below. Identify which shows a linear trend and which shows a nonlinear trend.

Question 9: Using the dataset below, calculate the slope and intercept using the formulas. Then predict the score for x=8.

x 2 4 6 8
y 60 65 72 80

Means: \(\bar{x}=6\), \(\bar{y}=73.7\).
SDs: \(s_x=3.42\), \(s_y=13.3\).
Correlation \(r \approx 0.99\).

Question 10. Using the regression line from Q9, predict the score for a student who studies 8 hours.

Question 11. For a student who studied 5 hours and scored 72, calculate the residual.

Question 12. Interpret the slope and intercept in the context of this dataset.


Quiz Answers

Answer 1: Five-number summary: Min=52, Q1=61, Median=66.5, Q3=73.5, Max=85. Boxplot shows slight right skew.

Answer 2: Histogram shows approximately symmetric distribution.

Answer 3: (a) categorical, (b) numerical, (c) categorical, (d) numerical.

Answer 4: (a) 14/60 = 0.233. (b) 10/30 = 0.333.

Answer 5: Histogram shows left-skewed distribution.

Answer 6: Side-by-side bar graphs show Seniors prefer Solo method more than Freshmen.

Answer 7: Dataset B is shifted right compared to A. Boxplots confirm higher median.

Answer 8: Picture A = linear trend, Picture B = nonlinear trend.

Answer 9. Means: \(\bar{x}=6, \bar{y}=73.7, s_x=3.42, s_y=13.3\). Correlation r≈0.99.
Slope: b1 ≈ 3.89. Intercept: b0 ≈ 50.4.

Answer 10. Prediction for 8 hours: y ≈ 81.5.

Answer 11. Predicted at x=5: y ≈ 69.8. Observed=72, residual=+2.2.

Answer 12. Slope: each additional study hour increases score by ≈3.9 points. Intercept: predicted score ≈50 when hours=0.