Statistics is the science of collecting, organizing, analyzing, and interpreting data. In this chapter, we learn about data types, individuals, variables, and datasets. Data can be categorical (qualitative) or numerical (quantitative). Understanding the distinction between these types is crucial, since the choice of graphs and summaries depends on it.
Consider a dataset of 6 students:
| Student | Gender | Eye Color | Height (inches) | Number of Siblings |
|---|---|---|---|---|
| 1 | Male | Blue | 70 | 1 |
| 2 | Female | Brown | 65 | 2 |
| 3 | Male | Green | 72 | 0 |
| 4 | Female | Blue | 60 | 3 |
| 5 | Male | Brown | 68 | 2 |
| 6 | Female | Hazel | 62 | 1 |
Study design is about how data are collected. A poor design can lead to bias and misleading conclusions. Important ideas include:
Suppose a university wants to estimate average textbook spending among students. They divide students by class year (freshman, sophomore, junior, senior), then randomly sample 20 students from each year. This is an example of stratified sampling, which ensures all class years are represented.
Categorical data can be summarized and compared using tables and graphs.
Survey of 60 people about Exercise & Gender:
| Exercise Regularly | Exercise Sometimes | Exercise Never | Total | |
|---|---|---|---|---|
| Male | 12 | 8 | 10 | 30 |
| Female | 15 | 9 | 6 | 30 |
| Total | 27 | 17 | 16 | 60 |
Numerical data are summarized using measures of center (mean, median) and spread (range, interquartile range, standard deviation). Graphical summaries include histograms, boxplots, and scatterplots.
Dataset of exam scores for 14 students:
## [1] 55 60 62 65 67 70 72 74 75 78 80 82 85 90
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 55.0 65.5 73.0 72.5 79.5 90.0
## [1] 55 65 73 80 90
This dataset shows exam scores ranging from 55 to 90. The boxplot summarizes spread and highlights symmetry of distribution.
Interpretation:
The middle 50% of scores fall between 65 and 80. The distribution is
slightly right-skewed due to high values near 90.
Regression models relationships between two quantitative variables. In simple linear regression, we model:
\[ y = \beta_0 + \beta_1 x \]
Dataset of hours studied (x) and exam scores (y):
| Hours | Score |
|---|---|
| 2 | 65 |
| 4 | 70 |
| 6 | 75 |
| 8 | 85 |
| 10 | 88 |
| 12 | 95 |
Calculating slope and intercept using the formulas:
\(\bar{x}=7, \; \bar{y}\approx79.7, \;
s_x\approx3.42, \; s_y\approx10.51, \; r\approx0.98\).
Thus, \(b_1 = r (s_y/s_x) \approx
3.01\), \(b_0 = \bar{y} - b_1\bar{x}
\approx 58.6\).
Interpretation: each additional hour studied increases predicted score by about 3 points.
Residual example: If a student studied 6 hours and scored 75, predicted score is 76.7, so residual = -1.7.
Scatterplot:
Use this as a self-test to make sure you’re ready:
Question 1: A dataset of 18 quiz scores is:
52, 55, 57, 60, 61, 62, 63, 65, 66, 67, 69, 70, 72, 73, 74, 76, 80,
85.
Find the five-number summary and draw a boxplot.
Question 2: A dataset of 20 students’ heights is collected. Construct a histogram and comment on the distribution shape.
Question 3: Identify whether each of the following
variables is categorical or numerical:
(a) Favorite color
(b) Annual income
(c) Age group (child, teen, adult)
(d) Number of books read per year
Question 4: Using the contingency table below, answer parts (a) and (b).
| Like Cats | Like Dogs | Both | Total | |
|---|---|---|---|---|
| Male | 10 | 15 | 5 | 30 |
| Female | 12 | 14 | 4 | 30 |
| Total | 22 | 29 | 9 | 60 |
Question 5: A dataset of 30 household incomes is presented on a histogram. Describe its shape
Question 6: The table shows students’ preferred study methods by class year. Create side-by-side bar graphs.
Question 7: Compare two datasets with side-by-side histograms and boxplots. Dataset A: 5,6,6,7,8,9,10. Dataset B: 7,7,8,9,10,11,12.
Question 8: Two datasets are plotted below. Identify which shows a linear trend and which shows a nonlinear trend.
Question 9: Using the dataset below, calculate the slope and intercept using the formulas. Then predict the score for x=8.
| x | 2 | 4 | 6 | 8 |
|---|---|---|---|---|
| y | 60 | 65 | 72 | 80 |
Means: \(\bar{x}=6\), \(\bar{y}=73.7\).
SDs: \(s_x=3.42\), \(s_y=13.3\).
Correlation \(r \approx 0.99\).
Question 10. Using the regression line from Q9, predict the score for a student who studies 8 hours.
Question 11. For a student who studied 5 hours and scored 72, calculate the residual.
Question 12. Interpret the slope and intercept in the context of this dataset.
Answer 1: Five-number summary: Min=52, Q1=61, Median=66.5, Q3=73.5, Max=85. Boxplot shows slight right skew.
Answer 2: Histogram shows approximately symmetric distribution.
Answer 3: (a) categorical, (b) numerical, (c) categorical, (d) numerical.
Answer 4: (a) 14/60 = 0.233. (b) 10/30 = 0.333.
Answer 5: Histogram shows left-skewed distribution.
Answer 6: Side-by-side bar graphs show Seniors prefer Solo method more than Freshmen.
Answer 7: Dataset B is shifted right compared to A. Boxplots confirm higher median.
Answer 8: Picture A = linear trend, Picture B = nonlinear trend.
Answer 9. Means: \(\bar{x}=6, \bar{y}=73.7, s_x=3.42,
s_y=13.3\). Correlation r≈0.99.
Slope: b1 ≈ 3.89. Intercept: b0 ≈ 50.4.
Answer 10. Prediction for 8 hours: y ≈ 81.5.
Answer 11. Predicted at x=5: y ≈ 69.8. Observed=72, residual=+2.2.
Answer 12. Slope: each additional study hour increases score by ≈3.9 points. Intercept: predicted score ≈50 when hours=0.