Chapter 01
Chapter 01
2
Probability vs. Statistics Cont’d.
4
What is a Sample?
A sample is a smaller and more manageable representation of a larger
group. A subset of a larger population that contains characteristics of
that population.
Sample
Population
5
What is a Sample? Cont’d.
Samples are used when:
The population is too large to collect data.
The data collected is not reliable.
The population is hypothetical and is unlimited in size.
Take the example of a study that documents the results of a new
medical procedure. It is unknown how the procedure will affect
people across the globe, so a test group is used to find out how
people react to it.
6
What is a Sample? Cont’d.
7
Descriptive vs. Inferential Statistics
There are two main branches in the field of statistics:
① Descriptive statistics aims to describe a chunk of raw data using
summary statistics, graphs, and tables, etc.
Let's say, we have a set of raw data that shows the test scores of 1000
students at a particular school. We might be interested in the average
test score along with the distribution of test scores.
② Inferential statistics uses a small sample of data to draw inferences
about the larger population that the sample came from.
8
Descriptive vs. Inferential Statistics Cont’d.
9
Descriptive vs. Inferential Statistics Cont’d.
The relationship between the two disciplines can be summarized by
saying that probability reasons from the population to the sample
(deductive reasoning), whereas inferential statistics reasons from the
sample to the population (inductive reasoning).
10
Pictorial and Tabular Methods in Descriptive
Statistics
11
Stem-and-Leaf displays
Consider a numerical dataset 𝑥1 , 𝑥2 , … , 𝑥𝑛 for which each 𝑥𝑖 consists
of at least two digits. A quick way to obtain an informative visual
representation of the dataset is to construct a stem-and-leaf display.
12
Example. The average number of hours of
sleep per day over a two-week period for a
Stem-and-Leaf
sample of 253 college students. displays Cont’d.
13
Stem-and-Leaf Example. The average number of hours of
displays Cont’d. sleep per day over a two-week period for
a sample of 253 college students*.
16
51 Measures
Dotplots Cont’d.
** Note. A dotplot can be quite cumbersome to construct and look crowded when
the number of observations is large. Now, let’s look at other interesting methods!!
17
Histograms
A numerical variable is discrete if its set of possible values either is
finite or else can be listed in an infinite sequence (one in which there is
a first number, a second number, and so on).
A discrete variable 𝑥 almost always results from counting, in which case
possible values are 0, 1, 2, 3, … or some subset of these integers.
A numerical variable is continuous if its possible values consist of an
entire interval on the number line.
Continuous variables arise from making measurements. Such as, if 𝑥 is
the pH of a chemical substance, then in theory 𝑥 could be any number
between 0 and 14, e.g., 7.0, 7.03, 7.032, and so on.
18
Histograms Cont’d.
Consider data consisting of observations on a discrete variable 𝑥. The
frequency of any 𝑥 value is the number of times that value occurs in the
dataset. The relative frequency of a value is the fraction or proportion of
times the value occurs:
number of times the value occurs
relative frequency of a value =
number of observations in the dataset
Example. Suppose that our dataset consists of 200 observations
(students) on 𝑥 = the number of courses a college student is taking this
term. If 70 of these 𝑥 values are 3, then:
19
Histograms Cont’d.
Frequency of the 𝑥 value 3: 70 and
70
Relative frequency of the 𝑥 value 3: = .35
200
Multiplying a relative frequency by 100 gives a percentage; in the
college-course example, 35% of the students in the sample are taking
three courses. The relative frequencies, or percentages, are usually of
more interest than the frequencies themselves.
In theory, the relative frequencies should sum to 1, but in practice the
sum may differ slightly from 1 because of rounding.
20
Histograms Cont’d.
22
Histograms Cont’d.
23
Histograms Cont’d.
24
Histogram Shapes
Histograms come in a variety of shapes. A unimodal histogram is one
that rises to a single peak and then declines. A bimodal histogram has two
different peaks. Bimodality can occur when the dataset consists of
observations on two quite different kinds of individuals or objects.
Example. consider a large dataset consisting of driving times for cars
traveling between San Luis Obispo, California, and Monterey, California
(exclusive of stopping time for sightseeing, eating, etc.). This histogram
would show two peaks: one for those cars that took the inland route
(roughly 2.5 hours) and another for those cars traveling up the coast (3.5
− 4 hours).
A histogram with more than two peaks is said to be multimodal.
25
Histogram Shapes Cont’d.
A histogram is symmetric if the left half is a mirror image of the right half (b).
A unimodal histogram is positively skewed if the stretching is to the right (c)
and negatively skewed if the stretching is to the left (a).
Skewness is simply a reflection of a dataset in which activity is heavily condensed in one range and less condensed in another. 26
Histogram Shapes Cont’d.
27
Histogram Shapes Cont’d.
Draw a histogram to represent the following data: 5, 3, 3, 6, 4, 3, 5, 4, 7, 3, 3, 5,
3, 6, 4, 3, 4, and then draw a histogram to represent the following data: 7, 4, 6, 7,
5, 7, 6, 3, 4, 7, 5, 6, 6, 7, 7, 5, 7.
28
Histogram Shapes Cont’d.
Also known as a positively skewed histogram. Also known as a negatively skewed histogram.
Mean > Median > Mode. Mean < Median < Mode.
The peak of the graph lies on the left side of the center. The peak of the graph lies on the right side of the center
29
Measures of Location
30
The Mean
For a given set of numbers 𝑥1 , 𝑥2 , … , 𝑥𝑛 , the most familiar and useful
measure of the center is the mean, or arithmetic average of the set. We will
often refer to the arithmetic average as the sample mean and denote it by 𝑥.ҧ
31
The Mean Cont’d.
32
The median
33
79.0 89.0
The median Cont’d.
35
The Variance
36
The Variance Cont’d.
39
Boxplots
40
Boxplots Cont’d.
41
Boxplots Cont’d.
42
Boxplots Cont’d.
Find the median, lower quartile and upper quartile of the following
numbers: 12, 5, 22, 30, 7, 36, 14, 42, 15, 53, 25.
First, arrange the data in ascending order:
5, 7, 12, 14, 15, 22, 25, 30, 36, 42, 53
Median (middle value) = 22
Lower Quartile (middle value of the lower half) = 12
Upper Quartile (middle value of the upper half) = 36
44
Boxplots Cont’d.
45
Boxplots Cont’d.
46
Boxplots Cont’d.
Example. the following is a sample of TN (total nitrogen) loads (kg
N/day) from a particular location, displayed in increasing order.
47
Boxplots Cont’d.
Relevant summary quantities are:
𝑥 = 92.17 lower 4th = 45.64 upper 4th = 167.79
𝑓𝑠 = 122.15 1.5𝑓𝑠 = 183.225 3𝑓𝑠 = 366.45
Subtracting 1.5𝑓𝑠 from the lower 4th gives a negative number, and none of
the observations are negative, so there are no outliers on the lower end of
the data. Yet,
upper 4th + 1.5𝑓 = 351.015, upper 4th + 3𝑓 = 534.24
𝑠 𝑠
Thus, the four largest observations— 563.92 , 690.11 , 826.54 , and
1529.35—are extreme outliers, and 352.09, 371.47, 444.68, and 460.86
are mild outliers.
48
Boxplots Cont’d.
When the median is in the middle of the box,
and the whiskers are about the same on both
sides of the box, then the distribution is
symmetric.
When the median is closer to the bottom of
the box, and if the whisker is shorter on the
lower end of the box, then the distribution is
positively skewed* (skewed right).
When the median is closer to the top of the
box, and if the whisker is shorter on the upper
end of the box, then the distribution is
*If your whisker extends out in the direction of the larger
negatively skewed** (skewed left). numbers, your data are positively skewed.
**If your whisker extends out to the smaller numbers, your data
50
Brainstorming
Do you think
Boxplots are a good
choice for
multimodal data?
51
Brainstorming
To see why boxplots are ill-suited for multimodal data, let’s consider an
example. Imagine our data set consisted of these values: 30, 30, 30, 62, 87, 115,
115, 115, 172, 209, 214. In this example, we have two modes: 30 and 115 both
occur three times.
52
Symmetrical Distribution
The distribution of the height of males is roughly symmetrically distributed and has
no skew. The average height of a male in the United States is roughly 69.1 inches.
The distribution of heights is roughly symmetrical, with some being shorter and
others taller.
Notice that the vertical line inside the box representing the median is equally close
to the first and third quartile, which means the distribution is symmetrical and has
no skew.
53
Right-Skewed Distribution
The distribution of annual household incomes in the United States is right-skewed.
Most households earn between $40k and $80k per year, but there is a long right tail on
the distribution representing households earning much more.
Notice that the vertical line inside the box representing the median is much closer to
the first quartile than the third quartile, meaning the distribution is right-skewed.
54
Left-Skewed Distribution
The distribution of the age of deaths in most populations is left-skewed. Most people
live to be between 70 and 80 years old, with fewer and fewer living less than this age.
Notice that the vertical line inside the box representing the median is much closer to
the third quartile than the first, meaning the distribution is left-skewed.
55