chapter 1
chapter 1
Overview and
1 Descriptive Statistics
1
2/18/2025
2
2/18/2025
3
2/18/2025
M A A A M A A M A A
4
2/18/2025
10
5
2/18/2025
11
11
Branches of Statistics
12
12
6
2/18/2025
Branches of Statistics
An investigator who has collected data may wish simply to
summarize and describe important features of the data.
This entails using methods from descriptive statistics.
13
Branches of Statistics
Computers are much more efficient than human beings at
calculation and the creation of pictures (once they have
received appropriate instructions from the user!).
14
7
2/18/2025
Example 1.1
Charity is a big business in the United States. The Web site
charitynavigator.com gives information on roughly 6000
charitable organizations, and there are many smaller
charities that fly below the navigator’s radar screen.
15
15
6.1 12.6 34.7 1.6 18.8 2.2 3.0 2.2 5.6 3.8
2.2 3.1 1.3 1.1 14.1 4.0 21.0 6.1 1.3 20.4
7.5 3.9 10.1 8.1 19.5 5.2 12.0 15.8 10.4 5.2
6.4 10.8 83.1 3.6 6.2 6.3 16.3 12.7 1.3 0.8
8.8 5.1 3.7 26.3 6.0 48.0 8.2 11.7 7.2 3.9
15.3 16.6 8.8 12.0 4.7 14.7 6.4 17.0 2.5 16.2
16
16
8
2/18/2025
17
17
18
9
2/18/2025
Branches of Statistics
Clearly a substantial majority of the charities in the sample
spend less than 20% on fundraising, and only a few
percentages might be viewed as beyond the bounds of
sensible practice.
19
20
20
10
2/18/2025
21
21
22
11
2/18/2025
23
23
24
24
12
2/18/2025
25
25
26
13
2/18/2025
27
27
28
28
14
2/18/2025
29
30
30
15
2/18/2025
31
31
32
32
16
2/18/2025
33
34
34
17
2/18/2025
35
35
36
18
2/18/2025
For example, the Nov. 23, 2009, New York Times reported
in an article “Behind Cancer Guidelines, Quest for Data”
that the new science for cancer investigations and more
sophisticated methods for data analysis spurred the U.S.
Preventive Services task force to re-examine guidelines for
how frequently middle-aged and older women should have
mammograms.
37
37
38
19
2/18/2025
39
39
Enumerative Versus
Analytic Studies
40
40
20
2/18/2025
41
42
42
21
2/18/2025
43
43
44
22
2/18/2025
45
45
Collecting Data
46
46
23
2/18/2025
Collecting Data
Statistics deals not only with the organization and analysis
of data once it has been collected but also with the
development of techniques for collecting the data. If data is
not properly collected, an investigator may not be able to
answer the questions under consideration with a
reasonable degree of confidence.
47
47
Collecting Data
The most systematic information of this sort comes from
placing monitoring devices in a small number of homes
across the United States. It has been conjectured that
placement of such devices in and of itself alters viewing
behavior, so that characteristics of the sample may be
different from those of the target population.
48
24
2/18/2025
Collecting Data
For example, if the frame consists of 1,000,000 serial
numbers, the numbers 1, 2, . . . , up to 1,000,000 could be
placed on identical slips of paper. After placing these slips
in a box and thoroughly mixing, slips could be drawn one
by one until the requisite sample size has been obtained.
49
49
Collecting Data
Sometimes alternative sampling methods can be used to
make the selection process easier, to obtain extra
information, or to increase the degree of confidence in
conclusions. One such method, stratified sampling, entails
separating the population units into nonoverlapping groups
and taking a sample from each one.
50
50
25
2/18/2025
Collecting Data
This would result in information separately from each
specialty and ensure that no one specialty is over or
underrepresented in the entire sample.
51
51
Collecting Data
If the bricks on the top and sides of the stack were
somehow different from the others, resulting sample data
would not be representative of the population.
52
52
26
2/18/2025
Collecting Data
Engineers and scientists often collect data by carrying out
some sort of designed experiment. This may involve
deciding how to allocate several different treatments (such
as fertilizers or coatings for corrosion protection) to the
various experimental units (plots of land or pieces of pipe).
53
53
Example 1.4
An article in the New York Times (Jan. 27, 1987) reported
that heart attack risk could be reduced by taking aspirin.
This conclusion was based on a designed experiment
involving both a control group of individuals that took a
placebo having the appearance of aspirin but known to be
inert and a treatment group that took aspirin according to a
specified regimen.
54
54
27
2/18/2025
55
55
56
56
28
2/18/2025
Overview and
1 Descriptive Statistics
1
2/18/2025
Notation
2
2/18/2025
Notation
Some general notation will make it easier to apply our
methods and formulas to a wide variety of practical
problems.
Notation
An experiment to compare thermal efficiencies for two
different types of diesel engines might result in samples
{29.7, 31.6, 30.9} and {28.7, 29.5, 29.4, 30.3}, in which
case m 5 3 and n 5 4.
3
2/18/2025
Notation
In many applications, x1 will be the first observation
gathered by the experimenter, x2 the second, and so on.
The ith observation in the data set will be denoted by xi.
Stem-and-Leaf Displays
4
2/18/2025
Stem-and-Leaf Displays
Consider a numerical data set x1, x2, x3,…, xn for which
each xi consists of at least two digits. A quick way to obtain
an informative visual representation of the data set is to
construct a stem-and-leaf display.
Stem-and-Leaf Displays
If the data set consists of exam scores, each between 0
and 100, the score of 83 would have a stem of 8 and a leaf
of 3.
If all exam scores are in the 90s, 80s, and 70s use of the
tens digit as the stem would give a display with three rows.
In this case, it is desirable to stretch the display by
repeating each stem value twice—9H, 9L, 8H, . . ,7L—once
for high leaves 9, . . , 5 and again for low leaves 4, ... , 0.
Then a score of 93 would have a stem of 9L and leaf of 3.
10
5
2/18/2025
Example 1.6
A common complaint among college students is that they
are getting less sleep than
they need.
11
11
12
12
6
2/18/2025
13
13
14
14
7
2/18/2025
15
15
16
16
8
2/18/2025
Stem-and-Leaf Displays
A stem-and-leaf display conveys information about the
following aspects of the data:
17
17
Dotplots
18
18
9
2/18/2025
Dotplots
A dotplot is an attractive summary of numerical data when
the data set is reasonably small or there are relatively few
distinct data values. Each observation is represented by a
dot above the corresponding location on a horizontal
measurement scale.
19
19
Example 1.8
There is growing concern in the U.S. that not enough
students are graduating from college. America used to be
number 1 in the world for the percentage of adults with
college degrees, but it has recently dropped to 16th. Here
is data on the percentage of 25- to 34-year-olds in each
state who had some type of postsecondary degree as of
2010 (listed in alphabetical order, with the District of
Columbia included):
31.5 32.9 33.0 28.6 37.9 43.3 45.9 37.2 68.8 36.2 35.5
40.5 37.2 45.3 36.1 45.5 42.3 33.3 30.3 37.2 45.5 54.3
37.2 49.8 32.1 39.3 40.3 44.2 28.4 46.0 47.2 28.7 49.6
37.6 50.8 38.0 30.8 37.6 43.9 42.5 35.2 42.2 32.8 32.2
38.5 44.5 44.6 40.9 29.5 41.3 35.4 20
20
10
2/18/2025
Example 1.8
Here is data on the percentage of 25- to 34-year-olds in
each state who had some type of postsecondary degree as
of 2010 (listed in alphabetical order, with the District of
Columbia included):
21
21
22
11
2/18/2025
Dotplots
The overall percentage for the entire country is 39.3%; this
is not a simple average of the 51 numbers but an average
weighted by population sizes.
23
23
Histograms
24
24
12
2/18/2025
Histograms
Some numerical data is obtained by counting to determine
the value of a variable (the number of traffic citations a
person received during the last year, the number of
customers arriving for service during a particular period),
whereas other data is obtained by taking measurements
(weight of an individual, reaction time to a particular
stimulus).
25
25
Histograms
Definition
A numerical variable is discrete if its set of possible values
either is finite or else can be listed in an infinite sequence
(one in which there is a first number, a second number, and
so on). A numerical variable is continuous if its possible
values consist of an entire interval on the number line.
26
13
2/18/2025
Histograms
Of course, in practice there are limitations on the degree of
accuracy of any measuring instrument, so we may not be
able to determine pH, reaction time, height, and
concentration to an arbitrarily large number of decimal
places.
27
Histograms
The relative frequency of a value is the fraction or
proportion of times the value occurs:
28
14
2/18/2025
Histograms
Multiplying a relative frequency by 100 gives a percentage;
in the college-course example, 35% of the students in the
sample are taking three courses.
29
29
Histograms
30
30
15
2/18/2025
Example 1.9
How unusual is a no-hitter or a one-hitter in a major league
baseball game, and how frequently does a team get more
than 10, 15, or even 20 hits?
31
31
32
16
2/18/2025
33
= .0155
34
34
17
2/18/2025
Similarly,
= .6361
35
35
Histograms
36
36
18
2/18/2025
Example 1.10
37
37
Example 1.10
This resulted in the accompanying data (part of the stored
data set FURNACE.MTW available in Minitab), which we
have ordered from smallest to largest.
38
38
19
2/18/2025
Example 1.10
The most striking feature of the histogram in Figure 1.8 is
its resemblance to a bell-shaped curve, with the point of
symmetry roughly at 10.
39
39
Example 1.10
40
40
20
2/18/2025
Histograms
Equal-width classes may not be a sensible choice if there
are some regions of the measurement scale that have a
high concentration of data values and other parts where
data is quite sparse.
41
41
Histograms
If a large number of equal-width classes are used, many
classes will have zero frequency. A sound choice is to use
a few wider intervals near extreme observations and
narrower intervals in the region of high concentration.
42
42
21
2/18/2025
Histograms
43
43
Example 1.11
Corrosion of reinforcing steel is a serious problem in
concrete structures located in environments affected by
severe weather conditions.
44
44
22
2/18/2025
Example 1.11
Consider the following 48 observations on measured bond
strength:
45
45
Example 1.11
The resulting histogram appears in Figure 1.10. The right
or upper tail stretches out much farther than does the left or
lower tail—a substantial departure from symmetry.
46
46
23
2/18/2025
Histograms
When class widths are unequal, not using a density scale
will give a picture with distorted areas.
47
47
Histograms
Multiplying both sides of the formula for density by the
class width gives
48
48
24
2/18/2025
Histograms
It is always possible to draw a histogram so that the area
equals the relative frequency (this is true also for a
histogram of discrete data)—just use the density scale.
49
49
Histogram Shapes
50
50
25
2/18/2025
Histogram Shapes
Histograms come in a variety of shapes. A unimodal
histogram is one that rises to a single peak and then
declines. A bimodal histogram has two different peaks.
51
Histogram Shapes
This histogram would show two peaks: one for those cars
that took the inland route (roughly 2.5 hours) and another
for those cars traveling up the coast (3.5–4 hours).
52
26
2/18/2025
Histogram Shapes
A histogram with more than two peaks is said to be
multimodal. Of course, the number of peaks may well
depend on the choice of class intervals, particularly with a
small number of observations. The larger the number of
classes, the more likely it is that bimodality or multimodality
will manifest itself.
53
53
Example 1.12
Figure 1.11(a) shows a Minitab histogram of the weights
(lb) of the 124 players listed on the rosters of the San
Francisco 49ers and the New England Patriots (teams the
author would like to see meet in the Super Bowl) as of Nov.
20, 2009.
54
27
2/18/2025
Example 12 cont’d
55
55
56
56
28
2/18/2025
57
57
Smoothed histograms
Figure 1.12
58
58
29
2/18/2025
Qualitative Data
59
59
Qualitative Data
Both a frequency distribution and a histogram can be
constructed when the data set is qualitative (categorical) in
nature.
60
30
2/18/2025
Example 1.13
The Public Policy Institute of California carried out a
telephone survey of 2501 California adult residents during
April 2006 to ascertain how they felt about various aspects
of K-12 public education. One question asked was “Overall,
how would you rate the quality of public schools in your
neighborhood today?”
61
61
Frequency Distribution for the School Rating Data Histogram of the school rating data from Minitab
Table 1.2 Figure 1.13
62
62
31
2/18/2025
63
63
Multivariate Data
64
64
32
2/18/2025
Multivariate Data
Multivariate data is generally rather difficult to describe
visually. Several methods for doing so appear later in the
book, notably scatter plots for bivariate numerical data.
65
65
33
2/18/2025
Overview and
1 Descriptive Statistics
1
2/18/2025
Measures of Location
Visual summaries of data are excellent tools for obtaining
preliminary impressions and insights. More formal data
analysis often requires the calculation and interpretation of
numerical summary measures.
Measures of Location
Suppose, then, that our data set is of the form
x1, x2,. . ., xn, where each xi is a number. What features of
such a set of numbers are of most interest and deserve
emphasis? One important characteristic of a set of
numbers is its location, and in particular its center.
2
2/18/2025
The Mean
The Mean
For a given set of numbers x1, x2,. . ., xn, the most familiar
and useful measure of the center is the mean, or arithmetic
average of the set. Because we will almost always think of
the xi’s as constituting a sample, we will often refer to the
arithmetic average as the sample mean and denote it by x.
3
2/18/2025
The Mean
Example 1.14
Recent years have seen growing commercial interest in the
use of what is known as internally cured concrete.
4
2/18/2025
The Mean
A physical interpretation of x demonstrates how it
measures the location (center) of a sample. Think of
drawing and scaling a horizontal measurement axis, and
then represent each sample observation by a 1-lb weight
placed at the corresponding point on the axis.
10
10
5
2/18/2025
The Mean
Just as x represents the average value of the observations
in a sample, the average of all values in the population can
be calculated. This average is called the population mean
and is denoted by the Greek letter . When there are N
values in the population (a finite population), then
= (sum of the N population values)/N.
11
11
The Mean
In the chapters on statistical inference, we will present
methods based on the sample mean for drawing
conclusions about a population mean.
12
12
6
2/18/2025
The Mean
The mean suffers from one deficiency that makes it an
inappropriate measure of center under some
circumstances: Its value can be greatly affected by the
presence of even a single outlier (unusually large or small
observation).
13
13
The Mean
14
14
7
2/18/2025
The Mean
When sampling from such a population (a normal or bell-
shaped population being the most important example), the
sample mean will tend to be stable and quite representative
of the sample.
15
15
The Median
16
16
8
2/18/2025
The Median
The word median is synonymous with “middle,” and the
sample median is indeed the middle value once the
observations are ordered from smallest to largest.
17
17
The Median
18
18
9
2/18/2025
Example 1.15
People not familiar with classical music might tend to
believe that a composer’s instructions for playing a
particular piece are so specific that the duration would not
depend at all on the performer(s).
19
19
62.3 62.8 63.6 65.2 65.7 66.4 67.4 68.4 68.8 70.8
75.7 79.0
20
10
2/18/2025
21
21
22
22
11
2/18/2025
The Median
The data in Example 1.15 illustrates an important property
of in contrast to x: The sample median is very insensitive
to outliers. If, for example, we increased the two largest xis
from 75.7 and 79.0 to 85.7 and 89.0, respectively,
would be unaffected.
23
23
The Median
Analogous to as the middle value in the sample is a
middle value in the population, the population median,
denoted by As with and , we can think of using the
sample median to make an inference about
24
24
12
2/18/2025
The Median
25
25
The Median
When this is the case, in making inferences we must first
decide which of the two population characteristics is of
greater interest and then proceed accordingly.
26
26
13
2/18/2025
27
27
28
28
14
2/18/2025
29
29
30
15
2/18/2025
31
31
Example 1.16
The production of Bidri is a traditional craft of India. Bidri
wares (bowls, vessels, and so on) are cast from an alloy
containing primarily zinc along with some copper.
2.0 2.4 2.5 2.6 2.6 2.7 2.7 2.8 3.0 3.1 3.2 3.3 3.3
3.4 3.4 3.6 3.6 3.6 3.6 3.7 4.4 4.6 4.7 4.8 5.3 10.1
32
32
16
2/18/2025
Figure 1.17
33
33
34
34
17
2/18/2025
35
35
36
36
18
2/18/2025
37
37
38
38
19
2/18/2025
39
39
40
20
2/18/2025
41
41
42
42
21
2/18/2025
43
43
44
44
22
2/18/2025
Overview and
1 Descriptive Statistics
1
2/18/2025
Measures of Variability
Reporting a measure of center gives only partial
information about a data set or distribution. Different
samples or populations may have identical measures of
center yet differ from one another in other important ways.
Figure 1.18 3
Measures of Variability
The first sample has the largest amount of variability, the
third has the smallest amount, and the second is
intermediate to the other two in this respect.
2
2/18/2025
Figure 1.18
6
3
2/18/2025
4
2/18/2025
10
10
5
2/18/2025
11
Thus if s = 2.0 mpg, then some xi’s in the sample are closer
than 2.0 to whereas others are farther away; 2.0 is a
representative (or “standard”) deviation from the mean fuel
efficiency. If s = 3.0 for a second sample of cars of another
type, a typical deviation in this sample is roughly 1.5 times
what it is in the first sample, an indication of more variability
in the second sample.
12
12
6
2/18/2025
Example 1.17
The Web site www.fueleconomy.gov contains a wealth of
information about fuel characteristics of various vehicles. In
addition to EPA mileage ratings, there are many vehicles
for which users have reported their own values of fuel
efficiency (mpg).
13
13
Example 1.17
14
14
7
2/18/2025
Example 1.17
Effects of rounding account for the sum of deviations not
being exactly zero. The numerator of s2 is Sxx = 314.106,
from which
15
Example 1.17
Note: Of the nine people who also reported driving
behavior, only three did more than 80% of their driving in
highway mode; we bet you can guess which cars they
drove.
16
16
8
2/18/2025
Motivation for s2
17
17
Motivation for s2
To explain the rationale for the divisor n – 1 in s2, note first
that whereas s2 measures sample variability, there is a
measure of variability in the population called the
population variance.
18
18
9
2/18/2025
Motivation for s2
When the population is finite and consists of N values,
19
Motivation for s2
If we actually knew the value of , then we could define the
sample variance as the average squared deviation of the
sample xis about .
20
20
10
2/18/2025
Motivation for s2
In other words, if we used a divisor n in the sample
variance, then the resulting quantity would tend to
underestimate 2 (produce estimated values that are too
small on the average), whereas dividing by the slightly
smaller n – 1 corrects this underestimating.
21
21
Motivation for s2
For example, if n = 4 and
then automatically so only three of
the four values of are freely determined (3 df).
22
22
11
2/18/2025
23
23
24
24
12
2/18/2025
Example 1.18
Traumatic knee dislocation often requires surgery to repair
ruptured ligaments. One measure of recovery is range of
motion (measured as the angle formed when, starting with
the leg straight, the knee is bent as far as possible).
154 142 137 133 122 126 135 135 108 120 127 134
122
25
25
Example 1.18
The sum of these 13 sample observations is
and the sum of their squares is
26
26
13
2/18/2025
Example 1.18
from which
s2 = 1579.0769/12
= 131.59
and
s = 11.47.
27
27
28
28
14
2/18/2025
29
29
30
30
15
2/18/2025
Boxplots
31
31
Boxplots
Stem-and-leaf displays and histograms convey rather
general impressions about a data set, whereas a single
summary such as the mean or standard deviation focuses
on just one aspect of the data.
These features include (1) center, (2) spread, (3) the extent
and nature of any departure from symmetry, and (4)
identification of “outliers,” observations that lie unusually far
from the main body of the data.
32
32
16
2/18/2025
Boxplots
Because even a single outlier can drastically affect the
values of and s, a boxplot is based on measures that are
“resistant” to the presence of a few outliers—the median
and a measure of variability called the fourth spread.
Definition
33
33
Boxplots
Roughly speaking, the fourth spread is unaffected by the
positions of those observations in the smallest 25% or the
largest 25% of the data. Hence it is resistant to outliers.
34
34
17
2/18/2025
Boxplots
Place a vertical line segment or some other symbol inside
the rectangle at the location of the median; the position of
the median symbol relative to the two edges conveys
information about skewness in the middle 50% of the data.
35
35
Example 1.19
The accompanying data consists of observations on the
time until failure (1000s of hours) for a sample of
turbochargers from one type of engine (from “The Beta
Generalized Weibull Distribution: Properties and
Applications,” Reliability Engr. and System Safety, 2012: 5–
15).
36
18
2/18/2025
Example 1.19
Figure 1.19 shows Minitab output from a request to describe
the data. Q1 and Q3 are the lower and upper quartiles,
respectively, and IQR (interquartile range) is the difference
between these quartiles. SE Mean is, the “standard
error of the mean”; it will be important in our subsequent
development of several widely used procedures for making
inferences about the population mean µ.
37
37
Example 1.19
Figure 1.20 shows both a dotplot of the data and a boxplot.
Both plots indicate that there is a reasonable amount of
symmetry in the middle 50% of the data, but overall values
stretch out more toward the low end than toward the high
end—a negative skew. The box itself is not very narrow,
indicating a fair amount of variability in the middle half of
the data, and the lower whisker is especially long.
38
38
19
2/18/2025
39
39
Definition
40
40
20
2/18/2025
41
41
Example 1.20
The Clean Water Act and subsequent amendments require
that all waters in the United States meet specific pollution
reduction goals to ensure that water is “fishable and
swimmable.”
42
42
21
2/18/2025
Example 1.20
Among the data considered is the following sample of TN
(total nitrogen) loads (kg N/day) from a particular
Chesapeake Bay location, displayed here in increasing
order.
43
43
Example 1.20
Relevant summary quantities are
44
22
2/18/2025
Example 20
The whiskers in the boxplot in Figure 1.21 extend out to the
smallest observation, 9.69, on the low end and 312.45, the
largest observation that is not an outlier, on the upper end.
A boxplot of the nitrogen load data showing mild and extreme outliers
Figure 1.21
45
45
Example 1.20
There is some positive skewness in the middle half of the
data (the median line is somewhat closer to the left edge of
the box than to the right edge) and a great deal of positive
skewness overall.
46
46
23
2/18/2025
Comparative Boxplots
47
47
Comparative Boxplots
A comparative or side-by-side boxplot is a very effective
way of revealing similarities and differences between two or
more data sets consisting of observations on the same
variable—fuel efficiency observations for four different
types of automobiles, crop yields for three different
varieties, and so on.
48
48
24
2/18/2025
Example 1.21
High levels of sodium in food products represent a growing health
concern. The accompanying data consists of values of sodium
content in one serving of cereal for one sample of cereals
manufactured by General Mills, another sample manufactured by
Kellogg, and a third sample produced by Post (see the website
http://www.nutritionresource.com/foodcomp2.cfm?id=0800 rather
than visiting your neighborhood grocery store!).
49
49
Example 1.21
Figure 1.22 shows a comparative boxplot of the data from
the software package R. The typical sodium content
(median) is roughly the same for all three companies. But
the distributions differ markedly in other respects.
50
50
25
2/18/2025
Example 1.21
51
51
Example 1.21
52
52
26