0% found this document useful (0 votes)
13 views109 pages

chapter 1

The document discusses the importance of statistics in organizing and summarizing data, focusing on populations, samples, and processes. It explains the distinction between census and sample, the nature of variables, and the branches of statistics, particularly descriptive and inferential statistics. Additionally, it highlights the application of statistical methods across various disciplines and provides examples of recent statistical research.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views109 pages

chapter 1

The document discusses the importance of statistics in organizing and summarizing data, focusing on populations, samples, and processes. It explains the distinction between census and sample, the nature of variables, and the branches of statistics, particularly descriptive and inferential statistics. Additionally, it highlights the application of statistical methods across various disciplines and provides examples of recent statistical research.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 109

2/18/2025

Overview and
1 Descriptive Statistics

Copyright © Cengage Learning. All rights reserved.

1.1 Populations, Samples, and


Processes

Copyright © Cengage Learning. All rights reserved.

1
2/18/2025

Populations, Samples, and Processes


Engineers and scientists are constantly exposed to
collections of facts, or data, both in their professional
capacities and in everyday activities.

The discipline of statistics provides methods for organizing


and summarizing data and for drawing conclusions based
on information contained in the data.

An investigation will typically focus on a well-defined


collection of objects constituting a population of interest. In
one study, the population might consist of all gelatin
capsules of a particular type produced during a specified
period.
3

Populations, Samples, and Processes


Another investigation might involve the population
consisting of all individuals who received a B.S. in
engineering during the most recent academic year.

When desired information is available for all objects in the


population, we have what is called a census.

Constraints on time, money, and other scarce resources


usually make a census impractical or infeasible. Instead, a
subset of the population—a sample—is selected in some
prescribed manner.

2
2/18/2025

Populations, Samples, and Processes


Thus we might obtain a sample of bearings from a
particular production run as a basis for investigating
whether bearings are conforming to manufacturing
specifications, or we might select a sample of last year’s
engineering graduates to obtain feedback about the quality
of the engineering curricula.

We are usually interested only in certain characteristics of


the objects in a population: the number of flaws on the
surface of each casing, the thickness of each capsule wall,
the gender of an engineering graduate, the age at which
the individual graduated, and so on.

Populations, Samples, and Processes


A characteristic may be categorical, such as gender or type
of malfunction, or it may be numerical in nature.

In the former case, the value of the characteristic is a


category (e.g. female or insufficient solder), whereas in the
latter case, the value is a number (e.g., age = 23 or
diameter = .502 cm).

3
2/18/2025

Populations, Samples, and Processes


A variable is any characteristic whose value may change
from one object to another in the population. We shall
initially denote variables by lowercase letters from the end
of our alphabet. Examples include

x = brand of calculator owned by a student

y = number of visits to a particular Web site during a


specified period

z = braking distance of an automobile under specified


conditions

Populations, Samples, and Processes


Data results from making observations either on a single
variable or simultaneously on two or more variables.
A univariate data set consists of observations on a single
variable.

For example, we might determine the type of transmission,


automatic (A) or manual (M), on each of ten automobiles
recently purchased at a certain dealership, resulting in the
categorical data set

M A A A M A A M A A

4
2/18/2025

Populations, Samples, and Processes


The following sample of pulse rates (beats per minute) for
patients recently admitted to an adult intensive care unit is
a numerical univariate data set:
88 80 71 103 154 132 67 110 60 105

We have bivariate data when observations are made on


each of two variables. Our data set might consist of a
(height, weight) pair for each basketball player on a team,
with the first observation as (72, 168), the second as
(75, 212), and so on.

Populations, Samples, and Processes


If an engineer determines the value of both x = component
lifetime and y = reason for component failure, the resulting
data set is bivariate with one variable numerical and the
other categorical.

Multivariate data arises when observations are made on


more than one variable (so bivariate is a special case of
multivariate).

For example, a research physician might determine the


systolic blood pressure, diastolic blood pressure, and
serum cholesterol level for each patient participating in a
study.
10

10

5
2/18/2025

Populations, Samples, and Processes


Each observation would be a triple of numbers, such as
(120, 80, 146). In many multivariate data sets, some
variables are numerical and others are categorical.

Thus the annual automobile issue of Consumer Reports


gives values of such variables as type of vehicle (small,
sporty, compact, mid-size, large), city fuel efficiency (mpg),
highway fuel efficiency (mpg), drivetrain type (rear wheel,
front wheel, four wheel), and so on.

11

11

Branches of Statistics

12

12

6
2/18/2025

Branches of Statistics
An investigator who has collected data may wish simply to
summarize and describe important features of the data.
This entails using methods from descriptive statistics.

Some of these methods are graphical in nature; the


construction of histograms, boxplots, and scatter plots are
primary examples.

Other descriptive methods involve calculation of numerical


summary measures, such as means, standard deviations,
and correlation coefficients. The wide availability of
statistical computer software packages has made these
tasks much easier to carry out than they used to be.
13

13

Branches of Statistics
Computers are much more efficient than human beings at
calculation and the creation of pictures (once they have
received appropriate instructions from the user!).

This means that the investigator doesn’t have to expend


much effort on “grunt work” and will have more time to
study the data and extract important messages.

Throughout this book, we will present output from various


packages such as Minitab, SAS, JMP, and R. The R
software can be downloaded without charge from the site
http://www.r-project.org.
14

14

7
2/18/2025

Example 1.1
Charity is a big business in the United States. The Web site
charitynavigator.com gives information on roughly 6000
charitable organizations, and there are many smaller
charities that fly below the navigator’s radar screen.

Some charities operate very efficiently, with fundraising and


administrative expenses that are only a small percentage of
total expenses, whereas others spend a high percentage of
what they take in on such activities.

15

15

Example 1.1 cont’d

Here is data on fundraising expenses as a percentage of


total expenditures for a random sample of 60 charities:

6.1 12.6 34.7 1.6 18.8 2.2 3.0 2.2 5.6 3.8
2.2 3.1 1.3 1.1 14.1 4.0 21.0 6.1 1.3 20.4
7.5 3.9 10.1 8.1 19.5 5.2 12.0 15.8 10.4 5.2
6.4 10.8 83.1 3.6 6.2 6.3 16.3 12.7 1.3 0.8
8.8 5.1 3.7 26.3 6.0 48.0 8.2 11.7 7.2 3.9
15.3 16.6 8.8 12.0 4.7 14.7 6.4 17.0 2.5 16.2

16

16

8
2/18/2025

Example 1.1 cont’d

Without any organization, it is difficult to get a sense of the


data’s most prominent features—what a typical
(i.e. representative) value might be, whether values are
highly concentrated about a typical value or quite
dispersed, whether there are any gaps in the data, what
fraction of the values are less than 20%, and so on.

17

17

Example 1.1 cont’d

Figure 1.1 shows what is called a stem-and-leaf display as


well as a histogram.

A Minitab stem-and-leaf display (tenths digit truncated) and


histogram for the charity fundraising percentage data
Figure 1.1 18

18

9
2/18/2025

Branches of Statistics
Clearly a substantial majority of the charities in the sample
spend less than 20% on fundraising, and only a few
percentages might be viewed as beyond the bounds of
sensible practice.

Having obtained a sample from a population, an


investigator would frequently like to use sample information
to draw some type of conclusion (make an inference of
some sort) about the population.

That is, the sample is a means to an end rather than an


end in itself. Techniques for generalizing from a sample to a
population are gathered within the branch of our discipline
called inferential statistics.
19

19

The Scope of Modern Statistics

20

20

10
2/18/2025

The Scope of Modern Statistics


These days statistical methodology is employed by
investigators in virtually all disciplines, including such areas
as
• molecular biology (analysis of microarray data)

• ecology (describing quantitatively how individuals in


various animal and plant populations are spatially
distributed)

• materials engineering (studying properties of various


treatments to retard corrosion)

21

21

The Scope of Modern Statistics


• marketing (developing market surveys and strategies for
marketing new products)

• public health (identifying sources of diseases and ways to


treat them)

• civil engineering (assessing the effects of stress on


structural elements and the impacts of traffic flows on
communities)

As you progress through the book, you’ll encounter a wide


spectrum of different scenarios in the examples and
exercises that illustrate the application of techniques from
probability and statistics. 22

22

11
2/18/2025

The Scope of Modern Statistics


Many of these scenarios involve data or other material
extracted from articles in engineering and science journals.

The methods presented herein have become established


and trusted tools in the arsenal of those who work with
data.

Meanwhile, statisticians continue to develop new models


for describing randomness, and uncertainty and new
methodology for analyzing data.

23

23

The Scope of Modern Statistics


As evidence of the continuing creative efforts in the
statistical community, here are titles and capsule
descriptions of some articles that have recently appeared in
statistics journals (Journal of the American Statistical
Association is abbreviated JASA, and AAS is short for the
Annals of Applied Statistics, two of the many prominent
journals in the discipline):

24

24

12
2/18/2025

The Scope of Modern Statistics


“How Many People Do You Know? Efficiently Estimating
Personal Network Size” (JASA, 2010: 59–70): How many
of the N individuals at your college do you know? You could
select a random sample of students from the population
and use an estimate based on the fraction of people in this
sample that you know.

A “latent mixing model” was proposed that the authors


asserted remedied deficiencies in previously used
techniques.

25

25

The Scope of Modern Statistics


• “Active Learning Through Sequential Design, with
Applications to the Detection of Money Laundering”
(JASA, 2009: 969–981): Money laundering involves
concealing the origin of funds obtained through illegal
activities.

The huge number of transactions occurring daily at


financial institutions makes detection of money laundering
difficult. The standard approach has been to extract
various summary quantities from the transaction history
and conduct a time-consuming investigation of suspicious
activities. The article proposes a more efficient statistical
method and illustrates its use in a case study.
26

26

13
2/18/2025

The Scope of Modern Statistics


• “Robust Internal Benchmarking and False Discovery
Rates for Detecting Racial Bias in Police Stops” (JASA,
2009: 661–668): Allegations of police actions that are
attributable at least in part to racial bias have become a
contentious issue in many communities.

This article proposes a new method that is designed to


reduce the risk of flagging a substantial number of “false
positives” (individuals falsely identified as manifesting
bias).

27

27

The Scope of Modern Statistics


The method was applied to data on 500,000 pedestrian
stops in New York City in 2006; of the 3000 officers
regularly involved in pedestrian stops, 15 were identified
as having stopped a substantially greater fraction of Black
and Hispanic people than what would be predicted were
bias absent.

28

28

14
2/18/2025

The Scope of Modern Statistics


• “Records in Athletics Through Extreme Value Theory”
(JASA, 2008: 1382–1391): The focus here is on the
modeling of extremes related to world records in athletics.

The authors start by posing two questions:


(1) What is the ultimate world record within a specific
event (e.g. the high jump for women)? and
(2) How “good” is the current world record, and how does
the quality of current world records compare across
different events?

A total of 28 events (8 running, 3 throwing, and 3 jumping


for both men and women) are considered.
29

29

The Scope of Modern Statistics


For example, one conclusion is that only about 20
seconds can be shaved off the men’s marathon record,
but that the current women’s marathon record is almost 5
minutes longer than what can ultimately be achieved.

The methodology also has applications to such issues as


ensuring airport runways are long enough and that dikes
in Holland are high enough.

30

30

15
2/18/2025

The Scope of Modern Statistics


“Self-Exciting Hurdle Models for Terrorist Activity” (AAS,
2012: 106–124): The authors developed a predictive model
of terrorist activity by considering the daily number of
terrorist attacks in Indonesia from 1994 through 2007. The
model estimates the chance of future attacks as a function
of the times since past attacks.

The article provides an interpretation of various model


characteristics and assesses its predictive performance.

31

31

The Scope of Modern Statistics


• “Prediction of Remaining Life of Power Transformers
Based on Left Truncated and Right Censored Lifetime
Data” (AAS, 2009: 857–879): There are roughly 150,000
high-voltage power transmission transformers in the
United States. Unexpected failures can cause substantial
economic losses, so it is important to have predictions for
remaining lifetimes.

Relevant data can be complicated because lifetimes of


some transformers extend over several decades during
which records were not necessarily complete.

32

32

16
2/18/2025

The Scope of Modern Statistics


In particular, the authors of the article use data from a
certain energy company that began keeping careful
records in 1980. But some transformers had been
installed before January 1, 1980, and were still in service
after that date (“left truncated” data), whereas other units
were still in service at the time of the investigation, so
their complete lifetimes are not available (“right
censored” data).

The article describes various procedures for obtaining an


interval of plausible values (a prediction interval) for a
remaining lifetime and for the cumulative number of
failures over a specified time period.
33

33

The Scope of Modern Statistics


• “The BARISTA: A Model for Bid Arrivals in Online
Auctions” (AAS, 2007: 412–441): Online auctions such as
those on eBay and uBid often have characteristics that
differentiate them from traditional auctions.

One particularly important difference is that the number of


bidders at the outset of many traditional auctions is fixed,
whereas in online auctions this number and the number
of resulting bids are not predetermined.

34

34

17
2/18/2025

The Scope of Modern Statistics


The article proposes a new BARISTA (for Bid ARivals In
STAges) model for describing the way in which bids arrive
online. The model allows for higher bidding intensity at the
outset of the auction and also as the auction comes to a
close.

Various properties of the model are investigated


and then validated using data from eBay.com on auctions
for Palm M515 personal assistants, Microsoft Xbox
games, and Cartier watches.

35

35

The Scope of Modern Statistics


• “Statistical Challenges in the Analysis of Cosmic
Microwave Background Radiation” (AAS, 2009: 61–95):
The cosmic microwave background (CMB) is a significant
source of information about the early history of the
universe.

Its radiation level is uniform, so extremely delicate


instruments have been developed to measure
fluctuations. The authors provide a review of statistical
issues with CMB data analysis; they also give many
examples of the application of statistical procedures to
data obtained from a recent NASA satellite mission, the
Wilkinson Microwave Anisotropy Probe.
36

36

18
2/18/2025

The Scope of Modern Statistics


Statistical information now appears with increasing
frequency in the popular media, and occasionally the
spotlight is even turned on statisticians.

For example, the Nov. 23, 2009, New York Times reported
in an article “Behind Cancer Guidelines, Quest for Data”
that the new science for cancer investigations and more
sophisticated methods for data analysis spurred the U.S.
Preventive Services task force to re-examine guidelines for
how frequently middle-aged and older women should have
mammograms.

37

37

The Scope of Modern Statistics


The panel commissioned six independent groups to do
statistical modeling. The result was a new set of
conclusions, including an assertion that mammograms
every two years are nearly as beneficial to patients as
annual mammograms, but confer only half the risk of
harms.

Donald Berry, a very prominent biostatistician, was quoted


as saying he was pleasantly surprised that the task force
took the new research to heart in making its
recommendations. The task force’s report has generated
much controversy among cancer organizations, politicians,
and women themselves.
38

38

19
2/18/2025

The Scope of Modern Statistics


It is our hope that you will become increasingly convinced
of the importance and relevance of the discipline of
statistics as you dig more deeply into the book and the
subject. Hopefully you’ll be turned on enough to want to
continue your statistical education beyond your current
course.

39

39

Enumerative Versus
Analytic Studies

40

40

20
2/18/2025

Enumerative Versus Analytic Studies


W. E. Deming, a very influential American statistician who
was a moving force in Japan’s quality revolution during the
1950s and 1960s, introduced the distinction between
enumerative studies and analytic studies.

In the former, interest is focused on a finite, identifiable,


unchanging collection of individuals or objects that make up
a population.

A sampling frame—that is, a listing of the individuals or


objects to be sampled—is either available to an investigator
or else can be constructed.
41

41

Enumerative Versus Analytic Studies


For example, the frame might consist of all signatures on a
petition to qualify a certain initiative for the ballot in an
upcoming election; a sample is usually selected to
ascertain whether the number of valid signatures exceeds
a specified value.

As another example, the frame may contain serial numbers


of all furnaces manufactured by a particular company
during a certain time period; a sample may be selected to
infer something about the average lifetime of these units.

42

42

21
2/18/2025

Enumerative Versus Analytic Studies


The use of inferential methods to be developed in this book
is reasonably noncontroversial in such settings (though
statisticians may still argue over which particular methods
should be used).

An analytic study is broadly defined as one that is not


enumerative in nature. Such studies are often carried out
with the objective of improving a future product by taking
action on a process of some sort (e.g., recalibrating
equipment or adjusting the level of some input such as the
amount of a catalyst).

43

43

Enumerative Versus Analytic Studies


Data can often be obtained only on an existing process,
one that may differ in important respects from the future
process. There is thus no sampling frame listing the
individuals or objects of interest.

For example, a sample of five turbines with a new design


may be experimentally manufactured and tested to
investigate efficiency.

These five could be viewed as a sample from the


conceptual population of all prototypes that could be
manufactured under similar conditions, but not necessarily
as representative of the population of units manufactured
once regular production gets underway.
44

44

22
2/18/2025

Enumerative Versus Analytic Studies


Methods for using sample information to draw conclusions
about future production units may be problematic.
Someone with expertise in the area of turbine design and
engineering (or whatever other subject area is relevant)
should be called upon to judge whether such extrapolation
is sensible.

A good exposition of these issues is contained in the article


“Assumptions for Statistical Inference” by Gerald Hahn and
William Meeker (The American Statistician, 1993: 1–11).

45

45

Collecting Data

46

46

23
2/18/2025

Collecting Data
Statistics deals not only with the organization and analysis
of data once it has been collected but also with the
development of techniques for collecting the data. If data is
not properly collected, an investigator may not be able to
answer the questions under consideration with a
reasonable degree of confidence.

One common problem is that the target population—the


one about which conclusions are to be drawn—may be
different from the population actually sampled. For
example, advertisers would like various kinds of information
about the television-viewing habits of potential customers.

47

47

Collecting Data
The most systematic information of this sort comes from
placing monitoring devices in a small number of homes
across the United States. It has been conjectured that
placement of such devices in and of itself alters viewing
behavior, so that characteristics of the sample may be
different from those of the target population.

When data collection entails selecting individuals or objects


from a frame, the simplest method for ensuring a
representative selection is to take a simple random sample.
This is one for which any particular subset of the specified
size (e.g., a sample of size 100) has the same chance of
being selected.
48

48

24
2/18/2025

Collecting Data
For example, if the frame consists of 1,000,000 serial
numbers, the numbers 1, 2, . . . , up to 1,000,000 could be
placed on identical slips of paper. After placing these slips
in a box and thoroughly mixing, slips could be drawn one
by one until the requisite sample size has been obtained.

Alternatively (and much to be preferred), a table of random


numbers or a computer’s random number generator could
be employed.

49

49

Collecting Data
Sometimes alternative sampling methods can be used to
make the selection process easier, to obtain extra
information, or to increase the degree of confidence in
conclusions. One such method, stratified sampling, entails
separating the population units into nonoverlapping groups
and taking a sample from each one.

For example, a study of how physicians feel about the


Affordable Care Act might proceed by stratifying according
to specialty: select a sample of surgeons, another sample
of radiologists, yet another sample of psychiatrists, and so
on.

50

50

25
2/18/2025

Collecting Data
This would result in information separately from each
specialty and ensure that no one specialty is over or
underrepresented in the entire sample.

Frequently a “convenience” sample is obtained by selecting


individuals or objects without systematic randomization. As
an example, a collection of bricks may be stacked in such a
way that it is extremely difficult for those in the center to be
selected.

51

51

Collecting Data
If the bricks on the top and sides of the stack were
somehow different from the others, resulting sample data
would not be representative of the population.

Often an investigator will assume that such a convenience


sample approximates a random sample, in which case a
statistician’s repertoire of inferential methods can be used;
however, this is a judgment call.

52

52

26
2/18/2025

Collecting Data
Engineers and scientists often collect data by carrying out
some sort of designed experiment. This may involve
deciding how to allocate several different treatments (such
as fertilizers or coatings for corrosion protection) to the
various experimental units (plots of land or pieces of pipe).

Alternatively, an investigator may systematically vary the


levels or categories of certain factors (e.g., pressure or type
of insulating material) and observe the effect on some
response variable (such as yield from a production
process).

53

53

Example 1.4
An article in the New York Times (Jan. 27, 1987) reported
that heart attack risk could be reduced by taking aspirin.
This conclusion was based on a designed experiment
involving both a control group of individuals that took a
placebo having the appearance of aspirin but known to be
inert and a treatment group that took aspirin according to a
specified regimen.

Subjects were randomly assigned to the groups to protect


against any biases and so that probability-based methods
could be used to analyze the data.

54

54

27
2/18/2025

Example 1.4 cont’d

Of the 11,034 individuals in the control group, 189


subsequently experienced heart attacks, whereas only 104
of the 11,037 in the aspirin group had a heart attack. The
incidence rate of heart attacks in the treatment group was
only about half that in the control group.

One possible explanation for this result is chance


variation—that aspirin really doesn’t have the desired effect
and the observed difference is just typical variation in the
same way that tossing two identical coins would usually
produce different numbers of heads.

55

55

Example 1.4 cont’d

However, in this case, inferential methods suggest that


chance variation by itself cannot adequately explain the
magnitude of the observed difference.

56

56

28
2/18/2025

Overview and
1 Descriptive Statistics

Copyright © Cengage Learning. All rights reserved.

1.2 Pictorial and Tabular Methods


in Descriptive Statistics

Copyright © Cengage Learning. All rights reserved.

1
2/18/2025

Pictorial and Tabular Methods in Descriptive Statistics

Descriptive statistics can be divided into two general


subject areas. In this section, we consider representing a
data set using visual techniques.

Many visual techniques may already be familiar to you:


frequency tables, tally sheets, histograms, pie charts, bar
graphs, scatter diagrams, and the like. Here we focus on a
selected few of these techniques that are most useful and
relevant to probability and inferential statistics.

Notation

2
2/18/2025

Notation
Some general notation will make it easier to apply our
methods and formulas to a wide variety of practical
problems.

The number of observations in a single sample, that is, the


sample size, will often be denoted by n, so that n = 4 for the
sample of universities {Stanford, Iowa State, Wyoming,
Rochester} and also for the sample of pH measurements
{6.3, 6.2, 5.9, 6.5}.

If two samples are simultaneously under consideration,


either m and n or n1 and n2 can be used to denote the
numbers of observations.
5

Notation
An experiment to compare thermal efficiencies for two
different types of diesel engines might result in samples
{29.7, 31.6, 30.9} and {28.7, 29.5, 29.4, 30.3}, in which
case m 5 3 and n 5 4.

Given a data set consisting of n observations on some


variable x, the individual observations will be denoted by
x1, x2, x3,…, xn. The subscript bears no relation to the
magnitude of a particular observation.

Thus x1 will not in general be the smallest observation in


the set, nor will xn typically be the largest.
6

3
2/18/2025

Notation
In many applications, x1 will be the first observation
gathered by the experimenter, x2 the second, and so on.
The ith observation in the data set will be denoted by xi.

Stem-and-Leaf Displays

4
2/18/2025

Stem-and-Leaf Displays
Consider a numerical data set x1, x2, x3,…, xn for which
each xi consists of at least two digits. A quick way to obtain
an informative visual representation of the data set is to
construct a stem-and-leaf display.

Stem-and-Leaf Displays
If the data set consists of exam scores, each between 0
and 100, the score of 83 would have a stem of 8 and a leaf
of 3.

If all exam scores are in the 90s, 80s, and 70s use of the
tens digit as the stem would give a display with three rows.
In this case, it is desirable to stretch the display by
repeating each stem value twice—9H, 9L, 8H, . . ,7L—once
for high leaves 9, . . , 5 and again for low leaves 4, ... , 0.
Then a score of 93 would have a stem of 9L and leaf of 3.

In general, a display based on between 5 and 20 stems is


recommended.
10

10

5
2/18/2025

Example 1.6
A common complaint among college students is that they
are getting less sleep than
they need.

The article “Class Start Times, Sleep, and Academic


Performance in College: A Path Analysis” Chronobiology
Intl., 2012: 318–335) investigated factors that impact sleep
time.

11

11

Example 1.6 cont’d

The stem-and-leaf display in Figure 1.4 shows the average


number of hours of sleep per day over a two-week period
for a sample of 253 students.

12

12

6
2/18/2025

Example 1.6 cont’d

The first observation in the top row of the display is 5.0,


corresponding to a stem of 5 and leaf of 0, and the last
observation at the bottom of the display is 10.6. Note that in
the absence of a context, without the identification of stem
and leaf digits in the display, we wouldn’t know whether the
observation with stem 7 and leaf 9 was .79, 7.9, or 79.

The leaves in each row are ordered from smallest to


largest; this is commonly done by software packages but is
not necessary if a display is created by hand.

13

13

Example 1.6 cont’d

The display suggests that a typical or representative sleep


time is in the stem 8L row, perhaps 8.1 or 8.2. The data is
not highly concentrated about this typical value as would be
the case if almost all students were getting between 7.5
and 9.5 hours of sleep on average.

The display appears to rise rather smoothly to a peak in


the 8L row and then decline smoothly (we conjecture that
the minor peak in the 6L row would disappear if more data
was available).

14

14

7
2/18/2025

Example 1.6 cont’d

The general shape of the display is rather symmetric,


bearing strong resemblance to a bell-shaped curve; it does
not stretch out more in one direction than the other.

The two smallest and two largest values seem a bit


separated from the remainder of the data—perhaps they
are very mild, but certainly not extreme,“outliers”.

15

15

Example 1.6 cont’d

A reference in the cited article suggests that individuals in


this age group need about 8.4 hours of sleep per day. So it
appears that a substantial percentage of students in the
sample are sleep deprived.

16

16

8
2/18/2025

Stem-and-Leaf Displays
A stem-and-leaf display conveys information about the
following aspects of the data:

• identification of a typical or representative value

• extent of spread about the typical value

• presence of any gaps in the data

• extent of symmetry in the distribution of values

• number and location of peaks

• presence of any outlying values

17

17

Dotplots

18

18

9
2/18/2025

Dotplots
A dotplot is an attractive summary of numerical data when
the data set is reasonably small or there are relatively few
distinct data values. Each observation is represented by a
dot above the corresponding location on a horizontal
measurement scale.

When a value occurs more than once, there is a dot for


each occurrence, and these dots are stacked vertically. As
with a stem-and-leaf display, a dotplot gives information
about location, spread, extremes, and gaps.

19

19

Example 1.8
There is growing concern in the U.S. that not enough
students are graduating from college. America used to be
number 1 in the world for the percentage of adults with
college degrees, but it has recently dropped to 16th. Here
is data on the percentage of 25- to 34-year-olds in each
state who had some type of postsecondary degree as of
2010 (listed in alphabetical order, with the District of
Columbia included):
31.5 32.9 33.0 28.6 37.9 43.3 45.9 37.2 68.8 36.2 35.5
40.5 37.2 45.3 36.1 45.5 42.3 33.3 30.3 37.2 45.5 54.3
37.2 49.8 32.1 39.3 40.3 44.2 28.4 46.0 47.2 28.7 49.6
37.6 50.8 38.0 30.8 37.6 43.9 42.5 35.2 42.2 32.8 32.2
38.5 44.5 44.6 40.9 29.5 41.3 35.4 20

20

10
2/18/2025

Example 1.8
Here is data on the percentage of 25- to 34-year-olds in
each state who had some type of postsecondary degree as
of 2010 (listed in alphabetical order, with the District of
Columbia included):

21

21

Example 1.8 cont’d

Figure 1.6 shows a dotplot of the data. There is clearly a


great deal of state-to-state variability.

The largest value, for D.C., is obviously an extreme


outlier, and four other values on the upper end of the data
are candidates for mild outliers (MA, MN, NY, and ND).
There is also a cluster of states at the low end, primarily
located in the South and Southwest.
22

22

11
2/18/2025

Dotplots
The overall percentage for the entire country is 39.3%; this
is not a simple average of the 51 numbers but an average
weighted by population sizes.

A dotplot can be quite cumbersome to construct and look


crowded when the number of observations is large. Our
next technique is well suited to such situations.

23

23

Histograms

24

24

12
2/18/2025

Histograms
Some numerical data is obtained by counting to determine
the value of a variable (the number of traffic citations a
person received during the last year, the number of
customers arriving for service during a particular period),
whereas other data is obtained by taking measurements
(weight of an individual, reaction time to a particular
stimulus).

The prescription for drawing a histogram is generally


different for these two cases.

25

25

Histograms
Definition
A numerical variable is discrete if its set of possible values
either is finite or else can be listed in an infinite sequence
(one in which there is a first number, a second number, and
so on). A numerical variable is continuous if its possible
values consist of an entire interval on the number line.

A discrete variable x almost always results from counting,


in which case possible values are 0, 1, 2, 3, . . . or some
subset of these integers. Continuous variables arise from
making measurements. For example, if x is the pH of a
chemical substance, then in theory x could be any number
between 0 and 14: 7.0, 7.03, 7.032, and so on.
26

26

13
2/18/2025

Histograms
Of course, in practice there are limitations on the degree of
accuracy of any measuring instrument, so we may not be
able to determine pH, reaction time, height, and
concentration to an arbitrarily large number of decimal
places.

However, from the point of view of creating mathematical


models for distributions of data, it is helpful to imagine an
entire continuum of possible values.

Consider data consisting of observations on a discrete


variable x. The frequency of any particular x value is the
number of times that value occurs in the data set.
27

27

Histograms
The relative frequency of a value is the fraction or
proportion of times the value occurs:

Suppose, for example, that our data set consists of 200


observations on x = the number of courses a college
student is taking this term. If 70 of these x values are 3,
then
frequency of the x value 3: 70

relative frequency of the x value 3:


28

28

14
2/18/2025

Histograms
Multiplying a relative frequency by 100 gives a percentage;
in the college-course example, 35% of the students in the
sample are taking three courses.

The relative frequencies, or percentages, are usually of


more interest than the frequencies themselves. In theory,
the relative frequencies should sum to 1, but in practice the
sum may differ slightly from 1 because of rounding.

A frequency distribution is a tabulation of the frequencies


and/or relative frequencies.

29

29

Histograms

This construction ensures that the area of each rectangle is


proportional to the relative frequency of the value. Thus if
the relative frequencies of x = 1 and x = 5 are .35 and .07,
respectively, then the area of the rectangle above 1 is five
times the area of the rectangle above 5.

30

30

15
2/18/2025

Example 1.9
How unusual is a no-hitter or a one-hitter in a major league
baseball game, and how frequently does a team get more
than 10, 15, or even 20 hits?

31

31

Example 1.9 cont’d

Table 1.1 is a frequency distribution for the number of hits


per team per game for all nine-inning games that were
played between 1989 and 1993.

Frequency Distribution for Hits in Nine-Inning Games


Table 1.1 32

32

16
2/18/2025

Example 1.9 cont’d

The corresponding histogram in Figure 1.7 rises rather


smoothly to a single peak and then declines. The histogram
extends a bit more on the right (toward large values) than it
does on the left—a slight “positive skew.”

Histogram of number of hits per nine-inning game


Figure 1.7 33

33

Example 1.9 cont’d

Either from the tabulated information or from the histogram


itself, we can determine the following:

= .0010 +.0037 + .0108

= .0155

34

34

17
2/18/2025

Example 1.9 cont’d

Similarly,

= .6361

That is, roughly 64% of all these games resulted in


between 5 and 10 (inclusive) hits.

35

35

Histograms

Constructing a histogram for continuous data


(measurements) entails subdividing the measurement axis
into a suitable number of class intervals or classes, such
that each observation is contained in exactly one class.

36

36

18
2/18/2025

Example 1.10

Power companies need information about customer usage


to obtain accurate forecasts of demands. Investigators from
Wisconsin Power and Light determined energy
consumption (BTUs) during a particular period for a sample
of 90 gas-heated homes. An adjusted consumption value
was calculated as follows:

37

37

Example 1.10
This resulted in the accompanying data (part of the stored
data set FURNACE.MTW available in Minitab), which we
have ordered from smallest to largest.

38

38

19
2/18/2025

Example 1.10
The most striking feature of the histogram in Figure 1.8 is
its resemblance to a bell-shaped curve, with the point of
symmetry roughly at 10.

39

39

Example 1.10

40

40

20
2/18/2025

Histograms
Equal-width classes may not be a sensible choice if there
are some regions of the measurement scale that have a
high concentration of data values and other parts where
data is quite sparse.

Figure 1.9 shows a dotplot of such a data set; there is


high concentration in the middle, and relatively few
observations stretched out to either side. Using a small
number of equal-width classes results in almost all
observations falling in just one or two of the classes.

41

41

Histograms
If a large number of equal-width classes are used, many
classes will have zero frequency. A sound choice is to use
a few wider intervals near extreme observations and
narrower intervals in the region of high concentration.

42

42

21
2/18/2025

Histograms

43

43

Example 1.11
Corrosion of reinforcing steel is a serious problem in
concrete structures located in environments affected by
severe weather conditions.

For this reason, researchers have been investigating the


use of reinforcing bars made of composite material.

One study was carried out to develop guidelines for


bonding glass-fiber-reinforced plastic rebars to concrete
(“Design Recommendations for Bond of GFRP Rebars to
Concrete,” J. of Structural Engr., 1996: 247–254).

44

44

22
2/18/2025

Example 1.11
Consider the following 48 observations on measured bond
strength:

45

45

Example 1.11
The resulting histogram appears in Figure 1.10. The right
or upper tail stretches out much farther than does the left or
lower tail—a substantial departure from symmetry.

46

46

23
2/18/2025

Histograms
When class widths are unequal, not using a density scale
will give a picture with distorted areas.

For equal-class widths, the divisor is the same in each


density calculation, and the extra arithmetic simply results
in a rescaling of the vertical axis (i.e., the histogram using
relative frequency and the one using density will have
exactly the same appearance).

47

47

Histograms
Multiplying both sides of the formula for density by the
class width gives

That is, the area of each rectangle is the relative frequency


of the corresponding class. Furthermore, since the sum of
relative frequencies should be 1, the total area of all
rectangles in a density histogram is l.

48

48

24
2/18/2025

Histograms
It is always possible to draw a histogram so that the area
equals the relative frequency (this is true also for a
histogram of discrete data)—just use the density scale.

This property will play an important role in motivating


models for distributions in Chapter 4.

49

49

Histogram Shapes

50

50

25
2/18/2025

Histogram Shapes
Histograms come in a variety of shapes. A unimodal
histogram is one that rises to a single peak and then
declines. A bimodal histogram has two different peaks.

Bimodality can occur when the data set consists of


observations on two quite different kinds of individuals or
objects.

For example, consider a large data set consisting of driving


times for automobiles traveling between San Luis Obispo,
California, and Monterey, California (exclusive of stopping
time for sightseeing, eating, etc.).
51

51

Histogram Shapes
This histogram would show two peaks: one for those cars
that took the inland route (roughly 2.5 hours) and another
for those cars traveling up the coast (3.5–4 hours).

However, bimodality does not automatically follow in such


situations. Only if the two separate histograms are “far
apart” relative to their spreads will bimodality occur in the
histogram of combined data.

Thus a large data set consisting of heights of college


students should not result in a bimodal histogram because
the typical male height of about 69 inches is not far enough
above the typical female height of about 64–65 inches.
52

52

26
2/18/2025

Histogram Shapes
A histogram with more than two peaks is said to be
multimodal. Of course, the number of peaks may well
depend on the choice of class intervals, particularly with a
small number of observations. The larger the number of
classes, the more likely it is that bimodality or multimodality
will manifest itself.

53

53

Example 1.12
Figure 1.11(a) shows a Minitab histogram of the weights
(lb) of the 124 players listed on the rosters of the San
Francisco 49ers and the New England Patriots (teams the
author would like to see meet in the Super Bowl) as of Nov.
20, 2009.

NFL player weights Histogram


Figure 1.11(a)
54

54

27
2/18/2025

Example 12 cont’d

Figure 1.11(b) is a smoothed histogram (actually what is


called a density estimate) of the data from the R software
package.

NFL player weights Smoothed histogram


Figure 1.11(b)

55

55

Example 1.12 cont’d

Both the histogram and the smoothed histogram show


three distinct peaks; the one on the right is for linemen, the
middle peak corresponds to linebacker weights, and the
peak on the left is for all other players (wide receivers,
quarterbacks, etc.).

A histogram is symmetric if the left half is a mirror image of


the right half. A unimodal histogram is positively skewed if
the right or upper tail is stretched out compared with the left
or lower tail and negatively skewed if the stretching is to
the left.

56

56

28
2/18/2025

Example 1.12 cont’d

A histogram is symmetric if the left half is a mirror image of


the right half. A unimodal histogram is positively skewed if
the right or upper tail is stretched out compared with the left
or lower tail and negatively skewed if the stretching is to
the left.

57

57

Example 1.12 cont’d

Figure 1.12 shows “smoothed” histograms, obtained by


superimposing a smooth curve on the rectangles, that
illustrate the various possibilities.

(a) symmetric unimodal (b) bimodal

(c) Positively skewed (d) negatively skewed

Smoothed histograms
Figure 1.12
58

58

29
2/18/2025

Qualitative Data

59

59

Qualitative Data
Both a frequency distribution and a histogram can be
constructed when the data set is qualitative (categorical) in
nature.

In some cases, there will be a natural ordering of


classes—for example, freshmen, sophomores, juniors,
seniors, graduate students—whereas in other cases the
order will be arbitrary—for example, Catholic, Jewish,
Protestant, and the like.

With such categorical data, the intervals above which


rectangles are constructed should have equal width.
60

60

30
2/18/2025

Example 1.13
The Public Policy Institute of California carried out a
telephone survey of 2501 California adult residents during
April 2006 to ascertain how they felt about various aspects
of K-12 public education. One question asked was “Overall,
how would you rate the quality of public schools in your
neighborhood today?”

61

61

Example 1.13 cont’d

Table 1.2 displays the frequencies and relative frequencies,


and Figure 1.13 shows the corresponding histogram
(bar chart).

Frequency Distribution for the School Rating Data Histogram of the school rating data from Minitab
Table 1.2 Figure 1.13

62

62

31
2/18/2025

Example 1.13 cont’d

More than half the respondents gave an A or B rating, and


only slightly more than 10% gave a D or F rating. The
percentages for parents of public school children were
somewhat more favorable to schools: 24%, 40%, 24%, 6%,
4%, and 2%.

63

63

Multivariate Data

64

64

32
2/18/2025

Multivariate Data
Multivariate data is generally rather difficult to describe
visually. Several methods for doing so appear later in the
book, notably scatter plots for bivariate numerical data.

65

65

33
2/18/2025

Overview and
1 Descriptive Statistics

Copyright © Cengage Learning. All rights reserved.

1.3 Measures of Location

Copyright © Cengage Learning. All rights reserved.

1
2/18/2025

Measures of Location
Visual summaries of data are excellent tools for obtaining
preliminary impressions and insights. More formal data
analysis often requires the calculation and interpretation of
numerical summary measures.

That is, from the data we try to extract several summarizing


numbers—numbers that might serve to characterize the
data set and convey some of its salient features. Our
primary concern will be with numerical data; some
comments regarding categorical data appear at the end of
the section.

Measures of Location
Suppose, then, that our data set is of the form
x1, x2,. . ., xn, where each xi is a number. What features of
such a set of numbers are of most interest and deserve
emphasis? One important characteristic of a set of
numbers is its location, and in particular its center.

This section presents methods for describing the location of


a data set.

2
2/18/2025

The Mean

The Mean
For a given set of numbers x1, x2,. . ., xn, the most familiar
and useful measure of the center is the mean, or arithmetic
average of the set. Because we will almost always think of
the xi’s as constituting a sample, we will often refer to the
arithmetic average as the sample mean and denote it by x.

3
2/18/2025

The Mean

For reporting x, we recommend using decimal accuracy of


one digit more than the accuracy of the xi’s. Thus if
observations are stopping distances with , x1 = 125,
x2 = 131, and so on, we might have x = 127.3 ft.

Example 1.14
Recent years have seen growing commercial interest in the
use of what is known as internally cured concrete.

This concrete contains porous inclusions most commonly in


the form of lightweight aggregate (LWA).

The article Characterizing Lightweight Aggregate


Desorption at High Relative Humidities Using a Pressure
Plate Apparatus” (J. of Materials in Civil Engr, 2012: 961–
969) reported on a study in which researchers examined
various physical properties of 14 LWA specimens.

4
2/18/2025

Example 1.14 cont’d

Here are the 24-hour water-absorption percentages for the


specimens:

Figure 1.14 shows a dotplot of the data; a water-absorption


percentage in the mid-teens appears to be “typical.”
With 229.0, the sample mean is

The Mean
A physical interpretation of x demonstrates how it
measures the location (center) of a sample. Think of
drawing and scaling a horizontal measurement axis, and
then represent each sample observation by a 1-lb weight
placed at the corresponding point on the axis.

The only point at which a fulcrum can be placed to balance


the system of weights is the point corresponding to the
value of x (see Figure 1.14).

10

10

5
2/18/2025

The Mean
Just as x represents the average value of the observations
in a sample, the average of all values in the population can
be calculated. This average is called the population mean
and is denoted by the Greek letter . When there are N
values in the population (a finite population), then
 = (sum of the N population values)/N.

We will give a more general definition for  that applies to


both finite and (conceptually) infinite populations. Just as x
is an interesting and important measure of sample location,
 is an interesting and important (often the most important)
characteristic of a population.

11

11

The Mean
In the chapters on statistical inference, we will present
methods based on the sample mean for drawing
conclusions about a population mean.

For example, we might use the sample mean x = 16.36


computed in Example 1.14 as a point estimate (a single
number that is our “best” guess) of  = crack length for all
specimens treated as described.

12

12

6
2/18/2025

The Mean
The mean suffers from one deficiency that makes it an
inappropriate measure of center under some
circumstances: Its value can be greatly affected by the
presence of even a single outlier (unusually large or small
observation).

For example, if a sample of employees contains nine who


earn $50,000 per year and one whose yearly salary is
$150,000, the sample mean salary is $60,000; this value
certainly does not seem representative of the data.

13

13

The Mean

In such situations, it is desirable to employ a measure that


is less sensitive to outlying values than x, and we will
momentarily propose one.

However, although does x have this potential defect, it is


still the most widely used measure, largely because there
are many populations for which an extreme outlier in the
sample would be highly unlikely.

14

14

7
2/18/2025

The Mean
When sampling from such a population (a normal or bell-
shaped population being the most important example), the
sample mean will tend to be stable and quite representative
of the sample.

15

15

The Median

16

16

8
2/18/2025

The Median
The word median is synonymous with “middle,” and the
sample median is indeed the middle value once the
observations are ordered from smallest to largest.

When the observations are denoted by x1,…, xn, we will


use the symbol to represent the sample median.

17

17

The Median

18

18

9
2/18/2025

Example 1.15
People not familiar with classical music might tend to
believe that a composer’s instructions for playing a
particular piece are so specific that the duration would not
depend at all on the performer(s).

However, there is typically plenty of room for interpretation,


and orchestral conductors and musicians take full
advantage of this.

19

19

Example 1.15 cont’d

The author went to the Web site ArkivMusic.com and


selected a sample of 12 recordings of Beethoven’s
Symphony #9 (the “Choral,” a stunningly beautiful work),
yielding the following durations (min) listed in increasing
order:

62.3 62.8 63.6 65.2 65.7 66.4 67.4 68.4 68.8 70.8
75.7 79.0

Here is a dotplot of the data:

Dotplot of the data from Example 14


Figure 1.16 20

20

10
2/18/2025

Example 1.15 cont’d

Since n = 12 is even, the sample median is the average of


the n/2 = 6th and (n/2 + 1) = 7th values from the ordered list:

Note that if the largest observation 79.0 had not been


included in the sample, the resulting sample median for the
n = 11 remaining observations would have been the single
middle value 66.4 (the [n + 1]/2 = 6th ordered value, i.e. the
6th value in from either end of the ordered list).

21

21

Example 1.15 cont’d

The sample mean is x = xi = 816.1/12 = 68.01, a bit more


than a full minute larger than the median.

The mean is pulled out a bit relative to the median because


the sample “stretches out” somewhat more on the upper
end than on the lower end.

22

22

11
2/18/2025

The Median
The data in Example 1.15 illustrates an important property
of in contrast to x: The sample median is very insensitive
to outliers. If, for example, we increased the two largest xis
from 75.7 and 79.0 to 85.7 and 89.0, respectively,
would be unaffected.

Thus, in the treatment of outlying data values, x and are


at opposite ends of a spectrum. Both quantities describe
where the data is centered, but they will not in general be
equal because they focus on different aspects of the
sample.

23

23

The Median
Analogous to as the middle value in the sample is a
middle value in the population, the population median,
denoted by As with and , we can think of using the
sample median to make an inference about

In Example 1.15, we might use = 66.90 as an estimate of


the median time for the population of all recordings. A
median is often used to describe income or salary data
(because it is not greatly influenced by a few large
salaries).

24

24

12
2/18/2025

The Median

The population mean  and median will not generally be


identical. If the population distribution is positively or
negatively skewed, as pictured in Figure 1.16, then

(a) Negative skew (b) Symmetric (c) Positive skew

Three different shapes for a population distribution


Figure 1.16

25

25

The Median
When this is the case, in making inferences we must first
decide which of the two population characteristics is of
greater interest and then proceed accordingly.

26

26

13
2/18/2025

Other Measures of Location: Quartiles,


Percentiles, and Trimmed Means

27

27

Other Measures of Location: Quartiles, Percentiles, and Trimmed Means

The median (population or sample) divides the data set into


two parts of equal size. To obtain finer measures of
location, we could divide the data into more than two such
parts.

Roughly speaking, quartiles divide the data set into four


equal parts, with the observations above the third quartile
constituting the upper quarter of the data set, the second
quartile being identical to the median, and the first quartile
separating the lower quarter from the upper three-quarters.

28

28

14
2/18/2025

Other Measures of Location: Quartiles, Percentiles, and Trimmed Means

Similarly, a data set (sample or population) can be even


more finely divided using percentiles; the 99th percentile
separates the highest 1% from the bottom 99%, and so on.

Unless the number of observations is a multiple of 100,


care must be exercised in obtaining percentiles.

29

29

Other Measures of Location: Quartiles, Percentiles, and Trimmed Means

The mean is quite sensitive to a single outlier, whereas the


median is impervious to many outliers. Since extreme
behavior of either type might be undesirable, we briefly
consider alternative measures that are neither as sensitive
as nor as insensitive as .

To motivate these alternatives, note that and are at


opposite extremes of the same “family” of measures.

The mean is the average of all the data, whereas the


median results from eliminating all but the middle one or
two values and then averaging.
30

30

15
2/18/2025

Other Measures of Location: Quartiles, Percentiles, and Trimmed Means

To paraphrase, the mean involves trimming 0% from each


end of the sample, whereas for the median the maximum
possible amount is trimmed from each end.

A trimmed mean is a compromise between and . A


10% trimmed mean, for example, would be computed by
eliminating the smallest 10% and the largest 10% of the
sample and then averaging what remains.

31

31

Example 1.16
The production of Bidri is a traditional craft of India. Bidri
wares (bowls, vessels, and so on) are cast from an alloy
containing primarily zinc along with some copper.

Consider the following observations on copper content (%)


for a sample of Bidri artifacts in London’s Victoria and
Albert Museum (“Enigmas of Bidri,” Surface Engr., 2005:
333–339), listed in increasing order:

2.0 2.4 2.5 2.6 2.6 2.7 2.7 2.8 3.0 3.1 3.2 3.3 3.3
3.4 3.4 3.6 3.6 3.6 3.6 3.7 4.4 4.6 4.7 4.8 5.3 10.1

32

32

16
2/18/2025

Example 1.16 cont’d

Figure 1.17 is a dotplot of the data. A prominent feature is


the single outlier at the upper end; the distribution is
somewhat sparser in the region of larger values than is the
case for smaller values.

Dotplot of copper contents from Example 1.16

Figure 1.17

33

33

Example 1.16 cont’d

The sample mean and median are 3.65 and 3.35,


respectively. A trimmed mean with a trimming percentage of
100(2/26) = 7.7% results from eliminating the two smallest
and two largest observations; this gives

Trimming here eliminates the larger outlier and so pulls the


trimmed mean toward the median.

34

34

17
2/18/2025

Other Measures of Location: Quartiles, Percentiles, and Trimmed Means

A trimmed mean with a moderate trimming


percentage—someplace between 5% and 25%—will yield
a measure of center that is neither as sensitive to outliers
as is the mean nor as insensitive as the median.

If the desired trimming percentage is 100 % and n is not


an integer, the trimmed mean must be calculated by
interpolation. For example, consider  = .10 for a 10%
trimming percentage and n = 26 as in Example 1.16.

35

35

Other Measures of Location: Quartiles, Percentiles, and Trimmed Means

Then xtr(10) would be the appropriate weighted average of


the 7.7% trimmed mean calculated there and the 11.5%
trimmed mean resulting from trimming three observations
from each end.

36

36

18
2/18/2025

Categorical Data and Sample


Proportions

37

37

Categorical Data and Sample Proportions


When the data is categorical, a frequency distribution or
relative frequency distribution provides an effective tabular
summary of the data. The natural numerical summary
quantities in this situation are the individual frequencies
and the relative frequencies.

For example, if a survey of individuals who own digital


cameras is undertaken to study brand preference, then
each individual in the sample would identify the brand of
camera that he or she owned, from which we could count
the number owning Canon, Sony, Kodak, and so on.

38

38

19
2/18/2025

Categorical Data and Sample Proportions


Consider sampling a dichotomous population—one that
consists of only two categories (such as voted or did not
vote in the last election, does or does not own a digital
camera, etc.).

If we let x denote the number in the sample falling in


category 1, then the number in category 2 is n – x. The
relative frequency or sample proportion in category 1 is x/n
and the sample proportion in category 2 is 1 – x/n .

39

39

Categorical Data and Sample Proportions


Let’s denote a response that falls in category 1 by a 1 and
a response that falls in category 2 by a 0. A sample size of
n = 10 might then yield the responses 1, 1, 0, 1, 1, 1, 0, 0,
1, 1. The sample mean for this numerical sample is (since
number of 1s = x = 7)

More generally, focus attention on a particular category and


code the sample results so that a 1 is recorded for an
observation in the category and a 0 for an observation not
in the category.
40

40

20
2/18/2025

Categorical Data and Sample Proportions


Then the sample proportion of observations in the category
is the sample mean of the sequence of 1s and 0s. Thus a
sample mean can be used to summarize the results of a
categorical sample.

These remarks also apply to situations in which categories


are defined by grouping values in a numerical sample or
population (e.g., we might be interested in knowing whether
individuals have owned their present automobile for at least
5 years, rather than studying the exact length of
ownership).

41

41

Categorical Data and Sample Proportions


Analogous to the sample proportion x/n of individuals or
objects falling in a particular category, let p represent the
proportion of those in the entire population falling in the
category.

As with x/n, p is a quantity between 0 and 1, and while x/n


is a sample characteristic, p is a characteristic of the
population.

42

42

21
2/18/2025

Categorical Data and Sample Proportions


The relationship between the two parallels the relationship
between and and between x and . In particular, we
will subsequently use x/n to make inferences about p.

If a sample of 100 students from a large university reveals


that 38 have Macintosh computers, then we could use
38/100 5 .38 as a point estimate of the proportion of all
students at the university who have Macs. Or we might ask
whether this sample provides strong evidence for
concluding that at least 1/3 of all students are Mac owners.

43

43

Categorical Data and Sample Proportions

With k categories (k . 2), we can use the k sample


proportions to answer questions about the population
proportions p1,…, pk.

44

44

22
2/18/2025

Overview and
1 Descriptive Statistics

Copyright © Cengage Learning. All rights reserved.

1.4 Measures of Variability

Copyright © Cengage Learning. All rights reserved.

1
2/18/2025

Measures of Variability
Reporting a measure of center gives only partial
information about a data set or distribution. Different
samples or populations may have identical measures of
center yet differ from one another in other important ways.

Figure 1.18 shows dotplots of three samples with the same


mean and median, yet the extent of spread about the
center is different for all three samples.

Samples with identical measures of center but different amounts of variability

Figure 1.18 3

Measures of Variability
The first sample has the largest amount of variability, the
third has the smallest amount, and the second is
intermediate to the other two in this respect.

2
2/18/2025

Measures of Variability for


Sample Data

Measures of Variability for Sample Data


The simplest measure of variability in a sample is the
range, which is the difference between the largest and
smallest sample values. The value of the range for sample
1 in Figure 1.18 is much larger than it is for sample 3,
reflecting more variability in the first sample than in the
third.

Samples with identical measures of center but different amounts of variability

Figure 1.18
6

3
2/18/2025

Measures of Variability for Sample Data


A defect of the range, though, is that it depends on only the
two most extreme observations and disregards the
positions of the remaining n – 2 values. Samples 1 and 2 in
Figure 1.18 have identical ranges, yet when we take into
account the observations between the two extremes, there
is much less variability or dispersion in the second sample
than in the first.

Our primary measures of variability involve the deviations


from the mean, That is, the
deviations from the mean are obtained by subtracting
from each of the n sample observations.

Measures of Variability for Sample Data


A deviation will be positive if the observation is larger than
the mean (to the right of the mean on the measurement
axis) and negative if the observation is smaller than the
mean. If all the deviations are small in magnitude, then all
xis are close to the mean and there is little variability.

Alternatively, if some of the deviations are large in


magnitude, then some xis lie far from suggesting a
greater amount of variability.

A simple way to combine the deviations into a single


quantity is to average them.
8

4
2/18/2025

Measures of Variability for Sample Data


Unfortunately, this is a bad idea:

so that the average deviation is always zero. The


verification uses several standard rules of summation and
the fact that

How can we prevent negative and positive deviations from


counteracting one another when they are combined?

Measures of Variability for Sample Data


One possibility is to work with the absolute values of the
deviations and calculate the average absolute deviation

Because the absolute value operation leads to a number of


theoretical difficulties, consider instead the squared
deviations

Rather than use the average squared deviation


for several reasons we divide the sum of squared
deviations by n – 1 rather than n.

10

10

5
2/18/2025

Measures of Variability for Sample Data

Note that s2 and s are both nonnegative. The unit for s is


the same as the unit for each of the xis.
11

11

Measures of Variability for Sample Data


If, for example, the observations are fuel efficiencies in
miles per gallon, then we might have s = 2.0 mpg. A rough
interpretation of the sample standard deviation is that it is
the size of a typical or representative deviation from the
sample mean within the given sample.

Thus if s = 2.0 mpg, then some xi’s in the sample are closer
than 2.0 to whereas others are farther away; 2.0 is a
representative (or “standard”) deviation from the mean fuel
efficiency. If s = 3.0 for a second sample of cars of another
type, a typical deviation in this sample is roughly 1.5 times
what it is in the first sample, an indication of more variability
in the second sample.
12

12

6
2/18/2025

Example 1.17
The Web site www.fueleconomy.gov contains a wealth of
information about fuel characteristics of various vehicles. In
addition to EPA mileage ratings, there are many vehicles
for which users have reported their own values of fuel
efficiency (mpg).

Consider the following sample of n = 11 efficiencies for the


2009 Ford Focus equipped with an automatic transmission
(for this model, EPA reports an overall rating of
27 mpg–24 mpg for city driving and 33 mpg for highway
driving):

13

13

Example 1.17

14

14

7
2/18/2025

Example 1.17
Effects of rounding account for the sum of deviations not
being exactly zero. The numerator of s2 is Sxx = 314.106,
from which

The size of a representative deviation from the sample


mean 33.26 is roughly 5.6 mpg.
15

15

Example 1.17
Note: Of the nine people who also reported driving
behavior, only three did more than 80% of their driving in
highway mode; we bet you can guess which cars they
drove.

We haven’t a clue why all 11 reported values exceed the


EPA figure—maybe only drivers with really good fuel
efficiencies communicate their results.

16

16

8
2/18/2025

Motivation for s2

17

17

Motivation for s2
To explain the rationale for the divisor n – 1 in s2, note first
that whereas s2 measures sample variability, there is a
measure of variability in the population called the
population variance.

We will use  2 (the square of the lowercase Greek letter


sigma) to denote the population variance and  to denote
the population standard deviation (the square root of  2).

18

18

9
2/18/2025

Motivation for s2
When the population is finite and consists of N values,

which is the average of all squared deviations from the


population mean (for the population, the divisor is N and
not N – 1).

Just as will be used to make inferences about the


population mean , we should define the sample variance
so that it can be used to make inferences about  2. Now
note that  2 involves squared deviations about the
population mean .
19

19

Motivation for s2
If we actually knew the value of , then we could define the
sample variance as the average squared deviation of the
sample xis about .

However, the value of  is almost never known, so the sum


of squared deviations about must be used.

But the xis tend to be closer to their average than to the


population average , so to compensate for this the divisor
n – 1 is used rather than n.

20

20

10
2/18/2025

Motivation for s2
In other words, if we used a divisor n in the sample
variance, then the resulting quantity would tend to
underestimate  2 (produce estimated values that are too
small on the average), whereas dividing by the slightly
smaller n – 1 corrects this underestimating.

It is customary to refer to s2 as being based on n – 1


degrees of freedom (df). This terminology reflects the fact
that although s2 is based on the n quantities
these sum to 0, so specifying the
values of any n – 1 of the quantities determines the
remaining value.

21

21

Motivation for s2
For example, if n = 4 and
then automatically so only three of
the four values of are freely determined (3 df).

22

22

11
2/18/2025

A Computing Formula for s2

23

23

A Computing Formula for s2


It is best to obtain s2 from statistical software or else use a
calculator that allows you to enter data into memory and
then view s2 with a single keystroke. If your calculator does
not have this capability, there is an alternative formula for
Sxx that avoids calculating the deviations.

The formula involves both summing and then


squaring, and squaring and then summing.

24

24

12
2/18/2025

Example 1.18
Traumatic knee dislocation often requires surgery to repair
ruptured ligaments. One measure of recovery is range of
motion (measured as the angle formed when, starting with
the leg straight, the knee is bent as far as possible).

The given data on postsurgical range of motion appeared


in the article “Reconstruction of the Anterior and Posterior
Cruciate Ligaments After Knee Dislocation”
(Amer. J. Sports Med., 1999: 189–197):

154 142 137 133 122 126 135 135 108 120 127 134
122
25

25

Example 1.18
The sum of these 13 sample observations is
and the sum of their squares is

Thus the numerator of the sample variance is

26

26

13
2/18/2025

Example 1.18
from which

s2 = 1579.0769/12

= 131.59

and

s = 11.47.

27

27

A Computing Formula for s2


Both the defining formula and the computational formula for
s2 can be sensitive to rounding, so as much decimal
accuracy as possible should be used in intermediate
calculations.

Several other properties of s2 can enhance understanding


and facilitate computation.

28

28

14
2/18/2025

A Computing Formula for s2


Proposition

29

29

A Computing Formula for s2


In words, Result 1 says that if a constant c is added to (or
subtracted from) each data value, the variance is
unchanged. This is intuitive, since adding or subtracting c
shifts the location of the data set but leaves distances
between data values unchanged.

According to Result 2, multiplication of each xi by c results


in s2 being multiplied by a factor of c2. These properties can
be proved by noting in Result 1 that and in
Result 2 that

30

30

15
2/18/2025

Boxplots

31

31

Boxplots
Stem-and-leaf displays and histograms convey rather
general impressions about a data set, whereas a single
summary such as the mean or standard deviation focuses
on just one aspect of the data.

In recent years, a pictorial summary called a boxplot has


been used successfully to describe several of a data set’s
most prominent features.

These features include (1) center, (2) spread, (3) the extent
and nature of any departure from symmetry, and (4)
identification of “outliers,” observations that lie unusually far
from the main body of the data.
32

32

16
2/18/2025

Boxplots
Because even a single outlier can drastically affect the
values of and s, a boxplot is based on measures that are
“resistant” to the presence of a few outliers—the median
and a measure of variability called the fourth spread.

Definition

33

33

Boxplots
Roughly speaking, the fourth spread is unaffected by the
positions of those observations in the smallest 25% or the
largest 25% of the data. Hence it is resistant to outliers.

The simplest boxplot is based on the following five-number


summary:
smallest xi lower fourth median upper fourth largest xi

First, draw a horizontal measurement scale. Then place a


rectangle above this axis; the left edge of the rectangle is at
the lower fourth, and the right edge is at the upper fourth
(so box width = fs).

34

34

17
2/18/2025

Boxplots
Place a vertical line segment or some other symbol inside
the rectangle at the location of the median; the position of
the median symbol relative to the two edges conveys
information about skewness in the middle 50% of the data.

Finally, draw “whiskers” out from either end of the rectangle


to the smallest and largest observations. A boxplot with a
vertical orientation can also be drawn by making obvious
modifications in the construction process.

35

35

Example 1.19
The accompanying data consists of observations on the
time until failure (1000s of hours) for a sample of
turbochargers from one type of engine (from “The Beta
Generalized Weibull Distribution: Properties and
Applications,” Reliability Engr. and System Safety, 2012: 5–
15).

The five-number summary is as follows.


smallest: 1.6 lower fourth: 5.05 median: 6.5 upper fourth:
7.85 largest: 9.0
36

36

18
2/18/2025

Example 1.19
Figure 1.19 shows Minitab output from a request to describe
the data. Q1 and Q3 are the lower and upper quartiles,
respectively, and IQR (interquartile range) is the difference
between these quartiles. SE Mean is, the “standard
error of the mean”; it will be important in our subsequent
development of several widely used procedures for making
inferences about the population mean µ.

37

37

Example 1.19
Figure 1.20 shows both a dotplot of the data and a boxplot.
Both plots indicate that there is a reasonable amount of
symmetry in the middle 50% of the data, but overall values
stretch out more toward the low end than toward the high
end—a negative skew. The box itself is not very narrow,
indicating a fair amount of variability in the middle half of
the data, and the lower whisker is especially long.

38

38

19
2/18/2025

Boxplots That Show Outliers

39

39

Boxplots That Show Outliers


A boxplot can be embellished to indicate explicitly the
presence of outliers. Many inferential procedures are based
on the assumption that the population distribution is normal
(a certain type of bell curve). Even a single extreme outlier
in the sample warns the investigator that such procedures
may be unreliable, and the presence of several mild
outliers conveys the same message.

Definition

40

40

20
2/18/2025

Boxplots That Show Outliers


Let’s now modify our previous construction of a boxplot by
drawing a whisker out from each end of the box to the
smallest and largest observations that are not outliers.

Now represent each mild outlier by a closed circle and


each extreme outlier by an open circle. Some statistical
computer packages do not distinguish between mild and
extreme outliers.

41

41

Example 1.20
The Clean Water Act and subsequent amendments require
that all waters in the United States meet specific pollution
reduction goals to ensure that water is “fishable and
swimmable.”

The article “Spurious Correlation in the USEPA Rating


Curve Method for Estimating Pollutant Loads” (J. of
Environ. Engr., 2008: 610–618) investigated various
techniques for estimating pollutant loads in watersheds; the
authors “discuss the imperative need to use sound
statistical methods” for this purpose.

42

42

21
2/18/2025

Example 1.20
Among the data considered is the following sample of TN
(total nitrogen) loads (kg N/day) from a particular
Chesapeake Bay location, displayed here in increasing
order.

43

43

Example 1.20
Relevant summary quantities are

Subtracting 1.5fs from the lower 4th gives a negative


number, and none of the observations are negative, so
there are no outliers on the lower end of the data.
However,
upper 4th + 1.5fs = 351.015 upper 4th + 3fs = 534.24
Thus the four largest observations—563.92, 690.11,
826.54, and 1529.35—are extreme outliers, and 352.09,
371.47, 444.68, and 460.86 are mild outliers.
44

44

22
2/18/2025

Example 20
The whiskers in the boxplot in Figure 1.21 extend out to the
smallest observation, 9.69, on the low end and 312.45, the
largest observation that is not an outlier, on the upper end.

A boxplot of the nitrogen load data showing mild and extreme outliers

Figure 1.21
45

45

Example 1.20
There is some positive skewness in the middle half of the
data (the median line is somewhat closer to the left edge of
the box than to the right edge) and a great deal of positive
skewness overall.

46

46

23
2/18/2025

Comparative Boxplots

47

47

Comparative Boxplots
A comparative or side-by-side boxplot is a very effective
way of revealing similarities and differences between two or
more data sets consisting of observations on the same
variable—fuel efficiency observations for four different
types of automobiles, crop yields for three different
varieties, and so on.

48

48

24
2/18/2025

Example 1.21
High levels of sodium in food products represent a growing health
concern. The accompanying data consists of values of sodium
content in one serving of cereal for one sample of cereals
manufactured by General Mills, another sample manufactured by
Kellogg, and a third sample produced by Post (see the website
http://www.nutritionresource.com/foodcomp2.cfm?id=0800 rather
than visiting your neighborhood grocery store!).

49

49

Example 1.21
Figure 1.22 shows a comparative boxplot of the data from
the software package R. The typical sodium content
(median) is roughly the same for all three companies. But
the distributions differ markedly in other respects.

50

50

25
2/18/2025

Example 1.21

The General Mills data shows a substantial positive skew


both in the middle 50% and overall, with two outliers at the
upper end.

The Kellogg data exhibits a negative skew in the middle


50% and a positive skew overall, except for the outlier at
the low end (this outlier is not identified by Minitab).

The Post data is negatively skewed both in the middle 50%


and overall with no outliers.

51

51

Example 1.21

Variability as assessed by the box length (here the


interquartile range rather than the fourth spread) is smallest
for the G brand and largest for the P brand, with the K
brand intermediate to the other two; looking instead at
standard deviations, 𝑠 and 𝑠 are roughly the same and
both much larger than 𝑠 .

52

52

26

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy