Numerical Descriptive Measures
Numerical Descriptive Measures
2
MEASURES OF CENTRAL TENDENCY FOR
UNGROUPED DATA
The measures that we discuss in this chapter include
measures like:
Figure
3.1 3
Mean
The mean for ungrouped data is obtained by dividing the
sum of all values by the number of values in the data set. Thus,
4
Example 3-1
Table 3.1 lists the total cash donations (rounded to millions of
dollars) given by eight U.S. companies during the year 2016.
Table 3.1 Cash Donations in 2016 by Eight
U.S. Companies
x x 1 x2 x3 x4 x5 x6 x7 x8
319 199 110 63 21 315 26 63 1116
x
x 1116
139.5 $139.5million
n 8
6
Example 3-2
The following are the ages (in years) of all eight employees of a
small company:
53 32 61 27 39 44 49 57
7
Example 3-2: Solution
The population mean is
x 362
45.25 years
N 8
8
Reconsider Example 3–2. If we take a sample of 3
employees from this company and calculate the mean age of
those 3 employees, this mean will be denoted by
x
Suppose the three values included in the sample are 32, 39,
and 57. Then, the mean age for this sample is:
= (32+39+57)∕3
x
= 42.67 yrs
9
If we take a second sample of 3 employees of this company,
the value of x will (most likely) be different. Suppose the
second sample includes the values 53, 27, and 44. Then, the
mean age for this sample is
= (53+27+44)∕3
x
= 41.33 yrs
11
Example 3-3
Note that the number of homes foreclosed in California is
very large compared to those in the other six states.
Hence, it is an outlier. Show how the inclusion of this outlier
affects the value of the mean.
12
Example 3-3: Solution
If we do not include the number of homes foreclosed in
California (the outlier), the mean of the number of
foreclosed homes in six states is
13
Example 3-3: Solution
Now, to see the impact of the outlier on the value of the
mean, we include the number of homes foreclosed in
California and find the mean number of homes foreclosed
in the seven states. This mean is
15
Median
The Median is the value of the middle term in a data set
that has been ranked in increasing order. i.e., it divides a
ranked data set into two equal parts.
16
Note that if the number of observations in
a data set is odd, then the median is given
by the value of the middle term in the
ranked data.
17
Example 3-4
Refer to the data on the number of homes foreclosed in 7
states given in Table 3.2 of Example 3.3. Those values
are
listed below.
18
Example 3-4: Solution
First, we rank the given data in increasing order as follows:
10,824 18,038 20,352 40,911 49,723 61,848 173,175
Since there are 7 homes in this data set and the middle term
is the fourth term, the median is given by the value of the 4th
term in the ranked data.
19
Example 3-5
Table 3.3 gives the total compensations (in millions of
dollars) for the year 2010 of the 12 highest-paid CEOs of
U.S. companies.
20
Table 3.3 Total Compensations of 12
Highest-Paid CEOs for the Year 2010
Find the median for
these data.
21
Example 3-5: Solution
First we rank the given total compensations of the 12 CESs as
follows:
21.6 21.7 22.9 25.2 26.5 28.0 28.2 32.6 32.9 70.1 76.1 84.5
22
Example 3-5: Solution
The two middle values are the sixth and seventh in the
arranged data, and these two values are 28.0 and 28.2.
23
Median
24
Mode
25
Example 3-6
The following data give the speeds (in miles per hour) of
eight cars that were stopped on NH-95 for speeding
violations.
77 82 74 81 79 84 74 78
26
Example 3-6: Solution
In this data set, 74 occurs twice and each of the remaining
values occurs only once. Because 74 occurs with the highest
frequency, it is the mode. Therefore,
27
Mode
A major shortcoming of the mode is that a data set may
have none or may have more than one mode, whereas it will
have only one mean and only one median.
28
Example 3-7 (Data set with no mode)
Last year’s incomes of five randomly selected families were
$76,150, $95,750, $124,985, $87,490, and $53,740.
29
Example 3-7: Solution
Because each value in this data set occurs only once, this data
set contains no mode.
30
Example 3-8 (Data set with two modes)
A small company has 12 employees. Their commuting times
(rounded to the nearest minute) from home to work are 23,
36, 12, 23, 47, 32, 8, 12, 26, 31, 18, and 28, respectively.
31
Example 3-8: Solution
In the given data on the commuting times of the 12
employees, each of the values 12 and 23 occurs twice, and
each of the remaining values occurs only once. Therefore,
that data set has two modes: 12 and 23 minutes.
32
Example 3-9 (Data set with three modes)
The ages of 10 randomly selected students from a class are 21,
19, 27, 22, 29, 19, 25, 21, 22 and 30 years, respectively.
33
Example 3-9: Solution
This data set has three modes: 19, 21 and 22. Each of these
three values occurs with a (highest) frequency of 2.
34
Mode
One advantage of the mode is that it can be calculated for
both kinds of data (quantitative and qualitative) - whereas
the mean and median can be calculated for only
quantitative data.
35
Example 3-10
36
Example 3-10: Solution
Because senior occurs more frequently than the other
categories, it is the mode for this data set. We cannot
calculate the mean and median for this data set.
37
To sum up, we cannot say for sure which of the three
measures of central tendency is a better measure overall.
Each of them may be better under different situations.
For a symmetric histogram and frequency distribution with one peak, the
values of the mean, median, and mode are identical, and they lie at the center
of the distribution.
39
Relationships Among the Mean, Median,
and Mode Figure 3.3 Mean, median, and
mode for a histogram and frequency
distribution curve skewed to the right.
For a histogram and a frequency distribution curve skewed to the right, the
value of the mean is the largest, that of the mode is the smallest, and the value
of the median lies between these two. (Notice that the mode always occurs at
the peak point.) The value of the mean is the largest in this case because it is
sensitive to outliers that occur in the right tail. These outliers pull the mean to
the right.
40
Relationships Among the Mean, Median,
and Mode Figure 3.4 Mean, median, and
mode for a histogram and frequency
distribution curve skewed to the left.
If a histogram and a frequency distribution curve are skewed to the left, the
value of the mean is the smallest and that of the mode is the largest, with
the value of the median lying between these two. In this case, the outliers in
the left tail pull the mean to the left.
41
The measures of central tendency, such as the
mean, median, and mode, do not reveal the
whole picture of the distribution of a data set.
42
Consider the following two data sets on the ages
(in years) of all workers working for each of two
small companies.
Company 1: 47 38 35 40 36 45 39
Company 2: 70 33 18 52 27
43
If we do not know the ages of individual workers at
these two companies and are told only that the mean
age of the workers at both companies is the same, we
may deduce that the workers at these two companies
have a similar age distribution.
44
Thus, the mean, median, or mode by itself is usually
not a sufficient measure to reveal the shape of the
distribution of a data set.
Range
Variance and Standard Deviation
Population Parameters and Sample Statistics
46
Range
Finding the Range for Ungrouped Data
47
Example 3-11
Table 3.4 gives the total areas in square miles of the four
western South-Central states of the United States.
48
Table 3.4
49
Example 3-11: Solution
Thus, the total areas of these four states are spread over a range of
217,626 square miles.
50
Range
Disadvantages
53
Variance and Standard Deviation
Basic Formulas for the Variance and Standard Deviation for
Ungrouped Data
2
x 2
and s 2
x x
2
N n 1
x x x
2 2
and s
N n 1
54
Table 3.5 (Mid-Term scores of a sample of 4
students)
55
Variance and Standard Deviation
Short-cut Formulas for the Variance and Standard Deviation
for Ungrouped Data
x 2
x 2
x 2
N
x 2
n
2 and s 2
N n 1
x
2
x
2
x 2
N
x 2
n
and s
N n 1
where σ² is the population variance, s² is the sample variance,
σ is the population standard deviation, and s is the sample
standard deviation.
56
Example 3-12
Until about 2009, airline passengers were not charged for checked
baggage. Around 2009, however, many U.S. airlines started charging
a fee for bags. According to the Bureau of Transportation Statistics,
U.S. airlines collected more than $3 billion in baggage fee revenue in
2010. The following table lists the baggage fee revenues of 6 U.S.
airlines for the year 2010.
57
Example 3-12
58
Example 3-12: Solution
Let x denote the 2010 baggage fee revenue (in millions of
dollars) of an airline. The values of Σx and Σx2 are calculated
in Table 3.6.
Table
3.6
59
Example 3-12: Solution
Step 1. Calculate Σx
The sum of values in the first column of Table 3.6 gives
2,854.
60
Example 3-12: Solution
Step 3. Determine the variance
x
2
2,854
2
x 2
n
1,746,098
6
s2
n 1 6 1
1,746,098 1,357,552.667
5
77,709.06666
61
Example 3-12: Solution
Step 4. Obtain the standard deviation
The standard deviation is obtained by taking the (positive) square root
of the variance:
x
2
x 2
n
s 77,709.06666
n 1
278.7634601 $278.76million
Thus, the standard deviation of the 2010 baggage fee revenues of
these six airlines is $278.76 million.
62
Usually the values of the variance and standard deviation
are positive, but if a data set has no variation, then the
variance and standard deviation are both zero.
35 35 35 35
63
Example 3-13
Following are the 2011 earnings (in thousands of dollars)
before taxes for all 6 employees of a small company.
64
Example 3-13: Solution
Let x denote the 2011 earnings before taxes of an employee
of this company. The values of ∑x and ∑x2 are calculated in
Table 3.7.
Table
3.7
65
Example 3-13: Solution
x
2
� (449.30) 2
�x2
N
35,978.51
6
2 388.90
N 6
388.90 $19.721 thousand $19,721
66
Population Parameters and Sample
Statistics
A numerical measure such as the mean, median, mode,
range, variance, or standard deviation calculated for a
population data set is called a population parameter, or
simply a parameter.
67
MEAN, VARIANCE AND STANDARD
DEVIATION FOR GROUPED DATA
Mean for Grouped Data
Variance and Standard Deviation for Grouped Data
68
Mean for Grouped Data
Calculating Mean for Grouped Data
69
Example 3-14
Table 3.8 gives the frequency distribution of the daily
commuting times (in minutes) from home to work for all 25
employees of a company.
70
Example 3-14
Table-3-8
71
Example 3-14: Solution
72
Example 3-14: Solution
mf
535
21.40 minutes
N 25
73
Example 3-15
Table 3.10 gives the frequency distribution of the number of
orders received each day during the past 50 days at the office
of a mail-order company.
74
Example 3-15
Table-3-10
75
Example 3-15: Solution
76
Example 3-15: Solution
x
mf
832
16.64 orders
n 50
Thus, this mail-order company received an average of
16.64 orders per day during these 50 days.
77
Variance and Standard Deviation for
Grouped Data
Basic Formulas for the Variance and Standard Deviation for
Grouped Data
f m f m x
2 2
2
and s 2
N n 1
78
Variance and Standard Deviation for
Grouped Data
Short-Cut Formulas for the Variance and Standard Deviation
for Grouped Data
( mf ) 2
mf
2
m f
2
N
m 2
f
n
2 and s 2
N n 1
79
Variance and Standard Deviation for
Grouped Data
Short-cut Formulas for the Variance and Standard Deviation for
Grouped Data
80
Example 3-16
The following data, reproduced from Table 3.8 of Example 3-14,
give the frequency distribution of the daily commuting times (in
minutes) from home to work for all 25 employees of a company.
81
Example 3-16: Solution
82
Example 3-16: Solution
(
m f N
2 mf ) 2
14,825
(535) 2
25 3376
2
135.04
N 25 25
Thus, the standard deviation of the daily commuting times for these
employees is 11.62 minutes.
83
Example 3-17
The following data, reproduced from Table 3.10 of Example 3-
15, give the frequency distribution of the number of orders
received each day during the past 50 days at the office of a
mail-order company.
84
Example 3-17: Solution
85
Example 3-17: Solution
m 2
f
( mf ) 2
14,216
(832 ) 2
s2 n 50 7.5820
n 1 50 1
86
USE OF STANDARD DEVIATION
Chebyshev’s Theorem
Empirical Rule
87
Chebyshev’s Theorem
88
Chebyshev’s Theorem
1
1
At least
of
k2
the elements of any distribution
lie within k standard deviations of the mean
1 1 3
1 1 75%
2
2
4 4 2
Standard
At 1 1 8 Lie
1 2 1 89% 3 deviations
least 3 9 9 within of the mean
1 1 15 4
1 2 1 94%
4 16 16
89
Figure 3.5 Chebyshev’s theorem.
90
Figure 3.6 Percentage of values within two
standard deviations of the mean for
Chebyshev’s theorem.
91
Figure 3.7 Percentage of values within
three standard deviations of the mean for
Chebyshev’s theorem.
92
Example 3-18
The average systolic blood pressure for 4000 women who
were screened for high blood pressure was found to be 187
mm Hg with a standard deviation of 22. Using Chebyshev’s
theorem, find at least what percentage of women in this
group have a systolic blood pressure between 143 and 231
mm Hg.
93
Example 3-18: Solution
Let μ and σ be the mean and the standard deviation,
respectively, of the systolic blood pressures of these women.
μ = 187 and σ = 22
94
Example 3-18: Solution
The value of k is obtained by dividing the distance between
the mean and each point by the standard deviation. Thus
k = 44/22 = 2
1 1 1
1 2 1 2 1 1 .25 .75 or 75%
k ( 2) 4
95
Figure 3.8 Percentage of women with
systolic blood pressure between 143 and
231.
96
Empirical Rule
97
Empirical Rule
For a bell shaped distribution, approximately
98
Figure 3.9 Illustration of the empirical rule.
99
Example 3-19
The age distribution of a sample of 5000 persons is bell-shaped
with a mean of 40 years and a standard deviation of 12 years.
Determine the approximate percentage of people who are 16
to 64 years old.
100
Example 3-19: Solution
From the given information, for this distribution,
x = 40 and s = 12 years
Each of the two points, 16 and 64, is 24 units away from the
mean.
101
Figure 3.10 Percentage of people who are
16 to 64 years old.
102
MEASURES OF POSITION
A measure of position determines the position of a
single value in relation to other values in a sample or a
population data set
Quartiles
Interquartile Range
Percentiles
103
Quartiles and Interquartile Range
Quartiles are the summary measures that divide a
ranked data set into four equal parts.
104
Figure 3.11 Quartiles.
105
Quartiles and Interquartile Range
Calculating Interquartile Range
The difference between the third and the first quartiles gives
the interquartile range:
106
Example 3-20
Table 3.3 in Example 3-5 gave the total compensations (in
millions of dollars) for the year 2010 of the 12 highest-paid
CEOs of U.S. companies. That table is reproduced on the next
slide.
(a) Find the values of the three quartiles. Where does the total
compensation of Michael D. White (CEO of DirecTV) fall in
relation to these quartiles?
107
Example 3-20
108
Example 3-20: Solution
(a)
109
Example 3-20: Solution
(b) The interquartile range is given by the difference between
the values of the third and first quartiles. Thus
110
Example 3-21
The following are the ages (in years) of nine employees of an
insurance company:
47 28 39 51 33 37 59 24 33
(a) Find the values of the three quartiles. Where does the age of
28 years fall in relation to the ages of the employees?
111
Example 3-21: Solution
(a)
112
Example 3-21: Solution
(b) The interquartile range is
IQR = Interquartile range = Q3 – Q1
= 49 – 30.5
= 18.5 years
113
Percentiles
Percentiles are the summary measures that divide a ranked data
set into 100 equal parts.
Each (ranked) data set has 99 percentiles that divide it into 100
equal parts.
114
Percentiles and Percentile Rank
Calculating Percentiles
The (approximate) value of the k th percentile, denoted by
Pk, is
kn
Pk Value of the th term in a ranked data set
100
115
Example 3-22
Refer to the data on total compensations (in millions of
dollars) for the year 2010 of the 12 highest-paid CEOs of U.S.
companies given in Example 3-20. Find the value of the 60th
percentile. Give a brief interpretation of the 60th percentile.
116
Example 3-22: Solution
The data arranged in increasing order is as follows:
21.6 21.7 22.9 25.2 26.5 28.0 28.2 32.6 32.9 70.1 76.1 84.5
kn (60)(12)
7.20th term 7th term
100 100
117
Example 3-22: Solution
The value of the 7.20th term can be approximated by the value
of the 7th term in the ranked data. Therefore,
118
BOX-AND-WHISKER PLOT
119
Example 3-24
The following data are the incomes (in thousands of dollars)
for a sample of 12 households.
120
Example 3-24: Solution
Step 1. First, rank the data in increasing order and calculate
the values of the median, the first quartile, the third quartile,
and the interquartile range. The ranked data are
121
Example 3-24: Solution
Step 2. Find the points that are 1.5 x IQR below Q1 and
1.5 x IQR above Q3.
122
Example 3-24: Solution
Step 3. Determine the smallest and the largest values in the
given data set within the two inner fences.
123
Example 3-24: Solution
Step 4. Draw a horizontal line and mark the income levels
on it such that all the values in the given data set are
covered. The result of this step is shown in Figure 3.13.
124
Example 3-24: Solution
Step 5. By drawing two lines, join the points of the
smallest and the largest values within the two inner
fences to the box. These values are 69 and 112 in this
example. This completes the box-and-whisker plot, as
shown in Figure 3.14.
125
Box Plot
Elementsof
Elements ofaaBox
BoxPlot
Plot
Smallest data Largest data point
point not not exceeding Suspected
Outlier below inner inner fence outlier
fence
o X X *
Median
Outer Inner Q1 Q3 Inner Outer
Fence Fence Fence Fence
Q1-1.5(IQR) Interquartile Range Q3+1.5(IQR)
Q1-3(IQR)
Q3+3(IQR)
126