0% found this document useful (0 votes)
186 views126 pages

Numerical Descriptive Measures

Numerical descriptive measures provide important information about data sets beyond what is shown in graphs. The measures of central tendency—mean, median, and mode—identify characteristics of the center or typical value in a data set. The mean is the average and is calculated by dividing the sum of all values by the total number of data points. The median is the middle value when the data is arranged in order. The mode is the most frequently occurring value in the data set. These measures can help understand features of distributions like the typical or relative positions of values.

Uploaded by

Vishesh Dwivedi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
186 views126 pages

Numerical Descriptive Measures

Numerical descriptive measures provide important information about data sets beyond what is shown in graphs. The measures of central tendency—mean, median, and mode—identify characteristics of the center or typical value in a data set. The mean is the average and is calculated by dividing the sum of all values by the total number of data points. The median is the middle value when the data is arranged in order. The mode is the most frequently occurring value in the data set. These measures can help understand features of distributions like the typical or relative positions of values.

Uploaded by

Vishesh Dwivedi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 126

NUMERICAL DESCRIPTIVE MEASURES

 Graphs are one important component of statistics;


however, it is also important to numerically describe the
main characteristics of a data set. The numerical summary
measures, such as the ones that identify the center and
spread of a distribution, identify many important features of
a distribution.

 For example, we can prepare graphs based on family


income data. However, if we want to know the income of a
“typical” family (given by the center of the distribution), the
spread of the distribution of incomes, or the relative
position of a family with a particular income, the numerical
summary measures can provide more detailed information.

2
MEASURES OF CENTRAL TENDENCY FOR
UNGROUPED DATA
The measures that we discuss in this chapter include
measures like:

 Central Tendency (Mean, Median, Mode)


 Spread/Dispersion (Range, Standard Deviation)
 Position (Quartiles, Percentiles)

Figure
3.1 3
Mean
The mean for ungrouped data is obtained by dividing the
sum of all values by the number of values in the data set. Thus,

Mean for population data:   x


N

Mean for sample data: x


 x
n

where  x is the sum of all values; N is the population size; n


is the sample size;  is the population mean; and
x is the
sample mean.

4
Example 3-1
Table 3.1 lists the total cash donations (rounded to millions of
dollars) given by eight U.S. companies during the year 2016.
Table 3.1 Cash Donations in 2016 by Eight
U.S. Companies

Find the mean of cash donations made by these eight


companies.
5
Example 3-1: Solution

x  x 1  x2  x3  x4  x5  x6  x7  x8
 319  199  110  63  21  315  26  63  1116

x
 x 1116
  139.5  $139.5million
n 8

Thus, these eight companies donated an average of $139.5 million in


2010 for charitable purposes.

6
Example 3-2
The following are the ages (in years) of all eight employees of a
small company:

53 32 61 27 39 44 49 57

Find the mean age of these employees.

7
Example 3-2: Solution
The population mean is

  x 362
  45.25 years
N 8

Thus, the mean age of all eight employees of this company is


45.25 years, or 45 years and 3 months.

8
 Reconsider Example 3–2. If we take a sample of 3
employees from this company and calculate the mean age of
those 3 employees, this mean will be denoted by
x
Suppose the three values included in the sample are 32, 39,
and 57. Then, the mean age for this sample is:

= (32+39+57)∕3
x
= 42.67 yrs

9
 If we take a second sample of 3 employees of this company,
the value of x will (most likely) be different. Suppose the
second sample includes the values 53, 27, and 44. Then, the
mean age for this sample is

= (53+27+44)∕3
x
= 41.33 yrs

 Consequently, we can state that the value of the population


mean is constant. However, the value of the sample mean
varies from sample to sample. The value of x for a particular
sample depends on what values of the population are
included in that sample.

 A major shortcoming of the mean as a measure of central


10
tendency is that it is very sensitive to outliers.
Example 3-3:
Table 3.2 Number of Homes Foreclosed in
2010

11
Example 3-3
Note that the number of homes foreclosed in California is
very large compared to those in the other six states.
Hence, it is an outlier. Show how the inclusion of this outlier
affects the value of the mean.

12
Example 3-3: Solution
 If we do not include the number of homes foreclosed in
California (the outlier), the mean of the number of
foreclosed homes in six states is

Mean without the outlier


49,723  20,352  10,824  40,911  18,038  61,848

6
201,696
  33,616
6

13
Example 3-3: Solution
 Now, to see the impact of the outlier on the value of the
mean, we include the number of homes foreclosed in
California and find the mean number of homes foreclosed
in the seven states. This mean is

Mean with the outlier


173,175  49,723  20,352  10,824  40,911  18,038  61,848

7
374,871
  53,553
7

 including the foreclosed homes of California causes around 60% increment in


the value of the mean, which changes from 33,616 to 53,553.
14
 Remember that the Mean is not always the best
measure of central tendency because it is heavily
influenced by outliers.

 Sometimes other measures of central tendency give a


more accurate impression of a data set.

 For example, when a data set has outliers, instead of


using the mean, we can use the Median as a measure
of central tendency.

15
Median
 The Median is the value of the middle term in a data set
that has been ranked in increasing order. i.e., it divides a
ranked data set into two equal parts.

 The calculation of the median consists of the following two


steps:

1. Rank the data set in increasing order.

2. Find the middle term. The value of this term is the


median.

16
 Note that if the number of observations in
a data set is odd, then the median is given
by the value of the middle term in the
ranked data.

 However, if the number of observations is


even, then the median is given by the
average of the values of the two middle
terms.

17
Example 3-4
Refer to the data on the number of homes foreclosed in 7
states given in Table 3.2 of Example 3.3. Those values
are
listed below.

173,175 49,723 20,352 10,824 40,911 18,038 61,848

Find the median for these data.

18
Example 3-4: Solution
First, we rank the given data in increasing order as follows:
10,824 18,038 20,352 40,911 49,723 61,848 173,175

Since there are 7 homes in this data set and the middle term
is the fourth term, the median is given by the value of the 4th
term in the ranked data.

Thus, the median number of homes foreclosed in these seven


states was 40,911 in 2010.

19
Example 3-5
 Table 3.3 gives the total compensations (in millions of
dollars) for the year 2010 of the 12 highest-paid CEOs of
U.S. companies.

20
Table 3.3 Total Compensations of 12
Highest-Paid CEOs for the Year 2010
Find the median for
these data.

21
Example 3-5: Solution
 First we rank the given total compensations of the 12 CESs as
follows:

 21.6 21.7 22.9 25.2 26.5 28.0 28.2 32.6 32.9 70.1 76.1 84.5

 There are 12 values in this data set. Because there are an


even number of values in the data set, the median is given by
the average of the two middle values.

22
Example 3-5: Solution
 The two middle values are the sixth and seventh in the
arranged data, and these two values are 28.0 and 28.2.

28.0  28.2 56.2


Median    28.1  $28.1million
2 2

 Thus, the median for the 2010 compensations of these 12


CEOs is $28.1 million.

23
Median

 The median gives the center of a histogram, with half the


data values to the left of the median and half to the right of
the median.

 The advantage of using the median as a measure of central


tendency is that it is not influenced by outliers.

 Consequently, the median is preferred over the mean as a


measure of central tendency for data sets that contain
outliers.

24
Mode

 In statistics, the mode represents the most common value in


a data set.

 The mode is the value that occurs with the highest


frequency in a data set.

25
Example 3-6
 The following data give the speeds (in miles per hour) of
eight cars that were stopped on NH-95 for speeding
violations.

77 82 74 81 79 84 74 78

Find the mode.

26
Example 3-6: Solution
 In this data set, 74 occurs twice and each of the remaining
values occurs only once. Because 74 occurs with the highest
frequency, it is the mode. Therefore,

Mode = 74 miles per hour

27
Mode
 A major shortcoming of the mode is that a data set may
have none or may have more than one mode, whereas it will
have only one mean and only one median.

 No Mode: A data set with each value occurring only once.


 Unimodal: A data set with only one mode.
 Bimodal: A data set with two modes.
 Multimodal: A data set with more than two modes.

28
Example 3-7 (Data set with no mode)
 Last year’s incomes of five randomly selected families were
$76,150, $95,750, $124,985, $87,490, and $53,740.

 Find the mode.

29
Example 3-7: Solution
 Because each value in this data set occurs only once, this data
set contains no mode.

30
Example 3-8 (Data set with two modes)
A small company has 12 employees. Their commuting times
(rounded to the nearest minute) from home to work are 23,
36, 12, 23, 47, 32, 8, 12, 26, 31, 18, and 28, respectively.

Find the mode for these data.

31
Example 3-8: Solution
In the given data on the commuting times of the 12
employees, each of the values 12 and 23 occurs twice, and
each of the remaining values occurs only once. Therefore,
that data set has two modes: 12 and 23 minutes.

32
Example 3-9 (Data set with three modes)
The ages of 10 randomly selected students from a class are 21,
19, 27, 22, 29, 19, 25, 21, 22 and 30 years, respectively.

Find the mode.

33
Example 3-9: Solution
This data set has three modes: 19, 21 and 22. Each of these
three values occurs with a (highest) frequency of 2.

34
Mode
One advantage of the mode is that it can be calculated for
both kinds of data (quantitative and qualitative) - whereas
the mean and median can be calculated for only
quantitative data.

35
Example 3-10

 The status of five students who are members of the


student senate at a college are senior, sophomore, senior,
junior, and senior, respectively. Find the mode.

36
Example 3-10: Solution
 Because senior occurs more frequently than the other
categories, it is the mode for this data set. We cannot
calculate the mean and median for this data set.

37
 To sum up, we cannot say for sure which of the three
measures of central tendency is a better measure overall.
Each of them may be better under different situations.

 Probably the mean is the most-used measure of central


tendency, followed by the median.

 The mean has the advantage that its calculation includes


each value of the data set.

 The median is a better measure when a data set includes


outliers.

 The mode is simple to locate, but it is not of much use in


practical applications. 38
and Mode
Figure 3.2 Mean, median, and mode for a
symmetric histogram and frequency
distribution curve.

For a symmetric histogram and frequency distribution with one peak, the
values of the mean, median, and mode are identical, and they lie at the center
of the distribution.

39
Relationships Among the Mean, Median,
and Mode Figure 3.3 Mean, median, and
mode for a histogram and frequency
distribution curve skewed to the right.

For a histogram and a frequency distribution curve skewed to the right, the
value of the mean is the largest, that of the mode is the smallest, and the value
of the median lies between these two. (Notice that the mode always occurs at
the peak point.) The value of the mean is the largest in this case because it is
sensitive to outliers that occur in the right tail. These outliers pull the mean to
the right.
40
Relationships Among the Mean, Median,
and Mode Figure 3.4 Mean, median, and
mode for a histogram and frequency
distribution curve skewed to the left.

If a histogram and a frequency distribution curve are skewed to the left, the
value of the mean is the smallest and that of the mode is the largest, with
the value of the median lying between these two. In this case, the outliers in
the left tail pull the mean to the left.

41
 The measures of central tendency, such as the
mean, median, and mode, do not reveal the
whole picture of the distribution of a data set.

 Two data sets with the same mean may have


completely different spreads. The
variation/spread among the values of
observations for one data set may be much
larger or smaller than for the other data set.

42
 Consider the following two data sets on the ages
(in years) of all workers working for each of two
small companies.

Company 1: 47 38 35 40 36 45 39
Company 2: 70 33 18 52 27

The mean age of workers in both these companies


is the same, 40 years.

43
 If we do not know the ages of individual workers at
these two companies and are told only that the mean
age of the workers at both companies is the same, we
may deduce that the workers at these two companies
have a similar age distribution.

 However, the variation in the workers’ ages for each


of these two companies is very different.

 If we look carefully, the ages of the workers at the


second company have a much larger variation than
the ages of the workers at the first company.

44
 Thus, the mean, median, or mode by itself is usually
not a sufficient measure to reveal the shape of the
distribution of a data set.

 We also need a measure that can provide some


information about the variation among data values.

 The measures that help us learn about the spread of a


data set are called the measures of dispersion.

 The measures of central tendency and dispersion


taken together give a better picture of a data set than
the measures of central tendency alone. 45
Measures of Dispersion for
Ungrouped Data
This section discusses three measures of dispersion:

 Range
 Variance and Standard Deviation
 Population Parameters and Sample Statistics

46
Range
Finding the Range for Ungrouped Data

 The range is the simplest measure of dispersion.

Range = Largest value – Smallest Value

47
Example 3-11

 Table 3.4 gives the total areas in square miles of the four
western South-Central states of the United States.

 Find the range for this data set.

48
Table 3.4

49
Example 3-11: Solution

Range = Largest value – Smallest Value


= 267,277 – 49,651
= 217,626 square miles

Thus, the total areas of these four states are spread over a range of
217,626 square miles.

50
Range
Disadvantages

 The range, like the mean, has the disadvantage of being


influenced by outliers. In Example 3–11, if the state of
Texas with a total area of 267,277 square miles is
dropped, the range decreases from 217,626 square miles
to 20,252 square miles. Consequently, the range is not a
good measure of dispersion to use for a data set that
contains outliers.

 Its calculation is based on two values only: the largest


and the smallest. All other values in a data set are ignored
when calculating the range. Thus, the range is not a very
satisfactory measure of dispersion.
51
Variance and Standard Deviation

 The standard deviation is the most-used measure of


dispersion.

 The value of the standard deviation tells how closely the


values of a data set are clustered around the mean.

 In general, a lower value of the standard deviation for a


data set indicates that the values of that data set are
spread over a relatively smaller range around the mean.

 In contrast, a larger value of the standard deviation for a


data set indicates that the values of that data set are
spread over a relatively larger range around the mean.
52
Variance and Standard Deviation
 The variance calculated for population data is denoted by σ²
and the variance calculated for sample data is denoted by
s².

 The standard deviation calculated for population data is


denoted by σ, and the standard deviation calculated for
sample data is denoted by s.

53
Variance and Standard Deviation
Basic Formulas for the Variance and Standard Deviation for
Ungrouped Data

 2

  x   2

and s 2

  x  x
2

N n 1

 x     x  x
2 2

  and s 
N n 1

where σ² is the population variance, s² is the sample variance,


σ is the population standard deviation, and s is the sample
standard deviation.

54
Table 3.5 (Mid-Term scores of a sample of 4
students)

55
Variance and Standard Deviation
Short-cut Formulas for the Variance and Standard Deviation
for Ungrouped Data

 x 2
 x 2

 
x 2

N
 
x 2

n
2  and s 2 
N n 1
  x
2
  x
2

 x 2

N
 x 2

n
  and s 
N n 1
where σ² is the population variance, s² is the sample variance,
σ is the population standard deviation, and s is the sample
standard deviation.

56
Example 3-12
Until about 2009, airline passengers were not charged for checked
baggage. Around 2009, however, many U.S. airlines started charging
a fee for bags. According to the Bureau of Transportation Statistics,
U.S. airlines collected more than $3 billion in baggage fee revenue in
2010. The following table lists the baggage fee revenues of 6 U.S.
airlines for the year 2010.

Find the variance and standard deviation for these data.

57
Example 3-12

58
Example 3-12: Solution
Let x denote the 2010 baggage fee revenue (in millions of
dollars) of an airline. The values of Σx and Σx2 are calculated
in Table 3.6.
Table
3.6

59
Example 3-12: Solution
Step 1. Calculate Σx
The sum of values in the first column of Table 3.6 gives
2,854.

Step 2. Find Σx2


The results of this step are shown in the second column of
Table 3.6, which is 1,746,098.

60
Example 3-12: Solution
Step 3. Determine the variance

  x
2
 2,854
2

x 2

n
1,746,098 
6
s2  
n 1 6 1
1,746,098  1,357,552.667

5
 77,709.06666

61
Example 3-12: Solution
Step 4. Obtain the standard deviation
The standard deviation is obtained by taking the (positive) square root
of the variance:

  x
2

x 2

n
s  77,709.06666
n 1
 278.7634601  $278.76million
Thus, the standard deviation of the 2010 baggage fee revenues of
these six airlines is $278.76 million.

62
 Usually the values of the variance and standard deviation
are positive, but if a data set has no variation, then the
variance and standard deviation are both zero.

For example, if four persons in a group are the same age—


say, 35 years—then the four values in the data set are

35 35 35 35

If we calculate the variance and standard deviation for these


data, their values are zero. This is because there is no
variation in the values of this data set.

63
Example 3-13
Following are the 2011 earnings (in thousands of dollars)
before taxes for all 6 employees of a small company.

88.50 108.40 65.50 52.50 79.80 54.60

Calculate the variance and standard deviation for these data.

64
Example 3-13: Solution
Let x denote the 2011 earnings before taxes of an employee
of this company. The values of ∑x and ∑x2 are calculated in
Table 3.7.
Table
3.7

65
Example 3-13: Solution

 x
2
� (449.30) 2

�x2 
N
35,978.51 
6
2    388.90
N 6
  388.90  $19.721 thousand  $19,721

Thus, the standard deviation of the 2011 earnings of all six


employees of this company is $19,721.

66
Population Parameters and Sample
Statistics
 A numerical measure such as the mean, median, mode,
range, variance, or standard deviation calculated for a
population data set is called a population parameter, or
simply a parameter.

Thus, µ and σ and are population parameters

 A summary measure calculated for a sample data set is


called a sample statistic, or simply a statistic.

Thus, x and s are sample statistics

67
MEAN, VARIANCE AND STANDARD
DEVIATION FOR GROUPED DATA
 Mean for Grouped Data
 Variance and Standard Deviation for Grouped Data

68
Mean for Grouped Data
Calculating Mean for Grouped Data

Mean for population data:   mf


N

Mean for sample data:


x
 mf
n
where m is the midpoint and f is the frequency of a class.

69
Example 3-14
Table 3.8 gives the frequency distribution of the daily
commuting times (in minutes) from home to work for all 25
employees of a company.

Calculate the mean of the daily commuting times.

70
Example 3-14
Table-3-8

71
Example 3-14: Solution

72
Example 3-14: Solution

  mf

535
 21.40 minutes
N 25

Thus, the employees of this company spend an average of


21.40 minutes a day commuting from home to work.

73
Example 3-15
Table 3.10 gives the frequency distribution of the number of
orders received each day during the past 50 days at the office
of a mail-order company.

Calculate the mean.

74
Example 3-15
Table-3-10

75
Example 3-15: Solution

76
Example 3-15: Solution

x
 mf

832
 16.64 orders
n 50
Thus, this mail-order company received an average of
16.64 orders per day during these 50 days.

77
Variance and Standard Deviation for
Grouped Data
Basic Formulas for the Variance and Standard Deviation for
Grouped Data

 f m    f m  x 
2 2

 2
 and s 2

N n 1

where σ² is the population variance, s² is the sample variance,


and m is the midpoint of a class. In either case, the standard
deviation is obtained by taking the positive square root of the
variance.

78
Variance and Standard Deviation for
Grouped Data
Short-Cut Formulas for the Variance and Standard Deviation
for Grouped Data

(  mf ) 2
  mf 
2

 m f 
2

N
m 2
f 
n
2  and s 2 
N n 1

where σ² is the population variance, s² is the sample variance,


and m is the midpoint of a class.

79
Variance and Standard Deviation for
Grouped Data
Short-cut Formulas for the Variance and Standard Deviation for
Grouped Data

The standard deviation is obtained by taking the positive


square root of the variance.

Population standard deviation:   2


Sample standard deviation: s  s2

80
Example 3-16
The following data, reproduced from Table 3.8 of Example 3-14,
give the frequency distribution of the daily commuting times (in
minutes) from home to work for all 25 employees of a company.

Calculate the variance and standard deviation.

81
Example 3-16: Solution

82
Example 3-16: Solution

(
m f  N
2  mf ) 2

14,825 
(535) 2
25 3376
 
2
   135.04
N 25 25

   2  135.04  11 .62 minutes

Thus, the standard deviation of the daily commuting times for these
employees is 11.62 minutes.

83
Example 3-17
The following data, reproduced from Table 3.10 of Example 3-
15, give the frequency distribution of the number of orders
received each day during the past 50 days at the office of a
mail-order company.

Calculate the variance and standard deviation.

84
Example 3-17: Solution

85
Example 3-17: Solution

 m 2
f
(  mf ) 2

14,216 
(832 ) 2

s2  n  50  7.5820
n 1 50  1

s  s 2  7.5820  2.75 orders

Thus, the standard deviation of the number of orders received at the


office of this mail-order company during the past 50 days is 2.75.

86
USE OF STANDARD DEVIATION
 Chebyshev’s Theorem
 Empirical Rule

87
Chebyshev’s Theorem

 For any number k greater than 1, at least (1 – 1/k²) of the


data values lie within k standard deviations of the mean.

 Applies to any distribution, regardless of shape.

 Places lower limits on the percentages of observations


within a given number of standard deviations from the
mean

88
Chebyshev’s Theorem
 1 
1  
 At least 

of
k2
the elements of any distribution


lie within k standard deviations of the mean

1 1 3
1  1    75%
2
2
4 4 2
Standard
At 1 1 8 Lie
1  2  1    89% 3 deviations
least 3 9 9 within of the mean
1 1 15 4
1 2  1   94%
4 16 16

89
Figure 3.5 Chebyshev’s theorem.

90
Figure 3.6 Percentage of values within two
standard deviations of the mean for
Chebyshev’s theorem.

91
Figure 3.7 Percentage of values within
three standard deviations of the mean for
Chebyshev’s theorem.

92
Example 3-18
 The average systolic blood pressure for 4000 women who
were screened for high blood pressure was found to be 187
mm Hg with a standard deviation of 22. Using Chebyshev’s
theorem, find at least what percentage of women in this
group have a systolic blood pressure between 143 and 231
mm Hg.

93
Example 3-18: Solution
 Let μ and σ be the mean and the standard deviation,
respectively, of the systolic blood pressures of these women.
 μ = 187 and σ = 22

94
Example 3-18: Solution
 The value of k is obtained by dividing the distance between
the mean and each point by the standard deviation. Thus
 k = 44/22 = 2

1 1 1
1  2  1  2  1   1  .25  .75 or 75%
k ( 2) 4

 Hence, according to Chebyshev's theorem, at least 75% of the


women have systolic blood pressure between 143 and 231
mm Hg. This percentage is shown in Figure 3.8.

95
Figure 3.8 Percentage of women with
systolic blood pressure between 143 and
231.

96
Empirical Rule

 Applies only to bell-shaped/symmetric distributions.

 Specifies approximate percentages of observations within a


given number of standard deviations from the mean

97
Empirical Rule
 For a bell shaped distribution, approximately

 68% of the observations lie within 1 standard


deviation of the mean

 95% of the observations lie within 2 standard


deviations of the mean

 99.7% of the observations lie within 3 standard


deviations of the mean

98
Figure 3.9 Illustration of the empirical rule.

99
Example 3-19
 The age distribution of a sample of 5000 persons is bell-shaped
with a mean of 40 years and a standard deviation of 12 years.
Determine the approximate percentage of people who are 16
to 64 years old.

100
Example 3-19: Solution
 From the given information, for this distribution,
 x = 40 and s = 12 years

 Each of the two points, 16 and 64, is 24 units away from the
mean.

 Because the area within two standard deviations of the mean


is approximately 95% for a bell-shaped curve, approximately
95% of the people in the sample are 16 to 64 years old.

101
Figure 3.10 Percentage of people who are
16 to 64 years old.

102
MEASURES OF POSITION
A measure of position determines the position of a
single value in relation to other values in a sample or a
population data set

 Quartiles
 Interquartile Range
 Percentiles

103
Quartiles and Interquartile Range
 Quartiles are the summary measures that divide a
ranked data set into four equal parts.

 The second quartile is the same as the median of a data


set.

 The first quartile is the value of the middle term among


the observations that are less than the median, and the
third quartile is the value of the middle term among the
observations that are greater than the median.

104
Figure 3.11 Quartiles.

105
Quartiles and Interquartile Range
 Calculating Interquartile Range
 The difference between the third and the first quartiles gives
the interquartile range:

 IQR = Interquartile range = Q3 – Q1

106
Example 3-20
Table 3.3 in Example 3-5 gave the total compensations (in
millions of dollars) for the year 2010 of the 12 highest-paid
CEOs of U.S. companies. That table is reproduced on the next
slide.

(a) Find the values of the three quartiles. Where does the total
compensation of Michael D. White (CEO of DirecTV) fall in
relation to these quartiles?

(b) Find the interquartile range.

107
Example 3-20

108
Example 3-20: Solution
(a)

By looking at the position of $32.9 million (total compensation of


Michael D. White, CEO of DirecTV), we can state that this value lies
in the bottom 75% of the 2010 total compensation. This value
falls between the second and third quartiles.

109
Example 3-20: Solution
(b) The interquartile range is given by the difference between
the values of the third and first quartiles. Thus

IQR = Interquartile range = Q3 – Q1


= 51.5 – 24.05 = $27.45 million

110
Example 3-21
The following are the ages (in years) of nine employees of an
insurance company:
 47 28 39 51 33 37 59 24 33

(a) Find the values of the three quartiles. Where does the age of
28 years fall in relation to the ages of the employees?

(b) Find the interquartile range.

111
Example 3-21: Solution
(a)

The age of 28 falls in the lowest 25% of the ages.

112
Example 3-21: Solution
(b) The interquartile range is
IQR = Interquartile range = Q3 – Q1
= 49 – 30.5
= 18.5 years

113
Percentiles
 Percentiles are the summary measures that divide a ranked data
set into 100 equal parts.

 Each (ranked) data set has 99 percentiles that divide it into 100
equal parts.

114
Percentiles and Percentile Rank
 Calculating Percentiles
 The (approximate) value of the k th percentile, denoted by
Pk, is

 kn 
Pk  Value of the   th term in a ranked data set
 100 

 where k denotes the number of the percentile and n


represents the sample size.

115
Example 3-22
 Refer to the data on total compensations (in millions of
dollars) for the year 2010 of the 12 highest-paid CEOs of U.S.
companies given in Example 3-20. Find the value of the 60th
percentile. Give a brief interpretation of the 60th percentile.

116
Example 3-22: Solution
 The data arranged in increasing order is as follows:

 21.6 21.7 22.9 25.2 26.5 28.0 28.2 32.6 32.9 70.1 76.1 84.5

 The position of the 60th percentile is

kn (60)(12)
  7.20th term  7th term
100 100

117
Example 3-22: Solution
 The value of the 7.20th term can be approximated by the value
of the 7th term in the ranked data. Therefore,

 P60 = 60th percentile = 28.2 = $28.2 million

 Thus, approximately 60% of these 12 CEOs had 2010 total


compensations less than or equal to $28.2 million.

118
BOX-AND-WHISKER PLOT

 A box-and-whisker plot gives a graphic presentation of


data using five measures: the median, the first quartile, the
third quartile, and the smallest and the largest values in the
data set between the lower and the upper inner fences.

 A box-and-whisker plot can help us visualize the center, the


spread, and the skewness of a data set.

 It also helps detect outliers.

 We can compare different distributions by making box-and-


whisker plots for each of them.

119
Example 3-24
 The following data are the incomes (in thousands of dollars)
for a sample of 12 households.

 75 69 84 112 74 104 81 90 94 144 79 98

 Construct a box-and-whisker plot for these data.

120
Example 3-24: Solution
 Step 1. First, rank the data in increasing order and calculate
the values of the median, the first quartile, the third quartile,
and the interquartile range. The ranked data are

 69 74 75 79 81 84 90 94 98 104 112 144

 Median = (84 + 90) / 2 = 87


 Q1 = (75 + 79) / 2 = 77
 Q3 = (98 + 104) / 2 = 101
 IQR = Q3 – Q1 = 101 – 77 = 24

121
Example 3-24: Solution
 Step 2. Find the points that are 1.5 x IQR below Q1 and
1.5 x IQR above Q3.

 1.5 x IQR = 1.5 x 24 = 36

 Lower inner fence = Q1 – 36 = 77 – 36 = 41

 Upper inner fence = Q3 + 36 = 101 + 36 = 137

122
Example 3-24: Solution
 Step 3. Determine the smallest and the largest values in the
given data set within the two inner fences.

 Smallest value within the two inner fences = 69


 Largest value within the two inner fences = 112

123
Example 3-24: Solution
 Step 4. Draw a horizontal line and mark the income levels
on it such that all the values in the given data set are
covered. The result of this step is shown in Figure 3.13.

124
Example 3-24: Solution
 Step 5. By drawing two lines, join the points of the
smallest and the largest values within the two inner
fences to the box. These values are 69 and 112 in this
example. This completes the box-and-whisker plot, as
shown in Figure 3.14.

125
Box Plot
Elementsof
Elements ofaaBox
BoxPlot
Plot
Smallest data Largest data point
point not not exceeding Suspected
Outlier below inner inner fence outlier
fence

o X X *

Median
Outer Inner Q1 Q3 Inner Outer
Fence Fence Fence Fence
Q1-1.5(IQR) Interquartile Range Q3+1.5(IQR)
Q1-3(IQR)
Q3+3(IQR)

126

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy