0% found this document useful (0 votes)
16 views22 pages

STA201 Lec 04

Uploaded by

azmainadilyasar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views22 pages

STA201 Lec 04

Uploaded by

azmainadilyasar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Lecture Note: 04

Topic: Measures of Dispersion

Measures of Dispersion
Sometimes when two or more different datasets are to be compared using measure s of
central tendency or averages, we may get the same result. Consider the runs scored by
two batsmen in their last ten matches as follows:

Batsman A: 30, 91, 0, 64, 42, 80, 30, 5, 117, 71

Batsman B: 53, 46, 48, 50, 53, 53, 58, 60, 57, 52

Clearly, the mean runs scored by both the batsmen A and B are same i.e., 53. Can we say
that the performance of two players is same? Clearly No, because the spread -ness in the
scores of batsman A is from 0 to 117, whereas, the spread-ness of the runs scored by
batsman B is from 46 to 60.

As we know that, there are quite a few ways of measuring the central tendency of a dataset
(Mean, Mode and Median). Similarly, we have different ways of measuring and comparing
the dispersion of the distribution(s).

Types of Measures of Dispersion


There are two types of measure of dispersion

1. Absolute Measure of Dispersion

An absolute measure of dispersion measures the variability in terms of the same


unit of the data. e.g., if the unit of the data is Tk, meter, kg, etc., the unit of the
measures of dispersion will also be Tk, meter, kg, etc.

The common absolute measures of dispersion are:

a) Range
b) Quartile Deviation
c) Mean Deviation (from mean or from median)
d) Standard Deviation

2. Relative Measure of Dispersion


A relative measure of dispersion compares the variability of two or more datasets
and is independent of the unit.

𝐴𝑏𝑠𝑜𝑙𝑢𝑡𝑒 𝑚𝑒𝑎𝑠𝑢𝑟𝑒𝑠
𝑅𝑒𝑙𝑎𝑡𝑖𝑣𝑒 𝑚𝑒𝑎𝑠𝑢𝑟𝑒 = × 100%
𝐴𝑣𝑒𝑟𝑎𝑔𝑒

The common relative measures of dispersion are:


2 | Page

a) Coefficient of Range
b) Coefficient of Quartile Deviation
c) Coefficient of Mean Deviation
d) Coefficient of Variation (C.V)

The major difference between Absolute and Relative Measures of Dispersion is that the
Absolute measure of dispersion measures only the variability of the dataset, further it has
the unit of measurement; on the other hand, Relative measure of dispersion is used to
compare the variation of two or more distributions, further it is unit less.

1 Absolute Measures of Dispersion

1.1 Range

The range is a measure of spread that represents the difference between the largest and
smallest values in a dataset. For raw data, it’s calculated by subtracting the smallest
value from the largest. In continuous grouped data, the range is the difference between
the upper limit of the highest class and the lower limit of the lowest class.

➢ Formula for raw data,

𝑅 = 𝑋𝑀𝑎𝑥 – 𝑋𝑀𝑖𝑛 = Largest value – smallest value

➢ Formula for grouped data,

𝑅 = 𝑋𝑈 − 𝑋𝐿 = Upper limit of the highest class – lower limit of the lowest class.

EXAMPLE 1. Range for Raw Data:

The ages of 8 students in a classroom are recorded as follows: 15, 17, 16, 14, 19, 18, 15, 20
years. Find the range of the ages.

SOLUTION: For raw data, Range, 𝑅 = 𝑋𝑀𝑎𝑥 – 𝑋𝑀𝑖𝑛 = 20 − 14 = 6 𝑦𝑒𝑎𝑟𝑠

Interpretation: The range of 6 years shows the age spread of students, with 14 being the
youngest and 20 being the oldest in the group.

EXAMPLE 2. Range for Group Data:

You have been provided with a frequency distribution table that c ontains data on the
durations of 115 randomly selected programming tutorial videos uploaded by a YouTube
channel.

Duration in minutes 30-40 40-50 50-60 60-70 70-80 80-90 90-100


Number of tutorials 10 14 20 34 22 9 6

Find the range of the tutorial durations.


3 | Page

SOLUTION: For group data, Range, 𝑅 = 𝑋𝑈 − 𝑋𝐿 = 100 − 30 = 70 𝑚𝑖𝑛𝑢𝑡𝑒𝑠

Interpretation: The range of 70 minutes shows the difference between the shortest and
longest tutorials, with the shortest in the 30–40-minute range and the longest in the 90–
100-minute range. This spread indicates a considerable variation in tutorial lengths.

1.2 Quartile Deviation

The quartile deviation, also known as the semi-interquartile range, measures the spread
of the middle 50% of a dataset. It is calculated as half the difference between the first
quartile (𝑄1 ) and the third quartile (𝑄3 ). This measure gives insight into the dataset's
dispersion by considering only the central portion, minimizing the effect of extreme
values.

➢ Formula for both raw data and grouped data:

𝑄3−𝑄1
Quartile deviation, 𝑄𝐷 = ; Where, 𝑄3= Third Quartile, and 𝑄1 = First Quartile
2

EXAMPLE 3. Quartile Deviation for Raw Data

The ages of 8 students in a classroom are recorded as follows: 15, 17, 16, 14, 19, 18, 15, 20
years. Find the quartile deviation of the ages.

SOLUTION:

Step 1. Arrange data: 14, 15, 15, 16, 17, 18, 19, 20.

Step 2. Find 𝑄1 and 𝑄3:

𝑖× 𝑛 25× 8
𝑄1 (1st Quartile): Here, = =2
100 100

( 15+15)
Therefore, 𝑄1 = Average of the 2nd and 3rd values: 𝑄1 = = 15.
2

𝑖× 𝑛 75× 8
𝑄3 (3rd Quartile): Here, = =6
100 100

18+19
Therefore, 𝑄3 = Average of the 6th and 7th values: 𝑄3 = = 18.5.
2

Step 3. Calculate Quartile Deviation:

𝑄3 − 𝑄1 18.5 − 15
𝑄. 𝐷. = = = 1.75
2 2

Interpretation: The quartile deviation of 1.75 years shows that the ages of students vary
around the median by about 1.75 years, indicating moderate dispersion within the central
portion of this dataset.
4 | Page

EXAMPLE 4. Quartile Deviation for Grouped Data

You have been provided with a frequency distribution table that contains data on the
durations of 115 randomly selected programming tutorial videos uploaded by a YouTube
channel.

Duration in minutes 30-40 40-50 50-60 60-70 70-80 80-90 90-100


Number of tutorials 10 14 20 34 22 9 6

Find the quartile deviation of the tutorial durations.

SOLUTION:

Step 1. Determine Cumulative Frequency to find 𝑄1 and 𝑄3.

Duration in minutes 30-40 40-50 50-60 60-70 70-80 80-90 90-100


Number of tutorials (𝑓) 10 14 20 34 22 9 6
Cumulative Frequency (𝐹) 10 24 44 78 100 109 115

Step 2. Find the 𝑄1 and 𝑄3 positions

𝑖× 𝑛 1×115
𝑄1 : Position = = = 28.75, which falls in the 50-60 interval.
4 4

𝑖× 𝑛 3×115
𝑄3 : Position = = = 86.25, which falls in the 70-80 interval.
4 4

Step 3. Use Interpolation for 𝑄1 and 𝑄3:

𝑖×𝑛
28.75 − 24 − 𝐹𝑝𝑟𝑒𝑣
𝑄1 = 50 + ( ) × 10 = 51.875. [∵ 𝑄𝑖 = 𝐿0 + ( 4 ) × 𝐶]
20 𝑓

86.25 − 78
𝑄3 = 70 + ( ) × 10 = 78.41.
22

Step 4. Calculate Quartile Deviation:

𝑄3− 𝑄1 78.41− 51.875


𝑄. 𝐷. = = = 13.27
2 2

Interpretation: The quartile deviation of 13.27 minutes suggests that the middle 50% of
tutorial durations vary by about 13.27 minutes from the median duration, reflecting a
moderate spread within the core range of the dataset.

1.3 Mean deviation

The mean deviation (also called the average absolute deviation) measures the average
distance of each data value from the mean or median of the dataset. It provides an insight
into the dispersion by showing how much, on average, values differ from the central point.
5 | Page

Raw Data Grouped Data

M.D from Mean ∑|𝑥𝑖 − x̅| ∑ 𝑓𝑖 |𝑥𝑖 − x̅|


𝑀. 𝐷(𝑥̅) = 𝑀. 𝐷(𝑥̅) =
𝑛 𝑛

M.D from Median ∑|𝑥𝑖 − Me| ∑ 𝑓𝑖 |𝑥𝑖 − Me|


𝑀. 𝐷(𝑀𝑒) = 𝑀. 𝐷(𝑀𝑒) =
𝑛 𝑛

EXAMPLE 5. Mean Deviation for Raw Data

The exam scores of students are recorded as follows:

85 92 78 89 95 88 76 82 90 91.

a) Calculate the mean deviation from the mean.


b) Calculate the mean deviation from the median.

SOLUTION:

a) Mean deviation from the mean (𝑴.𝑫 (𝒙


̅ )):

∑| 𝑥𝑖 −x̅|
We know, the mean deviation from the mean (𝑴. 𝑫(𝒙
̅ )) for raw data: 𝑀. 𝐷(𝑥̅ ) =
𝑛

Step 1. Calculate the mean:

∑𝑥 (85 + 92 + 78 + 89 + 95 + 88 + 76 + 82 + 90 + 91) 866


𝑀𝑒𝑎𝑛, 𝑥̅ = = = = 86.6
𝑛 10 10

Step 2. Now calculate the mean deviation from the mean, MD (𝑥̅ ):

∑| 𝑥𝑖−x̅|
MD (𝑥̅ ) =
𝑛

(|85 − 86.6| + |92 − 86.6| + |78 − 86.6| + |89 − 86.6| + |95 − 86.6|
+ |88 − 86.6| + |76 − 86.6| + |82 − 86.6| + |90 − 86.6| + |91 − 86.6|)
=
10
50.8
∴ 𝑀𝐷 (𝑥̅ ) = = 5.08
10

Interpretation: The mean deviation of 5.08 marks indicates that, on average, the scores
of students differ from the mean scores by 5.08 marks.

b) Mean Deviation from Median (MD(Me)):

∑| 𝑋𝑖 – 𝑀𝑒𝑑𝑖𝑎𝑛|
We know the mean deviation from the median, 𝑀𝐷(𝑀𝑒) =
𝑛

Step 1. Find the median:

To find the median, we need to sort our dataset:


6 | Page

76 78 82 85 88 89 90 91 92 95

(𝑛 )𝑡ℎ 𝑣𝑎𝑙𝑢𝑒+(𝑛+1)𝑡ℎ 𝑣𝑎𝑙𝑢𝑒 88 + 89


2 2
𝑀𝑒𝑑𝑖𝑎𝑛 = = = 88.5
2 2

Step 2. Now calculate mean deviation from median, MD(Me):

| 85 – 88.5| + | 92 – 88.5| + | 78 – 88.5| + | 89 – 88.5| + | 95 – 88.5| +


( )
∑| 𝑋𝑖 – 𝑀𝑒𝑑𝑖𝑎𝑛| | 88 – 88.5| + | 76 – 88.5| + | 82 – 88.5| + | 90 – 88.5| + | 91 – 88.5|
𝑀𝐷(𝑀𝑒) = =
𝑛 10

48
𝑜𝑟, 𝑀𝐷 (𝑀𝑒) = = 4.8
10

Interpretation: The mean deviation of 4.8 marks indicates that, on average, the scores of
students differ from the median scores by 4.8 marks.

EXAMPLE 6. Mean Deviation for Grouped Data

A frequency distribution table shows the monthly rainfall (in mm) recorded over 12
months:

Rainfall (mm) 50-60 60-70 70-80 80-90 90-100


Frequency 2 3 4 2 1

Find the mean deviation from the mean.

SOLUTION:

Step 1. Calculate Mean Using Midpoints:

Rainfall (mm) 50-60 60-70 70-80 80-90 90-100


Mid-points (𝑥𝑖 ) 55 65 75 85 95
Frequency (𝑓𝑖 ) 2 3 4 2 1

∑𝑓𝑖 𝑥𝑖 2×55+3×65+75×4+2×85+1×95 870


Therefore, 𝑥̅ = = = = 72.5
𝑛 2+3+4+2+1 12

Step 2. Calculate Deviations and Mean Deviation

∑ 𝑓𝑖 |𝑥𝑖 − x̅| 2 × |55 − 72.5| + 3 × |65 − 72.5| + ⋯ + 1 × |95 − 72.5| 115


𝑀. 𝐷 (𝑥̅) = = = = 9.58
𝑛 12 12

Interpretation: The mean deviation of 9.58 mm shows that the monthly rainfall varies, on
average, by 9.58 mm from the mean rainfall of 72.5 mm.

1.4 Variance and Standard Deviation

Variance: Variance measures the average of the squared differences between each data
point and the mean. In mathematical terms, the variance (σ² / s 2) is calculated as:
7 | Page

For ungrouped data For grouped data


2 2
Population variance ∑( 𝑥𝑖−𝜇) 2 ∑𝑥2𝑖 −
(∑𝑥 𝑖)
∑ 𝑓𝑖 ( 𝑥𝑖−𝜇) 2 ∑𝑓𝑖 𝑥2𝑖 −
(∑𝑓 𝑖𝑥 𝑖)

𝜎2 = = 𝑁
σ2 = = 𝑁
𝑁 𝑁 𝑁 𝑁

2 2
Sample variance ∑( 𝑥𝑖 −𝑥̅) 2
(∑𝑥 𝑖)
∑𝑥2𝑖 − ∑ 𝑓𝑖 ( 𝑥𝑖 −𝑥̅) 2 ∑𝑓𝑖 𝑥2𝑖 −
(∑𝑓 𝑖𝑥 𝑖)

s2 = = 𝑛
s²= = 𝑛
𝑛−1 𝑛−1 𝑛−1 𝑛−1

Where, 𝜇= population mean and 𝑥̅ =sample mean.

Variance provides a measure of how data points are scattered around the mean. A higher
variance indicates greater data dispersion.

Standard Deviation: The standard deviation is the square root of the variance. It
measures the average distance between each data point and the mean. Mathematically,
the standard deviation (σ/s) is calculated as:

Population standard deviation, 𝝈 = √(𝝈𝟐 )

Sample standard deviation, 𝒔 = √𝒔 𝟐

Standard deviation is expressed in the same units as the data, making it more
interpretable than variance. It provides a measure of the spread of data and is often used
for comparing the spread of different datasets. A larger standard deviation indicates
greater data variability.

EXAMPLE 7. Population Variance and Standard Deviation (Raw Data)

The ages (in years) of a small population of 6 employees in a company are: 25, 30, 35, 40,
45, and 50. Calculate the population variance and standard deviation.

SOLUTION:

Step 1. Calculate ∑ 𝑥𝑖 and ∑𝑥𝑖2:

∑𝑥𝑖 = 25 + 30 + 35 + 40 + 45 + 50 = 225

∑𝑥𝑖2 = 252 + 302 + 352 + 402 + 452 + 502 = 8875

Step 2. Apply the formula:


2
(∑𝑥 𝑖) (225) 2
∑𝑥2𝑖 − 8875− 437.5
Population variance, 𝜎 2 = 𝑁
= 6
= = 72.92
𝑁 6 6

And, Population standard deviation, 𝜎 = √𝜎 2 = √72.92 = 8.54

Interpretation: The standard deviation of 8.54 years indicates the average deviation from
the mean age in this population is 8.54 years.
8 | Page

EXAMPLE 8. Sample Variance and Standard Deviation (Raw Data)

A random sample of exam scores from 10 students is:

85 92 78 89 95 88 76 82 90 91.

Calculate the sample variance and standard deviation.

SOLUTION:

Step 1. Calculate ∑ 𝑥𝑖 and ∑𝑥𝑖2:

∑𝑥𝑖2 = 852 + 922 + ⋯ + 912 = 75344 and ∑𝑥𝑖 = 85 + 92 + ⋯ + 91 = 866

Step 2. Apply the formula:


2
(∑𝑥 𝑖) (866)2
∑𝑥2𝑖 − 75344 − 348.4
Sample variance, 𝑠 2 = 𝑛
= 10
= = 38.71
𝑛−1 10−1 9

And Sample standard deviation, 𝑠 = √𝑠 2 = √38.71 = 6.22

Interpretation: The standard deviation of 6.22 indicates the average deviation from the
mean exam score in this sample is 6.22 marks.

EXAMPLE 9. Sample Variance and Standard Deviation (Grouped Data)

You have been provided with a frequency distribution table that contains data on the
durations of 115 randomly selected programming tutorial videos uploaded by a YouTube
channel.

Duration in minutes 30-40 40-50 50-60 60-70 70-80 80-90 90-100


Number of tutorials 10 14 20 34 22 9 6

Calculate the Sample variance and sample standard deviation.

SOLUTION:

Step 1. Calculate Midpoints (𝑥𝑖 ) for each class interval:

Duration in minutes 30-40 40-50 50-60 60-70 70-80 80-90 90-100


Min-points (𝑥 𝑖 ) 35 45 55 65 75 85 95
Number of tutorials (𝑓𝑖 ) 10 14 20 34 22 9 6

Step 2. Calculate ∑𝑓𝑖 𝑥𝑖 and ∑𝑓𝑖 𝑥𝑖2

Here, ∑𝑓𝑖𝑥𝑖2 = (10 × 352 ) + (14 × 452 ) + ⋯ + (6 × 952 ) = 487675

And ∑𝑓𝑖 𝑥𝑖 = (10 × 35) + (14 × 45) + ⋯ + (6 × 95) = 7275


9 | Page

Step 3. Apply the formula for grouped data:


2
(∑𝑓 𝑖𝑥 𝑖) 72752
∑𝑓𝑖 𝑥2𝑖 − 487675−
Sample variance, 𝑠 2 = 𝑛
= 115
= 240.81
𝑛−1 114

Therefore, the sample standard deviation, 𝑠 = √240.81 = 15.52

Interpretation: The standard deviation of 15.52 hours indicates the variability in the
duration of the tutorial videos in 15.52 hours.

2 Relative Measures of Dispersion

2.1 Coefficient of Variation

The Coefficient of Variation was introduced by Karl Pearson. It is the most used relative
measure of dispersion. It is used to compare the variation or to compare the performance
of two sets of data. It expresses the standard deviation as a percentage of the mean,
allowing for comparison between data sets with different units or widely varying means.
A higher CV indicates greater variability relative to the mean.

Formula for both raw data and group data:


𝜎
➢ Population CV: 𝐶. 𝑉 = × 100%
𝜇
where, 𝜇 is the population mean and 𝜎 is the population standard deviation.
𝑠
➢ Sample CV: 𝐶. 𝑉 = × 100%
𝑥̅
where, 𝑥̅ is the sample mean and 𝑠 is the sample standard deviation.

A large value of C.V indicates that there is greater variability and vice versa. Similarly,
the smaller the C.V the more consistent and vice versa.

For example, when comparing two datasets, A and B:

If 𝐶𝑉𝐴 > 𝐶𝑉𝐵 , dataset A exhibits greater variability than B. Alternatively, dataset B can
be considered more consistent than dataset A, as a lower CV indicates higher relative
consistency.

EXAMPLE 10. Population CV (Raw Data)

The ages (in years) of a small population of 6 employees in a company are: 25, 30, 35, 40,
45, and 50. Calculate the population coefficient of variation (CV).

SOLUTION:

Step 1. Calculate ∑ 𝑥𝑖 and ∑𝑥𝑖2:

∑𝑥𝑖 = 25 + 30 + 35 + 40 + 45 + 50 = 225
10 | Page

∑𝑥𝑖2 = 252 + 302 + 352 + 402 + 452 + 502 = 8875

Step 2. Calculate the population mean and the population standard deviation:

∑𝑥𝑖 225
Population mean, 𝜇 = = = 37.5
𝑁 6

2
(∑𝑥 𝑖) (225) 2
∑𝑥2𝑖 − 8875− 437.5
Population variance, 𝜎 2 = 𝑁
= 6
= = 72.92
𝑁 6 6

Therefore, Population standard deviation, 𝜎 = √𝜎 2 = √72.92 = 8.54

Step 3. Calculate the coefficient of variation:

𝜎 8.54
Therefore, the coefficient of variation, 𝐶𝑉 = × 100% = × 100% = 22.77%
𝜇 37.5

Interpretation: The Coefficient of Variation (CV) of 22.77% indicates the standard


deviation is 22.77% of the mean age of this population.

EXAMPLE 11. Sample CV (Raw Data)

A random sample of exam scores from 10 students is:

85 92 78 89 95 88 76 82 90 91.

Calculate the sample coefficient of variation.

SOLUTION:

Step 1. Calculate ∑ 𝑥𝑖 and ∑𝑥𝑖2:

∑𝑥𝑖2 = 852 + 922 + ⋯ + 912 = 75344 and ∑𝑥𝑖 = 85 + 92 + ⋯ + 91 = 866

Step 2. Calculate the sample mean and the sample standard deviation:

∑𝑥𝑖 866
Sample mean, 𝑥̅ = = = 86.6
𝑛 10

2
(∑𝑥 𝑖) (866)2
∑𝑥2𝑖 − 75344 − 348.4
Sample variance, 𝑠 2 = 𝑛
= 10
= = 38.71
𝑛−1 10−1 9

And Sample standard deviation, 𝑠 = √𝑠 2 = √38.71 = 6.22

Step 3. Calculate the CV:

𝑠 6.22
Coefficient of Variation, 𝐶𝑉 = × 100 % = ( ) × 100% = 7.18%
𝑥̅ 86.6

Interpretation: The CV expresses the variability of test scores as a percentage of the mean,
helpful for comparing test consistency. Here a CV of 7.18% indicates a consistency in the
test scores.
11 | Page

EXAMPLE 12. Sample CV (Group Data)

You have been provided with a frequency distribution table that contains data on the
durations of 115 randomly selected programming tutorial videos uploaded by a YouTube
channel.

Duration in minutes 30-40 40-50 50-60 60-70 70-80 80-90 90-100


Number of tutorials 10 14 20 34 22 9 6

Calculate the sample coefficient of variation.

SOLUTION: In Example 9, we previously calculated the sample mean and sample


standard deviation for this distribution as 63.26 and 15.52, respectively.

𝑠 15.52
Therefore, the coefficient of variation, 𝐶𝑉 = × 100% = × 100% = 24.53 %
𝑥̅ 63.26

Interpretation: This result suggests that the data has moderate variability, with a CV of
24.53%, meaning the scores vary about 24.53% relative to the mean.

EXAMPLE 13. Let's say you have a dataset of five values representing the monthly
returns of a stock over the past five months: 2%, 3%, -1%, 5%, and -2%.

Calculate the sample CV.

SOLUTION: Here, ∑𝑥𝑖2 = 22 + 32 + ⋯ + (−2) 2 = 43 and ∑𝑥𝑖 = 2 + 3 + ⋯ + (−2) = 7

∑𝑥𝑖 7
Sample mean, 𝑥̅ = = = 1.4 %
𝑛 5

2
(∑𝑥 𝑖) (7)2
∑𝑥2𝑖 − 43− 33.2
Sample variance, 𝑠 2 = 𝑛
= 5
= = 8.3
𝑛−1 5−1 4

And Sample standard deviation, 𝑠 = √𝑠 2 = √8.3 = 2.88% (Here percentage is the


unit of original data)

𝑆 2.88%
Therefore, the Coefficient of Variation. 𝐶𝑉 = × 100 = × 100 % ≈
𝑥̅ 1.4%
205.71% (The original data's unit, which is in percentage, will be removed, and
the final percentage represents the coefficient of variation expressed as a
percentage.)

Interpretation: So, in this example, the standard deviation is approximately 2.88%, and
the coefficient of variation is approximately 205.71%. This provides a measure of the
stock's risk (volatility) relative to its return. A higher coefficient of variation indicates
higher risk compared to the mean return.

EXAMPLE 14. Consider a dataset of five test scores for 2 students:

Student A: 90 85 78 92 88
12 | Page

Student B: 85 97 58 98 44
Which students exhibits greater variability in test scores?

SOLUTION: Step 1. Calculate the sample mean:

∑ 𝑥𝑖 (90 + 85 + 78 + 92 + 88) 433


Sample mean for A, 𝑥̅𝐴 = = = = 86.6
𝑛 5 5

∑ 𝑥𝑖 ( 85+97+58+98+44) 382
Sample mean for B, 𝑥̅ 𝐵 = = = = 76.5
𝑛 5 5

Step 2. Calculate the sample variance:


2
(∑𝑥 𝑖) ( 433)2
∑𝑥2𝑖 − 37617− 119.2
Sample variance for A, 𝑠𝐴2 = 𝑛
= 5
= = 29.8
𝑛−1 5−1 4

2
(∑𝑥 𝑖) (382)2
∑𝑥2𝑖 − 31538− 2353.2
Sample variance for B, 𝑠𝐵2 = 𝑛
= 5
= = 588.3
𝑛−1 5−1 4

Step 3. Calculate the sample standard deviation:

Sample standard deviation for A, 𝑠𝐴 = √29.8 = 5.46

Sample standard deviation for B, 𝑠𝐵 = √588.3 = 24.25

Step 4. Calculate the Coefficient of Variation (CV):

𝑠𝐴 5.46
CV for A, 𝐶𝑉𝐴 = × 100 % = × 100 % = 6.30%
𝑥̅𝐴 86.6

𝑠𝐵 24.25
CV for B, 𝐶𝑉𝐵 = × 100 % = × 100 % = 31.70%
𝑥̅𝐵 76.5

Interpretation: In this example, the coefficient of variation for student A is much lower
than Student B. It indicates test scores for student A have less variabili ty /more
consistency compared to student B.

USES OF THE STANDARD DEVIATION

THE EMPIRICAL RULE

For a symmetrical, bell-shaped frequency distribution, approximately 68% of the


observations will lie within plus and minus one standard deviation of the mean; about
95% of the observations will lie within plus and minus two standard deviations of the
mean; and practically all (99.7%) will lie within plus and minus three standard deviations
of the mean.
13 | Page

EXAMPLE 15. A sample of the rental rates at University Park Apartments approximates
a symmetrical, bell-shaped distribution. The sample mean is $500; the standard deviation
is $20. Using the Empirical Rule, answer these questions:

a) About 68% of the monthly rentals are between what two amounts?
b) About 95% of the monthly rentals are between what two amounts?
c) Almost all of the monthly rentals are between what two amounts?

SOLUTION:

a) About 68% are between $480 and $520, found by 𝑥̅ ± 1𝑠 = $500 ± 1($20).
b) About 95% are between $460 and $540, found by 𝑥̅ ± 2𝑠 = $500 ± 2($20).
c) Almost all (99.7%) are between $440 and $560, found by 𝑥̅ ± 3𝑠 = $500 ± 3($20)

LIMITATION OF EMPIRICAL RULE: Empirical rule applied for only a symmetrical,


bell-shaped distribution. However, Chebyshev’s theorem applies to any set of values; that
is, the distribution of values can have any shape (positively skewed, negatively skewed or
symmetric)

CHEBYSHEV’S THEOREM

For any set of observations (sample or population), the proportion of the values that lie
1
within k standard deviations of the mean is at least 1 – 2 , where k is any value greater
𝑘
than 1.

EXAMPLE 16. Dupree Paint Company employees contribute a mean of $51.54 to the
company’s profit-sharing plan every two weeks. The standard deviation of biweekly
contributions is $7.51. At least what percent of the contributions lie within plus 3.5
standard deviations and minus 3.5 standard deviations of the mean, that is between
$25.26 and $77.83?

SOLUTION: About 92%, found by


1 1
1− 2 = 1− = 1 − 0.0816 ≈ 0.92
𝑘 (3.5) 2
14 | Page

Shape Characteristics of the Distribution


SKEWNESS:

The Skewness is a measure of the asymmetry of the distribution of the measurements of


a variable relative to its mean. As the data becomes skewed from a normal distribution,
the mean loses its ability to provide the best measure of central tendency.

A frequency distribution is said to be symmetrical if the frequencies are equally


distributed on both the sides of central value. A symmetrical distribution may be bell –
shaped or U-shaped.

A frequency distribution is said to be skewed if the frequencies are not equally distributed
on both the sides of the central value. A skewed distribution may be - Positively Skewed
or Negatively Skewed.

(a) Left / Negatively skewed (b) Symmetric / Normal / No skewed (c) Right / Positively skewed
Mean < Median < Mode Mean = Median = Mode Mean > Median > Mode
SK < 0 SK = 0 SK > 0

KURTOSIS:

Kurtosis is another measure of the shape of a frequency curve. It is a Greek word, which
means bulginess. While skewness signifies the extent of asymmetry, kurtosis measures
the degree of peaked-ness of a frequency distribution.

Karl Pearson classified curves into three types on the basis of the shape of their peaks.
These are Mesokurtic, leptokurtic and platykurtic. These three types of curves are shown
in figure below:

(𝛽2 > 3)
(𝛽2 = 3)
(𝛽2 < 3)
15 | Page

(a) Right skewed (b) Left Skewed (c) Symmetric

(d) Leptokurtic (e) Mesokurtic (f) Platykurtic

Graphical illustration of skewness and kurtosis using histogram.

MEASURES OF SKEWNESS:

Although comparing the mean, median, and mode, we get an idea about out shape of the
distribution as follows:

➢ If 𝑀𝑒𝑎𝑛 < 𝑀𝑒𝑑𝑖𝑎𝑛 < 𝑀𝑜𝑑𝑒, the distribution is left/negative skewed.


➢ If 𝑀𝑒𝑎𝑛 > 𝑀𝑒𝑑𝑖𝑎𝑛 > 𝑀𝑜𝑑𝑒, the distribution is right/positive skewed.
➢ If 𝑀𝑒𝑎𝑛 = 𝑀𝑒𝑑𝑖𝑎𝑛 = 𝑀𝑜𝑑𝑒, the distribution is symmetric.

But we have various measures of skewness. Three important measures of skewness are:

1. Karl Pearson's Coefficient of Skewness

The formula for measuring skewness as given by Karl Pearson is as follows:

𝑀𝑒𝑎𝑛 − 𝑀𝑜𝑑𝑒
𝑆𝐾𝑝 =
𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛

Now this formula is equal to

3(𝑀𝑒𝑎𝑛 − 𝑀𝑒𝑑𝑖𝑎𝑛)
𝑆𝐾𝑝 =
𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛

Therefore, the empirical relation among mean, median, and mode is:

𝑀𝑒𝑎𝑛 − 𝑀𝑜𝑑𝑒 = 3(𝑀𝑒𝑎𝑛 − 𝑀𝑒𝑑𝑖𝑎𝑛)

2. Bowley’s coefficient of skewness


16 | Page

The coefficient of skewness based on quartiles, also known as Bowley’s coefficient of


skewness, is used to measure the asymmetry of data using the first (Q1), second (Q2, the
median), and third quartiles (Q3). This method is especially useful when data may contain
extreme values.

Bowley’s Coefficient of Skewness,

(𝑄3 − 𝑄2 ) − (𝑄2 − 𝑄1 ) 𝑄3 + 𝑄1 − 2𝑄2


𝑆𝐾𝐵 = =
𝑄3 − 𝑄1 𝑄3 − 𝑄1

3. Moment Based Coefficient of Skewness:

Moments: Moments in statistics are quantitative measure (or a set of statistical


parameters) that describes the specific characteristics of a probability distribution. Thus,
by using moments, we can measure the central tendency of a series, dispe rsion or
variability, skewness and the peaked-ness of the curve. The moments about the arithmetic
mean are known as central moments and are denoted by μ r. The first four central
moments are as follows:

For ungrouped data For grouped data

r-th central moment (population) ∑(𝑥 𝑖 − 𝜇) 𝑟 ∑ 𝑓𝑖 (𝑥𝑖 − 𝜇) 𝑟


𝜇𝑟 = 𝜇𝑟 =
𝑁 𝑁

𝜇3
So, coefficient of skewness for the population, √𝛽1 = (
√𝜇2) 3

𝑛 ∑( 𝑥𝑖 −𝑥̅) 3
And for the sample, √𝑏1 = ( × 3
𝑛−1) (𝑛−2) 2
∑(𝑥 𝑖−𝑥̅) 2
( )
𝑛−1

Interpretation of coefficient of skewness:

➢ If 𝑐𝑜𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑜𝑓 𝑠𝑘𝑒𝑤𝑛𝑒𝑠𝑠 < 0, the distribution is left/negative skewed.


➢ If 𝑐𝑜𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑜𝑓 𝑠𝑘𝑒𝑤𝑛𝑒𝑠𝑠 > 0, the distribution is right/positive skewed.
➢ If 𝑐𝑜𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑜𝑓 𝑠𝑘𝑒𝑤𝑛𝑒𝑠𝑠 = 0, the distribution is symmetric.

MEASURES OF KURTOSIS
𝜇4
Coefficient of kurtosis for the population, 𝛽2 = (
𝜇2) 2

𝑛(𝑛+1) ∑( 𝑥𝑖 −𝑥̅) 4 3 ( 𝑛−1)2


And for the sample, 𝑏2 = ( × 2 2−
𝑛−1) (𝑛−2)(𝑛−3) ∑(𝑥 −𝑥 ̅) (𝑛−2)(𝑛−3)
( 𝑖 )
𝑛−1

In case of a normal distribution, that is, mesokurtic curve, the value of coefficient of
kurtosis is 3. If it is greater than 3, the curve is called a leptokurtic curve and is more
17 | Page

peaked than the normal curve. If it is less than 3, the curve is called a platykurtic curve
and is less peaked than the normal curve.

EXAMPLE 17. Shape Characteristics (Skewness) for Raw Data

Given a dataset: 10, 11, 15, 20, 20, 36, 48, 50, 52. Check the shape of the distribution.

SOLUTION: First, find the mean, median or mode, and standard deviation.

262
Here, Mean, 𝑥̅ = = 29.11 , Median= 20, Mode=20, and Standard Deviation= 17.40
9

𝑀𝑒𝑎𝑛−𝑀𝑜𝑑𝑒 ( 29.11−20)
Using Pearon’s coefficient of skewness, 𝑆𝐾𝑝 = = = 0.52
𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 17.40

Since 𝑆𝐾𝑝 > 0 therefore, the distribution is positively skewed/ right skewed

Alternatively, we can check using quartiles:

Here, first quartile 𝑄1 = 15, second quartile 𝑄2 = 20, and the third quartile 𝑄3 = 48

( 𝑄3−𝑄2) −( 𝑄2−𝑄1 ) ( 48−20) −(20−15) 23


Now, Bowley’s Coefficient of Skewness, 𝑆𝐾𝐵 = = = = 0.69
𝑄3 −𝑄1 48−15 33

Since 𝑆𝐾𝐵 > 0, therefore, the distribution is positively skewed/ right skewed.

EXAMPLE 18. Shape Characteristics (Skewness) for Group Data

You have been provided with a frequency distribution table that contains data on the
durations of 115 randomly selected programming tutorial videos uploaded by a YouTube
channel.

Duration in minutes 30-40 40-50 50-60 60-70 70-80 80-90 90-100


Number of tutorials 10 14 20 34 22 9 6

Are there any deviations from the symmetry in the distribution of the programming
tutorial videos? check the shape of the distribution.

SOLUTION: First find the mean, median or mode, and standard deviation.

7275
Here, Mean, 𝑥̅ = = 63.26 , Median= 63.97, Mode= 65.38, and SD = 15.52 [See in
115
previous examples]

Now, Using Pearson’s coefficient of skewness,

𝑀𝑒𝑎𝑛−𝑀𝑜𝑑𝑒 ( 63.26−65.38)
𝑆𝐾𝑝 = = = −0.14
𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 15.52

Since 𝑆𝐾𝑝 < 0, therefore, the distribution is negatively skewed/ left skewed.
18 | Page

Box-plot and outlier detection


BOX-PLOT: A box plot represents a graphical summary of what is sometimes called a
“five-number summary” of the distribution

❖ Minimum
❖ 1st quartile
❖ Median/ 2nd quartile
❖ 3rd quartile
❖ Maximum

OUTLIER

An outlier is a data point that differs significantly from other observations. An outlier
may be due to a variability in the measurement, an indication of novel data, or it may be
the result of experimental error; the latter are sometimes excluded from the data set.

Detecting outlier using box-plot:

➢ 1.5 IQR Rule:


Any data point that falls outside the interval (𝑄1 − 1.5 × 𝐼𝑄𝑅, 𝑄3 + 1.5 × 𝐼𝑄𝑅) is
considered an outlier.

➢ 3 IQR Rule:
Any data point that falls outside the interval (𝑄1 − 3 × 𝐼𝑄𝑅, 𝑄3 + 3 × 𝐼𝑄𝑅) is
considered an extreme outlier.

Here, Inter-Quartile Range, 𝐼𝑄𝑅 = 𝑄3 − 𝑄1 ;

Lower fence = 𝑄1 − 1.5 × 𝐼𝑄𝑅 and Upper fence = 𝑄3 + 1.5 × 𝐼𝑄𝑅

Handling outliers: Handling outliers depends on the specific goals of our analysis and
the nature of our data. It's common to remove outliers to reduce their impact on statistical
analyses. However, removing too many outliers can lead to a loss of information and
potentially introduce bias. In that case statistician uses some complex statistics, known
as “Robust Statistics”

EXAMPLE 20. Box-Plot for Raw Data


19 | Page

Consider first 9 Commodore prices (in ’000) 6.0, 6.7, 3.8, 7.0, 5.8, 9.975, 10.5, 5.99, 20.0.
draw a boxplot and identify any potential outlier in the dataset.

SOLUTION:

Arrange these in order of magnitude 3.8, 5.8, 5.99, 6.0, 6.7, 7.0, 9.975, 10.5, 20.0

𝑄2 = 6.7 (There are 4 values on either side)

𝑄1 = 5.99 (The median of the smallest half of the values)

𝑄3 = 9.975 (The median of the largest half of the values)

𝐼𝑄𝑅 = 𝑄3 – 𝑄1 = 9.975 − 5.99 = 3.985

Outlier Interval= (𝑄1 − 1.5 × 𝐼𝑄𝑅, 𝑄3 + 1.5 × 𝐼𝑄𝑅)

= (5.99 − 1.5 × 3.985, 9.975 + 1.5 × 3.985) = (0.0125, 15.9525)

Therefore, 20.0 is an outlier.

Similarly, extreme outlier interval = (5.99-3× 3.985, 9.975+ 3× 3.985) = (-5.965, 21.93)

The data contains no extreme outlier.

EXAMPLE 21. Outlier Detection for Raw Data

Given a dataset: 15, 20, 24, 29, 37, 40, 44, 48, 120. Identify any outlier or extreme outlier?

SOLUTION:

Here, 𝑀𝑖𝑛 = 15, 𝑄1 = 24, 𝑀𝑒𝑑𝑖𝑎𝑛 (𝑄2 ) = 37, 𝑄3 = 44, 𝑀𝑎𝑥 = 120

Inter-quartile-range, 𝐼𝑄𝑅 = 𝑄3 − 𝑄1 = 46 − 22 = 20

Outlier Interval = (24 − 1.5 × 20, 44 + 1.5 × 20) = ( −6, 74)


20 | Page

Therefore, 120 is an outlier.

Similarly, extreme outlier interval = (24 − 3 × 20, 44 + 3 × 20) = (−36, 104)

Therefore, 120 is an extreme outlier.

EXAMPLE 22. Comparative Bot-Plot for Raw Data

The monthly sales volumes (in units) of two teams, Team X and Team Y, over the last
nine months are given below:

Team X Sales Volume 35 40 42 43 45 47 48 49 50


Team Y Sales Volume 30 32 34 36 38 40 42 43 55

a) Draw a comparative box plot for both teams.


b) Identify and interpret any outliers.
c) Comment on the shape of the distribution of each dataset.

SOLUTION:

a) Five-Number Summary:

Team 𝑋: 𝑀𝑖𝑛 = 35, 𝑄1 = 42, 𝑀𝑒𝑑𝑖𝑎𝑛 (𝑄2 ) = 45, 𝑄3 = 48, 𝑀𝑎𝑥 = 50

Team 𝑌: 𝑀𝑖𝑛 = 30, 𝑄1 = 34, 𝑀𝑒𝑑𝑖𝑎𝑛 (𝑄2 ) = 38, 𝑄3 = 42, 𝑀𝑎𝑥 = 55

b) Outlier Interval:
➢ For Team X,

Inter-quartile-range, 𝐼𝑄𝑅 = 𝑄3 − 𝑄1 = 48 − 42 = 6

Lower fence: 𝑄1 − 1.5 × 𝐼𝑄𝑅 = 42 + 1.5 × 6 = 33

Upper fence: 𝑄3 + 1.5 × 𝐼𝑄𝑅 = 48 + 1.5 × 6 = 57

Interpretation: There are no data points outside the expected range, so no outliers.
21 | Page

➢ For Team Y,

Inter-quartile-range, 𝐼𝑄𝑅 = 𝑄3 − 𝑄1 = 42 − 34 = 8

Lower fence: 𝑄1 − 1.5 × 𝐼𝑄𝑅 = 34 − 1.5 × 8 = 22

Upper fence: 𝑄3 + 1.5 × 𝐼𝑄𝑅 = 42 + 1.5 × 8 = 54

Interpretaton: Since the data point 55 falls outside the expected range, thus, 55 is an
outlier.

c) Skewness:

The distributions of both 𝑇𝑒𝑎𝑚 𝑋 and 𝑇𝑒𝑎𝑚 𝑌 are fairly symmetric, as the distances
between 𝑄1 and 𝑄2, and between 𝑄2 and 𝑄3, are equal.

EXPLANATION OF A BOXPLOT: This box plot displays life expectancy across


continents in the years 1952 and 2007. Describe the notable observations and insights
reflected in the box plot.

Here are some key insights:

➢ Overall Increase: Life expectancy improved significantly from 1952 to 2007 across
all continents.
➢ Europe: It consistently shows the highest life expectancy levels, with relatively
smaller variation.
➢ Africa: The continent has the lowest life expectancy, particularly in 1952, with a
notable increase by 2007, though still with wide variability.
➢ Oceania: Exhibits the least variability and the highest life expectancy range,
especially in 2007.
➢ The outliers represent countries with life expectancy significantly deviating from
the median in their continent.

Practice math
22 | Page

1. Find the range of the following dataset: {16, 22, 18, 25, 30, 15, 28, 20, 12, 10}
2. Calculate the mean deviation from the mean for the dataset: {18, 22, 25, 30, 15, 28,
20, 24, 12, 10}
3. Determine the quartile deviation for the dataset: {32, 45, 50, 28, 35, 40, 38, 42, 48,
55}
4. Compute the variance, standard deviation, and CV for the population dataset: {8,
12, 15, 10, 14, 18, 20, 22, 25, 30}
5. Compute the variance, standard deviation, and CV for the sample data: {22, 18,
25, 20, 15, 28, 30, 35, 40, 45}
6. Determine the Pearson coefficient of skewness for the dataset: {28, 35, 40, 32, 38,
45, 50, 55, 60, 70}
7. Compute first four central moments and using moment, compute coefficients of
skewness and kurtosis for the dataset: {12, 14, 16, 18, 20, 22, 24, 26, 28, 30}
8. Create a box plot for the data: {25, 30, 35, 40, 45, 50, 55, 60, 200, 210} and identify
if there are any outliers.
9. Consider the following frequency distribution:

Class Interval Frequency


10-20 5
20-30 8
30-40 12
40-50 10
50-60 7

a) Find the range for the data.


b) Calculate the mean deviation from the mean for the data.
c) Compute the variance and standard deviation for the data.
d) Calculate the coefficient of variation for the data.
e) Determine the Pearson coefficient of skewness for the data.
10. Consider the following frequency distribution:

Class Interval Frequency


Less than 5 5
Less than 10 13
Less than 15 25
Less than 20 35
Less than 25 40

a) Calculate the mean deviation from the mean for the data.
b) Determine the quartile deviation for the data.
c) Compute the variance and standard deviation for the data.
d) Calculate the coefficient of variation for the data.
e) Determine the Pearson coefficient of skewness for the data.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy