STA201 Lec 04
STA201 Lec 04
Measures of Dispersion
Sometimes when two or more different datasets are to be compared using measure s of
central tendency or averages, we may get the same result. Consider the runs scored by
two batsmen in their last ten matches as follows:
Batsman B: 53, 46, 48, 50, 53, 53, 58, 60, 57, 52
Clearly, the mean runs scored by both the batsmen A and B are same i.e., 53. Can we say
that the performance of two players is same? Clearly No, because the spread -ness in the
scores of batsman A is from 0 to 117, whereas, the spread-ness of the runs scored by
batsman B is from 46 to 60.
As we know that, there are quite a few ways of measuring the central tendency of a dataset
(Mean, Mode and Median). Similarly, we have different ways of measuring and comparing
the dispersion of the distribution(s).
a) Range
b) Quartile Deviation
c) Mean Deviation (from mean or from median)
d) Standard Deviation
𝐴𝑏𝑠𝑜𝑙𝑢𝑡𝑒 𝑚𝑒𝑎𝑠𝑢𝑟𝑒𝑠
𝑅𝑒𝑙𝑎𝑡𝑖𝑣𝑒 𝑚𝑒𝑎𝑠𝑢𝑟𝑒 = × 100%
𝐴𝑣𝑒𝑟𝑎𝑔𝑒
a) Coefficient of Range
b) Coefficient of Quartile Deviation
c) Coefficient of Mean Deviation
d) Coefficient of Variation (C.V)
The major difference between Absolute and Relative Measures of Dispersion is that the
Absolute measure of dispersion measures only the variability of the dataset, further it has
the unit of measurement; on the other hand, Relative measure of dispersion is used to
compare the variation of two or more distributions, further it is unit less.
1.1 Range
The range is a measure of spread that represents the difference between the largest and
smallest values in a dataset. For raw data, it’s calculated by subtracting the smallest
value from the largest. In continuous grouped data, the range is the difference between
the upper limit of the highest class and the lower limit of the lowest class.
𝑅 = 𝑋𝑈 − 𝑋𝐿 = Upper limit of the highest class – lower limit of the lowest class.
The ages of 8 students in a classroom are recorded as follows: 15, 17, 16, 14, 19, 18, 15, 20
years. Find the range of the ages.
Interpretation: The range of 6 years shows the age spread of students, with 14 being the
youngest and 20 being the oldest in the group.
You have been provided with a frequency distribution table that c ontains data on the
durations of 115 randomly selected programming tutorial videos uploaded by a YouTube
channel.
Interpretation: The range of 70 minutes shows the difference between the shortest and
longest tutorials, with the shortest in the 30–40-minute range and the longest in the 90–
100-minute range. This spread indicates a considerable variation in tutorial lengths.
The quartile deviation, also known as the semi-interquartile range, measures the spread
of the middle 50% of a dataset. It is calculated as half the difference between the first
quartile (𝑄1 ) and the third quartile (𝑄3 ). This measure gives insight into the dataset's
dispersion by considering only the central portion, minimizing the effect of extreme
values.
𝑄3−𝑄1
Quartile deviation, 𝑄𝐷 = ; Where, 𝑄3= Third Quartile, and 𝑄1 = First Quartile
2
The ages of 8 students in a classroom are recorded as follows: 15, 17, 16, 14, 19, 18, 15, 20
years. Find the quartile deviation of the ages.
SOLUTION:
Step 1. Arrange data: 14, 15, 15, 16, 17, 18, 19, 20.
𝑖× 𝑛 25× 8
𝑄1 (1st Quartile): Here, = =2
100 100
( 15+15)
Therefore, 𝑄1 = Average of the 2nd and 3rd values: 𝑄1 = = 15.
2
𝑖× 𝑛 75× 8
𝑄3 (3rd Quartile): Here, = =6
100 100
18+19
Therefore, 𝑄3 = Average of the 6th and 7th values: 𝑄3 = = 18.5.
2
𝑄3 − 𝑄1 18.5 − 15
𝑄. 𝐷. = = = 1.75
2 2
Interpretation: The quartile deviation of 1.75 years shows that the ages of students vary
around the median by about 1.75 years, indicating moderate dispersion within the central
portion of this dataset.
4 | Page
You have been provided with a frequency distribution table that contains data on the
durations of 115 randomly selected programming tutorial videos uploaded by a YouTube
channel.
SOLUTION:
𝑖× 𝑛 1×115
𝑄1 : Position = = = 28.75, which falls in the 50-60 interval.
4 4
𝑖× 𝑛 3×115
𝑄3 : Position = = = 86.25, which falls in the 70-80 interval.
4 4
𝑖×𝑛
28.75 − 24 − 𝐹𝑝𝑟𝑒𝑣
𝑄1 = 50 + ( ) × 10 = 51.875. [∵ 𝑄𝑖 = 𝐿0 + ( 4 ) × 𝐶]
20 𝑓
86.25 − 78
𝑄3 = 70 + ( ) × 10 = 78.41.
22
Interpretation: The quartile deviation of 13.27 minutes suggests that the middle 50% of
tutorial durations vary by about 13.27 minutes from the median duration, reflecting a
moderate spread within the core range of the dataset.
The mean deviation (also called the average absolute deviation) measures the average
distance of each data value from the mean or median of the dataset. It provides an insight
into the dispersion by showing how much, on average, values differ from the central point.
5 | Page
85 92 78 89 95 88 76 82 90 91.
SOLUTION:
∑| 𝑥𝑖 −x̅|
We know, the mean deviation from the mean (𝑴. 𝑫(𝒙
̅ )) for raw data: 𝑀. 𝐷(𝑥̅ ) =
𝑛
Step 2. Now calculate the mean deviation from the mean, MD (𝑥̅ ):
∑| 𝑥𝑖−x̅|
MD (𝑥̅ ) =
𝑛
(|85 − 86.6| + |92 − 86.6| + |78 − 86.6| + |89 − 86.6| + |95 − 86.6|
+ |88 − 86.6| + |76 − 86.6| + |82 − 86.6| + |90 − 86.6| + |91 − 86.6|)
=
10
50.8
∴ 𝑀𝐷 (𝑥̅ ) = = 5.08
10
Interpretation: The mean deviation of 5.08 marks indicates that, on average, the scores
of students differ from the mean scores by 5.08 marks.
∑| 𝑋𝑖 – 𝑀𝑒𝑑𝑖𝑎𝑛|
We know the mean deviation from the median, 𝑀𝐷(𝑀𝑒) =
𝑛
76 78 82 85 88 89 90 91 92 95
48
𝑜𝑟, 𝑀𝐷 (𝑀𝑒) = = 4.8
10
Interpretation: The mean deviation of 4.8 marks indicates that, on average, the scores of
students differ from the median scores by 4.8 marks.
A frequency distribution table shows the monthly rainfall (in mm) recorded over 12
months:
SOLUTION:
Interpretation: The mean deviation of 9.58 mm shows that the monthly rainfall varies, on
average, by 9.58 mm from the mean rainfall of 72.5 mm.
Variance: Variance measures the average of the squared differences between each data
point and the mean. In mathematical terms, the variance (σ² / s 2) is calculated as:
7 | Page
𝜎2 = = 𝑁
σ2 = = 𝑁
𝑁 𝑁 𝑁 𝑁
2 2
Sample variance ∑( 𝑥𝑖 −𝑥̅) 2
(∑𝑥 𝑖)
∑𝑥2𝑖 − ∑ 𝑓𝑖 ( 𝑥𝑖 −𝑥̅) 2 ∑𝑓𝑖 𝑥2𝑖 −
(∑𝑓 𝑖𝑥 𝑖)
s2 = = 𝑛
s²= = 𝑛
𝑛−1 𝑛−1 𝑛−1 𝑛−1
Variance provides a measure of how data points are scattered around the mean. A higher
variance indicates greater data dispersion.
Standard Deviation: The standard deviation is the square root of the variance. It
measures the average distance between each data point and the mean. Mathematically,
the standard deviation (σ/s) is calculated as:
Standard deviation is expressed in the same units as the data, making it more
interpretable than variance. It provides a measure of the spread of data and is often used
for comparing the spread of different datasets. A larger standard deviation indicates
greater data variability.
The ages (in years) of a small population of 6 employees in a company are: 25, 30, 35, 40,
45, and 50. Calculate the population variance and standard deviation.
SOLUTION:
∑𝑥𝑖 = 25 + 30 + 35 + 40 + 45 + 50 = 225
Interpretation: The standard deviation of 8.54 years indicates the average deviation from
the mean age in this population is 8.54 years.
8 | Page
85 92 78 89 95 88 76 82 90 91.
SOLUTION:
Interpretation: The standard deviation of 6.22 indicates the average deviation from the
mean exam score in this sample is 6.22 marks.
You have been provided with a frequency distribution table that contains data on the
durations of 115 randomly selected programming tutorial videos uploaded by a YouTube
channel.
SOLUTION:
Interpretation: The standard deviation of 15.52 hours indicates the variability in the
duration of the tutorial videos in 15.52 hours.
The Coefficient of Variation was introduced by Karl Pearson. It is the most used relative
measure of dispersion. It is used to compare the variation or to compare the performance
of two sets of data. It expresses the standard deviation as a percentage of the mean,
allowing for comparison between data sets with different units or widely varying means.
A higher CV indicates greater variability relative to the mean.
A large value of C.V indicates that there is greater variability and vice versa. Similarly,
the smaller the C.V the more consistent and vice versa.
If 𝐶𝑉𝐴 > 𝐶𝑉𝐵 , dataset A exhibits greater variability than B. Alternatively, dataset B can
be considered more consistent than dataset A, as a lower CV indicates higher relative
consistency.
The ages (in years) of a small population of 6 employees in a company are: 25, 30, 35, 40,
45, and 50. Calculate the population coefficient of variation (CV).
SOLUTION:
∑𝑥𝑖 = 25 + 30 + 35 + 40 + 45 + 50 = 225
10 | Page
Step 2. Calculate the population mean and the population standard deviation:
∑𝑥𝑖 225
Population mean, 𝜇 = = = 37.5
𝑁 6
2
(∑𝑥 𝑖) (225) 2
∑𝑥2𝑖 − 8875− 437.5
Population variance, 𝜎 2 = 𝑁
= 6
= = 72.92
𝑁 6 6
𝜎 8.54
Therefore, the coefficient of variation, 𝐶𝑉 = × 100% = × 100% = 22.77%
𝜇 37.5
85 92 78 89 95 88 76 82 90 91.
SOLUTION:
Step 2. Calculate the sample mean and the sample standard deviation:
∑𝑥𝑖 866
Sample mean, 𝑥̅ = = = 86.6
𝑛 10
2
(∑𝑥 𝑖) (866)2
∑𝑥2𝑖 − 75344 − 348.4
Sample variance, 𝑠 2 = 𝑛
= 10
= = 38.71
𝑛−1 10−1 9
𝑠 6.22
Coefficient of Variation, 𝐶𝑉 = × 100 % = ( ) × 100% = 7.18%
𝑥̅ 86.6
Interpretation: The CV expresses the variability of test scores as a percentage of the mean,
helpful for comparing test consistency. Here a CV of 7.18% indicates a consistency in the
test scores.
11 | Page
You have been provided with a frequency distribution table that contains data on the
durations of 115 randomly selected programming tutorial videos uploaded by a YouTube
channel.
𝑠 15.52
Therefore, the coefficient of variation, 𝐶𝑉 = × 100% = × 100% = 24.53 %
𝑥̅ 63.26
Interpretation: This result suggests that the data has moderate variability, with a CV of
24.53%, meaning the scores vary about 24.53% relative to the mean.
EXAMPLE 13. Let's say you have a dataset of five values representing the monthly
returns of a stock over the past five months: 2%, 3%, -1%, 5%, and -2%.
∑𝑥𝑖 7
Sample mean, 𝑥̅ = = = 1.4 %
𝑛 5
2
(∑𝑥 𝑖) (7)2
∑𝑥2𝑖 − 43− 33.2
Sample variance, 𝑠 2 = 𝑛
= 5
= = 8.3
𝑛−1 5−1 4
𝑆 2.88%
Therefore, the Coefficient of Variation. 𝐶𝑉 = × 100 = × 100 % ≈
𝑥̅ 1.4%
205.71% (The original data's unit, which is in percentage, will be removed, and
the final percentage represents the coefficient of variation expressed as a
percentage.)
Interpretation: So, in this example, the standard deviation is approximately 2.88%, and
the coefficient of variation is approximately 205.71%. This provides a measure of the
stock's risk (volatility) relative to its return. A higher coefficient of variation indicates
higher risk compared to the mean return.
Student A: 90 85 78 92 88
12 | Page
Student B: 85 97 58 98 44
Which students exhibits greater variability in test scores?
∑ 𝑥𝑖 ( 85+97+58+98+44) 382
Sample mean for B, 𝑥̅ 𝐵 = = = = 76.5
𝑛 5 5
2
(∑𝑥 𝑖) (382)2
∑𝑥2𝑖 − 31538− 2353.2
Sample variance for B, 𝑠𝐵2 = 𝑛
= 5
= = 588.3
𝑛−1 5−1 4
𝑠𝐴 5.46
CV for A, 𝐶𝑉𝐴 = × 100 % = × 100 % = 6.30%
𝑥̅𝐴 86.6
𝑠𝐵 24.25
CV for B, 𝐶𝑉𝐵 = × 100 % = × 100 % = 31.70%
𝑥̅𝐵 76.5
Interpretation: In this example, the coefficient of variation for student A is much lower
than Student B. It indicates test scores for student A have less variabili ty /more
consistency compared to student B.
EXAMPLE 15. A sample of the rental rates at University Park Apartments approximates
a symmetrical, bell-shaped distribution. The sample mean is $500; the standard deviation
is $20. Using the Empirical Rule, answer these questions:
a) About 68% of the monthly rentals are between what two amounts?
b) About 95% of the monthly rentals are between what two amounts?
c) Almost all of the monthly rentals are between what two amounts?
SOLUTION:
a) About 68% are between $480 and $520, found by 𝑥̅ ± 1𝑠 = $500 ± 1($20).
b) About 95% are between $460 and $540, found by 𝑥̅ ± 2𝑠 = $500 ± 2($20).
c) Almost all (99.7%) are between $440 and $560, found by 𝑥̅ ± 3𝑠 = $500 ± 3($20)
CHEBYSHEV’S THEOREM
For any set of observations (sample or population), the proportion of the values that lie
1
within k standard deviations of the mean is at least 1 – 2 , where k is any value greater
𝑘
than 1.
EXAMPLE 16. Dupree Paint Company employees contribute a mean of $51.54 to the
company’s profit-sharing plan every two weeks. The standard deviation of biweekly
contributions is $7.51. At least what percent of the contributions lie within plus 3.5
standard deviations and minus 3.5 standard deviations of the mean, that is between
$25.26 and $77.83?
A frequency distribution is said to be skewed if the frequencies are not equally distributed
on both the sides of the central value. A skewed distribution may be - Positively Skewed
or Negatively Skewed.
(a) Left / Negatively skewed (b) Symmetric / Normal / No skewed (c) Right / Positively skewed
Mean < Median < Mode Mean = Median = Mode Mean > Median > Mode
SK < 0 SK = 0 SK > 0
KURTOSIS:
Kurtosis is another measure of the shape of a frequency curve. It is a Greek word, which
means bulginess. While skewness signifies the extent of asymmetry, kurtosis measures
the degree of peaked-ness of a frequency distribution.
Karl Pearson classified curves into three types on the basis of the shape of their peaks.
These are Mesokurtic, leptokurtic and platykurtic. These three types of curves are shown
in figure below:
(𝛽2 > 3)
(𝛽2 = 3)
(𝛽2 < 3)
15 | Page
MEASURES OF SKEWNESS:
Although comparing the mean, median, and mode, we get an idea about out shape of the
distribution as follows:
But we have various measures of skewness. Three important measures of skewness are:
𝑀𝑒𝑎𝑛 − 𝑀𝑜𝑑𝑒
𝑆𝐾𝑝 =
𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛
3(𝑀𝑒𝑎𝑛 − 𝑀𝑒𝑑𝑖𝑎𝑛)
𝑆𝐾𝑝 =
𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛
Therefore, the empirical relation among mean, median, and mode is:
𝜇3
So, coefficient of skewness for the population, √𝛽1 = (
√𝜇2) 3
𝑛 ∑( 𝑥𝑖 −𝑥̅) 3
And for the sample, √𝑏1 = ( × 3
𝑛−1) (𝑛−2) 2
∑(𝑥 𝑖−𝑥̅) 2
( )
𝑛−1
MEASURES OF KURTOSIS
𝜇4
Coefficient of kurtosis for the population, 𝛽2 = (
𝜇2) 2
In case of a normal distribution, that is, mesokurtic curve, the value of coefficient of
kurtosis is 3. If it is greater than 3, the curve is called a leptokurtic curve and is more
17 | Page
peaked than the normal curve. If it is less than 3, the curve is called a platykurtic curve
and is less peaked than the normal curve.
Given a dataset: 10, 11, 15, 20, 20, 36, 48, 50, 52. Check the shape of the distribution.
SOLUTION: First, find the mean, median or mode, and standard deviation.
262
Here, Mean, 𝑥̅ = = 29.11 , Median= 20, Mode=20, and Standard Deviation= 17.40
9
𝑀𝑒𝑎𝑛−𝑀𝑜𝑑𝑒 ( 29.11−20)
Using Pearon’s coefficient of skewness, 𝑆𝐾𝑝 = = = 0.52
𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 17.40
Since 𝑆𝐾𝑝 > 0 therefore, the distribution is positively skewed/ right skewed
Here, first quartile 𝑄1 = 15, second quartile 𝑄2 = 20, and the third quartile 𝑄3 = 48
Since 𝑆𝐾𝐵 > 0, therefore, the distribution is positively skewed/ right skewed.
You have been provided with a frequency distribution table that contains data on the
durations of 115 randomly selected programming tutorial videos uploaded by a YouTube
channel.
Are there any deviations from the symmetry in the distribution of the programming
tutorial videos? check the shape of the distribution.
SOLUTION: First find the mean, median or mode, and standard deviation.
7275
Here, Mean, 𝑥̅ = = 63.26 , Median= 63.97, Mode= 65.38, and SD = 15.52 [See in
115
previous examples]
𝑀𝑒𝑎𝑛−𝑀𝑜𝑑𝑒 ( 63.26−65.38)
𝑆𝐾𝑝 = = = −0.14
𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 15.52
Since 𝑆𝐾𝑝 < 0, therefore, the distribution is negatively skewed/ left skewed.
18 | Page
❖ Minimum
❖ 1st quartile
❖ Median/ 2nd quartile
❖ 3rd quartile
❖ Maximum
OUTLIER
An outlier is a data point that differs significantly from other observations. An outlier
may be due to a variability in the measurement, an indication of novel data, or it may be
the result of experimental error; the latter are sometimes excluded from the data set.
➢ 3 IQR Rule:
Any data point that falls outside the interval (𝑄1 − 3 × 𝐼𝑄𝑅, 𝑄3 + 3 × 𝐼𝑄𝑅) is
considered an extreme outlier.
Handling outliers: Handling outliers depends on the specific goals of our analysis and
the nature of our data. It's common to remove outliers to reduce their impact on statistical
analyses. However, removing too many outliers can lead to a loss of information and
potentially introduce bias. In that case statistician uses some complex statistics, known
as “Robust Statistics”
Consider first 9 Commodore prices (in ’000) 6.0, 6.7, 3.8, 7.0, 5.8, 9.975, 10.5, 5.99, 20.0.
draw a boxplot and identify any potential outlier in the dataset.
SOLUTION:
Arrange these in order of magnitude 3.8, 5.8, 5.99, 6.0, 6.7, 7.0, 9.975, 10.5, 20.0
Similarly, extreme outlier interval = (5.99-3× 3.985, 9.975+ 3× 3.985) = (-5.965, 21.93)
Given a dataset: 15, 20, 24, 29, 37, 40, 44, 48, 120. Identify any outlier or extreme outlier?
SOLUTION:
Here, 𝑀𝑖𝑛 = 15, 𝑄1 = 24, 𝑀𝑒𝑑𝑖𝑎𝑛 (𝑄2 ) = 37, 𝑄3 = 44, 𝑀𝑎𝑥 = 120
Inter-quartile-range, 𝐼𝑄𝑅 = 𝑄3 − 𝑄1 = 46 − 22 = 20
The monthly sales volumes (in units) of two teams, Team X and Team Y, over the last
nine months are given below:
SOLUTION:
a) Five-Number Summary:
b) Outlier Interval:
➢ For Team X,
Inter-quartile-range, 𝐼𝑄𝑅 = 𝑄3 − 𝑄1 = 48 − 42 = 6
Interpretation: There are no data points outside the expected range, so no outliers.
21 | Page
➢ For Team Y,
Inter-quartile-range, 𝐼𝑄𝑅 = 𝑄3 − 𝑄1 = 42 − 34 = 8
Interpretaton: Since the data point 55 falls outside the expected range, thus, 55 is an
outlier.
c) Skewness:
The distributions of both 𝑇𝑒𝑎𝑚 𝑋 and 𝑇𝑒𝑎𝑚 𝑌 are fairly symmetric, as the distances
between 𝑄1 and 𝑄2, and between 𝑄2 and 𝑄3, are equal.
➢ Overall Increase: Life expectancy improved significantly from 1952 to 2007 across
all continents.
➢ Europe: It consistently shows the highest life expectancy levels, with relatively
smaller variation.
➢ Africa: The continent has the lowest life expectancy, particularly in 1952, with a
notable increase by 2007, though still with wide variability.
➢ Oceania: Exhibits the least variability and the highest life expectancy range,
especially in 2007.
➢ The outliers represent countries with life expectancy significantly deviating from
the median in their continent.
Practice math
22 | Page
1. Find the range of the following dataset: {16, 22, 18, 25, 30, 15, 28, 20, 12, 10}
2. Calculate the mean deviation from the mean for the dataset: {18, 22, 25, 30, 15, 28,
20, 24, 12, 10}
3. Determine the quartile deviation for the dataset: {32, 45, 50, 28, 35, 40, 38, 42, 48,
55}
4. Compute the variance, standard deviation, and CV for the population dataset: {8,
12, 15, 10, 14, 18, 20, 22, 25, 30}
5. Compute the variance, standard deviation, and CV for the sample data: {22, 18,
25, 20, 15, 28, 30, 35, 40, 45}
6. Determine the Pearson coefficient of skewness for the dataset: {28, 35, 40, 32, 38,
45, 50, 55, 60, 70}
7. Compute first four central moments and using moment, compute coefficients of
skewness and kurtosis for the dataset: {12, 14, 16, 18, 20, 22, 24, 26, 28, 30}
8. Create a box plot for the data: {25, 30, 35, 40, 45, 50, 55, 60, 200, 210} and identify
if there are any outliers.
9. Consider the following frequency distribution:
a) Calculate the mean deviation from the mean for the data.
b) Determine the quartile deviation for the data.
c) Compute the variance and standard deviation for the data.
d) Calculate the coefficient of variation for the data.
e) Determine the Pearson coefficient of skewness for the data.