Wa0009.
Wa0009.
Examples:
Age (21, 35, 45)
Height (5.6 ft, 170 cm)
Test scores (85, 92, 78)
Number of students in a class (30, 45)
➢ Ungrouped Data
Definition: Raw data that has not been organized into groups or intervals.
➢ Characteristics:
Exact values are available.
Easy to calculate measures like mean, median, mode for small data.
Difficult to interpret visually when data is large.
➢ Grouped Data
Definition: Data that has been organized into classes or intervals.
Form: Data is arranged in a frequency distribution table.
Best used when the dataset is large, making ungrouped data hard to interpret.
Example (same test scores grouped into intervals):
Score Range Frequency
45-47 3
48-50 5
51-53 2
➢ Characteristics:
Data is summarized and easier to interpret.
Helps in constructing histograms and other charts.
Exact values are lost, only intervals and frequencies are used.
Used to estimate central tendency and dispersion.
➢ Frequency Distributions:
Represents the pattern of how frequently each value of a variable appears in
a dataset. It shows the number of occurrences for each possible value within
the dataset.
In the frequency distribution table, there are two columns one representing
the data either in the form of a range or an individual data set and the other
column shows the frequency of each interval or individual.
Test Score Frequency Test Score Frequency
0-20 6 45 1
21-40 12 47 1
41-60 22 48 2
61-80 15 49 3
81-100 5 50 2
Ungrouped Frequency Distribution for Ungrouped
Data
Value Frequency
10 4
15 3
20 2
25 3
30 2
➢Grouped Frequency Distribution for
Ungrouped Data
Observations are divided between different intervals known as
class intervals and then their frequencies are counted for each class
interval. This Frequency Distribution is used mostly when the data set
is very large.
CONSTRUCTING FREQUENCY DISTRIBUTIONS
1.Find the range, that is, the difference between the largest and smallest observations.
2. Find the class interval required to span the range by dividing the range by the desired
number of classes (ordinarily 10).
4. Determine where the lowest class should begin. (Ordinarily, this number should be a
multiple of the class interval.)
5. Determine where the lowest class should end by adding the class interval to the lower
boundary and then subtracting one unit of measurement.
6. Working upward, list as many equivalent classes as are required to include the largest
observation.
For example, list 130–139, 140–149, . . . , 240–249
8. Replace the tally count for each class with a number—the frequency (f )—and show the
total of all frequencies. (Tally marks are not usually shown in the final frequency
distribution.)
9. Supply headings for both columns and a title for the table.
Example 1:
Make the Frequency Distribution Table for the ungrouped data given as follows:
23, 27, 21, 14, 43, 37, 38, 41, 55, 11, 35, 15, 21, 24, 57, 35, 29, 10, 39, 42, 27, 17, 45,
52, 31, 36, 39, 38, 43, 46, 32, 37, 25
50 – 59 3
Ex.2
Consider a data set of 26 children of ages 1-6 years
2,2,1,3,3,3,6,6,2,1,1,1,1,3,3,3,5,5,4,4,4,5,5,4,4,3
Solution:
Relative Frequency Distribution
This distribution displays the proportion or percentage of observations in each
interval or class.
It is useful for comparing different data sets or for analyzing the distribution of data
within a set.
Frequency 5 10 20 10 5
Solution:
To Create the Relative Frequency Distribution table, we need to calculate Relative Frequency for
each class interval. Thus, Relative Frequency Distribution table is given as follows:
Score Range Frequency Relative Frequency
0-20 5 5/50 = 0.10
Total 50 1.00
Cumulative Frequency Distribution:
It is defined as the sum of all the frequencies in the previous values or intervals up to the
current one.
The distributions which represent the frequency distributions using cumulative frequencies
are called cumulative frequency distributions.
•More than Type: We sum all the frequencies after the current interval.
Example:
The table below gives the values of runs scored by Virat Kohli in the last 25 T-20
matches. Represent the data in the form of less-than-type cumulative frequency
distribution:
45 34 50 75 22
56 63 70 49 33
0 8 14 39 86
92 88 70 56 50
57 45 42 12 39
Since there are a lot of distinct values, we’ll express this in the form of grouped
distributions with intervals like 0-10, 10-20 and so. First let’s represent the data in the
form of grouped frequency distribution.
Runs Frequency
0-10 2
10-20 2
20-30 1
30-40 4
40-50 4
50-60 5
60-70 1
70-80 3
80-90 2
90-100 1
Runs scored by Virat Cumulative Runs scored by Virat Cumulative
Kohli Frequency Kohli Frequency
Less than 10 2 More than 0 25
Sr no 1 2 3 4 5 6 7 8 9 10
Half- yearly bonus 150 200 300 650 250 180 400 500 550 220
Find out the arithmetic mean. Sr. No. Half Yearly bonus x (in Rs)
1 150
Solution:
2 200
𝑥 +𝑥 +𝑥 +⋯𝑥𝑛
𝑋ത = 1 2 3 3 300
𝑁
4 650
𝝨𝑋 3400
= = = 340 5 250
𝑁 10
6 180
7 400
8 500
9 550
10 220
N=10 𝑋 = 3400
Example 3. Calculate the mean of the following frequency distribution of marks in a test in statistics:
Marks 10 20 30 40 50 60 70 80
3 6 10 12 9 6 2 2
No. of students
𝝨f(𝒙𝒊 ) 𝒙𝒊 3000
10-20 15 10 15
𝑋ത = = = 30
𝑁 100 20-30 25 40 1000
30-40 35 20 700
40-50 45 25 1125
N=𝝨𝑓 = 100 𝝨f(𝒙𝒊 ) 𝒙𝒊
= 3000
Example 5. For the following data , calculate arithmetic mean:
Marks No. of students
Less than 10 5
Less than 20 17
Less than 30 31
Less than 40 41
Less than 50 49
Solution: A cumulative frequency distribution should first be converted into a simple frequency distribution
Marks(x) Number of
students(f)
0-10 5
10-20 17-5= 12
20-30 31-17=14
30-40 41-31=10
40-50 49-41=8
Now mean value of the data is obtained by direct method as under:
Marks Mid values Number of f(𝒙𝒊 ) 𝒙𝒊
(𝒙𝒊 ) students(f(𝒙𝒊 ))
0-10 5 5 25
10-20 15 12 180
20-30 25 14 350
30-40 35 10 350
40-50 45 8 360
N=𝝨𝑓 = 49 𝝨f(𝒙𝒊 ) 𝒙𝒊
= 1265
𝝨f(𝒙𝒊 ) 𝒙𝒊 1265
𝑋ത = = = 25.82 𝑀𝑎𝑟𝑘𝑠
𝑁 49
Example 6 :
Where:
• 𝒇𝒊 = frequency of each class
• 𝒙𝒊 = midpoint of each class
• σ 𝒇𝒊 = total frequency
• The value of the middle item of a series when it is arranged in ascending or descending order of
magnitude.
• It is the value in the series which divides the series into two equal parts, one part consisting the
values equal to median or smaller than it and the other part having the value equal to the median or
larger than it.
• Unlike mean, median is the positional average. The position here means the place of value in the
series.
• Median as such is the positional average of the data and has a position more or less at the centre of
the values.
For ungrouped data/discrete series:
Firstly, arrange the data in ascending order
𝑛+1 𝑡ℎ
(i) If n is odd , Median =( ) observation
2
𝑛 𝑡ℎ 𝑛 𝑡ℎ
+ +1
2 2
(ii) If n is even , Median = observation
2
Median for grouped data/continuous series
For grouped data :
Step 1: Construct the cumulative frequency distribution
Step 2: Find the median class. Median class is the class in which the
𝑵
value of falls in cumulative frequency distribution.
𝟐
Step 3: Find the median by using the following formula.
𝑵
𝟐
− 𝒄.𝒇
Median =𝑳 + *h
𝒇
So, the cumulative frequency just before or at 25 will help us find the median class.
𝑁
Now from the cumulative frequency (CF) column, we see that = 25 falls in the class 20 - 30
2
(since CF for this class is 25). Therefore, 20 - 30 is the median class.
Step 3: Use the median formula:
𝑵
− 𝒄.𝒇
𝟐
Median = 𝑳 + *h
𝒇
Where:
• L =20 (lower boundary of the median class)
• N=50 (total frequency)
• c.f=13 (cumulative frequency of the class before the median class)
• 𝑓=12 (frequency of the median class)
• h= 10 (class width)
Step 4: Apply the formula.
𝟐𝟓 − 𝟏𝟑
Median = 𝟐𝟎 + * 10
𝟏𝟐
Thus, the median is 30.
Example 4. The consumption of printing paper reams (in units) for the first 11 months of a computer
operator is given as
20, 25, 30, 15, 17, 35, 26, 18, 40, 45, 50
15, 17, 18, 20, 25, 26, 30, 35, 40, 45, 50
11+1
Hence, the required median (M) = value of the 𝑡ℎ observation
2
=26
Example 5. Calculate the median of the following data that relates to the monthly salaries of employees (in
thousand rupees):
110, 115, 108, 112, 120, 116, 140, 135, 128, 132
Solution. By arranging the data in ascending order, we get the series 108, 110, 112, 115, 116, 120, 128,
132, 135, 140
5 𝑡ℎ 𝑣𝑎𝑙𝑢𝑒 + 6 𝑡ℎ 𝑣𝑎𝑙𝑢𝑒
=
2
116+120
Hence, the required median (M)= = 118
2
Size 5 𝟏 6 𝟏 7 𝟏 8 𝟏 9 𝟏 10 𝟏 11 𝟏
5 6 7 8 9 10 11
𝟐 𝟐 𝟐 𝟐 𝟐 𝟐 𝟐
No. of pairs 30 40 50 150 300 600 950 820 750 440 250 150 40 39
Size(x) No of Pairs(f) Cumulative frequency(c.f)
5 30 30
𝟏 40 70
5𝟐
6 50 120
𝟏 150 270
6𝟐
7 300 570
𝟏 600 1170
7𝟐
8 950 2120
𝟏 820 2940
8𝟐
9 750 3690
𝟏 440 4130
9𝟐
10 250 4380
𝟏 150 4530
10
𝟐
11 40 4570
𝟏 39 4609
11𝟐
Ν=Σf=4609
𝑁+1
Median = 𝑡ℎ 𝑣𝑎𝑙𝑢𝑒
2
4609+1
= 𝑡ℎ 𝑣𝑎𝑙𝑢𝑒
2
=2305th value
It shows that median value corresponds to 2305th value in the series. This value appears first of all in
2940th cumulative frequency of the series. Therefore, median shall be the value corresponding to the
𝟏
2940th cumulative frequency, which is 8
𝟐
𝟏
Hence, the median size of shoes sold is 8 .
𝟐
Example 7. An insurance company obtained the following data for accident claims from a particular
region. Obtain the median from this data.
1-3 6
3-5 53
5-7 85
7-9 56
9-11 21
11-13 16
13-15 4
15-17 4
Amount of claim Frequency (f) Cumulative frequency (c.f.)
1-3 6 6
3-5 53 59
5-7 85 144
7-9 56 200
9-11 21 221
11-13 16 237
13-15 4 241
15-17 4 245
N=𝝨𝑓 = 245
𝑁 245
= = 122.5,
2 2
which falls in the class 5-7 (see the row of the cumulative frequency 144 which contains 122.5). Hence, the
median class is 5-7
L= Lower limit of the median class = 5
f= frequency of the median class = 85
c.f. = cumulative frequency of the class, preceding the median class = 59
h=width of the class interval of median class = 2
𝑁
−𝑐.𝑓.
2
Median= 𝐿 + 𝑓
∗ℎ
245
−59
2
= 5+ ∗2
85
63.5
= 5+ 85
∗2
127
= 5+ 85
= 5+ 1.49
= 6.49
Example 8: Calculate the median from the following data.
Solution. This series is given in the descending order. It should be first converted to continuous series and
placed in the ascending order, as in the following table.
Class intervals No. of persons(f) Cumulative frequency(c.f.)
10.5-15.5 7 7
15.5-20.5 10 17
20.5-25.5 13 30
25.5-30.5 26 56
30.5-35.5 35 91
35.5-40.5 22 113
40.5-45.5 11 124
45.5-50.5 5 129
N=𝝨𝑓 = 129
129
−56
2
= 30.5+ ∗2
35
64.5−56
= 30.5+ ∗5
35
8.5
= 30.5+
7
= 30.5+ 1.2
= 31.7 years
Mode : Value that occurs the most frequently in data set.
For ungrouped data :
Mode = number that occurs the highest number of times
For grouped data :
Step 1: Find the modal class. Modal class is the class with maximum
frequency.
𝑓𝑚 −𝑓1
Mode = 𝐿 + *h
2𝑓𝑚 −𝑓1 −𝑓2
Where, L = lower limit of the modal class
h = class width
𝑓𝑚 = frequency of the modal class
𝑓2 = frequency of the class succeeding the modal class
𝑓1 = frequency of the class preceding the modal class
• The mode is the value(s) that appear most frequently in the
dataset.
• If there is one mode, the data is unimodal.
• If there are two modes, the data is bimodal.
• If there are more than two modes, the data is multimodal.
• If no value repeats, there is no mode.
Illustrative examples:
Example 1
Data: 5, 8, 7, 8, 10, 8, 9, 7, 5
Step 1: Arrange the data in ascending order (optional).
5,5,7,7,8,8,8,9,10
Step 2: Identify the most frequent value.
5 appears 2 times.
7 appears 2 times.
8 appears 3 times.
9 appears 1 time.
10 appears 1 time.
Step 3: The mode is the value with the highest frequency, which is 8 (appears 3
times).
Thus, the mode of the data is 8.
Example 2
Find mode for the grouped data given below
= 50+5
= 55
Thus mode is 55 marks.
Example 5: calculate the mode of the following series.
Since concentration of items is around the class 240-260, hence 240-260 is the modal class. It can be
verified with the help of the grouping method. Applying the formula:
𝑓𝑚 −𝑓1
Mode= 𝐿 + ∗ℎ
2𝑓𝑚 −𝑓1 −𝑓2
= 240+5.2632
= 245.2632
➢ Variance
It is a measure of how far the observed values in a dataset fall from the arithmetic mean and is therefore
a measure of spread - more specifically, it is a measure of variability. It is denoted by the Greek letter
sigma squared or Var(X) and its formula is given by:
2 Σ(𝑥𝑖 −𝑥)ҧ 2 Σ𝑥𝑖 2
𝜎 = 𝑉𝑎𝑟 𝑥 = = − 𝑥ҧ 2
𝑛 𝑛
Coefficient of variation is
Standard deviation is used
usually used to compare
to measure the dispersion
the variation of different
of data in a single data set
data sets
Example 1
Let's say we have the following dataset:
7, 12, 5, 18, 5, 9, 10, 9, 12, 8, 12, 16
Find the variance and standard deviation of this dataset.
Solution: we need to first find the mean, which is:
7 + 12 + 5 + 18 + 5 + 9 + 10 + 9 + 12 + 8 + 12 + 16 123
𝑥ҧ = = = 10.25
12 12
The variance of this dataset is then given by:
2 72 +122 +52 +182 +52 +92 +102 +92 +122 +82 +122 +162
𝜎 = − 10.252 = 14.69
12
No of Apple 3 5 6 4 3 5 4
No of oranges 1 3 7 9 2 6 2
C .V1 = 23.54% , C .V2 = 65.50% Since, C .V1<C .V2 , we can conclude that the
consumption of apples is more consistent than oranges.
Examples for practice
For following grouped data compute mean, variance, standard deviation, coefficient of variation
Class Frequency Class Frequency
10 - 20 15 0-2 5
20 - 30 25
2-4 16
30 - 40 20
4-6 13
40 - 50 12
6-8 7
50 - 60 8
60 - 70 5 8 - 10 5
70 - 80 3 10 - 12 4
Shape of Data
Skewness:
It means lack of symmetry.
If the right tail is longer, we get a positively skewed distribution for which
mean > median > mode.
while if the left tail is longer, we get a negatively skewed distribution for which
mean < median < mode.
The example of the Symmetrical curve, Positive skewed curve and Negative skewed
curve are given as follows:
Skewness Coefficient
(Pearson's First Coefficient of Skewness):
This is a numerical measure of skewness, which determines the skewness when mean and mode
are not equal.
Coefficient of Skewness as per Karl Pearson's Measure
3 Mean−Median
1. With respect to Mean and Median: Sk =
σ
Mean−Mode
2. With respect to Mean and Mode: Sk =
σ
•If Sk = 0, it indicates a perfectly symmetric distribution where the data is evenly balanced on
both sides of the mean.
•If Sk > 0, it suggests a positively skewed distribution where the tail on the right side is longer or
fatter, and most data points are concentrated on the left side of the mean.
•If Sk < 0, it indicates a negatively skewed distribution where the tail on the left side is longer or
fatter, and most data points are concentrated on the right side of the mean.
Note: The value of Karf Pearson's coefficient of skewness lies between -3 and 3
Example 1:
𝑀𝑒𝑎𝑛−𝑀𝑜𝑑𝑒
𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑜𝑓 𝑠𝑘𝑒𝑤𝑛𝑒𝑠𝑠 =
𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛
29.6−𝑀𝑜𝑑𝑒
0.32 =
6.5
x 2 3 4 5 6
f 1 3 7 3 1
Solution: 𝑥 𝑓 𝑓𝑥 𝒙−ഥ
𝒙 𝒇(𝒙 𝒇(𝒙 − 𝒙 ҧ)𝟐 𝒇(𝒙 − 𝒙 ҧ)𝟑 𝒇(𝒙 − 𝒙 ҧ)𝟒
𝝨𝑥 60 −ഥ𝒙)
𝑥ҧ = = =4
𝑁 15
2 1 2 -2 -2 4 -8 16
𝝨𝒇(𝒙−ഥ
𝒙) 0
𝜇1 = = =0 3 3 9 -1 -3 3 -3 3
𝑁 15
4 7 28 0 0 0 0 0
𝝨𝒇(𝒙−𝒙 ҧ)𝟐 14
𝜇2 = = 5 3 15 1 3 3 3 3
𝑁 15
6 1 6 2 2 4 8 16
N=𝝨𝑓 𝝨𝑓𝑥 0 14 0 38
= 15 = 15
𝝨𝒇(𝒙−𝒙 ҧ)𝟑 0
𝜇3 = = =0
𝑁 15
𝝨𝒇(𝒙−𝒙 ҧ)𝟒 38
𝜇4 = = =0
𝑁 15
𝜇32 0
𝜷1 = = 14 2
=0
𝜇22
15
They help in understanding the distribution and spread of data by indicating where
certain percentages of the data fall.
The most used partition values are quartiles, deciles, and percentiles.
To divide the observation into two equally sized parts, the median can be used.
Quartiles:
A quartile is a set of values that divides a dataset into four equal parts.
The first quartile, second quartile, and third quartile are the three basic quartile categories.
• Individual Series.
While computing quartiles, deciles and percentiles, the first step will be to arrange the data in
ascending order only. After that we shall have to apply the following formulae:
• Quartiles
𝑁+1
• 𝑄1 = 𝑠𝑖𝑧𝑒 𝑜𝑓 𝑡ℎ 𝑖𝑡𝑒𝑚
4
𝑁+1
• 𝑄2 = 𝑠𝑖𝑧𝑒 𝑜𝑓 2 𝑡ℎ 𝑖𝑡𝑒𝑚
4
𝑁+1
• 𝑄3 = 𝑠𝑖𝑧𝑒 𝑜𝑓 3 𝑡ℎ 𝑖𝑡𝑒𝑚
4
Deciles:
The formulas for calculating deciles are:
The deciles involve dividing a dataset into ten equal parts based on numerical values. There are therefore
nine deciles altogether. Deciles are represented as follows: D1, D2, D3, D4,…………,
A decile is used to group big data sets in descriptive statistics either from highest to lowest values or
vice versa
𝑁+1 𝑡ℎ
D1 = item
10
2(𝑁+1) 𝑡ℎ
D2 = item and so on….
10
9(𝑁+1) 𝑡ℎ
D9 = item
10
Where, N is the total number of observations, D1 is First Decile, D2 is Second Decile,……….D9 is Ninth
Decile.
Percentiles
Q1 = 2.5th term
Q1 = 12
Q3 = 7.5th item
Q3 = 37.5
Example 2:
Calculate Q1 and Q3 for the data related
to the age in years of 99 members in a housing society.
Solution:
𝑁+1 30+1
Q2= Median = 𝑠𝑖𝑧𝑒 𝑜𝑓 2 𝑡ℎ 𝑖𝑡𝑒𝑚= 𝑠𝑖𝑧𝑒 𝑜𝑓 2 𝑡ℎ 𝑖𝑡𝑒𝑚= 15.5th item
4 4
𝑆𝑖𝑧𝑒 𝑜𝑓 15𝑡ℎ 𝑖𝑡𝑒𝑚+𝑠𝑖𝑧𝑒 𝑜𝑓 16𝑡ℎ 𝑖𝑡𝑒𝑚 550+600 1150
= = = = 575
2 2 2
Lower or first quartile
𝑁+1
Q1 = 𝑠𝑖𝑧𝑒 𝑜𝑓 𝑡ℎ 𝑖𝑡𝑒𝑚
4
30+1
= 𝑠𝑖𝑧𝑒 𝑜𝑓 2 4
𝑡ℎ 𝑖𝑡𝑒𝑚 =Size of 7.75th item
𝑆𝑖𝑧𝑒 𝑜𝑓 8𝑡ℎ 𝑖𝑡𝑒𝑚−𝑠𝑖𝑧𝑒 𝑜𝑓 7𝑡ℎ 𝑖𝑡𝑒𝑚
=𝑠𝑖𝑧𝑒 𝑜𝑓 7𝑡ℎ 𝑖𝑡𝑒𝑚 + 3 4
= 409+0.75(430-409)= 409+15.15
= 424.75
Upper or third quartile
𝑁+1
Q3 = 𝑠𝑖𝑧𝑒 𝑜𝑓 3 𝑡ℎ 𝑖𝑡𝑒𝑚
4
30+1
= 𝑠𝑖𝑧𝑒 𝑜𝑓 3 𝑡ℎ 𝑖𝑡𝑒𝑚 =Size of 23.25th item
4
= 700+0.25(710-700)= 700+2.5
=702.5
7th Decile
𝑁+1
D7 = 𝑠𝑖𝑧𝑒 𝑜𝑓 7 𝑡ℎ 𝑖𝑡𝑒𝑚
10
30+1
= 𝑠𝑖𝑧𝑒 𝑜𝑓 7 𝑡ℎ 𝑖𝑡𝑒𝑚 =Size of 21.7th item
10
𝑆𝑖𝑧𝑒 𝑜𝑓 22𝑛𝑑 𝑖𝑡𝑒𝑚−𝑠𝑖𝑧𝑒 𝑜𝑓 21𝑠𝑡 𝑖𝑡𝑒𝑚
=𝑠𝑖𝑧𝑒 𝑜𝑓 21𝑡ℎ 𝑖𝑡𝑒𝑚 + 7
10
= 651+0.7(660-651)= 651+6.3
= 657.3
28th percentile
𝑁+1
P28= 𝑠𝑖𝑧𝑒 𝑜𝑓 28 𝑡ℎ 𝑖𝑡𝑒𝑚
100
30+1
= 𝑠𝑖𝑧𝑒 𝑜𝑓 28 𝑡ℎ 𝑖𝑡𝑒𝑚 =Size of 8.68th item
100
𝑆𝑖𝑧𝑒 𝑜𝑓 9𝑡ℎ 𝑖𝑡𝑒𝑚−𝑠𝑖𝑧𝑒 𝑜𝑓 8𝑡ℎ 𝑖𝑡𝑒𝑚
=𝑠𝑖𝑧𝑒 𝑜𝑓 8𝑡ℎ 𝑖𝑡𝑒𝑚 + 68
100
= 430+0.68(450-430)= 430+13.60
= 443.60
Example 4:(Bowley’s Coefficient of Skewness):
Calculate Bowley's Measure of Skewness for the following dataset representing the ages of a group of
people in a sample: 20, 24, 28, 32, 35, 40, 42, 45, 50.
Solution: Calculate the median (Q2)
Q2= 35 (the middle value)
Now, first quartile (Q1)
Q1 = 26
third quartile (Q3)
Q3 = 43.5
𝑄1 +𝑄3 −2𝑄2
Substitute the above values in the formula B =
𝑄3 −𝑄1
B = -0.02
Since B is negative (B < 0), the distribution is negatively skewed (left-skewed). This means that the tail
of the distribution is longer on the left side, indicating that there may be outliers or high values on the
right side of the data.
Example 5:
𝑄3+𝑄1−2𝑄2
Solution. Coefficient of skewness based on quartiles =
𝑄3−𝑄1
𝑄1 + 𝑄3 = 100 …(i)
𝑄2 = 38
100−2∗38
0.6 = 𝑄3−𝑄2
24
𝑄3 − 𝑄1 = 0.6 = 40 …(ii)
𝑄3 = 70 and 𝑄1 = 30
Data Visualization
Data Visualization: Histogram
➢ Equal units along the vertical axis (the Y axis, or ordinate) reflect increases
in frequency. (The units along the vertical axis do not have to be the same
width as those along the horizontal axis.)
➢ The intersection of the two axes defines the origin at which both numerical
scales equal 0
Age 1 2 3 4 5 6
Frequency 5 3 7 5 4 2
2
1
0 1 2 3 4 5 6
Ex.2 Draw a histogram for the following data distribution:
X
Ex.3 Given below is the table showing the approximate lengths, in mm, of 40 leaves taken
from different parts of a certain species.
Length
25-30 30-35 35-40 40-45 45-50 50-55 55-60
(mm)
Number of
1 4 8 10 8 7 2
leaves
Data Visualization:
Box Plot (Box-and-Whisker Plot)
The Box Plot is a graphical representation of a dataset’s five-
number summary: minimum, first quartile (25th percentile), median
(50th percentile), third quartile (75th percentile), and maximum.
Developed by John Tukey in the 1970s, this plotting system has
been recognized for its concise delivery of the distribution of a
dataset, thus simplifying the data analysis process.
It’s a powerful tool in data analysis because it can clearly highlight
the dataset’s central tendency, dispersion, and skewness. Moreover,
it effectively visualizes outliers, providing a complete picture of the
data distribution. This is particularly useful when comparing
multiple datasets, as it offers a clear, comparative visualization of
the different data distributions.
The five numbers used in a box plot are:
1. Minimum
2. First Quartile (Q1)
3. Median (Q2)
4. Third Quartile (Q3)
5. Maximum
The Essential Components of a Box Plot
➢ The second quartile (Q2) median is the middle value that separates the data into
two halves. It measures central tendency, providing a snapshot of the data’s center.
➢ Quartiles Q1 and Q3, marking the box ends, reflect the data’s dispersion. These
quartiles represent the 25th and 75th percentiles of the dataset, respectively. The
Q1 mark represents the median of the first half of the data, while the Q3 represents
the median of the second half.
➢ The whiskers are lines extending from the box, reaching the minimum and
maximum non-outlier data points.
Usually, the lower whisker extends from Q1 to the smallest non-outlier data point,
and the upper whisker extends from Q3 to the largest non-outlier data point.
➢ The length of the box is the Inter Quartile Range (IQR), calculated by subtracting
Q1 from Q3 (IQR = Q3 – Q1).
➢ The IQR measures the middle 50% of the data, measuring dispersion or spread.
➢ Outliers are typically calculated as data points that fall below (Q1 – 1.5IQR)
or above (Q3 + 1.5IQR). These outliers are represented as individual points
outside the whiskers in the box plot.
➢ A point more than 3 interquartile ranges from the box edge is called an
extreme outlier.
How to Interpret a Box Plot
➢ Length of the Box: The length of the box (between Q1
and Q3) represents the IQR, showing the spread of the
middle 50% of the data.
➢ When the median is closer to the top of the box, and if the
whisker is shorter on the upper end of the box, then the
distribution is negatively skewed (skewed left).
Conclusion
Understanding these components of a box plot allows for rapid
comprehension of the data’s distribution, spread, and skewness. It
also aids in identifying and visualizing potential outliers, which can
be invaluable in data analysis.
Example 1:
Test scores for a college statistics class held during the day are:
99, 56,78, 55.5,32, 90,80, 81, 56, 59, 45, 77, 84.5, 84, 70, 72, 68, 32, 79, 90
Find the smallest and largest values, the median, and the first and third quartile for the day class. Also construct box plot
for the data.
Solution: Arranging data in ascending order
32,32,45,55.5,56,56,59,68,70,72,77,78,79,80,81,84,84.5,90,90,99
Population size: 20
Median: 74.5 IQR 27.25
Minimum: 32 1.5IQR 40.875
Maximum: 99
Q1-1.5IQR 15.125
First quartile: 56
Third quartile: 83.25 Q3+1.5IQR 124.13
Interquartile Range: 27.25
Outliers: none
Since minimum value of the data set is greater than Q1 – 1.5IQR & maximum value of the data set is
less than Q3 + 1.5IQR there are no outliers
Thus, plotting all the five numbers using scaled line we get the box plot s given below
Example 2:
Test scores for college statistics class held during the evening are:
98, 78, 68, 83, 81, 89, 88, 76, 65, 45, 98, 90, 80, 84.5, 85, 79, 78, 98, 90, 79, 81, 25.5
Find the smallest and largest values, the median, and the first and third quartile for the
night class. Also construct box plot for the data.
Solution:25.5,45,65,68,76,78,78,79,79,80,81,81,83,84.5,85,88,89,90,90,98,98,98
Population size: 22
Median: 81 IQR 11.75
Solution
Population size: 40
Median: 66
Minimum: 59
Maximum: 77
First quartile: 64.25
Third quartile: 70
Interquartile Range: 5.75
Outliers: none
Softwares to perform statistical analysis and visualization of
data.
http://www.galaxy.gmu.edu/papers/astr1.html
http://ourworld.compuserve.com/homepages/Rainer_
Wuerlaender/statsoft.htm#archiv
http://www.R-project.org
END