0% found this document useful (0 votes)
11 views141 pages

Wa0009.

The document provides an overview of descriptive statistics, covering key concepts such as population vs. sample, types of variables, data representation, and measures of central tendency and dispersion. It highlights applications in various fields like business, healthcare, and education, and explains methods for constructing frequency distributions and calculating relative and cumulative frequencies. Additionally, it includes illustrative examples for calculating means and organizing data effectively.

Uploaded by

pepekksjsn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views141 pages

Wa0009.

The document provides an overview of descriptive statistics, covering key concepts such as population vs. sample, types of variables, data representation, and measures of central tendency and dispersion. It highlights applications in various fields like business, healthcare, and education, and explains methods for constructing frequency distributions and calculating relative and cumulative frequencies. Additionally, it includes illustrative examples for calculating means and organizing data effectively.

Uploaded by

pepekksjsn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 141

Vishwakarma Institute of Technology, Pune

Calculus and Statistics(HS1076)


Unit 5- Descriptive Statistics
Content
● Population, Sample
● Types of variables
● Data representation–Grouped, Ungrouped frequency distributions
● Measures of central tendency and dispersion
● Coefficient of Variation, Skewness, Kurtosis
● Quartiles, Deciles, Percentiles
● Data visualization (Graphical Representation-Histogram, Box plot)
Introduction to Descriptive Statistics

➢ Statistics exists because of the prevalence of variability in the real world.


➢ In its simplest form, known as descriptive statistics, statistics provides us
with tools—tables, graphs, averages, ranges, correlations—for organizing and
summarizing the inevitable variability in collections of actual observations or
scores
➢ Goal: To make sense of raw data using statistical tools to identify patterns or
trends.
Applications
1. Business and Economics: Summarizing sales data, customer demographics, and
financial performance using measures like mean sales, median income, or standard
deviations in profits.
2. Quality Control in Engineering: Analyzing production data to assess consistency,
such as using histograms to understand variations in product dimensions or
calculating averages to monitor quality.
3. Healthcare: Summarizing patient data, like average blood pressure levels, body
temperature, or demographic information to identify patterns.
4. Social Sciences: Summarizing survey responses using central tendency measures
(e.g., mean, median, mode) to understand societal trends and behaviors
5. Sports Analytics: Providing information about players, like average
points per game, highest scores, or batting averages, to assess
performance.
6. Environmental Studies: Summarizing temperature, rainfall, and
other climate data to identify patterns and monitor environmental
changes.
7. Education: Calculating average exam scores, pass rates, and other
statistics to evaluate student performance.
Population vs Sample
Population: The complete set of all possible observations or measurements that
could be made. It represents the entirety of individuals or instances about which
you want to make inferences.
● Example: All B.Tech students in a university.
Sample: A subset of the population selected for analysis. It is used to draw
conclusions about the population.
● Example: A group of 100 randomly selected B.Tech students.
Note: The size of the sample, denoted by n and the size of the population,
denoted by N, are related in sample studies.
Data/Statistical Variable: A collection of actual observations or
scores in a survey or an experiment

Any statistical analysis is performed on data.


Qualitative Data (Categorical Data):
➢ Describes qualities or characteristics.
➢ Non-numeric (usually)
➢ Used to categorize or label data.
Examples:
Gender (Male, Female, Other)
Colors (Red, Blue, Green)
Nationality (Indian, American)
Type of car (SUV, Sedan, Hatchback)
Types of Qualitative Data:
Nominal – Categories with no order
(e.g., blood type: A, B, AB, O)
Ordinal – Categories with a meaningful order, but differences can’t be measured
(e.g., rating: Poor, Fair, Good, Excellent)
Quantitative Data (Numerical Data)
➢ Describes quantities or amounts.
➢ Numeric
➢ Used to measure or count.

Examples:
Age (21, 35, 45)
Height (5.6 ft, 170 cm)
Test scores (85, 92, 78)
Number of students in a class (30, 45)

Types of Quantitative Data:


1.Discrete – Countable values
(e.g., number of books, number of cars)
2.Continuous – Can take any value within a range
(e.g., weight, temperature, height)
Data representation: Grouped vs Ungrouped

➢ Ungrouped Data
Definition: Raw data that has not been organized into groups or intervals.

Form: A list of individual data points.


Best used when the dataset is small and easy to read/analyze directly.

Example: Test scores of 10 students:


45, 50, 48, 47, 50, 52, 48, 49, 51, 50

➢ Characteristics:
Exact values are available.
Easy to calculate measures like mean, median, mode for small data.
Difficult to interpret visually when data is large.
➢ Grouped Data
Definition: Data that has been organized into classes or intervals.
Form: Data is arranged in a frequency distribution table.
Best used when the dataset is large, making ungrouped data hard to interpret.
Example (same test scores grouped into intervals):
Score Range Frequency

45-47 3
48-50 5
51-53 2

➢ Characteristics:
Data is summarized and easier to interpret.
Helps in constructing histograms and other charts.
Exact values are lost, only intervals and frequencies are used.
Used to estimate central tendency and dispersion.
➢ Frequency Distributions:
Represents the pattern of how frequently each value of a variable appears in
a dataset. It shows the number of occurrences for each possible value within
the dataset.

➢ Frequency Distribution Table


A way to organize and present data in a tabular form which helps us to
summarize the large dataset into a concise table.

In the frequency distribution table, there are two columns one representing
the data either in the form of a range or an individual data set and the other
column shows the frequency of each interval or individual.
Test Score Frequency Test Score Frequency

0-20 6 45 1

21-40 12 47 1

41-60 22 48 2

61-80 15 49 3

81-100 5 50 2
Ungrouped Frequency Distribution for Ungrouped
Data

An ungrouped frequency distribution produced whenever


observations are sorted into classes of single values.
Example:
Make the Frequency Distribution Table for the ungrouped data given as follows:
10, 20, 15, 25, 30, 10, 15, 10, 25, 20, 15, 10, 30, 25

Value Frequency
10 4
15 3
20 2
25 3
30 2
➢Grouped Frequency Distribution for
Ungrouped Data
Observations are divided between different intervals known as
class intervals and then their frequencies are counted for each class
interval. This Frequency Distribution is used mostly when the data set
is very large.
CONSTRUCTING FREQUENCY DISTRIBUTIONS

1.Find the range, that is, the difference between the largest and smallest observations.

2. Find the class interval required to span the range by dividing the range by the desired
number of classes (ordinarily 10).

3. Round off to the nearest convenient interval (such as 1, 2, 3, . . . 10, particularly 5 or 10


or multiples of 5 or 10).

4. Determine where the lowest class should begin. (Ordinarily, this number should be a
multiple of the class interval.)

5. Determine where the lowest class should end by adding the class interval to the lower
boundary and then subtracting one unit of measurement.
6. Working upward, list as many equivalent classes as are required to include the largest
observation.
For example, list 130–139, 140–149, . . . , 240–249

7. Indicate with a tally the class in which each observation falls.

8. Replace the tally count for each class with a number—the frequency (f )—and show the
total of all frequencies. (Tally marks are not usually shown in the final frequency
distribution.)

9. Supply headings for both columns and a title for the table.
Example 1:

Make the Frequency Distribution Table for the ungrouped data given as follows:
23, 27, 21, 14, 43, 37, 38, 41, 55, 11, 35, 15, 21, 24, 57, 35, 29, 10, 39, 42, 27, 17, 45,
52, 31, 36, 39, 38, 43, 46, 32, 37, 25

Solution: Observations are in between 10 and 57,


Class Frequency
10, 11, 14, 15, 17,
21, 21,23, 24, 25, 27, 27, 29, 10 – 19 5
31, 32, 35, 35, 36, 37, 37, 38, 38, 39, 39,
41, 42, 43, 43, 45, 46, 20 – 29 8
52, 55, 57
30 – 39 11
we can choose class intervals as 10-19, 20-29, 30-39, 40-49, and 50-59.
In these class intervals all the observations are covered 40 – 49 6

50 – 59 3
Ex.2
Consider a data set of 26 children of ages 1-6 years
2,2,1,3,3,3,6,6,2,1,1,1,1,3,3,3,5,5,4,4,4,5,5,4,4,3

For this data set of 26 children of ages 1-6 years


1,1,1,1,1,2,2,2,3,3,3,3,3,3,3,4,4,4,4,4,5,5,5,5,6,6
Ungrouped Frequency Distribution
Age 1 2 3 4 5 6
Frequency 5 3 7 5 4 2
Grouped Frequency Distribution

Age Group 1-2 3-4 5-6


Frequency 8 12 6
Ex. 3 Construct a grouped frequency distribution table from the following data
77,41,85,82,85,96,93,66,78,94,50,57

Solution:
Relative Frequency Distribution
This distribution displays the proportion or percentage of observations in each
interval or class.
It is useful for comparing different data sets or for analyzing the distribution of data
within a set.

Relative Frequency is given by:

Relative Frequency = (Frequency of Event)/(Total


Number of Events)
Example:
Score Range 0-20 21-40 41-60 61-80 81-100

Frequency 5 10 20 10 5

Solution:
To Create the Relative Frequency Distribution table, we need to calculate Relative Frequency for
each class interval. Thus, Relative Frequency Distribution table is given as follows:
Score Range Frequency Relative Frequency
0-20 5 5/50 = 0.10

21-40 10 10/50 = 0.20

41-60 20 20/50 = 0.40

61-80 10 10/50 = 0.20

81-100 5 5/50 = 0.10

Total 50 1.00
Cumulative Frequency Distribution:

It is defined as the sum of all the frequencies in the previous values or intervals up to the
current one.

The distributions which represent the frequency distributions using cumulative frequencies
are called cumulative frequency distributions.

There are two types of cumulative frequency distributions:


•Less than Type: We sum all the frequencies before the current interval.

•More than Type: We sum all the frequencies after the current interval.
Example:
The table below gives the values of runs scored by Virat Kohli in the last 25 T-20
matches. Represent the data in the form of less-than-type cumulative frequency
distribution:

45 34 50 75 22
56 63 70 49 33
0 8 14 39 86
92 88 70 56 50
57 45 42 12 39
Since there are a lot of distinct values, we’ll express this in the form of grouped
distributions with intervals like 0-10, 10-20 and so. First let’s represent the data in the
form of grouped frequency distribution.

Runs Frequency

0-10 2

10-20 2

20-30 1

30-40 4

40-50 4

50-60 5

60-70 1

70-80 3

80-90 2

90-100 1
Runs scored by Virat Cumulative Runs scored by Virat Cumulative
Kohli Frequency Kohli Frequency
Less than 10 2 More than 0 25

Less than 20 4 More than 10 23

Less than 30 5 More than 20 21

Less than 40 9 More than 30 20

Less than 50 13 More than 40 16

Less than 60 18 More than 50 12

Less than 70 19 More than 60 7

Less than 80 22 More than 70 6

Less than 90 24 More than 80 3

Less than 100 25 More than 90 1


Measures of Central Tendency
Mean (Arithmetic Average): The sum of all observations divided by
the number of observations.

𝑆𝑢𝑚 𝑜𝑓 𝑎𝑙𝑙 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠


For ungrouped data : Mean =
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠
σ 𝑥𝑖
𝑥ҧ =
𝑛
For grouped data :
σ 𝑓𝑖 𝑥𝑖
𝑥ҧ = σ 𝑓𝑖
Where 𝑓𝑖 = 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑖𝑒𝑠 𝑜𝑓 𝑡ℎ𝑒 𝑐𝑙𝑎𝑠𝑠 𝑖
Illustrative Examples
Example 1
Calculate the mean for the data set as given below
5, 7, 9, 10, 12
The formula for the mean is:
σ 𝑥𝑖
Mean =
𝑛
Step 1: Find the sum of the data values.
5+7+9+10+12=43
Step 2: Count the number of data points.
n=5
Step 3: Calculate the mean.
43
Mean = = 8.6
5
Example 2.The following table contains the half yearly bonus paid to 10 workers in a factory:

Sr no 1 2 3 4 5 6 7 8 9 10
Half- yearly bonus 150 200 300 650 250 180 400 500 550 220

Find out the arithmetic mean. Sr. No. Half Yearly bonus x (in Rs)
1 150
Solution:
2 200
𝑥 +𝑥 +𝑥 +⋯𝑥𝑛
𝑋ത = 1 2 3 3 300
𝑁
4 650
𝝨𝑋 3400
= = = 340 5 250
𝑁 10
6 180
7 400
8 500
9 550
10 220
N=10 ෍ 𝑋 = 3400
Example 3. Calculate the mean of the following frequency distribution of marks in a test in statistics:

Marks 10 20 30 40 50 60 70 80
3 6 10 12 9 6 2 2
No. of students

Marks(x) Number of students(f) fx


Solution:
10 3 30
20 6 120
𝝨𝑓𝑥 2040 30 10 300
𝑋ത = = = 40.8
𝑁 50
40 12 480
Hence average or 50 9 450
mean marks in 60 6 360
statistics = 40.8 70 2 140
80 2 160
N=𝝨𝑓 = 50 𝝨𝑓𝑥 = 2040
Example 4. Find out the arithmetic mean for the following data:
Marks 0-10 10-20 20-30 30-40 40-50
5 10 40 20 25
No. of students

Marks(x) Mid Number of f(𝒙𝒊 ) 𝒙𝒊


Solution: values(𝒙𝒊 ) students
(f(𝒙𝒊 ))
By direct method, 0-10 5 5 25

𝝨f(𝒙𝒊 ) 𝒙𝒊 3000
10-20 15 10 15
𝑋ത = = = 30
𝑁 100 20-30 25 40 1000
30-40 35 20 700
40-50 45 25 1125
N=𝝨𝑓 = 100 𝝨f(𝒙𝒊 ) 𝒙𝒊
= 3000
Example 5. For the following data , calculate arithmetic mean:
Marks No. of students
Less than 10 5
Less than 20 17
Less than 30 31
Less than 40 41
Less than 50 49

Solution: A cumulative frequency distribution should first be converted into a simple frequency distribution

Marks(x) Number of
students(f)
0-10 5
10-20 17-5= 12
20-30 31-17=14
30-40 41-31=10
40-50 49-41=8
Now mean value of the data is obtained by direct method as under:
Marks Mid values Number of f(𝒙𝒊 ) 𝒙𝒊
(𝒙𝒊 ) students(f(𝒙𝒊 ))
0-10 5 5 25
10-20 15 12 180
20-30 25 14 350
30-40 35 10 350
40-50 45 8 360
N=𝝨𝑓 = 49 𝝨f(𝒙𝒊 ) 𝒙𝒊
= 1265

𝝨f(𝒙𝒊 ) 𝒙𝒊 1265
𝑋ത = = = 25.82 𝑀𝑎𝑟𝑘𝑠
𝑁 49
Example 6 :

Find the mean of the grouped data given below:

Class Interval Frequency


40-49 3
50-59 5
60-69 7
70-79 4
80-89 1
Solution: Here
σ 𝒇𝒊 𝒙𝒊

𝒙= σ 𝒇𝒊

Where:
• 𝒇𝒊 = frequency of each class
• 𝒙𝒊 = midpoint of each class
• σ 𝒇𝒊 = total frequency

Class Interval Frequency (𝒇𝒊 ) Midpoint(𝒙𝒊 ) 𝒇𝒊 𝒙 𝒊


40-49 3 44.5 133.5
50-59 5 54.5 272.5

60-69 7 64.5 451.5


70-79 4 74.5 298
80-89 1 84.5 84.5
133.5 + 272.5 + 451.5 +298 + 84.5 1240
𝒙) =
Mean (ഥ 3 +5 +7 + 4 + 1
= 20
• Median:

• The value of the middle item of a series when it is arranged in ascending or descending order of
magnitude.
• It is the value in the series which divides the series into two equal parts, one part consisting the
values equal to median or smaller than it and the other part having the value equal to the median or
larger than it.
• Unlike mean, median is the positional average. The position here means the place of value in the
series.
• Median as such is the positional average of the data and has a position more or less at the centre of
the values.
For ungrouped data/discrete series:
Firstly, arrange the data in ascending order
𝑛+1 𝑡ℎ
(i) If n is odd , Median =( ) observation
2
𝑛 𝑡ℎ 𝑛 𝑡ℎ
+ +1
2 2
(ii) If n is even , Median = observation
2
Median for grouped data/continuous series
For grouped data :
Step 1: Construct the cumulative frequency distribution
Step 2: Find the median class. Median class is the class in which the
𝑵
value of falls in cumulative frequency distribution.
𝟐
Step 3: Find the median by using the following formula.
𝑵
𝟐
− 𝒄.𝒇
Median =𝑳 + *h
𝒇

Where, N = Total Frequency


L = Lower limit of the median class
f = Frequency of the median class
c.f = cumulative frequency of the class before the median class
h = class width
Illustrative Examples
Example 1
Find Median of the data: 5, 8, 3, 7, 10
Step 1: Arrange the data in ascending order.
3,5,7,8,10
Step 2: The number of data points, n = 5 (odd).
Step 3: The median is the middle value (the 3rd value in this case).
Median=7
Example 2
Find Median of the data : 12, 16, 10, 8, 22, 18
Step 1: Arrange the data in ascending order.
8,10,12,16,18,22
Step 2: The number of data points n=6 (even).
Step 3: The median is the average of the two middle values (3rd
and 4th values).
12 + 16
Median = = 14
2
Example.3

Step 1: Calculate the total frequency N


N = 50
𝑁
Step 2: Find
2
𝑁 50
= = 25
2 2

So, the cumulative frequency just before or at 25 will help us find the median class.
𝑁
Now from the cumulative frequency (CF) column, we see that = 25 falls in the class 20 - 30
2
(since CF for this class is 25). Therefore, 20 - 30 is the median class.
Step 3: Use the median formula:
𝑵
− 𝒄.𝒇
𝟐
Median = 𝑳 + *h
𝒇

Where:
• L =20 (lower boundary of the median class)
• N=50 (total frequency)
• c.f=13 (cumulative frequency of the class before the median class)
• 𝑓​=12 (frequency of the median class)
• h= 10 (class width)
Step 4: Apply the formula.
𝟐𝟓 − 𝟏𝟑
Median = 𝟐𝟎 + * 10
𝟏𝟐
Thus, the median is 30.
Example 4. The consumption of printing paper reams (in units) for the first 11 months of a computer
operator is given as

20, 25, 30, 15, 17, 35, 26, 18, 40, 45, 50

Find the median.

Solution: By arranging the data in ascending order, we get the series

15, 17, 18, 20, 25, 26, 30, 35, 40, 45, 50

The number of terms in this series is 11 which is odd

11+1
Hence, the required median (M) = value of the 𝑡ℎ observation
2

= value of 6th observation

=26
Example 5. Calculate the median of the following data that relates to the monthly salaries of employees (in
thousand rupees):

110, 115, 108, 112, 120, 116, 140, 135, 128, 132

Solution. By arranging the data in ascending order, we get the series 108, 110, 112, 115, 116, 120, 128,
132, 135, 140

The number of terms in this series is 10 which is even value,


10 10
𝑡ℎ 𝑣𝑎𝑙𝑢𝑒 + +1 𝑡ℎ 𝑣𝑎𝑙𝑢𝑒
2 2
median=
2

5 𝑡ℎ 𝑣𝑎𝑙𝑢𝑒 + 6 𝑡ℎ 𝑣𝑎𝑙𝑢𝑒
=
2

116+120
Hence, the required median (M)= = 118
2

Thus, the median salary is 118,000


Example 6: Obtain the median size of shoes sold from the following data.

Size 5 𝟏 6 𝟏 7 𝟏 8 𝟏 9 𝟏 10 𝟏 11 𝟏
5 6 7 8 9 10 11
𝟐 𝟐 𝟐 𝟐 𝟐 𝟐 𝟐

No. of pairs 30 40 50 150 300 600 950 820 750 440 250 150 40 39
Size(x) No of Pairs(f) Cumulative frequency(c.f)
5 30 30
𝟏 40 70
5𝟐
6 50 120
𝟏 150 270
6𝟐
7 300 570
𝟏 600 1170
7𝟐

8 950 2120
𝟏 820 2940
8𝟐

9 750 3690
𝟏 440 4130
9𝟐
10 250 4380
𝟏 150 4530
10
𝟐

11 40 4570
𝟏 39 4609
11𝟐
Ν=Σf=4609

𝑁+1
Median = 𝑡ℎ 𝑣𝑎𝑙𝑢𝑒
2

4609+1
= 𝑡ℎ 𝑣𝑎𝑙𝑢𝑒
2

=2305th value

It shows that median value corresponds to 2305th value in the series. This value appears first of all in
2940th cumulative frequency of the series. Therefore, median shall be the value corresponding to the
𝟏
2940th cumulative frequency, which is 8
𝟐

𝟏
Hence, the median size of shoes sold is 8 .
𝟐
Example 7. An insurance company obtained the following data for accident claims from a particular
region. Obtain the median from this data.

Amount of claim in thousand rupees Frequency

1-3 6
3-5 53
5-7 85
7-9 56
9-11 21
11-13 16
13-15 4
15-17 4
Amount of claim Frequency (f) Cumulative frequency (c.f.)
1-3 6 6
3-5 53 59
5-7 85 144
7-9 56 200
9-11 21 221
11-13 16 237
13-15 4 241
15-17 4 245
N=𝝨𝑓 = 245

Here N=245, which is an odd number

𝑁 245
= = 122.5,
2 2

which falls in the class 5-7 (see the row of the cumulative frequency 144 which contains 122.5). Hence, the
median class is 5-7
L= Lower limit of the median class = 5
f= frequency of the median class = 85
c.f. = cumulative frequency of the class, preceding the median class = 59
h=width of the class interval of median class = 2
𝑁
−𝑐.𝑓.
2
Median= 𝐿 + 𝑓
∗ℎ

245
−59
2
= 5+ ∗2
85
63.5
= 5+ 85
∗2
127
= 5+ 85

= 5+ 1.49
= 6.49
Example 8: Calculate the median from the following data.

Age in years No. of persons (f)


46-50 5
41-45 11
36-40 22
31-35 35
26-30 26
21-25 13
16-20 10
11-15 7

Solution. This series is given in the descending order. It should be first converted to continuous series and
placed in the ascending order, as in the following table.
Class intervals No. of persons(f) Cumulative frequency(c.f.)
10.5-15.5 7 7
15.5-20.5 10 17
20.5-25.5 13 30
25.5-30.5 26 56
30.5-35.5 35 91
35.5-40.5 22 113
40.5-45.5 11 124
45.5-50.5 5 129
N=𝝨𝑓 = 129

Here N = 129 which is an odd number


𝑁 129
= = 64.5,which falls in the class 30.5-35.5 (see the row of the cumulative frequency 91 which
2 2
contains 64.5).
Hence the median class is 30.5-35.5
L = limit of the median class =30.5
f = frequency of the median class = 35
cf = cumulative frequency of the class, preceding the median class = 56
w = width of the class interval of median class = 5
𝑁
−𝑐.𝑓.
2
Median= 𝐿 + ∗ℎ
𝑓

129
−56
2
= 30.5+ ∗2
35
64.5−56
= 30.5+ ∗5
35
8.5
= 30.5+
7

= 30.5+ 1.2
= 31.7 years
Mode : Value that occurs the most frequently in data set.
For ungrouped data :
Mode = number that occurs the highest number of times
For grouped data :
Step 1: Find the modal class. Modal class is the class with maximum
frequency.
𝑓𝑚 −𝑓1
Mode = 𝐿 + *h
2𝑓𝑚 −𝑓1 −𝑓2
Where, L = lower limit of the modal class
h = class width
𝑓𝑚 = frequency of the modal class
𝑓2 = frequency of the class succeeding the modal class
𝑓1 = frequency of the class preceding the modal class
• The mode is the value(s) that appear most frequently in the
dataset.
• If there is one mode, the data is unimodal.
• If there are two modes, the data is bimodal.
• If there are more than two modes, the data is multimodal.
• If no value repeats, there is no mode.
Illustrative examples:
Example 1
Data: 5, 8, 7, 8, 10, 8, 9, 7, 5
Step 1: Arrange the data in ascending order (optional).
5,5,7,7,8,8,8,9,10
Step 2: Identify the most frequent value.
5 appears 2 times.
7 appears 2 times.
8 appears 3 times.
9 appears 1 time.
10 appears 1 time.
Step 3: The mode is the value with the highest frequency, which is 8 (appears 3
times).
Thus, the mode of the data is 8.
Example 2
Find mode for the grouped data given below

Step 1: Identify the modal class.


The class with the highest frequency is 20-30 with a frequency of
20.
So, 20-30 is the modal class.
Step 2: Use the mode formula.
𝑓𝑚 −𝑓1
Mode = 𝐿 + *h
2𝑓𝑚 −𝑓1 −𝑓2

From the data:


𝐿 = 20 (the lower boundary of the modal class 20-30)
𝑓𝑚 =20 (frequency of the modal class)
𝑓1 = 12 (frequency of the class preceding the modal class, 10-20)
𝑓2 = 18 (frequency of the class succeeding the modal class, 30-40)
h = 10 (class width)
Step 3: Apply the formula.
20 − 12
Mode = 20 + *10
2∗20−12−18
Thus, the mode is 28.
Example 3: Find out the mode of the following marks obtained by 15 students in a class.
4 6 5 7 9 8 10 4 7 6 5 8 7 7 9
Solution. (a) By arranging data
4 4 5 5 6 6 7 7 7 7 8 8 9 9 10
it will be observed from the array that 7 is repeated four times te., more than any other item in the series, so 7 is
the mode that is modal marks.
(b) By converting into discrete series: Marks Tally Bars Frequency
4 || 2
5 || 2
6 || 2
7 |||| 4
8 || 2
9 || 2
Hence mode is 7 marks.
10 | 1
Total 15
Example 4: Calculate mode from the following data.
Marks No. of students marks No. of students
0-10 3 50-60 15
10-20 5 60-70 12
20-30 7 70-80 6
30-40 10 80-90 2
40-50 12 90-100 8
By inspection, the modal class is 50-60.
𝑓 −𝑓
Mode= 𝐿 + 𝑚 1 ∗ ℎ
2𝑓𝑚 −𝑓1 −𝑓2

𝐿 = 50, 𝑓𝑚 = 15, 𝑓1 = 12, 𝑓2 = 12 𝑎𝑛𝑑 ℎ = 10


15−12
Mode=50+ ∗ 10
30−12−12

= 50+5
= 55
Thus mode is 55 marks.
Example 5: calculate the mode of the following series.

Marks 200-220 220-240 240-260 260-280 280-300 300-320 320-340


No. of 7 15 20 6 6 4 2
students
Solution. To calculate mode, we first make class intervals equal. We have fixed 20 as the class interval.
The adjusted distribution is as follows:
Marks No. of students
200-220 7
220-240 15
240-260 20
260-280 6
280-300 6
300-320 4
320-340 2

Since concentration of items is around the class 240-260, hence 240-260 is the modal class. It can be
verified with the help of the grouping method. Applying the formula:
𝑓𝑚 −𝑓1
Mode= 𝐿 + ∗ℎ
2𝑓𝑚 −𝑓1 −𝑓2

𝐿 = 240, 𝑓𝑚 = 20, 𝑓1 = 15, 𝑓2 = 6 𝑎𝑛𝑑 ℎ = 20


20−15
Mode=240+ ∗ 20
40−15−6

= 240+5.2632

= 245.2632

Thus mode is 245.2632 marks.


Measures of Dispersion/Spread
Measures of Dispersion/ Variability Measurement
➢ Range
It is the difference between the highest and lowest class midpoints:
Range = 𝑥𝑚𝑎𝑥 − 𝑥𝑚𝑖𝑛

➢ Variance
It is a measure of how far the observed values in a dataset fall from the arithmetic mean and is therefore
a measure of spread - more specifically, it is a measure of variability. It is denoted by the Greek letter
sigma squared or Var(X) and its formula is given by:
2 Σ(𝑥𝑖 −𝑥)ҧ 2 Σ𝑥𝑖 2
𝜎 = 𝑉𝑎𝑟 𝑥 = = − 𝑥ҧ 2
𝑛 𝑛

Where x is an observation in the dataset, 𝑥ҧ is mean & n is the number of observations


For frequency data:
Σ𝑓𝑖 (𝑥𝑖 −𝑥)ҧ 2
𝑉𝑎𝑟 𝑥 =
Σ𝑓𝑖
➢ Standard Deviation (σ)
Standard deviation is the square root of the variance and therefore is also a measure of spread - more specifically, it is a
measure of dispersion .
Where variance is used to show how much the values in a dataset vary from each other, the standard deviation exists to
show how far apart the values in a dataset are from the mean and therefore can be used to identify outliers.
Standard deviation is denoted by the Greek letter sigma and being the square root of variance, is written as:
Σ(𝑥𝑖 −𝑥)ҧ 2 Σ𝑥𝑖 2
𝜎= 𝑉𝑎𝑟(𝑥) = = − 𝑥ҧ 2
𝑛 𝑛

For frequency data


σ 𝑓𝑖 ( 𝑥𝑖 − 𝜇)2
σ = 𝑉𝑎𝑟(𝑥) =
𝑛
➢ Coefficient of Variation:
A standardized measure of dispersion, calculated as the ratio of the standard deviation to the mean.
Standard Deviation
C. V = × 100
μ
Note:
1) The distribution/series for which the coefficient of variation is greater is more variable (less
homogeneous, less consistent, less stable, or less uniform).
2) The main differences between the two measures are given in the table below.

Coefficient of Variation Standard deviation

It is a relative measure of It is an absolute measure


dispersion of dispersion

It measures ratio of the


It measures how far a data
standard deviation to the
point lies from the mean
mean

Coefficient of variation is
Standard deviation is used
usually used to compare
to measure the dispersion
the variation of different
of data in a single data set
data sets
Example 1
Let's say we have the following dataset:
7, 12, 5, 18, 5, 9, 10, 9, 12, 8, 12, 16
Find the variance and standard deviation of this dataset.
Solution: we need to first find the mean, which is:
7 + 12 + 5 + 18 + 5 + 9 + 10 + 9 + 12 + 8 + 12 + 16 123
𝑥ҧ = = = 10.25
12 12
The variance of this dataset is then given by:

2 72 +122 +52 +182 +52 +92 +102 +92 +122 +82 +122 +162
𝜎 = − 10.252 = 14.69
12

Then, the standard deviation is 3.83


Example 2:For following grouped data compute mean, variance, standard deviation,
coefficient of variation

Step 1: Calculate the midpoints (xi​):


Lower limit+Upper limit
Midpoint (𝑥𝑖 ) = 𝟐
Step2: Thus,
(15×3)+(25×5)+(35×7)+(45×2)
mean μ or 𝑥ҧ = = 29.71
𝟏𝟕
Step 3:Calculate the variance, standard deviation , coefficient of variation :
Now, we have to calculate : σ 𝑓𝑖 ( 𝑥𝑖 − 𝑥)ҧ 2
Now,
σ 𝑓𝑖 ( 𝑥𝑖 − 𝜇)2 1423.46
Variance = = = 83.73
σ 𝑓𝑖 17
σ 𝑓𝑖 ( 𝑥𝑖 − 𝜇)2
S. D. = = 83.73 = 9.15
𝑛
Standard Deviation
C. V = × 100
μ
9.15
= × 100 = 30.79 %
29.71
Example.3 The consumption of number of apples and orange on a particular week by a
family are given below. Which fruit is consistently consumed by the family?

No of Apple 3 5 6 4 3 5 4
No of oranges 1 3 7 9 2 6 2

Solution: Let coefficient of variation for apples be C .V1


&
Let coefficient of variation for apples be C .V2

C .V1 = 23.54% , C .V2 = 65.50% Since, C .V1<C .V2 , we can conclude that the
consumption of apples is more consistent than oranges.
Examples for practice
For following grouped data compute mean, variance, standard deviation, coefficient of variation
Class Frequency Class Frequency
10 - 20 15 0-2 5
20 - 30 25
2-4 16
30 - 40 20
4-6 13
40 - 50 12
6-8 7
50 - 60 8
60 - 70 5 8 - 10 5

70 - 80 3 10 - 12 4
Shape of Data

Skewness:
It means lack of symmetry.

In Statistics, a distribution is called symmetric if mean, median and mode coincide.


Otherwise, the distribution becomes asymmetric.

If the right tail is longer, we get a positively skewed distribution for which
mean > median > mode.

while if the left tail is longer, we get a negatively skewed distribution for which
mean < median < mode.

The example of the Symmetrical curve, Positive skewed curve and Negative skewed
curve are given as follows:
Skewness Coefficient
(Pearson's First Coefficient of Skewness):
This is a numerical measure of skewness, which determines the skewness when mean and mode
are not equal.
Coefficient of Skewness as per Karl Pearson's Measure
3 Mean−Median
1. With respect to Mean and Median: Sk =
σ
Mean−Mode
2. With respect to Mean and Mode: Sk =
σ

•If Sk = 0, it indicates a perfectly symmetric distribution where the data is evenly balanced on
both sides of the mean.

•If Sk > 0, it suggests a positively skewed distribution where the tail on the right side is longer or
fatter, and most data points are concentrated on the left side of the mean.

•If Sk < 0, it indicates a negatively skewed distribution where the tail on the left side is longer or
fatter, and most data points are concentrated on the right side of the mean.

Note: The value of Karf Pearson's coefficient of skewness lies between -3 and 3
Example 1:

Calculate Pearson's skewness coefficient for a dataset of exam scores:


85, 88, 92, 94, 96, 98, 100, 100, 100, 100.
Solution:
Step 1: Calculation of Mean
𝑀𝑒𝑎𝑛 = (85 + 88 + 92 + 94 + 96 + 98 + 100 + 100 + 100 + 100)/10 = 953/10 = 95.3
Mean = 95.3
Step 2: Calculation of Median
Since there are 10 data points, the median is the average of the 5th and 6th values when sorted in ascending order:
Median = (96 + 98)/2 = 194/2 = 97
Median = 97
Step 3: Calculation of standard deviation.
(85 − 95.3)2 + ⋯ + (100 − 95.3)2
𝜎 2 = Σ 𝑥𝑖 − 𝜇 2 /𝑁 = = 26.81
10
Thus, 𝜎 = √26.81 ~5.458 and we get
Sk = -0.934
Interpretation:
This means that the tail of the distribution is slightly longer on the left side, and most of the scores are concentrated on the right side of the mean.
Example2:
Karl Pearson's coefficient of skewness of a distribution is 0.32, its standard deviation is 6.5 and
mean is 29.6. Find the mode of the distribution.
Solution. We know that,

𝑀𝑒𝑎𝑛−𝑀𝑜𝑑𝑒
𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑜𝑓 𝑠𝑘𝑒𝑤𝑛𝑒𝑠𝑠 =
𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛

29.6−𝑀𝑜𝑑𝑒
0.32 =
6.5

0.32 ∗ 6.5 = 29.6 − 𝑀𝑜𝑑𝑒

𝑀𝑜𝑑𝑒 = 29.6 − 2.08 = 27.52


Bowley's Measure:
Bowley's Skewness Coefficient, named after the British economist Arthur Lyon Bowley, is a statistical measure used to
assess the skewness or asymmetry in a probability distribution.
Unlike some other skewness measures that rely on moments or deviations from the mean, Bowley's Skewness Coefficient
is based on quartiles.
This coefficient provides a simple and intuitive way to understand the direction and magnitude of skewness in a dataset.
It is especially useful when dealing with data that may not follow a normal distribution or when a robust measure of
skewness is require
Q1 + Q 3 − 2Q 2
B =
Q 3 − Q1
• Q1 is the first quartile (25th percentile),
• Q2 is the second quartile (50th percentile, or median), and

• Q3 is the third quartile (75th percentile).

•Note: The value of Bowley's coefficient of skewness lies between -1 and 1


Coefficient of Bowley's Measure
•If B = 0, the distribution is perfectly symmetric about the mean (no
skewness).

•If B < 0, the distribution is negatively skewed (left-skewed),


meaning the tail on the left side of the distribution is longer or
heavier.

•If B > 0, the distribution is positively skewed (right-skewed),


indicating that the tail on the right side of the distribution is longer or
heavier.
Examples of Bowley's Measure
solved after quartile introduction
Kurtosis:
measures degree of peakedness of the
distribution
Kurtosis
Measure is denoted by β2
Leptokurtic: A distribution with heavy tails and a sharp peak
(β2 > 3). Curve is Peaked
Platykurtic: A distribution with light tails and a flatter peak (β2
< 3). Curve is Flat topped
Mesokurtic: A normal distribution (β2 = 3). Curve is Normal

To compute kurtosis we need the term known as Moments


Moments

Moments are statistical measures that give certain


characteristics of the distribution.
The Four moments in statistics are……….
Formulae to calculate Moments about the Mean:

First Moment (about the mean) μ1 ​= 0 (since it's always zero)

Second Moment (about the mean) μ2​ (variance)


σ 𝑓(𝑥 − 𝑥)ҧ 2
𝜇2 =
𝑁
Third Moment (about the mean) μ3 (skewness)
σ 𝑓(𝑥 − 𝑥)ҧ 3
𝜇3 =
𝑁

Fourth Moment (about the mean) μ4 (kurtosis)


σ 𝑓(𝑥 − 𝑥)ҧ 4
𝜇4 =
𝑁
The Coefficient of kurtosis:
To calculate β1 (Beta 1) and β2 (Beta 2) for grouped data using a tabular form, we need to
first understand what these measures represent:
β1 used as measure of skewness of the distribution.
β2 measures kurtosis (the "tailedness" of the distribution).
Formula:
𝝁 𝟐
𝜷 𝟏 = 𝝁𝟑 𝟑
𝟐
Where : μ2​ is the second central moment (variance).
μ3​ is the third central moment (used to measure skewness).
𝝁
𝜷𝟐 = 𝝁 𝟒𝟐
𝟐

Where: μ2​ is the second central moment.


μ4​ is the fourth central moment (used to measure kurtosis).
Example : For the following distribution, find the first four moments about the mean. Also find the value of 𝜷1 Is it a
symmetrical distribution?

x 2 3 4 5 6
f 1 3 7 3 1

Solution: 𝑥 𝑓 𝑓𝑥 𝒙−ഥ
𝒙 𝒇(𝒙 𝒇(𝒙 − 𝒙 ҧ)𝟐 𝒇(𝒙 − 𝒙 ҧ)𝟑 𝒇(𝒙 − 𝒙 ҧ)𝟒
𝝨𝑥 60 −ഥ𝒙)
𝑥ҧ = = =4
𝑁 15
2 1 2 -2 -2 4 -8 16
𝝨𝒇(𝒙−ഥ
𝒙) 0
𝜇1 = = =0 3 3 9 -1 -3 3 -3 3
𝑁 15
4 7 28 0 0 0 0 0
𝝨𝒇(𝒙−𝒙 ҧ)𝟐 14
𝜇2 = = 5 3 15 1 3 3 3 3
𝑁 15

6 1 6 2 2 4 8 16
N=𝝨𝑓 𝝨𝑓𝑥 0 14 0 38
= 15 = 15
𝝨𝒇(𝒙−𝒙 ҧ)𝟑 0
𝜇3 = = =0
𝑁 15

𝝨𝒇(𝒙−𝒙 ҧ)𝟒 38
𝜇4 = = =0
𝑁 15

𝜇32 0
𝜷1 = = 14 2
=0
𝜇22
15

In a symmetrical distribution 𝜷1 is zero. Hence this distribution is symmetrical.


Quartiles, Deciles, Percentiles
Partition values are statistical measures that divide a dataset into equal parts.

They help in understanding the distribution and spread of data by indicating where
certain percentages of the data fall.

There are several ways to divide an observation when required.

The most used partition values are quartiles, deciles, and percentiles.

To divide the observation into two equally sized parts, the median can be used.
Quartiles:
A quartile is a set of values that divides a dataset into four equal parts.

The first quartile, second quartile, and third quartile are the three basic quartile categories.

The lower quartile/first quartile and is denoted by the letter Q1.

The median/second quartile and is denoted by the letter Q2.

The third quartile/upper quartile and is denoted by the letter Q3.


COMPUTATION OF PARTITION VALUES (QUARTILES, DECILES AND
PERCENTILES)

• Individual Series.

While computing quartiles, deciles and percentiles, the first step will be to arrange the data in
ascending order only. After that we shall have to apply the following formulae:

• Quartiles

𝑁+1
• 𝑄1 = 𝑠𝑖𝑧𝑒 𝑜𝑓 𝑡ℎ 𝑖𝑡𝑒𝑚
4

𝑁+1
• 𝑄2 = 𝑠𝑖𝑧𝑒 𝑜𝑓 2 𝑡ℎ 𝑖𝑡𝑒𝑚
4

𝑁+1
• 𝑄3 = 𝑠𝑖𝑧𝑒 𝑜𝑓 3 𝑡ℎ 𝑖𝑡𝑒𝑚
4
Deciles:
The formulas for calculating deciles are:
The deciles involve dividing a dataset into ten equal parts based on numerical values. There are therefore
nine deciles altogether. Deciles are represented as follows: D1, D2, D3, D4,…………,
A decile is used to group big data sets in descriptive statistics either from highest to lowest values or
vice versa

𝑁+1 𝑡ℎ
D1 = item
10
2(𝑁+1) 𝑡ℎ
D2 = item and so on….
10
9(𝑁+1) 𝑡ℎ
D9 = item
10

Where, N is the total number of observations, D1 is First Decile, D2 is Second Decile,……….D9 is Ninth
Decile.
Percentiles

Centiles is another term for percentiles.


Any given observation is essentially divided into a total of 100 equal parts by a centile or percentile.
These percentiles or centiles are represented as P1, P2, P3, P4,……….P99.
P1 is a typical value of peaks for which 1/100 of any given data is either less than P1 or equal to P1.
𝑁+1 𝑡ℎ
P1 = item
100
2(𝑁+1) 𝑡ℎ
P2 = item and so on….
100
99(𝑁+1) 𝑡ℎ
P99 = item
100
Where, N is the total number of observations, P1 is First Percentile, P2 is Second Percentile, P3 is Third
Percentile, ……….P99 is Ninety Ninth Percentile.
Example 1:
Calculate the lower and upper quartiles of the following weights in the family: 25, 17, 32, 11, 40, 35, 13, 5,
and 46.
Solution:
First, organize the numbers in ascending order.
5, 11, 13, 17, 25, 32, 35, 40, 46
𝑁+1 𝑡ℎ 3(N+1) th
As per the quartile formula; 𝑄1 = item and Q 3 = item
4 4

Q1 = 2.5th term
Q1 = 12
Q3 = 7.5th item
Q3 = 37.5
Example 2:
Calculate Q1 and Q3 for the data related
to the age in years of 99 members in a housing society.
Solution:

Q1 = 25th item, Q3 = 75th item


Now, the 25th item falls under the cumulative frequency of 25 and the age against this cf value is 18.
Now, the 75th item falls under the cumulative frequency of 85 and the age against this cf value is 40.
Example 3: From the following information , compute median , lower quartile and upper quartile, 7 th decile
and 28th percentile:

Sr. No Wages Sr. No. Wages Sr. No. Wages


1 660 11 600 21 203
2 620 12 400 22 403
3 770 13 500 23 603
4 710 14 350 24 715
5 540 15 450 25 525
6 640 16 550 26 627
7 750 17 651 27 400
8 430 18 720 28 409
9 550 19 729 29 505
10 700 20 745 30 72
Solution : Data must be arranged in asending order first.
Sr. No Wages Sr. No. Wages Sr. No. Wages
1 72 11 505 21 651
2 203 12 525 22 660
3 350 13 540 23 700
4 400 14 550 24 715
5 400 15 550 25 715
6 403 16 600 26 720
7 409 17 603 27 729
8 430 18 620 28 745
9 450 19 627 29 750
10 500 20 640 30 770

𝑁+1 30+1
Q2= Median = 𝑠𝑖𝑧𝑒 𝑜𝑓 2 𝑡ℎ 𝑖𝑡𝑒𝑚= 𝑠𝑖𝑧𝑒 𝑜𝑓 2 𝑡ℎ 𝑖𝑡𝑒𝑚= 15.5th item
4 4
𝑆𝑖𝑧𝑒 𝑜𝑓 15𝑡ℎ 𝑖𝑡𝑒𝑚+𝑠𝑖𝑧𝑒 𝑜𝑓 16𝑡ℎ 𝑖𝑡𝑒𝑚 550+600 1150
= = = = 575
2 2 2
Lower or first quartile
𝑁+1
Q1 = 𝑠𝑖𝑧𝑒 𝑜𝑓 𝑡ℎ 𝑖𝑡𝑒𝑚
4

30+1
= 𝑠𝑖𝑧𝑒 𝑜𝑓 2 4
𝑡ℎ 𝑖𝑡𝑒𝑚 =Size of 7.75th item
𝑆𝑖𝑧𝑒 𝑜𝑓 8𝑡ℎ 𝑖𝑡𝑒𝑚−𝑠𝑖𝑧𝑒 𝑜𝑓 7𝑡ℎ 𝑖𝑡𝑒𝑚
=𝑠𝑖𝑧𝑒 𝑜𝑓 7𝑡ℎ 𝑖𝑡𝑒𝑚 + 3 4

= 409+0.75(430-409)= 409+15.15
= 424.75
Upper or third quartile
𝑁+1
Q3 = 𝑠𝑖𝑧𝑒 𝑜𝑓 3 𝑡ℎ 𝑖𝑡𝑒𝑚
4

30+1
= 𝑠𝑖𝑧𝑒 𝑜𝑓 3 𝑡ℎ 𝑖𝑡𝑒𝑚 =Size of 23.25th item
4

𝑆𝑖𝑧𝑒 𝑜𝑓 24𝑡ℎ 𝑖𝑡𝑒𝑚−𝑠𝑖𝑧𝑒 𝑜𝑓 23𝑟𝑑 𝑖𝑡𝑒𝑚


=𝑠𝑖𝑧𝑒 𝑜𝑓 23𝑟𝑑 𝑖𝑡𝑒𝑚 + 4

= 700+0.25(710-700)= 700+2.5
=702.5
7th Decile
𝑁+1
D7 = 𝑠𝑖𝑧𝑒 𝑜𝑓 7 𝑡ℎ 𝑖𝑡𝑒𝑚
10
30+1
= 𝑠𝑖𝑧𝑒 𝑜𝑓 7 𝑡ℎ 𝑖𝑡𝑒𝑚 =Size of 21.7th item
10
𝑆𝑖𝑧𝑒 𝑜𝑓 22𝑛𝑑 𝑖𝑡𝑒𝑚−𝑠𝑖𝑧𝑒 𝑜𝑓 21𝑠𝑡 𝑖𝑡𝑒𝑚
=𝑠𝑖𝑧𝑒 𝑜𝑓 21𝑡ℎ 𝑖𝑡𝑒𝑚 + 7
10

= 651+0.7(660-651)= 651+6.3
= 657.3
28th percentile
𝑁+1
P28= 𝑠𝑖𝑧𝑒 𝑜𝑓 28 𝑡ℎ 𝑖𝑡𝑒𝑚
100
30+1
= 𝑠𝑖𝑧𝑒 𝑜𝑓 28 𝑡ℎ 𝑖𝑡𝑒𝑚 =Size of 8.68th item
100
𝑆𝑖𝑧𝑒 𝑜𝑓 9𝑡ℎ 𝑖𝑡𝑒𝑚−𝑠𝑖𝑧𝑒 𝑜𝑓 8𝑡ℎ 𝑖𝑡𝑒𝑚
=𝑠𝑖𝑧𝑒 𝑜𝑓 8𝑡ℎ 𝑖𝑡𝑒𝑚 + 68
100

= 430+0.68(450-430)= 430+13.60
= 443.60
Example 4:(Bowley’s Coefficient of Skewness):

Calculate Bowley's Measure of Skewness for the following dataset representing the ages of a group of
people in a sample: 20, 24, 28, 32, 35, 40, 42, 45, 50.
Solution: Calculate the median (Q2)
Q2= 35 (the middle value)
Now, first quartile (Q1)
Q1 = 26
third quartile (Q3)
Q3 = 43.5
𝑄1 +𝑄3 −2𝑄2
Substitute the above values in the formula B =
𝑄3 −𝑄1

B = -0.02
Since B is negative (B < 0), the distribution is negatively skewed (left-skewed). This means that the tail
of the distribution is longer on the left side, indicating that there may be outliers or high values on the
right side of the data.
Example 5:

Calculate the D1, D5 from the following weights in a family:


25, 17, 32, 11, 40, 35, 13, 5, and 46.
Solution:
First, organize the numbers in ascending order.
5, 11, 13, 17, 25, 32, 35, 40, 46
Here N=9
𝑁+1 𝑡ℎ
D1 = item
10
2(𝑁+1) 𝑡ℎ
D5 = 10
D1 = 1st item = 5
D5 = 5th item = 25
Example 6:
Calculate P10 and P75 for the data related to the age (in years) of 99 members in a housing
society.
Solution:

P10 = 10th item


Now, the 10th item falls under the cumulative frequency of 20 and the age against this cf value is 10.
P10 = 10 years
P75 = 75th item
Now, the 75th item falls under the cumulative frequency of 85 and the age against this cf value is 40.
P75 = 40 years
Example 6:
In a frequency distribution, the coefficient of skewness based on the quartiles is 0.6. If the sum of the upper and the
lower quartile is 100 and the median is 38, determine the values of the upper and the lower quartiles.

𝑄3+𝑄1−2𝑄2
Solution. Coefficient of skewness based on quartiles =
𝑄3−𝑄1

𝑄1 + 𝑄3 = 100 …(i)
𝑄2 = 38
100−2∗38
0.6 = 𝑄3−𝑄2

24
𝑄3 − 𝑄1 = 0.6 = 40 …(ii)

From equation (i) and (ii) we get

𝑄3 = 70 and 𝑄1 = 30
Data Visualization
Data Visualization: Histogram

● Definition: A histogram is a graphical representation of the


distribution of numerical data. It is similar to a bar chart but
specifically used for quantitative data that is divided into ranges (also
called bins or intervals). Histograms are essential for visualizing the
frequency of data points in each range, helping to understand the
shape and spread of the data distribution.
Key Concepts Related to Histograms
● Bins (Intervals)
Definition: A bin is a continuous range of values within which data
points are grouped together. Each bin represents a specific interval of
values, and the height of the bar for each bin shows the number of data
points (or frequency) that fall within that range.

Example: If you're analyzing the test scores of students (0-100), bins


could be set at intervals of 10 (0-10, 10-20, 20-30, etc.).
Frequency Distributions

To construct a frequency distribution, we must divide the range of


the data into intervals, which are usually called class intervals,
cells, or bins.

Choosing the number of bins approximately equal to the square


root of the number of observations often works well in practice
Some of the important features of histograms.
➢ Equal units along the horizontal axis (the X axis, or abscissa) reflect the
various class intervals of the frequency distribution.

➢ Equal units along the vertical axis (the Y axis, or ordinate) reflect increases
in frequency. (The units along the vertical axis do not have to be the same
width as those along the horizontal axis.)

➢ The intersection of the two axes defines the origin at which both numerical
scales equal 0

➢ It is considered good practice to use wiggly lines to highlight breaks in


scale
● Choosing Bin Width:
Too Few Bins: If the bin width is too large, the histogram may
not show important details about the distribution.
Too Many Bins: If the bin width is too small, the histogram
can become overly detailed and noisy.
Constructing a Histogram (Equal Bin Widths)

1. Label the bin (class interval) boundaries on a horizontal


scale.
2. Mark and label the vertical scale with the frequencies or
the relative frequencies.
3. Above each bin, draw a rectangle where height is equal to
the frequency (or relative frequency) corresponding to
that bin.
Example 1

Consider a data set of 26 children of ages 1-6 years

Age 1 2 3 4 5 6

Frequency 5 3 7 5 4 2

Histogram of this data is as given here…


7 Histogram
6
5
Childrens age data
4
3

2
1

0 1 2 3 4 5 6
Ex.2 Draw a histogram for the following data distribution:

Class 50-60 60-70 70-80 80-90 90-100 100-110


Intervals
Frequency 30 25 45 15 20 40

X
Ex.3 Given below is the table showing the approximate lengths, in mm, of 40 leaves taken
from different parts of a certain species.

Length
25-30 30-35 35-40 40-45 45-50 50-55 55-60
(mm)
Number of
1 4 8 10 8 7 2
leaves
Data Visualization:
Box Plot (Box-and-Whisker Plot)
The Box Plot is a graphical representation of a dataset’s five-
number summary: minimum, first quartile (25th percentile), median
(50th percentile), third quartile (75th percentile), and maximum.
Developed by John Tukey in the 1970s, this plotting system has
been recognized for its concise delivery of the distribution of a
dataset, thus simplifying the data analysis process.
It’s a powerful tool in data analysis because it can clearly highlight
the dataset’s central tendency, dispersion, and skewness. Moreover,
it effectively visualizes outliers, providing a complete picture of the
data distribution. This is particularly useful when comparing
multiple datasets, as it offers a clear, comparative visualization of
the different data distributions.
The five numbers used in a box plot are:

1. Minimum
2. First Quartile (Q1)
3. Median (Q2)
4. Third Quartile (Q3)
5. Maximum
The Essential Components of a Box Plot
➢ The second quartile (Q2) median is the middle value that separates the data into
two halves. It measures central tendency, providing a snapshot of the data’s center.

➢ Quartiles Q1 and Q3, marking the box ends, reflect the data’s dispersion. These
quartiles represent the 25th and 75th percentiles of the dataset, respectively. The
Q1 mark represents the median of the first half of the data, while the Q3 represents
the median of the second half.
➢ The whiskers are lines extending from the box, reaching the minimum and
maximum non-outlier data points.
Usually, the lower whisker extends from Q1 to the smallest non-outlier data point,
and the upper whisker extends from Q3 to the largest non-outlier data point.

➢ The length of the box is the Inter Quartile Range (IQR), calculated by subtracting
Q1 from Q3 (IQR = Q3 – Q1).
➢ The IQR measures the middle 50% of the data, measuring dispersion or spread.
➢ Outliers are typically calculated as data points that fall below (Q1 – 1.5IQR)
or above (Q3 + 1.5IQR). These outliers are represented as individual points
outside the whiskers in the box plot.
➢ A point more than 3 interquartile ranges from the box edge is called an
extreme outlier.
How to Interpret a Box Plot
➢ Length of the Box: The length of the box (between Q1
and Q3) represents the IQR, showing the spread of the
middle 50% of the data.

➢ When the median is in the middle of the box, and the


whiskers are about the same on both sides of the box, then
the distribution is symmetric.

➢ When the median is closer to the bottom of the box, and if


the whisker is shorter on the lower end of the box, then
the distribution is positively skewed (skewed right).

➢ When the median is closer to the top of the box, and if the
whisker is shorter on the upper end of the box, then the
distribution is negatively skewed (skewed left).
Conclusion
Understanding these components of a box plot allows for rapid
comprehension of the data’s distribution, spread, and skewness. It
also aids in identifying and visualizing potential outliers, which can
be invaluable in data analysis.
Example 1:
Test scores for a college statistics class held during the day are:
99, 56,78, 55.5,32, 90,80, 81, 56, 59, 45, 77, 84.5, 84, 70, 72, 68, 32, 79, 90
Find the smallest and largest values, the median, and the first and third quartile for the day class. Also construct box plot
for the data.
Solution: Arranging data in ascending order
32,32,45,55.5,56,56,59,68,70,72,77,78,79,80,81,84,84.5,90,90,99

Population size: 20
Median: 74.5 IQR 27.25
Minimum: 32 1.5IQR 40.875
Maximum: 99
Q1-1.5IQR 15.125
First quartile: 56
Third quartile: 83.25 Q3+1.5IQR 124.13
Interquartile Range: 27.25
Outliers: none
Since minimum value of the data set is greater than Q1 – 1.5IQR & maximum value of the data set is
less than Q3 + 1.5IQR there are no outliers
Thus, plotting all the five numbers using scaled line we get the box plot s given below
Example 2:
Test scores for college statistics class held during the evening are:
98, 78, 68, 83, 81, 89, 88, 76, 65, 45, 98, 90, 80, 84.5, 85, 79, 78, 98, 90, 79, 81, 25.5
Find the smallest and largest values, the median, and the first and third quartile for the
night class. Also construct box plot for the data.
Solution:25.5,45,65,68,76,78,78,79,79,80,81,81,83,84.5,85,88,89,90,90,98,98,98
Population size: 22
Median: 81 IQR 11.75

Minimum: 25.5 1.5IQR 17.625

Maximum: 98 Q1-1.5IQR 59.875

First quartile: 77.5 Q3+1.5IQR 106.88


Third quartile: 89.25
Interquartile Range: 11.75
Outliers: 25.5,45
Since data points 25.5,45 of the data set is less than Q1 – 1.5IQR, these are
outliers to lower side & maximum value of the data set is less than Q3 + 1.5IQR
there are no outliers to upper side
Here lower whisker extends upto 65 and upper whisker extends upto 98.
Thus, plotting all the five numbers using scaled line we get the box plot s given
below
From previous two box plots which box plot has the widest spread for the middle 50% of
the data (the data between the first and third quartiles)? What does this mean for that set of
data in comparison to the other set of data?
Conclusion:
The first data set has the wider spread for the middle 50% of the
data. The IQR for the first data set is greater than the IQR for the
second set.
This means that there is more variability in the middle 50% of the
first data set.
Example.3
The following data are the heights of 40 students in a statistics class.
59, 60, 61, 62, 62, 63, 63, 64, 64, 64, 65, 65, 65, 65, 65, 65, 65, 65, 65, 66, 66, 67, 67, 68, 68, 69, 70, 70, 70, 70, 70, 71,
71, 72, 72, 73, 74, 74, 75, 77
Construct a box plot

Solution
Population size: 40
Median: 66
Minimum: 59
Maximum: 77
First quartile: 64.25
Third quartile: 70
Interquartile Range: 5.75
Outliers: none
Softwares to perform statistical analysis and visualization of
data.

SAS (System for Statistical Analysis), S-plus, R, Matlab, Minitab,


BMDP, Stata, SPSS, StatXact, Statistica, LISREL, JMP, GLIM,
HIL, MS Excel etc.
Some useful websites for more information of statistical softwares

http://www.galaxy.gmu.edu/papers/astr1.html
http://ourworld.compuserve.com/homepages/Rainer_
Wuerlaender/statsoft.htm#archiv
http://www.R-project.org
END

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy