0% found this document useful (0 votes)
38 views42 pages

Chap 2

Uploaded by

adomibob
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views42 pages

Chap 2

Uploaded by

adomibob
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

Bibliography

Chapter 2 UNIVARIATE DATA SETS


Suppose we want to conduct a study and focus on a single characteristic among elements of a data set. Since
data are collected around a single variable, we are dealing with univariate data sets for which we want to
summarise the main features in order to identify patterns in the data.
The present chapter presents tools and methods used in univariate descriptive statistics.

2.1 FREQUENCY DISTRIBUTION


Once the required data have been gathered, the next step is to organise the data in some meaningful way.
The most basic way of organising data is to construct a frequency distribution. A frequency distribution is a
table representing all the possible values of the variable along with their corresponding frequencies.
Frequency distributions can be classified as categorical or numerical, discrete or continuous, ungrouped or
grouped depending on the type of data. Depending on which form is most helpful, a researcher may use
simple frequency distributions, relative frequency distributions or cumulative frequency distributions.
2.1.1 Categorical frequency distribution
Consider a data set of size n, described according to the qualitative variable X, and let x1 , x2 , ..., xk denote
the possible values (categories) assumed by that variable. First let us introduce some basic terminology.
 Frequency ni of the value xi is the number of times the value xi appears in the data set.
k
The sum of the frequencies equals the data set size, that is, n =  ni .
i1

 Relative frequency fi of the value xi is the proportion of the data set corresponding to the value xi. It is
n
obtained by dividing the frequency ni by the data set size n, fi = ni .

A relative frequency can be expressed as a fraction, a decimal number, or a percentage.


k
The sum of the relative frequencies will always equal 1, i.e.,  fi = 1 (or 100%).
i1

Furthermore, in the case of an ordinal qualitative variable X:


 Less than cumulative frequency of the value xi is the number of elements in the data set with values less
than or equal to xi.
 Greater than cumulative frequency of the value xi is the number of elements in the data set with values
greater than or equal to xi.
 Less than cumulative relative frequency of the value xi is the proportion of elements in the data set with
values less than or equal to xi.
 Greater than cumulative relative frequency of the value xi is the proportion of elements in the data set
with values greater than or equal to xi.

Descriptive Statistics 1
Bibliography

Example 2.1: The brands of 24 cars have been recorded in a car park. If R stands for Renault, T for Toyota,
H for Hyundai, V for Volkswagen and O for other brands, the observations collected are given below:
H R T H V R
V H V O T V
R O V R H O
V R T V V R
Table 2.1: Individual series of the car brands.
This is an example of an individual series where raw data are given on individual basis. Because raw data
cannot be easily understood, we have constructed the following categorical frequency table:
Car brands Frequency Relative Frequency
xi ni fi
Renault 6 0.250
Toyota 3 0.125
Hyundai 4 0.170
Volkswagen 8 0.330
Other 3 0.125
Total 24 1
Table 2.2: Categorical frequency distribution with relative frequency of car brands.
The variable under study "car brands" is a nominal categorical variable. The five category names are listed
in the first column. The order of the brands has no significance. The second column shows the number of
cars for each brand: there are 6 Renaults, 3 Toyotas, 4 Hyundais, 8 Volkswagens and 3 other brands. These
are the frequencies. The third column indicates the relative frequency associated with each brand. The total
of the frequency column, 24, represents the total number of cars included in the sample. The relative
frequencies add up to 1. For a categorical nominal variable, cumulative frequencies are meaningless.
2.1.2 Numerical frequency distribution
a) Discrete data
The calculation of the frequency distribution of ordinal categorical variables and discrete quantitative
variables is similar.
Example 2.2: A survey conducted on 20 families in a locality revealed the following results for the number
of children in a family:
4 1 3 4 2
3 4 2 1 4
4 3 6 4 1
1 6 4 2 3
Table 2.3: Individual series of the number of children in a family.
Here, raw data are given as a list of numbers and as can be seen, they are hard to interpret in this format.
The given data are arranged in the ungrouped frequency table below:

Descriptive Statistics 2
Bibliography

Number of Frequency Relative Less than Greater than


children ni Frequency Cumulative Cumulative
xi fi Frequency Frequency
1 4 0.20 4 20
2 3 0.15 7 16
3 4 0.20 11 13
4 7 0.35 18 9
5 0 0 18 2
6 2 0.10 20 2
Total 20 1
Table 2.4: Ungrouped frequency distribution with relative and cumulative frequency for number of children in a family.
The variable "number of children in a family" is a discrete quantitative variable. In the first column, the
values taken by the variable are listed in ascending order. The second and third columns show the frequency
and relative frequency, respectively for each data value. The fourth column shows less than cumulative
frequencies. They are obtained by adding to the frequency of each data value the frequencies of all the data
values below it starting at the top of the table. The final number in the less than cumulative frequency
column, 20, matches the total number of families in the sample. Less than cumulative frequency is used to
determine the number of observations that lie below a particular value in a data set. Working backwards,
there are 20 families with 6 or fewer children, 18 families with 5 or fewer children, 18 families with 4 or
fewer children, 11 families with 3 or fewer children and so on. The fifth column shows greater than
cumulative frequencies. They are obtained by adding to the frequency of each data value the frequencies
of all the data values above it starting from the bottom of the table. The first number in the greater than
cumulative frequency column, 20, matches the total number of families in the sample. Greater than
cumulative frequency is used to determine the number of observations that lie above a particular value in a
data set. Working forwards, there are 20 families with 1 or more children, 16 families with 2 or more
children, 13 families with 3 or more children and so on. To obtain a less than (resp., greater than) cumulative
relative frequency column, we can follow the same steps that we did for the less than (resp., greater than)
cumulative frequency column.
b) Continuous data
For a continuous variable, the different values of the variable are grouped into intervals called classes. These
classes must be mutually exclusive and collectively exhaustive. In other words, they do not overlap and
include all data values. One way of achieving this is to consider half-open intervals of the form [xi , xi+1[ where
xi represents the lower class limit, xi+1 the upper class limit. Here are more key words.
 Size (or width) ai of the class [xi , xi+1[ is the difference xi+1 – xi.
 Mid-point (or central value) ci of the class [xi , xi+1[ is the quantity (xi + xi+1) / 2.
 Frequency ni of the class [xi , xi+1[ is the number of data values that fall within the interval [xi , xi+1[.

Descriptive Statistics 3
Bibliography

 Relative frequency fi of the class [xi , xi+1[ is the proportion of values corresponding to the interval [xi , xi+1[.
 Frequency density di of the class [xi , xi+1[ is the fraction ni / ai.
 Relative frequency density hi of the class [xi , xi+1[ is the fraction fi / ai.
 Less than cumulative frequency of the class [xi , xi+1[ is the sum of frequencies of earlier classes and the
class [xi , xi+1[.
 Greater than cumulative frequency of the class [xi , xi+1[ is the sum of frequencies of the class [xi , xi+1[
and the classes which succeed it.
 Less than cumulative relative frequency of the class [xi , xi+1[ is the sum of relative frequencies of previous
classes and the class [xi , xi+1[.
 Greater than cumulative relative frequency of the class [xi , xi+1[ is the sum of relative frequencies of the
class [xi , xi+1[ and the classes which come after it.

The number and appropriate width of the classes are left to the choice of the researcher. When grouping
values of a continuous variable into classes, he must choose between two options:
1. Classes have equal size. The number of classes k, for a data set of given size n, is calculated using one of
the following rules of thumb:
- Sturges' formula: k = 1 + 3.3logn

- Yule's formula: k = 2.5 4 n


The result of the calculations is rounded up to the nearest integer.
If e represents the range of the data set (i.e., the difference between the largest value and smallest value of
the variable), then the size a of each class is given by:
a = e/k
For convenience, the value obtained is rounded to the nearest whole number.
2. Classes are of unequal size. The number of classes should not be too small to avoid loss of information,
nor too large to avoid lots of empty classes. Typically, the number of classes should be between 5 and 20.
Example 2.3: As part of a clinical study, 28 volunteers were weighed. The results, expressed in kg, are
recorded in the table below:
40.3 41.0 43.8 40.0 39.7 38.2 40.6
42.3 43.1 42.6 41.2 40.8 41.4 43.9
40.4 38.5 40.1 38.9 38.5 42.5 39.7
39.2 39.1 40.7 41.6 42.1 41.3 38.8
Table 2.5: Individual series of the weight of volunteers.
The variable "weight of volunteer" is a continuous quantitative variable. For a continuous variable, it is easier
to handle the data by grouping the values into classes. These classes must be mutually exclusive and
collectively exhaustive. Here, we have opted for classes of equal size.

Descriptive Statistics 4
Bibliography

Consequently, the number of classes k, for the n = 28 observations in Table 2.5, is:
- according to Sturges’ formula: k = 1 + 3.3 log 28 = 5.78  6

- according to Yule’s formula: k = 2.5 4 28 = 5.75  6


The range of the data set is:
e = 43.9 – 38.2 = 5.7 kg
The size of each class is:
a = e / k = 5.7 / 6 = 0.95  1 kg
The results are given in the grouped frequency table below:
Weight Frequency Relative Less than Greater than
(kg) ni Frequency Cumulative Cumulative
fi Frequency Frequency
[38-39[ 5 0.18 5 28
[39-40[ 4 0.14 9 23
[40-41[ 7 0.25 16 19
[41-42[ 5 0.18 21 12
[42-43[ 4 0.14 25 7
[43-44[ 3 0.11 28 3
Total 28
Table 2.6: Grouped frequency distribution with relative and cumulative frequency for weight of volunteers.
The first column shows the weight classes. The lower limit of the first class is chosen so that the first class
includes the minimum value, 38.2 kg. Obviously, the last class includes the maximum value, 43.9 kg. The
number of data values falling in each class is counted. The results are shown in the frequency column. The
third column shows the relative frequencies. Less than and greater than cumulative frequencies are
calculated by following the same steps as in the case of a discrete variable. For instance, from the fourth
column of Table 2.6, 28 volunteers weigh below 44 kg, 25 volunteers weigh below 43 kg, 21 volunteers weigh
below 42 kg and so on. From the fifth column, 28 volunteers weigh 38 kg or more, 23 volunteers weigh 39
kg or more, 19 volunteers weigh 40 kg or more, and so on.
It is worth noting that grouping data into classes results in a loss of information compared with raw data.

2.2 GRAPHIC PRESENTATION


After data have been organised into a frequency distribution, they are usually presented through graphs
and diagrams for greater clarity. Graphic presentation, a visual form of presenting data, depends on the type
of variable under consideration.
Graphs are broadly divided into two: frequency distribution graphs that show the differences in frequencies
among values of the variable, and cumulative frequency graphs that show the evolution of cumulative
frequencies as data increase.

Descriptive Statistics 5
Bibliography

2.2.1 Frequency distribution graphs


A frequency distribution graph can take several forms depending on the types of data collected.
a) Qualitative data
For qualitative data, there are many types of graphical presentation. The most common are: bar graph and
pie chart.
 Bar graph
The horizontal axis only lists the categories of the variable so it has no scale, and the vertical axis represents
the frequencies (or relative frequencies). A bar graph consists of horizontal bars of equal width, separated
from each other with equal gaps to make clear that they are categories. The height of a bar is proportional
to the frequency (or relative frequency) of the corresponding category. For nominal data, categories are put
in any order. For ordinal data, categories are arranged in their natural order.

Example 2.4: Here is the bar chart showing the car brands of example 2.1.

Fig. 2.1: Bar graph showing the frequency by car brands.

 Pie chart
A pie chart is a circle, representing the entire data, divided into sectors that represent the possible categories
of the variable. The area of the sector for a particular category is proportional to the corresponding
frequency (or relative frequency). The angle at the centre i corresponding to the sector of a particular
category can be calculated using the following formula:
ni  360
i = = fi  360°
n
where:
n is the data set size, ni (resp., fi) is the frequency (resp., the relative frequency) of the category.

Descriptive Statistics 6
Bibliography

It is worth mentioning that the sum of all the central angles in a pie chart is 360°.
Example 2.5: To construct the pie chart showing the car brands of example 2.1, we first need to calculate
the central angle for each car brand. The results are shown in the table below.
Car brands Frequency Relative Frequency Measure of central angles
xi ni fi i
Renault 6 0.250 0.250360°= 90°
Toyota 3 0.125 0.125360°= 45°
Hyundai 4 0.170 0.170360°= 61°
Volkswagen 8 0.330 0.330360°= 119°
Other 3 0.125 0.125360°= 45°
Total 24 1 360°
Table 2.7: Categorical frequency distribution with relative frequency of car brands (showing central angles i).
The resulting pie chart is shown below:

Fig. 2.2: Pie chart showing car brands (in percentages).

b) Quantitative data
Before attempting any graphic presentation, it is important to differentiate between discrete and
continuous variables.

 Discrete data
 Stick chart
Sticks are appropriate at demonstrating discrete data. The discrete values taken by the variable are marked
on the horizontal axis and the frequencies (or relative frequencies) on the vertical axis. A stick looks like a
bar with no width. The height of the stick is proportional to the frequency (or relative frequency) of the
corresponding value of the variable.
Example 2.6: The following stick chart shows the number of children in a family of example 2.2.

Descriptive Statistics 7
Bibliography

Fig. 2.3: Stick chart showing the frequency by number of children in a family.

 Frequency polygon
A frequency polygon is a particular line graph used to represent the distribution of a set of quantitative data.
For discrete data, the frequency polygon is obtained by joining the tops of the sticks with straight lines in
the stick chart. In order to make the frequency polygon touches the horizontal axis on both sides, we add
one value below and above our data.
Example 2.7: As shown in Fig. 2.4 below, the frequency polygon is superimposed on the stick chart it replaces
(using the data given in Table 2.4). When joining the tops of the sticks with straight lines, we also included
points at (0, 0) and (7, 0) which represent one value below and one value above our data. Clearly, the
frequency polygon touches the x-axis on both sides.

Fig. 2.4: Stick chart and frequency polygon for number of children in a family.

 Continuous data
 Histogram
A histogram is particularly suitable for continuous data arranged into classes. A histogram is a set of adjacent

Descriptive Statistics 8
Bibliography

rectangles whose bases correspond to the size of the classes and whose areas are proportional to the
frequencies (or relative frequencies) of the classes. The horizontal and vertical axes display the class sizes
and frequencies (or relative frequencies), respectively. There are two ways of constructing a histogram
depending on the class size.
 Classes of equal size. The class limits are marked on the horizontal axis and the frequencies (or relative
frequencies) are indicated on the vertical axis. In this case, the heights of the rectangles are proportional
to the frequencies (or relative frequencies).
Example 2.8: The table below shows the distribution of monthly salary (in thousands of DA) of 100
employees of a company:
Monthly Number of
Salary employees, ni
[20-30[ 28
[30-40[ 34
[40-50[ 19
[50-60[ 15
[60-70[ 4
Total 100
Table 2.8 : Grouped frequency distribution of monthly salary (with equal class width).
Here, the classes have equal size a = 10 (thousands of DA), so the heights of the rectangles are equal to the
frequencies. The resulting frequency histogram is shown below:

Fig. 2.5: Frequency histogram for monthly salary (with classes of equal size).
 Classes of unequal sizes. To ensure that the area of each rectangle remains proportional to the
corresponding frequency (or relative frequency), the class limits are marked on the horizontal axis and the
frequency densities (or relative frequency densities) are indicated on the vertical axis. In this case, the
height of each rectangle is not proportional to the corresponding class frequency (or class relative
frequency), but rather to the corresponding class frequency density (or class relative frequency density).

Descriptive Statistics 9
Bibliography

Example 2.9: The following table gives the age distribution for the number of deaths caused by road traffic
accident during last year:
Age group Number of deaths
[20-25[ 595
[25-35[ 410
[35-45[ 287
[45-65[ 456
Table 2.9: Age distribution for the number of deaths (classes have unequal widths).
The class widths are not equal. We wish to construct a histogram based on the frequency table above, so it
is necessary to calculate the frequency densities. The results are shown in the following table:
Age group Frequency Class size Frequency density
ni ai di = ni / ai
[20-25[ 595 5 119
[25-35[ 410 10 41
[35-45[ 287 10 28.7
[45-65[ 456 20 22.8
Total 1 748
Table 2.10: Age distribution for the number of deaths showing the frequency densities.
The class limits are plotted on the x-axis and the frequency densities are plotted on the y-axis. The height of
each rectangle is equal to the corresponding class frequency density. The resulting frequency density
histogram is shown below.

Fig. 2.6: Frequency density histogram showing the number of deaths due to road accident in relation to age.

 Frequency polygon
As previously mentioned, a frequency polygon is a particular line graph used to represent the distribution
of a set of quantitative data. For continuous data, there are two ways of constructing a frequency polygon.

Descriptive Statistics 10
Bibliography

 Classes have equal width. To ensure that the area under the polygon is equal to the total area of the
histogram, a class with zero frequency is added on either side of the histogram. These classes are known as
hypothetical classes. The frequency (or relative frequency) polygon is obtained by joining the mid-points of
the tops of the rectangles of the histogram, as well as the mid-points of the two hypothetical classes, with
straight lines. The area under the polygon represents the total frequency of the frequency distribution.
Example 2.10: As shown in Fig. 2.7 below, the frequency polygon is superimposed on the frequency
histogram it replaces (using the data given in example 2.8). After obtaining the mid-points of the tops of the
rectangles of the histogram, we add a hypothetical class with zero frequency on either side of the histogram.
The first one is [10-20[, the other is [70-80[. We then connect the mid-points of the adjacent rectangles of
the histogram by straight lines. We complete the frequency polygon by joining the mid-point of the first
rectangle to the mid-point of the class [10-20[, and the mid-point of the last rectangle to the mid-point of
the class [70-80[.

Fig. 2.7: Frequency histogram and frequency polygon for monthly salary (with classes of equal size).
 Classes have unequal widths. The histogram is artificially partitioned into rectangles of equal base
denoted by as , known as the standard class size. The value of as is equal to the greatest common divisor
(GCD) of all the class sizes. A class with zero frequency and size as is added on either side of the histogram
(these are the hypothetical classes) to ensure once again that the area under the polygon is equal to the
total area of the histogram. The polygon is obtained by joining the midpoints of the upper horizontal sides
of the rectangles of the partition, besides the mid-points of the two hypothetical classes, by straight lines.
Example 2.11: Let's go back to example 2.9. Since the classes have unequal sizes we need to calculate the
standard class size as. From Table 2.10 above, the class sizes are 5, 10 and 20. Since the GCD of 5, 10 and 20
is 5, then as = 5. We artificially divide the histogram into rectangles of equal base, a s = 5. We add a
hypothetical class with zero frequency and size 5 on either side of the histogram: the first one is [15-20[, the

Descriptive Statistics 11
Bibliography

other is [65-70[. We join the mid-points of the adjacent rectangles of the partition by straight lines. We
complete the frequency polygon by connecting the mid-point of the first rectangle to the mid-point of the
class [15-20[, and the mid-point of the last rectangle to the mid-point of the class [65-70[. As shown in Fig.
2.8 below, the polygon based on the frequency density is superimposed on the density histogram it replaces.

Fig. 2.8: Frequency density histogram and polygon showing the number of deaths due to road accident by age group.

 Frequency curve
If the number of data values becomes larger and larger, and at the same time the width of the classes are
made smaller and smaller, the frequency polygon will eventually become a smooth curve called a frequency
curve. In other words, a frequency curve may be regarded as a limiting form of the frequency polygon. An
example of this is shown in the figure below.

Fig. 2.9 Frequency histogram and frequency curve.

Descriptive Statistics 12
Bibliography

2.2.2 Cumulative frequency graphs


A cumulative frequency graph is a graph plotted from a cumulative frequency distribution. It can also be
constructed from a cumulative relative frequency distribution. Cumulative frequency graphs are classified
into two types: less than cumulative frequency graph and more than (greater than) cumulative frequency
graph. Plotting cumulative frequency graphs requires different approaches according to the types of
quantitative variables involved in the study.

a) Discrete variable
The cumulative frequency graph of a discrete variable looks similar to a step function (i.e. constant over
intervals). First, we plot the points whose x-coordinates are the possible values of the variable, and y-coordinates
are equal to the corresponding cumulative frequencies. Then, to complete the graph, we draw horizontal line
segments for each interval since, by definition, the running total remains constant between two successive values
of the variable. Note that each interval of this step function is left-closed and right-open. To make the graph easier
to read, in addition to the horizontal line segments (solid line), vertical line segments are shown (dashed line).
Example 2.12: Let us consider the ungrouped frequency distribution table for the number of children in a
family, seen earlier in example 2.2.
Number of Frequency Less than Greater than
children ni Cumulative Cumulative
xi Frequency Frequency
1 4 4 20
2 3 7 16
3 4 11 13
4 7 18 9
5 0 18 2
6 2 20 2
Total 20
Table 2.11: Ungrouped frequency distribution with cumulative frequency for number of children in a family.

 The less than cumulative frequency graph is constructed as follows:


The number of children is marked on the horizontal axis and less than cumulative frequencies are taken on
the vertical axis.
We then plot the following points:
(1, 4); (2, 7); (3, 11); (4, 18); (5, 18) and (6, 20).
We can also plot the point (0, 0) since there are no families recorded with a number of children less than 1.
Finally, we draw horizontal line segments (solid line) for each interval and we added vertical line segments
(dashed line) to make the graph easier to read.
The less than cumulative frequency graph is shown in the following figure:

Descriptive Statistics 13
Bibliography

Fig. 2.10: Less than cumulative frequency graph for the number of children in a family.
 The more than cumulative frequency graph is constructed as follows:
The x-axis is labeled with the number of children and the y-axis is labeled with greater than cumulative
frequencies.
Then, we plot the points (1, 20); (2, 16); (3, 13); (4, 9); (5, 2) and (6, 2).
We can also plot the point (7, 0) as there are no families recorded with a number of children greater than 6.
We draw horizontal line segments (solid line) for each interval and vertical line segments (dashed line) to make
the graph easier to read.
The more than cumulative frequency graph is shown in the following figure:

Fig. 2.11: More than cumulative frequency graph for the number of children in a family.

b) Continuous variable
The cumulative frequency graph of a continuous variable is called an ogive, also known as a cumulative
frequency polygon. There are two types of ogives:

Descriptive Statistics 14
Bibliography

 Less than ogive. We plot the points whose x-coordinates are the upper limits of the classes, and
y-coordinates are the corresponding less than cumulative frequencies. We add the point whose ordinate
is zero and abscissa is equal to the lower limit of the first class, since it is common to begin with a cumulative
frequency of zero (the cumulative frequencies are in ascending order). To complete the less than ogive, we
join all the points by line segments.
 More than ogive. We plot the points whose x-coordinates are the lower limits of the classes, and
y-coordinates are the corresponding greater than cumulative frequencies. We add the point whose
ordinate is zero and abscissa is equal to the upper limit of the last class, since it is common to end with a
cumulative frequency of zero (the cumulative frequencies are in descending order). To complete the more
than ogive, we connect all the points by line segments.
Note: It is worth mentioning that the less than and more than ogives are mirror images of each other.
Example 2.13: The table below gives the frequency distribution of final grade average of (the high school)
baccalaureate of 1480 students:
Final grade Number of
average students
[10-12[ 560
[12-14[ 380
[14-16[ 300
[16-18[ 210
[18-20[ 30
Total 1480
Table 2.12: Grouped frequency distribution for final grade average of baccalaureate.
We need first to make a cumulative frequency table:
Final grade Frequency Less than Greater than
average ni Cumulative Cumulative
Frequency Frequency
[10-12[ 560 560 1480
[12-14[ 380 940 920
[14-16[ 300 1240 540
[16-18[ 210 1450 240
[18-20[ 30 1480 30
Total 1480
Table 2.13: Grouped frequency distribution with cumulative frequency for final grade average of baccalaureate.
 The less than ogive is constructed as follows:
The upper limits of the classes are marked on the horizontal axis and the less than cumulative frequencies
on the vertical axis. We then plot the points (12, 560); (14, 940); (16, 1240); (18, 1450); (20, 1480) and the
additional point (10, 0). To obtain the less than ogive, we simply join these points by line segments.
 The more than ogive is constructed as follows:

Descriptive Statistics 15
Bibliography

The lower limits of the classes are marked on the horizontal axis and the greater than cumulative frequencies
on the vertical axis. We then plot the points (10, 1480); (12, 920); (14, 540); (16, 240); (18, 30) and the
additional point (20, 0). To obtain the more than ogive, we simply connect these points by line segments.
Both less than ogive and more than ogive are shown in the following figure:

Fig. 2.12: Less than and more than ogives.

The abscissa of the point at which the less than and more than ogives intersect is the median of the
corresponding frequency distribution, as will be discussed later.

2.3 STATISTICAL MEASURES


Describing and summarising the observed data has so far been performed with the use of tables and graphs.
However, it is useful and necessary to summarise data numerically. A numerical summary is a value that is
representative of a data set. Numerical summaries include measures of central tendency, measures of location,
measures of dispersion, and measures of shape.

2.3.1 Measures of central tendency


Measures of central tendency aim to describe a data set with a particular value. This particular value is intended
to represent the central or typical value of the data set.
a) The mode
It can be calculated for all types of variables. The mode is the value that occurs the most often in a data set.
It is denoted by Mo. The determination of the mode depends on the type of variable under consideration.
 Categorical variable
For a categorical variable, the determination of the mode is straightforward. The values in the data set are sorted
by categories. The value that occurs most frequently is the mode.

Descriptive Statistics 16
Bibliography

Example 2.14: Let us consider the data given in example 2.1. As already seen, "car brands" is a categorical
variable.
 From Table 2.2, the Volkswagen category is the value of the variable that appears the most: 8 times.
Therefore, the mode is Volkswagen. As there is only one mode, the distribution is called unimodal.
 From the bar graph representing the car brands (Fig. 2.1), the bar above the Volkswagen category is the highest,
with a height of 8. Here again, the mode is Volkswagen.
 Discrete variable
For a discrete variable, the determination of the mode is similar to that of a categorical variable.
Example 2.15: Let us consider the data given in example 2.2. As we already know, the number of children in
a family is a discrete quantitative variable.
 From Table 2.4, the number "4" is the value of the variable that appears the most: 7 times. So the mode
is Mo = 4.
 From the stick chart representing the number of children in a family (Fig. 2.3), the stick above the number
"4" is the longest, with a length of 7. Here again, the mode is Mo = 4. This is also a unimodal distribution.
 There can be no mode in a data set.
Example 2.16: Find the mode of the following data.
3, 5, 8, 12, 17.
There is no mode because all values appear the same number of times (once).
 There can be multiple modes in a dataset.
Example 2.17: The table below shows the distribution of number of days off from work of 25 employees in
a company during last month:
Number of Number of
days employees
0 3
1 6
2 5
3 4
4 1
5 6
Total 25
Table 2.14: Frequency distribution of number of days off from work.
From the frequency column, the highest frequency is 6. The numbers of days with the highest frequency are
1 and 5. Thus, this frequency distribution has two modes: 1 and 5. It is a bimodal distribution.
 Continuous variable
Let's start by considering a continuous variable for which the range of values is grouped into classes. Naturally,
continuous data are represented in the form of a histogram. To find the mode, we first need to identify the modal

Descriptive Statistics 17
Bibliography

class of the data, as will be explained later in this section. Next, we determine the mode within that modal class.
The conventional approach suggests the use of linear interpolation by taking into account adjacent classes as
discussed below.
Consider the histogram shown in Fig. 2.13. Classes have equal width, so the heights of the rectangles in the
histogram are equal to the frequencies, and the modal class corresponds to the highest rectangle (modal
rectangle). We join the top left corner of the modal rectangle to the top left corner of the rectangle of the
succeeding class by a straight line. In the same way, we join the top right corner of the modal rectangle to
the top right corner of the rectangle of the preceding class by a straight line. These two diagonal lines
intersect at point G. The abscissa of point G gives the value of the mode Mo.

Fig. 2.13: Histogram showing how to find the mode using linear interpolation (with classes of equal width).

First, let's define some symbols:


xi: lower limit of the modal class;
xi+1: upper limit of the modal class;
a: width of the modal class;
h: height of the modal rectangle;
d1: difference between the frequency of the modal class and frequency of the preceding class;
d2: difference between the frequency of the modal class and frequency of the succeeding class.
From Fig. 2.13, it is clear that:
a = xi+1 - xi, d1 = AD and d2 = CB.
The equation of a line can be written as y = mx + c where m is the slope (or gradient) and c is the y-intercept.
Line AB passes through points A(xi ; h) and B (xi + a ; h – d2). Using the two-point formula, its equation is:
d2
y= (x – x) + h.
a i

Descriptive Statistics 18
Bibliography

Line CD passes through points C(xi + a ; h) and D (xi ; h – d1). Again, using the two-point formula, its equation is:
d1
y= (x – xi) + h – d1.
a

Lines AB and CD intersect at point G. Let the coordinates of G be (xG , yG). As previously mentioned, the
abscissa of point G gives the value of the mode (i.e., xG = Mo).
Furthermore, the coordinates of point G satisfy the equation of line AB and the equation of line CD
simultaneously, so we may write:
d2 d
(x – Mo) + h = 1 (Mo – xi) + h – d1
a i a

Solving the equation above for Mo yields:


d1
Mo = xi + a
d1  d 2
There are two ways of identifying the modal class.
 Classes have equal width. The class corresponding to the highest frequency (or relative frequency) is the
modal class.
Example 2.18: The distribution of the distance (in km) from home to work for 30 employees of a company is
shown in the table below.
Distance Number of
(km) employees
[0-5[ 3
[5-10[ 6
[10-15[ 16
[15-20[ 5
Total 30
Table 2.15: Frequency distribution of distance from home to work.
The classes are of equal width. The heights of the rectangles in the histogram are therefore equal to the
number of employees. The modal class is [10-15[ because it has the highest frequency with 16 employees.
The value of the mode is calculated by the formula:
d1
Mo = xi + a
d1  d 2
Here:
xi = 10, a = 15 – 10 = 5, d1 = 16 – 6 = 10, d2 = 16 – 5 = 11;

Mo = 10 + 5 10 = 12.4 km (rounded to the nearest tenth).


10  11
 Classes have unequal widths. In this case, the modal class is the class with the highest frequency density
(or relative frequency density).

Descriptive Statistics 19
Bibliography

Example 2.19: Let’s consider the last example with a few modifications.
Distance Number of
(km) employees
[0-10[ 9
[10-12[ 12
[12-20[ 9
Total 30
Table 2.16: Frequency distribution of distance from home to work.
The class widths are not equal. We need to calculate the frequency densities di = ni / ai. The results are shown
in the table below.
Distance Frequency Class width Frequency density
(km) ni ai di = ni / ai
[0-10[ 9 10 0.9
[10-12[ 12 2 6.0
[12-20[ 9 8 1.1
Total 30
Table 2.17: Frequency distribution of distance from home to work showing the frequency densities.
The modal class is [10-12[ because it has the highest frequency density of 6. Since the classes are of unequal
width, if we were to draw a histogram, the heights of the rectangles in the histogram would be equal to the
frequency densities (not the frequencies). Here again, the modal class corresponds to the highest rectangle.
The value of the mode is calculated by the same formula as the previous case (i.e., classes have equal width),
but we need to replace frequencies with frequency densities.
d1
Mo = xi + a
d1  d 2
Here:
xi = 10, a = 12 – 10 = 2, d1 = 6.0 – 0.9 = 5.1, d2 = 6.0 – 1.1 = 4.9;

Mo = 10 + 2 5.1 = 11.0 km (rounded to the nearest tenth)


5.1  4.9

b) The median
It can be calculated for both quantitative and ordinal qualitative variables. The median, denoted by Me,
corresponds to the value of the variable which divides an ordered data set in half. So, at least half (50%) of
the values are less than or equal to the median, and at least half of the values are greater than or equal to
the median. There are several methods for calculating the median, depending on the type of data.

 Numerical data given on individual basis


Consider an individual series containing n data values arranged in ascending order. There are two cases:
either n is odd or n is even.

Descriptive Statistics 20
Bibliography

Case 1: Suppose that n is odd. Then it can be written as n = 2p + 1, for some integer p. The median is the
value of rank p+1. The rank is the position of the data value. Here, the median is an observed value of the
data set.
Example 2.20: Consider the following individual series:
5 3 6 8 11 5 6
Let's arrange these values in ascending order:
3 5 5 6 6 8 11
This data set has 7 = 23 + 1 values. So the median is the 4th value, that is Me = 6.
Case 2: Suppose that n is even. Then it can be written as n = 2p, for some integer p. The median is half of
the sum of the data value of rank p and the data value of rank p + 1. In this case, the median is not necessarily
an observed value of the data set.
Example 2.21: Consider the following individual series:
14 6 11 13 8 14 8 12
Let's arrange these values in ascending order:
6 8 8 11 12 13 14 14
The data set has 8 = 24 values. So the median is half of the sum of the 4th and 5th values. In other words:

Me = 11  12 = 11.5.
2
 Discrete data
The median can be determined using either the less than cumulative frequencies or the more than
cumulative frequencies. The median is the value for which the cumulative frequency is at least equal to n/2,
where n is the total frequency.
Example 2.22: The distribution of the number of employees in 100 industrial companies is given in the table
below:
Number of Number of
employees companies
13 4
14 6
15 7
16 15
17 24
18 16
19 14
20 7
21 4
22 3
Table 2.18: Ungrouped frequency distribution for the number of employees.

Descriptive Statistics 21
Bibliography

Let's complete Table 2.18 by calculating the less than cumulative frequencies.
Number of Frequency Less than
employees fi cumulative
xi frequency
13 4 4
14 6 10
15 7 17
16 15 32
Me 17 24 56
18 16 72
19 14 86
20 7 93
21 4 97
22 3 100
Total 100
Table 2.19: Ungrouped frequency distribution with cumulative frequency for the number of employees.
The median Me is the value of the number of employees for which the cumulative frequency is at least equal
to n/2 = 100/2 = 50. The value "50" does not appear explicitly in the less than cumulative frequency column
of Table 2.19, so we choose the value "56" that is directly above it. From the value (56), we move horizontally
towards the first column to find the value of the number of employees corresponding to this cumulative
frequency. Hence Me = 17.
 Continuous data
Let's consider a continuous variable with the whole range of data values grouped into classes. The median
can be calculated either analytically or graphically.
 Analytical method
To find the mode analytically we start by identifying the class containing the median (called median class)
using the less than cumulative frequencies. The median class is the first class for which the cumulative
frequency is at least equal to n/2, where n is the total frequency. The value of the median is determined by
linear interpolation, within the median class, under the assumption that the frequencies are uniformly
distributed within each class.
First, let's define some symbols:
xi –1: upper limit of class preceding the median class;
xi: upper limit of the median class;
Ni –1 : less than cumulative frequency of the value xi –1;
Ni: less than cumulative frequency of the value xi.

Note: For less than cumulative frequencies, Ni –1  n/2  Ni.

Descriptive Statistics 22
Bibliography

Fig. 2.14: Calculating the median within the median class using linear interpolation for continuous data.

From Fig. 2.14: tan = GH = BC


AH AC

Hence, AH = AC GH .
BC
Let's write the line segments in terms of the symbols defined above:

AH = Me – xi –1 ; AC = xi – xi –1 ; GH = n/2 – Ni –1 ; BC = Ni – Ni –1

Let's replace the line segments with their respective expressions in the previous equation:
n/2  N i 1
Me – xi –1 = (xi – xi –1)
N i  N i 1
Solving the equation above for Me gives:

n/2  N i 1
Me = xi –1 + (xi – xi –1)
N i  N i 1
The median can also be calculated by linear interpolation using more than cumulative frequencies. Following
the same reasoning as above, we can prove that the value of the median is given by the formula:
n/2  N i
Me = xi – (xi – xi –1)
N i 1  Ni

where:
xi –1: lower limit of the median class;
xi: lower limit of the class succeeding the median class;
Ni –1 : more than cumulative frequency of the value xi –1;
Ni: more than cumulative frequency of the value xi.

Note: For more than cumulative frequencies: Ni  n/2  Ni –1.

Example 2.23: The distribution of rainfall (in cm) in a city over a period of 30 years is shown in the following
table:

Descriptive Statistics 23
Bibliography

Rainfall Number
(cm) of years
[28-31[ 4
[31-34[ 4
[34-37[ 9
[37-40[ 2
[40-43[ 5
[43-46[ 6
Table 2.20: Grouped frequency distribution of rainfall.
The median Me is the value of rainfall that corresponds to the cumulative frequency n/2 = 30/2 = 15.
 Let's complete Table 2.20 by calculating less than cumulative frequencies and determine the median.
Rainfall Frequency Less than
(cm) ni cumulative
frequency
[28 - 31[ 4 4
[31 - 34[ 4 8
[34 - 37[ 9 17
[37 - 40[ 2 19
[40 - 43[ 5 24
[43 - 46[ 6 30
Total 30
Table 2.21: Grouped frequency distribution with less than cumulative frequency of rainfall.
Here, n/2 = 15. From the less than cumulative frequency column, 8  15  17 (see Table 2.21 above). The
less than cumulative frequency 8 corresponds to the upper class limit 34 and less than cumulative frequency
17 corresponds to the upper class limit 37. Therefore:
xi –1 = 34, xi = 37, Ni –1 = 8, Ni = 17.
The median class is [34-37[.
The formula to calculate the median using less than cumulative frequencies is:
n/2  N i 1
Me = xi –1 + (xi – xi –1)
N i  N i 1
Substituting the values in the formula:

Me = 34 + (37 – 34) 15  8 = 36.33 cm (rounded to the nearest hundredth)


17  8

 Let's complete Table 2.20 by calculating more than cumulative frequencies and determine the median.
Here, n/2 = 15. From the more than cumulative frequency column, 13  15  22 (see Table 2.22 below). The
more than cumulative frequency 13 corresponds to the lower class limit 37 and more than cumulative
frequency 22 corresponds to the lower class limit 34. Consequently:
xi –1 = 34, xi = 37, Ni –1 = 22, Ni = 13 and the median class is also [34-37[.

Descriptive Statistics 24
Bibliography

Rainfall Frequency More than


(cm) ni cumulative
frequency
[28 - 31[ 4 30
[31 - 34[ 4 26
[34 - 37[ 9 22
[37 - 40[ 2 13
[40 - 43[ 5 11
[43 - 46[ 6 6
Total 30
Table 2.22: Grouped frequency distribution with more than cumulative frequency of rainfall.
The formula to calculate the median using more than cumulative frequencies is:
n/2  N i
Me = xi – (xi – xi –1)
N i 1  Ni
Substituting the values in the formula:
15  13
Me = 37 – (37 – 34) = 36.33 cm (rounded to the nearest hundredth).
22  13
 Graphical method
The value of the median can be determined graphically by using any of the two types of ogives or using both.
 Less than or more than ogive method. We draw either a less than or more than ogive. We draw a parallel
line to the x-axis having intercept n/2 on the y-axis. From the point of intersection of this line with the type
of ogive chosen (less than or more than) we draw a perpendicular on the x-axis. The abscissa of the point
on the x-axis, at which the perpendicular cuts, is the value of the median.
Example 2.24: Fig. 2.15 below shows the less than ogive for the frequency distribution of final grade average
of 1480 students of example 2.13. We draw a line parallel to the x-axis at frequency n/2 = 1480/2 = 740.
From the point of intersection of this line and the less than ogive, we draw a perpendicular on the x-axis.
The point where it meets the x-axis is the value of the median. Here, Me  13.

Fig. 2.15: Determining the median graphically using less than ogive.

Descriptive Statistics 25
Bibliography

 Less than and more than ogive method. We draw less than and more than ogives on the same graph.
From the point at which the less than and more than ogives intersect we draw a perpendicular on the x-axis.
The abscissa of the point at which it touches the x-axis is the value of the median.
Example 2.25: Fig. 2.16 below shows less than and more than ogives for the frequency distribution of final
grade average of 1480 students, as already seen in example 2.13.
From the point of intersection of the two ogives, we draw a perpendicular on the x-axis. The abscissa of the
point on the x-axis, at which the perpendicular cuts, is the value of the median. Here again, Me  13.

Fig. 2.16: Determining the median graphically with the help of less than and more than ogives.

 Ordinal data
The values in the data set are sorted by categories with a rank order. If the number of data values is odd, we
use the same method as for discrete data. If the number of data values is even, the median cannot be found
because half of the sum of the data value of rank p and data value of rank p + 1 is meaningless.
Example 2.26: An interviewer approached at random 40 people in the street and asked them about their
level of education. The results are recorded in the following table:
Level of Number of
education people
No formal education 5
Primary 8
Lower secondary 10
Upper secondary 12
Higher education 5
Total 40
Table 2.23: Categorical frequency distribution for the level of education.
Note that the categories are given in order, from lowest ranking to highest. Let's complete Table 2.23 by
calculating less than cumulative frequencies. The results are shown in the following table.

Descriptive Statistics 26
Bibliography

Level of Frequency Less than


education fi cumulative
frequency
No formal education 5 5
Primary 8 13
Lower secondary 10 23
Upper secondary 12 35
Higher education 5 40
Total 40
Table 2.24: Categorical frequency distribution with less than cumulative frequency of the level of education.
The median is the category corresponding to the cumulative frequency n/2 = 40/2 = 20. The value "20" does
not appear explicitly in the less than cumulative frequency column of Table 2.24, so we choose the value
"23" that is directly above it. From the value (23), we move horizontally towards the first column to find the
category corresponding to this cumulative frequency. Hence the median is lower secondary.
c) The arithmetic mean
The arithmetic mean is the best known and most widely used measure of central tendency. It applies only
to quantitative (numerical) variables. Its determination depends on the type of distribution (individual,
discrete, and continuous).
 Individual series
For an individual series, we speak of simple arithmetic mean. Let’s consider an individual series containing
n data values, namely x1, x2, … , xn. The simple arithmetic mean is defined by:
n
x = 1  xi
n i1

Example 2.27: The marks (out of 20) scored by a student in different subjects are 13, 12, 15, 9, 12.

There are 5 marks. The arithmetic mean is x = 13  12  15  9  12 = 61 = 12.2.


5 5

 Discrete series
For a discrete series, we speak of weighted arithmetic mean. Let’s consider a discrete series containing n
data values x1, x2, … , xk along with their corresponding frequencies n1, n2, … , nk. The weighted arithmetic
mean is defined as:
k
x = 1  ni x i
n i 1

Example 2.28: The table below gives the frequency distribution of the number of children in a family, as
already seen in example 2.2.

Descriptive Statistics 27
Bibliography

Number of Frequency
children ni
xi
1 4
2 3
3 4
4 7
5 0
6 2
Total 20
Table 2.25: Ungrouped frequency distribution for number of children in a family.
The mean number of children per family is:

x = 4  1  3  2  4  3  7  4  0  5  2  6 = 62 = 3.1.
20 20

 Continuous series
Let’s consider a continuous series in which data values are grouped into classes. Unlike discrete data, the
individual values for grouped data are not available. So to calculate the mean of grouped data, we use the
same formula as in the case of a discrete series and replace the individual values xi with the mid-points ci of
the various classes. The weighted arithmetic mean is therefore defined as:
k
x = 1  ni c i
n i1

Example 2.29: The table below gives the frequency distribution of rainfall (in cm) in a city over a period of
30 years, as seen in example 2.23.
Rainfall Frequency Mid-point
(cm) ni ci
[28-31[ 4 29.5
[31-34[ 4 32.5
[34-37[ 9 35.5
[37-40[ 2 38.5
[40-43[ 5 41.5
[43-46[ 6 44.5
Total 30
Table 2.26: Grouped frequency distribution of rainfall showing mid-points of the classes.
As we don’t know the exact amount of rain that fell each year, we use mid-points. The average rainfall is:

x = 4  29.5  4  32.5  9  35.5  2  38.5  5  41.5  6  44.5 = 1 119 = 37.3 cm


30 30

 Characteristics of the mean


There are five characteristics of the mean.
 Adding a constant to each value in a data set will increase the mean by that constant.

Descriptive Statistics 28
Bibliography

 Multiplying each value in a data set by a constant multiplies the mean by that constant.
 If x 1 and x 2 are the means of two groups of size n1 and n2 respectively. Then, the combined mean (for

the two groups) is:


n1 x 1  n2 x 2
x=
n1  n2

This formula can be generalized to apply to more than 2 groups.


 The sum of the deviations from the mean is zero. For instance, for an individual series:
n
 (xi  x) = 0
i1

 The mean, and no other alternative value such as the median or some arbitrary number, minimizes the
n
sum of squared deviations, that is, the quantity  ( x i  C )2 is minimal if and only if C represents the mean.
i 1

d) Yule properties
The statistician Udny Yule has defined six properties to be satisfied by an ideal measure of central tendency.
The following table compares the different measures of central tendency based on Yule properties.
Yule properties Mo Me x
 is defined objectively no yes yes
 depends on all the observations no no yes
 has a concrete meaning yes yes no
 is easy to calculate yes yes yes
 lends itself easily to mathematical manipulation no no yes
 is a little sensitive to fluctuations of the sample no yes no
Table 2.27: Comparison of mode, median and mean.
None of the measure of the central tendency verifies simultaneously all Yule properties. The mean is the
preferred measure of central tendency for a symmetrical distribution. The median is a suitable measure of
central tendency for a skewed distribution or when dealing with ordinal data. The mode is the most
appropriate measure of central tendency for nominal data. It's crucial to remember that the optimum
central tendency measure will vary depending on the situation and the statistical study carried out.

2.3.2 Measures of location


While measures of central tendency give information about the centre of a dataset, there are other
measures, known as measures of location or position, that are commonly used to describe the distribution
of the data set. Measures of position give information about the relative location of particular values in the
data set. The measures we consider here are quartiles, deciles and percentiles.

Descriptive Statistics 29
Bibliography

a) Quartiles
Quartiles are 3 values that divide the ordered data into four equal parts.
The first quartile, Q1, is the value such that at least a quarter (25%) of the data are less than or equal to this
value and at least three quarters (75%) of the data are greater than or equal to this value.
The second quartile, Q2, corresponds to the median.
The third quartile, Q3, is the value such that at least three quarters (75%) of the data are less than or equal
to this value and at least a quarter (25%) of the data are greater than or equal to this value.
There are several methods for calculating the quartiles, depending on the type of data.
 Numerical data given on individual basis
Consider an individual series containing n data values arranged in ascending order. There are two cases:
either n/4 is an integer or not.
Case 1: n/4 is an integer, then the first quartile Q1 is the value of rank n/4 and the third quartile Q3 is the
value of rank 3n/4.
Example 2.30: Consider the following individual series:
3 13 22 5 13 17 10 22
Let's sort these values in ascending order:
3 5 10 13 13 17 22 22
There are 8 observations:

1  8 = 2, the first quartile Q1 is the 2nd value, i.e., Q1 = 5;


4

3  8 = 6, the third quartile Q3 is the 6th value, i.e., Q3 = 17.


4
Case 2: n/4 is not an integer, then the first quartile Q1 is the value with rank directly above n/4 and the third
quartile Q3 is the value with rank directly above 3n/4.
Example 2.31: Consider the following individual series:
14 6 19 11 17 8 14 7 13
Let's sort these values in ascending order:
6 7 8 11 13 14 14 17 19
There are 9 observations:

1  9 = 2.25, the first quartile Q1 is the 3rd value, so Q1 = 8;


4

3  9 = 6.75, the third quartile Q3 is the 7th value, so Q3 = 14.


4

Descriptive Statistics 30
Bibliography

 Discrete data
The quartiles are determined using the less than cumulative frequencies. The first quartile Q1 is the value
for which the cumulative frequency is at least equal to n/4, and the third quartile Q 3 is the value for which
the cumulative frequency is at least equal to 3n/4, where n is the total number of data values.
Example 2.32: The frequency distribution with cumulative frequency for number of employees in 100
industrial companies is given in the table below (refer to the data in example 2.22):
Number of Frequency Less than
employees fi cumulative
xi frequency
13 4 4
14 6 10
15 7 17
Q1 16 15 32
17 24 56
18 16 72
Q3 19 14 86
20 7 93
21 4 97
22 3 100
Total 100
Table 2.28: Frequency distribution with cumulative frequency for the number of employees.
 The first quartile Q1 is the value of the number of employees for which the cumulative frequency is at
least equal to n/4 = 100/4 = 25. The value "25" does not appear explicitly in the less than cumulative
frequency column of Table 2.28, so we choose the value "32" that is directly above it. From the value (32)
we move horizontally towards the first column to find the value of the number of employees corresponding
to this cumulative frequency, hence Q1 = 16.
 The third quartile Q3 is the value of the number of employees for which the cumulative frequency is at
least equal to 3n/4 = 300/4 = 75. The value "75" does not appear explicitly in the less than cumulative
frequency column of Table 2.28, so we choose the value "86" that is directly above it. From the value (86)
we move horizontally towards the first column to find the value of the number of employees corresponding
to this cumulative frequency, hence Q3 = 19.
 Continuous data
Let's consider a continuous variable with the whole range of data values grouped into classes. The quartiles
can be calculated either analytically or graphically.
 Analytical method
We start by identifying the class containing the kth quartile (k = 1, 2 or 3) using the less than cumulative

Descriptive Statistics 31
Bibliography

frequencies. The class containing the kth quartile is the class for which the cumulative frequency is at least
equal to (kn/4), where n is the total frequency. The value of the kth quartile is determined by linear
interpolation, under the assumption that the frequencies are uniformly distributed within each class. By a
similar reasoning to the one adopted for the median, we can show that the value of the kth quartile is given
by the formula:
(kn/4)  N i 1
Q k = xi –1 + (xi – xi –1)
N i  N i 1

where:
xi –1: lower limit of the class containing Q k;
xi: upper limit of the class containing Q k;
Ni –1: less than cumulative frequency of the value xi –1;
Ni: less than cumulative frequency of the value xi.
Example 2.33: The frequency distribution with cumulative frequency of rainfall (in cm) in a city over a period
of 30 years is shown in the following table (refer to the data in example 2.23):
Rainfall Frequency Less than
ni cumulative
frequency
[28 - 31[ 4 4
[31 - 34[ 4 8
[34 - 37[ 9 17
[37 - 40[ 2 19
[40 - 43[ 5 24
[43 - 46[ 6 30
Total 30
Table 2.29: Grouped frequency distribution with less than cumulative frequency of rainfall.
 We want to calculate the first quartile Q1, so k = 1 in the above formula and (kn/4) = (130/4) = 7.5.
From the less than cumulative frequency column, 4  7.5  8. The less than cumulative frequency 4
corresponds to the upper class limit 31 and less than cumulative frequency 8 corresponds to the upper class
limit 34. Therefore:
xi –1 = 31, xi = 34, Ni –1 = 4, Ni = 8 so Q1 lies in the class [31-34[.
Substituting all these values in the formula:
n/4  N i 1
Q 1 = xi –1 + (xi – xi –1)
N i  N i 1

Q 1 = 31 + (34 – 31) 7.5  4 = 33.62 cm (rounded to the nearest hundredth).


84

Descriptive Statistics 32
Bibliography

 Now, we want to calculate the third quartile Q3, so k = 3 in the above formula and (kn/4) = (330/4) = 22.5.
From the less than cumulative frequency column, 19  22.5  24. The less than cumulative frequency 19
corresponds to the upper class limit 40 and less than cumulative frequency 24 corresponds to the upper
class limit 43.
Therefore:
xi –1 = 40, xi = 43, Ni –1 = 19, Ni = 24 so Q 3 lies in the class [40-43[.
Substituting all these values in the formula:

3n/4  N i 1
Q 3 = xi –1 + (xi – xi –1)
N i  N i 1

Q 3 = 40 + (43 – 40) 22.5  19 = 42.10 cm (rounded to the nearest hundredth).


24  19

Note: To calculate the median, first and third quartiles by linear interpolation using relative frequencies, the
previous formulae are adapted by replacing:
 the cumulative frequencies Ni – 1 and Ni by the cumulative relative frequencies Fi – 1 and Fi respectively;
 the quantities n/4, n/2 and 3n/4 by 1/4, 1/2 and 3/4 respectively.
 Graphical method
Quartiles can be determined graphically by using the less than ogive. First, we draw a less than ogive. Then,
we determine the cumulative frequency (kn/4), where n is the total frequency (put k = 1 for the first quartile,
k = 2 for the second quartile and k = 3 for the third quartile). Next, we draw a parallel line to the x-axis having
intercept (kn/4) on the y-axis. From the point of intersection of this line with the less than ogive we draw a
perpendicular on the x-axis. The abscissa of the point on the x-axis, at which the perpendicular cuts, is the
value of the quartile under consideration.
Example 2.34: Fig. 2.17 below shows the less than ogive for the frequency distribution of final grade average
of 1480 students of example 2.13.
For the first quartile Q 1, k = 1 and (kn/4) = (11480/4) = 370.
We draw a line parallel to the x-axis at frequency 370. From the point of intersection of this line and the less
than ogive, we draw a perpendicular on the x-axis. The point where it meets the x-axis is the value of the
first quartile. Here, Q 1  11.3.
For the third quartile Q 3, k = 3 and (kn/4) = (31480/4) = 1110.
We draw a line parallel to the x-axis at frequency 1110. From the point of intersection of this line and the
less than ogive, we draw a perpendicular on the x-axis. The point where it meets the x-axis is the value of
the third quartile. Here, Q 3  15.1.

Descriptive Statistics 33
Bibliography

Fig. 2.17: Determining quartiles Q1 and Q3 graphically using less than ogive.

b) Deciles
Deciles are 9 values that divide the ordered data into ten equal parts.
The first decile, D1, is the value such that at least 10% of the data are less than or equal to this value and at
least 90% of the data are greater than or equal to this value.
The second decile, D2, is the value such that at least 20% of the data are less than or equal to this value and
at least 80% of the data are greater than or equal to this value.
The fifth decile, D5, is the value such that at least 50% of the data are less than or equal to this value and at
least 50% of the data are greater than or equal to this value. D5 corresponds to the median.
The ninth decile, D9, is the value such that at least 90% of the data are less than or equal to this value and
at least 10% of the data are greater than or equal to this value.
Deciles may be determined in the same way as quartiles except that in place of (kn/4), where k = 1, 2 or 3
we will use (kn/10), where k = 1, 2, …, 9.

c) Percentiles
Percentiles are 99 values that divide the ordered data into one hundred equal parts.
The kth percentile, denoted by Pk, is the value such that at least k% of the data are less than or equal to this
value and at least (100 – k)% of the data are greater than or equal to this value, where k = 1, 2, 3, …, 99.
Note: The 50th percentile is the median, the 25th percentile is the first quartile and the 75th percentile is the
third quartile.
Percentiles may be determined in the same way as quartiles except that in place of (kn/4), where k = 1, 2 or
3 we will use (kn/100), where k = 1, 2,3, …, 99.

Descriptive Statistics 34
Bibliography

2.3.3 Measures of dispersion


While measures of central tendency give us information about the typical or central value in a data set,
measures of dispersion give us information about the spread of data, or their variation around a central
value. Measures of central tendency are used in conjunction with measures of variability to provide a more
complete numerical description of the data. Measures of dispersion can be divided into two categories:
absolute measures of dispersion that are expressed in the same units as the data, and relative measures of
dispersion that are dimensionless numbers.
a) Absolute measures of dispersion
 Range
The range of a data set is the difference between the largest and smallest values observed:
e = xmax – xmin
Range is simple to calculate and its meaning is clear. However, the range is not a robust measure of
dispersion since it depends solely on extreme values and outliers, and provides no information about how
the remaining data are distributed. The larger the range, the more spread out the data.
 Interquartile range
The interquartile range (IQR) is the difference between the third and first quartiles:
IQR = Q3 - Q1
The IQR is a measure of dispersion around the median and is unaffected by extreme values and outliers. The
IQR measures the spread of the middle 50% of values in a data set, ignoring the bottom 25% of the data
and the top 25% as well.
The larger the IQR, the wider the spread of the middle 50% of data values.

 Interdecile range
The interdecile range (IDR) is the difference between the ninth and the first deciles:
IDR = D9 - D1
The IDR is a measure of dispersion around the median and is not affected by extreme values and outliers.
The IDR measures the spread of the central 80% of values in a data set, ignoring the bottom 10% of the data
and the top 10% as well.
The bigger the IDR, the wider the spread of the central 80% of data values.
 Mean absolute deviation
Mean absolute deviation (MAD) of a data set is the average of the absolute differences between each value
and the mean. The formula for calculating MAD is as follows:
n
MAD = 1  xi – x  for individual series
n
i1

Descriptive Statistics 35
Bibliography

k
MAD = 1  ni xi – x  for discrete series
n
i1
k
MAD = 1  ni ci – x  for continuous series
n
i1

MAD is calculated by considering all the values in the data set and is hard to work with algebraically because
it involves absolute values. MAD uses the original units of the data. Mean absolute deviation is a measure
of variability: a low MAD suggests that data are tightly grouped around the mean (low variability) while a
high MAD suggests that data are spread out from the mean (high variability).
 Variance
Variance of a data set is the average of the squared differences between each value and the mean. The
formula for calculating the variance is as follows:
n
V = 1  (x i  x )2 for individual series
n
i 1
k
V = 1  ni (x i  x )2 for discrete series
n
i1
k
V = 1  ni (ci  x )2 for continuous series
n
i1

To simplify the calculations, we use the König-Huygens formula:


n
1 x2  x 2
V= n  i for individual series
i1
k
1 n x2  x 2
V= n  ii for discrete series
i 1
k
1 n c2  x 2
V= n  ii for continuous series
i1

Variance is a measure of dispersion (or variability) of a set of data values around the mean. A low variance
indicates that data are concentrated near the mean. A high variance indicates that data are spread out from
the mean. Variance is calculated by considering all the values in the data set and is sensitive to extreme
values. Variance is measured in the square of the unit of the data, which is sometimes difficult to interpret.
For example, if we are looking at weights in kilograms the variance will be in kg squared. In order to remedy
this flaw, we use the square root of the variance known as the standard deviation.
 Standard deviation
The standard deviation of a data set is the square root of its variance:
σ= V
The standard deviation is the most commonly used of the absolute measures of dispersion around the mean.

Descriptive Statistics 36
Bibliography

The standard deviation is calculated by considering all the values in the data set and is sensitive to extreme
values. It is expressed in the same unit as the data. The standard deviation is more adequate to describe the
variability of the data while variance is more suitable for statistical calculations.
The variance and standard deviation are not linear, but have very important properties:
 V(x + a) = V(x), σ(x + a) = σ(x): adding a constant to each value in a data set changes neither the standard
deviation nor the variance.
 V(bx) = b2 V(x), σ(bx) = bσ(x): multiplying each value in a data set by a constant multiplies the standard
deviation by its absolute value and the variance by its square.
Example 2.35: A delivery driver records the distance (in km) covered on his (daily) delivery round in the last
26 days:
Distance covered Number of days
xi ni
2 5
6 9
9 4
11 3
15 5
Total 26
Table 2.30: Distribution of the distance covered by the delivery driver.
The distance covered is a continuous quantitative variable. Because the values in Table 2.30 are rounded to
the nearest whole number this variable will be treated as a discrete quantitative variable.
The variance, for a discrete series, is given by the König-Huygens formula:
k
1 n x2  x 2
V= n  ii
i 1

The arithmetic mean, for a discrete series, is given by:


k
x = 1  ni x i = 1 (52 + 96 + 49 + 311 + 515) = 208 = 8 km
n i 1 26 26

V = 1 (522 + 962 + 492 + 3112 + 5152) – 82 = 2 156 – 64  18.92


26 26
The standard deviation is:

σ= V  4.35 km.
b) Relative measures of dispersion
It is possible to compare absolute measures of dispersion of two or more data sets, provided that they share
the same units and they have the same, or approximately the same, order of magnitude. If this is not the
case, the comparison can only be made using relative measures of dispersion.

Descriptive Statistics 37
Bibliography

Relative measures of dispersion are expressed in the form of ratios and percentages thus, making them
unitless. There are four relative measures of dispersion:
 Coefficient of range
The coefficient of range (CR) is based on the range. It is defined as:
xmax  xmin
CR =
x max  xmin

 Coefficient of quartile deviation


The coefficient of quartile deviation (CQD) is based on the quartiles Q1 and Q3. It is defined as:
Q 3  Q1
CQD =
Q 3  Q1

 Coefficient of mean deviation


The coefficient of mean deviation (CMD) is based on mean absolute deviation (MAD). It is defined as:

MAD
CMD =
x

 Coefficient of variation
The coefficient of variation (CV) is based on the standard deviation. It is usually expressed in percentage
terms and is the most commonly used of the relative measures of dispersion. It is defined as:

CV =
σ 100%
x
These coefficients are dimensionless numbers. A low coefficient reflects high uniformity or a small dispersion
of data. A high coefficient indicates low uniformity or a large dispersion of data.

2.3.4 Box plot


A box plot, also known as box-and-whisker plot, introduced by John Tukey is a graphical display of the
distribution of data based on a five number summary: minimum (xmin), first quartile (Q1), median (Me), third
quartile (Q3) and maximum (xmax).
A box plot is constructed as follows:
 A horizontal axis shows the minimum, first quartile, median, third quartile and maximum of the
distribution.
 We construct a rectangle (the box) parallel to the axis whose length is the interquartile range and whose
width is arbitrary.
 A vertical line segment is constructed inside the box corresponding to the median.
 Two lines called whiskers extend from either side of the box to the minimum and maximum.

Descriptive Statistics 38
Bibliography

Fig. 2.18: Box plot


Note: Box plots can be drawn in any orientation (horizontally or vertically).
Another type of box plot, called a modified box plot, can also be constructed in order to identify outliers
(values that lie far away from other values in a data set).
In a modified box plot, the box is drawn just as in a standard box plot, but the length of the whiskers have
to be redefined.
 One approach involves the use of certain deciles and percentiles to limit the length of the whiskers. The
left and right ends of the whiskers correspond respectively to the 1 st and 9th deciles, or the 1st and 99th
percentiles, or the 5th and 95th percentiles, etc.
 Another approach consists of restricting the length of the whiskers to 1.5 times the interquartile range.
A value is considered a low outlier if it is smaller than Q1 – 1.5IQR, and a value is considered a high outlier
if it is larger than Q3 + 1.5IQR. This is known as Tukey's rule for detecting outliers.
Outliers can have a big impact on a statistical analysis. However, the exclusion of outliers can also be
controversial. Therefore, a clear justification must be provided for their exclusion.
2.3.5 Summarising a data set
A proper descriptive summary of a data set should include a measure of central tendency in conjunction
with a measure of dispersion. As a general rule:
- Median is used as a measure of central tendency when interquartile range is used as a measure of
dispersion.
- Mean is used as a measure of central tendency when standard deviation is used as a measure of dispersion.
2.3.6 Measures of shape
The shape of a distribution can be described by two measures: its skewness and its kurtosis. Skewness is a
measure of the asymmetry of a distribution while kurtosis is a measure of tailedness of a distribution. Both
measures are based on moments.
a) Central moments
The r-th central moment (rℕ*) of a distribution is the average of the r-th power of the deviations from
the mean. It is defined as:

Descriptive Statistics 39
Bibliography

n
μr = 1  (x i  x ) r for individual series
n
i1
k
μr = 1  ni (x i  x ) r for discrete series
n
i1
k
μr = 1  ni (c i  x ) r for continuous series
n
i 1

b) Skewness
A distribution is symmetric if its left side and right side are mirror images of each other. Otherwise, the
distribution is asymmetric. The concept of skewness helps describe asymmetric distributions. Skewness is
a measure of the lack of symmetry of a distribution. A distribution is skewed to the right (or positively
skewed) if it has a long tail on its right side. Likewise, a distribution is skewed to the left (or negatively
skewed) if it has a long tail on its left side. A normal distribution (bell-shaped curve) exhibits zero skewness.
The skewness of a distribution can be determined in two ways:

 By comparing measures of central tendency


In the case of a unimodal distribution, comparing the mean x , median Me and mode Mo enables us to
characterise the shape of the distribution. There are three situations.
 x  Me  Mo: the distribution is skewed to the left.
 x = Me = Mo: the distribution is symmetric.
 x  Me  Mo: the distribution is skewed to the right.

 By calculating the coefficient of skewness


There are several formulas to measure skewness. One of them is Fisher’s skewness coefficient.
Fisher’s skewness coefficient, denoted by γ1, is the ratio of the third central moment to the cube of the
standard deviation:
μ3
γ1 =
σ3
Fisher’s skewness coefficient is a unitless number.
 If it is zero, the distribution is symmetric.
 If it is negative, the distribution is skewed to the left (or negatively skewed).
 If it is positive, the distribution is skewed to the right (or positively skewed).

Example 2.36: Consider the three frequency histograms and three frequency curves below:
The two left graphs show negatively skewed distributions (γ1  0).
The two middle graphs show symmetric distributions (γ1 = 0).

Descriptive Statistics 40
Bibliography

The two right graphs show positively skewed distributions (γ1  0).

Fig. 2.19: Frequency histograms with different types of skewness.

Fig. 2.20: Frequency curves with different types of skewness.

c) Kurtosis
Kurtosis is a measure of whether the data are heavy-tailed or light-tailed relative to a normal distribution
(bell-shaped curve). A distribution with a high kurtosis suggests heavy tails and more outliers. Alternatively,
a distribution with a low kurtosis suggests light tails and fewer outliers.

There are several formulas to measure kurtosis. One of them is Fisher’s kurtosis coefficient.

Fisher’s kurtosis coefficient, denoted by γ2, is the ratio of the fourth central moment to the fourth power
of the standard deviation, minus three:

μ4
γ2 = 3
σ4
Fisher’s kurtosis coefficient is a unitless number.
 If it is zero, the distribution is mesokurtic (exhibits tails that are similar to the normal distribution).

 If it is negative, the distribution is platykurtic (exhibits thinner tails than the normal distribution).

 If it is positive, the distribution is leptokurtic (exhibits fatter tails than the normal distribution).

Example 2.37: The figure below shows the three types of kurtosis in histograms.
The left graph shows a platykurtic distribution (γ2  0).
The middle graph shows a mesokurtic distribution (γ2 = 0).

Descriptive Statistics 41
Bibliography

The right graph shows a leptokurtic distribution (γ2  0).

Fig. 2.21: Frequency histograms with different types of kurtosis.

Below is a graph comparing a normal (mesokurtic) distribution with a platykurtic and leptokurtic one.

Fig. 2.22: Frequency curves with different types of kurtosis.

Descriptive Statistics 42

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy