Chap 2
Chap 2
Relative frequency fi of the value xi is the proportion of the data set corresponding to the value xi. It is
n
obtained by dividing the frequency ni by the data set size n, fi = ni .
Descriptive Statistics 1
Bibliography
Example 2.1: The brands of 24 cars have been recorded in a car park. If R stands for Renault, T for Toyota,
H for Hyundai, V for Volkswagen and O for other brands, the observations collected are given below:
H R T H V R
V H V O T V
R O V R H O
V R T V V R
Table 2.1: Individual series of the car brands.
This is an example of an individual series where raw data are given on individual basis. Because raw data
cannot be easily understood, we have constructed the following categorical frequency table:
Car brands Frequency Relative Frequency
xi ni fi
Renault 6 0.250
Toyota 3 0.125
Hyundai 4 0.170
Volkswagen 8 0.330
Other 3 0.125
Total 24 1
Table 2.2: Categorical frequency distribution with relative frequency of car brands.
The variable under study "car brands" is a nominal categorical variable. The five category names are listed
in the first column. The order of the brands has no significance. The second column shows the number of
cars for each brand: there are 6 Renaults, 3 Toyotas, 4 Hyundais, 8 Volkswagens and 3 other brands. These
are the frequencies. The third column indicates the relative frequency associated with each brand. The total
of the frequency column, 24, represents the total number of cars included in the sample. The relative
frequencies add up to 1. For a categorical nominal variable, cumulative frequencies are meaningless.
2.1.2 Numerical frequency distribution
a) Discrete data
The calculation of the frequency distribution of ordinal categorical variables and discrete quantitative
variables is similar.
Example 2.2: A survey conducted on 20 families in a locality revealed the following results for the number
of children in a family:
4 1 3 4 2
3 4 2 1 4
4 3 6 4 1
1 6 4 2 3
Table 2.3: Individual series of the number of children in a family.
Here, raw data are given as a list of numbers and as can be seen, they are hard to interpret in this format.
The given data are arranged in the ungrouped frequency table below:
Descriptive Statistics 2
Bibliography
Descriptive Statistics 3
Bibliography
Relative frequency fi of the class [xi , xi+1[ is the proportion of values corresponding to the interval [xi , xi+1[.
Frequency density di of the class [xi , xi+1[ is the fraction ni / ai.
Relative frequency density hi of the class [xi , xi+1[ is the fraction fi / ai.
Less than cumulative frequency of the class [xi , xi+1[ is the sum of frequencies of earlier classes and the
class [xi , xi+1[.
Greater than cumulative frequency of the class [xi , xi+1[ is the sum of frequencies of the class [xi , xi+1[
and the classes which succeed it.
Less than cumulative relative frequency of the class [xi , xi+1[ is the sum of relative frequencies of previous
classes and the class [xi , xi+1[.
Greater than cumulative relative frequency of the class [xi , xi+1[ is the sum of relative frequencies of the
class [xi , xi+1[ and the classes which come after it.
The number and appropriate width of the classes are left to the choice of the researcher. When grouping
values of a continuous variable into classes, he must choose between two options:
1. Classes have equal size. The number of classes k, for a data set of given size n, is calculated using one of
the following rules of thumb:
- Sturges' formula: k = 1 + 3.3logn
Descriptive Statistics 4
Bibliography
Consequently, the number of classes k, for the n = 28 observations in Table 2.5, is:
- according to Sturges’ formula: k = 1 + 3.3 log 28 = 5.78 6
Descriptive Statistics 5
Bibliography
Example 2.4: Here is the bar chart showing the car brands of example 2.1.
Pie chart
A pie chart is a circle, representing the entire data, divided into sectors that represent the possible categories
of the variable. The area of the sector for a particular category is proportional to the corresponding
frequency (or relative frequency). The angle at the centre i corresponding to the sector of a particular
category can be calculated using the following formula:
ni 360
i = = fi 360°
n
where:
n is the data set size, ni (resp., fi) is the frequency (resp., the relative frequency) of the category.
Descriptive Statistics 6
Bibliography
It is worth mentioning that the sum of all the central angles in a pie chart is 360°.
Example 2.5: To construct the pie chart showing the car brands of example 2.1, we first need to calculate
the central angle for each car brand. The results are shown in the table below.
Car brands Frequency Relative Frequency Measure of central angles
xi ni fi i
Renault 6 0.250 0.250360°= 90°
Toyota 3 0.125 0.125360°= 45°
Hyundai 4 0.170 0.170360°= 61°
Volkswagen 8 0.330 0.330360°= 119°
Other 3 0.125 0.125360°= 45°
Total 24 1 360°
Table 2.7: Categorical frequency distribution with relative frequency of car brands (showing central angles i).
The resulting pie chart is shown below:
b) Quantitative data
Before attempting any graphic presentation, it is important to differentiate between discrete and
continuous variables.
Discrete data
Stick chart
Sticks are appropriate at demonstrating discrete data. The discrete values taken by the variable are marked
on the horizontal axis and the frequencies (or relative frequencies) on the vertical axis. A stick looks like a
bar with no width. The height of the stick is proportional to the frequency (or relative frequency) of the
corresponding value of the variable.
Example 2.6: The following stick chart shows the number of children in a family of example 2.2.
Descriptive Statistics 7
Bibliography
Fig. 2.3: Stick chart showing the frequency by number of children in a family.
Frequency polygon
A frequency polygon is a particular line graph used to represent the distribution of a set of quantitative data.
For discrete data, the frequency polygon is obtained by joining the tops of the sticks with straight lines in
the stick chart. In order to make the frequency polygon touches the horizontal axis on both sides, we add
one value below and above our data.
Example 2.7: As shown in Fig. 2.4 below, the frequency polygon is superimposed on the stick chart it replaces
(using the data given in Table 2.4). When joining the tops of the sticks with straight lines, we also included
points at (0, 0) and (7, 0) which represent one value below and one value above our data. Clearly, the
frequency polygon touches the x-axis on both sides.
Fig. 2.4: Stick chart and frequency polygon for number of children in a family.
Continuous data
Histogram
A histogram is particularly suitable for continuous data arranged into classes. A histogram is a set of adjacent
Descriptive Statistics 8
Bibliography
rectangles whose bases correspond to the size of the classes and whose areas are proportional to the
frequencies (or relative frequencies) of the classes. The horizontal and vertical axes display the class sizes
and frequencies (or relative frequencies), respectively. There are two ways of constructing a histogram
depending on the class size.
Classes of equal size. The class limits are marked on the horizontal axis and the frequencies (or relative
frequencies) are indicated on the vertical axis. In this case, the heights of the rectangles are proportional
to the frequencies (or relative frequencies).
Example 2.8: The table below shows the distribution of monthly salary (in thousands of DA) of 100
employees of a company:
Monthly Number of
Salary employees, ni
[20-30[ 28
[30-40[ 34
[40-50[ 19
[50-60[ 15
[60-70[ 4
Total 100
Table 2.8 : Grouped frequency distribution of monthly salary (with equal class width).
Here, the classes have equal size a = 10 (thousands of DA), so the heights of the rectangles are equal to the
frequencies. The resulting frequency histogram is shown below:
Fig. 2.5: Frequency histogram for monthly salary (with classes of equal size).
Classes of unequal sizes. To ensure that the area of each rectangle remains proportional to the
corresponding frequency (or relative frequency), the class limits are marked on the horizontal axis and the
frequency densities (or relative frequency densities) are indicated on the vertical axis. In this case, the
height of each rectangle is not proportional to the corresponding class frequency (or class relative
frequency), but rather to the corresponding class frequency density (or class relative frequency density).
Descriptive Statistics 9
Bibliography
Example 2.9: The following table gives the age distribution for the number of deaths caused by road traffic
accident during last year:
Age group Number of deaths
[20-25[ 595
[25-35[ 410
[35-45[ 287
[45-65[ 456
Table 2.9: Age distribution for the number of deaths (classes have unequal widths).
The class widths are not equal. We wish to construct a histogram based on the frequency table above, so it
is necessary to calculate the frequency densities. The results are shown in the following table:
Age group Frequency Class size Frequency density
ni ai di = ni / ai
[20-25[ 595 5 119
[25-35[ 410 10 41
[35-45[ 287 10 28.7
[45-65[ 456 20 22.8
Total 1 748
Table 2.10: Age distribution for the number of deaths showing the frequency densities.
The class limits are plotted on the x-axis and the frequency densities are plotted on the y-axis. The height of
each rectangle is equal to the corresponding class frequency density. The resulting frequency density
histogram is shown below.
Fig. 2.6: Frequency density histogram showing the number of deaths due to road accident in relation to age.
Frequency polygon
As previously mentioned, a frequency polygon is a particular line graph used to represent the distribution
of a set of quantitative data. For continuous data, there are two ways of constructing a frequency polygon.
Descriptive Statistics 10
Bibliography
Classes have equal width. To ensure that the area under the polygon is equal to the total area of the
histogram, a class with zero frequency is added on either side of the histogram. These classes are known as
hypothetical classes. The frequency (or relative frequency) polygon is obtained by joining the mid-points of
the tops of the rectangles of the histogram, as well as the mid-points of the two hypothetical classes, with
straight lines. The area under the polygon represents the total frequency of the frequency distribution.
Example 2.10: As shown in Fig. 2.7 below, the frequency polygon is superimposed on the frequency
histogram it replaces (using the data given in example 2.8). After obtaining the mid-points of the tops of the
rectangles of the histogram, we add a hypothetical class with zero frequency on either side of the histogram.
The first one is [10-20[, the other is [70-80[. We then connect the mid-points of the adjacent rectangles of
the histogram by straight lines. We complete the frequency polygon by joining the mid-point of the first
rectangle to the mid-point of the class [10-20[, and the mid-point of the last rectangle to the mid-point of
the class [70-80[.
Fig. 2.7: Frequency histogram and frequency polygon for monthly salary (with classes of equal size).
Classes have unequal widths. The histogram is artificially partitioned into rectangles of equal base
denoted by as , known as the standard class size. The value of as is equal to the greatest common divisor
(GCD) of all the class sizes. A class with zero frequency and size as is added on either side of the histogram
(these are the hypothetical classes) to ensure once again that the area under the polygon is equal to the
total area of the histogram. The polygon is obtained by joining the midpoints of the upper horizontal sides
of the rectangles of the partition, besides the mid-points of the two hypothetical classes, by straight lines.
Example 2.11: Let's go back to example 2.9. Since the classes have unequal sizes we need to calculate the
standard class size as. From Table 2.10 above, the class sizes are 5, 10 and 20. Since the GCD of 5, 10 and 20
is 5, then as = 5. We artificially divide the histogram into rectangles of equal base, a s = 5. We add a
hypothetical class with zero frequency and size 5 on either side of the histogram: the first one is [15-20[, the
Descriptive Statistics 11
Bibliography
other is [65-70[. We join the mid-points of the adjacent rectangles of the partition by straight lines. We
complete the frequency polygon by connecting the mid-point of the first rectangle to the mid-point of the
class [15-20[, and the mid-point of the last rectangle to the mid-point of the class [65-70[. As shown in Fig.
2.8 below, the polygon based on the frequency density is superimposed on the density histogram it replaces.
Fig. 2.8: Frequency density histogram and polygon showing the number of deaths due to road accident by age group.
Frequency curve
If the number of data values becomes larger and larger, and at the same time the width of the classes are
made smaller and smaller, the frequency polygon will eventually become a smooth curve called a frequency
curve. In other words, a frequency curve may be regarded as a limiting form of the frequency polygon. An
example of this is shown in the figure below.
Descriptive Statistics 12
Bibliography
a) Discrete variable
The cumulative frequency graph of a discrete variable looks similar to a step function (i.e. constant over
intervals). First, we plot the points whose x-coordinates are the possible values of the variable, and y-coordinates
are equal to the corresponding cumulative frequencies. Then, to complete the graph, we draw horizontal line
segments for each interval since, by definition, the running total remains constant between two successive values
of the variable. Note that each interval of this step function is left-closed and right-open. To make the graph easier
to read, in addition to the horizontal line segments (solid line), vertical line segments are shown (dashed line).
Example 2.12: Let us consider the ungrouped frequency distribution table for the number of children in a
family, seen earlier in example 2.2.
Number of Frequency Less than Greater than
children ni Cumulative Cumulative
xi Frequency Frequency
1 4 4 20
2 3 7 16
3 4 11 13
4 7 18 9
5 0 18 2
6 2 20 2
Total 20
Table 2.11: Ungrouped frequency distribution with cumulative frequency for number of children in a family.
Descriptive Statistics 13
Bibliography
Fig. 2.10: Less than cumulative frequency graph for the number of children in a family.
The more than cumulative frequency graph is constructed as follows:
The x-axis is labeled with the number of children and the y-axis is labeled with greater than cumulative
frequencies.
Then, we plot the points (1, 20); (2, 16); (3, 13); (4, 9); (5, 2) and (6, 2).
We can also plot the point (7, 0) as there are no families recorded with a number of children greater than 6.
We draw horizontal line segments (solid line) for each interval and vertical line segments (dashed line) to make
the graph easier to read.
The more than cumulative frequency graph is shown in the following figure:
Fig. 2.11: More than cumulative frequency graph for the number of children in a family.
b) Continuous variable
The cumulative frequency graph of a continuous variable is called an ogive, also known as a cumulative
frequency polygon. There are two types of ogives:
Descriptive Statistics 14
Bibliography
Less than ogive. We plot the points whose x-coordinates are the upper limits of the classes, and
y-coordinates are the corresponding less than cumulative frequencies. We add the point whose ordinate
is zero and abscissa is equal to the lower limit of the first class, since it is common to begin with a cumulative
frequency of zero (the cumulative frequencies are in ascending order). To complete the less than ogive, we
join all the points by line segments.
More than ogive. We plot the points whose x-coordinates are the lower limits of the classes, and
y-coordinates are the corresponding greater than cumulative frequencies. We add the point whose
ordinate is zero and abscissa is equal to the upper limit of the last class, since it is common to end with a
cumulative frequency of zero (the cumulative frequencies are in descending order). To complete the more
than ogive, we connect all the points by line segments.
Note: It is worth mentioning that the less than and more than ogives are mirror images of each other.
Example 2.13: The table below gives the frequency distribution of final grade average of (the high school)
baccalaureate of 1480 students:
Final grade Number of
average students
[10-12[ 560
[12-14[ 380
[14-16[ 300
[16-18[ 210
[18-20[ 30
Total 1480
Table 2.12: Grouped frequency distribution for final grade average of baccalaureate.
We need first to make a cumulative frequency table:
Final grade Frequency Less than Greater than
average ni Cumulative Cumulative
Frequency Frequency
[10-12[ 560 560 1480
[12-14[ 380 940 920
[14-16[ 300 1240 540
[16-18[ 210 1450 240
[18-20[ 30 1480 30
Total 1480
Table 2.13: Grouped frequency distribution with cumulative frequency for final grade average of baccalaureate.
The less than ogive is constructed as follows:
The upper limits of the classes are marked on the horizontal axis and the less than cumulative frequencies
on the vertical axis. We then plot the points (12, 560); (14, 940); (16, 1240); (18, 1450); (20, 1480) and the
additional point (10, 0). To obtain the less than ogive, we simply join these points by line segments.
The more than ogive is constructed as follows:
Descriptive Statistics 15
Bibliography
The lower limits of the classes are marked on the horizontal axis and the greater than cumulative frequencies
on the vertical axis. We then plot the points (10, 1480); (12, 920); (14, 540); (16, 240); (18, 30) and the
additional point (20, 0). To obtain the more than ogive, we simply connect these points by line segments.
Both less than ogive and more than ogive are shown in the following figure:
The abscissa of the point at which the less than and more than ogives intersect is the median of the
corresponding frequency distribution, as will be discussed later.
Descriptive Statistics 16
Bibliography
Example 2.14: Let us consider the data given in example 2.1. As already seen, "car brands" is a categorical
variable.
From Table 2.2, the Volkswagen category is the value of the variable that appears the most: 8 times.
Therefore, the mode is Volkswagen. As there is only one mode, the distribution is called unimodal.
From the bar graph representing the car brands (Fig. 2.1), the bar above the Volkswagen category is the highest,
with a height of 8. Here again, the mode is Volkswagen.
Discrete variable
For a discrete variable, the determination of the mode is similar to that of a categorical variable.
Example 2.15: Let us consider the data given in example 2.2. As we already know, the number of children in
a family is a discrete quantitative variable.
From Table 2.4, the number "4" is the value of the variable that appears the most: 7 times. So the mode
is Mo = 4.
From the stick chart representing the number of children in a family (Fig. 2.3), the stick above the number
"4" is the longest, with a length of 7. Here again, the mode is Mo = 4. This is also a unimodal distribution.
There can be no mode in a data set.
Example 2.16: Find the mode of the following data.
3, 5, 8, 12, 17.
There is no mode because all values appear the same number of times (once).
There can be multiple modes in a dataset.
Example 2.17: The table below shows the distribution of number of days off from work of 25 employees in
a company during last month:
Number of Number of
days employees
0 3
1 6
2 5
3 4
4 1
5 6
Total 25
Table 2.14: Frequency distribution of number of days off from work.
From the frequency column, the highest frequency is 6. The numbers of days with the highest frequency are
1 and 5. Thus, this frequency distribution has two modes: 1 and 5. It is a bimodal distribution.
Continuous variable
Let's start by considering a continuous variable for which the range of values is grouped into classes. Naturally,
continuous data are represented in the form of a histogram. To find the mode, we first need to identify the modal
Descriptive Statistics 17
Bibliography
class of the data, as will be explained later in this section. Next, we determine the mode within that modal class.
The conventional approach suggests the use of linear interpolation by taking into account adjacent classes as
discussed below.
Consider the histogram shown in Fig. 2.13. Classes have equal width, so the heights of the rectangles in the
histogram are equal to the frequencies, and the modal class corresponds to the highest rectangle (modal
rectangle). We join the top left corner of the modal rectangle to the top left corner of the rectangle of the
succeeding class by a straight line. In the same way, we join the top right corner of the modal rectangle to
the top right corner of the rectangle of the preceding class by a straight line. These two diagonal lines
intersect at point G. The abscissa of point G gives the value of the mode Mo.
Fig. 2.13: Histogram showing how to find the mode using linear interpolation (with classes of equal width).
Descriptive Statistics 18
Bibliography
Line CD passes through points C(xi + a ; h) and D (xi ; h – d1). Again, using the two-point formula, its equation is:
d1
y= (x – xi) + h – d1.
a
Lines AB and CD intersect at point G. Let the coordinates of G be (xG , yG). As previously mentioned, the
abscissa of point G gives the value of the mode (i.e., xG = Mo).
Furthermore, the coordinates of point G satisfy the equation of line AB and the equation of line CD
simultaneously, so we may write:
d2 d
(x – Mo) + h = 1 (Mo – xi) + h – d1
a i a
Descriptive Statistics 19
Bibliography
Example 2.19: Let’s consider the last example with a few modifications.
Distance Number of
(km) employees
[0-10[ 9
[10-12[ 12
[12-20[ 9
Total 30
Table 2.16: Frequency distribution of distance from home to work.
The class widths are not equal. We need to calculate the frequency densities di = ni / ai. The results are shown
in the table below.
Distance Frequency Class width Frequency density
(km) ni ai di = ni / ai
[0-10[ 9 10 0.9
[10-12[ 12 2 6.0
[12-20[ 9 8 1.1
Total 30
Table 2.17: Frequency distribution of distance from home to work showing the frequency densities.
The modal class is [10-12[ because it has the highest frequency density of 6. Since the classes are of unequal
width, if we were to draw a histogram, the heights of the rectangles in the histogram would be equal to the
frequency densities (not the frequencies). Here again, the modal class corresponds to the highest rectangle.
The value of the mode is calculated by the same formula as the previous case (i.e., classes have equal width),
but we need to replace frequencies with frequency densities.
d1
Mo = xi + a
d1 d 2
Here:
xi = 10, a = 12 – 10 = 2, d1 = 6.0 – 0.9 = 5.1, d2 = 6.0 – 1.1 = 4.9;
b) The median
It can be calculated for both quantitative and ordinal qualitative variables. The median, denoted by Me,
corresponds to the value of the variable which divides an ordered data set in half. So, at least half (50%) of
the values are less than or equal to the median, and at least half of the values are greater than or equal to
the median. There are several methods for calculating the median, depending on the type of data.
Descriptive Statistics 20
Bibliography
Case 1: Suppose that n is odd. Then it can be written as n = 2p + 1, for some integer p. The median is the
value of rank p+1. The rank is the position of the data value. Here, the median is an observed value of the
data set.
Example 2.20: Consider the following individual series:
5 3 6 8 11 5 6
Let's arrange these values in ascending order:
3 5 5 6 6 8 11
This data set has 7 = 23 + 1 values. So the median is the 4th value, that is Me = 6.
Case 2: Suppose that n is even. Then it can be written as n = 2p, for some integer p. The median is half of
the sum of the data value of rank p and the data value of rank p + 1. In this case, the median is not necessarily
an observed value of the data set.
Example 2.21: Consider the following individual series:
14 6 11 13 8 14 8 12
Let's arrange these values in ascending order:
6 8 8 11 12 13 14 14
The data set has 8 = 24 values. So the median is half of the sum of the 4th and 5th values. In other words:
Me = 11 12 = 11.5.
2
Discrete data
The median can be determined using either the less than cumulative frequencies or the more than
cumulative frequencies. The median is the value for which the cumulative frequency is at least equal to n/2,
where n is the total frequency.
Example 2.22: The distribution of the number of employees in 100 industrial companies is given in the table
below:
Number of Number of
employees companies
13 4
14 6
15 7
16 15
17 24
18 16
19 14
20 7
21 4
22 3
Table 2.18: Ungrouped frequency distribution for the number of employees.
Descriptive Statistics 21
Bibliography
Let's complete Table 2.18 by calculating the less than cumulative frequencies.
Number of Frequency Less than
employees fi cumulative
xi frequency
13 4 4
14 6 10
15 7 17
16 15 32
Me 17 24 56
18 16 72
19 14 86
20 7 93
21 4 97
22 3 100
Total 100
Table 2.19: Ungrouped frequency distribution with cumulative frequency for the number of employees.
The median Me is the value of the number of employees for which the cumulative frequency is at least equal
to n/2 = 100/2 = 50. The value "50" does not appear explicitly in the less than cumulative frequency column
of Table 2.19, so we choose the value "56" that is directly above it. From the value (56), we move horizontally
towards the first column to find the value of the number of employees corresponding to this cumulative
frequency. Hence Me = 17.
Continuous data
Let's consider a continuous variable with the whole range of data values grouped into classes. The median
can be calculated either analytically or graphically.
Analytical method
To find the mode analytically we start by identifying the class containing the median (called median class)
using the less than cumulative frequencies. The median class is the first class for which the cumulative
frequency is at least equal to n/2, where n is the total frequency. The value of the median is determined by
linear interpolation, within the median class, under the assumption that the frequencies are uniformly
distributed within each class.
First, let's define some symbols:
xi –1: upper limit of class preceding the median class;
xi: upper limit of the median class;
Ni –1 : less than cumulative frequency of the value xi –1;
Ni: less than cumulative frequency of the value xi.
Descriptive Statistics 22
Bibliography
Fig. 2.14: Calculating the median within the median class using linear interpolation for continuous data.
Hence, AH = AC GH .
BC
Let's write the line segments in terms of the symbols defined above:
AH = Me – xi –1 ; AC = xi – xi –1 ; GH = n/2 – Ni –1 ; BC = Ni – Ni –1
Let's replace the line segments with their respective expressions in the previous equation:
n/2 N i 1
Me – xi –1 = (xi – xi –1)
N i N i 1
Solving the equation above for Me gives:
n/2 N i 1
Me = xi –1 + (xi – xi –1)
N i N i 1
The median can also be calculated by linear interpolation using more than cumulative frequencies. Following
the same reasoning as above, we can prove that the value of the median is given by the formula:
n/2 N i
Me = xi – (xi – xi –1)
N i 1 Ni
where:
xi –1: lower limit of the median class;
xi: lower limit of the class succeeding the median class;
Ni –1 : more than cumulative frequency of the value xi –1;
Ni: more than cumulative frequency of the value xi.
Example 2.23: The distribution of rainfall (in cm) in a city over a period of 30 years is shown in the following
table:
Descriptive Statistics 23
Bibliography
Rainfall Number
(cm) of years
[28-31[ 4
[31-34[ 4
[34-37[ 9
[37-40[ 2
[40-43[ 5
[43-46[ 6
Table 2.20: Grouped frequency distribution of rainfall.
The median Me is the value of rainfall that corresponds to the cumulative frequency n/2 = 30/2 = 15.
Let's complete Table 2.20 by calculating less than cumulative frequencies and determine the median.
Rainfall Frequency Less than
(cm) ni cumulative
frequency
[28 - 31[ 4 4
[31 - 34[ 4 8
[34 - 37[ 9 17
[37 - 40[ 2 19
[40 - 43[ 5 24
[43 - 46[ 6 30
Total 30
Table 2.21: Grouped frequency distribution with less than cumulative frequency of rainfall.
Here, n/2 = 15. From the less than cumulative frequency column, 8 15 17 (see Table 2.21 above). The
less than cumulative frequency 8 corresponds to the upper class limit 34 and less than cumulative frequency
17 corresponds to the upper class limit 37. Therefore:
xi –1 = 34, xi = 37, Ni –1 = 8, Ni = 17.
The median class is [34-37[.
The formula to calculate the median using less than cumulative frequencies is:
n/2 N i 1
Me = xi –1 + (xi – xi –1)
N i N i 1
Substituting the values in the formula:
Let's complete Table 2.20 by calculating more than cumulative frequencies and determine the median.
Here, n/2 = 15. From the more than cumulative frequency column, 13 15 22 (see Table 2.22 below). The
more than cumulative frequency 13 corresponds to the lower class limit 37 and more than cumulative
frequency 22 corresponds to the lower class limit 34. Consequently:
xi –1 = 34, xi = 37, Ni –1 = 22, Ni = 13 and the median class is also [34-37[.
Descriptive Statistics 24
Bibliography
Fig. 2.15: Determining the median graphically using less than ogive.
Descriptive Statistics 25
Bibliography
Less than and more than ogive method. We draw less than and more than ogives on the same graph.
From the point at which the less than and more than ogives intersect we draw a perpendicular on the x-axis.
The abscissa of the point at which it touches the x-axis is the value of the median.
Example 2.25: Fig. 2.16 below shows less than and more than ogives for the frequency distribution of final
grade average of 1480 students, as already seen in example 2.13.
From the point of intersection of the two ogives, we draw a perpendicular on the x-axis. The abscissa of the
point on the x-axis, at which the perpendicular cuts, is the value of the median. Here again, Me 13.
Fig. 2.16: Determining the median graphically with the help of less than and more than ogives.
Ordinal data
The values in the data set are sorted by categories with a rank order. If the number of data values is odd, we
use the same method as for discrete data. If the number of data values is even, the median cannot be found
because half of the sum of the data value of rank p and data value of rank p + 1 is meaningless.
Example 2.26: An interviewer approached at random 40 people in the street and asked them about their
level of education. The results are recorded in the following table:
Level of Number of
education people
No formal education 5
Primary 8
Lower secondary 10
Upper secondary 12
Higher education 5
Total 40
Table 2.23: Categorical frequency distribution for the level of education.
Note that the categories are given in order, from lowest ranking to highest. Let's complete Table 2.23 by
calculating less than cumulative frequencies. The results are shown in the following table.
Descriptive Statistics 26
Bibliography
Example 2.27: The marks (out of 20) scored by a student in different subjects are 13, 12, 15, 9, 12.
Discrete series
For a discrete series, we speak of weighted arithmetic mean. Let’s consider a discrete series containing n
data values x1, x2, … , xk along with their corresponding frequencies n1, n2, … , nk. The weighted arithmetic
mean is defined as:
k
x = 1 ni x i
n i 1
Example 2.28: The table below gives the frequency distribution of the number of children in a family, as
already seen in example 2.2.
Descriptive Statistics 27
Bibliography
Number of Frequency
children ni
xi
1 4
2 3
3 4
4 7
5 0
6 2
Total 20
Table 2.25: Ungrouped frequency distribution for number of children in a family.
The mean number of children per family is:
x = 4 1 3 2 4 3 7 4 0 5 2 6 = 62 = 3.1.
20 20
Continuous series
Let’s consider a continuous series in which data values are grouped into classes. Unlike discrete data, the
individual values for grouped data are not available. So to calculate the mean of grouped data, we use the
same formula as in the case of a discrete series and replace the individual values xi with the mid-points ci of
the various classes. The weighted arithmetic mean is therefore defined as:
k
x = 1 ni c i
n i1
Example 2.29: The table below gives the frequency distribution of rainfall (in cm) in a city over a period of
30 years, as seen in example 2.23.
Rainfall Frequency Mid-point
(cm) ni ci
[28-31[ 4 29.5
[31-34[ 4 32.5
[34-37[ 9 35.5
[37-40[ 2 38.5
[40-43[ 5 41.5
[43-46[ 6 44.5
Total 30
Table 2.26: Grouped frequency distribution of rainfall showing mid-points of the classes.
As we don’t know the exact amount of rain that fell each year, we use mid-points. The average rainfall is:
Descriptive Statistics 28
Bibliography
Multiplying each value in a data set by a constant multiplies the mean by that constant.
If x 1 and x 2 are the means of two groups of size n1 and n2 respectively. Then, the combined mean (for
The mean, and no other alternative value such as the median or some arbitrary number, minimizes the
n
sum of squared deviations, that is, the quantity ( x i C )2 is minimal if and only if C represents the mean.
i 1
d) Yule properties
The statistician Udny Yule has defined six properties to be satisfied by an ideal measure of central tendency.
The following table compares the different measures of central tendency based on Yule properties.
Yule properties Mo Me x
is defined objectively no yes yes
depends on all the observations no no yes
has a concrete meaning yes yes no
is easy to calculate yes yes yes
lends itself easily to mathematical manipulation no no yes
is a little sensitive to fluctuations of the sample no yes no
Table 2.27: Comparison of mode, median and mean.
None of the measure of the central tendency verifies simultaneously all Yule properties. The mean is the
preferred measure of central tendency for a symmetrical distribution. The median is a suitable measure of
central tendency for a skewed distribution or when dealing with ordinal data. The mode is the most
appropriate measure of central tendency for nominal data. It's crucial to remember that the optimum
central tendency measure will vary depending on the situation and the statistical study carried out.
Descriptive Statistics 29
Bibliography
a) Quartiles
Quartiles are 3 values that divide the ordered data into four equal parts.
The first quartile, Q1, is the value such that at least a quarter (25%) of the data are less than or equal to this
value and at least three quarters (75%) of the data are greater than or equal to this value.
The second quartile, Q2, corresponds to the median.
The third quartile, Q3, is the value such that at least three quarters (75%) of the data are less than or equal
to this value and at least a quarter (25%) of the data are greater than or equal to this value.
There are several methods for calculating the quartiles, depending on the type of data.
Numerical data given on individual basis
Consider an individual series containing n data values arranged in ascending order. There are two cases:
either n/4 is an integer or not.
Case 1: n/4 is an integer, then the first quartile Q1 is the value of rank n/4 and the third quartile Q3 is the
value of rank 3n/4.
Example 2.30: Consider the following individual series:
3 13 22 5 13 17 10 22
Let's sort these values in ascending order:
3 5 10 13 13 17 22 22
There are 8 observations:
Descriptive Statistics 30
Bibliography
Discrete data
The quartiles are determined using the less than cumulative frequencies. The first quartile Q1 is the value
for which the cumulative frequency is at least equal to n/4, and the third quartile Q 3 is the value for which
the cumulative frequency is at least equal to 3n/4, where n is the total number of data values.
Example 2.32: The frequency distribution with cumulative frequency for number of employees in 100
industrial companies is given in the table below (refer to the data in example 2.22):
Number of Frequency Less than
employees fi cumulative
xi frequency
13 4 4
14 6 10
15 7 17
Q1 16 15 32
17 24 56
18 16 72
Q3 19 14 86
20 7 93
21 4 97
22 3 100
Total 100
Table 2.28: Frequency distribution with cumulative frequency for the number of employees.
The first quartile Q1 is the value of the number of employees for which the cumulative frequency is at
least equal to n/4 = 100/4 = 25. The value "25" does not appear explicitly in the less than cumulative
frequency column of Table 2.28, so we choose the value "32" that is directly above it. From the value (32)
we move horizontally towards the first column to find the value of the number of employees corresponding
to this cumulative frequency, hence Q1 = 16.
The third quartile Q3 is the value of the number of employees for which the cumulative frequency is at
least equal to 3n/4 = 300/4 = 75. The value "75" does not appear explicitly in the less than cumulative
frequency column of Table 2.28, so we choose the value "86" that is directly above it. From the value (86)
we move horizontally towards the first column to find the value of the number of employees corresponding
to this cumulative frequency, hence Q3 = 19.
Continuous data
Let's consider a continuous variable with the whole range of data values grouped into classes. The quartiles
can be calculated either analytically or graphically.
Analytical method
We start by identifying the class containing the kth quartile (k = 1, 2 or 3) using the less than cumulative
Descriptive Statistics 31
Bibliography
frequencies. The class containing the kth quartile is the class for which the cumulative frequency is at least
equal to (kn/4), where n is the total frequency. The value of the kth quartile is determined by linear
interpolation, under the assumption that the frequencies are uniformly distributed within each class. By a
similar reasoning to the one adopted for the median, we can show that the value of the kth quartile is given
by the formula:
(kn/4) N i 1
Q k = xi –1 + (xi – xi –1)
N i N i 1
where:
xi –1: lower limit of the class containing Q k;
xi: upper limit of the class containing Q k;
Ni –1: less than cumulative frequency of the value xi –1;
Ni: less than cumulative frequency of the value xi.
Example 2.33: The frequency distribution with cumulative frequency of rainfall (in cm) in a city over a period
of 30 years is shown in the following table (refer to the data in example 2.23):
Rainfall Frequency Less than
ni cumulative
frequency
[28 - 31[ 4 4
[31 - 34[ 4 8
[34 - 37[ 9 17
[37 - 40[ 2 19
[40 - 43[ 5 24
[43 - 46[ 6 30
Total 30
Table 2.29: Grouped frequency distribution with less than cumulative frequency of rainfall.
We want to calculate the first quartile Q1, so k = 1 in the above formula and (kn/4) = (130/4) = 7.5.
From the less than cumulative frequency column, 4 7.5 8. The less than cumulative frequency 4
corresponds to the upper class limit 31 and less than cumulative frequency 8 corresponds to the upper class
limit 34. Therefore:
xi –1 = 31, xi = 34, Ni –1 = 4, Ni = 8 so Q1 lies in the class [31-34[.
Substituting all these values in the formula:
n/4 N i 1
Q 1 = xi –1 + (xi – xi –1)
N i N i 1
Descriptive Statistics 32
Bibliography
Now, we want to calculate the third quartile Q3, so k = 3 in the above formula and (kn/4) = (330/4) = 22.5.
From the less than cumulative frequency column, 19 22.5 24. The less than cumulative frequency 19
corresponds to the upper class limit 40 and less than cumulative frequency 24 corresponds to the upper
class limit 43.
Therefore:
xi –1 = 40, xi = 43, Ni –1 = 19, Ni = 24 so Q 3 lies in the class [40-43[.
Substituting all these values in the formula:
3n/4 N i 1
Q 3 = xi –1 + (xi – xi –1)
N i N i 1
Note: To calculate the median, first and third quartiles by linear interpolation using relative frequencies, the
previous formulae are adapted by replacing:
the cumulative frequencies Ni – 1 and Ni by the cumulative relative frequencies Fi – 1 and Fi respectively;
the quantities n/4, n/2 and 3n/4 by 1/4, 1/2 and 3/4 respectively.
Graphical method
Quartiles can be determined graphically by using the less than ogive. First, we draw a less than ogive. Then,
we determine the cumulative frequency (kn/4), where n is the total frequency (put k = 1 for the first quartile,
k = 2 for the second quartile and k = 3 for the third quartile). Next, we draw a parallel line to the x-axis having
intercept (kn/4) on the y-axis. From the point of intersection of this line with the less than ogive we draw a
perpendicular on the x-axis. The abscissa of the point on the x-axis, at which the perpendicular cuts, is the
value of the quartile under consideration.
Example 2.34: Fig. 2.17 below shows the less than ogive for the frequency distribution of final grade average
of 1480 students of example 2.13.
For the first quartile Q 1, k = 1 and (kn/4) = (11480/4) = 370.
We draw a line parallel to the x-axis at frequency 370. From the point of intersection of this line and the less
than ogive, we draw a perpendicular on the x-axis. The point where it meets the x-axis is the value of the
first quartile. Here, Q 1 11.3.
For the third quartile Q 3, k = 3 and (kn/4) = (31480/4) = 1110.
We draw a line parallel to the x-axis at frequency 1110. From the point of intersection of this line and the
less than ogive, we draw a perpendicular on the x-axis. The point where it meets the x-axis is the value of
the third quartile. Here, Q 3 15.1.
Descriptive Statistics 33
Bibliography
Fig. 2.17: Determining quartiles Q1 and Q3 graphically using less than ogive.
b) Deciles
Deciles are 9 values that divide the ordered data into ten equal parts.
The first decile, D1, is the value such that at least 10% of the data are less than or equal to this value and at
least 90% of the data are greater than or equal to this value.
The second decile, D2, is the value such that at least 20% of the data are less than or equal to this value and
at least 80% of the data are greater than or equal to this value.
The fifth decile, D5, is the value such that at least 50% of the data are less than or equal to this value and at
least 50% of the data are greater than or equal to this value. D5 corresponds to the median.
The ninth decile, D9, is the value such that at least 90% of the data are less than or equal to this value and
at least 10% of the data are greater than or equal to this value.
Deciles may be determined in the same way as quartiles except that in place of (kn/4), where k = 1, 2 or 3
we will use (kn/10), where k = 1, 2, …, 9.
c) Percentiles
Percentiles are 99 values that divide the ordered data into one hundred equal parts.
The kth percentile, denoted by Pk, is the value such that at least k% of the data are less than or equal to this
value and at least (100 – k)% of the data are greater than or equal to this value, where k = 1, 2, 3, …, 99.
Note: The 50th percentile is the median, the 25th percentile is the first quartile and the 75th percentile is the
third quartile.
Percentiles may be determined in the same way as quartiles except that in place of (kn/4), where k = 1, 2 or
3 we will use (kn/100), where k = 1, 2,3, …, 99.
Descriptive Statistics 34
Bibliography
Interdecile range
The interdecile range (IDR) is the difference between the ninth and the first deciles:
IDR = D9 - D1
The IDR is a measure of dispersion around the median and is not affected by extreme values and outliers.
The IDR measures the spread of the central 80% of values in a data set, ignoring the bottom 10% of the data
and the top 10% as well.
The bigger the IDR, the wider the spread of the central 80% of data values.
Mean absolute deviation
Mean absolute deviation (MAD) of a data set is the average of the absolute differences between each value
and the mean. The formula for calculating MAD is as follows:
n
MAD = 1 xi – x for individual series
n
i1
Descriptive Statistics 35
Bibliography
k
MAD = 1 ni xi – x for discrete series
n
i1
k
MAD = 1 ni ci – x for continuous series
n
i1
MAD is calculated by considering all the values in the data set and is hard to work with algebraically because
it involves absolute values. MAD uses the original units of the data. Mean absolute deviation is a measure
of variability: a low MAD suggests that data are tightly grouped around the mean (low variability) while a
high MAD suggests that data are spread out from the mean (high variability).
Variance
Variance of a data set is the average of the squared differences between each value and the mean. The
formula for calculating the variance is as follows:
n
V = 1 (x i x )2 for individual series
n
i 1
k
V = 1 ni (x i x )2 for discrete series
n
i1
k
V = 1 ni (ci x )2 for continuous series
n
i1
Variance is a measure of dispersion (or variability) of a set of data values around the mean. A low variance
indicates that data are concentrated near the mean. A high variance indicates that data are spread out from
the mean. Variance is calculated by considering all the values in the data set and is sensitive to extreme
values. Variance is measured in the square of the unit of the data, which is sometimes difficult to interpret.
For example, if we are looking at weights in kilograms the variance will be in kg squared. In order to remedy
this flaw, we use the square root of the variance known as the standard deviation.
Standard deviation
The standard deviation of a data set is the square root of its variance:
σ= V
The standard deviation is the most commonly used of the absolute measures of dispersion around the mean.
Descriptive Statistics 36
Bibliography
The standard deviation is calculated by considering all the values in the data set and is sensitive to extreme
values. It is expressed in the same unit as the data. The standard deviation is more adequate to describe the
variability of the data while variance is more suitable for statistical calculations.
The variance and standard deviation are not linear, but have very important properties:
V(x + a) = V(x), σ(x + a) = σ(x): adding a constant to each value in a data set changes neither the standard
deviation nor the variance.
V(bx) = b2 V(x), σ(bx) = bσ(x): multiplying each value in a data set by a constant multiplies the standard
deviation by its absolute value and the variance by its square.
Example 2.35: A delivery driver records the distance (in km) covered on his (daily) delivery round in the last
26 days:
Distance covered Number of days
xi ni
2 5
6 9
9 4
11 3
15 5
Total 26
Table 2.30: Distribution of the distance covered by the delivery driver.
The distance covered is a continuous quantitative variable. Because the values in Table 2.30 are rounded to
the nearest whole number this variable will be treated as a discrete quantitative variable.
The variance, for a discrete series, is given by the König-Huygens formula:
k
1 n x2 x 2
V= n ii
i 1
σ= V 4.35 km.
b) Relative measures of dispersion
It is possible to compare absolute measures of dispersion of two or more data sets, provided that they share
the same units and they have the same, or approximately the same, order of magnitude. If this is not the
case, the comparison can only be made using relative measures of dispersion.
Descriptive Statistics 37
Bibliography
Relative measures of dispersion are expressed in the form of ratios and percentages thus, making them
unitless. There are four relative measures of dispersion:
Coefficient of range
The coefficient of range (CR) is based on the range. It is defined as:
xmax xmin
CR =
x max xmin
MAD
CMD =
x
Coefficient of variation
The coefficient of variation (CV) is based on the standard deviation. It is usually expressed in percentage
terms and is the most commonly used of the relative measures of dispersion. It is defined as:
CV =
σ 100%
x
These coefficients are dimensionless numbers. A low coefficient reflects high uniformity or a small dispersion
of data. A high coefficient indicates low uniformity or a large dispersion of data.
Descriptive Statistics 38
Bibliography
Descriptive Statistics 39
Bibliography
n
μr = 1 (x i x ) r for individual series
n
i1
k
μr = 1 ni (x i x ) r for discrete series
n
i1
k
μr = 1 ni (c i x ) r for continuous series
n
i 1
b) Skewness
A distribution is symmetric if its left side and right side are mirror images of each other. Otherwise, the
distribution is asymmetric. The concept of skewness helps describe asymmetric distributions. Skewness is
a measure of the lack of symmetry of a distribution. A distribution is skewed to the right (or positively
skewed) if it has a long tail on its right side. Likewise, a distribution is skewed to the left (or negatively
skewed) if it has a long tail on its left side. A normal distribution (bell-shaped curve) exhibits zero skewness.
The skewness of a distribution can be determined in two ways:
Example 2.36: Consider the three frequency histograms and three frequency curves below:
The two left graphs show negatively skewed distributions (γ1 0).
The two middle graphs show symmetric distributions (γ1 = 0).
Descriptive Statistics 40
Bibliography
The two right graphs show positively skewed distributions (γ1 0).
c) Kurtosis
Kurtosis is a measure of whether the data are heavy-tailed or light-tailed relative to a normal distribution
(bell-shaped curve). A distribution with a high kurtosis suggests heavy tails and more outliers. Alternatively,
a distribution with a low kurtosis suggests light tails and fewer outliers.
There are several formulas to measure kurtosis. One of them is Fisher’s kurtosis coefficient.
Fisher’s kurtosis coefficient, denoted by γ2, is the ratio of the fourth central moment to the fourth power
of the standard deviation, minus three:
μ4
γ2 = 3
σ4
Fisher’s kurtosis coefficient is a unitless number.
If it is zero, the distribution is mesokurtic (exhibits tails that are similar to the normal distribution).
If it is negative, the distribution is platykurtic (exhibits thinner tails than the normal distribution).
If it is positive, the distribution is leptokurtic (exhibits fatter tails than the normal distribution).
Example 2.37: The figure below shows the three types of kurtosis in histograms.
The left graph shows a platykurtic distribution (γ2 0).
The middle graph shows a mesokurtic distribution (γ2 = 0).
Descriptive Statistics 41
Bibliography
Below is a graph comparing a normal (mesokurtic) distribution with a platykurtic and leptokurtic one.
Descriptive Statistics 42