PLU Quantitative Techniques 2
PLU Quantitative Techniques 2
Lecture Notes 2
Wanangwa Gondwe
Pentecostal Life
University
2 Descriptives Statistics
2.1 Introduction
This section presents an introduction to descriptive statistics. In this unit we will qualita-
tively and quantitatively describe characteristics of data. We will summarize attributes of a
sample with the aim of knowing its nature and present it so that others can understand and
use the information it contains. We will cover ways to numerically locate data centrally and
ascertain the nature of its variability. Most descriptive statistics are presented in what is
known as “summary statistics” because they summarize the characteristics of the data. They
provide visually easy to understand graphics and tables that will promote comprehension
and further inquiry. When we present descriptive statistics collectively either graphically or
in tabular form we do what is called exploratory data analysis.
Descriptive statistics gives several techniques for organizing data. Bar graphs, pie charts,
frequency distributions, histograms, and stem-and-leaf plots are techniques for describing
data. Often times, we are interested in a typical numerical value to help us describe a data
set. This typical value is often called an average value or a measure of central tendency. We
are looking for a single number that is in some sense representative of the complete data set.
In statsticics, this is called data reduction.
Frequency distribution
When data are qualitative, we use names to identify the different categories (or classes).
Often we summarize qualitative data by using a frequency distribution. It is a tabular
summary of data showing the number (frequency) of items in each of several non-overlapping
1
classes. A frequency distribution for qualitative data lists all categories and the number of
elements that belong to each of the categories. For example types of regions and the frequency
of interviews for a survey. See Table below.
Region Frequency
North 2176
Central 3952
Southern 5306
*
Source: Malawi Integrated Household survey data, 2019
Relative frequency
A relative frequency distribution gives a tabular summary of data showing the relative fre-
quency for each class. For example, in Table 1 the second column shows the frequencies per
class. If we divide all of them by the total, we would get relative frequencies. A percentage
frequency distribution summarizes the percent frequency of the data for each class. Column
3 of Fig. 3 shows the percent frequency. Note: To get the percent frequency just multiply
by 100 the relative frequency.
Summary: The relative frequency of a category is obtained by dividing the frequency for a
category by the sum of all the frequencies. If you multiply the relative frequency by 100 you
get the percentages in the third column.
Cumulative distributions
A cumulative frequency distribution gives the total number of values that fall below various
class boundaries of a frequency distribution. Table below shows the frequency distribution
of the contents in milliliters of a sample of 25 one-liter bottles of Orange Squash.
2
Table 3: Frequency distribution table
Range Frequency
970 to less than 990 5
990 to less than 1010 10
1010 to less than 1030 5
1030 to less than 1050 3
1050 to less than 1070 2
From the Table above we can construct the cumulative frequency distribution as below.
Content less than Cumulative Frequency Cumulative relative frequency Cumulative percent
970 0 0/25=0 0%
990 5 5/25=0.20 20%
1010 5+10=15 15/25=0.60 60%
1030 15+5=20 20/25=0.80 80%
1050 20+3=23 23/25=0.92 92%
1070 23+2=25 25/25=1 100%
From the Table above cumulative relative frequency is obtained by dividing a cumulative
frequency by the total number of observations in the data set. Cumulative percentages are
obtained by multiplying cumulative relative frequencies by 100.
There are many similarities between frequency distributions for qualitative data and fre-
quency distributions for quantitative data. Terminology for frequency distributions of quan-
titative data is discussed first, and then examples illustrating the construction of frequency
distributions for quantitative data are given. Table below gives a frequency distribution of
the University of Malawi Entrance examination scores
3
Table 5: Test scores for LUANAR Entrance exams (con-
tinued)
The frequency distribution given in Table above is composed of five classes. The classes
are: 80-94, 95-109, 110-124, 125-139, and 140-154. Each class has a lower class limit and
an upper class limit. The lower class limits for this distribution are 80, 95, 110, 125, and
140. The upper class limits are 94, 109, 124, 139, and 154. If the lower class limit for the
second class, 95, is added to the upper class limit for the first class, 94, and the sum divided
by 2, the upper boundary for the first class and the lower boundary for the second class are
determined. Table below gives all the boundaries for Table above. If the lower class limit is
added to the upper class limit for any class and the sum divided by 2, the class mark for that
class is obtained. The class mark for a class is the midpoint of the class and is sometimes
called the class midpoint rather than the class mark. The class marks for Table above are
shown in Table below. The difference between the boundaries for any class gives the class
width for a distribution. The class width for the distribution in Table below is 15.
Table 6: Class limit, boundary, width and mark
1. Determine the largest and smallest numbers in the raw data and thus find the range
(the difference between the largest and smallest numbers).
2. Divide the range into a convenient number of class intervals having the same size. If
this is not feasible, use class intervals of different sizes or open class intervals. The
number of class intervals is usually between 5 and 20, depending on the data. Class
intervals are also chosen so that the class marks (or midpoints) coincide with the
actually observed data. This tends to lessen the so-called grouping error involved in
further mathematical analysis. However, the class boundaries should not coincide with
the actually observed data.
3. Determine the number of observations falling into each class interval; that is, find the
class frequencies.
4
Example
The following data set gives the yearly food distribution expenditure in Thousands of MK
for 25 households in TA Chapananga in Chikwawa:
Construct a frequency distribution consisting of six classes for this data set. Use 0.5 as the
lower limit for the first class and use a class width equal to 0.5.
Solution
The first class would extend from 0.5 to 0.9 since the desired lower limit is 0.5 and the desired
class width is 0.5. Note that the class boundaries are 0.45 and 0.95 and therefore the class
width equals 0.95 - 0.45 or 0.5. The frequency distribution is shown in Table below.
Table 7: Emergency Expenditure in Chikwawa
Expenditure Frequency
0.5 - 0.9 1
1.0 - 1.4 2
1.5-1.9 5
2.0-2.4 5
2.5-2.9 7
3.0-3.4 4
Dot plot
Dot plot is a very simple graph that can be used to summarize a data set is called a dot plot.
To make a dot plot we draw a horizontal axis that spans the range of the measurements in
the data set. We then place dots above the horizontal axis to represent the measurements.
As an example, the figure below shows a dot plot of the exam scores in Statistics 1 first test
of the semester. The horizontal axis spans exam scores from 30 to 100. Each dot above
the axis represents an exam score. For instance, the two dots above the score of 90 tell us
that two students received a 90 on the exam. The dot plot shows us that there are two
concentrations of scores—those in the 80s and 90s and those in the 60s.
32 63 69 85 91 45 64 69 86 92 50 64 72 87 92 56 65 76 87 93
58 66 78 88 93 60 67 81 89 94 61 67 83 90 96 61 68 83 90 98
5
1.00
0.75
count
0.50
0.25
0.00
40 60 80 100
x1
Histogram
A histogram is a graph that shows the distribution of numerical data (it is a bar graph of a
frequency distribution). A histogram is a graph that groups data into different ranges and
then plots it as bars. Figure below shows a histogram of stunting in under five children in
Malawi using DHS data.
Another simple graph that can be used to quickly summarize a data set is called a stem-and-
leaf display. This kind of graph places the measurements in order from smallest to largest,
and allows the analyst to simultaneously see all of the measurements in the data set and see
the shape of the data set’s distribution. The following is car mileages for cars imported from
Singapore to Malawi:
30.8 30.8 32.1 32.3 32.7 31.7 30.4 31.4 32.7 31.4 30.1 32.5 30.8 31.2 31.8
31.6 30.3 32.8 30.7 31.9 32.1 31.3 31.9 31.7 33.0 33.3 32.1 31.4 31.4 31.5
31.3 32.5 32.4 32.2 31.6 31.0 31.8 31.0 31.5 30.6 32.0 30.5 29.8 31.7 32.3
32.4 30.5 31.1 30.7 31.4
6
2000
1500
Count
1000
500
0 5 10 15 20
Household size
To develop a stem-and-leaf display, we note that the sample mileages range from 29.8 to
33.3 and we place the leading digits of these mileages—the whole numbers 29, 30, 31, 32,
and 33—in a column on the left side of a vertical line. This vertical arrangement of leading
digits forms the stem of the display. Next, we pass through the mileages in Table above one
at a time and place each last digit (the tenths place) to the right of the vertical line in the
row corresponding to its leading digits. We form the leaves of the display by continuing this
procedure as we pass through all 50 mileages. After recording the last digit for each of the
mileages, we sort the digits in each row from smallest to largest and obtain the stem-and-leaf
display that follows:
29 | 8
30 | 134
30 | 55677888
31 | 00123344444
31 | 55667778899
32 | 011123344
32 | 55778
33 | 03
7
Example
During a study of willingness to buy fish in Lilongwe market, ages of consumers who were
randomly picked were recorded. Their ages were 11, 11, 12, 14, 16, 17, 21, 23, 24, 25, 29, 30,
30, 32, 37, 40, 41, 53, 60. Draw a stem and leaf diagram of the data.
Solution
1 | 112467
2 | 13459
3 | 0027
4 | 01
5 | 3
6 | 0
Advantages
A pie chart is a useful method for displaying the percentage of observations that fall into each
category of qualitative data. A pie chart is an effective method of showing the percentage
breakdown of a whole entity. It is a circular statistical graphic, which is divided into slices to
illustrate numerical proportion. In a pie chart, the arc length of each slice (and consequently
its central angle and area) is proportional to the quantity it represents. While it is named
for its resemblance to a pie which has been sliced, there are variations on the way it can be
presented.
Previous sections in this chapter have presented methods for summarizing data for a single
variable. Often, however, we wish to use statistics to study possible relationships between
several variables. In this section we present a simple way to study the relationship between
two variables. Crosstabulation is a process that classifies data on two dimensions. This
process results in a table that is called a contingency table. Such a table consists of rows and
columns—the rows classify the data according to one dimension and the columns classify
8
Figure 3: A histogram showing symmetric data
the data according to a second dimension. Together, the rows and columns represent all
possibilities
We often study relationships between variables by using graphical methods. A simple graph
that can be used to study the relationship between two variables is called a scatter plot.
As an example, suppose that a marketing manager wishes to investigate the relationship
between the sales volume (in thousands of units) of a product and the amount spent (in
units of MK10,000) on advertising the product. To do this, the marketing manager randomly
9
selects 10 sales districts having equal sales potential. The manager assigns a different level
of advertising expenditure for January 2022 to each sales district as shown in Table below.
At the end of the month, the sales volume for each region is recorded as also shown in Table
below.
Table 9: Values of Advertising Expenditure (in
MK10,000s) and Sales Volume (in 1000s) for Ten Sales
Districts
10
140
120
Sales Volume, y
100
80
In previous section we looked at Bar graphs, pie charts, frequency distributions, histograms,
and stem-and-leaf plots. These are techniques for describing data. Often times, we are
interested in a typical numerical value to help us describe a data set. This typical value
is often called an average value or a measure of central tendency. We are looking for a
single number that is in some sense representative of the complete data set. There are many
different measures of central tendency. The three most widely used measures of central
tendency include the mean, median, and mode
a. Mean
Average value of a variable, denoted by µ (pronounced Mu) for the population and for the
sample x̄(x bar).
11
Example:
The number of 911 emergency calls classified as domestic disturbance calls in a Low density
areas of Lilongwe city were sampled for thirty randomly selected 24 hour periods with the
following results.
25 46 34 45 37 36 40 30 29 37 44 56 50 47 23
40 30 27 28 47 58 22 29 56 40 46 38 19 49 50
Find the mean number of calls per 24-hour period.
Solution
P
x 1168
x̄ = = = 38.9
n 30
b. Median
The median of a set of data is a value that divides the bottom 50% of the data from the top
50% of the data represented by . To find the median of a data set, first arrange the data
in increasing order. If the number of observations is odd, the median is the number in the
middle of the ordered list. If the number of observations is even, the median is the mean of
the two values closest to the middle of the ordered
Example:
Given the following data: 32, 42, 46, 54,46. Find the median.
Solution:
Sort the data. 32, 42, 46, 46, 54. The middle number is 46. Therefore, the median is 46.
c. Mode
The mode is the value in a data set that occurs the most often. If no such value exists, we
say that the data set has no mode. If two such values exist, we say the data set is bimodal.
If three such values exist, we say the data set is trimodal. There is no symbol that is used
to represent the mode.
Example: From previous example find mode?
Solution: The most frequently occurring value is 46.
Measures of position are used to describe the location of a particular observation in relation
to the rest of the data set.
12
a. Percentile
Percentiles are values that divide the ranked data set into 100 equal parts. Percentiles provide
information about how the data are spread over the interval from the smallest value to the
largest value. The pth percentile is a value such that at least p percent of the observations
are less than or equal to this value and at least (100 − p) percent of observations are greater
than or equal to this value.
Calculating the pth percentile
The percentile for observation x is found by:
2. Dividing the number of observations less than x by the total number of observations.
3. Then multiplying this quantity by 100. This percent is then rounded to the nearest
whole number to give the percentile for observation x.
b. Decile
A decile rank arranges the data in order from lowest to highest and is done on a scale of
one to 10 where each successive number corresponds to an increase of 10 percentage points.
This type of data ranking is performed as part of many academic and statistical studies in
the finance and economics field. There is no one way of calculating a decile; however, it is
important that you are consistent with whatever formula you decide to use to calculate a
decile. One simple calculation of a decile is:
1
D1 = × (n + 1) thData
10
2
D2 = × (n + 1) thData
10
3
D3 = × (n + 1) thData
10
5
D5 = × (n + 1) thData
10
9
D9 = × (n + 1) thData
10
13
c. Quartiles
Use same percentile formula. The interquartile range, designated by IQR, is defined as
follows:
IQR = Q3 − Q1
The interquartile range shows the spread of the middle 50% of the data and is not affected
by Extremes (outliers) in the data set.
Box-and-whiskers displays (box plots) - A box-and-whiskers display (sometimes called
a box plot) is constructed by using Q1, Median, Q3, and the interquartile range. The box
contains the middle 50 percent of the data set. Next a vertical line is drawn through the box
at the value of the median. This line divides the data set into two roughly equal parts. The
lower and upper limits are also used to identify outliers. An outlier is a measurement that is
separated from (that is, different from) most of the other measurements in the data set. A
measurement that is less than the lower limit or greater than the upper limit is considered
to be an outlier. This line divides the data set into two roughly equal parts. We next define
what we call the lower and upper limits. The lower limit is located 1.5 × IQR below Q1
and the upper limit is located 1.5 × IQR above Q3. For the satisfaction ratings data, these
limits are:
Q1 − 1.5(IQR) and Q3 − 1.5(IQR)
20
15
Household size
region
1
2
10
3
Region
14
d. Quantiles
When the data is sorted in ascending order and is divided into five equal categories each
containing 20 percent of the data.
Example
Find the ninety-fifth percentile, the seventh decile, and the first quartile for the age distri-
bution given in Table below.
Age of Second year ODL Economics students
20 24 26 30
22 24 27 30
22 24 28 33
24 25 29 34
Solution
np (16×95)
• To find P95 , compute i = 100
= 100
= 15. The 95th percentile is 16th observation
in the arranged dataset.
np
• To find the 7th decile (same as P70 ), compute i = 100 = (16×70)
100
= 11.2. The 7th decile
is the 12th observation in the arranged dataset. Thus, 70% of the students are below
the age of 30.
np
• To find the first quartile (Q1 ), compute = 100 = (16×25)
100
= 4. The first quartile is the
average of the observations in positions 4 and 5 in the ranked data set. Or the average
of 24 and 24 which is 24.
1. Range: difference between largest and smallest observation. For data table shown
above, the range is given by 34 - 20.
2. Interquartile range: overcomes the dependency on extreme values. Denoted IQR
The interquartile range for the students age in Table 2.7 is found by subtracting the
value of Q1 from Q3 . The first quartile is equal to the 25th percentile and is found
to be observation in fifth position which is 24. The third quartile is equal to the 75th
percentile and is found by noting that (16×75)
100
= 12 and therefore i = 13 and the age is
30. Q1 is in the 4th position in Table 2.7 and Q3 = 13. The IQR equals 30 - 24 years
or 6 years.
15
Statistics for Economists 1
3. Variance: measure of variability that uses all data. It is squared deviations from the
mean divided by the number of observations. Population variance is denoted by σ 2
(sigma squared). Sample variance is denoted by s2 .
P P
(x−x̄)2 (x−x̄)2
Sample variance: σ 2 = N
Population variance: s2 = n−1
5. Coefficient of variation: The coefficient of variation (CV) is the ratio of the standard
deviation to the mean and shows the extent of variability in relation to the mean of
the population. The higher the CV, the greater the dispersion.
CV = Standard Deviation/Mean
The standard deviation is useful as a measure of variation within a given set of data.
When one desires to compare the dispersion in two sets of data, however, comparing
the two standard deviations may lead to fallacious results. It may be that the two
variables involved are measured in different units. For example, we may wish to know,
for a certain population, whether serum cholesterol levels, measured in milligrams per
100 ml, are more variable than body weight, measured in kgs. Furthermore, although
the same unit of measurement is used, the two means may be quite different. If we
compare the standard deviation of weights of first-grade children with the standard
deviation of weights of high school freshmen, we may find that the latter standard
deviation is numerically larger than the former, because the weights themselves are
larger, not because the dispersion is greater.
Example
The times required in minutes for students to solve a particular math problem were 5, 10,
15, 3, and 7. Calculate the standard deviation.
Solution
The mean time for the five preschoolers is 8 minutes. Table below illustrates the computation
indicated by formula. The first column lists the observations, x. The second column lists the
deviations from the mean, x − x̄. The third column lists the squares of the deviations. The
sum at the bottom of the second column is called the slim of the deviations, and is always
equal to zero for any data set. The sum at the bottom of the third column is referred to as
the sum of the squares of the deviations. The sample variance is obtained by dividing the
sum of the squares of the deviations by n − 1, or 5 − 1 = 4. The sample variance equals 88
divided by 4 which is 22 minutes squared.
16
x x − x̄ (X − x̄)2
5 5-8 (−3)2 = 9
10 10 - 8 (2)2 = 4
15 15 - 8 (7)2 = 49
3 3-8 (−5)2 = 25
7 7-8 (−1)2 = 1
(x − x̄) = 0 (x − x̄)2 = 88
P P
a. Kurtosis
(x − x̄)4
P
n
kurtosis = P −3=
(x − x̄)3
Normal
t−distribution
0.3
n_dist
0.2
0.1
0.0
−4 −2 0 2 4
17
b. Skewness
The shape of the distribution can either be skewed to the left or to the right. Skewness of
given by:
mean - Mode
Skewness =
Standard deviation
0.04
0.03
type
Left skewed
0.02
y
Normal
Right skewed
0.01
0.00
0 25 50 75 100
x
Chebyshev’s Theorem
If we fear that the Empirical Rule does not hold for a particular population, we can con-
sider using Chebyshev’s Theorem to find an interval that contains a specified percentage of
the individual measurements in the population. Although Chebyshev’s Theorem technically
applies to any population, we will see that it is not as practically useful as we might hope.
Chebyshev’s theorem states that the fraction of any data set lying within k standard devia-
tions of the mean is at least 1 − 1/k 2 . Where k is a number greater than 1. For example, if
we choose k equal to 2, then at least 100(1 − 1/22 )% = 100(3/4)% = 75% of the population
measurements lie in the interval [m 2s]. As another example, if we choose k equal to 3, then
at least 100(1 − 1/32 )% = 100(8/9)% = 88.89% of the population measurements lie in the
interval [m 3s].
The theorem applies to either a sample or a population. The implication of the theorem
within standard deviations is that:
a. At least 0.75 or 75% of data values must be within standard deviations of the mean.
18
b. At least 0.89 or 89% of the data values must be within standard deviations of the
mean.
c. At least 0.94 or 94% of the data values must be within standard deviations of the
mean.
Empirical Rule
A practical interpretation of the standard deviation: The Empirical Rule One type of rel-
ative frequency curve describing a population is the normal curve. The normal curve is a
symmetrical, bell-shaped curve. If a population is described by a normal curve, we say that
the population is normally distributed, and the following result can be shown to hold.
1. 68.26 percent of the population measurements are within (plus or minus) one standard
deviation of the mean and thus lie in the interval
2. 95.44 percent of the population measurements are within (plus or minus) two standard
deviations of the mean and thus lie in the interval
3. 99.73 percent of the population measurements are within (plus or minus) three standard
deviations of the mean and thus lie in the interval
Activity
1. Table 2.8 gives the ages of cars randomly selected from Lilongwe Civil servants. Find
the percentiles for the ages 10, 15, and 20.
Table 2.8
2 7 11 15 19
2 7 11 15 19
2 7 12 15 20
2 7 12 15 20
4 7 12 15 20
4 10 14 15 22
4 10 14 16 24
4 10 14 16 25
5 10 14 17 25
5 10 15 17 27
Solution
The age 10 is the thirtieth percentile. The age 15 is the fifty-eighth percentile. The
age 20 is the eighty-fourth percentile.
2. Find P90 , D8 , and Q3 for the civil servant cars’ ages in Table 2.8.
19
Solution
Anderson, D. R., Sweeney, D.J, Williams, T.A (2002) Essentials of Statistics for Business
and Economics. 2nd Edition. South Western College Publishing. \ Kazmier, L.J.,(2004).
Business Statistics. Schaum’s Outlines. DOI: 10.1036/0071430997
20