STAE lecture notes_LU3_Annotated
STAE lecture notes_LU3_Annotated
LEARNING OBJECTIVES
• Understand the concepts of and calculate the mean, median, mode and percentiles
• Understand the concepts of and calculate the range, interquartile range, standard deviation, variance
and coefficient of variation
• Choose the appropriate measures of central tendency and variability for any given variable
3.1. Introduction
Descriptive statistics are numerical summary measures used to describe the data collected from a sample in
terms of central tendency, variability, skewness and kurtosis. These measures are used in most statistical
analyses. In this course, measures of central tendency and variability are calculated using raw and frequency
data, skewness is only evaluated visually, and kurtosis is not assessed.
3.2.1. Mode
For raw data and ungrouped frequency data the mode is the value(s) of the variable that occur(s) most
frequently. A variable can have one, two, more than two, or no mode.
• Unimodal = one mode
• Bimodal = two modes
• Multimodal = more than two modes
For grouped frequency data it is not possible identify the most frequent value(s) since the data were grouped
into class intervals and information was lost. For such data formats the class(es) with the highest frequency
is/are the modal class and the mode is generally estimated using the midpoint of the modal class(es).
3.2.2. Median
The median is the value of the variable in the middle of the ordered set of data values. Therefore, at most 50%
of observations are below the median value, and at most 50% of observations are above the median value.
1
To find the median for raw data
• Order the data from lowest to highest
n +1
• Find the median position =
2
• If n is odd, the median position value will be a whole number
o The median value is the value of the variable in the median position of the ordered data
o For example, for the ordered observations: 3 4 6 9 13
5 +1 6
o Since n = 5 the median position = = =3
2 2
o The value in position 3 of the ordered data is 6, i.e., median = 6
• If n is even, the median position value will be a fraction
o The median value is the average of the two variable values on either side of the median position in
the ordered data
o For example, for the ordered observations: 3 4 6 9
4 +1 5
o Since n = 4, the median position = = = 2.5
2 2
o The value in position 2 of the ordered data is 4, and the value in position 3 of the ordered data is 6,
4 + 6 10
i.e., median = = =5
2 2
For ungrouped frequency tables the median is calculated using cumulative frequencies. For grouped frequency
tables the median is estimated using cumulative frequencies and an interpolation formula. However, this is
beyond the scope of this course.
3.2.3. Mean
The mean of a variable is also referred to as the arithmetic mean or the average. For raw data the mean is
calculated by adding all the values of the variable together and dividing by the total number of observations.
For a random variable X, the population mean is denoted by the Greek letter (mu):
1
=
N
x
For a random variable X, the sample mean is denoted by x (x-bar):
1
x=
n
x
2
For example, consider the random sample with observations: 9 2 4 13 6
9 + 2 + 4 + 13 + 6 34
x= = = 6.8
5 5
For an ungrouped frequency table, the mean is calculated using a formula based on the values of the variable
and the frequency of occurrence. For a grouped frequency table, the mean is estimated using a formula based
on the midpoint of the class intervals and the frequency of occurrence. For the purpose of this course, it is
sufficient to calculate/estimate the mean from frequency tables using the calculator.
For example, if 10% of students scored at least 80 on a test, then a student who scored 82 performed in the
top 10% of the distribution. The value “80” is the minimum value obtained by the top 10% of the distribution
and is therefore the 90th percentile, i.e., P90 = 80, as it separates the lowest 90% from the remaining 10% of
the distribution. Therefore, at most 90% of students scored less than 80 and at most 10% of students scored
more than 80.
3
Recall the interpretation of the median, namely at most 50% of observations are below the median value and
at most 50% of observations are above the median value. The median of a distribution is the 50 th percentile
value, i.e., P50 = median. Other commonly used percentiles are deciles, which divide the distribution into ten
equal parts (D 1 , D2 , …, D 10 ) and quartiles, which divide the distribution into four equal parts (Q 1 , Q 2 , Q 3 , Q 4).
Both deciles and quartiles can be expressed in terms of percentiles. For example, D5 = Q2 = P50 = median. For
raw data any percentile value is obtained by first sorting the data from lowest to highest, locating the percentile
position and then using a formula to calculate the percentile value. Percentile calculation from frequency data
is beyond the scope of this course.
(
• Pr = x( k ) + d x( k +1) − x( k ) )
o Where x( k ) is the value in position k of the ordered dataset
For example, find and interpret P20 and Q 3 for the following 12 observations (already ordered):
4 5 8 9 11 12 12 14 15 17 19 21
• P20
20
o Position = (12 + 1) = 2.6 , Therefore k = 2 and d = 0.6
100
o The value in position 2 (k) is 5 and the value in position 3 (k + 1) is 8
( )
o P20 = x( 2) + 0.6 x(3) − x( 2) = 5 + 0.6 (8 − 5) = 6.8
o At most 20% of observations are less than 6.8 and at most 80% of observations are greater than 6.8
• Q3 = P75
75
o position = (12 + 1) = 9.75 , Therefore k = 9 and d = 0.75
100
o The value in position 9 (k) is 15 and the value in position 10 (k + 1) is 17
( )
o P75 = x(9) + 0.75 x(10) − x(9) = 15 + 0.75 (17 − 15) = 16.5
o At most 75% of observations are less than 16.5 and at most 25% of observations are greater than 16.5
4
3.4. Measures Of Variability
Measures of variability (or spread or dispersion) describe the extent to which data are spread around its central
tendency and across the scale. The commonly used measures of variability are range, interquartile range,
variance, standard deviation and coefficient of variation.
3.4.1. Range
The range is an approximate measure of variability and shows how much of the scale is utilised. For raw data
and ungrouped frequency data the range is the difference between the maximum and the minimum values of
a variable. For grouped frequency data the range is the difference between the upper limit of the last class
interval and the lower limit of the first class interval.
Range = maximum – minimum
5
3.4.4. Variance
To solve the problem encountered with the average deviation measure, differences are considered as distances
which must always be positive. There are two ways in which negative values can be removed: either take the
absolute value (i.e., remove the sign), or square the value. The variance is the average squared deviation around
the mean. It is the most commonly used measure of variability in statistics. The larger the value of the variance
the more the data values vary around the mean and the greater the spread of the data. The variance is expressed
in the squared unit of measurement of a variable, which is of no practical value and is difficult to interpret.
The population variance is denoted by the Greek letter 2 (sigma-squared) and is calculated as follows:
1
2 = (x − )
2
The sample variance is denoted by the Roman letter s 2 (s-squared) and is calculated as follows:
n x 2 − ( x )
2
1
s = ( x − x ) = n ( n − 1)
2 2
n −1
For an ungrouped frequency table, the variance is calculated based on the values of the variable and the
frequency of occurrence. For a grouped frequency table, the variance is estimated using a formula based on
the midpoint of the class intervals and the frequency of occurrence. For the purpose of this course, it is
sufficient to calculate/estimate the variance from frequency tables using the calculator. The calculator gives
the population and sample standard deviations, which must be squared to obtain the variance. Steps to perform
calculations are discussed in Section 3.4.5.
The population standard deviation is denoted by the Greek symbol (sigma) and is calculated as follows:
1
= (x − )
2
6
The sample standard deviation is denoted by the Roman letter s and is calculated as follows:
n x 2 − ( x )
2
1
s= ( x − x ) =
2
n −1 n ( n − 1)
The CV is the ratio of the standard deviation to the mean, expressed as a percentage, i.e., the variability in the
variable is expressed as a percentage of the mean of that variable. This measures variability on comparable
scales for multiple variables. Note, this value is not bounded by 100% and can be greater than 100%, which
implies more variability.
7
Exercise 3.1
The sums for X = coffee consumption are x = 59 and x 2
= 251 . The following table shows the frequency
distribution for coffee consumption. Calculate the mean, range, variance, standard deviation and coefficient
of variation using the computational formulae as well as the calculator. Compare the results.
Coffee consumption Frequency
1 5
2 6
3 3
4 2
5 2
6 0
7 1
8 1
Total 20
From table
Mean =
Range =
Variance =
Standard deviation =
Coefficient of variation =
From sums: x = 59 , x 2
= 251
1
Mean = x =
n
x
n x 2 − ( x )
2
Variance = s = 2
n ( n − 1)
Standard deviation = s =
s
Coefficient of variation = 100
x
Comparison
8
Exercise 3.2
Use the raw data for the coffee affinity score as well as the grouped frequency table to calculate the mean,
range, variance, standard deviation and coefficient of variation. Compare the results.
Coffee affinity score Frequency Midpoint
(0, 1] 7
(1, 2] 4
(2, 3] 2
(3, 4] 4
(4, 5] 3
Total 20
From table
Mean =
Range =
Variance =
Standard deviation =
Coefficient of variation =
Comparison
9
Exercise 3.3
Use the following stem-and-leaf plot of age (leaf unit = 1) and calculate the mode(s), median, D 2 and IQR
Stem Leaf
1 9 9 9
2 1 4 4 5 6 6 8 9 9
3 0 2 4 5 6 7
4 0 3
Median
D2
P25
P75
IQR
10