SMA 160 - Stds Notes (2025)
SMA 160 - Stds Notes (2025)
INTRODUCTION TO STATISTICS
refers to the application of the sample statistics to the parent population parameters from
which the samples were drawn subject to the stated significant levels.
Population
A collection of items sharing a common characteristics. All subjects possessing a
common characteristic that is being studied.
Census
An examination or the collection of data from every element in a population.
Sample
A subgroup or subset of the population.
Parameter
Characteristic or measure obtained from a population.
Sampling Error
The difference between the sample result and the true population result that
occurs because the sample data is incorrectly collected, recorded, or analyzed.
Non-sampling error
questionnaire wording, data entry errors, and biased decisions (measurement and
analyzing errors).
A margin of error
the difference between the sample statistics and the actual population parameters.
Sampling
Sampling is a technique of selecting individual members or a subset of the population to
make statistical inferences from them and estimate characteristics of the whole population
from which the samples were drawn. Gathering the information by examining every item
in the population is referred to as census.
Methods of sampling
These are the techniques of selecting the items to represent the population of the study.
Sampling Techniques
Generally under probability sampling a sample is chosen based on the theory of probability
while in non-probability sampling a sample is chosen based on non-random criteria, and not
every member of the population has a chance of being included in the sample. The
probability sampling reduces the sample bias, it is appropriate with diverse and vast
Non-Probability Sampling
Bases Probability Sampling Methods
Methods
Probability Sampling is a Non-probability sampling is a
sampling technique in which sampling technique in which the
Definition samples from a larger population researcher selects samples based on
are chosen using a method based the researcher’s subjective judgment
on the theory of probability. rather than random selection.
Alternatively
Random sampling method. Non-random sampling method
Known as
Population The population is selected
The population is selected arbitrarily.
selection randomly.
Nature The research is conclusive. The research is exploratory.
Since there is a method for Since the sampling method is
deciding the sample, the arbitrary, the population
Sample
population demographics are demographics representation is
conclusively represented. almost always skewed.
Takes longer to conduct since This type of sampling method is
the research design defines the quick since neither the sample or
Time Taken
selection parameters before the selection criteria of the sample are
research study begins. defined.
This type of sampling is entirely
This type of sampling is entirely
biased and hence the results are
Results unbiased and hence the results
biased too, rendering the research
are unbiased too and conclusive.
subjective and speculative.
In probability sampling, there is
an underlying hypothesis before In non-probability sampling, the
Hypothesis the study begins and the hypothesis is derived after
objective of this method is to conducting the research study.
prove the hypothesis.
Statistical
analysis
Undergraduate Lecture Notes-Dr. Kasina MM Page 3
Introduction to Probability & Statistics
Data Information
Scales of Measurement
In statistics, there are four data measurement scales: nominal, ordinal, interval and ratio
data types.
• Nominal- under nominal scale the items are differentiated by a simple naming
system. Nominal items are usually categorical. Therefore nominal scales are just
used for labeling variables, without any quantitative value. Questions like what is
your gender? Where do you live are nominal data. Nominal can be categorized as
nominal with order or without order eg cold, warm, hot and male or female
respectively
• Ordinal- under ordinal scale the items are set into some kind of order by their
position on the scale. Ordinal items are usually categorical.eg teams can be ranked
as first, third and fifth etc regardless of the score between each consecutive position.
• Interval-Interval data (sometimes called integer). Just like the ratio scale it is
measured along a scale in which each position is equidistant from one another.eg
Altitudes (the height above sea level), Celsius temperature in which the difference
between any two consecutive values is the same.
• Ratio- under the ratio scale, items are measured along a regular scale in which each
position point is equidistant from one another, therefore numbers can be compared
as multiples of one another and have an absolute zero (reference point) eg weight
and height (Both have absolute zero such that no numbers or values exist below the
zero point)
A variable is a representative of something that can change and assume different value,
such as 'gender which can change from male to female'
• Categorical variable -results from a selection from categories. Nominal and ordinal
variables are categorical.
• Continuous variables are numeric variables that can take any value, with a none
zero intersection set between any two consecutive data points such as weight.
Data sources:
Primary sources- This is a firsthand and freshly collected data for a particular use. The
information can be obtained through survey, interview or observation among others
Secondary sources- This is a second hand data mainly obtained from a published sources
such as print or online reference or research works ,abstracts , indexes , finding
aids ,publisher’s or distributor’s brochure or website and broadcast program schedule. They
are not original in character and have undergone some statistical treatment at least once but
may attract a secondary use.
Experiments- Orderly procedure carried out to test a hypothesis. It may give out an in-
depth of cause-effect relationship by showing the response variation when the regressors
are manipulated.
The method to apply is mostly dictated by the resources and time available, intended form
of data analysis and finally where the data resides; environment files or people.
Direct Observation
Observation is the process in which a researcher observes what is occurring in some real
life situation then classify and record pertinent happenings according to some planned
criteria. Observation method is most useful when the study relates to behavioral science. It
is subject to many controls and checks. The different types of observations are:
The tools needed for gathering data using this technique include the eyes and other
senses, microscope, a pen and a paper.
Surveys
A survey solicits data from people; it is most appropriate with data elements that are not
easily quantifiable. It can be administered through; Personal (Physical) Interview,
Telephone Interview or Self-Administered Questionnaire. The tools for administering this
technique include interview guide, check list, tape recorder or a questionnaire. The
following are some of the key questionnaire design principles:
Keep the questionnaire as short and simple as possible (Avoid technical terms).
Ensure clearly worded questions free from ambiguities
Include both the closed and open-ended questions
Avoid using leading-questions.
The Response Rate is the proportion of all people selected who complete the survey
which is a key survey parameter.
Experiments
Experiments can be carried out in the laboratory, in the field or using computer numerical
models for the purpose of collecting data. Currently there are several computer codes that
can be utilized to construct a model. eg Finite Element Code (FEM) and CFD
(Computational Fluid Dynamics) code.
Case Study
Focus Groups
Online tracking
Frequency Distribution
Exercise 1
A random sample of 100 Machakos undergraduate students was selected and the time
(hours) each student spend in a gym for a particular semester was recorded as follows:
Table: 2. Machakos University students’ hours in a gym for a semester
65 22 84 100 88 87 105 44 85 67
80 109 83 89 91 104 90 103 67 52
110 98 86 39 72 66 92 99 60 75
88 112 97 88 49 62 70 66 88 62
72 85 81 78 77 41 105 92 94 74
78 75 87 83 71 99 56 69 78 60
119 39 104 86 67 79 98 102 82 91
46 120 73 125 132 86 48 55 112 28
42 24 130 100 46 57 31 129 137 59
By using the class interval 20-39, 40-59 and so forth construct the frequency distribution,
cumulative frequency distribution, relative frequency distribution and relative cumulative
frequency distribution in one table.
Definition of terms
Class boundary is the precise point that separates one class from another, rather than being
a value indicated in one of the classes. A class boundary is typically located midway
between the upper limit of a class and the lower limit of the next higher class adjoining it.
Therefore the class boundary separating the class 60-79 and the class 80-99 is halfway
between 79 and 80, that is, at the point 79.5. This is the upper class boundary and lower
class boundary for 60-70 and 80-99 classes respectively.
Class interval: is the width of a class. The class interval of a class is computed by
subtracting the class boundaries.
Class midpoint or class mark: is the point dividing the class into equal halves on the basis
of class interval. This point can be obtained by adding the lower and upper limits
(boundaries) of a class and dividing by 2.
Relative frequency of a class: it is the ratio of the frequency of any class to sum of the
frequencies.
Cumulative frequency distribution: shows the number of items of a series that are less
than (or more than) certain specified values.
A value that would describe the 'centre' of a distribution would be visually located near the
spot where most of the data seem to be concentrated. Consequently, values that fulfil this
role are called measures of central tendency.
The most common measures of the central tendency of a data set are arithmetic mean or
simply as mean, median and mode.
Example 1 calculating mean, median and mode for individual (Ungrouped) data
The following table shows the hourly wage rates of eight sampled construction workers.
Table: 3 workers hourly wage
Worker i 1 2 3 4 5 6 7 8
Hourly wage
rate ( xi ) 35 46 46 60 65 69 70 72
Arithmetic mean
x i
x1 x2 x3 x4 x5 x6 x7 x8
x i 1
n 8
463
57.875
8
Other means
1
i. Geometric(G ) ( x1 x2 xn ) n
ii. Harmonic( H )
n
for ungrouped data but if grouped H
f i
1 f
x xf
i
n 1 9
Location of the median: 4.5 th position
2 2
x 4 x5 60 65
Median (wage) = 62.5
2 2
The following table shows the daily wages of a random sample of construction workers.
Calculate its mean, median and mode.
Table 4: Workers daily wage
Daily Wages Number of Workers
200 - 399 5
400 - 599 15
600 - 799 25
800 - 999 30
1000 - 1199 18
1200 - 1399 7
Total 100
Solution
Number of Cum.
Daily Wages Workers Class Mark f i xi frequenc
fi xi y
F
200 - 399 5 299.5 1,497.5 5
400 - 599 15 499.5 7,492.5 20
600 - 799 25 699.5 17,489.5 45
800 - 999 30 899.5 26,985.5 75
1000 - 1199 18 1,099.5 19,791.0 93
1200 - 1399 7 1,299.5 9,096.5 100
Total 100 82,350.0
fx i i
82,350.0
x i 1
6
823.5
f
100
i
i 1
1
2 f Fa
Md L ci Where: L is the lower real limit of the middle class
f w
fw is the frequency of the middle class
Fa is the cumulative frequency above the middle class
ci is the class interval of the middle class
0.5(100) 45
799.5 (200) 832.8
30
f1 f 0
Mode L ci
2 f1 f 0 f 2
Where: L is the lower real limit of the modal class
f1 is the frequency of the middle class
f0 is the frequency of the class preceding modal class
f2 is the frequency of the class succeeding the modal class and
ci is the class interval of the modal class
30 25
Mode 799.5 (200) 858.3
2(30) 25 18
Mean
Advantages: (i) All values in the distribution are used in its calculation, so it can
be regarded as more representative than the other two measures.
Disadvantages: (i) Its result can be easily distorted by extreme values. As such, its
result may be rather lower or higher than the bulk of the values
and becomes unrepresentative.
(ii) In case of open end classes, mean can be calculated only if their
class marks are determined. If such classes contain a large
proportion of the values, then the mean may be subjected to
substantial error.
Median
Advantage: Its result will not be affected by extreme values and open end
classes.
Mode
Advantages: (i) Its result will not be affected by extreme values and open end
classes.
(i) Always select the mean whenever there is no special reason for choosing the other
two measures.
(ii) Select the median if the distribution consists of substantial amount of extreme large
or small values.
(iii) Select the mode if integral result is preferred as in the cases where the data are in
ordinal scales.
The figure below represents frequency distribution with some of the characteristics we
need to understand. The two curves in (a) represent two distributions with the same mean𝑋̅,
but with different varations. The two curves in (b) represent two distributions with the same
variations but with unequal means, 𝑋̅1 and 𝑋̅2, finally, (c) represents two distributions with
unequal means and unequal variations.
The measures of central tendency are, therefore, insufficient. They must be supported and
supplemented with other measures. A measure of variation is designed to state the extent
to which the individual measures differ on an average from the mean. Hence for an
adequate summary and characteristics description of a set of data we need to determine the
data variation.
The most common measures of variability or dispersion are the range, mean deviation,
interquartile range, deciles, percentiles, variance and standard deviation.
Example 1
Consider the following measurements, in grams, for two samples of strawberry jam bottled
by companies A and B:
Table 5: Strawberry
Sample for 31 32 32 33 32
Company A
Sample for 28 29 32 35 36
Company B
Both samples have the same mean, 32 grams. It is obvious that company A, in comparison
with company B, bottles strawberry jam with a more consistent content. We say that the
variability of the observations is smaller for company A. Therefore in buying strawberry
jam we would feel more confident that the bottle we select will be closer to the advertised
average content if we buy from company A.
The range of a set of numbers is the difference between the largest (L) and the smallest (S)
LS
number in the set. Therefore we have range = L-S and the Co-efficient of range
LS
Though range is simple and can be obtained easily, its result is unstable. This is particularly
true if the sample size is large. So whenever the sample size is over 10, we seldom choose
to use range to indicate variability of the data.
Absolute Mean deviation is the average of the absolute deviation of the numerical data
from their mean
Table 6: Mean Absolute Deviation
Worker i 1 2 3 4 5 6 7 8
Hourly wage
rate ( xi ) 35 38 46 60 65 69 72 78
xi x
22.875 19.875 11.87 2.125 7.125 11.12 14.12 20.12
xi 57.875 5 5 5 5
x i 57.875
109.25
Mean Absolute deviation= i 1
13.656
8 8
The mean deviation is a good measure to show the extent of variation of the data in a
distribution. However, when this measurement is used in further analysis, it would give
rise to some unnecessary tedious mathematical problem as a result of its absolute value
term. To avoid this pitfall, we can use the standard deviation instead.
Variance is the average of the squared deviations from the arithmetic mean.
n
(x x )
i
2
s2 i 1
n 1
Standard deviation of a population is the positive square root of the variance
Using the values in table 4 determine the variance and standard deviation
Solution
Number of
Daily Wages Workers Class Mark fi ( xi x)2
fi xi
200 - 399 5 299.5 1, 372,880
400 - 599 15 499.5 1,574,640
600 - 799 25 699.5 384,400
800 - 999 30 899.5 173,280
1000 - 1199 18 1,099.5 1,371,168
1200 - 1399 7 1,299.5 1,586,032
Total 100 6,462,400
6462400
Variance ( s 2 ) 65, 276.77
99
x f xf
2
2
s
f f
The values of the standard deviations cannot be used as the bases of the comparison
because:
(a) units of measurements of the two distributions may be different, and
(b) average values of two distributions may be widely dissimilar.
The correct measure that should be used is the coefficient of variation (CV ) .which does
not bear any unit of measurement, given as
s
CV 100%
x
Example 4
The following table shows the summary statistics for the daily wages of two types of
companies.
i. Compare these two daily wages distributions and state the company with a higher
distribution variability.
ii. Compute the combined average wage and standard deviation.
Solution
In comparison Distribution Reason
Average magnitude
II > I x II 150 x I 100
Variation I > II 20 24
CV I 100% 20% CV II 100% 16%
100 150
1 3
4 f Fa 4 f Fa
Q1 L ci and Q3 L ci respectively.
fw fw
Where: L is the lower real limit of the class containing lower/upper quartile score
fw is the frequency of the lower/upper quartile class
Fa is the cumulative frequency above the lower/upper quartile class
ci is the class interval of the lower/upper quartile class
Thus by measuring variation we are able to determine the nature and cause of variation in
order to control the variation itself. In matters of health, variation in body temperature,
pulse beat and blood pressure are the basic guides to diagnosis. Prescribed treatment is
designed to control their variation. In industrial production, efficient operation requires
control of quality variation, the causes of which are sought through inspection and quality
control programmes. Thus, measurement of variation is basic to the control of cause of
variation. In engineering problems, measures of variation are often specially important. In
social sciences, a special problem requiring the measurement of variability is the
measurement of “inequality” of the distribution of income and wealth, etc.
Again measures of variations enable comparison to be made of two or more series with
regard to their variability. The study of variation may also be looked upon as a means of
determining uniform or consistency. A high degree of variation would mean little
uniformity or consistency whereas a low degree of variation would mean greater
uniformity or consistency.
Lastly many powerful analytical tools in statistics such as correlation analysis, the test of
hypothesis, the analysis of fluctuations, techniques of production control, cost control,
among others are based on measures of variations.
Properties of a Good Measure of Variation
A good measure of variation should possess, as far as possible, the following properties:
i. It should be simple to understand and easy to compute
ii. It should be rigidly defined
iii. It should be based on each and every observation of the distribution
iv. It should be amenable to further algebraic treatment
v. It should have sampling stability and
vi. It should not be unduly affected by extreme observations
a) Moments
r f ( x x) r
The rth moment of a variable x about the mean ( x ) , such that using
f
the above set of discrete data where S 5, 7,9 the first central moment about the mean
is
{1(5 7)1 1(7 7)1 1(9 7)1}
1 0 . The second moment about the mean will be
3
{1(5 7)2 1(7 7)2 1(9 7)2 }
2 2.67 which is equal to the variance of the data.
3
The first moment about the mean tells us about the sample mean, second about the
variance, third about the skewness i.e if ( 3 0) then the data is skewed and the fourth
moment about the mean tell us about the kurtosis.
NB: The following relationship holds true
a) The moments about the mean (central moments) and the raw moments;
1 0 ,
2 2' (1' )2 ,
3 3' 31' 2' 2(1' )3 and 4 4' 41' 3' 62' (1' )2 3(1' )4
Using S 5, 7,9 determine 3 and 4 .
b) The betas and the central moments;
( 3 ) 2
1 and 2 4 2 The first beta ( 1 ) is used to measure the data skewness
( 2 ) 3
( 2 )
while
The second beta ( 2 ) measures the kurtosis of the plotted data curve as discussed below.
b) Skewness
Asymmetrical data is said to be skewed distribution. The distribution is either skewed to
the right or left otherwise it is symmetrical distribution (Normally Distributed).
Symmetric refers to equal amounts of data on either side of the ‘middle’ of the data, i.e.
the distribution of the data on one side is the mirror image of the distribution on the other
side. Skewness occurs when one ‘side’ of the data spreads out to take on larger values than
the other side. If the mean is much bigger than the median, then there must be large values
on the right-hand side of the distribution, compared to the left hand side (right skewed).
mo md x
x md mo
Normally distributed data has the three measures of central equal such that x mo md
x Mo Md
Measures of Skewness
(Q3 Q2 ) (Q2 Q1 ) Q3 Q1 2M d
S KB
Q3 Q1 Q3 Q1
iii. Kelly’s coefficient of skewness. Its based on percentiles and deciles such that
1 ( 2 3)
Skp
2(5 2 61 9)
Also the fourth measure of skewness is based on the third moment such that if 3 0 then
the distribution is said to be skewed. In measuring variation, we were interested in the
amount of the variation or its degree while the skewness gives the direction.
c) Kurtosis
Platykurtic (- Kurtosis)
Examples
Stem and leaf
The following record represents the long jump results (in meters) of inter-house
competitions in a certain school within Machakos County:
2.3, 2.5, 2.5, 2.7, 2.8 3.2, 3.6, 3.6, 4.5, 5.0
Stem Leaf
2 35578
3 266
4 5
5 0
Note:
Say what the stem and leaf mean (Stem "2" Leaf "3" means 2.3)
In this case each leaf is a decimal
It is OK to repeat a leaf value
5.0 has a leaf of "0"
Box-and-Whisker Plots:
Under this exploration technique statistics assumes that the data points are clustered
around some central value, the "box". To create a box-and-whisker plot, the data is
numerically ordered. The box divides the entire data set into quarters, called "quartiles".
Box plots enable us to study the distributional characteristics of a group of scores as well
as the level of the scores.
The median (middle quartile) marks the mid-point of the data and is shown by the line that
divides the box into two parts.
Upper quartile-Seventy-five percent of the scores fall below the upper quartile.
Inter-quartile range-The middle “box” represents the middle 50% of scores for the
group. The range of scores from lower to upper quartile is referred to as the inter-quartile
range.
Whiskers-The upper and lower whiskers represent scores outside the middle 50%.
Whiskers often (but not always) stretch over a wider range of scores than the middle
quartile groups. Any data points outside of this range of the whiskers are ploted
individually. These points are often called “outliers” based the 1.5 IQR rule of thumb. The
term outlier is usually used for unusual or extreme points.
Revision Exercise
Determine
i) The Company with the higher dispersion in awarding the air time
allowance
ii) The combined standard deviation
7. Differentiate the following terms as they apply in scientific research
i. Sample and a population
ii. Skewness and Kurtosis of a data distribution
iii. Sample statistic and Population parameter
iv. Sampling error and Non-Sampling error
8. The table below shows the wages of 80 employees of XYZ Company
Wages Number of
Ksh ‘000’ Employees
10-15 5
15-20 x
20-25 17
25-30 20
30-35 y
35-40 16
40-45 4
Given that the median wage is Ksh 27,000, determine
i. The values of x and y
ii. The mean
iii. The inter-quartile wage
iv. Karl Pearson’s coefficient of Skewness (Skp)
10. Define the term variable as used in statistics, giving two examples.
11. Explain in words each of the following terms as used in Statistics:
(i) Mean;
(ii) Median.
(iii) Mode
12. Estimate the sample median and quartiles using the box plot given below
13. The data given below represents the age in years of employees of an organisation.
28, 30, 33, 37, 37, 38, 42, 43, 43, 44, 45, 48, 48, 51, 55
Use the data to construct a box and whisker
plot.
14. The data given below represents the frequency distribution of marks scored in
Mathematics by a random sample of 1000 students who sat for KCSE
examination in the year 2022.
Marks scored 00 - 09 10 - 19 20 - 29 30 - 39 40 - 49 50 - 59 60 - 69 70 - 79 80 - 89 90 - 99
No. of Students 15 34 143 169 189 170 125 105 37 13
(a) (i) If the top 65% of the students are supposed to pass this examination, determine
the mark which should be set as the pass-mark to achieve this. (2 marks)
(ii) Grades for results are awarded as follows:
Fail to the bottom 20%,
Pass to the next 35%,
Credit to the next 30%,
Distinction to the top 15%.
Determine the lower and upper limits of the marks for each grade: Fail, Pass,
Credit and Distinction. (8 marks)
(b) (i) Suppose the pass-mark is set at 45 marks, determine the proportion of the
students who will pass. (2 marks)
(ii) Determine the proportion of the students who will score the grades Fail, Pass,
Credit and Distinction, if the grades are awarded as follows:
Fail for marks below 40,
Undergraduate Lecture Notes-Dr. Kasina MM Page 21
Introduction to Probability & Statistics
RELATIONSHIPS
eg the age of the students of a class. A distribution involving two discrete variables is
Types of Correlation
Correlation can be classified in several different ways. Three of the most important
are:
i. Linear or non-linear
ii. Simple, partial or multiple; and
The following are the important methods of ascertaining whether two variables are
correlated or not:
I. Scatter Diagram Method;
II. Karl Pearson’s Coefficient of Correlation;
III. Spearman’s Rank Correlation Coefficient; and
This is a dot chart also referred to as called dotogram, for each pair of X and Y values.
It uses dots to represent values for two different numeric variables. The position of each
dot on the horizontal and vertical axis indicates values for an individual data point. By
looking at the scatter of the various points, it is possible to form an idea as to whether the
variables are related.
By observing the following two variables X and Y make a scatter diagram and state if
they have any correlation.
X 10 12 11 18 21
Y 15 20 22 25 27
X: 10 20 30 40 50
It measures their joint variation. When x and y are not related its value is close to zero.
The position x , y is known as the centroid of all the points.
Illustration 2
Find correlation coefficient between the sales and expenses from the data given below:
Firm: A B C D E F G H I J
Sales (Ksh, 000): 50 50 55 60 65 65 65 60 60 50
Solution
Calculating the correlation coefficient
∑𝑋 150 ∑𝑌 140
𝑋̅ = = = 58 ; 𝑌̅ = = = 14
𝑁 10 𝑁 10
∑𝑥𝑦 70 70
𝑟= = = = 0.787
√∑𝑥 2 ∑𝑦 2 √360 × 22 88.994
There is a strong positive correlation between X and Y.
The covariance between x and y is 70/10=7
Exercise
i. The following data relate to the age of 10 employees from company ABC Ltd
and the number of days which they reported sick in a month:
Age: 20 30 32 35 40 46 52 55 58 62
Sick days: 11 12 10 13 14 16 15 17 18 19
By letting the age and sick days be presented by variable X and Y respectively, calculate
Karl Pearson’s coefficient of correlation and interpret its value.
ii. Find the coefficient of correlation by Karl Pearson’s method between X and Y
and interpret its value.
X 57 42 40 33 42 45 42 44 40 56 44 43
Y 10 60 30 41 29 27 27 19 18 19 31 29
Coefficient of Determination*
The coefficient of determination is equals to r2. It expresses the proportion of the
variance in Y due to X, that is, the ratio of the explained variance to the total variance. eg
if r=0.9, r2 will be 0.81 and this would mean that 0.81 per cent of the variation in the
dependent variable has been explained by the independent variable. The maximum value
of r2 is a unit because it is possible to explain all of the variation in Y, but it is not
possible to explain more than all of it.
3. RANK CORRELATION COEFFICIENT (R)
This measure is especially useful where quantitative of certain factors (such as in the
evaluation of leadership ability or the judgment of female beauty) cannot be fixed , but
the individuals in the group can be arranged in order thereby obtaining for each
individual a number indicating his (her) rank* in the group. In any event, hence the rank
correlation coefficient is applied to a set of (N paired) ordinal ranked numbers. Defined
6∑𝐷 2 6∑𝐷 2
as 𝑅 = 1 − 𝑁(𝑁2 −1) = 1 − (𝑁3 −𝑁)
Where R denotes rank coefficient of correlation and D refers to the difference of ranks
between paired items in two series. It derived the name Spearman’s rank correlation
coefficient in honour of the British psychologist Charles, Edward Spearman who
developed it in 1904. Again note 1 R 1
Illustration
Two managers were asked to rank a group of employees in order of their potential for
eventually being top managers. Given is their rankings, computed R and comment
Ranking Solution
Employees Manager I Manager II (R1- R2)2=D2
A 10 9 1
B 2 4 4
C 1 2 1
D 4 3 1
E 3 1 4
F 6 5 1
G 5 6 1
H 8 8 0
I 7 7 0
J 9 10 1
N=10 ∑D2=14
6∑𝐷2 6 × 14
𝑅 =1− = 1 − = 1 − 0.085 = 0.915
𝑁3 − 𝑁 990
Thus, we find that there is a high degree of positive correlation in the ranks assigned
by the two managers.
Exercise
Calculate the rank correlation coefficient for the following data of marks of two tests
given to candidates for a clerical job.
Preliminary test 92 89 87 86 83 77 71 63 53 50
Final test 86 83 91 77 68 85 52 82 37 57
number of items whose ranks are common. If there are more than one such group of
items with common rank, this value is added as many times as the number of such
groups. The formula can thus be written as;
1 1
6(∑𝐷2 + 12 (𝑚13 − 𝑚) + 12 (𝑚23 − 𝑚2 ) + ⋯ … . . )
𝑅 =1−
𝑁3 − 𝑁
Illustration
An examination of eight applicants for a clerical post was taken by a firm. From the
marks obtained by the applicants in the Accountancy and Statistics papers, compute
rank coefficient of correlation.
Applicant A B C D E F G H
Marks in Accountancy 15 20 28 12 40 60 20 80
Marks in Statistics 40 30 50 30 20 10 30 60
Solution
CALCULATION OF BANK CORRELATION COEFFICIENT
1 1
6(∑𝐷2 + 12 (𝑚13 − 𝑚1 ) + 12 (𝑚23 − 𝑚2 ) + ⋯ … . . )
𝑅 = 1−
𝑁3 − 𝑁
The item 20 is repeated 2 times in series X and hence m1=2. In series Y, the item 30
occurs 3 times and m2=3. Substituting these values in the above formula;
1 1
6(81.5 + 12 (23 − 2) + 12 (33 − 3)
𝑅 = 1−
83 − 8
6(81.5 + 0.5 + 2) 6 × 84
= 1− = 1− =0
504 504
There is no correlation between the marks obtained in two subjects.
Exercise
Ten ladies in a beauty contest were ranked by three judges in the following order.
Ladies A B C D E F G H I J
1st. judge 1 6 5 10 3 2 4 9 7 8
2nd. judge 3 5 8 4 7 10 2 1 6 9
3rd judge 6 4 9 8 1 2 3 10 5 7
Use the rank correlation coefficient to determine which pair of judges has the nearest
approach to common test beauty
Solution
In order to find out which pair of judges has the nearest approach to common tastes in
beauty, compare rank correlation between the judgments of;
(i) 1st judge and 2nd judge (ii) 2nd judge and 3rd judge and (iii) 1st judge and 3rd
judge
Since coefficient of correlation is maximum in the judgment of the first and third judges,
we conclude that they have the nearest approach to common tastes in beauty.
Regression Analysis
Introduction
Regression was first used by Francis Galton (1877) in his fathers’ vs sons’ heights
relationship study. He described the relationship by using a ‘regression Line’. The term is
still used to describe that a line drawn from a group of points to represent the trend,
although most of the modern writers use the term estimating line or predicting line instead
of regression line.
Regression analysis establishes the relationship between the dependent variable and the
regressors by obtaining the rate of change of the response variable due to a unit change of
the independent variable(s). It enables the analyst to estimate (or predict) the unknown
values of one variable from known values of another variable. It can also be used to obtain
a measure of the error (standard error) involved in using the regression line as a basis for
estimations. We can use regression analysis to estimate correlation between two variables.
The Linear Bivariate Regression Model (Simple Regression)
The average relationship between X and Y can be adequately described by a linear
equation 𝑌 = 𝑎 + 𝑏𝑋 whose geometrical presentation is a straight line as in the diagram
below:
In this equation a and b are the population regression coefficients. An individual value in
each sub-population Y, may be expressed as: 𝑌 = 𝐸(𝑌|𝑋) + 𝑒. Where e is the error term
or the stochastic disturbance term assumed to be independent random variables because
Y’s are random variables and independent, hence the expectations of these errors are
zero; E(e) = 0. Moreover, if Y’s are normal variables, the error can also be assumed to be
normal with identical variances of the regressions.
The average relationship between two variables x and y can be adequately described by a
linear equation y a bx whose geometrical presentation is a straight line as in the
diagrams below:
Regression lines are those lines where the sum of the red lines (Residuals) should be
minimal.
The Standard Errors (SE) is a measure that tells how much the coefficients were to vary if
the same regression were applied to many samples from the same population. A relatively
small SE value therefore indicates that the coefficients will remain very stable if the same
regression model is fitted to many different samples with identical parameters.
Where;
Terms
Observations- data points, either observed or measured often indexed with i
x variable – predictors or independent variables in the model, usually on the right side
of the model equation. Some authors refer the independent variables as; exogenous
variables, predictor variables or regressors
y variable – outcome, response, or dependent variable in the model that is typically the
lone term on the left side of the model equation. Some authors refer the dependent
variable as; endogenous variable, prognostic variable or regressand.
inputs – also called model terms, these are the items on the right side of the model
equation; note, a two predictors model could have three inputs with an interaction terms
between the two predictors.
The linear component - this is comprised of explanatory variables that have additive
effects. Additive effects mean that predictor effects operate individually, but can be
added together.
Design Matrix-
Where;
y1 1
y2 0
y , b , 2 and
1
yn n
2
𝑥1 1 𝑥11 ⋯ 𝑥𝑚1 𝑥11 𝑥11 𝑥21 ⋯ 𝑥𝑚−1, 1𝑥𝑚1
𝑥2 1 𝑥12 ⋯ 𝑥𝑚2 2
𝑥12 𝑥12 𝑥22 ⋯ 𝑥𝑚−1, 2𝑥𝑚2
𝑋=[ ]= ⋯ ⋮
⋮ ⋮ ⋮ ⋮ ⋮ ⋱ ⋮
𝑥𝑁 [1 𝑥1𝑁 ⋯𝑥𝑚𝑁 2
𝑥1𝑁 𝑥1𝑁 𝑥2𝑁 ⋯ 𝑥𝑚−1 𝑁𝑥𝑚𝑁 ]
Residual error – this is the stochastic component in the model and it is typically the
final term on the right side of the model equation that is included to account for any
unexplained information. That is the observed information that is not accounted for by
the predictors or inputs. This stochastic component is directly related to the response
variable through a distribution. Any observed data is thought to have come from some
underlying distribution. For instance if the heights of 100 adults persons is taken, it
would be assumed that the observations are from some distribution that has the mean
and variance of heights from all people. The number of insects observed on a transect is
Coefficient of Determination*
The coefficient of determination is equals to r2. It expresses the proportion of the variance
in Y due to X, that is, the ratio of the explained variance to the total variance. eg if r=0.9,
r2 will be 0.81 and this would mean that 0.81 per cent of the variation in the dependent
variable has been explained by the independent variable. The maximum value of r2 is a
unit because it is possible to explain all of the variation in Y, but it is not possible to explain
more than all of it. The coefficient of determination of a linear regression model is the
quotient of the variances of the fitted values and observed values of the dependent variable.
yˆ y
2
Explained Variation
r
2
; r2 i
y y
2
Total Variation i
The challenge of the multiple r 2 is that it will increase even when variables that explain
almost no variance are added. Hence, multiple r 2 encourages the inclusion
of junk variables.
Adjusted R 2
This is the coefficient of determination adjusted for the number of independent variables
in the regression model. Unlike the coefficient of determination, R 2 -adjusted may
decrease if the variables entered in the model do not add significantly to the model fit.
1
y yˆ
2
Unexplained variation/(n-k-1) (n 1)
r 2
1 2
; radj
y y (n k 1)
adj 2
Total variation/(n-1)
Thus, the adjusted r 2 will decrease when variables are added that explain little or even no
variance while it will increase if variables are added that explain a lot of variance.
INTRODUCTION TO PROBABILITY