0% found this document useful (0 votes)
28 views40 pages

SMA 160 - Stds Notes (2025)

The document provides an introduction to probability and statistics, covering key concepts such as descriptive and inferential statistics, sampling methods, and data collection techniques. It distinguishes between probability and non-probability sampling, outlines various scales of measurement, and discusses measures of central tendency. Additionally, it includes examples and exercises to illustrate statistical concepts and methods.

Uploaded by

ephymannasse
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views40 pages

SMA 160 - Stds Notes (2025)

The document provides an introduction to probability and statistics, covering key concepts such as descriptive and inferential statistics, sampling methods, and data collection techniques. It distinguishes between probability and non-probability sampling, outlines various scales of measurement, and discusses measures of central tendency. Additionally, it includes examples and exercises to illustrate statistical concepts and methods.

Uploaded by

ephymannasse
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

Introduction to Probability & Statistics

SMA_160 INTRODUCTION TO PROBABILITY AND STATISTICS

INTRODUCTION TO STATISTICS

Statistics- Refers to a branch of Mathematics dealing with data collection, organization,

analysis, interpretation and presentation. Descriptive statistics refers to the quantitative or

qualitative description of a sample measurement and characteristics. Inferential statistics

refers to the application of the sample statistics to the parent population parameters from

which the samples were drawn subject to the stated significant levels.

Definitions of other statistical basic terms

Population
A collection of items sharing a common characteristics. All subjects possessing a
common characteristic that is being studied.
Census
An examination or the collection of data from every element in a population.
Sample
A subgroup or subset of the population.
Parameter
Characteristic or measure obtained from a population.

Sampling Error
The difference between the sample result and the true population result that
occurs because the sample data is incorrectly collected, recorded, or analyzed.

Non-sampling error

Encompasses all types of errors, mostly caused by human judgement, such as

questionnaire wording, data entry errors, and biased decisions (measurement and

analyzing errors).

A margin of error

This is the percentage amount allowed for, in case of a miscalculation to represent

the difference between the sample statistics and the actual population parameters.

Undergraduate Lecture Notes-Dr. Kasina MM Page 1


Introduction to Probability & Statistics

Sampling
Sampling is a technique of selecting individual members or a subset of the population to
make statistical inferences from them and estimate characteristics of the whole population
from which the samples were drawn. Gathering the information by examining every item
in the population is referred to as census.

Methods of sampling

These are the techniques of selecting the items to represent the population of the study.

Sampling Techniques

Probability Sampling Non-probability Sampling


 Simple random  Quota sampling
 Stratified random  Snowball sampling
 Cluster sampling  Judgment sampling
 Systematic sampling  Convenience sampling
 Multi stage sampling

Generally under probability sampling a sample is chosen based on the theory of probability

while in non-probability sampling a sample is chosen based on non-random criteria, and not

every member of the population has a chance of being included in the sample. The

probability sampling reduces the sample bias, it is appropriate with diverse and vast

population and it helps in planning with an accurate predefined samples sizes.

Difference between probability sampling and non-probability sampling methods

To encapsulate the sampling discussion, the significant differences between probability

sampling methods and non-probability sampling methods are as outlined below:

Undergraduate Lecture Notes-Dr. Kasina MM Page 2


Introduction to Probability & Statistics

Non-Probability Sampling
Bases Probability Sampling Methods
Methods
Probability Sampling is a Non-probability sampling is a
sampling technique in which sampling technique in which the
Definition samples from a larger population researcher selects samples based on
are chosen using a method based the researcher’s subjective judgment
on the theory of probability. rather than random selection.
Alternatively
Random sampling method. Non-random sampling method
Known as
Population The population is selected
The population is selected arbitrarily.
selection randomly.
Nature The research is conclusive. The research is exploratory.
Since there is a method for Since the sampling method is
deciding the sample, the arbitrary, the population
Sample
population demographics are demographics representation is
conclusively represented. almost always skewed.
Takes longer to conduct since This type of sampling method is
the research design defines the quick since neither the sample or
Time Taken
selection parameters before the selection criteria of the sample are
research study begins. defined.
This type of sampling is entirely
This type of sampling is entirely
biased and hence the results are
Results unbiased and hence the results
biased too, rendering the research
are unbiased too and conclusive.
subjective and speculative.
In probability sampling, there is
an underlying hypothesis before In non-probability sampling, the
Hypothesis the study begins and the hypothesis is derived after
objective of this method is to conducting the research study.
prove the hypothesis.

Data-this is a set of values of qualitative or quantitative variables. It is categorized


depending on the scale of measurement applied. Statistical analysis is a tool for converting
data into useful information.

Statistical
analysis
Undergraduate Lecture Notes-Dr. Kasina MM Page 3
Introduction to Probability & Statistics

Data Information

Scales of Measurement

In statistics, there are four data measurement scales: nominal, ordinal, interval and ratio
data types.

• Nominal- under nominal scale the items are differentiated by a simple naming
system. Nominal items are usually categorical. Therefore nominal scales are just
used for labeling variables, without any quantitative value. Questions like what is
your gender? Where do you live are nominal data. Nominal can be categorized as
nominal with order or without order eg cold, warm, hot and male or female
respectively
• Ordinal- under ordinal scale the items are set into some kind of order by their
position on the scale. Ordinal items are usually categorical.eg teams can be ranked
as first, third and fifth etc regardless of the score between each consecutive position.
• Interval-Interval data (sometimes called integer). Just like the ratio scale it is
measured along a scale in which each position is equidistant from one another.eg
Altitudes (the height above sea level), Celsius temperature in which the difference
between any two consecutive values is the same.
• Ratio- under the ratio scale, items are measured along a regular scale in which each
position point is equidistant from one another, therefore numbers can be compared
as multiples of one another and have an absolute zero (reference point) eg weight
and height (Both have absolute zero such that no numbers or values exist below the
zero point)

A variable is a representative of something that can change and assume different value,
such as 'gender which can change from male to female'

• Categorical variable -results from a selection from categories. Nominal and ordinal
variables are categorical.
• Continuous variables are numeric variables that can take any value, with a none
zero intersection set between any two consecutive data points such as weight.

Data sources:

Primary sources- This is a firsthand and freshly collected data for a particular use. The
information can be obtained through survey, interview or observation among others

Secondary sources- This is a second hand data mainly obtained from a published sources
such as print or online reference or research works ,abstracts , indexes , finding
aids ,publisher’s or distributor’s brochure or website and broadcast program schedule. They

Undergraduate Lecture Notes-Dr. Kasina MM Page 4


Introduction to Probability & Statistics

are not original in character and have undergone some statistical treatment at least once but
may attract a secondary use.

Experiments- Orderly procedure carried out to test a hypothesis. It may give out an in-
depth of cause-effect relationship by showing the response variation when the regressors
are manipulated.

METHODS OF DATA COLLECTION

The method to apply is mostly dictated by the resources and time available, intended form
of data analysis and finally where the data resides; environment files or people.

Direct Observation

Observation is the process in which a researcher observes what is occurring in some real
life situation then classify and record pertinent happenings according to some planned
criteria. Observation method is most useful when the study relates to behavioral science. It
is subject to many controls and checks. The different types of observations are:

 Structured and unstructured observation


 Controlled and uncontrolled observation
 Participant, non-participant and disguised observation

The tools needed for gathering data using this technique include the eyes and other
senses, microscope, a pen and a paper.

Surveys

A survey solicits data from people; it is most appropriate with data elements that are not
easily quantifiable. It can be administered through; Personal (Physical) Interview,
Telephone Interview or Self-Administered Questionnaire. The tools for administering this
technique include interview guide, check list, tape recorder or a questionnaire. The
following are some of the key questionnaire design principles:

 Keep the questionnaire as short and simple as possible (Avoid technical terms).
 Ensure clearly worded questions free from ambiguities
 Include both the closed and open-ended questions
 Avoid using leading-questions.

Undergraduate Lecture Notes-Dr. Kasina MM Page 5


Introduction to Probability & Statistics

The Response Rate is the proportion of all people selected who complete the survey
which is a key survey parameter.

Experiments

Experiments can be carried out in the laboratory, in the field or using computer numerical
models for the purpose of collecting data. Currently there are several computer codes that
can be utilized to construct a model. eg Finite Element Code (FEM) and CFD
(Computational Fluid Dynamics) code.

Other Methods of Data Collection;

 Case Study
 Focus Groups
 Online tracking

Frequency Distribution

Statistical data obtained by means of census, sample surveys or experiments is raw,


unorganized and usually contains some errors. Before these are analyzed and used as a
basis for inferences about the phenomenon under investigation or as a basis for decision
making, they must be cleaned, summarized and the pertinent information extracted. One
way of presenting data for analysis is construction of Frequency distribution tables.
Frequency table or a frequency distribution is constructed by dividing the overall range of
values into a number of classes and then counting the number of observations that fall into
each of these classes or intervals. Insofar as possible, equal class intervals are preferred.
But the first and last classes can be open-ended to cater for extreme values.

Exercise 1

A random sample of 100 Machakos undergraduate students was selected and the time
(hours) each student spend in a gym for a particular semester was recorded as follows:
Table: 2. Machakos University students’ hours in a gym for a semester

65 22 84 100 88 87 105 44 85 67
80 109 83 89 91 104 90 103 67 52
110 98 86 39 72 66 92 99 60 75
88 112 97 88 49 62 70 66 88 62
72 85 81 78 77 41 105 92 94 74
78 75 87 83 71 99 56 69 78 60
119 39 104 86 67 79 98 102 82 91
46 120 73 125 132 86 48 55 112 28
42 24 130 100 46 57 31 129 137 59

Undergraduate Lecture Notes-Dr. Kasina MM Page 6


Introduction to Probability & Statistics

102 51 135 53 105 110 107 46 108 119

By using the class interval 20-39, 40-59 and so forth construct the frequency distribution,
cumulative frequency distribution, relative frequency distribution and relative cumulative
frequency distribution in one table.

Definition of terms

Class boundary is the precise point that separates one class from another, rather than being
a value indicated in one of the classes. A class boundary is typically located midway
between the upper limit of a class and the lower limit of the next higher class adjoining it.
Therefore the class boundary separating the class 60-79 and the class 80-99 is halfway
between 79 and 80, that is, at the point 79.5. This is the upper class boundary and lower
class boundary for 60-70 and 80-99 classes respectively.

Class interval: is the width of a class. The class interval of a class is computed by
subtracting the class boundaries.

Class midpoint or class mark: is the point dividing the class into equal halves on the basis
of class interval. This point can be obtained by adding the lower and upper limits
(boundaries) of a class and dividing by 2.

Relative frequency of a class: it is the ratio of the frequency of any class to sum of the
frequencies.

Cumulative frequency distribution: shows the number of items of a series that are less
than (or more than) certain specified values.

Measure of Central Tendency

A value that would describe the 'centre' of a distribution would be visually located near the
spot where most of the data seem to be concentrated. Consequently, values that fulfil this
role are called measures of central tendency.
The most common measures of the central tendency of a data set are arithmetic mean or
simply as mean, median and mode.

Example 1 calculating mean, median and mode for individual (Ungrouped) data

The following table shows the hourly wage rates of eight sampled construction workers.
Table: 3 workers hourly wage
Worker i 1 2 3 4 5 6 7 8
Hourly wage
rate ( xi ) 35 46 46 60 65 69 70 72
Arithmetic mean

Undergraduate Lecture Notes-Dr. Kasina MM Page 7


Introduction to Probability & Statistics
8

x i
x1  x2  x3  x4  x5  x6  x7  x8
x i 1

n 8

463
  57.875
8
Other means
1
i. Geometric(G )  ( x1 x2 xn ) n

ii. Harmonic( H ) 
n
for ungrouped data but if grouped H 
f i
1 f
x  xf
i

n 1 9
Location of the median:   4.5 th position
2 2

x 4  x5 60  65
Median (wage) =   62.5
2 2

Mode: the most occurring score is 46.

Example 2 calculating mean, median and mode for grouped data

The following table shows the daily wages of a random sample of construction workers.
Calculate its mean, median and mode.
Table 4: Workers daily wage
Daily Wages Number of Workers
200 - 399 5
400 - 599 15
600 - 799 25
800 - 999 30
1000 - 1199 18
1200 - 1399 7
Total 100

Solution

Undergraduate Lecture Notes-Dr. Kasina MM Page 4


Introduction to Probability & Statistics

Number of Cum.
Daily Wages Workers Class Mark f i xi frequenc
fi xi y
F
200 - 399 5 299.5 1,497.5 5
400 - 599 15 499.5 7,492.5 20
600 - 799 25 699.5 17,489.5 45
800 - 999 30 899.5 26,985.5 75
1000 - 1199 18 1,099.5 19,791.0 93
1200 - 1399 7 1,299.5 9,096.5 100
Total 100 82,350.0

fx i i
82,350.0
x i 1
6
  823.5
f
100
i
i 1

1 
 2  f  Fa 
Md  L    ci Where: L is the lower real limit of the middle class
 f w 
 
fw is the frequency of the middle class
Fa is the cumulative frequency above the middle class
ci is the class interval of the middle class
0.5(100)  45
 799.5  (200)  832.8
30
 f1  f 0 
Mode  L    ci
 2 f1  f 0  f 2 
Where: L is the lower real limit of the modal class
f1 is the frequency of the middle class
f0 is the frequency of the class preceding modal class
f2 is the frequency of the class succeeding the modal class and
ci is the class interval of the modal class

30  25
Mode  799.5  (200)  858.3
2(30)  25  18

Advantages and disadvantages of each measure

Mean

Undergraduate Lecture Notes-Dr. Kasina MM Page 5


Introduction to Probability & Statistics

Advantages: (i) All values in the distribution are used in its calculation, so it can
be regarded as more representative than the other two measures.

(ii) Its method of calculation is simple and most people understand


the meaning of its result.

(iii) Its result can easily be used in further analysis.

Disadvantages: (i) Its result can be easily distorted by extreme values. As such, its
result may be rather lower or higher than the bulk of the values
and becomes unrepresentative.

(ii) In case of open end classes, mean can be calculated only if their
class marks are determined. If such classes contain a large
proportion of the values, then the mean may be subjected to
substantial error.

Median
Advantage: Its result will not be affected by extreme values and open end
classes.

Disadvantage: It has to be supplemented by other statistics because it does not


reflect the distribution in the way that the mean does, that is,
including all values.

Mode
Advantages: (i) Its result will not be affected by extreme values and open end
classes.

(ii) If data are not grouped, it can be determined easily.

Disadvantages: (i) It has to be supplemented by other statistics.

(ii) It is difficult to obtain an accurate estimate of the mode if the


values are classified into a frequency distribution.

How to select a suitable measure

(i) Always select the mean whenever there is no special reason for choosing the other
two measures.

(ii) Select the median if the distribution consists of substantial amount of extreme large
or small values.

(iii) Select the mode if integral result is preferred as in the cases where the data are in
ordinal scales.

Undergraduate Lecture Notes-Dr. Kasina MM Page 6


Introduction to Probability & Statistics

Measure of data variation (Dispersion)

The figure below represents frequency distribution with some of the characteristics we
need to understand. The two curves in (a) represent two distributions with the same mean𝑋̅,
but with different varations. The two curves in (b) represent two distributions with the same
variations but with unequal means, 𝑋̅1 and 𝑋̅2, finally, (c) represents two distributions with
unequal means and unequal variations.

The measures of central tendency are, therefore, insufficient. They must be supported and
supplemented with other measures. A measure of variation is designed to state the extent
to which the individual measures differ on an average from the mean. Hence for an
adequate summary and characteristics description of a set of data we need to determine the
data variation.
The most common measures of variability or dispersion are the range, mean deviation,
interquartile range, deciles, percentiles, variance and standard deviation.

Example 1

Consider the following measurements, in grams, for two samples of strawberry jam bottled
by companies A and B:
Table 5: Strawberry
Sample for 31 32 32 33 32
Company A
Sample for 28 29 32 35 36

Undergraduate Lecture Notes-Dr. Kasina MM Page 7


Introduction to Probability & Statistics

Company B

Both samples have the same mean, 32 grams. It is obvious that company A, in comparison
with company B, bottles strawberry jam with a more consistent content. We say that the
variability of the observations is smaller for company A. Therefore in buying strawberry
jam we would feel more confident that the bottle we select will be closer to the advertised
average content if we buy from company A.

The range of a set of numbers is the difference between the largest (L) and the smallest (S)
LS
number in the set. Therefore we have range = L-S and the Co-efficient of range 
LS
Though range is simple and can be obtained easily, its result is unstable. This is particularly
true if the sample size is large. So whenever the sample size is over 10, we seldom choose
to use range to indicate variability of the data.

Absolute Mean deviation is the average of the absolute deviation of the numerical data
from their mean
Table 6: Mean Absolute Deviation
Worker i 1 2 3 4 5 6 7 8
Hourly wage
rate ( xi ) 35 38 46 60 65 69 72 78
xi  x
22.875 19.875 11.87 2.125 7.125 11.12 14.12 20.12
 xi  57.875 5 5 5 5

x i  57.875
109.25
Mean Absolute deviation= i 1
  13.656
8 8

The mean deviation is a good measure to show the extent of variation of the data in a
distribution. However, when this measurement is used in further analysis, it would give
rise to some unnecessary tedious mathematical problem as a result of its absolute value
term. To avoid this pitfall, we can use the standard deviation instead.

Variance is the average of the squared deviations from the arithmetic mean.
n

 (x  x )
i
2

s2  i 1

n 1
Standard deviation of a population is the positive square root of the variance
Using the values in table 4 determine the variance and standard deviation

Solution

Undergraduate Lecture Notes-Dr. Kasina MM Page 8


Introduction to Probability & Statistics

Number of
Daily Wages Workers Class Mark fi ( xi  x)2
fi xi
200 - 399 5 299.5 1, 372,880
400 - 599 15 499.5 1,574,640
600 - 799 25 699.5 384,400
800 - 999 30 899.5 173,280
1000 - 1199 18 1,099.5 1,371,168
1200 - 1399 7 1,299.5 1,586,032
Total 100 6,462,400

6462400
Variance ( s 2 )   65, 276.77
99

Standard deviation = 65276.77  255.49

x f   xf
2
2

s  
f f
 

Combined Average and Standard deviation

n1 x1  n2 x2 n1s12  n1s22  n1d12  n2 d 22


x12  and s12  Where di  xi  x12
n1  n2 n1  n2

Comparison of the variation of two distributions

The values of the standard deviations cannot be used as the bases of the comparison
because:
(a) units of measurements of the two distributions may be different, and
(b) average values of two distributions may be widely dissimilar.
The correct measure that should be used is the coefficient of variation (CV ) .which does
not bear any unit of measurement, given as
s
CV  100%
x
Example 4
The following table shows the summary statistics for the daily wages of two types of
companies.

Company n Daily Wages

Undergraduate Lecture Notes-Dr. Kasina MM Page 9


Introduction to Probability & Statistics

Mean Standard deviation


I 60 100 20
II 90 150 24

i. Compare these two daily wages distributions and state the company with a higher
distribution variability.
ii. Compute the combined average wage and standard deviation.

Solution
In comparison Distribution Reason
Average magnitude
II > I x II  150  x I  100

Variation I > II 20 24
CV I  100%  20%  CV II  100%  16%
100 150

Quartiles- Quartile divides the data set into 4 equal parts


Lower quartile (Q1) and upper quartile (Q3) are computed as;

1  3 
 4  f  Fa   4  f  Fa 
Q1  L    ci and Q3  L    ci respectively.
 fw   fw 
   
Where: L is the lower real limit of the class containing lower/upper quartile score
fw is the frequency of the lower/upper quartile class
Fa is the cumulative frequency above the lower/upper quartile class
ci is the class interval of the lower/upper quartile class

Deciles- Divides the data set into 10 equal parts

Percentile- divides the data set into 100 equal parts

The median formula is adjusted to determine deciles and percentiles.

Significance of Measuring Variation


Measures of variation are needed for four basic purposes:
Undergraduate Lecture Notes-Dr. Kasina MM Page 10
Introduction to Probability & Statistics

i. To determine the reliability of an average,


ii. To serve as a basis for the control of the variability,
iii. To compare two or more series with regard to their variability; and
iv. To facilitate the use of these points

Thus by measuring variation we are able to determine the nature and cause of variation in
order to control the variation itself. In matters of health, variation in body temperature,
pulse beat and blood pressure are the basic guides to diagnosis. Prescribed treatment is
designed to control their variation. In industrial production, efficient operation requires
control of quality variation, the causes of which are sought through inspection and quality
control programmes. Thus, measurement of variation is basic to the control of cause of
variation. In engineering problems, measures of variation are often specially important. In
social sciences, a special problem requiring the measurement of variability is the
measurement of “inequality” of the distribution of income and wealth, etc.
Again measures of variations enable comparison to be made of two or more series with
regard to their variability. The study of variation may also be looked upon as a means of
determining uniform or consistency. A high degree of variation would mean little
uniformity or consistency whereas a low degree of variation would mean greater
uniformity or consistency.
Lastly many powerful analytical tools in statistics such as correlation analysis, the test of
hypothesis, the analysis of fluctuations, techniques of production control, cost control,
among others are based on measures of variations.
Properties of a Good Measure of Variation
A good measure of variation should possess, as far as possible, the following properties:
i. It should be simple to understand and easy to compute
ii. It should be rigidly defined
iii. It should be based on each and every observation of the distribution
iv. It should be amenable to further algebraic treatment
v. It should have sampling stability and
vi. It should not be unduly affected by extreme observations

Undergraduate Lecture Notes-Dr. Kasina MM Page 11


Introduction to Probability & Statistics

Moments, Skewness and Kurtosis

a) Moments

In statistics moments refer to a quantitative measure of the shape of a set of points


representing mass. Represented by the Greek letter  (mu) moments give a summary
description of a distribution characteristics. The rth moment (raw moment) is denoted by
r'  E ( x r ) , such that if we have a set of discrete data, S  5, 7,9 then the first raw
(51  71  91 )
moment is S1'  1'   7 , the second raw moment is
3
(52  72  92 ) (5r  7r  9r )
S2  2 
' '
 51.67 and then the r moment is; Sr  r 
th ' '
.
3 3
Determine 3' and 4'
Moments about the mean (central moments) are obtained as;

1    f ( x  x)2 (2nd moment …)


f ( x  x)
(1st moment about the mean), 2 
f f
and

r   f ( x  x) r
The rth moment of a variable x about the mean ( x ) , such that using
 f
the above set of discrete data where S  5, 7,9 the first central moment about the mean
is
{1(5  7)1  1(7  7)1  1(9  7)1}
1   0 . The second moment about the mean will be
3
{1(5  7)2  1(7  7)2  1(9  7)2 }
2   2.67 which is equal to the variance of the data.
3

The first moment about the mean tells us about the sample mean, second about the
variance, third about the skewness i.e if ( 3  0) then the data is skewed and the fourth
moment about the mean tell us about the kurtosis.
NB: The following relationship holds true
a) The moments about the mean (central moments) and the raw moments;

1  0 ,
2  2'  (1' )2 ,
3  3'  31' 2'  2(1' )3 and 4  4'  41' 3'  62' (1' )2  3(1' )4
Using S  5, 7,9 determine 3 and 4 .
b) The betas and the central moments;

Undergraduate Lecture Notes-Dr. Kasina MM Page 12


Introduction to Probability & Statistics

( 3 ) 2 
1  and  2  4 2 The first beta ( 1 ) is used to measure the data skewness
( 2 ) 3
( 2 )
while
The second beta (  2 ) measures the kurtosis of the plotted data curve as discussed below.
b) Skewness
Asymmetrical data is said to be skewed distribution. The distribution is either skewed to
the right or left otherwise it is symmetrical distribution (Normally Distributed).
Symmetric refers to equal amounts of data on either side of the ‘middle’ of the data, i.e.
the distribution of the data on one side is the mirror image of the distribution on the other
side. Skewness occurs when one ‘side’ of the data spreads out to take on larger values than
the other side. If the mean is much bigger than the median, then there must be large values
on the right-hand side of the distribution, compared to the left hand side (right skewed).

Positively skewed distribution is tailed to the right such that mo  md  x

mo md x

Negatively skewed distribution is tailed to the left and x  md  mo

x md mo

Undergraduate Lecture Notes-Dr. Kasina MM Page 13


Introduction to Probability & Statistics

Normally distributed data has the three measures of central equal such that x  mo  md

x  Mo  Md

Measures of Skewness

i. Karl pearson’s coefficient of skewness (Skp)

Mean  Mode 3( Mean  Median)


Skp  or Skp  the second one is recommended since
 
median is a better estimator than mode.
ii. Bowley’s coefficient of skewness. It is based on quartile

(Q3  Q2 )  (Q2  Q1 ) Q3  Q1  2M d
S KB  
Q3  Q1 Q3  Q1
iii. Kelly’s coefficient of skewness. Its based on percentiles and deciles such that

P90  2 P50  P10 D  2 D5  D1


Skk  for percentiles and Skk  9 for deciles
P90  P10 D9  D1
In each case the coefficient varies between ±1 and it is interpreted directly however the
answers may slightly differ since each formula is based on different assumptions.
iv. Measure of skewness based on moments as given by Karl pearson

1 (  2  3)
Skp 
2(5 2  61  9)
Also the fourth measure of skewness is based on the third moment such that if 3  0 then
the distribution is said to be skewed. In measuring variation, we were interested in the
amount of the variation or its degree while the skewness gives the direction.

Undergraduate Lecture Notes-Dr. Kasina MM Page 14


Introduction to Probability & Statistics

c) Kurtosis

This is a non-dimensional measure of the relative peakness or flatness of a data


distribution i.e relative to a normal distribution.
Leptokurtic (+ Kurtosis)
Mesokurtic
(Normal)

Platykurtic (- Kurtosis)

According to Kar Pearson  2 is used to determine the degree of peakness of a curve


4
relative to the normal curve. Where  2  , such that when 2  3 the curve is
( 2 ) 2
mesokurtic (normal), when 2  3 the curve is platykurtic and when the 2  3 the curve
is leptokurtic.

Undergraduate Lecture Notes-Dr. Kasina MM Page 15


Introduction to Probability & Statistics

STATISTICAL DATA REPRESENATIONS


Data can be represented in form of; bar charts, pie charts, boxplots (box and whiskers
plots), histograms, stem and leaf, scatter diagram among others.

Examples
Stem and leaf
The following record represents the long jump results (in meters) of inter-house
competitions in a certain school within Machakos County:

2.3, 2.5, 2.5, 2.7, 2.8 3.2, 3.6, 3.6, 4.5, 5.0

And here is the stem-and-leaf plot:

Undergraduate Lecture Notes-Dr. Kasina MM Page 16


Introduction to Probability & Statistics

Stem Leaf
2 35578
3 266
4 5
5 0

Stem "2" Leaf "3" means 2.3

Note:

 Say what the stem and leaf mean (Stem "2" Leaf "3" means 2.3)
 In this case each leaf is a decimal
 It is OK to repeat a leaf value
 5.0 has a leaf of "0"

Box-and-Whisker Plots:

Under this exploration technique statistics assumes that the data points are clustered
around some central value, the "box". To create a box-and-whisker plot, the data is
numerically ordered. The box divides the entire data set into quarters, called "quartiles".

Undergraduate Lecture Notes-Dr. Kasina MM Page 17


Introduction to Probability & Statistics

Understanding and interpreting box plots

Box plots enable us to study the distributional characteristics of a group of scores as well
as the level of the scores.

The median (middle quartile) marks the mid-point of the data and is shown by the line that
divides the box into two parts.

Upper quartile-Seventy-five percent of the scores fall below the upper quartile.

Lower quartile-Twenty-five percent of scores fall below the lower quartile

Inter-quartile range-The middle “box” represents the middle 50% of scores for the
group. The range of scores from lower to upper quartile is referred to as the inter-quartile
range.

Whiskers-The upper and lower whiskers represent scores outside the middle 50%.
Whiskers often (but not always) stretch over a wider range of scores than the middle
quartile groups. Any data points outside of this range of the whiskers are ploted

Undergraduate Lecture Notes-Dr. Kasina MM Page 18


Introduction to Probability & Statistics

individually. These points are often called “outliers” based the 1.5 IQR rule of thumb. The
term outlier is usually used for unusual or extreme points.

Revision Exercise

1. Differentiate between descriptive statistics and inferential statistics


2. Highlight four levels of variable measurement scales in statistics
3. The table below shows the frequency distribution of sales made by 100 shops
Sales Number of
Ksh ‘000’ Shops
100-119 2
120-139 “a"
140-159 20
160-179 19
180-199 “b"
200-219 21
220-239 1
Given that the mean is Ksh 177,100, determine
i. The values of “a” and “b”
ii. The Median
iii. The Standard deviation
iv. Karl Pearson’s coefficient of skewness (Skp)

4. The following information relates to Milani supermarket monthly budget, salaries


take Ksh.42 millions which is 35% of its total monthly budget, marketing 20%,
legal fees 5%, taxes and rates 25% and reserves 15% . Determine the amount in
Ksh for each sector and present the information in a pie chart.
5. The number of days the college nurse was called for emergencies per month for
the last 10 months were; 2,3,4,0,5,6,7,4,3,2. Determine the
i. Mean
ii. Mean Absolute Deviation.
iii. Variance
iv. Standard deviation
6. Given below are the weekly average air time allowances for the employees of
Company A and B

Company n Weekly Average allowance Standard deviation


A 150 2,500 400
B 100 2,000 200

Undergraduate Lecture Notes-Dr. Kasina MM Page 19


Introduction to Probability & Statistics

Determine
i) The Company with the higher dispersion in awarding the air time
allowance
ii) The combined standard deviation
7. Differentiate the following terms as they apply in scientific research
i. Sample and a population
ii. Skewness and Kurtosis of a data distribution
iii. Sample statistic and Population parameter
iv. Sampling error and Non-Sampling error
8. The table below shows the wages of 80 employees of XYZ Company

Wages Number of
Ksh ‘000’ Employees
10-15 5
15-20 x
20-25 17
25-30 20
30-35 y
35-40 16
40-45 4
Given that the median wage is Ksh 27,000, determine
i. The values of x and y
ii. The mean
iii. The inter-quartile wage
iv. Karl Pearson’s coefficient of Skewness (Skp)

9. Highlight five properties of a good measure of data variation

10. Define the term variable as used in statistics, giving two examples.
11. Explain in words each of the following terms as used in Statistics:
(i) Mean;
(ii) Median.
(iii) Mode
12. Estimate the sample median and quartiles using the box plot given below

Undergraduate Lecture Notes-Dr. Kasina MM Page 20


Introduction to Probability & Statistics

13. The data given below represents the age in years of employees of an organisation.
28, 30, 33, 37, 37, 38, 42, 43, 43, 44, 45, 48, 48, 51, 55
Use the data to construct a box and whisker
plot.

14. The data given below represents the frequency distribution of marks scored in
Mathematics by a random sample of 1000 students who sat for KCSE
examination in the year 2022.

Marks scored 00 - 09 10 - 19 20 - 29 30 - 39 40 - 49 50 - 59 60 - 69 70 - 79 80 - 89 90 - 99
No. of Students 15 34 143 169 189 170 125 105 37 13

(a) (i) If the top 65% of the students are supposed to pass this examination, determine
the mark which should be set as the pass-mark to achieve this. (2 marks)
(ii) Grades for results are awarded as follows:
 Fail to the bottom 20%,
 Pass to the next 35%,
 Credit to the next 30%,
 Distinction to the top 15%.

Determine the lower and upper limits of the marks for each grade: Fail, Pass,
Credit and Distinction. (8 marks)

(b) (i) Suppose the pass-mark is set at 45 marks, determine the proportion of the
students who will pass. (2 marks)
(ii) Determine the proportion of the students who will score the grades Fail, Pass,
Credit and Distinction, if the grades are awarded as follows:
 Fail for marks below 40,
Undergraduate Lecture Notes-Dr. Kasina MM Page 21
Introduction to Probability & Statistics

 Pass for marks between 40 and 62,


 Credit for marks between 62 and 74,
 Distinction for marks above 74. (8 marks)

RELATIONSHIPS

A distribution in which there is only one variable is referred to as univariate distribution,

eg the age of the students of a class. A distribution involving two discrete variables is

called a bivariate frequency distribution.

Correlation & Regression Analysis


This involves two quantities such that if a variation in one variable influences the
movement of the other variable, then these quantities are said to be correlated. For
example, there exist some relationship between family income and expenditure on luxury
items, price of a commodity and amount demanded, increase in rainfall up to a point and
agricultural produce, pressure and volume gas. The measure of correlation called the
coefficient of correlation denoted by the symbol (𝑟) summarizes in one figure the
direction and degree of correlation. Thus, correlation can be defined as the covariate
analysis of two or more variables.
Correlation analysis involves;
1) Determination of any relationship existence; 1  r  1 whereby +1  perfect
positive correlation, -1  perfect negative correlation and 0  may mean there is
no linear correlation or no correlation at all.
2) Testing its significance to ensure the correlation is not by a mere chance due to
pure random sampling or investigator’s bias in selecting the sample; and finally
3) Establishing the cause-and-effect relations. (Caution: correlation doesn’t imply
causation or functional relationship). We should reach a conclusion based on
logical reasoning and intelligent investigation on significantly related matters to
avoid non-sense correlation or spurious conclusion.

Undergraduate Lecture Notes-Dr. Kasina MM Page 22


Introduction to Probability & Statistics

Types of Correlation
Correlation can be classified in several different ways. Three of the most important
are:
i. Linear or non-linear
ii. Simple, partial or multiple; and

A linear correlation can either be positively or negatively related. Correlation is positive


(direct) if both the variables are varying in the same direction otherwise it is negative
(inverse or indirect) correlation. The distinction between simple, partial and multiple
correlations are based on the number of variables studied. When only two variables are
studied it is a simple correlation. When three or more variables are studied it is either
multiple or partial correlation, whereby in multiple correlations the variables are studied
simultaneously. For example, studying the relationship between the yield of rice per acre
and both the amount of rainfall and the amount of fertilizers used, it is a problem of multiple
correlations. In partial correlation, only two variables are studied at a time holding the rest
constant. Lastly the distinction between linear and non-linear correlation is based on the
constancy of the ratio of change between the variables. If the quantity of change in one
variable tends to be proportional to the quantity of change in the other variables then the
correlation is said to be linear otherwise it is non-linear or curvilinear.
Methods of Studying Correlation

The following are the important methods of ascertaining whether two variables are
correlated or not:
I. Scatter Diagram Method;
II. Karl Pearson’s Coefficient of Correlation;
III. Spearman’s Rank Correlation Coefficient; and

I. Scatter Diagram Method

This is a dot chart also referred to as called dotogram, for each pair of X and Y values.
It uses dots to represent values for two different numeric variables. The position of each
dot on the horizontal and vertical axis indicates values for an individual data point. By

Undergraduate Lecture Notes-Dr. Kasina MM Page 23


Introduction to Probability & Statistics

looking at the scatter of the various points, it is possible to form an idea as to whether the
variables are related.

By observing the following two variables X and Y make a scatter diagram and state if
they have any correlation.

X 10 12 11 18 21

Y 15 20 22 25 27

X: 10 20 30 40 50

Y: 70 140 210 280 350

Undergraduate Lecture Notes-Dr. Kasina MM Page 24


Introduction to Probability & Statistics

2. Karl Pearson’s coefficient of correlation

The product- moment coefficient correlation popularly known as Pearsonian coefficient


of correlation, is mostly widely used in practice. It gives both the degree and direction of
the relationship between two variables. If the two variables understudy are X and Y, then
∑(𝑋 − 𝑋̅)(𝑌 − 𝑌̅)
𝑟= … (𝑖)
√∑(𝑋 − 𝑋̅)2 √∑(𝑌 − 𝑌̅)2
𝑇ℎ𝑒 𝑎𝑏𝑜𝑣𝑒 𝑓𝑜𝑟𝑚𝑢𝑙𝑎 𝑐𝑎𝑛 𝑏𝑒 𝑤𝑟𝑖𝑡𝑡𝑒𝑛 𝑎𝑠:
∑𝑥𝑦
𝑟∗ = … (𝑖𝑖)
√∑𝑥 2 ∑𝑦 2
𝑤ℎ𝑒𝑟𝑒 𝑥 = (𝑋 − 𝑋̅)𝑎𝑛𝑑 𝑦 = (𝑌 − 𝑌̅)
This formula is to be used only where the deviations are taken from actual means and not
from assumed means.
The coefficient of correlation can also be calculated from the original set of observations
(i.e., without taking deviations from mean) by applying the following formula
∑𝑋∑𝑌
∑𝑋𝑌− 𝑁∑𝑋𝑌−∑𝑋∑𝑌
𝑟 ∗∗ = 𝑁
= … (𝑖𝑖𝑖)
2
√∑𝑋 2 −(∑𝑋) √∑𝑌 2 −(∑𝑌)
2 √𝑁∑𝑋 2 −(∑𝑋)2 √𝑁∑𝑌 2 −(∑𝑌)2
𝑁 𝑁

  x  x  y  y  / n  covariance of the two variables x and y


i i

It measures their joint variation. When x and y are not related its value is close to zero.
The position  x , y  is known as the centroid of all the points.

Illustration 2
Find correlation coefficient between the sales and expenses from the data given below:
Firm: A B C D E F G H I J
Sales (Ksh, 000): 50 50 55 60 65 65 65 60 60 50

Expenses (Ksh. 000): 11 13 14 16 16 15 15 14 13 13

Undergraduate Lecture Notes-Dr. Kasina MM Page 25


Introduction to Probability & Statistics

Solution
Calculating the correlation coefficient
∑𝑋 150 ∑𝑌 140
𝑋̅ = = = 58 ; 𝑌̅ = = = 14
𝑁 10 𝑁 10
∑𝑥𝑦 70 70
𝑟= = = = 0.787
√∑𝑥 2 ∑𝑦 2 √360 × 22 88.994
There is a strong positive correlation between X and Y.
The covariance between x and y is 70/10=7
Exercise
i. The following data relate to the age of 10 employees from company ABC Ltd
and the number of days which they reported sick in a month:

Age: 20 30 32 35 40 46 52 55 58 62
Sick days: 11 12 10 13 14 16 15 17 18 19
By letting the age and sick days be presented by variable X and Y respectively, calculate
Karl Pearson’s coefficient of correlation and interpret its value.
ii. Find the coefficient of correlation by Karl Pearson’s method between X and Y
and interpret its value.

X 57 42 40 33 42 45 42 44 40 56 44 43
Y 10 60 30 41 29 27 27 19 18 19 31 29

Coefficient of Determination*
The coefficient of determination is equals to r2. It expresses the proportion of the
variance in Y due to X, that is, the ratio of the explained variance to the total variance. eg
if r=0.9, r2 will be 0.81 and this would mean that 0.81 per cent of the variation in the
dependent variable has been explained by the independent variable. The maximum value
of r2 is a unit because it is possible to explain all of the variation in Y, but it is not
possible to explain more than all of it.
3. RANK CORRELATION COEFFICIENT (R)

Undergraduate Lecture Notes-Dr. Kasina MM Page 26


Introduction to Probability & Statistics

This measure is especially useful where quantitative of certain factors (such as in the
evaluation of leadership ability or the judgment of female beauty) cannot be fixed , but
the individuals in the group can be arranged in order thereby obtaining for each
individual a number indicating his (her) rank* in the group. In any event, hence the rank
correlation coefficient is applied to a set of (N paired) ordinal ranked numbers. Defined
6∑𝐷 2 6∑𝐷 2
as 𝑅 = 1 − 𝑁(𝑁2 −1) = 1 − (𝑁3 −𝑁)

Where R denotes rank coefficient of correlation and D refers to the difference of ranks
between paired items in two series. It derived the name Spearman’s rank correlation
coefficient in honour of the British psychologist Charles, Edward Spearman who
developed it in 1904. Again note 1  R  1

Illustration
Two managers were asked to rank a group of employees in order of their potential for
eventually being top managers. Given is their rankings, computed R and comment
Ranking Solution
Employees Manager I Manager II (R1- R2)2=D2
A 10 9 1
B 2 4 4
C 1 2 1
D 4 3 1
E 3 1 4
F 6 5 1
G 5 6 1
H 8 8 0
I 7 7 0
J 9 10 1
N=10 ∑D2=14

6∑𝐷2 6 × 14
𝑅 =1− = 1 − = 1 − 0.085 = 0.915
𝑁3 − 𝑁 990

Undergraduate Lecture Notes-Dr. Kasina MM Page 27


Introduction to Probability & Statistics

Thus, we find that there is a high degree of positive correlation in the ranks assigned
by the two managers.
Exercise
Calculate the rank correlation coefficient for the following data of marks of two tests
given to candidates for a clerical job.
Preliminary test 92 89 87 86 83 77 71 63 53 50
Final test 86 83 91 77 68 85 52 82 37 57

Equal Ranks or Tie in Ranks.


In some cases, it may be found necessary to assign equal rank to two or more
individuals or entities. In such a case, it is customary to give each individual or entry
an average rank. Thus if two individuals are ranked equal at fifth place, they are each
5+6
given the rank , that is 5.5 while if three are ranked equal at fifth place, they are
2
5+6+7
given, they are given the rank = 6. However to determine R there is an
3
1
adjustment made by adding 12(m3-m) to the value of ∑D2, where m stands for the

number of items whose ranks are common. If there are more than one such group of
items with common rank, this value is added as many times as the number of such
groups. The formula can thus be written as;
1 1
6(∑𝐷2 + 12 (𝑚13 − 𝑚) + 12 (𝑚23 − 𝑚2 ) + ⋯ … . . )
𝑅 =1−
𝑁3 − 𝑁
Illustration
An examination of eight applicants for a clerical post was taken by a firm. From the
marks obtained by the applicants in the Accountancy and Statistics papers, compute
rank coefficient of correlation.
Applicant A B C D E F G H
Marks in Accountancy 15 20 28 12 40 60 20 80
Marks in Statistics 40 30 50 30 20 10 30 60
Solution
CALCULATION OF BANK CORRELATION COEFFICIENT

Undergraduate Lecture Notes-Dr. Kasina MM Page 28


Introduction to Probability & Statistics

Applicants Marks in Rank Marks in Rank (R1- R2)2


Accountancy assigned statistics assigned
X R1 Y R2 D2
A 15 2 40 6 16.00
B 20 3.5 30 4 0.25
C 28 5 50 7 4.00
D 12 1 30 4 9.00
E 40 6 20 2 16.00
F 60 7 10 1 36.00
G 20 3.5 30 4 0.25
H 80 8 60 8 0.00
N=8 ∑D2=81.5

1 1
6(∑𝐷2 + 12 (𝑚13 − 𝑚1 ) + 12 (𝑚23 − 𝑚2 ) + ⋯ … . . )
𝑅 = 1−
𝑁3 − 𝑁
The item 20 is repeated 2 times in series X and hence m1=2. In series Y, the item 30
occurs 3 times and m2=3. Substituting these values in the above formula;
1 1
6(81.5 + 12 (23 − 2) + 12 (33 − 3)
𝑅 = 1−
83 − 8
6(81.5 + 0.5 + 2) 6 × 84
= 1− = 1− =0
504 504
There is no correlation between the marks obtained in two subjects.
Exercise
Ten ladies in a beauty contest were ranked by three judges in the following order.
Ladies A B C D E F G H I J
1st. judge 1 6 5 10 3 2 4 9 7 8
2nd. judge 3 5 8 4 7 10 2 1 6 9
3rd judge 6 4 9 8 1 2 3 10 5 7

Use the rank correlation coefficient to determine which pair of judges has the nearest
approach to common test beauty

Undergraduate Lecture Notes-Dr. Kasina MM Page 29


Introduction to Probability & Statistics

Solution
In order to find out which pair of judges has the nearest approach to common tastes in
beauty, compare rank correlation between the judgments of;
(i) 1st judge and 2nd judge (ii) 2nd judge and 3rd judge and (iii) 1st judge and 3rd
judge

Rank by Rank by Rank by (R1- R2)2 (R2- R3)2 (R1- R3)2


1st judge 2st judge 3rd judge D2 D2 D2
R1 R2 R3
1 3 6 4 9 25
6 5 4 1 1 4
5 8 9 9 1 16
10 4 8 36 16 4
3 7 1 16 36 4
2 10 2 64 64 0
4 2 3 4 1 1
9 1 10 64 81 1
7 6 5 1 1 4
8 9 7 1 4 1
N=10 N=10 N=10 ∑D2=200 ∑D2=214 ∑D2=60

6∑𝐷 2 6×200 1200


R(I&II) =1 − 𝑁3 −𝑁 = 1 − 103 −10 = 1 − = −0.121
990
6∑𝐷 2 6×214 1284
R(II&III) =1 − 𝑁3−𝑁 = 1 − 103 −10 = 1 − = −0.297
990
6∑𝐷 2 6×60 360
R(I&III) =1 − 𝑁3 −𝑁 = 1 − 103 −10 = 1 − 990 = −0.636

Since coefficient of correlation is maximum in the judgment of the first and third judges,
we conclude that they have the nearest approach to common tastes in beauty.
Regression Analysis
Introduction
Regression was first used by Francis Galton (1877) in his fathers’ vs sons’ heights
relationship study. He described the relationship by using a ‘regression Line’. The term is

Undergraduate Lecture Notes-Dr. Kasina MM Page 30


Introduction to Probability & Statistics

still used to describe that a line drawn from a group of points to represent the trend,
although most of the modern writers use the term estimating line or predicting line instead
of regression line.
Regression analysis establishes the relationship between the dependent variable and the
regressors by obtaining the rate of change of the response variable due to a unit change of
the independent variable(s). It enables the analyst to estimate (or predict) the unknown
values of one variable from known values of another variable. It can also be used to obtain
a measure of the error (standard error) involved in using the regression line as a basis for
estimations. We can use regression analysis to estimate correlation between two variables.
The Linear Bivariate Regression Model (Simple Regression)
The average relationship between X and Y can be adequately described by a linear
equation 𝑌 = 𝑎 + 𝑏𝑋 whose geometrical presentation is a straight line as in the diagram
below:

In this equation a and b are the population regression coefficients. An individual value in
each sub-population Y, may be expressed as: 𝑌 = 𝐸(𝑌|𝑋) + 𝑒. Where e is the error term
or the stochastic disturbance term assumed to be independent random variables because
Y’s are random variables and independent, hence the expectations of these errors are
zero; E(e) = 0. Moreover, if Y’s are normal variables, the error can also be assumed to be
normal with identical variances of the regressions.

Undergraduate Lecture Notes-Dr. Kasina MM Page 31


Introduction to Probability & Statistics

The average relationship between two variables x and y can be adequately described by a
linear equation y  a  bx whose geometrical presentation is a straight line as in the
diagrams below:

Regression lines are those lines where the sum of the red lines (Residuals) should be
minimal.

The Standard Errors (SE) is a measure that tells how much the coefficients were to vary if
the same regression were applied to many samples from the same population. A relatively
small SE value therefore indicates that the coefficients will remain very stable if the same
regression model is fitted to many different samples with identical parameters.

Undergraduate Lecture Notes-Dr. Kasina MM Page 32


Introduction to Probability & Statistics

Where;

a is the y- intercept also represented as  0

b is the regression coefficient also denoted as 1

x and y are the independent and depended variables respectively.

Terms
Observations- data points, either observed or measured often indexed with i
x variable – predictors or independent variables in the model, usually on the right side
of the model equation. Some authors refer the independent variables as; exogenous
variables, predictor variables or regressors
y variable – outcome, response, or dependent variable in the model that is typically the
lone term on the left side of the model equation. Some authors refer the dependent
variable as; endogenous variable, prognostic variable or regressand.
inputs – also called model terms, these are the items on the right side of the model
equation; note, a two predictors model could have three inputs with an interaction terms
between the two predictors.

The linear component - this is comprised of explanatory variables that have additive
effects. Additive effects mean that predictor effects operate individually, but can be
added together.

Design Matrix-

Assuming several observations, we have

Undergraduate Lecture Notes-Dr. Kasina MM Page 33


Introduction to Probability & Statistics

Where;

 y1   1 
   
 y2   0  
y , b    ,    2  and
   1   
   
 yn  n 

2
𝑥1 1 𝑥11 ⋯ 𝑥𝑚1 𝑥11 𝑥11 𝑥21 ⋯ 𝑥𝑚−1, 1𝑥𝑚1
𝑥2 1 𝑥12 ⋯ 𝑥𝑚2 2
𝑥12 𝑥12 𝑥22 ⋯ 𝑥𝑚−1, 2𝑥𝑚2
𝑋=[ ]= ⋯ ⋮
⋮ ⋮ ⋮ ⋮ ⋮ ⋱ ⋮
𝑥𝑁 [1 𝑥1𝑁 ⋯𝑥𝑚𝑁 2
𝑥1𝑁 𝑥1𝑁 𝑥2𝑁 ⋯ 𝑥𝑚−1 𝑁𝑥𝑚𝑁 ]

Residual error – this is the stochastic component in the model and it is typically the
final term on the right side of the model equation that is included to account for any
unexplained information. That is the observed information that is not accounted for by
the predictors or inputs. This stochastic component is directly related to the response
variable through a distribution. Any observed data is thought to have come from some
underlying distribution. For instance if the heights of 100 adults persons is taken, it
would be assumed that the observations are from some distribution that has the mean
and variance of heights from all people. The number of insects observed on a transect is

Undergraduate Lecture Notes-Dr. Kasina MM Page 34


Introduction to Probability & Statistics

assumed to be a number coming from a distribution of integers that is greater than 0. It


is critical to understand how the response data is generated and sampled, in order to best
match the associated distribution and use appropriate test statistics for inferences. For
example, if the individual observations were successes or failures, then by definition the
appropriate distribution is Bernoulli or binomial distribution.

Regression Equation Calculation


Determine the regression equations of y on x from the following data:
X: 1 2 3 4 5
Y: 2 5 3 8 7
Regression equation of Y on X is given by: 𝑌 = 𝑎 + 𝑏𝑋

Using the normal equations: ∑𝑌 = 𝑁𝑎 + 𝑏∑𝑋


∑𝑋𝑌 = 𝑎∑𝑌 + 𝑎∑𝑋 2
Substituting the values, we get 25 = 5𝑎 + 15𝑏
88 = 15𝑎 + 55𝑏
Solving the equations simultaneously we get 𝑎 = 1.10 𝑎𝑛𝑑 𝑏 = 1.3
Hence, the required estimated simple regression equation of Y on X is given by
yˆ  1.1  1.3x
It enables the computation of the fitted values of y based on values of x.

Coefficient of Determination*
The coefficient of determination is equals to r2. It expresses the proportion of the variance
in Y due to X, that is, the ratio of the explained variance to the total variance. eg if r=0.9,
r2 will be 0.81 and this would mean that 0.81 per cent of the variation in the dependent
variable has been explained by the independent variable. The maximum value of r2 is a
unit because it is possible to explain all of the variation in Y, but it is not possible to explain
more than all of it. The coefficient of determination of a linear regression model is the
quotient of the variances of the fitted values and observed values of the dependent variable.

  yˆ  y 
2
Explained Variation
r 
2
; r2  i

 y  y 
2
Total Variation i

Undergraduate Lecture Notes-Dr. Kasina MM Page 35


Introduction to Probability & Statistics

The challenge of the multiple r 2 is that it will increase even when variables that explain
almost no variance are added. Hence, multiple r 2 encourages the inclusion
of junk variables.

Adjusted R 2

This is the coefficient of determination adjusted for the number of independent variables
in the regression model. Unlike the coefficient of determination, R 2 -adjusted may
decrease if the variables entered in the model do not add significantly to the model fit.

 1 
 y  yˆ 
2
Unexplained variation/(n-k-1) (n  1)
r 2
 1 2
; radj
 y  y  (n  k  1)
adj 2
Total variation/(n-1)

Thus, the adjusted r 2 will decrease when variables are added that explain little or even no
variance while it will increase if variables are added that explain a lot of variance.

INTRODUCTION TO PROBABILITY

Undergraduate Lecture Notes-Dr. Kasina MM Page 36

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy