Sta101 Lecture Notes-1
Sta101 Lecture Notes-1
1
1.0 OVERVIEW OF STATISTICS
1.1 INTRODUCTION
The sources of statistical data vary in many ways. The choice of any form
depends on the nature of problems to be solved as well as the expenses
involved in generating the data either by primary or secondary source.
These are important concepts which would be explained fully in due
course.
2
Education, housing, health, transport and so on for both present and
future.
Secondly, statistics can be regarded as the totality of all the methods that
are used in dealing with the numerical data. This is consistent with the
definition of statistics we gave above. However, statistic, in a singular form
is used to mean a numerical figure that is used to describe a set of data.
E.g. an average is a single number describing the general characteristics of
a set of data. The average mark of students in a class describes the central
mark of the class even though some of the marks may be greater or
smaller than the average mark.
(i) He may group the data in such a way that the overall picture of the
data can be seen at once. This form of classification is known as
frequency description.
(ii) He may like to construct tables, graphs and diagrams that will assist
him in comprehending the result more easily. This is done by
graphical presentation.
3
(iii) He might convert the raw data into percentages, quartiles, deciles
and other standardised values to help him solve the problem he
intends to.
On the other hand, inductive statistics deals with the method of using
sample results to generalize about the population. It involves treating raw
data leading to predictions or inferences concerning a large group of data.
This makes it possible for us to establish scientific hypotheses by the use of
probability concept.
Apart from economic and business, statistical methods are equally applied
to problems in other disciplines such as Biological and Agricultural
Sciences. The methods are specifically developed and adapted to handle
the problems in these fields to test the stated hypotheses.
1.4 Limitations
4
1.5 Types of Data
Collecting good data is the foundation on which you gather evidence and
make sense of it. Decide what data you need when you design any
research or project, then you can gather the right information from the
start, and throughout the research or project.
There are two general types of data – quantitative and qualitative and both
are equally important. You use both types to demonstrate effectiveness,
importance or value.
Basically we have two types of sources of data, viz: primary and secondary
sources of data.
5
the information or data from its record of sales, payments and receipts,
inventories, job cards, time book, output, etc. These records are important
and needed in order to plan a strategy to improve the efficiency,
productivity and increase the volume of work.
a) They are more reliable and more specific to the information concerning
the problems at hand, unlike the information collected by someone else
which may not reflect accurately the problems under study.
c) The frequently state the definition of terms and units that are used.
Besides, the process of generating primary data involves a lot of time and
money to be spent. In view of these constraints, sometimes firms usually
resort to other forms of data collection and compilation.
Sales Statistics: This is collected from the sales Day book of the sales
department. This figure is necessary for planning production so that the
firm will be aware of when to encourage the demand for its product
through sales promotion or when to reduce the supply of the commodity of
the firm.
6
These data are needed and used by the firm to control the excess supply
or under supply of goods. Furthermore, it could be used to determine the
purchasing policy of the firm.
Financial and Cost Statistics: This information are obtained from the
accounts department. From the firm‟s account we can gather data on
overhead cost, cost of raw materials, wage and salaries and the cost of
capital or equipment. This kind of data can be used by the management in
budgeting and allocating funds to various units of the firm.
7
d) Central Bank of Nigeria statistical bulletin, annual report and
statement of accounts
Apart from both primary and secondary sources of data, a firm may
organise special inquiry or market research to seek the opinions of the
consumers regarding the quality of its produce, the value of its goods or
the method of packaging and distribution network of the product. The
purpose of this is to obtain feedback from the consumers so that the firm
can improve on the area where there is deficiency in the marketing
strategy of the product.
1.7 SUMMARY
Basically, statistics is divided into two categories. The first category deals
with descriptive statistics and the other one with inferential statistics.
8
1.8 EXERCISES
4. What are the major sources of data? State the advantages and
disadvantages of each source.
This method involves the use of necessary books and journals of both past
and present research such as official reports and also the records of
institutions upon which investigations are to be carried. The documentary
sources are more or less published reports and results of experiments.
Advantages:
The cost of collecting the data is highly reduced and at times it is zero
because they are forms of secondary data whose cost is negligible to the
researcher.
It does not require much energy before the information become freely
available for use. This is because the whole process of collecting the data
right from the source has been carried out by the original investigator. The
energy usually dissipated in the process of generating the data from the
9
source is not encountered as this has been done by the institutions that
provide the data.
Disadvantages:
The serious disadvantage of this source of information is that the user may
not be aware of the limitations the data contain. It may not contain the
important feature which is relevant to the problem the user wishes to
consider in his analysis. For instance, the National Bureau of Statistics may
publish gross domestic product (GDP) statistics, and this statistics may not
take into account the goods and services of full-time housewife produced
at home, and some other goods and services that may not pass through
the market system. These variables may be important in determining the
actual GDP figures. Hence an attempt to use these figures for meaningful
decision in the light of apparent omission of these variables may result in
false conclusions.
2.3 Observation
This means that the counting is done if and only if the respondent (the
person to be counted) is physically present.
Advantages
Disadvantages
10
It is suitable for only a small fraction of the items we want to study.
Advantages
To a far extent, the data collected can be reliable. This is not only because
the data are obtained from the primary source but also the researcher has
the knowledge of the background of the data which is designed to suit the
area of his interest.
11
Disadvantages
One of the serious dangers of using enumerators for this exercise is that
they can influence the answers or ask misleading questions.
Advantage
This method appears to be simple and small responses are involved. There
is no need to hire the services of enumerators and as such it is regarded as
cost saving device.
Disadvantages
This method is least satisfactory and effective in the sense that only
relatively few of such questionnaires ever get back to the researcher. The
obvious reasons include, firstly, the posted questionnaire might not get to
the respondent due to poor postal situation we experience in the country.
Secondly, the completed questionnaire mailed by the respondents might
even fail to reach the researcher due to the same reason.
12
(ii) Those who fail to respond by sending back their questionnaire still
have the opportunity of being interviewed by personal contact to
obtain the required information.
This is a method in which telephone calls are used to collect data from a
chosen sample of telephone subscribers. It gives on the spot responses
from the telephone subscribers. This may be in form of radio or television
programme to conduct opinion polls in assessing the success or failure of a
programme or policy already in force.
2.7 Surveys
2.8 Internet
13
Although it is easy to obtain responses, this method is restricted to only
those that have internet facilities and were opportune to visit the
organisation website at the time of data collection.
2.9 Census
This involves the procedure of counting all the items in the population. The
population could be human or non-human. The national census 2006, for
instance, made it possible for us to know the number of people by sex,
age, education, etc.
2.10 SUMMARY
14
4. Assess the importance of personal interview method of data
collection in relation to other methods.
Example 3.1:
60 60 61 63 60 68 61 67 64 65
62 70 70 72 62 62 63 69 65 67
6467 IIII 5
6871 IIII 4
7275 I 1
15
This table is called frequency distribution. It tells us how the weights of the
students have been distributed among the classes or groups.
Step 1: Choosing the classes into which the data are to be grouped.
Step 2: Sorting out data by putting a check for each item into the
appropriate class called the tally method.
There is no general rule about the number of classes or groups to use for
classification but for practical purposes a rule of thumb requires that a
minimum of 4 classes and a maximum of 15 classes suffice. Because the
larger the number of classes the more precise our description of the data
though the more difficulty is encountered in the calculation process.
It should be noted that it is not all the time necessary that the classes
should have equal intervals, but equal class intervals ease our calculations
from the distribution.
3.2.1 Class Interval and Class Limit
In example 3.1 the first class ranges from 60-63. This is called class
interval. The terminal numbers 60 and 63 are called class limits. The
smaller value, 60 is the lower class limit and the larger number, 63 is the
upper class limit. Hence the class limits are the lower and upper numbers
of a class interval. A class interval which has either no lower class limit or
upper class limit indicated is called an open class interval.
16
Consider the following frequency distribution table on the height of 30
female students in a mathematics class.
Less than 50 3
5155 10
5660 7
6165 6
66 & Above 4
Such class intervals that have neither lower nor upper class limits are called
open class interval. An open class interval has the advantage of
accommodating a wide range of values, however, it does not tell us how
much or how less given values that fall into the group. Furthermore, it
makes it difficult to present the distribution in form of a graph. Let alone
make some calculations from it to describe the data.
3.2.2 Class Boundaries
In example 3.1, the weights are measured to the nearest kg. The class
interval 60-63 theoretically includes all measurements from 59.5 to 63.5
kg. These figures i.e. 59.5 and 63.5 indicated are called class boundaries or
true class limits. The smaller number 59.5 is called the true lower class
limit and the larger number 63.5 the true upper class limit or boundary. In
practice, the class boundaries are obtained by adding the upper limit of
one class interval to the lower limit of the next higher class interval and
dividing by 2. For example 3.1 the class boundaries and frequency are
given as follows.
59.5 63.5 10
63.567.5 5
67.571.5 4
71.575.5 1
17
3.2.3 Class Size or Width
The class size or width is the difference between the lower and upper class
boundaries or true upper and true lower limits. The class mark is the mid-
point of the class interval and is defined as the mid-point between the class
boundaries. It is obtained by adding the true lower and upper limits and
dividing by 2. The class mark is another name for class mid-point. For the
purpose of further mathematical analysis, all observations belonging to a
class interval are assumed to coincide with the class mark.
Table 3.2
5056 53 3
5763 50 4
6470 57 10
7177 64 3
In making calculations from the above example, where all the intervals
have the same width in the distribution, the following rule may be applied
to find the required class interval width or size.
LV SV
Class width
No. of desired class int erval
Where LV represents the largest value and SV the smallest value. Since 4
class intervals are required, then the class width is given as:
LV SV
Class width
No. of desired class int erval
77 50
6 75
4
7
18
That is the width is 7. We can then go ahead to construct the class interval
with 7 as the class width while starting from the lowest value of 50.
In this case, the first class interval will be 50 – 56, etc. The class mark is
equally given in the second column. In further calculations involving
frequency distribution, we always assume that observations that fall within
a given interval coincide with the class mid-point.
3.2.4 Histograms
Example 3.4: Construct a histogram from the information given in the table
3.3 below.
5056 53 3
5763 60 4
6470 67 10
7177 74 3
Solution:
1st step: Rewrite the frequency distribution making use of true lower and
upper class limits.
19
Table 3.4
5056 49.5 53 3
5763 56.5 60 4
6470 63.5 67 10
7177 70.5 74 3
2nd step: Plot the true lower limit against the frequency thus:
6
frequency
0
53 60 67 74
Class boundaries
20
To complete the frequency polygon extra classes with zero frequency are
added to both ends of the frequency distributions. This ensures that the
resultant frequency polygon touches the x-axis.
Example 3.6:
Solution:
4349 42.5 46 0
5056 49.5 53 3
5763 56.5 60 4
6470 63.5 67 10
7177 70.5 74 3
7884 77.5 81 0
10
0
42٠5 49٠5 56٠5 63٠5 70٠5 77٠5 84٠5
21
3.2.6 Relative Frequency Distribution
Example 3.7: Obtain the relative frequency from the data given in the table
below:
2130 5
3140 4
4150 8
5160 10
6170 8
7180 9
8190 6
Total 50
Solution:
2130 5 10
3140 4 8
4150 8 16
5160 10 20
6170 8 16
7180 9 18
8190 6 12
22
3.2.7 Cumulative Frequency Polygon or ogive
The total frequency of all the scores less than the upper class boundary of
a given class interval is called the cumulative frequency up to and including
the class interval.
Less than Ogive is obtained by plotting the true upper limits against
cumulative frequencies while more than Ogive is the resultant graph when
the true lower limits are plotted against the cumulative frequencies.
Example 3.8
Solution
Table 3.8
23
3.3 ERRORS IN STATISTICS
In statistics, the word „error‟ is used to denote the difference between the
true value and the estimated or approximated value. In other words „error‟
refers to the difference between the true value of a population parameter
and its estimate provided by an appropriate sample statistic computed by
some statistical device. Thus, in statistics, the term error is used in a
different and much restricted sense. It should be distinguished
from mistake or inaccuracies which may be committed in the course of
making observation, counting, calculations, etc. These errors in statistics
arise due to a number of factors such as:
(iii) The biases due to faulty collection and analysis of the data and
biases in the presentation and interpretation of the results
RE = AE/a = |a - e|/a
24
3.4 SUMMARY
3.5 EXERCISE
59 6
1014 12
1519 19
2029 33
3034 8
3539 2
3. The following data shows the lengths of 40 trees recorded to the nearest
millimetre.
25
4.0 MEASURES OF CENTRAL TENDENCY (OR LOCATION)
4.1 INTRODUCTION
To find the mean for simple (ungrouped) and grouped data; using direct
(coded) method,
X X2 Xn X i
X 1 i 1
n n
Solution:
n
n = 5, X
i 1
i = 43
9 4 6 13 11 43
X 86
5 5
26
4.2.2 Mean for grouped data.
If the n numbers x1, x2, …, xn occurs with frequencies f1, f2, …, fn times
respectively, the mean is defined by:
n
f X f2 X 2 fn X n f i Xi
X 1 1 i 1
f1 f 2 f n n
f i 1
i
Example 4.2: The following table shows the number of oranges picked by
twenty students in the school garden.
No of oranges (x) 0 1 2 3 4 5
Solution:
n
f i Xi
X i 1
n
f i 1
i
X 0 1 2 3 4 5
F 2 5 6 4 2 1 n
f
i 1
i =20
FX 0 5 12 12 8 5 n
fX
i 1
i i =42
42
Therefore, X 2 1
20
Example 4.3: The table below is the frequency distribution of distances (in
kilometres) from the home of 50 students to their school.
27
Distances (x) Number of Students (f)
04 2
59 3
1014 4
1519 10
2024 17
2529 8
3034 4
3539 2
Solution:
n
f i Xi
The mean can be obtained using the formula, X i 1
n
f i 1
i
f i Xi
1035
Therefore, the mean X i 1
n
20 7
50
f i 1
i
28
4.3 The Median
The median for simple data is the middle value in an ordered array
of numbers (if the ordered array is odd) or the Arithmetic mean of
the two middle values (if the ordered array is even).
Example 4.6:
Solution:
(i) Arrange the data in order of magnitude, i.e. 4, 5, 7, 10, 11, 12, 17,
18, 33. Since the ordered data is odd. The middle value is 11 which
is the median.
(ii) Here also we arrange the data in order that is, 3, 4, 6, 7, 9, 10. But
the ordered array is even. Therefore the median is the mean of the
6 7 13
two middle values, i.e., 6 5 =median.
2 2
4.3.2 The median for grouped data.
If the n observations x1, x2, …, xn occurs with frequencies f1, f2, …, fn times
f 1
i 1
respectively. The median is ordered observation.
2
And if the grouped data are in class intervals the median is given by
N
f l
Median L1 2 c
f median
29
f is the sum of frequencies of all classes lower than the
l
median class.
f 1
i 1
Also, the median class is the class having Observation.
2
(i)
X 0 1 2 3 4 5
F 2 5 6 4 2 1
(ii)
Class interval Frequency (f)
04 2
59 3
1014 4
1519 10
2024 17
2529 8
3034 4
3539 2
Solution:
n
f 1
i 1
(i) Here the median is ordered observation.
2
X 0 1 2 3 4 5
f 2 5 6 4 2 1 20
n
f
i 1
i 20
30
20 1 21
Median is 10 5 ordered observation and this corresponds to 2
2 2
(that is 2 + 5 + 6 = 13 and 10.5 falls in this class), hence median = 2.
(ii)
Class Interval F
0-4 2
5-9 3
10-14 4
15-19 10
20-24 17
25-29 8
30-34 4
35-39 2
N
f l
Median L1 2 c
f median
n
f 1
i 1
Median class is the class containing ordered observation.
2
50 1 51
This is 25 5 ordered observation and this corresponds to the
2 2
class interval 20-24 (i.e. 2 + 3 + 4 + 10 + 17 = 36).
L1 = 19.5 f l
19 (i.e. 2+3+4+10)
n
f median 17 , N f i 50 , c 24 20 4
i
50
19
Median 19 5 2 4
17
25 19
19 5 4
17
6
19 5 4
17
19 5 1 41 20 91 21
31
4.4 The Mode
The mode of a set of numbers is that value which occurs with the greatest
frequency.
4.4.1 The Mode for simple data
The mode for simple data is the most common value among the
observations or the one with the highest frequency.
However, it should be noted that mode may not exist and even if it does
exist it may not be unique.
(i) 2, 2, 4, 4, 5, 6, 6 (ii) 2, 2, 3, 3, 6, 6, 6, 8, 8, 8, 9
Solution:
When our data sets are grouped the mode is obtained using the formula:
1
Mode L1 c
1 2
c = the class size of the modal class and modal class is the
class with the highest frequency.
Example 4.9: From the following frequency distribution, find the mode of
the distribution.
Class Interval F
0 4 2
59 3
1014 4
1519 10
32
2024 17
2529 8
3034 4
3539 2
Solution:
1
Mode L1 c
1 2
7
Mode 19 5 4
79
7
19 5 4
16
28
19 5 19 5 1 75
16
21 25 21
4.5 SUMMARY
n n
Xi f i Xi
The mean is defined by X i 1
or X i 1
n
n
f i 1
i
The median is the middle value in an ordered odd array of number or the
mean of the two middle values if the ordered array is even. For grouped
data the median is given by:
N
f l
Median L1 2 c
f median
33
The mode of a set of numbers is that value which is the most common, i.e.
the one with highest frequency. And for grouped data, the mode is given
by
1
Mode L1 c
1 2
4.6 EXERCISES
3. The table below shows the number of people in the families living in
the houses of Phase 1, Gwagwalada.
Size of Family 1 2 3 4 5 6 7
Frequency 4 11 25 37 31 10 2
(a) Find the mean of the distribution directly(and using assumed mean)
34
CHAPTER 5
MEASURES OF DISPERSION
5.1 INTRODUCTION
(i) 1, 1, 3, 3, 4, 4, 5, 6, 10, 12
(ii)
Class Interval f
1014 5
1519 10
2024 15
2529 9
3034 6
Solution:
(ii) Here, Range = upper boundary of the largest class = 34.5 minus
smallest class lower boundary = 9.5
35
5.3 The Mean deviation
The mean deviation (M.D) of a set of n numbers x1, x2, …, xn is defined by:
n
x i x
M.D. i 1
if the data set is simple.
n
f x i i x n
M.D. i 1
, where k fI , fi are the frequencies and xi their
k i 1
Example 5.2: Find the mean deviation of the following data set:
(iii) 1, 1, 3, 3, 4, 4, 5, 6, 10, 12
(ii)
Class Interval F
1014 5
1519 10
2024 15
2529 9
3034 6
Solution:
n
x I x
(i) M.D. i 1
n
n
X i
1 1 3 12 49
Now, X i 1
49
n 10 10
36
xi 1 1 3 3 4 4 5 6 10 12
xi x –3.9 –3.9 –1.9 –1.9 –0.9 –0.9 0.1 1.1 5.1 7.1
xi x 3.9 3.9 1.9 1.9 0.9 0.9 0.0 1.1 5.1 7.1 26.8
n
26 8
Therefore, x
i 1
i x 26 8 , n = 10, M.D.
10
2 68
Class Interval fi xi f i xi xi x xi x f xi x
1014 5 12 60 –10 10 50
1519 10 17 170 –5 5 50
2024 15 22 330 0 0 0
2529 9 27 243 5 5 45
3034 6 32 192 10 10 60
45 995 205
n
f i Xi
995
Now, X i 1
n
22 11 22
45
f i 1
i
f xi x
205
M.D. i 1
n
4 56
45
f i 1
i
These are the most generally used measures of dispersion. The mean of
the squared deviation provides a quantity known as the variance (often
indicated by Var or V). The square root of the variance is known as the
standard deviation. It is usually abbreviated as S.D. or represented by
(sigma).
37
5.4.1 Variance and standard deviation of a simple data
Given a set of n numbers x1, x2, …, xn with mean x . The variance of the
set of the numbers is defined by:
x
n 2
i x
Var 2 i 1
, while the standard deviation is given by:
n
x
n 2
i x
SD i 1
Example 5.3: Find the standard deviation of the following data set. 1, 1, 3,
3, 4, 4, 5, 6, 10, 12.
Solution: x 4 9
xi 1 1 3 3 4 4 5 6 10 12
x i x –3.9 –3.9 –1.9 –1.9 –0.9 –0.9 0.1 1.1 5.1 7.1
x i x
2
15.21 15.21 3.61 3.61 0.81 0.81 1.01 1.21 26.01 50.41 116.9
x
n 2
x
116 9 i
Therefore, Variance 2
11 69 i 1
n 10
5.4.2 Variance and standard deviation of a grouped data
If x1, x2, …, xn, occurs with frequency f1, f2, …, fn times respectively, the
variance is given by:
x
n
f
2
i i x n
Var 2 i 1
, where k f i
k i 1
x
n
f
2
i i x n
S D i 1
, where k f i
k i 1
38
Example 5.4: Given the following frequency distribution table:
Class Interval f
1014 5
1519 10
2024 15
2529 9
3034 6
Solution
Class Interval fi xi f i xi x i x x i x
2
f xi x
2
f i Xi
995
Because, X i 1
n
22 11 22
45
f
i 1
i
x
n
f
2
i i x
1575
Variance 2 i 1
n
35
45
f
i 1
i
and
x
n
f
2
i i x
SD i 1
n
35 5 91
f
i 1
i
39
5.5 SKEWNESS AND KURTOSIS
The two measures viz., central tendency (concentration of the observations
about the middle of the distribution) and dispersion (the spread or scatter
of the observations about some measures of central tendency) are
inadequate to characterise a distribution completely. Two distributions may
have the same mean and standard deviation, yet they may give different
histograms. To determine nature and composition of frequency
distributions Skewness and Kurtosis are used. The four measures viz.,
central tendency, dispersion, skewness and kurtosis are sufficient to
describe a frequency distribution completely.
5.5.1 SKEWNESS
Literally means lack of symmetry. It helps us to determine the nature and
extent of the concentration of the observations towards the higher or lower
values of the variable. Generally, a distribution is said to be skewed if the
frequency curve of the distribution or histogram is not a symmetric bell-
shaped curve but it is stretched more to one side than to the other or the
values of mean, median and mode fall at different points i.e. they do not
coincide. One of the measure of skewness (Sk) is Sk = Mean – Median or
Sk = Mean – Mode.
Other measures of skewness include Karl Pearson‟s Coefficient of
Skewness, Bowley‟s Coefficient of Skewness, etc.
5.5.2 KURTOSIS
While skewness helps us in identifying the right or left tails of the
frequency curve, Kurtosis enables us to have an idea about the shape and
nature of the hump (middle part) of a frequency distribution. In order
words, Kurtosis is concerned with the flatness or peakedness of the
frequency curve.
Curve which is neither flat nor peaked is known as Normal curve and shape
of its hump is accepted as a standard one. Curves with humps of the form
of normal curve are said to have normal kurtosis and are termed as meso-
kurtic. The curves which are more peaked than the normal curve are
known as lepto-kurtic and are said to lack kurtosis or to have negative
kurtosis. On the other hand, curves which are flatter than the normal curve
are called platy-kurtic and they are said to possess kurtosis in excess or
have positive kurtosis.
As a measure of kurtosis, Karl Pearson gave the coefficient Beta two (β2)
β2 = µ4/µ22 = µ4/σ4
40
5.6 Summary
5.7 EXERCISES
(i) Given the following observations: 10, 11, 13, 12, 16, 9, 15, 17, find
1. The range
2. Mean deviation
3. Standard deviation
X 0 1 2 3 4
F 4 11 25 30 35
Find
1. The range
2. Mean deviation
3. Standard deviation
41
(iii) From the following frequency distribution table
Class Interval f
04 1
59 4
1014 6
1519 7
2024 10
2529 12
3034 6
3539 2
4044 1
Find
1. The range
CHAPTER 6
SIMPLE LINEAR REGRESSION
6.1 INTRODUCTION
42
suspect a relationship between income and expenditure. The more income
earned the more the tendency of spending. Variables of this nature are
often referred to as Independent variable and dependent variable.
Expenditure is regarded as dependent variable because without income it
cannot exist. Whereas income is regarded as independent variable because
it can exist without expenditure, that is, income is independent of
expenditure but expenditure depends on income. Often times independent
variable is represented by x while the dependent variable by y.
6.3 SCATTER DIAGRAM
This is used in plotting the values of the independent and dependent
variables in a two-dimension graph. Each value is plotted at its particular x
and y coordinates.
Example 16.1:
The following table shows the income and expenditure of 10 employees of
the University of Abuja per annum in hundreds of thousands.
43
60
50
40
30
20
10
0
0 20 40 60 80 100
fig 16.1 scatter diagram
Note:
A quick scan of figure 16.1 appears to indicate that employees with high
income spend higher, that is, there is a linear relationship between income
and expenditure. The question that will be examined next is how the
existence of a linear relationship can provide a better prediction of the
dependent variable y. This is achieved by the use of regression analysis.
6.4 REGRESSION ANALYSIS
Regression analysis is utilized for the purpose of prediction, in the scatter
diagram plotted in figure 16.1 a rough idea of the type of relationship that
exists between the variables (income and expenditure) has been observed
to be of straight line or linear relationship. Although the nature of the
relationship can take many forms, ranging from simple mathematical
functions to extremely complicated ones.
The simplest relationship consists of a straight line or linear relationship of
the type in figure 16.1. The simple linear regression model is our major
interest in this chapter.
44
b1 is the true slope for the population, representing the unit change in Y (∆Y)
per unit change in X (∆X). That is, it represents the amount that Y changes
(either positively or negatively) for a particular unit change in X.
yi nb0 b1 X i
i 1 i 1
(1)
n n n
X iYi b0 X i b1 X i2
i 1 i 1 i 1
(2)
Since there are two equations with two unknowns w can solve these
equations simultaneously for b0 and b1 as follows:
45
n n n
n X iYi X i Yi
b1 i 1 i 1 i 1
2
(3)
n
n
n X X i i
2
i 1 i 1
n n
Yi X i
b0 i 1
b1 i 1
Y b1 X (4)
n n
Example 16.2: The following table shows the marks obtained by ten students
out of a maximum of 10 marks in mathematics (X) and English (Y)
Maths(X) 3 6 4 6 4 7 5 5 4 7
Eng.(Y) 4 6 5 7 4 7 6 6 5 8
(a) Plot the data in a scatter
(b) Fit the least –squares regression equation of Y and X
(c) Predict Y, if X = 2
SOLUTION:
(a) The scatter diagram is given as,
Marks scores ( Maths and Eng.)
English( score)
10
5 Eng.(Y)
0
0 2 4 6 8
Mathematics(score)
n n n
n X iYi X i Yi
b1 i 1 i 1 i 1
2
n
n
n X X i i
2
i 1 i 1
and,
n n
Y
i 1
i Xi 1
i
b0 b1 Y b1 X
n n
Therefore,
46
S/N X Y XY X2 Y2
1 3 4 12 9 16
2 6 6 36 36 36
3 4 5 20 16 25
4 6 7 42 36 49
5 4 4 16 16 16
6 7 7 49 49 49
7 5 6 30 25 36
8 5 6 30 25 36
9 4 5 20 16 25
10 7 8 56 49 64
Total 51 58 311 277 352
That is, ΣX = 51, ΣY = 58, ∑XY = 311, ∑X2 = 277 and ∑Y2 = 352
152
0 899
169
0 899
58 51
b0
10 10
5 8 0 8995 1
5 8 4 585
1 215
6.5 SUMMARY
(1) Scatter diagram is the plot of two variables in a two-dimensional
graph.
47
n n n
n X iYi X i Yi
b1 i 1 i 1 i 1
2
n
n
n X X i
i
2
i 1 i 1
b0 Y b1 X
6.6 EXERCISE
(i) what do you understand by
(a) Normality
(b) Hamoscedasticity
(c) Independence of error
yi nb0 b1 X i
i 1 i 1
n n n
X iYi b0 X i b1 X i2
i 1 i 1 i 1
(iii) The following data is the Height (X) and Weight (Y) of 10 Students
in a statistics class.
Student 1 2 3 4 5 6 7 8 9 10
Height (X) 60 62 61 69 67 63 69 65 61 60
Weight (Y) 115 98 115 125 131 162 140 103 95 125
CORRELATION ANALYSIS
7.1 INTRODUCTION
In the previous chapter the theory of Regression was considered. In this
chapter the correlations theory would be discussed. A correlation problem
differs from a regression problem in that we are concerned with a measure of
the relationship between two or more variables rather than predicting one
variable from knowledge of the independent variables.
48
7.2 Correlation Theory
The objectives of the correlation analysis is to evaluate the extent to which co-
variance exists among the variables under investigation. That is a measure of
the linear relationship between variables. Two variables are said to be
correlated if a change in the values of one of the variables tends to be
associated with a consisted corresponding changes in the value of the other.
There are different types of correlation coefficients that are used to measure
the linear relationship between variables. Prominent among them are:
(i) The product moment correlation coefficient or the coefficient of
total correlation or simply linear correlation coefficient.
(ii) The rank correlation coefficient.
(iii) The intra class correlation coefficient.
(iv) The partial correlation coefficient.
(v) The multiple correlation coefficient.
49
The linear correlation coefficient has the following properties.
( i ) It is independent of scale and origin.
(ii) It lies between –1 and +1
(iii) If r = –1 or +1 then there is perfect linear relationship between x and y.
ssxy
(ii) r = Sample correlation coefficient.
ssx ss y
Note that the sample correlation coefficient is the most often used.
Therefore,
X
n
X Yi Y
ssxy i
r i 1
ssx ss y
X Y Y
n 2 n 2
i X i
i 1 i 1
n xy x y
r
n x 2
x n y 2 y
2 2
n xy x y
r
n x 2
x n y 2 y
2 2
Therefore
X 12 10 14 11 12 9 68
50
Y 18 17 23 19 20 15 112
XY 216 170 322 209 240 135 1292
X2 144 100 196 121 144 81 786
Y2 324 289 529 361 400 225 2128
61292 68112
r
6786 68 62128 112
2 2
0 947
nn 2 1
51
Example 17.2: For the data in example 17.1, calculate the spearman‟s rank
correlation coefficient.
Solution:
S/NO X Y Rank(Rx) Rank(y) Difference d2
(d) = Rx – Ry
1 12 18 2.5 4 1.5 2.25
2 10 17 5 5 0 0
3 14 23 1 1 0 0
4 11 19 4 3 1 1
5 12 20 2.5 2 0.5 0.25
6 9 15 6 6 0 0
Total 3.5
21
1 1 0 1 0 9
210
7.4 Summary
The objective of the correlation analysis is to measure the linear relationship
between two or more variables.
The linear correlation coefficient is given by the formula:
n xy x y
r ,
n x 2
x n y 2 y
2 2
where as the spearman‟s rank correlation coefficient is given by :
n
6 di2
rrank 1 i 1
nn 2 1
52
7.5 EXERCISES
1. The following data is the height (x) and weight (y) of 10 students.
Student 1 2 3 4 5 6 7 8 9 10
Height 60 62 61 69 67 63 69 65 61 60
Weight 115 98 116 125 131 162 140 103 95 125
Compute and interpret
(i) The product moment correlation coefficient.
(ii) The Spearman‟s rank correlation coefficient.
2. The grades of a class of 9 students on a C.A test (x) and find examination
(y) are as follows:
X 77 50 71 72 81 94 96 99 67
Y 82 66 78 34 47 85 99 99 68
Compute and interpret the correlation coefficient for the variables X and Y.
53