0% found this document useful (0 votes)
284 views53 pages

Sta101 Lecture Notes-1

This document provides an overview of an introductory statistics course. It discusses topics that will be covered including statistical data collection methods, data presentation techniques, measures of central tendency and dispersion, bivariate data analysis, and introduction to time series analysis. References for further reading on statistics topics are also provided. The document outlines the nature and scope of statistics, distinguishing between descriptive and inferential statistics. It notes some limitations of statistical methods which can produce faulty results if the underlying data is inaccurate or incomplete.

Uploaded by

absalomi088
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
284 views53 pages

Sta101 Lecture Notes-1

This document provides an overview of an introductory statistics course. It discusses topics that will be covered including statistical data collection methods, data presentation techniques, measures of central tendency and dispersion, bivariate data analysis, and introduction to time series analysis. References for further reading on statistics topics are also provided. The document outlines the nature and scope of statistics, distinguishing between descriptive and inferential statistics. It notes some limitations of statistical methods which can produce faulty results if the underlying data is inaccurate or incomplete.

Uploaded by

absalomi088
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 53

STA 101 INTRODUCTION TO STATISTICS [3 CREDIT UNITS]

Prerequisite: „O‟ Level Mathematics

Statistical data: types, sources, methods of collection. Presentation of data:


tables chart and graphs. Errors and approximations. Frequency and
cumulative distributions. Measures of location, partition, dispersion. Tree
diagrams. Box-plots and Stem-and-leaf displays. Bivariate data: Scatter
diagrams and regression lines and their applications; coefficient of
correlation, product moment correlation coefficient, Spearman rank
correlation coefficient. Index number: Definition and classification of index
number, Laspeyres and Paasches indices, types of index numbers,
problems in the construction of index number, and uses of index numbers.
Introduction to Time Series: Definition, components, additive and
multiplicative models, and Stationarity and invertibility.

REFERENCES AND RELATED READING

1. Bamanga, M.A. And Abdulkarim, M.I. (2007). A First Course in


University Statistics, Zaria: Ahmadu Bello University, Press.
2. Berenson, M.L. and D.M. Levinne (1979) Basic Business Statistics –
Concepts and Applications. New York: Harcourt Brace Jovanovidh
Inc.
3. Gupta, S.C. (2011) Fundamentals of statistics (6th Revised and
Enlarged Edition). Mumbai: Himalaya Publishing House.
4. Hamburg, Morris (1979) Basic Statistics: A Modern Approach (2nd
Edition) New York: Harcourt Brace Jovanovidh Inc.
5. Spiegel, M. R. (1972) Theory and problems of Statistics (Schaum‟s
outline series). New York: Mc Graw Hill Publishing Company.
6. Walpole, R. E. (1974) Introduction to statistics (2nd Edition)
New York: Macmillan Publishing Co. inc.

1
1.0 OVERVIEW OF STATISTICS
1.1 INTRODUCTION

Statistics as a field of study has in recent times grown to be very important


in all areas of endeavour. It provides a systematic method of collecting and
analysing information in vast areas like medicine, agriculture, economic,
business, etc., from which a comprehensive and conclusive result can be
derived for far-reaching decisive actions. The result, usually in numerical
terms, can provide sufficient information for describing natural, social and
managerial phenomenon under consideration. However, such information
in its entirety may not be sufficient to solve the problems that researchers
or managers would be delighted about due to some limitations that
statistical results encounter.

The sources of statistical data vary in many ways. The choice of any form
depends on the nature of problems to be solved as well as the expenses
involved in generating the data either by primary or secondary source.
These are important concepts which would be explained fully in due
course.

1.2 What is Statistics?

Statistics is defined as a scientific method of collecting, organising,


summarising, presenting and analysing data from which valid conclusions
or decisions can be made about the data. Statistics has grown to a popular
stage now and is widely used in all areas of study as an aid to decision
making.

In medicine, statistics is used to study the influence of new drug on


patients‟ recovery from a disease. In Psychology statistics is used to assess
the influence of environment on human behaviour. In agriculture, statistics
is used to determine whether increase in crop yield is due to type of
fertilizer or pesticide used.

In business, statistics has brought rapid changes in production, efficient


use of raw materials, marketing products and other areas in business
researches. Similarly, changes in the price level, production capacity
measurement and other economic phenomena are arrived at through
systematic statistical procedure.

Today, statistics is looked upon as a method by which meaningful decisions


in respect of all matters affecting economic activities can be taken. In
short, it is generally used as a tool for reaching concise decisions not only
by an organisation but also by the state. For instance, the Federal, State
and Local governments are constantly using statistics. This is because, the
government need the statistical information to plan for such services as

2
Education, housing, health, transport and so on for both present and
future.

The word statistics can be used in two senses. Firstly, it is generally


regarded as the collection of facts or data which are expressed as
summary statement. The information (data) may be obtained either from a
source from which several observations are personally collected or from
already prepared data.

For instance, we talk of the number of people employed by firms or


industry annually, the number of vehicle involved in accidents in a country,
the number of births or marriages registered in a local government. These
are information obtainable from the records of these various organisations.
On the other hand, we equally talk about the published statistics by the
National Bureau of Statistics, Central Banks or other agencies mandated
with the responsibility of disseminating statistical information.

Secondly, statistics can be regarded as the totality of all the methods that
are used in dealing with the numerical data. This is consistent with the
definition of statistics we gave above. However, statistic, in a singular form
is used to mean a numerical figure that is used to describe a set of data.
E.g. an average is a single number describing the general characteristics of
a set of data. The average mark of students in a class describes the central
mark of the class even though some of the marks may be greater or
smaller than the average mark.

1.3 Nature and Scope of Statistics.

The statistician may be interested in various methods of describing a large


number of raw data. (i.e. information in its original form) so that he can
make a decision and draw conclusion about the set of data. Basically, all
problems involving the use of statistical methods can be categorised either
to descriptive (deductive) statistics or inferential (inductive) statistics.

Descriptive statistics involves the procedure of collecting, classifying,


presenting the historical data, in various forms so as to make it usable and
easily understandable. For instance, the number of births in a state over
some years or the families and their incomes are mere raw information. A
statistician can organise and present these data in a descriptive way using
appropriate methods and rules in the following ways:

(i) He may group the data in such a way that the overall picture of the
data can be seen at once. This form of classification is known as
frequency description.

(ii) He may like to construct tables, graphs and diagrams that will assist
him in comprehending the result more easily. This is done by
graphical presentation.

3
(iii) He might convert the raw data into percentages, quartiles, deciles
and other standardised values to help him solve the problem he
intends to.

(iv) He may calculate the averages so as to know something about the


typical or representative characteristics of the data.

On the other hand, inductive statistics deals with the method of using
sample results to generalize about the population. It involves treating raw
data leading to predictions or inferences concerning a large group of data.
This makes it possible for us to establish scientific hypotheses by the use of
probability concept.

The distinction between descriptive and inductive statistics becomes clear if


we consider, for instance, the amount of exports of a particular country for
the past 20 years. Any measure describing the value of exports on the
average for the past 20 years can be regarded as descriptive statistics. But
if we make a statement that the value of exports on the average, based on
20 years information is N36 million, we can safely use this annual value to
predict the value of export beyond 20 years. Here we are generalising on
the strength of information at hand and thereby placing ourselves in the
field of inductive statistics.

Apart from economic and business, statistical methods are equally applied
to problems in other disciplines such as Biological and Agricultural
Sciences. The methods are specifically developed and adapted to handle
the problems in these fields to test the stated hypotheses.

1.4 Limitations

Although statistical methods are powerful instrument in aiding analysis and


decision making, they should be used with caution, because they may
sometimes produce results which might be faulty. However, this might not
be due to the fact that the method is wrongly applied, but rather the data
may present wrong information. For instance, Economic data collected
from the field may be inadequate or inaccurate because certain problems
were encountered in the collection process. These problems could be
falsification of data, omission of important information, which should be
included to facilitate the data collection, etc.

For example, the problems which arise in measuring population growth or


movement in price levels call for the respondents to falsify expected data.
Furthermore, the use of published data may contain information which,
unknown to the users, may not be relevant to the problems at hand.
Hence, interpreting a particular mass of data for which the knowledge of
its background is not properly stated is a dangerous procedure. It is
capable of leading us to a wrong conclusion and inferences.

4
1.5 Types of Data

Collecting good data is the foundation on which you gather evidence and
make sense of it. Decide what data you need when you design any
research or project, then you can gather the right information from the
start, and throughout the research or project.

There are two general types of data – quantitative and qualitative and both
are equally important. You use both types to demonstrate effectiveness,
importance or value.

(i) QUANTITATIVE DATA

Quantitative data is information that you can measure. It‟s


numbers –something you can count. Because it‟s countable it can be
reliable evidence. Examples include:

How many students are present in a lecture?

How much did it cost to buy a calculator?

How far is your house to school?

Average attendance at each lecture session?

(ii) QUALITATIVE DATA

Qualitative data is information about qualities, you can‟t count it.


That is, it‟s information about how people feel about something. Examples
include:

Sharing what people like about a programme.

How they think it could be improved.

What difference it has made to their lives.

Whether they would recommend the programme to others.

1.6 Sources of Data

Basically we have two types of sources of data, viz: primary and secondary
sources of data.

(i) Primary Data:

Primary data are data which are obtained by an individual, firm or an


organisation by setting up a body of inquiry which generates such
information or extracts from the records. For instance, a firm may obtain

5
the information or data from its record of sales, payments and receipts,
inventories, job cards, time book, output, etc. These records are important
and needed in order to plan a strategy to improve the efficiency,
productivity and increase the volume of work.

Similarly, primary data can be generated from survey experiments, census,


and other forms of techniques of inquiry. These include population census,
gross domestic product (GDP) compilation, price index compilation, etc.
The importance of such information cannot be over-emphasised, especially
if the government needs them for planning purposes as well as assessing
the performance of the economy.

Although, primary data are important, they are however used to


supplement the ones obtained outside the organisation especially those
that come from the governmental agencies.

The advantages of primary data include the following:

a) They are more reliable and more specific to the information concerning
the problems at hand, unlike the information collected by someone else
which may not reflect accurately the problems under study.

b) They usually provide more detailed information than secondary data.


This is because the researcher knows exactly where to collect his
information

c) The frequently state the definition of terms and units that are used.

A serious disadvantage of primary data is that, it is restrictive in use. It


cannot be used for far reaching policy making and future planning without
looking outwards to consider relevant data outside the organisation.

Besides, the process of generating primary data involves a lot of time and
money to be spent. In view of these constraints, sometimes firms usually
resort to other forms of data collection and compilation.

1.6 Sources of Primary Data:

Some primary sources of data for an organisation might be the following:

Sales Statistics: This is collected from the sales Day book of the sales
department. This figure is necessary for planning production so that the
firm will be aware of when to encourage the demand for its product
through sales promotion or when to reduce the supply of the commodity of
the firm.

Supply Statistics: This is obtained from the purchase department of the


firm. The data are related to the purchase and stock of raw materials.

6
These data are needed and used by the firm to control the excess supply
or under supply of goods. Furthermore, it could be used to determine the
purchasing policy of the firm.

Production Statistics: The statistics will be available in the production


department of the firm; such information includes job allocation, progress
reports on workers performance, number of break-downs of machine,
quantity and value of spoilt materials. Such information is needed to
control production level as well as give appropriate information to the
foreman.

Personnel Statistics: Information on the number of workers in an


organisation can be obtained from personnel department in the employee‟s
cards showing the number of the staff, salaries and wages, hours worked,
labour turnover, age, sex, education, etc. of the employee such information
is needed to aid management to determine the personnel policy, decisions
on wages, promotion and pension schemes.

Financial and Cost Statistics: This information are obtained from the
accounts department. From the firm‟s account we can gather data on
overhead cost, cost of raw materials, wage and salaries and the cost of
capital or equipment. This kind of data can be used by the management in
budgeting and allocating funds to various units of the firm.

(ii) Secondary Data:

An organisation may require other kinds of statistics apart from those


derived from its own record, such statistics are not usually cheap or easy
to compile by such organisation. They may contact external sources where
the information may be more readily available.

A bulk of such information may be found in the published statistics by


other agencies, such as government departments and international
organisations such as the United Nations. The purpose of external sources
of data is to be able to make policy decisions on broader basis. An
organisation may need the following type of information, depending on the
purpose, for such things like transport statistics, industrial statistics,
manpower statistics, energy and mining statistics, etc. All these statistics
are usually not available in the organisation‟s own records hence, there is
need to obtain them from external sources. Some of the external sources
of statistics are:

a) National Bureau of Statistics: Publication such as annual abstract of


statistics, industrial surveys, economics indicators, quarterly digest
of statistics, etc.

b) Federal Ministry of Labour bulletin.

c) NNPC annual statement

7
d) Central Bank of Nigeria statistical bulletin, annual report and
statement of accounts

e) Manpower Board of Surveys

Other sources include:

World Bank Publications such as world report, census report, National


Development Bank, United Nations Annual statement of account.

Apart from both primary and secondary sources of data, a firm may
organise special inquiry or market research to seek the opinions of the
consumers regarding the quality of its produce, the value of its goods or
the method of packaging and distribution network of the product. The
purpose of this is to obtain feedback from the consumers so that the firm
can improve on the area where there is deficiency in the marketing
strategy of the product.

The advantage of secondary data is that it is easy and less expensive to


obtain. This is because the initial cost of generating the data is not met by
the organisation or agency that uses them.

However, a serious disadvantage may be associated with the data, in that


some features inherent in the data, unknown to the users may be
irrelevant to the problem they need them for. With regard to this reason
therefore, one should exercise a lot of caution when one wants to apply
them.

1.7 SUMMARY

Statistics used in a singular sense indicates the numerical value obtained


from the statistical process and used to describe the characteristics of mass
of data.

Basically, statistics is divided into two categories. The first category deals
with descriptive statistics and the other one with inferential statistics.

To achieve statistical results, the art of collecting data is very important. In


this regard, the sources of collecting data fall under two major types,
Primary data and secondary data.

The sources of information in Nigeria include National Bureau of Statistics


and other government departments and agencies.

Statistics is the scientific method of collecting, organising, summarising,


presenting and analysing data to arrive at a far reaching conclusion and
decision.

8
1.8 EXERCISES

1. What is meant by statistics?

2. Mention the various uses of statistical result.

3. Distinguish between deductive statistics and inductive statistics

4. What are the major sources of data? State the advantages and
disadvantages of each source.

5. Distinguish between primary and secondary data

6. To what extent do you think statistical data position has hampered


the planning process in Nigeria?

2.0 METHODS OF DATA COLLECTION


2.1 INTRODUCTION

In this chapter we shall be discussing the various methods by which data


are collected before they are put into useful forms. These methods include
documentary, observation, personal interview or interview conducted by
enumerators, mail questionnaire and telephone method and through the
internet. The degree of reliability and accuracy is by far dependent on the
method adopted by the researcher. The methods are described in the
following subsections.

2.2 Documentary Method

This method involves the use of necessary books and journals of both past
and present research such as official reports and also the records of
institutions upon which investigations are to be carried. The documentary
sources are more or less published reports and results of experiments.

Advantages:

The cost of collecting the data is highly reduced and at times it is zero
because they are forms of secondary data whose cost is negligible to the
researcher.

It does not require much energy before the information become freely
available for use. This is because the whole process of collecting the data
right from the source has been carried out by the original investigator. The
energy usually dissipated in the process of generating the data from the

9
source is not encountered as this has been done by the institutions that
provide the data.

Disadvantages:

The serious disadvantage of this source of information is that the user may
not be aware of the limitations the data contain. It may not contain the
important feature which is relevant to the problem the user wishes to
consider in his analysis. For instance, the National Bureau of Statistics may
publish gross domestic product (GDP) statistics, and this statistics may not
take into account the goods and services of full-time housewife produced
at home, and some other goods and services that may not pass through
the market system. These variables may be important in determining the
actual GDP figures. Hence an attempt to use these figures for meaningful
decision in the light of apparent omission of these variables may result in
false conclusions.

To reduce the chance of making wrong conclusions from published data


therefore, the original researcher involved in the data collection should
provide footnote to explain fully the nature of such statistics so that the
user may not be misled.

2.3 Observation

Observation is a method of scientific inquiry associated with data collection.


It is preceded on the regular and systematic observation of events or
experiments. In this experiment, the researcher is personally involved in
counting or observing the object of interest. For instance, counting the
number of cars in a park at a particular period of the recent National
Census conducted on de facto method.

This means that the counting is done if and only if the respondent (the
person to be counted) is physically present.

Advantages

Observation method reduces the chance of incorrect information since the


data collection is carefully supervised on the field.

There is a high degree of reliability of decision arising from the used of


information obtained.

Disadvantages

It is usually costly in terms of money and energy required in collecting the


information. In some cases, it requires special training for those to be
involved in the research in order to record a high success on the field.

10
It is suitable for only a small fraction of the items we want to study.

It is not always easy to combine observation with random samples. We


shall come to random samples in due course. This is because the
researcher may have to be moving from one place to another in order to
cover the items included in the sample.

It gives too much chance to the influence of individual researchers. There


is no way such influence can be eliminated from the data even if the case
is established.

For instance, an enumerator involved in the national census can


demonstrate his own way of patriotism by altering the figures in favour of
his area for political and economic reasons, as this could be used to secure
positions in the government. To this extent, it is correct to say that the
enumerators have tremendous influence on the statistical figures collected.

2.4 Personal Interview

It is a method of reaching the people from whom information is sought by


personal contact. The method can be in two forms: It can either be a
personal interview by the researcher or by trained enumerators. It involves
the method of drawing up a questionnaire and the researcher or
enumerator carefully runs through the questions and records, the
responses when in contact with the respondents.

The method of personal interview requires a great deal of mannerism on


the part of enumerators if the job is to be well accomplished. For this
reason enumerators are expected to display a lot of maturity in the choice
of language they use; respect for cultural values of the area, traditional
way of greeting and appreciation for good reception. Above all, the
enumerators are trained to be able to fill in or complete the forms to
record the information from the respondents properly.

Advantages

To a far extent, the data collected can be reliable. This is not only because
the data are obtained from the primary source but also the researcher has
the knowledge of the background of the data which is designed to suit the
area of his interest.

A high percentage of the information required is obtained from the field


since the refusal rate is drastically minimized by personal contact.

11
Disadvantages

The cost by interview method is very high. It requires a considerable


amount of money to train and transport the enumerators to reach the
respondents for their interview.

One of the serious dangers of using enumerators for this exercise is that
they can influence the answers or ask misleading questions.

There is obvious possibility of enumerators recording the replies wrongly.


All these would lead to inaccurate or false information from the field work
which may have negative impact on the final result of the research work.

It gives room for personal issues to be discussed in the process of


interviewing which may in turn lead to disclosure of vital information
needed to be included in the respondent‟s responses.

2.5 Mail Questionnaire

At times, researcher might choose to mail questionnaire to the respondents


rather than using enumerators. This method of collecting data is one in
which a simple unambiguous questionnaire is posted to individual to be
completed and returned to the researcher.

Advantage

This method appears to be simple and small responses are involved. There
is no need to hire the services of enumerators and as such it is regarded as
cost saving device.

Disadvantages

This method is least satisfactory and effective in the sense that only
relatively few of such questionnaires ever get back to the researcher. The
obvious reasons include, firstly, the posted questionnaire might not get to
the respondent due to poor postal situation we experience in the country.
Secondly, the completed questionnaire mailed by the respondents might
even fail to reach the researcher due to the same reason.

However, some respondents might forget or refuse to complete the


questionnaire and return to the researcher.

Hence, this method is not satisfactory and the information obtained by it is


of little use except that one of the following is applied.

(i) Completion of the form is compulsory and even carries a legal


obligation.

12
(ii) Those who fail to respond by sending back their questionnaire still
have the opportunity of being interviewed by personal contact to
obtain the required information.

2.6 Telephone Method

This is a method in which telephone calls are used to collect data from a
chosen sample of telephone subscribers. It gives on the spot responses
from the telephone subscribers. This may be in form of radio or television
programme to conduct opinion polls in assessing the success or failure of a
programme or policy already in force.

This method is useful for special inquiry however it is subjected to


unwarranted criticism since the sample choice of subscribers is
discriminatory and limited to those who own telephones. To this extent,
the resultant information is said to be biased.

2.7 Surveys

Survey is the systematic method of collecting data from a define area. It


involves collection of necessary information that will present a good picture
of the situation. For instance, we have business surveys which cover all the
variables that normally affect the conduct of a business. If you want to go
into the production of wheat on a large scale, you may wish to know the
business environment that will guarantee the successful production of
wheat.

In this case, you might need to conduct a survey in wheat production


which will among other things help you to determine the profitable share of
wheat market. Hence you may have to look at variables like the existing
price, the level of wheat production already in existence, the consumption
level, the import level and opportunities for financial markets. All these
factors are necessary for serious consideration before embarking on
expansion or establishment of wheat production.

2.8 Internet

Recent development in information technology has led to collection of


information through the internet. This is a method where whenever an
organisation wants to collect data on a particular problem or issue; it uses
its webs site to post questions or opinions related to the issue under
investigation, so that any individual that visits the website can answer the
questions or express his/her opinion. This is the method that is currently
used by most media and other organisations to sample opinions of
respondents.

13
Although it is easy to obtain responses, this method is restricted to only
those that have internet facilities and were opportune to visit the
organisation website at the time of data collection.
2.9 Census

This involves the procedure of counting all the items in the population. The
population could be human or non-human. The national census 2006, for
instance, made it possible for us to know the number of people by sex,
age, education, etc.

The National Census, apart from knowing the population of a country is


most useful as basis for planning and ensuring an equitable distribution of
the national resources.

Therefore, surveys and census are important procedures for generating


primary data and the methods outlined above can be adopted in surveys or
census in an attempt to generate the required information or data for a
particular use.

2.10 SUMMARY

We have discussed the various methods of data collection. These include


documentary, observation, personal interview, mail questionnaire and
telephone method. For each of these methods we discussed the
advantages and disadvantages.

The choice of any method to be used in any research situation depends on


the cost and convenience with which the method can be applied.

Survey and census are distinguished: Survey is the systematic method of


collecting data in respect of an event or place such that a good picture of
the event or place can be brought into immediate focus. On the other hand
a census involves the procedure of counting all the items in the population.
The population could be defined in terms of human beings or non-human
beings.
2.11 EXERCISE

1. In which way is the method of observation superior to documentary


method.

2. Discuss the advantages and disadvantages of mail questionnaire


method of data collection.

3. What measures would you suggest to improve mail questionnaire


method in Nigeria

14
4. Assess the importance of personal interview method of data
collection in relation to other methods.

5. Why do you think census as a method of data collection should be


strictly used where possible?

3.0 PRESENTATION OF DATA: TABLES, CHARTS AND GRAPHS


3.1 FREQUENCY DISTRIBUTION TABLE

Frequency distribution presents yet another form of representation of


complex information or data in a clear and orderly manner. It also provides
a means of drawing an Ogive- that is a line graph drawn from a table of
frequency distribution. These methods are important as they are often
useful instruments in the conduct of researches.

Frequency distribution basically is a method of presenting a mass of data in


form of classes or categories and determining the number of items which
belongs to each class (this is also known as frequency).

From the frequency distribution table also, it is possible to obtain the


general characteristics of the data such as mean, mode, median, standard
deviation etc. Let us consider the following example:

Example 3.1:

The following data represent the weights of 20 students in a 100 level


statistics class of a university.

60 60 61 63 60 68 61 67 64 65

62 70 70 72 62 62 63 69 65 67

The data can be represented in a frequency distribution table as shown


below:

Table 3.1: Frequency Distribution.

WEIGHT TALLY NO. OF STUDENTS

60 63 IIII IIII 10

6467 IIII 5

6871 IIII 4

7275 I 1

15
This table is called frequency distribution. It tells us how the weights of the
students have been distributed among the classes or groups.

Hence, a tabular arrangement of data by class with the corresponding class


frequency is called frequency distribution or frequency table. The
procedure for constructing a frequency distribution consists of 3 steps.

Step 1: Choosing the classes into which the data are to be grouped.

Step 2: Sorting out data by putting a check for each item into the
appropriate class called the tally method.

Step 3: Counting the number of observations that fall in each class.

The choice of number of classes or groups is arbitrary and depends largely


on the use of the data. However, the following steps can be followed in
constructing frequency distribution.

1. You should decide on the number of classes you want to use


for the distribution.

2. You should decide on the range of the class. This is relevant


in the case of equal class intervals.

3. You should ensure that there is no overlapping between the


classes. This will make it difficult to trace an observation to a
specific class.

There is no general rule about the number of classes or groups to use for
classification but for practical purposes a rule of thumb requires that a
minimum of 4 classes and a maximum of 15 classes suffice. Because the
larger the number of classes the more precise our description of the data
though the more difficulty is encountered in the calculation process.

It should be noted that it is not all the time necessary that the classes
should have equal intervals, but equal class intervals ease our calculations
from the distribution.
3.2.1 Class Interval and Class Limit

In example 3.1 the first class ranges from 60-63. This is called class
interval. The terminal numbers 60 and 63 are called class limits. The
smaller value, 60 is the lower class limit and the larger number, 63 is the
upper class limit. Hence the class limits are the lower and upper numbers
of a class interval. A class interval which has either no lower class limit or
upper class limit indicated is called an open class interval.

16
Consider the following frequency distribution table on the height of 30
female students in a mathematics class.

Height (cm) Number of Students

Less than 50 3

5155 10

5660 7

6165 6

66 & Above 4

Such class intervals that have neither lower nor upper class limits are called
open class interval. An open class interval has the advantage of
accommodating a wide range of values, however, it does not tell us how
much or how less given values that fall into the group. Furthermore, it
makes it difficult to present the distribution in form of a graph. Let alone
make some calculations from it to describe the data.
3.2.2 Class Boundaries

In example 3.1, the weights are measured to the nearest kg. The class
interval 60-63 theoretically includes all measurements from 59.5 to 63.5
kg. These figures i.e. 59.5 and 63.5 indicated are called class boundaries or
true class limits. The smaller number 59.5 is called the true lower class
limit and the larger number 63.5 the true upper class limit or boundary. In
practice, the class boundaries are obtained by adding the upper limit of
one class interval to the lower limit of the next higher class interval and
dividing by 2. For example 3.1 the class boundaries and frequency are
given as follows.

Table 3.2: Frequency Distribution table using class boundaries.

WEIGHT NO. OF STUDENTS

(Class boundaries) (f)

59.5 63.5 10

63.567.5 5

67.571.5 4

71.575.5 1

17
3.2.3 Class Size or Width

The class size or width is the difference between the lower and upper class
boundaries or true upper and true lower limits. The class mark is the mid-
point of the class interval and is defined as the mid-point between the class
boundaries. It is obtained by adding the true lower and upper limits and
dividing by 2. The class mark is another name for class mid-point. For the
purpose of further mathematical analysis, all observations belonging to a
class interval are assumed to coincide with the class mark.

Example 3.3: The following table shows Scores of 20 students in a


statistics test.

Table 3.2

Scores Mid-Point No of Students

5056 53 3

5763 50 4

6470 57 10

7177 64 3

Define the class size or width.

In making calculations from the above example, where all the intervals
have the same width in the distribution, the following rule may be applied
to find the required class interval width or size.

LV  SV
Class width 
No. of desired class int erval

Where LV represents the largest value and SV the smallest value. Since 4
class intervals are required, then the class width is given as:

LV  SV
Class width 
No. of desired class int erval

77  50
  6  75
4

7

18
That is the width is 7. We can then go ahead to construct the class interval
with 7 as the class width while starting from the lowest value of 50.

In this case, the first class interval will be 50 – 56, etc. The class mark is
equally given in the second column. In further calculations involving
frequency distribution, we always assume that observations that fall within
a given interval coincide with the class mid-point.

3.2.4 Histograms

A histogram is a chart constructed from a frequency distribution table. It is


the most widely used chart for presenting frequency distribution. A
histogram is constructed on the following principles.

(i) A histogram or frequency histogram consists of a set of rectangles


having basis on a horizontal axis (x-axis) with centres at the class
marks and width equal to the class interval size.

(ii) The height of the rectangles is determined by the corresponding


class frequency. Therefore, histogram enables us to obtain a good
picture of a frequency distribution.

Example 3.4: Construct a histogram from the information given in the table
3.3 below.

Table 3.3 Scores of 20 students in an Economics Examination.

Scores Mid-Point No of Students

5056 53 3

5763 60 4

6470 67 10

7177 74 3

Solution:

1st step: Rewrite the frequency distribution making use of true lower and
upper class limits.

19
Table 3.4

Scores True Lower Limit Mid-Points Frequency

5056 49.5 53 3

5763 56.5 60 4

6470 63.5 67 10

7177 70.5 74 3

2nd step: Plot the true lower limit against the frequency thus:

Fig. 3.1 Histogram of scores of 20 students in an economics exam.


10

6
frequency

0
53 60 67 74
Class boundaries

3.2.5 Frequency Polygon

A frequency polygon is a line graph of class frequency plotted against class


marks. It is obtained by the following procedure.

(i) Construct a histogram

(ii) Mark the mid-point on the top of each rectangle

(iii) Join the mid-points with straight lines

20
To complete the frequency polygon extra classes with zero frequency are
added to both ends of the frequency distributions. This ensures that the
resultant frequency polygon touches the x-axis.

Example 3.6:

Use the data in example 3.4 to construct a frequency polygon.

Solution:

Table 3.5 Scores of 20 students in an Economics Examination

Scores True Lower Limit Mid-Point Frequency

4349 42.5 46 0

5056 49.5 53 3

5763 56.5 60 4

6470 63.5 67 10

7177 70.5 74 3

7884 77.5 81 0

The importance of frequency polygon or histogram is that it shows how


observations cluster around their central value. It is clearly visible to
determine the degree of spread or dispersion of the observations. The
frequency curve tends to be smoother if more observations are included in
the distribution.
Fig. 3.1 Frequency polygon.

10

0
42٠5 49٠5 56٠5 63٠5 70٠5 77٠5 84٠5

21
3.2.6 Relative Frequency Distribution

The relative frequency of a class is the frequency of the class divided by


the total frequency of all the classes. It is generally expressed in a
percentage. Hence, the sum of the relative frequencies of all the classes
must be equal to 1 or 100%.

Example 3.7: Obtain the relative frequency from the data given in the table
below:

Table 3.6 Scores of 50 students in Mathematics Test.

Scores No of Students (f)

2130 5

3140 4

4150 8

5160 10

6170 8

7180 9

8190 6
Total 50

Solution:

Table 3.7 Relative Frequency of table 3.6

Scores No. of Students (f) Relative Frequency (%)

2130 5 10

3140 4 8

4150 8 16

5160 10 20

6170 8 16

7180 9 18

8190 6 12

22
3.2.7 Cumulative Frequency Polygon or ogive

Cumulative frequency polygon or ogive is a line graph obtained when the


cumulative frequencies of a distribution are plotted on the graph. It can be
more than Ogive or less than Ogive.

The total frequency of all the scores less than the upper class boundary of
a given class interval is called the cumulative frequency up to and including
the class interval.

Less than Ogive is obtained by plotting the true upper limits against
cumulative frequencies while more than Ogive is the resultant graph when
the true lower limits are plotted against the cumulative frequencies.

Example 3.8

Use the data above to construct cumulative frequency polygon (ogive)

Solution

Step 1: Construct the table of less than Ogive as follows:

Table 3.8

Scores Cumulative Frequency

Less than 20.5 0

Less than 30.5 5

Less than 40.5 9

Less than 50.5 17

Less than 60.5 27

Less than 70.5 35

Less than 80.5 44

Less than 90.5 50

The cumulative frequency up to and including the class interval 5160 is


5+4+8+10 = 27 meaning that 27 students have score less than 60.5

23
3.3 ERRORS IN STATISTICS

In statistics, the word „error‟ is used to denote the difference between the
true value and the estimated or approximated value. In other words „error‟
refers to the difference between the true value of a population parameter
and its estimate provided by an appropriate sample statistic computed by
some statistical device. Thus, in statistics, the term error is used in a
different and much restricted sense. It should be distinguished
from mistake or inaccuracies which may be committed in the course of
making observation, counting, calculations, etc. These errors in statistics
arise due to a number of factors such as:

(i) Approximation in measurements, e.g., the heights of individuals


may be approximated to 10th of a centimetre, age may be
measured correct to the nearest month, weight may be measured
correct to 10th of a kilogram, distance may be measured correct to
the nearest metre and so on. Thus, in all such measurements there
is bound to be a difference between the observed value and the true
value.
(ii) Approximations in rounding of the figures to the nearest
hundreds, thousands, millions, etc., or in the rounding of decimals.

(iii) The biases due to faulty collection and analysis of the data and
biases in the presentation and interpretation of the results

(iv) Personal biases of the investigators, and so on.

3.3.1 MEASURES OF STATISTICAL ERRORS (Absolute and Relative


Errors)

A measure of the statistical errors is provided by absolute and relative


errors.
Absolute error. An absolute error [A.E] is the difference between the true
value (a) of any particular observed item or variable and its estimated or
approximated value (e). Symbolically, we may write: AE = | a - e |, i.e.,
the modulus value of (a - e). For example, if a value 54.87350 is
approximated to the nearest 10, it can be taken as 55. Thus,

AE = | a - e | = |54.87350 – 55.00000| = | -0.1265 |= 0.1265

Relative error. A relative error [R.E.] is defined as the ratio of the


absolute error to the true or actual value. Symbolically

RE = AE/a = |a - e|/a

Thus in the above example RE = 0.1265/54.87350 = 0.00230.

24
3.4 SUMMARY

Frequency distribution is formed when data are put into classes or


categories with their corresponding frequencies. The terminal numbers of a
given class defines the class limits. The smaller value is the lower class
limit while the larger value is the upper class limit.
Histogram is a chart constructed from a frequency distribution. A histogram
consists of a set of rectangles having basis on a horizontal axis whose
centres coincide with the class marks.
Frequency polygon is a line graph of class frequency plotted against class
mark of respective classes, while an Ogive is a line graph which result
when cumulative frequencies are plotted against the class mid-points.

3.5 EXERCISE

1. What are the advantages and disadvantages of open class intervals?


2. Given the following frequency distribution, construct a histogram.

Values of Order (Nm) No. of orders

59 6

1014 12

1519 19

2029 33

3034 8

3539 2
3. The following data shows the lengths of 40 trees recorded to the nearest
millimetre.

138 164 150 132 144 125 149 157

146 158 140 147 136 148 152 144

168 126 138 176 163 119 154 165

146 173 142 147 135 153 140 135

161 145 135 142 150 156 145 128

(a) Construct a frequency distribution table using five class intervals.


(b) Using the frequency distribution in (a) above construct a histogram.
(c) Also, construct a “less than and more than” ogive using the
frequency distribution in (a).

25
4.0 MEASURES OF CENTRAL TENDENCY (OR LOCATION)
4.1 INTRODUCTION

The most important characteristic that describes or summarizes a set of


data is its location. The location which is also known as average gives a
value which is typical or representative of a set of data. Such typical values
tend to lie centrally within a set of data arranged according to magnitude.
There are three primary measures of central tendency; they are the
arithmetic mean, median and mode. These would be computed for both
simple (ungrouped) and grouped data.

4.2 The Arithmetic Mean or Mean

The Mean is the measure of location commonly known as the average.

To find the mean for simple (ungrouped) and grouped data; using direct
(coded) method,

4.2.1 Mean for simple (ungrouped data)

The mean for a set of n numbers x1, x2, …, xn is denoted by X (read “x


bar”) and is defined as:
n

X  X2  Xn X i
X  1  i 1

n n

Example 4.1: find the mean of the number 9, 4, 6, 13, 11.

Solution:
n
n = 5, X
i 1
i = 43

9  4  6  13  11 43
X    86
5 5

26
4.2.2 Mean for grouped data.

If the n numbers x1, x2, …, xn occurs with frequencies f1, f2, …, fn times
respectively, the mean is defined by:
n

f X  f2 X 2  fn X n f i Xi
X  1 1  i 1

f1  f 2   f n n

f i 1
i

Example 4.2: The following table shows the number of oranges picked by
twenty students in the school garden.

No of oranges (x) 0 1 2 3 4 5

No. of students (f) 2 5 6 4 2 1

Find the mean number of oranges picked.

Solution:
n

f i Xi
X  i 1
n

f i 1
i

X 0 1 2 3 4 5
F 2 5 6 4 2 1 n

f
i 1
i =20

FX 0 5 12 12 8 5 n

fX
i 1
i i =42

42
Therefore, X   2 1
20

Example 4.3: The table below is the frequency distribution of distances (in
kilometres) from the home of 50 students to their school.

27
Distances (x) Number of Students (f)
04 2
59 3
1014 4
1519 10
2024 17
2529 8
3034 4
3539 2

Find the mean of the distribution.

Solution:
n

f i Xi
The mean can be obtained using the formula, X  i 1
n

f i 1
i

Class Intervals Mid-points (x) F fx


04 2 2 4
59 7 3 21
1014 12 4 48
1519 17 10 170
2024 22 17 374
2529 27 8 216
3034 32 4 128
3539 37 2 74
50 1035

f i Xi
1035
Therefore, the mean X  i 1
n
  20  7
50
f i 1
i

28
4.3 The Median

The median as a measure of central tendency tries to locate the middle of


a distribution that is ordered.
4.3.1 The median for simple data.

The median for simple data is the middle value in an ordered array
of numbers (if the ordered array is odd) or the Arithmetic mean of
the two middle values (if the ordered array is even).

Example 4.6:

Find the median of the following observations.

(i) 18, 11, 4, 7, 17, 10, 5, 33, 12 (ii) 9, 3, 6, 10, 7, 4.

Solution:

(i) Arrange the data in order of magnitude, i.e. 4, 5, 7, 10, 11, 12, 17,
18, 33. Since the ordered data is odd. The middle value is 11 which
is the median.

(ii) Here also we arrange the data in order that is, 3, 4, 6, 7, 9, 10. But
the ordered array is even. Therefore the median is the mean of the
6  7 13
two middle values, i.e.,   6  5 =median.
2 2
4.3.2 The median for grouped data.

If the n observations x1, x2, …, xn occurs with frequencies f1, f2, …, fn times

 f 1
i 1
respectively. The median is ordered observation.
2

And if the grouped data are in class intervals the median is given by

N 
   f l 
Median  L1   2 c
 f median 
 
 

Where, L1 is the lower class boundary of the median class,

N is the number of items in the data (total frequency)

29
 f  is the sum of frequencies of all classes lower than the
l

median class.

fmedian is the frequency of the median class

c is the median class size.


n

 f 1
i 1
Also, the median class is the class having Observation.
2

Example 4.7: Find the median of the following distributions.

(i)
X 0 1 2 3 4 5
F 2 5 6 4 2 1

(ii)
Class interval Frequency (f)
04 2
59 3
1014 4
1519 10
2024 17
2529 8
3034 4
3539 2

Solution:
n

 f 1
i 1
(i) Here the median is ordered observation.
2

X 0 1 2 3 4 5
f 2 5 6 4 2 1 20
n

f
i 1
i  20

30
20  1 21
Median is   10  5 ordered observation and this corresponds to 2
2 2
(that is 2 + 5 + 6 = 13 and 10.5 falls in this class), hence median = 2.

(ii)
Class Interval F
0-4 2
5-9 3
10-14 4
15-19 10
20-24 17
25-29 8
30-34 4
35-39 2

N 
   f l 
Median  L1   2 c
 f median 
 
 
n

 f 1
i 1
Median class is the class containing ordered observation.
2

50  1 51
This is   25  5 ordered observation and this corresponds to the
2 2
class interval 20-24 (i.e. 2 + 3 + 4 + 10 + 17 = 36).

 The median class is 20-24

L1 = 19.5  f  l
 19 (i.e. 2+3+4+10)

n
f median  17 , N   f i  50 , c  24  20  4
i

 50 
  19 
 Median  19  5   2 4
 17 
 
 
 25  19 
 19  5   4
 17 
6
 19  5   4
 17 
 19  5  1  41  20  91  21

31
4.4 The Mode

The mode of a set of numbers is that value which occurs with the greatest
frequency.
4.4.1 The Mode for simple data

The mode for simple data is the most common value among the
observations or the one with the highest frequency.

However, it should be noted that mode may not exist and even if it does
exist it may not be unique.

Example 4.8: Find the mode of the following data sets:

(i) 2, 2, 4, 4, 5, 6, 6 (ii) 2, 2, 3, 3, 6, 6, 6, 8, 8, 8, 9

Solution:

(i) Mode = 4 (the most common value)

(ii) Mode = 6 and 8 i.e. two mode (bimodal)


4.4.2 Mode for grouped data

When our data sets are grouped the mode is obtained using the formula:
 1 
Mode  L1   c
 1   2 

Where L1 = Lower class boundary of the modal class

Δ1 = the difference between modal frequency and


frequency before modal class.

Δ2 = the difference between the modal frequency and


frequency after the modal class

c = the class size of the modal class and modal class is the
class with the highest frequency.

Example 4.9: From the following frequency distribution, find the mode of
the distribution.
Class Interval F
0 4 2
59 3
1014 4
1519 10

32
2024 17
2529 8
3034 4
3539 2

Solution:

 1 
Mode  L1   c
 1   2 

Modal class is 2024 which is having the highest frequency of 17.

Therefore, L1 = 19.5, Δ1 = 17– 10 = 7, Δ2 = 17– 8 = 9 and c = 24–20 = 4

 7 
Mode  19  5   4
79
7
 19  5   4
 16 
28
 19  5   19  5  1  75
16
 21  25  21

4.5 SUMMARY

n n

 Xi f i Xi
The mean is defined by X  i 1
or X  i 1
n
n
f i 1
i

for simple and grouped data respectively.

The median is the middle value in an ordered odd array of number or the
mean of the two middle values if the ordered array is even. For grouped
data the median is given by:

N 
   f l 
Median  L1   2 c
 f median 
 
 

33
The mode of a set of numbers is that value which is the most common, i.e.
the one with highest frequency. And for grouped data, the mode is given
by

 1 
Mode  L1   c
 1   2 
4.6 EXERCISES

1. The mean of the number 5, 2, 3, x, 9 is 4.8. Find the value of x and


state the mode of the five numbers.

2. Given the following observations: 37.5, 25.5, 30.5, 41.5, 52.5,


28.25, and 30.5. calculate:

(i) The mean (ii) Median and (iii) mode.

3. The table below shows the number of people in the families living in
the houses of Phase 1, Gwagwalada.
Size of Family 1 2 3 4 5 6 7
Frequency 4 11 25 37 31 10 2

(a) Fine the mean of the distribution

(b) Find the median of the distribution

(c) Find the mode of the distribution

4. Given the following frequency distribution

X 0–4 5–9 10–14 15–19 20–24 25–29 30–34 35–39 40–44


f 1 4 6 7 10 12 6 2 1

(a) Find the mean of the distribution directly(and using assumed mean)

(b) Find the median of the distribution

(c) Find the mode of the distribution

34
CHAPTER 5
MEASURES OF DISPERSION

5.1 INTRODUCTION

Measures of dispersion measures the degree to which numerical data tend


to spread about an average value or mean. There are several types of
measures of dispersion. The most common ones are (1) Range (2) Mean
deviation (3) Variance and (4) Standard deviation. These are discussed in
the following subsections.
5.2 The Range

This is simplest and the most easiest measure of dispersion. It is defined as


the difference between the largest and smallest numbers in the set of data,
if the data sets are simple. However, when the data are grouped into class
intervals, the range is approximated as the difference between the upper
boundary in the largest class and the lower boundary in the smallest class.

Example 5.1: Find the range of the following data set:

(i) 1, 1, 3, 3, 4, 4, 5, 6, 10, 12

(ii)

Class Interval f

1014 5

1519 10

2024 15

2529 9

3034 6

Solution:

(i) Range = largest number minus smallest number. The largest


number is 12 and the smallest is 1.  Range = 12 – 1 = 11.

(ii) Here, Range = upper boundary of the largest class = 34.5 minus
smallest class lower boundary = 9.5

 Range = 34.5 – 9.5 = 25.

35
5.3 The Mean deviation

The mean deviation (M.D) of a set of n numbers x1, x2, …, xn is defined by:
n

x i x
M.D.  i 1
if the data set is simple.
n

Where x is the mean and x i  x is called the absolute value (modulus) of


the deviation of xi from x .

However, for a grouped data the mean deviation is defined by


n

f x i i x n
M.D.  i 1
, where k   fI , fi are the frequencies and xi their
k i 1

corresponding midpoints. Recall that, x is the mean of the distribution.

Example 5.2: Find the mean deviation of the following data set:

(iii) 1, 1, 3, 3, 4, 4, 5, 6, 10, 12

(ii)
Class Interval F
1014 5
1519 10
2024 15
2529 9
3034 6

Solution:
n

x I x
(i) M.D.  i 1

n
n

X i
1  1  3    12 49
Now, X  i 1
   49
n 10 10

36
xi 1 1 3 3 4 4 5 6 10 12

xi  x –3.9 –3.9 –1.9 –1.9 –0.9 –0.9 0.1 1.1 5.1 7.1

xi  x 3.9 3.9 1.9 1.9 0.9 0.9 0.0 1.1 5.1 7.1 26.8

n
26  8
Therefore, x
i 1
i  x  26  8 , n = 10,  M.D. 
10
 2  68

Class Interval fi xi f i xi xi  x xi  x f xi  x

1014 5 12 60 –10 10 50
1519 10 17 170 –5 5 50
2024 15 22 330 0 0 0
2529 9 27 243 5 5 45
3034 6 32 192 10 10 60
45 995 205
n

f i Xi
995
Now, X  i 1
n
  22 11  22
45
f i 1
i

f xi  x
205
M.D.  i 1
n
  4  56
45
f i 1
i

5.4 The Variance and Standard deviation

These are the most generally used measures of dispersion. The mean of
the squared deviation provides a quantity known as the variance (often
indicated by Var or V). The square root of the variance is known as the
standard deviation. It is usually abbreviated as S.D. or represented by 
(sigma).

37
5.4.1 Variance and standard deviation of a simple data

Given a set of n numbers x1, x2, …, xn with mean x . The variance of the
set of the numbers is defined by:

 x 
n 2
i x
Var   2  i 1
, while the standard deviation is given by:
n

 x 
n 2
i x
SD   i 1

Example 5.3: Find the standard deviation of the following data set. 1, 1, 3,
3, 4, 4, 5, 6, 10, 12.

Solution: x  4  9

xi 1 1 3 3 4 4 5 6 10 12

x i x  –3.9 –3.9 –1.9 –1.9 –0.9 –0.9 0.1 1.1 5.1 7.1

x i x 
2
15.21 15.21 3.61 3.61 0.81 0.81 1.01 1.21 26.01 50.41 116.9

 x 
n 2
x
116  9 i
Therefore, Variance   2 
 11  69 i 1

n 10
5.4.2 Variance and standard deviation of a grouped data

If x1, x2, …, xn, occurs with frequency f1, f2, …, fn times respectively, the
variance is given by:

x 
n

f
2
i i x n
Var   2  i 1
, where k   f i
k i 1

and the standard deviation is given by:

x 
n

f
2
i i x n
S D   i 1
, where k   f i
k i 1

38
Example 5.4: Given the following frequency distribution table:

Class Interval f

1014 5

1519 10

2024 15

2529 9

3034 6

Find the standard deviation.

Solution

Class Interval fi xi f i xi x i x  x i x 
2

f xi  x 
2

1014 5 12 60 –10 100 500


1519 10 17 170 –5 25 250
2024 15 22 330 0 0 0
2529 9 27 243 5 25 225
3034 6 32 192 10 100 600
45 995 1575
n

f i Xi
995
Because, X  i 1
n
  22 11  22
45
f
i 1
i

x 
n

f
2
i i x
1575
Variance   2  i 1
n
 35
45
f
i 1
i

and

x 
n

f
2
i i x
SD   i 1
n
 35  5  91
f
i 1
i

39
5.5 SKEWNESS AND KURTOSIS
The two measures viz., central tendency (concentration of the observations
about the middle of the distribution) and dispersion (the spread or scatter
of the observations about some measures of central tendency) are
inadequate to characterise a distribution completely. Two distributions may
have the same mean and standard deviation, yet they may give different
histograms. To determine nature and composition of frequency
distributions Skewness and Kurtosis are used. The four measures viz.,
central tendency, dispersion, skewness and kurtosis are sufficient to
describe a frequency distribution completely.

5.5.1 SKEWNESS
Literally means lack of symmetry. It helps us to determine the nature and
extent of the concentration of the observations towards the higher or lower
values of the variable. Generally, a distribution is said to be skewed if the
frequency curve of the distribution or histogram is not a symmetric bell-
shaped curve but it is stretched more to one side than to the other or the
values of mean, median and mode fall at different points i.e. they do not
coincide. One of the measure of skewness (Sk) is Sk = Mean – Median or
Sk = Mean – Mode.
Other measures of skewness include Karl Pearson‟s Coefficient of
Skewness, Bowley‟s Coefficient of Skewness, etc.

5.5.2 KURTOSIS
While skewness helps us in identifying the right or left tails of the
frequency curve, Kurtosis enables us to have an idea about the shape and
nature of the hump (middle part) of a frequency distribution. In order
words, Kurtosis is concerned with the flatness or peakedness of the
frequency curve.
Curve which is neither flat nor peaked is known as Normal curve and shape
of its hump is accepted as a standard one. Curves with humps of the form
of normal curve are said to have normal kurtosis and are termed as meso-
kurtic. The curves which are more peaked than the normal curve are
known as lepto-kurtic and are said to lack kurtosis or to have negative
kurtosis. On the other hand, curves which are flatter than the normal curve
are called platy-kurtic and they are said to possess kurtosis in excess or
have positive kurtosis.
As a measure of kurtosis, Karl Pearson gave the coefficient Beta two (β2)

β2 = µ4/µ22 = µ4/σ4

For a normal or meso-kurtic curve, β2 = 3. For lepto-kurtic curve β2 > 3


and for platy-kurtic β2 < 3.

40
5.6 Summary

Measure of dispersion measures the degree to which numerical data tend


to spread about the mean.

5.7 EXERCISES

(i) Given the following observations: 10, 11, 13, 12, 16, 9, 15, 17, find

1. The range

2. Mean deviation

3. Standard deviation

(ii) Given the following Table

X 0 1 2 3 4

F 4 11 25 30 35

Find

1. The range

2. Mean deviation

3. Standard deviation

41
(iii) From the following frequency distribution table

Class Interval f

04 1

59 4

1014 6

1519 7

2024 10

2529 12

3034 6

3539 2

4044 1

Find

1. The range

2. The mean deviation

3. The standard deviation. Using an assumed mean method and


otherwise.

CHAPTER 6
SIMPLE LINEAR REGRESSION
6.1 INTRODUCTION

In our previous chapters we have primarily focused upon a single variable


of interest such as the distribution of the random variable x (number of
heads in a coin tossing experiment). In this and the following chapter the
problems involving two or more variables as a means of viewing the
relationships that exist between them would be considered. Here the
technique of Regression analysis is discussed.
6.2 Dependent and Independent Variables
In nature certain phenomenon of interest (variables) tends to have
relationship of dependence among them. For example, it is natural to

42
suspect a relationship between income and expenditure. The more income
earned the more the tendency of spending. Variables of this nature are
often referred to as Independent variable and dependent variable.
Expenditure is regarded as dependent variable because without income it
cannot exist. Whereas income is regarded as independent variable because
it can exist without expenditure, that is, income is independent of
expenditure but expenditure depends on income. Often times independent
variable is represented by x while the dependent variable by y.
6.3 SCATTER DIAGRAM
This is used in plotting the values of the independent and dependent
variables in a two-dimension graph. Each value is plotted at its particular x
and y coordinates.
Example 16.1:
The following table shows the income and expenditure of 10 employees of
the University of Abuja per annum in hundreds of thousands.

Employee Income Expenditure


1 10 5
2 20 10
3 30 20
4 40 25
5 50 30
6 60 35
7 70 40
8 80 45
9 90 50
10 100 60

Plot the income and expenditure of the employees in a scatter diagram.


Solution:
If we denote the income by x and expenditure by y then we plot the values
on a scatter diagram as follows:

43
60
50
40
30
20
10
0
0 20 40 60 80 100
fig 16.1 scatter diagram

Note:
A quick scan of figure 16.1 appears to indicate that employees with high
income spend higher, that is, there is a linear relationship between income
and expenditure. The question that will be examined next is how the
existence of a linear relationship can provide a better prediction of the
dependent variable y. This is achieved by the use of regression analysis.
6.4 REGRESSION ANALYSIS
Regression analysis is utilized for the purpose of prediction, in the scatter
diagram plotted in figure 16.1 a rough idea of the type of relationship that
exists between the variables (income and expenditure) has been observed
to be of straight line or linear relationship. Although the nature of the
relationship can take many forms, ranging from simple mathematical
functions to extremely complicated ones.
The simplest relationship consists of a straight line or linear relationship of
the type in figure 16.1. The simple linear regression model is our major
interest in this chapter.

6.4.1 The Simple Linear Regression


The straight-line (linear) model can be represented as
Yi = b0 + b1Xi, for i =1,2,..., n
Where, b0 is the Y intercept for the population representing a constant factor
that is included in the equation. That is, it represents Y when X = 0.

44
b1 is the true slope for the population, representing the unit change in Y (∆Y)
per unit change in X (∆X). That is, it represents the amount that Y changes
(either positively or negatively) for a particular unit change in X.

6.4.2 Least-Squares Method


If the following assumptions are valid
a. Normality
b. Homoscedasticity
c. Independence of error
The sample Y intercept (b0) and the sample slope (b1) can be used as
estimates of the respective population parameters (β0 and β1). Thus, the
sample regression equation representing the straight line regression model
would be:

ŷi = b0 + b1Xi , for i =1,2,..., n


Where, ŷi is the predicted value of Y for observation i.
The prediction of Y using this equation involves the determination of two
coefficients; b0, the Y intercept and b1 the slope. Once b0 and b1 are
obtained the straight line is known and if need be, can be plotted on the
scatter diagram.
A mathematical technique which determines the values of b0 and b1 that
best fit the observed data is known as the least- squares method.
In using the least-squares method, we obtain the following two equations,
called the normal equations.
n n

 yi  nb0  b1  X i
i 1 i 1
(1)

n n n

 X iYi  b0  X i  b1  X i2
i 1 i 1 i 1
(2)

Since there are two equations with two unknowns w can solve these
equations simultaneously for b0 and b1 as follows:

45
n n n
n X iYi   X i  Yi
b1  i 1 i 1 i 1
2
(3)

n
 n
n  X   X i  i
2

i 1  i 1 
n n

Yi X i
b0  i 1
 b1 i 1
 Y  b1 X (4)
n n
Example 16.2: The following table shows the marks obtained by ten students
out of a maximum of 10 marks in mathematics (X) and English (Y)
Maths(X) 3 6 4 6 4 7 5 5 4 7
Eng.(Y) 4 6 5 7 4 7 6 6 5 8
(a) Plot the data in a scatter
(b) Fit the least –squares regression equation of Y and X
(c) Predict Y, if X = 2
SOLUTION:
(a) The scatter diagram is given as,
Marks scores ( Maths and Eng.)
English( score)

10

5 Eng.(Y)

0
0 2 4 6 8
Mathematics(score)

(b) To find ŷi = b0 + b1Xi for i =1,2,..., n

n n n
n X iYi   X i  Yi
b1  i 1 i 1 i 1
2

n
 n
n  X   X i  i
2

i 1  i 1 
and,
n n

Y
i 1
i Xi 1
i
b0   b1  Y  b1 X
n n
Therefore,

46
S/N X Y XY X2 Y2
1 3 4 12 9 16
2 6 6 36 36 36
3 4 5 20 16 25
4 6 7 42 36 49
5 4 4 16 16 16
6 7 7 49 49 49
7 5 6 30 25 36
8 5 6 30 25 36
9 4 5 20 16 25
10 7 8 56 49 64
Total 51 58 311 277 352

That is, ΣX = 51, ΣY = 58, ∑XY = 311, ∑X2 = 277 and ∑Y2 = 352

10311  5158 3110  2958


b1  
10277  51 2770  2601
2

152
 0  899
169

 0  899
58 51
b0 
10 10

 5  8  0  8995  1

 5  8  4  585

 1  215

Therefore, Yi  1  215  0  899X i


(c) If X = 2,
Y = 1.215 + 0.899 (2)
= 1.215 + 1.798
= 3.013.

6.5 SUMMARY
(1) Scatter diagram is the plot of two variables in a two-dimensional
graph.

(2) The least-squares estimates of the parameters in a regression model


Yi = b0 + b1Xi, for i =1,2,..., n are:

47
n n n
n X iYi   X i  Yi
b1  i 1 i 1 i 1
2
n
  n
n  X   X i 
i
2

i 1  i 1 

b0  Y  b1 X
6.6 EXERCISE
(i) what do you understand by
(a) Normality
(b) Hamoscedasticity
(c) Independence of error

(ii) Using the following normal equations to find b0 and b1.


n n

 yi  nb0  b1  X i
i 1 i 1

n n n

 X iYi  b0  X i  b1  X i2
i 1 i 1 i 1

(iii) The following data is the Height (X) and Weight (Y) of 10 Students
in a statistics class.

Student 1 2 3 4 5 6 7 8 9 10
Height (X) 60 62 61 69 67 63 69 65 61 60
Weight (Y) 115 98 115 125 131 162 140 103 95 125

(a) Plot the data in a scatter diagram.


(b) Fit the least-squares estimates of the regression line of Y on X.
(c) What is the estimate of weight if height is 68?
CHAPTER 7

CORRELATION ANALYSIS
7.1 INTRODUCTION
In the previous chapter the theory of Regression was considered. In this
chapter the correlations theory would be discussed. A correlation problem
differs from a regression problem in that we are concerned with a measure of
the relationship between two or more variables rather than predicting one
variable from knowledge of the independent variables.

48
7.2 Correlation Theory
The objectives of the correlation analysis is to evaluate the extent to which co-
variance exists among the variables under investigation. That is a measure of
the linear relationship between variables. Two variables are said to be
correlated if a change in the values of one of the variables tends to be
associated with a consisted corresponding changes in the value of the other.
There are different types of correlation coefficients that are used to measure
the linear relationship between variables. Prominent among them are:
(i) The product moment correlation coefficient or the coefficient of
total correlation or simply linear correlation coefficient.
(ii) The rank correlation coefficient.
(iii) The intra class correlation coefficient.
(iv) The partial correlation coefficient.
(v) The multiple correlation coefficient.

In this chapter the first two correlation coefficients would be discussed,


that is the linear correlation coefficient and the rank correlation coefficient.
7.2.1 The total correlation coefficient or linear correlation coefficient
The linear correlation coefficient measures the strength of the linear
relationship between any two variables say x and y. It is very widely used in
measuring relationships, association or dependence between variables. The
linear correlation coefficient take on values from –1 to +1 and it is a fact that
it cannot take on values greater than +1 or less than –1. The nearer a value is
to either of these extremes (–1 and +1) the stronger the correlation between
the variables, if the value of the correlation coefficient is positive, then the
correlation is direct, that is as the independent variable increases the
dependent variable also increases. If the value of the coefficient is negative
then the correlation is inverse that is as the independent variable increases
the dependent variable decreases. The closer the coefficient value is to zero
the less is the correlation between the variables.

49
The linear correlation coefficient has the following properties.
( i ) It is independent of scale and origin.
(ii) It lies between –1 and +1
(iii) If r = –1 or +1 then there is perfect linear relationship between x and y.

The formula for calculating the linear correlation coefficient is given by


Cov XY 
(i)  =Population correlation coefficient.
Var  X Var Y 

ssxy
(ii) r = Sample correlation coefficient.
ssx ss y 

Note that the sample correlation coefficient is the most often used.
Therefore,

 X  
n
 X Yi  Y
ssxy i
r  i 1

ssx ss y 
 X   Y  Y 
n 2 n 2
i X i
i 1 i 1

n xy   x  y
r
n x 2

  x  n y 2   y 
2 2

Example 17.1: Given the following data


X (Height) 12 10 14 11 12 9
Y (weight) 18 17 23 19 20 15
Compute and interpret the correlation coefficient.

n xy   x  y
r
n x 2

  x  n y 2   y 
2 2

Therefore
X 12 10 14 11 12 9 68

50
Y 18 17 23 19 20 15 112
XY 216 170 322 209 240 135 1292
X2 144 100 196 121 144 81 786
Y2 324 289 529 361 400 225 2128

61292  68112
r
6786  68 62128  112 
2 2

136 136 136


  
92224 20608 143  55

 0  947

A correlation of coefficient of 0.947 indicates a very good linear


relationship between X and Y.
7.3 THE SPEARMAN’S RANK CORRELATION OR RANK
CORRELATION COEFFICIENT (rrank)
There are basically three reasons for calculating the rank correlation coefficient
(rrank).
These are:
i It may be difficult or impossible to measure variables numerically while it‟s
often easy to rank them.
ii. The rrank may be a convenient substitute of r even when numerical
measurement are available in view of the arithmetic involve.
iii. The rrank has a valid interpretation no matter what the underline distribution is.

The rank correlation coefficient is defined by the formula.


n
6 di2
rrank  1  i 1

nn 2  1

Where d = difference between ranks of the corresponding values of X and Y, and


n = the numbers of the pairs values (X,Y) in the data.

51
Example 17.2: For the data in example 17.1, calculate the spearman‟s rank
correlation coefficient.
Solution:
S/NO X Y Rank(Rx) Rank(y) Difference d2
(d) = Rx – Ry
1 12 18 2.5 4 1.5 2.25
2 10 17 5 5 0 0
3 14 23 1 1 0 0
4 11 19 4 3 1 1
5 12 20 2.5 2 0.5 0.25
6 9 15 6 6 0 0
Total 3.5

Since Σd2 = 3.5


Therefore,
63  5 21
rrank  1  1

6 6 1
2

636  1

21
 1  1  0 1  0  9
210

7.4 Summary
The objective of the correlation analysis is to measure the linear relationship
between two or more variables.
The linear correlation coefficient is given by the formula:
n xy   x  y
r ,
n x 2

  x  n y 2   y 
2 2

where as the spearman‟s rank correlation coefficient is given by :
n
6 di2
rrank  1  i 1

nn 2  1

52
7.5 EXERCISES
1. The following data is the height (x) and weight (y) of 10 students.
Student 1 2 3 4 5 6 7 8 9 10
Height 60 62 61 69 67 63 69 65 61 60
Weight 115 98 116 125 131 162 140 103 95 125
Compute and interpret
(i) The product moment correlation coefficient.
(ii) The Spearman‟s rank correlation coefficient.
2. The grades of a class of 9 students on a C.A test (x) and find examination
(y) are as follows:
X 77 50 71 72 81 94 96 99 67
Y 82 66 78 34 47 85 99 99 68
Compute and interpret the correlation coefficient for the variables X and Y.

8.0 Rates; Ratio and Index number

53

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy