Theme 4 BRM & Stat
Theme 4 BRM & Stat
March, 2023GC.
1
CHAPTER ONE
Just close your eyes for a minute and utter the word research to yourself. What kinds of images does this
word conjure up for you? Do you visualize a lab with scientists at work Bunsen burners and test tubes, or
an Einstein-like character writing dissertations on some complex subject, or someone collecting data to
study the impact of a newly introduced day-care system on the morale of employees? Most certainly, all
these images do represent different aspects of research. Research is simply the process of finding solutions
to a problem after a thorough study and analysis of the situational factors. Curiosity or questioning is a
distinguishing aspect of human beings. As human beings, we may have questions in our mind about
business or economic, social, political, environmental, and many other areas of activities. Whenever we
encounter problems about these and other areas of concern, we try to find solutions to them. A systematic
search for such solutions to problems involves research. Hence,
2
Some people consider research as a movement, a movement from the known to the unknown. It is actually a
voyage of discovery. We all possess the vital instinct of inquisitiveness for, when the unknown confronts
us, we wonder and our inquisitiveness makes us probe and attain full and fuller understanding of the
unknown. This inquisitiveness is the mother of all knowledge and the method, which man employs for
obtaining the knowledge of whatever the unknown, can be termed as research.
Definitions of Research
• ‘Something that people undertake in order to find things out in a systematic way, thereby
increasing their knowledge‘ Saunders et al. (2009)
• Research is an organized enquiry Designed and carried out to provide information for
solving a problem, (Fred kerlinger).
• Research is a careful inquiry or Examination to discover new information or relationship
and to expand to and to verify the existing knowledge. (Francis Rummel).
• Research is a diligent enquiry and careful search for new knowledge through systematic,
scientific and analytical approach in any branch of knowledge.
• ―Research is the application of human intelligence to problems whose solutions are not
available immediately‖ (Hertz)
• ―Research is creative and original intellectual activity carried out in library, laboratory or in
the filed in the light of previous knowledge‖ (Klopsteg)
• Research is a systematic and refined technique of thinking, employing specialized tools,
instruments, and procedures in order to obtain a more adequate solution of a problem than
would be possible under ordinary means (C.C. Crawford).
• Research is a systematic, controlled, empirical and critical method consisting of
enumerating the problem, formulating a hypothesis, collecting the facts or data, analyzing
the facts and reaching certain conclusions either in the form of solutions towards the
concerned problem or in certain generalizations for some theoretical formulation.
Research is thus
An original contribution to the existing stock of knowledge making for its Advancement.
The search for knowledge through objective and systematic method of finding solutions to
a problem.
3
Business research covers a wide range of phenomena. For managers, the purpose of research is to provide
knowledge regarding the organization, the market, the economy, or another area of uncertainty. A financial
manager may ask, ―Will the environment for long-term financing be better two years from now?‖ A
personnel manager may ask, ―What kind of training is necessary for production employees?‖ or ―What is
the reason for the company‘s high employee turnover?‖ A marketing manager may ask, ―How can I
monitor my retail sales and retail trade activities?‖ Each of these questions requires information about how
the environment, employees, customers, or the economy will respond to executives‘ decisions. Research is
one of the principal tools for answering these practical questions. The social scientists believe that the
ultimate aim of research must be the social benefit i.e., they opine that the research workers must solve the
problems faced by the society.
Business research is the application of the scientific method in searching for the truth about business
phenomena. These activities include defining business opportunities and problems, generating and
evaluating alternative courses of action, and monitoring employee and organizational performance.
Business research is more than conducting surveys. This process includes idea and theory development,
problem definition, searching for and collecting information, analyzing data, and communicating the
findings and their implications
―Business research is a systematic inquiry whose objective is to provide information to
solve managerial problems.‖ (Donald and Pamela)
―Business research is a formalized means of designing, gathering, analyzing, and reporting
information that may be used to solve a specific management problem‖ – Burns and Bush.
Business research is the application of the scientific method in searching for the truth about
business phenomena. These activities include defining business opportunities and
problems, generating and evaluating alternative courses of action, and monitoring employee
and organizational performance.
―Business research is a function which links the organization, the customer, and the public
through information – information used to identify opportunities and define problems;
generate evaluate and refine actions; and monitor performance‖ American Marketing
Association.
4
The study of business research provides with the knowledge and skills one needs to solve the problems and
meet the challenges of a fast-paced decision-making environment. By providing the necessary information
on which to base business decisions, it can decrease the risk of making a wrong decision in each area. It
starts with a problem, collects data or facts, analyze it critically and reaches decisions based on the actual
evidence.
The purpose of research is to discover answers to questions through the application of scientific
procedures. The main aim of research is to find out the truth which is hidden and which has not been
discovered as yet. Though each research study has its own specific purpose, we may think of research
objectives as falling into a number of following broad groupings:
1. To gain familiarity with a phenomenon or to achieve new insights into it (studies with this
object in view are termed as exploratory or formulative research studies);
2. To portray accurately the characteristics of a particular individual, situation or a group
(studies with this object in view are known as descriptive research studies);
3. To determine the frequency with which something occurs or with which it is associated
with something else (studies with this object in view are known as diagnostic research
studies);
4. To test a hypothesis of a causal relationship between variables (such studies are known as
hypothesis-testing research studies).
Generally, the objective of any research study is either to explore a phenomenon or to describe the
characteristics of a particular event /object/ individual or groups or to diagnose or to test the relationship
between variables.
5
From the aforementioned definitions it is clear that research is a process for collecting, analyzing and
interpreting information to answer questions. But to qualify as research, the process must have certain
characteristics: it must, as far as possible, be controlled, rigorous, systematic, valid and verifiable, empirical
and critical. Let us briefly examine these characteristics to understand what they mean:
Controlled – In real life there are many factors that affect an outcome. A particular event is seldom the
result of a one-to-one relationship. Some relationships are more complex than others. Most outcomes are a
sequel to the interplay of a multiplicity of relationships and interacting factors. In a study of cause-and-
effect relationships it is important to be able to link the effect(s) with the cause(s) and vice versa. In the study
of causation, the establishment of this linkage is essential; however, in practice, particularly in the social
sciences, it is extremely difficult–and often impossible – to make the link.
The concept of control implies that, in exploring causality in relation to two variables, you set up your
study in a way that minimizes the effects of other factors affecting the relationship. This can be achieved to
a large extent in the physical sciences, as most of the research is done in a laboratory. However, in the social
sciences it is extremely difficult as research is carried out on issues relating to human beings living in
society, where such controls are impossible. Therefore, in the social sciences, as you cannot control
external factors, you attempt to quantify their impact.
Rigorous – You must be scrupulous in ensuring that the procedures followed to find answers to questions
are relevant, appropriate and justified. Again, the degree of rigor varies markedly between the physical and
the social sciences and within the social sciences.
Systematic – This implies that the procedures adopted to undertake an investigation follow a certain
logical sequence. The different steps cannot be taken in a haphazard way. Some procedures must follow
others.
Valid and verifiable – This concept implies that whatever you conclude on the basis of your findings is
correct and can be verified by you and others.
Empirical – This means that any conclusions drawn are based upon hard evidence gathered from
information collected from real-life experiences or observations.
6
Critical – Critical scrutiny of the procedures used and the methods employed is crucial to a research
enquiry. The process of investigation must be foolproof and free from any drawbacks. The process adopted
and the procedures used must be able to withstand critical scrutiny.
For a process to be called research, it is imperative that it has the above characteristics.
Why people take themselves to research is the main question. The possible motives may be either one or
more of the following:
1) Desire to get a research degree along with its consequential benefits;
2) Desire to face the challenges in solving the unsolved problems;
3) Desire to get intellectual joy of doing creative work;
4) Desire to be of service to the society
5) Desire to get respectability
The motivation will, however, determine to a considerable extent the nature, quality, depth and duration of
research.
7
2 . C L AS S I FI CAT ION OF RESEARC H BASED O N A PPL I C AT ION / GOAL OF RESEARCH
Pure research Vs Applied research
A. B AS IC /P U RE /F UN DAM ENTAL RESEARCH
Fundamental research is also called academic or basic or pure research. Pure research involves developing
and testing theories and hypotheses that are intellectually challenging to the researcher but may or may not
have practical application at the present time or in the future. Thus, such work often involves the testing of
hypotheses containing very abstract and specialized concepts. Such research is aimed at investigating or
search for new principles and laws. It is mainly concerned with generalization and formulation of a theory.
With change of time and space, it is necessary to make in the fundamental principles in every branch of
science; thus, this type of research also verifies the old established theories, principles and laws.
In general, fundamental research is concerned with the theoretical aspect of science. Its primary objective is
advancement of knowledge and the theoretical understanding of the relations among variables. It is
basically concerned with the formulation of a theory or a contribution to the existing body of knowledge.
Ex. - The relationship between crime and economic status
- Darwin theory of Evolution
Pure research is also concerned with the development, examination, verification and refinement of research
methods, procedures, techniques and tools that form the body of research methodology. Examples of pure
research include developing a sampling technique that can be applied to a particular situation; developing a
methodology to assess the validity of a procedure; developing an instrument, say, to measure the stress level
in people; and finding the best way of measuring people‘s attitudes. The knowledge produced through pure
research is sought in order to add to the existing body of knowledge of research methods.
B. A P P L I ED /A C T ION RESEARCH
Research aimed finding a solution for an immediate problem facing a society, a group or industry (business
organization) is applied research. The results of such research would be used by either individuals or
groups of decision-makers or even by policy makers. Those type of researches are conducted to solve the
practical problems or concerns such as for policy formulation, administration, and the enhancement of
understanding of a phenomenon.
8
• Is conducted in relation to actual problems and under the conditions in which they
are found in practice.
• Employs methodology that is not as rigorous as that of basic research.
• Yields findings that can be evaluated in terms of local applicability and not in
terms of universal validity
• Most of the researches in the social sciences are applied researches
Ex. The improvement of safety in the working place.
Researches aimed at certain conclusions (say, a solution) facing a concrete social or business problem is an
example of applied research. Research to identify social, economic or political trends that may affect a
particular institution are examples of applied research. Thus, the central aim of applied research is to
discover a solution for some pressing practical problem.
3 . C L AS S I FI CAT ION O F RESEARC HES B ASED ON O B JEC T I VES OF T HE STUDY
Descriptive vs Correlational vs Explanatory vs Exploratory Researches
1 . D ESC RI PT I VE RESEARCH
A research study classified as a descriptive study attempts to describe systematically a situation, problem,
phenomenon, service or programme, or provides information about, say, the living conditions of a
community, or describes attitudes towards an issue. For example, it may attempt to describe the types of
service provided by an organisation, the administrative structure of an organisation, the living conditions of
Aboriginal people in the outback, the needs of a community, what it means to go through a divorce, how a
child feels living in a house with domestic violence, or the attitudes of employees towards management. The
main purpose of such studies is to describe what is prevalent with respect to the issue/problem under study.
Its major purpose is description of the state of affairs as it exists at present. It tries to discover
answers to the questions who, what, when and sometimes how. Researcher has no control over
the variables, he can only report what has happened or what is happening.
Simply stated, it is a fact-finding investigation. In descriptive research, definite conclusions can be
arrived at, but it does not establish a cause and effect relationship. This type of research tries to
describe the characteristics of the respondent in relation to a particular product.
Descriptive research deals with demographic characteristics of the consumer. For example, trends
in the consumption of soft drink with respect to socio-economic characteristics such
9
as age, family, income, education level etc. Another example can be the degree of viewing TV channels, its
variation with age, income level, and profession of respondent as well as time of viewing.
2 . C O RRELAT ION RESEARCH
The main emphasis in a correlational study is to discover or establish the existence of a
relationship/association/interdependence between two or more aspects of a situation. What is the
relationship between stressful living and the incidence of heart attack? What is the relationship between
fertility and mortality? What is the relationship between technology and unemployment? These studies
examine whether there is a relationship between two or more aspects of a situation or phenomenon and,
therefore, are called correlational studies.
3 . E X PL ANATORY RESEARCH
Explanatory research attempts to clarify why and how there is a relationship between two aspects of a
situation or phenomenon. Analytical (causal or explanatory) research- identifies the cause or effect
relationship between variables where the research problem has already been narrowly defined. It explains
why and how a phenomenon is happening and has happened. This type of research attempts to explain, for
example, why stressful living results in heart attacks; why a decline in mortality is followed by a fertility
decline; or how the home environment affects children‘s level of academic achievement; Effect of
advertisement on sales. E.g., Which of two advertising strategies is more effective? An analytical study or
statistical method is a system of procedures and techniques of analysis applied to quantitative data. It may
consist of a system of mathematical models or statistical techniques applicable to numerical data.
4 . E X PLO RATO RY RESEARCH
The fourth type of research, from the viewpoint of the objectives of a study, is called exploratory research.
This is when a study is undertaken with the objective either to explore an area where little is known or to
investigate the possibilities of undertaking a particular research study. When a study is carried out to
determine its feasibility it is also called a feasibility study or a pilot study. It is usually carried out when a
researcher wants to explore areas about which s/he has little or no knowledge. A small-scale study is
undertaken to decide if it is worth carrying out a detailed investigation. On the basis of the assessment
made during the exploratory study, a full study may eventuate. Exploratory studies are also conducted to
develop, refine and/or test measurement tools and procedures.
10
Exploratory Research (Pilot Survey) is also called preliminary research. As its name implied, such
research is aimed at discovering, identifying and formulating a research problem and hypothesis.
When there are few or no studies that can be referred such research is needed. Sales decline in a
company may be due to: Inefficient service, Improper price, Inefficient sales force, Ineffective
promotion, Improper quality. The research executives must examine such questions to identify the most
useful avenues for further research. Preliminary investigation of this type is called exploratory research.
Expert surveys, focus groups, case studies and observation methods are used to conduct the exploratory
survey. E.g. “Our sales are declining and we don‗t know why?
Although, theoretically, a research study can be classified in one of the above objectives– perspective
categories, in practice, most studies are a combination of the first three; that is, they contain elements of
descriptive, correlational and explanatory research.
4. C L AS S I FI CAT ION O F RESE ARC HES BASED ON T HE T YPE OF I NFO RMATI ON S OU GHT / TY P E
OF DATA
Quantitative vs Qualitative Research
A) Q U AL IT AT I VE RESEARCH
A study is classified as qualitative if the purpose of the study is primarily to describe a situation,
phenomenon, problem or event; if the information is gathered through the use of variables measured on
nominal or ordinal scales (qualitative measurement scales); and if the analysis is done to establish the
variation in the situation, phenomenon or problem without quantifying it. The description of an observed
situation, the historical enumeration of events, an account of the different opinions people has about an
issue, and a description of the living conditions of a community are examples of qualitative research.
Such research is applicable for phenomena that cannot be expressed in terms of quantity. Things related to
quality and kind. Research designed to find how people feel or what they think about a particular subject or
institution is an example of such research. Those researches concerned with a quality of information,
qualitative methods attempt to gain an understanding of the underlying reasons and motivations for actions
and establish how people interpret their experiences and the world around them. Qualitative methods
provide insights into the setting of a problem, generating ideas and/or hypotheses.
Is concerned with qualitative phenomenon, i.e., phenomena relating to or involving quality or
kind. Is especially important in behavioral sciences where the aim is to discover the underlying
motives of human behavior.
Qualitative methods provide insights into the setting of a problem, generating ideas and/or
hypotheses.
11
For instance, when we are interested in investigating the reasons for human behavior (i.e., why
people think or do certain things). This type of research aims at discovering the underlying
motives and desires, using in depth interviews for the purpose.
B) Q U AN T IT AT I VE RESEARCH
On the other hand, the study is classified as quantitative if you want to quantify the variation in a
phenomenon, situation, problem or issue; if information is gathered using predominantly quantitative
variables; and if the analysis is geared to ascertain the magnitude of the variation. Examples of
quantitative aspects of a research study are: How many people have a particular problem? How many
people hold a particular attitude? Quantitative research as the name suggests, is concerned with trying
to quantify things. It is based on the measurement of quantity or amount. It is applicable for
phenomena that can be expressed in term of quantity. It is based on the measurement of quantity or
amount. It is applicable for phenomena that can be expressed in term of quantity.
Is based on the measurement of quantity or amount. Is applicable to phenomena
that can be expressed in terms of quantity. It asks questions such as ‗how long‘,
‗how many‘ or ‗the degree to which‘.
Quantitative methods look to quantify data and generalize results from a sample
of the population of interest. They may look to measure the incidence of various
views and opinions in a chosen sample for example or aggregate results.
It is strongly recommended that you do not ‗lock yourself‘ into becoming either solely a quantitative or
solely a qualitative researcher. It is true that there are disciplines that lend themselves predominantly either
to qualitative or to quantitative research. For example, such disciplines as anthropology, history and
sociology are more inclined towards qualitative research, whereas psychology, epidemiology, education,
economics, public health and marketing are more inclined towards quantitative research. However, this
does not mean that an economist or a psychologist never uses the qualitative approach, or that an
anthropologist never uses quantitative information. There is increasing recognition by most disciplines in
the social sciences that both types of research are important for a good research study. The research
problem itself should determine whether the study is carried out using quantitative or qualitative
methodologies.
12
As both qualitative and quantitative approaches have their strengths and weaknesses, and advantages and
disadvantages, ‗neither one is markedly superior to the other in all respects‘ (Ackroyd & Hughes 1992: 30).
The measurement and analysis of the variables about which information is obtained in a research study are
dependent upon the purpose of the study. In many studies you need to combine both qualitative and
quantitative approaches. For example, suppose you want to find out the types of service available to victims
of domestic violence in a city and the extent of their utilization. Types of service is the qualitative aspect of
the study as finding out about them entails description of the services. The extent of utilization of the
services is the quantitative aspect as it involves estimating the number of people who use the services and
calculating other indicators that reflect the extent of utilization.
5. Classification based on the Environment
1) Field Research - It is research carried out in the field. Such research is common in
social science, agricultural science, history and archeology.
2) Laboratory Research - It is research carried out in the laboratory. These are
commonly experimental research. Such researches are common in medical science,
agriculture and in general in natural sciences.
3) Simulation Research - Such research uses model to represent the real world.
Simulation is common in physical science, economics and mathematics.
6. C L AS S I FI CAT ION B ASED O N T HE T IME R EQU IRED TO C OMPL ET E T HE RESEARCH
A. One –time research: it is research limited to a single time period
B. Longitudinal research: Such research is also called on-going research. It is research
carried out over several time periods.
7. C L ASSI FI CAT ION B ASED ON LOGIC
It is the research from specific to general or vice versa.
1. Deductive Research: is a study in which conceptual and the critical structures is
developed and then tested by empirical observation. It is moving from the general
to particular
2. Inductive Research: is a study in which theory is developed from the
observation of empirical reality
13
8. O T HER T YPE OF RESEARCHES
A. Policy Research
Researches which are conducted for the specific purpose of application, or researches with policy
implications, may be treated as policy researches. The results of such studies are used as indices for policy
formulations and implementation. Many management researches are policy researches, because they are not
merely of theoretical value. They are more of practical utility than of theoretical knowledge.
B. C AS E ST UDI ES V S SURVEYS
A case study is an in-depth comprehensive study of a person, a social group, an episode, a process, a
situation, a program, a community, an institution, or any other social unit. Its purpose may be to understand
the life cycle of the unit under study or the interaction between factors that explain the present states or the
development over a period of time. The examples include social anthropological study of a rural
community, a causative study of a successful co-operative society; a study of the financial health of a
business undertaking; a study of employee participation in management in a particular enterprise, a study
of juvenile delinquency; a study of life style of working women; a study of life in slums; a study of urban
poor, a study of economic offenses; a study of refugees from an other country.
Survey is a research method involving collection of data directly from a population or a sample thereof at
particular time. Data may be collected by observation, interviewing or mailing questionnaires. The analysis
of data may be made by using simple or complex statistical techniques depending up on the objectives of the
study. In short, the case study and survey methods are compared as follows:
Case study Survey
Intensive investigation Broad-based investigation of a phenomenon
Study of a single unit /group Covers large number of units (units of
universe or a sample of them)
The findings of a case study can not be generalized. The findings of a survey study can be
generalized based on sample
Useful for testing hypotheses about structural and Useful for testing hypotheses about large
procedural characteristics (e.g. Status relation, social aggregates
interpersonal behavior, managerial style) of a
specified social unit (e.g. an organization, a small
group or a community)
14
C. E X P ERIM ENT AL RESEARCH
There are various phenomena such as motivation, productivity, development, and operational efficiency
which are influenced by various variables. It may become necessary to assess the effect of one particular
variable or one set of variables on a phenomenon. This need has given rise to experimental research.
Experimental research is designed to assess the effect of particular variables on a phenomenon by keeping
the other variables constant or controlled. It aims at determining whether and in what manner variables are
related to each other. The factor which is influenced by other factors is called a dependent variable, and the
other factors which influence it are known as independent variables. E.g., agricultural productivity, i.e. crop
yield per hectare is a dependent variable and the factors such as soil fertility, irrigation, quality of seed,
manuring, and cultural practices which influence the yield are independent variables. The nature of
relationship between independent variables and dependent variables is perceived and stated in the form of
causal hypotheses. A closely controlled procedure is adopted to test them.
D. H I S TO RIC AL RESEARCH
It is that which utilizes historical sources like documents, remains, etc to study events or ideas of the past,
including the philosophy of persons and groups at any remote point of time.
Research Method: research method is all about all those methods / techniques / procedures for conduction
of research. Research method, thus, refers to the methods the researchers use in performing research
operations. In other words, all those methods which are used by the researcher during the course of studying
his research are termed as research methods.
Methods -Research methods are the tools, techniques or processes that we use in our research. These might
be, for example, surveys, interviews, or participant observation. Methods and how they are used are shaped
by methodology. Hence the researcher must decide exactly the design of the study, how you are going to
achieve your stated objectives.
15
In short, research methods can be put into the following three groups:
a. Those methods which are concerned with the collection of Date (i.e., methods
of data collection)
b. Those methods / statistical techniques which are used for establishing relationship
between the data and the unknowns (i.e., methods of analysis)
c. Those methods which are used to evaluate the accuracy of the result obtained.
Research Methodology: Research methodology is a way to systematically solve the research problem. It
may be understood as a science of studying how research is done scientifically. In it we study the various
steps that are generally adopted by the researcher in studying his research problem along with the logic
behind them. It is necessary to the researcher to know not only the research methods / techniques but also
the methodology. Researcher not only need to know how to develop certain questionnaires, indices or tests,
how to calculate, how to apply particular research techniques, but they also need to know which of these
method or techniques are relevant and which are not and what would they mean and indicate and why.
Researchers also need to understand the assumption underlying various methods and they need to know the
criteria by which they can decide that certain methods / procedures will be applicable to certain problems and
others will not.
Methodology is the study of how research is done, how we find out about things, and how knowledge is
gained. In other words, methodology is about the principles that guide our research practices. Methodology
therefore explains why we‘re using certain methods or tools in our research. From what has been stated
above, we can say that research methodology has many dimensions and research methods do constitute a
part of research methodology. The scope of research methodology is wider than that of research method.
Research Methodology is generally refers to different approaches to systematically inquiry developed
within a particular paradigm with associated epistemological assumptions. (e.g., Experimental / Non-
experimental, Action / grounded / …)
Thus, when we talk about research methodology, we are not only talk of research methods but also consider
the LOGIC behind the methods we use in the context of our research study and explain why we are using a
particular method and why we are not using others so that research result are capable of being evaluated.
Why research study has been undertaken?
16
How the research problem has been defined?
Why the hypotheses has been formulated and in what way?
What data have been collected and what particular method has been adopted? And why not
others?
Why particular method of analysis has been used? and
A host of other similar questions are usually answered when we talk of research methodology concerning a
research study.
We defined research as an organized, systematic, data based, and critical, objective, scientific inquiry into a
specific problem that needs a solution. Decisions based on the results of a well-done scientific study tend to
yield the desired results. It is necessary to understand what the term scientific means. Scientific research
focuses on solving problems and pursues a systematic, logical, organized, and rigorous method to identify the
problems, gather data, analyze them, and draw valid conclusions there from. Thus, scientific research is not
based on hunches or intuition (though these may play a part in final decision-making), but is purposive and
rigorous. Because of the rigorous way in which it is done, scientific research enables all those who are
interested in researching and knowing about the same or similar issues to come up with comparable
findings when the data are analyzed.
Scientific research also helps researchers to state their findings with accuracy and confidence. This helps
various other organizations to apply those solutions when they encounter similar problems. Furthermore,
scientific investigation tends to be more objective than subjective, and helps managers to highlight the most
critical factors at the workplace that need specific attention so as to avoid, minimize, or solve problems.
Scientific investigation and managerial decision-making are integral aspects of effective problem solving.
The term scientific research applies to both basic and applied research. Applied research may or may not be
generalizable to other organizations, depending on the extent to which differences exist in such factors as
size, nature of work, characteristics of the employees, and structure of the organization. Nevertheless,
applied research also has to be an organized and systematic process where problems are carefully identified,
data scientifically gathered and analyzed, and conclusions drawn in an objective manner for effective
problem solving
17
A manager faced with two or more possible courses of action faces the initial decision of whether or not
research should be conducted. The determination of the need for research centers on (1) time constraints,
(2) the availability of data, (3) the nature of the decision that must be made, and
(4) the value of the business research information in relation to its costs
T IME CONSTRAINTS
Systematically conducting research takes time. In many instances management concludes that because a
decision must be made immediately, there will be no time for research. As a consequence, decisions are
sometimes made without adequate information or thorough understanding of the situation. Although not
ideal, sometimes the urgency of a situation precludes the use of research.
18
considered: the benefits versus the costs of the research. However, in general the more strategically or
tactically important the decision, the more likely that research will be conducted.
Ethics in business research refers to a code of conduct or expected societal norm of behavior while
conducting research. Ethical conduct applies to the organization and the members that sponsor the research,
the researchers who undertake the research, and the respondents who provide them with the necessary data
The observance of ethics begins with the person instituting the research, who should do so in good faith,
pay attention to what the results indicate, and surrendering the ego, pursue organizational rather than self-
interests. Ethical conduct should also be reflected in the behavior of the researchers who conduct the
investigation, the participants who provide the data,
19
the analysts who provide the results, and the entire research team that presents the interpretation of the
results and suggests alternative solutions.
Thus, ethical behavior pervades each step of the research process-data collection, data analysis, reporting,
and dissemination of information of the Internet, if such an activity is undertaken. How the subjects are
treated and how confidential information is safeguarded are all guided by business ethics. There are business
journals such as the journal of business Ethics and the Business Ethics Quarterly that are mainly devoted to
the issue of ethics in business. The American Psychological Association has established certain guideline
for conducting research, to ensure that organizational research is conducted in an ethical manner and the
interests of all concerned are safeguarded.
Researchers in Ethiopia, particularly those engaged in empirical research, are facing several problems.
Some of the important problems are as follows:
1. The lack of a scientific training in the methodology of research is a great impediment for
researchers in our country. There is paucity of competent researchers. Most of the work,
which goes in the name of research, is not methodologically sound. Research too many
researchers and even to their guides, is mostly a scissor and paste job without any insight
shed on the collated materials. The consequence is obvious, viz., the research results, quite
often, do not reflect the reality or realities. Thus, a systematic study of research
methodology is an urgent necessity. Before undertaking research projects, researchers
should be well equipped with all the methodological aspects. As such, efforts should be
made to provide short duration intensive courses for meeting this requirement.
2. There is insufficient interaction between the university research departments on one side
and business establishments, government departments and research institutions on the
other side. Great deals of primary data of non-confidential nature remain
untouched/untreated by the researchers for want of proper contacts. Efforts should be made
to develop satisfactory liaison among all concerned for better and realistic researches.
There is need for developing some mechanisms of a university—industry interaction
programme so that academics can get ideas from practitioners on what needs to be
researched and practitioners can apply the research done by the academics.
20
3. Most of the business units in our country do not have the confidence that the material
supplied by them to researchers will not be misused and as such, they are often reluctant in
supplying the needed information to researchers. The concept of secrecy seems to be
sacrosanct to business organizations in the country so much so that it proves an
impermeable barrier to researchers. Thus, there is the need for generating the confidence
that the information/data obtained from a business unit will not be misused.
4. Research studies overlapping one another are undertaken quite often for want of
adequate information. This results in duplication and fritters away resources. This problem
can be solved by proper compilation and revision, at regular intervals, of a list of subjects
on which and the places where the research is going on. Due attention should be given
toward identification of research problems in various disciplines of applied science which
are of immediate concern to the industries.
5. There does not exist a code of conduct for researchers and inter-university and
interdepartmental rivalries are also quite common. Hence, there is need for developing a
code of conduct for researchers, which, if adhered sincerely, can win over this problem.
6. Many researchers in our country also face the difficulty of adequate and timely secretary
assistance, including computer assistance. This causes unnecessary delays in the
completion of research studies. All possible efforts are made in this direction so that
efficient secretarial assistance is made available to researchers and that too well in time.
University Grants Commission must play a dynamic role in solving this difficulty.
7. Library management and functioning is not satisfactory at many places and much of the
time and energy of researchers are spent in tracing out the books, journals, reports, etc.,
rather than in tracing out relevant material from them.
8. There is also the problem that many of our libraries are not able to get copies of old and
new Acts/Rules, reports and other government publications in time. This problem is felt
more in libraries, which are away in places from Delhi and/or the state capitals. Thus,
efforts should be made for the regular and speedy supply of all governmental publications
to reach our libraries.
9. There is also the difficulty of timely availability of published data from various
government and other agencies doing this job in our country. Researcher also faces the
problem because of the fact that the published data vary quite significantly because of
21
differences in coverage by the concerning agencies.
22
CHAPTER TWO
In research process, the first and foremost step happens to be that of selecting and properly defining a research
problem. A researcher must find the problem and formulate it so that it becomes susceptible to research.
Like a medical doctor, a researcher must examine all the symptoms (presented to him or observed by him)
concerning a problem before he can diagnose correctly. To define a problem correctly, a researcher must
know: what a problem is? It is like the identification of a destination before undertaking a journey. As in the
absence of destination it is impossible to identify any route, in the absence of clear research problem, it is
impossible to have clear and economical plan. Research forms a cycle; it starts with a problem and ends
with a solution to the problem and a possible implication for future research. Perhaps the most important
step in the research process is selecting and developing the problem for research. A problem well stated is a
problem half solved.
The identification of research problem is difficult, but it is an important phase of the entire research process. It
requires a great deal of patience and logical thinking on the part of the researcher. Beginners find the tasks
of identifying a research problem a difficult one. Most of the time researchers select a problem because of
his own unique needs and purposes. There are, however, some important sources which are helpful to a
researcher for selecting problem to be investigated. A research problem is like the foundation of a building.
A research problem serves as the foundation a research study: if it is well-formulated, you can expect a
good study to follow. Accordingly, in this chapter, issues related to research problem and hypothesis s will
be discussed.
23
process of formulating them in a meaningful way is not at all an easy task.‘ As a newcomer it might seem
easy to formulate a problem but it requires considerable knowledge of both the subject area and research
methodology. Once you examine a question more closely you will soon realize the complexity of
formulating an idea into a problem which is researchable. ‗First identifying and then specifying a research
problem might seem like research tasks that ought to be easy and quickly accomplished. However, such is
often not the case‘ (Yegidis & Weinback 1991).
It is essential for the problem you formulate to be able to withstand scrutiny in terms of the procedures
required to be undertaken. Hence you should spend considerable time in thinking it through. Before start
your research, you need to have at least some idea of what you want to do. The main function of
formulating a research problem is to decide what you want to find out about.
2.2.1 Meaning of the research problem
A research problem is any significant, perplexing and challenging situation, real or artificial, the solution of
which requires reflective thinking. It is the difficulty experienced by the researcher in a theoretical or
practical situation. A research problem is the situation that causes the researcher to feel apprehensive,
confused and ill at ease. It is the determination of a problem area within a certain context involving the, who,
what, where, when and the why of the problem situation. Any question that you want answered and any
assumption or assertion that you want to challenge or investigate can become a research problem or a
research topic for your study.
A research problem, in general, refers to some difficulty, which a researcher experiences in the context of
either a theoretical or a practical situation and wants to obtain a solution (Zikmund, 2000). A problem is a
gap between what actually exists and what should have existed. The significance of a problem can be
measured by the gap. A problem does not necessarily mean that always something is seriously wrong with a
current situation that needs to be rectified immediately. A problem could simply indicate an interest in an
issue where finding the right answers might help to improve an existing situation. Thus, it is fruitful to
define a problem as any situation where a gap exists between the actual and the desired ideal states. Basic
researchers usually define their problems for investigation from this perspective. Problems mean gaps-a
problem occurs when there is a difference between the current conditions and a more preferable set of
conditions. In other words, a gap exists between the way things are now and a way that things could be
better.
⚫ Research gap is a research question or problem which has not been answered appropriately
or at all in a given field of study.
24
⚫ Practical gap--- Problem to be solved in the given selected case area
⚫ Theoretical gap--- Limitation of previously done researches or theories. It
is the contribution of the current research to the existing body of knowledge
Elements of a research problem: the elements of research problems are
1. Aim or purpose of the problem for investigation. This answers the question why‗
why is there an investigation, inquiry to study.
2. The subject matter or topic to be investigated. This answers the question what.
3. The place/local where the research is to be conducted. This answers the question
where? Where the study to be conducted?
4. The period or time of the study during which the data are to be gathered. This
answers the question when.
5. Population/universe from whom the data are to be collected. Answers the question
who or from whom.
The problem selected for research may initially be vague topic. The question to be studied or the problem
to be solved may not be known. The reason why the answer is wanted may not be known as well. Hence, the
selected topic should be defined and formulated. If it is to serve as a guide in planning the study and
interpreting its result, it is essential that the problem is stated in precise terms. This is a difficult process. It
requires intensive reading of related literatures in order to understand the nature of the selected problem.
The researcher should read selected literatures, digest, think, and reflect up on what is read and digested.
He/she should also discuss with experienced persons.
Formulation means translating and transforming the selected research problem /topic in to a scientifically
researchable question. It is concerned with specifying exactly what the research problem is or why it is
studied. It involves the task of laying down boundaries within which a researcher shall study the problem
with a predetermined objective in view.
Moreover, problem definition implies the separation of the problem from the complex of difficulties and
needs. It means to put a fence around it, to separate it by careful distinctions from like question found in
related situations of need. To define a problem means to specify it in detail and with precision. Each
question and subordinate question to be answered is to be specified.
25
Sometimes it is necessary to formulate the point of view on which the investigation is to be based. If certain
assumptions are made, they are explicitly noted.
It is important to define and elucidate the problem as a whole and further define all the technical and
unusual terms employed in the statement. By this, the research worker removes the chance of
misinterpretation of any of these crucial terms. The definition helps to establish the frame of reference with
which the researcher approaches the problem. Three principal components in the progressive formulation
of a problem for research are identified as follows:
1. The originating question (what one wants to know)
2. The rationale (why aspect)
3. The specifying question (possible answers to the originating question)
1) The originating question: This indicates what the problem is. It may be of
different kinds. It may call for discovering new and more decisive facts relating to
the subject-matter of study; it may put to question the adequacy of certain
concepts, may be related to empirical validity; or it may be related to the structure
of an organization.
2) Rationale for the question: This is the statement of reasons why a particular
question is posed. It states how the answer to the question will contribute to theory
and /or practice. It helps to make discrimination between scientifically important
and trivial question. In short, it ―states the case for the question in the court of
scientific opinion‖.
3) Specifying questions: the originating question is decomposed in to several specific
questions in order to identify the observations or data that will provide answers to
them. These questions should be simple, pointed, clear, and empirically verifiable.
They are known as ‗investigative‘ questions. It is only these questions (when
synthesized) can afford the solution to the problem selected for research. This
solution has implication for theory or systematic knowledge and /or for practice
Research problem / Idea originate from many sources. We discuss four of these sources for the time being:
Everyday life, practical issue, past research (literature), and Inference from theory.
1. Everyday life: is one common source of research problem / idea, Based on
Questioning and inquisitive approach, you can draw from your experiences, and
26
come up with many research problems. For example, think about what type of
management practices in cooperatives you
27
believe work well or do not work well. Would you be interested in doing a research
study on one or more of those practices?
2. Practical Issue: this is one of most important source of research problem especially
when you are practitioner. What is some current problem facing cooperatives
developments? What research topic do you think can address some of these
problems? By such types of inquisitive approach with regard to the practical issue
you can come up with research problem.
3. Past research (literature): Among the sources of research problems, one has to be
very familiar with the literature in the field of one‘s interest. Past research is
probably the most important source of research idea / problem. That is because,
importantly research usually generate more questions than it answers. This also the
best way to come with a specific idea that will fit in to and extend the research
literature.
4. Theory (Explanations of phenomenon): inference from theory can be a source of
research problem. The application of general principles involved in various theories
to specific situation makes an excellent starting point for research. The following
question gives illustration how theory can be a source of research problem.
Can you summarized and integrate a set of post studies in to a theory?
Are there any theoretical predictions needing empirical testing?
Do you have any theories that you believe have merit? Test them.
If there is little or no theory in the area of interest to you, then think about
collecting data to help you to generate a theory.
5. Contact and Discussion with People: Contacts and discussions with research-
oriented people in conferences, seminars or public lectures serve as important
sources of problem.
6. Technological and social Change: Changes in technology or social environment
such as changes in attitudes, preferences, policies of a nation…
When selecting a research problem/topic there are a number of considerations to keep in mind which will
help to ensure that your study will be manageable and that you remain motivated. These considerations are:
Interest – Interest should be the most important consideration in selecting a research problem. A research
28
endeavor is usually time consuming, and involves hard work and possibly unforeseen problems. If you
select a topic which does not greatly interest you, it could become extremely
29
difficult to sustain the required motivation and put in enough time and energy to complete it. One should
select topic of great interest to sustain the required motivation.
Magnitude – You should have sufficient knowledge about the research process to be able to visualize the
work involved in completing the proposed study. Narrow the topic down to something manageable, specific
and clear. It is extremely important to select a topic that you can manage within the time and with the
resources at your disposal. Even if you are undertaking a descriptive study, you need to consider its
magnitude carefully.
Measurement of concepts – If you are using a concept in your study (in quantitative studies), make sure
you are clear about its indicators and their measurement. For example, if you plan to measure the
effectiveness of a health promotion programme, you must be clear as to what determines effectiveness and
how it will be measured. Do not use concepts in your research problem that you are not sure how to
measure. This does not mean you cannot develop a measurement procedure as the study progresses. While
most of the developmental work will be done during your study, it is imperative that you are reasonably
clear about the measurement of these concepts at this stage.
Level of expertise – Make sure you have an adequate level of expertise for the task you are proposing.
Allow for the fact that you will learn during the study and may receive help from your research supervisor
and others, but remember that you need to do most of the work yourself.
Relevance – Select a topic that is of relevance to you as a professional. Ensure that your study adds to the
existing body of knowledge, bridges current gaps or is useful in policy formulation. This will help you to
sustain interest in the study.
Availability of data – If your topic entails collection of information from secondary sources (office
records, client records, census or other already-published reports, etc.) make sure that this data is available
and, in the format, you want before finalizing your topic.
Ethical issues – Another important consideration in formulating a research problem is the ethical issues
involved. In the course of conducting a research study, the study population may be adversely affected by
some of the questions (directly or indirectly); deprived of an intervention; expected to share sensitive and
private information; or expected to be simply experimental ‗guinea pigs‘. How ethical issues can affect the
study population and how ethical problems can be overcome should be thoroughly examined at the
problem-formulation stage.
30
The formulation of a research problem is the most crucial part of the research journey as the quality and
relevance of your research project entirely depends upon it. As mentioned earlier, every step that
constitutes the how part of the research journey depends upon the way you formulated your research
problem. The process of formulating a research problem consists of a number of steps. Working through
these steps presupposes a reasonable level of knowledge in the broad subject area within which the study is
to be undertaken and the research methodology itself. A brief review of the relevant literature helps
enormously in broadening this knowledge base. Without such knowledge it is difficult to ‗dissect‘ a subject
area clearly and adequately. If you do not know what specific research topic, idea, questions or issue you
want to research (which is not uncommon among students), first go through the following steps:
Step 1: - Identify a broad field or subject area of interest to you. Ask yourself, ‗What is it that really
interests me as a professional?‘ In the author‘s opinion, it is a good idea to think about the field in which
you would like to work after graduation. This will help you to find an interesting topic, and one which may
be of use to you in the future. For example, if you are a social work student, inclined to work in the area of
youth welfare, refugees or domestic violence after graduation, you might take to research in one of these
areas. Or if you are studying marketing, you might be interested in researching consumer behaviour. Or, as a
student of public health, intending to work with patients who have HIV/AIDS, you might like to conduct
research on a subject area relating to HIV/AIDS. As far as the research journey goes, these are the broad
research areas. It is imperative that you identify one of interest to you before undertaking your research
journey.
Step 2: -Dissect the broad area into subareas. At the onset, you will realize that all the broad areas
mentioned above – youth welfare, refugees, domestic violence, consumer behaviour and HIV/AIDS – have
many aspects. Similarly, you can select any subject area from other fields such as community health or
consumer research and go through this dissection process. In preparing this list of subareas you should also
consult others who have some knowledge of the area and the literature in your subject area. Once you have
developed an exhaustive list of the subareas from various sources, you proceed to the next stage where you
select what will become the basis of your enquiry.
Step 3: - Select what is of most interest to you. It is neither advisable nor feasible to
study all subareas. Out of this list, select issues or subareas about which you are
passionate. This is because
31
your interest should be the most important determinant for selection, even though there are other
considerations which have been discussed in the previous section, ‗Considerations in selecting a research
problem‘. One way to decide what interests you most is to start with the process of elimination. Go through
your list and delete all those subareas in which you are not very interested. You will find that towards the end
of this process, it will become very difficult for you to delete anything further. You need to continue until
you are left with something that is manageable considering the time available to you, your level of
expertise and other resources needed to undertake the study. Once you are confident that you have selected
an issue you are passionate about and can manage, you are ready to go to the next step.
Step 4: - Raise research questions. At this step ask yourself, ‗What is it that I want to find out about in
this subarea?‘ Make a list of whatever questions come to your mind relating to your chosen subarea and if
you think there are too many to be manageable, go through the process of elimination, as you did in Step 3.
Step 5: - Formulate objectives. Both your main objectives and your subobjectives now need to be
formulated, which grow out of your research questions. The main difference between objectives and research
questions is the way in which they are written. Research questions are obviously that questions. Objectives
transform these questions into behavioral aims by using action-oriented words such as ‗to find out‘, ‗to
determine‘, ‗to ascertain‘ and ‗to examine‘. Some researchers prefer to reverse the process; that is, they start
from objectives and formulate research questions from them. Some researchers are satisfied only with
research questions, and do not formulate objectives at all. If you prefer to have only research questions or
only objectives, this is fine, but keep in mind the requirements of your institution for research proposals.
Step 6: - Assess your objectives. Now examine your objectives to ascertain the feasibility of achieving
them through your research endeavor. Consider them in the light of the time, resources (financial and
human) and technical expertise at your disposal.
Step 7: - Double-check. Go back and give final consideration to whether or not you are sufficiently
interested in the study, and have adequate resources to undertake it. Ask yourself, ‗Am I really enthusiastic
about this study?‘ and ‗Do I really have enough resources to undertake it?‘ Answer these questions
thoughtfully and realistically. If your answer to one of them is ‗no‘, reassess your objectives.
32
The formulation of a problem is like the ‗input‘ to a study, and the output‘ – the quality of the contents of
the research report and the validity of the associations or causation established – is entirely dependent upon
it. Hence the famous saying about computers, ‗garbage in, garbage out‘, is equally applicable to a research
problem.
It determines the research destine. It indicates a way for the researcher. Without it;
a clear and economical plan is impossible.
Research problem is like the foundation of a building. The research problem
serves as the foundation of a research study: if it is well formulated, one can
expect a good study to follow.
The way you formulate your research problem determines almost every step that
follows: the type of study design that can be used; the type of sampling strategy
that can be employed; the research instrument that can be used; and the type of
analysis that can be undertaken.
The quality of the research report (output of the research undertakings) is
dependent on the quality of the problem formulation.
2.3.1. INTRODUCTION
One of the essential preliminary tasks when you undertake a research study is to go through the existing
literature in order to acquaint yourself with the available body of knowledge in your area of interest.
Reviewing the literature can be time consuming, daunting and frustrating, but it is also rewarding. The
literature review is an integral part of the research process and makes a valuable contribution to almost
every operational step. It has value even before the first step; that is, when you are merely thinking about a
research question that you may want to find answers to through your research journey. In the initial stages
of research, it helps you to establish the theoretical roots of your study, clarify your ideas and develop your
research methodology. Later in the process, the literature review serves to enhance and consolidate your
own knowledge base and helps you to integrate your findings with the existing body of knowledge. Since
an important responsibility in research is to compare your findings with those of others, it is here that the
literature review plays an extremely important role. During the write-up of your report, it helps you to
integrate your
33
findings with existing knowledge – that is, to either support or contradict earlier research. The higher the
academic level of your research, the more important a thorough integration of your findings with existing
literature becomes. In this section, you will learn about what the literature and literature review are, the
purposes and review procedure and related issues.
2.3.2. Meaning of Review of Literature
The phrase ‗review of literature‘ consists of two words: Review and Literature. The word ‗literature‘ has
conveyed different meaning from the traditional meaning. It is used with reference to the languages e.g.,
Amharic literature, English literature, Sanskrit literature. It includes a subject content: prose, poetry, dramas,
novels, stories etc. Here in research methodology the term literature refers to the knowledge of a particular
area of investigation of any discipline which includes theoretical, practical and its research studies.
The term ‗review‘ means to organize the knowledge of the specific area of research to evolve an edifice of
knowledge to show that his study would be an addition to this field. The task of review of literature is
highly creative and tedious because researcher has to synthesize the available knowledge of the field in a
unique way to provide the rationale for his study.
Literature review is a body of text that aims to review the critical points of current knowledge about your
research topic. Literature work is an evolving and ongoing task that is updated and revised throughout the
process of writing the research. A research literature review is a systematic, and reproducible method for
identifying, evaluating and synthesizing the existing body of completed and recorded work produced by
researchers, scholars, and practitioners. Literature review is a search and evaluation of the available
literature in your given subject or chosen topic area. It is one of the essential preliminary tasks of a
researcher.
In summary, a literature review has the following functions:
It provides a theoretical background to your study.
It helps you establish the links between what you are proposing to examine and
what has already been studied.
It enables you to show how your findings have contributed to the existing body of
knowledge in your profession.
34
2.3.3. Reasons for reviewing the literature
1 ) B RI NG IN G CL ARITY AND FOCU S TO YOU R RESEARCH PROBLEM
The literature review involves a paradox. On the one hand, you cannot effectively undertake a literature
search without some idea of the problem you wish to investigate. On the other hand, the literature review
can play an extremely important role in shaping your research problem because the process of reviewing
the literature helps you to understand the subject area better and thus helps you to conceptualize your
research problem clearly and precisely and makes it more relevant and pertinent to your field of enquiry.
When reviewing the literature, you learn what aspects of your subject area have been examined by others,
what they have found out about these aspects, what gaps they have identified and what suggestions they
have made for further research. All these will help you gain a greater insight into your own research
questions and provide you with clarity and focus which are central to a relevant and valid study. In addition,
it will help you to focus your study on areas where there are gaps in the existing body of knowledge,
thereby enhancing its relevance.
2 ) I M P ROVI NG YOU R RESEARCH METHODOLOGY
Going through the literature acquaints you with the methodologies that have been used by others to find
answers to research questions similar to the one you are investigating. A literature review tells you if others
have used procedures and methods similar to the ones that you are proposing, which procedures and
methods have worked well for them and what problems they have faced with them. By becoming aware of
any problems and pitfalls, you will be better positioned to select a methodology that is capable of providing
valid answers to your research question. This will increase your confidence in the methodology you plan to
use and will equip you to defend its use.
3 ) B RO ADEN ING YOU R KNOWL EDGE B ASE IN YOU R RESEARC H AREA
The most important function of the literature review is to ensure you read widely around the subject area in
which you intend to conduct your research study. It is important that you know what other researchers have
found in regard to the same or similar questions, what theories have been put forward and what gaps exist
in the relevant body of knowledge. When you undertake a research project for a higher degree (e.g., an MA
or a PhD) you are expected to be an expert in your area of research. A thorough literature review helps you
to fulfil this expectation. Another important reason for doing a literature review is that it helps you to
understand how the findings of your study fit into the existing body of knowledge (Martin 1985).
35
4 ) E N AB L IN G YOU TO CONT EXT UAL I Z E YOU R FINDINGS
Obtaining answers to your research questions is comparatively easy: the difficult part is examining how your
findings fit into the existing body of knowledge. How do answers to your research questions compare with
what others have found? What contribution have you been able to make to the existing body of knowledge?
How are your findings different from those of others? Undertaking a literature review will enable you to
compare your findings with those of others and answer these questions. It is important to place your
findings in the context of what is already known in your field of enquiry.
2.3.4. P ROC EDU RES I N R EVI EWI NG THE LITERATURE
If you do not have a specific research problem, you should review the literature in your broad area of interest
with the aim of gradually narrowing it down to what you want to find out about. After that the literature
review should be focused around your research problem. There is a danger in reviewing the literature
without having a reasonably specific idea of what you want to study. It can condition your thinking about
your study and the methodology you might use, resulting in a less innovative choice of research problem
and methodology than otherwise would have been the case. Hence, you should try broadly to conceptualize
your research problem before undertaking your major literature review. Reviewing a literature is a
continuous process. Often it begins before a specific research problem has been formulated and continues
until the report is finished.
There are four steps involved in conducting a literature review:
1 . S EARCHI NG FO R T HE EX I ST I NG LITERATURE
To search effectively for the literature in your field of enquiry, it is imperative that you have at least some
ideas of the broad subject area and of the problem you wish to investigate, in order to set parameters for
your search. Next, compile a bibliography for this broad area. There are three sources that you can use to
prepare a bibliography:
books;
journals;
the Internet.
A. Books
Though books are a central part of any bibliography, they have their disadvantages as well as advantages.
The main advantage is that the material published in books is usually important and of good quality, and
the findings are ‗integrated with other research to form a coherent body of
36
knowledge‘ (Martin 1985). The main disadvantage is that the material is not completely up to date, as it can
take a few years between the completion of a work and its publication in the form of book. The best way to
search for a book is to look at your library catalogues. When librarians catalogue a book, they also assign to
it subject headings that are usually based on Library of Congress Subject Headings. If you are not sure, ask
your librarian to help you find the best subject heading for your area. This can save you a lot of time.
Publications such as Book Review Index can help you to locate books of interest.
Use the subject catalogue or keywords option to search for books in your area of interest. Narrow the
subject area searched by selecting the appropriate keywords. Look through these titles carefully and identify
the books you think are likely to be of interest to you. If you think the titles seem appropriate to your topic,
print them out (if this facility is available), as this will save you time, or note them down on a piece of
paper. Be aware that sometimes a title does not provide enough information to help you decide if a book is
going to be of use so you may have to examine its contents too.
When you have selected 10–15 books that you think are appropriate for your topic, examine the
bibliography of each one. It will save time if you photocopy their bibliographies. Go through these
bibliographies carefully to identify the books common to several of them. If a book has been referenced by
a number of authors, you should include it in your reading list. Prepare a final list of books that you
consider essential reading.
Having prepared your reading list, locate these books in your library or borrow them from other sources.
Examine their contents to double-check that they really are relevant to your topic. If you find that a book is
not relevant to your research, delete it from your reading list. If you find that something in a book‘s
contents is relevant to your topic, make an annotated bibliography. An annotated bibliography contains a
brief abstract of the aspects covered in a book and your own notes of its relevance. Be careful to keep track
of your references. To do this you can prepare your own card index or use a computer program such as
Endnotes or Pro-Cite.
B. Journals
You need to go through the journals relating to your research in a similar manner. Journals provide you with
the most up-to-date information, even though there is often a gap of two to three years between the
completion of a research project and its publication in a journal. You should select as many journals as you
possibly can, though the number of journals available depends upon the field
37
of study – certain fields have more journals than others. As with books, you need to prepare a list of the
journals you want to examine for identifying the literature relevant to your study. This can be done in a
number of ways.
You can:
Locate the hard copies of the journals that are appropriate to your study;
Look at citation or abstract indices to identify and/or read the abstracts of such articles;
Search electronic databases.
If you have been able to identify any useful journals and articles, prepare a list of those you want to
examine, by journal. Select one of these journals and, starting with the latest issue, examine its contents
page to see if there is an article of relevance to your research topic. If you feel that a particular article is of
interest to you, read its abstract. If you think you are likely to use it, depending upon your financial
resources, either photocopy it, or prepare a summary and record its reference for later use. There are several
sources designed to make your search for journals easier and these can save you enormous time. They are:
Indices of journals (e.g., Humanities Index);
Abstracts of articles (e.g., ERIC);
Citation indices (e.g., Social Sciences Citation Index).
Each of these indexing, abstracting and citation services is available in print, or accessible through the
Internet. In most libraries, information on books, journals and abstracts is stored on computers. In each case
the information is classified by subject, author and title. You may also have the keywords option
(author/keyword; title/keyword; subject/keyword; expert/keyword; or just keywords). What system you use
depends upon what is available in your library and what you are familiar with.
There are specially prepared electronic databases in a number of disciplines. These can also be helpful in
preparing a bibliography. Select the database most appropriate to your area of study to see if there are any
useful references. Of course, any computer database search is restricted to those journals and articles that are
already on the database. You should also talk to your research supervisor and other available experts to find
out about any additional relevant literature to include in your reading list.
C. The Internet
38
In almost every academic discipline and professional field, the Internet has become an important tool for
finding published literature. Through an Internet search you can identify published material in books,
journals and other sources with immense ease and speed.
An Internet search is carried out through search engines, of which there are many, though the most
commonly used are Google and Yahoo. Searching through the Internet is very similar to the search for books
and articles in a library using an electronic catalogue, as it is based on the use of keywords. An Internet
search basically identifies all material in the database of a search engine that contains the keywords you
specify, either individually or in combination. It is important that you choose words or combinations of
words that other people are likely to use.
2 . R EVI EWIN G T HE S ELECT ED LITERATURE
Now that you have identified several books and articles as useful, the next step is to start reading them
critically to pull together themes and issues that are of relevance to your study. Unless you have a
theoretical framework of themes in mind to start with, use separate sheets of paper for each theme or issue
you identify as you go through selected books and articles.
Once you develop a rough framework, slot the findings from the material so far reviewed into these
themes, using a separate sheet of paper for each theme of the framework so far developed. As you read
further, go on slotting the information where it logically belongs under the themes so far developed. Keep in
mind that you may need to add more themes as you go along. While going through the literature you should
carefully and critically examine it with respect to the following aspects:
⮫ Note whether the knowledge relevant to your theoretical framework has been confirmed
beyond doubt.
⮫ Note the theories put forward, the criticisms of these and their basis, the
methodologies adopted (study design, sample size and its characteristics,
measurement procedures, etc.) and the criticisms of them.
⮫ Examine to what extent the findings can be generalized to other situations.
⮫ Notice where there are significant differences of opinion among researchers and give
your opinion about the validity of these differences.
⮫ Ascertain the areas in which little or nothing is known – the gaps that exist in the
body of knowledge.
3 . D EVELO P ING A THEO RET I CAL FRAMEWORK
39
Examining the literature can be a never-ending task, but as you have limited time it is important to set
parameters by reviewing the literature in relation to some main themes pertinent to your research topic. As
you start reading the literature, you will soon discover that the problem you wish to investigate has its roots
in a number of theories that have been developed from different perspectives. The information obtained
from different books and journals now needs to be sorted under the main themes and theories, highlighting
agreements and disagreements among the authors and identifying the unanswered questions or gaps. You
will also realize that the literature deals with a number of aspects that have a direct or indirect bearing on
your research topic. Use these aspects as a basis for developing your theoretical framework. Your review of
the literature should sort out the information, as mentioned earlier, within this framework. Unless you
review the literature in relation to this framework, you will not be able to develop a focus in your literature
search: that is, your theoretical framework provides you with a guide as you read. This brings us to the
paradox mentioned previously: until you go through the literature you cannot develop a theoretical
framework, and until you have developed a theoretical framework you cannot effectively review the
literature. The solution is to read some of the literature and then attempt to develop a framework, even a
loose one, within which you can organize the rest of the literature you read. As you read more about the
area, you are likely to change the framework. However, without it, you will get bogged down in a great
deal of unnecessary reading and note-taking that may not be relevant to your study.
If you want to study the relationship between mortality and fertility, you should review literature about:
Fertility-trends, theories, some of the indices and critiques of them, factors
affecting fertility, methods of controlling fertility, factors affecting acceptance of
contraceptives, etc.;
Mortality-factors affecting mortality, mortality indices and their sensitivity in
measuring change in mortality levels of a population, trends in mortality, etc.;
and, most importantly
The relationship between fertility and mortality-theories that have been put
forward to explain the relationship, implications of the relationship.
Out of this literature review, you need to develop the theoretical framework for your study. Primarily this
should revolve around theories about the relationship between mortality and fertility.
40
You will discover that a number of theories have been proposed to explain this relationship. For example, it
has been explained from economic, religious, medical, and psychological perspectives. Your literature
review should be written under the following headings, with most of the review involving examining the
relationships between fertility and mortality:
Fertility theories;
The theory of demographic transition;
Trends in fertility (global, then narrow it to national and local levels);
Methods of contraception (their acceptance and effectiveness);
Factors affecting mortality;
Trends in mortality (and their implications);
Measurement of mortality indices (their sensitivity), and
Relationships between fertility and mortality (different theories such as
‗insurance‘, ‗fear of non-survival‘, ‗replacement‘, ‗price‘, ‗utility‘, ‗risk‘,
‗hoarding‘).
Literature pertinent to your study may deal with two types of information:
1. Universal; and
2. More specific, i.e., local trends, or a specific program.
In writing about such information, you should start with the general information, gradually narrowing
it down to the specific as, for example, shown above.
41
theoretical framework includes all the theories that have been put forward to explain the
relationship between fertility and mortality. However, out of these, you may be planning
to test only one, say, and ‗the fear of non-survival‘. Hence, the conceptual framework
grows out of the theoretical framework and relates to the specific research problem,
concerning the fear of non- survival.
42
Broaden your understanding from the following examples
Original Passage
During the last two years of my medical course and the period which I spent in the
hospitals as house physician, I found time, by means of serious encroachment on my
night’s rest, to bring to completion a work on the history of scientific research into the
thought word of St. Paul, to revise and enlarge the Question of the Historical Jesus for
the second edition, and together with Wider to prepare an edition of Bach’s preludes and
fugues for the organ, giving with each piece directions for its rendering. (Albert
Schweitzer, Out of My life and Thought. New York: Mentor, 1963, p.94)
A Poor Paraphrase
Schweitzer said that during the last two years of his medical course and the period he
spent in the hospitals as house physician he found time, by encroaching on his night’s
rest, to bring to completion several works.
(Note: This paraphrase uses too many words and phrases from the original without putting them in
quotation marks and thus is considered plagiarism. (Plagiarism is unauthorized use of an author‘s thoughts
or ideas and presenting them as one‘s own). Furthermore, many of the ideas of the author have been left
out, making the paraphrase incomplete. Finally, the one who is paraphrasing has neglected to acknowledge
the source through a parenthetical citation.)
A Good Paraphrase
Albert Schweitzer observed that by staying up late at night, first as a medical student and
then as a” house physician,” he was able to finish several major works, including a
historical book on the intellectual word of St. Paul, a revised and expanded second
edition of Question of the Historical Jesus, and a new edition of Bach’s organ preludes
and fugues complete with interpretive notes, written collaboratively with Wider.
(Note: This paraphrase is very complete and appropriate; it does not use the author‘s own words, except in
one instance, which is acknowledged by quotation marks.)
B. I N CORPO RAT IN G DI REC T QUOTES
At times you may want to use direct quotes in addition to paraphrases and summaries. To incorporate direct
quotes smoothly, the following general principles hold.
43
When your quotations are four lines in length or less, surround them with
quotation marks and incorporate them into your text. When your quotations are longer
than four lines, set them off from the rest of the text by indenting five spaces from the left
and right margins and triple spacing above and below them. You do not need to use
quotation marks with such block quotes. Follow the block quote with the punctuation
found in the source. Then skip two spaces before parenthetical citation. Do not include
a period after the parentheses.
Introduce quotes using a verb tense that is consistent with the tense of the quote.
Change a capital letter to a lower-case letter (or vice versa) within the quote if
necessary. Use brackets for explanations or interpretations not in the original quote.
E.g.(―Evidence reveals that boys are higher on conduct disorder (behavior directed
toward the environment) than girls.‖) Use ellipses (three spaced dots) to indicate that
material has been omitted from the quote. It is not necessary to use ellipses for material
omitted before the quote begins. E.g. (―Fifteen to twenty percent of anorexia victims die
of direct starvation or related illnesses… [Which] their weak body immunity cannot
combat.‖)
Punctuate a direct quote as the original was punctuated. However, change the
end punctuation to fit the context. (For example, a quotation that ends with a period may
require a comma instead of the period when it is integrated in to your own sentence.)
A period, or a comma if the sentence continues after the quote, goes inside the quotation marks
E.g. (Although Almaz tries to disguise ―her innate evil nature, it reveals itself at the slightest
loss of control, as when she has a little alcohol‖) when the quote is followed with a
parenthetical citation, omit the punctuation before the quotation mark and follow the
parentheses with a period or comma. E.g. Alemu has‖ recognized the evil in himself, (and)
is ready to act for good.‖
If an ellipsis occurs at the end of the quoted material, add a period before the dots.
E.g. (Almaz is ―more than a Woman, who not only succumbs to the Serpent, but
becomes the serpent itself… as she triumphs over her victims…‖)
Place question marks and exclamation points outside the quotation marks if the
entire sentence is a question or an exclamation. E.g. (Has Sara read the article
―Alienation in east of Eden‖?)
44
Place question marks and exclamation points inside the quotation marks if and
only if the quote itself is a question or an exclamation. (Mary attendee the lecture
entitled‖ Is Cathy Really Eve?‖)
Use colon to introduce a quote if the introductory material prior to the quote is
long or if the quote itself is more than a sentence or too long.
45
E.g. Taylor puts it this way: (Long quote indented from margin)
Use comma to introduce a short quote. (Stein Beck explains, ―If Cathy were
simply a monster, that would not bring her in the story.‖)
C. R EF ERRIN G TO O THERS I N T HE TEXT
In Harvard system, at every point in the text at which reference is made to other writers, the name of the
writer and the year of publication should be included. It is also advisable to include page number.
If the surname of the author is part of the sentence, then the year of the
publication will appear in brackets.
EXAMPLE
Bloom (1963, p 16) describes this…
If the name of the author is not part of the sentence, then both the surname and the
year of publication with page number be in brackets
EXAMPLE
In a recent study (Smith, 1990, p36), it is described as…
If there are three or less authors, then their family names should be given; if there
are more than three authors, the first author‘s family name should be given,
followed by et al.
EXAMPLE
Tolera, Barabaran and Jones (1991, p33) suggest that…
The most recent work (Barabaran et al, 1995, p16) shows that…
If the same author has published two or more works in the same year, then each work should be referred to
individually by the year followed by lower case letters (a, b, c, etc). (These different references should be
included in the bibliography).
EXAMPLE
Barabaran (1996a, pp35-7) shows how…
2.3.5. Objectives of Review of Literature
The review of literature serves the following purposes in conducting research work:
It provides theories, ideas, explanations or hypothesis which may prove useful in
the formulation of a new problem.
It indicates whether the evidence already available solves the problem adequately without
46
requiring further investigation. Distinguishing what has been done from what needs to
be done. It avoids the replication.
It provides the sources for hypothesis. The researcher can formulate research
hypothesis on the basis of available studies.
It suggests method, procedure, sources of data and statistical techniques appropriate to
the solution of the problem.
It locates comparative data and findings useful in the interpretation and discussion of
results. The conclusions drawn in the related studies may be significantly compared
and may be used as the subject for the findings of the study.
Determining meanings, relevance of the study and relationship with the study and its
deviation from the available studies.
2.4. Hypothesis Formulation
INTRODUCTION
The formulation of hypotheses or propositions as the possible answers to the research questions is an
important step in the process of formulation of the research problem as explained in the previous section.
Keen observation, creative thinking, hunch, wit, imagination, vision, insight and sound judgment are of
great importance in setting up reasonable hypotheses. When the mind has before it a number of observed
facts about some phenomenon, there is a need to form some generalization relative to the phenomenon
concerned. Having introduced this much, let us now see other aspects of a research hypothesis as follows.
2.4.1. The meaning of Hypotheses
The word hypothesis is made up of two Greek roots which mean that it is some sort of ‗sub- statements‘,
for it is the presumptive statement of a proposition, which the investigation seeks to prove. The hypothesis
furnishes the germinal basis of the whole investigation and remains to the end its corner stone, for the
whole research is directed to test it out by facts. At the start of investigation, the hypothesis is a stimulus to
critical thoughts offers insights into the confusion of phenomena. At the end it comes to prominence as the
proposition to be accepted or rejected in the light of the findings. The word hypothesis consists of two
words:
Hypo + thesis = Hypothesis
⚫ Hypo means under or below and thesis means a reasoned theory or rational view
point. Thus, hypothesis would mean a theory which is not fully reasoned.
47
⚫ It is a tentative supposition or provisional guess which-seems to explain the
situation under observation. – James E. Greighton
⚫ Hypothesis is a tentative statement of the relationship between two or more
variables. Usually a research hypothesis must contain, at least, one independent
and one dependent variable.
⚫ Research hypothesis may refer to an unproven proposition or supposition that tentatively
explains certain facts; phenomena; a proposition that is empirically testable.
A hypothesis is a tentative assumption drawn from knowledge and theory which is used as a guide in the
investigation of other facts and theories that are yet unknown. It is a guess, supposition or tentative
inference as to the existence of some fact, condition or relationship relative to some phenomenon which
serves to explain such facts as already are known to exist in a given area of research and to guide the search
for new truth.
A hypothesis is a tentative supposition or provisional guess which sees to explain the situation under
observation.
A hypothesis states what we are looking for. A hypothesis looks forward. It is a proposition, which can be
put to a test to determine its validity. It may prove to be correct or incorrect.
A hypothesis is a tentative generalization the validity of which remains to be seen. In its most elementary
stage, the hypothesis may be any hunch, guess, imaginative idea which becomes the basis for further
investigation.
A hypothesis is an assumption or proposition whose tenability is to be tested on the basis of the
compatibility of its implications with empirical evidence and with previous knowledge.
A hypothesis is, therefore, a shrewd and intelligent guess, a supposition, inference, hunch, provisional
statement or tentative generalization as to the existence of some fact, condition or relationship relative to
some phenomenon which serves to explain already known facts in a given area or research and to guide the
research for new truth on the basis of empirical evidence. The hypothesis is put to test for its tenability and
for determining its validity.
The testing of a hypothesis is the important characteristics of the scientific method. It is a prerequisite of
any successful research, for it enables us to get rid of vague approaches and meaningless interpretations. It
establishes the relationship of concept with theory, and specifies the test to be applied especially in the
context of a meaningful value judgment. The hypothesis, therefore, plays a very pivotal role in the
scientific research method. The formulation of
48
hypothesis, thus, is very crucial and the success or the failure of a research study depends upon how best it
has been formulated by the researcher. We may conclude by saying that it is hard to conceive modern
science in all its rigorous and disciplined fertility without the guiding power of hypothesis.
2.4.2. S OU RC E OF HYPOTHESIS
The inspection for hypothesis comes from a number of sources w/h includes the following:
1. Professional Experience: The daily life experience or the day to day observation
of the relationship (correlation) between different phenomena leads the researcher
to hypothesize a relationship and to conduct a study if his/ her assumptions are
confirmed.
2. Past Research or Common beliefs: Hypothesis can also be inspired by tracing past
research or by commonly held beliefs.
3. Through direct analysis of data or deduction from existing theory: Hypothesis may
also be generated through direct analysis of data in the field or may be deducted
from a formal theory. Through attentive reading, the researcher may able to get
acquaintance with relevant theories, principles and facts that may alert him or her
to identify valid for his/her study
4. Technological and social changes: Directly or indirectly exerts an influence in the
function of an organization. All such changes bring about new problems for
research.
2.4.3. Importance of Hypotheses
Hypothesis has very important place in research although it occupies a very small place in the body of a
thesis. It is almost impossible for a research worker not to have one or more hypothesis before proceeding
with his work. If he is not capable of formulating a hypothesis about his problem, he may not be ready to
undertake the investigation. The aimless collection of data is not likely to lead him anywhere. The
importance of hypothesis can be more specifically stated as under: -
It provides direction to research. It defines what is relevant and what is
irrelevant. Thus, it prevents the review of irrelevant literature and the collection
of useless or excess data. It not only prevents wastage in the collection of data, but
also ensures the collection of the data necessary to answer the question posed in
the statement of the problem.
It Focuses Research: Without it, research is unfocussed research and remains like a
49
random empirical wandering. It serves as necessary link between theory and the
investigation.
50
It Places Clear and Specific Goals: A well thought out set of hypotheses is that
they place clear and specific goals before the research worker and provide him
with a basis for selecting sample and research procedure to meet these goals.
It Prevents Blind Research: ―The use of hypothesis prevents a blind search and
indiscriminate gathering of masses of data which may later prove irrelevant to the
problem under study.‖ They provide direction to research and prevent the review
of irrelevant literature and the collection of useful or excess data.
It enables the investigator to understand, with greater clarity, his/her problem and
its ramification. It farther enables a researcher to clarify the procedures and
methods to be used in solving his problem and to rule out methods which are
incapable of providing the necessary data
It serves as a framework for drawing conclusions. It makes possible the
investigation of data in the light of tentative proposition or provisional guess. It
provides the outline for setting conclusions in a meaningful way.
2.4.4. C HARAC T ERIS T IC S O F U SAB L E HYPOTHESES
A fruitful hypothesis is distinguished by the following characteristics:
1. Hypothesis should be clear and precise. A hypothesis should be conceptually clear. It
should consist of clearly defined and understandable concepts. This means that the
concepts found in the hypothesis should be formally and operationally defined.
Formal definition or explication of the concepts will clarify what a particular concept
stands for, while the operational definition will leave no ambiguity about what would
constitute the empirical evidence or indicator of the concept in the field. If the
hypothesis is not clear and precise, the inferences drawn on its basis cannot be taken
as reliable.
2. Hypothesis should be capable of being tested. A hypothesis should be testable and
should not be a moral judgment. It should be possible to collect empirical evidences to
test the hypothesis. Statements like ―Capitalists exploit their workers‖, ―Bad parents
produce bad children‖ are common place generalizations and cannot be tested, as they
merely express sentiments and their concepts are vague.
3. A hypothesis should be related to the existing body of knowledge. It is important that
your hypothesis emerges from the existing body of knowledge, and that it adds to it,
51
as this is an
52
important function of research. This can only be achieved if the hypothesis has its roots in
the existing body of knowledge.
4. Hypothesis should be limited in scope and must be specific. A researcher must
remember that narrower hypotheses are generally more testable and he should
develop such hypotheses. A hypothesis would include a clear statement of indexes,
which are to be used. For example, the concept of social class needs to be explicated
in terms of indexes such as income, occupation, education, etc. Such specific
formulations have the obvious advantage of assuring that research is practicable and
significant. It also helps increase the validity of the results because the more specific
the statement or predication, the smaller the probability that will actually be borne out
as a result of mere accident or chance.
5. Hypothesis should state relationship between variables, if it happens to be a
relational hypothesis.
6. Hypothesis should be stated as far as possible in most simple terms so that the same
is easily understandable by all concerned. A hypothesis should be a simple one
requiring fewer conditions or assumptions. But ―Simple‖ does not mean obvious.
Simplicity demands insight. The more insight the researcher, has in to the problem,
the simpler will be for his hypothesis about it. One must remember that simplicity of
hypothesis has nothing to do with its significance.
2.4.5. T YP ES OF HYPOTHESES
Hypotheses vary in form and some extent; form is determined by some function. Thus, a working
hypothesis or a tentative hypothesis is described as the best guess or statement derivable from known or
available evidence. The amount of evidence and the certainty or quality of it determine other forms of
hypotheses. In other cases, the type of statistical treatment generates a need for a particular form of
hypothesis. The following kinds of hypotheses and their examples represent an attempt to order the more
commonly observed varieties as well as to provide some general guidelines for hypothesis, development
Question form of Hypotheses: Some writers assert that a hypothesis may be stated as a question, however,
there is no general consensus on this view. At best, it represents the simplest level of empirical observation.
In fact, it fails to fit most definitions of hypothesis. It is included here for two reasons: the first of which is
simply that it frequently appears in the lists. The second reason is not so much that question may or may
not qualify as a hypothesis. There are cases of simple
53
investigation and search which can be adequately implemented by raising a question, rather than
dichotomize hypothesis forms into acceptable/ rejectable categories. The following example of a question
is used to illustrate the various hypothesis forms:
H1: Does the change in curriculum affect the academic status of students in
Arbaminch University?
H2: Will students who learn in small class size perform better in mathematics test than
who learn in large class size?
Directional Hypothesis: A hypothesis may be directional which connotes an expected direction in the
relationship or difference between variables. The above hypothesis has been written in directional
statement form as follows:
H1: In AMU, the academic status of those who studied new curriculum is
significantly higher than those who studied old curriculum
H2: Students who learn in small class size perform better in mathematics test than
who learn in large class size
The hypothesis developer of this type appears more certain of his anticipated evidence than would be the
case if he had used either of the previous examples. If seeking a tenable hypothesis is the general interest of
the researcher, this kind of hypothesis is less safe than the others because it reveals two possible conditions.
These conditions are matter of degree. The first condition is that the problem of seeking relationship
between variables is so obvious that additional evidence is scarcely needed. The second condition derives
because researcher has examined the variables very thoroughly and the available evidence supports the
statement of a particular anticipated outcomes. An example of the obviously safe hypothesis would be
‗hypothesis‘ that high intelligence students learn better than low intelligent students.
Non-Directional Hypothesis: A hypothesis may be stated in the null form which is an assertion that the
direction relationship or the difference exists between or among the variables is not specified. This kind of
hypothesis might state the relationship or the difference but doesn‘t specify the direction of the relationship
or the difference.
H1: In AMU, there is a difference in academic performance of students who studied
old curriculum and new curriculum.
Null/research hypotheses; When you construct a hypothesis stipulating that there is no
difference/relationship between two situations, groups, outcomes, or variables, this is called a null
54
hypothesis and is usually written as Ho. Such a statistical hypothesis, which is under test, is usually a
hypothesis of no difference between statistical and parameter. Null Hypothesis is a statistical hypothesis
which is used in analyzing the data. It assumes that observed difference is attributable by sampling error
and true difference is zero. The above hypothesis has been written in null hypothesis form as follows:
Ho: In AMU, there is no significant difference in academic status of students who
studied new curriculum and old curriculum.
Alternate hypotheses: A hypothesis in which a researcher stipulates that there will be a
difference/relationship but does not specify its magnitude is called alternate hypotheses and is usually
written as Ha. It is true when Ho is false. It is the statement about the population that must be true if null
hypothesis is false. Any hypothesis which is complementary to the null hypothesis is called an alternative
hypothesis. It is important to explicitly state the alternative hypothesis in respect of any null hypothesis,
because the acceptance or rejection of Ho is meaningful only it is being tested against a rival hypothesis.
The above hypothesis has been written in null hypothesis form as follows:
Ha: In AMU, there is significant difference in academic status of students who studied
new curriculum and old curriculum.
2.4.6. P ROC EDU RES FOR H Y POT HESES TESTING
To test hypothesis means to tell (on the basis of the data that the researcher has collected) whether or not the
hypothesis seems to be valid. Procedures in hypothesis testing refers to all those steps that we undertake for
making a choice between the two actions i.e., rejection and acceptance of a null hypothesis. The first and
foremost problem in any testing procedure is the setting up of the null hypothesis. As the name suggests, it
is always taken as a hypothesis of no difference. The decision maker or researcher should always adopt the
neutral or null attitude on the part of the researcher before drawing the sample is the basis of the null
hypothesis. The following points may be borne in mind in setting the hypothesis.
1) If we want to test significance of the difference between a statistic and the
parameter or between two sample statistics then we set up the null hypothesis, that
the difference is not significant. This means that the difference is just due to
fluctuations of sampling.
55
2) Setting the level of significance: The hypothesis is examined on a predetermined
level of significance. In other words, the level of significance can be either 5% or 1%
depending upon the purpose, nature of enquiry and size of the sample.
3) The next step in the testing of hypothesis is calculation of Standard Error (SE). The
standard deviation of the sampling distribution of a statistic is known as Standard
Error. The concept of standard error (SE) is extremely useful in the testing of
statistical hypothesis. Note that the SE is calculated differently for different
statistical value.
4) Calculation of Significance ratio: Significance ratio is symbolically described as ‗t‘.
It is calculated by dividing the difference between parameter and statistic by the
standard error
5) Deriving the inference: Compare the calculated value with critical value (table
value). If the observed value is less, it is insignificant and vice-versa.
In hypothesis testing, two kinds of errors are possible viz., Type I error and Type II error. Type I error
means rejection the null hypothesis when it happens to be true. Type II error means accepting null hypothesis
when it is false. The following tables being explain the type of error.
Position of Hypothesis Null Hypothesis-Accept Null hypothesis-Reject
Ho TRUE Correct Decision Type: I Error
Ho FALSE Type II Error Correct Decision
For instance, the level of significance is 5%. It means that five cases of out of 100 are rejecting the Ho which
is true. It is possible to reduce type I error by lowering down the level of significance. Both the type of
errors cannot be reduced simultaneously. We have to balance between them.
56
CHAPTER THREE
RESEARCH PROPOSAL
Before any research study is undertaken, there should be an agreement between the person who authorizes
the study (the sponsor or advisor if the study is for academic purpose) and the researcher as to the problem to
be investigated, the methodology to be used, the duration of the study, and its cost. This ensures that there
are no misunderstandings or frustrations later for both parties. This is usually accomplished through the
research proposal, which the researcher submits and gets approved by the sponsor or advisor, who issues a
letter of authorization or allows proceeding with the study. Proposals are informative and persuasive writing
because they attempt to educate the reader and to convince that reader to do something. The goal of the
writer is not only to persuade the reader to do what is being requested, but also to make the reader believe
that the solution is practical and appropriate. A research proposal is usually required when the research
project is to be commissioned and the researcher is expected to compete with other researchers to get
research fund or else when the research proposal is a requirement for partial fulfillment of an academic
degree such as BA, MBA, MSc, or PhD. For example, a senior essay proposal is intended to convince your
advisor that your senior essay is a worthwhile research proposal and that you have the competence and the
work plan to complete it.
Research proposal is a specific kind of document written for a specific purpose. Research involves a series of
actions and therefore it presents all actions in a systematic and scientific way. In this way, Research
proposal is a blueprint of the study which simply outlines the steps that researcher will undertake during the
conduct of his/her study. Proposal is a tentative plan so the researcher has every right to modify his
proposal on the basis of his reading, discussion and experiences gathered in the process of research. Even
with this relaxation available to the researcher, writing of research proposal is a must for the researcher.
57
Research proposal is a blueprint of a study which outlines all the steps a researcher should follow to
undertake a given research project. The objective in writing a proposal is to describe: What you will do, why
it should be done, how you will do it and what result will you expect? Being clear about these things from
the beginning will help you complete your research in a timely fashion. A vague, weak or fuzzy proposal
can lead to a long, painful, and often unsuccessful research writing exercise. A clean, well thought-out,
proposal forms the backbone for the research itself. A good research proposal hinges on a good idea.
Getting a good idea hinge on familiarity with the topic and this assumes a longer preparatory period of
reading, observation, discussion, and incubation. Read everything that you can in your area of interest.
Research Proposal is an overall plan, scheme, structure and strategy designed to obtain answers to the
research problems or questions. ―The academic research proposal is a structured presentation of what you
plan to do in research and how you plan to do it.‖ Research proposal describes why and how you ―propose‖
to carry out your research idea. Research proposal is a written document of research plan intended to
convince specific readers. A research proposal is a document of the research design that includes a
statement explaining the purpose of the study and a detailed, systematic outline of a particular research
methodology (Zikmund, 2000). The research proposal is essentially a road map, showing clearly the
location from which, a journey begins, the destination to be reached, and the method of getting there.
A proposal tells us:
🖝 What will be done?
🖝 Why it will be done
🖝 How it will be done
🖝 Where it will be done
🖝 To whom it will be done, and
🖝 What is the benefit of doing it?
🖝 What is the time period and budget required for each stage of research work?
These questions should be considered with reference to the researcher‘s interest, competence,
time and other resources, and the requirements of sponsoring agency, if any. Thus, the
considerations which enter in to making decisions regarding what, where, how much, by
what means constitutes a plan of study or a study design.
58
Purpose of a Research Proposal:
To present the problem to be researched and its importance
To discuss the research efforts of others who have worked on related problems. (If Any)
To set forth the data necessary for solving the problem
To suggest how the data will be gathered, treated and interpreted
59
Reduces the possibility of costly mistakes.
60
Component of a research proposal varies from one type of research proposal to the other. And there are no
hard and fast rules as to which format to follow. In addition, for practical reasons many research-funding
agencies prefer their own research proposal format and many universities, colleges or departments may have
their own formats. The most common elements of a large-scale research proposal are hereunder: A format
that shows the basic elements that should be presented in a standard research proposal is provided below.
A. T HE PRELIMINARIES
i. Title or cover page
ii. Table of Contents, List of tables, graph, chart
iii. Acronyms and abbreviations
iv. Abstract
B. T HE BODY
1. Introduction
1.1. Background of the study
1.2. Statement of the problem
1.3. Research question or hypotheses
1.4. Objective of the study
1.4.1. General objective
1.4.2. General objective
1.5. Significance of the study
1.6. Scope/delimitation of the study
1.7. Definition of terms
2. Review of Related Literature
3. Research Methodology
3.1. Research Design
3.2. Sampling design
3.3. Source of data and collection techniques
3.4. Methods of data analysis
3.5. Ethical Consideration
C. T HE SUPPLEMENTAL
61
i. Budget Breakdown
ii. Time Schedule
iii. Bibliography (Reference)
A) The Preliminary/Prefatory Parts of a Proposal
I. T I T L E PAGE
The title of your research study captures the main idea or theme of your proposal in a short phrase. It should
not be too short that it says nothing or so long that a person reading your proposal has to work to determine
the point of your study. It should be researchable and should give a clear indication of the variables or the
content of the study. It should use the fewest possible words that adequately describe the content of the
paper.
In selecting a title for investigation, the researcher should consider the following points:
The title should not be too lengthy: It should be specific to the area of study. For
example, the following topic appears to be long.
– “A study of academic achievement of children in pastoral regions whose parents
had participated literacy classes against those whose parents didn’t”
The title should not be too brief or too short: The following sentence is too short
―Marketing in Japan‖ or ―Unemployment in Ethiopia‖
For example, the research topic on ―Determinants of export performance in Ethiopia‖ is good because it is
concise and at the same time contains the three basic elements of a topic: the thing that is going to be
explained, the thing that explains, and a geographical scope. The thing that is going to be explained in the
aforementioned topic is export performance because your research is expected to draw conclusions
pertaining to export performance. The thing that explains export performance is the word determinants.
The actual factors that determine export performance are not stated in the topic because it has to be very
concise. Hence, the key word ―determinants‖ is used. And finally,‖ Ethiopia‖ the phrase puts a geographical
delimitation of the proposed research. Generally, the title of a research study must be as short and clear as
possible, but sufficiently descriptive of the nature of the work:
🞉 Have a concise and focused title.
🞉 Be short and clear preferably not more than one line.
🞉 Avoid unnecessary punctuation (commas, colons, semi-colons).
62
The title is the most widely read part of your proposal. The title will be read by many people who may not
necessarily read the proposal itself or even its abstract. It should be long enough to be explicit but not too
long so that it is not too tedious usually between 5 and 25 words. The title of the proposal should provide
sufficient information to permit readers to make an intelligent judgment about the topic and type of study
the researcher is proposing to do. The language in the title should be professional in nature. There are three
kinds of Title
a) Indicative Title: - This type of title states the subjects of the research (proposal)
rather than the expected outcome. Example ‗The role of entrepreneurial education
for graduates‘ creativity in case of Ethiopia‘.
b) Hanging Title: The hanging tile has two parts: a general first part followed by a
more specific second part. It is useful in rewording an otherwise long clumsy and
complicated indicative title. E.g., ‗Nurturing creativity of graduates in Ethiopia: the
role of entrepreneurial education‘.
c) Question Title: Question title is used less than indicative hanging title. It is,
however acceptable where it is possible to use few words say less than 15 words.
E.g., ‗Does entrepreneurial education increase creativity of graduates in Ethiopia?‘
The coversheet for the proposal contains basic information for the reader.
Contains proposal title
Name of the researcher and advisor
Institution
Department
Purpose why the Research is conducted
Month, year and Place
II. T AB L E OF C O NT ENT S , L I ST O F T AB L ES , GRAPH , CHART
It should locate each section and major subdivision of the proposal. In most circumstances, the table of
contents should remain simple; no division beyond the first subheading is needed. If several illustrations or
tables appear in the body of the proposal, they, too, should appear in the list of tables/illustrations, which is
incorporated into or follows the table of contents. The table of contents usually headed simply CONTENTS
(in full capital). List all the parts except the title page which precedes it. No page numbers appear on the title
page.
63
III. A C RONYM S AN D A BB REVI AT IONS ; AL PHAB ETI C ALLY ARRANGED
There is a great deal of overlap between abbreviations and acronyms. Every acronym is an abbreviation
because the acronym is a shortened form of a word or phrase. However, not every abbreviation is an
acronym, since some abbreviations - those made from words - are not new words formed from the first few
letters of a series of words.
ABBREVIATION
An abbreviation is a shortened form of a word or phrase, as N.Y. for New York. There are millions of
common abbreviations used every day. When you write out your address, most people write "St. or Ave."
instead of "street" or "avenue." When you write the date, you may abbreviate both the day of the week
(Mon, Tues., Wed., Thurs., Fri., Sat., and Sun.) and the month of the year (Jan., Feb., Aug., Sept., Oct.,
Nov., Dec.). There are also tons of industry specific abbreviations that you may be unaware of unless you
are in the industry, such as medical abbreviations or dental abbreviations. Shortening the word "Avenue" to
"Ave." is an abbreviation, because it is the shortened version of the word. However, it is not an acronym
since the word AVE is not a new word comprised of the first few letters of a phrase.
ACRONYM
Acronym is a word formed from the initial letters of the several words in the name. Example ―AIDS is an
acronym for acquired immune deficiency syndrome ―. An acronym, technically, must spell out another word.
NY is the acronym for New York. Since this acronym is a shortened version of the phrase, by definition the
acronym is also an abbreviation. Like abbreviations, acronyms are used daily, and most people can interpret
the meaning of common acronyms without much thought. For example, you go to the ATM instead of to the
automatic teller machine you give your time zone as EST, CST or PST instead of as Eastern Standard Time,
Central Standard Time or Pacific Standard Time. All of these new acronyms are also abbreviations because
they are all shortened versions of phrases that are using frequently. Abbreviations and acronyms are
shortened versions of words and phrases to speed up our communication. Be sure to use them correctly -
since, a misuse can led to a big miscommunication.
IV. SUMMARY/ABSTRACT
The abstract is a brief description of each element contained in your proposal. The abstract is a one-page
brief summary of the entire research proposal. Generally, the abstract contains a statement of the purpose
of your study or project, the measurable objectives, the procedures for
64
implementation of the project, the anticipated results, their significance, and their beneficiaries. The text of
the abstract must be single spaced, only in one paragraph. If you are submitting your proposal to an
external agency, there may be word limitations (250 to 300 words).
Abstract should be concise, informative, and should provide brief information about the whole problem to
investigate. It needs to show a reasonably informed reader why a particular topic is important to address,
where the gap lays for the research you want to undertake and an indication of what and how you want to
achieve. Because the abstract represents an executive summary of the entire project, it should actually be
the last section you complete. In general, abstract should summarize the main idea of the given title in the
form of;
Title or topic of the research
Statement of the problem and objective
Methodology of Investigation
Expected result (tentative only if a researcher starts with a formulated hypothesis)
B) Body of Research Proposal
1. Introduction
1.1. Background of the study: deductive order
The background of the study is to provide readers with the background information for the research. In the
background of the study, you need to give a sense of the general field of research of which your area is a
part. In background of the study, the researcher should create reader interest in the topic, lay the broad
foundation for the problem that leads to the study, place the study within the larger context of the scholarly
literature, and reach out to a specific audience. You then need to narrow to the specific area of your
concern. This should lead logically to the gap in the research that you intend to fill. Its purpose is to
establish the issues or concerns or motivations leading to the research questions and objectives, so that
readers can understand the significance and rationale underlying the study.
The proposal begins with an introductory statement, which leads like a funnel from a broad view of your
topic to the specific statement of the problem. It provides readers of the proposal the rationale (based on
published sources), for doing the study. It is the part of the proposal that provides readers with the
background information for the research proposal. Its purpose is to establish a framework for the research,
so that readers can understand how it is related to other research. Generally, background of the study
should be in deductive order i.e.
65
⮫ Global issues and trends about the topic
⮫ Situations in less developed countries or in an industry
⮫ National level/basic facts
⮫ Firm/regional level/basic facts
1.2. Statement of the problem
Having provided a broad introduction to the area under study, now focus on issues relating to its central
theme, identifying some of the gaps in the existing body of knowledge. Identify some of the main
unanswered questions. Here some of the main research questions that you would like to answer through
your study should also be raised, and a rationale and relevance for each should be provided. Knowledge
gained from other studies and the literature about the issues you are proposing to investigate should be an
integral part of this section.
A problem is an issue that exists in the literature, theory, or practice that leads to a need for the study. The
researcher should think on what caused the need to do the research (problem identification). The Statement
of the Problem describes the heart of your study in few brief sentences. It identifies the variables the
researcher plans to study as well as the type of study he or she intends to do. The research problem should
be stated in such a way that it would lead to analytical thinking. Specifically, this section should:
Identify the issues that are the basis of your study;
Specify the various aspects of/perspectives on these issues;
Identify the main gaps in the existing body of knowledge;
Raise some of the main research questions that you want to answer through your study;
Identify what knowledge is available concerning your questions, specifying the
differences of opinion in the literature regarding these questions if differences
exist;
Develop a rationale for your study with particular reference to how your study will
fill the identified gaps.
Refers to the research issue that motivated a need for the study
Serves as a guide in formulating the specific objectives
In a proposal, the problem should stand out for easy recognition:
o ―Why does this research need to be conducted?‖
o Make sure that you can provide clear answers to this question.
66
It is said 50% of the research is completed if the problem is well identified. Therefore, Statement of the
problem reflects the gap and justifies that the issue is worth researching. Statement of the problem should
illustrate the research gaps, the gap can be gap in the theories, gap in researches made by others, and gaps
between theory and practices.
Writing about something that is straightforward and unproblematic does not constitute an investigation.
Mere description is not research. Let us see one example how not all ―problems‖ are researchable. For
example, the government may have a ―problem‖ of not enough money to implement the new policy of low-
cost housing. The solution to this ―problem‖ would be simply: more money! However, there may be all
sorts of other kinds of researchable problems underlying this issue: Should the government cut back on
health provision in order to provide housing? Should housing be the government‘s priority? Could provision
of housing be privatized? How can low- cost housing promote economic justice? The answers to these
kinds of questions are not obvious and are often much contested. These kinds of questions are, therefore,
problematic ones.
In quantitative research, a research question typically asks about a relationship that may exist between or
among two or more variables. It should identify the variables being investigated and specify the type of
relationship to be investigated. For example, what effect does play football have on students‘ overall grade
point average during the football season? In qualitative research, a research question asks about the specific
process, issue, or phenomenon to be explored or described. For example, how does the social context of a
school influence perseverant teachers‘ beliefs about teaching? What is the experience of a teacher being a
student like?
67
Hypothesis is a statement about an expected relationship between two or more variables that permits
empirical testing. If the research is expected to be based on, only descriptive analysis there will be no need of
testing hypothesis. If someone is wants to examine the relationship between dependent and independent
variables, the hypothesis must be formulated. Before formulating hypothesis, we should have a clear-cut
idea about dependent and independent variables. The independent variable cause or influence the
dependent variable. Dependent variable is variable affected by the other variable (independent variable) or
it depends on others. For example, women‘s education, age at marriage, occupation of women, religion and
the use of contraception can be treated as independent variables that will have direct or indirect effect on
fertility.
Hypotheses and research questions are linked to the speculative proposition of the problem statement. The
Statement of the research questions/hypotheses describes the expected outcomes of your study. The term
research question implies an interrogative statement that can be answered by data. Whereas Hypotheses are
tentative statements or explanations of the formulated problem which will be tested
68
General statements specifying the desired outcomes of the proposed project.
The main/general objective indicates the central thrust of your study.
It is important to ascertain that the general objective is closely related your title
1.4.2. S P EC IF IC OBJECTIVES
The specific objectives are the specific aspects of the topic that you want to investigate within the main
framework of your study. Sub-objectives should be worded clearly and unambiguously. Make sure that it
contains only one aspect of the study. The wording of your objectives determines the type of research design
(e.g., descriptive, correlational or experimental, or others) you need to adopt to achieve them. Subobjectives
should delineate only one issue. If the objective is to test a hypothesis, you must follow the convention of
hypothesis formulation in wording the specific objectives.
The specific objectives identify the specific issues you propose to examine. The specific objectives
are commonly considered as smaller portions of the general objectives.
It identifies in greater detail the specific aims of the research project, often breaking down what is to
be accomplished into smaller logical components.
Use action-oriented verbs such as ―to determine, to find out, to ascertain, to evaluate, to discover‖
in formulating specific objectives which should be numerically listed.
Specific objectives should also be linked to research questions
The specific objectives are smaller portions of the general objectives
Specific objectives should be consistent with the problem
May be written in the form bullets for different objectives
69
what extent the solution of the problem would contribute for the furtherance of human knowledge. The list of
the objectives of the study magnifies further its utility and importance.
The following are some of the main components the justification stresses:
The need for new knowledge, techniques or conditions
The need to help address those areas that remain untouched or inadequately treated.
The need to fill the gap in the knowledge pertaining to the given area.
In this section, the researcher indicates the importance of the research and there by convinces the reader.
The researcher is thus it required to indicate what his/ her research will contribute whether the research is to
provide solution or to shed light on the nature of the problem or both some researches extend the frontiers
of knowledge. This section therefore enables the researcher answer questions like what is the usefulness of
this study? and What does this study contribute?
Indicates the importance of your research to the existing knowledge or practice.
It explains why the study is worth doing. What makes the study important to the
researcher‘s field? What tangible contribution will it make?
This section allows you to write about why the research has to be done. In this
section; you describe explicit benefits that will ensue from your study. The
importance of ―doing the study now‖ should be emphasized. Specifically, to:
User organizations
The society/the community/the country
Other researchers
1.6. S CO P E / DEL IM I TAT ION OF THE STUDY
In this section, the researcher indicates the boundary of the study. The problem should be reduced to a
manageable size delimitation is done to solve to problem using the available financial, labor and time
resources. This does not however mean that should delimit the research topic to particular issue and/or
organization or place because it is less costly and take less time. Delimiting is done not to necessarily
reduce the scope of the study for the stake of minimizing the effort to be exerted. This means that we should
not sniff the life of the topic in the name of making it manageable. Thus, there should be balance between
manageability and representativeness of the universe being studies. Delimitation/scope addresses how a
study will be narrowed in scope, that is, how it is bounded.
70
Limit your delimitations to the things that a reader might reasonably expect you to do but that you, for
clearly explained reasons, have decided not to do. This is the place to explain the things that you are not
doing and why you have chosen not to do them E.g., the population you are not studying (and why not), the
methodological procedures you will not use (and why you will not use them) with persuasive delimitation
Delimitations are restrictions the researcher sets on his/her study. Explain the exact area of your research;
for example, in terms time period, subjects, disciplines involved, unit of analysis etc. Scope provides the
boundary or framework- clearly defines Four major common delimitations in research: Geographical (area
coverage), Conceptual (topic coverage), Methodological (population/sample coverage) and Time frame
Many research works include some technical words. Thus, terms must be defined so that it is possible to
know what precisely the terms used in the body of the research mean. Without knowing explicitly what the
terms mean we can‘t evaluate the research or determine whether the researcher has carried out what in the
problem was announced as the principal thrust of the research. Thus, terms should be defined from the
outset. There are Nominal and Operational definition of terms.
Normal definition: are statements assigned to a term such as its dictionary meaning.
Operational Definition: are specifications of dictionary definition of the term in to observable and hence
measurable characteristics.
Terms must be defined operationally; i.e., the definition must interpret the term as it is employed in relation
to the researcher project. Sometimes students rely on dictionary definitions; dictionary definitions are
seldom neither adequate nor helpful. In defining a term to researcher makes that term mean whatever
he/she wishes it to mean within the particular context of the problem or its sub-problems we must know
how the researcher defines the term we need not necessarily subscribe to such a definition, but so long as we
know precisely what the researcher means when employing a particular term, we are able to understand the
research and appraise it more objectively.
If you are using words that are operationally defined (i.e. defined by how they are measured or have an
unusual or restricted meaning in the study), you must define them for the reader. The
71
technical terms or words and phrases having special meaning need to be defined operationally. There is no
need to define obvious or commonly used terms.
2. R EVI EW OF R EL AT ED LITERATURE
Initially we can say that a review of the literature is important because without it you will not acquire an
understanding of your topic, of what has already been done on it, how it has been researched, and what the
key issues are. In your written project, you will be expected to show that you understand previous research
on your topic. This amounts to showing that you have understood the main theories in the subject area and
how they have been applied and developed, as well as the main criticisms that have been made of work on
the topic. This is where you provide more detail about what others have done in the area, and what you
propose to do. This section will review published research related to the purpose and objectives described
above.
Its purpose is to establish a framework for the research, so that readers can understand how it is related to
another research. It includes the major issues, gaps in the literature (in more detail than is provided in the
introduction); research questions and/or hypotheses which are connected carefully to the literature being
reviewed; definitions of key terms, provided either when you introduce each idea, or in a definition sub-
section. It should be noted that references may be found throughout the proposal, but it is preferable for
most of the literature review to be reported in this section. It should summarize the results of previous
studies that have reported relationships among the variables included in the proposed research. An
important function of the literature review is to provide a theoretical explanation of the relationships among
the variables of interest. The review can also provide descriptive information about related problems,
intervention programs, and target populations. A well-structured literature review is characterized by a
logical flow of ideas, current and relevant references with consistent and appropriate referencing style;
proper use of terminology, and an unbiased and comprehensive view of the previous research regarding
your research topic.
Literature review can be broadly classified into theoretical and empirical literature review. The theoretical
literature review builds the detailed theoretical framework for your research that is an elaborated version of
the one in the introduction part. Empirical literature is literature that you got from empirical research.
Empirical research refers to research studies that have been undertaken
72
according to an accepted scientific method, which involves defining a research question, identifying a
method to carry out the study, followed by the presentation of results, and finally a discussion of the
results. Empirical research studies are normally the most important types of literature that will be
incorporated into a literature review. This is because they attempt to address a specific question using a
systematic approach.
Generally, literature review should be written as follows;
Deductive Order (General to specific)
Concepts and definitions of terminologies directly related to the topic.
Global issue and trends
Regional or continental or industrial facts
Best experiences, if relevant
Problems and challenges related to the topic
I MPORT ANT P O I NT S IN L I T ERAT U RE REVIEW
Adequacy- Sufficient to address the statement of the problem and the
specific objectives in detail
Logical flow and organization of the contents
Adequate citations
The variety of issues and ideas gathered from many authors
Exhaustive (complete) - cover the main points
Fair treatment of authors (do not overuse one author)
It should not be outdated
Rely on academic sites (usually .ac or .edu), government sites ( .gov), not-for-
profit institutions (.org),
Dictionaries and encyclopedias are not recommended
With proper citation
of sources A source is, usually,
referenced in two parts:
The citation, in your text at the point of use;
Full publication details, in a reference list, or bibliography, at the end of your
dissertation or report
Use: APA Citation or Harvard referencing Guide
BUT MAKE SURE TO BE CONSISTENT!!!!
Prepared By: Wagaw Demlie 65
Warning
Do not forget the issues of Plagiarism; Plagiarism means pretending that we, ourselves,
wrote what others actually wrote. Plagiarism might be accidental or not using quotation
marks for direct quotes or it might be careless rather than deceitful. On the other hand, it
may be forgetting to cite a source in the text Plagiarism is always a crime, since it destroys
the efforts of others. Institutions vary in terms of the seriousness with which they view the
crime; punishment can range from resubmission to expulsion, but reputation is always
lost.
3. R ES EARC H METHODOLOGY
3.1. Introduction
Research methodology is a way to systematically solve the research problem. It is a science of studying
how research is done scientifically. The methodology section of your research proposal answers mainly
―how‖ questions since it provides your work plan and describes the activities necessary for the completion
of your project. Researchers should understand the assumptions underlying various techniques and they
need to know the criteria by which they can decide that certain techniques and procedures will be
applicable to certain problems and others will not. This means that it is necessary for the researcher to
design his methodology for his/her problem as the same may differ from problem to problem. For example,
an architect, who designs a building, has to consciously evaluate the basis of his/her decisions, i.e., he/she
has to evaluate why and on what basis he/she selects particular size, number and location of doors,
windows and ventilators, uses particular materials and not others and the like. Similarly, in research the
researcher has to expose the research decisions to evaluation before they are implemented. He has to
specify very clearly and precisely what decisions he selects and why he selects them so that they can be
evaluated by others also. In this section, it is vital to include the following subheadings while expanding on
them in as much detail as possible.
3.4. S AM P L IN G DESIGN
A sample design is a definite plan for obtaining a sample from a given population. It refers to the technique
or the procedure the researcher would adopt in selecting items for the sample. Sampling is the process of
selecting a sufficient number of elements of the population. Sample design may as well lay down the
number of items to be included in the sample i.e., the size of the sample. Sample design is determined
before data are collected. There are many sample designs from which a researcher can choose. Some designs
are relatively more precise and easier to apply than others are. The researcher must select/prepare a sample
design, which should be reliable and appropriate for his research study. Knowledge of the various types of
sampling method (probability and non- probability sampling) is a prerequisite to develop an appropriate
description of your particular sampling technique. You have to explain the reasons behind the choice of
your sampling technique. Moreover, given considerations of cost and time, you have to reasonably
determine the appropriate sample size (number of participants). In determining sampling, one must clearly
identify
Population/universe
Sample element
The sampling frames
Sample Size
Uni111111verse/Population refers to the entire group of people, events, or things of interest that the
researcher wishes to investigate. For instance, if the CEO of a computer firm wants to know the kinds
of advertising strategies adopted by computer firms in the Silicon Valley, then all computer firms
situated there will be the population. A research population is generally a large collection of individuals
or objects that is the main focus of a scientific query. It is for the
3.5. S OU RC E AN D T YP E OF DATA
Data can be obtained from primary or secondary sources. Primary data refer to information obtained
firsthand by the researcher on the variables of interest for the specific purpose of the study. Secondary data
refer to information gathered from sources already existing. Some examples of sources of primary data are
individuals, focus groups, panels of respondents specifically set up by the researcher. Data can also be
obtained from secondary sources, as, for example, books, journals, company records or archives,
government publications, industry analyses offered by the media, web sites, the Internet, and so on.
Data can be classified into quantitative and qualitative data types. Qualitative types of data are concerned
The investigator should now find instruments for collecting the data required by the hypothesis. The
investigator himself may have to construct these instruments or he may have to adopt the readily available
instruments to suit the local conditions. In the latter case, the investigator may make certain necessary
changes in the format and etc., with the help of the feedback received by conducting a pilot study on a very
small sample.
Data can be collected in a variety of ways, in different settings field or lab and from different sources. Data
collection instruments include interviews face-to-face interviews, telephone interviews, computer-assisted
interviews, and interviews through the electronic media; questionnaires that are personally administered,
sent through the mail, or electronically administered; observation of individuals and events with or without
videotaping or audio recording. Interviewing, administering questionnaires, and observing people and
phenomena are the three main data collection methods in survey research. Each of these various methods
has its own advantages and limitations, which will be discussed in detail in the forth-coming chapters.
The analysis of data requires a number of closely related operations such as establishment of categories, the
application of these categories to raw data through coding, tabulation, and then drawing statistical
inferences. Thus, researcher should classify the raw data into some purposeful and usable categories.
Coding operation is usually done at a stage through which the categories of data are transformed into
symbols that may be tabulated and counted. Editing is the procedure that improves the quality of the data for
coding. With coding, the stage is ready for tabulation. Tabulation is a part of the technical procedure
wherein the classified data are put in the form of tables. Computers tabulate a great deal of data, especially
in large inquiries. Computers not only save time but also make it possible to study large number of
variables affecting a problem simultaneously.
8 Transportation Trip Xx Xx Xx
9 Telephone cost Number Xx Xx Xx
10 Total Xx
4.1. INTRODUCTION
Up to now, you have told what the problem is, what your study objectives are, and why it is important for
you to do the study. This section should include as many subsections as needed to show the phases of the
project. It provides information on your proposed design for tasks such as sample selection and size, data
collection method, instrumentation, procedures, and ethical requirements. It is a way that the requisite data
can be gathered and analyzed to arrive at a solution. When more than one way exists to approach the design,
discuss the methods you have rejected and why your selected approach is superior.
A research design is a procedural plan that is adopted by the researcher to answer questions validly,
objectively, accurately and economically. According to Selltiz, Deutsch and Cook, (1962) ‗A research
design is the arrangement of conditions for collection and analysis of data in a manner that aims to combine
relevance to the research purpose with economy in procedure‘. Research design is the conceptual structure
within which research is conducted; it constitutes the blueprint/roadmap for collection, measurement and
analysis of data. In other words, it is a master plan specifying the methods and procedures for collecting
and analyzing the needed information. A research design is a procedural plan adopted by the researcher to
answer research questions validly, objectively, accurately and economically.
Through a research design you decide for yourself and communicate to others your decisions regarding
what study design you propose to use, how you are going to collect information from
In some cases, the time lapse between the two contacts may result in attrition in the study
population. It is possible that some of those who participated in the pre-test may move out of the
area or withdraw from the experiment for other reasons.
One of the main limitations of this design, in its simplest form, is that as it measures total change,
you cannot ascertain whether independent or extraneous variables are responsible for producing
change in the dependent variable. Also, it is not possible to quantify the contribution of
independent and extraneous variables separately.
Sometimes the instrument itself educates the respondents. This is known as the reactive effect of
the instrument. For example, suppose you want to ascertain the impact of a programme designed to
create awareness of drugs in a population. To do this, you design a questionnaire listing various
drugs and asking respondents to indicate whether they have heard of them. At the pre-test stage a
respondent, while answering questions that include the names of the various drugs, is being made
aware of them, and this will be reflected in his/her responses at the post-test stage. Thus, the
research instrument itself has educated the study population and, hence, has affected the dependent
variable. Another example of this effect is a study designed to measure the impact of a family
planning education programme on respondents‘ awareness of contraceptive methods. Most studies
designed to measure the impact of a programme on participants‘ awareness face the difficulty that a
change in the level of awareness, to some extent, may be because of this reactive effect.
If the study population is very young and if there is a significant time lapse between the before-and-
after sets of data collection, changes in the study population may be because it is maturing. This is
particularly true when you are studying young children. The effect of this maturation, if it is
significantly correlated with the dependent variable, is reflected at the ‗after‘ observation and is
known as the maturation effect.
Another disadvantage that may occur when you use a research instrument twice to gauge the
attitude of a population towards an issue is a possible shift in attitude between the two points of
data collection. Sometimes people who place themselves at the extreme positions of a measurement
scale at the pre-test stage may, for a number of reasons, shift towards the mean at the post-test stage.
They might feel that they have been too negative or too positive
The reference period refers to the time-frame in which a study is exploring a phenomenon, situation, event
or problem. Studies are categorized from this perspective as:
I T HE R ETRO S P ECT I VE ST UDY DESIGN
Retrospective studies investigate a phenomenon, situation, problem or issue that has happened in the past.
They are usually conducted either on the basis of the data available for that period or on the basis of
respondents‘ recall of the situation (Figure 4.3a). For example, studies conducted on the following topics
are classified as retrospective studies:
⮫ The utilization of land before the Second World War in Western Australia.
⮫ A historical analysis of migratory movements in Eastern Europe between 1915 and 1945.
⮫ The relationship between levels of unemployment and street crime.
II T HE P RO S PECT I VE S TUDY DESIGN
Prospective studies refer to the likely prevalence of a phenomenon, situation, problem, attitude or outcome
in the future (Figure 4.3b). Such studies attempt to establish the outcome of an event or what is likely to
happen. Experiments are usually classified as prospective studies as the researcher must wait for an
intervention to register its effect on the study population. The following are classified as prospective
studies:
To determine, under field conditions, the impact of maternal and child health
services on the level of infant mortality.
To establish the effects of a counselling service on the extent of marital problems.
To find out the effect of parental involvement on the level of academic
achievement of their children.
To measure the effects of a change in migration policy on the extent of
immigration in Ethiopia.
FIGURE 4.3 (a) Retrospective study design; (b) prospective study design; (c) retrospective– prospective
study design.
Qualitative research involves studies that do not attempt to quantify their results through statistical
summary or analysis. Examples of qualitative designs are as follows:
1) Phenomenology– is the study of phenomena. It is a way of describing something
that exists as part of the world in which we live. Phenomena may be events,
situations, experiences or concepts. We are surrounded by many phenomena,
which we are aware of but not fully understand.
2) Ethnography- it is a methodology for descriptive studies of cultures and peoples.
3) Grounded theory- focus on development of new theory through the collection
and analysis of data about a phenomenon.
4) Case study- case study research can take a qualitative or quantitative stance. The
qualitative approach to case study focus in depth analysis of a single or small
number of units. Case study research is used to describe an entity that forms a
single unit such as a person, an organization or an institution.
Prepared By: Wagaw Demlie 95
4 .6.1. C O NC EP T AN D VARIABLE
A concept or construct is a generalized idea about a class of objects, attributes, occurrences, or processes
that has been given a name. Concepts abstract reality. That is, concepts express various events or objects in
words. A researcher can operate at two levels: on abstract level of concepts and on the empirical level of
variables. At the empirical level, we ―experience‖ reality—that is, we can observe, measure, or manipulate
objects or events. To move from abstract level to the empirical level, we must clearly define this construct
and identify actual measurements.
Variable- An image, perception or concept that is capable of measurement-hence capable of taking on
different values-is called a variable. In other words, a concept that can be measured is called a variable. A
variable is a property that takes on different values. A concept that can be measured on any one of the four
types (nominal, ordinal, interval and ratio) of measurement scale, which have varying degrees of precision
in measurement, is called a variable.
T HE DIF FERENC E B ETWEEN A CONC EPT AND A VARIABLE
⚫ Concepts are mental images or perceptions and therefore their meanings vary
markedly from individual to individual, whereas variables are measurable.
Measurability is the main difference between a concept and a variable.
⚫ A concept cannot be measured whereas a variable can be subjected to measurement by
crude/refined or subjective/ objective units of measurement.
⚫ Concepts are subjective impressions-their understanding may differ form person to person.
⚫ It is, therefore, important for the concepts to be converted into variables as they
can be subjected to measurement even though, the degree of precision with which
they can be measured varies from scale to scale.
C O NVERT ING CONC EPT S I NTO VARIABLES
⚫ If you are using a concept in your study, you need to consider its
operationalization, that is, how it will be measured.
⚫ To operationalize a concept, you first need to go through the process of
identifying indicators
⚫ Indicators are a set of criteria reflective of the concept-which can then be
converted into variables.
T YP ES O F VARIABLES
A variable can be classified in a number of ways. The classification developed here, results form looking at
variables in three different ways:
a) The causal relationship
b) The design of the study and
c) The unit of measurement
T HE I NT ERVAL SCALE
⚫ Quantitative in nature and build on ordinal measurement. Provide information
about both order and distance between values of variables. Numbers scaled at equal
distances. 60 mark is > 40 and 80 mark is > 60. The difference between them is
equal i.e., 20.
⚫ No absolute zero point; zero point is arbitrary. Zero Fahrenheit or Celsius, does not
represent absence of temperature. It means cold or zero in test score doesn‘t imply that
the student doesn‘t know anything. Addition and subtraction are possible. Lack of an
absolute zero point makes division and multiplication impossible.
⚫ These variables make it possible to establish by how much one of the two objects possesses
a given feature in greater measure than the other. However, interval variables do not
permit to determine how many times a given object has a given feature to a more
intensive degree, e.g., Calendar time, temperature scale, grade points, where an
arbitrary zero has been set by convention
⚫ Examples include temperature measured in Fahrenheit and Celsius.
T HE R AT IO SCALE
⚫ A ratio scale has all the properties of nominal, ordinal and interval scales and it
also has a starting point fixed at zero. It is possible to have no (or zero) money a
zero balance in a bank account. Therefore, it is an absolute scale. The difference
between the intervals is always measured from a zero point. The ratio scale can be
used for mathematical operations. The measurement of income, age, height and
weight are examples of this scale. A person who is 40 years of age is twice as old
as a 20-year-old. A person earning $60 000 per year earns three times the salary of
a person earning $20 000.
⚫ Those attributing a certain absolute value of intensity to a certain variable, thus permitting
comparison not only of the distances between different values but also the ratio
between them, e.g., speed of a vehicle, when expressed in terms of KMPH, the speed
has an absolute zero value if the vehicle is immobile. There are true zero value, equal
units, and equality of ratios for the ratio variables. All mathematical operations can
meaningfully be performed on ratio variables because there are equal intervals
Sampling is a familiar part of daily life. A customer in a bookstore picks up a book, looks at the cover, and
skims a few pages to get a sense of the writing style and content before deciding whether to buy. A high
school student visits a college classroom to listen to a professor‘s lecture. Selecting a university on the basis
of one classroom visit may not be scientific sampling, but in a personal situation, it may be a practical
sampling experience. When measuring every item in a population is impossible, inconvenient, or too
expensive, we intuitively take a sample. Although sampling is commonplace in daily activities, these
familiar samples are seldom scientific. For researchers, the process of sampling can be quite complex.
Sampling is a central aspect of business research, requiring in-depth examination. The basic idea of
sampling is that by selecting some of the elements in the population, we may draw conclusions about the
entire population. This chapter explains the nature of sampling and ways to determine the appropriate
sample design.
Statistics deals with large numbers. It does not study a single figure. All the items under consideration
in any field of inquiry constitute a universe or population. A complete enumeration of all the items in
the ―population‖ is known as a census method of collection of data. In practice, sometimes, it is not
possible to examine every item in the population. But also, a complete enumeration or estimation of all
the items in the ―population‖ may not be necessary. Sometimes it is possible to obtain sufficiently
accurate results by studying only a part of the total ―population‖. In the case of population census,
every household is to be enumerated. But in certain cases, a few items are selected from the population
in such a way that they are representatives of the universe. Such a section of the population is called a
sample and the process of selection is called sampling. A sample design is a definite plan for obtaining a
sample from a given population. It refers to the technique or the procedure the researcher would adopt in
selecting items for the sample. Sample design may as well lay down the number of items to be included in
the sample i.e., the size of the sample. Sample design is determined before data are collected.
Sampling, therefore, is the process of selecting a few (a sample) from a bigger group (the sampling
population) to become the basis for estimating or predicting the prevalence of an unknown piece of
information, situation or outcome regarding the bigger group. A sample is a subgroup of the population you
are interested in. This process of selecting a sample from the total population has advantages and
disadvantages. The advantages are that it saves time as well as financial and human resources. However,
the disadvantage is that you do not find out the information about the population‘s characteristics of interest
to you but only estimate or predict them. Hence, the possibility of an error in your estimation exists.
Sampling, therefore, is a trade-off between certain benefits and disadvantages. While on the one hand you
save time and resources, on the other hand you may compromise the level of accuracy in your findings.
Through sampling you only make an estimate about the actual situation prevalent in the total population
from which the sample is drawn. If you ascertain a piece of information from the total sampling population,
and if your method of enquiry is correct, your findings should be reasonably accurate. However, if you
select a sample and use this as the basis from which to estimate the situation in the total population, an
error is possible. Tolerance of this possibility of error is an important consideration in selecting a sample.
The purpose of sampling in quantitative research is to draw inferences about the group from which you have
selected the sample, whereas in qualitative research it is designed either to gain in-depth knowledge about a
situation/event/episode or to know as much as possible about different aspects of an individual on the
assumption that the individual is typical of the group and hence will provide insight into the group.
Similarly, the determination of sample size in quantitative and qualitative research is based upon the two
different philosophies. In quantitative research you are guided by a predetermined sample size that is based
upon a number of other considerations in addition to the resources available. However, in qualitative
research you do not have a predetermined sample size but during the data collection phase you wait to reach
a point of data saturation. When you are not getting new information or it is negligible, it is assumed you
have reached a data saturation point and you stop collecting additional information.
For some research questions it is possible to collect data from an entire population as it is of a manageable
size. However, you should not assume that a census would necessarily provide more useful results than
collecting data from a sample which represents the entire population. Sampling provides a valid alternative
to a census when:
🞺 It would be impracticable for you to survey the entire population;
🞺 Your budget constraints prevent you from surveying the entire population;
🞺 Your time constraints prevent you from surveying the entire population;
🞺 You have collected all the data but need the results quickly.
For all research questions where it would be impracticable for you to collect data from the entire
population, you need to select a sample. This will be equally important whether you are planning to use
interviews, questionnaires, observation or some other data collection technique. You might be able to
obtain permission to collect data from only two or three organizations. Alternatively,
With other research questions it might be theoretically possible for you to be able to collect data from the
entire population but the overall cost would prevent it. It is obviously cheaper for you to collect, enter (if
you are analyzing the data using a computer) and check data from 250 customers than from 2500, even
though the cost per case for your study (in this example, customer) is likely to be higher than with a census.
Your costs will be made up of new costs such as sample selection, and the fact that overhead costs such as
questionnaire, interview or observation schedule design and setting up computer software for data entry are
spread over a smaller number of cases.
Sampling also saves time, an important consideration when you have tight deadlines. The organisation of
data collection is more manageable as fewer people are involved. As you have fewer data to enter, the
results will be available more quickly. Occasionally, to save time, questionnaires are used to collect data
from the entire population but only a sample of the data collected are analyzed. Fortunately advances in
automated and computer assisted coding software mean that such situations are increasingly rare.
Many researchers, for example Henry (1990), argue that using sampling makes possible a higher overall
accuracy than a census. The smaller number of cases for which you need to collect data means that more
time can be spent designing and piloting the means of collecting these data. Collecting data from fewer
cases also means that you can collect information that is more detailed. In addition, if you are employing
people to collect the data (perhaps as interviewers) you can afford higher-quality staff. You also can devote
more time to trying to obtain data from more difficult to reach cases. Once your data have been collected,
proportionally more time can be devoted to checking and testing the data for accuracy prior to analysis.
There are several questions to be answered in securing a sample. Each requires unique information. While the
questions presented here are sequential, an answer to one question often forces a revision to an earlier one.
🞺 What is the target population?
🞺 What is the sampling frame?
🞺 What size sample is needed
🞺 What is the appropriate sampling method?
While developing a sampling design, the researcher must pay attention to the following points:
1. Define the target population/ universe: The first step in developing any sample design is to
clearly define the set of objects, technically called the universe, to be studied. The universe can
be finite or infinite. In finite universe the number of items is certain, but in case of an infinite
universe the number of items is infinite, i.e., we cannot have any idea about the total number
of items. The population of a city, the number of workers in a factory and the like are examples
of finite universes, whereas the number of stars in the sky, listeners of a specific radio
programme, throwing of a dice etc. are examples of infinite universes.
2. Determine the sampling unit: A decision has to be taken concerning a sampling unit before
selecting sample. During the actual sampling process, the elements of the population must be
selected according to a certain procedure. The sampling unit is a single element or group of
elements subject to selection in the sample. Sampling unit may be a geographical one such as
state, district, village, etc., or a construction unit such as house, etc., or it may be a social unit
such as family, club, school, etc., or it may be an individual. The researcher will have to decide
one or more of such units that he has to select for his study
3. Identify the sampling frame (source list): It is also known as ‗sampling frame‘ from which
sample is to be drawn. It contains the names of all items of a universe (in case of finite
universe only). If source list is not available, researcher has to prepare it. Such a list should
Students and others often ask: ‗How big a sample should I select?‘, ‗What should be my sample size?‘ and
‗How many cases do I need?‘ Basically, it depends on what you want to do with the findings and what type
of relationships you want to establish. Your purpose in undertaking research is the main determinant of the
level of accuracy required in the results, and this level of accuracy is an important determinant of sample
size. However, in qualitative research, as the main focus is to explore or describe a situation, issue, process
or phenomenon, the question of sample size is less important. You usually collect data till you think you
have reached saturation point in terms of discovering new information. Once you think you are not getting
much new data from your respondents, you stop collecting further information. Of course, the diversity or
heterogeneity in what you are trying to find out about plays an important role in how fast you will reach
saturation point. And remember: the greater the heterogeneity or diversity in what you are trying to find out
about, the greater the number of respondents you need to contact to reach saturation point.
Technically, the size of the sample depends upon the precision the researcher desires in estimating the
population parameter at a particular confidence level. There is no single rule that can be used to determine
sample size. The best answer to the question of size is to use as large a sample as possible. A larger sample
is much more likely to be representative of the population. Furthermore, with a large sample the data are
likely to be more accurate and precise. It was pointed out in that the larger the sample, the smaller the
standard error. In general, the standard error of a sample mean is inversely proportional to the sample size .
Thus, in order to double the precision of one‘s estimation, the sample size would need to be large. Your
choice of sample size within this compromise is governed by:
The confidence you need to have in your data
The types of analyses you are going to undertake
The size of the total population from which your sample is being drawn.
Generally, 95 to 99 per cent confidence intervals are acceptable i.e., 5 to 1 per cent error.
Given these competing influences, it is not surprising that the final sample size is almost
always a matter of judgement as well as of calculation.
There are two main categories: Random (Probabilistic) and Non-random (Non-Probabilistic).
5 .6.1. P ROB AB I L ITY SAMPLING
When elements in the population have a known chance of being chosen as subjects in the sample, we choice
to a probability sampling design. Probability sampling is also known as random sampling. It is a procedure
in which every number of the population will have a known, non-zero or equal and independent chance of
selection in the sample. Here it is blind chance alone that determines whether one item or the other is
selected.
Equal implies that the probability of selection of each element in the population is the same; that is, the
choice of an element in the sample is not influenced by other considerations such as personal preference. The
concept of independence means that the choice of one element is not dependent upon the choice of another
element in the sampling; that is, the selection or rejection of one element does not affect the inclusion or
exclusion of another. To explain these concepts let us return to our example of the class.
Suppose there are 80 students in the class. Assume 20 of these refuse to participate in your study. You want
the entire population of 80 students in your study but, as 20 refuse to participate, you can only use a sample
of 60 students. The 20 students who refuse to participate could have strong feelings about the issues you
wish to explore, but your findings will not reflect their opinions. Their exclusion from your study means
that each of the 80 students does not have an equal chance of selection. Therefore, your sample does not
represent the total class.
The same could apply to a community. In a community, in addition to the refusal to participate, let us
assume that you are unable to identify all the residents living in the community. If a significant proportion
of people cannot be included in the sampling population because they either cannot be identified or refuse to
participate, then any sample drawn will not give each element in
To understand the concept of an independent chance of selection, let us assume that there are five students
in the class who are extremely close friends. If one of them is selected but refuses to participate because the
other four are not chosen, and you are therefore forced to select either the five or none, then your sample
will not be considered an independent sample since the selection of one is dependent upon the selection of
others. The same could happen in the community where a small group says that either all of them or none
of them will participate in the study. In these situations where you are forced either to include or to exclude
a part of the sampling population, the sample is not considered to be independent, and hence is not
representative of the sampling population. However, if the number of refusals is fairly small, in practical
terms, it should not make the sample non-representative. In practice there are always some people who do
not want to participate in the study but you only need to worry if the number is significantly large.
A sample can only be considered a random/probability sample (and therefore representative of the
population under study) if both these conditions are met. Otherwise, bias can be introduced into the study.
There are two main advantages of random/probability samples:
As they represent the total sampling population, the inferences drawn from such
samples can be generalized to the total sampling population.
Some statistical tests based upon the theory of probability can be applied only to
data collected from random samples. Some of these tests are important for
establishing conclusive correlations.
Suppose the researcher decides to take a sample of 150. Then the strata sample size will be:
Accounting 0.31x150= 47
Economics 0.33x150= 50
150
b. Disproportionate stratified random sampling- This method does not give proportionate
representation to strata. All strata may be given equal weight even though their shares in the total
population can vary. consideration might not given to the size of the stratum. Strata exhibiting
more variability might be sampled more than proportionately to their relative size. Conversely,
those strata that are very homogenous might be sampled less than proportionately. It might depend
upon considerations involving personal judgement and convenience.
M ERI T S OF ST RAT I F IED RANDOM SAMPL I NG
This method has several advantages which may be summarized as below: -
If a correct stratification has been made, even a small number of units will form a
representative sample
Under stratified random sampling, no significant group is left underrepresented
Stratified random sampling is more precise and to a great extent avoids bias. It also saves
time and cost of data collection since the sample size can be less in this method
It is the only sampling plan which enables us to achieve different degrees of accuracy for
different segments of the population. Replacement of case is easy in this method if the
original case is not accessible to study. If a person refuses to cooperate with the
investigator he may also be replaced by another person from the same sub-group.
D EM ERIT S O F ST RAT IF I E D RANDOM SAMPL ING
This method has some demerits which are listed below: -
It is a very difficult task to divide the universe in to homogeneous strata
If the strata are over-lapping, unsuitable or disproportionate, the section of samples may
not be representative
Systematic sampling works well only if the complete and up-to-date frame is available and if
the units are randomly arranged
Any hidden periodicity in the list will adversely affect the representatives of the sample
4 . C LU ST ER SAMPLING
Simple random and stratified sampling techniques are based on a researcher‘s ability to identify each
element in a population. It is easy to do this if the total sampling population is small, but if the population is
large, as in the case of a city, state or country, it becomes difficult and expensive to identify each sampling
unit. In such cases the use of cluster sampling is more appropriate.
Cluster sampling is based on the ability of the researcher to divide the sampling population into groups
(based upon visible or easily identifiable characteristics), called clusters, and then to select elements within
each cluster, using the SRS technique. Clusters can be formed on the basis of geographical proximity or a
common characteristic that has a correlation with the main variable of the study (as in stratified sampling).
Depending on the level of clustering, sometimes sampling may be done at different levels. These levels
constitute the different stages (single, double or multiple) of clustering, which will be explained later.
Imagine you want to investigate the attitude of post-secondary students in Ethiopia towards problems in
higher education in the country. Higher education institutions are in every state and territory of country. In
addition, there are different types of institutions, for example universities, science and technology
universities, colleges of technical education. Within each institution various courses are offered at both
undergraduate and postgraduate levels. Each academic course could take three to four years. You can
imagine the magnitude of the task. In such situations cluster sampling is extremely useful in selecting a
random sample.
The first level of cluster sampling could be at the state or territory level. Clusters could be grouped according
to similar characteristics that ensure their comparability in terms of student population. If this is not easy,
you may decide to select all the states and territories and then select a sample at the institutional level. For
example, with a simple random technique, one institution from each category within each state could be
selected (one university, one university of technology and one colleges of technical education). This is
based upon the assumption that institutions within a
The technique involves taking a series of cluster samples, each involving some form of random sampling. In order
to minimize the impact of selecting smaller and smaller sub-groups on the representativeness of your sample, you
can apply stratified sampling techniques (discussed earlier). This technique can be further refined to take account of
the relative size of the sub-groups by allocating the sample size for each sub-group. In this type of sampling
primary sample units are inclusive groups and secondary units are sub-groups. Stages of a population are
usually
Errors are likely to be large in this method in comparison to any other method
It is usually less efficient than a suitable single stage sampling of the same
It involves considerable amount of listing of first stage units, second stage units etc. though
complete listing of units may not be necessary
It is a difficult and complex method of samplings
5 . 6 . 2 . N ON - P ROB AB I LI TY SAMPLING
The techniques for selecting samples discussed earlier have all been based on the assumption that your sample will
be chosen statistically at random. Consequently, it is possible to specify the probability that any case will be
included in the sample. However, within business research, such as market surveys and case study research, this
may either not be possible (as you do not have a sampling frame) or appropriate to answering your research
question. This means your sample must be selected some other way. Nonprobability sampling (or non-random
sampling) provides a
Convenience sampling (haphazard sampling) involves selecting haphazardly those cases that are easiest to obtain
for your sample, such as the person interviewed at random in a shopping center for a television programme or the
book about entrepreneurship you find at the airport. The sample selection process is continued until your required
sample size has been reached. Although this technique of sampling is used widely, it is prone to bias and influences
that are beyond your control, as the cases appear in the sample only because of the ease of obtaining them.
Convenience samples are best used for exploratory research when additional research will subsequently be
conducted with a probability sample.
It is least reliable but cheap and easy to collect. This method of sampling is common among market research and
newspaper reporters. The term incidental or accidental applied to those samples that are taken because they are
most frequently available, i.e., this refers to groups which are used as samples of a population because they are
readily available or because the researcher is unable to employ more acceptable sampling methods. This method
may be used in the following cases:
The universe is not clearly defined
Sampling unit is not clear
A complete source list is not available
M ERI T S OF CO NVEN IENCE SAMPL ING
It is very easy method of sampling.
It reduces the time, money and energy i.e., it is an economical method.
D EM ERIT S O F CON VENI EN C E SAMPL ING
It is not a representative of the population.
This method is very useful specially when some of the units are very important and their
inclusion in the study is necessary
It is a practical method where randomization is not possible
Use of the best available knowledge concerning the sample subjects.
More economical and less time consuming.
D EM ERIT S OF P U RPOS I VE SAMPL ING
Under this method, considerable prior knowledge of the universe is necessary which in most
cases is not possible
Control and safeguards adopted under this method are sometimes not effective and there is
very possibility of the selection of biased samples
Under this method, the calculation of sample errors is not possible. Therefore, the
hypothesis framed cannot be tested
Quote sampling is the combination of stratified and purposive sampling and thus enjoys the
benefits of both methods. It makes the best use of stratification economically. Thus, it is a
practical as well as convenient method.
If proper controls/checks are quota sampling is likely to give accurate results
It is useful method when no sample frame is available
It is the least expensive way of selecting a sample;
It guarantees the inclusion of the type of people you need.
D EM ERIT S OF Q UO TA S AMPL ING
This method suffers from the limitations of both stratified and purposive sampling
The bias may also occur due to substitution of unlike sample units.
This sampling technique is useful if you know little about the group or organisation you wish to study, as you need
only to make contact with a few individuals, who can then direct you to the other members of the group. It is
especially useful when you are trying to reach populations that are inaccessible or hard to find, as you need only to
make contact with a few individuals, who can then direct you to the other members of the group. For example, if
you want to study the problems faced by Ethiopians living in some country, say, you may identify an initial group
of Ethiopians through some source like Ethiopian Embassy. Then you can ask each one of them to supply names of
other Ethiopians known to them, and continue until you get an exhaustive list from which you can draw a sample
or make a census survey.
The main problem is making initial contact. Once you have done this, these cases identify further members of the
population, who then identify further members. For such samples the problems of bias are huge, as respondents are
most likely to identify other potential respondents who are similar to themselves, resulting in a homogeneous
sample. The next problem is to find these new cases. However, for populations that are difficult to identify,
snowball sampling may provide the only possibility.
It is very useful in studying social groups, informal groups in a formal organization, and
diffusion of information among professional of various kinds.
It is useful for smaller populations for which no frame is readily available
D EM ERIT S OF S NOWB ALL SAMPL I NG
It does not allow the use of the probability statistical method. Elements included are dependent on
the subjective choice of the original selected respondents.
It is difficult to apply it when the population is very large
It does not ensure the inclusion of all elements in the list.
As the main aim in qualitative enquiries is to explore the diversity, sample size and sampling strategy do not play a
significant role in the selection of a sample. If selected carefully, diversity can be extensively and accurately
described on the basis of information obtained even from one individual. All nonprobability sampling designs –
purposive, judgmental, expert, accidental and snowball – can also be used in qualitative research with two
differences:
1 In quantitative studies you collect information from a predetermined number of people but,
in qualitative research, you do not have a sample size in mind. Data collection based upon
a predetermined sample size and the saturation point distinguishes their use in quantitative
and qualitative research.
2 In quantitative research you are guided by your desire to select a random sample, whereas
in qualitative research you are guided by your judgement as to who is likely to provide you
with the ‗best‘ information.
6.1 Introduction
Most educational research will lead to the gathering of data by means of some standardized test or self-
constructed research tools. It should provide objective data for interpretation of results achieved in the
study. The data may be obtained by administering questionnaires, personal observations, interviews and
many other techniques of collecting quantitative and qualitative evidence. The researcher must know how
much and what kind of data collection will take place and when. He/she must also be sure that the types of
data obtainable from the selected instruments will be usable in whatever statistical model he will latter use
to bring out the significance of the study. Accordingly, in this particular chapter, data types and the rationale
for each, sources of data, and the methods of data collection with their pros and cons will be dealt in-depth.
Data serves as a basis of analysis. Without analysis of data, no inference can be drawn on the questions
under study. Otherwise, it would be an arbitrary guess or imagination of the issue under scrutiny and hence
unreliable. Besides, having data doesn‘t guarantee a valid inference – the relevance, adequacy and
reliability of data determine the quality of the findings of a given study. Not also that all data is not
important for your analysis and I advise you to be as much precise as possible in your data collection. If
you plan seriously and design your data collection carefully, this should not be a problem.
Examples of primary sources include finding out first-hand the attitudes of a community towards health
services, ascertaining the health needs of a community, evaluating a social programme, determining the job
satisfaction of the employees of an organisation, and ascertaining the quality of service provided by a
worker are examples of information collected from primary sources.
Secondary data means data that are already available i.e., they refer to the data which have already been
collected and analyzed by someone else. Secondary data are collected by others and used by others. Any
data that has been collected earlier for some other purposes are secondary data in the hands of an individual
who is using them. The use of census data to obtain information on the age– sex structure of a population,
the use of hospital records to find out the morbidity and mortality patterns of a community, the use of an
organization's records to ascertain its activities, and the collection of data from sources such as articles,
journals, magazines, books and periodicals to obtain historical and other types of information, are all
classified as secondary sources.
Advantage
It is found more quickly and cheaply
Improves an understanding of the problem
Dis-advantage
It may be out of date
It may not be adequate
The information does not meet one‘s specific needs, since it is collected
by others for their own purpose, definitions would differ, units of
measurements would differ, and different time periods may be involved.
Several methods can be used to collect primary data. The choice of a method depends upon the purpose of
the study, the resources available and the skills of the researcher. There are times when the method most
appropriate to achieve the objectives of a study cannot be used because of constraints such as a lack of
resources and/or required skills. In such situations you should be aware of the problems that these
limitations impose on the quality of the data.
In selecting a method of data collection, the socioeconomic demographic characteristics of the study
population play an important role: you should know as much as possible about characteristics such as
educational level, age structure, socioeconomic status and ethnic background. If possible,
Another important determinant of the quality of your data is the way the purpose and relevance of the study
are explained to potential respondents. Whatever method of data collection is used, make sure that
respondents clearly understand the purpose and relevance of the study. This is particularly important when
you use a questionnaire to collect data, because in an interview situation you can answer a respondent‘s
questions but, in a questionnaire, you will not have this opportunity. In the following sections each method
of data collection is discussed from the point of view of its applicability and suitability to a situation, and
the problems and limitations associated with it. There are several methods of collecting primary data. The
most common means of collecting data are the interview and the questionnaire.
6.3.1 Questionnaire
A questionnaire is a written list of questions, the answers to which are recorded by respondents. In a
questionnaire respondents read the questions, interpret what is expected and then write down the answers.
Questionnaires are essentially a specifically noted list of questions that are often defined as a basic form of
acquiring and recording different data or information in relation to a particular topic of study, which are put
together with unambiguous instructions, as well as adequate spacing for details of administration and
answers (Adams & Cox, 2008). This method of data collection is quite popular, particularly in case of big
investigations. It is being adopted by private individuals, research workers, private and public organizations
and even by governments.
In the case of a questionnaire, as there is no one to explain the meaning of questions to respondents, it is
important that the questions are clear and easy to understand. Also, the layout of a questionnaire should be
such that it is easy to read and pleasant to the eye, and the sequence of questions should be easy to follow.
A questionnaire should be developed in an interactive style. This means respondents should feel as if
someone is talking to them. In a questionnaire, a sensitive question or a question that respondents may feel
hesitant about answering should be prefaced by
Advantages of a questionnaire
It is less expensive. As you do not interview respondents, you save time, and
human and financial resources. The use of a questionnaire, therefore, is
comparatively convenient and inexpensive. Particularly when it is administered
collectively to a study population, it is an extremely inexpensive method of data
collection. Can reach a large number of people.
It offers greater anonymity. As there is no face-to-face interaction between
respondents and interviewer, this method provides greater anonymity. In some
situations where sensitive questions are asked it helps to increase the likelihood of
obtaining accurate information. It is free from bias of the interviewer
Respondents have adequate time to give well thought out answers.
Can provide information about the participants internal meaning and ways of thinking
Provide exact information needed by researcher (especially the closed ended questions)
Ease of data analysis (for closed ended)
Disadvantages of a questionnaire
Application is limited. One main disadvantage is that application is limited to a
study population that can read and write. It cannot be used on a population that is
illiterate, very young, very old or handicapped.
Response rate is low. Questionnaires are notorious for their low response rates;
that is, people fail to return them. If you plan to use a questionnaire, keep in mind
that because not everyone will return their questionnaire, your sample size will in
effect be reduced. The response rate depends upon a number of factors: the
interest of the sample in the topic of the study; the layout and length of the
questionnaire; the quality of the letter explaining the purpose and relevance of the
study; and the methodology used to deliver the questionnaire. You should consider
yourself lucky to obtain a 50 per cent response rate and sometimes it may be as
low as 20 per cent. However, as mentioned, the response rate is not a problem
when a questionnaire is administered in a collective situation.
Open ended questionnaire differs in that it allows the respondent to formulate and record their answers in
Prepared By: Wagaw Demlie 140
their own words. These are more qualitative and can produce detailed answers to
When deciding whether to use open-ended or closed questions to obtain information about a variable,
visualize how you plan to use the information generated. This is important because the way you frame your
questions determine the unit of measurement which could be used to classify the responses. The unit of
measurement in turn dictates what statistical procedures can be applied to the data and the way the
information can be analyzed and displayed. In closed questions, having developed categories, you cannot
change them; hence, you should be very certain about your categories when developing them. If you ask an
open-ended question, you can develop any number of categories at the time of analysis.
Both open-ended and closed questions have their advantages and disadvantages in different situations. To
some extent, their advantages and disadvantages depend upon whether they are being used in an interview
or in a questionnaire and on whether they are being used to seek information about facts or opinions. As a
rule, closed questions are extremely useful for eliciting factual information and open-ended questions for
seeking opinions, attitudes and perceptions. The choice of open-ended or closed questions should be made
according to the purpose for which a piece of information is to be used, the type of study population from
which information is going to be obtained, the proposed format for communicating the findings and the
socioeconomic background of the readership.
C O NT ENT S OF A QUESTIONNAIRE
There are three portions of a questionnaire
The cover letter (It should explain to the respondent the purpose of the
survey and motivate him to reply truthfully and quickly.
The instructions (It explains how to complete the survey and where to return it.
The questions
The main advantage of this is that the researcher or a member of the research team can collect all the
completed responses within a short period. Any doubts that the respondents might have on any question
could be clarified on the spot. The researcher is also afforded the opportunity to introduce the research topic
and motivate the respondents to offer their frank answers. Administering questionnaires to large numbers
of individuals at the same time is less expensive and consumes less time than interviewing; it does not also
require as much skill to administer the questionnaire as to conduct interviews.
However, organizations are often unable or reluctant to allow work hours to be spent on data collection,
and other ways of getting the questionnaires back after completion may have to be found. In such cases,
employees may be given blank questionnaires to be collected from them personally on completion after a
few days, or mailed back by a certain date in self-addressed, stamped envelopes provided to them for the
purpose.
The questionnaire must intimately relate to the final objective of investigation: One
should make sure that the questionnaire items match with the research objectives.
Understand your research participant: Your participant (Not you) will be filling out the questionnaire.
We should consider the demographic and cultural characteristics of our potential participants, so we can
make it understandable to them. Respondent knowledge of the subject, ability and willingness should be
property weighted.
Always use simple and everyday language. Your respondents may not be highly educated, and even if
they are they still may not know some of the ‗simple‘ technical jargon that you are used to. Particularly in a
questionnaire, take extra care to use words that your respondents will understand as you will have no
opportunity to explain questions to them. A pre-test should show you what is and what is not understood by
your respondents. If a questionnaire is to be translated for use in to several districts/local dialects, the
translated version of a questionnaire should be retranslated in to the original language to check its fidelity.
Such questions should be avoided and two or more separate questions asked instead. For example, the
question ―Do you think there is a good market for the product and that it will sell well? Could bring a ―yes
response to the first part (i.e., there is a good market for the product) and a ―no response to the latter part
(i.e., it will not sell well for various other reasons). In this case, it would be better to ask two questions: (1)
―Do you think there is a good market for the product? And (2)
―Do you think the product will sell well? The answers might be ―yes to both, ―no to both,
―yes to the first and ―no to the second, or ―yes to the second and ―no to the first.
Do not use ambiguous questions. Even questions that are not double-barreled might be ambiguously
worded and the respondent may not be sure what exactly they mean. An ambiguous question is one that
contains more than one meaning and that can be interpreted differently by different respondents. This will
result in different answers, making it difficult, if not impossible, to draw any valid conclusions from the
information. Example of such a question is ―To what extent would you say you are happy? Respondents
might find it difficult to decide whether the question refers to their state of feelings at the workplace, or at
home, or in general.
Do not ask leading questions. A leading question is one which, by its contents, structure or wording,
leads a respondent to answer in a certain direction. Such questions are judgmental and lead respondents to
answer either positively or negatively. Always remember that you don‘t want the participants response to
be the result of how you worded the questions. For example,
Write items that are clear, precise and relatively short: If your respondent/participant didn‘t understand
the items, your data will be invalid (i.e., your research study will have the garbage in, garbage out,
syndrome), the items should be short; short items are more easily understood and less stressful than long
items.
Avoid double negatives: Does the answer provided by the participants required combining two negatives?
(Ex: I disagree that promoters should not be required to supervise the cooperatives during audit time if yes,
rewrite it)
Keep the questions short: finally, simple, short questions are preferable to long ones. As a rule of thumb,
a question or a statement in the questionnaire should not exceed 20 words, or exceed one full line in print.
The sequence of questions in the questionnaire should be such that the respondent is led from questions of a
general nature to those that are more specific, and from questions that are relatively easy to answer to those
that are progressively more difficult. An attractive and neat questionnaire with appropriate introduction,
instructions, and well-arrayed set of questions and response alternatives will make it easier for the
respondents to answer them. A good introduction, well- organized instructions, and neat alignment of the
questions are all important. These elements are briefly discussed with examples.
This questionnaire is designed to study aspects of life at work at Arbaminch University. The information
you provide will help us better understand the quality of our work life. Because you are the one who can
give us a correct picture of how you experience your work life, I request you to respond to the questions
frankly and honestly.
Your response will be kept strictly confidential. Only members of the research team will have access to the
information you give. In order to ensure the utmost privacy, we have provided an identification number for
each participant. A summary of the results will be mailed to you after the data are analysed.
Thank you very much for your time and cooperation. I greatly appreciate your organization‗s and your help
in furthering this research endeavour.
General instructions
Please a make tick mark (🗸) on the appropriate box that represents your level of
agreement or disagreement with a given statement
If you have any difficulty on how to fill the questionnaire, please don‘t hesitate to
contact me through the following address:
Cordially
Z W
No formal education, but can read and write Primary education Secondary
education College diploma Bachelor degree Master
degree PhD and above
12. –
13 Etc.
6.3.3 INTERVIEWS
An interview involves an interviewer reading questions to respondents and recording their answers. The
interview is like a conversation and has the purpose of obtaining information relevant to a particular research
topic. The interview method of collecting data involves presentation of oral-verbal stimuli and reply in
terms of oral-verbal responses. The person who asks questions, is interviewer. The people who will respond
to the questions are called interviewees respondents.
Advantages of the interview
The interview is more appropriate for complex situations. It is the most
appropriate approach for studying complex and sensitive areas as the interviewer
has the opportunity to prepare a respondent before asking sensitive questions and
to explain complex ones to respondents in person.
It is useful for collecting in-depth information. In an interview situation it is
possible for an investigator to obtain in-depth information by probing. Hence, in
situations where in- depth information is required, interviewing is the preferred
method of data collection.
Information can be supplemented. An interviewer is able to supplement
information obtained from responses with those gained from observation of non-
verbal reactions.
Questions can be explained. It is less likely that a question will be
misunderstood as the interviewer can either repeat or put it in a form that is
I N T ERVI EW D ES IG NS
A. Structured interviews
In a structured interview the researcher asks a predetermined and ‗standardized‘ or identical set of questions,
using the same wording and order of questions as specified in the interview schedule. You read out each
question and then record the response on a standardized schedule, usually with pre-coded answers. The
interviewer has no freedom to rephrase questions, and extra ones, or change the order in which the
questions have to be presented. Thus, the interviews in a structured interview follows a rigid procedure laid
While there is social interaction between you and the participant, such as the preliminary explanations that
you will need to provide, you should read out the questions exactly as written and in the same tone of voice
so that you do not indicate any bias. One of the main advantages of
B. Sem-structured Interviews
In semi-structured interviews the researcher will have a list of themes and questions to be covered,
although these may vary from interview to interview. This means that you may omit some questions in
particular interviews, given a specific organizational context that is encountered in relation to the research
topic. The order of questions may also be varied depending on the flow of the conversation. On the other
hand, additional questions may be required to explore your research question and objectives given the nature
of events within particular organizations. You may formulate questions and raise issues on the spur of the
moment, depending upon what occurs to you in the context of the discussion. The nature of the questions
and the ensuing discussion mean that data will be recorded by audio recording the conversation or perhaps
note taking.
Unstructured interviews are informal. You would use these to explore in depth a general area in which
you are interested. There is no predetermined list of questions to work through in this situation, although
you need to have a clear idea about the aspect or aspects that you want to explore. The interviewee is given
the opportunity to talk freely about events, behavior and beliefs in relation to the topic area. Unstructured
interviews are usually labeled as ―focused‖,‖ depth‖, and ―non-directive‖. The focused interview aims at
some particular event or experience rather than on general lines of inquiry about an event. The depth
interview is searching and giving emphasis to psychological and social factors. The non-directive interview
permits much freedom to the interviewees to talk about the problem under investigation.
It has been labelled as an informant interview since it is the interviewee‘s perceptions that guide the
conduct of the interview. In comparison, a participant interview is one where the interviewer directs the
interview and the interviewee responds to the questions of the researcher (Easterby- Smith et al. 2008;
Robson 2002). Unstructured interview is found to be very important technique of data collection in case of
exploratory/formulative studies. But in case of descriptive studies, we quite often use structured interview
technique because of its being economical, providing a safe basis for generalization and requiring relatively
less skill/knowledge on the part of the interviewer.
The main advantage of face-to-face is that the researcher can adapt the questions as necessary, clarify
doubts, and ensure that the responses are properly understood, by repeating or rephrasing the questions.
The researcher can also pick up nonverbal cues from the respondent. Any discomfort, stress, or problems
that the respondent experiences can be detected through frowns, nervous tapping, and other body language
unconsciously exhibited by her. This would be impossible to detect in a telephone interview.
The main disadvantages of face-to-face interviews are the geographical limitations they may impose on the
surveys and the vast resources needed if such surveys need to be done nationally or internationally. The
costs of training interviewers to minimize interviewer biases (e.g., differences in questioning methods,
interpretation of responses) are also high. Another drawback is that respondents might feel uneasy about
the anonymity of their responses when they interact face to face with the interviewer.
2 . TELEPHONE INTERVIEWS
This method is a non-personal method used to collect data by contacting respondents on telephone. Though it
is not a very widely used method, but plays important part in industrial surveys, particularly in developed
regions. Telephone interviews are best suited when information from a large number of respondents spread
over a wide geographic area is to be obtained quickly, and the likely duration of each interview is, say, 10
minutes or less. Many market surveys, for instance, are conducted through structured telephone interviews.
T Y P ES OF O B SERVAT ION
There are two types of observation:
Participant observation, and
Non-participant observation
Participant observation is when a researcher participates in the activities of the group being observed in
the same manner as its members, with or without their knowing that they are being observed. This enables
researchers to share their experiences by not merely observing what is happening but also feeling it.
Example: Suppose you want to examine the reactions of the general population towards people in wheel
chairs. To study their reactions, you can sit in a wheel chair yourself. Or else, if you want to study the life
of prisoners, pretend to be a prisoner to observe.
Non-participant observation, on the other hand, is when the researcher does not get involved in the
activities of the group but remains a passive observer, watching and listing to its activities and drawing
conclusions from this. For example, you might want to study the functions carried out by
6 . 3 .6. K EY INFORMANTS
The use of key informants is another important technique to gain access to potentially available
information. Key informants could be knowledgeable community leaders or administrative staff at various
levels and one or two informative members of the target group of your research. For instance, if you want
to study time series analysis of energy cost efficiency in Arba Minch University, you may collect primary
data using the method of key informants. Method of key informants is good when the types of data you
need are relatively objective - like energy expense of Arba Minch University.
So far, we have discussed the primary sources of data collection where the required data was collected
either by you or by someone else for the specific purpose you have in mind. There are occasions when your
data have already been collected by someone else and you need only to extract the required information for
the purpose of your study. Such data are known as secondary data
In terms of measurement procedures, therefore, validity is the ability of an instrument to measure what it is
designed to measure: ‗Validity is defined as the degree to which the researcher has measured what he has
set out to measure‘ (Smith 1991). According to Kerlinger, (1973), ‗The commonest definition of validity is
epitomized by the question: Are we measuring what we think we are measuring?‘ Babbie (1989: 133),
writes, ‗validity refers to the extent to which an empirical measure adequately reflects the real meaning of
the concept under consideration‘.
In the social sciences there appear to be two approaches to establishing the validity of a research
instrument. These approaches are based upon either logic that underpins the construction of the research
tool or statistical evidence that is gathered using information generated through the use of the instrument.
Establishing validity through logic implies justification of each question in relation to the objectives of the
study, whereas the statistical procedures provide hard evidence by way of calculating the coefficient of
correlations between the questions and the outcome variables.
Establishing a logical link between the questions and the objectives is both simple and difficult. It is simple
in the sense that you may find it easy to see a link for yourself, and difficult because your justification may
lack the backing of experts and the statistical evidence to convince others.
For example, if you want to find out about age, income, height or weight, it is relatively easy to establish
the validity of the questions, but to establish whether a set of questions is measuring, say, the effectiveness of
a programme, the attitudes of a group of people towards an issue, or the extent of satisfaction of a group of
consumers with the service provided by an organisation is more difficult. When a less tangible concept is
involved, such as effectiveness, attitude or satisfaction, you need to ask several questions in order to cover
different aspects of the concept and demonstrate that the questions asked are actually measuring it. Validity
in such situations becomes more difficult to establish, and especially in qualitative research where you are
mostly exploring feelings, experiences, perceptions, motivations or stories. It is important to remember that
the concept of validity is pertinent only to a particular instrument and it is an ideal state that you as a
researcher aim to achieve.
T YP ES OF VALIDITY
⮫ The judgement is based upon subjective logic; hence, no definite conclusions can be
drawn. Different people may have different opinions about the face and content
validity of an instrument.
Predictive validity is judged by the degree to which an instrument can forecast an outcome. Concurrent
validity is judged by how well an instrument compares with a second assessment concurrently done: ‗It is
usually possible to express predictive validity in terms of the correlation coefficient between the predicted
status and the criterion. Such a coefficient is called a validity coefficient‘ (Burns 1997).
3 . C O NS T RUC T VALIDITY
Construct validity is a more sophisticated technique for establishing the validity of an instrument. It is based
upon statistical procedures. It is determined by ascertaining the contribution of each construct to the total
variance observed in a phenomenon. Suppose you are interested in carrying out a study to find the degree
of job satisfaction among the employees of an organisation. You consider status, the nature of the job and
remuneration as the three most important factors indicative of job satisfaction, and construct questions to
ascertain the degree to which people consider each factor important for job satisfaction.
After the pre-test or data analysis you use statistical procedures to establish the contribution of each
construct (status, the nature of the job and remuneration) to the total variance (job satisfaction). The
contribution of these factors to the total variance is an indication of the degree of validity of the instrument.
The greater the variance attributable to the constructs, the higher the
When you collect the same set of information more than once using the same instrument and get the same
or similar results under the same or similar conditions, an instrument is considered to be reliable. The level
of an instrument‘s reliability is dependent on its ability to produce the same score when used repeatedly.
The reliability of an instrument can be tested using a statistical measure called Cranach‘s alpha test of
reliability. The acceptable score of Cranach‘s alpha measure is 0.7 and above.
Internal consistency measure is used as the measure of reliability of an instrument. The idea behind internal
consistency procedures is that items or questions measuring the same phenomenon, if they are reliable
indicators, should produce similar results irrespective of their number in an instrument. Even if you
randomly select a few items or questions out of the total pool to test the reliability of an instrument, each
segment of questions thus constructed should reflect reliability more or less to the same extent. It is based
upon the logic that if each item or question is an indicator of some aspect of a phenomenon, each segment
constructed will still reflect different aspects of the phenomenon even though it is based upon fewer
items/questions. Hence, even if we reduce the number of items or questions, as long as they reflect some
aspect of a phenomenon, a lesser number of items can provide an indication of the reliability of an
instrument. The internal consistency procedure is based upon this logic.
Let us take an example. Suppose you develop a questionnaire to ascertain the prevalence of domestic
violence in a community. You administer this questionnaire and find that domestic
In the preceding chapter, you learned about data and methods of data collection. But in this chapter, you will
see the next step in a research process, i.e., how to process and make sense of the data collected in the form
of written text. Data analysis is now routinely done with software programs such as SPSS (Statistical
Package for Social Sciences), Excel, and the like. The goal of any research is to provide information from
raw data. The raw data after collection has to be processed and analyzed in line with the plan laid down for
the purpose at the time of developing the research plan. However, before we start analyzing the data some
preliminary steps need to be completed. These help to ensure that the data are reasonably good and of
assured quality for further analysis. Thus, the compiled data must be classified, processed, analyzed, and
interpreted.
A very common phrase that is used by researchers is ―garbage in, garbage out.‖ This refers to the idea that
if data is collected improperly, or coded incorrectly, your results are ―garbage,‖ because that is what was
entered into the data set to begin with. Therefore, like any part of the business research process, care and
attention to detail are important requirements for data processing. Technically speaking processing implies
editing, coding, classification and tabulation of collected data. Data collected during the research is
processed with a view to reducing them to manageable dimensions. A careful and systematic processing
will highlight the important characteristics of the data, facilitates comparisons and render it suitable for
further statistical analysis and interpretations. In other words, data processing is an intermediate stage
between the collection of data and their analysis and interpretation. Therefore, processing comprises the
task of editing, coding classification and tabulation. The stages are here under;
1 . EDITING
Editing is a process of examining the collected raw data (unedited responses from respondent exactly as
indicated by that respondent) to detect errors and omission (extreme values) and to correct those when
possible. Editing is the process of checking and adjusting data for omissions, consistency, and legibility. It
involves a careful scrutiny of completed questionnaires or interview
a) Field level editing (where the data is collected): during the time of data collection, the
interviewer often uses ad hoc abbreviations, special symbols and the like. As soon as possible,
after an interview, field workers should review their reporting forms, complete what was
abbreviated, translate personal shorthand and re-write illegible entries. But here attention must be
given that investigator must not correct errors of omission simply by guessing what the
respondent would have said if the question had been asked.
b) Central Editing (In-house editing): at this stage, the research form or schedule should get a
thorough editing and this takes place when all forms and all schedules have been completed and
returned to the office. One editor or team of editors may correct obvious errors such as entry in
wrong place, recordings in wrong units, etc. In case of inappropriate or missing replies, the editor
can sometimes determine the proper answer by reviewing the other information in the
schedule. At times, the respondent can be contacted for clarification.
2 . CODING
After editing of the collected data, the next step to follow is coding. Coding refers to assigning of number,
digits or letters or both to various responses so as to enable tabulation of information easy. The purpose of
coding is to classify the answers in to meaningful categories, which is essential for tabulation. Coding is,
therefore, necessary to carry out the subsequent operations of tabulation and analyzing data.
Coding consists of assigning a number or symbols to each answer, which falls in a predetermined class.
Coding means an operation by which data are organized into classes and number or symbol is given to each
item according to the class in which it falls. For example, a researcher may code Male as 0 and Female as
1. The classes must possess the characteristic of exhaustiveness (i.e., there must be a class for every data
item) and also that of mutual exclusively which means that a specific answer can be placed in one and only
one cell in a given category set. Assigning numerical symbols permits the transfer of data from
questionnaires forms to a computer.
3 . CLASSIFICATION
Data classification implies the processes of arranging data in groups or classes based on common
characteristics. Data having common characteristics placed in one class and in this way the entire data are
divided in to a number of groups or classes. In other words, heterogeneous data is divided into separate
homogeneous classes according to characteristics that exist amongst different individuals or quantities
constituting the data. Thus, fundamentally classification is dependent upon similarities and resemblances
among the items in the data. Depending upon the nature of the phenomenon classification can be of like
following two types, involved:
Classification according to Attributes: Data are classified according to some common characteristics
which can be either descriptive (literacy, sex, honesty, etc.) or numerical (such as weight, height, income,
etc.). Descriptive characteristics refer to qualitative phenomena which cannot be measured quantitatively. In
this case, we classify the data only by noticing the presence of these characteristics. Data obtained in this way
on the basis of certain attributes are known as statistics of attributes and their classification is said to be
classification of attributes. Such classifications can be simple or manifold classification: In simple
classifications we consider only one attribute and divided the universe in to two classes- one consisting of
items possessing attributes and the other class consisting of items which do not possess the given attribute.
Manifold classification we consider two or more attributes simultaneously, and divided the data in to
number of classes.
4 . TABULATION
Tabulation involves the orderly and systematic presentation of numerical data in a form
designed to elucidate the problem under consideration. Data can be presented through
tabulation or Graphic/ diagrammatic forms
T ABU L AR P RESEN TAT ION OF DATA
When a mass of data has been assembled, it becomes necessary for the researcher to arrange the same in
some kind of concise and logical order. This procedure is referred to as tabulation. Tabulation: is the
process of arranging given quantitative data based on similarities and common characteristics in certain
rows and columns so as to present the data vividly for quick intelligibility, easy comparability and visual
appeal. It is an orderly arrangement of data in columns and rows. It presents responses or the observations
on a question-by-question or item-by item basis and provides the most basic form of information. It tells
the researcher how frequently each response occurs. Tabulation is essential because of the following
reasons.
It conserves space and reduces explanatory and descriptive statement to a minimum.
It facilitates the process of comparison.
It facilitates the summation of items and the detection of errors and omissions.
It provides a basis for various statistical computations.
Tabulation can be done by hand or by mechanical or electronic devices. The choice depends on the size and
type of study, cost considerations, time pressures and the availability of tabulating machines or computers.
In relatively large inquiries, we may use mechanical or computer tabulation if other factors are favorable
and necessary facilities are available. Hand tabulation is usually preferred in case of small inquiries where
the number of questionnaires is small and they are of relatively short length. Tabulation may also be
classified as simple and complex tabulation. The former type of tabulation gives information about one or
more groups of independent questions, whereas the latter type of tabulation shows the division of data in two
or more categories and as such is designed to give information concerning one or more sets of inter-related
questions.
S P EC IM EN OF A TABLE
Table Number
Title
Head note
Generally, when significant amounts of quantitative data are presented in a report or publication, it is most
effective to use tables and/or graphs. Tables permit the actual numbers to be seen most clearly, while
graphs are superior for showing trends and changes in the data.
Analysis can be classified as qualitative analysis and quantitative analysis, this classification based on the
nature of the data (numerical/ quantitative or kind/ qualitative). Qualitative analysis is the analysis of
qualitative data such as text data from interview transcripts and open-ended questions. Unlike quantitative
analysis, which is statistics driven and largely independent of the researcher, qualitative analysis is heavily
dependent on the researcher‘s analytic and integrative skills and personal knowledge of the social context
where the data is collected. The emphasis in qualitative analysis is ―sense making‖ or understanding a
phenomenon, rather than predicting or explaining. A creative and investigative mindset is needed for
qualitative analysis, based on an ethically enlightened and participant-in-context attitude, and a set of
analytic strategies. Quantitative Analysis: numeric data collected in a research project can be analyzed
quantitatively using statistical tools in two different ways descriptive analysis and inferential analysis.
Statistical analysis may range from portraying a simple frequency distribution to more complex multivariate
analyses approaches, such as multiple regressions.
One way of analysing data is the use of descriptive statistics such as percentage, measures of central
tendency, and measures of dispersion. Descriptive analysis refers to the transformation of raw data in to a
form that will make them easy to understand and interpret. Unlike inferential statistics, descriptive statistics
do not give results beyond description. Descriptive statistics are used to describe the basic features of data.
Summary descriptive statistics are usually represented using simple graphs such as bar graph, pie chart, and
line graph. Descriptive statistics can be easily calculated and graphs can be generated used MS Excel,
STATA, SPSS, or other statistical packages. Descriptive analysis is the elementary transformation of data in
a way that describes the basic characteristics such as central tendency, distribution, and variability.
Inferential analysis:
In descriptive statistics we are simply describing what is or what the data shows. Descriptive Statistics are
used to describe the basic features of the data in a study. They provide simple summaries about the sample
and the measures. Descriptive analysis refers to the transformation of raw data in to a form that will make
them easy to understand and interpret. Descriptive statistics are used to describe the basic features of data.
Summary descriptive statistics are usually represented using simple graphs such as bar graph, pie chart, and
line graph.
The Range: The simplest measure of dispersion is the range, which is the difference between the maximum
value and the minimum value of the data.
Variance: variance determines dispersion or variability to which the possible random variable values differ
among them. The variance denoted by Var(x) or Variance is the squared deviation of the individual values
from their expected value or mean.
Standard deviation: is defined as the square-root of the average of squares of deviations, when such
deviations for the values of individual items in a series are obtained from the arithmetic average. It shows
the average deviation of the observation from the mean value.
Univariate analysis involves the examination across cases of one variable at a time. Whenever we deal with
data on two or more variables, we said to have a bivariate or multivariate population. Such situations
usually happen when we wish to know the relation of the two and/or more variables in the data with one
another.
We use inferential statistics to try to infer from the sample data what the population thinks. We use
inferential statistics to make inferences from our data to more general conditions; we use descriptive
statistics simply to describe what's going on in our data. Inferential statistics: Statistics used to make
inferences or judgment about a population on the basis of sample information. We have to answer two
types of questions in bivariate or multivariate analysis:
1. Does there exist association or correlation between the two (or more) variables?
If yes, of what degree?
2. Is there any cause-and-effect relationship between two variables in case of
bivariate population or between one variable on one side and two or more
variables on the other side in case of multivariate population? If yes, of what
degree and in which direction?
The first question can be answered by the use of correlation technique and the second question by the
technique of regression.
Correlation: the most commonly used relational statistic is correlation and it is a measure of the strength of
some relationship between two variables, not causality. Interpretation of a correlation coefficient does not
even allow the slightest hint of causality. The most a researcher can say is that the variables share something
in common; that are related in some way. The more two things have something in common, the more
strongly they are related. There can also be negative relations, but the important quality of correlation
coefficients is not their sign, but their absolute value. A correlation of -0.58 is stronger than a correlation of
0.43, even though with the former, the relationship is negative. The following table lists the interpretations
for various correlation coefficients:
In economic theory and business studies, relationship between various variables is studied. The correlation
analysis helps in deriving precisely the degree and direction of such relationships. The predication based on
correlation analysis is more reliable and near to reality.
Unlike regression, correlation does not care which variable is the independent one or the dependent one,
therefore, you cannot infer causality. Researchers often report the names of the variables in such sentences,
rather than just saying "one variable". A correlation coefficient at zero, or close to zero, indicates no linear
relationship.
In simple regression, we have only two variables, one variable (defined as independent) is the cause of the
behavior of another one (defined as dependent variable). When there are two or more than two independent
variables, the analysis concerning relationship is known as multiple regressions and the equation describing
such relationship as the multiple regression equation. This analysis is adopted when the researcher has one
dependent variable, which is presumed to be a function of two or more independent variables. The
objective of this analysis is to make a prediction about the dependent variable based on its covariance with
all the concerned independent variables. If the dependent variable is more than one it needs to be use
multivariate analysis or structural equation modeling (SEM) or other essentials.
Regression is the closest thing to estimating causality in data analysis, and that is because it predicts how
much the numbers "fit" a projected straight line. Regression is one of the very important statistical tools,
which is extremely used in almost all sciences-natural, social and physical. It is especially used in business
and Economics to study the relationship between two or more variables that are related causally, and for
estimation of demand and supply curves, cost functions, production and consumption functions, etc.
Regression is always very useful in model buildings.
Qualitative analysis is the analysis of qualitative data such as text data from interview transcripts and open-
ended questions. Unlike quantitative analysis, which is statistics driven and independent of the researcher,
qualitative analysis is heavily dependent on the researcher‘s analytic and integrative skills and personal
knowledge of the social context where the data is collected. The emphasis in qualitative analysis is ―sense
making‖ or understanding a phenomenon, rather than predicting or explaining. A creative and investigative
mindset is needed for qualitative analysis, based on an ethically enlightened and participant-in-context
attitude, and a set of analytic strategies.
After data collection and analysis, a researcher has to accomplish the task of drawing inferences followed
by report writing. Interpretation has to be done carefully so that misleading conclusion will not be drawn
and the whole purpose of doing research will not be vitiated. Through interpretation, the researcher can
expose relations and processes that underline his/her findings. All the analytical information and
consequential inferences may well be communicated, preferably through research report, to the users of
research results who may be individuals or groups or some public or private organizations. Accordingly, in
the 8th chapter, issues including meaning of and rationale for interpretation, techniques of interpretation,
precautions in interpretation, significance of report writing, steps in writing report, layout of research
report, and precautions for writing research report will be discussed in detail.
Interpretation refers to the task of drawing inferences from the collected facts.In business research; the
interpretation process explains the meaning of the analyzed data. After the statistical analysis of the data,
inferences and conclusions about their meaning are developed. A distinction can be made between analysis
and interpretation. Interpretation is drawing inferences from the analysis results. Inferences drawn from
interpretations lead to managerial implications. In other words, each statistical analysis produces results that
are interpreted with respect to insight into a particular decision.
The task of interpretation has two major aspects viz., (i) the effort to establish continuity in research through
linking the results of a given study with those of another, and (ii) the establishment of some explanatory
concepts. ―In one sense, interpretation is concerned with relationships within the collected data, partially
overlapping analysis. Interpretation also extends beyond the data of the study to include the results of other
research, theory, and hypotheses. Thus, interpretation is the device through which the factors that seem to
explain what has been observed by researcher in the course of the study can be better understood and it also
provides a theoretical conception which can serve as a guide for further researches.
T EC HNIQU E OF INTERPRETATION
The task of interpretation is not an easy job; rather it requires a great skill and dexterity on the part of
researcher. Interpretation is an art that one learns through practice and experience. The researcher may, at
times, seek the guidance from experts for accomplishing the task of interpretation. There are no existing
rules to guide the researcher about how to interpret the data. However, the following suggested steps could
be helpful
1. Researcher must give reasonable explanations of the relations, which he has found,
and he must interpret the lines of relationship in terms of the underlying processes and
must try to find out the thread of uniformity that lies under the surface layer of his
diversified research findings. In fact, this is the technique of how generalization should
be done and concepts be formulated.
2. Extraneous information if collected during the study must be considered while
interpreting the results of research study, for it may prove to be a key factor in
understanding the problem.
3. It is advisable, before embarking upon final interpretation, to consult someone having
insight into the study and who is frank and honest and will not hesitate to point out
omissions and errors in logical argumentation. Such a consultation will result in
correct interpretation and, thus, will enhance the utility of research results.
4. Researcher must accomplish the task of interpretation only after considering all relevant
factors affecting the problem to avoid false generalization. He must be in no hurry
while interpreting results, for quite often the conclusions, which appear to be all right
Anybody, who is reading the research report, must necessarily be conveyed enough about the study so that he
can place it in its general scientific context, judge the adequacy of its methods, and thus form an opinion of
how seriously the findings are to be taken. For this purpose, there is the need of proper layout of the report.
Latin word report means to carry. RE+ PORT= to carry information again. Research report document giving
summarized and interpretive information of research done based on factual data, opinions about the
procedure used by the researchers. The layout of the report means as to what the research report should
contain.
NB: The layout/components of the research report may be different in different situations
across universities, colleges, and departments. It is advisable to follow your university,
college, or department guideline. Moreover, do not forget the issues of converting future
tense in to pass tens especial the proposal parts.
Generally, a comprehensive layout of the research report should comprise (A) preliminary pages
(B) the main parts and (C) Appended parts/ end matter. Let us deal with them separately.
In preliminary pages, the report should carry a title page, followed by acknowledgements and
abbreviations. Then there should be a table of contents followed by list of tables, illustrations and abstract
so that the decision-maker or anybody interested in reading the report can easily locate the required
information in the report.
2 . M AI N PARTS
The main part provides the complete outline of the research report along with all details. Each main section
of the report should begin on a new page. The main parts of the report should have the following sections:
1) Introduction
2) Literature review
3) Research methodology
4) Data presentation, analysis and interpretation
5) Conclusions and recommendations
3 . E N D M AT T ER (APPENDIX)
The appendix, which comes last, is the appropriate place for other materials that substantiates the text of the
report
Components of a Research Report
1. PREFATORY/PRELIMINARIES
i. Title page
Title of the Research
(A Case study of)
Purpose why the Research is conducted
Name and Address of the investigator
Advisor/Reader
Month and Place where the research is written
ii. Acknowledgement
iii. Abbreviations and acronyms: abbreviations alphabetically
iv. Table of contents
v. List of tables
CHAPTER ONE
Introduction
1.1. Background of the study –Deductive order
Global issues and trends about the topic
Situations in Less Developed Countries or in an industry
National level
Firm/Regional level
1.2 Statement of the Problem or (Justification of the study)
Facts that motivated the investigator to conduct the research
Exactly specifying and measuring the gap
Hard facts or quantitative data about the topic for some previous years, for
example three years
1.3 Research Questions:
Research
1.4 Research Objectives – Ends met by conducting the research
1.4.1 General objective
often one statement directly related to the topic or title of the research
1.4.2 Specific Objectives
🞺 what the researcher wanted to achieve
About s/he collected data;
What was analyzed and compare
What the researcher wanted to achieve
Often 4-7
CHAPTER TWO
Review of Related Literature
C O NS I DERAT ION S
CHAPTER FOUR
R EF ERENC E / BIBLIOGRAPHY
⚫ You must give references to all the information that you obtain from
books, journals, and other sources.
BIBLIOGRAPHY/REFERENCE CITATION
A. Book with a Single Author:
Fleming, T. (1997) Liberty! The American Revolution. New York: Viking.
o Important Elements: Author, date of publication, title of the book, place
of publication, publisher.
B. B OOK WIT H TWO O R THREE AUTHORS:
Schwartz, D., Ryan, S., & Westbrook, F. (1995) The Encyclopedia of TV game shows. New York: Facts on
File.
o Note: the commas and full stops in between the authors!
C. B OOK WIT H MO RE THAN T HREE AUTHORS:
Azfar, O. et al. (1999) Decentralization, Governance and Public Services: the Impact of Institutional
Arrangements: A Review of Literature. IRIS Centre: Maryland University Press.
D. A RT IC L E WIT HIN A BOOK:
Adhana H. (1994) ―Mutation of Statehood and Contemporary Politics‖, in Abebe Z. and S. Abera (eds.)
Ethiopia in Change: Peasantry, Nationalism and Democracy, pp. 12-29. London: British Academic Press.
⚫ Important Elements: Author of the article, date of publication, title of the article,
editor(s) of the book, title of the book, page numbers of the article, place of
publication, and publisher.
E. A RT IC L ES F ROM A P RIN T ED JOURNAL:
Abbink, J. (1997) ―Ethnicity and Constitutionalism in Contemporary Ethiopia‖, Journal of African Law
41(2): 159-174.
⚫ Important Elements: Author of the article, date of publication, title of the article,
title of journal, volume and issue number of the journal, page numbers of the article.
F. A RT IC L E F ROM A P RINT ED NEWSPAPER:
Holden, S. (1998, May 16) Frank Sinatra dies at 82: Matchless stylist of pop. The New York Times, pp.
A1, A22-A23.
Often the researchers are asked make oral presentation of his research process and findings which is also
called as ‗Briefing‘. This presentation exercise is unique for the following factors:
A small group is to be addressed
Statistical Tables constitute major aspect of the topic
The audience is a core group interested in learning, knowing, analyzing and evaluating.
Presentation is normally followed by questions and answers.
The speaking time may vary from 10 to 20 minutes or 20n minutes to 1 hour 30
minutes Preparation: The presenter has to carefully jot down the outline of critical aspects
of the research study. While preparing for the presentation the presenter has to bear in
mind: (a) the purpose of the presentation (for instance, is it to inform about the problem?
Or is it to solve the problem? Or is it to give conclusions and recommendations) ands (b)
what is the time given for presentation? The oral presentation should cover the following
major points:
Opening remarks to explain the nature of the project, problem, found, and how
it is processed to solve
Findings and conclusions should be the basis of presentation. They must be brief
and comprehensive. And
Presentation of recommendation. They must have relevance to the conclusions and
findings stated earlier.
There are mainly three types of presentations. They are (a) Memorized speech. As a matter of fact, it is not
preferable method of presentation. It is highly self-centered or speaker centered (b) Reading manuscript.
This is also not advisable because over the time it becomes dull, lifeless and fails to evoke interest in the
audience (c) Extemporaneous Presentation. It is an oral presentation based on minimal notes or an
outline of the subject matter. This speech appears natural,
Prepared By: Wagaw Demlie 193
conversational and flexible. It is the best choice in organizational setting. The
outlines or important deliverable points can be noted on Cards of 5 x 8 inches
or 3 x 5 inches size.
If the audience requires, and / or the occasion demands, audio visuals can also
be used. It gives good results. Visual aids decision depends upon several
factors because there are a number of lecture-aids such as chalk boards, white
boards, handouts, flip charts, overhead transparencies, slides of 35 mm and
computer drawn visuals.
CHAPTER ONE
1. INTRODUCTION
In the modern world of computers and information technology, the importance of statistics is
very well recognized by all the disciplines. Statistics has originated as a science of statehood and
found applications slowly and steadily in Agriculture, Economics, Commerce, Management,
Biology, Medicine, Industry, planning, education and so on. As on date there is no other human
walk of life, where statistics cannot be applied. The word ‗Statistics‘ and ‗Statistical‘ are all
derived from the Latin word Status, means a political state.
194
Statistics is defined differently by different authors over a period of time. In the olden days
statistics was confined to only state affairs but in modern days it embraces almost every sphere
of human activity. Therefore, a number of old definitions, which was confined to narrow field of
enquiry were replaced by more definitions, which are much more comprehensive and exhaustive.
Secondly, statistics has been defined in two different ways such as Statistical data and statistical
methods. The following are some of the definitions of statistics as numerical data.
Statistics are the classified facts representing the conditions of people in a state. In
particular they are the facts, which can be stated in numbers or in tables of numbers or in
any tabular or classified arrangement.
Statistics are measurements, enumerations or estimates of natural phenomenon usually,
systematically arranged, analyzed and presented as to exhibit important interrelationships
among them.
195
Statistics are affected by a number of factors: For example, sale of a product depends
on a number of factors such as its price, quality, competition, the income of consumers,
and so on.
Statistics must be reasonably accurate: if wrong figures are analyzed, it will lead to
erroneous conclusion. Hence, it is necessary that conclusion must be based on accurate
figures.
Statistics must be collected in a systematic manner: If data are collected in a
disorganized manner, they will not be reliable and will lead to misleading conclusions.
Finally, statistics should be placed in relation to each other if one collects data unrelated
to each other, and then such data will be confusing and will not lead to any logical
conclusions. Data should be comparable overtime and space.
Definition by Lovett: statistics is a science that deals with collection, classification and
tabulation of numerical facts as a basis of the explanation, description and comparison of
phenomena.
1.2. Importance of Statistics in Business
There is an increasing realization of the importance of statistics in various quarters. This is
reflected in the increasing use of statistics in the government, industry, business, agriculture,
mining, transport, education, medicine and so on. As we are concerned with the use of statistics
in business and industry here, description given below is confined to these areas only. There are
three major functions in any business enterprise in which statistical methods are useful.
The planning functions: This may relate to either special projects or to the recurring
activities of the firm over specified period.
The setting up standards: This may relate to the size of employment, volume of sales,
fixation of quality norms for the manufactured products, norms for daily output, and so
forth.
The function of control: This involves comparison of actual production achieved against
the norm or target set earlier. In case the production has fallen short of the target, it gives
remedial measures so that such a deficiency does not occur again.
1.3. Types of Statistics
196
The statisticians commonly classify this subject in to two broad categories: the Descriptive
statistics and inferential statistics
Descriptive statistics: As the name suggests descriptive statistics includes any treatment
designed to describe or summarize the given data, bringing out their important features.
Thus, statistics do not go beyond this. This means that no attempt is made to infer
anything that pertains to more than the data themselves. Descriptive Statistics describe
the data set that‘s being analyzed, but doesn‘t allow us to draw any conclusions or make
any interference about the data. Example: Arba Minch University was graduate students
in the year of 2009 is 4000, in the year of 2010 is 4500 and in the year of 2011 is 5200,
this belongs to the domain of descriptive statistics.
Inferential statistics.: It is a method used to generalize from sample to a population.
Inferential statistics is also a set of methods, but it is used to draw conclusions or
inferences about characteristics of populations based on data from a sample. Example:
The average per capital income of all Ethiopian population can be estimated from figures
obtained from a few hundred (the sample) of the population is 1000$. Statistical
population is the collection of all possible observations of specified characteristics of
interest.
1.4. TYPES OF VARIABLES OR DATA
Variable: Is an item of interest that can take in many different numerical values. Variables can
be categorized as continuous or discrete. Or can be categorized as quantitative or qualitative.
E.g., age, money, time, height, weight. Consider, for example, that Olympic sprinters are
timed to the nearest hundredths place (in seconds), but if the Olympic judges wanted to
clock them to the nearest millionths place, they could.
A Discrete Variable, on the other hand, result of a counting process, it is measured in whole
units or categories. So discrete variables are not measured along a continuum.
197
A Quantitative Variable varies by amount. The variables are measured in numeric units,
and so both continuous and discrete variables can be quantitative.
For example, we can measure food intake in calories (a continuous variable) or we can
count the number of pieces of food consumed (a discrete variable). In both cases, the
variables are measured by amount (in numeric units).
A Qualitative Variable, on the other hand, varies by class. The variables are often labeling
for the behaviors we observe—so only discrete variables can fall into this category.
For example, socioeconomic class (working class, middle class, upper class) is discrete
and qualitative; so are many mental disorders such as depression (unipolar, bipolar) or
drug use (none, experimental, abusive).
Qualitative variables are non-numeric variables and can‘t be measured. Examples include
gender, religious affiliation and state of birth.
SCOPE OF STATISTICS
Apart from the methods comprising the scope of descriptive and inferential branches of statistics,
statistics also consists of methods of dealing with a few other issues of specific nature. Since
these methods are essentially descriptive in nature, they have been discussed here as part of the
descriptive statistics. These are mainly concerned with the following:
(i) It often becomes necessary to examine how two paired data sets are related. For example, we
may have data on the sales of a product and the expenditure incurred on its advertisement for a
specified number of years. Given that sales and advertisement expenditure are related to each
other, it is useful to examine the nature of relationship between the two and quantify the degree
of that relationship. As this requires use of appropriate statistical methods, these falls under the
purview of what we call regression and correlation analysis.
(ii) Situations occur quite often when we require averaging (or totaling) of data on prices and/or
quantities expressed in different units of measurement. For example, price of cloth may be
quoted per meter of length and that of wheat per kilogram of weight. Since ordinary methods of
totaling and averaging do not apply to such price/quantity data, special techniques needed for the
purpose are developed under index numbers.
198
(iii) Many a time, it becomes necessary to examine the past performance of an activity with a
view to determining its future behaviour. For example, when engaged in the production of a
commodity, monthly product sales are an important measure of evaluating performance. This
requires compilation and analysis of relevant sales data over time. The more complex the
activity, the more varied the data requirements. For profit maximizing and future sales planning,
forecast of likely sales growth rate is crucial. This needs careful collection and analysis of past
sales data. All such concerns are taken care of under time series analysis.
(iv) Obtaining the most likely future estimates on any aspect(s) relating to a business or
economic activity has indeed been engaging the minds of all concerned. This is particularly
important when it relates to product sales and demand, which serve the necessary basis of
production scheduling and planning. The regression, correlation, and time series analyses
together help develop the basic methodology to do the needful. Thus, the study of methods and
techniques of obtaining the likely estimates on business/economic variables comprises the scope
of what we do under business forecasting. Keeping in view the importance of inferential
statistics, the scope of statistics may finally be restated as consisting of statistical methods which
facilitate decision-making under conditions of uncertainty. While the term statistical methods are
often used to cover the subject of statistics as a whole, in particular it refers to methods by which
statistical data are analyzed, interpreted, and the inferences drawn for decision making.
Though generic in nature and versatile in their applications, statistical methods have come to be
widely used, especially in all matters concerning business and economics. These are also being
increasingly used in biology, medicine, agriculture, psychology, and education. The scope of
application of these methods has started opening and expanding in a number of social science
disciplines as well. Even a political scientist finds them of increasing relevance for examining
the political behaviour and it is, of course, no surprise to find even historians‘ statistical data, for
history is essentially past data presented in certain actual format.
199
Statistics is widely used in many industries. In industries, control charts are widely used to
maintain a certain quality level. In production engineering, to find whether the product is
conforming to specifications or not, statistical tools, namely inspection plans, control charts, etc.,
are of extreme importance.
Statistics are lifeblood of successful commerce. Any businessman cannot afford to either by
under stocking or having overstock of his goods. In the beginning he estimates the demand for
his goods and then takes steps to adjust with his output or purchases.
Analysis of variance (ANOVA) is one of the statistical tools developed by Professor R.A. Cash
crop, plays a prominent role in agriculture experiments. In tests of significance based on small
samples, it can be shown that statistics is adequate to test the significant difference between two
sample means. In analysis of variance, we are concerned with the testing of equality of several
population means. For an example, five fertilizers are applied to five plots each of wheat and the
yields of wheat on each of the plots are given. In such a situation, we are interested in finding out
whether the effect of these fertilizers on the yield is significantly different or not. The answer to
this problem is provided by the technique of ANOVA and it is used to test the homogeneity of
several population means.
Nowadays the uses of statistics are abundantly made in any economic study. Alfred Marshall
said that statistical data and techniques of statistical tools are immensely useful in solving many
economic problems such as wages, prices, production, distribution of income and wealth and so
on. Statistical tools like Index numbers, time series Analysis, Estimation theory, Testing
Statistical Hypothesis are extensively used in economics.
Statistics is widely used in education. Research has become a common feature in all branches of
activities. Statistics is necessary for the formulation of policies to start new course, consideration
of facilities available for new courses etc.
In order to achieve the above goals, the statistical data relating to production, consumption,
demand, supply, prices, investments, income expenditure etc. and various advanced statistical
techniques for processing, analyzing and interpreting such complex data are of importance.
In Medical sciences, statistical tools are widely used. In order to test the efficiency of a new drug
or medicine, t-test is used or to compare the efficiency of two drugs or two medicines, t-test for
the two samples is used. More and more applications of statistics are at present used in clinical
investigation.
Recent developments in the fields of computer technology and information technology have
enabled statistics to integrate their models and thus make statistics a part of decision-making
procedures of many organizations. There are so many software packages available for solving
design of experiments, forecasting simulation problems etc.
The preceding discussions highlighted the importance of statistics in business should not lead
anyone to conclude that statistics is free from any limitations. Statistics has a number of
limitations.
1. Statistics has no place in all such cases where quantification is not possible. For example,
beauty, intelligence, courage cannot be quantified.
2. Statistics reveal the average behavior, the normal or general trend. An application of the
‗average‘ concept if applied to an individual or a particular situation may lead to a wrong
conclusion and sometimes may be disastrous.
3. Since statistics are collected for a particular purpose, such data may not be relevant or
useful in other situations.
4. Statistics is not 100% precise as is mathematics or accountancy.
201
5. In statistical surveys, sampling is generally used as it is not physically possible to cover
all the units comprising the universe. The results may not be appropriate as far as the
universe is concerned.
Chapter Two
Statistical investigation is a comprehensive and requires systematic collection of data about some
group of people or objects, describing and organizing the data, analyzing the data with the help
of different statistical method, summarizing the analysis and using these results for making
judgments, decisions and predictions. When we talk of collection of data, we should be clear as
to what does the word ―data‖. The word datum is a Latin word which means ‗something given’.
It means a piece of information which can be either quantitative or qualitative. The term data is
the plural of datum and means facts and statistics collect together for reference or analysis.
202
1.7. Nature of data
It may be noted that different types of data can be collected for different purposes. The data can
be collected in connection with time or geographical location or in connection with time and
location. The following are the three types of data:
1. Time series data,
2. Spatial data and
3. Spacio-temporal data.
i. Time series data
It is a collection of a set of numerical values, collected over a period of time. The data might
have been collected either at regular intervals of time or irregular intervals of time. Example;
The following is the data for the three types of expenditures in birrs for a family for the four
years 2001,2002,2003,2004.
Year Food Education Others Total
2001 2000 1000 1000 4000
2002 2500 1500 1500 5500
2003 3000 2000 1500 6500
2004 3500 1500 2500 7500
ii. Spatial Data:
If the data collected is connected with that of a place, then it is termed as spatial data. Example:
Assume the population of the southern nation and nationality of Ethiopia in 2006.
Cite/Town Population
Arba Minch 1,000,000
Wolayita Sodo 1,586,000
Hawassa 2,000,000
Butajira 1,250,000
iii. Spacio-Temporal Data:
If the data collected is connected to the time as well as place then it is known as Spacio-temporal
data. Example: Assume the population of the southern nation and nationality of Ethiopia in 2006
and 2007.
Cite/Town Population
2006 2007
Arba Minch 1,000,000 1,150,000
Wolayita Sodo 1,586,000 1,690,000
203
Hawassa 2,000,000 2,200,000
Butajira 1,250,000 1,320,000
Data is the value that the variables can take, which is either numerical or categorical value.
Levels of data can be classified in to two:
204
implies a statement of ‗greater than‘ or ‗less than‘ without our being able to state how
much greater or less. The real difference between ranks 1 and 2 may be more or less than
the difference between ranks 5 and 6. Since the numbers of this scale have only a rank
meaning, the appropriate measure of central tendency is the median. Measures of
statistical significance are restricted to the non-parametric methods.
3. Interval scale: In the case of interval scale, the intervals are adjusted in terms of some
rule that has been established as a basis for making the units equal. The units are equal
only in so far as one accepts the assumptions on which the rule is based. Interval scales
can have an arbitrary zero, but it is not possible to determine for them what may be
called an absolute zero or the unique origin. The primary limitation of the interval scale is
the lack of a true zero; it does not have the capacity to measure the complete absence of a
trait or characteristic. The Fahrenheit scale is an example of an interval scale and shows
similarities in what one can and cannot do with it. One can say that an increase in
temperature from 30° to 40° involves the same increase in temperature as an increase
from 60° to 70°, but one cannot say that the temperature of 60° is twice as warm as the
temperature of 30° because both numbers are dependent on the fact that the zero on the
scale is set arbitrarily at the temperature of the freezing point of water. Interval scales
provide more powerful measurement than ordinal scales for interval scale also
incorporates the concept of equality of interval. Mean is the appropriate measure of
central tendency, while standard deviation is the most widely used measure of
dispersion. For statistical significance are the ‘t’ test and ‘F’ test are widely applied.
4. Ratio scale: Ratio scales have an absolute or true zero of measurement. The term
‗absolute zero‘ is not as precise as it was once believed to be. We can conceive of an
absolute zero of length and similarly we can conceive of an absolute zero of time. For
example, the zero point on a centimeter scale indicates the complete absence of length or
height. But an absolute zero of temperature is theoretically unobtainable and it remains a
concept existing only in the scientist‘s mind. The number of minor traffic-rule violations
and the number of incorrect letters in a page of type script represent scores on ratio
scales. Both these scales have absolute zeros and as such all minor traffic violations and
all typing errors can be assumed to be equal in significance. With ratio scales involved
one can make statements like ―Abie‘s‖ typing performance was twice as good as that of
205
―Kebie.‖ The ratio involved does have significance and facilitates a kind of comparison
which is not possible in case of an interval scale. Ratio scale represents the actual
amounts of variables. Measures of physical dimensions such as weight, height, distance,
etc. are examples. Generally, all statistical techniques are usable with ratio scales and all
manipulations that one can carry out with real numbers can also be carried out with ratio
scale values. Multiplication and division can be used with this scale but not with other
scales mentioned above. Geometric and harmonic means can be used as measures of
central tendency and coefficients of variation may also be calculated.
1.7.2. Source of Data
Any statistical data can be classified under two categories depending upon the sources utilized.
These categories are:
Primary data
Secondary data
1.7.2.1. Primary data:
Primary data is the one, which is collected by the investigator himself for the purpose of a
specific inquiry or study. Such data is original in character and is generated by survey conducted
by individuals or research institution or any organization. Primary data can be collected through
I. Direct personal interviews: The persons from whom information are collected are
known as informants. The investigator personally meets them and asks questions to
gather the necessary information.
It is the suitable method for intensive rather than extensive field surveys. It suits best for
intensive study of the limited field.
Merits:
206
The wordings in one or more questions can be altered to suit any informant.
Inconvenience and misinterpretations are thereby avoided.
Limitations:
IV. Mailed questionnaire method: Under this method a list of questions is prepared and is
sent to all the informants by post. The list of questions is technically called questionnaire.
A covering letter accompanying the questionnaire explains the purpose of the
investigation and the importance of correct information and requests the informants to fill
in the blank spaces provided and to return the form within a specified time.
The Merits of mailed questionnaire: is relatively cheap and it is preferable when the
informants are spread over the wide area.
207
The Limitations of mailed questionnaire: is that the informants should be literates who are
able to understand and reply the questions, It is possible that some of the persons who receive the
questionnaires do not return them and It is difficult to verify the correctness of the information
furnished by the respondents.
Merits:
Limitations:
208
1.7.2.2. Secondary Data:
Secondary data are those data which have been already collected and analyzed by some earlier
agency for its own use; and later the same data are used by a different agency.
The sources of secondary data can broadly be classified under two heads:
Tabulation is the process of summarizing classified or grouped data in the form of a table so that
it is easily understood and an investigator is quickly able to locate the desired information. A
table is a systematic arrangement of classified data in columns and rows.
Thus, a statistical table makes it possible for the investigator to present a huge mass of data in a
detailed and orderly form. It facilitates comparison and often reveals certain patterns in data
which are otherwise not obvious. ‗Classification‘ and ‗Tabulation‘, as a matter of fact, are not
two distinct processes. Actually, they go together. Before tabulation data are classified and then
displayed under different columns and rows of a table.
Advantages of Tabulation
209
It simplifies complex data and the data presented are easily understood.
It facilitates comparison of related facts, computation of various statistical measures like
averages, dispersion, correlation etc.
It presents facts in minimum possible space and unnecessary repetitions and explanations
are avoided. Moreover, the needed information can be easily located.
Tabulated data are good for references and they make it easier to present the information
in the form of graphs and diagrams.
Preparing a Table
The making of a compact table itself an art. This should contain all the information needed
within the smallest possible space. What the purpose of tabulation is and how the tabulated
information is to be used are the main points to be kept in mind while preparing for a statistical
table. An ideal table should consist of the following main parts:
i. Table Number: A table should be numbered for easy reference and identification.
ii. Title of the Table: A good table should have a clearly worded, brief but unambiguous
title explaining the nature of data contained in the table. It should also state arrangement
of data and the period covered.
iii. Captions or column Headings: Captions in a table stands for brief and self-explanatory
headings of vertical columns. Captions may involve headings and sub-headings as well.
The unit of data contained should also be given for each column.
iv. Stubs or Row Designations: Stubs stands for brief and self-explanatory headings of
horizontal rows. A variable with a large number of classes is usually represented in rows.
For example, rows may stand for score of classes and columns for data related to sex of
students. In the process, there will be many rows for scores classes but only two columns
for male and female students.
v. Body: The body of the table contains the numerical information of frequency of
observations in the different cells. This arrangement of data is according to the
description of captions and stubs.
vi. Footnotes: Footnotes are given at the foot of the table for explanation of any fact or
information included in the table which needs some explanation. Thus, they are meant for
210
explaining or providing further details about the data that have not been covered in title,
captions and stubs.
vii. Sources of data: Lastly one should also mention the source of information from which
data are taken. This may preferably include the name of the author, volume, page and the
year of publication.
Type of Tables:
Tables can be classified according to their purpose, stage of enquiry, nature of data or number of
characteristics used. On the basis of the number of characteristics, tables may be classified as
follows: Simple or One-Way Table, Two-Way Table, and Manifold Table.
A simple or one-way table is the simplest table which contains data of one characteristic only. A
simple table is easy to construct and simple to follow. For example, the blank table given below
may be used to show the number of adults in different occupations in a locality.
2. Two-way Table:
A table, which contains data on two characteristics, is called a two-way table. In such case,
therefore, either stub or caption is divided into two co-ordinate parts. In the given table, as an
example the caption may be further divided in respect of ‗sex‘. This subdivision is shown in two-
way table, which now contains two characteristics namely, occupation and sex.
211
Farmer 200 30 230
Students 100 50 150
Total 300 80 380
3. Manifold Table:
Thus, more and more complex tables can be formed by including other characteristics. For
example, we may further classify the caption sub-headings in the above table in respect of
―marital status‖, ―religion‖ and ―socio-economic status‖ etc. A table, which has more than two
characteristics of data, is considered as a manifold table. For instance, the table below shows
three characteristics namely, occupation, sex and marital status.
Number of adults
Occupation Male Female Total
Married Unmarried Total Married Unmarried Total
Farmer 150 50 200 20 10 30 230
Student 10 90 100 5 45 50 150
Total 160 140 300 25 55 80 380
Manifold tables, though complex is good in practice as these enable full information to be
incorporated and facilitate analysis of all related facts. Still, as a normal practice, not more than
four characteristics should be represented in one table to avoid confusion. Other related tables
may be formed to show the remaining characteristics.
Frequency distribution is a series when a number of observations with similar or closely related
values are put in separate bunches or groups, each group being in order of magnitude in a series.
It is simply a table in which the data are grouped into classes and the numbers of cases which fall
in each class are recorded. It shows the frequency of occurrence of different values of a single
Phenomenon. A frequency distribution
212
To estimate frequencies of the unknown population distribution from the distribution of
sample data and
To facilitate the computation of various statistical measures
1.8.2. Raw data
The statistical data collected are generally raw data or ungrouped data. Let us consider the daily
wages (in birr) of 30 laborers in a factory.
80 70 55 50 60 65 40 30 80 90
75 45 35 65 70 80 82 55 65 80
60 55 38 65 75 85 90 65 45 75
The above figures are nothing but raw or ungrouped data and they are recorded as they occur
without any pre consideration. This representation of data does not furnish any useful
information and is rather confusing to mind. A better way to express the figures in an ascending
or descending order of magnitude and is commonly known as array. But this does not reduce the
bulk of the data. The above data when formed into an array is in the following form:
30 35 38 40 45 45 50 55 55 55
60 60 65 65 65 65 65 70 70 75
75 75 80 80 80 80 82 85 90 90
The array helps us to see at once the maximum and minimum values. It also gives a rough idea
of the distribution of the items over the range. When we have a large number of items, the
formation of an array is very difficult, tedious and cumbersome. The Condensation should be
directed for better understanding and may be done in two ways, depending on the nature of the
data.
In this form of distribution, the frequency refers to discrete value. Here the data are presented in
a way that exact measurements of units are clearly indicated. There are definite differences
between the variables of different groups of items. Each class is distinct and separate from the
other class. Data such as facts like the number of rooms in a house, the number of companies
registered in a country, the number of children in a family, etc... The process of preparing this
type of distribution is very simple. We have just to count the number of times a particular value
213
is repeated, which is called the frequency of that class. In order to facilitate counting, prepare a
column of tallies. In another column, place all possible values of variable from the lowest to the
highest. Then put a bar (Vertical line) opposite the particular value to which it relates. To
facilitate counting, blocks of five bars are prepared and some space is left in between each
block. We finally count the number of bars and get frequency.
Example 1: In a survey of 40 families in a village, the number of children per family was
recorded and the following data obtained.
1 0 3 2 1 5 6 2
2 1 0 3 4 2 1 6
3 2 1 5 3 3 2 4
2 2 3 0 2 1 4 5
3 3 4 4 1 2 4 5
Solution:
In this form of distribution refers to groups of values. This becomes necessary in the case of
some variables which can take any fractional value and in which case an exact measurement is
not possible. Hence a discrete variable can be presented in the form of a continuous frequency
distribution.
Nature of class
The following are some basic technical terms when a continuous frequency distribution is
formed or data are classified according to class intervals.
214
1) Class limits
The class limits are the lowest and the highest values that can be included in the class. For
example, take the class 30-40. The lowest value of the class is 30 and highest class is 40. The
two boundaries of class are known as the lower limits and the upper limit of the class. The lower
limit of a class is the value below which there can be no item in the class. The upper limit of a
class is the value above which there can be no item to that class. The way in which class limits
are stated depends upon the nature of the data. In statistical calculations, lower class limit is
denoted by L and upper-class limit by U.
2) Class Interval:
The class interval may be defined as the size of each grouping of data. For example, 50-75, 75-
100, 100-125… are class intervals. Each grouping begins with the lower limit of a class interval
and ends at the lower limit of the next succeeding class interval.
Number of class intervals: The number of class interval in a frequency is matter of importance.
The number of class interval should not be too many. For an ideal frequency distribution, the
number of class intervals can vary from 5 to 15. To decide the number of class intervals for the
frequency distributive in the whole data, we choose the lowest and the highest of the values. The
difference between them will enable us to decide the class intervals. Thus, the number of class
intervals can be fixed arbitrarily keeping in view the nature of problem under study or it can be
decided with the help of Sturges‘ Rule. According to him, the number of classes can be
determined by the formula
K = 1 + 3. 322 log10 N
Thus, if the number of observations is 10, then the number of class intervals is
215
3) Width or size of the class interval:
The difference between the lower- and upper-class limits is called Width or size of class interval
and is denoted by ‗C‘.
Size of the class interval: Since the size of the class interval is inversely proportional to the
number of class interval in a given distribution. The approximate value of the size (or width or
magnitude) of the class interval ‗C‘ is obtained by using Sturges rule as
−
22 𝑁
Types of class intervals:
There are three methods of classifying the data according to class intervals namely
a) Exclusive method:
When the class intervals are so fixed that the upper limit of one class is the lower limit of the
next class; it is known as the exclusive method of classification. The following data are classified
on this basis.
b) Inclusive method:
In this method, the overlapping of the class intervals is avoided. Both the lower and upper limits
are included in the class interval. This type of classification may be used for a grouped frequency
distribution for discrete variable like members in a family, number of workers in a factory etc.,
where the variable may take only integral values. It cannot be used with fractional values like
age, height, weight etc.
216
This method may be illustrated as follows:
A class limit is missing either at the lower end of the first-class interval or at the upper end of the
last class interval or both are not specified. The necessity of open-end classes arises in a number
of practical situations, particularly relating to economic and medical data when there are few
very high values or few very low values which are far apart from the majority of observations.
The example for the open-end classes as follows:
The difference between largest and smallest value of the observation is called The Range and is
denoted by ‗R‘ i.e. R = Largest value – Smallest value (R = L – S)
5) Mid-value or mid-point:
The central point of a class interval is called the mid value or mid-point. It is found out by adding
the upper and lower limits of a class and dividing the sum by 2. 𝑢𝑒 2
217
6) Frequency:
Number of observations falling within a particular class interval is called frequency of that class.
Let us consider the frequency distribution of weights if persons working in a company.
The premise of data in the form of frequency distribution describes the basic pattern which the
data assumes in the mass. Frequency distribution gives a better picture of the pattern of data if
the number of items is large. If the identity of the individuals about whom particular information
is taken, is not relevant then the first step of condensation is to divide the observed range of
variable into a suitable number of class-intervals and to record the number of observations in
each class. Let us consider the weights in kg of 50 college students.
42 62 46 54 41 37 54 44 30 45
47 50 58 49 51 42 46 37 42 39
54 39 51 58 47 65 43 48 49 48
49 61 41 40 58 49 59 57 57 34
56 38 45 52 46 40 63 41 51 41
Here the size of the class interval as per Sturges rule is obtained as follows
C = Range/1+3.322 log10N C = 65 – 30/1+3.322 log10 (50) C=35/7 C= 5
Thus, the number of class interval is 7 and size of each class is 5. The required size of each class
is 5. The required frequency distribution is prepared using tally marks as given below:
Class Interval Tally marks Frequency
30-35 2
35-40 6
40-45 12
45-50 14
50-55 6
55-60 6
60-65 4
Total 50
218
1.8.3. The Relative Frequency Distribution
When you are comparing two or more groups, it is better to know the proportion or percentage of
the total that is in each group is more useful than knowing the frequency count of each group.
For such situations, you create a relative frequency distribution or a percentage distribution
instead of a frequency distribution. (If your two or more groups have different sample sizes, you
must use either a relative frequency distribution or a percentage distribution.)
The proportion, or relative frequency, is the number of values in each class divided by the total
number of values:
Cumulative frequency distribution has a running total of the values. It is constructed by adding
the frequency of the first-class interval to the frequency of the second-class interval. Again, add
that total to the frequency in the third-class interval continuing until the final total appearing
opposite to the last class interval will be the total of all frequencies. The cumulative frequency
may be downward or upward. A downward cumulation results in a list presenting the number of
frequencies ―less than‖ any given amount as revealed by the lower limit of succeeding class
interval and the upward cumulative results in a list presenting the number of frequencies ―more
than‖ and given amount is revealed by the upper limit of a preceding class interval.
219
Income (in birr) Number Frequency Cumulative Cumulative
of family Percentage frequency percentage
2000-4000 8 5.7% 8 5.7%
4000-6000 15 10.7% 23 16.4%
6000-8000 27 19.3% 50 35.7%
8000-10000 44 31.4% 94 67.1%
10000-12000 31 22.2% 125 89.3%
12000-14000 12 8.6% 137 97.9%
14000-20000 3 2.1% 140 100.0%
Total 140
1.9. Graphic Methods of Data Presentation
A graph is a visual form of presentation of statistical data. A graph is more attractive than a table
of figure. Even a common man can understand the message of data from the graph. Comparisons
can be made between two or more phenomena very easily with the help of a graph. However
here we shall discuss only some important types of graphs which are more popular and they are:
Histogram
Frequency Polygon
Ogive
Pie-Charts
Bar and Line Graphs
220
i. Line Diagram:
Line diagram is used in case where there are many items to be shown and there is not much of
difference in their values. Such diagram is prepared by drawing a vertical line for each item
according to the scale. The distance between lines is kept uniform. Line diagram makes
comparison easy, but it is less attractive.
No. of children 0 1 2 3 4 5
Frequency 10 14 9 6 4 2
Pie charts are simple diagrams for displaying categorical or grouped data. These charts are
commonly used within industry to communicate simple ideas, for example market share. They
are used to show the proportions of a whole. They are best used when there are only a handful of
categories to display. A pie chart consists of a circle divided into segments, one segment for each
category. The size of each segment is determined by the relative frequency of the category and
measured by the angle of the segment.
Example 8: Draw a Pie charts /diagram for the following data of production of sugar in quintals
of various countries.
Country Production of
Sugar (in quintals)
Ethiopia 62,000,000
Kenya 47,000,000
Sudan 35,000,000
Djibouti 16,000,000
Egypt 6,000,000
221
The pie chart is constructed by first drawing a circle and then dividing it up into segments.
Pie chart
Egypt
Djibouti
Ethiopia
Sudan
Kenya
Bar charts are a commonly–used and clear way of presenting categorical data or any ungrouped
discrete frequency observations.
Mode Frequency
Car 10
Walk 7
Bike 4
Bus 4
Metro 4
Train 1
Total 30
We can then present this information as a bar chart, by following the five-step process shown
below:
1. First decide what goes on each axis of the chart. By convention the variable being
measured goes on the horizontal (x–axis) and the frequency goes on the vertical (y–axis).
2. Next decide on a numeric scale for the frequency axis. This axis represents the frequency
in each category by its height. It must start at zero and include the largest frequency. It is
common to extend the axis slightly above the largest value so you are not drawing to the
edge of the graph.
222
3. Having decided on a range for the frequency axis we need to decide on a suitable number
scale to label this axis. This should have sensible values, for example, 0, 1, 2 . . . or 0, 10,
20., or other such values as make sense given the data.
4. Draw the axes and label them appropriately.
5. Draw a bar for each category. When drawing the bars, it is essential to ensure the
following:
The width of each bar is the same;
The bars are separated from each other by equally sized gaps.
This bar chart clearly shows that the most popular mode of transport is the car and that the metro,
bus and cycling are all equally popular (in our small sample). Bar charts provide a simple
method of quickly spotting simple patterns of popularity within a discrete data set.
iv. Histogram
Bar charts have their limitations; for example, they cannot be used to present continuous data.
When dealing with continuous random variables a different kind of graph is required. This is
called a histogram. At first sight these look similar to bar charts. There are, however, two critical
differences:
The horizontal (x-axis) is a continuous scale. As a result of this there are no gaps between
the bars (unless there are no observations within a class interval);
The height of the rectangle is only proportional to the frequency if the class intervals are
all equal.
223
Producing a histogram is much like producing a bar chart and in many respects can be
considered to be the next stage after producing a grouped frequency table. In reality, it is often
best to produce a frequency table first which collects all the data together in an ordered format.
Once we have the frequency table, the process is very similar to drawing a bar chart.
Find the maximum frequency and draw the vertical (y–axis) from zero to this value,
including a sensible numeric scale.
The range of the horizontal (x–axis) needs to include not only the full range of
observations but also the full range of the class intervals from the frequency table.
Draw a bar for each group in your frequency table. These should be the same width and
touch each other (unless there are no data in one particular class).
v. Frequency Polygon
If we mark the midpoints of the top horizontal sides of the rectangles in a histogram and join
them by a straight line, the figure so formed is called a Frequency Polygon. This is done under
the assumption that the frequencies in a class interval are evenly distributed throughout the class.
The area of the polygon is equal to the area of the histogram, because the area left outside is just
equal to the area included in it.
224
Example: Draw a frequency polygon for the following data.
vi. Ogive
For a set of observations, we know how to construct a frequency distribution. In some cases, we
may require the number of observations less than a given value or more than a given value. This
is obtained by an accumulating (adding) the frequencies up to (or above) the give value. This
accumulated frequency is called cumulative frequency. The curve table is obtained by plotting
cumulative frequencies is called a cumulative frequency curve or an Ogive. There are two
methods of constructing Ogive namely: The ‗less than Ogive‘ method and the ‗more than Ogive‘
method. In less than Ogive method we start with the upper limits of the classes and go adding the
frequencies. When these frequencies are plotted, we get a rising curve. In more than ogive
method, we start with the lower limits of the classes and from the total frequencies we subtract
the frequency of each class. When these frequencies are plotted, we get a declining curve.
225
Example: Draw the Ogive for the following
Solution:
data.
Class Less than More than
Class interval Frequency Limit Ogive Ogive
20-30 4 20 0 110
30 4 106
30-40 6
40 10 100
40-50 13
50 23 87
50-60 25
60 48 62
60-70 32 70 80 30
70-80 19 80 99 11
80-90 8 90 107 3
90-100 3 100 110 0
226
Chapter Three
A measure of central tendency is a typical value around which other figures congregate.
An average stand for the whole group of which it forms a part yet represents the whole.
One of the most widely used set of summary figures is known as measures of location.
1. Arithmetic mean or mean
Arithmetic means or simply the mean of a variable is defined as the sum of the observations
divided by the number of observations. If the variable x assumes n values x1, x2 … xn then the
mean, is given by
227
This formula is for the ungrouped or raw data.
Example: A student‘s marks in 5 subjects are 2, 4, 6, 8, and 10. Find his average mark.
X X
n
= 2+4+6+8+10= 6
The mean for ungrouped Frequency data is obtained from the following formula:
Example: Given the following frequency distribution, calculate the arithmetic mean
Marks 64 63 62 61 60 59
Number of Students 8 18 12 9 7 6
Solution
3713 = 61.9
60
Income Birr (100) 0-10 10-20 20-30 30-40 40-50 50-60 60-70
Number of persons 6 8 10 12 7 4 3
Solution:
228
Income Number of Mid Fx
C.I Persons (f) (X)
0-10 6 5 30
10-20 8 15 120
20-30 10 25 250
30-40 12 35 420
40-50 7 45 315
50-60 4 55 220
60-70 3 65 195
Total 50 1550
1550 = 31
50
Merits and demerits of Arithmetic mean:
Merits:
It is rigidly defined.
It is easy to understand and easy to calculate.
If the number of items is sufficiently large, it is more accurate and more reliable.
It is a calculated value and is not based on its position in the series.
It is possible to calculate even if some of the details of the data are lacking.
It provides a good basis for comparison.
Demerits:
229
Harmonic mean of a set of observations is defined as the reciprocal of the arithmetic average of
the reciprocal of the given values. If X1, X2…. Xn are n observations,
H.M= 11.526
Example: The marks secured by some students of a class are given below. Calculate the
harmonic mean.
Marks 20 21 22 23 24 25
No of Students 4 2 7 `1 3 1
Solution:
230
20 4 0.0500 0.2000
21 2 0.0476 0.0952
22 7 0.0454 0.3178
23 1 0.0435 0.0435
24 3 0.0417 0.1251
25 1 0.0400 0.0400
18 0.8216
=18 = 21.91
0.8216
Merits of H.M:
It is rigidly defined.
It is defined on all observations.
It is amenable to further algebraic treatment.
It is the most suitable average when it is desired to give greater weight to smaller
observations and less weight to the larger ones.
Demerits of H.M:
The geometric mean of a series containing n observations is the nth root of the product of the
values. If x1, x2…, xn are observations then
231
This is for the ungrouped data
For grouped data Geometric mean will be computed as follows
Example: Calculate the geometric mean of the following series of monthly income of a batch of
families 180, 250, 490, 1400, 1050
X Log x
180 2.2553
250 2.3979
490 2.6902
1400 3.1461
1050 3.0212
Total 13.5107
= Antilog = 13.5107
5
= Antilog 2.7021 = 503.6
Example: Calculate the average income per head from the data given below. Use geometric
mean.
232
Class of people Number of families(f) Monthly income per head Log x f (log x)
(Birr) x
Landlords 2 5000 3.6990 7.398
Cultivators 100 400 2.6021 260.210
Landless – labor 50 200 2.3010 115.050
Money – lenders 4 3750 3.5740 14.296
Office Assistants 6 3000 3.4771 20.863
Shop keepers 8 750 2.8751 23.2008
Carpenters 6 600 2.7782 16.669
Weavers 10 300 2.4771 24.771
Total 186 482.257
It is rigidly defined
It is based on all items
It is very suitable for averaging ratios, rates and percentages
It is capable of further mathematical treatment.
Unlike AM, it is not affected much by the presence of extreme values
It cannot be used when the values are negative or if any of the observations is zero
It is difficult to calculate particularly when the items are very large or when there is a
frequency distribution
It brings out the property of the ratio of the change and not the absolute difference of
change as the case in arithmetic mean.
The GM may not be the actual value of the series.
6. Positional Averages:
These averages are based on the position of the given observation in a series, arranged in an
ascending or descending order. The magnitude or the size of the values does matter as was in the
233
case of arithmetic mean. It is because of the basic difference that the median and mode are called
the positional measures of an average.
A. Median:
The median is that value of the variant which divides the group into two equal parts, one part
comprising all values greater, and the other, all values less than median. Median is defined as the
value of the middle item or the mean of the values of the two middle items when the data are
arranged in an ascending or descending order of magnitude.
Median (Md) = n +1
2
Example 1: When odd number of values are given. Find median for the following data25, 18,
27, 10, 8, 30, 42, 20, 53
Solution: Arranging the data in the increasing order 8, 10, 18, 20, 25, 27, 30, 42, 53
The middle value is the 5th item i.e., 25 is the median
Using formula, Median (Md) = n +1 = 9 +1 = 10/2 = 5th item = 25
2 2
Example 2: When even number of values are given. Find median for the following data 5, 8, 12,
30, 18, 10, 2, 22
Solution: Arranging the data in the increasing order 2, 5, 8, 10, 12, 18, 22, 30
Here median is the mean of the middle two items (i.e.) mean of (10, 12) i.e.
10 +12 = 11
2
Using the formula,
Median (Md) = n +1
2
= 8 +1
2
= (9/2)th item = 4.5th item
= 4th item + (1/2) (5th item- 4th item) = 10 + (1/2) (12-10)
234
= 10 + (1/2) x 2 = 10 +1 = 11
Grouped Data:
In a grouped distribution, values are associated with frequencies. Grouping can be in the form of
a discrete frequency distribution or a continuous frequency distribution. Whatever may be the
type of distribution, cumulative frequencies have to be calculated to know the total number of
items. In the case of a grouped series, the median is calculated by linear interpolation with the
help of the following formula:
M = l1 + l2 – l1 x m-c
f
Where, m=the median
l1=the lower limit of the class in which the median lies
l2=the upper limit of the class in which the median lies
f= the frequency of the class in which the median lies
m= the middle item (n +1)/2th
c= the cumulative frequency of the class preceding the one in which the median lies.
Example: The following table gives the frequency distribution of 325 workers of a factory,
according to their average monthly income in a certain year. Calculate median income
235
200-250 55 118
250-300, l1& l2 62 f 180
300-350 45 225
350-400 30 255
400-450 25 280
450-500 15 295
500-550 18 313
550-600 10 323
600 and above 2 325
Total 325
m=325 +1 = 326 = 163
2 2
It means median lies in the class interval of birr 250-300
M = l1 + l2 – l1 x m-c
f
= 250 + 300-250 x 163-118
62
= 250 + 50 x 45
62
= 286.29 birr
Merits of Median:
Median is not influenced by extreme values because it is a positional average.
Median can be calculated in case of distribution with open end intervals.
Median can be located even if the data are incomplete.
Median can be located even for qualitative factors such as ability, honesty etc.
Demerits of Median:
A slight change in the series may bring drastic change in median value.
In case of even number of items or continuous series, median is an estimated value other
than any value in the series.
It is not suitable for further mathematical treatment except its use in mean deviation.
It is not taken into account all the observations.
B. Mode:
The mode refers to that value in a distribution, which occur most frequently. It is an actual value,
which has the highest concentration of items in and around it. According to Croxton and Cowden
236
―The mode of a distribution is the value at the point around which the items tend to be most
heavily concentrated. It may be regarded at the most typical of a series of values‖.
It shows the center of concentration of the frequency in around a given value. Therefore, where
the purpose is to know the point of the highest concentration it is preferred. It is, thus, a
positional measure. Its importance is very great in marketing studies where a manager is
interested in knowing about the size, which has the highest concentration of items. For example,
in placing an order for shoes or ready-made garments the modal size helps because this size and
other sizes around in common demand.
Where, l = the lower value of the class in which the mode lies.
f1= the frequency of the class in which the mode lies
f0= the frequency of the class preceding the modal class
f2= the frequency of the class succeeding the modal class
c= the class interval of the modal class
While applying the above formula, we should ensure that the class intervals are uniform
throughout. If the class intervals are not uniform, then they should be made uniform on the
assumption that the frequencies are evenly distributed throughout the class. In the case of
unequal class intervals, the application of the above formula will give misleading results.
237
Example: Calculate mode for the following:
At this stage, one may ask as to which of these three measures of central tendency the best. There
is no simple answer to this question. It is because these three measures are based up on different
concepts. The arithmetic mean is the sum of the values divided by the total number of
observations in the series. The median is the value of the middle observation that divides the
238
series into two equal parts. Mode is the value around which the observations tend to concentrate.
As such the use of a particular measure will largely depend on the purpose of the study and the
nature of the data. For example, when we are interested in knowing the consumers‘ preferences
for different brands of television sets or different kinds of advertising, the choice should go in
favor of mode. The use of mean and median would not be proper. However, the median can
sometimes be used in the case of qualitative data when such data can be arranged in an ascending
or descending order. Let us take another example, suppose we invite, applications for a certain
vacancy in our company. A large number of candidates apply for that post. We are now
interested to know as to which age or age group has the largest concentration of applicants. Here,
obviously the mode will be the most appropriate choice. The arithmetic mean may not be
appropriate as it may be influenced by some extreme values. However, the mean happens to be
the most commonly used measure of central tendency.
Dispersion (also known as scatter, spread or variation) measures the extent to which the items
vary from some central value. It may be noted that the measure of dispersion measure only the
degree (i.e., the amount of variation) but not the direction of variation. The measures of
dispersion are also called averages of second order because these measures give an average of
the differences of various items from an average.
239
What are the properties of a good measure of dispersion?
Since a measure of dispersion is the average of the deviations of items from an average, it should
also possess all the qualities of a good measure of an average. According to Yule and Kendall,
the qualities of good measure of dispersion are as follows:
This list is not a complete list of the properties of a good measure of dispersion. But these are the
most important characteristics which a good measure of dispersion should possess.
Absolute measure is the measure of dispersion which is expressed in the same statistical unit in
which the original data are given such as kilograms, tones, kilometers, birr, etc. For example,
240
when rainfalls on different days are available in mm, any absolute measure of dispersion gives
the variation in rainfall in mm. These measures are suitable for comparing the variability in two
distribution having variables expressed in the same units and of the same average size. This
measure is not suitable for comparing the variability in two distribution having variables
expressed in different units. Following are the absolute measures of dispersion: range, inter-
quartile range, mean deviation, standard deviation.
Range is defined as the difference between the value of largest item and the value of smallest
item included in the distribution.
Interpretation of range; If the average of the two distributions is almost same, the distribution
with smaller range is said to have less dispersion and the distribution with larger range is said to
have more dispersion.
i. Range
This is the simplest possible measure of dispersion and is defined as the difference between the
largest and smallest values of the variable.
In symbols, Range = L – S Where, L = Largest value and S = Smallest value.
In individual observations and discrete series, L and S are easily identified. In continuous series,
the following two methods are followed.
Method 1: L = Upper boundary of the highest class
S = Lower boundary of the lowest class.
Method 2: L = mid value of the highest class.
S = mid value of the lowest class.
ii. Co-efficient of Range
241
Co-efficient of Range = L – S
L+S
Example: Find the value of range and it‘s co-efficient for the following data.
7, 9, 6, 8, 11, 10, 4
Solution:
L=11, S = 4.
Range = L – S = 11- 4 = 7
Co-efficient of Range = L – S = 11- 4 = 0.4667
L+S 11 +4
Example 2:
Calculate range and its co efficient from the following distribution.
Size: 60-63 63-66 66-69 69-72 72-75
Number: 5 18 42 27 8
Solution:
L = Upper boundary of the highest class = 75
S = Lower boundary of the lowest class = 60
Range = L – S = 75 – 60 = 15
Co-efficient of Range = L – S = 75- 60 = 0.11111
L + S 75 +60
Merits:
It is simple to understand.
It is easy to calculate.
In certain types of problems like quality control, weather forecasts, share price analysis,
etc., range is most widely used.
Demerits:
It is very much affected by the extreme items.
It is based on only two extreme observations.
It cannot be calculated from open-end class intervals.
It is not suitable for mathematical treatment.
It is a very rarely used measure.
II. Standard Deviation and Coefficient of variation:
i. Standard Deviation:
242
Karl Pearson introduced the concept of standard deviation in 1893. It is the most important
measure of dispersion and is widely used in many statistical formulas. Standard deviation is also
called Root-Mean Square Deviation. The reason is that it is the square–root of the mean of the
squared deviation from the arithmetic mean. It provides accurate result. Square of standard
deviation is called Variance.
Definition: It is defined as the positive square-root of the arithmetic mean of the Square of the
deviations of the given observation from their arithmetic mean. The standard deviation is
denoted by the Greek letter σ (sigma).
Example: Calculate the standard deviation from the following data. 14, 22, 9, 15, 20, 17, 12, 11
Solution: Deviations from actual mean.
Values (X)
14 -1 1
22 7 49
9 -6 36
15 0 0
20 5 25
17 2 4
12 -3 9
11 -4 16
120 140
X 20 22 25 31 35 40 42 45
f 5 12 15 20 25 14 10 6
Solution: Deviations from assumed mean
243
X F d = x –A (A = 31) d2 fd fd2
20 5 -11 121 -55 605
22 12 -9 81 -108 972
25 15 -6 36 -90 540
31 20 0 0 0 0
35 25 4 16 100 400
40 14 9 81 126 1134
42 10 11 121 110 1210
45 6 14 196 84 1176
N=107 ∑ =167 ∑ 2=6037
σ = 6037/107 σ = 7.51
Merits and Demerits of Standard Deviation:
Merits:
It is rigidly defined and its value is always definite and based on all the observations and
the actual signs of deviations are used.
As it is based on arithmetic mean, it has all the merits of arithmetic mean.
It is the most important and widely used measure of dispersion.
It is possible for further algebraic treatment.
It is less affected by the fluctuations of sampling and hence stable.
It is the basis for measuring the coefficient of correlation and sampling.
Demerits:
It is not easy to understand and it is difficult to calculate.
It gives more weight to extreme values because the values are squared up.
As it is an absolute measure of variability, it cannot be used for the purpose of
comparison.
ii. Coefficient of Variation
244
(Coefficient of variation, CV) = Standard of deviation / Mean x 100%
Example: In two factories A and B located in the same industrial area, the average weekly
wages (in birr) and the standard deviations are as follows:
A 34.5 5
B 28.5 4.5
Solution;
245
Chapter Four
4. Probability Theory
4.1 Introduction
Life is full of uncertainties. ‗Probably‘, ‗likely‘, ‗possibly‘, ‗chance‘ etc. is some of the most
commonly used terms in our day-to-day conversation. All these terms more or less convey the
same sense - ―the situation under consideration is uncertain and commenting on the future with
certainty is impossible‖. Decision-making in such areas is facilitated through formal and precise
expressions for the uncertainties involved. For example, product demand is uncertain but study
of demand spelled out in a form amenable for analysis may go a long to help analyze, and
facilitate decisions on sales planning and inventory management. Intuitively, we see that if there
is a high chance of a high demand in the coming year, we may decide to stock more. We may
also take some decisions regarding the price increase, reducing sales expenses etc. to manage the
demand. However, in order to make such decisions, we need to quantify the chances of different
quantities of demand in the coming year. Probability theory provides us with the ways and means
to quantify the uncertainties involved in such situations.
a. Random Experiment
246
Random experiment is an activity or a process whose consequence is not likely to be known until
its completion. This phenomenon has the properties that:
b. Sample Space
Sample space: the set of all sample points (simple events) for an experiment is called a sample
space; or set of all possible outcomes for an experiment. Sample space is denoted by the capital
letter S.
Example: Consider the experiment of tossing two coins. If we ask the question whether each
coin will fail on its head (H) or tail (T); then there are four possible answers (out comes). These
are
S ={ },{ }, { }, { }
c. Events
An event is defined as the set (or a collections) of individual outcomes (or elements) with in
sample space and having a specific characteristic. For example, for the above defined sample
space S, the subset{ },{ } is the event which indicates that either head or tail occurs.
An event is a subset of the sample space.
If two or more events cannot occur simultaneously in a single trial of an experiment, then such
events called mutually exclusive events. In other words, two events are mutually exclusive if the
occurrence of one of them prevents or rules out the occurrence of the other. Two events are said
to be mutually exclusive (or disjoint) if their intersection is empty. (I.e., A ∩ B = φ).
247
P (A∪ B) =P (A) + P (B)
Example in tossing a coin, there are two possible outcomes head and tail but not both. Therefore,
the events head and tail on a single toss are mutually exclusive.
P (A∪B) = n (A ∪B)
N (S)
= n (A )+ n (B) − n (A ∩ B )
N(S)
= n(A) + n(B) _ n(A∩B)
N(S) N(S) N(S)
= P ( A) + P (B ) − P ( A ∩ B )
The intersection of A and B, A∩B, is the event containing all sample points that
are both in A and B. Sometimes we use AB or A and B for intersection.
248
P (A ∪ B ∪ C) =P (A) + P (B) + P (C) − P (A ∩ B) − P (B ∩ C) − P (A ∩ C) + P (A ∩ B ∩ C)
The complement of A, Ac, is the event containing all sample points that are not in
A. Sometimes we use not A or A for complement.
In other words, A + A = S
So, P (A + A) = P(S)
Or P (A) + P (A) = 1
Or P (A) = 1 - P (A)
249
outcome. Symbolically, a set of events { } is collectively exhaustive if the
union of these events is identical with the sample space S. That is
S={ }
A general definition of probability states that probability is a numerical measure (between 0 and
1 inclusively) of the likelihood or chance of occurrence of an uncertain event.
i. Classical Approach
This approach of defining probability is based on the assumption that all out comes of an
experiment are mutually exclusive and equally likely. It states that during a random experiment,
if there are ‗a‘ possible outcome where the favorable event A occurs and ‗b‘ possible outcomes
where the event A does not occur and all these possible outcomes are mutually exclusive,
exhaustive, and equip-probable, then the probability that event A will occur is defined as
P (A) =
For example, if a fair die is rolled, then on any trial each event is equally lively to occur since
there are six equally likely exhaustive events, each will occur 1/6 of the time, and therefore the
probability of any one event occurring is 1/6.
250
This approach of computing probability is based on the assumption that a random experiment
can be repeated a large number of items under identical conditions where trials are independent
to each other while conducting a random experiment; we may or may not observe the desired
event. But as the experiment is repeated many times, that event may occur same proportion of
time. Thus, the approach calculates the proportion of times (I.e., the relative frequency) with
which the event occurs over an infinite number of repetitions of the experiment under identical
conditions.
This approach of using statical data is now called the relative frequency approach. Here,
probability is defined as the proportion of times an event occurs in the long run when the
conditions are stable.
Example, consider an experiment of tossing a fair coin. There are two possible outcomes head
and tail. If this experiment is repeated 300 times which is a fairly large number, then the relative
frequency tends to be stable. On the other hand, initially, there are large fluctuations but as the
experiment continuous the fluctuations decrease.
It is always based on the degree of beliefs, convictions, and experience concerning the likelihood
of occurrence of a random event. Probability assigned for the occurrence of an event may be
based on just guess or on having some idea about the relative frequency of past occurrences of
the event.
Probability rules are more in the nature of assumptions. We shall continue to denote events by
means of capital letters such as A, B, C. we shall write the probability of event A as P(A), the
probability of event B as P(B), and so forth. In addition, we shall follow the common practice of
denoting the set of all possible outcomes that is the sample space by the letter S.
251
Rule 1: the probability of any event is a real number or zero. It cannot be negative.
Symbolically, probability of P (A) ≥ 0
Rule 2: the sum of the probability of all possible mutually exclusive events is unity.
Symbolically,
Rule 3: the probability of either of two mutually exclusive events, say A and B,
occurring is equal to the sum of their probabilities: P (A or B) = P (A) + P (B).
Example 1: suppose we have a box with 3 red, 2 black and 5 white balls. Each time a ball is
drawn, it is returned to the box. What is the probability of drawing?
Solution
Either grade A or B
Either grade C or D
Solution
252
Frequency Relative frequency
Grade A 20 0.2
Grade B 25 0.25
Grade C 20 0.2
Grade D 35 0.35
The forgoing discussion related to the mutually exclusive events. There are situations when we
find that two events can occur together. Let us take an example to explain the method of
calculating probability in such case.
Example: in a group of 200 university students, 140 are full-time (80 females and 60 males)
students and 60 part-time (40 females and 20 males) students. The break-up of students is shown
below.
Males 60 20 80
Females 80 40 120
253
Event B: the student selected is part-time and male
Solution
These two events A and B are mutually exclusive as no students can be both full-time and part
time. Either he or she is a full-time or a part-time student. Show these events in the Venn
diagram. Now introduce another event C, which is defined as the student selected is females. Are
the events A and C mutually exclusive? Show this in another Venn diagram.
140 20
A 140 B
It will be seen from the above figure, the two events A and B are shown in a sample space. As
the total in the sample is 200 and as the two events account for 160(140 full-time and 20 part-
time and male), the remaining figure 40 is shown separately.
Now introduce another event C, which is defined as the student selected is females. Now the
question is whether the events A and C are mutually exclusive or not? Since there are 80 full
time females students, the two events A and C are not mutually exclusive events. The next figure
shows the Venn diagram showing the intersection of events A and C.
Using the sample space and events as defined in the preceding section; find the probability that
the student selected is full-time or female that is P (A or C).
254
Solution:
Referring to the sample space, we find that P (A) = 140/200 = 0.7. And probability of female,
P(C) = 120/200 = 0.6. Adding these two probabilities together, we get 1.3, which is exceeds 1.
At the same time we know from the basic properties of probability mentioned earlier that
probability numbers cannot be more than one. The question is how has it happened? If we see
more closely the sample space, we will come to know that there was double counting. We
counted 80 of the 200 students twice. There are only 180(140 +40) students who are full-time or
female. Thus, the probability of A or C is:
P (A or C) = n (A or C) = 180/200 = 0.9
n(S)
Now we can generalize the addition rule: let A and B be two events defined in a sample space S.
It may be noted that for two mutually exclusive events, we have just to add the probability of
each event A and B in order to calculate the occurrence of any one. Thus,
P (A or B) = P (A) + P (B)
This can be expanded to consider more than two mutually exclusive events:
Example: X is a registered contractor with the government. Recently, X has submitted his tender
for two contracts, A and B. the probability of getting the contract A is ¼, the contract B is ½ and
both contracts A and B is 1/8. Find the probability that X will get contract A or B.
Solution:
As getting contract A and contract B are mutually non-exclusive events, the required probability
will be:
255
P (A or B) = P (A) + P (B) – P (A and B)
When the occurrence of an event does not affect and is not affected by the probability of
occurrence of any event, the event is said to be statically independent event. There are three
types of probabilities under statistical independence marginal, joint, and conditional.
P ( AB ) P ( A B ) P ( A) xP( B )
Example: Suppose we toss a coin twice the probability the cases the coin will turn up head is
given by
P( ) =P ( ) X P ( 2) = ⁄2 ⁄2 = ⁄
256
subscriber subscriber telephone
Own TV set 27 20 18 10 75
No TV set 18 10 12 10 50
Total 45 30 30 20 125
Solution:
(b). These 30(18+12) persons whose household income is above birr 800 and are also
telephone subscribers. Out of these, 18 own TV sets. Hence the probability of this groups of
=0.6
(d)A and B be the events representing TV owners and telephone subscribers respectively. The
probability of a person owning a TV, P (A) =
257
The probability of a person being a telephone subscriber as well as A TV owner is
= P (A and B) = 45/125 = 9/25
Since P (AB) = P (A) X P (B), therefore we conclude that the events ownership of a TV‘ and
‗telephone subscriber‘ are statistically independent.
When the probability of an event is dependent or affected by the accordance of any other event,
the events are said to be statically dependent.
i) Joint probability: If A and B are dependent events, then the joint probability as
discussed under statically independence cases no longer equal to the product of their
respective probabilities, That is, for dependent events.
P (A and B) = P (AnB) # P (A) X P (B)
The joint probability of events A and B occurring together or in succession under
statically dependence is given by
P (AnB) = P (A) X P (B/A)
P (B/A) =
P (A/B) =
258
Example: The data for the promotion and academic qualification of a company is given below.
a) Calculate the conditional probability of promotion after an MBA has been identified.
b) Calculate the conditional probability that is an MBA when a promoted employee has
been chosen
Solution:
It is given that, P (A) =0.35, P (An) = 0.65, P (B) = 0.40, P (Bn) = 0.60, and p (AnB) =0.14
a) P(B/A) = = = 0.40
b) P(A/B) = = = 0.35
In business, at times one finds that estimates of probabilities were made on limited information
that was available at that time. However, subsequently, same additional information becomes
259
available. This additional information necessitates revision of the prior estimate of probability.
The new probabilities are known as revised or posterior probabilities.
The origin of the concept of obtaining posterior probabilities which limited information is
attributed to Reverend Thomas Bayes, and the basic formula for conditional probability under
dependence.
Baye‘s theorem is an important statistical method which is used in evaluating new information as
well as in revising prior estimates of the probabilities in the light of that information. Bayes‘
theorem, it properly used, makes it unnecessary to collect huge data over a long period in order
to make good decisions on the basis of probabilities.
Example 1: suppose we have two machines, I and II, which are used in the manufacture of
shoes. Let E1 be the event of shoes produced by machine I and E2 be the event that they are
produced by machine II. Machine I produces 60 percent of the shoes and machine II 40 percent.
It is also reported that 10 percent of the shoes produced by machine I are defective as against the
20 percent by machine II. What is the probability that a non-defective shoe was manufactured by
machine I?
Solution:
If E1 be the event of the shoe being produced by machine I and A be the event of a non-defective
shoe, our problem in symbolic term is: P (E1/A). That is, given anon-detective shoe, what is the
probability that it was produced by machine I?
= ∑ (A/Ei) P (Ei)
p( E1 A)
P (E1/A) =
p ( A / Ei ) p ( Ei )
260
This may also be written:
It may be noted that P (E1) is the probability of a shoe being manufactured by machine I,
whereas P (E1/A) is the probability of a shoe being produced by machine I, given that it is a non-
defective shoe. The probability P (E1) is called prior probability and P (E1/A) is called posterior
probability.
Let us set up a table to calculate the probability that a non-defective shoe was produced by
machine I.
On the basis of the above table, we can say that given a non – defective shoe, the probability that
it was produced by machine I is 0.63 and the probability it was produced by machine II is 0.37.
The above problem related to two elementary events. Let us take a problem having three
elementary events.
Example 2: a manufacturing firm is engaged in the production of steel pipes in its three plants
with a daily production of 1000, 1500 and 2500 units respectively. According to the past
experience, it is known that the fractions of defective pipes produced by the three plants are
respectively 0.04, 0.09 and 0.07. If a pipe is selected from a day‘s total production and found to
be defective, find out:
261
P (E1) = 1000/ (1000 + 1500 +2500) = 0.2 = P (plant A)
Let P (D) be the probability that a defective pipe is drawn. Given that the proportions of the
defective pipes coming from the three plants are 0.04, 0.09 and 0.07 respectively. These are in
fact the conditional probabilities: P (D/E1) = 0.04, P(D/E2) = 0.09 and P(D/E3) = 0.07.
Now we can multiply prior probabilities and conditional probabilities in order to obtain the joint
probabilities.
1 2 P(E1/Ei) 3 4 5 = 4/P(E)
262
Total P(E) = 0.07 1.00
Exercise 3: An Economist believes that during periods of high economic growth, the Ethiopian
Birr appreciates with probability 0.70; in periods of moderate economic growth, it appreciates
with probability 0.40; and during periods of low economic growth, the Birr appreciates with
probability 0.20. During any period of time the probability of high economic growth is 0.30; the
probability of moderate economic growth is 0.50 and the probability of low economic growth is
0.20. Suppose the Birr value has been appreciating during the present period. What is the
probability that we are experiencing the period of (a) high, (b) moderate, and (c) low, economic
growth?
263
CHAPTER FIVE
Probability Distribution
Probability distribution describes how the probability is spread over the possible numerical
values associated with the outcomes. In Section a numerical variable was defined as a variable
that yields numerical responses, such as the number of magazines you subscribe to or your
height. Numerical variables are either discrete or continuous. Continuous numerical variables
produce outcomes that come from a measuring process (e.g., your height). Discrete numerical
variables produce outcomes that come from a counting process (e.g., the number of magazines
you subscribe to).
A probability distribution for a discrete random variable a mutually exclusive list of all the
possible numerical outcomes along with the probability of occurrence of each outcome.
A random variable (R.V.) is a rule that assigns a numerical value to each possible outcome of a
random experiment.
Random: the value of the R.V. is unknown until the outcome is observed
Variable: it takes a numerical value
The discrete R.V arises in situations when the populations (or possible outcomes) are discrete (or
qualitative).
Let the variable of interest, X, be the number of heads observed then relevant events would be
{X = 0} = {TTT}
{X = 3} = {HHH}.
264
i. 0 ≤ p(x) ≤ 1, and
ii. ∑x p(x) = 1, where the summation is over all possible values of x.
The mean of discrete random variable X is the mean of its probability distribution. The mean of a
discrete random variable is also called its expected value. It is denoted by E(X). When we
perform an experiment a number of times, then what is our expectation from that experiment?
The mean is the value we expect to observe per repetition.
The expected value measures the central tendency of a probability distribution, while a variance
determines dispersion or variability to which the possible random variable values differ among
them. The variance denoted by Var(x) or Variance is the squared deviation of the individual
values from their expected value or mean.
Account A Account B
X 15,000 40,000
Solution:
= 9000 + 30000
= 39,000
265
= (-24,000)2 + (1000) 2
= 576,000,000 +1,000,000
= 577,000,000
Exercise: Suppose we are given the following data relating to breakdown of a machine in a
certain company during a given week. Where in x represents the number of breakdowns of a
machine and p (x) represents probability of value of X.
X 0 1 2 3 4
Each observation is classified in to two categories such as success and failure. Example a
supply of raw material received can be classified as defective or non-defective on the
basis of its normal quality.
It is necessary that the probability of success (or failure) remains the same for each
observation in each trial. Thus the probability of getting head or tail must remain the
same in each toss of the experiment. In other words, if the probability of success or
(failure) changes from trail to trail or if the results of each trial are classified in to more
than two categories, then it is not possible to use Binomial distribution.
The trial or individual observations must be independent of each other. In other words, no
trail should influence the outcome another trial.
r is the number of ways in which we can get r success and n-r failures out of n trails.
266
Example: find the chance of getting 3 successes in 5 trails when the chance of getting a success
in one trail is 2/3.
Solution:
= nCr.qn-r pr
3! (5-3)!
= 0.32856 = 0.33
Expected number of successes, i.e., the long run average, is calculated as:
Where q= (1-p), the variance of the number of successes can also be computed directly:
V(X) =npq
( X i ) 2 P( X ) = √npq
Example: The probability that a randomly chosen sales prospect will make a purchase is 0.2.If a
sales representative calls on 15 prospects, what is the expected number of sales (as a long run
average), the variance and the standard deviation associated with making calls on 15 prospects?
Solution:
V(X) =npq=15*0.2*0.8=2.40
V ( x) 2.40 1.55Sales
267
Statisticians often use the hyper geometric distribution to complement the types of analyses that
can be made by using the binomial distribution. Recall that the binomial distribution applies, in
theory, only to those experiments in which the trials are done with replacement (independent
events). The hyper geometric distribution applies only to those experiments in which the trials
are done without replacement.
The hyper geometric distribution, like the binomial distribution, consists of two possible
outcomes: success and failure. However, the user must know the size of the population and the
proportion of success and failure in the population in order to apply the hyper geometric
distribution.
It is a discrete distribution
Each outcome consists of a success or a failure
Sampling is done without replacement
The population N is finite and known
The number of success in the population is known
P(x) =
Where:
n= sample size
268
Example: Twenty – Four people, of whom 8 are women, have applied for a job, if 5 of the
applicants are randomly sampled, what is the probability that exactly 3 of those sampled are
women?
Solution:
N = 24 N – r = (24-8) = 16
n=5
X=3
P(x) = p(x = 3) =
= =0.1581
Exercise: suppose that you are forming a team of 8 managers from different departments within
your company. Your company has a total of 30managers, and 10 of these people are from the
finance department. If you are to randomly select members of the team, what is the probability
that the team will contain 2 managers from the finance department? Here, N= 30, the population
of managers within the company is finite.
Exercise: How many ways can 3 men and 4 women are selected from a group of 7 men and 10
women?
The Poisson distribution can be used to determine the probability of a designated number of
events occurring when the events occur in a continuum of time or space. Such a process is called
a Poisson process. It is similar to the Binomial process except that the events occur over a
continuum and there are no trials. It measures the probability of exactly X successes over some
continuous interval. Since it is a discrete probability with which we measure the arrival of a x
discrete random variables over the given interval. The Poisson random variable arises when
counting the number of events that occur in an interval of time when the events are occurring at a
constant rate.
269
The Poisson distribution has one characteristic, called (the Greek lowercase letter λ lambda),
which is the mean or expected number of events per unit. The variance of a Poisson distribution
is also equal to λ and the standard deviation is equal to √λ the number of events, X, of the
Poisson random variable ranges from 0 to infinity.
Poisson distribution
X!
x = number of events (x = 0, 1, 2…
Example: suppose that the mean number of customers who arrive per minute at the bank during
the noon-to-1 P.M. hour is equal to 3.0. What is the probability that in a given minute, exactly
two customers will arrive? And what is the probability that more than two customers will arrive
in a given minute?
Given:
λ=3
e = 2.71828
x=2
Solution:
X!
= e-3. 32
2!
= 0.2240
And what is the probability that more than two customers will arrive in a given minute?
270
P(X>2) = 1- P (X≤ 2) = 1- (P(X = 0) + P(X = 1) + P(X = 2))
0! 1! 2!
= 1 – 0.4232 = 0.5768
Thus, there is a 57.68% chance that more than two customers will arrive in the same minute.
The normal distribution (sometimes referred to as the Gaussian distribution) is the most common
continuous distribution used in statistics. The normal distribution is vitally important in statistics
for three main reasons:
In the normal distribution, you can calculate the probability that values occur within certain
ranges or intervals. However, because probability for continuous variables is measured as an area
under the curve, the exact probability of a particular value from a continuous distribution such as
the normal distribution is zero. As an example, time (in seconds) is measured and not counted.
Therefore, you can determine the probability that the download time for a video on a web
browser is between 7 and 10 seconds, or the probability that the download time is between 8 and
9 seconds, or the probability that the download time is between 7.99 and 8.01 seconds. However,
the probability that the download time is exactly 8 seconds is zero.
271
The normal curve is bell shaped and is symmetric about the mean.
The total area under the normal curve is equal to one.
The normal curve approaches, but never touches, the x-axis as it extends farther and
farther away from the mean.
Between μ-δ and μ + δ (in the center of the curve) the graph curves down ward. The
graph curves upward to the left of μ-δ and to the right of μ + δ. The points at which the
curve changes from curving upward to curving downward are called inflection points.
There are infinitely many normal distributions, each with its own mean and standard deviation.
The normal distribution with a mean of 0 and a standard deviation of 1 is called the standard
normal distribution. The horizontal scale of the graph of the standard normal distribution
corresponds to Z- scores.
272
To compute normal probabilities, you first convert a normally distributed random variable, X, to
a standardized normal random variable, Z, using the transformation formula: The Z value is
equal to the difference between X and the mean, μ divided by the standard deviation, δ.
Z=X-μ
X = Value of interest
𝑡 𝑒 𝑒 𝑡 𝑒 𝑠𝑡𝑟 𝑢𝑡
𝑡 𝑟 𝑒𝑣 𝑡 𝑡 𝑒 𝑠𝑡𝑟 𝑢𝑡
If necessary, we can then convert back to the original units of Measurement. To do this, simply
note that, if we take the formula for Z, multiply both sides by σ, and then add μ to both sides, we
get:
X=Zσ+μ
Example: The speeds of vehicles along a stretch of highway are normally distributed, with a
mean of 56 miles per hour and a standard deviation of 4 miles per hour. Find the speeds x
corresponding to scores of 1.96,-2.33 and 0. Interpret your results.
Solution:
The x-value that corresponds to each standard score is calculated using the formula
X = Zσ + μ
Interpretation You can see those 63.84 miles per hour is above the mean, 46.68is below the
mean, and 56 is equal to the mean.
273
Interpret your results.
Example: suppose the time to download a video is normally distributed, with a mean μ = 7
seconds and a standard deviation δ = 2 seconds. What is the Z value of the download time is
equal to 9 seconds?
Solution:
Z=X–μ=9–7=1
δ 2
Z = (1 – 7)/2 = -3
With the Z value computed, you look up the normal probability using a table of values from the
cumulative standardized normal distribution. Suppose you wanted to find the probability that the
download time for the Our Campus! Site is less than X = 9 seconds. Recall the above example,
that transforming to standardized Z units, given a mean μ = 7 seconds and a standard deviation δ
= 2 seconds, leads to a Z value of +1.00.With this value, you use Table to find the cumulative
area under the normal curve less than (to the left of) Z = +1.00 to read the probability or area
274
under the curve less than Z = +1.00, you scan down the Z column in the Table until you locate
the Z value of interest (in 10ths) in the Z row for 1.0.
Next, you read across this row until you intersect the column that contains the 100ths place of the
Z value. Therefore, in the body of the table, the probability for Z = 1.00 corresponds to the
intersection of the row Z = 1.0 with the column Z = .00. The probability listed at the intersection
is 0.8413, which means that there is an 84.13% chance that the download time will be less than 9
seconds.
The probability that the download time will be less than 9 seconds is 0.8413. Thus, the
probability that the download time will be at least 9 seconds is the complement of less than 9
seconds, 1 - 0.8413 = 0.1587. The next Figure illustrates this result.
275
5.2.4.3 Normal Approximation to Binomial Probabilities
The normal distribution gives a good approximation to binomial probability when p is closer to
0.5 and n is large. The approximation is quite good when np and nq are greater than 5.When
normal distribution is used to approximate binomial distribution, the mean (µ) and standard
deviation (δ) of the normal distribution are based on the expected value (µ=np) and standard
npq
deviation (δ = ) of the binomial distribution.
276
Continuity correction factor ( 0.5)
When you use a continuous normal distribution to approximate a binomial probability, you need
to move 0.5 unit to the left and right of the midpoint to include all possible x-values in the
interval. When you do this, you are making a correction for continuity.
It is the addition / subtraction of 0.5 to or from a district random variable. If we have some value
being estimated, thus the relative correction factor is as follows:
x>;x +0.5
x ;x< -0.5
277
<x< +0.5 and -0.5
X= 0.5
Example: Use a correction for continuity to convert each of the following binomial intervals to a
normal distribution interval.
Solution:
1. The discrete midpoint values are 270, 271,,,,, 310. The corresponding interval for the
continuous normal distribution is
2. The discrete midpoint values are 158, 159, 160,. The corresponding interval for the
continuous normal distribution is
X > 157.5
3. The discrete midpoint values are , 60, 61, 62.The corresponding interval for the
continuous normal distribution is
X < 62.5.
Example 1: Thirty-eight percent of people in the United States admit that they snoop in other
people‘s medicine cabinets. You randomly select 200 people in the United States and ask each if
he or she snoops in other people‘s medicine cabinets. What is the probability that at least 70 will
say yes?
Solution:
278
Because np = 200(0.38) = 76 and nq = 200(0.62) = 124 the binomial variable x is approximately
normally distributed with
µ = np = 76, δ =
npq = √200 x 0.38 x 0.62 = 6.86.
Using the correction for continuity, you can rewrite the discrete probability P(x ≥ 70) as the
continuous probability P(x ≥ 69.5). The graph shows a normal curve with µ = 76 and δ = 6.86
and a shaded area to the right of 69.5.The z-score that corresponds to 69.5 isz = (69.5 – 76) / 6.86
P(x≥69.5) = P (z ≥ -0.95)
= 1 – P (z ≤ -0.95)
= 1 – 0.1711 = 0.8289
Exercises: A survey reports that 95% of Internet users use Microsoft Internet Explorer as their
browser. You randomly select 200 Internet users and ask each whether he or she uses Microsoft
Internet Explorer as his or her browser. What is the probability that exactly 194 will say yes?
279
Statistics For Management-II
Mohammedareb S. (MBA)
Melaku Beshaw. (MBA)
ARBA MINCH UNIVERSITY
COLLAGE OF BUSINESS AND ECONOMICS
280
DEPARTMENT OF MANAGEMENT
FEB,2023 G.C.
CHAPTER ONE
Statistical data are collected through different data collection methods: Questionnaire, interview,
focused group discussion, field observation, controlled experiment, and so on. These techniques
are used to gather data either from the entire population or from the part of it based on the
manageability of the population size and the required amount and relevance of data. If the survey
281
covers the entire population, then it is known as the census survey. In contrast, if the survey
covers only a part of a population, or a subset from a set of units with the objective of
investigating the properties of the population, it is known as a sample survey. The process of
selecting sample is known as sampling. In this chapter, the following concepts will be discussed:
Definitions of terminologies
The importance of sampling
Different sampling methods
Sampling error
The concept of the sampling distribution
Sampling Distribution of the Mean and proportion.
Sampling Distribution of the Difference Between two means and two proportion
1.1.Definitions of terminologies
Population: population is a complete set of all possible observations of the type which is
to be investigated. Total numbers of students studying in a school or college, total
number of books in a library, total number of houses in a village or town are some
examples of population. A population is said to be finite if it consists of finite number of
units. Number of workers in a factory, production of articles in a particular day for a
company is examples of finite population. A population is said to be infinite if it has
infinite number of units. For example, the number of stars in the sky, the number of
people seeing the Television programmes etc.
Sampling: is a process or the selection of a small number of elements from a population
to make judgment about population.
Sample: is defined as an aggregate of sampling units actually chosen in obtaining a
representative subset from which inferences about the population are drawn. Sample to
describe a portion chosen from the population. A finite subset of statistical individuals
defined in a population is called a sample.
Sampling unit: The constituents of a population which are individuals to be sampled
from the population and cannot be further subdivided for the purpose of the sampling at a
time are called sampling units.
Sampling frame: a list or directory, defines all the sampling units in the universe to be
covered. It is a list of the elements from which the sample will be selected.
282
Sample Size: is the total number of the items/persons selected as a sample. Sample size
is denoted by n.
Parameter: is the numerical descriptive characteristic of the population (μ).
Statistic: is the numerical descriptive characteristic of the sample ( 𝑥̅ ).
Population Sample
Advantages of Sampling
Limitation of Sampling
283
There is the possibility of sampling errors.
If the sample is not sufficient to represent the entire population
1.2.Types of Sampling
Probability sample is one for which the inclusion or exclusion of any individual element of the
population depends upon the application of probability methods and not on a personal judgment.
It is so designed and drawn that the probability of inclusion of an element is known. The
essential feature of drawing such a sample is the randomness. In a probability sampling, it is
possible to estimate the error in the estimates and they can be minimized also. It is also possible
to evaluate the relative efficiency of the various probability sampling designs.
Mixed Sampling, here samples are selected partly according to some probability and partly
according to a fixed sampling rule; they are termed as mixed samples and the technique of
selecting such samples is known as mixed sampling. The classification of various probability
and non-probability methods are shown below:
284
A simple random sample from finite population is a sample selected such that each possible
sample combination has equal probability of being chosen. It is also called unrestricted random
sampling. Simple random sampling may be with or without replacement.
Simple random sampling without replacement: In this method the population elements can
enter the sample only once (i.e.) the units once selected is not returned to the population before
the next draw.
Simple random sampling with replacement: In this method the population units may enter the
sample more than once.
Lottery Method: This is the most popular and simplest method. In this method all the
items of the population are numbered on separate slips of paper of same size, shape and
color. They are folded and mixed up in a container. The required numbers of slips are
selected at random for the desire sample size. For example, if we want to select 5
students, out of 50 students, then we must write their names or their roll numbers of all
the 50 students on slips and mix them. Then we make a random selection of 5 students.
This method is mostly used in lottery draws. If the universe is infinite this method is
inapplicable.
Table of Random numbers: As the lottery method cannot be used, when the population
is infinite, the alternative method is that of using the table of random numbers. A random
number table is so constructed that all digits 0 to 9 appear independent of each other with
equal frequency. If we have to select a sample from population of size N= 100, then the
numbers can be combined three by three to give the numbers from 001 to 100.
Units of the population from which a sample is required are assigned with equal number of
digits. When the size of the population is less than thousand, three-digit number 000,001,002,
….. 999 are assigned. We may start at any place and may go on in any direction such as column
wise or row- wise in a random number table. But consecutive numbers are to be used. On the
basis of the size of the population and the random number table available with us, we proceed
285
according to our convenience. If any random number is greater than the population size N, then
N can be subtracted from the random number drawn. This can be repeatedly until the number is
less than N or equal to N.
Example 1: In an area there are 500 families. Using the following extract from a table of random
numbers select a sample of 15 families to find out the standard of living of those families in that
area.
This technique is mainly used to reduce the population heterogeneity and to increase the
efficiency of the estimates. Stratification means division into groups. In this method the
population is divided into a number of subgroups or strata. The strata should be so formed that
each stratum is homogeneous as far as possible. Then from each stratum a simple random sample
may be selected and these are combined together to form the required sample from the
286
population. There are two types of stratified sampling. They are proportional and non-
proportional.
Example 2:
Solution:
There are two strata in this case with sizes N1 = 200 and N2 = 300 and the total population N = N1
+ N2 = 500. The sample size is 50.
Merits:
It is more representative.
It ensures greater accuracy.
It is easy to administer as the universe is sub - divided.
Greater geographical concentration reduces time and expenses.
287
Limitations:
To divide the population into homogeneous strata, it requires more money, time and
statistical experience which are a difficult one.
Improper stratification leads to bias, if the different strata overlap such a sample
will not be a representative one.
iii. Systematic Sampling
This method is widely employed because of its ease and convenience. A frequently used method
of sampling when a complete list of the population is available is systematic sampling. It is also
called Quasi-random sampling.
Selection procedure:
The whole sample selection is based on just a random start. The first unit is selected with the
help of random numbers and the rest get selected automatically according to some pre designed
pattern is known as systematic sampling. With systematic random sampling every Kth element
in the frame is selected for the sample, with the starting point among the first K elements
determined at random.
For example, if we want to select a sample of 50 students from 500 students under this method
Kth item is picked up from the sampling frame and K is called the sampling interval.
K = 500/50 = 10
K = 10 is the sampling interval. Systematic sample consists in selecting a random number say i
K and every Kth unit subsequently. Suppose the random number ‗i‘ is 5, then we select 5, 15, 25,
35, 45… The random number ‗i‘ is called random start. The technique will generate K
systematic samples with equal probability.
Merits:
Limitations:
288
Systematic sampling may not represent the whole population.
There is a chance of personal bias of the investigators.
Systematic sampling is preferably used when the information is to be collected from trees in a
forest, house in blocks, entries in a register which are in a serial order etc.
Under this method, the random selection is made of primary, intermediate and final or the
ultimate units from a given population or stratum. There are several stages in which the sampling
process is carried out. At first, the first stage units are sampled by some suitable method, such as
simple random sampling. Then, a sample of second stage unit is selected from each of the
selected first stage units, again by some suitable method which may be same as or different from
the method employed for the first stage units. Further stages may be added as required.
For Example: Suppose we want to take a sample of 5,000 households from the city of Arba
Minch. At the first stage, the state may be divided into a number of sub-cities and a few sub-
cities are selected at random. At the second stage, each sub-city may be sub-divided into a
number of villages and a sample of villages may be taken at random. At the third stage, a number
of households may be selected from each of the villages selected at second stage.
Merits:
Limitations:
However, a multi-stage sample is in general less accurate than a sample containing the
same number of final stage units which have been selected by some suitable single stage
process.
1.2.2. Non- Probability sampling techniques
289
This is also known as non-random sample. Each element in the population does not have an
equal chance of being selected. Thus, the investigator does not consider the chance of the
elements in selecting the sample units. The following techniques are non-probability sampling:
a. Convenience sampling
In this scheme, a sample is obtained by selecting ‗convenient‘ population elements. For example,
a sample selected from the readily available sources or lists such as telephone directory or a
register of the small scale industrial units, etc. will give us a convenient sample. In these cases,
even if a random approach is used for identifying the units, the scheme will not be considered as
simple random sampling. For example, if one studies the wage structure in a close by textile
industry by interviewing a few selected workers, then the scheme adopted here is convenient
sampling. The results obtained by convenience sampling method can hardly be said to be
representative of the population parameters. Therefore, the results obtained are generally biased
and unsatisfactory. However, convenient sampling approach is generally used for making pilot
studies, particularly for testing a questionnaire and to obtain preliminary information about the
population.
b. Quota sampling
In this method of sampling, the basic parameters which describe the population are identified
first. Then the sample is selected which conform to these parameters. Thus, in a quota sample,
quotas are fixed according to these parameters, and each field investigator is assigned with
quotas of the number of units to be interviewed. Within the pre-assigned quotas, the selection of
the sample elements depends on the personal judgment. Quota sampling method is generally
used in public opinion studies, election forecast polls, as there is not sufficient time to adopt a
probability sampling scheme.
c. Judgment sampling
Judgment sampling method can also be called as sampling by opinion. In this method, someone
who is well acquainted with the population decides which members (elementary units) in his or
her judgment would constitute a proper cross-section representing the parameters of relevance to
the study. This method of sampling is generally used in studies involving performance of
290
personnel. This, of course, is not a scientific method, but in the absence of better evidence, such
a judgment method may have to be used.
d. Snowball sampling
With this approach, you initially contact a few potential respondents and then ask them whether
they know of anybody with the same characteristics that you are looking for in your research.
A sample is selected because it is simpler, less costly, and more efficient. However, it is unlikely
that the sample statistic would be identical to the population parameters. Thus, the measures
computed from sample would probably not be exactly equal to the corresponding population
value. Therefore, one expects some difference between a sample statistic and the corresponding
population values or parameter. The difference between a sample statistic and a population
parameter is called sampling error.
E.g. suppose a population of five production employees had efficiency ratings of 97, 103, 96, 99,
and 105. The production manager wants to estimate mean efficiency ratings of the population
using two rates. Suppose two employees with efficiency of 97 and 105 are selected.
Suppose another sample of two ratings 103 and 96 is selected and the sample mean is 99.5.
A sampling error is occurred due to chance. However, this error would not only be due to
chance, but there is also other error of non-sampling.
Non-sampling error would exist because of other factors such as errors in data collection,
editing, coding, analyzing or other biases. Thus, the difference between the sample statistic and
the population parameter consists of both sampling and the non-sampling errors.
291
Exercise 1.
The following table shows the total points scored in the 10 National Football League games
played during week 1 of the 2016 session.
23 25 33 51 62 24 31 58 47 49
In reality, of course we do not have all possible samples and all possible values of the statistic.
We have only one sample and one value of the statistic. This value is interpreted with respect to
all other outcomes that might have happened, as represented by the sampling distribution of the
statistic. In this lesson, we will refer to the sampling distributions of only the commonly used
sample statistics like sample mean, sample proportion, sample variance etc., which have a role in
making inferences about the population.
The sampling distribution of a statistic is the probability distribution of all possible values the
statistic may take when computed from random samples of the same size drawn from a specified
population. A sampling distribution is the distribution of the results if you actually selected all
possible samples. The single result you obtain in practice is just one of the results in the sampling
distribution.
Sampling Distribution of the sample means is a probability distribution of all possible sample
means of a given sample size. For instance, for the above five production employees‘ efficiency
ratings (97, 103, 96, 99, and105), the possible samples of two ratings can be organized into
probability distribution as follows.
292
First, find the possible sample of two for the population of five.
5C2 =5!/(5-2)!2!=10, different samples of 2 units are possible for five units.
Sample 1 2 3 4 5 6 7 8 9 10
no.
Sample (97,103) (97, 96) (97, 99) (97, 105) (103, 96) (103, 99) (103, 105) (96, 99) (96, 105) (99, 105)
Units
Sample 100 96.5 98 101 99.5 101 104 97.5 100.5 102
means
(𝑥)
Sample means is a random variable which can assume different values. It is random because the
selection of the sample unit is by chance. The probability distribution of sample means is called
the sampling distribution of sample mean.
293
Shows the population value because we have considered all possible samples. The subscript
𝑥 indicates that it is the mean of sampling distribution. Therefore, is unbiased estimator of
population mean ( .
Knowledge of this sampling distribution and its properties will enable us to make probability
statements about how close the sample mean is to the population mean μ.
The mean of the sample means is equal to the mean of the population. When the expected value
of a sampling distribution of sample means equals the population parameter, we say that the
point estimator is unbiased.
The standard deviation of the sampling distribution of sample means equals to the population
standard deviation divided by the square root of the sample size. It is also known as Standard
error of the sample mean ( ). If population standard deviation is known, then standard error is
,
√
Where, 𝑠𝑡 𝑟 𝑒𝑣 𝑡 𝑠 𝑝 𝑠𝑡𝑟 𝑢𝑡 𝑠 𝑝𝑒 𝑒 𝑠
𝑝 𝑝𝑢 𝑡 𝑠𝑡 𝑟 𝑒𝑣 𝑡
𝑠 𝑝 𝑒 𝑠 𝑧𝑒
When population standard deviation is unknown, sample standard deviation is used to compute
standard deviation.
𝑠
𝑠
√
Standard error of 𝑥 shows the spread in the distribution of the sample means. The spread in the
distribution of the sample means is less than the spread in the population values. The sample
means ranged from 96.5 to 104 while the population values vary from 96 to 105.Thus,
Standard error is affected by two values; standard deviation and sample size. If the standard
deviation is large, then the standard error will be large. As the sample size increases, the standard
294
error decreases, indicating that there is less variability in the distribution of sample means. As the
standard error becomes smaller the dispersion of the sample means tends to concentrate around
the population mean. Thus, as the standard error decreases, the precision increase; the difference
between the sample mean and the population mean narrows down
This works if and only if population size is large and sample size is very small.
But if sample size is large (n≥5%N) selected from finite population, we then apply finite
population correction factor (Multiplier). This make the distribution of sample means
approximates to normal distribution.
𝑁−
√
√ 𝑁−
𝑝 𝑝𝑢 𝑡 𝑠𝑡 𝑟 𝑒𝑣 𝑡
𝑠 𝑝 𝑒 𝑠 𝑧𝑒
𝑁−
√ 𝑡𝑒 𝑝 𝑝𝑢 𝑡 𝑟𝑟𝑒 𝑡 𝑡 𝑟
𝑁−
iii. The distribution of sample means tend to be more bell-shaped and to approximate
the normal distribution. As the sample size increases the sampling distribution of
sample means approximate normal distribution. Whenever the sample size is large
(n≥30), the sampling distribution of sample means will close to normal distribution.
Then, we can use normal distribution for estimating population parameter.
Now that the concept of a sampling distribution has been introduced and the standard error of the
mean has been defined, what distribution will the sample mean, follow? If you are sampling
from a population that is normally distributed with mean, and standard deviation, then regardless
295
of the sample size, n, the sampling distribution of the mean is normally distributed, with
Sometimes you need to find the interval that contains a fixed proportion of the sample means. To
do so, determine a distance below and above the population mean containing a specific area of
the normal curve.
If the sample is selected from normal population distribution, the sampling distribution of the
mean is also normal. However, in case of the sample selected from non-normally distributed
population, the shape of the sampling distribution of the mean estimated by the central limit
theorem.
Central Limit Theorem states that if all samples of a particular size are selected from any
population, the sampling distribution of the sample means is approximately a normal
distribution. The approximation is more accurate for large samples (n≥30) than small samples.
296
Central limit theorem states that, for the large samples (typically n≥30), the shape of the
sampling distribution of the sample means is close to a normal distribution with the mean of
and standard deviation of .Because the dispersion in the sampling distribution of the
sample means become smaller than the dispersion in the population. As the sample
√
size get larger, the standard error of the sample mean become smaller and smaller. This shows
that the sample means will get closer and closer to the mean of the population which in turn
approximates to normal. Hence, the shape of the sampling distribution of sample means will be
normal;
Often the comparison of two different populations is practical and important. For this purpose,
the study of Sampling distribution of the difference between two means is very much important.
Sampling distribution of the difference between two means is concerned with finding the
difference between sample means drawn from two populations. Thus, it is determining whether
the means of two populations are equal or not.
The following sample statistics characterizes the sampling distribution of the difference between
two sample means.
297
But if n≥5%N and sampling is done without replacement standard error of the difference
between the two samples means is:
( ̅ ̅ )=√
The distribution of the difference between two sample means is generally assumed that the two
populations are normally distributed with the mean ̅ ̅ and standard deviation ̅ ̅
𝑥̅ − 𝑥̅ − −
𝑍 ̅ ̅
̅ ̅
Suppose we take a random sample of n persons from a population and if x of these persons are
smokers, then the sample proportion р = x/n. Proportion is the number of success relative to the
total number of sample size. P is the point estimate of population proportion (π).
If the population is ―large‖ relative to the sample size (n/N is less than or equal to 0.05), then the
standard deviation of sampling distribution of sample proportion is
−
√
298
When several samples are taken for a large size population, the sampling distribution of the
sample proportion can be approximated by a normal distribution. The confidence interval using a
sample proportion
=p±Z
= p ± Z√
For a finite population, where the total number of objects is N and the sample is n≥30, the
standard error needs adjustment. Adjustment reduces the size of the standard error, which yields
a smaller range of values in estimating the population mean.
𝑝 −𝑝 𝑁−
√ √
𝑁−
Statistics problems often involve comparisons between two independent sample proportions. The
sampling distribution of the difference between two sample proportions is concerned with
determining whether two sample from different population have equal proportions or not. Hence
the mean and standard deviation of the sampling distribution of the difference between the two
sample proportions are:
̂ ̂ ̂ − ̂ −
𝑝̂ − 𝑝̂ 𝑝̂ − 𝑝̂
̂ ̂ √ √
If ≥30, is large, then the distribution of the difference between the sample proportion is
closely approximated to normal distribution.
Exercise
299
1. The amount of time a bank teller spends in each customer has a population mean of 3.10
minutes and standard deviation of 0.40 minute. If a random number of 16 customers is
selected, what is the probability that the average time spent per customer will be at least 3
minutes
2. The following table shows the total points scored in the 10 National Football League games
played during week 1 of the 2016 session.
23 25 33 51 62
24 31 58 47 49
3. The age of customers for a particular retail store follows a normal distribution with a mean of
37.5 years and standard deviation of 15 years. Given that the sample size is 36.
A. Compute standard error?
B. What is the probability that the next customer who enters the store will be More than 31
years old?
C. What is the probability that the next customer who enters the store will be Less than 42
years old?
4. The manager of the local branch of saving bank has determined that 40% of all depositors
have multiple accounts at the bank. If a random sample of 200 depositors is selected, what is
the probability that the sample proportion of depositors with multiple accounts will be
between .40 and .43.
5. A population proportion has been estimated at 0.32. Calculate the following with a sample
size of 160.
A. Find the probability of getting a sample proportion at most 0.30?
B. Find the probability of getting a sample proportion at least 0.36?
6. The American Council of Life Insurance and the Life Insurance Marketing and Research
Association have reported that insured households with heads 35 to 44 years old had an
300
average of $186,100 of life insurance coverage. Assuming a normal distribution and a
standard deviation of $40,000, what is the probability that a randomly selected 64 household
with a head in this age group had less than $195,000 in life insurance coverage?
7. A sample of 125 is drawn from population with proportion equal to .065. Determine the
probability of observing
A. 80 or fewer successes
B. 82 or fewer successes
C. 75 or more successes
8. According to smith Travel Research, the average hotel price in the United State in 2009 was
$97.86. Assume the population standard deviation is $18.00 and that a randomly sample of
35 hotel was selected.
Tilahun Dessie 8
Kedir Husien 6
Mengistu Gebremariam 4
Karo Algase 10
Gemechu Bedaso 6
301
had a bathtub. According to the U.S. Census Bureau, the mean completion time for the long
form is 38 minutes. Assuming a standard deviation of 5 minutes and a simple random sample
of 50 persons who filled out the long form, what is the probability that their average time for
completion of the form was more than 45 minutes?
11. A diameter of a component produced on a semi-automatic machine is known to be
distributed normally with a mean of 10 mm and standard deviation of 0.1 mm. If a random
sample of size 5 is picked up, what is the probability that the sample mean will between
9.95mm and 10.5?
12. The strength of the wire produce by company A has a mean of 4,500 kg and a standard
deviation of 200 kg. Company B has a mean of 4000 kg and a standard deviation of 300 kg.
If 50 wires of company A and 100wires of company B are selected at random and tested
for strength, what is the probability that the sample mean strength of A will be at least 600kg
more than that of B?
13. Assume that 2% of the items produced in an assembly line operation are defective, but that
the firm‘s production manager is not aware of this situation. What is the probability that in a
lot of 400 such items, 3% or more will be defective?
14. A manufacturer of bottles has found that on an average 0.04of the bottles produced are
defective. A random sample of 400 bottles is examined for the proportion of defective
bottles. Find the probability that the proportion of defective bottles in the sample is
between 0.02 and 0.05.
CHAPTER TWO
STATISTICAL ESTIMATION
302
INTRODUCTION
The sampling process is used to draw statistical inference about the characteristics of a
population or process of interest. On many occasions we do not have enough information to
calculate an exact value of population parameters (such as μ, σ and P) and therefore make the
best estimate of this value from the corresponding sample statistics (such as x, s, and ̅𝑝). The
need to use the sample statistic to draw conclusions about the population characteristic is one of
the fundamental applications of statistical inference in business and economics.
Definition of Terms
Estimation: is the process of making judgment or opinion about the population characteristics
from the information obtained from the scientifically selected sample.
Estimate: is a specific value/opinion made on the bases of sample information about population.
Estimator: a rule that tells us how to estimate a value for a population parameter using sample
data. It is the sample statistic used to make decision or opinion about population parameter.
Types of Estimates
There are two types of estimates that we can make about a population: a point estimate and an
interval estimate. A point estimate is a single number, which is used to estimate an unknown
population parameter. Although a point estimate may be the most common way of expressing an
estimate, it suffers from a major limitation since it fails to indicate how close it is to the quantity
it is supposed to estimate.
In other words, a point estimate does not give any idea about the reliability of precision of the
method of estimation used. For instance, if someone claims that 40 percent of all children in a
certain town do not go to the school and are devoid of education, it would not be very helpful if
this claim is based on a small number of households, say, 20. However, as the number of
303
households interviewed for this purpose increases from 20 to 100, 500 or even 5,000, the claim
that 40 percent of children have no school education would become more and more meaningful
and reliable. This makes it clear that a point estimate should always be accompanied by some
relevant information so that it is possible to judge how far it is reliable.
The second type of estimate is known as the interval estimate. It is a range of values used to
estimate an unknown population parameter. In case of an interval estimate, the error is indicated
in two ways: first by the extent of its range; and second, by the probability of the true population
parameter lying within that range. Taking our previous example of 40 percent children not
having a school education, the statistician may say that actual percentage of such children in that
town may lie between 35 percent and 45 percent. Thus, he will have a better idea of the
reliability of such an estimate as compared to the point estimate of 40 percent.
1. Point Estimation
In point estimation, a single sample statistic (such as ̅, s, and p) is calculated from the sample to
provide a best estimate of the true value of the corresponding population parameter (such as μ, σ
and p). Such a single relevant statistic is termed as point estimator, and the value of the statistic
is termed as point estimate.
There are four criteria by which we can evaluate the quality of a statistic as an estimator. These
are: Unbiasedness, efficiency, consistency and sufficiency.
i. Unbiasedness
This is a very important property that an estimator should possess. If we take all possible
samples of the same size from a population and calculate their means, the mean ̅ of all these
means will be equal to the mean μ of the population. This means that the sample mean x is an
unbiased estimator of the population mean μ.
̅ .
ii. Consistency
Another important characteristic that an estimator should possess is consistency. Let us take the
case of the standard deviation of the sampling distribution of sample mean. The standard
deviation of the sampling distribution of sample mean is computed by following formula:
304
√
The formula states that the standard deviation of the sampling distribution of x decreases as the
sample size increases and vice versa. When the sample size n increases, the population standard
deviation σ is to be divided by a higher denominator. This results in the reduced value of sample
standard deviation .
iii. Efficiency
iv. Sufficiency
The fourth property of a good estimator is that it should be sufficient. A sufficient statistic
utilizes all the information a sample contains about the parameter to be estimated. ̅, for
example, is a sufficient estimator of the population mean μ. It implies that no other estimator of
μ, such as the sample median, can provide any additional information about the parameter μ.
Sample mean ̅ is an unbiased, consistent, and efficient estimator of the population mean ).
The estimator of population proportion (π) is sample proportion (p). The estimator of population
variance of a normal distribution is sample variance.
Example 1: Values of six sample measurements of the diameter of a sphere were recorded by a
scientist as 5.35, 6.27, 6.50, 5.86, 6.32 and 5.70 mm. Determine unbiased and efficient estimates
of;
a. Population parameter
b. Population variance
Solution
305
∑
a. ̅ = = 6. Sample mean is unbiased and efficient estimates
of population parameter.
∑ ̅
b. 𝑠
Sample proportion is the convenient estimator of population proportion = p. Point estimate for
the population proportion is found by dividing the number of successes in the sample by the total
number sampled. If a sample n is selected from a population N and it is found that out of sample
n, x number of units are unfavorable items. Then the proportion of unfavorable items is
computed as 𝑝 .
Example 2: In a company, there are 1600 employees. A random sample of 400 employees was
taken to ask them their views on the proposed productivity incentive scheme. Out of these 184
expressed their dissatisfaction. Determine a point estimate of this proportion.
𝑝 = 0.46.
Thus, we can say that the proportion of employees against the incentive scheme would be 0.46.
2. Interval Estimation
Generally, a point estimate does not provide information about ‗how close is the estimate‘ to the
population parameter unless accompanied by a statement of possible sampling errors involved
based on the sampling distribution of the statistic. It is therefore important to know the precision
of an estimate before relying on it to make a decision. Thus, decision-makers prefer to use an
interval estimate that is likely to contain the population parameter value.
However, it is also important to state ‗how confident‘ he is that the interval estimate actually
contains the parameter value. Hence an interval estimate of a population parameter is therefore a
confidence interval with a statement of confidence that the interval contains the parameter value.
The confidence interval estimate of a population parameter is obtained by applying the formula:
306
Za/2 = critical value of standard normal variable that represents confidence level (probability of
being correct) such as 0.90, 0.95, and 0.99.
Suppose the population mean μ is unknown and the true population standard deviation σ is
known. Then for a large sample size (n=>30), the interval estimation of population mean μ is
given by
Where za/2 is the z-value representing an area a/2 in the right and left tails of the standard normal
probability distribution, and (1-α) is the level of confidence.
For example, if a 95 percent level of confidence is desired to estimate the mean, then 95 percent
of the area under the normal curve would be divided equally, leaving an area equal to 47.5
percent between each limit.
If n = 100 and σ = 25, then = σ/ n = 25/ 100 = 2.5. Using a table of areas for the standard
normal probability distribution 95 percent of the values of a normally distributed population are
within ±1.96 or 1.96 (2.5) = ± 4.90 range.
Hence 95 percent of the sample means will be within ± 4.90 of the population mean μ. In other
words, there is a 0.95 probability that the sample mean will provide a sampling error equal to | ̅ -
μ| = 4.90 or less. The value 0.95 is called confidence coefficient and the interval estimate ̅ ± 4.90
is called a 95 percent confidence interval.
307
Example 3: The average monthly electricity consumption for a sample of 100 families is 1250
units. Assuming the standard deviation of electric consumption of all families is 150 units;
construct a 95 percent, confidence interval estimates of the actual mean electric consumption.
Solution:
The information given is: ̅ =1250, σ =150, n= 100 and confidence level (1-α) = 95 percent.
Using the ‗Standard Normal Curve‘ we find that the half of 0.95 yields a confidence coefficient z
α/2 = 1.96. Thus, confidence limits with α/2 = ± 1.96 for 95 percent confidence are given by
Thus, for 95 percent level of confidence, the population mean μ is likely to fall between 1220.60
units and 1279.40 units, that is, 1220.60 ≤ μ ≤ 1279.40.
Example 4: The quality control manager at a factory manufacturing light bulb is interested to
estimate the average life of a large shipment of light bulbs. The standard deviation is known to
be 100 hours. A random sample of 50 light bulbs gave a sample average life of 350 hours.
a) Setup a 95 percent confidence interval estimate of the true average life of light bulbs in
the shipment.
b) Does the population of light bulb life have to be normally distributed? Explain.
308
a) Using the ‗Standard Normal Curve‘, we have z α/2 = ± 1.96 for 95 percent confidence
level. Thus, confidence limits are given by
Hence for 95 percent level of confidence the population mean μ is likely to fall between
b) Yes, since σ is known and n = 50, from the central limit theorem we may assume that x is
normally distributed.
b. Interval estimate of a population mean: small sample
When the sample n<30 or small sample, the sampling distribution is no longer normal. In such a
case, student t-distribution is used. Both normal and t-distribution are symmetrical, but t-
distribution is flatter than normal distribution.
There is different t-distribution for each sample size. There should be n-1 degree of freedom for
specified n sample size. Degree of freedom is the level of freedom to choose the values. The t-
distribution doesn‘t give the chance that a particular population parameter will be within a
specified confidence interval. Instead, it shows the chance that the particular population
parameter will not be within our confidence interval.
Example 5: A firm has appointed a large number of dealers all over the country to sell its
bicycles. It is interested in knowing the average sales per dealer. A random sample of 25 dealers
is selected for this purpose. The sample mean is 50,000 birr and the standard deviation is 20,000
birrs. Construct an interval estimate with 95% confidence.
For the sample of size 25, at 5% level of significance, t-value from the table is t(0.025,24)=2.064
309
This can be interpreted as; we are 95% confident that the interval estimate ranged from 41744
birr to 58256 contains population mean.
You know that normal distribution as an approximation of the sampling distribution of sample
proportion p = x / n is based on the large sample conditions: np >5 and nq=n (1-p) >5, where p is
the population proportion. The confidence interval estimate for a population proportion at 1-α
confidence coefficient is given by
Where zα/2 is the z-value providing an area of α/2 in the right tail of the standard normal
probability distribution and the quantity zα/2 p σ is the margin of error.
Example 6: Suppose we want to estimate the proportion of families in a town, which have two
or more children. A random sample of 144 families shows that 48 families have two or more
children. Setup a 95 percent confidence interval estimate of the population proportion of families
having two or more children.
Using the information, n = 144, p = 1/3 and zα/2 = 1.96 at 95 percent confidence coefficient, we
have
Hence the population proportion of families who have two or more children is likely to be
between 25.6 to 41 per cent, that is, 0.256 ≤ p ≤ 0.410.
In the business world, sample sizes are determined prior to data collection to ensure that the
confidence interval is narrow enough to be useful in making decisions. Determining the proper
310
sample size is a complicated procedure, subject to the constraints of budget, time, and the
amount of acceptable sampling error. From previous sections we understand that standard error
confidence intervals .
Obviously, the width or range of the confidence interval can be decreased by increasing the
sample size n. The decision regarding the appropriate size of the sample, however, depends on (i)
deciding in advance how good an estimate is required, and (ii) the availability of funds, time, and
ease of sample selection.
To develop an equation for determining the appropriate sample size needed when constructing a
confidence interval estimate for the mean, recall equation:
The amount added to or subtracted from X is equal to half the width of the interval. This quantity
represents the amount of imprecision in the estimate that results from sampling error. The
sampling error, e, is defined as:
Solving for n gives the sample size needed to construct the appropriate confidence interval
estimate for the mean. ―Appropriate‖ means that the resulting interval will have an acceptable
amount of sampling error.
311
Therefore, you should select a sample of 97 insulators because the general rule for determining
sample size is to always round up to the next integer value in order to slightly over satisfy the
criteria desired.
So far in this section, you have learned how to determine the sample size needed for estimating
the population mean. Now suppose that you want to determine the sample size necessary for
estimating a population proportion. To determine the sample size needed to estimate a population
proportion, you use a method similar to the method for a population mean. Recall that in
developing the sample size for a confidence interval for the mean, the sampling error is defined
by:
When estimating a proportion, you replace with Thus, the sampling error is
Solving for n, you have the sample size necessary to develop a confidence interval estimate for a
proportion.
312
In practice, selecting these quantities requires some planning. Once you determine the desired
level of confidence, you can find the appropriate Za/2 value from the standardized normal
distribution. The sampling error, e, indicates the amount of error that you are willing to tolerate
in estimating the population proportion. The third quantity,, is actually the population
parameter that you want to estimate.
Example 7: suppose that the auditing procedures require you to have 95% confidence in
estimating the population proportion of sales invoices with errors to within +0.07. The results
from past months indicate that the largest proportion has been no more than 0.15, determining
the sample size.
Given
Solution
Because the general rule is to round the sample size up to the next whole integer to slightly over
satisfy the criteria, a sample size of 100 is needed.
Confidence Intervals for the Difference between Two Means Using the
313
Normal Distribution
There is often a need to estimate the difference between two population means, such as the
difference between the wage levels in two firms. The unbiased point estimate of (µ 1 - µ 2)
is (X 1 – X 2). The confidence interval is constructed in a manner similar to that used for
estimating the mean, except that the relevant standard error for the sampling distribution is
the standard error of the difference between means. Use of the normal distribution is based
on the same conditions as for the sampling distribution of the mean, except that two
samples are involved. The formula used for estimating the difference between two
population means with confidence intervals is
Or when the standard deviations of the two populations are known, the standard error of the
difference between means is
When the standard deviations of the populations are not known, the estimated standard error of
the difference between means given that use of the normal distribution is appropriate is
Example 8: The mean weekly wage for a sample of n= 30 employees in a large manufacturing
firm is X 1= $280.00 with a sample standard deviation of s =$14:00. In another large firm a
random sample of n =40 hourly employees have a mean weekly wage of $270.00 with a sample
standard deviation of s = $10:00. The 99 percent confidence interval for estimating the difference
between the mean weekly wage levels in the two firms is
314
Thus, we can state that the average weekly wage in the first firm is greater than the average in the
second firm by an amount somewhere between $2:23 and $17:77, with 99 percent confidence in this
interval estimate. Note that the sample sizes are large enough to permit the use of Z to approximate
the t value.
The T Distribution and Confidence Intervals for The Difference Between Two
Means
2. Samples are small (n < 30). If samples are large, then t values can be approximated by the
standard normal z.
3. Populations are assumed to be approximately normally distributed (note that the central
limit theorem cannot be invoked for small samples).
4. In addition to the above, when the t distribution is used to define confidence intervals for
the difference between two means, rather than for inference concerning only one
population mean, an additional assumption usually required is
Because of the above equality assumption, the first step in determining the standard error of the
difference between means when the t distribution is to be used typically is to pool the two sample
variances:
315
The standard error of the difference between means based on using the pooled variance estimate σ 2 is
Note: Some computer software does not require that the two population variances be assumed to be
equal. Instead, a corrected value for the degrees of freedom is determined that results in reduced df,
and thus in a somewhat larger value of t and somewhat wider confidence interval.
Example 9. For a random sample of n 1 =10 bulbs, the mean bulb life is mean 1=4.600 hr. with
s1=250 hr. For another brand of bulbs, the mean bulb life and standard deviation for a sample of
n 2 = 8 bulbs are mean 2=4,000 hr. and s2 =200 hr. The bulb life for both brands is assumed to
be normally distributed. The 90 percent confidence interval for estimating the difference between
the mean operating life of the two brands of bulbs is Thus, we can state with 90 percent
confidence that the first brand of bulbs has a mean life that is greater than that of the second
brand by an amount between 410 and 790 hr.
316
and the requirements apply to each of the two samples. The confidence interval for estimating
the difference between two population proportions is
The standard error of the difference between proportions is determined by the following formula,
wherein the value of each respective standard error of the proportion is calculated as
EXAMPLE 10. In Example 3 it was reported that a proportion of 0.40 men out of a random
sample of 100 in a large community preferred the client firm‘s razor blades to all others. In
another large community, 60 men out of a random sample of 200 men prefer the client firm‘s
blades. The 90 percent confidence interval for the difference in the proportion of men in the two
Estimation summary
Estimation is a process of making a statement about the unknown population from sample
gathering information. There are two types of estimates that we can make an opinion about a
population: Point and Interval Estimation.
317
Point Estimation: - is a single value that best describes the population of interest.
In this method, the value of sample statistics and population parameters are equal.
Sample mean and proportion are equals to population mean and proportion,
respectively.
In this case the value of sampling error ( ̅ - µ) also equal to zero.
Interval Estimation: - a range of values that best describes the population interest.
In this method, the value of population parameter equal with sample statistics plus or
minus margin of error.
Simply what we called, confidence interval (Lower confidence interval and upper
confidence interval).
Confidence Interval = Point estimate Margin of error.
In point estimate method the value of margin of error equal to zero.
Margin of error: - represent the width of the confidence interval between a sample mean and its
upper limit and or between a sample mean and its lower limit.
ME = ̃
CI = Point estimates ME
It is an interval estimate around a sample mean that provide us with a range of where the true
population mean lies.
There are different cases that we needed to consider in computing a confidence Interval.
318
Case-1: - When is known
In some cases, the population standard deviation is known even if we don‘t know the value of
population mean.
3. Margin of error = ME = ̃
UCI = 𝑥̅ 𝑧 ⁄ √
LCI = 𝑥̅ − 𝑧 ⁄ √
What happen if is unknown? Because as long as n ≥ 30, we can substitute S, the sample
standard deviation, for , the population standard deviation, and follow the same procedure as
before.
319
Case-1: - When is known
When the sample size is less than 30 and sigma is known, the procedure reverts back to the large
sample size case. i.e., the procedure is the same as done already discussed. We can do this
because we are now assuming the population is normally distributed.
More often, we don‘t know the value of . Here, we make a similar adjustment that we made
earlier and substitute S. However, because of the small sample size, this substitution forces us to
use a new probability distribution known as T-distribution.
Exercises
1. A simple random sample of 50 items from a population with σ =6 resulted in a sample mean
of 32.
320
a. Provide a 90% confidence interval for the population mean.
b. Provide a 95% confidence interval for the population mean.
c. Provide a 99% confidence interval for the population mean.
2. A simple random sample of 60 items resulted in a sample mean of 80. The population
standard deviation is σ =15.
a. Compute the 95% confidence interval for the population mean.
b. Assume that the same sample mean was obtained from a sample of 120 items. Provide a
95% confidence interval for the population mean.
c. What is the effect of a larger sample size on the interval estimate?
3. The following data are from a simple random sample. 5, 8, 10, 7, 10 and 14
4. A survey question for a sample of 150 individuals yielded 75 Yes responses, what is the
point estimate of the proportion in the population who respond Yes?
5. The undergraduate grade point average (GPA) for students admitted to the top graduate
business schools was 3.37. Assume this estimate was based on a sample of 120 students
admitted to the top schools. Using past years‘ data, the population standard deviation can be
assumed known with σ = .28. What is the 95% confidence interval estimate of the mean
undergraduate GPA for students admitted to the top graduate business school.
6. A survey of small businesses with Web sites found that the average amount spent on a site
was $11,500 per year. Given a sample of 60 businesses and a population standard deviation
of σ = $4000, what is the margin of error? Use 95% confidence.
7. A simple random sample of 25 has been collected respondents have the sample mean 342
and the sample standard deviation is 14.9. Construct and interpret the 95% and 99%
confidence intervals for the population mean.
8. A National Retail Foundation survey found households intended to spend an
average of $649 during the December holiday season. Assume that the survey
included 600 households and that the sample standard deviation was $175.
a. With 95% confidence, what is the margin of error?
321
b. What is the 95% confidence interval estimate of the population
mean?
9. A machine that stuffs a cheese-filled snack product can be adjusted for the amount of cheese
injected into each unit. A simple random sample of 50 units is selected, and the average
amount of cheese injected is found to be 3.5 grams. If the process standard deviation is
known to be 0.25 grams, construct the 95% confidence interval for population mean of
cheese being injected by the machine.
10. A pharmaceutical company found that 46% of 1000 U.S. adults sampled surveyed knew
neither their blood pressure nor their cholesterol levels. Assuming the persons surveyed to be
a simple random sample of U.S. adults, construct a 95% confidence interval for Population
Proportion of U.S. adults who would have given the same answer if a census had been taken
instead of a survey.
11. A survey of 611 office workers investigated telephone answering
practices, including how often each office worker was able to answer
incoming telephone calls and how often incoming telephone calls went
directly to voice mail. A total of 281 office workers indicated that they
never need voice mail and are able to take every telephone call.
a) What is the point estimate of the proportion of the population of
office workers who are able to take every telephone call?
b) At 90% confidence, what is the margin of error?
c) What is the 90% confidence interval for the proportion of the
population of office workers who are able to take every telephone
call?
12. An airline has surveyed a simple random sample of air travelers to find out whether they
would be interested in paying a higher fare in order to have access to e-mail during their
flight. Of the 400 travelers surveyed, 80 said e-mail access would be worth a slight extra
cost. Construct a 95% confidence interval for the population proportion of air travelers who
are in favor of the airline‘s e-mail idea.
322
13. Based on a preliminary study, the population standard deviation has been estimated as 11.2
watts for these sets. In undertaking a larger study, and using a simple random sample, how
many sets must be tested for the firm to be 95% confident that its margin of error 3.0 watts?
14. A national political candidate has commissioned a study to determine the percentage of
registered voters who intend to vote for him in the upcoming election. To have 95%
confidence that the sample percentage will be within 3 percentage points of the actual
population percentage, how large a simple random sample is required?
15. In reporting the results of their survey of a simple random sample of U.S. registered voters,
pollsters claim 95% confidence that their sampling error (margin of error) is 0.04. Given this
information only, what sample size was used?
16. The following data show the number of hours per day 12 adults spent in front of screens
watching television content and those selected from normal distribution:
2 5 4 4 6 7
4 2 3 1 2 3
Construct a 95% confidence
interval to estimate the average number of hours per day adults spend in watching
television.
17. The Chevrolet dealers of a large county are conducting a study to determine the proportion of
car owners in the county who are considering the purchase of a new car within the next year.
If the population proportion is believed to be no more than 0.15, how many owners must be
included in a simple random sample if the dealers want to be 90% confident that the
maximum likely error will be no more than 0.02?
18. Ford Motor Company introduced a new minibus which has greater fuel economy than the
regular sized minibus. A random sample of 50 minibuses averaged 30 miles per gallon, and
had standard deviation of 3 miles per gallon. Construct a 95 percent confidence interval for
the mean miles per gallon for all minibuses.
19. Interviewers called a random sample of 300 homes while ―Ehud Mezinanya‖ is being aired.
105 respondents said they were watching the program. Construct a 95% confidence interval
for the proportion of all homes where the program was being watched.
20. A cattle raiser selected random sample of 10 steers, all of the same age and fed them special
mixture of grains and other ingredients. After a period of time, weight gains were recorded.
323
The sample mean weight gain, per steer, was 142.6 pounds and standard deviation was 10.4
pounds. Suppose weight gains are normally distributed. Construct a 90% confidence interval
for the population mean weight gain per steer.
21. The diameter of ball bearings made by an automatic machine are normally distributed and
have standard deviation of 0.02 mm. the mean of a random sample of four ball bearings is
6.01 mm. construct the 95% percent interval for the mean diameter of all ball bearings being
made by the machine.
22. The proportion of all consumers favoring a new product might be a slow as 0.20 or as high as
0.60. A random sample is to be used to estimate the proportion of the consumers who favor
the new product to within ±0.05, with a confidence coefficient of 90%. To be on the safe
(larger sample) side, what sample size should be used?
23. A 95% confidence interval for a population mean was reported to be 152 to 160. If σ =15,
what sample size was used in this study?
24. A sample of 16 ten-year-old girls gave a mean weight of 71.5 and a standard deviation of 12
pounds. Assuming normality, find the 90, 95, and 99 percent confidence intervals for the
population mean weight
324
CHAPTER THREE
TESTING OF HYPOTHESES
1. INTRODUCTION
Closely related to Statistical Estimation discussed in the preceding lesson, Testing of Hypotheses
is one of the most important aspects of the theory of decision-making. In the present lesson, we
will study a class of problems where the decision made by a decision-maker depends primarily
on the strength of the evidence thrown up by a random sample drawn from a population. We can
elaborate this by an example where the operations manager of a cola company has to decide
whether the bottling operation is under statistical control or it has gone out of control (and needs
some corrective action). Imagine that the company sells cola in bottles labeled 1-liter, filled by
an automatic bottling machine. The implied claim that on the average each bottle contains 1,000
cm3 of cola may or may not be true. If the claim is true, the process is said to be under statistical
control. It is in the interest of the company to continue the bottling process. If the claim is not
true i.e. the average is either more than or less than 1,000 cm3, the process is said to be gone out
of control. It is in the interest of the company to halt the bottling process and set right the error.
Therefore, to decide about the status of the bottling operation, the operations manager needs a
tool, which allows him to test such a claim. Testing of Hypotheses provides such a tool to the
decision-maker. If the operations manager were to use this tool, he would collect a sample of
filled bottles from the on-going bottling process. The sample of bottles will be evaluated and
based on the strength of the evidence produced by the sample; the operations manager will
accept or reject the implied claim and accordingly make the decision. The implied claim (μ =
1,000 cm3) is a hypothesis that needs to be tested and the statistical procedure, which allows us to
perform such a test, is called Hypothesis Testing or Testing of Hypotheses.
What is a Hypothesis?
A hypothesis is something that has been proven to be true. A hypothesis is something that
has not yet been proven to be true. It is some statement about a population parameter or
about population distribution.
325
Our hypothesis for the example of the bottling process could be:
This statement is tentative as it implies some assumption, which may or may not is found
valid on verification.
Hypothesis testing is the process of determining whether or not a given hypothesis is true.
If the population is large, there is no way of analyzing the population or of testing the hypothesis
directly. Instead, the hypothesis is tested on the basis of the outcome of a random sample.
H0: μ =1,000
The alternative hypothesis is the negation of the null hypothesis. For the null hypothesis
H0: μ =1,000, the alternative hypothesis is μ ≠ 1000. We will write it as:
H1: μ ≠ 1,000
We use the symbol H1 (or Ha) to denote the alternative hypothesis.
The null and alternative hypotheses assert exactly opposite statements. Obviously, both H0 and
H1 cannot be true and one of them will always be true. Thus, rejecting one is equivalent to
accepting the other. At the end of our testing procedure, if we come to the conclusion that H0
should be rejected, this also amounts to saying that H1 should be accepted and vice versa.
To better understand the role of null and alternative hypotheses, we can compare the process of
hypothesis testing with the process by which an accused person is judged to be innocent or
326
guilty. The person before the bar is assumed to be “innocent until proven guilty” So using the
language of hypothesis testing, we have:
Accepting H0 of innocence: when there was not enough evidence to convict. However, it
does not prove that the person is truly innocent.
Rejecting H0 and accepting H1 of guilt: when there is enough evidence to rule out
innocence as a possibility and to strongly establish guilt.
…if the null hypothesis is true, then no corrective action would be necessary. If the alternative
hypothesis is true, then some corrective action would be necessary.
After the null and alternative hypotheses are spelled out, the next step is to gather evidence from
a random sample of the population. An important limitation of making interferences from the
sample data is that we cannot be 100% confident about it. Since variations from one sample to
another can never be eliminated until the sample is as large as the population itself, it is possible
that the conclusion drawn is incorrect which leads to an error. There are two types of error:
States of Population
Decision-based on Sample
H0 True H0 False
Type I Error
In the context of statistical testing, the wrong decision of rejecting a true null hypothesis is
known as Type I Error. If the operations manager rejects H0 and conclude that the process has
gone out of control, when in reality it is under control, he would be making a type I error.
327
Type II Error
The wrong decision of accepting (not rejecting, to be more accurate) a false null hypothesis is
known as Type II Error. If the operations manager does not reject H0 and concludes that the
process is under control, when in reality it has gone out of control, he would be making a type II
error.
The mutually exclusive hypothesis is needed to be correctly stated. That is, if one is rejected, the
other must be accepted; and vice versa. The hypothesis test about population parameter takes one
of the following three forms.
i.e., The equality Sign always associated with the null hypothesis and never seen in H1
The decision maker must specify the level of significance. The level of significance represents
the probability of making type I error.
Step 3. Collect the sample data and compute the value of the test statistic
We take and measure a random sample to determine whether the claim is true or not. And
compute the test statistic by using ̃ n ̅
Formula ̅− ̅− ̅− ̅−
̃ ̃ ̃ ̅ ̃
√n √n √n √n
328
i.e., The population distribution is restricted to be normal when the sample size is < 30.
For t-test statistics the value is substituted by s- value, sample standard deviation.
Critical value is used to separate the critical/rejection region from non-critical region. The level
of α specifies the critical value for the sampling distribution. The distribution may have either z
or t critical value. The critical value is placed based on the type hypothesis test. See the
following table.
zα or tα placed on the zα or tα placed on the right zα/2 or tα/2 placed on either side
left side of the side of the distribution of the distribution
distribution
This comparison is used to decide whether to reject or fail to reject null hypothesis. To do so, we
use the decision rule for all type of hypothesis test. See the following table.
| ̅| ≤ | | Fail to reject H0
329
i.e., The decision rule also works for comparing t-test statistics ( ̅ ) and critical value (tα).
P- value Approach
The p-value approach uses the value of the test statistic ̅ or ̅ to compute a probability called a
p-value. Sometimes, called observed significance level.
See the following table, to compute p – value for three type of hypothesis.
Hypothesis Test
P- Value P( ̅ P( ̅ 2× P ( ̅
i.e., For t-test statistic, p – value is not precise and specific value, but we can approximate it.
If P-value is less than the significance level, the test is said to be significant. That means the null
hypothesis is to be rejected. The rejection rule is the same for all hypothesis test. See the
following table.
330
The last but not the least step of hypothesis testing. You made a conclusion based on what you
get from the computation.
In some hypothesis tests, the null hypothesis is rejected if the sample statistics are either too far
above or too far below the population parameter. The rejection area is to both sides of the
parameter. Tests of this type are called two-tailed tests. Whereas the situation in which the area
of the rejection lies entirely on one extreme of the curve either right or left tail are known as one-
tail tests.
One-tail Test is a hypothesis test with one rejection region on either side. When the test has a
rejection region on the left side, then the test is known as the left-tail test. If the rejection region
is on the right side of the curve, then the test is known as the right-tail test.
H0: μ ≥ 1,000
In this case, we will reject H0 only when X is significantly less than 1,000 or only when Z falls
significantly below zero. Thus, the rejection occurs only when Z takes a significantly low value
in the left tail of its distribution. Such a case where rejection occurs in the left tail of the
distribution of the test statistic is called a left-tailed test.
A Left-tailed Test
In the case of a left-tailed test, the p-value is the area to the left of the calculated value of the test
statistic.
Now consider the case where the null and alternative hypotheses are:
H0: μ ≤ 1,000
331
H1: μ > 1,000
In this case, we will reject H0 only when X is significantly more than 1,000 or only when Z is
significantly greater than zero. Thus, the rejection occurs only when Z takes a significantly high
value in the right tail of its distribution. Such a case where rejection occurs in the right tail of the
distribution of the test statistic is called a right-tailed test.
A Right-tailed Test
In the case of a right-tailed test, the p-value is the area to the right of the calculated value of the
test statistic. In left-tailed and right-tailed tests, rejection occurs only on one tail. Hence each of
them is called a one-tailed test.
Two-tail Test: is a hypothesis test with two rejection regions with acceptance rejoin in between
the two-rejection region. When the alternative hypothesis does not show direction or is non-
directional, then the test is a two-tail test.
Consider the case where the null and alternative hypotheses are:
H0: μ = 1,000
H1: μ ≠ 1,000
In this case, we have to reject H0 in both cases, that is, whether X is significantly less than or
greater than 1,000. Thus, rejection occurs when Z is significantly less than or greater than zero,
which is to say that rejection occurs on both tails. Therefore, this case is called a two-tailed test.
332
A Two-tailed Test
In the case of a two-tailed test, the Z-value is twice the tail area. If the calculated value of the test
statistic falls on the left tail, then we take the area to the left of the calculated value and multiply
it by 2. If the calculated value of the test statistic falls on the right tail, then we take the area to
the right of the calculated value and multiply it by 2. For example, if the calculated Z = +1.75,
the area to the right of it is 0.0401. Multiplying that by 2, we get the Z -value as 0.0802.
When the null hypothesis is about a population mean, the test statistic can be either Z or t. If we
use μ0 to denote the claimed population mean the null hypothesis can be any of the three usual
forms:
√n
The population is normal and the population standard deviation, σ, is unknown, but the
sample standard deviation, S, is known and the sample size, n, and is large enough. The
formula for calculating the test statistic Z in this case is:
̅−
̃
√n
333
Example 1: A company manufacturing automobile tires finds that tire-life is normally
distributed with a mean of 40,000km and a standard deviation of 3000km. it is believed that a
change in the production process will result in a better product and the company has developed a
new tire. A sample of 100 new tires has been selected. The company has found that the mean life
of these new tires is 40,900km. Can it be concluded that the new tire is significantly better than
the old one at a 1% level of significance?
Solution
In this example, we are interested to test whether the mean life of a new tire has increased
beyond 40,000km. To test this, we follow different steps in hypothesis testing:
i. State hypotheses
Ho: μ≤40,000
H1: μ>40,000.
This is the right-tail test. Thus, the rejection region is located on the right side of the curve.
ii. Select the significance level (α=0.01). We are 99% confident that the mean life of a
new tire indeed is 40,000km. This means 1 out of every 100 situations, there is a risk of
being wrong in accepting or rejecting the hypothesis.
iii. Select the suitable test criteria or test statistic. Since the population of tire-life is
normally distributed Z-test is used as test criteria.
iv. Formulate decision rule: At the significance level of 0.01, the z-value from the table is
to be used as a critical value to set our decision rule. The alternative hypothesis shows
the right-tail test so the rejection region is found only to the right side.
334
The table value at 0.01 level of significance, z0.01 = 2.33. Therefore, the decision rule is that
rejecting the null hypothesis if the calculated value is greater than the table value.
v. Computation for comparison: Compute the Z-value for the mean value of sample
mean of 40,900.
Since 3 > 2.33, the computed value falls in the rejection region. Hence, we reject the Ho
and accept H1.
vi. Conclusion: The new tire has a significantly better life than the old one.
The population is normal and the population standard deviation, σ, is unknown, but the
sample standard deviation, S, is known and the sample size, n, is small.
The formula for calculating the test statistic t in both these cases is:
̅−
̃ 𝑠
√n
The degrees of freedom for this t is (n-1)
Example 2:
A manufacturer of electric batteries claims that the average capacity of a certain type of battery
that the company produces is at least 140 ampere-hours. An independent sample of 20 batteries
gave a mean of 138.47 ampere-hours and a standard deviation of 2.66 ampere-hours. Test at 5%
significance level that the mean life is less than 140 ampere-hours.
Solution
Ho: μ ≥ 140
335
4. Decision rule: find the t-value from the table of student t-distribution for 20 sample
size. The degree of freedom is , df = 20-1= 19
t (0.05, 19) = 1.729. The decision rule is to reject the null hypothesis if the observed
data or calculated value is greater than the table value, tcal > t tab
2.572 >1.729, Hence, we reject the null hypothesis and accept the alternative hypothesis.
6. Conclusion: The mean life of batteries produced by the company is significantly less
than 140 ampere-hours.
1.6. Hypothesis Test Concerning the Difference between Two Populations Mean
Sometimes it may be claimed that there is no difference between the two population means or
proportion. In this case, we need samples from each group where the test is known as the Two-
sample Test. If two samples n1 and n2 are selected from a population, to test whether there is no
difference between the means of the two groups, the value of test statistic for observed/sample
data will take the following form: Hypothesis of the difference takes one of the following three
forms;
The population standard deviations; σ1 and σ2; are known and both the populations are
normal.
336
The population standard deviations; σ1 and σ2; are known and the sample sizes; n1 and n2;
are both at least 30 (The population need not be normal).
The formula for calculating the test statistic Z in both these cases is:
Example 3:
A sample of 65 observations is selected from one population. The sample mean is 2.67, and the
sample standard deviation is 0.75. Another sample of 50 observations is selected from the same
population. The sample mean is 2.59, and the sample standard deviation is 0.66. Verify that the
mean of the first population less than or equal to the mean of the second population at 5%
significant level.
Solution
337
6. Conclusion: The sample results do not provide sufficient evidence that the null hypothesis
is false. Thus, the mean of population one is less than or equal to the mean of the second
population.
The populations are normal; the population standard deviations; σ1 and σ2; are unknown, but the
sample standard deviations; S1 and S2; are known. The formula for calculating the test statistic t
depends on two subcases:
Where SP2 is the pooled variance of the two samples, which serves as the estimator of the
common population variance.
338
Subcase III: The population standard deviations; σ1 and σ2; are known and the sample sizes; n1
and n2; are <30.
Example 4: The following information relates to the prices (in birr) of a product in two cities A
and B.
City A City B
Mean price 22 17
Standard deviation 5 6
The observations related to prices are made for 9 months in city A and for 11 months in the city
B. Test at 0.01 level whether there is any significant difference between prices in two cities,
assuming:
339
Solution:
H0: μ1 –μ2 = 0
340
The degrees of freedom for this t is given by
(c): The population standard deviations; σ1 and σ2; are known and the sample sizes; n1 and n2;
are <30.
341
df = 8.93 = 9
6. Conclusion:
We cannot reject the null hypothesis at α = 0.01, when: σ12 =σ22 since t=1.99 < t0.005
=2.88.
We cannot reject the null hypothesis at α = 0.01, when σ12 =/ σ22 since t=2.03 <t0.005
=2.88.
We cannot reject the null hypothesis at α = 0.01, when: σ12 =σ22 since t=2.03 < t0.005
=3.25.
1.7. Hypothesis Tests of Population Proportion
When the null hypothesis is about a population proportion, the test statistic can be either the
Binomial random variable or its Poisson or Normal approximation. If we use p0 to denote the
claimed population proportion the null hypothesis can be any of the three usual forms:
The Binomial distribution can be used whenever we are able to calculate the necessary binomial
probabilities. When the Binomial distribution is used, the numbers of successes X serves as the
test statistic. It is conveniently applicable to problems where sample size, n, is small and p0 is
neither very close to 0 nor to 1.
342
The Normal approximation of Binomial distribution is conveniently applicable to problems
where sample size, n, is large and p0 is neither very close to 0 nor to 1. When the normal
distribution is used, the test statistic Z is calculated as:
Solution
1. State hypothesis:
Ho: ≥ 0.8
4. Decision rule: First find the Z table value at 5% from the standard normal distribution table.
The test is one tail test (left tail) so that the 0.5-0.05 = 0.45. Z0.45 = 1.645. The rule: reject the
null hypothesis if the observed value is greater than 1.645.
5. Computation for sample data:
343
Since Zcal is greater than the Ztab, we reject the null hypothesis and accept the alternative
hypotheses. We, therefore conclude that the claim of the company that the medicine is 80%
effective is not justified and the medicine provides relief for at least less than 12hrs.
1.8. Hypothesis Test about the Difference between Two Population Proportions
We will consider the large-sample tests for the difference between population proportions.
For „large enough‟ sample sizes the distribution of the two-sample proportions and also the
distribution of the difference between the two sample proportions is approximated well by a
normal distribution. This gives rise to Z-test for comparing the two population proportions.
We will use (P1 – P2)0 to denote the claimed difference between the two population proportions.
Then the null hypothesis can be any of the three usual forms:
The formula for calculating the test statistic Z depends on two cases.
The case I: When (P1 – P2)0 = 0 i.e. the claimed difference between the two population
proportions is zero.
Case II: When (P1 – P2) ≠ 0 i.e. the claimed difference between the two population proportions
is some number other than zero.
344
Example 6: A sample survey of tax-payers belonging to business class and professional class
yielded the following results:
the defaulter's rate is the same for the two classes of tax-payers
The defaulter‘s rate in the case of business class is more than that in the case of the
professional class by 0.07.
Solution:
The defaulter’s rate is the same for the two classes of tax-payers
1. The null and alternative hypotheses:
H0: 1 2 p − p = 0
345
5. Computations:
6. Conclusion: We cannot reject the null hypothesis at α = 0.05 since Z =1.87 < Z0.005
=2.58
The defaulter’s rate in the case of business class is more than that in the case of the
professional class by 0.07.
1. The null and alternative hypotheses:
H0: P1 – P2 = 0.07
6. Conclusion: We cannot reject the null hypothesis at α = 0.05 since Z = -0.76 > -Z0.01 =
-2.58
346
Exercise
2. Individuals filing federal income tax returns prior to March 31 received an average refund of
$1056. Consider the population of ―last-minute‖ filers who mail their tax return during the
last five days of the income tax period (typically April 10 to April 15).
A. A researcher suggests that a reason individual wait until the last five days is that on
average these individuals receive lower refunds than do early filers. Develop
appropriate null and alternative hypotheses.
B. For a sample of 400 individuals who filed a tax return between April 10 and 15, the
sample mean refund was $910. Based on prior experience a population standard
deviation of σ = $1600 may be assumed. What is the p-value?
C. At α = .05, what is your conclusion? Repeat the preceding hypothesis test using the
critical value approach.
3. Suppose the average life span of n = 100 persons was 71.8 year. According to earlier studies,
the population standard deviations is assumed to be 8.9 year. According to this information,
could it be concluded that the average life span of the population is less than 73 years using
0.025 significance level? The life span is supposed to be normally distributed.
347
C. At α = .01, what is your conclusion?
D. What is the rejection rule using the critical value? What is your conclusion?
5. The school nurse thinks the average height of 7th graders has increased. The average height
of a 7th grader five years ago was 145 cm with a standard deviation of 20 cm. She takes a
random sample of 200 students and finds that the average height of her sample is 147 cm.
Are 7th graders now taller than they were before? Conduct a hypothesis test using a .05
significance level.
6. Suppose the dean of Chamo-campus is claimed that the student‘s grade point averages have
improved dramatically in recent years. The graduating seniors‘ mean GPA over the last five
years is 2.75. The dean randomly samples 101 seniors from the last graduating class and
finds that their mean GPA is 2.85. Assume the population standard deviation 0.65. Is there
enough evidence to support the dean claim using α = 0.1?
8. The scores on an aptitude test required for entry into a certain job position have a mean of 500
and a standard deviation of 120. If a random sample of 36 applicants has a mean of 546, is
there evidence that their mean score is different from the mean that is expected from all
applicants? Use α = 0.1.
9. Your company sells exercise clothing and equipment on the Internet. To design the clothing,
you collect data on the physical characteristics of your different types of customers. We take a
sample of 24 male runners and find their mean weight to be 61.79 kilograms. Your company
believe that the average weight of the customer will be 63 kilograms and assume that the
population standard deviation is σ = 4.5. Using α = 0.02 conduct a hypothesis test.
348
10. You have just taken ownership of a pizza shop. The previous owner told you that you would
save money if you bought the mozzarella cheese in a 4.5pound slab. Each time you purchase
a slab of cheese, you weigh it to ensure that you are receiving 72 ounces of cheese. The
results of 7 random measurements are 70, 69, 73, 68, 71, 69 and 71ounces. Are these
differences due to chance or is the distributor giving you less cheese than you deserve? Using
α 10% and 5%.
12. A drug manufacturer claims that fewer than 10% of patients who take its new drug for
treating Alzheimer‘s disease will experience nausea. In a random sample of 250 patients, 23
experienced nauseas. Perform a significance test at the 5% significance level to test this
claim.
13. The National Academy of Science reported in a 1997 study that 40% of research in
mathematics is published by US authors. The mathematics chairperson of a prestigious
university wishes to test the claim that this percentage is no longer 40%. He surveys a
simple random sample of 130 recent articles published by research journals and finds that 62
of these articles have US authors. Does this evidence support the mathematics chairperson‘s
claim that the percentage is no longer 40%? Use a 0.10 level of significance.
14. The National Center for Health Statistics released a report that stated 70% of adults do not
exercise regularly (Associated Press, April 7, 2002). A researcher decided to conduct a study
to see whether the claim made by the National Center for Health Statistics differed on a state-
by-state basis.
A. State the null and alternative hypotheses assuming the intent of the researcher is to
identify states that differ from the 70% reported by the National Center for Health
Statistics.
B. At α = .05, what is the research conclusion for the following states:
349
Wisconsin: 252 of 350 adults did not exercise regularly
California: 189 of 300 adults did not exercise regularly
15. Virtual call centers are staffed by individuals working out of their homes. Most home agents
earn $10 to $15 per hour without benefits versus $7 to $9 per hour with benefits at a
traditional call center (BusinessWeek, January 23, 2006). Regional Airways is considering
employing home agents, but only if a level of customer satisfaction greater than 80% can be
maintained. A test was conducted with home service agents. In a sample of 300 customers
252 reported that they were satisfied with service.
A. Develop hypotheses for a test to determine whether the sample data support the
conclusion that customer service with home agents meets the Regional Airways criterion.
B. What is your point estimate of the percentage of satisfied customers?
C. What is the p-value provided by the sample data?
D. What is your hypothesis testing conclusion? Use α = .05 as the level of significance
16. Suppose ABC Drug Company develops a new drug, designed to prevent colds. The company
states that the drug is more effective for women than for men. To test this claim, they choose
a simple random sample of 100 women and 200 men from a population of 100,000
volunteers. At the end of the study, 38% of the women caught a cold; and 51% of the men
caught a cold. Based on these findings, can we conclude that the drug is more effective for
women than for men? Use a 0.01 level of significance.
17. Two types of batteries are tested for their length of life and the following data are obtained:
Is there a significant difference between the two means at a 95% confidence level?
18. The annual per capita consumption of milk is 21.6 gallons. You believe milk consumption is
higher in the Borena area, Oromia regional state and wish to support your opinion. A sample
of 16 individuals from the Borena area showed a sample mean annual consumption of 24.1
gallons with a standard deviation of s=4.8.
350
a. Develop a hypothesis test that can be used to determine whether the mean annual
consumption in Borena is higher than the national mean.
b. What is a point estimate of the difference between mean annual consumption in
Borena and the national mean?
c. At α=0.05, test for a significant difference. What is your conclusion?
19. Second-year management students were categorized into four sections. Two were assigned to
Mr. Demis and the remaining two sections to Mr. Yared for the course statistics for
Management II. In Mr. Demis‘s sections, there were 87 students, and in Mr. Yared‘s
sections, 92 students were learned. At the end of the semester, all sections took the same
standardized exam. Mr. Demis's students had an average test score of 78, with a standard
deviation of 10; and Mr. Yared's students had an average test score of 85, with a standard
deviation of 15. Test the hypothesis that Mr. Demis and Mr. Yared are equally effective
teachers at a 0.10 level of significance.
20. An advertising company feels that 20% of the population in the age group of 18 to 25 years
in a town watches a specific serial. To test this assumption, a random sample of 890
individuals in the same age group was taken of which 440 watched the serial. At a 5% level
of significance, can we accept the assumption laid down by the company?
21. Consider the following hypothesis test:
351
23. The label on a 3-quart container of orange juice claims that the orange juice contains an
average of 1 gram of fat or less. Answer the following questions for a hypothesis test that
could be used to test the claim on the label. Answer the following question‘?
A. Describe type I error
B. Describe type II error
C. If the null hypothesis is rejecting, what could be the conclusion?
D. If the null hypothesis is failing to reject, what could be the conclusion?
24. An engineer hypothesizes that the mean number of defects can be decreased in a
manufacturing process of compact disks by using robots instead of humans for certain tasks.
The mean number of defective disks per 1000 is 18.
A. Describe type I error
B. Describe type II error
C. If the null hypothesis is rejecting, what could be the conclusion?
D. If the null hypothesis is failing to reject, what could be the conclusion?
̅ ̅
df (v) = n1 + n2 – 2
Mean Independen Unknown 𝑥̅ − 𝑥̅ − 𝑠 𝑠
t Sample and 𝑡 ̅ ( )
Unequal 𝑠 𝑠
√ 𝑠 𝑠
( ) ( )
− −
Proportio Independen Not 𝑧̅ 𝑥 𝑥
n t Sample applicabl
𝑝̅ − 𝑝̅ − −
e
√ − ( )
352
Decision Tree for Deciding Which Hypothesis Test to Use
CHAPTER FOUR
353
CHI SQUARE (X2) DISTRIBUTION
Introduction
In previous chapter, we have learnt about testing hypothesis made about population parameters
under. Thus, any assumption can be made about distribution of the population from which the
samples are taken.
In this chapter, we will discuss hypothesis testing regarding sample characteristics. Sometimes
assumption is made to test whether the sample follows a certain population distribution or to test
whether different attributes are independent or to test for a single variance. Hence, there is no
need of assumption regarding distribution of the parent population from which the samples are
taken. For all this situations, X2-test is used, that we will explain it in this section.
A chi-square distribution can be used to test whether a population follows one or another
distribution (goodness of fit test). Chi-square distribution can also be used to test if two variables
are independent. Any statistical test that uses the chi square distribution can be called chi square
test. It is applicable both for large and small samples-depending on the context. That means the
data has been counted and divided into categories. This distribution is not defined for negative
real numbers and is not applicable when observations assume such values.
A Chi-square test is designed to analyze categorical data. That means that the data has been
counted and divided into categories. It will not work with parametric or continuous data (such as
height in inches). For example, if you want to test whether attending class influences how
students perform on an exam, using test scores (from 0-100) as data would not be appropriate for
a Chi-square test. However, arranging students into the categories "Pass" and "Fail" would.
Additionally, the data in a Chi-square grid should not be in the form of percentages, or anything
other than frequency (count) data. Thus, by dividing a class of 54 into groups according to
whether they attended class and whether they passed the exam, you might construct a data set
like this:
Pass Fail
354
Attended 25 6
Skipped 8 15
Be very careful when constructing your categories. A Chi-square test can tell you information
based on how you divide up the data. However, it cannot tell you whether the categories you
constructed are meaningful.
√2
Thus, χ2 distribution depends on the degrees of freedom as its shape changes with the change
in the ‗df‘, and as ‗df‘ becomes greater, χ2 gets approximated by the normal distribution.
355
The degrees of freedom when working with a single population variance is n-1. As n
the distribution approaches a normal distribution.
4.3. Steps in X2 hypothesis testing
Describe H0 and Ha: Here, hypothesis could not represent in mathematical symbols, rather
in statement (sentence of word) as the attributes are independent or the attributes are not
related.
Select appropriate significance level ( ): , in X2 distribution is located to the right tail
that implies rejection region.
Determine the suitable test statistic, X2 is appropriate for testing independence.
Set decision rule about the condition for rejecting the null hypothesis and accepting the
alternative hypothesis. Determining critical value is the main activity at this step. The value
of the chi-square random variable χ2 with df = k that cuts off a right tail of area c is denoted
χc2 and is called a critical value.
Compute for the sample observation: after computing for sample, we compare it with the
critical value in order to decide whether to reject or fail to reject the null hypothesis.
Conclusion: at this stage, we infer something about the variables under the study.
4.4. Application of X2 Tests
4.4.1. Test of Independence
Suppose N observations are considered and classified according two characteristics say A and B.
We may be interested to test whether the two characteristics are independent. In such a case, we
can use Chi square test for independence of two attributes. It has to be noted that the Chi square
goodness of fit test and test for independence of attributes depend only on the set of observed
and expected frequencies and degrees of freedom. This test does not need any assumption
regarding distribution of the parent population from which the samples are taken. Since these
tests do not involve any population parameters or characteristics, they are also termed as non-
parametric or distribution free tests. An additional important fact on these two tests is they are
sample size independent and can be used for any sample size as long as the assumption on
minimum expected cell frequency is met.
When items are classified according to two or more criteria, it is often of interest to decide
whether these criteria act independently of one another. The hypotheses that have to do with
whether or not two random variables take their values independently, or whether the value of one
356
has a relation to the value of the other, can be tested using X2 test. Do you remember what
independent event is and how to compute the joint probability for independent events? If ―A‖
and ―B‖ are independent events what is the probability that both occurring?
Two events are said to be independent if information about one tells noting about the occurrence
of the other. In other words, outcome of one event does not affect, and is not affected by the
other events. The outcomes of successive tosses of coins are independent of its preceding toss.
For instance, what is the probability that both trails face up tail in an experiment of tossing a fair
coin twice? Let us put the possible outcomes of the two trails in table as follows.
Trail 2 # 𝑜𝑓 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛
P (TT) =𝑡𝑜𝑡𝑎𝑙 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛
𝑃 𝑇 ∗𝑃 𝑇
T2 H2 Total
T1 T1T2 T1 H2 2
Trail 1
H1 H1T2 H1H2 2
Total 2 2 4
The table shows whether the occurrence of the outcomes of the 1st trail affects and affected by
the outcomes of the 2nd trail. Here, each trial has two possible outcomes. While the numerical
values of the cell probabilities are unspecified, each cell probability will equal the product of its
respective row and column probabilities.
2 2
This condition implies independence of the two trails. The main issue that we need to discuss is
whether trail 1 affects or affected by trail 2 or whether each outcome independently assume
probability of occurrence.
357
The categorical data should first organize into ―m*n‖ contingency table. ―m‖ represents the
number of rows (levels of the first variable) and ―n‖ represents number of columns (levels of the
second variables).
n1 n2
H0 = The probability of each cell will equal to the product of the probabilities of its respective
row and column.
The test statistic for this kind of hypothesis is X2-test with degrees of freedom (r-1)*(c-1), where,
r is # of row and c is # of column.
Example 1
Suppose we wish to classify defects found in wafers produced in a manufacturing plant, first
according to the type of defect and, second, according to the production shift during which the
wafers were produced. A total of 309 wafer defects were recorded and the defects were classified
as being one of four types, A, B, C, or D. At the same time each wafer was identified according
to the production shift in which it was manufactured, 1, 2, or 3.
Is there independence in between wafer defect types and production shifts at 99%
confidence level?
Table 1: Contingency table classifying wafers defects according to type and production shift
Type of Defects
358
Shift A B C D Total
1 15 21 45 13 94
2 26 31 34 5 96
3 33 17 49 20 119
Solution
1. Hypothesis
H0: wafer defects classification by defect types is independent of classification by
production shifts.
H1: wafer defects classification by defect types is dependent of classification by
production shifts.
2.
3. X2-test is suitable test statistics since the test is independency test.
4. Set decision rule based on the critical value and types of test tail. The test is non-directional
so that the critical value Xc2 is determined for (r-1)(c-1) degrees of freedom.
𝑟− − − −
X2(0.01, 6) = 16.812
Our decision rule is ―to reject H0, if calculated X2 is greater than 16.812.
5. Compute for test statistics using observed value and expected value. Then compare it with
critical value. ∑
Expected value (E) for each cell is obtained through dividing the product of row total and
column total by the grand total of the contingency table.
359
Type of Defects
Shift A B C D Total
𝑅𝑜𝑤 𝑇𝑜𝑡𝑎𝑙 𝐶𝑜𝑙𝑢𝑚𝑛 𝑇𝑜𝑡𝑎𝑙
𝐸 𝑖𝑗
𝐺𝑟𝑎𝑛𝑑 𝑇𝑜𝑡𝑎𝑙
1 1A 1B 1C 1D 94
Now we get the X2 =19.196 which is greater than 16.812. Therefore, we reject the null
hypothesis
360
6. Conclusion: we conclude that there is significant evidence that the proportions of the
different defect types vary from shift to shift.
4.4.2. Test of Equality of Several Proportions
In comparison of two independent groups by identifying the items of interest and the items not of
interest, contingency table can also use. Thus, to test the null hypothesis that there is no
difference between the several populations proportions, meaning the proportions are equal: we
can use X2 tests.
Example 2: While a researcher conducting research about Customer Retention of Two hotels (A
and B), he/she wants to identify whether there is difference between customer repurchase
intention from Hotel A and from Hotel B. He/she organizes responses of sample customers of the
two Hotels by asking a single question ―Are you likely to choose this hotel again?‖ From the
sample of 240 customers of Hotel A only 180 replied yes and out of 290 customers of Hotel B
215 customers responded yes. The analysis is to be done at 5% significance level, in order to
determine whether there is evidence of significant difference in customer repurchase intention
between the two Hotels.
Solution
Hotel
Responses Total
Hotel A Hotel
B
No 60 75 135
361
2.
3. Test statistic is equal to the squared difference between the observed and expected
frequencies, divided by the expected frequency in each cell of the table, summed over all
cells of the table.
−
∑
𝑥𝑐 , df =1 and 𝛼
𝑋
Decision rule
Reject H0 if 𝑋 calculated is greater
than𝑥𝑐
5. We need to compute expected value for each of the cells to calculate for the test statistic.
To compute the expected frequency, in any cell, you need to understand that if the null
hypothesis is true, the proportion of items of interest in the two populations will be equal.
Then the sample proportions you compute from each of the two groups would differ from
each other only by chance. Each would provide an estimate of the common population
parameter (i.e. P).
A statistic that combines these two separate estimates together into one overall estimate
of the population parameter provides more information than either of the two separate
estimates could provide by itself. This statistic, given by the symbol represents the
estimated overall proportion of items of interest for the two groups combined (i.e., the
total number of items of interest divided by the total sample size).
o The complement of − represents the estimated overall proportion of items
that are not of interest in the two groups. is computed as;
, Where, 𝑒 𝑝𝑟 𝑝 𝑟𝑡
362
𝑥 𝑢 𝑒𝑟 𝑡𝑒 𝑠 𝑡𝑒𝑟𝑒𝑠𝑡 𝑟 𝑡 𝑟 𝑢𝑝𝑠
𝑢 𝑒𝑟 𝑡 𝑡 𝑠 𝑝𝑒 𝑟 𝑡 𝑟 𝑢𝑝𝑠
In our example, the items of interest (x) are the customers showed their interest to
purchase again from the hotels, and items not of interest are customers do not want to
repurchase from the hotels.
𝑥 2
𝑥 2 2
𝑥 𝑥 2
2 2
− − 2 𝑡 𝑠 𝑠 𝑝𝑟 𝑝 𝑟𝑡 𝑟 𝑡 𝑒 𝑡𝑒 𝑠 𝑡 𝑡𝑒𝑟𝑒𝑠𝑡
To compute the expected frequency, (E) for cells that involve items of interest (i.e., the
cells in the first row in the contingency table), you multiply the sample size (or column
total) for a group by . To compute the expected frequency, (E) for cells that involve
items that are not of interest (i.e., the cells in the second row in the contingency table),
you multiply the sample size (or column total) for a group by − .
Hotel Total
Hotel Hotel
Expected value of all cells
Responses A B 1A = 𝑃𝑜 𝑥 2 ,
𝐵 𝑃𝑜 𝑥 2 2 ,
Yes 1A 1B 395 2𝐴 − 𝑃𝑜 𝑥 2 2 ,
2𝐵 − 𝑃𝑜 𝑥 2 2
No 2A 2B 135
363
Using the observed frequencies and expected frequencies, compute for the test statistic X2.
Plotting calculated X2 value on the curve, it lies within the acceptance region.
Acceptance
region
𝑋 2 𝑋𝑐
X2 = 0.0651, which is less than the Xc2. Thus, 0.0512<3.841, Therefore, we do not reject the H0.
6. Conclusion;
We can now conclude that sample doesn‘t provide enough evidences that the null
hypothesis is false. Therefore, the customers repurchase intentions for the two hotels are
equal.
B. The Difference Between More than Two Population Proportions
In this section, the test is extended to compare more than two independent populations. The same
procedures are employed to test the hypothesis about the equality of several population
364
proportions. The numbers of rows are two and ‗c‘ number of columns; 2 contingency table is
used, where c is the number of independent populations.
If the sufficient evidences are not obtained and the null hypothesis is true, the expected
frequencies in each cell are obtained using combined (better estimator of the populations‘
parameter) proportion .
𝑥 𝑥 𝑥 𝑥
− ).
Example 3: Equality of the proportions of four groups, more shoppers do the majority of their
shopping on Saturday than any other day of the week. However, is there a difference in the
various age groups in the proportion of people who do the majority of their shopping on
Saturday? A study on 200 shoppers for each age group showed the following results;
Age groups
Is there evidence of a significant difference among the age groups with respect to major
shopping day at 0.05 level of significance?
Solution
1. Describing hypothesis
H0: There is no difference in the various age groups in the proportion of people who do
the majority of their shopping on Saturday ( )
H1: There is no difference in the various age groups in the proportion of people who do
the majority of their shopping on Saturday ( )
2. Level of significance is 0.05
3. Test statistic is X2 test, with degrees of freedom (c-1) = 4-1= 3.
is critical value
365
4. Set decision rule about the rejection or acceptance of null hypothesis
Critical value Xc2=7.815, if the calculated value is greater than the critical value then we reject
the H0.
X2=∑
In this example, we have to calculate both observed and expected values. To simplify let us
represent each cell by number of rows and number of columns as C11, C12, C13, C14, C21, C22, C23,
and C24. The observed value for each cell is computed by multiplying the cell proportion by the
sample size of each age group. Example, C11=24%x200=48, C12= 34%x200=68, etc.…
Then, expected frequency for cells are computed by multiplying the combined proportion by the
is the common proportion of shoppers of all age groups who do the majority of their shopping
on Saturday. 22
Example: E(C11) = Po(n1) = 0.22x200 =44, E(C21) = (1-0.22) (n1) =0.78(200) =156
366
X2 is 34.4986, which is greater than critical value 7.815. Graphically presented, the sample result
is located in rejection region. Therefore, we reject the null hypothesis.
Rejection
Acceptance region
region
𝑋𝑐 𝑋
6. Finally, we conclude that there is significant different between the proportions of shoppers
of different age groups. However, which proportions make significant different requires
other procedure.
Example 4: Different age groups use different media sources for news. A study on this issue
explored the use of cell phones for accessing news. The study reported that 47% of users under
age 50 and 15% of users age 50 and over accessed news on their cell phones. Suppose that the
survey consisted of 1,000 users under age 50, of whom 470 accessed news on their cell phones,
and 891 users age 50 and over, of whom 134 accessed news on their cell phones. Construct a
contingency table. And Is there evidence of a significant difference in the proportion that
accessed the news on their cell phones between users under age 50 and users 50 years and older?
(Use ).
Solution
a. Contingency table
Media sources for news < 50 ≥ 50 Total
367
Ho: There is no significance difference in the proportion of accessing the news on their cell
phone of different age groups.
Ha: There is a significance difference in the proportion of accessing the news on their cell
phone of different age groups.
ii. Level of significance:
(fo-fe)2/fe 221.42
v. Conclusion:
Since X2 calculate (221.42) is greater than X2 table (3.84146): the decision is not accept
Ho, this implies that there is a significance difference in the proportion of accessing the
news on their cell phone of different age groups.
368
4.4.3. Test of Goodness of-fit Test
Goodness of fit is a statistical term referring to how far apart the expected values of variable are
from the actual values. It is the compatibility of sample evidences with the hypothesized
population parameter. For example, assume the population distribution is normal, then if the
sample selected from this population normally distributed, the sample is fitted to the population
distribution.
The chi-square goodness of fit test is appropriate when the following conditions are met:
Applied when you have one categorical variable from a single population.
The expected value of the number of sample observations in each level of the variable is
at least 5.
The chi-square goodness-of-fit test can be applied to discrete distributions such as the binomial
and the Poisson and continuous distribution like normal distribution and uniform distribution.
N
Ef = =( )N
n
Example-5: The table below contains random sample data on the number of workers absent
from Commercial Bank of Ethiopia. The number of absences of the week is expected to be
equally distributed for each day of the week. Does it appear that the number of workers absent is
uniformly distributed over days of the week? Perform a goodness-of-fit test at the 5 percent
level.
369
Day Number of Workers absent
Monday 15
Tuesday 9
Wednesday 9
Thursday 11
Friday 16
Total 60
Solution
Ho: The number of workers absent is uniformly distributed over days of the week.
H1: The number of workers absent does not uniformly distributed over days of the week
𝑡 𝑢 𝑒𝑟 𝑤 𝑟 𝑒𝑟𝑠 𝑠𝑒 𝑡
2
𝑦𝑠 𝑡 𝑒 𝑤𝑒𝑒
Day Observed(fo) Expected(fe) fo - fe (fo – fe)2 (fo – fe)2 /fe
15 12 3 9 0.75
Monday
9 12 -3 9 0.75
Tuesday
Wednesday 9 12 -3 9 0.75
11 12 -1 1 0.08
Thursday
16 12 4 16 1.33
Friday
(fo – fe)2 /fe 3.67
3. Decision rule:
370
Reject H0, if sample X2> 9.49
4. Sample analysis:
X2 = ∑ = 3.67
5. Conclusion:
Not reject the H0, because X2< 9.48773 i.e. 3.67 < 9.49, therefore, the distribution is
uniform.
x
Z
Example-6: Given the data below, test goodness of fit test at 1% level if distribution follows
740 or more 7
371
Solution
1. State hypothesis:
Test score fo Probability (e) fe(p x n) (fo – fe) (fo – fe)2 (fo – fe)2/fe
− 27
∑
372
4. Sample analysis:
X2 = ∑ = 27
5. Conclusion:
Since X2calculated (27) greater than 15.08627, the decision is to reject Ho, this
implies the distribution is not follow normal distribution.
n!
P(x) = p r q nr
r!(n r )!
To calculate the degrees of freedom we need to check for the values in expected frequency
column. Expected frequency must be greater than or equal five. If some expected frequencies
is/are less than five, it is better to merge them to make them more than five. Then, the degree of
freedom can be calculated as V= K-1-g
Example-7: Jacob, a mail order firm, sends out special item advertisements in batches of 10 at a
time. Sampson‘s sale manager believes that the probability of receiving an order as a result of
any one advertisement is 0.5. The manager wants to test the hypothesis that the distribution of
number of orders in batch of 10 is a binomial distribution with p=0.5. Data for random sample
1000 mailings are given below. Perform a goodness of fit test at the 5 % level.
373
Number of orders received from a mailing list of 50 advertisements Observed frequency
0 5
1 10
2 50
3 120
4 210
5 240
6 200
7 110
8 40
9 15
10 0
Solution
n!
P(x) = p r q nr
r!(n r )!
Where n = 10
P = 0.5, q= 0.5
N = 1000
fe = N x P
374
(nCr.pr.qn-r) (N x P)
5 0.001 1
0
10 0.010 10
1
50 0.044 44
2
120 0.117 117
3
210 0.205 205
4
240 0.246 246
5
200 0.205 205
6
110 0.117 117
7
40 0.044 44
8
15 0.010 10
9
0 0.001 1
10
To calculate the degree of freedom, we need to check for the values in expected
frequency column. Expected frequency must be greater than or equal to five.
However, in the above table some expected frequencies i.e. the first and the last row
are less than five. Therefore, we need to merge them to make them more than five.
Merging can be done by combining the first two rows and the last two rows together.
(nCr.pr.qn-r) (N x P)
0 5 0.001 1
1 10 0.010 10
2 50 0.044 44
375
4 210 0.205 205
8 40 0.044 44
9 15 0.010 10
10 0 0.001 1
15 0.011 11
0& 1
50 0.044 44
2
120 0.117 117
3
210 0.205 205
4
240 0.246 246
5
200 0.205 205
6
110 0.117 117
7
40 0.044 44
8
15 0.011 11
9& 10
V = k – 1- g
V = 9 – 1- 0 = 8
3. Decision rule:
X20.05, 8 = 15.507, accept Ho, if X2 calculated < 15.507
376
4. Sample analysis:
X2 ∑
15 11 4 16 1.45
50 44 6 36 0.82
40 44 -4 16 0.36
15 11 4 16 1.45
− 4.97
∑
5. Conclusion:
Since X2calculated (4.97) is less than 15.507, the decision is to accept Ho, this implies the
distribution follows binomial distribution.
Example 8; The number of boys in 500 families with 5 children is investigated. There were 20
families with no boy, 75 with 1, 145 with 2, 140 with 3, 85 with 4, and 35 with 5 boys. Decide
(with level of significance α = 0.05) whether the number of boys in a 5-children family follows
binomial distribution.
377
0 20
1 75
2 145
3 140
4 85
5 35
Total 500
Solution
Ha: the number of boys in a 5-children family does not follow binomial distribution.
2. Significant level (
3. Test statistic: since the assumption is about goodness of fit test, the appropriate test statistic is
X2.
X2 critical value depends on expect frequency of each group. Hence, we need to first calculate the
expected frequency (Ef).
Ef = P(X).N, where P(X) =binomial probability of the value of random variable (X) and
To calculate the Probability of the value of random variable, first we have to determine
probability of success and probability of failure from the expected value (mean).
Mean of the binomial probability distribution is equals to ―np‖. Now, we can find the probability
of success as follows.
2 ⁄ 2
378
Then, Probability of failure (q) = 1-p = 0.48. Using binomial probability distribution function
nCx.px.qn-x we can compute the expected frequency. (n=5, P =0.52, q =0.48)
1 75 0.138 69
4 85 0.175 87.74
5 35 0.038 19
All expected frequencies are greater than 5. Therefore, Df = n-1 = 5-1 = 4. The critical value is
9.49
1 75 69 6 36 0.52
5 35 19 16 256 13.47
− 21.69
∑
379
Calculated equals to 21.69, which is greater than critical 9.49. Hence, our decision is to
reject the H0.
Conclusion
Given the 95% confidence level the number of boys in 5-children family does not follow
binomial distribution.
n = number of observations
x .e
P( x)
x!
Example-9: When a beer bottle filling machine breaks a bottle, the machine must be shut down
while the broken glass is removed. The production manager at Bedele Brewery has been using
Poisson distribution with the average (λ=3) shut downs per day to determine the probabilities of
0, 1, 2, 3… Shut downs in a day. The manager has tabulated the number of shutdowns in a
random sample of 120 operating days, as shown in the table given below. We want to test, at the
1% level, the hypothesis number of shutdowns is a day has a Poisson distribution with λ = 3.
0 3
1 20
2 29
3 22
4 23
5 10
6 or more 13
380
Solution
No. of days Fo Probability(e ) fe(p x n) (fo – fe) (fo – fe)2 (fo – fe)2/fe
0 3 0.0498 6 -3 9 1.5
1 20 0.149 18 2 4 0.22
2 29 0.224 27 2 4 0.15
3 22 0.224 27 -5 25 0.93
4 23 0.168 20 3 9 0.45
25 10 0.101 12 -2 4 0.33
6 or more 13 0.050 6 7 49 8.17
− 11.75
∑
X2 = ∑ = 12.61
5. Conclusion
Since X2 calculated (12.61) is less than 16.812, the decision is to accept the Ho, this
implies the distribution follows the Poisson distribution.
381
Exercise
1. In regard to wine tasting competitions, many experts claim that the first glass of wine served
sets a reference taste and that a different reference wine may alter the relative ranking of the
other wines in competition. To test this claim, three wines, A, B and C, were served at a wine
tasting event. Each person was served a single glass of each wine, but in different orders for
different guests. At the close, each person was asked to name the best of the three. One
hundred seventy-two people were at the event and their top picks are given in the table
provided.
A B C
A 12 31 27
B 15 40 21
C 10 9 7
Test, at the 1% level of significance, whether there is sufficient evidence in the data to support
the claim that wine experts‘ preference is dependent on the first served wine.
2. Is being left-handed hereditary? To answer this question, 250 adults are randomly selected
and their handedness and their parents‘ handedness are noted. The results are summarized in
the table provided.
0 1 2
Handedness Left 8 10 12
Right 178 21 21
Test, at the 5% level of significance, whether there is sufficient evidence in the data to conclude
that there is a hereditary element in handedness.
382
3. The following contingency table shows the distribution of grades earned by students taking a
midterm exam in an MBA class, categorized by the number of hours the student spent
studying for exam: Using α = 0.05, perform a chi-square test to determine if the student time
spent in studying may affect the student‘s grade on the exam. (4 pt).
Grade
3-5 hours 15 14 6 35
Total 37 44 19 100
4. A sample of 500 shoppers was selected to determine various information concerning whether
they enjoy shopping the clothing. Their responses are summarized in the following
contingency table;
Female Male
Yes 158 87
No 125 130
Is there evidence of a significant difference between the proportion of males and females who
enjoy shopping for clothing at the 0.01 level of significance?
5. Arba Minch Tourist Hotel adopts new service delivery system in order to increase the
customer satisfaction and the profits. A sample of 200 tourists, 145 business customers and
320 other customers are selected to enquire whether they are satisfied with the new system.
Assume that all sample units know the hotel before the new system was adopted. Is there
significant difference between satisfactions of the three types of customers with new service
delivery system at 5% level of significance?
383
Customer satisfaction Customer type Total
with new system
Tourists Business customers Other customers
No 95 55 116 266
6. A company is considering three areas as possible locations for manufacturing plant. Each
area has about the same number of workers. The company needs skilled workers and wants
to determine whether the proportions of skilled workers in the areas are the same. Random
sample of data are given in the following table. At the 5 % level, perform a test of hypothesis
that the three areas have the same proportion of skilled manpower. (3 pt.)
Area
Number of workers A B C
7. The National economic analysist conducted a study to know the daily net income of small-
scale entrepreneurs. Consider the observed frequencies for the following set of grouped daily
net income of entrepreneurs: Perform a chi-square test using α = 0.05 to determine if the
daily net income follow the normal probability distribution with µ = 100 and = 20.
384
CHAPTER FIVE
- Analysis of Variance
- One-way ANOVA
- Two-way ANOVA
In this section, we will discuss about other way of testing analysis of variance. Analysis of
Variance (ANOVA) is a statistical method used to test differences between two or more means.
It may seem odd that the technique is called "Analysis of Variance" rather than "Analysis of
Means." As you will see, the name is appropriate because inferences about means are made by
analyzing variance. ANOVA is used to test general rather than specific differences among
means. ANOVA was developed by Ronald Fisher in 1918 and is the extension of the t and
the z test. Before the use of ANOVA, the t-test and z-test were commonly used. t-test is used
when the population mean and standard deviation are unknown, and 2 separate groups are being
compared.
Example 1:
Do males and females differ in terms of their exam scores? Take a sample of 20 males with mean
test score of 26.7 and standard deviation of 3.63, and a separate sample of 19 females measuring
mean test score of 27.1 with standard deviation 2.57 to determine if there is a significant
difference in scores between the groups at 5% significance level.
1. Hypothesis; −
−
2. =0.05
3. t-test is used since population mean and variance are unknown. Df = −2
Critical value t0.025, 37 = 2.026
𝑥̅ − 𝑥̅ − −
𝑡
𝑠 ̅ ̅
385
− 𝑠 − 𝑠
𝑠
−2
𝑠 𝑠
𝑠 ̅ ̅ √
In comparing the means of more than two populations, the use of t-test or Z-test increases the
size of the Type I error. If you set the Type I error to be 0.05, and you had several groups, each
time you tested a mean against another there would be a 0.05 probability of having a type I error
rate. This would mean that with six t-tests you would have a 0.30 (0.05×6) probability of having
a type one error rate. This is much higher than the desired 0.05. However, ANOVA helps to
retain Type I error to be 0.05 by comparing the variability of each population. If the means for
the three populations are equal, we would expect the three samples means to be close together. In
fact, the closer the three sample means are to one another, the more evidence we have for the
conclusion that the population means are equal. In other words, if the variability among the
sample means is ―small,‖ it supports the H0; if the variability among the sample means is ―large,‖
it supports Ha. Therefore, ANOVA helps to compare means of more than two
groups/populations by using their variances. Thus, if the H0 is true, the sample variances of each
group estimate common population variance ( .
386
Assumptions for ANOVA
Three assumptions are required to use analysis of variance.
The test for the difference between the variances of two independent populations is based on the
ratio of the two sample variances. If you assume that each population is normally distributed,
then the ratio follows the F-distribution. The critical values of the F-distribution depend on
the degrees of freedom in the two samples. The degrees of freedom in the numerator of the ratio
are for the first sample, and the degrees of freedom in the denominator are for the second sample.
The first sample taken from the first population is defined as the sample that has the larger
sample variance. The second sample taken from the second population is the sample with the
smaller sample variance.
Properties of F-distribution
387
F
0
The test statistic is equal to the variance of sample 1 (the larger sample variance) divided by
the variance of sample 2 (the smaller sample variance).
where, n p p n
n p 2 p n
Degrees of freedom from sample 1= n1-1
Degrees of freedom from sample 2= n2-1
Purpose of ANOVA
ANOVA is used to compare means of more than two groups or populations. Thus, the samples
variances are used to determine whether the means of the parent populations are equal or the
same. This can be done in different ways based on levels and factor/s of influence of measured
observation (dependent variable). Here, we need to understand what are these two terms mean:
factor and level.
Factor is the characteristic under consideration that is thought to effects the measured
observation whereas; level is the categories of a factor that shows different measurement rank.
Based on the number of factors, we have different ways of ANOVA. In this section we will
discuss only one-way ANOVA and Two-way ANOVA.
Example 2.
Suppose we need to know the effect of assessment on student marks. We apply four assessments,
each of different types, on four classes of the same level. The mark of each class is recorded and
the difference in mark within and between the classes is observed.
In this example, assessment is factor and types of assessments are level. Thus, there is one factor
(assessment) and four levels (types of assessment). The measured observation is student mark.
The statistical analysis tool for this kind of problem is known as one-way ANOVA.
388
Let us see other example, assume we need to see the effect of assessment and the size of the class
on the student mark. Here, there are two factors with different levels each. For this type of
problem two-way ANOVA is used.
One-Way ANOVA
One–way refers to one independent variable (measured) with different levels. One-way ANOVA
can be used to study the effect of different levels of a single measured variable on one factors
(dependent) variable. To determine if different levels measured (observed) variable affect the
factor differently, the following hypothesis is tested:
The One-way analysis of variance is used to test the assumption regarding the equality of
different means through comparing the variability between the samples (treatments) and the
variability within the samples (treatments)
Between treatments variability is expressed by the mean square deviation among the samples
or treatments. So that we call as between samples mean square (MSB) with degrees of freedom
(v1= k-1), where, k is the number of sample or groups. It is calculated by determining the means
of each group and the combined mean (over all mean).
SSB, called the sum of squares between groups (SSB), measures the between group variation by
summing the squared differences between the sample mean of each group, and the grand mean,
weighted by the sample size, in each group.
∑ 𝑥̅ − 𝑥̿ 𝑤 𝑒𝑟𝑒 𝑥̿ 𝑒 𝑒 𝑢 𝑒𝑟 𝑟 𝑢𝑝𝑠
# 𝑠𝑒𝑟𝑣 𝑡 𝑟 𝑢𝑝
389
Within the treatments Variability is measured by the mean square deviation within the
samples so it is called within sample/treatment mean square (MSW) with degrees of freedom
(v2=n-k) where, n is number of items in sample.
SSW, called the sum of squares within groups (SSW), measures the within-group variation. It
measures the difference between each value and the mean of their own group and sums the
squares of these differences over all groups.
∑ ∑ 𝑥 − 𝑥̅
Hence, the F-statistic for ANOVA is the ratio of between treatment mean square to within
treatment mean square.
If the null hypothesis that the population treatment means are equal were true, the ratio of would
tend to be equal 1. Thus, MSB = MSW.
The total variation, also called sum of squares total (SST), is a measure of the variation among
all the values.
The above concepts will be clear after solving the following problem.
390
Example 3.
Assume that the lifetimes of electric light bulbs are normally distributed with common variance.
A sample of 5 bulbs of 60W of three different brands showed the following lifetime hours in
excess of 1000 hours.
Sample Brands
unit
A B C
1 16 18 26
2 15 22 31
3 13 20 24
4 21 16 30
5 15 24 24
Test the hypothesis that there is no difference between the three brands with respect to mean
lifetime at 1% significance level.
Solution
In solving this problem, you will get insight about how to compute for one-way analysis of
variance for testing hypothesis about the equality of more than two means of populations. The
same six steps of hypothesis testing would be used here.
Ha: there is difference between the mean lifetimes of the three brands
391
Degrees of freedom for numerator parameter is v1=K-1= 3-1= 2 and degrees of freedom for
denominator is v2=n-k= 5+5+5-3=12. The critical value for F (0.01, v1, v2) = 6.93, read from
standard table of F-distribution.
0 F=6.93 F
4. Setting the decision rule: since F critical value is 6.93, we reject H0 if calculated value of F is
greater than 6.93.
5. Compute for the samples: First, we need to compute for sample mean 𝑥̅ and combined
mean 𝑥̿ , and variances 𝑠 for each sample (brand).
̅
𝑥̅ ∑ 𝑠𝑠 𝑝𝑒 𝑒 𝑥̿ ∑ =is combined mean of the samples
𝑥 − 𝑥̅
𝑠 ∑
−
𝑥 − 𝑥̅ − − − 2 − −
𝑠 ∑
− −
SSW 36+40+44=120
SSB 310
392
MSW 2
− −
MSB
− −
Fcal.>Fcritical value, 15.5>6.93 so it lies in the rejection region. Therefore, we reject our null
hypothesis and conclude as follows.
6. Conclusion: there is evidence, at 1% significance level, that the true mean lifetimes of the three
brands of bulbs do differ.
Example 4.
In a comparison of the cleaning action of four detergents, 20 pieces of white cloth were first
soiled with ink. The cloths were then washed under controlled conditions with 5 pieces washed
by each of the detergents. Unfortunately three pieces of cloth were 'lost' in the course of the
experiment. Whiteness readings, made on the 17 remaining pieces of cloth, are shown below.
Brand of Detergent
No. A B C D
1 77 74 73 76
2 66 78 85
81
3 61 58 57 77
4 76 69 64
5 69 63
Total
Assuming all whiteness readings to be normally distributed with common variance, test the
hypothesis that no difference between the four brands with respect to mean whiteness readings
after washing at 5% level of significance.
393
Solution
Dependent variable: whiteness of the color
Independent variable: types of detergents
1. H0: The four brands of detergent don‘t have difference in mean whiteness readings of the
clothes.
Ha: At least one brand of detergent showed difference in mean whiteness readings of the
clothes.
2. =0.05
3. Test statistic: F-test with degrees of freedom v1=k-1= 4-1=3 and v2= n-k=17- 4=13. The
critical value is 3.41 at =0.05.
4. Decision rule: reject H0, if the calculated value of F is greater than F critical value (3.41).
5. Computation
Brand of Detergent 𝑒 𝑒
Items A B C D
𝑥̿
Sample 5 3 5 4
Mean 73 66 68 75.5
Variance 62 64 68 75
SSW 249 128 272 225 874
MSW
− −
SSB 𝑥̅ − 𝑥̿ 𝑥̅ − 𝑥̿ 𝑥̅ − 𝑥̿ 221
𝑥̅ − 𝑥̿
MSB 22
− −
2
Fcal<Ftab = 1.096<3.41. Hence, we cannot reject the null hypothesis.
6. Conclusion: there is no evidence, at 0.05 significance level that supports the H0 is false.
Therefore, the four bands of detergent don‘t have difference in mean whiteness readings of
the clothes.
394
Example 5.
A randomized block design has three different income groups as blocks, and members of each
block have been randomly assigned to the treatment groups shown here. For these data, use the
0.01 level in determining whether the treatment effects could both be zero. Using the 0.01 level,
evaluate the effectiveness of the blocking variable.
Treatment
1 2
A 46 31
B 37 26
C 44 35
Solution
i. Describe the hypothesis:
Ho: There is no difference in the mean of the two groups.
Ha: There is a significance difference in the mean of the two groups.
ii. Level of significance:
=0.01
iii. Determine test statistics:
F – test is relevant for testing the independency
V1 = K – 1 = 2 – 1 = 1
V2 = n – k = 6 – 2 = 4
F 0.01,1,4 = 21.20
iv. Setting the decision rule:
Since F critical value is 21.20, we reject H0 if calculated value of F is greater
than 21.20.
v. Sample Analysis:
First, we need to compute for sample mean 𝑥̅ and combined mean 𝑥̿
𝑥̅ 46+37+44 = 42.3 𝑥̅ 2 31+26+35 = 30.7
3 3
𝑥̿ = 42.3+30.7 = 36.5
395
SSW = 85.34
SSB = 201.84
vi. Conclusion:
Since Fcal. < Ftable value, 9.46 > 21.20, the decision is accepting Ho. This implies that
there is no difference in the mean of the two groups.
Two-Way ANOVA
In two-way ANOVA is the extension of one-way ANOVA. There are two independent factors
(factor A and factor B), each of which operates on two or more levels.
The first factor is known as principal factor and the second factor is called blocking factors since
it creates homogeneous groups (blocks) of observation. The two-way ANOVA examines;
These three effects have to be tested in Two-Way ANOVA. Therefore, there are now three sets
of null and alternative hypotheses to be tested presented as follows;
396
2. Ho: None of levels of factor B has effect
The data can be listed in tabular form, with each cell identified as a combination of the ith level
of factor A with the jth level of factor B. Each cell contains r observations, or replications. For
example consider the following 3*4 design:
Factor B
j1 j2 j3 jb ith level Total
i1
Factor A i2 Xijk
ia
th
Total of j level abr
Each of the cells should contain equal observations of the combinations of the level of
each factor.
The number of values (replicates or sample sizes) for each cell (combination of a
particular level of factor A and a particular level of factor B) = r
Xijk = value of the kth observation for level i of factor A and level j of factor B
Example 6.
The effective life (in hours) of batteries is compared by material type (1, 2 or 3) and operating
temperature: Low (-10˚C), Medium (20˚C) or High (45˚C). Twelve batteries are randomly
selected from each material type and are then randomly allocated to each temperature level. The
resulting life of all 36 batteries is shown below:
397
Table: Life (in hours) of batteries by material type and temperature
Temperature (˚C)
Material type 1 130, 155, 74, 180 34, 40, 80, 75 20, 70, 82, 58
2 150, 188, 159, 126 136, 122, 106, 115 25, 70, 58, 45
3 138, 110, 168, 160 174, 120, 150, 139 96, 104, 82, 60
Is there difference in mean life of the batteries for differing material type and operating
temperature levels? Use
Number of replicates in each cell (r) =4, twelve batteries of each material are equally
allocated to the three different temperature levels.
Ho: there is no difference in mean life of the batteries for differing material type and operating
temperature levels
Ha: there is difference in mean life of the batteries for at least one material type and operating
temperature levels.
2. Significance level ( )
3. The Calculations for Test statistic and Critical value in Two-Way ANOVA Design
The analysis will be done through specific computation with each quantity being associated with
a specific source of variation within the sample data. Thus, the test statistic F is to be determined
based on between the group variability and within the group variability.
398
The Sum of Squares Terms: Quantifying the Sources of Variation
As stated earlier, there are different sources of variation: factor A, factor B, interaction of
factor A and B, and random error.
The two independent factors (A and B) might be the source of variation between the means of
different groups. They may, independently or in combination, create variation on dependent
variable. Variation due to the factors includes;
SSA is the sum of squares reflecting variation caused by the levels of factor A.
∑ ̅ −̿
The degree of freedom is subtracting one from the number of levels of the factor A. (V1 = a-1)
⁄ −
SSB is the sum of squares reflecting variation caused by the levels of factor B.
∑ ̅ −̿
The degrees of freedom is subtracting one from the number of levels of the factor B. (V2 = b-1).
The mean square of factor B MSB is;
⁄ −
SSE is the sum of squares reflecting variation due to sampling error. In this calculation, each
data value is compared to the mean of its own cell. The degree of freedom is VE = ab(r-1)
∑∑∑ −̅
399
−
Total Variation is the sum of the three variations. This calculation compares each observation
with the grand mean with the differences squared and summed. The degree of freedom for total
sum of square is (abr-1)
∑∑∑ −̿
This is the sum of squares reflecting variation caused by interactions between the levels of
factors A and B. This is most easily calculated by first computing the other sum of squares terms,
then;
− −
The degree of freedom for sum of square of interaction between A and B is the product of the
degree of freedom of levels of factor A (a-1) and the degree of freedom for levels of factor B (b-
1). VAB= (a-1)*(b-1)
− −
400
Variation Sum of Squares Degree of Mean Square F- ratio
source freedom
Factor A VA= a – 1 F = MSA
∑ ̅ −̿
MSE
Factor B VB = b – 1 F = MSB
∑ ̅ −̿ − MSE
Now, we can calculate variability for the example mean life of batteries.
Temperature (˚C)
Mean levels of
Low (-10˚C) Medium (20˚C) High (45˚C) A
1 130, 155, 74, 180 34, 40, 80, 75 20, 70, 82, 58 𝑥̅
Material type
2 150, 188, 159, 126 136, 122, 106, 115 25, 70, 58, 45 𝑥̅ 3
3 138, 110, 168, 160 174, 120, 150, 139 96, 104, 82, 60 𝑥̅ 2
Mean of B 𝑥̅ 𝑥̅ 𝑥̅ ̿
∑ ̅ −̿
− − 2 −
= 10680.1452
401
Mean Square of Factor A
∑ ̅ −̿
− − −
= 39112.1052
2 2
− −
Error Sum of Squares (SSE)
This is the sum of square of the deviation of each observation in the cell from the cell mean.
Therefore, for each cell we need to calculate mean and then, deduct from each observation.
∑∑ ∑ −̅
402
Mean Square of the Error (MSE)
22
− ∗ ∗
∑∑ ∑ −̿
=77646.97
− − −
2
− −
At a given level of significance , and degree of freedom for numerator VA= a-1=3-
1=2,
Decision Rule
403
5. Calculation and decision
Since Fcal=7.91is greater than Fcr=3.35, the decision will be rejecting the null
hypothesis.
6. Conclusion; we conclude that the levels of material type affect the mean life
of the batteries.
Critical value, F (0.05, 2, 27) = 3.35. The decision rule is Reject Ho if Fcal is greater than Fcri.
Conclusion; The mean life time of batteries is different with regard to the
differing levels of temperature.
Testing for the interaction effect of factors A & B
, VAB = (a-1)(b-1) = 4, VE = 27
Hence, Fcal is greater Fcri, the decision will be Rejecting the Ho.
Conclusion; The mean life of batteries is different with regard to varying Material type and
operating temperature levels.
404
Example 7.
A magazine publisher is studying the influence of type style and darkness on the readability of
her publication. Each of 12 persons has been randomly assigned to one of the cells in the
experiment, and the data are the number of seconds each person requires to read a brief test item.
For these data, use the 0.05 level of significance in drawing conclusions about the main and
interactive effects in the experiment.
Type Darkness
Total Mean
Light Medium Dark
1 29 23 26
32 28 30
Type style 168 28
29 26 23
2 31 23 24 156 26
Total 121 100 103
Solution
Ha: There is a difference between the mean of the three types of darkness.
iii. Determine test statistics:
F-test with degrees of freedom of the numerator (V1) for factor A is (a – 1) for
factor B is (b – 1) and the denominator (V2) of the F-ratio for each test is MSE,
which has ab(r -1).
405
Reject H0, if the calculated value of F is greater than F critical value.
F (0.05, 1, 6) = 5.99
F (0.05, 2, 6) = 5.14
The critical F for interactive effect of factor A & B at 0.05 level is:
F [0.05, (a-1) (b - 1), ab(r -1)]
F (0.05, 2, 6) = 5.14
v. Computation:
There is r = 2 replications within each cell and b = 3 levels for factor B. SSA is
based on differences between the grand mean (𝑥̿ = 27) and the respective means
for the a = 2 levels of factor A:
∑ ̅ − ̿ )2
= 2(3)[ 2 − 2 2 2 − 2 2]
= 6(2)
= 12
Factor B Sum of Squares, SSB
There are r = 2 replications within each cell and a = 2 levels for factor A. SSB is
based on differences between the grand mean ( ̿ = 27) and the respective means
for the b = 3 levels of factor B.
∑ ̅ −̿
= 2 [ 2 −2 2 2 −2 2 2 − 2 2]
= 4(16.12)
= 64.48
406
Error Sum of Squares, SSE
In this calculation, each observation is compared to the mean of its own cell. For
example, the mean of the (i =2, j = 3) cell is ̅ = (23+24) 2 = 23.5
SSE= ∑ ∑ ∑ −̅
=[ 2 − 2 2− 2] [ 2 −2 2 2 −2 2]
[ 2 − 2 2 − 2 2] [ 2 − 2 − 2] [ 2 −
2 2 2 −2 2 2 −2 2 2 −2 2]
SSE = 32
Total Sum of Squares, SST
This calculation compares each observation with the grand mean ( ̿ = 27), with
the differences squared and summed:
SST= ∑ ∑ ∑ −̿
[ 2 −2 2 2 − 2 2] [ 2 −2 2 2 − 2 2]
[ 2 −2 2 − 2 2] [ 2 −2 2 − 2 2]
[ 2 −2 2 2 − 2 2] [ 2 −2 2 2 −2 ]
SST= 118
Having calculated the other sum of square‘s terms, we can obtain SSAB by
subtracting the other terms from SST:
= 9.52
As we saw from the above table, each sum of squares term is divided by the
number of degrees of freedom with which it is associated. There are a= 2 levels
for factor A, b =3 levels for factor B, and r= 2 replications per cell, and the mean
square terms are as follows:
407
Factor A:
MSA = SSA = 12
a–1 2-1
= 12
Factor B:
(a-1)(b-1) (2-1)(3-1)
= 4.76
Error, E:
ab(r-1) 2(3)(2-1)
= 5.33
The summary findings for the preceding analysis are shown in the following table.
Factor A 12 1 12 2.25
Error 32 6 5.33
Total 118 11
408
vi. Conclusion:
MSE 5.33
MSE 5.33
MSE 5.33
o Regarding factor A: the calculated value of F (2.25) is less than the critical value (5.99),
H0: cannot be rejected. Our conclusion is that, the type style has no effect on the
readability of her publication.
o Regarding factor B, the calculated F (6.05) is greater than the critical value (5.14), and
H0: is rejected. Our conclusion is that, at least one of the types of darkness has an effect
on the readability of her publication.
o In the test for interaction effects, the calculated F (0.89) is less than the critical value
(5.14) and H0: is not rejected. The factors are operating independently, and there is no
relationship between the type style (factor A) and types of darkness (factor B) in
determining the readability of her publication.
409
Exercise
1. Three racquetball players, one from each skill level, have been randomly selected from
the membership list of a health club. Using the same ball, each person hits five serves,
one with each of five racquets, and using the racquets in a random order. Each serve is
clocked with a radar gun, and the results are shown here. With player skill level as a
blocking variable, use the 0.025 level of significance in determining whether the
treatment effects of the five racquets could all be zero.
Player Skill Level
Beginner Intermediate Advanced
A 73 64 83
B 63 72 89
C 51 54 72
D 56 81 86
F 69 90 97
2. Given the following data for a two-way ANOVA, identify the sets of null and alternative
hypotheses, and then use the 0.05 level in testing each null hypothesis.
Factor B________
1 2 3
1
152 158 160
151 154 160
Factor A 2 158 2
164 152
154 158 155
410
year, as given below. Perform an ANOVA test at α = 0.05 level to determine if the mean
returns for the three advisory firms are equal.
Percent returns
A B C
7.0 8.7 3.4
2.8 5.2 8.1
5.1 4.9 4.2
4.6 7.0 2.6
4. Instruments for correcting a power plant malfunction are mounted on control panel.
Three panels were designed, with the instruments arranged differently on different
panels. Then three random samples of four control engineers per were selected. Each
sample was assigned to one panel. The time in seconds taken by engineers to correct
stimulated malfunction are given below. Perform ANOVA test at the 0.05 level to
determine if the mean times to correct the malfunction are the same for the three panels.
Percent returns
Panel A Panel B Panel C
17 9 13
12 16 8
15 11 14
20 12 9
5. Three methods for assembling a product are to be tested at the 0.05 level to determine
whether mean times per assembly for the methods are equal. Random sample assembly
times in minutes are given below. Perform the ANOVA test.
Method one Method two Method three
11 19 19
13 25 14
19 16 13
18 22 14
14 18 20
6. Stock analyst thinks four stock mutual funds generate about the same return. She
collected the accompanying rate of return data on four different mutual funds during the
last 5 years.
Conduct a two-way ANOVA to decide whether the funds give different performances.
Use 5%
A B C D
1988 12 11 13 15
411
1989 12 17 19 11
1990 13 18 15 12
1991 18 20 25 11
1992 12 19 19 10
7. The following table gives the data regarding the sales in four zones in Ethiopia and the
sales made by four sales men. At 5% level of significance conduct a two-way ANOVA
(Analysis of Variance), to test the mean sales among the sales men is the same.
North East West South
Sales Man A 8 6 5 4
Sales man B 6 6 7 6
Sales Man C 5 6 8 9
Sales Man 4 8 7 9
CHAPTER – 6
Definitions:
412
Correlation is an analysis of the co-variation between two or more variables.
Correlation expresses the inter-dependence of two sets of variables upon each other. One
variable may be called as (subject) independent and the other relative variable (dependent).
Relative variable is measured in terms of subject.
Correlation is classified into various types. The most important ones are
413
i. Positive and Negative Correlation:
It depends upon the direction of change of the variables. If the two variables tend to move
together in the same direction (i.e.) an increase in the value of one variable is accompanied by an
increase in the value of the other, (or) a decrease in the value of one variable is accompanied by a
decrease in the value of other, then the correlation is called positive or direct correlation. Price
and supply, height and weight, yield and rainfall, are some examples of positive correlation.
If the two variables tend to move together in opposite directions so that increase (or) decrease in
the value of one variable is accompanied by a decrease or increase in the value of the other
variable, then the correlation is called negative (or) inverse correlation. Price and demand, yield
of crop and price, are examples of negative correlation.
If the ratio of change between the two variables is a constant then there will be linear correlation
between them. Consider the following.
X 2 4 6 8 10 12
Y 3 6 9 12 15 18
414
20
18
16
14
12
10
Linear (Y)
8
6
4
2
0
0 2 4 6 8 10 12 14
Here the ratio of change between the two variables is the same. If we plot these points on a
graph, we get a straight line. If the amount of change in one variable does not bear a constant
ratio of the amount of change in the other, then the relation is called Curvi-linear (or) non-linear
correlation. The graph will be a curve.
When we study only two variables, the relationship is simple correlation. For example, quantity
of money and price level, demand and price. But in a multiple correlation we study more than
two variables simultaneously. The relationship of price, demand and supply of a commodity are
an example for multiple correlations.
The study of two variables excluding some other variable is called Partial correlation. For
example, we study price and demand eliminating supply side. In total correlation all facts are
taken into account.
When there is some relationship between two variables, we have to measure the degree of
relationship. This measure is called the measure of correlation (or) correlation coefficient and
it is denoted by ‗r‘.
415
Co-variation: Covariance is a descriptive measure of the linear association between two
variables X (independent) and Y (dependent).
∑ ̅ ̅
Cov (X, Y) or SXY = , for a sample data
Where 𝑥̅ , ̅ are respectively means of x and y and ‗n‘ is the number of pairs of observations
selected as a sample. If SXY is positive, then there is direct linear relationship between the two
variables (increase in X correspond increase in Y). If SXY value is negative, then it means there is
inverse linear relationship between the two variables (increase in X correspond with decrease in
Y). If the value of SXY is zero, then there is no linear relationship between the two variables.
However, the strength of the relationship depends on how large or small the SXY, which in turn
depends on the measurement of the two Variables. For this the best measure of the strength of
the relationship is Pearson correlation coefficient.
Karl Pearson, a great biometrician and statistician, suggested a mathematical method for
measuring the magnitude of linear relationship between the two variables. It is most widely used
method in practice and it is known as pearsonian coefficient of correlation. Correlation
coefficient as a descriptive measure of the strength of linear association between two variables,
X and Y. Values of the correlation coefficient are always between -1 and +1. A value of -1
indicates that the two variables X and Y are perfectly related in a negative linear sense. That is,
all data points are on a straight line that has a negative slope.
A value of +1 indicates that X and Y are perfectly related in a positive linear sense, with all data
points on a straight line that has a positive slope. Values of the correlation coefficient close to
zero indicate that X and Y are not linearly related. Correlation coefficient quantifies the
direction and strength of the linear association between the two variables. The sign of the
correlation coefficient indicates the direction of the association. The magnitude of the
correlation coefficient indicates the strength of the association.
416
𝑣 𝑟 𝑒
𝑟
𝑠𝑡 𝑟 𝑒𝑣 𝑡 𝑡 𝑒 𝑡𝑤 𝑣 𝑟 𝑒 𝑠𝑒𝑟 𝑒𝑠
𝑟 , Where, 𝑝𝑒 𝑣 𝑟 𝑒 𝑟 𝑒
𝑠𝑡 𝑟 𝑒𝑣 𝑡 𝑟 𝑒
𝑠𝑡 𝑟 𝑒𝑣 𝑡 𝑣 𝑟 𝑒
∑ −̅ −̅ ⁄
n−
𝑟
√∑ − ̅ √∑ − 𝑦̅
− −
∑
−̅ −̅
√∑ ∑
Steps:
−̅ −̅
iii. Square the deviations and get the total sum, of the respective squares of deviations of X
and Y, denote by ∑ ∑ respectively.
iv. Multiply the deviations of X and Y and get the total and Divide by the square root of
product of ∑ ∑ , found the formula
√∑ ∑
417
v. Substitute the values in the formula.
Example 1. Find Karl Pearson‘s coefficient of correlation from the following data between
height of father (x) and son (y). Comment on the result.
X 64 65 66 67 68 69 70
Y 66 67 65 68 70 68 72
Solution
X Y = x- ̅ =y -̅
64 66 -3 9 -2 4 6
65 67 -2 4 -1 1 2
66 65 -1 1 -3 9 3
67 68 0 0 0 0 0
68 70 1 1 2 4 2
69 68 2 4 0 0 0
70 72 3 9 4 16 12
469 476 0 28 0 34 25
Mean= 67 68
∑
=
√ √
√∑ ∑
418
Since r = + 0.81, the variables have strong positively correlated. (i.e) Tall fathers have tall sons.
15
10
Linear (Y)
0
0 2 4 6 8 10
∑ − ̅ −̅⁄
n−
𝑟
√∑ − ̅ √∑ − 𝑦̅
− −
∑ − ̅ − ̅⁄
SXY = n− ,
The deviation is to be done for individual pair of observation. Then, for the total paired
observations the sum has to be taken as;
SXY = − ̅ −̅ − ̅ −̅ − ̅ − ̅ , is equivalent
with ∑ − ∑ ∑ .
The denominator (n-1) is canceled because it is multiplied by its reciprocal. Then we will find
∑ ∑ ∑
𝑟
√ ∑ ∑ √ ∑ ∑
Note: In the above method we need not find mean or standard deviation of variables separately.
419
Example 2: Calculate coefficient of correlation for the following data.
X 1 2 3 4 5 6 7 8 9
Y 9 8 10 12 11 13 14 16 15
Solution
X Y XY X2 Y2
1 9 9 1 81
2 8 16 4 64
3 10 30 9 100
4 12 48 16 144
5 11 55 25 121
6 13 78 36 169
7 14 98 49 196
8 16 128 64 256
9 15 135 81 225
∑ ∑ ∑
𝑟
√ ∑ ∑ √ ∑ ∑
420
r= 9 x 597 – 45 x 108
√ 𝑥2 − 𝑥 𝑥 −
r= 5373 – 4860
√ 2 −2 2 𝑥 22 −
r= 513 513
√ = 540 = 0.95
1.2.2. Rank Correlation:
It is studied when no assumption about the parameters of the population is made. This method is
based on ranks. It is useful to study the qualitative measure of attributes like honesty, colour,
beauty, intelligence, character, morality etc. The individuals in the group can be arranged in
order and there on, obtaining for each individual a number showing his/her rank in the group.
This method was developed by Edward Spearman in 1904. It is defined as
r = 1 - 6ƩD2
n3 - n
The value of r lies between –1 and +1. If r = +1, there is complete agreement in order of ranks
and the direction of ranks is also same. If r = -1, then there is complete disagreement in order of
ranks and they are in opposite directions. Computation for tied observations: There may be two
or more items having equal values. In such case the same rank is to be given. The ranking is said
to be tied. In such circumstances an average rank is to be given to each individual item. For
example, if the value so is repeated twice at the 5th rank, the common rank to be assigned to each
item is ⁄2= 5.5 which is the average of 5 and 6 given as 5.5, appeared twice.
Example 3: In a marketing survey the price of tea and coffee in a town based on quality was
found as shown below. Could you find any relation between and tea and coffee price?
421
Price of tea 88 90 95 70 60 75 50
Solution
88 3 120 4 1 1
90 2 134 3 1 1
95 1 150 1 0 0
70 5 115 5 0 0
60 6 110 6 0 0
75 4 140 2 2 4
50 7 100 7 0 0
ƩD2 = 6
∗
r=1- −
=1– = 1- 0.1071
= 0.8929
The relation between price of tea and coffee is positive at 0.89. Based on quality the association
between price of tea and price of coffee is highly positive.
2. REGRESSION ANALYSIS
Regression is the measure of the average relationship between two or more variables in terms of
the original units of the data. After knowing the relationship between two variables we may be
interested in estimating (predicting) the value of one variable given the value of another. The
422
variable predicted on the basis of other variables is called the ―dependent‖ or the ‗explained‘
variable and the ‗independent‘ or the ‗predicting‘ variable. The prediction is based on average
relationship derived statistically by regression analysis. The equation, linear or otherwise, is
called the regression equation or the explaining equation.
For example, if we know that advertising and sales are correlated, we may find out expected
number of sales for a given advertising expenditure or the required amount of expenditure for
attaining a given number of sales.
The relationship between two variables can be considered between, say, rainfall and agricultural
production, price of an input and the overall cost of product, consumer expenditure and
disposable income. Thus, regression analysis reveals average relationship between two variables
and this makes possible estimation or prediction.
423
Simple and Multiple:
In case of simple relationship only two variables are considered, for example, the influence of
advertising expenditure on sales turnover. In the case of multiple relationships, more than two
variables are involved. On this while one variable is a dependent variable the remaining variables
are independent ones. For example, the turnover (y) may depend on advertising expenditure (x)
and the income of the people (z). Then the functional relationship can be expressed as y = f (x,
z).
The linear relationships are based on straight-line trend, the equation of which has no-power
higher than one. But, remember a linear relationship can be both simple and multiple. Normally a
linear relationship is taken into account because besides its simplicity, it has a better predictive
value; a linear trend can be easily projected into the future. In the case of non-linear relationship
curved trend lines are derived. The equations of these are parabolic.
In the case of total relationships all the important variables are considered. Normally, they take
the form of a multiple relationships because most economic and business phenomena are
affected by multiplicity of cases. In the case of partial relationship one or more variables are
considered, but not all, thus excluding the influence of those not found relevant for a given
purpose.
It is simple because only two variables; one dependent variable and one independent variable. If
the two variables have linear relationship, then as the independent variable (X) changes, the
dependent variable (Y) also changes. If the different values of X and Y are plotted, then the
straight line of the linear equation provides best fit pass through the plotted points. This line is
known as regression line. This equation shows best estimate of one variable (dependent Y) for
the known value (independent X) of the other. A typical purpose for this type of analysis is to
estimate or predict what y will be for a given value of x.
424
The simple linear regression model is a linear equation having a y-intercept and a slope, with
estimates of these population parameters based on sample data and determined by standard
formulas. The model is described in terms of the population parameters as follows:
𝑦𝑖 𝛽𝑜 𝛽 𝑥𝑖 𝜀𝑖
𝑊 𝑒𝑟𝑒 𝑦𝑖 𝑉𝑎𝑙𝑢𝑒 𝑜𝑓 𝑑𝑒𝑝𝑒𝑛𝑑𝑒𝑛𝑡 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒 𝑌
𝛽𝑜 𝑇 𝑒 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 𝑜𝑓 𝑦
𝛽 𝑡 𝑒 𝑠𝑙𝑜𝑝𝑒 𝑜𝑓 𝑡 𝑒 𝑟𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛 𝑒𝑞𝑢𝑎𝑡𝑖𝑜𝑛
𝑥𝑖 𝑡 𝑒 𝑣𝑎𝑙𝑢𝑒 𝑜𝑓 𝑖𝑛𝑑𝑒𝑝𝑒𝑛𝑑𝑒𝑛𝑡 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒 𝑥
𝜀𝑖 𝑡 𝑒 𝑟𝑎𝑛𝑑𝑜𝑚 𝑒𝑟𝑟𝑜𝑟
- For a given value of x, the expected value of y is given by the linear equation,
𝑥.
The term can be stated as “the mean of y, given a specific value of x.”
- The difference between the actual value of y and the expected value of y is the error, or residual
𝑦 − 𝑥
= the difference between the actual value (𝑦 ) and the estimated (yi) value.
1. For any given value of X, the Y values are normally distributed with a mean that is on the
regression line; 𝑥.
2. Regardless of the value of X, the standard deviation of the distribution of Y values about the
regression line is the same. The assumption of equal standard deviations about the regression
line is called homoscedasticity.
3. The y values are statistically independent of each other. For example, if a given Y value
happens to exceed 𝑥 , this does not affect the probability that the next Y value
observed will also exceed 𝑥.
425
Estimation of Regression Equation
Based on the sample data, the y-intercept and slope of the population regression line can be
estimated. The result is the sample regression line:
𝑦̂ 𝑥
Where 𝑦̂ the estimated average value of the dependent variable (y) for a given value of x
𝑏𝑜 y-intercept; this is the value of y where the line intersects the y-axis whenever x=0.
𝑏 the slope of the regression line
X = a value for the independent variable
In the estimated equation a sample statistics b0 and b1 provide the estimate of unknown
parameters and , respectively. The sign of value of ―b1‖ shows the direction and strength of
the relationship between X and Y. If b1 = negative, then the two variables have inverse
relationship. As X increases the value of Y decreases. If b 1=positive, then the two variables have
direct relationship whenever the value of X increases the value of Y also increases & vice versa.
The estimated simple linear regression equation represents the straight line that passes through
the paired points sketched on the scattered diagram. The followings are the possible regression
line of simple linear regression;
426
(a) Positive linear relationship (b)Negative linear relationship
b0
b0
Slope b1= positive
Slope b1= negative
b0 = 0
Regression Line
Regression Line
b0
Slope b1= 0
Slope b1= 0
In developing estimated simple linear regression equation from a set of data the regression line
should provide best fit. This means that the difference between the actual value of the dependent
variable Y and the estimated value of the dependent variable Y should be minimum for each
given value of the independent variable X. In other word, the error should be minimum,
minimum sum of squared deviation of the actual value and estimated value. Therefore, the least
square criteria are;
∑ −̂
Where;
427
Based on the methods of differential calculus, values for b0 and b1 can be determined such that
the least-squares criterion is met. The least-squares regression line may also be referred to as the
least-squares regression equation or as simply the regression line.
∑ − ̅̅
∑ − ̅
𝑦
̅− ̅
Hence, for a set of data we can develop least square regression equation
𝑦̂
Example 4. For a sample of 8 employees, a production director has collected the following data
on number of units produced per hour by each worker versus years with the firm.
Years (X) 6 12 14 6 9 13 15 9
428
Solution
(a)
1. Dependent variable (Y) = number of units produced per hour by a given employee who has
year of experience
Independent variable (X) = number of years of experience
The estimated expected regression equation will be;
𝑦̂
2. Determine mean of each series
∑ 2
̅
∑ 2 2 2
̅
30
6 180 36 900
40
12 480 144 1600
56
14 784 196 3136
25
6 150 36 625
28
9 252 81 784
65
13 845 169 4225
63
15 945 225 3969
52
9 468 81 2704
429
4. Substitute to the formula of b0 and b1.
∑ − ̅̅
∑ − ̅
− −
− − 2
slope.
Interpretation of slope; As the employees add one year of stay within the firm, the average
number of units that he/she can produce per hour increases by 3.89 units.
𝑦̅ − ̅
− −
𝑦̂
̂ (Substituting x by 10)
̂ 2
Coefficient of Determination
The other measure of the strength of the relationship between two variables is coefficient of
determination. Coefficient of determination provides a measure of the goodness of fit for the
estimated regression equation. It determines the percent of the variability in dependent variable
that can be explained by the linear relationship between the dependent and independent
variables. It is denoted by ―r2‖.
430
𝑟
In estimating simple regression equation there are two errors that can be occurred due to
regression, and sampling error. The total error is the difference between the actual values of
dependent variable if it is measured from the mean of the observed value. It is sum of squared
deviation of each value from the mean of the observed values (SST)
∑ 𝑦 − 𝑦̅
Sum of square total is divided into two; sum of square error and sum of square due to
regression.
The other error is the difference between the actual values (Yi) of the dependent variable and the
estimated values of the dependent variable (𝑦̂ ). It is sum of square error or residual (SSE).
∑ 𝑦 − 𝑦̂
The sum of square due to regression measures how much the value on the estimated regression
line (𝑦̂) deviate (𝑦̅).
∑ 𝑦̂ − 𝑦̅
The estimated regression equation would provide a perfect fit if every value of the dependent
variable 𝑦 happened to lie on the estimated regression line. If the relationship is expressed as
perfect regression, then the SSE =0 and the SST = SSR and r2 = 1. Poorer fits will result in larger
values for SSE. Solving for SSE in equation, we see that SSE =SST - SSR. Hence, the largest
value for SSE (and hence the poorest fit) occurs when SSR =0 and SSE =SST.
431
13 65 20.12 404.81
15 63 18.12 328.33
9 52 7.12 50.69
84 ∑ 359 2
∑ ∑ 𝑦 −𝑦
̅ =1832.88
The sum of square error (SSE), for each given value of independent variable we have to find
estimated value of dependent variable using estimated regression equation.
∑ −̂
SSR; sum of square due to regression is sum of square total minus sum of square error (residual).
−
2 − 2
𝑟
2
r2=0.71
432
71% of variability in employee productivity expected to be explained by the linear relationship
between employee years of experience and his/her productivity.
Interpretation of r2
- The value of coefficient of determination lies between 0 and 1. It is the percent of variability
in the value of the dependent variable that can be expressed by linear relationship with
independent variable.
- It doesn‘t indicate the cause-and-effect relationship
433
Exercise
1. Given the following pair of values
X 1 2 3 4 6 9 10
Y 2 4 5 7 8 12 13
GDP (Y) 5 6 6 7 8 7 8 8 9 9
A. Is the linear regression model appropriate for the relationship between work force and
GDP?
B. Find the regression equation of Yi on Xi
C. What will be the production output if the labour force is 10?
3. The following data show the annual advertising expenditure in millions of dollars and the
market share for six automobile companies;
Company Advertising cost Market share
($ millions) (%)
1590
Mercedes-Benz 14.9
18.6
Ford Motor Co. 1568
26.2
General Motors Corp. 3004
8.6
Honda Motor Co. 854
6.3
Nissan Motor Co. 1023
1075 13.3
Toyota Motor Corp.
434
A. Develop a scatter diagram for these data with the advertising expenditure as the
independent variable and the market share as the dependent variable.
B. What does the scatter diagram developed in part (a) indicate about the relationship
between the two variables?
C. Use the least squares method to develop the estimated regression equation.
D. Provide an interpretation for the slope of the estimated regression equation.
E. Suppose that Honda Motor Co. believes that the estimated regression equation developed
in part (c) is applicable for developing an estimate of market share for next year. Predict
Honda‘s market share if they decide to increase their advertising expenditure to $1200
million next year.
F. Determine the coefficient of determination for the relationship between advertising and
market share for the companies.
4. Consider the following data on the number of vehicles (Xi) and the gasoline sales (Yi) in 5
regions.
Region Number of vehicles in (000) Gasoline sold in (000) birr
3 2
1
7 4
2
4 2
3
1 1
4
5 3
5
435
436