103 SM - All - in - One
103 SM - All - in - One
1.1 Introduction
Statistics plays an important role in almost every facet of human life. In
business context, managers are required to justify decisions on the basis of
data. They need statistical models to support these decisions. Statistical skills
enable managers to collect, analyse and interpret data in order to take
suitable decisions.
Statistical concepts and statistical thinking enable them to:
Solve problems in almost every domain
Support their decisions
Reduce guesswork
In this unit, you will study about Statistics, which deals with gathering,
organising, presenting and analysing data.
Objectives:
After studying this unit, you should be able to:
describe the scope and applications of statistics
explain the characteristics of statistics
recognise the functions of statistics
identify the limitations of statistics
analyse statistical software’s
1.1.1 Relevance
Nature created variation and thereby generated the importance for the
subject of statistics. This essentially exists only because of variation in data
– be it the height or weight of newly born babies, features like face, height or
weight of persons, growth of companies or market price. Truly, the capital
Greek word ∑ (pronounced summation), used for indicating total or sum of
numbers and the small Greek word σ (pronounced sigma), used for
measuring deviation could be labelled as the life blood for statisticians.
Although nature believes in variation, it also believes in mathematical
variation like weight of the new born babies, height of the individuals etc.
without any bias. The other examples of man-made asymmetrical variation
are: educational qualification, house hold income etc. The study of Statistics
will help in the study of variation in data for finding patterns and making
conclusions.
(Source: Adapted from T. N. Srivastava & Shailaja Rejo (2008) Statistics for
Management 5th ed.TMH)
Economics
Economists are frequently asked to provide forecasts about the future of the
economy. They use a variety of statistical information in making such
forecasts. For example, in forecasting inflation index, economists use
statistical information on indicators such as the producer index, the
unemployment rate and manufacturing capacity utilisation.
Caselet 1
The new General Manager Mr. Ravi of a manufacturing company is
concerned about the dwindling profits of the company. The Marketing and
Production Managers identify the reason as the guarantee period given to
customers, since the product has to be replaced if it fails within the
guarantee period. This replacement lowers the company’s profits and also
causes loss of reputation. The General Manager wants to reduce the
percentage of failure of units within a year. This means that he should
take action to improve the life of the unit. After preliminary studies he
decides to:
i) Estimate the average life of the units and their variation.
ii) Take action to improve the life of the unit.
iii) Lower the replacement cost as much as possible.
As you can see, the General Manager is using Statistics to solve a problem
and to increase profits. Decision making is a key part of our day-to-day life.
Even when we wish to purchase a television, we want to know the price,
quality, durability, and maintainability of various brands and models before
buying one. In this scenario, data is collected and an optimum decision is
made. In other words, we are using Statistics.
Suppose a company wishes to introduce a new product, it has to collect
data on market potential, consumer likings, availability of raw materials, and
feasibility of producing the product. Hence, data collection is the back-bone
of any decision making process.
Many organisations find themselves data-rich but, they are poor in drawing
information out of it. Therefore, it is important to develop the ability to extract
meaningful information from raw data, in order to make better decisions.
Statistics plays an important role in this aspect.
Statistics
Descriptive Inferential
Statistics Statistics
Caselet 2
In a firm, Human Resource Manager (HR Manager) calculates the
average salary of employees of the production department. The statistical
data collected is related to the production department and does not give
any information about the other departments of the firm. Here, the HR
Manager is using descriptive statistics. In this example, the HR Manager
displays the summarised numerical data in the form of tables, charts, and
diagrams, which come under descriptive statistics.
Inferential Statistics
Inferential Statistics is used to make valid inferences from the data for
effective decision making among managers or professionals. Statistical
Caselet 3
In a firm, the Human Resources Manager (HR Manager) uses the
average salary of employees of the production department, along with the
salary details of other departments, to estimate/project the average salary
of employees for all other departments in the firm. Here, the HR Manager
is using inferential statistics as the estimation of averages deals with
inferential statistics.
Activity
Place the number of the appropriate definition next to the item it describes
A. Statistic 1. Do not contain the same outcome
B. Parameter 2. The use of sample statistics to draw
C. Discrete conclusions concerning the population.
E. Mutually exclusive 3. A numerical characteristic of a sample.
F. Zero 4. Only finite values can exist on the X axis.
G. Continuous 5. Sum of deviation around a mean.
H. Inferential statistic 6. Measurement may assume any value
associated with uninterrupted Scale
7. A numerical characteristic of a
population.
Solution
A. 3, B. 7, C. 4, E. 1, F. 5, G. 6, H. 2
2. Out of the following, which one does not refer to a mass of data?
a) Banking Statistics
b) Mathematical Statistics
c) Agricultural Statistics
d) Income Statistics
3. Which of the following statement is most appropriate?
a) Nature believed in statistics
b) Nature created statistics
c) Nature believed in variation
d) Nature believed in symmetrical variation
4. Which of the following statement is true?
a) Statistics enlarges physical vision
b) Statistics helps in estimation
c) Statistics quantifies uncertainty
d) Statistics is of no use to humanity.
5. The origin of statistics can be traced to
a) State
b) Commerce
c) Economics
d) Industry
(1Source: Agarwal B L (2006) Basic Statistics 4th ed. Pg 1 New Age International Publishers)
(2Source: Agarwal B L (2006) Basic Statistics 4th ed. Pg 2 New Age International Publishers)
(3Source: Agarwal B L (2006) Basic Statistics 4th ed. Pg 2 New Age International Publishers)
Figure 1.2 depicts four different components of Statistics as per Croxton and
Cowden.
1. Collection of data
Careful planning is required while collecting data. Two methods used for
collecting data are census method and sampling method. The investigator
has to take care while selecting an appropriate collection method.
In the census method, every unit or object of the population is included in
the investigation. For example, in the census method, if we want to study
the average annual income of 500 families in a given area, we must study
the income of all the families in that area. When the population is large,
applying the census method would be difficult.
Sometimes a sample of units or objects is taken from the population to
describe the overall characteristics of that population. This method of
collecting data is called sampling. Sampling method is helpful when it is a
large population or when the results are needed in a short time.
2. Presentation of data
The collected data is usually presented for further analysis in a tabular,
diagrammatic or graphic form and it is condensed, summarised and visually
represented in a tabular or graphical form.
Tabulation is a systematic arrangement of classified data in rows and
columns. For the representation of data in diagrams, we use different types
of diagrams such as one-dimensional, two-dimensional and three-
dimensional diagrams.
Line diagrams, bar diagrams are one-dimensional diagrams. (Refer to
figure 1.3 and figure 1.4 for the illustrations of line diagrams and bar
diagrams respectively)
Fig. 1.6: Pie-chart of Prasad’s Family Expenses
3. Analysis of data
The data presented has to be carefully analysed to make any inference from
it. The inferences can be of various types, for example, as measures of
central tendency, dispersion, correlation or regression.
Measures of central tendency will cluster around the figure which is in the
central location. In case of population, the measures are the parameters and
in case of the sample are statistics that are estimates of population
parameters. The three most common ways of measuring the centre of
distribution is mean, mode and median.
In case of population, the measures of dispersion are used to quantify the
spread of the distribution. Range, interquartile range, mean deviation and
standard deviation are four measures to calculate the dispersion.
4. Interpretation of data
The final step is to draw conclusions from the analysed data. Interpretation
requires a high degree of skill and experience.
Thus, Statistics contains the tools and techniques required for collection,
presentation, analysis and interpretation of data. Thus, we can conclude that
this definition is precise and comprehensive.
The data in table 1.1 can be condensed and is presented in table 1.1a using
the statistical concepts such as, calculating frequency and frequency
distribution to draw conclusions and then the frequency table is prepared. In
this example, from the bulk data consisting of 50 rating scores, the
Example 5
The graphical curve represented in figure 1.7 and figure 1.8 shows the
profits of CBA Company and ZYX Company respectively, for ten years
from 1998 to 2008. The timeline in years is plotted on the X-Axis and the
profits are on the Y-Axis. From the graphs, we can compare the profits of
both the companies and conclude that profits of CBA Company in the
year 2008 are higher than that of ZYX Company.
The profits curve in the case of figure 1.7 shows that the profits for CBA
Company are increasing, whereas in figure 1.8 it is constant for ZYX
Company from the middle of the decade (1998-2008).
Minitab
Minitab is a statistical software package that was designed especially for
the teaching of introductory statistics courses. It is an easy-to-use
statistical software package and is a vital and significant component of
such a course. This permits the student to focus on statistical concepts
and thinking, rather than computations or the learning of a statistical
package. The main aim of any introductory statistics course should,
always be the why of statistics rather than technical details that do little to
stimulate the majority of students and do little to reinforce the key
concepts. (Source: http://www.minitab.com)
EViews
EViews is a statistical software tool, which offers academic researchers,
corporations, government agencies, and students the access to powerful
statistical, forecasting, and modelling tools through an innovative, easy-to-
use object-oriented interface.
EViews is the ideal package for anyone who works with time series, cross-
section, or longitudinal data. EViews offers an extensive array of powerful
features for data handling, statistics and econometric analysis, forecasting
and simulation, data presentation, and programming. EViews generates
forecasts or model simulations and produce high quality graphs and
tables. (Source: http://www.eviews.com/)
JMP Software
JMP is statistical discovery software. JMP helps you explore data, fit
models, discover patterns, and discover points that don’t fit patterns.
JMP is best for data analysis; JMP aims to present a graph with every
statistics.
Table 1.1b depicts the statistical techniques and their application.
Table 1.1b: Illustrative List of Statistical Techniques and Their Application
Statistical Techniques
Area Decision
Applicable
Marketing Assessment of Demand of Times Series,
Product, Customer Profiling and Correlation and
Market Research Regression
Retail Identifying Customer Buying Cluster Analysis,
Management Behaviour and Patterns Correlation and
Regression
Finance and Evaluation of Investment, Correlation Analysis and
Banking Derivatives and Predicting EPS Regression Analysis,
Probability, Hypothesis,
Time series
Insurance Determining the Premium, Probability, Hypothesis,
Impact of Different Factors on Time Series, Correlation
Health and Life Analysis and Regression
Analysis
Operations Controlling and Improving Statistical Quality
Production Process and Quality control, Six Sigma,
Sampling Inspection
HRD Performance Appraisal and Normal Distribution,
Reward System Correlation Analysis,
Conjoint Analysis
1.9 Summary
Let us now summarise the key learnings of this unit:
Decision making process becomes more efficient with the help of
Statistics. Statistics deals with an aggregate of facts.
Statistics is applied in all fields of our activities. Statistical interpretation
requires skilled and experienced statisticians. Statistical data is
numerical data or quantitative data but not qualitative data.
Statistics is broadly divided into Descriptive and Inferential Statistics.
1.10 Glossary
Data: Data is the facts and figures that are collected, analysed and
interpreted.
Descriptive Statistics: Descriptive statistics is tabular, graphical and
numerical methods used to summarise data.
Element: Element is the entities on which data are collected.
Qualitative Data: Data that are labels or names used to identify an attribute
of each element.
Quantitative Data: Quantitative data describes data in terms of quantity
using the numerical figure accompanied by a measurement unit.
Sample: Sample is a subset of the population.
Statistical Inference: This is the process of using data obtained from a
sample to make estimates about the characteristics of a population.
Statistics: Statistics is the art and science of collecting, analysing,
presenting and interpreting data.
Population: Population is the set of all elements of interest in a particular
study.
1.12 Answers
Terminal Questions
1. Refer to section 1.5
2. Refer section 1.3
3. Refer to section 1.7
4. Refer to section 1.6
5. Refer to section 1.1.3
6. Refer to section 1.4
products. Using the warranty cards submitted after purchases, the manager
was planning to survey these customers.
a. According to you as a researcher in this case, how would you decide in
proceeding with descriptive statistics?
b. Can you decide in thinking for a Manager of customer service division of
a consumer electronics company to use inferential statistics? Justify
your answer.
c. Describe the population and sample for this survey.
d. Develop three categorical and numerical questions that you feel would
be appropriate for the study.
References:
Agarwal B. L., (2006) Basic Statistics, 4th Ed, New Age International
Publishers.
Bowerman, B. L & R.T. O Connel, Applied Statistics: Improving Business
Processes, Irwin 1996.
David R. Anderson, Dennis J. Sweeney & Thomas A. Williams Thomson
Business Information Pvt Ltd. 5th Ed.
Freedman D. R. Pisani and R. Purves, Statistics 3rd Ed, W.W Norton
1997.
Rand R. Wilcox, (2009) Basic Statistics – Understanding Conventional
Methods and Modern Insights, Oxford University Press.
Richard I. Levin, David S. Rubin, (2008) Statistics for Management, 7th
Ed, PHI Learning Private Limited.
Srivastava, T. N. & Shailaja Rejo (2008). Statistics for Management, 5th
Ed. TMH.
Tanur , J.M, Statistics: A Guide to the unknown, 4th Ed, Brooks /cole,
2002.
Tukey J. W, Exploratory Data Analysis, Addison –Wesley, 1977.
E-References:
http://www.textbooksonline.tn.nic.in/Books/11/Stat-EM/Chapter-1.pdf.
2.1 Introduction
In the previous unit, ‘Introduction to Statistics’, we have been introduced to
the definition and functions of statistics. We also studied the broad divisions
of statistics. We now have an idea about the characteristics of statistics and
the limitations of statistics. In this unit, we will study about statistical survey
and the collection and analysis of numerical data.
When the population is large, it is hard to conduct a survey. In such
situations, a sample is drawn and studied to determine the characteristics of
the entire population. The primary purpose of conducting a sample survey is
to obtain certain information about the population.
We define the term ‘survey’ as a measurement procedure to gather people’s
opinions. Surveys differ from each other as their purpose, field of study,
scope, and the source of information differ. Surveys are used by companies
to assess the level of their customer satisfaction, to find out what products
their customers choose and to determine which section of the population is
buying their products. The following are some examples of activities, which
require collection and analysis of data in a systematic manner.
Formulation of a theory such as “Tobacco Consumption Leads to
Cancer”
Framing of policies according to the existing nature of a population
Finding the relationship between characteristics of units in the
population
In other words, a search for knowledge by analysing numerical data is
known as Statistical Survey or Statistical Investigation.
Objectives:
After studying this unit, you should be able to:
recall the definition of statistical survey
describe the activities involved in planning a statistical survey
recall the definition of terms used in statistics
differentiate between sample and population
differentiate between quantitative and qualitative characteristics
describe various methods of data collection
distinguish between primary and secondary data
explain various measurement scales
2.1.1 Relevance
The relevance, timeliness and accuracy of data are the standard tools for
any statistical study. The quality of information and conclusion derived from
a data depends on these characteristics. The absence of these reflects in
the popular way “Garbage in, garbage out” abbreviated as GIGO, mostly
used in the field of computer science. Therefore, it is equally important in the
context of statistical data and utmost care has to be taken while collecting
the right data by the right process and from the right source.
2.1.2 Statistics in practice
Recent CASs (Country Assistance Strategy) for Kenya and Armenia provide
good examples of assessing statistical capacity and proposing appropriate
action. The Kenya CAS takes a comprehend approach towards statistical
capacity building, based on the implementation of a national statistical
development strategy supported by IDA (International Development
Association) and a number of other development partners. The CAS for
Armenia finds that, Armenia’s capacity for poverty monitoring and analysis is
reasonably good, as the National Statistical Service (NSS) has conducted
regular household surveys for a number of years. The CAS identifies steps
to further improve capacity, including ‘strengthening the linkages between
different household surveys’ and ’improving questionnaires to reflect current
policies (for example, on social assistance) and to provide better information
(for example, on employment and earnings).’
(Source: http://siteresources.worldbank.org)
Statistical Survey
Planning Execution
Key Statistics
A parameter is a measure of the characteristic of the population.
Population can have many parameters.
A Statistic is a measure of characteristic corresponding to the sample.
Sample can have many statistics.
2.3.3 Sample
A sample is a part or a subset of the population. By studying the sample,
you can predict/comment on the characteristics of the entire population from
where the sample is taken. The measure that describes the characteristics
of a sample is known as statistics.
If the population is large, it is hard to collect data corresponding to the entire
population. Hence, a part of the population is chosen to study the
characteristics of the entire population. The size of the sample can never be
as large as the size of the population. Proper care must be taken while
choosing the samples. In the figure 2.3, a sample of three consumers is
drawn from the entire population of eight consumers.
Let us understand the basic terminologies of statistical survey with the help
of a Caselet.
Caselet
Consider the survey of the average number of children below 16 years in
a ward of a municipality. The number of houses in the ward is finite and
therefore, the population is finite. The objects are households. The
characteristic measured is number of children below 16 years in a
household. It is numerically measurable and hence quantitative. On the
other hand, in a survey to find the total number of blind people in a
locality, the characteristic ‘blindness’ is qualitative.
2.3.6 Variable
In a population, some characteristics remain the same for all units and some
others vary from unit to unit. The quantitative characteristic that varies from
unit to unit is called a variable. It is a measurable characteristic for example,
age, height, income. The qualitative characteristic that varies from unit to
unit is called an attribute. It is a non-measurable characteristic for example,
religion, nationality and occupation.
Key statistic
A census is the procedure of systematically acquiring and recording
information about the members of a given population
Merits Demerits
1. We get the original data which 1. This method is not cost efficient.
is more accurate and reliable.
2. Satisfactory information can be 2. This method consumes more time.
extracted by the investigator
through indirect questions.
2. Indirect oral interview – Indirect oral interview is used when the area to
be covered is large. The investigator collects the data from a third party or a
witness or the head of an institution. This method is generally used by the
police department in cases related to enquiries on the cause of fires, thefts
or murders.
In this method, the investigator contacts witnesses or neighbors or friends or
some other third parties who are capable of supplying the necessary
information. Enquiry committees appointed by governments use this method
to get people’s views and every possible detail regarding the enquiry. This
method suits best when direct sources do not exist or cannot be relied upon
or would be unwilling to take part in the survey. Table 2.2 shows the merits
and demerits of indirect oral interview.
Table 2.2: Merits and Demerits of Indirect Oral Interview Method
Merits Demerits
1. Economical in terms of time, cost 1. The degree of accuracy of the
and man power information is less.
2. Confidential information can be
collected
3. Information is likely to be unbiased
and reliable
Questionnaire design
Initial considerations
Type of information required
Type/nature of respondents
Type and method by which survey is to be undertaken
Question content
Relevance of a question
Clarity of a question
Avoid ambiguous, leading, double-barrelled questions
Ability and willingness of a respondent to answer the questions
Question phrasing
Style appropriate to target population
Short, Clear and unambiguous questions
Avoid biased words and leading questions
Avoid negative questions
Discourage guessing
Do not assume anything for granted from the part of the respondents
Types of questions
Closed ended questions
Dichotomous
Multiple choice (4 to 5 options; neutral point)
Likert scale (Agree or disagree)
Semantic differential (scale connecting bipolar words)
Importance scale (importance of some attribute)
Rating scale (Excellent to poor)
Open ended questions
Completely unstructured
Word association (first word that comes to mind …)
Sentence completion
Story completion
Picture completion (filling balloons)
Thematic Apperception Test (relate story to picture)
Question sequence
Logical order
Avoid questions which suggest answers to later questions (bias)
Questionnaire layout
Good quality paper
As short as possible (20-30 questions)
Use lines, boxes, pictures, etc.
Instructions kept to a minimum but user-friendly
Purpose of survey explained at the beginning and guarantee of
confidentiality
What is to be done with the completed questionnaire?
Example 2
The sun rises in which direction?
East [ ] West [ ]
North [ ] South [ ]
Example 3
Read the following statement and then indicate by a tick whether you
Strongly Agree, Agree, Disagree or Strongly Disagree with the
statement.
“Organised and prioritised tasks take less time to complete.”
1. Strongly Agree [ ]
2. Agree [ ]
3. Disagree [ ]
4. Strongly Disagree [ ]
Open ended questions are those questions for which the respondent
provides their own answer without any fixed set of possible responses.
Examples of the types of open ended questions are:
Sentence completion: In these, respondents complete an incomplete
sentence.
Example 4
Complete the sentence below.
“I like the management courses offered by Manipal University
Jaipur because ...”
Story completion: In these, respondents complete an incomplete story.
Picture completion: In these, respondents fill in an empty conversation
balloon.
Thematic Apperception Test: In these, respondents explain a picture
or make up a story about what they think is happening in the picture.
Activity
Design a questionnaire for consumer response in Facebook Vs Twitter in
the Internet.
5. Information through schedule filled by investigator – Information can
be collected through schedules filled by investigators through personal
contact. In order to get reliable information, the investigator should be well
trained, tactful, unbiased and hard working.
A schedule is suitable for an extensive area of investigation through
investigator’s personal contact. The problem of non-response is minimised.
There is a difference between a schedule and a questionnaire. A schedule
is a form that the investigator fills personally, while surveying the units or
individuals from the sample (respondent). A questionnaire is a form sent
(usually mailed) by an investigator to respondents. The respondent has to fill
it and then send it back to the investigator.
Advantages Disadvantages
The differences between primary and secondary data are listed in the
table 2.4.
Table 2.4: Differences between Primary Data and Secondary Data
Clearly, the numbers associated with the options above have no numerical
significance. Comparison between values is impossible and also descriptive
statistics like the mean and standard deviation would make no sense if
calculated.
Ordinal data
Ordinal variables allow us to rank order the items we measure in terms of
which has less and which has more of the quality represented by the
variable, however they do not allow us to say how much more. A typical
example of an ordinal variable is the socioeconomic status of families. For
example, we know that upper-middle is higher than middle but we cannot
say that it is, for example, 18% higher. Also, this very distinction between
nominal, ordinal, and interval scales itself represents a good example of an
ordinal variable. For example, we can say that nominal measurement
provides less information than ordinal measurement, but we cannot say how
much less or how this difference compares to the difference between ordinal
and interval scales.
Example 2
Employee’s performance
1. Excellent � 4. Poor �
2. Good � 5. Very poor �
3. Average �
It can be easily deduced that ‘Excellent’ is better than ‘Poor’, that is, there is
a latent scale on which comparison can be made among the various values.
Ordinal data can sometimes be treated as interval for the sake of statistical
analysis, provided the assumption is founded. In this case, the values of the
variable are mathematically considered to be ‘equidistant’ on its scale. The
numbers associated with each value starts to get some numerical
significance so that the mean, though not very convincingly, maybe
statistically interpreted.
The variable ‘Employee’s performance’ in Example 2 can be regarded as
interval if we assume that the ‘distance’ between any pair of successive
values is equal (for example, the distance between ‘Excellent’ and ‘Good’ is
the same as that between ‘Average’ and ‘Poor’). In this case, if the average
performance score of 100 employees is calculated and found to be, say,
3.2, we may, within some margin of security, conclude that the overall
performance of employees is just above ‘Average’, the latter having been
assigned a value of 3.
Example 3
Educational level
1. None � 5. Diploma �
2. Primary � 6. Degree �
3. Vocational � 7. Postgraduate �
4. Secondary � 8. Professional �
It is clear that the ‘distance’ between ‘None’ and ‘Primary’ is not equal to that
between ‘Diploma’ and ‘Degree’.
2.5.2 Quantitative (numerical) data
Quantitative data can be easily measured on a numerical scale; variables
which can be quantified in terms of units are all quantitative. Examples of
quantitative variables are number of students per class and height
(measured in centimetres). Again, these two variables differ in their nature;
the first is said to be discrete whereas the second is continuous.
Discrete data
Discrete data occur as definite and separate values; a discrete variable
assumes values which are countable so that there are gaps between its
successive values. For example, when counting the number of children in a
class, we use numbers (0, 1, 2… n).
Continuous data
Continuous data occur as the whole set of real numbers or a subset of it. In
other words, there are no gaps between successive values so that a
continuous variable assumes all the values (including all the decimals)
between given boundaries. Temperature is a good example of a continuous
variable – though thermometer readings are recorded to the nearest tenth of
a degree (Centigrade or Fahrenheit), temperature does not ‘jump’ from, for
example, 17.10 C to 17.20 C. It passes through all the real numbers between
these two values. Height, weight and speed are also continuous variables.
Continuous data can be measured on interval and ratio scales.
Interval scale
Interval variables allow us not only to rank order the items that are
measured, but also to quantify and compare the sizes of differences
between them. For example, temperature, as measured in degrees
Fahrenheit or Celsius, constitutes an interval scale. We can say that a
temperature of 40 degrees is higher than a temperature of 30 degrees, and
that an increase from 20 to 40 degrees is twice as much as an increase
from 30 to 40 degrees. However, interval scale variables do not have an
absolute zero. If the temperatures in Singapore and London are 300 C and
150 C respectively, we cannot say that it is twice as hot in Singapore than in
London. This is simply because it would not be the case if these
temperatures were measured in degrees Fahrenheit: 860 C and 590 F
respectively.
Ratio scale
Ratio variables are very similar to interval variables. In addition to all the
properties of interval variables, they feature an identifiable absolute zero
point, thus they allow for statements such as x is two times more than y.
Typical examples of ratio scales are measures of time or space. For
example, as the Kelvin temperature scale is a ratio scale, a temperature of
200 degrees is higher than 100 degrees, and it is twice as high. Interval
scales do not have the ratio property. Most statistical data analysis
procedures do not distinguish between the interval and ratio properties of
the measurement scales. Height is also a ratio scale variable since, if a
person is twice as tall as another, he/she will remain so, irrespective of the
units used (centimetres, inches, etc…).
Figure 2.7 depicts the various categories of Measurements Scales.
2.7 Summary
Let us recapitulate the important concepts discussed in this unit:
A statistical survey is a search for knowledge. There are two main
stages in any statistical survey - Planning and Execution.
Planning a statistical survey encompasses the following issues:
i) The nature of a problem
ii) The objectives
iii) The scope
iv) Statistical units
v) The degree of accuracy
vi) The time period
vii) The source of information and
viii) The organisation
The collected data should be edited, analysed and interpreted for
completeness, accuracy and consistency.
2.8 Glossary
Interval scale: An interval scale is a scale of measurement where the
distance between any two adjacent units of measurement (or 'intervals') is
the same but the zero point is arbitrary.
Nominal data: A set of data is said to be nominal if the values/observations
belonging to it can be assigned a code in the form of a number where the
numbers are simply labels.
Ordinal data: A set of data is said to be ordinal if the values/observations
belonging to it can be ranked (put in order) or have a rating scale attached.
You can count and order, but not measure, ordinal data.
Population: The set of all elements of interest in a particular study.
Primary data: Data collected for the first time keeping in view the objective
of the survey.
Qualitative variables: A variable with qualitative data.
Quantitative variables: A variable with quantitative data.
Ratio scale: Ratio variables are very similar to interval variables; in addition
to all the properties of interval variables, they feature an identifiable absolute
zero point.
Sample: A subset of the population.
Secondary data: Any information, which is used for the current
investigation collected by some other agency or person in a separate
investigation.
Statistical survey: A scientific process of collection and analysis of
numerical data.
2.10 Answers
8. i) Quantitative data
9. i) Statistics
10. i) Primary data, ii) Primary data, iii) Secondary data
11. i) sample method
12. i) True ii) True iii) False iv) True v) False
13. i) Quantitative ii) Qualitative iii) Qualitative iv) Quantitative
14. i) Quantitative, Ratio, ii) Qualitative, Nominal, iii) Qualitative, Ordinal,
iv) Quantitative, Ratio, v) Qualitative, Nominal
Terminal Questions
1. Refer section 2.1.3.
2. Refer section 2.2.1.
3. It refers to the unit of the population on which measurements are made,
for example, the height of employees in an office. Employees are
individuals or units. Height is the measurement made on them.
4. a) Data collected for the first time by the investigator is primary data.
Data collected by some other persons but used by the investigator
for his/her study is known as secondary data.
b) Direct investigations are carried out directly by the investigator.
Investigation conducted through mail questionnaire is called indirect
investigation.
c) Questionnaires contain simple questions and are filled by
respondents. Schedules also contain questions but responses are
recorded directly by the investigator.
Discussion Questions:
1. What is the population of the study?
2. What is the sample for this study?
3. Why would a sample be used in this situation? Explain.
Case Study 2
An AMC (Annual Maintenance Contract) company provides onsite IT
support of hardware services clients. At one of the client’s establishment,
the hardware comprises 500 personal computers (PCS) and ten servers
connected by local area network. The AMC covers, interalia, the
maintenance of servers and PCs and the network on 24/7 basis.
The company has a team of 10 technical engineers and a coordinator
posted at the client’s establishment. The company is faced with the problem
of too many complaints about the promptness and quality of service. The
company wants to analyse the problem, for arriving at some appropriate
solution.
Discussion Questions:
Design a questionnaire that would help the company in collecting relevant
data and initiate remedial action. The questionnaire may cover the following
aspects and also any other relevant issues.
Technical competency
Promptness
Behavioural
Case Study 3
Telecom Company wanted to understand the perception of consumers
about value added services of mobiles companies, with a view to add some
new services in this segment. A consultant was hired, and a survey was
planned. The following questionnaire was designed by the consultant.
Questionnaire for Consumers
1) Demographic Profile
2) Name Sex: Male/ Female
3) Occupation: Employed / Self Employed/ Student/Housewife/Retired
a) SMS
b) Voice Mail
c) Messenger Services
d) Ringtones
e) GPRS
f) MMS
g) Roaming
h) Internet
References:
Agarwal, B.L. (2006). Basic Statistics, 4TH Ed, New Age International
Publishers.
Bowerman, B. L & Connel, R.T.O. Applied Statistics: Improving Business
Processes, Irwin 1996.
Levin, R. I. & Rubin, D.S. (2008). Statistics for Management, 7th Ed, PHI
Learning Private Limited.
Lipschutz, S. & Schiller, J.J. Schaum's Outline of Introduction to
Probability and Statistics (Schaum's Outline Series) (Sep 7, 2011).
Pisani D.R. Freedman & R. Purves, Statistics, 3rd Ed, W.W Norton 1997.
Schiller, J, Srinivasan, R. Alu, and Spiegel, Murray, Schaum's Outline of
Probability and Statistics, 3rd Ed. (Schaum's Outline Series) (Aug 26,
2008).
Spiegel, M. & Stephens, L. Schaums Outline of Statistics, Fourth Edition
(Schaum's Outline Series) (Jan 31, 2011).
Sternstein, M. Barron's AP Statistics with CD-ROM (Barron's AP
Statistics (W/CD)) (Feb 1, 2010).
Tanur, J.M, Statistics: A Guide to the unknown, 4th Ed, Brooks /cole,
2002.
Voelkar, D.H, Orton, P. Z. & Adams, S. Statistics (Cliffs Quick Review)
(Jun 15, 2001).
Wilcox, R.R. (2009). Basic Statistics – Understanding Conventional
Methods and Modern Insights, Oxford University Press.
E-References:
http://www.textbooksonline.tn.nic.in/Books/11/Stat-EM/Chapter-3.pdf.
3.8 Glossary
3.9 Terminal Questions
3.10 Answers
3.11 Case Study
3.1 Introduction
In the previous unit, statistical survey, we have studied about surveys,
different methods of collecting the data and analysing the numerical data. In
this unit, we will learn about the classification, tabulation and presentation of
data. We will know about the simplification of collected data and also know
about some methods for graphical summarisation of data that reveals
certain characteristic.
Collected data in the raw form would be voluminous and non-
comprehensible. Therefore, it should be condensed and simplified for better
understanding and usefulness.
Classification is the first stage in simplification. It can be defined as a
systematic grouping of the units according to their common characteristics.
Each of the group is called class.
For example, in a survey of industrial workers of a particular industry,
workers can be classified as unskilled, semi-skilled and skilled, each of
which form a class.
Objectives:
After studying this unit, you should be able to:
describe the methods of classification
identify the parts of a table
describe the functions of tabulation
calculate the frequency and frequency distribution for the data
illustrate the numerical data as a graphical representation
3.1.1 Relevance
A picture is equal to a thousand words. The same is true about graphs and
charts that are used to present data in a form which can be easily
comprehended. Graphs and charts reflect our performance. There is
sufficient scope of making effective use of graphs and charts for managerial
functions.
80
70
60
50 3-D Column 1
40
30
20
10
0
0.29 0.31 0.33 0.35 0.37 0.39
Region Sales
North 285
South 300
East 185
West 235
Example 6
Figure 3.2 depicts the number of students who has secured more than
60% in various sub-modules of statistics. This can be classified using
one-way classification method.
Example 7
Figure 3.3 depicts the classification of students according to gender, who
has secured more than 60% in respective sub-modules of statistics. In
the sub-module titled ‘Basic Concepts’, ten students got more than 60%.
Out of ten students, four are males and six are females.
Example 8
Figure 3.4 depicts the classification of employees according to skill, sex
and education.
Example 9:
Figure 3.5 depicts manifold classification of population.
Example 10:
The below Table 3.5 depicts the educational qualification of hotel
employees
Table 3.5: Manifold Classification of Males and Females Based on
Qualification
Yes No Total
Educational
M F M F M F
Qualification
MBA Degree 12 9 3 6 15 15
B.Sc. H and HA 15 15 0 0 15 15
3.3 Tabulation
Tabulation follows classification. It is a logical or systematic listing of related
data in rows and columns. The row of a table represents the horizontal
arrangement of data and column represents the vertical arrangement of
data. The presentation of data in tables should be simple, systematic and
unambiguous.
The objectives of tabulation are to:
Simplify complex data
Highlight important characteristics
Present data in minimum space
Facilitate comparison
Bring out trends and tendencies
Facilitate further analysis
3.3.1 Basic differences between Classification and Tabulation
Table 3.6 depicts the few differences between classification and tabulation.
Table 3.6: Differences between Classification and Tabulation
Classification Tabulation
It is the basis for tabulation It is the basis for further analysis
It is the basis for simplification It is the basis for presentation
Data is divided into groups and sub-
Data is listed according to a logical
groups on the basis of similarities
sequence of related characteristics
and dissimilarities.
1 2
10
Source: ………..
Tab 2: Title
Title indicates the scope and the nature of contents in a concise form. In
other words, title of a table gives information about the data contained in the
body of the table. Title should not be lengthy.
Tab 3 and Tab 4: Captions
Captions are the headings and subheadings describing the data present in
the columns.
Tab 5 and Tab 6: Stubs
Stubs are the headings and subheadings of rows.
Tab 7: Body of the table
Body of the table contains numerical information.
Tab 8: Totals
The sub-totals for each separate classification and a general total for all
combined classes should be given at the bottom or right side of the figures
whose totals are taken. Ruling and spacing separate columns and rows.
However, totals are separated from main body by thick lines.
Tab 9: Head note
Head note is given below the title of the table to indicate the units of
measurement of the data and is enclosed in brackets.
Tab 10: Source note
Source note indicates the source from which data is taken. The source note
related to table is placed at the bottom on the left hand corner.
3.3.3 Types of tables
Tables are classified into three types. They are on the basis of:
a. Purpose of investigation
b. The nature of presented figures
c. Construction
a. Purpose of investigation: Tables classified under this classification are
of two types. They are:
General purpose table – General purpose table or reference table
facilitates easy reference to the collected data. They are formed without
specific objective, but can be used for any specific purpose. They
contain large mass of data. For example: census data
c. Construction
Different types of tables under this classification of tables are:
Simple table – Simple table presents only one characteristic. Table 3.10
depicts a simple table.
Complex table – Complex table presents two or more characteristics.
Table 3.11 depicts a complex table.
Cross-classified table – In the cross-classified table, the entries are
classified in both directions. Table 3.12 depicts an example of a cross-
classified table.
Table 3.10: Defectives Produced by Batches
Batches No. of defectives
1 15
2 20
3 40
4 50
Table 3.11: Distribution of Defectives According to Batch and Nature of
Defects
Defects
Batch Major Minor
I 8 7
II 15 5
III 25 15
IV 32 18
Total 80 45
Table 3.12: Population of a City According to Age, Sex and Education During
2003 to 2005
Educated Not Educated
Age
Years Below Above Below Above
20 - 40 Total 20 – 40 Total
20 yrs 40 20 yrs 40
Sex
Male
2003
Female
Male
2004
Female
Male
2005
Female
Solved Problem 1
1.1 When the collected data is grouped with reference to time, we have:
a) Quantitative classification b) Qualitative classification
c) Geographical classification d) Chronological classification
Solution – Chronological classification
1.2 Most quantitative classifications are:
a) Chronological b) Geographical
c) Frequency distribution d) None of these
Solution – Frequency distribution
1.3 Caption stands for:
a) A numerical information b) The column headings
c) The row headings d) The table headings
Solution – The column headings
1.4 A simple table contains data on:
a) Two characteristics b) Several characteristics
c) One characteristic d) Three characteristics
Solution – One characteristic
1.5 The headings of the rows given in the first column of a table are called:
a) Stubs b) Captions
c) Titles d) Reference notes
Solution - Stubs
1.6 Geographical classification means, classification of data according to
_______.
Solution – Geographical regions
1.7 The data recorded according to standard of education like illiterate,
primary, secondary, graduate, technical, etc, will be known as _______
classification.
Solution – Qualitative
1.8 An arrangement of data into rows and columns is known as _______.
Solution -Tabulation
1.9 Tabulation follows ______.
Solution – Classification
1.10 In a manifold table we have data on _______.
Solution – More than two characteristics
Example 11
Table 3.13: Data on marks obtained in statistics paper
Marks obtained in
Roll No.
statistics paper
1 83
2 80
3 75
4 92
5 65
The above data list is a raw data. The presentation of data in above form
doesn’t reveal any information. If the data is arranged in ascending/
descending order of their magnitude, it is called arraying of data and it gives
a better presentation.
3.4.2 Discrete frequency distribution
If the data series is presented indicating its exact measurement of units,
then it is called as discrete frequency distribution. Discrete variable is one
where the variates differ from each other by definite amounts.
Solved Problem 2
Assume that a survey has been made to know the number of post-
graduates in 10 families at random; the resulted raw data could be as
follows:
0, 1, 3, 1, 0, 2, 2, 2, 2, 4
Solution
This data can be classified into an ungrouped frequency distribution.
Table 3.14: Discrete Frequency Distribution
Number of post
Frequency
graduates
f
X
0 2
1 2
2 4
3 1
4 1
The number of post-graduates becomes the variable X for which we can list
the frequency of occurrence f in a tabular form. Table 3.14 depicts a discrete
frequency distribution, where the variables have discrete numerical values.
18 23 28 29 44 28 48 33 32 43
24 29 32 39 49 42 27 33 28 29
Table 3.16 depicts how the frequency distribution table can be formed by
grouping the marks into class width of 5.
Table 3.16: Continuous Frequency Distribution
Marks No. of students
0-5 0
5 – 10 0
10 – 15 0
15 – 20 1
20 – 25 2
25 – 30 7
30 – 35 4
35 – 40 1
40 – 45 3
45 – 50 2
A continuous frequency distribution is divided into mutually exclusive sub-
ranges called classes. Classes have lower and upper limits known as lower
class limits and upper class limits respectively. The differences between
upper class limit and lower class limit is termed as class width. The middle
value of a class interval is called mid-value of the class. It is the average of
class limits.
Example 12
In the class 0 – 10, the lowest value is zero and highest value is 10. The
two boundaries of the class are called upper and lower limits of the class.
Class limit is also called as class boundaries.
b) Class intervals: The difference between the upper and lower limit of the
class is known as class interval.
Example 13
In the class 0 – 10, the class interval is (10 – 0) = 10.
Example 14
If the marks of 60 students in a class vary between 40 and 100 and if we
want to form 6 classes, the class interval would be:
LS
The formula to find class interval is given as follows: i
R
L = Largest value
S = Smallest value
R = the no. of classes
L = 100 S = 40 R=6
LS 100 40 60
i = = = 10
R 6 6
Therefore, class intervals would be 40 – 50, 50 – 60, 60 – 70, 70 – 80,
80 – 90 and 90 – 100.
Key Statistic
Class intervals are of two types; exclusive and inclusive. The class
interval that does not include upper class limit is called an exclusive type
of class interval. The class interval that includes the upper class limit is
called an inclusive type of class interval.
ii) Inclusive method (non-overlapping): The class interval that includes
the upper class limit is called an inclusive type of class interval.
Example 15
Table 3.18: Marks versus Students
In table 3.19, the class ‘0 – 9’ includes the value ‘9’. In table 3.20, the class
‘0 – 10’ does not include the value ‘10’. If the value of ‘10’ occurs, it is
included in the class ‘10 – 20’.
Table 3.19: Inclusive Type of Class Interval
Note: Under this formula, number of classes cannot be less than 4 and not
greater than 20.
f) Class mid point or class marks: The mid value or central value of the
class is called mid point.
(lower limit of class upper limit of class)
Mid point of a class =
2
Solved Problem 5
For the class 10–20; find the lower class limit, the upper class limit, the
width of the class and the mid value of the class.
Solution
For the class 10-20, the lower class limit and the upper class interval is 10
and 20 respectively. The width of the class is 20-10 = 10. The mid value of
the class is calculated as:
10 20
Mid value of the class = = 15.
2
Therefore, mid value of the class is 15.
g) Sturges formula to find the size of class interval
Range
Size of class interval (h) =
1 3.322 log N
Solved Problem 6
In a group of 20 workers, highest wage is Rs. 175 and lowest wage is 42 per
day. Find the size of the interval.
Solution
K= 1 3.322 log10 N = 1 3.322 log 10 20 = 1 3.322 0.3010 5.3219 6
b) The number of classes should be neither too large nor too small.
Too many small classes result in greater interval width with loss of
accuracy. Too many large class intervals result is complexity.
c) All intervals should be of the same width. This is preferred for easy
computations.
Range
The width of interval =
Number of classes
d) Open end classes should be avoided since it creates difficulty in
analysis and interpretation. (Open end class means either lower limit of
the first class or upper limit of the last class will not be specified)
e) Intervals should be continuous throughout the distribution. This is
important for continuous distribution.
f) The lower limits of the class intervals should be simple multiples of the
interval.
Example 17
From the table 3.21, we can depict that ten students got 90 marks in
mathematics; five students got 82 and five got 75.
Table 3.21: Marks Secured by Students in Mathematics
Marks secured in Number of Students
mathematics
90
82
75
Solved Problem 7
The following problem will explain how raw data can be converted to
frequency distribution.
5 14 10 16 8 15 1 14 9 6
11 3 8 12 6 4 11 17 7 10
18 10 15 9 8 14 8 5 15 4
10 13 4 18 2 6 10 7 13 8
16 7 14 11 9 4 11 9 3 7
1 8 10 5 13 7 15 8 19 16
6 17 11 15 6 3 18 12 9 4
14 11 9 4 14 12 8 7 19 10
15 8 19 11 7 16 10 3 6 14
10 19 3 20 8 11 20 14 9 19
Solution
Frequency table for the above data is as follows
Table 3.22a: Frequency table
Key Statistic
If the class interval does not prescribe lower limit for first class or upper
limit for the last class, then it is known as open-end class interval.
Solved Problem 8
In a survey, it was found that 64 families bought milk in the following
quantity in a particular month.
16 22 9 22 12 39 19 14 23 6
24 16 18 17 20 25 28 18 10 24
20 21 10 7 18 28 24 20 14 23
25 34 22 5 33 23 26 29 13 36
11 26 11 37 30 13 8 15 22 21
32 21 31 17 16 23 12 9 15 27
17 21 19 7
Solved Problem 9
In a country music band of 48 members, 22 play guitar, 12 play brass,
14 play piano. Create a tabular display of the frequency and Relative
frequency distribution for the type of instruments.
Solution
Table 3.24 depicts the frequency and frequency distribution for the type of
instruments in a country music band.
Table 3.24: Frequency Distribution of the Type of Instruments
Solved Problem 10
Table 3.25 depicts the frequency distribution of marks. Calculate the derived
frequency distributions, less than and more than cumulative frequency
distribution.
Table 3.25: Frequency Distribution of Marks
Marks No of students
0-20 15
20-40 20
40-60 28
60-80 22
80-100 15
Total 100
Solved problem 11
Table 3.27 depicts the data related to the height and weight of 20 people.
Construct a bivariate frequency table with class interval of height as 62-64,
64-66…and weight as 115-125,125-135 and write down the marginal
distribution of X and Y.
Table 3.27: Height and Weight of 20 People
S.No. Height Weight S.No. Height Weight
1 70 170 11 70 163
2 65 135 12 67 139
3 65 136 13 63 122
4 64 137 14 68 134
5 69 148 15 67 140
6 63 121 16 69 132
7 65 117 17 65 120
8 70 128 18 68 148
9 71 143 19 67 129
10 62 129 20 67 152
Solution
Table 3.27a depicts the bivariate frequency table showing height and weight
of people.
Table 3.27a: Bivariate Frequency Table
Height(X)
Weight(Y) 62-64 64-66 66-68 68-70 70-72 Total
Solved Problem 12
Draw the line diagram for the following data
Table 3.28: Data for line diagram
Solution
Figure 3.7 depicts line diagram.
16
(15)
14
No. of students passed in FCD
(13)
(12)
12
10
8
(7)
6 (5) (5)
4
2001 2002 2003 2004 2005 2006
Year
Solution
Figure 3.8 is a simple bar diagram which depicts the yield of paddy in
Karnataka.
Solution
Figure 3.9 depicts the annual expenses of various cars in a vertical bar
diagram.
70000
65000 63270
59230
60000
55000
50000 47533
45000
40000
35000
30000
Maruthi Udyog Hyundai Tata Motors
Steel Arcelor Nippon POSCO JFE BAO US NUCOR RIVA Thyssen- Tangshan
maker Mittal Steel Steel krupp
Prodn. in 110 32 31 30 24 20 18 18 17 16
million
tonnes
Solution
Figure 3.10 depicts production of steel by top ten steel makers.
Tangshan 16
Thyssen-krupp 17
Top - 10 Steel Makers
RIVA 18
NUCOR 18
US Steel 20
BAO Steel 24
JFE 30
POSCO 31
Nippon 32
Arcelor Mittal 110
0 20 40 60 80 100 120
Production of Steel (Million Tonnes)
Solution
350
296 302
300 278 274
261 252
240 248
250
Value in Rs.
208
200
150
100
50
0
1 2 3
Model of Car
Santro Zen Wagnor
Solved Problem 17
Table 3.33 depicts the cost of manufacturing/unit and the revenue/unit from
2002-2005. Create a multiple bar diagram for this data.
Solution: The multiple bar diagram in figure 3.12 depicts the cost and
revenue per unit.
2002
Fig. 3.12: Multiple Bar Diagram showing the Cost and Revenue per Unit
Solved Problem 18
The following table gives the details of monthly expenditure of two families
A and B. Represent the data by percentage bar chart.
Family A Family B
Item Percentage Percentage
Cumulative Cumulative
of of
percentage percentage
expenditure expenditure
Food 28 28 20 20
House rent 24 52 20 40
Fuel 14 66 10 50
Miscellaneous 20 86 20 70
Savings 14 100 30 100
Income 100 100
2. Two-dimensional diagram
In a two-dimensional diagram, both breadth and length of the diagram
i.e. area of the diagram are considered, as area of the diagram represents
the data. The important two dimensional diagrams are:
Rectangular diagram
Square diagram
Rectangular diagram: Rectangular diagrams are used to depict two or
more variables. This diagram helps for direct comparison. The area of
rectangles is kept in proportion to the values. It may be of two types:
Percentage sub-divided rectangular diagram
Sub-divided rectangular diagram
In the former, width of the rectangles is proportional to the values, the
various components of the values are converted into percentages and
rectangles are divided according to them. In the later case, rectangles are
used to show some related phenomenon like cost per unit, quality of
production, etc.
Solved Problem 19
Table 3.35 depicts the expenditure of items by family A and family B. Draw
the rectangle diagram.
Table 3.35: Expenditure of Items in Rupees
Expenditure in Rs.
Item Expenditure
Family A Family B
Provisional stores 1000 2000
Education 250 500
Electricity 300 700
House Rent 1500 2800
Vehicle Fuel 500 1000
Total 3500 7000
Solution
Total expenditure will be taken as 100 and the expenditure on individual
items are expressed in percentage. The width of two rectangles is in
proportion to the total expenses of the two families i.e. 3500: 7000 or 1: 2.
The height of rectangles is according to the percentage of expenses. Table
Monthly Expenditure
Item Expenditure Family A (Rs. 3500) Family B (Rs. 7000)
Rs. Percentage Rs. Percentage
Provisional stores 1000 28.57 2000 28.57
Education 250 7.14 500 7.14
Electricity 300 8.57 700 10
House Rent 1500 42.85 2800 40
Vehicle Fuel 500 12.85 1000 14.28
Total 3500 100 7000 100
100
80
% of Expenditure
60
40
20
0
A B
Family
63.245 : 89.44
1.27 : 1.79
We draw two circles with radii 1.3 and 1.8 cms (where, 1 cm = 50 units).
Table 3.36a depicts the determined angles at the centre.
Graphs are used mainly for frequency distributions. Some of the types of
graphs are:
i) Histogram
ii) Frequency polygon
iii) Frequency curve
iv) Ogives [cumulative frequency curves]
Advantages of graphic presentation
It provides an attractive and impressive view
Simplifies complexity of data
Helps for direct comparison
It helps for further statistical analysis
It is the simplest method of presentation of data
It shows trend and pattern of data
Table 3.37 depicts the difference between graph and diagram
Table 3.37: Differences between Diagrams and Graphs
Diagram Graph
1. Ordinary paper can be used 1. Graph paper is required
2. It is attractive and easily 2. It is not easily understandable
understandable
3. It is appropriate and effective to 3. It creates problem
measure more variable
4. It cannot be used for further 4. Can be used for further analysis
analysis
5. It gives comparison 5. It shows relationship between
variables
6. Data are represented by bars, and 6. Points and lines are used to
rectangles represent data
3.6.1 Histogram
In this type of representation the given data are plotted in the form of series
of rectangles. Class intervals are marked along the x-axis and the
frequencies are along the y-axis according to the suitable scale. Unlike the
bar chart, which is one-dimensional, a histogram is two-dimensional in
which the length and width are both important. A histogram is constructed
from a frequency distribution of grouped data, where the height of rectangle
Solved Problem 22
Table 3.39 depicts the distribution of age. Draw a histogram for this data.
Table 3.39: Distribution of Age
Solution: The figure 3.18 depicts the histogram for the distribution of age
data.
We join the upper left corner of highest rectangle to the right adjacent
rectangle’s left corner and right upper corner of highest rectangle to left
adjacent rectangle’s right corner. From the intersecting point of these lines
we draw a perpendicular to the x-axis. The x-reading at that point gives the
mode of the distribution.
If the widths of the rectangles are not equal then we make areas of the
rectangles proportional and draw the histogram.
3.6.2 Frequency polygon
A frequency polygon is a line chart of frequency distribution in which; either
the values of discrete variables or the mid-point of class intervals are plotted
against the frequency or those plotted points are joined together by straight
lines. Since, the frequencies do not start at zero or end at zero, this
diagram as such would not touch the horizontal axis. However, since the
area under entire curve is the same as that of a histogram which is 100%;
Manipal University Jaipur Page No. 106
Statistics for Management Unit 3
the curve must be ‘enclosed’. The beginning of the curve touches the
horizontal axis and the last mid-point is joined with a ‘fictitious’ succeeding
mid-point, whose value is also zero, so that the curve will end at the
horizontal axis. This enclosed diagram is known as ‘frequency polygon’.
Solved Problem 23
Table 3.40 depicts the number of frequencies at which the marks are
obtained. Construct a frequency polygon for this data.
Table 3.40: Number of frequencies at which the marks are obtained
Marks Frequency
Mid-point
CI f
15 – 25 5 20
25 – 35 3 30
35 – 45 7 40
45 – 55 5 50
55 – 65 3 60
65 – 75 7 70
Solution
Figure 3.19 depicts a frequency polygon.
10
8 A Frequency polygon
6
Frequency
0
0 10 20 30 40 50 60 70 80 90 100
3.6.4 Ogives
Ogive is obtained by drawing the graph of a cumulative frequency
distribution. Hence, ogives are also called as cumulative frequency curves.
Since a cumulative frequency distribution can be of 'less than' or 'greater
than' type, we have less than and greater than type of ogives.
Less than Ogive – Variables are taken along x-axis and less than
cumulative frequencies are taken along y-axis. Less than cumulative
frequencies are plotted against the upper limit of class interval and joined by
a smooth-curve.
More than Ogive – More than cumulative frequencies are plotted against
lower limit of the class-interval and joined by a smooth-curve.
From the meeting point of these two ogives, if we draw a perpendicular line
to the x-axis, the point where it meets x-axis gives the median of distribution.
Solved Problem 25
Construct an Ogive curve for the data depicted in table 3.41.
Table 3.41: Data for Ogive Curve
30
Less than Cumulative Frequency
20
15
10
20 30 40 50 60 70
25
20
15
10
10 20 30 40 50 60 70
Activity:
1. A friend of yours heard that you were taking statistics and has
presented you with the following table from which he wants you to
construct a histogram.
Table 3.42: Frequency table
Age Relative Frequency (%)
00-14 28.4
15-44 50.5
45+ 21.1
100.0
0 5/100
1 25/100
2 30/100
3 25/100
4 15/100
Activity Solution
1. Open ended interval, too few intervals to give meaningful results and
intervals are of unequal length.
2. a) Table 3.45: Frequency table
Class Frequency Relative Frequency
1-4 4 4/25
5-8 5 5/25
9-12 6 6/25
13-16 6 6/25
17-20 4 4/25
Totals 25 1
2) The real or exact limits of the lowest interval are 4.5 – 8.5.
b) |
30 + -------
| | |
+ ------ | |------
| | | | |
20 + | | | |
Relative | | | | |
Frequency + | | | |----
| | | | | |
10 + | | | | |
| | | | | |
+-----| | | | |
| | | | | |
---+----+-----+-----+-----+--------> Y
0 1 2 3 4
c) Median = 1.5 + (20/30)(1.0) = 2.17
Mode = 2
Mean = [0(5) + 1(25) + 2(30) + 3(25) + 4(15)]/100 = 220/100 = 2.2
d) Probability (Y >= 2) = (30/100) + (25/100) + (15/100) = 70/100 =
0.70 = 70%
5. a) Both will have the same values on the horizontal axis.
6. d) Will have the exact same shape regardless of what units are used
on the axis.
7. a) The value of the measurement and the number of individuals with
that value.
3.7 Summary
Let us recapitulate the important concepts discussed in this unit:
For better understanding and usefulness, the collected data is classified
in a systematic manner according to common characteristics.
Classification simplifies and makes data more comprehensible and
renders the data ready for statistical analysis.
Classified data is tabulated in rows and columns for presentation, using
various types of classification. The tabulated data should be simple and
unambiguous, which should be understood and interpreted easily.
Frequency distribution is a special type of tabulation. In more concise
form, it brings out the salient features of the distribution.
Data presented in a diagram or graphical form is more appealing and
gives rough idea of the situation for busy executives.
Graphical data is visual representation of data in the form of line
diagrams, pie-charts, histograms, frequency polygons, frequency curves,
or ogives.
In a pie chart, different segments of a circle represent percentage
contribution of various components to the total. It brings out the relative
importance of various components of data.
The graph of cumulative frequency distribution is the ogive curve.
3.8 Glossary
Bar graph: A graphical device for depicting data that have been
summarised in a frequency distribution.
Bivariate distribution: If the number of variables is only two, then it is
called bivariate frequency distribution.
Cross tabulation: A tabular summary of data for two variables.
Frequency distribution: A tabular summary of data of numbers.
Histogram: A graphical presentation of a frequency distribution.
Multivariate frequency distribution: Frequency distribution of more than
two variables is known as multivariate frequency distribution.
Ogive: A graph of a cumulative distribution.
Pie chart: A graphical device for depicting data summaries based on the
subdivision of a circle into sector that corresponds to the relative frequency
for each class.
3. ABC Ice Cream Company attempts to keep all of its ten flavours of ice
cream in stock at each of its stores. In-charge of stores operation
collects data on the daily amount of each flavour to the nearest half
gallon.
i. Is the flavour classification discrete or continuous? Open or closed?
ii. Data collected, is it qualitative or quantitative?
iii. Is the amount collected on each flavour discrete or continuous?
4. Table 3.50 depicts certain data. Construct histogram for this data.
Table 3.50: Frequency Table
3.10 Answers
iv) Two
v) Sturge’s
vi) f/N
12. i) False ii) True iii) False
13. iv) Histogram
14. iii) A Pie diagram
15. iv) All the above
16. iv) Ogive
Terminal Questions
1. Table 3.52 depicts the solution for terminal question 1.
Table 3.52: Frequency Distribution Table
Class Interval Frequency
50-55 7
55-60 10
60-65 18
65-70 8
70-75 6
75-80 1
Total 50
2. The table 3.53 depicts the data required to construct the pie-chart (figure
3.22) for the budget data of XYZ Company.
Table 3.53: Budget of XYZ Company
5. Figure 3.25 is the ogive curve for the data given in terminal
question 5.
i. 16% ii. 57%
Yes 194
No 121
Not Sure 73
No response 422
Discussion Questions:
a) Convert the data to percentages and construct
i) A bar chart
ii) A pie chart
Which of these charts do you prefer to use and why?
Bajaj ICICI
Tata
HDFC unit Allianz Kotak Safe Prulife SBI Unit Birla Sun
AIG
linked New Investment Time Plus II life
Invest
Endowment Unit Plan Super Regular Premier
Assurell
Gain Regular
1 year 6.4 4.8 7.4 3.7 5.4 5.6 3.5
10 years 2.7 2.9 3.6 2.4 3.5 2.9 2.2
15 year 1.8 2.4 2.8 2.1 3.0 2.3 1.9
20 years 1.5 2.2 2.4 1.9 2.8 2.0 1.7
25 years 1.3 2.1 2.2 1.8 2.6 1.9 1.6
30 years 1.2 2.0 2.1 1.7 2.6 1.8 1.5
(Source: Economic Times dt 23rd October 2006)
References:
Agarwal B.L. (2006). Basic Statistics, 4th Ed, New Age International
Publishers.
Bowerman B. L., & R.T. O Connel. Applied Statistics: Improving
Business Processes, Irwin 1996.
Levin R.I., & Rubin, L.D.S. (2008). Statistics for Management, 7th Ed,
PHI Learning Private Limited.
Pisani F.D.R., & Purves R., Statistics, 3rd Ed, W.W Norton 1997.
Srivastava T.N., & Rejo, Shailaja (2008). Statistics for Management, 5th
Ed. TMH.
Tanur J.M, Statistics: A Guide to the unknown, 4th Ed, Brooks /cole,
2002.
Tukey J.W., Exploratory Data Analysis, Addison –Wesley, 1977.
Wilcox R.R. (2009). Basic Statistics – Understanding Conventional
Methods and Modern Insights, Oxford University Press.
E-References:
http://www.textbooksonline.tn.nic.in/Books/11/Stat-EM/Chapter-1.pdf
Quartile deviation
Mean deviation
4.11 Standard Deviation
Properties of Standard Deviation
Combined Standard Deviation
4.12 Coefficient of Variation
4.13 Summary
4.14 Glossary
4.15 Terminal Questions
4.16 Answers
4.17 Case Study
4.1 Introduction
In the previous unit, we have studied about data classification and
representation of data in tables and graphs. In this unit, we will study the
measures of Central tendency and Dispersion.
Graphical representation is a good way to represent summarised data.
However, graphs provide us only an overview and thus may not be used for
further analysis. Hence, we use summary statistics like computing averages
to analyse the data. Mass data, which is collected, classified, tabulated and
presented systematically, is analysed further to bring its size to a single
representative figure. This single figure is the measure which can be found
at central part of the range of all values. It is the one which represents the
entire data set. Hence, this is called the measure of central tendency.
In other words, the tendency of data to cluster around a figure which is in
central location is known as central tendency. Measure of central tendency
or average of first order describes the concentration of large numbers
around a particular value. It is a single value which represents all units.
Objectives:
After studying this unit, you should be able to:
describe the concept of Average (Measures of Central tendency) and
Measures of Dispersion
explain Arithmetic Mean for discrete and continuous data
explain Median and Mode of statistical data
explain Quartiles, Deciles and Percentiles for statistical data
explain Coefficient of Variation for statistical data
Manipal University Jaipur Page No. 124
Statistics for Management Unit 4
4.1.1 Relevance
Small Fry Design
Founded in 1997, Small Fry Design is a toy accessories’ company that
designs and imports product for infants. The company‘s product line
includes teddy bears, mobiles musical toys, rattles and security blankets
and features high-quality soft toy designs with an emphasis on colour,
texture and sound. The products are designed in the United States and
manufactured in China.
Small Fry Design uses independent representatives to sell the products to
infant furnishing retailers, children’s accessory and apparel stores, gift
shops, upscale department stores, and major catalogue companies.
Currently, Small Fry Design products are distributed in more than 1000 retail
outlets throughout the United States.
Cash flow management is one of the most critical activities in the day-to-day
operation of this young company. Ensuring sufficient incoming cash to meet
both current and ongoing debt obligations can mean the difference between
business success and failure. A critical factor in cash flow management is
the analysis and control of accounts receivable. By measuring the average
age and dollar values of outstanding invoices, management can predict
cash availability and monitor changes in the status of account receivable.
The company has set the following goals: the average age for outstanding
invoices should not exceed 45 days and the dollar value of invoices more
than 60 days old should not exceed 5% of the dollar value of all accounts
receivable.
In a recent summary of accounts receivable status, the following descriptive
statistics were provided for the age of outstanding invoices.
Mean 40 days
Median 35 days
Mode 31 days
Interpretation of these statistics shows that the mean or average age of an
invoice is 40 days. The median shows that half of the invoices have been
outstanding 35 days or more. The mode of 31 days is the most common
length of time an invoice has been outstanding is 31 days. The statistical
summary also showed that only 3% of the dollar value of all accounts
Manipal University Jaipur Page No. 125
Statistics for Management Unit 4
X 1 X 2 X 3 ............... X n X i
X = i 1
where. i 1,2......n
n n
X
i 1
i = Sum of the values of the observations of a series
n = Number of observations.
Step 2: Divide this total by the number of observations n. This will give the
value of Arithmetic Mean
Solved Problem 1
Calculate the mean for following data. Marks obtained by 6 students are
given below:
20, 15, 23, 22, 25, 20
Solution
Mean marks
n
X 1 X 2 X 3 ............... X n X i
X = i 1
where. i 1,2......n
n n
20 15 23 22 25 20 125
6 6
= 20.83
Solved Problem 2
Find the arithmetic mean of 15, 17, 22, 21, 19, 26 and 20.
Solution
The arithmetic mean X is given by:
n
X 1 X 2 X 3 ............... X n X i
X = i 1
where. i 1,2......n
n n
15 17 22 21 19 26 20 140
X 20
7 7
Therefore, the arithmetic mean is 20.
Solved Problem 3
Six months income of a departmental store is depicted in table 4.1. Find
mean income of a departmental store.
Table 4.1: Six Months income of departmental store
Month Jan Feb Mar Apr May June
Income (Rs.) 25000 30000 45000 20000 25000 20000
Solution
Total income = X = (25000 + 30000 + 45000 + 20000 + 25000 + 20000)
= 165000
Mean income X =
X 165000 Rs. 27500
n 6
The above example shows that if there are large figures, computations
required to get mean is high. In order to reduce computations one can go
for a short-cut method. The method is illustrated as follows.
2. Shortcut method
When the number of observations are large, the Arithmetic Mean can be
calculated using short cut method. The following formulation is used:
XA
d
N
XA
d
N
Discrete Series: Frequencies of each value is multiplied with the respective
size to get the total number of items and the total number of items is divided
by total number of frequencies to obtain Arithmetic Mean for Discrete
Series. This can be done in two methods:
1. Direct Method
2. Shortcut Method
1. Direct Method: When direct method is used, the following formula used
is
fX fX
X
f N
Value (X) 1 2 3 4 5
Frequency (f) 10 15 10 9 5
Solution
By direct method
Table 4.2a: Calculation of Mean using Direct Method
fX 131
X 2.67
N 49
By short-cut method
Let A = 3, (Assumed mean = 3)
Table 4.2b: Calculation of Mean using Short-cut method
fd 16
XA 3 2.67
N 49
Solved Problem 5
The data in table 4.3 depicts the number of students with respect to age.
Calculate the arithmetic mean of the students age.
Table 4.3: Number of Students with Respect to Age
Solution
The arithmetic mean X is given by:
fX fX
X
f N
20 3 23 5 25 10 28 6 30 1 623
X 24.92
3 5 10 6 1 25
Solution
Table 4.4a: Calculation of Arithmetic Mean
Age group No. of persons Mid point
f m fm
0 – 10 5 5 25
10 – 20 15 15 225
20 – 30 25 25 625
30 – 40 8 35 280
40 – 50 7 45 315
f = N = 60 fm = 1470
fm fm 1470
X= 24.5
f N 60
X = 24.5.
2. Short cut method
When this method is used, Arithmetic Mean is computed applying the
formula
fd
XA where A= Assumed mean, d = deviations of mid value from
N
assumed mean i.e, d = m-A, N = Total frequency
d' =
m Assumed Mean
Width of Class Interval
d' =
m A
i
m= mid value of the class,
i= common magnitude of the class intervals (width of Class
Interval),
A= Assumed mean
Steps of the step deviation method:
1. Find out the mid value ‘m’
2. Select the arbitrary mean (assumed mean) ‘A’
3. Find the deviation (d) of mid value of each class from ‘A’
Manipal University Jaipur Page No. 134
Statistics for Management Unit 4
Solution
Table 4.5a: Calculation of Arithmetic Mean
No. of
Mid value m A m 25
Age persons d'= fd'
‘m’ 10 10
‘f’
0 – 10 5 5 -2 -10
10 – 20 15 15 -1 -15
20 – 30 25 25 0 0
30 – 40 8 35 1 8
40 – 50 7 45 2 14
Total f=N=60 fd'= -3
Solution
Let A = 25
fd '
i = 10 XA i
N
(3)
X 25 x 10
60
1
X 25 24.5
2
Key statistic
For Individual series, the Arithmetic Mean is given by:
n
X 1 X 2 X 3 ............... X n X i
X i 1
where ....i 1,2......n
n n
n = no of observations
Key statistic
For discrete series, the Arithmetic Mean is given by:
fX fX
X
f N
∑f = N = total frequency
Key Statistic
For continuous series, the arithmetic mean is given by:
fd '
XA i
N
d' =
m Assumed Mean
Width of Class Interval
d' =
m A
i
m is the mid value of the class
A is the Assumed Mean
i is the common magnitude of the class intervals (width of the Class
Interval)
(X X ) 2
is lesser than (X A) 2
N1 X 1 N2 X 2 ........ Nn X n
X
N1 N2 .... Nn
Let X 1 & X 2 be the mean of first and second groups of data containing N1 &
N2 items, respectively.
N1 X1 N 2 X 2
Then, combined mean = X 12
N1 N 2
N1 X1 N 2 X 2 N 3 X 3
If there are 3 groups, then X123
N1 N 2 N 3
Solved problem 8
a) Find the mean for the entire group of workers from the following data
depicted in table 4.6.
Table 4.6
Group – 1 Group – 2
Mean wages 75 60
No. of workers 1000 1500
Solution
Given data
N1 = 1000 N2 = 1500
X1 75 & X 2 60
N1 X1 N 2 X 2
Combined Mean = X12
N1 N 2
1000 x 75 1500 x 60
=
1000 1500
X12 Rs. 66
Solved Problem 9
If average height of 30 men is 158 cm and average height of another group
of 40 men is 162 cm, find the average height of the combined group.
Solution
Given that,
N1 30 X1 158 , N 2 40 X 2 162
30 158 40 162
X 12 160 .29 cm
30 40
The average height of the combined group is 160.29 cm.
Solved Problem 10
Solution
On substituting the given values in the following equation, we get,
N1 X1 N 2 X 2
Combined Mean = X12
N1 N 2
30 X 1 40 162
160.28
30 40
1685
X2 11.23 gm
150
Therefore, the average weight of screws of box ‘B’ is 11.23 gm.
Solved Problem 12
A clerk calculated arithmetic mean of 50 values as 39.2. However, it was
found that instead of taking two values as 25 and 32, he took them as 52
and 23. Find the corrected arithmetic mean.
Solution
Given that,
50, 39.2
X W X 2 W2 X 3 W3 ...... X n Wn XW
X i Wi
Xw 1 1 i 1n
W1 W2 W3 ............ Wn
W i
W
i 1
Solved Problem 13
Compute simple weighted arithmetic mean and comment on them.
Table 4.7: Weighted Arithmetic Mean
Strength of
Monthly salary (Rs)
Designation cadre
X XW
W
General Manager 25000 10 250000
Mangers 19000 20 380000
Supervisors 14000 10 140000
Office Assistant 10000 50 500000
Helpers 8000 25 200000
Total (N = 5) X = 76000 W = 115 XW = 1470000
X 76000
a. Simple arithmetic mean = Rs. 15200
N 5
XW 1470000
b. Weighted arithmetic mean = Rs. 12782.6087
W 115
In this example, simple arithmetic mean does not account the difference in
salary range for various staff. It gives equal importance. The salary of
General Manager and Manager has inflated the value of simple mean. The
weighted mean gives importance to the various staff in various salary range.
Solved Problem 14
Comment on the performance of students of two universities depicted in
table 4.8.
Table 4.8: Weighted Arithmetic Mean
University Bombay Madras
% of No. of
% of No. of
pass students
Course pass students
(000) XW
X W
X W XW
MBA 71 3 213 81 5 405
MCA 83 2 166 76 3 228
MA 73 5 365 58 3 174
M.Sc. 75 2 150 76 1 76
M.Com. 70 2 140 81 2 162
Total () 372 14 1034 372 14 1045
Solution
a. Since X is same, simple arithmetic average for both universities.
X 372
= 74.4
N 5
XW 1034
b. Weighted mean for Bombay University = 73.86
W 14
XW 1045
c. Weighted mean for Madras University = 74.64
W 14
Comment: Madras university student’s performance is better than Bombay
university students.
Solved Problem 15
The data in table 4.9 is a reflection of the marks scored by students of a
class in an examination. Calculate the mean of the marks scored by the
students in an examination.
Table 4.9: Marks Scored by Students
Solution
Table 4.9a: Calculation of Arithmetic Mean
Marks Less than Frequency Mid Point X 35 fd'
Cum. Freq d'
f X 10
0 – 10 4 4 5 –3 – 12
10 – 20 16 12 15 –2 – 24
20 – 30 20 4 25 –1 –4
30 – 40 65 45 35 0 0
40 – 50 85 20 45 1 20
50 – 60 97 12 55 2 24
60 – 70 100 3 65 3 9
N =100 ∑fd'=13
In the table 4.9, the values given for the column ‘number of students’ are in
cumulative frequency distribution. Now, we have to convert it to frequency
distribution. The calculated values are depicted in table 4.9a.
The mean X is given by:
fd '
XA i
N
13
X 35 10 36.3
100
Therefore, the mean score of the students is 36.3.
Solved Problem 16
Find the missing frequency for the distribution in table 4.10, given the mean
value as 129 and N=80.
Table 4.10: Distribution Table
Class
80-100 100-120 120-140 140-160 160-180
Interval
Frequency 8 – 26 14 10
Solution
Let the missing frequency be ‘f’. Then,
Table 4.10a: Frequency Distribution Table
18 - f
that is, 129 130 20
80
360 20f
1
80
20f 360 80
f =22
Hence, the missing frequency is 22.
Merits Demerits
It is simple to calculate and easy to It is affected by extreme values.
understand.
3. If X1, X2, X3, ………… Xn are a set of n values of a variate, then the
mean is given by
i) N / Xi ii) Xi / n
N1 X1 N 2 X 2
ii) N Xi iv) X12
N1 N 2
4. (a) Find the Arithmetic mean 68,41,75,91,53,86,59
i) 67.57 ii) 47.57
iii) 37.57 iv) 27.57
(b) The average computed by considering the relative importance of each
of values to the total value, is called
i) Arithmetic mean ii) Geometric mean
iii) Weighted arithmetic mean iv) Harmonic average.
Usually, geometric mean is never larger than arithmetic mean. If there are
zeroes and negative numbers in the series, the geometric means cannot be
used, logarithms can be used to find geometric mean to reduce large
number and to save time.
In the field of business management various problems often arise relating to
average percentage rate of change over a period of time. In such cases,
the arithmetic mean is not an appropriate average to employ, so, we can
use geometric mean in such case. GM is highly useful in the construction of
index numbers. The table 4.12 displays the merits and demerits of
Geometric Mean.
Solution
Table 4.13: Calculation of Geometric Mean
X log X
2 0.301
4 0.602
8 0.903
log X = 1.806
log X
Geometric Mean (GM) = Antilog
N
1.806
GM = Antilog
3
GM = Antilog [0.6020] = 3.9997
GM 4
Solved Problem 19
Compare the previous year over head (OH) expenses which went up to 32%
in year 2003, then increased by 40% in next year and 50% increase in the
following year. Calculate average increase in over head expenses.
Let 100% be the OH expenses at base year.
Solution
Table 4.14: Calculation of Geometric Mean
Year OH Expenses
X log X
2002 Base year –
2003 132 2.121
2004 140 2.146
2005 150 2.176
log X = 6.443
log X
Geometric Mean (GM) = Antilog
N
6.443
GM = Antilog
3
GM = Antilog [2.1477] = 140.49
Solved Problem 20
The growth in bad-debt expense for Das Office Supplies Company, over the
last few years is as depicted in table 4.15. Calculate the average percentage
increase in bad-debt expense over this time period.
Table 4.15: Bad-debt Expense Growth for Das Office Supplies Company
Year 1992 1993 1994 1995 1996 1997 1998
Expense Rate 1.110 1.090 1.075 1.080 1.095 1.080 1.200
Solution
The Geometric Mean is given by:
GM = 7 (1.110 ) (1.090) (1.075) (1.080 ) (1.095) (1.080 ) (1.200) = 1.10
Therefore, the average increase is (1.10 – 1) = 0.10 %
4.4.2 Geometric Mean for discrete series
Geometric Mean for discrete series is given as:
f log X
Geometric Mean (GM) = Antilog
N
Solved Problem 21
Find the Geometric Mean for the data depicted in table 4.16
Table 4.16: Frequency Table
Marks 130 135 140 145 150
No. of students 3 4 6 6 3
Solution
Table 4.16a: Calculation of Geometric Mean
Marks No. of students
log X f log X
X f
130 3 2.113 6.339
135 4 2.130 8.52
140 6 2.146 12.876
145 6 2.161 12.996
150 3 2.176 6.528
f = N = 22 f log X =47.23
f log X
Geometric Mean (GM) = Antilog
N
Manipal University Jaipur Page No. 148
Statistics for Management Unit 4
47 .23
GM = Antilog
22
GM = Antilog [2.1468]
GM = 140.222
Solved Problem 22
The share-price of a particular company was moving up and down. The data
depicted in table 4.17 consolidates its movement for past 6 months. Find the
appropriate average share-price.
Table 4.17: Frequency Table of Share Price
Solution
The data in table 4.17 is obtained from the data in table 4.18a.
Table 4.17a: Calculation of Geometric Mean of Share Prices
f log x
GM = Antilog
N
90.9546
GM = Antilog
44
GM = Antilog [2.0672] = 116.70
The appropriate average share price is Rs. 116.70.
f log m
Geometric Mean (GM) = Antilog
N
GM = Antilog
68.6828
52
GM = Antilog [1.3208]
GM = 20.93
Key statistic
Whenever data deals with rates, ratios, growth rates, etc., the geometric
mean is the best measure.
Geometric mean is not defined even if one of the values is zero or
negative.
Key statistic
Suppose the values X1, X2, … Xn are assigned the weights W 1, W 2………
Wn then their weighted average is given by:
Xw
XW
W
and their weighted Geometric Mean is given by:
Gw = Antilog
W log X where, ‘W’ acts as frequency.
W
4.5 Harmonic Mean
It is the total number of items of a value divided by the sum of reciprocal of
values of variable. It is a specified average which solves problems involving
variables expressed in within ‘time rates’ that vary according to time.
E.g.: Speed in km/hr, min/day, price/unit.
Key statistic
For Individual series, Harmonic Mean is given by:
N
H.M.=
(1 / X)
Key statistic
For discrete series and continuous series, the Harmonic Mean is given
by:
N
H.M =
(f / X )
Manipal University Jaipur Page No. 151
Statistics for Management Unit 4
The table 4.19 displays the merits and demerits of Harmonic Mean.
Table 4.19: Merits and Demerits of Harmonic Mean
Merits Demerits
It is based on all observations. It is not easy to compute.
It is rigidly defined It cannot be used when one of the
items is zero.
It is suitable in case of series having It cannot represent distribution
wide dispersion.
It is suitable for further mathematical
treatment.
Solved Problem 24
Calculate the harmonic mean of 9.7, 9.8, 9.5, 9.4, 9.7
Solution
The Harmonic Mean (HM) is calculated as:
Table 4.19: Calculation of Harmonic Mean
X 1/X
9.7 0.1031
9.8 0.1020
9.5 0.1053
9.4 0.1064
9.7 0.1031
∑1/X = 0.5199
N
Harmonic Mean (H.M) =
(1 / X)
5
HM = = 9.6172
0.5199
Therefore, the Harmonic Mean is 9.6172.
Solved Problem 25
A man travelled by a car for 3 days. He covered 480 km each day. On the
first day he drives for 10 hrs at the rate of 48 KMPH, on the second day for
12 hrs at the rate of 40 KMPH and on the third day for 15 hrs at the rate of
X 48 40 32 Total
1/X 0.0208 0.025 0.0312 0.0770
N 3
Harmonic Mean (H.M) = = = 38.91
(1 / X) 0 . 0770
Data: 10 hrs @ 48 KMPH
12 hrs @ 40 KMPH
15 hrs @ 32 KMPH
W X XW
10 48 480
12 40 480
15 32 480
W = 37 WX = 1440
XW 1440
Weighted Mean = X w =
W 37
X w 38.91
Both Harmonic mean and Weighted mean are the same.
Solved Problem 26
Calculate the Harmonic Mean for the following data
Table 4.21: Frequency table
N 70
Harmonic mean = = 53.81
(f / X) 1.3009
4.5.1 Relationship between Arithmetic mean, Geometric mean and
Harmonic mean
The relationship between Arithmetic mean, Geometric mean and Harmonic
mean can be summarised as follows:
1. If all the items in a variable are the same, the Arithmetic mean (AM) X ,
Harmonic mean and Geometric mean are equal. i.e., AM GM HM .
2. If the size varies, Arithmetic mean will be greater than Geometric mean
and Geometric mean will be greater than Harmonic mean. This is
because of the property that Geometric mean gives larger weight to
smaller items and of the Harmonic mean gives larger weight to smallest
items. Hence AM GM HM .
Thus, we have discussed about Arithmetic Mean, Geometric Mean and
Harmonic Mean.
4.6 Median
In this section, we will discuss the median of distribution. Median of
distribution is that value of the variate, which divides it into two equal parts.
In terms of frequency curve, the ordinate drawn at median divides the area
under the curve into two equal parts. Median is a positional average
because its value depends upon the position of an item and not on its
magnitude.
Median of a set of values is the value which is the middle most value when
they are arranged in the ascending or descending order of magnitude.
Median is denoted by ‘M’.
4.6.1 Median for Individual series
The formula used for calculating median for individual series is
N 1
th
N 1
th
Solution
Arranging in ascending order, we get:
13, 15, 16, 17, 18, 19, 20, 22, 23
we have, N= 9
N 1
th
N 1
th
edian
37 40 38.5
2
The median for the given set of values is 38.5.
Solved Problem 29
In a class 15 students, 5 students were failed in a test. The marks of 10
students who have passed were 9, 6, 7, 8, 9, 6, 5, 4, 7, 8. Find the median
marks of 15 students.
Solution
The marks of 10 students who passed when arranged in ascending order of
magnitude are: 4,5,6,6,7,7,8,8,9,9.
Since five students who have failed must have scored less than 4 marks,
then the marks of 15 students arranged in ascending order will be as
follows:
0,0,0,0,0, 4,5,6,6,7,7,8,8,9,9.
N 1
th
15 1
th
N 1
th
N 1
th
4. The value for which the cumulative frequency includes item will
2
be taken as Median
Solved Problem 30
Find the median value for the data depicted in table 4.22
Table 4.22: Frequency table
X 12 16 10 14 17 20 15
f 4 9 3 5 4 2 10
Solution
In this problem, we have, N = 37
Table 4.22a: Computation of Median
Less than Cumulative frequency
X f
LCF
10 3 3
12 4 7
14 5 12
15 10 22
16 9 31
17 4 35
20 2 37
N 1
th
19 item
th
2
This value lies in cumulative frequency (22) for the value 15.
Therefore, the Median is 15.
4.6.3 Median for Continuous series
The procedure to get a median is different in continuous series. The class
intervals are already in the form of array and the frequency are recorded
th
N
against each class interval. For determining the size, we should take
2
item and median class located accordingly with reference to the cumulative
frequency, which covers the size first. When the median class is located,
the median value is to be interpolated using the formula given below.
h N
Median = c.f
f 2
Where = lower limit of the median class
h = Class width,
f = frequency of median class
c.f = Cumulative frequency of class preceding the median class.
Key statistic
To solve problems on median,:
i) Arrange the data in ascending order or descending order
ii) Make class-interval as exclusive type
Solved Problem 31
Find the median of the data in table 4.23
Table 4.23: Distribution of Weight Data
Solution
As it is an exclusive type of interval, we organise the data as shown in the
table 4.23a.
N 100
50
2 2
Table 4.23a Cumulative Frequency Table
N 50
25
2 2
Cum. frequency just above 25 is 33 and hence, 20 – 25 is median class.
20
h = 20 – 15 = 5
f=9
c.f = 24
h N
Median = c.f
f 2
Median = 20
5
25 24 = 20 5
9 9
Median = 20.555
Solved Problem 33
Find the missing frequency for the data depicted in table 4.25, given that its
median is 34.
Table 4.25: Frequency table
Class interval Frequency
0 – 10 4
10 – 20 9
20 – 30 -
30 – 40 20
40 – 50 18
50 – 60 7
60 – 70 3
Solution
Since median is 34, it falls in the class-interval 30-40. Let ‘f’ be the missing
frequency. Therefore, we have the data shown in table 4.25a
h N
Median = c.f
f 2
Table 4.25a: Cumulative Frequency Distribution for Data
Less than Cumulative frequency
Class interval Frequency
LCF
0 – 10 4 4
10 – 20 9 13
20 – 30 f 13 + f
30 – 40 20f 33 + f
40 – 50 18 51 + f
50 – 60 7 58 + f
60 – 70 3 61 + f
N =61+f
10 (61 f )
34 = 30 (13 f )
20 2
10 (61 f )
34 30 (13 f )
20 2
1 (61 f ) 2(13 f )
34 30
2 2
1 61 f 26 2 f
34 30
2 2
35 f
34 30
4
120 35 f
34
4
136 155 f
f 19
Therefore, the missing frequency is 19.
Merits Demerits
It can be easily understood and It is not based on all values.
computed.
It is not affected by extreme values. It is not capable of further algebraic
treatment.
It can be determined graphically
(Ogives).
Key statistic
In case of continuous series, Median M is given by:
h N
Median = c.f
f 2
Where
= lower limit of the median class
h = Class width,
f = frequency of median class
c.f = Cumulative frequency of class preceding the median class.
4.7 Mode
In this section, we will discuss the Mode. Mode is the value which occurs
with the maximum frequency. It is the most typical or common value that
receives the height frequency. It represents fashion and often it is used in
business. Thus, it corresponds to the values of variable, which occurs most
frequently. The modal class of a frequency distribution is the class with
highest frequency. It is denoted by ‘z’.
Mode is the value of variable which is repeated the greatest number of times
in the series. It is the usual, not casual size of item in the series. It lies at
the position of greatest density.
E.g.: If we say modal marks obtained by students in class test is 42, it
means that the largest number of student have secured 42 marks.
If each observation occurs the same number of times, we can say that there
is ‘no mode’. If two observations occur the same number of times, we can
say that it is a ‘Bi-modal’. If there 3 or more observations occur the same
number of times we say a ‘multi-modal’ case.
Modal value is most useful for business people. For example, shoe and
readymade garment manufacturers will like to know the modal size of the
people to plan their operations. For individual and discrete series, it is that
value corresponding to highest frequency.
Key statistic
In case of continuous series, mode is given by:
f1 f 0
Mode i.
2f 1 f 0 f 2
Where,
= lower limit of the modal class
Solution
We note that the intervals are exclusive type and the highest frequency is
25. Therefore, the corresponding interval is 1200-1400, which is called the
modal class.
f1 f 0
Mode i.
2f 1 f 0 f 2
Where,
= lower limit of the modal class = 1200
f1 = frequency of the modal class = 25
f 0 = frequency of previous modal class = 15
f 2 = frequency of succeeding modal class = 12
i = width of the class interval = 200
Therefore, the mode is calculated as:
25 15 2000
Mode 1200 200 1200 1286.95
2 25 15 12 23
Hence, the modal plinth area is 1286.95 square feet.
Solved Problem 36
Find the mode for data depicted in table 4.29
Table 4.29: Frequency table
51 – 60 8
Total f = N = 100
We will have to first convert the inclusive series into an exclusive series for
calculating the mode. To convert discontinuous distribution to continuous
distribution subtract 0.5 from lower limit and add 0.5 to upper limit
51 – 60 50.5 – 60.5 8
Total f = N = 100
We shall identify the modal class being the class of maximum frequency,
i.e., 30.5 – 40.5
f1 f 0
Mode i.
2f 1 f 0 f 2
Where,
= lower limit of the modal class = 30.5
f1 = frequency of the modal class = 31
f 0 = frequency of previous modal class = 26
f 2 = frequency of succeeding modal class = 16
i = width of the class interval = 10
Therefore, the mode is calculated as:
31 26
Mode 30.5 10 = 33
2 31 26 16
Key statistic
The empirical relationship between Mean, Median and Mode:
Mean – Mode = 3 (Mean – Median)
which is same as, Mode = 3 Median – 2 Mean.
Merits Demerits
In many cases it can be found by It is not based on all values.
inspection.
It is not affected by extreme values. It is not capable of further
mathematical treatment.
It can be calculated for distributions with It is much affected by sampling
open end classes. fluctuations.
It can be located graphically.
It can be used for qualitative data.
Key statistic
Quartiles: When distribution is divided into four equal portions, then we
get first quartile (Q1), second quartile (Q2 = Median) and third quartile (Q3)
as the positional averages.
For Individual series Q1 and Q3 are given by:
N 1
th
Q1 = Size of item
4
3( N 1)
th
Q 3 = Size of item
4
N 1
th
Q1 = Size of item
4
3( N 1)
th
Q 3 = Size of item
4
For continuous distribution Q1 and Q3 are given by:
N / 4 c.f
Q1 i
f
3N / 4 c.f
Q3 i
f
Where,
= lower limit of the quartile class
i = class width
f = frequency of quartile class
N = total frequency
c.f = cumulative frequency of class preceding the quartile class
Measures of quartiles
The quartile values are located on the principle similar to locating the
median value.
Table 4.31 depicts the procedure of locating quartiles.
Table 4.31: Procedure of locating quartiles
For Individual and Continuous Formula to be used for
Measure
Discrete series series Continuous series
N / 4 c.f
N 1
th th
N Q1 i
Q1 item item f
4 4
2N 1 2 N / 4 c.f
th
item 2N
th Q2 i
Q2 item f
4 4
3( N 1)
th th 3N / 4 c.f
3N Q3 i
Q3 item item f
4 4
Individual Series:
Solved Problem 37
Weekly sales of a product on 8 different shops are as follows. Calculate the
quartiles.
Sales in units: 309, 312, 305, 307, 310, 308, 308, 306
Solution
Arranging the data in ascending order
Sales in units: 305, 306, 307, 308, 308, 309, 310, 312
N 1 8 1
th th
2(8 1)
th th
2(N 1)
Q2 item item = 4.5th item
4 4
= 4th value + 0.5 (5th value – 4th value)
= 308 + 0.5 (308 - 308) = 308
3(8 1)
th th
3(N 1)
Q3 item = item = 6.75th item
4 4
= 6 value + 0.75 (7th value – 6th value)
th
N 1
th
Q1 = Size of item
4
320 1
th
Q1 = Size of item
4
Q1 = 80.25th item
Just above 80.25, the c.f (Cumulative Frequency) is 100. Against 100 c.f,
value is 5.
Q1 = 5
2( N 1)
th
Q = Size of item
2
4
Q2 = 160.5th item
Just above 160.5, the c.f (Cumulative Frequency) is 230. Against 230 c.f,
value is 6.
Q2 = 6= median
3( N 1)
th
Q 3 = Size of item
4
Manipal University Jaipur Page No. 172
Statistics for Management Unit 4
3(320 1)
th
Q 3 = Size of item
4
Q3 = 240.75th item
Just above 240.75, the c.f (Cumulative Frequency) is 260. Against 260
c.f, value is 6.5.
Q3 = 6.5
4.9.2 Deciles
The deciles divide the arrayed set of variates into ten portions of equal
frequency and they are sometimes used to characterise the data for some
specific purpose. In this process, we get nine decile values. The fifth decile
is nothing but a median value. We can calculate other deciles by following
the procedure which is used in computing the quartiles.
Table 4.33: Formula to compute Deciles
Solved Problem 39
Find the 7th Decile for the data given below:
Table 4.34: Frequency table
Class 13 – 18 18-20 20-21 21-22 22-23 23-25 25-30
interval
Frequency 22 27 51 42 32 16 10
Solution
Table 4.34a: Computation of 7th decile
N 1 N / 100 c.f
th th
N P1 i
P1 item item
100 100 f
25( N 1)
th th
25 N 25 N / 100 c.f
P25 item item P25 i
100 100 f
99( N 1)
th th
99 N 99 N / 10 c.f
P99 item item P99 i
10 100 f
Middle Decile
5( N 1) 5(155 1)
th th
Solved Problem 41
For the data provided below, find the 20th percentile.
Table 4.37: Frequency table
Class
13 – 18 18-20 20-21 21-22 22-23 23-25 25-30
interval
Frequency 22 27 51 42 32 16 10
Solution
Table 4.37a: Computation of 20th percentile
4.10 Dispersion
In this section, we will discuss about the Dispersion.
Definition: A measure of Dispersion may be defined as a statistics
signifying the extent of the scattering of items around a measure of central
tendency.
It describes another characteristic of a distribution. Consider the two
distribution of weights of a product produced by two machines, depicted in
table 4.38.
Table 4.38: Distribution of Weights of a Product
Machine A B
Sample size 1000 1000
Average weight 80 80
Minimum weight 20 40
Maximum weight 140 100
Machine ‘B’ produces products with weights much closer to the average
than Machine ‘A’. As a manufacturer or customer, we would choose
Machine ‘B’. In other words, we choose that machine whose spread is
smaller.
The property of deviations of values from the average is called Dispersion or
Variation. The degree of variation is found by the measures of variation.
They are as follows:
1. Range (R)
2. Quartile Deviation (Q.D)
3. Mean Deviation (M.D)
4. Standard Deviation (S.D)
They have units of measurement attached to them. Therefore, they are
known as absolute measures of variation. However, we may want to
compare two different distributions whose measurements are in terms of
kilograms and in terms of centimetres. Then, we use the following relative
measures that do not have any units attached to them. The relative
measures are as follows:
1. Coefficient of Range
2. Coefficient of Quartile Deviation
3. Coefficient of Mean Deviation
4. Coefficient of Variation
They are known as relative measures. In this unit, we study both measures
of variation and coefficients of variation simultaneously.
Prerequisite of a good measure of Variation are as follows:
1. It should be easy to understand and simple to calculate.
2. It should be based on all values.
3. It should be rigidly defined.
4. It should not be affected by extreme values.
5. It should not be affected by sampling fluctuations.
6. It should be capable of further algebraic treatment.
4.10.1 Range
‘Range’ represents the differences between the values of the extremes.
The range of any sample is the difference between the highest and the
lowest values in the series.
The values in between two extremes are not taken into consideration. The
range is a simple indicator of the variability of a set of observations. It is
denoted by ‘R’. In a frequency distribution, the range is taken to be the
difference between the lower limit of the class at the lower extreme of the
distribution and the upper limit of the class at the upper extreme of the
distribution. Range can be computed using following equation.
Range = Largest value – Smallest value = L - S
L arg est value Smallest value L S
Coefficien t of Range =
L arg e st value Smallest value L S
Solved Problem 42
Find the Range of the following series 26, 28, 28, 26, 28, 30, 27, 29, 26, 24
Solution
The range ‘R’ is calculated as follows:
R= Range = Largest value – Smallest value = L - S
R = 30 – 24 = 6
Therefore, the range is 6.
Solved Problem 43
Compute range and coefficient of range for the following discrete series of
data.
Table 4.40: Frequency table
X: 6 12 18 24 30 36 42
f: 20 130 16 14 20 15 40
Solution:
Solution
Range R is calculated as follows:
R = 25 – 0 = 25
Therefore, the range of the given continuous series is 25.
Solved problem 45
Compute the Range and also the co-efficient of Range of the given series
Table 4.42
Solution:
Table 4.42a: Computation of the Range and also the co-efficient of Range
R = L – S = 21 – 9 = 12 R = L – S = 29 – 1 = 28
L S 12 12 L S 28
CR = = = 0.4 CR = = 0.933
L S 21 9 30 L S 30
Solved Problem 46
Find Range and co-efficient of Range from following data and state which is
more dispersed and which is more uniform:
Table 4.43
A 10 11 12 13 14
B 40 41 42 43 44
C 100 101 102 103 104
Key statistic
Range is not defined if the class intervals are open.
Key statistic
1. Q3-Q1 is called inter quartile range.
2. Q3-Q1 gives the middle 50% of reading. Q3 and Q1 are also known as
upper and lower limit of middle 50% of readings.
3. Quartile range is not capable of further algebraic treatment.
Solved Problem 47
Find the Quartile Deviation and the Co-efficient of Quartile Deviation, from
the marks of 12 students depicted in table 4.44.
Table 4.44
Sl. No 1 2 3 4 5 6 7 8 9 10 11 12
Marks 25 30 37 43 48 54 61 67 72 80 84 89
Solution
N 1 12 1
th th
3(12 1)
th th
3(N 1)
Q3 item = item = 9.75th item
4 4
= 9th item + 0.75 (10th item- 9th item) = 72 + 0.75 (80- 72)
Q3 = 78
1
Quartile Deviation = (Q3 – Q1)
2
1
= (78 – 38.5)
2
Quartile Deviation = 19.75
Q 3 Q1
Co-efficient Quartile Deviation =
Q 3 Q1
78 38.5
= = 0.339
78 38.5
Solved Problem 48
Compute quartile deviation and its coefficient for the data depicted in the
table 4.45:
Table 4.45: Frequency table
X 58 59 60 61 62 63 64 65 65
f 15 20 32 35 33 22 20 10 8
Solution
Table 4.45a: Computation of Quartile deviation and its coefficient
X f Less than Cumulative
frequency
LCF
58 15 15
59 20 35
60 Q1 32 67
61 35 102
62 33 135
63 Q3 22 157
64 20 177
65 10 187
65 8 195
N = 195
N 1 195 1
th th
3( N 1) 3(195 1)
th th
Solved Problem 49
Find Quartile Deviation and Coefficient of Quartile Deviation for the given
data and also compute middle quartile.
Table 4.46: Frequency table
Class 1 – 10 11 – 20 21 – 30 31 – 40 41 – 50 51 – 60
Interval
Frequency 3 16 26 31 16 8
Solution
Table 4.46a: Computation of Quartile Deviation and its coefficient
2 N / 4 c.f
Q2 i
f
2(100 ) / 4 45
Q 2 30.5 10
31
Q2 = 32.11
The third Quartile Q3 is given by
(3N/4) = (3 x 100/4) = 75
Cum. frequency just above 75 is 76, Q 3 lies in the class 30.5 – 40.5
3N / 4 c.f
Q3 i
f
3(100 ) / 4 45
Q 3 30 .5 10
31
Q3 = 40.17
1 1
Quartile Deviation = (Q3 – Q1) = (40.17 – 22.80) = 8.685
2 2
Q 3 Q1 40 .17 22.80
Co-efficient of Quartile Deviation = =
Q 3 Q1 40.17 22.80
17 .37
= 0.276
62 .97
The table 4.47 depicts the merits and demerits of quartile deviation.
Table 4.47: Merits and Demerits of Quartile Deviation
Merits Demerits
It is easy to understand and to It is not based on all values.
compute.
It is rigidly defined. It is affected by sampling fluctuations.
It is not affected by extreme It is not capable of further algebraic
values. treatment.
M.D(X)
(X X)
i
f
For discrete and continuous series, Mean deviation from Mean is calculated
as:
M.D(X)
f (X X) i
f
For individual series, Mean deviation from Median is calculated as:
M.D(Median )
(X M) i
f
For discrete and continuous series, Mean deviation from Median is
calculated as:
M.D(Median )
f (X M) i
f
In case of continuous series ‘X’ represents mid value of class-interval.
However, mean deviation from median is the least.
The corresponding relative measures are coefficient of Mean Deviation.
M.D.(X)
Coefficien t of M.D.X
X
M.D.(Median )
Coefficien t of M.D.Median
Median
Solved Problem 50
Compute Mean deviation and its coefficient from Mean and Median for the
data.
X: 21, 32, 38, 41,49, 54, 59, 66, 68
Xi ( X i X) = ( X i 47.55) ( X i M) = ( X i 49)
21 26.55 28
32 15.55 17
38 9.55 11
41 6.55 8
49 1.45 0
54 6.45 5
59 11.45 10
66 18.45 17
68 20.45 19
X = 428 ( X i X) = 116.45 ( X i M) = 115
X
X 1 X 2 X 3 ............... X n
X = 428 47.55
N N 9
M.D(X)
(X X)
i
=
116.45
f 9
M.D(X) = 12.938
M.D.(X) 12 .938
Coefficien t of M.D.X = = 0.272
X 47 .55
N 1
th
M.D(Median )
(X M)i
=
115
= 12.778
f 9
M.D.(Median ) 12.778
Coefficien t of M.D.Median 0.2608
Median 49
Solved Problem 51
Compute Mean deviation and its coefficient from Mean and Median for the
data.
Table 4.49: Frequency table
Marks (X) 5 10 15 20 25
Students (f) 6 7 8 11 8
( X i 16)
5 6 30 11 66
10 7 70 6 42
15 8 120 1 8
20 11 220 4 44
25 8 200 9 72
N=40
fX 640 f ( X i X) = 232
fX fX 640
X 16
f N 40
M.D(X)
f (X X)
i
232
5.8 marks
f 40
M.D.(X ) 5 .8
Coefficien t of M.D.X = = 0.363
X 16
Table 4.49b: Mean Deviation about Median
Xi f Cumulative ( Xi M) = ( X i 15) f ( Xi M)
Frequency (LCF)
5 6 6 10 60
10 7 13 5 35
15 8 21 0 0
20 11 32 5 55
25 8 40 10 80
N=40 f ( X i X) = 230
40 1 th
Median = item = 20.5th item = 15th item
2
This value lies in cumulative frequency (21) for the value 15.
Therefore, the Median is 15.
M.D(Median )
(X M)
i
=
230
= 5.75 marks
f 40
M.D.(Median ) 5.75
Coefficien t of M.D.Median = 0.383
Median 15
Solved Problem 52
Compute Mean Deviation about its Mode and its coefficient from the data
depicted in table 4.50
Table 4.50: Mean Deviation about Mode
X f X i Mode f ( X i Mode)
20 6 100 600
40 19 80 1520
60 40 60 2400
80 23 40 920
100 65 20 1300
120 Mode 83 0 0
140 55 20 1100
160 20 40 800
180 9 60 540
f = 320 f X i Mode
= 9180
Solution
The highest frequency is 83 and hence Mode = 120
f X i Mode 9180
M.D(Mode) = = 28.68
f 320
M.D.(Mode) 28 .68
Coefficien t of M.D.Mode = = 0.239
Mode 120
Solved Problem 53
Find out the Mean Deviation from the data depicted below about its Median
and its coefficient.
Table 4.51: Frequency table
Size 0-10 10-20 20-30 30-40 40-50 50-60 60-70
Frequency 7 12 18 25 16 14 8
f Xi M 1314 .8
M.D(Median ) = = 13.148
f 100
M.D.(Median ) 13.148
Coefficien t of M.D.Median = = 0.3735
Median 35 .2
The table 4.52 depicts the merits and demerits of Mean Deviation.
Table 4.52: Merits and Demerits of Mean Deviation
Merits Demerits
It is based on all values. It is not capable of further algebraic
treatment.
It is less affected by extreme values. It does not take into account negative
signs.
It is not affected much by sampling
fluctuations.
Variance =
(X X) 2
N
Standard Deviation () = ( Variance )
ii) Deviation Taken from Assumed Mean: When the Arithmetic Mean is a
fractional value the method explained in (i) will be tedious and time-
consuming. Hence we use the following formula.
Variance =
d 2
(d ) 2
f
Standard Deviation () = ( Variance )
Where, d stands for the deviation from assumed mean d X A , A is
assumed mean, f = N
Discrete series:
i) Actual Mean Method:
Variance =
f (X X) 2
f
Standard Deviation () = ( Variance )
ii) Assumed Mean Method:
fd 2 fd 2
Variance =
f f
Standard Deviation () = ( Variance )
Where d X A , A = assumed mean, f = N
Continuous series:
In a continuous series mid-values of the class intervals are to be found out.
Where, ‘X’ is the mid value of class interval for continuous series.
fd 2 fd 2
Variance =
(i) 2
f f
= ( Variance )
XA
Where d
i
, A = assumed mean, f = N, i = class width
Key statistic
The square of standard deviation is called variance. It is denoted by 2.
Merits Demerits
It is rigidly defined. It is difficult to understand.
It is based on all values. It gives undue weightage for extreme
values.
It is capable of further algebraic It cannot be calculated for classes with
treatment. open end interval.
It is not very much affected by
sampling fluctuations.
Solved Problem 54
Find the Standard Deviation of (Rs.) 7, 9, 16, 24, 26
Solution:
Table 4.54: Computation of Standard Deviation
Variate (Rs.)
( X X) = (X 16.40) ( X X) 2
X
7 -9.4 88.36
9 -7.4 54.76
16 -0.4 0.16
24 7.6 57.76
26 9.6 92.16
X = 82 (X X) 2
= 293.20
X =
X 82 Rs. 16.40
N 5
Variance =
(X X) 2
293 .20
58.64
N 5
Standard Deviation () = ( Variance ) = 58.64 = Rs. 7.66
Hence the standard deviation is Rs. 7.66
Solved Problem 55
Find the Standard Deviation from the following:
Table 4.55: Frequency table
Weight (K.g) 44-46 46-48 48-50 50-52 52-54 Total
No. of persons 3 24 27 21 5 80
Solution
Table 4.55a: Computation of Standard Deviation
Weight Mid Point Frequency X 49
(K.g) X f
d fd fd2
2
44-46 45 3 -2 -6 12
46-48 47 24 -1 -24 24
48-50 49 27 0 0 0
50-52 51 21 1 21 21
52-54 53 5 2 10 20
f = 80 fd =1 fd 2
= 77
fd 2 fd 2
Variance =
(i) 2 = 77 1 2
( 2) = 3.8494
2
f f 80 80
Standard Deviation () = ( Variance ) = 3.8494 = 1.96 kg
N1 X1 N 2 X 2
Combined mean = X 12
N1 N 2
100 150 200 200
100 200
15000 40000 55000
183.33 gm
300 300
d 1 = X1 X12 = (150 – 183.33) =-33.33 d 12 = (- 33.33)2 = 1110.889
d2 = X 2 X12 = (200 – 183.33) = 16.67 d 22 = (16.67)2 = 277.8889
2 = 738.8889
= 27.18
Hence, the combined standard deviation is 27.17.
Solved Problem 57
The mean of two samples of sizes 50 and 100 respectively are 54.1 and
50.3 and their standard deviations are 8 and 7 respectively. Obtain the
Standard Deviation for the combined group.
Solution
N1 = 50, 1 = 54.1, 1 = 8
N2 = 100, 2 = 50.3, 2 = 7
X 12 = 51.56
1502 = 8580.5
2 = 57.2033
= 7.5632
Hence the standard deviation for the combined group is 7.5632.
Solved Problem 57
The mean wage is Rs. 75 per day, SD wage is Rs. 5 per day for a group of
1000 workers and the same is Rs. 60 and Rs. 4.5 for the other group of
1500 workers. Find mean and standard deviation for the entire group.
Solution
We have by data, 1 = 75, 1 = 5, N1 = 1000
Combined Mean :
N1 X1 N 2 X 2 1000 x 75 1500 x 60
X 12 66
N1 N 2 1000 1500
Combined Standard deviation:
d1 = 1 12 = (75 – 66) = 9 d 12 = 81
d2 = 2 12 = (60 – 66) = -6 d 22 = 36
(N1 + N2) 2 = N1 (12 + d12) + N2 (22 + d22),
(1000 + 1500) 2 = 1000 (52 + 81) + 1500 (4.52 + 36)
25002 = 190375
2 = 76.15
= 8.73
Solution:
Table 4.56a
Prices of Prices of
( X X) = (Y Y) =
(X) ( X X) 2 (Y) ( Y Y) 2
shares (X - 53) shares (Y - 105)
55 2 4 108 3 9
54 1 1 107 2 4
52 -1 1 105 0 0
53 0 0 105 0 0
56 3 9 106 1 1
58 5 25 107 2 4
52 -1 1 104 -1 1
50 -3 9 103 -2 4
51 -2 4 104 -1 1
49 -4 16 101 -4 16
( X X) =
2
X = 530 (X X) 2 = 70 Y = 1050
40
Solution
X 530
X= = = 53
N 10
Y 1050
Y= = = 105
N 10
x
(X X) 2
=
70
2.64
N 10
y
(Y Y) 2
40
2
N 10
x 2.64
CVx = x 100 = x 100 = 4.98%
X 53
y 2
CVy = x 100 = x 100 = 1.90%
Y 105
Y shares are more stable in value than X shares since the coefficient of
variation of Y shares is lower than the coefficient of variation of X shares.
Solved Problem 60
Goals scored by two teams A & B in a football season are as depicted in
table 4.57. By calculating Coefficient of Variation in each, find which team
may be considered as more consistent.
Table 4.57: Goals scored by 2 teams A and B
No. of matches
No. of goals
A-team B-team
X
f f
0 27 17
1 9 9
2 8 6
3 5 5
4 4 3
N = 53 N = 40
Solution
With the help of the given data and data depicted in table 4.57, we can
calculate the more consistent team.
Table 4.57a: Computation of Coefficient of Variation
Team (A) Team (B) Team (A) Team (B)
2
fX fX fX fX2
0 0 0 0
9 9 9 9
16 12 32 24
15 15 45 45
16 12 64 48
fX= 56 fX = 48 fx2 = 150 fx2 = 126
fX 56
XA = = = 1.056
N 53
fX 48
XB = = = 1.2
N 40
A
2 fX 2
N
XA 2
=
150
53
1.056 1.715
2
1.715 1.30
A A
B
2 fX 2
N
XB 2
=
126
40
1.2 1.71
2
1.71 1.30
B B
A 1.30
CVA = x 100 = x 100 = 123.1%
XA 1.056
B 1.30
CVB = x 100 = x 100 = 108.33%
XB 1.2
Let us consider
2 X2
N
X 2
X2
5.12 40
2
100
X2 X2
i.e., 40 5.1 1626 .01
2 2
or
100 100
X2 (Incorrect) = 100 x 1626.01 = 162601
Correct 2 = Correct
X2
n
Correct X 2
39 .9 25 , = 5
161701
i.e., Correct 2 = 2
100
Coefficient of variation = x 100
X
5
= x 100 12.53%
39.9
Hence Correct Coefficient of variation = 12.53%
Activity
1. The average rainfall of a city from Monday to Saturday is 0.3 inches.
Due to heavy rainfall Sunday the average rainfall for the week
increased to 0.5 inches. What is the rainfall on Sunday?
2. The average salary of male employees in a firm was Rs. 520 and that
of females Rs. 420. The mean of salary of all the employees as a
whole is Rs. 500. Find the percentage of male and female
employees.
3. For a given frequency table, find out the missing data. The average
accident is 1.46 and N= 200.
Activity Solution
Solution 1
Given: Mon – Sat = 0.3”
Sun = 0.5”
fX 1 fX 1
X , 0.3 , fX1 = 1.8
N 6
fX 2 fX 2
X , 0.5 , fX2 = 3.5
N 7
Rainfall on Sunday = fX2 – fX1
= 3.5 – 1.8
= 1.7”
Solution 2
Given: X1 520 X 2 420 X12 500
N1 = No. of male employees N2 = No. of female employees
N1 X1 N 2 X 2 ( N 1 520 ) ( N 2 420 )
X 12 500
N1 N 2 N1 N 2
500N1 + 500N2 = 520N1 + 420N2
500N2 - 420N2 = 520N1 – 500N1
80N2 = 20N1
N1 = 4N2
Let N1 + N2 = 100
4N2 + N2 = 100
5N2 = 100
N2 = 20% Female
N1 = 80% Male
20% and 80% are male and females employees in the firm.
Solution 3
Table 4.59
No. of accidents Frequency
fX
(X) (f)
0 46 0
1 f1 f1
2 f2 2f2
3 25 75
4 10 40
5 5 25
N = 200 fX = 140 + f1 + 2f2
fX
X
N
140 f1 2f2
1.46 =
200
292 = 140 + f1 + 2f2
f1 + 2f2 = 152 ---- (1)
w.k.t. N = f
200 = 86 + f1 + f2
f1 + f2 = 114 ---- (2)
From (1) and (2)
f1 + 2f2 = 152 ---- (1)
f1 + f2 = 114 ---- (2) (1) – (2)
----------------------------------
f2 = 38
---------------------------------
f2 = 38,
f1 + f2 = 114, f1 + 38 =114, f1 = 114 – 38, f1 = 76
Solution 4
Given:
AM = 25 GM = 15 HM = ?
ab GM = ab 2
X HM =
1 1
2
GM = ab a b
ab
X 15 = ab 2ab
2 HM =
ab
ab (15)2 = ( ab )2
25 2 x 225
2 ab = 225 HM =
50
a + b = 50
HM = 9
Solution 5
GM = 60 HM = 28.24 AM = ?
60 = ab
2ab ab
28.24 = X
602 = ab ab 2
ab = 3600 2ab 254 . 95
a+b= X
28 .24 2
2 x 3600 = 127.475
=
28 .24
a + b = 254.95
4.13 Summary
Let us recapitulate the important concepts discussed in this unit:
The measures of central tendency and measures of dispersion
summarise mass data in terms of its two important features. They are as
follows:
i. With respect to nature of data to cluster around a central value
ii. With respect to their spread from their central value
Arithmetic mean is defined as the sum of all values divided by number of
values.
Median of a set of values is the middle most value when the values are
arranged in the ascending order of magnitude.
Mode is the value which has the highest frequency.
4.14 Glossary
Arithmetic mean: Sum of observations divided by number of observations.
Coefficient of variation: The ratio of standard deviation to the mean,
usually expressed in % form.
Quartiles: A measure, which divides an array, into four equal parts is known
as Quartiles.
Geometric mean: The nth root of the product of ‘n’ observation.
Harmonic Mean: The reciprocal of the arithmetic mean of the reciprocals of
observations.
Inter-quartile range (IQR): The difference between the third quartile and
first quartile.
Mean or average deviation: Sum of the absolute deviations of the values
from their mean or median divided by number of values.
Median: The middle most value of a series of when arranged in first
ascending or descending order of magnitude.
Mode: The value which has maximum number of observations or tends to
have as compared to any other value.
Percentiles: Percentile values divide the distribution into 100 parts of equal
frequency.
Range: The difference between the maximum and the minimum values of
the observations.
Semi inter-quartile range: The difference between the third quartile and
first quartile divided by 2.
3. For the distribution shown in table 4.62, find the median and mode.
Table 4.62: Distribution Data for Terminal Question 4
% Marks 0 - 10 10 - 20 20 - 30 30 - 40 40 – 50 50 - 60 60 - 70
No. of 4 9 19 20 18 7 8
Smokers
5. Find the harmonic mean of the following distribution given in table 4.64.
Table 4.64: Distribution Data for Terminal Question 6
6. Given that, sum of upper and lower quartiles is 122 and their difference
is 23; find the quartile deviation of the series.
7. If Coefficient of Variation = 22 and S.D. = 4, find the mean.
8. The table 4.65 shows the distribution of age at the time of first delivery
of 65 women. Find mean deviation from mean and median.
Table 4.65: Distribution of Age at the Time of First Delivery of 65 Women
Age 18 – 22 22 – 26 26 – 30 30 – 34 34 – 38
Frequency 20 30 11 3 1
9. Read the data given below and find the combined mean, S.D. and
coefficient of variation.
n1 = 15 n2 = 20
X 1 = 40 X 2 = 50
1 = 3 2 = 5
10. Mean and Standard deviation of lengths of tails of 8 rats were found to
be 4.7 cm and 0.8 cm respectively. However, one reading was taken as
3.6 cm instead of 6.3 cm; find the corrected mean and standard
deviation.
4.16 Answers
6. iii) median
7. ii) coincide
8. i) Mode= 3 Median-2 Mean
9. ii) 24
10. ii) mean=31.35
11. i) AM>GM>HM
12. i) True, ii) True, iii) False
13. i) True, ii) True
14. i) True, ii) False, iii) False, iv) True
Terminal Questions
1. Rs. 84.69
2. Mean = 31.64, Standard Deviation = 13.36, Coefficient of Variation
=42.225
3. Median = 35.25, Mode = 33.33
4. 116.7 cm
5. 123.33
6. 11.5
7. 18.18
8. 2.462
9. Combined Mean = 45.7
Combined S.D = 6.53,
Coefficient of Variation = 14.29
10. Corrected Mean = 5.0375 cm
Corrected S.D = 0.8336 cm
interested in learning about how a new credit card payment option was
related to the customer’s purchase amounts.
Managerial report
Use the methods of descriptive statistics to summarise the sample data.
Provide summaries of the dollar purchase amounts for each customers,
personal check customers and credit card customers separately. Your
report should contain the following summaries and discussion:
1. A comparison and interpretation of mean and median.
2. A comparison and interpretation of measures of variability such as range
and standard deviation.
3. The identification and the interpretation of the five–number summaries
for each methods of payment.
Use summary section of your report to provide a discussion of what you
have learnt about the method and amount of payments for customers of
Consolidated Foods.
Purchase amount and method of payment for a sample of 16 Consolidated
Foods’ customers are:
Table 4.66
Customer Amount ($) Method of Payment
1 28.58 Check
2 52.04 Check
3 7.41 Cash
4 11.71 Cash
5 43.79 Credit Card
6 48.95 Check
7 57.59 Check
8 27.60 Check
9 26.91 Credit Card
10 9.00 Cash
11 18.09 Cash
12 54.84 Check
13 41.10 Check
14 43.14 Check
15 3.31 Cash
16 69.77 Credit Card
Case Study 2
All insurance companies, offering unit linked insurance policies, charge
certain amount of money by way of meeting initial expenses. However, the
percentages of such expenses ratio vary from company to company. The
following table gives the expenses ratio including allocation charges for
some companies and for various maturity periods.
HDFC unit Bajaj Tata AIG Kotak Safe ICICI SBI Unit Birla Sun
linked Allianz Invest Investment Prulife Plus II life
Endowment New Assurell Plan Time Regular Premier
Unit Super
Gain Regular
1 year 6.4 4.8 7.4 3.7 5.4 5.6 3.5
10 years 2.7 2.9 3.6 2.4 3.5 2.9 2.2
15 year 1.8 2.4 2.8 2.1 3.0 2.3 1.9
20 years 1.5 2.2 2.4 1.9 2.8 2.0 1.7
25 years 1.3 2.1 2.2 1.8 2.6 1.9 1.6
30 years 1.2 2.0 2.1 1.7 2.6 1.8 1.5
E-References:
http://www.textbooksonline.tn.nic.in/Books/11/Stat-EM/Chapter-3.pdf
5.1 Introduction
In the previous unit, you have studied about the measures of Central
tendency and measures of Dispersion. In this unit, you will study about
measure of uncertainty involved in our day to day lives by using probability
theory.
Every human activity has an element of uncertainty. Uncertainty affects the
decision making process. In your daily life, you very often use the word
‘probably’, like, probably it may rain today; probably the share price may go
up in the next week. Therefore, there is a need to handle uncertainty
systematically and scientifically.
Key statistic
The probability of event A [denoted P(A)], must lie within the interval from
0 to 1.
b) Random experiment
When the outcome of an experiment cannot be predicted with certainty, then
it is called random experiment or stochastic experiment.
There are two types of experiments. They are –
(i) Deterministic experiment and (ii) Random experiment.
A deterministic experiment, when repeated under the same conditions,
results in the same outcome. It has a unique outcome.
Random experiment is an experiment which may not result in the same
outcome when repeated under the same conditions. It is an experiment
which does not have a unique outcome.
Example 1
The experiment of 'toss of a coin' is a random experiment. It is so
because when a coin is tossed the result may be 'Head' or it may be
'Tail'.
Example 2
The experiment of 'drawing a card randomly from a pack of playing
cards' is a random experiment. Here, the result of the draw may be any
one of the 52 cards.
c) Sample space
The set of all possible outcomes of a random experiment is the sample
space. The sample space is denoted by S. The outcomes of the random
experiment (elements of the sample space) are called sample points or
outcomes or cases.
A sample space with finite number of outcomes is a finite sample space. A
sample space with infinite number of outcomes is an infinite sample space.
Example 3
In tossing of a coin, the outcomes are head and tail. In tossing a coin the
sample space ‘S’ is given by: S = Η, Τ . The head is denoted as ‘H’
and the tail as ‘T’. In tossing two coins, the sample space ‘S’ is given by:
S = ΗΗ, ΗΤ, ΤΗ, ΤΤ
Example 4
While throwing a die, the sample space is S = {1, 2, 3, 4, 5, 6}. This is a
finite sample space
Example 5
Consider the toss of a coin successively until a head is obtained. Let the
number of tosses be noted. Here, the sample space is S= {1, 2, 3, 4....}.
This is an infinite sample space
Key statistic
If the number of outcomes is finite then it is called as finite sample space,
otherwise it is called as an infinite sample space.
d) Event
Event is a subset of the sample space. Events are denoted by A, B, C, etc.
An event which does not contain any outcome is a null event (impossible
event). It is denoted by Φ. An event which has only one outcome is an
elementary event or simple event. An event which has more than one
outcome is a compound event. An event which contains all the outcomes is
equal to the sample and it is called sure event or certain event.
Example 6
While throwing a die, A= {2, 4, 6} is an event. It is the event that the throw
results in an even number. Here, A is a compound event.
Example 7
While tossing two coins, A= {TT} is an event. It is the event that the toss
results in two tails. Here, A is a simple event.
The outcomes which belong to an event are said to be favourable to that
event. The event happens whenever the experiment results in a favourable
outcome. Otherwise, the event does not happen
While throwing a die, the event A = {2, 4, 6} has three favourable
outcomes, namely, 2, 4 and 6 where the throw results in 2, 4 or 6, event A
occurs.
Example 8
While tossing a fair coin, the outcomes 'Head' and 'Tail' are equally likely.
Example 9
While throwing a fair die, the events A={2,4,6}, B = {1,3, 5} and C={ 1,2, 3}
are equally likely.
A sample space is called an equiprobable space if the outcomes are
equally likely. For instance, the sample space S = {1, 2, 3, 4, 5, 6} of throw
of a fair die is equiprobable space because the six outcomes are equally
likely.
Example 10
In tossing an unbiased coin, getting head and tail are equally likely.
Example 14
While throwing a die, the six outcomes together are exhaustive. But here,
if any one of these outcomes is left out, the remaining five outcomes are
not exhaustive.
Example 15
While throwing a die, events A = {2, 4, 6}, B = {3, 6} and C = {1, 5, 6}
together are exhaustive.
h) Complementation of an event
Let A be an event. Then, complement of A is the event of non-occurrence of
A. It is the event constituted by the outcomes which are not favourable to A.
The complement of A is denoted by A or Ā or Ac.
The complement of an event is an event that consists of those possible
outcomes that are different from those outcomes of given event.
While throwing a die, if A = {2, 4, 6}, its complement is A = {1, 3, 5}. Here, A
is the event that the throw results in an even number. A is the event that the
throw does not result in an even number, i.e., A is the event that the throw
results in an odd number.
i) Independent events
Two events are said to be independent of each other if the occurrence of
one is not affected by the occurrence of other or the occurrence of one does
not affect the occurrence of the other.
Example 16
Consider tossing of three fair coins as shown in figure 5.2. Then,
S = { HHH, HHT, HTH, THH, TTH, THT, HTT, TTT}
Let:
A be the event of getting three heads
B be the event of getting two heads
C be the event of getting one head
D be the event of not getting a head
Example 17
While tossing two coins simultaneously, let A = {HH} and B = {TT} be two
events. Then, their union is A B = {HH, TT}. Here, A is the event of
occurrence of two heads and B is the event of occurrence of two tails.
Example 18
While throwing a die, let A = {2, 4, 6}, B = {3, 6} and C = {4, 5, 6} be three
events. Then, their union is A B C = {2, 3, 4, 5, 6}.
l) Intersection of events
Intersection of two or more events is the event of simultaneous occurrence
of all these events. Thus, intersection of two events A and B is the event of
occurrence of both of them. The intersection of A and B is denoted by A∩B
or AB or (A and B).
Example 19
While tossing two coins, let A = {HH, TT}, B = {HH, HT, TH} be two
events. Then, their intersection is A∩B = {HH}.
Example 20
While throwing a die, let A = {2, 4, 6}, B = {3, 6} and C = {4, 5, 6} be three
events. Then, their intersection is A∩B∩C = {6}.
Solution
The bag has a total of 9 balls. Since the ball drawn can be any one of them,
there are 9 equally likely, mutually exclusive and exhaustive outcomes. Let
events A, B and C be
A: selected ball is white
B: selected ball is non-white
C: selected ball is white or green
(i) There are 3 white balls in the bag. Therefore, out of the 9 outcomes, 3
are favourable to event A.
∴P [white ball] = P (A) = 3/9 = 1/3
(ii) Event B is the complement of event A. Therefore,
∴ P[non-white ball] = P(B) = 1 - P(A) = 1 – 1/3 = 2/3
(iii) There are 3 white balls and 2 green balls in the bag. Therefore, out of
9 outcomes, 5 are either white or green.
∴ P[white ball or green ball] = P(C) = 5/9
Solved Problem 4
One card is drawn from a well-shuffled pack of playing cards. Find the
probability that the card drawn (i) is a heart (ii) is a king (iii) belongs to red
suit (iv) is a king or a queen (v) is a king or a heart.
Solution
A pack of playing cards has 52 cards. There are four suits, namely, spade,
club, heart and diamond (dice). In each suit, there are thirteen
denominations - ace (1), 2, 3,…, 10, jack (knave), queen and king.
A card selected at random may be any one of the 52 cards. Therefore, there
are 52 equally likely, mutually exclusive and exhaustive outcomes. Let
events A, B, C, D and E be —
A: selected card is a heart
B: selected card is a king
C: selected card belongs to a red suit
D: selected card is a king or a queen
E: selected card is a king or a heart
(i) There are 13 hearts in a pack. Therefore, 13 outcomes are favourable
to event A.
∴ P [Heart] = P(A) =13/52 = 1/4
Solution
A: selected number is even
B: selected number is a multiple of 3
(i) Four of the selections, namely, 2, 4, 6 and 8 are favourable to event A
∴ P [even number] = P(A) = 4/8 = 1/2
(ii) Two of the selections, namely, 3 and 6 are favourable to event B
∴ P[multiple of 3] = P(B) = 2/8 = 1/4
2) Statistical / relative frequency / empirical / posteriori approach
Under this approach the probability of an event is arrived at after conducting
an experiment. If we want to know the probability that a particular household
in an area will have two earning members, then we have to gather data on
all households in that area and then arrive at the probability. Greater the
number of households surveyed, greater will be the accuracy in the
probability, arrived.
n n
iii) Axiom of addition,
P
i
Ai
1
i
1
P
(Ai)where A , A ,...A are
1 2 n
Example 19
A sales manager may like to know the probability that he will exceed the
target for product A or product B. Sometimes, he would like to know the
probability that the sales of product A and B will exceed the target. The
first type of probability is answered by addition rule. The second type of
probability is answered by multiplication rule.
5.2.1 Addition rule
The addition rule of probability states that:
i) If ‘A’ and ‘B’ are any two events then the probability of the occurrence
of either ‘A’ or ‘B’ is given by:
ΡΑ Β = ΡΑ + ΡΒ ΡΑ Β
ii) If ‘A’ and ‘B’ are two mutually exclusive events then the probability of
occurrence of either ‘A’ or ‘B’ is given by:
ΡΑ Β = ΡΑ + ΡΒ
iii) If ‘A’, ‘B’ and ‘C’ are any three events then the probability of occurrence
of either ‘A’ or ‘B’ or ‘C’ is given by:
ΡΑ Β C = ΡΑ + ΡΒ + ΡC ΡΑ Β ΡΒ C ΡΑ C + ΡΑ Β C
In terms of Venn diagram, from the figure 5.3, we can calculate the
probability of occurrence of either event ‘A’ or event ‘B’, given that event ‘A’
and event ‘B’ are dependent events. From the figure 5.4, we can calculate
the probability of occurrence of either ‘A’ or ‘B’, given that, events ‘A’ and ‘B’
are independent events. From the figure 5.5, we can calculate the
probability of occurrence of either ‘A’ or ‘B’ or ‘C’, given that, events ‘A’, ‘B’
and ‘C’ are dependent events.
Fig. 5.3: A B for Two Fig. 5.4: AB for Two Fig. 5.5: ABC for
Dependent Events Independent Events Three Dependent
A and B A or B Events A, B and C
iv) If A1, A2, A3………, An are ‘n’ mutually exclusive and exhaustive events
then the probability of occurrence of at least one of them is given by:
ΡΑ 1 Α 2 ....... Α n = ΡΑ 1 + ΡΑ 2 + ........ + ΡΑ n .
ΡA B = P(AP(B)
B)
ΡB A = P(AP(A)
B)
For any bivariate distribution, there exists two marginal distributions and
‘m + n’ conditional distributions, where ‘m’ and ‘n’ are the number of
classifications/characteristics studied on two variables.
Example 20
Consider the example of a librarian who analysed the type of visitors and
their choice of library section. The data is represented in table 5.1
Table 5.1: Bivariate Distribution
Type of visitors Sections
Level of News Magazine Novel Subject Total
education Paper (story)
Under Graduates 50 100 120 50 320
Graduates 70 90 50 100 310
Post Graduates 100 60 30 150 340
Total 220 250 200 300 970
iii) The table 5.1c represents the distribution of people in sections given that
they are under graduate. Therefore, it is a conditional distribution.
Table 5.1c: Conditional Distribution
Level of News
Magazine Novels Subjects Total
education paper
Under
50 100 120 50 320
graduate
Thus, for any bivariate distributions having ‘m’ and ‘n’ classifications there
exist two marginal distributions and ‘m + n’ conditional distributions. In this
case there are 3 + 4 = 7 conditional distributions.
Solved problem 9
Calculation of n C r for the following values of ‘n’ and ‘r’:
i. n = 10 and r = 2
ii. n =16 and r = 3
Solution
10 10 9
i. C2 = = 45
1 2
16 16 15 14
ii. C3 = = 560
1 2 3
Key statistic
n
C r = n C n -r
n
C0 = nCn = 1
0! = 1
Solved Problem 10
Calculate 16C13.
Solution
16
C13 = 16C16-3 = 16C3 = 560
16
The value of C13 is 560.
Solved Problem 11
Find the probability of getting a head when a coin is tossed?
Solution
Let ‘A’ be the event of getting a head.
S = Η, Τ n(S) = 2
n(A) = 1
n(A) 1
∴ P(A) = =
n(S) 2
Therefore, the probability of getting at least two heads when three coins are
tossed is 1/2.
Solved Problem 13
What is the probability of getting a sum of ‘nine’ when two dice are thrown?
Solution
Let ‘A’ be the probability of getting a sum ‘nine’.
S={ (1 1),(1 2),(1 3),(1 4),(1 5),(1 6)
(2 1),(2 2),(2 3),(2 4),(2 5),(2 6)
(3 1),(3 2),(3 3),(3 4),(3 5),(3 6)
(4 1),(4 2),(4 3),(4 4),(4 5),(4 6)
(5 1),(5 2),(5 3),(5 4),(5 5),(5 6)
(6 1),(6 2),(6 3),(6 4),(6 5),(6 6) }
nS = 62 = 36
nS = 62 = 36
A is the event of combination of mutually exclusive events of getting a sum 9
or 10 or 11 or 12.
Α = 6,3, 3,6 , 5,4 , 4,5, 6,4 , 4,6 , 5,5, 6,5, 5,6 , 6,6 n Α = 10
n(A) 10 5
∴ P(A) = =
n(S) 36 18
Manipal University Jaipur Page No. 234
Statistics for Management Unit 5
Therefore, the probability of getting at least a sum of ‘nine’ when two dice
are thrown is 5/18.
Solved Problem 15
A number is selected at random from the numbers 1 to 30. What is the
probability that:
i. It is divisible by either 3 or 7
ii. It is divisible by 5 or 13
Solution
i) Let ‘A’ be the event of selecting a number divisible by 3. Let ‘B’ be the
event of selecting a number divisible by 7.
n S= 30 C1 = 30
Α = 3,6,9,12,15,18,21,24,27,30
nΑ = 10
Β = 7,14,21,28
nΒ = 4
Α Β = 21 nΑ Β = 1
6 2 8 4
= + = =
30 30 30 15
Therefore, the probability that a number is divisible by 5 or 13 is 4/15.
Solved Problem 16
The Board of Directors of a company wants to form a quality management
committee to monitor quality of their products. The company has 5
scientists, 4 engineers and 6 accountants. Find the probability that the
committee will contain 2 scientists, 1 engineer and 2 accountants?
Solution
Let ‘A’ be the event of selecting 2 scientists, 1 engineer and 2 accountants.
Then,
1515
14
13
1211
n(S)
C
5
3003
1
2
3
4
5
5 4 65
n(A) = 4 = 10 4 15 = 600
1 2 1 2
n(A) 600
∴ P(A) =
n(S) 3003
Therefore, the probability that the committee will contain 2 scientists,
1 engineer and 2 accountants is 600/3003.
Solved Problem 17
The odds favouring the event of a person hitting a target are 3 to 5. The
odds against the event of another person hitting the target are 3 to 2. If each
of them fire once at the target, find the probability that:
i) Both of them hit the target
ii) At least one of them hit the target
Solution
i) Let ‘A’ be event of first person hitting a target. Odds in favour means,
3 3
∴P(A) = = (1st ratio)
3+ 5 8
Let ‘B’ be event of Second person hitting a target. Odds against means,
2 2
∴ P(B) = = (2nd ratio)
3+2 5
Both hitting the target mean A B and A and B are independent
3 2 6 3
∴ P(A B) = P(A)P(B) = =
8 5 40 20
Therefore, the probability that both persons hit the target is 3/20.
ii) Let ‘A’ be the probability of hitting the target. Therefore,
3
P(A) =
8
2
Let ‘B’ be the probability of hitting the target. Therefore, P(B) =
5
ΡΑ Β = ΡΑ + ΡΒ ΡΑ Β
3 2 3 15 16 6 25 5
ΡΑ Β
8 5 20 20 40 8
Therefore, the probability that at least one of the persons hit the target
is 5/8.
Solved Problem 18
The probabilities that drivers A, B and C will drive home safely after
consuming liquor are 2/5, 3/7 and 3/4, respectively. What is the probability
that they will drive home safely after consuming liquor?
Solution
Let ‘A’ be the event of driver ‘A’ driving safely after consuming liquor. Let ‘B’
be the event of driver ‘B’ driving safely after consuming liquor. Let ‘C’ be the
event of driver ‘C’ driving safely after consuming liquor.
2 3 3
Given P(A) = , P(B) = , P(C) =
5 7 4
The events A, B, and C are independent. Therefore,
ΡA B C = ΡA ΡB ΡC
ΡA B C =
2 3 3 18 9
=
5 7 4 140 70
Therefore, the probability that all the drivers will drive home safely after
consuming liquor is 9/70.
Solved Problem 19
The probabilities that ‘A’ and ‘B’ will tell the truth are 2/3 and 4/5
respectively. What is the probability that:
i) They agree with each other
ii) They contradict each other while giving a testimony in the court.
Solution
i) Let ‘A’ be the event of A telling truth. Let ‘B’ be the event of B telling
truth.
2 2 1
Given P(A) = P( Α c ) = 1 P(A) = 1
3 3 3
4 4 1
P(B) = P(Bc ) 1 P(B) = 1
5 5 5
Both will agree if they say truth or they together lie, that is,
Α Β or Α c Β c
They are mutually exclusive. Therefore,
ΡΑ Β + Ρ Α c Β c = ΡΑ ΡΒ + Ρ Α c Ρ Β c
2 4 1 1 9 3
= + = =
3 5 3 5 15 5
The events A and B are independent.
Therefore, the probability that both A and B agree with each other is 3/5.
ii) They will contradict if A tells truth and B tells lies or B tells truth and A
tells lies.
Α Βc or Α c Β
Since, they are mutually exclusive.
Ρ Α Βc + Ρ Α c Β = ΡΑ Ρ Βc + Ρ Α c ΡΒ
2 1 1 4 6 2
= + = =
3 5 3 5 15 5
They are independent. Therefore, the probability that A and B contradict
each other is 2/5.
Solved Problem 20
A box contains five red and four blue similar shaped balls. Two balls are
drawn at random from the box. Find the probability that both of them are red
if:
Manipal University Jaipur Page No. 238
Statistics for Management Unit 5
Solution
A ball drawn from box I and transferred to box II could be either red or blue.
Let ‘A’ be the event of drawing a red ball from box I. Let ‘B’ be the event of
drawing a blue ball from box I. Let ‘C’ be the event of drawing red ball from
box II.
The required events are Α C or Β C .
The event ‘B’ is made up of four mutually exclusive and exhaustive events.
1
2
3
4
= ΡΑ i Β …….(1) [by using the law of marginal probability]
Manipal University Jaipur Page No. 241
Statistics for Management Unit 5
We know that:
= ΡΑ1 Ρ B
Ρ 1 =
A A1
B ΡΑ i Β
(by substituting (1) in the denominator and (3) in the numerator)
In general, the Bayes’ theorem states that if A1, A2………….., An are ‘n’
mutually exclusive and exhaustive events with prior probabilities
P(A1 ),P(A 2 ),...P(A n ) respectively and ‘B’ be an event for which the
conditional probabilities of the probability of occurrence of B given A1 , B
given A 2 ,… B given A n are P(B / A 1 ), P(B / A 2 ),...P(B / A n ) respectively,
then the posterior probability of occurrence of A1 given that given that ‘B’
has already occurred is given by:
P(A1 ). P(B / A1 )
P(A1 /B) = n
P(A ) P(B / A )
i =1
i i
0
.0400
A1 0.4 0.10 0.0400 0.2807
0
.1425
0.0525
A2 0.35 0.15 0.0525
0.3684
0.1425
0.0500
A3 0.25 0.20 0.0500
0.3509
0.1425
The probability that Mr Anand introduces new product by becoming the Vice
president is 0.3684.
Solved problem 25
A factory has three machines M1, M2 and M3. They produce 4000, 10,000
and 6,000 products per day. From past records, it is known that M1, M2, and
M3 produce 5%, 4%, and 8% defectives. A product is selected at random
from the day’s production and is found to be defective. What is the
probability that it was not produced by machine M3?
Solution
Let us consider the following:
Let ‘A1’ be the event that the product was produced by M1
Let ‘A2’ be the event that the product was produced by M2
Let ‘A3’ be the event that the product was produced by M3
Let ‘B’ be the event that the product is defective
Then, we are given:
14000
0
.20
20000
210000
0.5
2000
3 6000
0.3
20000
P(B/A1) = 0.05 P(B/A2) = 0.04 P(B/A3) = 0.08
The above information is represented in table 5.4.
Table 5.4: Required Probabilities for the Data in Solved Problem 25
Event Prior Conditional Joint Posterior
Ai Probability Probability Probability Probability
P(Ai) P(B/Ai) P(Ai ∩ B) P(Ai/B)
0.010
A1 0.2 0.05 0.010 = 0.1852
0.054
0.020
A2 0.5 0.04 0.020 = 0.3704
0.054
0.054
A3 0.3 0.08 0.024 = 0.4444
1.0000
Total 1.00 P(B) 0.054 1.0000
(iii) ΡΧ = 1
i
Example 22
Let ‘X’ denote the number of heads obtained, while tossing two fair coins.
Then, X is a random variable which takes the values 0,1 and 2 wit
respective probabilities ¼, ½ and ¼ . Here, X is a discrete random variable.
Example 23
Let ‘X’ denote the number obtained while throwing a fair die. Then, ‘X’ is a
discrete random variable taking values 1, 2, 3, 4, 5 and 6 with probability 1/6
each
Example 24
Let ‘X’ denote the weight of apples. Then, ‘X’ is a continuous random
variable.
For example, let us consider the tossing of three coins. The table 5.5
displays the probabilities of getting heads when three coins are tossed.
Table 5.5: Probabilities of Getting Heads when Three Coins are Tossed
No. of Heads
P(X)
(X)
3 ⅛
2 ⅜
1 ⅜
0 ⅛
Total 1
ΡΧ = 1
i
Where, E Χ 2 = Χ i2 ΡΧ i
Its standard deviation is:
S.DΧ = Var Χ = E Χ 2 EΧ
2
Solved Problem 26
A random variable takes the values -3, -2, 1, 0, 4, 6 with probabilities 1/12,
2/12, 3/12, 4/12, 1/12, 1/12 respectively. Find its mean or expected value
and variance?
Solution
The table 5.6 represents the values required to calculate expectation and
variance for the data in solved problem 26.
Table 5.6: Required Values for Calculating Mean and Variance for the Data
XI P(Xi) Xi P(Xi) Xi2 P(Xi)
-3 1/12 -3/12 9/12
-2 2/12 -4/12 8/12
1 3/12 3/12 3/12
0 4/12 0 0
4 1/12 4/12 16/12
6 1/12 6/12 36/12
Total 6/12 72/12 = 6
∴ EΧ = Χ i ΡΧ i = 6 / 12 = 1 / 2
Var Χ = E Χ 2 EΧ = 6 1 / 4 = 23 / 4
2
Where, E Χ 2 = Χ i2 ΡΧ i
72
12
6
Solved Problem 28
The table 5.8 displays the distribution of random variable X. Find the
following probabilities:
i) P(Xi) 3
ii) P(Xi = 0)
iii) P(1 Xi 3)
iv) P(Xi) 4
Xi -3 -2 0 1 2 3 4 5
P(Xi) K 2K 2K 3K 3K 2K K K
Solution
Since Xi is a random variable ΡΧ = 1
i
K + 2K + 2K + 3K + 3K + 2K + K + K = 1
15K = 1 ∴ K = 1/15
i) ΡΧ i 3 = ΡΧ i = 3 + ΡΧ i = 4 + ΡΧ i = 5
= 2K + K + K = 4K = 4 / 15
ii) ΡΧ i = 0 = 2K = 2 / 15
iii) Ρ1 Χ i 3
= ΡΧ i = 1 + ΡΧ i = 2 + ΡΧ i = 3
= 3K + 3K + 2K = 8K = 8 / 15
iv) ΡΧ i 4 = ΡΧ i = 4 + ΡΧ i = 5
= K + K = 2K = 2 / 15
Solved Problem 29
Two fair coins are tossed once. Find the mathematical expectation of the
number of heads obtained.
Solution
Let Xi denote the number of heads obtained. Then, Xi is a random variable
which takes the values 0, 1 and 2 with respective probabilities ¼ ½ and ¼
and that is,
Table 5.9
Xi 0 1 2
P(Xi) ¼ ½ ¼
∴ EΧ = Χ i ΡΧ i = 0
1 1 1
+ 1 + 2 = 1
4 2 4
Key Statistics
1. For a random variable Xi, the arithmetic mean is EΧ = Χ ΡΧ
i i
Where, E Χ 2 = Χ i2 ΡΧ i
The standard deviation is the square root of the variance.
Solved Problem 30
A bag has 3 white and 4 red balls. Two balls are randomly drawn from the
bag. Find the expected number of white balls in the draw.
Solution
Let ‘Xi’ denote the number of white balls obtained in the draw. Then, X i is a
random variable which takes the values 0, 1 and 2 with respective
probabilities –
4
P(0) = P[both red] = C2 6 2
7
=
C 2 21 7
3
C 2 4 C1 12 4
P(1) = P[one white and one red] = 7
=
C2 21 7
3
C2 3 1
P(2) = P[both white] = 7
=
C 2 21 7
The probability distribution of X is –
Table 5.10
Xi 0 1 2
P(Xi) 2/7 4/7 1/7
∴ EΧ = Χ i ΡΧ i 0
2 4 1 6
+ 1 + 2 =
7 7 7 7
1 (approximately)
5.7 Summary
Let us recapitulate the important concepts discussed in this unit:
Probability plays an important role in decision making process.
Probability is a numerical measure which indicates the chance of
occurrence of an event ‘A’. It is denoted by P(A). It is the ratio between
the favourable outcomes of an event ‘A’ (m) to the total outcomes of the
experiment (n).
When multiple events are involved in an experiment, the concerned
probabilities are calculated using addition and multiplication rules of
probability.
Bayes’ theorem deals with the probability of the occurrence of an event
to the occurrence or non-occurrence of an associated event. This is an
important theorem helpful for managers in business decisions.
Random variable is a not a variable. It is a function. It can be discrete or
continuous.
5.8 Glossary
Equally likely events (equiprobable events): Two or more events are
equally likely if they have equal chance of occurrence.
Event: Even is a subset of the sample space.
Exhaustive set of events: A set of events is exhaustive if one or the other
of the events in the set occurs whenever the experiment is conducted.
Experiment: An operation that results in a definite outcome is called an
experiment.
Mutually exclusive events (disjoint events): Two or more events are
mutually exclusive if only one of them can occur at a time.
Activity
Problem 1
The probability that a contractor will get a plumbing contract is 2/3 and
probability that he will not get an electrical contract is 5/9. If the probability
of getting at least one of these contracts is 4/5, what is the probability that
he will get both?
Problem 2
A can solve 90 percent of the problems given in a book and B can solve
70 percent. What is the probability that at least one of them will solve a
problem selected at random.
Problem 3
The probability that a trainee will remain with a company 0.6, The
probability that an employee earns more ten Rs.10,000 per year 0.5. The
probability an employee is trainee who remained with the company or who
earns more than Rs.10,000 per year is 0.7. What is the probability of a
trainee who earns more than Rs.10,000 per year given that he is a trainee
who stayed with the company.
Problem 4
Suppose that one of the three men, a politician, a bureaucrat and an
educationist will be appointed as VC of the university. The probabilities of
their appointment are respectively 0.3, 0.25 and 0.45. The probability that
these people will promote research activities if they are appointed is 0.4,
0.7 and 0.8 respectively. What is the probability that research will be
promoted by the new VC.
Problem 5
A box contains 4 green and 6 white balls another box contains 7 green
and 8 white balls. Two balls are transferred from box 1 to box 2 and then
a ball is drawn from box 2. What is the probability that it is white?
event A: transferred balls are green
event B: transferred balls are white
event C: Among transferred balls one green and 1 white
event D: selection of a white ball from box 2
5.10 Answers
Terminal Questions
1. Refer section 5.1.4
2. 13/16
3. 0.92, Yes
4. i) 4/9, ii) 5/9
5. 3/4
6. 0.92
7. 0.28
8. 21/29
Activity Solution
Solution 1
Let, A: Contractor gets a plumbing contract
B: Contractor gets an electrical contract
Then, P(A) = 2/3 P(B) = 5/9 and P(A B) = 4/5
Therefore, P(B) = 1-P(Bc) = 4/9
By addition theorem we have,
P(A B) = P(A) +P(B) – P(A B)
That is, P(A B) = P(A) +P(B) – P(A B)
Therefore, P [he gets both plumbing and electrical contract] =
P(A B) = P(A) +P(B) – P(A B)
= 2 / 3 + 4 / 9 4 / 5 = 14 / 45
Solution 2
Event A: Student A solves the problem
Event B: Student B solves the problem.
P(at least one solve the problem) = 1-P(none solve the problem)
1 P A B
1 P(A ).P(B)
1 (0.10)(0.30)
0.97
Solution 3
Event A: a trainee will remain with the company
Event B: a trainee earns more than Rs. 10,000.
Given P(A) = 0.6, P(B) = 0.5, P(A B) = 0.7
We need to find probability of a trainee who ears more than Rs.10000 per
year given that he is a trainee who stayed with the company:
P(A B) P(A) + P(B) P(A B) 0.6 0.5 0.7 0.4
P(B / A) = = = = 0.67
P(A) P(A) 0 .6 0.6
Solution 4
Event A: politician appointed as VC
Event B: bureaucrat appointed as VC
Event C: educationist appointed as VC
Event D: promotion of research activities
Probability that the research will be promoted by the new VC:
P(A D) + P(B D) + P(C D).
= P(D / A).P(A) + P(D / B).P(B) + P(D / C).P(C)
(0.3)(0.4) + (0.25)(0.7) + (0.45)(0.8) = 0.655
Solution 5
Event A: transferred balls are green
Event B: transferred balls are white
Event C: among transferred balls one green and 1 white
Event D: selection of a white ball from box 2
Assuming hypothetical figures, analyse the same and present the findings to
the CEO
(Source: T N Srivastava and Shailaja Rejo (2008) Statistics for Management, 5th
ed., TMH)
References:
Agarwal, B. L. (2006), Basic Statistics, Fourth Edition, New Age
International Publishers.
Anderson, David R. Sweeney, Dennis J. & Williams, Thomas A. 5th ed.,
Thomson Business Information Pvt. Ltd.
Bowerman, B. L. & R.T. O Connel, (1996), Applied Statistics: Improving
Business Processes, Irwin.
Freedman, D., Pisani, R. and Purves, R.(1997), Statistics, 3rd ed., W. W.
Norton.
Levin, Richard I. & Rubin, David S. (2008), Statistics for Management,
Seventh Edition, PHI Learning Private Limited.
Srivastava, T. N. & Rejo, Shailaja (2008), Statistics for Management, 5th
ed., TMH.
Tanur, J. M. (2002), Statistics: A Guide to the Unknown, 4th ed., Brooks
/Cole..
Tukey J.W ,Exploratory Data Analysis, Addison –Wesley, 1977.
Wilcox, Rand R. (2009), Basic Statistics – Understanding Conventional
Methods and Modern Insights, Oxford University Press
E-References:
http://www.textbooksonline.tn.nic.in/Books/11/Stat-EM/Chapter-1.pdf
6.1 Introduction
In the previous unit, we have studied about basic Probability theory
concepts. We have also studied the application of probability rules in solving
problems related to real life situations. We have ended the previous unit with
introduction to the concept of random variables. In this unit, we will discuss
about the probability distributions of the random variables; both discrete and
continuous. Before studying this unit, you have to refresh the concept of
random variables which was covered in the previous unit.
Individuals and corporate generate several data that resemble certain
theoretical distributions. Mathematically, we have many derived
characteristics of the theoretical distributions. We can make use of such
derived characteristics for a quick analysis of the observed distributions.
The examples of observed distributions are:
i. Number of male children in a family
ii. Number of defectives produced per production run
iii. Number of employees drawing salary in some brackets
The theoretical distributions are formed under certain assumptions. The
theoretical distributions are classified into two types. The two types of
theoretical probability distributions are:
i. Discrete probability distributions
ii. Continuous probability distributions
The figure 6.1 depicts the two groups of theoretical distributions.
Objectives:
After studying this unit, you should be able to:
differentiate between Bernoulli process and Binomial distribution
evaluate the probabilities using the Binomial distribution
evaluate the probabilities using the Poisson distribution
analyse the probabilities using the Normal distribution
6.1.1 Relevance
Good health, a pharmaceutical firm set up a plant to fill the bottles with
100 ml of costly medicines. The production manager observed that the filling
machine does not fill the bottles with the set volume but each filling was
different from 100ml, through by a small amount sometimes less, sometimes
more. While drug regulation stipulates a heavy fine, if a bottle is found to
have less than 100 ml, the management is concerned with wastage of
medicines that occurs when filling is more than 100 ml.
The dilemma for the production manager is that if he tries to reduce
wastage, he might incur the risk of being fined by regulatory authority. He
has decided the level which he should set for the filling machine so that the
wastage and the risk of getting penalised are minimised. This is where the
role of statistician came into picture and the issue could be resolved with the
help of statistical distribution to the satisfaction of the production manger.
(Source: Srivastava T. N. & Rejo, Shailaja (2008), Statistics for Management, 5th
ed.,TMH)
number of customers arriving at a CBC during any time period and make
decisions concerning the number of ATMs needed.
(Source: Anderson, David R., Sweeney, Dennis J., & Williams, Thomas A.,
5th ed., Thomson Business Information Pvt. Ltd.)
X 1 0
P(X) p q
Example 1:
When a fair coin is tossed as shown in figure 6.2, the outcome is either
head or tail. The variable ‘X’ assumes ‘1’ or ‘0’.
Result:
If X1, X2, …Xn are independent and identically distributed Bernoulli variates
with common parameter p, their sum X = X1 + X2 + ….+ Xn is a Binomial
variate with parameters n and p.
6.3.2 Mean & variance of Bernoulli distribution
Let, ‘Xi’ be a Bernoulli variate with parameter p. Then, probability distribution
of Xi is
Xi 1 0
P(Xi) p q
EΧ = Χ i ΡΧ i 1 p + 0 q = p
E Χ 2 = Χ i2 ΡΧ i 12 p + 0 2 q = p
Var(X) = E(X 2 ) [E(X)] 2 p p 2 p(1 p) = pq
Thus, mean of Bernoulli distribution is E(X) = p
Variance of Bernoulli distribution is Var(X) = p (1-p) = pq
Solved Problem 1
In an interview conducted by a company, if the probability that an interviewed
person is male is 2/3 and female is 1/3. Find the mean and variance of the
distribution.
Solution
Let, ‘X’ denote gender of the interviewed person. If interviewed person is male
then X takes value 1 and if interviewed person is a female X takes value 0,
with probabilities 2/3 and 1/3 respectively (i.e., p+q=2/3 +1/3=1). And X follows
Bernoulli distribution as shown in the following table:
X 1 0
P(X) 2/3 1/3
Key statistic
The mean and variance of a Bernoulli distribution are ‘p’ and ‘pq’
respectively.
Solved Problem 2
An unbiased coin is tossed six times. What is the probability that the tosses
will result in:
i) Exactly two heads
ii) At least five heads
iii) At most two heads
iv) Not greater than one head
v) Not less than five heads
vi) At least one head
Solution
Let ‘A’ be the event of getting head. Given that:
1 1
p , q , n6
2 2
Therefore, the probability that the tosses will result in exactly two
heads is 15/64.
ii) The probability that the tosses will result in at least five heads is given
by:
65 5 6 6 6
1 1
5 5 6 6 C 5 6 C 6 1 1
2 2 2 2
6 6
1 1
5 6
7
2 2 64
Therefore, the probability that the tosses will result in at least five
heads is 7/64.
iii) The probability that the tosses will result in at most two heads is given
by:
6 6 1 1 6 2 2
1 1 1 6 1 1
6 C1 C2
2 2 2 2 2
1 1 6 5 1 1 6 15 22 11
2 6
64 64 1 2 64 64 64 32
Therefore, the probability that the tosses will result in at most two
heads is 11/32.
iv) The probability that the tosses will result in not greater than one head
is given by:
vi) The probability that the tosses will result in at least one head is given
by:
Therefore, the probability that the tosses will result in at least one head
is 63/64.
The graph depicted in figure 6.3 illustrates the binomial distribution of
probability of ‘x’ number of heads occurring when a coin is tossed 6 times.
Solved Problem 3
The probability that an employee will get an occupational disease is 20%. In
a firm having five employees, what is the probability that:
i) None of the employees get the disease
ii) Exactly two will get the disease
iii) More than four will contract the disease
Solution
Given that:
20
p 0.2
100
q 1 0.2 0.8
n=5
Therefore, by binomial distribution, x 5 C x (0.8) 5-x (0.2) x
i) The probability that none of the employees get the disease is given by:
0 0.8 0.3277
5
Therefore, the probability that none of the employees get the disease
is 0.3277.
ii) The probability that exactly two employees will get the disease is given
by:
Therefore, the probability that exactly two employees will get the
disease is 0.2048.
iii) The probability that more than four employees will get the disease is
given by:
4 5 0.2 0.00032
5
Therefore, the probability that more than four employees will get the
disease is 0.00032.
Solved Problem 4
The probability that a bomb dropped on a bridge, will hit the bridge is 0.5.
Eight bombs are dropped on the bridge. The bridge will be destroyed if any
two bombs fall on it. Find the probability that:
i) All bombs hit the bridge
ii) The bridge is destroyed
Solution
Let the probability that the bomb will hit the bridge be p. Given that:
p 0.5 and n 8
q 1 0.5 0.5
Therefore by binomial distribution, x 8 C x (0.5)
8-x
(0.5)
x
i) The probability that all the bombs hit the bridge is given by:
8
1 1
8 0.5
8
2 256
Therefore, the probability that all the bombs hit the bridge is 1/256.
ii) Bridge is destroyed if two or more bombs fall on it. The required
probability is given by:
2 1 0 1
1 8 8
8 247
1 1
1 8 C1 1
2 2 256 256 256
247
Therefore, the probability that the bridge is destroyed is
256
Type ii: Finding the expected values
Solved Problem 5
A random sample of 5 sachets of coconut oil was examined and two were
found to be leaking. A wholesaler receives six hundred and twenty five
packets, each containing 5 sachets. Find the expected number of packets to
contain exactly one sachet leaking?
Solution
Given that:
n 5 , 625
Probability of leaking p is given by:
2
p
5
2 3
q 1
5 5
5x x
3 2
Therefore by binomial distribution, x 5 C x
5 5
51 1
3 2
1 5 C
1 5 5
1 5
81 2 162
625 5 625
Manipal University Jaipur Page No. 271
Statistics for Management Unit 6
Solved Problem 6
For a binomial distribution with n = 5 and p = 0.2.
Find:
i) P(X=3)
ii) P(X<4)
Solution
Given that:
n 5, p 0.2, and q 1p 0.8
npq 5
……………… (3)
np 4
q 5/ 4
Since, q > 1, the statement: ‘The mean of a binomial distribution is 4 and its
variance is 5’, is wrong.
Solved Problem 8
The incidence of an occupational disease in an industry is such that the
workers have 25% chances of suffering from it. What is the probability that
out of 5 workers, at the most two contract that disease?
Solution
Let X: number of workers contracting the diseases among 5 workers
Then, X is a binomial variate with parameter n = 5
p = P [a worker contracts the disease] = 25/100 = 0.25
Therefore by binomial distribution,
P(X=x) = 5Cx (0.25)x (0.75)5-x, x = 0, 1, 2,….5
The probability that at the most two workers contract the disease is
P[X 2] = P(X 0) + P(X 1) + P(X 2)
5 C 0 (0.25) 0 (0.75) 5 + 5 C1 (0.25)1 (0.75) 4 + 5 C 2 (0.25) 2 (0.75) 3
0.2373 + 0.3955 + 0.2637
0.8965
Solved Problem 9
In a large consignment of electric lamps, 5% are defective. A random
sample of 8 lamps is taken for inspection. What is the probability that it has
one or more defectives?
Manipal University Jaipur Page No. 273
Statistics for Management Unit 6
Solution
Given n = 8, p = 5/100 = 0.05
X: number of defective lamps
Therefore by binomial distribution,
P(X=x) = 8 C x (0.05) x (0.95) 8x , x = 0,1,2,...8
P [sample has one or more defectives] = 1 – P [no defectives]
= 1 - P(X=0)
= 1 - 8C0 (0.05)0(0.95)8
= 1 – 0.6634 = 0.3366
Case Study 1
Vinay is the operations manager of the books section of a large
department store. He has calculated that 0.4 is the probability that a
customer who is just browsing will buy something. Suppose that six
customers browse in the books section each hour. Vinay wants to
calculate the following probabilities.
What is the probability that:
i) Exactly four browsing customers will buy something during a
specified hour
ii) At least two browsing customers will buy something during a specified
hour
iii) None of the browsing customers will buy anything during a specified
hour
Key statistic
The probability distribution of a Poisson random variable ‘X’ is given by:
e m m
x!
The mean and variance of the distribution is ‘m’. Its standard deviation is
m and ’m’ is called the parameter of the Poisson distribution.
Key statistic
The mean of the Poisson distribution is also given by:
m np
where, ‘p’ is the probability of success and ‘n’ is the number of trials.
Solved Problem 10
Suppose two houses in a thousand catch fire in a year and there are 2000
houses in a village. What is the probability that:
i) None of the houses catches fire
ii) At least one house catches fire
iii) Not more than two houses catches fire
Solution
Given the probability of a house catching fire is:
2
p 0.002 , n 2000
1000
The probability that the carton contains 3 or more defective bulbs is given
by:
m0 21 22
1 e 2 e 2 e 2 1 e 2 1 2 2
0! 1! 2!
1 0.13534 5 1 0.6767 0.3233
Therefore, the probability that the carton contains 3 or more defective bulbs
is 0.3233.
Solved Problem 12
On an average, there are three mistakes on a page of a book. The book
contains 200 pages. What is the probability that a randomly selected page
has exactly one mistake?
Solution
The probability function for the Poisson Distribution is
mx
x e m , where x = 0,1,2,…,∞.
x!
Given that m 3 the required probability is calculated as:
31
1 e 3 0.04979 3 0.14937
1!
Hence, the probability that a randomly selected page has exactly one
mistake is 0.14937
Solved Problem 13
From the data given in solved problem 12, how many pages would you
expect to be free from mistakes?
Solution
Given that:
m 3 N 200
0 e
3
0.04979
Solved Problem 14
If X is a Poisson variate such that P(X = 1) = P(X = 2), find P(X = 0).
Solution
Let ‘m’ be the parameter of the distribution, and P(X = 1) = P(X = 2)
m m1 m m
2
e e
1! 2!
m m2
1 2
2m m 2 m 2
0 e 2 0.13534
Solved Problem 15
The following data relates to the number of mistakes in each page of a book
containing 180 pages.
Table 6.2: Data relating to the number of mistakes in each page of a book
No of mistakes per
0 1 2 3 4 Total
page:
No. of Pages 138 161 69 27 5 400
m=X=
fx
138 0 + 161 1 + 69 2 + 27 3 + 5 4
N 400
400
= 1, m 1
400
The probability function for the Poisson Distribution is
mx
x e m , where x = 0,1,2,…,∞.
x!
e 11x
P( X x ) =
x!
Table 6.2a: Calculation of expected frequencies
No of mistakes Probability N Frequency
per page function function
P[X=x] N x P[X=x]
e 110
0 400 400 x 0.3679 =
0.3679 147.16
0!
e 111
1 400 400 x 0.3679 =
0.3679 147.16
1!
e 112
2 400 400 x 0.1839 =
0.1839 73.56
2!
e 113
3 400 400 x 0.0613 =
0.0613 24.527
3!
e 114
4 400 400 x 0.0153 =
0.0153 6.12
4!
Solved Problem 16
The average number of telephone calls booked at an exchange between
10-00 A.M. and 10-10 A.M. is 4. Find the probability that on a randomly
selected day 2 or more calls are booked between 10-00 A.M. and
10-10 A.M. On how many days of a year, would you expect booking of
2 or more calls during that time gap.
Solution
Let X: number of telephone calls booked at the exchange during 10-00 A.M.
to 10-10 A.M. The mean is m=4.
mx
x e m , where x = 0,1,2,…,∞.
x!
e 4 4 x
P( X x ) =
x!
P [ 2 or more calls] = 1 - P[ less than 2 calls]
e 4 40 e 4 41
= 1 – [P(X=0) +P(X=1)]= 1
0! 1!
40 41
= 1 - e 4
0! 1!
= 1 – 0.0183 [ 1 + 4 ]
= 1 – 0.0915 = 0.9085
A year has 365 days. Out of these N = 365 days, the number of days on
which there will be 2 or more calls is:
N x P[ 2 or more calls] = 365 x 0.9085 = 332 days
Solved Problem 17
2 percent of the fuses manufactured by a firm are expected to be defective,
Find the probability that a box containing 200 fuses contains
i) defective fuses
ii) 3 or more defective fuses.
Solution
2 percent of the fuses are defective. Therefore, probability that a fuses is
2
defective is p = = 0.02, n = 200
100
Let ‘X’ denote the number of defective fuses in the box of 200 fuses.
Then, X is B (n = 200, p = 0.02) i.e., binomial with parameters n and p.
Here, p is very small and n is very large. Therefore, X can be treated as
Poisson variate with parameter m=np = 200 x 0.02 = 4.
e 4 4 x
P(X x) = , where x = 0,1,2, …, .
x!
P [ box has defective fuses] = 1-P [ no defective fuses]
e 4 4 0
= 1 – P(X=0) = 1-
0!
= 1 – 0.0183 =0.9817
P [ 3 or more defective fuses] = 1-P [ less than 3 defective fuses]
= 1 – [P(X=0) +P(X=1) +P(X=2)] =
e 4 40 e 4 41 e 4 4 2
=1 -
0! 1! 2!
= 1 – e-4 [1 + 4 + 8]
= 1 – 0.0183 x 13
= 1 – 0.2379 = 0.7621
Solved Problem 18
The probability that a razor blade manufactured by a firm is defective is
1/500. Blades are supplied in packets of 5 each. In a lot of 10,000 packets,
how many packets would:
i) Be free from defective blades?
ii) Contains exactly one defective blade?(e-0.01=0.99)
Manipal University Jaipur Page No. 282
Statistics for Management Unit 6
Solution
Let ‘X’ be the number of defective blades in a packet of 5 blades.
Then, ‘X’ is B (n = 5, p = 1/500)
Since p is very small and n is sufficiently large, X is treated as Poisson
1
variate with parameter m=np = 5 x = 0.01
500
e 0.01(0.01) x
P(X x) = , x = 0,1,2,3,...
x!
e 0.01(0.01) 0
i) P[ no defective blades] = P(X=0) = = 0.99
0!
The number of packets which will be free of defective blades is
N x P[no defective blades] = 10000*0.99 = 9900
e 0.01(0.01)1
ii) P [ one defective blade] =P(X=1) = 0.0099.
1!
The number of packets which will contain exactly one defective blade is
N x P [one defective blade] =10000 x 0.0099 = 99.
Solved Problem 19
On an average, a typist mistakes while typing one page. What is the
probability that a randomly observed page in free of mistakes? Among 200
pages, in how many pages would you expect mistakes?
Solution
Let X: number of mistakes in a page.
Then, X is a Poisson variate with parameter m=3.
e 3 3 x
P(X x) = , x = 0,1,2,3,....
x!
e 3 3 0
P [Page is free of mistakes] = P(X=0) = = e 3 = 0.0498
0!
P [ Page has mistakes] = 1 - P[Page has no mistakes] = 1- 0.0498 = 0.9502
Activity 1
1. In a binomial distribution the mean is 6 and the variance is 1.5.
Find (i) P[X=2] and (ii) P[X≤2].
2. In a Poisson distribution P[X=2] = P[X=3]. Find P[X=4].
Case Study 2
Read the information and find the required probability.
On average, four pigeons hit the India Gate and are killed each week.
Ramesh, an official of archaeological survey of India, requested the
Central Government to provide funds to buy equipments to scare pigeons
away from the monument. The concerned official from the Central
Government replied that unless the probability of more than two birds
being killed in any week exceeds 0.7, funds cannot be allocated.
Calculate and find out if the Central Government allocates the funds.
2
1 x
1
f (x) e 2
, x , 0,
2
The continuous random variables which can take all values in any given
interval such as the measure of heights, weights, temperatures, amount of
rainfall, etc. are all the examples of Normal random variables.
The following are some of the characteristics of Normal distribution:
1. Normal distribution is a Continuous probability distribution
2. Its probability density function is given by:
2
1 x
1
f (x) e 2
, x , 0,
2
3. Its mean is and standard deviation is , where and are the
parameters of the distribution
4. It is a bell-shaped curve and is symmetric about its mean, as depicted
in figure 6.4
Key statistic
The normal distribution is the limiting form of binomial distribution.
z2
1 2
f (z) e , z .
2
The graph of standard normal distribution is depicted in the figure 6.6.
The shaded area in figure 6.6 depicts the probability that the variate takes a
value between 0 and z. This area can be read from the table of areas under
standard normal curve. Corresponding to positive z, the area from 0 to z can
be read from this table.
Let, ‘X’ be a normal variate with mean µ and standard deviation σ.
X μ
Then Z = is a standard normal variate
σ
Therefore, to find any probability regarding X, the standard normal variate
can be made use of.
Note:
1 The standard normal variate (SNV) is denoted by N (0,1).
2. The standard normal table values are given in annexure (Table 1)
Key statistic
Any Normal distribution can be converted into a Standard normal
distribution by the transformation:
X
The Standard normal variate, ‘Z’ is given by: Z where ‘Z’ is
called Standard normal variate which gives the number of Standard
deviations from X to the mean of this distribution
is the mean of the distribution
is the standard deviation of this distribution
Z varies from - to +
The mean of its distribution is ‘0’ and standard deviation is ‘1’. The
statisticians have developed a Standard normal table. The table gives the
probability that ‘z’ will lie between ‘0’ and ‘Z’. Therefore, to solve any
problem with a normal distribution, we convert it to Standard normal
distribution to calculate ‘z’ and then refer to the table, which gives the area
under the normal curve between mean and any value of the normally
distributed random variable.
Key statistic
The mean of Standard normal distribution is ‘0’ and the standard
deviation is ‘1’.
Solved Problem 20
The weight of Cocavito packs packed by the filling machine follows a normal
distribution with mean weight of 500 gm and standard deviation of 10 gm. A
pack is selected at random. What is the probability that:
i) The pack’s weight will exceed 515 gm?
ii) The pack’s weight lie within 480 to 520 gm?
iii) The proportion of packs will have less than 480 and greater than
520 gm?
If 10,000 packs are supplied, how many packs will be rejected, given that
480 gm and 520 gm are lower and upper limit for acceptance?
Solution
X is a normal variate with parameters µ = 500 and σ = 10
Manipal University Jaipur Page No. 288
Statistics for Management Unit 6
X μ X 500
Therefore, Z= is a standard normal variate.
σ 10
i) The probability that the packs weight will exceed 515 gm is given by:
Therefore, the probability that the packs weight will exceed 515 gm is
0.0668.
ii) The probability that the pack’s weight lie between 480 gm to 520 gm, as
depicted in figure 6.8 is given by:
Therefore the probability that the pack’s weight lie between 480 gm to 520
gm is 0.9544.
iii) The probability of acceptance is as found in (ii),
Solution
X is a normal variate with parameters, µ = 42 and σ = 4.
X μ X 42
Therefore, Z= is a Standard normal variate.
σ 4
X 42 50 42
(i) P( X 50) P PZ 2
4 4
X 42 50 42
(ii) P(X 50) P P( Z 2)
4 4
X 42 40 42
(iii) P(X 40) P = P [Z < -0.5]
4 4
X 42 40 42
(iv) P(X 40) P = P [Z > -0.5]
4 4
40 42 X 42 44 42
(v) P(40 X 44) P = P [-0.5 < Z < 0.5]
4 4 4
37 42 X 42 41 42
(vi) P(37 X 41) P = P[-1.25 < Z < -0.25]
4 4 4
Solution
Let, ‘X’ denote the height of students. Then, X is a Normal variate with
parameters µ = 165 cm. and σ =5 cm.
X μ X 165
Z= is a Standard normal variate.
σ 5
i) Probability that the student is more than 177 cm tall is
Solved Problem 24
The mean and standard deviation of marks scored by a group of students in
an examination are 47 and 10 respectively. If only 20% of the students have
to be promoted, which should be the marks limits for promotion?
Solution
Let, ‘X’ denote marks. Then, X is a Normal variate with parameters µ=47
and σ = 10.
X μ X 47
Z=
σ 10
Let ‘a’ be the marks above which if a student scores he would be promoted.
Then, since only 20% of the students have to be promoted, the probability of
a student getting promotion should be 20/100=0.2
Therefore,
P(X a ) 0.2
X 47 a 47
P 0.2
40 10
a 47
P Z 0.2
10
a 47
And so, P [Z ≥ z] = 0.2 where z =
10
That is, [area from z to ∞] = 0.2
That is, [area from 0 to z] = 0.3
From the table of areas, the value of z for which [area from 0 to z] = 0.3 is
z = 0.84. Therefore, z = 0.84.
And so,
a 47
0.84
10
a – 47 = 8.4
a = 55.4
Thus, the marks limit for promotion is a = 55.4
Manipal University Jaipur Page No. 297
Statistics for Management Unit 6
6.7 Summary
Let us recapitulate the important concepts discussed in this unit:
Quick analysis of observed data can be done if it is identified with the
theoretical distribution.
The probabilities associated with random variate of the distribution help
us to know the chances of occurrence of several events within specified
values.
Binomial distribution is applied when you run a series of finite
independent Bernoulli trials and the probability of success remains
same for every trial. In this distribution, ‘1’ represents the occurrence of
success and ‘0’ represents the occurrence of failure.
Poisson distribution is a unimodal distribution with mean ‘m’ and
standard deviation is m . This distribution is the limiting form of
binomial distribution as ‘n’ tends to infinity.
Normal distribution is a continuous probability distribution with
probability density function f(x) given by:
2
1 x
1
f (x) e 2
, x , 0,
2
Any normal distribution can be converted into the standard normal
distribution with the transformation.
Z where, ‘Z’ is called Standard normal variate.
6.8 Glossary
Bernoulli variate: A random variable, which assumes values ‘1’ and ‘0’ with
probabilities ‘p’ and ‘q’, (where, q = 1-p) is called Bernoulli variable.
Binomial distribution: A probability distribution which has the following
probability mass function (p.m.f) is called binomial distribution.
2
1 x
1
f (x) e 2
, x , 0,
2
Poisson distribution: Poisson process is obtained when the Binomial
experiment is conducted many number of times.
Probability distributions: The listing of all the probable outcomes in a
random experiment along with their respective probabilities is called the
Probability distribution.
Random variables: A real valued function that associates result or outcome
of an experiment with a real number is known as random variable.
Standard normal distribution: A normal variate with mean µ=0 and
standard deviation σ =1 is called Standard normal variate.
6.10 Answers
Terminal Questions
1. Refer section 6.4.1.
2. 14899/15625
16
3. C2 (0.75)14 (0.25)2
4. Refer section 6.5.2.
5. e-0.6 = 0.5488
6. 0.27068
7. 8/27
8. i) 0.1587 ii) 0.3830 iii) 0.0668
9. Refer section 6.6.
10. Mean = 165.89, S.D = 11.03
Activity Solution
Solution 1:
Let n and p be the parameters. Then,
Mean = np = 6
Variance = npq = 1.5
Therefore, q = ¼ and p = ¾
Manipal University Jaipur Page No. 301
Statistics for Management Unit 6
e m m 2 e m m 3
=
2! 3!
1 m
=
2 3
3
m
2
The probability mass function of Poission Distribution is
e 3/2 (3/2) x
P(X x) = , x = 0,1,2,....
x!
what proportion of tubes sold will have to be replaced free of cost? If the
intension is not to replace more than 2% of the failed tubes, what period of
guarantee should be set? Given the cost of replacing a picture tube is Rs
5,000, and the cost for increasing the life of a tube by one year is Rs 1,000,
discuss the options of:
i) Increasing life by 1 year
ii) Increasing guarantee period by 1 year
iii) Revising the replacement policy so that not more than 1% of failed
tubes are replaced free of cost
Case study 2
Credit Cards
As an incentive for customers to spend more money on its credit card, a
bank has decided to award high spending customers with an offer of free
stay of 3 days at one of the holiday resorts in India. However, it doesn’t want
to give the offer to more than 1% of customers. If the mean spending per
customer is Rs. 20,000 with a standard deviation of Rs 5,000, what amount
of spending the company should specify as a cut-off? However, at the end
of the first month, it was found that 5% of customers qualified for the offer.
What could have happened? Assuming that the standard deviation has not
changed, calculate the new mean spending per customer, and assuming the
mean has not changed, calculate the new standard deviation.
(Source: TN Srivastava & Shailaja Rejo (2008) Statistics for Management 5th
edition, TMH)
References:
Bowerman B. L., & O Connel, R.T., (1996), Applied Statistics: Improving
Business Processes, Irwin.
David H. Voelker, Peter Z. Orton and Scott Adams (Jun 15, 2001),
Statistics (Cliffs Quick Review)
Freedman D. R. Pisani, and Purves, R., (1997), Statistics, 3rd edition,
W.W. Norton.
John Schiller, R. Alu Srinivasan and Murray Spiegel, (Aug 26, 2008),
Schaum's Outline of Probability and Statistics, 3rd Ed. (Schaum's
Outline Series)
Levin, Richard I., & Rubin, David S., (2008), Statistics for Management,
Seventh Edition, PHI Learning Private Limited
Martin Sternstein (Feb 1, 2010), Barron's AP Statistics with CD-ROM
(Barron's AP Statistics (W/CD))
Martin Sternstein, Barron's AP Statistics, 6th Edition
Murray R. Spiegel, John J. Schiller and R. Alu Srinivasan (Mar 17,
2000), Schaum's Outline of Probability and Statistics
Murray Spiegel and Larry Stephens (Jan 31, 2011), Schaums Outline of
Statistics, Fourth Edition (Schaum's Outline Series)
Seymour Lipschutz and John J. Schiller (Sep 7, 2011), Schaum's
Outline of Introduction to Probability and Statistics (Schaum's Outline
Series)
Tanur, J. M., (2002), Statistics: A Guide to the Unknown, 4th edition,
Brooks/Cole.
Tukey J. W., (1977), Exploratory Data Analysis, Addison–Wesley.
Wilcox, Rand R., (2009), Basic Statistics – Understanding Conventional
Methods and Modern Insights, Oxford University Press.
E-References:
http://www.textbooksonline.tn.nic.in/Books/11/Stat-EM/Chapter-3.pdf
7.1 Introduction
In the previous unit, ’Theoretical Probability Distributions’, you have studied
about both discrete and continuous random variables along with the
probability distributions of random variables. You have studied about the
Binomial, Poisson and Normal distributions which are explained with the
help of solved problems.
In this unit, we will discuss about the statistical sampling and sampling
designs. You will study about different types of sampling theories and also
the laws of sampling. We will end this unit with the important theorem called
central limit theorem.
In different fields of human activity, the decision making process is based on
the observations of few units which form a portion of the total population.
The process of studying only a portion of the population and making
decisions involves risk, the risk of making wrong decisions. This unit deals
with the various techniques of drawing samples from the population.
Evaluation of risk will be discussed in unit 9, ‘Testing of Hypothesis in case
of Large and Small Samples’.
When sampling design is not done properly, the estimation or the inferences
drawn from the sample can go wrong and the managerial decisions taken
on the wrong conclusions may lead to loss of time, money and human
resources. This may badly affect the reputation of their organisation. Hence,
the risks involved in using the incorrect sampling design are of primary
concern to investigators.
Objectives:
After studying this unit, you should be able to:
differentiate between population and a sample
define the laws of sampling theory
identify the various sampling errors
recognise the types of sampling available
determine the sample size
define the central limit theorem
7.1.1 Relevance
The new chairman of the Flourish bank continued with the existing system
of conducting quarterly meeting with a selected group of regional and
branch managers that were deemed to be an elite band by the previous
management. However, after a few meetings, he observed that the quality
of ideas at the meetings was limited due to the nature of AME executives
participating in the meeting. He was also informed by his secretary that
there was a general discontentment among the other regional and branch
managers, since they felt that they were being denied the opportunity of
meeting and sharing their ideas for growth with the chairman. Consequently,
he asked the head of department for Management Information Systems to
classify the regions into three categories and branches into four categories.
He then developed a system of selecting the participants for the meetings
from each category of regional & branch managers on a random basis, in
such a manner that no regional or branch manager was invited again unless
all others have been invited for the meetings. Thus, with the help of random
sampling, coupled with the other managerial and behavioural initiatives, the
chairman was able to create an atmosphere of involvement, trust and
commitment. This contributed significant to the accelerated growth of the
bank.
(Source: TN Srivastava & Shailaja Rejo(2008) Statistics for Management 5th edition,
TMH.)
can develop plans for the future including long–term planting and harvesting
schedules for the trees.
These sample data were entered into the company’s continuous forest
inventory (CFI) computer system. Reports from CFI system include a
number of frequency distribution summaries containing statistics on types of
trees, present forest volume, past forest growth rate and projected future
forest growth and volume. Sampling and associated statistical summaries of
the sample data provides the report that are essential for the effective
management of Mead’s forests and timberlands.
(Source: David R Anderson, Dennis J Sweeney & Thomas A Williams 5th edition,
Thomson Business Information Pvt. Ltd.)
Example 1
In the statistical survey aimed at determining average per capita income
of the people in the city, all earning individuals in the city form the
population.
Population
Sample
X f
fX
Mean Frequency
1.5 1 1.5
2 1 2.0
2.5 2 5.0
3 2 6.0
3.5 2 7.0
4 1 4.0
4.5 1 4.5
N = 10 fX = 30
Let us understand about each of the error types and the factors causing
those errors.
1. Sampling errors
The sample results are bound to differ from population results, since sample
is only a small portion of the population. It is also known as inherent error
and cannot be avoided. It is not worth to eliminate them completely. These
errors may be due to the following factors:
Faulty selection of sample
Substitution of units to be studied
Faulty demarcation of sampling units
Error due to bias in estimation
However, the sampling errors follow random or chance variations and tend
to cancel out each other on averaging.
2. Non-sampling errors
Non-sampling errors are attributed to factors that can be controlled and
eliminated by suitable actions. They are due to the following factors:
Faulty planning, faulty definitions
Defective methods of interviewing
Personal bias of investigator
Lack of trained and qualified investigators
Respondents failure to answer
Manipal University Jaipur Page No. 314
Statistics for Management Unit 7
Improper coverage
Compiling errors
Publication errors
It is worth to eliminate these errors.
3. Biased errors
Biased errors arise in both census and sampling methods. These errors
occur due to personal bias of the investigator and the instruments used for
measuring. They are also due to faulty collection of data, respondent’s bias
and bias due to non-response. Biased errors have a tendency to grow with
sample size. Therefore, they are also known as cumulative errors. The
magnitude of biased errors is directly proportional to the sample size.
4. Unbiased errors
The errors that are due to over-estimation and under-estimation, such that
they are equal are known as unbiased errors. They are also known as
compensatory errors. They do not increase with sample size.
7.6.1 Measures of statistical errors
Key statistic
Absolute error is the difference between true value, ‘t’ and the observed
value, ‘a’. Symbolically, Absolute Error ‘AE’ is represented as:
AE t a
It is independent of magnitude of the actual value.
Key statistic
Relative error is the ratio of the absolute error to the actual value. It is
symbolically represented as:
AE t - a
RE
a a
It provides a degree of error for comparison purposes between different
sets of data.
Example 4
The items produced by factories located at three cities ‘X’, ‘Y’ and ‘Z’ are
200, 300 and 500, respectively. We wish to draw a sample of 20 items
under proportional stratified sampling. We number the unit from 0 to 999.
Then refer to random numbers table and select the numbers as depicted
in table 7.4.
Table 7.4: Stratified Random Sampling
This implies ‘nK = N’ or ‘K = N/n’. From the first group, we select a unit at
random. Suppose the unit selected is 6th unit, thereafter we select every
6 + Kth units. If ‘K’ is 20, ‘n’ is 5 and ‘N’ is 100 then units selected are 6, 26,
46, 66, 86.
The table 7.5 displays the merits and demerits of systematic sampling.
Table 7.5: Merits and Demerits of Systematic Sampling
Merits Demerits
1. Very easy to operate. 1. Many of the cases we do not get
up-to-date list.
2. It saves time and labour. 2. It gives biased results if periodic
feature exist in the data.
3. More efficient than simple random
sampling if we have up-to-date
frame.
8. Multi-stage sampling
The total population is divided into several stages. The sampling process is
carried out through several stages. It is as depicted in figure 7.8.
Example 5
We want to select 1000 colleges from southern states. In the first stages
we may select any three states. In the second stage we may select some
districts in that state. In the third stage, we may select the colleges in
each district. We may adopt any sampling technique at each stage.
The table 7.6 depicts the merits and demerits of multi-stage sampling.
Table 7.6: Merits and Demerits of Multi Stage Sampling
Merits Demerits
Greater flexibility in this sampling Estimates are less accurate
method
Existing division can be used Investigator should have knowledge of
the entire population that will be
sampled
1. Judgment sampling
The choice of sample items depends exclusively on the judgment of the
investigator. The investigator’s experience and knowledge about the
population will help to select the sample units. It is the most suitable method
if the population size is less. The table 7.7 depicts the merits and demerits
of judgement sampling.
Table 7.7: Merits and Demerits of Judgement Sampling
Merits Demerits
1. Most useful for small population. 1. It is not a scientific method.
2. Most useful to study some unknown 2. It has a risk of investigator’s
traits of a population some of whose bias being introduced.
characteristics are known.
3. Helpful in solving day-to-day
problems.
2. Convenience sampling
The sample units are selected according to the convenience of the
investigator. It is also called “chunk” which refers to the fraction of the
population being investigated, which is selected neither by probability nor by
judgment.
Moreover, a list or framework should be available for the selection of the
sample. It is used to make pilot studies. However, there is a high chance of
bias being introduced.
3. Quota sampling
It is a type of judgment sampling. Under this design, quotas are set up
according to some specified characteristic such as age groups or income
groups. From each group a specified number of units are sampled
according to the quota allotted to the group. Within the group the selection
of sample units depends on personal judgment. It has a risk of personal
prejudice and bias entering the process. This method is often used in public
opinion studies.
Caselet
Read the information and answer the questions.
You have been given 5 boxes of biscuits. There are orange, brown and
yellow colour biscuits. You are asked to sample the biscuits. The target
population here is all of the biscuits and the sampling unit is the biscuit.
Answer the following questions.
i) How would you apply simple random sampling?
ii) How would you apply stratified sampling?
iii) How would you apply cluster sampling?
Key statistic
The formula used for calculating the sample size while research is
concerned with population proportion and finite population, is given by:
( p p)
z (For Finite population )
Nn pq
x
N 1 n
where, ‘N’ is population size.
z 2 pqN
n 2
e ( N 1) z 2 pq
z = value correspond to the degree of confidence desired
p = population proportion,
p = sample proportion
e = acceptable error ( the precision)
q=1–p
n = sample size.
Key statistic
The formula used for calculating the sample size while research is
concerned with population proportion and infinite population, is given by:
( p p)
z (For Infinite population )
pq
n
z 2 pq
n
e2
z = value correspond to the degree of confidence desired
p = population proportion,
p = sample proportion
e = acceptable error (the precision)
q=1–p
n = sample size.
Key statistic
The formula used for calculating the sample size for infinite population,
when population mean and sample mean are given, is:
( X μ)
z (For Infinite population )
n
z 2 2
n
e2
z = standard variate at a given confidence level
= population mean
X = sample mean
e = acceptable error ( the precision)
( X μ) e is the error we admit between the true value of parameter and
the statistic (estimated value).
= standard deviation of population
n = sample size
Key statistic
The formula used for calculating the sample size for finite population,
when population mean and sample mean are given, is:
( X μ)
z (For Finite population )
Nn
n N 1
z 2 2 N
n
( N 1)e 2 z 2 2
z = standard variate at a given confidence level
= population mean
X = sample mean
e = acceptable error ( the precision)
( X μ) e is the error we admit between the true value of parameter
and the statistic (estimated value).
= standard deviation of population
n = sample size
N = size of population
Solved Problem 2
The mean expenditure of per customer at a tyre store is
Rs. 85.00, with a standard deviation of Rs. 9.00. If the mean expenditure of
the sample is Rs. 87, what is the required sample size? (z-value is 1.41)
Solution
z 2 2 (1.41) 2 9 2 1.9881 81
n 40.25 40.
e2 22 4
Hence the required sample size is 40.
Solved Problem 3
A production company has 350 hourly employees having average 37.6
years of age, with a standard deviation of 8.3. If the sample average is 40
years of age and z-value is 2.07, calculate the required sample size.
z 22 N
n
( N 1)e 2 z 2 2
( 2.07 ) 2 (8.3) 2 350
(350 1) ( 2.4) 2 ( 2.07 ) 2 (8.3) 2
103315 .4
44.8 45.
2305 .4
7.10 Summary
Let us recapitulate the important concepts discussed in this unit:
Statistical survey or enquiries deal with studying various characteristics
of unit belonging to a group. The group consisting of all the units is
called Universe or Population.
There are two methods of studying the characteristics of population:
census and sampling.
Sample is a finite subset of a population. A sample is drawn from a
population to estimate the characteristics of the population.
There are two methods of sampling namely probability sampling and
non-probability sampling.
Probability sampling provides a scientific technique of drawing samples
from the population.
In non- probability sampling method, the selection of sample units
depends entirely upon the personal convenience, biases, prejudices and
beliefs of the investigator.
7.11 Glossary
Biased errors: Biased errors arise in both census and sampling method.
These errors occur due to personal bias of the investigator and the
instruments used for measuring.
Cluster sample: Cluster sample is the one in which the items in the
population are divided into various clusters, so that each cluster is the
representative of the population. A random sample of clusters is taken, and
the clusters selected are analysed.
Convenience sampling: The sample units are selected according to the
convenience of the investigator.
Judgment sampling: The choice of sample items depends exclusively on
the judgment of the investigator.
Non-probability sample: A sample in which items are chosen without
knowing their probability of selection.
Non-sampling errors: Non-sampling errors are attributed to factors that
can be controlled and eliminated by suitable actions.
Probability sampling: Probability sampling provides a scientific technique
of drawing samples from the population.
Sample mean: An unbiased estimate of the mean of the population from
which it was drawn.
Sampling distribution: Sampling distribution consists of all the possible
values of a statistic and their respective probabilities for a given sample
size.
Sampling error: Sampling error is the difference between the sample
statistic and the actual population parameter.
Simple random sample: A sample in which ‘n’ elements are selected from
a population in such a way that every set of ‘n’ elements in the population
has an equal probability of being selected.
Quota sampling: It is a type of judgment sampling. Under this design,
quotas are set up according to some specified characteristic such as age
groups or income groups.
Unbiased errors: The errors that are due to over-estimation and under-
estimation such that they are equal are known as unbiased errors.
Plant A B C
Number of employees 100 200 200
Activity
1. The process of obtaining information about an entire population by
examining only a part of its is known as
i) statistics ii) sampling
iii) survey iv) selection
2. Non Sampling errors include
i) bias ii) mistakes
iii) both bias & mistakes iv) none of these
3. The simplest way of increasing the accuracy of a sample is to
increase its
i) Size ii) interviewer
iii) population iv) universe.
4. The term error in statistics is
i) mistakes
ii) bias
iii) both bias & mistakes
iv) difference between the value of a statistics and that of the
corresponding parameter
7.13 Answers
Terminal Questions
1. Refer section 7.6
2. Refer section 7.7.1
3. Refer section 7.7.1
4. Refer section 7.7.2
5. Refer section 7.4
6. Refer section 7.5
Manipal University Jaipur Page No. 330
Statistics for Management Unit 7
The above data was published in October 02, 2007 issue of Business world,
based on Internet and online association of India (IOAI) report 2006. It was
reported that major online sites were yahoo mail, hot mail and rediff mail. It
was reported that during 2005-06, Indian consumers spent Rs 1,280 crore –
more than double the Rs 670 crore netted during 2004-05. Design a survey
Manipal University Jaipur Page No. 331
Statistics for Management Unit 7
References:
Agarwal, B. L., (2006), Basic Statistics, Fourth Edition, New Age
International Publishers.
Anderson, David R., Sweeney, Dennis J. & Williams, Thomas A., 5th
edition, Thomson Business Information Pvt. Ltd.
Bowerman, B. L. & O Connel, R. T., (1996), Applied Statistics:
Improving Business Processes, Irwin 1996.
Freedman, D.R Pisani, and Purves, R., (1997), Statistics, 3rd edition, W.
W. Norton.
Levin, Richard I. & Rubin, David S. (2008), Statistics for Management,
Seventh Edition, PHI Learning Private Limited.
Srivastava, T. N., & Rejo, Shailaja (2008), Statistics for Management, 5th
edition, TMH.
Tukey J. W., (1997), Exploratory Data Analysis, Addison–Wesley.
Wilcox, Rand R., (2009), Basic Statistics – Understanding Conventional
Methods and Modern Insights, Oxford University Press.
E-References:
http://www.textbooksonline.tn.nic.in/Books/11/Stat-EM/Chapter-1.pdf
Unit 8 Estimation
Structure:
8.1 Introduction
Objectives
Relevance
Statistics in practise
8.2 Reasons for Making Estimates
8.3 Making Statistical Inference
8.4 Types of Estimates
Point estimate
Interval estimate
8.5 Criteria of a Good Estimator
8.6 Point Estimator for Mean and Variance
8.7 Interval Estimates
Case study on calculating estimates
Making the interval estimate
8.8 Interval Estimates and Confidence Intervals
Interval estimates of the Mean
Interval estimates of the Proportion
Interval estimates using the Student’s ‘t’ distribution
8.9 Summary
8.10 Glossary
8.11 Terminal Questions
8.12 Answers
8.13 Case Study
8.1 Introduction
In the previous unit, ‘Sampling and Sampling Distributions’, you have
studied about sampling design and different theories of sampling. The
sampling errors in the sampling distributions are also studied. In this unit,
you will study about estimation and different types of estimation. You will
also study about calculation of confidence intervals of the population mean
when the standard deviation is unknown. Finally, you will study the methods
to calculate the sample size for estimating the parameter with certain level
of confidence for given measure of accuracy.
Everyone makes estimates. When you are ready to cross a street, you
estimate the speed of any car that is approaching, the distance between you
and that car, and your own speed. Having made these quick estimates, you
decide whether to wait, walk or run. With the knowledge of inferential
statistics, you can do the estimations about the population using the random
samples which are drawn from the population.
Objectives:
After studying this unit, you should be able to:
describe the types of estimates
distinguish between a Point estimate and an Interval estimate
evaluate the confidence interval
describe interval estimates and confidence intervals
evaluate the sample size if the confidence interval and permissible error
are given
8.1.1 Relevance
The new general manager of Ever Bright Light Company, manufacturing
tube lights is concerned about the dwindling profits of the company. The
main reason is that the company provides a guarantee of 1 year of life and
undertakes to replace a tube light if it fails within 1 year. Since a good
number of tube lights are failing in less than a year and are being replaced
free of cost, they are lowering the company’s profitability and also causing
loss of reputation. The general manager intuitively feels that the guaranteed
life must be such that the percentage of tube lights failing within that period
is quite small; say 5% or 10%, so as to keep the cost of replacement low.
Since, it may not be appropriate to reduce the guarantee; the only
alternative is to increase the life of the tube light. After careful consideration,
he outlines the following steps:
Estimate the average life of tube lights, as well as the variation in their
lives.
Take action to increase the life of the tube light with the help of improved
technology and better management of the production process.
Test whether the actions taken have increased the life, and by how
much?
Example 1
Suppose, we choose a sample of a given size and must decide whether
to use the sample mean or the sample weighted mean to estimate the
population mean.
If we calculate the standard error of the sample mean and found it to be
1.05 and then, calculated the standard error of the sample weighted
mean and found it to be 1.6, we would say that the sample mean is a
X i
X i 1
n
We can use the sample variance ‘ s ’ and estimate the population
2
(X i X) 2
s 2 i 1
n 1
where ‘n’ is the sample size. In many cases, such as in the case of interval
estimation of mean, we require to know the value of , the population
standard deviation. If is not known, we use ‘ s ’ in its place and proceed
with computations.
Solve Problem 1
The following table 8.1 depicts the total income in thousand rupees per year
of 10 randomly selected persons from a particular class of people.
Table 8.1: Total Income of ten people
Income
(in thousand 6.5 7.6 5.4 12.7 8.0 5.5 4.5 9.0 10.1 6.8
Rs)
On the basis of the data find the mean income of a person in this class and
also find sample standard deviation.
Solution
Let income is denoted by X, and given n=10.
Table 8.1a: Calculation of variation from the mean
Income
Xi
X i X = X i X =
2
X i
76.1
X i 1
7.61
n 10
The point estimate of population mean μ is X .
(X i X) 2
55.089
s 2 i 1
6.121.
n 1 10 1
Sample standard deviation is given by
s s 2 6.121 2.474.
Example 2
The table 8.2 depicts the results of samples of 35 boxes which contain
bolts.
Table 8.2: Results of Samples of 35 Boxes of Bolts (Bolts per Box)
101 103 112 102 98 97 93
105 100 97 107 93 94 97
97 100 110 106 110 103 99
93 98 106 100 112 105 100
114 97 110 102 98 112 99
X
X 3570 102
n 35
Thus, using the sample mean X as the estimator we have a point
estimate of the population mean ‘µ’.
If we select and plot a large number of sample means from a population, the
distribution of these means will approximate to normal curve. Furthermore,
the mean of the sample means will be same as the population mean.
8.7.1 Case study on calculating estimates
Case Study
The marketing research director needs an estimate of the average life in
months, for car batteries manufactured by his company. We select a
random sample of 200 batteries with a mean life of 36 months. If we use
the point estimate of the sample mean ‘ X ’ as the best estimator of the
population mean ‘µ’, we would report that the mean life of the company’s
batteries is 36 months.
The director also asks for a statement about the uncertainty that is likely
to accompany this estimate, that is, a statement about the range within
which the unknown population mean is likely to lie. To provide such a
statement, we need to find the standard error of the mean. Our sample
size of 200 is large enough that we can apply the central limit theorem,
suppose, we have already estimated the standard deviation of the
population of the batteries and reported that it is 10 months.
Using this standard deviation of population, we can calculate the
standard error of the mean in the case of large population, by using the
formula, x
n
Case Study
(Cont. from topic ‘Interval Estimates’)
We can tell to the director that our estimate of the life of the company’s
batteries is 36 months, and the standard error that accompanies this
estimate is 0.707. In other words, the actual mean life for all the batteries
may lie somewhere in the interval estimate of 35.293 to 36.707 months.
This is helpful but insufficient information for the director.
Next, we need to calculate the chances that the actual life will lie in this
interval or in other intervals of different widths that we might choose.
The probability is 0.955 that the mean of a sample size of 200 will be
within ±2 standard errors of the population mean. It can be stated
differently as 95.5 percent of all the sample mean are within ±2 standard
errors from population mean ‘’. The population mean, ‘µ’ will be located
within ±2 standard errors from the sample mean at 95.5 percent of the
time.
Hence, we can now report to the director, that the best estimate of the life
of the company’s batteries is 36 months, and we are 68.3 percent
confident that the life lies in the interval from 35.293 to 36.707 36 1 x .
Similarly, we are 95.5 percent confident that the life falls within the
interval of 34.586 to 37.414 36 2 x , and we are 99.7 percent confident
that battery life falls within the interval of 33.879 to 38.121 36 3 x .
Key statistic
The probability that we associate with an interval estimate is called the
confidence level.
Similarly, we are 95.5 percent confident that the life falls within the
interval of 34.586 to 37.414 months 36 2 x , and we are 99.7 percent
confident that battery life falls within the interval of 33.879 to 38.121
months 36 3 x .
This probability indicates how confident we are about the fact that the
interval estimate will include the population parameter. A higher probability
means more confidence. In estimation, the most commonly used confidence
levels are 90 percent, 95 percent, and 99 percent, but we are free to apply
any confidence level. The confidence interval is the range of the estimate
we are making.
Example 3
If we report that we are 90 percent confident that the mean of the
population of incomes of people in a certain community will lie between
Rs. 8,000 and Rs. 24,000, then the range Rs. 8,000 - Rs. 24,000 is our
confidence interval.
Often, however, we will express the confidence interval in standard errors
rather than in numerical values. Thus, we will often express confidence
intervals like this:
X z x = upper limit of the confidence interval
Thus, confidence limits are the upper and lower limits of the confidence
interval. In this case, X z x is called the upper confidence limit (UCL)
s N n 1.3 100 10
x
n N 1 10 100 1
4.111 0.9535 3.92
At 95 % level of confidence, we know from the ‘z’ table that ‘z’ is 1.96.
UCL= X z x = 5.2+1.96 x 3.92 = 5.2+7.6832 = 12.8832
pq
p = (large population case)
n
Using the above estimated standard error of proportion, we can work out the
confidence interval for population proportion thus:
Manipal University Jaipur Page No. 345
Statistics for Management Unit 8
pq
pz
n
Also, standard error of proportion in the case of finite population, we have:
pq Nn
p (In case of Finite population).
n N 1
Solved Problem 3
In a very large organisation, the director wanted to find out what proportions
of the employees prefer to provide their own retirement benefits in lieu of a
company – sponsored plan. A simple random sample of 75 employees was
taken. It was found that 40%, that is, 0.4 of them are interested in providing
their own retirement plans. The management requests that we use this
sample to find an interval about which they can be 99 percent confident that
it contains the true population proportion.
Solution
Here, n = 75,
p = 0.4,
q = 1- p = 1 – 0.4 = 0.6
pq (0.4)(0.6)
Therefore, standard error of the proportion = = = 0.057
n 75
Confidence interval is given by
pq
pz
n
At 99 % level of confidence, we know from the ‘z’ table that ‘z’ is 2.58
UCL= 0.4 + 2.58 (0.057) = 0.547
LCL= 0.4 - 2.58 (0.057) = 0.253
Therefore, the interval estimate for 99% level of confidence is
0.4 ± 2.58 (0.057) = 0.253 and 0.547.
Hence, the proportion of the total population of employees who wish to
establish their own retirements plans lie between 0.253 and 0.547.
Key statistic
We can define degrees of freedom as the number of values that we can
freely choose. We will use degrees of freedom when we select a ‘t’
distribution to estimate a population mean, and we will use ‘n-1’ degrees
of freedom, where ‘n’ is the sample size.
For example, if we use a sample of 20 to estimate the mean of population,
we will use 19 degrees of freedom in order to select the appropriate ‘t’
distribution. With two sample values, we have one degree of freedom
(2-1 = 1), and with seven sample values, we have six degrees of freedom
(7-1 = 6). In each of these two examples, then, we had ‘n-1’ degrees of
freedom; assuming ‘n’ is the sample size. Similarly, a sample of 23 would
give us 22 degrees of freedom.
Key statistic
In any estimation problem in which the sample size is 30 or less and the
standard deviation of the population is unknown and the underlying
population can be assumed to be normal or approximately normal, use
the ‘t’ distribution.
values for only a few percentages (10, 5, 2, and 1 Percent). Because there
is a different ‘t’ distribution for each number of degrees of freedom, a more
complete table would be quite lengthy.
A second difference in the ‘t’ table is that it does not focus on the chance
that the population parameter being estimated will fall with our confidence
interval. Instead, it measures the chance that the population parameter we
are estimating will not be within our confidence interval (that is, it will lie
outside the confidence interval).
Table 8.4 : Formulae Concerning Estimation
Infinite population Finite population
Estimating Nn
population mean () Xt s Xt s
when we do not n n N 1
know p and use
s and sample is
small (n 30)
Solved Problem 4
A random sample of 14 items is taken, producing a sample mean of 2.14
and sample standard deviation is 1.29. Find the confidence interval for the
population mean. ( t table value is 3.012)
Solution
Given n=14, X 2.14 and =1.29. Since the sample size is less than 30,
s
we use t distribution to compute the confidence interval.
Confidence interval for μ is given by
UCL X t s and LCL X t s.
n n
Table value for t at 99% confidence level and n-1 =14 -1 = 13 degrees of
freedom is 3.012. Therefore, we have
UCL X t s 2.14 3.012 1.29 2.14 1.04 3.18 and
n 14
Manipal University Jaipur Page No. 349
Statistics for Management Unit 8
LCL X t s 2.14 3.012 1.29 2.14 1.04 1.1
n 14
Therefore, the confidence interval for μ is 1.1 ≤ μ ≤ 3.18.
z x 500
At 95 % level of confidence, we know from the ‘z’ table that ‘z’ is 1.96.
Therefore,
1.96x 500
x 500 / 1.96 255
Now, if the standard error of the mean is 255; that lead us to:
x / n 255
1500 / n 255
Therefore,
2
1500
n 34.6
255
It implies that ‘n’ should be greater than 34.6 or 35, if the university wants to
estimate the precision with which it wants to conduct the survey.
8.9 Summary
Let us recapitulate the important concepts discussed in this unit:
The point estimates and interval estimates are the foundations for
inferential statistics in estimation and hypothesis testing.
Point estimate is a single number that is used to estimate an unknown
population parameter.
Interval estimate is a range of values used to estimate a population
parameter.
If the sample size is less than 30 and the population standard deviation
is not known, we use the Student’s ‘t’ distribution for estimations.
8.10 Glossary
Confidence interval estimate: A statistic constructed from a set of data to
provide an interval estimate for a parameter, provides a range of values
around an estimate to show how precise the estimate is. The confidence
level associated with the interval usually 90%, 95%, or 99%, is the
percentage of times in repeated sampling that the intervals will contain the
true value of the unknown parameter.
Degrees of freedom: The number of values in the final calculation of a
statistic that are free to vary, frequently referred to in the organisation of
tables of statistical distributions used in undertaking significance tests, for
e.g., t-distribution.
Estimation: The process of using a sample to estimate features of a
population.
Interval estimate: Interval estimate is a range of values used to estimate a
population parameter.
Interval estimates of the proportion: Statisticians often use sample to
estimate a proportion of occurrences in a population.
Manipal University Jaipur Page No. 352
Statistics for Management Unit 8
Activity:
1. Which of the following property is not a desirable property of a point
estimation:
i) Consistency
ii) Efficiency
iii) Sufficiency
iv) Bias
8.12 Answers
σs 615
x 55.01
n 125
i) X 1 x = 3250 55.01 = 3194.99 and 3305.01 to be 68.3%
certain.
ii) 95.5% certain means X 2 x = 3250 110.02 giving a range
between 3139 and 3360.02.
3. The required lower and upper class intervals are:
i) X 0.74 x ii) X 1.15 x
iii) X 1.88 x iv) X 2.33 x
4.
Nn n
i. x as 0.05
n N 1 N
1.368 540 60
x 0.167
60 540 1
Terminal Questions
1. The mean and standard deviation are 296.583 and 40.751.
2. i) 0.181
ii) 6.019, 6.381
3. i) 112.4 1.697
ii) 112.4 2.234
Activity Solution
1. iv) Bias
2. i) Sample size
Will the conclusion change if the data for marketing and finance are
interchanged?
Case Study 2 – Study Consumer Behaviour
An advertisement company is interested in studying the consumers’
behaviour in the context of purchase decision of jeans in the Lee market.
This company is aiming to be a major player in the Lee market that is
characterised by intense competition. It would like to know in particular
whether the income level of the consumer influence their choice of the
brand. Currently there are four brands in the market. ‘A’ and ‘B’ are the
premium brands while ‘C’ and ‘D’ are the economy brands.
A stratified random sampling procedure was adopted to cover the entire
market using income as the basis of selection. The categories that were
used in classifying income level were: lower, upper, middle and high. A
sample of 700 consumers participated in this study. The data depicted in the
following table 8.8, emerged from the study.
Analyse the above data to test independence of brand and income level.
Further, the marketing manager is in dilemma of selecting the appropriate
colours for jeans. For this, he wishes to compare five different colours of
jeans. He is interested in knowing the most preferred colour. A random
sample of 500 consumers reveals the following observation.
Table 8.9: Preference of consumers on colour of Jeans
Does the consumer preference for jeans colours show any significant
difference?
References:
Frederick James, (November 29, 2006), Statistical Methods in
Experimental Physics, 2nd Edition, (Hardcover). ).
Froedesen, A. G., Skjeggestad, D. & Tøfte, H., (1979), Probability and
Statistics in Particle Physics, (Hardcover, out of print).
Louis, Lyons, (1989), Statistics for Nuclear and Particle Physicists,
(Paperback).
Devore, Jay L., (January 29, 2008), Probability and Statistics for
Engineering and the Sciences, Enhanced Review Edition (Hardcover).
Morris, H. & Schervish, Mark J., (January 31, 2002), Probability and
Statistics, DeGroot, (Paperback).
Ross, Sheldon M., (February 13, 2009), Introduction to Probability and
Statistics for Engineers and Scientists, Fourth Edition, (Hardcover).
Cowan, Glen, Statistical Data Analysis, Oxford Science Publications,
(Paperback).
Bevington, Philip R., and Robinson, D. Keith, Data Reduction and Error
Analysis for the Physical Sciences, 3rd Edition, (Paperback).
Taylor, John R., An Introduction to Error Analysis: The Study of
Uncertainties in Physical Measurements,(Paperback).
Mandel, John, The Statistical Analysis of Experimental Data,
(Paperback).
Meyer, Stuart L., Data Analysis for Scientists and Engineers,
(Paperback).
Press, William H., Teukolsky, Saul A., Vetterling, William T. and
Flannery, Brian P., Numerical Recipes: The Art of Scientific Computing,
3rd Edition.
Levin, Richard I. & Rubin, David S., (2008), Statistics for Management,
Seventh Edition, PHI Learning Private Limited.
9.1 Introduction
In the previous unit, estimation, we have studied about the estimation of the
parameter from the samples and the methods of estimation. In this unit,
Testing of hypothesis, we will study about hypothesis and the testing of
hypothesis. Estimation is about estimating the parameters and finding out
9.1.2 Assumptions
Although hypothesis testing sounds like some formal statistical term and
completely unrelated to business decision making, in fact, managers
propose and test hypothesis all the time. For example, “if we drop the price
of this car model by Rs.1,500, we will sell 50,000 cars this year” is a
hypothesis. To test this hypothesis, total car sales till the end of the year
have to be counted.
Managerial hypothesis are based on intuition; the marketplace decides
whether the manager’s intuitions were correct. Hypothesis testing is about
making inferences about a population from only a small sample. The bottom
line in hypothesis testing is when we ask ourselves (and then decide)
whether a population, like this one, would be likely to produce a sample like
the one we are looking at.
Example 1
We want to test the hypothesis, that the population mean is equal to 500.
We would symbolise it as follows and read it as,
The null hypothesis is that the population mean = 500 which is written as,
0 : 500
Example 3
If we want to test if the attribute of educational qualification has any
influence on the income of an individual, we make null hypothesis as:
0 : Educational qualification has no influence on the income of an
individual
and alternative hypothesis is
1 : Educational qualification has an influence on the income of the
individual
Type I error
The combinations are:
If null hypothesis is true, and the test result make us to accept it, then
we have made a right decision.
If null hypothesis is true, and the test result make us to reject it, then we
have made a wrong decision (Type I error). It is also known as
consumer’s risk, denoted by .
If hypothesis is false, and the test results make us to accept it, then we
have made a wrong decision (Type II error). It is known as producer’s
risk, denoted by .1 – is called power of the test.
If hypothesis is false, and the test result make us to reject it – we have
made a right decision.
Manipal University Jaipur Page No. 364
Statistics for Management Unit 9
Table 9.2: Conditions for Using the Normal and ‘t’ Distributions in
Testing Hypothesis about Means
When the Population When the Population
Standard Deviation is Standard Deviation is
known not known
Sample size ‘n’ is larger Normal distribution, Normal distribution,
than 30. z–table z–table
Sample size ‘n’ is 30 or Normal distribution, ‘t’ distribution, ‘t’ table
less and we assume the z–table
population is normal or
approximately so.
One more rule has to be kept in mind, when testing the hypothesised values
of a mean. As in estimation, use the finite population multiplier whenever the
population is finite in size, sampling is done without replacement, and the
sample is more than five percent of the population.
A left-tailed test is one of two kinds of one-tailed tests. The other kind of
one-tailed test is a right-tailed test (or an upper-tailed test). An upper-tailed
test is used when the hypothesis is H1: > o. Only values of the sample
mean that are significantly above the hypothesised population mean will
cause us to reject the null hypothesis in favour of the alternative hypothesis.
Figure 9.3 depicts an upper-tailed test where the rejection region is in the
upper tail of the distribution of the sample mean.
Tests for proportion and other parameters are similarly discussed; rejection
regions are similarly identified with reference to the given level of
significance and appropriate distribution.
In each example of hypothesis testing, when we accept a null hypothesis on
the basis of sample information, we are really saying that there is no
statistical evidence to reject it. We are not saying that the null hypothesis is
true. The only way to prove a null hypothesis is to know the exact value of
the population parameter or the population distribution and that is not
possible with just sampling. Thus, we accept the null hypothesis and behave
as if it is true simply because we can find no evidence to reject it.
Example 4
The hypothesis to be tested is Ho: = 100, against the alternative
hypothesis H1: 100, with sample size n = 20, population standard
deviation σ =2.5. Here sample size is smaller than 30 but population
standard deviation is given; hence to test the hypothesis the probability
distribution used is ‘normal distribution’.
Example 5
The hypothesis to be tested is Ho: = 10 against the alternative
hypothesis H1: > 10, with sample size n = 20, population standard
deviation is not known. Here sample size is smaller than 20 but
population standard deviation is not given, hence to test the hypothesis
the probability distribution used is ‘t- distribution’.
difference in
p1 = first sample proportion
( p1 p 2 )
proportions of Z
4 Test between
proportion –
p1 = first sample proportion
when
populations
p 2 = second sample
are similar (p1 p 2 ) proportion
Z n1 = first sample size
with respect to
1 1
a given p 0 q 0 n2 = second sample size
attribute n1 n 2
n p n 2 p2
p0 1 1
n1 n 2
and q0 = 1- p0
Table 9.3b: Statistics for Testing the Hypothesis on Mean; Large Sample
Case
Test Description
Test Statistics Notes
No. of Test
5 Test for = Population mean
specified
mean –
X = Sample mean
infinite (X ) p = Population S.D
Z
population, p In case p is not known, we
n>30 and use s in its place calculating
n
population
variance(s) ( X X )2
s i
known n 1
6 Test for = Population mean
specified
mean –
X = Sample mean
( X ) p = Population S.D
finite Z
p In case p is not known, we
population,
Nn
n>30 and use s in its place calculating
population
n N 1
variance(s) ( X X )2
s i
known n 1
7 Test for X 1 and X 2 are sample mean
difference in (X1 X 2 )
Z for the first and second
means–
p1 2 p 2 2 samples respectively.
different
n1 n2 n1= first sample size
population,
n>30 and n2 = second sample size
( X X )2
1i 1
s1 = n 1
1
(X X )2
s2 = 2i 2
n 1
2
d1 = X1 X12
d2 = X 2 X12
N1 X1 N 2 X 2
X 12
N1 N 2
Solved problem 1
XYZ press hypothesis is that the average life of its latest web-offset press is
14,500 hours. They know the standard deviation of the press life is 2,100
hours. From a sample of 25 presses, the company finds a sample mean of
13,000 hours. At a significance level of 0.01, should the company conclude
that the average life of the presses is less than the hypothesised 14,500
hours?
Solution
The procedure is described here:
1. Null hypothesis H0: = 14,500
Alternate hypothesis H1: < 14,500 (one-tailed test)
2. Level of significance = 0.01 Ztab = - 2.33 and R:z< -2.33
(X )
3. Test statistics Z
p
n
p 2100 2100
420
n 25 5
13000 14500
Z cal 3.57
420
5. Conclusion: Since Zcal (-3.57) < Ztab (-2.33) and is in the rejection region,
H0 is rejected. In other words, we accept that the average life of the
press is significantly lesser than 14,500 hrs at 1% level of significance.
Solved problem 2
Theatre owners in India know that a hit movie ran for an average of 84 days,
with a standard deviation of 10 days in each city the movie was screened. A
particular movie distributor was interested in comparing the popularity of the
movie in his/her region with that of the population. The distributor randomly
chose 75 theatres in the region and found a popular movie ran for 81.5
days.
1) State appropriate hypothesis for testing whether there was a significant
difference between theatres in the distributor’s region and the
population.
2) At 1% significance level, test this hypothesis.
Solution
The procedure is explained in the form of steps:
1. Null hypothesis Ho: = 84
Alternate hypothesis H1: 84 (two-tailed test)
2. Level of significance = 0.01 Ztab = 2.58 and R:|z| > 2.58
3. Test statistics
(X )
Z
p
n
84 81.5
Z cal 2.165
1.1547
5. Conclusion: Since Zcal (2.165) < Ztab (2.58), and not in the rejection
region, H0 is accepted at 1% level of significance.
Solved Problem 3
A ketchup manufacturer is in the process of deciding whether to produce a
new extra spicy brand of ketchup. In a survey of 6000 households, the
company’s market research team found that, 355 households would buy the
extra spicy brand. A more extensive study carried out 2 years ago showed
that 5% of the households would buy the brand then. At 2% level of
significance, should the company conclude that there is an increased
interest in the extra spicy flavour?
Solution
The procedure is explained in the following steps:
1. Null hypothesis Ho: p = 0.05
Alternate hypothesis H1: p > 0.05 (one-tailed test)
2. Level of significance = 0.01 Ztab = 2.05 and R:z > 2.05
3. Test statistics
p p
Z
pq
n
4. Given p = 0.05, p = 355/6000 = 0.0592, n = 6000, q = 1 – p =1- 0.05=
0.95
( 0.0592 0.05 )
Z cal 3.29
0.05 0.95
.
6000
5. Conclusion: Since Zcal (3.29) > Ztab (2.05), and found in the rejection
region, Ho is rejected and it is accepted that there is an increase of the
proportion of population having an interest in the new flavour.
Solved Problem 4
Microsoft estimated that out of 10,000 potential software buyers, 35% wait
to purchase the new OS Windows Vista, until an upgrade has been
released. After an advertising campaign to reassure the public was
released, Microsoft surveyed 3000 buyers and found 950 who are still
skeptical. At 5% level of significance, can the company conclude that the
population of skeptical people had decreased?
Solution
The procedure is explained in the following steps:
1. Null hypothesis Ho: p = .35
Alternate hypothesis H1: p < 0.35
2. Level of significance = 0.05 Ztab = - 1.645 and R: z < -1.645
3. Test statistics
( p p)
Z
pq Nn
n N 1
4. Given p = 950/3000 = 19/60 = 0.317, p = 0.35, q = 1-p = 1- 0.35 = 0.65,
N=10,000, n = 3000
Solved problem 5
A machine is designed to pack 200ml of a medicine with a standard
deviation of 5ml. A sample of 100 bottles when measured had a mean
content of 201.3ml. Test whether the machine is functioning properly (use
5% level of significance).
Solution
The procedure is explained in the following steps:
1. Null hypothesis Ho: = 200
Alternate hypothesis H1: 200 (two-tailed test)
2. Level of significance = 0.05 Ztab = 1.96 and R: |Z| > 1.96.
3. Test statistics
(X )
Z
p
n
p 5
0.5
n 100
201 .3 200
Z cal 13 / 5 2.60
0 .5
5. Conclusion: Since Zcal (2.60) > Ztab (1.96) and Zcal is in the rejection
region, Ho is rejected. Hence at 5% level of significance, we reject null
hypothesis and conclude that the machine is not functioning properly.
It should be noted that the methods and theory of small samples are
applicable to large samples, but the reverse is not true.
v 1/ 2
t2
f ( t ) C 1
v
where,
C = Constant required to make the area under the curve equal to unity.
= n – 1, Degree of Freedom.
4. The value of ‘t’ ranges from - to +
5. “” is called the parameter of the distribution
6. It is symmetrical about mean
7. Its mean is zero
8. Variance of the distribution is greater than one
9. It has larger areas at the tails compared to normal distribution and
lower height at the mean.
10. It tends to a normal distribution as n ∞.
9.7.1 Uses of ‘t’ test
The ‘t’ test is used:
To test a specified value
Manipal University Jaipur Page No. 379
Statistics for Management Unit 9
Test Description
Test Statistics Notes
No. of Test
1 Test for X is the sample
specified mean
value – infinite
= Hypothised
population ( X )
t value of
with s population mean
d.f. = n -1, n
Population ( X X )2
variance s i
n 1
unknown
2 Test for ( X )
specified t
value – finite s N n
N = Population
population n N 1 size
with d.f.= n-1,
Population
variance
unknown
3 Test between X1 X 2 X1 = first sample
values – t
mean
independent 2
(X X ) (X X )
2 1 1
samples with 1i 1 2i 2 X 2 = second
n n 2 n n
d.f= n1 + n2 – 1 2 1 2 sample mean
2, X1 X 2 n1 and n 2 are
t
Population sizes of first and
( n 1) 2s ( n 1) 2s 1 1
variances not 1 1 2 2 second sample
known but n n 2 n n respectively.
1 2 1 2
assumed to
be equal
4 Paired “t – D = Mean of
D
test t
(dependent diff , where difference
n = sample size
samples) with n
d.f= n -1
diff
2
D 2i D .n
n 1
Solved problem 6
A random sample of 10 bags of fertilisers is found to have the following
weight (kg):
45, 49, 50, 49, 44, 52, 48, 45, 46, 45
Test at 5% level of significance whether the average packing weight can be
taken as 50 kg.
Solution: Table 9.6 depicts the frequency table for solved
problem 6.
Table 9.6: Frequency Table
Xi
X i X = X i X =
2
X i =473
∑ X i X =64.1
2
X i
473
X i 1
47.3
n 10
The sample variance is given by,
n
(X i X) 2
64.1
s 2 i 1
7.12
n 1 10 1
s 2.6687
( X )
t
s N n
n N 1
s
Given n = 10, N = 1000, X 47.3 , = 0.8439
n
( X ) ( 47.3 50)
t cal
3.2138
s N n 1000 10
0.8439
n N 1 1000 1
4. Conclusion: Since |tcal (-3.2138)| > |ttab (2.262)|, tcal is in the rejection
region and thus Ho is rejected.
Solved Problem 8
Average tensile strength of nine samples of paper is found to be 15.8 units
and variance is 10.3. Can we say at 1% level of significance that it is a
random sample drawn from a population whose mean tensile strength is
17.5?
Solution
The steps are described as follows:
1. Null hypothesis H0: 17 .5
Alternate hypothesis H1: 17 .5
2. Level of significance 1% and degrees of freedom (d.f.) =n-1= 9-1=8
ttab = 3.36 and R:|t| > 3.36
3. Test statistics
( X )
t
s
n
X 15 .8 , = 17.5, s = 10.3 ,
2
Given n=9
s 3.2084
1.0698
n 9
15.8 17.5
t cal 1.5891
1.0698
4. Conclusion: Since Itcal (-1.5891)| < Ittab (3.36)I, Ho is accepted
It can be considered as a random sample at 1% level of
significance.
Solved Problem 9
A sales manager wants to know whether a special promotional campaign is
a success. Table 9.7 depicts the data. Test at 5% level of significance,
whether it is a success?
Table 9.7: Sales Data Before and After the Campaign
Retail Outlets 1 2 3 4 5 6
Sales before campaign 50 48 31 42 28 53
Sales after campaign 56 55 30 45 29 58
Solution
Table 9.7a depicts the frequency table calculated for the sales data before
and after the campaign.
Table 9.7a: Frequency Table for the Sales Data Before and After the Campaign
Before (Xi) After (Yi) D = After – Before Di 2
Di = Yi - Xi
Campaign
50 56 6 36
48 55 7 49
31 30 -1 1
42 45 3 9
28 29 1 1
53 58 5 25
∑ Di = 21 ∑ Di 2 =121
D i 21
Mean of Differences or D 3.5
n 6
diff
2
D 2i D .n
n 1
121 2 .6
3.5
diff =3.08
6 1
Activity:
1. A random sample of 200 tins of vanaspathi has a mean weight 4.97
kgs and a standard deviation of 0.2kgs. Test at 1% level of
significance, that the tins have 5 kgs. vanaspathi
2. A random sample of 100 rods drawn from a lot of rods has a mean
length 32.7cms. and a standard deviation of 1.3cms. Can it be
concluded that the lot has a mean of 32 cms?
Solution
1. H0 : µ = 5kg
H1: µ ≠ 5kg
Level of significance = 0.01 Ztab = 2.58 and R: |Z| > 2.58
Test statistics
( X )
Z
s
n
Given = 5, X = 4.97, s = 0.2, n = 200
4.97 5
Z cal 2.12
0 .2
200
Conclusion: Since IZcalI < Ztab, we accept H0 at 1% level of
significance and conclude that the tins have 5 kgs of vanaspathi.
2. H0 : µ = 32
H1: µ ≠ 32
Level of significance = 0.05 Ztab = 1.96 and R: |Z| > 1.96
Test statistics
( X )
Z
s
n
Given = 32, X = 32.7, s = 1.3, n = 100
32.7 32
Z cal 5.38
1.3
100
Conclusion: Since Zcal > Ztab, we reject H0 at 5% level of significance
and conclude that the lot does not have a mean of 32 cms.
Manipal University Jaipur Page No. 386
Statistics for Management Unit 9
9.8 Summary
Let us recapitulate the important concepts discussed in this unit:
Hypothesis testing is the opinion about the population parameter that
may or may not be in the confidence interval derived from the sample.
In hypothesis testing, we must state the assumed or hypothesised value
of the population parameter before we begin sampling. The assumption
we wish to test is called the null hypothesis and is symbolised by ’H0’.
If our sample results fail to support the null hypothesis, we must
conclude that something else is true. Whenever we reject the
hypothesis, the conclusion we do accept is called the alternative
hypothesis and is symbolised as ‘H1.
If null hypothesis is true, and the test result make us to accept it, then
we have made a right decision.
If null hypothesis is true, and the test result make us to reject it, then we
have made a wrong decision (Type I error). It is also known as
consumer’s risk, denoted by .
If hypothesis is false, and the test results make us to accept it, then we
have made a wrong decision (Type II error). It is known as producer’s
risk, denoted by .1 – is called power of the test.
If hypothesis is false, and the test result make us to reject it , we have
made a right decision.
‘t’ tests can be used for sample size (n 30) and samples whose
population standard deviations are not known.
9.9 Glossary
Level of significance: The smallest probability at which the null hypothesis
would be rejected (Type I error). Usually, if the significance level is less than
a number such as 0.05 (5%), the null hypothesis would be rejected in favour
of the alternative; the chance of getting a sample like the one being
analysed if the null hypothesis were true. A small significance level would
imply that getting such a sample was highly unlikely, suggesting that the null
hypothesis is probably not true; also called the P-value of the test.
Null distribution: The distribution of the test statistic assuming the null
hypothesis is true.
One-tailed test: A test in which the alternative hypothesis specifies that the
population parameter is strictly greater, or strictly lesser, than a specified
value. A test in which the alternative hypothesis specifies that the parameter
is on "one side" of the null hypothesis value; a test in which H1 contains >
or <.
P-value: The value that indicates how unusual a computed test statistic
compared with what would be expected under the null hypothesis. A small
value indicates that the null hypothesis should be rejected at any
significance level above the calculated value. For example, if the P value
equals 0.0246, we would reject the null hypothesis at the 5% significance
level, but would not reject it at the 1% significance level.
Two-tailed test: The rejection region in a two-tailed test is split between the
two tails of the distribution.
Type I error: Rejecting a true null hypothesis. The probability of a type I
error is indicated by alpha (α).
Type II error: Not rejecting a false null hypothesis. The probability of a type
II error is indicated by beta (β).
Z test for a population mean: Tests a hypothesis pertaining to the
population mean by using a z-test statistic to evaluate the magnitude of
difference between the sample mean.
Z test for a population proportion: Tests a hypothesis pertaining to the
population proportion by using a z-test statistic to evaluate the magnitude of
the difference between sample proportion and hypothesised population
proportion.
9. The table 9.10 depicts the results related to the memory capacity of 10
students before and after training. Test at 5% level of significance
whether training is effective.
Roll No 1 2 3 4 5 6 7 8 9 1
Before Training 1 14 11 8 7 1 3 0 5 6
After Training 1 16 10 7 5 1 10 2 3 8
9.11 Answers
Terminal Questions
1. Zcal = 1.9457, H0 accepted
2. Zcal = 0.71, H0 accepted
3. Zcal = 0.50, H0 accepted
4. Zcal = 1.54, H0 accepted
5. Zcal = 18.75, H0 rejected
6. Zcal = 1.30, H0 accepted
7. tcal = 2.397, H0 is rejected
8. tcal = 2.21, H0 is rejected
9. tcal = 1.365, H0 is rejected
References:
Bevington, P.R. & Robinson, D.K. Data Reduction and Error Analysis for
the Physical Sciences (3rd Edition), (Paperback).
Cowan, G. Statistical Data Analysis (Oxford Science Publications),
(Paperback).
Manipal University Jaipur Page No. 391
Statistics for Management Unit 9
Devore, J.L. Probability and Statistics for Engineering and the Sciences,
Enhanced Review Edition. (Hardcover - Jan. 29, 2008).
James, F. Statistical Methods in Experimental Physics (2nd Edition).
(Hardcover - Nov. 29, 2006).
Levin, R.I. & Rubin, D.S. (2008) Statistics for Management, Seventh
Edition, PHI Learning Private Limited.
Lyons, l. Statistics for Nuclear and Particle Physicists. (Paperback,
1989).
Mandel, J. The Statistical Analysis of Experimental Data, (Paperback).
Meyer, S.L. Data Analysis for Scientists and Engineers, (Paperback).
Morris, H., Schervish, M.J. & Degroot, Probability and Statistics
[PROBABILITY & STATISTICS 3 -OS] (Paperback - Jan. 31, 2002).
Press, W.H., Teukolsky, S.A., Vetterling, W.T. & Flannery, B.P.
Numerical Recipes: The Art of Scientific Computing, 3rd Edition.
10.1 Introduction
In the previous unit, testing of hypothesis, we discussed about how to test
hypothesis concerned with parameters like mean and proportion, using data
from either one or two samples. We used one-sample tests to determine
whether a mean or a proportion was significantly different from a
hypothesised value. In the two-sample tests, we examined the difference
between either two means or two proportions, and we tried to learn whether
this difference was significant.
For example, we have proportions from five populations instead of only two,
then for these cases, the methods for comparing proportions described for
testing hypothesis for two-samples do not apply; we must use the Chi-
Square test (2 test). In this unit, Chi-Square, we will discuss the Chi-Square
tests which enable us to test whether more than two population proportions
can be considered equal. In other words, a Chi-Square test is also a
parametric test which can be applied on categorical data or qualitative data.
This test can be applied when we have few or no assumptions about the
population parameter.
Actually, Chi-Square tests allow us to do a lot more than just test for the
quality of several proportions. If we classify a population into several
categories with respect to two attributes (such as age and job performance),
we can then use a Chi-Square test to determine whether the two attributes
are independent of each other. So, Chi-Square tests can be applied on a
contingency table.
Objectives:
After studying this unit, you should be able to:
describe the non parametric method of testing hypothesis
describe the Chi-Square characteristics
identify the conditions required for applying Chi-Square test for a given
population distribution
recognise the applications of Chi-Square test
describe the steps in solving problems related to Chi-Square test
10.1.1 Relevance
Case-let
Women still earn less than men
On 27 February 2006 the Women and Work Commission (WWC), published
its report on the causes of the “gender pay gap “or the difference between
men’s and women‘s hourly pay. According to the report, British women
working full-time currently earn 17% less per hour than men. In February the
European commission also brought out its own report on the pay gap across
the European Union. Its findings were similar in that, on an hourly basis,
women earn 15% less than men for the same work.
In the United States, the difference in median pay between men and women
is around 20%. According to the WWC report the gender pay gap opens
early. Boys and girls study different subjects in school, and boy’s subjects
lead to more lucrative careers. They then work in different sorts of jobs. As a
result, average hourly pay for a woman at the start of her working life is only
91% of a man’s; even through nowadays she is probably better qualified.
How do we compile this type of statistical information? We can use Chi-
Square testing for more than one type of population.
(Source: Derek L Waller Published by Elsevier Inc Ed 2008).
O i E i 2 O1 E1 2 O 2 E 2 2 O 3 E 3 2 O n E n 2
2 .......
Ei E1 E2 E3 En
Where, O1, O2, O3….On are the observed frequencies and E1, E2, E3…En
are the corresponding expected or theoretical frequencies.
O i E i 2
2
Ei
where, ‘Oi’ is the observed frequency and ‘Ei’ is the expected frequency.
Key Statistic
The observed frequencies are the frequencies obtained from the
observation, which are sample frequencies. The expected frequencies
are the calculated frequencies.
10.2.2 Steps in solving problems related to Chi-Square test
Figure 10.1 depicts the steps required for solving the problems related to
Chi-Square test.
O E
i i
Key Statistic
The results of Chi-Square test cannot be accurate if the cell frequencies
in a contingency table are less than 5.
10.2.5 Practical applications of Chi-Square test
In inferential statistics, the Chi-Square test can also be applied for the
discrete distributions. In using Chi-Square test, we need no assumptions
regarding the shape of sampling distributions. The applications of Chi-
Square test include testing:
Example 1
For example, we are asked to write any four numbers, we will have all the
numbers of our choice. If a restriction is applied or imposed to the choice
that the sum of these numbers should be 50; then the freedom of choice
would be reduced to three only and so the degrees of freedom would
now be 3.
Key Statistic
The Chi-Square curve will be on the positive side of x-axis because the
Chi-Square values are always positive.
Number of rows 1 Number of columns 1
3 1 2 1 2
Hence, a contingency table with three rows and two columns has two
degrees of freedom.
Solved Problem 2
Table 10.1 depicts the production in three shifts and the number of defective
goods that turned out in three weeks. Test at 5% level of significance
whether weeks and shifts are independent.
Table 10.1: Production of Defective Goods in Three Shifts
Shift 1 Week 2 Week 3 Week Total
I 15 5 20 40
II 20 10 20 50
III 25 15 20 60
Total 60 30 60 150
Solution: Table 10.1a depicts the observed and expected values required
to calculate 2.
Table 10.1a: Observed and Expected Values
Observed Expected Value O i E i 2
Value Row Total Column Total (Oi – Ei)2
E Ei
Oi i Grand Total
3. Test statistics
O i E i 2
2
Ei
2cal = 3.6459
4. Conclusion: Since 2cal (3.6459) < 2tab (9.49), ‘Ho’ is accepted. Hence,
the attributes ‘week’ and ‘shifts’ are independent.
Solved Problem 3
Out of 1000 people surveyed, 600 belonged to urban areas and rest to rural
areas. Among 500 who visited other states, 400 belonged to urban areas.
Test at 5% level of significance whether area and visiting other states are
dependent.
Solution: Table 10.2 depicts the information given in solved problem 3 in a
tabulated form.
Table 10.2: People Belonging to Urban and Rural Areas
Other States Urban Rural Total
Visited 400 100 500
Not Visited 200 300 500
Total 600 400 1000
Table 10.2a depicts the observed and expected values for the calculation of 2.
Table 10.2a: Observed and Expected Values
O i E i 2
Observed Expected Value
Value Row Total Column Total (Oi – Ei)2
E Ei
Oi i Grand Total
400 300 10000 33.33
200 300 10000 33.33
100 200 10000 50.00
300 200 10000 50.00
2cal = 166.66
tab 3.84
2
3. Test statistics
O i E i 2
2
Ei
2cal = 166.66
4. Conclusion: Since 2cal (166.66) > 2tab (3.84), ‘Ho’ is rejected. Hence, the
‘area’ and ‘visit’ are dependent.
10.3.2 Test of goodness of fit
The test of goodness of fit of a statistical model measures how accurately
the test fits a set of observations. This test measures and summarises the
differences if any, between the observed and expected values of the
considered statistical model. These test results are helpful to know whether
the samples are drawn from identical distributions or not. The degrees of
freedom are ‘n-1’ and the expected value is equal to the average of the
observed values.
Solved Problem 4
A personal manager is interested in trying to determine whether
absenteeism is greater on one day of the week than on another day of the
week. The record for the past years is available. Table 10.3a depicts the
absenteeism for each working day over a week. Test whether absenteeism
is uniformly distributed over the week.
Table 10.3: Comparison of Data about Absenteeism
Days of
Monday Tuesday Wednesday Thursday Friday
Week
Number of
66 57 54 48 75
absentees
i
66 57 54 48 75 60
5
The table 10.3a depicts the calculated expected values required for
calculation of 2 for the data related to problem 4.
Table 10.3a: Observed and Expected Values for Calculation of 2
2 tab 9.49
4. Test statistics
O i E i 2
2
Ei
2cal = 7.50
5. Conclusion: Since 2cal (7.5) < 2tab (9.49), ‘Ho’ is accepted. In other
words, we conclude at 5% level of significance that absenteeism is
uniformly distributed and is independent of the days of the week.
Solved Problem 5
According to a theory in Genetics, the proportion of beans of A, B, C and D
types in a generation should be 9:3:3:1. In an experiment with 1600 beans,
the frequency of bean of A, B, C and D type was observed to be 882, 313,
287 and 118 respectively. Does the result support the theory?
Solution: The steps for calculation of Chi-Square are described as follows:
1. Null hypothesis ‘Ho’: The result supports theory
Alternate hypothesis ‘H1’: The result does not support theory
2. Level of significance is 5% and degrees of freedom(d.f.)= (4 – 1) = 3
2
tab 7.81
3. Test statistics
O i E i 2
2
Ei
Table 10.4 depicts the observed and expected values for calculation of 2
for solved problem 5.
Table 10.4: Observed and Expected Values for Calculation of 2
2cal = 4.72
4. Conclusion: Since 2cal (4.72) < 2tab (7.81), ‘Ho’ is accepted. Therefore,
the result supports the theory.
Solved problem 6
The following table gives the classification of 100 workers according to
gender and the nature of work. Test whether nature of work is independent
of the gender of the worker.
Table 10.5
Skilled Unskilled Total
Males 40 20 60
Females 10 30 40
Total 50 50 100
tab 3.84
2
3. Test statistics
O i E i 2
2
Ei
Table 10.5a depicts the observed and expected values for calculation of 2
for solved problem 6.
Table 10.5a: Observed and Expected Values for Calculation of 2
2cal = 16.666
4. Conclusion: Since 2cal (16.666) > 2tab (3.84), ‘Ho’ is accepted. Therefore
the null hypothesis that gender and nature of work are independent will
be rejected.
10.3.3 Test for comparing variance
When we have to use 2 as a test of population variance, then,
Ho: s2 = p2 and HA: s2 p2
s
2
2
(n 1)
p
2
Activity
Objective Questions:
1. What is the appropriate test to use if you want to determine whether
there is evidence that the proportion of successes is higher in group 1
than in group 2 and we have obtained independent samples from the
two groups?
i) The Z test
ii) The Chi-Square test
iii) Both of the above
iv) None of the above
2. Which of the following values cannot occur in a Chi-Square
distribution?
i) 100.0
ii) 38.4
iii) 0.61
iv) -2.45
3. What test would you use to determine whether a set of observed
frequencies differ from their corresponding expected frequencies?
i) The t test for dependent samples
ii) The Chi-Square test
iii) The t test for independent samples
iv) The F test
4. When using the chi-square test for differences in two proportions with
a contingency table that has r rows and c columns, how many degrees
of freedom will the test statistic have?
i) n – 1
ii) n1 + n2 - 2
iii) (r - 1) x (c - 1)
iv) (r - 1) + (c – 1)
5. When testing for the independence in a contingency table with 3 rows
and 4 columns, how many the degrees of freedom will the test statistic
have?
i) 5
ii) 6
iii) 7
iv) 12
10.4 Summary
Let us recapitulate the important concepts discussed in this unit:
Chi-Square test is a non-parametric test. The important applications of
Chi-Square test are the tests for independence of attributes, the test of
goodness of fit and the test for specified variance.
2 describe the magnitude of discrepancy between the observed and the
expected frequencies. The value of 2 is calculated as:
O i E i 2 O1 E1 2 O 2 E 2 2 O 3 E 3 2 O n E n 2
2
.......
Ei E1 E2 E3 En
Where, O1, O2, O3….On are the observed frequencies and E1, E2,
E3…En are the corresponding expected or theoretical frequencies..
An important criterion for applying the Chi-Square test is that the sample
size should be very large.
10.5 Glossary
Chi-Square test: It is a non-parametric test where no parameters regarding
the rigidity of population are required.
Level of significance: The smallest probability at which the null hypothesis
would be rejected (type I error). Usually, if the significance level is less than
a number such as 0.05 (5%), the null hypothesis would be rejected in favour
of the alternative; the chance of getting a sample like the one being
analysed if the null hypothesis were true. A small significance level would
imply that getting such a sample was highly unlikely, suggesting that the null
hypothesis is probably not true; also called the P-value of the test.
10.7 Answers
Terminal Questions
Discussion Questions:
i. Indicate the appropriate null and alternative hypothesis to test if the
make of automobile purchased is dependent on an individual’s
nationality?
ii. Using the critical value approach of the Chi-Square test at a 1%
significant level, does it appear that there is a relationship between
automobile purchase and nationality?
iii. Verify the result to Question 2 by using the p-value approach of the
Chi-Square test
iv. What has to be the significance level in order that there appears a
breakeven situation between dependency of nationality and
automobile preference?
v. What is your comment about the results?
References:
Bevington, P. R. & Robinson, D. K. Data Reduction and Error Analysis
for the Physical Sciences (3rd Edition). (Paperback).
Cowan, G. Statistical Data Analysis (Oxford Science Publications).
(Paperback).
Devore, J. L. Probability and Statistics for Engineering and the Sciences
Enhanced Review Edition. (Hardcover - Jan. 29, 2008).
Froedesen, A. G., Skieggestad, D. & Tofte, H. Probability and Statistics
in Particle Physics. (Hardcover, 1979 – out of print).
James. H. Statistical Methods in Experimental Physics (2nd Edition).
(Hardcover - Nov. 29, 2006).
Levin, R. I. & Rubin, D. S. (2008) Statistics for Management, Seventh
Edition, PHI Learning Private Limited.
Lyons, L. Nuclear and Particle Physicists. (Paperback, 1989).
11.1 Introduction
In the previous unit, we dealt with Chi-square as a test of independence,
and applications of Chi-square test. We have studied about the
characteristics of Chi-Square and its properties. We have also discussed
about how to find the Chi-Square test results for the given sampling
distribution. We also studied the calculations of Chi-Square values for either
rejecting or not rejecting the null hypothesis. In this unit, we will deal with
Analysis of Variance (ANOVA), assumptions for F-test, and classification of
ANOVA.
In the previous unit, we have studied that the Chi-Square test is used for
testing the differences among the two sample proportions and to make
inferences whether they are from the same population distribution or not.
When we have more than two populations, we have to use the analysis of
variance to evaluate the mean differences between two or more
populations.
Key Statistic
The technique of analysis of variance is referred to as ANOVA.
Initially, the technique was applied in the field of Zoology and Agriculture,
but in a later stage, it was applied to other fields also. In ANOVA, the degree
of variance between two or more data as well as the factors contributing
towards the variance is studied.
In fact, ANOVA is the classification and cross-classification of statistical data
with the view of testing whether the means of specific classification differ
significantly or whether they are homogeneous.
ANOVA is a method of splitting the total variation of data into constituent
parts which measure the different sources of variations.
The total variation is split up into the following two components:
Variance within the subgroups of samples
Variation between the subgroups of the samples
Hence, the total variance is the sum of variance between the samples and
the variance within the samples. After obtaining the above two variations,
these are tested for their significance by F-test which is also known as
variance ratio test.
test the differences between variance, that is, whether two populations can
be considered to have the same variance or not. As you have studied in unit
10, that to test a specified variance, we used 2 – test.
s 1 =
2 1
n1 1
2
Σ X X and
s 2 =
2 1
n2 1
Σ YY 2
where,
‘n1’ is the size of the first sample
‘n2’ is the size of the second sample
X and Y denotes the sample means of the random variable ‘X’ and ‘Y’
respectively
It is also known as variance ratio test. It has two degrees of freedom, one for
numerator of the ratio and another for denominator. They are represented
by:
1 = n1 – 1 and 2 = n2 – 1.
Where, ‘1’ and ‘2’ are degrees of freedom in numerator and denominator
respectively.
The degree of freedom for greater variance is represented as 1 and for
smaller variance as 2. By comparing the observed value of F with the
corresponding table value, we can infer whether the difference between the
variances of sample could have arisen due to sampling fluctuations.
The parent populations from which they are drawn are normally
distributed.
The assumption that all the populations should have normal distribution is
hardly achieved in practical cases. Hence, it can be considered as a
limitation.
Solved Problem 1
Table 11.1 depicts the time taken to do a job by method I and method II by
workers. Can we conclude that the variance of time distribution for method I
and method II are the same?
Table 11.1: Time Taken by Workers to Finish a Job by Two Different Methods
Method I 27 23 16 20 26 22
Method II 33 35 34 27 42 32 38
Solution:
Step 1: Null hypothesis ‘H0’: s 1 = s 2 , that is, the sample variances of
2 2
1 = 7 – 1 = 6 and 2 = 6 – 1 = 5
Step 3: Test Statistics
Tables 11.1a and 11.1b depict the frequency table required for the
calculation of sample means for the data given for two different methods.
Table 11.1a: Required Values of the Method I to Calculate Sample Mean
X d = X - 22 d2
27 5 25
23 1 1
16 -6 36
20 -2 4
26 4 16
22 0 0
Σd = 2 Σd 2 = 82
1 2 (Σ d) 2 1 4
s 1 = Σd 82
2
= = 16.266
n1 1 n 1 5 6
1 2 (Σ d) 2 1 16
s 2 = Σ d 136 7 = 22.286
2
=
n2 1 n2 6
s 2
2
22.2286
Fcal = = 1.37
s 1
2
16.266
Step 4: Conclusion: Since Fcal (1.37) < Ftab (4.95), ‘H0’ is accepted. Hence,
there is no significant difference and the sample variances of two
methods are equal.
Key Statistic
A table showing the source of variance, the sum of squares, degrees of
freedom, mean square (variance), and the formula for the F-ratio is
known as ANOVA table.
Key Statistic
The means of samples will not be same if the variation caused by the
interaction between the samples is large when compared to variance
within the each group.
(Σ X 1 ) 2 (Σ X 2 ) 2 (Σ X 3 ) 2 (Σ X 4 ) 2 (Σ X n ) 2 T 2
SSC = + + + + ,,,,,, +
n 1 n2 n3 n4 nn N
SSE
MSE
(n k )
MSC MSE
8. Test statistics F = ; when MSE> MSC we take F =
MSE MSC
9. Decision: If the computed value of F > Table (critical) value of F for
degrees of freedom (k-1, n - k) at α% (5% or 1%), then we reject H0 and
conclude that all the population means are unequal. Otherwise accept H0
and conclude that the population means are not unequal.
Table 11.2 depicts the specimen of ANOVA table.
Table 11.2: ANOVA Table in one-way ANOVA
Source of Variation Sum of Squares Degree of Freedom Mean Square
Between Samples SSC k–1 MSC
Within Samples SSE n–k MSE
Total SST n-1
8 7 12
10 5 9
7 10 13
14 9 12
11 9 14
(Σ X 1 ) 2 (Σ X 2 ) 2 (Σ X 3 ) 2 (Σ X 4 ) 2 (Σ X n ) 2 T 2
SSC = + + + + ,,,,,, +
n 1 n2 n3 n4 nn N
50 2 40 2 60 2
= + + 1500 = 1540 1500 40
5 5 5
Sum of the squares of the Error within columns (samples):
SSE = SST – SSC = 100 – 40 = 60
Variance between samples:
SSC 40 40
MSC = = = = 20
k 1 3 1 2
Variance within the samples:
SSE 60
MSE = = =5
(n k ) (15 3)
The degree of freedom = (k – 1, n – k) = (2, 12).
[ k is the number of columns and n is the total number of observations]
ANOVA Table
Table 11.3b: ANOVA Table
Source of Mean
Sum of squares df F-value
variation square
Between SSC = 40 2 MSC = 20 Fcal=20/5=4
Within SSE = 60 12 MSE = 5
Total TSS = 100 14
Table 11.4
A B C
20 18 25
21 20 28
23 17 22
16 25 28
20 15 32
Solution
Null hypothesis ‘H0’: The average yields of land under different varieties of
seed do not differ significantly.
T= Sum of all observations = 330
T 2 330 2
Correction factor = = = 7260
N 15
T2
SST (Total Sum of the Squares) = Sum of squares of all observations
N
= 20 2 + 18 2 + 25 2 + 212 + ..........+ 32 2 7260
7590 - 7260 330
Sum of the Squares of Error between the columns (samples):
(Σ X 1 ) 2 (Σ X 2 ) 2 (Σ X 3 ) 2 (Σ X 4 ) 2 (Σ X n ) 2 T 2
SSC = + + + + ,,,,,, +
n 1 n2 n3 n4 nn N
100 2 95 2 135 2
SSC + + 7260 = 190
5 5 5
Sum of the squares of the Error within columns (samples):
SSE = SST – SSC = 330 190 = 140
Variance between samples:
SSC 190 190
MSC = = = = 95
k 1 3 1 2
Variance within the samples:
SSE 140 140
MSE = = = 11.67
(n k ) (15 3) 12
Table 11.4b depicts the ANOVA table for the solved problem 3.
Table 11.4b: ANOVA Table for Solved Problem 3
MSC 95
Fcal = = = 8.14
MSE 11.67
The table value of ‘F’, at 5% level of significance for (2, 12) [1 = 2, 2 =12]
degrees of freedom (df), is 3.88 which is less than the calculated value of ‘F’
i.e. 8.14. Therefore, the null hypothesis is rejected and we conclude that the
average yields of land under different varieties of seed show differences.
11.4.2 Two-way ANOVA
In the two-way classification, data are classified on the basis of two factors.
For example, the agricultural output may be classified on the basis of
different varieties of seeds and also on the basis of different varieties of
fertilizers used.
Procedure for carrying out the Two-way ANOVA
1. a) Assume the means of all columns are equal. That is, the effects of all
factors in the first kind of treatment are equal.
α1 = α 2 = α 3 =..........α c
b) Assume the means of all rows are equal. That is, the effects of all
factors in the second kind of treatment are equal.
β1 = β 2 = β 3 = β 4 .......β r
( X ) 2 ( X ) 2 ( X ) 2 ( X ) 2 ( X ) 2 T 2
SSC 1 i 2 i 3 i 4 i ..... ni
n n n n n N
1 2 3 4 n
where X , X , X …..are Column totals.
1i 2i 3i
5. For rows, SSR is calculated as:
(Σ X j1 ) 2 (Σ X j2 ) 2 (Σ X j3 ) 2 (Σ X j4 ) 2 (Σ X jn ) 2 T 2
SSR = + + + + .... +
n 1 n2 n3 n4 n n N
where, Σ X j1 , Σ X j2 , Σ X j3 …… are Row totals.
MSC MSR
8. Fc = and Fr =
MSE MSE
Degrees of freedom for Fc = {c-1, (c-1) (r-1)}
Degrees of freedom for Fr = {r-1, (c-1) (r-1)}
MSE
If MSE > MSC then we take Fc =
MSC
MSE
If MSE > MSR then we taken Fr =
MSR
Fc is for column wise comparison
Fr is for row wise comparison
If Fc < table value of F then 1 = 2 = 3 =……….
If Fr < table value of F then 1 = 2 = 3 =……….
Solved Problem 4
Three varieties of crops ‘A’, ‘B’, and ‘C’ are tested in a randomised block
design with four replications. The yields are depicted in table 11.6. Test at
0.05 level of significance whether there is a difference between replications.
Test also whether the varieties differ significantly. Answer the question
taking a significant level of 5%.
Table 11.6: Yields of Three Crops Tested with Four Replications
Replications
Variety
1 2 3 4
A 6 4 8 6
B 7 6 6 9
C 8 5 10 9
Solution
The null hypothesis ‘H0’ is given as:
Η 0 : There is no difference between the replications or the varieties
Table 11.6a depicts the totals of yields of three crops tested with four
replications.
Table 11.6a: Totals of Yields of Three Crops Tested with Four Replications
Replications Total
Variety
1 2 3 4
A 6 4 8 6 24
B 7 6 6 9 28
C 8 5 10 9 32
Total 21 15 24 24 84
( X ) 2 ( X ) 2 ( X ) 2 ( X ) 2 ( X ) 2 T 2
SSC 1 i 2 i 3 i 4 i ..... ni
n n n n n N
1 2 3 4 n
212 15 2 24 2 24 2
SSC = + + + 588 18
3 3 3 3
SSC 18 18
MSC= = =6
(c 1) (4 - 1) 3
24 2 28 2 32 2
SSR = + + 588 = 8
4 4 4
SSR 8 8
MSR = = =4
(r 1) (3 - 1) 2
SSE 10 10
MSE = = = 1.667
(r 1) (c 1) (4 - 1)(3 - 1) 6
Table 11.6b depicts the ANOVA table for data of solved problem 4.
Between columns:
Table value of ‘F’ = 4.757 at = 0.05 and degrees of freedom (3,6).
Calculated value of ‘F’ = 3.6
Calculated value of ‘F’ < Table value of ‘F’. Therefore, we accept the
hypothesis that there is no significant difference between replications.
Between rows:
Table value of ‘F’ = 5.143 at = 0.05 and degrees of freedom (2,6)
Calculated ‘F’ value is 2.4
Calculated ‘F’ value < Table value of ‘F’. Therefore, we accept the
hypothesis that there is no significant difference between the varieties.
Solved Problem 5
A performance study was conducted by the Sales Manager of an NML
Manufacturing Company on four salesmen during three seasons and the
data is depicted in table 11.7.
i) Do the salesmen significantly differ in performance?
ii) Is there a significant difference between seasons?
Use 0.05 level of significance.
Table 11.7: Performance Study of Three Salesmen
Season
Salesmen
Salesman-I Salesman-II Salesman-III Salesman-IV
Summer 36 36 21 35
Rainy 28 29 31 32
Winter 26 28 29 29
Solution
The null hypothesis ‘H0’ is given as:
Τ 2 0
2
Correction factor = = =0
N 12
T2
Total sum of squares: SST = Sum of squares of all observations
N
= 6 + 6 + - 9 + 5 + - 2 + 1 + 1 + 2 + - 4 (2)2 (1)2 (1)2 0
2 2 2 2 2 2 2 2 2
SST = 210
SSC (between salesmen):
( X ) 2 ( X ) 2 ( X ) 2 ( X ) 2 ( X ) 2 T 2
SSC 1 i 2 i 3 i 4 i ..... ni
n n n n n N
1 2 3 4 n
SSC =
02 + 32 + - 92 62
Τ2
3 3 3 3 N
= 0 + 3 + 27 + 12 – 0 = 42
Degrees of freedom = c-1) = (4 -1) = 3
SSC 42
MSC= = = 14
(c 1) 3
(Σ X j1 ) 2 (Σ X j2 ) 2 (Σ X j3 ) 2 (Σ X j4 ) 2 (Σ X jn ) 2 T 2
SSR = + + + + .... +
n 1 n2 n3 n4 n n N
SSR =
82 + 02 - 82
T2
4 4 4 N
= 16 + 0 + 16 – 0 = 32
Degrees of freedom = (r-1) = (3 -1) = 2
SSR 32
MSR = = = 16
(r 1) 2
For salesmen:
The calculated value of FC is 1.62. The table value of F for (6,3) df at 5%
level of significance is 8.94. Since the calculated value of F is less than the
table value, we accept the null hypothesis and conclude that the sales of
different salesmen do not differ significantly.
For seasons
The calculated value of FR is 1.42. The table value of F for (6,2) df at 5%
level of significance is 19.3. Since the calculated value of F is less than the
table value, we accept the null hypothesis and conclude that there is no
significant difference in the seasons so far as sales are concerned.
11.5 Summary
Let us recapitulate the important concepts discussed in this unit:
ANOVA is a statistical technique used to evaluate the variances
between three or more sample means. This helps to make inferences to
judge whether the samples are from populations having the same mean
or not.
ANOVA is classified into one-way ANOVA and two-way ANOVA.
ANOVA is a parametric test as it assumes normality regarding
population distributions and also as it deal in means.
The F-test is conducted for performing ANOVA. F-test is used to test the
equality of two variances. ANOVA is used to test the equality of several
means. F-distribution has a pair of degrees of freedom.
The assumptions for applying the F-test are that the random samples
must be independent to each other and normally distributed.
11.6 Glossary
Analysis of Variance (ANOVA): The process of splitting the variation of a
group of observation into assignable causes and setting up various
significance tests.
One-way or one-way factor ANOVA: When the source of variation in the
observation is primarily due to one factor.
Post hoc test: A test carried out based on the result of the earlier test.
Two-way or two-way factor ANOVA: When there are two factors as
source of variation in the observation.
Employee 1 15 17 14 12
Employee 2 12 10 13 17
Employee 3 11 14 13 15 12
Employee 4 13 12 12 14 10 9
2. Four makes of bulbs were tested for their length of life (in ‘000 hours)
and the data obtained is depicted in table 11.9. Test whether the length
of their life is significantly different.
Table 11.9: Four Different Makes of Bulbs with Their Length of Life
Make I Make II Make III Make IV
20 19 21 15
23 15 19 17
18 17 20 16
17 20 17 18
16 16
3. Table 11.10 depicts the data on production rate by five workmen on four
machines. Test whether the rate is significantly different due to workers
and machines.
Table 11.10: Production Rate of Five Workmen on Four Machines
Machines Workmen
I II III IV V
1 46 48 36 35 40
2 40 42 38 40 44
3 49 54 46 48 51
4 38 45 34 35 41
11.8 Answers
Terminal Questions
1. Fcal = 1.47, not significant
2. Fcal = 1.67 not significant
3. Fcal = 8.20 for workman
Fcal = 19.20 for machines
Both are not significant
4. Fcal = 4.08, not significant
5. Fcal = 6.91 for students ( significant)
Fcal = 4.02 for consignments (not significant)
References:
Bevington P. R.; Robinson, D. K., Data Reduction and Error Analysis for
the Physical Sciences, 3rd Edition.
Cowan G., Statistical Data Analysis, Oxford Science Publications.
Devore J. L., (2008) Probability and Statistics for Engineering and the
Sciences, Enhanced Review Edition.
Froedesen A. G., Skjeggestad D. and Tøfte H., (1979), Probability and
Statistics in Particle Physics, out of print.
James F., (2006), Statistical Methods in Experimental Physics, 2nd
Edition.
Levin R. I., and Rubin D. S., (2008) Statistics for Management, Seventh
Edition, PHI Learning Private Limited.
Lyons L., (1989) Statistics for Nuclear and Particle Physicists .
Mandel J., The Statistical Analysis of Experimental Data.
Meyer S. L., Data Analysis for Scientists and Engineers.
Morris H., Schervish M. J., and DeGroot, (2002) Probability and
Statistics.
Press W. H., Teukolsky S. A., Vetterling, W. T., and Flannery, B. P.,
Numerical Recipes: The Art of Scientific Computing, 3rd Edition..
Ross S. M., (2009), Introduction to Probability and Statistics for
Engineers and Scientists, Fourth Edition.
Taylor J. R., An Introduction to Error Analysis: The Study of
Uncertainties in Physical Measurements.
12.1 Introduction
In the previous unit, we dealt with analysis of variance (ANOVA),
assumptions for F-test, and classification of ANOVA. In this unit, we will
deal with correlation, methods of correlation, measures of correlation,
probable error, Spearman’s rank correlation coefficient, partial correlation,
multiple correlations, regression, standard error of estimate, multiple
regression analysis, and application of multiple regressions.
Both correlation and regression are used to measure the strength of
relationships between variables. Those statistical tools measure the
relationship between the variables analysed in social science research.
Objectives:
After studying this unit, you should be able to:
define correlation and regression
discuss the types and measures of correlation
calculate the Karl Pearson’s correlation coefficient
calculate the coefficient for partial and multiple correlation
apply the method of estimating unknown values from known values
through regression equations
12.1.1 Relevance
The new CEO of a health care pharmaceutical company called for a
meeting of all heads of various departments to discuss the future strategy of
the company. While he expressed satisfaction over the growing sales of the
company, he also emphasised on the need of giving a further boost to the
sales and image of the company. The head of the R and D unit suggested
investing higher funds on innovation of new products and improvement of
existing ones. He pointed out that R and D had the most significant
contribution to the sales of the company. The head of the Marketing
department emphasised the importance of marketing strategy for boosting
the sales of the company. He, therefore, wanted more funds to be made
available for the purpose. The Head of HRD department suggested the
need for more staff and also new training programmes for improving the
sales significantly. The CEO agreed in person with them and was expecting
some analysis of quantitative facts and figures to evaluate the claims of the
head of department and commit funds for the new strategies. The job was
entrusted to a consultant who analysed the data using statistical techniques
Manipal University Jaipur Page No. 439
Statistics for Management Unit 12
12.2 Correlation
When two or more variables move in sympathy with the other, then they are
said to be correlated. If both variables move in the same direction, then they
are said to be positively correlated. If the variables move in the opposite
direction, then they are said to be negatively correlated. If they move
haphazardly, then there is no correlation between them. Correlation analysis
deals with the following:
Measuring the relationship between variables.
Testing the relationship for its significance.
Giving confidence interval for population correlation measure.
METHODS OF CORRELATION
GRAPHIC ALGEBRAIC
SCATTER
COVARIANCE RANK CONCURR-
DIAGRAM
METHOD CORRELATION ENT
DEVIATION
METHOD
If the dots lie close to a straight line that runs from left bottom to right top,
then the variables are said to be positively correlated. Figure 12.3 depicts
the scattered diagram for positively correlated variables.
If the dots lie exactly on a straight line that runs from left top to right bottom,
then the variables are said to be perfectly or exactly negatively correlated.
Figure 12.4 depicts the scattered diagram for the perfectly negatively
correlated variables.
If the dots lie very close to a straight line that runs from left top to right
bottom, then the variables are said to be negatively correlated. Figure 12.5
depicts the scattered diagram for the negatively correlated variables.
If the dots lie all over the graph paper, then the variables have zero
correlation. Figure 12.6 depicts the scattered diagram of the variables with
zero correlation.
Scatter diagram tells us the direction in which they are related and does not
give any quantitative measure for comparison between data sets.
12.4.2 Karl Pearson’s correlation coefficient
A Mathematical method for measuring the intensity or the magnitude of
linear relationship between two variable series is the correlation coefficient.
In order to study the “degree of variation” between the variables in a
bivariate distribution we can use the correlation coefficient
Key Statistic
Karl Pearson’s correlation coefficient is defined as:
Cov(X, Y )
r
S.D(X ).S.D(Y )
xy
i) r ––––––––––––– (A)
N x y
where, x and y
( X X) 2 ( Y Y) 2
x and Y
2 2
N N
xy
where, ‘N’ is the number of paired observations and is called
covariance of ‘x’ and ‘y’.
Key Statistic
The other forms of Karl Pearson’s correlation coefficient formula are:
xy
ii) r –––––––––––––––––––– (B)
x y
2 2
N XY X Y
r –––– (C)
N X 2
( X) 2 N Y 2
( Y) 2
N dx dy dx dy
r ––(D)
N dx 2
( dx) 2
N dy 2
( dy) 2
For all practical purposes, we can conveniently use form D; whenever
summary information is given choose proper form from A to C.
12.4.4 Factors influencing the size of correlation coefficient
The size of ‘r’ is very much dependent upon the variability of measured
values in the correlation sample. The greater the variability, the higher will
be the correlation, everything else being equal. The size of ‘r’ is altered
when researchers select extreme groups of subjects in order to compare
these groups with respect to certain behaviours. Selecting extreme groups
on one variable increases the size of ‘r’ over what would be obtained with
more random sampling.
Combining two groups which differ in their mean values on one of the
variables is not likely to faithfully represent the true situation as far as the
correlation is concerned.
Inclusion of an extreme case (and similarly dropping of an extreme case)
can lead to changes in the amount of correlation.
xy
r
x y
2 2
Solved Problem 1
Find Karl Pearson’s correlation coefficient for the data depicted in table
12.1.
Table 12.1: Data Related to Solved Problem 1
X 20 16 12 8 4
Y 22 14 4 12 8
Solution:
Table 12.1a depicts the sums calculated for the data depicted in table 12.1a.
Table 12.1a: Sums Related to Solved Problem 1
X Y X2 Y2 XY
20 22 400 484 440
16 14 256 196 224
12 4 144 16 48
8 12 64 144 96
4 8 16 64 32
X = 60 Y = 60 X = 880
2
Y = 904
2
XY = 840
Applying the formula for ‘r’ and substituting the respective values from the
table we get r as:
N XY X Y
r
N X 2
( X) 2 N Y 2
( Y) 2
5(840) (60)(60)
r
[5(880) (60) 2 ][5(904) (60) 2 ]
r 0 70
Hence, Karl Pearson’s Correlation Coefficient is 0.70.
Solved Problem 2
Calculate the correlation coefficient from the data depicted in table 12.2.
Table 12.2: Data Related to Solved Problem 2
X 50 60 58 47 49 33 65 43 46 68
Y 48 65 50 48 55 58 63 48 50 70
Solution:
Table 12.2a depicts the frequency table of the data related to solved
problem 2.
Table 12.2a: Frequency Table Data for Solved Problem 2
dx= dy=
X dx2 Y dy2 dx dy
X-50 Y-55
50 0 0 48 -7 49 0
60 + 10 100 65 + 10 100 + 100
58 +8 64 50 -5 25 - 40
47 -3 9 48 -7 49 + 21
49 -1 1 55 0 0 0
33 -17 289 58 3 9 - 51
65 + 15 225 63 8 64 + 120
43 -7 49 48 -7 49 + 49
46 -4 16 50 -5 25 + 20
68 +18 324 70 15 225 + 270
X = 519 dx =19 dx2 = 1077 Y = 535 dy = 5 dy2 = 595 dxdy =
489
N dx dy dx dy
r
N dx 2
( dx) 2 N dy 2
( dy) 2
And substituting values we get
10 489 19 5
r 0.611
10 1077 19 10 595 5
2 2
Solved Problem 3
In a bivariate data on ‘x’ and ‘y’, variance of ‘x’ = 49, variance of ‘y’ = 9 and
covariance Cov(x, y) = -17.5. Find coefficient of correlation between ‘x’ and
‘y’.
Solution:
We know that:
xy
r
N x y
xy
Given Cov(x, y) = - 17.5
N
x 49 7 y 9 3
17 .5
r - 0.833
73
Hence, there is a highly negative correlation.
Solved Problem 4
Ten observation in Weight (x) and Height (y) of a particular age group gave
the following data.
Solution:
We know that:
N XY X Y
r
N X 2
( X) 2 N Y 2
( Y) 2
Given N = 10, X = 56 Y = 138
X = 1357 Y2 = 2136 XY = 836
2
10 836 (56)(138)
r 0.1286
10 1357 (56) 10 2136 (138)
2 2
0 6745 1 r 2
n
where, ‘r’ is measured from sample of size ‘n’.
Probable error is used to:
i) Interpret the value of ‘r’,
If r < P.E, then it is not at all significant
If r > 6 P.E, then ‘r’ is highly significant
If P.E < r < 6 P.E, we cannot say anything about the significance of
‘r’
ii) Construct confidence limits within which correlation in the population
is expected to lie.
SE (r) =
1 r
2
PE (r) = SE (r) * 0.6745
n
The reason for taking the factor 0.6745 is that in a normal distribution 50%
of the distribution lie in the range μ ± 0.6745 σ
Solved Problem 5
If r = 0.6 and n = 64, then:
a) Interpret ‘r’
b) Find the limits within which ‘’ is supposed to lie
Solution:
0 6745 1 (0.6) 2
64
= 0.054
a) 6 6 0 054 0 324
Since r 0 6 6 , r is highly significant.
Key Statistic
Spearman’s Rank correlation coefficient is defined as:
6 D2
1 3
N N
where, D is the difference between ranks assigned to the variables.
N is the number of observation
Value of ‘’ lies between ‘-1’ and ‘+1’ and its interpretation is same as that
of Karl Pearson’s correlation coefficient.
There are three types of problems. Table 12.3 depicts the types of problems
involved in calculating rank correlation coefficient.
Table 12.3: Types of Problems
Type i Ranks are assigned
Type ii Ranks are not assigned
Type iii When ranks are repeated
Type i: Ranks are assigned: When ranks are already assigned, take the
difference between the ranks of the variables and denote it by D. Then the
rank correlation is computed using the formula
6 D2
1
N( N 2 1)
Solved Problem 6
In a singing competition, two judges assigned the ranks for seven
candidates which is depicted in table 12.4. Find Spearman’s rank correlation
coefficient.
Table 12.4: Ranks of Seven Candidates
Competitor 1 2 3 4 5 6 7
Judge I 5 6 4 3 2 7 1
Judge II 6 4 5 1 2 7 3
Solution:
Table 12.4a depicts the data of solved problem 6.
6 D2
1
N( N 2 1)
6(14) 6 14
=1– 1 0.75
7(7 1)
2
7 48
Solved Problem 7
Find the rank difference coefficient of correlation (in case of no ties) for the
data depicted in table 12.5.
Table 12.5: Scores of Students on Test I and Test II
Relation between ‘x’ and ‘y’ is very high and inverse. Relationship between
score on Test I and II is very high and inverse.
Solved Problem 8
Table 12.6 depicts the sales statistics of six sales representatives in two
different localities. Find whether there is a relationship between the buying
habits of the people in the localities.
Table 12.6: Sales Data of Six Representatives
Representative 1 2 3 4 5 6
Locality I 70 40 65 110 60 20
Locality II 70 30 80 100 90 20
Solution:
Table 12.6a depicts the calculated values of correlation coefficient of data in
solved problem 8.
Table 12.6a: Calculating the Coefficient of Correlation
Representative Sales in Sales in D = R1-R2 D2
Locality I locality II
R1 R2
1 2 4 -2 4
2 5 5 0 0
3 3 3 0 0
4 1 1 0 0
5 4 2 2 4
6 6 6 0 0
N=6 D2= 8
6 D2
1
N( N 2 1)
6(8) 8
=1– 1 0.7714
6(6 1)
2
35
Therefore, there is high positive correlation between the buying habits of the
locality people.
Solved Problem 9
Find the rank correlation coefficient for the data depicted in table 12.7.
Table 12.7: Scores of Student in Test I and Test II
Student A B C D E F G H I J
Score on Test I 20 30 22 28 32 40 20 16 14 18
Score on Test II 32 32 48 36 44 48 28 20 24 28
Solution:
Table 12.7a depicts the required data for calculating the correlation
coefficient.
Table 12.7a: Ranks of Test I and Test II
Score Score Rank Rank Difference
Difference
on on of on between
Student squared
Test I Test II Test I Test II Ranks
D2
X Y R1 R2 D
A 20 32 6.5 5.5 1.0 1.00
B 30 32 3 5.5 - 2.5 6.25
C 22 48 5 1.5 3.5 12.25
D 28 36 4 4 0 0
E 32 44 2 3 - 1.0 1.00
F 40 48 1 1.5 - 0.5 0.25
G 20 28 6.5 7.5 - 1.0 1.00
H 16 20 9 10 - 1.0 1.00
I 14 24 10 9 1.0 1.00
J 18 28 8 7.5 0.5 0.25
N = 10 D2 = 24
= 1 – 6 D 1 / 12(m1 m1 ) 1 / 12(m2 m2 ) 1 / 12(m3 m3 ) 1 / 12(m4 m4 )
2 3 3 3 3
N( N 2 1)
=1–
6 24 1 / 12(2 3
2) 1 / 12(2 3 2) 1 / 12(2 3 2) 1 / 12(2 3 2)
10(10 2 1)
=1–
144 0.5 0.5 0.5 0.5 = 1 – 146
0.8525
10 99 10 99
Activity:
Find the rank correlation from the following distribution
Cost 39 65 62 90 82 75 25 98 36 78
Sales 47 53 58 86 62 68 60 91 51 54
Activity Solution
Cost Sales
X Y R1 R2 D D2
39 47 8 10 -2 4
65 53 6 8 -2 4
62 58 7 7 0 0
90 86 2 2 0 0
82 62 3 5 -2 4
75 68 5 4 1 1
25 60 10 6 4 16
98 91 1 1 0 0
36 51 9 9 0 0
78 54 4 3 1 1
D2 = 30
6 D2
1
N( N 2 1)
6 30 180
1 1 0.82
10(10 1)
2
990
Key Statistic
Partial correlation is denoted by the symbol r12.3. Here correlation
between variable 1 and 2 keeping 3rd variable constant is:
r12 r13 .r23
r12.3
1 r13 . 1 r23
2 2
where,
r12.3 = Partial correlation between variables 1 and 2 keeping 3rd constant
r12 = correlation between variables 1 and 2
r13 = correlation between variables 1 and 3
r23 = correlation between variables 2 and 3
Similarly,
r13 r12 . r23 r23 r12 . r13
r13.2 and r23.1
1 r12 1 r23 1 r12 1 r13
2 2 2 2
Solved problem 10
Given r12 = 0.8, r13 = 0.5 and r23 = 0.4, calculate all partial correlations.
Solution:
(i) The correlation between variables 1 and 2 keeping the 3rd constant is
given by:
r12 r13 .r23 0.8 0.5 0.4 0.6
r12.3 0.756
2
1 r13 . 1 r23
2
1 0.5 1 0.4
2 2 0.794
(ii) The correlation between variables 1 and 3 keeping the 2nd constant is
given by:
(iii) The correlation between variables 2 and 3 keeping the 1st constant is
given by:
r23 r21.r13 0.4 0.8 0.5
r23.1 0
1 r21 . 1 r13
2 2
1 0.8 2 1 0.5 2
R1.23 = r
12
2
r13 2 2 r12 r13 r23 1 r
23
2
R2.13 = r
2
12
r 2 2 r12 r13 r23
23
1 r
2
13
R3.12 = r
2
13
r23
2
2 r12 r13 r23 1 r
2
12
Solved Problem 11
The following are the zero order correlation coefficients.
r12 = 0.98; r13 = 0.44 r23 = 0.54
Solution:
The first variable is dependent. The second and third variables are
independent. Using the formula for multiple correlation coefficients for R1.23
we get:
R1.23 = r 2
12 r13
2
2r 12 r 13 r 23 1 r
2
23
= 0.986
12.9 Regression
According to M. M. Blair, Regression is defined as, “the measure of the
average relationship between two or more variables in terms of the original
units of the data”.
Correlation analysis attempts to study the relationship between the two
variables ‘X and ‘Y’. In regression, it is attempted to quantify the
dependence of one variable on the other. For example, if there are two
variables ‘X’ and ‘Y’ and ‘Y’ depends on ‘X’, then the dependence is
expressed in the form of the equations.
12.9.1 Regression analysis
Regression analysis is used to estimate the values of the dependent
variables from the values of the independent variables. Regression analysis
is used to get a measure of the error involved while using the regression line
as a basis for estimation. The regression coefficient Y on X is the coefficient
of the variable ‘X’ in the line of regression Y on X. Regression coefficients
are used to calculate the correlation coefficient. The square of correlation is
the product of regression coefficients.
12.9.2 Regression lines
For a set of paired observations, there exist two straight lines. The line
drawn in such a way that the sum of vertical deviation is zero and the sum of
their squares is minimum, is called regression line of ‘Y’ on ‘X’. It is used to
estimate ‘Y’ values for given ‘X’ values. The line drawn in such a way that
the sum of horizontal deviation is zero and sum of their squares is minimum,
is called regression line of ‘X’ on ‘Y’. It is used to estimate the ‘X’ values for
the given ‘Y’ values. The smaller the angle between these lines, the higher
is the correlation between the variables. The regression lines always
intersect at ( X, Y ).
Y Y b yx X X
ii) The regression equation of ‘X’ on ‘Y’ is given by:
X X b xy Y Y
Manipal University Jaipur Page No. 461
Statistics for Management Unit 12
where,
N dxdy ( dx) ( dy)
b xy or b xy r x
N dy ( dy)
2 2
y
N dxdy ( dx) ( dy) y
b yx or b r
N dx 2 ( dx) 2 x
yx
Solved Problem 12
Find regression equation from the data depicted in table 12.9. Then
calculate the correlation coefficient.
Table 12.9: Data of Ages of Husband and Wife
Age of Husband 18 19 20 21 22 23 24 25 26 27
Age of Wife 17 17 18 18 19 19 19 20 21 22
Solution:
Table 12.9a depicts the data required for calculation of correlation and
regression coefficients.
Table 12.9a: Data Required for Calculation of Correlation and Regression
Coefficients
Age of
Age of wife
husband dx = X-22 dx2 dy = Y-19 dy2 dx dy
Y
X
18 -4 16 17 -2 4 8
19 -3 9 17 -2 4 6
20 -2 4 18 -1 1 2
21 -1 1 18 -1 1 1
22 0 0 19 0 0 0
23 1 1 19 0 0 0
24 2 4 19 0 0 0
25 3 9 20 1 1 3
26 4 16 21 2 4 8
27 5 25 22 3 9 15
∑X =225 ∑dx = 5 ∑dx2=85 ∑Y = 190 ∑dy = 0 ∑dy2=24 ∑dxdy= 43
225 190
X 22.5 Y 19
10 10
Regression equation of Y on X is :
Y Y b y x (X X)
N dxdy ( dx) ( dy)
b yx
N dx 2 ( dx) 2
10 43 (5) (0) 430
byx = 0.521
10 85 (5) 2 825
19 0.521 22.5
0.521 7.2775
Regression Equation of X and Y is:
X X b xy Y Y
N dxdy ( dx) ( dy)
b xy
N dy 2 ( dy) 2
10 43 (5) (0) 430
bxy = 1.792
10 24 (0) 2 240
22.5 1.792 19
1.792 11.548
r b yx .b xy
r 0.521x1.792 0.966
Hence, the Correlation Coefficient ‘r’ is 0.966.
Solved Problem 13
Table 12.10 depicts the results that were worked out from scores in
statistics and mathematics in a certain examination.
Table 12.10: Scores in Statistics and Mathematics
Scores in Statistics Scores in Mathematics
X Y
Mean 40 48
Standard Deviation 10 15
Karl Pearson’s correlation coefficient between ‘X’ and ‘Y’ is = + 0.42. Find
the regression lines ‘X’ on ‘Y’ and ‘Y’ on ‘X’. Use the regression lines to find
the value of ‘Y’ when X = 50 and value of ‘X’ when Y = 30.
Solution:
Given the following data:
X 40; Y 48 x = 10; y = 15; r = 0.42
The regression line X on Y is:
X X b xy Y Y
x x 10
b xy r , b xy r 0.42 0.28
y y 15
40 0.28 48
0.28 26.56
The regression line ‘y’ on ‘x’ is given as:
Y Y b yx X X
y y 15
b yx r , b yx r 0.42 0.63
x x 10
48 0.63 40
Y 0.63X 22.8
Therefore,
when Y = 30; 0.28 26.56 ; X = 34.96
when X =50; Y 0.63X 22.8 ; Y = 54.3
(X X c ) 2
Sxy =
N
The standard error of estimate of Y values from X is:
(Y Yc ) 2
S xy ,
N
where Yc and Xc are the estimated values of Y and X variables from the line
of regression of Y on X and X on Y respectively.
The following simpler formulae are used for calculating Sxy and Syx
X 2 a X b XY
S xy
N
Y 2 a Y b XY
S yx
N
To make the standard error an unbiased estimate of the actual variance of
the X or Y values, we divide the variability by (N - 2)
(X X c ) 2
Sxy =
N2
(Y Yc ) 2
S xy
N2
i b 0 1i b 1 1i b 2 1i X 2i
2
1i
i b 0 2i b 1i X 2i b 2 2i
2
2i
The values of b0, b1 & b2 are estimated with the help of Principle of Least
squares.
12.11.1 Application of Multiple Regression
Multiple regressions analysis can be applied to test the factors such as
export elasticity, import elasticity, and structural change (contribution of
manufacturing sector towards GDP) influencing over employment. Here,
employment is a dependent variable.
Similarly, researchers can attempt to use multiple regressions in their
research work appropriately.
12.12 Summary
Let us recapitulate the important concepts discussed in this unit:
When two or more variables move in sympathy with the other, then they
are said to be correlated. If both variables move in the same direction,
then they are said to be positively correlated. If the variables move in the
opposite direction, then they are said to be negatively correlated. If they
move haphazardly, then there is no correlation between them.
Regression helps us to study unknown variables with the help of known
variables. It also establishes a reliability measure for estimated values.
Regression analysis helps to quantify the dependence of one variable
on the other. Some of the regression types are simple and multiple
regressions, linear and non linear regression.
Regression analysis is useful in business and economic scenarios in the
decision making process.
12.13 Glossary
Correlation: When two or more variables move in sympathy with the other,
then they are said to be correlated.
Correlation coefficient: Critical statistic which indicates the direction and
intensity of a relationship between two continuous variables. Domain
extends from -1 through 0 to +1. Significance can be determined via
statistical testing. Both parametric and nonparametric correlation coefficients
are possible.
Coefficient of variation: A relative measure of variation, expressed as a
percentage; useful in comparing the variability of data sets with different
units of measure.
3. For the data in table 12.13, obtain the two lines of regression and its
estimation of the blood pressure when age is 50 yrs.
Table 12.13: Data for Terminal Question 3
Age in yrs (X) 56 42 72 39 63 47 52 49 40 42 68 60
B P (Y) 127 112 140 118 129 116 130 125 115 120 135 133
4. Table 12.14 depicts the results that were worked out from scores in
statistics and mathematics in a certain examination.
Table 12.14: Results of Scores in Statistics and Mathematics Examination
Scores in Statistics Scores in Mathematics
(X) (Y)
Mean 39.5 47.5
Standard Deviation 10.8 17.8
Karl Pearson’s correlation coefficient between X and Y = 0.42. Find both the
regression lines. Use these lines to estimate the value of Y when X = 50 and
the value of X when Y = 30.
12.15 Answers
Terminal Questions
1. 0.903
2. 0.967
3. X = - 95 + 1.184
Y = 87.2 + 0.724
4. X = 27.62 + 0.25Y
Y = 20.24 + 0.69X
By calculating the rank correlation, find out as to which of the indicators viz.
life expectancy, literacy, and GDP affects the HDI to the maximum extent.
To what extent the life expectancy in the nation depends on the percentage
of its urban population?
(Source: Srivastava, T. N. and Rejo, S. (2008) Statistics for Management, 5 th
edition, TMH)
References
Agarwal, B. L. (2006) Basic Statistics, 4th Edition, New Age International
Publishers.
Bowerman, B. L. and Connel, R.T. O., (1996) Applied Statistics:
Improving Business Processes, Irwin.
Levin, R. I., Rubin, D. S. (2008), Statistics for Management, 7th Edition,
PHI Learning Private Limited.
Pisani, F. D. R., and Purves, R. (1997), Statistics, 3rd edition, W.W
Norton.
Srivastava, T. N. and Rejo, S. (2008) Statistics for Management, 5th
edition, TMH.
Tanur,J. M., (2002), Statistics: A Guide to the unknown, 4th
edition,Brooks/cole.
E-Reference
http://www.textbooksonline.tn.nic.in/Books/11/Stat-EM/Chapter-1.pdf
13.1 Introduction
In the previous unit, we studied about Correlation and Regression
techniques, which are used for investigating the relationship between two or
more variables. In this unit we will discuss about business forecasting, the
To a very large extent, success or failure would depend upon the ability to
successfully forecast the future course of events. Without some element of
continuity between past, present and future, there would be little possibility
of successful prediction. But history is not likely to repeat itself and we would
hardly expect economic conditions next year or over the next 10 years to
follow a clear cut prediction. Yet, past patterns prevail sufficiently to justify
using the past as a basis for predicting the future.
A businessman cannot afford to base his decisions on guesses. Forecasting
helps a businessman in reducing the areas of uncertainty that surround
management decision making with respect to costs, sales, production,
profits, capital investment, pricing, expansion of production, extension of
credit, development of markets, increase of inventories and curtailment of
loans. These decisions are to be based on present indications of future
conditions.
However, we know that it is impossible to forecast the future precisely.
There is a possibility of occurrence of some range of error in the forecast.
Statistical forecasts are the methods in which we can use the mathematical
theory of probability to measure the risks of errors in predictions.
13.2.2 Prediction, Projection and Forecasting
A great amount of confusion seems to have grown up in the use of words
‘forecast’, ‘prediction’ and ‘projection’.
Key Statistic
A prediction is an estimate based solely on past data of the series under
investigation. It is purely a statistical extrapolation.
A projection is a prediction, where the extrapolated values are subject to
certain numerical assumptions.
A forecast is an estimate, which relates the series in which we are
interested into external factors.
4. Regression analysis
5. Modern econometric methods
6. Exponential smoothing method
13.3.1 Business Barometers
Business indices are constructed to study and analyse the business
activities on the basis of which future conditions are predetermined. As
business indices are the indicators of future conditions, they are also known
as ’business barometers’ or ‘economic barometers’. With the help of these
business barometers the trend of fluctuations in business conditions are
understood and a decision can be taken relating to the problem by
forecasting.
The construction of business barometer consists of gross national product,
wholesale prices, consumer prices, industrial production, stock prices, bank
deposits etc. These quantities may be converted into relatives on a certain
base. The relatives so obtained may be weighted and their average
computed.
There are three types of business barometers. They are barometers for:
1. General business activities
2. Specific business or industry
3. Individual business firm
1. Barometers relating to general business activities:
Barometers relating to general business activities are also known as general
indices of business activities which refer to weighted or composite indices of
individual index business activities. With the help of general index of
business activity, long term trends and cyclical fluctuations in the economic
activities of a country are measured. However, in some specific cases, the
long term trends can be different from general trends. These types of
indices help in the formation of a country’s economic policies.
2. Business barometers for specific business or industry
These barometers are used as the supplement of general index of business
activity and are constructed to measure future variations in a specific
business or industry.
Merits Demerits
The business barometer method is It is very difficult to construct indices of
scientific and reliable and used by business activities.
management for the purpose of
various business decisions at different
levels.
Business barometer method helps In most of the cases, the business
forecasting future trends of a business. barometers provide inaccurate,
incomplete and inconclusive
forecasting due to index numbers
prepared on the basis of incorrect and
inadequate data.
Business barometers are the indicators The business barometers are the
of future business trends and help to indicators of past conditions and the
forecast the speed of fluctuations. forecasting based on these conditions
may be erroneous.
This method helps to find solutions of Separate indices are calculated for
various business problems such as individual industry and firm which are
development of market, capital entirely different from general indices.
investment, exploration of new
consumer market etc.
Table 13.2 depicts the merits and demerits of time series analysis.
Table 13.2: Merits and Demerits of Time Series Analysis
Merits Demerits
It is an easy method of forecasting. This method is expensive, difficult and
time consuming.
By this method a comparative study This method deals with past data only.
of variations can be made.
Reliable results of forecasting are This method can only be used when the
obtained as this method is based on data for several years are available.
mathematical model.
13.3.3 Extrapolation
Extrapolation is the simplest method of business forecasting. By
extrapolation, a businessman finds out the possible trend of demand of his
goods and also about the future price trends. The accuracy of extrapolation
depends on two factors:
Knowledge about the fluctuations of the figures
Knowledge about the course of events relating to the problem under
consideration
Thus, extrapolation is based on two assumptions:
1. There is no sudden jump in figures from one period to another
2. There is regularity in fluctuations and the rise and fall is uniform
In extrapolation, we assume that the variable will follow the established
pattern of growth. For the purpose of business forecasting, one needs to
determine accurately the appropriate trend curve and the values of its
parameters.
Some of these trend curves are explained below.
Arithmetic trend
The straight line arithmetic trend assumes that growth will be a constant
amount each year.
Semi-log trend
It assumes a constant percentage increase each year. As the annual
increment is constant in logarithm, this line will become a straight line when
drawn on semi-log paper.
Gompertz curve
It is given by:
c ab c
X
Merits Demerits
This method is very useful to forecast This method can be used under its
the future demand and production. own assumptions only.
This method is widely used for the This method is not simple but
forecasting of business events. technical, because of its mathematical
formulation.
We get pure and reliable results by this The selection of trend curve is very
method, because it is a mathematical difficult.
method.
Merits Demerits
Accurate and reliable results are This method is difficult and
obtained under this method. complicated.
It is a scientific method where This method can be used only when
computer technology is used. adequate series of data is available.
This method explains in detail and in It is very difficult to construct growth
quantitative terms the way in which model for every business activity.
various aspects of the economy are
interrelated.
Example 1
When government makes use of deficit financing, it leads to inflationary
pressures; the purchasing power of people goes up. Therefore, the
wholesale prices and retail prices start rising. With the rise in retail prices,
the cost of living goes up and with it there is a demand for increased
wages. Thus, one factor, that is, more money in circulation, has affected
various fields of economic activity not simultaneously but successively.
Table 13.5 depicts the merits and demerits of sequence or time-lag theory.
Table 13.5: Merits and Demerits of Sequence or Time-lag Theory
Merits Demerits
This method is largely used for This method studies only the action and not
business forecasting. the reaction.
Though this theory is based on This method cannot be regarded as
statistical techniques, yet it is accurate because by using statistical
easy to understand. techniques the results can be up to the truth
but not an accurate one.
2. Decline
3. Depression
4. Improvement
Table 13.6 depicts the merits and demerits of Action and Reaction theory.
Table 13.6: Merits and Demerits of Action and Reaction Theory
Merits Demerits
This theory is better than other theories. The determination of normal level is
very difficult.
By this theory more reliable results can It is not necessary that reaction is
be obtained because this theory gives equal to the action.
attention to action and reaction of an
event.
Merits Demerits
Forecasting is made on the basis of The business events are not strictly
past conditions, hence they are more periodic and prediction of business
reliable. cycle on the basis of statistical method
is not satisfactory.
Merits Demerits
It is an easy method. In this theory, forecasting is based on
guess work, not on a scientific method
because the past and present
conditions are rarely found to be similar.
As the future is forecasted on the basis It is very difficult to select the past
of past business conditions, the period with the same business
forecasting is more reliable. conditions like present.
Merits Demerits
Present conditions are preferred than Independent analysis of individual facts
past. is very difficult.
The effect of each factor is studied Past facts are equally important for the
independently. purpose of forecasting, but in this
method no importance is given to past
facts.
Forecast is nearer to the accuracy as it The forecasting made on the basis of
is based on present conditions. this technique cannot be regarded as
reliable.
Activity
1. Which of the following is not a forecasting model?
i) Trend method
ii) End-use method
iii) Correlation Method
iv) Exponential Method
2. The basic assumption in Linear Trend Method of forecasting is:
i) Rate of growth is constant from year to year
ii) Absolute growth is constant from year to year
iii) Rate of change is constant from year to year
iv) Absolute growth changes is constant from year to year
3. Which of the following methods is most suited for forecasting of
capital goods and machinery?
i) Trend
ii) Correlation & Regression
iii) End-Use Method
iv) Time Series Analysis
4. Answer to which of the following question is not related to planning or
Budgeting exercise?
i) Where we are?
ii) How did we reach here?
iii) Why we reached here?
iv) Where we ought to reach?
Solution
1. iv) Exponential Method
2. iv) Absolute growth changes is constant from year to year
3. iii) End-Use Method
4. iv) Where we ought to reach?
13.6 Summary
Let us recapitulate the important concepts discussed in this unit:
Business forecasting refers to the analysis of past and present economic
conditions with the object of drawing inferences about probable future
business conditions.
To forecast the future, various data, information and facts concerning to
economic condition of business for past and present are analysed.
Business forecasting helps the businessmen and industrialists to form
the policies and plans related with their activities. On the basis of the
forecasting, businessmen can forecast the demand of the product, price
of the product, condition of the market and so on.
The following are the main methods of business forecasting: Business
barometers, Time series analysis, Extrapolation, Regression analysis,
Modern econometric methods, Exponential smoothing method.
There are a few theories that are followed while making business
forecasts. Some of them are: Sequence or time-lag theory, Action and
reaction theory, Economic rhythm theory, Specific historical analogy,
Cross-cut analysis theory.
13.7 Glossary
Arithmetic trend: The straight line arithmetic trend assumes that growth will
be a constant amount each year.
Estimation of future: The business forecasting is to forecast the future
regarding probable economic conditions.
Extrapolation: Extrapolation is the simplest method of business
forecasting. By extrapolation, a businessman finds out the possible trend of
demand of his goods and also about the future price trends
Exponential smoothing: A type of moving average technique, applies to
time series data used in forecasting.
Period: The forecasting can be made for long term, short term, medium
term or any specific period.
Semi-log trend: It assumes a constant percentage increase each year. As
the annual increment is constant in logarithm, this line will become a straight
line when drawn on semi-log paper.
Manipal University Jaipur Page No. 491
Statistics for Management Unit 13
13.9 Answers
References:
Cowan, Glen. Statistical Data Analysis, Oxford Science Publications
Data Reduction and Error Analysis for the Physical Sciences (3rd
Edition), by Philip R. Bevington and D. Keith Robinson (Paperback).
Bevington, Philip R. & Robinson, D. Keith. Data Reduction and Error
Analysis for the Physical Sciences. 3rd Ed
Devore, Jay L. (2008) Probability and Statistics for Engineering and the
Sciences.
Froedesen, A. G., Skjeggestad D., & Tøfte, H. (1979) Probability and
Statistics in Particle Physics.
James, Frederick. (2006) Statistical Methods in Experimental Physics.
2nd Ed.
Levin, Richard I., & Rubin, David S. (2008) Statistics for Management.
7th Ed. PHI Learning Private Limited.
Lyons, Louis. (1989) Statistics for Nuclear and Particle Physicists.
Mandel, John. The Statistical Analysis of Experimental Data.
Meyer, Stuart L. Data Analysis for Scientists and Engineers.
Press, William H., Teukolsky, Saul A., Vetterling, William T., & Flannery,
Brian P. Numerical Recipes : The Art of Scientific Computing 3rd Ed.
14.1 Introduction
In the previous unit ‘Business Forecasting’, you have studied about the
ways of forecasting business events successfully. You also studied about
the different methods available for forecasting. In this unit, you will study
about the time series analysis and different components of time series. You
will also study about the forecasting methods using time series.
A time series is a set of numerical values of a given variable listed at
successive intervals of time, which means that, data regarding the variable
is listed in chronological order. Usually, the interval of time is taken as
uniform.
Yearly production of wheat in the country, hourly temperature of a city,
bimonthly electricity bills are all examples of time series. Almost all data like
industrial production, agricultural production, exports, imports, dairy
products can be arranged in chronological order.
Objectives:
After studying this unit, you should be able to
analyse the time series
describe different components of time series
describe the forecasting methods
apply time series analysis in business scenarios
14.1.1 Relevance
The 1990s brought a heightened awareness of an increased concern over
pollution in various forms in the United States. Air pollution is one of the
main areas of environmental concern. The U.S Environmental Protection
Agency (EPA) monitors the quality of air around the country. Some of the air
pollutants monitored include carbon monoxide emissions, nitrogen oxide
emissions, volatile organic compounds, sulphur dioxide emissions, etc. The
substances in these pollutants cause cancer and respiratory problems. If the
data is given for 15 years of period (1985-1999), then the question is to find
whether the air quality in U.S has been improving or deteriorating over time.
Managerial and statistical questions:
1. Is it possible to forecast the emissions of carbon monoxide or nitrogen
oxides for the year 2004-2007, or 2020 using the available data?
2. What techniques best forecast the emissions of carbon monoxide or
nitrogen oxides in the future?
Manipal University Jaipur Page No. 496
Statistics for Management Unit 14
Let us analyse the above data and give some trends regarding the sales.
For example, the company would like to know why sales dropped in 1998
and 1999 and why did it increase. In other words, the company would like to
analyse the various forces that affect the sales.
There can be changes in the values of the variable recorder over different
points of time due to various forces. Analysing the effect of all such forces
on the values of the variable is generally known as the analysis of time
series. Broadly, the following are the four types of changes in the values of
the variable:
i) Changes which generally occur due to general tendency of the data to
increase or decrease
ii) Changes which occur due to change in climate, weather conditions
and festivals
iii) Changes which occur due to booms and depressions
iv) Changes which occur due to some unpredictable forces like floods,
famines and earthquakes
ii. Forecasting can be done using the time series. By studying the
variations and other behaviour of the variables over a sufficiently long
period of time, it may be possible to forecast the future behaviour of the
variables. However, such a forecast has meaning only if the period of
forecast is a normal period. For example, various five-year plans by the
government of India are formulated by studying the time series and
forecasting.
iii. Study of the time series helps in analysing the post behaviour of the
variables. This helps in identifying the various forces that affect its
behaviour.
ii. The prices of cooking oils reduce after the harvesting of oil seeds and
go up after some time.
Solved problem 1
Find the trend with the help of free hand curve method for the data depicted
in table 14.2
Table 14.2: Production Data from 1991 to 2001
Year Production Data (in Lakh ton)
1991 15
1992 18
1993 16
1994 22
1995 19
1996 24
1997 20
1998 28
1999 22
2000 30
2001 26
Solution: Figure 14.1 depicts free hand curve of the production data versus
the time period. In the graph, we have taken production data values on
Y-axis and values of time on X-axis.
The trend of the export of sugar from India can be found by the semi
averages method as follows:
Here the number of years is 10 i.e., it is even. The series is divided into two
halves consisting of the first five years and last five years.
The semi average method is The semi average method assumes a straight
simple. line relationship between the plotted points,
regardless of the fact whether such
relationship exists or not.
The trend line can be extended This method has an in built limitation of
on either side in order to obtain arithmetic mean. This method is not suitable in
past or future estimates. case of very low or very large extreme values.
Fig. 14.2: Procedure for Determining the Trend when Moving Average is Odd
Solved Problem 3
Calculate the 3 yearly Moving Averages of the data depicted in table 14.5.
Table 14.5: Production Data from 1988 to 1996
Year 1988 1989 1990 1991 1992 1993 1994 1995 1996
Production
21 22 23 25 24 22 25 27 26
(in Lakh ton)
Fig. 14.3: Procedure for Determining the Trend when Moving Average is Even
Table 14.6 depicts the merits and demerits of the moving averages method.
Merits Demerits
This is a simple method. No functional relationship between the
values and time. Thus, this method is not
helpful in forecasting and predicting the
values on the basis of time.
This method is objective in the No trend values for some years in the
sense that anybody working on a beginning and some in the end.
problem with this method will get
the same results.
This method is used for determining In case of non–linear trend, the values
seasonal, cyclic and irregular obtained by this method are biased in one
variations besides the trend values. or the other direction.
This method is flexible enough to The period selection of moving average is
add more figures to the data a difficult task. Hence, great care has to be
because the entire calculations are taken in period selection, particularly when
not changed. there is no business cycle during that time.
If the period of moving averages
coincides with the period of cyclic
fluctuations in the data, such
fluctuations are automatically
eliminated.
Solved problem 4
The following table gives the average monthly production
(in thousands) of new passenger cars in the period 1976-1985. Calculate
four yearly moving averages.
Table 14.6: Average Monthly Production of Cars
Year Average monthly production of new cars
1976 708
1977 767
1978 764
1979 702
1980 533
1981 521
1982 421
1983 562
1984 635
1985 667
Manipal University Jaipur Page No. 506
Statistics for Management Unit 14
Solution:
The following table depicts four yearly moving averages.
Table 14.6a: Four yearly moving averages
1977 767
2941 735.25
1978 764 713.375
2766 691.5 1426.75
1979 702 660.75
2520 630 1321.5
1980 533 587.125
1174.25
2177 544.25
1981 521 1053.5 526.75
2037 509.25
1982 421 1044 522
2139 534.75
1983 562 1106 553
2285 571.25
1984 635
1985 667
Key Statistic
Let ‘Y’ be the actual values of ‘Y’ and ‘Yc’ be the computed values of ‘Y’ for
a given value of ‘X’.
Let ‘Y = a + bX’ be a straight line to be fitted for trend. To find the values of
‘a’ and ‘b’, such that the sum of squares of differences of the actual and
computed values of ‘Y’ is least, that is,
Y Y
2
c is least
Y
Y a , therefore, a
N
XY
XY b X 2
, therefore, b
X 2
Solved problem 5
The production of pig iron and ferro alloys in thousand metric tons in India is
as given below. Find the trend line by the method of least squares.
Table 14.7: Production data
Year Production
1976 672
1977 824
1978 967
1979 1204
1980 1464
1981 1758
1982 2057
Solution:
The trend line can be fitted by using the method of least squares for the
given data.
Table 14.7a: Calculation for trend line
Production
Year X= Year - 1979 XY X2
Y
1976 672 -3 -2016 9
1977 824 -2 -1648 4
1978 967 -1 -967 1
1979 1204 0 0 0
1980 1464 1 1464 1
1981 1758 2 3516 4
1982 2057 3 6171 9
Total ∑Y= 8946 ∑X=0 ∑XY= 6520 ∑X =28
2
XY 6520
XY b X 2
, therefore, b
X 2
28
232 .9
Y a bX 1278 232 .9 X
Merits Demerits
This method is a completely It requires many calculations and is tedious
objective method. and complicated.
This method gives the trend If even a single item is added to the series
values for the entire time period. a new equation has to be formed.
This method can be used to Future forecasts made by this method are
forecast future trend because based only on trend values. Seasonal,
trend line establishes a functional cyclical or irregular variations are ignored.
relationship between the value
and the time.
Non-linear trend
When the time series data do not confirm with the linear trend, then we
obtain non-linear trend. We do so by obtaining a parabolic curve or non-
linear curve in the method of least squares. For this we use the equation of
the form.
a b c 2 d 3 .......... k n which is known as a polynomial of
degree ‘n’ in ‘X’, k ≠ 0.
Let the parabolic curve be
a b c 2
The values of a, b, and c can be determined by solving the normal
equations:
ab c 2
a b 2 c 3
2 a 2 b 3 c 4
If we can change the origin at a suitable point, such that ‘X = 0’, then the
normal equations reduce to:
ac 2
b 2
2 a 2 c 4
Manipal University Jaipur Page No. 510
Statistics for Management Unit 14
Key Statistic
The additive model assumes that the observed value is the sum of four
components of time series, that is,
Y=T+S+C+I
where,
Y = original data
T = trend value
S = seasonal component
C = cyclical component
I = irregular component
The additive model for decomposition of time series assumes that all the
four components of the time series operate independently of one another. It
also assumes that the behaviour of components is additive in character. It is
to be noted that only absolute values are added or deducted from the trend
value to arrive at the observed value.
Key Statistic
The multiplicative model assumes that the observed value is obtained by
multiplying the trend (T) by the rates of three other components, that is,
Y=TxSxCxI
where,
Y = original data
T = trend value
S = seasonal component
C = cyclical component
I = irregular component
The multiplicative model assumes that the components, although due to
different causes, are not necessarily independent and they can affect one
another. It also assumes that the behaviour of components is of
multiplicative character. It is to be noted that except the value of trend, all
the other values on the right hand side are rates or index numbers.
3. Price changes
Adjustment for price changes becomes necessary wherever we have real
value changes. Current values are to be deflated by the ratio of current
prices to base year prices.
4. Comparability
The data which gets analysed should be comparable in order to have a valid
conclusion. When we deal with the analysis of time series it involves data
relating to the past which must be homogeneous and comparable.
Therefore, effects should be there to make the data as homogeneous and
comparable as possible.
S
Symbolically, seasonal index for first term is given by: I 1
100
1 S
Where, S1 = Average of first term
S = Average of all terms Sj / k where j = 1, 2, 3, 4……..k
k = 12 for monthly data
k = 4 for quarterly data
The merits and demerits of Simple average method are depicted in
table 14.8.
Table 14.9: Merits and Demerits of Simple Average Method
Merits Demerits
This method is the Most economic time series have trends and
simplest one. therefore, the seasonal index computed by this
method is really an index of trends and seasons.
ii) Under additive model, from each original value, the corresponding
moving average is deducted to find out short time fluctuations, which is
given as:
Y–T=S+C+I
iii) By preparing a separate table, monthly (or quarterly) short time
fluctuations are added for each month (or quarter) over all the years
and their average is obtained. These averages are known as seasonal
variations for each month or quarter.
iv) If we want to isolate / measure irregular variations, the mean of the
respective month or quarter is deducted from the short time
fluctuations.
14.8.3 Chain or link relative method
The steps involved in the chain or link relative method are described below.
i) Each quarterly or monthly value is divided by the preceding quarterly
or monthly value and the result is multiplied by 100. These
percentages are known as ‘Link Relatives’ of the seasonal values.
Thus:
Average Link Re lative of current year Chain Re lative of previous year
100
iv) The second chain relative of first is computed on the basis of the chain
relative for the last:
The Chain Re lative of first quarter
Average Link Re lative of the first quarter Chain Re lative of the last
100
This chain relative may or may not be 100. It is not equal to 100 due
to secular trend. If it is 100, go to ‘step vi’, if it is not 100, go to ‘step
v’ and then go to ‘step vi’.
v) Compute the difference ‘d’ between the new chain relatives first
obtained in ‘step iv’ and chain relative assumed as 100. ‘d’ is divided
by the number of seasons and the resulting figure is multiplied by
1, 2, 3 and the product is deducted respectively from the chain
relatives of 2nd, 3rd, and 4th quarters. These are called corrected
relatives.
vi) The seasonal indices are obtained when the corrected chain relatives
are expressed as percentage of their relative averages.
14.8.4 Ratio to trend method
The following steps are considered to determine seasonal indices by this
method:
i) Determine the trend values by the method of least squares.
ii) To find ratio to trend, divide the original data by the corresponding
trend values and multiply these ratios by 100, that is,
Original Data
Ratio to Trend 100
Trend Value
iii) Calculate the arithmetic mean of the trend ratios obtained in ‘step ii’.
iv) Finally, all the trend ratios will be converted into seasonal indices. Add
all averages obtained in ‘step iii’ and find their general average.
Seasonal indices are calculated by using the following formula:
Quarterly Averages
Seasonal Indices 100
General Averages
Y Y
t
In this method the trend effect and cyclic effects do not come into account.
14.9.2 Naive forecast
In this method we forecast the value, for the time period t, to be equal to the
actual value observed in the previous period, that is, time period (t-1). This
is given as:
Y Y
t t 1
Activity
Find seasonal variations by the ratio to trend method from the data given
below:
Activity Solution
Quarterly
Yearly Trend
Year Average X X2 XY
Total Values
Y
1994 280 70 -2 4 140 64
1995 360 90 -1 1 -90 88
1996 400 100 0 0 0 112
1997 520 130 1 1 130 136
1998 680 170 2 4 340 160
∑Y = 560 ∑X= 0 ∑ X =10
2
∑XY= 240
XY 240
XY b X 2
, therefore, b
X 2
10
24
14.10 Summary
Let us recapitulate the important concepts discussed in this unit:
A time series is a set of numerical values of a given variable listed at
successive intervals of time.
The time series is classified into the following four components: Long
term trend or secular trend, Seasonal variations, Cyclic variations and
Random variations.
The methods of measuring the trend of a time series are: Free hand or
graphic methods, Semi averages method, Moving average method and
Method of least squares.
The forecasting methods using time series are: Mean forecast, Naïve
forecast, Linear trend forecast, Non-linear trend forecast and Forecasting
with exponential smoothing.
14.11 Glossary
Chain Relative: Ratio of seasonal index for the quarter to the seasonal
index of the previous quarter
Link Relative: Ratio of value of the variable for the quarter to the value of
the variable of the previous quarter.
Random: Changes in data value due to the factors other than trend and
seasonal.
Seasonal: Periodical changes in data values over regular intervals of time.
Time series: A set of observations recorded over a period of time.
Trend: The tendency in the data values either to increase or decrease.
14.13 Answers
Terminal Questions
1. Refer section 14.2
2. Refer section 14.4.2 and section 14.4.3
3. Refer section 14.5
4. Refer section 14.5.3
5. Refer section 14.4 and section 14.5
6. Refer section 14.4
7. Refer section 14.8
8. The equation of the straight line is given as: Y = 90 + 2X
The trend values are 84, 86, 88, 90, 92, 94, 96.
9. The seasonal values obtained are 98.66, 110.74, 95.30, 95.30.
Table 14.12: The number of people visiting a hotel’s webpage per month
Jan Feb Mar Apr May Jun July Aug Sep Oct Nov Dec
2006-07 420 100 300 344 300 200 344 766 899 900 788 455
2007-08 620 399 345 455 677 355 766 500 799 800 880 555
2008-09 520 289 400 644 566 677 500 800 899 900 680 666
Using the suitable methods with justification, forecast the number of visitors
to web page for all the month in the academic year Jan 2010-2011.
References:
Agarwal B.L. (2006) Basic Statistics. 4th Ed., New Age International
Publishers.
Anderson, David R., Sweeney. Dennis J. & Williams, Thomas A.
5th ed, Thomson Business Information Pvt Ltd.
Bowerman, B. L. & O Connel, R.T. Applied Statistics: Improving
Business Processes. (1996) Irwin.
Levin, Richard I. , Rubin, David S. (2008) Statistics for Management.
7th Ed. PHI Learning Private Limited.
Manipal University Jaipur Page No. 522
Statistics for Management Unit 14
Pisani, Freedman D.R. & Purves, R. Statistics. (1997) 3rd Ed. W.W
Norton.
Srivastava & Shailaja Rejo, T.N. (2008) Statistics for Management
5th Ed.TMH.
Tanur, J.M. (2002) Statistics: A Guide to the unknown. 4th Ed. Brooks
/cole..
Tukey, J.W. (1977) Exploratory Data Analysis. Addison –Wesley.
Wilcox, Rand R. (2009) Basic Statistics – Understanding Conventional
Methods and Modern Insights. Oxford University Press.
E-References:
http://www.textbooksonline.tn.nic.in/Books/11/Stat-EM/Chapter-1.pdf
15.1 Introduction
In the previous unit, we studied about the definition and components of time
series. We also studied about different forecasting methods using time
series analysis. In this unit, we will discuss about the meaning and definition
of index numbers. We will also study the different kinds of index numbers
and their limitations.
We know that almost all values change and so we wish to know the
changes taken place over a period of time. For example, we may want to
know how much the price of various essential household items increase or
decrease to make necessary adjustments to the monthly budget.
Consequently, in all such situations, an average measure needs to be
defined to compare such difference over a time period. Index numbers are
yardsticks for describing such differences. These differences may have to
do with the physical quantities of the goods, the prices of the commodities,
or such concepts as ‘efficiency’, ‘intelligence’ or beauty’. The comparison
may be between the periods of time, between places, between categories,
etc.
We can have index numbers comparing the cost of living at different times
or in different localities or countries. Index numbers are used in comparison
of the physical volume of production in different years. However, we must
confine most of our attention to the construction of index numbers
measuring changes over time.
Objectives:
After studying this unit, you should be able to:
represent a data set in an index number form
describe how much the economic variables have changed over time
describe three principal types of indices: price indices, quantity indices,
and value indices
calculate various kinds of index numbers
15.1.1 Relevance
The CEO of Bestview television was quite happy with the excellent growth in
the company’s television sales. However, he was confronted by the
employees union and the officers association to increase the pay package
every quarter in order to compensate for the cost of living, as reflected by
the wholesale and consumer price indices, released by the government of
India. After being apprised of the calculation of these indices, the CEO felt
that these indices, which were for all of India, might not be relevant for his
company, as all of its operations were at one location near Mumbai. He
therefore asked the HRD department to coordinate with the management
Manipal University Jaipur Page No. 525
Statistics for Management Unit 15
Key statistic
An index number is a statistical measure which is designed to express
changes or differences in a variable or a group of related variables. It is
usually expressed in percentage form.
Solved Problem 1:
The price of a commodity in India in 2001 was Rs. 95 per kg and in 2000 it
was Rs.80 per kg. Calculate the price relative for the year 2001.
Solution: The price relative for 2001, (using 2000 as base) is calculated as:
95
Price relative for 2001 100 118 .75 %
80
Hence, the price relative for 2001 is 118.75 %.
II. Production relative
Let us understand production relative with an example.
Solved Problem 2:
If the wheat production in India in 2002 was 5,82,000 metric tons and in
2004, it was 6,96,000 metric tons, then calculate the production relative for
2004.
Example 1:
If the prices of 2005 are compared with the prices of 2004, then 2005 is
the current year and 2004 is the base year. The index number of 2005
based on 2004, is denoted by ‘Q01’ or ‘P01’, where subscript ‘0’ stands for
the year 2004, and subscript ‘1’ stands for the year 2005.
4. Specified averages
Index number represents a special case of average, in general known as
weighted average. It is a special type of average, because in a simple
average, the data is homogenous having the same unit of measurement,
whereas the average variables have different units of measurement.
5. Basis of comparison
Index numbers by their very nature are comparative. They compare
changes over time or between places or similar categories.
15.2.5 Main steps in the construction of index numbers
To follow the steps many problems are encountered which are to be
discussed carefully. There are many difficulties in following the steps
involved in the construction of index numbers. The following steps are
discussed in detail.
1. Purpose of index number
The steps which are taken in the construction of index numbers generally
depend on the purpose of the index number. Hence, the purpose of index
numbers must be defined clearly and precisely. For example, the purpose of
the general index number of wholesale price index number is to know the
general price level. On the other hand, the purpose of the consumer price
index number is to give an idea of the effect of the change in retail prices on
the cost of living in the classes of people.
2. Selection of base period
The base period of an index number is the period of time against which the
comparisons are made. There are three types of base periods.
Fixed base (a single period)
Fixed base (an average of selected periods)
Chain base
While selecting the base, a decision has to be made to decide whether we
have fixed base or chain base.
Fixed base (a single period): In a fixed base (a single period), the base
period must be a normal period. A normal period means that the period
must be free from all sorts of abnormalities or random causes such as
financial crisis, floods, famines, earth quakes, strikes of labourers, wars, etc.
The base period should be a period for which reliable figures are available.
The base period should not be too distant in the past.
Fixed base (an average of selected periods): When it is difficult to choose
just one single period as the normal, then a better choice is an average of
several periods.
Chain base: If the comparisons are required from year to year, a system of
chain base is used.
3. Selection of commodities
The following problems can occur while selecting the commodities.
First problem is the selection of commodities because it is not feasible to
include all commodities. The purpose of the index number is to help in
deciding the number of commodities.
Another problem is to decide on which commodities are to be included. A
careful selection of the commodities must be made in such a way that:
It should represent the real tastes, habits and the customs of the people.
It should be of a standard quality and there must be no significant
variation in the quality.
It must be easily recognisable.
It should not be a non-tangible commodity such as personal service.
4. Selection of the representative prices
In the collection of price quotations we have to consider the following points:
The method of quoting prices of the commodities
The type of quotations - whether wholesale prices or retail prices
The place from where the quotations are to be obtained
5. Assignment of Weights
The term ‘weight’ refers to the relative importance of the different
commodities included in the construction of index numbers. There are two
methods of assigning weights. They are:
Implicit method: In this method, several varieties of a certain type of
commodity under study are used. Such weights are called implicit weights.
Explicit method: In this method, the weights are laid down on the basis of
one outward evidence of importance of commodities. One of the problems in
the selection of appropriate weight is to decide this evidence. Another
Unweighted index numbers can be further divided into two categories. They
are:
I. Simple aggregative method
II. Simple average of relatives method
Weighted index numbers can also be further divided into two categories.
They are:
I. Weighted aggregative method
II. Weighted average of relatives method
15.3.1 Unweighted index numbers
i. Simple aggregative method
To construct a price index by simple aggregative method, we proceed by
doing the following:
Add the prices of all commodities in the current year, that is, find P1
Add the prices of all commodities in the base year, that is, P0
Divide the total of current year prices by the total of base year prices and
multiply the quotient by 100, that is,
P
P01 1 100
P0
where, ‘P01’ is the simple price index number of current year based on
base year.
Solved Problem 3:
Find the simple aggregative price index from the data displayed in table
15.1.
Table 15.1: Price of Commodities for the Years 2000 and 2004
Price in Rs. per unit
Commodity Unit
2000 2004
A One kilogram 10 15
B One kilogram 40 30
C One dozen 10 12
D One litre 5 13
Total 65 70
Solution: The price index number of 2004 is based in 2000. Using the
formula:
P
P 1
100
01 P
0
Therefore,
70
P01 100 107.7
65
This implies that the prices had increased by 7.7% in the year 2004 as
compared to the year 2000.
Table 15.1 depicts the merits and demerits of simple aggregative method.
Solved Problem 4:
The prices of three different commodities for 2002 and 2003 are displayed in
table 15.4. The price given is per each ton of the commodity. Taking the
year 2002 as base, calculate the price index by using the simple average of
relatives method by using both arithmetic mean and geometric mean.
Key Statistic
For the construction of the price index number quantity weights are used.
If ‘w’ is the weight attached to a commodity, then the price index is given
by:
P w
Pr ice Index P 1 100
01 P w
0
Solved Problem 5:
Compute Laspeyre’s price and quantity index number for the following data:
Table 15.6: Price and Quantity of commodities in the base year and current
year
Base year Current year
Commodity Price Quantity Price Quantity
A 3 25 5 28
B 1 50 3 60
C 2 30 1 30
D 5 15 6 12
Solution: Base year price and quantity are denoted by P0 and Q0 and
current year price and quantity are denoted by P1 and Q1 respectively
Table 15.6a
LQ 01
Q P1 0
100
264
100 101 .54.
Q P0 0 260
2. Paasche’s method
Paasche’s method is based on current year’s quantities. Current year’s
quantities are used as weights. Paache’s Price Index is given by:
P1Q1
PP01 100
P0 Q1
Solution
Paache’s Price Index is given by:
P1Q1
PP01 100
P0 Q1
Solved problem 7:
For the solved example 5, compute Dorbish and Bowley’s and Fisher index
numbers.
Solution: Dorbish and Bowley’s index number is given by
P Q P Q
1 0 1 1
P Q P Q
0 0 0 1
DP 100
01 2
The table 15.7 depicts the merits and demerits of Fishers Index Number
Table 15.7: Merits and Demerits of Fishers Index Number
Merits Demerits
It is free from bias, upward as well as This formula is difficult to interpret.
downward.
This formula takes into account both It is not a practical index to compute
current years as well as base year because it is excessively laborious.
prices and quantities.
It satisfies both ‘time several test’ as It requires the prices and quantities for
well as the ‘factor reversal test’. This base year and current year.
is why it is called an ideal index
number.
where,
‘Q1’ and ‘Q0’ are the quantities for the current and base period respectively
‘Pn’ and ‘Qn’ are the prices and quantities that determine values that we use
for weights.
15.3.4 Value index numbers
The value index numbers are easy to calculate. Value is the product of price
and quantity. A simple value index number is equal to the value of the
current year divided by the value of the base year. If this value is multiplied
by 100 we get the value index number. The required formula is:
P1Q1
V 100
P0 Q 0
Simple value index number is given by:
V1
V 100
V0
where, V1 = value of the current year.
V0 = value of the base year.
Such index numbers are not weighted, because they do not take into
account either the price or the quantity. These index numbers are not very
popular because the situation revealed by price and quantities are not fully
revealed by the values.
Solved problem 8:
For the solved example 5, compute Value index number.
Solution: The formula to compute Value index number is
P1Q1
V 100
P0 Q 0
Symbolically,
P01 P12 P20 1
Key statistic
Cost of living price index measures average change over time in the
prices paid by the consumer for specific basket of goods and services.
The cost of living price index numbers are designed to measure the
average change in the price paid by the ultimate consumers for specified
quantities of goods and services over a period of time.
Same goods
The goods consumed in the base years and current years remain
unchanged.
No change in quantity of goods
It is also assumed that the quantity of goods consumed will remain
same in the base year and current year.
Price quotations are same
It is assumed that the prices at different places are same and they do
not change frequently.
True on the average
Cost of living index numbers are true on the average.
Representative goods
The commodities included in the cost of living index number represent
the consumption of the class of people.
15.5.3 Steps in construction of cost of living index numbers
There are 5 steps involved in construction of cost of living index numbers.
PW P
P , where P 1 100 for each item and
01 W P0
W= value weight, i.e., P0Q0
Solved problem 9:
Calculate the cost of living index for the current year on the basis of the
base year from the following data, using
(a) Aggregate expenditure method and (b) Family budget method
Table 15.8: Data for problem 9
Quantity Price in rupees
Article Consumed in Unit Base year Current
the Base Year
year
Wheat 8 Kgs 16 17.2
Rice 4 Kgs 12.2 13.5
Pulses 2 Kgs 7.25 8.50
Milk 16 Lts 2 2
Sugar 3 Kgs 24 25
Solution:
Table 15.8a: Table for calculating cost of living index
Quantity Price in rupees
Article Consumed
Base year Current
in the
year
Base Year
P0 P1 P0Q0 P1Q0
Q0
Wheat 8 16 17.2 128 137.6
Rice 4 12.2 13.5 48.8 54
Pulses 2 7.25 8.50 14.5 17
Milk 16 2 2 32 32
Sugar 3 24 25 72 75
Total 295.3 315.6
Solution
Table 15.9a: Family Budget Method
Item W Price in P PW
1
P0 P1 P 100
P
0
Food 35 150 140 93.33 3266.55
Rent 20 75 90 120.00 2400.00
Clothing 10 25 30 120.00 1200.00
Fuel and 15 50 60 120.00 1800.00
lighting
Miscellaneous 20 60 80 133.33 2666.60
∑W =100 ∑PW =11333.15
PW 11333.15
P 113.33.
01 W 100
Activity
1. Calculate Fishers Ideal index and show that it satisfies the Time
and Factor Reversal Tests
Table 15.11: Data for Activity
Commodities Base Year Current Year
Price Quantity Price Quantity
P0 Q0 P1 Q1
A 6.5 500 10.8 560
B 2.8 124 2.9 148
C 4.7 69 8.2 78
D 10.9 38 13.4 24
E 8.6 49 10.8 27
Activity Solution
Table 15.11a: Table for calculation of index numbers
Commodities P0Q0 P1Q0 P0Q1 P1Q1
A 3250 5400 3640 6048
B 347.2 359.6 414.4 429.2
C 324.3 565.8 366.6 639.6
D 414.2 509.2 261.6 321.6
E 421.4 529.2 232.2 291.6
Total 4757.1 7363.8 4914.8 7730
Laspeyre’s Price Index Number:
P Q P Q
1 0 1 1 7363.8 7730
P
P Q 4757.1 4914.8
01 P Q
0 0 0 1
P Q P Q
0 1 0 0 4914.8 4757.1
P
P Q 7730 7363.8
10 P Q
1 1 1 0
P01 P10
7363 .8 7730 4914 .8 4757 .1
x 1 =1,
4757 .1 4914 .8 7730 7363 .8
P1Q1
P01 Q 01 (without multiplying by 100)
P0 Q 0
Q P Q P
1 0 1 1 4914.8 7730
Q 01
Q P Q P 4757.1
0 1 7363.8
0 0
7363 .8 7730 4914 .8 7730
P01 Q01 = x
4757 .1 4914 .8 4757 .1 7363 .8
15.9 Summary
Let us recapitulate the important concepts discussed in this unit:
An index number is a number which is used as a device for comparison
between the price, quantity or value of a group of articles in different
situations.
In the computation of an index number the given year whose values are
to be compared is called a current year and the specified year whose
values are taken as standard is called a base year.
The various methods of constructing index numbers can be classified
into two groups. They are: unweighted index numbers and weighted
index numbers.
Unit test, time reversal test, factor reversal test and circular test are the
test for adequacy of index numbers.
The cost of living price index numbers also known as consumer price
index numbers are designed to measure the average change in the
15.10 Glossary
Explicit method: In this method, the weights are laid down on the basis of
one outward evidence of importance of commodities.
Implicit method: In this method, several varieties of a certain type of
commodity under study are used. Such weights are called implicit weights.
Index number: An index number is a number which is used to measure the
level of a certain phenomenon as compared to the level of the same
phenomenon at some standard period.
Price relative: The price of commodity in a given year expressed as a
percentage of the price of the same commodity in a specified year is called
price relative.
Relative: The value of a variable in a given year (or place) divided by the
value of the same variable in a specified year (or place) is called a relative.
It is generally expressed in percentage.
Table 15.12: Price of Commodities for the Years 1997 and 2005
Base year 1997 Current year 2005
Commodity
Price Qty Price Qty
A 16 110 25 132
B 5 220 5 264
C 10 132 15 165
D 25 66 30 55
5. The table 15.13 depicts the price of commodities along with the weights
of respective commodities. Calculate index number for 2000 based on
the year 1995.
Table 15.13: Price of Commodities along with the Weights
Commodity 1995 2000 Weights
A 0.50 0.75 2
B 0.60 0.75 5
C 2.00 2.40 4
D 1.80 2.10 8
E 8.00 10.00 1
15.12 Answers
PW 18995
P 189.95
01 W 100
Hence, the consumer price index number by family budget method is
189.95.
Terminal Questions
1. Refer section 15.2, section 15.5.1, section 15.8
2. Refer section 15.2.5
3. Refer section 15.2.4
4. The Fisher ideal index is equal to 134.69
5. The required index number for the year 2000 is 123.3
ii. Using the index for India developed in question i, how would you
describe the percentage of 15 and 16 years olds who admitted to being
drunk four times or more in a 30 day period in Bhutan, Bangladesh and
Sri Lanka.
iii. Using Pakistan as the base, develop a relative regional index for the
percentage of 15 and 16 years olds who admitted to being drunk four
times or more in a 30 day period.
iv. Using the index for Pakistan developed in question iii, how would you
describe the percentage of 15 and 16 years olds who admitted to being
drunk four times or more in a 30 day period in Bhutan, Bangladesh and
Sri Lanka.
v. Using China as a base develop a relative regional index for the
percentage of 15 and 16 years olds who admitted to being drunk four
times or more in a 30 day period.
vi. Using the index for China developed in question v, how would you
describe the percentage of 15 and 16 years olds who admitted to being
drunk four times or more in a 30 day period in Bhutan, Bangladesh and
Sri Lanka.
vii. Based on the data what general conclusion can you draw?
References:
Bevington, Philip R. & Robinson, D. Keith. Data Reduction and Error
Analysis for the Physical Sciences. 3rd Ed.
Cowan, Glen. Statistical Data Analysis, Oxford Science Publications.
Devore, Jay L. (2008) Probability and Statistics for Engineering and the
Sciences.
Froedesen, A.G., Skjeggestad D., & Tofte, H. (1979) Probability and
Statistics in Particle Physics.
James, Frederick. (2006). Statistical Methods in Experimental Physics.
2nd Ed.
Levin, Richard I., & Rubin, David S. (2008) Statistics for Management.
7th Ed. PHI Learning Private Limited.
Lyons, Louis. (1989) Statistics for Nuclear and Particle Physicists.
Mandel, John. .
Manipal University Jaipur Page No. 555
Statistics for Management Unit 15