0% found this document useful (0 votes)
14 views141 pages

Business Tool For Decision Making

Statistics refers to numerical information used to analyze and interpret data related to various subjects and phenomena. It encompasses methods for collecting, organizing, and drawing conclusions from data, which can be classified as quantitative or qualitative. The importance of statistics in business includes planning operations, setting standards, and controlling processes, though it also has limitations in measuring non-quantifiable concepts.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views141 pages

Business Tool For Decision Making

Statistics refers to numerical information used to analyze and interpret data related to various subjects and phenomena. It encompasses methods for collecting, organizing, and drawing conclusions from data, which can be classified as quantitative or qualitative. The importance of statistics in business includes planning operations, setting standards, and controlling processes, though it also has limitations in measuring non-quantifiable concepts.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 141

U nit I

INTRODU CTION TO STATISTICS


INTRODU CTION

‘Statistics’ means numerical information expressed in quantitative terms.

This information may relate to objects, subjects, activities, phenomena, or regions

of space. As a matter of fact, data have no limits as to their reference, coverage,

and scope. At the macro level, these are data on gross national product and
shares of agriculture, manufacturing, and services in Gross Domestic Product. At

the micro level, individual firms, howsoever small or large, produce extensive

statistics on their operations. The annual reports of companies contain variety of

data on sales, production, expenditure, inventories, capital employed, and other

activities. These data are often field data, collected by employing scientific survey

techniques. U nless regularly updated, such data are the product of a one- time

effort and have limited use beyond the situation that may have called for their

collection. A student knows statistics more intimately as a subject of study like

economics, mathematics, chemistry, physics, and others. It is a discipline, which

scientifically deals with data, and is often described as the science of data. In

dealing with statistics as data, statistics has developed appropriate methods of


collecting, presenting, summarizing, and analysing data, and thus consists of a

body of these methods.

MEANING AND DEFINITIONS OF STATISTICS

In the beginning, it may be noted that the word ‘statistics’ is used rather

curiously in two senses plural and singular. In the plural sense, it refers to a set of

figures or data. In the singular sense, statistics refers to the whole body of tools

that are used to collect data, organise and interpret them and, finally, to draw

conclusions from them. It should be noted that both the aspects of statistics are

important if the quantitative data are to serve their purpose. If statistics, as a

subject, is inadequate and consists of poor methodology, we could not know the

right procedure to extract from the data the information they contain. Similarly, if
our data are defective or that they are inadequate or inaccurate, we could

not reach the right conclusions even though our subject is well developed.
A.L. Bowley has defined statistics as: (i) statistics is the science of counting,

(ii) Statistics may rightly be called the science of averages, and (iii) statistics is the

science of measurement of social organism regarded as a whole in all its

manifestations. Boddington defined as: Statistics is the science of estimates and

probabilities. Further, W.I. King has defined Statistics in a wider context, the
science of Statistics is the method of judging collective, natural or social

phenomena from the results obtained by the analysis or enumeration or collection

of estimates.

Seligman explored that statistics is a science that deals with the methods of

collecting, classifying, presenting, comparing and interpreting numerical data

collected to throw some light on any sphere of enquiry. Spiegal defines statistics

highlighting its role in decision- making particularly under uncertainty, as follows:

statistics is concerned with scientific method for collecting, organising, summa

rising, presenting and analyzing data as well as drawing valid conclusions and

making reasonable decisions on the basis of such analysis.

According to Prof. Horace Secrist, Statistics is the aggregate of facts,


affected to a marked extent by multiplicity of causes, numerically expressed,

enumerated or estimated according to reasonable standards of accuracy,

collected in a systematic manner for a pre- determined purpose, and placed in

relation to each other.

From the above definitions, we can highlight the major characteristics of statistics

as follows:

1. Statistics are the aggregates of facts. It means a single figure is not

statistics. For example, national income of a country for a single year is not

statistics but the same for two or more years is statistics.

2. Statistics are affected by a number of factors. For example, sale of a

product depends on a number of factors such as its price, quality,


competition, the income of the consumers, and so on.

3. Statistics must be reasonably accurate. Wrong figures, if analysed, will


lead to erroneous conclusions. Hence, it is necessary that conclusions must

be based on accurate figures.

4. Statistics must be collected in a systematic manner. If data are collected in

a haphazard manner, they will not be reliable and will lead to misleading

conclusions.
5. Collected in a systematic manner for a pre- determined purpose

6. Lastly, Statistics should be placed in relation to each other. If one collects

data unrelated to each other, then such data will be confusing and will not

lead to any logical conclusions. Data should be comparable over time and

over space.

TYPES OF DATA AND DATA SOU RCES

Statistical data are the basic raw material of statistics. Data may relate to an

activity of our interest, a phenomenon, or a problem situation under study. They

derive as a result of the process of measuring, counting and/or observing.

Statistical data, therefore, refer to those aspects of a problem situation that can

be measured, quantified, counted, or classified. Any object subject phenomenon,


or activity that generates data through this process is termed as a variable. In

other words, a variable is one that shows a degree of variability when successive

measurements are recorded. In statistics, data are classified into two broad

categories: quantitative data and qualitative data. This classification is based on

the kind of characteristics that are measured.

Quantitative data are those that can be quantified in definite units of

measurement. These refer to characteristics whose successive measurements

yield quantifiable observations. Depending on the nature of the variable observed

for measurement, quantitative data can be further categorized as continuous and

discrete data.

Obviously, a variable may be a continuous variable or a discrete variable.


(i) Continuous data represent the numerical values of a continuous variable. A

continuous variable is the one that can assume any value between any
two points on a line segment, thus representing an interval of values. The

values are quite precise and close to each other, yet distinguishably

different. All characteristics such as weight, length, height, thickness,

velocity, temperature, tensile strength, etc., represent continuous

variables. Thus, the data recorded on these and similar other


characteristics are called continuous data. It may be noted that a

continuous variable assumes the finest unit of measurement. Finest in

the sense that it enables measurements to the maximum degree of

precision.

(ii) Discrete data are the values assumed by a discrete variable. A discrete

variable is the one whose outcomes are measured in fixed numbers.

Such data are essentially count data. These are derived from a process

of counting, such as the number of items possessing or not possessing

a certain characteristic. The number of customers visiting a

departmental store everyday, the incoming flights at an airport, and the

defective items in a consignment received for sale, are all examples of


discrete data.

Qualitative data refer to qualitative characteristics of a subject or an object.

A characteristic is qualitative in nature when its observations are defined and

noted in terms of the presence or absence of a certain attribute in discrete

numbers. These data are further classified as nominal and rank data.

(i) Nominal data are the outcome of classification into two or more categories

of items or units comprising a sample or a population according to some

quality characteristic. Classification of students according to sex (as

males and females), of workers according to skill (as skilled, semi- skilled,

and unskilled), and of employees according to the level of education (as

matriculates, undergraduates, and post- graduates), all result into


nominal data. Given any such basis of classification, it is always possible to

assign each item to a particular class and make a summation of items


belonging to each class. The count data so obtained are called nominal

data.

(ii) Rank data, on the other hand, are the result of assigning ranks to specify

order in terms of the integers 1,2,3, ..., n. Ranks may be assigned

according to the level of performance in a test. a contest, a competition,


an interview, or a show. The candidates appearing in an interview, for

example, may be assigned ranks in integers ranging from I to n,

depending on their performance in the interview. Ranks so assigned can

be viewed as the continuous values of a variable involving performance

as the quality characteristic.

Data sources could be seen as of two types, viz., secondary and primary. The two

can be defined as under:

(i) Secondary data: They already exist in some form: published or

unpublished - in an identifiable secondary source. They are, generally,

available from published source(s), though not necessarily in the form

actually required.
(ii) Primary data: Those data which do not already exist in any form, and thus

have to be collected for the first time from the primary source(s). By their

very nature, these data require fresh and first- time collection covering

the whole population or a sample drawn from it.

SCOPE OF STATISTICS

Apart from the methods comprising the scope of descriptive and inferential

branches of statistics, statistics also consists of methods of dealing with a few

other issues of specific nature. Since these methods are essentially descriptive in

nature, they have been discussed here as part of the descriptive statistics. These

are mainly concerned with the following:

(i) It often becomes necessary to examine how two paired data sets are
related. For example, we may have data on the sales of a product and the

expenditure incurred on its advertisement for a specified number of


years. Given that sales and advertisement expenditure are related to

each other, it is useful to examine the nature of relationship between the

two and quantify the degree of that relationship. As this requires use of

appropriate statistical methods, these falls under the purview of what

we call regression and correlation analysis.


(ii) Situations occur quite often when we require averaging (or totalling) of

data on prices and/or quantities expressed in different units of

measurement. For example, price of cloth may be quoted per meter of

length and that of wheat per kilogram of weight. Since ordinary methods

of totalling and averaging do not apply to such price/quantity data,

special techniques needed for the purpose are developed under index

numbers.

(iii) Many a time, it becomes necessary to examine the past performance of an

activity with a view to determining its future behaviour. For example,

when engaged in the production of a commodity, monthly product sales

are an important measure of evaluating performance. This requires


compilation and analysis of relevant sales data over time. The more

complex the activity, the more varied the data requirements. For profit

maximising and future sales planning, forecast of likely sales growth rate

is crucial. This needs careful collection and analysis of past sales data.

All such concerns are taken care of under time series analysis.

(iv) Obtaining the most likely future estimates on any aspect(s) relating to a

business or economic activity has indeed been engaging the minds of all

concerned. This is particularly important when it relates to product sales

and demand, which serve the necessary basis of production scheduling

and planning. The regression, correlation, and time series analyses

together help develop the basic methodology to do the needful. Thus,


the study of methods and techniques of obtaining the likely estimates on

business/economic variables comprises the scope of what we do under


business forecasting.

Keeping in view the importance of inferential statistics, the scope of

statistics may finally be restated as consisting of statistical methods which

facilitate decision- making under conditions of uncertainty. W hile the term

statistical methods is often used to cover the subject of statistics as a whole, in


particular it refers to methods by which statistical data are analysed, interpreted,

and the inferences drawn for decision making.

Though generic in nature and versatile in their applications, statistical

methods have come to be widely used, especially in all matters concerning

business and economics. These are also being increasingly used in biology,

medicine, agriculture, psychology, and education. The scope of application of

these methods has started opening and expanding in a number of social science

disciplines as well. Even a political scientist finds them of increasing relevance for

examining the political behaviour and it is, of course, no surprise to find even

historians statistical data, for history is essentially past data presented in certain

actual format.

IMPORTANCE OF STATISTICS IN BU SINESS

There are three major functions in any business enterprise in which the

statistical methods are useful. These are as follows:

(i) The planning of operations: This may relate to either special projects or to

the recurring activities of a firm over a specified period.

(ii) The setting up of standards: This may relate to the size of employment,

volume of sales, fixation of quality norms for the manufactured product,

norms for the daily output, and so forth.

(iii) The function of control: This involves comparison of actual production


achieved against the norm or target set earlier. In case the production has

fallen short of the target, it gives remedial measures so that such a


deficiency does not occur again.

A worth noting point is that although these three functions- planning of

operations, setting standards, and control- are separate, but in practice they are

very much interrelated. Different authors have highlighted the importance of

Statistics in business. For instance, Croxton and Cowden give numerous uses of
Statistics in business such as project planning, budgetary planning and control,

inventory planning and control, quality control, marketing, production and

personnel administration. W ithin these also they have specified certain areas

where Statistics is very relevant.

Another author, Irwing W. Burr, dealing with the place of statistics in an

industrial organisation, specifies a number of areas where statistics is extremely

useful. These are: customer wants and market research, development design and

specification, purchasing, production, inspection, packaging and shipping, sales

and complaints, inventory and maintenance, costs, management control, industrial

engineering and research.

Statistical problems arising in the course of business operations are


multitudinous. As such, one may do no more than highlight some of the more

important ones to emphasis the relevance of statistics to the business world. In

the sphere of production, for example, statistics can be useful in various ways.

Statistical quality control methods are used to ensure the production of quality

goods. Identifying and rejecting defective or substandard goods achieve this. The

sale targets can be fixed on the basis of sale forecasts, which are done by using

varying methods of forecasting. Analysis of sales affected against the targets set

earlier would indicate the deficiency in achievement, which may be on account of

several causes: (i) targets were too high and unrealistic (ii) salesmen's performance

has been poor (iii) emergence of increase in competition (iv) poor quality of

company's product, and so on. These factors can be further investigated.


Another sphere in business where statistical methods can be used is

personnel management. Here, one is concerned with the fixation of wage rates,
incentive norms and performance appraisal of individual employee. The concept of

productivity is very relevant here. On the basis of measurement of productivity, the

productivity bonus is awarded to the workers. Comparisons of wages and

productivity are undertaken in order to ensure increases in industrial productivity.

Statistical methods could also be used to ascertain the efficacy of a certain


product, say, medicine. For example, a pharmaceutical company has developed a

new medicine in the treatment of bronchial asthma.

Before launching it on commercial basis, it wants to ascertain the

effectiveness of this medicine. It undertakes an experimentation involving the

formation of two comparable groups of asthma patients. One group is given this

new medicine for a specified period and the other one is treated with the usual

medicines. Records are maintained for the two groups for the specified period.

This record is then analysed to ascertain if there is any significant difference in the

recovery of the two groups. If the difference is really significant statistically, the

new medicine is commercially launched.

LIMITATIONS OF STATISTICS
Statistics has a number of limitations, pertinent among them are as follows:

(i) There are certain phenomena or concepts where statistics cannot be

used. This is because these phenomena or concepts are not amenable

to measurement. For example, beauty, intelligence, courage cannot be

quantified. Statistics has no place in all such cases where quantification

is not possible.

(ii) Statistics reveal the average behaviour, the normal or the general trend. An

application of the 'average' concept if applied to an individual or a

particular situation may lead to a wrong conclusion and sometimes may

be disastrous. For example, one may be misguided when told that the

average depth of a river from one bank to the other is four feet, when
there may be some points in between where its depth is far more than four feet.

On this understanding, one may enter those points having greater


depth, which may be hazardous.

(iii) Since statistics are collected for a particular purpose, such data may not

be relevant or useful in other situations or cases. For example,

secondary data (i.e., data originally collected by someone else) may not

be useful for the other person.


(iv) Statistics are not 10 0 per cent precise as is Mathematics or Accountancy.

Those who use statistics should be aware of this limitation.

(v) In statistical surveys, sampling is generally used as it is not physically

possible to cover all the units or elements comprising the universe. The

results may not be appropriate as far as the universe is concerned.

Moreover, different surveys based on the same size of sample but

different sample units may yield different results.

(vi) At times, association or relationship between two or more variables is

studied in statistics, but such a relationship does not indicate cause and

effect' relationship. It simply shows the similarity or dissimilarity in the

movement of the two variables. In such cases, it is the user who has to
interpret the results carefully, pointing out the type of relationship

obtained.

(vii) A major limitation of statistics is that it does not reveal all pertaining to a

certain phenomenon. There is some background information that

statistics does not cover. Similarly, there are some other aspects related

to the problem on hand, which are also not covered. The user of

Statistics has to be well informed and should interpret Statistics keeping

in mind all other aspects having relevance on the given problem.

Apart from the limitations of statistics mentioned above, there are misuses

of it. Many people, knowingly or unknowingly, use statistical data in wrong manner.

Let us see what the main misuses of statistics are so that the same could be
avoided when one has to use statistical data. The misuse of Statistics may

take several forms some of which are explained below.


(i) Sources of data not given: At times, the source of data is not given. In the

absence of the source, the reader does not know how far the data are

reliable. Further, if he wants to refer to the original source, he is unable to

do so.

(ii) Defective data: Another misuse is that sometimes one gives defective
data. This may be done knowingly in order to defend one's position or to

prove a particular point. This apart, the definition used to denote a

certain phenomenon may be defective. For example, in case of data

relating to unemployed persons, the definition may include even those

who are employed, though partially. The question here is how far it is

justified to include partially employed persons amongst unemployed

ones.

(iii) U nrepresentative sample: In statistics, several times one has to conduct a

survey, which necessitates to choose a sample from the given population

or universe. The sample may turn out to be unrepresentative of the

universe. One may choose a sample just on the basis of convenience. He


may collect the desired information from either his friends or nearby

respondents in his neighbourhood even though such respondents do

not constitute a representative sample.

(iv) Inadequate sample: Earlier, we have seen that a sample that is

unrepresentative of the universe is a major misuse of statistics. This

apart, at times one may conduct a survey based on an extremely

inadequate sample. For example, in a city we may find that there are 1,

0 0,0 0 0 households. W hen we have to conduct a household survey, we

may take a sample of merely 10 0 households comprising only 0.1 per

cent of the universe. A survey based on such a small sample may not

yield right information.


(v) U nfair Comparisons: An important misuse of statistics is making unfair

comparisons from the data collected. For instance, one may construct
an index of production choosing the base year where the production

was much less. Then he may compare the subsequent year's production

from this low base. Such a comparison will undoubtedly give a rosy

picture of the production though in reality it is not so. Another source of

unfair comparisons could be when one makes absolute comparisons


instead of relative ones. An absolute comparison of two figures, say, of

production or export, may show a good increase, but in relative terms it

may turnout to be very negligible. Another example of unfair comparison

is when the population in two cities is different, but a comparison of

overall death rates and deaths by a particular disease is attempted.

Such a comparison is wrong. Likewise, when data are not properly

classified or when changes in the composition of population in the two

years are not taken into consideration, comparisons of such data would

be unfair as they would lead to misleading conclusions.

(vi) U nwanted conclusions: Another misuse of statistics may be on account of

unwarranted conclusions. This may be as a result of making false


assumptions. For example, while making projections of population in the

next five years, one may assume a lower rate of growth though the past

two years indicate otherwise. Sometimes one may not be sure about the

changes in business environment in the near future. In such a case, one

may use an assumption that may turn out to be wrong. Another source

of unwarranted conclusion may be the use of wrong average. Suppose in

a series there are extreme values, one is too high while the other is too

low, such as 80 0 and 50. The use of an arithmetic average in such a

case may give a wrong idea. Instead, harmonic mean would be proper in

such a case.

(vii) Confusion of correlation and causation: In statistics, several times one has
to examine the relationship between two variables. A close relationship

between the two variables may not establish a cause –and –effect –
relationship in the sense that one variable is the cause and the other is

the effect. It should be taken as something that measures degree of

association rather than try to find out causal relationship.

CHARTS

A chart is a graphical representation of data, in which "the data is


represented by symbols, such as bars in a bar chart, lines in a line chart, or slices

in a pie chart". A chart can represent tabular numeric data, functions or some

kinds of qualitative structure and provides different info. The term "chart" as a

graphical representation of data has multiple meanings:

A data chart is a type of diagram or graph that organizes and represents a

set of numerical or qualitative data. Maps that are adorned with extra information

(map surround) for a specific purpose are often known as charts, such as a

nautical chart or aeronautical chart, typically spread over several map sheets.

Other domain specific constructs are sometimes called charts, such as the chord

chart in music notation or a record chart for album popularity.

Charts are often used to ease understanding of large quantities of data and
the relationships between parts of the data. Charts can usually be read more

quickly than the raw data. They are used in a wide variety of fields, and can be

created by hand (often on graph paper) or by computer using a charting

application. Certain types of charts are more useful for presenting a given data set

than others. For example, data that presents percentages in different groups

(such as "satisfied, not satisfied, and unsure") are often displayed in a pie chart,

but may be more easily understood when presented in a horizontal bar chart. On

the other hand, data that represents numbers that change over a period of time

(such as "annual revenue from 1990 to 20 0 0 ") might be best shown as a line

chart.

GRAPHS
One goal of statistics is to present data in a meaningful way. Often, data

sets involve millions (if not billions) of values. This is far too many to print out in a
journal article or sidebar of a magazine story. That's where graphs can be

invaluable, allowing statisticians to provide a visual interpretation of complex

numerical stories. There are seven graphs that are commonly used in statistics.

Good graphs convey information quickly and easily to the user. Graphs

highlight salient features of the data. They can show relationships that are not
obvious from studying a list of numbers. They can also provide a convenient way to

compare different sets of data. Different situations call for different types of

graphs, and it helps to have a good knowledge of what types are available. The

type of data often determines what graph is appropriate to use. Qualitative data,

quantitative data, and paired data each use different types of graphs.

Types of Graphs and Charts

The graphical demonstration of statistical data in a chart is normally

specified as statistical graph chart. There are many kinds of graphs and charts

which are used to indicate a set of data. The data is either unremitting or separate.

These graphs are very helpful to recognize the statistical data.

Line graphs
Pie charts

Bar graph

Scatter plot

Stem and plot

Histogram

Frequency polygon

Frequency curve

Cumulative frequency

BAR DIAGRAM

Bar diagrams are the most common type of diagrams used in practice. A

bar is a thick line whose width is shown merely for attention. They are called one-
dimensional because it is only the length of the bar that matters and not the

width. W hen the number of items is large, lines may be drawn instead of bars to
economise space. The special merits of bar diagrams are the following:

(i) They are readily understood even by those unaccustomed to reading

charts or those who are not chart- minded.

(ii) They posses the outstanding advantage that they are the simplest and

the easiest to make.


(iii) W hen a large number of items are to be compared they are the only form

that can be used effectively.

W hile constructing bar diagrams the following points should be kept in mind.

(i) The width of the bars should be uniform throughout the diagram.

(ii) The gap between one bar and another should be uniform throughout.

(iii) Bars may be either horizontal or vertical. The vertical bars should be

preferred. Figures at the end of each bar so that the reader can know

the precise value without looking at the scale. This is particularly so

where the scale is too narrow, for example, 1” on paper may represent 10

crore people.

Types of Bar Diagrams:


Simple bar diagrams

Sub- divided bar diagrams

Multiple bar diagrams

Percentage bar diagrams

Deviation bars

PIE DIAGRAM

A pie chart (or a circle chart) is a circular statistical graphic, which is divided

into slices to illustrate numerical proportion. In a pie chart, the arc length of each

slice (and consequently its central angle and area), is proportional to the quantity

it represents. W hile it is named for its resemblance to a pie which has been sliced,

there are variations on the way it can be presented. The earliest known pie chart is
generally credited to W illiam Playfair's Statistical Breviary of 180 1. Pie charts

are very widely used in the business world and the mass media. However, they
have been criticized, and many experts recommend avoiding them, pointing out

that research has shown it is difficult to compare different sections of a given pie

chart, or to compare data across different pie charts. Pie charts can be replaced

in most cases by other plots such as the bar chart, box plot or dot plots.

A pie chart displays data, information, and statistics in an easy- to- read 'pie-
slice' format with varying slice sizes telling you how much of one data element

exists. The bigger the slice, the more of that particular data was gathered. Let's

take, for example, the pie chart shown below. It represents the percentage of

people who own various pets. As you can see, the 'dog ownership' slice is by far

the largest, which means that most people represented in this chart own a dog as

opposed to a cat, fish, or other animal.

U SES OF A PIE CHART

The main use of a pie chart is to show comparison. W hen items are

presented on a pie chart, you can easily see which item is the most popular and

which is the least popular. Various applications of pie charts can be found in

business, school, and at home. For business, pie charts can be used to show the
success or failure of certain products or services. They can also be used to show

market reach of a business compared to similar businesses.

At school, pie chart applications include showing how much time is allotted

to each subject. It can also be used to show the number of girls to boys in various

classes. At home, pie charts can be useful when figuring out your diet. You can

also use pie charts to see how much money you spend in different areas. There are

many applications of pie charts and all are designed to help you to easily grasp a

bunch of information visually.

TYPES OF DIAGRAM

There are various diagrammatic devices by which statistical data can be

presented. We shall discuss a few of them, which are mostly used. They following
are the common type of diagrams;

1. One –dimensional diagram ( line and bar)


2. Two –dimensional diagram (rectangle, square, circle, etc.)

3. Three –dimensional diagram (cube, sphere, cylinder etc.)

4. Pictogram

5. Cartogram.

1. One dimensional diagram:


In one –dimensional diagram, the length of the lines or bars is

considered and the width of the bars is not taken into consideration. The

term bar means a thick wide line. The following are the main types:

(a)Line diagram:

This is the simplest of all the diagrams. On the basis of size of the figures,

heights of bars or lines are drawn. The distance between lines is kept uniform. It

makes comparison easy. This diagram is not attractive; hence it is less important.

Problem no 1.

The following data show the number of accidents sustained by 10 0 drivers of a

company in particular year. Draw a suitable diagram.

Number of accidents: 1 2 3 4 5 6 7 8
Number of drivers : 2 18 15 10 13 22 9 11

(b) Simple bar diagram:

A simple bar can be drawn either on horizontal or vertical base. Bars on

horizontal base are more common. A bar diagram is simple to draw and easy to

understand. In business and economics it is commonly used.

Problem No 2.

Draw a suitable bar diagram showing the following data.


Year Profits

(‘0 0 0 ’)

20 0 5 160 0 0

20 0 6 130 0 0

20 07 170 0 0

(a) Vertical bar diagram

(b) Horizontal bar diagram

(c) Multiple bar diagram ( compound bar diagram)

Multiple bar diagrams are used to denote more than one phenmenon,

eg for import and export tred. Multiple bars are useful for direct comparson
between two vales. The bars are drawn side by side. In order to

distinguish the bars, different colours, shades, etc., may be used and a key
index to this effect be given to understand the different bar.

Practical Excises:

Problem No 2:

The data below gives the yearly profits of two companies A and B

Year Profits

A B

20 0 10 0 0 150 0

5 0 0

20 0 80 0 0 130 0

6 0

20 0 130 0 140 0

7 0 0

Represent the data by means of a multiple bar diagram.

Co A

Co B

(d) Sub - divided bar diagram ( component bar diagram)


The bar is subdivided into various parts in proportion to the values given

in the data and may be drawn on absolute figures or percentages. Each


component occupies a part of the bar proportional to its share in the total.

To distinguish different components from one another, different color or

shades may be given.

Practical Excises:

Problem No 3:
Represent the following data in a suitable diagram.

Districts A B C

Population Male 10 0 120 130

0 0 0

Female 50 0 80 0 90 0

Total 150 20 0 220

0 0 0

Male

Female

Solution:
Percentage subdivided bar diagram:

The above mentioned diagrams have beeen used to represent absolute value.
But comparison is made on a relative basis. The various components are

expressed as percentage to the total. For dividing the bars these percentages are

cumulated. In this case, the bars are all of equal height. Each segment shows the

percentage to the total.

Problem No 4:
Represent by a percentage bar diagram the following data on investment

for the first second five –year plans:

Investment in the Public sector

Item The first five year plan The second five year

plan

Agriculture 357 768

Irrigation 492 990

Industry 261 90 9

Transport 654 1485

Social 30 6 945

services

Miscellaneous 90 30 0

Solution:

Percentage Bar

Item First five year plan Second five year plan


Investment Percentage Investment Percentage

Transport 654 30.28 1485 27.50

Social services 30 6 14.16 945 17.50

Miscellaneous 90 4.17 30 0 5.62

Total 2160 10 0 540 0 10 0

(f)Other Bar diagrams:

(a) Deviation bars:

Deviation bar diagram is used to depict the net deviations in different

values i.e., surplus or deficit, profit or loss, net import or export, etc., which have

both positive or negative values. Positive values are shown above the base line and

negative below the base line.

(b) Broken bars:

In certain cases we may come across data which contain very wide

variations in values very small or very large. In order to provide adequate and

reasonable shape to the smaller bars, the larger bars may be broken at the top.

The value of each bar is written at the top of the bar.

2. Two dimensional diagram:

(Area or surface diagram): in one dimensional diagram, only length is taken

into account. In two dimensional diagram, the area of the diagram represent the

data, i.e., the length and breadth are considered. The important types are:

(a)Rectangles: rectangles are used when two or more magnitudes with

different components have to be compared. The areas of the rectangles are kept

in proportion to the values. It may be of two types; (i) percentage sub divided

rectangular diagram. In such a diagram the width of rectangles is kept according


to the proportion of the values, the various components of the values are converted

into percentage and rectangles divided according to them. (ii) Sub- divided
rectangle. Such diagrams are used to show some related phenomena. Eg cost per

unit, quantity of production, etc.

Practical Excises:

Problem No 5:

Draw a two dimensional diagram to represent the following data;

Items of Expenditure in

expenditure Rupees

Family A Family B

Food 20 0 30 0

Clothing 48 75

Education 32 40

House rent 40 75

Miscellaneous 80 110

Total 40 0 60 0

Solution:

The total expenditure will be taken as 10 0 and the expenditure on each item will

be expressed in percentage. The width of the two rectangles will be in proportion

to the total expenditure of the two families; i.e., 40 0 : 60 0 or 2 : 3. The height of

each rectangle will be the same as it represents 10 0 percent.

Items of Monthly expenditure

expenditure

Family A (Rs. 40 0 ) Family B (Rs. 60 0 )

Rs. % Cumulative Rs. % Cumulative

% %
Food 20 0 50 50 30 0 50 50

Clothing 48 12 62 75 1235 62.5

Education 32 8 70 40 6.67 69.17

House rent 40 10 80 75 12.5 81.67

Miscellaneo 80 20 10 0 110 18.33 10 0

us

Total 40 0 10 0 60 0 10 0

(b) Square Diagram: while preparing squares, we have to bear in mind that the

ration is to be maintained according to the areas of the squares. To draw a square

diagram, the square root is taken of the values of the various items to be shown in

the diagram; then suitable scale may be adopted to draw it.

Practical Excises:

Problem No 6:

Draw a square diagram to represent the following data:

810 0 490 0 250 0

Solution:

First we have to find out the square root of the figures; they are 90, 70 and 50
further , these roots are divided by 10 ; thus we get 9, 7 and 5.

(c) circle:

Circle diagrams are alternative to square diagram. Steps are similar to the
above. The side of the square will become the radius of the circle.

(e)Angular or pie diagram:


The pie diagram ranks high in understanding. Just as we divided a bar or a

rectangle to show its components, a circle can also be divided into sectors. As

there are 360 degrees at the centre, proportionate sectors are cut taking the

whole data equal to 360 degrees. This will be clear from the following illustration.

Practical Excises:
Problem No 7:

The following table show the are in millions of square kilometers of the

oceans of the world:

Ocean Area (million sq. km)

Pacific 70.8

Atlantic 41.2

Indian 28.5

Antarctic 7.6

Arctic 4.8

Draw a pie diagram to represent the data.


Solution :

Calculation for pie diagram

Ocean Area Degrees

Pacific 70.8 70.8/152.9* 360 = 167

Atlantic 41.2 41.2/152.9* 360 = 97

Indian 28.5 28.5/152.9*360 = 67

Antarctic 7.6 7.6/152.9*360 = 18

Arctic 4.8 4.8/152.9* 360 = 11


152.9 360 º

3.Three dimensional diagram:


The square, circle, rectangle et., may fail to represent the data if the quantities

to be represented are awfully diverse. In such cases three dimensional diagrams

are drawn. They are called so because length, height and width or depth are

considered; and these comprise of cubes, spheres, prisms, cylinders, blocks, etc. of

all these cubes are the easiest to draw as the side of the cube can easily be found
out by taking the cube root of the data.

4. Pictograms and cartograms:

Pictogram is a device of representing statistical data in pictures. These are

very useful in attracting the attention. They are easily understood. For the purpose

of propaganda, the pictorial presentations of facts are quite popular and find place

in exhibitions. They are extensively used by government organizations as well as by

private institutions.

In cartograms, statistical facts are presented through maps accompanied by

various types of diagrammatic representation. It presents the numerical facts in a

pictorial form in a geographical or spatial distribution. Cartograms are simple and


are easy to understand. They are generally used when the regional or geographic

comparisons are to be made.

Choice or selection of a diagram

There Are Many methods to depict statistical data diagrammatically. No single

diagram is suited for all purposes. The choice or selections of a particular diagram,

out of many, to suit a given set of data is not any easy task but requires skill,

experience and intelligence. Primarily, the choice depends upon the (a) nature of

data and (b) purpose of presentation and to whom it is meant. The nature of data

will help in taking a decision as to one dimensional or two dimensional or three

dimensional diagram. Then it is important to know the level of the knowledge, of the
audience for whom the diagram is depicted.

Practical Excises:
Problem No 8:

Represent the following by a suitable diagram

Profits for 20 07

Company A Rs.

1250 0

Company B Rs. 640 0

Company C Rs. 270 0

Solutions:

Cube Roots Sides in

centimeter

Company A 5 5

Company B 4 4

Company C 3 3

CENTRAL TENDENCY

The description of statistical data may be quite elaborate or quite brief

depending on two factors: the nature of data and the purpose for which the same

data have been collected. W hile describing data statistically or verbally, one must

ensure that the description is neither too brief nor too lengthy. The measures of
central tendency enable us to compare two or more distributions pertaining

to the same time period or within the same distribution over time. For example, the
average consumption of tea in two different territories for the same period or in a

territory for two years, say, 20 0 3 and 20 0 4, can be attempted by means of an

average.

In statistics, a central tendency (or measure of central tendency) is a central

or typical value for a probability distribution. It may also be called a center or


location of the distribution. Colloquially, measures of central tendency are often

called averages. The term central tendency dates from the late 1920 s. The most

common measures of central tendency are the arithmetic mean, the median and

the mode. A central tendency can be calculated for either a finite set of values or

for a theoretical distribution, such as the normal distribution. Occasionally authors

use central tendency to denote "the tendency of quantitative data to cluster

around some central value." The central tendency of a distribution is typically

contrasted with its dispersion or variability; dispersion and central tendency are

the often characterized properties of distributions. Analysts may judge whether

data has a strong or a weak central tendency based on its dispersion.

A measure of central tendency is a summary statistic that represents the


center point or typical value of a dataset. These measures indicate where most

values in a distribution fall and are also referred to as the central location of a

distribution. You can think of it as the tendency of data to cluster around a middle

value. In statistics, the three most common measures of central tendency are the

mean, median, and mode. Each of these measures calculates the location of the

central point using a different method.

ARITHMETIC MEAN

In statistics, the arithmetic mean, or simply the mean or average when the

context is clear, is the sum of a collection of numbers divided by the number of

numbers in the collection. The collection is often a set of results of an experiment

or an observational study, or frequently a set of results from a survey. The term


"arithmetic mean" is preferred in some contexts in mathematics and statistics

because it helps distinguish it from other means, such as the geometric mean and
the harmonic mean. In addition to mathematics and statistics, the arithmetic mean

is used frequently in many diverse fields such as economics, anthropology, and

history, and it is used in almost every academic field to some extent. For example,

per capita income is the arithmetic average income of a nation's population.

W hile the arithmetic mean is often used to report central tendencies, it is


not a robust statistic, meaning that it is greatly influenced by outliers (values that

are very much larger or smaller than most of the values). Notably, for skewed

distributions, such as the distribution of income for which a few people's incomes

are substantially greater than most people's, the arithmetic mean may not coincide

with one's notion of "middle", and robust statistics, such as the median, may be a

better description of central tendency.

The arithmetic mean (or mean or average) is the most commonly used and

readily understood measure of central tendency in a data set. In statistics, the

term average refers to any of the measures of central tendency. The arithmetic

mean of a set of observed data is defined as being equal to the sum of the

numerical values of each and every observation divided by the total number of
observations. Symbolically, if we have a data set consisting of the values a1, a2, …,

an then the arithmetic mean A is defined by the formula:

CHARACTERISTICS OF THE ARITHMETIC MEAN


1. The sum of the deviations of the individual items from the arithmetic mean
is always zero. This means I: (x - x ) = 0, where x is the value of an item and x
is the arithmetic mean. Since the sum of the deviations in the positive
direction is equal to the sum of the deviations in the negative direction, the
arithmetic mean is regarded as a measure of central tendency.
2. The sum of the squared deviations of the individual items from the
arithmetic mean is always minimum. In other words, the sum of the squared

deviations taken from any value other than the arithmetic mean will be
higher.

3. As the arithmetic mean is based on all the items in a series, a change in the

value of any item will lead to a change in the value of the arithmetic mean.

4. In the case of highly skewed distribution, the arithmetic mean may get

distorted on account of a few items with extreme values. In such a case, it


may cease to be the representative characteristic of the distribution.

MEDIAN

Median is defined as the value of the middle item (or the mean of the values

of the two middle items) when the data are arranged in an ascending or

descending order of magnitude. Thus, in an ungrouped frequency distribution if

the n values are arranged in ascending or descending order of magnitude, the

median is the middle value if n is odd. W hen n is even, the median is the mean of

the two middle values.

The median is the value separating the higher half from the lower half of a

data sample (a population or a probability distribution). For a data set, it may be

thought of as the "middle" value. For example, in the data set {1, 3, 3, 6, 7, 8, 9}, the
median is 6, the fourth largest, and also the fourth smallest, number in the sample.

For a continuous probability distribution, the median is the value such that a

number is equally likely to fall above or below it.

The median is a commonly used measure of the properties of a data set in

statistics and probability theory. The basic advantage of the median in describing

data compared to the mean (often simply described as the "average") is that it is

not skewed so much by extremely large or small values, and so it may give a better

idea of a "typical" value. For example, in understanding statistics like household

income or assets which vary greatly, a mean may be skewed by a small number of

extremely high or low values. Median income, for example, may be a better way to

suggest what a "typical" income is.


Because of this, the median is of central importance in robust statistics, as

it is the most resistant statistic, having a breakdown point of 50 % : so long as no


more than half the data are contaminated, the median will not give an arbitrarily

large or small result.

CHARACTERISTICS OF THE MEDIAN

1. U nlike the arithmetic mean, the median can be computed from open-

ended distributions. This is because it is located in the median class-

interval, which would not be an open- ended class.

2. The median can also be determined graphically whereas the arithmetic

mean cannot be ascertained in this manner.

3. As it is not influenced by the extreme values, it is preferred in case of a

distribution having extreme values.

4. In case of the qualitative data where the items are not counted or
measured but are scored or ranked, it is the most appropriate measure of

central tendency.

MODE

The mode of a set of data values is the value that appears most often. It is

the value x at which its probability mass function takes its maximum value. In other

words, it is the value that is most likely to be sampled. Like the statistical mean and

median, the mode is a way of expressing, in a (usually) single number, important

information about a random variable or a population. The numerical value of the

mode is the same as that of the mean and median in a normal distribution, and it

may be very different in highly skewed distributions.


The mode is not necessarily unique to a given discrete distribution, since

the probability mass function may take the same maximum value at several points
x1, x2, etc. The most extreme case occurs in uniform distributions, where all values

occur equally frequently. W hen the probability density function of a continuous

distribution has multiple local maxima it is common to refer to all of the local

maxima as modes of the distribution. Such a continuous distribution is called

multimodal (as opposed to unimodal). A mode of a continuous probability


distribution is often considered to be any value x at which its probability density

function has a locally maximum value, so any peak is a mode.

In symmetric unimodal distributions, such as the normal distribution, the

mean (if defined), median and mode all coincide. For samples, if it is known that

they are drawn from a symmetric unimodal distribution, the sample mean can be

used as an estimate of the population mode. The mode is another measure of

central tendency. It is the value at the point around which the items are most

heavily concentrated. As an example, consider the following series: 8, 9, 11, 15, 16,

12, 15,3, 7, 15. There are ten observations in the series wherein the figure 15

occurs maximum number of times three. The mode is therefore 15. The series

given above is a discrete series; as such, the variable cannot be in fraction. If the
series were continuous, we could say that the mode is approximately 15, without

further computation. In the case of grouped data, mode is determined by the

following formula:

RELATIONSHIPS OF THE MEAN, MEDIAN AND MODE

Having discussed mean, median and mode, we now turn to the relationship

amongst these three measures of central tendency. We shall discuss the

relationship assuming that there is a unimodal frequency distribution

(i) W hen a distribution is symmetrical, the mean, median and mode are the
same, as is shown below in the following figure. In case, a distribution is skewed

to the right, then mean> median> mode. Generally, income distribution is


skewed to the right where a large number of families have relatively low

income and a small number of families have extremely high income. In

such a case, the mean is pulled up by the extreme high incomes and the

relation among these three measures is as shown in Fig. 6.3. Here, we

find that mean> median> mode.


(ii) W hen a distribution is skewed to the left, then mode> median> mean. This

is because here mean is pulled down below the median by extremely low

values. This is shown as in the figure.

(iii) Given the mean and median of a unimodal distribution, we can determine

whether it is skewed to the right or left. W hen mean> median, it is

skewed to the right; when median> mean, it is skewed to the left. It may

be noted that the median is always in the middle between mean and

mode.

GEOMETRIC MEAN

Apart from the three measures of central tendency as discussed above,

there are two other means that are used sometimes in business and economics.
These are the geometric mean and the harmonic mean. The geometric mean is

more important than the harmonic mean. We discuss below both these means.

First, we take up the geometric mean. Geometric mean is defined at the nth root

of the product of n observations of a distribution.

A geometric mean is often used when comparing different items finding a

single "figure of merit" for these items— when each item has multiple properties

that have different numeric ranges. For example, the geometric mean can give a

meaningful "average" to compare two companies which are each rated at 0 to 5 for

their environmental sustainability, and are rated at 0 to 10 0 for their financial

viability. If an arithmetic mean were used instead of a geometric mean, the financial

viability is given more weight because its numeric range is larger— so a small
percentage change in the financial rating (e.g. going from 80 to 90 ) makes

a much larger difference in the arithmetic mean than a large percentage change in
environmental sustainability (e.g. going from 2 to 5). The use of a geometric mean

"normalizes" the ranges being averaged, so that no range dominates the

weighting, and a given percentage change in any of the properties has the same

effect on the geometric mean. So, a 20 % change in environmental sustainability

from 4 to 4.8 has the same effect on the geometric mean as a 20 % change in
financial viability from 60 to 72.

The geometric mean can be understood in terms of geometry. The

geometric mean of two numbers, a {\displaystyle a} a and b {\displaystyle b} b, is

the length of one side of a square whose area is equal to the area of a rectangle

with sides of lengths a {\displaystyle a} a and b {\displaystyle b} b. Similarly, the

geometric mean of three numbers, a {\displaystyle a} a, b {\displaystyle b} b, and

c {\displaystyle c} c, is the length of one edge of a cube whose volume is the same

as that of a cuboid with sides whose lengths are equal to the three given numbers.

HARMONIC MEAN

The main advantage of the harmonic mean is that it is based on all

observations in a distribution and is amenable to further algebraic treatment.


W hen we desire to give greater weight to smaller observations and less weight to

the larger observations, then the use of harmonic mean will be more suitable. As

against these advantages, there are certain limitations of the harmonic mean.

First, it is difficult to understand as well as difficult to compute. Second, it cannot

be calculated if any of the observations is zero or negative. Third, it is only a

summary figure, which may not be an actual observation in the distribution. It is

worth noting that the harmonic mean is always lower than the geometric mean,

which is lower than the arithmetic mean. This is because the harmonic mean

assigns lesser importance to higher values. Since the harmonic mean is based on

reciprocals, it becomes clear that as reciprocals of higher values are lower than

those of lower values, it is a lower average than the arithmetic mean as well as the
geometric mean.

The harmonic mean is an average. It is calculated by dividing the number of


observations by the reciprocal of each number in the series. Thus, the harmonic

mean is the reciprocal of the arithmetic mean of the reciprocals. The harmonic

mean of 1,4, and 4 is:

Arithmetic Means
Arithmetic average is also called as mean. It is the most common type and

widely used measure of central tendency. Arithmetic average of a series is the

figure obtained by dividing the total value of the various items by their number.

There are two types of arithmetic average:

1. Simple arithmetic average; and

2. Weighted arithmetic average.

1. Simple arithmetic average: individual observation

(a) Direct method:

Arithmetic mean is frequently referred to simply as the “mean” and we talk

about such values as mean income, men tonnage, mean marks, etc. As opposed
to certain other averages which are found in terms of their position in a series, the

mean has to be computed by taking every value in the series into consideration.

Hence the mean cannot be found by either inspection or observation of the items.

The simple arithmetic mean of a series is equal to the sum of variables divided by

their number.

Step 1. Add up all the values of the variables x and find out ∑x

Step 2. Divide ∑x by their number of observation (N)

= x1 + x2+x3+…….xn/ N or = ∑x/n

= Mean
∑x = The sum of variables
N = Number of observation

Practical Excises:
Problem No 1.

Calculate mean from the following data.

R.Nos 18 2 3 4 5 6 7 8 9 10

Marks 40 50 55 78 58 60 73 35 43 48

Solution:
Calculation of Mean

R.Nos Marks

1 40

2 50

3 55

4 78

5 58

6 60

7 73

8 35

9 43

10 48

N = 10 ∑x =

540

∑x/N

= 540/10
= 54

(b) Short cut method:

The arithmetic mean can also be calculated by short cut method. This

method reduces the amount of calculation. It involves the following steps: It

involves the following steps:


Steps:

1. Assume any one value as an assumed mean, which is also known as


working mean or arbitrary average (A = Assumed mean)

2. Find out the difference of each value from the assumed mean (d = x –A)

3. Add all the deviations (difference ) ( ∑d)

4. Apply the formula:

= A±∑d/N

= Arithmetic mean
A = Assumed mean

∑d = sum of the deviations


Note:
Any value whether existing in the data or not, can be taken as the assumed
mean and the final answer would be the same. The answer is not affected by
the value selected as the assumed mean. However, in order to simplify the

calculation, the mind, point of one of the centrally located classes in the given
distribution should be selected as the assumed mean.
Practical Excises:

Problem No 2.
(solving the previous problem)

R.Nos Marks d(x - 50 )


1 40 - 10

2 50 0

3 55 5

4 78 28

5 58 8

6 60 10

7 73 23

8 35 - 15

9 43 -7

10 48 -2

N = 10 ∑x = ∑d = 40

540

Let the assumed mean be 50

= A±∑d/N

= 50 + 40/10

= 50 +4

= 54
Mathematical characteristics

1. The algebraic sum of the deviations, of all the items from their

arithmetic men is zero i.e., ∑(x - ) =0


2. The sum of the squared deviations of the items from the mean is a
minimum, that is less than the sum of the squared deviations of the items
from any other value.

∑d² = a minimum
3.
Since if = ∑x/N, if two values are given, the third one can be
computed.

4. If all the items of a series are increased (or decreased ) by any

constant number, the arithmetic mean will also increase (or decrease) by
the same constant.
3. Weighted Arithmetic Mean:
One of the limitations of simple arithmetic mean is that it gives equal

importance to all the items of the distribution. In certain cases, relative


importance of essential to allocate weights to the items. According to

the relative importance of the items, the weightage applied may vary in
different cases. Thus weightage is a number standing for the relative

importance of the items. Weighted average can be defined as an


average whose component items are multiplied by certain values

(weights) and the aggregate of the products are divided by the total of
weights.
Practical Excises:

Problem No 3.
Comment on the performance of the students of three universities given
below using simple and weighted averages:

U niversity Bombay Calcutta Madras

% Of No. of % Of No. of % Of No. of

Pass student Pass student Pass student


s (’0 0 ’) s (’0 0 ’) s (’0 0 ’)

M.A 71 3 82 2 81 2
83 4 76 3 76 3.5

B.SC 65 2 65 3 70 7

M.SC 66 3 60 7 73 2

Solution:

Computation of Simple and weighted Average.

U niversit Bombay Calcutta Madras

course x w wx X w wx x w wx

M.A 71 3 213 82 2 164 81 2 162


83 4 332 76 3 228 76 3.5 266

B.SC 65 2 195 65 3 195 70 7 490

M.SC 66 3 198 60 7 420 73 2 146

∑x ∑w ∑wx ∑x ∑w ∑wx ∑x ∑w ∑wx

432 20 1451 432 28 197 432 21 1513

∑x is the same for all the three universities:

= 432/6
= 72
But the number of students (weight) is different. Therefore , we have to

calculate the weighted mean:

Bombay w = ∑wx/w =1451/20 = 72.55

Calcutta w = ∑wx/w =1977/28 = 70.60

Madras w = ∑wx/w =1513/21 = 72.0 5


Bombay university is the best university because the weighted mean is greater
than the other two universities.
Discrete series

Direct Method:
To find out the total of items in discrete series, frequency of each value is

multiplied with the respective size. The values so obtained are totaled up. This

total is then divided by the total number of frequencies to obtain the arithmetic
mean. The steps involved in the calculation of mean are as follows.

Steps:
1. Multiply each size of item by its frequency –(fx)

2. Add all the fx –(∑fx)

3. Divide ∑fx by the total of frequency (N).

The formula is = ∑fx/N

= Arithmetic mean; ∑fx = the sum of products; N = total number of items.


Problem No 4.

Calculate mean from the following data.


Value : 1 2 3 4 5 6 7 8 9 10

Frequency: 21 30 28 40 26 34 40 9 15 57
Solutions:
Calculation of mean

X F Fx
1 21 21

2 30 60

3 28 84

4 40 160

5 26 130

6 34 20 4

7 40 280

8 9 72

9 15 135

10 57 570

N= ∑fx =

30 0 1716

= ∑fx/N
= 1716/30 0

= 5.72

Short Cut Method


Steps :

1. Take any value as assumed mean

2. Find out deviations of each variable from the assumed mean


3. Multiply the deviation with the respective requencies

4. Add up the products

5. Apply the formula

= A±∑d/N
= Mean
A = Assumed mean
∑fd = sum if total deviations N = Total frequency

Problem No 5.

Calculate mean from the following data.


Value : 1 2 3 4 5 6 7 8 9 10

Frequency: 21 30 28 40 26 34 40 9 15 57

Solutions:

Calculation of mean

X F d (x –A) Fd

1 21 -4 - 84

2 30 -3 - 90

3 28 -2 - 56

4 40 -1 - 40

5 26 0 0

6 34 1 34

7 40 2 80

8 9 3 27

9 15 4 60

10 57 5 285

Nf = ∑fd =

30 0 +216
= A±∑d/N
A = 5; ∑fd = +216; N = 30 0

= 5+216/30 0

= 5 + 0.72

= 5.72 (same answer as the previous illustratration)

Continuous serious

It continuous frequency distribution, the value of each individual frequency


distribution unknown. Therefore an assumption is made to make them precise or

on the assumption that the frequency of the class intervals is concentrated at the

centre that the midpoint of each class interval has to be found out. In continuous

frequency distribution, the mean can be calculated by any following methods:

1. Direct methods
2. Short cut method

3. Step Deviation method


1. Direct method:

` The following procedure is to be adopted for calculating arithmetic


mean in continuous series.

Steps:
1. Find out the mid value of each group or class. The mid value is
obtained by adding the lower limit and upper limit of the class and

dividing the total by two. For example, in a class interval say 10 –


20, the mid value is
15 +( 10 +20/2 = 30/2 = 15)

(symbol = m)
2. Multiply the mid value of each class by the frequency of the class, the

other words m will be multiplied by f.

3. Add up all the products –(∑fm).

4. ∑fm is divided by N

Apply the formula ∑fm/N

Practical Excises

Problem No 6.

From the following find out the mean profits.

Profits per shop Number of

Rs. shops

10 0 –20 0 10

20 0 –30 0 18

30 0 –40 0 20

40 0 –50 0 26

50 0 –60 0 30

60 0 –70 0 28

70 0 –80 0 18
Solution :

Profits Rs Mid point (m) No. of shops (f) fm

10 0 - 150 10 150 0

20 0

20 0 - 250 18 450 0

30 0

30 0 - 350 20 70 0 0

40 0

40 0 - 450 26 1170 0

50 0

50 0 - 550 30 1650 0

60 0

60 0 - 650 28 1820 0

70 0

70 0 - 750 18 1350 0

80 0

∑f = 150 ∑fm =

7290 0

Measures of central tendency ( Average)

∑fm/N
= 7290 0/150

= 486
The average profit is Rs. 486

2. Short cut method


Steps:

1. Find the mid value of each class or group –(m)

2. Assume any one of the mid value as an average –(A)

3. Find out the deviation of the mid value of each from the assumed mean

–(d)
4. Multiply the deviations of each class by its frequency –(fd)

5. Apply the formula

= A±∑fd/N
A = Assumed mean

∑fd = sum of total deviations

N = number of items.

Solution : calculation of mean

Profits
M d = m –450 f fd
Rs

10 0 - 150 - 30 0 10 - 30 0 0

20 0

20 0 - 250 - 20 0 18 - 30 0 0
30 0

30 0 - 350 - 10 0 20 - 20 0 0
40 0 450 0 26 0

60 0 - 650 20 0 28 560 0
70 0

70 0 - 750 30 0 18 540 0

80 0

∑f = 150 ∑f = 540 0

= A±∑fd/N
A = 450 ; ∑fd = 540 0 ; N(∑f) = 150

= 450 +540 0/150


= 450 + 36

= 486
Therefore the average profit is Rs. 486

3. Step deviation method:


The short cut method discussed above is further simplified or
calculations are reduced to a great extent by adopting step deviation
method. After finding out the deviation from the assumed mean, if
possible, it is further divided by a common factor. Scaling down the

deviation by a “step” will reduce the calculation to a minimum. In such a


case, the frequencies will be multiplied by the step deviations and not by
deviations. The decreasement arrived at by scaling down is counter
balanced by multiplying the average of the step deviations by the same
amount of step. This is done before adding to the assumed mean.

The whole steps in brief:


1. Find out the mid value of each class or group –(m)

2. Assume any one of the mid –value as average –(A)

3. Find out the deviations of the mid value of each from the assumed

mean –(d)

4. Deviations are divided by a common factor –(d’)


5. Multiply the d’ of each class by its frequency –(fd’)

6. Add up the products of above step (5) –(∑fd’)

7. Then apply the formula:

= A±∑fd’/N * c

= mean; A = assumed mean; ±∑fd’ = sum of the deviations


N = number of items; C = common factor
Practical Excises
Problem No 7.

(Solving Illustration No. 6)

d’
Profits d (m–
m F m– fd’
Rs 150 )
450/10 0

10 0 - 150 10 - 30 0 -3 - 30

20 0

20 0 - 250 18 - 20 0 -2 - 36
30 0

30 0 - 350 20 - 10 0 -1 - 20
40 0

40 0 - 450 26 0 0 0
50 0 550 30 10 0 1 30

∑f = 150 ∑f d= 54

= ∑fd’/N * c = mean

= 450 + 54/ 150 * 10 0 A = Assumed mean


= 450 + (0.36 * 10 0 ) ∑fd’ = the sum of deviations

= 450 + 36 N = number of items

= 486 c = step deviation


Therefore the average profit is Rs.486

Cumulative Series

Cumulative series can be of either more than type or less than type. In the
former, the frequencies are cumulated upwards so that the first class
interval has the highest cumulative frequency and it goes on declining in
subsequent classes. In case of less than type the cumulation is done

downwards. So that the first class has the lowest frequency and the
subsequent classes have higher cumulative frequencies. In both types of
cumulative series the data are first converted into a simple series. Either

exclusive or inclusive. After it the calculation of mean is done in the manner


as illustrated in earlier illustrations.
The following illustration make these points clear
Problem No 8.

Calculate mean from the following data.

Values Frequency

Less than 4

10

Less than 10

20

Less than 15

30

Less than 25

40

Less than 30

50

Less than 35

60

Less than 45
70

Less than 65

80

Solution:

In this problem cumulative frequencies and classes are given. We will first

convert the data in simple series from the given cumulative frequencies.

After this, the calculation of mean is done. This is illustrated below:

Computation of mean

Individual d’ = m –
Value m fd’
frequency 35/10
0 –10 4 5 -3 - 12

20 - 30 15 –10 = 5 25 -1 -5

30 –40 25 –15 = 10 35 0 0

40 –50 30 –25 = 5 45 1 5

50 –60 35 –30 = 5 55 2 10

60 –70 45 –35 = 10 65 3 30

70 –80 65 –45 = 20 75 4 80

N = 65 ∑fd’ = 96

=A + ∑fd’/N * c
A = 35, ∑fd’ = 96, N = 65, C =10

= 35+ 96/65*10 = 35+14.77


= 49.77

Problem No 9.
From the following information pertaining to 150 workers. Calculate
average paid to workers.

No of
Wages (Rs)
workers

More than 75 150

More than 85 140

More than 95 115

More than 95
10 5 70

More than 25
145

Solution:

There are no workers who receive wages less than Rs.75. The lower limit of

the first class is 75. The class interval would be 75 –85, 85 –90 and so on.

No of d’ = m –
Wages x m fd’
workers f 110/10

75 - 85 80 150 –140 = -3 - 30

10

85 - 95 90 140 –115 = -2 - 50

25

95 - 10 0 115 –95 = -1 - 20

10 5 20

10 5 –115 110 95 –70 = 15 0 0

115 –125 120 70 –60 = 10 1 10

125 –135 130 60 –40 = 20 2 40

135 –145 140 40 –25 = 15 3 45

145 –155 150 25 4 10 0

N = 150 ∑fd’ = 95

=A + ∑fd’/N * c
A = 110, ∑fd’ = 95, N = 150, C =10

= 110 + 95/150*10 = 110 +6.33


= 116.33

There the average wage is Rs. 116.33


Note:

Inclusive class intervals: while calculating mean in a continuous series

with inclusive class intervals, it is not necessary to convert the series into an

exclusive class interval series by adjusting class limits. It is also not

necessary to re- arrange the series in an ascending of a descending order


as is done in case of median.

Problem No 10.

Find mean of the following data:

Class –interval 50 –59 40 –49 30 –39 20 –29 10 –

19 0 –9

Frequency 1 3 9 10 15 2

Solution:

In the above illustration, it is neither necessary to convert the data

into exclusive class interval series (49.5 –59.5, 39.5 –49.5 and so on) nor to

arrange the data in ascending order, beginning with 0 –9.

d’ = m –
Class interval Mid value m Frequency f fd’
34.5/10

50 –59 54.5 1 2 2

40 –49 44.5 3 1 3

30 –39 34.5 9 0 0

20 –29 24.5 10 -1 - 10

10 –19 14.5 15 -2 - 30

0- 9 4.5 2 -3 -6

N = 40 ∑fd’ = 41
=A + ∑fd’/N * c
A = 34.5, ∑fd’ = - 41, N = 40, C =10

= 34.5+ (- 41)/40*10 = 34.5+10.25


= 24.25

Merits of Arithmetic mean

Arithmetic mean is the simplest measurement of central tendency of a


series. It is widely used because:
1. It is easy to understand.

2. It is easy to calculate.
3. It is used in further calculation.

4. It is rigidly defined.
5. It is based on the value of every item in the series.
6. If provides a good basis for comparison.

7. Arithmetic average can be calculated if we know the number of items and


aggregate. If the average and the number of items are know, we find the
aggregate.
8. Its formula is rigidly defined. The mean is the same for the series,
whoever calculates it.
9. It can be used for further analysis and algebraic treatment.

10. The mean is a more stable measure of central tendency (ideal average)
Demerits (limitations) of Arithmetic mean
1. The mean is unduly affected by the extreme items.
2. It is unrealistic.
3. It may lead to a false conclusion.

4. It cannot be accurately determined even if one of the values is not


know.

5. It is not useful for the study of qualities like intelligence, honesty


and character.

6. It cannot be located by observation or the graphic method.

U ses of arithmetic mean

Even though arithmetic average is subject to various demerits, it is

considered to be the best of all averages. It is familiar to everyone.


Arithmetic mean is called the ideal average. It is used in social, economic

and business problems. W hen we speak of average cost of production,

average income or average price, we mean the arithmetic average.

Correcting incorrect mean (misread items)

To err is human, it may happen that wring items are included instead of

correct items due to mistake or oversight. Thus we get a wrong mean,

calculated from wrong sum of variables. But we have to correct the mean.

To find the correct mean, there is a process. From the incorrect ∑x. the

corrected ∑x is divided by the number of observations. This will give us the

correct mean.

Problem No 11.
The average mark secured by 36 students was 52. But it was

discovered that an item 64 was misread as 46. Find the correct mean of

marks.

Solution:

N = 36 = 52

= ∑x/N ∑x = * N = 52 * 36 = 1872
Wrong ∑x = 1872
Correct∑x = incorrect ∑x –wrong item + correct item
= 1872 –46 + 64

= 1890

= ∑x/N = 1890/36 = 52.5 Marks


Practical Excises

Problem No 12.

The mean of 10 0 items was 46. Later on it was discovered that an


item 16 was misread as 61 and another item 43 was misread as 34. It was
also found that the number of items was 90 and not 10 0 correct mean.
Solution:

Wrong aggregate of 10 0 items = 10 0 * 46 = 460 0


Less wrong value of items

450 5

Included 61 and 34 = 95

Add correct value of items to be


Included 16 and

43

Correct aggregate ∑x = 4564

Correct number of items = 90


= = ∑x/N
= 4564/90

= 50.71

Combined Arithmetic Mean


If we know the means and the number of items in two or more related

groups the combined or composite mean can be computed with the help of

the following formula.

12 =N1 1 + N2 2 / N1 + N2

123 = N1 1 + N2 2 + N3 3 / N1 + N2 + N3

12 , 123 = The combined means


X1 x2 x3 = arithmetic mean of first group , second and third group.

N 1 N2 N3 = number of items in first, second group and third group


Problem No 13.

There are two branches of a company, employing 10 0 and 80


persons respectively. If the arithmetic mean of the monthly salaries paid by

the two companies are Rs. 275 and Rs. 225 respectively, find the arithmetic
mean of the salaries of the employees of the companies as a whole.

12 =N1 1 + N2 2 / N1 + N2 = 10 0*275
+80*225/10 0 + 80
= 2750 0 * 180 0 0 / 180
= 4550 0/180

= 252.78

Find Out U nknown Value (X)

Problem No 14.

Find out the missing values of the variate for the following

distribution whose mean is 31.87.

X 12 20 27 33 ? 54

Y 8 16 48 90 30 8

Solution:

Computation of missing value.

x f fx

12 8 96

20 16 320

27 48 1296

33 90 2970

X 30 30 x

54 8 432

N= 5114 +

20 0 30 x

= ∑fx/N
31.87 = 5114 + 30X / 20 0
31.87 * 20 0 = 5114 + 30 x

6374 = 5114 + 30 x

6374 - 5114 = 30 x

1260 = 30 x

X = 1260/30 = 42

Hence, the missing value lf variate is 42

Algebraic properties of arithmetic mean

1. The Sum of the deviations of the items from the arithmetic mean,

taking into account plus and minus signs, is always zero. That is, ∑(x - )
or ∑d = 0

2. The sum of the squared deviations of the items from mean is minimum.

That is, ∑(x - )² or ∑d² = minimum .


3.
If any two of three values that is arithmetic mean ( ), Number of items
(N) and total of the values ∑x are known, the third can be found out,
4. If arithmetic mean and the number of items of two or more than two
related groups are know; then their combined mean can be computed.

5. If a constant is added or subtracted from each item in a series, the


mean will increase or reduce by the same amount. If the items in a series
is multiplied by, say 2, the mean will also become two times of the original.
6. If the values of some of the items in a series are changed, then
whatever change has taken place in the total value, divided by the number

of items, should be deducted or added to find out the new mean;

New = old ± net increase (+) or Decreases (- ) / N


U NIT II

MEASU RES OF DISPERSION


W hile measures of central tendency are used to estimate "normal" values of

a dataset, measures of dispersion are important for describing the spread of the

data, or its variation around a central value. Two distinct samples may have the

same mean or median, but completely different levels of variability, or vice versa. A

proper description of a set of data should include both of these characteristics.


There are various methods that can be used to measure the dispersion of a

dataset, each with its own set of advantages and disadvantages.

Range

Defined as the difference between the largest and smallest sample values.

One of the simplest measures of variability to calculate.

Depends only on extreme values and provides no information about how the

remaining data is distributed

Example: Find the range of global observed sea surface temperatures at each

grid point over the time period December 1981 to the present.

Locate Dataset and Variable

Select the "Datasets by Catagory" link in the blue banner on the Data Library page.
Click on the "Air- Sea Interface" link.

Select the NOAA NCEP EMC CMB GLOBAL Reyn_Smith dataset.

Click on the "Reyn_SmithOIv2" link.

Scroll down the page and select the "monthly" link under the Datasets and

Variables subheading.

Choose the "Sea Surface Temperature" link again located under the Datasets and

Variables subheading. CHECK

Find Maximum Value

Click on the "Filters" link in the function bar.

To the right, you will see a selection of grids from which you may select any one or

combination.
Select the Maximum over "T" command. CHECK EXPERT

This operation finds the maximum SST for each grid point over the time grid T.
View Maximum Values

To see the results of this operation, choose the viewer window with land drawn in

black.

Maximum Observed Sea Surface Temperatures

Find Minimum Values and Subtract from Maximum Values


Return to the dataset page by clicking on the right- most link on the blue source

bar.

Click on the "Expert Mode" link in the function bar.

Enter the following lines below the text already there:

SOU RCES .NOAA .NCEP .EMC .CMB .GLOBAL .Reyn_SmithOIv2 .monthly .sst

[T]minover

sub

Press the OK button. CHECK

The above command subtracts the monthly minimum SST from the monthly

maximum SST. The result is a range of SST values for each spatial grid point.

View Range
To see your results, choose the viewer with land shaded in black.

Range of Observed Sea Surface Temperatures

Generally, there is a larger range of sea- surface temperatures near the

coasts and in smaller, sheltered bodies of water compared to the open ocean. For

example, the Caspian Sea has a sea surface temperature range of over 25°C, while

the sea surface temperature range of the non- coastal Atlantic Ocean at a

comparable latitude does not exceed 12°C. This image also illustrates relatively

large ranges off the west coast of South America, which is related to the El Niño

Southern Oscillation (ENSO).

QUARTILE

A quartile is a type of quantile. The first quartile (Q1) is defined as the


middle number between the smallest number and the median of the data

set. The second quartile (Q2) is the median of the data. The third quartile (Q3) is
the middle value between the median and the highest value of the data set.

In applications of statistics such as epidemiology, sociology and finance, the

quartiles of a ranked set of data values are the four subsets whose boundaries are

the three quartile points. Thus an individual item might be described as being "on

the upper quartile".


DECILES

The deciles are the nine values of the variable that divide an ordered data

set into ten equal parts.

The deciles determine the values for 10 % , 20 % ... and 90 % of the data.

D5 coincides with the median.

Calculating Deciles

1. Order the data from smallest to largest.

2. Find the place that occupies every decile using the expression Deciles Formula

, in the cumulative frequency table.

Li is the lower limit of the decile class.


N is the sum of the absolute frequency.

Fi- 1 is the absolute frequency immediately below the decile class.

ai is the width of the class containing the decile class.

The deciles are independent of the widths of the classes.

Example
Calculate the deciles of the distribution for the following table:

fi Fi
[50, 60 ) 8 8

[80, 90 ) 14 48

[90, 10 0 ) 10 58

[10 0, 110 ) 5 63

[110, 120 ) 2 65

65

Calculation of the First Decile

Calculation of the Second Decile

Calculation of the Third Decile

Calculation of the Fourth Decile

Calculation of the Fifth Decile


Calculation of the Sixth Decile

Calculation of the Seventh Decile

Calculation of the Eighth Decile

Calculation of the Ninth Decile

QUARTILE DEVIATION AND ITS COEFFICIENT

Quartile Deviation
Quartile deviation is based on the lower quartile Q1 and the upper quartile

Q3. The difference Q3−Q1 is called the inter quartile range. The difference Q3−Q1
divided by 2 is called semi- inter- quartile range or the quartile deviation. Thus

The quartile deviation is a slightly better measure of absolute dispersion

than the range, but it ignores the observations on the tails. If we take difference

samples from a population and calculate their quartile deviations, their values are

quite likely to be sufficiently different. This is called sampling fluctuation, and it is

not a popular measure of dispersion. The quartile deviation calculated from the

sample data does not help us to draw any conclusion (inference) about the
quartile deviation in the population.
Coefficient of Quartile Deviation

A relative measure of dispersion based on the quartile deviation is called the


coefficient of quartile deviation. It is defined as:

Coefficient of Quartile Deviation

Coefficient of Quartile Deviation =

It is a pure number free of any units of measurement. It can be used for comparing

the dispersion of two or more sets of data.


Standard Deviation

The standard deviation is the square root of the sample variance.

Defined so that it can be used to make inferences about the population variance.

Calculated using the formula:

The values computed in the squared term, xi - xbar, are anomalies, which is

discussed in another section.

Not restricted to large sample datsets, compared to the root mean square
anomaly discussed later in this section.

Provides significant information into the distribution of data around the mean,
approximating normality.
The mean ± one standard deviation contains approximately 68% of the

measurements in the series.


The mean ± two standard deviations contains approximately 95% of the
measurements in the series.

The mean ± three standard deviations contains approximately 99.7% of the


measurements in the series.
Climatologists often use standard deviations to help classify abnormal climatic

conditions. The chart below describes the abnormality of a data value by how
many standard deviations it is located away from the mean. The probablities in the
third column assume the data is normally distributed
Standard Deviations Abnormality Probability of Occurance

Away From Mean

beyond - 3 sd extremely subnormal 0.15%

- 3 to - 2 sd greatly subnormal 2.35%

- 2 to - 1 sd subnormal 13.5%

- 1 to +1 sd normal 68.0 %

+1 to +2 sd above normal 13.5%

+2 to +3 sd greatly above normal 2.35%

beyond +3 sd extremely above normal 0.15%

MEASU RE OF SKEW NESS

In statistics, we study about the management, observation and calculation

generally over a large numerical data. In the statistical analysis of a survey or

research, a researcher is required to know about the distribution, central

tendency, dispersion etc.


It is also needed to know that what would be the variability and location of the

given data set. This includes the measurement of skewness of the data, since all

given data distributions are not symmetric.

The skewness measures how asymmetric the distribution is. We can say that

skewness is the measure of asymmetry of the data. It also determines if the data is

skewed to the left or to the right.

The measure of skewness is being utilized in many areas. We know that a data

which is normally distributed is said to be symmetric about its mean. It has

skewness equal to zero. But usually, the distributions are not symmetric.

Thus, the analysis of skewness becomes mandatory, as it defines the deviations


from the mean position. An asymmetric or skewed data is not a perfect mirror

image about the mean. By the measurement of skewness, one can determine how
mean, median and mode are connected to one another. Let us go ahead in this

page and learn about skewness, its calculation and applications in detail.

Definition

Measure of skewness determines the extent of asymmetry or lack of

symmetry. A distribution is said to be asymmetric if its graph does not appear


similar to the right and to the left around the central position. In more statistical

language, the skewness measures how much is the asymmetry of probability

distribution of some given real- valued random variable about the mean. Skewness

can be observed in the given data when number of observations are less.

For Example: W hen the numbers 9, 10, 11 are given, we may easily inspect

that the values are equally distributed about the mean 10. But if we add a number

5, so as to get the data as 5, 9, 10, 11, then we can say that the distribution is not

symmetric or it is skewed.

The skewness can be viewed by the having a look at the graph. The measure of

skewness can be of two types : positive skew and negative skew.

Positive Skew: W hen the given distribution concentrates on the left side in
the graph, it is known as the positive skew. In the following curve, we may easily

observe that the right tail is bigger. This may be called as right- tailed or right-

skewed distribution.
Negative Skew: If in the graph, the concentration of the curve is higher on
the right side or the left tail is bigger, then given distribution would know as left

tailed or negatively skewed or left skewed. It is shown in the following diagram:

Relative Measure of Skewness

there are three important measures of relative skewness:

1. karl pearson’s coefficient of skewness

2. bowley’s coefficient of skewness

3. kelly’s coefficient of skewness


1. karl pearson’s coefficient of skewness

According to him, absolute skewness = mean –mode. This measure is not


suitable for making valid comparison of the skewness in two or more distributions,

because (a) the unit of measurement may be different in different series, and (b)

the same size of skewness has different significance with small or large variation in

two series. Therefore, to avoid the difficulties, an absolute measure is adopted.

This is done by dividing the difference between the mean and the mode by the
standard deviation. The resultant coefficient is called pearsonian coefficient of

skewness thus.

coefficient of skewness (

2. bowley’s coefficient of skewness

In the above method of measuring skewness, the whole of the series is needed

prof. bowley’s has suggested a formula based on relative position of quartiles. In a

symmetrical distribution, the quartiles are equidistant from the value of the mean;

i.e., median- Q1 = Q3 –median. This means, the value of the median is the mean of

Q1 and Q2. But in a skewed distribution, the quartiles will not be equidistant from

the median. Hence bowley’s has suggested the following formula

Absolute Sk = (Q3 - median) –(median –Q1)


= Q3 = Q1 –2 median

Coefficient of Sk =

Illustration:

Find the range of weights of 7 students from the following 27, 30, 35, 36, 38,

40, and 43

Solution:

Range = L –S

= 43 –27

=16

Coefficient of range
= 0.23
Illustration:

Calculate the semi inter quartile range and quartile coefficient from the

following

Age in years No. of members

20 3

30 61

40 132

50 153

60 140

70 51

80 3

Solution

Calculation of Quartiles

Age in years No. of members c.ƒ .

20 3 3

30 61 64

40 132 196
50 153 349

60 140 489

70 51 540

80 3 543

= value of () th item

= value of th item

= value of th item

= 136 th item

= 40 years

= value of 3 th item

= value of 3 th item

= value of 3 x 136 th item

=40 8 th item which is 60 years.


Q.D. =

= 10 years

Coefficient of Q.D. =

= = = 0.2

Illustration:

Calculate mean deviation from mean and median for the following data

10 0, 150, 20 0, 250, 360, 490, 50 0, 60 0, 671


Also calculate coefficient of mean deviation.

Solution:
Calculation of Mean Deviation

X \D\= X X̅ \D\= X median

i.e. X 369 i.e. X 360

10 0 269 260

150 219 210

20 0 169 160

250 119 110

360 9 0

490 121 130

50 0 131 140

60 0 231 240

671 30 2 311

ƩX=3321 Ʃ\D\ = 1570 Ʃ\D\ = 1561

Mean

X̅ =

= 369

M.D. from mean =

= =

= 174.44

Coefficient of M. D.
= = 0.47

Median
= value of () th item

= value of th item

= value of 5 th item

= 360

M.D. from mean =


= =

= 173.44

Coefficient of M. D.

= 0.48

Illustration;

Calculate the standard deviation from the following data.

14, 22, 9, 15, 20, 17, 12, 11.


Solution:

Calculation of S.D. from actual Mean

Values (X) X X̅ (X - 15) (X X̅) ²

14 -1 1

22 7 49

9 -6 36

15 0 0

20 5 25
17 2 4

12 -3 9

11 -4 16

X̅=

= 15

= (or)

= 4.18

Illustration:

The index numbers of prices of cotton and coal shares in 20 07 were as under.

Index number of prices of Index number of


Month
cotton shares prices of coal shares

January 188 131

February 178 130

March 173 130

April 164 129

May 172 129

June 183 120

July 184 127


August 185 127

September 211 130

October 217 137

November 232 140

December 240 142

W hich of the two shares do you consider more variable in price?


Solution:

Computation of coefficient of variation

X series Y series

Month Index No. Deviation Square of Index No. Deviation Square of

from A deviations from A deviations


(Cotton) (Coal)

(A=184) (d²X) (A=130 ) (d²Y)


(X) (Y)

(dX) (dY)

January 188 +4 16 131 +1 1

February 178 -6 36 130 0 0

March 173 - 11 121 130 0 0

April 164 - 20 40 0 129 -1 1

May 172 - 12 144 129 -1 1

June 183 -1 1 120 - 10 10 0

July 184 0 0 127 -3 9

August 185 1 1 127 -3 9

Septemb 211 27 729 130 0 0


er 217 33 10 89 137 +7 49

October 232 48 230 4 140 +10 10 0

Novembe 240 56 3136 142 +12 144

Decembe

Dispersion:

X̅ = A+ X̅ = A+

184 + 130 +

=184 + 9.9 =130 + 1

= 193.9 = 131

= =

= =

= =
= =

= =

= 23.81 = 5.79

C.V = /x̅ x 10 0 C.V = /x̅ x 10 0

= X 10 0 = X 10 0

= 12.28 % = 4.42 %

Hence cotton shares are more variable in price than the coal shares.
U nit III

INTRODU CTION TO REGRESSION AND CORRELATION

The statistical methods discussed so far are used to analyze data involving

only one variable. Often an analysis of data concerning two or more variables is

needed to look for any statistical relationship or association between them.


A few instances where knowledge about an association or relationship between

two variables would be vital to making a decision are:

Family income and expenditure on luxury items

Sales revenue and expenses incurred on advertising

Yield of a crop and quantity of fertilizer applied

The following aspects are considered when examining the statistical relationship

between two or more variables:

Is there an association between two or more variables? If yes, what is the form

and degree of that relationship?

Is the relationship strong or significant enough to arrive at a desirable

conclusion?
Can the relationship be used for predictive purposes, that is, to predict the most

likely value of a dependent variable corresponding to the given value of the


independent variable or variables?

There are two different techniques which are used for the study of two or

more variables: regression and correlation. Both study the behavior of the

variables but they differ in their end results.

Regression studies the relationship where dependence is necessarily


involved. One variable is dependent on a certain number of variables. Regression

can be used for predicting the values of a variable which depends upon other

variables. The term regression was introduced by the English biometrician Sir

Francis Galton (1822 - 1911).

Correlation attempts to study the strength of the mutual relationship

between two variables. In correlation we assume that the variables are random

and dependence of any nature is not involved.

Correlation

Correlation is a statistical technique that can show whether and how

strongly pairs of variables are related. For example, height and weight are related;

taller people tend to be heavier than shorter people. The relationship isn't perfect.
People of the same height vary in weight, and you can easily think of two people

you know where the shorter one is heavier than the taller one. Nonetheless, the

average weight of people 5'5'' is less than the average weight of people 5'6'', and

their average weight is less than that of people 5'7'', etc. Correlation can tell you

just how much of the variation in peoples' weights is related to their heights.

Although this correlation is fairly obvious your data may contain

unsuspected correlations. You may also suspect there are correlations, but don't

know which are the strongest. An intelligent correlation analysis can lead to a

greater understanding of your data.

Techniques in Determining Correlation

There are several different correlation techniques. The Survey System's


optional Statistics Module includes the most common type, called the

Pearson or product- moment correlation. The module also includes a variation on


this type called partial correlation. The latter is useful when you want to look at the

relationship between two variables while removing the effect of one or two other

variables.

Like all statistical techniques, correlation is only appropriate for certain kinds

of data. Correlation works for quantifiable data in which numbers are meaningful,
usually quantities of some sort. It cannot be used for purely categorical data, such

as gender, brands purchased, or favorite color.

Rating Scales

Rating scales are a controversial middle case. The numbers in rating scales

have meaning, but that meaning isn't very precise. They are not like quantities.

W ith a quantity (such as dollars), the difference between 1 and 2 is exactly the

same as between 2 and 3. W ith a rating scale, that isn't really the case. You can

be sure that your respondents think a rating of 2 is between a rating of 1 and a

rating of 3, but you cannot be sure they think it is exactly halfway between. This is

especially true if you labeled the mid- points of your scale (you cannot assume

"good" is exactly half way between "excellent" and "fair").


Most statisticians say you cannot use correlations with rating scales,

because the mathematics of the technique assume the differences between

numbers are exactly equal. Nevertheless, many survey researchers do use

correlations with rating scales, because the results usually reflect the real world.

Our own position is that you can use correlations with rating scales, but you should

do so with care. W hen working with quantities, correlations provide precise

measurements. W hen working with rating scales, correlations provide general

indications.

Correlation Coefficient

The main result of a correlation is called the correlation coefficient (or "r"). It

ranges from - 1.0 to +1.0. The closer r is to +1 or - 1, the more closely the two
variables are related.

If r is close to 0, it means there is no relationship between the variables. If r is


positive, it means that as one variable gets larger the other gets larger. If r is

negative it means that as one gets larger, the other gets smaller (often called an

"inverse" correlation).

W hile correlation coefficients are normally reported as r = (a value between - 1 and

+1), squaring them makes then easier to understand. The square of the coefficient
(or r square) is equal to the percent of the variation in one variable that is related to

the variation in the other. After squaring r, ignore the decimal point. An r of .5

means 25% of the variation is related (.5 squared =.25). An r value of .7 means

49% of the variance is related (.7 squared = .49).

A correlation report can also show a second result of each test - statistical

significance. In this case, the significance level will tell you how likely it is that the

correlations reported may be due to chance in the form of random sampling error.

If you are working with small sample sizes, choose a report format that includes the

significance level. This format also reports the sample size.

A key thing to remember when working with correlations is never to assume

a correlation means that a change in one variable causes a change in another.


Sales of personal computers and athletic shoes have both risen strongly in the last

several years and there is a high correlation between them, but you cannot

assume that buying computers causes people to buy athletic shoes (or vice versa).

The second caveat is that the Pearson correlation technique works best

with linear relationships: as one variable gets larger, the other gets larger (or

smaller) in direct proportion. It does not work well with curvilinear relationships (in

which the relationship does not follow a straight line). An example of a curvilinear

relationship is age and health care. They are related, but the relationship doesn't

follow a straight line. Young children and older people both tend to use much more

health care than teenagers or young adults. Multiple regression (also included in

the Statistics Module) can be used to examine curvilinear relationships, but it is


beyond the scope of this article.

Correlations are useful because if you can find out what relationship
variables have, you can make predictions about future behavior. Knowing what the

future holds is very important in the social sciences like government and

healthcare. Businesses also use these statistics for budgets and business plans.

The Correlation Coefficient

A correlation coefficient is a way to put a value to the relationship.


Correlation coefficients have a value of between - 1 and 1. A “0 ” means there is no

relationship between the variables at all, while - 1 or 1 means that there is a perfect

negative or positive correlation (negative or positive correlation here refers to the

type of graph the relationship will produce)

Types

The most common correlation coefficient is the Pearson Correlation

Coefficient. It’s used to test for linear relationships between data. In AP stats or

elementary stats, the Pearson is likely the only one you’ll be working with. However,

you may come across others, depending upon the type of data you are working

with. For example, Goodman and Kruskal’s lambda coefficient is a fairly common

coefficient. It can be symmetric, where you do not have to specify which variable is
dependent, and asymmetric where the dependent variable is specified.

Simple Correlation

In a bivariate distribution, we are interested to find out whether there is any

relationship between twovariables. The correlation is a statistical technique which

studies the relationship between two or morevariables and correlation analysis

involves various methods and techniques used for studying and measuringthe

extent of relationship between the two variables. W hen two variables are related in

such a way that achange in the value of one is accompanied either by a direct

change or by an inverse change in the values ofthe other, the two variables are

said to be correlated. In the correlated variables an increase in one variableis

accompanied by an increase or decrease in the other variable. For instance,


relationship exists between theprice and demand of a commodity because

keeping other things equal, an increase in the price of a commodity shall cause a
decrease in the demand for that commodity. Relationship might exist between the

heights and weights of the students and between amount of rainfall in a city and

the sales of raincoats in that city.

These are some of the important definitions about correlation. Croxton and

Cowden says, “W hen the relationship is of a quantitative nature, the appropriate


statistical tool for discovering and measuring the relationship and expressing it in a

brief formula is known as correlation”.

A.M. Tuttle says, “Correlation is an analysis of the covariation between two

or more variables.” W.A. Neiswanger says, “Correlation analysis contributes to the

understanding of economic behavior, aids in locating the critically important

variables on which others depend, may reveal to the economist the connections by

which disturbances spread and suggest to him the paths through which

stabilizing forces may become effective.”

U tility of Correlation

The study of correlation is very useful in practical life as revealed by these

points.
1. W ith the help of correlation analysis, we can measure in one figure, the degree of

relationship existing between variables like price, demand, supply, income,

expenditure etc. Once we know that two variables are correlated then we can

easily estimate the value of one variable, given the value of other.

2. Correlation analysis is of great use to economists and businessmen, it reveals to

the economists the disturbing factors and suggest to him the stabilizing forces. In

business, it enables the executive to estimate costs, sales etc. and plan

accordingly.

3. Correlation analysis is helpful to scientists. Nature has been found to be a

multiplicity of interrelated forces.

Difference between Correlation and Causation


The term correlation should not be misunderstood as causation. If

correlation exists between two variables, it must not be assumed that a change in
one variable is the cause of a change in other variable. In simple words, a change

in one variable may be associated with a change in another variable but this

change need not necessarily be the cause of a change in the other variable. W hen

there is no cause and effect relationship between two variables but a correlation is

found between the two variables such correlation is known as “spurious


correlation” or “nonsense correlation”.

Methods of Studying Correlation

There are different methods which helps us to find out whether the

variables are related or not.

1. Scatter Diagram Method.

2. Graphic Method.

3. Karl Pearson’s Coefficient of correlation.

4. Rank Method.

We shall discuss these methods one by one.

(1) Scatter Diagram : Scatter diagram is drawn to visualise the relationship

between two variables. The values of more important variable is plotted on the X-
axis while the values of the variable are plotted on the Y- axis. On the graph, dots

are plotted to represent different pairs of data. W hen dots are plotted to represent

all the pairs, we get a scatter diagram. The way the dots scatter gives an indication

of the kind of relationship which exists between the two variables. W hile drawing

scatter diagram, it is not necessary to take at the point of sign the zero values of X

and Y variables, but the minimum values of the variables considered may be taken.

W hen there is a positive correlation between the variables, the dots on the

scatter diagram run from left hand bottom to the right hand upper corner. In case

of perfect positive correlation all the dots will lie on a straight line.

W hen a negative correlation exists between the variables, dots on the

scatter diagram run from the upper left hand corner to the bottom right hand
corner. In case of perfect negative correlation, all the dots lie on a straight

line.
(2) Graphic Method. In this method the individual values of the two variables

are plotted on the graph paper. Therefore two curves are obtained- one for X

variable and another for Y variable.

(3) Karl Pearson’s Co- efficient of Correlation. Karl Pearson’s method,

popularly known as Pearsonian co- efficient of correlation, is most widely applied in


practice to measure correlation. The Pearsonian co- efficient of correlation is

represented by the symbol r.

According to Karl Pearson’s method, co- efficient of correlation between the

variables is obtained by dividing the sum of the products of the corresponding

deviations of the various items of two series from their respective means by the

product of their standard deviations and the number of pairs of observations.

Symbolically, r = where r stands for coefficient of correlation ...(i) where x1, x2, x3,

x4 ..................... xn are the deviations of various items of the first variable from the

mean, y1, y2, y3,........................ yn are the deviations of all items of the second

variable from mean, Sxy is the sum of products of these corresponding deviations.

N stands for the number of pairs, sx stands for the standard deviation of X
variable and sy stands for the standard deviation of Y variable. sx = and sy = If we

substitute the value of sx and sy in the above written formula of computing r, we

get r = or r = Degree of correlation varies between + 1 and –1; the result will be + 1

in case of perfect positive correlation and –1 in case of perfect negative

correlation.

Computation of correlation coefficient can be simplified by dividing the

given data by a common factor. In such a case, the final result is not multiplied by

the common factor because coefficient of correlation is independent of change of

scale and origin.

Karl Pearson’s Coefficient of Correlation

Definition: Karl Pearson’s Coefficient of Correlation is widely used


mathematical method wherein the numerical expression is used to calculate

the degree and direction of the relationship between linear related variables.
Pearson’s method, popularly known as a Pearsonian Coefficient of

Correlation, is the most extensively used quantitative methods in practice. The

coefficient of correlation is denoted by “r”.

If the relationship between two variables X and Y is to be ascertained, then the

following formula is used:

Properties of Coefficient of Correlation

The value of the coefficient of correlation (r) always lies between ±1. Such as:

r=+1, perfect positive correlation

r=- 1, perfect negative correlation

r=0, no correlation

The coefficient of correlation is independent of the origin and scale. By origin,


it means subtracting any non- zero constant from the given value of X and Y the

vale of “r” remains unchanged. By scale it means, there is no effect on the value of

“r” if the value of X and Y is divided or multiplied by any constant.

The coefficient of correlation is a geometric mean of two regression coefficient.

Symbolically it is represented as:

The coefficient of correlation is “ zero” when the variables X and Y are


independent. But, however, the converse is not true.
Assumptions of Karl Pearson’s Coefficient of Correlation
The relationship between the variables is “Linear”, which means when the two

variables are plotted, a straight line is formed by the points plotted.


There are a large number of independent causes that affect the variables

under study so as to form a Normal Distribution. Such as, variables like price,

demand, supply, etc. are affected by such factors that the normal distribution is

formed.

The variables are independent of each other.

Spearman's rank correlation coefficient

In statistics, Spearman's rank correlation coefficient or Spearman's rho,

named after Charles Spearman and often denoted by the Greek letter (rho) or as ,

is a nonparametric measure of rank correlation (statistical dependence between

the rankings of two variables). It assesses how well the relationship between two

variables can be described using a monotonic function.

The Spearman correlation between two variables is equal to the Pearson

correlation between the rank values of those two variables; while Pearson's

correlation assesses linear relationships, Spearman's correlation assesses


monotonic relationships (whether linear or not). If there are no repeated data

values, a perfect Spearman correlation of +1 or −1 occurs when each of the

variables is a perfect monotone function of the other.

Intuitively, the Spearman correlation between two variables will be high

when observations have a similar (or identical for a correlation of 1) rank (i.e. relative

position label of the observations within the variable: 1st, 2nd, 3rd, etc.) between

the two variables, and low when observations have a dissimilar (or fully opposed

for a correlation of −1) rank between the two variables.

Spearman's coefficient is appropriate for both continuous and discrete

ordinal variables. Both Spearman's and Kendall's can be formulated as special

cases of a more general correlation coefficient.


Regression

Regression is a statistical measure used in finance, investing and other


disciplines that attempts to determine the strength of the relationship between

one dependent variable (usually denoted by Y) and a series of other changing

variables (known as independent variables). Regression helps investment and

financial managers to value assets and understand the relationships between

variables, such as commodity prices and the stocks of businesses dealing in those
commodities.

The two basic types of regression are linear regression and multiple linear

regression, although there are non- linear regression methods for more

complicated data and analysis. Linear regression uses one independent variable

to explain or predict the outcome of the dependent variable Y, while multiple

regression uses two or more independent variables to predict the outcome.

Regression can help finance and investment professionals as well as

professionals in other businesses. Regression can help predict sales for a

company based on weather, previous sales, GDP growth or other conditions. The

capital asset pricing model (CAPM) is an often- used regression model in finance

for pricing assets and discovering costs of capital. The general form of each type
of regression is:

Linear Regression: Y = a + bX + u

Multiple Regression: Y = a + b1X1 + b2X2 + b3X3 + ... + btXt + u

W here:

Y = the variable that you are trying to predict (dependent variable)

X = the variable that you are using to predict Y (independent variable)

a = the intercept

b = the slope

u = the regression residual

Regression takes a group of random variables, thought to be predicting Y,

and tries to find a mathematical relationship between them. This relationship is


typically in the form of a straight line (linear regression) that best

approximates all the individual data points. In multiple regression, the separate
variables are differentiated by using numbers with subscript.

Regression in Investing

Regression is often used to determine how many specific factors such as

the price of a commodity, interest rates, particular industries or sectors influence

the price movement of an asset. The aforementioned CAPM is based on


regression, and it is utilized to project the expected returns for stocks and to

generate costs of capital. A stock's returns are regressed against the returns of a

broader index, such as the S& P 50 0, to generate a beta for the particular stock.

Beta is the stock's risk in relation to the market or index and is reflected as the

slope in the CAPM model. The expected return for the stock in question would be

the dependent variable Y, while the independent variable X would be the market

risk premium.

Additional variables such as the market capitalization of a stock, valuation ratios

and recent returns can be added to the CAPM model to get better estimates for

returns. These additional factors are known as the Fama- French factors, named

after the professors who developed the multiple linear regression model to better
explain asset returns.

Regression Analysis

In statistical modeling, regression analysis is a set of statistical processes

for estimating the relationships among variables. It includes many techniques for

modeling and analyzing several variables, when the focus is on the relationship

between a dependent variable and one or more independent variables (or

'predictors'). More specifically, regression analysis helps one understand how the

typical value of the dependent variable (or 'criterion variable') changes when any

one of the independent variables is varied, while the other independent variables

are held fixed.

Regression analysis is widely used for prediction and forecasting, where its
use has substantial overlap with the field of machine learning. Regression

analysis is also used to understand which among the independent variables are
related to the dependent variable, and to explore the forms of these relationships.

In restricted circumstances, regression analysis can be used to infer causal

relationships between the independent and dependent variables. However this

can lead to illusions or false relationships, so caution is advisable; for example,

correlation does not prove causation.


Classical assumptions for regression analysis include:

The sample is representative of the population for the inference prediction.

The error is a random variable with a mean of zero conditional on the explanatory

variables.

The independent variables are measured with no error. (Note: If this is not so,

modeling may be done instead using errors- in- variables model techniques).

The independent variables (predictors) are linearly independent, i.e. it is not

possible to express any predictor as a linear combination of the others.

The errors are uncorrelated, that is, the variance–covariance matrix of the errors

is diagonal and each non- zero element is the variance of the error.

The variance of the error is constant across observations (homoscedasticity). If


not, weighted least squares or other methods might instead be used.

These are sufficient conditions for the least- squares estimator to possess

desirable properties; in particular, these assumptions imply that the parameter

estimates will be unbiased, consistent, and efficient in the class of linear unbiased

estimators. It is important to note that actual data rarely satisfies the assumptions.

That is, the method is used even though the assumptions are not true. Variation

from the assumptions can sometimes be used as a measure of how far the model

is from being useful. Many of these assumptions may be relaxed in more advanced

treatments. Reports of statistical analyses usually include analyses of tests on the

sample data and methodology for the fit and usefulness of the model.

Independent and dependent variables often refer to values measured at


point locations. There may be spatial trends and spatial autocorrelation in

the variables that violate statistical assumptions of regression. Geographic


weighted regression is one technique to deal with such data. Also, variables may

include values aggregated by areas. W ith aggregated data the modifiable areal

unit problem can cause extreme variation in regression parameters. W hen

analyzing data aggregated by political boundaries, postal codes or census areas

results may be very distinct with a different choice of units.


Interpolation and Extrapolation

Regression models predict a value of the Y variable given known values of

the X variables. Prediction within the range of values in the dataset used for model-

fitting is known informally as interpolation. Prediction outside this range of the data

is known as extrapolation. Performing extrapolation relies strongly on the

regression assumptions. The further the extrapolation goes outside the data, the

more room there is for the model to fail due to differences between the

assumptions and the sample data or the true values.

It is generally advised that when performing extrapolation, one should accompany

the estimated value of the dependent variable with a prediction interval that

represents the uncertainty. Such intervals tend to expand rapidly as the values of
the independent variable(s) moved outside the range covered by the observed

data.

However, this does not cover the full set of modeling errors that may be

made: in particular, the assumption of a particular form for the relation between Y

and X. A properly conducted regression analysis will include an assessment of how

well the assumed form is matched by the observed data, but it can only do so

within the range of values of the independent variables actually available. This

means that any extrapolation is particularly reliant on the assumptions being made

about the structural form of the regression relationship. Best- practice advice

here[citation needed] is that a linear- in- variables and linear- in- parameters

relationship should not be chosen simply for computational convenience, but that
all available knowledge should be deployed in constructing a regression

model. If this knowledge includes the fact that the dependent variable cannot go
outside a certain range of values, this can be made use of in selecting the model –

even if the observed dataset has no values particularly near such bounds. The

implications of this step of choosing an appropriate functional form for the

regression can be great when extrapolation is considered. At a minimum, it can

ensure that any extrapolation arising from a fitted model is "realistic" (or in accord
with what is known).

Example - 1

Compute the coefficient of correlation between X –Advertisement Expenditure

and Y –Sales

X : 10 12 18 8 13 20 22 15 5 17

Y : 88 90 94 86 87 92 96 94 88 85

Solution:

Values of X and Y are assumed to be small and the following formula is

attempted instead of the one used in the previous example.

X Y XY X² Y²

10 88 880 10 0 7744

12 90 10 80 144 810 0

18 94 1692 324 8836

8 86 688 64 7396

13 87 1131 169 7569

20 92 1840 40 0 8464
22 96 2112 484 9216

15 94 1410 225 8836

5 88 440 25 7744

7 85 1445 289 7225

ƩX=140 ƩY=90 0 ƩXY=12718 Ʃ X²=2224 Ʃ Y²=81130

Correlation co efficient,

0.6370

Example 2

Calculate product moment correlation coefficient from the following bivariate

frequency table

Y
1 3 5
X

-1 1 1 4
0 3 7 1

Solution

X and Y values are small and hence the following formula.

Y 1 3 5 Total fX fX² fXY

(f)
X

-1 1 1 4 6 -6 6

-1 -3 - 20 - 24

3 7 1 11 0 0
0
0 0 0 0

6 2 0 8 16 32
2
12 12 0 24

Total N= ƩfX ƩfX² ƩfXY


10 30 5
(f) 25 = 10 = 38 =0

ƩfY =
fY 10 30 25
10

ƩfY²
fY² 10 90 125
= 38

ƩfXY
fXY
=0
11 9 - 20
= - 0.5959

Note: f XY values have to be found by multiplying each cell frequency f, X value for
the row and Y value for the column.

Example 3

The following table gives the frequency, according to age group, of marks

obtained by 67 students in an intelligence test. Measure the degree of relationship

between age and intelligence test.

Test marks Age in years

18 19 20 21

20 0 - 250 4 4 2 1

250 - 30 0 3 5 4 2

30 0 - 350 2 6 8 5

350 - 40 0 1 4 6 10

Solution

Let X be test marks. Corresponding to the mid values 225, 275, 325, and 375,

u values are - 1, 0, 1, and 2 respectively where u = and a = 275 and c = 50. Let Y

be age. V values are - 1,0 , 1, and 2 where v = and b = 19 and d = 1.

Fuv values have to be found by multiplying each cell frequency, f, u value for

the row and v value for the column.


v -1 0 1 2 Total fu fu² fuv

(f)
u

-1 4 4 2 1 11 - 11 11 0

-4 0 -2 -2

3 5 4 2 14 0 0 0
0
0 0 0 0

5 21 21 21 16
1 2 6 8

-2 0 8
10

1 4 6 10 21 42 84 50
2
-2 0 12 40

Total Ʃfu Ʃfu² Ʃfuv


10 19 20 18 N= 67
(f) = 52 = 116 = 66

Ʃfv
fv - 10 0 20 36
= 46

Ʃfv²
fv² 10 0 20 72
=10 2

ƩfXY
fuv
= 66
0 0 18 48
= 0.4151

Example 4
W ith the following data in 6 cities, calculate the coefficient of correlation by

pearson’s method between the density of population and the death rate.

Area in
Population in’
Cities No. of deaths
sq. miles 000

A 150 30 30 0

B 180 90 1440

C 10 0 40 560

D 60 42 840

E 120 72 1224

F 80 24 312

Solution:

Density is population per unit area and death rate is number of deaths per 10 0

people.

For City A Density = = 20 0 and

Death rate = x 10 0 0 =10

City Area Pop. No. Densit D. u= uv u² v²

in of y Rate
000 death X Y=v a=

s 40 0,

c=10

A 150 30 30 0 20 0 10 -2 - 20 4 10 0

B 180 90 1440 50 0 16 1 16 1 256

C 10 0 40 560 40 0 14 0 0 0 196

D 60 42 840 70 0 20 3 60 9 40 0

E 120 72 1224 60 0 17 2 34 4 289

F 80 24 312 30 0 13 -1 - 13 1 169

Total -- -- -- -- Ʃv=9 Ʃu= 3 Ʃuv=7 Ʃu²=1 Ʃv²=141

0 7 9 0

= 0.9875

Note: W hen b=0 and d= 1, v= gives v = y.

Example 5
Calculate the coefficient of correlation between expenditure on advertisement

in Rs.’0 0 0 (X) and Sales in Rs. Lakhs (Y) after allowing a time lag of two months.

Mon. Jan. Feb. Mar. Apr. May June July Aug. Sep. Oct.

X 40 45 47 50 53 60 57 51 48 45

Y 75 69 65 64 70 71 75 83 90 92

Solution:

As a time lag of two months is to be allowed, the following pairs of values are

available.

X Y XY X² Y²

40 65 260 0 160 0 4225

45 64 2880 20 25 40 96

47 70 3290 220 9 490 0

50 71 3550 250 0 50 41

53 75 3975 280 9 5625

60 83 4980 360 0 6889

57 90 5130 3249 810 0

51 92 4692 260 1 8464

40 3 610 310 97 20 593 47340


= 0.7493

Example 6

From the following data compute the coefficient of correlation between X and

Y.

X Y

Sum of squares of deviations

From the arithmetic mean 8250 724

Sum of products of deviations of

X and Y from respective means 2350

No. of pairs of observations 10

Solution:

Given: Ʃx² = Ʃ(X - X̅) ² = 8250.

Ʃy² = Ʃ(Y - Y̅) ² = 74.

Ʃxy = Ʃ(X - X̅) (Y - Y̅) = 2350

N = 10
= 0.9615

Example 7

Calculate correlation coefficient from the following results.


N=10, ƩX= 140, ƩY = 150, Ʃ(X- 10 ) ²= 180, Ʃ(Y- 15) ²= 215, Ʃ(X- 10 ) Ʃ(Y- 15) = 60

Solution

Let u = where a = 10 and c = 1.

v = W here b = 15 and d= 1.

Ʃu = Ʃ(X- a) = ƩX –Na = 140 –10 x 10 = 40

Ʃv = Ʃ(Y- b) = ƩY –Nb = 150 –10 x 15 = 0

Given: Ʃu²= 180 ;

Ʃv²= 215;

Ʃuv= 60 ;

Hence, r =

= 0.9150

Note: ƩX², ƩY² and ƩXY can be found to be 1980, 2465 and 2160.

Then, r =

gives r = 0.9150. But, it is tedious.


U nit IV

TIME SERIES ANALYSIS


Time Series is a sequence of well- defined data points measured at

consistent time intervals over a period of time. Data collected on an ad- hoc basis

or irregularly does not form a time series. Time series analysis is the use of

statistical methods to analyze time series data and extract meaningful statistics

and characteristics about the data.


Time series Analysis helps us understand what are the underlying forces

leading to a particular trend in the time series data points and helps us in

forecasting and monitoring the data points by fitting appropriate models to it.

Historically speaking, time series analysis has been around for centuries and

its evidence can be seen it the field of astronomy where it was used to study the

movements of the planets and the sun in ancient ages. Today, it is used in

practically every sphere around us –from day to day business issues (say monthly

sales of a product or daily closing value of NASDAQ) to complicated scientific

research and studies (evolution or seasonal changes).

Benefits and Applications of Time Series Analysis

Time series analysis aims to achieve various objectives and the tools and
models used vary accordingly. The various types of time series analysis include –

Descriptive analysis - to determine the trend or pattern in a time series using

graphs or other tools. This helps us identify cyclic patterns, overall trends, turning

points and outliers.

Spectral analysis - is also referred to as frequency domain and aims to separate

periodic or cyclical components in a time series. For example, identifying cyclical

changes in sales of a product.

Forecasting - used extensively in business forecasting, budgeting, etc

based on historical trends

Intervention analysis - is used to determine if an event can lead to a change

in the time series, for example, an employee’s level of performance has improved or
not after an intervention in the form of training –to determine the

effectiveness of the training program.


Explanative analysis - studies the cross correlation or relationship between two

time series and the dependence of one on another. For example the study of

employee turnover data and employee training data to determine if there is any

dependence of employee training programs on employee turnover rates over time.

The biggest advantage of using time series analysis is that it can be used to
understand the past as well as predict the future. Further, time series analysis is

based on past data plotted against time which is rather readily available in most

areas of study.

For instance, a financial services provider may want to predict future gold

price movements for its clients. It can use historically available data to conduct

Time series analysis and forecast the gold rates for a certain future period.

There are various other practical applications of time series analysis

including economic forecasting, census analysis and yield projections. Further, it is

used by investment analysts and consultants for stock market analysis and

portfolio management. Business managers use time series analysis on a regular

basis for sales forecasting, budgetary analysis, inventory management and quality
control.

Terms and concepts:

Dependence: Dependence refers to the association of two observations

with the same variable, at prior time points.

Stationarity: Shows the mean value of the series that remains constant over

a time period; if past effects accumulate and the values increase toward infinity,

then stationarity is not met.

Differencing: U sed to make the series stationary, to De- trend, and to control

the auto- correlations; however, some time series analyses do not require

differencing and over- differenced series can produce inaccurate estimates.

Specification: May involve the testing of the linear or non- linear relationships
of dependent variables by using models such as ARIMA, ARCH, GARCH,

VAR, Co- integration, etc.


Exponential smoothing in time series analysis: This method predicts the one

next period value based on the past and current value. It involves averaging of

data such that the nonsystematic components of each individual case or

observation cancel out each other. The exponential smoothing method is used to

predict the short term predication. Alpha, Gamma, Phi, and Delta are the
parameters that estimate the effect of the time series data. Alpha is used when

seasonality is not present in data. Gamma is used when a series has a trend in

data. Delta is used when seasonality cycles are present in data. A model is

applied according to the pattern of the data. Curve fitting in time series analysis:

Curve fitting regression is used when data is in a non- linear relationship. The

following equation shows the non- linear behavior

The Components of Time Series

The factors that are responsible for bringing about changes in a time series,

also called the components of time series, are as follows:

Secular Trends (or General Trends)

Seasonal Movements
Cyclical Movements

Irregular Fluctuations

Secular Trends

The secular trend is the main component of a time series which results from

long term effects of socio- economic and political factors. This trend may show the

growth or decline in a time series over a long period. This is the type of tendency

which continues to persist for a very long period. Prices and export and import

data, for example, reflect obviously increasing tendencies over time.

Seasonal Trends

These are short term movements occurring in data due to seasonal factors.

The short term is generally considered as a period in which changes occur in a


time series with variations in weather or festivities. For example, it is

commonly observed that the consumption of ice- cream during summer is


generally high and hence an ice- cream dealer's sales would be higher in some

months of the year while relatively lower during winter months. Employment,

output, exports, etc., are subject to change due to variations in weather. Similarly,

the sale of garments, umbrellas, greeting cards and fire- works are subject to large

variations during festivals like Valentine’s Day, Eid, Christmas, New Year's, etc.
These types of variations in a time series are isolated only when the series is

provided biannually, quarterly or monthly.

Cyclic Movements

These are long term oscillations occurring in a time series. These oscillations

are mostly observed in economics data and the periods of such oscillations are

generally extended from five to twelve years or more. These oscillations are

associated with the well known business cycles. These cyclic movements can be

studied provided a long series of measurements, free from irregular fluctuations, is

available.

Irregular Fluctuations

These are sudden changes occurring in a time series which are unlikely to
be repeated. They are components of a time series which cannot be explained by

trends, seasonal or cyclic movements. These variations are sometimes called

residual or random components. These variations, though accidental in nature, can

cause a continual change in the trends, seasonal and cyclical oscillations during

the forthcoming period. Floods, fires, earthquakes, revolutions, epidemics, strikes

etc., are the root causes of such irregularities.

Analysing the Secular Trend

A number of different methods are available to estimate the trend; however,

the suitability of these methods largely depends on the nature of the data and the

purpose of the analysis. To measure a trend which can be represented as a


straight line or some type of smooth curve, the following are the commonly

employed methods:
(1) Freehand smooth curves

(2) Semi- average method

(3) Moving average method

(4) Mathematical curve fitting

Generally speaking, when the time series is available for a short span of time
in which seasonal variation might be important, the freehand and semi- average

methods are employed. If the available series is spread over a long time span and

has annual data where long term cycles might be important, the moving average

method and the mathematical curve fitting are generally employed.

Method of the Free- Hand Curve

This is a familiar concept, and is briefly described for drawing frequency

curves. In case of a time series a scatter diagram of the given observations is

plotted against time on the horizontal axis and a freehand smooth curve is drawn

through the plotted points. The curve is so drawn that most of the points

concentrate around the curve, however, smoothness should not be scarified in

trying to let the points fall exactly on the curve. It is better to draw a straight line
through the plotted points instead of a curve, if possible.

One of the major disadvantages of this method is that different individuals

draw curves or lines that differ in slope and intercept, and hence no two

conclusions are identical. However, it is the most simple and quickest method of

isolating the trend. This method is generally employed in situations where the

scatter diagram of the original data conforms to some well define trends.

Advantages

This method is very simple and easy to understand. It is applicable to linear

and non- linear trends. It gives us an idea about the rise and fall of the time series.

For every long time series, the graph of the original data enables us to decide on

the application of more mathematical models for the measurement of a trend.


Monthly data from 5 years has 60 values. A graph of these values may

suggest that the trend is linear for the first two years (24 values) and for the next
3 years, it is non- linear. We accordingly apply the linear approach to the first 24

values and the curvilinear technique to the next 36 values.

Disadvantages

This method is not mathematical in nature, so different people may draw a

different trend. The method does not appeal to the common man because it
seems rough and crude.

Method of Semi- Averages

This method is as simple and relatively objective as the free hand method.

The data is divided in two equal halves and the arithmetic mean of the two sets of

values of Y is plotted against the center of the relative time span. If the number of

observations is even the division into halves will be straightforward; however, if the

number of observations is odd, then the middle most item, i.e.,(n+12)th, is dropped.

The two points so obtained are joined through a straight line which shows the

trend. The trend values of Y, i.e., Yˆ , can then be read from the graph

corresponding to each time period.

Since the arithmetic mean is greatly affected by extreme values, it is


subjected to misleading values, and hence the trend obtained by plotting by

means might be distorted. However, if extreme values are not apparent, this

method may be successfully employed. To understand the estimation of trends,

using the above noted two methods.

Advantages

This method is very simple and easy to understand, and also it does not

require many calculations.

Disadvantages

The method is used only when the trend is linear or almost linear. For non-

linear trends this method is not applicable. It is used for the calculation of

averages, and averages are affected by extreme values. Thus if there is some very
large value or very small value in the time series, that extreme value should

either be omitted or this method should not be applied. We can also write the
equation of the trend line.

Method of Moving Averages

Suppose that there are n time periods denoted by t1,t2,t3,…,tn and the

corresponding values of the Y variable are Y1,Y2,Y3,…,Yn. First of all we have to

decide the period of the moving averages. For a short time series we use a period
of 3 or 4 values, and for a long time series the period may be 7, 10 or more. For a

quarterly time series we always calculate averages taking 4- quarters at a time,

and in a monthly time series, 12- monthly moving averages are calculated. Suppose

the given time series is in years and we have decided to calculate 3- year moving

averages. The moving averages denoted by a1,a2,a3,…,an−2 are calculated as

below:

The average of the first 3 values is Y1+Y2+Y33 and is denoted bya1. It is

written against the middle year t2. We leave the first value Y1 and calculates the

average for the next three values. The average is Y2+Y3+Y43=a2 and is written

against the middle yearst3. The process is carried out to calculate the remaining

moving averages.
Advantages

Moving averages can be used for measuring the trend of any series. This

method is applicable to linear as well as non- linear trends.

Disadvantages

The trend obtained by moving averages generally is neither a straight line

nor a standard curve. For this reason the trend cannot be extended for

forecasting future values. Trend values are not available for some periods at the

start and some values at the end of the time series. This method is not applicable

to short time series.

Seasonal Component Multiplicative Model

U sing the multiplicative model, i.e.Y=T×S×R, the ratio detrended series may
be obtained by dividing the actual observations by the corresponding trend

values:
Y T=S×R

The remainder now consists of the seasonal and the residual components.

The seasonal component may be isolated from the ratio- detrended series by

averaging the detrended ratios for each month or quarter. The adjustment

seasonal totals are, however, obtained by multiplying the seasonal totals by the
following adjustment factor.

AdjustmentFactor = TotalNumberofObservationsSumofDetrendedRatios

These adjustment seasonal totals are then averaged over the number of

detrended ratios in each quarter or month. The obtained averages represent the

seasonal component. After having determined the seasonal component S, the de-

personalised series may be obtained by dividing the actual observations Y by the

corresponding seasonal component. The de- personalised series so obtained

determines the trend and the residual, for

Y T=S×R

The residual component may now be separated by a further division of the

de- seasonalised series by the trend, for


YS×T=R

The entire analysis described above may be briefly summarized

Least Squares Method

The least squares method is a form of mathematical regression analysis that

finds the line of best fit for a dataset, providing a visual demonstration of the

relationship between the data points. Each point of data is representative of the

relationship between a known independent variable and an unknown dependent

variable.

Least Squares Method

The least squares method provides the overall rationale for the placement of

the line of best fit among the data points being studied. The most common
application of the least squares method, referred to as linear or ordinary,

aims to create a straight line that minimizes the sum of the squares of the errors
generated by the results of the associated equations, such as the squared

residuals resulting from differences in the observed value and the value

anticipated based on the model.

This method of regression analysis begins with a set of data points to be

graphed. An analyst using the least squares method will seek a line of best fit that
explains the potential relationship between an independent variable and a

dependent variable. In regression analysis, dependent variables are designated on

the vertical Y axis and independent variables are designated on the horizontal X-

axis. These designations will form the equation for the line of best fit, which is

determined from the least squares method.

Example of Least Squares Method

For example, an analyst may want to test the relationship between a

company’s stock returns and the returns of the index for which the stock is a

component. In this example, the analyst seeks to test the dependence of the stock

returns on the index returns. To do this, all of the returns are plotted on a chart.

The index returns are then designated as the independent variable, and the stock
returns are the dependent variable. The line of best fit provides the analyst with

coefficients explaining the level of dependence.

Line of Best Fit Equation

The line of best fit determined from the least squares method has an

equation that tells the story of the relationship between the data points. Computer

software models are used to determine the line of best fit equation, and these

software models include a summary of outputs for analysis. The least squares

method can be used for determining the line of best fit in any regression analysis.

The coefficients and summary outputs explain the dependence of the variables

being tested.

Interpolation
In the mathematical field of numerical analysis, interpolation is a method of

constructing new data points within the range of a discrete set of known data
points.

In engineering and science, one often has a number of data points,

obtained by sampling or experimentation, which represent the values of a function

for a limited number of values of the independent variable. It is often required to

interpolate, i.e., estimate the value of that function for an intermediate value of the
independent variable.

A closely related problem is the approximation of a complicated function by

a simple function. Suppose the formula for some given function is known, but too

complicated to evaluate efficiently. A few data points from the original function can

be interpolated to produce a simpler function which is still fairly close to the

original. The resulting gain in simplicity may outweigh the loss from interpolation

error.

Interpolation is an estimation of a value within two known values in a

sequence of values. Polynomial interpolation is a method of estimating values

between known data points. W hen graphical data contains a gap, but data is

available on either side of the gap or at a few specific points within the gap,
interpolation allows us to estimate the values within the gap.

Newton’s Method

In the mathematical field of numerical analysis, a Newton polynomial, named

after its inventor Isaac Newton,[citation needed] is an interpolation polynomial for a

given set of data points. The Newton polynomial is sometimes called Newton's

divided differences interpolation polynomial because the coefficients of the

polynomial are calculated using Newton's divided difference method.

As with other difference formulas, the degree of a Newton interpolating

polynomial can be increased by adding more terms and points without discarding

existing ones. Newton's form has the simplicity that the new points are always

added at one end: Newton's forward formula can add new points to the right, and
Newton's backward formula can add new points to the left.

The accuracy of polynomial interpolation depends on how close the


interpolated point is to the middle of the x values of the set of points used.

Obviously, as new points are added at one end, that middle becomes farther and

farther from the first data point. Therefore, if it isn't known how many points will be

needed for the desired accuracy, the middle of the x- values might be far from

where the interpolation is done.


Gauss's formula alternately adds new points at the left and right ends,

thereby keeping the set of points centred near the same place (near the evaluated

point). W hen so doing, it uses terms from Newton's formula, with data points and x

values renamed in keeping with one's choice of what data point is designated as

the x0 data point.

Stirling's formula remains centred about a particular data point, for use

when the evaluated point is nearer to a data point than to a middle of two data

points.

Bessel's formula remains centred about a particular middle between two data

points, for use when the evaluated point is nearer to a middle than to a data point.

Bessel and Stirling achieve that by sometimes using the average of two
differences, and sometimes using the average of two products of binomials in x,

where Newton's or Gauss's would use just one difference or product. Stirling's

uses an average difference in odd- degree terms (whose difference uses an even

number of data points); Bessel's uses an average difference in even- degree terms

(whose difference uses an odd number of data points).

Here are the formulas:

Gregory- Newton or Newton Forward Difference Interpolation

P(x0 +hs)=f0 +s∆f0 +s(s−1)2!∆2f0 + +s(s−1)(s−2)...(s−n+1)n!∆nf0

where

s=(x−x0 )h;f0 =f(x0 );∆kfi=∑j=0 k(−1)jk!j!(k−j)!fi+k−j

Gregory- Newton or Newton Backward Difference Interpolation


P(xn+hs)=fn+s fn+s(s+1)2! 2fn+ +s(s+1)(s+2)...(s+n−1)n! nfn

where
s=(x−xn)h;fn=f(xn); kfi=∑j=0 k(−1)jk!j!(k−j)!fi−j

Example: For interpolating at the points x0 =−3,−2.9,−2.8,...,2.9,3=xn

with f(x)=ex using MATLAB we have

A finite difference is a mathematical expression of the form f(x + b) − f(x + a).

If a finite difference is divided by b − a, one gets a difference quotient. The


approximation of derivatives by finite differences plays a central role in finite

difference methods for the numerical solution of differential equations, especially

boundary value problems.

Certain recurrence relations can be written as difference equations by

replacing iteration notation with finite differences. Today, the term "finite

difference" is often taken as synonymous with finite difference approximations of

derivatives, especially in the context of numerical methods. Finite difference

approximations are finite difference quotients in the terminology employed above.

Finite differences have also been the topic of study as abstract self-

standing mathematical objects, e.g. in works by George Boole (1860 ), L. M. Milne-

Thomson (1933), and Károly Jordan (1939), tracing its origins back to one of Jost
Bürgi's algorithms (ca. 1592) and others including Isaac Newton. In this viewpoint,

the formal calculus of finite differences is an alternative to the calculus of

infinitesimals.

Practical excises

Problem.

Draw the trend line by graphic method and estimate the production in 20 0 3.

Year 1995 1996 1997 1998 1999 20 0 0 20 0 1

Producation 20 22 25 26 25 27 30

Solution:
Year is represented in x axis. Production is represented in y axis. Points

(1995,20 ), (1997,25),(1998,26),(1999,25),(20 0 0,27) and (20 0 1,30 ) are marked on


a graph sheet.

1. Method of Semi –Average:

Practical excises

Problem.

The sales in tonnes of a commodity varied from 1990 to 20 0 1 as under.

280 30 0 280 280 270 240 230 230 220 20 0

210 20 0

Fit a trend line by the method of semi –averages. Estimate the sales in
20 0 2.

Solution:

Sales in Middle most


Year Mean sales
tonnes year

1990 280

1991 30 0

1992 280 1992.5 1650/6

=275.0
1993 280
1994 270

1995 240

1996 230

1997 230

1998 220 1998.5 1290/6

=215.0
1999 20 0

20 0 0 210

20 0 1 20 0

2. Methods of Moving Averages

Practical excises

Problem.

Calculate 5 yearly moving average of number of students in a

commerce college as shown by the following figures:

No of No of
Year Year
students students

1987 332 1992 40 5

1988 311 1993 410

1989 357 1994 427

1990 392 1995 40 5

1991 40 2 1996 438

Solution

No of 5 yearly Moving 5 yearly Moving


Year
students Totals Averages
1987 332 - -

1991 40 2 1966 393.2

1992 40 5 20 36 407.2

1993 410 20 46 40 9.0

1994 427 20 85 417.0

1995 40 5 - -

1996 438 - -

3. Method of least squares:

Practical excises

Problem.

Fit a straight line trend equation to the following data by the method of

least squares and estimate the value of sales for the year 1985.

Year 1979 1980 1981 1982 1983

Sales (in 10 0 120 140 160 180

Rs.)

Solution

Let y = a + b X be the equation of the trend line where x –year and y –sales.

As x values are large, consider x = x -

= x –1981

Let the resulting equation by y = A + Bx where Y = y

For finding the values of A and B , the normal equations are

∑y = NA + B∑x

∑xy = N∑x+B∑x²

Year x Sales (in X = x- 1981 xy x² Trend Yt


1979 Rs.) -2 - 20 0 4 10 0

Y =y

1980 120 -1 - 120 1 120

1981 140 0 0 0 140

1982 160 1 160 1 160

1983 180 2 360 4 180

Total ∑y = 70 0 ∑x = 0 ∑xy=20 0 ∑x² = 0 ∑Yt =70 0

Interpolation

Assumption:

the function is assumed to increase or decrease steadily. There is no jump. The

basis of the interpolation formulae is that the function is a polynomial of relevant

degree in x. when (n +1) pairs of values are known, the function is assumed to be a

polynomial of degree. It is of the form.


Y = a0 + a1x + a2x² +……..+anxn.

For example , for 4 pairs of values, y= a0 +a1x+a2x²+a3x³

And for 5 pairs of values, y = a0 +a1x+a2x²+a3x³+a4x

Newton’s Method Of Forward Differences

Practical excises

Problem.

th
From the following series, obtain the missing value for 12 year using Newton’s

method.

Year 5 10 12 15 20
Price 4 14 ? 24 34

Solution
Arguments 5, 10, 15 and 20 have equal differences. Common

difference, h = 5. X = 12 in the first half. U = = = 1.4

Year (x) Price (y) ∆y Differences ∆² y ∆³ y

X0 = 5 y0 = 4 ∆y0 = 14 - 4 = 10 ∆²y0 = 10 - 10 =
0

∆y1 = 24 - 14 =
X1 = 10 y1 = 14
10 ∆²y1 = 10 - 10 =
∆³y1 = 10 - 10
0
=0
X2 = 15 y2 = 24
∆y2 = 34 - 24 =

10

X3 = 20 y3 = 34

By Newton’s forward difference formula,

Y = y0 + u/1 ∆y0 + u (u –1)/ 1*2 ∆²y0 + u (u –1)(u –2)/1 * 2*3 ∆³y0

= 4 + 1.4/1*10 + 1.4* 0.4/1*2*0 + 1.4* 0.4* (- 0.6)/1*2*3*0

= 4+14+0 +0

= 18
th
Price in 12 year = 18.

U nit V

INDEX NU MBERS

Index numbers are intended to measure the degree of economic changes

over time. These numbers are values stated as a percentage of a single base
figure. Index numbers are important in economic statistics. In simple terms,

an index (or index number) is a number displaying the level of a variable relative to
its level (set equal to 10 0 ) in a given base period.

Index numbers are intended to study the change in the effects of such factors

which cannot be measured directly. Bowley stated that "Index numbers are used

to gauge the changes in some quantity which we cannot observe directly". It can

be explained through example in which changes in business activity in a nation are


not capable of direct measurement but it is possible to study relative changes in

business activity by studying the variations in the values of some such factors

which affect business activity, and which are proficient of direct measurement.

Index numbers are usually applied in statistical device to measure the

combined fluctuations in a group related variables. If statistician or researcher

wants to compare the price level of consumer items today with that predominant

ten years ago, they are not interested in comparing the prices of only one item, but

in comparing some sort of average price levels (Srivastava, 1989). W ith the

support of index numbers, the average price of several articles in one year may be

compared with the average price of the same quantity of the same articles in a

number of different years. There are several sources of 'official' statistics that
contain index numbers for quantities such as food prices, clothing prices, housing,

and wages.

Index numbers may be categorized in terms of the variables that they are

planned to measure. In business, different groups of variables in the measurement

of which index number techniques are normally used are price, quantity, value, and

business activity.

Types of Index Numbers

Simple Index Number: A simple index number is a number that measures a

relative change in a single variable with respect to a base. These type of Index

numbers are constructed from a single item only.

Composite Index Number: A composite index number is a number that


measures an average relative changes in a group of relative variables with

respect to a base. A composite index number is built from changes in a number of


different items.

Price index Numbers: Price index numbers measure the relative changes in

prices of a commodity between two periods. Prices can be either retail or

wholesale. Price index number are useful to comprehend and interpret varying

economic and business conditions over time.


Quantity Index Numbers: These types of index numbers are considered to

measure changes in the physical quantity of goods produced, consumed or sold

of an item or a group of items.

Methods of constructing index numbers: There are two methods to construct

index numbers: Price relative and aggregate methods (Srivastava, 1989).

In aggregate methods, the aggregate price of all items in a given year is


expressed as a percentage of same in the base year, giving the index number.

Relative method: The price of each item in the current year is expressed as
a percentage of price in base year. This is called price relative and expressed as
following formula:
In simple average of relative method, the current year price is expressed as
a price relative of the base year price. These price relatives are then averaged to
get the index number. The average used could be arithmetic mean, geometric
mean or even median.

Weighted index numbers: These are those index numbers in which rational
weights are assigned to various chains in an explicit fashion.

Weighted aggregative index numbers: These index numbers are the simple

aggregative type with the fundamental difference that weights are assigned to the

various items included in the index.

Characteristics of index numbers:


Index numbers are specialised averages.

Index numbers measure the change in the level of a phenomenon.


Index numbers measure the effect of changes over a period of time.

U ses of Index number: Index numbers has practical significance in


measuring changes in the cost of living, production trends, trade, and income
variations. Index numbers are used to measure changes in the value of money. A

study of the rise or fall in the value of money is essential for determining the
direction of production and employment to facilitate future payments and to know

changes in the real income of different groups of people at different places and
times (Srivastava, 1989). Crowther designated, "By using the technical

device of an index number, it is thus possible to measure changes in different


aspects of the value of money, each particular aspect being relevant to a different

purpose." Basically, index numbers are applied to frame appropriate policies. They

reveal trends and tendencies and Index numbers are beneficial in deflating.

Problems associated with index numbers (Srivastava, 1989):

Choice of the base period.


Choice of an average.

Choice of index.

Selection of commodities.

Data collection.

Price Index Number

Price index, measure of relative price changes, consisting of a series of

numbers arranged so that a comparison between the values for any two periods

or places will show the average change in prices between periods or the average

difference in prices between places. Price indexes were first developed to

measure changes in the cost of living in order to determine the wage increases

necessary to maintain a constant standard of living. They continue to be used


extensively to estimate changes in prices over time and are also used to measure

differences in costs among different areas or countries. See also consumer price

index; wholesale price index.

Some notable price indices include:

Consumer price index

Producer price index

Employment cost index

Export price index

Import price index

GDP deflator

Data
The central problem of price- data collection is to gather a sample of prices

representative of the various price quotations for each of the commodities under
study. Sampling is almost always necessary. The larger and the more complex the

universe of prices to be covered by the index, the more complex the sampling

pattern will have to be. An index of prices paid by consumers in a large and

geographically varied country, for example, ideally should be based on a sample

representative of price changes in different cities and localities, in different types


of outlets (supermarkets, department stores, neighbourhood shops, etc.), and for

different commodities. The number of prices chosen to represent each type of city

(or metropolitan area), type of outlet, and category of commodity would ideally be

proportionate to its relative importance in the expenditures of the nation. Most

price indexes are based on some approximation to such a sampling design.

Once the commodity sample has been chosen, the collection of prices must be

planned so that differences between the prices of any two dates will reflect

changes in price and price alone. Ideally one would collect the prices of exactly

the same items at each date. To this end, commodity prices are sometimes

collected in accordance with detailed specifications such as “wheat, no. 2 red

winter, bulk, carlots, f.o.b. Chicago, spot market price, average of high and low, per
bushel.” If all commodities were as standardized as wheat, the making of price

indexes would be much simpler than it is. In fact, except for a limited range of

goods consisting mainly of primary products, it is very difficult to describe a

product completely enough so that different pricing agents can go into stores and

price an identical item on the basis of description alone. In view of this difficulty,

price- collection agencies sometimes rely upon each respondent, usually a

business firm, to report prices in successive periods for the same variant of a

product (say, men’s shoes); the variant chosen by each respondent may be

different, but valid data will be obtained as long as each provides prices for the

same variant he originally chose. Because a product may vary in quality from one

observation to another, even though it retains the same general specification, the
usual procedure is to avoid the computation of average observed prices for each

commodity for each date. Instead, each price received from each source is
converted to a percentage of the corresponding price reported for the previous

period from the same source. These percentages are called “price relatives.”

Weighting

The next step is to combine the price relatives in such a way that the

movement of the whole group of prices from one period to another is accurately
described. U sually, one begins by averaging the price relatives for the same

specification (e.g., men’s high work shoes, elk upper, Goodyear welt, size range 6

to 11) from different reporters. Sometimes separate averages for each commodity

are calculated for each city, and the city averages are combined.

A more difficult problem arises in combining the price relatives for different

commodities. They must be given different weights, of course, because not all the

commodities for which the prices or price relatives have been obtained are of

equal importance. The price of wheat, for example, should be given more weight in

an index of wholesale prices than the price of pepper. The difficulty is that the

relative importance of commodities changes over time. Some commodities even

drop out of use, while new ones appear, and often an item changes so much in
composition and design that it is doubtful whether it can properly be considered

the same commodity. U nder these conditions, the pattern of weights selected can

be accurate in only one of the periods for which the index numbers have been

calculated. The greater the lapse of time between that period and other periods in

the index, the less meaningful the price comparisons become. Price indexes thus

can give relatively accurate measures of price change only for periods close

together in time.

Adjusting for biases

Another problem of price index number construction that cannot be

completely resolved is the problem of quality change. In a dynamic world, the


qualities of goods are continually changing to such a degree that it is

doubtful whether anyone living in an industrialized economy buys many products


that are identical in physical and technical characteristics to those purchased by

his grandfather. There is no fully satisfactory way to handle quality changes. One

way would be to make price comparisons between two periods solely in terms of

goods that are identical in both periods. If one systematically deletes goods that

change in quality, the price index will tend to be biased upward if quality is
improving on the average and downward if it is deteriorating on the average. A

better approach is to attempt to measure the extent to which an observed change

in the quoted price represents a change in quality. It is possible, for example, to

obtain from manufacturers estimates of the increase or decrease in cost of

production entailed in the main changes in automobiles from one model year to the

next. The amount added or subtracted from the cost by the changes can then be

regarded as a measure of the quality change; any change in the quoted price not

accounted for in this way is taken as solely a change in price. The disadvantage of

this method is that it cannot take account of improvements that are not associated

with an increase in costs.

W hether or not a failure to make sufficient allowance for improvements in


the quality of goods causes most price indexes to be biased upward is a matter of

dispute. An expert committee appointed to review the price statistics of the U.S.

government (the Stigler Committee) declared in 1961 that most economists felt

that there were systematic upward biases in the U.S. price indexes on this

account. Because the U.S. indexes are usually thought to be relatively good, this

view would seem to apply by extension to those of most other countries. The

official position of the U.S. Bureau of Labor Statistics has been that errors owing to

quality changes have probably tended to offset each other, at least in its index of

consumer prices.

Another possible source of error in price indexes is that they may be based

on list prices rather than actual transactions prices. List prices probably are
changed less frequently than the actual prices at which goods are sold;

they may represent only an initial base of negotiation, a seller’s asking price rather
than an actual price. One study has shown that actual prices paid by the

purchasing departments of government agencies were lower and were

characterized by more frequent and wider fluctuations than were the prices for

the same products reported for the price index.

So far we have discussed various formulae for construction of weighted &


unweighted index numbers.

However the problem still remains of selecting an appropriate method for

the construction of an index number in a given situation. The following tests can be

applied to find out the adequacy of an index number.

(1) U nit Test

(2) Time Reversal Test

(3) Factor Reversal Test

(4) Circular Test

1. U nit Test - This test requires that the index number formulae should be

independent of the units in which prices or quantities of various commodities are

quoted. For example in a group of commodities, while the price of wheat might be
in kgs., that of vegetable oil may be quoted in per liter & toilet soap may be per

unit.

Except for the simple (unweighted) aggregative index, all other formulae discussed

above satisfy this test.

2. Time Reversal Test - The time reversal test is used to test whether a

given method will work both backwards & forwards with respect to time. The test

is that the formula should give the same ratio between one point of comparison &

another no matter which of the two is taken as base.

The time reversal test may be stated more precisely as follows— If the time

subscripts of a price (or quantity) index number formula be interchanged, the

resulting price (or quantity) formula should be reciprocal of the original formula. i.e.
if p0 represents price of year 20 11 and p1 represents price at year 20 12 i.e.

should be equal to symbolically, the following relation should be satisfied p0 1 x p10


= 1, Omitting the factor 10 0 from both the indices.

W here P0 1 is index for current year ‘1’ based on base year ‘0 ’ pl0 is index

for year ‘0 ’ based on year ‘1’.

The methods which satisfy the following test are:-

(1) Simple aggregate index


(2) Simple geometric mean of price relative

(3) Weighted geometric mean of price relative with fixed weights

(4) Kelly’s fixed weight formula

(5) Fisher’s ideal formula

(6) Marshall- Edgeworth formula

3. Factor Reversal Test - An Index number formula satiesfies this test if the

product of the Price Index and the Quantity Index gives the True value ratio,

omitting the factor 10 0 from each index. This test is satisfied if the change in the

price multiplied by the change in quantity is equal to the change in the value.

Speaking precisely if p and q factors in a price (or quantity) index formula be

interchanged, so that a quantity (or price) index formula is obtained the product of
the two indices should give the true value ratio.

Symbolically,

= The True Value Ratio = TVR

Consider the Laspeyres formula of price index

Consider the quantity index by interchange p with q & q with p

4. Circular Test - Circular test is an extension of time reversal test for more

than two periods & is based on shiftability of the base period. For example, if an

index is constructed for the year 20 12 with the base of 20 11 & another index for

20 11 with the base of 20 10. Then it should be possible for us to directly get an

index for the year 20 12 with the base of 20 10. If the index calculated directly does

not give an inconsistent value, the circular test is said to be satisfied.


This test is satisfied if— P0 1 x P12 x P20 = 1.

This test is satisfied only by the following index Nos. formulas—


(1) Simple aggregative index

(2) Simple geometric mean of price relatives

(3) Kelly’s fixed base method W hen the test is applied to simple aggregative

method—

Hence, the simple aggregative formula satisfies circular test Similarly when it is
applied to fixed weight Kelly’s method

Cost of Living Index

Price and index numbers statistics is considered as important

economic statistics associated with the daily lives of individuals. It provides

the necessary information to identify the general trend of price movement

through the construction of its index numbers. The General Authority for

Statistics has began to publishing the price and index numbers of the cost

of living bulletins since more than 50 years ago, to provide data on

consumer prices. Prices data is collected from 16 major cities of which 13

represent the centers of administrative regions as follows: (Riyadh, Makkah,

Madinah, Buraydah, Dammam, Abha, Tabuk, Hail, Arar, Jazan, Najran, Baha,
Sakaka) in addition of three cities, namely, (Jeddah, Taif, Hofuf), based on

the components of the consumer basket of goods and services derived from

20 07 household expenditure & income survey to provide monthly data, and

time series for data on index number of the cost of living for the purpose

of making comparisons and getting acknowledgment of price developments over

time.

There are two methods to compute consumer price index numbers: (a)

Aggregate Expenditure Method (2) Family Budget Method

Aggregate Expenditure Method

In this method, the quantities of commodities consumed by the particular

group in the base year are estimated and these figures or their proportions are
used as weights. Then the total expenditure of each commodity for each

year is calculated. The price of the current year is multiplied by the quantity or
weight of the base year. These products are added. Similarly, for the base year the

total expenditure of each commodity is calculated by multiplying the quantity

consumed by its price in the base year. These products are also added. The total

expenditure of the current year is divided by the total expenditure of the base year

and the resulting figure is multiplied by 10 0 to get the required index numbers. In
this method, the current period quantities are not used as weights because these

quantities change from year to year.

Pon=∑Pnqo∑Poqo×10 0

Here, Pn Represent the price of the current year, Po Represents the price of

the base year and qo Represents the quantities consumed in the base year.

Family Budget Method

In this method, the family budgets of a large number of people are carefully

studied and the aggregate expenditure of the average family for various items is

estimated. These values are used as weights. The current year’s prices are

converted into price relatives on the basis of the base year’s prices, and these

price relatives are multiplied by the respective values of the commodities in the
base year. The total of these products is divided by the sum of the weights and

the resulting figure is the required index numbers.

Consumer Price Index Numbers

Consumer price index numbers measure the changes in the prices paid by

consumers for a special “basket” of goods and services during the current year as

compared to the base year. The basket of goods and services will contain items

like (1) Food (2) Rent (3) Clothing (4) Fuel and Lighting (5) Education (6)

Miscellaneous like cleaning, transport, newspapers, etc. Consumer price index

numbers are also called cost of living index numbers or retail price index numbers.

Construction of Consumer Price Index Numbers

The following steps are involved in the construction of consumer price index
numbers.

(1) Class of People


The first step in the construction of the consumer price index (CPI) is that

the class of people should be defined clearly. It should be decided whether the

cost of living index number is being prepared for industrial workers, or middle or

lower class salaried people living in a particular area. It is therefore necessary to

specify the class of people and locality where they reside.


(2) Family Budget Inquiry

The next step in the construction of a consumer price index number is that

some families should be selected randomly. These families provide information

about the cost of food, clothing, rent, miscellaneous, etc. The inquiry includes

questions on family size, income, the quality and quantity of resources consumed

and the money spent on them, and the weights are assigned in proportions to the

expenditure on different items.

(3) Price Data

The next step is to collect data on the retail prices of the selected commodities for

the current period and the base period when these prices should be obtained

from the shops situated in the locality for which the index numbers are prepared.
(4) Selection of Commodities

The next step is the selection of the commodities to be included. We should

select those commodities which are most often used by that class of people.

Simple average of price relative method:

Compute a price index for the following by a (a) simple aggregate and (b)

average of price relative method by using both arithmetic mean and geometric

mean.

Commodity A H C D E F

Price in 20 30 10 25 40 50

20 0 5 (Rs)
Price in 25 30 15 35 45 55

20 0 6 (Rs)

Solution:

a)Simple Aggregative Index = £p1/£p0* 10 0

£p0 = 175, £p1 = 20 5

= 20 5/175*10 0
= 117.143%

Calculation of price index

Price relative
Price in Price in
Commodit
20 0 5 20 0 6 = P0 1 (P1 / Log p
y
Po P1 P0 ) * 10 0

A 20 25 125 2.0 969

B 30 30 10 0 2.0 0 0 0

C 10 15 150 2.1761

D 25 35 140 2.1461

E 40 45 112.5 2.0 511

F 50 55 110 2.0 414

175 20 5 737.5 12.5116

(b) (i) Arithmetic mean of price

Relatives = P0 1/ N

£po1= 737.5, N = 6
737.5 /6 = 122.92

(ii) Geometric mean of price


Relative index = Antilog

= Antilog (12.5116/6)

= Antilog 2.0 853

= 121.7

Weighted aggregate index number


1. Laspeyres method

2. Paasche’s method

3. Fisher’s ideal method

4. Marshall - Edgeworth method

Practical Excises

Problem

Calculate index number from the following data.

Base year Current year

Kilo Rata (Rs) Kilo Rata (Rs)

Bread 10 3 8 3.25

Meat 20 15 15 20

Tea 2 25 3 23

Solution:

Construction of price index

Base year Current year P1q0 P0 q0 P1q1 Poq1

Kilo Kilo
Rata Rata

qo (Rs) (Rs)
P0 q1 P1

Bread 10 3 8 3.25 32.5 30 26 24

Meat 20 15 15 20 40 0.0 30 0 30 0 225

Tea 2 25 3 23 46.0 0 50 69 75

Total 478.50 380 395 324

(a) Laspeyre’s method

Po1 = ∑ P1q0 *10 0/ P0 q0 = 478.50*10 0/380.0 0 =125.9

(b) Paasche’s method

Po1 = ∑ P1q1/ P0 q1 * 10 0 =395*10 0/324 =121.9

(c) Bowley’s method

Po1 = ∑ P1q1/ P0 q1 + ∑ P1q1/ P0 q1 /2 *10 0

= 487.5/380.0 + 395/324/2*10 0

= 123.9

(D) Fisher’s ideal method

P0 1 = √ L*P =√ ∑ P1q0/ P0 q0 + ∑ P1q1/ P0 q1 *10 0


= √ 478.5/380 *395/324 *10 0

= √ 1.259 * 1.219 * 10 0

= 1.239 * 10 0 = 123.9

(E) Marshall - Edgeworth method

P0 1 = √ ∑ P1q0/ P0 q0 + ∑ P1q1/ P0 q1 *10 0

= 478.5 + 395/380 +324

= 873.5 * 10 0/70 4

= 1.24 *10 0

= 124

Test of consistency of index number.


Time Reversal Test

Factor Reversal Test


1. Time Reversal Test:

Practical Excises

Problem

Time Reversal Test is Satisfied when Po1 * P10 = 1


P10 = √ ∑ P0 q1/ ∑P1q1 + ∑ P0 q0/ ∑P1q1

= √ 470/530 * 425/50 5

P10 * P10 = √ 50 5/425 * 530/470 *470/530 * 425 /50 5

= √1

2. Factor Reversal Test:

Practical Excises

Problem

Factor reversal test is satisfied when


P0 1 = √ ∑ P1q0/ ∑P0 q0 + ∑ P1q1/ ∑P0 q1

Q0 1 = √ ∑q1po/∑qo Po * ∑q1p1/∑q0 p1

P0 1 * Q0 1 = √ 50 5/425 * 530/470 * 470/425 * 530/50 5

= 530/425 i.e., ∑ P1q1/ ∑P0 q0

P0 1 * Q0 1 = ∑ P1q1/ ∑P0 q0

Cost of living index numbers


Methods Of Constructing Consumer Price Index

1. Aggregate expenditure method


2. Family budget method

1. Aggregate expenditure method

Practical Excises

Problem

Calculate the index number using the aggregate expenditure method


for the yrar 20 07 with 20 0 6 as base year, form the following data;

Price per unit in Price per unit


Commodity Quantity in units
20 0 6 (Rs.) in 20 07 (Rs.)

A 10 0 8 12.0 0

B 25 6 7.50

C 10 5 5.25

D 20 48 52.0 0

E 65 15 16.0 0

F 30 19 27.0 0

Solution:

Calculation of index number by using aggregate expenditure method

Commodit q0 p0 p1 p1q0 p0 q0

A 10 0 8 12.0 0 120 0.0 0 80 0

B 25 6 7.50 184.50 150


C 10 5 5.25 52.50 50

E 65 15 16.0 0 1072.50 975

F 30 19 27.0 0 810.0 0 570

Total 4362.5 350 5

P0 1 = √ ∑ P1q0/ ∑P0 q0 * 10 0

= 4362.50 /350 5 *10 0

= 124.47

Family Budget Method:

Calculate index number of prices for 20 07 on the basis of 20 0 6 from

the data given below:

Commodity Weight Price per unit Price per unit

A 40 16.0 0 20.0 0

B 25 40.0 0 60.0 0

C 5 0.50 0.50

D 20 5.12 6.25

E 10 2.0 0 1.50

Solution:

Calculation of index numbers

Price
Price per Price per Weighted
Commodity Weights relative P
unit 20 0 6 unit 20 07 relatives:
=
A 40 16.0 0 20.0 0 125 50 0 0

D 20 5.12 6.25 122.1 2442

E 10 2.0 0 1.50 75 750

∑V =10 0 ∑V =

12442

Index Number of prices for 20 07

= ∑PV/∑V

= 12442/10 0

Price index number for 20 07 = Rs. 124.42

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy