0% found this document useful (0 votes)
130 views98 pages

AST-202-Statistical Methods

Uploaded by

deepali sacha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
130 views98 pages

AST-202-Statistical Methods

Uploaded by

deepali sacha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 98

University of Agricultural Sciences, Dharwad

STATISTICAL METHODS
(AST-202)

By:
Dr. Ashalatha K. V.
Ms. Jyoti B Bagalkoti
Ms. Megha J
Mr. Anand P
DEPARTMENT OF AGRICULTURAL STATISTICS
COLLEGE OF AGRICULTURE, DHARWAD –5
STATISTICS

Statistics its meaning and definition:


Statistics is a branch of applied mathematics and is concerned with observational data.
The word statistics is generally used in two different sense:
1. When it is used in plural, it means the “quantitative data affected to a marked extent
by multiplicity of causes”. When we say collect statistics it means collect the
numerical data which are to be analyzed and interpreted for e.g.
i. Wheat production affected by various causes.
ii. Collect the data of height of the student of second year Agri.
2. When it is used in singular, it means “the science of collecting, classifying and using
the data for further statistical treatments”. It involves the methods of analysis used in
the analysis and interpretation of data and they are known as statistical methods.
Definitions:
1. R.A. Fisher: it is a study of population, variation and the methods for reduction of
data.
2. A.L. Bowley: it is a science of calculation and averages.
3. A.L. Boddington: it is a science of estimate and probability.
All these definitions are not satisfactory because they cover only a part of the subject.
In general, Statistics may be defined as the science and art of collection, organization,
presentation, analysis and interpretation of numerical data. Or Statistics is concerned with
scientific methods for collecting, organizing, summarizing, presenting and analyzing the data
as well as drawing valid conclusion and making reasonable decision on the basis of such
analysis.
However, it is not used for all these purpose in all field e.g. in administrative and
executive department statisticians are interested only in “data collection” and ‘presentation of
data”. Such as crop yields, birth and death rates etc. On the other hand, a researcher employs
the methods which relate to design of experiments and analysis of experimental results.
Data:
The information collected through censuses and surveys or in a routine manner or
other sources is called a raw data. When the raw data are grouped into groups or classes, they
are known as grouped data.
Example for data:
1. No. of farmers in a block.
2. The rainfall over a period of time.
3. Area under paddy crop in a state.
Kinds of Data:
There are 2 kinds of data:
1. Primary Data:
The data is collected directly from the individuals and these data have never been
used for any purpose earlier.
2. Secondary Data:
The data is collected from secondary sources of information, like newspapers, trade
journals and statistical bulletins, etc.,
Classification of Data:
1. Geographical:
Classification is according to place, region or area, such as states, towns etc.
Data on area under crop in India can be classified as shown below:

Region Area (in hectares)


Central India 200
West 110
North 230
East 140
South 135

2. Chronological:
Classification is according to the lapse of time such as monthly, yearly, etc
Data on Production of food grains can be classified as shown below:
Year Tonnes
1990-91 300
1991-92 230
1992-93 410
1993-94 175
1994-95 110
3. Qualitative:
Classification is according to the attributes of the subjects or items, such as sex,
colour, qualification, etc.
Type of farmers Number of farmers
Marginal 907
Medium 1041
Large 1948
Total 3896

4. Quantitative:
Classification is according to the magnitude of numerical values. Such as age,
income, height, weight, etc
The data on land holdings by farmers in a block:
Land holding (hectare) Number of Farmers

<1 442
1-2 908
2-5 471
>5 124

Variables and Attributes:


Attributes: Qualitative characteristics of an individual, which shows variability is known as
attributes.
Variables: The characteristics which show variation or variability are called variables or
variates e.g., cabbage yield, wheat yield per hectares of growers.
The variable can be of two types:
I. Qualitative: The characteristics which cannot be measured numerically or in terms of
magnitude. E.g.: flower color, nature of surface.
II. Quantitative: The characteristics which can be measured numerically or in terms of
magnitude. The quantitative characteristics are of two types:
i. Discrete: Character which takes only an integer values/ or whole number.
There is a definite gap between two values. E.g.: No of students in a class, no
of bacteria in a given area.
ii. Continuous: The quantity which can take any numerical value within a
certain range. E.g.: Height, weight, yield.

TYPES OF STATISTICS:
1. Descriptive Statistics:
It consists of methods for organizing, displaying, and describing data
by using tables, graphs, and summary measures.
2. Inferential Statistics:
It is another branch of statistics. It provides the procedures to draw an
inference about conditions that exist in a larger set of observations from study of a part of that
set.

Aims of studying statistics:


1. To study the population of any kind is on the basis of sample data.
2. To understand the nature of variability e.g., height of plants. The biological
phenomena observed under one set of conditions are never duplicated exactly under
another set of similar condition. Therefore, repetition of experiment is necessary to
account all the factors causing variation. In biological phenomena, where variation is
a rule rather than exception it is this function that has wide application.
3. To express the facts in summary form. e.g., it is not possible for one to form a
“precise idea about the income position of the population of India from the records of
individuals”. However, figure of per capita income can be easily drawn.
4. To provide the correct method(s) for taking sample (sampling).
5. To provide proper method for comparison of two or more things.
6. It helps in prediction/ forecasting the yield of a particular crop for a particular year on
the basis of the past record.

Limitation of statistics:
Statistics with its wide application in almost every sphere of human activity is not
without limitations. The following are the limitation:
1. It does not deal with individuals.
Statistics deals with an aggregate of objects and does not give any specific recognition to
the individual items of a series. E.g.the individual figures of agricultural production of any
country for a particular year are meaningless unless, to facilitate comparison, similar figures
of other countries or of the same country of different years are given. Height of Mr.X is 5’8”
does not constitute statistics statement. The average height of an Indian is 5’8”.
2. It deals only with quantitative characters.
It deals only with quantitative characters. E.g., efficiency, honesty, intelligence. These
factors can be measure indirectly e.g., efficiency of selling agent can be judged by studying
the no. of articles sold by him.
3. Statistics results are true only on an average.
Statistical results are true on an average. E.g., average consumption of milk per head in a
certain locality is 0.5 liter but it does not give any idea of the shortage of milk faced by the
poor. The conclusion obtained statistically are not universally true, they are true only under
certain conditions. This is because statistics as a science less exact as compared to natural
sciences.
4. Statistics can be misused.
Statistics can be misused because if conclusion is based on incomplete information,
statistics can prove anything. There are three types of lies: lies dammed lies and statistics.
Statistics are like day of which one can make a god or devil as he pleases.
5. Expert knowledge is must to handle the statistical data.

Importance in Agriculture:
1) It helps to understand nature of variability.
2) To arrive at the meaningful conclusion on the basis of sample study in the field.
3) Express the data/result of the field experiment in summary form.
4) Sampling.
i. In state Agril. Survey for estimation of areas and yield of crops.
ii. In price fixation policy of various Agril. Commodities.
iii. In Agril Extension survey to study the impact of programs.
iv. In Agril. Economics survey to study the demand- supply policy, the growth
rate of population and cost of production of various crops.
5) In Agril. Meteorology for weather forecasting and to correlate weather parameter with
crop production.
FREQUENCY DISTRIBUTION
In statistics, a frequency distribution is a tabulation of the values that one or more
variables taken in a sample. Each entry in the table contains the frequency or count of the
occurrences of values within a particular group or interval, and in this way the table
summarizes the distribution of values in the sample.
Terms used in Frequency Distribution Table:
1. Class: The arrangement of data into groups based on some common criteria. These
groups are called as class. It is denoted X.
2. Frequency: The number of times a category or class occurs. It is denoted by f.
3. Class limits: The boundary figures of the classes are called class limits.
4. Upper limit: Upper bound value of the class.
5. Lower limit: Lower bound value of the class.
6. Class interval: Difference between upper limit and lower limit of a class.
7. Frequency Distribution: Arrangement of data along with their frequencies is called
frequency distribution.
Construction of frequency distribution table:
The following steps are used for construction of frequency distribution table:
Step 1: The number of classes are to be decided.
The appropriate number of classes may be decided by yule’s formula.
No. Of classes= 2.5 x n1/4. where, ‘n’ is the total number of observations
Step 2: The class interval (CI) is to be determined.

CI=

Step 3: The frequencies are counted by using tally marks.


Step 4: Now class is represented along with their frequencies which forms the frequency
distribution table.

Frequency distribution table can be made by two methods:


1. Exclusive method:
In this method, the lower limit of class interval is included in the same class
but upper limit is included in the next class. There is no gap between upper limit of
one class and lower limit of another class. It is continuous distribution.
Ex:
Class (X) Frequency (f)

25-30 5
30-35 6
35-40 7

2. Inclusive method:
In this method, the lower limit and upper limit of class interval are included in
the same class. It is discontinuous distribution.
Ex:
Class (X) Frequency (f)
65-84 3
85-104 5
105-124 7
125-144 12
145-164 8
To convert discontinuous distribution into continuous distribution, subtract 0.5 from
lower limit and add 0.5 to upper limit.

EXAMPLE: Construct a frequency distribution table for following data:


23, 50, 38, 42, 63, 75, 12, 33, 26, 39, 35, 47, 43, 52, 56, 59, 64, 77, 15, 21, 51, 54, 72, 68, 36,
65, 52, 60, 27, 34, 47, 48, 55, 58, 59, 62, 51, 48, 50, 41, 57, 65, 54, 43, 56, 44, 30, 46, 67, 53.

Solution:
No. of observation (n) = 50
Number of classes = 2.5 x n1/4
= 2.5 x 501/4
= 6.648 7.0

Class Interval =

= = 9.286 9.0
Inclusive method:
C.I Tally marks Frequency (f)
10-19 II 2
20-29 IIII 4
30-39 IIII II 7
40-49 IIII IIII 10
50-59 IIII IIII IIII I 16
60-69 IIII III 8
70-79 III 3
TOTAL 50

Exclusive method:
C.I Tally marks Frequency (f)
9.5-19.5 II 2
19.5-29.5 IIII 4
29.5-39.5 IIII II 7
39.5-49.5 IIII IIII 10
49.5-59.5 IIII IIII IIII I 16
59.5-69.5 IIII III 8
69.5-79.5 III 3
TOTAL 50

Cumulative frequency:
Total frequency upto and including the class is called cumulative frequency. There are
two types of cumulative frequency less than and more than cumulative frequency.
1. Less than cumulative frequency:
It is obtained on adding successively the frequency of all previous values (or
classes), including the frequency of variable (against) which the totals are written,
provided the values (classes) are arranged in ascending order of magnitude.
Ex:
Marks Frequency Less than cf
25-30 5 5
30-35 6 5+6=11
35-40 6 11+6=17
40-45 4 17+4=21
45-50 4 21+4=25

2. More than cumulative frequency:


It is obtained by finding the cumulative totals of frequencies starting from the
highest value of the variable(class) and subtracting frequency of all previous value (or
classes).
Ex:
Marks Frequency More than cf
25-30 5 25
30-35 6 25-5=20
35-40 6 20-6=14
40-45 4 14-6=8
45-50 4 8-4=4

Example: Form a cumulative frequency distribution (less than and more than) table for the
following data:
C.I Frequency (f) Less than cf More than cf
9.5-19.5 2 2 50
19.5-29.5 4 2+4=6 50-2=48
29.5-39.5 7 6+7=13 48-4=44
39.5-49.5 10 13+10=23 44-7=37
49.5-59.5 16 23+16=39 37-10=27
59.5-69.5 8 39+8=47 27-16=11
69.5-79.5 3 47+3=50 11-8=3
Graphical Representation of Data
Graphs are charts consisting of points, lines and curves. Charts are drawn on graph
sheets. Suitable scales are to be chosen for both x and y axes, so that the entire data can be
presented in the graph sheet. Graphical representations are used for grouped quantitative data.

Different types of graphs (Statistical graphs):


1. Histogram
2. Frequency polygon
3. Frequency curve
4. Ogive
Histogram:
Histogram is most popular and commonly used for charting continuous frequency
distribution. It is just like a simple bar diagram with minor differences. There is no gap
between the bars, since the classes are continuous. The bars are drawn only in outline without
colouring or marking as in the case of simple bar diagram. It is a suitable form to represent a
frequency distribution.
Class intervals are to be represented in X-axis and the bases of the bars are the
respective class intervals. Frequencies are to be represented on Y-axis. The heights of the
bars are equal to the corresponding frequencies.
Frequency Polygon:
In frequency polygon, the frequency of each class is plotted against the mid value of
class (on X-axis). mid value of each class is obtained. The adjacent dots are then joined by
straight line. The resulting graph is known as frequency polygon.
Example: Draw frequency polygon for following data:
Seed Yield (gms) No. of Plants
2.5-3.5 4
3.5-4.5 6
4.5-5.5 10
5.5-6.5 26
6.5-7.5 24
7.5-8.5 15
8.5-9.5 10
9.5-10.5 5
Frequency curve:
Construction is similar to frequency polygon. The points of frequencies in the graph
are connected by a free-hand smooth curve. Frequency curve also begins and ends in base
line. More smooth appearance of data than in frequency polygon.
Example: Draw frequency polygon for following data:

Seed Yield (gms) No. of Plants


2.5-3.5 4
3.5-4.5 6
4.5-5.5 10
5.5-6.5 26
6.5-7.5 24
7.5-8.5 15
8.5-9.5 10
9.5-10.5 5

Ogives:
Ogive is a cumulative frequency graph. Ogive is a free-hand graph showing the curve
of a cumulative frequency.
There are two types of ogives:
1. Less than ogive:
Less than ogive is the graph of the less than cumulative frequency distribution which
shows the number of observations LESS THAN the upper-class limit.

2. More than ogive:


More than ogive is the graph of the greater than cumulative frequency distribution
which shows the number of observations GREATER THAN the lower-class limit.

NOTE: Less than and more than ogives intersect at Median.


Measures of Central Tendency
In the study of a population with respect to one in which we are interested we may get
a large number of observations. It is not possible to grasp any idea about the characteristic
when we look at all the observations. So, it is better to get one number for one group. That
number must be a good representative one for all the observations to give a clear picture of
that characteristic. Such representative number can be a central value for all these
observations. This central value is called a measure of central tendency or an average or a
measure of locations. There are five averages. Among them mean, median and mode are
called simple averages and the other two averages geometric mean and harmonic mean are
called special averages.

Requisite/characteristics of an ideal measures of central tendency:


Since an average is a single value representing a group of values. It is expected that such
a value should satisfy the following conditions:
1. It should be based on all the observation of a set of data.
2. It should be rigidly defined.
3. It should be easily computable.
4. It should be least affected by the extreme values.
5. It should be least affected by fluctuations of sampling.
6. It should be amenable for further mathematical treatments.
Different measures of central tendency:
1) Arithmetic mean.
2) Median.
3) Mode.
4) Geometric mean.
5) Harmonic mean.

Arithmetic mean:
It is the most common and ideal measure of central tendency. It is defined as “the sum
of the observed values of the character (or variable) divided by the total number of
observations”. It is denoted by the symbol .
For ungrouped data:
If the variable x assumes n values x1, x2 … xn then the mean is given by,
A.M( ) = =

Example 1: Calculate the mean for pH levels of soil 6.8, 6.6, 5.2, 5.6, 5.8
Solution,
= = =6

Example 2: A variable takes the values as given below. Calculate the arithmetic mean of
110, 117, 129, 195, 95, 100, 100, 175, 250 and 750.
= = =202.1

For grouped data:


1. Direct Method:
Let x1, x2,…,xk be the values of x and f1,f2,…, fk are their corresponding frequencies
then the mean for grouped data is obtained from the following formula:

= =

NOTE: In the case of grouped or continues frequency distribution, x is taken as the mid
value of the corresponding class.
2. Assumed mean method:

=A+ , where di = (xi-A)/C, A is the assumed mean, C is the class interval,

x is the class mid value.


Example 1: The distribution of age at first marriage of 130 males was as given bellow.
Calculated the average age.
Age in 18 19 20 21 22 23 24 25 26 27 28 29
year(x)
No of 2 1 4 8 10 12 17 19 18 14 13 12
males(f)

The average age can be calculated as follows:

A= =

= 24.92.
i.e. the mean age of males at first marriage is 24.92 years.
Example 2: The distribution of the size of the holding of cultivated land in an area, was as
follows:
Size of holdings Mid points(x) No of
holdings(f)
0-2 1 48
2-4 3 19
4-6 5 10
6-8 7 14
8-10 9 11
10-20 15 9
20-40 30 2
40-60 50 1

Average size of holding in the area can be calculated as follows, midpoint of the class
intervals are shown in the middle column along with the data. Hence,

A.M = =

=5.237.
i.e. the average size of holding is 5.237

Example: Calculate the mean for the following frequency distribution:

Class 0-8 8-16 16-24 24-32 32-40 40-48


interval
frequency 8 7 16 24 15 7

Solution:

Here we take, A=28 and C=8

Class interval Mid-value(x) Frequency(f) di = fd


0-8 4 8 -3 -24
8-`16 12 7 -2 -14
16-24 20 16 -1 -16
24-32 28 24 0 0
32-40 36 15 1 15
40-48 44 7 2 14
Total 77 -25
=A+

= 28 +

= 28 - = 25.404

Merits of Arithmetic mean:


1. It is rigidly defined.
2. It is easy to understand and easy to calculate.
3. It is based upon all the observation.
4. It is amenable to algebraic treatment.

5. Of all the averages, arithmetic mean is affected least by fluctuations of sampling.


Thus property is sometimes described by saying that arithmetic mean is stable
average.
Thus, we see that arithmetic mean satisfies all the properties laid down by prof. Yule for
an ideal average.

Demerits of Arithmetic mean:


1. It cannot be determined by inspection nor can it be located graphically.
2. Arithmetic mean cannot be used if we are dealing with qualitative characteristics
which cannot be measured quantitatively; such as, intelligence, honesty, beauty, etc.
3. Arithmetic mean cannot be obtained even if a single observation is missing or lost or
is illegible unless we drop it out and compute the arithmetic mean of the remaining
values.
4. It is affected very much by extreme values. In case of extreme items, it gives a
distorted picture of the distribution and no longer remains representative of the
distribution.
5. Arithmetic mean may lead to wrong conclusion if the details of the data from which it
is computed are not given. Let us consider the following marks obtained by two
students A and B in three tests, viz., terminal test (I), half-yearly examination (II) and
annual examination (III) respectively.

Marks I test II test III test Average

A 50% 60% 70% 60%

B 70% 60% 50% 60%


Thus, average marks obtained by each of the two students at the end of the
year are 60%. If we are given the average marks alone we conclude that the level of
intelligence of both the students at the end of the year is same. This is a fallacious
conclusion since A improved consistently, while student B has deteriorated
consistently.
6. Arithmetic mean cannot be calculated for open end class, e.g.: below 10 or above 90.
7. In extremely asymmetrical (skewed) distribution, usually arithmetic mean is not a
suitable measure of location.
Note: It is dependent on change of origin and change of scale.
Properties of arithmetic mean:
1) Algebraic sum of the deviations of a set of values from their arithmetic mean is
always zero.
2) The sum of the squares of the deviations of values is minimum when taken about
mean.
3) Arithmetic mean is increased or decreased by the constant value added or subtracted
from each of observation respectively. Also, it is 1/C times of the original mean if
each observation is divided by C and C times the original mean if each observation is
multiplied by C.
4) Amenability of arithmetic mean to further mathematical calculation.
Let 1 be the mean of n1 observation, 2 be the mean of n2 observation…… k be the
mean of nk observation, then the combined mean of N observation is given by (where
N=n1+n2+…nk)

Uses:
It is most popular and simple estimate and used widely in almost all fields of studies
such as social science, economics, business, agriculture, medical science, engineering and
such other science.

Weighted mean:

When different observations are to be given different weights, arithmetic mean does
not prove to be good measure of central tendency. In such cases weighted mean is calculated.
If x1, x2, x3…Xn are the different observation and W1, W2, W3…. Wn are the
respective weights then.

W.M = =

Merits and demerits:


Weighted mean is the Arithmetic mean hence, merits and demerits are the same as
there for arithmetic mean.
Uses:
1. Used when the number of individuals in different classes of a grouped widely varying.
2. Used when the importance of all items in a series is not same.
3. Used when the ratios, percentages or rates. E.g.: Rs/Kg, Rs/mt, etc are to averaged.
4. W.M is particularly used in calculating birth rates, death rates, index numbers,
average yields etc.
Geometric mean:
Geometric mean of n values is the nth root of the product of the value. It is denoted by
G. thus, geometric mean of the value x1, x2, x3, . . .xn is given by:

If logarithms are used,


Example: Calculate the Geometric mean for 2,4,8

Example: Find the geometric mean for the following data:

X 110 115 118 119 120


F 4 11 21 6 2

Solution:

X F Log x f (log x)
110 4 2.0414 8.1656
115 11 2.0607 22.6677
118 21 2.0719 43.5099
119 6 2.0755 12.4530
120 2 2.0792 4.1584

The geometric mean is given by:

Merits /Advantages:

1. Geometric mean is rigidly defined.


2. Its calculation based on all observations.
3. It is affected by the extreme value.
4. It is not much affected by the fluctuations of the sampling.
5. It is capable for further mathematical treatments.

Disadvantages:

1. If any one of the observations is zero, then G.M does not exist.
2. If any observation is positive or negative, G.M does not exist.

Uses of geometric mean:

1. It is used to find average ratios, average proportions, average percentages, rates of


change (or) relative change (i.e., increase or decrease)
Eg. Rate of population growth, growth rate of interest, rate of population death rate,
production system, etc.
2. Used in construction of index numbers.

Harmonic mean:

The Harmonic mean of n values is the reciprocal of the arithmetic mean of the
reciprocals of the given values. It is denoted by H.

The Harmonic mean of the values x1, x2, x3,. . . .xn is

H.M ,

In the case of tabulated data, harmonic mean is:

Example: Calculate the Harmonic mean of 9.7, 9.8, 9.5, 9.4 and 9.7.

Solution:

X 1/x
9.7 0.1031
9.8 0.1020
9.5 0.1053
9.4 0.1064
9.7 0.1031
Total 0.5199

Harmonic mean = 5/0.5199 = 9.6172


Merits:

1. HM satisfy the test of rigid definition. Its definition is precise and its value is always
definite.
2. Like AM and GM, this average is also based on all the observations of the series. It
cannot be calculated in the absence of even a single figure
3. HM is capable of further algebraic treatment
4. It is least/no affected by fluctuation of sampling.
5. It gives greater importance to small items and such a single big item cannot push up
its value.

Draw backs:

1. HM is not readily understood and nor can it be calculated with ease.


2. It is usually a value which doesn’t exist in series
3. It gives high weight-age to small items.
4. It cannot be calculated if any one item is zero.
5. It is usually a value which does not exist in a given data.

Applications:

1. Harmonic mean is used to calculate the average of a set of numbers.


2. It is considered as an appropriate average for averaging ratios and rates.
3. In population genetics, it is used for calculating the effect of fluctuation in generation
size on the effective breeding population.
4. In finance, it is preferable method for averaging multiples such as price/earnings
ratios. (in which price is in the numerator).
5. In Computer science, specifically in information retrieval and machine learning, the
harmonic mean of precision and the recall is often use as an aggregated performance
score for the evaluation of algorithm and system.

Median:

Median of a set of values is the middle most value when they are arranged in the
ascending order of magnitude. (Such an arrangement is called an array). It is a value that is
greater than half of the values and lesser than the remaining half. The median is denoted by
M.
In the case of a raw data and also a discrete frequency distribution, the median is

M= value in the arrayed series.

In case of ungrouped data, if the number of observations is odd then median is the
middle value after the values have been arranged in ascending or descending order of
magnitude. In case of even number of observations, there are two middle terms and median is
obtained by taking the arithmetic mean of middle terms.
Example 1: The median of the values 25,20,25,35,18, i.e., 15,18,20,25,35 is 20
Example 2: Median of 8,20,50,25,15,30, i.e., of 8,155,20,25,35 is = 22.5.

In case of discrete frequency distribution median is obtained by considering the


cumulative frequencies. The steps for calculating median are given below:

• Find , where N=∑fi.

• See the (less than cumulative frequency (c.f.) just greater than .

• The corresponding values of x is median.

Example: Find the median for following data:


Age (in years) 12 10 14 15 8
Number of students 5 3 2 6 4
Solution:
n/2 = 20/2 = 10
Age (in years) Number of students CF
8 4 4
10 3 7
12 5 12 Median Class
14 2 14
15 8 20

The corresponding class of median class is 12. Hence 12 is the median.

Median for continuous frequency distribution:


In the case of continuous frequency distribution, the class corresponding to the c.f,
just greater than is called the median class and the value of median is obtained by using

the following formula:

Md = L +

Where, L - is the lower limit of the median class.


f - frequency of the median class.
CI - magnitude of the median class.
cf - cumulative frequency preceding to median class.
Properties of median:

i. Sum of the absolute deviations taken from median is minimum.


ii. It can be located graphically through ogive curve.
iii. It is not based on all observation. For example, the median of 10,25,50,60 and 64 is
50. We can replace the observations 10 and 25 by any two values which are smaller
than 50 and the observations 60 and 65 by any two values greater than 50, without
affecting the values of median. This property is described by saying that median is
insensitive.

Merits of Median:

1. It is rigidly defined.
2. It is easily understood.
3. It is simple to calculate.
4. It is not affected by extreme values.
5. It can be calculated for open end classes.

Demerits of median:

1. In case of even observations median cannot be determined exactly.


2. It is not based on all the observations.
3. It is not amendable for further algebraic treatment.
4. It is affected much by fluctuations.

Properties of Median:

1. Sum of the absolute deviations taken from median is minimum.


2. It can be located graphically through ogive curve.
3. It is not based on all observation. This property is described by saying that median is
insensitive.

Applications of Median:

1. Median is the only average to be used while dealing with qualitative data.
2. It is to be used for determining the typical value in problems concerning wages,
distribution of wealth, etc.
3. The median is the most commonly quoted figure used to measure property prices.
4. The use of the median avoids the problem of the mean property price which is
affected by a few expensive properties that are not representative of the general
property market.

Mode:

Mode is the value which occurs most frequently in a set of observations and around
which the other items of the set cluster densely. In other words, mode is the value of the
variable which is predominant in the series. Thus, in the case of discrete frequency
distribution, mode is the value of x corresponding to maximum frequency.

Mode for the continuous frequency distribution:


STEPS:
1. Find the modal class: The class which is having highest frequency is the modal class.
2. Find mode by applying following formula:
In the case of continuous frequency distribution, mode is given by the formula:

xC

Where,
L = lower limit of modal class.
= frequency of modal class.
= frequency of the succeeding modal class.
= frequency of the preceding modal class.
C = class interval
Modal class is the class which has got highest frequency.
Ungrouped Data:
The following are the number of children for 20 couples. Find the mode
No. of children per couple: 2, 3, 6, 3, 4, 0, 5, 2, 2, 4, 3, 2, 1, 0, 4, 2, 2, 1, 1, 3.

Here the value 2 has appears more time (highest frequency). Therefore, mode is 2.

Mode for discrete data:

Age (X) No. of boys (f)

12 5

10 3

14 2

15 6 Modal
Class
The highest frequency is 6. Hence 15 is the mode.

Advantages:

1. Very quick and easy to determine


2. Is an actual value of the data
3. Not affected by extreme scores

Disadvantages:

1. Sometimes not very informative (e.g. cigarettes smoked in a day)


2. Can change dramatically from sample to sample
3. Might be more than one (which is more representative?)

Applications of Mode:

1. The mode has applications in printing. For example, it is important to print more of
the most popular books; because printing different books in equal numbers would
cause a shortage of some books and an oversupply of others.
2. Likewise, the mode has applications in manufacturing. For example, it is important to
manufacture more of the most popular shoes; because manufacturing different shoes
in equal numbers would cause a shortage of some shoes and an oversupply of others.
Relationship between mean, median and mode for slightly skew distribution.

Mean – mode = .

Mode = 3median - 2mean.

For symmetrical distribution mean = median = mode.


When to use various Averages.
The appropriate situations where various averages can be used are given below:
Arithmetic Mean:
1. The average is required for deep statistical calculation.
2. The variable is continuous.
3. The variable is additive in nature.
4. The data are on the interval / ratio scale.

Median:
1. The variable is discrete.
2. Some of the extreme values are missing.
3. There are abnormal extreme values.
4. Mode is ill-defined.
5. The characters tic under study is qualitative.
6. The data are on the ordinal scale.
Mode:
1. Modal value has very high frequency compared to other frequencies.
2. Some of the extreme values are missing.
3. The variable is discrete.
4. There are abnormal extreme values.
5. The characteristic under study is qualitative.
6. The data are on nominal scale.
Geometric mean:
1. The variable is multiplicative in nature.
Harmonic mean:
2. The reciprocal of the value is additive in nature.
Measures of Dispersion:
Though mean is the important concept in statistics, it does not give a clear picture as
to how the different observation are distributed in a given distribution or the series under the
study. Consider the following series:
Series observations Mean
1 2,3,4,7 4
2 4,4,4,4 4
3 1,1,2,12 4
4 3,4,4,5 4

In the above series, the mean is same i.e. 4 but the spread of the observation about the
mean is in different manner. Hence after locating the measures of central tendency, the next
point is to find out the center. This can be done by the measuring the spread. The spread is
also called as scatter or variation or dispersion of the variate values.

Definition:
Dispersion may be defined as the extend of the scatterness of observation around a
measure of central tendency and a measure of such scatter is called as measures of dispersion.
The different measures of dispersion are as follows:
1. Range.
2. Quartile deviation
3. Absolute mean deviation or absolute deviation (A.D or A.M.D).
4. Standard deviation(S).
Characteristics of satisfactory measures of dispersion:

Measure of dispersion should possess all those characteristics which are considered
essential for measures of central tendency viz.
1. It should be based on all the observations.
2. It should be readily comprehensible.
3. It should be fairly easily calculated.
4. It should be simple to understand.
5. It should not be affected by extreme values.
6. It should not be affected by sampling flucations.
7. It should be amenable to algebraic treatments.
Measure of dispersion:
 Absolute:
Measure the dispersion in the original unit of the data.Variability in 2 or more
distributions can be compared only if they are given in same unit and have the same
average.
 Relative:
Measure the dispersion is free from unit of measurement of data.It is also
called as co-efficient of dispersion.

Range:

The range is the difference between two extreme observations of the distribution. If A
and B are the greatest and smallest observations respectively in a distribution, then range is
given by:

Range= Xmax – Xmin= A-B

Range is the simplest but a crude measure of dispersion, easy to calculate, involves no
calculation. We can use range when the observation is less than five and data are on ordinal
scale. Since it based on two extreme observations which themselves are subject to chance
fluctuations, it is not at all a reliable measure of dispersion.

Example: for the following distribution of age of 10 pre-university students, find the range
and the coefficient of range.

16, 18, 18, 16, 18, 20, 17, 19, 16, 24.

Solution:

The highest and lowest values are H = 24 years, and L= 16 years.

Therefore, the range is 24 – 16 = 8 years.

The coeff. of Range = (24-16) / (24+16) = 0.2


Quartile deviation or Semi- inter quartile range:

The semi-inter quartile range (or SIR) is defined as the difference of the first and third
quartiles divided by two. The first quartile is the 25th percentile and the third quartile is the
75th percentile.

Q.D = (Q3 - Q1) / 2,

It is definitely a better measure than the range as it makes use of central 50% of data.
But since it ignores other 50% of the data, it cannot be regarded as a reliable measure.

Example: For the following data, find the Q.D and Coefficient of Q.D:

36, 43, 30, 37, 38, 35, 29, 38, 35, 32, 35, 36.

Solution:

To find the Q.D, firstly the lower and the upper quartiles should be obtained.

Array: 29, 30, 32, 35, 35, 35, 36, 36, 37, 38, 38, 43.

The lower quartile is Q1 = [(n+1) /4]th value in the array = 3.24th value = 32 +0.25(35-32) =
32.75.

The upper quartile is Q3 = [3(n+1) / 4]th value in the array = 9.75th value = 37 + 0.75(38.37) =
37.75.

Thus Q.D = (37.75 – 32.75)/2 = 2.5.

Coefficient of Q.D = (37.75 – 32.75)/(37.75 + 32.75) = 0.0709.

Mean deviation:
The mean deviation is an average of absolute deviations of individual observations
from the central value of a series.
k

∑ xi − x
MD (x ) = i=1
, about mean
n

, about median.

Calculate the mean deviation for the data: 72, 85, 87, 89, 90, 93
xi x i- | x i-
72 86 -14 14
85 86 -1 1
87 86 1 1
89 86 3 3
90 86 4 4
93 86 7 7
516 0 30

k
= 30/6 = 5 ∑x i −x
MD(x ) = i =1

Example: Calculate mean deviation from median for the given data.

18, 16, 16, 19, 12, 14, 20.

Solution:

Array (x) |x-M|

12 4
14 2

16 0
16 0

18 2

19 3
20 4

Total 15

M= value in the array.

= 4th value = 16.

Mean deviation from median =

Note: Mean deviation about median is less than or equal to M.D about mean.
Standard deviation(S):
The standard deviation or “root of mean square deviation” is the most common and
efficient estimator used in statistics. It is based on deviation from arithmetic mean and is
denoted by S (Standard deviation of sample) or σ (Standard deviation of population).
Definition:
It is a square root of a ratio of sum of square of deviation calculated from arithmetic
mean to the total number of observations.
Method of calculations:
A. Ungrouped data:
1) Deviation method:

S=

2) Variable square method:

S=

3) Assumed mean method

S= , di = xi-A, A is assumed mean.

B. Grouped data or frequency distribution:


1) Deviation method:

S=

2) Variable square method:

S=

3) Assumed mean method:

S= , di = xi-A, A is assumed mean, fi is the frequency of ith class.

Example:
1. Find the standard deviation for the given data:
Family No. 1 2 3 4 5 6 7 8 9 10

Size (xi) 3 3 4 4 5 5 6 6 7 7

Family no 1 2 3 4 5 6 7 8 9 10 Total
xi 3 3 4 4 5 5 6 6 7 7 50
x i- -2 -2 -1 -1 0 0 1 1 2 2 0
4 4 1 1 0 0 1 1 4 4 20

S= = 1.48

2. Find the Standard deviation for the following data:


xi 3 2 3 2 3
fi 5 3 5 3 5

xi fi xifi x i- fi
3 2 6 -3 9 18
5 3 15 -1 1 3
7 2 14 1 1 2
8 2 16 2 4 8
9 1 9 3 9 9
Total 10 60 - - 40

S= = 4.11

Mathematical properties:
1) The sum of square of the deviations of items in the series from their arithmetic mean
is minimum. This is the reason why standard deviation is always computed from
arithmetic mean.
2) Addition or subtraction of a constant from the group of an observation will not change
the value of S.D.
3) Multiplying or dividing each observation of a given series by a constant value will
multiply or divide the standard deviation by the same constant.
Note: Standard deviation is independent of change of origin but dependent on change of
scale.

Variance:
Variance is the square of standard deviation. It is also called the “mean square
deviation”. It’s being used very extensively in analysis of variance of result from field
experiment. Symbolically denoted by S2 is sample variance and σ2 is population variance.
Methods of calculation:
1. ungrouped data:
1) Deviation method:

S2 =

2) Variable square method:

S2 = , where xi is variate value, n is the no. of observation.

3) Assumed mean method:

S2= , where di = xi-A and A is the assumed mean.

2. Grouped data or frequency distribution:


1) Deviation method:

S2 =

2) Variable square method:

S2 = , where xi is variate value, n is the no. of observation.

3) Assumed mean method:

S2= , where di = xi-A and A is the assumed mean and fi is frequency of

ith class.
Properties of variance:
1) If V(X)represent the variance of X series and V(Y)represents the variance of Y series
then V(X±Y) = V(X)+V(Y).
2) Multiplying or dividing each observation by a constant will multiply or divide the
variance by square of that constant.
E.g. V (ax)=a2V(x)
3) Addition or subtraction a constant from the groups of each observations will not
change the value of variance.
Coefficient of variation:
It is a relative measure of variation and widely used to compare two or more statistical
series.
The statistical series may differ from one another with respect to their mean or
standard deviation or both. Sometimes they may also differ with respect to their units and
then their comparison is not possible. To have a comparable idea about the variability present
in them C.V% is used. It was developed by Karl Pearson.

Definition:
“It is a percentage ratio of standard deviation to the arithmetic mean of a given
series”. It is unit less measure.
The series for which the C.V% is greater is said to be more variable or we say less
consistence, less homogenous or less stable while the series having lower C.V% is called
more consistence or more homogeneous.
Moments, Skewness and Kurtosis

The rth moment of a set of values about any constant is the mean of the rth powers of
the deviations of the values from the constant.
Moments about any constant can be found. The moments about the arithmetic mean
are called central moments. The moments about any other constant are called raw moments.
The central moments are denoted by µ 1, µ 2, . . . .and the raw moments are denoted by
, .... .

In case of raw data, the rth central moment and raw moment about a are:

µr = and µ r’ =

In the case of frequency data, they are:

µr = and µ r’ =

The first four moments of a distribution are useful in the study of the distribution.

1. The first moment about zero is the arithmetic mean.


2. The second central moment is the variance (square of standard deviation).
3. The third central moment is a measure of skewness.
4. The fourth central moments measure of kurtosis.

For a frequency distribution, four constants based on the central moments are defined. They
are:

β1 = β2 =

γ1 = + γ2 = β2-3.

Here β1 and γ1 are measures of skewness. β2and γ2 are measures of kurtosis.

Skewness:
In a frequency distribution, the spread of the values may be symmetrical around the
center or it may not be so. If the values are not distributed symmetrically around the center,
the distribution is said to be skew. Thus, skewness means asymmetry or non-symmetry (lack
of symmetry).
Coefficient of skewness is a measure which indicates the degree of skewness, it may
be positive, zero or negative. It will be positive if the right tail of the distribution is longer
than the left tail. It will be negative if the left tail is longer than right tail. For a symmetrical
distribution, the coefficient of skewness will be zero. According as coefficient of skewness is
positive or negative, the distribution is said to be positively or negatively skew.
The mean, median and mode of symmetrical distribution are equal. For such a
distribution, the lower and upper quartiles are equidistant from the median, for positively
skew distribution, the median and the mode is lesser than the mean. For negative distribution,
the median and the mode are greater than the mean.

Symmetrical distribution

Left skewed (Negatively skewed) Right skewed (Positively skewed)

Coefficient of skewness:

I. Karl Pearson’s coefficient of skewness:


Sk = .

When the mode is ill-defined, Karl Pearson’s coefficient is:

Sk = .

II. Bowley’s coefficient of skewness which is based on the quartile is:


Sk =

III. The coefficient based on the moments:

β1 =

This coefficient is more exact in indicating the degree of skewness of distribution.


Usually, this measure is used in the advanced study of distribution.

Example-1 Calculate the Karl Pearson’s coefficient of skewness for the distribution having
mean 83.8, mode 82.64 and S.D 2.336.

Solution:

Skp = = = 0.496.

Thus, the coefficient of skewness is positive. Therefore, the distribution is positively skew.

Example-2 For a frequency distribution, the sum of upper and the lower quartiles is 25. Their
difference is 13. The median is 10. Find the coefficient of skewness.

Solution:

Here, Q3+Q1 = 25, Q3-Q1= 13 and median = 10.

Therefore, Bowely’s coefficient of skewness is

Sb = = = 0.3846.

Thus, the coefficient of skewness is positive. Therefore, the distribution is positively


skew.

Kurtosis:

A frequency distribution may show high concentration at the center compared to that
at the extremities. On the other hand, another frequency distribution may show almost equal
concentration throughout the range. Here, the first distribution is said to have high kurtosis
compared to the latter.
Kurtosis means peakedness (non-flatness). Coefficient of kurtosis is a measure
which indicated the degree of peakedness of the distribution. The constant β2 = is

considered to be a coefficient of kurtosis.


The normal distribution is taken as a standard for the measurement of kurtosis. A
distribution which is as peaked as the normal distribution is called as meso-kurtic
distribution. A distribution which is more peaked than normal distribution is called as lepto-
kurtic distribution. A distribution which is less peaked than the normal distribution is called
platy-kurtic distribution.
For a meso-kurtic distribution, β2=3, for leptokurtic distribution, β2>3 and for a platy-
kurtic distribution, β2<3. The value γ2= β2-3can very well be considered to be a measure of
kurtosis.

Example: In a frequency distribution, the first four central moments are 0, 4, -2 and 2.4.
Comment on the skewness and kurtosis of the distribution.

Solution:
Here, µ 0=0, µ 1=4, µ 2=-2 and µ 4=2.4
Therefore,

β1 = = = 0.0625.

β2 = = = 0.15.

Since µ 3 is negative, the distribution is negatively skew. Also, since β1= 0.0625 is
very small, the distribution is slightly skew.
Probability

The concept of probability is difficult to define in precise terms. In ordinary


language, the word probable means likely (or) chance. Generally, the word probability, is
used to denote the happening of a certain event, and the likelihood of the occurrence of that
event, based on past experiences. By looking at the clear sky, one will say that there will not
be any rain today. On the other hand, by looking at the cloudy sky or overcast sky, one will
say that there will be rain today. In the earlier sentence, we aim that there will not be rain
and, in the latter,, we expect rain. On the other hand, a mathematician says that the
probability of rain is ‘0’ in the first case and that the probability of rain is ‘1’ in the second
case. In between 0 and 1, there are fractions denoting the chance of the event occurring. In
ordinary language, the word probability means uncertainty about happenings. In Mathematics
and Statistics, a numerical measure of uncertainty is provided by the important branch of
statistics – called theory of probability. Thus, we can say, that the theory of probability
describes certainty by 1 (one), impossibility by 0 (zero) and uncertainties by the co-efficient
which lies between 0 and 1.

Trial and Event: An experiment which, though repeated under essentially identical (or)
same conditions does not give unique results but may result in any one of the several possible
outcomes. Performing an experiment is known as a trial and the outcomes of the experiment
are known as events.

Example:

1. Seed germination – either germinates or does not germinates are events.


2. In a lot of 5 seeds none may germinate (0), 1 or 2 or 3 or 4 or all 5 may germinate.

Random experiment:
Random experiment is an experiment which may not result in the same outcome when
repeated under the same conditions. It is an experiment which does not have a unique
outcome.
For example:
1. The experiment of Toss of coin is a random experiment. It is so because when a coin
is tossed the result may be head or it may be tail.
2. The experiment of drawing a card randomly from a pack of playing cards is a random
experiment. Here, the result of the draw may be any one of the 52 cards.
Sample space (S): A set of all possible outcomes from an experiment is called sample space.
For example, a set of five seeds are sown in a plot, none may germinate, 1, 2, 3 ,4 or all five
may germinate. i.e the possible outcomes are {0, 1, 2, 3, 4, 5. The set of numbers is called a
sample space. Each possible outcome (or) element in a sample space is called sample point.

Exhaustive Events: The total number of possible outcomes in any trial is known as
exhaustive events (or) exhaustive cases.

Example:

1. When pesticide is applied a pest may survive or die. There are two exhaustive cases
namely (survival, death)

2. In throwing of a die, there are six exhaustive cases, since anyone of the 6 faces 1, 2, 3, 4, 5,
6 may come uppermost.

Favourable Events: The number of cases favourable to an event in a trial is the number of
outcomes which entail the happening of the event.

Example:

1. When a seed is sown if we observe non germination of a seed, it is a favourable event. If


we are interested in germination of the seed then germination is the favourable event.

Equally likely events:


Two or more events are equally likely if they have equal chance of occurrence. That
is, equally likely events are such that none of them has greater chance of occurrence than
others.
Example-1: While tossing a fair coin, head and tail are equally likely event.
Example-2: While throwing a fair die, the events A={2,4,6}, B={1,3,5} and C={1,2,3} are
equally likely events.

Mutually exclusive events:


Two or more events are mutually exclusive if only one of them can occur at a time.
That is, the occurrence of any of these events totally excludes the occurrence of the other
events. Mutually exclusive events cannot occur together.
Example-1: While tossing a coin, the outcomes head and tail are mutually exclusive because
when the coin is tossed once, the result cannot be head as well as tail.
Example-2: while throwing a die, the events A= {2,4,6}, B= {3,5} and C= {1} are mutually
exclusive.
Independent Events: Several events are said to be independent if the happening of an event
is not affected by the happening of one or more events.
Example: When two seeds are sown in a pot, one seed germinates. It would not affect the
germination or non-germination of the second seed. One event does not affect the other event.
Dependent Events: If the happening of one event is affected by the happening of one or
more events, then the events are called dependent events.
Example: If we draw a card from a pack of well shuffled cards, if the first card drawn is not
replaced then the second draw is dependent on the first draw.
Note: In the case of independent (or) dependent events, the joint occurrence is possible.

Definition of Probability Mathematical (or) Classical (or) a-priori Probability


If an experiment results in ‘n’ exhaustive cases which are mutually exclusive and
equally likely cases out of which ‘m’ events are favourable to the happening of an event ‘A’,
then the probability ‘p’ of happening of ‘A’ is given by

P =P(A) =

Note:
1. If m = 0 ⇒ P(A) = 0, then ‘A’ is called an impossible event. (i.e.) also by P(φ) = 0.
2. If m = n ⇒ P(A) = 1, then ‘A’ is called assure (or) certain event.
3. The probability is a non-negative real number and cannot exceed unity (i.e.) lies
between 0 to 1.
4. The probability of non-happening of the event ‘A’ (i.e.) P( ). It is denoted by ‘q’.

P( ) = = = 1-P(A)

⇒q=1–p
⇒ p + q = 1 (or)
P (A) + P( ) = 1.

Statistical (or) Empirical Probability (or) a-posteriori Probability


If an experiment is repeated a number (n) of times, an event ‘A’ happens ‘m’ times
then the statistical probability of ‘A’ is given by:
P =P(A) =
Axioms for Probability:
1. The probability of an event ranges from 0 to 1. If the event cannot take place its
probability shall be ‘0’ if it certain, its probability shall be ‘1’.
Let E1, E2, …., En be any events, then P (Ei) ≥ 0.
2. The probability of the entire sample space is ‘1’. (i.e.) P(S) = 1.
Total Probability =1.

3. If A and B are mutually exclusive (or) disjoint events then the probability of
occurrence of either A (or) B denoted by P(AUB) shall be given by,
P(A∪B) = P(A) + P(B)
P (E1∪E2∪…. ∪En) = P(E1) + P(E2) +……+ P(En)
If E1, E2, …., En are mutually exclusive events.

Example 1: Two dice are tossed. What is the probability of getting


(i) Sum 6 (ii) Sum 9?
Solution: When 2 dice are tossed. The exhaustive number of cases is 36 ways.
(i) Sum 6 = {(1, 5), (2, 4), (3, 3), (4, 2), (5, 1)}
∴ Favourable number of cases = 5

P (Sum 6) =

(ii) Sum 9 = {(3, 6), (4, 5), (5, 4), (6, 3)}
∴ Favourable number of cases = 4
P (Sum 9) = =

Example 2: A card is drawn from a pack of cards. What is a probability of getting


(i) a king (ii) a spade (iii) a red card (iv) a numbered card?
Solution:
There are 52 cards in a pack.
One can be selected in 52C1 ways.
∴ Exhaustive number of cases is = 52C1 = 52.
(i) A king
There are 4 kings in a pack.
One king can be selected in 4C1 ways.
∴ Favourable number of cases is = 4C1 = 4

Hence the probability of getting a king =


(ii) A spade
There are 13 kings in a pack.
One spade can be selected in 13C1 ways.
∴ Favourable number of cases is = 13C1 = 13

Hence the probability of getting a spade = =

(iii) A red card


There are 26 kings in a pack.
One red card can be selected in 26C1 ways.
∴ Favourable number of cases is = 26C1 = 26

Hence the probability of getting a red card = =

(iv) A numbered card


There are 36 kings in a pack.
One numbered card can be selected in 36C1 ways.
∴ Favourable number of cases is = 36C1 = 36

Hence the probability of getting a numbered card = =

Conditional Probability:
Two events A and B are said to be dependent, when B can occur only when A is
known to have occurred (or vice versa). The probability attached to such an event is called
the conditional probability and is denoted by P (A/B) (read it as: A given B) or, in other
words, probability of A given that B has occurred.

P( )= =

If two events A and B are dependent, then the conditional probability of B given A is,

P( )= =

Addition theorem on Probability:


Let A and B be any two events with respective probability P(A) and P(B). Then the
probability of occurrence of at least one of these events is:
1. If the two events are not mutually exclusive
2. If the two events are mutually exclusive

Example-1: If A is the event “drawing an ace from a deck of cards” and B is the event
“drawing a King”. Find the probability of getting either ace or king.

Solution:

P(A) = , P(B) =

Then the probability of either an ace or king in a single draw is:

= + = =

Example-2: If A is the event “drawing an ace from a deck of cards” and B is the event
“drawing a spade”. Find the probability of getting either an ace or spade.

Solution:

The A and B are not mutually exclusive. Since the ace of spades can be drawn. Thus,
the probability of drawing either an ace or a spade is:

= + = =

Example-3: A class contains 10 men and 20 women of which half the men and half the
women have brown eyes. Find the probability “P” that a person chosen at random is a man or
has a brown hair.
Solution:
Let A = person is a man and B = person has brown eyes.

W.K.T, P(A) = , P(B) = and ,

Therefore, = + =
Multiplication theorem on Probability:
Let A and B be two events with respective probability P(A) and P(B). let P(B|A) be
the conditional probability of event B given the event A has happened. Then, the probability
of simultaneous occurrence of A and B is given by:
1. If A and B be any two events which are not independent, (i.e.) dependent.
P (A and B) = P (A∩B) = P (AB) = P (A). P (B/A)
2. If A and B be any two events which are independent.
P (A and B) = P (A∩B) = P (AB) = P (A) X P (B)
Example-1: If A is the event “getting heads in second toss” and B is the event “getting heads
in third toss”. Find the probability of getting heads on both the 2nd and 3rd tosses.
Solution:
P(A∩B) =P(A) x P(B)= 1/2*1/2 = 1/4
Example-2: If the probability that A will be alive in 20 years is 0.7 and the probability that B
will be alive in 20 years is 0.5, Find the probability of that they will both be alive in 20 years.
P(A∩B) =P(A) x P(B) = 0.7*0.5 = 0.35.
Theoretical Distributions
Random Variable
Random variable is a function which assigns a real number to every sample point in
the sample space. The set of such real values is the range of the random number.
There are two types of random variables.
1. Discrete random variable:
A variable X which takes values in whole number is called discrete random variable.
Eg: number of students in a class, number of fruits per plant etc.
2. Continuous random variable:
A random variable whose ranges are unaccountably infinite is a continuous random
variable.
Eg: plant height, weight of a person etc.

Probability mass function:


Let X be a discrete random variable. And let p(x) be a function such that p(x) =
P[X=x]. then, p(x) is the probability mass function of X.
Here, (i). p(x) ≥ 0 for all x;
(ii). ∑ p(x)=1
A similar function is defined for a continuous random variable X is called as
probability density function.
Mathematical expectation:
Let X be a discrete random variable with probability mass function p(x). then,
mathematical expectation of X is given by:
E(X) =

Example: Two coins are tossed once. Find the mathematical expectation of the number of
heads obtained.
Solution:
Let X denotes the number of heads obtained. Then, X is a random variable which
takes the values 0, 1 and 2 with respective probabilities 0.25, 0.5 and 0.25. That is
x 0 1 2

p(x) ¼ ½ ¼
The mathematical expectation of the number of heads is:
E(X) = = 0*0.25+1*0.5+2*0.25 = 1.
Theoretical distributions are
1. Binomial distribution
2. Poisson distribution Discrete Probability distribution

3. Normal distribution Continuous Probability distribution

Discrete Probability distribution


Bernoulli distribution
A random variable x takes two values 0 and 1, with probabilities q and p ie., p(x=1) =
p and p(x=0)=q, q-1-p is called a Bernoulli variate and is said to be Bernoulli distribution
where p and q are probability of success and failure. It was given by Swiss mathematician
James Bernoulli (1654-1705)
Example
• Tossing a coin (head or tail)
• Germination of seed (germinate or not)
Binomial distribution
Binomial distribution was discovered by James Bernoulli (1654-1705). Let a random
experiment be performed repeatedly and the occurrence of an event in a trial be called as
success and its non-occurrence is failure. Consider a set of n independent trails (n being
finite), in which the probability p of success in any trail is constant for each trial. Then q=1-p
is the probability of failure in any trail.
The probability of x success and consequently n-x failures in n independent trails. But
x successes in n trails can occur in nCx ways. Probability for each of these ways is pxqn-x.
Definition:
A random variable x is said to follow binomial distribution if it assumes nonnegative values
and its probability mass function is given by
P (X= x) =p(x) =
n
Cx px qn-x , x=0,1,2…n
q=1-p
0, otherwise
The two independent constants n and p in the distribution are known as the
parameters of the distribution.
Conditions for Binomial distribution
We get the binomial distribution under the following experimentation conditions
1. The number of trial n is finite
2. The trials are independent of each other.
3. The probability of success p is constant for each trial.
4. Each trial must result in a success or failure.
5. The events are discrete events.
Properties of binomial distribution:
1. Binomial distribution has two parameters, n and p (or q).
2. The mean of the binomial distribution is np and variance is npq.
3. The shape of the distribution depends on the values of p and q. if p=q, the shape is
symmetrical. Otherwise it is asymmetrical but the asymmetry decrease as n increases.
4. Β-coefficients:

a. β1 = =

b. β2 = 3 + =

5. Binomial distribution tends to normal distribution as n increases. The normal


approximation is correct enough if the mean np is greater than 15 for p=0.5.
Example: Consider a simple trail of tossing a perfectly round and balanced coin six
times. Then calculate the probability of getting
1. Exactly three heads.
2. At least three heads.
3. Not more than two heads.
Solution: For given problem, n=6, p=q=1/2=0.5

1. P (exactly three head) = = = .

2. P (at least three heads) =

= + + +

3. P (not more than two heads) =


= + +

= .

Application
1. Quality control measures and sampling process in industries to classify items as
defectives or non-defective.
2. Medical applications such as success or failure, cure or no-cure.

Poisson Distribution:
A random variable which can take only one discrete value in an interval of time
howsoever small is known as Poisson variable.
The Poisson distribution is the limiting form of the binomial probability distribution
when ‘n’ becomes infinitely large and ‘p’ approaches 0 in such a way that np = remaines
constant. Such situation is fairly common. That is to say a Poisson distribution may be
expected in cases were the chance of any individual event being a success is rare. Some of
examples of Poisson variable are:
1. Number of mistakes in a typed page;
2. Number of cars parked at a place in an hour, say between 10.00 AM and 11.00
AM;
3. Number of defects in the insulation of a fifty-meter length of wire;
4. Number suicides in a certain period in a city or town;
5. Occurrence of rare events such as serious floods, drought etc.
Like binomial distribution, the variate of the Poisson distribution is also a discrete
one. The probability function is given by:

P(x) = For x= 0, 1, 2, . . . .

= 0 , otherwise.
Where, λ is the average number of occurrences per unit of time
λ = np
Condition for Poisson distribution
Poisson distribution is the limiting case of binomial distribution under the following
assumptions.
1. The number of trials n should be indefinitely large ie., n->∞
2. The probability of success p for each trial is indefinitely small.
3. np= λ, should be finite where λ is constant.
Properties of Poisson distribution:
1. Poisson distribution has mean and its variance is also . It is the only distribution
known so far, of which the mean and variance are equal.
2. Poisson distribution possesses only one parameter .
3. β-coefficients:
a. β1 =

b. β2 = 3 +

Application
1. It is used in quality control statistics to count the number of defects of an item.
2. In biology, to count the number of bacteria.
3. In determining the number of deaths in a district in a given period, by rare disease.
4. The number of error per page in typed material.
5. The number of plants infected with a particular disease in a plot of field.
6. Number of weeds in particular species in different plots of a field.
Example-1
Suppose at a particular place, the average number of cars parked per hour in 3. Under Poisson
model, calculate the probability of 5 cars parked in a particular hour.

Solution: P(x=5) = = = 0.101.

Example-2
The number of mistakes counted in one hundred typed pages of a typist revealed that he made
2.8 mistakes on an average per pages. Calculate the probability, that in page typed by him,
1. There is no mistake.
2. There is two or less mistake.
Solution: given that, =2.8.

1. P(x=0) = = e-2.8 = 0.061.

2. P(x≤2) = = = 0.061(1+2.8+3.92) = 0.471


Normal Distribution
Normal distribution is the most popular and commonly used distribution. It was
discovered by A De Moivre in 1718; about twenty years after Bernoulli gave binomial
distribution. A random variable X is said to follow normal distribution, if and only if, its
probability density function (p.d.f) is given by:

F(x) = ; -∞<x<∞, -∞<µ<∞, .

The variable X(continuous random variable) is said to be distributed normally with


mean µ and variance , i.e. X~N(µ, ), the density function has two parameters, namely µ
and . Here µ can take any real value in the range -∞ to ∞, where is any positive real value,
i.e., >0. Since the probability can never be negative, FX(x)≥0 for all x.
Condition of Normal Distribution:
1. Normal distribution is a limiting form of the binomial distribution under the following
conditions.
a. n, the number of trials is indefinitely large
b. Neither p nor q is very small.
2. Constants of normal distribution are mean = m, variation =s2, Standard deviation = s.

Properties of normal distribution:


1. If we plot the curve for the value of the standard normal variate X and the
corresponding probabilities, it is of the bell-shaped and symmetrical about the vertical
line at x=0.
2. The area under the normal curve within the limits -∞ to ∞ is unity.
3. On either side of the mean µ, the frequency decrease more rapidly within the range
(µ± and gets slower and slower as it goes away from the mean. The frequencies are
extremely small beyond the distance of ±3 . As a matter of 99.73 percent units of the
population lie within the range µ±
4. The curve is asymptotic to x-axis.
5. As the value of increases, the curve becomes more and more flat and vice versa.
6. If the variable does not follow the normal distribution, it can be made to follow after
making suitable transformation like square root, arcsine, logarithm etc.
Standard Normal distribution:
Let X be random variable which follows normal distribution with mean m and
variance s2. The standard normal variate is defined as which follows standard

normal distribution with mean 0 and standard deviation 1 i.e., Z ~ N(0,1). The standard
normal distribution is given by :

(Z)= ;- < z< -

The advantage of the above function is that it doesn’t contain any parameter. This
enables us to compute the area under the normal probability curve.
Example 1: In a normal distribution whose mean is 12 and standard deviation is 2. Find the
probability for the interval from x = 9.6 to x = 13.8
Solution:
Given that Z~ N (12, 4)
P(1.6 13.8) = P(

= P (-1.2 ≤ Z ≤ 0) + P (0 ≤ Z ≤ 0.9)
= P (0≤ Z ≤ 1.2) +P (0 ≤ Z ≤ 0.9) [by using symmetric property]
=0.3849 +0.3159
=0.7008
When it is converted to percentage (ie) 70% of the observations are covered between 9.6 to
13.8.
Sampling Method and Sampling Distribution

Sampling: It is defined as the method of collection of samples from the population.

Sample: Sample is a part or fraction of a population selected on some basis. Sample consists
of a few items of a population. In principle a sample should be such that it is a true
representative of the population.

Sampling method: By sampling method we mean the manner or scheme through which the
required number of units is selected in a sample from a population.

Sampling unit: The constitutes of a population which are the individual to be sampled from
the population and cannot be further subdivided for the purpose of sampling are called as
sampling unit. For instance, to know the average income per family, the head of the family is
a sampling unit. To know the average yield of wheat, each farm owner’s field of wheat is a
sampling unit.

Sampling distribution of sample mean:


When different random samples are drawn and sample mean or variance is computed,
it will not be the same for all samples. Consider an artificial example, where the population
has 4 units 1, 2, 3, 4 possessing the values 2, 3, 4, 6 for the study variable. Then we will have
6 possible samples without replacement, of size 2. The possible samples with replacement
along with the sample means are respectively is given in the table below.
Serial Possible Sample mean Serial Possible Sample mean
number samples number mean

1 (1,1) 2 9 (3,1) 3

2 (1,2) 2.5 10 (3,2) 3.5

3 (1,3) 3 11 (3,3) 4

4 (1,4) 4 12 (3,4) 5

5 (2,1) 2.5 13 (4,1) 4

6 (2,2) 3 14 (4,2) 4.5

7 (2,3) 3.5 15 (4,3) 5

8 (2,4) 4.5 16 (4,4) 6


Though the sample means are not the same from sample to samplr by either method,
the average of the sample means in both cases is 3.75, which is the population mean. The
variance of the sample means when the sampling is with replacement is 1.09 and it is 0.73
when the sampling is without replacement. This example shows that the sample means for
different samples are clustered closer while the sampling is without replacement rather than
with replacement. Thus one may not get greater sampling fluctuations in determining sample
mean when the sampling is done without replacement and this procedure is normally
recorded.
The theory sampling with replacement is same as that of sampling from an infinite
population. Let a simple random sample of n units be selected from the population and let the
observations on the sampled units be x1, x2, . . .xn. let be the sample mean and
S2 be sample variance. Let m and 2
be the variance based on N observations. Let

. Then the sample mean is an unbiased estimator of m, in the sense that

the average of the sample means on repeated drawing equal the population mean m. the

variance of the sample means is , if the sampling is with replacement and is

, if the sampling is without replacement. The square root of the variance of the

sample means is called standard error of the sample mean and is denoted by .

If the sample is taken from a normal population, the distribution of the sample mean is
normal, even for small values of n. when the population from which the sample is drawn is
non-normal, for large values of n, the central limit theorem ensures that the sample mean will
be normally distributed. For all practical purpose, we can treat to be normally distributed
with mean m and standard deviation

Different sampling techniques or sampling plans or sampling designs:

1. Simple random sampling


2. Systematic sampling
3. Stratified sampling
4. Cluster sampling
Simple random sampling:
In simple random sampling, the sample is selected in such a way that every number of
the population has an equal and independent chance of being selected in the sample. It
implies that selection of a sample from all possible samples that could be chosen is equally
likely.
It is used when the population is heterogeneous in nature.

SRS is of two types:


1. SRS with replacement method.
2. SRS without replacement method.

If the units are selected or drawn one by one in such a way that a unit drawn at a time is
replaced back to the population before the subsequent draw, it is known as simple random
sampling with replacement method. In this type of sampling from a population of size N, the
probability of selection of a unit at each draw remains 1/N. In this method, a unit can be
included more than once in a sample. Therefore, if the required sample size is n, the effective
sample size is sometimes less than n due to the inclusion of one or more units more than
once.
In SRS without replacement method unit selected once is not included in the population at
any subsequent draw. Hence, the probability of drawing a units from a population of N units
rth draw is 1/(N- r +1).
Random selection of units is done using any one of the following methods:
1. Using tickets, tags, etc.
2. Using random number table.

Using tickets, tags etc.:


To give an example, suppose that an experimenter wishes to draw a random sample of
size 10 individuals (says 10 plants for measuring the heights) from a finite population, say
200 plants. A method of doing this would be assign a number to each member of the
population put the individual into box, and mix them thoroughly. Draw ten tickets or tags
from the box. The number on these ten tickets or tags correspond the plants to be selected.
This method is costlier, time and labor consuming.

Random number table method:


The above procedure can be shortened by the use of a table of random number. Such a
table consists of number chosen in a fashion similar to drawing numbered tickets or tags out
of a box. This table is so made that all numbers 0,1,2. . . appear with approximately the same
frequency. By combining the number in pairs we have three digit numbers from 000 to 999
etc.
The table should be entered in a random manner. Put a pencil aimlessly on a page of
the table. The point thus obtained on page is the starting point for selecting the numbers,
record the numbers until the required number of random digits is obtained.

Examples of SRS:

1. Impact of T & V system on socio-economic status of summer groundnut growers in a


specific taluk.
Here population is summer groundnut growers registered under T & V system of a
given taluk. One can easily prepare the frame for this finite population and select the random
sample of summer groundnut growers.

2. Constraints analysis for milk: productivity in a village


Here population is “milk producers in a given village. The frame for which can be
prepared easily and random sample can be taken to fine out the factors responsible for low
productivity.
Advantages:
1. Equal chance of selection.
2. Easy method
3. If applied appropriately, simple random sampling is associated with the minimum
amount of sampling bias compared to other sampling methods.
4. Given the large sample frame is available, the ease of forming the sample group
i.e. selecting samples is one of the main advantages of simple random sampling.
5. Research findings resulting from the application of simple random sampling can
be generalized due to representativeness of this sampling technique and a little
relevance of bias.
Disadvantages:
1. Costly and time-consuming for large studies.
2. Complete accounting of population needed.
3. Cumbersome to provide unique designations to every population member.
4. Very inefficient when applied to skewed population distribution.

Stratified random sampling:


In this scheme the population is sub-divided into several groups called stratum and
then samples are drawn independently (at random) from each stratum.
As the sampling variance of the estimate of mean depends on the within strata
variation, the stratification of heterogeneous population into homogenous strata helps in
increasing precision of the estimates e.g. while studying average income of the staff members
of Gujarat Agricultural university one has to employ stratified random sampling. Because
simple random sampling may results into under or over estimation of income of the staff
member. Suppose only professor are selected for the sample for the sample then the average
income will be above the true average value, similarly if only helpers are selected in the
sample then the results will be on lower side.
When such heterogeneous population is required to be sampled then one has to utilize
the stratified in spite of simple random sampling.
Example:
1. Adoption level of improved agro technology by cultivators.
2. A survey status of school going student.
3. Socio-economic survey for rural and urban people of Valsad district.
Advantages:
1. Provides greater precision than a simple random sampling.
2. Assures representation of all groups in sample population.
3. Characteristics of each stratum can be estimated and comparisons made.
Disadvantages:
1. Requires accurate information on proportions of each stratum.
2. Costly method.

Systematic sampling:
In this method of sampling, the first unit is selected with the help of simple random
sampling and the remaining units are selected automatically according to a predetermined
pattern. This method is known as systematic sampling. It is a way to select a probability-
based sample from a directory or list. This method is more efficient than simple random
sampling. Because of its simplicity, systematic sampling is popular with researchers.
Advantages:
1. It is easier to draw a sample and often easier to execute it without mistakes.
2. This is more advantageous when the drawing is done in fields and offices as there
may be substantial saving in time.
3. The cost is low and the selection of units is simple.
4. Much less training is needed for surveyors to collect units through systematic
sampling.
5. The systematic sample is spread more evenly over the population.
6. More precise than simple random sampling.
Disadvantages:
1. Systematic sampling can be applied only if the complete list of population is
available.
2. Greater risk of data manipulation.
3. Can be imprecise and inefficient if the population being sampled is heterogenous.

Cluster sampling:
Method by which the population is divided into groups (clusters), any of which can be
considered a representative sample. These clusters are mini-populations and therefore are
heterogeneous. Once clusters are established a random draw is done to select one (or more)
clusters to represent the population.
Steps for cluster sampling:
1. Divide the whole population into clusters according to some well defined rule.
2. Treat the clusters as sampling units.
3. Choose a sample of clusters according to some procedure.
4. Carry out a complete enumeration of the selected clusters, i.e., collect information on
all the sampling units available in selected clusters.
Advantages:
1. Economic efficiency.
2. Faster and less expensive than SRS
3. Does not require a list of all members of the universe
Disadvantage:
1. Commonly has higher sampling error

Sampling and non-sampling errors:


The inaccuracies on error on any statistical investigation i.e. in the collection, processing,
analysis and interpretation of the data may be broadly classified as follows:
1. Sampling errors.
2. Non-sampling errors.

Sampling errors:
In a sample survey, since only a small portion of population is studies and hence its
results are bound to differ from the census results and this have a certain amount of error.
This error would be there no matter that the sample is drawn at random and that it is highly
representative. This error is attributed to fluctuations of sampling error. Sampling error is due
to the fact that only a subset of the population (i.e sample) has been used to estimate the
population parameters and draw inference about population. Thus, sampling error is
presented only in a sample survey while it is completely absent in census surveys.
Sampling error may be due to following reasons:
1. Faulty selection of the sample.
2. Substitution.
3. Faulty demarcation of sampling units.
4. Error due to bias in the estimation method.
5. Variability of population.
Non-sampling errors:
Non-sampling errors are not attributed to chance and are a consequence of certain
factors which are within human control. In other words, they are due to certain causes which
can be traced and may arise at any stage of the inquiry viz. planning and execution of the
survey and collection, processing and analysis of the data. This error is present in sample and
census.

Some of the important factors responsible for non-sampling errors are as under:

1. Faulty planning including vague and faulty definition of the population of the
statistical units to be used, incomplete list of population members.
2. Vague and imperfect questionnaire which might result in incomplete or wrong
information.
3. Defective methods of interviewing and asking questions.
4. Vagueness about the type of the data to be collected.
5. Personal bias of the investigator.
6. Lack of trained and qualified investigators and lack of supervisory staff.
7. Failure of respondent’s memory to recall the events or happenings in the past.
8. Non response and inadequate response.
9. Improper coverage.
10. Compiling errors.
Testing of hypothesis.
A hypothesis is an assertion or conjecture about the parameter(s) of population
distribution(s). (or) Hypothesis is the tentative statement about something.
Parameter: Constant of the population is called as Parameter.
Statistic: Constant of a sample is called as statistic.
Types of hypothesis:
Null hypothesis: Ho
A hypothesis which is to be actually tested for acceptance or rejection is termed as
null hypothesis. Also, hypothesis of no difference is called as null hypothesis.
Alternative hypothesis:
It is a statement about the population parameter or parameters, which gives an
alternative to the null hypothesis, with in the range of pertinent values of the parameter, i.e.,
if Ho is accepted, what hypothesis is to be rejected and vice versa. An alternative hypothesis
is denoted by H1 or HA.
Two types of error:
After applying a test, a decision is taken about the acceptance or rejection of null
hypothesis vis – a – vis the alternative hypothesis. There is always some possibility of
committing an error in taking a decision about the hypothesis. These errors are of two types.
1. Type-I error: Rejecting null hypothesis (Ho), when it is true.
2. Type-II error: Accept null hypothesis (HA), when it is false.

Ho is
True False
Do not reject Ho Correct decision Type II Error
Reject Ho Type I Error Correct Decision

For an example:

Charge him Release him

Butler did it Correct Error


Butler did not do it Error Correct
Level of significance:
It is the quantity of risk of type-I error which we are ready to tolerate in making a
decision about Ho. in other words, it is the probability of type-I error which is tolerable. The
level of significance is denoted by α and is conventionally chosen as 0.05 or 0.01 level for
moderate and high precision respectively.

Critical region:
A statistic is used to test the hypothesis Ho. The test statistic follows some known
distribution. In a test, the area under probability density curve is divided into two regions,
viz., the region of acceptance and the region of rejection. The region of rejection is the region
in which Ho is rejected. It means that if the value of test statistic lies in this region. Ho will be
rejected. The region of rejection is called critical region. Moreover, the area of critical region
is equal to level of significance. The critical region is always on the tail of the distribution
curve. It may be on both the tails or on one tail, depending upon the alternative hypothesis.

One tail test:


If the alternative hypothesis is of the type the critical
region lies on only one tail of the probability density curve. In this situation the test is called
one-tail.

Region of rejection
Two tail test:
If the alternative hypothesis is of type the critical region lies on
the both the tails. In this situation the test is called two-tailed.
Region of rejection

Degrees of freedom:
It is the number of independent observations, on which a test is based, is known as degrees of
freedom of the test statistic.

Steps in testing of hypothesis:


The process of testing a hypothesis involves following steps.
1. Formulation of null & alternative hypothesis.
2. Specification of level of significance.
3. Selection of test statistic and its computation.
4. Finding out the critical value from tables using the level of significance, sampling
distribution and its degrees of freedom.
5. Determination of the significance of the test statistic.
6. Decision about the null hypothesis based on the significance of the test statistic.
7. Writing the conclusion in such a way that it answers the question on hand.
Test of Significance:
The theory of test of significance consists of various test statistic. The theory had been
developed under two broad headings.
1. Test of significance for large sample
Large sample test or Asymptotic test or Z test (n≥30
2. Test of significance for small samples(n<30)
Small sample test or Exact test-t, F and χ2.
It may be noted that small sample tests can be used in case of large samples also.
Z-test (Standard Normal Deviate Test):
The shape of the normal curve changes as the value of µ and/or σ vary. To cope with
this problem, a transformation is made as,

The transformed variable Z is always distributed normally with mean 0 and variance
1, i.e. Z ~ N (0, 1). In this way, whatever may be the parameter of X, Z has always same
normal distribution N (0,1) and hence only one normal curve is enough after transformation
irrespective of the distribution of X. the variable Z is called the standard normal variate
(SND). After the transformation, the probability density function of the SND-Z is,

Note: Table for area under standard normal curve is given.


Z-test is carried out when the sample size is large (>30) or small sample with known
population SD.
One sample test:
Case-1:
Assumptions:
1. Population is normal.
2. The sample is drawn at random.
Condition:
1. Population S.D is known.
2. Size of the sample may be small or large

Null hypothesis: µ = µo

Test statistic:

Z= .

Example:
The average number of mango fruits per tree in a particular region was known from a
considerable experience as 520 with a standard deviation 4.0. A sample of 20 tress gives an
average number of fruits 450 per tree. Test whether the average number of fruits per tree
selected in the sample is agreement with the average production in that region?
Solution: Null hypothesis: µ = µo = 520

Z= = 78.26.

Conclusion: Z (calculated) > Z (tabulated), 1.96 at 5 % level of significance. Therefore, it


can be concluded that there is significant difference between sample mean and population
mean with respect to average performance.

Case-2: If the S.D in the population is not known still we can use the standard normal deviate
test.
Assumption:
1. Population is normal.
2. Sample is drawn at random.
Conditions:
1. σ is not known.
2. Size of the sample is large (>30).
Null hypothesis: µ = µo
Test statistic:

Z= ,

Where,

S=

is the sample mean and n is the size of the sample.


Example: The average daily milk production of a particular variety of buffalo was given as
12kgs. The distribution of daily milk yield in farm as follows:

Daily milk 6-8 8-10 10-12 12-14 14-16 16-18


yield(kgs)

No. of 9 20 35 42 17 7
buffaloes

Test whether the performance of dairy farm was in agreement with the record.
Solution:
Null hypothesis: µ = µo = 12.
Using the mean formula, we have, = 11.91. and using the Standard deviation
formula we have S = 2.49.
Therefore,

Z= = 0.41.

Conclusion: The calculated Z is less than the table Z, 1.96 at 5 % level of significance.
Therefore, the null hypothesis is accepted. That is there is no significant difference between
the average daily milk yield of dairy farm and the previous record.

Two sample case


Case-1:
Assumptions:
1. Populations are normal.
2. Samples are drawn independently and at random.
Conditions:
1. σ is known.
2. Size of sample may be small or large.
Null hypothesis: µ1 = µ2, where µ1, µ2 are the population means for 1st and 2nd populations
respectively.

Test statistics: Z = , where 1 and 2 are the means of 1st and 2nd samples

respectively of size n1 and n2.


Conclusion: If Z(calculated) ≥ Z(tabulated), the null hypothesis is rejected. There is
significant difference between two sample means. In other words, the two samples have come
from two different populations having two different means. Otherwise, the null hypothesis is
accepted.

Case-2: In this case common population S.D is not known.


Assumption:
1. Populations are normal.
2. Samples are drawn independently and at random.
Conditions:
1. σ is not known.
2. Sizes of samples are large.
Null hypothesis: µ1 = µ2
Test statistic:

Z= ,

Where,
1, 2 are the means of 1st and 2nd samples with sizes n1 and n2 respectively.

S12= and

S22 =

Conclusion: If Z (calculated) ≥ Z (tabulated) at chosen level of significance, the null


hypothesis is rejected. Otherwise it is accepted.

Example: A random sample of 90 poultry farms of one variety gave an average production of
240eggs per bird/year with S.D of 18 eggs. Another random sample of 60 poultry farms of
another variety gave average eggs of 195 eggs per bird/year with a S.D of 15 eggs.
Distinguish between two varieties of birds with respect to their egg production.
Null hypothesis: µ1 = µ2;

Z= = 16.61.

Conclusion: Z (calculated)> Z (tabulated), 1.96 at 5% of level of significance. Hence there is


significant difference between two varieties of birds with respect to egg production.

S.N.D test for Proportions:


Sometimes there is need to have the tests of hypothesis for proportion of individual
(or objects) having a particular attribute. For example, to know whether the proportion of
disease infected plants in the sample is in conformity with the proportion in the entire field
(or population).
Here the number of plants in the sample is identically equal to the n independent trials
with constant probability of success, p. the probabilities of 0,1,2… successes are the
successive terms of the binomial expansion distribution (q+p)2 where q=(1-p). for the
binomial distribution the first and second moment of the number of successes are ‘np’ and
‘npq’ respectively.
Mean of proportion of successes = P

S.E of the proportion of successes =

Test for single proportion:


In a sample of large size n, we may examine whether the sample would have come
from a population having a specified proportion P=Po. For testing we may proceed as
follows:
1. Null Hypothesis (Ho)
Ho: The given sample would have come from a population with specified proportion
P=Po.
2. Alternative Hypothesis(H1)
H1: The given sample may not be from a population with specified proportion
P≠Po (Two Sided)
P>Po (One sided-right sided)
P<Po (One sided-left sided)
3. Test statistic

It follows a standard normal distribution with µ=0 and σ2=1.


4. Level of Significance:
The level of significance may be fixed at either 5% or 1% 5.
5. Expected vale or critical value
In case of test statistic Z, the expected value is
Ze = 1.96 at 5% level
Two tail test
2.58 at 1% level
Ze = 1.65 at 5% level
One tailed test
2.33 at 1% level
6. Inference
If the observed value of the test statistic Zo exceeds the table value Ze we
reject the Null Hypothesis Ho otherwise accept it.
Example: For a particular variety of wheat crop it was estimated that 5percent of plant
attacked with a disease. A sample of 600 plants of the same variety of wheat crops was
observed and found that 50 plants were infected with diseases. Test whether the sample
results were in conformity with the population.

Null hypothesis: P=P0

Z= = 3.74

Conclusion: Here Z (calculated) < Z (tabulated), 1.96 at 5 percent level of significance, the
null hypothesis is rejected. Therefore, there is significant difference between the proportion
of diseased plants in the sample and the population.

Test for equality of two proportions:


Given two sets of sample data of large size n1 and n2 from attributes. We may
examine whether the two samples come from the populations having the same proportion.
We may proceed as follows:
1. Null Hypothesis (Ho)
Ho: The given two sample would have come from a population having the
same proportion P1=P2
2. Alternative Hypothesis (H1)
H1: The given two sample may not be from a population with specified proportion
P1≠P2 (Two Sided)
P1>P2 (One sided-right sided)
P1<P2 (One sided-left sided)
3. Test statistic

When P1and P2 are not known, then

for heterogeneous population


Where q1 = 1-p1 and q2 = 1-p2

for homogeneous population

p= combined or pooled estimate.

4. Level of Significance:
The level may be fixed at either 5% or 1%.
5. Expected value: The expected value is given by
Ze = 1.96 at 5% level Two tailed test
2.58 at 1% level

Ze = 1.65 at 5% level
One tailed test
2.33 at 1% level
6. Inference:
If the observed value of the test statistic Z exceeds the table value Ze we may reject
the Null Hypothesis Ho otherwise accept it.

Example 1: In an investigation, it was found that 4percent of the farmer accepted the
improved seeds for a barley crop in a particular state. On conducting a survey in two
panchayath samithies, 340 farmers accepted out of 1500 in the 1st samithies and 200 out of
1000 in the 2nd samithies. Test whether the different between two samithies is significant.

Null hypothesis: P=P0

P0=4/100=0.04, Q0=1-0.04=0.96,
P1=340/1500=0.23,
P2=200/1000=0.2.

Z= = 1.19.
Conclusion: Z (calculated) < Z (tabulated), 1.96 at 5 percent level of significant. Therefore,
the null hypothesis is accepted. i.e., there is no significance difference between the
proportions of the two samithies with regard to acceptability of the improved seeds.

Example 2: In the previous example if P is not known, test the significance of the difference
between the proportions of the two samples.

Null hypothesis: P1=P2=P where P1 and P2 are the proportions in the 1st and 2nd populations
respectively.
P= = 0.22, Q=0.78,

Z= = 1.75.

Conclusion: Here Z (calculated) < Z (tabulated),1.96 at 5percent level of significance. There


is no significant difference between the two samithies with regards to proportions of farmers
accepting the improved seeds.
Student’s t - Test
In case of small samples drawn from a normal population, the ratio of difference
between sample and population means to its estimated standard error follows a distribution
known as t-distribution, where

t= where s2 =

Note: t-test is carried out when the sample size is small (i.e when it is less than 30).
One sample t-test:
Assumption:
1. Population is normal.
2. Sample is drawn at random.
Conditions:
1. σ is not known.
2. Size of sample is small.
Null hypothesis: µ = µ0
Test statistic:

t=

Where,
s2 = and n is the sample size.

Conclusion: If t (calculated) < t (tabulated) with (n-1) d.f at chosen level of significance, the
null hypothesis is accepted. That is, there is no significant difference between sample mean
and population mean. Otherwise, null hypothesis is rejected.

Example: The height of plants in a particular field were assumed to follow normal
distribution, a random sample of 10 plants was selected and whose heights (in cms) were
recorded as 96, 100, 102, 99, 104, 105, 99,98, 100 and 101. Discus in the light of the above
data the mean height of plants in the population is 100.

Null hypothesis: µ = µ0 = 100


x di di 2
di = (xi-A) where A = 100
96 -4 16
= A+∑ = 100+ = 100.4.
100 0 0

102 2 4 s=

99 -1 1
= = 2.72.
104 4 16

t= = 0.46.
105 5 25

99 -1 1

98 -2 4 Conclusion: t (Calculated) < t (tabulated), (2.262)

100 0 0 with 9 d.f. at 5 percent level of significance. Therefore,


the null hypothesis is accepted. In other words, the
101 1 1 sample may belong to the population whose mean
height is 100cm.

Two sample t-test:


Assumptions:
1. Populations are normal.
2. Samples are drawn independently and at random.

Conditions:
1. S.D’s in the populations are same and are not known.
2. Sizes of samples are small.

Null hypothesis: µ1 = µ2 where µ1, µ2 are the means of 1st and 2nd populations respectively.
Test statistic:

t=
Where,

= ,

= and =

Conclusion: If t (calculated) ≤ t (tabulated) with (n1+n2-2) d.f at chosen level of significance,


the null hypothesis is accepted. That is there is no significant difference between the two
samples mean. Otherwise, the null hypothesis is rejected.

Example: Two types of diets were administered to two groups of school going children for
increase in weight and the following increases in weight (100gm) were recorded after a
month.
Diet A 4 3 2 2 1 0 5 6 3

Diet B 5 4 4 2 3 2 6 1

Test whether there is any significant difference between the two diets with respect to increase
in weight.

X1 X2 X12 X22
Null hypothesis: µ1 = µ2

4 5 16 25 1=2.89, 2=3.38, Sc2=3.25.

t= = 0.56.
3 4 9 16

2 4 4 16
Conclusion: t (calculated) <t (tabulated), (2.131)
2 2 4 4
with 15 d.f. at 5 percent level of significance.
1 3 1 9 Therefore, the null hypothesis is accepted. That is,
there is no significant difference between the two
0 2 0 4
diets with respect to increases in weight.
5 6 25 36

6 1 36 1

3 9
Paired t-test:
When two small samples of equal size are drawn from two population and the
samples are dependent on each other than the paired t-test is used in preference to
independent t-test. The same patients for the comparison of two drugs with some time
interval; the neighboring plots of a field for comparison of two fertilizers with respect to yield
assuming that the neighboring plots will have the same soil composition; rats from the same
litter for comparison of two diets; branches of same plant for comparison of the nitrogen
uptake, etc., are some of the situations where paired-t can be used.
In the paired t-test the testing of the difference between two treatments means was
made more efficient by keeping all the other experimental conditions same.

Assumptions:
1. Populations are normal.
2. Samples are drawn independently and at random.
Conditions:
1. Samples are related with each other.
2. Sizes of the samples are small and equal.
3. S.D’s in the population are equal and not known.
Null hypothesis: µ1 = µ2
Test statistic:

t= , where di =(X1i-X2i), = , n is the sample size and

Sd2 =

Conclusion: If t (calculated) < t (tabulated) with (n-1) d.f at 5 percent level of significance,
the null hypothesis is accepted. That is, there is no significant difference between the means
of the two samples. In other words, the two samples may belong to the same population.
Otherwise, the null hypothesis is rejected.

Example: The following are the experiment conducted on agronomy farm at college of
agriculture, UAS, Dharwad for comparing two types of grasses on neighboring plots of size
5X2 meters in each replication. The weights of grasses per plot (in kgs) at the harvesting time
were recorded on 7 replicates:
1 2 3 4 5 6 7

Cenchrus 1.96 2.10 1.64 1.78 1.95 1.70 2.00


cilioris
(Grass I)

Losirus 2.13 2.10 2.14 2.08 2.20 2.12 2.05


sindicus
Grass (II)

Test the significant difference between the two grasses with respect to their yield.

X1i X2i di di 2
Null hypothesis: µ1 = µ2
1.96 2.13 -0.17 0.0289

2.10 2.10 0 0 =-0.24,

1.64 2.14 -0.50 0.25 Sd2 =

1.78 2.08 -0.30 0.09 t= = 3.46.


1.95 2.20 -0.25 0.0625
Conclusion: t (calculated) > t (tabulated),(2.447) with 6 d.f.
1.70 2.12 -0.42 0.1764
at 5 percent level of significance. The null hypothesis is
2.0 2.05 -0.05 0.0025 rejected. There is significant difference between the two

-1.69 0.6103 grasses with respect to yield.


Variance Ratio Test (F-Test)

In the variance ratio test, we have to test two variances i.e.; and . In case of
testing two means, we assume that, the population variance is same. But always this
assumption may not hold good. We may have to draw two samples from different population,
where the variances are not same. In such a situation, we cannot use t-test directly for testing
equality of two means. Therefore, we have to test whether these two variances are same or
not. For testing equality of two variances, we use F-test.

Null hypothesis: ;

Alternative hypothesis: ;

Test criterion,

This is ‘F’ with (n1-1) and (n2-1) d.f

Conclusion: If F(cal)< F(table) for 5 percent, the test is significant, reject the null hypothesis.
Therefore, we conclude that both variances are not same. Otherwise, if F(cal)< F(table), the test
is not significant. We accept the null hypothesis and we conclude that both variances are
same.

Example 1: The marks in different subjects are given below. Test whether the variances of
marks in both the subjects is same or not.

Subject-X: 15 25 30 10 12 40 45

Subject-Y 19 20 25 30 18
Solution:

; ;

2 2
Subject-X: Subject-Y

15 19 -10.2 104.4 -3.4 11.56

25 20 -0.28 0.078 -2.4 5.76

30 25 4.72 22.27 2.6 6.76

10 30 -15.28 233.47 7.6 57.76

12 18 -13.28 176.35 -4.4 19.36

40 14.71 216.38

45 19.72 388.87 101.20

Test criterion,

= =25.28, =

Therefore,
F= / = (F(table) at is 4.53)

Conclusion: Therefore, F(cal) is more than F(table) for 5% level of significance, so we conclude
that the two variances are not same.
Chi-Square Distribution

So far, we had discussed various test of significance such as t, F and Z. these tests
were based on the assumption that the samples were drawn from normally distributed
populations. Since, the testing procedure requires assumption about the type of population or
parameter i.e. population values, these tests are known as “Parametric tests”.
There are many situations in which it is not possible to make any rigid assumption
about the distribution of the population from which samples being drawn. This limitation has
led to the development of a group of alternative techniques known as non-parametric or
distribution free method. When non-parametric tests are used, no assumption about the
parameters of the population or populations from which we draw our samples is made. Chi-
square test of goodness of fit and test of independence is a prominent example for the use the
non-parametric test.
χ2 test as a goodness of fit:
Chi-square test is very popularly known as test of goodness of fit for the reason that it
enables us to ascertain how well the theoretical distribution such as Binomial, Poission,
Normal etc., fit empirical distribution i.e., those obtained from sample data.
The expression for X2 test for goodness of fit is:

χ2 = ,

Where, Oi is the observed frequencies of the ith class,


Ei is the expected frequencies of ith class and ‘K’ the number of cells.
Here the term ‘cell’ is used for class interval in the case of frequency distribution and
compartment (or category) in the case of enumeration data.
Conclusion: if χ 2(cal) ≤ χ 2(table) at (k-1) df, then the hypothesis that the observed frequencies
are in agreement with the expected frequencies is accepted. Otherwise, the hypothesis is
rejected.
Example 1: A set of 5coins is tossed 3200 times and the number of heads appearing each
time is noted. The results are given below:
No. of heads 0 1 2 3 4 5

Frequency 80 570 1100 900 500 50

Test the hypothesis that the coins are unbiased.


Solution:
Let us take the hypothesis that the coins are unbiased. If this is true, the chance of

getting heads etc., in a toss of 5 coins are the successive terms in the binomial So

the theoretical frequency in 3200 terms are the terms in the expansion 3200 . So the

theoretical frequency in 3200 terms is the terms in the expansion 3200 are as follows:

No. of 0 1 2 3 4 5
heads
Expected 100 500 1000 1000 500 100
Frequency

2
Applying the test

Oi Ei (Oi-Ei) (Oi-Ei)2 (Oi-Ei)2/Ei

80 100 -20 400 4

570 500 70 4900 9.8

1100 1000 100 10000 10

900 1000 -100 10000 10

500 500 0 0 0

50 100 -50 2500 25

58.80

χ2 = = 58.80.

Conclusion: The χ 2(cal) is much greater than χ 2(table) value. Hence the hypothesis is rejected.
Therefore, we can conclude that the coins are biased.
Example-2: Genetic theory states that children having one parent of blood type A and the
other type B will always be one of three types A, AB, B and that the proportion of three types
will on an average be as 1:2:1. A report states that out of 300 children having one parent and
B parent, 30% were found to be type A, 45% type AB and remainder type B. test the
hypothesis by χ 2 test.
Chi-square test as a test of independence:
With help of χ 2 test we can find out whether two or more attributes are associated or
not. Suppose we have N observations classified according to some attributes. We may ask
whether the attributes are related or independent. Thus, we can find out whether quinine is
effective in controlling fever or not, whether is any association between color and intelligence
etc. in order to test whether or not the attributes are associated, we take the null hypothesis
that there is not association in the attributes under study or into the words, the two attributes
are independent.
If the χ 2(cal) is less than the χ 2 (table) value at a certain level of significance generally 5%
level, we say that the results of the experiment provide no evidence for doubting the
hypothesis.
2 X 2 contingency table:
When the individuals (or objects) are classified into two categories with respect to
each of the two attributes then the table showing frequency distributed over 2 X 2 classes
called 2 X 2 contingency table.
Suppose the individuals are classified according to two attributes (say) color (B) and
intelligence (A). the distribution of frequencies over cells is shown below:

B\A Intelligence

A1 A2 Total

B1 a B a+b

Colour B2 c D c+d

a+c b+d N

Where a, b, c, d are frequencies of the different cells.

Null hypothesis: the two attributes are independent.

χ2 =

Then expected frequencies are obtained as,

E (a) = ; E (b) = ; E (c) = ; E (d) = ;


Substituting the expected frequency in

χ2=

Then the simplification we will have,

χ2=

Conclusion: If χ 2(cal) ≤ χ 2(table) with (r-1)x(c-1) df at chosen level of significance, the null
hypothesis is accepted i.e., two attributes are independent. Otherwise, null hypothesis is
rejected.

Example: One hundred individuals of a particular race were tested with an intelligence test
and classified into two classes. Another group of one hundred twenty individuals belong to
another race were administered the same intelligence test and classified into the same two
classes. The following are the observed frequency of the two races.

Race\intelligence Intelligent Non- Total


intelligent

Race-1 42 58 100

Race-2 55 65 120

Total 97 123 220

Test whether the intelligence is anything to do with the race.

Solution:
Ho: Intelligence and race are two independent attributes.
We know,

χ2=

= = 0.325

Conclusion: since χ 2(cal)< χ 2(table) for 3.841 with (r-1)(c-1) = (2-1)(2-1) =1 df at 5 percent
level of significance. Therefore, there is evidence to conclude that race and intelligence may
be independent.
Example: A cross between the two varieties of sorghum one giving high yield and the other
for high amount of fodder was made. The number of plants in F2 generation was observed as
79, 160 and 85. Test whether this sample data is in agreement with the Mendelian ratio 1:2:1
or not.
Solution:
Null hypothesis: The sample ratio is in agreement with 1:2:1
Observed frequency Expected frequency (Oi– Ei)
(Oi) (Ei)

79 324*1/4 = 81 -2 0.0494

160 324*2/4 = 162 -2 0.0247

85 324*1/4 = 81 4 0.1974

324 324 0.2716

Conclusion: χ 2(Cal) < χ 2(Tab), (5.991) with (3-1) d.f. at 5 percent level of significance.
Therefore, the null hypothesis is accepted, i.e., the plants are segregating according to
Mendelian ratio, 1:2:1 in F2 generation.
Correlation:
Correlation is a measure of extent or degree of mutual dependence between two
variables.
In the study of two variables jointly, many times an investigator is interested to know
the degree or extent of dependence between them. Actually, one wants to know whether the
relation between two variables is of high, moderate or low degree. If the two variables have
no relation, it means the change in one variable has no impact about the change in the other.
In this case, two variables are said to be independent.
Methods of determining correlation:
Graphical method:
The extent of relation between two variables can roughly be judged by plotting the
pairs of observations as points on graph paper. These points are spread in different patterns
and as such these are called scatter diagrams. Larger the number of points in a straight line,
greater is the degree of relationship between them. In the following scatter diagrams, 1st
shows that there is a perfect positive linear relationship between X and Y, i.e., the variable is
proportional to Y and vice-versa. In this case the line flows from lower left side to upper right
side. All the points are lies on the line.
In the second diagram, divulge the same phenomenon but in opposite direction, i.e., if
X increases, then Y decreases. In this case, the line runs from upper left to right bottom side.
In the diagram if delineated high positive and negative correlation respectively as
most of the points lie near the straight lines or lie on them.
In the diagram if the same phenomena as previous expect that in these figures, the
points lie farther from the lines indicating a low degree of correlation between the variables.
If hardly any line can be drawn about which all the points concentrate. It means there is no
correlation between the variables.
Positive correlation Negative correlation No correlation

Mathematical measure:
A graph provides a rough idea about the type and extent of correlation between two
variables. But more exactly the correlation can be measured numerically by calculating
coefficient of correlation. This is known as Person’s coefficient of correlation and the
formula for it was developed by Karl Pearson. This is based on three assumptions.
1. The variable X and Y are distributed normally.
2. The relationship between X and Y is linear.
3. There is a cause and effect relationship between X and Y.

rxy =

If from bivariate population there are n pairs of values of the variables X and Y as (x1, y1),
(x2,y2), (x3,y3). . . . (xn, yn). Then the formula is given as:

rxy = , for i = 1, 2, . . . . ,n and where rxy is the

correlation between X and Y. but mostly suffix XY is committed due to inconvenience.

Alternative formula:

r=

Test of significance of Simple Correlation Coefficient.

Null hypothesis: ρ = 0, where ρ is the population correlation coefficient.

Test Statistic:
Conclusion: If t (Cal) >t(Tab) with (n-2) d.f. at chosen level of significance, the null
hypothesis is rejected. That is, there may be significant correlation between the two variates.
Otherwise, the null hypothesis is accepted.

Properties of correlation coefficient:


1. The value of coefficient of correlation lies between -1 to +1.
2. The value of r indicates, high, moderate, low positive or negative and nil degree of
correlation as per the values of r given in the table below:

Degree of Positive corr. Negative corr.


correlation Coeff. Coeff

Prefect r = +1 r = -1

High 0.75 ≤ r≤ 1 -1≤ r ≤ -0.75

Moderate 0.25 ≤ r ≤ 0.75 -0.75 ≤ r ≤ -0.25

Low 0 ≤ r ≤ 0.25 -0.25 ≤ r ≤ 0

Nil 0 0

3. It has no units as it is a pure number.


4. If a constant value ‘a’ is added or subtracted from each value of x and ‘b’ from each
value of y and also each value of x is divided (multiplied) by a constant ‘c’ and y by
‘d’, the value of correlation coefficient calculated from coded value is same as that of
original value. It means coding of data does not affect the value of r.
5. If X and Y are independent, then the correlation coefficient between them is zero, but
the converse is need not be true.
Example-1: Marks of seven students in physics and mathematics in an hourly test out of 10
were as follows:

Students 1 2 3 4 5 6 7
Marks in Math’s (X) 7 9 10 6 5 4 8
Marks in Physics (Y) 9 6 5 4 3 2 6

Find the correlation coefficient between the marks scored in the two subjects.
Solution:
From the table we can calculate = 7.0 and = 5.0
= (7-7)(9-5)+(9-7)(6-5)+. . . . . (8.7)(6-5) = 17
= (7-7)2 + (9-7)2 + . . . . + (8-7)2 = 28
= (9-5)2 + (6-5)2+. . . . . +(6-5)2 = 32

= = = 0.568.

Since the value of r is little more than 0.5, it can be interpreted that the correlation
between marks in mathematics and physics is moderate.
Regression
Regression is the measure of the average relationship between two or more variables in
terms of the original units of the data.
If two variables are correlated, unknown value of one of the variables can be
estimated by using the known value of the other variable. The so estimated value may
not be equal to the actually observed value, but it will be close to the actual value.
The property of the tendency of actual value to lie close to the estimated value is
called regression. In a wider usage, regression is the theory of estimation of unknown
value of a variable with the help of known values of the variables. The regression theory
was first introduced by Sir Francis Galton in the field of genetics.
When data on two variables are known, by assuming one of the variables to be
dependent on the other, we fit a linear equation to the data by the method of least square.
The linear equation is called regression equation.
For a bivariate data on x and y, the regression equation obtained with the
assumptions that x is dependent on y is called regression of x on y. the regression of x on
y is:

The regression equation obtained with the assumption that y is dependent on x is


called regression of y on x. the regression of y on x is :

Here, the constants bxy and byx are the regression coefficients. They are:

The regression equation of x on y is used for the estimation of x values and the
regression equation of y on x is used for the estimation of y values.
Graphical representation of the regression equation is called regression lines.

Properties of Regression Coefficients:


Regression coefficients are the coefficient of the independent variables in the
regression equations.
1. The regression coefficient bxy is the change occurring in x for unit change in y.
the regression coefficient byx is the change occurring in y for unit change in x.
2. The regression coefficients are independent of the origins of measurement of the
variables. But, they are dependent on the scale.
3. The geometric mean of the regression coefficients is equal to the coefficient of
correlation.
4. The regression coefficient cannot be of opposite signs.
If r is positive, both the regression coefficient will be positive. If r is negative, both
the regression coefficient will be negative. If r is zero, both the regression
coefficient will be zero.
5. Since coefficients of correlation, numerically, cannot be greater than 1, the
product of regression coefficient cann0t be greater than 1.
Properties of regression line:
There are two regression line,
1. The regression line intersect at
2. The regression liner have positive slope, if the variables are positively correlated.
They have negative slope, if the variables are negatively correlated.
3. If there is perfect correlation, the regression lines coincide (there will be only
one regression line).
Example:
The following are the heights of 8 persons and one each of their sons. From the data, estimate
the height of a person whose father height is 150cm tall.
Height of father 164 176 178 184 175 167 173 180
(cms)
Height of sons 168 174 175 181 173 166 173 179
(cms)

Solution:
Let x and y respectively denote the heights of father and the sons.
Then, the value of y corresponding to x =150 has to be estimated.
For this, regression of y on x should be found and the estimation should be made.
X y u = x -170 v = y- 170 u2 Uv
164 168 -6 -2 36 12
176 174 6 4 36 24
178 175 8 5 64 40
184 181 14 11 196 154
175 173 5 3 25 15
167 166 -3 -4 9 12
173 173 3 3 9 9
180 179 10 9 100 90
37 29 475 356

= 170 + = 174.63cms and = 170 + = 173.63cms.

Since regression coefficients are independent of origins, the required regression coefficient is

= = 0.7302.

Thus, regression of y on x is: ;


On substitution we get:
y-173.63 = 0.7302*(x-174.63);
y = 0.7302x + 46.12;
thus, the estimate of son’s height is 155.65cms.
Analysis of Variance (one-way)

Analysis of variance is a technique of analyzing data from designed or controlled


experiment. A control experiment consists of a set of experimental conditions called as
treatments. Every treatment is applied repeatedly to a set of experimental units resulting
replications. For example, an insecticide when sprayed on five plants results in five
observations. When repeated observations are taken on a set of treatments, data that result are
analyzed using the technique of ANOVA. It splits total variability in the data into variability
due to treatments and that due to factors extraneous to treatments.

Definition: ANOVA is a technique of partitioning of total variations present in the


experiment into different component of which some are known and some are completely
unknown.
ANOVA Table for One-Way:
Source of Degrees of Sum of Squares Mean sum of Cal-F
Variation freedom squares
Between t-1 SS due to MSS due to
treatments treatments / (t-1) treat. / MSS due
to error
Within t(t-1) By subtraction SS due to error /
treatments t(t-1)
Total N-1 .

Computation of these components of variation is explained with an example below

Example: The following data are from a controlled experiment in which 5 insecticides were
applied to four cabbage plants each and the number of insect larvae was counted.

Replication Insecticide
1 A B C D E
2 11 6 8 14 7
3 4 4 6 27 7
4 4 3 4 8 12
Total 24 20 28 68 40
Mean 6 5 7 17 10
Note: The number of larvae varies from plant to plant. It also varies among plants, which
have been sprayed with the same insecticide. The technique to analysis of variance splits the
total variation in the 20 observations into two components- one that is attributes to the
possible difference among the insecticides and the other to differences among insecticides.
The computation is as follows:

Step 1: Find the totals and mean corresponding to the insecticides.

Insecticide A B C D E

Total 24 20 28 68 40

Mean 6 5 7 17 10

Step 2: The overall variation in the data is measured by what is called the total sum of
squares, which is computed as follows:

First compute,

This quantity is called correction factor (C.F), then compute total S.S as:

= 2292 – 1620 = 672.


This can be computed also as the sum of squares of deviations of 20 observations
from the grand mean.

=672.
It may be recalled that the total S.S divided by 20 would give the variance of the 20
observation. Thus, the total S.S is measure of overall variation.
Step 3: Compute “between treatment sum of squares”:
To compute between treatments S.S., the treatments total or means will be used.

Where, 4 is the number of replications for each insecticide. Then,


Between S.S = 1996 – 1620 = 376.
Step 4: Within treatment sum of square or Error sum of square.
Error S.S measures variability due to factors other than insecticides. Error S.S is
computed by subtracting “between insecticide” SS from total SS as,
Error S.S = Total S.S – Between treatment S.S
= 672-236 = 296.
This is because, “between treatment” SS and within treatment” SS and to “total” SS.
i.e., between treatment SS + within treatment SS = Total SS.
This is known as ANOVA identity.
Step 5: Set up analysis of variance (ANOVA) table:
Sources of Degrees of Sum of Squares Mean sum of Cal – F
variation freedom (df) (SS) square
“Between” 5-1 = 4 376 376/4 = 94 94/19.73=4.76
insecticide
“Within” 5(4-1) = 15 296 296/15=19.73
insecticide
Total 20-1=19 672

It may be noted that of the total variation 672, 376 is attributed to possible differences
among insecticides. It can be tested weather 376 is substantial enough to conclude that the
insecticides differ. This technique is known as testing of hypothesis, and as behind the scope
of this course.
Two –way Analysis of Variance

Two-way ANOVA technique is used when data are classified on the basis of two
factors. For example, the agricultural output may be classified on the basis of different
varieties of seeds and also on the basis of different varieties of fertilizers used. Such a two-
way design may have repeated measurements of each factor or may not repeated values. We
shall now explain the two-way ANOVA technique in context of two-way design when
repeated values are not there.
As we do not have repeated values, we cannot directly compute the sum of squares
with in samples as we had done in the case of one-way ANOVA. Therefore, we have to
calculate this residual or error variation by substation, once we have calculated (just on the
same lines as we did in the case of one-way ANOVA) the sum of squares for total variance
between varieties of the other treatment.

The various steps involved are as follow:


1. Use of coding device, if the same simplifies the task.
2. Take the total of the individual items (or their coded values as the case may be) in all
the samples and call it GT.
3. Work out the correction factor as under:

4. Find out the square of all the item values (or either coded values as the case may be)
one by one then take its total. Subtract the correction factor from this total to obtain
the sum of squares of deviations for total variance. Symbolically, we write it as:
Sum of squares of deviations for total variance or total SS:
= .
5. Take the total of different columns and then obtained the square of each column total
and divide such squared values of each column by the number of items in the
concerning column and take the total of the result thus obtained. Finally, substrate the
correction factor from this total to obtain the sum of squares of deviations for variance
between columns or (SS due to columns).
6. Take the total of different rows and then obtained the square of each row total and
divide such squared values of each row by the number of items in the corresponding
row and take the total of the result thus obtained. Finally, subtracted the correction
factor from this total to obtain the sum of squares of deviations for variance between
rows (or SS between rows).

7. Sum of squares of deviations for residual or error variance can be worked out by
subtracting the result of the sum of (5) and (6) steps from the result of 4th steps stated
above. In other words,
Total SS - (SS between columns + SS between rows).
= SS for residual or error variance.
8. Degrees of freedom (d.f) can be worked out as under:
d.f for total variance= (c*r-1)
d.f for variance between columns = (c-1).
d.f for variance between rows = (r-1).
d.f for residual rows = (c-1)(r-1).
Where c is the number of columns and r is the number of rows.
9. ANOVA table can be set up in the usual fashion as shown below:

Source of Degrees of Sum of


Mean sum of squares (MS) F-ratio
variation freedom (d.f) squares (SS)
Between
columns (c-1)
treatment
Between
row (r-1)
treatment
By
Error (c-1)(r-1)
substation

Total (c*r-1) .

In table c is the number of columns and r is the number of rows.


SS due to error = Total SS – (SS between columns + SS between rows).

Example: Set up the ANOVA table for the following two-way design:
Fertilizer Varieties

A B C

W 6 5 5

X 7 5 4

Y 8 3 3

Z 3 7 4

Step-1: G.T = 60, n=12, therefore C.F = = = 300

Step-2: Total SS = (62 + 52 + 52+. . . . . + 42) – 300 =32.

Step-3: SS between columns treatments = .

Step-4: SS between rows treatment = .

Step-4: SS error = total SS – (SS between columns + SS between rows) = 32-(8+18)


=6
After making all these computations, ANOVA table can be set up for drawing conclusion:
Source of variation d.f SS MS F-ratio Table-F @5%

Between columns treatment 2 8 8/2 = 4 4/1 = 4 F(2,6) =5.14

Between row treatment 3 18 18/3 = 6 6/1 = 6 F(3,6) = 4.76

Error 6 6 6/6 = 1

Total 11 32

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy