AST-202-Statistical Methods
AST-202-Statistical Methods
STATISTICAL METHODS
(AST-202)
By:
Dr. Ashalatha K. V.
Ms. Jyoti B Bagalkoti
Ms. Megha J
Mr. Anand P
DEPARTMENT OF AGRICULTURAL STATISTICS
COLLEGE OF AGRICULTURE, DHARWAD –5
STATISTICS
2. Chronological:
Classification is according to the lapse of time such as monthly, yearly, etc
Data on Production of food grains can be classified as shown below:
Year Tonnes
1990-91 300
1991-92 230
1992-93 410
1993-94 175
1994-95 110
3. Qualitative:
Classification is according to the attributes of the subjects or items, such as sex,
colour, qualification, etc.
Type of farmers Number of farmers
Marginal 907
Medium 1041
Large 1948
Total 3896
4. Quantitative:
Classification is according to the magnitude of numerical values. Such as age,
income, height, weight, etc
The data on land holdings by farmers in a block:
Land holding (hectare) Number of Farmers
<1 442
1-2 908
2-5 471
>5 124
TYPES OF STATISTICS:
1. Descriptive Statistics:
It consists of methods for organizing, displaying, and describing data
by using tables, graphs, and summary measures.
2. Inferential Statistics:
It is another branch of statistics. It provides the procedures to draw an
inference about conditions that exist in a larger set of observations from study of a part of that
set.
Limitation of statistics:
Statistics with its wide application in almost every sphere of human activity is not
without limitations. The following are the limitation:
1. It does not deal with individuals.
Statistics deals with an aggregate of objects and does not give any specific recognition to
the individual items of a series. E.g.the individual figures of agricultural production of any
country for a particular year are meaningless unless, to facilitate comparison, similar figures
of other countries or of the same country of different years are given. Height of Mr.X is 5’8”
does not constitute statistics statement. The average height of an Indian is 5’8”.
2. It deals only with quantitative characters.
It deals only with quantitative characters. E.g., efficiency, honesty, intelligence. These
factors can be measure indirectly e.g., efficiency of selling agent can be judged by studying
the no. of articles sold by him.
3. Statistics results are true only on an average.
Statistical results are true on an average. E.g., average consumption of milk per head in a
certain locality is 0.5 liter but it does not give any idea of the shortage of milk faced by the
poor. The conclusion obtained statistically are not universally true, they are true only under
certain conditions. This is because statistics as a science less exact as compared to natural
sciences.
4. Statistics can be misused.
Statistics can be misused because if conclusion is based on incomplete information,
statistics can prove anything. There are three types of lies: lies dammed lies and statistics.
Statistics are like day of which one can make a god or devil as he pleases.
5. Expert knowledge is must to handle the statistical data.
Importance in Agriculture:
1) It helps to understand nature of variability.
2) To arrive at the meaningful conclusion on the basis of sample study in the field.
3) Express the data/result of the field experiment in summary form.
4) Sampling.
i. In state Agril. Survey for estimation of areas and yield of crops.
ii. In price fixation policy of various Agril. Commodities.
iii. In Agril Extension survey to study the impact of programs.
iv. In Agril. Economics survey to study the demand- supply policy, the growth
rate of population and cost of production of various crops.
5) In Agril. Meteorology for weather forecasting and to correlate weather parameter with
crop production.
FREQUENCY DISTRIBUTION
In statistics, a frequency distribution is a tabulation of the values that one or more
variables taken in a sample. Each entry in the table contains the frequency or count of the
occurrences of values within a particular group or interval, and in this way the table
summarizes the distribution of values in the sample.
Terms used in Frequency Distribution Table:
1. Class: The arrangement of data into groups based on some common criteria. These
groups are called as class. It is denoted X.
2. Frequency: The number of times a category or class occurs. It is denoted by f.
3. Class limits: The boundary figures of the classes are called class limits.
4. Upper limit: Upper bound value of the class.
5. Lower limit: Lower bound value of the class.
6. Class interval: Difference between upper limit and lower limit of a class.
7. Frequency Distribution: Arrangement of data along with their frequencies is called
frequency distribution.
Construction of frequency distribution table:
The following steps are used for construction of frequency distribution table:
Step 1: The number of classes are to be decided.
The appropriate number of classes may be decided by yule’s formula.
No. Of classes= 2.5 x n1/4. where, ‘n’ is the total number of observations
Step 2: The class interval (CI) is to be determined.
CI=
25-30 5
30-35 6
35-40 7
2. Inclusive method:
In this method, the lower limit and upper limit of class interval are included in
the same class. It is discontinuous distribution.
Ex:
Class (X) Frequency (f)
65-84 3
85-104 5
105-124 7
125-144 12
145-164 8
To convert discontinuous distribution into continuous distribution, subtract 0.5 from
lower limit and add 0.5 to upper limit.
Solution:
No. of observation (n) = 50
Number of classes = 2.5 x n1/4
= 2.5 x 501/4
= 6.648 7.0
Class Interval =
= = 9.286 9.0
Inclusive method:
C.I Tally marks Frequency (f)
10-19 II 2
20-29 IIII 4
30-39 IIII II 7
40-49 IIII IIII 10
50-59 IIII IIII IIII I 16
60-69 IIII III 8
70-79 III 3
TOTAL 50
Exclusive method:
C.I Tally marks Frequency (f)
9.5-19.5 II 2
19.5-29.5 IIII 4
29.5-39.5 IIII II 7
39.5-49.5 IIII IIII 10
49.5-59.5 IIII IIII IIII I 16
59.5-69.5 IIII III 8
69.5-79.5 III 3
TOTAL 50
Cumulative frequency:
Total frequency upto and including the class is called cumulative frequency. There are
two types of cumulative frequency less than and more than cumulative frequency.
1. Less than cumulative frequency:
It is obtained on adding successively the frequency of all previous values (or
classes), including the frequency of variable (against) which the totals are written,
provided the values (classes) are arranged in ascending order of magnitude.
Ex:
Marks Frequency Less than cf
25-30 5 5
30-35 6 5+6=11
35-40 6 11+6=17
40-45 4 17+4=21
45-50 4 21+4=25
Example: Form a cumulative frequency distribution (less than and more than) table for the
following data:
C.I Frequency (f) Less than cf More than cf
9.5-19.5 2 2 50
19.5-29.5 4 2+4=6 50-2=48
29.5-39.5 7 6+7=13 48-4=44
39.5-49.5 10 13+10=23 44-7=37
49.5-59.5 16 23+16=39 37-10=27
59.5-69.5 8 39+8=47 27-16=11
69.5-79.5 3 47+3=50 11-8=3
Graphical Representation of Data
Graphs are charts consisting of points, lines and curves. Charts are drawn on graph
sheets. Suitable scales are to be chosen for both x and y axes, so that the entire data can be
presented in the graph sheet. Graphical representations are used for grouped quantitative data.
Ogives:
Ogive is a cumulative frequency graph. Ogive is a free-hand graph showing the curve
of a cumulative frequency.
There are two types of ogives:
1. Less than ogive:
Less than ogive is the graph of the less than cumulative frequency distribution which
shows the number of observations LESS THAN the upper-class limit.
Arithmetic mean:
It is the most common and ideal measure of central tendency. It is defined as “the sum
of the observed values of the character (or variable) divided by the total number of
observations”. It is denoted by the symbol .
For ungrouped data:
If the variable x assumes n values x1, x2 … xn then the mean is given by,
A.M( ) = =
Example 1: Calculate the mean for pH levels of soil 6.8, 6.6, 5.2, 5.6, 5.8
Solution,
= = =6
Example 2: A variable takes the values as given below. Calculate the arithmetic mean of
110, 117, 129, 195, 95, 100, 100, 175, 250 and 750.
= = =202.1
= =
NOTE: In the case of grouped or continues frequency distribution, x is taken as the mid
value of the corresponding class.
2. Assumed mean method:
A= =
= 24.92.
i.e. the mean age of males at first marriage is 24.92 years.
Example 2: The distribution of the size of the holding of cultivated land in an area, was as
follows:
Size of holdings Mid points(x) No of
holdings(f)
0-2 1 48
2-4 3 19
4-6 5 10
6-8 7 14
8-10 9 11
10-20 15 9
20-40 30 2
40-60 50 1
Average size of holding in the area can be calculated as follows, midpoint of the class
intervals are shown in the middle column along with the data. Hence,
A.M = =
=5.237.
i.e. the average size of holding is 5.237
Solution:
= 28 +
= 28 - = 25.404
Uses:
It is most popular and simple estimate and used widely in almost all fields of studies
such as social science, economics, business, agriculture, medical science, engineering and
such other science.
Weighted mean:
When different observations are to be given different weights, arithmetic mean does
not prove to be good measure of central tendency. In such cases weighted mean is calculated.
If x1, x2, x3…Xn are the different observation and W1, W2, W3…. Wn are the
respective weights then.
W.M = =
Solution:
X F Log x f (log x)
110 4 2.0414 8.1656
115 11 2.0607 22.6677
118 21 2.0719 43.5099
119 6 2.0755 12.4530
120 2 2.0792 4.1584
Merits /Advantages:
Disadvantages:
1. If any one of the observations is zero, then G.M does not exist.
2. If any observation is positive or negative, G.M does not exist.
Harmonic mean:
The Harmonic mean of n values is the reciprocal of the arithmetic mean of the
reciprocals of the given values. It is denoted by H.
H.M ,
Example: Calculate the Harmonic mean of 9.7, 9.8, 9.5, 9.4 and 9.7.
Solution:
X 1/x
9.7 0.1031
9.8 0.1020
9.5 0.1053
9.4 0.1064
9.7 0.1031
Total 0.5199
1. HM satisfy the test of rigid definition. Its definition is precise and its value is always
definite.
2. Like AM and GM, this average is also based on all the observations of the series. It
cannot be calculated in the absence of even a single figure
3. HM is capable of further algebraic treatment
4. It is least/no affected by fluctuation of sampling.
5. It gives greater importance to small items and such a single big item cannot push up
its value.
Draw backs:
Applications:
Median:
Median of a set of values is the middle most value when they are arranged in the
ascending order of magnitude. (Such an arrangement is called an array). It is a value that is
greater than half of the values and lesser than the remaining half. The median is denoted by
M.
In the case of a raw data and also a discrete frequency distribution, the median is
In case of ungrouped data, if the number of observations is odd then median is the
middle value after the values have been arranged in ascending or descending order of
magnitude. In case of even number of observations, there are two middle terms and median is
obtained by taking the arithmetic mean of middle terms.
Example 1: The median of the values 25,20,25,35,18, i.e., 15,18,20,25,35 is 20
Example 2: Median of 8,20,50,25,15,30, i.e., of 8,155,20,25,35 is = 22.5.
• See the (less than cumulative frequency (c.f.) just greater than .
Md = L +
Merits of Median:
1. It is rigidly defined.
2. It is easily understood.
3. It is simple to calculate.
4. It is not affected by extreme values.
5. It can be calculated for open end classes.
Demerits of median:
Properties of Median:
Applications of Median:
1. Median is the only average to be used while dealing with qualitative data.
2. It is to be used for determining the typical value in problems concerning wages,
distribution of wealth, etc.
3. The median is the most commonly quoted figure used to measure property prices.
4. The use of the median avoids the problem of the mean property price which is
affected by a few expensive properties that are not representative of the general
property market.
Mode:
Mode is the value which occurs most frequently in a set of observations and around
which the other items of the set cluster densely. In other words, mode is the value of the
variable which is predominant in the series. Thus, in the case of discrete frequency
distribution, mode is the value of x corresponding to maximum frequency.
xC
Where,
L = lower limit of modal class.
= frequency of modal class.
= frequency of the succeeding modal class.
= frequency of the preceding modal class.
C = class interval
Modal class is the class which has got highest frequency.
Ungrouped Data:
The following are the number of children for 20 couples. Find the mode
No. of children per couple: 2, 3, 6, 3, 4, 0, 5, 2, 2, 4, 3, 2, 1, 0, 4, 2, 2, 1, 1, 3.
Here the value 2 has appears more time (highest frequency). Therefore, mode is 2.
12 5
10 3
14 2
15 6 Modal
Class
The highest frequency is 6. Hence 15 is the mode.
Advantages:
Disadvantages:
Applications of Mode:
1. The mode has applications in printing. For example, it is important to print more of
the most popular books; because printing different books in equal numbers would
cause a shortage of some books and an oversupply of others.
2. Likewise, the mode has applications in manufacturing. For example, it is important to
manufacture more of the most popular shoes; because manufacturing different shoes
in equal numbers would cause a shortage of some shoes and an oversupply of others.
Relationship between mean, median and mode for slightly skew distribution.
Mean – mode = .
Median:
1. The variable is discrete.
2. Some of the extreme values are missing.
3. There are abnormal extreme values.
4. Mode is ill-defined.
5. The characters tic under study is qualitative.
6. The data are on the ordinal scale.
Mode:
1. Modal value has very high frequency compared to other frequencies.
2. Some of the extreme values are missing.
3. The variable is discrete.
4. There are abnormal extreme values.
5. The characteristic under study is qualitative.
6. The data are on nominal scale.
Geometric mean:
1. The variable is multiplicative in nature.
Harmonic mean:
2. The reciprocal of the value is additive in nature.
Measures of Dispersion:
Though mean is the important concept in statistics, it does not give a clear picture as
to how the different observation are distributed in a given distribution or the series under the
study. Consider the following series:
Series observations Mean
1 2,3,4,7 4
2 4,4,4,4 4
3 1,1,2,12 4
4 3,4,4,5 4
In the above series, the mean is same i.e. 4 but the spread of the observation about the
mean is in different manner. Hence after locating the measures of central tendency, the next
point is to find out the center. This can be done by the measuring the spread. The spread is
also called as scatter or variation or dispersion of the variate values.
Definition:
Dispersion may be defined as the extend of the scatterness of observation around a
measure of central tendency and a measure of such scatter is called as measures of dispersion.
The different measures of dispersion are as follows:
1. Range.
2. Quartile deviation
3. Absolute mean deviation or absolute deviation (A.D or A.M.D).
4. Standard deviation(S).
Characteristics of satisfactory measures of dispersion:
Measure of dispersion should possess all those characteristics which are considered
essential for measures of central tendency viz.
1. It should be based on all the observations.
2. It should be readily comprehensible.
3. It should be fairly easily calculated.
4. It should be simple to understand.
5. It should not be affected by extreme values.
6. It should not be affected by sampling flucations.
7. It should be amenable to algebraic treatments.
Measure of dispersion:
Absolute:
Measure the dispersion in the original unit of the data.Variability in 2 or more
distributions can be compared only if they are given in same unit and have the same
average.
Relative:
Measure the dispersion is free from unit of measurement of data.It is also
called as co-efficient of dispersion.
Range:
The range is the difference between two extreme observations of the distribution. If A
and B are the greatest and smallest observations respectively in a distribution, then range is
given by:
Range is the simplest but a crude measure of dispersion, easy to calculate, involves no
calculation. We can use range when the observation is less than five and data are on ordinal
scale. Since it based on two extreme observations which themselves are subject to chance
fluctuations, it is not at all a reliable measure of dispersion.
Example: for the following distribution of age of 10 pre-university students, find the range
and the coefficient of range.
16, 18, 18, 16, 18, 20, 17, 19, 16, 24.
Solution:
The semi-inter quartile range (or SIR) is defined as the difference of the first and third
quartiles divided by two. The first quartile is the 25th percentile and the third quartile is the
75th percentile.
It is definitely a better measure than the range as it makes use of central 50% of data.
But since it ignores other 50% of the data, it cannot be regarded as a reliable measure.
Example: For the following data, find the Q.D and Coefficient of Q.D:
36, 43, 30, 37, 38, 35, 29, 38, 35, 32, 35, 36.
Solution:
To find the Q.D, firstly the lower and the upper quartiles should be obtained.
Array: 29, 30, 32, 35, 35, 35, 36, 36, 37, 38, 38, 43.
The lower quartile is Q1 = [(n+1) /4]th value in the array = 3.24th value = 32 +0.25(35-32) =
32.75.
The upper quartile is Q3 = [3(n+1) / 4]th value in the array = 9.75th value = 37 + 0.75(38.37) =
37.75.
Mean deviation:
The mean deviation is an average of absolute deviations of individual observations
from the central value of a series.
k
∑ xi − x
MD (x ) = i=1
, about mean
n
, about median.
Calculate the mean deviation for the data: 72, 85, 87, 89, 90, 93
xi x i- | x i-
72 86 -14 14
85 86 -1 1
87 86 1 1
89 86 3 3
90 86 4 4
93 86 7 7
516 0 30
k
= 30/6 = 5 ∑x i −x
MD(x ) = i =1
Example: Calculate mean deviation from median for the given data.
Solution:
12 4
14 2
16 0
16 0
18 2
19 3
20 4
Total 15
Note: Mean deviation about median is less than or equal to M.D about mean.
Standard deviation(S):
The standard deviation or “root of mean square deviation” is the most common and
efficient estimator used in statistics. It is based on deviation from arithmetic mean and is
denoted by S (Standard deviation of sample) or σ (Standard deviation of population).
Definition:
It is a square root of a ratio of sum of square of deviation calculated from arithmetic
mean to the total number of observations.
Method of calculations:
A. Ungrouped data:
1) Deviation method:
S=
S=
S=
S=
Example:
1. Find the standard deviation for the given data:
Family No. 1 2 3 4 5 6 7 8 9 10
Size (xi) 3 3 4 4 5 5 6 6 7 7
Family no 1 2 3 4 5 6 7 8 9 10 Total
xi 3 3 4 4 5 5 6 6 7 7 50
x i- -2 -2 -1 -1 0 0 1 1 2 2 0
4 4 1 1 0 0 1 1 4 4 20
S= = 1.48
xi fi xifi x i- fi
3 2 6 -3 9 18
5 3 15 -1 1 3
7 2 14 1 1 2
8 2 16 2 4 8
9 1 9 3 9 9
Total 10 60 - - 40
S= = 4.11
Mathematical properties:
1) The sum of square of the deviations of items in the series from their arithmetic mean
is minimum. This is the reason why standard deviation is always computed from
arithmetic mean.
2) Addition or subtraction of a constant from the group of an observation will not change
the value of S.D.
3) Multiplying or dividing each observation of a given series by a constant value will
multiply or divide the standard deviation by the same constant.
Note: Standard deviation is independent of change of origin but dependent on change of
scale.
Variance:
Variance is the square of standard deviation. It is also called the “mean square
deviation”. It’s being used very extensively in analysis of variance of result from field
experiment. Symbolically denoted by S2 is sample variance and σ2 is population variance.
Methods of calculation:
1. ungrouped data:
1) Deviation method:
S2 =
S2 =
ith class.
Properties of variance:
1) If V(X)represent the variance of X series and V(Y)represents the variance of Y series
then V(X±Y) = V(X)+V(Y).
2) Multiplying or dividing each observation by a constant will multiply or divide the
variance by square of that constant.
E.g. V (ax)=a2V(x)
3) Addition or subtraction a constant from the groups of each observations will not
change the value of variance.
Coefficient of variation:
It is a relative measure of variation and widely used to compare two or more statistical
series.
The statistical series may differ from one another with respect to their mean or
standard deviation or both. Sometimes they may also differ with respect to their units and
then their comparison is not possible. To have a comparable idea about the variability present
in them C.V% is used. It was developed by Karl Pearson.
Definition:
“It is a percentage ratio of standard deviation to the arithmetic mean of a given
series”. It is unit less measure.
The series for which the C.V% is greater is said to be more variable or we say less
consistence, less homogenous or less stable while the series having lower C.V% is called
more consistence or more homogeneous.
Moments, Skewness and Kurtosis
The rth moment of a set of values about any constant is the mean of the rth powers of
the deviations of the values from the constant.
Moments about any constant can be found. The moments about the arithmetic mean
are called central moments. The moments about any other constant are called raw moments.
The central moments are denoted by µ 1, µ 2, . . . .and the raw moments are denoted by
, .... .
In case of raw data, the rth central moment and raw moment about a are:
µr = and µ r’ =
µr = and µ r’ =
The first four moments of a distribution are useful in the study of the distribution.
For a frequency distribution, four constants based on the central moments are defined. They
are:
β1 = β2 =
γ1 = + γ2 = β2-3.
Skewness:
In a frequency distribution, the spread of the values may be symmetrical around the
center or it may not be so. If the values are not distributed symmetrically around the center,
the distribution is said to be skew. Thus, skewness means asymmetry or non-symmetry (lack
of symmetry).
Coefficient of skewness is a measure which indicates the degree of skewness, it may
be positive, zero or negative. It will be positive if the right tail of the distribution is longer
than the left tail. It will be negative if the left tail is longer than right tail. For a symmetrical
distribution, the coefficient of skewness will be zero. According as coefficient of skewness is
positive or negative, the distribution is said to be positively or negatively skew.
The mean, median and mode of symmetrical distribution are equal. For such a
distribution, the lower and upper quartiles are equidistant from the median, for positively
skew distribution, the median and the mode is lesser than the mean. For negative distribution,
the median and the mode are greater than the mean.
Symmetrical distribution
Coefficient of skewness:
Sk = .
β1 =
Example-1 Calculate the Karl Pearson’s coefficient of skewness for the distribution having
mean 83.8, mode 82.64 and S.D 2.336.
Solution:
Skp = = = 0.496.
Thus, the coefficient of skewness is positive. Therefore, the distribution is positively skew.
Example-2 For a frequency distribution, the sum of upper and the lower quartiles is 25. Their
difference is 13. The median is 10. Find the coefficient of skewness.
Solution:
Sb = = = 0.3846.
Kurtosis:
A frequency distribution may show high concentration at the center compared to that
at the extremities. On the other hand, another frequency distribution may show almost equal
concentration throughout the range. Here, the first distribution is said to have high kurtosis
compared to the latter.
Kurtosis means peakedness (non-flatness). Coefficient of kurtosis is a measure
which indicated the degree of peakedness of the distribution. The constant β2 = is
Example: In a frequency distribution, the first four central moments are 0, 4, -2 and 2.4.
Comment on the skewness and kurtosis of the distribution.
Solution:
Here, µ 0=0, µ 1=4, µ 2=-2 and µ 4=2.4
Therefore,
β1 = = = 0.0625.
β2 = = = 0.15.
Since µ 3 is negative, the distribution is negatively skew. Also, since β1= 0.0625 is
very small, the distribution is slightly skew.
Probability
Trial and Event: An experiment which, though repeated under essentially identical (or)
same conditions does not give unique results but may result in any one of the several possible
outcomes. Performing an experiment is known as a trial and the outcomes of the experiment
are known as events.
Example:
Random experiment:
Random experiment is an experiment which may not result in the same outcome when
repeated under the same conditions. It is an experiment which does not have a unique
outcome.
For example:
1. The experiment of Toss of coin is a random experiment. It is so because when a coin
is tossed the result may be head or it may be tail.
2. The experiment of drawing a card randomly from a pack of playing cards is a random
experiment. Here, the result of the draw may be any one of the 52 cards.
Sample space (S): A set of all possible outcomes from an experiment is called sample space.
For example, a set of five seeds are sown in a plot, none may germinate, 1, 2, 3 ,4 or all five
may germinate. i.e the possible outcomes are {0, 1, 2, 3, 4, 5. The set of numbers is called a
sample space. Each possible outcome (or) element in a sample space is called sample point.
Exhaustive Events: The total number of possible outcomes in any trial is known as
exhaustive events (or) exhaustive cases.
Example:
1. When pesticide is applied a pest may survive or die. There are two exhaustive cases
namely (survival, death)
2. In throwing of a die, there are six exhaustive cases, since anyone of the 6 faces 1, 2, 3, 4, 5,
6 may come uppermost.
Favourable Events: The number of cases favourable to an event in a trial is the number of
outcomes which entail the happening of the event.
Example:
P =P(A) =
Note:
1. If m = 0 ⇒ P(A) = 0, then ‘A’ is called an impossible event. (i.e.) also by P(φ) = 0.
2. If m = n ⇒ P(A) = 1, then ‘A’ is called assure (or) certain event.
3. The probability is a non-negative real number and cannot exceed unity (i.e.) lies
between 0 to 1.
4. The probability of non-happening of the event ‘A’ (i.e.) P( ). It is denoted by ‘q’.
P( ) = = = 1-P(A)
⇒q=1–p
⇒ p + q = 1 (or)
P (A) + P( ) = 1.
3. If A and B are mutually exclusive (or) disjoint events then the probability of
occurrence of either A (or) B denoted by P(AUB) shall be given by,
P(A∪B) = P(A) + P(B)
P (E1∪E2∪…. ∪En) = P(E1) + P(E2) +……+ P(En)
If E1, E2, …., En are mutually exclusive events.
P (Sum 6) =
(ii) Sum 9 = {(3, 6), (4, 5), (5, 4), (6, 3)}
∴ Favourable number of cases = 4
P (Sum 9) = =
Conditional Probability:
Two events A and B are said to be dependent, when B can occur only when A is
known to have occurred (or vice versa). The probability attached to such an event is called
the conditional probability and is denoted by P (A/B) (read it as: A given B) or, in other
words, probability of A given that B has occurred.
P( )= =
If two events A and B are dependent, then the conditional probability of B given A is,
P( )= =
Example-1: If A is the event “drawing an ace from a deck of cards” and B is the event
“drawing a King”. Find the probability of getting either ace or king.
Solution:
P(A) = , P(B) =
= + = =
Example-2: If A is the event “drawing an ace from a deck of cards” and B is the event
“drawing a spade”. Find the probability of getting either an ace or spade.
Solution:
The A and B are not mutually exclusive. Since the ace of spades can be drawn. Thus,
the probability of drawing either an ace or a spade is:
= + = =
Example-3: A class contains 10 men and 20 women of which half the men and half the
women have brown eyes. Find the probability “P” that a person chosen at random is a man or
has a brown hair.
Solution:
Let A = person is a man and B = person has brown eyes.
Therefore, = + =
Multiplication theorem on Probability:
Let A and B be two events with respective probability P(A) and P(B). let P(B|A) be
the conditional probability of event B given the event A has happened. Then, the probability
of simultaneous occurrence of A and B is given by:
1. If A and B be any two events which are not independent, (i.e.) dependent.
P (A and B) = P (A∩B) = P (AB) = P (A). P (B/A)
2. If A and B be any two events which are independent.
P (A and B) = P (A∩B) = P (AB) = P (A) X P (B)
Example-1: If A is the event “getting heads in second toss” and B is the event “getting heads
in third toss”. Find the probability of getting heads on both the 2nd and 3rd tosses.
Solution:
P(A∩B) =P(A) x P(B)= 1/2*1/2 = 1/4
Example-2: If the probability that A will be alive in 20 years is 0.7 and the probability that B
will be alive in 20 years is 0.5, Find the probability of that they will both be alive in 20 years.
P(A∩B) =P(A) x P(B) = 0.7*0.5 = 0.35.
Theoretical Distributions
Random Variable
Random variable is a function which assigns a real number to every sample point in
the sample space. The set of such real values is the range of the random number.
There are two types of random variables.
1. Discrete random variable:
A variable X which takes values in whole number is called discrete random variable.
Eg: number of students in a class, number of fruits per plant etc.
2. Continuous random variable:
A random variable whose ranges are unaccountably infinite is a continuous random
variable.
Eg: plant height, weight of a person etc.
Example: Two coins are tossed once. Find the mathematical expectation of the number of
heads obtained.
Solution:
Let X denotes the number of heads obtained. Then, X is a random variable which
takes the values 0, 1 and 2 with respective probabilities 0.25, 0.5 and 0.25. That is
x 0 1 2
p(x) ¼ ½ ¼
The mathematical expectation of the number of heads is:
E(X) = = 0*0.25+1*0.5+2*0.25 = 1.
Theoretical distributions are
1. Binomial distribution
2. Poisson distribution Discrete Probability distribution
a. β1 = =
b. β2 = 3 + =
= + + +
= .
Application
1. Quality control measures and sampling process in industries to classify items as
defectives or non-defective.
2. Medical applications such as success or failure, cure or no-cure.
Poisson Distribution:
A random variable which can take only one discrete value in an interval of time
howsoever small is known as Poisson variable.
The Poisson distribution is the limiting form of the binomial probability distribution
when ‘n’ becomes infinitely large and ‘p’ approaches 0 in such a way that np = remaines
constant. Such situation is fairly common. That is to say a Poisson distribution may be
expected in cases were the chance of any individual event being a success is rare. Some of
examples of Poisson variable are:
1. Number of mistakes in a typed page;
2. Number of cars parked at a place in an hour, say between 10.00 AM and 11.00
AM;
3. Number of defects in the insulation of a fifty-meter length of wire;
4. Number suicides in a certain period in a city or town;
5. Occurrence of rare events such as serious floods, drought etc.
Like binomial distribution, the variate of the Poisson distribution is also a discrete
one. The probability function is given by:
P(x) = For x= 0, 1, 2, . . . .
= 0 , otherwise.
Where, λ is the average number of occurrences per unit of time
λ = np
Condition for Poisson distribution
Poisson distribution is the limiting case of binomial distribution under the following
assumptions.
1. The number of trials n should be indefinitely large ie., n->∞
2. The probability of success p for each trial is indefinitely small.
3. np= λ, should be finite where λ is constant.
Properties of Poisson distribution:
1. Poisson distribution has mean and its variance is also . It is the only distribution
known so far, of which the mean and variance are equal.
2. Poisson distribution possesses only one parameter .
3. β-coefficients:
a. β1 =
b. β2 = 3 +
Application
1. It is used in quality control statistics to count the number of defects of an item.
2. In biology, to count the number of bacteria.
3. In determining the number of deaths in a district in a given period, by rare disease.
4. The number of error per page in typed material.
5. The number of plants infected with a particular disease in a plot of field.
6. Number of weeds in particular species in different plots of a field.
Example-1
Suppose at a particular place, the average number of cars parked per hour in 3. Under Poisson
model, calculate the probability of 5 cars parked in a particular hour.
Example-2
The number of mistakes counted in one hundred typed pages of a typist revealed that he made
2.8 mistakes on an average per pages. Calculate the probability, that in page typed by him,
1. There is no mistake.
2. There is two or less mistake.
Solution: given that, =2.8.
normal distribution with mean 0 and standard deviation 1 i.e., Z ~ N(0,1). The standard
normal distribution is given by :
The advantage of the above function is that it doesn’t contain any parameter. This
enables us to compute the area under the normal probability curve.
Example 1: In a normal distribution whose mean is 12 and standard deviation is 2. Find the
probability for the interval from x = 9.6 to x = 13.8
Solution:
Given that Z~ N (12, 4)
P(1.6 13.8) = P(
= P (-1.2 ≤ Z ≤ 0) + P (0 ≤ Z ≤ 0.9)
= P (0≤ Z ≤ 1.2) +P (0 ≤ Z ≤ 0.9) [by using symmetric property]
=0.3849 +0.3159
=0.7008
When it is converted to percentage (ie) 70% of the observations are covered between 9.6 to
13.8.
Sampling Method and Sampling Distribution
Sample: Sample is a part or fraction of a population selected on some basis. Sample consists
of a few items of a population. In principle a sample should be such that it is a true
representative of the population.
Sampling method: By sampling method we mean the manner or scheme through which the
required number of units is selected in a sample from a population.
Sampling unit: The constitutes of a population which are the individual to be sampled from
the population and cannot be further subdivided for the purpose of sampling are called as
sampling unit. For instance, to know the average income per family, the head of the family is
a sampling unit. To know the average yield of wheat, each farm owner’s field of wheat is a
sampling unit.
1 (1,1) 2 9 (3,1) 3
3 (1,3) 3 11 (3,3) 4
4 (1,4) 4 12 (3,4) 5
the average of the sample means on repeated drawing equal the population mean m. the
, if the sampling is without replacement. The square root of the variance of the
sample means is called standard error of the sample mean and is denoted by .
If the sample is taken from a normal population, the distribution of the sample mean is
normal, even for small values of n. when the population from which the sample is drawn is
non-normal, for large values of n, the central limit theorem ensures that the sample mean will
be normally distributed. For all practical purpose, we can treat to be normally distributed
with mean m and standard deviation
If the units are selected or drawn one by one in such a way that a unit drawn at a time is
replaced back to the population before the subsequent draw, it is known as simple random
sampling with replacement method. In this type of sampling from a population of size N, the
probability of selection of a unit at each draw remains 1/N. In this method, a unit can be
included more than once in a sample. Therefore, if the required sample size is n, the effective
sample size is sometimes less than n due to the inclusion of one or more units more than
once.
In SRS without replacement method unit selected once is not included in the population at
any subsequent draw. Hence, the probability of drawing a units from a population of N units
rth draw is 1/(N- r +1).
Random selection of units is done using any one of the following methods:
1. Using tickets, tags, etc.
2. Using random number table.
Examples of SRS:
Systematic sampling:
In this method of sampling, the first unit is selected with the help of simple random
sampling and the remaining units are selected automatically according to a predetermined
pattern. This method is known as systematic sampling. It is a way to select a probability-
based sample from a directory or list. This method is more efficient than simple random
sampling. Because of its simplicity, systematic sampling is popular with researchers.
Advantages:
1. It is easier to draw a sample and often easier to execute it without mistakes.
2. This is more advantageous when the drawing is done in fields and offices as there
may be substantial saving in time.
3. The cost is low and the selection of units is simple.
4. Much less training is needed for surveyors to collect units through systematic
sampling.
5. The systematic sample is spread more evenly over the population.
6. More precise than simple random sampling.
Disadvantages:
1. Systematic sampling can be applied only if the complete list of population is
available.
2. Greater risk of data manipulation.
3. Can be imprecise and inefficient if the population being sampled is heterogenous.
Cluster sampling:
Method by which the population is divided into groups (clusters), any of which can be
considered a representative sample. These clusters are mini-populations and therefore are
heterogeneous. Once clusters are established a random draw is done to select one (or more)
clusters to represent the population.
Steps for cluster sampling:
1. Divide the whole population into clusters according to some well defined rule.
2. Treat the clusters as sampling units.
3. Choose a sample of clusters according to some procedure.
4. Carry out a complete enumeration of the selected clusters, i.e., collect information on
all the sampling units available in selected clusters.
Advantages:
1. Economic efficiency.
2. Faster and less expensive than SRS
3. Does not require a list of all members of the universe
Disadvantage:
1. Commonly has higher sampling error
Sampling errors:
In a sample survey, since only a small portion of population is studies and hence its
results are bound to differ from the census results and this have a certain amount of error.
This error would be there no matter that the sample is drawn at random and that it is highly
representative. This error is attributed to fluctuations of sampling error. Sampling error is due
to the fact that only a subset of the population (i.e sample) has been used to estimate the
population parameters and draw inference about population. Thus, sampling error is
presented only in a sample survey while it is completely absent in census surveys.
Sampling error may be due to following reasons:
1. Faulty selection of the sample.
2. Substitution.
3. Faulty demarcation of sampling units.
4. Error due to bias in the estimation method.
5. Variability of population.
Non-sampling errors:
Non-sampling errors are not attributed to chance and are a consequence of certain
factors which are within human control. In other words, they are due to certain causes which
can be traced and may arise at any stage of the inquiry viz. planning and execution of the
survey and collection, processing and analysis of the data. This error is present in sample and
census.
Some of the important factors responsible for non-sampling errors are as under:
1. Faulty planning including vague and faulty definition of the population of the
statistical units to be used, incomplete list of population members.
2. Vague and imperfect questionnaire which might result in incomplete or wrong
information.
3. Defective methods of interviewing and asking questions.
4. Vagueness about the type of the data to be collected.
5. Personal bias of the investigator.
6. Lack of trained and qualified investigators and lack of supervisory staff.
7. Failure of respondent’s memory to recall the events or happenings in the past.
8. Non response and inadequate response.
9. Improper coverage.
10. Compiling errors.
Testing of hypothesis.
A hypothesis is an assertion or conjecture about the parameter(s) of population
distribution(s). (or) Hypothesis is the tentative statement about something.
Parameter: Constant of the population is called as Parameter.
Statistic: Constant of a sample is called as statistic.
Types of hypothesis:
Null hypothesis: Ho
A hypothesis which is to be actually tested for acceptance or rejection is termed as
null hypothesis. Also, hypothesis of no difference is called as null hypothesis.
Alternative hypothesis:
It is a statement about the population parameter or parameters, which gives an
alternative to the null hypothesis, with in the range of pertinent values of the parameter, i.e.,
if Ho is accepted, what hypothesis is to be rejected and vice versa. An alternative hypothesis
is denoted by H1 or HA.
Two types of error:
After applying a test, a decision is taken about the acceptance or rejection of null
hypothesis vis – a – vis the alternative hypothesis. There is always some possibility of
committing an error in taking a decision about the hypothesis. These errors are of two types.
1. Type-I error: Rejecting null hypothesis (Ho), when it is true.
2. Type-II error: Accept null hypothesis (HA), when it is false.
Ho is
True False
Do not reject Ho Correct decision Type II Error
Reject Ho Type I Error Correct Decision
For an example:
Critical region:
A statistic is used to test the hypothesis Ho. The test statistic follows some known
distribution. In a test, the area under probability density curve is divided into two regions,
viz., the region of acceptance and the region of rejection. The region of rejection is the region
in which Ho is rejected. It means that if the value of test statistic lies in this region. Ho will be
rejected. The region of rejection is called critical region. Moreover, the area of critical region
is equal to level of significance. The critical region is always on the tail of the distribution
curve. It may be on both the tails or on one tail, depending upon the alternative hypothesis.
Region of rejection
Two tail test:
If the alternative hypothesis is of type the critical region lies on
the both the tails. In this situation the test is called two-tailed.
Region of rejection
Degrees of freedom:
It is the number of independent observations, on which a test is based, is known as degrees of
freedom of the test statistic.
The transformed variable Z is always distributed normally with mean 0 and variance
1, i.e. Z ~ N (0, 1). In this way, whatever may be the parameter of X, Z has always same
normal distribution N (0,1) and hence only one normal curve is enough after transformation
irrespective of the distribution of X. the variable Z is called the standard normal variate
(SND). After the transformation, the probability density function of the SND-Z is,
Null hypothesis: µ = µo
Test statistic:
Z= .
Example:
The average number of mango fruits per tree in a particular region was known from a
considerable experience as 520 with a standard deviation 4.0. A sample of 20 tress gives an
average number of fruits 450 per tree. Test whether the average number of fruits per tree
selected in the sample is agreement with the average production in that region?
Solution: Null hypothesis: µ = µo = 520
Z= = 78.26.
Case-2: If the S.D in the population is not known still we can use the standard normal deviate
test.
Assumption:
1. Population is normal.
2. Sample is drawn at random.
Conditions:
1. σ is not known.
2. Size of the sample is large (>30).
Null hypothesis: µ = µo
Test statistic:
Z= ,
Where,
S=
No. of 9 20 35 42 17 7
buffaloes
Test whether the performance of dairy farm was in agreement with the record.
Solution:
Null hypothesis: µ = µo = 12.
Using the mean formula, we have, = 11.91. and using the Standard deviation
formula we have S = 2.49.
Therefore,
Z= = 0.41.
Conclusion: The calculated Z is less than the table Z, 1.96 at 5 % level of significance.
Therefore, the null hypothesis is accepted. That is there is no significant difference between
the average daily milk yield of dairy farm and the previous record.
Test statistics: Z = , where 1 and 2 are the means of 1st and 2nd samples
Z= ,
Where,
1, 2 are the means of 1st and 2nd samples with sizes n1 and n2 respectively.
S12= and
S22 =
Example: A random sample of 90 poultry farms of one variety gave an average production of
240eggs per bird/year with S.D of 18 eggs. Another random sample of 60 poultry farms of
another variety gave average eggs of 195 eggs per bird/year with a S.D of 15 eggs.
Distinguish between two varieties of birds with respect to their egg production.
Null hypothesis: µ1 = µ2;
Z= = 16.61.
Z= = 3.74
Conclusion: Here Z (calculated) < Z (tabulated), 1.96 at 5 percent level of significance, the
null hypothesis is rejected. Therefore, there is significant difference between the proportion
of diseased plants in the sample and the population.
4. Level of Significance:
The level may be fixed at either 5% or 1%.
5. Expected value: The expected value is given by
Ze = 1.96 at 5% level Two tailed test
2.58 at 1% level
Ze = 1.65 at 5% level
One tailed test
2.33 at 1% level
6. Inference:
If the observed value of the test statistic Z exceeds the table value Ze we may reject
the Null Hypothesis Ho otherwise accept it.
Example 1: In an investigation, it was found that 4percent of the farmer accepted the
improved seeds for a barley crop in a particular state. On conducting a survey in two
panchayath samithies, 340 farmers accepted out of 1500 in the 1st samithies and 200 out of
1000 in the 2nd samithies. Test whether the different between two samithies is significant.
P0=4/100=0.04, Q0=1-0.04=0.96,
P1=340/1500=0.23,
P2=200/1000=0.2.
Z= = 1.19.
Conclusion: Z (calculated) < Z (tabulated), 1.96 at 5 percent level of significant. Therefore,
the null hypothesis is accepted. i.e., there is no significance difference between the
proportions of the two samithies with regard to acceptability of the improved seeds.
Example 2: In the previous example if P is not known, test the significance of the difference
between the proportions of the two samples.
Null hypothesis: P1=P2=P where P1 and P2 are the proportions in the 1st and 2nd populations
respectively.
P= = 0.22, Q=0.78,
Z= = 1.75.
t= where s2 =
Note: t-test is carried out when the sample size is small (i.e when it is less than 30).
One sample t-test:
Assumption:
1. Population is normal.
2. Sample is drawn at random.
Conditions:
1. σ is not known.
2. Size of sample is small.
Null hypothesis: µ = µ0
Test statistic:
t=
Where,
s2 = and n is the sample size.
Conclusion: If t (calculated) < t (tabulated) with (n-1) d.f at chosen level of significance, the
null hypothesis is accepted. That is, there is no significant difference between sample mean
and population mean. Otherwise, null hypothesis is rejected.
Example: The height of plants in a particular field were assumed to follow normal
distribution, a random sample of 10 plants was selected and whose heights (in cms) were
recorded as 96, 100, 102, 99, 104, 105, 99,98, 100 and 101. Discus in the light of the above
data the mean height of plants in the population is 100.
102 2 4 s=
99 -1 1
= = 2.72.
104 4 16
t= = 0.46.
105 5 25
99 -1 1
Conditions:
1. S.D’s in the populations are same and are not known.
2. Sizes of samples are small.
Null hypothesis: µ1 = µ2 where µ1, µ2 are the means of 1st and 2nd populations respectively.
Test statistic:
t=
Where,
= ,
= and =
Example: Two types of diets were administered to two groups of school going children for
increase in weight and the following increases in weight (100gm) were recorded after a
month.
Diet A 4 3 2 2 1 0 5 6 3
Diet B 5 4 4 2 3 2 6 1
Test whether there is any significant difference between the two diets with respect to increase
in weight.
X1 X2 X12 X22
Null hypothesis: µ1 = µ2
t= = 0.56.
3 4 9 16
2 4 4 16
Conclusion: t (calculated) <t (tabulated), (2.131)
2 2 4 4
with 15 d.f. at 5 percent level of significance.
1 3 1 9 Therefore, the null hypothesis is accepted. That is,
there is no significant difference between the two
0 2 0 4
diets with respect to increases in weight.
5 6 25 36
6 1 36 1
3 9
Paired t-test:
When two small samples of equal size are drawn from two population and the
samples are dependent on each other than the paired t-test is used in preference to
independent t-test. The same patients for the comparison of two drugs with some time
interval; the neighboring plots of a field for comparison of two fertilizers with respect to yield
assuming that the neighboring plots will have the same soil composition; rats from the same
litter for comparison of two diets; branches of same plant for comparison of the nitrogen
uptake, etc., are some of the situations where paired-t can be used.
In the paired t-test the testing of the difference between two treatments means was
made more efficient by keeping all the other experimental conditions same.
Assumptions:
1. Populations are normal.
2. Samples are drawn independently and at random.
Conditions:
1. Samples are related with each other.
2. Sizes of the samples are small and equal.
3. S.D’s in the population are equal and not known.
Null hypothesis: µ1 = µ2
Test statistic:
Sd2 =
Conclusion: If t (calculated) < t (tabulated) with (n-1) d.f at 5 percent level of significance,
the null hypothesis is accepted. That is, there is no significant difference between the means
of the two samples. In other words, the two samples may belong to the same population.
Otherwise, the null hypothesis is rejected.
Example: The following are the experiment conducted on agronomy farm at college of
agriculture, UAS, Dharwad for comparing two types of grasses on neighboring plots of size
5X2 meters in each replication. The weights of grasses per plot (in kgs) at the harvesting time
were recorded on 7 replicates:
1 2 3 4 5 6 7
Test the significant difference between the two grasses with respect to their yield.
X1i X2i di di 2
Null hypothesis: µ1 = µ2
1.96 2.13 -0.17 0.0289
In the variance ratio test, we have to test two variances i.e.; and . In case of
testing two means, we assume that, the population variance is same. But always this
assumption may not hold good. We may have to draw two samples from different population,
where the variances are not same. In such a situation, we cannot use t-test directly for testing
equality of two means. Therefore, we have to test whether these two variances are same or
not. For testing equality of two variances, we use F-test.
Null hypothesis: ;
Alternative hypothesis: ;
Test criterion,
Conclusion: If F(cal)< F(table) for 5 percent, the test is significant, reject the null hypothesis.
Therefore, we conclude that both variances are not same. Otherwise, if F(cal)< F(table), the test
is not significant. We accept the null hypothesis and we conclude that both variances are
same.
Example 1: The marks in different subjects are given below. Test whether the variances of
marks in both the subjects is same or not.
Subject-X: 15 25 30 10 12 40 45
Subject-Y 19 20 25 30 18
Solution:
; ;
2 2
Subject-X: Subject-Y
40 14.71 216.38
Test criterion,
= =25.28, =
Therefore,
F= / = (F(table) at is 4.53)
Conclusion: Therefore, F(cal) is more than F(table) for 5% level of significance, so we conclude
that the two variances are not same.
Chi-Square Distribution
So far, we had discussed various test of significance such as t, F and Z. these tests
were based on the assumption that the samples were drawn from normally distributed
populations. Since, the testing procedure requires assumption about the type of population or
parameter i.e. population values, these tests are known as “Parametric tests”.
There are many situations in which it is not possible to make any rigid assumption
about the distribution of the population from which samples being drawn. This limitation has
led to the development of a group of alternative techniques known as non-parametric or
distribution free method. When non-parametric tests are used, no assumption about the
parameters of the population or populations from which we draw our samples is made. Chi-
square test of goodness of fit and test of independence is a prominent example for the use the
non-parametric test.
χ2 test as a goodness of fit:
Chi-square test is very popularly known as test of goodness of fit for the reason that it
enables us to ascertain how well the theoretical distribution such as Binomial, Poission,
Normal etc., fit empirical distribution i.e., those obtained from sample data.
The expression for X2 test for goodness of fit is:
χ2 = ,
getting heads etc., in a toss of 5 coins are the successive terms in the binomial So
the theoretical frequency in 3200 terms are the terms in the expansion 3200 . So the
theoretical frequency in 3200 terms is the terms in the expansion 3200 are as follows:
No. of 0 1 2 3 4 5
heads
Expected 100 500 1000 1000 500 100
Frequency
2
Applying the test
500 500 0 0 0
58.80
χ2 = = 58.80.
Conclusion: The χ 2(cal) is much greater than χ 2(table) value. Hence the hypothesis is rejected.
Therefore, we can conclude that the coins are biased.
Example-2: Genetic theory states that children having one parent of blood type A and the
other type B will always be one of three types A, AB, B and that the proportion of three types
will on an average be as 1:2:1. A report states that out of 300 children having one parent and
B parent, 30% were found to be type A, 45% type AB and remainder type B. test the
hypothesis by χ 2 test.
Chi-square test as a test of independence:
With help of χ 2 test we can find out whether two or more attributes are associated or
not. Suppose we have N observations classified according to some attributes. We may ask
whether the attributes are related or independent. Thus, we can find out whether quinine is
effective in controlling fever or not, whether is any association between color and intelligence
etc. in order to test whether or not the attributes are associated, we take the null hypothesis
that there is not association in the attributes under study or into the words, the two attributes
are independent.
If the χ 2(cal) is less than the χ 2 (table) value at a certain level of significance generally 5%
level, we say that the results of the experiment provide no evidence for doubting the
hypothesis.
2 X 2 contingency table:
When the individuals (or objects) are classified into two categories with respect to
each of the two attributes then the table showing frequency distributed over 2 X 2 classes
called 2 X 2 contingency table.
Suppose the individuals are classified according to two attributes (say) color (B) and
intelligence (A). the distribution of frequencies over cells is shown below:
B\A Intelligence
A1 A2 Total
B1 a B a+b
Colour B2 c D c+d
a+c b+d N
χ2 =
χ2=
χ2=
Conclusion: If χ 2(cal) ≤ χ 2(table) with (r-1)x(c-1) df at chosen level of significance, the null
hypothesis is accepted i.e., two attributes are independent. Otherwise, null hypothesis is
rejected.
Example: One hundred individuals of a particular race were tested with an intelligence test
and classified into two classes. Another group of one hundred twenty individuals belong to
another race were administered the same intelligence test and classified into the same two
classes. The following are the observed frequency of the two races.
Race-1 42 58 100
Race-2 55 65 120
Solution:
Ho: Intelligence and race are two independent attributes.
We know,
χ2=
= = 0.325
Conclusion: since χ 2(cal)< χ 2(table) for 3.841 with (r-1)(c-1) = (2-1)(2-1) =1 df at 5 percent
level of significance. Therefore, there is evidence to conclude that race and intelligence may
be independent.
Example: A cross between the two varieties of sorghum one giving high yield and the other
for high amount of fodder was made. The number of plants in F2 generation was observed as
79, 160 and 85. Test whether this sample data is in agreement with the Mendelian ratio 1:2:1
or not.
Solution:
Null hypothesis: The sample ratio is in agreement with 1:2:1
Observed frequency Expected frequency (Oi– Ei)
(Oi) (Ei)
79 324*1/4 = 81 -2 0.0494
85 324*1/4 = 81 4 0.1974
Conclusion: χ 2(Cal) < χ 2(Tab), (5.991) with (3-1) d.f. at 5 percent level of significance.
Therefore, the null hypothesis is accepted, i.e., the plants are segregating according to
Mendelian ratio, 1:2:1 in F2 generation.
Correlation:
Correlation is a measure of extent or degree of mutual dependence between two
variables.
In the study of two variables jointly, many times an investigator is interested to know
the degree or extent of dependence between them. Actually, one wants to know whether the
relation between two variables is of high, moderate or low degree. If the two variables have
no relation, it means the change in one variable has no impact about the change in the other.
In this case, two variables are said to be independent.
Methods of determining correlation:
Graphical method:
The extent of relation between two variables can roughly be judged by plotting the
pairs of observations as points on graph paper. These points are spread in different patterns
and as such these are called scatter diagrams. Larger the number of points in a straight line,
greater is the degree of relationship between them. In the following scatter diagrams, 1st
shows that there is a perfect positive linear relationship between X and Y, i.e., the variable is
proportional to Y and vice-versa. In this case the line flows from lower left side to upper right
side. All the points are lies on the line.
In the second diagram, divulge the same phenomenon but in opposite direction, i.e., if
X increases, then Y decreases. In this case, the line runs from upper left to right bottom side.
In the diagram if delineated high positive and negative correlation respectively as
most of the points lie near the straight lines or lie on them.
In the diagram if the same phenomena as previous expect that in these figures, the
points lie farther from the lines indicating a low degree of correlation between the variables.
If hardly any line can be drawn about which all the points concentrate. It means there is no
correlation between the variables.
Positive correlation Negative correlation No correlation
Mathematical measure:
A graph provides a rough idea about the type and extent of correlation between two
variables. But more exactly the correlation can be measured numerically by calculating
coefficient of correlation. This is known as Person’s coefficient of correlation and the
formula for it was developed by Karl Pearson. This is based on three assumptions.
1. The variable X and Y are distributed normally.
2. The relationship between X and Y is linear.
3. There is a cause and effect relationship between X and Y.
rxy =
If from bivariate population there are n pairs of values of the variables X and Y as (x1, y1),
(x2,y2), (x3,y3). . . . (xn, yn). Then the formula is given as:
Alternative formula:
r=
Test Statistic:
Conclusion: If t (Cal) >t(Tab) with (n-2) d.f. at chosen level of significance, the null
hypothesis is rejected. That is, there may be significant correlation between the two variates.
Otherwise, the null hypothesis is accepted.
Prefect r = +1 r = -1
Nil 0 0
Students 1 2 3 4 5 6 7
Marks in Math’s (X) 7 9 10 6 5 4 8
Marks in Physics (Y) 9 6 5 4 3 2 6
Find the correlation coefficient between the marks scored in the two subjects.
Solution:
From the table we can calculate = 7.0 and = 5.0
= (7-7)(9-5)+(9-7)(6-5)+. . . . . (8.7)(6-5) = 17
= (7-7)2 + (9-7)2 + . . . . + (8-7)2 = 28
= (9-5)2 + (6-5)2+. . . . . +(6-5)2 = 32
= = = 0.568.
Since the value of r is little more than 0.5, it can be interpreted that the correlation
between marks in mathematics and physics is moderate.
Regression
Regression is the measure of the average relationship between two or more variables in
terms of the original units of the data.
If two variables are correlated, unknown value of one of the variables can be
estimated by using the known value of the other variable. The so estimated value may
not be equal to the actually observed value, but it will be close to the actual value.
The property of the tendency of actual value to lie close to the estimated value is
called regression. In a wider usage, regression is the theory of estimation of unknown
value of a variable with the help of known values of the variables. The regression theory
was first introduced by Sir Francis Galton in the field of genetics.
When data on two variables are known, by assuming one of the variables to be
dependent on the other, we fit a linear equation to the data by the method of least square.
The linear equation is called regression equation.
For a bivariate data on x and y, the regression equation obtained with the
assumptions that x is dependent on y is called regression of x on y. the regression of x on
y is:
Here, the constants bxy and byx are the regression coefficients. They are:
The regression equation of x on y is used for the estimation of x values and the
regression equation of y on x is used for the estimation of y values.
Graphical representation of the regression equation is called regression lines.
Solution:
Let x and y respectively denote the heights of father and the sons.
Then, the value of y corresponding to x =150 has to be estimated.
For this, regression of y on x should be found and the estimation should be made.
X y u = x -170 v = y- 170 u2 Uv
164 168 -6 -2 36 12
176 174 6 4 36 24
178 175 8 5 64 40
184 181 14 11 196 154
175 173 5 3 25 15
167 166 -3 -4 9 12
173 173 3 3 9 9
180 179 10 9 100 90
37 29 475 356
Since regression coefficients are independent of origins, the required regression coefficient is
= = 0.7302.
Example: The following data are from a controlled experiment in which 5 insecticides were
applied to four cabbage plants each and the number of insect larvae was counted.
Replication Insecticide
1 A B C D E
2 11 6 8 14 7
3 4 4 6 27 7
4 4 3 4 8 12
Total 24 20 28 68 40
Mean 6 5 7 17 10
Note: The number of larvae varies from plant to plant. It also varies among plants, which
have been sprayed with the same insecticide. The technique to analysis of variance splits the
total variation in the 20 observations into two components- one that is attributes to the
possible difference among the insecticides and the other to differences among insecticides.
The computation is as follows:
Insecticide A B C D E
Total 24 20 28 68 40
Mean 6 5 7 17 10
Step 2: The overall variation in the data is measured by what is called the total sum of
squares, which is computed as follows:
First compute,
This quantity is called correction factor (C.F), then compute total S.S as:
=672.
It may be recalled that the total S.S divided by 20 would give the variance of the 20
observation. Thus, the total S.S is measure of overall variation.
Step 3: Compute “between treatment sum of squares”:
To compute between treatments S.S., the treatments total or means will be used.
It may be noted that of the total variation 672, 376 is attributed to possible differences
among insecticides. It can be tested weather 376 is substantial enough to conclude that the
insecticides differ. This technique is known as testing of hypothesis, and as behind the scope
of this course.
Two –way Analysis of Variance
Two-way ANOVA technique is used when data are classified on the basis of two
factors. For example, the agricultural output may be classified on the basis of different
varieties of seeds and also on the basis of different varieties of fertilizers used. Such a two-
way design may have repeated measurements of each factor or may not repeated values. We
shall now explain the two-way ANOVA technique in context of two-way design when
repeated values are not there.
As we do not have repeated values, we cannot directly compute the sum of squares
with in samples as we had done in the case of one-way ANOVA. Therefore, we have to
calculate this residual or error variation by substation, once we have calculated (just on the
same lines as we did in the case of one-way ANOVA) the sum of squares for total variance
between varieties of the other treatment.
4. Find out the square of all the item values (or either coded values as the case may be)
one by one then take its total. Subtract the correction factor from this total to obtain
the sum of squares of deviations for total variance. Symbolically, we write it as:
Sum of squares of deviations for total variance or total SS:
= .
5. Take the total of different columns and then obtained the square of each column total
and divide such squared values of each column by the number of items in the
concerning column and take the total of the result thus obtained. Finally, substrate the
correction factor from this total to obtain the sum of squares of deviations for variance
between columns or (SS due to columns).
6. Take the total of different rows and then obtained the square of each row total and
divide such squared values of each row by the number of items in the corresponding
row and take the total of the result thus obtained. Finally, subtracted the correction
factor from this total to obtain the sum of squares of deviations for variance between
rows (or SS between rows).
7. Sum of squares of deviations for residual or error variance can be worked out by
subtracting the result of the sum of (5) and (6) steps from the result of 4th steps stated
above. In other words,
Total SS - (SS between columns + SS between rows).
= SS for residual or error variance.
8. Degrees of freedom (d.f) can be worked out as under:
d.f for total variance= (c*r-1)
d.f for variance between columns = (c-1).
d.f for variance between rows = (r-1).
d.f for residual rows = (c-1)(r-1).
Where c is the number of columns and r is the number of rows.
9. ANOVA table can be set up in the usual fashion as shown below:
Total (c*r-1) .
Example: Set up the ANOVA table for the following two-way design:
Fertilizer Varieties
A B C
W 6 5 5
X 7 5 4
Y 8 3 3
Z 3 7 4
Error 6 6 6/6 = 1
Total 11 32