Unit 3 - Statistics
Unit 3 - Statistics
UNIT 3 STATISTICS
Introduction:
In the modern world of computers and information technology, the importance of statistics is
very well recognized by all the disciplines. Statistics has originated as a science of statehood and
found applications slowly and steadily in Agriculture, Economics, Commerce, Biology,
Medicine, Industry, planning, education and so on. As on date there is no other human walk of
life, where statistics cannot be applied.
Meaning of Statistics:
Statistics is concerned with scientific methods for collecting, organizing, summarizing,
presenting and analyzing data as well as deriving valid conclusions and making reasonable
decisions on the basis of this analysis. Statistics is concerned with the systematic collection of
numerical data and its interpretation.
The word ‘statistic’ is used to refer to
Definition:
Statistics may be defined as the science of collection, presentation analysis and interpretation of
numerical data from the logical analysis.
We have two types for interpretation data
(i) population
(ii) Sample.
Population: A population is the set of all possible data value for a subject under consideration.
Sample: A sample is set of data values drawn from much larger population.
Descriptive Statistics
∑ 𝒙𝒊 = 𝒙𝟏 + 𝒙𝟐 + 𝒙𝟑 + 𝒙𝟒 … … + 𝒙𝒏
𝒊=𝟏
The Greek letter ∑ is the summation sign.
Example: Find the mean of 10 students and their weights in kilograms are following:
32, 26, 41, 35, 28, 42, 36, 40, 33, 42
_
32 26 41 35 28 42 36 40 33 42 355
Solution: x 35.5 kilogram
10 10
Grouped Data
If the data for a grouped frequency then the formula for the mean is as follows,
_
x
x i fi
n
Where 𝑥𝑖 = mid value of the class
𝑛 =Total frequency = f i
xi 0 1 2 3 4 5
f i 3 20 15 8 3 1
Solution:
xi fi xi f i
0 3 0
1 20 20
2 15 30 _
x
x i fi
91
1.82
3 8 24 n 50
4 3 12
5 1 5
n = 50 xi f i =91
Example: Find the mean of the following data:
Class 0-10 10-20 20-30 30-40 40-50
fi 5 8 15 16 6
Solution:
Class f i xi xi f i
0-10 5 5 25
10-20 8 15 120
20-30 15 25 375 _
x
x i fi
1350
27
30-40 16 35 560 n 50
40-50 6 45 270
n = 50 xi f i =1350
Median: The median is another measure of central location. The median is the value in the
middle when the data are arranged in ascending order (smallest value to largest value) or
descending order (largest value to smallest value) the median can be denoted as M.
If there is an odd number of an observation then median can be found as:
n 1
th
M observation
2
Where, n= number of observation
Example: 2,3,5,6,7,8,9 for given data median is 6.
xi 0 1 2 3 4
fi 4 1 6 11 3
Solution:
xi fi cf Here, n = 25
n 1
th
0 4 4
M observation
1 1 5 2
2 6 11
25 1
th
3 11 22 observation
4 3 25 2
n = 25 13 th observation
So the median =3
Solution:
Class fi cf
0-10 4 4
10-20 8 12
20-30 12 24
30-40 20 44
40-50 24 68
50-60 15 83
60-70 7 90
n = 90
Here, n = 90
th
n
M observation
2
th
90
observation
2
45 th observation
If we observed in cumulative frequency (cf) 45 th observation lies between 44-68,
So here
Median class 40 50 , L 40 , n 90 , cf 44 , f 24 , class width c 10
n
cf
Median, M L 2 c
f
90
44
40 2 10
24
40.42
Mode: A third measure of location is the mode. The mode is defined as follows.
The mode is the value that occurs with greatest frequency. Mode is denoted as Z.
Example: 1,2,3,4,5,6,4,5,2,4,1,2,2
In above given data 4 occur maximum time so our mode is 4.
Solution: Here we can see that maximum frequency is 20and corresponding value of maximum
frequency is 1. So, the mode is 1.
If the data for grouped frequency then the formula for the mode is as follows:
f1 f 0
Z L c
2 f1 f 0 f 2
Where,
L=lower limit of modal class
f0 =frequency of the class preceding the modal class
f1=frequency of the modal class
f2=frequency of the class succeeding the modal class
c= width of the class
Solution:
In the given data maximum frequency is 13, so that the modal class is 30-40
L 30 , f 0 11, f 1 13 , f 2 10 , c 10
f1 f 0
Mode, Z L c
1
2 f f 0 f 2
13 11
30 *10
(2 *13) 11 10
2
30 *10
5
34
Example: Find the mean, median and mode of the following data:
Solution:
Class Class fi xi f i xi cf
10-19 9.5-19.5 2 14.5 29 2
20-29 19.5-29.5 9 24.5 220.5 11
30-39 29.5-39.5 15 34.5 517.5 26
40-49 39.5-49.5 14 44.5 623 40
50-59 49.5-59.5 10 54.5 545 50
n = 50 xi f i
=1935
Mean: x
_
x i fi
1935
38.7
n 50
Median:
th
n
M observation
2
th
50
observation 25 th observation
2
In cumulative frequency (cf) 25th observation lies between 11-26.So that median class is 29.5-39.5
L 29.5 , cf 11, f 15, n 50, c 10
n
cf
Median, M L 2 c
f
50
11
29.5 2 10
15
29.5 9.3
38.8
Mode:
In the given data the maximum frequency is 15, so that the modal class is 29.5-39.5
L 29.5 , f 0 9, f1 15 , f 2 14 , c 10
f1 f 0
Mode, Z L c
1
2 f f 0 f 2
15 9
29.5 * 10
(2 * 15) 9 14
6
29.5 * 10
7
38.07
Negative Skewness: The left tail is longer; the mass of the distribution is concentrated on the
right of the figure. The distribution is said to be left-skewed, left tailed or skewed to the left.
Positive Skewness: The right tail is longer; the mass of the distribution is concentrated on the
left of the figure. The distribution is said to be right-skewed, right tailed or skewed to the
right.
Measure of Deviation:
Variance: The variance is a measure of variability that utilizes all the data. The variance is
based on the difference between the value of each observation (𝑥𝑖 ) and the mean. The difference
between each 𝑥𝑖 and the mean (for a sample, μ for a population) are called a deviation about the
̅̅̅; for a population, it is
mean. For a sample, a deviation about the mean is written(𝑥𝑖 − 𝑥)
written(𝒙𝒊 − 𝝁).
Population Variance 𝟐
∑𝐧𝐢=𝟏(𝐱 𝐢 − 𝛍)𝟐
𝛔 =
𝐍
𝐧
𝟐
∑𝐢=𝟏(𝐱 𝐢 − 𝐱̅)𝟐
Sample variance 𝐬 =
𝐧−𝟏
NOTE: if in sample size is greater than 30 then n-1 is replaced by n because it does not
affect numerical value of variance.
∑𝐧𝐢=𝟏(𝐱 𝐢 − 𝐱̅)𝟐
𝛔= √
𝐧
OR
𝟐
∑𝐧 𝐱 𝐢 𝟐 ∑𝐧 𝐱 𝐢
𝛔 = √ 𝐢=𝟏 − ( 𝐢=𝟏 )
𝐧 𝐧
If the data for grouped frequency then the formula for the standard deviation is as follows:
∑𝐧𝐢=𝟏 𝐟𝐢 (𝐱 𝐢 − 𝐱̅)𝟐
𝛔=√
∑𝐧𝐢=𝟏 𝐟𝐢
OR
𝟐
∑𝐧𝐢=𝟏 𝐟𝐢 𝐱 𝐢 𝟐 ∑𝐧 𝐟𝐢 𝐱 𝐢
𝛔=√ 𝐧 − ( 𝐢=𝟏 )
∑𝐢=𝟏 𝐟𝐢 ∑𝐧𝐢=𝟏 𝐟𝐢
X 6 7 8 9 10 11 12
F 3 6 9 13 8 5 4
Solution:
X F XF X X
2
F XX 2
6 3 18 9 27
7 6 42 49 294
8 9 72 64 576
9 13 117 81 1053
10 8 80 100 800
11 5 55 121 605
12 4 48 144 576
63 48 432 568 3931
FX 432
X 9
F 48
F X X
2
3931
Variance 2 81.8958
F 48
F X X
2
3931
Std. Deviation 9.0496
F 48
Example:Find the variance and the standard deviation from the following table:
𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛
coefficient of variation(c. v) = *100
𝑚𝑒𝑎𝑛
Coefficient of variation is better is said that more variable or less consistent.
Coefficient of variation is less is said that less variable or more consistent.
Correlation Analysis: we have studied problems relating to one variable only. In practice we
come across a large number of problems involving the use of two or more variables. If two
quantities vary in such a way that change in one variable are effects a change in the value of
other. These quantities are correlated.
(i) Positive or Negative correlation: If two variables are changing in the same
direction, correlation is said to be positive or direct correlation. If two variables are
changing in the opposite direction, correlation is said to be negative or inverse
correlation.
For example: The correlation between heights and weights of group of people is
positive and the correlation between pressure and volume of a gas is negative.
(i) Simple, partial or multiple: The difference between the simple, partial or multiple
correlation is based on the number of variable studied. When only two variable are
studied correlation is said to be simple correlation. When three or more variable are
involved then the problem may be either partial or multiple correlation.
(ii) Linear or Non-linear correlation: If the amount of change in one variable tends to
bear a constant ratio to the amount of change in the other variable then the correlation
is said to be linear correlation.
For example: consider to variables X and Y
X 5 10 15 20 25 30
Y 50 100 150 200 250 300
It is clear shows that the ratio of change in both the variables is same.
If the amount of change in one variable does not tend to bear a constant ratio to the
amount of change in the other variable then the correlation is said to be Non-linear
correlation or curly linear correlation.
200
Y-AXIS
150
150
100
100
50
50
0
0
0 10 20 30 40
0 10 20 30 40
X-AXIS
X-AXIS
∑(𝒙 − 𝒙
̅) (𝒚 − 𝒚
̅)
𝒓=
̅)𝟐 √√∑(𝒚 − 𝒚
√∑(𝒙 − 𝒙 ̅ )𝟐
∑ 𝒅𝒙𝒅𝒚
𝒓=
√∑ 𝒅𝟐 𝒙 √∑ 𝒅𝟐 𝒚
Where, 𝑑𝑥 = 𝑥 − 𝑥̅ 𝑑𝑦 = 𝑦 − 𝑦̅
This formula also can be written as follow:
𝑛(∑(𝑥𝑦)) − (∑ 𝑥)(∑ 𝑦)
𝑟=
√𝑛(∑ 𝑥 2 ) − (∑ 𝑥)2 √𝑛(∑ 𝑦 2 ) − (∑ 𝑦)2
Correlation coefficient for the grouped data the formula can be written as follows:
Solution:
x y _ _ 2 2
_
_
xx yy _
_
x x y y
x x y y
100 98 1 3 1 9 3
101 99 2 4 4 16 8
102 99 3 4 9 16 12
102 97 3 2 9 4 6
100 95 1 0 1 0 0
99 92 0 -3 0 9 0
97 95 -2 0 4 0 0
98 94 -1 -1 1 1 1
96 90 -3 -5 9 25 15
95 91 -4 -4 16 16 16
x y
_
_
_
2
_
2
_
_
=990 =950
x x y y
x x
y y
x x y y
=0 =0 =54 =96 =61
_
x
x 990 99
n 10
_
y
y 950 95
n 10
_
_
x x y y 61
Correlation Coefficient , r 0.85
2 2
_
_
54 96
x x y y
In case finding out rank correlation coefficient when the observations are paired the above
formula can be written as:
𝑚 𝑚
6 {∑ 𝑑 2 + 12 (𝑚2 − 1) + 12 (𝑚2 − 1) + ⋯ … … … . .}
𝑟 =1−
𝑛(𝑛2 − 1)
In d 2m 2
12
m 1 is added where m is the number of times an item is repeated.
,
The value of correlation coefficient by Spearman’s method also lies between -1 and +1. If the
ranks are same for each pair of two series then each value of d=0. Hence ∑ 𝑑 2 =0 and the
value of r=+1, which shows that perfect positive correlation between the two variables. If the
ranks are exactly in reverse order for each pair of two series, then the value of r = −1 which
shown perfect negative correlation between the variables.
Example: Two judges have given ranks to 10 students for their honesty. Find the rank
correlation coefficient of the following data:
1st 3 5 8 4 7 10 2 1 6 9
Judge
2nd 6 4 9 8 1 2 3 10 5 7
judge
Solution:
Rank given by 1st Rank given by 2nd Difference in ranks d2
judge judge d
3 6 -3 9
5 4 1 1
8 9 -1 1
4 8 -4 16
7 1 6 36
10 2 8 64
2 3 -1 1
1 10 -9 81
6 5 1 1
9 7 2 4
d 2
=214
6 d 2 6 * 214 1284
Rank Correlation , r 1 1 1 1 1.30 0.30
n n 1
2
10100 1 990
x 35 40 42 43 40 53 54 49 41 55
y 102 101 97 98 38 101 97 92 95 95
Solution:
x y Ranks in x Ranks in y Difference d d2
35 102 10 1 9 81
40 101 8.5 2.5 6 36
42 97 6 5.5 0.5 0.25
43 98 5 4 1 1
40 38 8.5 10 -1.5 2.25
53 101 3 2.5 0.5 0.25
54 97 2 5.5 -3.5 10.25
49 92 4 9 -5 25
41 95 7 7.5 -0.5 0.25
55 95 1 7.5 -6.5 42.25
d2
=200.25
6 d 2
m 2
m 1 m 2
m 1 m 2
m 1
m 2
m 1
Rank Correlation , r 1
12 12 12 12
n n 1
2
6200 .50 0.5 0.5 0.5 0.5
1
990
0.227
Regression Analysis: By studying the correlation we can know the existence degree and
direction of relationship between two variables but we can not the answer the question of the
type if there is a certain amount of change in one variable, what will be the corresponding
change in the other variable. The above type of question can be answered if we can establish
a quantitative relationship between two related variables.
The statistical tool by which it is possible to predict or estimate the unknown values of one
variable from known values of another variable is called regression. A line of regression is
straight line.
This equation is called regression line 𝑌 on 𝑋 and 𝑏𝑦𝑥 is called regression coefficient. The
formula can be computed as:
(𝑦 − 𝑦̅) = 𝑏𝑦𝑥 (𝑥 − 𝑥̅ )
𝑛 ∑ 𝑥𝑦−(∑ 𝑥)(∑ 𝑦)
Where, 𝑏𝑦𝑥 = 𝑛 ∑ 𝑥 2 −(∑ 𝑥)2
This formula can be used to compute the value of y for given value of x.
Similarly, the regression line 𝑋on 𝑌 and 𝑏𝑥𝑦 is called regression coefficient. The formula can
be computed as;
(𝑥 − 𝑥̅ )= 𝑏𝑥𝑦 (𝑦 − 𝑦̅)
𝑛 ∑ 𝑥𝑦−(∑ 𝑥)(∑ 𝑦)
Where, 𝑏𝑥𝑦 = 𝑛 ∑ 𝑦 2 −(∑ 𝑦)2
This formula can be used to compute the value of x for the given value of y.
NOTE:
(1) 𝑏𝑥𝑦 and 𝑏𝑦𝑥 are also computed using the following formula
𝑟𝜎 𝑟𝜎𝑦
𝑏𝑥𝑦 = 𝜎 𝑥 and𝑏𝑦𝑥 = 𝜎
𝑦 𝑥
(2) Angle between the two regression lines are as follows:
𝑟 2 −1
( ) 𝜎𝑥 𝜎𝑦
𝑟
𝜃 = tan−1 | |
𝜎 2𝑥 + 𝜎 2𝑦
𝜋
When 𝑟 = 0 and 𝜃 = 2 in this case both the regression lines are perpendicular to each
other. If 𝑟 = ±1 and 𝜃 = 0 in this case both the regression lines are same line because
point (𝑥,
̅ 𝑦̅) is common point.
n xy x y
bxy 3 .6
n y 2 y
2
Q: Where does this given function come from in the first place?
• Analytical models of phenomena (e.g. equations from physics)
• Create an equation from observed data 1)
This has limited use as a general function Sinceits really a group of small functions, connecting
one point to the next it doesn’t work very well for data that has built in random error (scatter)
2)
Curve fitting - capturing the trend in the data by assigning a single function across the entire
range. The example below uses a straight line function
f ( x) ax b
How can we pick the coefficients that best fits the line
to the data?
First question: What makes a particular straight line a
‘good’ fit?
Why does the blue line appear to us to fit the trend
better?
• Consider the distance between the data and points
on the line
• Add up the length of all the red and blue verticle
lines
• This is an expression of the ‘error’ between data and
fitted line
• The one line that provides a minimum error is then
the ‘best’ straight line
Quantifying errors in a curve fit
Assumption:
(1) positive or negative error have the same
value(data point is above or below the line)
n
err d i y1 f x1 y 2 f x 2 ........ y n f x n
2 2 2 2
i 1
n
y i axi b
2
i 1
i 1 i 1 i 1 i 1 i 1 i 1
n n n n n
xi y i a xi b xi y i a xi n b
2
i 1 i 1 i 1 i 1 i 1
Solve the equations
n n
yi a xi nb
i 1 i 1
(1)
n n n
x y a xi b xi
2
i i (2)
i 1 i 1 i 1
x =2.5 y x x y
2
i i =7.5 i =6.25 i i =18.75
i 1 i 1 i 1 i 1
y
i 1
i a xi n b
i 1
(1)
n n n
xi y i a xi b xi
2
(2)
i 1 i 1 i 1
7.5 2.5 a 6b
18.75 6.25a 2.5b
So, what we do if the straight line is not suitable for the data?
Straight line will not predict diminishing returns that data shows
Curve fitting - higher order polynomials
We started the linear curve fit by choosing a generic form of the straight line f(x) = ax + b
This is just one kind of function. There are an infinite number of generic forms we could choose
from for almost any shape we want. Let’s start with a simple extension to the linear regression
concept recall the examples of sampled data
Is a straight line suitable for each of these cases ? Top left and bottom right don’t look linear in
trend, so why fit a straight line? No reason to, let’s consider other options. There are lots of
functions with lots of different shapes that depend on coefficients. We can choose a form
based on experience and trial/error. Let’s develop a few options for non-linear curve fitting.
We’ll start with a simple extension to linear regression...higher order polynomials
i 1
y1 a bx1 cx1 y a bx
2 2
2 2 cx2
2
2
........ y n a bxn cxn
2
y a bx cx
n
2 2
i i i
i 1
To minimize the error, derivatives with respect to a, b and c equal to 0.
err n
a
2 y i a bxi cxi 0
2
i 1
err n
b
2 xi y i a bxi cxi 0
2
i 1
err n
b
2 xi y i a bxi cxi 0
2 2
i 1
Simplify these equations,We get
n n n
y i a n b xi c xi
2
i 1 i 1 i 1
n n n n
xi y i a xi b xi c xi
2 3
i 1 i 1 i 1 i 1
n n n n
x y i a xi b xi c xi
2 2 3 4
i
i 1 i 1 i 1 i 1
Solution:
2 3 4 2
xi yi xi xi xi xi y i xi y i
0 0 0 0 0 0 0
0.5 0.25 0.25 0.125 0.0625 0.125 0.0625
1 1 1 1 1 1 1
1.5 2.25 2.25 3.375 5.0625 3.375 5.0625
2 4 4 8 16 8 16
2.5 6.25 6.25 15.625 39.0625 15.625 39.0625
x i =7.5 y i x i
2
x i
3
x i
4
x y i i x y i i
y i a n b xi c xi
2
i 1 i 1 i 1
n n n n
xi y i a xi b xi c xi
2 3
i 1 i 1 i 1 i 1
n n n n
xi y i a xi b xi c xi
2 2 3 4
i 1 i 1 i 1 i 1
Yi a X i n b
i 1 i 1
(1)
n n n
X Y a X i b X i
2
i i (2)
i 1 i 1 i 1
After getting values of a and b , A antilog a, C antilog b .
Example: An experiment gave the following values:
X 1 5 7 9
Y 10 15 12 21
Fit an exponential curve y Ce Ax
Solution:
X i yi yi Yi ln yi Xi
2
X i Yi
1 10 2.302585 1 2.302585
5 15 2.70805 25 13.54025
7 12 2.484906 49 17.39435
9 15 2.70805 81 24.37245
12 21 3.044522 144 36.53427
5 5 5 5
X Y X X Y
2
i =34 i =13.24811 i =300 i I =94.1439
i 1 i 1 i 1 i 1
13.24811 34 A 5B
94.1439 300 A 34 B
A=2.00479, B=2.248664
a=antilog2.00479=7.424536
b=antilog(2.248664)=9.475068
(2) y bxa
n n
Yi nB A X i
i 1 i 1
(1)
n n n
X Y B X i A X i
2
i i (2)
i 1 i 1 i 1
It is known that v and t are connected by the relation v bt a , find the best possible values of a and b.
v t Y=logv X=logt X2 XY
350 61 2.544068 1.78533 3.18740262 4.542001
400 26 2.60206 1.414973 2.002149575 3.681846
500 7 2.69897 0.845098 0.714190697 2.280894
600 2.6 2.778151 0.414973 0.17220288 1.152859
4 4 2 3
Y X
4 4
i 1
i 10.62325
i 1
i =4.460375 X i =6.075945772
i 1
X i =11.6576
i 1
n n
Yi nB A X i
i 1 i 1
(1)
n n n
X iYi B X i A X i
2
(2)
i 1 i 1 i 1
10.62325 4B 4.460375 A
11.6575 4.460375 B 6.075945772 A