0% found this document useful (0 votes)

3 views25 pages

Unit 3 - Statistics

The document provides an overview of statistics, focusing on descriptive statistics, correlation, and regression as part of a B.Tech program at Parul University. It defines key concepts such as population, sample, measures of central tendency (mean, median, mode), and their calculations, along with examples. Additionally, it discusses measures of variability, including skewness and variance, highlighting the importance of statistics across various fields.

Uploaded by

princekhandelwal2611

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views25 pages

Unit 3 - Statistics

Uploaded by

princekhandelwal2611

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

Aww PARUL UNIVERSITY

FACULTY OF ENGINEERING AND TECHNOLOGY

DEPARTMENT OF APPLIED SCIENCE AND HUMANITIES
3rd SEMESTER B.TECH PROGRAMME (MECHANICAL)
PDE PROBABILITY AND STATISTICS (303191204)
ACADEMIC YEAR 2024-25

UNIT 3 STATISTICS

DESCRIPTIVE STATISTICS & CORRELATION AND REGRESSION

Introduction:
In the modern world of computers and information technology, the importance of statistics is
very well recognized by all the disciplines. Statistics has originated as a science of statehood and
found applications slowly and steadily in Agriculture, Economics, Commerce, Biology,
Medicine, Industry, planning, education and so on. As on date there is no other human walk of
life, where statistics cannot be applied.

Meaning of Statistics:
Statistics is concerned with scientific methods for collecting, organizing, summarizing,
presenting and analyzing data as well as deriving valid conclusions and making reasonable
decisions on the basis of this analysis. Statistics is concerned with the systematic collection of
numerical data and its interpretation.
The word ‘statistic’ is used to refer to

1. Numerical facts, such as the number of people living in Particular area.

2. The study of ways of collecting, analyzing and interpreting the facts.

Definition:
Statistics may be defined as the science of collection, presentation analysis and interpretation of
numerical data from the logical analysis.
We have two types for interpretation data
(i) population
(ii) Sample.

Population: A population is the set of all possible data value for a subject under consideration.
Sample: A sample is set of data values drawn from much larger population.

Descriptive Statistics

Measures of Central Tendency: There are three types

(i)Mean, (ii) Median, (iii) Mode,
Mean: The most important measure of location is the mean or average value, for a variable. The
mean provides a measure of central location for the data. If the data are for sample, the mean is
denoted by𝑥̅ .If the data are for a population, the mean is denoted by the Greek letter μ.
In statistical formulas, it is customary to denote the value of variable x for the first observation
by x1, the value of variable x for the second observation by x2, and so on. In general, the value of
variable x for the ith observation is denoted by 𝑥𝑖 . For a sample with n observations, the formula
for the sample mean is as follows.
_
x
 xi
n
In the preceding formula, the numerator is the sum of the values of the n observations. That is,
𝒏

∑ 𝒙𝒊 = 𝒙𝟏 + 𝒙𝟐 + 𝒙𝟑 + 𝒙𝟒 … … + 𝒙𝒏
𝒊=𝟏
The Greek letter ∑ is the summation sign.

Example: Find the mean of 10 students and their weights in kilograms are following:
32, 26, 41, 35, 28, 42, 36, 40, 33, 42

_
32  26  41  35  28  42  36  40  33  42 355
Solution: x    35.5 kilogram
10 10

Grouped Data
If the data for a grouped frequency then the formula for the mean is as follows,

_
x
x i fi
n
Where 𝑥𝑖 = mid value of the class
𝑛 =Total frequency =  f i

Example: Find the mean of the following:

xi 0 1 2 3 4 5
f i 3 20 15 8 3 1
Solution:
xi fi xi f i
0 3 0
1 20 20
2 15 30 _
x
x i fi

91
 1.82
3 8 24 n 50
4 3 12
5 1 5
n = 50  xi f i =91
Example: Find the mean of the following data:
Class 0-10 10-20 20-30 30-40 40-50
fi 5 8 15 16 6
Solution:
Class f i xi xi f i
0-10 5 5 25
10-20 8 15 120
20-30 15 25 375 _
x
x i fi

1350
 27
30-40 16 35 560 n 50
40-50 6 45 270
n = 50  xi f i =1350
Median: The median is another measure of central location. The median is the value in the
middle when the data are arranged in ascending order (smallest value to largest value) or
descending order (largest value to smallest value) the median can be denoted as M.
If there is an odd number of an observation then median can be found as:
 n  1
th

M   observation
 2 
Where, n= number of observation
Example: 2,3,5,6,7,8,9 for given data median is 6.

If there is an even number of an observation then median can be found as:

1 𝑛 𝑡ℎ 𝑛 𝑡ℎ
𝑀 = [( ) 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛 + ( + 1) 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛]
2 2 2
Where, n= number of observation

Example: 3, 4, 5,6,7,8 for given data median is 5.5.

If the data for grouped frequency then the formula for the median is as follows:
n
 cf
M  L 2 c
f
Where,
L =lower limit of median class
n = total number of frequency
cf=cumulative frequency of above median class
f = frequency of median class
c = class width

Example: Find the median of the following data:

xi 0 1 2 3 4
fi 4 1 6 11 3

Solution:
xi fi cf Here, n = 25
 n 1
th
0 4 4
M   observation
1 1 5  2 
2 6 11
 25  1 
th
3 11 22   observation
4 3 25  2 
n = 25  13 th observation

If we observed in Cumulative frequency (cf) 13th observation lies between 11-22,

So the median =3

Example: Find the median of the following:

Class 0-10 10-20 20-30 30-40 40-50 50-60 60-70

fi 4 8 12 20 24 15 7

Solution:

Class fi cf
0-10 4 4
10-20 8 12
20-30 12 24
30-40 20 44
40-50 24 68
50-60 15 83
60-70 7 90
n = 90
Here, n = 90
th
n
M    observation
2
th
 90 
   observation
 2
 45 th observation
If we observed in cumulative frequency (cf) 45 th observation lies between 44-68,
So here
Median class  40  50 , L  40 , n  90 , cf  44 , f  24 , class width c  10
n
 cf
Median, M  L  2 c
f
 90 
  44 
 40   2   10
 24 
 
 
 40.42

Mode: A third measure of location is the mode. The mode is defined as follows.
The mode is the value that occurs with greatest frequency. Mode is denoted as Z.

Example: 1,2,3,4,5,6,4,5,2,4,1,2,2
In above given data 4 occur maximum time so our mode is 4.

Example: Find the mode of the following:

xi 0 1 2 3 4
f i 12 20 10 6 2

Solution: Here we can see that maximum frequency is 20and corresponding value of maximum
frequency is 1. So, the mode is 1.

If the data for grouped frequency then the formula for the mode is as follows:
 f1  f 0 
Z  L     c
 2 f1  f 0  f 2 
Where,
L=lower limit of modal class
f0 =frequency of the class preceding the modal class
f1=frequency of the modal class
f2=frequency of the class succeeding the modal class
c= width of the class

Example: Find the mode of the following:

Class 0-10 10-20 20-30 30-40 40-50 50-60 60-70
fi 5 9 11 13 10 7 2

Solution:
In the given data maximum frequency is 13, so that the modal class is 30-40
L  30 , f 0  11, f 1  13 , f 2  10 , c  10

 f1  f 0 
Mode, Z  L     c
 1
2 f  f 0  f 2 

 13  11 
 30    *10
 (2 *13)  11  10 
2
 30    *10
5
 34

Example: Find the mean, median and mode of the following data:

Class 10-19 20-29 30-39 40-49 50-59

fi 2 9 15 14 10

Solution:
Class Class fi xi f i xi cf
10-19 9.5-19.5 2 14.5 29 2
20-29 19.5-29.5 9 24.5 220.5 11
30-39 29.5-39.5 15 34.5 517.5 26
40-49 39.5-49.5 14 44.5 623 40
50-59 49.5-59.5 10 54.5 545 50
n = 50  xi f i
=1935

Mean: x 
_
x i fi

1935
 38.7
n 50
Median:
th
n
M    observation
2
th
 50 
   observation  25 th observation
 2
In cumulative frequency (cf) 25th observation lies between 11-26.So that median class is 29.5-39.5
L  29.5 , cf  11, f  15, n  50, c  10
n
 cf
Median, M  L  2 c
f
 50 
  11 
 29.5   2   10
 15 
 
 
 29.5  9.3
 38.8
Mode:
In the given data the maximum frequency is 15, so that the modal class is 29.5-39.5
L  29.5 , f 0  9, f1  15 , f 2  14 , c  10

 f1  f 0 
Mode, Z  L     c
 1
2 f  f 0  f 2 

 15  9 
 29.5    * 10
 (2 * 15)  9  14 
6
 29.5    * 10
7
 38.07

̅ ), median (𝑴) and mode (𝒁) can be denoted as following:

Relationship among the mean (𝑿
𝒁 = 𝟑𝑴 − 𝟐𝑿 ̅

Measures of variability: Skewness is a measure of symmetry, or more precisely, the lack of

symmetry. A distribution, or data set, is symmetric if it looks the same to the left and right of the
center point.

Negative Skewness: The left tail is longer; the mass of the distribution is concentrated on the
right of the figure. The distribution is said to be left-skewed, left tailed or skewed to the left.

Positive Skewness: The right tail is longer; the mass of the distribution is concentrated on the
left of the figure. The distribution is said to be right-skewed, right tailed or skewed to the
right.
Measure of Deviation:

Variance: The variance is a measure of variability that utilizes all the data. The variance is
based on the difference between the value of each observation (𝑥𝑖 ) and the mean. The difference
between each 𝑥𝑖 and the mean (for a sample, μ for a population) are called a deviation about the
̅̅̅; for a population, it is
mean. For a sample, a deviation about the mean is written(𝑥𝑖 − 𝑥)
written(𝒙𝒊 − 𝝁).
Population Variance 𝟐
∑𝐧𝐢=𝟏(𝐱 𝐢 − 𝛍)𝟐
𝛔 =
𝐍
𝐧
𝟐
∑𝐢=𝟏(𝐱 𝐢 − 𝐱̅)𝟐
Sample variance 𝐬 =
𝐧−𝟏

NOTE: if in sample size is greater than 30 then n-1 is replaced by n because it does not
affect numerical value of variance.

Standard Deviation: Standard deviation (s.d.) is the square root of variance.

Standard deviation can be defined as follows:

∑𝐧𝐢=𝟏(𝐱 𝐢 − 𝐱̅)𝟐
𝛔= √
𝐧

𝟐
∑𝐧 𝐱 𝐢 𝟐 ∑𝐧 𝐱 𝐢
𝛔 = √ 𝐢=𝟏 − ( 𝐢=𝟏 )
𝐧 𝐧

Where, n is the number of observation.

If the data for grouped frequency then the formula for the standard deviation is as follows:
∑𝐧𝐢=𝟏 𝐟𝐢 (𝐱 𝐢 − 𝐱̅)𝟐
𝛔=√
∑𝐧𝐢=𝟏 𝐟𝐢

𝟐
∑𝐧𝐢=𝟏 𝐟𝐢 𝐱 𝐢 𝟐 ∑𝐧 𝐟𝐢 𝐱 𝐢
𝛔=√ 𝐧 − ( 𝐢=𝟏 )
∑𝐢=𝟏 𝐟𝐢 ∑𝐧𝐢=𝟏 𝐟𝐢

Example: Find the variance and the standard deviation.

X 6 7 8 9 10 11 12
F 3 6 9 13 8 5 4
Solution:

X F XF X  X 
2

F XX 2

6 3 18 9 27
7 6 42 49 294
8 9 72 64 576
9 13 117 81 1053
10 8 80 100 800
11 5 55 121 605
12 4 48 144 576
63 48 432 568 3931

 FX 432
X   9
F 48

 F X  X 
2

3931
Variance  2    81.8958
F 48

 F X  X 
2

3931
Std. Deviation     9.0496
F 48

Example:Find the variance and the standard deviation from the following table:

Class 0-10 10-20 20-30 30-40 40-50

Frequency 5 8 15 16 6
Coefficient of Variation
In some situations we may be interested in a descriptive statistic that indicates how large the
standard deviation is relative to the mean. This measure is called the coefficient ofvariation and
is usually expressed as a percentage.

𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛
coefficient of variation(c. v) = *100
𝑚𝑒𝑎𝑛
Coefficient of variation is better is said that more variable or less consistent.
Coefficient of variation is less is said that less variable or more consistent.

Correlation Analysis: we have studied problems relating to one variable only. In practice we
come across a large number of problems involving the use of two or more variables. If two
quantities vary in such a way that change in one variable are effects a change in the value of
other. These quantities are correlated.

Types of correlation: There are three types of correlation.

(i) Positive or Negative correlation: If two variables are changing in the same
direction, correlation is said to be positive or direct correlation. If two variables are
changing in the opposite direction, correlation is said to be negative or inverse
correlation.
For example: The correlation between heights and weights of group of people is
positive and the correlation between pressure and volume of a gas is negative.
(i) Simple, partial or multiple: The difference between the simple, partial or multiple
correlation is based on the number of variable studied. When only two variable are
studied correlation is said to be simple correlation. When three or more variable are
involved then the problem may be either partial or multiple correlation.

(ii) Linear or Non-linear correlation: If the amount of change in one variable tends to
bear a constant ratio to the amount of change in the other variable then the correlation
is said to be linear correlation.
For example: consider to variables X and Y
X 5 10 15 20 25 30
Y 50 100 150 200 250 300
It is clear shows that the ratio of change in both the variables is same.

If the amount of change in one variable does not tend to bear a constant ratio to the
amount of change in the other variable then the correlation is said to be Non-linear
correlation or curly linear correlation.

Methods of studying correlation: There are mainly three types of methods.

(i) Scatter Diagram
(ii) Karl Pearson’s method
(iii) Spearman’s method of rank correlation
(i) Scatter diagram: This is a very simple method studying the relationship between
two variables. In this method one variable is taken on X-axis and the other variable is
taken on Y-axis and for each pair of values, points are plotted as follows:

Perfectly positvie Perfectly Negative

Scatter diagram Scatter diagram
350
350
300
300
250
250
200
Y-AXIS

200

Y-AXIS
150
150
100
100
50
50
0
0
0 10 20 30 40
0 10 20 30 40
X-AXIS
X-AXIS

(ii) Karl Pearson’s coefficient of correlation: The several mathematical methods of

measuring correlation the Karl Pearson’s popularly known as Pearson’s coefficient of
correlation is most widely used. It is denoted by r. The formula for computing the
coefficient of correlation is as follows:

∑(𝒙 − 𝒙
̅) (𝒚 − 𝒚
̅)
𝒓=
̅)𝟐 √√∑(𝒚 − 𝒚
√∑(𝒙 − 𝒙 ̅ )𝟐

∑ 𝒅𝒙𝒅𝒚
𝒓=
√∑ 𝒅𝟐 𝒙 √∑ 𝒅𝟐 𝒚

Where, 𝑑𝑥 = 𝑥 − 𝑥̅ 𝑑𝑦 = 𝑦 − 𝑦̅
This formula also can be written as follow:

𝑛(∑(𝑥𝑦)) − (∑ 𝑥)(∑ 𝑦)
𝑟=
√𝑛(∑ 𝑥 2 ) − (∑ 𝑥)2 √𝑛(∑ 𝑦 2 ) − (∑ 𝑦)2

Correlation coefficient for the grouped data the formula can be written as follows:

𝑛 ∑ 𝑓𝑢𝑣 − ∑ 𝑢𝑓𝑢 ∑ 𝑣𝑓𝑣

𝑟=
√𝑛 ∑ 𝑢2 𝑓𝑢 − (∑ 𝑢𝑓𝑢 )2 √𝑛 ∑ 𝑣 2 𝑓𝑣 − (∑ 𝑣𝑓𝑣 )2
OR

𝑛 ∑ 𝑓𝑑𝑥 𝑑𝑦 − (∑ 𝑓𝑑𝑥 ) (∑ 𝑓𝑑𝑦 )

𝑟=
2
√𝑛 ∑ 𝑓𝑑 2 𝑥 − (∑ 𝑓𝑑𝑥 )2 √𝑛 ∑ 𝑓𝑑 2 𝑦 − (∑ 𝑓𝑑𝑦 )

Properties of the coefficient of correlation:

(1) The coefficient of correlation always lies between -1 and 1 including -1 and 1.
i.e. −1 ≤ 𝑟 ≤ 1
(2) The correlation coefficient is independent of change of origin and scale.
(3) The correlation coefficient is an absolute number and it is independent of units of
measurements.

Example:Find the Pearson’s Correlation Coefficient of the following data:

x 100 101 102 102 100 99 97 98 96 95

y 98 99 99 97 95 92 95 94 90 91

Solution:

x y _ _ 2 2
 _
 _

xx yy  _
  _
  x  x  y  y 
 x  x  y  y
      
100 98 1 3 1 9 3
101 99 2 4 4 16 8
102 99 3 4 9 16 12
102 97 3 2 9 4 6
100 95 1 0 1 0 0
99 92 0 -3 0 9 0
97 95 -2 0 4 0 0
98 94 -1 -1 1 1 1
96 90 -3 -5 9 25 15
95 91 -4 -4 16 16 16
x y 
  
_
 

_
  _

2
 _

2

  
_
 _

=990 =950
 x  x  y  y 
   
 x  x 
 y  y 

 x  x  y  y 

=0 =0 =54 =96 =61

_
x
 x  990  99
n 10
_
y
 y  950  95
n 10
 _
 _

  x  x  y  y  61
Correlation Coefficient , r    0.85
2 2
 
_
 _
 54 96
 x  x  y  y
   

Calculated by following formula:

6 ∑ 𝑑2
𝑟 = 1−
𝑛(𝑛 − 1)

Where, n=number of pairs

In case finding out rank correlation coefficient when the observations are paired the above
formula can be written as:

𝑚 𝑚
6 {∑ 𝑑 2 + 12 (𝑚2 − 1) + 12 (𝑚2 − 1) + ⋯ … … … . .}
𝑟 =1−
𝑛(𝑛2 − 1)

In d 2m 2
12
 
m  1 is added where m is the number of times an item is repeated.
,
The value of correlation coefficient by Spearman’s method also lies between -1 and +1. If the
ranks are same for each pair of two series then each value of d=0. Hence ∑ 𝑑 2 =0 and the
value of r=+1, which shows that perfect positive correlation between the two variables. If the
ranks are exactly in reverse order for each pair of two series, then the value of r = −1 which
shown perfect negative correlation between the variables.

Example: Two judges have given ranks to 10 students for their honesty. Find the rank
correlation coefficient of the following data:
1st 3 5 8 4 7 10 2 1 6 9
Judge
2nd 6 4 9 8 1 2 3 10 5 7
judge

Solution:
Rank given by 1st Rank given by 2nd Difference in ranks d2
judge judge d
3 6 -3 9
5 4 1 1
8 9 -1 1
4 8 -4 16
7 1 6 36
10 2 8 64
2 3 -1 1
1 10 -9 81
6 5 1 1
9 7 2 4
d 2
=214
6 d 2 6 * 214 1284
Rank Correlation , r  1   1  1  1  1.30  0.30

n n 1
2
 10100  1 990

Example: Find the Coefficient of rank correlation of the following data:

x 35 40 42 43 40 53 54 49 41 55
y 102 101 97 98 38 101 97 92 95 95

Solution:
x y Ranks in x Ranks in y Difference d d2
35 102 10 1 9 81
40 101 8.5 2.5 6 36
42 97 6 5.5 0.5 0.25
43 98 5 4 1 1
40 38 8.5 10 -1.5 2.25
53 101 3 2.5 0.5 0.25
54 97 2 5.5 -3.5 10.25
49 92 4 9 -5 25
41 95 7 7.5 -0.5 0.25
55 95 1 7.5 -6.5 42.25
d2
=200.25


6 d 2 
m 2
m 1  m 2

m 1 m 2

m 1 
m 2
 
m 1    
Rank Correlation , r  1   
12 12 12 12
n n 1
2
 
6200 .50  0.5  0.5  0.5  0.5
 1
990
 0.227

Regression Analysis: By studying the correlation we can know the existence degree and
direction of relationship between two variables but we can not the answer the question of the
type if there is a certain amount of change in one variable, what will be the corresponding
change in the other variable. The above type of question can be answered if we can establish
a quantitative relationship between two related variables.
The statistical tool by which it is possible to predict or estimate the unknown values of one
variable from known values of another variable is called regression. A line of regression is
straight line.
This equation is called regression line 𝑌 on 𝑋 and 𝑏𝑦𝑥 is called regression coefficient. The
formula can be computed as:

(𝑦 − 𝑦̅) = 𝑏𝑦𝑥 (𝑥 − 𝑥̅ )

𝑛 ∑ 𝑥𝑦−(∑ 𝑥)(∑ 𝑦)
Where, 𝑏𝑦𝑥 = 𝑛 ∑ 𝑥 2 −(∑ 𝑥)2
This formula can be used to compute the value of y for given value of x.
Similarly, the regression line 𝑋on 𝑌 and 𝑏𝑥𝑦 is called regression coefficient. The formula can
be computed as;

(𝑥 − 𝑥̅ )= 𝑏𝑥𝑦 (𝑦 − 𝑦̅)

𝑛 ∑ 𝑥𝑦−(∑ 𝑥)(∑ 𝑦)
Where, 𝑏𝑥𝑦 = 𝑛 ∑ 𝑦 2 −(∑ 𝑦)2
This formula can be used to compute the value of x for the given value of y.

NOTE:
(1) 𝑏𝑥𝑦 and 𝑏𝑦𝑥 are also computed using the following formula
𝑟𝜎 𝑟𝜎𝑦
𝑏𝑥𝑦 = 𝜎 𝑥 and𝑏𝑦𝑥 = 𝜎
𝑦 𝑥
(2) Angle between the two regression lines are as follows:

𝑟 2 −1
( ) 𝜎𝑥 𝜎𝑦
𝑟
𝜃 = tan−1 | |
𝜎 2𝑥 + 𝜎 2𝑦
𝜋
When 𝑟 = 0 and 𝜃 = 2 in this case both the regression lines are perpendicular to each
other. If 𝑟 = ±1 and 𝜃 = 0 in this case both the regression lines are same line because
point (𝑥,
̅ 𝑦̅) is common point.

Properties of regression coefficient:

(1) 𝑟 = ±√𝑏𝑥𝑦 𝑏𝑦𝑥 , the sign of r should be taken before the square root is that of the
regression coefficient.
(2) Since (𝑏𝑥𝑦 )(𝑏𝑦𝑥 ) = 𝑟 2 ≤ 1 , both the regression coefficient cannot be greater than
unity (1).
(3) Arithmetic mean of regression coefficients is greater than or equal to the coefficient of
correlation.
𝑏𝑥𝑦 +𝑏𝑦𝑥
i.e.( 2 ) ≥ 𝑟
(4) Regression coefficient are independent of origin but not of scale.
Example:The following data regarding the heights (y) and weights (x) of 100 college students
are given:  x  15000 ,  x 2  2272500 ,  xy  1022250 ,  y  6800 ,  y 2  463025 Find
the coefficient of correlation between height and weight and also the equation of regression of
height and weight.
Solution:
Here, n=100
n xy   x  y
b yx   0 .1
n x 2   x 
2

n xy   x  y
bxy   3 .6
n y 2   y 
2

r  bxy  b yx  3.6  0.1  0.6

_
x
 x  15000  150
n 100
_
y
 y  6800  68
n 100
The equation of the line of regression of y on x is:
_
 _

y  y  b yx  x  x 
 
y  68  0.1 x  150 
y  0.1 x  53
The equation of the line of regression of x on y is:
_
 _

x  x  bxy  y  y 
 
x  150  3.6 y  68 
x  3.6 y  94.8
Example: Find the equation of regression line from the following data and also estimate y for
x  1 and x for y  4 .
Curve Fitting

Q: Where does this given function come from in the first place?
• Analytical models of phenomena (e.g. equations from physics)
• Create an equation from observed data 1)

Interpolation (connect the data-dots) If data is reliable,

we can plot it and connect the dots This is piece-wise,
linear interpolation.

This has limited use as a general function Sinceits really a group of small functions, connecting
one point to the next it doesn’t work very well for data that has built in random error (scatter)
2)
Curve fitting - capturing the trend in the data by assigning a single function across the entire
range. The example below uses a straight line function

A straight line is described by f ( x)  ax  b

The goal is to identify the coefficients ‘ a ’ and ‘ b ’ such that ‘ f (x ) ’ fits the data well
Other examples of data sets that we can fit a function to

Is a straight line suitable for each of these cases ?

No. But we’re not stuck with just straight line fits. We’ll start with straight lines, then expand
the concept.

Linear curve fitting (linear regression)

Given the general form of a straight line

f ( x)  ax  b
How can we pick the coefficients that best fits the line
to the data?
First question: What makes a particular straight line a
‘good’ fit?
Why does the blue line appear to us to fit the trend
better?
• Consider the distance between the data and points
on the line
• Add up the length of all the red and blue verticle
lines
• This is an expression of the ‘error’ between data and
fitted line
• The one line that provides a minimum error is then
the ‘best’ straight line
Quantifying errors in a curve fit

Assumption:
(1) positive or negative error have the same
value(data point is above or below the line)

(2) Weight greater errors more heavily

we can do both of these things by squaring the
distancedenote data values as (x, y) ======>>
denote points on the fitted line as (x, f(x))
sum the error at the four data points

n
err   d i   y1  f x1    y 2  f x 2   ........ y n  f x n 
2 2 2 2

i 1

  y1  ax1  b    y 2  ax2  b   ........  y n  axn  b 

2 2 2

n
   y i  axi  b 
2

i 1

Error is minimum if first ordered partial derivatives=0

 err  n  err n
   2 xi  y i  axi  b   0    2 y i  axi  b   0
a i 1 b i 1
n n n n n n
  xi y i  a  xi  b xi  0 and   y i  a  xi  b1  0
2

i 1 i 1 i 1 i 1 i 1 i 1
n n n n n
  xi y i  a  xi  b xi   y i  a  xi  n b
2

i 1 i 1 i 1 i 1 i 1
Solve the equations
n n

 yi  a xi  nb
i 1 i 1
(1)
n n n

x y  a  xi  b xi
2
i i (2)
i 1 i 1 i 1

Example: Fit a straight line using least square method

xi 0 0.5 1 1.5 2 2.5

yi 0 1.5 3 4.5 6 7.5
Solution:
2
xi yi xi xi y i
0 0 0 0
0.5 1.5 0.25 0.75
1 3 1 3
1.5 4.5 2.25 6.75
2 6 4 12
n n n n

 x =2.5 y x x y
2
i i =7.5 i =6.25 i i =18.75
i 1 i 1 i 1 i 1

Now, Solve the equations

n n

y
i 1
i  a  xi  n b
i 1
(1)
n n n

 xi y i  a  xi  b xi
2
(2)
i 1 i 1 i 1

Substitute the values from the table, here n=6.

7.5  2.5 a  6b
18.75  6.25a  2.5b

a  3.561 and b  0.975

Hence, the best fit line is y  3.561 x  0.975 .

So, what we do if the straight line is not suitable for the data?

Straight line will not predict diminishing returns that data shows
Curve fitting - higher order polynomials
We started the linear curve fit by choosing a generic form of the straight line f(x) = ax + b
This is just one kind of function. There are an infinite number of generic forms we could choose
from for almost any shape we want. Let’s start with a simple extension to the linear regression
concept recall the examples of sampled data

Is a straight line suitable for each of these cases ? Top left and bottom right don’t look linear in
trend, so why fit a straight line? No reason to, let’s consider other options. There are lots of
functions with lots of different shapes that depend on coefficients. We can choose a form
based on experience and trial/error. Let’s develop a few options for non-linear curve fitting.
We’ll start with a simple extension to linear regression...higher order polynomials

Curve fitting – Quadratic polynomial

Let the general form of second order polynomial f ( x)  a  bx  cx2 .
Just as was the case for linear regression, we ask:
How can we pick the coefficients that best fits the
curve to the data? We can use the same idea:
The curve that gives minimum error between y
data and the fit f (x ) is ‘best’
Quantify the error for these two second order
curves...
• Add up the length of all the red and blue
verticle lines
• pick curve with minimum total error

Error - Least squares approach

n
err   d i   y1  f x1    y 2  f x 2   ........ y n  f x n 
2 2 2 2

i 1

 
 y1  a  bx1  cx1  y  a  bx
2 2
2 2  cx2
2
 2
 ........  y n  a  bxn  cxn 
2

  y  a  bx  cx 
n
2 2
i i i
i 1
To minimize the error, derivatives with respect to a, b and c equal to 0.
 err  n
a
   2 y i  a  bxi  cxi  0
2
  
i 1

 err  n
b
   2 xi y i  a  bxi  cxi  0
2
  
i 1

 err  n
b
   2 xi y i  a  bxi  cxi  0
2 2
  
i 1
Simplify these equations,We get
n n n

 y i  a n  b xi  c  xi
2

i 1 i 1 i 1
n n n n

 xi y i  a  xi  b xi  c  xi
2 3

i 1 i 1 i 1 i 1
n n n n

x y i  a  xi  b  xi  c  xi
2 2 3 4
i
i 1 i 1 i 1 i 1

Example: Fit a second order polynomial equation to following data

xi 0 0.5 1.0 1.5 2.0 2.5

yi 0 0.25 1.0 2.25 4.0 6.25

Solution:
2 3 4 2
xi yi xi xi xi xi y i xi y i
0 0 0 0 0 0 0
0.5 0.25 0.25 0.125 0.0625 0.125 0.0625
1 1 1 1 1 1 1
1.5 2.25 2.25 3.375 5.0625 3.375 5.0625
2 4 4 8 16 8 16
2.5 6.25 6.25 15.625 39.0625 15.625 39.0625
x i =7.5 y i x i
2
x i
3
x i
4
x y i i x y i i

=13.75 =13.75 =28.125 =61.1875 =28.125 =61.1875

Substitute these values in equations
n n n

 y i  a n  b xi  c  xi
2

i 1 i 1 i 1
n n n n

 xi y i  a  xi  b xi  c  xi
2 3

i 1 i 1 i 1 i 1
n n n n

 xi y i  a  xi  b  xi  c  xi
2 2 3 4

i 1 i 1 i 1 i 1

Hence, y  x is required equation which fits the data.

Curve fitting - Other nonlinear fits (exponential)

Q: Will a polynomial of any order necessarily fit any set of data?
A: Nope, lots of phenomena don’t follow a polynomial form. They may be, for example,
exponential

(1) General exponential equation f ( x)  C e Ax

Now, take log on both side, we get
ln y  ln C  Ax
Y  b  aX ; where Y  ln y, X  x, ln C  b and a  ln A
Which is equation of line, the original data in xy- plane mapped into XY-plane. This is called linearization.
The data x, y  transformed as x, ln y  .
To find the value of a and b we will use the equations
n n

 Yi  a X i  n b
i 1 i 1
(1)
n n n

X Y  a  X i  b X i
2
i i (2)
i 1 i 1 i 1
After getting values of a and b , A  antilog a, C  antilog b .
Example: An experiment gave the following values:
X 1 5 7 9
Y 10 15 12 21
Fit an exponential curve y  Ce Ax

Solution:

X i  yi yi Yi  ln yi Xi
2
X i Yi
1 10 2.302585 1 2.302585
5 15 2.70805 25 13.54025
7 12 2.484906 49 17.39435
9 15 2.70805 81 24.37245
12 21 3.044522 144 36.53427
5 5 5 5

X Y X X Y
2
i =34 i =13.24811 i =300 i I =94.1439
i 1 i 1 i 1 i 1
13.24811  34 A  5B
94.1439  300 A  34 B

A=2.00479, B=2.248664
a=antilog2.00479=7.424536

b=antilog(2.248664)=9.475068

Hence, best fit curve is y  9.475068 e 2.248664

(2) y  bxa

Taking log10 on both the side

log 10 y  log 10 b  a log 10 x

Y  B  AX ; where Y  log 10 y, X  log 10 x and a  A, B  log 10 b

n n

 Yi  nB  A X i
i 1 i 1
(1)
n n n

X Y  B  X i  A X i
2
i i (2)
i 1 i 1 i 1

Example: An experiment gave the following values:

v (ft/min) 350 400 500 600
t (min) 61 26 7 2.6

It is known that v and t are connected by the relation v  bt a , find the best possible values of a and b.

v t Y=logv X=logt X2 XY
350 61 2.544068 1.78533 3.18740262 4.542001
400 26 2.60206 1.414973 2.002149575 3.681846
500 7 2.69897 0.845098 0.714190697 2.280894
600 2.6 2.778151 0.414973 0.17220288 1.152859
4 4 2 3

Y X
4 4

i 1
i 10.62325
i 1
i =4.460375  X i =6.075945772
i 1
 X i =11.6576
i 1

Substitute in given equation,

n n

 Yi  nB  A X i
i 1 i 1
(1)
n n n

 X iYi  B X i  A X i
2
(2)
i 1 i 1 i 1
10.62325  4B  4.460375 A
11.6575  4.460375 B  6.075945772 A

On solving these equations B=2.845 A=a= - 0.17.

b  anti log( 2.845)  699 .842

Statistics Nda Math
No ratings yet
Statistics Nda Math
9 pages
Mean Mode Median
100% (1)
Mean Mode Median
17 pages
Green Economy Presentation
No ratings yet
Green Economy Presentation
17 pages
Statistics
No ratings yet
Statistics
164 pages
Chapter-4 Financial Statement Frauds
No ratings yet
Chapter-4 Financial Statement Frauds
38 pages
7 Statistics Sets Relation Handout
No ratings yet
7 Statistics Sets Relation Handout
11 pages
Record Management
100% (3)
Record Management
46 pages
Final - Statistics
No ratings yet
Final - Statistics
20 pages
Here Are 40 Common Accounting Interview Questions and Answers For Freshers
No ratings yet
Here Are 40 Common Accounting Interview Questions and Answers For Freshers
4 pages
ST 421 Manl 17
No ratings yet
ST 421 Manl 17
104 pages
#305 1630 154th Strata Documents
No ratings yet
#305 1630 154th Strata Documents
74 pages
Chapter 5 Descriptive Analysis Using Measures of Central Tendency and Measures of Despersion
No ratings yet
Chapter 5 Descriptive Analysis Using Measures of Central Tendency and Measures of Despersion
88 pages
Happy Days Farm, Exton Pennsylvania Historic Resource Survey Form - Photoisite Plan Sheet
No ratings yet
Happy Days Farm, Exton Pennsylvania Historic Resource Survey Form - Photoisite Plan Sheet
115 pages
Chap 4 Final
No ratings yet
Chap 4 Final
73 pages
Biostat Lecture Four
No ratings yet
Biostat Lecture Four
53 pages
ENDATA130 Data Summarization - Measures of Central Tendency
No ratings yet
ENDATA130 Data Summarization - Measures of Central Tendency
30 pages
Bush & James JR 2020 - Adolescents in Individualistics Cultures
No ratings yet
Bush & James JR 2020 - Adolescents in Individualistics Cultures
11 pages
Lecture Sheet C
No ratings yet
Lecture Sheet C
42 pages
Measures of CT and Dispersion
No ratings yet
Measures of CT and Dispersion
57 pages
Introduction To Social Representation Theory
No ratings yet
Introduction To Social Representation Theory
8 pages
Movie
No ratings yet
Movie
25 pages
Mind Map Loyalty - Google Penelusuran
No ratings yet
Mind Map Loyalty - Google Penelusuran
1 page
Lowara SV Series
No ratings yet
Lowara SV Series
68 pages
Lecture 2-Descriptive Statistics
No ratings yet
Lecture 2-Descriptive Statistics
74 pages
Quantitative Analysis and Business Development (UNIT-1)
No ratings yet
Quantitative Analysis and Business Development (UNIT-1)
31 pages
Doctors Contact
No ratings yet
Doctors Contact
52 pages
Quality and Efficacy of Accounting Information System (Ais) in The Decision-Making Using Enterprise Resources Planning System (Erps)
No ratings yet
Quality and Efficacy of Accounting Information System (Ais) in The Decision-Making Using Enterprise Resources Planning System (Erps)
9 pages
PC 2 Statistics by Praveen Mathur
No ratings yet
PC 2 Statistics by Praveen Mathur
44 pages
Data Management: Midterm
0% (1)
Data Management: Midterm
85 pages
B 0
No ratings yet
B 0
24 pages
SVC200203 - ST Vincent's Clinic A4 Booklet Web Final
No ratings yet
SVC200203 - ST Vincent's Clinic A4 Booklet Web Final
84 pages
Mariel Sofia S. Pulbosa 8-Gauss
No ratings yet
Mariel Sofia S. Pulbosa 8-Gauss
8 pages
Math 7 4th Quarter Week 5
No ratings yet
Math 7 4th Quarter Week 5
9 pages
Charpter 5 - Descriptive Analysis
No ratings yet
Charpter 5 - Descriptive Analysis
88 pages
Notes For SIBD
No ratings yet
Notes For SIBD
19 pages
Chapter - 14 Statistics
No ratings yet
Chapter - 14 Statistics
33 pages
Chapter Four: Numerical Descriptive Techniques
No ratings yet
Chapter Four: Numerical Descriptive Techniques
65 pages
Biostat Ch-4
No ratings yet
Biostat Ch-4
36 pages
Hoffmann, Goethe, and Miyazaki's Spirited Away
No ratings yet
Hoffmann, Goethe, and Miyazaki's Spirited Away
4 pages
Module 1
No ratings yet
Module 1
108 pages
Measures of CT and Dispersion
No ratings yet
Measures of CT and Dispersion
43 pages
Hand Outs3
No ratings yet
Hand Outs3
6 pages
Measures of Central Tendency: Presentation By: Dr. Sampda Rajurkar
100% (1)
Measures of Central Tendency: Presentation By: Dr. Sampda Rajurkar
44 pages
Chapter 13 Statistics
No ratings yet
Chapter 13 Statistics
9 pages
MCS Lecture 3
No ratings yet
MCS Lecture 3
57 pages
Chapter 2
No ratings yet
Chapter 2
10 pages
STPDF2 - Descriptive Statistics
100% (1)
STPDF2 - Descriptive Statistics
74 pages
Stat Chapter 3
No ratings yet
Stat Chapter 3
41 pages
Worksheet of English Grammar Part 1
0% (1)
Worksheet of English Grammar Part 1
3 pages
Frequency Distributions and Graphs2
No ratings yet
Frequency Distributions and Graphs2
8 pages
Anatomy and Physiology Workbook FINAL
100% (1)
Anatomy and Physiology Workbook FINAL
66 pages
MTE 3113 - Stat - 2
No ratings yet
MTE 3113 - Stat - 2
51 pages
Central Tendancy in R
No ratings yet
Central Tendancy in R
10 pages
B26 Notes
No ratings yet
B26 Notes
11 pages
Mean Median Mode
No ratings yet
Mean Median Mode
56 pages
Maths - Class - 12 - Statistics and Probability
No ratings yet
Maths - Class - 12 - Statistics and Probability
9 pages
LAS Biotech 8 MELC 1 Week 1
No ratings yet
LAS Biotech 8 MELC 1 Week 1
8 pages
Measures of Central Tendency or Averages
No ratings yet
Measures of Central Tendency or Averages
9 pages
Chapter 15 (3) NNN
No ratings yet
Chapter 15 (3) NNN
16 pages
Chapter 4 Measures of Central Tendency
No ratings yet
Chapter 4 Measures of Central Tendency
8 pages
1431364846L02.EE3121.Review of Measures of Central Tendency
No ratings yet
1431364846L02.EE3121.Review of Measures of Central Tendency
7 pages
The Egyptian Culture PowerPoint
No ratings yet
The Egyptian Culture PowerPoint
29 pages
Measures of Central Tendency: Presentation By: DR Dharuv
No ratings yet
Measures of Central Tendency: Presentation By: DR Dharuv
44 pages
Bookshop Business Quick Guide: by Crack A Business Kenya
No ratings yet
Bookshop Business Quick Guide: by Crack A Business Kenya
19 pages
Bio L3 Measures of Central Tendency For Group and Ungroup Data
No ratings yet
Bio L3 Measures of Central Tendency For Group and Ungroup Data
18 pages
Lesson 3 Numerical and Descriptive Measures
No ratings yet
Lesson 3 Numerical and Descriptive Measures
16 pages
المحاضرة الثالثة
No ratings yet
المحاضرة الثالثة
16 pages
Chapter 3 Descriptive Statistics
No ratings yet
Chapter 3 Descriptive Statistics
78 pages
Ict SSS One, Two and Three
No ratings yet
Ict SSS One, Two and Three
8 pages
Central Tendency - Fall 20
No ratings yet
Central Tendency - Fall 20
38 pages
Energy Day: From The Content Group To The Climate Champions
No ratings yet
Energy Day: From The Content Group To The Climate Champions
3 pages
Republic of The Philippines City of Taguig Taguig City University Gen. Santos Avenue, Central Bicutan, Taguig City
No ratings yet
Republic of The Philippines City of Taguig Taguig City University Gen. Santos Avenue, Central Bicutan, Taguig City
7 pages
C22 P04 Statistical Averages
No ratings yet
C22 P04 Statistical Averages
41 pages
1 Measures of Central Tendency
No ratings yet
1 Measures of Central Tendency
32 pages
Liturgy of St. John (Eliz. English) - Staff Notation
100% (2)
Liturgy of St. John (Eliz. English) - Staff Notation
99 pages
2023 Palarong Pampaaralan Dance Sports Guidelines
No ratings yet
2023 Palarong Pampaaralan Dance Sports Guidelines
4 pages
Math in The Modern World Stat Lecture
No ratings yet
Math in The Modern World Stat Lecture
3 pages
Modules Week 1 8 2nd Quarter
No ratings yet
Modules Week 1 8 2nd Quarter
11 pages
English 4 - Third Quarter - Teachers Guide - ENG4 - TG - U3
No ratings yet
English 4 - Third Quarter - Teachers Guide - ENG4 - TG - U3
139 pages
DOCS
No ratings yet
DOCS
9 pages
Central Tendency: Mode, Median, and Mean
No ratings yet
Central Tendency: Mode, Median, and Mean
15 pages
3-Measure of Central Tendency
No ratings yet
3-Measure of Central Tendency
11 pages
Educ 98 - MEASURES OF CENTRAL TENDENCY
No ratings yet
Educ 98 - MEASURES OF CENTRAL TENDENCY
24 pages
Javascript - Javascript Tutorial: Javascript Tutorials Is One of The Best Quick Reference To The Javascript. in This
100% (1)
Javascript - Javascript Tutorial: Javascript Tutorials Is One of The Best Quick Reference To The Javascript. in This
39 pages
History of Fifth Philippine Republic
No ratings yet
History of Fifth Philippine Republic
5 pages
Introduction To Statistics
No ratings yet
Introduction To Statistics
18 pages
Central Tendency
No ratings yet
Central Tendency
105 pages
Measure of Locations
No ratings yet
Measure of Locations
6 pages
Finance (Pay Cell) Department
No ratings yet
Finance (Pay Cell) Department
2 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Unit 3 - Statistics

Uploaded by

Unit 3 - Statistics

Uploaded by

Aww PARUL UNIVERSITY

FACULTY OF ENGINEERING AND TECHNOLOGY

DESCRIPTIVE STATISTICS & CORRELATION AND REGRESSION

1. Numerical facts, such as the number of people living in Particular area.

Measures of Central Tendency: There are three types

Example: Find the mean of the following:

If there is an even number of an observation then median can be found as:

Example: 3, 4, 5,6,7,8 for given data median is 5.5.

Example: Find the median of the following data:

If we observed in Cumulative frequency (cf) 13th observation lies between 11-22,

Example: Find the median of the following:

Class 0-10 10-20 20-30 30-40 40-50 50-60 60-70

Example: Find the mode of the following:

Example: Find the mode of the following:

Class 10-19 20-29 30-39 40-49 50-59

̅ ), median (𝑴) and mode (𝒁) can be denoted as following:

Measures of variability: Skewness is a measure of symmetry, or more precisely, the lack of

Standard Deviation: Standard deviation (s.d.) is the square root of variance.

Where, n is the number of observation.

Example: Find the variance and the standard deviation.

Class 0-10 10-20 20-30 30-40 40-50

Types of correlation: There are three types of correlation.

Methods of studying correlation: There are mainly three types of methods.

Perfectly positvie Perfectly Negative

(ii) Karl Pearson’s coefficient of correlation: The several mathematical methods of

𝑛 ∑ 𝑓𝑢𝑣 − ∑ 𝑢𝑓𝑢 ∑ 𝑣𝑓𝑣

𝑛 ∑ 𝑓𝑑𝑥 𝑑𝑦 − (∑ 𝑓𝑑𝑥 ) (∑ 𝑓𝑑𝑦 )

Properties of the coefficient of correlation:

Example:Find the Pearson’s Correlation Coefficient of the following data:

x 100 101 102 102 100 99 97 98 96 95

Calculated by following formula:

Where, n=number of pairs

Example: Find the Coefficient of rank correlation of the following data:

Properties of regression coefficient:

r  bxy  b yx  3.6  0.1  0.6

Interpolation (connect the data-dots) If data is reliable,

A straight line is described by f ( x)  ax  b

Is a straight line suitable for each of these cases ?

Linear curve fitting (linear regression)

(2) Weight greater errors more heavily

  y1  ax1  b    y 2  ax2  b   ........  y n  axn  b 

Error is minimum if first ordered partial derivatives=0

Example: Fit a straight line using least square method

xi 0 0.5 1 1.5 2 2.5

Now, Solve the equations

Substitute the values from the table, here n=6.

a  3.561 and b  0.975

Hence, the best fit line is y  3.561 x  0.975 .

Curve fitting – Quadratic polynomial

Error - Least squares approach

Example: Fit a second order polynomial equation to following data

xi 0 0.5 1.0 1.5 2.0 2.5

=13.75 =13.75 =28.125 =61.1875 =28.125 =61.1875

Hence, y  x is required equation which fits the data.

Curve fitting - Other nonlinear fits (exponential)

(1) General exponential equation f ( x)  C e Ax

Hence, best fit curve is y  9.475068 e 2.248664

Taking log10 on both the side

log 10 y  log 10 b  a log 10 x

Example: An experiment gave the following values:

Substitute in given equation,

On solving these equations B=2.845 A=a= - 0.17.

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.