0% found this document useful (0 votes)
3 views12 pages

Revised Correlation and Regression

The document discusses correlation and regression, explaining the types of correlation (positive, negative, and zero) and their significance in analyzing relationships between quantitative variables. It details the calculation of the correlation coefficient and its applications in real-life scenarios, such as predicting outcomes and analyzing data patterns. Additionally, the document covers regression analysis, including fitting regression lines and testing the significance of regression coefficients.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views12 pages

Revised Correlation and Regression

The document discusses correlation and regression, explaining the types of correlation (positive, negative, and zero) and their significance in analyzing relationships between quantitative variables. It details the calculation of the correlation coefficient and its applications in real-life scenarios, such as predicting outcomes and analyzing data patterns. Additionally, the document covers regression analysis, including fitting regression lines and testing the significance of regression coefficients.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 12

Correlation and Regression

MAT 3103: Computational Statistics and Probability


Chapter 11: Correlation and Regression

Correlation:
A measure of linear relationship between two quantitative variables (for example, age and
weight). Correlation is a statistical technique that can show whether and how strongly pairs of
variables are related.
Types of correlation:
a) Positive correlation, b) Negative correlation, c) No (Zero) correlation

Positive correlation:
If the values of a variable increase, the values of the other variable also increase and as the
values of a variable decrease, the values of the other variable also decrease the positive
correlation is raised. The points lie close to a straight line, which has a positive gradient.
Example:
Relation between training and performance of employees in a company
Relation between price and supply of a product

Negative correlation:
If the values of a variable increase, the values of the other variable decrease and as the values of
a variable decrease, the values of the other variable increase the negative correlation is raised.
The points lie close to a straight line, which has a negative gradient.
Example:
Relation between television viewing and exam grades
Relation between price and demand of a product

No (Zero) correlation:
If change in one variable has no effect on the other variable. There is no pattern to the points.
Example:
Relation between height and exam grades

1
Correlation and Regression

Correlation coefficient:
Let ( x 1, y 1), ( x 2, y 2), …, ( x n, y n) be n pairs of observations of the variables X and Y observed
from n sample points. The linear relationship of X and Y is called simple correlation. The degree
of linear relationship of X and Y is estimated by a quantity r, where

The values of r range between 1 to 1. r = 1 means a perfect negative correlation, r = 1 means


a perfect positive correlation, and r = 0 means no linear relationship between the variables.

Real-life applications of correlation analysis:


A similar negative correlation can be seen with education and the number of children a woman
has: more education, fewer children on average; the less educated, the more children, on
average. Both of these are negative correlations are easy to see on a global map, as well as
among populations of a given country.
Correlation of pseudorandom binary codes makes GPS working; lots of radar systems, and lots
CDMA (code division multiple access) systems. That's why GPS is power hungry, it takes a lot
of processing to find the correct correlation, with correct satellite (correct code) and the correct
delay.
Correlation is one of the promising methods in analyzing network-based intrusion
alerts to find significant relationships among alerts that have been triggered by multiple
intrusion detection sensors. Security Admin (SA) needs to understand and study these alerts.
They are meaningless if being analyzed individually. Somehow, they must be 'connected' with

2
Correlation and Regression

previous alerts or future alerts. So, SA can figure out the sequences of attacks that have been
launched on the network. This is important to identify preventive measure in the future.
Once correlation is known, we can use it to make predictions. If we know a score on
one measure, we can make a more accurate prediction of another measure that is highly related
to it. The stronger the relationship between/among variables the more accurate the prediction.

Show by example, r = 1
Let x=1 , 2 ,3∧ y=1 , 2 ,3. ( ∑ x )2 ( 6 )2
SS(x )=¿ ∑ x 2− ¿ 14− ¿2
n 3
SP (xy) 2 ( ∑ y )2 ( 6 )2
r= = =1 SS( y )=¿ ∑ y 2− ¿ 14− ¿2
√ SS ( x ) SS( y) √2 ×2 n 3
xy 6 ×6
SP(xy )=¿ ∑ xy− ¿ 14− ¿2
n 3
r = 1 indicates that X and Y are perfectly and positively correlated. It happens, if both X and Y
change uniformly in the same direction. Here X increases by 1 unit and Y increases by 1 unit.
Show by example, r = −¿1
Let x=1 , 2 ,3∧ y=9 ,6 , 3. ( ∑ x )2 ( 6 )2
SS(x )=¿ ∑ x 2− ¿ 14− ¿2
n 3
SP ( xy) −6 2 ( ∑ y )2 ( 18 )2
r= = =−1 SS( y )=¿ ∑ y − ¿ 126− ¿ 18
√ SS ( x ) SS( y) √2 ×18 n 3
xy 6 × 18
SP(xy )=¿ ∑ xy− ¿ 30− ¿−6
n 3
r = −¿1 indicates that X and Y are perfectly negatively correlated. It happens when both X and
Y change uniformly but in opposite direction. Here X increases by 1 unit and Y decreases by 3
units.

Problem 11.1: The following data represents individuals' ages and programming skill levels
on a scale of 1 to 100.
Age ( x ,∈ year ) 25 30 35 30 32 40 45 40 36 35
Programming
75 80 85 90 95 85 100 90 85 80
Skill Level ( y )

a) Compute correlation coefficient between age and programming skill level.

3
Correlation and Regression

b) Do you think that programming skill level increases significantly with the age?
Solution:
a)
2 2
x y xy x y
25 75 1875 625 5625
30 80 2400 900 6400
35 85 2975 1225 7225
30 90 2700 900 8100
32 95 3040 1024 9025
40 85 3400 1600 7225
45 100 4500 2025 10000
40 90 3600 1600 8100
36 85 3060 1296 7225
35 80 2800 1225 6400
2 2
Σx=348 Σy=865 Σxy=30350 Σ x =12420 Σ y =75325

( ∑ x )2 ( ∑ y )2
SS(x )=¿ ∑ x 2− SS( y )=¿ ∑ y 2−
n n
2 2
348 865
¿ 12420 – =309.6 ¿ 75325 – =502.5
10 10
xy SP(xy ) 248
SP(xy )=¿ ∑ xy− r =¿ =
n √ SS ( x ) SS ( y ) √309.6 × 502.5
348 ×865 ¿ 0.63 .
¿ 30350 – =248
10
The variables X (age) and Y (programming skill level) are positively correlated.

Test of the significance of the correlation coefficient:


We perform a hypothesis test of the significance of the correlation coefficient to decide
whether the linear relationship in the sample data is strong enough to use to model the
relationship in the population.
Performing the hypothesis test:
H0: ρ = 0 against HA: ρ ≠ 0

4
Correlation and Regression

Test Statistic:
r √ n−2
t= ~ t n–2
√ 1−r 2
Decision rule: With α = .05 and df = n – 2, then the critical value of t is found from t table.
We reject H0 if | t | > t 0.05, (n-2).
If the test concludes that the correlation coefficient is significantly different from zero, we say
that the correlation coefficient is significant. There is a significant linear relationship between x
and y.
If the test concludes that the correlation coefficient is not significantly different from zero (it is
close to zero), we say that correlation coefficient is not significant. There is not a significant
linear relationship between x and y.

b) We need to test,
H 0 : ρ=0 vs H 1 : ρ ≠0 .
Test Statistic:
r √n−2 0.63 √10−2
t= = =2.29
√ 1−r 2
√ 1−0.63 2
Since |t | < t n−2=t 8 ¿ 2.306. So H 0 is accepted.
We can conclude that programming skill level of the investigated persons is not significantly
correlated with their age.
Regression:
It is a method of setting a function of dependent variable y based on independent variable x so
that for any value of x , value of y can be estimated. Mathematically, the linear regression model
is given by,
Y =α + βx +ϵ ,
where
α = the value of y when x=0
β = regression coefficient of y on x . It measures the rate of change of y for unit change in x .
ϵ = random error. It is used in the model to measure the influences of other variables which
are not included in the model.
The problem is to fit the regression equation in such a way that the sum of squares due to error is
minimum. Let the fitted model be
^y =a+bx ,

5
Correlation and Regression

where, a is the estimate of α and b is the estimate of β . Here,

a=¿
.
Show, by example, b = 1
Let x=1 , 2 ,3∧ y=1 , 2 ,3. ( ∑ x )2 ( 6 )2
SS(x )=¿ ∑ x 2− ¿ 14− ¿2
n 3
SP(xy ) 2 xy 6 ×6
b= = =1 SP(xy )=¿ ∑ xy− ¿ 14− ¿2
SS (x ) 2 n 3
Show, by example, b = −¿2
Let x=1 , 2 ,3∧ y=8 ,6 , 4. ( ∑ x )2 ( 6 )2
SS(x )=¿ ∑ x 2− ¿ 14− ¿2
n 3
SP(xy ) −4 xy 6 ×18
b= = =−2 SP(xy )=¿ ∑ xy− ¿ 32− ¿−4
SS (x ) 2 n 3

Real-life applications of regression analysis:


To estimate the impact of CGPA on university admissions.
To estimate the impact of rainfall amount on number fruits yielded.
To predict the sale of products in the future based on past buying behavior, etc.

Problem 11.2: The following are the data representing the battery life (y in hours) to different
phones age (x in years) -
x :3 ,7 , 2 , 4 , 0 , 3 , 5 ,1 , 6 , 2
y :9 , 5 ,8 , 6 , 10 , 7 , 3 ,11, 6 , 8
a) Fit a regression line of y on x .
b) Estimate the phone's battery life, which is 8 years old.
c) Test the significance of regression.

Solution:
( ∑ x )2 ( 33 )2 xy 33 × 73
a) SS(x )=¿ ∑ x − 2
¿ 145− SP(xy )=¿ ∑ xy− ¿ 211−
n 10 n 10
¿ 36.1 ¿−29.9

6
Correlation and Regression

SP(xy ) −29.9 y x 73 33
b= = =−0.83 a= y−b x = −b = −(−0.83) =10.04
SS (x ) 36.1 n n 10 10
Fitted line: ^y =a+bx = 10.04 – 0.83 x

b) If x=8 years , then ^y =10.04 – 6.64=3.40 hours .

c) We need to test H 0 : β=0 vs H 1 : β ≠ 0.


b −0.83
t= = =¿
Test Statistic:
√ s2
ss(x ) √ 1.16
36.1
−4.64

( ∑ y )2
2 73
2
SS( y )=¿ ∑ y − ¿ 567 – =34.1
n 10
2 ss ( y )−b sp (xy) 34.1−(−0.83)(−29.9)
s= = =1.16
n−2 10−2
Since |t | > t 10−2 =t 8 ¿ 2.306, so H 0 is rejected. Hence, the regression is statistically significant —
phone age negatively affects battery life.

SPSS command:
1. Pearson Correlation (between two variables)
Steps in SPSS (GUI):
1. Go to the top menu and click on Analyze
2. Then click on Correlate → Bivariate...
3. In the dialog box:
o Move your two variables (e.g., StudyHours and ExamScore) into the Variables box
o Make sure Pearson is checked (it is by default)
o Also check Two-tailed under "Test of Significance"
o Optionally, check Flag significant correlations
4. Click OK

2. Simple Linear Regression (two variables)


Steps in SPSS (GUI):
1. Go to Analyze

7
Correlation and Regression

2. Then click Regression → Linear...


3. In the dialog box:
o Move the dependent variable (e.g., ExamScore) into the Dependent box
o Move the independent variable (e.g., StudyHours) into the Independent(s) box
o Method should be Enter (default)
4. Click OK

Exercise 11
11.1 The following data are given for the amount of effort (in person-hours, y) required to
develop a software module based on its size (e.g., lines of code (LOC), x).
x y
500 20
1000 35
1500 50
2000 65
2500 80
3000 95

a. Compute correlation coefficient.


b. Do you think that the development effort increases significantly with the increase of LOC?
c. Fit a regression line of y on x .
d. Estimate the development effort when the LOC will be 1800.
e. Test the significance of regression.

8
Correlation and Regression

11.2 The following data for the network latency (ms, y) and packet size (KB, x) are given-
x y
10 15
20 25
30 35
40 45
50 55
60 65

a. Compute correlation coefficient.


b. Do you think the network latency increases significantly with the packet size?
c. Fit a regression line of y on x .
d. Estimate the network latency when the packet size is 35.
e. Test the significance of regression.

9
Correlation and Regression

11.3 The following data are given for the day temperature (in ℃) (x) of Dhaka and the
corresponding humidity (in %) (y)
x y
30 90
32 78
34 84
36 73
38 88
40 72

a. Compute correlation coefficient.


b. Do you think that humidity increases significantly with the increase of temperature?
c. Fit a regression line of y on x .
d. Estimate the humidity of a day with the temperature 37c.
e. Test the significance of regression.

10
Correlation and Regression

11.4 The following data are given for the website load time (s) (y) and the number of users (x)
accessing it simultaneously
x y
50 1.2
100 1.5
150 1.8
200 2.1
250 2.4
300 2.7
a. Compute correlation coefficient.
b. Do you think that the website load time significantly change with the number of users?
c. Fit a regression line of y on x .
d. Estimate the website load time when the number of concurrent users are 75.
e. Test the significance of regression.

11
Correlation and Regression

12

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy