0% found this document useful (0 votes)
42 views34 pages

Monika Project

This document summarizes a student project analyzing factors that influence air quality. The student aims to establish a linear relationship between gases in the air and the carbon monoxide (CO) level using 119 observations of 12 independent variables. An initial regression model is developed and coefficients are analyzed. The regression equation is reported and model significance is assessed. Residuals are examined and heteroscedasticity is detected in the model, indicating non-constant error variances.

Uploaded by

gurleen kaur
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views34 pages

Monika Project

This document summarizes a student project analyzing factors that influence air quality. The student aims to establish a linear relationship between gases in the air and the carbon monoxide (CO) level using 119 observations of 12 independent variables. An initial regression model is developed and coefficients are analyzed. The regression equation is reported and model significance is assessed. Residuals are examined and heteroscedasticity is detected in the model, indicating non-constant error variances.

Uploaded by

gurleen kaur
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

DEPARTMENT OF OPERATION

RESEARCH
UNIVERSITY OF DELHI
Course : MSC( applied operation research)

PROJECT WORK ON AIR QUALITY

SEM-2
ROLLNO.= 61652
SUBMITTED TO : MR.KAUSHAL KUMAR
SUBMITTED BY: MONIKA SIDHU
AIM OF STUDY
TO study the dependence of various factors on the
air quality

Dependent Indepndent variable


variable PT08.S1(CO)
NMHC(GT)
1. CO(GT)
C6H6(GT)
PT08.S2(NMHC)
NOx(GT)
PT08.S3(NOx)
NO2(GT)
PT08.S4(NO2)
PT08.S5(O3)
T
RH
AH
OBJECTIVE
 Our objective with the project to
establish a linear relationship of the
gases present in the air with the
co(gt) level provided by the air
quality
DATA

Here the total observations are 119 and total independent variables are
12
Data link =https://archive.ics.uci.edu/ml/datasets/Air+Quality
ASSUMPTION
 The regression model is linear in terms of
parameter(s).
Xi and Ui are uncorrelated.
 E(Ui) = 0 and V(Ui) = σ2 for all i=1, 2, 3, . . ., n.
 There does not exist any relationship between
any independent variables i.e. no
Multicollinearity.
 Error variance is constant for all values of i=1, 2, 3.
. ., n i.e. no Heteroscedasticity.
 Errors are uncorrelated i.e. no Autocorrelation.
 Functional form of the model is correct i.e. no
Specification Bias.
 Errors are normally distributed as N(0, σ2).
OUR INITIAL REGRESSION MODEL IS AS
FOLLOW:
CO(GT) =B0+B1* PT08.S1(CO)+B2*
NMHC(GT) +B3* C6H6(GT)+B4*
PT08.S2(NMHC) +B5* NOX(GT) +B6*
PT08.S3(NOX) +B7* NO2(GT)+B8*
PT08.S4(NO2) +
B9*PT08.S5(O3)+
B10* T+B11* RH+B12* AH
Coefficientsa
Model Unstandardized Coefficients Standardized t Sig.
Coefficients

B Std. Error Beta

(Constant) -90.048 363.039 -.248 .805


PT08S1 .147 .071 .743 2.068 .041

NMHC .341 .098 .896 3.488 .001


AH 79.966 221.387 .168 .361 .719

RH -3.078 2.649 -.910 -1.162 .248

T -8.377 8.172 -.844 -1.025 .308


1 PT08S5 -.053 .051 -.439 -1.034 .303
PT08S4 .103 .169 .640 .611 .543

NO2 -.199 .191 -.335 -1.041 .300

PT08S3 -.031 .090 -.200 -.344 .731

PT08S2 .188 .260 1.047 .723 .471

Nox .142 .173 .380 .823 .412


C6H6 -18.935 9.316 -2.921 -2.033 .045

Our fitted regression equation is


CO(GT) =-090.048+0.1468* PT08.S1(CO)+0.34061* NMHC(GT) -
18.93544* C6H6(GT)+0.18880* PT08.S2(NMHC) +0.14231* NOx(GT)
-0.3183* PT08.S3(NOx) -0.19951* NO2(GT)+0.10310* PT08.S4(NO2) -
0.05382*PT08.S5(O3)-8.37796* T-3.07821* RH+3.07821* AH
Our new equation becomes
CO(GT) =-090.048+0.1468* PT08.S1(CO)+0.34061*
NMHC(GT) -18.93544* C6H6(GT)+0.18880*
PT08.S2(NMHC) +0.14231* NOx(GT) -0.3183*
PT08.S3(NOx) -0.19951* NO2(GT)+0.10310*
PT08.S4(NO2) -0.05382*PT08.S5(O3)-8.37796* T-
3.07821* RH+3.07821* AH
Model Significance

Variables Entered/Removeda

Variables Variable Method


Model Entered Removed
.
C6H6, AH, RH, Enter
NO2, PT08S3,
NMHC,
1 PT08S5,
PT08S1, Nox,
T, PT08S4,
PT08S2b

a. Dependent Variable: CO
b. All requested variables entered.
Histogram of Residuals

We can see from this figure a bell shaped curve


which implies that the Residuals follow Normal
Distribution and our assumption is valid.
Model Summaryb
M
o R Adjusted Std. Error of
d R Square
e R Square the Estimate
l

1 .523a .274 .191 40.0512

a. Predictors: (Constant), C6H6, AH, RH, NO2,


PT08S3, NMHC, PT08S5, PT08S1, Nox, T,
PT08S4, PT08S2

b. Dependent Variable: CO

ANOVAa
Model Sum of Df Mean F Sig.
Squares Square
Regression 64026.040 12 5335.503 3.326 .000b
1 Residual 170034.083 106 1604.095

Total 234060.123 118


INTERPRETATION

Multiple R :- This is the correlation coefficient. It tells us how strong the linear
relationship is. Basically here it suggest the correlation coefficient between the
observed and predicted values. In our result the multiple R value is .523 which is
quite close to 1 hence, there is a strong linear relationship between the
dependent variable and independent variable.

R squared :- This is R2 ,the coefficient of determination. It explains the variance


in Y that gets explained by the regression model. In our model .274 means that
27.4% of the variation of y-values around the mean is explained by the x-values .
In other words 27.4% of the values fit the model . It is a statistical measure of
how close the data are to the fitted regression line.

Adjusted R square:- The adjusted R square adjusts for the number of terms in
a model. The adjusted R-squared is a modified version of R-squared that has been
adjusted for the number of predictors in the model. The adjusted R-squared
increases only if the new term improves the model more than would be expected
by chance, it decreases when a predictor improves the model by less then
expected by chance.Value is .191
Fvalue:The F-value for the model can be seen from the (ANOVA)
Table1. The Critical value at 5% level of significance is Calculated
F = 1.8792 and Tabulated F(0.05,12,119) =3.326

Clearly, F(calc)<F(Tab). This can also be seen from the p-value in


the table i.e. .000, which is <0.05 which shows we accept our H1
and conclude that our model is significant
ANOVA (analysis of variance technique) is used to test the
significance of the model.
1.Sum of Squares (SS).
2. Regression Mean Square (MS) = SS / Regression degrees of
freedom.
3. Residual Mean Square (MS) = mean squared error (Residual SS /
Residual degrees of freedom).
4. F: Overall F test for the null hypothesis.(Regression MS/Residual MS)
5. Significance F: The significance associated P-Value.

This table indicates that the regression model predicts the dependent
variable significantly well. Here, p < 0.05, and indicates that, overall, the
regression model statistically significantly predicts the outcome variable
(i.e., it is a good fit for the data).
Residuals Statisticsa

Minimum Maximum Mean Std. Deviation N


Predicted Value -177.092 33.919 -7.629 23.2936 119
Residual -174.1841 56.4026 .0000 37.9601 119
Std. Predicted Value -7.275 1.784 .000 1.000 119
Std. Residual -4.349 1.408 .000 .948 119
Our model explains 19.1% of variance, therefore not all the data points lie on the
fitted regression line. The difference between the regression line and the points is
the disturbance term (errors)
Heteroscedasticity
When error terms for different sample
observations have same variances the situation
is called as homoscedasticity. The opposite of
homoscedasticity is called as heteroscedasticity
. For detecting if heteroscedasticity is present in
our model we use Spearman’s Rank
Correlation.
In this method first we Calculate the
Spearman’s rank correlation coefficient between
predicted dependent variable ( ) and absolute
residua (|ε |) as follows:
where di =Yi^ - |ε i| .
We set up the hypotheses as follows:
Null Hypothesis H0 : ρ = 0 Absence of
Hetroscedasticity
Alternative Hypothesis H1 : ρ ≠ 0 Presence of
Hetroscedasticity
where ρ is the population rank correlation between
predicted dependent variable (Y^ ) and absolute
residua(|ε ^|).
Detection of Heteroscedasticity
P value(0.000) < 0.05 so we reject null
hypothesis and conclude that is
Heteroscedasticity is present in our
model.
Unstandardized Unstandardi
Predicted Value zed
Residual

Unstandardized Correlation 1.000 -.771**


Predicted Value Coefficient
Sig. (2-tailed) . .000
Spearman's rho N 119 119
Correlation -.771** 1.000
Unstandardized Coefficient
Residual Sig. (2-tailed) .000 .
N 119 119
We use Spearman's Rank Correlation Test to test the presence or
absence of heteroscedasticity.
P value(0.000) < 0.05 so we reject null hypothesis and conclude that is
Heteroscedasticity is present in our model.
Removal of Heteroscedasticity

For Removal of Heteroscedasticity we can


transform our regression model under various
assumptions i.e:
1.If Variance of Residual is proportional to square
of explanatory variables
2.Or If Variance of Residual is proportional to
explanatory variables
3.If Variance of Residual is proportional to square
of expected value of dependent variable
4.Or we can transform our model by taking Log on
both sides.
AUTO-CORRELATION

One of the basic assumptions in linear regression


model is that the random error components or
disturbances are identically and independently
distributed. So in the model y = X β + u, it is assumed
that the correlation between the successive
disturbances is zero.
We use Durbin-Watson Test to detect the
presence of Autocorrelation.
Detection of Autocorrelation
Model Summaryb
Model R R Square Adjusted R Std. Error of the Durbin-Watson
Square Estimate

1 .523a .274 .191 40.0512 2.236


a. Predictors: (Constant), C6H6, AH, RH, NO2, PT08S3, NMHC, PT08S5,
PT08S1, Nox, T, PT08S4, PT08S2
b. Dependent Variable: CO

ANOVAa
Model Sum of Squares df Mean F Sig.
Square

Regression 64026.040 12 5335.503 3.32 .000b


1
6
Residual 170034.083 106 1604.095
Total 234060.123 118
To test for positive autocorrelation at significance α, the test statistic d is compared to
lower and upper critical values (dL,α and dU,α):

• If d < dL,α, there is statistical evidence that the error terms are positively
autocorrelated.
• if d > dU,α, there is no statistical evidence that the error terms are positively
autocorrelated.
• If dL,α < d < dU,α, the test is inconclusive.
Positive serial correlation is serial correlation in which a positive error for one
observation increases the chances of a positive error for another observation.

To test for negative autocorrelation at significance α, the test statistic (4 − d) is


compared to lower and upper critical values (dL,α and dU,α):

• If (4 − d) < dL,α, there is statistical evidence that the error terms are negatively
autocorrelated.
• If (4 − d) > dU,α, there is no statistical evidence that the error terms are negatively
autocorrelated.
• If dL,α < (4 − d) < dU,α, the test is inconclusive.L,α and dU,α):
INTERPRETATION
We can see that the value of d under Durbin Watson D-test is
d=2.236
We know that value of k=no of independent variables+1
Therefore, the value of k = 12+1=13.
Here, n= no of observations in our data
Therefore, n=119

By using the significant table for finding the critical values of dL


and dU
We find Calculated Durbin-Watson statistic,
dL=1.48579 du=1.92641

here, d>du , there is no statistical evidence that the error terms are
positively autocorrelated

Our Durbin-Watson statistic is greater than du so, do not reject


null hypothesis and conclude there is no evidence of
Autocorrelation
Removal of Autocorrelation
Autocorrelation can be removed by Cochrane-Orcutt
Iterative Method which adjusts a linear model for serial
correlation in the error term.
REMOVAL:
Here we make use of the COCHRAN ORCUTT ITERATIVE METHOD
Consider the model-

Yt = b0 + b1Xt1 + b2Xt2 +……+ bkXtk + Ut -(1)

Where Ut is autocorrelated under AR(1) scheme and is given by


Ut = PUt-1 + Vt
Rewrite eqn(1) for period (t-1)
Yt-1 = b0 + b1X(t-1)1 + b2X(t-2)2 +……+ bkX(t-k)k + Ut-1 -(2)
Multiply P with eqn(2) and subtract it with (1)
(3)=(1)-P(2)

Yt -P Yt-1 = b0 (1-P)+ b1 (Xt1 - PX(t-1)1)+ b2(Xt2 - PX(t-2)2) +……+ bk (Xtk - PX(t-


k)k )+ (Ut - P Ut-1 ) -(3)

Hence, this model becomes free from autocorrelation.


Multicollinearity
is a statistical phenomenon in which two or
more predictor variables in a multiple
regression model are highly correlated,
meaning that one can be linearly predicted
from the others with a non-trivial degree of
accuracy.
Consequences of Multicollinearity

In cases of near or high multicollinearity, one is likely


to encounter the following consequences:
1. Although BLUE, the OLS estimators have large
variances and covariance, making precise estimation
difficult.
2. The confidence intervals tend to be much wider,
leading to the acceptance of the “zero null hypothesis”
(i.e., the true population coefficient is zero) more
readily.
Detection of Multicollinearity
Variance Inflation Factor:-
It’s one such factor by which multicollinearity is detected . It
signifies
That variance is inflated due to the presence of
multicollinearity.
And if VIF>2.5 multicollinearity is present.
In our analysis we started by using the best regression model
selection using sequential approach (Step-wise). Therefore it
removed all the variables having multicollinearity, i.e. all the
variables having their VIF>2.5.Hence, it removed12 of the
variables from our analysis. independent variables having
VIF(variance inflation factor)’s value less than 2.5
Hence, our best fitted model was free from
multicollinearity.
Coefficientsa

From Table, it can be observed that the


Collinearity
Model Statistics VIF values for some variables of our
model are greater than 2.5.
Tolerance VIF

PT08S1 .053 18.838 • This suggests Multicollinearity is


NMHC .104 9.632 Present in the data between these
AH .032 31.510 variables.
RH .011 89.515
T .010 99.024 • Now , To see the extent of Correlation
1 PT08S5 .038 26.290 between the variables we will show
PT08S4 .006 160.094 Correlation Matrix.
NO2 .066 15.110
PT08S3 .020 49.340
PT08S2 .003 305.819
Nox .032 31.043
C6H6 .003 301.351
We can see from the correlation matrix
that the variables are highly correlated
with each other.
Removal of Multicollinearity
• Remove one of highly correlated independent variable from
the model. If you have two or more factors with a high VIF,
remove one from the model.
• Principle Component Analysis (PCA) - It cut the
number of interdependent variables to a smaller set of
uncorrelated components. Instead of using highly
correlated variables, use components in the model that
have eigenvalue greater than 1.
• Run PROC VARCLUS and choose variable that
has minimum (1-R2) ratio within a cluster.
• Ridge Regression - It is a technique for analyzing
multiple regression data that suffer from multicollinearity.
Conclusion

The final fitted regression equation for our model to determine


the price of cars is
CO(GT) =-090.048+0.1468* PT08.S1(CO)+0.34061* NMHC(GT)
-18.93544* C6H6(GT)+0.18880* PT08.S2(NMHC) +0.14231*
NOx(GT) -0.3183* PT08.S3(NOx) -0.19951* NO2(GT)+0.10310*
PT08.S4(NO2) -0.05382*PT08.S5(O3)-8.37796* T-3.07821*
RH+3.07821* AH

ECONOMETRICS Regression Analysis


From the previous tests, we observed that
• Heteroscedasticity is present in our data.
• Multicollinearity exists among in all variables.
• No evidence of auto correlation
Self made project
Software used:
• SPSS

Thank You 

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy