Lec 34
Lec 34
Lecture - 34
Diagnostics to Improve Linear Model Fit
So, you should in this case you should not attempt to fit a linear
model with the data. Instead, you should ask the experimenter to go
and collect data different values of x, then come back and try to check
whether that is valid. Unfortunately, when we actually apply linear
regression to these data sets, and then find the slope and the intercept
parameter we find that in all 4 cases we get the same intercept value of
3, you can see that all 4 data sets you get a value of 3, and you also get
the same slope parameter which is point 5 in all 4 cases.
So, the regression model if you are fit to any of these data 4 data
sets you will get the same estimate of the intercept and slope.
Furthermore, you get the same standard error of fit which is 1.12 for
intercept and point one for the slope. And if you run a confidence
interval for the slope parameter, you may end up accepting that this
slope is acceptable for all 4 cases. And you may conclude incorrectly
conclude that the linear model is adequate. You can actually run the r
squared value it will be the same for all 4 data sets. You can run the a
hypothesis test whether a reduced model is acceptable compared to a
model with the slope parameter, again you will reject the null
hypothesis using the f statistic. And you may conclude for all 4 cases,
you will get the same identical result that a linear model is a good fit.
Clearly it is not so.
One can of course, do scatter plots and try to judge it in this
particular case. Because it is a univariate example, but when you have
many independent variables, then you have to examine several such
scatter plots and that may not be very easy.
So, if you assume there are 100 independent variables, you have to
examine 100 such plots of y versus x. And it may not be possible for
you to make a visual conclusion from that. So, we will use other kinds
of plots called residual plots, which will enable us to do this whether it
is a univariate regression problem or a multivariate regression problem,
we will see what these are.
So, the main questions that we are trying to ask now whether a
linear model is adequate. We have some measures we have seen, but
they are not adequate, we will use additional things, and when we did
the linear regression, we did make additional assumptions although
they were not been stated explicitly, we assume that the errors that
corrupt the dependent variables are normally distributed and they have
identical variance. Only under this these assumptions can you use a
least squares method to perform linear regression. That you can at least
prove that the least squares method has some nice properties.
But the residual which is a result of the fit will not have the same
variance, for all data points. In fact, you can show that the variance of
the ith sample is σ2 which represents the error in the measured value of
the dependent variable multiplied by 1 - pii; where pii is defined by
this. Notice that pii depends on the ith sample, numerator depends on
the ith sample, therefore p ii depends on the ith sample and varies with
sample to sample.
So, the variance of the residual will not be identical for all
examples, it is given by this quantity. We also can show that the
residuals are not independent even though we assume that the errors
corrupting the measurements are all independent. The residuals in the
samples are not independent and they have a correlation covariance
and that covariance can be shown to be given by this quantity. The
reason for the variance not being identical of the residuals or them
being correlated is because you notice that this ŷi they have actually
have here is a result of the regression it depends on all variables all
measurements. It is not dependent only on the ith measurement. This
predicted value is a function of all the observations, and that because of
that it introduces a correlations between the different residuals. And
also, imparts different variants to different residuals. And so, having
derived this, notice that even if we do not have a priori knowledge of σ
square which is the variance of error in the measurements we can
estimate this quantity.
We have already seen this in the previous lecture, we can replace
this by s s e by n - 2, which is an estimate of this σ 2 and substitute this
to get an estimated variance of each residual.
So, we will plot the residual what we call a residual plots will we
plot the residuals with respect to the predicted or fitted value of the
dependent variable remember there is. Only one dependent variable
even if there are multiple independent variables we have only one
dependent variable, we can plot the residuals with respect to the
predicted value of the dependent variable. And the predicted values
you obtain after the regression remember.
So, this is called the residual plot what is called the residual versus
the fitted or predicted value and this plot is very useful in testing the
validity of the linear model, in determining whether the errors are
normally distributed, assumptions on errors are ok and whether the
variances of all errors are identical or not which is called the
homoscedastic case, which means the errors in all measured values are
identical or the variance of the error in different measured values are
non-identical which is called the heteroscedastic case or
heteroscedastic error. So, let us see how each of these how the plot
looks for each of these cases.
(Refer Slide Time: 11:28)
Now, let plot the residual plot for the 4 data sets provided by
Anscombe. Notice that we have done the regression model regression
model we computed all these parameters r squared confidence interval
they all turned out to be identical they gave us no clues, whether the
linear model is good for all 4 data sets or not. Basically they say they
would say that the linear model is adequate.
But when we do the residual plot here we have plotted the residual
versus the I have plotted with respect to the independent variable, but
because it is a univariate case, we have plotted with respect to the
independent variable. But technically you should plot the residual with
respect to the predicted value of the dependent variable, remember
because we presume that the predicted variable is linearly dependent
on x, in this case it may not matter the pattern will look the same you
can try it out for yourself if you plot the residual with respect to the
predicted value of the variable dependent variable. Then you will get
this kind of pattern of the residuals for the 4 data sets. The first data set
if you look at it exhibits no pattern. The residuals seems to be
randomly distributed between this case between - 3 and + 3 and
whereas for the second data set there is a distinct pattern. The residuals
look like a quadratic like a parabola. And so, therefore, there exists a
pattern in the data set 2. For the third data set basically, you can say
that there is no pattern, except that are constant more or less linear are
constant, there seems to be a small bias because of the slope left in the
residuals. Data set 4 as we saw before is a poorly designed
experimental data set. All the y values are obtained at a single x value
and that is what the residuals are also showing. The 10 of the data
points obtained at the same x value are showing different residuals and
the one single residual at a different x value showing something.
So, from this you cannot judge anything. All you can say is that the
experimental data set is very poorly designed, and you need to get back
to the experimenter and ask him to provide a different data set. Now
based on this we can safely conclude that data set one clearly a linear
model is adequate. All the measures, previous measures also, were
satisfied. And now the residual plot also shows a random pattern which
means a random or what we call no pattern, then linear model is
adequate. Whereas, for data set 2 by look at the residual plot we can
conclude that a linear model is inadequate should not be used for this
data set.
For the third data set however, we know there is one data point that
is lying far away, and perhaps that is the one that is causing all of this
slightly a linear pattern here. And if we remove this outlier and retry it
maybe this resolve this problem will get resolved, and linear model
may be adequate for data set 3. For data set 4 again there is a distinct
constant pattern and therefore, we can conclude that linear model
should not be used.
So, a Q-Q plot if this thing is in does not happen. If we find the
significant deviation of this quantiles from the 45-degree line, then a
normal distribution assumption is incorrect. Which means we have to
modify the a least square subjective function to actually accommodate
this. It may not have it may or may not have a significant effect on the
regression coefficient, but there are ways of dealing with it which I will
not go into.
So, we can remove all these 4 outliers at the same time, if you want
as I said that is not a good idea perhaps we should remove only sample
number 35 which has furthest away from the boundary with the
residual with the largest magnitude. And redo this because of lack of
time, I have just removed all 4 at the same time and then done the
analysis. My suggestion is you do one at a time and then repeat this
exercise for yourself. Here we have removed these 4 samples 4 13 35
and thing and run the regression analysis.
(Refer Slide Time: 27:02)
Again, you can see that the regression analysis, maintain retaining
all the samples is shown on the right-hand side of the plot. And their
corresponding intercept coupon the slope as well as the f test statistic
and so on r squared value is shown here. And once we remove these 4
samples which we outliers and then rerun it, now the fit seems to be
much better. It is also seen on the left-hand side that the t is much
better you can see that the r squared value has gone up to 0.99 from
0.775. The again the test on the intercept and the coupon rate or slope
shows that that they are significant and therefore, you should not
assume that they are close to 0.
It also shows that the f statistic has also a low p value which means,
you raise the null hypothesis that reduce model is adequate which
means the linear model with the slope parameter is a much better fit.
So, all of this indicator show seem to indicate show that a linear model
is adequate and the t seems to be good. But we should do a residual
plot again with this data. And if that actually shows no pattern we can
actually stop there we can say that are no outliers. And therefore, we
can conclude that regression model that we have fitted for this data is a
reasonably good one.
Next class will see how to actually extend all of these ideas to the
multiple linear regression which consist of many independent variables
and one dependent variable.
Thank you.