4.Introduction to Multiple Linear Regression_QAns
4.Introduction to Multiple Linear Regression_QAns
• The strong relationship between two or more independent variables and one dependent
variable (e.g. how rainfall, temperature, and amount of fertilizer added affect crop
Department Of Computer Engineering
Class: TE Question Answer Sub: QA
growth).
• The value of the dependent variable at a certain value of the independent variables (e.g.
the expected yield of a crop at certain levels of rainfall, temperature, and fertilizer
addition).
Key Concepts:
• Dependent Variable:
To find the best-fit line for each independent variable, multiple linear regression calculates three
things:
• The regression coefficients that lead to the smallest overall model error.
• The t test of the overall model.
• The associated p value (how likely it is that the t statistic would have occurred by chance if
the null hypothesis of no relationship between the independent and dependent variables was
true).
Department Of Computer Engineering
Class: TE Question Answer Sub: QA
• Invalid Model. There is no relationship between the dependent variable and the set of
independent variables. In this case, all of the regression coefficients βi in the population
model are zero. This is the claim for the null hypothesis in the overall model
test: H0:β1=β2=⋅s=βk=0.
• Valid Model. There is a relationship between the dependent variable and the set of
independent variables. In this case, at least one of the regression coefficients βi in the
population model is not zero. This is the claim for the alternative hypothesis in the overall
model test: Ha:at least one βi≠0.
The logic behind the overall model test is based on two independent estimates of the variance
of the errors:
• One estimate of the variance of the errors, MSR, is based on the mean amount of explained
variation in the dependent variable y.
• One estimate of the variance of the errors, MSE, is based on the mean amount of unexplained
variation in the dependent variable y.
The overall model test compares these two estimates of the variance of the errors to determine if
there is a relationship between the dependent variable and the set of independent
variables. Because the overall model test involves the comparison of two estimates of variance,
an F-distribution is used to conduct the overall model test, where the test statistic is the ratio of
the two estimates of the variance of the errors.
The mean square due to regression, MSR, is one of the estimates of the variance of the
errors. The MSR is the estimate of the variance of the errors determined by the variance of the
predicted y^-values from the regression model and the mean of the y-values in the sample, y―. If
there is no relationship between the dependent variable and the set of independent variables, then
the MSR provides an unbiased estimate of the variance of the errors. If there is a relationship
between the dependent variable and the set of independent variables, then the MSR provides an
overestimate of the variance of the errors.
The overall model test depends on the fact that the MSR is influenced by the explained variation in
Department Of Computer Engineering
Class: TE Question Answer Sub: QA
the dependent variable, which results in the MSR being either an unbiased or overestimate of the
variance of the errors. Because the MSE is based on the unexplained variation in the dependent
variable, the MSE is not affected by the relationship between the dependent variable and the set of
independent variables, and is always an unbiased estimate of the variance of the errors.
The null hypothesis in the overall model test is that there is no relationship between the dependent
variable and the set of independent variables. The alternative hypothesis is that there is a relationship
between the dependent variable and the set of independent variables. The F-score for the overall
model test is the ratio of the two estimates of the variance of the
errors, F=MSRMSE with df1=k and df2=n−k−1. The p-value for the test is the area in the right tail
of the F-distribution to the right of the F-score.
First, multiple linear regression requires the relationship between the independent and dependent
variables to be linear. The linearity assumption can best be tested with scatterplots. The following
two examples depict a curvilinear relationship (left) and a linear relationship (right).
Second, the multiple linear regression analysis requires that the errors between observed and
Department Of Computer Engineering
Class: TE Question Answer Sub: QA
predicted values (i.e., the residuals of the regression) should be normally distributed. This
assumption may be checked by looking at a histogram or a Q-Q-Plot. Normality can also be
checked with a goodness of fit test (e.g., the Kolmogorov-Smirnov test), though this test must be
conducted on the residuals themselves.
Third, multiple linear regression assumes that there is no multicollinearity in the data.
Multicollinearity occurs when the independent variables are too highly correlated with each other.