MLR Multicollinearlty, Categorical Variable
MLR Multicollinearlty, Categorical Variable
4
New Consideration
• Adding more independent variables to a multiple
regression procedure does not mean the regression will
be "better" or offer better predictions; in fact it can
make things worse. This is called OVERFITTING.
• The addition of more independent variables creates
more relationships among them. So not only are the
independent variables potentially related to the
dependent variable, they are also potentially related to
each other. When this happens, it is called
MULTICOLLINEARITY.
• The ideal is for all of the independent variables to
be correlated with the dependent variable but
NOT with each other. 5
New Consideration
• Because of multicollinearity and overfitting, there is a
fair amount of pre-work to do BEFORE conducting
multiple regression analysis if one is to do it properly.
• Correlations
• Scatter plots
• Simple regressions
6
Multicollinearity
• Travel time is the dependent variable and miles traveled
and number of deliveries are independent variables.
• Some independent variables or set of independent
variables are better at predicting the Dependent
variable than others. Some contribute nothing
7
Multiple Regression Model
• Multiple Regression Model
•+ε
8
Coefficient Interpretation
9
INTERPRETING COEFFICIENTS
13
MULTIPLE REGRESSION
Conducting multiple regression analysis requires a fair amount of
pre-work before actually running the regression. Here are the
steps:
1. Generate a list of potential variables; independent(s) and dependent
2. Collect data on the variables
3. Check the relationships between each independent variable and the
dependent variable using scatterplots and correlations
4. Check the relationships among the independent variables using
scatterplots and correlations
5. (Optional) Conduct simple linear regressions for each IV/DV pair
6. Use the non-redundant independent variables in the analysis to find the
best fitting model
14
7. Use the best fitting model to make predictions about the dependent
RDS DATA AND VARIABLE NAMING
To conduct your milesTraveled (x1)
num Deliveries
gasPrice (x3)
travelTime(hrs),
analysis you take a (x2) (y)
17
18
Multicollinearity Check
• Independent variable vs independent variable
• numDeliveries(x2) APPEARS highly correlated with
milesTraveled (x,); this is multicollinearity
• milesTraveled (x1) does not appear highly correlated with
gasPrice(x3)
• gasPrices(x3) does not appear correlated with
numDeliveries(x2)
• Since numDeliveries is HIGHLY CORRELATED with
milesTraveled, we would NOT use BOTH in the multiple
regression; they are redundant
• Note: for now, we will keep both in and then take one
out later for learning purposes
19
Correlations
num
milesTravel gasPrice
Correlation Deliveries
ed (x1) (x3)
(x2)
num Deliveries
0.956
(x2)
travelTime(hrs)
0.928 0.916 0.267
, (y)
20
Scatterplot (DV vs IV)
21
IV Scatterplots(Multicollinearity)
22
Understanding
• Correlation analysis confirms • For the sake of learning, we are
the conclusions reached by going to break the rules and
visual examination of the include all three independent
scatterplots variables in the regression at
• Redundant multicollinear first
variables • Then we will remove the
• milesTraveled and problematic independent
numDeliveries are both highly variables as we should and
correlated with each other and then watch what happens to
therefore are redundant; only the regression results
one should used in the multiple
regression analysis • We will also perform simple
• Non-contributing variables regressions with the dependent
• gasPrice is NOT correlated with variable to use as a baseline
23
Example
BP Weight Age BSA Dur Pulse Stress
105 85.4 47 1.75 5.1 63 33
115 94.2 49 2.1 3.8 70 14
116 95.3 49 1.98 8.2 72 10
117 94.7 50 2.01 5.8 73 99
112 89.4 51 1.89 7 72 95
121 99.5 48 2.25 9.3 71 10
121 99.8 49 2.25 2.5 69 42
110 90.9 47 1.9 6.2 66 8
110 89.2 49 1.83 7.1 69 62
114 92.7 48 2.07 5.6 64 35
114 94.4 47 2.07 5.3 74 90 BP Age Weight BSA Dur Pulse Stress
115 94.1 49 1.98 5.6 71 21 BP 1
114 91.6 50 2.05 10.2 68 47 Age 0.659093 1
106 87.1 45 1.92 5.6 67 80 Weight 0.950068 0.407349 1
125 101.3 52 2.19 10 76 98 BSA 0.865879 0.378455 0.875305 1
114 94.5 46 1.98 7.4 69 95 Dur 0.292834 0.343792 0.20065 0.13054 1
106 87 46 1.87 3.6 62 18 Pulse 0.721413 0.618764 0.65934 0.464819 0.401514 1
113 94.5 46 1.9 4.3 70 12 Stress 0.163901 0.368224 0.034355 0.018446 0.31164 0.50631 1
110 90.5 48 1.88 9 71 99
122 95.7 56 2.09 7 75 99
24
Variance Inflammation Factor (VIF)
• A variance inflation factor (VIF) provides a measure of
multicollinearity among the independent variables in a
multiple regression model.
• Detecting multicollinearity is important because while
multicollinearity does not reduce the explanatory power
of the model, it does reduce the statistical significance
of the independent variables.
• A large VIF on an independent variable indicates a
highly collinear relationship to the other variables that
should be considered or adjusted for in the structure of
the model and selection of independent variables.
25
Variance Inflammation Factor (VIF)
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.998073
R Square 0.99615 8.42, 5.33, and 4.41 — are fairly
Adjusted R
Square 0.994373
large. The VIF for the predictor
Standard
Error 0.407229
Weight, for example, tells us that
Observation the variance of the estimated
s 20
coefficient of Weight is inflated by
ANOVA
Significance a factor of 8.42 because Weight is
df SS MS F F
highly correlated with at least one
Regression
Residual
6
13
557.8441
2.155858
92.97402
0.165835
560.641 6.4E-15 of the other predictors in the
Total 19 560 model.
Standard
Coefficients Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0% VIF R Squared
Intercept -12.8705 2.55665 -5.03412 0.000229 -18.3938 -7.34717 -18.3938 -7.34717
Age 0.703259 0.049606 14.17696 2.76E-09 0.596093 0.810426 0.596093 0.810426 1.76
Weight 0.96992 0.063108 15.36909 1.02E-09 0.833582 1.106257 0.833582 1.106257 8.417035 0.88119332
BSA 3.776491 1.580151 2.389956 0.032694 0.362783 7.190199 0.362783 7.190199 5.33
Dur 0.068383 0.048441 1.411663 0.181534 -0.03627 0.173035 -0.03627 0.173035 1.24
Pulse -0.08448 0.051609 -1.63702 0.125594 -0.19598 0.02701 -0.19598 0.02701 4.41
Stress 0.005572 0.003412 1.63277 0.126491 -0.0018 0.012943 -0.0018 0.012943 1.834845 0.45499493
26
Variance Inflammation Factor
(VIF)-Weight as Dependent
variable
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.938719
R Square 0.881193
Adjusted
R Square 0.838762
Standard
Error 1.724594
Observati
ons 20
ANOVA
Significan
df SS MS F ce F
Regressio
n 5 308.8389 61.76777 20.7677 5.05E-06
Residual 14 41.63913 2.974223
Total 19 350.478
27
Understanding
• we see that the predictors Weight and BSA are highly correlated (r =
0.875). We can choose to remove either predictor from the model. The
decision of which one to remove is often a scientific or practical one.
• For example, if the researchers here are interested in using their final
model to predict the blood pressure of future individuals, their choice
should be clear. Which of the two measurements — body surface area or
weight — do you think would be easier to obtain?!
• If weight is an easier measurement to obtain than body surface area,
then the researchers would be well-advised to remove BSA from the
model and leave Weight in the model.
• Reviewing again the above pairwise correlations, we see that the
predictor Pulse also appears to exhibit fairly strong marginal correlations
with several of the predictors, including Age (r = 0.619), Weight (r =
0.659), and Stress (r = 0.506). Therefore, the researchers could also
consider removing the predictor Pulse from the model.
28
After removal
SUMMARY OUTPUT
BP Weight Age Dur Stress Regression Statistics BP Weight Age Dur Stress
105 85.4 47 5.1 33 Multiple R 0.995934 BP 1
115 94.2 49 3.8 14 R Square 0.991884
Weight 0.950068 1
116 95.3 49 8.2 10 Adjusted R Age 0.659093 0.407349 1
117 94.7 50 5.8 99 Square 0.989719
Dur 0.292834 0.20065 0.343792 1
112 89.4 51 7 95 Standard
Error 0.550462 Stress 0.163901 0.034355 0.368224 0.31164 1
121 99.5 48 9.3 10
121 99.8 49 2.5 42
Observatio
ns 20
110 90.9 47 6.2 8
110 89.2 49 7.1 62 ANOVA
114 92.7 48 5.6 35 Significance
114 94.4 47 5.3 90 df SS MS F F
115 94.1 49 5.6 21
114 91.6 50 10.2 47 Regression 4 555.4549 138.8637 458.2834 1.76E-15
106 87.1 45 5.6 80 Residual 15 4.545126 0.303008
125 101.3 52 10 98 Total 19 560
114 94.5 46 7.4 95
106 87 46 3.6 18 Standard Lower Upper
Coefficients Error t Stat P-value Lower 95% Upper 95% 95.0% 95.0% VIF R squared
113 94.5 46 4.3 12
Intercept -15.8698 3.195296 -4.96662 0.000169 -22.6804 -9.05922 -22.6804 -9.05922
110 90.5 48 9 99
Weight 1.034128 0.032672 31.65228 3.76E-15 0.964491 1.103766 0.964491 1.103766 1.234653 0.190056
122 95.7 56 7 99
Age 0.683741 0.061195 11.17309 1.14E-08 0.553306 0.814176 0.553306 0.814176 1.47
Dur 0.039889 0.064486 0.618561 0.545485 -0.09756 0.177339 -0.09756 0.177339 1.2
Stress 0.002184 0.003794 0.575777 0.573304 -0.0059 0.01027 -0.0059 0.01027 1.24
29
• The remaining variance inflation factors are quite
satisfactory! That is, it appears as if hardly any
variance in inflation remains. Incidentally, in terms of
the adjusted R squared we did not seem to lose much
by dropping the two predictors BSA and Pulse from our
model. The adjusted R squared decreased to only
98.97% from the original adjusted of 99.44%.
30
Dummy Variable
31
Scenario – Dummy Variable
• You are an analyst for a small company that develops
house pricing models for independent realtors. To
generate your models you use publicly available data
such as; list price, square footage, number of bedrooms,
number of bathrooms, etc.
• You are interested in another question: Is the public high
school in the neighborhood "Exemplary" (the highest
rating) and how is that rating related to home price?
• High school rating is not quantitative, it is qualitative
(categorical). For each home the high school is either
exemplary or not; yes or no.
32
Price1000s(y) sqrt(x1) exempHS(x2) Price1000s(y) sqrt(x1) exempHS(x2)
145 1872 Not exemplary 145 1872 0
69.9 1954 Not exemplary 69.9 1954 0
315 4104 Exemplary 315 4104 1
144.9 1524 Not exemplary 144.9 1524 0
134.9 1297 Not exemplary 134.9 1297 0
369 3278 Exemplary 369 3278 1
95 1192 Not exemplary 95 1192 0
228.9 2252 Exemplary 228.9 2252 1
149 1620 Not exemplary 149 1620 0
295 2466 Exemplary 295 2466 1
388.5 3188 Exemplary 388.5 3188 1
75 1061 Not exemplary 75 1061 0
130 1195 Not exemplary 130 1195 0
174 1552 Exemplary 174 1552 1
334.9 2901 Exemplary 334.9 2901 1
33
DUMMY VARIABLES
• In many situations we must work with categorical
independent variables
• In regression analysis we call these dummy or indicator
variables
• For a variable with n categories there are always n-1
dummy variables
• Exemplary/Not exemplary there are 2 categories, so 2 − 1 = 1
dummy variable
• North/South/East/West there are 4 categories, so 4 - 1 = 3
dummy variables
34
Region Variable Coding
Region x1 x2 x3
North 1 0 0
South 0 1 0
East 0 0 1
West 0 0 0
35
Interpretation
• +
• Expected value of home price given the high school is Not
exemplary ,
36
Regression Analysis
SUMMARY OUTPUT
• Sqrt: Every square foot is
Regression Statistics
Multiple R 0.925716
related to an increase in price
R Square
Adjusted R
0.85695 of 0.0621($1000s) or $62.10
Square 0.833109 per square foot.
Standard Error 44.65395
• ExempHS: On average, a home
Observations 15
in an area with an exemplary
ANOVA high school is related to a
df SS MS F $98600
Significance F higher price.
Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%
Intercept 27.07494 33.68996 0.80365 0.437229 -46.3292 100.479 -46.3292 100.479
sqrt(x1) 0.062066 0.020324 3.05383 0.010013 0.017784 0.106348 0.017784 0.106348
37
Interpretation
• + 98.6
• Expected value of home price given the high
school is Not exemplary ,
38
• You are an analyst for a small company that develops house pricing
models for independent realtors. To generate your models you use
publicly available data such as; list price, square footage, number of
bedrooms, number of bathrooms, etc.
• You are interested in two questions:
1. Is the public high school in the neighborhood "Exemplary"
(the highest rating) and how is that rating related to home price?
2. What region (N,S,E,W) of the city is the home located and
how is that related to home price?
• This data is not quantitative, it is qualitative (categorical) so we will
need to use dummy variables in our regression.
39
Price1000s(y Regio Price1000s(y
sqrt(x1) exempHS(x2) sqrt(x1) exempHS(x2) South(x3) West(x4) North(x5)
) n )
145 1872 Not exemplary South 145 1872 0 1 0 0
69.9 1954 Not exemplary North 69.9 1954 0 0 0 1
315 4104 Exemplary South 315 4104 1 1 0 0
144.9 1524 Not exemplary North 144.9 1524 0 0 0 1
134.9 1297 Not exemplary East 134.9 1297 0 0 0 0
369 3278 Exemplary South 369 3278 1 1 0 0
95 1192 Not exemplary West 95 1192 0 0 1 0
228.9 2252 Exemplary North 228.9 2252 1 0 0 1
149 1620 Not exemplary West 149 1620 0 0 1 0
295 2466 Exemplary South 295 2466 1 1 0 0
388.5 3188 Exemplary South 388.5 3188 1 1 0 0
75 1061 Not exemplary South 75 1061 0 1 0 0
130 1195 Not exemplary East 130 1195 0 0 0 0
174 1552 Exemplary South 174 1552 1 1 0 0
334.9 2901 Exemplary South 334.9 2901 1 1 0 0
40
41
42
For home with 2500 square feet having high school
exemplary the price is premium, but the price is less for
the case high school is not exemplary.
Also the price is less for the west region, high for the
south region. Moderate for other two.
43
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.940779
R Square 0.885066
Adjusted R Square 0.821214
Standard Error 46.2179
Observations 15
ANOVA
df SS MS F Significance F
Regression 5 148043.6 29608.71743 13.861149 0.000527941
Residual 9 19224.85 2136.094023
Total 14 167268.4
Standard
Coefficients Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%
Intercept 51.30076 42.29369 1.212964705 0.25601839 -44.37422483 146.9757412 -44.37422483 146.9757412
sqrt(x1) 0.065128 0.021546 3.022764609 0.01441461 0.016387876 0.113867729 0.016387876 0.113867729
exempHS(x2) 102.1551 40.13882 2.545046236 0.03144932 11.3548324 192.9554514 11.3548324 192.9554514
South(x3) -32.1221 44.4748 -0.72225415 0.4884789 -132.7311102 68.48688566 -132.7311102 68.48688566
West(x4) -20.8704 46.34628 -0.450315462 0.66313455 -125.7130273 83.9721305 -125.7130273 83.9721305
North(x5) -61.8466 43.87802 -1.409511619 0.19228908 -161.1055453 37.41239569 -161.1055453 37.41239569
44
Interpretation
• +
• Expected value of home price given the high school is Not
exemplary , and in west
45
• It appears from this data and its analysis:
• Higher square footage is a significant predictor of higher home
price
• Being in a district with an exemplary high school is a
significant predictor of higher home price
• Region is NOT a significant predictor of higher home price.
46
Any
Queries?
Thank you