0% found this document useful (0 votes)
20 views48 pages

MLR Multicollinearlty, Categorical Variable

Uploaded by

Vipul Khandke
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views48 pages

MLR Multicollinearlty, Categorical Variable

Uploaded by

Vipul Khandke
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 48

ICS422 Applied Predictive Analytics [3- 0-0-3]

Multiple Linear Regression


Class 19
Presented by
Dr. Selvi C
Assistant
Professor
IIIT Kottayam
REGIONAL DELIVERY SERVICE
• Let's assume that you are a small business owner for
Regional Delivery Service, Inc. (RDS) who offers same-
day delivery for letters, packages, and other small
cargo.
• You are able to use Google Maps to group individual
deliveries into one trip to reduce time and fuel costs.
Therefore some trips will have more than one delivery.
• As the owner, you would like to be able to estimate how
long a delivery will take based on two factors: 1) the
total distance of the trip in miles and 2) the number of
deliveries that must be made during the trip.
2
RDS DATA AND VARIABLE NAMING
• To conduct your analysis you take a random num
milesTraveled travelTime(
sample of 10 past trips and record three (x1)
Deliveries
hrs), (y)
(x2)
pieces of information for each trip: 1) total
89 4 7
miles traveled, 2) number of deliveries, and 66 1 5.4
3) total travel time in hours. 78 3 6.6
111 6 7.4
• Remember that in this case, you would like 44 1 4.8
to be able to predict the total travel 77 3 6.4
time using both the miles traveled and 80 3 7
66 2 5.6
number of deliveries on each trip. 109 5 7.3
• In what way does travel time DEPEND on 76 3 6.4

the first two measures?


• Travel time is the dependent variable and
miles traveled and number of deliveries are 3
Multiple Linear Regression
• Extension of Simple Linear regression
• One to one IV->DV

• Extension of Simple Linear regression


• Many to One
• IV1+IV2+… ->DV

4
New Consideration
• Adding more independent variables to a multiple
regression procedure does not mean the regression will
be "better" or offer better predictions; in fact it can
make things worse. This is called OVERFITTING.
• The addition of more independent variables creates
more relationships among them. So not only are the
independent variables potentially related to the
dependent variable, they are also potentially related to
each other. When this happens, it is called
MULTICOLLINEARITY.
• The ideal is for all of the independent variables to
be correlated with the dependent variable but
NOT with each other. 5
New Consideration
• Because of multicollinearity and overfitting, there is a
fair amount of pre-work to do BEFORE conducting
multiple regression analysis if one is to do it properly.
• Correlations
• Scatter plots
• Simple regressions

6
Multicollinearity
• Travel time is the dependent variable and miles traveled
and number of deliveries are independent variables.
• Some independent variables or set of independent
variables are better at predicting the Dependent
variable than others. Some contribute nothing

7
Multiple Regression Model
• Multiple Regression Model
•+ε

• Multiple Regression Equation


• Estimated Multiple Regression Equation

8
Coefficient Interpretation

9
INTERPRETING COEFFICIENTS

• x1 = capital investment ($1000s)


• x2 = marketing expenditures ($1000s)
• ŷ = predicted sales ($1000s)
• In multiple regression, each coefficient is interpreted as the
estimated change in y corresponding to a one unit change
in a variable, when all other variables are held constant.
• So in this example, $9000 is an estimate of the expected
increase in sales y, corresponding to a $1000 increase in
capital investment (x1) when marketing expenditures (x2)
are held constant.
10
• Multiple regression is an extension of simple linear regression
• Two or more independent variables are used to predict / explain the
variance in one dependent variable
• Two problems may arise:
• Overfitting
• Multicollinearity
• Overfitting is caused by adding too many independent variables; they
account for more variance but add nothing to the model
• Multicollinearity happens when some/all of the independent variables
are correlated with each other
• In multiple regression, each coefficient is interpreted as the estimated
change in y corresponding to a one unit change in a variable, when all
other variables are held constant.
11
REGIONAL DELIVERY SERVICE
• Let's assume that you are a small business owner for
Regional Delivery Service, Inc. (RDS) who offers same-
day delivery for letters, packages, and other small
cargo.
• You are able to use Google Maps to group individual
deliveries into one trip to reduce time and fuel costs.
Therefore some trips will have more than one delivery.
• As the owner, you would like to be able to estimate how
long a delivery will take based on two factors: 1) the
total distance of the trip in miles 2) the number of
deliveries that must be made during the trip and 3) the
daily price of gas/petrol in U.S. dollars
12
Multicollinearity

13
MULTIPLE REGRESSION
Conducting multiple regression analysis requires a fair amount of
pre-work before actually running the regression. Here are the
steps:
1. Generate a list of potential variables; independent(s) and dependent
2. Collect data on the variables
3. Check the relationships between each independent variable and the
dependent variable using scatterplots and correlations
4. Check the relationships among the independent variables using
scatterplots and correlations
5. (Optional) Conduct simple linear regressions for each IV/DV pair
6. Use the non-redundant independent variables in the analysis to find the
best fitting model
14
7. Use the best fitting model to make predictions about the dependent
RDS DATA AND VARIABLE NAMING
To conduct your milesTraveled (x1)
num Deliveries
gasPrice (x3)
travelTime(hrs),
analysis you take a (x2) (y)

random sample of 89 4 3.84 7


10 past trips and 66 1 3.19 5.4
78 3 3.78 6.6
record four pieces
111 6 3.89 7.4
of information for 44 1 3.57 4.8
each trip: 1) total 77 3 3.57 6.4
miles traveled, 2) 80 3 3.03 7
66 2 3.51 5.6
number of
109 5 3.54 7.3
deliveries, 3) the 76 3 3.25 6.4
daily gas price, and
4) total travel time
in hours. 15
16
SCATTERPLOT SUMMARY
• Dependent variable vs independent variables
• travelTime(y) appears highly correlated with milesTraveled (x1)
• travelTime(y) appears highly correlated with numDeliveries (x2)
• travelTime(y) DOES NOT appear highly correlated with
gasPrice(x3)
• Since gasPrice (x3) does NOT APPEAR CORRELATED with
the dependent variable we would NOT use that variable in
the multiple regression
• Note: for now, we will keep gasPrice in and then take it out
later for learning purposes

17
18
Multicollinearity Check
• Independent variable vs independent variable
• numDeliveries(x2) APPEARS highly correlated with
milesTraveled (x,); this is multicollinearity
• milesTraveled (x1) does not appear highly correlated with
gasPrice(x3)
• gasPrices(x3) does not appear correlated with
numDeliveries(x2)
• Since numDeliveries is HIGHLY CORRELATED with
milesTraveled, we would NOT use BOTH in the multiple
regression; they are redundant
• Note: for now, we will keep both in and then take one
out later for learning purposes
19
Correlations

num
milesTravel gasPrice
Correlation Deliveries
ed (x1) (x3)
(x2)

num Deliveries
0.956
(x2)

gasPrice (x3) 0.356 0.498

travelTime(hrs)
0.928 0.916 0.267
, (y)

20
Scatterplot (DV vs IV)

r=0.928 r=0.916 r=0.267

21
IV Scatterplots(Multicollinearity)

r= 0.956 r=0.356 r=0.498

22
Understanding
• Correlation analysis confirms • For the sake of learning, we are
the conclusions reached by going to break the rules and
visual examination of the include all three independent
scatterplots variables in the regression at
• Redundant multicollinear first
variables • Then we will remove the
• milesTraveled and problematic independent
numDeliveries are both highly variables as we should and
correlated with each other and then watch what happens to
therefore are redundant; only the regression results
one should used in the multiple
regression analysis • We will also perform simple
• Non-contributing variables regressions with the dependent
• gasPrice is NOT correlated with variable to use as a baseline
23
Example
BP Weight Age BSA Dur Pulse Stress
105 85.4 47 1.75 5.1 63 33
115 94.2 49 2.1 3.8 70 14
116 95.3 49 1.98 8.2 72 10
117 94.7 50 2.01 5.8 73 99
112 89.4 51 1.89 7 72 95
121 99.5 48 2.25 9.3 71 10
121 99.8 49 2.25 2.5 69 42
110 90.9 47 1.9 6.2 66 8
110 89.2 49 1.83 7.1 69 62
114 92.7 48 2.07 5.6 64 35
114 94.4 47 2.07 5.3 74 90 BP Age Weight BSA Dur Pulse Stress
115 94.1 49 1.98 5.6 71 21 BP 1
114 91.6 50 2.05 10.2 68 47 Age 0.659093 1
106 87.1 45 1.92 5.6 67 80 Weight 0.950068 0.407349 1
125 101.3 52 2.19 10 76 98 BSA 0.865879 0.378455 0.875305 1
114 94.5 46 1.98 7.4 69 95 Dur 0.292834 0.343792 0.20065 0.13054 1
106 87 46 1.87 3.6 62 18 Pulse 0.721413 0.618764 0.65934 0.464819 0.401514 1
113 94.5 46 1.9 4.3 70 12 Stress 0.163901 0.368224 0.034355 0.018446 0.31164 0.50631 1
110 90.5 48 1.88 9 71 99
122 95.7 56 2.09 7 75 99

24
Variance Inflammation Factor (VIF)
• A variance inflation factor (VIF) provides a measure of
multicollinearity among the independent variables in a
multiple regression model.
• Detecting multicollinearity is important because while
multicollinearity does not reduce the explanatory power
of the model, it does reduce the statistical significance
of the independent variables.
• A large VIF on an independent variable indicates a
highly collinear relationship to the other variables that
should be considered or adjusted for in the structure of
the model and selection of independent variables.
25
Variance Inflammation Factor (VIF)
SUMMARY OUTPUT

Regression Statistics
Multiple R 0.998073
R Square 0.99615 8.42, 5.33, and 4.41 — are fairly
Adjusted R
Square 0.994373
large. The VIF for the predictor
Standard
Error 0.407229
Weight, for example, tells us that
Observation the variance of the estimated
s 20
coefficient of Weight is inflated by
ANOVA
Significance a factor of 8.42 because Weight is
df SS MS F F
highly correlated with at least one
Regression
Residual
6
13
557.8441
2.155858
92.97402
0.165835
560.641 6.4E-15 of the other predictors in the
Total 19 560 model.
Standard
Coefficients Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0% VIF R Squared
Intercept -12.8705 2.55665 -5.03412 0.000229 -18.3938 -7.34717 -18.3938 -7.34717
Age 0.703259 0.049606 14.17696 2.76E-09 0.596093 0.810426 0.596093 0.810426 1.76
Weight 0.96992 0.063108 15.36909 1.02E-09 0.833582 1.106257 0.833582 1.106257 8.417035 0.88119332
BSA 3.776491 1.580151 2.389956 0.032694 0.362783 7.190199 0.362783 7.190199 5.33
Dur 0.068383 0.048441 1.411663 0.181534 -0.03627 0.173035 -0.03627 0.173035 1.24
Pulse -0.08448 0.051609 -1.63702 0.125594 -0.19598 0.02701 -0.19598 0.02701 4.41
Stress 0.005572 0.003412 1.63277 0.126491 -0.0018 0.012943 -0.0018 0.012943 1.834845 0.45499493

26
Variance Inflammation Factor
(VIF)-Weight as Dependent
variable
SUMMARY OUTPUT

Regression Statistics
Multiple R 0.938719
R Square 0.881193
Adjusted
R Square 0.838762
Standard
Error 1.724594
Observati
ons 20

ANOVA
Significan
df SS MS F ce F
Regressio
n 5 308.8389 61.76777 20.7677 5.05E-06
Residual 14 41.63913 2.974223
Total 19 350.478

Coefficien Standard Lower Upper Lower Upper


ts Error t Stat P-value 95% 95% 95.0% 95.0%
Intercept 19.67444 9.464742 2.078708 0.056513 -0.62541 39.97429 -0.62541 39.97429
Age -0.14464 0.206491 -0.70048 0.495103 -0.58752 0.298235 -0.58752 0.298235
BSA 21.42165 3.464586 6.183034 2.38E-05 13.99086 28.85245 13.99086 28.85245
Dur 0.008696 0.205134 0.042394 0.966783 -0.43127 0.448665 -0.43127 0.448665
Pulse 0.557697 0.159853 3.488813 0.003615 0.214847 0.900548 0.214847 0.900548
Stress -0.023 0.013079 -1.75834 0.10052 -0.05105 0.005054 -0.05105 0.005054

27
Understanding
• we see that the predictors Weight and BSA are highly correlated (r =
0.875). We can choose to remove either predictor from the model. The
decision of which one to remove is often a scientific or practical one.
• For example, if the researchers here are interested in using their final
model to predict the blood pressure of future individuals, their choice
should be clear. Which of the two measurements — body surface area or
weight — do you think would be easier to obtain?!
• If weight is an easier measurement to obtain than body surface area,
then the researchers would be well-advised to remove BSA from the
model and leave Weight in the model.
• Reviewing again the above pairwise correlations, we see that the
predictor Pulse also appears to exhibit fairly strong marginal correlations
with several of the predictors, including Age (r = 0.619), Weight (r =
0.659), and Stress (r = 0.506). Therefore, the researchers could also
consider removing the predictor Pulse from the model.
28
After removal
SUMMARY OUTPUT

BP Weight Age Dur Stress Regression Statistics BP Weight Age Dur Stress
105 85.4 47 5.1 33 Multiple R 0.995934 BP 1
115 94.2 49 3.8 14 R Square 0.991884
Weight 0.950068 1
116 95.3 49 8.2 10 Adjusted R Age 0.659093 0.407349 1
117 94.7 50 5.8 99 Square 0.989719
Dur 0.292834 0.20065 0.343792 1
112 89.4 51 7 95 Standard
Error 0.550462 Stress 0.163901 0.034355 0.368224 0.31164 1
121 99.5 48 9.3 10
121 99.8 49 2.5 42
Observatio
ns 20
110 90.9 47 6.2 8
110 89.2 49 7.1 62 ANOVA
114 92.7 48 5.6 35 Significance
114 94.4 47 5.3 90 df SS MS F F
115 94.1 49 5.6 21
114 91.6 50 10.2 47 Regression 4 555.4549 138.8637 458.2834 1.76E-15
106 87.1 45 5.6 80 Residual 15 4.545126 0.303008
125 101.3 52 10 98 Total 19 560
114 94.5 46 7.4 95
106 87 46 3.6 18 Standard Lower Upper
Coefficients Error t Stat P-value Lower 95% Upper 95% 95.0% 95.0% VIF R squared
113 94.5 46 4.3 12
Intercept -15.8698 3.195296 -4.96662 0.000169 -22.6804 -9.05922 -22.6804 -9.05922
110 90.5 48 9 99
Weight 1.034128 0.032672 31.65228 3.76E-15 0.964491 1.103766 0.964491 1.103766 1.234653 0.190056
122 95.7 56 7 99
Age 0.683741 0.061195 11.17309 1.14E-08 0.553306 0.814176 0.553306 0.814176 1.47
Dur 0.039889 0.064486 0.618561 0.545485 -0.09756 0.177339 -0.09756 0.177339 1.2
Stress 0.002184 0.003794 0.575777 0.573304 -0.0059 0.01027 -0.0059 0.01027 1.24

29
• The remaining variance inflation factors are quite
satisfactory! That is, it appears as if hardly any
variance in inflation remains. Incidentally, in terms of
the adjusted R squared we did not seem to lose much
by dropping the two predictors BSA and Pulse from our
model. The adjusted R squared decreased to only
98.97% from the original adjusted of 99.44%.

30
Dummy Variable

31
Scenario – Dummy Variable
• You are an analyst for a small company that develops
house pricing models for independent realtors. To
generate your models you use publicly available data
such as; list price, square footage, number of bedrooms,
number of bathrooms, etc.
• You are interested in another question: Is the public high
school in the neighborhood "Exemplary" (the highest
rating) and how is that rating related to home price?
• High school rating is not quantitative, it is qualitative
(categorical). For each home the high school is either
exemplary or not; yes or no.
32
Price1000s(y) sqrt(x1) exempHS(x2) Price1000s(y) sqrt(x1) exempHS(x2)
145 1872 Not exemplary 145 1872 0
69.9 1954 Not exemplary 69.9 1954 0
315 4104 Exemplary 315 4104 1
144.9 1524 Not exemplary 144.9 1524 0
134.9 1297 Not exemplary 134.9 1297 0
369 3278 Exemplary 369 3278 1
95 1192 Not exemplary 95 1192 0
228.9 2252 Exemplary 228.9 2252 1
149 1620 Not exemplary 149 1620 0
295 2466 Exemplary 295 2466 1
388.5 3188 Exemplary 388.5 3188 1
75 1061 Not exemplary 75 1061 0
130 1195 Not exemplary 130 1195 0
174 1552 Exemplary 174 1552 1
334.9 2901 Exemplary 334.9 2901 1

33
DUMMY VARIABLES
• In many situations we must work with categorical
independent variables
• In regression analysis we call these dummy or indicator
variables
• For a variable with n categories there are always n-1
dummy variables
• Exemplary/Not exemplary there are 2 categories, so 2 − 1 = 1
dummy variable
• North/South/East/West there are 4 categories, so 4 - 1 = 3
dummy variables

34
Region Variable Coding

Region x1 x2 x3
North 1 0 0
South 0 1 0
East 0 0 1
West 0 0 0

35
Interpretation
• +
• Expected value of home price given the high school is Not
exemplary ,

• Expected value of home price given the high school is Not


exemplary ,

36
Regression Analysis
SUMMARY OUTPUT
• Sqrt: Every square foot is
Regression Statistics
Multiple R 0.925716
related to an increase in price
R Square
Adjusted R
0.85695 of 0.0621($1000s) or $62.10
Square 0.833109 per square foot.
Standard Error 44.65395
• ExempHS: On average, a home
Observations 15
in an area with an exemplary
ANOVA high school is related to a
df SS MS F $98600
Significance F higher price.

Regression 2 143340.7 71670.36 35.94346 8.57E-06


Residual 12 23927.7 1993.975
Total 14 167268.4

Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%
Intercept 27.07494 33.68996 0.80365 0.437229 -46.3292 100.479 -46.3292 100.479
sqrt(x1) 0.062066 0.020324 3.05383 0.010013 0.017784 0.106348 0.017784 0.106348

exempHS(x2) 98.64787 35.96319 2.743023 0.017831 20.2908 177.0049 20.2908 177.0049

37
Interpretation
• + 98.6
• Expected value of home price given the high
school is Not exemplary ,

• Expected value of home price given the high


school is Not exemplary ,

38
• You are an analyst for a small company that develops house pricing
models for independent realtors. To generate your models you use
publicly available data such as; list price, square footage, number of
bedrooms, number of bathrooms, etc.
• You are interested in two questions:
1. Is the public high school in the neighborhood "Exemplary"
(the highest rating) and how is that rating related to home price?
2. What region (N,S,E,W) of the city is the home located and
how is that related to home price?
• This data is not quantitative, it is qualitative (categorical) so we will
need to use dummy variables in our regression.

39
Price1000s(y Regio Price1000s(y
sqrt(x1) exempHS(x2) sqrt(x1) exempHS(x2) South(x3) West(x4) North(x5)
) n )
145 1872 Not exemplary South 145 1872 0 1 0 0
69.9 1954 Not exemplary North 69.9 1954 0 0 0 1
315 4104 Exemplary South 315 4104 1 1 0 0
144.9 1524 Not exemplary North 144.9 1524 0 0 0 1
134.9 1297 Not exemplary East 134.9 1297 0 0 0 0
369 3278 Exemplary South 369 3278 1 1 0 0
95 1192 Not exemplary West 95 1192 0 0 1 0
228.9 2252 Exemplary North 228.9 2252 1 0 0 1
149 1620 Not exemplary West 149 1620 0 0 1 0
295 2466 Exemplary South 295 2466 1 1 0 0
388.5 3188 Exemplary South 388.5 3188 1 1 0 0
75 1061 Not exemplary South 75 1061 0 1 0 0
130 1195 Not exemplary East 130 1195 0 0 0 0
174 1552 Exemplary South 174 1552 1 1 0 0
334.9 2901 Exemplary South 334.9 2901 1 1 0 0

40
41
42
For home with 2500 square feet having high school
exemplary the price is premium, but the price is less for
the case high school is not exemplary.

Also the price is less for the west region, high for the
south region. Moderate for other two.

43
SUMMARY OUTPUT

Regression Statistics
Multiple R 0.940779
R Square 0.885066
Adjusted R Square 0.821214
Standard Error 46.2179
Observations 15

ANOVA
df SS MS F Significance F
Regression 5 148043.6 29608.71743 13.861149 0.000527941
Residual 9 19224.85 2136.094023
Total 14 167268.4

Standard
Coefficients Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%
Intercept 51.30076 42.29369 1.212964705 0.25601839 -44.37422483 146.9757412 -44.37422483 146.9757412
sqrt(x1) 0.065128 0.021546 3.022764609 0.01441461 0.016387876 0.113867729 0.016387876 0.113867729
exempHS(x2) 102.1551 40.13882 2.545046236 0.03144932 11.3548324 192.9554514 11.3548324 192.9554514
South(x3) -32.1221 44.4748 -0.72225415 0.4884789 -132.7311102 68.48688566 -132.7311102 68.48688566
West(x4) -20.8704 46.34628 -0.450315462 0.66313455 -125.7130273 83.9721305 -125.7130273 83.9721305
North(x5) -61.8466 43.87802 -1.409511619 0.19228908 -161.1055453 37.41239569 -161.1055453 37.41239569

44
Interpretation
• +
• Expected value of home price given the high school is Not
exemplary , and in west

45
• It appears from this data and its analysis:
• Higher square footage is a significant predictor of higher home
price
• Being in a district with an exemplary high school is a
significant predictor of higher home price
• Region is NOT a significant predictor of higher home price.

46
Any
Queries?
Thank you

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy