0% found this document useful (0 votes)
12 views49 pages

Multiple Linear Regression 13112023 063212pm

This document provides an introduction to multiple linear regression (MLR). MLR is used to estimate the relationship between two or more independent variables and one dependent variable. MLR can help determine how strong the relationship is between independent and dependent variables and predict dependent variable values given independent variable values. MLR makes assumptions similar to simple linear regression, including normality and independence of errors. Strategies for selecting independent variables include all-in, backward elimination, forward selection, and bidirectional elimination. Python code demonstrates implementing MLR, including one-hot encoding, training/testing a model, and using backward elimination for variable selection.

Uploaded by

AHSAN HAMEED
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views49 pages

Multiple Linear Regression 13112023 063212pm

This document provides an introduction to multiple linear regression (MLR). MLR is used to estimate the relationship between two or more independent variables and one dependent variable. MLR can help determine how strong the relationship is between independent and dependent variables and predict dependent variable values given independent variable values. MLR makes assumptions similar to simple linear regression, including normality and independence of errors. Strategies for selecting independent variables include all-in, backward elimination, forward selection, and bidirectional elimination. Python code demonstrates implementing MLR, including one-hot encoding, training/testing a model, and using backward elimination for variable selection.

Uploaded by

AHSAN HAMEED
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 49

Multiple Linear

Regression
Introduction to Machine Learning
Contents
1. What is multiple linear regression (MLR)

2. What multiple linear regression can help you do.

3. Assumptions of multiple linear regression

4. How to perform a multiple linear regression


i. T-test
ii. P-value
iii. The model
iv. Selecting the independent variables
v. Python Code
vi. Example of backward elimination for selection of independent variables
What is MLR Multiple linear regression is used to estimate
the relationship between two or more
independent variables and one dependent
variable.
What multiple linear regression can help
you do.
• You can use multiple linear regression when you want to know:
 How strong the relationship is between two or more independent
variables and one dependent variable (e.g. how rainfall,
temperature, and amount of fertilizer added affect crop growth).
 The value of the dependent variable at a certain value of the
independent variables (e.g. the expected yield of a crop at
certain levels of rainfall, temperature, and fertilizer addition).
Assumptions of multiple linear regression
• Multiple linear regression makes all of the same assumptions as
simple linear regression:
 The probability distribution of e is normal.
 The mean of e is zero: E(e) = 0.
 The standard deviation of e is se for all values of X.
 The set of errors associated with different values of Y are all
independent
Two or more independent
variables (predictor variables).

Design
Requirements
Sample size: >= 50 (at least 10
times as many cases as
independent variables)
The formula for a multiple linear
regression is:
• y = the predicted value of the dependent variable
• B0 = the y-intercept (value of y when all other parameters are set to 0)
• B1X1= the regression coefficient (B1) of the first independent variable (X1) (a.k.a.
the effect that increasing the value of the independent variable has on the predicted
y value)
• … = do the same for however many independent variables you are testing
• BnXn = the regression coefficient of the last independent variable
• e = model error (a.k.a. how much variation there is in our estimate of y)
best-fit line
• To find the best-fit line for each independent variable, multiple
linear regression calculates three things:
 The regression coefficients that lead to the smallest overall
model error.
 The t-statistic of the overall model.
 The associated p-value (how likely it is that the t-statistic would
have occurred by chance if the null hypothesis of no relationship
between the independent and dependent variables was true).
T-test
• In statistics, the t-statistic is the ratio of the
departure of the estimated value of a
parameter from its hypothesized value to its
standard error’
• It is meant for evaluating whether the two
sets of data are statistically significantly
different from each other.
• Q.1: Find the t-test value for the following given two sets of values:
• A = 7, 2, 9, 8 and
• B = 1, 2, 3, 4?
• Solution: For first data set:
• Number of terms in first set i.e. n_1 = 4
• Calculate mean value for first data set using formula:
• Higher values of the t-value, also called t-score, indicate that a large difference exists
between the two sample sets. The smaller the t-value, the more similarity exists
between the two sample sets.
P-value
• P-value is the lowest significance level that results in rejecting the null hypothesis.
Example • Coin toss
 Two possible outcomes
 H0 = This is a fair coin
 H1 = This is not a fair coin
• The P-value test will assume that the H0
hypothesis is true i.e., the coin is fair
• Let us assume our threshold value to be 5%
i.e., 0.05
• Let us assume the output is
 First toss output is Tail (probability = 0.5)
 First toss output is Tail and second toss output is also Tail (probability = 0.25)
 First two outputs same as before, third toss output is also Tail (probability =
0.125)
 First three outputs same as before, fourth toss output is also Tail (probability
= 0.0625)
 First four outputs same as before, fifth toss output is also Tail (probability =
0.03)
 First five outputs same as before, sixth toss output is also Tail (probability =
0.01)
 After the fourth output the statistical test is significant. Since P-value of less
than 5% indicates that the hypothesis H0 is rejected and hypothesis H1 is
accepted i.e., the coin is not fair
Selecting the independent variables being
used
• Five strategies are available for selecting the independent variables
 All in
 Backward Elimination
 Forward Selection
 Bi-directional elimination
 Score Comparison (All possible combinations)
All in
• Use all features
• Prior knowledge (Data domain expert) tell you which features to
keep and which to discard
Backward Elimination
1. Select a significance level (SL) for P-value e.g. 5% (0.05)
2. Fit the model will all predictors
3. Consider the predictor with highest P-value. If P>SL go to step
4, otherwise include the predictor in your feature set
4. Remove variable with P>SL
5. Fit model without the variable and go to step 3 if all features
have not been exhausted. Otherwise terminate
Forward selection
Select Select a significance level (SL) for P-value e.g. 5% (0.05)

Fit Fit all the predictors y->xn one at a time and select one with the lowest P-value

Keep this variable and fit all possible models with one extra predictor i.e., add one predictor to the
Keep variables you already have.

Consider the predictor with the least P-value. If P<SL, go to step 3, otherwise finish. (keep the
Consider previous model)
Select a significance level to enter (SL_Enter) and stay
(SL_stay) in the model.

Perform the next step of forward selection (new variables must


enter if P < SL_enter)

Bi-directional Pefrom all steps of backward elimination (old variables must


Elimination have P<SL_stay to stay in the model)

No new variables can enter and no old variables can exit

FIN: model is ready


All possible models 1. Select a criterion of goodness of fit
2. Construct all possible models. If N
variables the
3. Select the model with the best criterion
4. Model is ready

• Very computationally intense !!!


• We will be using backward elimination
strategy
MLR Implementation
Python
Multiple Linear Regression

Python Implementation
Importing
Dataset
Dataset
• Total 50 samples

• Three independent variables


 Administration
 Marketing spent
 State (categorical data)
 One hot encoding
 Three categories, so three dummy
variables

• One dependent variable


 Profit
Code

• One hot encoding to be applied on


column 3

• 80, 20 split
Training and testing the model
Evaluating the model
Q: Do we need to normalize the data in MLR
• A: No, we do not need to perform normalization for MLR, since the
coefficients b0, b1, b2,… in the MLR model automatically does
that.

Q: Do we need to check the assumptions of linear


regression
Some points • A: Absolutely not, for a new dataset play and experiment with it. If
there are redundant features it will perform poorly.
to remember
Q: Do we need to use some strategy to avoid the
dummy variable trap
• A: The class used here in python will automatically do that

Q: Do we have to use techniques such as backward


elimination etc, before applying MLR.
• A: No, because the class we use will automatically do that.
Example of Backward Elimination
1. Select a significance level (SL) for P-value e.g. 5% (0.05)
2. Fit the model will all predictors
3. Consider the predictor with highest P-value. If P>SL go to step
4, otherwise include the predictor in your feature set
4. Remove variable with P>SL
5. Fit model without the variable and go to step 3 if all features
have not been exhausted. Otherwise terminate
Code
• Importing the dataset
• Dividing the dataset into independent and dependent variables
• One hot encoding for the categorical data
• We do not need to cater the missing values as there are none
Inserting beta_0

x6 has highest p value, so it should be removed


x5 has highest p value so it should be removed
Now no independent variable has a p value >0.05, so we keep the remaining variables
Comparison between two approaches using
RMSE

The regression model with backward elimination shows lower RMSE


Plotting the output
Example of Forward Selection(Optional)
1. Select a significance level (SL) for P-value e.g. 5% (0.05)
2. Fit all the predictors y->xn one at a time and select one with the
lowest P-value
3. Keep this variable and fit all possible models with one extra
predictor i.e., add one predictor to the variables you already
have.
4. Consider the predictor with the least P-value. If P<SL, go to
step 3, otherwise finish. (keep the previous model)

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy