Multiple Linear Regression 13112023 063212pm
Multiple Linear Regression 13112023 063212pm
Regression
Introduction to Machine Learning
Contents
1. What is multiple linear regression (MLR)
Design
Requirements
Sample size: >= 50 (at least 10
times as many cases as
independent variables)
The formula for a multiple linear
regression is:
• y = the predicted value of the dependent variable
• B0 = the y-intercept (value of y when all other parameters are set to 0)
• B1X1= the regression coefficient (B1) of the first independent variable (X1) (a.k.a.
the effect that increasing the value of the independent variable has on the predicted
y value)
• … = do the same for however many independent variables you are testing
• BnXn = the regression coefficient of the last independent variable
• e = model error (a.k.a. how much variation there is in our estimate of y)
best-fit line
• To find the best-fit line for each independent variable, multiple
linear regression calculates three things:
The regression coefficients that lead to the smallest overall
model error.
The t-statistic of the overall model.
The associated p-value (how likely it is that the t-statistic would
have occurred by chance if the null hypothesis of no relationship
between the independent and dependent variables was true).
T-test
• In statistics, the t-statistic is the ratio of the
departure of the estimated value of a
parameter from its hypothesized value to its
standard error’
• It is meant for evaluating whether the two
sets of data are statistically significantly
different from each other.
• Q.1: Find the t-test value for the following given two sets of values:
• A = 7, 2, 9, 8 and
• B = 1, 2, 3, 4?
• Solution: For first data set:
• Number of terms in first set i.e. n_1 = 4
• Calculate mean value for first data set using formula:
• Higher values of the t-value, also called t-score, indicate that a large difference exists
between the two sample sets. The smaller the t-value, the more similarity exists
between the two sample sets.
P-value
• P-value is the lowest significance level that results in rejecting the null hypothesis.
Example • Coin toss
Two possible outcomes
H0 = This is a fair coin
H1 = This is not a fair coin
• The P-value test will assume that the H0
hypothesis is true i.e., the coin is fair
• Let us assume our threshold value to be 5%
i.e., 0.05
• Let us assume the output is
First toss output is Tail (probability = 0.5)
First toss output is Tail and second toss output is also Tail (probability = 0.25)
First two outputs same as before, third toss output is also Tail (probability =
0.125)
First three outputs same as before, fourth toss output is also Tail (probability
= 0.0625)
First four outputs same as before, fifth toss output is also Tail (probability =
0.03)
First five outputs same as before, sixth toss output is also Tail (probability =
0.01)
After the fourth output the statistical test is significant. Since P-value of less
than 5% indicates that the hypothesis H0 is rejected and hypothesis H1 is
accepted i.e., the coin is not fair
Selecting the independent variables being
used
• Five strategies are available for selecting the independent variables
All in
Backward Elimination
Forward Selection
Bi-directional elimination
Score Comparison (All possible combinations)
All in
• Use all features
• Prior knowledge (Data domain expert) tell you which features to
keep and which to discard
Backward Elimination
1. Select a significance level (SL) for P-value e.g. 5% (0.05)
2. Fit the model will all predictors
3. Consider the predictor with highest P-value. If P>SL go to step
4, otherwise include the predictor in your feature set
4. Remove variable with P>SL
5. Fit model without the variable and go to step 3 if all features
have not been exhausted. Otherwise terminate
Forward selection
Select Select a significance level (SL) for P-value e.g. 5% (0.05)
Fit Fit all the predictors y->xn one at a time and select one with the lowest P-value
Keep this variable and fit all possible models with one extra predictor i.e., add one predictor to the
Keep variables you already have.
Consider the predictor with the least P-value. If P<SL, go to step 3, otherwise finish. (keep the
Consider previous model)
Select a significance level to enter (SL_Enter) and stay
(SL_stay) in the model.
Python Implementation
Importing
Dataset
Dataset
• Total 50 samples
• 80, 20 split
Training and testing the model
Evaluating the model
Q: Do we need to normalize the data in MLR
• A: No, we do not need to perform normalization for MLR, since the
coefficients b0, b1, b2,… in the MLR model automatically does
that.