0% found this document useful (0 votes)
6 views51 pages

Ml Module3 Regression

The document provides an overview of regression analysis, a supervised learning algorithm used in predictive analytics to model the relationship between dependent and independent variables. It covers various types of regression, including linear, multilinear, and logistic regression, along with their applications, assumptions, and methods for improving accuracy. Key concepts such as the least squares method, multicollinearity, and the differences between linear and logistic regression are also discussed.

Uploaded by

12302080603002
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views51 pages

Ml Module3 Regression

The document provides an overview of regression analysis, a supervised learning algorithm used in predictive analytics to model the relationship between dependent and independent variables. It covers various types of regression, including linear, multilinear, and logistic regression, along with their applications, assumptions, and methods for improving accuracy. Key concepts such as the least squares method, multicollinearity, and the differences between linear and logistic regression are also discussed.

Uploaded by

12302080603002
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

Regression Analysis

1
Contents

• What is Regression?
• Why Regression?

• Linear Regression
– Linear Regression algorithm using least square method
– Evaluation of method
• Multilinear Regression
• Logistic Regression

2
What is Regression?

• Regression is a supervised learning algorithm under


Machine Learning terminology
• An important tool in Predictive Analytics
• Regression analysis is a predictive modeling technique
which investigates the relationship between a
dependent and independent variable.
• Graphing a line over a set of data points that most
closely fits the overall shape of the data.
• The regression shows the changes in the dependent
variable on the Y axis to the changes in the explanatory
3 variable on X axis.
What is Regression?

• Regression is a tool for finding existence of an association


relationship between a dependent variable (Y) and one or more
independent variables (X1, X2, …, Xn) in a study.

• The relationship can be linear or non-linear.

• A dependent variable (response variable) “measures an


outcome of a study (also called outcome variable)”.

• An independent variable (explanatory variable) “explains


changes in a response variable”.
4
Types of Regression

One More than


independent One
variable independent
5 variable
Most Common Regression Algorithms

● Simple linear regression


● Multiple linear regression
● Polynomial regression
● Multivariate adaptive regression splines
● Logistic regression
● Maximum likelihood estimation (least squares)

6
Use cases of Regression

• Predictive analytics
• Operation efficiency
• Supporting decisions
• Correcting errors
• New insights

• House Price Predictions


• Trend forecasting
– E.g. what will be the price of gold in next six months
• Finding Associations among attributes:
– E.g. Mediclaim agencies: Effect of age on claims
7
Linear Regression
• Linear regression: It is a linear approach to modelling the
relationship between a scalar response and one or more
explanatory variables (also known as dependent and independent
variables).
• The case of one explanatory variable is called simple linear
regression; for more than one, the process is called multiple
linear regression.

• In linear regression, the relationships are modeled using linear


predictor functions whose unknown
model parameters are estimated from the data.
• Linear regression models are often fitted using the least
squares approach.
8
Simple Linear Regression

• One of the easiest algorithm in machine learning.

• Simple Linear regression: It is a statistical model that


attempts to show the relationship between two variables
through the linear equation.

• Data is modeled using a straight line (Y = mX + c)

• Correlation between X and Y variables

9
Simple Linear Regression: Understanding
+ve Relationship

Speed of Vehicle

m=Slop of the line


(Dependent variable)

Distance travelled in
fixed duration of time

c= y – intercept of the line

10
(Independent variable)
Simple Linear Regression: Understanding

11
Slops of Simple Linear Regression Model

Linear positive slope Linear negative slope

Slope = (Y − Y ) / (X − X ) = Delta (Y) / Delta(X)


Slope = Change in Y/Change in X

Curve linear positive slope

Example:
(X , Y ) = (−3, −2) and (X , Y ) = (2, 2)
Rise = (Y − Y ) = (2 − (−2)) = 2 + 2 = 4
Run = (X − X ) = (2 − (−3)) = 2 + 3 = 5
12 Slope = Rise/Run = 4/5 = 0.8

Curve linear negative slope


Relations in Regression

Linear positive slope Linear negative slope

Curve linear positive slope

13

Curve linear negative slope


Simple Linear Regression:
Least Square Method
• How to find the best Regression Line?
• Our challenge is to determine the value of m and c, that gives
the minimum error for the given dataset. We will be doing this
by using the Least Squares method.
• Loss function:
y = mx +c

For minimum loss we take partial


derivative of L(x) and equate to 0,
then finding expression of m and c.

14
Simple Linear Regression:
Least Square Method (Example)

• A method to predict best fit line.

15
Simple Linear Regression:
Least Square Method (Example)

16
Simple Linear Regression
• Measure of Goodness: R2 method

18
OLS algorithm

● Step 1: Calculate the mean of X and Y


● Step 2: Calculate the errors of X and Y
● Step 3: Get the product
● Step 4: Get the summation of the products
● Step 5: Square the difference of X
● Step 6: Get the sum of the squared difference
● Step 7: Divide output of step 4 by output of step 6 to calculate ‘b’
● Step 8: Calculate ‘a’ using the value of ‘b’

19
Example of Simple Linear Regression

Calculation summary:
Sum of X = 299
Sum of Y = 852
Mean X, M = 19.93
Mean Y, M = 56.8

20
Error in Simple Regression

Y = (a + bX) + ε

Example of simple regression

Scatter plot and regression line

Sum of Square of Residual

21 SSE=

Residual is the distance between the predicted point (on the


regression line) and the actual point as depicted in Figure
Multiple Linear Regression

• Two or more independent variables, i.e. predictors are involved in the


model.

• In the example of simple linear regression by considering Price of a


Property as the dependent variable and the Area of the Property (in sq.
m.) as the predictor variable.

• If we consider Price of a Property (in $) as the dependent variable and


Area of the Property (in sq. m.), location, floor, number of years since
purchase and amenities available as the independent variables, we can
form a multiple regression equation as shown below:

22
• The simple linear regression
• Parameter ‘a’ is the intercept of
model and the multiple
this plane. Parameters ‘b1 ’ and
regression model assume that
‘b2 ’ are referred to as partial
the dependent variable is
regression coefficients.
continuous.

• Parameter b1 represents the


• The following expression
change in the mean response
describes the equation involving
corresponding to a unit change in
the relationship with two
X1 when X2 is held constant.
predictor variables, namely X1
and X2 .
• Parameter b2 represents the
change in the mean response
23 corresponding to a unit change in
• The model describes a plane in X2 when X1 is held constant.
the three-dimensional space of
Ŷ, X1, and X2 .
• Consider the following
example of a multiple linear
regression model with two
predictor variables, namely
X1 and X2.

24
Multiple regression for estimating equation when there are ‘n’ predictor
variables is as follows:

While finding the best fit line, we can fit either a polynomial or
curvilinear regression. These are known as polynomial or curvilinear
regression, respectively.
25
Assumptions in Regression Analysis

1. The dependent variable (Y) can be calculated / predicated as a


linear function of a specific set of independent variables (X’s) plus an
error term (ε).
2. The number of observations (n) is greater than the number of
parameters (k) to be estimated, i.e. n > k.
3. Relationships determined by regression are only relationships of
association based on the data set and not necessarily of cause and
effect of the defined class.
4. Regression line can be valid only over a limited range of data. If the
line is extended (outside the range of extrapolation), it may only lead
to wrong predictions.

26
5. If the business conditions change and the business assumptions
underlying the regression model are no longer valid, then the past
data set will no longer be able to predict future trends.
6. Variance is the same for all values of X (homoskedasticity).
7. The error term (ε) is normally distributed. This also means that the
mean of the error (ε) has an expected value of 0.
8. The values of the error (ε) are independent and are not related to
any values of X. This means that there are no relationships between a
particular X, Y that are related to another specific value of X, Y.

27 Given the above assumptions, the OLS estimator is the Best Linear
Unbiased Estimator (BLUE), and this is called as Gauss-Markov
Theorem.
Main Problems in Regression Analysis

• Two primary problems: Multicollinearity and heteroskedasticity


Mutilcollinearity
• Two variables are perfectly collinear if there is an exact linear
relationship between them.

• Multicollinearity is the situation in which the degree of correlation is


not only between the dependent variable and the independent
variable, but there is also a strong correlation within (among) the
independent variables themselves.

• A multiple regression equation can make good predictions when


28 there is multicollinearity, but it is difficult for us to determine how
the dependent variable will change if each independent variable is
changed one at a time.
• When multicollinearity is present, it increases the standard errors
of the coefficients.
• One way to gauge multicollinearity is to calculate the Variance
Inflation Factor (VIF), which assesses how much the variance of
an estimated regression coefficient increases if the predictors are
correlated.
• If no factors are correlated, the VIFs will be equal to 1.
• The assumption of no perfect collinearity states that there is no
exact linear relationship among the independent variables.
• This assumption implies two aspects of the data on the
independent variables.

29
• First, none of the independent variables, other than the variable
associated with the intercept term, can be a constant.
• Second, variation in the X’s is necessary.
• In general, the more variation in the independent variables, the
better will be the OLS estimates in terms of identifying the impacts
of the different independent variables on the dependent variable.

Heteroskedasticity
• Refers to the changing variance of the error term.
• If the variance of the error term is not constant across data sets,
there will be erroneous predictions.
30 • In general, for a regression equation to make accurate predictions,
the error term should be independent, identically (normally)
distributed (iid).
31
Improving Accuracy of the Linear Regression Model

• Accuracy refers to how close the estimation is near the actual


value, whereas prediction refers to continuous estimation of the
value.
High bias = low accuracy (not close to real value)
High variance = low prediction (values are scattered)
Low bias = high accuracy (close to real value)
Low variance = high prediction (values are close to each other)

• We have a regression model which is highly accurate and highly


predictive; therefore, the overall error of our model will be low,
implying a low bias (high accuracy) and low variance (high
32 prediction). This is highly preferable.
Improving Accuracy of the Linear Regression Model

Accuracy of linear regression can be improved using the following


three methods:

1. Shrinkage Approach
2. Subset Selection
3. Dimensionality (Variable) Reduction

33
Polynomial Regression Model

• Extension of the simple linear model by adding extra predictors


obtained by raising (squaring) each of the original predictors to a
power.
• This approach provides a simple way to yield a non-linear fit to
data. For example,

• Let us use the below data set of (X, Y) for degree 3 polynomial.

34
35
• As you can observe, the regression line is slightly curved for
polynomial degree 3 with the above 15 data points.

• The regression line will curve further if we increase the polynomial


degree.

• At the extreme value as shown above, the regression line will be


overfitting into all the original values of X.

36
What is Logistic Regression?

• Logistic regression is a Classification algorithm.


• Logistic Regression is all about predicting binary variables,
not predicting continuous variables.
• Logistic regression models estimate how probability of an
event may be affected by one or more explanatory variables.

• Logistic regression is a technique used for predicting “class


probability”, that is the probability that the case belongs to a
particular class.

37
Use cases of Logistic Regression

• Mail[Spam / Not Spam]


• Transaction [Fraudulent / Normal]
• Tumor [Malignant / Benign]
• Sentimental Analysis [Positive / Negative]
• Weather Prediction [Rain / Not Rain]
• Medical Diagnosis [Fit / ill]

38
Linear and Logistic Regression

39
Linear and Logistic Regression

Logistic curve
Sigmoid (S) curve

40
Logistic Regression Curve

41
Some fundamentals terms of Logistic Regression

• The probability that an event will occur is the fraction of times you expect
to see that event in many trials. If the probability of an event occurring is
Y, then the probability of the event not occurring is 1-
Y. Probabilities always range between 0 and 1.
• The odds are defined as the probability that the event will occur divided
by the probability that the event will not occur. Unlike probability, the
odds are not constrained to lie between 0 and 1 but can take any value
from zero to infinity.
• If the probability of Success is P, then the odds of that event is:

• The logit function is the logarithmic transformation of the logistic function.


It is defined as the natural logarithm of odds.
43
Math behind Logistic Regression

44
Math behind Logistic Regression

45
Math behind Logistic Regression

46
• Let us say we have a model that can predict whether a person is
male or female on the basis of their height.
• Given a height of 150 cm, we need to predict whether the person
is male or female.
• We know that the coefficients of a = −100 and b = 0.6.
• Using the above equation, we can calculate the probability of male
given a height of 150 cm or more formally P(male|height = 150).

47 or a probability of near zero that the person is a male.


Linear vs Logistic Regression

Basis Linear Regression Logistic Regression


Core concept Data is modeled using a Data is modeled using a
(Modeling of data) straight line. logistic (sigmoid) function.
Used with Continuous variable Categorical variable
Output/prediction Value of the variable Probability of occurrence of
event
Problem Solved Regression Classification
Accuracy Loss, R2, adjusted R2, etc. Accuracy, Precision,
(goodness of fit) Recall, F1 score, ROC
curve, Confusion matrix,
etc.

• The basic difference: the type of function that is used for


mapping
– (Linear: continuous X -> Continuous Y;
48
– Logistic: continuous X -> binary Y) – used for deciding category or true/false
decisions of the data
Parameter Estimation by
Maximum Likelihood Method

● The coefficients in a logistic regression are estimated using a process called Maximum
Likelihood Estimation (MLE).

● Likelihood function:

49
Thank You

50
Parameter Estimation by
Maximum Likelihood Method

• Probability density function for binary logistic regression is


given by:

51
Parameter Estimation by
Maximum Likelihood Method

The above system of equations are solved iteratively to


estimate β0 and β1

52
References

• Coursera tutorial - Linear Regression, Logistic Regression


• SimpliLearn tutorial – Logistic Regression
• Wikipedia – linear regression

• Business Analytics – by U.Dinesh Kumar

53

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy