0% found this document useful (0 votes)
15 views29 pages

Logistic Regression

The document provides an overview of logistic regression, a common algorithm for binary classification, and discusses its suitability compared to linear regression. It covers the logistic function, gradient ascent for maximizing log-likelihood, and introduces the perceptron learning algorithm as a simplified version of logistic regression. Additionally, it explains multi-class logistic regression, including the one-vs-all strategy and softmax regression for handling multiple classes.

Uploaded by

benahmedroua2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views29 pages

Logistic Regression

The document provides an overview of logistic regression, a common algorithm for binary classification, and discusses its suitability compared to linear regression. It covers the logistic function, gradient ascent for maximizing log-likelihood, and introduces the perceptron learning algorithm as a simplified version of logistic regression. Additionally, it explains multi-class logistic regression, including the one-vs-all strategy and softmax regression for handling multiple classes.

Uploaded by

benahmedroua2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Machine Learning

Logistic Regression

Mohamed Farah

2024-2025

Mohamed Farah Machine Learning 2024-2025 1 / 29


Outline

1 Classification and Logistic Regression

2 Perceptron Learning Algorithm

3 Multi-class Logistic Regression


or Softmax Regression

Mohamed Farah Machine Learning 2024-2025 2 / 29


Classification and Logistic Regression

Mohamed Farah Machine Learning 2024-2025 3 / 29


Classification Problem

In classification, the target variable y takes on discrete values.


Binary classification: y ∈ {0, 1}.
0 is called the negative class, 1 is the positive class.
They are also denoted by the symbols ”-” and ”+”.
Example: Spam classification (spam = 1, not spam = 0).
Logistic regression is a common algorithm for binary classification.
Why not use linear regression? Linear regression can produce
values outside the [0, 1] range, making it unsuitable for classification.

Mohamed Farah Machine Learning 2024-2025 4 / 29


Logistic Regression

Linear regression is not suitable for classification.


Instead use logistic regression also known in the literature as logit
regression, maximum-entropy classification (MaxEnt) or the log-linear
classifier
In this model, the probabilities describing the possible outcomes of a
single trial are modeled using a logistic function or sigmoid function
1
g (z) =
1 + e −z

Mohamed Farah Machine Learning 2024-2025 5 / 29


Plot of the Sigmoid Function

The sigmoid function smoothly transitions from 0 to 1.


It is commonly used in logistic regression for binary classification.
The midpoint of the sigmoid function is at z = 0, where g (0) = 0.5.
Properties of g (z):
g (z) → 1 as z → ∞.
g (z) → 0 as z → −∞.
g (z) is bounded between 0 and 1.
Mohamed Farah Machine Learning 2024-2025 6 / 29
Logistic Regression

Hypothesis function for logistic regression:


1
hθ (x) = g (θT x) =
1 + e −θT x
1
where g (z) = 1+e −z
The sigmoid function maps any real-valued number into the range [0,
1], making it suitable for probability estimation.

Mohamed Farah Machine Learning 2024-2025 7 / 29


Derivative of the Sigmoid Function

Derivative of g (z):
 
′ d 1
g (z) =
dz 1 + e −z
1
= · e −z
(1 + e −z )2
 
1 1
= · 1−
1 + e −z 1 + e −z
= g (z)(1 − g (z)).

This property is useful for gradient-based optimization.


The derivative is maximized at z = 0, where g (z) = 0.5.

Mohamed Farah Machine Learning 2024-2025 8 / 29


Probabilistic Assumptions
Assume y | x; θ ∼ Bernoulli(ϕ) with
ϕ = P(y = 1 | x; θ) = hθ (x)
P(y = 0 | x; θ) = 1 − hθ (x)
or (compact form)
p(y | x; θ) = (hθ (x))y (1 − hθ (x))1−y
Likelihood of parameters:
n
Y
L(θ) = p(y (i) | x (i) ; θ)
i=1
Log likelihood:
Xn
ℓ(θ) = y (i) log hθ (x (i) ) + (1 − y (i) ) log(1 − hθ (x (i) ))
i=1
Maximizing the log likelihood is equivalent to minimizing the
cross-entropy loss.
Mohamed Farah Machine Learning 2024-2025 9 / 29
Compute θj maximizing the likelihood

n  
∂ X 1 1 ∂
ℓ(θ) = y (i) − (1 − y (i) ) hθ (x (i) )
∂θj i=1
hθ (x (i) ) 1 − hθ (x (i) ) ∂θj
n  
X 1 1 ∂ T (i)
= y (i) − (1 − y (i) ) hθ (x (i) )(1 − hθ (x (i) )) θ x
i=1
hθ (x (i) ) 1 − hθ (x (i) ) ∂θj
n 
X 
(i)
= y (i) (1 − hθ (x (i) )) − (1 − y (i) )hθ (x (i) ) xj
i=1
n
X (i)
= (y (i) − hθ (x))xj .
i=1

Mohamed Farah Machine Learning 2024-2025 10 / 29


Gradient Ascent

Gradient ascent is used to maximize the log-likelihood function.


Update rule for gradient ascent:

θ := θ + α∇θ ℓ(θ)

Update rule for Stochastic gradient ascent:


(i)
θj := θj + α(y (i) − hθ (x (i) ))xj

Similar to LMS update rule, but hθ (x (i) ) is non-linear.


The learning rate α controls the step size of the updates.

Mohamed Farah Machine Learning 2024-2025 11 / 29


Perceptron Learning Algorithm

Mohamed Farah Machine Learning 2024-2025 12 / 29


Perceptron Learning Algorithm

The perceptron algorithm is a simplified version of logistic regression.


The hypothesis function is:

hθ (x) = g (θT x)

where g (z) is the threshold function:


(
1 if z ≥ 0
g (z) =
0 if z < 0

The SGD update rule is:


(i)
θj := θj + α(y (i) − hθ (x (i) ))xj

The perceptron algorithm converges if the data is linearly separable.

Mohamed Farah Machine Learning 2024-2025 13 / 29


Perceptron vs. Logistic Regression

The perceptron algorithm does not output probabilities.


It directly predicts 0 or 1 based on the threshold.
The perceptron is less flexible than logistic regression.
It is difficult to give probabilistic interpretations to perceptron
predictions.
Logistic regression provides a probabilistic framework, while the
perceptron is a deterministic algorithm.

Mohamed Farah Machine Learning 2024-2025 14 / 29


Multi-class Logistic Regression
or Softmax Regression

Mohamed Farah Machine Learning 2024-2025 15 / 29


Multi-class Classification

In multi-class classification, y can take on k values: y ∈ {1, 2, . . . , k}.

Example: Classifying emails into spam, personal, and work-related.

We model p(y | x; θ) as a multinomial distribution.

One-vs-all (OvA) and softmax regression are common approaches.

Mohamed Farah Machine Learning 2024-2025 16 / 29


One-vs-All in Multiclass Classification

Mohamed Farah Machine Learning 2024-2025 17 / 29


One-vs-All in Multiclass Classification I

The one-vs-all (or one-vs-rest) strategy is used to extend binary


classifiers to handle multiple classes.
Given a set of K classes, the approach involves training K binary
classifiers, one for each class.

Training
For each class k ∈ {1, 2, . . . , K }:
Treat class k as the positive class.
Treat all other classes {1, 2, . . . , k − 1, k + 1, . . . , K } as the negative
class.
Train a binary classifier fk (x) to distinguish between class k and the
rest.

Mohamed Farah Machine Learning 2024-2025 18 / 29


One-vs-All in Multiclass Classification II

Prediction
For a new input x, the predicted class ŷ is determined by:

ŷ = arg max fk (x),


k∈{1,2,...,K }

where fk (x) is the score or probability output by the k-th classifier.

Mohamed Farah Machine Learning 2024-2025 19 / 29


One-vs-All in Multiclass Classification III

Example
Consider a multiclass problem with three classes: A, B, and C .
Classifier 1: fA (x) distinguishes A (positive) vs. B and C (negative).
Classifier 2: fB (x) distinguishes B (positive) vs. A and C (negative).
Classifier 3: fC (x) distinguishes C (positive) vs. A and B (negative).
The final prediction is the class with the highest score:

ŷ = arg max fk (x).


k∈{A,B,C }

Mohamed Farah Machine Learning 2024-2025 20 / 29


Softmax Regression

Mohamed Farah Machine Learning 2024-2025 21 / 29


Multinomial Distribution

A multinomial distribution involves k numbers ϕ1 , . . . , ϕk , specifying


the probability of each outcome:

ϕj = P(y = j | x; θ).

These probabilities must satisfy the constraint:


k
X
ϕj = 1.
j=1

The multinomial distribution generalizes the Bernoulli distribution to


multiple classes.

Mohamed Farah Machine Learning 2024-2025 22 / 29


Softmax Regression

We need to design a parameterized model that outputs ϕ1 , . . . , ϕk


satisfying this constraint given the input x.
We introduce k parameters θ1 , . . . , θk , each being a vector in Rd .
Intuitively, we want to use θ1⊤ x, . . . , θk⊤ x to represent ϕ1 , . . . , ϕk :

ϕj = P(y = j | x; θ)

However, there are two issues with this direct approach:


1 θj⊤ x is not necessarily within [0, 1].
Pk
2 The summation j=1 θj⊤ x is not necessarily 1.

Mohamed Farah Machine Learning 2024-2025 23 / 29


Softmax Function I

To address these issues, we use the softmax function to transform


(θ1⊤ x, . . . , θk⊤ x) into a valid probability vector (ϕ1 , . . . , ϕk ):

exp(θj⊤ x)
ϕj = Pk .

s=1 exp(θs x)

The softmax function ensures:

ϕj ∈ [0, 1] for all j

k
X
ϕj = 1
j=1

This gives us a valid probability distribution over the k classes.


The softmax function is a generalization of the sigmoid function for
multi-class classification.
Mohamed Farah Machine Learning 2024-2025 24 / 29
Softmax Function II

The Softmax Gradient


1. Let zj = θjT x. Then, the softmax probability can be written as:

exp(zj )
P(y = j | x; θ) = Pk .
s=1 exp(zs )

2. The derivative of P(y = j | x; θ) with respect to zj is:

∂P(y = j | x; θ)
= P(y = j | x; θ) · (1 − P(y = j | x; θ)) .
∂zj

3. Using the chain rule, the gradient of P(y = j | x; θ) with respect


to θj is:

∂P(y = j | x; θ)
∇θj P(y = j | x; θ) = · ∇θj zj .
∂zj

Mohamed Farah Machine Learning 2024-2025 25 / 29


Softmax Function III

4. Since zj = θjT x, we have:

∇θj zj = x.

5. Combining these results, the gradient of the softmax probability is:

∇θj P(y = j | x; θ) = P(y = j | x; θ) · (1 − P(y = j | x; θ)) · x.

Mohamed Farah Machine Learning 2024-2025 26 / 29


Cost function I

The probability of y = j given x is:

exp(θjT x)
P(y = j | x; θ) = Pk
T
s=1 exp(θs x)

The Cost function J(θ):


n k
1 XX
J(θ) = − I{y (i) = j} log P(y (i) = j | x (i) ; θ)
n
i=1 j=1

where I{y (i) = j} is the indicator function, which is 1 if y (i) = j and


0 otherwise.
The term I{y (i) = j} ensures that only the probability of the correct
class j (for the i-th example) contributes to the cost.

Mohamed Farah Machine Learning 2024-2025 27 / 29


Cost function II

Gradient of the Cost Function: The gradient of the cost function


J(θ) with respect to θj is:
n
1X h i
∇θj J(θ) = − ∇θj I{y (i) = j} log P(y (i) = j | x (i) ; θ)
n
i=1

Using the chain rule and the gradient of the softmax probability, we
get:
n
1 X (i)  
∇θj J(θ) = − x I{y (i) = j} 1 − P(y (i) = j | x (i) ; θ) .
n
i=1

N.B. : The difference I{y (i) = j} − P(y (i) = j | x (i) ; θ) represents the
error between the true label and the predicted probability
This gradient is used to update the parameters θj during training.

Mohamed Farah Machine Learning 2024-2025 28 / 29


Cost function III

The update rule for gradient descent is:

θj := θj − α∇θj J(θ)

Mohamed Farah Machine Learning 2024-2025 29 / 29

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy