0% found this document useful (0 votes)

15 views29 pages

Logistic Regression

The document provides an overview of logistic regression, a common algorithm for binary classification, and discusses its suitability compared to linear regression. It covers the logistic function, gradient ascent for maximizing log-likelihood, and introduces the perceptron learning algorithm as a simplified version of logistic regression. Additionally, it explains multi-class logistic regression, including the one-vs-all strategy and softmax regression for handling multiple classes.

Uploaded by

benahmedroua2

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views29 pages

Logistic Regression

Uploaded by

benahmedroua2

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 29

Machine Learning

Logistic Regression

Mohamed Farah

2024-2025

Mohamed Farah Machine Learning 2024-2025 1 / 29

Outline

1 Classification and Logistic Regression

2 Perceptron Learning Algorithm

3 Multi-class Logistic Regression

or Softmax Regression

Mohamed Farah Machine Learning 2024-2025 2 / 29

Classification and Logistic Regression

Mohamed Farah Machine Learning 2024-2025 3 / 29

Classification Problem

In classification, the target variable y takes on discrete values.

Binary classification: y ∈ {0, 1}.
0 is called the negative class, 1 is the positive class.
They are also denoted by the symbols ”-” and ”+”.
Example: Spam classification (spam = 1, not spam = 0).
Logistic regression is a common algorithm for binary classification.
Why not use linear regression? Linear regression can produce
values outside the [0, 1] range, making it unsuitable for classification.

Mohamed Farah Machine Learning 2024-2025 4 / 29

Logistic Regression

Linear regression is not suitable for classification.

Instead use logistic regression also known in the literature as logit
regression, maximum-entropy classification (MaxEnt) or the log-linear
classifier
In this model, the probabilities describing the possible outcomes of a
single trial are modeled using a logistic function or sigmoid function
1
g (z) =
1 + e −z

Mohamed Farah Machine Learning 2024-2025 5 / 29

Plot of the Sigmoid Function

The sigmoid function smoothly transitions from 0 to 1.

It is commonly used in logistic regression for binary classification.
The midpoint of the sigmoid function is at z = 0, where g (0) = 0.5.
Properties of g (z):
g (z) → 1 as z → ∞.
g (z) → 0 as z → −∞.
g (z) is bounded between 0 and 1.
Mohamed Farah Machine Learning 2024-2025 6 / 29
Logistic Regression

Hypothesis function for logistic regression:

1
hθ (x) = g (θT x) =
1 + e −θT x
1
where g (z) = 1+e −z
The sigmoid function maps any real-valued number into the range [0,
1], making it suitable for probability estimation.

Mohamed Farah Machine Learning 2024-2025 7 / 29

Derivative of the Sigmoid Function

Derivative of g (z):

′ d 1
g (z) =
dz 1 + e −z
1
= · e −z
(1 + e −z )2

1 1
= · 1−
1 + e −z 1 + e −z
= g (z)(1 − g (z)).

This property is useful for gradient-based optimization.

The derivative is maximized at z = 0, where g (z) = 0.5.

Mohamed Farah Machine Learning 2024-2025 8 / 29

Probabilistic Assumptions
Assume y | x; θ ∼ Bernoulli(ϕ) with
ϕ = P(y = 1 | x; θ) = hθ (x)
P(y = 0 | x; θ) = 1 − hθ (x)
or (compact form)
p(y | x; θ) = (hθ (x))y (1 − hθ (x))1−y
Likelihood of parameters:
n
Y
L(θ) = p(y (i) | x (i) ; θ)
i=1
Log likelihood:
Xn
ℓ(θ) = y (i) log hθ (x (i) ) + (1 − y (i) ) log(1 − hθ (x (i) ))
i=1
Maximizing the log likelihood is equivalent to minimizing the
cross-entropy loss.
Mohamed Farah Machine Learning 2024-2025 9 / 29
Compute θj maximizing the likelihood

n
∂ X 1 1 ∂
ℓ(θ) = y (i) − (1 − y (i) ) hθ (x (i) )
∂θj i=1
hθ (x (i) ) 1 − hθ (x (i) ) ∂θj
n
X 1 1 ∂ T (i)
= y (i) − (1 − y (i) ) hθ (x (i) )(1 − hθ (x (i) )) θ x
i=1
hθ (x (i) ) 1 − hθ (x (i) ) ∂θj
n
X
(i)
= y (i) (1 − hθ (x (i) )) − (1 − y (i) )hθ (x (i) ) xj
i=1
n
X (i)
= (y (i) − hθ (x))xj .
i=1

Mohamed Farah Machine Learning 2024-2025 10 / 29

Gradient Ascent

Gradient ascent is used to maximize the log-likelihood function.

Update rule for gradient ascent:

θ := θ + α∇θ ℓ(θ)

Update rule for Stochastic gradient ascent:

(i)
θj := θj + α(y (i) − hθ (x (i) ))xj

Similar to LMS update rule, but hθ (x (i) ) is non-linear.

The learning rate α controls the step size of the updates.

Mohamed Farah Machine Learning 2024-2025 11 / 29

Perceptron Learning Algorithm

Mohamed Farah Machine Learning 2024-2025 12 / 29

Perceptron Learning Algorithm

The perceptron algorithm is a simplified version of logistic regression.

The hypothesis function is:

hθ (x) = g (θT x)

where g (z) is the threshold function:

(
1 if z ≥ 0
g (z) =
0 if z < 0

The SGD update rule is:

(i)
θj := θj + α(y (i) − hθ (x (i) ))xj

The perceptron algorithm converges if the data is linearly separable.

Mohamed Farah Machine Learning 2024-2025 13 / 29

Perceptron vs. Logistic Regression

The perceptron algorithm does not output probabilities.

It directly predicts 0 or 1 based on the threshold.
The perceptron is less flexible than logistic regression.
It is difficult to give probabilistic interpretations to perceptron
predictions.
Logistic regression provides a probabilistic framework, while the
perceptron is a deterministic algorithm.

Mohamed Farah Machine Learning 2024-2025 14 / 29

Multi-class Logistic Regression
or Softmax Regression

Mohamed Farah Machine Learning 2024-2025 15 / 29

Multi-class Classification

In multi-class classification, y can take on k values: y ∈ {1, 2, . . . , k}.

Example: Classifying emails into spam, personal, and work-related.

We model p(y | x; θ) as a multinomial distribution.

One-vs-all (OvA) and softmax regression are common approaches.

Mohamed Farah Machine Learning 2024-2025 16 / 29

One-vs-All in Multiclass Classification

Mohamed Farah Machine Learning 2024-2025 17 / 29

One-vs-All in Multiclass Classification I

The one-vs-all (or one-vs-rest) strategy is used to extend binary

classifiers to handle multiple classes.
Given a set of K classes, the approach involves training K binary
classifiers, one for each class.

Training
For each class k ∈ {1, 2, . . . , K }:
Treat class k as the positive class.
Treat all other classes {1, 2, . . . , k − 1, k + 1, . . . , K } as the negative
class.
Train a binary classifier fk (x) to distinguish between class k and the
rest.

Mohamed Farah Machine Learning 2024-2025 18 / 29

One-vs-All in Multiclass Classification II

Prediction
For a new input x, the predicted class ŷ is determined by:

ŷ = arg max fk (x),

k∈{1,2,...,K }

where fk (x) is the score or probability output by the k-th classifier.

Mohamed Farah Machine Learning 2024-2025 19 / 29

One-vs-All in Multiclass Classification III

Example
Consider a multiclass problem with three classes: A, B, and C .
Classifier 1: fA (x) distinguishes A (positive) vs. B and C (negative).
Classifier 2: fB (x) distinguishes B (positive) vs. A and C (negative).
Classifier 3: fC (x) distinguishes C (positive) vs. A and B (negative).
The final prediction is the class with the highest score:

ŷ = arg max fk (x).

k∈{A,B,C }

Mohamed Farah Machine Learning 2024-2025 20 / 29

Softmax Regression

Mohamed Farah Machine Learning 2024-2025 21 / 29

Multinomial Distribution

A multinomial distribution involves k numbers ϕ1 , . . . , ϕk , specifying

the probability of each outcome:

ϕj = P(y = j | x; θ).

These probabilities must satisfy the constraint:

k
X
ϕj = 1.
j=1

The multinomial distribution generalizes the Bernoulli distribution to

multiple classes.

Mohamed Farah Machine Learning 2024-2025 22 / 29

Softmax Regression

We need to design a parameterized model that outputs ϕ1 , . . . , ϕk

satisfying this constraint given the input x.
We introduce k parameters θ1 , . . . , θk , each being a vector in Rd .
Intuitively, we want to use θ1⊤ x, . . . , θk⊤ x to represent ϕ1 , . . . , ϕk :

ϕj = P(y = j | x; θ)

However, there are two issues with this direct approach:

1 θj⊤ x is not necessarily within [0, 1].
Pk
2 The summation j=1 θj⊤ x is not necessarily 1.

Mohamed Farah Machine Learning 2024-2025 23 / 29

Softmax Function I

To address these issues, we use the softmax function to transform

(θ1⊤ x, . . . , θk⊤ x) into a valid probability vector (ϕ1 , . . . , ϕk ):

exp(θj⊤ x)
ϕj = Pk .
⊤
s=1 exp(θs x)

The softmax function ensures:

ϕj ∈ [0, 1] for all j

k
X
ϕj = 1
j=1

This gives us a valid probability distribution over the k classes.

The softmax function is a generalization of the sigmoid function for
multi-class classification.
Mohamed Farah Machine Learning 2024-2025 24 / 29
Softmax Function II

The Softmax Gradient

1. Let zj = θjT x. Then, the softmax probability can be written as:

exp(zj )
P(y = j | x; θ) = Pk .
s=1 exp(zs )

2. The derivative of P(y = j | x; θ) with respect to zj is:

∂P(y = j | x; θ)
= P(y = j | x; θ) · (1 − P(y = j | x; θ)) .
∂zj

3. Using the chain rule, the gradient of P(y = j | x; θ) with respect

to θj is:

∂P(y = j | x; θ)
∇θj P(y = j | x; θ) = · ∇θj zj .
∂zj

Mohamed Farah Machine Learning 2024-2025 25 / 29

Softmax Function III

4. Since zj = θjT x, we have:

∇θj zj = x.

5. Combining these results, the gradient of the softmax probability is:

∇θj P(y = j | x; θ) = P(y = j | x; θ) · (1 − P(y = j | x; θ)) · x.

Mohamed Farah Machine Learning 2024-2025 26 / 29

Cost function I

The probability of y = j given x is:

exp(θjT x)
P(y = j | x; θ) = Pk
T
s=1 exp(θs x)

The Cost function J(θ):

n k
1 XX
J(θ) = − I{y (i) = j} log P(y (i) = j | x (i) ; θ)
n
i=1 j=1

where I{y (i) = j} is the indicator function, which is 1 if y (i) = j and

0 otherwise.
The term I{y (i) = j} ensures that only the probability of the correct
class j (for the i-th example) contributes to the cost.

Mohamed Farah Machine Learning 2024-2025 27 / 29

Cost function II

Gradient of the Cost Function: The gradient of the cost function

J(θ) with respect to θj is:
n
1X h i
∇θj J(θ) = − ∇θj I{y (i) = j} log P(y (i) = j | x (i) ; θ)
n
i=1

Using the chain rule and the gradient of the softmax probability, we
get:
n
1 X (i)
∇θj J(θ) = − x I{y (i) = j} 1 − P(y (i) = j | x (i) ; θ) .
n
i=1

N.B. : The difference I{y (i) = j} − P(y (i) = j | x (i) ; θ) represents the
error between the true label and the predicted probability
This gradient is used to update the parameters θj during training.

Mohamed Farah Machine Learning 2024-2025 28 / 29

Cost function III

The update rule for gradient descent is:

θj := θj − α∇θj J(θ)

Mohamed Farah Machine Learning 2024-2025 29 / 29

ML - Logistic Regression&KNN
No ratings yet
ML - Logistic Regression&KNN
48 pages
HALOT Box User Manual
100% (1)
HALOT Box User Manual
24 pages
Mastering O'Level Islamiyat
98% (47)
Mastering O'Level Islamiyat
343 pages
Phage Ecology: Harald Brüssow and Elizabeth Kutter
No ratings yet
Phage Ecology: Harald Brüssow and Elizabeth Kutter
36 pages
09 KTK - 14 Statistics
No ratings yet
09 KTK - 14 Statistics
36 pages
Neural Networks
No ratings yet
Neural Networks
63 pages
Algorithms For Artificial Intelligence
No ratings yet
Algorithms For Artificial Intelligence
69 pages
Logistic Regression
No ratings yet
Logistic Regression
36 pages
2023 LSE MY474 Applied Machine Learning Social Science, Lecture3
No ratings yet
2023 LSE MY474 Applied Machine Learning Social Science, Lecture3
58 pages
Solution 5
No ratings yet
Solution 5
4 pages
12 - Bài Toán Phân L P - LR - v2
No ratings yet
12 - Bài Toán Phân L P - LR - v2
130 pages
CSE 440 AI Volume1 (p1)
No ratings yet
CSE 440 AI Volume1 (p1)
4 pages
Slides MC Softmax Regression
No ratings yet
Slides MC Softmax Regression
11 pages
M146 Lec3 Sidenotes S25
No ratings yet
M146 Lec3 Sidenotes S25
29 pages
Cs217 Perceptron Sigmoid Softmax Week5 3feb25
No ratings yet
Cs217 Perceptron Sigmoid Softmax Week5 3feb25
90 pages
Pattern Reco Tutorial
No ratings yet
Pattern Reco Tutorial
13 pages
Mod 1
No ratings yet
Mod 1
99 pages
Softmax
No ratings yet
Softmax
17 pages
Ds 6
No ratings yet
Ds 6
21 pages
7-1 Initial Sizing Reference
No ratings yet
7-1 Initial Sizing Reference
4 pages
Detailed Sigmoid and Softmax Activation Function
No ratings yet
Detailed Sigmoid and Softmax Activation Function
5 pages
Human Health and Disease - Practice Sheet - VIJETA SERIES CLASS-12TH
No ratings yet
Human Health and Disease - Practice Sheet - VIJETA SERIES CLASS-12TH
6 pages
Lecture 04
No ratings yet
Lecture 04
28 pages
03-Linear Classification
No ratings yet
03-Linear Classification
17 pages
Chap007 1 PDF
No ratings yet
Chap007 1 PDF
69 pages
Classification
100% (2)
Classification
105 pages
Lec 43
No ratings yet
Lec 43
9 pages
NB 13
No ratings yet
NB 13
27 pages
Notes6 Classification
No ratings yet
Notes6 Classification
10 pages
Mathematical Foundations of Computational Linguistics: Manfred Klenner and Jannis Vamvas
No ratings yet
Mathematical Foundations of Computational Linguistics: Manfred Klenner and Jannis Vamvas
32 pages
Supervised Learning
No ratings yet
Supervised Learning
5 pages
הרצאה-Classifiers and Decision Trees
No ratings yet
הרצאה-Classifiers and Decision Trees
119 pages
Unit-1 - Machine Learning
No ratings yet
Unit-1 - Machine Learning
85 pages
Introduction of Machine Learning
No ratings yet
Introduction of Machine Learning
61 pages
Lecture 1
No ratings yet
Lecture 1
48 pages
Chapter Classification
No ratings yet
Chapter Classification
12 pages
Lecture04 VDL
No ratings yet
Lecture04 VDL
93 pages
CS60010: Deep Learning: Spring 2021
No ratings yet
CS60010: Deep Learning: Spring 2021
32 pages
Anthroposophy in Everyday Life - Steiner, Rudolf, 1861-1925 - 1995 - Hudson, NY - Anthroposophic Press - 9780880104272 - Anna's Archive
No ratings yet
Anthroposophy in Everyday Life - Steiner, Rudolf, 1861-1925 - 1995 - Hudson, NY - Anthroposophic Press - 9780880104272 - Anna's Archive
100 pages
2021 Logistic Regression
No ratings yet
2021 Logistic Regression
33 pages
ECE 449 Notes
No ratings yet
ECE 449 Notes
5 pages
A Layman's Guide To The Project
No ratings yet
A Layman's Guide To The Project
34 pages
Notes Chapter Linear Classifiers
No ratings yet
Notes Chapter Linear Classifiers
4 pages
A Study On Fracture Toughness of Ultra High Toughness 2020 Construction and
No ratings yet
A Study On Fracture Toughness of Ultra High Toughness 2020 Construction and
22 pages
Pattern Revision
No ratings yet
Pattern Revision
63 pages
Geography Revision Booklet
No ratings yet
Geography Revision Booklet
170 pages
Machine - Learning (Unit 3)
No ratings yet
Machine - Learning (Unit 3)
9 pages
3 LogisticRegression
No ratings yet
3 LogisticRegression
30 pages
Hubungan Sub Etnik Pada Suku Minahasa Menggunakan
No ratings yet
Hubungan Sub Etnik Pada Suku Minahasa Menggunakan
19 pages
Lecture 03 Bayes Classifier With Prob Concepts
No ratings yet
Lecture 03 Bayes Classifier With Prob Concepts
70 pages
ML Unit 1
No ratings yet
ML Unit 1
73 pages
WMI 2024 Final Grade 04 Paper B Question
No ratings yet
WMI 2024 Final Grade 04 Paper B Question
3 pages
Electric Circuits
No ratings yet
Electric Circuits
10 pages
Machine Learning
No ratings yet
Machine Learning
9 pages
Analog Electronic Circuits Unit 7 Vtu
No ratings yet
Analog Electronic Circuits Unit 7 Vtu
48 pages
CH 1
No ratings yet
CH 1
24 pages
Solved MCQs 6
100% (1)
Solved MCQs 6
3 pages
Cross Interopy
No ratings yet
Cross Interopy
7 pages
06 Lectureslides LinearClassification Fixed
No ratings yet
06 Lectureslides LinearClassification Fixed
52 pages
BITS F464 ML Lecture Notes
No ratings yet
BITS F464 ML Lecture Notes
86 pages
GHG Emissions Calculator Ver01.1 Web
0% (1)
GHG Emissions Calculator Ver01.1 Web
91 pages
Social Collapse Best Practices - by Dmitry Orlov
100% (1)
Social Collapse Best Practices - by Dmitry Orlov
15 pages
List of Horse Breeds
No ratings yet
List of Horse Breeds
4 pages
Linear Discriminant Functions: CS479/679 Pattern Recognition Dr. George Bebis
No ratings yet
Linear Discriminant Functions: CS479/679 Pattern Recognition Dr. George Bebis
41 pages
DAC ML Tutorial Final Deck
No ratings yet
DAC ML Tutorial Final Deck
150 pages
microCT SKYSCAN 1273 Brochure DOC-B76-EXS014 High
No ratings yet
microCT SKYSCAN 1273 Brochure DOC-B76-EXS014 High
12 pages
Cheatsheet Supervised Learning
No ratings yet
Cheatsheet Supervised Learning
4 pages
Alat Lab 2
No ratings yet
Alat Lab 2
10 pages
6.036: Intro To Machine Learning: Lecturer: Professor Leslie Kaelbling Notes By: Andrew Lin Fall 2019
No ratings yet
6.036: Intro To Machine Learning: Lecturer: Professor Leslie Kaelbling Notes By: Andrew Lin Fall 2019
50 pages
Notes Chapter Logistic Regression
No ratings yet
Notes Chapter Logistic Regression
6 pages
CS229 Supplemental Lecture Notes: 1 Binary Classification
No ratings yet
CS229 Supplemental Lecture Notes: 1 Binary Classification
7 pages
Lec 1
No ratings yet
Lec 1
42 pages
2 Softmaxregression
No ratings yet
2 Softmaxregression
4 pages
DS 2df8236i Ael
No ratings yet
DS 2df8236i Ael
4 pages
Statistical Machine Learning-The Basic Approach and Current Research Challenges
No ratings yet
Statistical Machine Learning-The Basic Approach and Current Research Challenges
35 pages
7 Colin Shelley Facts Global Energy
No ratings yet
7 Colin Shelley Facts Global Energy
13 pages
Statistical Machine Learning-The Basic Approach and Current Research Challenges
No ratings yet
Statistical Machine Learning-The Basic Approach and Current Research Challenges
35 pages
Anuranan Das Summer of Sciences, 2019. Understanding and Implementing Machine Learning
No ratings yet
Anuranan Das Summer of Sciences, 2019. Understanding and Implementing Machine Learning
17 pages
Teachings Temple 3
No ratings yet
Teachings Temple 3
479 pages
Machine Learning HC
No ratings yet
Machine Learning HC
4 pages
Tutankhamuns Missing Ribs KMT 18.1 PDF
100% (3)
Tutankhamuns Missing Ribs KMT 18.1 PDF
7 pages
Vol 3-2 28-08-07
No ratings yet
Vol 3-2 28-08-07
96 pages
Diesel Generator Warranty
No ratings yet
Diesel Generator Warranty
1 page
Centralised Lubrication System For A Manitou MT 732 Complete
No ratings yet
Centralised Lubrication System For A Manitou MT 732 Complete
35 pages
Cheet Sheet
No ratings yet
Cheet Sheet
47 pages
Dance 101
No ratings yet
Dance 101
17 pages
Statistical Learning Theory
No ratings yet
Statistical Learning Theory
4 pages
Soil Nailing Design PDF
100% (3)
Soil Nailing Design PDF
32 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Logistic Regression

Uploaded by

Logistic Regression

Uploaded by

Machine Learning

Mohamed Farah Machine Learning 2024-2025 1 / 29

1 Classification and Logistic Regression

2 Perceptron Learning Algorithm

3 Multi-class Logistic Regression

Mohamed Farah Machine Learning 2024-2025 2 / 29

Mohamed Farah Machine Learning 2024-2025 3 / 29

In classification, the target variable y takes on discrete values.

Mohamed Farah Machine Learning 2024-2025 4 / 29

Linear regression is not suitable for classification.

Mohamed Farah Machine Learning 2024-2025 5 / 29

The sigmoid function smoothly transitions from 0 to 1.

Hypothesis function for logistic regression:

Mohamed Farah Machine Learning 2024-2025 7 / 29

This property is useful for gradient-based optimization.

Mohamed Farah Machine Learning 2024-2025 8 / 29

Mohamed Farah Machine Learning 2024-2025 10 / 29

Gradient ascent is used to maximize the log-likelihood function.

Update rule for Stochastic gradient ascent:

Similar to LMS update rule, but hθ (x (i) ) is non-linear.

Mohamed Farah Machine Learning 2024-2025 11 / 29

Mohamed Farah Machine Learning 2024-2025 12 / 29

The perceptron algorithm is a simplified version of logistic regression.

where g (z) is the threshold function:

The SGD update rule is:

The perceptron algorithm converges if the data is linearly separable.

Mohamed Farah Machine Learning 2024-2025 13 / 29

The perceptron algorithm does not output probabilities.

Mohamed Farah Machine Learning 2024-2025 14 / 29

Mohamed Farah Machine Learning 2024-2025 15 / 29

In multi-class classification, y can take on k values: y ∈ {1, 2, . . . , k}.

Example: Classifying emails into spam, personal, and work-related.

We model p(y | x; θ) as a multinomial distribution.

One-vs-all (OvA) and softmax regression are common approaches.

Mohamed Farah Machine Learning 2024-2025 16 / 29

Mohamed Farah Machine Learning 2024-2025 17 / 29

The one-vs-all (or one-vs-rest) strategy is used to extend binary

Mohamed Farah Machine Learning 2024-2025 18 / 29

ŷ = arg max fk (x),

where fk (x) is the score or probability output by the k-th classifier.

Mohamed Farah Machine Learning 2024-2025 19 / 29

ŷ = arg max fk (x).

Mohamed Farah Machine Learning 2024-2025 20 / 29

Mohamed Farah Machine Learning 2024-2025 21 / 29

A multinomial distribution involves k numbers ϕ1 , . . . , ϕk , specifying

These probabilities must satisfy the constraint:

The multinomial distribution generalizes the Bernoulli distribution to

Mohamed Farah Machine Learning 2024-2025 22 / 29

We need to design a parameterized model that outputs ϕ1 , . . . , ϕk

However, there are two issues with this direct approach:

Mohamed Farah Machine Learning 2024-2025 23 / 29

To address these issues, we use the softmax function to transform

The softmax function ensures:

ϕj ∈ [0, 1] for all j

This gives us a valid probability distribution over the k classes.

The Softmax Gradient

2. The derivative of P(y = j | x; θ) with respect to zj is:

3. Using the chain rule, the gradient of P(y = j | x; θ) with respect

Mohamed Farah Machine Learning 2024-2025 25 / 29

4. Since zj = θjT x, we have:

5. Combining these results, the gradient of the softmax probability is:

∇θj P(y = j | x; θ) = P(y = j | x; θ) · (1 − P(y = j | x; θ)) · x.

Mohamed Farah Machine Learning 2024-2025 26 / 29

The probability of y = j given x is:

The Cost function J(θ):

where I{y (i) = j} is the indicator function, which is 1 if y (i) = j and

Mohamed Farah Machine Learning 2024-2025 27 / 29

Gradient of the Cost Function: The gradient of the cost function

Mohamed Farah Machine Learning 2024-2025 28 / 29

The update rule for gradient descent is:

Mohamed Farah Machine Learning 2024-2025 29 / 29

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.