Logistic Regression
Logistic Regression
Logistic Regression
Mohamed Farah
2024-2025
Derivative of g (z):
′ d 1
g (z) =
dz 1 + e −z
1
= · e −z
(1 + e −z )2
1 1
= · 1−
1 + e −z 1 + e −z
= g (z)(1 − g (z)).
n
∂ X 1 1 ∂
ℓ(θ) = y (i) − (1 − y (i) ) hθ (x (i) )
∂θj i=1
hθ (x (i) ) 1 − hθ (x (i) ) ∂θj
n
X 1 1 ∂ T (i)
= y (i) − (1 − y (i) ) hθ (x (i) )(1 − hθ (x (i) )) θ x
i=1
hθ (x (i) ) 1 − hθ (x (i) ) ∂θj
n
X
(i)
= y (i) (1 − hθ (x (i) )) − (1 − y (i) )hθ (x (i) ) xj
i=1
n
X (i)
= (y (i) − hθ (x))xj .
i=1
θ := θ + α∇θ ℓ(θ)
hθ (x) = g (θT x)
Training
For each class k ∈ {1, 2, . . . , K }:
Treat class k as the positive class.
Treat all other classes {1, 2, . . . , k − 1, k + 1, . . . , K } as the negative
class.
Train a binary classifier fk (x) to distinguish between class k and the
rest.
Prediction
For a new input x, the predicted class ŷ is determined by:
Example
Consider a multiclass problem with three classes: A, B, and C .
Classifier 1: fA (x) distinguishes A (positive) vs. B and C (negative).
Classifier 2: fB (x) distinguishes B (positive) vs. A and C (negative).
Classifier 3: fC (x) distinguishes C (positive) vs. A and B (negative).
The final prediction is the class with the highest score:
ϕj = P(y = j | x; θ).
ϕj = P(y = j | x; θ)
exp(θj⊤ x)
ϕj = Pk .
⊤
s=1 exp(θs x)
k
X
ϕj = 1
j=1
exp(zj )
P(y = j | x; θ) = Pk .
s=1 exp(zs )
∂P(y = j | x; θ)
= P(y = j | x; θ) · (1 − P(y = j | x; θ)) .
∂zj
∂P(y = j | x; θ)
∇θj P(y = j | x; θ) = · ∇θj zj .
∂zj
∇θj zj = x.
exp(θjT x)
P(y = j | x; θ) = Pk
T
s=1 exp(θs x)
Using the chain rule and the gradient of the softmax probability, we
get:
n
1 X (i)
∇θj J(θ) = − x I{y (i) = j} 1 − P(y (i) = j | x (i) ; θ) .
n
i=1
N.B. : The difference I{y (i) = j} − P(y (i) = j | x (i) ; θ) represents the
error between the true label and the predicted probability
This gradient is used to update the parameters θj during training.
θj := θj − α∇θj J(θ)