Binary Logistic Regression 2
Binary Logistic Regression 2
Matt Gormley
Lecture 10
Feb. 17, 2020
1
Reminders
• Midterm Exam 1
– Tue, Feb. 18, 7:00pm – 9:00pm
• Homework 4: Logistic Regression
– Out: Wed, Feb. 19
– Due: Fri, Feb. 28 at 11:59pm
• Today’s In-Class Poll
– http://p10.mlcourse.org
• Reading on Probabilistic Learning is reused
later in the course for MLE/MAP
3
MLE
Suppose we have data D = {x(i) }N
i=1
Principle of Maximum Likelihood
N Estimation:
Choose theMLE = `;Ktthat p(t
parameters maximize
|N ) the likelihood
(i)
of the data. MLE
= `;Kt
i=1 (i)
N
p(t | )
MAP
= `;Kt p(t(i)i=1
| )p( )
Maximum Likelihood Estimate (MLE)
i=1
θ2 θMLE
L(θ)
L(θ1, θ2)
θMLE θ1
5
MLE
What does maximizing likelihood accomplish?
• There is only a finite amount of probability
mass (i.e. sum-to-one constraint)
• MLE tries to allocate as much probability
mass as possible to the things we have
observed…
6
MOTIVATION:
LOGISTIC REGRESSION
7
Example: Image Classification
• ImageNet LSVRC-2010 contest:
– Dataset: 1.2 million labeled images, 1000 classes
– Task: Given a new image, label it with the correct class
– Multiclass classification problem
• Examples from http://image-net.org/
10
11
12
13
Example: Image Classification
CNN for Image Classification
(Krizhevsky, Sutskever & Hinton, 2011)
17.5% error on ImageNet LSVRC-2010 contest
Input • Five convolutional layers 1000-way
image (w/max-pooling)
(pixels) • Three fully connected layers softmax
14
Figure 2: An illustration of the architecture of our CNN, explicitly showing the delineation of responsibilities
Example: Image Classification
CNN for Image Classification
(Krizhevsky, Sutskever & Hinton, 2011)
17.5% error on ImageNet LSVRC-2010 contest
Input • Five convolutional layers 1000-way
image (w/max-pooling)
(pixels) • Three fully connected layers softmax
15
Figure 2: An illustration of the architecture of our CNN, explicitly showing the delineation of responsibilities
LOGISTIC REGRESSION
16
Logistic Regression
Data: Inputs are continuous vectors of length M. Outputs
are discrete.
We are back to
classification.
17
R ec a
ll…
Linear Models for Classification
Key idea: Try to learn
this hyperplane directly
Half-spaces:
Using gradient ascent for linear
classifiers
Key idea behind today’s lecture:
1. Define a linear classifier (logistic regression)
2. Define an objective function (likelihood)
3. Optimize it with gradient descent to learn
parameters
4. Predict the class with highest probability under
the model
20
Using gradient ascent for linear
classifiers
This decision function isn’t Use a differentiable
differentiable: function instead:
1
h(t) = sign( T
t) p (y = 1|t) =
1 + 2tT( T
t)
sign(x) 1
logistic(u) ≡
1+ e−u 21
Using gradient ascent for linear
classifiers
This decision function isn’t Use a differentiable
differentiable: function instead:
1
h(t) = sign( T
t) p (y = 1|t) =
1 + 2tT( T
t)
sign(x) 1
logistic(u) ≡
1+ e−u 22
Logistic Regression
Data: Inputs are continuous vectors of length M. Outputs
are discrete.
24
Learning for Logistic Regression
Whiteboard
– Partial derivative for Logistic Regression
– Gradient for Logistic Regression
25
LOGISTIC REGRESSION ON
GAUSSIAN DATA
26
Logistic Regression
27
Logistic Regression
28
Logistic Regression
29
LEARNING LOGISTIC REGRESSION
30
Maximum Conditional
Likelihood Estimation
Learning: finds the parameters that minimize some
objective function.
= argmin J( )
We minimize the negative log conditional likelihood:
N
J( ) = HQ; p (y (i) |t(i) )
i=1
Why?
1. We can’t maximize likelihood (as in Naïve Bayes)
because we don’t have a joint model p(x,y)
2. It worked well for Linear Regression (least squares is
MCLE)
31
Maximum Conditional
Likelihood Estimation
Learning: Four approaches to solving = argmin J( )
32
Maximum Conditional
Likelihood Estimation
Learning: Four approaches to solving = argmin J( )
Answer:
At each step (i.e. iteration) of SGD for Logistic Regression we…
A. (1) compute the gradient of the log-likelihood for all examples (2) update all
the parameters using the gradient
B. (1) ask Matt for a description of SGD for Logistic Regression, (2) write it down,
(3) report that answer
C. (1) compute the gradient of the log-likelihood for all examples (2) randomly
pick an example (3) update only the parameters for that example
D. (1) randomly pick a parameter, (2) compute the partial derivative of the log-
likelihood with respect to that parameter, (3) update that parameter for all
examples
E. (1) randomly pick an example, (2) compute the gradient of the log-likelihood
for that example, (3) update all the parameters using that gradient
F. (1) randomly pick a parameter and an example, (2) compute the gradient of
the log-likelihood for that example with respect to that parameter, (3) update
that parameter using that gradient
34
R ec a
ll…
Gradient Descent
Algorithm 1 Gradient Descent
(0)
1: procedure GD(D, )
(0)
2:
3: while not converged do
4: +
— J( )
5: return
d
In order to apply GD to Logistic d 1 J( )
Regression all we need is the d
d 2 J( )
gradient of the objective J( ) = ..
function (i.e. vector of partial .
derivatives). d
d N J( )
35
R ec a
ll…
Stochastic Gradient Descent (SGD)
Answer:
37
Matching Game
Goal: Match the Algorithm to its Update Rule
3. Perceptron 6. (i)
T k k + (h (x(i) ) y (i) )xk
h (x) = sign( x)
39
Mini-Batch SGD
• Gradient Descent:
Compute true gradient exactly from all N
examples
• Stochastic Gradient Descent (SGD):
Approximate true gradient by the gradient
of one randomly chosen example
• Mini-Batch SGD:
Approximate true gradient by the average
gradient of K randomly chosen examples
40
Mini-Batch SGD
41
Summary
1. Discriminative classifiers directly model the
conditional, p(y|x)
2. Logistic regression is a simple linear
classifier, that retains a probabilistic
semantics
3. Parameters in LR are learned by iterative
optimization (e.g. SGD)
50
Logistic Regression Objectives
You should be able to…
• Apply the principle of maximum likelihood estimation (MLE) to
learn the parameters of a probabilistic model
• Given a discriminative probabilistic model, derive the conditional
log-likelihood, its gradient, and the corresponding Bayes
Classifier
• Explain the practical reasons why we work with the log of the
likelihood
• Implement logistic regression for binary or multiclass
classification
• Prove that the decision boundary of binary logistic regression is
linear
• For linear regression, show that the parameters which minimize
squared error are equivalent to those that maximize conditional
likelihood
51
MULTINOMIAL LOGISTIC
REGRESSION
54
55
Multinomial Logistic Regression
Chalkboard
– Background: Multinomial distribution
– Definition: Multi-class classification
– Geometric intuitions
– Multinomial logistic regression model
– Generative story
– Reduction to binary logistic regression
– Partial derivatives and gradients
– Applying Gradient Descent and SGD
– Implementation w/ sparse features
56
Debug that Program!
In-Class Exercise: Think-Pair-Share
Debug the following program which is (incorrectly)
attempting to run SGD for multinomial logistic regression
Buggy Program:
while not converged:
for i in shuffle([1,…,N]):
for k in [1,…,K]:
theta[k] = theta[k] - lambda * grad(x[i], y[i],
theta, k)
57