0% found this document useful (0 votes)
19 views43 pages

Binary Logistic Regression 2

This document is a lecture on Binary and Multinomial Logistic Regression from Carnegie Mellon University's Machine Learning course. It covers key concepts such as Maximum Likelihood Estimation (MLE), the architecture of Convolutional Neural Networks (CNN) for image classification, and the learning process for logistic regression using methods like Gradient Descent and Stochastic Gradient Descent. Additionally, it discusses the objectives of logistic regression and provides insights into implementing the model for classification tasks.

Uploaded by

K SD
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views43 pages

Binary Logistic Regression 2

This document is a lecture on Binary and Multinomial Logistic Regression from Carnegie Mellon University's Machine Learning course. It covers key concepts such as Maximum Likelihood Estimation (MLE), the architecture of Convolutional Neural Networks (CNN) for image classification, and the learning process for logistic regression using methods like Gradient Descent and Stochastic Gradient Descent. Additionally, it discusses the objectives of logistic regression and provides insights into implementing the model for classification tasks.

Uploaded by

K SD
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

10-601 Introduction to Machine Learning

Machine Learning Department


School of Computer Science
Carnegie Mellon University

Binary Logistic Regression


+
Multinomial Logistic Regression

Matt Gormley
Lecture 10
Feb. 17, 2020

1
Reminders
• Midterm Exam 1
– Tue, Feb. 18, 7:00pm – 9:00pm
• Homework 4: Logistic Regression
– Out: Wed, Feb. 19
– Due: Fri, Feb. 28 at 11:59pm
• Today’s In-Class Poll
– http://p10.mlcourse.org
• Reading on Probabilistic Learning is reused
later in the course for MLE/MAP
3
MLE
Suppose we have data D = {x(i) }N
i=1
Principle of Maximum Likelihood
N Estimation:
Choose theMLE = `;Ktthat p(t
parameters maximize
|N ) the likelihood
(i)
of the data. MLE
= `;Kt
i=1 (i)
N
p(t | )
MAP
= `;Kt p(t(i)i=1
| )p( )
Maximum Likelihood Estimate (MLE)
i=1
θ2 θMLE

L(θ)
L(θ1, θ2)

θMLE θ1
5
MLE
What does maximizing likelihood accomplish?
• There is only a finite amount of probability
mass (i.e. sum-to-one constraint)
• MLE tries to allocate as much probability
mass as possible to the things we have
observed…

…at the expense of the things we have not


observed

6
MOTIVATION:
LOGISTIC REGRESSION

7
Example: Image Classification
• ImageNet LSVRC-2010 contest:
– Dataset: 1.2 million labeled images, 1000 classes
– Task: Given a new image, label it with the correct class
– Multiclass classification problem
• Examples from http://image-net.org/

10
11
12
13
Example: Image Classification
CNN for Image Classification
(Krizhevsky, Sutskever & Hinton, 2011)
17.5% error on ImageNet LSVRC-2010 contest
Input • Five convolutional layers 1000-way
image (w/max-pooling)
(pixels) • Three fully connected layers softmax

14

Figure 2: An illustration of the architecture of our CNN, explicitly showing the delineation of responsibilities
Example: Image Classification
CNN for Image Classification
(Krizhevsky, Sutskever & Hinton, 2011)
17.5% error on ImageNet LSVRC-2010 contest
Input • Five convolutional layers 1000-way
image (w/max-pooling)
(pixels) • Three fully connected layers softmax

The rest is just This “softmax”


some fancy
feature extraction layer is Logistic
(discussed later in Regression!
the course)

15

Figure 2: An illustration of the architecture of our CNN, explicitly showing the delineation of responsibilities
LOGISTIC REGRESSION

16
Logistic Regression
Data: Inputs are continuous vectors of length M. Outputs
are discrete.

We are back to
classification.

Despite the name


logistic regression.

17
R ec a
ll…
Linear Models for Classification
Key idea: Try to learn
this hyperplane directly

Looking ahead: Directly modeling the


• We’ll see a number of hyperplane would use a
commonly used Linear
Classifiers decision function:
• These include:
– Perceptron h(t) = sign( T
t)
– Logistic Regression
– Naïve Bayes (under
certain conditions) for:
– Support Vector
Machines y { 1, +1}
R ec a
ll…
Background: Hyperplanes
Hyperplane (Definition 1):
Notation Trick: fold the T
bias b and the weights w
H = {x : w x = b}
into a single vector θ by Hyperplane (Definition 2):
prepending a constant to
x and increasing
dimensionality by one!
w

Half-spaces:
Using gradient ascent for linear
classifiers
Key idea behind today’s lecture:
1. Define a linear classifier (logistic regression)
2. Define an objective function (likelihood)
3. Optimize it with gradient descent to learn
parameters
4. Predict the class with highest probability under
the model

20
Using gradient ascent for linear
classifiers
This decision function isn’t Use a differentiable
differentiable: function instead:
1
h(t) = sign( T
t) p (y = 1|t) =
1 + 2tT( T
t)

sign(x) 1
logistic(u) ≡
1+ e−u 21
Using gradient ascent for linear
classifiers
This decision function isn’t Use a differentiable
differentiable: function instead:
1
h(t) = sign( T
t) p (y = 1|t) =
1 + 2tT( T
t)

sign(x) 1
logistic(u) ≡
1+ e−u 22
Logistic Regression
Data: Inputs are continuous vectors of length M. Outputs
are discrete.

Model: Logistic function applied to dot product of


parameters with input vector. 1
p (y = 1|t) =
1 + 2tT( T
t)
Learning: finds the parameters that minimize some
objective function. = argmin J( )

Prediction: Output is the most probable class.


ŷ = `;Kt p (y|t)
y {0,1}
23
Logistic Regression
Whiteboard
– Bernoulli interpretation
– Logistic Regression Model
– Decision boundary

24
Learning for Logistic Regression
Whiteboard
– Partial derivative for Logistic Regression
– Gradient for Logistic Regression

25
LOGISTIC REGRESSION ON
GAUSSIAN DATA

26
Logistic Regression

27
Logistic Regression

28
Logistic Regression

29
LEARNING LOGISTIC REGRESSION

30
Maximum Conditional
Likelihood Estimation
Learning: finds the parameters that minimize some
objective function.
= argmin J( )
We minimize the negative log conditional likelihood:
N
J( ) = HQ; p (y (i) |t(i) )
i=1
Why?
1. We can’t maximize likelihood (as in Naïve Bayes)
because we don’t have a joint model p(x,y)
2. It worked well for Linear Regression (least squares is
MCLE)
31
Maximum Conditional
Likelihood Estimation
Learning: Four approaches to solving = argmin J( )

Approach 1: Gradient Descent


(take larger – more certain – steps opposite the gradient)

Approach 2: Stochastic Gradient Descent (SGD)


(take many small steps opposite the gradient)

Approach 3: Newton’s Method


(use second derivatives to better follow curvature)

Approach 4: Closed Form???


(set derivatives equal to zero and solve for parameters)

32
Maximum Conditional
Likelihood Estimation
Learning: Four approaches to solving = argmin J( )

Approach 1: Gradient Descent


(take larger – more certain – steps opposite the gradient)

Approach 2: Stochastic Gradient Descent (SGD)


(take many small steps opposite the gradient)

Approach 3: Newton’s Method


(use second derivatives to better follow curvature)

Approach 4: Closed Form???


(set derivatives equal to zero and solve for parameters)

Logistic Regression does not


have a closed form solution
for MLE parameters. 33
SGD for Logistic Regression
Question:
Which of the following is a correct description of SGD for Logistic Regression?

Answer:
At each step (i.e. iteration) of SGD for Logistic Regression we…
A. (1) compute the gradient of the log-likelihood for all examples (2) update all
the parameters using the gradient
B. (1) ask Matt for a description of SGD for Logistic Regression, (2) write it down,
(3) report that answer
C. (1) compute the gradient of the log-likelihood for all examples (2) randomly
pick an example (3) update only the parameters for that example
D. (1) randomly pick a parameter, (2) compute the partial derivative of the log-
likelihood with respect to that parameter, (3) update that parameter for all
examples
E. (1) randomly pick an example, (2) compute the gradient of the log-likelihood
for that example, (3) update all the parameters using that gradient
F. (1) randomly pick a parameter and an example, (2) compute the gradient of
the log-likelihood for that example with respect to that parameter, (3) update
that parameter using that gradient

34
R ec a
ll…
Gradient Descent
Algorithm 1 Gradient Descent
(0)
1: procedure GD(D, )
(0)
2:
3: while not converged do
4: +
— J( )
5: return
d
In order to apply GD to Logistic d 1 J( )
Regression all we need is the d
d 2 J( )
gradient of the objective J( ) = ..
function (i.e. vector of partial .
derivatives). d
d N J( )
35
R ec a
ll…
Stochastic Gradient Descent (SGD)

We can also apply SGD to solve the MCLE


problem for Logistic Regression.
We need a per-example objective:
N
Let J( ) = i=1 J (i) ( )
where J (i) ( ) = HQ; p (y i |ti ).
36
Logistic Regression vs. Perceptron
Question:
True or False: Just like Perceptron, one
step (i.e. iteration) of SGD for Logistic
Regression will result in a change to the
parameters only if the current example is
incorrectly classified.

Answer:

37
Matching Game
Goal: Match the Algorithm to its Update Rule

1. SGD for Logistic Regression 4.


k k + (h (x(i) ) y (i) )
h (x) = p(y|x)
2. Least Mean Squares 5. 1
k k +
h (x) = T
x 1 + exp (h (x(i) ) y (i) )

3. Perceptron 6. (i)
T k k + (h (x(i) ) y (i) )xk
h (x) = sign( x)

A. 1=5, 2=4, 3=6 E. 1=6, 2=6, 3=6


B. 1=5, 2=6, 3=4 F. 1=6, 2=5, 3=5
C. 1=6, 2=4, 3=4 G. 1=5, 2=5, 3=5
D. 1=5, 2=6, 3=6 H. 1=4, 2=5, 3=6 38
OPTIMIZATION METHOD #4:
MINI-BATCH SGD

39
Mini-Batch SGD
• Gradient Descent:
Compute true gradient exactly from all N
examples
• Stochastic Gradient Descent (SGD):
Approximate true gradient by the gradient
of one randomly chosen example
• Mini-Batch SGD:
Approximate true gradient by the average
gradient of K randomly chosen examples

40
Mini-Batch SGD

Three variants of first-order optimization:

41
Summary
1. Discriminative classifiers directly model the
conditional, p(y|x)
2. Logistic regression is a simple linear
classifier, that retains a probabilistic
semantics
3. Parameters in LR are learned by iterative
optimization (e.g. SGD)

50
Logistic Regression Objectives
You should be able to…
• Apply the principle of maximum likelihood estimation (MLE) to
learn the parameters of a probabilistic model
• Given a discriminative probabilistic model, derive the conditional
log-likelihood, its gradient, and the corresponding Bayes
Classifier
• Explain the practical reasons why we work with the log of the
likelihood
• Implement logistic regression for binary or multiclass
classification
• Prove that the decision boundary of binary logistic regression is
linear
• For linear regression, show that the parameters which minimize
squared error are equivalent to those that maximize conditional
likelihood

51
MULTINOMIAL LOGISTIC
REGRESSION

54
55
Multinomial Logistic Regression
Chalkboard
– Background: Multinomial distribution
– Definition: Multi-class classification
– Geometric intuitions
– Multinomial logistic regression model
– Generative story
– Reduction to binary logistic regression
– Partial derivatives and gradients
– Applying Gradient Descent and SGD
– Implementation w/ sparse features

56
Debug that Program!
In-Class Exercise: Think-Pair-Share
Debug the following program which is (incorrectly)
attempting to run SGD for multinomial logistic regression

Buggy Program:
while not converged:
for i in shuffle([1,…,N]):
for k in [1,…,K]:
theta[k] = theta[k] - lambda * grad(x[i], y[i],
theta, k)

Assume: grad(x[i], y[i], theta, k) returns the gradient of the negative


log-likelihood of the training example (x[i],y[i]) with respect to vector theta[k].
lambda is the learning rate. N = # of examples. K = # of output classes. M = # of
features. theta is a K by M matrix.

57

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy