0% found this document useful (0 votes)
6 views20 pages

Logistic Regression (Probability Concepts) and Perceptron

The document discusses classification algorithms, particularly focusing on binary classification where outcomes are discrete values, such as disease diagnosis or spam detection. It explains the limitations of applying linear regression to classification problems and introduces the logistic function as a better alternative for predicting probabilities between 0 and 1. The document also outlines the process of fitting parameters using maximum likelihood estimation and gradient ascent for logistic regression, contrasting it with the perceptron algorithm for binary outputs.

Uploaded by

mohamed el dahan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views20 pages

Logistic Regression (Probability Concepts) and Perceptron

The document discusses classification algorithms, particularly focusing on binary classification where outcomes are discrete values, such as disease diagnosis or spam detection. It explains the limitations of applying linear regression to classification problems and introduces the logistic function as a better alternative for predicting probabilities between 0 and 1. The document also outlines the process of fitting parameters using maximum likelihood estimation and gradient ascent for logistic regression, contrasting it with the perceptron algorithm for binary outputs.

Uploaded by

mohamed el dahan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

ClAssification

Algorithms
Recall that:
Regression problems are where the variables that you’re
trying to predict are continuous values.
In classification problem we need to talk about if the values
you’re trying to predict will be discrete values.
You can take on only a small number of discrete values and
in this case I’ll talk about binary classification where output
takes on only two values.
Classification
One classification problems, say, a medical diagnosis and try to
decide based on some features that the patient has a disease or
does not have a disease.
Or if in the housing example, maybe you’re trying to decide
will this house be sold in the next six months or not and the
answer is either yes or no.
Another example, if you want to build a spam filter. Is this e-
mail spam or not? It’s yes or no.
So, it’s a yes or no answer.
So there’s X, there’s Y and Y є [0,1].

Y
+1 _
xxxxx

xxxxx
q x
One thing you could do is take linear regression, as
we’ve described it so far, and apply it to this problem,
So given this data set you can fit a straight line to it.
Maybe you get that straight line → easy classification
problem.
So you apply linear regressions to this data set and you get a
reasonable fit and you can then maybe take your linear regression
hypothesis to this straight line and threshold it at 0.5.

Y
+1 _
xxxxx
0.5

xxxxx
q x
If you do that you’ll certainly get the right answer.
You predict that if X is to the right of, the mid-point q then Y =1
and then if X is to the left of that mid-point then Y =0.
So some people actually apply linear regression to
classification problems and sometimes it’ll work well, but in
general it’s actually a bad idea to apply linear regression to
classification problems like these.
This is because:
Let’s say I change my training set by giving you just one more
training example all the way up there.

Y
+1 _
xxxxx x
0.5

xxxxx
q x
Actually this training set is still entirely obvious what the
relationship between X and Y is
Just take this value (q) and if X is greater than q then Y =1
but if it’s less than q then Y = 0.
By giving you this additional training example it really
shouldn’t change anything.
There’s no surprise that this corresponds to Y =1 .
Y
+1 _
xxxxx x
0.5

xxxxx
q x
But if you now fit linear regression to this data set you end up with
a line that looks maybe like that.
And now the predictions of your hypothesis have changed
completely if your threshold – your hypothesis at Y equal both 0.5.
What is the relationship between X and Y now???

Y
+1 _
xxxxx x
0.5

xxxxx
q x
So what shall we do?
If Y є [0,1].
Let’s just start by changing the form of our
hypothesis so that my hypothesis always lies in the
unit interval between 0 and 1.
hϴ(x) є [0,1].
If I know Y is either 0 or 1 then let’s at least not have
the hypothesis predict values much larger than 1 and
much smaller than 0.
And so instead of choosing a linear function for the
hypothesis we are going to choose this function,
1
hϴ(x) = g(ϴTx) Where : g ( z) = −z
1+ e

1
h ( x) = g ( x) =
T
So, 1+ e − T x sigmoid function
(logistic function)

g(z) As
Z>>
1 _

0.5 _
As
Z<<
z
So g(z) tends towards 0 as Z becomes very small and g(z) will
ascend towards 1 as Z becomes large and it crosses the vertical
axis at 0.5.
As z << → g(z) tends towards 0
As z >> → g(z) ascend towards 1
So this is sigmoid function, also called the logistic function.
The output values by my hypothesis will always be between 0
&1.
Furthermore, just like we did for linear regression, I’m going to
endow the outputs and the hypothesis with a probabilistic
interpretation, So I’m going to assume that the probability that y
= 1 given x parameterized by ϴ =
P(y=1|x; ϴ) = hϴ(x)
So in other words we’ll imagine that the hypothesis is outputting
all these numbers that lie between zero and one.
And we are going to think of the hypothesis as trying to estimate
the probability that y = 1.
And because y has to be either 0 and 1, then the probability of y
equals zero is going to be:

P(y=0|x; ϴ) = 1 - hϴ(x)

Take previous two equations and write them more compactly as:
P(y|x; ϴ) = (hϴ(x) )y(1 - hϴ(x)) 1-y
Given this model by data, how do I fit the parameters ϴ of my
model?
So the likelihood of the parameters is, as before, it’s just the
probability of data.
 m
L( ) = P(Y | X ; ) =  P( y i | x i ; )
i =1

Plugging the previous compact form into this eq. yield:


m
L( ) =  (h ( x )) (1 − h ( x ))
i yi i 1− y i

i =1

So, as before, let’s say we want to find a maximum likelihood


estimate of the parameters ϴ.
It turns out that very often – just when you work with the
derivations, it is often much easier to maximize the log of the
likelihood rather than maximize the likelihood.
l(ϴ) = log L(ϴ)

m
l ( ) = log L( ) =  y i log( h ( x i )) + (1 − y i ) log(1 − h ( x i ))
i =1

And so to fit the parameters ϴ of our model we’ll find the


value of ϴ that maximizes this log likelihood.
So having maximized this function, we can actually apply the
same gradient descent algorithm that we learned.
Which was the first algorithm we used to minimize the quadratic
error function J(ϴ).
So we can actually use exactly the same algorithm to maximize
the log likelihood.
That algorithm was just repeatedly take the value of ϴ and you
replace it with the previous value of ϴ plus a learning rate α
times the gradient of the error function.

 :=  +  l ( )
One small change is that because previously we were trying to
minimize the quadratic error term.
Today we’re trying to maximize rather than minimize. So rather
than having a minus sign we have a plus sign.
So this is just great in ascents, but for the maximization rather than
the minimization.
So we actually call this gradient ascent and it’s really the same
algorithm.
So what you need to do is compute the partial derivatives of your
objective function with respect to each of your parameters ϴi.
The way you get this is to take derivatives, and work through the
algebra it turns out it’ll simplify down to this formula:
m
 j :=  j +   ( y − h ( x )) x
i i i
j
i =1
We actually had exactly the same learning rule as for least squares
regression,
So, is this the same learning algorithm as the previous least squares
regression which we declared before as being a bad idea for
classification problems?
Actually, this is not the same, as in logistic regression definition
of this hϴ(x) is no longer (ϴTxi) this is not a linear function
anymore.
hϴ(x) = g(ϴTx)

Actually, this is a logistic function of (ϴTxi)


This is actually a totally different learning algorithm.
But this is one of the most elegant generalized learning models.
perceptron algorithm
What if you want to force g(z) to output a value to either 0 or 1?
Rather than the logistic algorithm which ouput values in between
[0,1]
So the perceptron algorithm defines g(z) to be:

1 if z  0
g ( z) =  g(z)
0 otherwise step function

1 _
and hϴ(x) = g(ϴTx)
Learning rule is:
m
 j :=  j +   ( y i − h ( x i )) x ij
i =1
z
Recall that:
In logistic regression model, to find the value of ϴ that
maximizes this log likelihood, gradient ascent or gradient
descent is a perfectly fine algorithm to use.
P(y=1|x; ϴ) = hϴ(x)

1
h ( x) = g ( x) =
T
− T x
1+ e
m
l ( ) = log L( ) =  y i log( h ( x i )) + (1 − y i ) log(1 − h ( x i ))
i =1

 j :=  j +  ( y i − h ( x i )) x ij

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy