0% found this document useful (0 votes)
34 views38 pages

cs188 sp23 Lec25 - Z

This document discusses linear classifiers and perceptrons for classification problems. It introduces feature vectors, weights, and how dot products of weights and features are used to classify examples. It also covers learning algorithms for perceptrons, including updating weights for misclassified examples. Multiclass classification with perceptrons is also covered.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views38 pages

cs188 sp23 Lec25 - Z

This document discusses linear classifiers and perceptrons for classification problems. It introduces feature vectors, weights, and how dot products of weights and features are used to classify examples. It also covers learning algorithms for perceptrons, including updating weights for misclassified examples. Multiclass classification with perceptrons is also covered.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

CS 188: Artificial Intelligence

Perceptrons and Logistic Regression

Spring 2023
University of California, Berkeley
Linear Classifiers
Feature Vectors

Hello, # free
YOUR_NAME
:
:
2
0 SPAM
Do you want free printr
cartriges? Why pay more
MISSPELLED
FROM_FRIEND
:
:
2
0 or
when you can get them
ABSOLUTELY FREE! Just
...
+

PIXEL-7,12 : 1
PIXEL-7,13
...
: 0
“2”
NUM_LOOPS : 1
...
Some (Simplified) Biology
§ Very loose inspiration: human neurons
Linear Classifiers

§ Inputs are feature values


§ Each feature has a weight
§ Sum is the activation

§ If the activation is: f1


w1

S
§ Positive, output +1 w2
f2 >0?
w3
§ Negative, output -1 f3
Weights
Dot product positive means the positive class (spam)

# free : 2
# free : 4
YOUR_NAME :-1 YOUR_NAME : 0
MISSPELLED : 2
MISSPELLED : 1
FROM_FRIEND : 0
FROM_FRIEND :-3
...
...

# free : 4 # free : 0
YOUR_NAME :-1 YOUR_NAME : 1
MISSPELLED : 1 MISSPELLED : 1
FROM_FRIEND :-3 FROM_FRIEND : 1
... ...

Do these weights make sense for spam classification?


Review: Vectors
§ A tuple like (2,3) can be interpreted two different ways:

2 2
A point on a coordinate grid A vector in space. Notice we are
not on a coordinate grid.

§ A tuple with more elements like (2, 7, -3, 6) is a point or vector in higher-
dimensional space (hard to visualize)
Review: Vectors
§ Definition of dot product:
§ a · b = |a| |b| cos(θ)
§ θ is the angle between the vectors a and b θ
θ
§ Consequences of this definition:
§ Vectors closer together a · b large, positive a · b small, positive
= “similar” vectors
= smaller angle θ between vectors
= larger (more positive) dot product
§ If θ < 90°, then dot product is positive
θ θ
§ If θ = 90°, then dot product is zero
a · b zero a · b negative
§ If θ > 90°, then dot product is negative
Weights
§ Binary case: compare features to a weight vector
§ Learning: figure out the weight vector from examples

# free : 4
YOUR_NAME :-1
MISSPELLED : 1 # free : 2
FROM_FRIEND :-3 YOUR_NAME : 0
... MISSPELLED : 2
FROM_FRIEND : 0
...

# free : 0
YOUR_NAME : 1
MISSPELLED : 1
Dot product positive FROM_FRIEND : 1
means the positive class ...
Decision Rules
Binary Decision Rule
§ In the space of feature vectors
§ Examples are points
§ Any weight vector is a hyperplane (divides space into two sides)
§ One side corresponds to Y=+1, the other corresponds to Y=-1 free : 4
§ In the example: money : 2

§ f · w > 0 when 4*free + 2*money > 0

money
f · w < 0 when 4*free + 2*money < 0 2
These equations correspond to two halves of the feature space
+1 = SPAM
§ f · w = 0 when 4*free + 2*money = 0
This equation corresponds to the decision boundary (a line in 1
2D, a hyperplane in higher dimensions)

0
-1 = HAM 0 1 free
Weight Updates
Learning: Binary Perceptron
§ Start with weights = 0
§ For each training instance:
§ Classify with current weights

§ If correct (i.e., y=y*), no change!

§ If wrong: adjust the weight vector


Learning: Binary Perceptron
§ Start with weights = 0
§ For each training instance:
§ Classify with current weights

§ If correct (i.e., y=y*), no change!


§ If wrong: adjust the weight vector by
adding or subtracting the feature
vector. Subtract if y* is -1.
Learning: Binary Perceptron
§ Misclassification, Case I:
§ w · f > 0, so we predict +1
§ True class is -1
§ We want to modify w to w' such that dot product w' · f is lower
§ Update if we misclassify a true class -1 sample: w' = w – f
§ Proof: w' · f = (w – f) · f = (w · f) – (f · f) = (w · f) – |f|2
Note that |f|2 is always positive
§ Misclassification, Case II:
§ w · f < 0, so we predict -1
§ True class is +1
§ We want to modify w to w' such that dot product w' · f is higher
§ Update if we misclassify a true class +1 sample: w' = w + f
§ Proof: w' · f = (w + f) · f = (w · f) + (f · f) = (w · f) + |f|2
Note that |f|2 is always positive
§ Write update compactly as w' = w + y* · f, where y* = true class
Examples: Perceptron
§ Separable Case
Multiclass Decision Rule

§ If we have multiple classes:


§ A weight vector for each class:

§ Score (activation) of a class y:

§ Prediction highest score wins

Binary = multiclass where the negative class has weight zero


Learning: Multiclass Perceptron

§ Start with all weights = 0


§ Pick up training examples one by one
§ Predict with current weights

§ If correct, no change!
§ If wrong: lower score of wrong answer,
raise score of right answer
Example: Multiclass Perceptron

“win the vote”


“win the election”
“win the game”

BIAS : 1 BIAS : 0 BIAS : 0


win : 0 win : 0 win : 0
game : 0 game : 0 game : 0
vote : 0 vote : 0 vote : 0
the : 0 the : 0 the : 0
... ... ...
Properties of Perceptrons
Separable
§ Separability: true if some parameters get the training set
perfectly correct

§ Convergence: if the training is separable, perceptron will


eventually converge (binary case)

§ Mistake Bound: the maximum number of mistakes (binary


Non-Separable
case) related to the margin or degree of separability
Problems with the Perceptron

§ Noise: if the data isn’t separable,


weights might thrash
§ Averaging weight vectors over time
can help (averaged perceptron)

§ Mediocre generalization: finds a


“barely” separating solution

§ Overtraining: test / held-out


accuracy usually rises, then falls
§ Overtraining is a kind of overfitting
Improving the Perceptron
Non-Separable Case: Deterministic Decision
Even the best linear boundary makes at least one mistake
Non-Separable Case: Probabilistic Decision
0.9 | 0.1
0.7 | 0.3
0.5 | 0.5
0.3 | 0.7
0.1 | 0.9
How to get deterministic decisions?
§ Perceptron scoring: z = w · f (x)
§ If z = w · f (x) positive à classifier says: 1.0 probability this is class +1
§ If z = w · f (x) negative à classifier says: 0.0 probability this is class +1
H(z)
§ Step function
1

z
0
§ z = output of perceptron
H(z) = probability the class is +1, according to the classifier
How to get probabilistic decisions?
§ Perceptron scoring: z = w · f (x)
§ If z = w · f (x) very positive à probability of class +1 should approach 1.0
§ If z = w · f (x) very negative à probability of class +1 should approach 0.0

§ Sigmoid function

1
(z) = z
1+e
§ z = output of perceptron
1
(z) ==probability
z
the class is +1, according to the classifier
1+e
A 1D Example
where w is some weight constant (1D vector) we have to learn
(assume w is positive in this example)

definitely blue not sure definitely red


(x negative) (x near 0) (x positive)
Best w?
§ Recall maximum likelihood estimation: Choose the w value that
maximizes the probability of the observed (training) data
Best w?
§ Recall maximum likelihood estimation: Choose the w value that
maximizes the probability of the observed (training) data
Best w?
§ Maximum likelihood estimation:
X
(i) (i)
max ll(w) = max log P (y |x ; w)
w w
i

(i) (i) 1
with: P (y = +1|x ; w) = w·f (x(i) )
1+e
(i) (i) 1
P (y = 1|x ; w) = 1 w·f (x(i) )
1+e
= Logistic Regression
Separable Case: Deterministic Decision – Many Options
Separable Case: Probabilistic Decision – Clear Preference

0.7 | 0.3
0.5 | 0.5
0.7 | 0.3
0.3 | 0.7 0.5 | 0.5
0.3 | 0.7
Multiclass Logistic Regression
§ Recall Perceptron:
§ A weight vector for each class:

§ Score (activation) of a class y:

§ Prediction highest score wins

§ How to make the scores into probabilities?


z1
e ez2 ez3
z1 , z2 , z3 ! , ,
ez1 + ez2 + ez3 ez1 + ez2 + ez3 ez1 + ez2 + ez3
original activations softmax activations
Best w?
§ Recall maximum likelihood estimation: Choose the w value that
maximizes the probability of the observed (training) data
Best w?
§ Maximum likelihood estimation:
X
(i) (i)
max ll(w) = max log P (y |x ; w)
w w
i

wy(i) ·f (x(i) )
(i) (i) e
with: P (y |x ; w) = P (i) )
e w y ·f (x
y

= Multi-Class Logistic Regression


Softmax with Different Bases
Softmax and Sigmoid
§ Recall: Binary perceptron is a special case of multi-class perceptron
§ Multi-class: Compute for each class y, pick class with the highest activation
§ Binary case:
Let the weight vector of +1 be w (which we learn).
Let the weight vector of -1 always be 0 (constant).
§ Binary classification as a multi-class problem:
Activation of negative class is always 0.
If w · f is positive, then activation of +1 (w · f) is higher than -1 (0).
If w · f is negative, then activation of -1 (0) is higher than +1 (w · f).

Softmax Sigmoid
with wred = 0 becomes:
Next Lecture

§ Optimization

§ i.e., how do we solve:


X
(i) (i)
max ll(w) = max log P (y |x ; w)
w w
i

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy