1 Algorithm: For I 1 To N Ify
1 Algorithm: For I 1 To N Ify
The Perceptron
First of all, the coolest algorithm name! It is based on the 1943 model of neurons made by Well, maybe “neocogni-
McCulloch and Pitts and by Hebb. It was developed by Rosenblatt in 1962. At the time, tron,” also the name of
a real ML algorithm, is
it was not interpreted as attempting to optimize any particular criteria; it was presented
cooler.
directly as an algorithm. There has, since, been a huge amount of study and analysis of its
convergence properties and other aspects of its behavior.
1 Algorithm
Recall that we have a training dataset Dn with x ∈ Rd , and y ∈ {−1, +1}. The Perceptron
algorithm trains a binary classifier h(x; θ, θ0 ) using the following algorithm to find θ and
θ0 using τ iterative steps: We use Greek letter τ
here instead of T so we
P ERCEPTRON(τ, Dn ) don’t confuse it with
T transpose!
1 θ = 0 0 ··· 0
2 θ0 = 0
3 for t = 1 to τ
4 for i = 1 to n
5 if y(i) θT x(i) + θ0 6 0
6 θ = θ + y(i) x(i)
7 θ0 = θ0 + y(i)
8 return θ, θ0
Let’s check dimensions.
Intuitively, on each step, if the current hypothesis θ, θ0 classifies example x(i) correctly, Remember that θ is
d × 1, x(i) is d × 1, and
then no change is made. If it classifies x(i) incorrectly, then it moves θ, θ0 so that it is
y(i) is a scalar. Does
“closer” to classifying x(i) , y(i) correctly. everything match?
Note that if the algorithm ever goes through one iteration of the loop on line 4 without
making an update, it will never make any further updates (verify that you believe this!)
and so it should just terminate at that point.
Study Question: What is true about En if that happens?
15
MIT 6.036 Fall 2019 16
1 (0)
Example: Let h be the linear classifier defined by θ(0) = , θ0 = 1. The dia-
−1
gram below shows several points classified by h. However,
in this case, h (repre-
1
sented by the bold line) misclassifies the point x(1) = which has label y(1) = 1.
3
Indeed,
1
y (1)
θ x + θ0 = 1 −1
T (1)
+ 1 = −1 < 0
3
By running an iteration of the Perceptron algorithm, we update
(1) (0) (1) (1) 2
θ =θ +y x =
2
(1) (0)
θ0 = θ0 + y(1) = 2
The new classifier (represented by the dashed line) now correctly classifies that
point, but now makes a mistake on the negatively labeled point.
T (0)
θ(0) x + θ0 = 0
x(1)
θ(0)
θ(1)
T (1)
θ(1) x + θ0 = 0
A really important fact about the perceptron algorithm is that, if there is a linear classi-
fier with 0 training error, then this algorithm will (eventually) find it! We’ll look at a proof
of this in detail, next.
2 Offset
Sometimes, it can be easier to implement or analyze classifiers of the form
+1 if θT x > 0
h(x; θ) =
−1 otherwise.
Without an explicit offset term (θ0 ), this separator must pass through the origin, which may
appear to be limiting. However, we can convert any problem involving a linear separator
with offset into one with no offset (but of higher dimension)!
Consider the d-dimensional linear separator defined by θ = θ1 θ2 ··· θd and
offset θ0 .
• define T
θnew = θ1 ··· θd θ0
Then,
θTnew · xnew = θ1 x1 + · · · + θd xd + θ0 · 1
= θT x + θ0
Thus, θnew is an equivalent ((d + 1)-dimensional) separator to our original, but with no
offset.
Consider the data set:
It is linearly separable in d = 1 with θ = [−1] and θ0 = 2.5. But it is not linearly separable
through the origin! Now, let
1 2 3 4
Xnew =
1 1 1 1
This new dataset is separable through the origin, with θnew = [−1, 2.5]T .
We can make a simplified version of the perceptron algorithm if we restrict ourselves
to separators through the origin: We list it here because
this is the version of the
P ERCEPTRON -T HROUGH -O RIGIN(τ, Dn ) algorithm we’ll study in
T more detail.
1 θ = 0 0 ··· 0
2 for t = 1 to τ
3 for i = 1 to n
4 if y(i) θT x(i) 6 0
5 θ = θ + y(i) x(i)
6 return θ
Another way to say this is that all predictions on the training set are correct:
h(x(i) ; θ, θ0 ) = y(i) .
And, another way to say this is that the training error is zero:
En (h) = 0 .
This quantity will be positive if and only if the point x is classified as y by the linear classi-
fier represented by this hyperplane.
Study Question: What sign does the margin have if the point is incorrectly classi-
fied? Be sure you can explain why.
Now, the margin of a dataset Dn with respect to the hyperplane θ, θ0 is the minimum
margin of any point with respect to θ, θ0 :
T (i)
(i) θ x + θ0
min y · .
i kθk
The margin is positive if and only if all of the points in the data-set are classified correctly.
In that case (only!) it represents the distance from the hyperplane to the closest point.
1
Example: Let h be the linear classifier defined by θ = , θ0 = 1.
−1
The diagram below shows several points classified by h, one of which is misclassi-
fied. We compute the margin for each point:
θT x + θ0 = 0
x(1)
x(3) x(2)
√
θT x(1) + θ0 −2 + 1 2
y(1) · =1· √ =−
kθk 2 2
θT x(2) + θ0 1+1 √
y(2) · =1· √ = 2
kθk 2
T (3)
θ x + θ0 −3 + 1 √
y(3) · = −1 · √ = 2
kθk 2
Theorem 3.1 (Perceptron Convergence). For simplicity, we consider the case where the linear
separator must pass through the origin. If the following conditions hold:
∗T (i)
(a) there exists θ∗ such that y(i) θ kθx∗ k > γ for all i = 1, . . . , n and for some γ > 0 and
(b) all the examples have bounded magnitude:
x(i)
6 R for all i = 1, . . . n,
2
then the perceptron algorithm will make at most R γ mistakes. At this point, its hypothesis will
be a linear separator of the data.
Proof. We initialize θ(0) = 0, and let θ(k) define our hyperplane after the perceptron algo-
rithm has made k mistakes. We are going to think about the angle between the hypothesis
we have now, θ(k) and the assumed good separator θ∗ . Since they both go through the ori-
gin, if we can show that the angle between them is decreasing usefully on every iteration,
then we will get close to that separator.
So, let’s think about the cos of the angle between them, and recall, by the definition of
dot product:
θ(k) · θ∗
cos θ(k) , θ∗ = ∗
kθ k
θ(k)
We’ll divide this up into two factors,
!
θ(k) · θ∗ 1
(k) ∗
cos θ , θ = ·
θ(k)
, (3.1)
kθ∗ k
x ,y
(i) (i)
.
θ(k) · θ∗ θ(k−1) + y(i) x(i) · θ∗
=
kθ∗ k kθ∗ k
θ(k−1) · θ∗ y(i) x(i) · θ∗
= +
kθ∗ k kθ∗ k
θ(k−1) · θ∗
> +γ
kθ∗ k
> kγ
where we have first applied the margin condition from (a) and then applied simple induc-
tion.
Now, we’ll look at thesecond factor
in equation 3.1. We note that since x , y
(i) (i)
is
T
classified incorrectly, y(i) θ(k−1) x(i) 6 0. Thus,
2
(k)
2
(k−1)
θ
=
θ + y(i) x(i)
2
2
T
=
θ(k−1)
+ 2y(i) θ(k−1) x(i) +
x(i)
2
6
θ(k−1)
+ R2
6 kR2
where we have additionally applied the assumption from (b) and then again used simple
induction.
Returning to the definition of the dot product, we have
(k) ∗
θ(k) · θ∗ θ ·θ 1 1 √ γ
(k) ∗
cos θ , θ =
> (kγ) · √ = k·
θ(k)
kθ∗ k = ∗
kθ k
θ(k)
kR R
This result endows the margin γ of Dn with an operational meaning: when using the
Perceptron algorithm for classification, at most (R/γ)2 classification errors will be made,
where R is an upper bound on the magnitude of the training vectors.