Lecturenote - COL341 - 2010
Lecturenote - COL341 - 2010
Contents
1 Lecture 1 : Introdcution to Machine Learning 6
2 Lecture 2 7
2.1 Solving Least Squares in General (for Linear models) . . . . . . . . . . . . . . . . . . 7
3 Lecture 3 : Regression 10
3.1 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2 Linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.3 Least square solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.4 Geometrical interpretation of least squares . . . . . . . . . . . . . . . . . . . . . . . . 11
1
CONTENTS 2
7 Lecture 7 : Probability 31
7.1 Note . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
7.2 Part of speech(pos) example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
7.3 Probability mass function(pmf) and probability density function(pdf) . . . . . . . . 31
7.3.1 Joint distribution function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
7.3.2 Marginalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
7.4 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
7.5 Conditional Density . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
7.6 Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
7.6.1 Properties of E(x) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
7.7 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
7.8 Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
7.8.1 Properties of Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
7.9 Chebyshev’s Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
7.10 Bernoulli Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
7.11 Binomial Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
7.12 Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
7.13 Maximum Likelihood and Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
7.14 Bayesian estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
8 Lecture 8 38
8.1 Bernoulli Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
8.2 Bayesian Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
11 Lecture 11 52
11.1 Recall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
11.2 Bayes Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
11.3 Pure Bayesian - Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
11.4 Sufficient Statistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
11.5 Lasso . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
13 Lecture 13 59
13.1 Conclude Bias-Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
13.1.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
13.1.2 Bayesian Linear Regression(BLR) . . . . . . . . . . . . . . . . . . . . . . . . 59
13.1.3 General Problems with Standard Distribution . . . . . . . . . . . . . . . . . . 61
13.2 Emperical Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
13.2.1 First Approach: Approximate the posterior . . . . . . . . . . . . . . . . . . . 64
13.2.2 Second Approach: Emperical Bayes . . . . . . . . . . . . . . . . . . . . . . . 65
13.2.3 Sove the eigenvalue equation . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
16 Lecture 16 72
16.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
16.2 Problems of linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
16.2.1 Sensitivity to outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
16.2.2 Masking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
16.3 Possible solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
16.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
17 Lecture 17 77
18 Lecture 18:Perceptron 78
18.1 Fisher’s discriminant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
18.2 Perceptron training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
18.2.1 Intuition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
CONTENTS 4
19 Lecture 19 82
19.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
19.2 Margin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
19.3 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
19.4 Support Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
19.5 Objective Design in SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
19.5.1 Step 1:Perfect Separability . . . . . . . . . . . . . . . . . . . . . . . . . . 84
19.5.2 Step 2:Optimal Separating Hyperplane For Perfectly Separable Data 84
19.5.3 Step 2:Separating Hyperplane For Overlapping Data . . . . . . . . . 85
23 Lecture 23 104
23.1 Sequential minimization algorithm - SMO . . . . . . . . . . . . . . . . . . . . . . . . 105
23.2 Probablistic models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
25 Lecture 25 112
25.1 Exponential Family Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
25.2 Discrete Feature Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
25.3 Naive Bayes Assumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
25.4 Graphical Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
25.5 Graphical Representation of Naive Bayes . . . . . . . . . . . . . . . . . . . . . 114
25.6 Graph Factorisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
25.7 Naive Bayes Text Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
1 LECTURE 1 : INTRODCUTION TO MACHINE LEARNING 6
2 Lecture 2
2.1 Solving Least Squares in General (for Linear models)
Xplot = [0, 0.1, ..., 2] Curve
Why noise ?
– Since observations are not perfect
– Owing to quantization / precision / rounding
Curve Fitting
Learn f : X → Y such that E(f, X, Y1 ) is minimized. Here the error function E and form of the
function to learn f is chosen by the modeler.
Consider one such form of f ,
f (x) = w0 + w1 x + w2 x2 + ... + wt xt
If there are m data points, then a polynomial of degree m − 1 can exactly fit the data, since the
polynomial has m degrees of freedom (where degrees of freedom=no. of coefficients)
As the degree of the polynomial increases beyond m, the curve becomes more and more wobbly,
while still passing through the points. Contrast the degree 10 fit in Figure 2.1 against the degree 5
fit in Figure 2.1. This is due to the problem of overfitting (overspecification)
Now E is a convex function. To optimize it, we need to set ∇w E = 0. The ∇ operator is also
called gradient.
Solution is given by,
X = (φt φ)−1 φt Y
If m << t then
φ becomes singular and the solution cannot be found OR
The column vectors in φ become nearly linearly dependent
RMS (root mean sqare) error is given by :
r
2E
RM S =
k
2 LECTURE 2 8
Generally, some test data (which potentially could have been part of the training data) is held
out for evaluating the generalized performance of the model. Another held out fraction of the
training data, called the validation dataset is typically used to find the most appropriate degree
tbest for f .
2 LECTURE 2 9
Figure 2: Fit for degree 10 polynomial. Note how wobbly this fit is.
3 LECTURE 3 : REGRESSION 10
3 Lecture 3 : Regression
This lecture was about regression. It started with formally defining a regression problem. Then a
simple regression model called linear regression was discussed. Different methods for learning the
parameters in the model were next discussed. It also covered least square solution for the problem
and its geometrical interpretation.
3.1 Regression
Suppose there are two sets of variables x ∈ <n and y ∈ <k such that x is independent and y
is dependant. The regression problem is concerned with determining y in terms of x. Let us
assume that we are given m data points D = hx1 , y1 i, hx2 , y2 i, .., hxm , ym i. Then the problem is
to determine a function f ∗ such that f ∗ (x) is the best predictor for y, with respect to D. Suppose
ε(f, D) is an error function, designed to reflect the discrepancy between the predicted value f (x0 )
of y0 and the actual value y0 for any hx0 , y0 i ∈ D, then
where, F denotes the class of functions over which the optimization is performed.
The minimum value of the squared loss is zero. Is it possible to achieve this value ? In other
p
X
words is ∀j, wi φi (xj ) = yj possible ?
i=1
3 LECTURE 3 : REGRESSION 11
C(φ)
ŷ
Figure 3: Least square solution ŷ is the orthogonal projection of y onto column space of φ
ŷT φ = yT φ
ie, (φw)T φ = yT φ
ie, wT φT φ = yT φ
T
ie, φ φw = φT y
∴ w = (φT φ)−1 y
In the last step, please note that, φT φ is invertible only if φ has full column rank.
3 LECTURE 3 : REGRESSION 12
Theorem: If φ has full column rank, φT φ is invertible. A matrix is said to have full column rank
if all its columnPvectors are linearly independent. A set of vectors vi is said to be linearly
independent if i αi vi = 0 ⇒ αi = 0.
Proof: Given that φ has full column rank and hence columns are linearly independent, we have
that φx = 0 ⇒ x = 0.
Assume on the contrary that φT φ is non invertible. Then ∃x 6= 0 3 φT φx = 0.
⇒ xT φT φx = 0
⇒ (φx)T φx = ||φx||2 = 0
⇒ φx = 0. This is a contradiction. Hence the theorem is proved.
4 LECTURE 4 : LEAST SQUARES LINEAR REGRESSION 13
m
X
2
w∗ = argmin (f (xj , w) − yj )
w
j=1
m
X p
!2
X
= argmin wi φi (xj ) − yj
w
j=1 i=1
φ1 (x1 ) φ2 (x1 ) ...... φp (x1 )
.
φ=
.
φ1 (xm ) φ2 (xm ) ...... φp (xm )
y1
y2
y= .
.
ym
w1
w2
w=
.
.
wp
m
X 2
ε = min φT (xj )w − yj
w
j=1
2
= min ||φw − y||
w
T
= min (φw − y) (φw − y)
w
min wT φT φw − 2yT φw + yT y
= (2)
w
4 LECTURE 4 : LEAST SQUARES LINEAR REGRESSION 14
Figure 4: 10 level curves for the function f (x1 , x2 ) = x1 ex2 (Figure 4.12 from [1])
Level surfaces are similarly defined for any n-dimensional function f (x1 , x2 , ..., xn ) as a collection
of points in the argument space on which the value of the function is same at all points while we
change the argument values.
where c is a constant.
Figure 5: 3 level surfaces for the function f (x1 , x2 , x3 ) = x21 + x22 + x23 with c = 1, 3, 5. The gradient
at (1, 1, 1) is orthogonal to the level surface f (x1 , x2 , x3 ) = x21 + x22 + x23 = 3 at (1, 1, 1) (Fig. 4.14
from [1]).
The direction of the gradient vector gives the direction of maximum rate of change of the value of
the function at a point. Also the magnitude of the gradient vector gives that maximum value of
rate of change.
f (x + hv) − f (x)
Dv (f ) = lim
h→0 h
||v|| = 1
Note: The maximum value of directional derivative of a function f at any point is always the
magnitude of it’s gradient vector at that point.
4 LECTURE 4 : LEAST SQUARES LINEAR REGRESSION 16
Figure 6: The level curves from Figure 4 along with the gradient vector at (2, 0). Note that the
gradient vector is perpenducular to the level curve x1 ex2 = 2 at (2, 0) (Figure 4.13 from [1])
Refer to Definition 22 and Theorem 58 in the class notes [1] for more details.
4.5 Hyperplane
A Hyperplane is a set of points whose direction w.r.t. a point p is orthogonal to a vector v. It can
be formally defined as :
T
Hv,p = q | (p − q) v = 0
1. Plane consisting of all tangent lines at x∗ to any parametric curve c(t) on level surface.
T
T Hx∗ = p | (p − x∗ ) ∇f (x∗ ) = 0
Refer to Definition 24 and Theorem 59 in class notes [1] for more details.
The idea of gradient descent algorithm is based on the fact that if a real-valued function f (x) is
defined and differentiable at a point xk , then f (x) decreases fastest when you move in the direction
of the negative gradient of the function at that point, which is −∇f (x).
Here we describe the method of Gradient Descent Algorithm to find the parameter vector w
which minimizes the error function, ε(w, D)
= −(2φT φw − 2yT φ + 0)
= 2(φT y − φT φw)
so we got
In general
t(k) = argminf w(k+1)
t
Refer to section 4.5.1 in the class notes [1] for more details.
If ∇f (x∗ ) = 0 then x∗ can be a point of local minimum (or maximum). [Neccessary Condi-
tion]
If ∇2 f (x∗ ) is positive (negative) definite then x∗ is a point of local minimum (maximum). [Suffi-
cient Condition]
∀x 6= 0 xT ∇2 f (x∗ )x > 0
OR
Refer to definition 27, theorem 61 and fig. 4.23, 4.24 in the class notes [1] for more details.
Figure 7: Plot of f (x1 , x2 ) = 3x21 − x31 − 2x22 + x42 , showing the various local maxima and minima
of the function (fig. 4.16 from [1])
5 LECTURE 5 : CONVEX FUNCTIONS 19
5.1 Recap
We recall that the problem was to find w such that
2
w∗ = argmin ||φw − y|| (3)
w
= argminw (wT φT φw − 2wT φy − yT y) (4)
5.2 Point 1
If ∇f (x∗ ) is defined & x∗ is local minimum/maximum, then ∇f (x∗ ) = 0
(A necessary condition) (Cite : Theorem 60)[2]
Given that
we would have
∇f (w∗ ) = 0 (7)
T ∗ T
=⇒ 2φ φw − 2φ y = 0 (8)
∗ T −1 T
=⇒ w = (φ φ) φ y (9)
5.3 Point 2
Is ∇ f (w∗ ) positive definite ?
2
φx = 0 if f x=0 (14)
5 LECTURE 5 : CONVEX FUNCTIONS 20
∴ If x 6= 0, xT ∇2 f (w∗ )x > 0
Example where φ doesn’t have a full column rank,
This is the simplest form of linear correlation of features, and it is not at all desirable.
5.4 Point 3
Definition of convex sets and convex functions (Cite : Definition 32 and 35)[2]
A Hyperplane H is defined as {x|aT x = b, a 6= 0}. Let x and y be vectors that belong to the
hyperplane. Since they belong to the hyperplane, aT x = b and aT y = b. In order to prove the
convexity of the set we must show that :
θx + (1 − θ)y ∈ H, where θ ∈ [0, 1] (17)
We have,
∇2 f (w) = 2φT φ (23)
2
So, ||φw − y|| is convex, since the domain for w is Rn and is convex.
2
Q. Is ||φw − y|| strictly convex?
A. Iff φ has full column rank.
5.5 Point 4
To prove: If a function is convex, any point of local minima ≡ point of global minima
Proof - (Cite : Theorem 69)[2]
Thus, w∗ is a point of global minimum. One can also find a solution to (φT φw = φT y) by Gauss
elimination.
5.5.1 Overfitting
Too many bends (t=9 onwards) in curve ≡ high values of some wi0 s
5.6 Point 5
Q. How to solve constrained problems of the above-mentioned type?
A. General problem format :
M inimize f (w) s.t. g(w) ≤ 0 (27)
5 LECTURE 5 : CONVEX FUNCTIONS 23
Figure 10: p-Norm curves for constant norm value and different p
Intuition: If the above didn’t hold, then we would have ∇f (w∗ ) = α1 ∇g(w∗ )+α2 ∇⊥ g(w∗ ), where
by moving in direction ±∇⊥ g(w∗ ), we remain on boundary g(w∗ ) = 0, while decreasing/increasing
value of f, which is not possible at the point of optimality.
6 LECTURE 6 : REGULARIZED SOLUTION TO REGRESSION PROBLEM 24
As observed in last lecture, the objective function, namely f (w) = (Φw−Y )T (Φw−Y ) is strictly
convex. Further to this, the constraint function, g(w) =k w k22 −ξ, is also a convex function. For
convex g(w), the set S = {w|g(w) ≤ 0}, can be proved to be a convex set by taking two elements
w1 ∈ S and w2 ∈ S such that g(w1 ) ≤ 0 and g(w2 ) ≤ 0. Since g(w) is a convex function, we have
the following inequality:
g(θw1 + (1 − θ)w2 ) ≤ θg(w1 ) + (1 − θ)g(w2 )
(32)
≤ 0; ∀θ ∈ [0, 1], w1 , w2 ∈ S
As g(θw1 + (1 − θ)w2 ) ≤ 0; ∀θ ∈ S, ∀w1 , w2 ∈ S, θw1 + (1 − θ)w2 ∈ S, which is both sufficient
and necessary for S to be a convex set. Hence, function g(w) imposes a convex constraint over the
solution space.
This fact might be easily visualized from Figure 1. As we can see, the first condition occurs when
minima lies on the boundary of function g. In this case, gradient vectors corresponding to function
6 LECTURE 6 : REGULARIZED SOLUTION TO REGRESSION PROBLEM 25
Figure 12: Two plausible scenario for minima to occur: a) When minima is on constraint function
boundary, in which case gradients point in the same direction upto constant and b) When minima
is inside the constraint space (shown in yellow shade), in which case Of (w∗ ) = 0.
Lagrange dual function is given as the minimum value of the lagrangian over λ ∈ Rm , µ ∈ Rp . Such
function might be given in the following manner:
z(λ, µ) = inf L(w, λ, µ) (36)
w
The dual function always yields lower bound for minimizer of the primal formulation. Such dual
function is used in characterization of a dual gap, which depicts suboptimality of the solution.
Duality gap may be denoted as the gap between primal and dual objectives, f (w) − z(λ, µ). When
functions f and gi , ∀i ∈ [1, m] are convex and hj , ∀j ∈ [1, p] are affine, Karush-Kuhn-Tucker (KKT)
conditions are both necessary and sufficient for points to be both primal and dual optimal with zero
duality gap. For above mentioned formulation of the problem, KKT conditions for all differentiable
functions (i.e. f, gi , hj ) with ŵ primal optimal and (λ̂, µ̂) dual optimal point may be given in the
following manner:
Pm Pp
1. Of (ŵ) + i=1 λ̂i Ogi (ŵ) + j=1 µˆj Ohj (ŵ) = 0
2. gi (ŵ) ≤ 0; i = 1, 2, . . . , m
3. λ̂i ≥ 0; i = 1, 2, . . . , m
4. λ̂i gi (ŵ) = 0; i = 1, 2, . . . , m
5. hj (ŵ) = 0; j = 1, 2, . . . , p
Thus values of w∗ and λ which satisfy all these equations would yield an optimum solution.
Consider equation (39),
w∗ = (ΦT Φ + λI)−1 ΦT y
Premultiplying with (ΦT Φ + λI) on both sides we have,
(ΦT Φ + λI)w∗ = ΦT y
∴ (ΦT Φ)w∗ + (λI)w∗ = ΦT y
∴ k(ΦT Φ)w∗ + (λI)w∗ k = kΦT yk
By triangle inequality,
k(ΦT Φ)w∗ k + (λ)kw∗ k ≥ k(ΦT Φ)w∗ + (λI)w∗ k = kΦT yk (43)
T
Now , (Φ Φ) is a nxn matrix which can be determined as Φ is known .
k(ΦT Φ)w∗ k ≤ αkw∗ k for some α for finite |(ΦT Φ)w∗ k. Substituting in previous equation,
(α + λ)kw∗ k ≥ kΦT yk
i.e.
kΦT yk
λ≥ −α (44)
kw∗ k
Note that when kw∗ k → 0, λ → ∞. This is obvious as higher value of λ would focus more on
reducing value of kw∗ k than on minimizing the error function.
kw∗ k2 ≤ ξ
∗
Eliminating kw k from the equation (14) we get,
kΦT yk
∴λ≥ √ −α (45)
ξ
This is not the exact solution of λ but the bound (15) proves the existance of λ for some ξ and Φ.
Figure 13: RMS error Vs degree of polynomial for test and train data.
For same λ these two solutions are the same.This form or regression is known as Ridge regression.
If we use L1 norm then it’s called as ’Lasso’. Note that w∗ form that we had derived is valid only
for L2 norm.
Figure 14: RMS error Vs 10λ for test and train data (at Polynomial degree = 6).
Let B1 , B2 , ..., Bn be a set of mutually exclusive events that together form the sample space S. Let
A be any event from the same sample space, such that P (A) > 0. Then,
P r(Bi ∩ A)
P r(Bi /A) = (48)
P r(B1 ∩ A) + P r(B2 ∩ A) + · · · + P r(Bn ∩ A)
Using the relation P (Bi ∩ A) = P (Bi ) · P (A/Bi )
P r(Bi ) · P r(A/Bi )
P r(Bi /A) = Pn (49)
j=1 P r(Bj ) · P r(A/Bj )
Example 1. A lab test is 99% effective in detecting a disease when in fact it is present. However,
the test also yields a false positive for 0.5% of the healthy patients tested. If 1% of the population
has that disease, then what is the probability that a person has the disease given that his/her test is
positive?
6 LECTURE 6 : REGULARIZED SOLUTION TO REGRESSION PROBLEM 30
0.01 ∗ 0.99
P r(D/T ) =
0.01 ∗ 0.99 + 0.99 ∗ 0.005
2
=
3
What does this mean? It means that there is 66.66% chance that a person with positive test
results is actually having the disease. For a test to be good we would have expected higher certainty.
So, despite the fact that the test is 99% effective for a person actually having the disease, the false
positives reduce the overall usefulness of the test.
Two events E1 and E2 are called independent iff their probabilities satisfy
P (E1 E2 ) = P (E1 ) · P (E2 ) (51)
where P (E1 E2 )meansP (E1 ∩ E2 )
In general, events belonging to a set are called as mutually independent iff, for every finite
subset, E1 , · · · , En , of this set
\n n
Y
P r( Ei ) = P r(Ei ) (52)
i=1 i=1
7 LECTURE 7 : PROBABILITY 31
7 Lecture 7 : Probability
This lecture gives an overview of the probability theory. It discusses distribution functions; notion
of expectation, variance; bernoulli and binomial random variables; and central limit theorem
7.1 Note
Pr - probability in general of an event
Solution
Let Ak be the probability that the set contains pos type ’k’.
P r(Ak ) = 1 − (1 − pk )n
where (1 − pk )n is that all ’n’ words are not of pos of type ’k’.
T
P r(Anoun Averb )
P r(Anoun /Averb ) = P r(Averb )
pdf :- A probability density function of a continuous random variable is a function that describes the
relative likelihood for this random variable to occur at a given point in the observation space(Src:
7 LECTURE 7 : PROBABILITY 32
wiki).
R
P r(X ∈ D) = D p(x)dx
where D is set of reals and p(x) is density function.
For discrete
P case i.e.
P p(x,y) is a joint pmf:
F (a, b) = x<=a y<=b p(x, y)
7.3.2 Marginalization
Marginal probability is then the unconditional probability P(A) of the event A; that is, the proba-
bility of A, regardless of whether event B did or did not occur. If B can be thought of as the event
of a random variable X having a given outcome, the marginal probability of A can be obtained
by summing (or integrating, more generally) the joint probabilities over all outcomes for X. For
example, if there are
T two possible
T outcomes for X with corresponding events B and B’, this means
that P (A) = P (A B) + P (A B 0 ). This is called marginalization.
Discrete case:P
P (X = a) = y p(a, y)
Continuous
R ∞ case:
Px (a) = −∞ p(a, y)dy
7.4 Example
Statement :- X and Y are independent continuous random variables with same density functions.
(
e−x if x > 0;
p(x) =
0 otherwise.
7 LECTURE 7 : PROBABILITY 33
Find density X
Y .
Note:- They are indepedent.
Solution
F X (a) = P r( X
Y <= a)
YR
∞ R ya
= 0 0 p(x, y)dxdy
R ∞ R ya
= 0 0 e−x e−y dxdy
1
= 1 − a+1
a
= a+1
Continuous case: X
p (Y ) p (X )
pX ( Y x=y ) = X,Y
pY (y) =
R ∞X,Y Y
p(x,y)dx
−∞
7.6 Expectation
Discrete case: Expectation is equivalent to probability weighted sums of possible values.
E(X) = Σi xi P r(xi ) where X is a random variable
Continuous
R ∞case:
E(X) = −∞ f (x)p(x)dx
7 LECTURE 7 : PROBABILITY 34
E[cX] = cE[X]
7.7 Variance
For any random variable X, variance is defined as follows:
V ar[X] = E[(X − µ)2 ]
⇒ V ar[X] = E[X 2 ] − 2µE[X] + µ2
⇒ V ar[X] = E[X 2 ] − (E[X])2
V ar[αX + β] = α2 V ar[X]
7.8 Covariance
For random variables X and Y, covariance is defined as:
Cov[X, Y ] = E[(X − E(X))(Y − E(Y ))] = E[XY ] − E[X]E[Y ]
If X and Y are independent then their covariance is 0, since in that case
E[XY ] = E[X]E[Y ]
However, covariance being 0 does not necessarily imply that the variables are independent.
Cov[Σi Xi , Y ] = Σi Cov[Xi , Y ]
Cov[X, X] = V ar[X]
If n tends to infinity, then the data mean tends to converge to µ, giving rise to the weak law
of large numbers.
P r[X = k] = nk qk (1 − q)n−k
An example of Binomial distribution can be a coin tossed n times and counting the number of
times heads shows up.
n −nµ
X1 +X2 +..+X
√
σ2n
Ñ (0, 1)
2
Sample Mean: µ̂Ñ (µ, σm )
7 LECTURE 7 : PROBABILITY 36
µ̂ is an estimator for µ
ˆ ˆ
L(X̂1 , X̂2 , .., Xˆn |q) = q X̂1 (1 − q)1−X̂1 ..q Xn (1 − q)1−Xn
(1−x)
p(x) = µx (1 − µ)
D = X1 , X2 , .., Xm is a random sample
(1−xi )
L(Dkµ) = Πi µxi (1 − µ)
GOAL :
µ̂M LE = argmaxµ L(D|µ)
Equivalently :
2
i −mµ
Σi X√
σ2m
Ñ (0, 1) and µ̂M LE Ñ (µ, σm )
8 Lecture 8
8.1 Bernoulli Distribution
The general formula for probability of a random variable ’x’ is
p(x) = µx (1 − µ)1−x
The likelihood of the data given µ is
m
Y
L(D|µ) = µxi (1 − µ)1−xi
i=1
Our goal is to find the maximum likelihood estimate µ̂M LE = argmaxµ L(D|µ)
Thus,
Pm
i=1 Xi − mµ
√ ∼ N (0, 1)
σ m
and,
σ2
µ̂M LE ∼ N (µ, )
m
Question. Given an instantiation of ( X1 , X2 , ...., Xm ) called training data D (x1 , x2 , ...., xm ),
you have Maximum Likelihood Estimation
X
xi /m
i
P
(point estimate). How confident can you be that actual µ is within ( xi )/m ± Z for some Z.
q
µ̂∗(1−µ̂)
µ ∈ (µ̂M LE ± Z m )is(1 − 2α)
1
p(µ) dµ = 1 ⇒θ=1
0
Posterior (Bayesian Posterior) is:
L(x1 ,x2 ,....,xm |µ)∗p(µ)
pr (µ|X1 , X2 , ...., Xm ) = R1
L(x1 ,x2 ,....,xm |µ)p(µ)dµ
0
µΣxi ∗ (1 − µ)Σ(1−xi ) ∗ 1
= R1
0
µΣxi ∗ (1 − µ)Σ(1−xi ) dµ
Σ
(m + 1)!µΣxi ∗(1−µ) (1−xi )
= P P
xi !(m − xi )!
µ ∗ p(µ|x1 , x2 , ...., xm )d(µ)
Expectation E(µ|x1 , x2 , ...., xm ) =
Pm
xi +1
Expected is under posterior = i=1
m+2
2+1 3
E(µ|1, 1) = 2+2 = 4
µ̂B = E[µ|D] = p(µ|D)
Γ(a + b) a−1 b−1
Beta(µ|a, b) = µ µ
Γ(a)Γ(b)
Γ(a+b)
Beta is conjugate prior to bernoulli distribution and Γ(a)Γ(b) is a normalization constant
Σn
i=1 xi Σni=1 (1−xi )
L(x1 ...xn ) = µ (1 − µ)
Prior should have the same form as the likelihood with some normalization const
p(µ|D) ∝ L(D|µ)p(µ)
If µprior ≈ Beta(µ|a, b)
Z 1
p(µ|x1 ...xn ) = L(x1 ...xn |µ)P (µ) = L(x1 ...xn |µ)Beta(a, b) = L(x1 ...xn |µ)Beta(µ|a, b)du
o
n
Σn
Γ(m + a + b)µΣi=1 xi +a−1 (1 − µ i=1 (1−xi )+b−1
)
=
Γ(Σni=1 xi + a)Γ(Σni=1 (1 − xi ) + b)
≈ Beta(Σni=1 xi + a, Σni=1 (1 − xi ) + b)
a
EBeta(a,b) (µ|x1 ...xn ) =
a+b
Σni=1 xi + a
EBeta(a+Σni=1 x,b+Σni=1 (1−xi )) (µ|x1 ...xn ) =
Σni=1 (1 − xi ) + Σni=1 xi + a + b
for large m, a and b m
8 LECTURE 8 41
µ̂Bayes → µ̂M LE
Question: What will be conjugate prior αi ’s, which are params of Multinomial?
Answer: Joint Distribution.
n
P (µ1 , . . . µn |α1 , . . . αn ) ∝ πi=1 µiαi −1 (53)
1
P (x) = R
f (x)dx
f (x)
Since
Rn
i=1
P (µ1 , . . . µn |α1 , . . . αn ) = 1
By Integrating, we get normalisation constant.
Pn
Γ( i=1 αi ) n αi −1
P (µ1 , . . . µn |α1 , . . . αn ) = n π µ (54)
πi=1 Γ(αi ) i=1 i
which follows Dir(α1 . . . αn )
9.0.2 Summary
Pm
Posterior is Dir(. . . αi + k=1 Xk,i . . .)
The expectation of µ for Dir(α1 . . . αn ) is given by:
α1 α1
E [µ]Dir(α1 ...αn ) = P ... P (57)
α1 αi
9 LECTURE 9 : MULTINOMIAL DISTRIBUTION 43
Pm
The expectation of µ for Dir(. . . αi + Xk,i . . .) is given by:
k=1
P P
α1 + k Xk,1 αj + k Xk,j
E [µ]Dir(...αi +Pm Xk,i ...) = P ... P ... (58)
k=1 α1 + m αi + m
Observations:
α1 . . . αn = 1 ⇒ Uniform Distribution
As m→ ∞ ⇒ µ̂Bayes → µ̂M LE
Let us denote I(X=x) as the measure of information conveyed in knowing value of X=x.
Figure 15: Figure showing curve where Information is not distributed all along.
Question:Consider the following two graphs, say you know probability function p(x), then when is
knowing value of X more useful(carries more information).
Ans: It is more useful in the case(2), because more information is conveyed in Figure 15 than in
Figure 16.
9 LECTURE 9 : MULTINOMIAL DISTRIBUTION 44
I(p(x))<I(p(y))
There is only one function which satisfies the above two properties.
9.1.3 Observations:
For Discrete random variable(∼ countable domain), the information is maximum for Uniform
distribution.
For Continuous random variable ( ∼ Finite mean and finite variance), the information for
Gaussian Distribution.
−(x−µ)2
e √ 2σ2
p(x) = 2πσ2
9 LECTURE 9 : MULTINOMIAL DISTRIBUTION 45
If X ∼ N (µ, σ 2 )
−(x−µ)2
p(x) = √1 e 2σ 2 where − ∞ < x < ∞
σ 2Π
Recall
dφ(p)
E(X) = dt
2
var(x) = d dt
φ(p)
2
(σt)2
EN (µ,σ2 ) [et(w1 x+w0 ) ] = (w1 µt + w0 t + 2 × w12 ) ∼ N (w1 µ + w0 , w12 σ 2 )
Sum of i.i.d X1 , X2 , ......, Xn ∼ N (µ, σ 2 ) is also normal(gaussian)
X1 + X2 + ...... + Xn ∼ N (nµ, nσ 2 )
Pn
In genaral if Xi ∼ N (µi , σi2 ) =⇒ σi2 )
P P
i=1 Xi ∼ N ( µi ,
1 µ
(take w1 = σ and w0 = σ)
Qm −(Xi −µ)2
µ̂M LE = argmaxµ √1
i=1 [ σ 2π e ]
2σ 2
(Xi −µ)2
P
−
= argmaxµ σ√12π e 2σ 2
Pm
Xi
µ̂M LE = i=1
m = sample mean
With out relaying on central limit theorem Properties (2) and (1)
9 LECTURE 9 : MULTINOMIAL DISTRIBUTION 46
Similarly
Pm 2
2 i=1 (Xi −µ̂M LE )
σ̂M LE = m is χ2 distrbution
∼ χ2m
Xi2 ∼ χ2m
P
i m-degree of freedom
⇒ µ ∼ N (µ0 , σ02 )
⇒ σ2 ∼ Γ
m
X
Xi
i=1
µ̂M L =
m
m
X
(Xi − µ)2
σˆ2 M L = i=1
m
ˆ2
Here σ M L follows the chi − squared distribution
Figure 10 : Figure showing the nature of the (chi − square) distribution of σˆ2 M L
m
X
(Xi − µ)2
i=1
√
LL(D|µ, σ) = − ln(σ 2π)
2σ 2
m
1 Y −(Xi − µ)2
ie, L(D|µ, σ) = √ exp( )
σ 2π i=1 2σ 2
(Note : In pattern recognition, (x − µ)T Σ−1 (x − µ) is called the Mahalanobis distance between x
and µ. If Σ = I, it is called Eucledian distance. )
ii) If Σ is symmetric
Xn
Σ= λi (qi qiT ) where qi ’s are orthogonal.
i=1
n
X 1
Here Σ−1 = qi qiT
i=1
λi
Here the joint distribution has been decomposed as a product of marginals in a shifted and ro-
tated co-ordinate system.
10 LECTURE 10 : MULTIVARIATE GAUSSIAN DISTRIBUTION 49
Set ∇µ LL = 0, ∇Σ LL = 0
−1 X
∇µ LL = [ 2(xi − µ)]Σ−1 = 0
2
Since Σ is invertible,
X
(xi − µ) = 0
X
xi
ie, µ = m
m
1 X
Σ̂M L = (xi − µ̂M L )(xi − µ̂M L )T
M i=1
Here Σ̂M L is called emperical co-variance matrix in statistics.
µ̂M L ∼ N (µ, Σ)
E[µ̂M L ] = µ
Question : If ∼ N (0, σ 2 )
1 (y − φT (x)w)2
p(y|x, w) = √ exp( )
2πσ 2 2σ 2
E[Y (w, x)] = wT φ(x) = w0T + w1T φ1 (x) + ... + wnT φn (x)
The congjugate prior for multivariate gaussian distibution if µ, σ 2 are known is given as
P (x) ∼ N (µ, σ 2 )
2
P (µ|x1 ...xm ) = N (µm , σm )
σ2 mσ02
µm = ( µ0 ) + ( µ̂M L )
mσ02
+ σ2 mσ02 + σ 2
1 1 m
2
= 2+ 2
σm σ0 σ
10 LECTURE 10 : MULTIVARIATE GAUSSIAN DISTRIBUTION 51
P (w) ∼ N (µ0 , Σ0 )
2
P (w|x1 ...xm ) = N (µm , σm )
µm = Σm (Σ−1 2 T
0 µ0 + σ φ y)
−1
Σ−1 2 T
m = Σ0 + σ φ φ
11 LECTURE 11 52
11 Lecture 11
11.1 Recall
For bayesian estimation in the univariate case with fixed σ where µ ∼ N (µ0 , σ 2 0 ) and x ∼ N (µ, σ 2 )
1 m 1
= 2+ 2
σ2 σ σ0
µm m
= 2 µ̂mle + µ0
σ2 m σ
such that p(x|D) ∼ N (µm , σm 2 ). m/σ 2 is due to noise in observation while 1/σ02 is due to
uncertainity in µ. For the Bayesian setting for the multivariate case with fixed Σ
y = wT φ(x) + ε
ε ∼ N (0, σ 2 )
w ∼ N (0, Σ0 )
ŵM LE = (φT φ)−1 y
Finding µm and Σm :
−1
Σ−1 T
m µm = Σ0 µ0 + φ y/σ
2
1 T
Σ−1
m = Σ0 + 2 φ φ
σ
Setting Σ0 = αI and µ0 = 0
Σ−1 T
m µm = φ y/σ
2
1
Σ−1
m µm = I + φT φ/σ 2
α
(I/α + φT φ/σ 2 )−1 φT y
µm =
σ2
But since σ 2 /α is nothing but the λ in ridge regression, this can be written as
µm = (λI + φT φ/σ 2 )−1 φT y
−1 I φT φ
Σm = + 2
α σ
11 LECTURE 11 53
Point? p(x|D)
MLE θ̂M LE = argmaxθ LL(D|θ) p(x|θM LE )
Bayes Estimator θ̂B = Ep(θ|D) E[θ] p(x|θM LE )
MAP θ̂M AP = argmaxθ p(θ|D) p(x|θM AP )
p(D|θ)p(θ)
Pure Bayesian p(θ|D) = R
p(D|θ)p(θ)dθ
Ym
p(D|θ) = p(xi |θ)
i=1
Z
p(x|D) = p(x|θ)p(θ|D)dθ
Thus, we see that for the normal distribution, p(D|µ) = g(s, µ)h(D).
11.5 Lasso
We have Y = wT φ(x) + ε where ε ∼ N (0, σ 2 ). Here w ∼ Laplace distribution. Then it turns out
m
X
that ŵM AP with laplace prior is ŵM AP = argmaxw ||wT φ(xi ) − yi ||2 + λ||w||l1
i=1
Here λ||w||l1 is basically the penalty function, also called lasso. Recall ŵM AP with guassian prior
Xm
is ŵM AP = argmaxw ||wT φ(xi ) − yi ||2 + λ||w||2 l1
i=1
11 LECTURE 11 55
be obtained when f (x) equals E[y/x]. The minimum expected loss is given by
Z Z
Ef,P (x,y) [L] = (E[y/x] − y)2 P (x, y)dxdy (67)
y x
This is the minimum loss we can expect for a given training data. Now let us find out the
expected loss over different training data. Let Z
ETD [f (x, TD )] = f (x, TD )p(TD )dTD
TD
Earlier we have found that the only tweakable component in expected loss is (f (x) − E[y/x])2 . Now
we will find out the expected loss over all the trainind data by finding expectation of the expression
over
Z the distribution of training data.
(f (x, TD ) − E[y/x])2 p(TD )dTD = ETD [(f (x, TD ) − E[y/x])2 ]
TD
h 2 i
= ETD f (x, TD ) − ETD [f (x, TD )] + ETD [f (x, TD )] − E[y/x]
h 2 2
= ETD f (x, TD ) − ETD [f (x, TD )] + ETD [f (x, TD )] − E[y/x]
i
− 2 f (x, TD ) − ETD [f (x, TD )] ETD [f (x, TD )] − E[y/x]
Since h i
ETD f (x, TD ) = ETD [f (x, TD )]
and
Z the other factors are independent hof TD , the third term vanishes. Finally
2 2 i
(f (x) − E[y/x])2 p(TD )dTD = ETD f (x, TD ) − ETD [f (x, TD )] + ETD [f (x, TD )] − E[y/x]
TD
h 2 i 2
= ETD f (x, TD ) − ETD [f (x, TD )] + ETD [f (x, TD )] − E[y/x]
= V ariance + Bias2
h 2 i
Variance of f (x, D) = ETD f (x, TD ) − ETD [f (x, TD )] and Bias = ETD [f (x, TD )] − E[y/x]
Putting back in (66) the Expectedloss = V ariance + Bias2 + N oise
Let us try to understand what this means. Consider the case of regression. The loss of the
prediction depends on many factors such as complexity of the model (linear, ..) the parameters and
the measurements etc. The noise in the measurement can cause loss of prediction. That is given
by third term. Similarly the complexity of the model can contribute to the loss.
If we were to take the linear regression with a low degree polynomial, we are introducing a bias
that the dependency of the predicted variable is simple. Similarly when we add a regularizer term,
we are implicitly telling that the weights are not big, is also a kind of bias. The prediction we
otained may not be accurate. In these cases the predicted values may not have much correlation
with the sample points we took. So the predictions remains more or less the same over different
samples. That is for different samples the predicted values does not vary much. The prediction is
more generalizable over the samples.
Suppose we complicate our regression model by increasing degree of the polynomial used. As we
have seen in previous classes, we used to obtain a highly wobbly curve which pass through almost
all points in the training data. This is an example for less bias. For a given training data our
prediction could be very good. If we were to take another sample data we would have obtained
12 LECTURE 12 : BIAS-VARIANCE TRADEOFF 58
another curve which pass through all the new points but with a drastic difference from the current
curve. Our predictions are accurate for the training sample chosen, but at the same time they are
highly correlated to the sample we have chosen. For different training data chosen, the variance of
the prediction is very high. So the model is not generalizable over the samples.
We saw that when we decrease the bias the variance increases and vice versa. The more complex
the model is, the less bias we have and more the variance. Both are contrary to each other. The
ideal complexity of the model should be related with the complexity of the actual relation between
the dependent and independent variable.
I recommend the reference [4] for a good example.
13 LECTURE 13 59
13 Lecture 13
These are the topics discussed in today’s lecture:
1. Conclude Bias-Variance
2. Shrinkage - Best Subset
3. Mixture models
4. Empirical Bayes
In the above equation, TD represents the random sample. x, y represent the data distribution.
E[Y /X] is the expected value optimal with respect to least squares.
E[X/TD ] is something like a posterior distribution and expected value is there. But
there is no concept of f(x, TD ) which is the posterior estimate.
1 1
Σm −1 = I + 2 φT φ
α σ
equation
If two points are far way from mean, but close to each other then φT φ will in-
crease and Σm −1 will also increase. Therefore variance will decrease.
This can also be interpreted as, points which are far apart are positively contributing
by giving less variance i.e; less uncertainity.
large.
Mixture of Gaussians
n
X
p(x) = αi pi (x|z = i) (pi (x) is a different distribution)
i=1
n
X
αi = 1
i=1
Ex : Mixture of Gaussians :
13 LECTURE 13 62
X ∼ N (µi , Σi )
p(x) = p(x|µi , Σi )
Issues
1. Number of K’s
2. Estimating µi ’s and Σi ’s
3. Estimating αi ’s
Classification Perspective
Assume data :
X1 , 1
X2 , 1
.... ..
.... ..
Xm , 3
(z is a class label)
=⇒ It is a Classification Problem
If z value is given :
X1
X2
Given data :
....
....
Xm
We have to estimate using hidden variables as z is not explicitly given. (EM algo-
rithm which can be shown to converge).
Target
1. Implicit estimation of z E[x|xi ]
2. Estimate µi ’s with E[z|xi ] in place of zi
i.e
X1 , E[z|x1 ]
X2 , E[z|x2 ]
.... ..
.... ..
Xm , E[z|xm ]
With initial guesses for the parameters of our mixture model, ”partial member-
ship” of each data point in each constituent distribution is computed by calculating
[[expectation value]]s for the membership variables of each data point. That is, for
13 LECTURE 13 64
each data point xj and distribution Yi , the membership value yi,j is:
ai fY (xj ; θi )
yi,j = .
fX (xj )
With expectation values in hand for group membership, “plug-in estimates“ are
recomputed for the distribution parameters.
The mixing coefficients ai are the arithmetic mean’s of the membership values
over the N data points.
N
1 X
ai = yi,j
N j=1
The component model parameters θi are also calculated by expectation maximiza-
tion using data points xj that have been weighted using the membership values. For
example,Pif θ is a mean µ
j yi,j xj
µi = P .
j yi,j
With new estimates for ai and the θi ’s, the expectation step is repeated to re-
compute new membership values. The entire procedure is repeated until model
parameters converge.
Pr(y | D) = Pr(y |< y1 , φ(x1 ) >, < y2 , φ(x2 ) >, ... < yn , φ(xn ) >)
RRR
= Pr(y | w, σ 2 ) Pr(w | ȳ, α, σ 2 ) Pr(α, σ 2 | ȳ), dw dα dσ 2
where ȳ is the data D
The first approach involves approximating the posterior i.e, the second term Pr(w |
ȳ, α, σ 2 ) as wM AP , i.e, as the mode of the posterior distribution of w which is gaus-
13 LECTURE 13 65
sian. Note that as the number of data points keep increasing φT φ keeps increasing,
hence from the relation
−
Σ− 1 T
m 1 = Σ0 1 + σ 2 φ φ
it is clear that the posterior variance decreases, hence the distribution of w peaks.
Pr(y | w, σ 2 ) Pr(w | ȳ, α̂, σˆ2 ) dw for the chosen α̂ and σˆ2
R
≈
≈ N (φT µm σm
2
)N (µm , Σm )
Emperical Bayes finds the α̂ and σˆ2 such that i Pr(yi | TD ) i.e, conditional
Q
likelihood is maximised.
1 T
( φ φ)ui = λi ui
σ2
λi
P
define parameter γ as γ = i α+λi
f (x) : Rn → D
There are many techniques of performing the task of classification. The two main
types are
2. Probabilistic Classification
(a) Disciminative models: Here we model P r(y D | x) directly. For e.g.: we
can say that P r(D = + | data) comes from a multinomial distribution.
Examples: Logistic Regression, Maximum Entropy models, Con-
ditional Random Fields
(b) Generative models: Here we model P r(y D | x) by modeling P r(x | y D)
For Example:
P r(x | y = c1 ) ∼ N(µ1 , Σ1 )
P r(x | y = c2 ) ∼ N(µ2 , Σ2 )
We can find P r(y | x) as
P r(x | y)P r(y)
P r(y | x) =
Σy P r(x | y)P r(y)
Here, P r(y = c1 ), P r(y = c2 ) are called priors or mixture components.
Examples: Naive Bayes, Bayes Nets, Hidden Markov Models
15 LECTURE 15: LINEAR MODELS FOR CLASSIFICATION 67
3. The third method is to model P (C+ |x) and P (C− |x) directly. These types of
models are called discriminative models. In this case P (C+ |x) = P (C− |x) gives
the required decision boundary.
15.2.1 Examples
Figure 19: Data from two classes classified by least squares (magenta) and logistic (green)
Even as the least-squares approach gives a closed form solution for the discriminant
function parameters, it suffers from problems such as lack of robustness to outliers.
This is illustrated in figures 3 and 4 where we see that introduction of additional data
points in the figure 4 produce a significant change in the location of the decision
boundary, even though these points would be correctly classified by the original
boundary in figure 3. For comparison, least squares approach is contrasted with
logisitc regression, which remains unaffected due to the additional points.
16 LECTURE 16 72
16 Lecture 16
16.1 Introduction
We will discuss the problems of the Linear regression model for classifications. We
will also look at some of the possible solutions of these problems. Our main focus is
on two class classification problem.
Outliers : They are points which have noise and adversely affect the classification.
In the right hand figure , the separating hyperplane has changed because of the
outliers.
16.2.2 Masking
It is seen empirically that linear regression classifier may mask a given class. This
is shown in the left hand figure. We had 3 classes one in between the other two. The
between class points are not classified.
16 LECTURE 16 73
The equation of the classifier between class C1(red dots) and class C2(green dots)
is
(ω1 − ω2 )T φ(x) = 0
and the equation of the classifier between the classes C2(green dots) and C3(blue
dots) is
(ω2 − ω3 )T φ(x) = 0
0 0
Here we try to determine the transformations φ1 and φ2 such that we can
get a linear classifier in this new space. When we map back to the original
dimensions , the separators may not remain linear.
16 LECTURE 16 74
Figure 23: Mapping back to original dimension class separator not linear
0
Problem : Exponential blowup of number of parameters (w s) in order O(nk−1 ).
2. Decision surface perpendicular bisector to the mean connector.
Decision surface is the perpendicular bisector of the line joining mean of class
C1 (m1 ) and mean of class C2 (m2 ).
P
m1 = (1/N1 ) n∈C1 xn where m1 is the mean of class C1 and N1 is the number
of points in class C1 .
P
m2 = (1/N2 ) n∈C2 xn where m2 is the mean of class C2 and N2 is the number
16 LECTURE 16 75
of points in class C2 .
Comment : This is solving the masking problem but not the sensitivity
problem as this does not capture the orientation(eg: spread of the data points)
of the classes.
3. Fisher Discrimant Analysis.
Here we consider the mean of the classes , within class covariance and global
covariance.
Aim : To increase the separation between the class means and to minimize
within class variance. Considering two classes.
SB is Inter class covariance and SW is Intra class covariance.
P
m1 = (1/N1 ) n∈C1 xn where m1 is the mean of class C1 and N1 is the number
of points in class C1 .
P
m2 = (1/N2 ) n∈C2 xn where m2 is the mean of class C2 and N2 is the number
of points in class C2 .
SB = (m2 − m1 )(m2 − m1 )T
− m1 )(xn − m1 )T + − m2 )(xn − m2 )T
P P
SW = n∈C1 (xn n∈C2 (xn
wαSw−1 (m2 − m1 )
16 LECTURE 16 76
16.4 Summary
17 Lecture 17
Not submitted
18 LECTURE 18:PERCEPTRON 78
18 Lecture 18:Perceptron
Was Fisher’s discriminant robust to noise?
Perceptron training
wT φ(x) = 0
φ(x0 )
φ(x)
D
D = wT (φ(x) − φ(x0 ))
Since wT (φ(x0 )) = 0 we get distance = wT (φ(x))
Perceptron works for two classes only. We label them as y=1 and y=-1. A point
is misclassified if wT (φ(x)) < 0
Perceptron Algorithm:
INITIALIZE: w=ones()
REPEAT:
18 LECTURE 18:PERCEPTRON 79
18.2.1 Intuition
T
ywk+1 φ(x) = y(wk + yφ(x)T φ(x)
= ywkT φ(x) + y 2 kφ(w)k2
> ywkT φ(x)
Note: We applied the update for this point,
Since ywkT φ(x) ≤ 0
We have ywkT φ(x) > ywkT φ(x) So we have more hope that this point is classified
correctly now.
More formally, perceptron tries to minimize the error function
X
E=− yφT (x)ω
x∈M
where M is the set of misclassified examples.
Perceptron algorithm is similar (Its not exactly equivalent) to a gradient descent
algorithm, which can be shown as follows:
Since ∇EX
is given by,
∇E = − yφ(x)
x∈M
So,
wk+1 = wk − η∇E
X
= wk + η yφ(x) (This takes all misclassified points at a time)
x∈M
But what we are doing in standard Perceptron Algorithm, is basically Stochastic
Gradient Descent:
X X
∇E = − yφ(x) = − ∇E(x) , where E(x) = yφ(x)
x∈M x∈M
wk+1 = wk − η∇E(x)
= wk + ηyφ(x) (for any x ∈ M )
18 LECTURE 18:PERCEPTRON 80
Proof :-
lim kwk+1 − ρw∗ k2 = 0 (68)
k→∞
Observations:-
1. ywkT φ(x) ≤ 0 (∵ x was misclassified)
2. Γ2 = max kφ(x)k2
x∈D
2Γ2
Taking, ρ = , then,
−δ
0 ≤ kwk+1 − ρw∗ k2 ≤ kwk − ρw∗ k2 − Γ2
Hence, we got, Γ2 = θ2 , that we were looking for in eq.(3).
∴ kwk+1 − ρw∗ k2 decreases by atleast Γ2 at every iteration.
19 Lecture 19
19.1 Introduction
In this lecture,we extend the margin-concept towards our goal of classification and
introduce ourselves to Support Vector Machines(SVM)
19.2 Margin
Given w? ,the unsigned minimum distance of x from the hyperplane:
T
w? φ(x) = 0 (71)
is given by:
T
min yw? φ(x) (72)
xD
where y = ±1.Here,y is the corresponding target classifier value for the case of 2-
class classifiers.Note that multiplication with y makes the distance unsigned. This
classification is greedy.
Figure 25: H3(green) doesn’t separate the 2 classes. H1(blue) does, with a small margin and
H2(red) with the maximum margin. [5]
Intuitively, a good separation is achieved by the hyperplane that has the largest
distance to the nearest training data points of any class, since in general the larger
the margin the lower the generalization error of the classifier. [5]
19 LECTURE 19 83
wT φ(x) = 0 (73)
The factor y which is also the target classifier ensures that the unsigned distance
is positive semi-definite. Posing this as an optimization problem where:
Figure 26: Maximum Margin Hyperplanes and Margins for SVM trained with 2 classes.Samples on
the margin are called the support vectors. [5]
maxδ (79)
w,w0
maxδ (83)
w,w0
max0 δ (88)
w,w0
θ
y (wT φ(xi )
||w|| i
+ w00 ) > δ (89)
δ≥0 (90)
or Equivalenty,
max0 δ (91)
w,w0
1
max0 ||w|| (94)
w,w0
min ||w||2
0
(96)
w,w0
Unlike the previous case wherein Data was either ”Black” or ”White”,herein we have
a Region of ”Gray”.
Earlier,we implicitly used an error function that gave infinite error if a data point
was misclassified and zero error if it was classified correctly. We now modify this
approach so that data points are allowed to be on the ”wrong side” of the margin
boundary, but with a penalty that increases with the distance from that boundary.
Thus, the objective to account for the Noise.Hence, we shall introduce a slack
variable ζi
19 LECTURE 19 86
N
2 X
min0 ||w||
2
+c ζi2 (101)
w,w 0
i=1
yi (wT φ(xi ) + w00 ) > 1 − ζi (102)
ζi ≥ 0 (103)
Thus,the objective is analogous to the Minimization of Error subject to Regu-
lariser.
Here,the Error is:
N
X
E=c ζi2 (104)
i=1
||w||2
R= (105)
2
Both EandR can be proved to be convex.
Partially differentiating E w.r.t. ζj :
δE
δζj
= 2ζj (106)
δ2 E
δζj2
=2>0 (107)
Thus E is convex w.r.t ζj ∀1 ≤ j ≤ n
Consider,R:
5(R) = 2w (108)
52 (R) = 2I (109)
19 LECTURE 19 87
Proof.
Figure 27: Margins in SVM
20.1 Recap
The expression from previous day:
||w||
yi (φT (xi )w + w00 ) ≥ (110)
θ
So, any multiple of w and w00 would not change the inequality.
The distance between the points x1 and x2 in 20.2 turns out to be:
||φ(x1 ) − φ(x2 )|| = ||rw|| (111)
We have,
wT φ(x1 ) + w0 = −1 (112)
and
wT φ(x2 ) + w0 = 1 (113)
Subtracting Equation 113 from Equation 112 we get,
20 LECTURE 20: SUPPORT VECTOR MACHINES (SVM) 89
KKT 1.a
∇w L = 0 (142)
Xm
=⇒ w − αj yj φT (xj ) = 0 (143)
j=1
KKT 1.b
∇xii L = 0 (144)
=⇒ c − αi − λi = 0 (145)
KKT 1.c
∇w0 L = 0 (146)
Xm
=⇒ w − αi y i = 0 (147)
i=1
KKT 2
∀i (148)
yi φT (xi )w + w0 ≥ 1 − ξi
(149)
ξi ≥ 0 (150)
KKT 3
Lj ≥ 0andλk ≥ 0 (151)
∀j, k = 1, . . . , m (152)
KKT 4
Lj yi φT (xj )w + w0 − 1 + ξj = 0
(153)
λk ξk = 0 (154)
(a)
m
X
∗
w = αj yi φ(xj ) (155)
j=1
w∗ is weighted linear combination of points φ(x)s.
20 LECTURE 20: SUPPORT VECTOR MACHINES (SVM) 93
(b)
φT (x1 )φ(x1 ) φT (x1 )φ(x2 ) ... ... φT (x1 )φ(xn )
T
φ (x2 )φ(x1 ) φT (x2 )φ(x2 ) ... ... φT (x2 )φ(xn )
K=
... ... ... ... ...
... ... ... ... ...
T T T
φ (xn )φ(x1 ) φ (xn )φ(x2 ) ... ... φ (xn )φ(xn )
In other words, Kij = φT (xi )φ(xj ). The SVM dual can now be re-written as,
1
max{− αT Ky α + αT ones(m, 1)}
2
subject to constraints,
X
αi y i = 0
i
0 ≤ αi ≤ c
Thus, for αi ∈ (0, C)
w0∗ = yi − φT (xi )w
Xm
= yi − αj∗ yj φT (xi )φ(xj )
j=0
m
X
= yi − αj∗ yj Kij
j=0
Where V is the eigen vector matrix (an orthogonal matrix), and Λ is the Diag-
onal matrix of eigen values.
Hence K must be
1. Symmetric.
2. Positive Semi Definite.
3. Having non-negative Diagonal Entries.
1. Kij = K 0 ij + K 00 ij
2. Kij = αK 0 ij (α ≥ 0)
3. Kij = K 0 ij K 00 ij
Define φ(xi ) = φ0T (x0i )φ00T (x00i ). Thus, Kij = φ(xi )φ(xj ).
Hence, K is a valid kernel.
22 LECTURE 22: SVR AND OPTIMIZATION TECHNIQUES 98
As long as points lie inside the margin, they do not contribute to the error.
We can define the -insensitive loss function L (x, y, f) as:-
For linear :
L (x, y, f ) = |y − f (x)|ε = max(0, |y − f (x)| − ε)
For Quadratic :
L2 (x, y, f ) = |y − f (x)|2ε
we can optimise the generalisation of our regressor by minimising the sum of the
quadratic -insensitive losses. For SVR-2 norm
1 X
min ||w||2 + C (ξi2 + ξi2 )
2
st, ∀i (ΦT (xi )w + w0 ) − yi ≤ ε + ξi
and, yi − (ΦT (xi )w + w0 ) ≥ ε + ξi‘
also ξi ξi‘ = 0
and for 1-norm we have
X
C (ξi + ξi‘ ) and
ξi ≥ 0, ξi‘ ≥ 0
For 2-norm case, the dual problem can be derived using the standard method and
taking into account that ξi ξi‘ = 0 and therefore that the same relation αi αi‘ holds for
the corresponding Lagrange multipliers:
M M M M
X
‘
X
‘ 1 XX 1
maximise yi (αi − αi ) − ε yi (αi + αi )− yi (αi‘ − αi )(αj‘ − αj )(Kij + δij )
i=i i=i
2 i=1 j=1 C
M
X
subjectto : (αi‘ − αi ) = 0
i=i
αi‘ ≥ 0, αi ≥ 0, αi‘ α = 0
The corresponding Karush - Kuhn - Tucker complementarity conditions are [7]
αi (< w.φ(xi ) > +b − yi − ε − ξi ) = 0
αi‘ (yi − < w − φ(xi ) > −b − −ε − ξi‘ ) = 0
ξi ξi‘ = 0 αi αi‘ = 0
By substituting β = α‘ −α and using the relation αi αi‘ = 0, it is possible to rewrite
the dual problem in a way that more closely resembles the classification case
M M M
X X 1X 1
maximise y i αi − ε |αi |− αi αj (K(φ(xi ).φ(xj )) + δij )
i=1 i=1
2 i,j=1 C
M
X
subjectto βi =0
i=1
Notes:-
Ridge Regression has 1 parameter - λ,
SVM 2-norm has 2 parameters - (C − 1/λ) and ε
22 LECTURE 22: SVR AND OPTIMIZATION TECHNIQUES 102
22.3 L1 SVM
Let training datum be xi (i = 1, ..., M ) and its label be yi = 1 if xi belongs to Class
1, and yi = −1 if Class 2. In SVMs, to enhance linear separability, the input space is
mapped intoa high dimensional feature space using the mapping function g(x). To
obtain the optimal separating hyperplane of the L1-SVM in the feature space, we
consider the following optimization problem:
M
1 X
minimize kW k2 + C ξi
2 i=0
subject to yi (W t g(xi ) + b) > 1 − ξi
for i = 1, . . . , M,
where W is a weight vector, C is the margin parameter that determines the tradeoff
between the maximization of the margin and the minimization of the classification
error, ξi (i = 1, ..., M ) are the nonnegative slack variables and b is a bias term.
Introducing the Lagrange multipliers αi , we obtain the following dual problem:
maximize
M M
X 1X
Q(α) = αi − αi αj yi yj g(xi )t g(xj ),
i=0
2 i=0
M
X
subject to yi αi = 0, 0 ≤ αi ≤ C
i=0
We use the mapping function that satisfies
1 XX X
minimize αi αj Kij yi yj − αi ,
2 i j
X
subject to αi yi = 0, 0 ≤ αi ≤ C
23 LECTURE 23 104
23 Lecture 23
Key terms : SMO, Probabilistic models, Parzen window
In previous classes we wrote the convex formulation for maximum margin classi-
fication, Lagrangian of the formulation etc. Then dual of the program was obtained
by first minimizing the lagrangian with respect to weights w. The dual will be max-
imization of it with respect to α which is same as minimizing the negative of the
objective function under the same constraints. The dual problem, given in equation
(156) is an optimization with respect to α and its solution will correspond to optimal
of original objective when the KKT conditions are satisfied. We are interested in
solving dual of the objective because we have already seen that most of the dual
variable will be zero in the solution and hence it will give a sparse solution (based
on the KKT conidtion).
X 1 XX
Dual: min − αi + αi αj yi yj Kij (156)
α 2 i j
X
s. t. αi y i = 0
i
αi ∈ [0, c]
The above program is a quadratic program. Any quadratic solvers can be used
for solving (156), but a generic solver will not take consider speciality of the solution
and may not be efficient. One way to solve (156) is by using projection methods(also
called Kernel adatron). You can solve the above one using two ways - chunking
methods and decomposition methods.
The chunking method is as follows
1. Initialize αi s arbitrarily
2. Choose points(I mean the components αi ) that violate KKT condition
3. Consider only K working set and solve the dual for the variables in working set
∀α ∈ working set
X 1 X X
min − αi + αi αj yi yj Kij (157)
α
αi inW S
2 i∈W S j∈W S
X X
s. t. αi y i = − αj yj
i∈W S j ∈W
/ S
αi ∈ [0, c]
4. set αnew = [αW
new old
S , αnonW S ]
23 LECTURE 23 105
Decompsition methods follow almost the same procedure except that in step 2 we
always take a fixed number of points which violate the KKT conditions the most.
α1 , α2 ∈ [0, c]
From the second last constraint, we can write α1 in terms of α2 .
y2 y2
α1 = −α2 + α1old + α2old
y1 y1
Then the objective is just a function of α2 , let the objective is −D(α2 ). Now the
program reduces to
min − D(α2 )
α2
s. t. α2 ∈ [0, c]
Find α2∗ such that ∂D(α
∂α2
2)
= 0. We have to ensure that α1 ∈ [0, c]. So based on
that we will have to clipp α2 , ie, shift it to certain interval. The condition is as
follows
y2 y2
0 <= −α2 + α1old + α2old <= c
y1 y1
case 1: y1 = y2
α2 ∈ [max(0, −c + α1old + α2old ), min(c, α1old + α2old )]
case 2: y1 = −y2
α2 ∈ [max(0, α2old − α1old ), min(c, c − α1old + α2old )]
If α2 is already in the inerval then there is no problem. If it is more than the
23 LECTURE 23 106
maximum limit then reset it to the maximum limit. This will ensure the optimum
value of the objective constrained to this codition. Similarly if α2 goes below the
lower limit then reset it to the lower limit.
For a discriminative model the function gk (x) = ln(p(Ck |x)), ie it models the coni-
tional probability of the class variable with respect to the input.
For a generative models gk (x) = ln(p(x|Ck ))p(Ck ) − ln(p(x))(can be obtained by
Bayes rule). The generative model model a joint distribution of the input and
class variables.
24 LECTURE 24: PROB. CLASSIFIERS 107
Z
P = p(x) dx (159)
R
Suppose that n iid samples are randomly drawn according to the probability distri-
bution p(x). Probability that k of these n fall in R is
n
Pk = P k (1 − P )(n−k) (160)
k
Expected value of k is
E[k] = nP (161)
k
If n is very large then n
will be a good estimate for the probability P .
Z
p(x0 ) dx0 ' p(x)V (162)
R
k
p(x) ' (163)
nV
Kernel Density Estimation
One technique for nonparametric density estimation is the kernel density esti-
mation where, effectively, V is held fixed while K, the number of sample points
24 LECTURE 24: PROB. CLASSIFIERS 108
K(x, x0 )
P (x) = Σx0 D (164)
dn ∗ |D|
(
1 if kxi − x0i k ≤ d ∀ i ∈ [1 : n]
K(x, x0 ) = (165)
0 if kxi − x0i k > d f or some i ∈ [1 : n]
For smooth kernel density estimation, we could use
−kx−x0 k2
K(x, x0 ) = e 2σ 2 (166)
1
P (x) = Σx0 D K(x, x0 ) (167)
|D|
|Di |
P (Ci ) = |D|
Parametric methods assume a form for the probability distribution that gen-
erates the data and estimate the parameters of the distribution. Generally
parametric methods make more assumptions than non-parametric methods.
– Gaussian Discriminant:
1 −(φ(x) − µi )Σ−1
i (φ(x) − µi )
P (x|Ci ) = N (φ(x), µi , Σi ) = n/2 1/2
exp( )
(2π) |Σi | 2
(168)
Given point x,classify it to class Ci such that,
Ci = argmaxi log[P (x|Ci )P (Ci )]
let ai = log[P (x|Ci )] + log[P (Ci )] = wiT φ(x) + wi0
where,
24 LECTURE 24: PROB. CLASSIFIERS 110
wi = Σi µi
wi0 = −1 µT Σ−1
2 i
1 n 1 T −1
i µi + ln[P (Ci )] − 2 ln[(2π) |Σi |] − 2 φ(x) Σi φ(x)
(µM
i
LE
, ΣM
i
LE
) = argmaxµ,Σ LL(D, µ, Σ) = argmaxµ,Σ Σi ΣxDi log[N (x, µi , Σi )]
(169)
1. Σ ’s are common across all classes i.e., Σi = Σ ∀i
Maximum Likelihood estimates using (169) are:
φ(x)
µM
i
LE
= ΣxDi
|Di |
1
ΣM LE = Σki=1 ΣxDi (φ(x) − µi )(φ(x) − µi )T
|D|
2. Σ0i s are also parameters
Maximum Likelihood estimates are:
φ(x)
µM
i
LE
= ΣxDi
|Di |
1
ΣMi
LE
= ΣxDi (φ(x) − µi )(φ(x) − µi )T
|Di |
We could do this for exponential family as well.
– Exponential Family:
For a given vector of functions φ(x) = [φ1 (x), . . . , φk (x)] and a parameter
vector η<k , the exponential family of distributions is defined as
P (x, η) = h(x)exp{η T φ(x) − A(η)} (170)
where the h(x) is a conventional reference function and A(η) is the log
normalization constant designed as
R
A(η) = log[ xRange(x) exp{η T φ(x)}h(x)dx]
Example:
* Gaussian Distribution: Gaussian Distribution X ∼ N (µ, σ) can be
expressed as
η = [ σµ2 , 2σ
−1
2]
φ(x) = [x, x2 ]
24 LECTURE 24: PROB. CLASSIFIERS 111
* Multivariate Gaussian:
η = Σ−1 µ
1
p(x, Σ−1 , η) = exp{η T x − xT Σ−1 x + Z}
2
where Z = −1
2
(nlog(2π) − log(|Σ −1
|))
* Bernoulli:
Bernoulli distribution is defined on a binary (0 or 1) random variable
using parameter µ where µ = P r(X = 1). The Bernoulli distribution
can be written as
µ
p(x|µ) = exp{xlog( 1−µ ) + log(1 − µ)}
µ
⇒ φ(x) = [x] and η = [log( 1−µ )]
25 Lecture 25
25.1 Exponential Family Distribution
Considering The Exponential Family Distribution:
Let φ(x) have n attributes and each of the n attributes can take c different (dis-
crete) values.
Total number of possible values of φ(x) is cn .
nval1 nval2 . . . .nvaln
φ(1) (x) . . . . .
(2)
φ (x) . . . . .
φ(x) = (173)
: . . . . .
: . . . . .
(cn )
φ (x) . . . . .
0 0 0
0 0 1
0 1 0
0 1 1
(174)
1 0 0
1 0 1
1 1 0
1 1 1
P robaility | Conf iguration
p1 | φ(1) (x)
φ(2) (x)
p2 |
p(φ(x)|ηk ) = (175)
: | :
|
: :
(cn )
pcn | φ (x)
whereΣpi = 1 (176)
i
where, φi (x) denotes the i-th attribute of φ(x) in the feature space.
Thus, what Naive Bayes Assumption essentially says is that each attribute is inde-
pendent of other attributes given the class.
So the original probability distribution table of p(φ(x)|ηk ) of size cn will get replaced
by n tables
(One per attribute) each of size cx1 as follows :
pj1 | φj (x) = µj1
pj2 | φj (x) = µj2
p(φj (x)|ηk ) =
: | :
(178)
: | :
pjc | φj (x) = µjc
where µj1 ..µjc are c discrete values that φj (x) can take
andΣpji = 1 (179)
i
The fact that there is no Edge between φ1 (x) and φ2 (x) denotes Conditional
Independence.
p(φ, c) = p(φ|c).p(c)
= p(φ2 |φ1 , φ3 , c).p(φ1 |φ3 , c).p(φ3 |c).p(c)
≈ p(φ2 |φ1 ).p(φ1 |c).p(φ3 |c).p(c)(F romF igure29)
= Πp(x|π(x))(π(x) ≡ Set of P arents of x)
References
[1] “Class notes: Basics of convex optimization,chapter.4.”
[2] “Convex Optimization.”
[3] “Linear Algebra.”
[4] “Bias variance tradeoff.” [Online]. Available: http://www.aiaccess.net/English/
Glossaries/GlosMod/e gm bias variance.htm
[5] Steve Renals, “Support Vector Machine.”
[6] Christopher M. Bishop, “Pattern Recognition And Machine Learning.”
[7] Nello Cristianini and John Shawe-Taylor , “An Introduction to Support Vector
Machines and Other Kernel-based Learning Methods.”
[8] ClassNotes, “Graphical Models,Class Notes.”
[9] ——, “Graphical Case Study of Probabilistic Models,Class Notes.”
[10] Andrew McCallum,Kamal Nigam, “A Comparison of Event Models for Naive
Bayes Text Classification.”
[11] Steve Renals, “Naive Bayes Text Classification.”