0% found this document useful (0 votes)
98 views116 pages

Lecturenote - COL341 - 2010

This document contains lecture notes for the course CS725 Machine Learning. It covers key topics in machine learning including linear regression, least squares regression, convex functions, regularization, probability, Bayesian estimation, and distributions like Gaussian, Bernoulli, binomial and multinomial. The notes are organized into 9 lectures, with each lecture covering important concepts, theorems, and algorithms related to machine learning.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
98 views116 pages

Lecturenote - COL341 - 2010

This document contains lecture notes for the course CS725 Machine Learning. It covers key topics in machine learning including linear regression, least squares regression, convex functions, regularization, probability, Bayesian estimation, and distributions like Gaussian, Bernoulli, binomial and multinomial. The notes are organized into 9 lectures, with each lecture covering important concepts, theorems, and algorithms related to machine learning.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 116

Lecture notes on CS725 : Machine learning

Contents
1 Lecture 1 : Introdcution to Machine Learning 6

2 Lecture 2 7
2.1 Solving Least Squares in General (for Linear models) . . . . . . . . . . . . . . . . . . 7

3 Lecture 3 : Regression 10
3.1 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2 Linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.3 Least square solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.4 Geometrical interpretation of least squares . . . . . . . . . . . . . . . . . . . . . . . . 11

4 Lecture 4 : Least Squares Linear Regression 13


4.1 Least Square Linear Regression Model . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.2 Level Curves and Surfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.3 Gradient Vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.4 Directional Derivative . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.5 Hyperplane . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.6 Tangential Hyperplane . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.7 Gradient Descent Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.8 Local Minimum and Local Maximum . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

5 Lecture 5 : Convex functions 19


5.1 Recap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
5.2 Point 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
5.3 Point 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
5.4 Point 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
5.5 Point 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5.5.1 Overfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.5.2 Next problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.6 Point 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

6 Lecture 6 : Regularized Solution to Regression Problem 24


6.1 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
6.2 Duality and KKT conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
6.3 Bound on λ in the regularized least square solution . . . . . . . . . . . . . . . . . . . 26

1
CONTENTS 2

6.4 RMS Error variation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27


6.5 Alternative objective function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
6.6 A review of probability theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
6.6.1 The three axioms of probability . . . . . . . . . . . . . . . . . . . . . . . . . . 29
6.6.2 Bayes’ theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
6.6.3 Independent events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

7 Lecture 7 : Probability 31
7.1 Note . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
7.2 Part of speech(pos) example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
7.3 Probability mass function(pmf) and probability density function(pdf) . . . . . . . . 31
7.3.1 Joint distribution function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
7.3.2 Marginalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
7.4 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
7.5 Conditional Density . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
7.6 Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
7.6.1 Properties of E(x) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
7.7 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
7.8 Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
7.8.1 Properties of Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
7.9 Chebyshev’s Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
7.10 Bernoulli Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
7.11 Binomial Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
7.12 Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
7.13 Maximum Likelihood and Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
7.14 Bayesian estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

8 Lecture 8 38
8.1 Bernoulli Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
8.2 Bayesian Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

9 Lecture 9 : Multinomial Distribution 42


9.0.1 Posterior probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
9.0.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
9.1 Gaussian Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
9.1.1 Information Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
9.1.2 Expectation for I(X=x): . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
9.1.3 Observations: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
9.1.4 Properties of gaussian univariate distribution . . . . . . . . . . . . . . . . . . 45

10 Lecture 10 : Multivariate Gaussian Distribution 47


10.1 Multivariate Gaussian Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
10.1.1 Unbiased Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
10.2 Dealing with Conjugate Priors for Multivariate Gaussian . . . . . . . . . . . . . . . . 50
CONTENTS 3

11 Lecture 11 52
11.1 Recall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
11.2 Bayes Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
11.3 Pure Bayesian - Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
11.4 Sufficient Statistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
11.5 Lasso . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

12 Lecture 12 : Bias-Variance tradeoff 56


12.1 Expected Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

13 Lecture 13 59
13.1 Conclude Bias-Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
13.1.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
13.1.2 Bayesian Linear Regression(BLR) . . . . . . . . . . . . . . . . . . . . . . . . 59
13.1.3 General Problems with Standard Distribution . . . . . . . . . . . . . . . . . . 61
13.2 Emperical Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
13.2.1 First Approach: Approximate the posterior . . . . . . . . . . . . . . . . . . . 64
13.2.2 Second Approach: Emperical Bayes . . . . . . . . . . . . . . . . . . . . . . . 65
13.2.3 Sove the eigenvalue equation . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

14 Lecture 14 : Introduction to Classification 66

15 Lecture 15: Linear Models for Classification 67


15.1 Generalized linear models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
15.2 Three broad types of classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
15.2.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
15.3 Handling Multiclasses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
15.3.1 Avoiding ambiguities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
15.4 Least Squares approach for classification . . . . . . . . . . . . . . . . . . . . . . . . . 70
15.4.1 Limitations of Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

16 Lecture 16 72
16.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
16.2 Problems of linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
16.2.1 Sensitivity to outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
16.2.2 Masking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
16.3 Possible solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
16.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

17 Lecture 17 77

18 Lecture 18:Perceptron 78
18.1 Fisher’s discriminant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
18.2 Perceptron training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
18.2.1 Intuition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
CONTENTS 4

19 Lecture 19 82
19.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
19.2 Margin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
19.3 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
19.4 Support Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
19.5 Objective Design in SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
19.5.1 Step 1:Perfect Separability . . . . . . . . . . . . . . . . . . . . . . . . . . 84
19.5.2 Step 2:Optimal Separating Hyperplane For Perfectly Separable Data 84
19.5.3 Step 2:Separating Hyperplane For Overlapping Data . . . . . . . . . 85

20 Lecture 20: Support Vector Machines (SVM) 88


20.1 Recap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
20.2 Distance between the points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
20.3 Formulation of the optimization problem . . . . . . . . . . . . . . . . . . . . . . . . . 89
20.4 Soft Margin SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
20.4.1 Three types of g points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
20.5 Primal and Dual Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
20.5.1 Primal formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
20.5.2 Dual Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
20.6 Duality theory applied to KKT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

21 Lecture 21:The SVM dual 94


21.1 SVM dual . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
21.2 Kernel Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
21.2.1 Generation of φ space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
21.3 Requirements of Kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
21.3.1 Examples of Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
21.4 Properties of Kernel Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

22 Lecture 22: SVR and Optimization Techniques 98


22.1 Other occurance of kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
22.1.1 Some variants of SVM’s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
22.2 Support Vector Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
22.3 L1 SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
22.4 Kernel Adatron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

23 Lecture 23 104
23.1 Sequential minimization algorithm - SMO . . . . . . . . . . . . . . . . . . . . . . . . 105
23.2 Probablistic models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

24 Lecture 24: Prob. Classifiers 107


24.1 Non Parametric Density Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
24.2 Parametric Density Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
CONTENTS 5

25 Lecture 25 112
25.1 Exponential Family Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
25.2 Discrete Feature Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
25.3 Naive Bayes Assumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
25.4 Graphical Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
25.5 Graphical Representation of Naive Bayes . . . . . . . . . . . . . . . . . . . . . 114
25.6 Graph Factorisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
25.7 Naive Bayes Text Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
1 LECTURE 1 : INTRODCUTION TO MACHINE LEARNING 6

1 Lecture 1 : Introdcution to Machine Learning


This lecture was an introduction to machine learning. ...
2 LECTURE 2 7

2 Lecture 2
2.1 Solving Least Squares in General (for Linear models)
Xplot = [0, 0.1, ..., 2] Curve

ˆ Why noise ?
– Since observations are not perfect
– Owing to quantization / precision / rounding

Curve Fitting
Learn f : X → Y such that E(f, X, Y1 ) is minimized. Here the error function E and form of the
function to learn f is chosen by the modeler.
Consider one such form of f ,

f (x) = w0 + w1 x + w2 x2 + ... + wt xt

The sum of squares error is given by,


m
1X
E= (f (xi ) − yi )2
2 i=1

So the expression is,


K
1X
argmin [(w0 + w1 x + w2 x2 + ... + wt xt ) − y1 (i)]2
w=[w1 ,w2 ,...wt ] 2 i=1

If there are m data points, then a polynomial of degree m − 1 can exactly fit the data, since the
polynomial has m degrees of freedom (where degrees of freedom=no. of coefficients)
As the degree of the polynomial increases beyond m, the curve becomes more and more wobbly,
while still passing through the points. Contrast the degree 10 fit in Figure 2.1 against the degree 5
fit in Figure 2.1. This is due to the problem of overfitting (overspecification)
Now E is a convex function. To optimize it, we need to set ∇w E = 0. The ∇ operator is also
called gradient.
Solution is given by,

X = (φt φ)−1 φt Y
If m << t then
ˆ φ becomes singular and the solution cannot be found OR
ˆ The column vectors in φ become nearly linearly dependent
RMS (root mean sqare) error is given by :
r
2E
RM S =
k
2 LECTURE 2 8

Figure 1: Fit for degree 5 polynomial.

Generally, some test data (which potentially could have been part of the training data) is held
out for evaluating the generalized performance of the model. Another held out fraction of the
training data, called the validation dataset is typically used to find the most appropriate degree
tbest for f .
2 LECTURE 2 9

Figure 2: Fit for degree 10 polynomial. Note how wobbly this fit is.
3 LECTURE 3 : REGRESSION 10

3 Lecture 3 : Regression
This lecture was about regression. It started with formally defining a regression problem. Then a
simple regression model called linear regression was discussed. Different methods for learning the
parameters in the model were next discussed. It also covered least square solution for the problem
and its geometrical interpretation.

3.1 Regression
Suppose there are two sets of variables x ∈ <n and y ∈ <k such that x is independent and y
is dependant. The regression problem is concerned with determining y in terms of x. Let us
assume that we are given m data points D = hx1 , y1 i, hx2 , y2 i, .., hxm , ym i. Then the problem is
to determine a function f ∗ such that f ∗ (x) is the best predictor for y, with respect to D. Suppose
ε(f, D) is an error function, designed to reflect the discrepancy between the predicted value f (x0 )
of y0 and the actual value y0 for any hx0 , y0 i ∈ D, then

f ∗ = argmin ε(f, D) (1)


f ∈F

where, F denotes the class of functions over which the optimization is performed.

3.2 Linear regression


Depending on the function class we consider, there are many types of regression problems. In Linear
regression we consider
Pp only linear functions, functions that are linear in the basis function. Here
F is of the form { i=1 wi φi (x)}. φi : Rn → Rk Here, the φi ’s are called the basis functions (for
example, we can consider φi (x) = xi , i.e., polynomial basis functions) .
Any function in F is characterized by its parameters, the wi ’s. Thus, in (1) we have to find
f (w∗ ) where
w∗ = argminε(w, D)
w

3.3 Least square solution


The error function ε plays a major role in the accuracy and tractability of the optimization problem.
The error function is also called the loss function. The squared loss is a commonly used loss
function. It is the sum of squares of the differences between the actual value and the predicted
value. X
ε(f, D) = (f (xi ) − yi )2
hxi ,yi i∈D

So the least square solution for linear regression is given by


m p

X X 2
w = argmin (wi φi (xj ) − yj
w
j=1 i=1

The minimum value of the squared loss is zero. Is it possible to achieve this value ? In other
p
X
words is ∀j, wi φi (xj ) = yj possible ?
i=1
3 LECTURE 3 : REGRESSION 11

C(φ)

Figure 3: Least square solution ŷ is the orthogonal projection of y onto column space of φ

The above equality can be written as ∀u, φT (xu )w = yu


or equivalently
 φw = y where   
φ1 (x1 ) · · · φp (x1 ) y1
 .. .. ..  and y =  .. 
  
φ=  . . .   . 
φ1 (xm ) · · · φp (xm ) ym
It has a solution if y is in the column space (the subspace of Rn formed by the column vectors)
of φ. It is possible that there exists no w which satisfies the conditions? In such situations we can
solve the least square problem.

3.4 Geometrical interpretation of least squares


Let ŷ be a solution in the column space of φ. The least squares solution is such that the distance
between ŷ and y is minimized. From the diagram it is clear that for the distance to be minimized,
the line joining ŷ to y should be orthogonal to the column space. This can be summarized as
1. φw = ŷ
2. ∀v ∈ {1, ..p}, (y − ŷ)T φv = 0 or (ŷ − y)T φ = 0

ŷT φ = yT φ
ie, (φw)T φ = yT φ
ie, wT φT φ = yT φ
T
ie, φ φw = φT y
∴ w = (φT φ)−1 y

In the last step, please note that, φT φ is invertible only if φ has full column rank.
3 LECTURE 3 : REGRESSION 12

Theorem: If φ has full column rank, φT φ is invertible. A matrix is said to have full column rank
if all its columnPvectors are linearly independent. A set of vectors vi is said to be linearly
independent if i αi vi = 0 ⇒ αi = 0.
Proof: Given that φ has full column rank and hence columns are linearly independent, we have
that φx = 0 ⇒ x = 0.
Assume on the contrary that φT φ is non invertible. Then ∃x 6= 0 3 φT φx = 0.
⇒ xT φT φx = 0
⇒ (φx)T φx = ||φx||2 = 0
⇒ φx = 0. This is a contradiction. Hence the theorem is proved.
4 LECTURE 4 : LEAST SQUARES LINEAR REGRESSION 13

4 Lecture 4 : Least Squares Linear Regression


In this lecture we discussed how to minimize the error function ε(w, D) that we used for the
least square linear regression model in the last lecture. To do this, some basic concepts regarding
minimization of a function were discussed and we applied these to our error function.

4.1 Least Square Linear Regression Model


In the least squares regression model, we determine the value of w for which our error function ε
attains the minimum value. Here, D =< x1 , y1 >, < x2 , y2 >, .., < xm , ym > is the training data
set, and φi ’s are the basis functions.

m
X 
2
w∗ = argmin (f (xj , w) − yj )
w
j=1
m
X p
!2 
X
= argmin wi φi (xj ) − yj
w
j=1 i=1
 
φ1 (x1 ) φ2 (x1 ) ...... φp (x1 )
 
 . 
φ= 

 . 

φ1 (xm ) φ2 (xm ) ...... φp (xm )
 
y1
 y2
 

 
y=  .


 .
 

ym
 
w1
 w2 
 
 
w=
 . 

 . 
 

wp

m
X 2
ε = min φT (xj )w − yj
w
j=1
2
= min ||φw − y||
w
T
= min (φw − y) (φw − y)
w
min wT φT φw − 2yT φw + yT y

= (2)
w
4 LECTURE 4 : LEAST SQUARES LINEAR REGRESSION 14

How to minimize a function?


Following are some basic concepts which help in minimizing or maximizing a function:

4.2 Level Curves and Surfaces


A level curve of a function f (x) is defined as a curve along which the value of the function remains
unchanged while we change the value of it’s argument x. Note that there can be as many level
curves for any function as the number of different values it can attain.

Figure 4: 10 level curves for the function f (x1 , x2 ) = x1 ex2 (Figure 4.12 from [1])

Level surfaces are similarly defined for any n-dimensional function f (x1 , x2 , ..., xn ) as a collection
of points in the argument space on which the value of the function is same at all points while we
change the argument values.

Formally we can define a level curve as :


 
Lc (f ) = x|f (x) = c

where c is a constant.

Refer to Fig. 4.15 in class notes [1] for example.


4 LECTURE 4 : LEAST SQUARES LINEAR REGRESSION 15

Figure 5: 3 level surfaces for the function f (x1 , x2 , x3 ) = x21 + x22 + x23 with c = 1, 3, 5. The gradient
at (1, 1, 1) is orthogonal to the level surface f (x1 , x2 , x3 ) = x21 + x22 + x23 = 3 at (1, 1, 1) (Fig. 4.14
from [1]).

4.3 Gradient Vector


The gradient vector of a function f at a point x is defined as follows:
 ∂f (x) 
∂x1
 ∂f (x) 

 ∂x2 
 n
∇fx∗ =
 .  R

.
 
 
∂f (x)
∂xn

The direction of the gradient vector gives the direction of maximum rate of change of the value of
the function at a point. Also the magnitude of the gradient vector gives that maximum value of
rate of change.

Refer to Definition 23 in the class notes [1] for more details.

4.4 Directional Derivative


Directional Derivative gives the rate of change of the function value in a given direction at a point.
The directional derivative of a function f in the direction of a unit vector v at a point x can be
defined as :

f (x + hv) − f (x)
Dv (f ) = lim
h→0 h

||v|| = 1

Note: The maximum value of directional derivative of a function f at any point is always the
magnitude of it’s gradient vector at that point.
4 LECTURE 4 : LEAST SQUARES LINEAR REGRESSION 16

Figure 6: The level curves from Figure 4 along with the gradient vector at (2, 0). Note that the
gradient vector is perpenducular to the level curve x1 ex2 = 2 at (2, 0) (Figure 4.13 from [1])

Refer to Definition 22 and Theorem 58 in the class notes [1] for more details.

4.5 Hyperplane
A Hyperplane is a set of points whose direction w.r.t. a point p is orthogonal to a vector v. It can
be formally defined as :
 
T
Hv,p = q | (p − q) v = 0

4.6 Tangential Hyperplane


There are two definitions of tangential hyperplane (T Hx∗ ) to level surface (Lf (x∗ ) (f )) of f at x∗ :

1. Plane consisting of all tangent lines at x∗ to any parametric curve c(t) on level surface.

2. Plane orthogonal to the gradient vector at x∗ .

 
T
T Hx∗ = p | (p − x∗ ) ∇f (x∗ ) = 0

Note: By definition, T Hx∗ is n − 1 dimensional.

Refer to Definition 24 and Theorem 59 in class notes [1] for more details.

4.7 Gradient Descent Algorithm


Gradient Descent Algorithm is used to find minimum value attained by a real valued function f (x).
We first start at an intial point x(0) and make a sequence of steps proportional to negative of
gradient of the function at the point. Finally we stop at a point x(∗) where a desired convergence
4 LECTURE 4 : LEAST SQUARES LINEAR REGRESSION 17

criterion (see notes on Convex Optimization) will be attained.

The idea of gradient descent algorithm is based on the fact that if a real-valued function f (x) is
defined and differentiable at a point xk , then f (x) decreases fastest when you move in the direction
of the negative gradient of the function at that point, which is −∇f (x).

Here we describe the method of Gradient Descent Algorithm to find the parameter vector w
which minimizes the error function, ε(w, D)

∆w(k) = − ∇ε(w(k) ) from equation (2)


= −∇(w φ φw − 2yT φw + yT y)
T T

= −(2φT φw − 2yT φ + 0)
= 2(φT y − φT φw)

so we got

w(k+1) = w(k) + 2t(k) (φT y − φT φw(k) )


Gradient Descent Algorithm :

Find starting point w(0) D


repeat
1. ∆wk = −∇ε(w(k) )
2. Choose a step size t(k) > 0 using exact or backtracking ray search.
3. Obtain w(k+1) = w(k) + t(k) ∆w(k) .
4. Set k = k + 1.
until stopping criterion (such as †∇ε(x(k+1) ) †≤ ) is satisfied

Exact Line Search Algorithm :


  
t(k) = argminε w(k) + 2t φT y − φT φw(k)
t

In general  
t(k) = argminf w(k+1)
t

Refer to section 4.5.1 in the class notes [1] for more details.

4.8 Local Minimum and Local Maximum


Critical Point : x is a called a critical point w.r.t to a function f , if ∇f (x) = 0 i.e. the gradient
vanishes at x or the gradient fails to exist at x.

Local Minimum (or Maximum):


4 LECTURE 4 : LEAST SQUARES LINEAR REGRESSION 18

If ∇f (x∗ ) = 0 then x∗ can be a point of local minimum (or maximum). [Neccessary Condi-
tion]

If ∇2 f (x∗ ) is positive (negative) definite then x∗ is a point of local minimum (maximum). [Suffi-
cient Condition]

Note: ∇2 f (x∗ ) is positive definite means :

∀x 6= 0 xT ∇2 f (x∗ )x > 0
OR

λi (∇2 f (x∗ )) > 0


i.e. matrix eigen values are positive.

Refer to definition 27, theorem 61 and fig. 4.23, 4.24 in the class notes [1] for more details.

Figure 7: Plot of f (x1 , x2 ) = 3x21 − x31 − 2x22 + x42 , showing the various local maxima and minima
of the function (fig. 4.16 from [1])
5 LECTURE 5 : CONVEX FUNCTIONS 19

5 Lecture 5 : Convex functions


In this lecture the concepts of convex sets and functions were introduced.

5.1 Recap
We recall that the problem was to find w such that
2
w∗ = argmin ||φw − y|| (3)
w
= argminw (wT φT φw − 2wT φy − yT y) (4)

5.2 Point 1
If ∇f (x∗ ) is defined & x∗ is local minimum/maximum, then ∇f (x∗ ) = 0
(A necessary condition) (Cite : Theorem 60)[2]

Given that

f (w) = argmin(wT φT φw − 2wT φy − yT y) (5)


w
=⇒ ∇f (w) = 2φ φw − 2φT y
T
(6)

we would have

∇f (w∗ ) = 0 (7)
T ∗ T
=⇒ 2φ φw − 2φ y = 0 (8)
∗ T −1 T
=⇒ w = (φ φ) φ y (9)

5.3 Point 2
Is ∇ f (w∗ ) positive definite ?
2

i.e. ∀x 6= 0, is xT ∇f (w∗ )x > 0? (A sufficient condition for local minimum)

(Note : Any positive definite matrix is also positive semi-definite)


(Cite : Section 3.12 & 3.12.1)[3]

∇2 f (w∗ ) = 2φT φ (10)


T 2 ∗
=⇒ x ∇ f (w )x = 2xT φT φx (11)
= 2(φx)T φx (12)
2
= 2 ||φx|| ≥ 0 (13)

And if φ has full column rank ,

φx = 0 if f x=0 (14)
5 LECTURE 5 : CONVEX FUNCTIONS 20

∴ If x 6= 0, xT ∇2 f (w∗ )x > 0
Example where φ doesn’t have a full column rank,

x21 x21 x31


 
x1
 x2 x22 x22 x32 
 
φ=
 .. .. .. .. 
 (15)
 . . . . 
xn x2n x2n x3n

This is the simplest form of linear correlation of features, and it is not at all desirable.

5.4 Point 3
Definition of convex sets and convex functions (Cite : Definition 32 and 35)[2]

Figure 8: A sample convex function

∴ f (θx + (1 − θ)y) ≤ θf (x) + (1 − θ)f (y) (16)

Some convex functions : (Cite : Table 4.1, pg-54)[2]

To prove : Verify that a hyperplane is a convex set.


Proof :
5 LECTURE 5 : CONVEX FUNCTIONS 21

A Hyperplane H is defined as {x|aT x = b, a 6= 0}. Let x and y be vectors that belong to the
hyperplane. Since they belong to the hyperplane, aT x = b and aT y = b. In order to prove the
convexity of the set we must show that :
θx + (1 − θ)y ∈ H, where θ ∈ [0, 1] (17)

In particular, it will belong to the hyperplane if it’s true that :


aT (θx + (1 − θ)y) = b (18)
T T
=⇒ a θx + a (1 − θ)y = b (19)
T T
=⇒ θa x + (1 − θ)a y = b (20)
(21)
And, we also have aT x = b and aT y = b. Hence θb + (1 − θ)b = b. [Hence Proved]

So a hyperplane is a convex set.

Q. Is ||φw − y||2 convex?


A. To check this, we have (Cite : Theorem 75)[2] but it is not very practical. We would use
(Cite : Theorem 79)[2] to check for the convexity of our function. So the condition that has
our focus is -
∇2 f (w∗ ) is positive semi − def inite, if ∀x 6= 0, xT ∇2 f (w∗ )x ≥ 0 (22)

We have,
∇2 f (w) = 2φT φ (23)

2
So, ||φw − y|| is convex, since the domain for w is Rn and is convex.
2
Q. Is ||φw − y|| strictly convex?
A. Iff φ has full column rank.

(Side note : Weka1 )

5.5 Point 4
To prove: If a function is convex, any point of local minima ≡ point of global minima
Proof - (Cite : Theorem 69)[2]

To prove : If a function is strictly convex, it has a unique point of global minima


Proof - (Cite : Theorem 70)[2]
2
Since ||φw − y|| is strictly convex for linearly independent φ,
∇f (w∗ ) = 0 f or w∗ = (φT φ)−1 φT y (24)
1 http://www.cs.waikato.ac.nz/ml/weka/
5 LECTURE 5 : CONVEX FUNCTIONS 22

Thus, w∗ is a point of global minimum. One can also find a solution to (φT φw = φT y) by Gauss
elimination.

5.5.1 Overfitting

Figure 9: train-RMS and test-RMS values vs t(degree of polynomial) graph

ˆ Too many bends (t=9 onwards) in curve ≡ high values of some wi0 s

ˆ Train and test errors differ significantly

5.5.2 Next problem


Find
2
w∗ = argminw ||φw − y|| s.t. ||w||p ≤ ζ, (25)
where
n
X  p1
p
||w||p = |wi | (26)
i=1

5.6 Point 5
Q. How to solve constrained problems of the above-mentioned type?
A. General problem format :
M inimize f (w) s.t. g(w) ≤ 0 (27)
5 LECTURE 5 : CONVEX FUNCTIONS 23

Figure 10: p-Norm curves for constant norm value and different p

Figure 11: Level curves and constraint regions

At the point of optimality,


Either g(w∗ ) < 0 & ∇f (w∗ ) = 0 (28)
∗ ∗ ∗
Or g(w ) = 0 & ∇f (w ) = α∇g(w ) (29)

If w∗ is on the border of g, i.e., g(w∗ ) = 0,


∇f (w∗ ) = α∇g(w∗ ) (30)
(Duality Theory) (Cite : Section 4.4, pg-72)[2]

Intuition: If the above didn’t hold, then we would have ∇f (w∗ ) = α1 ∇g(w∗ )+α2 ∇⊥ g(w∗ ), where
by moving in direction ±∇⊥ g(w∗ ), we remain on boundary g(w∗ ) = 0, while decreasing/increasing
value of f, which is not possible at the point of optimality.
6 LECTURE 6 : REGULARIZED SOLUTION TO REGRESSION PROBLEM 24

6 Lecture 6 : Regularized Solution to Regression Problem


In last lecture, we derived solution for the regression problem formulated in least-squares sense
which was aimed at minimizing rms error over observed data points. We also analysed conditions
under which the obtained solution was guaranteed to be a global minima. However, as we observed,
increasing the order of the model yielded larger rms error over test data, which was due to large
fluctuations in model learnt and consequently due to very high values of model coefficients (weights).
In this lecture, we discuss how the optimization problem can be modified to counter very large
magnitudes of coefficients. Subsequently, solution of this problem is provided through lagrange
dual formulation followed by discussion over obtained solution and impact over test data. Towards
the end of the lecture, a very gentle introduction to axiomatic probability is provided.

6.1 Problem formulation


In order to cease coefficients from becoming too large in magnitude, we may modify the problem
to be a constrained optimization problem. Intuitively, for achieving this criterion, we may impose
constraint on magnitude of coefficients. Any norm for this purpose might give good working solu-
tion. However, for mathematical convenience, we start with the euclidean (L2 ) norm. The overall
problem with objective function and constraint goes as follows:

minimize (Φw − Y )T (Φw − Y )


w (31)
such that ||w||22 ≤ ξ

As observed in last lecture, the objective function, namely f (w) = (Φw−Y )T (Φw−Y ) is strictly
convex. Further to this, the constraint function, g(w) =k w k22 −ξ, is also a convex function. For
convex g(w), the set S = {w|g(w) ≤ 0}, can be proved to be a convex set by taking two elements
w1 ∈ S and w2 ∈ S such that g(w1 ) ≤ 0 and g(w2 ) ≤ 0. Since g(w) is a convex function, we have
the following inequality:
g(θw1 + (1 − θ)w2 ) ≤ θg(w1 ) + (1 − θ)g(w2 )
(32)
≤ 0; ∀θ ∈ [0, 1], w1 , w2 ∈ S
As g(θw1 + (1 − θ)w2 ) ≤ 0; ∀θ ∈ S, ∀w1 , w2 ∈ S, θw1 + (1 − θ)w2 ∈ S, which is both sufficient
and necessary for S to be a convex set. Hence, function g(w) imposes a convex constraint over the
solution space.

6.2 Duality and KKT conditions


Given convex objective and constraint functions, minima, w∗ , can occur in one of the following two
ways:

1. g(w∗ ) = 0 and Of (w∗ ) = αOg(w∗ )


2. g(w∗ ) < 0 and Of (w∗ ) = 0

This fact might be easily visualized from Figure 1. As we can see, the first condition occurs when
minima lies on the boundary of function g. In this case, gradient vectors corresponding to function
6 LECTURE 6 : REGULARIZED SOLUTION TO REGRESSION PROBLEM 25

Figure 12: Two plausible scenario for minima to occur: a) When minima is on constraint function
boundary, in which case gradients point in the same direction upto constant and b) When minima
is inside the constraint space (shown in yellow shade), in which case Of (w∗ ) = 0.

f and function g, at w∗ , point in the same direction barring multiplication by a constant α ∈ R.


Second condition depicts the case when minima lies inside the constraint space (interior of epigraph
of function g). This space is shown shaded in Figure 1. Clearly, for this case Of (w∗ ) = 0 for minima
to occur. This primal problem can be converted to dual using lagrange multiplier. According to
which, we can convert this problem to objective function augmented by weighted sum of constraint
functions in order to get corresponding lagrangian. Such lagrangian might be depicted in the
following manner:
L(w, λ) = f (w) + λg(w); λ ∈ R+ (33)
Here, we wish to penalize higher magnitude coefficients, hence, we wish g(w) to be negative while
minimizing the lagrangian. In order to maintain such direction, we must have λ ≥ 0. Also, for
solution w∗ to be feasible, Og(w∗ ) ≤ 0. Due to complementary slackness condition, we further have
λg(w∗ ) = 0, which roughly suggests that lagrange multiplier is zero unless constraint is active at
minimum point. As w∗ minimizes lagrangian L(w, λ), gradient must vanish at this point and hence
we have Of (w∗ ) + λOg(w∗ ) = 0. In general, optimization problem with inequality and equality
constraints might be depicted in the following manner:
minimizef (w)
subject to gi (w) ≤ 0; i = 1, 2, . . . , m (34)
hj (w) = 0; j = 1, 2, . . . , p
Here, w ∈ Rn and domain is intersection of all functions. Lagrangian for such case might be
depicted in the following manner:
m
X X p
L(w, λ, µ) = f (w) + λi gi (w) + µj hj (w) (35)
i=1 j=1
6 LECTURE 6 : REGULARIZED SOLUTION TO REGRESSION PROBLEM 26

Lagrange dual function is given as the minimum value of the lagrangian over λ ∈ Rm , µ ∈ Rp . Such
function might be given in the following manner:
z(λ, µ) = inf L(w, λ, µ) (36)
w
The dual function always yields lower bound for minimizer of the primal formulation. Such dual
function is used in characterization of a dual gap, which depicts suboptimality of the solution.
Duality gap may be denoted as the gap between primal and dual objectives, f (w) − z(λ, µ). When
functions f and gi , ∀i ∈ [1, m] are convex and hj , ∀j ∈ [1, p] are affine, Karush-Kuhn-Tucker (KKT)
conditions are both necessary and sufficient for points to be both primal and dual optimal with zero
duality gap. For above mentioned formulation of the problem, KKT conditions for all differentiable
functions (i.e. f, gi , hj ) with ŵ primal optimal and (λ̂, µ̂) dual optimal point may be given in the
following manner:
Pm Pp
1. Of (ŵ) + i=1 λ̂i Ogi (ŵ) + j=1 µˆj Ohj (ŵ) = 0

2. gi (ŵ) ≤ 0; i = 1, 2, . . . , m

3. λ̂i ≥ 0; i = 1, 2, . . . , m

4. λ̂i gi (ŵ) = 0; i = 1, 2, . . . , m

5. hj (ŵ) = 0; j = 1, 2, . . . , p

6.3 Bound on λ in the regularized least square solution


As discussed earlier, we need to minimize the error function subject to constraint †w†≤ ξ . Applying
KKT conditions to this problem, if w∗ is a global optimum then from the first KKT condition we
get,
∇w∗ (f (w) + λg(w)) = 0 (37)
(38)
T 2
where, f (w) = (Φw − Y ) (Φw − Y ) and g(w) = kwk − ξ
Solving we get,
2(ΦT Φ)w∗ − 2ΦT − 2λw∗ = 0
i.e.
w∗ = (ΦT Φ + λI)−1 ΦT y (39)
From the second KKT condition we get,
kw∗ k2 ≤ ξ (40)
From the third KKT condition,
λ≥0 (41)
From the fourth condition
λkw∗ k2 = λξ (42)
6 LECTURE 6 : REGULARIZED SOLUTION TO REGRESSION PROBLEM 27

Thus values of w∗ and λ which satisfy all these equations would yield an optimum solution.
Consider equation (39),
w∗ = (ΦT Φ + λI)−1 ΦT y
Premultiplying with (ΦT Φ + λI) on both sides we have,
(ΦT Φ + λI)w∗ = ΦT y
∴ (ΦT Φ)w∗ + (λI)w∗ = ΦT y
∴ k(ΦT Φ)w∗ + (λI)w∗ k = kΦT yk
By triangle inequality,
k(ΦT Φ)w∗ k + (λ)kw∗ k ≥ k(ΦT Φ)w∗ + (λI)w∗ k = kΦT yk (43)
T
Now , (Φ Φ) is a nxn matrix which can be determined as Φ is known .
k(ΦT Φ)w∗ k ≤ αkw∗ k for some α for finite |(ΦT Φ)w∗ k. Substituting in previous equation,
(α + λ)kw∗ k ≥ kΦT yk
i.e.
kΦT yk
λ≥ −α (44)
kw∗ k
Note that when kw∗ k → 0, λ → ∞. This is obvious as higher value of λ would focus more on
reducing value of kw∗ k than on minimizing the error function.
kw∗ k2 ≤ ξ

Eliminating kw k from the equation (14) we get,
kΦT yk
∴λ≥ √ −α (45)
ξ
This is not the exact solution of λ but the bound (15) proves the existance of λ for some ξ and Φ.

6.4 RMS Error variation


Recall the polynomial curve fitting problem we considered in earlier lectures. Figure 2 shows RMS
error variation as the degree of polynomial (assumed to fit the points) is increased. We observe
that as the degree of polynomial is increased till 5 both train and test errors decrease. For degree
> 7, test error shoots up. This is attributed to the overfitting problem (The datasize for train set
is 8 points.)
Now see Figure 3 where variation in RMS error and Lagrange multiplier λ has been explored
(keeping the polynomial degree constant at 6). Given this analysis, what is the optimum value of λ
that must be chosen? We have to choose that value for which the test error is minimum (Identified
as optimum in the figure.).

6.5 Alternative objective function


Consider equation (37). If we substitute g(w) = kwk2 − ξ, we get
∇w∗ (f (w) + λ · (kwk2 − ξ)) = 0 (46)
This is equivalent to finding
min(k Φw − y k2 +λ k w k2 ) (47)
6 LECTURE 6 : REGULARIZED SOLUTION TO REGRESSION PROBLEM 28

Figure 13: RMS error Vs degree of polynomial for test and train data.

For same λ these two solutions are the same.This form or regression is known as Ridge regression.
If we use L1 norm then it’s called as ’Lasso’. Note that w∗ form that we had derived is valid only
for L2 norm.

6.6 A review of probability theory


Let’s now review some basics of the probability theory. More details will be covered in the next
lecture.
Definition 1. Sample space (S) : A sample space is defined as a set of all possible outcomes of an ex-
periment. Example of an experiment would be a coin pair toss. In this case S = {HH, HT, TH, TT}.
Definition 2. Event (E) : An event is defined as any subset of the sample space. Total number of
distinct events possible is 2S , where S is the number of elements in the sample space. For a coin
pair toss experiment some examples of events could be
for at least one head, E = {HH, HT }
for all tails, E = {T T }
for either a head or a tail or both, E = {HH, HT, T H, T T }
Definition 3. Random variable (X) : A random variable is a mapping (or function) from set of
events to a set of real numbers. Continuous random variable is defined thus
X : 2S → R
On the other hand a discrete random variable maps events to a countable set (e.g. discrete real
numbers)
X : 2S → Discrete R
6 LECTURE 6 : REGULARIZED SOLUTION TO REGRESSION PROBLEM 29

Figure 14: RMS error Vs 10λ for test and train data (at Polynomial degree = 6).

6.6.1 The three axioms of probability

Probability P r is a number corresponding to events . It satisfies following three axioms,


Axiom 1. For every event E, P r(E) ∈ [0, 1]
Axiom 2. P r(S) = 1 (Equivalently, P (∅) = 0)
Axiom 3. If E1 , E2, . . . , En is a set of pairwise disjoint events, then
n
[ X n
P r( Ei ) = P r(Ei )
i=1 i=1

6.6.2 Bayes’ theorem

Let B1 , B2 , ..., Bn be a set of mutually exclusive events that together form the sample space S. Let
A be any event from the same sample space, such that P (A) > 0. Then,

P r(Bi ∩ A)
P r(Bi /A) = (48)
P r(B1 ∩ A) + P r(B2 ∩ A) + · · · + P r(Bn ∩ A)
Using the relation P (Bi ∩ A) = P (Bi ) · P (A/Bi )
P r(Bi ) · P r(A/Bi )
P r(Bi /A) = Pn (49)
j=1 P r(Bj ) · P r(A/Bj )

Example 1. A lab test is 99% effective in detecting a disease when in fact it is present. However,
the test also yields a false positive for 0.5% of the healthy patients tested. If 1% of the population
has that disease, then what is the probability that a person has the disease given that his/her test is
positive?
6 LECTURE 6 : REGULARIZED SOLUTION TO REGRESSION PROBLEM 30

Solution 1. Let, H be the event that a tested person is actually healthy.


D be the event that a tested person does have the disease.
T be the event that the test comes out positive for a person.
We want to find out P r(D/T )
H and D are disjoint events. Together they form the sample space.
Using Bayes’ theorem,

P r(D) · P r(T /D)


P (D/T ) = (50)
P r(D) · P r(T /D) + P r(H) · P r(T /H)
Now, Pr(D) = 0.01 (Given)
Since Pr(D)+Pr(H)=1, Pr(H)=0.99
The lab test is 99% effective when the disease is present. Hence, Pr(T/D)=0.99
There is 0.5% chance that the test will give false positive for a healthy person. Hence, Pr(T/H)=0.005
Plugging these values in equation (50) we get,

0.01 ∗ 0.99
P r(D/T ) =
0.01 ∗ 0.99 + 0.99 ∗ 0.005
2
=
3
What does this mean? It means that there is 66.66% chance that a person with positive test
results is actually having the disease. For a test to be good we would have expected higher certainty.
So, despite the fact that the test is 99% effective for a person actually having the disease, the false
positives reduce the overall usefulness of the test.

6.6.3 Independent events

Two events E1 and E2 are called independent iff their probabilities satisfy
P (E1 E2 ) = P (E1 ) · P (E2 ) (51)
where P (E1 E2 )meansP (E1 ∩ E2 )
In general, events belonging to a set are called as mutually independent iff, for every finite
subset, E1 , · · · , En , of this set
\n n
Y
P r( Ei ) = P r(Ei ) (52)
i=1 i=1
7 LECTURE 7 : PROBABILITY 31

7 Lecture 7 : Probability
This lecture gives an overview of the probability theory. It discusses distribution functions; notion
of expectation, variance; bernoulli and binomial random variables; and central limit theorem

7.1 Note
ˆ Pr - probability in general of an event

ˆ F - cumulative distribution function

ˆ p - probability distribution function(pdf) or probability mass function(pmf)

ˆ pdf – continuous random variable case

ˆ pmf – discrete random variable case

7.2 Part of speech(pos) example


Problem Statement:-
A set of ’n’ words, each of a particular part of speech(noun/verb/etc) is picked. Probability that a
word is of part of speech type ’k’ is pk . Assuming the picking of words is done independently, find
probability that the set contains a ’noun’ given that it contains a ’verb’.

Solution
Let Ak be the probability that the set contains pos type ’k’.
P r(Ak ) = 1 − (1 − pk )n
where (1 − pk )n is that all ’n’ words are not of pos of type ’k’.
T
P r(Anoun Averb )
P r(Anoun /Averb ) = P r(Averb )

Ak2 ) = 1 − (1 − pk1 )n − (1 − pk2 )n + (1 − pk1 − pk2 )n


T
P r(Ak1
1−(1−pnoun )n −(1−pverb )n +(1−pnoun −pverb )n
P r(Anoun /Averb ) = 1−(1−pverb )n

7.3 Probability mass function(pmf ) and probability density function(pdf )


pmf :- It is a function that gives the probability that a discrete random variable is exactly equal to
some value(Src: wiki).
pX (a) = P r(X = a)

Cumulative distribution function(Discrete case)


F (a) = P r(X <= a)

pdf :- A probability density function of a continuous random variable is a function that describes the
relative likelihood for this random variable to occur at a given point in the observation space(Src:
7 LECTURE 7 : PROBABILITY 32

wiki).
R
P r(X ∈ D) = D p(x)dx
where D is set of reals and p(x) is density function.

Cumulative distributionRfunction(Continuous case)


a
F (a) = P r(X <= a) = −∞ p(x)dx
dF (x)
f (a) = dx |x=a

7.3.1 Joint distribution function

If p(x,y) is a joint pdf i.e. for continuous case:


Rb Ra
F (a, b) = P r(X <= a, Y <= b) = −∞ −∞ p(x, y)dxdy
∂ 2 F (x,y)
p(a, b) = ∂x∂y |a,b

For discrete
P case i.e.
P p(x,y) is a joint pmf:
F (a, b) = x<=a y<=b p(x, y)

7.3.2 Marginalization

Marginal probability is then the unconditional probability P(A) of the event A; that is, the proba-
bility of A, regardless of whether event B did or did not occur. If B can be thought of as the event
of a random variable X having a given outcome, the marginal probability of A can be obtained
by summing (or integrating, more generally) the joint probabilities over all outcomes for X. For
example, if there are
T two possible
T outcomes for X with corresponding events B and B’, this means
that P (A) = P (A B) + P (A B 0 ). This is called marginalization.

Discrete case:P
P (X = a) = y p(a, y)

Continuous
R ∞ case:
Px (a) = −∞ p(a, y)dy

7.4 Example
Statement :- X and Y are independent continuous random variables with same density functions.
(
e−x if x > 0;
p(x) =
0 otherwise.
7 LECTURE 7 : PROBABILITY 33

Find density X
Y .
Note:- They are indepedent.

Solution

F X (a) = P r( X
Y <= a)
YR
∞ R ya
= 0 0 p(x, y)dxdy
R ∞ R ya
= 0 0 e−x e−y dxdy
1
= 1 − a+1
a
= a+1

f X (a) = derivative of F X (a) w.r.t a


Y Y
1
= (a+1) 2 > 0

7.5 Conditional Density


Discrete case:
P (X=x,Y =y)
pX ( Y x=y ) = P ( X=x
Y =y ) = P (Y =y)

Continuous case: X
p (Y ) p (X )
pX ( Y x=y ) = X,Y
pY (y) =
R ∞X,Y Y
p(x,y)dx
−∞

7.6 Expectation
Discrete case: Expectation is equivalent to probability weighted sums of possible values.
E(X) = Σi xi P r(xi ) where X is a random variable

Continuous case: Expectation is equivalent to probability density weighted integral of possible


values. R

E(X) = −∞ xp(x)dx

If the random variable is a function of x, then Discrete case:


E(X) = Σi f (xi )P r(xi ) where X is a random variable

Continuous
R ∞case:
E(X) = −∞ f (x)p(x)dx
7 LECTURE 7 : PROBABILITY 34

7.6.1 Properties of E(x)

E[X + Y ] = E[X] + E[Y ]

For any constant c and any random variable X


E[(X − c)2 ] ≥ E[(X − µ)2 ]
where µ = E[X]

E[cX] = cE[X]

7.7 Variance
For any random variable X, variance is defined as follows:
V ar[X] = E[(X − µ)2 ]
⇒ V ar[X] = E[X 2 ] − 2µE[X] + µ2
⇒ V ar[X] = E[X 2 ] − (E[X])2

V ar[αX + β] = α2 V ar[X]

7.8 Covariance
For random variables X and Y, covariance is defined as:
Cov[X, Y ] = E[(X − E(X))(Y − E(Y ))] = E[XY ] − E[X]E[Y ]
If X and Y are independent then their covariance is 0, since in that case
E[XY ] = E[X]E[Y ]
However, covariance being 0 does not necessarily imply that the variables are independent.

7.8.1 Properties of Covariance

Cov[X + Z, Y ] = Cov[X, Y ] + Cov[Z, Y ]

Cov[Σi Xi , Y ] = Σi Cov[Xi , Y ]

Cov[X, X] = V ar[X]

7.9 Chebyshev’s Inequality


Chebyshev’s inequality states that
if X is any random variable with mean µ and variance σ then ∀k > 0
2
P r[|X − µ| ≥ k] ≤ σk2
7 LECTURE 7 : PROBABILITY 35

If n tends to infinity, then the data mean tends to converge to µ, giving rise to the weak law
of large numbers.

P r[| X1 +X2n+..+Xn − µ| ≥ k] tends to 0 as n tends to ∞

7.10 Bernoulli Random Variable


Bernoulli random variable is a discrete random variable taking values 0,1
Say, P r[Xi = 0] = 1 − q where q[0, 1]
Then P r[Xi = 1] = q
E[X] = (1 − q) ∗ 0 + q ∗ 1 = q
V ar[X] = q − q 2 = q(1 − q)

7.11 Binomial Random Variable


Binomial random variable is discrete variable where the distribution is a series of n experiments
with 0,1 value. The probability that the outcome of a particular experiment is 1 being q.

P r[X = k] = nk qk (1 − q)n−k


E[X] = Σi E[Yi ] where Yi is a bernoulli random variable E[X] = nq

V ar[X] = Σi V ar[Yi ] (since Yi ’s are independent)


V ar[X] = nq(1 − q)

An example of Binomial distribution can be a coin tossed n times and counting the number of
times heads shows up.

7.12 Central Limit Theorem


If X1 , X2 , .., Xm is a sequence of i.i.d. random variables each having mean µ and variance σ 2
Then for large m, X1 + X2 + .. + Xm is approximately normally distributed with mean mµ and
variance mσ 2
If X Ñ (µ, σ 2 )
−(x−µ)2
1
Then P [x] = √
σ 2 2π
e 2σ 2

It can be shown by CLT

n −nµ
ˆ X1 +X2 +..+X

σ2n
Ñ (0, 1)
2
ˆ Sample Mean: µ̂Ñ (µ, σm )
7 LECTURE 7 : PROBABILITY 36

7.13 Maximum Likelihood and Estimator


Estimator is a function of random space which is meant to approximate a parameter.

ˆ µ̂ is an estimator for µ

ˆ Maximum Likelihood estimator q̂M LE for q


q̂M LE for q (Parameter of Bern(q) from which we get sample data)

ˆ ˆ
L(X̂1 , X̂2 , .., Xˆn |q) = q X̂1 (1 − q)1−X̂1 ..q Xn (1 − q)1−Xn

q̂M LE = argmaxq L(X̂1 , X̂2 , .., Xˆn |q)

E.g.: Bernoulli Random Variable

(1−x)
p(x) = µx (1 − µ)
D = X1 , X2 , .., Xm is a random sample
(1−xi )
L(Dkµ) = Πi µxi (1 − µ)
GOAL :
µ̂M LE = argmaxµ L(D|µ)
Equivalently :

µ̂M LE = argmaxµ logL(D|µ)


dlogL(D|µ) ΣXi
= 0 gives µ̂M LE =
dµ m
Summary :

1. µ̂M LE is a function of the random sample


2. It is called an estimator in terms of Xi ’s
3. It is called an estimate in terms of xi

4. It coincides with sample mean

Recall from CLT

for large m Σi Xi is similar to N (mµ, mσ 2 ) if each Xi has


E(Xi ) = µ
V (Xi ) = σ 2
Thus :
7 LECTURE 7 : PROBABILITY 37

2
i −mµ
Σi X√
σ2m
Ñ (0, 1) and µ̂M LE Ñ (µ, σm )

Question : Given an instantiation of X1 , X2 , .., Xm called Data D x1 , x2 , .., xm


You have MLE estimate
ΣXi
m , which is a point estimate
How confident can you be that the actual µ is ΣX m ± z for some z, this is called the interval estimate
i

7.14 Bayesian estimator


L(X1 , X2 , .., Xn |µ) = Πi µXi (1 − µ)1−Xi

p(µ) = θ1 ∀µ(0, 1)θ ≥ 0


R1
0
p(µ) dµ = 1 , implies that θ = 1

Posterior = P (µ|x1 , x2 , .., xn ) (Bayesian Posterior)

L(X1 ,X2 ,..,Xn kµ)p(µ)


= R1
L(X1 ,X2 ,..,Xn |µ)p(µ)|,dµ
0
µΣi xi 1−µΣi 1−xi
= R1
µΣi xi 1−µΣi 1−xi dµ
0
R
Bayes Estimate E(µ|x1 , x2 , .., xn ) = µP (µ|x1 , x2 , .., xn ) dµ
Σm xi +1
Expected µ under posterior = im+2

Expected value under posterior of parameter µ is called bayes estimate

Beta Distribution is the conjugate prior for Bernoulli distribution


(
a−1)(1−µ)( b−1)
β(µ|a, b) = Γ(a+b)µ Γ(a)Γ(b)
Note : Prior and Likelihood should have same form for posterior to have same form as prior. If so
,the chosen prior is called a conjugate prior.
8 LECTURE 8 38

8 Lecture 8
8.1 Bernoulli Distribution
The general formula for probability of a random variable ’x’ is

p(x) = µx (1 − µ)1−x
The likelihood of the data given µ is
m
Y
L(D|µ) = µxi (1 − µ)1−xi
i=1
Our goal is to find the maximum likelihood estimate µ̂M LE = argmaxµ L(D|µ)

Since log is a monotonically increasing function, we can write µ̂M LE equivalently as :

µ̂M LE = argmaxµ LL(D|µ)


where, LL represents the log of the likelihood of the data given µ.

This can also be represented as

µ̂M LE = argmaxµ log(L(D|µ))


m
X
⇒ µ̂M LE = argmaxµ [Xi ln µ + (1 − Xi ) ln(1 − µ)]
i=1
Xm m
X
⇒ µ̂M LE = argmaxµ ln µ ( Xi ) + ln(1 − µ) (1 − Xi )
i=1 i=1
such that, 0 ≤ µ ≤ 1
d LL(D|µ)
To find the maxima for LL(D|µ), we put dµ = 0
Pm Pm
X i=1 Xi
⇒ µ̂M LE = Pm Pmi
i=1
=
i=1 Xi + i=1 (1−Xi ) m

Thus, we know that


(1) µ̂M LE is a function of the random sample.
(2) It is called an estimator in terms of Xi s.
(3) It is called an estimate in terms of xi s.
(4) It coincides with the sample mean.

From central-limit theorem, we know that for large m


m
X
Xi ∼ N (mµ, mσ 2 )
i=1

If each Xi has E[Xi ] = µ and V [Xi ] = σ 2

i.e Xi s are normally distributed over m with mean µ and variance σ 2


8 LECTURE 8 39

Thus,
Pm
i=1 Xi − mµ
√ ∼ N (0, 1)
σ m
and,
σ2
µ̂M LE ∼ N (µ, )
m
Question. Given an instantiation of ( X1 , X2 , ...., Xm ) called training data D (x1 , x2 , ...., xm ),
you have Maximum Likelihood Estimation
X
xi /m
i
P
(point estimate). How confident can you be that actual µ is within ( xi )/m ± Z for some Z.

Answer: Here, we are looking for an interval estimate.


r
µ̂ ∗ (1 − µ̂)
µ̂M LE ± Z
m
where we lookup Z from a table for standard normal distribution value of α for given Z such that
Pr (x≥Z) = α

q
µ̂∗(1−µ̂)
µ ∈ (µ̂M LE ± Z m )is(1 − 2α)

8.2 Bayesian Estimation


Qm
Likelihood L( X1 , X2 , ...., Xm |µ) = i=1 µxi ∗ (1 − µ)1−xi
p(µ) = θ1 θ ≥ 0 f or all µ ∈ [0, 1]
8 LECTURE 8 40

1
 p(µ) dµ = 1 ⇒θ=1


0
Posterior (Bayesian Posterior) is:
L(x1 ,x2 ,....,xm |µ)∗p(µ)
pr (µ|X1 , X2 , ...., Xm ) = R1
L(x1 ,x2 ,....,xm |µ)p(µ)dµ
0

µΣxi ∗ (1 − µ)Σ(1−xi ) ∗ 1
= R1
0
µΣxi ∗ (1 − µ)Σ(1−xi ) dµ

Σ
(m + 1)!µΣxi ∗(1−µ) (1−xi )
= P P
xi !(m − xi )!


µ ∗ p(µ|x1 , x2 , ...., xm )d(µ)
Expectation E(µ|x1 , x2 , ...., xm ) = 

Pm
xi +1
Expected is under posterior = i=1
m+2

Thus if we tossed a coin 2 times and Xi = 1, X2 = 1 then

2+1 3
E(µ|1, 1) = 2+2 = 4
µ̂B = E[µ|D] = p(µ|D)
Γ(a + b) a−1 b−1
Beta(µ|a, b) = µ µ
Γ(a)Γ(b)
Γ(a+b)
Beta is conjugate prior to bernoulli distribution and Γ(a)Γ(b) is a normalization constant
Σn
i=1 xi Σni=1 (1−xi )
L(x1 ...xn ) = µ (1 − µ)
Prior should have the same form as the likelihood with some normalization const
p(µ|D) ∝ L(D|µ)p(µ)
If µprior ≈ Beta(µ|a, b)
Z 1
p(µ|x1 ...xn ) = L(x1 ...xn |µ)P (µ) = L(x1 ...xn |µ)Beta(a, b) = L(x1 ...xn |µ)Beta(µ|a, b)du
o
n
Σn
Γ(m + a + b)µΣi=1 xi +a−1 (1 − µ i=1 (1−xi )+b−1
)
=
Γ(Σni=1 xi + a)Γ(Σni=1 (1 − xi ) + b)
≈ Beta(Σni=1 xi + a, Σni=1 (1 − xi ) + b)

a
EBeta(a,b) (µ|x1 ...xn ) =
a+b
Σni=1 xi + a
EBeta(a+Σni=1 x,b+Σni=1 (1−xi )) (µ|x1 ...xn ) =
Σni=1 (1 − xi ) + Σni=1 xi + a + b
for large m, a and b  m
8 LECTURE 8 41

µ̂Bayes → µ̂M LE

Let us say we make k more obsrvations: y1 ...yk


≈ Beta(Σni=1 xi + Σni=1 yi + a, Σni=1 (1 − xi ) + Σnj=1 (1 − yj ) + b)
Multinomial
Say : observation are dice tossing, there are n possible outcomes and each modeled by a vector Xk
= (01 02 03 .....1k .....)

xki = 1 (Note: Xkk = 1 and Xik = 0 for all i 6= k )


P
i
Pn
Also, p(X = xk ) = yk ≈ such that k=1 µk = 1
Qn
Then, p(X = x) = i=1 µxkk
 
µ1
µ2 
 

 
n
X

 µ3 

E(x) = X k ∗ p(X k ) = . 
 

 
k=1  . 
 
. 
 

µn
9 LECTURE 9 : MULTINOMIAL DISTRIBUTION 42

9 Lecture 9 : Multinomial Distribution


This lecture was a continuation of the previous discussion, where we were discussing Multinomial
Distribution.Here we discuss about conjugate prior and posterior of Multinomial Distribution. Then
we extend the discussion to Gaussian Distribution.

Question: What will be conjugate prior αi ’s, which are params of Multinomial?
Answer: Joint Distribution.
n
P (µ1 , . . . µn |α1 , . . . αn ) ∝ πi=1 µiαi −1 (53)

Note: For normalising constant, if P(x) ∝ f(x)

1
P (x) = R
f (x)dx
f (x)

Since
Rn
i=1
P (µ1 , . . . µn |α1 , . . . αn ) = 1
By Integrating, we get normalisation constant.
Pn
Γ( i=1 αi ) n αi −1
P (µ1 , . . . µn |α1 , . . . αn ) = n π µ (54)
πi=1 Γ(αi ) i=1 i
which follows Dir(α1 . . . αn )

Dirichlet distribution is a generalisation of Beta distribution, just as multinomial is generalisation


of Bernouli distribution.

9.0.1 Posterior probability


P (X1 ,...Xm |µ1 ,...µn )P (µ1 ,...µn )
P (µ1 , . . . µn |X1 , . . . Xm ) = P (X1 ,...Xm )
Pn
Γ( i=1 αi + m) n (α −1+ m
P
k=1 Xk,i )
⇒ P (µ1 , . . . µn |X1 , . . . Xm ) = n Pm πi=1 µi i (55)
πi=1 Γ(αi + k=1 Xk,i )

9.0.2 Summary

ˆ For multinomial, the mean at maximum likelihood is given by:


Pm
Xk,i
µ̂iM LE = k=1 (56)
m
ˆ Conjugate prior follows Dir(α1 . . . αn )

Pm
ˆ Posterior is Dir(. . . αi + k=1 Xk,i . . .)
ˆ The expectation of µ for Dir(α1 . . . αn ) is given by:
 
α1 α1
E [µ]Dir(α1 ...αn ) = P ... P (57)
α1 αi
9 LECTURE 9 : MULTINOMIAL DISTRIBUTION 43

Pm
ˆ The expectation of µ for Dir(. . . αi + Xk,i . . .) is given by:
k=1
 P P 
α1 + k Xk,1 αj + k Xk,j
E [µ]Dir(...αi +Pm Xk,i ...) = P ... P ... (58)
k=1 α1 + m αi + m

Observations:

ˆ α1 . . . αn = 1 ⇒ Uniform Distribution

ˆ As m→ ∞ ⇒ µ̂Bayes → µ̂M LE

ˆ If m=0, µ̂Bayes = µprior

9.1 Gaussian Distribution


9.1.1 Information Theory

Let us denote I(X=x) as the measure of information conveyed in knowing value of X=x.

Figure 15: Figure showing curve where Information is not distributed all along.

Figure 16: Figure showing curve where Information is distributed.

Question:Consider the following two graphs, say you know probability function p(x), then when is
knowing value of X more useful(carries more information).
Ans: It is more useful in the case(2), because more information is conveyed in Figure 15 than in
Figure 16.
9 LECTURE 9 : MULTINOMIAL DISTRIBUTION 44

9.1.2 Expectation for I(X=x):

ˆ If X and Y are independant random variables from same distribution.

I(X = x, Y = x) = I(X = x) + I(Y = y) (59)

The above equation can be equivalently stated as follows:


I(P (x)P (y)) = I(P (x)) + I(P (y)) (60)
where P(x),P(y) are the probability functions respectively.

ˆ If p(x)>P (y) , then

I(p(x))<I(p(y))

There is only one function which satisfies the above two properties.

I(p(x)) = −c log(p(x)) (61)


ˆ The Entropy in the case of discrete random variable can be defined as:
X
EP [I(p(x))] = −c log[p(x)] (62)
x

ˆ In the case of continuous random variable it is,Z


EP [I(p(x))] = −c log[p(x)] (63)
x
The constant ’C’ in the above two equations is traditionally 1.

9.1.3 Observations:

ˆ For Discrete random variable(∼ countable domain), the information is maximum for Uniform
distribution.

ˆ For Continuous random variable ( ∼ Finite mean and finite variance), the information for
Gaussian Distribution.

Finding argmaxp Ep ∼ Infinite domain, subject to

xp(x)dx = µ, (x − µ)2 p(x)dx = σ 2


R R

The solution would be

−(x−µ)2
e √ 2σ2
p(x) = 2πσ2
9 LECTURE 9 : MULTINOMIAL DISTRIBUTION 45

9.1.4 Properties of gaussian univariate distribution

ˆ If X ∼ N (µ, σ 2 )
−(x−µ)2
p(x) = √1 e 2σ 2 where − ∞ < x < ∞
σ 2Π

then w1 X + w0 ∼ N (w1 µ + w0 , w12 σ 2 )


(can prove this using moment generating function)
(σt)2
Φ(N (µ, σ 2 )) = EN (µ,σ2 ) [etx ] = eµt+ 2

Recall
dφ(p)
E(X) = dt
2
var(x) = d dt
φ(p)
2

(σt)2
EN (µ,σ2 ) [et(w1 x+w0 ) ] = (w1 µt + w0 t + 2 × w12 ) ∼ N (w1 µ + w0 , w12 σ 2 )
ˆ Sum of i.i.d X1 , X2 , ......, Xn ∼ N (µ, σ 2 ) is also normal(gaussian)

X1 + X2 + ...... + Xn ∼ N (nµ, nσ 2 )
Pn
In genaral if Xi ∼ N (µi , σi2 ) =⇒ σi2 )
P P
i=1 Xi ∼ N ( µi ,

ˆ Corollary from (1) If X ∼ N (µ, σ 2 )


X−µ
z= σ ∼ N (0, 1) (Useful in setting interval estimate)

1 µ
(take w1 = σ and w0 = σ)

ˆ Maximum Likelihood estimate for µ and σ 2

Given X1 , X2 , ....Xm ..... Random Sample.

Qm −(Xi −µ)2
µ̂M LE = argmaxµ √1
i=1 [ σ 2π e ]
2σ 2

(Xi −µ)2
P

= argmaxµ σ√12π e 2σ 2

Pm
Xi
µ̂M LE = i=1
m = sample mean

ˆ With out relaying on central limit theorem Properties (2) and (1)
9 LECTURE 9 : MULTINOMIAL DISTRIBUTION 46

i.e. Sum of i.i.d’s X1 , X2 , ......, Xn ∼ N (µ, σ 2 )


2
µ̂M LE = N (µ, σm )

Similarly
Pm 2
2 i=1 (Xi −µ̂M LE )
σ̂M LE = m is χ2 distrbution

∼ χ2m

Note:- If X1 , X2 , ....Xm ∼ N (0, 1)

Xi2 ∼ χ2m
P
i m-degree of freedom

ˆ Coming up with conjugate prior of N (µ, σ 2 )

Case (1) σ 2 is fixed and prior on µ

⇒ µ ∼ N (µ0 , σ02 )

Case (2) µ is fixed and σ 2 has prior

⇒ σ2 ∼ Γ

case (3) if µ and σ 2 both having the prior

⇒ (µ, σ 2 ) ∼ Normal gamma distribution ∼ Students-t distribution


10 LECTURE 10 : MULTIVARIATE GAUSSIAN DISTRIBUTION 47

10 Lecture 10 : Multivariate Gaussian Distribution


We start the lecture by discussing the question given in the previous lecture and then move over to
Multivariate Gaussian Distribution.

The question was : If X1 ...Xm ∼ N (µ, σ 2 ) . Then assuming σ 2 is known

m
X
Xi
i=1
µ̂M L =
m
m
X
(Xi − µ)2
σˆ2 M L = i=1
m
ˆ2
Here σ M L follows the chi − squared distribution

Figure 10 : Figure showing the nature of the (chi − square) distribution of σˆ2 M L
m
X
(Xi − µ)2
i=1

LL(D|µ, σ) = − ln(σ 2π)
2σ 2
m
1 Y −(Xi − µ)2
ie, L(D|µ, σ) = √ exp( )
σ 2π i=1 2σ 2

Question : What is the conjugate prior p(µ) if σ 2 is known?

Answer : Gaussian distribution.


Question : What is the conjugate prior p(σ 2 ) if µ is known?
Answer : Gamma distribution.

Question : What is the conjugate prior p(µ, σ 2 ) if both are unknown?


10 LECTURE 10 : MULTIVARIATE GAUSSIAN DISTRIBUTION 48

Answer : If Xi ∼ N (0, 1) then


m
X
2
Xi ∼ Xm and
i=1
y = Xz 2 ∼ tn . (where z ∼ N (0, 1))
Xi
Here y follows students-t distribution

Figure 10: Figure showing students-t distribution of y

10.1 Multivariate Gaussian Variable


Definition : If X ∼ N µΣ) (where x ∈ Rn )) then
1 −(x − µ)T Σ−1 (x − µ)
p(x) = n exp
(2π) 2 |Σ| 2

(Note : In pattern recognition, (x − µ)T Σ−1 (x − µ) is called the Mahalanobis distance between x
and µ. If Σ = I, it is called Eucledian distance. )

i) Σ can be assumed to be symmetric.


A = Asym + Aantisym

ii) If Σ is symmetric
Xn
Σ= λi (qi qiT ) where qi ’s are orthogonal.
i=1
n
X 1
Here Σ−1 = qi qiT
i=1
λi

x0 = Q(x − µ) (Q is a matrix of qi0 s as column vectors)


n
P (x0i )2
n − λi
Y 1 i=1
p(x0 ) = (p )exp
j=1
2πλj 2πλj

Here the joint distribution has been decomposed as a product of marginals in a shifted and ro-
tated co-ordinate system.
10 LECTURE 10 : MULTIVARIATE GAUSSIAN DISTRIBUTION 49

ie, x0i s are independent.


m
−n n −1 X
LL(x1 ...xm |µ, Σ) = ln(2π) − ln[|Σ|] − ((xi − µ)T Σ−1 (xi − µ))
2 2 2 i=1

Set ∇µ LL = 0, ∇Σ LL = 0
−1 X
∇µ LL = [ 2(xi − µ)]Σ−1 = 0
2
Since Σ is invertible,
X
(xi − µ) = 0
X
xi
ie, µ = m
m
1 X
Σ̂M L = (xi − µ̂M L )(xi − µ̂M L )T
M i=1
Here Σ̂M L is called emperical co-variance matrix in statistics.

µ̂M L ∼ N (µ, Σ)

E[µ̂M L ] = µ

Here µ̂M L is an unbiased estimator.

10.1.1 Unbiased Estimator

An estimator e(θ) is called unbiased estimator of θ if E[e(θ)] = θ


k
X k
X
If ei (θ), e2 (θ), ..., ek (θ) are unbiased estimators and λi = 1 then λi ei (θ) is also unbiased
i=1 i=1
estimator.
m−1
Since E(Σ̂M L ) = Σ, Σ̂M L is a biased estimator.
m
m
1 X
An unbiased estimator for Σ is therefore (xi − µ̂M L )(xi − µ̂M L )T
m − 1 i=1

Question : If  ∼ N (0, σ 2 )

y = wT φ(x) +  where w, φ(x) ∈ Rn


10 LECTURE 10 : MULTIVARIATE GAUSSIAN DISTRIBUTION 50

then y ∼ N (wT φ(x), σ 2 )

1 (y − φT (x)w)2
p(y|x, w) = √ exp( )
2πσ 2 2σ 2
E[Y (w, x)] = wT φ(x) = w0T + w1T φ1 (x) + ... + wnT φn (x)

φ(x) = [1 φ1 (x) ... φn (x)]

Given random sample D


 
y1 φ1 (x)
 y2 φ2 (x) 
 
 
D=  : : 

 : :
 

yn φn (x)
n
−m 1 X T
LL(y1 ...ym |x1 ...xm , w, σ 2 ) = ln(2πσ 2 ) − 2 (w φ(x)i − yi )2
2 2σ i=1
Given σ 2

ŵM L = argmax LL(y1 ...ym |x1 ...xm , w, σ 2 )


m
X
= argmax (wT φ(x)i − yi )2
i=1

10.2 Dealing with Conjugate Priors for Multivariate Gaussian

The congjugate prior for multivariate gaussian distibution if µ, σ 2 are known is given as

P (µ) = N (µ0 , σ02 )

P (x) ∼ N (µ, σ 2 )

2
P (µ|x1 ...xm ) = N (µm , σm )

σ2 mσ02
µm = ( µ0 ) + ( µ̂M L )
mσ02
+ σ2 mσ02 + σ 2
1 1 m
2
= 2+ 2
σm σ0 σ
10 LECTURE 10 : MULTIVARIATE GAUSSIAN DISTRIBUTION 51

If for y(x, w) ∼ N (φT (x)w, σ 2 )

P (w) ∼ N (µ0 , Σ0 )

2
P (w|x1 ...xm ) = N (µm , σm )

µm = Σm (Σ−1 2 T
0 µ0 + σ φ y)

−1
Σ−1 2 T
m = Σ0 + σ φ φ
11 LECTURE 11 52

11 Lecture 11
11.1 Recall
For bayesian estimation in the univariate case with fixed σ where µ ∼ N (µ0 , σ 2 0 ) and x ∼ N (µ, σ 2 )

1 m 1
= 2+ 2
σ2 σ σ0
µm m
= 2 µ̂mle + µ0
σ2 m σ
such that p(x|D) ∼ N (µm , σm 2 ). m/σ 2 is due to noise in observation while 1/σ02 is due to
uncertainity in µ. For the Bayesian setting for the multivariate case with fixed Σ

x ∼ N (µ, Σ), µ ∼ N (µ0 , Σ0 ) & p(x|D) ∼ N (µm , Σm )


Σ−1
m = mΣ
−1
+ Σ−1
0
Σ−1
m µm = mΣ
−1
µ̂mle + Σ−1
0 µ

11.2 Bayes Linear Regression


The Bayesian interpretation of probability can be seen as an extension of logic that enables reasoning
with uncertain statements. Bayesian linear regression is a Bayesian alternative to ordinary least
squares regression.

y = wT φ(x) + ε
ε ∼ N (0, σ 2 )
w ∼ N (0, Σ0 )
ŵM LE = (φT φ)−1 y

Finding µm and Σm :

−1
Σ−1 T
m µm = Σ0 µ0 + φ y/σ
2

1 T
Σ−1
m = Σ0 + 2 φ φ
σ
Setting Σ0 = αI and µ0 = 0

Σ−1 T
m µm = φ y/σ
2

1
Σ−1
m µm = I + φT φ/σ 2
α
(I/α + φT φ/σ 2 )−1 φT y
µm =
σ2
But since σ 2 /α is nothing but the λ in ridge regression, this can be written as
µm = (λI + φT φ/σ 2 )−1 φT y
−1 I φT φ
Σm = + 2
α σ
11 LECTURE 11 53

which are similar to the results obtained in ridge regression.

What is the Bayes Estimator here?

ŵBayes = Ep(w|X1 ...Xm ) [w]


σ
= (φT φ + )−1 φT y
α
which is the least square solution for ridge regression.
ŵM AP = argmaxw p(w|X1 ...Xn )
which is the point w at which posterior distribution peak (mode).
For Gaussian distribution, mode is the same as the mean.
mean = ŵBayes . (64)

Find P (y|X1 ...Xm ) ∼ for linear regression


In the context of multivariate gaussian.

p(x|D) = p(X|X1 ...Xm )


Z
= p(x|µ)p(µ|D)dµ
µ
∼ N (µn , Σ + Σn )

Point? p(x|D)
MLE θ̂M LE = argmaxθ LL(D|θ) p(x|θM LE )
Bayes Estimator θ̂B = Ep(θ|D) E[θ] p(x|θM LE )
MAP θ̂M AP = argmaxθ p(θ|D) p(x|θM AP )
p(D|θ)p(θ)
Pure Bayesian p(θ|D) = R
p(D|θ)p(θ)dθ
Ym
p(D|θ) = p(xi |θ)
i=1
Z
p(x|D) = p(x|θ)p(θ|D)dθ

where θ is the prior vector.


11 LECTURE 11 54

11.3 Pure Bayesian - Regression


Z
p(y|X1 ...Xm ) = p(y|w)p(w|D)dw
w
where
y = wT φ(X) + ε
ε ∼ N (0, σ 2 )
w ∼ N (0, αI)
∼ N (µTm φ(x), σm
2
)
Also we have
2
σm = σ 2 + φ T Σm φ
I φT φ
Σ−1
m = + 2
α σ
Σm φ T y
µm =
σT
σ 2 −1 T
µm = (φT φ + ) φ y
α
Since x ∼ N (µ, σ 2 ), we have αx + β = N (αµ + β, α2 σ 2 ). And since we know that ε ∼ N (0, σ 2 ).
Using y = wT x + ε, we get
y ∼ N (wT x, σ 2 )

11.4 Sufficient Statistic


A Statistic is called sufficient for θ if p(D|s, θ) is independent of θ. It can be proved that a statistic
s is sufficient for θ iff p(D|θ) can be written as p(D|θ) = g(s, θ)h(D). For case of gaussian we have
Xn
g(s, µ) = exp(−m/2µT Σ−1 µ + µT Σ−1 xi )
i=1
m
X
h(x1 , x2 ...xm ) = 1/2π nm/2 |Σ|m/2 exp(−1/2 xTi Σ−1 xi )
i=1

Thus, we see that for the normal distribution, p(D|µ) = g(s, µ)h(D).

11.5 Lasso
We have Y = wT φ(x) + ε where ε ∼ N (0, σ 2 ). Here w ∼ Laplace distribution. Then it turns out
m
X
that ŵM AP with laplace prior is ŵM AP = argmaxw ||wT φ(xi ) − yi ||2 + λ||w||l1
i=1
Here λ||w||l1 is basically the penalty function, also called lasso. Recall ŵM AP with guassian prior
Xm
is ŵM AP = argmaxw ||wT φ(xi ) − yi ||2 + λ||w||2 l1
i=1
11 LECTURE 11 55

Here λ||w||2 l2 is basically penalty function also called ridge regression.


Refer Section 7 from Sir’s notes for more details on this topic.
Lasso generally yields sparser solutions. What this means is that if you have φ1 (x), φ2 (x)....φk (x)
each of them with weights w1 , w2 ...wk We may want to reduce the number of non-zero k. In general,
if we use ridge regression those parameter which are irrelevant may have some small weights but
lasso tends to set have has zero weights
12 LECTURE 12 : BIAS-VARIANCE TRADEOFF 56

12 Lecture 12 : Bias-Variance tradeoff


Key terms : Bias, Variance, Expected loss, Noise
When we design a particular machine learning model (function), we wish that it should be able
to predict the output accurately and it should be able to do it independent of the sample training
data it has been trained with. In the following section we are going to show that there is a trade
off between the two main objectives. Let us understand the terms bias and variance. Variance of
a machine learning model is the variance in the prediction of a value when trained over different
training data. Bias is related to how much the prediction varies from the actual observed values
in the training data.

12.1 Expected Loss


Suppose we are interested in finding the expected loss value of the function. We are given a training
data TD , a distribution over x and target variable y say P (x, y) and desired fuction f (x) (here by
f (x) , we actually mean f (x, TD ) , i.e. a function trained with respect to the training data TD )to
approximate y. We want to find out the expected loss of a model with respect to all the data we
have in our distribution. Usually squared error is chosen as a measure of loss the model. First we
will derive the loss for a training data and then for the whole distribution.
L(y, x, f (x)) = (f (x) − y)2
We want to find a function f (.) whose expected value of the loss over the instances in the training
data, is minimum. Find
argminf Ef,P (x,y) [L(y, x, f (x))]
where Ef,P (x,y) [L(y, x, f (x))] denotes expectation of the loss with a fixed f () and P (, ).
Z Z
Ef,P (x,y) [L] = L(y, x, f )P (x, y)dxdy
y x
Z Z
= (f (x) − y)2 P (x, y)dxdy
y x
Z Z
= (f (x) − E[y/x] + E[y/x] − y)2 P (x, y)dxdy
y x
Z Z Z Z
= (f (x) − E[y/x])2 P (x, y)dxdy + (E[y/x] − y)2 P (x, y)dxdy
y x y x
Z Z
+ 2(f (x) − E[y/x])(E[y/x] − y)P (x, y)dxdy
y x
Let
Z Zus rewrite the third term.
R R
2(f (x) − E[y/x])(E[y/x] − y)P (x, y)dxdy = x 2(f (x) − E[y/x]) y (E[y/x] − y)P (y/x)dyP (x)dx
(65)
y x
R
Since y yP (y/x)dy = E[y/x], the inner integral in (65) is 0. So
Z Z Z Z
Ef,P (x,y) [L] = (f (x) − E[y/x])2 P (x, y)dxdy + (E[y/x] − y)2 P (x, y)dxdy (66)
y x y x
Here the second term is independent of f (). We have to consult only the first term to find out the
f () which will minimize th expecation. From the first term, it is clear that minimum loss will
12 LECTURE 12 : BIAS-VARIANCE TRADEOFF 57

be obtained when f (x) equals E[y/x]. The minimum expected loss is given by
Z Z
Ef,P (x,y) [L] = (E[y/x] − y)2 P (x, y)dxdy (67)
y x
This is the minimum loss we can expect for a given training data. Now let us find out the
expected loss over different training data. Let Z
ETD [f (x, TD )] = f (x, TD )p(TD )dTD
TD
Earlier we have found that the only tweakable component in expected loss is (f (x) − E[y/x])2 . Now
we will find out the expected loss over all the trainind data by finding expectation of the expression
over
Z the distribution of training data.
(f (x, TD ) − E[y/x])2 p(TD )dTD = ETD [(f (x, TD ) − E[y/x])2 ]
TD
h 2 i
= ETD f (x, TD ) − ETD [f (x, TD )] + ETD [f (x, TD )] − E[y/x]
h 2  2
= ETD f (x, TD ) − ETD [f (x, TD )] + ETD [f (x, TD )] − E[y/x]
  i
− 2 f (x, TD ) − ETD [f (x, TD )] ETD [f (x, TD )] − E[y/x]
Since h i
ETD f (x, TD ) = ETD [f (x, TD )]
and
Z the other factors are independent hof TD , the third term vanishes. Finally
2  2 i
(f (x) − E[y/x])2 p(TD )dTD = ETD f (x, TD ) − ETD [f (x, TD )] + ETD [f (x, TD )] − E[y/x]

TD
h 2 i  2
= ETD f (x, TD ) − ETD [f (x, TD )] + ETD [f (x, TD )] − E[y/x]
= V ariance + Bias2
h 2 i
Variance of f (x, D) = ETD f (x, TD ) − ETD [f (x, TD )] and Bias = ETD [f (x, TD )] − E[y/x]
Putting back in (66) the Expectedloss = V ariance + Bias2 + N oise
Let us try to understand what this means. Consider the case of regression. The loss of the
prediction depends on many factors such as complexity of the model (linear, ..) the parameters and
the measurements etc. The noise in the measurement can cause loss of prediction. That is given
by third term. Similarly the complexity of the model can contribute to the loss.
If we were to take the linear regression with a low degree polynomial, we are introducing a bias
that the dependency of the predicted variable is simple. Similarly when we add a regularizer term,
we are implicitly telling that the weights are not big, is also a kind of bias. The prediction we
otained may not be accurate. In these cases the predicted values may not have much correlation
with the sample points we took. So the predictions remains more or less the same over different
samples. That is for different samples the predicted values does not vary much. The prediction is
more generalizable over the samples.
Suppose we complicate our regression model by increasing degree of the polynomial used. As we
have seen in previous classes, we used to obtain a highly wobbly curve which pass through almost
all points in the training data. This is an example for less bias. For a given training data our
prediction could be very good. If we were to take another sample data we would have obtained
12 LECTURE 12 : BIAS-VARIANCE TRADEOFF 58

another curve which pass through all the new points but with a drastic difference from the current
curve. Our predictions are accurate for the training sample chosen, but at the same time they are
highly correlated to the sample we have chosen. For different training data chosen, the variance of
the prediction is very high. So the model is not generalizable over the samples.
We saw that when we decrease the bias the variance increases and vice versa. The more complex
the model is, the less bias we have and more the variance. Both are contrary to each other. The
ideal complexity of the model should be related with the complexity of the actual relation between
the dependent and independent variable.
I recommend the reference [4] for a good example.
13 LECTURE 13 59

13 Lecture 13
These are the topics discussed in today’s lecture:

1. Conclude Bias-Variance
2. Shrinkage - Best Subset
3. Mixture models
4. Empirical Bayes

13.1 Conclude Bias-Variance


13.1.1 Summary
Z Z Z Z ( h i
2
(f (x) − E[y/x]) dx dy dTD = ETD f (x, TD ) − E[y/x] (1)
TD y x x
n )
 o2
+ ETD f (x, TD ) − ETD f (x, TD ) dx (2)
Z Z n o2
− E[y/x] − y dx dy (3)
y x

In the above equation, TD represents the random sample. x, y represent the data distribution.
E[Y /X] is the expected value optimal with respect to least squares.

(1) represents the bias,


(2) represents the variance and
(3) represents the intrinsic noice. More the flexbility of the hypothesis function,
less is the bias and more is the variance

Now, does this analysis apply to Bayesian Regression?

E[X/TD ] is something like a posterior distribution and expected value is there. But
there is no concept of f(x, TD ) which is the posterior estimate.

13.1.2 Bayesian Linear Regression(BLR)

P (y/x, yT , α, σ 2 ) = N (µm T φ(x), σ 2 m)


where
1
µm = σm φT yT and
σ2
13 LECTURE 13 60

1 1
Σm −1 = I + 2 φT φ
α σ
equation

The Basis function:


 
φ1 (x1 ) φ2 (x1 ) .... ....
 
φ1 (x2 ) φ2 (x2 ) .... ....
φT = 
 ....

 .... .... ....
.... .... .... ....

If two points are far way from mean, but close to each other then φT φ will in-
crease and Σm −1 will also increase. Therefore variance will decrease.

This can also be interpreted as, points which are far apart are positively contributing
by giving less variance i.e; less uncertainity.

Also, Uncertainity with the prior is represented by α1 I, and uncertainity in the


estimator that we are explicitly modelling for, is represented by α12 φT φ.

Assume, the peak points of the two


gaussian distributions are at φ(x1 ) and φ(x2 ).
The point represented in red has
equal contributions from both the
gaussians.

Now, If φ(x1 )T φ(x2 ) is large, assuming


φ is normalized, standard deviation in-
creases. In gaussian distribution, at re-
gions away from the mean, φ(xi ) will be
13 LECTURE 13 61

large.

13.1.3 General Problems with Standard


Distribution

1. Single Mode - N(µ, Σ)


n
2. A non-trivial Σ =⇒ Cr parameters. As dimension of data n grows, n2 will grow
in Σ

3. Search space is sparse.

Mixture of Gaussians

n
X
p(x) = αi pi (x|z = i) (pi (x) is a different distribution)
i=1
n
X
αi = 1
i=1

p =⇒ p is a convex combination of individual distribution.

What is form of distribution ?

For the ith distribution taking αi as the mean where αi is a multinomial

Ex : Mixture of Gaussians :
13 LECTURE 13 62

p(x) = Σni=1 αi N (x|µi , Σi )

X ∼ N (µi , Σi )

p(x) = p(x|µi , Σi )

Issues

1. Number of K’s

2. Estimating µi ’s and Σi ’s

3. Estimating αi ’s

Classification Perspective

Assume data :
 
X1 , 1
 X2 , 1 
 
 
 .... ..
 
 .... ..
 

Xm , 3
(z is a class label)

Question : What is MLE ??

=⇒ It is a Classification Problem

If z value is given :

P (z = t|y) = P (x|z = t)P (z = t)/P (x)


It is a supervised classfication problem.
13 LECTURE 13 63

 
X1
 X2 
 
 
Given data : 
 .... 

 .... 
 

Xm
We have to estimate using hidden variables as z is not explicitly given. (EM algo-
rithm which can be shown to converge).

Target
1. Implicit estimation of z E[x|xi ]
2. Estimate µi ’s with E[z|xi ] in place of zi
i.e
 
X1 , E[z|x1 ]
 X2 , E[z|x2 ] 
 
 
 .... .. 
 
.... ..
 
 
Xm , E[z|xm ]

Estimating number of K’s is hard.It also can be classified as a modelling problem


and thus

number of K’s depend on the model we chose.

EM Algorithm (Source : Wikipedia)

The Expectation-maximization algorithm can be used to compute the parameters


of a parametric mixture model distribution (the ai ’s and θi ’s). It is an iterative al-
gorithm with two steps: an ”expectation step” and a ”maximization step”..

The expectation step

With initial guesses for the parameters of our mixture model, ”partial member-
ship” of each data point in each constituent distribution is computed by calculating
[[expectation value]]s for the membership variables of each data point. That is, for
13 LECTURE 13 64

each data point xj and distribution Yi , the membership value yi,j is:

ai fY (xj ; θi )
yi,j = .
fX (xj )

The maximization step

With expectation values in hand for group membership, “plug-in estimates“ are
recomputed for the distribution parameters.
The mixing coefficients ai are the arithmetic mean’s of the membership values
over the N data points.
N
1 X
ai = yi,j
N j=1
The component model parameters θi are also calculated by expectation maximiza-
tion using data points xj that have been weighted using the membership values. For
example,Pif θ is a mean µ
j yi,j xj
µi = P .
j yi,j
With new estimates for ai and the θi ’s, the expectation step is repeated to re-
compute new membership values. The entire procedure is repeated until model
parameters converge.

13.2 Emperical Bayes


There are two approaches to solve the equation

Pr(y | D) = Pr(y |< y1 , φ(x1 ) >, < y2 , φ(x2 ) >, ... < yn , φ(xn ) >)
RRR
= Pr(y | w, σ 2 ) Pr(w | ȳ, α, σ 2 ) Pr(α, σ 2 | ȳ), dw dα dσ 2
where ȳ is the data D

13.2.1 First Approach: Approximate the posterior

The first approach involves approximating the posterior i.e, the second term Pr(w |
ȳ, α, σ 2 ) as wM AP , i.e, as the mode of the posterior distribution of w which is gaus-
13 LECTURE 13 65

sian. Note that as the number of data points keep increasing φT φ keeps increasing,
hence from the relation

Σ− 1 T
m 1 = Σ0 1 + σ 2 φ φ

it is clear that the posterior variance decreases, hence the distribution of w peaks.

13.2.2 Second Approach: Emperical Bayes

The second approach is to emperically assume some value of the hyperparameters α


and σ 2 , say Rα̂ and σˆ2 for which the posterior will peak. i.e, we have
Pr(y | D) ≈ σ 2 Pr(y | wM AP , σ 2 ) Pr(α, σ 2 | ȳ) Pr(w | ȳ, α, σ 2 ), dw dα dσ 2
α

Pr(y | w, σ 2 ) Pr(w | ȳ, α̂, σˆ2 ) dw for the chosen α̂ and σˆ2
R

≈ N (φT µm σm
2
)N (µm , Σm )

Emperical Bayes finds the α̂ and σˆ2 such that i Pr(yi | TD ) i.e, conditional
Q
likelihood is maximised.

13.2.3 Sove the eigenvalue equation

1 T
( φ φ)ui = λi ui
σ2
λi
P
define parameter γ as γ = i α+λi

then emperical α̂ is α̂ = µTγµm


m

and emperical σˆ2 is σˆ2 = m−γ


1
Pm T 2
i=1 (Yi − µm φ(xm ))
14 LECTURE 14 : INTRODUCTION TO CLASSIFICATION 66

14 Lecture 14 : Introduction to Classification


The goal in classification is to take an input vector x and assign it to one of D
discrete classes

f (x) : Rn → D
There are many techniques of performing the task of classification. The two main
types are

1. Using a discriminant function: For e.g.: Consider D = {+, -}.


(
+ if f (x) ≥ 0
y=
− if f (x) < 0
Here y : {f (x) = 0} is called disciminant / decision surface (decision boundary)

Examples: Least square, Fischer disciminant, Support Vector


Machines, Perceptron, Neural Networks

2. Probabilistic Classification
(a) Disciminative models: Here we model P r(y  D | x) directly. For e.g.: we
can say that P r(D = + | data) comes from a multinomial distribution.
Examples: Logistic Regression, Maximum Entropy models, Con-
ditional Random Fields
(b) Generative models: Here we model P r(y  D | x) by modeling P r(x | y  D)
For Example:
P r(x | y = c1 ) ∼ N(µ1 , Σ1 )
P r(x | y = c2 ) ∼ N(µ2 , Σ2 )
We can find P r(y | x) as
P r(x | y)P r(y)
P r(y | x) =
Σy P r(x | y)P r(y)
Here, P r(y = c1 ), P r(y = c2 ) are called priors or mixture components.
Examples: Naive Bayes, Bayes Nets, Hidden Markov Models
15 LECTURE 15: LINEAR MODELS FOR CLASSIFICATION 67

15 Lecture 15: Linear Models for Classification


The goal in classification is to take an input vector x and assign it to one of K
discrete classes Ck where k = 1, ..., K. In most cases, the classes are taken to be
disjoint, so that each input is assigned to one and only one class. The input space is
thus divided into decision regions whose boundaries are called decision boundaries
or decision surfaces. We consider only linear models for classification in this lecture,
which means that the decision surfaces are linear functions of the input vector x
and hence are defined by (D − 1)-dimensional hyperplanes within the D-dimensional
input space.
The simplest method of classification (for 2 classes) is to design a function f such
that (
vc+ if xi ∈ C+
f (xi ) =
vc− if xi ∈ C−

15.1 Generalized linear models


In these models we adopt linear regression to model the classification problem. This
is done by modeling a function f as follows:
f (x) = g(wT φ(x))
where g is known as activation function and φ the vector of basis functions. Classi-
fication is achieved by: (
vc+ if θ > 0
g(θ) =
vc− if θ < 0
The decision surface in this case is given by wT φ(x) = 0

15.2 Three broad types of classifiers


1. The first method involves explicit construction of w for wT φ(x) = 0 as the
decision surface.
2. The second method is to model P (x|C+ ) and P (x|C− ) together with the prior
probabilities P (Ck ) for the classes, from which we can compute the posterior
probabilities using Bayes’ theorem
P (x|Ck )P (Ck )
P (Ck |x) =
P (x)
These types of models are called generative models.
15 LECTURE 15: LINEAR MODELS FOR CLASSIFICATION 68

3. The third method is to model P (C+ |x) and P (C− |x) directly. These types of
models are called discriminative models. In this case P (C+ |x) = P (C− |x) gives
the required decision boundary.

15.2.1 Examples

An example of generative model is as follows:


P (x|C+ ) = N (µ+ , Σ)
P (x|C− ) = N (µ− , Σ)
With prior probabilities P (C+ ) and P (C− ) known, we can derive P (C+ |x) and
P (C+ |x). In this case it can be shown that the decision boundary P (C+ |x) = P (C− |x)
is a hyperplane.
An example of discriminative model is
T
ew φ(x)
P (C+ |x) =
1 + ewT φ(x)
1
P (C− |x) =
1 + ewT φ(x)
Examples of first model (which directly construct the classifier) include
ˆ Linear Regression
ˆ Perceptron
ˆ Fisher’s Discriminant

15.3 Handling Multiclasses


We now consider the extension of linear discriminants to K > 2 classes. One solution
is to buid a K-class discriminant by combining a number of two-class discriminant
functions.
ˆ one-versus-the-rest: In this approach, K − 1 classifiers are constructed, each of
which separtes the points in a particular class Ck from points not in that classes
ˆ one-versus-one: In this method, K C2 binary discriminant functions are intro-
duced, one for every possible pair of classes.
Attempting to construct a K class discriminant from a set of two class discrim-
inants leads to ambiguous regions. The problems with the first two approaches are
illustrated in the figures 1 and 2, where there are ambiguous regions marked with
’ ?’.
15 LECTURE 15: LINEAR MODELS FOR CLASSIFICATION 69

Figure 17: Illustrates the ambiguity in one-versus-rest case

Figure 18: Illustrates the ambiguity in one-versus-one case


15 LECTURE 15: LINEAR MODELS FOR CLASSIFICATION 70

15.3.1 Avoiding ambiguities

We can avoid above mentioned difficulties by considering a single K-class discrimi-


nant comprising K functions gCk (x). Then x is assigned to a class Ck that has the
maximum value for gCk (x)
If gCk (x) = wCTk φ(x) the decision boundary between class Cj and class Ck is given
by gCk (x) = gCj (x) and hence corresponds to
(wCTk − wCTj )φ(x) = 0

15.4 Least Squares approach for classification


We now apply the Least squares method to the classification problem. Consider a
classification problem with K classes. Then the target values are represented by a
K component target vector t. Each class is described by its own model
yk (x) = wkT φ(x)
where k ∈ {1, ...K}. We can conveniently group these together using vector notation
so that
y(x) = WT φ(x)
where W is a matrix whose k th column comprises the unknown parameters wk and
φ(x) is the vector of basis function values evaluated at the input vector x. The
procedure for classification is then to assign a new input vector x to the class for
which the output yk = wkT φ(x) is largest.
We now determine the parameter matrix W by minimizing a sum-of-squares error
function. Consider a training data set {xn , tn } where n ∈ {1, .., N }, where xn is input
and tn is corresponding target vector. We now define a matrix Φ whose nth row is
given by φ(xn ).
 
φ0 (x1 ) φ1 (x1 ) . . . φK−1 (x1 )
 
 φ0 (x2 ) φ1 (x2 ) . . . φK−1 (x2 ) 
Φ=  
 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 

φ0 (xN ) φ1 (xN ) . . . φK−1 (xN )
We further define a matrix T whose nth row is given by the vector tTn . Now, the
sum-of-squares error function can then be written as
1
err(W) = T r{(ΦW − T)T (ΦW − T)}
2
We can now minimize the error by setting the derivative with respect to W to zero.
The solution we obtain for W is then of the form
W = (ΦT Φ)−1 ΦT T
15 LECTURE 15: LINEAR MODELS FOR CLASSIFICATION 71

Figure 19: Data from two classes classified by least squares (magenta) and logistic (green)

Figure 20: Response of the classifiers to addition of outlier points

15.4.1 Limitations of Least Squares

Even as the least-squares approach gives a closed form solution for the discriminant
function parameters, it suffers from problems such as lack of robustness to outliers.
This is illustrated in figures 3 and 4 where we see that introduction of additional data
points in the figure 4 produce a significant change in the location of the decision
boundary, even though these points would be correctly classified by the original
boundary in figure 3. For comparison, least squares approach is contrasted with
logisitc regression, which remains unaffected due to the additional points.
16 LECTURE 16 72

16 Lecture 16
16.1 Introduction
We will discuss the problems of the Linear regression model for classifications. We
will also look at some of the possible solutions of these problems. Our main focus is
on two class classification problem.

16.2 Problems of linear regression


The following are the problems with linear regression model for classification:
1. Sensitivity to outliers
2. Masking

16.2.1 Sensitivity to outliers

Outliers : They are points which have noise and adversely affect the classification.

Figure 21: Outliers

In the right hand figure , the separating hyperplane has changed because of the
outliers.

16.2.2 Masking

It is seen empirically that linear regression classifier may mask a given class. This
is shown in the left hand figure. We had 3 classes one in between the other two. The
between class points are not classified.
16 LECTURE 16 73

Figure 22: Masking

The right hand figure is the desirable classification.

The equation of the classifier between class C1(red dots) and class C2(green dots)
is
(ω1 − ω2 )T φ(x) = 0
and the equation of the classifier between the classes C2(green dots) and C3(blue
dots) is
(ω2 − ω3 )T φ(x) = 0

16.3 Possible solutions


1. Mapping to new space

We will transform the original dimensions to new dimensions. New dimen-


sions are function of original dimensions. This is a work around solution.
0
φ1 (x) = σ1 (φ1 , φ2 )
0
φ2 (x) = σ2 (φ1 , φ2 )

0 0
Here we try to determine the transformations φ1 and φ2 such that we can
get a linear classifier in this new space. When we map back to the original
dimensions , the separators may not remain linear.
16 LECTURE 16 74

Figure 23: Mapping back to original dimension class separator not linear

0
Problem : Exponential blowup of number of parameters (w s) in order O(nk−1 ).
2. Decision surface perpendicular bisector to the mean connector.

Figure 24: Class separator perpendicular to the line joining mean

Decision surface is the perpendicular bisector of the line joining mean of class
C1 (m1 ) and mean of class C2 (m2 ).
P
m1 = (1/N1 ) n∈C1 xn where m1 is the mean of class C1 and N1 is the number
of points in class C1 .
P
m2 = (1/N2 ) n∈C2 xn where m2 is the mean of class C2 and N2 is the number
16 LECTURE 16 75

of points in class C2 .

||φ(x) − m1 || < ||φ(x) − m2 || => x ∈ C1


||φ(x) − m2 || < ||φ(x) − m1 || => x ∈ C2

Comment : This is solving the masking problem but not the sensitivity
problem as this does not capture the orientation(eg: spread of the data points)
of the classes.
3. Fisher Discrimant Analysis.

Here we consider the mean of the classes , within class covariance and global
covariance.
Aim : To increase the separation between the class means and to minimize
within class variance. Considering two classes.
SB is Inter class covariance and SW is Intra class covariance.
P
m1 = (1/N1 ) n∈C1 xn where m1 is the mean of class C1 and N1 is the number
of points in class C1 .
P
m2 = (1/N2 ) n∈C2 xn where m2 is the mean of class C2 and N2 is the number
of points in class C2 .

N1 + N2 = N where N is the total number of training points.

SB = (m2 − m1 )(m2 − m1 )T

− m1 )(xn − m1 )T + − m2 )(xn − m2 )T
P P
SW = n∈C1 (xn n∈C2 (xn

J(w) = (wT SB w)/(wT Sw w)

By maximizing J(w) we get the following:

wαSw−1 (m2 − m1 )
16 LECTURE 16 76

16.4 Summary

Sensitivity to outliers Masking


Perpendicular Bisector of means connector Does not solve Solves
Fischer Discriminant Does not solve Solves
We have seen that the fisher discriminant analysis is better compared to the other
two possible solutions. Fisher Discriminant analysis for k classes is discussed in the
next lecture.
17 LECTURE 17 77

17 Lecture 17
Not submitted
18 LECTURE 18:PERCEPTRON 78

18 Lecture 18:Perceptron
ˆ Was Fisher’s discriminant robust to noise?
ˆ Perceptron training

18.1 Fisher’s discriminant


From the figure, it can be seen that Fischer Discriminant method is not robust to
noise. The difference in inclination of blue (without noise) and magenta (with noise)
lines shows that the outlying points affect the classifier more than desired. This is
because Fischer discriminant does not take into account the distance of the data
points from the hyperplane. Perceptron, unlike Fischer, considers the distance and
is thus robust to noise.

18.2 Perceptron training


ˆ Explicitly account for signed distribution of points(misclassified points) from
hyperplane
wT φ(x) = 0
Distance from hyperplane can be calculated as follows

wT φ(x) = 0

φ(x0 )
φ(x)
D

D = wT (φ(x) − φ(x0 ))
Since wT (φ(x0 )) = 0 we get distance = wT (φ(x))
ˆ Perceptron works for two classes only. We label them as y=1 and y=-1. A point
is misclassified if wT (φ(x)) < 0

Perceptron Algorithm:
ˆ INITIALIZE: w=ones()
ˆ REPEAT:
18 LECTURE 18:PERCEPTRON 79

– If given < x, y >, wT Φ(x).y ≤ 0


– then, w = w + Φ(x).y
– endif

18.2.1 Intuition

T
ywk+1 φ(x) = y(wk + yφ(x)T φ(x)
= ywkT φ(x) + y 2 kφ(w)k2
> ywkT φ(x)
Note: We applied the update for this point,
Since ywkT φ(x) ≤ 0
We have ywkT φ(x) > ywkT φ(x) So we have more hope that this point is classified
correctly now.
More formally, perceptron tries to minimize the error function
X
E=− yφT (x)ω
x∈M
where M is the set of misclassified examples.
Perceptron algorithm is similar (Its not exactly equivalent) to a gradient descent
algorithm, which can be shown as follows:

Gradient Descent (Batch Perceptron) Algorithm

Since ∇EX
is given by,
∇E = − yφ(x)
x∈M
So,
wk+1 = wk − η∇E
X
= wk + η yφ(x) (This takes all misclassified points at a time)
x∈M
But what we are doing in standard Perceptron Algorithm, is basically Stochastic
Gradient Descent:
X X
∇E = − yφ(x) = − ∇E(x) , where E(x) = yφ(x)
x∈M x∈M

wk+1 = wk − η∇E(x)
= wk + ηyφ(x) (for any x ∈ M )
18 LECTURE 18:PERCEPTRON 80

Earlier it was intuition, now, Formally,:-


If ∃ an optimal separating hyperplane with parameters w∗ such that,
φT (x)w∗ = 0
then perceptron algorithm converges.

Proof :-
lim kwk+1 − ρw∗ k2 = 0 (68)
k→∞

(If this happens for some constant ρ, we are fine.)


kwk+1 − ρw∗ k2 = kwk − ρw∗ k2 + kyφ(x)k2 + 2y(wk − ρw∗ )T φ(x) (69)
Now, we want L.H.S. to be less than R.H.S. at every step, although by some small
value, so that perceptron will converge overtime.
So, if we can obtain an expression of the form:
kwk+1 − ρw∗ k2 < kwk − ρw∗ k2 − θ2 (70)
Then, kwk+1 − ρw∗ k2 is reducing by atleast θ2 at every iteration.
So, from the above expressions (2) and (3), we need to find θ such that,
kφ(x)k2 + 2y(wk − ρw∗ )T φ(x) < −θ2
(Here, kyφ(x)k2 = kφ(x)k2 because kyk = 1, y is either +1 or −1)
 ∗ k2

So, the no. of iterations would be: O kw0 −ρw
θ2

Observations:-
1. ywkT φ(x) ≤ 0 (∵ x was misclassified)
2. Γ2 = max kφ(x)k2
x∈D

3. δ = max −2yw∗ T φ(x)


x∈D

Here, margin = w∗ T φ(x) = dist. of closest point from hyperplane


and, D is the set of all points, not just misclassifed ones.
δ = max −2yw∗ T φ(x)
x∈D

= min yw∗ T φ(x)


x∈D
Since, w∗ T φ(x) ≥ 0, so, δ ≤ 0.
So, what we are interested in, is the ’least negative’ value of δ
From the observations, and eq.(2), we have:
0 ≤ kwk+1 − ρw∗ k2 ≤ kwk − ρw∗ k2 + Γ2 + ρδ
18 LECTURE 18:PERCEPTRON 81

2Γ2
Taking, ρ = , then,
−δ
0 ≤ kwk+1 − ρw∗ k2 ≤ kwk − ρw∗ k2 − Γ2
Hence, we got, Γ2 = θ2 , that we were looking for in eq.(3).
∴ kwk+1 − ρw∗ k2 decreases by atleast Γ2 at every iteration.

Here is the notion of convergence:-


wk converges to ρw∗ by making atleast some decrement at each step.

Thus, for k → ∞, kwk − ρw∗ k → 0,


Hence, eq.(1) is proved.
19 LECTURE 19 82

19 Lecture 19
19.1 Introduction
In this lecture,we extend the margin-concept towards our goal of classification and
introduce ourselves to Support Vector Machines(SVM)

19.2 Margin
Given w? ,the unsigned minimum distance of x from the hyperplane:

T
w? φ(x) = 0 (71)

is given by:

T
min yw? φ(x) (72)
xD

where y = ±1.Here,y is the corresponding target classifier value for the case of 2-
class classifiers.Note that multiplication with y makes the distance unsigned. This
classification is greedy.

Figure 25: H3(green) doesn’t separate the 2 classes. H1(blue) does, with a small margin and
H2(red) with the maximum margin. [5]

Intuitively, a good separation is achieved by the hyperplane that has the largest
distance to the nearest training data points of any class, since in general the larger
the margin the lower the generalization error of the classifier. [5]
19 LECTURE 19 83

19.3 Support Vector Machines


The idea in a Support Vector Machine(SVM) is to Maximize the Minimum
Unsigned Distance (ywT φ(x)) of a point x from the hyperplane

wT φ(x) = 0 (73)
The factor y which is also the target classifier ensures that the unsigned distance
is positive semi-definite. Posing this as an optimization problem where:

φ = [1, φ0 , φ1 , ....φn ] (74)

w = [w0 , w1 , ....wn ] (75)


where 1 and w0 together denote the bias parameters and [w1 ....wn ] denote the
slope.
Optimization Goal: To adjust the Slope and Bias to Maximize the Minimum
unsigned distance from the hyperplane.

Figure 26: Maximum Margin Hyperplanes and Margins for SVM trained with 2 classes.Samples on
the margin are called the support vectors. [5]

19.4 Support Vectors


The seperating hyperplane is dependent only on the support vectors. Those points
which are not support vectors are irrelevant.
19 LECTURE 19 84

19.5 Objective Design in SVM


Keeping the Bias(w0 ) and Slope(w) separate,

y(x) = wT φ(x) + w0 (76)

y(x) represents the Separating Hyperplane.

19.5.1 Step 1:Perfect Separability

The condition for existence of a Separating Hyperplane(Perfect Separability) is:

∃w, w0 ∀yi , xi D wT φ(xi ) + w0 > 0 if yi = +1 (77)


wT φ(xi ) + w0 < 0 if yi = −1 (78)

19.5.2 Step 2:Optimal Separating Hyperplane For Perfectly Separable Data

maxδ (79)
w,w0

wT φ(xi ) + w0 > δ if yi = +1 (80)


wT φ(xi ) + w0 < −δ if yi = −1 (81)
δ≥0 (82)
But for two distinct w1 , w2 defining the same Hyperplane,signed distance can be
different for the same point depending on the value of ||w||.
Thus fixing ||w||,the Optimization Objective is:

maxδ (83)
w,w0

yi (wT φ(xi ) + w0 ) > δ (84)


δ≥0 (85)
||w|| = θ (86)
(87)
Without restriction on ||w||,δ goes unbounded.
We can get rid of the assumption that ||w|| = θ by replacing the conditions with
θw0
the following while redefining w00 = ||w|| to get:
19 LECTURE 19 85

max0 δ (88)
w,w0
θ
y (wT φ(xi )
||w|| i
+ w00 ) > δ (89)
δ≥0 (90)
or Equivalenty,

max0 δ (91)
w,w0

yi (wT φ(xi ) + w00 ) > δ ||w||


θ
(92)
δ≥0 (93)
Since for any w and w00 satisfying these inequalities, any positive multiple satisfies
them too, we can arbitrarily set ||w|| = θδ . Thus, the equivalent problem is:

1
max0 ||w|| (94)
w,w0

yi (wT φ(xi ) + w00 ) > 1 (95)


Or Equivalently:

min ||w||2
0
(96)
w,w0

yi (wT φ(xi ) + w00 ) > 1 (97)


The above transformation is fruitful as the determination of model parameters
reduces to a Convex Optimization problem and hence,any locally optimum solution
is globally optimum.
[6]

19.5.3 Step 2:Separating Hyperplane For Overlapping Data

Unlike the previous case wherein Data was either ”Black” or ”White”,herein we have
a Region of ”Gray”.
Earlier,we implicitly used an error function that gave infinite error if a data point
was misclassified and zero error if it was classified correctly. We now modify this
approach so that data points are allowed to be on the ”wrong side” of the margin
boundary, but with a penalty that increases with the distance from that boundary.
Thus, the objective to account for the Noise.Hence, we shall introduce a slack
variable ζi
19 LECTURE 19 86

Now the Optimization Objective is:


2
min0 ||w||
2
(98)
w,w0

yi (wT φ(xi ) + w00 ) > 1 − ζi (99)


ζi ≥ 0 (100)
or Equivalently:

N
2 X
min0 ||w||
2
+c ζi2 (101)
w,w 0
i=1
yi (wT φ(xi ) + w00 ) > 1 − ζi (102)
ζi ≥ 0 (103)
Thus,the objective is analogous to the Minimization of Error subject to Regu-
lariser.
Here,the Error is:

N
X
E=c ζi2 (104)
i=1

And the Regulariser is:

||w||2
R= (105)
2
Both EandR can be proved to be convex.
Partially differentiating E w.r.t. ζj :

δE
δζj
= 2ζj (106)
δ2 E
δζj2
=2>0 (107)
Thus E is convex w.r.t ζj ∀1 ≤ j ≤ n
Consider,R:

5(R) = 2w (108)
52 (R) = 2I (109)
19 LECTURE 19 87

2I is clearly positive definite with the eigenvalues 2(twice).


Thus R is convex.
Therefore,E + R is convex and the Optimization again reduces to a Convex Op-
timization Problem.
20 LECTURE 20: SUPPORT VECTOR MACHINES (SVM) 88

Proof.
Figure 27: Margins in SVM

20 Lecture 20: Support Vector Machines (SVM)


This lecture formulates the primal expression of SVM. Then it applies KKT condi-
tions on top of that. Then we proceed to form the dual problem.

20.1 Recap
The expression from previous day:
||w||
yi (φT (xi )w + w00 ) ≥ (110)
θ
So, any multiple of w and w00 would not change the inequality.

20.2 Distance between the points


2
Important Result 1. The distance between the points x1 and x2 in 20.2 is ||w||

The distance between the points x1 and x2 in 20.2 turns out to be:
||φ(x1 ) − φ(x2 )|| = ||rw|| (111)
We have,
wT φ(x1 ) + w0 = −1 (112)
and
wT φ(x2 ) + w0 = 1 (113)
Subtracting Equation 113 from Equation 112 we get,
20 LECTURE 20: SUPPORT VECTOR MACHINES (SVM) 89

wT (φ(x1 ) − φ(x2 )) = −2 (114)


T
=⇒ w rw = −2 (115)
2
=⇒ r = − (116)
||w||2
2
=⇒ ||rw|| = − (117)
||w||
Hence proved.

20.3 Formulation of the optimization problem


1
max (118)
||w||
s.t.∀i (119)
yi (φT (xi )w + w00 ) ≥ 1 (120)
This means that if we maximize the separation between margin planes and at the
same time ensure that the respective points are not crossing corresponding margin
plane (the constraint), we are done.
It can be proved that,
if max ||distance of closest point from hyperplane||p
1
then, max
||w||q
s.t.∀i
yi (φT (xi )w + w00 ) ≥ 1
1 1
where + = 1
p q
In our case p = 2 so q = 2. For p = 1, q = ∞ means maximum.
20 LECTURE 20: SUPPORT VECTOR MACHINES (SVM) 90

Figure 28: Different types of points

20.4 Soft Margin SVM


X
min ||w||2 + c ξi (121)
w,w0
i
s.t.∀i (122)
T
yi (φ (xi )w + w00 )
≥ 1 − ξi (123)
where, (124)
∀iξi ≥ 0 (125)
In soft margin we account for the the errors. The above formulation is one of the
many formulation of soft SVMs. In the above formulation, large value of c means
overfitting.

20.4.1 Three types of g points

In subsubsection 20.4.1 we can see three types of points. They are:

1. Correctly classified but ξi > 0 or violates margin


2. Correctly classified but ξi = 0 or on the margin
3. Inorrectly classified but ξi > 1
20 LECTURE 20: SUPPORT VECTOR MACHINES (SVM) 91

20.5 Primal and Dual Formulation


20.5.1 Primal formulation

p∗ = min f (x) (126)


x∈D (127)
s.t.g(x) ≤ 0 (128)
i = 1, . . . , m (129)

20.5.2 Dual Formulation


m
!
X
d∗ = max min f (x) + λi gi (x) (130)
λ∈R x∈D
i=1
s.t.λi ≥ 0 (131)
Equation 130 is and convex optimization problem. Also, d ≤ p and (p − d∗ ) is
∗ ∗ ∗

called the duality gap.


If for some (x∗ , λ∗ ) where x∗ is primal feasible and λ∗ is dual feasible and we see
the KKT conditions are satisfied and f is and all gi are convex then x∗ is optimal
solution to primal and λ∗ to dual.
Also, the dual optimization problem becomes,
d∗ = maxm
L(x∗ , λ) (132)
λ∈R
s.t. (133)
λi ≥ 0∀i (134)
where, (135)
m
X
L(x, λ) = f (x) + λi gi (x) (136)
i=1
L∗ (λ) = min L(x, λ) (137)
x∈D
= max L(KKT (x), λ) (138)
λ∈R
λi ≥ 0∀i (139)
It happens to be,
p∗ = d∗ (140)
20 LECTURE 20: SUPPORT VECTOR MACHINES (SVM) 92

20.6 Duality theory applied to KKT


m m m
1
¯ w0 , ᾱ, λ̄) =
X X   X
L(w̄, ξ, ||w||2 + c ξi + αi 1 − ξi − yi φT (xi )w + w0 − λi ξi
2 i=1 i=1 i=1
(141)
Now we check for KKT conditions at the point of optimality,

KKT 1.a
∇w L = 0 (142)
Xm
=⇒ w − αj yj φT (xj ) = 0 (143)
j=1
KKT 1.b
∇xii L = 0 (144)
=⇒ c − αi − λi = 0 (145)
KKT 1.c
∇w0 L = 0 (146)
Xm
=⇒ w − αi y i = 0 (147)
i=1
KKT 2
∀i (148)
yi φT (xi )w + w0 ≥ 1 − ξi

(149)
ξi ≥ 0 (150)
KKT 3
Lj ≥ 0andλk ≥ 0 (151)
∀j, k = 1, . . . , m (152)
KKT 4
Lj yi φT (xj )w + w0 − 1 + ξj = 0
  
(153)
λk ξk = 0 (154)
(a)
m
X

w = αj yi φ(xj ) (155)
j=1
w∗ is weighted linear combination of points φ(x)s.
20 LECTURE 20: SUPPORT VECTOR MACHINES (SVM) 93

(b)

If 0 < αj < c then, by Equation 145 


0 < λj < c and by Equation 154, ξj = 0 and yi φT (xj )w + w0 = 1

If however, αj = c then λj = 0 and yi φT (xj )w + w0 ≤ 1.

If α0 then λj = c and ξj = 0, we get yi φT (xj )w + w0 ≥ 1. Then αj = 0
21 LECTURE 21:THE SVM DUAL 94

21 Lecture 21:The SVM dual


21.1 SVM dual
SVM can be formulated as the following optimization problem,
m
1 2
X
min{ kwk + C ξi }
w 2 i=0
subject to constraint,
∀i : yi (φT (xi )w + w0 ) ≥ 1 − ξi
The dual of the SVM optimization problem can be stated as,
m m m
1 XX T
X
max{− yi yj αi αj φ (xi )φ(xj ) + αj }
2 i=1 j=1 j=1
subject to constraints,
X
∀i : αi yi = 0
i
∀i : 0 ≤ αi ≤ c
The duality gap = f (x∗ ) − L∗ (λ∗ ) = 0, as shown in last lecture. Thus, as is
evident from the solution of the dual problem,
m
X

w = αi∗ yi φ(xi )
i=1
To obtain wo∗ ,
we can use the fact (as shown in last lecture) that, if αi ∈ (0, C),
T
yi (φ (xi )w + w0 ) = 1. Thus, for any point xi such that, αi ∈ (0, C), that is, αi is a
point on the margin,
1 − yi (φT (xi )w∗ )
wo∗ =
yi
= yi − φ (xi )w∗
T

The decision function,


g(x) = φT (x)w∗ + w0∗
Xm
= αi yi φT (x)φ(xi ) + w0∗
i=0

21.2 Kernel Matrix


A kernel matrix
21 LECTURE 21:THE SVM DUAL 95

 
φT (x1 )φ(x1 ) φT (x1 )φ(x2 ) ... ... φT (x1 )φ(xn )
 T
 φ (x2 )φ(x1 ) φT (x2 )φ(x2 ) ... ... φT (x2 )φ(xn ) 

 
K=
 ... ... ... ... ... 

... ... ... ... ...
 
 
T T T
φ (xn )φ(x1 ) φ (xn )φ(x2 ) ... ... φ (xn )φ(xn )
In other words, Kij = φT (xi )φ(xj ). The SVM dual can now be re-written as,

1
max{− αT Ky α + αT ones(m, 1)}
2
subject to constraints,
X
αi y i = 0
i
0 ≤ αi ≤ c
Thus, for αi ∈ (0, C)
w0∗ = yi − φT (xi )w
Xm
= yi − αj∗ yj φT (xi )φ(xj )
j=0
m
X
= yi − αj∗ yj Kij
j=0

21.2.1 Generation of φ space

For a given x = [x1 , x2 , . . . , xn ] → φ(x) = [xd1 , xd2 , xd3 , . . . , xd−1


1 x2 , . . . ].
2 2
For n = 2, d = 2, φ(x) = [x1 , x1 x2 , x2 x1 , x2 ], thus,
Xm X m
T
φ (x).φ(x̄) = xi xj .x̄i x̄j
i=1 j=1
Xm Xm
= ( xi x̄i ).( xj x̄j )
i=1 j=1
Xm
= ( xi x̄i )2
i=1
= (x x̄)2
T

In general, for n ≥ 1 and d ≥ 1, φT (x).φ(x̄) = (xT x̄)d .


21 LECTURE 21:THE SVM DUAL 96

A polynomial kernel, in general, is defined as Kij = (xTi xj )d .

21.3 Requirements of Kernel


1. Since
Kij = φT (xi )φ(xj )
= φT (xj )φ(xi )

Hence K should be a Symmetric Matrix.


2. The Cauchy Schwarz Inequality
(φT (x)φ(x̄))2 ≤ kφT (x)k2 kφ(x̄)k2

⇒ Kij 2 ≤ Kii Kjj


3. Positivity of Diagonal
K = V ΛV T

Where V is the eigen vector matrix (an orthogonal matrix), and Λ is the Diag-
onal matrix of eigen values.

Goal is to construct a φ. Which can be constructed as



φ(xi ) = λi Vi (λi ≥ 0)
Kii = λi kVi k2

Hence K must be
1. Symmetric.
2. Positive Semi Definite.
3. Having non-negative Diagonal Entries.

21.3.1 Examples of Kernels


d
1. Kij = (xi T xj )
d
2. Kij = (xi T xj + 1)
21 LECTURE 21:THE SVM DUAL 97

3. Gaussian or Radial basis Function (RBF)


kxi −xj k
Kij = e− 2σ 2 (σ ∈ R, σ 6= 0)
4. The Hyperbolic Tangent function
Kij = tanh(σxTi xj + c)

21.4 Properties of Kernel Functions


If K 0 and K 00 are Kernels then K is also a Kernel if either of the following holds

1. Kij = K 0 ij + K 00 ij
2. Kij = αK 0 ij (α ≥ 0)
3. Kij = K 0 ij K 00 ij

Proof : (1) and (2) are left as an exercise.


(3)
Kij = Kij0 Kij00
= φ0T (x0i )φ0 (x0j ) ∗ φ00T (x00i )φ00 (x00j )

Define φ(xi ) = φ0T (x0i )φ00T (x00i ). Thus, Kij = φ(xi )φ(xj ).
Hence, K is a valid kernel.
22 LECTURE 22: SVR AND OPTIMIZATION TECHNIQUES 98

22 Lecture 22: SVR and Optimization Techniques


Topics covered in this lecture
ˆ Variants of SVM (other occurance of kernel)
ˆ Support Vector Regression
ˆ L1 SVM
ˆ Projection Method (kernel Adatron)

22.1 Other occurance of kernel


ˆ for regression, we have
W = (φT φ)φT y
f (X) = W T φ(x) = φ(φT φ)T φ(x)y
ˆ For perceptron
M
X
W = αi φ(xi )yi
i=i
αi = no of times update on w was made for < xi , yi >
X X
g(xj ) = φT (xj ).w = αi φT (xj )φ(xi )yi = αi kij yi
i i

22.1.1 Some variants of SVM’s

We have considered some variants of SVM,


1 X
minw,ξ ||w||2 + C ξi
2 i
st., ∀i yi (φT (xi )w + w0 ) ≥ 1 − ξi and ξi > 0
This is 1-Norm SVM, i.e norm of the slack (ξ). We can also formulate withou ξ > 0
as
1 X
minw,ξ ||w||2 + C ξi2
2 i
if C = 0 , w=0 is a trivial solution
22 LECTURE 22: SVR AND OPTIMIZATION TECHNIQUES 99

Dual for this problem:-


1 XX 1 X
max − αi αj (kij + δij )yi yj + αi
2 i j C i
X
s.t., αi y i = 0
i
δij = 1 if i = j
and = 0 otherwise
also 0 ≤ αi ≤ C
X
The constraint αi yi = 0 keep appearing again and again. This appear because
i
of w0 being explicit. It will dissapear if w0 were absorbed in w.
In linear regression (error)

and W = (φT φ)−1 φT y


In ridge regression (error + λ||w||2 )
and W = (φT φ − λI)−1 φT y

22.2 Support Vector Regression


The Support Vector method can also be applied to the case of regression (apart
from classification problem), maintaining all the main features that characterise the
maximal margin algorithm, preserving the property of sparseness.
A non-linear function is learned by a linear learning machine in a kernel-induced
feature space while the capacity of the system is controlled by a parameter that does
not depend on the dimensionality of the space. The figure below shows a situation
for a non-linear regression function.

The insensitive band (slackness) for a non-linear regression function.


22 LECTURE 22: SVR AND OPTIMIZATION TECHNIQUES 100

As long as points lie inside the  margin, they do not contribute to the error.
We can define the -insensitive loss function L (x, y, f) as:-
For linear :
L (x, y, f ) = |y − f (x)|ε = max(0, |y − f (x)| − ε)

The linear -insensitive loss for zero and non-zero .

For Quadratic :
L2 (x, y, f ) = |y − f (x)|2ε

The quadratic -insensitive loss for zero and non-zero .

In adapted ridge regression when slackness is introduced we have:-


1 X
min ||w||2 + C ξ2
2
st.(ΦT (xi )w + w0 ) − yi = ξi
22 LECTURE 22: SVR AND OPTIMIZATION TECHNIQUES 101

we can optimise the generalisation of our regressor by minimising the sum of the
quadratic -insensitive losses. For SVR-2 norm
1 X
min ||w||2 + C (ξi2 + ξi2 )
2
st, ∀i (ΦT (xi )w + w0 ) − yi ≤ ε + ξi
and, yi − (ΦT (xi )w + w0 ) ≥ ε + ξi‘
also ξi ξi‘ = 0
and for 1-norm we have
X
C (ξi + ξi‘ ) and
ξi ≥ 0, ξi‘ ≥ 0
For 2-norm case, the dual problem can be derived using the standard method and
taking into account that ξi ξi‘ = 0 and therefore that the same relation αi αi‘ holds for
the corresponding Lagrange multipliers:
M M M M
X

X
‘ 1 XX 1
maximise yi (αi − αi ) − ε yi (αi + αi )− yi (αi‘ − αi )(αj‘ − αj )(Kij + δij )
i=i i=i
2 i=1 j=1 C
M
X
subjectto : (αi‘ − αi ) = 0
i=i
αi‘ ≥ 0, αi ≥ 0, αi‘ α = 0
The corresponding Karush - Kuhn - Tucker complementarity conditions are [7]
αi (< w.φ(xi ) > +b − yi − ε − ξi ) = 0
αi‘ (yi − < w − φ(xi ) > −b − −ε − ξi‘ ) = 0
ξi ξi‘ = 0 αi αi‘ = 0
By substituting β = α‘ −α and using the relation αi αi‘ = 0, it is possible to rewrite
the dual problem in a way that more closely resembles the classification case
M M M
X X 1X 1
maximise y i αi − ε |αi |− αi αj (K(φ(xi ).φ(xj )) + δij )
i=1 i=1
2 i,j=1 C
M
X
subjectto βi =0
i=1
Notes:-
ˆ Ridge Regression has 1 parameter - λ,
SVM 2-norm has 2 parameters - (C − 1/λ) and ε
22 LECTURE 22: SVR AND OPTIMIZATION TECHNIQUES 102

ˆ SVR with 2 norm and ε = 0 ≡ Ridge Regression


ˆ if ε = 0 and as C → ∞,
SVR 2-norm → linear regression
ˆ SVR 2-norm and SVM 2-norm have [ C1 δij ] added to kernel matrix in dual.

22.3 L1 SVM
Let training datum be xi (i = 1, ..., M ) and its label be yi = 1 if xi belongs to Class
1, and yi = −1 if Class 2. In SVMs, to enhance linear separability, the input space is
mapped intoa high dimensional feature space using the mapping function g(x). To
obtain the optimal separating hyperplane of the L1-SVM in the feature space, we
consider the following optimization problem:

M
1 X
minimize kW k2 + C ξi
2 i=0
subject to yi (W t g(xi ) + b) > 1 − ξi
for i = 1, . . . , M,
where W is a weight vector, C is the margin parameter that determines the tradeoff
between the maximization of the margin and the minimization of the classification
error, ξi (i = 1, ..., M ) are the nonnegative slack variables and b is a bias term.
Introducing the Lagrange multipliers αi , we obtain the following dual problem:
maximize
M M
X 1X
Q(α) = αi − αi αj yi yj g(xi )t g(xj ),
i=0
2 i=0
M
X
subject to yi αi = 0, 0 ≤ αi ≤ C
i=0
We use the mapping function that satisfies

H(x, x0 ) = g(x)t g(x0 ),


where H(x, x0 ) is a kernel function. By this selection, we need not treat the variables
in the feature space explicitly. Solving the above dual problem, we obtain the decision
function:
M
X
D(x) = αi∗ yi H(xi , x) + b∗
i=0
22 LECTURE 22: SVR AND OPTIMIZATION TECHNIQUES 103

L1 SVM is like Lasso and it gives sparse solution.

22.4 Kernel Adatron


The ”Perceptron with optimal stability” has been the object of extensive theoretical
and exper- imental work, and a number of simple iterative procedures have been
proposed, aimed at finding hyperplanes which have ”optimal stability” or maximal
margin. One of them, the Adatron, comes with theoretical guarantees of convergence
to the optimal solution, and of a rate of convergence exponentially fast in the num-
ber of iterations, provided that a solution exists. Such models can be adapted,
with the introduction of kernels, to operate in a high- dimensional feature space,
and hence to learning non- linear decision boundaries. This provides a procedure
which emulates SV machines but doesn’t need to use the quadratic programming
toolboxes.The Adatron is a an on-line algorihm for learning perceptrons which has an
attractive xed point cor- responding to the maximal-margin consistent hyper- plane,
when this exists.By writing the Adatron in the data-dependent repre- sentation, and
by substituting the dot products with kernels, we obtain the following algorithm:
Kernel Adatron Algoritm
1. Initialise αi = 1.
p
X
2. Starting from pattern i = 1, for labeled points (xi , yi calculate zi = yi αi yi K(x, x0 ).
j=0
3. For all patterns i calculate γi = yi zi and execute steps 4 to 5 below. 4. Let
δαi = η(1 − γ i ) be the proposed change to the multipliers αi .
5.1. If (αi + δαi )≤ 0 then the proposed change to the multipliers would result in a
negative αi . Consequently to avoid this problem we set αi = 0. 5.2 If (αi + δαi ) ≥ 0
then the multipliers are updated through the addition of the δαi i.e. αi ← αi + δαi .
6. Calculate the bias b from b = 12 (min(zi+ ) + max(zi− ))
where zi+ those patterns i with class label +1 and zi− are those with class label −1.
7. If a maximum number of presentations of the pat- tern set has been exceeded
then stop, otherwise return to step 2.
Every stable point for adatron algorithm is a maximal margin point and vice
versa. The algorithm converges in a finite number of steps to a stable point if a
solution exists. The primal problem for adatron is given below.

1 XX X
minimize αi αj Kij yi yj − αi ,
2 i j
X
subject to αi yi = 0, 0 ≤ αi ≤ C
23 LECTURE 23 104

23 Lecture 23
Key terms : SMO, Probabilistic models, Parzen window
In previous classes we wrote the convex formulation for maximum margin classi-
fication, Lagrangian of the formulation etc. Then dual of the program was obtained
by first minimizing the lagrangian with respect to weights w. The dual will be max-
imization of it with respect to α which is same as minimizing the negative of the
objective function under the same constraints. The dual problem, given in equation
(156) is an optimization with respect to α and its solution will correspond to optimal
of original objective when the KKT conditions are satisfied. We are interested in
solving dual of the objective because we have already seen that most of the dual
variable will be zero in the solution and hence it will give a sparse solution (based
on the KKT conidtion).
X 1 XX
Dual: min − αi + αi αj yi yj Kij (156)
α 2 i j
X
s. t. αi y i = 0
i
αi ∈ [0, c]
The above program is a quadratic program. Any quadratic solvers can be used
for solving (156), but a generic solver will not take consider speciality of the solution
and may not be efficient. One way to solve (156) is by using projection methods(also
called Kernel adatron). You can solve the above one using two ways - chunking
methods and decomposition methods.
The chunking method is as follows
1. Initialize αi s arbitrarily
2. Choose points(I mean the components αi ) that violate KKT condition
3. Consider only K working set and solve the dual for the variables in working set
∀α ∈ working set
X 1 X X
min − αi + αi αj yi yj Kij (157)
α
αi inW S
2 i∈W S j∈W S
X X
s. t. αi y i = − αj yj
i∈W S j ∈W
/ S

αi ∈ [0, c]
4. set αnew = [αW
new old
S , αnonW S ]
23 LECTURE 23 105

Decompsition methods follow almost the same procedure except that in step 2 we
always take a fixed number of points which violate the KKT conditions the most.

23.1 Sequential minimization algorithm - SMO


We can’t take just one point at a time in working set and optimize with respect to it,
because the second last constraint can’t be satisified. So choose 2 points and optimize
with respect to that. This is what is done in SMO, that is SMO is a decompostion
method with 2 points taken at a time for opimization. The details are given below
Without loss of generality take the points α1 and α2 in the working set. Then the
program (157) can be rewritten as
X 1
min − α1 − α2 − αi + α12 K11 + α22 K22 + α1 α2 K12 y1 y2
α1 ,α2
i6=1,2
2
X X
+ α1 y 1 K1i αi yi + α2 y2 K2i αi yi (158)
i6=1,2 i6=1,2
X
s. t. α 1 y 1 + α2 y 2 = − αj yj = α1old + α2old
j6=1,2

α1 , α2 ∈ [0, c]
From the second last constraint, we can write α1 in terms of α2 .
y2 y2
α1 = −α2 + α1old + α2old
y1 y1
Then the objective is just a function of α2 , let the objective is −D(α2 ). Now the
program reduces to
min − D(α2 )
α2

s. t. α2 ∈ [0, c]
Find α2∗ such that ∂D(α
∂α2
2)
= 0. We have to ensure that α1 ∈ [0, c]. So based on
that we will have to clipp α2 , ie, shift it to certain interval. The condition is as
follows
y2 y2
0 <= −α2 + α1old + α2old <= c
y1 y1
ˆ case 1: y1 = y2
α2 ∈ [max(0, −c + α1old + α2old ), min(c, α1old + α2old )]
ˆ case 2: y1 = −y2
α2 ∈ [max(0, α2old − α1old ), min(c, c − α1old + α2old )]
If α2 is already in the inerval then there is no problem. If it is more than the
23 LECTURE 23 106

maximum limit then reset it to the maximum limit. This will ensure the optimum
value of the objective constrained to this codition. Similarly if α2 goes below the
lower limit then reset it to the lower limit.

23.2 Probablistic models


In one of the previous lectures probablistic models were mentioned. They are of two
types conditional and generative based on the variable over which the distribution
is defined. Conditional models define a distribution over class given the input and
the Generative models define a joint distribution of both dependent and independent
variables.
The classification models can again be divided into two - parametric and non-
parametric. The parametric forms assumes a distribution over the class or the input
which are controlled by parameters, for example the class output ∼ (N )(wT φ(x), σ)
where σ,w are parameters . During the training phase the parameters would be
learned.
For a classification task we will have a scoring function gk (x) based on which we
will dot classification. The point x will be classified to argmaxk gk (x).

For a discriminative model the function gk (x) = ln(p(Ck |x)), ie it models the coni-
tional probability of the class variable with respect to the input.
For a generative models gk (x) = ln(p(x|Ck ))p(Ck ) − ln(p(x))(can be obtained by
Bayes rule). The generative model model a joint distribution of the input and
class variables.
24 LECTURE 24: PROB. CLASSIFIERS 107

24 Lecture 24: Prob. Classifiers


24.1 Non Parametric Density Estimation
The models in which the form of the model is not specified a priori but is instead
determined from data are called Non parametric models. The term non-parametric
is not meant to imply that such models completely lack parameters but that the
number and nature of the parameters are flexible and not fixed in advance. Non
Parametric models are generally generative methods.
The probability P that a vector x will fall in a region R is given by

Z
P = p(x) dx (159)
R

Suppose that n iid samples are randomly drawn according to the probability distri-
bution p(x). Probability that k of these n fall in R is

 
n
Pk = P k (1 − P )(n−k) (160)
k
Expected value of k is

E[k] = nP (161)
k
If n is very large then n
will be a good estimate for the probability P .

Z
p(x0 ) dx0 ' p(x)V (162)
R

where x is a point within R and V is the volume enclosed by R. Combining Eqs


(159) & (162) we get:

k
p(x) ' (163)
nV
ˆ Kernel Density Estimation
One technique for nonparametric density estimation is the kernel density esti-
mation where, effectively, V is held fixed while K, the number of sample points
24 LECTURE 24: PROB. CLASSIFIERS 108

lying withing V is estimated. The density is given by treating K as a kernel func-


tion, e.g., a Gaussian function, centered on each data point and then adding the
functions together. The quality of the estimate depends crucially on the kernel
function. For the Gaussian case, the non-parametric density estimator is known
as the Parzen-window density estimator.
In the Eq (163) we could keep V fixed & estimate K. For example, we could
consider V to be a hypercube of length d & dimension n:

K(x, x0 )
P (x) = Σx0 D (164)
dn ∗ |D|
(
1 if kxi − x0i k ≤ d ∀ i ∈ [1 : n]
K(x, x0 ) = (165)
0 if kxi − x0i k > d f or some i ∈ [1 : n]
For smooth kernel density estimation, we could use

−kx−x0 k2
K(x, x0 ) = e 2σ 2 (166)

Since K is smooth we need not specify volume (V ); σ implicitly defines V .

1
P (x) = Σx0 D K(x, x0 ) (167)
|D|

Parzen Window Classifier:


– Given r classes, for each class build a density model.
D = D1 ∪ D2 ...... ∪ Dr
1
P (x|Ci ) = Pi (x) = Σx0 Di K(x, x0 )
|Di |
|Di |
– estimate P (Ci ) = |D|
– The class chosen is the one that maximizes the posterior distribution, i.e.,
argmax ai (x) = log[P (x|Ci )P (Ci )]
i
*(assuming the same σ for all classes)
A potential problem with kernel density classifiers can be that they could be
biased toward more populated classes, owing to the class prior.
24 LECTURE 24: PROB. CLASSIFIERS 109

|Di |
P (Ci ) = |D|

A more severe issue is that σ is fixed for all classes.


ˆ K-Nearest Neighbours(K-NN)
The idea behind this class of kernel desity estimators is to hold K constant and
instead determine the volume V of the tighest sphere that encompasses the K
samples. The volumne is a non-decreasing function of K. K-NN is non-smooth.
Ki
P (Ci |x) = K

where Ki is number of points that fall in class Ci out of K points nearest to a


given point x. The steps in K-NN density estimation are as follows:
– Given a point x, find the set of K nearest neighbours (Dk )
– For each class Ci compute Ki = |{x |x ∈ Dk and x ∈ Ci }|, that is Ki is
the number of points from class Ci that belong to the nearest neighbour set
Dk .
– Classify x into Cj = argmax KKi
Ci

The decision boundaries of K-NN are highly non-linear.

24.2 Parametric Density Estimation

Parametric methods assume a form for the probability distribution that gen-
erates the data and estimate the parameters of the distribution. Generally
parametric methods make more assumptions than non-parametric methods.

– Gaussian Discriminant:

1 −(φ(x) − µi )Σ−1
i (φ(x) − µi )
P (x|Ci ) = N (φ(x), µi , Σi ) = n/2 1/2
exp( )
(2π) |Σi | 2
(168)
Given point x,classify it to class Ci such that,
Ci = argmaxi log[P (x|Ci )P (Ci )]
let ai = log[P (x|Ci )] + log[P (Ci )] = wiT φ(x) + wi0
where,
24 LECTURE 24: PROB. CLASSIFIERS 110

wi = Σi µi
wi0 = −1 µT Σ−1
2 i
1 n 1 T −1
i µi + ln[P (Ci )] − 2 ln[(2π) |Σi |] − 2 φ(x) Σi φ(x)

Maximum Likelihood Estimation:

(µM
i
LE
, ΣM
i
LE
) = argmaxµ,Σ LL(D, µ, Σ) = argmaxµ,Σ Σi ΣxDi log[N (x, µi , Σi )]
(169)
1. Σ ’s are common across all classes i.e., Σi = Σ ∀i
Maximum Likelihood estimates using (169) are:
φ(x)
µM
i
LE
= ΣxDi
|Di |
1
ΣM LE = Σki=1 ΣxDi (φ(x) − µi )(φ(x) − µi )T
|D|
2. Σ0i s are also parameters
Maximum Likelihood estimates are:
φ(x)
µM
i
LE
= ΣxDi
|Di |
1
ΣMi
LE
= ΣxDi (φ(x) − µi )(φ(x) − µi )T
|Di |
We could do this for exponential family as well.

– Exponential Family:
For a given vector of functions φ(x) = [φ1 (x), . . . , φk (x)] and a parameter
vector η<k , the exponential family of distributions is defined as
P (x, η) = h(x)exp{η T φ(x) − A(η)} (170)
where the h(x) is a conventional reference function and A(η) is the log
normalization constant designed as
R
A(η) = log[ xRange(x) exp{η T φ(x)}h(x)dx]
Example:
* Gaussian Distribution: Gaussian Distribution X ∼ N (µ, σ) can be
expressed as
η = [ σµ2 , 2σ
−1
2]

φ(x) = [x, x2 ]
24 LECTURE 24: PROB. CLASSIFIERS 111

* Multivariate Gaussian:
η = Σ−1 µ
1
p(x, Σ−1 , η) = exp{η T x − xT Σ−1 x + Z}
2
where Z = −1
2
(nlog(2π) − log(|Σ −1
|))

* Bernoulli:
Bernoulli distribution is defined on a binary (0 or 1) random variable
using parameter µ where µ = P r(X = 1). The Bernoulli distribution
can be written as
µ
p(x|µ) = exp{xlog( 1−µ ) + log(1 − µ)}
µ
⇒ φ(x) = [x] and η = [log( 1−µ )]

Bernoulli is important if φ(x) contains discrete values.


Say for each class Ck , we have parameter ηk
ak (x) = ln(p(x|Ck ).p(Ck )) = ηkT φ(x) − A(ηk ) + ln(h(x))
⇒ ai (x) = aj (x) gives us a linear discriminant
25 LECTURE 25 112

25 Lecture 25
25.1 Exponential Family Distribution
Considering The Exponential Family Distribution:

p(φ(x)|ηk ) = h(φ(x)) exp(ηkT φ(x) − A(ηk ))(f or class k). (171)

25.2 Discrete Feature Space


φ(x) is the feature space. Considering the case when it is discrete valued.

φ(x) = [attr1, attr2, ..] (172)

Let φ(x) have n attributes and each of the n attributes can take c different (dis-
crete) values.
Total number of possible values of φ(x) is cn .

 
nval1 nval2 . . . .nvaln
φ(1) (x) . . . . .
 
 
(2)
 
 φ (x) . . . . . 
φ(x) =   (173)

 : . . . . . 

 
 : . . . . . 
(cn )
φ (x) . . . . .

Size of the above table is cn .


For example, when n = 3 and c = 2 ({0, 1} ∀ attributes), table showing all possible
values of φ(x) will be:
25 LECTURE 25 113

 
0 0 0
0 0 1
 
 
 

 0 1 0 

 0 1 1 
(174)
 
 

 1 0 0 


 1 0 1 

1 1 0
 
 
1 1 1

Thus size of the table of p(φi (x)|ηk ) will also be cn .

 
P robaility | Conf iguration
p1 | φ(1) (x)
 
 
φ(2) (x)
 
 p2 | 
p(φ(x)|ηk ) =   (175)

 : | : 

|
 
 : : 
(cn )
pcn | φ (x)

whereΣpi = 1 (176)
i

It is clear that p(φ(x)|ηk ) ≡ p(x|ηk )

25.3 Naive Bayes Assumption


As the size(cn ) is exponential in the dimension of feature space, it is not feasible to
work with the full table even in a
moderately large dimension feature space.
One possible way out is to approximate the probability distribution so that this size
is reduced considerably.
Naive Bayes is one such approximation in which the assumption is:

p(φ(x)|ηk ) = p(φ1 (x)|ηk )p(φ2 (x)|ηk ) . . . p(φn (x)|ηk ) (177)


25 LECTURE 25 114

where, φi (x) denotes the i-th attribute of φ(x) in the feature space.
Thus, what Naive Bayes Assumption essentially says is that each attribute is inde-
pendent of other attributes given the class.
So the original probability distribution table of p(φ(x)|ηk ) of size cn will get replaced
by n tables
(One per attribute) each of size cx1 as follows :

 
pj1 | φj (x) = µj1
pj2 | φj (x) = µj2
 
 
 
p(φj (x)|ηk ) = 
 : | : 
 (178)
: | :
 
 
pjc | φj (x) = µjc
where µj1 ..µjc are c discrete values that φj (x) can take

andΣpji = 1 (179)
i

NOTE:This assumption does NOT say that :


p(φ(x)) = p(φ1 (x)) . . . p(φn (x))

25.4 Graphical Models


Discussion on Graphical Models was done from the slides [8], [9].

25.5 Graphical Representation of Naive Bayes

c = φ1 (x)φ2 (x) . . . φn (x) (180)

Figure 29: Graphical model representation of Naive Bayes


25 LECTURE 25 115

The fact that there is no Edge between φ1 (x) and φ2 (x) denotes Conditional
Independence.

p(φ, c) = p(φ|c).p(c)
= p(φ2 |φ1 , φ3 , c).p(φ1 |φ3 , c).p(φ3 |c).p(c)
≈ p(φ2 |φ1 ).p(φ1 |c).p(φ3 |c).p(c)(F romF igure29)
= Πp(x|π(x))(π(x) ≡ Set of P arents of x)

25.6 Graph Factorisation


Think
` of p(φ(x), c)(Factorised) as Specifying a Family of Distributions.
a b | c (a is conditionally independent of b given c) means:

p(a|b, c) = p(a|c) (181)


p(b|a, c) = p(b|c) (182)

25.7 Naive Bayes Text Classification


Naive Bayes Text Classification was discussed from the slides [10], [11].
REFERENCES 116

References
[1] “Class notes: Basics of convex optimization,chapter.4.”
[2] “Convex Optimization.”
[3] “Linear Algebra.”
[4] “Bias variance tradeoff.” [Online]. Available: http://www.aiaccess.net/English/
Glossaries/GlosMod/e gm bias variance.htm
[5] Steve Renals, “Support Vector Machine.”
[6] Christopher M. Bishop, “Pattern Recognition And Machine Learning.”
[7] Nello Cristianini and John Shawe-Taylor , “An Introduction to Support Vector
Machines and Other Kernel-based Learning Methods.”
[8] ClassNotes, “Graphical Models,Class Notes.”
[9] ——, “Graphical Case Study of Probabilistic Models,Class Notes.”
[10] Andrew McCallum,Kamal Nigam, “A Comparison of Event Models for Naive
Bayes Text Classification.”
[11] Steve Renals, “Naive Bayes Text Classification.”

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy