0% found this document useful (0 votes)
24 views4 pages

Examples1 2up

Uploaded by

Amir Sharifi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views4 pages

Examples1 2up

Uploaded by

Amir Sharifi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Engineering Tripos Part IIB FOURTH YEAR

Module 4F10: STATISTICAL PATTERN RECOGNITION


Examples Paper 1
Straightforward questions are marked †
Tripos standard (but not necessarily Tripos length) questions are marked ∗

Bayes Risk
1. In many pattern classification problems, one has the option either to assign the
pattern to one of the c classes, or to reject it as being unrecognizable. If the cost to
reject is not too high, rejection may be a desirable action. Let the cost of classification
be defined as
0 ωi = ωj (i.e. (Correct classification)
λ(ωi |ωj ) = λr ωi = ω0 (i.e. Rejection)
λs Otherwise (i.e. Substitution Error)
Show that for minimum risk classification, the decision rule should associate a test
vector x with class ωi , if P (ωi |x) ≥ P (ωj |x) for all j and P (ωi |x) ≥ 1 − λr /λs , and
reject otherwise.
EM and Mixture Models
2. † For d-dimensional data compare the computational cost of calculating the log-
likelihood with a diagonal covariance matrix Gaussian distribution, a full covariance
matrix Gaussian distribution and an M -component diagonal covariance matrix Gaus-
sian mixture models. Clearly state any assumptions made.
3. A 1-dimensional 2-component mixture distribution has a common fixed known vari-
ance = 1 and initial mean values µ1 = 0 µ2 = 2 and mixture weights c1 = c2 = 0.5.
There is a data set of 9 training data points provided

−1.5, −0.5, 0.1, 0.3, 0.9, 1.3, 1.9, 2.3, 3.0

(a) Calculate the log likelihood of the training data for the mixture distribution with
the initial parameters.
(b) Calculate updated values for the mean and mixture weights for 1 iteration of the
E-M algorithm.
4. Consider an M component mixture model of d-dimensional binary data x of the form

M
p(x) = P (ωm )p(x|ωm )
m=1

1
where the j th component PDF has parameters λj1 , . . . , λjd and

d
p(x|ωj ) = λxjii (1 − λji )1−xi
i=1

A set of training samples x1 , . . . , xn are used to train the mixture model. Using
the standard form of EM with mixture models show that the maximum likelihood
estimate for the “new” parameters, λ̂ji , is given by
∑n
k=1 P (ωj |xk )xki
λ̂ji = ∑
k=1 P (ωj |xk )
n

where P (ωj |xk ) is obtained using the “old” model parameters.


5. ∗ A series of n independent, noisy, measurements are taken, x1 , . . . , xn . The noise
is known to be Gaussian distributed with zero mean and unit variance. The “true”
data is also known to be Gaussian distributed.
(a) Find the maximum likelihood estimates of the mean, µ, and variance, σ 2 , of the
“true” data by equating the gradient to zero.
(b) A latent variable zi is introduced. It is the value of the noise for observation xi .
Show that the posterior probability of zi given the current model parameters is
( )
(xi − µ) σ2
p(zi |xi , θ) = N zi ; ,
(1 + σ 2 ) (1 + σ 2 )
Using the expectation-maximisation algorithm derive re-estimation formulae for
the mean, µ, and variance, σ 2 . Show that the iterative estimation scheme for
the mean converges to the correct answer, you may assume that the variance of
the true data is known and fixed at σ 2 .
Discuss the merits of the two optimisation schemes for this task and for optimisation
tasks in general.
Product of Experts
6. ∗ For parts of this question it is useful to use matlab/octave A product of experts
system is to be used for speech synthesis. The data is known to be generated from
two classes ω1 and ω2 . Four Gaussian experts are to be used. These experts are:
p(xt |ω1 ) = N (xt ; 1, 1) Expert 1
p(xt − xt−1 |ω1 ) = N (xt − xt−1 ; 1, 1) Expert 2
p(xt |ω2 ) = N (xt ; 2, 1) Expert 3
p(xt − xt−1 |ω2 ) = N (xt − xt−1 ; −1, 1) Expert 4
A sequence of 3 samples are to be generated. The first two are known to come from
class ω1 , the final sample from class ω2 . The data is known to start in silence, which
has a value of 0.

2
(a) Show that the overall sequence of observarions can be written in the following
form
 
x1
 
   x1 − 0 
x1  
   x2 
Ax = A  x2  = 



 x2 − x1 
x3  
 x3 
x3 − x2

(b) The transformed data, Ax is Gaussian distributed, so


1
p(x|θ) = p(Ax|θ)
Z
1
= N (Ax; µ, Σ)
Z
where Z is the appropriate normalisation term to ensure a valid PDF. Find
expressions for µ and Σ.
(c) By using the following expression (or otherwise)
( ) ( )
1 1
exp − (Ax − µ)′ (Ax − µ) = exp − (x′ A′ Ax − 2µ′ Ax + µ′ µ)
2 2
find the mean of the distribution of x. How can this approach be used for speech
synthesis? What does x look like if experts 2 and 4 are not used, set A = I (an
identity matrix)

Restricted Boltzmann Machine

7. A restricted Boltzmann machine is to be built where the observations, x, are con-


tinuous variables and the hidden units, h, are binary. The energy function has the
following form:

d
(xi − ai )2 ∑
J ∑ xi
G(x, h|θ) = − bj hj − hj wij
i=1 2σi2 j=1 i,j σi

Show that the posterior probabilty of the hidden and observed variables can be
expressed as
1
P (hj = 1|x, θ) = ∑
1 + exp(−bj − di=1 x1
w )
σi ij

p(xi |h, θ) = N (xi ; ai + σi hj wij , σi2 )
j

Why is this form of expression important when training Restricted Boltzmann ma-
chines?

3
Single Layer Perceptrons

8. The standard single layer perceptron is used to discriminate between two classes.
There are two simple techniques for generalising this to a K class problem. The
first is to build a set of pairwise classifiers i.e. ωi versus ωj , j ̸= i. The second
is to build a set of classifiers of each class versus all other classes i.e. ωi versus
{ω1 , . . . , ωi−1 , ωi+1 , ωK }. Compare the two forms of classifier in terms of training and
testing computational cost. By drawing a specific example with K = 3 show that
both forms of classifier can result in an “ambiguous” region i.e. no decision can be
made. Describe how multiple binaries classifiers may be trained so that no ambigu-
ous regions exist.

Answers

3. (a) total log-likelihood of data (natural log (ln)) -15.302 (likelihood 2.262e-07); (b)
µ̂1 = −0.0426 ; µ̂2 = 1.878 ; ĉ1 = 0.5266; ĉ2 = 0.4734

M.J.F. Gales
P.C. Woodland
Oct 2003 - Jan 2007

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy