0% found this document useful (0 votes)
13 views5 pages

Dis 1

Uploaded by

luck786mty
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views5 pages

Dis 1

Uploaded by

luck786mty
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

CS 189/289A Introduction to Machine Learning

Fall 2024 Jennifer Listgarten, Saeed Saremi DIS1

1 Maximizing Likelihood & Minimizing Cost


Maximum Likelihood Estimation (MLE) is a method for estimating the parameters of a statistical
model given observations.
Data Suppose we obtain n discrete observations belonging to B := {1, 2, 3, 4}. Our dataset looks
something like the following.

r1 = 1
r2 = 1
r3 = 3
..
.
rn = 1

Assumptions Suppose we aim to estimate the occurence probabilities of each class in B based
on the observed data. We additionally assume that observations are independent and identically
distributed (i.i.d.). In particular, this assumption implies that the order of the data does not matter.
Model Based on these assumptions, a natural model for our data is the multinomial distribu-
tion. In a multinomial distribution, the order of the data does not matter, and we can equivalently
represent our dataset as (y, cy )y∈B , where cy is the number of items of class y.
The probability mass function (PMF) of the multinomial distribution—this is, the probability in n
trials of obtaining each class i xi times—is
k
Y pixi
P(x1 , . . . , xk ) = n! .
i=1
xi !

(a) Derive an expression for the likelihood for this problem. What are the observations? What are
the parameters? What parameters are we trying to estimate with MLE?

DIS1, ©UCB CS 189/289A, Fall 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 1
(b) Typically, the log-likelihood ℓ(θ) = log L(θ) is used instead of L(θ). Write down the expression
for ℓ(θ). Why might this be a good idea?

(c) Another idea might be to minimize the cross-entropy based on raw observations, corresponding
to the following program
n X
X
argmin − δri y log py
p∈R4+ i=1 y∈B
||p||1 =1

h i⊤
where p is the vector of probabilities per class p1 p2 p3 p4 , and δri y is the Kronecker
delta that outputs 1 if ri = y and 0 otherwise.
Show that this program is equivalent to the MLE program.

DIS1, ©UCB CS 189/289A, Fall 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 2
2 Independence and Multivariate Gaussians
As described in lecture, a covariance matrix Σ ∈ RN×N for a random variable X ∈ RN with the
following values, where cov(Xi , X j ) = E[(Xi − µi )(X j − µ j )] is the covariance between the i-th and
j-th elements of the random vector X:

 
cov(X1 , X1 ) ... cov(X1 , Xn )
 
Σ =  ... ...  .

(1)
cov(Xn , X1 ) ... cov(Xn , Xn )


Recall that the density of an N dimensional Multivariate Gaussian Distribution N(µ, Σ) is defined
as follows when Σ is positive definite:

1 1 ⊤ Σ−1 (x−µ)
f (x) = p e− 2 (x−µ) . (2)
(2π)N |Σ|

Here, |Σ| denotes the determinant of the matrix Σ.

(a) Consider the random variables X and Y in R with the following conditions.

(i) X and Y can take values {−1, 0, 1}.


(ii) When X is 0, Y takes values 1 and -1 with equal probability ( 21 ). When Y is 0, X takes
values 1 and -1 with equal probability ( 21 ).
(iii) Either X is 0 with probability ( 21 ), or Y is 0 with probability ( 21 ).

Are X and Y uncorrelated? Are X and Y independent? Prove your assertions. Hint: Write
down the joint probability of (X, Y) for each possible pair of values they can take.

DIS1, ©UCB CS 189/289A, Fall 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 3
(b) For X = [X1 , · · · , Xn ]⊤ ∼ N(µ, Σ), verify that if Xi , X j are independent (for all i , j), then Σ
must be diagonal, i.e., Xi , X j are uncorrelated.

α β
     
0 X 
(c) Let N = 2, µ =  , and Σ =  . Suppose X =  1  ∼ N(µ, Σ). Show that X1 , X2 are
0 β γ X2
independent if β = 0. Recall that two continuous random variables W, Y with joint density
fW,Y and marginal densities fW , fY are independent if fW,Y (w, y) = fW (w) fY (y).

(d) Consider a data point x drawn from an N-dimensional zero mean Multivariate Gaussian dis-
tribution N(0, Σ), as shown above. Assume that Σ−1 exists. Prove that there exists a matrix
A ∈ RN×N such that x⊤ Σ−1 x = ∥Ax∥22 for all vectors x. What is the matrix A?

DIS1, ©UCB CS 189/289A, Fall 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 4
3 Least Squares (using vector calculus)

In ordinary least-squares linear regression, we typically have n > d so that there is no w such that
Xw = y (these are typically overdetermined systems — too many equations given the number of
unknowns). Hence, we need to find an approximate solution to this problem. The residual vector
will be r = Xw − y and we want to make it as small as possible. The most common case is to
measure the residual error with the standard Euclidean ℓ2 -norm. So the problem becomes:

min ∥Xw − y∥22 ,


w

where X ∈ Rn×d , w ∈ Rd , y ∈ Rn .
Assume that X is full rank.

(a) How do we know that X⊤ X is invertible?

(b) Derive using vector calculus an expression for an optimal estimate for w for this problem.

(c) What should we do if X is not full rank?

DIS1, ©UCB CS 189/289A, Fall 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 5

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy