Dis 1
Dis 1
r1 = 1
r2 = 1
r3 = 3
..
.
rn = 1
Assumptions Suppose we aim to estimate the occurence probabilities of each class in B based
on the observed data. We additionally assume that observations are independent and identically
distributed (i.i.d.). In particular, this assumption implies that the order of the data does not matter.
Model Based on these assumptions, a natural model for our data is the multinomial distribu-
tion. In a multinomial distribution, the order of the data does not matter, and we can equivalently
represent our dataset as (y, cy )y∈B , where cy is the number of items of class y.
The probability mass function (PMF) of the multinomial distribution—this is, the probability in n
trials of obtaining each class i xi times—is
k
Y pixi
P(x1 , . . . , xk ) = n! .
i=1
xi !
(a) Derive an expression for the likelihood for this problem. What are the observations? What are
the parameters? What parameters are we trying to estimate with MLE?
DIS1, ©UCB CS 189/289A, Fall 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 1
(b) Typically, the log-likelihood ℓ(θ) = log L(θ) is used instead of L(θ). Write down the expression
for ℓ(θ). Why might this be a good idea?
(c) Another idea might be to minimize the cross-entropy based on raw observations, corresponding
to the following program
n X
X
argmin − δri y log py
p∈R4+ i=1 y∈B
||p||1 =1
h i⊤
where p is the vector of probabilities per class p1 p2 p3 p4 , and δri y is the Kronecker
delta that outputs 1 if ri = y and 0 otherwise.
Show that this program is equivalent to the MLE program.
DIS1, ©UCB CS 189/289A, Fall 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 2
2 Independence and Multivariate Gaussians
As described in lecture, a covariance matrix Σ ∈ RN×N for a random variable X ∈ RN with the
following values, where cov(Xi , X j ) = E[(Xi − µi )(X j − µ j )] is the covariance between the i-th and
j-th elements of the random vector X:
cov(X1 , X1 ) ... cov(X1 , Xn )
Σ = ... ... .
(1)
cov(Xn , X1 ) ... cov(Xn , Xn )
Recall that the density of an N dimensional Multivariate Gaussian Distribution N(µ, Σ) is defined
as follows when Σ is positive definite:
1 1 ⊤ Σ−1 (x−µ)
f (x) = p e− 2 (x−µ) . (2)
(2π)N |Σ|
(a) Consider the random variables X and Y in R with the following conditions.
Are X and Y uncorrelated? Are X and Y independent? Prove your assertions. Hint: Write
down the joint probability of (X, Y) for each possible pair of values they can take.
DIS1, ©UCB CS 189/289A, Fall 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 3
(b) For X = [X1 , · · · , Xn ]⊤ ∼ N(µ, Σ), verify that if Xi , X j are independent (for all i , j), then Σ
must be diagonal, i.e., Xi , X j are uncorrelated.
α β
0 X
(c) Let N = 2, µ = , and Σ = . Suppose X = 1 ∼ N(µ, Σ). Show that X1 , X2 are
0 β γ X2
independent if β = 0. Recall that two continuous random variables W, Y with joint density
fW,Y and marginal densities fW , fY are independent if fW,Y (w, y) = fW (w) fY (y).
(d) Consider a data point x drawn from an N-dimensional zero mean Multivariate Gaussian dis-
tribution N(0, Σ), as shown above. Assume that Σ−1 exists. Prove that there exists a matrix
A ∈ RN×N such that x⊤ Σ−1 x = ∥Ax∥22 for all vectors x. What is the matrix A?
DIS1, ©UCB CS 189/289A, Fall 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 4
3 Least Squares (using vector calculus)
In ordinary least-squares linear regression, we typically have n > d so that there is no w such that
Xw = y (these are typically overdetermined systems — too many equations given the number of
unknowns). Hence, we need to find an approximate solution to this problem. The residual vector
will be r = Xw − y and we want to make it as small as possible. The most common case is to
measure the residual error with the standard Euclidean ℓ2 -norm. So the problem becomes:
where X ∈ Rn×d , w ∈ Rd , y ∈ Rn .
Assume that X is full rank.
(b) Derive using vector calculus an expression for an optimal estimate for w for this problem.
DIS1, ©UCB CS 189/289A, Fall 2024. All Rights Reserved. This may not be publicly shared without explicit permission. 5