Version 1
Version 1
Lucero
• Perceptron
(b) (2 pts) Derive the maximum likelihood estimation for θ, which is the value for theta that maximizes
the function of part (a). (Hint: The likelihood function is a monotonic function. So the maximizing
solution is at the extreme— there is no need to take derivative for this case.)
.
2. Weighted linear regression. [10pt] In class when discussing linear regression, we assume that the
Gaussian noise is iid (identically independently distributed). In practice, we may have some extra
information regarding the fidelity of each data point. For example, we may know that some examples
have higher noise variance than others. To model this, we can model the noise variable i , 2 , · · · n
as distinct Gaussian’s, i.e., i ∼ N (0, σi2 ) with known variance σi2 . How will this influence our linear
regression model? Let’s work it out.
(a) (3pts) Write down the log likelihood function of w under this new modeling assumption.
(c) (3 pts) Take the gradient of the loss function J(w) and provide the batch gradient descent update
rule for optimizing w.
(d) (3 pts) Derive a closed form solution to this optimization problem. Hint: begin by rewrite the
objective into matrix form using a diagonal matrix A with A(i, i) = ai .
1
Franklin F. Lucero
Here we will go through some questions to help you figure out how to use the probability and misclas-
sification costs to make predictions.
(a) (2 pt) You received an email for which the spam filter predicts that it is a spam with p = 0.8. We
want to make the decision that minimizes the expected cost.
Question: Should you classify this particular email as spam or non-spam? [Hint: Compare the
expected cost of classifying the email as spam versus non-spam. Choose the classification that
results in the lower expected cost.]
(b) (2 pts)The MAP decision rule would classify an email as spam if p > 0.5, but this rule does not
minimize expected cost in this case. We need a new rule that compares p to a different threshold
θ. The value of θ should be chosen to minimize the expected cost based on the costs in the table.
Question: What is the value of θ that works for the costs specified in Table 1? [Hint: To find
the threshold θ, set up the decision rule by comparing the expected cost of each decision, as you
did in (a), then Solve for p in terms of the costs.]
(c) (2pts) Now, imagine that the optimal decision rule would use θ = 1/5 as the threshold for
classifying an email as spam. Question: Can you provide a new cost table where this would be
the case? [Hint: Use the relationship between the costs and θ that you derived in part (b). Based
on this relationship, adjust the misclassification costs in the table to achieve θ = 1/5.]
4. Maximum A-Posteriori Estimation. [8pt] Suppose we observe the values of n IID random vari-
ables X1 , . . . , Xn drawn from a single Bernoulli distribution with parameter θ. In other words, for
each Xi , we know that P (Xi = 1) = θ and P (Xi = 0) = 1 − θ. In the Bayesian framework, we
treat θ as a random variable, and use a prior probability distribution over θ to express our prior
knowledge/preference about θ. In this framework, X1 , . . . , Xn can be viewed as generated by:
• First, the value of θ is drawn from a given prior probability distribution
• Second, X1 , . . . , Xn are drawn independently from a Bernoulli distribution with this θ value.
In this setting, Maximum A-Posteriori (MAP) estimation is a way to estimate θ by finding the value
that maximizes the posterior probability, given both its prior distribution and the observed data.The
MAP estimation of θ is given by:
where L(θ̂) is the likelihood function of the data given θ, and p(θ̂) is the prior distribution over θ.
2
Franklin F. Lucero
Now consider using a beta distribution as the prior: θ ∼ Beta(α, β), whose PDF function is
θ̂(α−1) (1 − θ̂)(β−1)
p(θ̂) =
B(α, β)
(a) (3 pts) Derive the posterior distribution p(θ̂|X1 , . . . , Xn , α, β). Compare the form of the posterior
distribution with that of the beta distribution, you will see the posterior is also a beta distribution.
What the updated α and β parameters for the posterior?
(b) (2 pts) Suppose we use Beta(2, 2) as the prior, what Beta distribution do we get for the posterior
after we observe 5 coin tosses and 2 of them are head? What is the posterior distribution of θ after
we observe 50 coin tosses and 20 of them are head? (you don’t need to write out the distributions,
simply provide the α and β distribution would suffice.
(c) (1pt) Plot the pdf function of the prior Beta(2, 2) and the two posterior distributions. You can
use any software (e.g., R, Python, Matlab) for this plot.
(d) (2pt) Assume that θ = 0.4 is the true probability, as we observe more and more coin tosses from
this coin, how would the shape of the posterior change as more data is observed? Will the MAP
estimate converge toward the true value?
5. Perceptron. [3pt] Assume a data set consists only of a single data point {(x, +1)}. How many times
would the Perceptron algorithm mis-classify this point x before convergence? What if the initial weight
vector w0 was initialized randomly and not as the all-zero vector? Derive the number of times as a
function of w0 and x.
6. Bonus: MLE for multi-class logistic regression. [6 pts] Consider the maximum likelihood
estimation problem for multi-class logistic regression using the soft-max function defined below:
exp(wkT x)
p(y = k|x) = K
T
j=1 exp(wj x)
3
Franklin F. Lucero
CamScanner
Franklin F. Lucero
CamScanner
Franklin F. Lucero
CamScanner
Franklin F. Lucero
CamScanner
Franklin F. Lucero
CamScanner
Franklin F. Lucero
CamScanner
Franklin F. Lucero
CamScanner
Franklin F. Lucero
CamScanner
Franklin F. Lucero
CamScanner
Franklin F. Lucero
CamScanner
10/20/24 10:12 PM Code.m 1 of 1
clear;
close all;
clc;
% Vector of x values
x = 0:0.001:1;
% Prior Beta(2, 2)
prior = betapdf(x, 2, 2);
% Beta(4, 5)
after1 = betapdf(x, 4, 5);
% Beta(22, 32)
after2 = betapdf(x, 22, 32);
% Plot
figure;
plot(x, prior, 'DisplayName', 'Prior Beta(2,2)', 'LineWidth', 1.5);
hold on;
plot(x, after1, 'DisplayName', 'Posterior Beta(4,5)', 'LineWidth', 1.5);
plot(x, after2, 'DisplayName', 'Posterior Beta(22,32)', 'LineWidth', 1.5);
xlabel('Theta');
ylabel('Density');
legend('show');
grid on;
hold off;
6
Prior Beta(2,2)
Posterior Beta(4,5)
5 Posterior Beta(22,32)
4
Density
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Theta
CamScanner
CamScanner
CamScanner