Ass6 Solns
Ass6 Solns
Policies:
For all multiple-choice questions, note that multiple correct answers may exist. However, selecting
an incorrect option will cancel out a correct one. For example, if you select two answers, one
correct and one incorrect, you will receive zero points for that question. Similarly, if the number
of incorrect answers selected exceeds the correct ones, your score for that question will be zero.
Please note that it is not possible to receive negative marks. You must select all the correct
options to get full marks for the question.
While the syllabus initially indicated the need to submit a paragraph explaining the use of AI or
other resources in your assignments, this requirement no longer applies as we are now utilizing
eClass quizzes instead of handwritten submissions. Therefore, you are not required to submit any
explanation regarding the tools or resources (such as online tools or AI) used in completing this
quiz.
This PDF version of the questions has been provided for your convenience should you wish to print
them and work offline.
Only answers submitted through the eClass quiz system will be graded. Please do not
submit a written copy of your responses.
Question 1. [1 mark]
Consider the predictor f (x) = xw, where w ∈ R is a one-dimensional parameter, and x rep-
resents the feature with no bias term. Suppose you are given a dataset of n data points D =
((x1 , y1 ), (x2 , y2 ), . . . , (xn , yn )), where each yi is the target variable corresponding to feature xi . Let
the loss function be the scaled squared loss `(f (x), y) = c(f (x) − y)2 where c ∈ R. The estimate of
the expected loss for a parameter w ∈ R is defined as the following convex function:
n
1X
L̂(w) = c(xi w − yi )2
n
i=1
Solution:
Answer: d.
Explanation: To find the closed-form solution for ŵ, we need to minimize L̂(w). This is equivalent
to minimizing the function:
1/13
Fall 2024 CMPUT 267: Basics of Machine Learning
n
cX
L̂(w) = (xi w − yi )2
n
i=1
Question 2. [1 mark]
Let everything be defined as in the previous question. Suppose we consider the multivariate case
where f (x) = x> w, and w ∈ Rd+1 . What is the closed form solution for ŵ = arg minw∈Rd+1 L̂(w)?
c. ŵ = n1 ni=1 xi
P
Pn
cxi yi
d. ŵ = Pi=1
n 2
i=1 cxi
Solution:
Answer: a.
Explanation: To find the closed-form solution for ŵ, we minimize the expected loss defined as:
n
1X
L̂(w) = c(x>
i w − yi )
2
n
i=1
2/13
Fall 2024 CMPUT 267: Basics of Machine Learning
Aw = b
Thus, we can write:
ŵ = A−1 b
Question 3. [1 mark]
Let g(w) = − ln w i=1 yi − ln(1 − w) ni=1 (1 − yi ) where
Pn P
P w ∈ R. We can rewrite this a bit more
simply as g(w) = −s ln w − (n − s) ln(1 − w) where s = ni=1 yi . What is the derivative g 0 (w) and
the first order gradient descent update rule with a constant step size η?
a. g 0 (w) = − 1−w
s
+ n−s
w and update rule w ← w − η − s
1−w + n−s
w
b. g 0 (w) = − ws + n−s
1−w
s
and update rule w ← w − η − 1−w + n−s
w
c. g 0 (w) = − ws + n−s
1−w and update rule w ← w − η − ws + n−s
1−w
d. g 0 (w) = − 1−w
s
− n−s
w
s
and update rule w ← w − η − 1−w − n−s
w
Solution:
Answer: c
Explanation: To find the derivative g 0 (w), we differentiate g(w):
g(w) = −s ln w − (n − s) ln(1 − w)
Taking the derivative:
1 1
g 0 (w) = −s
+ (n − s)
w 1−w
The gradient descent update rule with a constant step size η is given by:
s n−s
w ← w − ηg 0 (w) = w − η − +
w 1−w
Thus, the first-order gradient descent update rule becomes:
s n−s
w ←w+η −
w 1−w
Question 4. [1 mark]
Let everything be defined as in the previous question. What is the second derivative g 00 (w) and the
second order gradient descent update rule?
s n−s
−w + 1−w
a. g 00 (w) = s
w2
− n−s
(1−w)2
and update: w ← w − s
− n−s 2
w2 (1−w)
s n−s
−w + 1−w
b. g 00 (w) = s
w2
+ n−s
(1−w)2
and update: w ← w − s
+ n−s 2
w2 (1−w)
3/13
Fall 2024 CMPUT 267: Basics of Machine Learning
s n−s
−w + 1−w
c. g 00 (w) = − ws2 + n−s
(1−w)2
and update: w ← w − − s
+ n−s 2
w2 (1−w)
s n−s
−w + 1−w
d. g 00 (w) = s
w2
+ n−s
(1−w)2
and update: w ← w + s
+ n−s 2
w2 (1−w)
Solution:
Answer: b.
Explanation: To find the second derivative:
s n−s
g 00 (w) = 2
+
w (1 − w)2
The first derivative is given by:
s n−s
g 0 (w) = −
+
w 1−w
Thus, the second-order gradient descent update rule is:
g 0 (w) − ws + 1−w
n−s
w←w− = w − s n−s
g 00 (w) w2
+ (1−w) 2
Question 5. [1 mark]
Let everything be defined as in the previous question. What is the closed form solution for
a. w∗ = n/s
b. w∗ = s/(n − s)
c. w∗ = s/(s − n)
d. w∗ = s/n
Solution:
Answer: d.
Explanation: Set derivative of g(w) to zero and solve for w:
−s n−s s n−s s
+ = 0 =⇒ = =⇒ w = .
w 1−w w 1−w n
Question 6. [1 mark]
Let g(w) = w4 + e−w where w ∈ R. What is the derivative g 0 (w) and the first order gradient
descent update rule with a constant step size η?
4/13
Fall 2024 CMPUT 267: Basics of Machine Learning
Solution:
Answer: a.
Explanation: To find the derivative g 0 (w):
d d −w
g 0 (w) = (w4 ) + (e ) = 4w3 − e−w .
dw dw
The first-order gradient descent update rule is given by:
Question 7. [1 mark]
Let everything be defined as in the previous question. What is the second derivative g 00 (w) and the
second order gradient descent update rule?
4w3 −e−w
a. g 00 (w) = 12w2 − e−w and update: w ← w − 12w2 −e−w
4w3 −e−w
b. g 00 (w) = 12w2 + e−w and update: w ← w + 12w2 +e−w
4w3 −e−w
c. g 00 (w) = 12w2 + e−w and update: w ← w − 12w2 +e−w
4w3 −e−w
d. g 00 (w) = 12w2 − e−w and update: w ← w + 12w2 −e−w
Solution:
Answer: c.
Explanation: To find the second derivative g 00 (w):
1. First, we have the first derivative:
d d −w
g 00 (w) = (4w3 ) − (e ) = 12w2 + e−w .
dw dw
The second-order gradient descent update rule is given by:
Question 8. [1 mark]
Let everything be defined as in the previous question. For the second order update rule, calculate
w(1) if w(0) = 0.
5/13
Fall 2024 CMPUT 267: Basics of Machine Learning
Solution:
Answer: 1.
Explanation: Let g(w) = w4 +e−w where w ∈ R. We want to compute w(1) using the second-order
gradient descent update rule, given w(0) = 0.
The second-order update rule is given by:
g 0 (w(0) )
w(1) = w(0) −
g 00 (w(0) )
Substituting w(0) = 0:
g 0 (0) = 4(0)3 − e0 = −1
Step 2: Calculate g 00 (w). The second derivative is:
Substituting w(0) = 0:
g 00 (0) = 12(0)2 + e0 = 1
Step 3: Update the value of w. Now we can compute w(1) :
−1
w(1) = 0 − =0+1=1
1
Question 9. [1 mark]
Let everything be defined as in the previous question. Change the step size to be calculated using
the normalized gradient. For the first order update rule, calculate w(1) if w(0) = 0, η = 1. Only for
this problem, set = 0.
Solution:
Answer: 1.
Explanation: We know that the derivative at w(0) = 0 is g 0 (0) = −1. The normalized gradient
step size η is given by
η
η (0) = 0 (0) = 1
|g (w )|
Therefore
w(1) = w(0) − η (0) g 0 (w(0) ) = 0 − 1 × (−1) = 1.
6/13
Fall 2024 CMPUT 267: Basics of Machine Learning
(t)
>
(t) (t) (t) (t) x
b. w(t+1) = w(t) − η 2w1 (w2 )2 − e−w1 , 2w2 (w1 )2 − e−w2 (t)
(t) >
(t)
(t) (t) (t) (t)
c. w(t+1) = w(t) − η 2w1 (w2 )2 + e−w1 , 2w2 (w1 )2 + e−w2
(t) >
(t)
(t) (t) (t) (t)
d. w(t+1) = w(t) − η −2w1 (w2 )2 + e−w1 , −2w2 (w1 )2 + e−w2
Solution:
Answer: b
Explanation: Let g(w) = g(w1 , w2 ) = w12 w22 + e−w1 + e−w2 where w ∈ R2 .
Step 1: Calculate the gradient ∇g(w)
The gradient is given by:
∂g ∂g >
∇g(w) = ,
∂w1 ∂w2
Calculating the partial derivatives:
∂g
= 2w1 w22 − e−w1
∂w1
∂g
= 2w2 w12 − e−w2
∂w2
Step 2: Write the gradient Thus, the gradient is:
>
∇g(w) = 2w1 w22 − e−w1 , 2w2 w12 − e−w2
Step 3: Gradient descent update rule The first-order gradient descent update rule is given by:
Solution:
Answer: True.
Explaination: For any f ∈ G, we know that L̂(f ) ≥ L̂(g) for all g ∈ G, since the RHS is the
minimum value. But since F ⊂ G, we have that L̂(f ) ≥ L̂(g) for all f ∈ F as well. Taking minimum
over F both sides, we have the result.
7/13
Fall 2024 CMPUT 267: Basics of Machine Learning
Solution:
2+4
False. It’s 4 = 6 · 5/2 = 15.
φ(x) = x1 , x2 , x21 , x1 x2 , x22 , x31 , x21 x2 , x1 x22 , x32 , x41 , x31 x2 , x21 x22 , x1 x32 , x42 .
True or False?
Solution:
False. The constant term 1 is missing.
F¯p = {f |f : Rd+1 → R, and f (x) = log(φp (x)> w), for some w ∈ Rp̄ }.
Solution:
True. As we increase the degree the function class becomes more expressive.
Solution:
True. Irreducible error can be reduced by adding more features that are relevant to the prediction
task.
Solution:
False. Estimation error can be reduced by adding more data points or by using a simpler model.
Solution:
True. Approximation error can be reduced by using a more complex model.
8/13
Fall 2024 CMPUT 267: Basics of Machine Learning
Solution:
False. We need to make p smaller.
Solution:
Answer: c. Variance large, bias small, overfit.
Solution:
Answer: b. Variance small, bias large, underfit.
9/13
Fall 2024 CMPUT 267: Basics of Machine Learning
Solution:
Answer: c. Variance small, bias small, neither overfit or underfit.
Solution:
True. Decreasing the value of lambda will reduce the regularization strength and allow the model
to fit the data better.
a. αMLE = n1 ni=1 αi
P
b. αMLE = n1 ni=1 zi
P
1 Pn
c. αMLE = n−1 i=1 zi
1 Pn−1
d. αMLE = n i=1 zi
Solution:
Answer: b.
Explanation: The probability of each flip zi is p(zi |α) = αzi (1 − α)1−zi . The likelihood is:
n
Y
p(D|α) = αzi (1 − α)1−zi
i=1
10/13
Fall 2024 CMPUT 267: Basics of Machine Learning
d
Differentiating and setting (− log p(D|α)) = 0, we find:
dα
Pn
n − ni=1 zi
P
d i=1 zi
(− log p(D|α)) = − + =0
dα Pn Pn α 1−α
i=1 zi n − i=1 zi
=⇒ =
α 1−α
Xn X n
=⇒ (1 − α) zi = α(n − zi )
i=1 i=1
n
X n
X n
X
=⇒ zi − α zi = αn − α zi
i=1 i=1 i=1
Xn
=⇒ zi = αn
i=1
n
1X
=⇒ α = zi = αMLE
n
i=1
where p(·|·) is the density of the above Gaussian distribution. What is partial derivative of g with
respect to w1 ?
i (yi −xi w1 )
∂g
= ni=1 xexp(x
P
a. ∂w 1 i w2 )
∂g Pn (yi −xi w1 )2
b. ∂w1 = i=1 2 exp(xi w2 )
∂g Pn xi (yi −xi w1 )
c. ∂w1 =− i=1 exp(xi w2 )
∂g Pn (yi −xi w1 )2
d. ∂w1 =− i=1 exp(xi w2 )
Solution:
Answer: c.
∂g
Explanation: To find ∂w1 , note that the density of Y |X is
(yi − xi w1 )2
1
p(yi |xi , w) = p exp −
2π exp(xi w2 ) 2 exp(xi w2 )
The negative log-likelihood term gi (w) is:
(yi − xi w1 )2 1
gi (w) = + ln(2π exp(xi w2 ))
2 exp(xi w2 ) 2
11/13
Fall 2024 CMPUT 267: Basics of Machine Learning
Pn (yi −xi w1 )2 xi
b. i=1 2 exp(xi w2 ) + 2
Pn xi (yi −xi w1 )2 xi
c. i=1 2 exp(xi w2 ) − 2
2
i −xi w1 )
− x2i (y
Pn xi
d. i=1 exp(xi w2 ) + 2
Solution:
Answer: d.
∂g
Explanation: To find ∂w2 , we start with the expression for gi (w):
(yi − xi w1 )2 1
gi (w) = + ln(2π exp(xi w2 ))
2 exp(xi w2 ) 2
Differentiating gi (w) with respect to w2 gives:
n
xi (yi − xi w1 )2 xi
∂g X
= − + .
∂w2 2 exp(xi w2 ) 2
i=1
12/13
Fall 2024 CMPUT 267: Basics of Machine Learning
Solution:
Answer: b.
Explanation: Plugging in the partial derivatives from the previous questions into the gradient
descent update rule, we get:
n
X xi (yi − xi w1 )
w1 ← w1 + η ,
exp(xi w2 )
i=1
n
xi (yi − xi w1 )2
X xi
w2 ← w2 + η − .
2 exp(xi w2 ) 2
i=1
13/13