0% found this document useful (0 votes)
11 views13 pages

Ass6 Solns

Uploaded by

ceadamtan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views13 pages

Ass6 Solns

Uploaded by

ceadamtan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Homework Assignment 6

Due: Friday, November 8, 2024, 11:59 p.m. Mountain time


Total marks: 26

Policies:
For all multiple-choice questions, note that multiple correct answers may exist. However, selecting
an incorrect option will cancel out a correct one. For example, if you select two answers, one
correct and one incorrect, you will receive zero points for that question. Similarly, if the number
of incorrect answers selected exceeds the correct ones, your score for that question will be zero.
Please note that it is not possible to receive negative marks. You must select all the correct
options to get full marks for the question.
While the syllabus initially indicated the need to submit a paragraph explaining the use of AI or
other resources in your assignments, this requirement no longer applies as we are now utilizing
eClass quizzes instead of handwritten submissions. Therefore, you are not required to submit any
explanation regarding the tools or resources (such as online tools or AI) used in completing this
quiz.
This PDF version of the questions has been provided for your convenience should you wish to print
them and work offline.
Only answers submitted through the eClass quiz system will be graded. Please do not
submit a written copy of your responses.

Question 1. [1 mark]
Consider the predictor f (x) = xw, where w ∈ R is a one-dimensional parameter, and x rep-
resents the feature with no bias term. Suppose you are given a dataset of n data points D =
((x1 , y1 ), (x2 , y2 ), . . . , (xn , yn )), where each yi is the target variable corresponding to feature xi . Let
the loss function be the scaled squared loss `(f (x), y) = c(f (x) − y)2 where c ∈ R. The estimate of
the expected loss for a parameter w ∈ R is defined as the following convex function:
n
1X
L̂(w) = c(xi w − yi )2
n
i=1

What is the closed form solution for ŵ = arg minw∈R L̂(w) ?


Pn
i=1 cxi yi
a. ŵ = P n 2
i=1 xi
Pn
yi
b. ŵ = i=1
n
Pn
i=1 yi
c. ŵ = Pn
i=1 xi
Pn
xi yi
d. ŵ = Pi=1
n 2
i=1 xi

Solution:
Answer: d.
Explanation: To find the closed-form solution for ŵ, we need to minimize L̂(w). This is equivalent
to minimizing the function:

1/13
Fall 2024 CMPUT 267: Basics of Machine Learning

n
cX
L̂(w) = (xi w − yi )2
n
i=1

Taking the derivative with respect to w and setting it to zero gives:


n
∂ L̂ cX
= 2(xi w − yi )xi = 0
∂w n
i=1

Simplifying leads to:


n
X n
X
xi (xi w) = xi yi
i=1 i=1

Thus, solving for w yields:


Pn
x i yi
ŵ = Pi=1
n 2
i=1 xi

Question 2. [1 mark]
Let everything be defined as in the previous question. Suppose we consider the multivariate case
where f (x) = x> w, and w ∈ Rd+1 . What is the closed form solution for ŵ = arg minw∈Rd+1 L̂(w)?

a. ŵ = A−1 b where A = ni=1 xi x>


P Pn
i and b = i=1 xi yi (assume that A is invertible).

b. ŵ = Ax where A = ni=1 xi x>


P
i

c. ŵ = n1 ni=1 xi
P

Pn
cxi yi
d. ŵ = Pi=1
n 2
i=1 cxi

Solution:
Answer: a.
Explanation: To find the closed-form solution for ŵ, we minimize the expected loss defined as:
n
1X
L̂(w) = c(x>
i w − yi )
2
n
i=1

Taking the derivative with respect to w and setting it to zero gives:


n
∂ L̂ cX
= 2(x>
i w − yi )xi = 0
∂w n
i=1

This simplifies to:


n
X n
X
xi (x>
i w) = yi xi
i=1 i=1

We can express this using the definitions A and b:

2/13
Fall 2024 CMPUT 267: Basics of Machine Learning

Aw = b
Thus, we can write:

ŵ = A−1 b

Question 3. [1 mark]
Let g(w) = − ln w i=1 yi − ln(1 − w) ni=1 (1 − yi ) where
Pn P
P w ∈ R. We can rewrite this a bit more
simply as g(w) = −s ln w − (n − s) ln(1 − w) where s = ni=1 yi . What is the derivative g 0 (w) and
the first order gradient descent update rule with a constant step size η?
 
a. g 0 (w) = − 1−w
s
+ n−s
w and update rule w ← w − η − s
1−w + n−s
w
 
b. g 0 (w) = − ws + n−s
1−w
s
and update rule w ← w − η − 1−w + n−s
w
 
c. g 0 (w) = − ws + n−s
1−w and update rule w ← w − η − ws + n−s
1−w
 
d. g 0 (w) = − 1−w
s
− n−s
w
s
and update rule w ← w − η − 1−w − n−s
w

Solution:
Answer: c
Explanation: To find the derivative g 0 (w), we differentiate g(w):

g(w) = −s ln w − (n − s) ln(1 − w)
Taking the derivative:
1 1
g 0 (w) = −s
+ (n − s)
w 1−w
The gradient descent update rule with a constant step size η is given by:
 
s n−s
w ← w − ηg 0 (w) = w − η − +
w 1−w
Thus, the first-order gradient descent update rule becomes:
 
s n−s
w ←w+η −
w 1−w

Question 4. [1 mark]
Let everything be defined as in the previous question. What is the second derivative g 00 (w) and the
second order gradient descent update rule?
s n−s
−w + 1−w
a. g 00 (w) = s
w2
− n−s
(1−w)2
and update: w ← w − s
− n−s 2
w2 (1−w)

s n−s
−w + 1−w
b. g 00 (w) = s
w2
+ n−s
(1−w)2
and update: w ← w − s
+ n−s 2
w2 (1−w)

3/13
Fall 2024 CMPUT 267: Basics of Machine Learning

s n−s
−w + 1−w
c. g 00 (w) = − ws2 + n−s
(1−w)2
and update: w ← w − − s
+ n−s 2
w2 (1−w)

s n−s
−w + 1−w
d. g 00 (w) = s
w2
+ n−s
(1−w)2
and update: w ← w + s
+ n−s 2
w2 (1−w)

Solution:
Answer: b.
Explanation: To find the second derivative:
s n−s
g 00 (w) = 2
+
w (1 − w)2
The first derivative is given by:
s n−s
g 0 (w) = −
+
w 1−w
Thus, the second-order gradient descent update rule is:

g 0 (w) − ws + 1−w
n−s
w←w− = w − s n−s
g 00 (w) w2
+ (1−w) 2

Question 5. [1 mark]
Let everything be defined as in the previous question. What is the closed form solution for

w∗ = arg min g(w)


w∈R

a. w∗ = n/s

b. w∗ = s/(n − s)

c. w∗ = s/(s − n)

d. w∗ = s/n

Solution:
Answer: d.
Explanation: Set derivative of g(w) to zero and solve for w:
−s n−s s n−s s
+ = 0 =⇒ = =⇒ w = .
w 1−w w 1−w n

Question 6. [1 mark]
Let g(w) = w4 + e−w where w ∈ R. What is the derivative g 0 (w) and the first order gradient
descent update rule with a constant step size η?

a. g 0 (w) = 4w3 − e−w and update: w ← w − η(4w3 − e−w )

b. g 0 (w) = 4w3 + e−w and update: w ← w − η(4w3 + e−w )

4/13
Fall 2024 CMPUT 267: Basics of Machine Learning

c. g 0 (w) = 4w3 + e−w and update: w ← w + η(4w3 + e−w )

d. g 0 (w) = 4w3 − e−w and update: w ← w + η(4w3 − e−w )

Solution:
Answer: a.
Explanation: To find the derivative g 0 (w):

d d −w
g 0 (w) = (w4 ) + (e ) = 4w3 − e−w .
dw dw
The first-order gradient descent update rule is given by:

w ← w − ηg 0 (w) = w − η(4w3 − e−w ).

Question 7. [1 mark]
Let everything be defined as in the previous question. What is the second derivative g 00 (w) and the
second order gradient descent update rule?
4w3 −e−w
a. g 00 (w) = 12w2 − e−w and update: w ← w − 12w2 −e−w

4w3 −e−w
b. g 00 (w) = 12w2 + e−w and update: w ← w + 12w2 +e−w

4w3 −e−w
c. g 00 (w) = 12w2 + e−w and update: w ← w − 12w2 +e−w

4w3 −e−w
d. g 00 (w) = 12w2 − e−w and update: w ← w + 12w2 −e−w

Solution:
Answer: c.
Explanation: To find the second derivative g 00 (w):
1. First, we have the first derivative:

g 0 (w) = 4w3 − e−w .

2. Next, we differentiate g 0 (w) to get g 00 (w):

d d −w
g 00 (w) = (4w3 ) − (e ) = 12w2 + e−w .
dw dw
The second-order gradient descent update rule is given by:

g 0 (w) 4w3 − e−w


w←w− = w − .
g 00 (w) 12w2 + e−w

Question 8. [1 mark]
Let everything be defined as in the previous question. For the second order update rule, calculate
w(1) if w(0) = 0.

5/13
Fall 2024 CMPUT 267: Basics of Machine Learning

Solution:
Answer: 1.
Explanation: Let g(w) = w4 +e−w where w ∈ R. We want to compute w(1) using the second-order
gradient descent update rule, given w(0) = 0.
The second-order update rule is given by:

g 0 (w(0) )
w(1) = w(0) −
g 00 (w(0) )

Step 1: Calculate g 0 (w). The first derivative is:

g 0 (w) = 4w3 − e−w

Substituting w(0) = 0:
g 0 (0) = 4(0)3 − e0 = −1
Step 2: Calculate g 00 (w). The second derivative is:

g 00 (w) = 12w2 + e−w

Substituting w(0) = 0:
g 00 (0) = 12(0)2 + e0 = 1
Step 3: Update the value of w. Now we can compute w(1) :
−1
w(1) = 0 − =0+1=1
1

Question 9. [1 mark]
Let everything be defined as in the previous question. Change the step size to be calculated using
the normalized gradient. For the first order update rule, calculate w(1) if w(0) = 0, η = 1. Only for
this problem, set  = 0.

Solution:
Answer: 1.
Explanation: We know that the derivative at w(0) = 0 is g 0 (0) = −1. The normalized gradient
step size η is given by
η
η (0) = 0 (0) = 1
|g (w )|
Therefore
w(1) = w(0) − η (0) g 0 (w(0) ) = 0 − 1 × (−1) = 1.

Question 10. [1 mark]


Let g(w) = g(w1 , w2 ) = w12 w22 + e−w1 + e−w2 where w ∈ R2 . What is the gradient of g(w) and the
first order gradient descent update rule with a constant step size η?
 >
(t) (t) (t) (t)
a. w(t+1) = w(t) − η 2w1 (w2 )2 , 2w2 (w1 )2

6/13
Fall 2024 CMPUT 267: Basics of Machine Learning

 (t)
>
(t) (t) (t) (t) x
b. w(t+1) = w(t) − η 2w1 (w2 )2 − e−w1 , 2w2 (w1 )2 − e−w2 (t)

(t) >
 (t)

(t) (t) (t) (t)
c. w(t+1) = w(t) − η 2w1 (w2 )2 + e−w1 , 2w2 (w1 )2 + e−w2

(t) >
 (t)

(t) (t) (t) (t)
d. w(t+1) = w(t) − η −2w1 (w2 )2 + e−w1 , −2w2 (w1 )2 + e−w2

Solution:
Answer: b
Explanation: Let g(w) = g(w1 , w2 ) = w12 w22 + e−w1 + e−w2 where w ∈ R2 .
Step 1: Calculate the gradient ∇g(w)
The gradient is given by:
∂g ∂g >
 
∇g(w) = ,
∂w1 ∂w2
Calculating the partial derivatives:
∂g
= 2w1 w22 − e−w1
∂w1
∂g
= 2w2 w12 − e−w2
∂w2
Step 2: Write the gradient Thus, the gradient is:
>
∇g(w) = 2w1 w22 − e−w1 , 2w2 w12 − e−w2

Step 3: Gradient descent update rule The first-order gradient descent update rule is given by:

w(t+1) = w(t) − η∇g(w(t) )

Plugging in the gradient:


(t) (t) >
 
(t) (t) (t) (t)
w(t+1) = w(t) − η 2w1 (w2 )2 − e−w1 , 2w2 (w1 )2 − e−w2

Question 11. [1 mark]


If F ⊂ G, then is it true that minf ∈F L̂(f ) ≥ ming∈G L̂(g)?

Solution:
Answer: True.
Explaination: For any f ∈ G, we know that L̂(f ) ≥ L̂(g) for all g ∈ G, since the RHS is the
minimum value. But since F ⊂ G, we have that L̂(f ) ≥ L̂(g) for all f ∈ F as well. Taking minimum
over F both sides, we have the result.

Question 12. [1 mark]


Consider the setting of polynomial regression. Let d = 2, such that x = (x0 = 1, x1 , x2 ), and p = 4,
then p̄ = 10. True or False?

7/13
Fall 2024 CMPUT 267: Basics of Machine Learning

Solution:
2+4

False. It’s 4 = 6 · 5/2 = 15.

Question 13. [1 mark]


Let everything be defined as in the previous question. The expression for φp (x) is given by

φ(x) = x1 , x2 , x21 , x1 x2 , x22 , x31 , x21 x2 , x1 x22 , x32 , x41 , x31 x2 , x21 x22 , x1 x32 , x42 .


True or False?

Solution:
False. The constant term 1 is missing.

Question 14. [1 mark]


Suppose that

F¯p = {f |f : Rd+1 → R, and f (x) = log(φp (x)> w), for some w ∈ Rp̄ }.

Is it true that F̄1 ⊂ F̄2 ?

Solution:
True. As we increase the degree the function class becomes more expressive.

Question 15. [1 mark]


You are predicting house prices. Supose you want to make the irriducible error smaller. If you
gather a new feature about houses (that you didn’t already have) such as the number of swimming
pools in the backyard, is it likely to decrease the irriducible error? True or False?

Solution:
True. Irreducible error can be reduced by adding more features that are relevant to the prediction
task.

Question 16. [1 mark]


Consider the same setting as the previous problem. The estimation error can be reduced by reducing
the number of data points. True or False?

Solution:
False. Estimation error can be reduced by adding more data points or by using a simpler model.

Question 17. [1 mark]


Consider the same setting as the previous problem. The approximation error can be reduced by
using a larger function class. True or False?

Solution:
True. Approximation error can be reduced by using a more complex model.

8/13
Fall 2024 CMPUT 267: Basics of Machine Learning

Question 18. [1 mark]


You notice your predictor is overfitting. To reduce overfitting, we should make the degree p of the
polynomial function class larger. True or False?

Solution:
False. We need to make p smaller.

Question 19. [1 mark]


Suppose that you have a small dataset, but a large function class. Would the variance be large or
small? Would you expect the bias to be large or small? Would you expect the predictor fˆD to be
underfitting or overfitting the data or neither?

a. variance large, bias large, overfit.

b. variance small, bias large, overfit.

c. variance large, bias small, overfit.

d. variance small, bias small, underfit.

Solution:
Answer: c. Variance large, bias small, overfit.

Question 20. [1 mark]


Suppose that you have a large dataset, but a small function class, and fBayes is much more complex
than any function in the function class. Would the variance be large or small? Would you expect
the bias to be large or small? Would expect the predictor fˆD to be underfitting or overfitting the
data or neither?

a. variance large, bias large, underfit.

b. variance small, bias large, underfit.

c. variance small, bias small, neither overfitting nor underfitting.

d. variance large, bias large, neither overfitting nor underfitting.

Solution:
Answer: b. Variance small, bias large, underfit.

Question 21. [1 mark]


Suppose that you have a large dataset, a small function class F, and fBayes ∈ F. Would the variance
be large or small? Would you expect the bias to be large or small? Would expect the predictor fˆD
to be underfitting or overfitting the data or neither?

a. variance large, bias large, overfitting.

b. variance small, bias small, overfitting.

9/13
Fall 2024 CMPUT 267: Basics of Machine Learning

c. variance small, bias small, neither overfitting nor underfitting.

d. variance large, bias large, neither overfitting nor underfitting.

Solution:
Answer: c. Variance small, bias small, neither overfit or underfit.

Question 22. [1 mark]


You are using regularization. You notice you are underfitting. You should decrease the value of
lambda to reduce underfitting and get a smaller test loss. True or False?

Solution:
True. Decreasing the value of lambda will reduce the regularization strength and allow the model
to fit the data better.

Question 23. [1 mark]


Suppose you have a dataset D = (z1 , . . . , zn ) containing n i.i.d. flips of a coin. Since the flips are
i.i.d. you know they all follow the distribution Bernoulli (α∗ ). However, you do not know what α∗
is so you would like to estimate it using MLE. Which of the following is the maximum likelihood
estimate αMLE ?

a. αMLE = n1 ni=1 αi
P

b. αMLE = n1 ni=1 zi
P

1 Pn
c. αMLE = n−1 i=1 zi

1 Pn−1
d. αMLE = n i=1 zi

Solution:
Answer: b.
Explanation: The probability of each flip zi is p(zi |α) = αzi (1 − α)1−zi . The likelihood is:
n
Y
p(D|α) = αzi (1 − α)1−zi
i=1

The negaitve log-likelihood is:


n
X
log αzi (1 − α)1−zi

− log p(D|α) = −
i=1
n
X
=− (zi log α + (1 − zi ) log(1 − α))
i=1
n n
! !
X X
=− zi log α − n− zi log(1 − α)
i=1 i=1

10/13
Fall 2024 CMPUT 267: Basics of Machine Learning

d
Differentiating and setting (− log p(D|α)) = 0, we find:

Pn
n − ni=1 zi
P
d i=1 zi
(− log p(D|α)) = − + =0
dα Pn Pn α 1−α
i=1 zi n − i=1 zi
=⇒ =
α 1−α
Xn X n
=⇒ (1 − α) zi = α(n − zi )
i=1 i=1
n
X n
X n
X
=⇒ zi − α zi = αn − α zi
i=1 i=1 i=1
Xn
=⇒ zi = αn
i=1
n
1X
=⇒ α = zi = αMLE
n
i=1

Question 24. [1 mark]


Assume that Y |X follows a Gaussian distribution with mean µ = xw1 and variance σ 2 = exp(xw2 )
for all x ∈ R and w = (w1 , w2 ) where w1 , w2 ∈ R. The negative log-likelihood, can be written as
follows for a dataset D = ((x1 , y1 ), · · · , (xn , yn )):
n
X
g(w) = gi (w) where gi (w) = − ln p(yi |xi , w) ,
i=1

where p(·|·) is the density of the above Gaussian distribution. What is partial derivative of g with
respect to w1 ?
i (yi −xi w1 )
∂g
= ni=1 xexp(x
P
a. ∂w 1 i w2 )

∂g Pn (yi −xi w1 )2
b. ∂w1 = i=1 2 exp(xi w2 )

∂g Pn xi (yi −xi w1 )
c. ∂w1 =− i=1 exp(xi w2 )

∂g Pn (yi −xi w1 )2
d. ∂w1 =− i=1 exp(xi w2 )

Solution:
Answer: c.
∂g
Explanation: To find ∂w1 , note that the density of Y |X is

(yi − xi w1 )2
 
1
p(yi |xi , w) = p exp −
2π exp(xi w2 ) 2 exp(xi w2 )
The negative log-likelihood term gi (w) is:

(yi − xi w1 )2 1
gi (w) = + ln(2π exp(xi w2 ))
2 exp(xi w2 ) 2

11/13
Fall 2024 CMPUT 267: Basics of Machine Learning

Differentiating gi (w) with respect to w1 gives:


n
X xi (yi − xi w1 )
∂g
=− .
∂w1 exp(xi w2 )
i=1

Question 25. [1 mark]


Let everything be defined as in the previous question. What is partial derivative of g with respect
to w2 ?
Pn  i −xi w1 )
2

a. i=1 − 2(yexp(x i w2 )
+ xi

Pn  (yi −xi w1 )2 xi

b. i=1 2 exp(xi w2 ) + 2

Pn  xi (yi −xi w1 )2 xi

c. i=1 2 exp(xi w2 ) − 2
 2

i −xi w1 )
− x2i (y
Pn xi
d. i=1 exp(xi w2 ) + 2

Solution:
Answer: d.
∂g
Explanation: To find ∂w2 , we start with the expression for gi (w):

(yi − xi w1 )2 1
gi (w) = + ln(2π exp(xi w2 ))
2 exp(xi w2 ) 2
Differentiating gi (w) with respect to w2 gives:
n 
xi (yi − xi w1 )2 xi

∂g X
= − + .
∂w2 2 exp(xi w2 ) 2
i=1

Question 26. [1 mark]


Let everything be defined as in the previous question. You want to solve for wMLE using gradient
descent. Using the partial derivatives you calculated in the previous quesitons, what would the
gradient update rule look like with a constant step size η?

Pn  xi (yi −xi w1 )  Pn  xi (yi −xi w1 )2 xi



a. w1 ← w1 − η i=1 exp(xi w2 ) , w2 ← w2 − η i=1 2 exp(xi w2 ) − 2

Pn  xi (yi −xi w1 )  Pn  xi (yi −xi w1 )2 xi



b. w1 ← w1 + η i=1 exp(xi w2 ) , w2 ← w2 + η i=1 2 exp(xi w2 ) − 2

Pn  (yi −xi w1 )  Pn  (yi −xi w1 )2 xi



c. w1 ← w1 − η i=1 2 , w2 ← w2 − η i=1 2 exp(xi w2 ) − 2

Pn  (yi −xi w1 )  Pn  (yi −xi w1 )2 xi



d. w1 ← w1 − η i=1 exp(xi w2 ) , w2 ← w2 + η i=1 2 − 2

12/13
Fall 2024 CMPUT 267: Basics of Machine Learning

Solution:
Answer: b.
Explanation: Plugging in the partial derivatives from the previous questions into the gradient
descent update rule, we get:
n
X xi (yi − xi w1 )
w1 ← w1 + η ,
exp(xi w2 )
i=1
n 
xi (yi − xi w1 )2

X xi
w2 ← w2 + η − .
2 exp(xi w2 ) 2
i=1

13/13

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy