0% found this document useful (0 votes)
22 views4 pages

2019-20-I ES Key

Uploaded by

singhalabhi53
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views4 pages

2019-20-I ES Key

Uploaded by

singhalabhi53
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

CS 771A: Introduction to Machine Learning Endsem Exam (18 Nov 2019)

Name SAMPLE SOLUTIONS 80 marks


Roll No Dept. Page 1 of 4

Instructions:
1. This question paper contains 2 pages (4 sides of paper). Please verify.
2. Write your name, roll number, department in block letters neatly with ink on each page of this question paper.
3. If you don’t write your name and roll number on all pages, pages may get lost when we unstaple to scan pages
4. Write your final answers neatly with a blue/black pen. Pencil marks may get smudged.
5. Don’t overwrite/scratch answers especially in MCQ and T/F. We will entertain no requests for leniency.

Q1. Write T or F for True/False (write only in the box on the right hand side) (16x2=32 marks)

If 𝑓, 𝑔: ℝ2 → ℝ are two convex fns, then the fn ℎ defined as ℎ(𝐱) = 𝑓(𝐱) ⋅ 𝑔(𝐱)
1
can never be a convex fn no matter which two convex functions 𝑓, 𝑔 we choose
F
It is possible to derive a Lagrangian dual problem for the 𝐿2 -regularized logistic
2
regression problem even though there are no constraints in the primal formulation
T
If 𝑋, 𝑌 are two real-valued r.v.s (not necessarily independent) such that at least one
3
of them has zero variance i.e. 𝕍[𝑋] = 0 or 𝕍[𝑌] = 0 then Cov(𝑋, 𝑌) = 0
T
The LwP algorithm when used on a binary classification problem, results in a linear
4
decision boundary no matter how many prototypes we use per class
F
The time it takes to make a prediction for a test data point with a decision tree
5
with 𝑛 leaf nodes is always 𝒪(log 𝑛) no matter what the structure of the tree.
F
If we have 10000 red and 20 green points, then best option to deal with imbalance
6
is to find 20 red points closest to the green points and throw the rest 9980 away
F
Reinf. learning is a good technique to build a RecSys if we suspect that tastes of
7
users are changing (possibly due to our own recommendations to them)
T
Bandit algorithms are named so since they operate in settings where a malicious
8
adversary can sometimes corrupt the feedback/response given to the algorithm
F
The binary relevance method in recommendation systems is best suited (in terms
9
of prediction time/model size) when the number of items/labels is extremely large
F
A NN with a three hidden layers and a single output node with all nodes except
10
input layer nodes using sigmoid activation will always learn a continuous function
T
If our goal in RecSys is to quickly find out the most liked item(s) by a certain user,
11
then we should adopt the UCB method rather than pure exploration method
T
The EM algorithm is a special case of Q-learning (recall Q learning is used in reinf.
12
learning) since the EM algorithm also optimizes a function known as the Q function
F
If we are training an ensemble of 𝑘 classifiers, then it is very simple to train all of
13
them in parallel when using bagging but not that simple when using boosting
T
If we have 𝑛 data points with 𝑑-dimensional feature vectors, then kernel PCA with
14
the Gaussian kernel can learn only at most 𝑑 components from this data if 𝑑 < 𝑛
F
If 𝐴 ∈ ℝ𝑛×𝑛 is an orthonormal matrix i.e. 𝐴⊤ 𝐴 = 𝐼𝑛 = 𝐴𝐴⊤ , then it can never be
15
the case that 𝐴 is symmetric i.e. we must have 𝐴⊤ ≠ 𝐴
F
Let 𝑋 be a real valued r.v. that always takes values in the interval [−1,1]. Then we
16
must have 𝕍[𝔼[𝑋]] = 0 i.e. if we define 𝑌 = 𝔼[𝑋] then we must have 𝕍[𝑌] = 0
T
Page 2 of 4
Q2 Consider the NN with 2 hidden layers – all nodes use the identity activation function. This NN
is clearly equivalent to a network with no hidden layers since all activation functions are linear.
Find the weights of this new network and write them down in the space provided. (4 marks)

Q3 Define 𝑓: ℝ2 × ℝ2×3 × ℝ3 → ℝ as 𝑓(𝐱, 𝑊, 𝐲) = 𝐱 ⊤ 𝑊𝐲 where 𝐱 ∈ ℝ2 , 𝐲 ∈ ℝ3 , 𝑊 ∈ ℝ2×3 . Let


1,2,1
𝐱 0 = [1,2]⊤ , 𝐲 0 = [3,4,5]⊤ , 𝑊 0 = [ ]. Define 𝑝: ℝ2 → ℝ as 𝑝(𝐱) = 𝐱 ⊤ 𝑊 0 𝑦 0 , 𝑞: ℝ2×3 → ℝ as
2,1,2
𝑞(𝑊) = (𝐱 0 )⊤ 𝑊𝐲 0 and 𝑟: ℝ3 → ℝ as 𝑟(𝐲) = (𝐱 0 )⊤ W 0 𝐲. Write the Jacobians of 𝑝, 𝑞, 𝑟 below.
Note that to avoid clutter, we are asking you to write 𝐽𝑞 as a 2 × 3 matrix. (2+3+3=8 marks)

Q4 We wish to use 𝐶-SVM to learn a binary classifier. We have 100000 train points half of which
are red and the other half green. Briefly outline a way to tune the 𝐶 parameter and justify your
reasons for the same. You may use the 100000 training points in any way you wish. (4 marks)
Since the dataset is balanced, we need not resort to class-weighted classification
tactics. We may set aside a fair number of randomly chosen points (say 30000) as
a held-out validation set, then perform a grid search over a reasonable range of
values of 𝐶 say 0.1, 0.2, 0.5, 1, 2, 5, 10, 20, 50 and choose the value for which the
SVM trained using that value of 𝐶 gives us maximum classification accuracy on the
validation dataset. Note that other methods like k-fold validation etc are also
admissible. Also, we are able to use classification accuracy as a performance
measure on the validation dataset only because the dataset is balanced. Had the
dataset been unbalanced, we should have used F-measure etc instead.
CS 771A: Introduction to Machine Learning Endsem Exam (18 Nov 2019)
Name SAMPLE SOLUTIONS 80 marks
Roll No Dept. Page 3 of 4

Q5 Let 𝐾1 : 𝒳 × 𝒳 → ℝ be a Mercer kernel with feature map 𝜙1 : 𝒳 → ℝ𝐷 for some finite 𝐷 > 0.
Define a new kernel 𝐾2 = 𝐾12 i.e. 𝐾2 (𝐱, 𝐲) = 𝐾1 (𝐱, 𝐲)2 for all 𝐱, 𝐲 ∈ 𝒳. Design a feature map for
𝐾2 i.e. 𝜙2 : 𝒳 → ℝ𝐿 for some 𝐿 > 0 s.t. 𝐾2 (𝐱, 𝐲) = 〈𝜙2 (𝐱), 𝜙2 (𝐲)〉 for all 𝐱, 𝐲 ∈ 𝒳. (6 marks)

The properties of trace tell us that 𝜙1 (𝐱)⊤ 𝜙1 (𝐲) = trace(𝜙1 (𝐱)⊤ 𝜙1 (𝐲)) =
trace(𝜙1 (𝐱)𝜙1 (𝐲)⊤ ). Also, 𝑐 ⋅ 𝑡𝑟𝑎𝑐𝑒(𝑋) = 𝑡𝑟𝑎𝑐𝑒 (𝑐 ⋅ 𝑋) for all 𝑐 ∈ ℝ. Thus we write
2
𝐾2 (𝐱, 𝐲) = 𝐾1 (𝐱, 𝐲)2 = (𝜙1 (𝐱)⊤ 𝜙1 (𝐲)) = trace(𝜙1 (𝐱)𝜙1 (𝐱)⊤ 𝜙1 (𝐲)𝜙1 (𝐲)⊤ ). If
we use 𝜙2 (𝐱) = 𝜙1 (𝐱)𝜙1 (𝐱)⊤ ∈ ℝ𝐷×𝐷 , then we have 𝐾2 (𝐱, 𝐲) = 〈𝜙2 (𝐱), 𝜙2 (𝐲)〉 for
all 𝐱, 𝐲 ∈ 𝒳.
Instead of a matrix-valued feature map, we may have a vector feature map as well
2
𝜙2 (𝐱) ∈ ℝ𝐷 i.e. 𝐿 = 𝐷 2 by creating coordinates of the form 𝐯𝑖 𝐯𝑗 : 𝑖, 𝑗 ∈ [𝐷] where
we denote 𝐯 = 𝜙1 (𝐱) (note that this essentially stretches out the 𝐷 × 𝐷 matrix we
created earlier as a long vector).

Q6 Derive the Lagrangian dual for the following weighted CSVM problem (for use in Adaboost)
1
min𝑑 ‖𝐰‖22 + ∑𝑛𝑖=1 𝑐𝑖 ⋅ [1 − 𝑦 𝑖 ⋅ 𝐰 ⊤ 𝐱 𝑖 ]+
𝐰∈ℝ 2

Write down the problem as a constrained opt. problem, write down the Lagrangian, and show
main steps in the derivation of the dual. Assume 𝑦 𝑖 ∈ {−1,1}, 𝐱 𝑖 ∈ ℝ𝑑 , 𝑐𝑖 > 0. (3+1+2=6marks)
1
Constrained prob: min ‖𝐰‖22 + ∑𝑛𝑖=1 𝑐𝑖 ⋅ 𝜉𝑖 s.t. 𝑦 𝑖 ⋅ 𝐰 ⊤ 𝐱 𝑖 ≥ 1 − 𝜉𝑖 and 𝜉𝑖 ≥ 0 ∀𝑖
𝐰,𝛏 2
1
Lagrangian: ℒ (𝐰, 𝛏, 𝛂, 𝛃) = ‖𝐰‖22 + ∑𝑛𝑖=1 𝑐𝑖 ⋅ 𝜉𝑖 + 𝛼𝑖 (1 − 𝜉𝑖 − 𝑦 𝑖 ⋅ 𝐰 ⊤ 𝐱 𝑖 ) − 𝛽𝑖 𝜉𝑖
2
𝜕ℒ 𝜕ℒ
Setting = 0 gives us 𝛼𝑖 + 𝛽𝑖 = 𝑐𝑖 whereas = 𝟎 gives us 𝐰 = ∑𝑛𝑖=1 𝛼𝑖 𝑦 𝑖 𝐱 𝑖 .
𝜕𝜉𝑖 𝜕𝐰
1
Simplifying gives max ∑𝑛𝑖=1 𝛼𝑖 − ∑𝑛𝑖,𝑗=1 𝛼𝑖 𝛼𝑗 𝑦 𝑖 𝑦 𝑗 〈𝐱 𝑖 , 𝐱𝑗 〉 s.t. 𝛼𝑖 ∈ [0, 𝑐𝑖 ] ∀𝑖 ∈ [𝑛]
𝛂 2
Page 4 of 4
Q7 Consider the valley distribution with three parameters 𝒱(𝑎, 𝑏, 𝑐) where 𝑎 < 𝑏 and 𝑎 ≤ 𝑐 ≤ 𝑏
(no other restrictions on 𝑎, 𝑏, 𝑐). The PDF of this distribution, with ℎ = 4/(3(𝑏 − 𝑎)), is
0 𝑥<𝑎
ℎ(𝑥 − 𝑎)
ℎ− 𝑎≤𝑥<𝑐
2(𝑐 − 𝑎)
ℙ[𝑥 | 𝑎, 𝑏, 𝑐] = 𝒱(𝑥; 𝑎, 𝑏, 𝑐) ≜
ℎ(𝑏 − 𝑥)
ℎ− 𝑐≤𝑥≤𝑏
2(𝑏 − 𝑐)
{ 0 𝑥>𝑏
Given 𝑛 indep. samples 𝑥 1 , … , 𝑥 𝑛 ∈ ℝ (not all samples are the same) we wish to learn a valley
distribution as a generative distribution using MLE i.e. find arg max ℙ[𝑥 1 , … , 𝑥 𝑛 | 𝑎, 𝑏, 𝑐]. Give a
𝑎<𝑏,𝑎≤𝑐≤𝑏
brief description + derivation of an algorithm to find 𝑎̂MLE , 𝑏̂MLE , 𝑐̂MLE . (5+5+10=20 marks)
Observation 1: let 𝑚 ≜ min 𝑥 𝑖 and 𝑀 ≜ max 𝑥 𝑖 . Then if 𝑎 > 𝑚 or 𝑏 < 𝑀 then the
𝑖 𝑖
likelihood would vanish and thus, we must have 𝑎 ≤ 𝑚, 𝑏 ≥ 𝑀.
Observation 2: if 𝑐 ∈ [𝑚, 𝑀] then if 𝑎 < 𝑚 or 𝑏 > 𝑀 or both, then we can
increase likelihood by keeping 𝑐 the same and setting 𝑎 = 𝑚, 𝑏 = 𝑀. This is
because doing so causes (𝑏 − 𝑎) ↓ so ℎ ↑ which causes PDF to go up in the entire
interval [𝑎, 𝑏] = [𝑚, 𝑀] i.e. likelihood of all data points goes up.
Observation 3: if 𝑐 < 𝑚, then we can similarly see that setting 𝑎 = 𝑐 = 𝑚 will
strictly increase likelihood. Similarly if 𝑐 > 𝑀, we may set 𝑏 = 𝑐 = 𝑀.
The above observations tell us that 𝑎̂MLE = 𝑚, 𝑏̂MLE = 𝑀 and 𝑐̂MLE ∈ [𝑚, 𝑀]. In
general there need not be a closed form solution for 𝑐̂MLE . A sensible workaround
is to perform search in the interval [𝑚, 𝑀]. W.l.o.g. assume that 𝑥 1 ≤ 𝑥 2 … ≤ 𝑥 𝑛 .
Then for all values of 𝑐 ∈ [𝑥 𝑖 , 𝑥 𝑖+1 ], we have the NLL expression as

𝑖(
𝑖 𝑥𝑗 − 𝑎 𝑛 𝑏 − 𝑥𝑗
ℓ 𝑐) = − ∑ ln (1 − )−∑ ln (1 − )
𝑗=1 2(𝑐 − 𝑎) 𝑗=𝑖+1 2(𝑏 − 𝑐)
Note that we removed terms involving ℎ above as they do not affect the optimum.
The above function may be (approximately) minimized in the range 𝑐 ∈ [𝑥 𝑖 , 𝑥 𝑖+1 ]
using GD. The same process needs to be repeated for all 𝑖 ∈ [𝑛 − 1] to obtain an
(approximation) of the globally optimal value of 𝑐.
Pseudo Algo for estimating 𝒄̂MLE :
For 𝑖 = 1, … , 𝑛 − 1, let 𝑐̂ 𝑖 = arg min
𝑖 𝑖+1
ℓ 𝑖( )
𝑐 approximated using GD
𝑐∈[𝑥 ,𝑥 ]

Output 𝑐̂ 𝑘 where 𝑘 = arg min ℓ𝑗 (𝑐̂ 𝑗 )


𝑗∈[𝑛−1]

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy