0% found this document useful (0 votes)
7 views4 pages

ES Key

The document is an end-semester exam for the course CS 771A: Intro to Machine Learning at IIT Kanpur, consisting of multiple questions related to machine learning concepts such as confusion matrices, Mercer kernels, support vector data description, and one-class SVMs. Each question requires students to provide specific answers or derivations related to the topics covered in the course. The exam includes instructions for writing and formatting answers, as well as marks allocation for each question.

Uploaded by

Prasoon
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views4 pages

ES Key

The document is an end-semester exam for the course CS 771A: Intro to Machine Learning at IIT Kanpur, consisting of multiple questions related to machine learning concepts such as confusion matrices, Mercer kernels, support vector data description, and one-class SVMs. Each question requires students to provide specific answers or derivations related to the topics covered in the course. The exam includes instructions for writing and formatting answers, as well as marks allocation for each question.

Uploaded by

Prasoon
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

CS 771A: Intro to Machine Learning, IIT Kanpur Endsem Exam (14 July 2023)

Name MELBO 40 marks


Roll No 230007 Dept. AWSM Page 1 of 4

Instructions:
1. This question paper contains 2 pages (4 sides of paper). Please verify.
2. Write your name, roll number, department in block letters with ink on each page.
3. Write your final answers neatly with a blue/black pen. Pencil marks may get smudged.
4. Don’t overwrite/scratch answers especially in MCQ – ambiguous cases may get 0 marks.

Q1. (Total confusion) The confusion matrix is a very useful tool for evaluating classification models.
For a 𝐶-class problem, this is a 𝐶 × 𝐶 matrix that tells us, for any two classes 𝑐, 𝑐 ′ ∈ [𝐶], how many
instances of class 𝑐 were classified as 𝑐 ′ by the model. In the example below, 𝐶 = 2, there were
𝑃 + 𝑄 + 𝑅 + 𝑆 points in the test set where 𝑃, 𝑄, 𝑅, 𝑆 are strictly positive integers. The matrix tells
us that there were 𝑄 points that were in class +1 but (incorrectly) classified as −1 by the model,
𝑆 points were in class −1 and were (correctly) classified as −1 by the model, etc. Give expressions
for the specified quantities in terms of 𝑃, 𝑄, 𝑅, 𝑆. No derivations needed. Note that 𝑦 denotes the
true class of a test point and 𝑦̂ is the predicted class for that point. (5 x 1 = 5 marks)
𝑃+𝑆
Predicted Accuracy (ACC) ℙ[𝑦̂ = 𝑦]
𝑃+𝑄+𝑅+𝑆
class 𝑦̂ 𝑃
+𝟏 −𝟏 Precision (PRE) ℙ[𝑦 = 1|𝑦̂ = 1]
𝑃+𝑅
True class 𝑦

+𝟏 𝑃 𝑄 𝑃
Recall (REC) ℙ[𝑦̂ = 1|𝑦 = 1]
𝑃+𝑄
𝑅
−𝟏 𝑅 𝑆 False discovery rate (FDR) ℙ[𝑦 = −1|𝑦̂ = 1]
𝑃+𝑅
𝑄
Confusion Matrix False omission rate (FOR) ℙ[𝑦 = 1|𝑦̂ = −1]
𝑄+𝑆

Q2. (Kernel Smash) Melbi has created two Mercer kernels 𝐾1 , 𝐾2 : ℝ × ℝ → ℝ with the feature
map for the kernel 𝐾𝑖 being 𝜙𝑖 : ℝ → ℝ2 . Thus, for any 𝑥, 𝑦 ∈ ℝ, we have 𝐾𝑖 (𝑥, 𝑦) = 〈𝜙𝑖 (𝑥), 𝜙𝑖 (𝑦)〉
for 𝑖 ∈ {1,2}. Melbi knows that 𝜙1 (𝑥) = (𝑥, 𝑥 3 ) and 𝜙2 (𝑥) = (1, 𝑥 2 ). Melbo has created a new
2
kernel 𝐾3 using Melbi’s kernels so that for any 𝑥, 𝑦 ∈ ℝ, 𝐾3 (𝑥, 𝑦) = (𝐾1 (𝑥, 𝑦) + 3 ⋅ 𝐾2 (𝑥, 𝑦)) .
Design a feature map 𝜙3 : ℝ → ℝ7 for the kernel 𝐾3 . Write your answer only in the pace given
below. No derivation needed. Note that 𝝓𝟑 must not use more than 7 dimensions. If your solution
does not require 7 dimensions leave the rest of the dimensions blank. (5 marks)

𝜙3 (𝑥 ) =

( 3 , 𝑥 √6 , 𝑥 2 √19 , 𝑥 3 √12 , 𝑥 4 √11 , 𝑥 5 √6 , 𝑥 6 )


Page 2 of 4
Q3 (Opt to Prob) Melbo enrolled in an advanced ML course and learnt an unsupervised learning
technique called support vector data description (SVDD). Given a set of data points, say in 2D for
sake of simplicity, 𝐱1 , … , 𝐱 𝑛 ∈ ℝ2 , SVDD solves the following optimization problem:
min
2
𝑟 2 s. t. ‖𝐱 𝑖 − 𝐜‖22 ≤ 𝑟 2 for all 𝑖 ∈ [𝑛]
𝐜∈ℝ ,𝑟∈ℝ

Melbo’s friend Melba saw this and exclaimed that this is just an MLE solution. To convince Melbo,
create a likelihood distribution ℙ[𝐱 | 𝐜, 𝑟] over the 2D space ℝ2 with parameters 𝐜 ∈ ℝ2 , 𝑟 ≥ 0 s.t.

[arg max {∏𝑖∈[𝑛] ℙ[𝐱 𝑖 | 𝐜, 𝑟]}] = [arg min 𝑟 2 s. t. ‖𝐱 𝑖 − 𝐜‖22 ≤ 𝑟 2 for all 𝑖 ∈ [𝑛]]. Your solution
𝐜∈ℝ2 ,𝑟≥0 𝐜∈ℝ2 ,𝑟≥0

must be a proper distribution i.e., ℙ[𝐱 | 𝐜, 𝑟] ≥ 0 and ∫𝐱∈ℝ2 ℙ[𝐱 | 𝐜, 𝑟] 𝑑𝐱 = 1. Give calculations to
show why your distribution is correct. Hint: area of a circle of radius 𝑟 is 𝜋𝑟 2 . (4 + 6 = 10 marks)
1
‖𝐱 − 𝐜‖2 ≤ 𝑟
ℙ[𝐱 | 𝐜, 𝑟] = {𝜋𝑟 2
0 ‖𝐱 − 𝐜‖2 > 𝑟
Using the above likelihood distribution expression yields the following likelihood value
1 𝑛
‖𝐱 𝑖 − 𝐜‖22 ≤ 𝑟 2 for all 𝑖 ∈ [𝑛]
∏ ℙ[𝐱 𝑖 | 𝐜, 𝑟] = {(𝜋𝑟 2 )
𝑖∈[𝑛]
0 ∃𝑖 such that ‖𝐱 𝑖 − 𝐜‖2 > 𝑟
Thus, likelihood drops to 0 if any data point is outside the circle. Since we wish to maximize the
likelihood, we are forced to ensure that ‖𝐱 𝑖 − 𝐜‖2 > 𝑟 does not happen for any 𝑖 ∈ [𝑛]. This yields
the following optimization problem for the MLE
1 𝑛
arg max ( 2 ) s. t. ‖𝐱 𝑖 − 𝐜‖22 ≤ 𝑟 2 for all 𝑖 ∈ [𝑛]
𝐜∈ℝ2 ,𝑟≥0 𝜋𝑟

1 𝑛
Since 𝑓(𝑥) ≝ ( ) is a decreasing function of 𝑥 for all 𝑥 ≥ 0 as 𝑛, 𝜋 are constants, maximizing
𝜋𝑥
𝑓(𝑥) is the same as minimizing 𝑥. This yields the following problem concluding the argument.
arg min 𝑟 2 s. t. ‖𝐱 𝑖 − 𝐜‖22 ≤ 𝑟 2 for all 𝑖 ∈ [𝑛]
𝐜∈ℝ2 ,𝑟≥0

Q4. (A one-class SVM?) For anomaly detection tasks, the “one-class” approach is often used. Given
a set of data points 𝐱1 , … , 𝐱 𝑛 ∈ ℝ𝑑 , the 1CSVM solves the following problem:
1
min𝑛 { ‖𝐰‖22 − 𝜌 + ∑ 𝜉𝑖 } s. t. 𝐰 ⊤ 𝐱 𝑖 ≥ 𝜌 − 𝜉𝑖 and 𝜉𝑖 ≥ 0 for all 𝑖 ∈ [𝑛]
𝑑
𝐰∈ℝ ,𝛏∈ℝ ,𝜌∈ℝ 2 𝑖∈[𝑛]

1. Write down the Lagrangian for this optimization problem by introducing dual variables.
2. Write down the dual problem as a max-min problem (no need to simplify it at this stage).
3. Now simplify the dual problem (eliminate 𝐰, 𝛏, 𝜌). Show major steps. (3 + 2 + 5 = 10 marks)
CS 771A: Intro to Machine Learning, IIT Kanpur Endsem Exam (14 July 2023)
Name MELBO 40 marks
Roll No 230007 Dept. AWSM Page 3 of 4

Introducing dual variables 𝛼𝑖 , 𝛽𝑖 , 𝑖 ∈ [𝑛] for the first and second set of constraints respectively
(styled as vectors 𝛂, 𝛃 ∈ ℝ𝑛 for notational brevity) and using 𝟏 ∈ ℝ𝑛 to denote the all-ones vector
and 𝑋 ∈ ℝ𝑛×𝑑 to denote the feature matrix allows us to write the Lagrangian in a compact form.
1
ℒ(𝐰, 𝛏, 𝜌, 𝛂, 𝛃) = ‖𝐰‖22 − 𝜌 + 𝛏⊤ 𝟏 + 𝛂⊤ (𝜌 ⋅ 𝟏 − 𝛏 − 𝑋𝐰) − 𝛃⊤ 𝛏
2

The dual problem is simply max { min {ℒ(𝐰, 𝛏, 𝜌, 𝛂, 𝛃)}}


𝛂,𝛃≥𝟎 𝐰∈ℝ𝑑 ,𝛏∈ℝ𝑛 ,𝜌∈ℝ

To simplify the dual, we eliminate 𝐰, 𝛏, 𝜌 by using first-order optimality to get


𝜕ℒ 𝜕ℒ 𝜕ℒ
= 𝟎 ⇒ 𝐰 = 𝑋⊤𝛂 =𝟎⇒𝛂+𝛃=𝟏 = 𝟎 ⇒ 𝛂⊤ 𝟏 = 1
𝜕𝐰 𝜕𝛏 𝜕𝜌
Putting these back into the dual gives us the following form of the dual with constraints.
1
max𝑛 {− 𝛂⊤ 𝑋𝑋 ⊤ 𝛂} s. t. 𝛂, 𝛃 ≥ 𝟎 and 𝛂 + 𝛃 = 𝟏 and 𝛂⊤ 𝟏 = 1
𝛂,𝛃∈ℝ 2
We now eliminate 𝛃 by setting 𝛃 = 𝟏 − 𝛂. Note that this introduces a new constraint 𝛂 ≤ 𝟏 (i.e.,
𝛼𝑖 ≤ 1 for all 𝑖 ∈ [𝑛]) since 𝛃 ≥ 𝟎. This simplifies the dual further to
1
min𝑛 𝛂⊤ 𝑋𝑋 ⊤ 𝛂 s. t. 𝟎≤𝛂≤𝟏 and 𝛂⊤ 𝟏 = 1
𝛂∈ℝ 2

Actually, the constraint 𝛂 ≤ 𝟏 is vacuous since 𝟎 ≤ 𝛂 and 𝛂⊤ 𝟏 = 1 together ensure 𝛂 ≤ 𝟏. Thus,


and even more simplified version of the dual is
1
min𝑛 𝛂⊤ 𝑋𝑋 ⊤ 𝛂 s. t. 𝛂≥𝟎 and 𝛂⊤ 𝟏 = 1
𝛂∈ℝ 2
Page 4 of 4
Q5 (Kernelized Anomaly Detection?) Let’s kernelize the 1CSVM. Suppose 𝑑 is large and instead of
receiving 𝐱1 , … , 𝐱 𝑛 ∈ ℝ𝑑 , we receive pairwise dot products of the features as an 𝑛 × 𝑛 matrix 𝐺 =
[𝑔𝑖𝑗 ] ∈ ℝ𝑛×𝑛 where 𝑔𝑖𝑗 ≝ 〈𝐱 𝑖 , 𝐱𝑗 〉 for all 𝑖, 𝑗 ∈ [𝑛]. Rewrite the (simplified) dual that you derived
in Q4 but using only the dot products 𝑔𝑖𝑗 . No derivations required – just rewrite the dual using the
dot products. Note: your rewritten dual must not use feature vectors 𝐱 𝒊 at all. (2 marks)
Note that 𝑋𝑋 ⊤ = 𝐺. This allows us to rewrite the simplified dual in terms of just the dot products.
1
min𝑛 𝛂⊤ 𝐺𝛂 s. t. 𝟎≤𝛂≤𝟏 and 𝛂⊤ 𝟏 = 1
𝛂∈ℝ 2

or else the further simplified form


1
min𝑛 𝛂⊤ 𝐺𝛂 s. t. 𝛂≥𝟎 and 𝛂⊤ 𝟏 = 1
𝛂∈ℝ 2

Q6 (Delta Likelihood) Melbo has 𝑛 data points {𝐱 𝑖 , 𝑦𝑖 } with 𝐱 𝑖 ∈ ℝ𝑑 , 𝑦𝑖 ∈ {−1, +1}. The likelihood
of a model 𝐰 ∈ ℝ𝑑 w.r.t. data point 𝑖 is 𝑠𝑖 ≝ 1⁄(1 + exp(−𝑦𝑖 ⋅ 𝐰 ⊤ 𝐱 𝑖 )) and w.r.t. the entire data
is ℒ(𝐰) ≝ ∏𝑖∈[𝑛] 𝑠𝑖 . Notice that if the label of the 𝑗-th point is flipped (for any single 𝑗 ∈ [𝑛]), then
the likelihood of the same model 𝐰 changes to ℒ̃𝑗 (𝐰) ≝ 1⁄(1 + exp(𝑦𝑗 ⋅ 𝐰 ⊤ 𝐱𝑗 )) ⋅ (∏𝑖≠𝑗 𝑠𝑖 ).

i. Given a fixed model 𝐰, 𝑗 ∈ [𝑛], give an expression for Δ𝑗 (𝐰) ≝ ℒ̃𝑗 (𝐰)⁄ℒ(𝐰), i.e., the
factor by which likelihood of 𝐰 changes if 𝑗-th label is flipped. Give brief derivation.
ii. If 𝑛 = 5 and 𝑠1 = 0.1, 𝑠2 = 0.3, 𝑠3 = 0.9, 𝑠4 = 0.6, 𝑠5 = 0.2, find the point 𝑗 ∗ ∈ [5] for
which Δ𝑗 ∗ (𝐰) is the largest and value of Δ𝑗 ∗ (𝐰). Give brief justification.
iii. If 𝑛 = 5 and 𝑠1 = 0.4, 𝑠2 = 0.6, 𝑠3 = 0.2, 𝑠4 = 0.7, 𝑠5 = 0.8, find 𝑘 ∗ ∈ [5] for which
Δ𝑘 ∗ (𝐰) is the smallest and value of Δ𝑘 ∗ (𝐰). Give brief justification. (2 + 3 + 3 = 8 marks)

⊤ (1+exp(−𝑦𝑗 ⋅𝐰 ⊤ 𝐱 𝑗 ))
Δ𝑗 (𝐰) = exp(−𝑦𝑗 ⋅ 𝐰 𝐱𝑗 ) or equivalently,
(1+exp(𝑦𝑗 ⋅𝐰 ⊤ 𝐱 𝑗 ))

𝑗∗ = 1 Δ𝑗 ∗ (𝐰) = 9 𝑘∗ = 5 Δ𝑘∗ (𝐰) = 0.25


Give brief derivation for part i and justification for parts ii and iii below.
1
We have Δ𝑗 (𝐰) = (1 + exp(−𝑦𝑗 ⋅ 𝐰 ⊤ 𝐱𝑗 ))⁄(1 + exp(𝑦𝑗 ⋅ 𝐰 ⊤ 𝐱𝑗 )) = exp(−𝑦𝑗 ⋅ 𝐰 ⊤ 𝐱𝑗 ) = − 1.
𝑠𝑖
1
Thus, Δ𝑗 (𝐰) is the largest when 𝑠𝑗 is the smallest giving us 𝑗 ∗ = 1 and Δ𝑗 ∗ (𝐰) = − 1 = 9.
0.1
1
Also, Δ𝑘 (𝐰) is the smallest when 𝑠𝑗 is the largest giving us 𝑘 ∗ = 5 and Δ𝑘 ∗ (𝐰) = − 1 = 0.25.
0.8

Note that this makes sense since in part ii, point 1 is indeed the worst classified point (misclassified
with a large margin) and thus, flipping 𝑦1 will increase the likelihood the most.
Similarly in part iii, point 5 is the best classified point (correctly classified with a large margin) and
thus, flipping 𝑦5 will decrease the likelihood the most.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy