ES Key
ES Key
Instructions:
1. This question paper contains 2 pages (4 sides of paper). Please verify.
2. Write your name, roll number, department in block letters with ink on each page.
3. Write your final answers neatly with a blue/black pen. Pencil marks may get smudged.
4. Don’t overwrite/scratch answers especially in MCQ – ambiguous cases may get 0 marks.
Q1. (Total confusion) The confusion matrix is a very useful tool for evaluating classification models.
For a 𝐶-class problem, this is a 𝐶 × 𝐶 matrix that tells us, for any two classes 𝑐, 𝑐 ′ ∈ [𝐶], how many
instances of class 𝑐 were classified as 𝑐 ′ by the model. In the example below, 𝐶 = 2, there were
𝑃 + 𝑄 + 𝑅 + 𝑆 points in the test set where 𝑃, 𝑄, 𝑅, 𝑆 are strictly positive integers. The matrix tells
us that there were 𝑄 points that were in class +1 but (incorrectly) classified as −1 by the model,
𝑆 points were in class −1 and were (correctly) classified as −1 by the model, etc. Give expressions
for the specified quantities in terms of 𝑃, 𝑄, 𝑅, 𝑆. No derivations needed. Note that 𝑦 denotes the
true class of a test point and 𝑦̂ is the predicted class for that point. (5 x 1 = 5 marks)
𝑃+𝑆
Predicted Accuracy (ACC) ℙ[𝑦̂ = 𝑦]
𝑃+𝑄+𝑅+𝑆
class 𝑦̂ 𝑃
+𝟏 −𝟏 Precision (PRE) ℙ[𝑦 = 1|𝑦̂ = 1]
𝑃+𝑅
True class 𝑦
+𝟏 𝑃 𝑄 𝑃
Recall (REC) ℙ[𝑦̂ = 1|𝑦 = 1]
𝑃+𝑄
𝑅
−𝟏 𝑅 𝑆 False discovery rate (FDR) ℙ[𝑦 = −1|𝑦̂ = 1]
𝑃+𝑅
𝑄
Confusion Matrix False omission rate (FOR) ℙ[𝑦 = 1|𝑦̂ = −1]
𝑄+𝑆
Q2. (Kernel Smash) Melbi has created two Mercer kernels 𝐾1 , 𝐾2 : ℝ × ℝ → ℝ with the feature
map for the kernel 𝐾𝑖 being 𝜙𝑖 : ℝ → ℝ2 . Thus, for any 𝑥, 𝑦 ∈ ℝ, we have 𝐾𝑖 (𝑥, 𝑦) = 〈𝜙𝑖 (𝑥), 𝜙𝑖 (𝑦)〉
for 𝑖 ∈ {1,2}. Melbi knows that 𝜙1 (𝑥) = (𝑥, 𝑥 3 ) and 𝜙2 (𝑥) = (1, 𝑥 2 ). Melbo has created a new
2
kernel 𝐾3 using Melbi’s kernels so that for any 𝑥, 𝑦 ∈ ℝ, 𝐾3 (𝑥, 𝑦) = (𝐾1 (𝑥, 𝑦) + 3 ⋅ 𝐾2 (𝑥, 𝑦)) .
Design a feature map 𝜙3 : ℝ → ℝ7 for the kernel 𝐾3 . Write your answer only in the pace given
below. No derivation needed. Note that 𝝓𝟑 must not use more than 7 dimensions. If your solution
does not require 7 dimensions leave the rest of the dimensions blank. (5 marks)
𝜙3 (𝑥 ) =
Melbo’s friend Melba saw this and exclaimed that this is just an MLE solution. To convince Melbo,
create a likelihood distribution ℙ[𝐱 | 𝐜, 𝑟] over the 2D space ℝ2 with parameters 𝐜 ∈ ℝ2 , 𝑟 ≥ 0 s.t.
[arg max {∏𝑖∈[𝑛] ℙ[𝐱 𝑖 | 𝐜, 𝑟]}] = [arg min 𝑟 2 s. t. ‖𝐱 𝑖 − 𝐜‖22 ≤ 𝑟 2 for all 𝑖 ∈ [𝑛]]. Your solution
𝐜∈ℝ2 ,𝑟≥0 𝐜∈ℝ2 ,𝑟≥0
must be a proper distribution i.e., ℙ[𝐱 | 𝐜, 𝑟] ≥ 0 and ∫𝐱∈ℝ2 ℙ[𝐱 | 𝐜, 𝑟] 𝑑𝐱 = 1. Give calculations to
show why your distribution is correct. Hint: area of a circle of radius 𝑟 is 𝜋𝑟 2 . (4 + 6 = 10 marks)
1
‖𝐱 − 𝐜‖2 ≤ 𝑟
ℙ[𝐱 | 𝐜, 𝑟] = {𝜋𝑟 2
0 ‖𝐱 − 𝐜‖2 > 𝑟
Using the above likelihood distribution expression yields the following likelihood value
1 𝑛
‖𝐱 𝑖 − 𝐜‖22 ≤ 𝑟 2 for all 𝑖 ∈ [𝑛]
∏ ℙ[𝐱 𝑖 | 𝐜, 𝑟] = {(𝜋𝑟 2 )
𝑖∈[𝑛]
0 ∃𝑖 such that ‖𝐱 𝑖 − 𝐜‖2 > 𝑟
Thus, likelihood drops to 0 if any data point is outside the circle. Since we wish to maximize the
likelihood, we are forced to ensure that ‖𝐱 𝑖 − 𝐜‖2 > 𝑟 does not happen for any 𝑖 ∈ [𝑛]. This yields
the following optimization problem for the MLE
1 𝑛
arg max ( 2 ) s. t. ‖𝐱 𝑖 − 𝐜‖22 ≤ 𝑟 2 for all 𝑖 ∈ [𝑛]
𝐜∈ℝ2 ,𝑟≥0 𝜋𝑟
1 𝑛
Since 𝑓(𝑥) ≝ ( ) is a decreasing function of 𝑥 for all 𝑥 ≥ 0 as 𝑛, 𝜋 are constants, maximizing
𝜋𝑥
𝑓(𝑥) is the same as minimizing 𝑥. This yields the following problem concluding the argument.
arg min 𝑟 2 s. t. ‖𝐱 𝑖 − 𝐜‖22 ≤ 𝑟 2 for all 𝑖 ∈ [𝑛]
𝐜∈ℝ2 ,𝑟≥0
Q4. (A one-class SVM?) For anomaly detection tasks, the “one-class” approach is often used. Given
a set of data points 𝐱1 , … , 𝐱 𝑛 ∈ ℝ𝑑 , the 1CSVM solves the following problem:
1
min𝑛 { ‖𝐰‖22 − 𝜌 + ∑ 𝜉𝑖 } s. t. 𝐰 ⊤ 𝐱 𝑖 ≥ 𝜌 − 𝜉𝑖 and 𝜉𝑖 ≥ 0 for all 𝑖 ∈ [𝑛]
𝑑
𝐰∈ℝ ,𝛏∈ℝ ,𝜌∈ℝ 2 𝑖∈[𝑛]
1. Write down the Lagrangian for this optimization problem by introducing dual variables.
2. Write down the dual problem as a max-min problem (no need to simplify it at this stage).
3. Now simplify the dual problem (eliminate 𝐰, 𝛏, 𝜌). Show major steps. (3 + 2 + 5 = 10 marks)
CS 771A: Intro to Machine Learning, IIT Kanpur Endsem Exam (14 July 2023)
Name MELBO 40 marks
Roll No 230007 Dept. AWSM Page 3 of 4
Introducing dual variables 𝛼𝑖 , 𝛽𝑖 , 𝑖 ∈ [𝑛] for the first and second set of constraints respectively
(styled as vectors 𝛂, 𝛃 ∈ ℝ𝑛 for notational brevity) and using 𝟏 ∈ ℝ𝑛 to denote the all-ones vector
and 𝑋 ∈ ℝ𝑛×𝑑 to denote the feature matrix allows us to write the Lagrangian in a compact form.
1
ℒ(𝐰, 𝛏, 𝜌, 𝛂, 𝛃) = ‖𝐰‖22 − 𝜌 + 𝛏⊤ 𝟏 + 𝛂⊤ (𝜌 ⋅ 𝟏 − 𝛏 − 𝑋𝐰) − 𝛃⊤ 𝛏
2
Q6 (Delta Likelihood) Melbo has 𝑛 data points {𝐱 𝑖 , 𝑦𝑖 } with 𝐱 𝑖 ∈ ℝ𝑑 , 𝑦𝑖 ∈ {−1, +1}. The likelihood
of a model 𝐰 ∈ ℝ𝑑 w.r.t. data point 𝑖 is 𝑠𝑖 ≝ 1⁄(1 + exp(−𝑦𝑖 ⋅ 𝐰 ⊤ 𝐱 𝑖 )) and w.r.t. the entire data
is ℒ(𝐰) ≝ ∏𝑖∈[𝑛] 𝑠𝑖 . Notice that if the label of the 𝑗-th point is flipped (for any single 𝑗 ∈ [𝑛]), then
the likelihood of the same model 𝐰 changes to ℒ̃𝑗 (𝐰) ≝ 1⁄(1 + exp(𝑦𝑗 ⋅ 𝐰 ⊤ 𝐱𝑗 )) ⋅ (∏𝑖≠𝑗 𝑠𝑖 ).
i. Given a fixed model 𝐰, 𝑗 ∈ [𝑛], give an expression for Δ𝑗 (𝐰) ≝ ℒ̃𝑗 (𝐰)⁄ℒ(𝐰), i.e., the
factor by which likelihood of 𝐰 changes if 𝑗-th label is flipped. Give brief derivation.
ii. If 𝑛 = 5 and 𝑠1 = 0.1, 𝑠2 = 0.3, 𝑠3 = 0.9, 𝑠4 = 0.6, 𝑠5 = 0.2, find the point 𝑗 ∗ ∈ [5] for
which Δ𝑗 ∗ (𝐰) is the largest and value of Δ𝑗 ∗ (𝐰). Give brief justification.
iii. If 𝑛 = 5 and 𝑠1 = 0.4, 𝑠2 = 0.6, 𝑠3 = 0.2, 𝑠4 = 0.7, 𝑠5 = 0.8, find 𝑘 ∗ ∈ [5] for which
Δ𝑘 ∗ (𝐰) is the smallest and value of Δ𝑘 ∗ (𝐰). Give brief justification. (2 + 3 + 3 = 8 marks)
⊤ (1+exp(−𝑦𝑗 ⋅𝐰 ⊤ 𝐱 𝑗 ))
Δ𝑗 (𝐰) = exp(−𝑦𝑗 ⋅ 𝐰 𝐱𝑗 ) or equivalently,
(1+exp(𝑦𝑗 ⋅𝐰 ⊤ 𝐱 𝑗 ))
Note that this makes sense since in part ii, point 1 is indeed the worst classified point (misclassified
with a large margin) and thus, flipping 𝑦1 will increase the likelihood the most.
Similarly in part iii, point 5 is the best classified point (correctly classified with a large margin) and
thus, flipping 𝑦5 will decrease the likelihood the most.