ES Key
ES Key
Instructions:
1. This question paper contains 2 pages (4 sides of paper). Please verify.
2. Write your name, roll number, department in block letters with ink on each page.
3. Write your final answers neatly with a blue/black pen. Pencil marks may get smudged.
4. Don’t overwrite/scratch answers especially in MCQ – ambiguous cases may get 0 marks.
Q1. (True-False) Write T or F for True/False (write only in the box on the right-hand side). You
must also give a brief justification for your reply in the space provided below.(3 x (1+2) = 9 marks)
The difference of two Mercer kernels can never be Mercer. If True, give a proof. If
2 False, construct two Mercer kernels 𝐾1 , 𝐾2 with maps 𝜙1 , 𝜙2 s.t. the difference F
𝐾3 ≝ 𝐾1 − 𝐾2 is also a Mercer kernel with map 𝜙3 . Give maps 𝜙1 , 𝜙2 , 𝜙3 explicitly.
Let 𝐾1 , 𝐾2 : ℝ × ℝ → ℝ be defined as 𝐾1 (𝑥, 𝑦) ≝ 25𝑥𝑦 and 𝐾2 (𝑥, 𝑦) ≝ 16𝑥𝑦. The corresponding
feature maps are 𝜙1 (𝑥) = [5𝑥] and 𝜙2 (𝑥) = [4𝑥]. Note the feature maps are unidimensional.
We have 𝐾3 (𝑥, 𝑦) = 9𝑥𝑦 for which the feature map 𝜙3 (𝑥) = [3𝑥] works.
𝑥+𝑦
For convex differentiable 𝑓: ℝ → ℝ, if 𝑓 ( ) > 1 for some 𝑥, 𝑦 ∈ ℝ, then we must
3 2 T
have max{𝑓(𝑥), 𝑓(𝑦)} > 1. Justify either using a proof or counter example.
𝑥+𝑦 𝑓(𝑥)+𝑓(𝑦)
Convex functions satisfy 𝑓 ( )≤ . If 𝑓(𝑥) ≤ 1 as well as 𝑓(𝑦) ≤ 1 then we will have
2 2
𝑥+𝑦 1+1 𝑥+𝑦
𝑓( ) ≤ i.e. 𝑓 ( ) ≤ 1 which is a contradiction. Thus, at least one of 𝑓(𝑥) or 𝑓(𝑦) must
2 2 2
be strictly greater than 1 which implies that max{𝑓(𝑥), 𝑓(𝑦)} > 1.
Page 2 of 4
Q2 (Almost Uniform) Melbo is constructing a distribution 𝒟 with
support over 2D vectors of length up to 1 i.e. {𝐱 ∈ ℝ2 : ‖𝐱‖2 ≤ 1}.
𝒟 has two parameters 𝐜 ∈ ℝ2 , 𝜖 ∈ [0,1] and assigns a high
1
density in a “dense ball” of radius 𝜖 centered at 𝐜 i.e., in
𝜖
1
{𝐱 ∈ ℝ2 : ‖𝐱‖2 ≤ 1, ‖𝐱 − 𝐜‖2 ≤ 𝜖} and a low density of in the
2𝜋
rest of the support i.e., in {𝐱 ∈ ℝ2 : ‖𝐱‖2 ≤ 1, ‖𝐱 − 𝐜‖2 > 𝜖}. We
have ‖𝐜‖2 ≤ 1 − 𝜖 i.e., the dense ball stays within the support.
a. For which values of 𝜖 will 𝒟 be a proper distribution? Find them and show calculations. You
may find the fact that 𝜋 − √𝜋 2 − 2 ∈ [0,1] and 𝜋 − √𝜋 2 − 1 ∈ [0,1] to be useful.
b. Find out the mean vector 𝛍 ∈ ℝ2 of this distribution. Show calculations. (5 + 7 = 12 marks)
Hint: the mean of a uniform distribution over a circle is its centre.
Find value(s) of 𝜖 for which 𝒟 is a proper distribution.
1 1
Distributions are normalized i.e., ⋅ 𝜋𝜖 2 + ⋅ (𝜋 − 𝜋𝜖 2 ) = 1 i.e. 𝜖 2 − 2𝜋𝜖 + 1 = 0. Solving the
𝜖 2𝜋
quadratic gives us the candidate values as 𝜋 ± √𝜋 2 − 1. However, the larger root would result in
𝜖 = 𝜋 + √𝜋 2 − 1 > 𝜋 > 1 which is absurd since that would in-turn force ‖𝐜‖2 ≤ 1 − 𝜖 < 0. Thus,
the only value 𝜖 can take is 𝜋 − √𝜋 2 − 1. Note that this value satisfies 𝜖 ∈ [0,1] using the fact
provided in the question statement.
Find out the mean vector of the distribution 𝒟.
Let 𝒰 denote the unit ball {𝐱 ∈ ℝ2 : ‖𝐱‖2 ≤ 1} and ℋ be the dense ball {𝐱 ∈ ℝ2 : ‖𝐱 − 𝐜‖2 ≤ 𝜖}.
⬚ ⬚ ⬚
We have 𝛍 = ∫𝒰 𝐱 ⋅ 𝒟(𝐱) 𝑑𝐱 = ∫
⏟ℋ 𝐱 ⋅ 𝒟(𝐱) 𝑑𝐱 + ∫
⏟𝒰\ℋ 𝐱 ⋅ 𝒟(𝐱) 𝑑𝐱.
(𝐴) (𝐵)
1 ⬚ ⬚ ⬚ 1
(𝐴) = ∫ℋ 𝐱 𝑑𝐱 . Now ∫ℋ 𝐱 𝑑𝐱 = 𝜋𝜖 2 ⋅ ∫ℋ 𝐱 ⋅ 𝒫(𝐱) 𝑑𝐱 where 𝒫(𝐱) = 2 is the (conditional)
𝜖 𝜋𝜖
uniform distribution inside the heavy ball. As the mean of a uniform distribution over a circle is its
⬚ 1
centre, we have ∫ℋ 𝐱 ⋅ 𝒫(𝐱) 𝑑𝐱 = 𝐜 which gives us (𝐴) = ⋅ 𝜋𝜖 2 ⋅ 𝐜 = 𝜋𝜖 ⋅ 𝐜.
𝜖
1 ⬚ 1 ⬚ ⬚
(𝐵) = ∫ 𝐱 𝑑𝐱 = (∫ 𝐱 𝑑𝐱 − ∫
⏟ℋ 𝐱 𝑑𝐱) . Using the same argument as above, we get
2𝜋 𝒰\ℋ 2𝜋 ⏟
𝒰
(𝐶) (𝐷)
𝜖2 𝜖2
(𝐶) = 𝜋12 ⋅ 𝟎 and (𝐷) = 𝜋𝜖 2 ⋅ 𝐜 which gives us (𝐵) = − ⋅ 𝐜 giving us 𝛍 = (𝜋𝜖 − ) ⋅ 𝐜.
2 2
1
However, recall that 𝜖 satisfies 𝜖 2 − 2𝜋𝜖 + 1 = 0 which means 𝛍 = ⋅ 𝐜.
2
1
Can you simplify these calculations? What if the low density is some general value 𝑝𝑙 ≠ ?
2𝜋
CS 771A: Intro to Machine Learning, IIT Kanpur Endsem Exam (16 July 2024)
Name MELBO 40 marks
Roll No 24007 Dept. AWSM Page 3 of 4
1. Write the Lagrangian for this problem by introducing dual variables (no derivation needed).
2. Simplify the dual problem (eliminate 𝐰) – show major steps. Assume 𝑋 ⊤ 𝑋 is invertible.
3. Give a coordinate descent/ascent algorithm to solve the dual. (2 + 4 + 6 = 12 marks)
Write down the Lagrangian here (you will need to introduce dual variables and give them names)
1
ℒ(𝐰, 𝛂) = ‖𝑋𝐰 − 𝐲‖22 − 𝛂⊤ 𝐰
2
which can be rewritten for convenience as
1 1
ℒ(𝐰, 𝛂) = 𝐰 ⊤ 𝑋 ⊤ 𝑋𝐰 − 𝐰 ⊤ 𝑋 ⊤ 𝐲 − 𝐰 ⊤ 𝛂 + ‖𝐲‖22
2 2
The dual is max {min{ℒ(𝐰, 𝛂)}}. Solving the inner problem by applying first-order optimality (since
𝛂≥𝟎 𝐰
𝜕ℒ
it is an unconstrained problem) gives us = 𝟎 ⇒ 𝑋 ⊤ (𝑋𝐰 − 𝐲) − 𝛂 = 𝟎. Putting this in the
𝜕𝐰
Lagrangian and neglecting constant terms gives us
1
min { 𝛂⊤ 𝐶𝛂 + 𝛂⊤ 𝐬}
𝛂≥𝟎 2
Consider a single coordinate of the dual variable, say 𝛼𝑖 (the coordinate may have been chosen
cyclically or via random permutation, etc. The optimization problem restricted to 𝛼𝑖 is
1
min 𝑐𝑖𝑖 𝛼𝑖2 + 𝛼𝑖 (𝑠𝑖 + ∑ 𝑐𝑖𝑗 𝛼𝑗 )
𝛼𝑖 ≥0 2 𝑗≠𝑖
1
Using the QUIN trick tells us that the optimal value is max {0, −
𝑐𝑖𝑖
(𝑠𝑖 + ∑𝑗≠𝑖 𝑐𝑖𝑗 𝛼𝑗 )}
Q4. (Kernel Smash) 𝐾1 , 𝐾2 , 𝐾3 : ℝ × ℝ → ℝ are Mercer kernels i.e., for any 𝑥, 𝑦 ∈ ℝ, we have
𝐾𝑖 (𝑥, 𝑦) = 〈𝜙𝑖 (𝑥), 𝜙𝑖 (𝑦)〉 with 𝜙1 (𝑥) = (1, 𝑥), 𝜙2 (𝑥) = (𝑥, 𝑥 2 ), 𝜙3 (𝑥) = (𝑥 2 , 𝑥 4 , 𝑥 6 ). Design a
2
map 𝜙4 : ℝ → ℝ7 for kernel 𝐾4 s.t. 𝐾4 (𝑥, 𝑦) = (𝐾1 (𝑥, 𝑦) − 𝐾2 (𝑥, 𝑦)) + 3𝐾3 (𝑥, 𝑦) for all 𝑥, 𝑦 ∈ ℝ.
No derivation needed. Note that 𝝓𝟒 must not use more than 7 dimensions. If your solution does
not require 7 dimensions then leave the rest of the dimensions blank or fill with zero. (7 marks)
𝜙4 (𝑥 ) =
2 4 6
( 1 , 𝑥 , 2𝑥 , 𝑥 √3 , 0 , 0 , 0 )