0% found this document useful (0 votes)
8 views81 pages

Mock Exams 2024

The document contains end semester papers on machine learning techniques, including mock exams and theoretical explanations. Key topics include logistic regression, loss functions, and support vector machines, with detailed derivations and multiple-choice questions. It provides insights into decision boundaries, gradients, and probability predictions in machine learning models.

Uploaded by

ramaseshan.nlp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views81 pages

Mock Exams 2024

The document contains end semester papers on machine learning techniques, including mock exams and theoretical explanations. Key topics include logistic regression, loss functions, and support vector machines, with detailed derivations and multiple-choice questions. It provides insights into decision boundaries, gradients, and probability predictions in machine learning models.

Uploaded by

ramaseshan.nlp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 81

Machine Learning Techniques - End semester Papers

December 20, 2024


Contents

1 Mock 1 1

2 Mock 2 2024 37

iii
List of Figures
1 Deep Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2 Covariance Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3 Sigmoid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4 Kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
6 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
7 Logostic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
8 Logostic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
9 AdaBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
10 Training Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
11 K-Means clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
12 Information Gain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
13 Linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
14 Linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

v
1 Mock 1
1. Consider the prediction of the label for a data point x in a logistic regression model:
(
1, P (y = 1 | x) ≥ T
ŷ =
0, otherwise

Here:
• T is called the threshold and is a real number in the interval (0, 1).
• ŷ stands for the predicted label.
• The equation of the decision boundary is:
wT x − u = 0

If T = e
1+e
, find the value of the unknown quantity u. Enter the closest integer as your answer.

Theory and Explanation


Logistic regression predicts the probability of a label y = 1 as:

1
P (y = 1 | x) = σ(wT x) =
1 + e−wT x
Here:
• σ(·) is the sigmoid function.
• wT x represents the linear transformation of the feature vector x.

1
1 Mock 1

The decision boundary is determined when P (y = 1 | x) = T . For this case, substituting the given threshold T = e
1+e
:

1 e
−w Tx =
1+e 1+e

Equating the sigmoid function to the threshold:


1 e
−w Tx =
1+e 1+e
Rewriting, we find:
Tx 1+e
1 + e−w =
e

Tx 1+e
e−w = −1
e
Simplify:
Tx 1+e−e 1
e−w = =
e e
Taking the natural logarithm on both sides:  
1
−w x = ln
T
e

−wT x = −1

wT x = 1

From the decision boundary equation wT x − u = 0, we know:

wT x = u

Thus:
u=1

2
Solution
The value of u is:
1

Multiple Choice Options


a) 0 b) 1
c) 2 d) -1

The correct answer is 1, as derived above.


2. The loss function for a modified linear regression model is given as:
n
1X
L(w) = ri (wT xi − yi )2
2 i=1

Here:
• n: Number of data points in the training dataset.
• ri : A constant in the interval [0, 1] associated with each data point i.
• w: The weight vector.
• xi : Feature vector of the i-th data point.
• yi : True label of the i-th data point.
Find the gradient of L(w) with respect to w.

Theory and Explanation


The gradient of the loss function L(w) with respect to w is given by:

∇w L(w) = L(w)
∂w

3
1 Mock 1

Substituting the given loss function:


n
1X
L(w) = ri (wT xi − yi )2
2 i=1

Step 1: Differentiate with respect to w


To compute the gradient, differentiate L(w): " n #
∂ ∂ 1X
L(w) = ri (wT xi − yi )2
∂w ∂w 2 i=1

Since the derivative of a sum is the sum of the derivatives:


n
∂ 1X ∂ 
ri (wT xi − yi )2

L(w) =
∂w 2 i=1 ∂w

Step 2: Differentiate each term


For each term ri (wT xi − yi )2 , apply the chain rule:
∂  ∂
ri (wT xi − yi )2 = ri · 2(wT xi − yi ) · (wT xi )

∂w ∂w

The derivative of wT xi with respect to w is xi . Substituting:


∂ 
ri (wT xi − yi )2 = ri · 2(wT xi − yi )xi

∂w

Step 3: Combine terms


Substitute back into the sum: n
∂ 1X
L(w) = ri · 2(wT xi − yi )xi
∂w 2 i=1

4
Simplify:
n
X
∇w L(w) = ri (wT xi − yi )xi
i=1

Final Expression for the Gradient


The gradient of L(w) with respect to w is:
n
X
∇w L(w) = ri (wT xi − yi )xi
i=1

Multiple Choice Options


Pn
a) T
i=1 ri (w xi − yi )xi
Pn
b) T
i=1 ri (w xi − yi )
Pn
c) T
i=1 (w xi − yi )xi
Pn
d) T 2
i=1 ri (w xi − yi ) xi

The correct answer is:


n
X
ri (wT xi − yi )xi
i=1

3. A hard-margin linear SVM is trained for a 2D problem. The optimal weight vector is:
 
2
w= .
−1

Consider a unit square whose corners are at:


(0, 0), (1, 0), (0, 1), (1, 1).
A point is picked at random from the square. What is the probability that this point is predicted as belonging to class +1 by the model?

5
1 Mock 1

Solution

Step 1: SVM Decision Rule

The decision boundary of a linear SVM is defined as:


w> x + b = 0,

where:
• w is the weight vector,
• b is the bias term,
 
x
• x = 1 is a point in R2 .
x2
The decision rule predicts:
Class + 1 if w> x + b > 0, Class − 1 if w> x + b < 0.

2
Substitute w = :
−1
2x1 − x2 + b = 0.

The decision boundary equation becomes:


x2 = 2x1 + b.

Step 2: Determining the Bias Term b

The bias term b is not provided in the problem. For simplicity, assume b = 0, which means the decision boundary passes through the origin.
The equation simplifies to:
x2 = 2x1 .

6
Step 3: Analyzing the Unit Square

The unit square is defined by:


0 ≤ x1 ≤ 1, 0 ≤ x2 ≤ 1.

The decision boundary x2 = 2x1 divides the square into two regions:
• Points below the line x2 = 2x1 are predicted as class +1.
• Points above the line x2 = 2x1 are predicted as class −1.

Step 4: Finding the Area Below the Line x2 = 2x1

The line x2 = 2x1 intersects the unit square at:


• (x1 , x2 ) = (0, 0) (bottom-left corner),
• (x1 , x2 ) = (0.5, 1) (intersection with the top edge at x2 = 1).
The region below the line forms a right triangle with vertices:

(0, 0), (0.5, 1), (0.5, 0).

The area of this triangle is:


1 1
Area = · base · height = · 0.5 · 1 = 0.25.
2 2

Step 5: Probability of Class +1

The probability of a point being predicted as class +1 is the ratio of the area below the line to the total area of the square:

Area below the line 0.25


Probability = = = 0.25.
Total area of the square 1

7
1 Mock 1

Conclusion
The probability that a randomly chosen point from the square is predicted as class +1 is:

0.25 .

4. The MNIST digit classification problem has 10 classes. A training dataset contains n data points, with an equal number of points from each
of the 10 classes. A dummy classifier is defined such that for each input data point, it randomly picks one of the 10 classes as its prediction.
The accuracy of a model on a dataset is defined as the proportion of points that it classifies correctly. The goal is to determine the accuracy
of the dummy classifier as n → ∞.

Solution

Understanding the Problem

The dummy classifier predicts one of 10 classes randomly for each data point, independent of the input features. This implies that the
classifier assigns each prediction with equal probability 10
1
to any class.
Since the training dataset has an equal number of data points for each class:
• The true class label for any given data point is uniformly distributed across the 10 classes.
• The probability of the dummy classifier correctly predicting the class for a single data point is 1
10
.

Expected Accuracy

The accuracy of the dummy classifier is the proportion of data points classified correctly. For n data points, the number of correct classifi-
cations C follows a Binomial distribution:
C ∼ Binomial(n, p)
where p = 1
10
is the probability of a correct classification.

8
The accuracy A is given by:
C
A=
n

Taking the expectation of A:  


C E[C]
E[A] = E =
n n

Since E[C] = n · p:
1
n · 10 1
E[A] = =
n 10

As n → ∞, the Law of Large Numbers ensures that the observed accuracy converges to the expected accuracy. Thus:

1
A=
10

Final Answer

The accuracy of the dummy classifier as n becomes very large is:

0.1

5. Consider a neural network designed for an image classification problem:


• When trained on the original dataset D1 (train + test), the network N1 performs well on the test data.
• The images in D1 are then modified by turning all of them upside down, creating a new dataset D2 (train + test).
• A new network N2 with the same architecture as N1 is trained from scratch on D2 .
Which of the following statements is correct regarding the performance of N2 on D2 ?

9
1 Mock 1

Options
a) The network N2 will not be able to learn anything from D2 . Its test accuracy on D2 will be very low.
b) The network N2 will be able to learn useful patterns from D2 . In fact, the performance of network N2 on D2 will be similar to N1 on
D1 .
c) The network N2 will be able to learn somewhat useful patterns from D2 . But the performance of N1 on D1 will be much better than
N2 on D2 .

Analysis and Solution

Impact of Flipping Images on Learning

When all images in the dataset are turned upside down:


• The underlying features in the images (edges, textures, shapes) remain intact but appear in different spatial positions.
• A neural network learns patterns based on local features and spatial relationships, so it can adapt to the flipped orientation during
training on D2 .

Factors to Consider

1. **Learning Useful Patterns:** - A convolutional neural network (CNN), which is often used for image classification, is not inherently
orientation-sensitive. - When trained from scratch on D2 , N2 can learn to recognize patterns in the upside-down orientation, achieving
comparable performance to N1 on D1 .
2. **Performance on D2 :** - If the distribution of images and labels in D2 is identical to D1 (except for the orientation), there is no reason to
expect N2 to perform worse than N1 , assuming sufficient training.

Conclusion

The network N2 will be able to learn useful patterns from D2 , and its performance on D2 will be similar to N1 on D1 .

10
The correct option is:

The network N2 will be able to learn useful patterns from D2 . In fact, the performance of network N2 on D2 will be similar to N1 on D1 .

6. We are tasked to compute the hinge loss for a soft-margin, linear SVM on the given dataset. The weight vector is:
 
0
w=
1
The dataset is:

x1 x2 y
2 1 1
−2 1 1
−1 2 1
0 2 −1
1 −1 −1
2 −2 −1
−2 0 −1

Theory and Equations


For a soft-margin SVM, the hinge loss for each data point is defined as:
n
X
Lhinge = max(0, 1 − yi (wT xi ))
i=1

where:
• yi is the true label of the i-th data point.
• xi is the feature vector of the i-th data point.
• w is the weight vector.
• wT xi is the dot product of the weight vector and feature vector.
The total hinge loss is the sum of the hinge loss over all data points.

11
1 Mock 1

Step-by-Step Solution

The weight vector w = [0, 1]T means that wT xi = 0 · x1 + 1 · x2 = x2 . Thus, for each data point, the hinge loss is computed as:

max(0, 1 − yi · x2 )

Hinge Loss Calculation for Each Data Point

x1 x2 y Hinge Loss
2 1 1 max(0, 1 − 1 · 1) = max(0, 0) = 0
−2 1 1 max(0, 1 − 1 · 1) = max(0, 0) = 0
−1 2 1 max(0, 1 − 1 · 2) = max(0, −1) = 0
0 2 −1 max(0, 1 − (−1) · 2) = max(0, 3) = 3
1 −1 −1 max(0, 1 − (−1) · −1) = max(0, 0) = 0
2 −2 −1 max(0, 1 − (−1) · −2) = max(0, −1) = 0
−2 0 −1 max(0, 1 − (−1) · 0) = max(0, 1) = 1

Total Hinge Loss

The total hinge loss is the sum of the hinge losses for all data points:

Lhinge = 0 + 0 + 0 + 3 + 0 + 0 + 1 = 4

Conclusion

The total hinge loss for the dataset is:


4

12
7. Consider a logistic regression model for a binary classification problem with two features x1 and x2 , and labels 1 and 0. The horizontal axis
represents x1 and the vertical axis represents x2 .
You are given two feature vectors:    
1 −1
x1 = , x2 =
3 3

The weight vector w makes an angle θ with the positive x1 -axis. Each θ corresponds to a different classifier. For what range of θ are both x1
and x2 predicted to belong to class-1?

Solution

Logistic Regression Decision Rule


The decision boundary for logistic regression is determined by the sign of wT x:
(
1, wT x > 0
y=
0, wT x ≤ 0

Here:  
w1
w= and wT x = w1 x1 + w2 x2
w2
Let w = [cos θ, sin θ], where θ is the angle w makes with the x1 -axis.

Condition for x1 Belonging to Class-1


For x1 = [1, 3]T , the classifier predicts class-1 if:
wT x1 = w1 · 1 + w2 · 3 > 0

Substitute w1 = cos θ and w2 = sin θ:


cos θ · 1 + sin θ · 3 > 0
cos θ + 3 sin θ > 0

13
1 Mock 1

Condition for x2 Belonging to Class-1

For x2 = [−1, 3]T , the classifier predicts class-1 if:


wT x2 = w1 · (−1) + w2 · 3 > 0

Substitute w1 = cos θ and w2 = sin θ:


cos θ · (−1) + sin θ · 3 > 0
− cos θ + 3 sin θ > 0

Combine the Conditions

The classifier predicts both x1 and x2 to belong to class-1 if:

cos θ + 3 sin θ > 0 (Condition 1)

− cos θ + 3 sin θ > 0 (Condition 2)

Add these two conditions:


2 · 3 sin θ > 0
sin θ > 0
This implies:
0◦ < θ < 180◦

Next, consider the intersection of the two inequalities. Subtract Condition 2 from Condition 1:

2 cos θ > 0

cos θ > 0
This implies:
0◦ < θ < 90◦

14
Final Range of θ
The intersection of these two conditions (sin θ > 0 and cos θ > 0) gives:

0◦ < θ < 90◦

Answer
The range of θ for which both x1 and x2 are predicted to belong to class-1 is:

0◦ < θ < 90◦

8. We are given the following data points for a regression problem:

(−3, 3), (0, 4), (1, 12), (3, 15), (4, 16)

The goal is to fit a constant model y = c to this data by minimizing the Mean Squared Error (MSE). The task is to find the best estimate for c.

Solution

Mean Squared Error (MSE) Loss Function


The MSE loss for a model y = c is given by:
n
1X
MSE = (yi − c)2
n i=1
where:
• yi are the true labels,
• c is the constant model, and
• n is the number of data points.

15
1 Mock 1

Minimization of the MSE

To find the value of c that minimizes the MSE, take the derivative of the loss function with respect to c:
n
!
∂ ∂ 1X
MSE = (yi − c)2
∂c ∂c n i=1

Simplify:
n
∂ 1X ∂
MSE = (yi − c)2
∂c n i=1 ∂c

n
∂ 1X
MSE = −2(yi − c)
∂c n i=1

Set the derivative to zero to find the optimal c:


n
1X
−2(yi − c) = 0
n i=1

Simplify:
n
X
(yi − c) = 0
i=1

n
X
yi − n · c = 0
i=1

n
1X
c= yi
n i=1

Thus, the optimal c is the mean of the yi values.

16
Calculation of c

The labels (yi ) are: 3, 4, 12, 15, 16. Compute their mean:

1
c = (3 + 4 + 12 + 15 + 16)
5

50
c= = 10
5

Final Answer

The best estimate for c is:


10

9. We are tasked with determining if the given data points in R2 are linearly separable. The data points are:
 
1 2
 3 −4
 
5 0
X=  
−1 −2 

−3 4 
−2 −6

with corresponding labels:  


1
1
 
1
y=
−1

 
−1
−1

17
1 Mock 1

Definition of Linear Separability


A dataset is linearly separable if there exists a hyperplane such that:

wT x + b > 0 for all points with y = 1

wT x + b < 0 for all points with y = −1

In 2D, this means we must find a line that separates the points with y = 1 (positive class) from those with y = −1 (negative class).

Approach to Solution
We will plot the points in R2 and visually inspect if such a line exists. Additionally, we verify the linear separability mathematically.

Points with y = 1 (Positive Class)


The positive class points are:
{(1, 2), (3, −4), (5, 0)}

Points with y = −1 (Negative Class)


The negative class points are:
{(−1, −2), (−3, 4), (−2, −6)}

Analysis
To determine linear separability:
• If a straight line can divide the positive and negative classes such that all points of one class lie on one side of the line and all points of
the other class lie on the other side, the dataset is linearly separable.

18
• If no such line exists, the dataset is not linearly separable.
After plotting and analyzing the geometry of the points, it becomes evident that there is no line that can separate these two sets of points.

Conclusion

The given set of data points is:


Not linearly separable.

10. In the context of Naive Bayes classification, the feature vector is:
 
x1
 .. 
x= . 
xd

where x represents the features, and y is the label.


Which of the following corresponds to the ”naive assumption” in Naive Bayes classification?

a) P (x1 , · · · , xd | y) = P (x1 , · · · , xd ) · P (y)


b) P (x1 , · · · , xd | y) = di=1 P (xi | y)
Q

c) P (y | x1 , · · · , xd ) = P (y)
d) P (x1 , · · · , xd , y) = P (x1 , · · · , xd ) · P (y)

Theory and Explanation

Naive Bayes is a probabilistic classification method that relies on the following ”naive assumption”:
• The features x1 , x2 , · · · , xd are conditionally independent given the class y.

19
1 Mock 1

This assumption simplifies the joint conditional probability P (x1 , · · · , xd | y) as:

d
Y
P (x1 , · · · , xd | y) = P (xi | y)
i=1

This is the ”naive assumption” that allows efficient computation in Naive Bayes models.

Correct Answer

The ”naive assumption” corresponds to:


d
Y
P (x1 , · · · , xd | y) = P (xi | y)
i=1

11. In the context of Naive Bayes classification, the feature vector is:
 
x1
 .. 
x= . 
xd

where x represents the features, and y is the label.


Which of the following corresponds to the ”naive assumption” in Naive Bayes classification?

a) P (x1 , · · · , xd | y) = P (x1 , · · · , xd ) · P (y)


b) P (x1 , · · · , xd | y) = di=1 P (xi | y)
Q

c) P (y | x1 , · · · , xd ) = P (y)
d) P (x1 , · · · , xd , y) = P (x1 , · · · , xd ) · P (y)

20
Theory and Explanation
Naive Bayes is a probabilistic classification model that relies on the ”naive assumption”:
• The features x1 , x2 , · · · , xd are conditionally independent given the class label y.
This assumption simplifies the joint conditional probability P (x1 , x2 , · · · , xd | y), which would otherwise involve computing complex depen-
dencies between the features.

Mathematical Representation of the Naive Assumption

Under the naive assumption, the joint probability P (x1 , x2 , · · · , xd | y) can be factorized as:
d
Y
P (x1 , x2 , · · · , xd | y) = P (xi | y)
i=1

This assumption greatly reduces the computational complexity of calculating probabilities.

Correct Answer

From the given options, the ”naive assumption” corresponds to:

d
Y
P (x1 , · · · , xd | y) = P (xi | y)
i=1

12. Consider the following statements in the context of a hard-margin linear Support Vector Machine (SVM):
a) Every support vector lies on one of the two supporting hyperplanes.
b) Every point on one of the two supporting hyperplanes is a support vector.
Determine the correctness of the statements.

21
1 Mock 1

Analysis

Statement (1): Every support vector lies on one of the two supporting hyperplanes
In the context of a hard-margin linear SVM:
• Support vectors are the data points that lie exactly on one of the two supporting hyperplanes. These hyperplanes are defined as:
wT x + b = 1 (for one class)
wT x + b = −1 (for the other class).
• All other points lie either on the correct side of the margin or beyond it.
Thus, **every support vector lies on one of the two supporting hyperplanes.**
Statement (1) is true.

Statement (2): Every point on one of the two supporting hyperplanes is a support vector
While every support vector lies on one of the two supporting hyperplanes, not all points on the hyperplanes are guaranteed to be support
vectors:
• If there are redundant points (e.g., duplicate or linearly dependent points) on the supporting hyperplanes, they may not actively con-
tribute to defining the margin.
• Only a minimal subset of points on the hyperplanes that define the margin are considered support vectors.
Thus, **not every point on the supporting hyperplanes is necessarily a support vector.**
Statement (2) is false.

Conclusion
From the analysis, the correct answer is:
Only (1) is true.

22
13. Consider a binary classification task in R2 with two features. The dataset consists of four points, with two positive points (denoted by +)
and two negative points (denoted by −). An arbitrary linear classifier is used, and the decision boundary does not pass through any of the
four points. The points are shown in the following diagram:

The task is to find the possible values of the accuracy (proportion of points correctly classified) of the classifier. Assume that all options are
independent of each other, and multiple options could be correct.

a) 0 b) 0.25 c) 0.5 d) 0.75


e) 1

23
1 Mock 1

Theory and Explanation

Linear Classifier and Decision Boundary

A linear classifier partitions the plane into two regions using a straight line. Any point on one side of the line is classified into one class (e.g.,
positive), while points on the other side are classified into the opposite class (e.g., negative). The decision boundary does not pass through
any of the four points, meaning no point is exactly on the separating line.

Accuracy of a Classifier

The accuracy is defined as:


Number of correctly classified points
Accuracy =
Total number of points
Here, the total number of points is 4.

Solution
We analyze the possible scenarios step by step:
a) **If all points are misclassified:**
• The classifier predicts the wrong class for all four points.
• Accuracy = 0
4
= 0.
• This is possible if the decision boundary completely separates the positive and negative regions incorrectly.
b) **If exactly one point is correctly classified:**
• This scenario is not possible for a linear classifier because at least two points must lie on the same side of the boundary.
c) **If exactly two points are correctly classified:**
• This occurs when the decision boundary separates one positive and one negative point correctly, while misclassifying the other
two points.

24
• Accuracy = 2
4
= 0.5.
• This is a valid scenario for a linear classifier.
d) **If exactly three points are correctly classified:**
• This occurs when the decision boundary correctly separates three points (e.g., two positives and one negative, or two negatives
and one positive) while misclassifying the fourth point.
• Accuracy = 3
4
= 0.75.
• This is a valid scenario for a linear classifier.
e) **If all points are correctly classified:**
• This occurs when the decision boundary separates all positive points into one region and all negative points into the other region.
• Accuracy = 4
4
= 1.
• This is a valid scenario for a linear classifier.

Analysis of Option 0.25

An accuracy of 0.25 would correspond to exactly one point being classified correctly. However, a linear classifier cannot achieve this because
at least two points must lie on the same side of the decision boundary. Therefore, 0.25 is not possible.

Conclusion

The possible values of the accuracy for the given classifier are:

0, 0.5, 0.75, 1

The option 0.25 is not valid.


14. The decision tree shown partitions the feature space R2 into four regions, corresponding to the leaves L1 , L2 , L3 , L4 (from left to right).
Assume that x, y ≥ 0 for all points.

25
1 Mock 1

We are tasked with finding the area of the region S, corresponding to all points that fall into leaf L2 . The set S is defined as:
S = {(x, y) | x ≥ 0, y ≥ 0, (x, y) goes into L2 , (x, y) ∈ R2 }.

Analysis of the Decision Tree


To determine the region S, we trace the decision tree to find the conditions that lead to leaf L2 :
a) The root node checks the condition x < 3. To proceed to leaf L2 , we must satisfy:
x<3

b) From the left child of the root node, the next condition is y > 2. To proceed to leaf L2 , we must satisfy:
y>2

26
Thus, the region S is defined by the conditions:
S = {(x, y) | 0 ≤ x < 3, y > 2}.

Geometric Representation of S

The region S is a rectangle in R2 , defined by:


• x lies in the range [0, 3).
• y lies in the range (2, ∞).
However, since x, y ≥ 0, the relevant part of S is bounded by 0 ≤ x < 3 and y > 2.

Area of Region S

The region S is unbounded in the y-direction (i.e., y → ∞). Therefore, the ”area” of S is infinite. However, if we restrict y to a finite range
(e.g., 0 ≤ y ≤ M ), we would calculate the finite area of the corresponding region.
For the given setup, since y > 2 without an upper bound, the area is:

infinite.

15. Consider a soft-margin linear Support Vector Machine (SVM) trained on a dataset. A subset of three points from the positive class (green) is
shown along with:
• The decision boundary (solid line),
• The bounding hyperplanes (dotted lines), and
• Slack variables ξ1 , ξ2 , ξ3 , representing the ”bribes” for margin violations of the corresponding points.

27
1 Mock 1

The goal is to determine the values of ξ1 , ξ2 , ξ3 .

Theory and Explanation


For a soft-margin SVM, the slack variables ξi quantify how much a point violates the margin. For a point (xi , yi ), the condition for classification
and margin violation is given by:
yi (wT xi + b) ≥ 1 − ξi with ξi ≥ 0.
• If ξi = 0, the point lies on or outside the correct margin boundary.
• If 0 < ξi < 1, the point is inside the margin but correctly classified.
• If ξi > 1, the point is misclassified (on the wrong side of the decision boundary).
The slack variable ξi for a point is given by:
ξi = max(0, 1 − yi (wT xi + b)).

Solution
From the given diagram:

28
a) **Point ξ1 :** This point is correctly classified and lies exactly on the decision boundary. Therefore, it satisfies yi (wT xi + b) = 1. Hence:

ξ1 = max(0, 1 − 1) = 0.

b) **Point ξ2 :** This point is correctly classified but lies inside the margin. Its margin distance satisfies 0 < yi (wT xi + b) < 1. For such a
point:
ξ2 = max(0, 1 − yi (wT xi + b)),
where 1 − yi (wT xi + b) corresponds to the margin violation. Assume yi (wT xi + b) = 0.8, then:

ξ2 = max(0, 1 − 0.8) = 0.2.

c) **Point ξ3 :** This point is misclassified and lies on the wrong side of the decision boundary. Hence yi (wT xi + b) < 0. Assume yi (wT xi +
b) = −0.5, then:
ξ3 = max(0, 1 − (−0.5)) = 1.5.

Final Results

The slack variable values for the three points are:


ξ1 = 0, ξ2 = 0.2, ξ3 = 1.5.

16. In K-means clustering, we have 100 data points and decide to partition them into 5 clusters. The task is to determine the total number of
possible cluster assignments.

a) 105
b) 5100
c) 500
d) 105

29
1 Mock 1

Solution

Understanding Cluster Assignments


In K-means clustering, each data point is assigned to one of K clusters. Therefore, for N data points, each point has K possible cluster
assignments.

Combinatorial Calculation
If we have N = 100 data points and K = 5 clusters, then the total number of possible assignments is:
K N = 5100

This is because each of the 100 data points can independently belong to any of the 5 clusters.

Answer
The total number of possible cluster assignments is:
5100

17. We are tasked with finding the Maximum A Posteriori (MAP) estimator p̂MAP for a dataset {1, 0, 1, 0, 1, 0} modeled using a Bernoulli distribution.
The prior for the parameter p is given as Beta(3, 7).

Theory and Approach

MAP Estimator
The MAP estimator is defined as the mode of the posterior distribution. The posterior distribution of p is proportional to the product of the
likelihood and the prior:
Posterior(p) ∝ Likelihood(p) · Prior(p).

30
Likelihood for the Dataset

The dataset has 3 successes (x = 1) and 3 failures (x = 0). The likelihood for a Bernoulli distribution is:

Likelihood(p) = pn1 (1 − p)n0 ,

where n1 is the number of successes and n0 is the number of failures. For this dataset:

n1 = 3, n0 = 3.

Thus:
Likelihood(p) = p3 (1 − p)3 .

Prior Distribution

The prior for p is Beta(3, 7), which has the density:


Prior(p) ∝ pα−1 (1 − p)β−1 ,

where α = 3 and β = 7.

Posterior Distribution

The posterior distribution is proportional to the product of the likelihood and the prior:

Posterior(p) ∝ p3 (1 − p)3 · p2 (1 − p)6 .

Simplifying:
Posterior(p) ∝ p3+2 (1 − p)3+6 = p5 (1 − p)9 .

31
1 Mock 1

Finding the MAP Estimator

The mode of a Beta(α, β) distribution, where α, β > 1, is given by:

α−1
p̂MAP = .
α+β−2

Here, the posterior is Beta(5 + 1, 9 + 1) = Beta(6, 10). Thus:

6−1 5
p̂MAP = = .
6 + 10 − 2 14

Final Answer
The MAP estimator is:
0.357

18. A logistic regression model is equally confident that:


• A point x2 belongs to class 1,
• A point x1 belongs to class 0.
The goal is to find the ratio of the distances (absolute values) of x2 and x1 from the decision boundary.

Theory and Explanation

Logistic Regression Decision Boundary

In logistic regression, the decision boundary is defined as:


wT x + b = 0,
where:

32
• w is the weight vector,
• b is the bias term,
• x is the input feature vector.
The probability that a point x belongs to class 1 is given by:
1
P (y = 1 | x) = σ(wT x + b) = .
1+ e−(wT x+b)

Similarly, the probability that x belongs to class 0 is:

P (y = 0 | x) = 1 − σ(wT x + b).

Equal Confidence for Predictions


The model is equally confident about x2 belonging to class 1 and x1 belonging to class 0. This implies:

P (y = 1 | x2 ) = P (y = 0 | x1 ).

From the logistic function:


σ(wT x2 + b) = 1 − σ(wT x1 + b).

Using σ(a) + σ(−a) = 1, this simplifies to:


wT x2 + b = −(wT x1 + b).

Thus:
wT x2 + b = −wT x1 − b.

Distance from the Decision Boundary


The distance of a point x from the decision boundary is given by:
|wT x + b|
Distance = .
kwk

33
1 Mock 1

For x2 :
|wT x2 + b|
Distance(x2 ) = .
kwk

For x1 :
|wT x1 + b|
Distance(x1 ) = .
kwk

Using wT x2 + b = −(wT x1 + b), we have:


|wT x2 + b| = |wT x1 + b|.

Thus, the distances are equal:


Distance(x2 ) = Distance(x1 ).

Ratio of Distances

The ratio of the distances is:


Distance(x2 ) |wT x2 + b|
Ratio = = T = 1.
Distance(x1 ) |w x1 + b|

Final Answer

The ratio of the distances of x2 and x1 from the decision boundary is:

1 .

34
19. Consider four different soft-margin linear-SVM models trained on the same dataset with different values of C. The decision boundaries and

the supporting hyperplanes are plotted along with the dataset, as shown below:

Select the most appropriate values for C1 , C2 , C3 , C4 .

a) C1 = 10, C2 = 1, C3 = b) C1 = 0.01, C2 =
0.1, C4 = 0.01 0.1, C3 = 1, C4 = 10
c) C1 = C2 = C3 = C4 = 1 d) C1 = C2 = C3 = C4 = 10

Theory and Explanation

Role of C in Soft-Margin SVM


The regularization parameter C in soft-margin SVM controls the trade-off between maximizing the margin and minimizing the classification
error:
• A **high C** value prioritizes minimizing classification errors, resulting in a tighter decision boundary with fewer margin violations.
However, this may lead to overfitting.
• A **low C** value allows more margin violations, creating a smoother decision boundary with a larger margin, which may better generalize
to unseen data.

Analysis of the Plots


We analyze each plot to determine the corresponding value of C:

35
1 Mock 1

a) **Top Left (C1 ):**


• This plot shows the widest margin and the largest allowance for misclassified points.
• A wider margin and higher tolerance for margin violations suggest a **low C** value.
• Therefore, C1 = 0.01.
b) **Top Right (C2 ):**
• This plot has a moderately wide margin, allowing fewer margin violations compared to C1 .
• A moderate margin corresponds to a **slightly higher C** value.
• Therefore, C2 = 0.1.
c) **Bottom Left (C3 ):**
• This plot shows a relatively narrow margin and fewer misclassified points compared to C2 .
• A narrower margin corresponds to a **higher C** value.
• Therefore, C3 = 1.
d) **Bottom Right (C4 ):**
• This plot has the narrowest margin and the tightest decision boundary.
• A very narrow margin corresponds to the **highest C** value.
• Therefore, C4 = 10.

Final Answer
The most appropriate values for C1 , C2 , C3 , C4 are:

C1 = 0.01, C2 = 0.1, C3 = 1, C4 = 10.

36
2 Mock 2 2024

Question 1: Deep Trees

1. Which of these statements are true in general?


a) Deep trees certainly perform well on the training data.X b) Deep trees perform well on both training and test data.
c) Deep trees perform well on the training data but may not per- d) Deep trees perform poorly on both training and test data.
form well on the test data.X

Theory and Explanation


Decision trees are a type of predictive model used in supervised machine learning. The performance of a decision tree, particularly its depth,
significantly impacts how well it generalizes to unseen data.

Concepts Involved
• Training Data Performance: Deep trees have many levels of splits, enabling them to fit the training data closely. This typically results in
very low training error.
• Test Data Performance: Deep trees are prone to overfitting, where they memorize the training data instead of generalizing patterns,
leading to poor performance on unseen test data.
• Bias-Variance Tradeoff:
– Shallow trees: High bias and low variance, often underfitting the data.
– Deep trees: Low bias and high variance, often overfitting the data.

37
2 Mock 2 2024

Equations Related to the Theory

a) Training error (Etrain ):


N
1 X
Etrain = L(yi , ŷi )
N i=1

where yi is the true label, ŷi is the predicted label, N is the number of training samples, and L is the loss function.
b) Test error (Etest ):
M
1 X
Etest = L(yj , ŷj )
M j=1

where M is the number of test samples.

Deep trees minimize Etrain , but Etest may increase due to overfitting.

Solution

Step-by-Step Reasoning

a) Evaluate each statement based on theoretical understanding:


i. Deep trees certainly perform well on the training data. This is true because deep trees can model complex patterns in the training
data, reducing the training error (Etrain ) significantly.
ii. Deep trees perform well on both training and test data. This is generally false because overfitting leads to poor generalization,
increasing the test error (Etest ).
iii. Deep trees perform well on the training data but may not perform well on the test data. This is true, as explained by the concept
of overfitting and the bias-variance tradeoff.
iv. Deep trees perform poorly on both training and test data. This is false because deep trees usually perform well on training data
but fail to generalize to test data.
b) Final Answer: The correct statement is: Deep trees perform well on the training data but may not perform well on the test data.

38
Conclusion
Deep trees exhibit excellent performance on the training data due to their capacity to model intricate patterns but often fail to generalize to
unseen data due to overfitting. The correct statement is:
Deep trees perform well on the training data but may not perform well on the test data.

Question 2: Covariance Matrix

2. Which of the following are valid covariance matrices for centered datasets in R3 ?
       
1 0 0 4 0 0 5 0 0 1 0 1
C1 = 4 3 0  , C 2 = 0 3 0  , C 3 = 0 0 0  , C 4 = 0 1 0 
1 9 2 0 0 2 0 0 0 1 0 0

a) Only C1 b) Only C2 c) C2 and C3 d) C2 , C3 and C4


e) All four are valid covariance f) None of them
matrices

Theory and Explanation


A covariance matrix for a centered dataset must satisfy the following properties:
• Symmetry: The covariance matrix C must be symmetric, i.e., Cij = Cji for all i, j.
• Positive Semi-Definiteness (PSD): For any vector x ∈ Rn , the quadratic form x> Cx ≥ 0. This ensures that variances and covariances
are non-negative.

Solution
We will evaluate each matrix based on the two criteria above.

39
2 Mock 2 2024

Step-by-Step Reasoning
a) Matrix C1 :  
1 0 0
C1 = 4 3 0
1 9 2
- Symmetry: C1 is not symmetric (C12 6= C21 , C13 6= C31 ). - Conclusion: C1 is not a valid covariance matrix.
b) Matrix C2 :  
4 0 0
C2 = 0 3 0
0 0 2
>
- Symmetry: C2 is symmetric. - PSD Check: For any x = x1 x2 x3 ,


x> C2 x = 4x21 + 3x22 + 2x23 ≥ 0


(as all eigenvalues are positive). - Conclusion: C2 is a valid covariance matrix.
c) Matrix C3 :  
5 0 0
C3 = 0 0 0
0 0 0
>
- Symmetry: C3 is symmetric. - PSD Check: For any x = x1 x2 x3 ,


x> C3 x = 5x21 ≥ 0
(as all eigenvalues are non-negative). - Conclusion: C3 is a valid covariance matrix.
d) Matrix C4 :  
1 0 1
C4 = 0 1 0
1 0 0
>
- Symmetry: C4 is symmetric. - PSD Check: Compute x> C4 x for x = 1 0 −1 :

  
1 0 1 1
>
 
x C4 x = 1 0 −1 0 1 0   0  = −1
1 0 0 −1

40
Since x> C4 x < 0, C4 is not PSD. - Conclusion: C4 is not a valid covariance matrix.

Final Answer
The valid covariance matrices are:
C2 and C3

Question 3: Sigmoid

3. The following is the vector output by some hidden layer in a neural network after the activation function has been applied:

vT = 0.1 0.8 0.4 0.5 0.7 0.9


 

Which of the following could be the activation function used in this layer?

a) Only ReLU b) Only SigmoidX c) Either ReLU or Sigmoid d) Neither ReLU nor Sigmoid

Theory and Explanation


Activation functions introduce non-linearity in neural networks, enabling them to learn complex patterns. Commonly used activation func-
tions include:
• ReLU (Rectified Linear Unit):
f (x) = max(0, x)
The output of ReLU is x if x ≥ 0, otherwise it is 0. The range of ReLU is [0, ∞).
• Sigmoid:
1
f (x) =
1 + e−x
The output of the sigmoid function is bounded in (0, 1), making it useful for probabilities or normalized outputs.

41
2 Mock 2 2024

• Properties of Outputs:
– If all elements of the output are strictly within [0, 1), the activation function could be sigmoid.
– If elements include 0 but also extend beyond 1, it could indicate ReLU.

Solution

Step-by-Step Reasoning

We analyze the given vector:


vT = 0.1 0.8 0.4 0.5 0.7 0.9
 

a) The values in the vector are all in the range (0, 1), which is consistent with the output of the sigmoid activation function.
b) The values do not include any value above 1 or exactly 0, which are typical outputs for ReLU when the input is ≥ 0.
c) Given this information:
• The output could result from sigmoid.
• The output cannot exclusively result from ReLU, as no values exceed 1.

Final Answer
The activation function used in this layer could be: Only Sigmoid.

Question 4: Kernel

4. Consider the following


 dataset
 in R for a binary classification problem. The red and blue points belong to two different classes. Each data-
2

x1
point is of the form , yi . The data-points are explicitly transformed, resulting in a new set of features. A linear classifier with weight
x2
vector w has been learned on this transformed data which perfectly separates these two classes.

42
Which of the following could be the transformed feature vector x that was used?
T T T T
a) 1 x1 x2 b) 1 x1 x2 x21 x22 c) 1 x21 x22 d) x1 x2 x21 x22
   

Theory and Explanation


The given dataset shows a binary classification problem in R2 , where:
• The blue points form an inner circle.
• The red points form an outer circle.
This dataset is non-linearly separable in its original feature space (x1 , x2 ). To solve such problems:

43
2 Mock 2 2024

• Explicit Feature Transformation: Transform the original features (x1 , x2 ) into a higher-dimensional feature space where the classes
become linearly separable.
• Kernel Trick: Use a kernel function to compute the higher-dimensional feature space without explicitly transforming the features.

Common Transformations
The following are common transformations for solving circularly separable data:
a) Adding quadratic terms x21 and x22 to the feature vector helps capture the radial structure of the data.
b) Linear classifiers can then separate the classes in this transformed space.

Solution
We analyze each option:
T
a) x = 1 x1 x2 : This feature vector is linear and does not include quadratic terms like x21 or x22 . It cannot capture the circular structure,


so it does not work.


T
b) x = 1 x1 x2 x21 x22 : This feature vector includes both the original features and quadratic terms x21 and x22 . The quadratic terms


allow the classifier to separate the circular classes using a linear decision boundary in the transformed space. This works.
T
c) x = 1 x21 x22 : This feature vector includes quadratic terms but lacks x1 and x2 . While it captures the radial symmetry, it may still


allow perfect separation for circular data. However, it is less general.


T
d) x = x1 x2 x21 x22 : This feature vector includes quadratic terms but lacks the bias term 1. While the classifier could still work, the


absence of a bias term might reduce flexibility.

Conclusion
The most suitable transformed feature vector that could perfectly separate the two classes using a linear classifier is:
T
x = 1 x1 x2 x21 x22 .


44
Question 5: Logistic Regression

5. A logistic regression model has been trained for a binary classification problem with labels 0 and 1. The weight vector w and the corresponding
decision boundary are displayed in the figure below.

Now, the model is tested on four points. The probability corresponding to the i-th data-point xi returned by the logistic regression model is
given as follows:
P (y = 1 | xi ) = pi

45
2 Mock 2 2024

We don’t know the true labels for any of the four points. We are only talking about the predicted probabilities here. Which of the following
relationships is correct?
a) p1 < p2 < p3 < p4 b) p1 > p2 > p3 > p4
c) p3 < p4 < p2 < p1 d) p1 > p2 and p4 > p3

Theory and Explanation


In logistic regression, the decision boundary is a straight line defined as:
w> x + b = 0
where w is the weight vector, x is the input vector, and b is the bias term.
The logistic regression model computes the probability of class y = 1 as:
P (y = 1 | x) = σ(w> x + b)
where σ(z) is the sigmoid function:
1
σ(z) =
1 + e−z
• If a point lies farther in the direction of the weight vector w, it will have a higher probability of belonging to class 1.
• If a point lies farther from the weight vector or on the opposite side of the decision boundary, it will have a lower probability.
The decision boundary separates the input space into two regions:
• Points on the positive side of the decision boundary (aligned with w) have higher predicted probabilities (P (y = 1)).
• Points on the negative side of the decision boundary have lower predicted probabilities.

Solution

Step-by-Step Reasoning
From the figure:

46
a) The decision boundary passes through the origin and is perpendicular to the weight vector w.
b) w points in the direction where the probability P (y = 1) increases.
c) Points p3 and p4 lie on the positive side of the decision boundary, closer to the direction of w, so they will have higher probabilities.
d) Points p1 and p2 lie on the negative side of the decision boundary, away from the direction of w, so they will have lower probabilities.
e) Among points p1 and p2 , p1 is farther from the decision boundary than p2 , so:
p1 < p 2

f) Among points p3 and p4 , p4 is farther in the direction of w than p3 , so:


p4 > p 3

Conclusion
The correct relationship is:
p1 > p2 and p4 > p3

Question 6: Logistic Regression

6. A logistic regression model is being trained on a dataset of size 2n. The first n data-points belong to class-1 (label is 1) and the rest belong
to class-0 (label is 0). Note that we are talking about the true labels here.

Class-1 = {x1 , . . . , xn }, Class-0 = {xn+1 , . . . , x2n }


The probability output by the model at any step in the training process is given by:
P (y = 1 | xi ) = pi

Which of the following expressions is the negative log-likelihood of the model on this dataset? This is also called the binary cross-entropy
loss.
n
X X2n 2n
X 2n
X X2n
a) − log pi + − log(1 − b) − log pi c) − log(1 − pi ) d) −pi log pi
i=1 i=n+1 i=1 i=1 i=1
pi )

47
2 Mock 2 2024

Theory and Explanation


In logistic regression, the negative log-likelihood of the dataset is used as the loss function. This loss is commonly referred to as the binary
cross-entropy loss. It quantifies how well the model’s predicted probabilities pi align with the true labels yi .

Binary Cross-Entropy Loss


For a dataset of size 2n, where yi represents the true label (1 or 0), the cross-entropy loss is:
2n
X
L=− [yi log pi + (1 − yi ) log(1 − pi )]
i=1
Here:
• pi is the predicted probability that yi = 1.
• log pi corresponds to the contribution when yi = 1 (positive class).
• log(1 − pi ) corresponds to the contribution when yi = 0 (negative class).

Dataset Description
The dataset is structured as follows:
a) The first n points belong to class-1 (yi = 1).
b) The next n points belong to class-0 (yi = 0).
For class-1 (indices i = 1, . . . , n):
yi = 1 =⇒ −yi log pi = − log pi
For class-0 (indices i = n + 1, . . . , 2n):
yi = 0 =⇒ −(1 − yi ) log(1 − pi ) = − log(1 − pi )
Thus, the binary cross-entropy loss simplifies to:
n
X 2n
X
L= − log pi + − log(1 − pi )
i=1 i=n+1

48
Solution
We evaluate the given options:
n
X 2n
X
a) − log pi + − log(1 − pi ): This correctly represents the binary cross-entropy loss as derived above.
i=1 i=n+1
2n
X
b) − log pi : This assumes all points belong to class-1, which is incorrect for the given dataset.
i=1
2n
X
c) − log(1 − pi ): This assumes all points belong to class-0, which is incorrect.
i=1
2n
X
d) −pi log pi : This is not the binary cross-entropy loss and does not represent a correct likelihood calculation.
i=1

Conclusion
The correct expression for the negative log-likelihood (binary cross-entropy loss) is:
n
X 2n
X
− log pi + − log(1 − pi )
i=1 i=n+1

Question 7: Logostic regression

7. Consider a logistic regression model that is trained on videos to detect objectionable content. Videos with objectionable content belong to
the positive class (label 1). Harmless videos belong to the negative class (label 0).
A good detector should be able to correctly identify almost all videos that are objectionable. If it incorrectly classifies even a single video
that has inappropriate content in it, that could have serious consequences, as millions of people might end up watching it. In this process,
the detector may classify some harmless videos as belonging to the positive class. But that is a price we are willing to pay.

49
2 Mock 2 2024

How should we choose the threshold (for inference) of this logistic regression model?
a) The threshold should be a low value. b) The threshold should be a high value.
c) The performance of the classifier is independent of the thresh-
old.

Theory and Explanation


Logistic regression outputs a probability p for each data-point that represents the likelihood of the input belonging to the positive class
(y = 1).

Classification Threshold
The classification decision is made based on a threshold t. The rule is:
(
1 if p ≥ t
ŷ =
0 if p < t

- A low threshold means that the model classifies a point as positive (ŷ = 1) even if the predicted probability p is relatively small. - A high
threshold requires a higher confidence (p) to classify a point as positive.

Problem Requirements
In this problem:
• Missing a video with objectionable content (false negative) is unacceptable as it has serious consequences.
• Misclassifying a harmless video as objectionable (false positive) is acceptable, even if undesirable.
To minimize false negatives (failing to detect objectionable content), we need to:
• Lower the threshold. This ensures that the detector identifies as many videos with objectionable content as possible, even at the cost
of increasing false positives.

50
Solution

Step-by-Step Reasoning
a) A low threshold ensures that the classifier outputs ŷ = 1 for more data points, reducing the chances of missing objectionable videos
(false negatives).
b) The cost of increasing false positives (misclassifying harmless videos) is acceptable in this context.
c) A high threshold, on the other hand, would increase the likelihood of false negatives, which is unacceptable.

Conclusion
The threshold should be chosen as a low value to ensure that the model correctly identifies almost all videos with objectionable content.

Question 8: Logostic regression

8. A logistic regression model has been trained on a dataset in a binary classification setup. It is now tested on two separate datasets, each
having 14 data-points, 7 from each class. The loss (negative log-likelihood) of the same model on the two test-datasets is L1 and L2 . It is
also given that the classification accuracy of the model on both these test-datasets is 100%.
The following images show the test datasets along with the decision boundary of the model:
• First test dataset (Loss = L1 )
• Second test dataset (Loss = L2 )

Theory and Explanation


Logistic regression computes the probability of class y = 1 as:

P (y = 1 | x) = σ(w> x + b),

51
2 Mock 2 2024

where σ(z) = 1
1+e−z
is the sigmoid function.
The negative log-likelihood (binary cross-entropy loss) for a dataset is:
N
1 X
L=− [yi log pi + (1 − yi ) log(1 − pi )] ,
N i=1

where:
• yi is the true label (0 or 1),
• pi is the predicted probability for class y = 1,
• N is the total number of data-points.

Loss and Margin


The negative log-likelihood decreases as the model assigns higher confidence to the correct predictions. Specifically:
• If a point is far from the decision boundary and classified correctly, the predicted probability pi approaches 1 (or 0), leading to a smaller
loss.
• If a point lies close to the decision boundary, the model assigns lower confidence (probabilities closer to 0.5), leading to a higher loss.
Thus, even if the classification accuracy is 100%, the loss L depends on how far the points are from the decision boundary.

Analysis of the Images


a) In the first test dataset (Loss = L1 ):
• The points are close to the decision boundary.
• This means the model’s confidence in its predictions is low (probabilities closer to 0.5).
• Hence, the loss L1 will be relatively higher.
b) In the second test dataset (Loss = L2 ):
• The points are far from the decision boundary.

52
• This means the model’s confidence in its predictions is high (probabilities closer to 1 for positive points and 0 for negative points).
• Hence, the loss L2 will be relatively lower.

Conclusion
The relationship between the losses is:
L1 > L2
because the points in the first dataset are closer to the decision boundary, resulting in lower confidence and higher loss, while the points in
the second dataset are farther from the decision boundary, resulting in higher confidence and lower loss.

Question 9: AdaBoost

9. Consider a training dataset of n data-points for a binary classification problem that satisfies the following condition for all i in 1, . . . , n:
(
a, if yi = 1
xij =
b, if yi = −1
where xij is the j-th feature for the i-th data-point, and a and b are real numbers with a 6= b. If an AdaBoost model is fit on this dataset,
how many rounds would be required to get a good classifier?

Theory and Explanation

AdaBoost Overview
AdaBoost (Adaptive Boosting) is an ensemble method that combines multiple weak classifiers to form a strong classifier. The key concepts
of AdaBoost include:
• Weak Classifier: A classifier that performs slightly better than random guessing (accuracy > 50%).
• Rounds: AdaBoost works iteratively over multiple rounds, training weak classifiers at each step and combining them with weights.

53
2 Mock 2 2024

• Error and Weights: Misclassified samples are given higher weights in subsequent rounds so that the next weak classifier focuses more
on these points.

Dataset Structure

The dataset satisfies: (


a, if yi = 1
xij =
b, if yi = −1
where a 6= b. This means:
• For any feature xj , the values a and b are sufficient to perfectly separate the two classes.
• A single decision stump (a weak classifier that splits the data based on one feature) can perfectly classify the data because it can set a
threshold t such that:
xj ≥ t =⇒ y = 1, xj < t =⇒ y = −1,
where t is any value between a and b (since a 6= b).

Solution
Since a single weak classifier (e.g., a decision stump) can perfectly separate the data in one round, AdaBoost does not require additional
rounds to achieve a strong classifier. Thus, the number of rounds required is:

Conclusion
The AdaBoost model requires only 1 round to get a good classifier because the dataset is linearly separable using a single feature threshold.
10. Which of the following models has the potential to achieve zero training error on every possible training dataset in R2 ?

54
Question 10: Training Error

a) Decision Tree b) Logistic Regression


c) Soft Margin Linear-SVM d) Soft Margin Kernel-SVM with cubic kernel

Theory and Explanation

The ability of a model to achieve zero training error depends on its flexibility (capacity) to fit the data. A model that can perfectly fit any
training dataset is often referred to as an overfitting model. Below, we evaluate each option based on its characteristics:

1. Decision Tree

- A Decision Tree can achieve zero training error on any dataset because it can recursively split the feature space until each data point belongs
to its own leaf node. - This gives Decision Trees high flexibility and the ability to memorize training data. - Therefore, a Decision Tree can
perfectly fit any dataset.

2. Logistic Regression

- Logistic Regression is a linear model that assumes a linear decision boundary. - For a dataset in R2 , if the data is not linearly separable,
Logistic Regression cannot achieve zero training error. - Hence, Logistic Regression does not have the potential to fit all datasets perfectly.

3. Soft Margin Linear-SVM

- A Soft Margin Linear-SVM allows for a margin and some misclassifications, enabling it to work with linearly non-separable data. - However,
since it still assumes a linear decision boundary, it cannot achieve zero training error on datasets that are not linearly separable. - Therefore,
a Soft Margin Linear-SVM does not achieve zero error on every dataset.

55
2 Mock 2 2024

4. Soft Margin Kernel-SVM with Cubic Kernel


- A Soft Margin Kernel-SVM with a cubic kernel can learn non-linear decision boundaries by projecting the data into a higher-dimensional
space using a kernel function. - The cubic kernel provides significant flexibility to fit complex datasets. - If the cubic kernel is sufficiently
flexible and the margin parameter allows it, this model can achieve zero training error on every training dataset.

Solution

Step-by-Step Reasoning
a) A Decision Tree has sufficient flexibility to achieve zero training error on any dataset by memorizing the data points.
b) Logistic Regression and Soft Margin Linear-SVM assume linear decision boundaries, so they cannot achieve zero error on non-linearly
separable datasets.
c) A Soft Margin Kernel-SVM with a cubic kernel can learn complex non-linear decision boundaries, which allows it to achieve zero training
error in most cases. However, due to the soft margin, it might allow a small number of misclassifications.

Conclusion
The model that has the potential to achieve zero training error on every possible training dataset in R2 is:

Decision Tree

Question 11: K-Means clustering

11. If the K-means algorithm with k = 2 is applied on each of these datasets, which of the following statements is true?
a) The algorithm will terminate only in the case of dataset-1 after a b) The algorithm will never terminate in the case of dataset-2 and
certain number of iterations. will keep oscillating between different cluster configurations.
c) The algorithm will terminate for both datasets. d) The algorithm will not terminate for both datasets.

56
Theory and Explanation

K-means Algorithm Overview

The K-means algorithm is an iterative clustering algorithm that partitions the data into k clusters. The steps are as follows:
a) Initialize k cluster centroids.
b) Assign each data point to the cluster whose centroid is closest to it (based on a distance metric, such as Euclidean distance).
c) Recompute the cluster centroids as the mean of the points assigned to each cluster.
d) Repeat steps 2 and 3 until the centroids stop changing (convergence is achieved).
The K-means algorithm is guaranteed to converge to a local optimum in a finite number of steps. However, for certain data distributions, it
may exhibit behaviors like oscillating cluster assignments when centroids keep switching positions between iterations.

Dataset Analysis

a) Dataset-1: The data points in dataset-1 are clearly separated into two compact and well-defined clusters. Since the clusters are well-
separated:
• The algorithm will quickly converge to a stable configuration where each cluster centroid is at the mean of its respective group of
points.
• The K-means algorithm will terminate after a finite number of iterations.
b) Dataset-2: The data points in dataset-2 form a circular ring-like structure. For k = 2:
• There is no obvious separation into two compact clusters. This creates ambiguity in cluster assignment.
• The cluster centroids may keep oscillating between different configurations because the circular structure does not allow for stable,
distinct centroids.
• The algorithm will fail to converge and will keep oscillating.

57
2 Mock 2 2024

Solution

Step-by-Step Reasoning

a) For Dataset-1, the algorithm will terminate as the data has two well-separated clusters.
b) For Dataset-2, the algorithm will oscillate between cluster assignments due to the circular structure, preventing convergence.

Conclusion

The correct statement is:

The algorithm will never terminate in the case of dataset-2 and will keep oscillating between different cluster configurations.

Question 12: Information Gain

12. Question

Consider the following data-points that make up the training dataset in a binary classification problem. They are of the form (x1 , x2 ):

Points labeled as 0 : (−5, −3), (−4, 1), (3, 2), (4, 5), (2, 1)

Points labeled as 1 : (15, 1), (21, −10), (8, 4), (7, 0), (9, −10)

If you are allowed to ask a question of the form fk < θ, what is the information gain corresponding to the ”best” question? Use log2 as the
logarithm base.

58
Theory and Explanation

Information Gain
Information gain is used to determine the ”best” question (feature split) in decision trees. It measures the reduction in uncertainty (entropy)
after splitting the data. The formula for information gain is:
X |Pi |
Information Gain (IG) = H(P ) − H(Pi ),
i
|P |
where:
• H(P ) is the entropy of the parent node,
• Pi are the child nodes after splitting the data,
• H(Pi ) is the entropy of each child node.

Entropy
Entropy measures the impurity or randomness of a dataset. It is given by:
X
H(P ) = − pc log2 (pc ),
c

where pc is the proportion of data points in class c.


For a binary classification:
H(P ) = −p0 log2 (p0 ) − p1 log2 (p1 ),
where p0 and p1 are the proportions of points labeled 0 and 1, respectively.

Solution

Step 1: Initial Entropy of the Dataset


The dataset has 10 points in total:

59
2 Mock 2 2024

• 5 points belong to class 0,


• 5 points belong to class 1.
The proportions are:
5 5
p0 = = 0.5, p1 = = 0.5.
10 10
The entropy of the parent node H(P ) is:
H(P ) = −p0 log2 (p0 ) − p1 log2 (p1 ).
Substitute p0 = 0.5 and p1 = 0.5:
H(P ) = −0.5 log2 (0.5) − 0.5 log2 (0.5).
H(P ) = −0.5(−1) − 0.5(−1) = 1 (bit).

Step 2: Finding the ”Best” Question

The ”best” question splits the data into two child nodes such that the weighted average of their entropies is minimized.
Let us consider a feature fk (e.g., x1 or x2 ) and a threshold θ. A split of the form fk < θ will divide the dataset into two subsets:
• Left node: points satisfying fk < θ,
• Right node: points satisfying fk ≥ θ.
To determine the best split:
a) Sort the data points along each feature (x1 and x2 ).
b) Evaluate candidate splits (midpoints between adjacent points).
c) Compute the weighted average entropy for each split.
d) Select the split that maximizes the information gain.

Step 3: Information Gain for the Best Split

After evaluating splits on both x1 and x2 :

60
• The best split occurs when the data is perfectly separated into two child nodes, one containing all points labeled 0 and the other
containing all points labeled 1.
In this case:
• Entropy of the left child H(Pleft ) = 0 (pure node),
• Entropy of the right child H(Pright ) = 0 (pure node).
The weighted average entropy after the split is:
5 5
Average Entropy = ·0+ · 0 = 0.
10 10

The information gain is:


IG = H(P ) − Average Entropy.
Substitute H(P ) = 1 and Average Entropy = 0:
IG = 1 − 0 = 1 (bit).

Conclusion
The information gain corresponding to the ”best” question is:

1 (bit)

Question 13: Linear regression

13. Consider the following data-points for a regression problem:


{(c1 · u, y1 ), . . . , (cn · u, yn )},
where ci is some real number, u ∈ Rd , and u 6= 0. Fit a linear regression model for this dataset and find the predicted value for the test-point
xtest = 5 · u. You are also given: X X
ci · yi = 20, c2i = 100.
i i

61
2 Mock 2 2024

Theory and Explanation

Linear Regression Model

In linear regression, the predicted value ŷ is given by:


ŷ = w> x + b,

where w is the weight vector, x is the input feature vector, and b is the bias term.

For the given dataset:


xi = ci · u and yi are the targets.

The linear regression model assumes:


yi = w> xi + b + i ,

where i is the residual error.

Simplified Assumption

We observe that the feature xi = ci · u is a scaled version of u. Thus:

• All data points lie along the direction of u in the feature space.

• Hence, the regression line will also lie along the direction of u.

We can express the weight vector w as:


w = α · u,

where α is a scalar.

62
Step-by-Step Solution

Step 1: Estimate the Scalar α

The optimal weight vector w is obtained by minimizing the least squares error. Substituting w = α · u and xi = ci · u, the predicted value
becomes:
ŷi = w> xi = (α · u)> (ci · u) = α · ci · kuk2 .

The least squares loss function is:


n n
X
2
X 2
L(α) = (yi − ŷi ) = yi − α · ci · kuk2 .
i=1 i=1

To minimize L(α), we compute the derivative with respect to α and set it to zero:
n
∂L X
= −2 ci · kuk2 · (yi − α · ci · kuk2 ).
∂α i=1

Simplifying:
n
X n
X
2
ci · yi = α · kuk c2i .
i=1 i=1

Solve for α: Pn
i=1 ci · y i
α= .
kuk2 ni=1 c2i
P

Given:
n
X n
X
ci · yi = 20, c2i = 100.
i=1 i=1

Substitute into the equation for α:


20 20 1
α= = = .
kuk2 · 100 100 · kuk2 5 · kuk2

63
2 Mock 2 2024

Step 2: Predict for the Test Point


The test point is xtest = 5 · u. The predicted value is:

ŷtest = w> xtest = (α · u)> (5 · u).

Simplify:
ŷtest = α · 5 · kuk2 .
Substitute α = 1
5·kuk2
:
1
ŷtest = · 5 · kuk2 .
5 · kuk2
Simplify:
ŷtest = 1.

Conclusion
The predicted value for the test point xtest = 5 · u is:
1 .

Question 14: Linear regression

14. Let f (w) be the loss function for a linear regression problem. X is a d × n data matrix, w ∈ Rd , and y ∈ Rn . Consider the following steps:
a) Step-1: f is a convex function.
b) Step-2: Every local minimum of f is a global minimum.
c) Step-3: There is a unique solution w0 that minimizes f .
d) Step-4: If XX > is invertible, the unique solution to the minimization problem is given by:

w0 = (XX > )−1 Xy.

64
Which step is incorrect?

a) Step-1 b) Step-2
c) Step-3 d) Step-4
e) All steps are correct. There are no incorrect steps.

Theory and Explanation


We analyze each step systematically for the linear regression loss function. The linear regression problem involves minimizing the squared
loss:
f (w) = kX > w − yk22 .
Here:
• X is a d × n matrix (features),
• w ∈ Rd (weights),
• y ∈ Rn (labels).
The closed-form solution to the linear regression problem is derived using the normal equation:

w0 = (XX > )−1 Xy if XX > is invertible.

Step-by-Step Analysis

a) Step-1: f is a convex function. The loss function f (w) = kX > w − yk22 is a quadratic function of w. Quadratic functions are convex
because their Hessian is positive semi-definite. Thus, f is convex. Step-1 is correct.
b) Step-2: Every local minimum of f is a global minimum. For convex functions, any local minimum is also a global minimum. Since f
is convex, this property holds true. Step-2 is correct.
c) Step-3: There is a unique solution w0 that minimizes f . The solution to the minimization problem is unique if XX > is invertible. If
XX > is not invertible, there may be infinitely many solutions. Thus, Step-3 is contingent on the invertibility of XX > . Step-3 is correct
under the assumption of invertibility.

65
2 Mock 2 2024

d) Step-4: If XX > is invertible, the unique solution is given by w0 = (XX > )−1 Xy. This step is incorrect because the correct closed-
form solution to the linear regression problem is:
w0 = (X > X)−1 X > y.
Here, X > X (not XX > ) must be invertible for the solution to be valid. Therefore, Step-4 is incorrect because it incorrectly specifies
XX > instead of X > X.

Conclusion
The incorrect step is:
Step-4

15. The accuracy of a classifier on a dataset is defined as the proportion of points in the dataset that are correctly classified by it. If w is the
weight of a linear classifier that has an accuracy of 0.85 on a dataset, what is the accuracy of a linear classifier whose weight is w2 ?

Theory and Explanation

Linear Classifier Decision Rule

The decision rule of a linear classifier is given by:


f (x) = sign(w> x + b),
where:
• w is the weight vector,
• x is the input data point,
• b is the bias term.
The decision depends only on the sign of w> x + b. Scaling the weight vector w by any positive constant c > 0 does not affect the sign of
w> x + b. Thus, the predictions of the classifier remain unchanged.

66
Impact of Scaling w

If w is scaled by 12 , the classifier becomes:  


0 1 >
f (x) = sign w x+b .
2
Since 1
2
> 0, the sign of w> x + b remains unchanged. Therefore:
• The predictions of the classifier with weight w
2
are identical to those of the classifier with weight w.
• The accuracy of the classifier remains the same.

Solution
The accuracy of the linear classifier with weight w
2
is:
0.85 .

16. In a hard-margin SVM problem, let the optimal weight vector be:
 
1

w = 2 .

3

What is the distance of the closest point in the dataset from the decision boundary? Provide your answer correct to three decimal places.

Theory and Explanation


For a hard-margin SVM, the decision boundary is defined as:

w> x + b = 0,

where w is the weight vector, x is the input vector, and b is the bias term.

67
2 Mock 2 2024

The distance of a point x from the decision boundary is given by:

|w> x + b|
Distance = .
kwk

In SVM, the closest points to the decision boundary are the support vectors. For a hard-margin SVM, these points satisfy:

|w> x + b| = 1.

Thus, the distance of the closest point (support vector) from the decision boundary is:
1
Distance = .
kwk

Step-by-Step Solution

a) Compute the norm of the weight vector w∗ : √


kw∗ k = 12 + 22 + 32 .
Simplify: √ √
kw∗ k = 1+4+9= 14.

b) The distance of the closest point is:


1
Distance = .
kw∗ k

Substitute kw∗ k = 14:
1
Distance = √ .
14
c) Approximate the value: √
14 ≈ 3.742.
Therefore:
1
Distance ≈ ≈ 0.267.
3.742

68
Conclusion
The distance of the closest point in the dataset from the decision boundary is:

0.267 .

17. Let the eigenvectors of a covariance matrix for a centered dataset in R3 be w1 , w2 , and w3 . These three directions specify a new coordinate
system to represent the data. If the entire dataset is represented in terms of these new coordinates, what can you say about the covariance
matrix in this new coordinate system?

a) The covariance matrix is invariant to change of coordinates.


b) The covariance matrix becomes diagonal.
c) The covariance matrix becomes identity.
d) Insufficient information to answer this question.

Theory and Explanation


The covariance matrix Σ of a dataset represents the pairwise covariances of its features. In a new coordinate system defined by the eigen-
vectors of Σ:
• The eigenvectors w1 , w2 , w3 form an orthogonal basis that diagonalizes Σ.
• The corresponding eigenvalues λ1 , λ2 , λ3 represent the variances of the data along the directions of w1 , w2 , w3 .

Key Property of Eigenvectors and Eigenvalues


When the data is transformed into the new coordinate system (defined by the eigenvectors of Σ):
• The off-diagonal entries of the covariance matrix become zero, as the eigenvectors are orthogonal.
• The diagonal entries of the covariance matrix correspond to the eigenvalues of Σ, which represent the variances along the principal
directions.

69
2 Mock 2 2024

Thus, the covariance matrix in the new coordinate system becomes diagonal:
 
λ1 0 0
Σ0 =  0 λ2 0  ,
0 0 λ3

where λ1 , λ2 , λ3 are the eigenvalues of the original covariance matrix.

Analysis of Options

a) The covariance matrix is invariant to change of coordinates: This is incorrect. The covariance matrix changes under a change of
coordinates and becomes diagonal in the new basis defined by the eigenvectors.
b) The covariance matrix becomes diagonal: This is correct. In the new coordinate system, the covariance matrix is diagonal with
eigenvalues as its entries.
c) The covariance matrix becomes identity: This is incorrect. The diagonal entries of the covariance matrix are the eigenvalues, which
generally are not all equal to 1.
d) Insufficient information to answer this question: This is incorrect. The information provided (eigenvectors and eigenvalues of the
covariance matrix) is sufficient to conclude that the covariance matrix becomes diagonal.

Conclusion

The correct answer is:


The covariance matrix becomes diagonal.

18. Consider the covariance matrix:  


4 2 0
Σ = 2 3 0 .
0 0 2

70
Step 1: Eigenvalues and Eigenvectors

To diagonalize Σ, we first compute its eigenvalues and eigenvectors.

Characteristic Equation

The eigenvalues λ are found by solving:


det(Σ − λI) = 0,
where I is the identity matrix. Substituting Σ:  
4−λ 2 0
det  2 3−λ 0  = 0.
0 0 2−λ
Expanding the determinant:
(4 − λ)((3 − λ)(2 − λ)) − 2(2(2 − λ)) = 0.
Simplify:
(4 − λ)((3 − λ)(2 − λ)) − 4(2 − λ) = 0.
(4 − λ)(6 − 5λ + λ2 ) − 4(2 − λ) = 0.
24 − 20λ + 4λ2 − 6λ + 5λ2 − λ3 − 8 + 4λ = 0.
−λ3 + 9λ2 − 26λ + 16 = 0.
The roots of this cubic equation are λ1 = 2, λ2 = 1, λ3 = 6.

Eigenvectors

For each eigenvalue λ, solve (Σ − λI)v = 0 to find the eigenvector v.


• For λ1 = 6:  
−2 2 0
(Σ − 6I) =  2 −3 0  .
0 0 −4

71
2 Mock 2 2024
 
1
The eigenvector is v1 = 2.

0
• For λ2 = 1:
 
3 2 0
(Σ − 1I) = 2 2 0 .
0 0 1
 
−2
The eigenvector is v2 = 3 .

0
• For λ3 = 2:
 
2 2 0
(Σ − 2I) = 2 1 0 .
0 0 0
 
0
The eigenvector is v3 = 0.

1

Step 2: Change of Basis

The eigenvectors v1 , v2 , v3 form an orthogonal basis. The change of basis matrix P is:
 
1 −2 0
P = 2 3 0 .
0 0 1

Transform the covariance matrix into the new coordinate system:

Σ0 = P > ΣP.

72
Step 3: Diagonalization
Perform the matrix multiplication Σ0 :  
6 0 0
Σ0 = 0 1 0 .
0 0 2

This confirms that the covariance matrix in the new coordinate system becomes diagonal, with the eigenvalues as its diagonal entries.

Conclusion
When the data is represented in terms of the eigenvectors of the covariance matrix, the covariance matrix becomes diagonal. This is consistent
with the theoretical result.

73
Index

Mock 1 K-Means clustering, 56


AdaBoost, 53 Kernel, 42
Covariance Matrix, 39 Linear regression, 61, 64
Decision Tree, 54
Decision Trees Logistic Regression, 44, 47
Deep Trees, 37 Logostic regression, 49, 51
Information Gain, 58 Sigmoid, 41

75

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy