Mock Exams 2024
Mock Exams 2024
1 Mock 1 1
2 Mock 2 2024 37
iii
List of Figures
1 Deep Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2 Covariance Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3 Sigmoid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4 Kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
6 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
7 Logostic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
8 Logostic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
9 AdaBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
10 Training Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
11 K-Means clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
12 Information Gain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
13 Linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
14 Linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
v
1 Mock 1
1. Consider the prediction of the label for a data point x in a logistic regression model:
(
1, P (y = 1 | x) ≥ T
ŷ =
0, otherwise
Here:
• T is called the threshold and is a real number in the interval (0, 1).
• ŷ stands for the predicted label.
• The equation of the decision boundary is:
wT x − u = 0
If T = e
1+e
, find the value of the unknown quantity u. Enter the closest integer as your answer.
1
P (y = 1 | x) = σ(wT x) =
1 + e−wT x
Here:
• σ(·) is the sigmoid function.
• wT x represents the linear transformation of the feature vector x.
1
1 Mock 1
The decision boundary is determined when P (y = 1 | x) = T . For this case, substituting the given threshold T = e
1+e
:
1 e
−w Tx =
1+e 1+e
Tx 1+e
e−w = −1
e
Simplify:
Tx 1+e−e 1
e−w = =
e e
Taking the natural logarithm on both sides:
1
−w x = ln
T
e
−wT x = −1
wT x = 1
wT x = u
Thus:
u=1
2
Solution
The value of u is:
1
Here:
• n: Number of data points in the training dataset.
• ri : A constant in the interval [0, 1] associated with each data point i.
• w: The weight vector.
• xi : Feature vector of the i-th data point.
• yi : True label of the i-th data point.
Find the gradient of L(w) with respect to w.
3
1 Mock 1
4
Simplify:
n
X
∇w L(w) = ri (wT xi − yi )xi
i=1
3. A hard-margin linear SVM is trained for a 2D problem. The optimal weight vector is:
2
w= .
−1
5
1 Mock 1
Solution
where:
• w is the weight vector,
• b is the bias term,
x
• x = 1 is a point in R2 .
x2
The decision rule predicts:
Class + 1 if w> x + b > 0, Class − 1 if w> x + b < 0.
2
Substitute w = :
−1
2x1 − x2 + b = 0.
The bias term b is not provided in the problem. For simplicity, assume b = 0, which means the decision boundary passes through the origin.
The equation simplifies to:
x2 = 2x1 .
6
Step 3: Analyzing the Unit Square
The decision boundary x2 = 2x1 divides the square into two regions:
• Points below the line x2 = 2x1 are predicted as class +1.
• Points above the line x2 = 2x1 are predicted as class −1.
The probability of a point being predicted as class +1 is the ratio of the area below the line to the total area of the square:
7
1 Mock 1
Conclusion
The probability that a randomly chosen point from the square is predicted as class +1 is:
0.25 .
4. The MNIST digit classification problem has 10 classes. A training dataset contains n data points, with an equal number of points from each
of the 10 classes. A dummy classifier is defined such that for each input data point, it randomly picks one of the 10 classes as its prediction.
The accuracy of a model on a dataset is defined as the proportion of points that it classifies correctly. The goal is to determine the accuracy
of the dummy classifier as n → ∞.
Solution
The dummy classifier predicts one of 10 classes randomly for each data point, independent of the input features. This implies that the
classifier assigns each prediction with equal probability 10
1
to any class.
Since the training dataset has an equal number of data points for each class:
• The true class label for any given data point is uniformly distributed across the 10 classes.
• The probability of the dummy classifier correctly predicting the class for a single data point is 1
10
.
Expected Accuracy
The accuracy of the dummy classifier is the proportion of data points classified correctly. For n data points, the number of correct classifi-
cations C follows a Binomial distribution:
C ∼ Binomial(n, p)
where p = 1
10
is the probability of a correct classification.
8
The accuracy A is given by:
C
A=
n
Since E[C] = n · p:
1
n · 10 1
E[A] = =
n 10
As n → ∞, the Law of Large Numbers ensures that the observed accuracy converges to the expected accuracy. Thus:
1
A=
10
Final Answer
0.1
9
1 Mock 1
Options
a) The network N2 will not be able to learn anything from D2 . Its test accuracy on D2 will be very low.
b) The network N2 will be able to learn useful patterns from D2 . In fact, the performance of network N2 on D2 will be similar to N1 on
D1 .
c) The network N2 will be able to learn somewhat useful patterns from D2 . But the performance of N1 on D1 will be much better than
N2 on D2 .
Factors to Consider
1. **Learning Useful Patterns:** - A convolutional neural network (CNN), which is often used for image classification, is not inherently
orientation-sensitive. - When trained from scratch on D2 , N2 can learn to recognize patterns in the upside-down orientation, achieving
comparable performance to N1 on D1 .
2. **Performance on D2 :** - If the distribution of images and labels in D2 is identical to D1 (except for the orientation), there is no reason to
expect N2 to perform worse than N1 , assuming sufficient training.
Conclusion
The network N2 will be able to learn useful patterns from D2 , and its performance on D2 will be similar to N1 on D1 .
10
The correct option is:
The network N2 will be able to learn useful patterns from D2 . In fact, the performance of network N2 on D2 will be similar to N1 on D1 .
6. We are tasked to compute the hinge loss for a soft-margin, linear SVM on the given dataset. The weight vector is:
0
w=
1
The dataset is:
x1 x2 y
2 1 1
−2 1 1
−1 2 1
0 2 −1
1 −1 −1
2 −2 −1
−2 0 −1
where:
• yi is the true label of the i-th data point.
• xi is the feature vector of the i-th data point.
• w is the weight vector.
• wT xi is the dot product of the weight vector and feature vector.
The total hinge loss is the sum of the hinge loss over all data points.
11
1 Mock 1
Step-by-Step Solution
The weight vector w = [0, 1]T means that wT xi = 0 · x1 + 1 · x2 = x2 . Thus, for each data point, the hinge loss is computed as:
max(0, 1 − yi · x2 )
x1 x2 y Hinge Loss
2 1 1 max(0, 1 − 1 · 1) = max(0, 0) = 0
−2 1 1 max(0, 1 − 1 · 1) = max(0, 0) = 0
−1 2 1 max(0, 1 − 1 · 2) = max(0, −1) = 0
0 2 −1 max(0, 1 − (−1) · 2) = max(0, 3) = 3
1 −1 −1 max(0, 1 − (−1) · −1) = max(0, 0) = 0
2 −2 −1 max(0, 1 − (−1) · −2) = max(0, −1) = 0
−2 0 −1 max(0, 1 − (−1) · 0) = max(0, 1) = 1
The total hinge loss is the sum of the hinge losses for all data points:
Lhinge = 0 + 0 + 0 + 3 + 0 + 0 + 1 = 4
Conclusion
12
7. Consider a logistic regression model for a binary classification problem with two features x1 and x2 , and labels 1 and 0. The horizontal axis
represents x1 and the vertical axis represents x2 .
You are given two feature vectors:
1 −1
x1 = , x2 =
3 3
The weight vector w makes an angle θ with the positive x1 -axis. Each θ corresponds to a different classifier. For what range of θ are both x1
and x2 predicted to belong to class-1?
Solution
Here:
w1
w= and wT x = w1 x1 + w2 x2
w2
Let w = [cos θ, sin θ], where θ is the angle w makes with the x1 -axis.
13
1 Mock 1
Next, consider the intersection of the two inequalities. Subtract Condition 2 from Condition 1:
2 cos θ > 0
cos θ > 0
This implies:
0◦ < θ < 90◦
14
Final Range of θ
The intersection of these two conditions (sin θ > 0 and cos θ > 0) gives:
Answer
The range of θ for which both x1 and x2 are predicted to belong to class-1 is:
(−3, 3), (0, 4), (1, 12), (3, 15), (4, 16)
The goal is to fit a constant model y = c to this data by minimizing the Mean Squared Error (MSE). The task is to find the best estimate for c.
Solution
15
1 Mock 1
To find the value of c that minimizes the MSE, take the derivative of the loss function with respect to c:
n
!
∂ ∂ 1X
MSE = (yi − c)2
∂c ∂c n i=1
Simplify:
n
∂ 1X ∂
MSE = (yi − c)2
∂c n i=1 ∂c
n
∂ 1X
MSE = −2(yi − c)
∂c n i=1
Simplify:
n
X
(yi − c) = 0
i=1
n
X
yi − n · c = 0
i=1
n
1X
c= yi
n i=1
16
Calculation of c
The labels (yi ) are: 3, 4, 12, 15, 16. Compute their mean:
1
c = (3 + 4 + 12 + 15 + 16)
5
50
c= = 10
5
Final Answer
9. We are tasked with determining if the given data points in R2 are linearly separable. The data points are:
1 2
3 −4
5 0
X=
−1 −2
−3 4
−2 −6
17
1 Mock 1
In 2D, this means we must find a line that separates the points with y = 1 (positive class) from those with y = −1 (negative class).
Approach to Solution
We will plot the points in R2 and visually inspect if such a line exists. Additionally, we verify the linear separability mathematically.
Analysis
To determine linear separability:
• If a straight line can divide the positive and negative classes such that all points of one class lie on one side of the line and all points of
the other class lie on the other side, the dataset is linearly separable.
18
• If no such line exists, the dataset is not linearly separable.
After plotting and analyzing the geometry of the points, it becomes evident that there is no line that can separate these two sets of points.
Conclusion
10. In the context of Naive Bayes classification, the feature vector is:
x1
..
x= .
xd
c) P (y | x1 , · · · , xd ) = P (y)
d) P (x1 , · · · , xd , y) = P (x1 , · · · , xd ) · P (y)
Naive Bayes is a probabilistic classification method that relies on the following ”naive assumption”:
• The features x1 , x2 , · · · , xd are conditionally independent given the class y.
19
1 Mock 1
d
Y
P (x1 , · · · , xd | y) = P (xi | y)
i=1
This is the ”naive assumption” that allows efficient computation in Naive Bayes models.
Correct Answer
11. In the context of Naive Bayes classification, the feature vector is:
x1
..
x= .
xd
c) P (y | x1 , · · · , xd ) = P (y)
d) P (x1 , · · · , xd , y) = P (x1 , · · · , xd ) · P (y)
20
Theory and Explanation
Naive Bayes is a probabilistic classification model that relies on the ”naive assumption”:
• The features x1 , x2 , · · · , xd are conditionally independent given the class label y.
This assumption simplifies the joint conditional probability P (x1 , x2 , · · · , xd | y), which would otherwise involve computing complex depen-
dencies between the features.
Under the naive assumption, the joint probability P (x1 , x2 , · · · , xd | y) can be factorized as:
d
Y
P (x1 , x2 , · · · , xd | y) = P (xi | y)
i=1
Correct Answer
d
Y
P (x1 , · · · , xd | y) = P (xi | y)
i=1
12. Consider the following statements in the context of a hard-margin linear Support Vector Machine (SVM):
a) Every support vector lies on one of the two supporting hyperplanes.
b) Every point on one of the two supporting hyperplanes is a support vector.
Determine the correctness of the statements.
21
1 Mock 1
Analysis
Statement (1): Every support vector lies on one of the two supporting hyperplanes
In the context of a hard-margin linear SVM:
• Support vectors are the data points that lie exactly on one of the two supporting hyperplanes. These hyperplanes are defined as:
wT x + b = 1 (for one class)
wT x + b = −1 (for the other class).
• All other points lie either on the correct side of the margin or beyond it.
Thus, **every support vector lies on one of the two supporting hyperplanes.**
Statement (1) is true.
Statement (2): Every point on one of the two supporting hyperplanes is a support vector
While every support vector lies on one of the two supporting hyperplanes, not all points on the hyperplanes are guaranteed to be support
vectors:
• If there are redundant points (e.g., duplicate or linearly dependent points) on the supporting hyperplanes, they may not actively con-
tribute to defining the margin.
• Only a minimal subset of points on the hyperplanes that define the margin are considered support vectors.
Thus, **not every point on the supporting hyperplanes is necessarily a support vector.**
Statement (2) is false.
Conclusion
From the analysis, the correct answer is:
Only (1) is true.
22
13. Consider a binary classification task in R2 with two features. The dataset consists of four points, with two positive points (denoted by +)
and two negative points (denoted by −). An arbitrary linear classifier is used, and the decision boundary does not pass through any of the
four points. The points are shown in the following diagram:
The task is to find the possible values of the accuracy (proportion of points correctly classified) of the classifier. Assume that all options are
independent of each other, and multiple options could be correct.
23
1 Mock 1
A linear classifier partitions the plane into two regions using a straight line. Any point on one side of the line is classified into one class (e.g.,
positive), while points on the other side are classified into the opposite class (e.g., negative). The decision boundary does not pass through
any of the four points, meaning no point is exactly on the separating line.
Accuracy of a Classifier
Solution
We analyze the possible scenarios step by step:
a) **If all points are misclassified:**
• The classifier predicts the wrong class for all four points.
• Accuracy = 0
4
= 0.
• This is possible if the decision boundary completely separates the positive and negative regions incorrectly.
b) **If exactly one point is correctly classified:**
• This scenario is not possible for a linear classifier because at least two points must lie on the same side of the boundary.
c) **If exactly two points are correctly classified:**
• This occurs when the decision boundary separates one positive and one negative point correctly, while misclassifying the other
two points.
24
• Accuracy = 2
4
= 0.5.
• This is a valid scenario for a linear classifier.
d) **If exactly three points are correctly classified:**
• This occurs when the decision boundary correctly separates three points (e.g., two positives and one negative, or two negatives
and one positive) while misclassifying the fourth point.
• Accuracy = 3
4
= 0.75.
• This is a valid scenario for a linear classifier.
e) **If all points are correctly classified:**
• This occurs when the decision boundary separates all positive points into one region and all negative points into the other region.
• Accuracy = 4
4
= 1.
• This is a valid scenario for a linear classifier.
An accuracy of 0.25 would correspond to exactly one point being classified correctly. However, a linear classifier cannot achieve this because
at least two points must lie on the same side of the decision boundary. Therefore, 0.25 is not possible.
Conclusion
The possible values of the accuracy for the given classifier are:
0, 0.5, 0.75, 1
25
1 Mock 1
We are tasked with finding the area of the region S, corresponding to all points that fall into leaf L2 . The set S is defined as:
S = {(x, y) | x ≥ 0, y ≥ 0, (x, y) goes into L2 , (x, y) ∈ R2 }.
b) From the left child of the root node, the next condition is y > 2. To proceed to leaf L2 , we must satisfy:
y>2
26
Thus, the region S is defined by the conditions:
S = {(x, y) | 0 ≤ x < 3, y > 2}.
Geometric Representation of S
Area of Region S
The region S is unbounded in the y-direction (i.e., y → ∞). Therefore, the ”area” of S is infinite. However, if we restrict y to a finite range
(e.g., 0 ≤ y ≤ M ), we would calculate the finite area of the corresponding region.
For the given setup, since y > 2 without an upper bound, the area is:
infinite.
15. Consider a soft-margin linear Support Vector Machine (SVM) trained on a dataset. A subset of three points from the positive class (green) is
shown along with:
• The decision boundary (solid line),
• The bounding hyperplanes (dotted lines), and
• Slack variables ξ1 , ξ2 , ξ3 , representing the ”bribes” for margin violations of the corresponding points.
27
1 Mock 1
Solution
From the given diagram:
28
a) **Point ξ1 :** This point is correctly classified and lies exactly on the decision boundary. Therefore, it satisfies yi (wT xi + b) = 1. Hence:
ξ1 = max(0, 1 − 1) = 0.
b) **Point ξ2 :** This point is correctly classified but lies inside the margin. Its margin distance satisfies 0 < yi (wT xi + b) < 1. For such a
point:
ξ2 = max(0, 1 − yi (wT xi + b)),
where 1 − yi (wT xi + b) corresponds to the margin violation. Assume yi (wT xi + b) = 0.8, then:
c) **Point ξ3 :** This point is misclassified and lies on the wrong side of the decision boundary. Hence yi (wT xi + b) < 0. Assume yi (wT xi +
b) = −0.5, then:
ξ3 = max(0, 1 − (−0.5)) = 1.5.
Final Results
16. In K-means clustering, we have 100 data points and decide to partition them into 5 clusters. The task is to determine the total number of
possible cluster assignments.
a) 105
b) 5100
c) 500
d) 105
29
1 Mock 1
Solution
Combinatorial Calculation
If we have N = 100 data points and K = 5 clusters, then the total number of possible assignments is:
K N = 5100
This is because each of the 100 data points can independently belong to any of the 5 clusters.
Answer
The total number of possible cluster assignments is:
5100
17. We are tasked with finding the Maximum A Posteriori (MAP) estimator p̂MAP for a dataset {1, 0, 1, 0, 1, 0} modeled using a Bernoulli distribution.
The prior for the parameter p is given as Beta(3, 7).
MAP Estimator
The MAP estimator is defined as the mode of the posterior distribution. The posterior distribution of p is proportional to the product of the
likelihood and the prior:
Posterior(p) ∝ Likelihood(p) · Prior(p).
30
Likelihood for the Dataset
The dataset has 3 successes (x = 1) and 3 failures (x = 0). The likelihood for a Bernoulli distribution is:
where n1 is the number of successes and n0 is the number of failures. For this dataset:
n1 = 3, n0 = 3.
Thus:
Likelihood(p) = p3 (1 − p)3 .
Prior Distribution
where α = 3 and β = 7.
Posterior Distribution
The posterior distribution is proportional to the product of the likelihood and the prior:
Simplifying:
Posterior(p) ∝ p3+2 (1 − p)3+6 = p5 (1 − p)9 .
31
1 Mock 1
α−1
p̂MAP = .
α+β−2
6−1 5
p̂MAP = = .
6 + 10 − 2 14
Final Answer
The MAP estimator is:
0.357
32
• w is the weight vector,
• b is the bias term,
• x is the input feature vector.
The probability that a point x belongs to class 1 is given by:
1
P (y = 1 | x) = σ(wT x + b) = .
1+ e−(wT x+b)
P (y = 0 | x) = 1 − σ(wT x + b).
P (y = 1 | x2 ) = P (y = 0 | x1 ).
Thus:
wT x2 + b = −wT x1 − b.
33
1 Mock 1
For x2 :
|wT x2 + b|
Distance(x2 ) = .
kwk
For x1 :
|wT x1 + b|
Distance(x1 ) = .
kwk
Ratio of Distances
Final Answer
The ratio of the distances of x2 and x1 from the decision boundary is:
1 .
34
19. Consider four different soft-margin linear-SVM models trained on the same dataset with different values of C. The decision boundaries and
the supporting hyperplanes are plotted along with the dataset, as shown below:
a) C1 = 10, C2 = 1, C3 = b) C1 = 0.01, C2 =
0.1, C4 = 0.01 0.1, C3 = 1, C4 = 10
c) C1 = C2 = C3 = C4 = 1 d) C1 = C2 = C3 = C4 = 10
35
1 Mock 1
Final Answer
The most appropriate values for C1 , C2 , C3 , C4 are:
36
2 Mock 2 2024
Concepts Involved
• Training Data Performance: Deep trees have many levels of splits, enabling them to fit the training data closely. This typically results in
very low training error.
• Test Data Performance: Deep trees are prone to overfitting, where they memorize the training data instead of generalizing patterns,
leading to poor performance on unseen test data.
• Bias-Variance Tradeoff:
– Shallow trees: High bias and low variance, often underfitting the data.
– Deep trees: Low bias and high variance, often overfitting the data.
37
2 Mock 2 2024
where yi is the true label, ŷi is the predicted label, N is the number of training samples, and L is the loss function.
b) Test error (Etest ):
M
1 X
Etest = L(yj , ŷj )
M j=1
Deep trees minimize Etrain , but Etest may increase due to overfitting.
Solution
Step-by-Step Reasoning
38
Conclusion
Deep trees exhibit excellent performance on the training data due to their capacity to model intricate patterns but often fail to generalize to
unseen data due to overfitting. The correct statement is:
Deep trees perform well on the training data but may not perform well on the test data.
2. Which of the following are valid covariance matrices for centered datasets in R3 ?
1 0 0 4 0 0 5 0 0 1 0 1
C1 = 4 3 0 , C 2 = 0 3 0 , C 3 = 0 0 0 , C 4 = 0 1 0
1 9 2 0 0 2 0 0 0 1 0 0
Solution
We will evaluate each matrix based on the two criteria above.
39
2 Mock 2 2024
Step-by-Step Reasoning
a) Matrix C1 :
1 0 0
C1 = 4 3 0
1 9 2
- Symmetry: C1 is not symmetric (C12 6= C21 , C13 6= C31 ). - Conclusion: C1 is not a valid covariance matrix.
b) Matrix C2 :
4 0 0
C2 = 0 3 0
0 0 2
>
- Symmetry: C2 is symmetric. - PSD Check: For any x = x1 x2 x3 ,
x> C3 x = 5x21 ≥ 0
(as all eigenvalues are non-negative). - Conclusion: C3 is a valid covariance matrix.
d) Matrix C4 :
1 0 1
C4 = 0 1 0
1 0 0
>
- Symmetry: C4 is symmetric. - PSD Check: Compute x> C4 x for x = 1 0 −1 :
1 0 1 1
>
x C4 x = 1 0 −1 0 1 0 0 = −1
1 0 0 −1
40
Since x> C4 x < 0, C4 is not PSD. - Conclusion: C4 is not a valid covariance matrix.
Final Answer
The valid covariance matrices are:
C2 and C3
Question 3: Sigmoid
3. The following is the vector output by some hidden layer in a neural network after the activation function has been applied:
Which of the following could be the activation function used in this layer?
a) Only ReLU b) Only SigmoidX c) Either ReLU or Sigmoid d) Neither ReLU nor Sigmoid
41
2 Mock 2 2024
• Properties of Outputs:
– If all elements of the output are strictly within [0, 1), the activation function could be sigmoid.
– If elements include 0 but also extend beyond 1, it could indicate ReLU.
Solution
Step-by-Step Reasoning
a) The values in the vector are all in the range (0, 1), which is consistent with the output of the sigmoid activation function.
b) The values do not include any value above 1 or exactly 0, which are typical outputs for ReLU when the input is ≥ 0.
c) Given this information:
• The output could result from sigmoid.
• The output cannot exclusively result from ReLU, as no values exceed 1.
Final Answer
The activation function used in this layer could be: Only Sigmoid.
Question 4: Kernel
x1
point is of the form , yi . The data-points are explicitly transformed, resulting in a new set of features. A linear classifier with weight
x2
vector w has been learned on this transformed data which perfectly separates these two classes.
42
Which of the following could be the transformed feature vector x that was used?
T T T T
a) 1 x1 x2 b) 1 x1 x2 x21 x22 c) 1 x21 x22 d) x1 x2 x21 x22
43
2 Mock 2 2024
• Explicit Feature Transformation: Transform the original features (x1 , x2 ) into a higher-dimensional feature space where the classes
become linearly separable.
• Kernel Trick: Use a kernel function to compute the higher-dimensional feature space without explicitly transforming the features.
Common Transformations
The following are common transformations for solving circularly separable data:
a) Adding quadratic terms x21 and x22 to the feature vector helps capture the radial structure of the data.
b) Linear classifiers can then separate the classes in this transformed space.
Solution
We analyze each option:
T
a) x = 1 x1 x2 : This feature vector is linear and does not include quadratic terms like x21 or x22 . It cannot capture the circular structure,
allow the classifier to separate the circular classes using a linear decision boundary in the transformed space. This works.
T
c) x = 1 x21 x22 : This feature vector includes quadratic terms but lacks x1 and x2 . While it captures the radial symmetry, it may still
Conclusion
The most suitable transformed feature vector that could perfectly separate the two classes using a linear classifier is:
T
x = 1 x1 x2 x21 x22 .
44
Question 5: Logistic Regression
5. A logistic regression model has been trained for a binary classification problem with labels 0 and 1. The weight vector w and the corresponding
decision boundary are displayed in the figure below.
Now, the model is tested on four points. The probability corresponding to the i-th data-point xi returned by the logistic regression model is
given as follows:
P (y = 1 | xi ) = pi
45
2 Mock 2 2024
We don’t know the true labels for any of the four points. We are only talking about the predicted probabilities here. Which of the following
relationships is correct?
a) p1 < p2 < p3 < p4 b) p1 > p2 > p3 > p4
c) p3 < p4 < p2 < p1 d) p1 > p2 and p4 > p3
Solution
Step-by-Step Reasoning
From the figure:
46
a) The decision boundary passes through the origin and is perpendicular to the weight vector w.
b) w points in the direction where the probability P (y = 1) increases.
c) Points p3 and p4 lie on the positive side of the decision boundary, closer to the direction of w, so they will have higher probabilities.
d) Points p1 and p2 lie on the negative side of the decision boundary, away from the direction of w, so they will have lower probabilities.
e) Among points p1 and p2 , p1 is farther from the decision boundary than p2 , so:
p1 < p 2
Conclusion
The correct relationship is:
p1 > p2 and p4 > p3
6. A logistic regression model is being trained on a dataset of size 2n. The first n data-points belong to class-1 (label is 1) and the rest belong
to class-0 (label is 0). Note that we are talking about the true labels here.
Which of the following expressions is the negative log-likelihood of the model on this dataset? This is also called the binary cross-entropy
loss.
n
X X2n 2n
X 2n
X X2n
a) − log pi + − log(1 − b) − log pi c) − log(1 − pi ) d) −pi log pi
i=1 i=n+1 i=1 i=1 i=1
pi )
47
2 Mock 2 2024
Dataset Description
The dataset is structured as follows:
a) The first n points belong to class-1 (yi = 1).
b) The next n points belong to class-0 (yi = 0).
For class-1 (indices i = 1, . . . , n):
yi = 1 =⇒ −yi log pi = − log pi
For class-0 (indices i = n + 1, . . . , 2n):
yi = 0 =⇒ −(1 − yi ) log(1 − pi ) = − log(1 − pi )
Thus, the binary cross-entropy loss simplifies to:
n
X 2n
X
L= − log pi + − log(1 − pi )
i=1 i=n+1
48
Solution
We evaluate the given options:
n
X 2n
X
a) − log pi + − log(1 − pi ): This correctly represents the binary cross-entropy loss as derived above.
i=1 i=n+1
2n
X
b) − log pi : This assumes all points belong to class-1, which is incorrect for the given dataset.
i=1
2n
X
c) − log(1 − pi ): This assumes all points belong to class-0, which is incorrect.
i=1
2n
X
d) −pi log pi : This is not the binary cross-entropy loss and does not represent a correct likelihood calculation.
i=1
Conclusion
The correct expression for the negative log-likelihood (binary cross-entropy loss) is:
n
X 2n
X
− log pi + − log(1 − pi )
i=1 i=n+1
7. Consider a logistic regression model that is trained on videos to detect objectionable content. Videos with objectionable content belong to
the positive class (label 1). Harmless videos belong to the negative class (label 0).
A good detector should be able to correctly identify almost all videos that are objectionable. If it incorrectly classifies even a single video
that has inappropriate content in it, that could have serious consequences, as millions of people might end up watching it. In this process,
the detector may classify some harmless videos as belonging to the positive class. But that is a price we are willing to pay.
49
2 Mock 2 2024
How should we choose the threshold (for inference) of this logistic regression model?
a) The threshold should be a low value. b) The threshold should be a high value.
c) The performance of the classifier is independent of the thresh-
old.
Classification Threshold
The classification decision is made based on a threshold t. The rule is:
(
1 if p ≥ t
ŷ =
0 if p < t
- A low threshold means that the model classifies a point as positive (ŷ = 1) even if the predicted probability p is relatively small. - A high
threshold requires a higher confidence (p) to classify a point as positive.
Problem Requirements
In this problem:
• Missing a video with objectionable content (false negative) is unacceptable as it has serious consequences.
• Misclassifying a harmless video as objectionable (false positive) is acceptable, even if undesirable.
To minimize false negatives (failing to detect objectionable content), we need to:
• Lower the threshold. This ensures that the detector identifies as many videos with objectionable content as possible, even at the cost
of increasing false positives.
50
Solution
Step-by-Step Reasoning
a) A low threshold ensures that the classifier outputs ŷ = 1 for more data points, reducing the chances of missing objectionable videos
(false negatives).
b) The cost of increasing false positives (misclassifying harmless videos) is acceptable in this context.
c) A high threshold, on the other hand, would increase the likelihood of false negatives, which is unacceptable.
Conclusion
The threshold should be chosen as a low value to ensure that the model correctly identifies almost all videos with objectionable content.
8. A logistic regression model has been trained on a dataset in a binary classification setup. It is now tested on two separate datasets, each
having 14 data-points, 7 from each class. The loss (negative log-likelihood) of the same model on the two test-datasets is L1 and L2 . It is
also given that the classification accuracy of the model on both these test-datasets is 100%.
The following images show the test datasets along with the decision boundary of the model:
• First test dataset (Loss = L1 )
• Second test dataset (Loss = L2 )
P (y = 1 | x) = σ(w> x + b),
51
2 Mock 2 2024
where σ(z) = 1
1+e−z
is the sigmoid function.
The negative log-likelihood (binary cross-entropy loss) for a dataset is:
N
1 X
L=− [yi log pi + (1 − yi ) log(1 − pi )] ,
N i=1
where:
• yi is the true label (0 or 1),
• pi is the predicted probability for class y = 1,
• N is the total number of data-points.
52
• This means the model’s confidence in its predictions is high (probabilities closer to 1 for positive points and 0 for negative points).
• Hence, the loss L2 will be relatively lower.
Conclusion
The relationship between the losses is:
L1 > L2
because the points in the first dataset are closer to the decision boundary, resulting in lower confidence and higher loss, while the points in
the second dataset are farther from the decision boundary, resulting in higher confidence and lower loss.
Question 9: AdaBoost
9. Consider a training dataset of n data-points for a binary classification problem that satisfies the following condition for all i in 1, . . . , n:
(
a, if yi = 1
xij =
b, if yi = −1
where xij is the j-th feature for the i-th data-point, and a and b are real numbers with a 6= b. If an AdaBoost model is fit on this dataset,
how many rounds would be required to get a good classifier?
AdaBoost Overview
AdaBoost (Adaptive Boosting) is an ensemble method that combines multiple weak classifiers to form a strong classifier. The key concepts
of AdaBoost include:
• Weak Classifier: A classifier that performs slightly better than random guessing (accuracy > 50%).
• Rounds: AdaBoost works iteratively over multiple rounds, training weak classifiers at each step and combining them with weights.
53
2 Mock 2 2024
• Error and Weights: Misclassified samples are given higher weights in subsequent rounds so that the next weak classifier focuses more
on these points.
Dataset Structure
Solution
Since a single weak classifier (e.g., a decision stump) can perfectly separate the data in one round, AdaBoost does not require additional
rounds to achieve a strong classifier. Thus, the number of rounds required is:
Conclusion
The AdaBoost model requires only 1 round to get a good classifier because the dataset is linearly separable using a single feature threshold.
10. Which of the following models has the potential to achieve zero training error on every possible training dataset in R2 ?
54
Question 10: Training Error
The ability of a model to achieve zero training error depends on its flexibility (capacity) to fit the data. A model that can perfectly fit any
training dataset is often referred to as an overfitting model. Below, we evaluate each option based on its characteristics:
1. Decision Tree
- A Decision Tree can achieve zero training error on any dataset because it can recursively split the feature space until each data point belongs
to its own leaf node. - This gives Decision Trees high flexibility and the ability to memorize training data. - Therefore, a Decision Tree can
perfectly fit any dataset.
2. Logistic Regression
- Logistic Regression is a linear model that assumes a linear decision boundary. - For a dataset in R2 , if the data is not linearly separable,
Logistic Regression cannot achieve zero training error. - Hence, Logistic Regression does not have the potential to fit all datasets perfectly.
- A Soft Margin Linear-SVM allows for a margin and some misclassifications, enabling it to work with linearly non-separable data. - However,
since it still assumes a linear decision boundary, it cannot achieve zero training error on datasets that are not linearly separable. - Therefore,
a Soft Margin Linear-SVM does not achieve zero error on every dataset.
55
2 Mock 2 2024
Solution
Step-by-Step Reasoning
a) A Decision Tree has sufficient flexibility to achieve zero training error on any dataset by memorizing the data points.
b) Logistic Regression and Soft Margin Linear-SVM assume linear decision boundaries, so they cannot achieve zero error on non-linearly
separable datasets.
c) A Soft Margin Kernel-SVM with a cubic kernel can learn complex non-linear decision boundaries, which allows it to achieve zero training
error in most cases. However, due to the soft margin, it might allow a small number of misclassifications.
Conclusion
The model that has the potential to achieve zero training error on every possible training dataset in R2 is:
Decision Tree
11. If the K-means algorithm with k = 2 is applied on each of these datasets, which of the following statements is true?
a) The algorithm will terminate only in the case of dataset-1 after a b) The algorithm will never terminate in the case of dataset-2 and
certain number of iterations. will keep oscillating between different cluster configurations.
c) The algorithm will terminate for both datasets. d) The algorithm will not terminate for both datasets.
56
Theory and Explanation
The K-means algorithm is an iterative clustering algorithm that partitions the data into k clusters. The steps are as follows:
a) Initialize k cluster centroids.
b) Assign each data point to the cluster whose centroid is closest to it (based on a distance metric, such as Euclidean distance).
c) Recompute the cluster centroids as the mean of the points assigned to each cluster.
d) Repeat steps 2 and 3 until the centroids stop changing (convergence is achieved).
The K-means algorithm is guaranteed to converge to a local optimum in a finite number of steps. However, for certain data distributions, it
may exhibit behaviors like oscillating cluster assignments when centroids keep switching positions between iterations.
Dataset Analysis
a) Dataset-1: The data points in dataset-1 are clearly separated into two compact and well-defined clusters. Since the clusters are well-
separated:
• The algorithm will quickly converge to a stable configuration where each cluster centroid is at the mean of its respective group of
points.
• The K-means algorithm will terminate after a finite number of iterations.
b) Dataset-2: The data points in dataset-2 form a circular ring-like structure. For k = 2:
• There is no obvious separation into two compact clusters. This creates ambiguity in cluster assignment.
• The cluster centroids may keep oscillating between different configurations because the circular structure does not allow for stable,
distinct centroids.
• The algorithm will fail to converge and will keep oscillating.
57
2 Mock 2 2024
Solution
Step-by-Step Reasoning
a) For Dataset-1, the algorithm will terminate as the data has two well-separated clusters.
b) For Dataset-2, the algorithm will oscillate between cluster assignments due to the circular structure, preventing convergence.
Conclusion
The algorithm will never terminate in the case of dataset-2 and will keep oscillating between different cluster configurations.
12. Question
Consider the following data-points that make up the training dataset in a binary classification problem. They are of the form (x1 , x2 ):
Points labeled as 0 : (−5, −3), (−4, 1), (3, 2), (4, 5), (2, 1)
Points labeled as 1 : (15, 1), (21, −10), (8, 4), (7, 0), (9, −10)
If you are allowed to ask a question of the form fk < θ, what is the information gain corresponding to the ”best” question? Use log2 as the
logarithm base.
58
Theory and Explanation
Information Gain
Information gain is used to determine the ”best” question (feature split) in decision trees. It measures the reduction in uncertainty (entropy)
after splitting the data. The formula for information gain is:
X |Pi |
Information Gain (IG) = H(P ) − H(Pi ),
i
|P |
where:
• H(P ) is the entropy of the parent node,
• Pi are the child nodes after splitting the data,
• H(Pi ) is the entropy of each child node.
Entropy
Entropy measures the impurity or randomness of a dataset. It is given by:
X
H(P ) = − pc log2 (pc ),
c
Solution
59
2 Mock 2 2024
The ”best” question splits the data into two child nodes such that the weighted average of their entropies is minimized.
Let us consider a feature fk (e.g., x1 or x2 ) and a threshold θ. A split of the form fk < θ will divide the dataset into two subsets:
• Left node: points satisfying fk < θ,
• Right node: points satisfying fk ≥ θ.
To determine the best split:
a) Sort the data points along each feature (x1 and x2 ).
b) Evaluate candidate splits (midpoints between adjacent points).
c) Compute the weighted average entropy for each split.
d) Select the split that maximizes the information gain.
60
• The best split occurs when the data is perfectly separated into two child nodes, one containing all points labeled 0 and the other
containing all points labeled 1.
In this case:
• Entropy of the left child H(Pleft ) = 0 (pure node),
• Entropy of the right child H(Pright ) = 0 (pure node).
The weighted average entropy after the split is:
5 5
Average Entropy = ·0+ · 0 = 0.
10 10
Conclusion
The information gain corresponding to the ”best” question is:
1 (bit)
61
2 Mock 2 2024
where w is the weight vector, x is the input feature vector, and b is the bias term.
Simplified Assumption
• All data points lie along the direction of u in the feature space.
• Hence, the regression line will also lie along the direction of u.
where α is a scalar.
62
Step-by-Step Solution
The optimal weight vector w is obtained by minimizing the least squares error. Substituting w = α · u and xi = ci · u, the predicted value
becomes:
ŷi = w> xi = (α · u)> (ci · u) = α · ci · kuk2 .
To minimize L(α), we compute the derivative with respect to α and set it to zero:
n
∂L X
= −2 ci · kuk2 · (yi − α · ci · kuk2 ).
∂α i=1
Simplifying:
n
X n
X
2
ci · yi = α · kuk c2i .
i=1 i=1
Solve for α: Pn
i=1 ci · y i
α= .
kuk2 ni=1 c2i
P
Given:
n
X n
X
ci · yi = 20, c2i = 100.
i=1 i=1
63
2 Mock 2 2024
Simplify:
ŷtest = α · 5 · kuk2 .
Substitute α = 1
5·kuk2
:
1
ŷtest = · 5 · kuk2 .
5 · kuk2
Simplify:
ŷtest = 1.
Conclusion
The predicted value for the test point xtest = 5 · u is:
1 .
14. Let f (w) be the loss function for a linear regression problem. X is a d × n data matrix, w ∈ Rd , and y ∈ Rn . Consider the following steps:
a) Step-1: f is a convex function.
b) Step-2: Every local minimum of f is a global minimum.
c) Step-3: There is a unique solution w0 that minimizes f .
d) Step-4: If XX > is invertible, the unique solution to the minimization problem is given by:
64
Which step is incorrect?
a) Step-1 b) Step-2
c) Step-3 d) Step-4
e) All steps are correct. There are no incorrect steps.
Step-by-Step Analysis
a) Step-1: f is a convex function. The loss function f (w) = kX > w − yk22 is a quadratic function of w. Quadratic functions are convex
because their Hessian is positive semi-definite. Thus, f is convex. Step-1 is correct.
b) Step-2: Every local minimum of f is a global minimum. For convex functions, any local minimum is also a global minimum. Since f
is convex, this property holds true. Step-2 is correct.
c) Step-3: There is a unique solution w0 that minimizes f . The solution to the minimization problem is unique if XX > is invertible. If
XX > is not invertible, there may be infinitely many solutions. Thus, Step-3 is contingent on the invertibility of XX > . Step-3 is correct
under the assumption of invertibility.
65
2 Mock 2 2024
d) Step-4: If XX > is invertible, the unique solution is given by w0 = (XX > )−1 Xy. This step is incorrect because the correct closed-
form solution to the linear regression problem is:
w0 = (X > X)−1 X > y.
Here, X > X (not XX > ) must be invertible for the solution to be valid. Therefore, Step-4 is incorrect because it incorrectly specifies
XX > instead of X > X.
Conclusion
The incorrect step is:
Step-4
15. The accuracy of a classifier on a dataset is defined as the proportion of points in the dataset that are correctly classified by it. If w is the
weight of a linear classifier that has an accuracy of 0.85 on a dataset, what is the accuracy of a linear classifier whose weight is w2 ?
66
Impact of Scaling w
Solution
The accuracy of the linear classifier with weight w
2
is:
0.85 .
16. In a hard-margin SVM problem, let the optimal weight vector be:
1
∗
w = 2 .
3
What is the distance of the closest point in the dataset from the decision boundary? Provide your answer correct to three decimal places.
w> x + b = 0,
where w is the weight vector, x is the input vector, and b is the bias term.
67
2 Mock 2 2024
|w> x + b|
Distance = .
kwk
In SVM, the closest points to the decision boundary are the support vectors. For a hard-margin SVM, these points satisfy:
|w> x + b| = 1.
Thus, the distance of the closest point (support vector) from the decision boundary is:
1
Distance = .
kwk
Step-by-Step Solution
68
Conclusion
The distance of the closest point in the dataset from the decision boundary is:
0.267 .
17. Let the eigenvectors of a covariance matrix for a centered dataset in R3 be w1 , w2 , and w3 . These three directions specify a new coordinate
system to represent the data. If the entire dataset is represented in terms of these new coordinates, what can you say about the covariance
matrix in this new coordinate system?
69
2 Mock 2 2024
Thus, the covariance matrix in the new coordinate system becomes diagonal:
λ1 0 0
Σ0 = 0 λ2 0 ,
0 0 λ3
Analysis of Options
a) The covariance matrix is invariant to change of coordinates: This is incorrect. The covariance matrix changes under a change of
coordinates and becomes diagonal in the new basis defined by the eigenvectors.
b) The covariance matrix becomes diagonal: This is correct. In the new coordinate system, the covariance matrix is diagonal with
eigenvalues as its entries.
c) The covariance matrix becomes identity: This is incorrect. The diagonal entries of the covariance matrix are the eigenvalues, which
generally are not all equal to 1.
d) Insufficient information to answer this question: This is incorrect. The information provided (eigenvectors and eigenvalues of the
covariance matrix) is sufficient to conclude that the covariance matrix becomes diagonal.
Conclusion
70
Step 1: Eigenvalues and Eigenvectors
Characteristic Equation
Eigenvectors
71
2 Mock 2 2024
1
The eigenvector is v1 = 2.
0
• For λ2 = 1:
3 2 0
(Σ − 1I) = 2 2 0 .
0 0 1
−2
The eigenvector is v2 = 3 .
0
• For λ3 = 2:
2 2 0
(Σ − 2I) = 2 1 0 .
0 0 0
0
The eigenvector is v3 = 0.
1
The eigenvectors v1 , v2 , v3 form an orthogonal basis. The change of basis matrix P is:
1 −2 0
P = 2 3 0 .
0 0 1
Σ0 = P > ΣP.
72
Step 3: Diagonalization
Perform the matrix multiplication Σ0 :
6 0 0
Σ0 = 0 1 0 .
0 0 2
This confirms that the covariance matrix in the new coordinate system becomes diagonal, with the eigenvalues as its diagonal entries.
Conclusion
When the data is represented in terms of the eigenvectors of the covariance matrix, the covariance matrix becomes diagonal. This is consistent
with the theoretical result.
73
Index
75