ML Midsem 2022
ML Midsem 2022
Total Points: 25
Instructions
• There is no partial marking for multiple choice questions and there could be multiple corrects
answers. Choose the right options and justify your choice.
Xtrain = (x(1) , x(2) , ..., x(mtrain ) ), Ytrain = (y (1) , y (2) , ..., y (mtrain ) )
Xtest = (x(1) , x(2) , ..., x(mtest ) ), Ytest = (y (1) , y (2) , ..., y (mtest ) )
You want to normalize your data before training your model. Which of the following proposi-
tions are true?
1. The normalizing mean and variance computed on the training set, and used to train the
model, should be used to normalize test data.
2. Test data should be normalized with its own mean and variance before being fed to
the network at test time because the test distribution might be different from the train
distribution.
3. Normalizing the input impacts the landscape of the loss function.
4. In imaging, just like for structured data, normalization consists in subtracting the mean
from the input and multiplying the result by the standard deviation.
Solution: Option (i) and (iii) We should perform feature normalisation over the training data.
Then perform normalisation on testing instances as well, but this time using the mean and
variance of training explanatory variables. In this way, we can test and evaluate whether our
model can generalize well to new, unseen data points and impacts the landscape of the loss
function. 0.5 marks for each correct answer. 1 marks total
2. (1 point) Having multiple perceptrons can actually solve the XOR problem satisfactorily: this
is because each perceptron can partition off a linear part of the space itself, and they can then
combine their results? State your reasoning clearly for choosing the correct option.
1. True – this works always, and these multiple perceptrons learn to classify even complex
problems.
2. False – perceptrons are mathematically incapable of solving linearly inseparable functions,
no matter what you do.
3. True – perceptrons can do this but are unable to learn to do it – they have to be explicitly
hand-coded.
4. False – just having a single perceptron is enough.
Solution: (3) A3. True - perceptrons can do this but are unable to learn to do it - they have
to be explicitly hand-coded. 1 mark for correct answer
3. (1 point) Suppose you are given an EM algorithm that finds maximum likelihood estimates
for a model with latent variables. You are asked to modify the algorithm so that it finds MAP
estimates instead. Which step or steps do you need to modify:
1. Expectation
2. Maximization
3. No modification necessary
4. Both
Solution: (2) Maximization [For detailed solution refer:
https://www.jmlr.org/papers/volume1/meila00a/html/node16.html
1 mark for correct answer and justification.
4. (1 point) Given
β∥w0 ∥2
nmax = α2
where nmax is the maximum number of iterations for a perceptron to converge and w0 is the
optimal weights at convergence, which of the following is true:
1. There is no unique solution for w0 but unique solution for nmax
2. There is no unique solution for nmax but unique solution for w0 exists
3. There is no unique solution for nmax and w0
4. Unique solution for nmax and w0 exists
Solution: (3) There is no unique solution for nmax and w0 .
1 mark for correct answer.
5. (1 point) Which of the following is/are true:
1. Logistic loss is better than L2 loss in classification tasks.
2. In terms of feature selection, L2 regularization is preferred since it comes up with sparse
solutions
3. A classifier that attains 100% accuracy on the training set and 70% accuracy on test set
is better than a classifier that attains 70% accuracy on the training set and 75% accuracy
on test set.
4. MSE is the preferred loss function for logistic regression
Solution: (1) Logistic loss is better than L2 loss in classification tasks. 1 mark for correct
answer and justification.
Page 3
Section B: Short Answers
6. (2 points) Given below are two versions of perceptron learning algorithm. Identify the correct
implementation and justify?
Solution: There is no difference in the perception learning algorithm part. The difference
is in the test for convergence. Option (A): The test for convergence is done after the entire
samples are seen and (B) The test for convergence is done after each sample is seen. Option A
is the right implementation where the convergence is tested after the entire samples are seen.
Suppose if you choose the first sample and the output is correct, then as per the convergence
criteria the PLA will stop resulting in improper training.
1 mark for identifying the right implementation. 1 mark for justifying how it helps in the
convergence of PLA.
7. (2 points) Given plot denotes the loss functions of logistic regression and perceptron. Compare
and contrast the nature of loss functions given in the figure (Any 2 observations).
Page 4
8. (2 points) Calculate the bias and variance. Evaluate the performance in terms of bias and
variance of the model and comment. Solution:
0.5 mark each for the bias variance calculation and right observations
9. (2 points) You would like to train a dog/cat image classifier using mini-batch gradient de-
scent. You have already split your dataset into train, validation and test sets. The classes are
balanced. You realize that within the training set, the images are ordered in such a way that
all the dog images come first and all the cat images come after. Your test set (Xtest , Ytest ) is
such that the first m1 images are of dogs and the remaining images are of cats. After shuffling
Xtest and Ytest , you evaluate your model on it to obtain a classification accuracy a1 %. You also
evaluate your model on Xtest and Ytest without shuffling to obtain accuracy a2 %. What is the
relationship between a1 and a2 (>, <, =, ≤, ≥)? Explain. solution: a1 = a2 When evaluating
on the test set, the only form of calculation that you do is a single metric (e.g. accuracy) on
the entire test set. The calculation of this metric on the entire test set does not depend on the
ordering. 0.5 mark for the right observations. 1.5 marks for justification.
Section C: Descriptive
10. (4 points points) To exploit the desirable properties of decision tree classifiers and perceptrons,
Adam came up with a new algorithm called the “perceptron tree” that combines features from
both. Perceptron trees are similar to decision trees, but each leaf node contains a perceptron
rather than a majority vote. To create a perceptron tree, the first step is to follow a regular
decision tree learning algorithm (such as ID3) and perform splitting on attributes until the
specified maximum depth is reached. Once maximum depth has been reached, at each leaf
node, a perceptron is trained on the remaining attributes which have not yet been used in that
branch. Classification of a new example is done via a similar procedure. The example is first
passed through the decision tree based on its attribute values. When it reaches a leaf node, the
Page 5
final prediction is made by running the corresponding perceptron at that node. Assume that
you have a dataset with 6 binary attributes {A, B, C, D, E, F } and two output labels {−1, 1}.
A perceptron tree of depth 2 on this dataset is given below. Weights of the perceptron are
given in the leaf nodes. Assume bias b = 1 for each perceptron.
1. What would the given perceptron tree predict as the output label for the sample x =
[1, 1, 0, 1, 0, 1]? (2 marks)
2. True or False ? “The decision boundary of a perceptron tree will always be linear.”
Justify. (1 mark)
3. “For small values of max depth, decision trees are more likely to underfit the data than
perceptron trees.” True or False? Justify. (1 mark)
Solution:
(a) A=1 and D=1 so the point is sent to the right-most leaf node, where the perceptron output
is (1*1)+(0*0)+((-1)*0)+(1*1)+1 = 1 + 0 + 0 +1 +1 = 3. Prediction = sign(3) = 1.
(b) False, since decision tree boundaries need not be linear.
(c) True. For smaller values of max depth, decision trees essentially degenerate into majority-
vote classifiers at the leaves. On the other hand, perceptron trees have the capacity to make
use of “unused” attributes at the leaves to predict the correct class. Decision trees: Non-
linear decision boundaries. Perceptron: Ability to gracefully handle unseen attribute values in
training data/ Better generalization at leaf nodes.
2 marks for (a) part. 1 mark for (b) part , to be awarded only if correct answer with correct
justification, binary marking. 1 mark for (c) part, to be awarded only if correct answer with
correct justification, binary marking
11. (4 points) We are building a random forest for a 2-class classification problem with n de-
cision trees RF = {T1 , T2 ...Tn } and bagging. Each tree generated in bagging is identically
distributed(i.d.) but not necessarily independent and the expectation of an average of n such
trees is the same as the expectation of any one of them. Find the variance of the average of the
n trees, given that the trees with indices n = 2i + 1 ,where i = {0, 1, 2, ... n−1
2 } are independent
of each other and the positive pairwise correlation between rest of them is ρ.
Solution:
Pn
Xi (1−ρ) 2
V ar i=1
B = σ2ρ + B σ
Page 6
Given that the trees with odd indices are independent of each other, their correlation as well
as covariance is 0. Considering the rest of the trees,
Equation for variance of average of trees with variance σ 2 (1 mark) Deriving final equation for
variance of average of trees (1 mark) Identifying the constraints, finding out the covariance is
0 due to independence constraint (1 mark) Applying it on variance formula and final formula
(1 mark)
Page 7
Page 8