Exam1 Practice Solutions
Exam1 Practice Solutions
Instructions:
• Fill in your name and Andrew ID above. Be sure to write neatly, or you may not
receive credit for your exam.
• Clearly mark your answers in the allocated space on the front of each page. If
needed, use the back of a page for scratch space, but you will not get credit for anything
written on the back of a page. If you have made a mistake, cross out the invalid parts
of your solution, and circle the ones which should be graded.
• No electronic devices may be used during the exam.
• Please write all answers in pen.
• You have N/A to complete the exam. Good luck!
10-601 Machine Learning Exam 1 Practice Problems - Page 2 of 25
# Marie Curie
# Noam Chomsky
If you need to change your answer, you may cross out the previous answer and bubble in
the new answer:
Select One: Who taught this course?
Matt Gormley
# Marie Curie
@ Noam Chomsky
@
For “Select all that apply” questions, please fill in all appropriate squares completely:
Select all that apply: Which are scientists?
Stephen Hawking
Albert Einstein
Isaac Newton
I don’t know
Again, if you need to change your answer, you may cross out the previous answer(s) and
bubble in the new answer(s):
Select all that apply: Which are scientists?
Stephen Hawking
Albert Einstein
Isaac Newton
@I
@ don’t know
For questions where you must fill in a blank, please make sure your final answer is fully
included in the given space. You may cross out answers or parts of answers, but the final
answer must still be within the given space.
Fill in the blank: What is the course number?
10-601 10-S7601
S
10-601 Machine Learning Exam 1 Practice Problems - Page 3 of 25
1 Decision Trees
1. Perceptron Trees: To exploit the desirable properties of decision tree classifiers
and perceptrons, Adam came up with a new algorithm called “perceptron trees”, which
combines features from both. Perceptron trees are similar to decision trees, however
each leaf node is a perceptron, instead of a majority vote.
To create a perceptron tree, the first step is to follow a regular decision tree learning
algorithm (such as ID3) and perform splitting on attributes until the specified maximum
depth is reached. Once maximum depth has been reached, at each leaf node, a perceptron
is trained on the remaining attributes which have not been used up in that branch.
Classification of a new example is done via a similar procedure. The example is first
passed through the decision tree based on its attribute values. When it reaches a leaf
node, the final prediction is made by running the corresponding perceptron at that node.
Assume that you have a dataset with 6 binary attributes (A, B, C, D, E, F) and
two output labels (-1 and 1). A perceptron tree of depth 2 on this dataset is given
below. Weights of the perceptron are given in the leaf nodes. Assume bias=1 for each
perceptron:
(a) Numerical answer: Given a sample x = [1, 1, 0, 1, 0, 1], predict the output label
for this sample
1, Explanation: A=1 and D=1 so the point is sent to the right-most leaf node,
where the perceptron output is (1*1)+(0*0)+((-1)*0)+(1*1)+1 = 3. Prediction =
sign(3) = 1.
10-601 Machine Learning Exam 1 Practice Problems - Page 4 of 25
(b) True or False: The decision boundary of a perceptron tree will always be linear.
True
False
False, since decision tree boundaries need not be linear.
(c) True or False: For small values of max depth, decision trees are more likely to
underfit the data than perceptron trees
True
False
True. For smaller values of max depth, decision trees essentially degenerate into
majority-vote classifiers at the leaves. On the other hand, perceptron trees have
the capacity to make use of “unused” attributes at the leaves to predict the correct
class. Decision trees: Non-linear decision boundaries
Perceptron: Ability to gracefully handle unseen attribute values in training data/
Better generalization at leaf nodes
2. (2 points) Select all that apply: Given a set of input features x, where x ∈ Rn , you
are tasked with predicting a label for y, where y = 1 or y = −1. You have no knowledge
of about the distribution of x and of y. Which of the following methods are appropriate?
2 Perceptron
2 k-Nearest Neighbors
2 Linear Regression
2 Decision Tree with unlimited depth
2 None of the Above
Kth Nearest Neighbours and Decision Tree with unlimited depth since these two methods
do not making the assumption of linear separation.
3. (2 points) ID3 algorithm is a greedy algorithm for growing Decision Tree and it suffers
the same problem as any other greedy algorithm that finds only locally optimal trees.
Which of the following method(s) can make ID3 ”less greedy”? Select all that apply:
Use a subset of attributes to grow the decision tree
Use different subsets of attributes to grow many decision trees
Change the criterion for selecting attributes from information gain (mutual informa-
tion) to information gain ratio (mutual information divided by entropy of splitting
attributes) to avoid selecting attributes with high degree of randomness
Keep using mutual information, but select 2 attributes instead of one at each step,
and grow two separate subtrees. If there are more than 2 subtrees in total, keep
10-601 Machine Learning Exam 1 Practice Problems - Page 5 of 25
only the top 2 with the best performance (e.g., top 2 with lowest training errors at
the current step)
2nd and 4th choices; 1st choice should be wrong as the best performance will be deter-
mined by the deepest tree. Any shallower tree will make more mistakes, so ensemble
learning can only make performance worse and it won’t change the local optimality of
the forest.
4. [2 pts] ID3 algorithm is guaranteed to find the optimal solution for decision tree.
Circle one: True False
False.
5. [2 pts] One advantages of decision trees algorithm is that they are not easy to overfit
comparing to naive Bayes.
Circle one: True False
False.
10-601 Machine Learning Exam 1 Practice Problems - Page 6 of 25
2 K Nearest Neighbors
1. True or False: Consider a binary (two classes) classification problem using k-nearest
neighbors. We have n 1-dimensional training points {x1 , x2 , ..., xn } with xi ∈ R, and
their corresponding labels {y1 , y2 , ..., yn } with yi ∈ {0, 1}.
Assume the data points x1 , x2 , ..., xn are sorted in the ascending order, we use Euclidean
distance as the distance metric, and a point can be it’s own neighbor. True or False:
We CAN build a decision tree (with decisions at each node has the form “x ≥ t” and
“x < t”, for t ∈ R) that behave exactly the same as the 1-nearest neighbor classifier, on
this dataset.
True
False
True, we can build a decision tree by setting the internal nodes at the mid-points between
each pair of adjacent training points.
2. Select all that apply: Please select all that apply about kNN in the following options:
Assume a point can be its own neighbor.
2 k-NN works great with a small amount of data, but struggles when the amount
of data becomes large.
2 k-NN is sensitive to outliers; therefore, in general we decrease k to avoid over-
fitting.
2 k-NN can only be applied to classification problems, but it cannot be used to
solve regression problems.
2 We can always achieve zero training error (perfect classification) with k-NN,
but it may not generalize well in testing.
True: A, Curse of dimensionality; D, by setting k = 1
False: B, we increase k to avoid overfitting; C, KNN regression
3. (1 point) Select one: A k-Nearest Neighbor model with a large value of K is analogous
to...
A short Decision Tree with a low branching factor
A short Decision Tree with a high branching factor
A long Decision Tree with a low branching factor
A long Decision Tree with a high branching factor
A short Decision Tree with a low branching factor
10-601 Machine Learning Exam 1 Practice Problems - Page 7 of 25
4. (1 point) Select one. Imagine you are using a k-Nearest Neighbor classifier on a data
set with lots of noise. You want your classifier to be less sensitive to the noise. Which
is more likely to help and with what side-effect?
Increase the value of k => Increase in prediction time
Decrease the value of k => Increase in prediction time
Increase the value of k => Decrease in prediction time
Decrease the value of k => Decrease in prediction time
Increase the value of k => Increase in prediction time
5. (1 point) Select all that apply: Identify the correct relationship between bias, vari-
ance, and the hyperparameter k in the k-Nearest Neighbors algorithm:
2 Increasing k leads to increase in bias
2 Decreasing k leads to increase in bias
2 Increasing k leads to increase in variance
2 Decreasing k leads to increase in variance
A and D
6. Consider a training dataset for a regression task as follows.
1 X
h(x) = y (i)
k
i∈N (x,D)
The red X’s denote training points and the black semi-circles A, B, C denote test points
of unknown output values. For convenience, all training data points have integer input
and output values.
Any ties are broken by selecting the point with the lower x value.
(a) With k = 1, what is the mean squared error on the training set?
0 ± 0.00001
(b) With k = 2, what is the predicted value at A?
4 ± 0.00001
(c) With k = 2, what is the predicted value at B?
10-601 Machine Learning Exam 1 Practice Problems - Page 9 of 25
5 ± 0.00001
(d) With k = 3, what is the predicted value at C?
7 ± 0.00001
(e) With k = 8, what is the predicted value at C?
5.375 ± 0.1
(f) With k = N , for any dataset D with the form specified in the beginning of this
question, write down a mathematical expression for the predicted value ŷ = h(x).
Your response shouldn’t include a reference to the neighborhood function N ().
1
PN
ȳ = N i=1 y (i)
10-601 Machine Learning Exam 1 Practice Problems - Page 10 of 25
4. [1 pts] Say you plot the train and test errors as a function of the model complexity.
Which of the following two plots is your plot expected to look like?
(a) (b)
10-601 Machine Learning Exam 1 Practice Problems - Page 11 of 25
B. When model complexity increases, model can overfit better, so training error
will decrease. But when it overfits too much, testing error will increase.
2. Training Sample Size: In this problem, we will consider the effect of training sample
size n on a logistic regression classifier with d features. The classifier is trained by opti-
mizing the conditional log-likelihood. The optimization procedure stops if the estimated
parameters perfectly classify the training data or they converge.
The following plot shows the general trend for how the training and testing error change
as we increase the sample size n = |S|. Your task in this question is to analyze this plot
and identify which curve corresponds to the training and test error. Specifically:
1. Which curve represents the training error? Please provide 1–2 sentences of
justification. Curve (ii) is the training set. Training error increases as the training
set increases in size (more points to account for). However, the increase tapers out
when the model generalizes well. Evidently, curve (i) is testing, since larger training
sets better form generalized models, which reduces testing error.
2. In one word, what does the gap between the two curves represent? Overfitting
3. What are the effects of the following on overfitting? Choose the best answer.
(a) Increasing decision tree max depth.
Less likely to overfit
More likely to overfit
More likely to overfit
10-601 Machine Learning Exam 1 Practice Problems - Page 12 of 25
values of γ that we can include in our search if we also want to include 8 values of
ω? Assume that any computations other than training are negligible.
More data used for validation, giving a better estimate of performance on held-out
data.
10-601 Machine Learning Exam 1 Practice Problems - Page 14 of 25
4 Perceptron
1. Select all that apply: Let S = {(x(1) , y (1) ), · · · , (x(n) , y (n) )} be n linearly separable
points by a separator through the origin in Rd . Let S 0 be generated from S as: S 0 =
{(cx(1) , y (1) ), · · · , (cx(n) , y (n) )}, where c > 1 is a constant. Suppose that we would like
to run the perceptron algorithm on both data sets separately, and that the perceptron
algorithm converges on S. Which of the following statements are true?
2 The mistake bound of perceptron on S 0 is larger than the mistake bound on S
2 The perceptron algorithm when run on S and S 0 returns the same classifier,
modulo constant factors (i.e., if wS and wS 0 are outputs of the perceptron for
S and S 0 , then wS = c1 wS0 for some constant c1 ).
2 The perceptron algorithm converges on S 0 .
B and C are true.. Simply follow the perceptron update rule and we see that the update
on wS and wS 0 is identical up to the constant c. A is false as the maximum margin
between any point to the decision hyperplane is also scaled up by c, and the mistake
bound is unchanged.
2. True or False: We know that if the samples are linearly separable, the perceptron
algorithm finds a separating hyperplane in a finite number of steps. Given such a dataset
with linearly separable samples, select whether the following statement is True or False:
The running time of the perceptron algorithm depends on the sample size n.
True
False
False. For a linearly separable dataset, the runtime of the perceptron algorithm does
not depend on the size of the training data. The proof can be found on slide 34 of
http://www.cs.cmu.edu/~10701/slides/8_Perceptron.pdf
3. (1 point) Select all that apply: Which of the following are considered as inductive
bias of perceptron.
2 Assume that most of the cases in a small neighborhood in feature space belong
to the same class
2 Decision boundary should be linear
2 Prefer to correct the most recent mistakes
2 Prefer the smallest hypothesis that explains the data
BC
4. (1 point) True or False: If the training data is linearly separable and representative
of the true distribution, the perceptron algorithm always finds the optimal decision
boundary for the true distribution.
True
10-601 Machine Learning Exam 1 Practice Problems - Page 15 of 25
False
False.
5. (1 point) True or False: Consider two datasets D(1) and D(2) where
(1) (1) (1) (1) (2) (2) (2) (2) (1)
D(1) = {(x1 , y1 ), ..., (xn , yn )} and D(2) = {(x1 , y1 ), ..., (xm , ym )} such that xi ∈
(2)
Rd1 , xi ∈ Rd2 . Suppose d1 > d2 and n > m. Then the maximum number of mistakes a
perceptron algorithm will make is always higher on dataset D(1) than on dataset D(2) .
True
False
False.
Example Number X1 X2 Y
1 -1 2 -1
2 -2 -2 +1
3 1 -1 +1
4 -3 1 -1
You wish to perform the Batch Perceptron algorithm on this data. Assume you start with
initial weights θT = [0, 0], bias b = 0 and that we pass all of our examples through in order
of their example number.
1. (1 point) Numerical answer: What would be the updated weight vector θ be after we
pass example 1 through the perceptron algorithm?
[1, −2]
2. (1 point) Numerical answer: What would be the updated bias b be after we pass
example 1 through our the Perceptron algorithm?
−1
3. (1 point) Numerical answer: What would be the updated weight vector θ be after we
pass example 2 through the Perceptron algorithm?
10-601 Machine Learning Exam 1 Practice Problems - Page 16 of 25
[1, −2]
4. (1 point) Numerical answer: What would be the updated bias b be after we pass
example 2 through the Perceptron algorithm?
−1
5. (1 point) Numerical answer: What would be the updated weight vector θ be after we
pass example 3 through the Perceptron algorithm?
[1, −2]
6. (1 point) Numerical answer: What would be the updated bias b be after we pass
example 3 through the Perceptron algorithm?
−1
7. (1 point) True or False: You friend stops you here and tells you that you do not need
to update the Perceptron weights or the bias anymore, is this true or false?
True
False
True, all points are classified correctly.
8. (2 points) True or False: Data (X,Y) has a non-linear decision boundary. Fortunately
there is a function F that maps (X,Y) to (F(X),Y) such that (F(X),Y) is linearly sep-
arable. We have tried to build a modified perceptron to classify (X,Y). Is the given
(modified) perceptron update rule correct?
if sign(wF(x(i) ) + b) != y (i) :
w0 = w + y (i) F(x(i) )
b0 = b + y (i)
True
False
True
10-601 Machine Learning Exam 1 Practice Problems - Page 17 of 25
5 Linear Regression
1. (1 point) Select one: The closed form solution for linear regression is θ = (X T X)−1 X T y.
Suppose you have n = 35 training examples and m = 5 features (excluding the bias
term). Once the bias term is now included, what are the dimensions of X, y, θ in the
closed form equation?
X is 35 × 6, y is 35 × 1, θ is 6 × 1
X is 35 × 6, y is 35 × 6, θ is 6 × 6
X is 35 × 5, y is 35 × 1, θ is 5 × 1
X is 35 × 5, y is 35 × 5, θ is 5 × 5
A.
2. (1 point) (True or False) A multi-layer perceptron model with linear activation is equiv-
alent to linear regression model.
True
False
True
3. Short answer: Assume we have data X ∈ Rn×d with label y ∈ Rn . If the underlying
distribution of the data is y = Xβ ∗ + , where ∼ N (0, I). Assume the closed form
solution β̂ for mean squared error linear regression exists for this data, write out β̂’s
distribution:
2.
5. Please circle True or False for the following questions, providing brief explanations to
support your answer.
(i) [3 pts] Consider a linear regression model with only one parameter, the bias, ie.,
y = β0 . Then given n data points (xi , yi ) (where xi is the feature and yi is the
output), minimizing the sum of squared errors results in β0 being the median of
the yi values.
Circle one: True False
Brief explanation:
False. ni=1P(yi − β0 )2 is the training cost, which when differentiated and set to zero
P
n
yi
gives β0 = i=1n
, the mean of the yi values.
(ii) [3 pts] Given data D = {(x1 , y1 ), ..., (xn , yn )}, we obtain ŵ, the parameters that
minimize the training error cost for the linear regression model y = wT x we learn
from D.
Consider a new dataset Dnew generated by duplicating the points in D and adding
10 points that lie along y = ŵT x. Then the ŵnew that we learn for y = wT x from
Dnew is equal to ŵ.
Circle one: True False
Brief explanation:
True. The new squared error can be written as 2k + m, where k is the old squared
error. m = 0 for the 10 points that lie along the line, the lowest possible value for
m. And 2k is least when k is least, which is when the parameters don’t change.
where λ ≥ 0. It is possible to derive a closed form expression for the parameter vector
10-601 Machine Learning Exam 1 Practice Problems - Page 19 of 25
ŵ that minimizes this cost function. The gradient, using matrix notation, is
∂`(w)
∂w1
∇J(w) = ... = X T Xw − X T y + λw,
∂`(w)
∂wd
where X ∈ Rn×d is the matrix with the the training input instances on the rows (xi on
row i), and y ∈ Rn is the vector of training output instances.
(i) [2 pts.] What is the closed form expression for ŵ that we obtain by solving ∇`(w) =
0? Hint: use λw = Iwλ, where I is the identity matrix.
∇J(w) = 0 =⇒ ŵ = (X T X + λI)−1 X T y.
(ii) [3 pts.] What is the meaning of the λ parameter? What kind of tradeoff between
training error and model complexity we have when λ approaches zero? What about
when λ goes to infinity?
When λ → 0, we prefer models that achieve the smallest possible training error,
irrespective their complexity (squared `2 norm). When λ → +∞, we prefer models
that are the simplest possible (i.e. that have small norm). We expect ŵ to approach
the zero vector in this case.
(iii) [4 pts.] Answer true or false to each of the following questions and provide a brief
justification for your answer:
• [1 pt.] T or F: In ridge regression, when solving for the linear regressor that
minimizes the training cost function, it is always preferable to use the closed
form expression rather than using an iterative method like gradient descent,
even when the number of parameters is very high.
False. Using the closed form expression to solve for ŵ requires O(d3 ) operations
(the cost of solving the linear system or computing the inverse), while using a
gradient descent approach requires O(d2 ) (the cost of matrix multiplication) per
step. If the number of parameters is very high, it may not be computationally
feasible to use the closed form expression. Also, if the required precision for
the solution ŵ is moderate, as it is typically the case in machine learning
applications, gradient descent may be preferable.
• [1 pt.] T or F: Using a non-linear feature map is never useful as all regression
problems have linear input to output relations.
False. We very rarely believe that the true regression function is linear in the
initial input space. Many times, the choice of a linear model is done out of
mathematical convenience.
• [1 pt.] T or F: In linear regression, minimizing a tradeoff between training
error and model complexity, usually allows us to obtain a model with lower
test error than just minimizing training error.
10-601 Machine Learning Exam 1 Practice Problems - Page 20 of 25
True. Not using regularization leads us to choose more complex models, which
have higher variance. In practice it is typically better to optimize a tradeoff
between data fitting and model complexity. This is especially true when the
ratio between the size of the training set and the number of parameters is not
very big.
• [1 pt.] T or F: When λ = 0, we recover ordinary least squares (OLS). If n < d,
then X T X does not have an inverse and the estimator ŵ is not well-defined.
True. This is a problem with the ordinary least squares estimator for the case
where the number of parameters is bigger than the number of data points.
7. (a) Which of the following are valid expressions for the mean squared error objective
function for linear regression with dataset D = {x(i) , y (i) }N
i=1 , with each x
(i)
∈ RM
and the design matrix X ∈ RN ×(M +1) . y and θ are column vectors.
Select all that apply:
1
2 J(θ) = N
ky − θXk22
1
2 J(θ) = N
kyT − θXk22
1
2 J(θ) = N
kyT − Xθk22
1
2 J(θ) = N
kXθ − yk22
2 None of the Above
1
J(θ) = N
kXθ − yk22
(b) Your friend accidentally solved linear regression using the wrong objective function
for mean squared error, specifically, they used the following objective function that
contains two mistakes: 1) they forgot the 1/N and 2) they have one sign error.
N M
!!2
X X (i)
J(w, b) = y (i) − w j xj − b
i=1 j=1
You realize that you can still use the parameters that they learned, w and b, to
correctly predict y given x.
Write the equation that implements this corrected prediction function h(x, w, b)
using your friend’s learned parameters, w and b. The w and x vectors are column
vectors.
h(x, w, b) = wT x − b
(c) We have 2 data points:
x(1) = [2, 1]T y (1) = 7
10-601 Machine Learning Exam 1 Practice Problems - Page 21 of 25
We know that for linear regression with a bias/intercept term and mean squared
error, there are infinite solutions with these two points.
Give a specific third point x(3) , y (3) , such that, when included with the first two,
will cause linear regression to still have infinite solutions. Your x(3) should not equal
x(1) or x(2) and your y (3) should not equal y (1) or y (2) .
(3)
x1
(3)
x2
y (3)
Any x(3) that is colinear with the first two x’s; y doesn’t matter.
After adding your third point, if we then double the output of just the first point
such that now y (1) = 14, will this change the number of solutions for linear regres-
sion?
Yes
No
No
8. Given that we have an input x and we want to estimate an output y, in linear regression
we assume the relationship between them is of the form y = wx + b + , where w and
b are real-valued parameters we estimate and represents the noise in the data. When
the noise is Gaussian, maximizing the likelihood of a dataset S = {(x1 , y1 ), . . . , (xn , yn )}
to estimate the parameters w and b is equivalent to minimizing the squared error:
n
X
arg min (yi − (wxi + b))2 .
w
i=1
Consider the dataset S plotted in Fig. 3 along with its associated regression line. For each
of the altered data sets S new plotted in Fig. 5, indicate which regression line (relative to
the original one) in Fig. 4 corresponds to the regression line for the new data set. Write
your answers in the table below.
10-601 Machine Learning Exam 1 Practice Problems - Page 22 of 25
(a) Adding one outlier to the (b) Adding two outliers to the original data set.
original data set.
(c) Adding three outliers to the original data set. Two on one side (d) Duplicating the original data set.
and one on the other side.
(e) Duplicating the original data set and adding four points that
lie on the trajectory of the original regression line.
6 Optimization
1. Select all that apply: Which of the following are correct regarding Gradient Descent
(GD) and stochastic gradient descent (SGD)
2 Each update step in SGD pushes the parameter vector closer to the parameter
vector that minimizes the objective function.
2 The gradient computed in SGD is, in expectation, equal to the gradient com-
puted in GD.
2 The gradient computed in GD has a higher variance than that computed in
SGD, which is why in practice SGD converges faster in time than GD.
B.
A is incorrect, SGD updates are high in variance and may not go in the direction of the
true gradient. C is incorrect, for the same reason. D is incorrect since they can converge
if the function is convex, not just strongly convex.
2. (a) Determine if the following 1-D functions are convex. Assume that the domain of
each function is R. The definition of a convex function is as follows:
f (x) is convex ⇐⇒ f (αx + (1 − α)z) ≤ αf (x) + (1 − α)f (z), ∀α ∈ [0, 1] and ∀x, z.
2 α=1
2 α=2
10-601 Machine Learning Exam 1 Practice Problems - Page 25 of 25
Give the range of all values for α ≥ 0 such that limt→∞ f (z (t) ) = 0, assuming the
initial value of z is z (0) = 1. Be specific.
(0, 1).