0% found this document useful (0 votes)

61 views25 pages

Exam1 Practice Solutions

Practise.problems of algorithms 2 IIT KANPUR

Uploaded by

Ayush Patel

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

61 views25 pages

Exam1 Practice Solutions

Practise.problems of algorithms 2 IIT KANPUR

Uploaded by

Ayush Patel

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

Solutions

10-601 Machine Learning Name:

Fall 2021 Andrew Email:
Exam 1 Practice Problems Room:
September 27, 2021 Seat:
Time Limit: N/A Exam Number:

Instructions:
• Fill in your name and Andrew ID above. Be sure to write neatly, or you may not
receive credit for your exam.
• Clearly mark your answers in the allocated space on the front of each page. If
needed, use the back of a page for scratch space, but you will not get credit for anything
written on the back of a page. If you have made a mistake, cross out the invalid parts
of your solution, and circle the ones which should be graded.
• No electronic devices may be used during the exam.
• Please write all answers in pen.
• You have N/A to complete the exam. Good luck!
10-601 Machine Learning Exam 1 Practice Problems - Page 2 of 25

Instructions for Specific Problem Types

For “Select One” questions, please fill in the appropriate bubble completely:
Select One: Who taught this course?
Matt Gormley

# Marie Curie
# Noam Chomsky
If you need to change your answer, you may cross out the previous answer and bubble in
the new answer:
Select One: Who taught this course?
Matt Gormley

# Marie Curie
@ Noam Chomsky
@

For “Select all that apply” questions, please fill in all appropriate squares completely:
Select all that apply: Which are scientists?
Stephen Hawking

Albert Einstein
Isaac Newton
I don’t know
Again, if you need to change your answer, you may cross out the previous answer(s) and
bubble in the new answer(s):
Select all that apply: Which are scientists?
Stephen Hawking

Albert Einstein
Isaac Newton
@I

@ don’t know
For questions where you must fill in a blank, please make sure your final answer is fully
included in the given space. You may cross out answers or parts of answers, but the final
answer must still be within the given space.
Fill in the blank: What is the course number?

10-601 10-S7601

S
10-601 Machine Learning Exam 1 Practice Problems - Page 3 of 25

1 Decision Trees
1. Perceptron Trees: To exploit the desirable properties of decision tree classifiers
and perceptrons, Adam came up with a new algorithm called “perceptron trees”, which
combines features from both. Perceptron trees are similar to decision trees, however
each leaf node is a perceptron, instead of a majority vote.
To create a perceptron tree, the first step is to follow a regular decision tree learning
algorithm (such as ID3) and perform splitting on attributes until the specified maximum
depth is reached. Once maximum depth has been reached, at each leaf node, a perceptron
is trained on the remaining attributes which have not been used up in that branch.
Classification of a new example is done via a similar procedure. The example is first
passed through the decision tree based on its attribute values. When it reaches a leaf
node, the final prediction is made by running the corresponding perceptron at that node.
Assume that you have a dataset with 6 binary attributes (A, B, C, D, E, F) and
two output labels (-1 and 1). A perceptron tree of depth 2 on this dataset is given
below. Weights of the perceptron are given in the leaf nodes. Assume bias=1 for each
perceptron:

Figure 1: Perceptron Tree of max depth=2

(a) Numerical answer: Given a sample x = [1, 1, 0, 1, 0, 1], predict the output label
for this sample

1, Explanation: A=1 and D=1 so the point is sent to the right-most leaf node,
where the perceptron output is (1*1)+(0*0)+((-1)*0)+(1*1)+1 = 3. Prediction =
sign(3) = 1.
10-601 Machine Learning Exam 1 Practice Problems - Page 4 of 25

(b) True or False: The decision boundary of a perceptron tree will always be linear.
True
False
False, since decision tree boundaries need not be linear.
(c) True or False: For small values of max depth, decision trees are more likely to
underfit the data than perceptron trees
True
False
True. For smaller values of max depth, decision trees essentially degenerate into
majority-vote classifiers at the leaves. On the other hand, perceptron trees have
the capacity to make use of “unused” attributes at the leaves to predict the correct
class. Decision trees: Non-linear decision boundaries
Perceptron: Ability to gracefully handle unseen attribute values in training data/
Better generalization at leaf nodes
2. (2 points) Select all that apply: Given a set of input features x, where x ∈ Rn , you
are tasked with predicting a label for y, where y = 1 or y = −1. You have no knowledge
of about the distribution of x and of y. Which of the following methods are appropriate?
2 Perceptron
2 k-Nearest Neighbors
2 Linear Regression
2 Decision Tree with unlimited depth
2 None of the Above
Kth Nearest Neighbours and Decision Tree with unlimited depth since these two methods
do not making the assumption of linear separation.
3. (2 points) ID3 algorithm is a greedy algorithm for growing Decision Tree and it suffers
the same problem as any other greedy algorithm that finds only locally optimal trees.
Which of the following method(s) can make ID3 ”less greedy”? Select all that apply:
Use a subset of attributes to grow the decision tree
Use different subsets of attributes to grow many decision trees
Change the criterion for selecting attributes from information gain (mutual informa-
tion) to information gain ratio (mutual information divided by entropy of splitting
attributes) to avoid selecting attributes with high degree of randomness
Keep using mutual information, but select 2 attributes instead of one at each step,
and grow two separate subtrees. If there are more than 2 subtrees in total, keep
10-601 Machine Learning Exam 1 Practice Problems - Page 5 of 25

only the top 2 with the best performance (e.g., top 2 with lowest training errors at
the current step)
2nd and 4th choices; 1st choice should be wrong as the best performance will be deter-
mined by the deepest tree. Any shallower tree will make more mistakes, so ensemble
learning can only make performance worse and it won’t change the local optimality of
the forest.

4. [2 pts] ID3 algorithm is guaranteed to find the optimal solution for decision tree.
Circle one: True False

False.
5. [2 pts] One advantages of decision trees algorithm is that they are not easy to overfit
comparing to naive Bayes.
Circle one: True False

False.
10-601 Machine Learning Exam 1 Practice Problems - Page 6 of 25

2 K Nearest Neighbors
1. True or False: Consider a binary (two classes) classification problem using k-nearest
neighbors. We have n 1-dimensional training points {x1 , x2 , ..., xn } with xi ∈ R, and
their corresponding labels {y1 , y2 , ..., yn } with yi ∈ {0, 1}.
Assume the data points x1 , x2 , ..., xn are sorted in the ascending order, we use Euclidean
distance as the distance metric, and a point can be it’s own neighbor. True or False:
We CAN build a decision tree (with decisions at each node has the form “x ≥ t” and
“x < t”, for t ∈ R) that behave exactly the same as the 1-nearest neighbor classifier, on
this dataset.
True
False
True, we can build a decision tree by setting the internal nodes at the mid-points between
each pair of adjacent training points.
2. Select all that apply: Please select all that apply about kNN in the following options:
Assume a point can be its own neighbor.
2 k-NN works great with a small amount of data, but struggles when the amount
of data becomes large.
2 k-NN is sensitive to outliers; therefore, in general we decrease k to avoid over-
fitting.
2 k-NN can only be applied to classification problems, but it cannot be used to
solve regression problems.
2 We can always achieve zero training error (perfect classification) with k-NN,
but it may not generalize well in testing.
True: A, Curse of dimensionality; D, by setting k = 1
False: B, we increase k to avoid overfitting; C, KNN regression

3. (1 point) Select one: A k-Nearest Neighbor model with a large value of K is analogous
to...
A short Decision Tree with a low branching factor
A short Decision Tree with a high branching factor
A long Decision Tree with a low branching factor
A long Decision Tree with a high branching factor
A short Decision Tree with a low branching factor
10-601 Machine Learning Exam 1 Practice Problems - Page 7 of 25

4. (1 point) Select one. Imagine you are using a k-Nearest Neighbor classifier on a data
set with lots of noise. You want your classifier to be less sensitive to the noise. Which
is more likely to help and with what side-effect?
Increase the value of k => Increase in prediction time
Decrease the value of k => Increase in prediction time
Increase the value of k => Decrease in prediction time
Decrease the value of k => Decrease in prediction time
Increase the value of k => Increase in prediction time
5. (1 point) Select all that apply: Identify the correct relationship between bias, vari-
ance, and the hyperparameter k in the k-Nearest Neighbors algorithm:
2 Increasing k leads to increase in bias
2 Decreasing k leads to increase in bias
2 Increasing k leads to increase in variance
2 Decreasing k leads to increase in variance
A and D
6. Consider a training dataset for a regression task as follows.

x(1) , y (1) , x(2) , y (2) , · · · , x(N ) , y (N )

with x(i) ∈ R and y (i) ∈ R.

For regression with k-nearest neighbor, we make predictions on unseen data points sim-
ilar to the classification algorithm, but instead of a majority vote, we take the mean of
the output values of the k nearest points to some new data point x. That is,

1 X
h(x) = y (i)
k
i∈N (x,D)

where N (x, D) is the set of indices of the k closest training points to x.

10-601 Machine Learning Exam 1 Practice Problems - Page 8 of 25

Please answer the following questions using this example dataset:

The red X’s denote training points and the black semi-circles A, B, C denote test points
of unknown output values. For convenience, all training data points have integer input
and output values.
Any ties are broken by selecting the point with the lower x value.
(a) With k = 1, what is the mean squared error on the training set?

0 ± 0.00001
(b) With k = 2, what is the predicted value at A?

4 ± 0.00001
(c) With k = 2, what is the predicted value at B?
10-601 Machine Learning Exam 1 Practice Problems - Page 9 of 25

5 ± 0.00001
(d) With k = 3, what is the predicted value at C?

7 ± 0.00001
(e) With k = 8, what is the predicted value at C?

5.375 ± 0.1
(f) With k = N , for any dataset D with the form specified in the beginning of this
question, write down a mathematical expression for the predicted value ŷ = h(x).
Your response shouldn’t include a reference to the neighborhood function N ().

1
PN
ȳ = N i=1 y (i)
10-601 Machine Learning Exam 1 Practice Problems - Page 10 of 25

3 Model Selection and Errors

1. Train and test errors: In this problem, we will see how you can debug a classifier
by looking at its train and test errors. Consider a classifier trained till convergence on
some training data Dtrain , and tested on a separate test set Dtest . You look at the test
error, and find that it is very high. You then compute the training error and find that
it is close to 0.
1. [4 pts] Which of the following is expected to help? Select all that apply.
(a) Increase the training data size.
(b) Decrease the training data size.
(c) Increase model complexity (For example, if your classifier is an SVM, use a
more complex kernel. Or if it is a decision tree, increase the depth).
(d) Decrease model complexity.
(e) Train on a combination of Dtrain and Dtest and test on Dtest
(f) Conclude that Machine Learning does not work.
a and d
2. [5 pts] Explain your choices. The model is overfitting. In order to address the
problem, we can either increase training data size or decrease model complexity.
We should never do (e), the model shouldn’t see any testing data in the training
process.

3. [2 pts] What is this scenario called? Overfitting

4. [1 pts] Say you plot the train and test errors as a function of the model complexity.
Which of the following two plots is your plot expected to look like?

(a) (b)
10-601 Machine Learning Exam 1 Practice Problems - Page 11 of 25

B. When model complexity increases, model can overfit better, so training error
will decrease. But when it overfits too much, testing error will increase.
2. Training Sample Size: In this problem, we will consider the effect of training sample
size n on a logistic regression classifier with d features. The classifier is trained by opti-
mizing the conditional log-likelihood. The optimization procedure stops if the estimated
parameters perfectly classify the training data or they converge.

The following plot shows the general trend for how the training and testing error change
as we increase the sample size n = |S|. Your task in this question is to analyze this plot
and identify which curve corresponds to the training and test error. Specifically:

1. Which curve represents the training error? Please provide 1–2 sentences of
justification. Curve (ii) is the training set. Training error increases as the training
set increases in size (more points to account for). However, the increase tapers out
when the model generalizes well. Evidently, curve (i) is testing, since larger training
sets better form generalized models, which reduces testing error.

2. In one word, what does the gap between the two curves represent? Overfitting

3. What are the effects of the following on overfitting? Choose the best answer.
(a) Increasing decision tree max depth.
Less likely to overfit
More likely to overfit
More likely to overfit
10-601 Machine Learning Exam 1 Practice Problems - Page 12 of 25

(b) Increasing decision tree mutual information split threshold.

Less likely to overfit
More likely to overfit
Less likely to overfit
(c) Increasing decision tree max number of nodes.
Less likely to overfit
More likely to overfit
More likely to overfit
(d) Increasing k in k-nearest neighbor.
Less likely to overfit
More likely to overfit
Less likely to overfit
(e) Increasing the training data size for decision trees. Assume that training data points
are drawn randomly from the true data distribution.
Less likely to overfit
More likely to overfit
Less likely to overfit
(f) Increasing the training data size for 1-nearest neighbor. Assume that training data
points are drawn randomly from the true data distribution.
Less likely to overfit
More likely to overfit
Less likely to overfit
4. Consider a learning algorithm that uses two hyperparameters, γ and ω, and it takes 1
hour to train regardless of the size of the training set.
We choose to do random subsampling cross-validation, where we do K runs of cross-
validation and for each run, we randomly subsample a fixed fraction αN of the dataset
for validation and use the remaining for training, where α ∈ (0, 1) and N is the number
of data points.
(a) In combination with the cross-validation method above, we choose to do grid search
on discrete values for the two hyperparameters.
Given N = 1000 data points, K = 4 runs, and α = 0.25, if we have 100 hours to
complete the entire cross-validation process, what is the most number of discrete
10-601 Machine Learning Exam 1 Practice Problems - Page 13 of 25

values of γ that we can include in our search if we also want to include 8 values of
ω? Assume that any computations other than training are negligible.

3 + −0.001. Round 100/4/8 = 3.33 down to 3.

(b) In one sentence, give one advantage of increasing the value of α.

More data used for validation, giving a better estimate of performance on held-out
data.
10-601 Machine Learning Exam 1 Practice Problems - Page 14 of 25

4 Perceptron
1. Select all that apply: Let S = {(x(1) , y (1) ), · · · , (x(n) , y (n) )} be n linearly separable
points by a separator through the origin in Rd . Let S 0 be generated from S as: S 0 =
{(cx(1) , y (1) ), · · · , (cx(n) , y (n) )}, where c > 1 is a constant. Suppose that we would like
to run the perceptron algorithm on both data sets separately, and that the perceptron
algorithm converges on S. Which of the following statements are true?
2 The mistake bound of perceptron on S 0 is larger than the mistake bound on S
2 The perceptron algorithm when run on S and S 0 returns the same classifier,
modulo constant factors (i.e., if wS and wS 0 are outputs of the perceptron for
S and S 0 , then wS = c1 wS0 for some constant c1 ).
2 The perceptron algorithm converges on S 0 .
B and C are true.. Simply follow the perceptron update rule and we see that the update
on wS and wS 0 is identical up to the constant c. A is false as the maximum margin
between any point to the decision hyperplane is also scaled up by c, and the mistake
bound is unchanged.
2. True or False: We know that if the samples are linearly separable, the perceptron
algorithm finds a separating hyperplane in a finite number of steps. Given such a dataset
with linearly separable samples, select whether the following statement is True or False:
The running time of the perceptron algorithm depends on the sample size n.
True
False
False. For a linearly separable dataset, the runtime of the perceptron algorithm does
not depend on the size of the training data. The proof can be found on slide 34 of
http://www.cs.cmu.edu/~10701/slides/8_Perceptron.pdf
3. (1 point) Select all that apply: Which of the following are considered as inductive
bias of perceptron.
2 Assume that most of the cases in a small neighborhood in feature space belong
to the same class
2 Decision boundary should be linear
2 Prefer to correct the most recent mistakes
2 Prefer the smallest hypothesis that explains the data
BC
4. (1 point) True or False: If the training data is linearly separable and representative
of the true distribution, the perceptron algorithm always finds the optimal decision
boundary for the true distribution.
True
10-601 Machine Learning Exam 1 Practice Problems - Page 15 of 25

False
False.
5. (1 point) True or False: Consider two datasets D(1) and D(2) where
(1) (1) (1) (1) (2) (2) (2) (2) (1)
D(1) = {(x1 , y1 ), ..., (xn , yn )} and D(2) = {(x1 , y1 ), ..., (xm , ym )} such that xi ∈
(2)
Rd1 , xi ∈ Rd2 . Suppose d1 > d2 and n > m. Then the maximum number of mistakes a
perceptron algorithm will make is always higher on dataset D(1) than on dataset D(2) .
True
False
False.

4.1 Perceptron Calculation

Suppose you are given the following dataset:

Example Number X1 X2 Y
1 -1 2 -1
2 -2 -2 +1
3 1 -1 +1
4 -3 1 -1

You wish to perform the Batch Perceptron algorithm on this data. Assume you start with
initial weights θT = [0, 0], bias b = 0 and that we pass all of our examples through in order
of their example number.
1. (1 point) Numerical answer: What would be the updated weight vector θ be after we
pass example 1 through the perceptron algorithm?

[1, −2]
2. (1 point) Numerical answer: What would be the updated bias b be after we pass
example 1 through our the Perceptron algorithm?

−1
3. (1 point) Numerical answer: What would be the updated weight vector θ be after we
pass example 2 through the Perceptron algorithm?
10-601 Machine Learning Exam 1 Practice Problems - Page 16 of 25

[1, −2]
4. (1 point) Numerical answer: What would be the updated bias b be after we pass
example 2 through the Perceptron algorithm?

−1
5. (1 point) Numerical answer: What would be the updated weight vector θ be after we
pass example 3 through the Perceptron algorithm?

[1, −2]
6. (1 point) Numerical answer: What would be the updated bias b be after we pass
example 3 through the Perceptron algorithm?

−1
7. (1 point) True or False: You friend stops you here and tells you that you do not need
to update the Perceptron weights or the bias anymore, is this true or false?
True
False
True, all points are classified correctly.
8. (2 points) True or False: Data (X,Y) has a non-linear decision boundary. Fortunately
there is a function F that maps (X,Y) to (F(X),Y) such that (F(X),Y) is linearly sep-
arable. We have tried to build a modified perceptron to classify (X,Y). Is the given
(modified) perceptron update rule correct?

if sign(wF(x(i) ) + b) != y (i) :
w0 = w + y (i) F(x(i) )
b0 = b + y (i)
True
False
True
10-601 Machine Learning Exam 1 Practice Problems - Page 17 of 25

5 Linear Regression
1. (1 point) Select one: The closed form solution for linear regression is θ = (X T X)−1 X T y.
Suppose you have n = 35 training examples and m = 5 features (excluding the bias
term). Once the bias term is now included, what are the dimensions of X, y, θ in the
closed form equation?
X is 35 × 6, y is 35 × 1, θ is 6 × 1
X is 35 × 6, y is 35 × 6, θ is 6 × 6
X is 35 × 5, y is 35 × 1, θ is 5 × 1
X is 35 × 5, y is 35 × 5, θ is 5 × 5
A.
2. (1 point) (True or False) A multi-layer perceptron model with linear activation is equiv-
alent to linear regression model.
True
False
True
3. Short answer: Assume we have data X ∈ Rn×d with label y ∈ Rn . If the underlying
distribution of the data is y = Xβ ∗ + , where ∼ N (0, I). Assume the closed form
solution β̂ for mean squared error linear regression exists for this data, write out β̂’s
distribution:

β̂ = (XT X)−1 XT y = (XT X)−1 XT (Xβ ∗ + ) = β ∗ + (XT X)−1 XT ∼ N (β ∗ , (XT X)−1 )

4. Consider linear regression on 1-dimensional data x ∈ Rn with label y ∈ Rn . We apply
linear regression in both directions on this data, i.e., we first fit y with x and get y = β1 x
as the fitted line, then we fit x with y and get x = β2 y as the fitted line. Discuss the
relations between β1 and β2 :
(i) True or False: The two fitted lines are always the same, i.e. we always have
β2 = β11 .
True
False
False.
(ii) Numerical answer: We further assume that xT y > 0. What is the minimum
value of β11 + β12 ?
10-601 Machine Learning Exam 1 Practice Problems - Page 18 of 25

2.
5. Please circle True or False for the following questions, providing brief explanations to
support your answer.
(i) [3 pts] Consider a linear regression model with only one parameter, the bias, ie.,
y = β0 . Then given n data points (xi , yi ) (where xi is the feature and yi is the
output), minimizing the sum of squared errors results in β0 being the median of
the yi values.
Circle one: True False
Brief explanation:
False. ni=1P(yi − β0 )2 is the training cost, which when differentiated and set to zero
P
n
yi
gives β0 = i=1n
, the mean of the yi values.

(ii) [3 pts] Given data D = {(x1 , y1 ), ..., (xn , yn )}, we obtain ŵ, the parameters that
minimize the training error cost for the linear regression model y = wT x we learn
from D.
Consider a new dataset Dnew generated by duplicating the points in D and adding
10 points that lie along y = ŵT x. Then the ŵnew that we learn for y = wT x from
Dnew is equal to ŵ.
Circle one: True False
Brief explanation:
True. The new squared error can be written as 2k + m, where k is the old squared
error. m = 0 for the 10 points that lie along the line, the lowest possible value for
m. And 2k is least when k is least, which is when the parameters don’t change.

6. Given a set of training pairs {(xi , yi ), i = 1, . . . , n} where xi ∈ Rd is the input and yi ∈ R

is the output, we want to find a linear function f (x) = wT x that minimizes a tradeoff
between training error and model complexity. The ridge regression formulation captures
this tradeoff:

ŵ = arg min J(w)

w∈Rd
n d
1X T λX 2
= arg min (w xi − yi )2 + w ,
w∈Rd 2 i=1 2 i=1 i

where λ ≥ 0. It is possible to derive a closed form expression for the parameter vector
10-601 Machine Learning Exam 1 Practice Problems - Page 19 of 25

ŵ that minimizes this cost function. The gradient, using matrix notation, is
 ∂`(w) 
∂w1
∇J(w) =  ...  = X T Xw − X T y + λw,
 
∂`(w)
∂wd

where X ∈ Rn×d is the matrix with the the training input instances on the rows (xi on
row i), and y ∈ Rn is the vector of training output instances.
(i) [2 pts.] What is the closed form expression for ŵ that we obtain by solving ∇`(w) =
0? Hint: use λw = Iwλ, where I is the identity matrix.
∇J(w) = 0 =⇒ ŵ = (X T X + λI)−1 X T y.
(ii) [3 pts.] What is the meaning of the λ parameter? What kind of tradeoff between
training error and model complexity we have when λ approaches zero? What about
when λ goes to infinity?
When λ → 0, we prefer models that achieve the smallest possible training error,
irrespective their complexity (squared `2 norm). When λ → +∞, we prefer models
that are the simplest possible (i.e. that have small norm). We expect ŵ to approach
the zero vector in this case.
(iii) [4 pts.] Answer true or false to each of the following questions and provide a brief
justification for your answer:
• [1 pt.] T or F: In ridge regression, when solving for the linear regressor that
minimizes the training cost function, it is always preferable to use the closed
form expression rather than using an iterative method like gradient descent,
even when the number of parameters is very high.
False. Using the closed form expression to solve for ŵ requires O(d3 ) operations
(the cost of solving the linear system or computing the inverse), while using a
gradient descent approach requires O(d2 ) (the cost of matrix multiplication) per
step. If the number of parameters is very high, it may not be computationally
feasible to use the closed form expression. Also, if the required precision for
the solution ŵ is moderate, as it is typically the case in machine learning
applications, gradient descent may be preferable.
• [1 pt.] T or F: Using a non-linear feature map is never useful as all regression
problems have linear input to output relations.
False. We very rarely believe that the true regression function is linear in the
initial input space. Many times, the choice of a linear model is done out of
mathematical convenience.
• [1 pt.] T or F: In linear regression, minimizing a tradeoff between training
error and model complexity, usually allows us to obtain a model with lower
test error than just minimizing training error.
10-601 Machine Learning Exam 1 Practice Problems - Page 20 of 25

True. Not using regularization leads us to choose more complex models, which
have higher variance. In practice it is typically better to optimize a tradeoff
between data fitting and model complexity. This is especially true when the
ratio between the size of the training set and the number of parameters is not
very big.
• [1 pt.] T or F: When λ = 0, we recover ordinary least squares (OLS). If n < d,
then X T X does not have an inverse and the estimator ŵ is not well-defined.
True. This is a problem with the ordinary least squares estimator for the case
where the number of parameters is bigger than the number of data points.
7. (a) Which of the following are valid expressions for the mean squared error objective
function for linear regression with dataset D = {x(i) , y (i) }N
i=1 , with each x
(i)
∈ RM
and the design matrix X ∈ RN ×(M +1) . y and θ are column vectors.
Select all that apply:
1
2 J(θ) = N
ky − θXk22
1
2 J(θ) = N
kyT − θXk22
1
2 J(θ) = N
kyT − Xθk22
1
2 J(θ) = N
kXθ − yk22
2 None of the Above
1
J(θ) = N
kXθ − yk22
(b) Your friend accidentally solved linear regression using the wrong objective function
for mean squared error, specifically, they used the following objective function that
contains two mistakes: 1) they forgot the 1/N and 2) they have one sign error.

N M
!!2
X X (i)
J(w, b) = y (i) − w j xj − b
i=1 j=1

You realize that you can still use the parameters that they learned, w and b, to
correctly predict y given x.
Write the equation that implements this corrected prediction function h(x, w, b)
using your friend’s learned parameters, w and b. The w and x vectors are column
vectors.

h(x, w, b) = wT x − b
(c) We have 2 data points:
x(1) = [2, 1]T y (1) = 7
10-601 Machine Learning Exam 1 Practice Problems - Page 21 of 25

x(2) = [1, 2]T y (2) = 5

We know that for linear regression with a bias/intercept term and mean squared
error, there are infinite solutions with these two points.
Give a specific third point x(3) , y (3) , such that, when included with the first two,
will cause linear regression to still have infinite solutions. Your x(3) should not equal
x(1) or x(2) and your y (3) should not equal y (1) or y (2) .
(3)
x1

(3)
x2

y (3)

Any x(3) that is colinear with the first two x’s; y doesn’t matter.
After adding your third point, if we then double the output of just the first point
such that now y (1) = 14, will this change the number of solutions for linear regres-
sion?
Yes
No
No
8. Given that we have an input x and we want to estimate an output y, in linear regression
we assume the relationship between them is of the form y = wx + b + , where w and
b are real-valued parameters we estimate and represents the noise in the data. When
the noise is Gaussian, maximizing the likelihood of a dataset S = {(x1 , y1 ), . . . , (xn , yn )}
to estimate the parameters w and b is equivalent to minimizing the squared error:
n
X
arg min (yi − (wxi + b))2 .
w
i=1

Consider the dataset S plotted in Fig. 3 along with its associated regression line. For each
of the altered data sets S new plotted in Fig. 5, indicate which regression line (relative to
the original one) in Fig. 4 corresponds to the regression line for the new data set. Write
your answers in the table below.
10-601 Machine Learning Exam 1 Practice Problems - Page 22 of 25

Dataset (a) (b) (c) (d) (e)

Regression line

Dataset (a) (b) (c) (d) (e)

Regression line (b) (c) (b) (a) (a)

Figure 3: An observed data set and its associated regression line.

Figure 4: New regression lines for altered data sets S new .

10-601 Machine Learning Exam 1 Practice Problems - Page 23 of 25

(a) Adding one outlier to the (b) Adding two outliers to the original data set.
original data set.

(c) Adding three outliers to the original data set. Two on one side (d) Duplicating the original data set.
and one on the other side.

(e) Duplicating the original data set and adding four points that
lie on the trajectory of the original regression line.

Figure 5: New data set S new .

10-601 Machine Learning Exam 1 Practice Problems - Page 24 of 25

6 Optimization
1. Select all that apply: Which of the following are correct regarding Gradient Descent
(GD) and stochastic gradient descent (SGD)
2 Each update step in SGD pushes the parameter vector closer to the parameter
vector that minimizes the objective function.
2 The gradient computed in SGD is, in expectation, equal to the gradient com-
puted in GD.
2 The gradient computed in GD has a higher variance than that computed in
SGD, which is why in practice SGD converges faster in time than GD.
B.
A is incorrect, SGD updates are high in variance and may not go in the direction of the
true gradient. C is incorrect, for the same reason. D is incorrect since they can converge
if the function is convex, not just strongly convex.
2. (a) Determine if the following 1-D functions are convex. Assume that the domain of
each function is R. The definition of a convex function is as follows:

f (x) is convex ⇐⇒ f (αx + (1 − α)z) ≤ αf (x) + (1 − α)f (z), ∀α ∈ [0, 1] and ∀x, z.

Select all convex functions:

2 f (x) = x + b for any b ∈ R
2 f (x) = c2 x for any c ∈ R
2 f (x) = ax2 + b for any a ∈ R and any b ∈ R
2 f (x) = 0
2 None of the Above
f (x) = x + b for any b ∈ R, f (x) = c2 x for any c ∈ R, f (x) = 0.
(b) Consider the convex function f (z) = z 2 . Let α be our learning rate in gradient
descent.
For which values of α will limt→∞ f (z (t) ) = 0, assuming the initial value of z is
z (0) = 1 and z (t) is the value of z after the t-th iteration of gradient descent.
Select all that apply:
2 α=0
1
2 α= 2

2 α=1
2 α=2
10-601 Machine Learning Exam 1 Practice Problems - Page 25 of 25

2 None of the Above

1
α= 2

Give the range of all values for α ≥ 0 such that limt→∞ f (z (t) ) = 0, assuming the
initial value of z is z (0) = 1. Be specific.

(0, 1).

Rabin Karp and KMP Algorithm
No ratings yet
Rabin Karp and KMP Algorithm
20 pages
STAT 451: Machine Learning Lecture Notes: Sebastian Raschka Department of Statistics University of Wisconsin-Madison
No ratings yet
STAT 451: Machine Learning Lecture Notes: Sebastian Raschka Department of Statistics University of Wisconsin-Madison
18 pages
Relational Algebra Operations in Mapreduce
No ratings yet
Relational Algebra Operations in Mapreduce
28 pages
Book: Mathematics For Economics and Business
No ratings yet
Book: Mathematics For Economics and Business
2 pages
(Ebook) Optimization Models by Giuseppe C
No ratings yet
(Ebook) Optimization Models by Giuseppe C
82 pages
Mehta Et Al. - 2019 - A High-Bias, Low-Variance Introduction To Machine PDF
No ratings yet
Mehta Et Al. - 2019 - A High-Bias, Low-Variance Introduction To Machine PDF
116 pages
ML Assignment 2 2019 Nptel
No ratings yet
ML Assignment 2 2019 Nptel
34 pages
MLfinal 1
No ratings yet
MLfinal 1
7 pages
Leap Year Check in Python
No ratings yet
Leap Year Check in Python
5 pages
Computer Science Questions From Chapter 10 For Class 10 11
No ratings yet
Computer Science Questions From Chapter 10 For Class 10 11
11 pages
IML Trees
No ratings yet
IML Trees
66 pages
Recitation8 Solutions
No ratings yet
Recitation8 Solutions
5 pages
02 LecDT
No ratings yet
02 LecDT
85 pages
CRF Laura Kallmeyer
No ratings yet
CRF Laura Kallmeyer
21 pages
DMT MCQ
No ratings yet
DMT MCQ
15 pages
Introduction To Machine Learning IIT KGP Week 2
100% (1)
Introduction To Machine Learning IIT KGP Week 2
14 pages
Machine Learning Lab Manual
No ratings yet
Machine Learning Lab Manual
12 pages
Battleship Game Assignment
No ratings yet
Battleship Game Assignment
3 pages
Dr. Zayed Ramadan
No ratings yet
Dr. Zayed Ramadan
4 pages
7.6.A DesignBriefApollo13
No ratings yet
7.6.A DesignBriefApollo13
2 pages
Merging Result-Merged
No ratings yet
Merging Result-Merged
14 pages
Mod3 Classification
No ratings yet
Mod3 Classification
32 pages
WinRAR Encryption Technology FAQ
No ratings yet
WinRAR Encryption Technology FAQ
6 pages
Pset 2
No ratings yet
Pset 2
5 pages
ML Recap
No ratings yet
ML Recap
96 pages
WK 07
No ratings yet
WK 07
8 pages
Decision Tree Learning Lecture
No ratings yet
Decision Tree Learning Lecture
13 pages
Secure Joint Communication and Sensing
No ratings yet
Secure Joint Communication and Sensing
9 pages
Bankruptcy Prevention Project
No ratings yet
Bankruptcy Prevention Project
16 pages
DWDM 2024
No ratings yet
DWDM 2024
2 pages
Class Test 1 Answer Key
No ratings yet
Class Test 1 Answer Key
3 pages
2016 SMQ Cs 171 Final Exam Key
No ratings yet
2016 SMQ Cs 171 Final Exam Key
14 pages
Sorting Algortims With O (N 2) Bound: Presented By: Twaha Ahmed Minai
No ratings yet
Sorting Algortims With O (N 2) Bound: Presented By: Twaha Ahmed Minai
19 pages
1 - An Introduction To Machine Learning With Scikit-Learn
No ratings yet
1 - An Introduction To Machine Learning With Scikit-Learn
9 pages
Quiz2 A
No ratings yet
Quiz2 A
5 pages
Ai ML Unit 3
No ratings yet
Ai ML Unit 3
15 pages
ML - Unit 2 - Part I
No ratings yet
ML - Unit 2 - Part I
15 pages
Homework Solution 01 KNN DT
No ratings yet
Homework Solution 01 KNN DT
4 pages
ML Unit 2-2-40
No ratings yet
ML Unit 2-2-40
39 pages
Basics of Econometrics
No ratings yet
Basics of Econometrics
10 pages
Ass3 v1
No ratings yet
Ass3 v1
4 pages
Analytic Derivation and Evaluation of A State-Trajectory Control Law For DC-DC Converters - Burns
No ratings yet
Analytic Derivation and Evaluation of A State-Trajectory Control Law For DC-DC Converters - Burns
16 pages
18CSC305J ARTIFICIAL INTELLIGENCE - Set4
No ratings yet
18CSC305J ARTIFICIAL INTELLIGENCE - Set4
6 pages
Statistical Methods For ML
No ratings yet
Statistical Methods For ML
24 pages
Unit 1
No ratings yet
Unit 1
12 pages
HW 1
No ratings yet
HW 1
4 pages
Quiz2 B
No ratings yet
Quiz2 B
6 pages
Finals 19
No ratings yet
Finals 19
16 pages
Midterm
No ratings yet
Midterm
12 pages
M01 Tree-Based Methods
No ratings yet
M01 Tree-Based Methods
38 pages
SDET Formulae MidSem2 2018 Ver3
No ratings yet
SDET Formulae MidSem2 2018 Ver3
2 pages
Solution of Final Exam: 10-701/15-781 Machine Learning: Fall 2004 Dec. 12th 2004
No ratings yet
Solution of Final Exam: 10-701/15-781 Machine Learning: Fall 2004 Dec. 12th 2004
27 pages
Decision Trees in Sklearn Decision Trees in Sklearn
No ratings yet
Decision Trees in Sklearn Decision Trees in Sklearn
7 pages
Chapter8 - Solution Manual
No ratings yet
Chapter8 - Solution Manual
6 pages
Graph Traversal: Text Depth-First Search Breadth-First Search
No ratings yet
Graph Traversal: Text Depth-First Search Breadth-First Search
41 pages
BSM DSC16
No ratings yet
BSM DSC16
1 page
AlphaGo Tutorial Slides
No ratings yet
AlphaGo Tutorial Slides
16 pages
R 2031053
No ratings yet
R 2031053
12 pages
Jacobian
No ratings yet
Jacobian
9 pages
Midterm
No ratings yet
Midterm
4 pages
Slide Matlab
No ratings yet
Slide Matlab
41 pages
MachineLearning MidTerm UMT Spring 2021
100% (1)
MachineLearning MidTerm UMT Spring 2021
12 pages
ML Assignment
No ratings yet
ML Assignment
7 pages
Midterm Solutions Machine
100% (1)
Midterm Solutions Machine
17 pages
12s 701 Final
No ratings yet
12s 701 Final
17 pages
ML Unit 2
No ratings yet
ML Unit 2
8 pages
Advantages:: Q.No 1.a Ans
No ratings yet
Advantages:: Q.No 1.a Ans
12 pages
Machine Learning CA 2
No ratings yet
Machine Learning CA 2
19 pages
Decision Trees
No ratings yet
Decision Trees
53 pages
Midterm Solutions PDF
No ratings yet
Midterm Solutions PDF
17 pages
Midterm 2006
No ratings yet
Midterm 2006
11 pages
601 sp09 Midterm Solutions
No ratings yet
601 sp09 Midterm Solutions
14 pages
Class 8 Math TB Chapter 2 Linear Equations and One Variable
No ratings yet
Class 8 Math TB Chapter 2 Linear Equations and One Variable
6 pages
Week 7 - Graded
No ratings yet
Week 7 - Graded
17 pages
212 Final-Solution
No ratings yet
212 Final-Solution
23 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
56 pages
1000099853
No ratings yet
1000099853
2 pages
Kernel PCA
No ratings yet
Kernel PCA
13 pages
Assignment 6 (Sol.) : Introduction To Machine Learning Prof. B. Ravindran
No ratings yet
Assignment 6 (Sol.) : Introduction To Machine Learning Prof. B. Ravindran
10 pages
SMAI End 2015 S
No ratings yet
SMAI End 2015 S
4 pages
MCQ Question
No ratings yet
MCQ Question
5 pages
15-381 Spring 2007 Assignment 6: Learning
No ratings yet
15-381 Spring 2007 Assignment 6: Learning
14 pages
Machine Learning
No ratings yet
Machine Learning
7 pages
MedTerm Machine Learning
No ratings yet
MedTerm Machine Learning
14 pages
Data Structure Question Bank
No ratings yet
Data Structure Question Bank
9 pages
The Tech Interview Playbook: From DSA to System Design
From Everand
The Tech Interview Playbook: From DSA to System Design
Chinmoy Mukherjee
No ratings yet
Fundamental Math
From Everand
Fundamental Math
Russell Pead
No ratings yet
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet
Alternating Decision Tree: Fundamentals and Applications
From Everand
Alternating Decision Tree: Fundamentals and Applications
Fouad Sabry
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Exam1 Practice Solutions

Uploaded by

Exam1 Practice Solutions

Uploaded by

Solutions

10-601 Machine Learning Name:

Instructions for Specific Problem Types

Figure 1: Perceptron Tree of max depth=2

x(1) , y (1) , x(2) , y (2) , · · · , x(N ) , y (N )

with x(i) ∈ R and y (i) ∈ R.

where N (x, D) is the set of indices of the k closest training points to x.

Please answer the following questions using this example dataset:

3 Model Selection and Errors

3. [2 pts] What is this scenario called? Overfitting

(b) Increasing decision tree mutual information split threshold.

3 + −0.001. Round 100/4/8 = 3.33 down to 3.

4.1 Perceptron Calculation

β̂ = (XT X)−1 XT y = (XT X)−1 XT (Xβ ∗ + ) = β ∗ + (XT X)−1 XT ∼ N (β ∗ , (XT X)−1 )

6. Given a set of training pairs {(xi , yi ), i = 1, . . . , n} where xi ∈ Rd is the input and yi ∈ R

ŵ = arg min J(w)

x(2) = [1, 2]T y (2) = 5

Dataset (a) (b) (c) (d) (e)

Dataset (a) (b) (c) (d) (e)

Figure 3: An observed data set and its associated regression line.

Figure 4: New regression lines for altered data sets S new .

Figure 5: New data set S new .

Select all convex functions:

2 None of the Above

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Exam1 Practice Solutions

Uploaded by

Exam1 Practice Solutions

Uploaded by

Solutions

10-601 Machine Learning Name:

Instructions for Specific Problem Types

Figure 1: Perceptron Tree of max depth=2

x(1) , y (1) , x(2) , y (2) , · · · , x(N ) , y (N )

with x(i) ∈ R and y (i) ∈ R.

where N (x, D) is the set of indices of the k closest training points to x.

Please answer the following questions using this example dataset:

3 Model Selection and Errors

3. [2 pts] What is this scenario called? Overfitting

(b) Increasing decision tree mutual information split threshold.

3 + −0.001. Round 100/4/8 = 3.33 down to 3.

4.1 Perceptron Calculation

β̂ = (XT X)−1 XT y = (XT X)−1 XT (Xβ ∗ + ) = β ∗ + (XT X)−1 XT  ∼ N (β ∗ , (XT X)−1 )

6. Given a set of training pairs {(xi , yi ), i = 1, . . . , n} where xi ∈ Rd is the input and yi ∈ R

ŵ = arg min J(w)

x(2) = [1, 2]T y (2) = 5

Dataset (a) (b) (c) (d) (e)

Dataset (a) (b) (c) (d) (e)

Figure 3: An observed data set and its associated regression line.

Figure 4: New regression lines for altered data sets S new .

Figure 5: New data set S new .

Select all convex functions:

2 None of the Above

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

β̂ = (XT X)−1 XT y = (XT X)−1 XT (Xβ ∗ + ) = β ∗ + (XT X)−1 XT ∼ N (β ∗ , (XT X)−1 )