0% found this document useful (0 votes)

33 views26 pages

Sample Midterm With Solutions (Updated)

Uploaded by

huangkai

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

33 views26 pages

Sample Midterm With Solutions (Updated)

Uploaded by

huangkai

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 26

EECS 445 — Introduction to Machine Learning: Sample Midterm

Winter 2024

Name: uniqname: (1pt)

You have 110 minutes to complete this exam (from the time you turn past this cover page to the
time you make the last mark on any page other than this cover page). As indicated above, filling
in your name and uniqname is worth 1 point of the exam total. This is a closed everything
exam (including books, web, class notes, etc.) except for one double-sided 8.5×11 inch piece of
paper with notes prepared by you. No electronic devices are allowed (this includes calculators,
cellphones, etc.).
When you are finished, sign the honor code statement below.

“I have neither given nor received aid on this examination, nor have I concealed
any violations of the honor code.”

Signed:

Additional instructions:
1. DO NOT DETACH PAGES FROM THE EXAM. Failure to comply may result in
point deductions.
2. This exam is printed double-sided. Please be sure to answer ALL questions including all
subparts (check all pages).
3. Mark your answers ON THE EXAM ITSELF, in the space provided. If you make a mess,
clearly indicate your final answer (box it).
4. In general, you only have to provide a narrative answer when requested. In these cases,
be succinct (1-3 sentences should typically be sufficient). However, if a narrative answer
is not requested, you may, if you wish, provide a brief explanation for partial credit where
appropriate.
5. In most cases, any numerical computation is simple enough that it can be done by hand.
Otherwise, simplify as far as possible without a calculator. In the appendix at the end of the
exam (before the blank pages), we have provided some properties and formulas that might
be helpful for you.
6. If you think something about a question is open to interpretation, feel free to ask the course
staff or make a note on the exam.
7. Please flip through the exam to make sure you have all pages. The total number of pages is
indicated in the footer. Note: we have provided extra pages for your rough work at
the end.
8. MAKE SURE TO WRITE CLEARLY AND LEGIBLY. If we are unable to read your writing,
we reserve the right to make point deductions.
9. If you are still in the exam room within the last 10 minutes of the exam, you must remain
seated inside the classroom until the end of the exam time.
10. Before you leave the exam room, be sure to turn in your exam to the proctor and
sign the sheet provided with your uniqname to confirm your submission.
Page 1 of 26
Q Problem

0 Writing your name and uniqname

1 (Stochastic) Gradient Descent & Regression

2 Kernels and Feature Maps

3 Classification and Regression

4 SVM

5 Perceptron

6 Decision Trees and Random Forests

7 AdaBoost

Page 2 of 26
1 (Stochastic) Gradient Descent and Regression
4
1. Suppose you use gradient descent to minimize the quartic function f (z) = z4 with z ∈ R.
You initialize at z (0) = 1 and set a constant learning rate η = 2 in all iterations.

(a) Calculate z (1) and z (2) . Show your work.

Solution: The update step for gradient descent minimizing a quartic function and
using the parameter z:

z (t+1) = z (t) − η∇f (z (t) )

z4
The gradient of f (z) = 4
with respect to z can be denoted as:

∇f (z) = z 3
Now, we can use this gradient to find z (1) and z (2) given that z (0) = 1 and η = 2.

Calculation for z (1) :

z (1) = z (0) − η∇f (z (0) )

= 1 − 2 × 13
= −1
Calculation for z (2)

z (2) = z (1) − η∇f (z (1) )

= −1 − 2 × (−1)3
=1
So, z (1) = −1 and z (2) = 1.

(b) Calculate z (k) for all k = 0, 1, 2, . . .. Simplify your answer as much as possible.

Solution:
Given z (0) = 1, z (1) = −1, and z (2) = 1, we observe that the sequence of z values
oscillates between 1 and −1 starting from k = 0.

In general terms, the sequence z (k) can be described as follows:

Page 3 of 26
(
1, if k is even,
z (k) =
−1, if k is odd.

(c) Based on the results from (a) and (b) above, would you increase, decrease, or keep
the same learning rate? Please select an answer and give a brief explanation.
√
⃝ Increase Decrease ⃝ Keep the same
Explain your answer below.

Solution: Decrease.
The current learning rate (η = 2) leads to oscillations in the value of z between
1 and −1, showing that the algorithm is not converging to a minimum. The high
learning rate is causing the algorithm to overshoot the minimum and bounce back
and forth across it. A lower learning rate could potentially allow the algorithm to
converge to the minimum by taking smaller steps.

2. You are given a regression dataset {(x̄(i) , y (i) )}ni=1 with x̄(i) ∈ Rd and y (i) ∈ R. Instead of
the squared loss function Loss(z) = 12 z 2 we saw in class, you would like to try the following
quartic loss function:
z4
Loss4 (z) =
4
You would like to learn a linear model (without offset) parameterized by θ̄ ∈ Rd by minimizing
the empirical risk with respect to the quartic loss.

(a) Suppose you are minimizing the empirical risk with respect to the quartic loss using
stochastic gradient descent. When updating the model parameter from θ̄(k) to θ̄(k+1) ,
you use a learning rate ηk and the training example (x̄(i) , y (i) ). Derive the update for-
mula from θ̄(k) to θ̄(k+1) .

Solution: Use the update formula for SGD:

θ̄(k+1) = θ̄(k) − ηk ∇θ̄ Loss4 (y (i) − θ̄ · x̄(i) ) θ̄=θ̄(k)

(y (i) − θ̄ · x̄(i) )4

(k+1) (k)
θ̄ = θ̄ − ηk ∇θ̄
4 θ̄=θ̄(k)

(k+1) (k) (y (i) − θ̄(k) · x̄(i) )3

θ̄ = θ̄ − ηk (−4x̄(i) )
4
θ̄(k+1) = θ̄(k) + ηk (y (i) − θ̄(k) · x̄(i) )3 x̄(i)

Page 4 of 26
(b) Suppose you are minimizing the empirical risk with respect to the quartic loss using
gradient descent. When updating the model parameter from θ̄(k) to θ̄(k+1) , you use a
learning rate ηk . Derive the update formula from θ̄(k) to θ̄(k+1) .

Solution: Use the update formula for GD:

θ̄(k+1) = θ̄(k) − ηk ∇θ̄ Rn (θ̄) θ̄=θ̄(k)

" n
#
1 X (y (i) − θ̄ · x̄(i) )4
θ̄(k+1) = θ̄(k) − ηk ∇θ̄
n i=1 4
θ̄=θ̄(k)
n
1 X (y (i) − θ̄(k) · x̄(i) )3
θ̄(k+1) = θ̄(k) − ηk (−4x̄(i) )
n i=1 4
n
(k+1) (k) ηk X (i)
θ̄ = θ̄ + (y − θ̄(k) · x̄(i) )3 x̄(i)
n i=1

Page 5 of 26
(c) Given below is a dataset with feature dimension d = 1 and number of examples n = 11,
where the data points are represented as circles. You tried linear regression with both
the squared loss and the quartic loss on this dataset, and plotted the resulting solutions
as Line A (dashed) and Line B (solid) below.

Unfortunately, you have forgot which line corresponds to which loss function. Which of
the following scenarios is more likely? Select one of them and provide a justification.
√
Line A corresponds to the squared loss, and Line B corresponds to the
quartic loss.
⃝ Line B corresponds to the squared loss, and Line A corresponds to the quartic loss.

Justify your answer below.

Solution: The first option is correct: Line A corresponds to the squared loss, and
Line B corresponds to the quartic loss. This is because quartic loss penalizes incor-
rect predictions more heavily than square loss, so the model trained with quartic
loss will be much more influenced by the outlier datapoint (x(i) , y (i) ) = (10, 20) ,
resulting in a regression that will better predict this datapoint, at the expense of
fitting the other datapoints worse. Line B exhibits this pattern compared to Line
A, and so Line B corresponds to quartic loss, and Line A corresponds to square loss.

Page 6 of 26
2 Kernels and Feature Maps
1. Consider the following binary classification dataset {(x̄(i) , y (i) )}8i=1 , with x̄(i) ∈ R2 and y (i) ∈
{+1, −1}. Circles represent data points with a class label −1, and crosses represent data
points with a class label +1.

(a) Are the data

√ points linearly separable in the original feature space?
⃝ Yes No
(b) Suppose you use a linear classifier without offset on this dataset. What is the smallest
training error you can possibly get? Explain your answer.

Solution: The smallest possibe training error is 12 because the best possible linear
classifier without offset will misclassify 2 datapoints from each class.

(c) Your friend suggests that you use a feature map corresponding to the kernel K(x̄, z̄) =
2x1 z1 + 3x2 z2 + 4 where x̄, z̄ ∈ R2 . Would these data points be linearly separable after
this feature map? Why or why not?

Solution: No. This kernel corresponds to a linear feature mapping. The given
dataset is not linearly separable in the original feature space, so applying only linear
transformations will not make this dataset linearly separable in the mapped feature
space. To make this dataset linearly separable in the mapped feature space, we
would need to use a nonlinear kernel.

(d) Propose a feature map ϕ : R2 → Rp with the smallest possible p such that these data
points are linearly separable after the feature map ϕ(x̄).

Page 7 of 26
Solution: One possible feature mapping is ϕ(x̄) = x21 . This is a mapping onto R1 .

Page 8 of 26
2. For x̄, z̄ ∈ R2 , consider the kernel function
2 2
K(x̄, z̄) = ex1 +z1 +4 + 3(x̄ · z̄ + 2)2 .

Derive a feature map ϕ(x̄) that corresponds to this kernel. You may refer to Appendix for
useful properties of the exponential function.

Solution:
2 2
K(x̄, z̄) = ex1 +z1 +4 + 3(x̄ · z̄ + 2)2
2 2
= (ex1 +2 )(ez1 +2 ) + 3(x1 z1 + x2 z2 + 2)2
2 2
= (e2 ex1 )(e2 ez1 ) + 3(x21 z12 + x22 z22 + 2x1 z1 x2 z2 + 4x1 z1 + 4x2 z2 + 4)
2 2
= (e2 ex1 )(e2 ez1 ) + 3x21 z12 + 3x22 z22 + 6x1 z1 x2 z2 + 12x1 z1 + 12x2 z2 + 12

Thus, the feature

√ map
√ that√corresponds
√ to this
√ kernel√ is:
2 x21
ϕ(x̄) = [e e , 3x1 , 3x2 , 6x1 x2 , 2 3x1 , 2 3x2 , 2 3]T
2 2

Page 9 of 26
3 Classification and Regression
For this question, consider a data set S with five U.S. cities of different latitude and longitude.
The horizontal lines that cross the earth are the lines of latitude. The vertical lines that cross the
earth are the lines of longitude. In this question, assume that each city is represented by
a two dimensional vector, described by the city’s latitude and longitude. We would like
to classify these cities as North and South. For this question, consider Chicago and DC as cities
classified as North and the other three cities as South.

LA: Latitude: 34.1°N, Longitude: 118.2°W

DC: Latitude: 38.9°N, Longitude: 77.1°W
Orlando: Latitude: 28.5°N, Longitude: 81.4°W
Miami: Latitude: 25.8°N, Longitude: 80.2°W
Chicago: Latitude: 41.8°N, Longitude: 87.8°W

(a) What is the training error at convergence achieved by the Perceptron algorithm (with offset)
on this dataset? Why?

Solution: The minimal training error is 0.

Yes, the perceptron algorithm will converge here. The given plot of the five data points
show that they are linearly separable with an offset.

Page 10 of 26
(b) Do(es) there exist any decision tree(s) that can perfectly classify this dataset? If so, draw
(one of) the smallest of such decision tree(s). If not, explain why this is not possible.

Solution: Yes there is a decision tree that can perfectly classify this dataset, consider the
example below.

(c) Propose a feature mapping ϕ(x̄) (where x̄ = [x1 , x2 ]T and x1 represents latitude and x2
represents longitude) to the lowest dimension possible such that these mapped data points
can be perfectly classified by a linear classifier.

Solution: e.g.,
ϕ : R2 → R
and
ϕ(x1 , x2 ) = x1

Page 11 of 26
4 SVM
1. Suppose that you have a linearly separable dataset and train a hard-margin SVM on it. Is
it possible that all the data points turn out to be support vectors?
√
Yes ⃝ No
Optional justification

2. Suppose you wish to use a hard-margin SVM without offset to solve a binary classification
problem where some training data points are more important than others. Formally, you are
given a training dataset {(x̄(i) , y (i) )}ni=1 where x̄(i) ∈ Rd , y (i) ∈ {+1, −1}, and you are also
given a set of known weights p1 , . . . , pn , where pi indicates the importance of the i-th data
point and 0 ≤ pi ≤ 1. You formulate the modified SVM problem as follows:
1
min ∥θ̄∥22
θ̄ 2
subject to y (i) θ̄ · x̄(i) ≥ pi , i = 1, 2, . . . , n.

Now you wish to find the dual form of this modified SVM problem. You may assume that
strong duality holds for this problem and that the duality gap is 0.

(a) Write out the Lagrangian.

List your dual variables here: α1 , . . . , αn

Solution:
n
1 X
L(θ̄, ᾱ) = ||θ̄||22 + αi (pi − y (i) (θ̄ · x̄(i) ))
2 i=1

(b) Swap the order of the optimization and find the optimal value of θ̄ in terms of the dual
variables.

Solution:
n
X
∇θ̄ L(θ̄, ᾱ) = θ̄ − αi y (i) x̄(i)
i=1

Setting the previous gradient to zero gives:

n
X
∗
θ̄ = αi y (i) x̄(i)
i=1

Page 12 of 26
(c) Plug in the optimal value of θ̄ you obtained in part (b) and derive the dual formulation
of the modified SVM problem.

Solution: The original dual formulation of the SVM problem is

n
1 X
max min ||θ̄||22 + αi (pi − y (i) (θ̄ · x̄(i) ))
ᾱ,αi ≥0 θ̄ 2
i=1

Using the fact from (b), we get the equivalent optimization problem of
n n
" n
#
1 X X X
max || αi y (i) x̄(i) ||22 + αi pi − y (i) (( αj y (j) x̄(j) ) · x̄(i) )
ᾱ,αi ≥0 2
i=1 i=1 j=1

which simplifies to the following formulation of the modified SVM problem

n n n
X 1 XX
max α i pi − αi αj y (i) y (j) (x̄(i) · x̄(j) )
ᾱ,αi ≥0
i=1
2 i=1 j=1

Page 13 of 26
5 Perceptron
You are running the perceptron algorithm without offset on a dataset {(x̄(i) , y (i) )}ni=1 , with x̄(i) ∈
Rd and y (i) ∈ {+1, −1}. Assume that the feature vectors in your dataset are orthonormal, which
means that they all have unit L2 -norm and every two different vectors are orthogonal:

∥x̄(i) ∥2 = 1, i = 1, 2, . . . , n,
x̄(i) · x̄(j) = 0, 1 ≤ i ̸= j ≤ n.

Suppose you initialize θ̄(0) = 0̄. After making k updates, you arrive at parameter θ̄(k) . Now, you
count the number of times each data point has been used to update the parameter, and let that
number be mi for data point (x̄(i) , y (i) ).

1. Write down the expression of θ̄(k) . You may use k and x̄(i) , y (i) , mi for i ∈ {1, 2, . . . , n} in
your expression.

Solution: n
X
(k)
θ̄ = mi y (i) x̄(i)
i=1

2. What is the value of y (i) θ̄(k) · x̄(i) for each i ∈ {1, 2, . . . , n}? Show your calculation and
simplify your answer as much as possible.

Solution: Using part 1 and the fact that x̄(i) is orthonormal, we have
n
!
X
y (i) θ̄(k) · x̄(i) = y (i) ( mj y (j) x̄(j) ) · x̄(i)

j=1
n
!
X
= y (i) mj y (j) x̄(j) · x̄(i)
j=1
n
!
X
= y (i) mj y (j) (x̄(j) · x̄(i) )
j=1

= y (mi y (i) )
(i)

= mi

Page 14 of 26
3. For the dataset described in this problem, is it possible that m2 = 10? If it is possible, state
the corresponding value of k. If not, list the possible values that m2 can take throughout the
algorithm. Explain your answer.

Solution:
It is not possible for m2 to be 10.
The possible values: 0 and 1
Before the algorithm makes any updates on the dataset, all mi ’s are 0.
Then, we know from part 2:
y (i) θ̄(k) · x̄(i) = mi

which means that the algorithm will make one and only one update on each point in the
dataset. This is because after one update on point i, we will have:

y (i) θ̄(k) · x̄(i) = mi = 1

4. Will the perceptron algorithm eventually terminate on this dataset? Why or why not?

Solution:
Yes, the algorithm will eventually terminate.
As shown in parts 2 and 3, the updates using different points have no influence on each
other.
Therefore, the algorithm will make one and only one update on each point in the dataset
and then terminate.

Page 15 of 26
6 Decision Trees and Random Forests
1. Consider the following training dataset with 4 datapoints, where X = {0, 1}3 and Y = {0, 1}:

X1 X2 X3 Y
1 1 1 1
1 0 0 1
1 1 0 0
0 0 1 0

You may use 0 log2 (0) = 0, log2 (1) = 0, as well as the following approximations:

a b − ab log2 ( ab )
1 2 0.5
1 3 0.53
2 3 0.39
1 4 0.5
3 4 0.31

(a) Calculate the conditional entropy for each feature. Show your work and simplify your
answer to a numerical value.
i. H(Y |X1 )

Solution:

H(Y | X1 ) = P (X1 = 1)H(Y | X1 = 1) + P (X1 = 0)H(Y | X1 = 0)

3 2 2 1 1 1
= − log2 − log2 + (−0 log2 0 − 1 log2 1)
4 3 3 3 3 4
= 0.69

ii. H(Y |X2 )

Solution:

H(Y | X2 ) = P (X2 = 1)H(Y | X2 = 1) + P (X2 = 0)H(Y | X2 = 0)

1 1 1 1 1 1 1 1 1 1
= − log2 − log2 + − log2 − log2
2 2 2 2 2 2 2 2 2 2
=1

Page 16 of 26
iii. H(Y |X3 )

Solution:

H(Y | X3 ) = P (X3 = 1)H(Y | X3 = 1) + P (X3 = 0)H(Y | X3 = 0)

1 1 1 1 1 1 1 1 1 1
= − log2 − log2 + − log2 − log2
2 2 2 2 2 2 2 2 2 2
=1

(b) Suppose you run the greedy algorithm to build a decision tree by maximizing informa-
tion gain at each split. Which feature would you split on at the root?

Solution: X1

(c) Now you continue running the greedy algorithm, but in order to keep the tree simple,
you set the maximum depth to be 2 (that is, you only allow at most 2 splits before
reaching a leaf). Specifically, whenever you reach depth 2, instead of going on with the
recursion, you stop and make a leaf according to the majority label.
i. Draw the final decision tree obtained from this procedure.
(Note: Whenever there is a tie when you are choosing between different features
or labels, you may break the tie arbitrarily.)

Solution: Any of the following trees:

X1
0 1
Y =0 X2
0 1
Y =1 Y =0

Page 17 of 26
X1
0 1
Y =0 X2
0 1
Y =1 Y =1

X1
0 1
Y =0 X3
0 1
Y =0 Y =1

X1
0 1
Y =0 X3
0 1
Y =1 Y =1

ii. What is the training error of this decision tree on the given dataset?

1
Solution: 4

(d) Is it possible to find a decision tree of depth 2 that attains a smaller training error than
the tree you obtained in part (c)? Justify your answer.

Solution: Yes, it is possible to find such a tree with 0 training error:

X2
0 1
X3 X3
0 1 0 1
Y =1 Y =0 Y =0 Y =1

Page 18 of 26
2. Choose the correct answer from the choices provided. You may optionally provide a brief
justification for your answers.

(a) Consider the greedy algorithm using information gain maximization. For any dataset,
among all the decision trees that minimize the training error, this algorithm finds a
decision tree with the smallest number of total nodes.
√
⃝ True False

Solution: False

(b) Consider the greedy algorithm using information gain maximization. For any dataset,
among all the decision trees that minimize the training error, this algorithm finds a
decision tree with the minimum depth.
√
⃝ True False

Solution: False

(d) Increasing the number of decision trees in a random forest classifier is typically beneficial
for generalization.
√
True ⃝ False

Solution: True

Page 19 of 26
7 AdaBoost
In class, we discussed the AdaBoost algorithm with decision stumps as the set of weak classifiers. In
this question, we will focus on the feature space X = R2 and instead use axis-aligned rectangles
as the set of weak classifiers.
An axis-aligned rectangle classifier predicts one label for all points inside the axis-aligned rectan-
gle (including boundary), and predicts the other label for all points outside the rectangle. Formally,
it can be parameterized by a, b, c, d ∈ R and q ∈ {+1, −1} and implements the function
(
q, if a ≤ x1 ≤ b and c ≤ x2 ≤ d,
h(x̄; a, b, c, d, q) = (x̄ ∈ R2 )
−q otherwise.

You are given the following binary classification dataset {(x̄(i) , y (i) )}10
i=1 , with x̄
(i)
∈ R2 and
(i)
y ∈ {+1, −1}. Circles represent data points with a class label −1, and crosses represent data
points with a class label +1. The number next to each data point is its index i in the dataset
(i ∈ {1, 2, . . . , 10}).

1. What is the first weak classifier found by AdaBoost? Describe it using the 5 parameters
a, b, c, d, and q. If there are multiple possible answers, you can just write down any one of
them.
a= ,b= ,c= ,d= ,q= .
Optional justification

Solution: The first weak classifier is an axis-aligned rectangle the contains points 1, 2,
3, 4, 5, 8, and 9. For example,

• a = −1.

Page 20 of 26
• b = 5.

• c = 0.

• d = 4.

• q = −1.

2. What is the weighted error ϵ̂1 of the classifier from part 1?

Solution:
n
X (i) 1 1 2 1
ϵ̂1 = e0 [[y (i) ̸= h(x̄(i) ; θ̄1 )]] =
w ×1+ ×1= =
i=1
10 10 10 5

(i)
e1 for each data point (x̄(i) , y (i) ) after the first
3. What is the updated normalized weight w
iteration of AdaBoost?

(1) (2) (3) (4) (5)

w
e1 = ,w
e1 = ,w
e1 = ,w
e1 = ,w
e1 = ,

(6) (7) (8) (9) (10)

w
e1 = ,w
e1 = ,w
e1 = ,w
e1 = ,w
e1 = .

Optional justification

Solution:
1 1 − ϵ̂1 1
α1 = ln( ) = ln(4)
2 ϵ̂1 2

Correctly classified points:

(i) 1 (i) 1 1 1 1 1
w
e1 = e0 exp(−y (i) α1 h(x̄(i) ; θ̄1 )) =
w × exp(− ln(4)) = ×
Z1 Z1 10 2 Z1 20

Incorrectly classified points:

(i) 1 (i) 1 1 1 1 1
w
e1 = e0 exp(−y (i) α1 h(x̄(i) ; θ̄1 )) =
w × exp( ln(4)) = ×
Z1 Z1 10 2 Z1 5

Page 21 of 26
Finding Z1 and the new weights
10
X (i) 1 1 4
Z1 = e0 exp(−y (i) α1 h(x̄(i) ; θ̄1 )) =
w ×8+ ×2=
i=1
20 5 5

1
Plugging in Z1 into the formula for the updated normalized weights, you get 16
for all
correctly classified points and 14 for all incorrectly classified points.
Correct weights:

(1) 1 (2) 1 (3) 1 (4) 1 (5) 1

w
e1 = , w
e1 = , w
e1 = , we1 = , w e1 = ,
16 16 16 16 16
(6) 1 (7) 1 (8) 1 (9) 1 (10) 1
w
e1 = , w
e1 = , w
e1 = , w e1 = , w e1 = .
16 16 4 4 16

Alternate Solution: Due to the boosting property,

n
X
(i)
w
em [[y (i) ̸= h(x̄(i) ; θ̄m )]] = 0.5
i=1

We will take advantage of the fact that since this is the first iteration of AdaBoost,
the new normalized weight of all misclassifed points will be the same and the new nor-
malized weight of all the correctly classified points will be the same. Since there are 2
misclassifications,
n
X (i)
e1 [[y (i) ̸= h(x̄(i) ; θ̄1 )]] = 0.5
w
i=1
1
w
emisclassified =
4

Since all the weights are normalized, they add up to 1. If the weights of all the misclas-
sified points add up to 0.5, then the weights of all the correctly classified points add up
to 0.5. Therefore, since there are 8 correctly classified points,

0.5 1
w
ecorrectly classified = =
8 16

Page 22 of 26
Appendix
1. The primal formulation of a soft-margin SVM with offset, where C > 0 is a hyperparameter
is given below.

n
1 X
minimize ∥θ̄∥2 + C ξi
θ̄,b,ξ̄ 2 i=1

subject to y (i) (θ̄ · x̄(i) + b) ≥ 1 − ξi ,

ξi ≥ 0 , ∀i = 1, ..., n.
The corresponding dual form is given below.
n n n
X 1 XX
maximize αi − αi αj y (i) y (j) x̄(i) · x̄(j)
ᾱ
i=1
2 i=1 j=1
Xn
subject to αi y (i) = 0,
i=1
0 ≤ αi ≤ C , ∀i = 1, ..., n.

Recall that
P the optimal value of the primal variable θ̄ may be expressed in terms of the dual
as θ̄ = ni=1 αi y (i) x̄(i) .

2. The RBF (Radial Basis Function) kernel with hyperparameter γ is given by:

KRBF (x̄, z̄) = exp(−γ||x̄ − z̄||2 )

3. The closed form expression for a Linear Regression model with squared loss is given by

θ̄∗ = (X ⊤ X)−1 X ⊤ ȳ

where X = [x̄(1) , . . . , x̄(n) ]⊤ and ȳ = [y (1) , . . . , y (n) ]⊤ .

4. You might find the following properties of the logarithmic and exponential functions useful.

• log(ab) = log(a) + log(b) • logb (x) = loga (x) • eln(a) = a

loga (b)
• log(ab ) = b · log(a) • log2 (2a ) = a • ea+b = ea eb

5. Recall the general form for entropy of a random variable Y that can take on values in
{y1 , . . . , yk }:
Xk
H(Y ) = − P (Y = yi ) log2 P (Y = yi )
i=1

The conditional entropy is given by (assuming X takes values in {x1 , . . . , xm })

6. The greedy algorithm for building decision trees is:

Algorithm 1: Build Decision Tree
BuildTree (DS)
if (y (i) == y) for all examples in DS then
return y
else if (x̄(i) == x̄) for all examples in DS then
return majority label
else
j, t = arg minj H(Y |[[Xj ≥ t]])
DSg = {examples in DS where Xj ≥ t}
BuildTree (DSg )
DSl = {examples in DS where Xj < t}
BuildTree (DSl )
end

7. The AdaBoost algorithm:

Algorithm 2: AdaBoost
(i)
e0 = n1 , for all i ∈ [1 . . . n]
(a) Initialize the observation weights w

(b) For m = 1 to M :
Pn (i)
i. Find: θ̄m = arg minθ̄ em−1 [[y (i) ̸= h(x̄(i) ; θ̄)]]
w
i=1
(i)
ii. Given θ̄m , compute: ϵ̂m = ni=1 w em−1 [[y (i) ̸= h(x̄(i) ; θ̄m )]].
P

1 1−ϵ̂m
iii. Compute αm = 2 ln ϵ̂m .

iv. Update un-normalized weights for all i ∈ [1 . . . n]:

(i) (i)
em−1 · exp − y (i) αm h(x̄(i) ; θ̄m ) ,

wm ←w
v. Normalize weights to sum to 1:
(i)
(i) wm
w
em ←P (i)
i wm := Zm
PM
(c) Output the final classifier: hM (x̄) = m=1 αm h(x̄; θ̄m )

Page 24 of 26
(Blank page for rough work)

Page 25 of 26
(Blank page for rough work)

Page 26 of 26

1.deep Learning Assignment1 Solutions 1
100% (3)
1.deep Learning Assignment1 Solutions 1
12 pages
Midterm With Solutions
No ratings yet
Midterm With Solutions
26 pages
2021 Exam2 Solution
No ratings yet
2021 Exam2 Solution
11 pages
Endsem ML Makeup AK - 1
No ratings yet
Endsem ML Makeup AK - 1
7 pages
4.deep Learning Assignment4 Solution PDF
100% (1)
4.deep Learning Assignment4 Solution PDF
12 pages
CS6910 Tutorial5
No ratings yet
CS6910 Tutorial5
9 pages
2020 Exam2 Solution
No ratings yet
2020 Exam2 Solution
9 pages
2019-20-I MS Key
No ratings yet
2019-20-I MS Key
6 pages
CS 419M Midsem 2021 22
No ratings yet
CS 419M Midsem 2021 22
6 pages
1160 CS F425 20241218114944 Comprehensive Exam Question Paper
No ratings yet
1160 CS F425 20241218114944 Comprehensive Exam Question Paper
5 pages
HW 1
No ratings yet
HW 1
3 pages
Machine Learning Homework
No ratings yet
Machine Learning Homework
8 pages
Exam 21
No ratings yet
Exam 21
17 pages
MS Key-4
No ratings yet
MS Key-4
4 pages
Recognition Patterns: Jean Carlo Grandas Franco March 2020
No ratings yet
Recognition Patterns: Jean Carlo Grandas Franco March 2020
9 pages
Sample Final Exam Solutions
No ratings yet
Sample Final Exam Solutions
30 pages
Midterm 2010 Solutions
No ratings yet
Midterm 2010 Solutions
8 pages
hw4 Red
No ratings yet
hw4 Red
6 pages
2022 Exam2 Solution
No ratings yet
2022 Exam2 Solution
10 pages
Week 7
No ratings yet
Week 7
7 pages
Deep Learning Assignment2 Solutions PDF
No ratings yet
Deep Learning Assignment2 Solutions PDF
16 pages
Quiz1 Solutions Quiz 1 Soln
No ratings yet
Quiz1 Solutions Quiz 1 Soln
7 pages
Ass6 Solns
No ratings yet
Ass6 Solns
13 pages
Solutions Manual Scientific Computing
0% (1)
Solutions Manual Scientific Computing
192 pages
S Ccs Answers
No ratings yet
S Ccs Answers
192 pages
07au Midterm
No ratings yet
07au Midterm
17 pages
2024 Machine Learning
No ratings yet
2024 Machine Learning
8 pages
COGS 118 Homework 3 Supervised Machine Learning Algorithms
No ratings yet
COGS 118 Homework 3 Supervised Machine Learning Algorithms
7 pages
Introduction To Machine Learning Lecture 2: Linear Regression
No ratings yet
Introduction To Machine Learning Lecture 2: Linear Regression
38 pages
ML ES 23-24-II Key
No ratings yet
ML ES 23-24-II Key
4 pages
Mlgs 2021 Endterm Solution
No ratings yet
Mlgs 2021 Endterm Solution
26 pages
1.1 ID5059 1.2 Tom Kelsey - Jan 2021: February 15, 2021
No ratings yet
1.1 ID5059 1.2 Tom Kelsey - Jan 2021: February 15, 2021
43 pages
LogisticRegression ExercisesSolutions
No ratings yet
LogisticRegression ExercisesSolutions
5 pages
HW 3
No ratings yet
HW 3
7 pages
ML Question CMU
No ratings yet
ML Question CMU
12 pages
Machine Learning (CSEN3203) 1-14
No ratings yet
Machine Learning (CSEN3203) 1-14
15 pages
ML4N Exam Sample 2024
No ratings yet
ML4N Exam Sample 2024
6 pages
Week 2
No ratings yet
Week 2
7 pages
MLF Q2 Practice Problems
No ratings yet
MLF Q2 Practice Problems
61 pages
2023 Exam2 Solution
No ratings yet
2023 Exam2 Solution
12 pages
Tutorial 8 Questions
No ratings yet
Tutorial 8 Questions
3 pages
Linear Regression Review
67% (3)
Linear Regression Review
4 pages
Worksheet For Quiz
No ratings yet
Worksheet For Quiz
5 pages
Mock End Term Solution
No ratings yet
Mock End Term Solution
12 pages
LinearRegression LectureNotesPublic PDF
No ratings yet
LinearRegression LectureNotesPublic PDF
7 pages
Machine Learning: Linear Models For Regression
No ratings yet
Machine Learning: Linear Models For Regression
54 pages
Epfl Machine Learning Final Exam 2021 Solutions
No ratings yet
Epfl Machine Learning Final Exam 2021 Solutions
21 pages
Linear - Regression - SGD
No ratings yet
Linear - Regression - SGD
71 pages
2019-20-I ES Key
No ratings yet
2019-20-I ES Key
4 pages
COL774 Practice Problems
No ratings yet
COL774 Practice Problems
22 pages
Midterm Practice Questions
No ratings yet
Midterm Practice Questions
14 pages
Group 30
No ratings yet
Group 30
33 pages
CMU 2018s NinaBALCAN HW3
No ratings yet
CMU 2018s NinaBALCAN HW3
7 pages
T Rec H.235.6 201401 I!!pdf e
No ratings yet
T Rec H.235.6 201401 I!!pdf e
50 pages
Multi-Layer Feed-Forward Networks
No ratings yet
Multi-Layer Feed-Forward Networks
6 pages
Arbitrary Precision Calculator
No ratings yet
Arbitrary Precision Calculator
11 pages
Nonlinearity Test Summary - Bima
No ratings yet
Nonlinearity Test Summary - Bima
4 pages
Final Credit Risk Prediction Report Corrected
No ratings yet
Final Credit Risk Prediction Report Corrected
19 pages
Exponential Backoff and Jitter - AWS Architecture Blog
No ratings yet
Exponential Backoff and Jitter - AWS Architecture Blog
9 pages
Maste503 Structural-Dynamics TH 1.0 84 Maste503
No ratings yet
Maste503 Structural-Dynamics TH 1.0 84 Maste503
2 pages
CH 4 Algorithms and Flowcharts
No ratings yet
CH 4 Algorithms and Flowcharts
23 pages
Process Selection and Facility Layout
No ratings yet
Process Selection and Facility Layout
2 pages
12 Cbse Revision Assignment Day 11 22-12-24
No ratings yet
12 Cbse Revision Assignment Day 11 22-12-24
2 pages
RiskManagement B00246928
No ratings yet
RiskManagement B00246928
8 pages
19CS3602
No ratings yet
19CS3602
2 pages
Walter Simulation
No ratings yet
Walter Simulation
19 pages
DWM 2 Marks
No ratings yet
DWM 2 Marks
2 pages
Math10 - Exit - Assessment TOS
No ratings yet
Math10 - Exit - Assessment TOS
1 page
List of AMOS Fit Indices
No ratings yet
List of AMOS Fit Indices
6 pages
Research Proposal
No ratings yet
Research Proposal
14 pages
Credit Card Approval Data Information
No ratings yet
Credit Card Approval Data Information
3 pages
Single-Source Shortest Paths - Cormen Book CH 24
No ratings yet
Single-Source Shortest Paths - Cormen Book CH 24
28 pages
Homework 2 Solution PDF
No ratings yet
Homework 2 Solution PDF
5 pages
Dynamical Functional Equations PDF
No ratings yet
Dynamical Functional Equations PDF
2 pages
Math 5
No ratings yet
Math 5
37 pages
Topic Wise Test Polynomials Cbse Class 9 Maths: Verify Division Algorithm For The P (X) X X
No ratings yet
Topic Wise Test Polynomials Cbse Class 9 Maths: Verify Division Algorithm For The P (X) X X
1 page
Week 2 - The General Strategy For Solving Material Balance Problems
No ratings yet
Week 2 - The General Strategy For Solving Material Balance Problems
19 pages
Wiener Filter DSP Proakis
No ratings yet
Wiener Filter DSP Proakis
5 pages
WWW Gradplus Pro Lessons Elective IV Digital Image Processing Nagpur University Summer 2019
No ratings yet
WWW Gradplus Pro Lessons Elective IV Digital Image Processing Nagpur University Summer 2019
2 pages
Technical Software Pic Crc16
No ratings yet
Technical Software Pic Crc16
7 pages
QB DL
No ratings yet
QB DL
2 pages
Lec14 15 GenerativeModelsForDiscreteData
No ratings yet
Lec14 15 GenerativeModelsForDiscreteData
74 pages
Practical-1 Create A Program To Find Out The Interest: Chandigarh Engineering College, Landran
No ratings yet
Practical-1 Create A Program To Find Out The Interest: Chandigarh Engineering College, Landran
27 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Sample Midterm With Solutions (Updated)

Uploaded by

Sample Midterm With Solutions (Updated)

Uploaded by

EECS 445 — Introduction to Machine Learning: Sample Midterm

Name: uniqname: (1pt)

0 Writing your name and uniqname

1 (Stochastic) Gradient Descent & Regression

2 Kernels and Feature Maps

3 Classification and Regression

6 Decision Trees and Random Forests

(a) Calculate z (1) and z (2) . Show your work.

z (t+1) = z (t) − η∇f (z (t) )

Calculation for z (1) :

z (1) = z (0) − η∇f (z (0) )

z (2) = z (1) − η∇f (z (1) )

In general terms, the sequence z (k) can be described as follows:

Solution: Use the update formula for SGD:

θ̄(k+1) = θ̄(k) − ηk ∇θ̄ Loss4 (y (i) − θ̄ · x̄(i) ) θ̄=θ̄(k)

(k+1) (k) (y (i) − θ̄(k) · x̄(i) )3

Solution: Use the update formula for GD:

θ̄(k+1) = θ̄(k) − ηk ∇θ̄ Rn (θ̄) θ̄=θ̄(k)

Justify your answer below.

(a) Are the data

Thus, the feature

LA: Latitude: 34.1°N, Longitude: 118.2°W

Solution: The minimal training error is 0.

(a) Write out the Lagrangian.

List your dual variables here: α1 , . . . , αn

Setting the previous gradient to zero gives:

Solution: The original dual formulation of the SVM problem is

which simplifies to the following formulation of the modified SVM problem

y (i) θ̄(k) · x̄(i) = mi = 1

H(Y | X1 ) = P (X1 = 1)H(Y | X1 = 1) + P (X1 = 0)H(Y | X1 = 0)

ii. H(Y |X2 )

H(Y | X2 ) = P (X2 = 1)H(Y | X2 = 1) + P (X2 = 0)H(Y | X2 = 0)

H(Y | X3 ) = P (X3 = 1)H(Y | X3 = 1) + P (X3 = 0)H(Y | X3 = 0)

Solution: Any of the following trees:

Solution: Yes, it is possible to find such a tree with 0 training error:

2. What is the weighted error ϵ̂1 of the classifier from part 1?

(1) (2) (3) (4) (5)

(6) (7) (8) (9) (10)

Correctly classified points:

Incorrectly classified points:

(1) 1 (2) 1 (3) 1 (4) 1 (5) 1

Alternate Solution: Due to the boosting property,

subject to y (i) (θ̄ · x̄(i) + b) ≥ 1 − ξi ,

KRBF (x̄, z̄) = exp(−γ||x̄ − z̄||2 )

where X = [x̄(1) , . . . , x̄(n) ]⊤ and ȳ = [y (1) , . . . , y (n) ]⊤ .

• log(ab) = log(a) + log(b) • logb (x) = loga (x) • eln(a) = a

The conditional entropy is given by (assuming X takes values in {x1 , . . . , xm })

6. The greedy algorithm for building decision trees is:

7. The AdaBoost algorithm:

iv. Update un-normalized weights for all i ∈ [1 . . . n]:

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.