0% found this document useful (0 votes)
33 views26 pages

Sample Midterm With Solutions (Updated)

Uploaded by

huangkai
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views26 pages

Sample Midterm With Solutions (Updated)

Uploaded by

huangkai
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

EECS 445 — Introduction to Machine Learning: Sample Midterm

Winter 2024

Name: uniqname: (1pt)

You have 110 minutes to complete this exam (from the time you turn past this cover page to the
time you make the last mark on any page other than this cover page). As indicated above, filling
in your name and uniqname is worth 1 point of the exam total. This is a closed everything
exam (including books, web, class notes, etc.) except for one double-sided 8.5×11 inch piece of
paper with notes prepared by you. No electronic devices are allowed (this includes calculators,
cellphones, etc.).
When you are finished, sign the honor code statement below.

“I have neither given nor received aid on this examination, nor have I concealed
any violations of the honor code.”

Signed:

Additional instructions:
1. DO NOT DETACH PAGES FROM THE EXAM. Failure to comply may result in
point deductions.
2. This exam is printed double-sided. Please be sure to answer ALL questions including all
subparts (check all pages).
3. Mark your answers ON THE EXAM ITSELF, in the space provided. If you make a mess,
clearly indicate your final answer (box it).
4. In general, you only have to provide a narrative answer when requested. In these cases,
be succinct (1-3 sentences should typically be sufficient). However, if a narrative answer
is not requested, you may, if you wish, provide a brief explanation for partial credit where
appropriate.
5. In most cases, any numerical computation is simple enough that it can be done by hand.
Otherwise, simplify as far as possible without a calculator. In the appendix at the end of the
exam (before the blank pages), we have provided some properties and formulas that might
be helpful for you.
6. If you think something about a question is open to interpretation, feel free to ask the course
staff or make a note on the exam.
7. Please flip through the exam to make sure you have all pages. The total number of pages is
indicated in the footer. Note: we have provided extra pages for your rough work at
the end.
8. MAKE SURE TO WRITE CLEARLY AND LEGIBLY. If we are unable to read your writing,
we reserve the right to make point deductions.
9. If you are still in the exam room within the last 10 minutes of the exam, you must remain
seated inside the classroom until the end of the exam time.
10. Before you leave the exam room, be sure to turn in your exam to the proctor and
sign the sheet provided with your uniqname to confirm your submission.
Page 1 of 26
Q Problem

0 Writing your name and uniqname

1 (Stochastic) Gradient Descent & Regression

2 Kernels and Feature Maps

3 Classification and Regression

4 SVM

5 Perceptron

6 Decision Trees and Random Forests

7 AdaBoost

Page 2 of 26
1 (Stochastic) Gradient Descent and Regression
4
1. Suppose you use gradient descent to minimize the quartic function f (z) = z4 with z ∈ R.
You initialize at z (0) = 1 and set a constant learning rate η = 2 in all iterations.

(a) Calculate z (1) and z (2) . Show your work.

Solution: The update step for gradient descent minimizing a quartic function and
using the parameter z:

z (t+1) = z (t) − η∇f (z (t) )


z4
The gradient of f (z) = 4
with respect to z can be denoted as:

∇f (z) = z 3
Now, we can use this gradient to find z (1) and z (2) given that z (0) = 1 and η = 2.

Calculation for z (1) :

z (1) = z (0) − η∇f (z (0) )


= 1 − 2 × 13
= −1
Calculation for z (2)

z (2) = z (1) − η∇f (z (1) )


= −1 − 2 × (−1)3
=1
So, z (1) = −1 and z (2) = 1.

(b) Calculate z (k) for all k = 0, 1, 2, . . .. Simplify your answer as much as possible.

Solution:
Given z (0) = 1, z (1) = −1, and z (2) = 1, we observe that the sequence of z values
oscillates between 1 and −1 starting from k = 0.

In general terms, the sequence z (k) can be described as follows:

Page 3 of 26
(
1, if k is even,
z (k) =
−1, if k is odd.

(c) Based on the results from (a) and (b) above, would you increase, decrease, or keep
the same learning rate? Please select an answer and give a brief explanation.

⃝ Increase Decrease ⃝ Keep the same
Explain your answer below.

Solution: Decrease.
The current learning rate (η = 2) leads to oscillations in the value of z between
1 and −1, showing that the algorithm is not converging to a minimum. The high
learning rate is causing the algorithm to overshoot the minimum and bounce back
and forth across it. A lower learning rate could potentially allow the algorithm to
converge to the minimum by taking smaller steps.

2. You are given a regression dataset {(x̄(i) , y (i) )}ni=1 with x̄(i) ∈ Rd and y (i) ∈ R. Instead of
the squared loss function Loss(z) = 12 z 2 we saw in class, you would like to try the following
quartic loss function:
z4
Loss4 (z) =
4
You would like to learn a linear model (without offset) parameterized by θ̄ ∈ Rd by minimizing
the empirical risk with respect to the quartic loss.

(a) Suppose you are minimizing the empirical risk with respect to the quartic loss using
stochastic gradient descent. When updating the model parameter from θ̄(k) to θ̄(k+1) ,
you use a learning rate ηk and the training example (x̄(i) , y (i) ). Derive the update for-
mula from θ̄(k) to θ̄(k+1) .

Solution: Use the update formula for SGD:

θ̄(k+1) = θ̄(k) − ηk ∇θ̄ Loss4 (y (i) − θ̄ · x̄(i) ) θ̄=θ̄(k)

(y (i) − θ̄ · x̄(i) )4
 
(k+1) (k)
θ̄ = θ̄ − ηk ∇θ̄
4 θ̄=θ̄(k)

(k+1) (k) (y (i) − θ̄(k) · x̄(i) )3


θ̄ = θ̄ − ηk (−4x̄(i) )
4
θ̄(k+1) = θ̄(k) + ηk (y (i) − θ̄(k) · x̄(i) )3 x̄(i)

Page 4 of 26
(b) Suppose you are minimizing the empirical risk with respect to the quartic loss using
gradient descent. When updating the model parameter from θ̄(k) to θ̄(k+1) , you use a
learning rate ηk . Derive the update formula from θ̄(k) to θ̄(k+1) .

Solution: Use the update formula for GD:

θ̄(k+1) = θ̄(k) − ηk ∇θ̄ Rn (θ̄) θ̄=θ̄(k)


" n
#
1 X (y (i) − θ̄ · x̄(i) )4
θ̄(k+1) = θ̄(k) − ηk ∇θ̄
n i=1 4
θ̄=θ̄(k)
n
1 X (y (i) − θ̄(k) · x̄(i) )3
θ̄(k+1) = θ̄(k) − ηk (−4x̄(i) )
n i=1 4
n
(k+1) (k) ηk X (i)
θ̄ = θ̄ + (y − θ̄(k) · x̄(i) )3 x̄(i)
n i=1

Page 5 of 26
(c) Given below is a dataset with feature dimension d = 1 and number of examples n = 11,
where the data points are represented as circles. You tried linear regression with both
the squared loss and the quartic loss on this dataset, and plotted the resulting solutions
as Line A (dashed) and Line B (solid) below.

Unfortunately, you have forgot which line corresponds to which loss function. Which of
the following scenarios is more likely? Select one of them and provide a justification.

Line A corresponds to the squared loss, and Line B corresponds to the
quartic loss.
⃝ Line B corresponds to the squared loss, and Line A corresponds to the quartic loss.

Justify your answer below.

Solution: The first option is correct: Line A corresponds to the squared loss, and
Line B corresponds to the quartic loss. This is because quartic loss penalizes incor-
rect predictions more heavily than square loss, so the model trained with quartic
loss will be much more influenced by the outlier datapoint (x(i) , y (i) ) = (10, 20) ,
resulting in a regression that will better predict this datapoint, at the expense of
fitting the other datapoints worse. Line B exhibits this pattern compared to Line
A, and so Line B corresponds to quartic loss, and Line A corresponds to square loss.

Page 6 of 26
2 Kernels and Feature Maps
1. Consider the following binary classification dataset {(x̄(i) , y (i) )}8i=1 , with x̄(i) ∈ R2 and y (i) ∈
{+1, −1}. Circles represent data points with a class label −1, and crosses represent data
points with a class label +1.

(a) Are the data


√ points linearly separable in the original feature space?
⃝ Yes No
(b) Suppose you use a linear classifier without offset on this dataset. What is the smallest
training error you can possibly get? Explain your answer.

Solution: The smallest possibe training error is 12 because the best possible linear
classifier without offset will misclassify 2 datapoints from each class.

(c) Your friend suggests that you use a feature map corresponding to the kernel K(x̄, z̄) =
2x1 z1 + 3x2 z2 + 4 where x̄, z̄ ∈ R2 . Would these data points be linearly separable after
this feature map? Why or why not?

Solution: No. This kernel corresponds to a linear feature mapping. The given
dataset is not linearly separable in the original feature space, so applying only linear
transformations will not make this dataset linearly separable in the mapped feature
space. To make this dataset linearly separable in the mapped feature space, we
would need to use a nonlinear kernel.

(d) Propose a feature map ϕ : R2 → Rp with the smallest possible p such that these data
points are linearly separable after the feature map ϕ(x̄).

Page 7 of 26
Solution: One possible feature mapping is ϕ(x̄) = x21 . This is a mapping onto R1 .

Page 8 of 26
2. For x̄, z̄ ∈ R2 , consider the kernel function
2 2
K(x̄, z̄) = ex1 +z1 +4 + 3(x̄ · z̄ + 2)2 .

Derive a feature map ϕ(x̄) that corresponds to this kernel. You may refer to Appendix for
useful properties of the exponential function.

Solution:
2 2
K(x̄, z̄) = ex1 +z1 +4 + 3(x̄ · z̄ + 2)2
2 2
= (ex1 +2 )(ez1 +2 ) + 3(x1 z1 + x2 z2 + 2)2
2 2
= (e2 ex1 )(e2 ez1 ) + 3(x21 z12 + x22 z22 + 2x1 z1 x2 z2 + 4x1 z1 + 4x2 z2 + 4)
2 2
= (e2 ex1 )(e2 ez1 ) + 3x21 z12 + 3x22 z22 + 6x1 z1 x2 z2 + 12x1 z1 + 12x2 z2 + 12

Thus, the feature


√ map
√ that√corresponds
√ to this
√ kernel√ is:
2 x21
ϕ(x̄) = [e e , 3x1 , 3x2 , 6x1 x2 , 2 3x1 , 2 3x2 , 2 3]T
2 2

Page 9 of 26
3 Classification and Regression
For this question, consider a data set S with five U.S. cities of different latitude and longitude.
The horizontal lines that cross the earth are the lines of latitude. The vertical lines that cross the
earth are the lines of longitude. In this question, assume that each city is represented by
a two dimensional vector, described by the city’s latitude and longitude. We would like
to classify these cities as North and South. For this question, consider Chicago and DC as cities
classified as North and the other three cities as South.

LA: Latitude: 34.1°N, Longitude: 118.2°W


DC: Latitude: 38.9°N, Longitude: 77.1°W
Orlando: Latitude: 28.5°N, Longitude: 81.4°W
Miami: Latitude: 25.8°N, Longitude: 80.2°W
Chicago: Latitude: 41.8°N, Longitude: 87.8°W

(a) What is the training error at convergence achieved by the Perceptron algorithm (with offset)
on this dataset? Why?

Solution: The minimal training error is 0.


Yes, the perceptron algorithm will converge here. The given plot of the five data points
show that they are linearly separable with an offset.

Page 10 of 26
(b) Do(es) there exist any decision tree(s) that can perfectly classify this dataset? If so, draw
(one of) the smallest of such decision tree(s). If not, explain why this is not possible.

Solution: Yes there is a decision tree that can perfectly classify this dataset, consider the
example below.

(c) Propose a feature mapping ϕ(x̄) (where x̄ = [x1 , x2 ]T and x1 represents latitude and x2
represents longitude) to the lowest dimension possible such that these mapped data points
can be perfectly classified by a linear classifier.

Solution: e.g.,
ϕ : R2 → R
and
ϕ(x1 , x2 ) = x1

Page 11 of 26
4 SVM
1. Suppose that you have a linearly separable dataset and train a hard-margin SVM on it. Is
it possible that all the data points turn out to be support vectors?

Yes ⃝ No
Optional justification

2. Suppose you wish to use a hard-margin SVM without offset to solve a binary classification
problem where some training data points are more important than others. Formally, you are
given a training dataset {(x̄(i) , y (i) )}ni=1 where x̄(i) ∈ Rd , y (i) ∈ {+1, −1}, and you are also
given a set of known weights p1 , . . . , pn , where pi indicates the importance of the i-th data
point and 0 ≤ pi ≤ 1. You formulate the modified SVM problem as follows:
1
min ∥θ̄∥22
θ̄ 2
subject to y (i) θ̄ · x̄(i) ≥ pi , i = 1, 2, . . . , n.


Now you wish to find the dual form of this modified SVM problem. You may assume that
strong duality holds for this problem and that the duality gap is 0.

(a) Write out the Lagrangian.

List your dual variables here: α1 , . . . , αn

Solution:
n
1 X
L(θ̄, ᾱ) = ||θ̄||22 + αi (pi − y (i) (θ̄ · x̄(i) ))
2 i=1

(b) Swap the order of the optimization and find the optimal value of θ̄ in terms of the dual
variables.

Solution:
n
X
∇θ̄ L(θ̄, ᾱ) = θ̄ − αi y (i) x̄(i)
i=1

Setting the previous gradient to zero gives:


n
X

θ̄ = αi y (i) x̄(i)
i=1

Page 12 of 26
(c) Plug in the optimal value of θ̄ you obtained in part (b) and derive the dual formulation
of the modified SVM problem.

Solution: The original dual formulation of the SVM problem is


n
1 X
max min ||θ̄||22 + αi (pi − y (i) (θ̄ · x̄(i) ))
ᾱ,αi ≥0 θ̄ 2
i=1

Using the fact from (b), we get the equivalent optimization problem of
n n
" n
#
1 X X X
max || αi y (i) x̄(i) ||22 + αi pi − y (i) (( αj y (j) x̄(j) ) · x̄(i) )
ᾱ,αi ≥0 2
i=1 i=1 j=1

which simplifies to the following formulation of the modified SVM problem


n n n
X 1 XX
max α i pi − αi αj y (i) y (j) (x̄(i) · x̄(j) )
ᾱ,αi ≥0
i=1
2 i=1 j=1

Page 13 of 26
5 Perceptron
You are running the perceptron algorithm without offset on a dataset {(x̄(i) , y (i) )}ni=1 , with x̄(i) ∈
Rd and y (i) ∈ {+1, −1}. Assume that the feature vectors in your dataset are orthonormal, which
means that they all have unit L2 -norm and every two different vectors are orthogonal:

∥x̄(i) ∥2 = 1, i = 1, 2, . . . , n,
x̄(i) · x̄(j) = 0, 1 ≤ i ̸= j ≤ n.

Suppose you initialize θ̄(0) = 0̄. After making k updates, you arrive at parameter θ̄(k) . Now, you
count the number of times each data point has been used to update the parameter, and let that
number be mi for data point (x̄(i) , y (i) ).

1. Write down the expression of θ̄(k) . You may use k and x̄(i) , y (i) , mi for i ∈ {1, 2, . . . , n} in
your expression.

Solution: n
X
(k)
θ̄ = mi y (i) x̄(i)
i=1


2. What is the value of y (i) θ̄(k) · x̄(i) for each i ∈ {1, 2, . . . , n}? Show your calculation and
simplify your answer as much as possible.

Solution: Using part 1 and the fact that x̄(i) is orthonormal, we have
n
!
X
y (i) θ̄(k) · x̄(i) = y (i) ( mj y (j) x̄(j) ) · x̄(i)

j=1
n
!
X
= y (i) mj y (j) x̄(j) · x̄(i)
j=1
n
!
X
= y (i) mj y (j) (x̄(j) · x̄(i) )
j=1

= y (mi y (i) )
(i)

= mi

Page 14 of 26
3. For the dataset described in this problem, is it possible that m2 = 10? If it is possible, state
the corresponding value of k. If not, list the possible values that m2 can take throughout the
algorithm. Explain your answer.

Solution:
It is not possible for m2 to be 10.
The possible values: 0 and 1
Before the algorithm makes any updates on the dataset, all mi ’s are 0.
Then, we know from part 2:
y (i) θ̄(k) · x̄(i) = mi


which means that the algorithm will make one and only one update on each point in the
dataset. This is because after one update on point i, we will have:

y (i) θ̄(k) · x̄(i) = mi = 1




4. Will the perceptron algorithm eventually terminate on this dataset? Why or why not?

Solution:
Yes, the algorithm will eventually terminate.
As shown in parts 2 and 3, the updates using different points have no influence on each
other.
Therefore, the algorithm will make one and only one update on each point in the dataset
and then terminate.

Page 15 of 26
6 Decision Trees and Random Forests
1. Consider the following training dataset with 4 datapoints, where X = {0, 1}3 and Y = {0, 1}:

X1 X2 X3 Y
1 1 1 1
1 0 0 1
1 1 0 0
0 0 1 0

You may use 0 log2 (0) = 0, log2 (1) = 0, as well as the following approximations:

a b − ab log2 ( ab )
1 2 0.5
1 3 0.53
2 3 0.39
1 4 0.5
3 4 0.31

(a) Calculate the conditional entropy for each feature. Show your work and simplify your
answer to a numerical value.
i. H(Y |X1 )

Solution:

H(Y | X1 ) = P (X1 = 1)H(Y | X1 = 1) + P (X1 = 0)H(Y | X1 = 0)


 
3 2 2 1 1 1
= − log2 − log2 + (−0 log2 0 − 1 log2 1)
4 3 3 3 3 4
= 0.69

ii. H(Y |X2 )

Solution:

H(Y | X2 ) = P (X2 = 1)H(Y | X2 = 1) + P (X2 = 0)H(Y | X2 = 0)


   
1 1 1 1 1 1 1 1 1 1
= − log2 − log2 + − log2 − log2
2 2 2 2 2 2 2 2 2 2
=1

Page 16 of 26
iii. H(Y |X3 )

Solution:

H(Y | X3 ) = P (X3 = 1)H(Y | X3 = 1) + P (X3 = 0)H(Y | X3 = 0)


   
1 1 1 1 1 1 1 1 1 1
= − log2 − log2 + − log2 − log2
2 2 2 2 2 2 2 2 2 2
=1

(b) Suppose you run the greedy algorithm to build a decision tree by maximizing informa-
tion gain at each split. Which feature would you split on at the root?

Solution: X1

(c) Now you continue running the greedy algorithm, but in order to keep the tree simple,
you set the maximum depth to be 2 (that is, you only allow at most 2 splits before
reaching a leaf). Specifically, whenever you reach depth 2, instead of going on with the
recursion, you stop and make a leaf according to the majority label.
i. Draw the final decision tree obtained from this procedure.
(Note: Whenever there is a tie when you are choosing between different features
or labels, you may break the tie arbitrarily.)

Solution: Any of the following trees:

X1
0 1
Y =0 X2
0 1
Y =1 Y =0

Page 17 of 26
X1
0 1
Y =0 X2
0 1
Y =1 Y =1

X1
0 1
Y =0 X3
0 1
Y =0 Y =1

X1
0 1
Y =0 X3
0 1
Y =1 Y =1

ii. What is the training error of this decision tree on the given dataset?

1
Solution: 4

(d) Is it possible to find a decision tree of depth 2 that attains a smaller training error than
the tree you obtained in part (c)? Justify your answer.

Solution: Yes, it is possible to find such a tree with 0 training error:


X2
0 1
X3 X3
0 1 0 1
Y =1 Y =0 Y =0 Y =1

Page 18 of 26
2. Choose the correct answer from the choices provided. You may optionally provide a brief
justification for your answers.

(a) Consider the greedy algorithm using information gain maximization. For any dataset,
among all the decision trees that minimize the training error, this algorithm finds a
decision tree with the smallest number of total nodes.

⃝ True False

Solution: False

(b) Consider the greedy algorithm using information gain maximization. For any dataset,
among all the decision trees that minimize the training error, this algorithm finds a
decision tree with the minimum depth.

⃝ True False

Solution: False

(c) Random forest decreases structural error (bias) relative to decision trees.

⃝ True False

Solution: False

(d) Increasing the number of decision trees in a random forest classifier is typically beneficial
for generalization.

True ⃝ False

Solution: True

Page 19 of 26
7 AdaBoost
In class, we discussed the AdaBoost algorithm with decision stumps as the set of weak classifiers. In
this question, we will focus on the feature space X = R2 and instead use axis-aligned rectangles
as the set of weak classifiers.
An axis-aligned rectangle classifier predicts one label for all points inside the axis-aligned rectan-
gle (including boundary), and predicts the other label for all points outside the rectangle. Formally,
it can be parameterized by a, b, c, d ∈ R and q ∈ {+1, −1} and implements the function
(
q, if a ≤ x1 ≤ b and c ≤ x2 ≤ d,
h(x̄; a, b, c, d, q) = (x̄ ∈ R2 )
−q otherwise.

You are given the following binary classification dataset {(x̄(i) , y (i) )}10
i=1 , with x̄
(i)
∈ R2 and
(i)
y ∈ {+1, −1}. Circles represent data points with a class label −1, and crosses represent data
points with a class label +1. The number next to each data point is its index i in the dataset
(i ∈ {1, 2, . . . , 10}).

1. What is the first weak classifier found by AdaBoost? Describe it using the 5 parameters
a, b, c, d, and q. If there are multiple possible answers, you can just write down any one of
them.
a= ,b= ,c= ,d= ,q= .
Optional justification

Solution: The first weak classifier is an axis-aligned rectangle the contains points 1, 2,
3, 4, 5, 8, and 9. For example,

• a = −1.

Page 20 of 26
• b = 5.

• c = 0.

• d = 4.

• q = −1.

2. What is the weighted error ϵ̂1 of the classifier from part 1?

Solution:
n
X (i) 1 1 2 1
ϵ̂1 = e0 [[y (i) ̸= h(x̄(i) ; θ̄1 )]] =
w ×1+ ×1= =
i=1
10 10 10 5

(i)
e1 for each data point (x̄(i) , y (i) ) after the first
3. What is the updated normalized weight w
iteration of AdaBoost?

(1) (2) (3) (4) (5)


w
e1 = ,w
e1 = ,w
e1 = ,w
e1 = ,w
e1 = ,

(6) (7) (8) (9) (10)


w
e1 = ,w
e1 = ,w
e1 = ,w
e1 = ,w
e1 = .

Optional justification

Solution:
1 1 − ϵ̂1 1
α1 = ln( ) = ln(4)
2 ϵ̂1 2

Correctly classified points:

(i) 1 (i) 1 1 1 1 1
w
e1 = e0 exp(−y (i) α1 h(x̄(i) ; θ̄1 )) =
w × exp(− ln(4)) = ×
Z1 Z1 10 2 Z1 20

Incorrectly classified points:

(i) 1 (i) 1 1 1 1 1
w
e1 = e0 exp(−y (i) α1 h(x̄(i) ; θ̄1 )) =
w × exp( ln(4)) = ×
Z1 Z1 10 2 Z1 5

Page 21 of 26
Finding Z1 and the new weights
10
X (i) 1 1 4
Z1 = e0 exp(−y (i) α1 h(x̄(i) ; θ̄1 )) =
w ×8+ ×2=
i=1
20 5 5

1
Plugging in Z1 into the formula for the updated normalized weights, you get 16
for all
correctly classified points and 14 for all incorrectly classified points.
Correct weights:

(1) 1 (2) 1 (3) 1 (4) 1 (5) 1


w
e1 = , w
e1 = , w
e1 = , we1 = , w e1 = ,
16 16 16 16 16
(6) 1 (7) 1 (8) 1 (9) 1 (10) 1
w
e1 = , w
e1 = , w
e1 = , w e1 = , w e1 = .
16 16 4 4 16

Alternate Solution: Due to the boosting property,


n
X
(i)
w
em [[y (i) ̸= h(x̄(i) ; θ̄m )]] = 0.5
i=1

We will take advantage of the fact that since this is the first iteration of AdaBoost,
the new normalized weight of all misclassifed points will be the same and the new nor-
malized weight of all the correctly classified points will be the same. Since there are 2
misclassifications,
n
X (i)
e1 [[y (i) ̸= h(x̄(i) ; θ̄1 )]] = 0.5
w
i=1
1
w
emisclassified =
4

Since all the weights are normalized, they add up to 1. If the weights of all the misclas-
sified points add up to 0.5, then the weights of all the correctly classified points add up
to 0.5. Therefore, since there are 8 correctly classified points,

0.5 1
w
ecorrectly classified = =
8 16

Page 22 of 26
Appendix
1. The primal formulation of a soft-margin SVM with offset, where C > 0 is a hyperparameter
is given below.

n
1 X
minimize ∥θ̄∥2 + C ξi
θ̄,b,ξ̄ 2 i=1

subject to y (i) (θ̄ · x̄(i) + b) ≥ 1 − ξi ,


ξi ≥ 0 , ∀i = 1, ..., n.
The corresponding dual form is given below.
n n n
X 1 XX
maximize αi − αi αj y (i) y (j) x̄(i) · x̄(j)
ᾱ
i=1
2 i=1 j=1
Xn
subject to αi y (i) = 0,
i=1
0 ≤ αi ≤ C , ∀i = 1, ..., n.

Recall that
P the optimal value of the primal variable θ̄ may be expressed in terms of the dual
as θ̄ = ni=1 αi y (i) x̄(i) .

2. The RBF (Radial Basis Function) kernel with hyperparameter γ is given by:

KRBF (x̄, z̄) = exp(−γ||x̄ − z̄||2 )

3. The closed form expression for a Linear Regression model with squared loss is given by

θ̄∗ = (X ⊤ X)−1 X ⊤ ȳ

where X = [x̄(1) , . . . , x̄(n) ]⊤ and ȳ = [y (1) , . . . , y (n) ]⊤ .

4. You might find the following properties of the logarithmic and exponential functions useful.

• log(ab) = log(a) + log(b) • logb (x) = loga (x) • eln(a) = a


loga (b)
• log(ab ) = b · log(a) • log2 (2a ) = a • ea+b = ea eb

5. Recall the general form for entropy of a random variable Y that can take on values in
{y1 , . . . , yk }:
Xk
H(Y ) = − P (Y = yi ) log2 P (Y = yi )
i=1

The conditional entropy is given by (assuming X takes values in {x1 , . . . , xm })


m
X
H(Y |X) = P (X = xj ) H(Y |X = xj )
j=1
Page 23 of 26
where
k
X
H(Y |X = xj ) = − P (Y = yi |X = xj ) log2 P (Y = yi |X = xj )
i=1

6. The greedy algorithm for building decision trees is:


Algorithm 1: Build Decision Tree
BuildTree (DS)
if (y (i) == y) for all examples in DS then
return y
else if (x̄(i) == x̄) for all examples in DS then
return majority label
else
j, t = arg minj H(Y |[[Xj ≥ t]])
DSg = {examples in DS where Xj ≥ t}
BuildTree (DSg )
DSl = {examples in DS where Xj < t}
BuildTree (DSl )
end

7. The AdaBoost algorithm:


Algorithm 2: AdaBoost
(i)
e0 = n1 , for all i ∈ [1 . . . n]
(a) Initialize the observation weights w

(b) For m = 1 to M :
Pn (i)
i. Find: θ̄m = arg minθ̄ em−1 [[y (i) ̸= h(x̄(i) ; θ̄)]]
w
i=1
(i)
ii. Given θ̄m , compute: ϵ̂m = ni=1 w em−1 [[y (i) ̸= h(x̄(i) ; θ̄m )]].
P
 
1 1−ϵ̂m
iii. Compute αm = 2 ln ϵ̂m .

iv. Update un-normalized weights for all i ∈ [1 . . . n]:

(i) (i)
em−1 · exp − y (i) αm h(x̄(i) ; θ̄m ) ,
 
wm ←w
v. Normalize weights to sum to 1:
(i)
(i) wm
w
em ←P (i)
i wm := Zm
PM
(c) Output the final classifier: hM (x̄) = m=1 αm h(x̄; θ̄m )

Page 24 of 26
(Blank page for rough work)

Page 25 of 26
(Blank page for rough work)

Page 26 of 26

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy