Output 25
Output 25
where:
(b) Provide examples of how optimization techniques are applied in the training of models such
as linear regression and logistic regression.
Answer : In linear regression, the goal is to find the parameters of the model that minimize the
error between the predicted and actual target values. The model is typically represented as:
y = θ 0 + θ 1 x1 + θ 2 x2 + . . . + θ p xp + ϵ
Where:
• y is the target variable.
• x1 , x2 , . . . , xp are the input features.
• θ0 , θ1 , . . . , θp are the model parameters (coefficients).
• ϵ is the error term, usually assumed to be Gaussian noise.
1
The objective is to minimize the MSE loss function:
n
1X
L(θ) = (yi − ŷi )2
n i=1
Where:
• ŷi = θ0 + θ1 x1 + . . . + θp xp is the predicted value for the i-th data point.
• n is the total number of data points.
The optimization is typically performed using Gradient Descent, which iteratively updates the
parameters θ in the direction of the negative gradient of the loss function. The update rule is:
∂L(θ)
θj := θj − α
∂θj
Where:
• α is the learning rate.
∂L(θ)
• ∂θj is the partial derivative of the loss function with respect to the parameter θj .
The gradient descent algorithm continues until the loss function L(θ) converges to a minimum.
In logistic regression, the goal is to predict the probability of a binary outcome (0 or 1) based
on input features. The model is similar to linear regression, but it applies the sigmoid function to
the output:
1
hθ (x) =
1+ e−(θ0 +θ1 x1 +...+θp xp )
Where:
• hθ (x) is the predicted probability that y = 1.
• x1 , x2 , . . . , xp are the input features.
• θ0 , θ1 , . . . , θp are the model parameters.
The objective in logistic regression is to minimize the Cross-entropy, which is given by:
n
1X
L(θ) = − [yi log(hθ (xi )) + (1 − yi ) log(1 − hθ (xi ))]
n i=1
Where:
• yi is the actual class label for the i-th data point.
• hθ (xi ) is the predicted probability for the i-th data point.
Similar to linear regression, **Gradient Descent** is used to optimize the parameters in logistic
regression. The update rule is:
∂L(θ)
θj := θj − α
∂θj
Where ∂L(θ)
∂θj is the partial derivative of the loss function with respect to the parameter θj .
(c) Discuss the role of the loss (or cost) function in this context and how it guides the optimization
process.
Answers:
2
The loss function plays a crucial role in the optimization process of machine learning models. It
quantifies how well the model’s predictions match the actual target values. The primary objective
in training any machine learning model is to minimize loss, which reflects the error between the
predicted and true values.
By evaluating the predictions against the actual values, the loss function provides feedback to
guide the optimization algorithm in adjusting the model parameters.
The optimization process involves iteratively updating the model parameters (such as θ0 , θ1 , . . . , θp )
in such a way that the loss function is minimized. This can be done using Gradient Descent or other
optimization algorithms. The gradient of the loss function with respect to the model parameters
provides the direction in which the parameters should be updated to reduce the error. The optimiza-
tion process stops when the loss function reaches its minimum, indicating that the model parameters
have been optimized.
In both linear and logistic regression, the loss function serves as a guiding signal for the optimiza-
tion process , directs the model to minimizing error. By optimizing the loss function, we improve
the model’s performance and make it more accurate in predicting future data.
(xi − µ)2
1
f (xi ; µ) = √ exp −
2πσ 2 2σ 2
Thus, the likelihood function L(µ) is:
n
(xi − µ)2
Y 1
L(µ) = √ exp −
i=1 2πσ 2 2σ 2
Take the derivative of log L(µ) with respect to µ and set it equal to zero:
n
d 1 X
log L(µ) = 2 (xi − µ)
dµ σ i=1
3
Setting the derivative equal to zero to maximize:
n
X
(xi − µ) = 0
i=1
where: - p(µ | X) is the posterior probability of the parameter µ given the observed data X, -
p(X | µ) is the likelihood of the data given the parameter, - p(µ) is the prior probability of the
parameter. Using Bayes’ theorem, the posterior can be expressed as:
p(X | µ) · p(µ)
p(µ | X) = .
p(X)
Here: - p(X) is the evidence (a normalizing constant) which does not depend on µ. Thus, the
MAP estimate becomes:
µ̂M AP = arg max p(X | µ) · p(µ) .
µ
Taking the logarithm, the MAP estimate µ̂M AP is found by maximizing the log-posterior:
µ̂M AP = arg max log p(µ | X) = arg max log p(X | µ) + log p(µ) .
µ µ
4
To maximize this, we differentiate with respect to µ:
n
!
∂ 1 X 1
− 2 (xi − µ)2 − 2 (µ − µ0 )2 = 0.
∂µ 2σ i=1 2τ
(c) Compare the MLE and MAP estimators. Discuss how the choice of µ0 and τ 2 affects the MAP
estimator. Answer:
Estimates µ by maximizing the likelihood function p(X | µ) based solely on observed data. The
MLE for the mean of a normal distribution is:
n
1X
µ̂M LE = xi .
n i=1
Incorporates prior information by maximizing the posterior distribution p(µ | X), which combines
the likelihood p(X | µ) and the prior p(µ). The MAP estimate for µ with a Gaussian prior is:
1
Pn µ0
2 i=1 xi + τ 2
µ̂M AP = σ n 1 .
σ2 + τ 2
A larger µ0 pulls the MAP estimate towards the prior mean. If µ0 is close to the true mean, the
MAP estimate improves over MLE. A smaller τ 2 (stronger prior) gives more weight to the prior,
making the estimate closer to µ0 . As τ 2 → ∞, MAP converges to MLE.
5
In text classification, each document is treated as a set of words, and the goal is to classify it
into one of the categories, such as ”Sports” or ”Politics” in this case. The Naive Bayes classifier
computes the probability of a document belonging to each class and selects the class with the highest
probability. Specifically, for a given document with words X = {x1 , x2 , . . . , xn }, the probability of
the document belonging to class C is given by Bayes’ Theorem:
P (X | C) · P (C)
P (C | X) =
P (X)
If we have a dataset with two classes, ”Sports” and ”Politics”, and words like ”win”, ”team”,
”election”, and ”vote” in the vocabulary, we can compute the likelihood of the document belonging
to either class based on the frequencies of these words in the training data. The classifier will
compute the likelihood for each class, multiply it by the prior probability of the class, and select the
class with the highest posterior probability.
(b) Using the data above, calculate the probability that a document containing the words win
and vote belongs to the Sports category versus the Politics category. Assume uniform class priors
and apply Laplace smoothing with α = 1.
Answer:
The total number of words in each class is calculated as:
T otalwordsinSports = 50 + 60 + 15 + 10 = 135
6
For Politics:
1 1
P (P olitics | X) = · · 0.065 · 0.479
P (X) 2
We can see that P (P olitics | X) is greater than P (Sports | X), the document is more likely to
belong to the Politics category. (c) Interpret the results and discuss any limitations of the Naive
Bayes classifier in this context. Answer:
Naive Bayes assumes that the presence of each word is independent of others, which may not
hold true in real-world data, where words can be correlated. Words that don’t appear in the
training data (even with Laplace smoothing) may still result in zero probability for unseen words.
The classifier relies on a predefined vocabulary, which may not capture all important words in
the documents.
4 Logistic Regression
Consider a binary classification problem where the goal is to predict whether a student will pass
or fail an exam based on the numbers of hours spent studying and sleeping. Formulate the logistic
regression model for this problem. Answer :
https://github.com/lamvu0607/MLH W 1
6 Regularization Techniques
Regularization is a technique used to prevent overfitting in machine learning models.
(a) Explain the difference between L1 (Lasso) and L2 (Ridge) regularization in the context of linear
regression.
Answer:
Regularization is a very important technique in machine learning to prevent overfitting. Mathe-
matically speaking, it adds a regularization term in order to prevent the coefficients from fitting so
perfectly to overfit.
The difference between the L1 and L2 regularization is that L2 is the sum of the square of the
weights, while L1 is just the sum of the absolute values of the weights. As follows in linear regression:
7
Solution Uniqueness: L2-norm (Ridge) always provides a unique solution. The penalty is the
sum of the squares of coefficients, leading to a smooth, differentiable function and a single global
minimum. L1-norm (Lasso) solution is not always unique. The penalty is the sum of the absolute
values of coefficients, which can lead to sparse solutions (some coefficients set to zero), but multiple
valid solutions can exist when features are correlated.
Sparsity: L1-norm has the property of producing many coefficients with zero values or very
small values and few large coefficients, allowing it to perform feature selection. L2-norm keeps all
features.
Computational Efficiency: L1-norm does not have an analytical solution. However, its spar-
sity properties allow it to be used with sparse algorithms, improving computational efficiency. L2-
norm has an analytical solution, making its computation more straightforward and efficient.
Stability: L1-norm is sensitive to small changes in data, especially with correlated features.
L2-norm retains all features, making it less sensitive to outliers.
(b) Given a dataset with multiple features that are highly correlated, discuss which regularization
method would be more appropriate and why. Answer:
In cases of high feature correlation, L2 regularization is typically the better choice due to its ability
to handle multicollinearity and provide stable, interpretable results without discarding important
features