0% found this document useful (0 votes)
16 views18 pages

Loss

This document discusses various loss functions used in machine learning, including Mean Squared Error, Mean Absolute Error, Huber Loss, Cross-Entropy Loss, and others. Each loss function is explained in terms of its characteristics, advantages, and applications, particularly in optimizing model performance. The document also highlights the importance of selecting appropriate loss functions based on the specific requirements of the task at hand.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views18 pages

Loss

This document discusses various loss functions used in machine learning, including Mean Squared Error, Mean Absolute Error, Huber Loss, Cross-Entropy Loss, and others. Each loss function is explained in terms of its characteristics, advantages, and applications, particularly in optimizing model performance. The document also highlights the importance of selecting appropriate loss functions based on the specific requirements of the task at hand.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

BASIC LOSS FUNCTIONS

THIS BOOK
This book is written (typed) by
Ari, who hails from the South
and has keen interests in
Computer Science, Biology, and
Tamil Literature. Occasionally,
he updates his website, where
you can reach out to him.
https://arihara-sudhan.github.io
LOSS FUNCTION
Loss function serves as a measure of how well a model is
performing. It quantifies the difference between the predicted
values and the target values. We have learned of it a little bit in
MLP book. The objective of our training is to reduce the loss. We
also learned of optimization algorithms such as Gradient Descent,
SGD, Mini Batch Gradient Descent, SGD with Momentum,
AdaGrad, RMSProp and Adam. There are even more
optimization algorithms focused to reduce loss. Obviously, loss
function is the feedback-giver for the network by means of which
the parameters are tuned. Remember, loss function is not an
evaluation metric. Another term used is Cost Function. The loss
function is to capture the difference between the actual and
predicted values for a single datum whereas cost functions
aggregate the difference for the entire training dataset.
☆ MEAN SQUARED ERROR
Mean Squared Error is the one which calculates the average of the
squared differences between predicted and actual values.

In the last MLP topic, we had a short introduction to error, on


reducing which the model performs well on given data. It is okay
to have (y – y_hat). But, why do we have it squared? Why do we
have it average-summed? Squaring the differences between actual
and predicted values makes larger errors stand out more. For
instance, an error of 4 becomes 16 after squaring, while an error of
1 remains 1. This helps the model focus on reducing large errors,
which are often more impactful on overall performance. Without
squaring, positive and negative errors would cancel each other
out. Squaring makes all errors positive, so we get a more accurate
representation of total error regardless of direction
(underestimation or overestimation). Averaging divides the total
error by the number of data points n, resulting in a mean value that
doesn’t change simply because there are more (or fewer) data
points. This makes the error measure comparable across datasets
of any size.
We can either implement by ourself or else use the built-in one.

MSE is a convex function, meaning that it has a single global


minimum and no local minima. This property makes it easier to
optimize using gradient-based optimization techniques like
gradient descent. MSE is differentiable everywhere, meaning its
gradient (derivative) can be computed at every point. This is
crucial for gradient-based optimization algorithms (such as
stochastic gradient descent). MSE squares the error for each data
point, which means that larger errors (outliers) have a
disproportionately large impact on the loss. A single large error
can significantly increase the MSE value. Thus, it becomes
sensitive to outliers.
Following is the curve of MSE Loss function.
☆ MEAN ABSOLUTE ERROR
Mean Absolute Error is the one which calculates the absolute
value of the average differences between predicted and actual
values.

It gives linear penalty. A linear penalty means that the error term
grows in direct proportion to the deviation between the predicted
and actual values. This is because the MAE calculates the
absolute difference between each predicted value and the actual
value, rather than squaring the difference as in MSE. We can also
say, each error contributes to the overall loss directly as it is, no
matter how large or small the error, it’s added linearly without
amplification. If the error (difference between the predicted and
actual values) is 3, then it contributes exactly 3 units to the loss. If
the error is 6, it contributes 6 units to the loss. It helps make MAE
less sensitive to large errors or outliers. When we say it is a Linear
Penalty, the one in MSE is Quadratic Penalty. There in MSE, if
the error is 6, the penalty will be like 6*6 = 36. The graph of MAE
versus error is symmetrical and forms a V-shape around zero,
where the minimum error is achieved when the error is zero. The
problem with MAE is that, it is not differentiable some times.
☆ HUBER LOSS
Huber Error combines the essence of both Mean Squared Error
and Mean Absoluter Error. It is linear for large errors and
quadratic for small errors. It is also differentiable. It has parameter
delta to switch between MSE and MAE.

Huber Loss is smooth, as there is no abrupt transition between


the quadratic and linear regions.
The smoothness property allows for more stable training
compared to non-smooth loss functions like MAE. Huber Loss is
convex, meaning that it has a single global minimum. This
property ensures that optimization algorithms like gradient
descent will converge to the optimal solution.
☆ CROSS ENTROPY LOSS
Cross-Entropy Loss measures how well the predicted probability
distribution matches the true distribution. When the predicted
probability for the true class is high, the loss is low. If the
predicted probability is low, the loss is high, indicating that the
model's prediction is far from the true label. Cross-Entropy Loss
is non-linear but convex when dealing with one-hot encoded target
distributions, which makes it suitable for optimization with
gradient-based methods like stochastic gradient descent (SGD).
☆ BINARY CROSS ENTROPY LOSS
Binary Cross-Entropy Loss is a specific form of Cross-Entropy
Loss used for binary classification tasks. It measures the
dissimilarity between the predicted probability of the positive class
and the true label. If the model predicts a probability close to 1 for
the correct class, the loss is minimal. Conversely, if the prediction
is far from the correct class, the loss increases, indicating that the
model's prediction deviates from the true label. Binary Cross-
Entropy Loss is convex and works well with gradient-based
optimization methods like stochastic gradient descent (SGD) for
training binary classifiers. For binary classification, we typically
want to model the probability of one class (usually the positive
class) using a single output neuron with a sigmoid activation
function. This gives us a probability value between 0 and 1, which
we then compare against the true label (0 or 1). In contrast, Cross-
Entropy Loss for multi-class classification assumes that each class
has its own output neuron, and it compares the predicted
probability distribution across all classes. This would not be
suitable for binary classification where only two outcomes (0 or 1)
are considered.
☆ BALANCED CROSS ENTROPY
Balanced Cross-Entropy Loss is an extension of the Binary Cross-
Entropy Loss that takes into account class imbalance. It assigns
different weights to the positive and negative classes, allowing the
model to focus more on the underrepresented class.

The weights w1 and w2 are typically set inversely proportional to


the class frequencies to balance the impact of each class. For
example, if the dataset is highly imbalanced with more negatives
than positives, you might set w1 (positive class weight) higher than
w2 (negative class weight). We can implement it like the following:

☆ FOCAL LOSS
Balanced Cross-Entropy Loss adjusts the loss by assigning a
higher weight to the underrepresented class, which helps mitigate
class imbalance to some extent. However, the problem is that it
still treats both easy and hard examples in the same way.
Focal Loss modifies the standard cross-entropy loss by adding a
factor to down-weight the loss for well-classified examples and
focus more on the misclassified examples, especially the hard
examples. It is defined as:

Focal Loss achieves this by focusing more on hard examples and


down-weighting easy examples, allowing the model to better learn
from rare or difficult examples, thus handling extreme class
imbalance more effectively.
☆ CONTRASTIVE LOSS
The core idea behind Contrastive Loss is to encourage the model
to output a small distance for similar pairs and a large distance for
dissimilar pairs. This is typically used in problems like face
verification, where you want the model to learn to distinguish
between similar and dissimilar instances (e.g., whether two faces
belong to the same person).
For similar pairs (yi=1), the loss is proportional to the square of
the Euclidean distance between the two samples, i.e., we want to
reduce the distance for similar pairs. For dissimilar pairs (yi=0),
the loss is proportional to the square of the difference between the
margin m and the Euclidean distance, ensuring the distance
between dissimilar pairs exceeds the margin m. If the distance
exceeds m, the loss is zero.

In training models with Contrastive Loss, the key idea is to select


pairs of data points — these are typically positive pairs (similar)
and negative pairs (dissimilar). Proper pair selection is crucial for
the model's performance because the learning task is built around
how well the model learns to distinguish between similar and
dissimilar instances.
☆ TRIPLET LOSS
Triplet Loss is another powerful loss function used in metric
learning, particularly to learn an embedding space where similar
items are close together, and dissimilar items are far apart. It's
commonly used in tasks like face verification, image retrieval, and
few-shot learning. The goal of Triplet Loss is to minimize the
distance between an anchor sample and a positive sample (same
class), while maximizing the distance between the anchor and a
negative sample (different class). This is achieved by ensuring that
the anchor-positive pair is closer in the embedding space than the
anchor-negative pair by a given margin.
Given a triplet (A,P,N):
A: Anchor sample
P: Positive sample (same class as the anchor)
N: Negative sample (different class from the anchor)

f(x) represents the embedding function of the model for a sample


x. ∥f(A)−f(P)∥ is the Euclidean distance between the anchor and
positive samples. ∥f(A)−f(N)∥ is the Euclidean distance between
the anchor and negative samples. alpha is a margin that ensures
the negative sample is sufficiently far away from the anchor (this
margin is typically a hyperparameter).

The Clusters should come out like this.


MERCI

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy