Cross Entropy Loss Intro, Applications
Cross Entropy Loss Intro, Applications
In simple terms, entropy indicates the amount of uncertainty of an event. Let’s take
the problem of determining the fair coin toss outcome as an example.
For a fair coin, we have two outcomes. Both have P[X=H] = P[X=T] = 1/2. Using
the Shannon entropy equation:
H(X) = − ∑ P (x i ) log 2 P (x i )
i=1
Both terms are 0 for the coin, almost always H or almost always T, so the entropy is
0.
In the data science domain, the cross entropy between two discrete probability
distributions is related to Kullback-Leibler (KL)Divergence, a metric that captures
how similar the two distributions are.
Given a true distribution t and a predicted distribution p, the cross entropy between
them is given by the following equation:
For a three-element support S, if t = [t1, t2, t3] and p = [p1, p2, p3], it’s not
necessary that t_i = p_i for i in {1,2,3}.
In the real world, however, the predicted value differs from the actual value,
referred to as divergence, because it differs or diverges from the actual value. As a
result, cross-entropy is the sum of Entropy and KL divergence (type of divergence).
Now let’s understand how cross-entropy fits in the deep neural network paradigm
using a classification example.
Every classification case has a known class label, which has a probability of 1.0,
whereas every other label has a probability of 0. Here, the model determines the
probability that a particular case falls within each class name. Cross-entropy can
then be used to determine how the neural pathways differ for each label.
Cross-entropy loss is used when adjusting model weights during training. The aim
is to minimize the loss—the smaller the loss, the better the model.
Regression loss functions deal with continuous values, which can take any value
between two limits., such as when predicting a country's GDP per capita, given its
population growth rate, urbanization, historical GDP trends, etc.
Classification loss functions deal with discrete values, like classifying an object
with a confidence value. For instance, image classification into two labels: cat and
dog.
Ranking loss functions predict the relative distances between values. An example
would be face verification, where we want to know which face images belong to a
particular face. We can do so by ranking faces that do not belong to the original
face-holder via their degree of relative approximation to the target face scan.
Loss landscape during model optimization (source)
Before we jump into the loss functions, let’s discuss activation functions and their
applications. Output activation functions are transformations we apply to vectors
coming out from Convolutional Neural Networks (CNNs) before the loss
computations.
💡 Pro tip: Read this detailed piece on PyTorch Loss Functions and start
training your ML models.
Sigmoid
Sigmoid squashes a vector in the range (0, 1). It is applied independently to each
input element in the batch during training. It’s also called the logistic function.
Sigmod function graph (source)
Softmax
Softmax is a function, not a loss. It squashes a vector in the range (0, 1), and all the
resulting elements add up to 1. It is applied to the output scores s.
Cross-entropy loss is used when adjusting model weights during training. The aim is
to minimize the loss—the smaller the loss, the better the model. A perfect model
has a cross-entropy loss of 0. It typically serves multi-class and multi-label
classifications.
H(P, Q)
Where:
Cross-entropy can be calculated using the probabilities of the events from P and Q:
With Softmax, the model predicts a vector of probabilities [0.7, 0.2, 0.1]. The sum of
70%, 20%, and 10% is 100%, and the first entry is the most likely one.
Cross-entropy loss formula (source)
The image below illustrates the input parameter to the cross entropy loss function:
Cross-entropy loss parameters
The Probability Mass Function (PMF) is used (return probability) when dealing with
discrete quantities. For continuous values where Mean Squared Error is used,
Probability Density Function (PDF) (return density) is applied instead.
As seen in the plots above, in the interval (0,1], log(x) and -log(x) are negative and
positive, respectively. Observe how -log(x) approaches 0 as x approaches 1. This
observation is useful when parsing the expression for cross-entropy loss.
Since we want to maximize the probability of the output falling into a specific
category, the mu value has to be found in the below log-likelihood equation.
Log likelihood
Calculate the partial derivative of the above log-likelihood function with respect to
mu. The output is:
Let’s take the coin toss as an example. If we are looking for heads, the value of x()
For example, in the coin toss, if we are looking for heads, if a head appears, then the
value of x(i) will be 1; otherwise, 0. This way, the above equation will calculate the
probability of the desired outcome in all the events.
If we maximize the likelihood or minimize the negative log-likelihood (it is the actual
error in prediction and actual value), the outcome is the same
If there are n samples in the dataset, then the total cross-entropy loss is the sum of
the loss values over all the samples in the dataset. So the binary cross entropy
(BCE) to minimize the error can be formulated in the following way:
When the true label t is 1, the cross-entropy loss approaches 0 as the predicted
probability p approaches 1 and
When the true label t is 0, the cross-entropy loss approaches 0 as the predicted
probability p approaches 0.
Multi-class classification
Categorical Cross Entropy is also known as Softmax Loss. It’s a softmax activation
plus a Cross-Entropy loss used for multiclass classification. Using this loss, we
can train a Convolutional Neural Network to output a probability over the N classes
for each image.
In multiclass classification, the raw outputs of the neural network are passed
through the softmax activation, which then outputs a vector of predicted
probabilities over the input classes.
In the specific (and usual) case of multi-class classification, the labels are one-hot,
so only the positive class keeps its term in the loss. There is only one element of the
target vector, different than zero. Discarding the elements of the summation which
are zero due to target labels, we can write:
Plot the best routes for your training data with 8 workflow stages to arrange,
connect, and loop any way you need.
Learn more
->
](https://www.v7labs.com/workflows)
Moreover, it depends on how you load the dataset. Loading the dataset labels using
integers instead of vectors provides greater memory and computation efficiency.
PyTorch
1. Define a dummy input and target to test the cross entropy loss pytorch function.
2. Import CrossEntropyLoss() inbuilt function from torch.nn module.
3. Define the loss variable and pass the inputs and target
4. Call the output backward function to compute gradients to improve loss in the
next training iteration.
torch.Size([3])
Input:
tensor([[0.8671, 0.0189, 0.0042, 0.1619, 0.9805],
[0.1054, 0.1519, 0.6359, 0.6112, 0.9417],
[0.9968, 0.3285, 0.9185, 0.0315, 0.9592]],
requires_grad=True)
Target:
tensor([1, 0, 4])
Cross Entropy Loss:
tensor(1.8338, grad_fn=)
Input grads:
tensor([[ 0.0962, -0.2921, 0.0406, 0.0475, 0.1078],
[-0.2901, 0.0453, 0.0735, 0.0717, 0.0997],
[ 0.0882, 0.0452, 0.0815, 0.0336, -0.2484]])
Binary cross entropy is a case where the number of classes is 2. In PyTorch, there
are nn.BCELoss and nn.BCEWithLogitsLoss. The former requires the input
normalized sigmoid probability, an the latter can take raw unnormalized logits.
Tensorflow
1. Define a dummy input and target to test TensorFlow's cross-entropy loss
function.
2. Import BinaryCrossentropy() inbuilt function from tf.keras.losses module.
3. Define loss variable binary_cross_entropy and pass the inputs and target.
4. Call the output backward function to compute gradients to improve loss in the
following training iteration.
#input lables.
y_true = [[0.,1.],
[0.,0.]]
y_pred = [[0.5,0.4],
[0.6,0.3]]
binary_cross_entropy = tf.keras.losses.BinaryCrossentropy()
binary_cross_entropy(y_true=y_true,y_pred=y_pred).numpy()
binary_cross_entropy(_y_true_=y_true,_y_pred_=y_pred).numpy()
Key takeaways
Here’s a short recap of what we’ve learned about cross-entropy loss.
Deval Shah
](/authors/deval-shah)
Deval is a senior software engineer at Eagle Eye Networks and a computer vision
enthusiast. He writes about complex topics related to machine learning and deep
learning.