Skip to main content

What Is Cross Entropy Loss? A Tutorial With Code

A tutorial covering Cross Entropy Loss, with code samples to implement the cross entropy loss function in PyTorch and Tensorflow with interactive visualizations.
Created on September 10|Last edited on June 26
One of the most common loss functions used for training neural networks is cross-entropy. In this article, we'll go over its derivation and implementation using PyTorch and TensorFlow and learn how to log and visualize them using Weights & Biases.
Quick Start: TensorFlow Colab | PyTorch Colab


Table of Contents



Let's dive in!

What is Cross Entropy Loss?

Cross entropy loss is a metric used in machine learning to measure how well a classification model performs. The loss (or error) is measured as a number between 0 and 1, with 0 being a perfect model. The goal is generally to get your model as close to 0 as possible.
Cross entropy loss is often considered interchangeable with logistic loss (or log loss, and sometimes referred to as binary cross entropy loss), but this isn't always correct.
Cross entropy loss measures the difference between the discovered probability distribution of a machine learning classification model and the predicted distribution. All possible values for the prediction are stored so, for example, if you were looking for the odds in a coin toss, it would store that information at 0.5 and 0.5 (heads and tails).
Binary cross entropy loss, on the other hand, stores only one value. That means it would store only 0.5, with the other 0.5 assumed in a different problem. If the first probability was 0.7, it would assume the other was 0.3). It also uses a logarithm (thus "log loss").
It is for this reason that binary cross entropy loss (or log loss) is used in scenarios where there are only two possible outcomes, though it's easy to see where it would fail immediately if there were three or more. That's where cross-entropy loss is often used: in models where there are three or more classification possibilities.

The Theory Behind Cross Entropy Loss

Let's start from the basics. In deep learning, we typically use a gradient-based optimization strategy to train a model (say f(x)f(x)) using some loss function l(f(xi),yi)l \, (f(x_i), \, y_i) where (xi,yi)(x_i, y_i) are some input-output pair. A loss function is used to help the model determine how "wrong" it is and, based on that "wrongness," improve itself. It's a measure of error. Our goal throughout training is to minimize this error/loss.
The role of a loss function is an important one. If it doesn't penalize wrong output appropriately to its magnitude, it can delay convergence and affect learning.
There's a learning paradigm called Maximum Likelihood Estimation (MLE) which trains the model to estimate its parameters so as to learn the underlying data distribution. Thus, we use a loss function to evaluate how well the model fits the data distribution.
Using cross-entropy, we can measure the error (or difference) between two probability distributions.
For example, in the case of Binary Classification, cross-entropy is given by:
l=(ylog(p)+(1y)log(1p))l = - (\,y \, log(p)\,\,+ \,\, (1-y) \, log(1-p)\,)

where:
  • pp is the predicted probability, and
  • yy is the indicator ( 00  or 11 in the case of binary classification )
Let's walk through what happens for a particular data point. Let's say the correct indicator is i.e, y=1y = 1. In this case,
l=(1×log(p)+(11)log(1p))l = - ( \, \,1 \times log(p) + (1 - 1) \, \, log (1- p) \, \,)

l=(1×log(p))l = - ( \, \, 1 \times log(p) \, \,)

the value of loss ll thus depends on the probability pp. Therefore, our loss function will reward the model for giving a correct prediction (high value of pp) with a low loss. However, if the probability is lower, the value of the error will be high (bigger negative value), and therefore it penalizes the model for a wrong outcome.
A simple extension to a Multi-Classification (say NN classes) problem exists as follows:-
c=1Nyclog(pc)- \sum_{c=1}^{N} y_c log(p_c)


Coding The Cross Entropy Loss Function

In this section, we'll go over how to use the cross entropy loss function in both Tensorflow and PyTorch and log to Weights & Biases.

Coding The Cross Entropy Loss Function With TensorFlow

import tensorflow as tf
from wandb.keras import WandbCallback

def build_model():
...

# Define the Model Architecture
model = tf.keras.Model(inputs = ..., outputs = ...)

# Define the Loss Function -> BinaryCrossentropy or CategoricalCrossentropy
fn_loss = tf.keras.losses.BinaryCrossentropy()

model.compile(optimizer = ..., loss = [fn_loss], metrics= ... )

return model

model = build_model()

# Create a W&B Run
run = wandb.init(...)

# Train the model, allowing the Callback to automatically sync loss
model.fit(... ,callbacks = [WandbCallback()])

# Finish the run and sync metrics
run.finish()

Coding The Cross Entropy Loss Function With PyTorch

import wandb
import torch.nn as nn

# Define the Loss Function
criterion = nn.CrossEntropyLoss()

# Create a W&B Run
run = wandb.init(...)

def train_step(...):
...
loss = criterion(output, target)

# Back-propagation
loss.backward()

# Log to Weights and Biases
wandb.log({"Training Loss": loss.item()})

# Finish the run and sync metrics
run.finish()

Run set
2


Summing Up

And that wraps up our short tutorial on Cross Entropy Loss. To see the full suite of Weights & Biases features, please check out this short 5 minutes guide.
  • If you're wondering why we should use negative log probabilities, check out this video 🎥
  • If you want a more rigorous mathematical explanation, check out these:
Here's that video, to save you a click:

Mahmoud Limam
Mahmoud Limam •  
Hi thanks for the article. I noticed it says in the beginning that the loss is between 0 and 1, which isn't the case with cross entropy as -log(p) can certainly exceed 1 when p is close enough to 0.
Reply
Iterate on AI agents and models faster. Try Weights & Biases today.