What Is Cross Entropy Loss? A Tutorial With Code

A tutorial covering Cross Entropy Loss, with code samples to implement the cross entropy loss function in PyTorch and Tensorflow with interactive visualizations.

Saurav Maheshkar

Created on September 10|Last edited on June 26

Comment

One of the most common loss functions used for training neural networks is cross-entropy﻿. In this article, we'll go over its derivation and implementation using PyTorch and TensorFlow and learn how to log and visualize them using Weights & Biases. 
﻿﻿Quick Start: TensorFlow Colab | PyTorch Colab﻿
﻿
Table of ContentsWhat is Cross Entropy Loss? The Theory Behind Cross Entropy LossCoding The Cross Entropy Loss FunctionCoding The Cross Entropy Loss Function With TensorFlowCoding The Cross Entropy Loss Function With PyTorchSumming UpSome  Related Resources
﻿
﻿
Let's dive in!
What is Cross Entropy Loss? Cross entropy loss is a metric used in machine learning to measure how well a classification model performs. The loss (or error) is measured as a number between 0 and 1, with 0 being a perfect model. The goal is generally to get your model as close to 0 as possible.
Cross entropy loss is often considered interchangeable with logistic loss (or log loss, and sometimes referred to as binary cross entropy loss), but this isn't always correct.
Cross entropy loss measures the difference between the discovered probability distribution of a machine learning classification model and the predicted distribution. All possible values for the prediction are stored so, for example, if you were looking for the odds in a coin toss, it would store that information at 0.5 and 0.5 (heads and tails).
Binary cross entropy loss, on the other hand, stores only one value. That means it would store only 0.5, with the other 0.5 assumed in a different problem. If the first probability was 0.7, it would assume the other was 0.3). It also uses a logarithm (thus "log loss").
It is for this reason that binary cross entropy loss (or log loss) is used in scenarios where there are only two possible outcomes, though it's easy to see where it would fail immediately if there were three or more. That's where cross-entropy loss is often used: in models where there are three or more classification possibilities. 
The Theory Behind Cross Entropy LossLet's start from the basics. In deep learning, we typically use a gradient-based optimization strategy to train a model (say f(x)f(x)f(x)﻿) using some loss function l (f(xi), yi)l \, (f(x_i), \, y_i)l(f(xi​),yi​)﻿ where (xi,yi)(x_i, y_i)(xi​,yi​)﻿ are some input-output pair. A loss function is used to help the model determine how "wrong" it is and, based on that "wrongness," improve itself. It's a measure of error. Our goal throughout training is to minimize this error/loss.
The role of a loss function is an important one. If it doesn't penalize wrong output appropriately to its magnitude, it can delay convergence and affect learning.
There's a learning paradigm called Maximum Likelihood Estimation (MLE) which trains the model to estimate its parameters so as to learn the underlying data distribution. Thus, we use a loss function to evaluate how well the model fits the data distribution.
Using cross-entropy, we can measure the error (or difference) between two probability distributions. 
For example, in the case of Binary Classification, cross-entropy is given by:
l=−( y log(p)  +  (1−y) log(1−p) )l = - (\,y \, log(p)\,\,+ \,\, (1-y) \, log(1-p)\,)l=−(ylog(p)+(1−y)log(1−p))﻿
where:
﻿ppp﻿ is the predicted probability, and
﻿yyy﻿ is the indicator ( 00 0﻿ or 111﻿ in the case of binary classification )
Let's walk through what happens for a particular data point. Let's say the correct indicator is i.e, y=1y = 1y=1﻿. In this case, 
l=−(  1×log(p)+(1−1)  log(1−p)  )l = - ( \, \,1 \times log(p) + (1 - 1) \, \, log (1- p) \, \,)l=−(1×log(p)+(1−1)log(1−p))﻿
l=−(  1×log(p)  )l = - ( \, \, 1 \times log(p) \, \,)l=−(1×log(p))﻿
the value of loss lll﻿ thus depends on the probability ppp﻿. Therefore, our loss function will reward the model for giving a correct prediction (high value of ppp﻿) with a low loss. However, if the probability is lower, the value of the error will be high (bigger negative value), and therefore it penalizes the model for a wrong outcome.
A simple extension to a Multi-Classification (say NNN﻿ classes) problem exists as follows:-
−∑c=1Nyclog(pc)- \sum_{c=1}^{N} y_c log(p_c)−∑c=1N​yc​log(pc​)﻿
Coding The Cross Entropy Loss FunctionIn this section, we'll go over how to use the cross entropy loss function in both Tensorflow and PyTorch and log to Weights & Biases.
Coding The Cross Entropy Loss Function With TensorFlowimport tensorflow as tf
from wandb.keras import WandbCallback
﻿
def build_model():
    ...
﻿
    # Define the Model Architecture
    model = tf.keras.Model(inputs = ..., outputs = ...)
﻿
    # Define the Loss Function -> BinaryCrossentropy or CategoricalCrossentropy
    fn_loss = tf.keras.losses.BinaryCrossentropy() 
﻿
    model.compile(optimizer = ..., loss = [fn_loss], metrics= ... )
﻿
    return model
﻿
model = build_model()
﻿
# Create a W&B Run
run = wandb.init(...)
﻿
# Train the model, allowing the Callback to automatically sync loss
model.fit(... ,callbacks = [WandbCallback()])
﻿
# Finish the run and sync metrics
run.finish()
Coding The Cross Entropy Loss Function With PyTorchimport wandb
import torch.nn as nn
﻿
# Define the Loss Function
criterion = nn.CrossEntropyLoss()
﻿
# Create a W&B Run
run = wandb.init(...)
﻿
def train_step(...):
    ...
    loss = criterion(output, target)
﻿
    # Back-propagation
    loss.backward()
﻿
    # Log to Weights and Biases
    wandb.log({"Training Loss": loss.item()})
﻿
# Finish the run and sync metrics
run.finish()
﻿
Run set2
﻿
Summing UpAnd that wraps up our short tutorial on Cross Entropy Loss. To see the full suite of Weights & Biases features, please check out this short 5 minutes guide.
Some  Related ResourcesIf you're wondering why we should use negative log probabilities, check out this video 🎥
If you want a more rigorous mathematical explanation, check out these: 
﻿blog post 1 and 
﻿blog post 2 🧾
Here's that video, to save you a click: 
﻿
﻿

Add a comment

Mahmoud Limam • 3 years ago

Hi thanks for the article. I noticed it says in the beginning that the loss is between 0 and 1, which isn't the case with cross entropy as -log(p) can certainly exceed 1 when p is close enough to 0.

Tags: Beginner, Domain Agnostic, Keras, PyTorch, Tutorial, Plots, Cross Entropy Loss, Exemplary

Iterate on AI agents and models faster. Try Weights & Biases today.