What is Backpropagation?

A guide to one of the most fundamental tools in machine learning.
Created on February 13|Last edited on February 15
Comment
Backpropagation, developed by David Rumelhart, Geoffrey Hinton, and Ronald Williams, is a fundamental component of modern deep learning. Despite some uncertainty that backpropagation is the answer to human-level AI, it remains the most promising option for optimizing large neural networks. Although the derivations for backpropagation do not fully explain the amazing performance we've seen when combining backpropagation with large transformers and large amounts of data, there is a somewhat convincing phenomenon that takes place when millions multiple neurons are able to work together to learn from complex data, in a seemingly similar fashion to the brain. In order to gain a better intuition for how backpropagation works, we will go through the equations that allows backpropagation to function, and then we will program a simple example that shows exactly how the algorithm works.
﻿
What We'll Cover The Perceptron The Multi-Layer Perceptron Backpropagation The Chain RuleWriting Backpropagation in Numpy Algorithm Verification
﻿
The Perceptron Starting out, it’s helpful to understand the overarching idea of a perceptron. The perceptron was created by Frank Rosenblatt in 1957 and represents the simplest form of a neural network, intended to simulate the decision-making process of the human brain. It can be considered a foundational building block of neural networks, representing a single neuron with adjustable weights that are multiplied by each input. The perceptron takes a set of inputs, applies a dot product between the inputs and weights (a weighted sum), and passes this weighted sum through a non-linear activation function to produce an output. This model, while simple, was inspired by biological neural networks and set the stage for more complex architectures. In the diagram below, we denote our inputs by 'x', which are fed into a single perceptron which uses the weight values (denoted by 'w') with an addititive bias. This intermediate output is denoted by 'v' and then it is fed through an activation function which produces the final output of the perceptron. 
[1]
The Multi-Layer Perceptron The multilayer perceptron (MLP), often referred to as a vanilla neural network, extends this concept by adding multiple layers of perceptrons, allowing for the modeling of more complex functions. Each layer's output serves as the input to the next layer, with the final layer producing the network's output. The introduction of multiple layers allows the network to learn complex non-linear relationships, which a single perceptron cannot do. Below is a diagram of multi-layer perceptron, which uses the greek symbol phi to denote the activation function. The letter 'd' is used to denote the correct label, and 'e' is used to represent the error between the predicted output and the ground truth. Don't worry too much about the notational use of 'i' and 'j'; just know that j is used to index a specific neuron, and i is used as an index for the weights of a certain neuron j . Also, note that k is simply the index for the current iteration of training, so it is not important for understanding the overall concept of backpropagation.
[1]
The original form of the perceptron was based on a threshold activation function, which is a type of step function. This can be considered a non-linear function, and it outputs only two values (for example, 0 and 1) depending on whether the input exceeds a certain threshold.
The goal of backpropagation is to optimize the neural network by adjusting its weights and biases to minimize the error. Specifically, it aims to find the gradient of the error with respect to the weights. This process involves two main phases: the forward pass, where inputs are passed through the network to get the output, and the backward pass, where the gradient of the error is computed through each layer, starting from the output and moving backward through the network. This information is then used to update the weights in a way that minimizes the error, typically through a gradient descent optimization algorithm.
Backpropagation Backpropagation is a cornerstone of neural network training, encapsulating the essence of how these models learn from data to improve their predictions. By delving into the mechanics of backpropagation, we gain invaluable insights into the inner workings of neural networks, shedding light on the intricate processes that empower deep learning models to tackle complex tasks.
Gradient Descent The journey to mastering backpropagation begins with an appreciation for the role of gradient descent in the optimization process. Gradient descent is a methodical approach where we compute derivatives at various points of a function to guide our search for the function's minimum. Imagine observing a parabola, and calculating gradients at several points along its curve. Each gradient points us in the direction we need to move to find the lowest point of the parabola, the point where the function attains its minimum value.
﻿
﻿
Applying this concept to neural networks involves a critical step: calculating a specific gradient for each neuron, known as the local gradient. Determining the local gradients is pivotal, as these gradients, when combined with their corresponding inputs, inform us on how to adjust the weights. Subsequently, adjusting the weights based on these calculations propels the network towards minimizing error. This iterative process of adjustment continues until the network optimizes its weights to minimize error effectively.
The relationship between gradient descent and backpropagation is that backpropagation computes the gradients of the cost function that are necessary for the Gradient Descent algorithm to update the weights and biases in the direction that minimizes the cost. 
The Chain RuleWe’ll begin with understanding backpropagation's derivation using the chain rule. The chain rule is a fundamental principle in calculus that provides a method to compute the derivative of a composite function. In the context of a neural network, it enables us to calculate the gradient of the loss with respect to the weights.
In simple terms, we are ultimately looking to find the gradient of the total error with respect to the weights, which is show below:  
﻿
In order to solve for this value, we can use the chain rule to write an equation that will allow us to calculate it based on the results from the forward pass and the loss calculation. The equation is shown below: 
Equation 1
So right now, you may be dreading this technical math jargon. However, I will go through this equation step-by-step and break it down into simple pieces, and hopefully it will give you deep understanding of how backpropagation works. 
Now that we have the equation written out, we can relate each term in the equation to functions used in a neural network. The first term, which is the gradient of the total error with respect to the individual errors of each neuron:
﻿
So essentially, we are deriving the following equation (the mean squared error) with respect to e(k). 
Equation 2 
Based on the power rule, we see that the derivative simply comes out to the following, which is the instantaneous error for the neuron:  
﻿
Where e(k) is described as the following: 
Equation 4
﻿
Next we need to solve for the next term in the original backpropagation formula: 
﻿
﻿
Now, we need to derive e with respect to y. This is the derivative of Equation 4 with respect to y(k), which simply comes out to -1! 
Equation 5 
Next, we are ready to move on to solving for the third term in the backpropagation formula, which is the following: 
﻿
This term is the gradient of the output of the neuron after being passed through the activation function (y), with respect to the output of the neuron before being passed through the activation function (v). This is simply the derivative of the activation function, which is denoted as the greek symbol phi! As long as we choose an activation function that can easily be differentiated in closed form, this is a very easy value to obtain. 
Equation 6
In our network, we will use the sigmoid activation function, which has a very simple derivative formula that can be computed easily. We will cover this activation function along with its derivative in the code later on in this tutorial. 
We are finally at the last term of the equation, which is the gradient of the output of the neuron before going through the activation function (v) with respect to the weights (w): 
﻿
Since the output of the neuron before being passed to the activation function is calculated by multiplying the weights by the inputs to the neuron(v = w.dot(input)), the derivative of v with respect to w is simply the inputs to the neuron! Note that the input here is denoted as y, which is a bit confusing, but confusing notation is just par for the course when it comes to calculus. 
﻿
Equation 7 
Now that we have successfully broken down our original backpropagation equation into simpler parts, we can now rewrite equation 1 as the following: 
Equation 8
﻿
In practice, we will multiply this value by a small scaler (also known as the learning rate) before adding it to our existing weights. This is also known as the 'delta rule' and is shown below, where the greek symbol alpha is our learning rate: 
Equation 9
One of the tricky parts about the backpropagation algorithm is that the equations for the weight updates are slightly different depending on if you are calculating the updates for the output layer or for a hidden layer. The below equation shows the difference for the two: 
Equation 10
Note the 1's in the bottom equation, which signifies that it's the layer above the current layer. ￼
The reason behind this difference is that the output node equation relies on the error of the individual neuron. The problem is that the error cannot easily be calculated for hidden layers (since there is no established output target for the outputs of the these hidden layers). In order to solve for this, we will calculate the error recursively given the error signals of the neurons for which it is connected to (the next layer). This recursive calculation is shown in the bottom equation above, and I'll show it below for clarity: 
﻿
﻿
Having derived the weight update equation, it's beneficial to pause and clarify a few aspects that might cause confusion. We established the equation for weight updates that involve multiplying the local gradients by the inputs to the neurons. This multiplication is crucial because the input serves as a scaling factor in the weight update equation. When we multiply the local gradient—the measure of how changes in the weight affect the error—by the input, we're essentially adjusting the weight's update in proportion to the input's magnitude. This process ensures that the adjustment is appropriately scaled, reflecting the input's influence on the neuron's activation and, subsequently, on the network's overall error.
I'll add the equations for the local gradients down below, and you will see that it is very similar to the equations listed above, simply without the scaling by the inputs: 
Equation 11
﻿
Writing Backpropagation in Numpy Ok, so we have worked out the fundamental equation for backpropagation, we are now ready to implement it in python! I will do a simple example using a toy dataset, and build a very small neural network using only Numpy! 
First we will initialize a new Weights and Biases project, then will create our toy dataset, which will have 4 samples, with 3 features each: 
import numpy as np
import wandb 
﻿
wandb.init(project="backprop", entity="byyoung3")
﻿
# Inputs (4 samples, 3 features each)
X = np.array([[0.1, 0.2, 0.7],
              [0.3, 0.7, 0.2],
              [0.6, 0.9, 0.1],
              [0.5, 0.3, 0.8]])
﻿
# Expected outputs (4 samples, 1 output each)
y = np.array([[1], [0], [1], [0]])
Next, we will create our activation function, which is a sigmoid. Additionally, we will need to create a function that implements the closed form derivative of the activation function: 
# Define the sigmoid activation function and its derivative
def sigmoid(x):
    return 1 / (1 + np.exp(-x))
﻿
def sigmoid_prime(x):
    return x * (1 - x)
Ok, now we are ready to define the weights for our neural network, which will have 2 layers, and a hidden layer size of 4 neurons: 
# Initialize weights for the 2-layer network
inputSize = 3 # Number of input neurons (features)
hiddenSize = 4 # Number of hidden neurons
outputSize = 1 # Number of output neurons
﻿
# Weights for input to hidden layer
W1 = np.random.uniform(size=(inputSize, hiddenSize))
# Weights for hidden to output layer
W2 = np.random.uniform(size=(hiddenSize, outputSize))
Ok, now I will write out a simple training loop to run the backpropagation algorithm. Additionally, I logged the error for our model using Weights and Biases. 
# Training through forward and backward passes
for epoch in range(1000000): # Number of training iterations
    ##### Forward pass
    v1 = np.dot(inputs, W1)
    y1 = sigmoid(v1)
﻿
    v2 = np.dot(y1, W2)
    y2 = sigmoid(v2)
﻿
    ##### Backward pass
    # Error in output
    output_error = targets - y2
    output_delta = output_error * sigmoid_prime(y2)
    dEdW2 = y1.T.dot(output_delta)
﻿
    hidden_error = output_delta.dot(W2.T) # recursive calculation of hidden error using error from output layer 
    hidden_delta = hidden_error * sigmoid_prime(y1)
    dEdW1 = inputs.T.dot(hidden_delta)
﻿
    # Weight updates
    W2 += dEdW2 * learning_rate
    W1 += dEdW1 * learning_rate
﻿
    # Print loss metric
    print(0.5 * sum(np.square(output_error)) / len(output_error))
    if not epoch % 10: 
    # Log to Weights & Biases
        wandb.log({'loss': sum(np.square(output_error))/len(output_error)},step=epoch)
wandb.finish()
In the forward pass of a neural network, inputs are multiplied by the first layer's weights W1, passed through a sigmoid activation function y1, then processed by the second layer's weights W2 and another sigmoid function to produce the final output y2, which represents the network's prediction.
In the backward pass of the neural network training loop, the difference between actual targets and predictions output_error is first calculated. This error is used to compute gradients for the output layer (output_delta) by multiplying it with the derivative of the sigmoid function (sigmoid_prime). These gradients are then used to adjust the weights W2 and W1 through a chain of calculations that propagate the error back through the network, computing hidden_error and hidden_delta for the hidden layer. The weights are updated by scaling these gradients and then adding the update to the weights.
Now if we run this, we will see the loss curve looks great! Here are the logs for this training run: 
﻿
Run: sparkling-moon-21
﻿
Algorithm VerificationI've found it's really easy to make a small error when writing backpropagation from scratch, so I will write another script that will verify the integrity of our code. We will use Torch and W&B to verify that our algorithm is correctly implemented.
To achieve this, the following steps outline a method to compare the loss values from your custom implementation with those obtained from a PyTorch model. By ensuring both implementations yield similar loss values for the same inputs and model parameters, you can verify the correctness of your custom implementation.
First, define a PyTorch model that has the same architecture as your custom model. This includes the same number of layers, the same activation functions, and the same initial weights, if possible. In the provided code, a simple two-layer neural network is defined with sigmoid activation functions.
import numpy as np
import torch
import torch.nn as nn
import wandb
﻿
# Initialize Weights and Biases logging
wandb.init(project="bp_verification", entity="byyoung3")
﻿
# Sample inputs and expected outputs
inputs = np.array([[0.1, 0.2, 0.7],
                   [0.3, 0.7, 0.2],
                   [0.6, 0.9, 0.1],
                   [0.5, 0.3, 0.8]])
targets = np.array([[1], [0], [1], [0]])
﻿
# Sigmoid and its derivative
def sigmoid(z):
    return 1 / (1 + np.exp(-z))
﻿
def sigmoid_prime(a):
    return a * (1 - a)
﻿
# Network architecture
num_inputs = 3
num_hidden = 4
num_outputs = 1
﻿
# Weight initialization
W1 = np.random.uniform(size=(num_inputs, num_hidden))
W2 = np.random.uniform(size=(num_hidden, num_outputs))
﻿
# Convert numpy arrays to torch tensors
inputs_torch = torch.tensor(inputs, dtype=torch.float32)
targets_torch = torch.tensor(targets, dtype=torch.float32)
W1_torch = torch.tensor(W1, dtype=torch.float32, requires_grad=True)
W2_torch = torch.tensor(W2, dtype=torch.float32, requires_grad=True)
﻿
# Neural Network in PyTorch
class NeuralNetwork(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(NeuralNetwork, self).__init__()
        self.hidden = nn.Linear(input_size, hidden_size)
        self.output = nn.Linear(hidden_size, output_size)
        self.sigmoid = nn.Sigmoid()
        
    def forward(self, x):
        x = self.hidden(x)
        x = self.sigmoid(x)
        x = self.output(x)
        x = self.sigmoid(x)
        return x
﻿
# Initialize the neural network
net = NeuralNetwork(num_inputs, num_hidden, num_outputs)
﻿
# Copy weights from numpy to torch model
net.hidden.weight.data = torch.tensor(W1.T, dtype=torch.float32)
net.hidden.bias.data = torch.zeros(num_hidden, dtype=torch.float32)
net.output.weight.data = torch.tensor(W2.T, dtype=torch.float32)
net.output.bias.data = torch.zeros(num_outputs, dtype=torch.float32)
﻿
# Loss and optimizer
criterion = nn.MSELoss()
optimizer = torch.optim.SGD(net.parameters(), lr=0.1)
Now we can write two training loops for each implementation, and verify that both algorithms produce near identical outputs (note that due to some of the low level implementation details of torch, it's not always feasible to get exactly the same results, but as long as they are relatively close, it's fair to say your algorithm has implemented correctly). Here are the training loops for both implementations: 
# Training loop for NumPy network
for epoch in range(100): # Reduced number of iterations for example
    # Forward pass
    v1 = np.dot(inputs, W1)
    y1 = sigmoid(v1)
    v2 = np.dot(y1, W2)
    y2 = sigmoid(v2)
﻿
    # Compute error
    output_error_np = targets - y2
    # loss_np = 0.5 * np.mean(np.square(output_error_np))
    # we drop the .5 to remain consisent with torch implementation
    # this is fine in practice, since it's a constant scalar value 
    loss_np = np.mean(np.square(output_error_np)) 
﻿
    # Backward pass
    output_delta = output_error_np * sigmoid_prime(y2)
    dEdW2 = np.dot(y1.T, output_delta)
    hidden_error = np.dot(output_delta, W2.T)
    hidden_delta = hidden_error * sigmoid_prime(y1)
    dEdW1 = np.dot(inputs.T, hidden_delta)
﻿
    # Update weights
    W1 += dEdW1 * 0.1
    W2 += dEdW2 * 0.1
    if not epoch % 10: 
    # Log to Weights & Biases
        wandb.log({'loss_np': loss_np}, step=epoch)
﻿
wandb.finish()
﻿
wandb.init(project="bp_verification", entity="byyoung3")
﻿
# Training loop for PyTorch network
for epoch in range(100): # Reduced number of iterations for example
    # Forward pass
    outputs_torch = net(inputs_torch)
    
    # Compute loss
    loss_torch = criterion(outputs_torch, targets_torch)
﻿
    # Backward pass and optimization
    optimizer.zero_grad()
    loss_torch.backward()
    optimizer.step()
﻿
    # Log to Weights & Biases
    
    if not epoch % 10: 
    # Log to Weights & Biases
        wandb.log({'loss_torch': loss_torch.item()},step=epoch)
﻿
﻿
# Finish Weights & Biases logging session
wandb.finish()
﻿
Below are the logs for the run. You will see that losses are nearly identical, indicating that our backpropagation algorithm updates the weights in a very similar fashion to the torch library! 
﻿
Run set2
﻿
The exploration of backpropagation, from its theoretical underpinnings to practical implementation, underscores its foundational role in the evolution and current successes of deep learning. Despite emerging criticisms and alternative proposals for achieving artificial general intelligence (AGI), backpropagation remains a linchpin in our toolkit for training neural networks. Its ability to efficiently compute gradients across layers of complex architectures has enabled the deep learning field to tackle problems once thought insurmountable, pushing the boundaries of what machines can learn. I hope you enjoyed this tutorial, and feel free to drop any questions in the comments below. Also, if you want to learn more about the equations used in this tutorial, I highly recommend this book for learning about backpropagation! 
Related Articles 
What Is Cross Entropy Loss? A Tutorial With Code
A tutorial covering Cross Entropy Loss, with code samples to implement the cross entropy loss function in PyTorch and Tensorflow with interactive visualizations.
Instance Normalization in PyTorch (With Examples)
A quick introduction to Instance Normalization in PyTorch, complete with code and an example to get you started. Part of a bigger series covering the various types of widely used normalization techniques.
Collect and Label Images to Train a YOLOv5 Object Detection Model in PyTorch
This tutorial will guide you on how to prepare datasets to train custom YOLOv5 model step by step.
How To Use GPU with PyTorch 
A short tutorial on using GPUs for your deep learning models with PyTorch, from checking availability to visualizing usable.
﻿
Sources: 
﻿﻿﻿
[1] Keller, J. M., Liu, D., & Fogel, D. B. (n.d.). Fundamentals of computational intelligence: Neural networks, fuzzy systems, and evolutionary computation. IEEE Press. https://onlinelibrary.wiley.com/doi/book/10.1002/9781119214403﻿
﻿
﻿
﻿
Add a comment