An Introduction to Transformer Networks

This article provides an A-to-Z guide to how Transformer Networks function, and discusses why they outperform neural network models such as LSTM and RNN.
Mostafa Ibrahim
Created on November 5|Last edited on December 15
Comment
﻿
Created with Stable Diffusion. Prompt: optimus prime transformer on a stage, introduced by human
In this article, we will explore a wide range of transformer concepts.
Specifically, we'll start with the origins of transformers and their dramatic impact on the machine learning world, then move on to a brief explanation of how transformers work and why they often outperform other neural network models such as RNN and LSTM. 
We'll then introduce some of the major uses of transformers in ML, from performing NLP tasks such as voice recognition to being utilized in the computer vision field for high-accuracy object detection. We'll also answer the question of how deep learning is related to transformers and explain how transformers work when it comes to utilizing the encoder-decoder architectural model. 
Here's what we'll be covering:
Table Of ContentsTable Of ContentsA Brief History of Transformers What is a Transformer Network? What Are Transformer Networks Used For?Is a Transformer a Deep Learning Algorithm? How Do Transformers Work? How Much Data Do Transformers Need? What Are the Applications of Transformers?Tasks Transformers Can Perform Conclusion
﻿
﻿
A Brief History of Transformers Transformers were first introduced to the machine learning world by Google in 2017 in their groundbreaking “Attention Is All You Need” paper (we'll come back to this plenty of times). In the following years, transformers have significantly outperformed previous NLP models, putting an end to commonly used deep learning techniques for NLP and computer vision models such as RNN. 
So were there any further improvements that were done to transformers since they were first introduced? The answer is yes. With the general public holding a grasp on such a new model, further improvements were suggested to the previous model’s complexity, speed, and applications. Such improvements cleared the way for transformers to enter new machine learning fields outside of NLP. 
When transformers were first introduced, they were not fully optimized to efficiently perform speech recognition and search tasks. With audio data input having way higher frame rates when compared to their output text frames, a reconditioning on how to decrease such a gap was of utmost importance for transformers to successfully enter the speech recognition market. Thankfully, a Chinese team introduced a concept on how to fix such an issue, successfully introducing transformers into the deep learning world not far after transformers had been initially introduced, removing such a limitation.
And that's not the end for transformers. In 2021, Google released a new paper that proposed a new method to significantly boost the number of parameters while maintaining the number of floating-point operations per second (the ML computational cost standard metric), all of which improved the model's complexity and ability to learn.
What is a Transformer Network? As proposed in the “Attention Is All You Need” paper, transformers are a neural network with a novel architecture that aims to solve sequence-to-sequence complex language tasks like translation, question answering, and chatbots, all while managing long-range dependencies. 
In cases where the data input is in textual form, a transformer model works by tracking relationships between the different words found in a given sentence. Moreover, transformers offer much longer memory withholding, handling long-range dependencies (more on that later in the article).
﻿
﻿Source﻿
﻿
A transformer model uses an encoder-decoder architectural structure, with the typical transformer model having up to 6 layers. Each layer processes and outputs data as input values to the next layer, with each layer divided into 2 further sublayers. The first sublayer performs a process known as self-attention and passes the output result into the next layer which performs a feed-forward. 
In a feed-forward process, the data moves in only one direction (forward) from the hidden nodes to the output nodes. Note that the input and output variables are both generated and fed to the next layer in parallel, allowing for high-speed performance.
What Are Transformer Networks Used For?Transformers have been developed to handle quite a range of diverse tasks. In this section, we will explain the two main reasons that allowed transformers to largely replace the RNN and LSTM machine learning models.
The first reason on our list is that transformers resolve the vanishing gradient. This issue is typically associated with old neural network models such as the RNN model. So what exactly is a vanishing gradient? 
To put it in simple terms, a vanishing gradient is directly correlated with how long a model can withhold data. For example, let's say that an RNN model is trying to predict what comes last in the sentence “I have been studying for a bachelor's in physics for the past couple of years, and tomorrow I have to submit my last assignment in        .” The problem with an RNN model is that it has a short memory, meaning that if I stated that I am studying physics at the start of the sentence, I would not expect the model to guess what type of assignment I am supposed to submit at the end of the sentence. 
With the introduction of LSTM, a longer memory sequence was introduced to the old RNN model. Still, even with the use of LSTM models, the model’s memory length will not work for sentences longer than a certain size (in most cases a couple of hundred words). Such an issue would cause a massive decrease in efficiency in NLP subfields such as speech recognition. Here is where transformers come in handy allowing for much longer memory withholding.
The second big issue with the RNN networks and their improvised LSTM version is slow training speed. Models such as RNNs take in their input as sequential data, not fully utilizing the GPU used to train the model. The main idea behind GPU usage is that GPUs excel at performing complex tensor computations. GPUs complement deep learning models as they take in the input values as massive data batches allowing for much higher training speed. And here is where transformers come in handy: by removing any incompatibility between old neural network models and modern GPUs. Unlike RNNs and LSTM, transformers process a big batch of data input all at once, granting transformers full utilization over GPUs and allowing for much better performance
Is a Transformer a Deep Learning Algorithm? The short answer is yes: at their core transformers are a deep learning algorithm (or in more technical terms a deep learning model). To better answer this question we must ask ourselves what is a deep learning algorithm, to begin with.
Deep learning is simply a branch of machine learning. What's new in deep learning when compared to machine learning is that deep learning algorithms run data through several “layers” of neural networks, each of which passes a simplified representation of the data to the next layer. Meaning that if a deep learning algorithm has 3 to 4 layers (as most do) then each layer will contain multiple nodes that take as their input a batch of data and processes them to return a final value. This value will then be passed on as an input to the next layer until we reach the output layer returning the model’s final result.
﻿
Neural network image with multiple layers (Source)
Similar to RNNs, transformers do work in this fashion. Although there are still some differences between the two models, such as the amount of data processed passed in a given amount of time, both models work in a network kind of format, starting with hundreds of child branches and returning a final value at the root of the tree.
How Do Transformers Work? As computers are only capable of understanding numerical values, a data transformation process that converts alphabetical values (textual data) into numerical values must be applied first at the beginning of most transformer models. 
In the deep learning world, this process is called word embedding. Word embeddings will represent each word in the form of a real-valued vector that encodes the meaning of the word such that the words that are closer in the vector space are expected to be similar in meaning. After converting our data into a more understandable format, we'll pass the embedded data into the next layer known as the self-attention layer. 
The self-attention layer is one of the main processing layers in a transformer along with it being the central mechanism that transformers heavily depend on for their final results. By utilizing a technique known as attention (more commonly known as self-attention), a transformer is capable of detecting distant data relations and resolving vanishing gradients. Meaning that a given transformer model will still be able to study a given relationship between two related words even if both these words are far away from each other in a given context.
The self-attention process represents how relevant a specific word is in relation to its neighboring words in a given sentence. This relation is then represented as what we call an attention vector. It is worth noting that for each word a unique attention vector is created. The problem with such an approach is that each word weighs its value more than others when creating its own attention vector. To solve for this error multiple attention vectors are created and an average summation is calculated to detect the weight of a given word. 
﻿Source﻿
There are three additional types of vectors created in the self-attention layer: key, query, and value vectors. Each vector is multiplied by the input vector in order to return a weighted value.
This process is then followed by a feed-forward neural network that takes in each attention vector and transforms it into a more accessible form for the next layer. Data is then passed to the decoder layer which predicts the output of a given model. 
How Much Data Do Transformers Need? While transformers are more optimized for GPU and TPU computation/processing, meaning that they are also more optimized for processing bigger data sets, transformers do require massive data sets in order to fully train. The most common way to define whether a data set is sufficient is to apply a 10 times rule. This rule means that the amount of input data (i.e., the number of examples) should be ten times more than the number of degrees of freedom a model has. 
It is worth noting that different models may require completely different-sized data sets. Thus the best way to identify how much data a transformer does need is by giving real-life examples of already-trained transformer models:
The famous Facebook Data-Efficient Image Transformer (Delt) achieved state-of-art image classification performance using an ImageNet dataset with around 1.2 million images. Such a dataset is sufficient to train the model and will provide great model accuracy.
It is worth noting that in most cases, using such massive datasets is not required. In most cases, a data set of around a hundred thousand data points will be more than likely sufficient. That said, more data is frequently a recipe for better model performance (think of the recent explosion of billion parameter LLMs here). 
How Do Transformers Use Encoders and Decoders? Originally created with the concept of outperforming previous neural network models, transformers were introduced with a new architectural design known as an encoder-decoder model. So how does the encoder-decoder model work? And how does this architectural design allow transformers to outperform RNNs, LSTM, and other neural network models? 
﻿
﻿Source﻿
﻿
The encoder is the part of the transformer where the self-attention process is performed. The encoder implements a multiple-head attention mechanism where the model can tap into multiple embedding subspaces. Further transformations are then performed in the feed-forward layer which takes in the embedded vectors for further processing. The decoder then takes in the processed output of the encoder and tries to predict a given output for the model.
﻿
﻿Source﻿
To put it simply, the encoder extracts features from a given sentence while the decoder uses such features in order to produce a given output. 
What Are the Applications of Transformers?Since their release transformers have been heavily used by worldwide companies such as Google and Facebook.
﻿
Google utilizes transformers in its state-of-the-art Google Translate program. Transformers are exceptionally useful in such tasks as they allow the model to hold relations between long textual formats such as long paragraphs. Currently, Google translate can handle up to 5000 characters for a single translation!
﻿
﻿Source﻿
Facebook has created its own vision transformer model. Visual transformers use multi-head self-attention layers. Those layers are based on the attention mechanism that utilizes queries, keys, and vectors to “pay attention” to information from different representations at different positions. This model is capable of differentiating and classifying images according to multiple categories. It goes without saying that with Facebook receiving millions and millions of photos daily, a fast and accurate model is of utmost importance.
Tasks Transformers Can Perform As we mentioned earlier, transformers are primarily used in natural language processing and computer vision applications. 
Natural Language Processing Applications﻿
﻿Source﻿
﻿Speech Recognition is a subfield of machine learning that allows for the recognition and translation of spoken languages into a textual format that can be understood by a given computer.
Language translation is the process of translating a particular source language into another given language in a fast and effective manner. Such a translation model can be easily created with transformer models, allowing for fast translation and accuracy.
﻿Sentiment analysis utilizes natural language process techniques in order to quantify and study the subjective information of words in a given sentence. An example where sentiment analysis is used is in racial statement detection. This application is used in social media applications such as Facebook and Twitter in order to detect and delete statements aimed at a given group of people.
Computer Vision Applications﻿
﻿Source﻿
Image classification is the process of classifying or categorizing images based on different traits represented in the image. Transformers have been successfully applied to medical applications, such as disease classification in images and so on.
﻿Object detection is the process of identifying objects of a given class in a digital image or video. Object detection is used in tasks such as face recognition and pedestrian detection.
Image compression is a data compression technique applied to digital images. Its main purpose is to reduce the cost of storing the image. Moreover, image compression may cause a slight decrease in image quality.
ConclusionIn this article, we covered quite a range of concepts related to transformers, from their inception in the famous "Attention is all you need" paper to how they actually work, which architectures they improved upon, and ended with a few applications of transformers. 
﻿
﻿
﻿
﻿
﻿
﻿
﻿
﻿
﻿
﻿
﻿
﻿
﻿
﻿
﻿
﻿
﻿
﻿
﻿
﻿
﻿
﻿
﻿
﻿
﻿
﻿
﻿
﻿
﻿
﻿
﻿
﻿
﻿
﻿
﻿
﻿
﻿
﻿
﻿
﻿
﻿
﻿
﻿
﻿
﻿
﻿
Add a comment
Tags: Articles, Intermediate, Computer Vision, NLP
Iterate on AI agents and models faster. Try Weights & Biases today.