Exploring Google’s T5 Text-To-Text Transformer Model
In this article, we'll explore the architecture and mechanisms behind Google’s T5 Transformer model, from the unified text-to-text framework to the comparison of T5 results.
Created on September 25|Last edited on July 8
Comment
The field of Natural Language Processing (NLP) is constantly changing and growing. New models are being released every quarter, achieving state-of-the-art (SOTA) results. All these models are trained using different objective functions, different training procedures, and different datasets.
With such varying features, there's a real challenge when comparing models to each other and coming to meaningful conclusions. Luckily for us, this is where the authors of "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer" come in.
In this article, we'll explore Google's T5 Text-to-Text Transformer model, to understand the framework and to compare its results.
Here's what we'll be covering:
Table of Contents
The T5 PaperWhat Is Transfer Learning?Why Transfer Learn At All?What Is The T5 Transformer Model?How Does The T5 Transformer Model Work?Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer1. Unsupervised Objective:2. Training:3. Models:ResultsT5 vs LongT5When You Would Use The T5 Model?0. Importing LibrariesSummary
The T5 Paper
"Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer" is a revolutionary paper that introduced Google’s T5 architecture and the unified “Text-to-Text” framework.
The gist of the proposed methodology is that we can treat every NLP task as a “text-to-text” problem. In other words, we'll use text as our input and produce text as our output. They also introduced a huge dataset called C4, which contains about 750 GB of clean English text. As an added bonus, both the model and the dataset are open-sourced.
In this post, we will explore the T5 model architecture, and compare it to other models and test it out ourselves.

But before we dive into the T5 model, it's important that we first understand a bit about transfer learning.
What Is Transfer Learning?
Transfer learning occurs when a model is pre-trained on a data-rich dataset for one task and can be fine-tuned for specific downstream tasks. Essentially, training and learning on one task are transferred to another, generally similar in field and domain, and task.
This is a very powerful technique that promotes collaboration and growth. The pinnacle of transfer learning is when a single standalone model pre-trained on a mixture of data-rich tasks can be used for all text-processing tasks.
T5 is the first model to achieve this across various tasks (language translation, text summarization, text classification, etc).
Why Transfer Learn At All?
You may be asking, "Why would one want to transfer the learning from one task to another rather than simply training a model on the downstream task?"
Training large models is computationally and financially expensive, and it is often done using advanced GPUs and TPUs. These models also imprint a huge carbon footprint and ecological impact.
Essentially, transfer learning allows everyone, from hobbyists to big corporations, to piggy-back off massive, expensive models and fit them to their specific tasks.
What Is The T5 Transformer Model?
The T5 Transformer Model was introduced in 2020 by the Google AI team and stands for Text-To-Text Transfer Transformer (5 Ts, or, in our case, T5). The main problem T5 addresses is the lack of systematic studies comparing best practices in the field of NLP.
Most of the current SOTA models are derived from the Transformer architecture. The transformer was introduced in the legendary paper “Attention Is All You Need” by Vaswani et al. and had two main architectural blocks, namely Encoders and Decoders.
All the subsequent models had some sort of relation to these architectural blocks. Google’s BERT only had Encoder blocks, OpenAI’s GPT-2 only had Decoder blocks, etc.
With varying architectures, and various training datasets (Wikipedia, Wikipedia + Toronto Book Corpus), we cannot objectively compare these models and their SOTA results. After all, every model has varying pre-training objectives, unlabelled datasets, transfer approaches, and architectures.
T5 introduced the “Text-to-Text” framework, in which every NLP task (Translation, Classification, etc.) has the same underlying structure in which text is fed as input to the model and text is produced as output. This means we can use the same model, the same hyperparameters, and the same loss function across all the tasks.
How Does The T5 Transformer Model Work?
The T5 transformer model works by using the same standard encoder-decoder structure as standard transformer models. It consists of 12-pair blocks of encoder-decoder. Each block contains self-attention, a feed-forward network, and optional encoder-decoder attention.
If it has the same architecture as the original Transformer, how is it able to achieve SOTA?
To understand that, we need to first understand two unique features of the T5 model:
- Input/Output Representation: Text-to-Text Framework
- Training dataset: C4 dataset
1. Input/Output Representation: Text-to-Text Framework
As discussed earlier, we are feeding text as the input and getting text as the output. This allows us to use the same model, same hyperparameters, and same loss function across all the tasks.
This is done by adding a task-specific prefix to the input sequence and pre-training the model to get prefix-specific outputs.
Let us dive into how it is done for a few NLP tasks. We will be generating prompts in this section and will illustrate how this works with code further below.
Before doing this, let me share the theme of the tasks we'll be working with as illustrated using Stable Diffusion.

Now let's get to it.
1.1 Text Summarization
Text Summarization is the NLP task in which a model, given a long text sequence, produces a summarized version of the input. Since I'm a huge fan of One Piece manga, so we will be summarizing a sequence from a One Piece Wikipedia article.
For summarizing we need to add the “summarize:” prefix to the input sequence.
1.2 Language Translation
Language Translation is the NLP task in which a model, given a text in one language, produces translated version of the same text in another language. The T5 model was trained on the C4 dataset, which contains the following languages: English, German, French, and Romanian.
Using T5, we can translate between these languages.
Below we are going to translate from English to French. For translation, we need to add the “translate English to French: ” prefix to the input sequence.
Our prompt: “translate English to French: You should definitely watch 'One Piece', it is so good, you will love the comic book.”
1.3 Text Classification: Textual Entailment
Textual Entailment is a NLP task in which a model is given two sentences, one being the premise and the other being hypothesis. Based on these two sentences, the output is classified into three classes: entailment, contradiction, and neutral.
For textual entailment, we need to add “mnli premise: ” and “hypothesis: ” to the sentence pairs.
Our prompt: “mnli premise: I love One Piece. hypothesis: My feelings towards One Piece are filled with love.”
1.4 Linguistic Acceptability
Linguistic Acceptability is an NLP task in which a model, given a text prompt checks if the sentence is grammatically correct.
For linguistic acceptability, we need to add “cola sentence: ” to the sentence. COLA is the dataset that contains sentences mapped to their acceptability.
Our prompt: “cola sentence: Luffy is a great pirate.”
1.5 Sentence Similarity
Sentence similarity is an NLP task in which a model, given two sentence pairs would rate their similarity on a 1 - 5 scale. The output is considered a string value and is found to be incremented by 0.2. This means we can consider this as a text classification task with 21 classes: 1.0, 1.2, … 5.0.
For sentence similarity, we need to add “stsb sentence 1: ” and “sentence 2: ” to the sentence pairs.
Our prompt: “stsb sentence 1: Luffy was fighting in the war. sentence 2: Luffy's fighting style is comical.”

Text Processing Tasks, Image By Author
2. Training dataset: C4 dataset
Colossal Clean Crawled Corpus, C4 is a 750 GB clean English text scraped from the Internet. This dataset was collected from Common Crawl, which is a publicly available web archive.
After extracting from Common Crawl, the authors then proceeded to remove offensive words, filler sentences(Loren Ipsum), Code Brackets “{”, Duplicates, and sentences that don’t end with a punctuation mark.
It is a clean and huge dataset, this meant the model can be trained on the dataset without ever repeating the same data.
These were the most important features of T5, we will see how they perform in the next section.
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
In this section, we will summarize the “Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer” paper. We have already looked at the most important features of the T5 model and architecture.
We will be covering 3 main sections: Unsupervised Objective, Training, and Model
1. Unsupervised Objective:
Up until this paper, most of the models used standard denoising objectives in which we make a decision on each input to corrupt or not. The authors arrived at a BERT-style denoising objective with Span Corruption Objective.
Consecutive corrupted tokens are treated as a span, each span is then given a single unique mask token, which replaces the entire span. This results in shorter sequences.
Original text: One Piece is the greatest story ever told in human history.
Corrupted Spans: One Piece <X> story ever <Y> in human history.
Target: <X>is the greatest<Y>ever told<Z>
2. Training:
As discussed earlier, T5 was trained on the C4 dataset. The authors, through experimentation, came to the conclusion that the best performance is provided when the model is trained for 1 million steps with a batch size of 2 power 11 sequences of 512 lengths.
3. Models:
There are 5 T5 variants with varying parameters and model sizes.
- Base: Comparable to that of BERT_base. It is a baseline model with 222 million parameters.
- Small: It is a scaled-down version of the Base model. It only has 60 million parameters with only 6 layers of encoders and decoders.
- Large: Scaled-up version of the base with 770 million parameters.
- 3B: Scaled up version of the base with 3 Billion parameters.
- 11B: Scaled up version of the base with 11 Billion parameters.
Results
T5 (11B) achieved State-Of-The-Art in most of the NLP tasks, specifically 18 out of 24 tasks. The main factor for achieving this result was the scaling and purity of the dataset.

T5 Results, Image by Authors of T5 Paper
T5 vs LongT5
There are two main drawbacks to the T5 model:
- Fixed Input Length
- Model Size
With T5 we can only pass in shorter input sequences(typically of length less than 512 tokens). This is because of “quadratic computation growth”. Computation resources increase quadratically with respect to the input sequence length. This would increase training time and memory consumption.
These two concerns are addressed by the new and novel LongT5, which extends T5 with Transient Global (TGlobal) mechanism, though we won't be covering TGlobal here.
LongT5 achieves better results than T5 across various NLP tasks such as Summarisation, QnA, etc.
The results are shown here:

Comparing T5 vs LongT5, Image by Authors of LongT5 paper
When You Would Use The T5 Model?
Finally, we can test out the T5 model by ourselves!!!
We are going to feed the model with the prompts we designed in previous sections.
We are going to follow the same structure for every task except for the prefix.
First, we need to import the necessary libraries as well as set up the tokenizer and model.
For every task, we will gather the input sequence, encode the input, generate output through our model and decode the output.
Remember, you can view the outputs and try the code yourself in this ...
0. Importing Libraries
!pip install transformers!pip install sentencepiece#import librariesimport torchfrom transformers import T5Tokenizer, T5ForConditionalGeneration#set up tokenizer and modeltokenizer = T5Tokenizer.from_pretrained('t5-small')model = T5ForConditionalGeneration.from_pretrained('t5-small', return_dict=True)
1. Text Summarization
one_piece_sequence = ("The series focuses on Monkey D. Luffy, a young man made of rubber, who, inspired by his childhood idol,""the powerful pirate Red-Haired Shanks, sets off on a journey from the East Blue Sea to find the mythical treasure,""the One Piece, and proclaim himself the King of the Pirates. In an effort to organize his own crew, the Straw Hat Pirates,""Luffy rescues and befriends a pirate hunter and swordsman named Roronoa Zoro, and they head off in search of the ""titular treasure. They are joined in their journey by Nami, a money-obsessed thief and navigator; Usopp, a sniper ""and compulsive liar; and Sanji, a perverted but chivalrous cook. They acquire a ship, the Going Merry, and engage in confrontations""with notorious pirates of the East Blue. As Luffy and his crew set out on their adventures, others join the crew later in the series, ""including Tony Tony Chopper, an anthropomorphized reindeer doctor; Nico Robin, an archaeologist and former Baroque Works assassin; ""Franky, a cyborg shipwright; Brook, a skeleton musician and swordsman; and Jimbei, a fish-man helmsman and former member of the Seven ""Warlords of the Sea. Once the Going Merry is damaged beyond repair, Franky builds the Straw Hat Pirates a new ship, the Thousand Sunny,""Together, they encounter other pirates, bounty hunters, criminal organizations, revolutionaries, secret agents, and soldiers of the""corrupt World Government, and various other friends and foes, as they sail the seas in pursuit of their dreams.")inputs = tokenizer.encode("summarize: " + one_piece_sequence,return_tensors='pt',max_length=512,truncation=True)summarization_ids = model.generate(inputs, max_length=80, min_length=40, length_penalty=5., num_beams=2)summarization = tokenizer.decode(summarization_ids[0])print(summarization)
This gives us:

2. Language Translation
language_sequence = ("You should definitely watch 'One Piece', it is so good, you will love the comic book")input_ids = tokenizer("translate English to French: "+language_sequence, return_tensors="pt").input_idslanguage_ids = model.generate(input_ids)language_translation = tokenizer.decode(language_ids[0],skip_special_tokens=True)print(language_translation)
This gives us:

3.Text Classification: Textual Entailment
entailment_premise = ("I love One Piece.")entailment_hypothesis = ("My feelings towards One Piece is filled with love")input_ids = tokenizer("mnli premise: "+entailment_premise+" hypothesis: "+entailment_hypothesis, return_tensors="pt").input_idsentailment_ids = model.generate(input_ids)entailment = tokenizer.decode(entailment_ids[0],skip_special_tokens=True)print(entailment)
That gives us:

4. Linguistic Acceptability
sentence = ("Luffy is a great pirate.")input_ids = tokenizer("cola: "+ sentence, return_tensors="pt").input_idssentence_ids = model.generate(input_ids)sentence = tokenizer.decode(sentence_ids[0],skip_special_tokens=True)print(sentence)
That gives us:

5. Sentence Similarity
stsb_sentence_1 = ("Luffy was fighting in the war.")stsb_sentence_2 = ("Luffy's fighting style is comical.")input_ids = tokenizer("stsb sentence 1: "+stsb_sentence_1+" sentence 2: "+stsb_sentence_2, return_tensors="pt").input_idsstsb_ids = model.generate(input_ids)stsb = tokenizer.decode(stsb_ids[0],skip_special_tokens=True)print(stsb)
That gives us:

Summary
In this blog post, we sought to introduce you to the T5 model, the Text-to-text framework, and various NLP tasks.
We also had hands-on experience with testing out T5 for Summarization, Language Translation, Textual Entailment, Linguistic Acceptability, and Sentence Similarity.
If you want to see walkthroughs of other models, feel free to comment below and I'll see what I can do!
Add a comment
Iterate on AI agents and models faster. Try Weights & Biases today.