Skip to main content

Dialogue Summarization with Flan-T5

In this article, we explore model recycling, and train a T5 text-to-text transformer model to summarize dialogue and achieve some unexpected results.
Created on March 3|Last edited on December 27
Recently, natural language processing (NLP) has undergone significant transformations in how it 'understands' and interacts with the world. NLP tasks like summarization were particularly difficult or costly just half a decade ago. These tasks often required hundreds of hours of subject-matter expertise to produce annotated training datasets – expert-authored summaries of long documents such as court cases or medical records – or countless hours spent validating and auditing the machine-produced summaries to ensure that they had high fidelity to the original, long-form texts.
Much like transformer models revolutionized NLP more generally, transformer models have also become a powerful new tool in generating abstractive and extractive summaries from complicated, long-form documents in domain-expert fields such as law, medicine, STEM, and more. With a simple attention mechanism inside of our transformer models, abstractive and extractive summarization can be performed with unprecedented accuracy, even in jargon-rich domains, using these novel deep learning techniques.
To get started working with a transformer-based summarization mode, please see the Colab notebook linked below. It takes a T5-base model and fine-tunes it on a dataset produced by Samsung Research. See if you can get your model to score higher than the baseline (be sure to evaluate your model across many metrics like ROUGE-L, ROUGE-S, and more).
The fine-tuned model is often more performant than the original transformer model, suggesting overfitting and a lack of generalization. Continue reading below, and follow along in this Colab, which takes a T5-base model and fine-tunes it on a dataset produced by Samsung Research.

See if you can get your model to score higher than the baseline (be sure to evaluate your model across many metrics like ROUGE-L, ROUGE-S, and more).
💡

Run set
33


Table of Contents



The Text-to-Text Transfer Transformer (T5)

The 🍮Flan-T5🍮 model referenced above can summarize dialogue, such as a podcast in which a host interviews a guest, and a back-and-forth dialogue takes place. If you'd like an overview of the T5 model architecture specifics, the datasets used to train the model or other attributes of the training process, you will find that here.
For this project, we fine-tuned the T5 model on a set of dialogue data that had been human-annotated, the SAMSum Corpus, and - since our training process made use of the transformers library from HuggingFace, we pushed our model weights and score to the HuggingFace Hub after using the report_to="wandb" one-liner to capture and log, in real-time/as our model was training, the performance curves that you see above.
The assorted ROUGE scores (Recall-Oriented Understudy for Gisting Evaluation), a common way of measuring your NLP model's ability to 'get the gist' or effectively summarize text, looked good. Some weeks later, IBM AI Research group reached out to us as they were surveying 2,000 plus models of publicly posted, fine-tuned models as part of the model-recycling effort.
Our fine-tuned model beat the performance of the google/t5-v1_1-base model and took second place out of all of the google/t5-v1_1-base models that the researchers had surveyed when evaluated on the original tasks - mnli_lp, the Twenty Newsgroups tasks, the Amazon Reviews Multilingual text classification task, and many more:
Previous studies observed that finetuned models may be better base models than the vanilla pretrained model. Such a model, finetuned on some source dataset, may provide a better starting point for a new finetuning process on a desired target dataset. Here, we perform a systematic analysis of this intertraining scheme, over a wide range of English classification tasks. Surprisingly, our analysis suggests that the potential intertraining gain can be analyzed independently for the target dataset under consideration, and for a base model being considered as a starting point. This is in contrast to current perception that the alignment between the target dataset and the source dataset used to generate the base model is a major factor in determining intertraining success. We analyze different aspects that contribute to each. [....]
The Average column is the average of a fine-tuned model - a model developed in the community - whereas the Pretrained Average column shows the score of the baseline model.
In general, there's about a two percent spread between the model as trained by Facebook AI (RoBERTa), Google researchers (with the original BERT model), and so on. While two percent may not sound like much, a quick review of the SOTA leaderboards at Papers with Code shows that fractions of a percent often separate first place from tenth place on worldwide DL model leaderboards. Hence, these model-recycling statistics are quite promising for the future power of recycled models! Additionally, recent work from researchers at AllenAI in early 2023 has shown that by recycling embeddings we can also achieve good performance without the costly and time-consuming embeddings fine-tuning cycles that we used to engage in!
model-recycling leaderboard for main classes of Transformer models
We would encourage you to fine-tune models yourself and see where you rank on the leaderboard. You can use the evaluate library to determine many of your ROUGE scores or fork the Notebook that we've presented above, as that model uses the evaluate library. Jump right into fine-tuning that model using the Notebook above, or continue learning about how we used to summarize documents in the years leading up to the advent of Transformer-based summarization architectures.

Summarization Before Transformers

Summarization methods: ex- and abstractive

Extractive Summarization in the Past

Extractive summarization used to work in three main stages before the advent of Transformers: an intermediate representation of the input document which only captures key aspects of the document is generated; the set of sentences were scored/ranked based on that representation; finally, a summary was selected/generated based on several sentences (Nenkova, A., McKeown, K. (2012). We'll cover those steps in a bit more detail later on, so don't worry if they seem a bit opaque right now.
Intermediate representations were often captured via simple frequency counts, term-frequency inverse document frequency, or other topic-weight assignments. You may wonder if our document repeatedly uses the terms croissant, pâte à choux/choux pastry, and lamination, and our topic modeling efforts tell us that the document is about "baking."
However, a more exact, slightly narrower topic would be pâtisserie, which encompasses the baking of pastries (like croissants) and other sweet baked goods (but not breads or other savory baked items). The document author, however, never explicitly uses the word pâtisserie thus, our 'old school' summarization model wouldn't know to use that term unless we employed a lexical chain approach; see Barzilay and Elhadad's 1997 work, linked in the References section, for a foundational paper on lexical chain summarization. Extractive summarization, as it was commonly done via lexical chain summarization, consisted of the following steps:
  • Sentences are intermediately represented: said succinctly, WordNet or some other thesaurus was used to collate topics or concepts of semantically-related tokens. Then, weights were assigned to those tokens, after which latent semantic analysis (LSA) was utilized to find word co-occurrence patterns. Those co-occurring tokens were considered to be 'topics', and weights were assigned to each topic.
  • Sentences are scored: Sentences are then scored in order of 'importance'. The score is commonly related to how well a sentence expresses some of the most important topics in the document or to what extent it combines information about different topics. Dragomir Radev and Güneş Erkan's LexRank (and the TextRank model developed independently by different researchers) represent all of the sentences in a document as vertices in a graph, with the edges representing a measure of semantic similarity (TextRank using occurrences; LexRank using cosine similarity of TF-IDF vectors).
  • Summary sentences are selected: In the olden days of abstractive summarization, we'd tell our model that we wanted the summary to be X% of the original document length or that the summary should be Y sentences long. Two common methods for selection were employed in traditional NLP summarization workflows: maximum marginal relevance (MMR) and global selection.
Thus, extractive summarizers work by selecting important sentences or phrases directly from source material without any modifications being made whatsoever — hence why they are often referred to as “extractors” instead of “summarizers” per se. An advantage of extractive summarization is that the meaning is extracted from the the original corpus, which allows us to control the final output since only relevant sections will make it into the 'final product'. Extractive summarization is thus very useful when dealing with sensitive topics where even the smallest alterations may cause changes in the meaning or interpretation of written documents.

Abstractive Summarization

Abstractive summarization is a process that involves generating new sentences based on the original document’s content. If you read through Radev and Guneş LexRank paper from 2004, they state that:
[...] the problems in abstractive summarization, such as semantic representation, inference and natural language generation, are relatively harder compared to a data-driven approach such as sentence extraction. In fact, truly abstractive summarization has not reached to a mature stage today.
All is not lost, however! The late 2000s saw little advancement in terms of high-quality abstractive summarization methods. However, by the mid-2010s, a new front-runner emerged. A research paper from the Harvard SEAS lab and Facebook AI Research called A Neural Attention Model for Abstractive Sentence Summarization made great headway in the abstractive summarization space by building upon Bengio's 2014 paper that introduced the attention mechanism, a deep learning 'building block' that would revolutionize computer vision, NLP, protein folding, and countless other fields.
The SEAS and FAIR paper uses algorithms like Transformer-based deep neural networks, which are commonplace in today's deep learning, to generate a summary by 'understanding' the context of the words used in a text rather than simply extracting key phrases from it as an extractor or extractive summarization, would do.

Wrapping Things Up

In this article, we've learned that abstractive and extractive summarization are two important techniques for generating summaries of larger texts and that large-scale, Transformer-based models can act as multi-task learners. We learned that extractive summarization involves selecting and concatenating the most relevant sentences from the original text, while abstractive summarization involves generating a new summary text that may not directly relate to the original sentences.
Various algorithms and techniques have been developed for both extractive and abstractive summarization, including LexRank, TF-IDF. In recent years, due to their power to learn word representations in context, Transformer-based models have taken the lead in a wide range of tasks, not only just in summarization workflows. While extractive methods are generally simpler and more transparent, abstractive methods can produce more fluent and concise summaries. Thankfully, a Transformer model can capture the best of both worlds by summarizing text and providing short quotes from summarized text snippets.
Ultimately, the choice of summarization technique depends on the specific requirements of the task at hand, and researchers continue to explore new approaches to improve the quality and effectiveness of automatic summarization.
We also learned that we can produce high-quality multi-task learners by fine-tuning foundational models, and large language models with a plethora of capabilities in natural language and the domains of programming, science, and more. These multi-task learners not only perform well on the task that they were instructed to learn to do (summarize text, in our case), but they retain and even improve upon their baseline performance scores, as the model-recycling project shows.

References

In order of their mention in the blog post above:

Tags: Articles, NLP
Iterate on AI agents and models faster. Try Weights & Biases today.