Compressing the Story: The Magic of Text Summarization

In this article, we explore the benefits, challenges, and future of text summarization technology, including the most popular algorithms and their limitations.
Mostafa Ibrahim
Created on February 15|Last edited on July 15
Comment
In this article, we'll delve into the world of text summarization and its applications. Unlike other natural language processing techniques, text summarization focuses on condensing large volumes of information into shorter, more manageable summaries. 
We'll provide a comprehensive overview of text summarization, explore the most popular algorithms, and discuss the challenges and limitations of this technique.
Here's what we'll be covering: 
Table Of Contents﻿﻿Table Of ContentsWhat Is Text Summarization?Can AI Summarize Text? How Does Text Summarization AI Work?Is Text Summarization Supervised or Unsupervised?Is Text Summarization an NLP Use Case?How Does BERT Summarizer Work? Which Algorithm Is Best for Text Summarization? An Example of Text Summarization in Python With CodeConclusion
﻿
﻿
Let's dive in! 
What Is Text Summarization?Text summarization is a process of extracting the most essential information from a larger text and presenting it in a concise and comprehensive manner. It involves the elimination of redundant or less significant details while retaining the most critical information. 
This can be achieved through various techniques, including deletion, generalization, and abstraction. The objective of summarization is to communicate the primary meaning of the original text in a simplified and straightforward form without sacrificing the integrity of the information being conveyed.
﻿Source﻿
Can AI Summarize Text? AI can now effectively and accurately summarize text. With recent advancements, there are now several apps and APIs available that use artificial intelligence to summarize text for you, making it a breeze to understand the key takeaways from a lengthy document. 
Whether you're a student trying to comprehend a dense academic article or a busy professional who needs to stay on top of important information, these AI-powered tools are here to help.
With that said, these amazing AI tools make it super simple for people to grasp the core ideas of a long text by providing quick and precise summaries. No more struggling to understand lengthy documents!
In fact, we're going to look at a few popular APIs that can help with just this task:
Aylien Text Summarization API
﻿Source﻿
The Aylien Text Summarization API is an online service that uses natural language processing technology to analyze text and provide content processing capabilities, including the ability to generate summaries. 
This API employs advanced machine learning algorithms to determine which sentences are most important in a given text and produces a summary that captures the main points of the original document. 
Additionally, Aylien's text summarization API supports multi-lingual summarization, allowing for the processing of text in different languages. It also provides greater customizability when it comes to the final summary.
Google Cloud Natural Language API ﻿﻿﻿Google's Cloud Natural Language API utilizes cutting-edge machine learning techniques to automatically analyze text and identify the most important sentences, generating a summary that captures the most critical points of the original document. 
Google's API also supports other useful features, such as sentiment analysis and entity recognition, to provide users with the most comprehensive analysis possible. 
With its powerful and accurate analysis capabilities, the Cloud Natural Language API is a highly regarded and sought-after tool for any business or organization looking to optimize its text summarization process.
OpenAI's GPT-3 API 
﻿Source﻿
The GPT-3 API from OpenAI is built on advanced natural language processing techniques, which enable it to produce summaries that are more coherent and natural-sounding compared to those generated by traditional summarization algorithms. 
It's worth noting, however, that GPT-3's text summarization capabilities are still relatively new and may not yet be as refined as those offered by other established APIs. Nonetheless, given GPT-3's remarkable capabilities, I believe it is worth a try, as it holds great potential in the field of NLP and will likely continue to improve with time.
IBM Watson's Natural Language Understanding API 
﻿Source﻿
Created by one of the giants of the computer industry, IBM Watson's Natural Language Understanding API offers domain-specific models, multi-language support, customization capabilities, and additional NLP features, making it a good choice for businesses and organizations that require specialized text analysis and summarization within specific industries and domains.
How Does Text Summarization AI Work?There are two main approaches to text summarization: extractive and abstractive summarization. Extractive summarization selects & reorders sentences from the original text to create a summary. Abstractive summarization creates new sentences that aren't in the original but still convey their meaning.
In extractive summarization, the AI looks for the most important sentences, taking into consideration factors like the frequency of words, sentence length, and relevance to the topic.
By contrast, in abstractive summarization, the AI uses advanced NLP techniques such as deep learning and neural networks to understand the context and meaning of the text and then generate new, concise sentences.
Both types of text summarization AI are trained on vast amounts of data, so they can learn to identify important information and generate accurate summaries. These tools can be very useful for people who need to get a quick understanding of a text without having to read the entire document.
Is Text Summarization Supervised or Unsupervised?Text summarization falls under supervised learning. This means that the algorithm is trained on a dataset of text and corresponding summaries created by people. Through this training, the algorithm learns how to identify the most important information in a text and create a concise summary. 
Over time, the algorithm becomes more and more accurate, providing users with high-quality summaries. As a refresher, here are quick definitions of supervised and unsupervised learning: 
﻿Supervised learning is a type of machine learning where the algorithm is given input data and the corresponding correct output. The goal is to make the algorithm's predictions as accurate as possible. With each prediction, the algorithm becomes better and better at making correct predictions.
﻿Unsupervised learning is different because the algorithm is not given an output at all. Instead, the goal is to identify patterns and relationships within the data. The algorithm works to uncover hidden structures in the data without being told what the answer should be.
Is Text Summarization an NLP Use Case?
﻿Source﻿
Text summarization is a clear use case for Natural Language Processing (NLP). NLP is a field of study focussing on interactions between computers and humans using natural language. Text summarization is a way to automatically condense and simplify text into a shorter, more digestible form.
This involves analyzing the content of the text and using NLP techniques to understand the meaning and context of the words and sentences in the text. 
This information is then used to identify the most important information and generate a summary that captures the essence of the original text. The use of NLP in text summarization makes it a powerful tool for reducing the time and effort required to understand the main points of a long document while still retaining the key information and insights.
How Does BERT Summarizer Work? ﻿BERT is an incredibly advanced pre-trained language model that uses a transformer-based architecture to capture the context of words in a sentence. This bidirectional context allows BERT to understand the relationships between words and their meanings in a much deeper and more nuanced way. 
To use BERT for summarization, we fine-tune the model on a dataset of text and human-created summaries. The goal of this process is to adapt the pre-trained BERT to the specific task of summarization so that it can create more accurate summaries.
During the fine-tuning, the model is presented with an input text and its corresponding human-created summary. The model then uses this labeled data to learn how to identify the most important information in a text and create a concise summary. The fine-tuned BERT is then used to generate summaries for new, unseen texts.
﻿Source﻿
One of the key strengths of BERT is its ability to handle complex language structures and truly understand the meaning and context behind words in a sentence. This allows it to create much more accurate summaries than traditional NLP models that may not have the same level of understanding. 
In conclusion, BERT is a powerful tool for summarizing texts, thanks to its bidirectional context-aware architecture and its understanding of the relationships between words and their meanings.
Which Algorithm Is Best for Text Summarization? There is no one "best" algorithm for text summarization, as the choice often depends on the specific requirements and constraints of the task at hand. Some popular algorithms include extractive summarization using Tf-idf and cosine similarity, as well as abstractive summarization using deep learning models such as recurrent neural networks (RNNs) and transformers. 
Ultimately, the best algorithm will depend on the quality of the summarization desired, the size and nature of the text being summarized, and the computational resources available. In the following example,  we will build a simple extraction text summarization model using the Tf-idf (term frequency-inverse document frequency).
An Example of Text Summarization in Python With Code
Step 1: Import the Necessary LibrariesSome of the imports include: 
1- The nltk library, which is a popular Python library for NLP. 
2- The Python Regular Expression (re) library, which provides functions for pattern matching and substitution.
3- The import of TfidfVectorizer from sklearn.feature_extraction.text is part of the scikit-learn library. This technique is used for transforming text data into numerical vectors in a high-dimensional space. This import is used for feature extraction and representation of text data.
import nltk
import re
import numpy as np
from nltk.corpus import stopwords
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
Step 2: Import the preprocess_text() FunctionThe preprocess_text function is used to clean and prepare a string of text for summarization. It performs several operations to get the text ready. Firstly, it removes punctuation marks and converts all the characters to lowercase. Next, it eliminates the stop words from the text by using the stopwords corpus from the Natural Language Toolkit (nltk) library. These stop words are common words such as "the", "and", "is" etc, which don't carry much significance and aren't useful for summarization. Finally, it combines the remaining words back into a single string, separated by spaces.
def preprocess_text(text):
	text = re.sub(r'[^\w\s]', '', text)
	text = text.lower()
	text = text.lower()
	text = [word for word in text.split() if word not in stopwords.words('english')]
	return " ".join(text)
Step 3: Import the summarize_text() FunctionThe summarize_text function serves to summarize a given string of text. The user can also specify how many sentences they want in the summary through the optional parameter 'n'. The text is first split into individual sentences, which are then preprocessed using the preprocess_text function to remove unnecessary information. The TfidfVectorizer from the scikit-learn library is used to convert these preprocessed sentences into numerical values, allowing for easy comparison. The cosine_similarity function is applied to calculate the similarity between these sentences, producing a similarity matrix. 
This matrix is then summed up to determine the importance of each sentence. The 'n' sentences with the highest importance scores are selected and combined to form the final summary.
def summarize_text(text, n=10):
	text = [preprocess_text(sentence) for sentence in nltk.sent_tokenize(text)]
	tfidf = TfidfVectorizer().fit_transform(text)
	similarity_matrix = cosine_similarity(tfidf)
	similarity_matrix = np.asarray(similarity_matrix)
	sentence_importance = np.sum(similarity_matrix, axis=0)
	ranked_sentence_indexes = [item[0] for item in sorted(enumerate(sentence_importance), key=lambda item: item[1], reverse=True)]
	selected_sentences = sorted(ranked_sentence_indexes[:n])
	summarized_text = ' '.join([text[i] for i in selected_sentences])
	return summarized_text
Step 4:Create a Text Test StringBelow is the paragraph that we will use to test the model on.
text = "A young man named Jim Hawkins is the narrator of the story. He lives in England with his mother and father. They run the Admiral Benbow Inn. One day, an old sailor named Billy Bones comes to stay at the inn. He pays Jim's father to keep a large sea chest for him. Later, Billy Bones dies and Jim finds a map in the chest. The map shows the location of a treasure on a mysterious island. Jim, Dr. Livesey, and Squire Trelawney decide to go to the island and look for the treasure. They set sail on the ship called the Hispaniola. On the ship, they meet a one-legged pirate named Long John Silver. They soon discover that Long John Silver and his men plan to steal the treasure. Jim and the other men must find a way to stop them."
Step 5: Run the summarize_text() Function and Print the Outputprint(summarize_text(text))
Output with n=5 one day old sailor named billy bones comes stay inn later billy bones dies jim finds map chest jim dr livesey squire trelawney decide go island look treasure ship meet onelegged pirate named long john silver soon discover long john silver men plan steal treasure
Output with n=10young man named jim hawkins narrator story lives england mother father one day old sailor named billy bones comes stay inn pays jims father keep large sea chest later billy bones dies jim finds map chest map shows location treasure mysterious island jim dr livesey squire trelawney decide go island look treasure ship meet onelegged pirate named long john silver soon discover long john silver men plan steal treasure jim men must find way stop
Output with n=15young man named jim hawkins narrator story lives england mother father run admiral benbow inn one day old sailor named billy bones comes stay inn pays jims father keep large sea chest later billy bones dies jim finds map chest map shows location treasure mysterious island jim dr livesey squire trelawney decide go island look treasure set sail ship called hispaniola ship meet onelegged pirate named long john silver soon discover long john silver men plan steal treasure jim men must find way stop
ConclusionIn conclusion, text summarization is a rapidly growing field that is transforming the way we process and analyze vast amounts of information. By automating the summarization process, we are able to save time and effort in extracting the most important points from lengthy texts. 
With the advent of AI, the potential for text summarization has reached new heights, as machines are now able to analyze complex language structures and understand the meaning and context behind words. This has paved the way for powerful summarization APIs that can provide us with quick and accurate summaries of even the most complex texts.
The future of text summarization is bright, with the advancements in NLP and using models like BERT. We can expect to see improved summarization technologies in the near future. However, it is important to use these technologies ethically and responsibly. The future of summarization is sure to bring new innovations and improvements.
﻿