NLP Basics: What is Tokenization and How Do You Do It?

A tutorial covering Tokenization, with code samples to use tokenizers from popular libraries suchas as nltk and Tensorflow.
Saurav Maheshkar
Created on April 26|Last edited on April 26
Comment
﻿
Table of Contents (click to expand)
IntroductionWhile there's of course been an incredible amount of NLP research in the past decade, a key under-explored area is Tokenization, which is simply the process of converting natural text into smaller parts known as "tokens." 
Tokens here is an ambiguous term. A token that can refer to words, subwords, or even characters. It basically refers to the process of breaking down a sentence into its constituent parts. 
For instance, in the statement:
"Coffee is superior than chai"One way to split the statement into smaller chunks is to separate out words from the sentence i.e.
"Coffee", "is", "superior", "than", "chai"That's the essence of Tokenization. 
﻿
Now, of course, this is not the end of things 😜. Real world applications require much smarter thinking and more nuances ways to tokenize text.
For example, in the example above we didn't even consider how to handle punctuation like a "?" or a ",". Or how do we tokenize words which are associated with punctuation marks? Is "Hi" different from "Hi," ? Nor particularly, but how do we convey this to the model? Do we consider "talk" and "talking" different tokens? They do convey the same meaning but we also wan't our models to have some understanding of verb tenses, right? 
There are myriad edge cases to think about and we'll talk about it in future reports. But for now, let's see how we can use popular functions from popular libraries to tokenize arbitrary sentences.
Tokenizing in Python“Each star was like a distant eye blinking at the human world from the depths of the universe.” ― Liu Cixin, Supernova EraLet's work with this statement for the purposes of this tutorial.
﻿
Let's consider the easiest way first, i.e. word-by-word. The easiest and logical way to do that is to simply split the statement by the spaces between words (this assumes our text is available in a grammatically correct way with no extra spaces, but let's just make that assumption for now). 
Python provides an inbuilt function named split() which enables us to split a sentence given some condition. In our case the condition is a simple space.
sentence = "Each star was like a distant eye blinking at the human world from the depths of the universe"
sentence.split(" ") # < --- Inbuilt Python Function
This gives us the output:
['Each','star','was','like','a','distant','eye','blinking','at','the','human','world','from','the','depths','of','the','universe']
But this was an easy statement let's consider something more complex.
"The man who passes the sentence should swing the sword. If you would take a man's life, you owe it to him to look into his eyes and hear his final words. And if you cannot bear to do that, then perhaps the man does not deserve to die." - Eddard Stark, head of House Stark, the Lord of Winterfell, Lord Paramount and Warden of the North.Now this is a more complex statement with more action words and phrases. For this statement we might think that it's better to split into sentences first and then process it sentence-by-sentence or further break the statements into words. 
For this task we could use sent_tokenize from nltk:
from nltk import sent_tokenize
﻿
got_quote = "..."
﻿
sent_tokenize(got_quote)
﻿
> ['The man who passes the sentence should swing the sword.',
 "If you would take a man's life, you owe it to him to look into his eyes and hear his final words.",
 'And if you cannot bear to do that, then perhaps the man does not deserve to die.']
Our we could also use the word_tokenize from nltk to tokenize the statement which uses the best Tokenizer from nltk to split the sentence.
SummaryThis is meant to be a quick introduction to tokenizing, which is a key concept that allows models to better understand large blocks of text by breaking them up into smaller, more digestible pieces. If you want more reports covering the math and "from-scratch" code implementations let us know in the comments down below or on our forum ✨!
Check out these other reports on Fully Connected covering other fundamental development topics like GPU Utilization and Saving Models.
Recommended Reading
How To Use GPU with PyTorch 
A short tutorial on using GPUs for your deep learning models with PyTorch, from checking availability to visualizing usable.
PyTorch Dropout for regularization - tutorial 
Learn how to regularize your PyTorch model with Dropout, complete with a code tutorial and interactive visualizations
How to save and load models in PyTorch
This article is a machine learning tutorial on how to save and load your models in PyTorch using Weights & Biases for version control.
Image Classification Using PyTorch Lightning and Weights & Biases
This article provides a practical introduction on how to use PyTorch Lightning to improve the readability and reproducibility of your PyTorch code.
A Gentle Introduction To Weight Initialization for Neural Networks
An explainer and comprehensive overview of various strategies for neural network weight initialization
How to Compare Keras Optimizers in Tensorflow for Deep Learning
A short tutorial outlining how to compare Keras optimizers for your deep learning pipelines in Tensorflow, with a Colab to help you follow along.
﻿
﻿
Add a comment
Tags: NLP, Beginner, Articles, Tutorial
Iterate on AI agents and models faster. Try Weights & Biases today.