Skip to main content

NLP Basics: What is Tokenization and How Do You Do It?

A tutorial covering Tokenization, with code samples to use tokenizers from popular libraries suchas as nltk and Tensorflow.
Created on April 26|Last edited on April 26

Table of Contents (click to expand)

Introduction

While there's of course been an incredible amount of NLP research in the past decade, a key under-explored area is Tokenization, which is simply the process of converting natural text into smaller parts known as "tokens."
Tokens here is an ambiguous term. A token that can refer to words, subwords, or even characters. It basically refers to the process of breaking down a sentence into its constituent parts.
For instance, in the statement:
"Coffee is superior than chai"
One way to split the statement into smaller chunks is to separate out words from the sentence i.e.
"Coffee", "is", "superior", "than", "chai"
That's the essence of Tokenization.


Now, of course, this is not the end of things 😜. Real world applications require much smarter thinking and more nuances ways to tokenize text.
For example, in the example above we didn't even consider how to handle punctuation like a "?" or a ",". Or how do we tokenize words which are associated with punctuation marks? Is "Hi" different from "Hi," ? Nor particularly, but how do we convey this to the model? Do we consider "talk" and "talking" different tokens? They do convey the same meaning but we also wan't our models to have some understanding of verb tenses, right?
There are myriad edge cases to think about and we'll talk about it in future reports. But for now, let's see how we can use popular functions from popular libraries to tokenize arbitrary sentences.

Tokenizing in Python

Each star was like a distant eye blinking at the human world from the depths of the universe.” ― Liu Cixin, Supernova Era
Let's work with this statement for the purposes of this tutorial.


Let's consider the easiest way first, i.e. word-by-word. The easiest and logical way to do that is to simply split the statement by the spaces between words (this assumes our text is available in a grammatically correct way with no extra spaces, but let's just make that assumption for now).
Python provides an inbuilt function named split() which enables us to split a sentence given some condition. In our case the condition is a simple space.
sentence = "Each star was like a distant eye blinking at the human world from the depths of the universe"
sentence.split(" ") # < --- Inbuilt Python Function
This gives us the output:
['Each','star','was','like','a','distant','eye','blinking','at','the','human','world','from','the','depths','of','the','universe']
But this was an easy statement let's consider something more complex.
"The man who passes the sentence should swing the sword. If you would take a man's life, you owe it to him to look into his eyes and hear his final words. And if you cannot bear to do that, then perhaps the man does not deserve to die." - Eddard Stark, head of House Stark, the Lord of Winterfell, Lord Paramount and Warden of the North.
Now this is a more complex statement with more action words and phrases. For this statement we might think that it's better to split into sentences first and then process it sentence-by-sentence or further break the statements into words.
For this task we could use sent_tokenize from nltk:
from nltk import sent_tokenize

got_quote = "..."

sent_tokenize(got_quote)

> ['The man who passes the sentence should swing the sword.',
"If you would take a man's life, you owe it to him to look into his eyes and hear his final words.",
'And if you cannot bear to do that, then perhaps the man does not deserve to die.']
Our we could also use the word_tokenize from nltk to tokenize the statement which uses the best Tokenizer from nltk to split the sentence.

Summary

This is meant to be a quick introduction to tokenizing, which is a key concept that allows models to better understand large blocks of text by breaking them up into smaller, more digestible pieces. If you want more reports covering the math and "from-scratch" code implementations let us know in the comments down below or on our forum ✨!
Check out these other reports on Fully Connected covering other fundamental development topics like GPU Utilization and Saving Models.

Iterate on AI agents and models faster. Try Weights & Biases today.