# Enriching Word Vectors with Sub-word Information

The next step for word embeddings. Made by Devjyoti Chakraborty using Weights & Biases
Devjyoti Chakraborty

## Introduction

### Check out the Code | Check out the Repository

Source:
Intelligence has sometimes been defined as the ability to reason, the capacity to understand logic. This absolutely holds true for most scenarios, until the question of "teaching" computers intelligence comes into the fray.
Teaching a computer with a knowledge base, in my opinion, would fall under pseudo-intelligence. The computer will always be grounded by the rules and paradigms we have set in our database of knowledge. It will never go beyond its way to infer and understand logic. To put it in perspective, it's no better than a circus lion being taught to jump through a flaming hoop in return for food while if it looks hard enough it has food all around it.
So how do we answer the question of teaching computers intelligence without explicitly stating everything? For that, let's change the process a little bit. Instead of defining everything in the knowledge base, how about we let the computer comprehend the meaning of objects through numbers (vectors) in n-dimensional space?
Let us take an example here. Suppose we want our computer to understand the concept of a light bulb ðŸ’¡. We take a 2-dimensional space, where one of the dimensions depicts the intensity of light while the other depicts the size of the object. With such a simple setup we can represent the light bulb as a point in the 2-dimensional space. This space provides the computer to comprehend not only a light bulb but also a tube light now. With vector representations of objects, the computer can now tweak the points and understand different objects with it. Is this an easier way for a computer to understand things? If you agree, you'd be on the same page as Geoffrey Hinton, one of the godfathers of AI. So what do you have to say about the statement, "Intelligence is a vector"?
Representation of words with vectors is a long-conceived concept. In our previous articles on Word2Vec and GloVe we get into the nitty-gritty of the idea. There we not only talk about the intuition behind the idea but also code our way through an Embedding layer. This article would serve as a follow-up report. Here we try deciphering the paper Enriching Word Vectors with Subword Information by Piotr Bojanowski et. al. This is considered to be the immediate successor of Word2Vec. Here the authors consider the morphology of individual words for a better vectorial representation.

## Intuition

Let us talk a bit about Word2Vec. The proposal for Word2Vec is simple yet powerful.
The meaning of a word depends on its surrounding words.
This led to the ideation of two strategies namely Skip-Gram and CBOW which used the concept to encode words into vectors. The most interesting insight about word vectors is that linear algebra on well-trained word vectors directly translates to logic. A very famous example was that of King-Man+Woman=Queen. A keen reader can head on to the articles on Word2Vec and GloVe for a better grasp of the topic.
With the help of vector representation and associations (surrounding words), the ingenious idea of word2vec was introduced where we could teach a machine the meaning of words through vectors in a space of N-dimensions. But since we are teaching each word as a vector, we absolutely ignore the morphology and etymology behind the words. This leads to ignoring a vital part of the information that tells us how a word came to be and how subwords of a word can be linked to another word.
When an erudite essay is written or a complex idea or emotion is sought to be expressed, rare and complex words are used. Should one try to train a model through the word association route(Word2Vec, GloVe family) this may fail in such cases. On the other hand, such rare words may be deciphered through morphology as they are constructed through part of other words or etymologically decipherable characteristics. Some language like Sanskrit or its modern derivative Indian languages is very apt for such training as they are systematically derivable from some route syllable.
Understanding morphology
An excerpt from the paper:
For example, in French or Spanish, most verbs have more than forty different inflected forms, while the Finnish language has fifteen cases for nouns. These languages contain many word forms that occur rarely (or not at all) in the training corpus, making it difficult to learn good word representations. Because many word formations follow rules, it is possible to improve vector representations for morphologically rich languages by using character level information.
The authors of the paper suggest an extension of the Skip-Gram model. The main idea behind the skip-gram model was to predict multiple context words given a single center/target word.
For example, consider the sentences:
1. "The color red pops out"
2. "What is green in color?"
3. "Shades of the color yellow suits you"
If we train the model on the sentences above we would eventually see that the word "color" would be associated with the words "red", "green" and "yellow". The basic objective of finding the meaning of the word 'color' through word associations is indeed accomplished.
On the other hand, Language as an entity has a much deeper and defined meaning. There are estimated to be 7000 different human languages in the world, each having a distinctive dialect and defined vocabulary. To understand how certain words came to be in a particular language would mean to break down the word into defined sub-words. However, delegating the task to word2vec models would mean we are choosing to ignore the subtle beauty of languages. Take the following example;
An example of word morphology
The word 'unreadable' can easily be broken down into its constituent sub-words and thus its meaning can be defined. The paper wants to include this very information into the skip-gram model by replacing the target/center word with its constituent sub-words. This way, we can also correlate the sub-words belonging to 'unreadable' to a different word like 'unstoppable'. Even though the two words carry different meanings, they have common sub-words. This implies that their formation had a similar idea. Another way this approach helps is how the meaning of words depends on two factors now; Word Associations and Morphology. This means that rarely occurring words now can be deciphered even with the lack of occurrences.

## The Objective

The method proposed by Piotr Bojanowski et. al. in this paper is a direct extension of the Skip-Gram model. Before we dive into the subword space, let us revise a little on Skip-Grams.
Given a word vocabulary of size W, where a word is identified by its index w\in \{1,...,W\} the goal is to learn a vectorial representation for each word w. Given a large training corpus represented as a sequence of words w_{1},...,w_{T} the objective of the skip-gram model is to maximize the following log-likelihood:
J(\theta)=\sum ^{T}_{t=1}\sum _{c\in C_{t}}\log p( w_{c} |w_{t})
Where the context C_t is the set of indices of words surrounding the target word w_t.
The question arises about the parameterization of the log-likelihood function, to be specific "What do we tweak to maximize the log-likelihood?". The answer to this lies in the probability function.
p( w_{c} |w_{t}) =\frac{\exp( s( w_{t} ,w_{c}))}{\sum ^{W}_{j=1}\exp( s( w_{t} ,j))}
The probability function is indeed a softmax function. Here s(x,y) is considered to be the scoring function that calculates the similarity between the vectors x and y. The parameters that are tweaked while maximizing the log-likelihood are the vector representation of the words. The objective function is such that the loss decreases with the better vectorial representation of words.
With a softmax implementation, we achieve a higher probability distribution that leads to us focusing on one context word only.
However, such a model is not adapted to our case as it implies that, given a word w_t , we only predict one context word w_c .
This led to framing the probability function differently. Now the problem of predicting context words was considered to be a binary classification task. It becomes a task to independently predict the presence or absence of context words. With negative sampling in the picture, this task of binary classification takes place with two kinds of context words, positive and negative. The positive context words are the ones that lie in the same window as that of the target word, while the negative context words are anything other than the words in the window.

### Subword space

By using a distinct vector representation for each word, the skipgram model ignores the internal structure of words.
To combat this issue, the authors propose a different scoring function. To go deeper into the scoring function one needs to understand the setup that they propose too. They consider each word w as a bag of character n-gram. They also add special boundary symbols < and > at the beginning and end of each word considering the demarcation between prefixes and suffixes from other character sequences. They also add the word w itself in the set of its n-grams. Let's understand the following through a snippet of code.
>>>word = "where">>>word = f"<{word}>">>>n_grams = [word[i:i+3] for i in range(len(word)-2)]+[word]n_grams>>>n_grams['', '']
Now, our scoring function s( w_{t} , w_c) takes two vectors as parameters, namely the target and the context vector. With the subwords in the frame now, the scoring function is modified to include the subwords of the target word only. This means that the score is computed as the scalar product between the context vectors and all of the target n-gram vectors.
Suppose that you are given a dictionary of n-grams of size G. Given a word w, let us denote by G_w \sub \{1,...,G\} the set of n-grams appearing in w. We associate a vector representation z_g to each n-gram g. We represent a word by the sum of the vector representations of its n-grams. We thus obtain the scoring function:
s( w,\ c) = \sum\limits _{g\in G_{w}} z^{T}_{g} v_{c}
The most important thing to note here is that we consider the vector representation of a target word to be the sum of all vectors of its n-grams.
Here z_{g} refers to each of the n-grams corresponding to a target word. For example, if our target word is ' '; z_g would be the vectors corresponding to '' and ''. The key thing to remember is that sub-words appearing in a word can also appear for another word. This is how mutual information sharing takes place. This simple model enables the architecture to share information about sub-words across words.

## Code

### Check out the Code | Check out the Repository

In this section, we look into the TensorFlow implementation of the paper. The code is heavily influenced by the official TensorFlow Word2Vec guide.
The most important part of the code is data preparation. The data that we take is a text file that has a lot of sentences that are separated by carriage returns.
# Create a tf.data with all the non-negative sentences>>> text_ds = tf.data.TextLineDataset(path_to_file).filter(lambda x: tf.cast(tf.strings.length(x), bool))>>> for text in text_ds.take(5): print(text)tf.Tensor(b'First Citizen:', shape=(), dtype=string)tf.Tensor(b'Before we proceed any further, hear me speak.', shape=(), dtype=string)tf.Tensor(b'All:', shape=(), dtype=string)tf.Tensor(b'Speak, speak.', shape=(), dtype=string)tf.Tensor(b'First Citizen:', shape=(), dtype=string)
We then tokenize and standardize each of the sentences.
# We create a custom standardization function to lowercase the text and # remove punctuation.def custom_standardization(input_data): lowercase = tf.strings.lower(input_data) return tf.strings.regex_replace(lowercase, '[%s]' % re.escape(string.punctuation), '')# Define the vocabulary size and number of words in a sequence.vocab_size = 4096sequence_length = 10# Use the text vectorization layer to normalize, split, and map strings to# integers. Set output_sequence_length length to pad all samples to same length.vectorize_layer = TextVectorization( standardize=custom_standardization, max_tokens=vocab_size, output_mode='int', output_sequence_length=sequence_length)# build the vocabvectorize_layer.adapt(text_ds.batch(1024))
After we have the tokens we need to create a setup that helps in the supervised setup for learning.
Source: Word2Vec
For our training process we do not use the tf.random.log_uniform_candidate_sampler and instead customize the process to include better negative samples in the training process. StackOverFlow thread to follow the discussion.
For our subwords, the setup modifies a little. We do not have the target word index in the initial position as shown in the figure. Instead, we will have the indices of the subwords of the target word.
With a batchsize of 1000, the shape of the dataset is something like this:
• (1000, None) - 1000 ngrams.
• (1000, 5, 1) - 1000 5-pieces context words.
• (1000, 5) - 1000 5-piece labels.
The model is rather simple
class Word2Vec(Model): def __init__(self, subword_vocab_size, vocab_size, embedding_dim): super(Word2Vec, self).__init__() self.target_embedding = Embedding(subword_vocab_size+1, embedding_dim, input_length=None, name="w2v_embedding",) self.context_embedding = Embedding(vocab_size+1, embedding_dim, input_length=num_ns+1) self.dots = Dot(axes=(3,1)) self.flatten = Flatten() def call(self, pair): target, context = pair we = tf.math.reduce_sum(self.target_embedding(target),axis=1) ce = self.context_embedding(context) dots = self.dots([ce, we]) return self.flatten(dots)
We define the two embedding layers. The context and the target embeddings are then scored by performing a dot product. This score is then evaluated and the loss is back-propagated to tweak the embeddings.

## Results

The loss and the accuracy of the model are shown here. Both the metrics seem to be doing well here.

### Embedding Projections

Another way of looking at the embeddings is to view them on a projector. TensorFlow has a great tool for visualizing just that. One can create the vector.tsv and metadat.tsv for an embedding and load it to the projector. The projector applies dimensional reduction techniques like PCA to coalesce our data into a visual vector space but still managing to retain important information. For a quick access, we have uploaded the tsv files as an artifact to the wandb project. Please feel free to download the artifacts and use them. We have also uploaded the files to our GitHub repository.

#### Context

After substantial stabilization in our loss, we decided to view how our context embeddings had shaped to be. We choose "for" as our exploratory word. Notice how words like "the", "a", "of", "with" are shown to be nearest. This means our model learned how to group words with predominant grammatical meaning together.
The word searched: "for"
Next, we choose a bit more rarely occurring word "secrets", we see that our model has caught up on words like "safeguard", "Signal" and "strangely". Keeping in mind that our data was just some Shakespearean passages, the model has comparatively done a stellar job!
The word searched: "secrets"

### Target

The target embeddings contained all the sub-words that were conceived from the given data. We first search the phrase "
The word searched: "
We see a similar result when searching for the word "the". Words with similar sub-words have appeared as it's nearest neighbors. We can successfully conclude that we have captured the morphology of words through our model. One can try and experiment with n-grams of different sizes to capture more semantic meaning.
The word searched: "the"

## Conclusion

The work portrayed by this paper served as the acclaimed successor of word2vec and also moved on to become one of the three pre-requisites for the foundation of fastText ( considered to be a benchmark in the work done in Natural language processing since it utilizes many concepts to generate efficient word representations).
Our experiments show that the paper utilizes simplistic methods to learn word representations, also learning sub-word information in the process. An ideal successor to the skip-gram method, the model trains much faster than other legacy models and outperforms baselines perceived without sub-word information and morphological analysis.
To conclude, working with sub-word information takes us one step closer to actually let computers harness the beauty and power of language to its maximum potential.
The authors: