Word2Vec

This is the first part of word embedding learning. Made by Devjyoti Chakraborty using Weights & Biases
Devjyoti Chakraborty

Meaning

-> XKCD: I could care less <-

Humans are intelligent because they have a network of other human beings and a communication medium in language. Language is a relatively new concept in the history of humanity. Senses and learning through senses came way before language. With the advent of language, humans made a startling difference among the other animals. Animals communicate, but they cannot transfer their knowledge via communication. Theoretical knowledge transfer is unique to us humans. The key to theoretical knowledge transfer is also language.

Which of the following carries more information, some megabytes of an image or a sentence describing the image? This question startled me at first. Come to think of it, with fast internet, megabytes of images can be transferred within a blink of an eye while we humans can speak and convey very slowly. This should directly mean that language is a slow form of information transfer. Information theory has a different viewpoint. It states that language is one of the best compression algorithms that we have. If used correctly, some words can describe not only some megabytes of images but also a very complex idea. The internet can be fast, but due to the high compression achieved by language, we can safely say that the sentence carries more information than the image.

The most important question that should be addressed is “How do we represent meaning of words?”. Words are considered to be symbols, and the meaning an idea. Language is a very powerful construct. It can convey an idea via mere patterns of letters (words). Now, take a moment and think. Communication through words is essentially an abstract commodity designed by humans. Teaching language to other humans is difficult, but making computers understand this is a different ball game altogether. For entities that only understand 0s and 1s at their core level, understanding complex words and their semantic meaning sounds like an impossible task.

How Computers Understand Words

The first attempt to harness the meaning of words by machines was through wordnet. Wordnet is a thesaurus of words and its synonyms. This dictionary helps in considering the words that mean the same in the given context. This is a manually made dictionary that takes up much labour. This approach does seem to help encode the meaning of the words, but this is a subjective approach. The term good can be similar to many words for someone and not me. We would need a better approach to encode the meaning of the words.

Discrete representation or a localist representation represents the words as one-hot vectors. This way of representing words helps us with feeding words to a computer. The cons of this approach are unending. We cannot represent the meaning of the words. A corpus can have many words; hence the one-hot vectors would be vast and inefficient. To encode meaning, what could be done is to build a lookup table with pairs of similar words (through Wordnet) and keeping their one-hot representation in the table. Can you calculate the size of the table if there are 50,000 words in the corpus? 😖

Distributional semantics: Here, the word's meaning is given by the words that frequently appear close to it. “You shall know a word by the company it keeps” - J.R. Firth. In this approach, we try to model a word's context by noting the words close to it. We stack these contexts and then later generalize to get the meaning of the word. Come to think of it, similar words indeed share a similar context. This is called a distributional semantic because the meaning of a word depends not only on itself but also on the words beside it. Hence the meaning is distributed. Another take on this is that the vectors made to represent words with this approach are not one-hot anymore, they are dense, and the meaning is distributed in each dimension of the vector.

Word2Vec

Let’s discuss distributional semantics some more. Suppose you have a dataset that assigns a negative or a positive sentiment to a set of words. After training, your classifier is bound to group similar words according to their target sentiments. This is a straightforward way of getting a faux classifier which “understands” words. However, that does not achieve the primary objective of a computer algorithm that can communicate with you. Till 2013, many promising ideas and algorithms were introduced, which were slowly but steadily raising the bar. In 2013 however, the NLP community was swept off its feet with the introduction of word2vec. Before I get into the intricacies, let us visualize the concept together.

-> Word embeddings, an intuition <-

Let us say we have a corpus of four words, namely Man, King, Queen, and Teacup. What if we could create a table, keeping the words in the rows and each column referring to a property that defines those words? Refer to the table above; The word Man and King are given the same value in the gender column while Queen gets the exact opposite. This refers to Man and King having the same gender, while Queen has the opposite gender. Teacup gets a neutral value since it is genderless. Similarly, in the alive column, we see that Man, King, and Queen have the value 1, and Teacup has the value -1 since it is a non-living thing. If we imagine the table as a vector space and its columns as the dimensions, we see that each word as a vector has some weight in each of the dimensions. This ingenious idea is precisely what was envisioned as the concept of word2vec. Your classifier learns word associations from the corpora by being expressed in a matrix where each column is an embedding or a direction in which the word has some weight. These weights together constitute the meaning of the word for your classifier with respect to the given corpus. However, an example like the table shown above would be an ideal case scenario when the embeddings become crystal clear and decipherable. In most cases, as the classifier itself decides the embeddings, it becomes difficult to decode the meaning and reasoning behind each embedding. So, to sum up, word2vec is a technique that utilizes neural networks to learn to represent words as vectors to form word associations with other words with respect to a given corpus.

Likelihood

In this section, we talk about the maths behind the problem statement and the solution. The problem is to find some way to derive meaning from words. The meaning must be encoded in such a way that it is easy for a computer to work with. In a distributed representation, the idea is to understand the meaning of a word from the words beside it (the context words).

The solution can be seen in two orthogonal ways. In the Continuous Bag of Words Model(CBOW) approach, the authors try to predict a centre word's probability provided the context words. In the Skip-gram model, the approach predicts the probability of context words provided the centre word.

CBOW: For each position $t=1,2,..,T$ we predict the center words in a window of fixed size $m$ given the context words $w_{t+j}$ where $-m\leqslant j\leqslant m$. $ L( \theta ) =\prod ^{T}{t=1}\prod {-m\leqslant j\leqslant m;\ j\neq 0} P( w{t} |w{t+j} ;\theta ) $

Skip Gram: For each position $t=1,2,..,T$ we predict the context words in a window of fixed size $m$ given the center word $w_{t}$. $ L( \theta ) =\prod ^{T}{t=1}\prod {-m\leqslant j\leqslant m;\ j\neq 0} P( w{t+j} |w{t} ;\theta ) $

The symbols and their meaning:

  1. $\theta$: The parameter that is learned. Here refers to the vector embeddings of each word.
  2. $t$: The variable that represents each word index in the corpus of text. There is $T$ number of words in the corpus.
  3. $m$: The size of the window of context
  4. $j$: The variable that represents the context words index.

We can say that likelihood is how good our model has done. We get to know how likely our probabilistic model will predict the right words given a particular context.

Objective function

With the likelihood function in place, we can express the likelihood's maximization by minimizing the average negative log-likelihood. The average negative log-likelihood function turns out to be an objective function.

$ J( \theta ) =-\frac{1}{T}\log L( \theta ) $

By minimizing $J(\theta)$, we are maximizing the likelihood of the model and hence succeed in modelling the language better.

The probability function

We have the likelihood function that needs to be maximized. The better approach is not to maximize rather minimize the objective function. One missing ingredient about the approach is the probability function itself. We need to find an equation that can model the probability of the words occurring for a given context.

What we consider having is two vector representation for each word in the vocabulary. Let us consider a word $w$ then the two vectors would be:

  1. $v_w$- When the word is the ** centre** word
  2. $u_w$ - When the words is the context word

$ P( w_{a} |w_{b}) =\frac{\exp u_{a}^{T} v_{b}}{\sum ^{W}{i=1}\exp u{i}^{T} v_{b}} $

This looks like the softmax function doesn't it? Let us break this equation down.

In the numerator, we have the $\exp u_{a}^{T}v_b$ term. This is the dot product between two words. This signifies how close the two words are. Exponentiating the dot product has an excellent effect on it. It not only increases the big numbers but also diminishes the small number. Hence the numerator talks about how close (similar) the two words are.

In the denominator we have a normalization term $\sum ^{V}{i=1}\exp u{i}^{T} v_{c}$. This normalizes the numerator and provides us with a percentage similarity. The formula directly translates to the probability of the word $w_a$ when the word $w_b$ is present.

CBOW-Tutorial

Badge

Word2vec dwells in two sections. One is the CBOW (Continuous Bag of Words) while the other is the Skip-Gram model. This section takes the reader through a journey of CBOW in tensorflow. The sections written above would serve the readers with the right amount of intuition to help them understand CBOW and Skip-Gram. Both topics are probabilistic models. They are best understood with an understanding of the objective function mentioned above. In the following section, we will go over a simple CBOW tutorial step by step. First, we shall refer to the required imports. We will mainly work with NumPy arrays and tensors while utilizing the gradient tape functionality of TensorFlow. We will use Matplotlib to visualize the embeddings and tqdm to track the time taken for our operations.

import numpy as np
import tensorflow as tf
from tqdm import tqdm
import matplotlib.pyplot as plt

For our data, we have taken reference from a comprehension exercise website

Next, we perform tokenization on our corpus, by splitting it into words and removing punctuations and capitalization, to make it easier for our architecture to group words. The words formed due to tokenization are then appended to a set.

# Converts the data into tokens
tokenized_text = tf.keras.preprocessing.text.text_to_word_sequence(data)
tokenized_text_size = len(tokenized_text)
# Creates a vocab of unique words
vocab = sorted(set(tokenized_text))
vocab_size = len(vocab)
print('Vocab Size: {}'.format(len(vocab)))

A very important part of our code. Here we create a dictionary called vocab_to_ix which contains words as keys and their indexes (From the vocab set) as their value. ix_to_vocab is a NumPy array version of vocab, while text_as_int is a NumPy array of the tokenized text, containing the indexes of the words instead of the words themselves.

# Map the vocab words to individual indices
vocab_to_ix = {c:ix for ix,c in enumerate(vocab)}
# Map the indices to the words in vocab
ix_to_vocab = np.array(vocab)
# Convert the data into numbers
text_as_int = np.array([vocab_to_ix[c] for c in tokenized_text])

In the following code block, we define the number of embeddings, which, as explained above, is the number of dimensions that will help define words from our data corpus. We chose a window size of 5, which means we will be getting a single central word for every 4 context words. Here, context_vector and center_vector are essentially our embedding matrices, in which our words will procure meanings with respect to the given corpus.

EMBEDDING_SIZE = 9
WINDOW_SIZE = 5
opt = tf.optimizers.Adam()
iterations = 1000
# Here the word vectors are represented as a row
context_vector =  tf.Variable(np.random.rand(vocab_size, EMBEDDING_SIZE))
center_vector = tf.Variable(np.random.rand(vocab_size, EMBEDDING_SIZE))

In the most critical step, our code defines our train_step function, which houses the gradient_tape method. This method takes in a list of indices and losses as its parameter. With tf.GradientTape() as tape starts tracking the changes that the variables under it undergo (For computations of derivatives). u_avg is defined as the average value procured from our context vectors. The output is calculated through the matrix multiplication between u_avg and the center_vector embedding, followed by the softmax distribution. As stated in the paper, we will see the softmax distribution of all centre words with respect to the given set of context words. From there, we choose the target centre word and calculate the softmax loss, which is then fed to the tape.gradient method. The tape.gradient method calculates the change our loss variable goes through during each iteration and performs derivates on the center_vector and context_vector embeddings. So to sum it up, our output changes the values of our embedding matrices.

def train_step(indices, loss_list):
  """The training step

  Arguments:
    indices (list): The indices of the vocab in the window
  """
  with tf.GradientTape() as tape:
    # Context
    u_avg = 0
    for count,index in enumerate(indices):
      if count != WINDOW_SIZE//2:
        u_avg += context_vector[index,:]
    u_avg /= WINDOW_SIZE-1
    # Center
    output = tf.matmul(center_vector, tf.expand_dims(u_avg ,1))
    soft_out = tf.nn.softmax(output, axis=0)
    loss = soft_out[indices[WINDOW_SIZE//2]]
    log_loss = -tf.math.log(loss)
  loss_list.append(log_loss.numpy())
  grad = tape.gradient(log_loss, [context_vector, center_vector])
  opt.apply_gradients(zip(grad, [context_vector, center_vector]))

Finally, after we have trained this architecture for a desired number of iterations, we can use TSNE to visualize the words according to the embeddings formed by each matrix. on a 2D plane. This way, we can view which words have formed clusters with each other.

from sklearn.manifold import TSNE

TSNE_embedd = TSNE(n_components=2).fit_transform(center_vector.numpy())
TSNE_decod = TSNE(n_components=2).fit_transform(context_vector.numpy())

nt = 0 
plt.figure(figsize=(25,5))
for x,y in TSNE_embedd:
  plt.scatter(x,y)
  plt.annotate(ix_to_vocab[cnt], (x,y))
  cnt += 1
plt.show()

Here, we show a 2D visualization of a small set of words. We see that the relations are not clear enough, but some clusters of meanings have formed, for example, years. We shall adequately look at this visualization in a later section.

-> CBOW visualization <-

A final wrap up; In CBOW, a given set of input context vectors remain constant, iterating through all possible centre words and calculating the softmax loss over the target centre word.

CBOW implementation

-> Computation involved in CBOW <-

Skip gram Tutorial

Badge

In this section, we talk about Skip-Gram. This algorithm is a lot similar to the one we have seen above. The only difference is the use of the context vectors as the output. We need to see Skip-Gram as not the opposite of CBOW, but an algorithm that harnesses the same contextual abilities but a tad bit better. Moving on to the skip-gram tutorial, we will similarly import the required packages, as mentioned for CBOW, and perform similar data processing.

A number of embedding layers are set to 9, while the window size is set to 5.

EMBEDDING_SIZE = 9
WINDOW_SIZE = 5
opt = tf.optimizers.Adam()
iterations = 1000
# Here the word vectors are represented as row
context_vector =  tf.Variable(np.random.rand(vocab_size, EMBEDDING_SIZE))
center_vector = tf.Variable(np.random.rand(vocab_size, EMBEDDING_SIZE))

The training step is where the difference between Skip-gram and CBOW is understood. In CBOW, our set of context vectors were fixed, while the center vectors iterated through all possible outputs to calculate the loss. In Skip-gram, our centre vector is fixed, and one by one, its iterated through the available context vectors. We have implemented the same in our train step; set our centre word of a given list of words as the centre vector and matrix multiplied it with our context embedding matrix to give an output over all possible words. After taking the output's softmax distribution, the softmax loss is calculated over the current set of target context words. The calculated loss is then fed to the tape.gradient method, and the two embedding matrices are changed according to the derivatives.

def train_step(indices, loss_list):
  """The training step

  Arguments:
    indices (list): The indices of the vocab in the window
  """
  
  with tf.GradientTape() as tape:
    # Center
    loss = 0
    #181, 9 ->  181,9  * 9,1 ->181,1 
    v_center = center_vector[ indices[WINDOW_SIZE//2],:]
    output = tf.matmul(context_vector, tf.expand_dims(v_center ,1))

    soft_out = tf.nn.softmax(output, axis=0)
    
    for count,index in enumerate(indices):
      if count != WINDOW_SIZE//2:
        loss += soft_out[index]
        
    log_loss = -tf.math.log(loss)
    
    # Context
    
  loss_list.append(np.array(log_loss))
  grad = tape.gradient(log_loss, [context_vector, center_vector])
  opt.apply_gradients(zip(grad, [context_vector, center_vector]))

We use TSNE again to visualize the words on a 2D plane.

The visualization shows a stark improvement over CBOW, with similar words slowly forming clusters. We can see that the years have formed their own cluster. Similarly, words representing different parts of speech have also formed their little cluster.

-> Skip-gram visualization <-

To sum up skip-gram, our Center vector embedding V remains constant, as it iterates through all possible words from the context vector embedding U, as shown in the GIF. The softmax loss is calculated over a given set of target context words from the softmax distribution of all possible worlds.

Skip Implementation

-> Computations involved in Skip-Gram <-

Loss and the gradients

To analyze the loss landscape and study the visualizations, we have subjected both architectures to two corpora; the first one is a self-made tiny corpus of random sentences while the other one is a bigger corpus detailing a singular topic. Henceforth, we have referred to the architectures having the bigger corpus as "scaled-up".

Let's look at the loss of the CBOW architecture when the used corpus is a small one. The spikes are bound to exist since we dealing with words and not just mere numbers.

Section 2

The CBOW loss when working with a large corpus is deceivingly better than the smaller one, but closer inspection will reveal that the spikes for the losses still exist.

Section 6

In contrast to the small corpus CBOW architecture, the skip-gram architecture performed worse, as we will see later in the visualizations. The loss was comparatively higher than CBOW's.

Section 7

The loss of the Scaled up version of Skip-Gram also stays higher than CBOW's, but the twist in the tale lies in the visualizations of word associations.

Section 8

Visualizations

CBOW (Small Corpus)

A small corpus used for CBOW creates several word associations which make sense, like the grouping of names of individuals or the grouping of names of sports. The types of movies (Western, Eastern) are also grouped together.

Section 10

CBOW (Scaled-up)

The scaled-up visualization for CBOW gives us a non satisfactory result. The only distinguishable grouping here can be said for the years.

Section 12

Skip-Gram (Small Corpus)

Skip-Gram on a small corpus yielded less satisfactory results than the CBOW's embeddings. WE can see there that some names of sports have been grouped together. But apart from that, its really hard to distinguish the other embeddings and their associated meanings.

Section 14

Skip-Gram (Scaled-up)

The scaled-up version of Skip-Gram however, gives us very delicious associations. We can clearly see the bulk of years getting huddled up in a singular cluster, while words like a, an and, at have found their own cluster. Further to the left, we see as, about creating their own little cluster. The scaled up Skip-Gram convincingly beats CBOW in terms of the word associations formed.

Section 15

Appendix

Loss landscape

Here I will try to derive the gradients of the loss with respect to the parameters of the model. The parameters of the model includes $u_{w}$ and $v_{w}$ for each $w$ in the vocabulary. We initialise the model with random $u_{w}$ and $v_{w}$ and then change their values in accordance to the gradient of the objective function $J(\theta)$.

$\frac{\partial J(\theta)}{\partial v_{t}}$

In this section we will look into the derivation of the gradient of the objective function with respect to the vector representation of the center word. $ \frac{\partial \log P( u_{o} |v_{c})}{\partial v_{c}} =\frac{\partial }{\partial v_{c}}\log\frac{\exp u_{o} v_{c}}{\sum ^{V}{i=1}\exp u{i} v_{c}}\ \Longrightarrow \frac{\partial \log P( u_{o} |v_{c})}{\partial v_{c}} =\frac{\partial }{\partial v_{c}}\log\exp u_{o} v_{c} -\frac{\partial }{\partial v_{c}}\log\sum ^{V}{i=1}\exp u{i} v_{c}\ \Longrightarrow \frac{\partial \log P( u_{o} |v_{c})}{\partial v_{c}} =\frac{\partial }{\partial v_{c}} u_{o} v_{c} -\frac{\partial }{\partial \sum ^{V}{i=1}\exp u{i} v_{c}}\log\sum ^{V}{i=1}\exp u{i} v_{c} .\frac{\partial }{\partial v_{c}}\sum ^{V}{x=1}\exp u{x} v_{c}\ \Longrightarrow \frac{\partial \log P( u_{o} |v_{c})}{\partial v_{c}} =u_{o} -\frac{1}{\sum ^{V}{i=1}\exp u{i} v_{c}} .\sum ^{V}{x=1}(\exp u{x} v_{c}) u_{x}\ \Longrightarrow \frac{\partial \log P( u_{o} |v_{c})}{\partial v_{c}} =u_{o} -\frac{\sum ^{V}{x=1}(\exp u{x} v_{c}) u_{x}}{\sum ^{V}{i=1}\exp u{i} v_{c}}\ \Longrightarrow \frac{\partial \log P( u_{o} |v_{c})}{\partial v_{c}} =u_{o} -\sum ^{V}{x=1}\frac{(\exp u{x} v_{c})}{\sum ^{V}{i=1}\exp u{i} v_{c}} u_{x}\ \Longrightarrow \frac{\partial \log P( u_{o} |v_{c})}{\partial v_{c}} =u_{o} -\sum ^{V}{x=1} P( u{x} |v_{c}) u_{x}\ \boxed{\frac{\partial J( \theta )}{\partial v_{c}} =-\frac{1}{T}\sum ^{T}{t-1}\sum {-m\leqslant j\leqslant m;\ j\neq 0}\left[ u{t+j} -\sum ^{V}{x=1} P( u_{x} |v_{t}) u_{x}\right]} $

The last equation kind of gives us an intuitive model of the gradient (slope). We are subtracting the expected context word vector ($\sum ^{V}{x=1} P( u{x} |v_{c}) u_{x}$) from the observed context vector ($u_o$).

$\frac{\partial P( J(\theta)}{\partial u_{o}}$

In this section we will look into the derivation of the gradient of the objective function with respect to the vector representation of the context word. $ \frac{\partial \log P( u_{o} |v_{c})}{\partial u_{o}} =\frac{\partial }{\partial u_{o}}\log\frac{\exp u_{o} v_{c}}{\sum ^{V}{i=1}\exp u{i} v_{c}}\ \Longrightarrow \frac{\partial \log P( u_{o} |v_{c})}{\partial u_{o}} =\frac{\partial }{\partial u_{o}}\log\exp u_{o} v_{c} -\frac{\partial }{\partial u_{o}}\log\sum ^{V}{i=1}\exp u{i} v_{c}\ \Longrightarrow \frac{\partial \log P( u_{o} |v_{c})}{\partial u_{o}} =\frac{\partial }{\partial u_{o}} u_{o} v_{c} -\frac{\partial }{\partial \sum ^{V}{i=1}\exp u{i} v_{c}}\log\sum ^{V}{i=1}\exp u{i} v_{c} .\frac{\partial }{\partial u_{o}}\sum ^{V}{x=1}\exp u{x} v_{c}\ \Longrightarrow \frac{\partial \log P( u_{o} |v_{c})}{\partial u_{o}} =v_{c} -\frac{1}{\sum ^{V}{i=1}\exp u{i} v_{c}} .v_{c}\ \boxed{\frac{\partial J( \theta )}{\partial u_{t+j}} =-\frac{1}{T}\sum ^{T}{t-1}\sum {-m\leqslant j\leqslant m;\ j\neq 0}\left[ u{t+j} -\sum ^{V}{x=1} P( u_{x} |v_{t}) u_{x}\right]} $

Conclusion and Further Readings

Link to the second part

We were focused on making the reader believe in the idea of a distributed representation of words. Once the intuition is built, one can easily understand its power and apply the concept to anything and everything. Word2Vec is a simple concept but a rather beautiful idea.

One can enrich their understanding of word embeddings by referring to the following links:

Stanford:

Yannick:

Papers:

Tutorials:

Some quick-fire tutorial videos:

I would like to appreciate Krisha Mehta for her guidance and review on the appendix. In our next article, we focus our aim on Negative Sampling and GloVe embeddings. :smile:

The authors:

Name Twitter GitHub
Devjyoti Chakrobarty @Cr0wley_zz @cr0wley-zz
Aritra Roy Gosthipaty @ariG23498 @ariG23498