Word2Vec: Teaching Neural Networks to Understand Words
This article explores Word2Vec, which is the first part of word embedding learning, giving clear examples and tracking our results using Weights & Biases.
Created on November 22|Last edited on July 29
Comment
Humans are intelligent because they have a network of other human beings and a communication medium in language. Language is a relatively new concept in the history of humanity. Senses and learning through senses came way before language.
With the advent of language, humans made a startling difference from other animals. Animals communicate, but they cannot transfer their knowledge via communication. Theoretical knowledge transfer is unique to us humans. The key to theoretical knowledge transfer is also language.
Which of the following carries more information, some megabytes of an image, or a sentence describing the image? This question startled me at first. Come to think of it, with fast internet, megabytes of images can be transferred within a blink of an eye while we humans can speak and convey very slowly. This should directly mean that language is a slow form of information transfer.
Information theory has a different viewpoint. It states that language is one of the best compression algorithms that we have. If used correctly, some words can describe not only some megabytes of images but also a very complex idea. The internet can be fast, but due to the high compression achieved by language, we can safely say that the sentence carries more information than the image.
The most important question that should be addressed is, “How do we represent the meaning of words?”. Words are considered to be symbols, and the meaning is an idea. Language is a very powerful construct. It can convey an idea via mere patterns of letters (words).
Now, take a moment and think. Communication through words is essentially an abstract commodity designed by humans. Teaching language to other humans is difficult, but making computers understand this is a different ball game altogether. For entities that only understand 0s and 1s at their core level, understanding complex words and their semantic meaning sounds like an impossible task.
Here's what we'll be covering in this article:
Table of Contents
How Computers Understand WordsWord2VecCBOW-TutorialSkip Gram TutorialLoss and the GradientsVisualizationsAppendixConclusion and Further Reading
Let's get going!
How Computers Understand Words
The first attempt to harness the meaning of words by machines was through wordnet. Wordnet is a thesaurus of words and their synonyms. This dictionary helps in considering the words that mean the same in the given context. This is a manually made dictionary that takes up much labor.
This approach does seem to help encode the meaning of the words, but this is a subjective approach. The term good can be similar to many words for someone and not me. We would need a better approach to encode the meaning of the words.
Discrete representation or a localist representation represents the words as one-hot vectors. This way of representing words helps us with feeding words to a computer. The cons of this approach are unending. We cannot represent the meaning of the words. A corpus can have many words; hence the one-hot vectors would be vast and inefficient.
To encode meaning, what could be done is to build a lookup table with pairs of similar words (through Wordnet) and keep their one-hot representation in the table. Can you calculate the size of the table if there are 50,000 words in the corpus? 😖
Distributional semantics: Here, the word's meaning is given by the words that frequently appear close to it.
“You shall know a word by the company it keeps” - J.R. Firth.
In this approach, we try to model a word's context by noting the words close to it. We stack these contexts and then later generalize to get the meaning of the word. Come to think of it; similar words indeed share a similar context. This is called a distributional semantic because the meaning of a word depends not only on itself but also on the words beside it.
Hence the meaning is distributed. Another take on this is that the vectors made to represent words with this approach are not one-hot anymore, they are dense, and the meaning is distributed in each dimension of the vector.
Word2Vec
Let’s discuss distributional semantics some more. Suppose you have a dataset that assigns a negative or a positive sentiment to a set of words. After training, your classifier is bound to group similar words according to their target sentiments. This is a straightforward way of getting a faux classifier that “understands” words.
However, that does not achieve the primary objective of a computer algorithm that can communicate with you. Till 2013, many promising ideas and algorithms were introduced, which were slowly but steadily raising the bar. In 2013 however, the NLP community was swept off its feet with the introduction of word2vec. Before I get into the intricacies, let us visualize the concept together.

Word embeddings, an intuition
Let us say we have a corpus of four words, namely Man, King, Queen, and Teacup. What if we could create a table, keeping the words in the rows and each column referring to a property that defines those words? Refer to the table above; The word Man and King are given the same value in the gender column, while Queen gets the exact opposite.
This refers to Man and King having the same gender, while Queen has the opposite gender. Teacup gets a neutral value since it is genderless. Similarly, in the alive column, we see that Man, King, and Queen have the value 1, and Teacup has the value -1 since it is a non-living thing. If we imagine the table as a vector space and its columns as the dimensions, we see that each word as a vector has some weight in each of the dimensions.
This ingenious idea is precisely what was envisioned as the concept of word2vec. Your classifier learns word associations from the corpora by being expressed in a matrix where each column is an embedding or a direction in which the word has some weight. These weights together constitute the meaning of the word for your classifier with respect to the given corpus.
However, an example like the table shown above would be an ideal case scenario when the embeddings become crystal clear and decipherable. In most cases, as the classifier itself decides the embeddings, it becomes difficult to decode the meaning and reasoning behind each embedding.
So, to sum up, word2vec is a technique that utilizes neural networks to learn to represent words as vectors to form word associations with other words with respect to a given corpus.
Likelihood
In this section, we talk about the maths behind the problem statement and the solution. The problem is to find some way to derive meaning from words. The meaning must be encoded in such a way that it is easy for a computer to work with. In a distributed representation, the idea is to understand the meaning of a word from the words beside it (the context words).
The solution can be seen in two orthogonal ways. In the Continuous Bag of Words Model(CBOW) approach, the authors try to predict a center word's probability provided the context words. In the Skip-gram model, the approach predicts the probability of context words provided the center word.
CBOW:
For each position we predict the center words in a window of fixed size given the context words where .
Skip Gram:
For each position we predict the context words in a window of fixed size given the center word .
The symbols and their meaning:
1. : The parameter that is learned. Here refers to the vector embeddings of each word.
2. : The variable that represents each word index in the corpus of text. There is number of words in the corpus.
3. : The size of the window of context
4. : The variable that represents the context words index.
We can say that likelihood is how good our model has done. We get to know how likely our probabilistic model will predict the right words given a particular context.
Objective Function
With the likelihood function in place, we can express the likelihood's maximization by minimizing the average negative log-likelihood. The average negative log-likelihood function turns out to be an objective function.
- Average: The average gives us the mean error. The scale of things is independent of the size of the corpus.
- Negative: The optimizer should decrease the error
- Log: Products are transformed into summation
By minimizing , we are maximizing the likelihood of the model and hence succeeding in modeling the language better.
The Probability Function
We have the likelihood function that needs to be maximized. The better approach is not to maximize but rather minimize the objective function. One missing ingredient of the approach is the probability function itself. We need to find an equation that can model the probability of the words occurring in a given context.
What we consider having is two vector representations for each word in the vocabulary. Let us consider a word then the two vectors would be:
1. - When the word is the centre word
2. - When the words is the context word
In the numerator, we have the term. This is the dot product between two words. This signifies how close the two words are. Exponentiating the dot product has an excellent effect on it. It not only increases the big numbers but also diminishes the small number. Hence the numerator talks about how close (similar) the two words are.
In the denominator, we have a normalization term . This normalizes the numerator and provides us with a percentage similarity. The formula directly translates to the probability of the word when the word is present.
CBOW-Tutorial
Word2vec dwells in two sections. One is the CBOW (Continuous Bag of Words), while the other is the Skip-Gram model. This section takes the reader through a journey of CBOW in TensorFlow. The sections written above would serve the readers with the right amount of intuition to help them understand CBOW and Skip-Gram. Both topics are probabilistic models. They are best understood with an understanding of the objective function mentioned above.
In the following section, we will go over a simple CBOW tutorial step by step.
First, we shall refer to the required imports. We will mainly work with NumPy arrays and tensors while utilizing the gradient tape functionality of TensorFlow. We will use Matplotlib to visualize the embeddings and TQDM to track the time taken for our operations.
import numpy as npimport tensorflow as tffrom tqdm import tqdmimport matplotlib.pyplot as plt
Next, we perform tokenization on our corpus, by splitting it into words and removing punctuations and capitalization, to make it easier for our architecture to group words. The words formed due to tokenization are then appended to a set.
# Converts the data into tokenstokenized_text = tf.keras.preprocessing.text.text_to_word_sequence(data)tokenized_text_size = len(tokenized_text)# Creates a vocab of unique wordsvocab = sorted(set(tokenized_text))vocab_size = len(vocab)print('Vocab Size: {}'.format(len(vocab)))
A very important part of our code. Here we create a dictionary called vocab_to_ix which contains words as keys and their indexes (From the vocab set) as their value. ix_to_vocab is a NumPy array version of vocab, while text_as_int is a NumPy array of the tokenized text, containing the indexes of the words instead of the words themselves.
# Map the vocab words to individual indicesvocab_to_ix = {c:ix for ix,c in enumerate(vocab)}# Map the indices to the words in vocabix_to_vocab = np.array(vocab)# Convert the data into numberstext_as_int = np.array([vocab_to_ix[c] for c in tokenized_text])
In the following code block, we define the number of embeddings, which, as explained above, is the number of dimensions that will help define words from our data corpus. We chose a window size of 5, which means we will be getting a single central word for every 4 context words. Here, context_vector and center_vector are essentially our embedding matrices, in which our words will procure meanings with respect to the given corpus.
EMBEDDING_SIZE = 9WINDOW_SIZE = 5opt = tf.optimizers.Adam()iterations = 1000# Here the word vectors are represented as a rowcontext_vector = tf.Variable(np.random.rand(vocab_size, EMBEDDING_SIZE))center_vector = tf.Variable(np.random.rand(vocab_size, EMBEDDING_SIZE))
In the most critical step, our code defines our train_step function, which houses the gradient_tape method. This method takes in a list of indices and losses as its parameter. With tf.GradientTape() as tape starts tracking the changes that the variables under it undergo (For computations of derivatives). u_avg is defined as the average value procured from our context vectors. The output is calculated through the matrix multiplication between u_avg and the center_vector embedding, followed by the softmax distribution.
As stated in the paper, we will see the softmax distribution of all center words with respect to the given set of context words. From there, we choose the target center word and calculate the softmax loss, which is then fed to the tape.gradient method. The tape.gradient method calculates the change our loss variable goes through during each iteration and performs derivates on the center_vector and context_vector embeddings. So to sum it up, our output changes the values of our embedding matrices.
def train_step(indices, loss_list):"""The training stepArguments:indices (list): The indices of the vocab in the window"""with tf.GradientTape() as tape:# Contextu_avg = 0for count,index in enumerate(indices):if count != WINDOW_SIZE//2:u_avg += context_vector[index,:]u_avg /= WINDOW_SIZE-1# Centeroutput = tf.matmul(center_vector, tf.expand_dims(u_avg ,1))soft_out = tf.nn.softmax(output, axis=0)loss = soft_out[indices[WINDOW_SIZE//2]]log_loss = -tf.math.log(loss)loss_list.append(log_loss.numpy())grad = tape.gradient(log_loss, [context_vector, center_vector])opt.apply_gradients(zip(grad, [context_vector, center_vector]))
Finally, after we have trained this architecture for a desired number of iterations, we can use TSNE to visualize the words according to the embeddings formed by each matrix. on a 2D plane. This way, we can view which words have formed clusters with each other.
from sklearn.manifold import TSNETSNE_embedd = TSNE(n_components=2).fit_transform(center_vector.numpy())TSNE_decod = TSNE(n_components=2).fit_transform(context_vector.numpy())nt = 0plt.figure(figsize=(25,5))for x,y in TSNE_embedd:plt.scatter(x,y)plt.annotate(ix_to_vocab[cnt], (x,y))cnt += 1plt.show()
Here, we show a 2D visualization of a small set of words. We see that the relations are not clear enough, but some clusters of meanings have formed, for example, years. We shall adequately look at this visualization in a later section.

CBOW visualization
A final wrap up; In CBOW, a given set of input context vectors remain constant, iterating through all possible centre words and calculating the softmax loss over the target centre word.

Computation involved in CBOW
Skip Gram Tutorial
In this section, we talk about Skip-Gram. This algorithm is a lot similar to the one we have seen above. The only difference is the use of the context vectors as the output. We need to see Skip-Gram as not the opposite of CBOW, but an algorithm that harnesses the same contextual abilities but a tad bit better.
Moving on to the skip-gram tutorial, we will similarly import the required packages, as mentioned for CBOW, and perform similar data processing.
A number of embedding layers are set to 9, while the window size is set to 5.
EMBEDDING_SIZE = 9WINDOW_SIZE = 5opt = tf.optimizers.Adam()iterations = 1000# Here the word vectors are represented as rowcontext_vector = tf.Variable(np.random.rand(vocab_size, EMBEDDING_SIZE))center_vector = tf.Variable(np.random.rand(vocab_size, EMBEDDING_SIZE))
The training step is where the difference between Skip-gram and CBOW is understood. In CBOW, our set of context vectors were fixed, while the center vectors iterated through all possible outputs to calculate the loss. In Skip-gram, our centre vector is fixed, and one by one, its iterated through the available context vectors.
We have implemented the same in our train step; set our centre word of a given list of words as the centre vector and matrix multiplied it with our context embedding matrix to give an output over all possible words. After taking the output's softmax distribution, the softmax loss is calculated over the current set of target context words. The calculated loss is then fed to the tape.gradient method, and the two embedding matrices are changed according to the derivatives.
def train_step(indices, loss_list):"""The training stepArguments:indices (list): The indices of the vocab in the window"""with tf.GradientTape() as tape:# Centerloss = 0#181, 9 -> 181,9 * 9,1 ->181,1v_center = center_vector[ indices[WINDOW_SIZE//2],:]output = tf.matmul(context_vector, tf.expand_dims(v_center ,1))soft_out = tf.nn.softmax(output, axis=0)for count,index in enumerate(indices):if count != WINDOW_SIZE//2:loss += soft_out[index]log_loss = -tf.math.log(loss)# Contextloss_list.append(np.array(log_loss))grad = tape.gradient(log_loss, [context_vector, center_vector])opt.apply_gradients(zip(grad, [context_vector, center_vector]))
We use TSNE again to visualize the words on a 2D plane.
The visualization shows a stark improvement over CBOW, with similar words slowly forming clusters. We can see that the years have formed their own cluster. Similarly, words representing different parts of speech have also formed their little cluster.

Skip-gram visualization
To sum up skip-gram, our Center vector embedding V remains constant, as it iterates through all possible words from the context vector embedding U, as shown in the GIF. The softmax loss is calculated over a given set of target context words from the softmax distribution of all possible worlds.

Computations involved in Skip-Gram
Loss and the Gradients
To analyze the loss landscape and study the visualizations, we have subjected both architectures to two corpora; the first one is a self-made tiny corpus of random sentences, while the other one is a bigger corpus detailing a singular topic. Henceforth, we have referred to the architectures having the bigger corpus as "scaled-up".
Let's look at the loss of the CBOW architecture when the used corpus is a small one. The spikes are bound to exist since we dealing with words and not just mere numbers.
Run set
1
The CBOW loss when working with a large corpus is deceivingly better than the smaller one, but closer inspection will reveal that the spikes for the losses still exist.
Run set
1
In contrast to the small corpus CBOW architecture, the skip-gram architecture performed worse, as we will see later in the visualizations. The loss was comparatively higher than CBOW's.
Run set
1
The loss of the scaled-up version of Skip -Gram also stays higher than CBOW's, but the twist in the tale lies in the visualizations of word associations.
Run set
1
Visualizations
CBOW (Small Corpus)
A small corpus used for CBOW creates several word associations which make sense, like the grouping of names of individuals or the grouping of names of sports. The types of movies (Western, Eastern) are also grouped together.
Run set
1
CBOW (Scaled-up)
The scaled-up visualization for CBOW gives us a non-satisfactory result. The only distinguishable grouping here can be said for the years.
Run set
1
Skip-Gram (Small Corpus)
Skip-Gram on a small corpus yielded less satisfactory results than the CBOW's embeddings. WE can see there that some names of sports have been grouped together. But apart from that, its really hard to distinguish the other embeddings and their associated meanings.
Run set
1
Skip-Gram (Scaled-up)
The scaled-up version of Skip-Gram, however, gives us very delicious associations. We can clearly see the bulk of years getting huddled up in a singular cluster, while words like a, an, and at have found their own cluster. Further to the left, we see about creating their own little cluster. The scaled-up Skip Gram convincingly beats CBOW in terms of the word associations formed.
Run set
1
Appendix
Loss landscape
Here I will try to derive the gradients of the loss with respect to the parameters of the model. The parameters of the model includes and for each in the vocabulary. We initialise the model with random and and then change their values in accordance to the gradient of the objective function .
In this section we will look into the derivation of the gradient of the objective function with respect to the vector representation of the center word.
The last equation kind of gives us an intuitive model of the gradient (slope). We are subtracting the expected context word vector () from the observed context vector ().
In this section we will look into the derivation of the gradient of the objective function with respect to the vector representation of the context word.
Conclusion and Further Reading
Link to the second part
We were focused on making the reader believe in the idea of a distributed representation of words. Once the intuition is built, one can easily understand its power and apply the concept to anything and everything. Word2Vec is a simple concept but a rather beautiful idea.
One can enrich their understanding of word embeddings by referring to the following links:
Stanford:
Yannick:
Papers:
Tutorials:
Some quick-fire tutorial videos:
I would like to appreciate Krisha Mehta for her guidance and review on the appendix.
In our next article, we focus our aim on Negative Sampling and GloVe embeddings. 😄
The authors:
Name | GitHub | |
---|---|---|
Devjyoti Chakrobarty | @Cr0wley_zz | @cr0wley-zz |
Aritra Roy Gosthipaty | @ariG23498 | @ariG23498 |
Add a comment
Iterate on AI agents and models faster. Try Weights & Biases today.