This is the second part of word embedding learning. Made by Devjyoti Chakraborty using Weights & Biases

-> XKCD Wasted Time <-

In the previous article, we tried creating the intuition about **Distributional Semantics**. The idea of center and context words made a lot of sense. The loss diagrams and the T-SNE plots made the idea of learning word representations quite luring. We could have stopped where we were.

Now comes the important question: **How efficient is the algorithm?** Come to think of it, CBOW and Skip-Gram both use lots of computations while learning the word embeddings. At each iteration, the algorithms compute gradients of the entire embedding space and then update the parameters of the model. In Distributed Representations of Words and Phrases and their Compositionality by Mikolov et. al. the authors themselves provide the readers with a more efficient algorithm to train Skip-Gram, **negative sampling**. In this article, we not only dive into the algorithm of negative sampling but also take into consideration the algorithm called **GloVe**. GloVe(Global Vectors), works primarily on two main principles; **Local Context window** methods along with **Global co-occurrence** matrix. GloVe heavily outperformed its predecessors being better at representing and also in terms of efficiency.

Let us go back and talk a little about the previous training method mentioned in Skip-gram.

The training objective of the Skip-gram model is to find word representations that are useful for predicting the surrounding words in a sentence or a document.

We take a window, slide the window on the corpus. For each window, we get a single-center word and multiple context words. We essentially compute the softmax of all the context word vectors and try to maximize this. This has a bad effect on the back-propagation stage. Due to the softmax functionality, we intend to build a rather complex computation graph that involves all of the context vectors. This boils down to the fact that we would need to compute the gradients of the entire context matrix. Can one see the problem now? We would back-propagate into a huge matrix at each window iteration and that leads to a huge amount of computation.

One quick and easy way to go around the problem is to take batches of data that decrease the number of iteration at once. This indeed helps with the time constraints but does not imbibe the representations that well.

Enter **negative sampling**. The idea of negative sampling is to first get hold of positive samples and negative samples from the dataset given a specific center word. If we have the following sentence `I am Aritra. Nice to meet you all.`

with the center word `meet`

, the positive samples would be `to`

and `you`

. By positive the authors mean that these words are indeed the correct context words for the given center word. On the other hand, `I`

, `am`

, `Aritra`

, `Nice`

, and `all`

are the negative samples.

After we have the samples ready we use a simple yet effective mathematical equation to train the model.
$
\log \sigma \left( u^{T}*{a} v*{b}\right) +\sum ^{k}*{i=1} \mathbb{E}*{w_{i} \sim P_{n}( w)}\left[\log \sigma \left( -u^{T}*{i} v*{b}\right)\right]
$
Where the notations are as follows:

- $u_a$: This is the context vector representation for the word a
- $v_b$: This is the center vector representation for the word b
- $P_n(w)$: noise distribution
- $\mathbb{E}
*{w*{i} \sim P_{n}( w)}$: Sample words from the negative context words - $i$: The counter used

The equation can be broken down into two parts

- $\log \sigma \left( u^{T}
*{a} v*{b}\right)$: Here we are focused in getting a logistic regression for the positive class. - $\sum ^{k}
*{i=1} \mathbb{E}*{w_{i} \sim P_{n}( w)}\left[\log \sigma \left( -u^{T}*{i} v*{b}\right)\right]$: While here we are concerned with the**negative**logistic regression for the negative class. Note that here we maximize the term so that the negative samples are repelled.

```
def train_step(indices, loss_list):
"""The training step
Arguments:
indices (list): The indices of the vocab in the window
"""
with tf.GradientTape() as tape:
loss = tf.Variable(0.0, dtype = tf.float64)
v_center = tf.expand_dims(center_vector[ indices[WINDOW_SIZE//2],:],1)
index_context = list(filter(lambda x: x not in indices, list(range(vocab_size))))
rand_neg_indices = np.random.choice(index_context, 10, replace = False)
v_context_negative = tf.gather(context_vector, rand_neg_indices)
#tf.gather selects specific tensors
v_context_positive = tf.gather(context_vector , indices)
output_negative = tf.reduce_sum(tf.math.log(tf.math.sigmoid(tf.math.negative(tf.matmul(v_context_negative, v_center)))))
for i in v_context_positive:
loss = loss - tf.reduce_sum(tf.math.log(tf.math.sigmoid(tf.matmul(tf.transpose(tf.expand_dims(i,1)), v_center)))) - output_negative
loss_list.append(np.array(loss))
wandb.log({"loss":np.array(loss)})
grad = tape.gradient(loss, [context_vector, center_vector])
opt.apply_gradients(zip(grad, [context_vector, center_vector]))
```

Both the parts together serve a single purpose. They together optimize the statistical model such that word embeddings for the center word help in predicting the context word. With this approach, while we tend to learn word embeddings that attract positive samples and repel negative samples, they seem to be learning the context pretty well.

We trained our Negative Sampling implementation on a short piece dedicated to Marie Curie and her achievements. As we can see, the loss comparatively decreases albeit some spikes. However, since this is a small corpus, we will see some overfitting later in the visualization.

At first glance, the TSNE visualization doesn't really make sense. But upon closer inspection, we will see that since the corpus is small, the positive samples have overfitted on the centre words. For example, we have the sentence "Marie Curie was born in 1867", we see that the words "born" and "1867" are very close to each other.

The architectures that we have seen till now show some success in capturing semantic regularities of a corpus. However, we see that the that **frequency** of co-occurrences doesn't have a separate term in our final cost function. In the academic paper GloVe: Global Vectors for Word Representation by Jeffrey Pennington et. al. the authors talk about enforcing certain model properties to better capture and understand semantic and syntactic regularities.

The Local context window method carries over from previous architectures like CBOW and Skip-gram, while the addition of a co-occurrence factor differentiates it from other architectures. For its training phase, instead of continuously iterating over local windows of sequenced data, we use the co-occurrence matrix as a lookup table for words which have appeared in the context of other words, as well as prevent computation for words who have no co-occurrence.

Before moving further, we shall discuss what a co-occurrence matrix is.

-> Create-a-gray-level-co-occurrence-matrix <-

The matrix on the left has unique values ranging from 1-8. The co-occurrence matrix, one of the right, tells us how many times a particular value (the columns) appeared beside another value ( the rows). In simple words, if there are $i$ rows and $j$ columns of a matrix $M$, a co-occurrence matrix will tell us how many times the values in $j$ have appeared in context to a single value of $i$, or as stated in the paper,

The matrix of word-word co-occurrence counts be denoted by X, whose entries ${X_{i}}_{j}$ tabulate the number of times word $j$ occurs in the context of word $i$

The paper states that

The statistics of word occurrences in a corpus is the primary source of information available to all unsupervised methods for learning word representations.

Based on that notion, first we create a co-occurrence matrix $X$ based on the available corpus. So, the notation $X_{ij}$ will refer to the number of times word $j$ appeared beside word $i$. We will calculate the probability of word $j$ given $i$ ($P(j|i)$) as $X_{ij}/X_{i}$, where $X_{i}$ is the summation of the number of times all words appear in the context of word $i$. ( ${\sum\limits *{k} X*{i}}_{k}$ )

**From a Thermodynamics Point of view:**
The relationship of words can be better examined when ratios of probabilities are compared, which is beautifully explained in the paper. For example, let's take a word $k$ = Solid, $i$ = ice and $j$ = steam. Obviously, since ice and solid relate to each other, $P_{ik}$ will be a greater value than ${P_{j}}*{k}$, since Solid and steam won't occur together much, hence ${P*{i}}*{k}/{P*{j}}*{k}$ will have a great value. Similarly, if $k$ = gas, we will have a small value for ${P*{i}}*{k}/{P*{j}}*{k}$ since gas occurs more times in the context of steam than ice. Now, for words like Water, which occurs in the context of both ice and steam, the ${P*{i}}*{k}/{P*{j}}*{k}$ ratio will be close to 1. And finally, for words like Fashion, which doesn't occur in the context of Ice and Steam in general, the ${P*{i}}*{k}/{P*{j}}_{k}$ will also be close to 1. This ratio helps in filtering the fine meaning of words. From a thermodynamic point of view, we would not want a high value for ice and stem corresponding to water or fashion. Both ice and steam are related to water but carry neutral meaning from a thermodynamic standpoint. Both ice and steam are negatively related to the term fashion. We see that the ration model this behavior properly and hence the authors suggest that this is a correct approach to move ahead.

-> The table of probabilities, as shown in the paper. Notice, the ratios change based on the words in question. <-

Hence these ratios showcase more information than raw probabilities. From this information, we can appropriately deduce that the starting form of this model depends on the ratios of co-occurrence probabilities. Since the ratios depend on the words $i$, $j$, $k$ at a given instance, the most general form of our starting function becomes:

$ F(w_{i},w_{j},\tilde{w_{k}})=\frac{P_{ik}}{P_{jk}} $

Where $w$ are word vectors and $\tilde{w}$ are separate context vectors. Which becomes:

$ F(w_{i}-w_{j},\tilde{w_{k}})=\frac{P_{ik}}{P_{jk}} $

Once we take into consideration that vectors are linear structures and to encode them into a vector space, we shall use vector differences. To avoid complicated parameterization, we directly take the dot product of the arguments.

$ F((w_{i}-w_{j})^{T}\tilde{w_{k}})=\frac{P_{ik}}{P_{jk}} $

For mathematical simplification, the authors have equated the following;

$
F\left( w^{T}*{i} \tilde{w*{k}}\ \right) =\ P_{ik} \ =\frac{X_{ik}}{X_{i}}
$

which gives,

$ F((w_{i}-w_{j})^{T}\tilde{w_{k}})=\frac{{F(w_{i}}^{T}\tilde{w_{k}})}{{F(w_{j}}^{T}\tilde{w_{k}})} $

If we take $F = exp$,

$
w^{T}*{i}\tilde{w*{k}}=\log{P_{ik}}=\log{X_{ik}}-\log{X_{i}}
$

Since $\log( X_{i})$ is independent of $k$, we can absorb it into the bias.

$
w^{T}*{i}\tilde{w*{k}}+b_{i}+\tilde{b_{k}}=\log{X_{ik}}
$

After addressing the issues coming into existence due to the possibility of $\log{0}$, the authors came up with a least-squares regression model, which makes the final cost function to be :

$
J=\sum\limits^{V}*{i,\ j\ =1}F( x)(w^{T}*{i}\tilde{w_{k}}+b_{i}+\tilde{b_{j}}-\log{X_{ij}})^{2}
$

To better understand the efficiency of GloVe, let's analyze the final form of the Cost Function.

$
J=\sum\limits^{V}*{i,\ j\ =1}F( x)(w^{T}*{i}\tilde{w_{k}}+b_{i}+\tilde{b_{j}}-\log{X_{ij}})^{2}
$

$X_{ij}$ refers to the values present in the Co-occurrence matrix. $F(x)$ has been defined to be

$ F( x) =\begin{cases} \left(\frac{x}{x_{max}}\right)^{\alpha } & x< x_{max}\ 1 & x\geqslant \ x_{max} \end{cases} $

So, for the values of $X_{ij}$ which are 0, the whole cost function automatically becomes 0, preventing any unrequired computation. This is a major split from previous architectures, where we had to iterate over the complete matrices regardless of zero elements.

For our minimal GloVe implementation, we have written two helper functions. The word_processor method returns important data like complete vocabulary, vocabulary size, and their indexes. These data are then put through our compute_co_occurence_matrix method which computes a co-occurrence matrix depending on the data provided and window size. The window size here determines the number of words before and after a center word which will be considered context words with respect to the center word.

```
def word_processor(data):
# Converts the data into tokens
tokenized_text = tf.keras.preprocessing.text.text_to_word_sequence(data)
tokenized_text_size = len(tokenized_text)
# Creates a vocab of unique words
vocab = sorted(set(tokenized_text))
vocab_size = len(vocab)
# Map the vocab words to individual indices
vocab_to_ix = {c:ix for ix,c in enumerate(vocab)}
# Map the indices to the words in vocab
ix_to_vocab = np.array(vocab)
# Convert the data into numbers
text_as_int = np.array([vocab_to_ix[c] for c in tokenized_text])
return vocab ,vocab_size, vocab_to_ix, ix_to_vocab, text_as_int
def compute_co_occurrence_matrix(corpus, window_size=1):
vocab ,vocab_size, vocab_to_ix, ix_to_vocab, text_as_int = word_processor(corpus)
M = np.zeros((vocab_size, vocab_size))
doc_len = len(text_as_int)
for current_idx in range(doc_len):
left_boudary = max(current_idx-window_size, 0)
right_boundary = min(current_idx+window_size+1, doc_len)
outside_words = np.append(text_as_int[left_boudary:current_idx],text_as_int[current_idx+1:right_boundary])
center_word = text_as_int[current_idx]
for outside_word in outside_words:
M[center_word, outside_word] += 1
return M, vocab_to_ix, vocab_size, ix_to_vocab, text_as_int
```

For our training step, we have made use of the rather efficient function tf.gather, which helps us pinpoint the particular rows of the context embedding matrix we want to pluck out using the indexes. This will help us reduce computations. The cost function is made exactly as mentioned in the paper while we note the mean of the loss generated over a single iteration.

```
def train_step(indices, loss_list):
"""The training step
Arguments:
indices (list): The indices of the vocab in the window
"""
with tf.GradientTape() as tape:
cost = tf.Variable(0.0, dtype=tf.float64)
for i in indices:
center_em = center_vector[i] #a vector (8,)
context_em = tf.gather(context_vector, indices) #a matrix (index,8)
embedding_product = tf.math.reduce_sum(tf.math.multiply(center_em, context_em), 1)
log_cooccurrences = tf.math.log(1+co_occurance_matrix[i, indices])
distance_expr = tf.square(tf.add_n([
embedding_product,
tf.negative(log_cooccurrences)
]))
single_losses = tf.math.multiply(weighted_func(co_occurance_matrix[i, indices],100,(3/4)), distance_expr)
cost = cost+single_losses
loss_list.append(cost.numpy().mean())
wandb.log({"loss (GloVe)":cost.numpy().mean()})
grad = tape.gradient(cost, [context_vector, center_vector])
opt.apply_gradients(zip(grad, [context_vector, center_vector]))
```

Now we shall look into the loss generated for our GloVe implementation. We have used the same data as before. Since we are tracking the mean of the loss generated per iteration, we see that the curve is smooth. In most experiments noted, the curve saturated quickly.

For our GloVe representations, we have shown the combination of the two vector embedding matrices trained (as shown in the paper). Since we have trained on a small dataset, the regularities aren't really distinctive. We see that the years are slowly forming a cluster, as well as the parts of speech being close to each other.

This ends our journey in the search of word representations. We hope that the reader gets to taste the thoughts behind these beautiful ideas. We started with CBOW and Skip-Gram and finally end with Negative Sampling and GloVe. The journey involved uncovering a lot of subtle ideas and implementing the ideas from scratch with `tensorflow2`

. We broke down a number of papers in the span of two articles. We would love to hear from you. Any queries or pull requests to the repository would be highly appreciated.

Reach the authors:

Name |
Twitter |
GitHub |
---|---|---|

Devjyoti Chakrobarty | @Cr0wley_zz | @cr0wley-zz |

Aritra Roy Gosthipaty | @ariG23498 | @ariG23498 |