Sentiment Analysis with Bag of Tricks

A primer on the Bag of Tricks for NLP. Made by Devjyoti Chakraborty using Weights & Biases
Devjyoti Chakraborty

Introduction

Check out the colab for large dataset

Check out the colab for small dataset

The trend of naming papers in a fun, casual way was popularized by the Show and Tell paper. The best thing about titles like these is that they are very crisp and direct about what the paper's content while being approachable and not overly formal.
So when you hear the title Bag of Tricks for Efficient Text Classification, what's the first thing that pops up in your mind? If you said a magical bag containing efficient methods to classify text, well, that's exactly what the paper is about.
Let's go on a little journey and decipher this simple yet beautiful piece of research which has served as one of the pre-requisites for fasttext. Let's dig in.

For starters, we want to stress how prevalent text classification actually is. In fact, the approach to text classification in this paper is something that we see across everyday in our email inboxes when the suspicious messages get filtered into our spam, social, or promotional folders. Our daily web search uses this text classification method too: all the relevant information is presented to us at the browser screen based on the text you typed. The broad point here is that text classification is an ML problem nearly everyone on earth is interacting with whether they know it or not.
Of course, there are plenty more applications for text classification: sentiment analysis, content ranking, understanding users for targeted marketing and so on. In the paper named Bag of Tricks for Efficient Text Classification by Joulin et al., the authors not only describe a simple way to do efficient text classification, but also provide various other techniques which can help us make our work even more efficient.
"Efficiency"
For our purposes today, we're going to focus on sentiment analysis in this report. Building on the foundation laid by word2vec to come up with a quick and efficient classifier was indeed a major boost in work done in this domain. In fact, the idea is elegantly simple: to take the basic idea of Continuous Bag of Words and utilize the addition of proper labels.

Understanding the history of Word2Vec

As the world of NLP moved on from depending on one-hot encodings, the concept of words as vectors came to the forefront. The idea was simple: since a computer doesn’t understand abstractions, the only way to accomplish our task of explaining something like words to the computers would be to represent them with vectors.
This was more or less what NLP visionary Tomas Mikolov proposed. An embedding space in which each word will be expressed, where each word will have a weight in a given dimension. And when all the directions are considered, their weights towards these dimensions will be how a computer understands how to differentiate one word from others. This concept is called as Word2Vec and it primarily branched into two subsidiaries: CBOW (Continuous bag of words) and Skip-gram. For this report, we will primarily focus on the CBOW architecture.

CBOW:

Given the sentence "I am going to school." , we shall slide a window of a pre-determined length and create context-center word pairs. For example, for a window size of 5 and setting the number of context words before and after the center word as 2, we get
Context : [ 'I', 'am', 'to', 'school'] for the Center word: ['going']. We feed the one-hot encodings of the context words to a projection layer to output the embedding of the center word.
In simpler terms, say we have an embedding matrix A, where each row refers to a unique word in our vocabulary while the columns are the dimensions (or descriptors) which we can use to describe these words. Something like:
In practice, the meaning of each dimension is normally a bit obscure to human understanding.
So each row essentially embodying the meaning of a word and how a computer distinguishes it from other words. In CBOW, we take the rows corresponding to the context words and sum them up. We can pass our summed result through another trainable weight matrix to give us our output. The difference between the output and the one-hot encoding of our center word acts as the loss.
However, let's think about this practically. If our predicted output needs to be a one-hot encoding, then the penultimate matrix which will produce this output would have to be a sparse matrix. After all, one-hot encoding would mean a whole column vector of 0s in every position except for one. For this reason, we will stray a bit away from the one-hot encoding loss calculation and take the help of the ever reliable soft-max function.
CBOW Architecture
The primary basis of the architecture has been retained by the authors of this paper to perform sentiment analysis.
However, if one wants to learn about Word2Vec in depth, they can always:

Check out the W&B report for word2vec

The Basic Architecture

Let's go over the proposed architecture at its absolute base level. Remember how each sentence would be fed into the CBOW architecture and we didn't have any specific labels? The words in our text corpus itself acted as labels each time the window was slid on our data.
In the following proposed architecture, we instead will have fixed labels for each input sentence. Even at a glance, this architecture will work wonders since in CBOW, there would be a lot of targets for the loss to decrease steadily. Instead, we just have some fixed labels, which naturally helps the loss decrease a lot steadier.
This basic Bag of Tricks architecture is somewhat similar to the CBOW architecture. As in CBOW architecture the middle word or target word is predicted according to the context word, in this, the middle word is the label or the word that is to be matched with the predicted word. The sentence is broken into words and words into vector representation. This vector representation gets fed to the hidden space or the embedding space where the output is averaged and this representation is matched to the representation of the label. Using the softmax function, we obtain a probability distribution over the words in the corpus, which leads to minimizing the negative log-likelihood over the given words or minimizing the loss to match the prediction with the target. Which leads our loss function to look something like this:
-\frac{1}{N}\sum\limits _{n = 1}\limits^{N} y_n \log{(f(BAx_n))}
Where y_n is the target label, N is the number of labels, f(x) refers to the soft max function and B, A are weight matrices.
The goal is to match the embedding vector of the word in corpus with the one in the label. This will let us know the similarity between predicted word and target word.

Hierarchical SoftMax

To understand this algorithm, let's first recap what SoftMax is.
The SoftMax is a form of logistic regression that normalizes an input into an array of values which correspond to a probability distribution (Adding up all the values, we will get 1).
The formula is:
\frac{\exp v_{j}}{\sum ^{W}_{i=1}\exp v_{i}}
...where j is our target class and i consists of available classes.
The problem with SoftMax is the redundant computations and complexity. If our vocal size is V, then the complexity will be O(V). The main motivation of using Hierarchical SoftMax is to reduce the complexity from O(V) to O(\log(V)). This is achieved using a Binary tree, where each leaf node represents a word. The words can be arranged in the leaves following their frequency of occurrences in our text corpus, but the position of each word shall be noted. Consequently, that would mean there would be designated path from the root node to each leaf node for a given word.
Each of the words can be reached by a path from the root through the inner nodes, each having a probability assigned to it with the help of usage of a sigmoid function (since we either traverse to the left node or right node). For a certain context word C, we are checking the probability of a word in context to our root in either the left node, or the right node. We are essentially fixing the probabilities generated from each Inner node for a given word for a Root since we already know the path.
So, for a tree which has a root node, two inner node and accompanying leaf nodes, we would be performing only 3 steps of computations to reach our desired word, which reduces computations to a huge extent.

Code (Scaled-down)

Here, we’ll go through salient parts of our implementation of the basic architecture.
For our implementation, we have created our own small dataset. For better results in our small dataset, we remove the stop words from sentences. Stop words include words like “a, the, is” etc.
corpus = ['The sky is blue and beautiful.', 'Love this blue and beautiful sky!', 'The quick brown fox jumps over the lazy dog.', 'A kings breakfast has sausages, ham, bacon, eggs, toast and beans.', 'I love green eggs, ham, sausages and bacon!', 'The brown fox is quick and the blue dog is lazy!', 'The sky is very blue and the sky is very beautiful today', 'The dog is lazy but the brown fox is quick!', 'The dog is loyal to its master.', 'This dog look likes a fox.', 'The pasta looks delicious.', 'The weather is great today.', 'My friend is having a burger.', 'I hate eating new dishes.', 'This kitten is looking so cute.', 'I love playing football.', 'Magnus Carlsen is the best chess player in the world.', 'Sachin Tendulkar is the God of cricket.', 'Winter is coming.', 'The eggs are boiled.', 'The national sports of India is hockey.', 'An apple a day keeps a doctor away.', 'The day seems so humid.', 'The squirrel looks cute but is dangerous when encountered.', 'Our college has no tennis court.', 'My favourite dish is biryani.', 'Lion is the king of the jungle.', 'Let us all play together.', 'Off the 28 Olympic medals, India have won 9 gold, 7 silver and 12 bronze.', 'The aroma of the dish is spreading all around the gallery.']labels = ['weather', 'weather', 'animals', 'food', 'food', 'animals', 'weather', 'animals', 'animals', 'animals', 'food', 'weather', 'food', 'food', 'animals', 'sports', 'sports', 'sports', 'weather', 'food', 'sports', 'food', 'weather', 'animals', 'sports', 'food', 'animals', 'sports', 'sports', 'food'] corpus = np.array(corpus)corpus_df = pd.DataFrame({'Text': corpus, 'Category': labels})corpus_df = corpus_df[['Text', 'Category']]mod_df = corpus_dffor i in range(len(corpus_df['Text'])): new_sent = '' for j in corpus_df['Text'].iloc[i].split(): if j not in stop_words: new_sent = new_sent + ' ' + j mod_df['Text'].iloc[i] = new_sentmod_df
We use tensorflow’s tokenizer and use the fit_on_texts function to create our vocabulary. We then use the texts_to_sequences function to directly generate our training data and labels.
top_k = 1000tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=top_k, oov_token="", filters='!"#$%&()*+.,-/:;=?@[\]^_`{|}~')tokenizer.fit_on_texts(mod_df['Text'])tokenizer.fit_on_texts(corpus_df['Category'])train_seqs = tokenizer.texts_to_sequences(mod_df['Text'])train_labels = tokenizer.texts_to_sequences(corpus_df['Category'])
Now, refer to the architecture we presented to accomplish our task. We have an Embedding matrix and another matrix used to get our output. We use Adam as our optimizer.
EMBEDDING_SIZE = 2opt = tf.optimizers.Adam()iterations = 1000# Here the word vectors are represented as rowembedding_matrix = tf.Variable(np.random.rand(len(tokenizer.word_index), EMBEDDING_SIZE))de_embedding_matrix = tf.Variable(np.random.rand(EMBEDDING_SIZE, len(tokenizer.word_index)))def one_hot_enc(inp): dummy_vector = np.zeros(shape=(1, len(tokenizer.word_index))) dummy_vector[0][inp-1] = 1 return dummy_vectordef train_step(indices, target ,log_list): target = target[0] with tf.GradientTape() as tape: hidden_avg = 0 for i in indices: hidden_avg = hidden_avg + tf.matmul(one_hot_enc(i),embedding_matrix) hidden_avg = tf.divide(hidden_avg,len(indices)) output_over_classes = tf.matmul(hidden_avg, de_embedding_matrix) soft_out = tf.nn.softmax(tf.squeeze(output_over_classes), axis=0) loss_trgt = soft_out[target-1] log_loss = -tf.math.log(loss_trgt) log_loss = (1/len(indices))*(target-1)*log_loss log_list.append(log_loss.numpy()) grad = tape.gradient(log_loss, [embedding_matrix, de_embedding_matrix]) opt.apply_gradients(zip(grad, [embedding_matrix, de_embedding_matrix]))
The next part is probably the most important part of our implementation. We add a helper function which returns the one hot encoding of word indexes. We use gradient tape, a truly beautiful tool to calculate the losses through differentiation and backpropagate them to change our

Code (Scaled up)

To show the effect of our architecture when handling bigger datasets, we have introduced a few noteworthy changes in our code.
We have used the BBC news Train set which contains several paragraphs corresponding to a five labels
corpus_df['Category'].unique()array(['business', 'tech', 'politics', 'sport', 'entertainment'], dtype=object)
We use the same snippet of code used in our scaled down version to remove stop words from our data frame and create our dataset using our tokenizer accordingly. We diverge from our old method by introducing the keras Sequential model architecture. We set our embedding size to be 20, setting our batch size at 16. The model begins with our embedding layer, followed by a custom layer to average out the output, finally ending with a dense layer. The loss function we have used is Categorial Cross entropy, since objectively that is the closest to the loss given in the paper.
Our model is going to look something like this:

Visualizations

Scaled-down

We see that the dip in loss for our small dataset is pretty steep and saturates at a value close to 0.
Here we have plotted the word associates formed due to our small dataset. You can select an area to zoom in and see the associations that have formed. One can conclude that our architecture has worked beautifully on a smaller dataset.

Scaled - up

We have used Categorical cross-entropy as our loss function for our scaled up version of sentiment analysis. Our loss has decreased quite steadily and saturated within 100 time steps.
Since we have used a bigger dataset, we decided to express our embedding layer in Tensorboard's embedding projector. As you can see, we have lots of similar words associated with sad that have appeared in context with it, as well as lots of non related words. This can be attributed to the fact that our dataset wasn't huge and sentences can contain lots of words which might not come directly in relation to the sentiment unless paired with another word. For example, the sentence " I am not happy" has a sad sentiment, but if we consider individual words, the word happy also gets associated with it.

Conclusion

Bag of Tricks is a lovely pre-requisite paper to Fasttext which not only explores a simple way of powering up sentiment analysis but also suggests ways to make it more efficient. The world of NLP is moving on to newer concept concepts like attention, hence this paper also served as an ode to the original word2vec concept, showcasing its raw power of learning the meaning of words through word associations.