Introduction

I have never found recurrent networks very intuitive. Part of me wants to believe that the idea of recurrence is too complex to be made sense of, the other part yells at me for being lazy (import tf.keras.layers.SimpleRNN) and not trying to derive the calculations on my own. This report is the consequence of intolerable yelling.

In this report, we will dive deep into recurrent neural networks. The main motive is to look under the hood of the architecture. The only feasible way for me to jot down the nitty-gritty of RNNs is to build it from scratch in NumPy. We take a sequence modeling task, build an RNN using NumPy, and describe the various aspects of the architecture.

Problem Statement

The best way to build an intuition about sequences is to look at characters. On one hand, if kept jumbled, the sequence of characters does not make sense, and on the other hand, when kept in the right order, it creates beautiful poetry, prose, code, and mathematical equations. We are going to build a text generator that has been modeled upon a given input text. The text generation and language modeling problem can be tackled in various ways. In this report, I would like to show a character level language model.

The input will be a sequence of characters, and the output would be the immediate next character in the sequence. The image below demonstrates the approach. The characters in a particular sequence are H, E, L, L, and the next character is O. Notice that the character O could have been a , or merely a \n. The character that is mainly generated depends on the context of the sequence. A well-trained model would generate characters that fit the context very well.

-> Character level language modeling problem <-

Training the RNN with NumPy

In this section, I will be documenting and explaining my thoughts that went into building the RNN.

Data

The first direction that I headed to was processing the input data. The input data is picked up from any .txt file provided. The file is read, and vocabulary is formed. Vocabulary is the collection of unique characters that are found in the text file. The transformation of characters into numbers was the immediate next step. This is done because the model needs to have numbers to process.

char_to_ix = {c:ix for ix,c in enumerate(vocab)}
ix_to_char = np.array(vocab)
text_as_int = np.array([char_to_ix[c] for c in text])


-> text_as_int stores the input text in the form of numbers. <-

Feedforward

To get into this part, let us look at the formulae governing a recurrent neural network.

-> The recurrence formula <- $h_{t} =\tanh\begin{pmatrix} x_{t}\ h_{t-1} \end{pmatrix}$

The equation and the image are a simple representation of the recurrence formula. The idea to grasp here is that $W$ is a vanilla multi-layer perceptron, we provide input $x_{t}$ and $h_{t-1}$, and the model produces the hidden state $h_{t}$ at every time step. This hidden state $h_{t}$ is a function of the present input $x_{t}$ and the past hidden state $h_{t-1}$.

The idea of recurrence hits harder when we unroll the concept of $h_{t-1}$. The previous hidden state $h_{t-1}$ is a function of the previous input $x_{t-1}$ and the previous to previous hidden state $h_{t-2}$. This goes on till we have reached $h_{-1}$. The theory directly implies the fact that recurrent models take due notice of the past inputs. They change their weights and biases, taking into consideration all of the past events.

Inside the multilayer perceptron: $raw_h_{t} =W_{x}x_{t}+W_{h}h_{t-1}+b_{h}\ \boxed{h_{t} =\tanh raw_h_{t}}\$

With the hidden state $h_{t}$ in hand, we can plug in a projection head and transform the hidden state to our liking. For our problem statement, we need to project the hidden state $h_{t}$ to obtain probabilities for the next character in the sequence. In this way, the next character considers all the previous characters due to the recurrence. $\boxed{y_{t} =W_{y}\times h_{t}+b_{y}}$

We will iterate over a provided sequence and plug in a projection head at the last hidden state to tackle our problem. This will project the probability of the next character provided the input sequence. As shown below, the character's probability should be higher for O and lower for every other character in the vocabulary.

-> Feed forward <-

for t in range(len(inputs)):
xs[t] = np.zeros((vocab_size,1)) #one hot encode
xs[t][inputs[t],0] = 1
hs[t] = np.tanh(np.matmul(Wxh, xs[t]) + np.matmul(Whh, hs[t-1]) + bh)

y = np.matmul(Why, hs[t]) + by #projection
p = np.exp(y) / np.sum(np.exp(y)) #probability


Loss formulation

After projecting the final hidden state $h_{t}$, we have the un-normalized log probabilities for each of the vocabulary characters. These un-normalized log probabilities are the elements in $y_{t}$. $p_{k} =\frac{e^{y_{k}}}{\sum {j} e^{y{j}}}$ Here $p_{k}$ is the normalized probability of the correct class $k$. We then apply a negative $\log$ on this and get the softmax loss of $y_{t}$. $\boxed{\mathcal{L}{t} =-\log p{k}}$ We take this loss and back-propagate through the network.

Backpropagation

In this stage, we have to back-propagate the softmax loss. For a recurrent model, the name given to the backpropagation algorithm is ** backpropagation through time**. In this stage, we have to traverse from the last time step to the first time step of the RNN and back-propagate the gradients. In this section, we will hand traverse through time and derive the gradients as we go.

• Gradient of the loss $\mathcal{L}$ with respect to the projection $y$: A thorough investigation of the gradient can be found in CS231n softmax grad blog. We will try keeping it to the point here. $\boxed{\frac{\partial \mathcal{L}}{\partial y} =\ p_{j} -\mathbb{1}{when\ j=k}}$
dy = np.copy(p)
dy[target] -= 1

• The derivative of the loss $\mathcal{L}$ wrt $W_{y}$: This one is easy; we calculate the gradient $\frac{\partial y}{\partial W_{y}}$ and then use the chain rule with the upstream gradient. $y=W_{y}h_{final}\ \frac{\partial y}{\partial W_{y}} =h_{final}$ $\frac{\partial \mathcal{L}}{\partial W_{y}} =\frac{\partial \mathcal{L}}{\partial y}\frac{\partial y}{\partial W_{y}}\ \boxed{\frac{\partial \mathcal{L}}{\partial W_{y}} =\frac{\partial \mathcal{L}}{\partial y} h_{final}}$
dWhy = np.matmul(dy, hs[final].T)
dby = dy

• The derivative of the loss $\mathcal{L}$ wrt the present hidden state $h_{t}$: Here we have to take two things under consideration. The final hidden state $h_{final}$ has gradients flowing from the projection head. All hidden states other than $h_{final}$ have gradients flowing from the next raw hidden state $h_{raw_t+1}$ $y=W_{y}h_{final}\ \frac{\partial y}{\partial h_{final}} =W_{y}\$ $\frac{\partial \mathcal{L}}{\partial h_{final}} =\frac{\partial \mathcal{L}}{\partial y}\frac{\partial y}{\partial h_{t}}\ \boxed{\frac{\partial \mathcal{L}}{\partial h_{final}} =\frac{\partial \mathcal{L}}{\partial y}W_{y}+\partial{h_{next}}}$ The gradients of every other $h_{t}$ are given as $\frac{\partial \mathcal{L}}{\partial h_{t}} =\frac{\partial \mathcal{L}}{\partial y}\frac{\partial y}{\partial h_{t}}\ \boxed{\frac{\partial \mathcal{L}}{\partial h_{t}} =\partial{h_{next}}}$
dhnext = np.matmul(Why.T, dy)
for t in reversed(range(len(inputs))):
dh = dhnext
.
.
dhnext = np.matmul(Whh.T, dhraw)

• The derivative of $\mathcal{L}$ wrt $h_{raw_t}$: $h_{t} =\tanh h_{raw_t}\ \frac{\partial h_{t}}{\partial h_{raw_t}} =1-(\tanh h_{raw_t})^{2}\ \frac{\partial h_{t}}{\partial h_{raw_t}} =1-h^{2}{t}$ The gradients: $\frac{\partial \mathcal{L}{t}}{\partial h_{raw_t}} =\frac{\partial \mathcal{L}{t}}{\partial h{t}}\frac{\partial h_{t}}{\partial h_{raw_t}}\ \boxed{\frac{\partial \mathcal{L}{t}}{\partial h{raw_t}} =\frac{\partial \mathcal{L}{t}}{\partial h{t}}\left( 1-h^{2}_{t}\right)}$
dhraw = (1 - hs[t] * hs[t]) * dh

• The derivative of the loss $\mathcal{L}$ wrt $W_{x}$: $h_{raw_t}=W_{x}x_{t}+W_{h}h_{t-1}\ \frac{\partial h_{raw_t}}{\partial W_{x}} =x_{t}$ $\frac{\partial \mathcal{L}{t}}{\partial W{x}} =\frac{\partial \mathcal{L}{t}}{\partial h{raw_t}}\frac{\partial h_{raw_t}}{\partial W_{x}}\ \boxed{\frac{\partial \mathcal{L}{t}}{\partial W{x}} =\frac{\partial \mathcal{L}{t}}{\partial h{raw_t}} x_{t}}$
dWxh += np.matmul(dhraw, xs[t].T)

• **The derivative of the loss $\mathcal{L}$ wrt $W_{h}$ **: $h_{raw_t}=W_{x}x_{t}+W_{h}h_{t-1}\ \frac{\partial h_{raw_t}}{\partial W_{h}} =h_{t-1}$ $\frac{\partial \mathcal{L}{t}}{\partial W{h}} =\frac{\partial \mathcal{L}{t}}{\partial h{raw_t}}\frac{\partial h_{raw_t}}{\partial W_{h}}\ \boxed{\frac{\partial \mathcal{L}{t}}{\partial W{h}} =\frac{\partial \mathcal{L}{t}}{\partial h{raw_t}} h_{t-1}}$
dWhh += np.matmul(dhraw, hs[t-1].T)


Section 7

In this section, we will look into the biggest problem of recurrent neural networks. Below is an image of the gradients flowing back through the time step. Notice anything bad for the gradient?

1. $\tanh$ non-linearity: When the gradients of hidden state back-propagates, it goes through a $\tanh$ non-linearity. The derivative $\frac{\partial \tanh x}{\partial x} =1-(\tanh x)^{2}$ signifies that while backpropagating, the gradients tend to converge between [-1,1]. The square term further perturbs the gradients.

2. The $W$ weight matrix: While back-propagating, the gradients are continuously multiplied with the weight $W$ matrix. If the largest singular value > 1, the gradient explodes to infinity, while the gradients vanish to 0 if the largest singular value < 1.

The theory looks nice and intuitive, but does this really happen? To answer this, I made some changes to my code to visualize the gradients flowing from the last time step to the first time step. In my code, the RNN unrolls for 25 time steps. This means while back-propagating, the gradient of the hidden state $\partial{h_{t}}$ needs to flow from time step 24 to time step 0. The way I tried visualizing the gradient is by plotting the histograms at each time step. This will tell us while the range of values is more in the count and so on. In a vanishing gradient problem, we would expect the histograms of time step 25 to be wider than the histogram at time step 0. In an exploding grad problem, we will expect the histograms to diverge. To make the visualizations even better, I have plotted histograms for particular epochs as well. It would help us see what a trained and untrained network does to the gradients. In the visualizations provided, the step denotes the epoch. I have samples epoch 0, epoch 11, epoch 22, epoch 33, epoch 44, and epoch 44.

Visualize the Connectivity

In the article Visualizing memorization in RNNs, the authors propose a great tool to visualize the contextual understanding of a sequence model, connectivity between the desired output and all the input. This means the visualization would say which inputs are the reason that we got our desired output.

-> Source: Visualizing memorization in RNNs <-

To visualize the connectivity, the first step in to see the heat map colors. The heat map I have chosen is shown below. The cold connectivity (not so connected) will be transparent and gradually move from light to dark blue. The hot connectivity (strong connection) will be colored red.

-> The heatmap colors with their intensities of connection <-

For this experiment, I have chosen a sequence of varying lengths and tried inferring the immediate next character. The inferred character will be colored green. The rest of the sequence will be colored as the heat map chosen.

• Sequence length 20:
• Sequence length 40:
• Sequence length 100:

The connectivity reveals two aspects of the recurrent model. Firstly, we can see the green characters make sense where they stand. The second and more critical aspect is that connectivity is on a short term basis. Due to the vanishing gradient problem, the gradient does not backpropagate properly. Hence long-term context cannot be picked up.

Discussion

The problem with the RNN architecture is that it has poor gradient flow. This is something that sequence models try to counter with either a different architecture or a different approach. We could have approached the problem differently. That would have led us to think about a different backpropagation strategy.

One approach could have been projecting the hidden state $h_{t}$ at each time step and evaluating a loss at each time step $\mathcal{L}_{t}$. This would help better gradients flow by providing each hidden state an upstream gradient from each time step's loss. This does seem like a good approach, yet this is unable to make long context understandings.

The only option for us to go to is to improve the architecture. The foundation of Long Short Term Memory was done to counter the problems faced by a vanilla RNN. Do you want an in-depth report on LSTMs too? We would face the same problem statement with an LSTM this time. If this sounds fun, comment below and let me know.

On an ending note, I would love to thank Kyle Goyette for his valuable feedback and suggestions.

Reach out to me on Twitter at @ariG23498