Under the Hood of Long Short Term Memory (LSTM)

This article explores how LSTM works, including how to train them with NumPy, vanish/explode the gradient, and visualize their connectivity.
Aritra Roy Gosthipaty
Created on October 5|Last edited on November 3
Comment
In the previous article on recurrent neural networks (Under the hood of RNNs), we went through the training loop of a vanilla recurrent neural network (RNN). We spent much time understanding the feedforward and the backpropagation algorithm. The critical point was that we built the entire network from scratch with NumPy. The article also covered the cons of training RNNs. The short context understanding and the vanishing/exploding gradient problem were visualized as well.  
﻿
Table of Contents
Training the LSTM with NumPyVanish/Explode the GradientVisualize the ConnectivityDiscussion
﻿
﻿
I hope the reader has gone through the article on RNNs before coming here. The problem statement remains the same as the other article, a character-level text generator. 
In this article, we dive deep into the working of Long Short Term Memory (LSTM).
﻿Here is a link to the GitHub repository. ﻿
Training the LSTM with NumPy
DataI began with processing the input data. The input data is picked up from any .txt file provided. The file is read and vocabulary is formed. Vocabulary is the collection of unique characters that are found in the text file. The transformation of characters into numbers was the immediate next step. This is done because the model needs to have numbers to process.
char_to_ix = {c:ix for ix,c in enumerate(vocab)}
ix_to_char = np.array(vocab)
text_as_int = np.array([char_to_ix[c] for c in text])
text_as_int stores the input text in the form of numbers.
FeedforwardHere is a link to the GitHub Repository. Before diving into LSTMs, let us make one thing sure so that the text below makes sense. In a vanilla RNN, the gradient flow was faulty. As a consequence, RNN does not learn a longer context. With LSTMs, we want to counter the problem. The architecture of LSTMs provides a better way for the gradients to backpropagate. In the text below, we would put on our detective cap and head on to a journey of intuitive thinking and logic to attain a better gradient flow.
﻿
A single LSTM cell
﻿
(figo)=(σσtanh⁡σ)Wl(ht−1xt)\begin{pmatrix}
f\\
i\\
g\\
o
\end{pmatrix} =\begin{pmatrix}
\sigma\\
\sigma\\
\tanh \\
\sigma
\end{pmatrix} W^{l}\begin{pmatrix}
h_{t-1}\\
x_{t}
\end{pmatrix}\\​figo​​=​σσtanhσ​​Wl(ht−1​xt​​)﻿
ct=f⊙ct−1+i⊙ght=o⊙tanh⁡ctc_{t} =f\odot c_{t-1} +i\odot g\\
h_{t} =o\odot \tanh c_{t}ct​=f⊙ct−1​+i⊙ght​=o⊙tanhct​﻿
When I started with LSTMs, the image and formulae were not intuitive to me at all. If you feel lost here, I can assure you things will ease up as we go a little further. Here at this juncture, I would like to point out that an LSTM cell has two recurrence states, ctc_{t}ct​﻿ the memory highway and the hth_{t}ht​﻿, the hidden state representation.
During feedforward, the inputs (ht−1h_{t-1}ht−1​﻿ and xtx_{t}xt​﻿) are concatenated as a single matrix ztz_{t}zt​﻿ for the simplicity and efficiency of calculations. This input ztz_{t}zt​﻿ is then passed through multiple gates. The functions fff﻿, iii﻿, ggg﻿, ooo﻿ are termed as gates of the LSTM architecture. They provide the intuition of how much of a particular data needs to travel through to make a better representation. A gate in the LSTM architecture is a multilayer perceptron with a non-linearity function. The choice of the non-linearity function will make sense as we study in-depth about the different gates.
The GatesForget Gate fff﻿: This gate is concerned with how much to forget. It takes in the input ztz_{t}zt​﻿ and then decides how much of the previous memory state ct−1c_{t-1}ct−1​﻿ should be forgotten. The activation non-linearity is σ\sigmaσ﻿. This means that the gate's output is in the range 0 to 1. 0 means to forget everything, while 1 means to remember everything. This gate acts as a switch for the memory state circuit.
f=σWf(ht−1xt)f=\sigma W_{f}\begin{pmatrix}
h_{t-1}\\
x_{t}
\end{pmatrix}f=σWf​(ht−1​xt​​)﻿
After forgetting, we have the amount of memory state that we need from the previous step:
ct_f=f⊙ct−1c_{t\_f}= f\odot c_{t-1}ct_f​=f⊙ct−1​﻿
Input gate iii﻿: This gate is used to decide how much of the present input needs to flow. This acts as a switch for the present input circuit. This gate also uses a σ\sigmaσ﻿ non-linearity function.
i=σWi(ht−1xt)i=\sigma W_{i}\begin{pmatrix}
h_{t-1}\\
x_{t}
\end{pmatrix}i=σWi​(ht−1​xt​​)﻿
Gate gate ggg﻿: This gate closely resembles the recurrence formula of a vanilla RNN. We can say that this gate is the hidden state of the RNN in an LSTM. The resemblance to the RNN formula intensifies upon noticing the non-linearity function. This gate is the only one that uses a tanh⁡\tanhtanh﻿ function.
g=tanh⁡Wg(ht−1xt)g=\tanh W_{g}\begin{pmatrix}
h_{t-1}\\
x_{t}
\end{pmatrix}g=tanhWg​(ht−1​xt​​)﻿
The usage of the input gate and the gate gate will make sense now. The input gate behaves like a switch to the output of the gate gate:
ht_i=i⊙gh_{t\_i}=i\odot g\\ht_i​=i⊙g﻿
Upon pointwise addition of ct_fc_{t\_f}ct_f​﻿ and ht_ih_{t\_i}ht_i​﻿ we get the present memory state. The memory state does not only holds the past and present information but also holds a definite amount of both to make a better representation possible.
ctl=ct_f+ht_ic^{l}_{t} =c_{t\_f}+h_{t\_i}ctl​=ct_f​+ht_i​﻿
Output gate ooo﻿: This gate is responsible for deciding how much output will flow into making the present hidden state.
o=σWo(ht−1xt)o=\sigma W_{o}\begin{pmatrix}
h_{t-1}\\
x_{t}
\end{pmatrix}o=σWo​(ht−1​xt​​)﻿
Let us pass the memory state from a tanh⁡\tanhtanh﻿ first.
ct_o=tanh⁡ctc_{t\_o}=\tanh c_{t}ct_o​=tanhct​﻿
Then the ct_oc_{t\_o}ct_o​﻿ needs to be elementwise multiplied with the output gate to evaluate how much of the ct_oc_{t\_o}ct_o​﻿ needs to be a part of the hidden state.
ht=o⊙ct_oh_{t}=o\odot c_{t\_o}ht​=o⊙ct_o​﻿
Feed forward for LSTM
Loss FormulationAfter projecting the final hidden state hth_{t}ht​﻿ we have the un-normalized log probabilities for each of the characters in the vocabulary. These un-normalized log probabilities are the elements in yty_{t}yt​﻿. 
pk=eyk∑jeyjp_{k} =\frac{e^{y_{k}}}{\sum _{j} e^{y_{j}}}pk​=∑j​eyj​eyk​​﻿
Here pkp_{k}pk​﻿ is the normalized probability of the correct class kkk﻿. We then apply a negative log⁡\loglog﻿ on this and get the softmax loss of yty_{t}yt​﻿. 
Lt=−log⁡pk\boxed{\mathcal{L}_{t} =-\log p_{k}}Lt​=−logpk​​﻿
We take this loss and back-propagate through the network.
BackpropagationI will assume the reader knows computing the gradients of a softmax function. The reader can refer to the previous article on RNNs.
 
Visualization of the backpropagation in LSTM
In this stage, we have to back-propagate the softmax loss. We would handhold and traverse through the reversed time step and see the gradients flowing.
Back Propagation﻿Link to the GitHub repository﻿The gradient of the loss L\mathcal{L}L﻿ wrt WyW_yWy​﻿: We have projected the last hidden state hfinalh_{final}hfinal​﻿ and computed the softmax loss. This means that the weight matrix WyW_{y}Wy​﻿ is subject to receive gradients only at the final time step.
y=Wyhfinal∂y∂Wy=hfinaly=W_{y} h_{final}\\
\frac{\partial y}{\partial W_{y}} =h_{final}y=Wy​hfinal​∂Wy​∂y​=hfinal​﻿
The gradient:
∂L∂Wy=∂L∂y∂y∂Wy∂L∂Wy=∂L∂yhfinal\frac{\partial \mathcal{L}}{\partial W_{y}} =\frac{\partial \mathcal{L}}{\partial y}\frac{\partial y}{\partial W_{y}}\\
\boxed{\frac{\partial \mathcal{L}}{\partial W_{y}} =\frac{\partial \mathcal{L}}{\partial y} h_{final}}∂Wy​∂L​=∂y∂L​∂Wy​∂y​∂Wy​∂L​=∂y∂L​hfinal​​﻿
dWy = np.matmul(dy, hs[final].T)
dby = dy
The gradient of the loss L\mathcal{L}L﻿ wrt the present hidden state hth_{t}ht​﻿: Here we have to take two things under consideration. The final hidden state hfinalh_{final}hfinal​﻿ has gradients flowing from the projection head. All hidden states other than hfinalh_{final}hfinal​﻿ have gradients flowing from the next raw hidden state hraw_t+1h_{raw\_t+1}hraw_t+1​﻿﻿
y=Wyhfinal∂yt∂hfinal=Wyy=W_{y} h_{final}\\
\frac{\partial y_{t}}{\partial h_{final}} =W_{y}y=Wy​hfinal​∂hfinal​∂yt​​=Wy​﻿
The gradient:
∂L∂hfinal=∂L∂yWy+∂hnext\boxed{\frac{\partial \mathcal{L}}{\partial h_{final}} =\frac{\partial \mathcal{L}}{\partial y} W_{y} +\partial h_{next}}∂hfinal​∂L​=∂y∂L​Wy​+∂hnext​​﻿
On the final time step the ∂hnext\partial h_{next}∂hnext​﻿ is taken to be all zeros.
# dhnext is all zeros
dh[final] = np.matmul(Wy.T, dy)+dhnext
For every other time step, the gradient of loss wrt to the hidden state hth_{t}ht​﻿ is performed as
∂L∂ht=∂hnext\frac{\partial \mathcal{L}}{\partial h_{t}} =\partial{h_{next}}∂ht​∂L​=∂hnext​﻿
dh[t] = dhnext
The gradient of the loss L\mathcal{L}L﻿ wrt the memory state ctc_{t}ct​﻿: Here we need to consider the upstream gradients that are flowing from the time step t+1t+1t+1﻿. This is added to the present gradient that is computed from the gradient of the hidden state ∂ht\partial{h_{t}}∂ht​﻿.
ht=ot⊙tanh⁡ct∂ht∂ct=ot∂tanh⁡ct∂ct∂ht∂ct=ot(1−tanh⁡2ct)h_{t} =o_{t} \odot \tanh c_{t}\\
\frac{\partial h_{t}}{\partial c_{t}} =o_{t}\frac{\partial \tanh c_{t}}{\partial c_{t}}\\
\frac{\partial h_{t}}{\partial c_{t}} =o_{t}\left( 1-\tanh^{2} c_{t}\right)ht​=ot​⊙tanhct​∂ct​∂ht​​=ot​∂ct​∂tanhct​​∂ct​∂ht​​=ot​(1−tanh2ct​)﻿
The gradient:
∂L∂ct=∂L∂ht∂ht∂ct+∂cnext∂L∂ct=∂L∂htot(1−tanh⁡2ct)+∂cnext\frac{\partial \mathcal{L}}{\partial c_{t}} =\frac{\partial \mathcal{L}}{\partial h_{t}}\frac{\partial h_{t}}{\partial c_{t}} +\partial c_{next}\\
\boxed{\frac{\partial \mathcal{L}}{\partial c_{t}} =\frac{\partial \mathcal{L}}{\partial h_{t}} o_{t}\left( 1-\tanh^{2} c_{t}\right) +\partial c_{next}}∂ct​∂L​=∂ht​∂L​∂ct​∂ht​​+∂cnext​∂ct​∂L​=∂ht​∂L​ot​(1−tanh2ct​)+∂cnext​​﻿
dc = dc_next
dc += dh * o[t] * dtanh(tanh(cs[t]))
Backproping the gatesThe complicated gradients have been derived above. The gradients for the gates are comparatively easier. In the later section, I would not derive the whole equation. Feel free to derive them in your own time.
The gradient of the loss L\mathcal{L}L﻿ wrt the output gate oto_{t}ot​﻿:
∂L∂ot=∂L∂httanh⁡ct\boxed{\frac{\partial \mathcal{L}}{\partial o_{t}} =\frac{\partial \mathcal{L}}{\partial h_{t}} \tanh c_{t}}∂ot​∂L​=∂ht​∂L​tanhct​​﻿
The gradient of the loss L\mathcal{L}L﻿ wrt ot′o^{'}_{t}ot′​﻿﻿
∂L∂ot′=∂L∂otot(1−ot)\boxed{\frac{\partial \mathcal{L}}{\partial o^{'}_{t}} =\frac{\partial \mathcal{L}}{\partial o_{t}} o_{t}( 1-o_{t})}∂ot′​∂L​=∂ot​∂L​ot​(1−ot​)​﻿
The gradient of the loss L\mathcal{L}L﻿ wrt the gate gtg_{t}gt​﻿﻿
∂L∂gt=∂L∂ctit\boxed{\frac{\partial \mathcal{L}}{\partial g_{t}} =\frac{\partial \mathcal{L}}{\partial c_{t}} i_{t}}∂gt​∂L​=∂ct​∂L​it​​﻿
The gradient of the loss L\mathcal{L}L﻿ wrt gt′g^{'}_{t}gt′​﻿﻿
∂L∂gt′=∂L∂gt(1−gt2)\boxed{\frac{\partial \mathcal{L}}{\partial g^{'}_{t}} =\frac{\partial \mathcal{L}}{\partial g_{t}}\left( 1-g^{2}_{t}\right)}∂gt′​∂L​=∂gt​∂L​(1−gt2​)​﻿
The gradient of the loss L\mathcal{L}L﻿ wrt the input gateiti_{t}it​﻿:
∂L∂it=∂L∂ctgt\boxed{\frac{\partial \mathcal{L}}{\partial i_{t}} =\frac{\partial \mathcal{L}}{\partial c_{t}} g_{t}}∂it​∂L​=∂ct​∂L​gt​​﻿
The gradient of the loss L\mathcal{L}L﻿ wrt it′i^{'}_{t}it′​﻿﻿
∂L∂it′=∂L∂itit(1−it)\boxed{\frac{\partial \mathcal{L}}{\partial i^{'}_{t}} =\frac{\partial \mathcal{L}}{\partial i_{t}} i_{t}( 1-i_{t})}∂it′​∂L​=∂it​∂L​it​(1−it​)​﻿
The gradient of the loss L\mathcal{L}L﻿ wrt the forget gate ftf_{t}ft​﻿:
∂L∂ft=∂L∂ctct−1\boxed{\frac{\partial \mathcal{L}}{\partial f_{t}} =\frac{\partial \mathcal{L}}{\partial c_{t}} c_{t-1}}∂ft​∂L​=∂ct​∂L​ct−1​​﻿
The gradient of loss L\mathcal{L}L﻿ wrt ft′f^{'}_{t}ft′​﻿﻿
∂L∂ft′=∂L∂ftft(1−ft)\boxed{\frac{\partial \mathcal{L}}{\partial f^{'}_{t}} =\frac{\partial \mathcal{L}}{\partial f_{t}} f_{t}( 1-f_{t})}∂ft′​∂L​=∂ft​∂L​ft​(1−ft​)​﻿
Backproping the weights of individual gatesBecause all the gates are nothing but multilayer perceptrons, they will have gradients for their respective weight matrices. This section shows the gradients of individual gate weights.
The gradient of the loss L\mathcal{L}L﻿ wrt weight of the output gate WoW_{o}Wo​﻿:
∂L∂Wo=∂L∂ot′zt\boxed{\frac{\partial \mathcal{L}}{\partial W_{o}} =\frac{\partial \mathcal{L}}{\partial o^{'}_{t}} z_{t}}∂Wo​∂L​=∂ot′​∂L​zt​​﻿
The gradient of the loss L\mathcal{L}L﻿ wrt weight of the gate gate WgW_{g}Wg​﻿:
∂L∂Wg=∂L∂gt′zt\boxed{\frac{\partial \mathcal{L}}{\partial W_{g}} =\frac{\partial \mathcal{L}}{\partial g^{'}_{t}} z_{t}}∂Wg​∂L​=∂gt′​∂L​zt​​﻿
The gradient of the loss L\mathcal{L}L﻿ wrt weight of the input gate WiW_{i}Wi​﻿:
∂L∂Wi=∂L∂it′zt\boxed{\frac{\partial \mathcal{L}}{\partial Wi} =\frac{\partial \mathcal{L}}{\partial i^{'}_{t}} z_{t}}∂Wi∂L​=∂it′​∂L​zt​​﻿
The gradient of the loss L\mathcal{L}L﻿ wrt weight of the forget gate WfW_{f}Wf​﻿:
∂L∂Wf=∂L∂ft′zt\boxed{\frac{\partial \mathcal{L}}{\partial W_{f}} =\frac{\partial \mathcal{L}}{\partial f^{'}_{t}} z_{t}}∂Wf​∂L​=∂ft′​∂L​zt​​﻿
Final gradient of inputIn this section, we calculate the gradient of the loss with respect to the input. On close observations, we will conclude that the gradient holds the concatenated gradients of ht−1h_{t-1}ht−1​﻿ and xtx{t}xt﻿.
The gradient of the loss L\mathcal{L}L﻿ wrt the input ztz_{t}zt​﻿:
∂L∂zt=∂L∂ft′Wf+∂L∂gt′Wg+∂L∂it′Wi+∂L∂ot′Wo\boxed{\frac{\partial \mathcal{L}}{\partial z_{t}} =\frac{\partial \mathcal{L}}{\partial f^{'}_{t}} W_{f} +\frac{\partial \mathcal{L}}{\partial g^{'}_{t}} W_{g} +\frac{\partial \mathcal{L}}{\partial i^{'}_{t}} W_{i} +\frac{\partial \mathcal{L}}{\partial o^{'}_{t}} W_{o}}∂zt​∂L​=∂ft′​∂L​Wf​+∂gt′​∂L​Wg​+∂it′​∂L​Wi​+∂ot′​∂L​Wo​​﻿
﻿
﻿
﻿
Run set2
﻿
Vanish/Explode the GradientHere we see why LSTMs are preferred over RNNs.

Backprop in RNN
﻿
As we had discussed earlier, the problem with the backpropagation in the vanilla RNN is the tanh⁡\tanhtanh﻿ non-linearity and the repeated multiplication of the weight matrix WWW﻿. This often leads to the vanishing or the exploding gradient problem. The point was further proved by looking at the histograms of gradients of the RNN at various stages of training.
﻿
﻿
 -> Backprop in LSTM <-
Upon looking at the picture above, we notice one thing. The gradients of the memory state ∂ct\partial{c_{t}}∂ct​﻿ flows without much perturbation. Apart from the element-wise multiplication with the forget gate, the gradient ∂ct\partial{c_{t}}∂ct​﻿ flows freely in the circuit provided for the memory state. This is the reason why the memory state is involved in the architecture in the first place. The name gradient highway now makes sense, doesn't it? The architecture has a better flow of gradients than the vanilla RNN.
To prove the point, here I have created the GIFs of histograms. There are two rows, one for the GIFs of the gradient of hidden state ∂ht\partial{h_{t}}∂ht​﻿ and the other is that of the gradient of the memory state ∂ct\partial{c_{t}}∂ct​﻿. The histograms are formed by taking the histograms at each time step from time-step 25 to time-step 0. To make the visuals more concrete, the histograms are taken from different epochs while the model trains. The epochs are 0, 10, 20, 30, and 40.
﻿
﻿
Run set1
﻿
Some observations to see here:
At the very beginning, both the gradients vanish, which is due to the weight initialization.
Later when the model trains, the vanishing problem goes away. This means that the weights are tuned so that the gradients are backpropagated properly.
The memory state does not vanish as is expected.
Visualize the Connectivity﻿﻿In the article Visualizing memorization in RNNs, the authors propose a great tool to visualize the contextual understanding of a sequence model, connectivity between the desired output and all the input. This means the visualization would say which inputs are the reason that we got our desired output. 
 
Source: Visualizing memorization in RNNs﻿
To visualize the connectivity, the first step in to see the heat map colors. The heat map I have chosen is shown below. The cold connectivity (not so connected) will be transparent and gradually move from light to dark blue. The hot connectivity (strong connection) will be colored red.
The heatmap colors with their intensities of connection
For this experiment, I have chosen a sequence of varying lengths and tried inferring the immediate next character. The inferred character will be colored green. The rest of the sequence will be colored as the heat map chosen.
Sequence length 20:

﻿
RNN
LSTM
﻿
Sequence length 40:

﻿
RNN
LSTM
Sequence length 100:

﻿
RNN
LSTM
The connectivity shows quite evidently that LSTMs can pick up on long contexts. The reason they can and RNNs cannot lie in the better backpropagation of loss.
DiscussionThis was a tough project for me, due to the fact that LSTMs are not that easy to code  from scratch. I purposely did not pursue the analysis of the gates, which might be taken care of in a future article. I would be more than happy to handle doubts about my code and article. Please feel free to comment below.
As a final note, I would like to thank Kyle Goyette for his valuable feedback and suggestions.
Reach out to me - @ariG23498. ﻿
﻿