Skip to main content

Text Generation With Sequence Models

In this article, we explore network architectures and training settings for character-based text generation. Compare RNNs, GRUs, and LSTMS, with different depths, layer widths, and dropout.
Created on June 4|Last edited on November 3
This report is a saved snapshot of Boris' research. He's published this example so you can see how to use W&B to visualize training and keep track of your work. Feel free to add a visualization, click on graphs and data, and play with features. Your edits won't overwrite his work.

Table Of Contents



Project Description

Boris explored network architectures and training settings for character-based text generation. He compared RNNs, GRUs, and LSTMs, with different depths, layer widths, and dropout.
He also considered the training data length, sequence length, and a number of sequences.

RNN, LSTM, or GRU?

I ran three experiments to compare different types of recurrent networks: "vanilla" RNN, LSTM, GRU.
LSTM & GRU clearly outperform RNN.
GRU seems also to learn faster than LSTM so it could be the preference.
Time is also a major factor to consider when comparing these models. Click "edit" on the graphs below and change the x-axis to time. You can see that RNN is much faster than LSTM & GRU but unfortunately stops learning much earlier. Also good to note, I was running several experiments at the same time, which affected the run time of each individual run.


Run set 1
3


Number of Layers

These runs were performed to compare how the number of layers affected the model.
While we only see minor differences in the training loss, the effect is visible in the validation loss.
It is interesting to note how the "3 layers" architecture had a longer "warm up" time before decreasing its loss though it may be a coincidence.


Run set 1
3



Width of Each Layer

These runs were performed to compare how the width of each layer affected the model.
It seems that the width has an effect much larger than the depth of the network.


Run set 1
4



The Effect of Dropout

These runs were performed to evaluate the effect of dropout between each RNN layer.
We do not see any overfitting even without dropout, meaning that the text used is large enough for the size of our network. Therefore we won't be using dropout unless we see any evidence of overfitting.


Run set 1
4



Looking for the Best Model

Using the previous conclusions, we train for a longer time and try to get the best possible model.
We can see that a wider network performs better than a deeper network and that 2 layers may be enough.
The networks with no dropout may be reaching over-fitting in this case (contrary to the previous section where we had smaller networks) as the validation loss does not improve after a certain time and even increases in some cases. We retrain the best architectures adding some dropout, which improves slightly training and validation errors. A value of 0.1 seems sufficient as too much dropout (0.3) reduces the performance.
Note: It is interesting to see the peak in validation loss on one of the runs. This happens on some runs (though rarely). It is difficult to interpret but could be due to the introduction of "mistakes" on words with high occurrence in the validation set.
The fact that the training loss is averaged over during each epoch prevents us from seeing potential sporadic peaks. It is quickly corrected and does not appear again. Some peaks end up not being resolved to show over-fitting. It happens especially with larger batch size (refer section "Number of sequences per batch").
Another theory would be that it is due to a poor initialization of hidden states. Those are zeroed and may then start in the middle of a sentence, leading to a "corrupted memory" as those cases may have not been seen during training and keep on happening the same way in the validation set.


Run set 1
5



Number of Characters in Each Sequence

These runs show the influence of the number of consecutive characters to be used in each sequence, prior to backpropagation.
Longer sequences provide better results, probably due to more stability during weight updates. However, sequences of 100 characters are probably enough as the improvements are only minor above them, and training time increases linearly with the number of characters per sequence.



Run set 1
4



Number of Sequences per Batch

These runs show the influence of the number of parallel sequences per batch used during training. These runs consider sequences of 100 characters based on the previous section.
While a larger batch size improves the results, we need to add a significant dropout above a batch size of 8. With a batch size of 32, a dropout of 0.1 was not enough and we used 0.3 to avoid over-fitting (probably due to the fact that there is not enough "noise" during training). So it may be better to limit the batch size to 8 (or maybe 16 max).


Run set 1
4


Text Length

These runs let us evaluate how the text length affects our model. They are named as follows:
Note: we increase the percentage of validation data in smaller texts to ensure the loss is relevant. Also, we normalize the losses based on the number of unique characters present in the text.
As the text gets shorter, we need to introduce some dropout to prevent over-fitting, a small value may not be enough but increasing it usually solves any issue. Nevertheless, when text length varies significantly, it would make more sense to re-tune the entire model (at least the layer width).
Also, we used different texts and we can see that some seem easier to learn than others. For example, even when shortening the Internal Revenue Code (T1) from 24M to 2.6M characters, we obtain a lower validation loss than The Count of Monte Cristo (2.6M characters as well).


Run set 1
6

Iterate on AI agents and models faster. Try Weights & Biases today.