Featured Report

This report is a saved snapshot of Boris' research. He's published this example so you can see how to use W&B to visualize training and keep track of your work. Feel free to add a visualization, click on graphs and data, and play with features. Your edits won't overwrite his work.

Project Description

Boris explored network architectures and training settings for character-based text generation. He compared RNNs, GRUs, and LSTMS, with different depths, layer widths, and dropout. He also considered the training data length, sequence length, and a number of sequences.

RNN, LSTM, or GRU?

I ran three experiments to compare different types of recurrent networks: "vanilla" RNN, LSTM, GRU.

LSTM & GRU clearly outperform RNN.

GRU seems also to learn faster than LSTM so it could be the preference.

Time is also a major factor to consider when comparing these models. Click "edit" on the graphs below and changing the x-axis to time. You can see that RNN is much faster than LSTM & GRU but unfortunately stops learning much earlier. Also good to note, I was running several experiments at the same time, which affected the run time of each individual run.

Section 1

RNN, LSTM or GRU?

I ran three experiments to compare different types of recurrent networks: "vanilla" RNN, LSTM, GRU.

LSTM & GRU clearly outperform RNN.

GRU seems also to learn faster than LSTM so it could be the preference.

Time is also a major factor to consider when comparing these models. Click "edit" on the graphs below and changing the x-axis to time. You can see that RNN is much faster than LSTM & GRU, but unfortunately stops learning much earlier. Also good to note, I was running several experiments at the same time, which affected the run time of each individual run.

Featured Report

This report is a saved snapshot of Boris' research. He's published this example so you can see how to use W&B to visualize training and keep track of your work. Feel free to add a visualization, click on graphs and data, and play with features. Your edits won't overwrite his work.

Project Description

Boris explored network architectures and training settings for character-based text generation. He compared RNNs, GRUs, and LSTMS, with different depths, layer widths, and dropout. He also considered the training data length, sequence length, and number of sequences.

Number of Layers

These runs were performed to compare how the number of layers affected the model.

While we only see minor differences in the training loss, the effect is visible on the validation loss.

It is interesting to note how the "3 layers" architecture had a longer "warm up" time before decreasing its loss though it may be a coincidence.

Section 2

Number of layers

These runs were performed to compare how the number of layers affected the model.

While we only see minor difference on the training loss, the effect is visible on the validation loss.

It is interesting to note how the "3 layers" architecture had a longer "warm up" time before decreasing its loss though it may be a coincidence.

Width of Each Layer

These runs were performed to compare how the width of each layer affected the model.

It seems that the width has an effect much larger than the depth of the network.

Section 3

Width of each layer

These runs were performed to compare how the width of each layer affected the model.

It seems that the width has an effect much larger than the depth of the network.

The Effect of Dropout

These runs were performed to evaluate the effect of dropout between each RNN layer.

We do not see any overfitting even without dropout, meaning that the text used is large enough for the size of our network. Therefore we won't be using dropout unless we see any evidence of overfitting.

Section 4

The effect of dropout

These runs were performed to evaluate the effect of dropout between each RNN layer.

We do not see any overfitting even without dropout, meaning that the text used is large enough for the size of our network. Therefore we won't be using dropout unless we see any evidence of overfitting.

Looking for the Best Model

Using the previous conclusions, we train for a longer time and try to get the best possible model.

We can see that a wider network performs better than a deeper network and that 2 layers may be enough.

The networks with no dropout may be reaching over-fitting in this case (contrary to previous section where we had smaller networks) as the validation loss does not improve after a certain time and even increases in some cases. We retrain the best architectures adding some dropout, which improves slightly training and validation errors. A value of 0.1 seems sufficient as too much dropout (0.3) reduces the performance.

Note: It is interesting to see the peak in validation loss on one of the run. This happens on some runs (though rarely). It is difficult to interpret but could be due to the introduction of "mistakes" on words with high occurrence in the validation set. The fact that the training loss is averaged over during each epoch prevents from seeing potential sporadic peaks. It is quickly corrected and does not appear again. Some peaks end up not being resolved showing over-fitting. It happens especially with larger batch size (refer section "Number of sequences per batch"). Another theory would be that it is due to a poor initialization of hidden states. Those are zeroed and may then start in the middle of a sentence, leading to a "corrupted memory" as those cases may have not been seen during training and keep on happening the same way in the validation set.

Section 5

Looking for the best model

Using the previous conclusions, we train for a longer time and try to get the best possible model.

We can see that a wider network performs better than a deeper network and that 2 layers may be enough.

The networks with no dropout may be reaching over-fitting in this case (contrary to previous section where we had smaller networks) as the validation loss does not improve after a certain time and even increases in some cases. We retrain the best architectures adding some dropout, which improves slightly training and validation errors. A value of 0.1 seems sufficient as too much dropout (0.3) reduces the performance.

Note: It is interesting to see the peak in validation loss on one of the run. This happens on some runs (though rarely). It is difficult to interpret but could be due to the introduction of "mistakes" on words with high occurrence in the validation set. The fact that the training loss is averaged over during each epoch prevents from seeing potential sporadic peaks. It is quickly corrected and does not appear again. Some peaks end up not being resolved showing over-fitting. It happens especially with larger batch size (refer section "Number of sequences per batch"). Another theory would be that it is due to a poor initialization of hidden states. Those are zeroed and may then start in the middle of a sentence, leading to a "corrupted memory" as those cases may have not been seen during training and keep on happening the same way in the validation set.

Number of Characters in Each Sequence

These runs show the influence of the number of consecutive characters to be used in each sequence, prior to backpropagation.

Longer sequences provide better results, probably due to more stability during weight update. However, sequences of 100 characters are probably enough as the improvements are only minor above it, and training time increases linearly with the number of characters per sequence.

Section 6

Number of characters in each sequence

These runs show the influence of the number of consecutive characters to be used in each sequence, prior to backpropagation.

Longer sequences provide better results, probably due to more stability during weight update. However, sequences of 100 characters are probably enough as the improvements are only minor above it and training time increases linearly with number of characters per sequence.

Number of Sequences per Batch

These runs show the influence of the number of parallel sequences per batch used during training. These runs consider sequences of 100 characters based on the previous section.

While a larger batch size improve the results, we need to add a significant dropout above a batch size of 8. With a batch size of 32, a dropout of 0.1 was not enough and we used 0.3 to avoid over-fitting (probably due to the fact that there is not enough "noise" during training). So it may be better to limit the batch size to 8 (or maybe 16 max).

Section 7

Number of sequences per batch

These runs show the influence of the number of parallel sequences per batch used during training. These runs consider sequences of 100 characters based on the previous section.

While a larger batch size improve the results, we need to add a significant dropout above a batch size of 8. With a batch size of 32, a dropout of 0.1 was not enough and we used 0.3 to avoid over-fitting (probably due to the fact that there is not enough "noise" during training). So it may be better to limit the batch size to 8 (or maybe 16 max).

Text Length

These runs let us evaluate how the text length affects our model. They are named as follows:

Note: we increase the percentage of validation data in smaller texts to ensure the loss is relevant. Also, we normalize the losses based on the number of unique characters present in the text.

As the text gets shorter, we need to introduce some dropout to prevent over-fitting, a small value may not be enough but increasing it usually solves any issue. Nevertheless, when text length varies significantly, it would make more sense to re-tune the entire model (at least the layer width).

Also, we used different texts and we can see that some seem easier to learn than others. For example, even when shortening the Internal Revenue Code (T1) from 24M to 2.6M characters, we obtain a lower validation loss than The Count of Monte Cristo (2.6M characters as well).

Section 8

Text length

These runs let us evaluate how the text length affects our model. They are named as follows:

Note: we increase the percentage of validation data in smaller texts to ensure the loss is relevant. Also we normalize the losses based on the number of unique characters present in the text.

As the text gets shorter, we need to introduce some dropout to prevent over-fitting, a small value may not be enough but increasing it usually solves any issue. Nevertheless, when text length varies significantly, it would make more sense to re-tune the entire model (at least the layer width).

Also we used different texts and we can see that some seem easier to learn than others. For example, even when shortening the Internal Revenue Code (T1) from 24M to 2.6M characters, we obtain a lower validation loss than The Count of Monte Cristo (2.6M characters as well).