Gradients and Attention

Created on September 15|Last edited on September 15
Comment
﻿
IntroductionIn this report I'm going to illustrate how attention mechanisms enable learning long-term dependencies using the copy task and denoise task. For an introduction to these tasks see my previous report. Before, I showed the different ways attention was leveraged to solve problems, by providing dynamic skip connections in long-dependency tasks. Here, I'll show the impact of these connections on the gradient using ablation studies, and I'll provide some new dynamic visualizations that can help one see the learned structure.
﻿
Copy TaskIn the loss and accuracy plots below, I show a model that is trained to perform extremely well on the copy task. We're going to look at the difference in the gradient propagation at the beginning of training and at the end of training.
﻿
﻿
﻿
accuracy
accuracy
2k4k6k8kStep00.20.40.60.81
copy/quiet-wood-32
copy/stellar-sponge-31
copy/driven-tree-30
copy/dark-terrain-29
copy/tough-galaxy-13
loss
loss
2k4k6k8kStep0.000011
copy/quiet-wood-32
copy/stellar-sponge-31
copy/driven-tree-30
copy/dark-terrain-29
copy/tough-galaxy-13
Run set5
﻿
Denoise Task
﻿
﻿
﻿
Run set1
﻿
RemarksI believe that these new visualizations gave me further insight into how attention is leveraged on tasks with long term dependencies. The learned structure is not obvious, even in simple cases, and improved visualizations can lead to a better understanding of how these mechanisms work. 
These visualizations can be added into your projects by logging the following information as a wandb.Table
data = [source_step, target_step, gradient_norm, attention_strength]
data_table = wandb.Table(columns= ['source step', 'target_step', 'grad', 'attn'])
wandb.log({"Grad/Attn Visualization {}".format(step): data_table})
Next add vega2 in your Weights & Biases profile bio.
Finally, on your workspace, you should now have the ability to create a vega2 panel. 
﻿
﻿
Add a comment