Skip to main content

Gradients and Attention

Created on September 15|Last edited on September 15

Introduction

In this report I'm going to illustrate how attention mechanisms enable learning long-term dependencies using the copy task and denoise task. For an introduction to these tasks see my previous report. Before, I showed the different ways attention was leveraged to solve problems, by providing dynamic skip connections in long-dependency tasks. Here, I'll show the impact of these connections on the gradient using ablation studies, and I'll provide some new dynamic visualizations that can help one see the learned structure.



Copy Task

In the loss and accuracy plots below, I show a model that is trained to perform extremely well on the copy task. We're going to look at the difference in the gradient propagation at the beginning of training and at the end of training.




2k4k6k8kStep00.20.40.60.81
2k4k6k8kStep0.000011
Run set
5


Denoise Task




Run set
1


Remarks

I believe that these new visualizations gave me further insight into how attention is leveraged on tasks with long term dependencies. The learned structure is not obvious, even in simple cases, and improved visualizations can lead to a better understanding of how these mechanisms work.

These visualizations can be added into your projects by logging the following information as a wandb.Table

data = [source_step, target_step, gradient_norm, attention_strength]
data_table = wandb.Table(columns= ['source step', 'target_step', 'grad', 'attn'])
wandb.log({"Grad/Attn Visualization {}".format(step): data_table})

Next add vega2 in your Weights & Biases profile bio.

Finally, on your workspace, you should now have the ability to create a vega2 panel.