Attentive recurrent models can be painful to inspect. I was interested in finding out how the structure of learned attention mechanisms interacted with gradient flow in sequential models. So I created the visualization below. Hovering over any point in this model shows the strength of the connection between all other time steps and the selected time step. From this, I was clearly able to see understand how my model leveraged attention to learn to solve the denoise task, and how gradient flowed in the learned structure.