Skip to main content

Attention Alignment Experiements

Pre-softmax attention alignment for VQA
Created on September 7|Last edited on September 7

Overview

The t-syngcn-sg-glovegcn-linear-fusion-1 and t-syngat-sg-glovegat-linear-fusion-0 models in the graphs below use a naive 2-layer MLP to extract information from both a question embedding GCN/GAT (with edges defined by dependency parser outputs) and a ground-truth scene graph GCN/GAT. The aligned variants use an identical GCN/GAT structure, but compute self-attention over all question node embeddings and then transformer-style KQV attention between the aligned question features and scene graph node embeddings, using aligned question features as the attention query, and scene graph embeddings as the key and values.

The main premise behind the attention alignment layer is that we want to first determine which parts of the question are important via scaled dot-product self-attention, and then determine which scene graph features are relevant to the important question words.




Select runs that logged epoch
to visualize data in this line chart.
Select runs that logged train/accuracy
to visualize data in this line chart.
Select runs that logged train/loss
to visualize data in this line chart.
Run set


Discussion

Interestingly, we see a large improvement for the GCN-based model, but a very slight improvement for the GAT-based model. This indicates that the attention alignment layer does not aid in performing VQA reasoning steps, but moreso allows the model to determine which parts of the embedded data are important to the question at hand.

What would be worth experimenting is incorporating a similar attention operation between each convolutional layer in the graph, perhaps even bi-directionally to allow information flow between question and scene graphs, with the goal being the scene graph embedding can be updated according to the relevant parts of the question. Regulat GAT conv layers then act as the self-attention step.