Skip to main content

Dependency Parser GCN

Initial tests for a 3 layer GCN with layer sizes (300, 600, 1200, 1878)
Created on August 23|Last edited on August 27

Method and Results

All models were trained on GQA-train-balanced and evaluated on GQA-val-balanced. Every model has the same underlying structure, taking in a dependency graph of each question, and using 300-dim GloVe embeddings as the features for each node. No images or image-derived data was used, meaning we would generally expect poor performance, however accuracy falls 8% short of global prior results reported in Hudson and Manning's GQA paper - it's almost impressive that the model is this bad, given the current successes of depencency parsing in the VQA field.

Note that the graphs below log loss at each logging step, and accumulate correct/total counts over each epoch, hence the accuracy jumps up at the start of each epoch.




Select runs that logged train/loss
to visualize data in this line chart.
Run set


Discussion

Despite the overall poor performance of the models, I learned the importance of weight decay for GCNs in an text-processing context, which will come in handy when optimisaing parameters later, as well as a few techniques for parameter optimisation and performance visualisation.

The next obvious steps are to:

  • Investigate the effects of GATs over GCNs
  • Investigate how layer structure of GCNs affect performance
  • Try learning embeddings for the GCN
  • Try a different pre-trained dependency parser; the ewt trained dependency parser may not cover some vocabulary. Inspecting some samples manually, the dependency parser output is sometimes a bit off.
  • Incorporate image signals into the training data to address the true VQA challenge of multimodal fusion,



Run set


Experimenting with a larger GCN of size (300, 600, 900, 1200, 1500, 1843), I found that the existing models weren't actually big enough to capture the complexity of the problem at hand. This is clearly shown in the graph below, a comparison of the best smaller GCN run with the larger GCN. Note that both runs were performed on different machines, however we can expect the larger GCN to take about 1.6-1.7 times longer to train than the smaller ones.




Run set