Dependency Parser GCN

Initial tests for a 3 layer GCN with layer sizes (300, 600, 1200, 1878)
Created on August 23|Last edited on August 27
Comment
﻿
Method and ResultsAll models were trained on GQA-train-balanced and evaluated on GQA-val-balanced. Every model has the same underlying structure, taking in a dependency graph of each question, and using 300-dim GloVe embeddings as the features for each node. No images or image-derived data was used, meaning we would generally expect poor performance, however accuracy falls 8% short of global prior results reported in Hudson and Manning's GQA paper - it's almost impressive that the model is this bad, given the current successes of depencency parsing in the VQA field.
Note that the graphs below log loss at each logging step, and accumulate correct/total counts over each epoch, hence the accuracy jumps up at the start of each epoch.
﻿
﻿
﻿
Loss
Loss
Select runs that logged train/loss 
to visualize data in this line chart.
Run set
﻿
DiscussionDespite the overall poor performance of the models, I learned the importance of weight decay for GCNs in an text-processing context, which will come in handy when optimisaing parameters later, as well as a few techniques for parameter optimisation and performance visualisation. 
The next obvious steps are to:
Investigate the effects of GATs over GCNs
Investigate how layer structure of GCNs affect performance
Try learning embeddings for the GCN
Try a different pre-trained dependency parser; the ewt trained dependency parser may not cover some vocabulary. Inspecting some samples manually, the dependency parser output is sometimes a bit off.
Incorporate image signals into the training data to address the true VQA challenge of multimodal fusion,
﻿
﻿
﻿
Run set
﻿
Experimenting with a larger GCN of size (300, 600, 900, 1200, 1500, 1843), I found that the existing models weren't actually big enough to capture the complexity of the problem at hand. This is clearly shown in the graph below, a comparison of the best smaller GCN run with the larger GCN. Note that both runs were performed on different machines, however we can expect the larger GCN to take about 1.6-1.7 times longer to train than the smaller ones.
﻿
﻿
﻿
Run set
﻿
﻿
Add a comment