Dual GCN Experiments
Object-semantic GCNs
Smaller is Better
Though not complete, these graphs clearly show that the model with a smaller object-semantic GCN performs better. The semantic relationships are extracted using a relation proposal network (RelPN) as in the paper "Graph R-CNN for Scene Graph Generation". No bounding boxes are used and no NMS is applied in these experiments, and relations are proposed based on pure class distributions for each object in the scene. These experiments use ground-truth GQA-train and GQA-val scene graphs.
The object-semantic GCN aims to capture statistical priors present in the dataset, specifically those related to the interactions between certain classes of objects, e.g. food is generally related to plate, but less so to car. The worst of the experiments below uses a GCN with four layers, whilst the best has just one layer, capturing only first-order class interactions.
Limitations
Naturally, a naive linear fusion layer will not suffice as thge primary form of information-fusion for multi-modal tasks. Attention is next in the pipeline for information sharing between graphs.