Part 2 – Comparing Message-Passing-Based GNN Architectures

In this article, we'll review various message-passing-based graphical neural networks (GNN) architectures and compare them using Sweeps by Weights & Biases.
Yash Kotadia
Created on August 28|Last edited on November 1
Comment
﻿
Before going through this report, I suggest the readers check out Part 1 – Intro to Graph Neural Networks with GatedGCN where we understand the motivation behind graphical neural networks (GNN), and the various problem statements that GNNs could be useful for. 
The first section of this report describes the training pipeline of a message-passing-based GNN. Next, we review and compare various message-passing-based GNN architectures such as vanilla GCN[2], GraphSage[3], MoNet[4], GAT[5] and GatedGCN[6]. Finally, we use Sweeps by Weights & Biases to train and compare these architectures at the Node classification task.
Run the experiments on Google Colab →﻿
Table of ContentsTraining PipelineInput LayerL x GNN LayerPrediction LayerMessage Passing based GNNsVanilla Graph Convolutional Network(GCN)GraphSageGraph Attention Network (GAT)MoNetGatedGCNExperimentsConclusionReferences
﻿
Training Pipeline
﻿

Figure 1: Summary of the training pipeline[1]
Input LayerThe input layer consists of the following elements:
Node features αiϵRa×1\alpha_i \epsilon \mathbb{R}^{a\times1}αi​ϵRa×1﻿. These features are passed as an embedding to d-dimensional hidden features hil=0h_i^{l=0}hil=0​﻿ via a simple linear projection.
(optional)Edge features βijϵRb×1\beta_{ij} \epsilon \mathbb{R}^{b\times1}βij​ϵRb×1﻿, for each edge connecting node iii﻿ and node jjj﻿. Similar to the node features, these features are passed as an embedding to d-dimensional hidden feature eijl=0e_{ij}^{l=0}eijl=0​﻿.
Lastly, the graph structure itself which maintains the connections amongst the nodes.
L x GNN LayerAs seen in the figure, this segment consists of the LLL﻿ layer neural network. This part of the training pipeline encapsulates the hidden layers of the GNN. The LLL﻿ layer deep network corresponds to L-hop neighborhood aggregation across the entire network. Since this is an iterative process, it can be visualized as a message-passing mechanism where each node receives updates from all its neighbors, which in turn receives updates from their neighbors.
The updated feature vector hil+1h^{l+1}_ihil+1​﻿ for node iii﻿ is simply a function of its previous feature vector hilh_i^lhil​﻿ and the feature vectors of all its neighbors jjj﻿ as shown in the equation below.
﻿hil+1=f(hil+1,hjl:j→i)h^{l+1}_i = f(h_i^{l+1}, {h^l_j: j\rightarrow i})hil+1​=f(hil+1​,hjl​:j→i)﻿﻿
It is interesting to see in Figure 1 how these intermediate layers themselves can be visualized as intermediate graphs highlighting their ability to capture the graph structure. +Since this component forms the training pipeline's crux, we shall be looking at the detailed GNN layer of various architectures in the subsequent sections.
Prediction LayerIn the prediction layer, we utilize the GNN-based node embeddings to predict the task-based outcome. We design the loss function and apply gradient descent to improve task-based prediction. This helps the GNN learn more task-based discriminative node embeddings.
Following is the summary of the prediction layer across various tasks:
Graph Classification: Apply a Multi-Layer Perceptron(MLP) on the average of all node embeddings. Finally, we minimize the cross-entropy loss between the obtained logits and the ground truth.
Graph Regression: Similar to graph classification, we apply MLP on the average of all node embeddings. Finally, we minimize the L1-loss between the predicted score and the ground truth score.
Node Classification:  For this task, we independently pass each node’s embeddings to an MLP and minimize the cross-entropy loss between obtained logits and the ground truth.
Edge Classification: To predict an edge between two nodes, we concatenate the two nodes' embeddings and pass it through an MLP. Finally, we minimize the cross-entropy loss between the logits and the ground-truth.
Message Passing based GNNsIn this section, we look at the GNN layer of the various Message Passing based GNNs and contrast these architectures.
Vanilla Graph Convolutional Network(GCN)The vanilla GCN[2] introduced a semi-supervised learning approach to graph-structured data. It uses shared parameters to perform an efficient layer-wise update of embeddings. This update is based on a first-order approximation of spectral convolutions on the graph. Spectral convolution refers to the eigendecomposition of the Laplacian matrix[7]. One of the most important limitations of the vanilla GCN is that it does not incorporate edge features. The update rule is mathematically formulated as follows:
﻿hil+1=ReLU(Ul1degiΣjϵNi)h_i^{l+1} = ReLU(U^l \frac{1}{deg_i} \Sigma_{j\epsilon N_i})hil+1​=ReLU(Uldegi​1​ΣjϵNi​​)﻿﻿
where, UlϵRd×dU^l\epsilon\mathbb{R}^{d\times d}UlϵRd×d﻿ is the weight matrix(ignores bias for clarity), degideg_idegi​﻿ is the degree of node iii﻿.
While in the above equation, we can notice that the update does not incorporate the previous layer's features, we can still enforce this using skip connection.
GraphSageGraphSage[3] builds upon the vanilla GCN by incorporating each node’s own features from the previous layer during updation. The updation equation is as follows:
﻿h^il+1=ReLU(Ul Concat(hil,MeanjϵNihjl))\hat{h}_i^{l+1} = ReLU(U^l \ Concat(h_i^l, Mean_{j\epsilon N_i} h^l_j))h^il+1​=ReLU(Ul Concat(hil​,MeanjϵNi​​hjl​))﻿, UlϵRd×2dU^l \epsilon \mathbb{R}^{d\times2d}UlϵRd×2d﻿﻿
The updated layer hil+1h^{l+1}_ihil+1​﻿ is simply the l2-norm of h^il+1\hat{h}_i^{l+1}h^il+1​﻿.
[3]Hamilton et al also propose two more advanced neighborhood aggregation methods:
Max-pooling: h^il+1=ReLU(Ul Concat(hil,MaxjϵNiReLU(Vlhjl)))\hat{h}_i^{l+1} = ReLU(U^l \ Concat(h_i^l, Max_{j\epsilon N_i} ReLU(V^l h^l_j)))h^il+1​=ReLU(Ul Concat(hil​,MaxjϵNi​​ReLU(Vlhjl​)))﻿﻿
LSTM: h^il+1=ReLU(Ul Concat(hil,LSTMjϵNil(hjl)))\hat{h}_i^{l+1} = ReLU(U^l \ Concat(h_i^l, LSTM_{j\epsilon N_i}^l(h^l_j)))h^il+1​=ReLU(Ul Concat(hil​,LSTMjϵNi​l​(hjl​)))﻿﻿
Here VlϵRd×dV^l\epsilon \mathbb{R}^{d\times d}VlϵRd×d﻿ and LSTMlLSTM^lLSTMl﻿ are learnable weights.
Graph Attention Network (GAT)Both GCN and GraphSage seen until now are examples of isotropic architectures, that is, each neighbor contributes equally towards determining the update equation. Furthermore, both these architectures do not incorporate edge features in their update equation. The GAT model employs an edge feature based attention mechanism. Further, it employs a multi-headed attention mechanism[8] that helps stabilize the learning process. 
﻿

Update of Graph Attention Network[1]
The update equation is as follows:
﻿hil+1=Concatk=1K(ELU(ΣjϵNieijk,lUk,lhjl))h^{l+1}_i = Concat^K_{k=1}(ELU(\Sigma_{j\epsilon N_i}e_{ij}^{k,l} U^{k,l}h^l_j))hil+1​=Concatk=1K​(ELU(ΣjϵNi​​eijk,l​Uk,lhjl​))﻿
where Uk,lϵRdK×dU^{k,l} \epsilon \mathbb{R}^{\frac{d}{K}\times d}Uk,lϵRKd​×d﻿ are the parameters for KKK﻿ linear project heads, and eijk,le_{ij}^{k,l}eijk,l​﻿ are the attention coefficients of each head defined as:
﻿eijk,l=exp(e^ijk,l)Σj′ϵNiexp(e^ij′k,l)e_{ij}^{k,l} = \frac{exp(\hat{e}_{ij}^{k,l})}{\Sigma_{j'\epsilon N_i} exp(\hat{e}_{ij'}^{k,l})}eijk,l​=Σj′ϵNi​​exp(e^ij′k,l​)exp(e^ijk,l​)​﻿
e^i,jk,l=LeakyReLU(Vk,lConcat(Uk,lhil,Uk,lhjl))\hat{e}^{k,l}_{i,j} = LeakyReLU(V^{k,l} Concat(U^{k,l}h_i^l, U^{k,l}h_j^l))e^i,jk,l​=LeakyReLU(Vk,lConcat(Uk,lhil​,Uk,lhjl​))﻿
where Vk,lϵR2dKV^{k,l}\epsilon \mathbb{R}^{\frac{2d}{K}}Vk,lϵRK2d​﻿ is a learnable parameter
MoNetThe techniques presented earlier in this report apply spectral convolution. Velickovi ˇ c et al. argue that spectral convolutional techniques are dependent on the Laplacian eigenbasis, which is domain-dependent. Thus they state that spectral convolutional methods are not easily transferable to other graphs. The Monet architecture relies on Gaussian Mixture Models(GMM) to learn on graphs.
﻿

Update of Monet[1]
The updation equation is as follows:
﻿hil+1=ReLU(Σk=1KΣjϵNiek,lijUk,lhjl)h^{l+1}_i = ReLU(\Sigma_{k=1}^K \Sigma_{j\epsilon N_i} e{k,l}_{ij} U^{k,l} h^l_j)hil+1​=ReLU(Σk=1K​ΣjϵNi​​ek,lij​Uk,lhjl​)﻿
eijk,l=exp(−12(uijl−μkl)T)(Σkl)−1(uijl−μkl))e^{k,l}_{ij} = exp(-\frac{1}{2} (u^l_{ij} - \mu^l_k)^T)(\Sigma_k^l)^{-1} (u^l_{ij} - \mu^l_k))eijk,l​=exp(−21​(uijl���−μkl​)T)(Σkl​)−1(uijl​−μkl​))﻿
uijl=Tanh(Al(degi−0.5,degj−0.5)T+al)u^l_{ij} = Tanh(A^l(deg_i^{-0.5}, deg_j^{-0.5})^T + a^l)uijl​=Tanh(Al(degi−0.5​,degj−0.5​)T+al)﻿﻿
where Uk,lϵRd×dU^{k,l} \epsilon \mathbb{R}^{d\times d}Uk,lϵRd×d﻿, μkl\mu^l_kμkl​﻿, (Σkl)−1(\Sigma^l_k)^{-1}(Σkl​)−1﻿, alϵR2a^l \epsilon \mathbb{R}^2alϵR2﻿ and AlϵR2×2A^l \epsilon \mathbb{R}^{2\times 2}AlϵR2×2﻿ are the learnable parameters.
GatedGCNThe GatedGCN architecture is an anisotropic message-passing based GCN. It employs residual connections, batch normalization, and edge gates. Batch normalization help stabilize the learning process while residual connections allow for developing deeper networks. The edge gates contribute as an attention mechanism. The authors also propose an LSTM-based neighborhood aggregation mechanism; however, deeper LSTM-based networks suffer from the vanishing gradient problem and do not achieve the expected results. This issue further aggravates in dense graphs.
﻿

Update of GatedGCN[1]
﻿hil+1=hil+ReLU(BN(Ulhil+ΣjϵNieijl⊙Vlhjl)),h^{l+1}_i = h^l_i + ReLU(BN(U^l h_i^l + \Sigma_{j\epsilon N_i} e^l_{ij}\odot V^l h^l_j )),hil+1​=hil​+ReLU(BN(Ulhil​+ΣjϵNi​​eijl​⊙Vlhjl​)),﻿﻿LaTeX\LaTeXLATE​X﻿ 
where Ul,VlϵRd×dU^l, V^l \epsilon \mathbb{R}^{d\times d}Ul,VlϵRd×d﻿,  ⊙\odot⊙﻿ implies element-wise multiplication, NiN_iNi​﻿ are neighbors of node iii﻿,  eijle^l_{ij}eijl​﻿ are the edge gates defined as follows:
﻿eijl=σ(e^ijl)(Σj′ϵNiσ(e^ij′l)+ε)e^l_{ij} = \frac{\sigma(\hat{e}^{l}_{ij})}{(\Sigma_{j'\epsilon N_i} \sigma(\hat{e}^{l}_{ij'}) + \varepsilon)}eijl​=(Σj′ϵNi​​σ(e^ij′l​)+ε)σ(e^ijl​)​﻿
e^ijl=e^ijl−1+ReLU(BN(Alhil−1+Blhjl−1+Cle^ijl−1))\hat{e}^{l}_{ij} = \hat{e}^{l-1}_{ij} + ReLU(BN(A^l h^{l-1}_i + B^l h^{l-1}_j + C^l \hat{e}^{l-1}_{ij}))e^ijl​=e^ijl−1​+ReLU(BN(Alhil−1​+Blhjl−1​+Cle^ijl−1​))﻿﻿
where σ\sigmaσ﻿ is the signmoid function, ε\varepsilonε﻿ is a small fixed constant for numerical stability, Al,Bl,ClϵRd×dA^l, B^l, C^l \epsilon \mathbb{R}^{d \times d}Al,Bl,ClϵRd×d﻿﻿
Experiments﻿Full code in Google Colab →﻿
We shall be comparing all the presented message-passing-based GNN architectures at the binary node classification task on the PATTERN Dataset proposed by Dwivedi et al. [1]. The PATTERN dataset is a Stochastic Block Model-based artificial graph. SBMs are known to be good at controllably modeling social networks. Our train/validation/test split is 10K/2K/2K graphs, with an average of 117 nodes and 4749 edges.
The experiment aims to replicate the results produced by Dwivedi et al. [1]. For a fair comparison of all the architectures' learning capability, the authors train each network with about 100K parameters. All the architectures have been augmented to leverage modern deep learning techniques such as residual connections and batch normalization.
To provide a non-GNN baseline, as suggested by [1], we also train a simple multi-layer perceptron network that simply predicts the node class based on the node features. It is oblivious of the graph structure and the neighbors of the node under consideration. This allows us to contrast the GNN architectures' success against a model that does not capture the semantics of the graph.
We use the Weights & Biases Sweeps to run a grid search on all the parameters. The models have been coded in PyTorch, and the code is available at this link. The experiments use Adam Optimizer. The initial learning rate is set to 10−310^{-3}10−3﻿, and it is reduced by half if the model does not improve for five epochs until the learning rate reaches 10−610^{-6}10−6﻿. 
The section below presents the code in the node_classification_pattern_sweep.yaml file. The program specifies which python program the agent should be running with the determined arguments. We specify the grid method of performing sweep, where the agent simply runs all set of possible configurations only once. The configuration for each of the models has been stored in separate files, so here we just pass the name of these files. Since this experiment's purpose is not to tune the models on the specific datasets, but to compare them, we do not perform hyperparameter optimization.
    program: main_SBMs_node_classification.py
    method: grid
    metric:
      name: val_loss
      goal: minimize
    parameters:
      dataset:
        values: ["SBM_PATTERN"]
      config:
        values: ["configs/SBMs_node_clustering_MLP_PATTERN_100k.json", "configs/SBMs_node_clustering_GCN_PATTERN_100k.json", "configs/SBMs_node_clustering_GraphSage_PATTERN_100k.json", "configs/SBMs_node_clustering_GatedGCN_PATTERN_100k.json", "configs/SBMs_node_clustering_GAT_PATTERN_100k.json", "configs/SBMs_node_clustering_MoNet_PATTERN_100k.json"]
The results of the experiment have been visualized in the graphs below. We observe that the GNN methods generally perform better than the non-GNN MLP based baseline. Thus, we conclude that the proposed GNN architectures indeed learn better features with specialized architectures. It can also be observed that the anisotropic mechanisms, MoNet, GAT, and GatedGCN perform better than the isotropic methods vanilla GCN and GraphSage. This indicates that the edge features indeed help in improving the graph representation learning task.
The architectures MoNet and GatedGCN are seen to be the best performing architectures with a  test accuracy of about 85%. It can be attributed to the fact that both are anisotropic architectures. In the time taken to train vs. accuracy aspect, the MoNet performs significantly better. It delivers similar performance to GatedGCN with three times lesser training time. However, it is to be noted that this was a binary node classification task that may be better suited for MoNet's GMM based architecture. Based on the experiments performed by Dwivedi et al., the performance of MoNet drops significantly at other tasks, whereas the GatedGCN consistently performs better.
﻿Full code in Google Colab →﻿
﻿
Run set8
﻿
﻿
ConclusionIn this report, we have seen an overview of the training pipeline of a message-passing-based GNN. We also review and contrast the various GNN architectures broadly divided into isotropic and anisotropic. We also experimentally verify the learning capability of the anisotropic architectures.
The code[1] for running the experiment is available on Colab.While this report only experiments on the node classification task, I would like to encourage the readers to experiment on other tasks and datasets as well using Weights & Biases Sweep! 
The original code written by Dwivedi et al is available here.
ReferencesVijay Dwivedi, Chaitanya Joshi, Thomas Laurent, Yoshua Bengio, Xavier Bresson. Benchmarking Graph Neural Networks. arXiv preprint arXiv:2003.00982
Thomas N. Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations (ICLR), 2017
Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, pages 1024–1034, 2017.
Federico Monti, Davide Boscaini, Jonathan Masci, Emanuele Rodola, Jan Svoboda, and Michael M. Bronstein. Geometric deep learning on graphs and manifolds using mixture model cnns. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jul 2017.
Petar Velickovi ˇ c, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua ´ Bengio. Graph Attention Networks. International Conference on Learning Representations, 2018
Xavier Bresson and Thomas Laurent. Residual gated graph convnets. arXiv preprint arXiv:1711.07553, 2017
﻿https://towardsdatascience.com/spectral-graph-convolution-explained-and-implemented-step-by-step-2e495b57f801﻿
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
﻿
﻿
Add a comment
Tags: Intermediate, Domain Agnostic, Classification, PyTorch, Experiment, Tutorial, GNN, Sweeps
Iterate on AI agents and models faster. Try Weights & Biases today.