Before going through this report, I suggest the readers check out Part 1 – Intro to Graph Neural Networks with GatedGCN where we understand the motivation behind graphical neural networks (GNN), and the various problem statements that GNNs could be useful for.

The first section of this report describes the training pipeline of a message-passing based GNN. Next, we review and compare various message-passing based GNN architectures such as vanilla GCN[2], GraphSage[3], MoNet[4], GAT[5] and GatedGCN[6]. Finally, we use Sweeps by Weights & Biases to train and compare these architectures at the Node classification task.

*Figure 1: Summary of the training pipeline[1]*

The input layer consists of the following elements:

- Node features $\alpha_i \epsilon \mathbb{R}^{a\times1}$. These features are passed as an embedding to d-dimensional hidden features $h_i^{l=0}$ via a simple linear projection.
- (optional)Edge features $\beta_{ij} \epsilon \mathbb{R}^{b\times1}$, for each edge connecting node $i$ and node $j$. Similar to the node features, these features are passed as an embedding to d-dimensional hidden feature $e_{ij}^{l=0}$.
- Lastly, the graph structure itself which maintains the connections amongst the nodes.

As seen in the figure, this segment consists of the $L$ layer neural network. This part of the training pipeline encapsulates the hidden layers of the GNN. The $L$ layer deep network corresponds to L-hop neighborhood aggregation across the entire network. Since this is an iterative process, it can be visualized as a message-passing mechanism where each node receives updates from all its neighbors, which in turn receives updates from their neighbors.

The updated feature vector $h^{l+1}_i$ for node $i$ is simply a function of its previous feature vector $h_i^l$ and the feature vectors of all its neighbors $j$ as shown in the equation below.

$h^{l+1}_i = f(h_i^{l+1}, {h^l_j: j\rightarrow i})$

It is interesting to see in Figure 1 how these intermediate layers themselves can be visualized as intermediate graphs highlighting their ability to capture the graph structure. +Since this component forms the training pipeline's crux, we shall be looking at the detailed GNN layer of various architectures in the subsequent sections.

In the prediction layer, we utilize the GNN based node embeddings to predict the task-based outcome. We design the loss function and apply gradient descent to improve task-based prediction. This helps the GNN learn more task-based discriminative node embeddings.

Following is the summary of the prediction layer across various tasks:

- Graph Classification: Apply a Multi-Layer Perceptron(MLP) on the average of all node embeddings. Finally, we minimize the cross-entropy loss between the obtained logits and the ground truth.
- Graph Regression: Similar to graph classification, we apply MLP on the average of all node embeddings. Finally, we minimize the L1-loss between the predicted score and the ground truth score.
- Node Classification: For this task, we independently pass each node’s embeddings to an MLP and minimize the cross-entropy loss between obtained logits and the ground truth.
- Edge Classification: To predict an edge between two nodes, we concatenate the two nodes' embeddings and pass it through an MLP. Finally, we minimize the cross-entropy loss between the logits and the ground-truth.

In this section, we look at the GNN layer of the various Message Passing based GNNs and contrast these architectures.

The vanilla GCN[2] introduced a semi-supervised learning approach to graph-structured data. It uses shared parameters to perform an efficient layer-wise update of embeddings. This update is based on a first-order approximation of spectral convolutions on the graph. Spectral convolution refers to the eigendecomposition of the Laplacian matrix[7]. One of the most important limitations of the vanilla GCN is that it does not incorporate edge features. The update rule is mathematically formulated as follows:

$h_i^{l+1} = ReLU(U^l \frac{1}{deg_i} \Sigma_{j\epsilon N_i})$

where, $U^l\epsilon\mathbb{R}^{d\times d}$ is the weight matrix(ignores bias for clarity), $deg_i$ is the degree of node $i$.

While in the above equation, we can notice that the update does not incorporate the previous layer's features, we can still enforce this using skip connection.

GraphSage[3] builds upon the vanilla GCN by incorporating each node’s own features from the previous layer during updation. The updation equation is as follows:

$\hat{h}*i^{l+1} = ReLU(U^l \ Concat(h_i^l, Mean*{j\epsilon N_i} h^l_j))$, $U^l \epsilon \mathbb{R}^{d\times2d}$

The updated layer $h^{l+1}_i$ is simply the l2-norm of $\hat{h}_i^{l+1}$.

[3]Hamilton et al also propose two more advanced neighborhood aggregation methods:

- Max-pooling: $\hat{h}
*i^{l+1} = ReLU(U^l \ Concat(h_i^l, Max*{j\epsilon N_i} ReLU(V^l h^l_j)))$ - LSTM: $\hat{h}
*i^{l+1} = ReLU(U^l \ Concat(h_i^l, LSTM*{j\epsilon N_i}^l(h^l_j)))$

Here $V^l\epsilon \mathbb{R}^{d\times d}$ and $LSTM^l$ are learnable weights.

Both GCN and GraphSage seen until now are examples of isotropic architectures, that is, each neighbor contributes equally towards determining the update equation. Furthermore, both these architectures do not incorporate edge features in their update equation. The GAT model employs an edge feature based attention mechanism. Further, it employs a multi-headed attention mechanism[8] that helps stabilize the learning process.

*Update of Graph Attention Network[1][4]*

The update equation is as follows:

$h^{l+1}*i = Concat^K*{k=1}(ELU(\Sigma_{j\epsilon N_i}e_{ij}^{k,l} U^{k,l}h^l_j))$
where $U^{k,l} \epsilon \mathbb{R}^{\frac{d}{K}\times d}$ are the parameters for $K$ linear project heads, and $e_{ij}^{k,l}$ are the attention coefficients of each head defined as:

$e_{ij}^{k,l} = \frac{exp(\hat{e}*{ij}^{k,l})}{\Sigma*{j'\epsilon N_i} exp(\hat{e}*{ij'}^{k,l})}$
$\hat{e}^{k,l}*{i,j} = LeakyReLU(V^{k,l} Concat(U^{k,l}h_i^l, U^{k,l}h_j^l))$
where $V^{k,l}\epsilon \mathbb{R}^{\frac{2d}{K}}$ is a learnable parameter

The techniques presented earlier in this report apply spectral convolution. Velickovi ˇ c et al. argue that spectral convolutional techniques are dependent on the Laplacian eigenbasis, which is domain-dependent. Thus they state that spectral convolutional methods are not easily transferable to other graphs. The Monet architecture relies on Gaussian Mixture Models(GMM) to learn on graphs.

*Update of Monet[1][5]*

The updation equation is as follows:

$h^{l+1}*i = ReLU(\Sigma*{k=1}^K \Sigma_{j\epsilon N_i} e{k,l}*{ij} U^{k,l} h^l_j)$
$e^{k,l}*{ij} = exp(-\frac{1}{2} (u^l_{ij} - \mu^l_k)^T)(\Sigma_k^l)^{-1} (u^l_{ij} - \mu^l_k))$
$u^l_{ij} = Tanh(A^l(deg_i^{-0.5}, deg_j^{-0.5})^T + a^l)$

where $U^{k,l} \epsilon \mathbb{R}^{d\times d}$, $\mu^l_k$, $(\Sigma^l_k)^{-1}$, $a^l \epsilon \mathbb{R}^2$ and $A^l \epsilon \mathbb{R}^{2\times 2}$ are the learnable parameters.

The GatedGCN architecture is an anisotropic message-passing based GCN. It employs residual connections, batch normalization, and edge gates. Batch normalization help stabilize the learning process while residual connections allow for developing deeper networks. The edge gates contribute as an attention mechanism. The authors also propose an LSTM based neighborhood aggregation mechanism; however, deeper LSTM based networks suffer from the vanishing gradient problem and do not achieve the expected results. This issue further aggravates in dense graphs.

*Update of GatedGCN[1][6]*

$h^{l+1}*i = h^l_i + ReLU(BN(U^l h_i^l + \Sigma*{j\epsilon N_i} e^l_{ij}\odot V^l h^l_j )),$$$
where $U^l, V^l \epsilon \mathbb{R}^{d\times d}$, $\odot$ implies element-wise multiplication, $N_i$ are neighbors of node $i$, $e^l_{ij}$ are the edge gates defined as follows:

$e^l_{ij} = \frac{\sigma(\hat{e}^{l}*{ij})}{(\Sigma*{j'\epsilon N_i} \sigma(\hat{e}^{l}*{ij'}) + \varepsilon)}$
$\hat{e}^{l}*{ij} = \hat{e}^{l-1}_{ij} + ReLU(BN(A^l h^{l-1}_i + B^l h^{l-1}*j + C^l \hat{e}^{l-1}*{ij}))$

where $\sigma$ is the signmoid function, $\varepsilon$ is a small fixed constant for numerical stability, $A^l, B^l, C^l \epsilon \mathbb{R}^{d \times d}$

We shall be comparing all the presented message-passing based GNN architectures at the binary node classification task on the PATTERN Dataset proposed by Dwivedi et al. [1]. The PATTERN dataset is a Stochastic Block Model-based artificial graph. SBMs are known to be good at controllably modeling social networks. Our train/validation/test split is 10K/2K/2K graphs, with an average of 117 nodes and 4749 edges.

The experiment aims to replicate the results produced by Dwivedi et al. [1]. For a fair comparison of all the architectures' learning capability, the authors train each network with about 100K parameters. All the architectures have been augmented to leverage modern deep learning techniques such as residual connections and batch normalization.

To provide a non-GNN baseline, as suggested by [1], we also train a simple multi-layer perceptron network that simply predicts the node class based on the node features. It is oblivious of the graph structure and the neighbors of the node under consideration. This allows us to contrast the GNN architectures' success against a model that does not capture the semantics of the graph.

We use the Weights and Biases Sweeps to run a grid search on all the parameters. The models have been coded in PyTorch, and the code is available at this link. The experiments use Adam Optimizer. The initial learning rate is set to $10^{-3}$, and it is reduced by half if the model does not improve for five epochs until the learning rate reaches $10^{-6}$.

The section below presents the code in the node_classification_pattern_sweep.yaml file. The *program* specifies which python program the agent should be running with the determined arguments. We specify the **grid** method of performing sweep, where the agent simply runs all set of possible configurations only once. The configuration for each of the models has been stored in separate files, so here we just pass the name of these files. Since this experiment's purpose is not to tune the models on the specific datasets, but to compare them, we do not perform hyperparameter optimization.

```
program: main_SBMs_node_classification.py
method: grid
metric:
name: val_loss
goal: minimize
parameters:
dataset:
values: ["SBM_PATTERN"]
config:
values: ["configs/SBMs_node_clustering_MLP_PATTERN_100k.json", "configs/SBMs_node_clustering_GCN_PATTERN_100k.json", "configs/SBMs_node_clustering_GraphSage_PATTERN_100k.json", "configs/SBMs_node_clustering_GatedGCN_PATTERN_100k.json", "configs/SBMs_node_clustering_GAT_PATTERN_100k.json", "configs/SBMs_node_clustering_MoNet_PATTERN_100k.json"]
```

The results of the experiment have been visualized in the graphs below. We observe that the GNN methods generally perform better than the non-GNN MLP based baseline. Thus, we conclude that the proposed GNN architectures indeed learn better features with specialized architectures. It can also be observed that the anisotropic mechanisms, MoNet, GAT, and GatedGCN perform better than the isotropic methods vanilla GCN and GraphSage. This indicates that the edge features indeed help in improving the graph representation learning task.

The architectures MoNet and GatedGCN are seen to be the best performing architectures with a test accuracy of about **85%**. It can be attributed to the fact that both are anisotropic architectures. In the time taken to train vs. accuracy aspect, the MoNet performs significantly better. It delivers similar performance to GatedGCN with three times lesser training time. However, it is to be noted that this was a binary node classification task that may be better suited for MoNet's GMM based architecture. Based on the experiments performed by Dwivedi et al., the performance of MoNet drops significantly at other tasks, whereas the GatedGCN consistently performs better.

In this report, we have seen an overview of the training pipeline of a message-passing based GNN. We also review and contrast the various GNN architectures broadly divided into isotropic and anisotropic. We also experimentally verify the learning capability of the anisotropic architectures.

While this report only experiments on the node classification task, I would like to encourage the readers to experiment on other tasks and datasets as well using Weights and Biases Sweep!

- Vijay Dwivedi, Chaitanya Joshi, Thomas Laurent, Yoshua Bengio, Xavier Bresson. Benchmarking Graph Neural Networks. arXiv preprint arXiv:2003.00982
- Thomas N. Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations (ICLR), 2017
- Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, pages 1024–1034, 2017.
- Federico Monti, Davide Boscaini, Jonathan Masci, Emanuele Rodola, Jan Svoboda, and Michael M. Bronstein. Geometric deep learning on graphs and manifolds using mixture model cnns. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jul 2017.
- Petar Velickovi ˇ c, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua ´ Bengio. Graph Attention Networks. International Conference on Learning Representations, 2018
- Xavier Bresson and Thomas Laurent. Residual gated graph convnets. arXiv preprint arXiv:1711.07553, 2017
- https://towardsdatascience.com/spectral-graph-convolution-explained-and-implemented-step-by-step-2e495b57f801
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.