Skip to main content

18.Limitations of Graph Neural Networks

Created on December 29|Last edited on January 13

Main idea in GNN is we start from a graph data structure and apply convolutions produce representations of nodes, pass through various layers and produce embeddings of nodes, subgraphs and complete graphs.

We generate node embeddings based no local neighbourhoods. For every node, we use its computational graph and aggregate messages from neighbours through the computational graph. We have many GNN's that vary in how we aggregate the messages, the kind of neighbour features we aggregate and different choice of neural networks.

  • Node Classification[Kipf ICLR'2017]
  • Graph Classification[Ying NeurIPS'2018]
  • Link Prediction[Zhang NeurIPS 2018]

Graph Structure Limitation

Turns out conventional GNN's are not able to distinguish simple graph structures. GCN and GraphSAGE both fail to distinguish the below two graphs, and their network structure. If you start with input node features that are uniform, but different edge combinations, we want the GNN to capture that structural difference. But in this case both models fail to do so. failure 1

Noise Limitation

GNNs are not robust to noise in graph data. Adding a slight noise in graph through node perturbation or edge addition/deletion is having an adversarial effect on the output of the GNNs.



Graph Structure Vulnerability

We want to be able to capture structural differences in graphs using GNN. Consider the below two graphs below, they can be the whole graphs or even part of a subgraph. It becomes even more important in case of a classification scenario. structure

Essentially this is called the graph isomorphism test problem. No polynomial algorithms exist for general case. GNN's might not perfectly distinguish any graphs, but we want to understand how well can GNNs can do graph isomorphism test. We need to rethink the mechanism of how GNNs we studied till now capture graph structure.

If two graphs with different structures have different computational graphs for each node with different topologies, we want to be able to take advantage of this. Below is the computational graphs for the small graphs with 4 nodes each with different structures. We showed a 2-hop computational graph, which is generally what we do in practice while training. computational graph

Considering nodes with uniform features, most discriminative GNNs that discriminate the above two graphs will map different subtrees of computational graphs into different node representations. We call this property Injectivity. We call a function injective if it maps different elements into different outputs. If a function maps any two different inputs to the same output, that function is not injective.

We want to make sure that our aggregation mechanism through the computational graph is injective to get different outputs for different computation graphs. And the entire neighbourhood aggregation is injective if every step of neighbour aggregation is injective.

Until now we thought of neighbourhood aggregation as traversing a graph, now we want to see it as a function and if it is injective or not. Neighbour aggregation is essentially a function over a multi-set, where elements repeat. Examples of multi-sets are [a,a],[b,b],[a,a,b][a, a], [b, b], [a, a, b] where a and b are elements of a set. Now we can see that the discriminative power of GNNs can be characterized by that of multi-set functions.

GCN is not injective

GCN uses mean pooling. And therefore will not be able to distinguish aggregation on say the two multisets [a,b][a, b] and [a,a,b,b][a, a, b, b]. For these both multi-sets mean-pooling will result in the same average, hence it is not injective. Because of mean-pooling, GCN will not be able to distinguish nodes that are getting messages from two other nodes and from four nodes. The structural difference is not being differentiated here.

GraphSAGE is not injective

GraphSAGE uses max pooling. Consider the multi-sets [a,b][a, b] and [a,a,b][a, a, b]. These two multisets will result in the same output if we do max-pooling. Therefore GraphSAGE will fail to distinguish multi-sets with the same distinct elements but with different structure, here the number of nodes connecting to our root node is different. Hence GraphSAGE is not injective.

Solution

We want to design a injective multi-set function using neural networks. For a given multiset, any injective multi-set function can be expressed by ϕ∑x∈Sf(x)\phi\sum_{x\in S}f(x) where ϕ,f\phi, f are some non-linear functions. For a multi-set [a,b,b][a, b, b] the function will therefore be ϕ(f(a)+f(b)+f(c))\phi\bigg(f(a) + f(b) + f(c)\bigg). We can model ϕ,f\phi, f using MLPs.



Graph Isomorphism Network(GIN)[Xu et al., 2019]

We use MLP + sum-pooling as our aggregation mechanism. Not only is this permutation invariant, it is also an injective aggregation function. Sum pooling can give us injective graph pooling.

The expectation is that GIN will perform better in graph classification tasks better than GCN and GraphSAGE.

Now we will see how much discriminative power do we achieve on GIN by using injective neighbor aggregation.

The way GIN was conceived through Weisfeiler-Lehman WL Graph Isomorphism Test(1968). WL algorithm is an efficient isomorphism test for most real-world graphs and its computation time is of O(N)O(N). The actual graph isomorphism is NP-hard. This is probably an approximation algorithm.

Turns out GIN is as discriminative as the WL test.

WL test is basically, creating the rooted subtrees for all the nodes in a graph, and then map each node in the rooted subtree a different color based on number of neighbour it has. We make it efficient by only going as deep enough till we find a mismatch.

In the below image you can see two sample graphs, and their respective sub trees sub trees

You can see for the first graph, we have 4 pink nodes with same subtrees, and for the second graph, we have 1 blue, 1 green and 2 purple nodes and their subtrees. Here we didn't even have to color the second level of the nodes in the subtrees to see that original graphs are different. WL basically compares the count of the neighbours in the subtrees and we traverse one step at a time, until we find a difference in node coloring.

WL test and GIN are operationally equivalent. Graphs that WL test can distinguish, GIN can distinguish. Therefore GINs have the same discriminative power as the WL graph isomorphism test. There are some edge cases where WL fails to distinguish some corner cases. For example WL algorithm fails to distinguish Gskip(11,2)G_{\text{skip}(11, 2)} from Gskip(11,3)G_{\text{skip}(11, 3)}. Skips graphs with arguments a, b basically represents a nodes in circle and each node has an edge skipping b nodes on either side of it. The above two skip graphs have different structure but for WL test they both look the same because every node has the same local subtree structure. You can find [Murphy ICML 2019] that resolve corner cases but with exponential time complexity. This might be useful for smaller graphs.

GIN performs much better than GCN and GraphSAGE in case of graph classification.



Noise Vulnerability in GNN

Deep neural networks are vulnerable to adversarial attacks. For example in image recognition task we can add a layer of noise data to an image in a way it is not recognizable to humans but you can fool the neural network to identify it as some other image.

adversarial

These sorts of adversarial attacks are also very common in applications of GNN like search engines, recommender systems, social networks and etc. These adversaries will exploit any exposed vulnerabilities.

GCN on a graph for semi supervised node classification task, with Adjacency matrix A∈{1,1}N×NA \in \lbrace1, 1\rbrace^{N\times N} and node attributes X∈{0,1}N×NX \in \lbrace0, 1\rbrace^{N\times N}. Its renormalized adjacency matrix will be A^≡D−1/2(A+I)D−1/2\hat{A} \equiv D^{-1/2}(A + I)D^{-1/2}

The classification model with two step GCN message passing will look like softmax(A^ReLU(A^×W(1))W(2))\text{softmax}(\hat{A}\text{ReLU}(\hat{A}\times W^{(1)})W^{(2)})

We minimize the cross entropy loss on labeled data, and apply the model to predict unlabeled data.

Attack possibilities in case of node classification where we want to modify target node t's class using attacker nodes S that are part of the network. We call it a direct attack, if the attacker node is just the target node (S={t})(S = \lbrace t \rbrace) and an indirect attack if target node t is not in the attack nodes (t∉S)(t \notin S).

Direct attack examples:

  • modify targets features, changing website content
  • add connections to the target, buy likes/followers
  • remove connections, unfollow untrusted users

Indirect attack examples:

  • modify attackers' features, hijack friends of target
  • add or remove connections to the attackers, create link/spam farm

Nettak

Zugner, Adversarial attacks on Neural Networks for Graph Data, KDD 18

We can formulize adversarial attacks in graphs as maximize the change in predicted labels of target node, subject to limited noise in the graph.

We have the following objective function to find a modified graph that maximizes the change of predicted labels of a target node.

arg max⁡A′,X′max⁡c≠coldlog⁡Zv,c∗−log⁡Zv,cold∗\displaystyle \argmax_{A', X'} \max_{c \neq c_{\text{old}}} \log{Z_{v, c}^*} - \log{Z_{v, c_{\text{old}}}^*}

Where Z∗=fθ∗(A′,X′)=softmax(A^ReLU(A^×W(1))W(2))Z^* = f_{\theta^*}(A', X') = \text{softmax}(\hat{A}\text{ReLU}(\hat{A}\times W^{(1)})W^{(2)})

with θ∗=arg min⁡θL(θ;A′,X′)\theta^* = \argmin_\theta \mathcal{L}(\theta; A', X')

such that (A′,X′)≈(A,X)(A', X') \approx (A, X)

A′A' and X′X' are modified adjacency and node features.

We want to increase the log likelihood log⁡Zv,c∗\log{Z_{v, c}^*} of target node vv being predicted as class cc , and decrease the log likelihood log⁡Zv,cold∗\log{Z_{v, c_{\text{old}}}^*} of target node vv being predicted as old class coldc_{\text{old}}

And GCN is trained on the modified graph, which is then used to predict labels of the target node. At each change of connectivity we have to train a brand new GCN. And we want to make this possible with the least possible change in the adjacency matrix.

But we cannot solve this optimization problem exactly, because graph modification is discrete, cannot use simple gradient descent to optimize. And the inner loop involves expensive re-training of GCN.

Some heuristics have been proposed to efficiently obtain an approximate solution. We can greedily choose the step-by-step graph modification. Or simplifying GCN by removing ReLU activation to work in closed form.

More details in Zugner KDD 2018.

Nettack Experiments

They did class predictions for a single node in a semi-supervised node classification with GCN. The GCN prediction was easily manipulated by only modifications of graph structure in a graph of node count ~2k and edge count of ~5k



Open Questions and Future directions

Molecular property prediction tasks in chemistry. Protein-protein interaction networks to predict protein function.

Normal challenges include scarcity of labeled data because labels require expensive experiments, and models overfit to small training datasets. With smaller datasets, out-of-distribution tasks underperform, where test examples are from very different from training in scientific discovery.

Pre-training of GNNs[Hu 2019]

Pre-train GNNs on relevant easy to obtain graph data like we do in natural language processing tasks to obtain representations and the fine tune for downstream tasks.

Making GNNs robust, tractable optimization on discrete graph data, achieving good trade-off between accuracy and robustness.