8.Graph Neural Networks

Key intuition behind GNN and study Convolutions on graphs, GCN, GraphSAGE, Graph Attention Networks.
Created on December 22|Last edited on January 13
Comment
﻿
IntuitionGiven a Graph G(V,E)G(V, E)G(V,E), our goal is to map each node vvv to its own d-dimensional embedding or a representation, that captures all the node's local graph structure and data (node features, edge features connecting to the node, features of nodes connecting to our node vvv proportional to importance of each neighbourhood node and etc) such that nodes similar to nnn are closer together in the embedding space.
Our goal is to find out an Encoder function ENC(v)ENC(v)ENC(v) that maps all the nodes to some embedding space zvz_vzv​, where similar nodes based on its network neighbourhood features are close to each other in the embedding space. Once we have the encoder of nodes' local graph structure and all of its features, we can use it for any downstream task of say, classifying nodes and etc.
We learned shallow encoding methods in the previous report, where ENC(v)ENC(v)ENC(v) is a N×dN\times dN×d matrix, and ENC(v)ENC(v)ENC(v) is a simple embedding lookup. We can look at shallow encoders as one-layer of data transformation, a single hidden layer that maps node uuu to embedding zuz_uzu​ via a function fff where zu=f(zv,v∈NR(u))z_u = f(z_v, v \in N_R(u))zu​=f(zv​,v∈NR​(u)).
Shallow encoders we discussed earlier have many limitations, we need O∣V∣O|V|O∣V∣ parameters, as the embedding look up matrix is of size ∣V∣×d|V| \times d∣V∣×d. And parameters are not shared between nodes, and every node has its own unique embedding. Shallow encoders are inherently transductive, as in it cannot generate embeddings for nodes that are not seen during training. Imagine we have a social network graph, and we learned the embedding matrix, and a new node joins the network, we have to train our new network again using embedding methods discussed earlier to get our new embedding matrix. Another disadvantage is that they do not incorporate node features. They just make use of local/global graph structure of a node using random walks.
We want a better encoding methodology that solves all these problems, that automatically does multiple layers of non-linear transformations on the graph structure of each node. And Graph Neural Networks are exactly that. All these deep encoders can be combined with node similarity functions defined in the last lecture, if that is the downstream task. Or train these embeddings on various classification tasks, and let back-propagation figure out the best parameters suitable to the task at hand.
The problem is that networks have arbitrary size and have complex topological structure, no spatial locality like in case of image grids, no fixed node ordering or reference point. Often dynamic and have multimodal features. For a given node we should be able to transform "information" from the neighbours and combine it and come up with something useful, not just initial node features.
Naive approachA naive approach would be to join adjacency matrix and node features, feed them into a deep neural network on some down stream task. It will not be applicable to graphs of different sizes. And the same graph has N!N!N! different adjacency matrix representations, as network representations are invariant to the node ordering. If we pass the same trained network on a different adjacency matrix representing the same network, intuitively it will not work.
Computational GraphSo we look at each node's neighbourhood as its own computation graph. And propagate information across the graph to compute node embeddings. We somehow aggregate all the information passed from a node's neighbours who themselves were informed by their own neighbours.
In the above image, consider the input graph and a target node AAA, the graph on the right side shows the computation graph of node AAA based on its neighbourhood. Intuitively, AAA gets all the messages from its neighbourhood nodes [B,C,D][B, C, D][B,C,D] and processes them somehow, and the nodes [B,C,D][B, C, D][B,C,D] in turn processes information from their neighbours. Look at the edge directions to make sense of the computational graph. This a computational graph of depth 2, where original node features of nodes [A,C][A, C][A,C] are passed to node BBB, transformed and then passed again to node AAA.
The key intuition is that every node has its own computational graph, and we make use of dynamic recursive programming, and simultaneously compute node embeddings of all the nodes at each layer of the computational graph and feed them into the next layer of the computational graph.
A Deep model can be of arbitrary depth, the model and the respective computational graph shown below is of depth 2.
Nodes have embeddings at each layer, Layer-0 embedding of node uuu is its input feature, xux_uxu​, Layer-K embedding gets information from nodes that are K hops away in the computational graph.
The black boxes in the picture is the neural network, and the arrows are messages from neighbours. Keep in mind that at a given level we use the same neural network with same parameters for all computational graphs of all the nodes. At every level the aggregated messages go through the neural network to give us that level's node embeddings. We also have to make sure to aggregate the messages from all the neighbours in a node order agnostic way. 
Keep in mind that at every level of the computational graph, the actual graph/network structure doesn't change, only thing changing/improving is the node embedding. At level1level_1level1​, node embeddings make use of node embeddings at level0level_0level0​ (also the original node features) and at level2level_2level2​, node embeddings make use of node embeddings at level1level_1level1​. In general at levelklevel_klevelk​ we compute node embeddings using levelk−1level_{k-1}levelk−1​ embeddings
This is the core intuition behind Graph Neural Networks. How we aggregate the messages at each level, and how we compute the message from a node, using all the node neighbourhood information are the variations on our Graph Neural Network.
﻿
Graph Convolutional Network (GCN)The aggregation method we will be using is averaging neighbour messages, and this is how we compute layerklayer_klayerk​ embeddings of node vvv given layerk−1layer_{k-1}layerk−1​ embeddings of its neighbourhood for a depth KKK computational graph.
hv0=xvh_v^0 = x_vhv0​=xv​
hvk=σ(Wk∑u∈N(v)huk−1∣N(v)∣+Bkhvk−1),∀k∈{1,⋯ ,K}\displaystyle h_v^k = \sigma \bigg( W_k \sum_{u \in N(v)} \frac{h_u^{k-1}}{|N(v)|} + B_kh_v^{k-1} \bigg), \forall k \in \lbrace 1, \cdots, K\rbracehvk​=σ(Wk​u∈N(v)∑​∣N(v)∣huk−1​​+Bk​hvk−1​),∀k∈{1,⋯,K}
zv=hvKz_v = h_v^Kzv​=hvK​
In the above equations, zvz_vzv​ is the embedding after KKK layers of neighbourhood aggregation, σ\sigmaσ is the non-linearity, hvk−1h_v^{k-1}hvk−1​ is the previous layer embedding of the node itself and N(v)N(v)N(v) is the neighbourhood nodes of node vvv. WkW_kWk​ and BkB_kBk​ are the trainable matrices that will we learn.
We can feed these embeddings into any loss function and run stochastic gradient descent to get our weight parameters. WWW and BBB.
The above equation can be written in its vector form as
Hl=σ(Hl−1)W0l−1+A~Hl−1W1l−1H^{l} = \sigma(H^{l-1})W_0^{l-1} + \tilde AH^{l-1}W_1^{l-1}Hl=σ(Hl−1)W0l−1​+A~Hl−1W1l−1​ 
with A~=D−1/2AD−1/2\tilde A = D^{-1/2}AD^{-1/2}A~=D−1/2AD−1/2 and Hl=[h1lT,⋯ ,hNlT]TH^l = [h_1^{l^T},\cdots, h_N^{l^T}]^THl=[h1lT​,⋯,hNlT​]T
UnsupervisedWe can train the above equation in an unsupervised manner, using only the graph structure, with "similar" nodes having similar embeddings. And unsupervised loss function can be
Random walk methods we discussed in the last report like node2vec, DeepWalk, struc2vec.
Graph factorization
Node proximity in the graph.
SupervisedWe can directly train the model for a supervised task, like node classification. In case of node classification, an example loss function might look like this using the embeddings of the nodes.
L=∑v∈Vyvlog⁡(σ(zvTθ))+(1−yv)log(1−σ(zvTθ))\displaystyle \mathcal{L} = \sum_{v \in V} y_v \log(\sigma(z_v^T\theta)) + (1 - y_v)log(1 - \sigma(z_v^T\theta))L=v∈V∑​yv​log(σ(zvT​θ))+(1−yv​)log(1−σ(zvT​θ)) where yvy_vyv​ is the node class and θ\thetaθ classification weights.
OverviewDefine a neighbourhood aggregation function that is node order agnostic, in this case its average.
Define a loss function on the embeddings specific to the task.
Train on a set of nodes i.e., a batch of compute graphs.
Generate embeddings as needed, even for nodes we never trained on.
Same aggregation parameters are shared for all the nodes. And the number of model parameters is sublinear in ∣V∣|V|∣V∣ and we can generalize to unseen nodes.
Using inductive node embedding we can generalize to entirely unseen graphs. For example train on protein interaction graph from model organism A and generate embeddings on newly collected data about organism B.
Many application settings constantly encounter previously unseen nodes, like reddit, youtube etc., We are able to generate node embeddings "on the fly" for these new nodes changing the graph structure slightly.
﻿
GraphSAGESo far we have aggregated the neighbour messages by taking their average. We can use any differentiable function that maps a set of vectors in N(u)N(u)N(u) to a single vector that is order agnostic. We can for example apply L2 normalization for each node embedding at every layer.
hvk=σ([Ak.AGG({hvk−1,∀u∈N(v)}),Bkhvk−1])\displaystyle h_v^k = \sigma([A_k . AGG(\lbrace h_v^{k-1}, \forall u \in N(v) \rbrace), B_kh_v^{k-1}])hvk​=σ([Ak​.AGG({hvk−1​,∀u∈N(v)}),Bk​hvk−1​])
The key ways GraphSAGE is different from Simple neighbourhood aggregation we discussed in the previous section are instead of averaging neighbourhood embeddings, we do a generalized aggregation, and instead of adding aggregated messages to the self embedding at every layer, we concatenate aggregated neighbourhood embedding with self embedding.
The aggregation function can be mean/avg or pooling or LSTM
MeanTake average of neighbours
AGG=∑v∈N(u)hvk−1∣N(v)∣\displaystyle AGG = \sum_{v\in N(u)} \frac{h_v^{k-1}}{|N(v)|}AGG=v∈N(u)∑​∣N(v)∣hvk−1​​
PoolTransform neighbour embedding vectors and apply symmetric vector function.
AGG=γ({Qhvk−1,∀v∈N(v)})\displaystyle AGG = \gamma(\lbrace Qh_v^{k-1}, \forall v \in N(v)\rbrace)AGG=γ({Qhvk−1​,∀v∈N(v)})
LSTMApply LSTM to reshuffled neighbourhood embeddings from previous layer.
AGG=LSTM([hvk−1,∀v∈π(N(v))])\displaystyle AGG = LSTM([h_v^{k-1}, \forall v \in \pi(N(v))])AGG=LSTM([hvk−1​,∀v∈π(N(v))])
Efficient Implementations of AggretationsMany aggregations can be performed efficiently by sparse matrix operations.
Let Hk−1=[h1k−1,⋯ ,hnk−1]H^{k-1} = [h_1^{k-1},\cdots,h_n^{k-1}]Hk−1=[h1k−1​,⋯,hnk−1​]
hvk=∑u∈N(v)huk−1∣N(v)∣→Hk=D−1AHk−1\displaystyle h_v^k = \sum_{u \in N(v)} \frac{h_u^{k-1}}{|N(v)|} \rarr H^k = D^{-1}AH^{k-1}hvk​=u∈N(v)∑​∣N(v)∣huk−1​​→Hk=D−1AHk−1
Another example implementation: GCN(Kipf et al. 2017) where
Hk=D−1/2AD1/2Hk−1\displaystyle H^k = D^{-1/2}AD^{1/2}H^{k-1}Hk=D−1/2AD1/2Hk−1
﻿
More GNN References
Tutorials and overviews:
Relational inductive biases and graph networks (Battaglia et al., 2018)
Representation learning on graphs: Methods and applications (Hamilton et al., 2017)

Attention-based neighborhood aggregation:
Graph attention networks (Hoshen, 2017; Velickovic et al., 2018; Liu et al., 2018)

Embedding entire graphs:
Graph neural nets with edge embeddings (Battaglia et al., 2016; Gilmer et. al., 2017)
Embedding entire graphs (Duvenaud et al., 2015; Dai et al., 2016; Li et al., 2018) and graph pooling
(Ying et al., 2018, Zhang et al., 2018)
Graph generation and relational inference (You et al., 2018; Kipf et al., 2018)
How powerful are graph neural networks(Xu et al., 2017)

Embedding nodes:
Varying neighborhood: Jumping knowledge networks (Xu et al., 2018), GeniePath (Liu et al., 2018)
Position-aware GNN (You et al. 2019)

Spectral approaches to graph neural networks:
Spectral graph CNN & ChebNet (Bruna et al., 2015; Defferrard et al., 2016)
Geometric deep learning (Bronstein et al., 2017; Monti et al., 2017)

Other GNN techniques:
Pre-training Graph Neural Networks (Hu et al., 2019)
GNNExplainer: Generating Explanations for Graph Neural Networks (Ying et al., 2019)
﻿
Graph Attention Networks (GAT)In a simple neighbourhood aggregation, we just took the mean of messages from the neighbourhood nodes. But not all edges or neighbourhood nodes should carry the same weight. We want to be able to weigh messages coming from different nodes signifying the importance of a node's message. Lets say αvu\alpha_{vu}αvu​ to be the weighting factor of node uuu's message to node vvv. In the first case of simple neighbourhood aggregation, αvu=1/∣N(v)∣\alpha_{vu} = 1/|N(v)|αvu​=1/∣N(v)∣ because every message is averaged. We had that in explicitly.
The key idea behind Graph networks is that we want to be able to assign weights implicitly to messages coming from different nodes and learn them by back propagating through them.
Let αvu\alpha_{vu}αvu​ be computed as a byproduct of an attention mechanism aaa. Let aaa compute attention coefficients evue_vuev​u across pairs of nodes u,vu, vu,v based on their messages
evu=a(Wkhuk−1,Wkhvk−1)e_vu = a(W_kh_u^{k-1}, W_kh_v^{k-1})ev​u=a(Wk​huk−1​,Wk​hvk−1​) where evue_{vu}evu​ indicates the importance of node uuu's message to node vvv
Normalize coefficients using the softmax function in order to be comparable across different neighbourhoods
αvu=exp⁡(evu)∑k∈N(v)exp⁡(evk)\displaystyle \alpha_{vu} = \frac{\exp(e_{vu})}{\sum_{k \in {N(v)}}\exp(e_{vk})}αvu​=∑k∈N(v)​exp(evk​)exp(evu​)​
And we use this weight coefficient to compute the levelklevel_klevelk​ embedding based on levelk−1level_{k-1}levelk−1​ embeddings as follows.
hvk=σ(∑u∈N(v)αvuWkhuk−1)h_v^k = \sigma(\sum_{u \in N(v)} \alpha_{vu}W_kh_u^{k-1})hvk​=σ(∑u∈N(v)​αvu​Wk​huk−1​)
Now let's talk about what kind of attention mechanism aaa we can use. One approach is agnostic to the choice of aaa by using a simple single-layer neural network. aaa can have parameters which need to be estimated. These parameters can be trained together with weight matrices in an end-to-end fashion.
Another approach is Multi-head attention: Stabilize the learning process of attention mechanism [Velickovic et al., ICLR 2018]. In this case attention operations in a given layer are independently replicated R times, each replica with different parameters. Outputs are aggregated by concatenating or adding.
Key benefit allows for implicitly specifying different importance values αvu\alpha_{vu}αvu​ to different neighbors.
Computation of attentional coefficients can be parallelized across all edges of the graph. Aggregation maybe parallelized across nodes.
Sparse matrix operations do not require more than O(V+E)O(V+E)O(V+E) entries to be stored. Fixed number of parameters, irrespective of the graph size. Similar to the simplified aggregation methods, it still retains the inductive capability, hence does not depend on the global graph structure. We still can embed nodes that are newly introduced. Train on partial parts of the graph structure and etc.
Attention mechanisms can be used with many different neural network models. In many cases attention leads to performance gains.
﻿
PinSAGE and other applicationsPinSage is a recommendation engine built by Pinterest based on Graph Neural Networks. Pinterest has 300M users, with 4 Billion pins and 2 Billion Boards. Their use case is they have to generate node embeddings on graph containing 2B pins and 1B boards with 20B edges, recommending new pins to boards. Graph is very dynamic, we need to be able to apply our embedding to new nodes without model retraining. Nodes also have rich features, most of them Content, images.
These node embeddings are essential to various downstream tasks like recommendation, classification, clustering, ranking and etc.
Need to learn with a resolution of 100 vs 3B, they achieved this by using harder and harder negative samples. Samples that should be obviously far away from each other in embedding space.
10,000X larger dataset than any previous graph neural network application.
Key innovations
sub-sample neighbourhoods for efficient GPU batching.
Producer-consumer CPU-GPU training pipeline.
Curriculum learning for negative samples.
Mapreduce for efficient inference.
On-the-fly graph convolutionsSample the neighbourhood around a node and dynamically construct a computation graph. Perform a localized graph convolution around a particular node. Does not need the entire graph during training.
Construct convolutions via random walksPerforming convolutions on full neighborhoods was infeasible, so they did importance pooling, with the help of random walks on a node. Simulate random walks and selecting the neighbours with the highest vist counts.
Efficient MapReduce inferenceBottom-up aggregation of node embeddings lends itself to MapReduce. Decompose each aggregation step across all nodes into three operations in MapReduce, map, join and reduce.
﻿
General Tips to Train a GNNData preprocessingUse renormalization tricks
Variance-scaled initialization
Network data whitening
OptimizerADAM optimizer works well, takes care of decaying learning rate.
ActivationReLU activation function often works really well. And don't use activation function at the output layer.
BiasInclude bias term in every layer.
Layer sizeGCN layer of size 64 or 128 is big enough, don't go bigger than that.
DebuggingIf loss/accuracy are not converging during training. One testing if your model code have bugs in it is by trying to overfit on training data. Accuracy should be essentially 100% or error close to 0, on a very small train dataset. If neural network cannot overfit a single data point, something is wrong. Scrutinize your loss function/viz etc. 
﻿
Chapters
Introduction, Structure of Graphs
Properties of Networks and Random Graph Models
Motifs and Structural Roles in Networks
Community Structure in Networks
Spectral Clustering
Message Passing and Node Classification
Graph Representation Learning
Graph Neural Networks
Graph Neural Networks - Pytorch Geometric
Deep Generative Models for Graphs
Link Analysis: PageRank
Network Effects and Cascading Behaviour
Probabilistic Contagion and Models of Influence
Influence Maximization in Networks
Outbreak Detection in Networks
Network Evolution
Reasoning over Knowledge Graphs
Limitations of Graph Neural Networks
Applications of Graph Neural Networks
﻿
﻿
Add a comment