8.Graph Neural Networks
Intuition
Given a Graph G(V,E)G(V, E), our goal is to map each node vv to its own d-dimensional embedding or a representation, that captures all the node's local graph structure and data (node features, edge features connecting to the node, features of nodes connecting to our node vv proportional to importance of each neighbourhood node and etc) such that nodes similar to nn are closer together in the embedding space.
Our goal is to find out an Encoder function ENC(v)ENC(v) that maps all the nodes to some embedding space zvz_v, where similar nodes based on its network neighbourhood features are close to each other in the embedding space. Once we have the encoder of nodes' local graph structure and all of its features, we can use it for any downstream task of say, classifying nodes and etc.
We learned shallow encoding methods in the previous report, where ENC(v)ENC(v) is a N×dN\times d matrix, and ENC(v)ENC(v) is a simple embedding lookup. We can look at shallow encoders as one-layer of data transformation, a single hidden layer that maps node uu to embedding zuz_u via a function ff where zu=f(zv,v∈NR(u))z_u = f(z_v, v \in N_R(u)).
Shallow encoders we discussed earlier have many limitations, we need O∣V∣O|V| parameters, as the embedding look up matrix is of size ∣V∣×d|V| \times d. And parameters are not shared between nodes, and every node has its own unique embedding. Shallow encoders are inherently transductive, as in it cannot generate embeddings for nodes that are not seen during training. Imagine we have a social network graph, and we learned the embedding matrix, and a new node joins the network, we have to train our new network again using embedding methods discussed earlier to get our new embedding matrix. Another disadvantage is that they do not incorporate node features. They just make use of local/global graph structure of a node using random walks.
We want a better encoding methodology that solves all these problems, that automatically does multiple layers of non-linear transformations on the graph structure of each node. And Graph Neural Networks are exactly that. All these deep encoders can be combined with node similarity functions defined in the last lecture, if that is the downstream task. Or train these embeddings on various classification tasks, and let back-propagation figure out the best parameters suitable to the task at hand.
The problem is that networks have arbitrary size and have complex topological structure, no spatial locality like in case of image grids, no fixed node ordering or reference point. Often dynamic and have multimodal features. For a given node we should be able to transform "information" from the neighbours and combine it and come up with something useful, not just initial node features.
Naive approach
A naive approach would be to join adjacency matrix and node features, feed them into a deep neural network on some down stream task. It will not be applicable to graphs of different sizes. And the same graph has N!N! different adjacency matrix representations, as network representations are invariant to the node ordering. If we pass the same trained network on a different adjacency matrix representing the same network, intuitively it will not work.
Computational Graph
So we look at each node's neighbourhood as its own computation graph. And propagate information across the graph to compute node embeddings. We somehow aggregate all the information passed from a node's neighbours who themselves were informed by their own neighbours.
In the above image, consider the input graph and a target node AA, the graph on the right side shows the computation graph of node AA based on its neighbourhood. Intuitively, AA gets all the messages from its neighbourhood nodes [B,C,D][B, C, D] and processes them somehow, and the nodes [B,C,D][B, C, D] in turn processes information from their neighbours. Look at the edge directions to make sense of the computational graph. This a computational graph of depth 2, where original node features of nodes [A,C][A, C] are passed to node BB, transformed and then passed again to node AA.
The key intuition is that every node has its own computational graph, and we make use of dynamic recursive programming, and simultaneously compute node embeddings of all the nodes at each layer of the computational graph and feed them into the next layer of the computational graph.
A Deep model can be of arbitrary depth, the model and the respective computational graph shown below is of depth 2.
Nodes have embeddings at each layer, Layer-0 embedding of node uu is its input feature, xux_u, Layer-K embedding gets information from nodes that are K hops away in the computational graph.
The black boxes in the picture is the neural network, and the arrows are messages from neighbours. Keep in mind that at a given level we use the same neural network with same parameters for all computational graphs of all the nodes. At every level the aggregated messages go through the neural network to give us that level's node embeddings. We also have to make sure to aggregate the messages from all the neighbours in a node order agnostic way.
Keep in mind that at every level of the computational graph, the actual graph/network structure doesn't change, only thing changing/improving is the node embedding. At level1level_1, node embeddings make use of node embeddings at level0level_0 (also the original node features) and at level2level_2, node embeddings make use of node embeddings at level1level_1. In general at levelklevel_k we compute node embeddings using levelk−1level_{k-1} embeddings
This is the core intuition behind Graph Neural Networks. How we aggregate the messages at each level, and how we compute the message from a node, using all the node neighbourhood information are the variations on our Graph Neural Network.
Graph Convolutional Network (GCN)
The aggregation method we will be using is averaging neighbour messages, and this is how we compute layerklayer_k embeddings of node vv given layerk−1layer_{k-1} embeddings of its neighbourhood for a depth KK computational graph.
hv0=xvh_v^0 = x_v
hvk=σ(Wk∑u∈N(v)huk−1∣N(v)∣+Bkhvk−1),∀k∈{1,⋯ ,K}\displaystyle h_v^k = \sigma \bigg( W_k \sum_{u \in N(v)} \frac{h_u^{k-1}}{|N(v)|} + B_kh_v^{k-1} \bigg), \forall k \in \lbrace 1, \cdots, K\rbrace
zv=hvKz_v = h_v^K
In the above equations, zvz_v is the embedding after KK layers of neighbourhood aggregation, σ\sigma is the non-linearity, hvk−1h_v^{k-1} is the previous layer embedding of the node itself and N(v)N(v) is the neighbourhood nodes of node vv. WkW_k and BkB_k are the trainable matrices that will we learn.
We can feed these embeddings into any loss function and run stochastic gradient descent to get our weight parameters. WW and BB.
The above equation can be written in its vector form as
Hl=σ(Hl−1)W0l−1+A~Hl−1W1l−1H^{l} = \sigma(H^{l-1})W_0^{l-1} + \tilde AH^{l-1}W_1^{l-1}
with A~=D−1/2AD−1/2\tilde A = D^{-1/2}AD^{-1/2} and Hl=[h1lT,⋯ ,hNlT]TH^l = [h_1^{l^T},\cdots, h_N^{l^T}]^T
Unsupervised
We can train the above equation in an unsupervised manner, using only the graph structure, with "similar" nodes having similar embeddings. And unsupervised loss function can be
- Random walk methods we discussed in the last report like node2vec, DeepWalk, struc2vec.
- Graph factorization
- Node proximity in the graph.
Supervised
We can directly train the model for a supervised task, like node classification. In case of node classification, an example loss function might look like this using the embeddings of the nodes.
L=∑v∈Vyvlog(σ(zvTθ))+(1−yv)log(1−σ(zvTθ))\displaystyle \mathcal{L} = \sum_{v \in V} y_v \log(\sigma(z_v^T\theta)) + (1 - y_v)log(1 - \sigma(z_v^T\theta)) where yvy_v is the node class and θ\theta classification weights.
Overview
- Define a neighbourhood aggregation function that is node order agnostic, in this case its average.
- Define a loss function on the embeddings specific to the task.
- Train on a set of nodes i.e., a batch of compute graphs.
- Generate embeddings as needed, even for nodes we never trained on.
- Same aggregation parameters are shared for all the nodes. And the number of model parameters is sublinear in ∣V∣|V| and we can generalize to unseen nodes.
- Using inductive node embedding we can generalize to entirely unseen graphs. For example train on protein interaction graph from model organism A and generate embeddings on newly collected data about organism B.
- Many application settings constantly encounter previously unseen nodes, like reddit, youtube etc., We are able to generate node embeddings "on the fly" for these new nodes changing the graph structure slightly.
GraphSAGE
So far we have aggregated the neighbour messages by taking their average. We can use any differentiable function that maps a set of vectors in N(u)N(u) to a single vector that is order agnostic. We can for example apply L2 normalization for each node embedding at every layer.
hvk=σ([Ak.AGG({hvk−1,∀u∈N(v)}),Bkhvk−1])\displaystyle h_v^k = \sigma([A_k . AGG(\lbrace h_v^{k-1}, \forall u \in N(v) \rbrace), B_kh_v^{k-1}])
The key ways GraphSAGE is different from Simple neighbourhood aggregation we discussed in the previous section are instead of averaging neighbourhood embeddings, we do a generalized aggregation, and instead of adding aggregated messages to the self embedding at every layer, we concatenate aggregated neighbourhood embedding with self embedding.
The aggregation function can be mean/avg or pooling or LSTM
Mean
Take average of neighbours
AGG=∑v∈N(u)hvk−1∣N(v)∣\displaystyle AGG = \sum_{v\in N(u)} \frac{h_v^{k-1}}{|N(v)|}
Pool
Transform neighbour embedding vectors and apply symmetric vector function.
AGG=γ({Qhvk−1,∀v∈N(v)})\displaystyle AGG = \gamma(\lbrace Qh_v^{k-1}, \forall v \in N(v)\rbrace)
LSTM
Apply LSTM to reshuffled neighbourhood embeddings from previous layer.
AGG=LSTM([hvk−1,∀v∈π(N(v))])\displaystyle AGG = LSTM([h_v^{k-1}, \forall v \in \pi(N(v))])
Efficient Implementations of Aggretations
Many aggregations can be performed efficiently by sparse matrix operations.
Let Hk−1=[h1k−1,⋯ ,hnk−1]H^{k-1} = [h_1^{k-1},\cdots,h_n^{k-1}]
hvk=∑u∈N(v)huk−1∣N(v)∣→Hk=D−1AHk−1\displaystyle h_v^k = \sum_{u \in N(v)} \frac{h_u^{k-1}}{|N(v)|} \rarr H^k = D^{-1}AH^{k-1}
Another example implementation: GCN(Kipf et al. 2017) where
Hk=D−1/2AD1/2Hk−1\displaystyle H^k = D^{-1/2}AD^{1/2}H^{k-1}
More GNN References
Tutorials and overviews:
- Relational inductive biases and graph networks (Battaglia et al., 2018)
- Representation learning on graphs: Methods and applications (Hamilton et al., 2017)
Attention-based neighborhood aggregation:
- Graph attention networks (Hoshen, 2017; Velickovic et al., 2018; Liu et al., 2018)
Embedding entire graphs:
- Graph neural nets with edge embeddings (Battaglia et al., 2016; Gilmer et. al., 2017)
- Embedding entire graphs (Duvenaud et al., 2015; Dai et al., 2016; Li et al., 2018) and graph pooling (Ying et al., 2018, Zhang et al., 2018)
- Graph generation and relational inference (You et al., 2018; Kipf et al., 2018)
- How powerful are graph neural networks(Xu et al., 2017)
Embedding nodes:
- Varying neighborhood: Jumping knowledge networks (Xu et al., 2018), GeniePath (Liu et al., 2018)
- Position-aware GNN (You et al. 2019)
Spectral approaches to graph neural networks:
- Spectral graph CNN & ChebNet (Bruna et al., 2015; Defferrard et al., 2016)
- Geometric deep learning (Bronstein et al., 2017; Monti et al., 2017)
Other GNN techniques:
- Pre-training Graph Neural Networks (Hu et al., 2019)
- GNNExplainer: Generating Explanations for Graph Neural Networks (Ying et al., 2019)
Graph Attention Networks (GAT)
In a simple neighbourhood aggregation, we just took the mean of messages from the neighbourhood nodes. But not all edges or neighbourhood nodes should carry the same weight. We want to be able to weigh messages coming from different nodes signifying the importance of a node's message. Lets say αvu\alpha_{vu} to be the weighting factor of node uu's message to node vv. In the first case of simple neighbourhood aggregation, αvu=1/∣N(v)∣\alpha_{vu} = 1/|N(v)| because every message is averaged. We had that in explicitly.
The key idea behind Graph networks is that we want to be able to assign weights implicitly to messages coming from different nodes and learn them by back propagating through them.
Let αvu\alpha_{vu} be computed as a byproduct of an attention mechanism aa. Let aa compute attention coefficients evue_vu across pairs of nodes u,vu, v based on their messages
evu=a(Wkhuk−1,Wkhvk−1)e_vu = a(W_kh_u^{k-1}, W_kh_v^{k-1}) where evue_{vu} indicates the importance of node uu's message to node vv
Normalize coefficients using the softmax function in order to be comparable across different neighbourhoods
αvu=exp(evu)∑k∈N(v)exp(evk)\displaystyle \alpha_{vu} = \frac{\exp(e_{vu})}{\sum_{k \in {N(v)}}\exp(e_{vk})}
And we use this weight coefficient to compute the levelklevel_k embedding based on levelk−1level_{k-1} embeddings as follows.
hvk=σ(∑u∈N(v)αvuWkhuk−1)h_v^k = \sigma(\sum_{u \in N(v)} \alpha_{vu}W_kh_u^{k-1})
Now let's talk about what kind of attention mechanism aa we can use. One approach is agnostic to the choice of aa by using a simple single-layer neural network. aa can have parameters which need to be estimated. These parameters can be trained together with weight matrices in an end-to-end fashion.
Another approach is Multi-head attention: Stabilize the learning process of attention mechanism [Velickovic et al., ICLR 2018]. In this case attention operations in a given layer are independently replicated R times, each replica with different parameters. Outputs are aggregated by concatenating or adding.
Key benefit allows for implicitly specifying different importance values αvu\alpha_{vu} to different neighbors.
Computation of attentional coefficients can be parallelized across all edges of the graph. Aggregation maybe parallelized across nodes.
Sparse matrix operations do not require more than O(V+E)O(V+E) entries to be stored. Fixed number of parameters, irrespective of the graph size. Similar to the simplified aggregation methods, it still retains the inductive capability, hence does not depend on the global graph structure. We still can embed nodes that are newly introduced. Train on partial parts of the graph structure and etc.
Attention mechanisms can be used with many different neural network models. In many cases attention leads to performance gains.
PinSAGE and other applications
PinSage is a recommendation engine built by Pinterest based on Graph Neural Networks. Pinterest has 300M users, with 4 Billion pins and 2 Billion Boards. Their use case is they have to generate node embeddings on graph containing 2B pins and 1B boards with 20B edges, recommending new pins to boards. Graph is very dynamic, we need to be able to apply our embedding to new nodes without model retraining. Nodes also have rich features, most of them Content, images.
These node embeddings are essential to various downstream tasks like recommendation, classification, clustering, ranking and etc.
Need to learn with a resolution of 100 vs 3B, they achieved this by using harder and harder negative samples. Samples that should be obviously far away from each other in embedding space.
10,000X larger dataset than any previous graph neural network application. Key innovations
- sub-sample neighbourhoods for efficient GPU batching.
- Producer-consumer CPU-GPU training pipeline.
- Curriculum learning for negative samples.
- Mapreduce for efficient inference.
On-the-fly graph convolutions
Sample the neighbourhood around a node and dynamically construct a computation graph. Perform a localized graph convolution around a particular node. Does not need the entire graph during training.
Construct convolutions via random walks
Performing convolutions on full neighborhoods was infeasible, so they did importance pooling, with the help of random walks on a node. Simulate random walks and selecting the neighbours with the highest vist counts.
Efficient MapReduce inference
Bottom-up aggregation of node embeddings lends itself to MapReduce. Decompose each aggregation step across all nodes into three operations in MapReduce, map, join and reduce.
General Tips to Train a GNN
Data preprocessing
- Use renormalization tricks
- Variance-scaled initialization
- Network data whitening
Optimizer
- ADAM optimizer works well, takes care of decaying learning rate.
Activation
ReLU activation function often works really well. And don't use activation function at the output layer.
Bias
Include bias term in every layer.
Layer size
GCN layer of size 64 or 128 is big enough, don't go bigger than that.
Debugging
If loss/accuracy are not converging during training. One testing if your model code have bugs in it is by trying to overfit on training data. Accuracy should be essentially 100% or error close to 0, on a very small train dataset. If neural network cannot overfit a single data point, something is wrong. Scrutinize your loss function/viz etc.
Chapters
- Introduction, Structure of Graphs
- Properties of Networks and Random Graph Models
- Motifs and Structural Roles in Networks
- Community Structure in Networks
- Spectral Clustering
- Message Passing and Node Classification
- Graph Representation Learning
- Graph Neural Networks
- Graph Neural Networks - Pytorch Geometric
- Deep Generative Models for Graphs
- Link Analysis: PageRank
- Network Effects and Cascading Behaviour
- Probabilistic Contagion and Models of Influence
- Influence Maximization in Networks
- Outbreak Detection in Networks
- Network Evolution
- Reasoning over Knowledge Graphs
- Limitations of Graph Neural Networks
- Applications of Graph Neural Networks