7.Graph Representation Learning
Introduction
We have various ml talks in graphs, where the goal is to classify nodes or doing a link prediction and etc like shown below
Node Classification
Link Prediction
And machine learning life cycle includes lots of steps, where we do feature engineering on raw data, to get structured data suited for the ml task at hand, and using the structured data we use some learning algorithm to build our final model. Feature engineering used to be core challenge until neural networks in case of image recognition, nlp and other ML tasks. Now we just let the neural network figure out embeddings of our original data instead of engineering them by hand. We want that also to happen when it comes to ml tasks in Graphs.
Our goal is to do feature learning on nodes/graphs efficiently independent of what the downstream task, may it be node classification, clustering, link prediction etc. We want an effective and efficient representation of nodes in an embedding space f:u→ℜdf: u \rarr \Re^d, that captures all the network structure in its neighbourhood.
We want to map each node in a network into a low-dimensional space, who distribution represents the nodes in the network. And similarity of embeddings between nodes indicates their network similarity. And encode network information and generate a node representation.
From adjacency matrix of size n×nn\times n to a low d-dimensional latent space where d<<nd << n. So that we can use these embeddings for downstream ML tasks.
Below is an example embedding of Zachary karate club network nodes into a two dimensional space, keeping the clustered nodes in the network close in the 2-d space.
Finding embeddings for graphs is harders as they have complex topological structure, no spatial locality like in grids like images, nor like sequences like in nlp. There is no reference point in case of graphs. The same graph of size n can be represented n!n! different adjacency matrices that are isomorphic. They often have dynamic and have multimodal features.
Embedding Nodes
Assume a Graph GG with vertex set VV and adjacency matrix AA, for now a non weighted edge matrix. For now let's try to just use network structure, with no node features or any other extra information.
Our goal is to encode nodes in way the similarity in the embedding space approximates(e.g cosine similarity or dot product) similarity in the original network. similarity(u,v)≈zvTzusimilarity(u, v) \approx z_v^Tz_u
We need to define the similaritysimilarity function in the network space, define the encoder function ENCENC a mapping function from nodes to embeddings. And finally optimize the parameters of the encoder so that similarity(u,v)≈zvTzusimilarity(u, v) \approx z_v^Tz_u
Encoder function maps node to low dimensional embedding space. ENC(v)=zvENC(v) = z_v
Shallow Encoding.
Simplest encoder is just an embedding lookup. ENC(v)=ZVENC(v)=ZV where Z∈ℜd×∣V∣Z\in\Re^{d\times |\mathcal{V}|}, a matrix, with each column is a node embedding, that we have to learn and V∈I∣v∣V \in \mathcal{I}^{|\mathcal{v}|}, an indicator vector, with all zeros except a one in column indicating a node vv
Each node is assigned to a unique embedding vector. We have many methods to figure out this embedding lookup, like DeepWalk, node2vec, TransE.
Node similarity
Our another challenge is how to define node similarity, should two nodes be similar if they are connected, or share neighbours, or have similar structural roles and how do we automatically infer these properties of nodes.
Random Walk Embeddings
The following section is based on the following two papers DeepWalk Perozzi et al. 2014, node2vec Grover et al 2016
Given a graph and a starting node, we select a neighbor of it at random, and move to this neighbor; then we select a neighbor of this node at random and move to it and continue the process. The random sequence of nodes selected this way is a random walk on the graph.
And now we can define the similarity of two nodes as, the probability that uu and vv co-occuring on a random walk over the network. The more probable the similar they are.
Estimate probability of visiting node vv on a random walk starting from node uu using some random walk strategy RR and optimize embeddings to encode these random walk statistics. Similarity function encodes random walk similarity. zuTzv≈PR(v∣u)z_u^Tz_v \approx P_R(v|u)
We use random walks for their expressivity and efficiency. Its easy to express them as they have a flexible stochastic definition of node similarity that incorporates both local and high-order neighbourhood information and its easy to parametrize both locality and high-orderness of a random walk. Its efficient because we do not have to consider all nodes pairs when training, only need to consider pairs that co-occur on random walks.
Unsupervised Feature Learning
Learn node embeddings of nodes to d-dimensions that preserve similarity. We define similarity of of nodes using node neighbourhood of node u NR(u)N_R(u) using a random walk strategy RR
Given a graph G=(V,E)G=(V, E) we need a mapping z:u→ℜdz: u \rarr \Re^d. Using the log-likelihood objective maxz∑u∈VlogP(NR(u)∣zu)\displaystyle \max_z \sum_{u \in V} \log P(N_R(u)|z_u), where NR(u)N_R(u) is neighbourhood of node uu by strategy RR.
Given a node uu, we want to learn feature representations that are predictive of the nodes in its neighbourhood NR(u)N_R(u) using the log-likelihood objective above.
Random walk optimization
We will run short fixed-length random walks starting from each node on graph using some strategy RR. For each node uu collect NR(u)N_R(u), the multiset of nodes visited on random walks starting from uu, its a multiset because the same node can be visited multiple times on random walks of a single node. And optimize embeddings according to the objective functionmaxz∑u∈VlogP(NR(u)∣zu)\displaystyle \max_z \sum_{u \in V} \log P(N_R(u)|z_u). We want to maximize the conditional probability given embedding vector zuz_u is going to be close to the node neighbourhood of uu.
We end up with the following loss function. Optimize embeddings to maximize likelihood of random walk co-occurences.
L=∑u∈V∑v∈NR(u)−log(P(v∣zu))\displaystyle \mathcal{L} = \sum_{u \in V} \sum_{v \in N_R(u)} - \log(P(v|z_u))
And parameterize P(v∣zu)P(v|z_u) using softmax. We want node vv to be most similar to node uu out of all nodes n, and using this intuition ∑iexp(xi)≈maxiexp(xi)\sum_i \exp(x_i) \approx \max_i \exp(x_i) we get
P(v∣zu)=exp(zuTzv)∑n∈Vexp(zuTzn)\displaystyle P(v|z_u) = \frac{exp(z_u^Tz_v)}{\sum_{n\in V}exp(z_u^Tz_n)}
Replacing P(v∣zu)P(v|z_u) with its parameterized version we get
L=∑u∈V∑v∈NR(u)−log(exp(zuTzv)∑n∈Vexp(zuTzn))\displaystyle \mathcal{L} = \sum_{u \in V}\sum_{v \in N_R(u)} - \log\bigg(\frac{exp(z_u^Tz_v)}{\sum_{n\in V}exp(z_u^Tz_n)}\bigg)
The loss function L\mathcal{L} is predicted probability of uu and vv co occurring on a random walk, for all vv seen on a random walk starting from uu and for all uu in our network. Optimizing random walk embeddings is the same as finding embeddings zuz_u that minimize L\mathcal{L}
But doing this naively is of O(∣V∣2)O(|V|^2) complexity because of the normalization term in the softmax. So we use negative sampling to approximate the softmax term like below
log(exp(zvTzu)∑n∈Vexp(zuTzn))≈log(σ(zuTzv))−∑i=1klog(σ(zuTzni)),ni∼PV\displaystyle log\bigg(\frac{exp(z_v^Tz_u)}{\sum_{n \in V}exp(z_u^Tz_n)}\bigg) \approx log(\sigma(z_u^Tz_v)) - \sum_{i=1}^k log(\sigma(z_u^Tz_{n_i})), n_i \sim P_V
The sigmoid function makes each term a probability and instead of normalizing w.r.t all nodes, just normalize against kk random "negative samples" nin_i which is a random distribution over all the nodes. Even though this is a different objective function, negative sampling is a form of Noise Contrastive Estimation(NCE) which approximately maximizes the log probability of softmax. New formulation corresponds to using a logistic regression sigmoid function to distinguish the target node vv from nodes nin_i sampled from background distribution PvP_v. I also didn't really understand the intuition behind this but you can find the original negative sampling paper here https://arxiv.org/pdf/1402.3722.pdf
We sample kk negative nodes proportional to the degree. Higher kk gives more robust estimates, and higher kk correspond to higher bias on negative events. In practice k=5to20k = 5 to 20
So far given a random walk strategy we described how to optimize embeddings. The simplest random walk is to just run fixed-length, unbiased random walks starting from each node described in the DeepWalk paper but such notion of similarity is too constrained. We need a more general approach for this.
node2vec: Biased Walks
We needed a flexible notion of network neighbourhood NR(U)N_R(U) of node uu that will eventually lead to rich node embeddings. The random walk strategy of just running fixed-length walks is too constrained. We want to develop a 2nd2^{nd} order random walk strategy RR to generate network neighbourhood NR(u)N_R(u) of node uu.
We want a flexible, biased random walk strategy that can trade off between local and global views of the network. Local microscopic view by doing a BFS and global macroscopic view by doing a DFS. We need a way of interpolating both BFS and DFS. We do this by having two parameters pp and qq. Return parameter pp to return back to the previous node in the walk, and In-out parameter qq to decide if we should do BFS or DFS at a given node. Moving outwards is DFS and inwards BFS. Intuitively, qq is the ratio of BFS vs. DFS.
Biased 2nd2^{nd}-order random walks explore the network neighbourhoods. In each walk, we have to remember where that walk came from, in case we need to step back. In the image below, if a random walk traversed the edge (s1,w)(s_1, w) and is now at ww. Now we have 4 options to go from node ww, go back to s1s_1, go to s2s_2 which is BFS option wrt s1s_1, or go to s3s_3 or s4s_4 that are DFS options wrt s1s_1. If pp is the return parameter, and q walk-away parameter. We make the next move according to the unnormalized probabilities in this image. pp and qq are model transition probabilities and are our model hyperparameters.
So following this strategy we compute random walk probabilities, simulate rr random walks of length ll starting from each node uu, optimize the node2vec objective using stochastic gradient descent.
We can use these embeddings to do clustering/community detection, node classification or link prediction.
Clustering and node classification are straightforward, just do a k-means on the embeddings or classify on the embed space vectors.
When it comes to link-prediction. Predict edge (i,j)(i, j) based on f(zi,zj)f(z_i, z_j), where we can concatenate, avg, product, or take a difference between the embeddings;
- Concatenation: f(zi,zj)=g([zi,zj])f(z_i, z_j) = g([z_i, z_j])
- Hadamard: f(zi,zj)=g(zi∗zj)f(z_i, z_j) = g(z_i*z_j) per coordinate product
- Sum/avg: f(zi,zj)=g(zi+zj)f(z_i, z_j) = g(z_i+z_j)
- Distance: f(zi,zj)=g(∣∣zi−zj∣∣2)f(z_i, z_j) = g(||z_i - z_j||_2)
There are other notions of similarity, Adjacency-based, multi-hop similarity definitions, or random walk approaches. No one method is best in every scenario. We must choose task specific similarity approach. node2vec performs better on node classification while multi-hop methods performs better on link prediction. (Goyal and Ferrara, 2017 Survey)
TransE, Translating embeddings for Multi-relational data.
TransE is an application of Embeddings to the Knowledge Graph. A knowledgraph is composed of facts/statements about interrelated entities. Nodes are entities and edges are relationships between the entities. The relationships can be of many types. These knowledge graphs are huge, with limited information about how entities are related to each other, which might limit us in doing downstream tasks related to knowledge graphs. So predicting the edges/relationships between entities is an important problem. We want a link prediction model that learns from local and global connectivity patterns in the knowledge graph, taking into account entities and relationships of different types at the same time. Relation predictions are performed by using the learned patterns to generalize observed relationships between an entity of interest and all the other entities.
In TransE, relationships between entities are represented as triplets (h,l,t)(h, l, t), head entity hh, relation ll and tail entity tt. Entities are first embedded in an entity space ℜk\Re^k like previous methods. And relations are represented as translations, if for a (h,l,t)(h, l, t) there exists a relationship between hh and tt, we consider h+l≈th+l \approx t else h+l≠th + l \neq t.
Entities and relations are initialized uniformly and normalized. Triplet negative sampling with links that does not appear in KG and using comparative loss, which favours lower distance values for valid triplets, high distance values for corrupted ones.
L=∑((h,l,t),(h′,l,t′))∈TbatchΔ[γ+d(h+l,t)−d(h′+l,t′)]\displaystyle \mathcal{L} = \sum_{\big((h, l, t), (h', l, t')\big) \in T_{batch}}\Delta[\gamma + d(h+l, t) - d(h'+l, t')]
In the above equation (h,l,t)(h, l, t) is a positive sample and (h′,l,t′)(h', l, t') a corrupted one.
Embedding Entire Graphs
We want to be able to embed entire graphs into the latent space, for tasks like graph classification or graph anomaly detection. Like identifying a toxic vs non-toxic molecules and etc.
A simple way of doing it is just sum all the node embeddings of a graph to get graph embedding. From Duvenaud et al., 2016 where they used graph embeddings to classify molecules.
zG=∑v∈Gzv\displaystyle z_G = \sum_{v\in G}z_v
Or we can introduce a virtual node to represent the graph or a subgraph and run a standard node embedding technique. And then use the embedding of that node in place of a graph or a subgraph. Proposed by Li et al., 2016 as a general technique for subgraph embedding.
Anonymous Walk Embeddings
Another approach is by using [Anonymous Walk Embeddings (https://arxiv.org/pdf/1805.11921.pdf). Essentially we do random walks in the graph and anonymize the nodes visited. States in anonymous walk correspond to the index of the first time we visited the node in a random walk. Consider the following 3 walks in a graph with nodes A,B,C,D,E,F,⋯{A, B, C, D, E, F, \cdots}
- Random Walk 1: [A→B→C→B→C][A\rarr B\rarr C\rarr B\rarr C]
- Random Walk 2: [C→D→B→D→B][C\rarr D\rarr B\rarr D\rarr B]
- Random Walk 3: [A→B→A→B→D][A\rarr B\rarr A\rarr B\rarr D]
After anonymising the nodes, the first two random walks are essentially the same. Because the first three nodes in both walks are different and the 4th node in both the walks is the same as the 2nd node and 5th node is the same as the 3rd node. We store both random walk 1 and random walk 2 as 1→2→3→2→3]1\rarr 2\rarr 3\rarr 2\rarr 3] And the anonymised walk of random walk 3 is [1→2→1→2→3][1\rarr 2\rarr 1\rarr 2\rarr 3].
Essentially using anonymous random walks we are capturing where are the random walks getting stuck in during a random walk or how often it's visiting the same node. The idea is that through anonymised random walks we are capturing the graph behaviour rather than node behaviour.
Number of anonymous walks grow exponentially wrt to the length of the walks. There are 5 anonymous walks of length 3 but more than 4 Million of length 12
Enumerate all possible anonymous walks aia_i of ll steps and record their counts. Represent the graph as a probability distribution over these walks. ZG[i]Z_G[i] is the probability of anonymous walk aia_i in G.
Complete counting of all anonymous walks in a large graph maybe infeasible. So we do sampling to approximate the true distribution. Generate independently a set of mm random walks and calculate its corresponding empirical distribution of anonymous walks.
The number of random walks mm needed for the distribution to have error of more than ϵ\epsilon with probability less than δ\delta is m=[2ϵ2(log(2η−2)−log(δ))]m = [\frac{2}{\epsilon^2}(\log(2^{\eta} - 2) - \log(\delta))] where η\eta is the total number of anonymous walks of length ll. For example we need to generate m=122500m=122500 random walks of length l=7l=7, as there are η=877\eta=877 total anonymous walks of length 7. with ϵ=0.1\epsilon=0.1 and δ=0.01\delta = 0.01
We need to learn embedding ziz_i of every anonymous walk aia_i, The embedding of a graph GG is then sum/avg/concatenation of walk embeddings ziz_i
wtuw_t^u is the t-th random walk starting at node uu. Embed walks such that next walk can be predicted. Set ziz_i such that we maximize P(wtu∣wt−Δu,⋯ ,wt−1u)=f(z)P(w_t^u|w_{t-\Delta}^u, \cdots, w_{t-1}^u) = f(z)
Run TT different random walks from uu each of length ll, NR(u)=[w1u,⋯ ,wTu]N_R(u) = [w_1^u,\cdots, w_T^u], And aia_i be its anonymous version of walk wiw_i. We learn to predict walks that co-occur in Δ\Delta sized window. Estimate embedding ziz_i of anonymous walk aia_i of wiw_i using
max1T∑t=ΔTlogP(wt∣wt−Δ,⋯ ,wt−1)\displaystyle \max \frac{1}{T}\sum_{t=\Delta}^T \log P(w_t|w_{t-\Delta}, \cdots, w_{t-1})
And the parameterized version of the log probability in the above equation is
P(wt∣wt−Δ,⋯ ,wt−1)=exp(y(wt))∑iηexp(y(wi))\displaystyle P(w_t|w_{t-\Delta,\cdots,w_{t-1}}) = \frac{exp(y(w_t))}{\sum_i^\eta \exp(y(w_i))}
y(wt)=b+U.(1Δ∑i=1Δzi)\displaystyle y(w_t) = b + U.\bigg(\frac{1}{\Delta}\sum_{i=1}^\Delta z_i\bigg) where b∈ℜ,U∈ℜdb\in \Re, U \in \Re^d
In summary, we represent graphs via the distribution over all the anonymous walks, sample the walks to approximate the distribution, and embed the anonymous walks.
Chapters
- Introduction, Structure of Graphs
- Properties of Networks and Random Graph Models
- Motifs and Structural Roles in Networks
- Community Structure in Networks
- Spectral Clustering
- Message Passing and Node Classification
- Graph Representation Learning
- Graph Neural Networks
- Graph Neural Networks - Pytorch Geometric
- Deep Generative Models for Graphs
- Link Analysis: PageRank
- Network Effects and Cascading Behaviour
- Probabilistic Contagion and Models of Influence
- Influence Maximization in Networks
- Outbreak Detection in Networks
- Network Evolution
- Reasoning over Knowledge Graphs
- Limitations of Graph Neural Networks
- Applications of Graph Neural Networks