7.Graph Representation Learning

Figuring out embeddings of nodes that represent node similarity within the networks
Created on December 22|Last edited on January 13
Comment
﻿
IntroductionWe have various ml talks in graphs, where the goal is to classify nodes or doing a link prediction and etc like shown below
Node Classification
Link Prediction
And machine learning life cycle includes lots of steps, where we do feature engineering on raw data, to get structured data suited for the ml task at hand, and using the structured data we use some learning algorithm to build our final model. Feature engineering used to be core challenge until neural networks in case of image recognition, nlp and other ML tasks. Now we just let the neural network figure out embeddings of our original data instead of engineering them by hand. We want that also to happen when it comes to ml tasks in Graphs.
Our goal is to do feature learning on nodes/graphs efficiently independent of what the downstream task, may it be node classification, clustering, link prediction etc. We want an effective and efficient representation of nodes in an embedding space f:u→ℜdf: u \rarr \Re^df:u→ℜd, that captures all the network structure in its neighbourhood.
We want to map each node in a network into a low-dimensional space, who distribution represents the nodes in the network. And similarity of embeddings between nodes indicates their network similarity. And encode network information and generate a node representation.
From adjacency matrix of size n×nn\times nn×n to a low d-dimensional latent space where d<<nd << nd<<n. So that we can use these embeddings for downstream ML tasks.
Below is an example embedding of Zachary karate club network nodes into a two dimensional space, keeping the clustered nodes in the network close in the 2-d space.
Finding embeddings for graphs is harders as they have complex topological structure, no spatial locality like in grids like images, nor like sequences like in nlp. There is no reference point in case of graphs. The same graph of size n can be represented n!n!n! different adjacency matrices that are isomorphic. They often have dynamic and have multimodal features.
﻿
Embedding NodesAssume a Graph GGG with vertex set VVV and adjacency matrix AAA, for now a non weighted edge matrix. For now let's try to just use network structure, with no node features or any other extra information.
Our goal is to encode nodes in way the similarity in the embedding space approximates(e.g cosine similarity or dot product) similarity in the original network. similarity(u,v)≈zvTzusimilarity(u, v) \approx z_v^Tz_usimilarity(u,v)≈zvT​zu​
We need to define the similaritysimilaritysimilarity function in the network space, define the encoder function ENCENCENC a mapping function from nodes to embeddings. And finally optimize the parameters of the encoder so that similarity(u,v)≈zvTzusimilarity(u, v) \approx z_v^Tz_usimilarity(u,v)≈zvT​zu​
Encoder function maps node to low dimensional embedding space. ENC(v)=zvENC(v) = z_vENC(v)=zv​
Shallow Encoding.Simplest encoder is just an embedding lookup. ENC(v)=ZVENC(v)=ZVENC(v)=ZV where Z∈ℜd×∣V∣Z\in\Re^{d\times |\mathcal{V}|}Z∈ℜd×∣V∣, a matrix, with each column is a node embedding, that we have to learn and V∈I∣v∣V \in \mathcal{I}^{|\mathcal{v}|}V∈I∣v∣, an indicator vector, with all zeros except a one in column indicating a node vvv
Each node is assigned to a unique embedding vector. We have many methods to figure out this embedding lookup, like DeepWalk, node2vec, TransE.
Node similarityOur another challenge is how to define node similarity, should two nodes be similar if they are connected, or share neighbours, or have similar structural roles and how do we automatically infer these properties of nodes.
﻿
Random Walk EmbeddingsThe following section is based on the following two papers DeepWalk Perozzi et al. 2014, node2vec Grover et al 2016
Given a graph and a starting node, we select a neighbor of it at random, and move to this neighbor; then we select a neighbor of this node at random and move to it and continue the process. The random sequence of nodes selected this way is a random walk on the graph.
And now we can define the similarity of two nodes as, the probability that uuu and vvv co-occuring on a random walk over the network. The more probable the similar they are.
Estimate probability of visiting node vvv on a random walk starting from node uuu using some random walk strategy RRR and optimize embeddings to encode these random walk statistics. Similarity function encodes random walk similarity. zuTzv≈PR(v∣u)z_u^Tz_v \approx P_R(v|u)zuT​zv​≈PR​(v∣u)
We use random walks for their expressivity and efficiency. Its easy to express them as they have a flexible stochastic definition of node similarity that incorporates both local and high-order neighbourhood information and its easy to parametrize both locality and high-orderness of a random walk. Its efficient because we do not have to consider all nodes pairs when training, only need to consider pairs that co-occur on random walks.
Unsupervised Feature LearningLearn node embeddings of nodes to d-dimensions that preserve similarity. We define similarity of of nodes using node neighbourhood of node u NR(u)N_R(u)NR​(u) using a random walk strategy RRR
Given a graph G=(V,E)G=(V, E)G=(V,E) we need a mapping z:u→ℜdz: u \rarr \Re^dz:u→ℜd. Using the log-likelihood objective max⁡z∑u∈Vlog⁡P(NR(u)∣zu)\displaystyle \max_z \sum_{u \in V} \log P(N_R(u)|z_u)zmax​u∈V∑​logP(NR​(u)∣zu​), where NR(u)N_R(u)NR​(u) is neighbourhood of node uuu by strategy RRR.
Given a node uuu, we want to learn feature representations that are predictive of the nodes in its neighbourhood NR(u)N_R(u)NR​(u) using the log-likelihood objective above.
Random walk optimizationWe will run short fixed-length random walks starting from each node on graph using some strategy RRR. For each node uuu collect NR(u)N_R(u)NR​(u), the multiset of nodes visited on random walks starting from uuu, its a multiset because the same node can be visited multiple times on random walks of a single node. And optimize embeddings according to the objective functionmax⁡z∑u∈Vlog⁡P(NR(u)∣zu)\displaystyle \max_z \sum_{u \in V} \log P(N_R(u)|z_u)zmax​u∈V∑​logP(NR​(u)∣zu​). We want to maximize the conditional probability given embedding vector zuz_uzu​ is going to be close to the node neighbourhood of uuu.
We end up with the following loss function. Optimize embeddings to maximize likelihood of random walk co-occurences.
L=∑u∈V∑v∈NR(u)−log⁡(P(v∣zu))\displaystyle \mathcal{L} = \sum_{u \in V} \sum_{v \in N_R(u)} - \log(P(v|z_u))L=u∈V∑​v∈NR​(u)∑​−log(P(v∣zu​))
And parameterize P(v∣zu)P(v|z_u)P(v∣zu​) using softmax. We want node vvv to be most similar to node uuu out of all nodes n, and using this intuition ∑iexp⁡(xi)≈max⁡iexp⁡(xi)\sum_i \exp(x_i) \approx \max_i \exp(x_i)∑i​exp(xi​)≈maxi​exp(xi​) we get
P(v∣zu)=exp(zuTzv)∑n∈Vexp(zuTzn)\displaystyle P(v|z_u) = \frac{exp(z_u^Tz_v)}{\sum_{n\in V}exp(z_u^Tz_n)}P(v∣zu​)=∑n∈V​exp(zuT​zn​)exp(zuT​zv​)​
Replacing P(v∣zu)P(v|z_u)P(v∣zu​) with its parameterized version we get
L=∑u∈V∑v∈NR(u)−log⁡(exp(zuTzv)∑n∈Vexp(zuTzn))\displaystyle \mathcal{L} = \sum_{u \in V}\sum_{v \in N_R(u)} - \log\bigg(\frac{exp(z_u^Tz_v)}{\sum_{n\in V}exp(z_u^Tz_n)}\bigg)L=u∈V∑​v∈NR​(u)∑​−log(∑n∈V​exp(zuT​zn​)exp(zuT​zv​)​)
The loss function L\mathcal{L}L is predicted probability of uuu and vvv co occurring on a random walk, for all vvv seen on a random walk starting from uuu and for all uuu in our network. Optimizing random walk embeddings is the same as finding embeddings zuz_uzu​ that minimize L\mathcal{L}L
But doing this naively is of O(∣V∣2)O(|V|^2)O(∣V∣2) complexity because of the normalization term in the softmax. So we use negative sampling to approximate the softmax term like below
log(exp(zvTzu)∑n∈Vexp(zuTzn))≈log(σ(zuTzv))−∑i=1klog(σ(zuTzni)),ni∼PV\displaystyle log\bigg(\frac{exp(z_v^Tz_u)}{\sum_{n \in V}exp(z_u^Tz_n)}\bigg) \approx log(\sigma(z_u^Tz_v)) - \sum_{i=1}^k log(\sigma(z_u^Tz_{n_i})), n_i \sim P_Vlog(∑n∈V​exp(zuT​zn​)exp(zvT​zu​)​)≈log(σ(zuT​zv​))−i=1∑k​log(σ(zuT​zni​​)),ni​∼PV​
The sigmoid function makes each term a probability and instead of normalizing w.r.t all nodes, just normalize against kkk random "negative samples" nin_ini​ which is a random distribution over all the nodes. Even though this is a different objective function, negative sampling is a form of Noise Contrastive Estimation(NCE) which approximately maximizes the log probability of softmax. New formulation corresponds to using a logistic regression sigmoid function to distinguish the target node vvv from nodes nin_ini​ sampled from background distribution PvP_vPv​. I also didn't really understand the intuition behind this but you can find the original negative sampling paper here https://arxiv.org/pdf/1402.3722.pdf
We sample kkk negative nodes proportional to the degree. Higher kkk gives more robust estimates, and higher kkk correspond to higher bias on negative events. In practice k=5to20k = 5 to 20k=5to20
So far given a random walk strategy we described how to optimize embeddings. The simplest random walk is to just run fixed-length, unbiased random walks starting from each node described in the DeepWalk paper but such notion of similarity is too constrained. We need a more general approach for this.
﻿
node2vec: Biased WalksWe needed a flexible notion of network neighbourhood NR(U)N_R(U)NR​(U) of node uuu that will eventually lead to rich node embeddings. The random walk strategy of just running fixed-length walks is too constrained. We want to develop a 2nd2^{nd}2nd order random walk strategy RRR to generate network neighbourhood NR(u)N_R(u)NR​(u) of node uuu.
We want a flexible, biased random walk strategy that can trade off between local and global views of the network. Local microscopic view by doing a BFS and global macroscopic view by doing a DFS. We need a way of interpolating both BFS and DFS. We do this by having two parameters ppp and qqq. Return parameter ppp to return back to the previous node in the walk, and In-out parameter qqq to decide if we should do BFS or DFS at a given node. Moving outwards is DFS and inwards BFS. Intuitively, qqq is the ratio of BFS vs. DFS.
Biased 2nd2^{nd}2nd-order random walks explore the network neighbourhoods. In each walk, we have to remember where that walk came from, in case we need to step back. In the image below, if a random walk traversed the edge (s1,w)(s_1, w)(s1​,w) and is now at www. Now we have 4 options to go from node www, go back to s1s_1s1​, go to s2s_2s2​ which is BFS option wrt s1s_1s1​, or go to s3s_3s3​ or s4s_4s4​ that are DFS options wrt s1s_1s1​. If ppp is the return parameter, and q walk-away parameter. We make the next move according to the unnormalized probabilities in this image. ppp and qqq are model transition probabilities and are our model hyperparameters.
So following this strategy we compute random walk probabilities, simulate rrr random walks of length lll starting from each node uuu, optimize the node2vec objective using stochastic gradient descent.
We can use these embeddings to do clustering/community detection, node classification or link prediction.
Clustering and node classification are straightforward, just do a k-means on the embeddings or classify on the embed space vectors.
When it comes to link-prediction. Predict edge (i,j)(i, j)(i,j) based on f(zi,zj)f(z_i, z_j)f(zi​,zj​), where we can concatenate, avg, product, or take a difference between the embeddings;
Concatenation: f(zi,zj)=g([zi,zj])f(z_i, z_j) = g([z_i, z_j])f(zi​,zj​)=g([zi​,zj​])
Hadamard: f(zi,zj)=g(zi∗zj)f(z_i, z_j) = g(z_i*z_j)f(zi​,zj​)=g(zi​∗zj​) per coordinate product
Sum/avg: f(zi,zj)=g(zi+zj)f(z_i, z_j) = g(z_i+z_j)f(zi​,zj​)=g(zi​+zj​)
Distance: f(zi,zj)=g(∣∣zi−zj∣∣2)f(z_i, z_j) = g(||z_i - z_j||_2)f(zi​,zj​)=g(∣∣zi​−zj​∣∣2​)
There are other notions of similarity, Adjacency-based, multi-hop similarity definitions, or random walk approaches. No one method is best in every scenario. We must choose task specific similarity approach. node2vec performs better on node classification while multi-hop methods performs better on link prediction. (Goyal and Ferrara, 2017 Survey)
﻿
TransE, Translating embeddings for Multi-relational data.TransE is an application of Embeddings to the Knowledge Graph. A knowledgraph is composed of facts/statements about interrelated entities. Nodes are entities and edges are relationships between the entities. The relationships can be of many types. These knowledge graphs are huge, with limited information about how entities are related to each other, which might limit us in doing downstream tasks related to knowledge graphs. So predicting the edges/relationships between entities is an important problem. We want a link prediction model that learns from local and global connectivity patterns in the knowledge graph, taking into account entities and relationships of different types at the same time. Relation predictions are performed by using the learned patterns to generalize observed relationships between an entity of interest and all the other entities.
In TransE, relationships between entities are represented as triplets (h,l,t)(h, l, t)(h,l,t), head entity hhh, relation lll and tail entity ttt. Entities are first embedded in an entity space ℜk\Re^kℜk like previous methods. And relations are represented as translations, if for a (h,l,t)(h, l, t)(h,l,t) there exists a relationship between hhh and ttt, we consider h+l≈th+l \approx th+l≈t else h+l≠th + l \neq th+l​=t.
Entities and relations are initialized uniformly and normalized. Triplet negative sampling with links that does not appear in KG and using comparative loss, which favours lower distance values for valid triplets, high distance values for corrupted ones.
L=∑((h,l,t),(h′,l,t′))∈TbatchΔ[γ+d(h+l,t)−d(h′+l,t′)]\displaystyle \mathcal{L} = \sum_{\big((h, l, t), (h', l, t')\big) \in T_{batch}}\Delta[\gamma + d(h+l, t) - d(h'+l, t')]L=((h,l,t),(h′,l,t′))∈Tbatch​∑​Δ[γ+d(h+l,t)−d(h′+l,t′)]
In the above equation (h,l,t)(h, l, t)(h,l,t) is a positive sample and (h′,l,t′)(h', l, t')(h′,l,t′) a corrupted one.
﻿
Embedding Entire GraphsWe want to be able to embed entire graphs into the latent space, for tasks like graph classification or graph anomaly detection. Like identifying a toxic vs non-toxic molecules and etc.
A simple way of doing it is just sum all the node embeddings of a graph to get graph embedding. From Duvenaud et al., 2016 where they used graph embeddings to classify molecules.
zG=∑v∈Gzv\displaystyle z_G = \sum_{v\in G}z_vzG​=v∈G∑​zv​
Or we can introduce a virtual node to represent the graph or a subgraph and run a standard node embedding technique. And then use the embedding of that node in place of a graph or a subgraph. Proposed by Li et al., 2016 as a general technique for subgraph embedding.
Anonymous Walk EmbeddingsAnother approach is by using [Anonymous Walk Embeddings (https://arxiv.org/pdf/1805.11921.pdf). Essentially we do random walks in the graph and anonymize the nodes visited. States in anonymous walk correspond to the index of the first time we visited the node in a random walk. Consider the following 3 walks in a graph with nodes A,B,C,D,E,F,⋯{A, B, C, D, E, F, \cdots}A,B,C,D,E,F,⋯
Random Walk 1: [A→B→C→B→C][A\rarr B\rarr C\rarr B\rarr C][A→B→C→B→C]
Random Walk 2: [C→D→B→D→B][C\rarr D\rarr B\rarr D\rarr B][C→D→B→D→B]
Random Walk 3: [A→B→A→B→D][A\rarr B\rarr A\rarr B\rarr D][A→B→A→B→D]
After anonymising the nodes, the first two random walks are essentially the same. Because the first three nodes in both walks are different and the 4th node in both the walks is the same as the 2nd node and 5th node is the same as the 3rd node. We store both random walk 1 and random walk 2 as 1→2→3→2→3]1\rarr 2\rarr 3\rarr 2\rarr 3]1→2→3→2→3] And the anonymised walk of random walk 3 is [1→2→1→2→3][1\rarr 2\rarr 1\rarr 2\rarr 3][1→2→1→2→3].
Essentially using anonymous random walks we are capturing where are the random walks getting stuck in during a random walk or how often it's visiting the same node. The idea is that through anonymised random walks we are capturing the graph behaviour rather than node behaviour.
Number of anonymous walks grow exponentially wrt to the length of the walks. There are 5 anonymous walks of length 3 but more than 4 Million of length 12
Enumerate all possible anonymous walks aia_iai​ of lll steps and record their counts. Represent the graph as a probability distribution over these walks. ZG[i]Z_G[i]ZG​[i] is the probability of anonymous walk aia_iai​ in G.
Complete counting of all anonymous walks in a large graph maybe infeasible. So we do sampling to approximate the true distribution. Generate independently a set of mmm random walks and calculate its corresponding empirical distribution of anonymous walks.
The number of random walks mmm needed for the distribution to have error of more than ϵ\epsilonϵ with probability less than δ\deltaδ is m=[2ϵ2(log⁡(2η−2)−log⁡(δ))]m = [\frac{2}{\epsilon^2}(\log(2^{\eta} - 2) - \log(\delta))]m=[ϵ22​(log(2η−2)−log(δ))] where η\etaη is the total number of anonymous walks of length lll. For example we need to generate m=122500m=122500m=122500 random walks of length l=7l=7l=7, as there are η=877\eta=877η=877 total anonymous walks of length 7. with ϵ=0.1\epsilon=0.1ϵ=0.1 and δ=0.01\delta = 0.01δ=0.01
We need to learn embedding ziz_izi​ of every anonymous walk aia_iai​, The embedding of a graph GGG is then sum/avg/concatenation of walk embeddings ziz_izi​
wtuw_t^uwtu​ is the t-th random walk starting at node uuu. Embed walks such that next walk can be predicted. Set ziz_izi​ such that we maximize P(wtu∣wt−Δu,⋯ ,wt−1u)=f(z)P(w_t^u|w_{t-\Delta}^u, \cdots, w_{t-1}^u) = f(z)P(wtu​∣wt−Δu​,⋯,wt−1u​)=f(z)
Run TTT different random walks from uuu each of length lll, NR(u)=[w1u,⋯ ,wTu]N_R(u) = [w_1^u,\cdots, w_T^u]NR​(u)=[w1u​,⋯,wTu​], And aia_iai​ be its anonymous version of walk wiw_iwi​. We learn to predict walks that co-occur in Δ\DeltaΔ sized window. Estimate embedding ziz_izi​ of anonymous walk aia_iai​ of wiw_iwi​ using 
max⁡1T∑t=ΔTlog⁡P(wt∣wt−Δ,⋯ ,wt−1)\displaystyle \max \frac{1}{T}\sum_{t=\Delta}^T \log P(w_t|w_{t-\Delta}, \cdots, w_{t-1})maxT1​t=Δ∑T​logP(wt​∣wt−Δ​,⋯,wt−1​)
And the parameterized version of the log probability in the above equation is
P(wt∣wt−Δ,⋯ ,wt−1)=exp(y(wt))∑iηexp⁡(y(wi))\displaystyle P(w_t|w_{t-\Delta,\cdots,w_{t-1}}) = \frac{exp(y(w_t))}{\sum_i^\eta \exp(y(w_i))}P(wt​∣wt−Δ,⋯,wt−1​​)=∑iη​exp(y(wi​))exp(y(wt​))​
y(wt)=b+U.(1Δ∑i=1Δzi)\displaystyle y(w_t) = b + U.\bigg(\frac{1}{\Delta}\sum_{i=1}^\Delta z_i\bigg)y(wt​)=b+U.(Δ1​i=1∑Δ​zi​) where b∈ℜ,U∈ℜdb\in \Re, U \in \Re^db∈ℜ,U∈ℜd
In summary, we represent graphs via the distribution over all the anonymous walks, sample the walks to approximate the distribution, and embed the anonymous walks.
﻿
Chapters
Introduction, Structure of Graphs
Properties of Networks and Random Graph Models
Motifs and Structural Roles in Networks
Community Structure in Networks
Spectral Clustering
Message Passing and Node Classification
Graph Representation Learning
Graph Neural Networks
Graph Neural Networks - Pytorch Geometric
Deep Generative Models for Graphs
Link Analysis: PageRank
Network Effects and Cascading Behaviour
Probabilistic Contagion and Models of Influence
Influence Maximization in Networks
Outbreak Detection in Networks
Network Evolution
Reasoning over Knowledge Graphs
Limitations of Graph Neural Networks
Applications of Graph Neural Networks
﻿
﻿
Add a comment