Skip to main content

A Brief Introduction to Graph Contrastive Learning

This article provides an overview of "Deep Graph Contrastive Representation Learning" and introduces a general formulation for Contrastive Representation Learning on Graphs using W&B for interactive visualizations. It includes code samples for you to follow!
Created on December 27|Last edited on February 5
NOTE: This Report is a part of a series of reports on Graph Representation Learning, for a brief overview and survey please refer to the following articles as well
💡


Introduction

Contrastive learning is both one of the key pillars of representation learning and a key framework in self supervised learning. In contrastive learning, multiple views are generated from a given sample and then "contrasted" with views from other samples. The various views from a sample are regarded as positives and those from other samples are negatives.
The objective is to align or group positive samples together within the latent space and separate or push away the negative samples. This general framework has worked wonders for Euclidean data such as images with methods like SimCLR and DINO. But can we take this framework and apply it to graphs as well?
Well the authors of Deep Graph Contrastive Representation Learning certainly think so! In particular they propose a novel framework for unsupervised graph representation learning with a contrastive learning objective at the node level. Let's dive into some details!
You can follow along with this report using the Colab below:
💡



Table of Contents





👨‍🏫 Method

Figure 1: The proposed GRACE framework
The GRACE framework is relatively simple to understand. First, given a graph, we corrupt it to generate two graph views. The way in which we corrupt a given sample is crucial to a contrastive learning framework. In GRACE particularly the authors employ two augmentations:
  • Removing Edges (RE): In this augmentation we randomly remove some edges from the input graph.
  • Masking Node Features (MF): In this augmentation we randomly set some node features to zero.
NOTE: The authors make a distinction in the paper regarding how these augmentations are different from Dropout and DropEdge, IMO while the resulting effect of these augmentations is different and has in fact more benefits in the GRACE framework the intuition behind these remain the same.
💡
After we generate the two views by applying the aforementioned augmentations we generate node embeddings corresponding to each view using a common shared encoder. We then apply a contrastive learning objective function between these embeddings, and update the parameters of the encoder appropriately.
Contrary to previous work that learns representations by utilizing local-global relationships, in GRACE, the authors learn embeddings by directly maximizing node-level agreement between embeddings.
This contrastive objective has the responsibility of pushing away the embeddings from views of different graphs and to club together the embeddings of views from the same graph.
Formally, given a node embedding from a view ui\large u_i (anchor) has a positive sample vi\large v_i which is the embedding for the same node from a different view and all other embeddings are regarded as negative samples (uk,vk)\large (u_k, v_k). This notion of similarity θ(u,v)=s(g(u),g(v))\large \theta(u, v) = s(g(u), g(v)). Where s\large s is some similarity measure, in case of GRACE it's the cosine similarity function and \largeg\large g is some non-linear projection function, in case of GRACE it's a simple two layer MLP.
The authors break down the overall loss function using a pairwise objective for pairs. This pairwise objective is defined as follows:
l(ui,vi)=logeθ(ui,vi)/τeθ(ui,vi)/τ+k=1N1[k1]eθ(ui,vk)/τ+k=1N1[k1]eθ(ui,uk)/τ\huge l(u_i, v_i) = \text{log} \frac{e^{\theta(u_i, v_i)/ \tau}}{{e^{\theta(u_i, v_i)/ \tau}} + \color{blue}{\sum_{k=1}^{N}\mathbb{1}_{[k \neq 1]} e^{\theta(u_i, v_k)/ \tau}} + \color{green}{\sum_{k=1}^{N}\mathbb{1}_{[k \neq 1]} e^{\theta(u_i, u_k)/ \tau}}}


Let's try and break down this gigantic formula into perhaps more digestible terms.
  • evaluate positive pairs: [exp(θ(ui,vi)/τ)\text{exp}(\theta(u_i, v_i)/\tau)] This is the first component of the denominator and the numerator wherein the objective function tries to club positive pairs together.
  • inter-view negative pairs: [1[k1]exp(θ(ui,vk)/τ)\large \, \mathbb{1}_{[k \neq 1]} \text{exp}(\theta(u_i, v_k)/\tau)] This is the second component of the denominator wherein the objective function tries to push negative views of different nodes away from each other.
  • intra-view negative pairs: [1[k1]exp(θ(ui,uk)/τ)\large \, \mathbb{1}_{[k \neq 1]} \text{exp}(\theta(u_i, u_k)/\tau)] This is the third component of the denominator wherein the objective function tries to push negative views of same nodes away from each other.
The same formula can be written in pure text as:
l(ui,vi)=positive pairpositive pair+inter-view negative pair+intra-view negative pair\huge l(u_i, v_i) = \frac{\text{positive pair}}{\text{positive pair} + \color{blue}\text{inter-view negative pair} + \color{green}\text{intra-view negative pair}}

Thus, we use this definition of the pairwise objective function to define the overall objective function as:
J=12Ni=1N[l(ui,vi)+l(vi,uu)]\huge \mathcal{J} = \frac{1}{2N} \displaystyle \sum_{i=1}^{N} [l(u_i, v_i) + l(v_i, u_u)]


👨‍💻 Code

Let's look into one training step of the GRACE implemented using PyTorch + PyTorch Geometric into some detail. If you want to read more of the model or training code, please refer to the repository.
class GRACE(torch.nn.Module):
...
def train_step(
self,
x: torch.Tensor,
edge_index: torch.Tensor,
) -> torch.Tensor:
"""
Perform a single training step.

Args:
x (torch.Tensor): Node features.
edge_index (torch.Tensor): Edge indices.

Returns:
float: Loss.
"""
# Generate Graph Views

## Removing Edges (RE)
edge_index_1 = dropout_adj(edge_index, p=self.drop_edge_rate_1)[0]
edge_index_2 = dropout_adj(edge_index, p=self.drop_edge_rate_2)[0]

## Masking Node Features (MF)
x_1 = drop_feature(x, drop_prob=self.drop_feature_rate_1)
x_2 = drop_feature(x, drop_prob=self.drop_feature_rate_2)

## Generating views
z1 = self.forward(x_1, edge_index_1)
z2 = self.forward(x_2, edge_index_2)

# Calculate Loss
loss = self.loss(z1, z2, batch_size=0)

return loss
I'd encourage you to browse through the following Colab Notebook for a simple overview of the framework.


📊 Results

I was personally wondering if the performance of the model would drastically change if we change the projection dimension. Let's experiment by setting the projection dimension to 64, 128 and 256. All other default configurations can be found in the Github Repository.

Run set
3

As we can see the performance of the model doesn't change drastically if we change the projection dimension.

🔗 Summary

In this article, you read through a brief overview of a novel Contrastive Graph Representation Learning framework from the paper "Deep Graph Contrastive Representation Learning" and how we can use Weights & Biases to explore the training process and how that can lead to valuable insights.
To see the full suite of W&B features, please check out this short 5-minute guide. If you want more reports covering the math and "from-scratch" code implementations, let us know in the comments down below or on our forum ✨!
Check out these other reports on Fully Connected covering other Geometric Deep Learning topics such as Graph Attention Networks.

Iterate on AI agents and models faster. Try Weights & Biases today.