5.Spectral Clustering
Introduction and Problem Definition
Here we study the important class of spectral methods for understanding networks. Spectral clustering means, we will be using eigenvectors and eigenvalues of the matrix representations of the graphs.
The spectral clustering algorithms we explore generally consist of three basic stages
Pre-processing
Construct a matrix representation of the graph
Decomposition
Compute eigenvalues and eigenvectors of the matrix representation. And map each point to a lower-dimensional representation based on one or more eigenvectors.
Grouping
Assign points to two or more clusters, based on the new representation.
Problem Definition
Given an undirected graph G(V,E)G(V, E) we need to be able to divide vertices into two disjoint groups A, B.
How can we define a good partition of GG? And how can we efficiently identify such a partition.
Until now we defined things like modularity and such to identify the best partition with in a graph. Now we approach the same problem using a different criterion. Null models, modularity is a physics view of the partitioning problem, this will be a computer science view. Here we will have approximation guarantees of how well our methods work.
Graph Partitioning Criterion
Maximize the number of within-group connections and minimize the number of between-group connections.
Lets define the graph partitioning objective as a function as the edge cut of the partition. Cut is the set of edges with one endpoint in each group.
cut(A,B)=∑i∈A,j∈Bwij\displaystyle cut(A, B) = \sum_{i\in A, j \in B}w_{ij}
If we define the criterion as Minimum-cut arg minA,Bcut(A,B)\argmin_{A, B} cut(A, B), there will be some edge cases where dangling nodes will consider themselves as a different group when there are much better partitioning of the graphs available. It only considers external cluster connections but not internal more non obvious wrt cut cluster connectivity.
Conductance [Shi-Malik '97]
Conductance is defined as connectivity between groups relative to the density of each group.
ϕ(A,B)=cut(A,B)min(vol(A),vol(B))\displaystyle \phi(A, B) = \frac{cut(A, B)}{min(vol(A), vol(B))}
Volume is total weighted degree of the nodes in a group. We want the conductance to be as small as possible, and its small when the volume of the both groups is as big as possible. And we use minimum volume in a way to force both groups to be as big as possible. Intuitively we want to split the graph into two similar sized groups such that the cut is small. Conductance produces much more balanced partitions, unlike using just Cut. And computing the best conductance cut is NP-hard.
Spectral Graph Theory
Spectral graph partitioning that gives us an approximation guarantee to find the cut that optimizes conductance .
Let AA be the adjacency matrix of undirected graph GG and xx a vector in Rn\mathcal{R}^n with components (x1,⋯xn)(x_1,\cdots x_n). We can think of it as a label or a value respective to each node in GG. Every node ii has it's own corresponding xix_i
And now y=A.xy = A.x
(a11⋯a1n⋮⋱⋮an1⋯ann).(x1⋮xn)=(y1⋮yn)\displaystyle \begin{pmatrix} a_{11} & \cdots & a_{1n}\\ \vdots & \ddots & \vdots\\ a_{n1} & \cdots & a_{nn}\end{pmatrix}. \begin{pmatrix} x_1\\ \vdots\\ x_n\end{pmatrix} = \begin{pmatrix} y_1\\ \vdots\\ y_n\end{pmatrix}
yi=∑j=1nAijxj=∑(i,j)∈Exj\displaystyle y_i = \sum_{j=1}{n} A_{ij}x_j = \sum_{(i,j) \in E} x_j
When we multiply our adjacency matrix with xx, we get another vector yy whose elements at a position ii is just sum of all the corresponding xx values of nodes that have an edge to node ii. Entry yiy_i is sum of labels xjx_j of neighbors of ii.
Eigenvectors and Eigenvalues of a matrix AA are defined as the solution to the equation A.x=λ.xA.x = \lambda .x. Where xx is our eigenvector and lambda our eigenvalue. Multiplying a matrix with a vector results in scaling the original vector and rotates it. When you multiply a matrix with one of its eigenvectors, it just scales the vector and does not rotate it. And it scales it by eigenvalue amount. For any matrix there are some favoured vectors/directions and when the matrix acts upon these favoured vectors, it just scales the vector with no rotation. If we consider the maximum eigenvalue of a matrix A and its acted upon by the corresponding eigenvector, the result will be the maximum, no other vector when acted upon by this matrix will get as stretched as this eigenvector.
If we consider a eigenvector and corresponding eigenvalue of our adjacency matrix, with each element in the eigenvector labeling each node. And when we the sum all the x values of neighbours of node ii we get a scaled version of xix_i. yiy_i is essentially eigenvalue times xix_i even though its sum of neighbouring x values of node i.
Spectral Graph Theory is branch of applied mathematics, that analyses the spectrum of a matrix representing GG. Where spectrum represents the Eigenvectors x(i)x^{(i)} of a graph, ordered by its magnitude of their corresponding eigenvalues Λ={λ1,λ2,⋯λn}\Lambda = \lbrace\lambda_1,\lambda_2,\cdots\lambda_n\rbrace where λ1≤λ2≤⋯≤λn\lambda_1 \leq \lambda_2 \leq \cdots \leq \lambda_n
Example: d-Regular Graph.
For a connected graph GG with all dd-degree nodes, GG is called dd-regular.
Consider the vector x=[1,1,⋯1]x = [1,1,\cdots1]. If we multiple the Adjacency matrix of a dd-regular matrix with the vector xx we will just be getting d.xd.x. As its a d-regular all nodes will be having d neighbor. And A.xA.x at ii is just sum of xx values of node ii. As multiplying AA with xx is giving us a scaled xx by times dd, we consider xx here is an eigenvector and dd corresponding eigenvalue.
x=[1,1,⋯1]A.x=[d,d,⋯d]=λ.xwith λ=dx = [1,1,\cdots1]\\ A.x = [d,d,\cdots d] = \lambda .x\\ \text{with } \lambda = d
Turns out for a dd-regular graph, dd is the largest eigenvalue and [1,1,⋯1][1,1,\cdots1] the corresponding eigenvector.
TODO: proof of the above statement in pdf
Example: Graph on 2 Components
Consider a Graph GG with two dd-regular disconnected components. What are some eigenvectors.
Suppose AA and BB are the components and lets consider xx with all 1s on AA and 0s on BB.
x′=[1,⋯ ,1⏟∣A∣,0,⋯ ,0⏟∣B∣]T then A.x′=[d,⋯ ,d,0,⋯ ,0]Tx′′=[0,⋯ ,0,1,⋯ ,1]T then A.x′′=[0,⋯ ,0,d,⋯ ,d]Tx' = [\underbrace{1,\cdots,1}_{|A|},\underbrace{0,\cdots,0}_{|B|}]^T \text{ then } A.x' = [d,\cdots,d,0,\cdots,0]^T\\ x'' = [0,\cdots,0,1,\cdots,1]^T \text{ then } A.x'' = [0,\cdots,0,d,\cdots,d]^T
Both vectors corresponding λ=d\lambda = d
Hence for a graph with two dd-regular disconnected components λn=λn−1\lambda_n = \lambda_{n-1}. The two largest eigenvalues are the same. We found an eigenvector of multiplicity 2, meaning two eigenvectors correspond to the same eigenvalue. Intuitively if a graph has multiple connected components, the multiplicity of the largest eigenvalue tells us number of connected components it has.
From this intuition, consider a graph that almost has two connected components, the largest eigenvalue and the second largest eigenvalue will be very close to each other. λn−λn−1≈0\lambda_n - \lambda_{n-1} \approx 0. The second largest eigenvalue will tell us what nodes should be in component AA and what nodes should be in component BB
Eigenvectors are orthogonal, hence if the dd-regular graph is connected then we already know that xn=[1,⋯ ,1]x_n = [1,\cdots,1] is the eigenvector corresponding the largest eigenvalue. This implies xn−1x_{n-1} must sum to zero. Because xn.xn−1=0→∑ixn[i].xn−1[i]=∑ixn−1[i]=0x_n.x_{n-1} = 0 \rarr \sum_i x_n[i].x_{n-1}[i] = \sum_ix_{n-1}[i] = 0
xn−1x_{n-1} "splits" the nodes into two groups, nodes with xn−1>0x_{n-1}>0 and xn−1<0x_{n-1}<0. So in principle we could look at the eigenvector of the 2nd largest eigenvalue and declare nodes with positive label in AA and negative label in BB.
Matrix Representations
Adjacency matrix(AA)
size n×nn\times n with aij=1a_{ij}=1 if edge between ii and jj, for a undirected graph AA is symmetric, has nn real eigenvalues and eigenvectors are real-valued and orthogonal.
A=(011010101000110110001011100101000110)A = \begin{pmatrix} 0 & 1 & 1 & 0 & 1 & 0\\ 1 & 0 & 1 & 0 & 0 & 0\\ 1 & 1 & 0 & 1 & 1 & 0\\ 0 & 0 & 1 & 0 & 1 & 1\\ 1 & 0 & 0 & 1 & 0 & 1\\ 0 & 0 & 0 & 1 & 1 & 0\end{pmatrix}
Degree matrix(DD)
Diagonal Matrix of size n×nn\times n, D=[dii],dii=D = [d_{ii}], d_{ii} = degree of the node ii
D=(300000020000003000000300000030000002)D = \begin{pmatrix} 3 & 0 & 0 & 0 & 0 & 0\\ 0 & 2 & 0 & 0 & 0 & 0\\ 0 & 0 & 3 & 0 & 0 & 0\\ 0 & 0 & 0 & 3 & 0 & 0\\ 0 & 0 & 0 & 0 & 3 & 0\\ 0 & 0 & 0 & 0 & 0 & 2\end{pmatrix}
Laplacian matrix(LL)
Laplacian matrix is defined as L=D−AL = D - A, your degree matrix minus the adjacency matrix. -1 where ever there is an edge and positive degrees wherever there is an edge. n×nn\times n size symmetric matrix.
L=(−3−1−1−0−1−0−1−2−1−0−0−0−1−1−3−1−0−0−0−0−1−3−1−1−1−0−0−1−3−1−0−0−0−1−1−2)L = \begin{pmatrix} \phantom{-}3 & -1 & -1 & \phantom{-}0 & -1 & \phantom{-}0\\ -1 & \phantom{-}2 & -1 & \phantom{-}0 & \phantom{-}0 & \phantom{-}0\\ -1 & -1 & \phantom{-}3 & -1 & \phantom{-}0 & \phantom{-}0\\ \phantom{-}0 & \phantom{-}0 & -1 & \phantom{-}3 & -1 & -1\\ -1 & \phantom{-}0 & \phantom{-}0 & -1 & \phantom{-}3 & -1\\ \phantom{-}0 & \phantom{-}0 & \phantom{-}0 & -1 & -1 & \phantom{-}2\end{pmatrix}
Trivial eigenpair in case of a laplacian matrix is x=[1,⋯ ,1]x=[1,\cdots,1] then L.x=0L.x=0 and so λ=λ1=0\lambda = \lambda_1 = 0
The smallest eigenvalue is 0, unlike in case of adjacency matrix, the eigenvalue corresponding to [1,⋯ ,1][1,\cdots,1] is the largest.
Also eigenvalues of LL are non-negative real numbers. And eigenvectors are all real and are always orthogonal.
TODO: Proof for the laplacian matrix properties above mentioned is in the pdf.
λ2\lambda_2 as an Optimization Problem
For any symmetric matrix M:
λ2=minx:xTw1=0xTMxxTx\displaystyle \lambda_2 = \min_{x: x^Tw_1 = 0} \frac{x^TMx}{x^Tx} where w1w_1 is eigenvector corresponding to λ1\lambda_1
TODO: Proof of this is in the pdf
In our case MM is LL, to figure out the meaning of minxTLx\min x^TLx on GG
xTLx=∑i,j=1nLijxixj=∑i,j=1n(Dij−Aij)xixj=∑iDiixi2−∑(i,j)∈E2xixj=∑(i,j)∈E(xi2+xj2−2xixj)=∑(i,j)∈E(xi−xj)2x^TLx = \sum_{i,j=1}^n L_{ij}x_ix_j \\= \sum_{i,j=1}^n(D_{ij} - A_{ij})x_ix_j \\= \sum_iD_{ii}x_i^2 - \sum_{(i,j) \in E}2x_ix_j \\= \sum_{(i, j) \in E}(x_i^2 + x_j^2 - 2x_ix_j) \\= \sum_{(i,j)\in E}(x_i - x_j)^2
As eigenvectors are unit vectors we have ∑ixi2=1\sum_ix_i^2 = 1
and x is orthogonal to 1st eigenvector [1,⋯ ,1][1,\cdots,1] thus: ∑ixi.1=∑ixi=0\sum_ix_i.1 = \sum_ix_i = 0
From the above 3 equations we have
λ2=minx:∑xi=0∑(i,j)∈E(xi−xj)2∑ixi2=minx:∑xi=0∑(i,j)∈E(xi−xj)2\displaystyle \lambda_2 = \min_{x: \sum x_i=0} \frac{\sum_{(i, j) \in E}(x_i - x_j)^2}{\sum_ix_i^2} = \min_{x: \sum x_i=0} \sum_{(i, j) \in E}(x_i - x_j)^2
We want to assign values xix_i to nodes ii such that few edges cross 0. We want xix_i and xjx_j to subtract each other. From the equation for λ2\lambda_2 above, we have the constraints that all the node labels should sum up to zero, so intuitively half of nodes will be in one cluster, and the other half will be in the other cluster. And to minimize the term ∑(i,j)∈E(xi−xj)2\sum_{(i, j) \in E}(x_i - x_j)^2 the edges crossing the clusters have to as few as possible. And the Edges within the clusters will result in a small value for ∑(i,j)∈E(xi−xj)2\sum_{(i, j) \in E}(x_i - x_j)^2 anyway.
The image below is for intuition, as we have node labels from the second eigenvector on x-axis and the corresponding nodes in red. The edges between the two clusters have to minimum to minimize the term ∑(i,j)∈E(xi−xj)2\sum_{(i, j) \in E}(x_i - x_j)^2
Eigenvector that corresponds to λ2\lambda_2 is our magical vector that puts nodes in one or the other cluster.
Find Optimal Cut [Fiedler '73]
This is an optimal cut algorithm from 73, where we express the partition of graph into two components AA, BB as a vector yy like below.
yi={+1 if i∈A−1 if i∈B\displaystyle y_i = \begin{cases} +1 & \text{ if } i \in A\\ -1 & \text{ if } i \in B \end{cases}
We also enforce that ∣A∣=∣B∣→∑iyi=0|A| = |B| \rarr \sum_iy_i = 0, equivalent to being orthogonal to the trivial eigenvector [1,⋯ ,1][1,\cdots,1]
He proves that we can minimize the cut of the partition by finding a vector yy that minimizes:
arg miny∈{−1,+1}nf(y)=∑(i,j)∈E(yi−yj)2\displaystyle \argmin_{y\in\lbrace -1, +1\rbrace^n}f(y) = \sum_{(i,j) \in E}(y_i - y_j)^2
This equation looks almost the same as we got when we were using λ2\lambda_2 for clustering into components. Only difference is yy here can only take +1 or -1, unlike in the eigenvalue case, where it can assume any real number following the constraints mentioned above.
Rayleigh Theorem
miny∈ℜn;∑iyi=0;∑iyi2=1f(y)=∑(i,j)∈E(yi−yj)2=yTLy\displaystyle \min_{y\in \Re^n; \sum_iy_i = 0; \sum_iy_i^2=1}f(y) = \sum_{(i, j)\in E}(y_i - y_j)^2 = y^TLy
Rayleigh Theorem states that
- λ2=minyf(y)\lambda_2 = \min_yf(y) The minimum value of f(y)f(y) is given by the second smallest eigenvalue λ2\lambda_2 of the Laplacian matrix LL
- xarg minyf(y)x \argmin_yf(y) The optimal solution for yy is given by the eigenvector xx corresponding to λ2\lambda_2 referred to as Fielder Vector.
- And we can use the sign xix_i to determine cluster assignment of node ii.
There is a proof where you can show this type of spectral clustering of graph laplacian will give us near optimal solution. There is an approximation guarantee of the spectral clustering: Spectral clustering finds a cut that has at most twice the conductance as the optimal one of conductance β\beta
Suppose there is a partition of GG into AA and BB where ∣A∣≤∣B∣|A| \leq |B|, such that the conductance of the cut (A,B) is β\beta then λ2≤β\lambda_2 \leq \beta.
TODO: The proof for this is in the PDF
Solving for λ2\lambda_2 is essentially minimizing the sum of the squared differences between the coordinates(eigenvector labels) of nodes. with the condition that sum is zero and square sum is 1
Spectral Partitioning Algorithm
Preprocessing
Build the laplacian matrix LL from adjacency matrix
Eigenvalue decomposition
find the eigenvalues λ\lambda and eigenvectors xx of LL. Map vertices to corresponding components of x2x_2
Grouping
Sort components of reduced 1-d vector of node labels. Identify clusters by splitting the sorted vector in two. You can choose the splitting point at median or simply 0, or you can choose to minimize the normalized cut by sweeping over ordering of nodes induced by the eigenvector.
In the above image you can see after you build the laplacian matrix of the graph with two obvious clusters, and get the x2x_2 and assign them to the respective nodes, sort them and plot the ranks of the node on x-axis and its x2x_2 value on the y axis. You can see the clusters emerging.
In the above image if you do spectral clustering on a graph with 4 obvious clusters, and you do the same plot on the second eigenvector x2x_2, you see 4 step pattern emerge.
In the above image we show the plots of 1st and 3rd component plots for the same cluster above.
k-Way Spectral Clustering
Given a graph, how do we make use of the eigen components of the graph laplacian?
Two approaches
- recursive bi-partitioning[Hagen '92]
- cluster multiple eigenvectors[Shi-Malik '00]
Recursive Bi-partitioning
Recursive bi-partitioning involves, partitioning the cluster into two components using the second component x2x_2. And then consider each cluster into its own graph and find its own second component and repeat. This is a bit inefficient and unstable and cluster count will have to be a power of 2.
Cluster multiple eigenvectors
Cluster multiple eigenvectors, involves, building a reduced space from multiple eigenvectors. Each node is now represented by k components of the first k eigenvectors. We then cluster these nodes based on their k-dimensional representation using k-means and etc. This is the recommended approach and more common in recent papers. This method can be used to approximate optimal k-way normalized cut. Emphasizes cohesive clusters, increases the unevenness in the distribution of the data. Associations between similar points are amplified, associations between dissimilar points are attenuated. The data begins to "approximate a clustering". Multiple eigenvectors prevent instability due to information loss.
We make use of Eigengap, the difference between two consecutive eigenvalues, to choose k. Most stable clustering is generally give by the value kk that maximizes eigengap Δk=∣λk−λk−1∣\Delta_k = |\lambda_k - \lambda_{k-1}|. You plot the eigenvalues, and when pick kk where it seems like the plot drops.
Motif-Based Spectral Clustering
Could you come up and cluster the network based on the motifs. We want to detect communities based on network motifs. Give a motif M and Graph G, We identify clusters in the network where M occurs a lot. Different motifs reveal different clustering structures.
We need analogues to cut, volume and conductance in case of motifs and they are defined like below.
Motifs cut: Number of motifs that get cut
Volume volM(S)vol_M(S) = number of instances of end points of a motif are there in a given set s
Motif conductance ϕ(S)\phi(S) = motifs cutvolM(S)\frac{\text{motifs cut}}{vol_M(S)}
In the image above the motif cut is 1 and volume 10, because the number of motif endpoints in set s, three triangles and the cut motif has one endpoint in SS.
Similar to edge clustering where finding the optimal cut that minimizes conductance is NP-hard, finding a cut that optimizes the motif-conductance is also NP-hard. But we can use similar spectral clustering techniques with slight changes to get an approximate optimal motif clustering.
Motif Spectral Clustering
Given a motif MM and a graph GG, find a set of nodes SS that minimizes motif conductance.
Preprocessing:
From graph GG and Motif MM. We get a new weighted graph W(M)W^{(M)}. Wij(M)W_{ij}^{(M)} is the number of times edge (i, j) participates in motif M
In the above picture, in the weighted graph, the edge weight of nodes 8 and 9 is 2 because in the graph G that (8,9) edge participates in two instances of the motif. And the edge weight of nodes 7, 9 is zero as there are no instances of edge (9,7) participating in the motif M. The weighted graph we build is also an undirected graph.
Decomposition
We apply standard spectral clustering on weighted graph WMW^{M} finds clusters of low motif conductance.
Compute Fiedler vector xx associated with λ2\lambda_2 of the laplacian of L(M)L^{(M)}
L(M)=D(M)−W(M)L^{(M)} = D^{(M)} - W^{(M)}
Compute: L(M)=λ2xL^{(M)} = \lambda_2x
Use x to identify communities.
Grouping
We sort the nodes by their values in xx, Let Sr={x1,⋯ ,xr}S_r = \lbrace x_1, \cdots, x_r \rbrace and compute the motif conductance of each SrS_r. We select the optimal cut by doing this sweeping procedure. In the above image we see that S5S_5 had the optimal conductance, so we finalize the cut to be of the nodes {1⋯5}\lbrace 1\cdots 5\rbrace
We generalized community detection to higher-order structures. Motif conductance admits a motif Cheeger inequality. Simple, fast and scalable.
Other partitioning methods
METIS
Heuristic but works really well in practice.
Graclus
Based on kernel k-means
Louvain
Based on Modularity Optimization
Clique percorlation method
For finding overlapping clusters
Chapters
- Introduction, Structure of Graphs
- Properties of Networks and Random Graph Models
- Motifs and Structural Roles in Networks
- Community Structure in Networks
- Spectral Clustering
- Message Passing and Node Classification
- Graph Representation Learning
- Graph Neural Networks
- Graph Neural Networks - Pytorch Geometric
- Deep Generative Models for Graphs
- Link Analysis: PageRank
- Network Effects and Cascading Behaviour
- Probabilistic Contagion and Models of Influence
- Influence Maximization in Networks
- Outbreak Detection in Networks
- Network Evolution
- Reasoning over Knowledge Graphs
- Limitations of Graph Neural Networks
- Applications of Graph Neural Networks