2.Properties of Networks and Random Graph Models

How do you characterize properties of networks and what kind of models do we use to generate realistic networks?
Created on December 21|Last edited on January 13
Comment
﻿
Key Network PropertiesIn this section we study various properties of a graph, and see how they look like in naturally occuring graphs and compare them with graphs that we generate using various techniques. The following are few properties we define.
Degree distribution: P(k)P(k)P(k)
Path length: hhh
Clustering coefficient: CCC
Connected components: sss
Degree DistributionDegree distribution P(k)P(k)P(k) is the probability that a randomly chosen node has a degree kkk. For a given graph, we plot the histogram of all the node degrees and see how it looks like. Below we show the histogram of node degrees of a graph with 10 nodes. On x-axis we plot node degree kkk and on y axis the frequence of the nodes with a given degree.
We can see we have 6 nodes with degree 1 and so on.
For a directed graph we have in-degree and out-degree distributions.
PathsA path is a sequence of nodes in which each node is linked to the next one. A path can intersect itself and pass through the same edge multiple times.
Pn={i0,i1,i2,⋯ ,in}P_n = \{i_0, i_1, i_2,\cdots, i_n\}Pn​={i0​,i1​,i2​,⋯,in​}
Distance between two nodes is defined as the number of edges along the shortest path connecting the two nodes. If the nodes cannot reach each other, it is usually defined as zero or infinity. It's also called the geodesic distance.
In case of directed graphs, the paths need to follow the directions of the edges. Hence distance is not symmetric. hB,C≠hC,bh_{B,C} \neq h_{C,b}hB,C​​=hC,b​
Diameter of a graph is defined as the maximum shortest path distance between any pair of nodes in a graph. To understand how big our graph is. Its not very useful in real world networks, where we have a massive strongly connected cluster, and a very long string of nodes from the connected component. The diameter might not give us the full picture of what is happening in the graph.
Average path length for a connected graph or a strongly connected directed graph is defined as follows
h‾=12Emax∑i,j≠ihij\displaystyle \overline h = \frac{1}{2E_{max}}\sum_{i, j \neq i}h_{ij}h=2Emax​1​i,j​=i∑​hij​
hijh_{ij}hij​ is the distance from node i to node j and EmaxE_{max}Emax​ is the max number of edges with n nodes n(n−1)/2n(n-1)/2n(n−1)/2
Clustering CoefficientClustering coefficient measures how connected a given node's neighbours are with each other. If node i has k neighbours. Total possible edges within its own neighbours is k(k−1)/2k(k-1)/2k(k−1)/2. If the actual number edges in the neighbourhood of a node iii with degree kik_iki​ is eie_iei​. Then the clustering coefficient CiC_iCi​ is defined as Ci=2eiki(ki−1)C_i = \frac{2e_i}{k_i(k_i - 1)}Ci​=ki​(ki​−1)2ei​​. CiC_iCi​ is between 0, 1. The clustering coefficient is undefined or defined to be 0 for nodes with degree 0 or 1. If none of the neighbouring nodes of node iii doesn't have any edges among them and only connect to the node iii then Ci=0C_i = 0Ci​=0. For social networks the clustering coefficient tends to be very high, as the friends of a given node tend to be friends as well.
Average Clustering Coefficient of a graph is defined as C=1N∑iNCi\displaystyle C = \frac{1}{N}\sum_i^NC_iC=N1​i∑N​Ci​
ConnectivityThe size of the largest connected component. The largest set of nodes where any two vertices can be connected by a path. We call this Largest component or a giant component
We can find out the largest component of a graph using Breadth First Search.
Start from a random node and perform BFS, track all the nodes visited.
If all the nodes are visited, the network is a connected graph. Otherwise repeat the process from a unvisited node.
﻿
Network Properties on MSN Messenger NetworkThe professor, shows a sample real world network. And all of its properties we defined in the previous section. The real world network he chose is MSN Messenger activity. With 245 Million logged in users, 180 million users engaged in conversation, more than 30 billion conversations and more than 255 billion messages exchanged.
The MSN messenger network consists of 180 Million nodes and 1.3Billion edges. Two nodes(users) have an edge between them if they exchanged at least 1 message.
Degree Distribution on a log-log scaleThis is what the log-log degree distribution plot looks like for the MSN messenger network. Majority of the people have degree 2, 3 and etc. And as we go farther on the x-axis we see the number of nodes with degree more than 1000 are close to 1
Clustering CoefficientThe plot of average clustering coefficient of nodes with degree k against the degree looks like below.
On y-axis is CkC_kCk​ which is the average clustering coefficients of all the nodes with degree k.
Ck=1Nk∑i;ki=kCi\displaystyle C_k = \frac{1}{N_k}\sum_{i;k_i=k}C_iCk​=Nk​1​i;ki​=k∑​Ci​
For the MSN network C∝k−0.37C \propto k^{-0.37}C∝k−0.37 and avg clustering of MSN is 0.11
Connected ComponentsHere we plot the histogram of the sizes of all the connected components. On y-axis we have the count of the components of a give size, and on x-axis we have the size of the connected component.
We find that there is one component close to 100 million in size. This communication network is well connected. 99.9 percent of the nodes are in the largest connected component. And its million times larger than the second largest connected component. And there are lots of small components.
The below plot is histogram of path distances for all the nodes in the largest connected component in the MSN network.
We can see that 90% of nodes are within 8 hop distance. And the nodes have an average path length of 6.6. And the max path distance is around 30.
MSN Network SummaryThe degree distribution P(k)P(k)P(k) is heavily skewed, with an average node degree of 14.4
Average  path length of 6.6, most nodes are within 8 hop distance from each other.
Average Clustering coefficient of 0.11
When it comes to connectivity, most of the nodes, almost 99% of them are part of a single huge connected component.
We want to understand how common these key properties are in general, when it comes to graphs. If it deviates from the "norm", where does it deviate. Are these key network properties expected? Are these surprising?
﻿
Network properties on PPI NetworkIn this section we go through the key network properties of a Protein Protein Interaction network which is another real world network.
Various proteins are nodes in this PPI network. N = 2018
The binding interactions are the edges. E = 2930
Degree Distribution: Is skewed just like in the MSN Network, with avg degree 2.90
Diameter: The average path length is 5.8
Clustering: Average Clustering of 0.12
Connectivity: 185 components with the largest component having 1647(81%) nodes
This is a network coming from a very different domain and much smaller from MSN network, yet this has similar properties.
﻿
Erdos-Renyi Random Graph ModelOur goal is to model random graphs that have similar node/edge counts to real world networks and see where the network properties differ.
Erdos-Renyi Random Graph Model is one such model to generate graphs.
There are two variants. GnpG_{np}Gnp​ and GnmG_{nm}Gnm​
GnpG_{np}Gnp​ is an undirected graph on n nodes where each edge (u, v) appears with probability p
GnmG_{nm}Gnm​ is an undirected graph on n nodes, and m edges picked uniformly at random.
What kind of networks do these simplest models generate.
In a GnpG_{np}Gnp​ n and p dont determine the network, every time we will be getting a different network, with different number of edges.
Degree DistributionThe degree distribution of GnpG_{np}Gnp​ is a binomial. It looks like a discrete version of a gaussian.
P(k)=(n−1k)pk(1−p)n−1−k\displaystyle P(k) = {n-1 \choose k}p^k(1-p)^{n-1-k}P(k)=(kn−1​)pk(1−p)n−1−k
For a node i, the probability of it having k edges with the rest of the n-1 edges is given by the above equation if p is the probability of having an edge.
Mean k‾=p(n−1)\overline k = p(n-1)k=p(n−1) and variance σ2=p(1−p)(n−1)\sigma^2 = p(1-p)(n-1)σ2=p(1−p)(n−1)
If we look at how the variance changes wrt average node degree, we get
σk‾=[1−pp1n−1]1/2≃1(n−1)1/2\displaystyle \frac{\sigma}{\overline k} = [\frac{1-p}{p} \frac{1}{n-1}]^{1/2} \simeq \frac{1}{(n-1)^{1/2}}kσ​=[p1−p​n−11​]1/2≃(n−1)1/21​
As the number of nodes in the random graph increases, we see that the variance of the degree k decreases, the plot narrows at average node degree. From this we can see that the degree distribution of GnpG_{np}Gnp​ looks so much different from that of real world networks like MSN and PPI networks above.
Clustering Coefficient of GnpG_{np}Gnp​We know that Ci=2eiki(ki−1)\displaystyle C_i = \frac{2e_i}{k_i(k_i - 1)}Ci​=ki​(ki​−1)2ei​​ and edges in a GnpG_{np}Gnp​ appear with a probability p. 
From above the expected eie_iei​ which is the number of edges with in the neighbourhood of a node i with degree kik_iki​ is E[ei]=pki(ki−1)2\displaystyle E[e_i] = p\frac{k_i(k_i - 1)}{2}E[ei​]=p2ki​(ki​−1)​. Simply because we multiply p with the number of edges possible with k neighbourhood nodes.
Hence E[Ci]=p.ki(ki−1)ki(ki−1)=p=k‾n−1≃k‾n\displaystyle E[C_i] = \frac{p.k_i(k_i - 1)}{k_i(k_i - 1)} = p = \frac{\overline k}{n-1} \simeq \frac{\overline k}{n}E[Ci​]=ki​(ki​−1)p.ki​(ki​−1)​=p=n−1k​≃nk​
From this we can see that, as n increases, and when we generate a graph with given average node degree, CCC decreases with graph size n.
PathFor this lets first define the term Expansion of a graph. The intuition behind this term is that at what factor any given subset of nodes will be expanding or have edges to the rest of the network compared to the number of nodes in that subset.
Graph G(V,E)G(V, E)G(V,E) has expansion: α\alphaα: if ∀S⊆V:no of edges leavingS≥α.min(∣S∣,∣V\S∣)\forall S \subseteq V: \text{no of edges leaving} S \geq \alpha . min(|S|, |V\backslash S|)∀S⊆V:no of edges leavingS≥α.min(∣S∣,∣V\S∣)
α=min⁡S⊆Vno of edges leaving Smin(∣S∣,∣V\S∣)\displaystyle \alpha = \min_{S \subseteq V} \frac{\text{no of edges leaving } S}{min(|S|, |V\backslash S|)}α=S⊆Vmin​min(∣S∣,∣V\S∣)no of edges leaving S​
From the definition of expansion, In a randomly generated graph on nnn nodes and expansion α\alphaα, for all pairs of nodes, there is a path of length O((log⁡n)/α)O((\log n)/\alpha)O((logn)/α)
This section was not really clear to me, how expansion and average path distance are related exactly. But it turns out random graphs have a good expansion so it takes logarithmic number of steps for BFS to visit all the nodes.
Erdos-Renyi random graphs can grow very large but the nodes will be just a few hops away. The avg path distance will be growing logarithmically with respect to n. Below is the graph of avg. shortest path as number of nodes increases, keeping the avg degree k constant.
Connected ComponentsHere we study how the formation of components gets affected as we vary p from 0 to 1 for a network of size n. With edge probability p and node size n, avg degree k=p(n−1)k = p(n-1)k=p(n−1)
p=kn−1p = \frac{k}{n-1}p=n−1k​
When avg-degree k = 1, every node has at least one expected edge, at p=1n−1p = \frac{1}{n-1}p=n−11​ so a giant component starts to form in the graph. At k=1−ϵk = 1 - \epsilonk=1−ϵ all components are of size Ω(log⁡n)\varOmega(\log n)Ω(logn) and at k=1+ϵk=1+\epsilonk=1+ϵ 1 component of size Ω(n)\varOmega(n)Ω(n) appears and others having size Ω(log⁡n)\varOmega(\log n)Ω(logn). As the node degree k moves away from 1 and probability p increases to 1, the giant component becomes even more connected, with fewer isolated nodes and eventually becomes a complete graph at p=1
Below image shows how in a Gnp, with n=100k, the fraction of nodes in the largest Connected component varies with node degree/probability. The giant component starts forming at k = 1
﻿
Real Networks vs GnpWe can see that real networks differ from randomly generated networks in two ways. One is that the degree distribution is a simple binomial distribution in case of random graphs, where as in real networks that is not the case at all. The other thing is that the clustering coefficient for random graphs is way too low. There is no local structure emerging in random graphs. But both do have giant connected components and similar average path lengths. Even though they both have giant connected components, real networks do not emerge through a phase transition.
PropertyMSNGnpG_{np}Gnp​similarity
Degree Distributionskewedbinomial❌
Avg. Path Length6.6O(logn) 8.2O(log n) ~ 8.2O(logn) 8.2✅
Avg. Clustering Coeff0.11k‾/n 8.10−8\overline k/n ~ 8.10^{-8}k/n 8.10−8❌
Largest Conn. Component99%exists when k‾>1\overline k > 1k>1✅
Obviously real networks are not random. But we use GnpG_{np}Gnp​ as the reference model, and it helps us calculate many quantities, that can be compared to real data. It will help us understand to what degree a particular property is the result of some random process.
﻿
The Small-World ModelThe motivation behind this is to generate a random graph with high clustering and small diameter/average path. The Erdos-Renyi random graph has short average path distance but very little clustering. MSN network has 7 orders of magnitude larger clustering than the corresponding GnpG_{np}Gnp​
Networkhactualh_{actual}hactual​hrandomh_{random}hrandom​CactualC_{actual}Cactual​CrandomC_{random}Crandom​
Film actors3.652.990.790.00027
Power Grid18.7012.400.0800.005
C. elegans2.652.250.280.05
The above table random graphs with same avg.degree as that of the real graph are generated and their properties compared. The avg.path length is comparable, but the clustering coefficient is orders of magnitude different.
﻿
Kronecker Graph ModelThe intuition behind Kronecker Graph Models is that we generate bigger graphs recursively, that are self similar, starting from a seed or initiator adjacency matrix. The whole has the same shape as one or more of the parts. And we use Kronecker product as a way to generate self-similar matrices.
K1=(110111011)3×3K_1 = \begin{pmatrix} 1 & 1 & 0\\ 1 & 1 & 1\\ 0 & 1 & 1 \end{pmatrix}_{3 \times 3}K1​=⎝⎜⎛​110​111​011​⎠⎟⎞​3×3​
K2=(K1K10K1K1K10K1K1)K1⊗K1=(110110000111111000011011000110110110111111111011011011000110110000111111000011011)9×9K_2 = \begin{pmatrix} K_1 & K_1 & 0\\ K_1 & K_1 & K_1\\ 0 & K_1 & K_1 \end{pmatrix}_{K_1 \otimes K_1 } = \begin{pmatrix} 1 & 1 & 0 & 1 & 1 & 0 & 0 & 0 & 0\\ 1 & 1 & 1 & 1 & 1 & 1 & 0 & 0 & 0\\ 0 & 1 & 1 & 0 & 1 & 1 & 0 & 0 & 0\\ 1 & 1 & 0 & 1 & 1 & 0 & 1 & 1 & 0\\ 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1\\ 0 & 1 & 1 & 0 & 1 & 1 & 0 & 1 & 1\\ 0 & 0 & 0 & 1 & 1 & 0 & 1 & 1 & 0\\ 0 & 0 & 0 & 1 & 1 & 1 & 1 & 1 & 1\\ 0 & 0 & 0 & 0 & 1 & 1 & 0 & 1 & 1\\ \end{pmatrix}_{9 \times 9}K2​=⎝⎜⎛​K1​K1​0​K1​K1​K1​​0K1​K1​​⎠⎟⎞​K1​⊗K1​​=⎝⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎛​110110000​111111000​011011000​110110110​111111111​011011011​000110110​000111111​000011011​⎠⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎞​9×9​
The below picture is the adjacency matrix after we did the Kronecker Product of the 9×99 \times 99×9 matrix one more time and get K2⊗K2K_2 \otimes K_2K2​⊗K2​.
Kronecker productThe Kronecker product of two matrices A and B is given by the equation below.
C=An×m⊗Bk×l=(a1,1Ba1,2B⋯a1,mBa2,1Ba2,2B⋯a2,mB⋮⋮⋱⋮an,1Ban,2B⋯an,mB)nk×ml\displaystyle C = A_{n \times m} \otimes B_{k \times l} = \begin{pmatrix} a_{1,1}B & a_{1,2}B & \cdots & a_{1,m}B\\ a_{2,1}B & a_{2,2}B & \cdots & a_{2,m}B\\ \vdots & \vdots & \ddots & \vdots\\ a_{n,1}B & a_{n,2}B & \cdots & a_{n,m}B \end{pmatrix}_{nk \times ml}C=An×m​⊗Bk×l​=⎝⎜⎜⎜⎜⎛​a1,1​Ba2,1​B⋮an,1​B​a1,2​Ba2,2​B⋮an,2​B​⋯⋯⋱⋯​a1,m​Ba2,m​B⋮an,m​B​⎠⎟⎟⎟⎟⎞​nk×ml​
Kronecker graph is obtained by growing sequence of graphs by iterating the Kronecker product over the initiator matrix K1K_1K1​
K1[m]=Km=K1⊗K1⊗⋯K1⏟m times=Km−1⊗K1K_1^{[m]} = K_m = \underbrace{K_1 \otimes K_1 \otimes \cdots K_1}_{m\text{ times}} = K_{m-1} \otimes K_1K1[m]​=Km​=m timesK1​⊗K1​⊗⋯K1​​​=Km−1​⊗K1​
Below we can see Kronecker graphs with different initiator K1K_1K1​ after 3 iterations.
Stochastic Kronecker GraphsInstead of 1's and 0s we can use probability matrices to introduce some stochasticity into our generation. We start with a n×nn \times nn×n probability matrix Θ1\varTheta_1Θ1​ and compute kthk^{th}kth Kronecker power Θk\varTheta_kΘk​. And for each entry puvp_{uv}puv​ of Θk\varTheta_kΘk​ include an edge (u,v)(u, v)(u,v) in KkK_kKk​ with probability puvp_{uv}puv​
Θ1=(0.50.20.10.3)→Θ2=Θ1⊗Θ1=(0.250.100.100.040.050.150.020.060.050.020.150.060.010.030.030.09)\varTheta_1 = \begin{pmatrix} 0.5 & 0.2\\ 0.1 & 0.3 \end{pmatrix} \rarr \varTheta_2 = \varTheta_1 \otimes \varTheta_1 = \begin{pmatrix} 0.25 & 0.10 & 0.10 & 0.04\\ 0.05 & 0.15 & 0.02 & 0.06\\ 0.05 & 0.02 & 0.15 & 0.06\\ 0.01 & 0.03 & 0.03 & 0.09\end{pmatrix}Θ1​=(0.50.1​0.20.3​)→Θ2​=Θ1​⊗Θ1​=⎝⎜⎜⎜⎛​0.250.050.050.01​0.100.150.020.03​0.100.020.150.03​0.040.060.060.09​⎠⎟⎟⎟⎞​
Generation of Kronecker GraphsAfter generating the Θm\varTheta_mΘm​ probability kronecker graphs, if we have to get the actual adjacency matrix we have to flip biased coins n2n^2n2 times. Instead we can exploit the recursive structure of the Kronecker Graph.
TODO
Estimation: Epinions (n=76k, m=510k)Kronecker graphs have very similar network properties like that of a real graph. Below image shows a Kronecker graph with Θ1=(0.990.540.490.13)\varTheta_1 = \begin{pmatrix}0.99 & 0.54\\0.49 & 0.13\end{pmatrix}Θ1​=(0.990.49​0.540.13​) and its properties compared to Epinions Graph with n=76k and m=510k.
﻿
Chapters
Introduction, Structure of Graphs
Properties of Networks and Random Graph Models
Motifs and Structural Roles in Networks
Community Structure in Networks
Spectral Clustering
Message Passing and Node Classification
Graph Representation Learning
Graph Neural Networks
Graph Neural Networks - Pytorch Geometric
Deep Generative Models for Graphs
Link Analysis: PageRank
Network Effects and Cascading Behaviour
Probabilistic Contagion and Models of Influence
Influence Maximization in Networks
Outbreak Detection in Networks
Network Evolution
Reasoning over Knowledge Graphs
Limitations of Graph Neural Networks
Applications of Graph Neural Networks
﻿
﻿
Property	MSN	$G_{np}$	similarity
Degree Distribution	skewed	binomial	❌
Avg. Path Length	6.6	$O (l o g n) 8.2$	✅
Avg. Clustering Coeff	0.11	$8.10−8\overline k/n ~ 8.10^{-8}$	❌
Largest Conn. Component	99%	exists when $k‾>1\overline k > 1$	✅
Network	$h_{actual}$	$h_{random}$	$C_{actual}$	$C_{random}$
Film actors	3.65	2.99	0.79	0.00027
Power Grid	18.70	12.40	0.080	0.005
C. elegans	2.65	2.25	0.28	0.05
Add a comment
2.Properties of Networks and Random Graph Models

Key Network Properties

Degree Distribution

Paths

Clustering Coefficient

Connectivity

Network Properties on MSN Messenger Network

Degree Distribution on a log-log scale

Clustering Coefficient

Connected Components

MSN Network Summary

Network properties on PPI Network

Erdos-Renyi Random Graph Model

Degree Distribution

Clustering Coefficient of GnpG_{np}Gnp​

Path

Connected Components

Real Networks vs Gnp

The Small-World Model

Kronecker Graph Model

Kronecker product

Stochastic Kronecker Graphs

Generation of Kronecker Graphs

Estimation: Epinions (n=76k, m=510k)

Chapters

Clustering Coefficient of $G_{np}$