Skip to main content

2.Properties of Networks and Random Graph Models

How do you characterize properties of networks and what kind of models do we use to generate realistic networks?
Created on December 21|Last edited on January 13

Key Network Properties

In this section we study various properties of a graph, and see how they look like in naturally occuring graphs and compare them with graphs that we generate using various techniques. The following are few properties we define.

  • Degree distribution: P(k)P(k)
  • Path length: hh
  • Clustering coefficient: CC
  • Connected components: ss

Degree Distribution

Degree distribution P(k)P(k) is the probability that a randomly chosen node has a degree kk. For a given graph, we plot the histogram of all the node degrees and see how it looks like. Below we show the histogram of node degrees of a graph with 10 nodes. On x-axis we plot node degree kk and on y axis the frequence of the nodes with a given degree.

degree distribution

We can see we have 6 nodes with degree 1 and so on.

For a directed graph we have in-degree and out-degree distributions.

Paths

A path is a sequence of nodes in which each node is linked to the next one. A path can intersect itself and pass through the same edge multiple times.

Pn={i0,i1,i2,⋯ ,in}P_n = \{i_0, i_1, i_2,\cdots, i_n\}

Distance between two nodes is defined as the number of edges along the shortest path connecting the two nodes. If the nodes cannot reach each other, it is usually defined as zero or infinity. It's also called the geodesic distance.

In case of directed graphs, the paths need to follow the directions of the edges. Hence distance is not symmetric. hB,C≠hC,bh_{B,C} \neq h_{C,b}

Diameter of a graph is defined as the maximum shortest path distance between any pair of nodes in a graph. To understand how big our graph is. Its not very useful in real world networks, where we have a massive strongly connected cluster, and a very long string of nodes from the connected component. The diameter might not give us the full picture of what is happening in the graph.

Average path length for a connected graph or a strongly connected directed graph is defined as follows

h‾=12Emax∑i,j≠ihij\displaystyle \overline h = \frac{1}{2E_{max}}\sum_{i, j \neq i}h_{ij}

hijh_{ij} is the distance from node i to node j and EmaxE_{max} is the max number of edges with n nodes n(n−1)/2n(n-1)/2

Clustering Coefficient

Clustering coefficient measures how connected a given node's neighbours are with each other. If node i has k neighbours. Total possible edges within its own neighbours is k(k−1)/2k(k-1)/2. If the actual number edges in the neighbourhood of a node ii with degree kik_i is eie_i. Then the clustering coefficient CiC_i is defined as Ci=2eiki(ki−1)C_i = \frac{2e_i}{k_i(k_i - 1)}. CiC_i is between 0, 1. The clustering coefficient is undefined or defined to be 0 for nodes with degree 0 or 1. If none of the neighbouring nodes of node ii doesn't have any edges among them and only connect to the node ii then Ci=0C_i = 0. For social networks the clustering coefficient tends to be very high, as the friends of a given node tend to be friends as well.

Average Clustering Coefficient of a graph is defined as C=1N∑iNCi\displaystyle C = \frac{1}{N}\sum_i^NC_i

Connectivity

The size of the largest connected component. The largest set of nodes where any two vertices can be connected by a path. We call this Largest component or a giant component

We can find out the largest component of a graph using Breadth First Search.

  • Start from a random node and perform BFS, track all the nodes visited.
  • If all the nodes are visited, the network is a connected graph. Otherwise repeat the process from a unvisited node.


Network Properties on MSN Messenger Network

The professor, shows a sample real world network. And all of its properties we defined in the previous section. The real world network he chose is MSN Messenger activity. With 245 Million logged in users, 180 million users engaged in conversation, more than 30 billion conversations and more than 255 billion messages exchanged.

The MSN messenger network consists of 180 Million nodes and 1.3Billion edges. Two nodes(users) have an edge between them if they exchanged at least 1 message.

Degree Distribution on a log-log scale

This is what the log-log degree distribution plot looks like for the MSN messenger network. Majority of the people have degree 2, 3 and etc. And as we go farther on the x-axis we see the number of nodes with degree more than 1000 are close to 1 log log plot of degree distribution

Clustering Coefficient

The plot of average clustering coefficient of nodes with degree k against the degree looks like below. avg clustering coefficient vs degree

On y-axis is CkC_k which is the average clustering coefficients of all the nodes with degree k.

Ck=1Nk∑i;ki=kCi\displaystyle C_k = \frac{1}{N_k}\sum_{i;k_i=k}C_i

For the MSN network C∝k−0.37C \propto k^{-0.37} and avg clustering of MSN is 0.11

Connected Components

Here we plot the histogram of the sizes of all the connected components. On y-axis we have the count of the components of a give size, and on x-axis we have the size of the connected component. connected Components plot

We find that there is one component close to 100 million in size. This communication network is well connected. 99.9 percent of the nodes are in the largest connected component. And its million times larger than the second largest connected component. And there are lots of small components.

The below plot is histogram of path distances for all the nodes in the largest connected component in the MSN network.

path histogram

We can see that 90% of nodes are within 8 hop distance. And the nodes have an average path length of 6.6. And the max path distance is around 30.

MSN Network Summary

  • The degree distribution P(k)P(k) is heavily skewed, with an average node degree of 14.4
  • Average path length of 6.6, most nodes are within 8 hop distance from each other.
  • Average Clustering coefficient of 0.11
  • When it comes to connectivity, most of the nodes, almost 99% of them are part of a single huge connected component.

We want to understand how common these key properties are in general, when it comes to graphs. If it deviates from the "norm", where does it deviate. Are these key network properties expected? Are these surprising?



Network properties on PPI Network

In this section we go through the key network properties of a Protein Protein Interaction network which is another real world network.

Various proteins are nodes in this PPI network. N = 2018 The binding interactions are the edges. E = 2930

  • Degree Distribution: Is skewed just like in the MSN Network, with avg degree 2.90
  • Diameter: The average path length is 5.8
  • Clustering: Average Clustering of 0.12
  • Connectivity: 185 components with the largest component having 1647(81%) nodes

This is a network coming from a very different domain and much smaller from MSN network, yet this has similar properties.



Erdos-Renyi Random Graph Model

Our goal is to model random graphs that have similar node/edge counts to real world networks and see where the network properties differ.

Erdos-Renyi Random Graph Model is one such model to generate graphs.

There are two variants. GnpG_{np} and GnmG_{nm}

GnpG_{np} is an undirected graph on n nodes where each edge (u, v) appears with probability p

GnmG_{nm} is an undirected graph on n nodes, and m edges picked uniformly at random.

What kind of networks do these simplest models generate.

In a GnpG_{np} n and p dont determine the network, every time we will be getting a different network, with different number of edges.

Degree Distribution

The degree distribution of GnpG_{np} is a binomial. It looks like a discrete version of a gaussian.

P(k)=(n−1k)pk(1−p)n−1−k\displaystyle P(k) = {n-1 \choose k}p^k(1-p)^{n-1-k}

For a node i, the probability of it having k edges with the rest of the n-1 edges is given by the above equation if p is the probability of having an edge.

Mean k‾=p(n−1)\overline k = p(n-1) and variance σ2=p(1−p)(n−1)\sigma^2 = p(1-p)(n-1)

If we look at how the variance changes wrt average node degree, we get

σk‾=[1−pp1n−1]1/2≃1(n−1)1/2\displaystyle \frac{\sigma}{\overline k} = [\frac{1-p}{p} \frac{1}{n-1}]^{1/2} \simeq \frac{1}{(n-1)^{1/2}}

As the number of nodes in the random graph increases, we see that the variance of the degree k decreases, the plot narrows at average node degree. From this we can see that the degree distribution of GnpG_{np} looks so much different from that of real world networks like MSN and PPI networks above.

Clustering Coefficient of GnpG_{np}

We know that Ci=2eiki(ki−1)\displaystyle C_i = \frac{2e_i}{k_i(k_i - 1)} and edges in a GnpG_{np} appear with a probability p.

From above the expected eie_i which is the number of edges with in the neighbourhood of a node i with degree kik_i is E[ei]=pki(ki−1)2\displaystyle E[e_i] = p\frac{k_i(k_i - 1)}{2}. Simply because we multiply p with the number of edges possible with k neighbourhood nodes.

Hence E[Ci]=p.ki(ki−1)ki(ki−1)=p=k‾n−1≃k‾n\displaystyle E[C_i] = \frac{p.k_i(k_i - 1)}{k_i(k_i - 1)} = p = \frac{\overline k}{n-1} \simeq \frac{\overline k}{n}

From this we can see that, as n increases, and when we generate a graph with given average node degree, CC decreases with graph size n.

Path

For this lets first define the term Expansion of a graph. The intuition behind this term is that at what factor any given subset of nodes will be expanding or have edges to the rest of the network compared to the number of nodes in that subset.

Graph G(V,E)G(V, E) has expansion: α\alpha: if ∀S⊆V:no of edges leavingS≥α.min(∣S∣,∣V\S∣)\forall S \subseteq V: \text{no of edges leaving} S \geq \alpha . min(|S|, |V\backslash S|)

α=min⁡S⊆Vno of edges leaving Smin(∣S∣,∣V\S∣)\displaystyle \alpha = \min_{S \subseteq V} \frac{\text{no of edges leaving } S}{min(|S|, |V\backslash S|)}

expansion viz

From the definition of expansion, In a randomly generated graph on nn nodes and expansion α\alpha, for all pairs of nodes, there is a path of length O((log⁡n)/α)O((\log n)/\alpha)

This section was not really clear to me, how expansion and average path distance are related exactly. But it turns out random graphs have a good expansion so it takes logarithmic number of steps for BFS to visit all the nodes.

Erdos-Renyi random graphs can grow very large but the nodes will be just a few hops away. The avg path distance will be growing logarithmically with respect to n. Below is the graph of avg. shortest path as number of nodes increases, keeping the avg degree k constant.

path distance vs num nodes

Connected Components

Here we study how the formation of components gets affected as we vary p from 0 to 1 for a network of size n. With edge probability p and node size n, avg degree k=p(n−1)k = p(n-1)

p=kn−1p = \frac{k}{n-1}

When avg-degree k = 1, every node has at least one expected edge, at p=1n−1p = \frac{1}{n-1} so a giant component starts to form in the graph. At k=1−ϵk = 1 - \epsilon all components are of size Ω(log⁡n)\varOmega(\log n) and at k=1+ϵk=1+\epsilon 1 component of size Ω(n)\varOmega(n) appears and others having size Ω(log⁡n)\varOmega(\log n). As the node degree k moves away from 1 and probability p increases to 1, the giant component becomes even more connected, with fewer isolated nodes and eventually becomes a complete graph at p=1

Below image shows how in a Gnp, with n=100k, the fraction of nodes in the largest Connected component varies with node degree/probability. The giant component starts forming at k = 1

connected graph size vs k



Real Networks vs Gnp

We can see that real networks differ from randomly generated networks in two ways. One is that the degree distribution is a simple binomial distribution in case of random graphs, where as in real networks that is not the case at all. The other thing is that the clustering coefficient for random graphs is way too low. There is no local structure emerging in random graphs. But both do have giant connected components and similar average path lengths. Even though they both have giant connected components, real networks do not emerge through a phase transition.

PropertyMSNGnpG_{np}similarity
Degree Distributionskewedbinomial
Avg. Path Length6.6O(logn) 8.2O(log n) ~ 8.2
Avg. Clustering Coeff0.11k‾/n 8.10−8\overline k/n ~ 8.10^{-8}
Largest Conn. Component99%exists when k‾>1\overline k > 1

Obviously real networks are not random. But we use GnpG_{np} as the reference model, and it helps us calculate many quantities, that can be compared to real data. It will help us understand to what degree a particular property is the result of some random process.



The Small-World Model

The motivation behind this is to generate a random graph with high clustering and small diameter/average path. The Erdos-Renyi random graph has short average path distance but very little clustering. MSN network has 7 orders of magnitude larger clustering than the corresponding GnpG_{np}

Networkhactualh_{actual}hrandomh_{random}CactualC_{actual}CrandomC_{random}
Film actors3.652.990.790.00027
Power Grid18.7012.400.0800.005
C. elegans2.652.250.280.05

The above table random graphs with same avg.degree as that of the real graph are generated and their properties compared. The avg.path length is comparable, but the clustering coefficient is orders of magnitude different.



Kronecker Graph Model

The intuition behind Kronecker Graph Models is that we generate bigger graphs recursively, that are self similar, starting from a seed or initiator adjacency matrix. The whole has the same shape as one or more of the parts. And we use Kronecker product as a way to generate self-similar matrices.

K1=(110111011)3×3K_1 = \begin{pmatrix} 1 & 1 & 0\\ 1 & 1 & 1\\ 0 & 1 & 1 \end{pmatrix}_{3 \times 3}

K2=(K1K10K1K1K10K1K1)K1⊗K1=(110110000111111000011011000110110110111111111011011011000110110000111111000011011)9×9K_2 = \begin{pmatrix} K_1 & K_1 & 0\\ K_1 & K_1 & K_1\\ 0 & K_1 & K_1 \end{pmatrix}_{K_1 \otimes K_1 } = \begin{pmatrix} 1 & 1 & 0 & 1 & 1 & 0 & 0 & 0 & 0\\ 1 & 1 & 1 & 1 & 1 & 1 & 0 & 0 & 0\\ 0 & 1 & 1 & 0 & 1 & 1 & 0 & 0 & 0\\ 1 & 1 & 0 & 1 & 1 & 0 & 1 & 1 & 0\\ 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1\\ 0 & 1 & 1 & 0 & 1 & 1 & 0 & 1 & 1\\ 0 & 0 & 0 & 1 & 1 & 0 & 1 & 1 & 0\\ 0 & 0 & 0 & 1 & 1 & 1 & 1 & 1 & 1\\ 0 & 0 & 0 & 0 & 1 & 1 & 0 & 1 & 1\\ \end{pmatrix}_{9 \times 9}

The below picture is the adjacency matrix after we did the Kronecker Product of the 9×99 \times 9 matrix one more time and get K2⊗K2K_2 \otimes K_2. 81 x 81 adjacency matrix

Kronecker product

The Kronecker product of two matrices A and B is given by the equation below.

C=An×m⊗Bk×l=(a1,1Ba1,2B⋯a1,mBa2,1Ba2,2B⋯a2,mB⋮⋮⋱⋮an,1Ban,2B⋯an,mB)nk×ml\displaystyle C = A_{n \times m} \otimes B_{k \times l} = \begin{pmatrix} a_{1,1}B & a_{1,2}B & \cdots & a_{1,m}B\\ a_{2,1}B & a_{2,2}B & \cdots & a_{2,m}B\\ \vdots & \vdots & \ddots & \vdots\\ a_{n,1}B & a_{n,2}B & \cdots & a_{n,m}B \end{pmatrix}_{nk \times ml}

Kronecker graph is obtained by growing sequence of graphs by iterating the Kronecker product over the initiator matrix K1K_1

K1[m]=Km=K1⊗K1⊗⋯K1⏟m times=Km−1⊗K1K_1^{[m]} = K_m = \underbrace{K_1 \otimes K_1 \otimes \cdots K_1}_{m\text{ times}} = K_{m-1} \otimes K_1

Below we can see Kronecker graphs with different initiator K1K_1 after 3 iterations. Kronecker graphs

Stochastic Kronecker Graphs

Instead of 1's and 0s we can use probability matrices to introduce some stochasticity into our generation. We start with a n×nn \times n probability matrix Θ1\varTheta_1 and compute kthk^{th} Kronecker power Θk\varTheta_k. And for each entry puvp_{uv} of Θk\varTheta_k include an edge (u,v)(u, v) in KkK_k with probability puvp_{uv}

Θ1=(0.50.20.10.3)→Θ2=Θ1⊗Θ1=(0.250.100.100.040.050.150.020.060.050.020.150.060.010.030.030.09)\varTheta_1 = \begin{pmatrix} 0.5 & 0.2\\ 0.1 & 0.3 \end{pmatrix} \rarr \varTheta_2 = \varTheta_1 \otimes \varTheta_1 = \begin{pmatrix} 0.25 & 0.10 & 0.10 & 0.04\\ 0.05 & 0.15 & 0.02 & 0.06\\ 0.05 & 0.02 & 0.15 & 0.06\\ 0.01 & 0.03 & 0.03 & 0.09\end{pmatrix}

Generation of Kronecker Graphs

After generating the Θm\varTheta_m probability kronecker graphs, if we have to get the actual adjacency matrix we have to flip biased coins n2n^2 times. Instead we can exploit the recursive structure of the Kronecker Graph.

TODO

Estimation: Epinions (n=76k, m=510k)

Kronecker graphs have very similar network properties like that of a real graph. Below image shows a Kronecker graph with Θ1=(0.990.540.490.13)\varTheta_1 = \begin{pmatrix}0.99 & 0.54\\0.49 & 0.13\end{pmatrix} and its properties compared to Epinions Graph with n=76k and m=510k.

kronecker graph comparison