13. Probabilistic Contagion and Models of Influence.

Created on December 29|Last edited on January 13
Comment
﻿
Previous section is about decision based cascading models, this section is about probabilistic cascading models.
In decision based models nodes deciding whether they want to adopt a certain behaviour they see in their neighbourhood based on the payoff they might get.
In probabilistic based models, is more like the process of epidemic spreading or virality in social media. A contagion process is complex and unobservable, hence in some cases it can be modeled as randomness.
A disease with high contagion probability the disease spreads farther and with low contagion probability the disease dies out. We want to quantify this spread and contagion probability.
Lets consider a simple model where a patient with a virus meets d people, and with probability q > 0 she infects each of them. Now the goal is to find for which values of d and q the epidemic runs forever. Now we can consider that the epidemic runs forever or ends based on the below two conditions.
Run forever:lim⁡h→inf⁡P[node at depth h is infected]>0\text{Run forever:} \lim_{h \rarr \inf} P\bigg[\text{node at depth h is infected}\bigg] > 0Run forever:limh→inf​P[node at depth h is infected]>0
Die out:lim⁡h→inf⁡P[node at depth h is infected]=0\text{Die out:} \lim_{h \rarr \inf} P\bigg[\text{node at depth h is infected}\bigg] = 0Die out:limh→inf​P[node at depth h is infected]=0
If php_hph​ is the probability of a node at depth hhh from the initial patient is infected, we need lim⁡h→inf⁡\lim_{h \rarr \inf}limh→inf​ based on ddd and qqq. This leads to a recurrent formulation because once we get a level out, we are left with an identical problem with depth h−1h-1h−1. Because of this php_hph​ in terms of ph−1p_{h-1}ph−1​ can be written as
ph=1−(1−q.ph−1)dp_h = 1 - (1-q.p_{h-1})^dph​=1−(1−q.ph−1​)d
we get php_hph​ the probability from the probability no node at depth h−1h-1h−1 from the roots child node is infected. We get php_hph​ by iterating the  equation f(x)=1−(1−qx)df(x) = 1 - (1 - qx)^df(x)=1−(1−qx)d starting with x = 1, since p1=1p_1 = 1p1​=1 this leads to x1=f(1);x2=f(x1);⋯x_1 = f(1); x_2 = f(x_1); \cdotsx1​=f(1);x2​=f(x1​);⋯
Intuitively from this at the limit h→inf⁡h \rarr \infh→inf f(x)=xf(x) = xf(x)=x, if its 0 for some values q and d, it implies the epidemic will die out, if its more than 0, the epidemic will not die out. If we plot f(x) against x, the plot of f(x) looks like the image below, it starts with some value less than 1 at x = 1, and eventually goes to zero. For some values of q, d if the plot of f(x) as we keep iterating crosses the line y=xy = xy=x, it means we have a non zero value for php_hph​ at its infinite limit. So the plot y=1−(1−qx)dy = 1 - (1-qx)^dy=1−(1−qx)d should always be below the line y=xy=xy=x
The point where the plot of f(x)f(x)f(x) crosses the line y=xy=xy=x is called the fixed point.
f(0)=0f(1)<1f′(x)=q.d(1−qx)d−1f(0) = 0 \\f(1) < 1\\f'(x) = q.d(1-qx)^{d-1}f(0)=0f(1)<1f′(x)=q.d(1−qx)d−1 
Given q is a probability and d a positive number and from the above equations we know that f′(x)f'(x)f′(x) is always positive for all x between 0 and 1 which means f(x)f(x)f(x) is a monotonic function
And f′(x)f'(x)f′(x) is a decreasing function wrt x. As x increases from 0 to 1, f′(x)f'(x)f′(x) keeps getting smaller. This implies the slope of the function f(x)f(x)f(x) keeps decreasing. So the shape of the plot f(x) will look like the plot shown the above picture, only that it might cross the line y=xy=xy=x or not. And if the derivative f′(x)f'(x)f′(x) at zero is less than 1, it means f(x)f(x)f(x) will never cross the line y=xy=xy=x if its greater than 1 the plot of f(x)f(x)f(x) will cross the line at some non zero probability.
So for the epidemic to die out f′(0)=q.d<1f'(0) = q.d < 1f′(0)=q.d<1, lim⁡h→inf⁡ph=0\lim_{h \rarr \inf} p_h = 0limh→inf​ph​=0 when q.dq.dq.d < 1. This is also called the Reproductive number R0=q.dR_0 = q.dR0​=q.d 
There is an epidemic if R0≥1R_0\geq 1R0​≥1. Intuitively we can see that if the expected number of people every infected person can infect is greater than 1, we have an epidemic, and if its less than 1, the epidemic will die out.
Reproductive number R0R_0R0​ tells us if a disease will die out or spread out. There is an epidemic if R0≥1R_0 \geq 1R0​≥1. Otherwise the disease will die out.
DiseaseR0R_0R0​
HIV2-5
Measles12-18
Ebola1.5-2
When R0R_0R0​ is close to 1, slightly changing q or d can result in epidemics dying out or happening. Quarantining people/nodes reducing d, or encouraging better sanitary practices like washing hands, wearing masks reduces germs spreading hence reducing q.
Estimating R0R_0R0​ from real data.Characterizing social cascades on Flickr
Users are connected to other users via friend links, and a user can like or favourite a photo. The data they considered is 100 day photo likes. With around 2 million users, 34 million likes and 11 million photos.
They considered it a social cascade if a user likes a photo after at least one of her friends liked the photo.And they didn't consider it a social cascade of liking if a user likes a photo and none of his friends have previously liked the photo.
There are two ways of computing R0R_0R0​
We can simply get R0R_0R0​ from the start node of a cascade, by counting the fraction of directly infected nodes and proclaim that to be R0R_0R0​. We call it the empirical value of R0R_0R0​
Given an infected node count the proportion of its neighbours subsequently infected and average it to get qqq. Then R0=q∗d∗avg(di2)(avg(di))2R_0 = q * d * \frac{avg(d_i^2)}{(avg(d_i))^2}R0​=q∗d∗(avg(di​))2avg(di2​)​. We include the correction factor for did_idi​ to handle uneven values for d. We call this the estimated value of R0R_0R0​.
If we plot the values of empirical and estimated R0R_0R0​ values for the top 1000 photo cascades we get the below plot.
From this we can see that both empirical and estimated values of R0R_0R0​ are close to each other. Which means getting it by observing just the parent and the fraction of nodes its infecting is as good a measure as getting R0R_0R0​ by observing the whole network cascade. Even though social networks are very heterogenous, knowing what happens at the first step of propagation gives us a good picture of the whole cascade process in subsequent steps.
The basic reproduction number of popular photos is between 1 and 190. This is much higher than very infectious diseases like measles, indicating that social networks are efficient transmission media and online content can be very infectious. The reason these social epidemics stop and doesn't spread to the entire network is that the reproductive number is not homogenous, it keeps changing with time as the news gets stale. 
﻿
Epidemic modelsSpreading models of virusesWe have the following two virus propagation parameters, β\betaβ, virus birth rate, the probability that an infected neighbour attacks, and δ\deltaδ, virus death rate, the probability that an infected node heals. Each node can go through several phases, S, E I, R, Z Susceptible, exposed, infected, recovered and immune respectively.
SIR Model:Each node goes through phases, Susceptible Infected Recovered. Models chickenpox or plague, where once you heal, you can never get infected again.
Assuming perfect mixing, the network is a complete graph, the model dynamics are
dSdt=−βSI;dRdt=δI;dIdt=βSI−δI\displaystyle \frac{dS}{dt} = -\beta SI ;\quad \frac{dR}{dt} = \delta I ;\quad \frac{dI}{dt} = \beta SI - \delta IdtdS​=−βSI;dtdR​=δI;dtdI​=βSI−δI
SIS ModelEach node goes through phases Susceptible, Infected and Susceptible. This model works for diseases where cured nodes immediately becomes susceptible. And we define virus strength s=β/δs = \beta/\deltas=β/δ
SIS models flu. Susceptible node becomes infected, the node then heals and become susceptible again. Assuming perfect mixing(a complete graph) these are the model dynamics
dSdt=−βSI+δI;dIdt=βSI−δI\displaystyle \frac{dS}{dt} = -\beta SI + \delta I ; \quad \frac{dI}{dt}= \beta SI - \delta IdtdS​=−βSI+δI;dtdI​=βSI−δI
If we simulate this model, from the above plot we can see that a fixed fraction of people are infected and a fixed fraction of people are healed with time.
We define a metric τ\tauτ an epidemic threshold, at value of virus strength threshold the epidemic cannot happen and it eventually dies out. s=β/δ<τs = \beta/\delta < \taus=β/δ<τ
In an SIS model, the epidemic threshold of a given graph is given by β/δ<τ=1/λ1,A\displaystyle \beta/\delta < \tau = 1/\lambda_{1, A}β/δ<τ=1/λ1,A​, where λ1,A\lambda_{1,A}λ1,A​ is the largest eigenvalue of the graph adjacency matrix. The largest eigenvalue alone captures the property of the graph.
Below is the plot of how the infected number of nodes changes with time for various values of β\betaβ and δ\deltaδ. You can see how the infection keeps on going when s>τs > \taus>τ and eventually dies when s<τs < \taus<τ. You can also see how the infected number of nodes varies with a change in initial number of infected nodes.

Article about estimation of R0R_0R0​ of ebola.
SEIZ model for rumor spreading on Twitter.Susceptible S{twitter accounts}
Infected I{Believe news/rumors, and post a tweet}
Exposed E{Be exposed but not yet believe}
Skeptics Z{Skeptical and also do not tweet}
The SEIZ model dynamics are as follows
dSdt=−βSIN−bSZNdEdt=(1−p)βSIN+(1−I)bSZN−ρEIN−ϵEdIdt=pβSIN+ρEIN+ϵEdZdt=lbSZN\displaystyle \frac{dS}{dt} = -\beta S\frac{I}{N} - bS\frac{Z}{N} \\ \frac{dE}{dt}= (1-p)\beta S \frac{I}{N} + (1 - I)bS\frac{Z}{N} - \rho E\frac{I}{N} - \epsilon E \\ \frac{dI}{dt} = p\beta S \frac{I}{N} + \rho E \frac{I}{N} + \epsilon E \\ \frac{dZ}{dt} = lbS\frac{Z}{N}dtdS​=−βSNI​−bSNZ​dtdE​=(1−p)βSNI​+(1−I)bSNZ​−ρENI​−ϵEdtdI​=pβSNI​+ρENI​+ϵEdtdZ​=lbSNZ​
They collected, 8 different viral stories, 4 real and 4 rumours and tried to estimate the model. SEIZ model is fit to each cascade to minimize the difference between the number of rumor tweets and the estimated number of rumor tweets by the model. Using grid-search and find the parameters with minimum error ∣I(t)−tweets(t)∣|I(t) - \text{tweets}(t)|∣I(t)−tweets(t)∣
For these viral stories, SEIZ model is better at modeling the virality and spread of a story better than the SIS model. And parameters obtained by fitting SEIZ model efficiently identifies rumors vs news using a metric RSI=(1−p)β+(1−l)bρ+ϵR_{SI} = \frac{(1-p)\beta + (1-l)b}{\rho + \epsilon}RSI​=ρ+ϵ(1−p)β+(1−l)b​, kind of a flux ratio of effects entering E and to those leaving E.
﻿
Independent Cascade ModelWe are given a network with edges assigned some probability with the intuition that every node spreads to the next connected node with probability corresponding to the edge. When node u becomes infected it activates or infects each out-neighbour v with prob puvp_{uv}puv​ and activations spread through the network. Estimating these probabilities from data is very hard[Goyal et al 2010]. If all edges have the same probability weight, it becomes SIR model which is too simple.
We can make the model more tractable by separating the notion of exposure to the notion of adoption. Any given node gets exposed to something through its neighbours and the node can act on the contagion through adoption.
Exposure curveProbability of adopting new behaviour depends on the total number of friends who have already adopted. Below you can see how the exposure curves look for both probabilistic spreading vs decision making spreading.
Exposure curve examplesExposure curve for movie recommendation

Exposure for livejournal group membership, the probability of joining a group depending on the number of friends already in the group.

Exposure curve for the top 500 twitter hashtags, network data of 60M users, and 3 Billion tweets
We have two characteristics for these exposure curves persistence and stickiness.
Persistence of P is the ratio of the area under the curve P and the area of the rectangle of height max(P)max(P)max(P) and width max(D(P))max(D(P))max(D(P)) where D(P)D(P)D(P) is the domain of P. Persistence measures the decay of exposure curves.
Stickiness of P is max(P)max(P)max(P). Stickiness is the probability of usage at the most effective exposure.
For the twitter data, they plotted stickiness and persistence of various viral hashtags from different categories. They found that Idioms and music have lower persistence than that of a random subset of hashtags of the same size. Where as politics and sports have higher persistence than that of a random subset of hashtags of the same size. When it comes to stickiness, technology and movies have lower stickiness where as music has higher stickiness than that of a random subset of hashtags of the same size.
﻿
Chapters
Introduction, Structure of Graphs
Properties of Networks and Random Graph Models
Motifs and Structural Roles in Networks
Community Structure in Networks
Spectral Clustering
Message Passing and Node Classification
Graph Representation Learning
Graph Neural Networks
Graph Neural Networks - Pytorch Geometric
Deep Generative Models for Graphs
Link Analysis: PageRank
Network Effects and Cascading Behaviour
Probabilistic Contagion and Models of Influence
Influence Maximization in Networks
Outbreak Detection in Networks
Network Evolution
Reasoning over Knowledge Graphs
Limitations of Graph Neural Networks
Applications of Graph Neural Networks
﻿
﻿