Skip to main content

13. Probabilistic Contagion and Models of Influence.

Created on December 29|Last edited on January 13

Previous section is about decision based cascading models, this section is about probabilistic cascading models.

In decision based models nodes deciding whether they want to adopt a certain behaviour they see in their neighbourhood based on the payoff they might get.

In probabilistic based models, is more like the process of epidemic spreading or virality in social media. A contagion process is complex and unobservable, hence in some cases it can be modeled as randomness.

A disease with high contagion probability the disease spreads farther and with low contagion probability the disease dies out. We want to quantify this spread and contagion probability.

Lets consider a simple model where a patient with a virus meets d people, and with probability q > 0 she infects each of them. Now the goal is to find for which values of d and q the epidemic runs forever. Now we can consider that the epidemic runs forever or ends based on the below two conditions.

Run forever:lim⁡h→inf⁡P[node at depth h is infected]>0\text{Run forever:} \lim_{h \rarr \inf} P\bigg[\text{node at depth h is infected}\bigg] > 0

Die out:lim⁡h→inf⁡P[node at depth h is infected]=0\text{Die out:} \lim_{h \rarr \inf} P\bigg[\text{node at depth h is infected}\bigg] = 0

If php_h is the probability of a node at depth hh from the initial patient is infected, we need lim⁡h→inf⁡\lim_{h \rarr \inf} based on dd and qq. This leads to a recurrent formulation because once we get a level out, we are left with an identical problem with depth h−1h-1. Because of this php_h in terms of ph−1p_{h-1} can be written as

ph=1−(1−q.ph−1)dp_h = 1 - (1-q.p_{h-1})^d

we get php_h the probability from the probability no node at depth h−1h-1 from the roots child node is infected. We get php_h by iterating the equation f(x)=1−(1−qx)df(x) = 1 - (1 - qx)^d starting with x = 1, since p1=1p_1 = 1 this leads to x1=f(1);x2=f(x1);⋯x_1 = f(1); x_2 = f(x_1); \cdots

Intuitively from this at the limit h→inf⁡h \rarr \inf f(x)=xf(x) = x, if its 0 for some values q and d, it implies the epidemic will die out, if its more than 0, the epidemic will not die out. If we plot f(x) against x, the plot of f(x) looks like the image below, it starts with some value less than 1 at x = 1, and eventually goes to zero. For some values of q, d if the plot of f(x) as we keep iterating crosses the line y=xy = x, it means we have a non zero value for php_h at its infinite limit. So the plot y=1−(1−qx)dy = 1 - (1-qx)^d should always be below the line y=xy=x

fixed point plot

The point where the plot of f(x)f(x) crosses the line y=xy=x is called the fixed point.

f(0)=0f(1)<1f′(x)=q.d(1−qx)d−1f(0) = 0 \\f(1) < 1\\f'(x) = q.d(1-qx)^{d-1}

Given q is a probability and d a positive number and from the above equations we know that f′(x)f'(x) is always positive for all x between 0 and 1 which means f(x)f(x) is a monotonic function

And f′(x)f'(x) is a decreasing function wrt x. As x increases from 0 to 1, f′(x)f'(x) keeps getting smaller. This implies the slope of the function f(x)f(x) keeps decreasing. So the shape of the plot f(x) will look like the plot shown the above picture, only that it might cross the line y=xy=x or not. And if the derivative f′(x)f'(x) at zero is less than 1, it means f(x)f(x) will never cross the line y=xy=x if its greater than 1 the plot of f(x)f(x) will cross the line at some non zero probability.

So for the epidemic to die out f′(0)=q.d<1f'(0) = q.d < 1, lim⁡h→inf⁡ph=0\lim_{h \rarr \inf} p_h = 0 when q.dq.d < 1. This is also called the Reproductive number R0=q.dR_0 = q.d

There is an epidemic if R0≥1R_0\geq 1. Intuitively we can see that if the expected number of people every infected person can infect is greater than 1, we have an epidemic, and if its less than 1, the epidemic will die out.

Reproductive number R0R_0 tells us if a disease will die out or spread out. There is an epidemic if R0≥1R_0 \geq 1. Otherwise the disease will die out.

DiseaseR0R_0
HIV2-5
Measles12-18
Ebola1.5-2

When R0R_0 is close to 1, slightly changing q or d can result in epidemics dying out or happening. Quarantining people/nodes reducing d, or encouraging better sanitary practices like washing hands, wearing masks reduces germs spreading hence reducing q.

Estimating R0R_0 from real data.

Characterizing social cascades on Flickr

Users are connected to other users via friend links, and a user can like or favourite a photo. The data they considered is 100 day photo likes. With around 2 million users, 34 million likes and 11 million photos.

They considered it a social cascade if a user likes a photo after at least one of her friends liked the photo.And they didn't consider it a social cascade of liking if a user likes a photo and none of his friends have previously liked the photo.

There are two ways of computing R0R_0

We can simply get R0R_0 from the start node of a cascade, by counting the fraction of directly infected nodes and proclaim that to be R0R_0. We call it the empirical value of R0R_0

Given an infected node count the proportion of its neighbours subsequently infected and average it to get qq. Then R0=q∗d∗avg(di2)(avg(di))2R_0 = q * d * \frac{avg(d_i^2)}{(avg(d_i))^2}. We include the correction factor for did_i to handle uneven values for d. We call this the estimated value of R0R_0.

If we plot the values of empirical and estimated R0R_0 values for the top 1000 photo cascades we get the below plot.

R0 plot estimation vs empirical

From this we can see that both empirical and estimated values of R0R_0 are close to each other. Which means getting it by observing just the parent and the fraction of nodes its infecting is as good a measure as getting R0R_0 by observing the whole network cascade. Even though social networks are very heterogenous, knowing what happens at the first step of propagation gives us a good picture of the whole cascade process in subsequent steps.

The basic reproduction number of popular photos is between 1 and 190. This is much higher than very infectious diseases like measles, indicating that social networks are efficient transmission media and online content can be very infectious. The reason these social epidemics stop and doesn't spread to the entire network is that the reproductive number is not homogenous, it keeps changing with time as the news gets stale.



Epidemic models

Spreading models of viruses

We have the following two virus propagation parameters, β\beta, virus birth rate, the probability that an infected neighbour attacks, and δ\delta, virus death rate, the probability that an infected node heals. Each node can go through several phases, S, E I, R, Z Susceptible, exposed, infected, recovered and immune respectively.

SIR Model:

Each node goes through phases, Susceptible Infected Recovered. Models chickenpox or plague, where once you heal, you can never get infected again.

Assuming perfect mixing, the network is a complete graph, the model dynamics are

dSdt=−βSI;dRdt=δI;dIdt=βSI−δI\displaystyle \frac{dS}{dt} = -\beta SI ;\quad \frac{dR}{dt} = \delta I ;\quad \frac{dI}{dt} = \beta SI - \delta I

sir model

SIS Model

Each node goes through phases Susceptible, Infected and Susceptible. This model works for diseases where cured nodes immediately becomes susceptible. And we define virus strength s=β/δs = \beta/\delta

SIS models flu. Susceptible node becomes infected, the node then heals and become susceptible again. Assuming perfect mixing(a complete graph) these are the model dynamics

dSdt=−βSI+δI;dIdt=βSI−δI\displaystyle \frac{dS}{dt} = -\beta SI + \delta I ; \quad \frac{dI}{dt}= \beta SI - \delta I

SIS model

If we simulate this model, from the above plot we can see that a fixed fraction of people are infected and a fixed fraction of people are healed with time.

We define a metric τ\tau an epidemic threshold, at value of virus strength threshold the epidemic cannot happen and it eventually dies out. s=β/δ<τs = \beta/\delta < \tau

In an SIS model, the epidemic threshold of a given graph is given by β/δ<τ=1/λ1,A\displaystyle \beta/\delta < \tau = 1/\lambda_{1, A}, where λ1,A\lambda_{1,A} is the largest eigenvalue of the graph adjacency matrix. The largest eigenvalue alone captures the property of the graph.

Below is the plot of how the infected number of nodes changes with time for various values of β\beta and δ\delta. You can see how the infection keeps on going when s>τs > \tau and eventually dies when s<τs < \tau. You can also see how the infected number of nodes varies with a change in initial number of infected nodes.

tau tau with varied initial infections

Article about estimation of R0R_0 of ebola.

SEIZ model for rumor spreading on Twitter.

  • Susceptible S{twitter accounts}
  • Infected I{Believe news/rumors, and post a tweet}
  • Exposed E{Be exposed but not yet believe}
  • Skeptics Z{Skeptical and also do not tweet}

The SEIZ model dynamics are as follows

dSdt=−βSIN−bSZNdEdt=(1−p)βSIN+(1−I)bSZN−ρEIN−ϵEdIdt=pβSIN+ρEIN+ϵEdZdt=lbSZN\displaystyle \frac{dS}{dt} = -\beta S\frac{I}{N} - bS\frac{Z}{N} \\ \frac{dE}{dt}= (1-p)\beta S \frac{I}{N} + (1 - I)bS\frac{Z}{N} - \rho E\frac{I}{N} - \epsilon E \\ \frac{dI}{dt} = p\beta S \frac{I}{N} + \rho E \frac{I}{N} + \epsilon E \\ \frac{dZ}{dt} = lbS\frac{Z}{N}

They collected, 8 different viral stories, 4 real and 4 rumours and tried to estimate the model. SEIZ model is fit to each cascade to minimize the difference between the number of rumor tweets and the estimated number of rumor tweets by the model. Using grid-search and find the parameters with minimum error ∣I(t)−tweets(t)∣|I(t) - \text{tweets}(t)|

For these viral stories, SEIZ model is better at modeling the virality and spread of a story better than the SIS model. And parameters obtained by fitting SEIZ model efficiently identifies rumors vs news using a metric RSI=(1−p)β+(1−l)bρ+ϵR_{SI} = \frac{(1-p)\beta + (1-l)b}{\rho + \epsilon}, kind of a flux ratio of effects entering E and to those leaving E.



Independent Cascade Model

We are given a network with edges assigned some probability with the intuition that every node spreads to the next connected node with probability corresponding to the edge. When node u becomes infected it activates or infects each out-neighbour v with prob puvp_{uv} and activations spread through the network. Estimating these probabilities from data is very hard[Goyal et al 2010]. If all edges have the same probability weight, it becomes SIR model which is too simple.

We can make the model more tractable by separating the notion of exposure to the notion of adoption. Any given node gets exposed to something through its neighbours and the node can act on the contagion through adoption.

Exposure curve

Probability of adopting new behaviour depends on the total number of friends who have already adopted. Below you can see how the exposure curves look for both probabilistic spreading vs decision making spreading.

exposure curves example

Exposure curve examples

Exposure curve for movie recommendation Exposure curve for Movie recommendations Exposure for livejournal group membership, the probability of joining a group depending on the number of friends already in the group. Exposure curve for LiveJournal group memberships Exposure curve for the top 500 twitter hashtags, network data of 60M users, and 3 Billion tweets tweets exposure curve

We have two characteristics for these exposure curves persistence and stickiness.

Persistence of P is the ratio of the area under the curve P and the area of the rectangle of height max(P)max(P) and width max(D(P))max(D(P)) where D(P)D(P) is the domain of P. Persistence measures the decay of exposure curves.

Stickiness of P is max(P)max(P). Stickiness is the probability of usage at the most effective exposure.

For the twitter data, they plotted stickiness and persistence of various viral hashtags from different categories. They found that Idioms and music have lower persistence than that of a random subset of hashtags of the same size. Where as politics and sports have higher persistence than that of a random subset of hashtags of the same size. When it comes to stickiness, technology and movies have lower stickiness where as music has higher stickiness than that of a random subset of hashtags of the same size.