15.Outbreak Detection in Networks

Created on December 29|Last edited on January 19
Comment
﻿
In this report we talk about outbreak detection sort of a reverse to influence maximisation. In influence maximisation we are trying to create a cascade and here we are studying detection of these cascades as quickly as possible. We also discuss how Develop an approximation algorithm for it, which is also a submodular optimisation problem. We also discuss a new data dependent bound on the solution quality that is valid for optimizing any submodular function. The guarantee that greedy hill climbing provides, that at least 63% of nodes from Optimal solution is data independent.
Real world use cases for outbreak detection include figuring out where to put sensors in a real city water distribution network, so that we can detect contamination as quickly as possible given the data on how contamination spreads in the network.
In information networks, which users/news sites should one follow to detect cascades as effectively possible.
Given a dynamic process spreading over a network we want to select a set of nodes to detect the process effectively and as quickly as possible.
We have two notions to consider one is utility of placing sensors and another is the cost of placing the sensors.
Some nodes can cause high impact outbreaks that might affect the most nodes, we try to measure this impact of contamination in terms of utility.
For each subset S⊆VS \subseteq VS⊆V compute utility f(S)f(S)f(S), S is set of nodes where we have the sensors.
We are given the graph G(V, E) and data about how outbreaks spread over the G, i.e for each outbreak i we know the time T(u, i) when the outbreak i contaminates node u.
In case of water contamination, we get T(u, i) from mechanical simulation flow and consumption. And in case of News, we get it from traces of the information flow, we collect lots of articles and trace them to obtain data about information flow form a given news site.
Our goal is to select a subset of nodes S such that it maximizes the expected reward given below
max⁡S⊆V=∑iP(i)fi(S)\displaystyle \max_{S\subseteq V} = \sum_i P(i)f_i(S)\quadS⊆Vmax​=i∑​P(i)fi​(S) subject to cost(S)<B\quad cost(S) < Bcost(S)<B 
where P(i)P(i)P(i) probability of outbreak i occuring and fif_ifi​ reward for detecting outbreak i using sensors S. Basically expected reward for detecting an outbreak over all possible outbreaks subject to our budget constraints B.
Reward can be one of the following
Minimize time of detection
Maximize number of detected propagations
Minimize number of infected people
Cost can be thought as reading big blogs is more time consuming in case of information networks or the cost of placing a sensor in a remote location.
Objectives for outbreak detectionWe consider penalty πi(t)\pi_i(t)πi​(t) for detecting outbreak i at time t.
For the rewards we specified above penalty can be given accordingly.
While maximizing Time to detection DT, we consider the penalty for detecting at time t πi(t)=t\pi_i(t) = tπi​(t)=t
While maximizing number of detected propagations we consider penalty in terms of detection likelihood DL, how many contaminations did we detect. This is a binary outcome, whether we detect it or not. πi(t)=0 and πi(∞)=1\pi_i(t) = 0 \text{ and } \pi_i(\infin) = 1πi​(t)=0 and πi​(∞)=1
In case of minimizing number of infected people, we just consider the population affected as the penalty. πi(t)=\pi_i(t) =πi​(t)= number of infected nodes in outbreak i by time t.
In all these cases detecting sooner does not hurt, hence all these penalty functions are monotone.
From here we define fi(S)f_i(S)fi​(S) as penalty reduction
fi(S)=πi(∞)−πi(T(S,i))f_i(S) = \pi_i(\infin) - \pi_i(T(S, i))fi​(S)=πi​(∞)−πi​(T(S,i))
We formulated objective this way because f is a submodular function. maximizing π\piπ would not be submodular but maximizing fff is a submodular function.
Adding a new node will lead to diminishing returns, hence making it submodular. In terms of f adding a new sensor helps less than it would help if we added the same sensor to a subset of the original S.
For all A⊆B⊆VA \subseteq B \subseteq VA⊆B⊆V and a sensor x∈V\Bx \in V\backslash Bx∈V\B
f(A∪{x})−f(A)≥f(B∪{x})−f(B)f(A \cup \lbrace x\rbrace) - f(A) \geq f(B \cup \lbrace x \rbrace) - f(B)f(A∪{x})−f(A)≥f(B∪{x})−f(B)
For us to prove this we need to consider 3 cases, based on when A, B and x detected the outbreak.
Case 1x is the last to detect, hence both A and B will not benefit, when T(B,i)≤T(A,i)≤T(x,i)T(B, i) \leq T(A, i) \leq T(x, i)T(B,i)≤T(A,i)≤T(x,i)
so fi(A∪{x})=fi(A)f_i(A \cup \lbrace x \rbrace) = f_i(A)fi​(A∪{x})=fi​(A) and fi(B∪{x})=fi(B)f_i(B \cup \lbrace x \rbrace) = f_i(B)fi​(B∪{x})=fi​(B), this implies fif_ifi​ is submodular f(A∪{x})−f(A)=f(B∪{x})−f(B)=0f(A \cup \lbrace x\rbrace) - f(A) = f(B \cup \lbrace x \rbrace) - f(B) = 0f(A∪{x})−f(A)=f(B∪{x})−f(B)=0
Case 2x  detects after B but before A, when T(B,i)≤T(x,i)≤T(A,i)T(B, i) \leq T(x, i) \leq T(A, i)T(B,i)≤T(x,i)≤T(A,i). x detects sooner than any node in A but after any node in B. So x only helps improve the solution in A but not in B.
fi(A∪{x})−fi(A)≥0=fi(B∪{x})−fi(B)f_i(A \cup \lbrace x \rbrace) - f_i(A) \geq 0 = f_i(B \cup \lbrace x \rbrace) - f_i(B)fi​(A∪{x})−fi​(A)≥0=fi​(B∪{x})−fi​(B)
Case 3x detects earlier than both A and B, where T(x,i)<T(B,i)≤T(A,i)T(x, i) < T(B, i) \leq T(A, i)T(x,i)<T(B,i)≤T(A,i). The inequality fi(A)≤fi(B)f_i(A) \leq f_i(B)fi​(A)≤fi​(B) is due to the fact that fif_ifi​ is a non decreasing function and A is a subset of B. So
fi(A∪{x})−fi(A)=[πi(∞)−πi(T(x,i))]−fi(A)≥[πi(∞)−πi(T(x,i))]−fi(B)=fi(B∪{x})−fi(B)f_i(A \cup \lbrace x \rbrace) -f_i(A) = [\pi_i(\infin) - \pi_i(T(x, i))] - f_i(A) \geq [\pi_i(\infin) - \pi_i(T(x, i))] - f_i(B) = f_i(B \cup \lbrace x \rbrace) - f_i(B)fi​(A∪{x})−fi​(A)=[πi​(∞)−πi​(T(x,i))]−fi​(A)≥[πi​(∞)−πi​(T(x,i))]−fi​(B)=fi​(B∪{x})−fi​(B)
From all the above cases we conclude that fi(.)f_i(.)fi​(.) is submodular which makes f(.)f(.)f(.) also submodular cause f(S)=∑iP(i)fi(S)f(S) = \sum_iP(i)f_i(S)f(S)=∑i​P(i)fi​(S)
Greedy Hill climbing for optimising submodular functions is near optimal with factor (1−1/e)(1 - 1/e)(1−1/e), but this only works for unit cost case, where each sensor costs the same. But for us each sensor has a cost c(s)c(s)c(s). And also hill climbing algorithm is slow, at each iteration we need to re-evaluate marginal gains of all the nodes whose runtime is O(∣V∣.K)O(|V|.K)O(∣V∣.K) for placing K sensors.
﻿
CELFCELF is an algorithm for optimizing submodular functions under cost constraints.
Naive way of doing Hill climbing in this case can be ignoring the sensor cost, repeatedly select sensor with highest marginal gain until we exhaust the budget. This results in a solution where the hill-climbing will result in is arbitrarily far from OPT, optimal solution. 
You can see this by considering a node with reward r that costs the whole budget B, and rest of the nodes where reward is r−ϵr-\epsilonr−ϵ and cost 1. The greedy algorithm will pick the first node to put a sensor for only minor increase in reward by exhausting the whole budget B. We ended up with a solution S with only one node, which is so far away from the Optimal solution.
Even greedily picking a new sensor location based on the benefit to cost ratio si=arg max⁡s∈(V\A)f(Ai−1∪{s})−f(Ai−1)c(s)s_i = \argmax_{s\in(V\backslash A)} \frac{f(A_{i-1} \cup \lbrace s \rbrace) - f(A_{i-1})}{c(s)}si​=argmaxs∈(V\A)​c(s)f(Ai−1​∪{s})−f(Ai−1​)​ will result a obviously bad solution. You can see this by considering two sensors s1s_1s1​ and s2s_2s2​ where costs are c(s1)=ϵ,c(s2)=Bc(s_1) = \epsilon, c(s_2) = Bc(s1​)=ϵ,c(s2​)=B and benefits f(s1)=2ϵ,f(s2)=Bf(s_1) = 2\epsilon, f(s_2) = Bf(s1​)=2ϵ,f(s2​)=B. Accordingly the benefit cost ratios are f(s1)/c(s1)=2,f(s2)/c(s2)=1f(s_1)/c(s_1) = 2, f(s_2)/c(s_2) = 1f(s1​)/c(s1​)=2,f(s2​)/c(s2​)=1. If we select s1s_1s1​ greedily based on this benefit to cost ratio we will end up with reward 2ϵ2\epsilon2ϵ instead of B. This will be an arbitrarily bad solution when ϵ→0\epsilon \rarr 0ϵ→0. In this case greedy algorithm incentivizes choosing nodes with very low cost, even when slightly more expensive ones can lead to much better global results.
Basically with cost in the picture we don't an optimisation metric that is a monotone.
CELF(Cost-Effective Lazy Forward-selection) is a two pass greedy algorithm. Where we compute solution S′S'S′ that is benefit-cost greedy and solution S′′S''S′′ that is unit-cost greedy leading to our actual solution S=arg max⁡(f(S′),f(S′′))S = \argmax(f(S'), f(S''))S=argmax(f(S′),f(S′′))
CELF is near optimal with a factor of 1/2(1−1/e)1/2(1 - 1/e)1/2(1−1/e) [Krause&Guestrin, '05]. Using two clearly suboptimal solutions, but taking the best of the two is guaranteed to give a near-optimal solution.
﻿
Speeding up Hill-ClimbingConsider so far we picked the sensors till step i Si={s1,⋯ ,si}S_i = \lbrace s_1,\cdots,s_i\rbraceSi​={s1​,⋯,si​}, we now have to pick si+1=arg max⁡uf(Si∪{u})−f(Si)s_{i+1} = \argmax_u f(S_i \cup \lbrace u \rbrace) - f(S_i)si+1​=argmaxu​f(Si​∪{u})−f(Si​). We are trying to maximise the marginal gain of adding u to SiS_iSi​ lets call this δi(u)=f(Si∪{u})−f(Si)\delta_i(u) = f(S_i \cup \lbrace u \rbrace) - f(S_i)δi​(u)=f(Si​∪{u})−f(Si​).
Using the submodularity property we can conclude that δi(u)≥δj(u) for i<j since Si⊂Sj\delta_i(u) \geq \delta_j(u) \text{ for } i < j \text{ since } S_i \subset S_jδi​(u)≥δj​(u) for i<j since Si​⊂Sj​
Marginal benefits δi(u)\delta_i(u)δi​(u) only shrink as i grows.
Lazy Hill ClimbingFrom above we can use δi\delta_iδi​ as an upper bound on δj\delta_jδj​ for all j>ij > ij>i
Everytime we keep the marginal benefits δi\delta_iδi​ for all the nodes from the previous iterations. And reevaluate δi\delta_iδi​ for only the top nodes. Re-order and prune.
We start by computing all the marginal benefits of adding each node to S. Say we added the node a, that gives us the best benefit into S. We sort the rest of the marginal gains, and compute marginal gains in the second step in that order. While computing if the marginal benefit of the first placed node in the second step is smaller than the marginal benefits of the rest, we reorder our sort list, and compute the marginal benefit of the second placed node and etc. And when we found the node with the best marginal benefit out of all the sorted nodes, we put it in S, and move to the next iteration. In the worst case, its as good as the naive greedy hill climbing algorithm.
It's called lazy algorithm, because we only evaluate the marginal benefit, when we have to.
In practice CELF using lazy evaluation runs 700 times faster than greedy hill-climbing algorithm in the below example.
﻿
Data Dependent BoundWe know that we have a (1−1/e)(1 - 1/e)(1−1/e) bound for submodular functions. And it's the worst case bound which is data independent. In some cases we might get solutions that are much better than 63%. We want to be able to have a better bound based on the input data.
Suppose S is some solution to f(S)f(S)f(S) such that ∣S∣≤k|S| \leq k∣S∣≤k, and OPT={t1,⋯ ,tk}OPT = \lbrace t_1, \cdots, t_k\rbraceOPT={t1​,⋯,tk​} be the optimal solution. and for each node u δ(u)=f(S∪{u})−f(S)\delta(u) = f(S \cup \lbrace u\rbrace) - f(S)δ(u)=f(S∪{u})−f(S) be the marginal gain in f, by adding u to S.
If we order the marginal gains such that δ(1)≥δ(2)≥⋯\delta(1) \geq \delta(2) \geq \cdotsδ(1)≥δ(2)≥⋯, then f(OPT)≤f(S)+∑i=1kδ(i)f(OPT) \leq f(S) + \sum_{i=1}^k\delta(i)f(OPT)≤f(S)+∑i=1k​δ(i)
we can prove the above statement as follows
f(OPT)≤f(OPT∪S)=f(S)+f(OPT∪S)−f(S)≤f(S)+∑i=1k[f(S∪{ti})−f(S)]=f(S)+∑i=1kδ(ti)≤f(S)+∑i=1kδ(i)⇒f(OPT)≤f(S)+∑i=1kδ(i)f(OPT) \leq f(OPT \cup S) \\ = f(S) + f(OPT \cup S) - f(S) \\ \leq f(S) + \sum_{i=1}^k[f(S \cup \lbrace t_i \rbrace) - f(S)] \\ = f(S) + \sum_{i=1}^k \delta(t_i) \\ \leq f(S) + \sum_{i=1}^k \delta(i) \Rightarrow f(OPT) \leq f(S) + \sum_{i=1}^k\delta(i)f(OPT)≤f(OPT∪S)=f(S)+f(OPT∪S)−f(S)≤f(S)+∑i=1k​[f(S∪{ti​})−f(S)]=f(S)+∑i=1k​δ(ti​)≤f(S)+∑i=1k​δ(i)⇒f(OPT)≤f(S)+∑i=1k​δ(i)
This is a data dependent bound because f(S)f(S)f(S) and δ\deltaδ are based on the input, but it doesn't make any assumptions about how we got S. For some inputs this bound can be worse than 63%.
Using this inequality, we can have a data bound based on our data and how far away we are from the Optimal solution.
From the below image we can see how the greedy algorithm is performing based on both our original bound and the data dependent bound. As we can see our hill climbing approach was much more closer to the optimal solution after looking at the data-dependent bound. This is modeled on a real metropolitan area water network with 20k nodes and 25k edges. Using a cluster of 50 machines for a month, they simulated 3.6 million epidemic scenarios, by randomising locations, days and time of the day.
We can also see that placement heuristics perform much worse than CELF. Heuristics based on population, degree of nodes, flow, diameter and etc.
﻿
Chapters
Introduction, Structure of Graphs
Properties of Networks and Random Graph Models
Motifs and Structural Roles in Networks
Community Structure in Networks
Spectral Clustering
Message Passing and Node Classification
Graph Representation Learning
Graph Neural Networks
Graph Neural Networks - Pytorch Geometric
Deep Generative Models for Graphs
Link Analysis: PageRank
Network Effects and Cascading Behaviour
Probabilistic Contagion and Models of Influence
Influence Maximization in Networks
Outbreak Detection in Networks
Network Evolution
Reasoning over Knowledge Graphs
Limitations of Graph Neural Networks
Applications of Graph Neural Networks
﻿
﻿
Add a comment