Skip to main content

15.Outbreak Detection in Networks

Created on December 29|Last edited on January 19

In this report we talk about outbreak detection sort of a reverse to influence maximisation. In influence maximisation we are trying to create a cascade and here we are studying detection of these cascades as quickly as possible. We also discuss how Develop an approximation algorithm for it, which is also a submodular optimisation problem. We also discuss a new data dependent bound on the solution quality that is valid for optimizing any submodular function. The guarantee that greedy hill climbing provides, that at least 63% of nodes from Optimal solution is data independent.

Real world use cases for outbreak detection include figuring out where to put sensors in a real city water distribution network, so that we can detect contamination as quickly as possible given the data on how contamination spreads in the network.

In information networks, which users/news sites should one follow to detect cascades as effectively possible.

Given a dynamic process spreading over a network we want to select a set of nodes to detect the process effectively and as quickly as possible.

We have two notions to consider one is utility of placing sensors and another is the cost of placing the sensors.

Some nodes can cause high impact outbreaks that might affect the most nodes, we try to measure this impact of contamination in terms of utility.

For each subset S⊆VS \subseteq V compute utility f(S)f(S), S is set of nodes where we have the sensors.

We are given the graph G(V, E) and data about how outbreaks spread over the G, i.e for each outbreak i we know the time T(u, i) when the outbreak i contaminates node u.

In case of water contamination, we get T(u, i) from mechanical simulation flow and consumption. And in case of News, we get it from traces of the information flow, we collect lots of articles and trace them to obtain data about information flow form a given news site.

Our goal is to select a subset of nodes S such that it maximizes the expected reward given below

max⁡S⊆V=∑iP(i)fi(S)\displaystyle \max_{S\subseteq V} = \sum_i P(i)f_i(S)\quad subject to cost(S)<B\quad cost(S) < B

where P(i)P(i) probability of outbreak i occuring and fif_i reward for detecting outbreak i using sensors S. Basically expected reward for detecting an outbreak over all possible outbreaks subject to our budget constraints B.

Reward can be one of the following

  • Minimize time of detection
  • Maximize number of detected propagations
  • Minimize number of infected people

Cost can be thought as reading big blogs is more time consuming in case of information networks or the cost of placing a sensor in a remote location.

Objectives for outbreak detection

We consider penalty πi(t)\pi_i(t) for detecting outbreak i at time t.

For the rewards we specified above penalty can be given accordingly.

  • While maximizing Time to detection DT, we consider the penalty for detecting at time t πi(t)=t\pi_i(t) = t
  • While maximizing number of detected propagations we consider penalty in terms of detection likelihood DL, how many contaminations did we detect. This is a binary outcome, whether we detect it or not. πi(t)=0 and πi(∞)=1\pi_i(t) = 0 \text{ and } \pi_i(\infin) = 1
  • In case of minimizing number of infected people, we just consider the population affected as the penalty. πi(t)=\pi_i(t) = number of infected nodes in outbreak i by time t.

In all these cases detecting sooner does not hurt, hence all these penalty functions are monotone.

From here we define fi(S)f_i(S) as penalty reduction

fi(S)=πi(∞)−πi(T(S,i))f_i(S) = \pi_i(\infin) - \pi_i(T(S, i))

We formulated objective this way because f is a submodular function. maximizing π\pi would not be submodular but maximizing ff is a submodular function.

Adding a new node will lead to diminishing returns, hence making it submodular. In terms of f adding a new sensor helps less than it would help if we added the same sensor to a subset of the original S.

For all A⊆B⊆VA \subseteq B \subseteq V and a sensor x∈V\Bx \in V\backslash B

f(A∪{x})−f(A)≥f(B∪{x})−f(B)f(A \cup \lbrace x\rbrace) - f(A) \geq f(B \cup \lbrace x \rbrace) - f(B)

For us to prove this we need to consider 3 cases, based on when A, B and x detected the outbreak.

Case 1

x is the last to detect, hence both A and B will not benefit, when T(B,i)≤T(A,i)≤T(x,i)T(B, i) \leq T(A, i) \leq T(x, i)

so fi(A∪{x})=fi(A)f_i(A \cup \lbrace x \rbrace) = f_i(A) and fi(B∪{x})=fi(B)f_i(B \cup \lbrace x \rbrace) = f_i(B), this implies fif_i is submodular f(A∪{x})−f(A)=f(B∪{x})−f(B)=0f(A \cup \lbrace x\rbrace) - f(A) = f(B \cup \lbrace x \rbrace) - f(B) = 0

Case 2

x detects after B but before A, when T(B,i)≤T(x,i)≤T(A,i)T(B, i) \leq T(x, i) \leq T(A, i). x detects sooner than any node in A but after any node in B. So x only helps improve the solution in A but not in B.

fi(A∪{x})−fi(A)≥0=fi(B∪{x})−fi(B)f_i(A \cup \lbrace x \rbrace) - f_i(A) \geq 0 = f_i(B \cup \lbrace x \rbrace) - f_i(B)

Case 3

x detects earlier than both A and B, where T(x,i)<T(B,i)≤T(A,i)T(x, i) < T(B, i) \leq T(A, i). The inequality fi(A)≤fi(B)f_i(A) \leq f_i(B) is due to the fact that fif_i is a non decreasing function and A is a subset of B. So

fi(A∪{x})−fi(A)=[πi(∞)−πi(T(x,i))]−fi(A)≥[πi(∞)−πi(T(x,i))]−fi(B)=fi(B∪{x})−fi(B)f_i(A \cup \lbrace x \rbrace) -f_i(A) = [\pi_i(\infin) - \pi_i(T(x, i))] - f_i(A) \geq [\pi_i(\infin) - \pi_i(T(x, i))] - f_i(B) = f_i(B \cup \lbrace x \rbrace) - f_i(B)

From all the above cases we conclude that fi(.)f_i(.) is submodular which makes f(.)f(.) also submodular cause f(S)=∑iP(i)fi(S)f(S) = \sum_iP(i)f_i(S)

Greedy Hill climbing for optimising submodular functions is near optimal with factor (1−1/e)(1 - 1/e), but this only works for unit cost case, where each sensor costs the same. But for us each sensor has a cost c(s)c(s). And also hill climbing algorithm is slow, at each iteration we need to re-evaluate marginal gains of all the nodes whose runtime is O(∣V∣.K)O(|V|.K) for placing K sensors.



CELF

CELF is an algorithm for optimizing submodular functions under cost constraints.

Naive way of doing Hill climbing in this case can be ignoring the sensor cost, repeatedly select sensor with highest marginal gain until we exhaust the budget. This results in a solution where the hill-climbing will result in is arbitrarily far from OPT, optimal solution.

You can see this by considering a node with reward r that costs the whole budget B, and rest of the nodes where reward is r−ϵr-\epsilon and cost 1. The greedy algorithm will pick the first node to put a sensor for only minor increase in reward by exhausting the whole budget B. We ended up with a solution S with only one node, which is so far away from the Optimal solution.

Even greedily picking a new sensor location based on the benefit to cost ratio si=arg max⁡s∈(V\A)f(Ai−1∪{s})−f(Ai−1)c(s)s_i = \argmax_{s\in(V\backslash A)} \frac{f(A_{i-1} \cup \lbrace s \rbrace) - f(A_{i-1})}{c(s)} will result a obviously bad solution. You can see this by considering two sensors s1s_1 and s2s_2 where costs are c(s1)=ϵ,c(s2)=Bc(s_1) = \epsilon, c(s_2) = B and benefits f(s1)=2ϵ,f(s2)=Bf(s_1) = 2\epsilon, f(s_2) = B. Accordingly the benefit cost ratios are f(s1)/c(s1)=2,f(s2)/c(s2)=1f(s_1)/c(s_1) = 2, f(s_2)/c(s_2) = 1. If we select s1s_1 greedily based on this benefit to cost ratio we will end up with reward 2ϵ2\epsilon instead of B. This will be an arbitrarily bad solution when ϵ→0\epsilon \rarr 0. In this case greedy algorithm incentivizes choosing nodes with very low cost, even when slightly more expensive ones can lead to much better global results.

Basically with cost in the picture we don't an optimisation metric that is a monotone.

CELF(Cost-Effective Lazy Forward-selection) is a two pass greedy algorithm. Where we compute solution S′S' that is benefit-cost greedy and solution S′′S'' that is unit-cost greedy leading to our actual solution S=arg max⁡(f(S′),f(S′′))S = \argmax(f(S'), f(S''))

CELF is near optimal with a factor of 1/2(1−1/e)1/2(1 - 1/e) [Krause&Guestrin, '05]. Using two clearly suboptimal solutions, but taking the best of the two is guaranteed to give a near-optimal solution.



Speeding up Hill-Climbing

Consider so far we picked the sensors till step i Si={s1,⋯ ,si}S_i = \lbrace s_1,\cdots,s_i\rbrace, we now have to pick si+1=arg max⁡uf(Si∪{u})−f(Si)s_{i+1} = \argmax_u f(S_i \cup \lbrace u \rbrace) - f(S_i). We are trying to maximise the marginal gain of adding u to SiS_i lets call this δi(u)=f(Si∪{u})−f(Si)\delta_i(u) = f(S_i \cup \lbrace u \rbrace) - f(S_i).

Using the submodularity property we can conclude that δi(u)≥δj(u) for i<j since Si⊂Sj\delta_i(u) \geq \delta_j(u) \text{ for } i < j \text{ since } S_i \subset S_j

Marginal benefits δi(u)\delta_i(u) only shrink as i grows.

Lazy Hill Climbing

From above we can use δi\delta_i as an upper bound on δj\delta_j for all j>ij > i

Everytime we keep the marginal benefits δi\delta_i for all the nodes from the previous iterations. And reevaluate δi\delta_i for only the top nodes. Re-order and prune.

We start by computing all the marginal benefits of adding each node to S. Say we added the node a, that gives us the best benefit into S. We sort the rest of the marginal gains, and compute marginal gains in the second step in that order. While computing if the marginal benefit of the first placed node in the second step is smaller than the marginal benefits of the rest, we reorder our sort list, and compute the marginal benefit of the second placed node and etc. And when we found the node with the best marginal benefit out of all the sorted nodes, we put it in S, and move to the next iteration. In the worst case, its as good as the naive greedy hill climbing algorithm.

It's called lazy algorithm, because we only evaluate the marginal benefit, when we have to.

In practice CELF using lazy evaluation runs 700 times faster than greedy hill-climbing algorithm in the below example.

CELF in practice



Data Dependent Bound

We know that we have a (1−1/e)(1 - 1/e) bound for submodular functions. And it's the worst case bound which is data independent. In some cases we might get solutions that are much better than 63%. We want to be able to have a better bound based on the input data.

Suppose S is some solution to f(S)f(S) such that ∣S∣≤k|S| \leq k, and OPT={t1,⋯ ,tk}OPT = \lbrace t_1, \cdots, t_k\rbrace be the optimal solution. and for each node u δ(u)=f(S∪{u})−f(S)\delta(u) = f(S \cup \lbrace u\rbrace) - f(S) be the marginal gain in f, by adding u to S.

If we order the marginal gains such that δ(1)≥δ(2)≥⋯\delta(1) \geq \delta(2) \geq \cdots, then f(OPT)≤f(S)+∑i=1kδ(i)f(OPT) \leq f(S) + \sum_{i=1}^k\delta(i)

we can prove the above statement as follows

f(OPT)≤f(OPT∪S)=f(S)+f(OPT∪S)−f(S)≤f(S)+∑i=1k[f(S∪{ti})−f(S)]=f(S)+∑i=1kδ(ti)≤f(S)+∑i=1kδ(i)⇒f(OPT)≤f(S)+∑i=1kδ(i)f(OPT) \leq f(OPT \cup S) \\ = f(S) + f(OPT \cup S) - f(S) \\ \leq f(S) + \sum_{i=1}^k[f(S \cup \lbrace t_i \rbrace) - f(S)] \\ = f(S) + \sum_{i=1}^k \delta(t_i) \\ \leq f(S) + \sum_{i=1}^k \delta(i) \Rightarrow f(OPT) \leq f(S) + \sum_{i=1}^k\delta(i)

This is a data dependent bound because f(S)f(S) and δ\delta are based on the input, but it doesn't make any assumptions about how we got S. For some inputs this bound can be worse than 63%.

Using this inequality, we can have a data bound based on our data and how far away we are from the Optimal solution.

From the below image we can see how the greedy algorithm is performing based on both our original bound and the data dependent bound. As we can see our hill climbing approach was much more closer to the optimal solution after looking at the data-dependent bound. This is modeled on a real metropolitan area water network with 20k nodes and 25k edges. Using a cluster of 50 machines for a month, they simulated 3.6 million epidemic scenarios, by randomising locations, days and time of the day.

data dependent bound

We can also see that placement heuristics perform much worse than CELF. Heuristics based on population, degree of nodes, flow, diameter and etc.

CELF comparison