Locating the Epidemic Source in Complex Networks with Sparse Observers

Li, Xiang; Wang, Xiaojie; Zhao, Chengli; Zhang, Xue; Yi, Dongyun

doi:10.3390/app9183644

Open AccessArticle

Locating the Epidemic Source in Complex Networks with Sparse Observers

¹

College of Liberal Arts and Sciences, National University of Defense Technology, Changsha 410073, China

²

State Key Laboratory of High Performance Computing, National University of Defense Technology, Changsha 410073, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2019, 9(18), 3644; https://0-doi-org.brum.beds.ac.uk/10.3390/app9183644

Submission received: 16 August 2019 / Revised: 23 August 2019 / Accepted: 1 September 2019 / Published: 4 September 2019

(This article belongs to the Section Applied Physics General)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Epidemic source localization is one of the most meaningful areas of research in complex networks, which helps solve the problem of infectious disease spread. Limited by incomplete information of nodes and inevitable randomness of the spread process, locating the epidemic source becomes a little difficult. In this paper, we propose an efficient algorithm via Bayesian Estimation to locate the epidemic source and find the initial time in complex networks with sparse observers. By modeling the infected time of observers, we put forward a valid epidemic source localization method for tree network and further extend it to the general network via maximum spanning tree. The numerical analyses in synthetic networks and empirical networks show that our algorithm has a higher source localization accuracy than other comparison algorithms. In particular, when the randomness of the spread path enhances, our algorithm has a better performance. We believe that our method can provide an effective reference for epidemic spread and source localization in complex networks.

Keywords:

epidemic source localization; randomness; bayesian estimation; sparse observers; maximum spanning tree; accuracy

1. Introduction

Epidemic spread [1,2] in complex networks is an important topic with significant value to help us better understand the spread law and influence scope of contagion. In fact, epidemic is a serious threat to human safety, and it is a new one, drawing research attention. If leaders can identify the epidemic source in time, they prefer to make the right decision and avoid greater harm. To protect citizens from epidemic harm, there is an urgent need for us to develop efficient methods to locate the epidemic source with higher accuracy.

In the past ten years, the source localization problem has attracted many scholars’ attention and some good works have been done, but remains still a challenge. Shah et al. [3] first proposed a rumor centrality, as maximum likelihood estimation of single-source, in tree network under SI model. Inspired by Shah, Zhu et al. [4] studied the source localization problem under the SIR model and presented a Jordan centrality. The node with the smallest Jordan centrality is a source, i.e., the largest distance between source to all infectious nodes remains the least. A similar concept, namely distance centrality, means the sum distance between one node and other all infectious nodes. The source always has the smallest distance centrality. Fioriti et al. [5] calculated the dynamical age of each node based on the importance of node dynamical. They considered the eigenvalue drop rate of the adjacency matrix as dynamical age when a node was eliminated from this network and the sources are the nodes with highest dynamical age. Based on the SIR model and incomplete node information, Zang et al. [6] proposed an advanced unbiased betweenness algorithm. They used a reverse propagation algorithm to build an extended infection graph and marked off several infection subgraphs where we can identify the source with the highest unbiased betweenness.

The above methods consider the location information of all nodes or most nodes in complex networks and the location information and time information of sparse observers can also achieve a satisfactory source localization accuracy. The information source localization method based on observer placement was presented by Pinto [7], called the Gaussian-based method. They calculated the likelihood of each node when it is considered to be the source using the data from observers. The larger the likelihood, the higher probability the node became the source with. Shen et al. [8] investigated a universal method, constructing a virtual backward diffusion tree to locate the source, called Time-Reversal Backward Spreading. By calculating the time of each node when observers reach it reversely, they chose the node with smallest variance as source. Furthermore, Huang et al. [9] extended and applied it in temporal network. For multiple source localization, Fu et al. [10] proposed a maximum–minimum strategy based on backward diffusion, which has been further extended by Hu et al. [11] via integer programming.

Moreover, Brockmann et al. [12] presented the equivalent distance concept, changing the disease-spread problem to a plane wave diffusion process. The source is in the middle of plane wave. Nino Antulov-Fantulin et al. [13] proposed a new source localization method-based Monte-Carlo simulation under the SIR model, through a lot of simulations. Based on source identification algorithm, Wang et al. [14] presented label propagation, where they set up the initial label, propagated the label, and chose the nodes with the tallest labels as the sources.

In this paper, we present the Pascal-based Epidemic Source Localization (PESL) algorithm as a simple and efficient framework for locating the epidemic source and finding the initial time in complex networks with sparse observers. We mainly consider the SIR spread model with time delay and provide a probabilistic method to locate the epidemic source via Bayesian Estimation and Maximum Likelihood Estimation. Simply speaking, we derived an optimal solution of the source localization problem in tree networks and extended to general networks. The experiments show that in synthetic networks and empirical networks, our algorithm has a better performance than other comparison algorithms.

2. Model and Method

In this section, we will provide a detailed description for the epidemic source localization algorithm. First, we will give a brief introduction to epidemic source localization problem formulation; for a tree network, we will next put forward PESL algorithm as a Bayesian and Maximum Likelihood estimator that can simultaneously estimate the probability that a node becomes the epidemic source and the initial time; finally, we will give an example in the tree network.

2.1. Problem Definition

Consider the early-warning mechanism of an epidemic, e.g., influenza. When some new flu starts spreading in a crowd, only a few people realize it is flu and chose to go to the hospital, while most citizens think they have a common cold with no need to go to the hospital. After that, hospital can only collect infection information of the few people and submit it to centers for disease control and prevention. When the number of infectious people exceeds the threshold, on the one hand, the centers for disease control and prevention organize social forces to isolate and treat patients; on the other hand, it is necessary for them to locate the epidemic source using the data from all the hospitals so as to curb the spread of the epidemic.

In this paper, we focus on an undirected graph

G = (V, E)

in which the spread occurs, where V denotes the set of nodes and E denotes the set of edges. In particular, the network remains static, which means its topology never changes during the epidemic spread model. The epidemic source

s^{*} \in V

represents the only node where the epidemic originates, and it triggers spread at an unknown initial time

t_{0}

. Furthermore, the spread process associated with a time delay along the edge is considered in our work based on epidemic spread process.

The spread process is modeled as follows. There are three kinds of node states in network, i.e., Susceptible (S), Infectious (I), and Recovered (R) in SIR spread model. The susceptible nodes represent the citizens who are easily infected but now stay healthy, whereas the infectious nodes represent the individuals who have already been infected and have the ability to infect neighbors. The recovered nodes denote the people who have the antibody after infection, or die. At first, there is only one node

s^{*}

in the infectious state and other nodes are all in susceptible state. Let

Γ_{v}

denote the neighbors of node v and assume that v is in the infectious state at time

t_{v}

. The node v will infect susceptible neighbors

Γ_{v}

at the probability of

β

, meanwhile it will recover to the recovered state at the probability of

γ

. Once a node

u \in Γ_{v}

is infected, its state will change from susceptible state to infectious state and node u can infect susceptible neighbors. At this time, the infected time of node u is at

t_{v} + θ_{v u}

, where

θ_{v u}

represents the time delay associated with edge

v u

. Finally, when there is no node in infectious state, the whole spread process terminates.

Let

O = {o_{k}}_{k = 1}^{K}

represent a group of observers whose infected time

T_{O} = {[t_{o_{1}}, t_{o_{2}}, . . ., t_{o_{K}}]}^{T}

can be acquired by us. Then the epidemic source localization problem can be described as follows: given the network topology and sparse observers (partial nodes in network), which can report their infected time, our goal is to find the epidemic source

s_{*}

in the network. Due to the randomness of the spread process, it is difficult to determine what infected the susceptible nodes before so that we cannot conclude actual spread path, which brings a huge challenge to epidemic source localization. Moreover, the initial time of epidemic spread is impossible to acquire, which increases the difficulty of locating the epidemic source.

2.2. Locating the Epidemic Source in Tree Network

Following the relevant definitions in graph theory, tree is a graph where there is only one path between any two vertices or a connected graph without circuit. If a connected subgraph exists in the original network containing all nodes without circuit, this subgraph is a maximum spanning tree of the original network. Next we research the epidemic source localization problem.

2.2.1. Infected Time Modeling

Consider the case of a tree network. Obviously, studying the epidemic spread problem in tree has a huge advantage because there is only one path between the source and each observer. Recall that

O = {o_{k}}_{k = 1}^{K}

is a set of K observers with infected time

T_{O} = {[t_{o_{1}}, t_{o_{2}}, . . ., t_{o_{K}}]}^{T}

. Since the time delay

{θ_{v u}}

along edges are random variables, the infected time

{τ_{s k}}

of observers can also be seen as random variables. Suppose that the value of

θ_{v u}

can be all positive integers

θ_{v u} \in Z_{+}

and we next discuss the distribution of random variable

θ_{v u}

.

\{\begin{matrix} P (θ_{v u} = 1) & = \frac{1}{Z} \cdot β \\ P (θ_{v u} = 2) & = \frac{1}{Z} \cdot β (1 - β) (1 - γ) \\ ⋮ \\ P (θ_{v u} = k) & = \frac{1}{Z} \cdot β {(1 - β)}^{k - 1} {(1 - γ)}^{k - 1} \end{matrix}

(1)

where

β

and

γ

denote infection probability and recovery probability respectively and Z means normalized factor, which can be calculated directly:

\begin{matrix} Z & = β + β (1 - β) (1 - γ) + β {(1 - β)}^{2} {(1 - γ)}^{2} + . . . \\ = \frac{β}{β + γ - β γ} \end{matrix}

(2)

Let

α = β + γ - β γ

, so the distribution of random variable

θ_{v u}

can be simplified to

P (θ_{v u} = k) = α {(1 - α)}^{k - 1}

that is:

θ_{v u}

is a random variable obeying geometric distribution

θ_{v u} \sim G e o (α)

.

Suppose one node s in network and it does not belong to observers

s \in V \ O

. If node s is the epidemic source in network, for one observer

o_{k}

, its infected time

τ_{s k}

should be the sum of the initial time

t_{0}

and all time delay

θ_{v u}

on the path from node s to node

o_{k}

.

τ_{s k} = t_{0} + \sum_{\bar{v u} \in P (s, o_{k})} θ_{v u}

(3)

where

P (s, o_{k}) \subset E

denotes the path from node s to node

o_{k}

. Thanks for the tree we consider, the path is unique.

Due to all random variables

θ_{v u}

with independent identical distribution

θ_{v u} \overset{i . i . d .}{\sim} G e o (α)

, we can find the infected time

τ_{s k}

of node

o_{k}

actually obeys a Pascal distribution after translation transformation, i.e.,

τ_{s k} \sim t_{0} + N B (d_{s k}, α)

, its probability density function shows:

P (τ_{s k} = t_{0} + r_{s k} | s, t_{0}) = C_{r_{s k} - 1}^{d_{s k} - 1} α^{d_{s k}} {(1 - α)}^{r_{s k} - d_{s k}}, r_{s k} \geq d_{s k}

(4)

where

C_{b}^{a} = \frac{b!}{a! (b - a)!}

represents the number of combinations and

d_{s k}

denotes the length of path

P (s, o_{k})

.

2.2.2. Parameter Estimation of S

Let

T_{O} = [τ_{s 1}, τ_{s 2}, . . ., τ_{s K}]

, so the posterior probability of node s as epidemic source shows:

P (s, t_{0} | T_{O} = T_{O}) = \frac{P (T_{O} = T_{O} | s, t_{0}) P (s, t_{0})}{P (T_{O} = T_{O})}

(5)

Suppose that there is no prior knowledge of the epidemic source and the observers, i.e., the probability of any node as the epidemic source in the network keeps equal and the probability that any node becomes an observer is equal too. Then the following derivation can be obtained.

\begin{matrix} P (s, t_{0} | T_{O} = T_{O}) \propto P (T_{O} = T_{O} | s, t_{0}) \approx \prod_{o_{k} \in O} P (τ_{s k} = t_{o_{k}} | s, t_{0}) \end{matrix}

(6)

Let

L (s, t_{0} | T_{O} = T_{O}) = \sum_{o_{k} \in O} log P (τ_{s k} = t_{o_{k}} | s, t_{0})

, so the epidemic source localization problem can be transformed into following optimization problem.

(\hat{s}, \hat{t_{0}}) = \underset{s \in V, t_{0} \in Z}{a r g m a x} L (s, t_{0} | T_{O} = T_{O})

(7)

To solve this optimization problem, we first calculate initial time

\hat{t_{0}} = {a r g m a x}_{t_{0} \in Z} L (s, t_{0} | T_{O} = T_{O})

of each node s and then drag in Equation (7) to get the objective function value

L (s, \hat{t_{0}} | T_{O} = T_{O})

. And the node with the maximum objective function value is the epidemic source in network.

When the network is not a tree but a general network, the first idea is to build a maximum spanning tree and then locate the epidemic source same as before. Considering that the epidemic spread between nodes prefers to occur in a shorter path, we can build a BFS tree randomly and then find the node with maximum objective function as the epidemic source. In fact, from Equation (4), it is even unnecessary to actually construct these BFS trees. If only the distance

d_{s k}

between node s and each observer

o_{k}

is known, parameter estimation and likelihood probability can be calculated. As the core of the whole algorithm is to optimize the Pascal distribution, it is called the Pascal-based Epidemic Source Localization algorithm (PESL).

2.3. Initial Time Range Confirmation

In the PESL algorithm, the key procedure is calculating initial time

\hat{t_{0}} = {a r g m a x}_{t_{0} \in Z} L (s, t_{0} | T_{O} = T_{O})

of one node s. Theoretically, the range of

t_{0}

is all integers

t_{0} \in Z

. We may research all possible

t_{0}

, which is not realistic because of its high computational complexity, however. Therefore, we must shrink the range of

t_{0}

through a certain method.

On the one hand, from Equation (4), for any observer

o_{k}

, in equation

r_{s k} = t_{o_{k}} - t_{0} \geq d_{s k}

sets up all the tie, we next get

t_{0} \leq \underset{o_{k} \in O}{m i n} \{t_{o_{k}} - d_{s k}\}

(8)

On the other hand, from the properties of Pascal distribution, for a random variable

X \sim N B (r, p)

subject to Pascal distribution, when

X = ⌊r + \frac{(1 - p) (r - 1)}{p}⌋ = ⌊\frac{r + p - 1}{p}⌋

, the corresponding probability density function can be maximized.

Therefore, the maximum of probability density function is

r_{s k} = ⌊\frac{d_{s k} + α - 1}{α}⌋

in Equation (4) that is:

t_{0} = t_{o_{k}} - r_{s k} = t_{o_{k}} - ⌊\frac{d_{s k} + α - 1}{α}⌋

. Inspired, narrow the research range further of

t_{0}

t_{0} \geq \underset{o_{k} \in O}{m i n} \{t_{o_{k}} - ⌊\frac{d_{s k} + α - 1}{α}⌋\}

(9)

Let

t_{0}^{l} = {m i n}_{o_{k} \in O} \{t_{o_{k}} - ⌊\frac{d_{s k} + α - 1}{α}⌋\}

,

t_{0}^{r} = {m i n}_{o_{k} \in O} \{t_{o_{k}} - d_{s k}\}

, we just find the optimal initial time

t_{0}

from

[t_{0}^{l}, t_{0}^{l} + 1, . . ., t_{0}^{r}]

that is

\hat{t_{0}} = \underset{t_{0}^{l} \leq t_{0} \leq t_{0}^{r}, t_{0} \in Z}{a r g m a x} L (s, t_{0} | T_{O} = T_{O})

(10)

2.4. An Example in Tree Network

To better describe implementation process of PESL algorithm, we make some tests of epidemic source localization in a tree network of 12 nodes and 11 edges. Figure 1 shows we simulate the epidemic spread in this network under the SIR model, where infection probability is

β = 0.5

, recovery probability is

γ = 0.3

, and thus

α = β + γ - β γ = 0.65

. Assume node h is the epidemic source. When initial time is

t = 0

, only this node is infectious; others are all susceptible. Figure 1a–f represent the state of each node in the network at six successive moments, respectively. The nodes in the orange box denote the three observers

O = [i, e, g]

and their infected time is

T_{O} = [1, 2, 5]

. Without knowing the concrete spread process, we try to locate the epidemic source only by their observers and its infected time.

Now we introduce the calculation process of PESL algorithm, taking node h for example. From Figure 1, it is easy to calculate the distance between node h and each observer:

\{\begin{matrix} d_{h 1} & = d i s t (h, i) = 1 \\ d_{h 2} & = d i s t (h, e) = 2 \\ d_{h 3} & = d i s t (h, g) = 3 \end{matrix}

(11)

From Equations (8) and (9), for node h, we can confirm the research range of initial time

t_{0}

:

\{\begin{matrix} t_{0}^{l} & = m i n \{1 - ⌊\frac{1 + 0.65 - 1}{0.65}⌋, 2 - ⌊\frac{2 + 0.65 - 1}{0.65}⌋, 3 - ⌊\frac{3 + 0.65 - 1}{0.65}⌋\} = - 2 \\ t_{0}^{r} & = m i n \{1 - 1, 2 - 2, 5 - 3\} = 0 \end{matrix}

(12)

Next, using Equation (10) to solve optimal initial time

t_{0}

, when

\hat{t_{0}} = 0

, objective function

L (h, t_{0} | T_{O} = T_{O})

can get the maximum −2.89(

\hat{t_{0}} = - 2

,

L = - 7.18

;

\hat{t_{0}} = - 1

,

L = - 4.84

).

Table 1 shows the epidemic source localization result using the PESL algorithm in the tree network. From this result, when

\hat{t_{0}} = 0

, the objective function value of node h can achieve maximum among all nodes, i.e., this algorithm locates the epidemic source successfully and gets the correct initial time.

3. Experiments and Analysis

In the previous section, the solving process of the epidemic source localization problem under the SIR model has been elaborated in detail. This section comprehensively analyzes the performance of PESL algorithm and other common source localization algorithms, and verifies the performance of the algorithms in synthetic networks and empirical networks, respectively.

To get closer to the epidemic source localization in the real world, the classic SIR model is used in the network to simulate spread process. The infection probability is set as

β

and recovery probability as

γ

. A node is randomly selected as the epidemic source. Once a susceptible node is infected by a neighbor node, it will report its own state at the probability of

δ

, which is called the observer. When the number of observers in the network reaches a threshold K, the algorithm needs to locate the epidemic source backward according to the infected time of these observers. To verify the performance of the algorithm, Distance Centrality (DC) [4], Jordan Centrality (JC) [4], Gaussian-based method (GAU) [7] and Time-Reversal Backward Spreading (TRBS) [8] were selected as comparison algorithms. DC and JC only use the location information of observers, while GAU and TRBS comprehensively use the location information of observers and infection time information to locate the epidemic source.

In this paper, we chose average error distance [15] as the evaluation index, which was described as follows. The average error distance is referred as the mean shortest distance of hops between the accurate source and estimated source found by an algorithm. The shorter average error distance is, the higher the accuracy becomes.

3.1. Source Localization Accuracy in Synthetic Networks

In this section, we mainly consider the source localization performance of the algorithm in two synthetic tree networks, namely ER tree (Erdös-Rényi tree, ERT) and BA tree (Barabási-Albert tree, BAT). These two networks are generated by random network (ER) and scale-free network (BA) respectively, which are all connected networks.

To generate BAT, we first make the initial network contain only one isolated node, then add one node at a time and add an edge according to the preference attachment rule in the BA model. Relatively speaking, it is very difficult to directly use ER model to generate a connected tree network with N nodes. Although

n - 1

edges can be randomly generated among the N nodes, there is no guarantee that the generated network is a connected network. Therefore, a compromise is adopted here: First, generate a ER network with

1.5 N

node and 1 average degree; then, if the giant component in the network has N nodes (through the vast simulations, we do find the giant component), take the maximum spanning tree as the approximation of ERT. The basic topological properties of these two tree networks are shown in Table 2.

Figure 2 shows the source localization performance under various infection probabilities with five various algorithms in ERT and BAT. In ERT, the epidemic source localization accuracy of PESL algorithm is consistently better than other comparison algorithms. In BAT, PESL algorithm can still achieve good performance of source localization, which reflects the advantages of PESL algorithm on the source localization in ERT and BAT. Intuitively speaking, the larger the infection probability

β

under the SIR model is, the earlier the infected time of the nodes near the epidemic source will be, so the infected time of the observers will be closer to the initial time and a better source localization algorithm should have higher accuracy. It can be found from the results in the figure that the localization accuracy of GAU, TRBS, and PESL increases with the increasing of infection probability, while DC and JC, which only use the location information of nodes, do not show such a trend.

The relationship between source localization performance and reporting probability of different source localization algorithms in two kinds of tree networks is shown in Figure 3. It is easy to see that PESL algorithm is still the most competitive source localization algorithm. Especially when the infection probability

β

is very small, the performance of PESL algorithm is significantly better than other comparison algorithms, indicating that PESL algorithm can well deal with the influence of randomness in the spread process. As can be seen from Figure 3c and Figure 3f, when the infection probability

β

is large, the randomness in the spread process decreases significantly and at this time, PESL algorithm cannot well reflect its advantages. On the other hand, with the same infection probability, the larger the reporting probability

δ

is, the closer the observers will be to the epidemic source, the less randomness brought by the observer will be and the gap between PESL algorithm and other algorithms will be further narrowed. In general, the results in Figure 3 indirectly indicate the applicability of PESL algorithm: as reporting randomness of source localization problem increases, the advantages of PESL algorithm will be more obvious compared with other algorithms.

The relationship between the performance of source localization algorithm and the number of observers is shown in Figure 4. In the visual impression of people, the more the number of observers is, the more accurate the source localization accuracy will be. However, the results in the figure are not like this: there is no obvious tendency between source localization accuracy of GAU, TRBS, PESL, and the number of observers; The localization accuracy of DC and JC declines as the number of observers increases. This means the information provided by observers in network has a lot of redundancy, as shown in Figure 1. Because node e reports the infected time after node i, and node g only can be infected by node e, even if we have known the infected time of node g, it has no ability to provide any useful information for source localization, which is called information redundancy. In a tree network, as the number of observers increases, a great source localization algorithm should perform as stably as possible. From this point of view, PESL algorithm has the strongest stability, i.e., it behaves best.

3.2. Parameter Estimation Accuracy

Considering that PESL algorithm can estimate the initial time while locating the epidemic source, we further study MSE of the initial time with different infection probabilities and reporting probabilities in ERT and BAT. Figure 5 shows the mean square error of the initial time

t_{0}

of our algorithm. The lower the mean square error is, the more accurate the estimate will be. It can be seen from the results that the MSE of

t_{0}

is positively correlated with the infection probability

β

and the reporting probability

δ

. When the infection probability

β

or the reporting probability

δ

increases, the estimation accuracy of the initial time will also increase.

3.3. Algorithm Performance under Unknown Parameters

In the previous experiment, in order to use the PESL algorithm to carry out the epidemic source localization, we must suppose that the infection probability

β

and recovery probability

γ

in the SIR model are all known. Only in this way can we calculate the parameters of the PESL algorithm

α = β + γ - β γ

and carry out the subsequent optimization process. However, it is difficult to know in advance the exact value of the infection probability

β

and the recovery probability

γ

for many real-life epidemic source localization problems. Therefore, it is necessary to study the effects of parameters

β

and

γ

on PESL algorithm. The specific results are shown in Figure 6, the vertical dotted line represents the true infection probability in the SIR model

\tilde{β}

and the horizontal coordinate denotes the parameter

β

used in PESL. When these two parameters are equal

β = \tilde{β}

, the PESL algorithm has the best source localization performance. When the parameters

β

and

\tilde{β}

differ greatly, the source localization accuracy of the algorithm will decrease significantly, indicating that the PESL algorithm is very sensitive to parameters. In practical problems, it is necessary to estimate the spread parameters as accurately as possible to better use the PESL algorithm to solve the source localization problem.

3.4. Source Localization Accuracy in Empirical Networks

The problem of epidemic source localization in empirical networks is discussed below. Nine empirical networks from the real world are selected to verify the performance of the algorithm. Hospital [16]: The data comes from the contact between patients and medical staff in one hospital of Lyon between 6 December 2010 and 10 December. In the network, node represents the patient or medical personnel, edge between two nodes denotes the close contact(including chat, ward inspection, operation, etc.). Workplace [17]: The network consists of the communication among staffs in an office building in France from 23 June 2013 to 3 July 2013. When two staff members have face-to-face chat, an edge is built between the corresponding nodes. ACM [18]: The data set is an exchange of participants’ data collected during the 2009 ACM Hypertext conference. All participants who volunteer to participate in the data collection program wear a radio badge that records a data item when two participants get close to each other. School [19]: The original data recorded the contact between all students in a local high school of Marseilles, France in December 2013. During the five-day data collection period, any two students who had face-to-face communication were recorded. Infectious [18]: The human exchange network at the art and science fair in Dublin, Ireland, recorded a data item between INFECTIOUS: STAY AWAY (2009) of participants when they became close to each other. URV [20]: The internal email communication network of Rovira I Virgili university in Tarragona, southern Catalonia, Spain. The nodes in the network include the university’s students and faculty members. Yeast [21,22,23,24]: A contact network of proteins within yeast, where node represents proteins and edge denotes the relationships between the two proteins. OpenFlights [25]: The traffic network consists of all the flight information collected by OpenFlights.org. The node in the network represents the airports and edge denotes the flight between them. Sex [26,27]: The sex trade information is collected by a public forum. The original network is a bipartite graph between sex buyers and sex workers, where the edge denotes the trade between them. In the experiment, we ignore the identity and deal with it as an ordinary network.

In above nine empirical networks, Hospital, Workplace, ACM, School and Infectious networks are all collected by the SocioPatterns project http://www.sociopatterns.org/ about the close contact among people, Sex network is a network of intimate sexual contact among citizens, URV network represents the telecommunication networks, Yeast and OpenFlights networks reflect the information diffusion among the human being. For the research convenience, only the giant component of these networks is preserved, and the multilateral and self-loop are deleted. The basic properties of the network obtained are shown in Table 3.

The relation between epidemic source localization accuracy and infection probability with various source localization algorithms in nine empirical networks is shown in Figure 7. The recovery probability in the SIR model was set as

γ = 0.1

and the reporting probability of nodes was set as

δ = 0.3

. It can be seen from the results that the performance of PESL algorithm in all empirical networks except URV network is consistently better than that of other algorithms. When there is a lower infection probability, the spread process has stronger randomness. Although at this time GAU and TRBS algorithms consider location information and time information of observers, their behavior is not better than DC and JC algorithms only considering the location information especially in Hospital and ACM networks. When the infection probability increases, GAU and TRBS will gradually show their wonderful performance. Figure 8 shows the relationship between the performance of source localization algorithm and the reporting probability. Similar to the previous situation, PESL algorithm still has a great source localization performance.

4. Conclusions

Source localization problem is one of the most important topics in complex networks, including information source estimation, rumor source identification and epidemic source localization. In particular, locating the epidemic source in complex network is a classical application, helping us to deal with epidemic problems in a way. In this paper, the main purpose was to propose a method that can locate the epidemic source in complex networks with sparse observers. First, we did some derivation about the probability distribution of infected time of observers. In consideration of the randomness of the spread process, we modeled infected time using probability and then found that the infected time of two adjacent nodes in the network is independent and obeys the geometric distribution of the same parameter. Thus, the infected time from the source to any observer obeys the Pascal distribution. Next, we proposed an epidemic source localization algorithm based on Pascal distribution. We found that when there is no prior knowledge with epidemic source and observers; the posterior probability of epidemic source was equal to the product of infected time of each observer, so that it located the epidemic source through maximizing the posterior probability. After analysis of the initial time range, the algorithm only needed to optimize and solve in a very small search space, which greatly improved the computational efficiency.

To verify the performance of the algorithm, the epidemic source localization experiments were carried out in the synthetic networks and empirical networks and we chose average error distance as the evaluation index. The results showed that the epidemic source localization accuracy of our algorithm in most cases was better than other several comparison algorithms; meanwhile, when the randomness increased, the performance of our algorithm behaved better. The randomness mainly reflected which node was infected by which node through which path. The results of sensitivity analysis showed that the source localization performance of this algorithm was closely related to the parameters in the SIR model. However, obtaining spread parameters from real data has always been a challenge, especially for a new epidemic, whose spread parameters could only be inferred after a large outbreak in a population.

Although our proposed method provides a new reference for source localization problem in complex networks, more considerable works remain to be done. How to identify the multiple sources in complex networks now is a hot topic and useful in practical applications. Compared with the single-source localization problem, the case of multiple sources localization becomes more difficult because it needs to locate the sources at the same time in most cases. However, there are some works [6,28,29] dealing with multi-source localization via two-step strategies that first divide the network into several communities and then apply the single-source localization method to locate sources in each community. In addition, the network topology we consider in this paper is static and just single layer. In fact, the topology in the real world remains dynamic as time goes by or has multiple layers. As we know, few theoretical and practical works have proposed source localization in multi-layer networks [30] and temporal networks [31].

References

Author Contributions

Conceptualization, C.Z., X.L. and D.Y.; methodology, X.L. and X.W.; software, X.L. and X.W.; validation, X.W. and X.L.; formal analysis, X.W.; investigation, X.Z.; resources, X.W.; data curation, X.W.; writing—original draft preparation, X.L.; writing—review and editing, X.Z.; visualization, X.L. and X.W.; supervision, X.W. and X.L.; project administration, C.Z.; funding acquisition, C.Z. This paper was prepared the contributions of all authors. All authors have read and approved the final manuscript.

Funding

This work is supported by the National Key R&D Program of China (Grant No. 2017YCF1200301).

Acknowledgments

We thank Yangyang Liu for valuable discussions.

Conflicts of Interest

The authors declare no conflict of interest.

References

Prakash, B.A.; Vreeken, J.; Faloutsos, C.T. Spotting Culprits in Epidemics: How Many and Which Ones? In Proceedings of the 2012 IEEE 12th International Conference on Data Mining, Brussels, Belgium, 10–13 December 2012. [Google Scholar]
Luo, W.; Tay, W.P.; Leng, M. Identifying Infection Sources and Regions in Large Networks. IEEE Trans. Signal Process. 2013, 61, 2850–2865. [Google Scholar] [CrossRef] [Green Version]
Shah, D.; Zaman, T. Rumor centrality: A universal source detector. ACM Sigmetrics Perform. Eval. Rev. 2012, 40, 199–210. [Google Scholar] [CrossRef]
Zhu, K.; Ying, L. Information Source Detection in the SIR Model: A Sample Path Based Approach. IEEE/ACM Trans. Netw. 2012, 24, 408–421. [Google Scholar] [CrossRef]
Fioriti, V.; Chinnici, M. Predicting the sources of an outbreak with a spectral technique. arXiv 2012, arXiv:1211.2333. [Google Scholar] [CrossRef]
Zang, W.; Zhang, P.; Zhou, C. Locating multiple sources in social networks under the SIR model: A divide-and-conquer approach. J. Comput. Sci. 2015, 10, 278–287. [Google Scholar] [CrossRef]
Pinto, P.C.; Thiran, P.; Vetterli, M. Locating the Source of Diffusion in Large-Scale Networks. Phys. Rev. Lett. 2012, 109, 068702. [Google Scholar] [CrossRef] [PubMed]
Shen, Z.; Cao, S.; Wang, W.; Di, Z.; Stanley, H.E. Locating the source of diffusion in complex networks by time-reversal backward spreading. Phys. Rev. E 2016, 93, 32301. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Huang, Q.; Zhao, C.; Zhang, X. Locating the source of spreading in temporal networks. Phys. A 2016, 468, 434–444. [Google Scholar] [CrossRef]
Fu, L.; Shen, Z.; Wang, W.; Fan, Y.; Zengru, D.T. Multi-source localization on complex networks with limited observers. EPL 2016, 113, 18006. [Google Scholar] [CrossRef]
Hu, Z.; Shen, Z.; Tang, C.; Xie, B.; Lu, J. Localization of diffusion sources in complex networks with sparse observations. Phys. Lett. A 2018, 382, 931–937. [Google Scholar] [CrossRef]
Brockmann, D.; Helbing, D. The Hidden Geometry of Complex, Network-Driven Contagion Phenomena. Science 2013, 342, 1337–1342. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Antulovfantulin, N.; Lancic, A.; Smuc, T. Identification of Patient Zero in Static and Temporal Networks—Robustness and Limitations. Phys. Rev. A. 2014, 114, 248701. [Google Scholar]
Zheng, W.; Chaokun, W.; Jisheng, P. Multiple Source Detection without Knowing the Underlying Propagation Model. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017. [Google Scholar]
Cai, K. Information spreading forensics via sequential dependent snapshots. IEEE/ACM Trans. Netw. 2018, 26, 478–491. [Google Scholar] [CrossRef]
Vanhems, P. Estimating potential infection transmission routes in hospital wards using wearable proximity sensors. PLoS ONE 2013, 8, e73970. [Google Scholar] [CrossRef]
Genois, M. Data on face-to-face contacts in an office building suggest a low-cost vaccination strategy based on community linkers. Netw. Sci. 2015, 3, 326–347. [Google Scholar] [CrossRef] [Green Version]
Isella, L. What’s in a crowd? Analysis of face-to-face behavioral networkse. J. Theor. Biol. 2011, 271, 166–180. [Google Scholar] [CrossRef] [PubMed]
Mastr, R. Contact patterns in a high school: A comparison between data collected using wearable sensors, contact diaries and friendship surveys. PLoS ONE 2015, 10, e0136497. [Google Scholar]
Guimera, R. Self-similar community structure in a network of human interactions. PRE 2003, 68, 065103. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Jeong, H. Lethality and centrality in protein networks. Nature 2001, 411, 41. [Google Scholar] [CrossRef]
Coulomb, S. Gene essentiality and the topology of protein interaction networks. Biol. Sci. 2005, 272, 1721–1725. [Google Scholar] [CrossRef] [Green Version]
Han, J.-D.J. Effect of sampling on topology predictions of protein-protein interaction networks. Nat. Biotechnol. 2005, 23, 839. [Google Scholar] [CrossRef] [PubMed]
Stumpf, M.P. Subnets of scale-free networks are not scalefree: Sampling properties of networks. Proc. Natl. Acad. Sci. USA 2005, 102, 4221–4224. [Google Scholar] [CrossRef] [PubMed]
OpenFlights Network Dataset—KONECT. Available online: http://konect.uni-koblenz.de/networks/openflights (accessed on 2 September 2019).
Sexual Escorts Network Dataset—KONECT. Available online: http://konect.unikoblenz.de/networks/escorts (accessed on 2 September 2019).
Rocha, L.E. Information dynamics shape the sexual networks of Internet-mediated prostitution. Proc. Natl. Acad. Sci. USA 2010, 107, 5706–5711. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Zang, W.; Zhang, P.; Zhou, C.; Guo, L. Discovering Multiple Diffusion Source Nodes in Social Networks. Int. Conf. Concept. Struct. 2014, 29, 443–452. [Google Scholar] [CrossRef] [Green Version]
Tang, W.; Ji, F.; Tay, W.P. Multiple sources identification in networks with partial timestamps. In Proceedings of the 2017 IEEE Global Conference on Signal and Information Processing (GlobalSIP), Montreal, QC, Canada, 14–16 November 2017; pp. 638–642. [Google Scholar]
Kivelä, M.; Arenas, A.; Barthelemy, M.; Gleeson, J.P.; Moreno, Y.; Porter, M.A. Multilayer networks. J. Complex Netw. 2014, 2, 203–271. [Google Scholar] [CrossRef] [Green Version]
Holme, P. Modern temporal network theory: A colloquium. Eur. Phys. J. B 2015, 88, 1–30. [Google Scholar] [CrossRef]

Figure 1. Schematic diagram of SIR spread process in a tree network. The time delay of each node all equals 1. (a–f) respectively represent the states of all nodes in network at six successive moments, among which white represents susceptible, red represents infectious, and purple represents recovered. Node h is the epidemic source and the initial time is

t_{0} = 0

. At this moment, only this node is infectious and other nodes are all susceptible. As time goes by, more and more nodes change from susceptible to infectious, and the infectious nodes also recover successively. The node in the orange box represents the observer. The goal of PESL algorithm is to locate the epidemic source in the network according to the network topology and observers, which report its infected time.

Figure 1. Schematic diagram of SIR spread process in a tree network. The time delay of each node all equals 1. (a–f) respectively represent the states of all nodes in network at six successive moments, among which white represents susceptible, red represents infectious, and purple represents recovered. Node h is the epidemic source and the initial time is

t_{0} = 0

. At this moment, only this node is infectious and other nodes are all susceptible. As time goes by, more and more nodes change from susceptible to infectious, and the infectious nodes also recover successively. The node in the orange box represents the observer. The goal of PESL algorithm is to locate the epidemic source in the network according to the network topology and observers, which report its infected time.

Figure 2. The relationship between epidemic source localization accuracy and infection probability in tree networks. (a–c) represent the relationship between average error distance and infection probability in ERT under the different reporting probability of nodes

δ = 0.3, 0.6, 0.9

respectively. (d–f) represent the same results in BAT. In the figure, the abscissa represents the infection probability

β

of SIR model, the ordinate denotes the distance between the real source and the source found by different source localization algorithms. The recovery probability used in the simulation is

γ = 0.1

and the number of observers is

K = 10

, which are same in Figure 3.

Figure 2. The relationship between epidemic source localization accuracy and infection probability in tree networks. (a–c) represent the relationship between average error distance and infection probability in ERT under the different reporting probability of nodes

δ = 0.3, 0.6, 0.9

respectively. (d–f) represent the same results in BAT. In the figure, the abscissa represents the infection probability

β

of SIR model, the ordinate denotes the distance between the real source and the source found by different source localization algorithms. The recovery probability used in the simulation is

γ = 0.1

and the number of observers is

K = 10

, which are same in Figure 3.

Figure 3. The relationship between epidemic source localization accuracy and reporting probability in tree networks. (a–c) represent the relationship between average error distance and reporting probability in ERT under the different infection probability of nodes

β = 0.2, 0.5, 0.8

respectively. (d–f) represent the same results in BAT. The abscissa represents the reporting probability

δ

of nodes.

Figure 3. The relationship between epidemic source localization accuracy and reporting probability in tree networks. (a–c) represent the relationship between average error distance and reporting probability in ERT under the different infection probability of nodes

β = 0.2, 0.5, 0.8

respectively. (d–f) represent the same results in BAT. The abscissa represents the reporting probability

δ

of nodes.

Figure 4. The relationship between source localization accuracy and the number of observers in tree networks. (a,b) represent the relationship between average error distance and the number of observers in ERT under

β = 0.2, δ = 0.6

and

β = 0.5, δ = 0.3

respectively. (c,d) represent the same results in BAT. The abscissa represents the number of observers K.

Figure 4. The relationship between source localization accuracy and the number of observers in tree networks. (a,b) represent the relationship between average error distance and the number of observers in ERT under

β = 0.2, δ = 0.6

and

β = 0.5, δ = 0.3

respectively. (c,d) represent the same results in BAT. The abscissa represents the number of observers K.

Figure 5. MSE of initial time in tree network. (a,c) represent the relationship between MSE of

t_{0}

and

β

under various

δ = 0.3, 0.6, 0.9

in ERT and BAT respectively. (b,d) represent the relationship between MSE of

t_{0}

and

δ

under various

β = 0.2, 0.5, 0.8

in ERT and BAT respectively. The ordinate in the figure represents the mean square error of the initial time

t_{0}

. The lower the mean square error is, the more accurate the estimation of PESL becomes. In all experiments, the number of observers was

K = 10

and the recovery probability in the SIR model was

γ = 0.1

.

Figure 5. MSE of initial time in tree network. (a,c) represent the relationship between MSE of

t_{0}

and

β

under various

δ = 0.3, 0.6, 0.9

in ERT and BAT respectively. (b,d) represent the relationship between MSE of

t_{0}

and

δ

under various

β = 0.2, 0.5, 0.8

in ERT and BAT respectively. The ordinate in the figure represents the mean square error of the initial time

t_{0}

. The lower the mean square error is, the more accurate the estimation of PESL becomes. In all experiments, the number of observers was

K = 10

and the recovery probability in the SIR model was

γ = 0.1

.

Figure 6. The algorithm performance with unknown infection probability. (a,c) represent the relationship between average error distance and

β

when

\tilde{β} = 0.2

in ERT and BAT respectively. (b,d) represent the relationship between average error distance and

β

when

\tilde{β} = 0.8

in ERT and BAT respectively. The x-axis in the figure represents the parameter

β

used in the PESL algorithm and the vertical dotted line denotes the true infection probability

\tilde{β}

in the SIR model. Reporting probability of nodes is

δ = 0.3, 0.6, 0.9

respectively.

Figure 6. The algorithm performance with unknown infection probability. (a,c) represent the relationship between average error distance and

β

when

\tilde{β} = 0.2

in ERT and BAT respectively. (b,d) represent the relationship between average error distance and

β

when

\tilde{β} = 0.8

in ERT and BAT respectively. The x-axis in the figure represents the parameter

β

used in the PESL algorithm and the vertical dotted line denotes the true infection probability

\tilde{β}

in the SIR model. Reporting probability of nodes is

δ = 0.3, 0.6, 0.9

respectively.

Figure 7. The performance of epidemic source localization accuracy under various infection probability with five algorithms in empirical networks. The number of observers changes due to the scale of network. In (a–e),

K = 10

; in (f–h),

K = 25

; in (i),

K = 50

. The recovery probability used in the simulation is

γ = 0.1

and the reporting probability of nodes is

δ = 0.3

.

Figure 7. The performance of epidemic source localization accuracy under various infection probability with five algorithms in empirical networks. The number of observers changes due to the scale of network. In (a–e),

K = 10

; in (f–h),

K = 25

; in (i),

K = 50

. The recovery probability used in the simulation is

γ = 0.1

and the reporting probability of nodes is

δ = 0.3

.

Figure 8. The performance of epidemic source localization accuracy in empirical networks under various reporting probabilities with five algorithms. (a–i) represent the results in Hospital, Workplace, ACM, School, Infectious, URV, Yeast, OpenFlights and Sex network respectively. In the figure, the abscissa represents the reporting probability of nodes

δ

, the ordinate denotes the distance between the real source and the source found by different source localization algorithms. The number of observers is similar to Figure 7. The infection probability used in the simulation was

β = 0.2

and the recovery probability was

γ = 0.1

.

Figure 8. The performance of epidemic source localization accuracy in empirical networks under various reporting probabilities with five algorithms. (a–i) represent the results in Hospital, Workplace, ACM, School, Infectious, URV, Yeast, OpenFlights and Sex network respectively. In the figure, the abscissa represents the reporting probability of nodes

δ

, the ordinate denotes the distance between the real source and the source found by different source localization algorithms. The number of observers is similar to Figure 7. The infection probability used in the simulation was

β = 0.2

and the recovery probability was

γ = 0.1

.

Table 1. The results of epidemic source localization in network. Where

{\hat{t}}_{0}

represents the initial time obtained from Equation (10) and

L_{s}

is the target function value obtained from Equation (7). The node with the largest objective function value is the epidemic source in the network. For the three observers in the network

(I, e, g)

, use their infected time as the initial time directly.

Table 1. The results of epidemic source localization in network. Where

{\hat{t}}_{0}

represents the initial time obtained from Equation (10) and

L_{s}

is the target function value obtained from Equation (7). The node with the largest objective function value is the epidemic source in the network. For the three observers in the network

(I, e, g)

, use their infected time as the initial time directly.

MLE	a	b	c	d	e	f	g	h	i	j	k	i
${\hat{t}}_{0}$	$- 3$	−2	−3	−1	2	−3	5	0	1	−1	−3	−3
$L_{s}$	−5.69	−5.94	−5.69	−6.84	−8.02	−11.58	−27.90	−2.89	−3.12	−3.67	−4.38	−4.38

Table 2. Basic topological properties of two tree networks. Where

N, M

represent the number of nodes and edges respectively,

〈k〉

and

k_{m a x}

denote the average degree and maximum degree respectively, H means index of degree heterogeneity in network and

〈d〉

is the mean shortest path in network.

Table 2. Basic topological properties of two tree networks. Where

N, M

represent the number of nodes and edges respectively,

〈k〉

and

k_{m a x}

denote the average degree and maximum degree respectively, H means index of degree heterogeneity in network and

〈d〉

is the mean shortest path in network.

Type	Name	N	M	$〈k〉$	$k_{\max}$	H	$〈d〉$
Erdös-Rènyi Tree	ERT	100	99	1.980	5	1.240	9.855
Barabási-Albert Tree	BAT	100	99	1.980	22	2.372	4.945

Table 3. Basic topological properties of nine various networks. Where

N, M

represent the number of nodes and edges,

〈k〉

and

k_{m a x}

denote the average degree and maximum degree, H means index of degree heterogeneity,

〈d〉

is the mean shortest path length and

〈C〉

represents mean local clustering coefficient.

Table 3. Basic topological properties of nine various networks. Where

N, M

represent the number of nodes and edges,

〈k〉

and

k_{m a x}

denote the average degree and maximum degree, H means index of degree heterogeneity,

〈d〉

is the mean shortest path length and

〈C〉

represents mean local clustering coefficient.

Type	Name	N	M	$〈k〉$	$k_{\max}$	H	$〈d〉$	$〈C〉$
Contact	Hospital	75	1139	30.373	61	1.244	1.598	0.640
Contact	Workplace	92	755	16.413	44	1.213	1.964	0.426
Contact	ACM	113	2196	38.867	98	1.223	1.656	0.535
Contact	School	327	5818	35.584	87	1.144	2.159	0.504
Contact	Infectious	410	2765	13.488	50	1.388	3.631	0.456
Communication	URV	1133	5451	9.622	71	1.942	3.606	0.220
Biological	Yeast	1458	1948	2.672	56	2.667	6.812	0.071
Transport	OpenFlights	3397	19,230	11.322	248	5.699	4.103	0.488
Contact	Sex	15,810	38,540	4.875	305	5.828	5.785	0.000

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, X.; Wang, X.; Zhao, C.; Zhang, X.; Yi, D. Locating the Epidemic Source in Complex Networks with Sparse Observers. Appl. Sci. 2019, 9, 3644. https://0-doi-org.brum.beds.ac.uk/10.3390/app9183644

AMA Style

Li X, Wang X, Zhao C, Zhang X, Yi D. Locating the Epidemic Source in Complex Networks with Sparse Observers. Applied Sciences. 2019; 9(18):3644. https://0-doi-org.brum.beds.ac.uk/10.3390/app9183644

Chicago/Turabian Style

Li, Xiang, Xiaojie Wang, Chengli Zhao, Xue Zhang, and Dongyun Yi. 2019. "Locating the Epidemic Source in Complex Networks with Sparse Observers" Applied Sciences 9, no. 18: 3644. https://0-doi-org.brum.beds.ac.uk/10.3390/app9183644

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Locating the Epidemic Source in Complex Networks with Sparse Observers

Abstract

1. Introduction

2. Model and Method

2.1. Problem Definition

2.2. Locating the Epidemic Source in Tree Network

2.2.1. Infected Time Modeling

2.2.2. Parameter Estimation of S

2.3. Initial Time Range Confirmation

2.4. An Example in Tree Network

3. Experiments and Analysis

3.1. Source Localization Accuracy in Synthetic Networks

3.2. Parameter Estimation Accuracy

3.3. Algorithm Performance under Unknown Parameters

3.4. Source Localization Accuracy in Empirical Networks

4. Conclusions

Author Contributions

Funding

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI