{"title": "Expectation Propagation with Stochastic Kinetic Model in Complex Interaction Systems", "book": "Advances in Neural Information Processing Systems", "page_first": 2029, "page_last": 2039, "abstract": "Technological breakthroughs allow us to collect data with increasing spatio-temporal resolution from complex interaction systems. The combination of high-resolution observations, expressive dynamic models, and efficient machine learning algorithms can lead to crucial insights into complex interaction dynamics and the functions of these systems. In this paper, we formulate the dynamics of a complex interacting network as a stochastic process driven by a sequence of events, and develop expectation propagation algorithms to make inferences from noisy observations. To avoid getting stuck at a local optimum, we formulate the problem of minimizing Bethe free energy as a constrained primal problem and take advantage of the concavity of dual problem in the feasible domain of dual variables guaranteed by duality theorem. Our expectation propagation algorithms demonstrate better performance in inferring the interaction dynamics in complex transportation networks than competing models such as particle filter, extended Kalman filter, and deep neural networks.", "full_text": "Expectation Propagation with Stochastic Kinetic\n\nModel in Complex Interaction Systems\n\nLe Fang, Fan Yang, Wen Dong, Tong Guan, and Chunming Qiao\n\nDepartment of Computer Science and Engineering\n\n{lefang, fyang24, wendong, tongguan, qiao}@buffalo.edu\n\nUniversity at Buffalo\n\nAbstract\n\nTechnological breakthroughs allow us to collect data with increasing spatio-\ntemporal resolution from complex interaction systems. The combination of high-\nresolution observations, expressive dynamic models, and ef\ufb01cient machine learning\nalgorithms can lead to crucial insights into complex interaction dynamics and the\nfunctions of these systems. In this paper, we formulate the dynamics of a complex\ninteracting network as a stochastic process driven by a sequence of events, and\ndevelop expectation propagation algorithms to make inferences from noisy obser-\nvations. To avoid getting stuck at a local optimum, we formulate the problem of\nminimizing Bethe free energy as a constrained primal problem and take advantage\nof the concavity of dual problem in the feasible domain of dual variables guar-\nanteed by duality theorem. Our expectation propagation algorithms demonstrate\nbetter performance in inferring the interaction dynamics in complex transportation\nnetworks than competing models such as particle \ufb01lter, extended Kalman \ufb01lter, and\ndeep neural networks.\n\n1\n\nIntroduction\n\nWe live in a complex world, where many collective systems are dif\ufb01cult to interpret. In this paper,\nwe are interested in complex interaction systems, also called complex interaction networks, which\nare large systems of simple units linked by a network of interactions. Many research topics exem-\nplify complex interaction systems in speci\ufb01c domains, such as neural activities in our brain, the\nmovement of people in an urban system, epidemic and opinion dynamics in social networks, and so\non. Modeling and inference for dynamics on these systems has attracted considerable interest since\nit potentially provides valuable new insights, for example about functional areas of the brain and\nrelevant diagnoses[7], about traf\ufb01c congestion and more ef\ufb01cient use of roads [19], and about where,\nwhen and to what extent people are infected in an epidemic crisis [23]. Agent-based modeling and\nsimulation [22] is a classical way to address complex systems with interacting components to explore\ngeneral collective rules and principles, especially in the \ufb01eld of systems biology. However, the actual\nunderlying dynamics of a speci\ufb01c real system are not in the scope. People are not satis\ufb01ed with only\na macroscopic general description but aims to track down an evolving system.\nUnprecedented opportunities for researchers in these \ufb01elds have recently emerged due to the pros-\nperous of social media and sensor tools. For instance, the functional magnetic resonance imaging\n(fMRI) and the electroencephalogram (EEG) can directly measure brain activity, something never\npossible before. Similarly, signal sensing technologies can now easily track people\u2019s movement and\ninteractions [12, 24]. Researchers no longer need to worry about acquiring abundant observation\ndata, and instead are pursuing more powerful theoretical tools to grasp the opportunities afforded by\nthat data. We, in the machine learning community, are interested in the inference problem \u2014 that is\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\frecovering the hidden dynamics of a system given certain observations. However, challenges still\nexist in these efforts, especially when facing systems with a large number of components.\nStatistical inference on complex interaction systems has a close relationship with the statistical physics\nof disordered ensembles, for instance, the established equivalence between loopy belief propagation\nand the Bethe free energy formulation [25]. In the past, the main interaction between statistical physics\nand statistical inference has focused on building stationary and equilibrium probability distributions\nover the state of a system. However, temporal dynamics is omitted when only equilibrium state\nis pursued. This leads not only to the loss of a signi\ufb01cant amount of interesting information, but\npossibly also to qualitatively wrong conclusions. In terms of learning dynamics, one approach is\nto solve stochastic differential equations (SDE) [20]. In each SDE, at least one term belongs to a\nstochastic process, of which the most common is the Wiener process. The drift and diffusion terms in\nthese SDEs are what we need to recover from multiple realizations (sample paths) of the stochastic\nprocess. Typically, an assumption of constant diffusion and linear drift makes the problem tractable,\nbut realistic dynamics generally cannot be modeled by rigid SDEs with simple assumptions.\nInference on complex interaction systems naturally corresponds to inference on large graphical\nmodels, which is a classical topic in machine learning. Exact \ufb01ltering and smoothing algorithms\nare impractical due to the exploding computational cost to make inferences about complex systems.\nThe hidden Markov model [17] faces an exponentially exploding size of the state transition kernel.\nThe Kalman \ufb01lter [15] and its variants, such as the extended Kalman \ufb01lter [14], solves the linear or\nnonlinear estimation problem assuming that the latent and observed variables are jointly Gaussian\ndistributions. Its scalability versus the number of components is O(M 3) due to the time cost in\nmatrix operations.\nApproximate algorithms to make inferences with complex interaction systems can be divided roughly\ninto sampling-based and optimization-based methods. Among sampling based methods, particle \ufb01lter\nand smoother [4, 18] use particles to represent the posterior distribution of a stochastic process given\nnoisy observations. However, particle based methods show weak scalability in a complex system: a\nlarge number of particles is needed, even in moderate size complex systems where the number of\ncomponents becomes over thousands. A variety of Markov Chain Monte Carlo (MCMC) methods\nhave been proposed [6, 5], but these generally have issues with rapid convergence in high-dimension\nsystems. Among optimization based methods, expectation propagation (EP) [16, 13] refers to a\nfamily of approximate inference algorithms with local marginal projection. These methods adopt an\niterative approach to approximate each factor of the target distribution into a tractable family. EP\nmethods have been shown to be relatively ef\ufb01cient, faster than sampling in many low-dimension\nexamples[16, 13]. The equivalence between the EP energy minimization and Bethe free energy\nminimization is justi\ufb01ed [16]. Researches propose \u201cdouble loop\u201d algorithm to minimize Bethe free\nenergy [13] in order to digest the non-convex term in the objective. They formulate a saddle point\nproblem where strictly speaking the inner loop should be converged before moving to the outer\nloop. However, the stability of saddle points is an issue in general. There are also ad hoc energy\noptimization methods for speci\ufb01c network structures, for instance [21] for binary networks, but the\ngenerality of these methods is unknown.\nIn this paper, we present new formulation of EP and apply it to solve the inference problem in\ngeneral large complex interaction systems. This paper makes the following contributions. First, we\nformulated expectation propagation as an optimization problem to maximize a concave dual function,\nwhere its local maximum is also its global maximum and provides a solution for Bethe free energy\nminimization problem. To this end, we transformed concave terms in the Bethe free energy into\nits Legendre dual and added regularization constraint to the primal problem. Second, we designed\ngradient ascent and \ufb01xed point algorithms to make inferences about complex interaction systems\nwith the stochastic kinetic model. In all the algorithms we make mean-\ufb01eld inferences about the\nindividual components from observations about them according to the average interactions of all other\ncomponents. Third, we conducted experiments on our transportation network data to demonstrate\nthe performance of our proposed algorithms over the state of the art algorithms in inferring complex\nnetwork dynamics from noisy observations.\nThe remainder of this paper is organized as follows. In Section 2, we brie\ufb02y review some models\nto specify complex system dynamics and the issues in minimizing Bethe free energy. In Section 3,\nwe formulate the problem of minimizing Bethe free energy as maximizing a concave dual function\nsatisfying dual feasible constraint, and develop gradient-based and \ufb01xed-point methods to make\n\n2\n\n\ftractable inferences with the stochastic kinetic model. In Section 4, we detail empirical results from\napplying the proposed algorithms to make inferences about transportation network dynamics. Section\n5 concludes.\n\n2 Background\n\nIn this section, we provide brief background about describing complex system dynamics and typical\nissues in minimizing Bethe free energy.\n\n2.1 Dynamic Bayesian Network and State-Space Model\n\nt\n\nt\n\nt\n\nt\n\nt\n\nt\n\n, y(2)\n\n| x(m)\n\nm p(y(m)\n\n1 ,..., x(M )\n\n) be the values and yt = (y(1)\n\nt p(xt | xt\u22121)(cid:81)\n\npath with observations p(x1,...T , y1,...T ) can be written as p(x1,...T , y1,...T ) =(cid:81)\nxt) = (cid:81)\n\nA dynamic Bayesian network (DBN) captures the dynamics of a complex interaction system by\nspecifying how the values of state variables at the current time are probabilistically dependent on\nthe values at previous time. Let xt = (x(1)\n, ..., y(M )\n)\nbe the observations made at these M state variables at time t. The probability measure of sample\nt p(xt | xt\u22121)p(yt |\n), where p(xt | xt\u22121) is the state transition model and\np(yt | xt) is observation model. We can factorize state transition into miniature kernels involving\n). The DBN inference problem is to infer p(xt | y1,...T )\nonly variable x(m)\nfor given observations y1,...T .\nState-space models (SSM) use state variables to describe a system by a set of \ufb01rst-order differential or\ndifference equations. For example, the state evolves as xt = Ftxt\u22121 + wt and we make observations\nwith yt = Htxt + vt. Typical \ufb01ltering and smoothing algorithms estimate series of xt from time\nseries of yt.\nBoth DBM and SSM face dif\ufb01culties in directly capturing the complex interactions, since these\ninteractions seldom obey simple rigid equations and are too complex to be expressed by a joint\ntransition kernel, even allowing time-variance of such kernel. The SKM model that follows uses a\nsequence of events to capture such nonlinear and time-variant dynamics.\n\nand its parents Pa(x(m)\n\nt\n\nt\n\n2.2 Stochastic Kinetic Model\n\nThe stochastic kinetic model (SKM) [9, 23] has been successfully applied in many \ufb01elds, especially\nchemistry and system biology [1, 22, 8]. It describes the dynamics with chemical reactions occurring\nstochastically at an adaptive rate. By analogy with a chemical reaction system, we consider a complex\ninteraction system involving M system components (species) and V types of events (reactions).\nGenerally, the system forms a Markov jump process [9] with a \ufb01nite set of discrete events. Each\nevent v can be characterized by a \u201cchemical equation\u201d:\n\nv X (M ) \u2192 p(1)\nand p(m)\n\nr(1)\nv X (1) + ... + r(M )\n\nv X (1) + ... + p(M )\n\nv X (M )\n\n(1)\n\nwhere X (m) denotes the m-th component, r(m)\nand products. Let x(m)\nspecies at time t, an event will change populations (x(1)\nr(2)\nv , ..., p(M )\nhv(xt, cv) is a function of the current state:\n\ncount the (relative) quantities of reactants\nbe the population count (or continuous number as concentration) of m\nv \u2212\n). Events occur mutually independently of each other and each event rate\n\n) by \u2206v = (p(1)\n\nv \u2212 r(M )\n\nv \u2212 r(1)\n\n, ..., x(M )\n\nv , p(2)\n\n, x(2)\n\nv\n\nv\n\nv\n\nt\n\nt\n\nt\n\nt\n\n(M )(cid:89)\nwhere cv denotes the rate constant and(cid:81)(M )\n\nhv(xt, cv) = cv\n\nm=1\n\nm=1\n\n(M )(cid:89)\n\n(cid:18)x(m)\n\n(cid:19)\n\nt\nr(m)\nv\n\ng(m)\nv\n\n(x(m)\n\n) = cv\n\nt\n\n(cid:0)x(m)\n\nt\nr(m)\nv\n\nm=1\n\n(cid:1) counts the number of different ways for the\n\n(2)\n\ncomponents to meet and trigger an event. When we consider time steps 1, 2, ., t, ..T with suf\ufb01ciently\nsmall time interval \u03c4, the probability of two or more events happening in the interval is negligible\n[11]. Consider a sample path p(x1,...T , v2,...T , y1,...T ) of the system with the sequence of states\n\n3\n\n\fx1, . . . , xT , happened events v2, . . . , vT and observations y1, . . . , yT . We can express the event-\nbased state transition kernel P (xt, vt|xt\u22121) in terms of event rate hv(xt, cv):\n\nP (xt, vt|xt\u22121) = I (xt = xt\u22121 + \u2206vtand xt \u2208 (xmin, xmax)) \u00b7 P (vt|xt\u22121)\n= I (xt = xt\u22121 + \u2206vtand xt \u2208 (xmin, xmax)) \u00b7\n\n(cid:26)\u03c4 hv(xt\u22121, cv)\n1 \u2212(cid:80)\n\nv \u03c4 hv(xt\u22121, cv)\n\nif vt = v\nif vt = \u2205\n\n(3)\n\nwhere \u2205 represents a null event that none of those V events happens and states don\u2019t change; I(\u00b7)\nis the indicator function; xmin, xmax are respectively lower bound and upper bound vectors, which\nprohibit \u201cghost\u201d transitions between out-of-scope xt\u22121 and xt. For instance, we generally need\nto bound xt be non-negative in realistic complex systems. This natural constraint on xt leads to a\nlinearly truncated state space that realistic events lie.\nInstead of state transitions possibly from any state to any other in DBN and state updates with a linear\n(or nonlinear) transformation, state in the SKM evolves according to \ufb01nite number of events between\ntime steps. The transition kernel is dependent on underlying system state and so is adaptive for\ncapturing the underlying system dynamics. We can now consider the inference problem of complex\ninteraction systems in the context of general DBN, with a speci\ufb01c event-based transition kernel from\nSKM.\n\n2.3 Bethe Free Energy\n\n(cid:90)\n\nt\n\ndxtqt(xt) log qt(xt)\n\n(cid:90)\n\nt\n\nminimize FBethe =\n\ndxt\u22121,t \u02c6pt(xt\u22121,t) log\n\n\u02c6pt(xt\u22121,t)\n\u03c8(xt\u22121,t)\n\nIn general DBN, the expectation propagation algorithm to make inference aims to minimize Bethe\nfree energy FBethe [16, 25, 13], subject to moment matching constraints. We have a non-convex prime\nobjective and its trivial dual function with dual variables in the full space is not concave. We take\nthe general notation that potential function is \u03c8(xt\u22121,t) = P (xt, yt | xt\u22121) and our optimization\nproblem becomes the following\n\nsubject to : (cid:104)f (xt)(cid:105) \u02c6pt(xt\u22121,t) = (cid:104)f (xt)(cid:105)qt(xt) = (cid:104)f (xt)(cid:105) \u02c6pt+1(xt,t+1)\n\n\u2212(cid:88)\n(cid:88)\nt f (xt))+log(cid:82) dxt exp((\u03b1t+\u03b2t)(cid:62)f (xt))\nt log(cid:82) dxt\u22121,t exp(\u03b1(cid:62)\nmaximize FDual = \u2212(cid:80)\nrandom variable xt to its statistics. Integrals (cid:104)f (xt)(cid:105) \u02c6pt(xt\u22121,t) =(cid:82) dxtf (xt)(cid:82) dxt\u22121 \u02c6pt(xt\u22121,t) and\nIn the above, \u02c6pt(xt\u22121,t) \u2248 p(xt\u22121,t|y1,\u00b7\u00b7\u00b7 ,T ) are approximate two-slice probabilities, qt(xt) \u2248\np(xt|y1,\u00b7\u00b7\u00b7 ,T ) are approximate one-slice probabilities. The vector-valued function f (xt) maps a\n(or K-L divergence) between the approximate distribution(cid:81)\np(x1,\u00b7\u00b7\u00b7 ,T|y1,\u00b7\u00b7\u00b7 ,T ) =(cid:81)\n\nso on are the mean parameters to be matched in the optimization. FBethe is the relative entropy\nand the true distribution\nt \u03c8(xt\u22121,t) to be minimized. With the method of Lagrange multipliers, one\ncan \ufb01nd that \u02c6pt(xt\u22121,t) and qt(xt) are distributions in the exponential family parameterized either by\nthe mean parameters (cid:104)f (xt)(cid:105) \u02c6pt(xt\u22121,t) and (cid:104)f (xt)(cid:105)qt(xt) or by the natural parameters \u03b1t\u22121 and \u03b2t,\nand the trivial dual target FDual is the negative log partition of the dynamic Bayesian network.\nThe problem with minimizing FBethe or maximizing FDual is that both have multiple local op-\ntima and there is no guarantee how closely a local optimal solution approximates the true pos-\n\u03c8(xt\u22121,t) is a convex term,\n\nt\u22121f (xt\u22121))\u03c8(xt\u22121,t) exp(\u03b2(cid:62)\n\nt\n\nterior probability of the latent state. In FBethe,(cid:82) dxt\u22121,t \u02c6pt(xt\u22121,t) log \u02c6pt(xt\u22121,t)\n\u2212(cid:80)\n\n(cid:82) dxtqt(xt) log qt(xt) is concave, and the sum is not guaranteed to be convex. Similarly in\n\nFDual, the minus log partition function of \u02c6pt (\ufb01rst term) is concave, the log partition function of qt is\nconvex, and the sum is not guaranteed to be concave.\nAnother dif\ufb01culty with expectation propagation is that the approximate probability distribution often\nneeds to satisfy some inequality constraints. For example, when approximating a target probability\ndistribution with the product of normal distributions in Gaussian expectation propagation, we require\nthat all factor normal distributions have positive variance. So far, the common heuristic is to set the\nvariances to very large numbers once they fall below zero.\n\n\u02c6pt(xt\u22121,t)\n\nt\n\nqt(xt)\n\n4\n\n\f3 Methodology\n\nAs noted in Subsection 2.3, the dif\ufb01culty in minimizing Bethe free energy is that both the FPrimal\nand FDual have many local optima in the full space. Our formulation starts with transforming the\nconcave term to its Legendre dual and taking dual variables as additional variables. Thereafter we\ndrop the dependence over qt(xt) by utilizing the moment matching constraints, formulate EP as\na constrained minimization problem and derive its dual optimization problem (which is concave\nunder a dual feasible constraint). Our formulation also provides theoretical insights to avoid negative\nvariance in Gaussian expectation propagation.\nWe start by minimizing the Bethe free energy over the two-slice probabilities \u02c6pt and the one-slice\nprobabilities qt:\n\nminimize over \u02c6pt(xt\u22121,t), qt(xt) :\n\n(cid:90)\n\n(cid:88)\n\ndxt\u22121,t \u02c6pt(xt\u22121,t) log\n\nFBethe =\n(cid:90)\nsubject to : (cid:104)f (xt)(cid:105) \u02c6pt(xt\u22121,t) = (cid:104)f (xt)(cid:105)qt(xt) = (cid:104)f (xt)(cid:105) \u02c6pt+1(xt,t+1) ,\n\n(cid:90)\n\nt\n\nt\n\n\u02c6pt(xt\u22121,t)\n\u03c8(xt\u22121,t)\n\ndxtqt(xt) log qt(xt)\n\n(cid:90)\n\n\u2212(cid:88)\n\ndxtqt(x) = 1 =\n\ndxt\u22121,t \u02c6pt(xt\u22121,t).\n\nWe introduce the Legendre dual \u2212(cid:82) dxtqt log qt= min\u03b3t\nand replace (cid:104)f (xt)(cid:105)q(xt)\n(cid:104)f (xt)(cid:105) \u02c6pt(xt\u22121,t) = (cid:104)f (xt)(cid:105)qt(xt).\nwe add a regularization constraint to bound it:\nminimize over \u02c6pt(xt\u22121,t), \u03b3t :\n\n(cid:111)\n\u00b7 f (xt))\nin the target with (cid:104)f (xt)(cid:105) \u02c6pt(xt\u22121,t) by utilizing the constraint\nInstead of searching \u03b3t over the over-complete full space,\n\n+ log(cid:82) dxt exp(\u03b3(cid:62)\n\n(cid:110)\u2212\u03b3(cid:62)\n\n\u00b7 (cid:104)f (xt)(cid:105)qt\n\n(4)\n\nt\n\nt\n\n\u2212(cid:88)\n\nt\n\n\u03b3(cid:62)\n\nt\n\n\u00b7 (cid:104)f (xt)(cid:105) \u02c6pt\n(cid:90)\n\n(cid:90)\n\nlog\n\n(cid:88)\n\nt\n\n+\n\ndxt exp(\u03b3(cid:62)\n\nt\n\n\u00b7 f (xt))\n\nFPrimal =\n\ndxt\u22121,t \u02c6pt log\n\n\u02c6pt(xt\u22121,t)\n\u03c8(xt\u22121,t)\n\n(cid:90)\n\n(cid:88)\n\nt\n\nsubject to : (cid:104)f (xt)(cid:105) \u02c6pt(xt\u22121,t) = (cid:104)f (xt)(cid:105) \u02c6pt+1(xt,t+1) ,\n\ndxt\u22121,t \u02c6pt(xt\u22121,t) = 1, \u03b3(cid:62)\n\nt \u03b3t \u2264 \u03b7t.\n\n(5)\n\nt f (xt))/(cid:82) dxt exp(\u03b3(cid:62)\n\nt\n\nIn the primal problem, \u03b3t is the natural parameter of a probability in the exponential family: q(x; \u03b3t) =\n\u00b7 f (xt)). The primal problem (5) is equivalent with Bethe energy\nexp(\u03b3(cid:62)\nminimization problem.\nWe solve the primal problem with the Lagrange duality theorem [3]. First, we de\ufb01ne the La-\ngrangian function L by introducing the Lagrange multipliers \u03b1t, \u03bbt and \u03bet to incorporate the con-\nstraints. Second, we set the derivative over prime variables to zero. Third, we plug the optimum\npoint back into the Lagrangian. The Lagrange duality theorem implies that FDual(\u03b1t, \u03bbt, \u03bet) =\ninf \u02c6pt(xt\u22121,t),\u03b3tL(\u02c6pt(xt\u22121,t), \u03b3t, \u03b1t, \u03bbt, \u03bet). Thus the dual problem is as follows\n\nFDual = \u2212(cid:88)\n\nmaximize over \u03b1t, \u03bbt \u2265 0 for all t :\nlog\n\nlog Zt\u22121,t +\n\n(cid:90)\n\ndxt exp(\u03b3(cid:62)\n\nt f (xt)) +\n\n(cid:88)\n\nt\n\n\u03bbt\n2\n\n(cid:0)\u03b3(cid:62)\n\nt \u03b3t \u2212 \u03b7t\n\n(cid:1)\n\n(cid:88)\n+ (cid:104)f (xt)(cid:105)\u03b3t\nexp(\u03b1(cid:62)\n\nt\n\nt\n\nwhere \u2212 (cid:104)f (xt)(cid:105) \u02c6pt\n\u02c6pt(xt\u22121,t) =\n\n1\n\nZt\u22121,t\n\n+ \u03bbt\u03b3t = 0\n\nt\u22121 \u00b7 f (xt\u22121))\u03c8(xt\u22121,t) exp((\u03b3(cid:62)\n\nt \u2212 \u03b1(cid:62)\n\nt ) \u00b7 f (xt))\n\n(6)\n\n(7)\n\n(8)\n\nIn the dual problem, we drop the dual variable \u03bet since it takes value to normalize \u02c6pt(xt\u22121,t) as a\nvalid primal probability. For any dual variable \u03b1t, \u03bbt, we map primal variables \u02c6pt(xt\u22121,t) and \u03b3t\nas implicit functions de\ufb01ned by the extreme point conditions Eq. (7),(8). We have the following\ntheoretic guarantee with proofs in the supplementary material. We name cov\u03b3t (f (xt), f (xt)) +\n\n\u03bbtI \u2212(cid:10)f (xt) \u00b7 f (xt)(cid:62)(cid:11)\n\n\u02c6pt(xt\u22121,t) (cid:31) 0 as the dual feasible constraint.\n\n5\n\n\fProposition 1: The Lagrangian function has positive de\ufb01nite Hessian matrix under the dual\nfeasible constraint.\n\nProposition 1 ensures that the dual function is in\ufb01mum of Lagrangian function, the point wise\nin\ufb01mum of a family of af\ufb01ne functions of \u03b1t, \u03bbt, \u03bet, thus is concave. Instead of a full space of dual\nvariables \u03b1t, \u03bbt, we only consider the domain constrained by the dual feasible constraint.\n\nProposition 2: Eq. (7) and (8) have an unique solution under the dual feasible constraint.\n\nThe Lagrange dual problem is a maximization problem with a bounded domain, which can be reduced\nto an unconstrained problem through barrier method or through penalizing constraint violation, and\nbe solved with a gradient ascent algorithm or a \ufb01xed point algorithm. The partial derivatives of the\ndual function over dual variables are the following:\n\n(cid:0)\u03b3(cid:62)\n\n(cid:1)\n\n\u2202FDual\n\u2202\u03b1t\n\n= \u2212(cid:104)f (xt)(cid:105) \u02c6pt+1(xt,t+1) + (cid:104)f (xt)(cid:105) \u02c6pt(xt\u22121,t) ,\n\n\u2202FDual\n\n\u2202\u03bbt\n\n=\n\n1\n2\n\nt \u03b3t \u2212 \u03b7t\n\n(9)\n\nt\n\nset\n\nt + \u03b3\n\n= \u03b1(old)\n\n\u2202FDual\n\u2202\u03b1t\n\n= 0 \u21d2forward:\u03b1(new)\n\nwhere \u02c6pt(xt\u22121,t) and \u03b3t are implicit functions de\ufb01ned by Eq. (7),(8). We can get a \ufb01xed point\niteration through setting the \ufb01rst derivatives to zero 1. Here \u03b3(\u00b7) converts mean parameters to natural\nparameters.\n\n(cid:16)(cid:104)f (xt)(cid:105) \u02c6pt\n(cid:17) \u2212 \u03b3(old)\n(cid:17)\n(cid:16)(cid:104)f (xt)(cid:105) \u02c6pt+1\ndistributions, which pose implicit constraints on the primal and dual domains. Let(cid:80)\ncovariance matrix associated with \u02c6pt(xt\u22121,t), \u03b3t and it requires(cid:80)\n(cid:31) 0,(cid:80)\n(cid:31) 0, cov\u03b3t (f (xt), f (xt)) + \u03bbtI \u2212(cid:10)f (xt) \u00b7 f (xt)(cid:62)(cid:11)\n\nIn terms of Gaussian EP, the prime variables \u02c6pt(xt\u22121,t), \u03b3t correspond to multivariate Gaussian\nbe the\n(cid:31) 0. The domain of\n\ndual variables is de\ufb01ned by the following constraints:\n\nbackward:\u03b3(new)\n\n,(cid:80)\n\n\u02c6pt(xt\u22121,t) (cid:31) 0\n\n= \u03b3\n\n\u03b3t\n\n\u03b3t\n\n\u02c6pt\n\n\u02c6pt\n\nt\n\nt\n\n(cid:31) 0,\n\n(cid:88)\n\n(cid:88)\n\u03bbt \u2265 0,\n+ (cid:104)f (xt)(cid:105)\u03b3t\nwhere \u2212 (cid:104)f (xt)(cid:105) \u02c6pt\nexp(\u03b1(cid:62)\n\u02c6pt(xt\u22121,t) =\n\n\u03b3t\n\n\u02c6pt\n\n1\n\nZt\u22121,t\n\n+ \u03bbt\u03b3t = 0\n\nt\u22121 \u00b7 f (xt\u22121))P (xt, yt|xt\u22121) exp((\u03b3(cid:62)\n\nt \u2212 \u03b1(cid:62)\n\nt ) \u00b7 f (xt))\n\nIn this case, it is nontrivial to \ufb01nd a starting point of \u03b1t, \u03bbt. We develop a phase I stage to \ufb01nd a\nstrictly feasible starting point [3]. For convenience, we note \u03b1t, \u03bbt as x, rewrite above constraints\nas inequality constraints gi(x) \u2264 0 and equality constraints gj(x) = 0. Start from a valid x0, s that\ngi(x0) \u2264 s,gj(x0) = 0 and then solve the optimization problem\n\nminimize s subject to gj(x0) = 0, gi(x0) \u2264 s\n\nover the variable s and x. The strict feasible point of x will be found when we arrive s < 0.\nWith the duality framework and SKM, we can solve the dual optimization problem to make inferences\nabout complex system dynamics from imperfect observations. The latent states (the populations in\nSKM) can be formulated as either categorical or Gaussian random variables. In categorical case, the\nmax),\u00b7\u00b7\u00b7 ),\nstatistics are f (xt) = (I(x(1)\nwhere x(1)\nmax are the maximum populations and I is the indicator function. In the Gaussian\n,\u00b7\u00b7\u00b7 ) and we force the natural parameters\ncase, the statistics are f (xt) = (x(1)\nto satisfy the constraint that minus half of precision is negative. The potential \u03c8(xt\u22121,t) in the\n\nt = 1),\u00b7\u00b7\u00b7 , I(x(1)\n\nt = 1),\u00b7\u00b7\u00b7 , I(x(2)\n\nmax,\u00b7\u00b7\u00b7 , x(M )\n\nmax), I(x(2)\n\nt = x(2)\n\nt = x(1)\n\n, x(2) 2\n\n, x(1) 2\n\n, x(2)\n\nt\n\nt\n\nt\n\nt\n\n1Empirically, the \ufb01xed point iteration converges even without the dual feasible constraint (\u03bbt = 0); In\n\ngeneral, \u03bbt is bounded by the dual feasible constraint and the derivative over \u03bbt is not zero.\n\n6\n\n\fdistribution \u02c6pt+1(xt,t+1) (Eq. (8)) has speci\ufb01c form(cid:80)\n\nt\n\nvt\n\n(x(m)\n\nt\u22121,t) \u2248 (cid:104)f (xt)(cid:105) \u02c6pt(xt\u22121,t) for each species m, where \u02c6p(m)\n\nP (xt, vt|xt\u22121)P (yt|xt) as Eq. (3), which\nt,t+1) \u2248 (cid:104)f (xt)(cid:105) \u02c6pt+1(xt,t+1) and\nfacilitates a mean \ufb01led approximation to evaluate (cid:104)f (xt)(cid:105) \u02c6p(m)\n(cid:104)f (xt)(cid:105) \u02c6p(m)\nt+1(x(m)\nt\u22121,t) are\nthe marginal two-slice distributions for m and derived explicitly in the supplementary material. As\nsuch, we establish linear complexity over number of species m and tractable inference in general\ncomplex system dynamics.\nTo summarize, Algorithm 1 gives the mean-\ufb01eld forward-backward algorithm and the gradient\nascent algorithm for making inferences with a stochastic kinetic model from noisy observations that\nminimize Bethe free energy.\n\nt,t+1) and \u02c6p(m)\n\nt+1(x(m)\n\n(x(m)\n\nt\n\nAlgorithm 1 Make inference of a stochastic kinetic model with expectation propagation.\nInput: Discrete time SKM model (Eqs. (1),(2),(3)); Observation probabilities P (yt|xt) and initial\nvalues of \u03b1t, \u03b3t, \u03bbt for all populations m and time t.\nExpectation Propagation \ufb01xed point: Alternate between forward and backward iterations until\nconvergence.\n\n\u2022 For t = 1,\u00b7\u00b7\u00b7 , T , \u03b1(new)\n\u2022 For t = T,\u00b7\u00b7\u00b7 , 1, \u03b3(new)\n\nt\n\nt\n\n= \u03b3\n\n= \u03b1(old)\n\n(cid:16)(cid:104)f (xt)(cid:105) \u02c6pt(xt\u22121,t)\n(cid:17)\n\n(cid:16)(cid:104)f (xt)(cid:105) \u02c6pt+1(xt,t+1)\n\nt + \u03b3\n\n.\n\n(cid:17) \u2212 \u03b3(old)\n\nt\n\n.\n\nGradient ascent: Execute the following updates in alternating forward and backward sweeps, where\nthe gradients are de\ufb01ned in Eq. (9), under the dual feasible constraints.\n\n\u2022 \u03b1(new)\n\nt \u2190 \u03b1t + \u0001 \u2202Fdual\n\n, \u03bb(new)\n\nt \u2190 \u03bbt + \u0001 \u2202FDual\n\n.\n\n\u2202\u03b1t\n\n\u2202\u03bbt\n\nOutput: Optimum \u02c6pt(xt\u22121,t), (cid:104)f (xt)(cid:105) \u02c6pt\n\nas Eq. (7), (8) for all populations m and time t.\n\n4 Experiments on Transportation Dynamics\n\nIn this section, we evaluate and benchmark the performance of our proposed algorithms (Algorithm 1)\nagainst mainstream state-of-the-art approaches. We have the \ufb02exibility to specify species, states, and\nevents with different granularities in SKM, at either macroscopic or microscopic level. Consequently,\ndifferent levels of inference can be made by feeding in corresponding observations and model\nspeci\ufb01cations. For example, to track epidemics in a social network we can de\ufb01ne each person as a\nspecies and their health state as a hidden state, with infection and recovery as events. Using real-world\ndatasets about epidemic diffusion in a college campus, we ef\ufb01ciently inferred students\u2019 health states\ncompared with ground truth from surveys [23]. In this section, we demonstrate population level\ninference in the context of transportation dynamics2.\n\nTransportation Dynamics A transportation system consists of residents and a network of locations.\nThe macroscopic description is the number of vehicles indexed by location and time, while the\nmicroscopic description is the location of each vehicle at each time. Our goal is to infer the\nmacroscopic populations from noisy sensor network observations made at several selected roads.\nSuch inference problems in complex interaction networks are not trivial, for several reasons: the\nsystem can be very large and contain large number of components (residents and locations) and\ntherefore many approaches fail due to resource costs; the interaction between components (i.e. the\nmobility of residents) is by nature uncertain and time variant, and multiple variables (populations at\ndifferent locations) correlate together.\nTo model transportation dynamics, we classify people at the same location as one species. Let\nl \u2208 L index the locations and x(l)\nt be the number of vehicles at location l at time t, which are\nthe latent states we want to identify. The events v that change system states can be generally\nexpressed as reaction li \u2192 lj, which represents one vehicle moving from location li to location\nthe same. The event rate reads\nlj. It decrease x(li)\ndifferent possible vehicles to transit at li.\nhv(xt, cv) = cv\n\nby 1, increase x(lj )\nt ) = cvx(li)\nv (x(l)\n\nby 1 and keep other x(l)\nt\n\n, as there are x(li)\n\n(cid:81)(L)\n\nl=1 g(l)\n\nt\n\nt\n\nt\n\nt\n\n2Source code and a general function interface for other domains at both levels are here online\n\n7\n\n\f(l)\nt\n(l)\nt\n\ny\n\n(l)\nt\n(l)\nt\n\nxp\u2212y\n\n(l)\ny\nt\n\nt\n\nn\n\n| x(l)\n\n(cid:80)\n\nyi\u2212fi\n\ni\n\nyi\n\n(cid:1)(cid:30)(cid:0)xttl\n\nt ) = (cid:0)x\n\nis x(l)\n\nt = xttly(l)\nt vehicles at l is p(y(l)\n\nt\n\n(cid:1) \u00b7(cid:0)xttl\u2212x\n\nt /xp. More strictly, the likelihood of observing y(l)\n\n(cid:1). Our hidden state x(l)\n\nExperiment Setup: We select a certain proportion, e.g. 20%, of vehicles as probe vehicles to build\nthe observation model, assuming that the probe vehicles are uniformly sampled from the system.\nLet xttl be the total number of vehicles in the system, xp the total number of probe vehicles, x(l)\nt\nthe number of vehicles at location l, y(l)\nthe number of probe vehicles observed at l. A rough point\nt\nestimation of x(l)\nt probe vehicles\nt\namong x(l)\ncan be\nrepresented as either a discrete variable or a univariate gaussian.\nDataset Description: We implement and benchmark algorithms on two representative datasets. In\nthe SynthTown dataset, we synthesize a mini road network (Fig. 1(a)). Virtual residents go to work in\nthe morning and back home in the evening. We synthesize their itineraries from MATSIM, a common\nMulti-agent transportation simulator[2]. The number of residents and locations are respectively 2,000\nand 25. In the Berlin dataset, we have a larger real world road network with 1,539 locations derived\nfrom Open Street Map and 9,178 people\u2019s itineraries synthesized from MATSIM. Both two datasets\nspan a whole day, from midnight to midnight.\nEvaluation Metrics: To evaluate the accuracy of the model, we need compare the series of inferred\npopulations against the series of ground truths. We choose three appropriate metrics: the \u201ccoef\ufb01cient\nthe R2 tells the goodness of \ufb01t of a model and is calculated as 1 \u2212 (cid:80)\nof determination\u201d (R2), the mean percentage error (MPE) and mean squared error (MSE). In statistics,\n(cid:80)\ni(yi\u2212fi)2\ni(yi\u2212\u00afy)2 , where yi are the\nground truth values, \u00afy their mean and fi the inferred values. Typically, R2ranges from 0 and 1: the\n(cid:80)\ncloser it is to 1, the better the inference is. The MPE computes average of percentage errors by which\n. MPE can be either positive or negative and the\nfi differ from yi and is calculated as 100%\ni(yi \u2212 fi)2 to measure the average deviation\ncloser it is to 0, the better. The MSE is calculated as 1\nn\nbetween y and f. The lower the MSE, the better the inference. We also consider the runtime as an\nimportant metric to research scalability of different approaches.\nApproaches for Benchmark: We implement three algorithms to instantiate the procedures in\nAlgorithm 1: the \ufb01xed point algorithm with discrete latent state (DFP) or gaussian latent state (GFP)\nand the gradient ascent algorithm with discrete latent state (DG). The pseudo codes are included in\nthe supplementary material. We also implement several other mainstream state-of-the-art approaches.\nParticle Filter (PF): We implement a sampling importance resampling (SIR) [10] algorithm that\nrecursively approximates the posterior with a weighted set of particles, updates these particles and\nresamples to cope with degeneracy problem. Performance is dependent on the number of particles\nwith a certain number is needed to achieve a good result. We selected the number of particles\nempirically by increasing the number until no obvious accuracy improvement could be detected,\nand ended up with thousands to tens of thousands of particles. Extended Kalman Filter (EKF):\nWe implement the standard EKF procedure with an alternating prediction step and update step.\nFeedforward Neural Network (FNN): The FNN builds only a non-parametric model between input\nnodes and output nodes, without \u201cactually\u201d learning the dynamics of the system. We implement\na \ufb01ve-layer FNN: one input layer accepting the inference time point and observations in certain\nprevious period (e.g. one hour), three hidden layers and one output layer from which we directly\nread the inference populations. The FNN and afterwards RNN are both trained by feeding ground\ntruth populations about each road into the network structures. We tune meta-parameters and train the\nnetwork with 30 days synthesized mobility data from MATSIM until obtaining optimum performance.\nRecurrent Neural Network (RNN): The RNN is capable of exploiting previous inferred hidden states\nrecursively to improve current estimation. We implement a typical RNN, such that in each RNN\ncell we take both the current observations and inferred population from a previous cell as input,\ntraverse one hidden layer, and then output the inferred populations. We train the RNN with 30 days\nof synthesized mobility data from MATSIM until obtaining optimum performance.\nInference Performance and Scalability: Figure 1 plots the inferred population at several represen-\ntative locations in Fig. 1(a). The lines above the shaded areas are the ground truths, and we plot the\nerror (i.e., inferred populations minus ground truth) with different scales. For GFP, the inference\nwithin \u00b5 \u00b1 3\u03c3 con\ufb01dence intervals is shown in the colored \u201cbelt\u201d. We can see that our proposed\nalgorithms generally deviate less from the ground truth than other approaches do.\n\n8\n\n\fTable 1: Performance and time scalability of all algorithms\n\nDataset\nMetrics\n\nDFP\nGFP\nDG\nPF\nEKF\nFNN\nRNN\n\nSynthTown\n\nR2 MPE MSE\n181\n0.85\n-3%\n0.85\n161\n-8%\n-5%\n0.87\n104\n-21% 663\n0.50\n-19% 679\n0.51\n11%\n0.73\n526\n0.72\n-14% 407\n\nTime\n47 sec\n42 sec\n157 sec\n15 sec\n2 sec\n\n1 h training\n8 h training\n\nBerlin\nR2 MPE MSE\n20\n0.66\n3%\n0.62\n27\n2.5%\n26\n0.61\n2.8%\n-6%\n0.50\n678\n-40% 1046\n0.45\n-14% 540\n0.31\n0.51\n-9%\n800\n\nTime\n29 min\n21 min\n56 min\n71min\n14 hour\n\n11 h training\n28 h training\n\n(a) Road Network\n\n(b) Inference results\n\nFigure 1: Road network and inference results with the SynthTown Dataset\n\nTable 1 summarizes the performances in different metrics (mean values). There is both a training\nphase and a running phase in making inferences with neural networks, with the training phase taking\nlonger. The neural network training time shown in the table ranges from several hours to around one\nday, and is quadratic in the number of system components per batch per epoch. The neural network\nrunning times in our experiments are comparable with EP running times. Theoretically, neural\nnetwork running times are quadratic in the number of system components to make one prediction, and\nEP running times are linear in the number of system components to propagate marginal probabilities\nfrom one time step to the next (EP algorithms empirically converge within a few iterations), while PF\nscales quadratically and EKF cubically with the number of locations.\nSummary: Generally, our proposed algorithms have higher R2, \u201cnarrower\u201d MPE and lower MSE,\nfollowed by neural networks, PF and EKF. The neural networks sometimes provide comparable\nperformance. Our proposed algorithms, especially the DFP and GFP, experience lower time explosion\nin bigger datasets. Overall, our algorithms generally outperform PF, EKF, FNN and RNN in terms of\naccuracy metrics and scalability to a larger dataset.\n\n5 Discussion\n\nIn this paper, we have introduced the stochastic kinetic model and developed expectation propagation\nalgorithms to make inferences about the dynamics of complex interacting systems from noisy\nobservations. To avoid getting stuck at a local optimum, we formulate the problem of minimizing\nBethe free energy as a maximization problem over a concave dual function in the feasible domain\nof dual variables guaranteed by duality theorem. Our experiments show superior performance over\ncompeting models such as particle \ufb01lter, extended Kalman \ufb01lter, and deep neural networks.\n\n9\n\n\fReferences\n[1] Adam Arkin, John Ross, and Harley H McAdams. Stochastic kinetic analysis of developmental\npathway bifurcation in phage \u03bb-infected escherichia coli cells. Genetics, 149(4):1633\u20131648,\n1998.\n\n[2] Michael Balmer, Marcel Rieser, Konrad Meister, David Charypar, Nicolas Lefebvre, and Kai\nNagel. Matsim-t: Architecture and simulation times. In Multi-agent systems for traf\ufb01c and\ntransportation engineering, pages 57\u201378. IGI Global, 2009.\n\n[3] Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press,\n\n2004.\n\n[4] Pierre Del Moral. Non-linear \ufb01ltering: interacting particle resolution. Markov processes and\n\nrelated \ufb01elds, 2(4):555\u2013581, 1996.\n\n[5] Wen Dong, Alex Pentland, and Katherine A Heller. Graph-coupled hmms for modeling the\n\nspread of infection. arXiv preprint arXiv:1210.4864, 2012.\n\n[6] Arnaud Doucet, Nando De Freitas, Kevin Murphy, and Stuart Russell. Rao-blackwellised\nparticle \ufb01ltering for dynamic bayesian networks. In Proceedings of the Sixteenth conference on\nUncertainty in arti\ufb01cial intelligence, pages 176\u2013183. Morgan Kaufmann Publishers Inc., 2000.\n\n[7] Karl Friston. Learning and inference in the brain. Neural Networks, 16(9):1325\u20131352, 2003.\n\n[8] Daniel T Gillespie. Stochastic simulation of chemical kinetics. Annu. Rev. Phys. Chem.,\n\n58:35\u201355, 2007.\n\n[9] Andrew Golightly and Colin S Gillespie. Simulation of stochastic kinetic models. In Silico\n\nSystems Biology, pages 169\u2013187, 2013.\n\n[10] Neil J Gordon, David J Salmond, and Adrian FM Smith. Novel approach to nonlinear/non-\ngaussian bayesian state estimation. In IEE Proceedings F (Radar and Signal Processing),\nvolume 140, pages 107\u2013113. IET, 1993.\n\n[11] Winfried K Grassmann. Transient solutions in markovian queueing systems. Computers &\n\nOperations Research, 4(1):47\u201353, 1977.\n\n[12] Tong Guan, Wen Dong, Dimitrios Koutsonikolas, and Chunming Qiao. Fine-grained location\nextraction and prediction with little known data. In Wireless Communications and Networking\nConference (WCNC), 2017 IEEE, pages 1\u20136. IEEE, 2017.\n\n[13] Tom Heskes and Onno Zoeter. Expectation propagation for approximate inference in dynamic\nbayesian networks. In Proceedings of the Eighteenth conference on Uncertainty in arti\ufb01cial\nintelligence, pages 216\u2013223. Morgan Kaufmann Publishers Inc., 2002.\n\n[14] Simon J Julier and Jeffrey K Uhlmann. Unscented \ufb01ltering and nonlinear estimation. Proceed-\n\nings of the IEEE, 92(3):401\u2013422, 2004.\n\n[15] Rudolph Emil Kalman et al. A new approach to linear \ufb01ltering and prediction problems. Journal\n\nof basic Engineering, 82(1):35\u201345, 1960.\n\n[16] Thomas P Minka. The ep energy function and minimization schemes. See www. stat. cmu. edu/\u02dc\n\nminka/papers/learning. html, 2001.\n\n[17] Lawrence R Rabiner. A tutorial on hidden markov models and selected applications in speech\n\nrecognition. Proceedings of the IEEE, 77(2):257\u2013286, 1989.\n\n[18] Vinayak Rao and Yee Whye Teh. Fast mcmc sampling for markov jump processes and continu-\n\nous time bayesian networks. arXiv preprint arXiv:1202.3760, 2012.\n\n[19] Claudia Tebaldi and Mike West. Bayesian inference on network traf\ufb01c using link count data.\n\nJournal of the American Statistical Association, 93(442):557\u2013573, 1998.\n\n10\n\n\f[20] Michail D Vrettas, Manfred Opper, and Dan Cornford. Variational mean-\ufb01eld algorithm for\nef\ufb01cient inference in large systems of stochastic differential equations. Physical Review E,\n91(1):012148, 2015.\n\n[21] Max Welling and Yee Whye Teh. Belief optimization for binary networks: A stable alternative\nto loopy belief propagation. In Proceedings of the Seventeenth conference on Uncertainty in\narti\ufb01cial intelligence, pages 554\u2013561. Morgan Kaufmann Publishers Inc., 2001.\n\n[22] Darren J Wilkinson. Stochastic modelling for systems biology. CRC press, 2011.\n\n[23] Zhen Xu, Wen Dong, and Sargur N Srihari. Using social dynamics to make individual predic-\ntions: Variational inference with stochastic kinetic model. In Advances In Neural Information\nProcessing Systems, pages 2775\u20132783, 2016.\n\n[24] Fan Yang and Wen Dong. Integrating simulation and signal processing with stochastic social\nkinetic model. In International Conference on Social Computing, Behavioral-Cultural Modeling\nand Prediction and Behavior Representation in Modeling and Simulation, pages 193\u2013203.\nSpringer, Cham, 2017.\n\n[25] Jonathan S Yedidia, William T Freeman, and Yair Weiss. Understanding belief propagation and\nits generalizations. Exploring arti\ufb01cial intelligence in the new millennium, 8:236\u2013239, 2003.\n\n11\n\n\f", "award": [], "sourceid": 1238, "authors": [{"given_name": "Le", "family_name": "Fang", "institution": "University at Buffalo-SUNY"}, {"given_name": "Fan", "family_name": "Yang", "institution": "University at Buffalo"}, {"given_name": "Wen", "family_name": "Dong", "institution": "University at Buffalo"}, {"given_name": "Tong", "family_name": "Guan", "institution": null}, {"given_name": "Chunming", "family_name": "Qiao", "institution": null}]}