{"title": "Wasserstein Distributionally Robust Kalman Filtering", "book": "Advances in Neural Information Processing Systems", "page_first": 8474, "page_last": 8483, "abstract": "We study a distributionally robust mean square error estimation problem over a nonconvex Wasserstein ambiguity set containing only normal distributions. We show that the optimal estimator and the least favorable distribution form a Nash equilibrium. Despite the non-convex nature of the ambiguity set, we prove that the estimation problem is equivalent to a tractable convex program. We further devise a Frank-Wolfe algorithm for this convex program whose direction-searching subproblem can be solved in a quasi-closed form. Using these ingredients, we introduce a distributionally robust Kalman filter that hedges against model risk.", "full_text": "Wasserstein Distributionally Robust Kalman Filtering\n\nSoroosh Sha\ufb01eezadeh-Abadeh\nDaniel Kuhn\n\u00c9cole Polytechnique F\u00e9d\u00e9rale de Lausanne, CH-1015 Lausanne, Switzerland\n\nViet Anh Nguyen\n\n{soroosh.shafiee,viet-anh.nguyen,daniel.kuhn} @epfl.ch\n\nPeyman Mohajerin Esfahani\n\nDelft Center for Systems and Control, TU Delft, The Netherlands\n\nP.MohajerinEsfahani@tudelft.nl\n\nAbstract\n\nWe study a distributionally robust mean square error estimation problem over a\nnonconvex Wasserstein ambiguity set containing only normal distributions. We\nshow that the optimal estimator and the least favorable distribution form a Nash\nequilibrium. Despite the non-convex nature of the ambiguity set, we prove that\nthe estimation problem is equivalent to a tractable convex program. We further\ndevise a Frank-Wolfe algorithm for this convex program whose direction-searching\nsubproblem can be solved in a quasi-closed form. Using these ingredients, we\nintroduce a distributionally robust Kalman \ufb01lter that hedges against model risk.\n\n1\n\nIntroduction\n\nThe Kalman \ufb01lter is the workhorse for the online tracking and estimation of a dynamical system\u2019s\ninternal state based on indirect observations [1]. It has been applied with remarkable success in\nareas as diverse as automatic control, brain-computer interaction, macroeconomics, robotics, signal\nprocessing, weather forecasting and many more. The classical Kalman \ufb01lter critically relies on\nthe availability of an accurate state-space model and is therefore susceptible to model risk. This\nobservation has led to several attempts to robustify the Kalman \ufb01lter against modeling errors.\nThe H\u221e-\ufb01lter targets situations in which the statistics of the noise process is uncertain and where\none aims to minimize the worst case instead of the variance of the estimation error [3, 26]. This \ufb01lter\nbounds the H\u221e-norm of the transfer function that maps the disturbances to the estimation errors.\nHowever, in transient operation, the desired H\u221e-performance is lost, and the \ufb01lter may diverge unless\nsome (typically restrictive) positivity condition holds in each iteration. In set-valued estimation the\ndisturbance vectors are modeled through bounded sets such as ellipsoids [4, 22]. In this framework,\none attempts to construct the smallest ellipsoids around the state estimates that are consistent with the\nobservations and the exogenous disturbance ellipsoids. However, the resulting robust \ufb01lters ignore\nany distributional information and thus have a tendency to be over-conservative. A \ufb01lter that is robust\nagainst more general forms of (set-based) model uncertainty was \ufb01rst studied in [19]. This \ufb01lter\niteratively minimizes the worst-case mean square error across all models in the vicinity of a nominal\nstate space model. While performing well in the face of large uncertainties, this \ufb01lter may be too\nconservative under small uncertainties. A generalized Kalman \ufb01lter that addresses this shortcoming\nand strikes the balance between nominal and worst-case performance has been proposed in [25].\nA risk-sensitive Kalman \ufb01lter is obtained by minimizing the moment-generating function instead\nof the mean of the squared estimation error [24]. This risk-sensitive Kalman \ufb01lter is equivalent\nto a distributionally robust \ufb01lter proposed in [12], which minimizes the worst-case mean square\nerror across all joint state-output distributions in a Kullback-Leibler (KL) ball around a nominal\ndistribution. Extensions to more general \u03c4-divergence balls are investigated in [27].\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fIn this paper we use ideas from distributionally robust optimization to design a Kalman-type \ufb01lter\nthat is immunized against model risk. Speci\ufb01cally, we assume that the joint distribution of the states\nand outputs is uncertain but known to reside in a given ambiguity set that contains all distributions in\nthe proximity of the nominal distribution generated by a nominal state-space model. The ambiguity\nset thus re\ufb02ects our level of (dis)trust in the nominal model. We then construct the most accurate\n\ufb01lter under the least favorable distribution in this set. The hope is that hedging against the worst-\ncase distribution has a regularizing effect and will lead to a \ufb01lter that performs well under the\nunknown true distribution. Distributionally robust \ufb01lters of this type have been studied in [7, 16]\nusing uncertainty sets for the covariance matrix of the state vector and in [12, 27] using ambiguity sets\nde\ufb01ned via information divergences. Inspired by recent progress in data-driven distributionally robust\noptimization [14], we construct here the ambiguity set as a ball around the nominal distribution with\nrespect to the type-2 Wasserstein distance. The Wasserstein distance has seen widespread application\nin machine learning [2, 6, 18], and an intimate relation between regularization and Wasserstein\ndistributional robustness has been discovered in [21, 20, 23, 15]. Also, the Wasserstein distance is\nknown to be more statistically robust than other information divergences [5].\nWe summarize our main contributions as follows:\n\u2022 We introduce a distributionally robust mean square estimation problem over a nonconvex Wasser-\nstein ambiguity set containing normal distributions only, and we demonstrate that the optimal\nestimator and the least favorable distribution form a Nash equilibrium.\n\u2022 Leveraging modern reformulation techniques from [15], we prove that this problem is equivalent to\na tractable convex program\u2014despite the nonconvex nature of the underlying ambiguity set\u2014and\nthat the optimal estimator is an af\ufb01ne function of the observations.\n\u2022 We devise an ef\ufb01cient Frank-Wolfe-type \ufb01rst-order method inspired by [10] to solve the resulting\nconvex program. We show that the direction-\ufb01nding subproblem can be solved in quasi-closed\nform, and we derive the algorithm\u2019s convergence rate.\n\u2022 We introduce a Wasserstein distributionally robust Kalman \ufb01lter that hedges against model risk.\nThe \ufb01lter can be computed ef\ufb01ciently by solving a sequence of robust estimation problems via the\nproposed Frank-Wolfe algorithm. Its performance is validated on standard test instances.\n\nId stands for the identity matrix in Rd\u00d7d. For any A, B \u2208 Rd\u00d7d, we use(cid:10)A, B(cid:11) = Tr(cid:2)A(cid:62)B(cid:3) to\n\nAll proofs are relegated to Appendix A, and additional numerical results are reported in Appendix B.\nNotation: For any A \u2208 Rd\u00d7d we use Tr [A] to denote the trace and (cid:107)A(cid:107) to denote the spectral norm\nof A. By slight abuse of notation, the Euclidean norm of v \u2208 Rd is also denoted by (cid:107)v(cid:107). Moreover,\ndenote the trace inner product. The space of all symmetric matrices in Rd\u00d7d is denoted by Sd. We\nuse Sd\n++) to represent the cone of symmetric positive semide\ufb01nite (positive de\ufb01nite) matrices\nin Sd. For any A, B \u2208 Sd, the relation A (cid:23) B (A (cid:31) B) means that A \u2212 B \u2208 Sd\n++).\nFinally, the set of all normal distribution on Rd is denoted by Nd.\n\n+ (A \u2212 B \u2208 Sd\n\n+ (Sd\n\n2 Robust Estimation with Wasserstein Ambiguity Sets\nConsider the problem of estimating a signal x \u2208 Rn from a potentially noisy observation y \u2208 Rm.\nIn practice, the joint distribution of x and y is never directly observable and thus fundamentally\nuncertain. This distributional uncertainty should be taken into account in the estimation procedure.\nIn this paper, we model distributional uncertainty through an ambiguity set P, that is, a family of\ndistributions on Rd, d = n + m, that are suf\ufb01ciently likely to govern x and y in view of the available\ndata or that are suf\ufb01ciently close to a prescribed nominal distribution. We then seek a robust estimator\nthat minimizes the worst-case mean square error across all distributions in the ambiguity set. In the\nfollowing, we propose to use the Wasserstein distance in order to construct ambiguity sets.\nDe\ufb01nition 2.1 (Wasserstein distance). The type-2 Wasserstein distance between two distributions\nQ1 and Q2 on Rd is de\ufb01ned as\nW2(Q1, Q2) (cid:44)\n\n(1)\nwhere \u03a0(Q1, Q2) is the set of all probability distributions on Rd \u00d7 Rd with marginals Q1 and Q2.\n\n(cid:107)z1 \u2212 z2(cid:107)2 \u03c0(d z1, d z2)\n\n2(cid:41)\n(cid:19) 1\n\n(cid:40)(cid:18)(cid:90)\n\ninf\n\n\u03c0\u2208\u03a0(Q1,Q2)\n\nRd\u00d7Rd\n\n,\n\n2\n\n\fProposition 2.2 ([9, Proposition 7]). The type-2 Wasserstein distance between two normal distribu-\ntions Q1 = Nd(\u00b51, \u03a31) and Q2 = Nd(\u00b52, \u03a32) with \u00b51, \u00b52 \u2208 Rd and \u03a31, \u03a32 \u2208 Sd\n\n+ equals\n\n(cid:115)(cid:13)(cid:13)\u00b51 \u2212 \u00b52\n\n(cid:13)(cid:13)2\n\n(cid:20)\n\n(cid:16)\n\n2(cid:21)\n(cid:17) 1\n\n.\n\nW2(Q1, Q2) =\n\n+ Tr\n\n\u03a31 + \u03a32 \u2212 2\n\n1\n2\n\n2 \u03a31\u03a3\n\n1\n2\n2\n\n\u03a3\n\nConsider now a d-dimensional random vector z = [x(cid:62), y(cid:62)](cid:62) comprising the signal x \u2208 Rn and the\nobservation y \u2208 Rm, where d = n + m. For a given ambiguity set P, the distributionally robust\nminimum mean square error estimator of x given y is a solution of the outer minimization problem in\n\nEQ(cid:2)(cid:107)x \u2212 \u03c8(y)(cid:107)2(cid:3) ,\n\n\u03c8\u2208L sup\ninf\nQ\u2208P\n\n(2)\n\nwhere L denotes the family of all measurable functions from Rm to Rn. Problem (2) can be viewed\nas a zero-sum game between a statistician choosing the estimator \u03c8 and a \ufb01ctitious adversary (or\nnature) choosing the distribution Q. By construction, the minimax estimator performs best under the\nworst possible distribution Q \u2208 P. From now on we assume that P is the Wasserstein ambiguity set\n\n(cid:110)Q \u2208 Nd : W2(Q, P) \u2264 \u03c1\n\n(cid:111)\n\nP =\n\n(3)\nwhich can be interpreted as a ball of radius \u03c1 \u2265 0 in the space of normal distributions. We will further\nassume that P is centered at a normal distribution P = Nd(\u00b5, \u03a3) with covariance matrix \u03a3 (cid:31) 0.\nEven though the Wasserstein ambiguity set P is nonconvex (as mixtures of normal distributions are\ngenerically not normal), we can prove a minimax theorem, which ensures that one may interchange\nthe in\ufb01mum and the supremum in (2) without affecting the problem\u2019s optimal value.\nTheorem 2.3 (Minimax theorem). If P is a Wasserstein ambiguity set of the form (3), then\n\n,\n\nEQ(cid:2)(cid:107)x \u2212 \u03c8(y)(cid:107)2(cid:3) = sup\n\nEQ(cid:2)(cid:107)x \u2212 \u03c8(y)(cid:107)2(cid:3) .\n\n(4)\n\n\u03c8\u2208L sup\ninf\nQ\u2208P\n\ninf\n\u03c8\u2208L\n\nQ\u2208P\n\nRemark 2.4 (Connection to Bayesian estimation). The optimal solutions \u03c8(cid:63) and Q(cid:63) of the two\ndual problems in (4) represent the minimax strategies of the statistician and nature, respectively.\nTheorem 2.3 implies that (\u03c8(cid:63), Q(cid:63)) forms a saddle point (and thus a Nash equilibrium) of the\nunderlying zero-sum game. Hence, the robust estimator \u03c8(cid:63) is also the optimal Bayesian estimator for\nthe prior Q(cid:63). For this reason, Q(cid:63) is often referred to as the least favorable prior [11].\n\nWe now demonstrate that the minimax problem (2) is equivalent to a tractable convex program, whose\nsolution allows us to recover both the optimal estimator \u03c8(cid:63) as well as the least favorable prior Q(cid:63).\nTheorem 2.5 (Tractable reformulation). The minimax problem (2) with the Wasserstein ambiguity\nset (3) centered at P = Nd(\u00b5, \u03a3),\n\n\u03c3 (cid:44) \u03bbmin(\u03a3) > 0, is equivalent to the \ufb01nite convex program\n\u00af\n\nsup Tr(cid:2)Sxx \u2212 SxyS\u22121\n(cid:20)Sxx Sxy\n(cid:21)\n(cid:16)\n\nSyx Syy\nS + \u03a3 \u2212 2\n\ns. t. S =\n\n(cid:20)\n\nTr\n\n\u03a3\n\nyy Syx\n\u2208 Sd\n\n(cid:3)\n2(cid:21)\n+, Sxx \u2208 Sn\n(cid:17) 1\n\n1\n2 S\u03a3\n\n1\n2\n\n\u2264 \u03c12, S (cid:23)\n\n\u03c3Id.\n\u00af\n\n+, Syy \u2208 Sm\n\n+ , Sxy = S(cid:62)\n\nyx \u2208 Rn\u00d7m\n\n(5)\n\nxy(S(cid:63)\n\nxx, S(cid:63)\n\nx , \u00b5(cid:62)\n\nyy and S(cid:63)\n\nxy is optimal in (5) and \u00b5 = [\u00b5(cid:62)\n\ny ](cid:62) for some \u00b5x \u2208 Rn and \u00b5y \u2208 Rm, then\nyy)\u22121(y \u2212 \u00b5y) + \u00b5x is the distributionally robust minimum mean\n\nIf S(cid:63), S(cid:63)\nthe af\ufb01ne function \u03c8(cid:63)(y) = S(cid:63)\nsquare error estimator, and the normal distribution Q(cid:63) = Nd(\u00b5, S(cid:63)) is the least favorable prior.\nTheorem 2.5 provides a tractable procedure for constructing a Nash equilibrium (\u03c8(cid:63), S(cid:63)) for the\nstatistician\u2019s game against nature. Note that if \u03c1 = 0, then S(cid:63) = \u03a3 is the unique solution to (5). In\nthis case the estimator \u03c8(cid:63) reduces to the Bayesian estimator corresponding to the nominal distribution\nP = Nd(\u00b5, \u03a3). We emphasize that the choice of the Wasserstein radius \u03c1 may have a signi\ufb01cant\nimpact on the resulting estimator. In fact, this is a key distinguishing feature of the Wasserstein\nambiguity set (3) with respect to other popular divergence-based ambiguity sets.\n\n3\n\n\fRemark 2.6 (Divergence-based ambiguity sets). As a natural alternative, one could replace the\nWasserstein distance in (3) with an information divergence. For example, ambiguity sets de\ufb01ned via\n\u03c4-divergences, which encapsulate the popular KL divergence as a special case, have been studied in\n[12, 27]. As shown in [12, Theorem 1] and [27, Theorem 2.1], the optimal estimator corresponding\nto any \u03c4-divergence ambiguity set always coincides with the Bayesian estimator for the nominal\ndistribution P = Nd(\u00b5, \u03a3) irrespective of \u03c1. Thus, in stark contrast to the setting considered here, the\nsize of a \u03c4-divergence ambiguity set has no impact on the corresponding optimal estimator. Moreover,\nthe least favorable prior Q = Nd(\u00b5, S(cid:63)) for a \u03c4-divergence ambiguity set always satis\ufb01es\n\n.\n\n(6)\n\n(cid:20)S(cid:63)\n\n(cid:21)\n\nS(cid:63) =\n\nxx \u03a3xy\n\u03a3yx \u03a3yy\n\nThus, in order to harm the statistician, nature only perturbs the second moments of the signal but sets\nall second moments of the observation as well as all cross moments to their nominal values.\nExample 2.7 (Impact of \u03c1 on the Nash equilibrium). We illustrate the dependence of the saddle\npoint (\u03c8(cid:63), Q(cid:63)) on the size \u03c1 of the ambiguity set in a 2-dimensional example. Suppose that the\nnominal distribution P of [x, y] \u2208 R2 satis\ufb01es \u00b5x = \u00b5y = 0, \u03a3xx = \u03a3xy = 1 and \u03a3yy = 1.1,\nimplying that the noise w (cid:44) y \u2212 x and the signal x are independent (EP\n[xw] = \u03a3xy \u2212 \u03a3xx = 0).\nFigure 1 visualizes the canonical 90% con\ufb01dence ellipsoids of the the least favorable priors as well\nas the graphs of the optimal estimators for different sizes of the Wasserstein and KL ambiguity sets.\nAs \u03c1 increases, the least favorable prior for the Wasserstein ambiguity set displays the following\nxx increases, (ii) the measurement variance S(cid:63)\ninteresting properties: (i) the signal variance S(cid:63)\nyy\ndecreases, (iii) the signal-measurement covariance S(cid:63)\nxy decreases towards 0, and (iv) the noise variance\nxy\u2212S(cid:63)\nEQ(cid:63)\nxx\ndecreases and is negative for all \u03c1 > 0, and (vi) the optimal estimator \u03c8(cid:63) tends to the zero function.\nNote that the optimal estimator and the measurement variance remain constant in \u03c1 when working\nwith a KL ambiguity set.\n\nxx increases. Hence, (v) the signal-noise covariance EQ(cid:63)\n\n[xw] = S(cid:63)\n\n[w2] = S(cid:63)\n\nyy\u22122S(cid:63)\n\nxy+S(cid:63)\n\nFigure 1: Least favorable priors (solid ellipsoids) and optimal estimators (dashed lines) for Wasserstein\n(left) and KL (right) ambiguity sets with different radii \u03c1. The Wasserstein estimators vary with \u03c1,\nwhile the KL estimators remain unaffected by \u03c1.\nRemark 2.8 (Ambiguity sets with non-normal distributions). Theorem 2.3 can be generalized to\nWasserstein ambiguity set of the form Q = {Q \u2208 M(Rd) : W2(Q, P) \u2264 \u03c1}, where M(Rd) denotes\nthe set of all (possibly non-normal) probability distributions on Rd with \ufb01nite second moments, and\nP = Nd(\u00b5, \u03a3). In this case, the minimax result (4) remains valid provided that the set L of all\nmeasurable estimators is restricted to the set A of all af\ufb01ne estimators. Theorem 2.5 also remains\nvalid under this alternative setting.\n\n3 Ef\ufb01cient Frank-Wolfe Algorithm\n\nThe \ufb01nite convex optimization problem (5) is numerically challenging as it constitutes a nonlinear\nsemi-de\ufb01nite program (SDP). In principle, it would be possible to eliminate all nonlinearities by using\nSchur complements and to reformulate (5) as a linear SDP, which is formally tractable. However, it is\nfolklore knowledge that general-purpose SDP solvers are yet to be developed that can reliably solve\nlarge-scale problem instances. We thus propose a tailored \ufb01rst-order method to solve the nonlinear\nSDP (5) directly, which exploits a covert structural property of the problem\u2019s objective function\n\nf (S) (cid:44) Tr(cid:2)Sxx \u2212 SxyS\u22121\n\nyy Syx\n\n(cid:3) .\n\n4\n\n-5-3-10135-3-2-10123-5-3-10135-3-2-10123\fDe\ufb01nition 3.1 (Unit total elasticity1). We say that a function \u03d5 : Sd\n\n+ \u2192 R+ has unit total elasticity if\n+.\n\n\u03d5(S) =(cid:10)S,\u2207\u03d5(S)(cid:11) \u2200S \u2208 Sd\n(cid:28)(cid:20)Sxx Sxy\n\n(cid:21)\n\n(cid:20)\n\nIn\n\u2212S\u22121\nyy Syx S\u22121\n\n\u2212SxyS\u22121\nyy SyxSxyS\u22121\n\nyy\n\nyy\n\n,\n\nSyx Syy\n\n(cid:21)(cid:29)\n\n= f (S).\n\nIt is clear that every linear function has unit total elasticity. Maybe surprisingly, however, the objective\nfunction f (S) of problem (5) also enjoys unit total elasticity because\n\n(cid:10)S,\u2207f (S)(cid:11) =\n\nMoreover, as will be explained below, it turns out problem (5) can be solved highly ef\ufb01ciently if\nits objective function is replaced with a linear approximation. These observations motivate us to\nsolve (5) with a Frank-Wolfe algorithm [8], which starts at S(0) = \u03a3 and constructs iterates\n\n(7a)\nwhere \u03b1k represents a judiciously chosen step-size, while the oracle mapping F : S+ \u2192 S+ returns\nthe unique solution of the direction-\ufb01nding subproblem\n\nS(k+1) = \u03b1kF(cid:0)S(k)(cid:1) + (1 \u2212 \u03b1k)S(k) \u2200k \u2208 N \u222a {0},\n\uf8f1\uf8f4\uf8f2\uf8f4\uf8f3 arg max\n\n(cid:10)L,\u2207f (S)(cid:11)\n(cid:20)\n\nL + \u03a3 \u2212 2\n\nL(cid:23)\n\u03c3Id\n\u00af\ns. t.\n\n2(cid:21)\n(cid:17) 1\n\n2 L\u03a3 1\n\n(cid:16)\n\n\u03a3 1\n\nTr\n\n2\n\n\u2264 \u03c12 .\n\nF (S) (cid:44)\n\n(7b)\n\nIn each iteration, the Frank-Wolfe algorithm thus maximizes a linearized objective function over\nthe original feasible set. In contrast to other commonly used \ufb01rst-order methods, the Frank-Wolfe\nalgorithm thus obviates the need for a potentially expensive projection step to recover feasibility. It\nis easy to convince oneself that any solution of the nonlinear SDP (5) is indeed a \ufb01xed point of the\noperator F . To make the Frank-Wolfe algorithm (7) work in practice, however, one needs\n\n(i) an ef\ufb01cient routine for solving the direction-\ufb01nding subproblem (7b);\n(ii) a step-size rule that offers rigorous guarantees on the algorithm\u2019s convergence rate.\n\nIn the following, we propose an ef\ufb01cient bisection algorithm to address (i). As for (ii), we show\nthat the convergence analysis portrayed in [10] applies to the problem at hand. The procedure for\nsolving (7b) is outlined in Algorithm 1, which involves an auxiliary function h : R+ \u2192 R de\ufb01ned via\n(8)\n\nh(\u03b3) (cid:44) \u03c12 \u2212(cid:10)\u03a3,(cid:0)Id \u2212 \u03b3(\u03b3Id \u2212 \u2207f (S))\u22121(cid:1)2(cid:11).\n\n++ and\n\n+, Algorithm 1 outputs a feasible and \u03b5-suboptimal solution to (7b).\n\nTheorem 3.2 (Direction-\ufb01nding subproblem). For any \ufb01xed inputs \u03c1, \u03b5 \u2208 R++, \u03a3 \u2208 Sd\nS \u2208 Sd\nWe emphasize that the most expensive operation in Algorithm 1 is the matrix inversion (\u03b3Id \u2212 D)\u22121,\nwhich needs to be evaluated repeatedly for different values of \u03b3. These computations can be\naccelerated by diagonalizing D only once at the beginning. The repeat loop in Algorithm 1 carries\nout the actual bisection algorithm, and a suitable initial bisection interval is determined by a pair of a\npriori bounds LB and U B, which are available in closed form (see Appendix A).\nThe overall structure of the proposed Frank-Wolfe method is summarized in Algorithm 2. We borrow\nthe step-size rule suggested in [10] to establish rigorous convergence guarantees. This is accomplished\nby showing that the objective function f has a bounded curvature constant. Our convergence result is\nformalized in the next theorem.\nTheorem 3.3 (Convergence analysis). If \u03a3 (cid:31) 0, \u03c1 > 0, \u03b4 > 0 and \u03b1k = 2/(2 + k) for any k \u2208 N,\nthen the k-th iterate S(k) computed by Algorithm 2 is feasible in (5) and satis\ufb01es\n\nwhere S(cid:63) is an optimal solution of (5),\n\n\u00af\n\nf (S(cid:63)) \u2212 f (S(k)) \u2264\n\n4\u00af\u03c34\n\n\u03c3 is the smallest eigenvalue of \u03a3, and \u00af\u03c3 (cid:44) (\u03c1 +(cid:112)Tr [\u03a3])2.\n\n\u03c33(k + 2)\n\u00af\n\n(1 + \u03b4),\n\n1Our terminology is inspired by the de\ufb01nition of the elasticity of a univariate function \u03d5(s) as d\u03d5(s)\nds\n\ns\n\n\u03d5(s) .\n\n5\n\n\fAlgorithm 1 Bisection algorithm to solve (7b)\nInput: Covariance matrix \u03a3 (cid:31) 0\n\nGradient matrix D (cid:44) \u2207f (S) (cid:23) 0\nWasserstein radius \u03c1 > 0\nTolerance \u03b5 > 0\n\nDenote the largest eigenvalue of D by \u03bb1\nLet v1 be an eigenvector of \u03bb1\n\nSet LB \u2190 \u03bb1(1 +(cid:112)v(cid:62)\nSet U B \u2190 \u03bb1(1 +(cid:112)Tr [\u03a3]/\u03c1)\n\n1 \u03a3v1/\u03c1)\n\nrepeat\n\nSet \u03b3 \u2190 (U B + LB)/2\nSet L \u2190 \u03b32(\u03b3Id \u2212 D)\u22121\u03a3(\u03b3Id \u2212 D)\u22121\nif h(\u03b3) < 0 then\nSet LB \u2190 \u03b3\nSet U B \u2190 \u03b3\n\nSet \u2206 \u2190 \u03b3(\u03c12 \u2212 Tr [\u03a3]) \u2212(cid:10)L, D(cid:11)\n+\u03b32(cid:10)(\u03b3Id \u2212 D)\u22121, \u03a3(cid:11)\n\nend if\n\nelse\n\nuntil h(\u03b3) > 0 and \u2206 < \u03b5\n\nOutput: L\n\nAlgorithm 2 Frank-Wolfe algorithm to solve (5)\nInput: Covariance matrix \u03a3 (cid:31) 0\nWasserstein radius \u03c1 > 0\nTolerance \u03b4 > 0\n\n\u03c3 \u2190 \u03bbmin(\u03a3), \u00af\u03c3 \u2190 (\u03c1 +(cid:112)Tr [\u03a3])2\n\nSet\nSet C \u2190 2\u00af\u03c34/\n\u03c33\nSet S(0) \u2190 \u03a3, k \u2190 0\n\u00af\nwhile Stopping criterion is not met do\n\n\u00af\n\nSet \u03b1k \u2190 2\nSet G \u2190 S(k)\nCompute gradient D \u2190 \u2207f (S(k)) by\n\nk+2\nxy (S(k)\n\nyy )\u22121\n\nD \u2190 [In, \u2212 G](cid:62)[In, \u2212 G]\n\nSet \u03b5 \u2190 \u03b1k\u03b4C\nSolve the subproblem (7b) by Algorithm 1\n\nL \u2190 Bisection(\u03a3, D, \u03c1, \u03b5)\n\nSet S(k+1) \u2190 S(k) + \u03b1k(L \u2212 S(k))\nSet k \u2190 k + 1\n\nend while\nOutput: S(k)\n\n4 The Wasserstein Distributionally Robust Kalman Filter\nConsider a discrete-time dynamical system whose (unobservable) state xt \u2208 Rn and (observable)\noutput yt \u2208 Rm evolve randomly over time. At any time t \u2208 N, we aim to estimate the current\nstate xt based on the output history Yt (cid:44) (y1, . . . , yt). We assume that the joint state-output process\nt ](cid:62), t \u2208 N, is governed by an unknown Gaussian distribution Q in the neighborhood of a\nzt = [x(cid:62)\nknown nominal distribution P(cid:63). The distribution P(cid:63) is determined through the linear state-space model\n\nt , y(cid:62)\n\nxt = Atxt\u22121 + Btvt\nyt = Ctxt + Dtvt\n\n\u2200t \u2208 N,\n\n(9)\n\nwhere At, Bt, Ct, and Dt are given matrices of appropriate dimensions, while vt \u2208 Rd, t \u2208 N,\ndenotes a Gaussian white noise process independent of x0 \u223c Nn(\u02c6x0, V0). Thus, vt \u223c Nd(0, Id) for\nall t, while vt and vt(cid:48) are independent for all t (cid:54)= t(cid:48). Note that we may restrict the dimension of vt to\nthe dimension d = n + m of zt without loss of generality. Otherwise, all linearly dependent columns\nof [B(cid:62)\nBy the law of total probability and the Markovian nature of the state-space model (9), the nominal\n= Nn(\u02c6x0, V0) of the initial\ndistribution P(cid:63) is uniquely determined by the marginal distribution P(cid:63)\nstate x0 and the conditional distributions\n\nt ](cid:62) and the corresponding components of vt can be eliminated systematically.\n\nt , D(cid:62)\n\nx0\n\n(cid:27)\n\n(cid:32)(cid:20) At\n\n(cid:21)\n\n(cid:20)\n\n(cid:21)(cid:20)\n\n(cid:21)(cid:62)(cid:33)\n\nP(cid:63)\nzt|xt\u22121\n\n= Nd\n\nxt\u22121,\n\nCtAt\n\nBt\n\nBt\n\nCtBt + Dt\n\nCtBt + Dt\n\nof zt given xt\u22121 for all t \u2208 N.\nUnlike P(cid:63), the true distribution Q governing zt, t \u2208 N, is unknown, and thus the estimation problem\nat hand is not well-de\ufb01ned. We will therefore estimate the conditional mean \u02c6xt and covariance matrix\nVt of xt given Yt under some worst-case distribution Q(cid:63) to be constructed recursively. First, we\n= Nn(\u02c6x0, V0).\nassume that the marginal distribution Q(cid:63)\nNext, \ufb01x any t \u2208 N and assume that the conditional distribution Q(cid:63)\nof xt\u22121 given Yt\u22121\nunder Q(cid:63) has already been computed as Q(cid:63)\nxt|Yt\nis then split into a prediction step and an update step. The prediction step combines the previous\nstate estimate Q(cid:63)\nto generate a pseudo-nominal\n\n= Nn(\u02c6xt\u22121, Vt\u22121). The construction of Q(cid:63)\n\nx0 of x0 under Q(cid:63) equals Px0, that is, Q(cid:63)\n\nwith the nominal transition kernel P(cid:63)\n\nxt\u22121|Yt\u22121\n\nxt\u22121|Yt\u22121\n\nx0\n\nxt\u22121|Yt\u22121\n\nzt|xt\u22121\n\n6\n\n\fAlgorithm 3 Robust Kalman \ufb01lter at time t\nInput: Covariance matrix Vt\u22121 (cid:23) 0\n\nState estimate \u02c6xt\u22121\nWasserstein radius \u03c1t > 0\nTolerance \u03b4 > 0\n\nPrediction:\n\nForm the pseudo-nominal distribution\nPzt|Yt\u22121= Nd(\u00b5t, \u03a3t) using (10)\nObserve the output yt\n\nObservation:\n\nUpdate:\n\nUse Algorithm 2 to solve (11)\n\nt \u2190 Frank-Wolfe(\u03a3t, \u00b5t, \u03c1t, \u03b4)\nS(cid:63)\nOutput: Vt = St,xx \u2212 St,xy(St,yy)\u22121St,yx\n\n\u02c6xt = S(cid:63)\n\nt,xy(S(cid:63)\n\nt,yy)\u22121(yt \u2212 \u00b5t,y) + \u00b5t,x\n\nFigure 2: Wasserstein ball in the space S2\n+ of\ncovariance matrices centered at I2 with radius 1.\n\ndistribution Pzt|Yt\u22121 of zt conditioned on Yt\u22121, which is de\ufb01ned through\n\nP(cid:63)\nzt|xt\u22121\n\n(B|xt\u22121)Q(cid:63)\n\nxt\u22121|Yt\u22121\n\n(dxt\u22121|Yt\u22121)\n\n(cid:90)\n\nPzt|Yt\u22121(B|Yt\u22121) =\n(cid:21)\n\nRn\n\n(cid:20) At\n\n(cid:21)\n\n(cid:20) At\n\n(cid:20) At\n\n(cid:21)(cid:62)\n\n(cid:20)\n\n(cid:21)(cid:20)\n\n(cid:21)(cid:62)\n\nfor every Borel set B \u2286 Rd and observation history Yt\u22121 \u2208 Rm\u00d7(t\u22121). The well-known formula for\nthe convolution of two multivariate Gaussians reveals that Pzt|Yt\u22121= Nd(\u00b5t, \u03a3t), where\n\n\u02c6xt\u22121\n\nCtAt\n\nCtAt\n\nVt\u22121\n\nand \u03a3t =\n\n(10)\n\u00b5t =\nNote that the construction of Pzt|Yt\u22121 resembles the prediction step of the classical Kalman \ufb01lter but\nuses the least favorable distribution Q(cid:63)\nIn the update step, the pseudo-nominal a priori estimate Pzt|Yt\u22121 is updated by the measurement\nyt and robusti\ufb01ed against model uncertainty to yield a re\ufb01ned a posteriori estimate Q(cid:63)\n. This a\nposteriori estimate is found by solving the minimax problem\n\ninstead of the nominal distribution P(cid:63)\n\nCtBt + Dt\n\nCtBt + Dt\n\nxt\u22121|Yt\u22121\n\nxt\u22121|Yt\u22121\n\nCtAt\n\nxt|Yt\n\n.\n\n.\n\n+\n\nBt\n\nBt\n\nEQ(cid:2)(cid:107)xt \u2212 \u03c8t(yt)(cid:107)2(cid:3)\nPzt|Yt\u22121 =(cid:8)Q \u2208 Nd : W2(Q, Pzt|Yt\u22121) \u2264 \u03c1t\n\nQ\u2208Pzt|Yt\u22121\n\ninf\n\u03c8t\u2208L\n\nsup\n\n(11)\n\n(cid:9) .\n\nequipped with the Wasserstein ambiguity set\n\nNote that the Wasserstein radius \u03c1t quanti\ufb01es our distrust in the pseudo-nominal a priori estimate and\ncan therefore be interpreted as a measure of model uncertainty. Practically, we reformulate (11) as an\nequivalent \ufb01nite convex program of the form (5), which is amenable to ef\ufb01cient computational solution\nvia the Frank-Wolfe algorithm detailed in Section 3. By Theorem 2.5, the optimal solution S(cid:63)\nt of\nproblem (5) yields the least favorable conditional distribution Q(cid:63)\nt ) of zt given Yt\u22121.\nBy using the well-known formulas for conditional normal distributions (see, e.g., [17, page 522]), we\nthen obtain the least favorable conditional distribution Q(cid:63)\n\nzt|Yt\u22121\n= Nn(\u02c6xt, Vt) of xt given Yt, where\n\n= Nd(\u00b5t, S(cid:63)\n\n\u02c6xt = S(cid:63)\n\nt,xy(S(cid:63)\n\nt,yy)\u22121(yt \u2212 \u00b5t,y) + \u00b5t,x\n\nand\n\nt, xx \u2212 S(cid:63)\n\nt, xy(S(cid:63)\n\nt, yy)\u22121S(cid:63)\n\nt, yx.\n\nxt|Yt\nVt = S(cid:63)\n\nThe distributionally robust Kalman \ufb01ltering approach is summarized in Algorithm 3. Note that the\nrobust update step outlined above reduces to the usual update step of the classical Kalman \ufb01lter\nfor \u03c1 \u2193 0.\n\n5 Numerical Results\n\nWe showcase the performance of the proposed Frank-Wolfe algorithm and the distributionally robust\nKalman \ufb01lter in a suite of synthetic experiments. All optimization problems are implemented in\nMATLAB and run on an Intel XEON CPU with 3.40GHz clock speed and 16GB of RAM, and the\ncorresponding codes are made publicly available at https://github.com/sorooshafiee/WKF.\n\n7\n\n\f(a) d = 10\n\n(b) d = 50\n\n(c) d = 100\n\nFigure 3: Distribution of the difference between the errors of the robust MMSE (Bayesian MMSE)\nand the ideal MMSE(cid:63) estimator.\n5.1 Distributionally Robust Minimum Mean Square Error Estimation\n\nWe \ufb01rst assess the distributionally robust minimum mean square error (robust MMSE) estimator,\nwhich is obtained by solving (2), against the classical Bayesian MMSE estimator, which can be\nviewed as the solution of problem (2) over a singleton ambiguity set that contains only the nominal\ndistribution. Recall from Remark 2.6 that the optimal estimator corresponding to a KL or \u03c4-divergence\nambiguity set of the type studied in [12, 27] coincides with the Bayesian MMSE estimator irrespective\nof \u03c1. Thus, we may restrict attention to Wasserstein ambiguity sets. In order to develop a geometric\nintuition, Figure 2 visualizes the set of all bivariate normal distributions with zero mean that have a\nWasserstein distance of at most 1 from the standard normal distribution\u2014projected to the space of\ncovariance matrices.\nIn the \ufb01rst experiment we aim to predict a signal x \u2208 R4d/5 from an observation y \u2208 Rd/5, where\nthe random vector z = [x(cid:62), y(cid:62)](cid:62) follows a d-variate Gaussian distribution with d \u2208 {10, 50, 100}.\nThe experiment comprises 104 simulation runs. In each run we randomly generate two covariance\nmatrices \u03a3(cid:63) and \u03a3 as follows. First, we draw two matrices A(cid:63) and A from the standard normal\ndistribution on Rd\u00d7d, and we denote by R(cid:63) and R the orthogonal matrices whose columns correspond\nto the orthonormal eigenvectors of A(cid:63) + (A(cid:63))(cid:62) and A + A(cid:62), respectively. Then, we de\ufb01ne \u2206(cid:63) =\nR(cid:63)\u039b(cid:63)(R(cid:63))(cid:62) and \u03a3 = R\u039bR(cid:62), where \u039b(cid:63) and \u039b are diagonal matrices whose main diagonals are\n2 + (\u2206(cid:63)) 1\nsampled uniformly from [0, 1]d and [0.1, 10]d, respectively. Finally, we set \u03a3(cid:63) = (\u03a3 1\n2 )2\nand de\ufb01ne the normal distributions P(cid:63) = Nd(0, \u03a3(cid:63)) and P = Nd(0, \u03a3). By construction, we have\n\nW2(P(cid:63), P) \u2264 (cid:107)(\u03a3(cid:63))\n\n1\n\n2 \u2212 \u03a3\n\n1\n\n2(cid:107)F \u2264\n\n\u221a\n\nd,\n\n\u221a\n\nwhere (cid:107) \u00b7 (cid:107)F stands for the Frobenius norm, and the \ufb01rst inequality follows from [13, Proposition 3].\nWe assume that P(cid:63) is the true distribution and P our nominal prior. The robust MMSE estimator\nis obtained by solving (5) for \u03c1 =\nd via the Frank-Wolfe algorithm from Section 3, while the\nBayesian MMSE estimator under P is calculated analytically. In order to provide a meaningful\ncomparison between these two approaches, we also compute the Bayesian MMSE estimator under\nthe true distribution P(cid:63) (denoted by MMSE(cid:63)), which is indeed the best possible estimator. Figure 3\nvisualizes the distribution of the difference between the mean square errors under P(cid:63) of the robust\nMMSE (Bayesian MMSE) and MMSE(cid:63) estimators. We observe that the robust MMSE estimator\nproduces better results consistently across all experiments, and the effect is more pronounced for\nlarger dimensions d. Figures 4(a) and 4(b) report the execution time and the iteration complexity\nof the Frank-Wolfe algorithm for d \u2208 {10, . . . , 100} when the algorithm is stopped as soon as the\n\nrelative duality gap(cid:10)F (Sk) \u2212 Sk,\u2207f (Sk)(cid:11)/f (Sk) drops below 0.01%. Note that the execution\n\ntime grows polynomially due to the matrix inversion in the bisection algorithm. Figure 4(c) shows\nthe relative duality gap of the current solution as a function of the iteration count.\n\n5.2 Wasserstein Distributionally Robust Kalman Filtering\n\nWe assess the performance of the proposed Wasserstein distributionally robust Kalman \ufb01lter against\nthat of the classical Kalman \ufb01lter and the Kalman \ufb01lter with the KL ambiguity set from [12]. To\nthis end, we borrow the standard test instance from [19, 25, 12] with n = 2 and m = 1. The system\nmatrices satisfy\n\n0.0196 + 0.099\u2206t\n\n0\n\n0.9802\n\n, BtB(cid:62)\n\nt =\n\n0.0195\n1.9605\n\n, Ct = [1, \u2212 1], DtD(cid:62)\n\nt = 1,\n\n(cid:20)0.9802\n\nAt =\n\n(cid:21)\n\n(cid:21)\n\n(cid:20)1.9608\n\n0.0195\n\n8\n\n\f(a) Scaling of iteration count\n\n(b) Scaling of execution time\n\n(c) Convergence for d = 100\n\nFigure 4: Convergence behavior of the Frank-Wolfe algorithm (shown are the average (solid line)\nand the range (shaded area) of the respective performance measures across 100 simulation runs)\n\n(a) Small time-invariant\nuncertainty\n\n(d) Large time-varying\nuncertainty\nFigure 5: Empirical means square estimation error of different \ufb01lters\n\n(c) Large time-invariant\nuncertainty\n\n(b) Small time varying\nuncertainty\n\n1\n\n500\n\nt \u2212 \u02c6xj\n\n(cid:80)500\nj=1 (cid:107)xj\n\nt = 0, where \u2206t represents a scalar uncertainty, and the initial state satis\ufb01es x0 \u223c N2(0, I2).\nand BtD(cid:62)\nIn all numerical experiments we simulate the different \ufb01lters over 1000 periods starting from \u02c6x0 = 0\nt(cid:107)2 across 500\nand V0 = I2. Figure 5 shows the empirical mean square error\nindependent simulation runs, where \u02c6xj\nt denotes the state estimate at time t in the jth run. We\ndistinguish four different scenarios: time-invariant uncertainty (\u2206j\nt = \u2206j sampled uniformly from\n[\u2212 \u00af\u2206, \u00af\u2206] for each j) versus time-varying uncertainty (\u2206j\nt sampled uniformly from [\u2212 \u00af\u2206, \u00af\u2206] for each\nt and j), and small uncertainty ( \u00af\u2206 = 1) versus large uncertainty ( \u00af\u2206 = 10). All results are reported in\ndecibel units (10 log10(\u00b7)). As for the \ufb01lter design, the Wasserstein and KL radii are selected from\nthe search grids {a \u00b7 10\u22121 : a \u2208 {1, 1.1,\u00b7\u00b7\u00b7 , 2}} and {a \u00b7 10\u22124 : a \u2208 {1, 1.1,\u00b7\u00b7\u00b7 , 2}}, respectively.\nFigure 5 reports the results with minimum steady state error across all candidate radii.\nUnder small time-invariant uncertainty (Figure 5(a)), the Wasserstein and KL distributionally robust\n\ufb01lters display a similar steady-state performance but outperform the classical Kalman \ufb01lter. Note\nthat the KL distributionally robust \ufb01lter starts from a different initial point as we use the delayed\nimplementation from [12]. Under small time-varying uncertainty (Figure 5(b)), both distributionally\nrobust \ufb01lters display a similar performance as the classical Kalman \ufb01lter. Figures 5(c) and (d)\ncorresponding to the case of large uncertainty are similar to Figures 5(a) and (b), respectively.\nHowever, the Wasserstein distributionally robust \ufb01lter now signi\ufb01cantly outperforms the classical\nKalman \ufb01lter and, to a lesser extent, the KL distributionally robust \ufb01lter. Moreover, the Wasserstein\ndistributionally robust \ufb01lter exhibits the best transient behavior.\n\nAcknowledgments We gratefully acknowledge \ufb01nancial support from the Swiss National Science\nFoundation under grant BSCGI0_157733.\n\nReferences\n[1] B. D. Anderson and J. B. Moore. Optimal Filtering. Prentice Hall, 1979.\n\n[2] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein GAN. arXiv preprint arXiv:1701.07875,\n\n2017.\n\n[3] T. Ba\u00b8sar and P. Bernhard. H\u221e-Optimal Control and Related Minimax Design Problems: A\n\nDynamic Game Approach. Springer, 2008.\n\n[4] D. Bertsekas and I. Rhodes. Recursive state estimation for a set-membership description of\n\nuncertainty. IEEE Transactions on Automatic Control, 16(2):117\u2013128, 1971.\n\n9\n\n\f[5] Y. Chen, J. Ye, and J. Li. A distance for HMMs based on aggregated Wasserstein metric and\n\nstate registration. In European Conference on Computer Vision, pages 451\u2013466, 2016.\n\n[6] M. Cuturi and D. Avis. Ground metric learning. The Journal of Machine Learning Research,\n\n15(1):533\u2013564, 2014.\n\n[7] Y. C. Eldar and N. Merhav. A competitive minimax approach to robust estimation of random\n\nparameters. IEEE Transactions on Signal Processing, 52(7):1931\u20131946, 2004.\n\n[8] M. Frank and P. Wolfe. An algorithm for quadratic programming. Naval Research Logistics,\n\n3(1-2):95\u2013110, 1956.\n\n[9] C. R. Givens and R. M. Shortt. A class of Wasserstein metrics for probability distributions. The\n\nMichigan Mathematical Journal, 31(2):231\u2013240, 1984.\n\n[10] M. Jaggi. Revisiting Frank-Wolfe: Projection-free sparse convex optimization. In International\n\nConference on Machine Learning, pages 427\u2013435, 2013.\n\n[11] E. L. Lehmann and G. Casella. Theory of Point Estimation. Springer, 2006.\n\n[12] B. C. Levy and R. Nikoukhah. Robust state space \ufb01ltering under incremental model perturbations\nsubject to a relative entropy tolerance. IEEE Transactions on Automatic Control, 58(3):682\u2013695,\n2013.\n\n[13] V. Masarotto, V. M. Panaretos, and Y. Zemel. Procrustes metrics on covariance operators and\n\noptimal transportation of Gaussian processes. preprint at arXiv:1801.01990, 2018.\n\n[14] P. Mohajerin Esfahani and D. Kuhn. Data-driven distributionally robust optimization using\nthe Wasserstein metric: performance guarantees and tractable reformulations. Mathematical\nProgramming, 171(1):115\u2013166, 2018.\n\n[15] V. A. Nguyen, D. Kuhn, and P. Mohajerin Esfahani. Distributionally robust inverse covariance\n\nestimation: The Wasserstein shrinkage estimator. Optimization Online, 2018.\n\n[16] L. Ning, T. Georgiou, A. Tannenbaum, and S. Boyd. Linear models based on noisy data and the\n\nFrisch scheme. SIAM Review, 57(2):167\u2013197, 2015.\n\n[17] C. R. Rao. Linear Statistical Inference and its Applications. Wiley, 1973.\n\n[18] A. Rolet, M. Cuturi, and G. Peyr\u00e9. Fast dictionary learning with a smoothed Wasserstein loss.\n\nIn Arti\ufb01cial Intelligence and Statistics, pages 630\u2013638, 2016.\n\n[19] A. H. Sayed. A framework for state-space estimation with uncertain models. IEEE Transactions\n\non Automatic Control, 46(7):998\u20131013, 2001.\n\n[20] S. Sha\ufb01eezadeh-Abadeh, D. Kuhn, and P. Mohajerin Esfahani. Regularization via mass trans-\n\nportation. preprint at arXiv:1710.10016, 2017.\n\n[21] S. Sha\ufb01eezadeh-Abadeh, P. Mohajerin Esfahani, and D. Kuhn. Distributionally robust logistic\nregression. In Advances in Neural Information Processing Systems, pages 1576\u20131584, 2015.\n\n[22] S. Shtern and A. Ben-Tal. A semi-de\ufb01nite programming approach for robust tracking. Mathe-\n\nmatical Programming, 156(1-2):615\u2013656, 2016.\n\n[23] A. Sinha, H. Namkoong, and J. Duchi. Certi\ufb01able distributional robustness with principled\n\nadversarial training. In International Conference on Learning Representations, 2018.\n\n[24] J. L. Speyer, C. Fan, and R. N. Banavar. Optimal stochastic estimation with exponential cost\n\ncriteria. In IEEE Conference on Decision and Control, pages 2293\u20132299, 1992.\n\n[25] H. Xu and S. Mannor. A Kalman \ufb01lter design based on the performance/robustness tradeoff.\n\nIEEE Transactions on Automatic Control, 54(5):1171\u20131175, 2009.\n\n[26] K. Zhou, J. C. Doyle, and K. Glover. Robust and Optimal Control. Prentice Hall, 1996.\n[27] M. Zorzi. Robust Kalman \ufb01ltering under model perturbations. IEEE Transactions on Automatic\n\nControl, 62(6):2902\u20132907, 2017.\n\n10\n\n\f", "award": [], "sourceid": 5129, "authors": [{"given_name": "Soroosh", "family_name": "Shafieezadeh Abadeh", "institution": "EPFL"}, {"given_name": "Viet Anh", "family_name": "Nguyen", "institution": "Ecole Polytechnique Federale de Lausanne"}, {"given_name": "Daniel", "family_name": "Kuhn", "institution": "EPFL"}, {"given_name": "Peyman", "family_name": "Mohajerin Esfahani", "institution": "TU Delft"}]}