{"title": "Kalman Filter, Sensor Fusion, and Constrained Regression: Equivalences and Insights", "book": "Advances in Neural Information Processing Systems", "page_first": 13187, "page_last": 13196, "abstract": "The Kalman filter (KF) is one of the most widely used tools for data assimilation and sequential estimation. In this work, we show that the state estimates from the KF in a standard linear dynamical system setting are equivalent to those given by the KF in a transformed system, with infinite process noise (i.e., a ``flat prior'') and an augmented measurement space. This reformulation---which we refer to as augmented measurement sensor fusion (SF)---is conceptually interesting, because the transformed system here is seemingly static (as there is effectively no process model), but we can still capture the state dynamics inherent to the KF by folding the process model into the measurement space. Further, this reformulation of the KF turns out to be useful in settings in which past states are observed eventually (at some lag). Here, when the measurement noise covariance is estimated by the empirical covariance, we show that the state predictions from SF are equivalent to those from a regression of past states on past measurements, subject to particular linear constraints (reflecting the relationships encoded in the measurement map). This allows us to port standard ideas (say, regularization methods) in regression over to dynamical systems. For example, we can posit multiple candidate process models, fold all of them into the measurement model, transform to the regression perspective, and apply $\\ell_1$ penalization to perform process model selection. We give various empirical demonstrations, and focus on an application to nowcasting the weekly incidence of influenza in the US.", "full_text": "Kalman Filter, Sensor Fusion, and Constrained\n\nRegression: Equivalences and Insights\n\nMaria Jahja\n\nDepartment of Statistics\n\nCarnegie Mellon University\n\nPittsburgh, PA 15213\nmaria@stat.cmu.edu\n\nRoni Rosenfeld\n\nMachine Learning Department\nCarnegie Mellon University\n\nPittsburgh, PA 15213\nroni@cs.cmu.edu\n\nDavid Farrow\n\nComputational Biology Department\n\nCarnegie Mellon University\n\nPittsburgh, PA 15213\ndfarrow0@gmail.com\n\nRyan J. Tibshirani\n\nDepartment of Statistics\n\nMachine Learning Department\nCarnegie Mellon University\n\nPittsburgh, PA 15213\n\nryantibs@stat.cmu.edu\n\nAbstract\n\nThe Kalman \ufb01lter (KF) is one of the most widely used tools for data assimilation\nand sequential estimation. In this work, we show that the state estimates from the\nKF in a standard linear dynamical system setting are equivalent to those given by\nthe KF in a transformed system, with in\ufb01nite process noise (i.e., a \u201c\ufb02at prior\u201d)\nand an augmented measurement space. This reformulation\u2014which we refer to as\naugmented measurement sensor fusion (SF)\u2014is conceptually interesting, because\nthe transformed system here is seemingly static (as there is effectively no process\nmodel), but we can still capture the state dynamics inherent to the KF by folding\nthe process model into the measurement space. Further, this reformulation of the\nKF turns out to be useful in settings in which past states are observed eventually\n(at some lag). Here, when the measurement noise covariance is estimated by the\nempirical covariance, we show that the state predictions from SF are equivalent to\nthose from a regression of past states on past measurements, subject to particular\nlinear constraints (re\ufb02ecting the relationships encoded in the measurement map).\nThis allows us to port standard ideas (say, regularization methods) in regression\nover to dynamical systems. For example, we can posit multiple candidate process\nmodels, fold all of them into the measurement model, transform to the regression\nperspective, and apply (cid:96)1 penalization to perform process model selection. We give\nvarious empirical demonstrations, and focus on an application to nowcasting the\nweekly incidence of in\ufb02uenza in the US.\n\nIntroduction\n\n1\nLet xt \u2208 Rk, t = 1, 2, 3, . . . denote states and zt \u2208 Rd, t = 1, 2, 3, . . . denote measurements evolving\naccording to the time-invariant linear dynamical system:\n\n(1)\n(2)\nfor t = 1, 2, 3, . . .. We assume the noise terms \u03b4t, \u0001t have mean zero and covariances Q \u2208 Rk\u00d7k and\nR \u2208 Rd\u00d7d, respectively, for all t = 1, 2, 3, . . .. Also, we assume that the initial state x0 and all noise\nterms are mutually independent. We call (1) the process model and (2) the measurement model.\n\nxt = F xt\u22121 + \u03b4t,\nzt = Hxt + \u0001t,\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fKalman \ufb01lter. The Kalman \ufb01lter (KF) [Kalman, 1960] is a method for sequential estimation in the\nmodel (1), (2). Given past estimates \u02c6x1, . . . , \u02c6xt and measurements z1, . . . , zt+1, we form an estimate\n\u02c6xt+1 of the state xt+1 via\n\n\u00afxt+1 = F \u02c6xt,\n\u02c6xt+1 = \u00afxt+1 + Kt+1(zt+1 \u2212 H \u00afxt+1),\n\n\u00afPt+1 = F PtF T + Q,\nKt+1 = \u00afPt+1H T (H \u00afPt+1H T + R)\u22121,\nPt+1 = (I \u2212 Kt+1H) \u00afPt+1.\n\n(3)\n(4)\nwhere Kt+1 \u2208 Rk\u00d7d is called the Kalman gain (at time t + 1). It is itself updated sequentially, via\n(5)\n(6)\n(7)\nwhere Pt+1 \u2208 Rk\u00d7k denotes the state error covariance (at time t + 1). The step (3) is often called\nthe predict step: we form an intermediate estimate \u00afxt+1 of the state based on the process model and\nour estimate at the previous time point. The step (4) is often called the update step: we update our\nestimate \u02c6xt+1 based on the measurement model and the measurement zt+1.\nUnder the data model (1), (2) and the conditions on the noise stated above, the Kalman \ufb01lter attains\nthe optimal mean squared error E(cid:107)\u02c6xt \u2212 xt(cid:107)2\n2 among all linear unbiased \ufb01lters, at each t = 1, 2, 3, . . ..\nWhen the initial state x0 and all noise terms are Gaussian, the Kalman \ufb01lter estimates exactly reduce\nto the Bayes estimates \u02c6xt = E(xt|z1, . . . , zt), t = 1, 2, 3, . . .. Numerous important extensions have\nbeen proposed, e.g., the ensemble Kalman \ufb01lter (EnKF) [Evensen, 1994, Houtekamer and Mitchell,\n1998], which approximates the noise process covariance Q by a sample covariance in an ensemble\nof state predictions, as well as the extended Kalman \ufb01lter (EKF) [Smith et al., 1962] and unscented\nKalman \ufb01lter (UKF) [Julier and Uhlmann, 1997], which both allow for nonlinearities in the process\nmodel. Particle \ufb01ltering (PF) [Gordon et al., 1993] has more recently become a popular approach for\nmodeling complex dynamics. PF adaptively approximates the posterior distribution, and in doing so,\navoids the linear and Gaussian assumptions inherent to the KF. This \ufb02exibility comes at the cost of a\ngreater computational burden.\nIn this paper, we revisit the standard KF (3), (4) and show that its estimates \u02c6xt+1, t = 0, 1, 2, . . . are\nequivalent to those from the KF applied to a transformed system, with in\ufb01nite process noise and an\naugmented measurement space. At \ufb01rst glance, this is perhaps surprising, because the transformed\nsystem effectively lacks a process model and is therefore seemingly static; however, it is able to take\nthe state dynamics into account as part of its measurement model. Importantly, this reformulation of\nthe KF leads us to derive a second, key reformulation for problems in which past states are observed\n(at some lag). This second reformulation is the methodological crux of our paper: it is a constrained\nregression approach for predicting states from measurements, motivated by (derived from) SF and the\nKF. We illustrate its effectiveness in an application to nowcasting weekly in\ufb02uenza levels in the US.\nIf we let the noise covariance in the process model diverge to in\ufb01nity, Q \u2192 \u221e1,\n\nSensor fusion.\nthen the Kalman \ufb01lter estimate in (3), (4) simpli\ufb01es to\n\n\u02c6xt+1 = (H T R\u22121H)\u22121H T R\u22121zt+1.\n(8)\nThis can be veri\ufb01ed by rewriting the Kalman gain as Kt+1 = ( \u00afP \u22121\nt+1 + H T R\u22121H)\u22121H T R\u22121, and\nt+1 \u2192 0 as Q \u2192 \u221e. Alternatively, we can verify this by specializing to the case of\nobserving that \u00afP \u22121\nGaussian noise: as tr(Q) \u2192 \u221e, we approach a \ufb02at prior, and the Kalman \ufb01lter (Bayes estimator)\njust maximizes the likelihood of zt+1|xt+1. From the measurement model (2) (assuming Gaussian\nnoise), this is a weighted regression of zt+1 on the measurement map H, precisely as in (8).\nWe will call (8) the sensor fusion (SF) estimate (at time t + 1).2 In this setting, we will also refer to\nthe measurements as sensors. As de\ufb01ned, sensor fusion is a special case of the Kalman \ufb01lter when\nthere is in\ufb01nite process noise; said differently, it is a special case of the Kalman \ufb01lter when there is no\nprocess model at all. Thus, looking at (8), the state dynamics have apparently been completely lost.\nPerhaps surprisingly, as we will show shortly, these dynamics can be exactly recovered by augmenting\nthe measurement vector zt+1 with the KF intermediate prediction \u00afxt+1 = F \u02c6xt in (3) (and adjusting\nthe map H and covariance R appropriately). We summarize this and our other contributions next.\n\n1To make this unambiguous, we may take, say, Q = aI and let a \u2192 \u221e.\n2 \u201cSensor fusion\u201d is typically used as a generic term, similar to \u201cdata assimilation\u201d; we use it to speci\ufb01cally\ndescribe the estimate in (8) to distinguish it from the KF. This is useful when we describe equivalences, shortly.\n\n2\n\n\fSummary of contributions. An outline of our contributions in this paper is as follows.\n\n1. We show in Section 2 that, if we take the KF intermediate prediction \u00afxt+1 in (3), append it\nto the measurement vector zt+1, and perform SF (8) (with an appropriately adjusted H, R),\nthen the result is exactly the KF estimate (4).\n\n2. We show in Section 3 that, if we are in a problem setting in which past states are observed\n(at some lag, which is the case in the \ufb02u nowcasting application), and we replace the noise\ncovariance R from the measurement model by the empirical covariance on past data, then\nthe sensor fusion estimate (8) can be written as \u02c6BT zt+1, where \u02c6B \u2208 Rd\u00d7k is a matrix of\ncoef\ufb01cients that solves a regression problem of the states on the measurements (using past\ndata), subject to the equality constraint H T \u02c6B = I.\n\n3. We demonstrate the effectiveness of our new regression formulation of SF in Section 4 by\ndescribing an application of this methodology to nowcasting the incidence of weekly \ufb02u in\nthe US. This achieves state-of-the art performance in this problem.\n\n4. We present in Section 5 some extensions of the regression formulation of SF; they do not\nhave direct equivalences to SF (or the KF), but are intuitive and extend dynamical systems\nmodeling in new directions (e.g., using (cid:96)1 penalization to perform a kind of process model\nselection).\n\nWe make several remarks. The equivalences described in points 1\u20133 above are deterministic (they do\nnot require the modeling assumptions (1), (2), or any modeling assumptions whatsoever). Further,\neven though their proofs are elementary (they are purely linear algebraic) and the setting is a classical\none (linear dynamical systems), these equivalences are\u2014as far as we can tell\u2014new results. They\ndeserve to be widely known and may have implications beyond what is explored in this paper.\nFor example, the regression formulation of SF may still be a useful perspective for problems in which\npast states are fully unobserved (this being the case in most KF applications). In such problems,\nwe may consider using smoothed estimates of past states, obtained by running a backward version\nof the KF forward recursions (3)\u2013(7) (see, e.g., Chapter 7 of Anderson and Moore [1979]), for the\npurposes of the regression formulation. As another example, the SF view of the KF may be a useful\nformulation for the purposes of estimating the covariances R, Q, or the maps F, H, or all of them;\nin this paper, we assume that F, H, R, Q are known (except for in the regression formulation of SF,\nin which R is unknown but past states are available); in general, there are well-developed methods\nfor estimating F, H, R, Q such as subspace identi\ufb01cation algorithms (see, e.g., Overshee and Moor\n[1996]), and it may be interesting to see if the SF perspective offers any advantages here.\n\nRelated work. The Kalman \ufb01lter and its extensions, as previously referenced (EnKF, EKF, UKF),\nare the de facto standard in state estimation and tracking problems; the literature surrounding them\nis enormous and we cannot give a thorough treatment. Various authors have pointed out the simple\nfact that maximum likelihood estimate in (8), which we call sensor fusion, is the limit of the KF as\nthe noise covariance in the process model approaches in\ufb01nity (see, e.g., Chapter 5.9 of Brown and\nHwang [2012]). We have not, however, seen any authors note that this static model can recover the\nKF by augmenting the measurement vector with the KF intermediate prediction (Theorem 1).\nAlong the lines of our second equivalence (Theorem 2), there is older work in the statistical calibration\nliterature that studies the relationships between the regressions of y on x and x on y (for multivariate\nx, y, see Brown [1982]). This is somewhat related to our result, since we show that a backwards or\nindirect approach, which models zt+1|xt+1, is actually equivalent to a forwards or direct approach,\nwhich predicts xt+1 from zt+1 via regression. However, the details are quite different.\nFinally, our SF methodology in the \ufb02u nowcasting application blends together individual predictors\nin a way that resembles linear stacking [Wolpert, 1992, Breiman, 1996]. In fact, one implication of\nour choice of measurement map H in the \ufb02u nowcasting problem, as well as the constraints in our\nregression formulation of SF, is that all regression weights must sum to 1, which is the standard in\nlinear stacking as well. However, the equality constraints in our regression formulation are quite a bit\nmore complex, and re\ufb02ect aspects of the sensor hierarchy that linear stacking would not.\n\n3\n\n\f2 Equivalence between KF and SF\n\nAs already discussed, the sensor fusion estimate (8) is a limiting case of the Kalman \ufb01lter (3), (4), and\ninitially, it seems, one rather limited in scope: there is effectively no process model (as we have sent\nthe process variance to in\ufb01nity). However, as we show next, the KF is actually itself a special case of\nSF, when we augment the measurement vector by the KF intermediate predictions, and appropriately\nadjust the measurement map H and noise covariance R. The proof is elementary, a consequence of\nthe Woodbury matrix and related manipulations. It is given in the supplement.\nTheorem 1. At each time t = 0, 1, 2, . . ., suppose we augment our measurement vector by de\ufb01ning\n\u02dczt+1 = (zt+1, \u00afxt+1) \u2208 Rd+k, where \u00afxt+1 = F \u02c6xt is the KF intermediate prediction at time t + 1.\nSuppose that we also augment our measurement map by de\ufb01ning \u02dcH \u2208 R(d+k)\u00d7k to be the rowwise\nconcatenation of H and the identity matrix I \u2208 Rk\u00d7k. Furthermore, suppose we de\ufb01ne an augmented\nmeasurement noise covariance\n\n\u02dcRt+1 =\n\n,\n\n(9)\n\n(cid:20)R\n\n0\n\n(cid:21)\n\n0\n\n\u00afPt+1\n\nwhere \u00afPt+1 is the KF intermediate error covariance at time t + 1 (as in (5)). Then applying SF to\nthe augmented system produces an estimate at t + 1 that equals the KF estimate,\nt+1 \u02dczt+1 = \u00afxt+1 + Kt+1(zt+1 \u2212 H \u00afxt+1),\n\n\u02dcH)\u22121 \u02dcH T \u02dcR\u22121\n\n( \u02dcH T \u02dcR\u22121\n\n(10)\n\nt+1\n\nwhere Kt+1 is the Kalman gain at t + 1 (as in (6)).\nRemark 1. We can think of the last state estimate \u02c6xt in the theorem (which is propagated forward\nvia \u00afxt+1 = F \u02c6xt) as the previous output from SF itself, when applied to the appropriate augmented\nsystem. More precisely, by induction, Theorem 1 says that iteratively applying SF to \u02dczt+1, \u02dcH, \u02dcRt+1\nacross times t = 0, 1, 2, . . ., where each \u00afxt+1 = F \u02c6xt is the intermediate prediction using the last SF\nestimate \u02c6xt, produces a sequence \u02c6xt+1, t = 0, 1, 2, . . . that matches the state estimates from the KF.\nRemark 2. The result in Theorem 1 can be seen from a Bayesian perspective, as was pointed out by\nan anonymous reviewer. When the initial state x0 and all noise terms in (1), (2) are Gaussian, recall\nthe KF reduces to the Bayes estimator. Here the posterior is the product of a Gaussian likelihood and\nGaussian prior, and is thus itself Gaussian. (The proof of this standard fact uses similar arguments\nto the proof of Theorem 1.) Meanwhile, in augmented SF, we can view the Gaussian likelihood\nbeing maximized as the product of the Gaussian density of zt+1 and that of \u00afxt+1. This matches the\nposterior used by the KF, where the density of \u00afxt+1 plays the role of the prior in the KF. Therefore in\neach case, we are de\ufb01ning our estimate to be the mean of the same Gaussian distribution.\nRemark 3. The equivalence between SF and KF can be extended beyond the case of linear process\nand linear measurement models. Given a nonlinear process map f and a nonlinear process model h,\nsuppose we de\ufb01ne \u00afxt+1 = f (\u02c6xt), Ft+1 = Df (\u02c6xt) (the Jacobian of f at \u02c6xt), and Ht+1 = Dh(\u00afxt+1)\n(the Jacobian of h at \u00afxt+1). Suppose we de\ufb01ne the augmented measurement vector as\n\n\u02dczt+1 =(cid:0)zt+1 + Ht+1 \u00afxt+1 \u2212 h(\u00afxt+1), \u00afxt+1\n\n(11)\nwhere we have offset the measurement zt+1 by the residual Ht+1 \u00afxt+1 \u2212 h(\u00afxt+1) from linearization.\nSuppose, as in the theorem, we de\ufb01ne the augmented measurement map \u02dcHt+1 \u2208 R(d+k)\u00d7k to be the\nrowwise concatenation of Ht+1 and I \u2208 Rk\u00d7k, and de\ufb01ne \u02dcRt+1 \u2208 R(d+k)\u00d7(d+k) as in (9), for \u00afPt+1\nas in (5), but with Ft+1, Ht+1 in place of F, H. In the supplement, we prove that\n\n(cid:1),\n\n(cid:0)zt+1 \u2212 h(\u00afxt+1)(cid:1),\n\n(12)\n\n( \u02dcH T\n\nt+1\n\n\u02dcR\u22121\n\nt+1\n\n\u02dcHt+1)\u22121 \u02dcH T\n\nt+1\n\n\u02dcR\u22121\nt+1 \u02dczt+1 = \u00afxt+1 + Kt+1\n\nwhere Kt+1 is as in (6), but with Ft+1, Ht+1 in place of F, H. The right-hand side above is precisely\nthe extended KF (EKF). The left-hand side is what we might call extended SF (ESF).\n\n3 Equivalence between SF and regression\n\nSuppose that in our linear dynamical system, at each time t, we observe the measurement zt, make a\nprediction \u02c6xt for xt, then later observe the state xt itself. (This setup indeed describes the in\ufb02uenza\nnowcasting problem, a central motivating example that we will describe shortly.) In such problems,\nwe can estimate R using the empirical covariance on past data. When we plug this into (8), it turns\nout SF reduces to a prediction from a constrained regression of past states on past measurements.\n\n4\n\n\f3.1 Equivalent regression problem\n\nIn making a prediction at time t + 1, we assume in this section that we observe past states. We may\nassume without a loss of generality that we observe the full past xi, i = 1, . . . , t (if this is not the\ncase, and we observe only some subset of the past, then the only changes to make in what follows are\nnotational). Assuming the measurement noise covariance R is unknown, we may use\n\nt(cid:88)\n\ni=1\n\n\u02c6Rt+1 =\n\n1\nt\n\n(zi \u2212 Hxi)(zi \u2212 Hxi)T ,\n\n(13)\n\n(14)\n\nthe empirical (uncentered) covariance based on past data, as an estimate. Under this choice, it turns\nout that sensor fusion (8) is exactly equivalent to a regression of states on measurements, subject to\ncertain equality constraints. The proof is elementary, but requires detailed arguments. It is deferred\nuntil the supplement.\nTheorem 2. Let \u02c6Rt+1 be as in (13) (assumed to be invertible). Consider the SF prediction at time\nt + 1, with \u02c6Rt+1 in place of R. Denote this by \u02c6xt+1 = \u02c6BT zt+1, where\n\n(and H T \u02c6R\u22121\n\nt+1H is assumed invertible). Each column of \u02c6B, denoted \u02c6bj \u2208 Rd, j = 1, . . . , k, solves\n\n\u02c6BT = (H T \u02c6R\u22121\n\nt+1\n\nt+1H)\u22121H T \u02c6R\u22121\nt(cid:88)\n\n(xij \u2212 bT\n\nj zi)2\n\nminimize\n\nbj\u2208Rd\n\ni=1\n\nsubject to H T bj = ej,\n\nwhere ej \u2208 Rd is the jth standard basis vector (all 0s except for a 1 in the jth component).\nRemark 4. As discussed in the introduction, the interpretation of (H T \u02c6R\u22121\nt+1H)\u22121H T \u02c6R\u22121\nt+1zt+1 as\nthe coef\ufb01cients from regressing zt+1(the response) onto H (the covariates) is more or less immediate.\nInterpreting the same quantity as \u02c6BT zt+1 = (\u02c6bT\nk zt+1), the predictions from historically\nregressing xi, i = 1, . . . , t (the response) onto zi, i = 1, . . . , t (the covariates), however, is much less\nobvious. The latter is a forwards or direct regression approach to predicting xt+1, whereas SF was\noriginally de\ufb01ned via the backwards or indirect perspective inherent to the measurement model (2).\n\n1 zt+1, . . . , \u02c6bT\n\n3.2\n\nIn\ufb02uenza nowcasting\n\nAn example that we will revisit frequently, for the rest of the paper, is the following in\ufb02uenza (or \ufb02u)\nnowcasting problem. The state variable of interest is the weekly percentage of weighted in\ufb02uenza-like\nillness (wILI), a measure of \ufb02u incidence provided by the Centers for Disease Control and Prevention\n(CDC), in each of the k = 51 US states (including DC). Because it takes time for the CDC to collect\nand compile this data, they release wILI values with a 1 week delay. Meanwhile, various proxies for\nthe \ufb02u (i.e., data sources that are potentially correlated with \ufb02u incidence) are available in real time,\ne.g., web search volume for \ufb02u-related terms, site traf\ufb01c metrics for \ufb02u-related pages, pharmaceutical\nsales for \ufb02u-related products, etc. We can hence train (using historical data) sensors to predict wILI,\none from each data source, and plug them into sensor fusion (8) in order to \u201cnowcast\u201d the current \ufb02u\nincidence (that would otherwise remain unknown for another week).\nSuch a sensor fusion system for \ufb02u nowcasting, using d = 308 sensors (\ufb02u proxies), is described in\nChapter 4 of Farrow [2016]3. In addition to the surveillance sensors described above (search volume\nfor \ufb02u terms, site traf\ufb01c metrics for \ufb02u pages, etc.), the measurement vector in this nowcasting system\nalso uses a sensor that is trained to make predictions of wILI using a seasonal autoregression with\n3 lags (SAR3). From the KF-SF equivalence established in Section 2, we can think of this SAR3\nsensor as serving the role of something like a process model, in the underlying dynamical system.\nWhile wILI itself is available at the US state level, the data source used to train each sensor may only\nbe available at coarser geographic resolution. Thus, importantly, each sensor outputs a prediction at a\ndifferent geographic resolution (which re\ufb02ects the resolution of its corresponding data source). As an\n\n3This is more than just a hypothetical system; it is fully operational, and run by the Carnegie Mellon DELPHI\ngroup to provide real-time nowcasts of \ufb02u incidence every week, in all US states, plus select regions, cities, and\nterritories. (See https://delphi.midas.cs.cmu.edu).\n\n5\n\n\fFigure 1: Simpli\ufb01ed version of the \ufb02u nowcasting\nproblem, with k = 5 states and d = 8 sensors. We\nhave a 3-level hierarchy, where x1, x2, x3 are part\nof the \ufb01rst region and x4, x5 are part of the second.\nThe national level is at the root. As for the sensors,\nwe have one at each state, one at each region, and\none at the national level. Assuming all states have\nequal populations, the sensor map H is\n\n\uf8ee\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8f0\n\nH =\n\n\uf8f9\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fb .\n\n1\n0\n0\n0\n0\n1/3\n0\n1/5\n\n0\n1\n0\n0\n0\n1/3\n0\n1/5\n\n0\n0\n1\n0\n0\n1/3\n0\n1/5\n\n0\n0\n0\n1\n0\n0\n1/2\n1/5\n\n0\n0\n0\n0\n1\n0\n1/2\n1/5\n\nexample, the number of visits to \ufb02u-related CDC pages are available for each US state separately; so\nfor each US state, we train a separate sensor to predict wILI from CDC site traf\ufb01c. However, counts\nfor Wikipedia page visits are only available nationally; so we train just one sensor to predict national\nwILI from Wikipedia page visits.\nAssuming unbiasedness of all the sensors, we construct the map H in (2) so that its rows re\ufb02ect the\ngeography of the sensors. For example, if a sensor is trained on data that is available at the ith US\nstate, then its associated row in H is\n\n(0, . . . 1\u2191\n\n, . . . 0);\n\ni\n\nand if a sensor is trained on data from the aggregate of the \ufb01rst 3 US states, then its associated row is\n\nfor weights w1, w2, w3 > 0 such that w1 + w2 + w3 = 1, based on relative state populations; and so\non. Figure 1 illustrates the setup in a simple example.\n\n(w1, w2, w3, 0, . . . 0),\n\n3.3\n\nInterpreting the constraints\n\nAt a high-level, the constraints in (14) encode information about the measurement model (2). They\nalso provide some kind of implicit regularization. Interestingly, as we will see later in Section 4, this\ncan still be useful when used in addition to more typical (explicit) regularization.\nHow can we interpret these constraints? We give three interpretations, the \ufb01rst one speci\ufb01c to the \ufb02u\nforecasting setting, and the next two general.\n\nFlu interpretation.\nIn the \ufb02u nowcasting problem, recall, the map H has rows that sum to 1, and\nthey re\ufb02ect the geographic level at which the corresponding sensors were trained (see Section 3.2).\nThe constraints H T bj = ej, j = 1, . . . , k can be seen in this case as a mechanism that accounts for\nthe geographical hierachy underlying the sensors. As a concrete example, consider the simpli\ufb01ed\nsetup in Figure 1, and j = 3. The constraint H T b3 = e3 reads:\nb31 + 1/3 b36 + 1/5 b38 = 0,\nb32 + 1/3 b36 + 1/5 b38 = 0,\nb33 + 1/3 b36 + 1/5 b38 = 1,\nb34 + 1/3 b37 + 1/5 b38 = 0,\nb35 + 1/3 b37 + 1/5 b38 = 0.\n\nThe third line can be interpreted as follows: an increase of 1 unit in sensor z3, 1/3 units in z6, and\n1/5 units in z8, holding all other sensors \ufb01xed, should lead to an increase in 1 unit of our prediction\nfor x3. This is a natural consequence of the hierarchy in the sensor model (2), visualized in Figure 1.\nThe \ufb01rst line can be read as: an increase of 1 unit in sensor z1, 1/3 units in z6, and 1/5 in z8, with\nall others \ufb01xed, should not change our prediction for x3. This is also natural, following from the\nhierachy (i.e., such a change must have been propogated by x1). The other lines are similar.\n\n6\n\n\fInvariance interpretation. The SF prediction (at time t + 1) is \u02c6xt+1 = \u02c6BT zt+1. To denoise (i.e.,\nestimate the mean of) the measurement zt+1, based on the model (2), we could use \u02c6zt+1 = H \u02c6xt+1.\nGiven the denoised \u02c6zt+1, we could then re\ufb01t our state prediction via \u02dcxt+1 = \u02c6BT \u02c6zt+1. But due to\nthe constraint H T \u02c6B = I (a compact way of expressing H T \u02c6bj = ej, for j = 1, . . . , k), it holds that\n\u02dcxt+1 = \u02c6BT H \u02c6xt+1 = \u02c6xt+1. This is a kind of invariance property. In other words, we can go from\nestimating states, to re\ufb01tting measurements, to re\ufb01tting states, etc., and in this process, our state\nestimates will not change.\nGenerative interpretation. Assume t \u2265 k, and \ufb01x an arbitrary j = 1, . . . , k as well as bj \u2208 Rk.\nThe constraint H T bj = ej implies, by taking an inner product on both sides with xi, i = 1, . . . , k,\n\n(Hxi)T bj = xij,\n\ni = 1, . . . , k.\n\nIf we assume xi, i = 1, . . . , k are linearly independent, then the above linear equalities are not only\nimplied by H T bj = ej, they are actually equivalent to it. Invoking the model (2), we may rewrite the\nconstraint H T bj = ej as\n\n(15)\nIn the context of problem (14), this is a statement about a generative model for the data (as zi|xi\ndescribes the distribution of the covariates conditional on the response). The representation in (15)\nshows that (14) constrains the regression estimator to have the correct conditional predictions, on\naverage, on the data we have already seen (xi, zi), i = 1, . . . , k. (Note here we did not have to use\nthe \ufb01rst k time points; any past k time points would suf\ufb01ce.)\n\ni = 1, . . . , k.\n\nE(bT\n\nj zi|xi) = xij,\n\n3.4 Modi\ufb01cations and equivalences\n\nIn the supplement, we show that two modi\ufb01cations of the basic SF formulation also have equivalences\nin the regression perspective: namely, shrinking the empirical covariance in (13) towards the identity\nis equivalent to adding a ridge (squared (cid:96)2) penalty to the criterion in (14); and also, adding a null\nsensor at each state (one that always outputs 0) is equivalent to removing the constraints in (14). The\nlatter equivalence here provides indirect but fairly compelling evidence that the constraints in the\nregression formulation (14) play an important role (under the model (2)): it says that removing them\nis equivalent to including meaningless null sensors, which intuitively should worsen its predictions.\n\n4 Flu nowcasting application\n\nExperimental setup. We examine the performance of our methods for nowcasting (one-week-\nahead prediction of) wILI across 5 \ufb02u seasons, from 2013 to 2018 (total of 140 weeks). Recall the\nsetup described in Section 3.2, with k = 51 states and d = 308 measurements. At week t + 1, we\nderive an estimate \u02c6xt+1 of the current wILI in the 51 US states, based on sensors zt+1 (each sensor\nbeing the output of an algorithm trained to predict wILI at a different geographic resolution from a\ngiven data source), and past wILI and sensor data. We consider 7 methods for computing the nowcast\n\u02c6xt+1: (i) SF, or equivalently, constrained regression (14); (ii) SF as in (14), but with an additional\nridge (squared (cid:96)2) penalty (equivalently, SF with covariance shrinkage); (iii) SF as in (14), but with\nan additional lasso ((cid:96)1) penalty; (iv/v) regression as in (14), but without constraints, and using a\nridge/lasso penalty; (vi) random forests (RF) [Breiman, 2001], trained on all of the sensors; (vii) RF,\nbut trained on all of the underlying data sources used to \ufb01t the sensors.\nAt prediction week t + 1, we use the last 3 years (weeks t \u2212 155 through t) as the training set for all\n7 methods. We do not implement unpenalized regression (as in (14), but without constraints), as it\nis not well-de\ufb01ned (156 observations and 308 covariates).4 All ridge and lasso tuning parameters\nare chosen by optimizing one-week-ahead prediction error over the latest 10 weeks of data (akin to\ncross-validation, but for a time series context like ours). Python code for this nowcasting experiment\nis available at http://github.com/mariajahja/kf-sf-flu-nowcasting.\n\n4SF is still well-de\ufb01ned, due of the constraint in (14): a nonunique solution only occurs when the (random)\nnull space of the covariate matrix has a nontrivial intersection with the null space of H T , which essentially never\nhappens.\n\n7\n\n\fFigure 2: Top row, from left to right: data sources, sensors, and nowcasts are compared to the underlying wILI\nvalues for Pennsylvania during \ufb02u season 2017-18. For visualization purposes, the sources are scaled to \ufb01t the\nrange of wILI. On the rightmost plot, we display nowcasts using select methods. Bottom row: MAEs (full colors)\nand MADs (light colors) of nowcasts over 5 \ufb02u seasons from 2013-14 to 2017-18.\n\nMissing data. Unfortunately, sensors are observed at not only varying geographic resolutions, but\nalso varying temporal resolutions (since their underlying data sources are), and missing values occur.\nIn our experiments, we choose to compute predictions using the regression perspective, and apply a\nsimple mean imputation approach (using only past sensor data), before \ufb01tting all models.\n\nNowcasting results. The bottom row of Figure 2 displays the mean absolute errors (MAEs) from\none-week-ahead predictions by the 7 methods considered, averaged over the 51 US states, for each\nof the 5 seasons. Also displayed are the mean absolute deviations (MADs), in light colors. We see\nthat SF with ridge regularization is generally the most accurate over the 5 seasons, SF with lasso\nregularization is a close second, and SF without any regularization is the worst. Thus, clearly, explicit\nregularization helps. Importantly, we also see that the constraints in the regression problem (14)\n(which come from its connection to SF) play a key role: in each season, SF with ridge regularization\noutperforms ridge regression, and SF with lasso regularization outperforms the lasso. Therefore, the\nconstraints provide additional (bene\ufb01cial) implicit regularization.\nRF trained on sensors performs somewhat competitively. RF trained on sources is more variable (in\nsome seasons, much worse than RF on sensors). This observation indicates that training the sensors\nis an important step for nowcasting accuracy, as this can be seen as a form of denoising, and suggests\na view of all the methods we consider here (except RF on sources) as prediction assimilators (rather\nthan data assimilators). Finally, the top row Figure 2 visualizes the nowcasts for Pennsylvania in the\n2017-18 season. We can see that SF, RF (on sensors), and even ridge regression are noticeably more\nvolatile than SF with ridge regularization.\n\n5 Discussion and extensions\n\nIn this paper, we studied connections between the Kalman \ufb01lter, sensor fusion, and regression. We\nderived equivalences between the \ufb01rst two and latter two, discussed the general implications of our\nresults, and studied the application of our work to nowcasting the weekly in\ufb02uenza levels in the US.\nWe conclude with some ideas for extending the constrained regression formulation (14) of SF.\n\n8\n\nDecJanFebMarAprMay1234567wILIwILISourcesDecJanFebMarAprMay1234567wILISensorsDecJanFebMarAprMay1234567wILISF + L2RidgeSFRF (sensors)2013-142014-152015-162016-172017-18Season0.00.20.40.60.81.01.21.4Mean Absolute ErrorSF + L2SF + L1RidgeLassoSFRF (sensors)RF (sources)States\fSensor selection. The problem of selecting a small number of relevant sensors (on which to perform\nsensor fusion) among a possibly large number, which we can call sensor selection, is quite a dif\ufb01cult\nproblem. Beyond this, measurement selection in the Kalman \ufb01lter is a generally dif\ufb01cult problem. As\nfar as we know, this is an active and relatively open area of research. On the other hand, in regression,\nvariable selection is extremely well-studied, and (cid:96)1 regularization (among many other tools) is now\nvery well-developed (see, e.g., Hastie et al. [2009, 2015]). Starting from the regression formulation\nfor SF in (14), it would be natural to add to the criterion an (cid:96)1 or lasso penalty [Tibshirani, 1996] to\nselect relevant sensors,\n\nminimize\n\nbj\u2208Rd\n\n1\nt\n\n(xij \u2212 bT\n\nj zi)2 + \u03bbj(cid:107)bj(cid:107)1\n\n(16)\n\nwhere (cid:107)bj(cid:107)1 =(cid:80)k\n\nsubject to H T bj = ej,\n\n(cid:96)=1 |bj(cid:96)|, j = 1, . . . , k. It is not clear (nor likely) that (16) has an equivalent SF\nformulation, but the exact equivalence when \u03bbj = 0 suggests that (16) could be a reasonable tool for\nsensor selection. (Indeed, without even considering its sensor selection capabilities, this performed\nrespectably for predictive purposes in the experiments in Section 4.) Further, we can perform a kind\nof process model selection with (16) by augmenting our measurement vector with multiple candidate\nprocess models, and penalizing only their coef\ufb01cients. An example is given in the supplement.\n\nt(cid:88)\n\ni=1\n\nJoint sensor learning.\nIn the \ufb02u nowcasting problem, recall, the sensors are outputs of predictive\nmodels, each trained individually to predict wILI from a particular data source (\ufb02u proxy). Denote by\nui \u2208 Rd, i = 1, . . . , t the data sources at times 1 through t. Instead of learning the sensors (predictive\ntransformations of these sources) individually, we could learn them jointly, by extending (14) into:\n\n(cid:0)xij \u2212 bT\n\nj fj(ui)(cid:1)2\n\nt(cid:88)\n\ni=1\n\nminimize\n\nfj\u2208Fj\n\n1\nt\n\n+ \u03bbjPj(fj)\n\n(17)\n\nsubject to H T bj = ej.\n\nfor j = 1, . . . , k. Here, each Fj is a space of functions from Rd to Rd (e.g., diagonal linear maps)\nand Pj is a penalty to be speci\ufb01ed by the modeler (e.g., the Frobenius norm in the linear map case).\nThe key in (17) is that we are simultaneously learning the sensors and assimilating them.\n\nGradient boosting. Solving (17) is computationally dif\ufb01cult (even in the simple linear map case, it\nis nonconvex). An alternative that is more tractable is to proceed iteratively, in a manner inspired by\ngradient boosting [Friedman, 2001]. For each j = 1, . . . , d, let Aj be an algorithm (\u201cbase learner\u201d)\nthat we use to \ufb01t sensor j from data source j. Write yi = Hxi, i = 1, . . . , t, and let \u03b7 > 0 be a small\n\ufb01xed learning rate. To make a prediction at time t + 1, we initialize x(0)\ni = 0, i = 1, . . . , t + 1 (or\ninitialize at the \ufb01ts from the usual linear SF), and repeat, for boosting iterations b = 1, . . . , B:\n\n\u2022 For j = 1, . . . , d:\n\u2013 Let y(b\u22121)\n\u2013 Run Aj with responses {yij \u2212 y(b\u22121)\n}t\ni=1 and covariates {uij}t\n\u2013 De\ufb01ne intermediate sensors z(b)\nij = \u00aff (b)\n(uij), for i = 1, . . . , t + 1.\n\n= (Hx(b\u22121))ij, for i = 1, . . . , t.\n\nij\n\nij\n\nj\n\n\u2022 For j = 1, . . . , k:\n\ni=1, to produce \u00aff (b)\n\nj\n\n.\n\n\u2013 Run SF as in (14) (possibly with regularization) with responses {xij \u2212 x(b\u22121)\n\nij\n\n}t\ni=1 and\n\ncovariates {z(b)}t\n\ni=1, to produce \u02c6bj.\n\n\u2013 De\ufb01ne intermediate state \ufb01ts \u00afx(b)\nij = x(b\u22121)\n\u2013 Update total state \ufb01ts x(b)\n\nij = \u02c6bT\n\nj z(b)\n+ \u03b7\u00afx(b)\n\n, for i = 1, . . . , t + 1.\nij , for i = 1, . . . , t + 1.\n\nij\n\ni\n\nWe return at the end our \ufb01nal prediction \u02c6xt+1 = x(B)\nin detail, and study the extent to which it can improve on the usual linear SF.\n\nt+1. It would be interesting to pursue this approach\n\nAcknowledgments. We thank Logan Brooks for several helpful conversations and brainstorming\nsessions. MJ was supported by NSF Graduate Research Fellowship No. DGE-1745016. RR and RJT\nwere supported by DTRA Contract No. HDTRA1-18-C-0008.\n\n9\n\n\fReferences\nBrian D. O. Anderson and John B. Moore. Optimal Filtering. Prentice-Hall, 1979.\n\nLeo Breiman. Stacked regressions. Machine Learning, 24(1):49\u201364, 1996.\n\nLeo Breiman. Random forests. Machine Learning, 45(1):5\u201332, 2001.\n\nP. J. Brown. Multivariate calibration. Journal of the Royal Statistical Society: Series B, 44(3):\n\n287\u2013321, 1982.\n\nRobert Brown and Patrick Hwang. Introduction to Random Signals and Applied Kalman Filtering.\n\nWiley, 2012. Fourth edition.\n\nGeir Evensen. Sequential data assimilation with nonlinear quasi-geostrophic model using Monte\nCarlo methods to forecast error statistics. Journal of Geophysical Research, 99(C5):143\u2013162,\n1994.\n\nDavid Farrow. Modeling the Past, Present, and Future of In\ufb02uenza. PhD thesis, Computational\n\nBiology Department, Carnegie Mellon University, 2016.\n\nJerome Friedman. Greedy function approximation: a gradient boosting machine. Annals of Statistics,\n\n29(5):1190\u20131232, 2001.\n\nNeil J. Gordon, David J. Salmond, and Adrian F. M. Smith. Novel approach to nonlinear/non-\nGaussian Bayesian state estimation. IEE Proceedings F, Radar and Signal Processing, 140(2):\n107\u2013113, 1993.\n\nTrevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning; Data\n\nMining, Inference and Prediction. Springer, 2009. Second edition.\n\nTrevor Hastie, Robert Tibshirani, and Martin Wainwright J. Statistical Learning with Sparsity: The\n\nLasso and Generalizations. Chapman & Hall, 2015.\n\nP. L. Houtekamer and Herschel L. Mitchell. Data assimilation using an ensemble Kalman \ufb01lter\n\ntechnique. Monthly Weather Review, 126(3):796\u2013811, 1998.\n\nSimon J. Julier and Jeffrey K. Uhlmann. A new extension of the Kalman \ufb01lter to nonlinear systems.\n\nSignal Processing, Sensor Fusion, and Target Recognition, 1997.\n\nRudolf E. Kalman. A new approach to linear \ufb01ltering and prediction problems. Journal of Basic\n\nEngineering, 82(1):35\u201345, 1960.\n\nPeter Van Overshee and Bart De Moor. Subspace Identi\ufb01cation for Linear Systems. Kluwer Academic,\n\n1996.\n\nGerald L. Smith, Stanley F. Schmidt, and Leonard A. McGee. Application of statistical \ufb01lter theory\nto the optimal estimation of position and velocity on board a circumlunar vehicle. National\nAeronautics and Space Administration Tech Report, 1962.\n\nRobert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical\n\nSociety: Series B, 58(1):267\u2013288, 1996.\n\nDavid Wolpert. Stacked generalization. Neural Networks, 5(2):241\u2013259, 1992.\n\n10\n\n\f", "award": [], "sourceid": 7231, "authors": [{"given_name": "Maria", "family_name": "Jahja", "institution": "Carnegie Mellon University"}, {"given_name": "David", "family_name": "Farrow", "institution": "Carnegie Mellon University"}, {"given_name": "Roni", "family_name": "Rosenfeld", "institution": "Carnegie Mellon University"}, {"given_name": "Ryan", "family_name": "Tibshirani", "institution": "Carnegie Mellon University"}]}