{"title": "Online Variational Approximations to non-Exponential Family Change Point Models: With Application to Radar Tracking", "book": "Advances in Neural Information Processing Systems", "page_first": 306, "page_last": 314, "abstract": "The Bayesian online change point detection (BOCPD) algorithm provides an efficient way to do exact inference when the parameters of an underlying model may suddenly change over time. BOCPD requires computation of the underlying model's posterior predictives, which can only be computed online in $O(1)$ time and memory for exponential family models. We develop variational approximations to the posterior on change point times (formulated as run lengths) for efficient inference when the underlying model is not in the exponential family, and does not have tractable posterior predictive distributions. In doing so, we develop improvements to online variational inference. We apply our methodology to a tracking problem using radar data with a signal-to-noise feature that is Rice distributed. We also develop a variational method for inferring the parameters of the (non-exponential family) Rice distribution.", "full_text": "Online Variational Approximations to\n\nnon-Exponential Family Change Point Models:\n\nWith Application to Radar Tracking\n\nRyan Turner\n\nNorthrop Grumman Corp.\nryan.turner@ngc.com\n\nSteven Bottone\n\nNorthrop Grumman Corp.\n\nsteven.bottone@ngc.com\n\nClay Stanek\n\nNorthrop Grumman Corp.\nclay.stanek@ngc.com\n\nAbstract\n\nThe Bayesian online change point detection (BOCPD) algorithm provides an ef-\n\ufb01cient way to do exact inference when the parameters of an underlying model\nmay suddenly change over time. BOCPD requires computation of the underly-\ning model\u2019s posterior predictives, which can only be computed online in O(1)\ntime and memory for exponential family models. We develop variational approx-\nimations to the posterior on change point times (formulated as run lengths) for\nef\ufb01cient inference when the underlying model is not in the exponential family,\nand does not have tractable posterior predictive distributions. In doing so, we de-\nvelop improvements to online variational inference. We apply our methodology\nto a tracking problem using radar data with a signal-to-noise feature that is Rice\ndistributed. We also develop a variational method for inferring the parameters of\nthe (non-exponential family) Rice distribution.\n\nChange point detection has been applied to many applications [5; 7]. In recent years there have been\ngreat improvements to the Bayesian approaches via the Bayesian online change point detection\nalgorithm (BOCPD) [1; 23; 27]. Likewise, the radar tracking community has been improving in its\nuse of feature-aided tracking [10]: methods that use auxiliary information from radar returns such\nas signal-to-noise ratio (SNR), which depend on radar cross sections (RCS) [21]. Older systems\nwould often \ufb01lter only noisy position (and perhaps Doppler) measurements while newer systems use\nmore information to improve performance. We use BOCPD for modeling the RCS feature. Whereas\nBOCPD inference could be done exactly when \ufb01nding change points in conjugate exponential family\nmodels the physics of RCS measurements often causes them to be distributed in non-exponential\nfamily ways, often following a Rice distribution. To do inference ef\ufb01ciently we call upon variational\nBayes (VB) to \ufb01nd approximate posterior (predictive) distributions. Furthermore, the nature of both\nBOCPD and tracking require the use of online updating. We improve upon the existing and limited\napproaches to online VB [24; 13]. This paper produces contributions to, and builds upon background\nfrom, three independent areas: change point detection, variational Bayes, and radar tracking.\nAlthough the emphasis in machine learning is on \ufb01ltering, a substantial part of tracking with radar\ndata involves data association, illustrated in Figure 1. Observations of radar returns contain mea-\nsurements from multiple objects (targets) in the sky. If we knew which radar return corresponded\nto which target we would be presented with NT \u2208 N0 independent \ufb01ltering problems; Kalman\n\ufb01lters [14] (or their nonlinear extensions) are applied to \u201caverage out\u201d the kinematic errors in the\nmeasurements (typically positions) using the measurements associated with each target. The data\nassociation problem is to determine which measurement goes to which track. In the classical setup,\nonce a particular measurement is associated with a certain target, that measurement is plugged into\nthe \ufb01lter for that target as if we knew with certainty it was the correct assignment. The association\nalgorithms, in effect, \ufb01nd the maximum a posteriori (MAP) estimate on the measurement-to-track\nassociation. However, approaches such as the joint probabilistic data association (JPDA) \ufb01lter [2]\nand the probability hypothesis density (PHD) \ufb01lter [16] have deviated from this.\n\n1\n\n\fTo \ufb01nd the MAP estimate a log likelihood of the data under each possible assignment vector a must\nbe computed. These are then used to construct cost matrices that reduce the assignment problem to a\nparticular kind of optimization problem (the details of which are beyond the scope of this paper). The\nmotivation behind feature-aided tracking is that additional features increase the probability that the\nMAP measurement-to-track assignment is correct. Based on physical arguments the RCS feature\n(SNR) is often Rice distributed [21, Ch. 3]; although, in certain situations RCS is exponential or\ngamma distributed [26]. The parameters of the RCS distribution are determined by factors such as\nthe shape of the aircraft facing the radar sensor. Given that different aircraft have different RCS\ncharacteristics, if one attempts to create a continuous track estimating the path of an aircraft, RCS\nfeatures may help distinguish one aircraft from another if they cross paths or come near one another,\nfor example. RCS also helps distinguish genuine aircraft returns from clutter: a \ufb02ock of birds or\nrandom electrical noise, for example. However, the parameters of the RCS distributions may also\nchange for the same aircraft due to a change in angle or ground conditions. These must be taken into\naccount for accurate association. Providing good predictions in light of a possible sudden change in\nthe parameters of a time series is \u201cright up the alley\u201d of BOCPD and change point methods.\nThe original BOCPD papers [1; 11] studied sudden changes in the parameters of exponential family\nmodels for time series. In this paper, we expand the set of applications of BOCPD to radar SNR\ndata which often has the same change point structure found in other applications, and requires online\npredictions. The BOCPD model is highly modular in that it looks for changes in the parameters of\nany underlying process model (UPM). The UPM merely needs to provide posterior predictive prob-\nabilities, the UPM can otherwise be a \u201cblack box.\u201d The BOCPD queries the UPM for a prediction\nof the next data point under each possible run length, the number of points since the last change\npoint. If (and only if by Hipp [12]) the UPM is exponential family (with a conjugate prior) the\nposterior is computed by accumulating the suf\ufb01cient statistics since the last potential change point.\nThis allows for O(1) UPM updates in both computation and memory as the run length increases.\nWe motivate the use of VB for implementing UPMs when the data within a regime is believed to\nfollow a distribution that is not exponential family. The methods presented in this paper can be used\nto \ufb01nd variational run length posteriors for general non-exponential family UPMs in addition to the\nRice distribution. Additionally, the methods for improving online updating in VB (Section 2.2) are\napplicable in areas outside of change point detection.\n\nFigure 1: Illustrative example of a tracking scenario: The black lines (\u2212) show the true tracks while the red\nstars (\u2217) show the state estimates over time for track 2 and the blue stars for track 1. The 95% credible regions\non the states are shown as blue ellipses. The current (+) and previous (\u00d7) measurements are connected to their\nassociated tracks via red lines. The clutter measurements (birds in this case) are shown with black dots (\u00b7). The\ndistributions on the SNR (RCS) for each track (blue and red) and the clutter (black) are shown on the right.\n\nTo our knowledge this paper is the \ufb01rst to demonstrate how to compute Bayesian posterior distri-\nbutions on the parameters of a Rice distribution; the closest work would be Lauwers et al. [15],\nwhich computes a MAP estimate. Other novel factors of this paper include: demonstrating the use-\nfulness (and advantages over existing techniques) of change point detection for RCS estimation and\ntracking; and applying variational inference for UPMs where analytic posterior predictives are not\npossible. This paper provides four main technical contributions: 1) VB inference for inferring the\nparameters of a Rice distribution. 2) General improvements to online VB (which is then applied to\nupdating the UPM in BOCPD). 3) Derive a VB approximation to the run length posterior when the\nUPM posterior predictive is intractable. 4) Handle censored measurements (particularly for a Rice\ndistribution) in VB. This is key for processing missed detections in data association.\n\n2\n\nclutter (birds)track 1 (747)track 2 (EMB 110)05101520SNRLikelihood\f1 Background\n\nIn this section we brie\ufb02y review the three areas of background: BOCPD, VB, and tracking.\n\n1.1 Bayesian Online Change Point Detection\n\nWe brie\ufb02y summarize the model setup and notation for the BOCPD algorithm; see [27, Ch. 5] for a\ndetailed description. We assume we have a time series with n observations so far y1, . . . , yn \u2208 Y. In\neffect, BOCPD performs message passing to do online inference on the run length rn \u2208 0:n \u2212 1, the\nnumber of observations since the last change point. Given an underlying predictive model (UPM)\nand a hazard function h, we can compute an exact posterior over the run length rn. Conditional on a\nrun length, the UPM produces a sequential prediction on the next data point using all the data since\nthe last change point: p(yn|y(r), \u0398m) where (r) := (n \u2212 r):(n \u2212 1). The UPM is a simpler model\nwhere the parameters \u03b8 change at every change point and are modeled as being sampled from a prior\nwith hyper-parameters \u0398m. The canonical example of a UPM would be a Gaussian whose mean\nand variance change at every change point. The online updates are summarized as:\n\n(cid:88)\n\nrn\u22121\n\n(cid:124)\n\nhazard\n\n(cid:123)(cid:122)\n(cid:88)\n\nmsgn := p(rn, y1:n) =\n\nP (rn|rn\u22121)\n\n(cid:125)\n\n(cid:124)\n(cid:125)\np(yn|rn\u22121, y(r))\n\n(cid:123)(cid:122)\n\n(cid:124)\n\nUPM\n\n(cid:123)(cid:122)\n\nmsgn\u22121\n\n(cid:125)\n\np(rn\u22121, y1:n\u22121)\n\n.\n\n(1)\n\nUnless rn = 0, the sum in (1) only contains one term since the only possibility is that rn\u22121 = rn\u22121.\nThe indexing convention is such that if rn = 0 then yn+1 is the \ufb01rst observation sampled from the\nnew parameters \u03b8. The marginal posterior predictive on the next data point is easily calculated as:\n\np(yn+1|y1:n) =\n\np(yn+1|y(r))P (rn|y1:n) .\n\n(2)\n\nThus, the predictions from BOCPD fully integrate out any uncertainty in \u03b8. The message updates\n(1) perform exact inference under a model where the number of change points is not known a priori.\n\nrn\n\nyn \u223c Rice(\u03bd, \u03c3) ,\n\n=\u21d2 p(yn|\u03bd, \u03c3) = yn\u03c4 exp(\u2212\u03c4 (y2\n\n\u03bd \u223c N (\u00b50, \u03c32/\u03bb0) ,\nn + \u03bd2)/2)I0(yn\u03bd\u03c4 )I{yn \u2265 0}\n\nBOCPD RCS Model We show the Rice UPM as an example as it is required for our application.\nThe data within a regime are assumed to be iid Rice observations, with a normal-gamma prior:\n\u03c3\u22122 =: \u03c4 \u223c Gamma(\u03b10, \u03b20)\n\n(3)\n(4)\nwhere I0(\u00b7) is a modi\ufb01ed Bessel function of order zero, which is what excludes the Rice distribution\nfrom the exponential family. Although the normal-gamma is not conjugate to a Rice it will enable\nus to use the VB-EM algorithm. The UPM parameters are the Rice shape1 \u03bd \u2208 R and scale \u03c3 \u2208 R+,\n\u03b8 := {\u03bd, \u03c3}, and the hyper-parameters are the normal-gamma parameters \u0398m := {\u00b50, \u03bb0, \u03b10, \u03b20}.\nEvery change point results in a new value for \u03bd and \u03c3 being sampled. A posterior on \u03b8 is maintained\nfor each run length, i.e. every possible starting point for the current regime, and is updated at each\nnew data point. Therefore, BOCPD maintains n distinct posteriors on \u03b8, and although this can be\nreduced with pruning, it necessitates posterior updates on \u03b8 that are computationally ef\ufb01cient.\nNote that the run length updates in (1) require the UPM to provide predictive log likelihoods at all\nsample sizes rn (including zero). Therefore, UPM implementations using such approximations as\nplug-in MLE predictions will not work very well. The MLE may not even be de\ufb01ned for run lengths\nsmaller than the number of UPM parameters |\u03b8|. For a Rice UPM, the ef\ufb01cient O(1) updating in\nexponential family models by using a conjugate prior and accumulating suf\ufb01cient statistics is not\npossible. This motivates the use of VB methods for approximating the UPM predictions.\n\n1.2 Variational Bayes\n\nWe follow the framework of VB where when computation of the exact posterior distribution\np(\u03b8|y1:n) is intractable it is often possible to create a variational approximation q(\u03b8) that is lo-\ncally optimal in terms of the Kullback-Leibler (KL) divergence KL(q(cid:107)p) while constraining q to be\nin a certain family of distributions Q. In general this is done by optimizing a lower bound L(q) on\nthe evidence log p(y1:n), using either gradient based methods or standard \ufb01xed point equations.\n\n1 The shape \u03bd is usually assumed to be positive (\u2208 R+); however, there is nothing wrong with using a\n\nnegative \u03bd as Rice(x|\u03bd, \u03c3) = Rice(x|\u2212\u03bd, \u03c3). It also allows for use of a normal-gamma prior.\n\n3\n\n\fThe VB-EM Algorithm In many cases, such as the Rice UPM, the derivation of the VB \ufb01xed point\nequations can be simpli\ufb01ed by applying the VB-EM algorithm [3]. VB-EM is applicable to models\nthat are conjugate-exponential (CE) after being augmented with latent variables x1:n. A model is\nCE if: 1) The complete data likelihood p(x1:n, y1:n|\u03b8) is an exponential family distribution; and 2)\nthe prior p(\u03b8) is a conjugate prior for the complete data likelihood p(x1:n, y1:n|\u03b8). We only have\nto constrain the posterior q(\u03b8, x1:n) = q(\u03b8)q(x1:n) to factorize between the latent variables and the\nparameters; we do not constrain the posterior to be of any particular parametric form. Requiring the\ncomplete likelihood to be CE is a much weaker condition than requiring the marginal on the observed\ndata p(y1:n|\u03b8) to be CE. Consider a mixture of Gaussians: the model becomes CE when augmented\nwith latent variables (class labels). This is also the case for the Rice distribution (Section 2.1).\nLike the ordinary EM algorithm [9] the VB-EM algorithm alternates between two steps: 1) Find the\nposterior of the latent variables treating the expected natural parameters \u00af\u03b7 := Eq(\u03b8)[\u03b7] as correct:\nq(xi)\u2190 p(xi|yi, \u03b7 = \u00af\u03b7). 2) Find the posterior of the parameters using the expected suf\ufb01cient statis-\ntics \u00afS := Eq(x1:n)[S(x1:n, y1:n)] as if they were the suf\ufb01cient statistics for the complete data set:\nq(\u03b8)\u2190 p(\u03b8|S(x1:n, y1:n) = \u00afS). The posterior will be of the same exponential family as the prior.\n\n1.3 Tracking\n\ni=1\n\nn maps tracks to measurements: meaning a\u22121\n\nIn this section we review data association, which along with \ufb01ltering constitutes tracking. In data\nassociation we estimate the association vectors a which map measurements to tracks. At each time\nstep, n \u2208 N1, we observe NZ(n) \u2208 N0 measurements, Zn = {zi,n}NZ (n)\n, which includes returns\nfrom both real targets and clutter (spurious measurements). Here, zi,n \u2208 Z is a vector of kinematic\nmeasurements (positions in R3, or R4 with a Doppler), augmented with an RCS component R \u2208 R+\nfor the measured SNR, at time tn \u2208 R. The assignment vector at time tn is such that an(i) = j\nif measurement i is associated with track j > 0; an(i) = 0 if measurement i is clutter. The\nn (an(i)) = i if an(i) (cid:54)= 0; and\ninverse mapping a\u22121\nn (i) = 0\u21d4 an(j) (cid:54)= i for all j. For example, if NT = 4 and a = [2 0 0 1 4] then NZ = 5,\na\u22121\nNc = 2, and a\u22121 = [4 1 0 5]. Each track is associated with at most one measurement, and vice-versa.\nIn ND data association we jointly \ufb01nd the MAP estimate of the association vectors over a sliding\nwindow of the last N \u2212 1 time steps. We assume we have NT (n) \u2208 N0 total tracks as a known\nparameter: NT (n) is adjusted over time using various algorithms (see [2, Ch. 3]). In the generative\nprocess each track places a probability distribution on the next N \u2212 1 measurements, with both\nkinematic and RCS components. However, if the random RCS R for a measurement is below R0\nthen it will not be observed. There are Nc(n) \u2208 N0 clutter measurements from a Poisson process\nwith \u03bb := E[Nc(n)] (often with uniform intensity). The ordering of measurements in Zn is assumed\nto be uniformly random. For 3D data association the model joint p(Zn\u22121:n, an\u22121, an|Z1:n\u22122) is:\n\n\u03bbNc(i) exp(\u2212\u03bb)/|Zi|!\n\np0(zj,i)\n\nI{ai(j)=0} ,\n\n(5)\n\nNT(cid:89)\n\ni=1\n\nn\u22121(i),n\u22121) \u00d7 n(cid:89)\n\ni=n\u22121\n\npi(za\n\nn (i),n, za\n\u22121\n\u22121\n\n|Zi|(cid:89)\n\nj=1\n\nwhere pi is the probability of the measurement sequence under track i; p0 is the clutter distribution.\nThe probability pi is the product of the RCS component predictions (BOCPD) and the kinematic\ncomponents (\ufb01lter); informally, pi(z) = pi(positions)\u00d7 pi(RCS). If there is a missed detection, i.e.\na\u22121\nn (i) = 0, we then use pi(za\nn (i),n) = P (R < R0) under the RCS model for track i with no con-\n\u22121\ntribution from positional (kinematic) component. Just as BOCPD allows any black box probabilistic\npredictor to be used as a UPM, any black box model of measurement sequences can used in (5).\nThe estimation of association vectors for the 3D case becomes an optimization problem of the form:\n(6)\n\nlog P (an\u22121, an|Z1:n) = argmax\n\nlog p(Zn\u22121:n, an\u22121, an|Z1:n\u22122) ,\n\n(\u02c6an\u22121, \u02c6an) = argmax\n(an\u22121,an)\n\n(an\u22121,an)\n\nwhich is effectively optimizing (5) with respect to the assignment vectors. The optimization given\nin (6) can be cast as a multidimensional assignment (MDA) problem [2], which can be solved ef-\n\ufb01ciently in the 2D case. Higher dimensional assignment problems, however, are NP-hard; approx-\nimate, yet typically very accurate, solvers must be used for real-time operation, which is usually\nrequired for tracking systems [20].\nIf a radar scan occurs at each time step and a target is not detected, we assume the SNR has not\nexceeded the threshold, implying 0 \u2264 R < R0. This is a (left) censored measurement and is treated\ndifferently than a missing data point. Censoring is accounted for in Section 2.3.\n\n4\n\n\fn(cid:88)\n\n\u00afx :=\n\nE[xi]/n , \u00b5n =\n\ni=1\n\u03b2n = \u03b20 +\n\n1\n2\n\n(E[xi] \u2212 \u00afx)2 +\n\nn(cid:88)\n\ni=1\n\n\u03bb0\u00b50 +(cid:80)\n\nE[xi]\n\ni\n\u03bb0 + n\n1\nn\u03bb0\n2\n\n\u03bb0 + n\n\n,\n\n\u03bbn = \u03bb0 + n , \u03b1n = \u03b10 + n ,\n\n(\u00afx \u2212 \u00b50)2 +\n\n1\n2\n\ni \u2212 E[xi]2 .\nR2\n\nn(cid:88)\n\ni=1\n\n(9)\n\n(10)\n\n2 Online Variational UPMs\n\nWe cover the four technical challenges for implementing non-exponential family UPMs in an ef\ufb01-\ncient and online manner. We drop the index of the data point i when it is clear from context.\n\n2.1 Variational Posterior for a Rice Distribution\n\nThe Rice distribution has the property that\n\nx \u223c N (\u03bd, \u03c32) ,\n\ny(cid:48) \u223c N (0, \u03c32) =\u21d2 R =\n\n(cid:112)\n\nx2 + y(cid:48)2 \u223c Rice(\u03bd, \u03c3) .\n\n(7)\n\nFor simplicity we perform inference using R2, as opposed to R, and transform accordingly:\n\nx \u223c N (\u03bd, \u03c32) , R2 \u2212 x2 \u223c Gamma( 1\n=\u21d2 p(R2, x) = p(R2|x)p(x) = Gamma(R2 \u2212 x2| 1\n\n\u03c4 := 1/\u03c32 \u2208 R+\n\n2 , \u03c4\n2 , \u03c4\n\n2 ) ,\n2 )N (x|\u03bd, \u03c32) .\n\n(8)\nThe complete likelihood (8) is the product of two exponential family models and is exponential\nfamily itself, parameterized with base measure h and partition factor g:\n\u03b7 = [\u03bd\u03c4, \u2212\u03c4 /2](cid:62), S = [x, R2](cid:62), h(R2, x) = (2\u03c0\ng(\u03bd, \u03c4 ) = \u03c4 exp(\u2212\u03bd2\u03c4 /2) .\nR2 \u2212 x2)\u22121,\nBy inspection we see that the natural parameters \u03b7 and suf\ufb01cient statistics S are the same as a\nGaussian with unknown mean and variance. Therefore, we apply the normal-gamma prior on (\u03bd, \u03c4 )\nas it is the conjugate prior for the complete data likelihood. This allows us to apply the VB-EM\ni as the VB observation, not Ri as in (3). In (5), z\u00b7,\u00b7(end) is the RCS R.\nalgorithm. We use yi := R2\n\n(cid:112)\n\nVB M-Step We derive the posterior updates to the parameters given expected suf\ufb01cient statistics:\n\nThis is the same as an observation from a Gaussian and a gamma that share a (inverse) scale \u03c4.\nVB E-Step We then must \ufb01nd both expected suf\ufb01cient statistics \u00afS. The expectation E[R2\ni|R2\ni trivially; leaving E[xi|R2\nR2\nthe radius to R, the angle \u03c9 will be distributed by a von Mises (VM) distribution. Therefore,\n\u03c9 := arccos(x/R) \u223c VM(0, \u03ba) , \u03ba = R E[\u03bd\u03c4 ] =\u21d2 E[x] = R E[cos \u03c9] = RI1(\u03ba)/I0(\u03ba) ,\n(11)\nwhere computing \u03ba constitutes the VB E-step and we have used the trigonometric moment on \u03c9 [18].\nThis completes the computations required to do the VB updates on the Rice posterior.\n\ni ] =\ni ]. Recall that the joint on (x, y(cid:48)) is a bivariate normal; if we constrain\n\nn(cid:88)\nVariational Lower Bound For completeness, and to assess convergence, we derive the VB lower\nbound L(q). Using the standard formula [4] for L(q) = Eq[log p(y1:n, x1:n, \u03b8)] + H[q] we get:\n\n(12)\nwhere p in the KL is the prior on (\u03bd, \u03c4 ) which is easy to compute as q and p are both normal-gamma.\nEquivalently, (12) can be optimized directly instead of using the VB-EM updates.\n\nE[\u03bd2\u03c4 ] + log I0(\u03bai) \u2212 KL(q(cid:107)p) ,\n\ni + (E[\u03bd\u03c4 ] \u2212 \u03bai/Ri)E[xi] \u2212 1\n\nE[log \u03c4 /2] \u2212 1\n\nE[\u03c4 ]R2\n\ni=1\n\n2\n\n2\n\n2.2 Online Variational Inference\n\nIn Section 2.1 we derived an ef\ufb01cient way to compute the variational posterior for a Rice distribution\nfor a \ufb01xed data set. However, as is apparent from (1) we need online predictions from the UPM;\nwe must be able to update the posterior one data point at a time. When the UPM is exponential\nfamily and we can compute the posterior exactly, we merely use the posterior from the previous step\nas the prior. However, since we are only computing a variational approximation to the posterior,\nusing the previous posterior as the prior does not give the exact same answer as re-computing the\nposterior from batch. This gives two obvious options: 1) recompute the posterior from batch every\nupdate at O(n) cost or 2) use the previous posterior as the prior at O(1) cost and reduced accuracy.\n\n5\n\n\fi=1\n\ni=1\n\n\u00afS =(cid:80)n\nposterior effectively uses \u00afS =(cid:80)n\n\nThe difference between the options is encapsulated by looking at the expected suf\ufb01cient statistics:\nEq(xi|y1:n)[S(xi, yi)]. Naive online updating uses old expected suf\ufb01cient statistics whose\nEq(xi|y1:i)[S(xi, yi)]. We get the best of both worlds if we\nadjust those estimates over time. We in fact can do this if we project the expected suf\ufb01cient statistics\ninto a \u201cfeature space\u201d in terms of the expected natural parameters. For some function f,\n\nq(xi) = p(xi|yi, \u03b7 = \u00af\u03b7) =\u21d2 Eq(xi|y1:n)[S(xi, yi)] = f (yi, \u00af\u03b7) .\n\nf (yi, \u00af\u03b7) = \u03c6(\u00af\u03b7)(cid:62)\u03c8(yi) =\u21d2 \u00afS =(cid:80)n\n\nIf f is piecewise continuous then we can represent it with an inner product [8, Sec. 2.1.6]\ni=1\u03c8(yi) ,\n\n(14)\nwhere an in\ufb01nite dimensional \u03c6 and \u03c8 may be required for exact representation, but can be approx-\nimated by a \ufb01nite inner product. In the Rice distribution case we use (11)\n\ni=1\u03c6(\u00af\u03b7)(cid:62)\u03c8(yi) = \u03c6(\u00af\u03b7)(cid:62)(cid:80)n\n\nI(cid:48)(\u00b7) := I1(\u00b7)/I0(\u00b7) ,\n\nf (yi, \u00af\u03b7) = E[xi] = RiI(cid:48)(Ri E[\u03bd\u03c4 ]) = RiI(cid:48)((Ri/\u00b50) \u00b50E[\u03bd\u03c4 ]) ,\n\n(15)\ni and \u00af\u03b71 = E[\u03bd\u03c4 ]. We can easily represent f with an inner product if we can\nwhere recall that yi = R2\nrepresent I(cid:48) as an inner product: I(cid:48)(uv) = \u03c6(u)(cid:62)\u03c8(v). We use unitless \u03c6i(u) = I(cid:48)(ciu) with c1:G\nas a log-linear grid from 10\u22122 to 103 and G = 50. We use a lookup table for \u03c8(v) that was trained to\nmatch I(cid:48) using non-negative least squares, which left us with a sparse lookup table. Online updating\nfor VB posteriors was also developed in [24; 13]. These methods involved introducing forgetting\nfactors to forget the contributions from old data points that might be detrimental to accuracy. Since\nthe VB predictions are \u201cembedded\u201d in a change point method, they are automatically phased out if\nthe posterior predictions become inaccurate making the forgetting factors unnecessary.\n\n(13)\n\n2.3 Censored Data\n\nAs mentioned in Section 1.3, we must handle censored RCS observations during a missed detection.\nIn the VB-EM framework we merely have to compute the expected suf\ufb01cient statistics given the\ncensored measurement: E[S|R < R0]. The expected suf\ufb01cient statistic from (11) is now:\nE[x|R < R0] =\n\u03c3 ))/(1 \u2212 Q1( \u03bd\n\u03c3 )) ,\nwhere QM is the Marcum Q function [17] of order M. Similar updates for E[S|R < R0] are\npossible for exponential or gamma UPMs, but are not shown as they are relatively easy to derive.\n\n(cid:90) R0\nE[x|R]p(R)dR(cid:14) RiceCDF(R0|\u03bd, \u03c4 ) = \u03bd(1 \u2212 Q2( \u03bd\n\n\u03c3 , R0\n\n\u03c3 , R0\n\n0\n\n2.4 Variational Run Length Posteriors: Predictive Log Likelihoods\n\nBoth updating the BOCPD run length posterior (1) and \ufb01nding the marginal predictive log like-\nlihood of the next point (2) require calculating the UPM\u2019s posterior predictive log likelihood\nlog p(yn+1|rn, y(r)). The marginal posterior predictive from (2) is used in data association (6) and\nbenchmarking BOCPD against other methods. However, the exact posterior predictive distribution\nobtained by integrating the Rice likelihood against the VB posterior is dif\ufb01cult to compute.\nWe can break the BOCPD update (1) into a time and measurement update. The measurement update\ncorresponds to a Bayesian model comparison (BMC) calculation with prior p(rn|y1:n):\n\np(rn|y1:n+1) \u221d p(yn+1|rn, y(r))p(rn|y1:n) .\n\nserves as a joint VB lower bound: L(q) = log(cid:80)\n\n(16)\nUsing the BMC results in Bishop [4, Sec. 10.1.4] we \ufb01nd a variational posterior on the run length by\nusing the variational lower bound for each run length Li(q) \u2264 log p(yn+1|rn = i, y(r)), calculated\nusing (12), as a proxy for the exact UPM posterior predictive in (16). This gives the exact VB\nposterior if the approximating family Q is of the form:\nq(rn, \u03b8, x) = qUPM(\u03b8, x|rn)q(rn) =\u21d2 q(rn = i) = exp(Li(q))p(rn = i|y1:n)/ exp(L(q)) , (17)\nwhere qUPM contains whatever constraints we used to compute Li(q). The normalizer on q(rn)\ni exp(Li(q))p(rn = i|y1:n) \u2264 log p(yn+1|y1:n).\nNote that the conditional factorization is different than the typical independence constraint on q.\nFurthermore, we derive the estimation of the assignment vectors a in (6) as a VB routine. We use\na similar conditional constraint on the latent BOCPD variables given the assignment and constrain\nthe assignment posterior to be a point mass. In the 2D assignment case, for example,\nq(an,X1:NT ) = q(X1:NT |an)q(an) = q(X1:NT |an)I{an = \u02c6an} ,\n\n(18)\n\n6\n\n\f(a) Online Updating\n\n(b) Exponential RCS\n\n(c) Rice RCS\n\nFigure 2: Left: KL from naive updating ((cid:52)), Sato\u2019s method [24] ((cid:3)), and improved online VB (\u25e6) to the\nbatch VB posterior vs. sample size n; using a standard normal-gamma prior. Each curve represents a true \u03bd\nin the generating Rice distribution: \u03bd = 3.16 (red), \u03bd = 10.0 (green), \u03bd = 31.6 (blue) and \u03c4 = 1. Middle:\nThe RMSE (dB scale) of the estimate on the mean RCS distribution E[Rn] is plotted for an exponential RCS\nmodel. The curves are BOCPD (blue), IMM (black), identity (magenta), \u03b1-\ufb01lter (green), and median \ufb01lter\n(red). Right: Same as the middle but for the Rice RCS case. The dashed lines are 95% con\ufb01dence intervals.\n\nwhere each track\u2019s Xi represents all the latent variables used to compute the variational lower bound\non log p(zj,n|an(j) = i).\n:= {rn, x, \u03b8}. The resulting VB \ufb01xed point\nequations \ufb01nd the posterior on the latent variables Xi by taking \u02c6an as the true assignment and solving\nthe VB problem of (17); the assignment \u02c6an is found by using (6) and taking the joint BOCPD lower\nbound L(q) as a proxy for the BOCPD predictive log likelihood component of log pi in (5).\n\nIn the BOCPD case, Xi\n\n3 Results\n\n3.1\n\nImproved Online Solution\n\nWe \ufb01rst demonstrate the accuracy of the online VB approximation (Section 2.2) on a Rice estima-\ntion example; here, we only test the VB posterior as no change point detection is applied. Fig-\nure 2(a) compares naive online updating, Sato\u2019s method [24], and our improved online updating in\nKL(online(cid:107)batch) of the posteriors for three different true parameters \u03bd as sample size n increases.\nThe performance curves are the KL divergence between these online approximations to the posterior\nand the batch VB solution (i.e. restarting VB from \u201cscratch\u201d every new data point) vs sample size.\nThe error for our method stays around a modest 10\u22122 nats while naive updating incurs large errors\nof 1 to 50 nats [19, Ch. 4]. Sato\u2019s method tends to settle in around a 1 nat approximation error. The\nrecommended annealing schedule, i.e. forgetting factors, in [24] performed worse than naive updat-\ning. We did a grid search over annealing exponents and show the results for the best performing\nschedule of n\u22120.52. By contrast, our method does not require the tuning of an annealing schedule.\n\n3.2 RCS Estimation Benchmarking\n\nWe now compare BOCPD with other methods for RCS estimation. We use the same experimental\nexample as Slocumb and Klusman III [25], which uses an augmented interacting multiple model\n(IMM) based method for estimating the RCS; we also compare against the same \u03b1-\ufb01lter and median\n\ufb01lter used in [25]. As a reference point, we also consider the \u201cidentity \ufb01lter\u201d which is merely an\nunbiased \ufb01lter that uses only yn to estimate the mean RCS E[Rn] at time step n. We extend this\nexample to look at Rice RCS in addition to the exponential RCS case. The bias correction constants\nin the IMM were adjusted for the Rice distribution case as per [25, Sec. 3.4].\nThe results on exponential distributions used in [25] and the Rice distribution case are shown in Fig-\nures 2(b) and 2(c). The IMM used in [25] was hard-coded to expect jumps in the SNR of multiples\nof \u00b110 dB, which is exactly what is presented in the example (a sequence of 20, 10, 30, and 10 dB).\nIn [25] the authors mention that the IMM reaches an RMSE \u201c\ufb02oor\u201d at 2 dB, yet BOCPD continues\nto drop as low as 0.56 dB. The RMSE from BOCPD does not spike nearly as high as the other\nmethods upon a change in E[Rn]. The \u03b1-\ufb01lter and median \ufb01lter appear worse than both the IMM\nand BOCPD. The RMSE and con\ufb01dence intervals are calculated from 5000 runs of the experiment.\n\n7\n\n102030405010\u2212210\u22121100101102Sample SizeKL (nats)01002003004000246810TimeRCS RMSE (dBsm)0100200300400012345TimeRCS RMSE (dBsm)\f(a) SIAP Metrics\n\n(b) Heathrow (LHR)\n\nFigure 3: Left: Average relative improvements (%) for SIAP metrics: position accuracy (red (cid:52)), velocity\naccuracy (green (cid:3)), and spurious tracks (blue \u25e6) across dif\ufb01culty levels. Right: LHR: true trajectories shown\nas black lines (\u2212), estimates using a BOCPD RCS model for association shown as blue stars (\u2217), and the\nstandard tracker as red circles (\u25e6). The standard tracker has spurious tracks over east London and near Ipswich.\n\nBackground map data: Google Earth (TerraMetrics, Data SIO, NOAA, U.S. Navy, NGA, GEBCO, Europa Technologies)\n\n3.3 Flightradar24 Tracking Problem\n\nFinally, we used real \ufb02ight trajectories from \ufb02ightradar24 and plugged them into our 3D tracking\nalgorithm. We compare tracking performance between using our BOCPD model and the relatively\nstandard constant probability of detection (no RCS) [2, Sec. 3.5] setup. We use the single integrated\nair picture (SIAP) metrics [6] to demonstrate the improved performance of the tracking. The SIAP\nmetrics are a standard set of metrics used to compare tracking systems. We broke the data into 30\nregions during a one hour period (in Sept. 2012) sampled every 5 s, each within a 200 km by 200 km\narea centered around the world\u2019s 30 busiest airports [22]. Commercial airport traf\ufb01c is typically very\norderly and does not allow aircraft to \ufb02y close to one another or cross paths. Feature-aided tracking\nis most necessary in scenarios with a more chaotic air situation. Therefore, we took random subsets\nof 10 \ufb02ight paths and randomly shifted their start time to allow for scenarios of greater interest.\nThe resulting SIAP metric improvements are shown in Figure 3(a) where we look at performance by\na dif\ufb01culty metric: the number of times in a scenario any two aircraft come within \u223c400 m of each\nother. The biggest improvements are seen for dif\ufb01culties above three where positional accuracy\nincreases by 30%. Signi\ufb01cant improvements are also seen for velocity accuracy (11%) and the\nfrequency of spurious tracks (6%). Signi\ufb01cant performance gains are seen at all dif\ufb01culty levels\nconsidered. The larger improvements at level three over level \ufb01ve are possibly due to some level \ufb01ve\nscenarios that are not resolvable simply through more sophisticated models. We demonstrate how\nour RCS methods prevent the creation of spurious tracks around London Heathrow in Figure 3(b).\n\n4 Conclusions\n\nWe have demonstrated that it is possible to use sophisticated and recent developments in machine\nlearning such as BOCPD, and use the modern inference method of VB, to produce demonstrable\nimprovements in the much more mature \ufb01eld of radar tracking. We \ufb01rst closed a \u201chole\u201d in the\nliterature in Section 2.1 by deriving variational inference on the parameters of a Rice distribution,\nwith its inherent applicability to radar tracking. In Sections 2.2 and 2.4 we showed that it is possible\nto use these variational UPMs for non-exponential family models in BOCPD without sacri\ufb01cing its\nmodular or online nature. The improvements in online VB are extendable to UPMs besides a Rice\ndistribution and more generally beyond change point detection. We can use the variational lower\nbound from the UPM and obtain a principled variational approximation to the run length posterior.\nFurthermore, we cast the estimation of the assignment vectors themselves as a VB problem, which is\nin large contrast to the tracking literature. More algorithms from the tracking literature can possibly\nbe cast in various machine learning frameworks, such as VB, and improved upon from there.\n\n8\n\n12345\u22125051015202530354045DifficultyImprovement (%)Easting (km)Northing (km)020406080100\u221220020406080\fReferences\n[1] Adams, R. P. and MacKay, D. J. (2007). Bayesian online changepoint detection. Technical report, Univer-\n\nsity of Cambridge, Cambridge, UK.\n\n[2] Bar-Shalom, Y., Willett, P., and Tian, X. (2011). Tracking and Data Fusion: A Handbook of Algorithms.\n\nYBS Publishing.\n\n[3] Beal, M. and Ghahramani, Z. (2003). The variational Bayesian EM algorithm for incomplete data: with\n\napplication to scoring graphical model structures. In Bayesian Statistics, volume 7, pages 453\u2013464.\n\n[4] Bishop, C. M. (2007). Pattern Recognition and Machine Learning. Springer.\n[5] Braun, J. V., Braun, R., and M\u00a8uller, H.-G. (2000). Multiple changepoint \ufb01tting via quasilikelihood, with\n\napplication to DNA sequence segmentation. Biometrika, 87(2):301\u2013314.\n\n[6] Byrd, E. (2003). Single integrated air picture (SIAP) attributes version 2.0. Technical Report 2003-029,\n\nDTIC.\n\n[7] Chen, J. and Gupta, A. (1997). Testing and locating variance changepoints with application to stock prices.\n\nJournal of the Americal Statistical Association, 92(438):739\u2013747.\n\n[8] Courant, R. and Hilbert, D. (1953). Methods of Mathematical Physics. Interscience.\n[9] Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the\n\nEM algorithm. Journal of the Royal Statistical Society, Series B, 39(1):1\u201338.\n\n[10] Ehrman, L. M. and Blair, W. D. (2006). Comparison of methods for using target amplitude to improve\nmeasurement-to-track association in multi-target tracking. In Information Fusion, 2006 9th International\nConference on, pages 1\u20138. IEEE.\n\n[11] Fearnhead, P. and Liu, Z. (2007). Online inference for multiple changepoint problems. Journal of the\n\nRoyal Statistical Society, Series B, 69(4):589\u2013605.\n\n[12] Hipp, C. (1974). Suf\ufb01cient statistics and exponential families. The Annals of Statistics, 2(6):1283\u20131292.\n[13] Honkela, A. and Valpola, H. (2003). On-line variational Bayesian learning. In 4th International Sympo-\n\nsium on Independent Component Analysis and Blind Signal Separation, pages 803\u2013808.\n\n[14] Kalman, R. E. (1960). A new approach to linear \ufb01ltering and prediction problems. Transactions of the\n\nASME \u2014 Journal of Basic Engineering, 82(Series D):35\u201345.\n\n[15] Lauwers, L., Barb\u00b4e, K., Van Moer, W., and Pintelon, R. (2009). Estimating the parameters of a Rice\nIn Instrumentation and Measurement Technology Conference, 2009.\n\ndistribution: A Bayesian approach.\nI2MTC\u201909. IEEE, pages 114\u2013117. IEEE.\n\n[16] Mahler, R. (2003). Multi-target Bayes \ufb01ltering via \ufb01rst-order multi-target moments. IEEE Trans. AES,\n\n39(4):1152\u20131178.\n\n[17] Marcum, J. (1950). Table of Q functions. U.S. Air Force RAND Research Memorandum M-339, Rand\n\nCorporation, Santa Monica, CA.\n\n[18] Mardia, K. V. and Jupp, P. E. (2000). Directional Statistics. John Wiley & Sons, New York.\n[19] Murray, I. (2007). Advances in Markov chain Monte Carlo methods. PhD thesis, Gatsby computational\n\nneuroscience unit, University College London, London, UK.\n\n[20] Poore, A. P., Rijavec, N., Barker, T. N., and Munger, M. L. (1993). Data association problems posed as\nmultidimensional assignment problems: algorithm development. In Optical Engineering and Photonics in\nAerospace Sensing, pages 172\u2013182. International Society for Optics and Photonics.\n\n[21] Richards, M. A., Scheer, J., and Holm, W. A., editors (2010). Principles of Modern Radar: Basic Princi-\n\nples. SciTech Pub.\n\n[22] Rogers, S. (2012). The world\u2019s top 100 airports: listed, ranked and mapped. The Guardian.\n[23] Saatc\u00b8i, Y., Turner, R., and Rasmussen, C. E. (2010). Gaussian process change point models. In 27th\n\nInternational Conference on Machine Learning, pages 927\u2013934, Haifa, Israel. Omnipress.\n\n[24] Sato, M.-A. (2001). Online model selection based on the variational Bayes. Neural Computation,\n\n13(7):1649\u20131681.\n\n[25] Slocumb, B. J. and Klusman III, M. E. (2005). A multiple model SNR/RCS likelihood ratio score for\nIn Optics & Photonics 2005, pages 59131N\u201359131N. International\n\nradar-based feature-aided tracking.\nSociety for Optics and Photonics.\n\n[26] Swerling, P. (1954). Probability of detection for \ufb02uctuating targets. Technical Report RM-1217, Rand\n\nCorporation.\n\n[27] Turner, R. (2011). Gaussian Processes for State Space Models and Change Point Detection. PhD thesis,\n\nUniversity of Cambridge, Cambridge, UK.\n\n9\n\n\f", "award": [], "sourceid": 237, "authors": [{"given_name": "Ryan", "family_name": "Turner", "institution": "Northrop Grumman"}, {"given_name": "Steven", "family_name": "Bottone", "institution": "Northrop Grumman"}, {"given_name": "Clay", "family_name": "Stanek", "institution": "Northrop Grumman"}]}