{"title": "Doubly Robust Bayesian Inference for Non-Stationary Streaming Data with $\\beta$-Divergences", "book": "Advances in Neural Information Processing Systems", "page_first": 64, "page_last": 75, "abstract": "We present the very first robust Bayesian Online Changepoint Detection algorithm through General Bayesian Inference (GBI) with $\\beta$-divergences. The resulting inference procedure is doubly robust for both the predictive and the changepoint (CP) posterior, with linear time and constant space complexity. We provide a construction for exponential models and demonstrate it on the Bayesian Linear Regression model. In so doing, we make two additional contributions: Firstly, we make GBI scalable using Structural Variational approximations that are exact as $\\beta \\to 0$. Secondly, we give a principled way of choosing the divergence parameter $\\beta$ by minimizing expected predictive loss on-line. Reducing False Discovery Rates of \\CPs from up to 99\\% to 0\\% on real world data, this offers the state of the art.", "full_text": "Doubly Robust Bayesian Inference for\n\nNon-Stationary Streaming Data with \u03b2-Divergences\n\nJeremias Knoblauch\nThe Alan Turing Institute\nDepartment of Statistics\nUniversity of Warwick\nCoventry, CV4 7AL\n\nj.knoblauch@warwick.ac.uk\n\nJack Jewson\n\nDepartment of Statistics\nUniversity of Warwick\nCoventry, CV4 7AL\n\nj.e.jewson@warwick.ac.uk\n\nTheodoros Damoulas\nThe Alan Turing Institute\n\nDepartment of Computer Science & Department of Statistics\n\nUniversity of Warwick\nCoventry, CV4 7AL\n\nt.damoulas@warwick.ac.uk\n\nAbstract\n\nWe present the \ufb01rst robust Bayesian Online Changepoint Detection algorithm\nthrough General Bayesian Inference (GBI) with \u03b2-divergences. The resulting\ninference procedure is doubly robust for both the parameter and the changepoint\n(CP) posterior, with linear time and constant space complexity. We provide a\nconstruction for exponential models and demonstrate it on the Bayesian Linear\nRegression model. In so doing, we make two additional contributions: Firstly, we\nmake GBI scalable using Structural Variational approximations that are exact as\n\u03b2 \u2192 0. Secondly, we give a principled way of choosing the divergence parameter\n\u03b2 by minimizing expected predictive loss on-line. Reducing False Discovery Rates\nof CPS from over 90% to 0% on real world data, this offers the state of the art.\n\n1\n\nIntroduction\n\nModeling non-stationary time series with changepoints (CPS) is popular [23, 50, 33] and important\nin a wide variety of research \ufb01elds, including genetics [8, 16, 42], \ufb01nance [27], oceanography [24],\nbrain imaging and cognition [13, 20], cybersecurity [37] and robotics [2, 26]. For streaming data,\na particularly important subclass are Bayesian On-line Changepoint Detection (BOCPD) methods\nthat can process data sequentially [1, 11, 43, 47, 46, 41, 8, 34, 44, 40, 25] while providing full\nprobabilistic uncertainty quanti\ufb01cation. These algorithms declare CPS if the posterior predictive\ncomputed from y1:t at time t has low density for the value of the observation yt+1 at time t + 1.\nNaturally, this leads to a high false CP discovery rate in the presence of outliers and as they run\non-line, pre-processing is not an option. In this work, we provide the \ufb01rst robust on-line CP detection\nmethod that is applicable to multivariate data, works with a class of scalable models and quanti\ufb01es\nmodel, CP and parameter uncertainty in a principled Bayesian fashion.\nStandard Bayesian inference minimizes the Kullback-Leibler divergence (KLD) between the \ufb01tted\nmodel and the Data Generating Mechanism (DGM), but is not robust under outliers or model mis-\nspeci\ufb01cation due to its strictly increasing in\ufb02uence function. We remedy this by instead minimizing\nthe \u03b2-divergence (\u03b2-D) whose in\ufb02uence function has a unique maximum, allowing us to deal with\noutliers effectively. Fig. 1 A illustrates this: Under the \u03b2-D, the in\ufb02uence of observations \ufb01rst\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fFigure 1: A: In\ufb02uence of yt on inference as function of distance to the posterior expectation\nin Standard Deviations for \u03b2-divergences with different \u03b2s. B: Five jointly modeled Simulated\nAutoregressions (ARS) with true CPS at t = 200, 400; bottom-most AR injected with t4-noise.\nMaximum A Posteriori CPS of robust (standard) BOCPD shown as solid (dashed) vertical lines.\n\nincreases as they move away from the posterior mean, mimicking the KLD. However, once they\nmove far enough, their in\ufb02uence decreases again. This can be interpreted to mean that they are\n(increasingly) treated as outliers. As \u03b2 increases, observations are registered as outliers closer to the\nposterior mean. Conversely, as \u03b2 \u2192 0, one recovers the KLD which cannot treat any observation as an\noutlier. In addressing misspeci\ufb01cation and outliers this way, our approach builds on the principles of\nGeneral Bayesian Inference (GBI) [see 6, 21] and robust divergences [e.g. 4, 15]. This paper presents\nthree contributions in separate domains that are also illustrated in Figs. 1 and 3:\n\n(1) Robust BOCPD: We construct the very \ufb01rst robust BOCPD inference. The procedure is\napplicable to a wide class of (multivariate) models and is demonstrated on Bayesian Linear\nRegression (BLR). Unlike standard BOCPD, it discerns outliers and CPS, see Fig. 1 B.\n\n(2) Scalable GBI: Due to intractable posteriors, GBI has received little attention in machine\nlearning so far. We remedy this with a Structural Variational approximation which preserves\nparameter dependence and is exact as \u03b2 \u2192 0, providing a near-perfect \ufb01t, see Fig. 3.\n\n(3) Choosing \u03b2: While Fig. 1 A shows that \u03b2 regulates the degree of robustness [see also\n21, 15], it is unclear how to set its magnitude. For the \ufb01rst time, we provide a principled way\nof initializing \u03b2. Further, we show how to re\ufb01ne it on-line by minimizing predictive losses.\n\nThe remainder of the paper is structured as follows: In Section 2, we summarize standard BOCPD\nand show how to extend it to robust inference using the \u03b2-D. We quantify the degree of robustness\nand show that inference under the \u03b2-D can be designed so that a single outlier never results in false\ndeclaration of a CP, which is impossible under the KLD. Section 3 motivates ef\ufb01cient Structural\nVariational Inference (SVI) with the \u03b2-D posterior. Within BOCPD, we propose to scale SVI using\nvariance-reduced Stochastic Gradient Descent. Next, Section 4 expands on how \u03b2 can be initialized\nbefore the algorithm is run and then optimized on-line during execution time. Lastly, Section 5\nshowcases the substantial gains in performance of robust BOCPD when compared to its standard\nversion on real world data in terms of both predictive error and CP detection.\n\n2 Using Bayesian On-line Changepoint Detection with \u03b2-Divergences\n\nBOCPD is based on the Product Partition Model [3] and introduced independently in Adams and\nMacKay [1] and Fearnhead and Liu [11]. Recently, both formulations have been uni\ufb01ed in Knoblauch\nand Damoulas [25]. The underlying algorithm has extensions ranging from Gaussian Processes [41]\nand on-line hyperparameter optimization [8] to non-exponential families [44, 34].\nTo formulate BOCPD probabilistically, de\ufb01ne the run-length rt as the number of observations at time\nt since the most recent CP and mt as the best model in the set M for the observations since that\nCP. Then, given a real-valued multivariate process {yt}\u221e\nt=1 of dimension d, a model universe M, a\n\n2\n\n0246810Standard Deviations0.00.20.40.60.81.01.21.4InfluenceAKLD=0.05=0.2=0.250100200300400500600Time051015202530ValueB\frun-length prior h de\ufb01ned over N0 and a model prior q over M, the BOCPD model is\n\nrt|rt\u22121 \u223c H(rt, rt\u22121)\n\u03b8m|mt \u223c \u03c0mt(\u03b8mt)\n\nmt|mt\u22121, rt \u223c q(mt|mt\u22121, rt)\nyt|mt, \u03b8mt \u223c fmt(yt|\u03b8mt)\n\nthe posterior predictive fm(yt|y1:(t\u22121), rt) =(cid:82)\n\n(1a)\n(1b)\nwhere q(mt|mt\u22121, rt) = 1mt\u22121 (mt) for rt > 0 and q(mt) otherwise, and where H is the conditional\nrun-length prior so that H(0, r) = h(r+1), H(r+1, r) = 1\u2212h(r+1) for any r \u2208 N0 and H(r, r(cid:48)) =\n0 otherwise. For example, Bayesian Linear Regression (BLR) with the d \u00d7 p regressor matrix Xt\nand prior covariance \u03a30 is given by \u03b8m = (\u03c32, \u00b5), fm(yt|\u03b8m) = Nd(yt; Xt\u00b5, Id) and \u03c0m(\u03b8m) =\nNd(\u00b5; \u00b50, \u03c32\u03a30)IG(\u03c32; a0, b0). If the computations of the parameter posterior \u03c0m(\u03b8m|y1:t, rt) and\nfm(yt|\u03b8m)\u03c0m(\u03b8m|y1:(t\u22121), rt)d\u03b8m are ef\ufb01cient\nfor all models m \u2208 M, then so is the recursive computation given by\np(y1, r1 = 0, m1) = q(m1) \u00b7\n\n(cid:90)\n(cid:110)\nfmt(yt|Ft\u22121)q(mt|Ft\u22121, mt\u22121)H(rt, rt\u22121)p(y1:(t\u22121), rt\u22121, mt\u22121)\n\nfm1 (y1|\u03b8m1 )\u03c0m1(\u03b8m1)d\u03b8m1 = q(m1) \u00b7 fm1 (y1|y0),\n\n(cid:111)\n(cid:9) and p(y1:t, rt, mt) is the joint density of y1:t, mt and rt.\n\nThe run-length and model posteriors are then available exactly at time t, as p(rt, mt|y1:t) =\n\nmt\u22121,rt\u22121\n\np(y1:t, rt, mt) =\n\nwhere Ft\u22121 = (cid:8)y1:(t\u22121), rt\u22121\np(y1:t, rt, mt)/(cid:80)\n\np(y1:t, rt, mt). For a full derivation and the resulting inference see [25].\n\n(cid:88)\n\n\u0398m1\n\n\u0398m\n\nmt,rt\n\n(2a)\n\n(2b)\n\n2.1 General Bayesian Inference (GBI) with \u03b2-Divergences (\u03b2-D)\n\nStandard Bayesian inference minimizes the KLD between the Data Generating Mechanism (DGM)\nand its probabilistic model (see Section 2.1 of [6] for a clear illustration). In the M-closed world\nwhere one assumes that the DGM and model coincide, the KLD is the most ef\ufb01cient way of updating\nposterior beliefs. However, this is no longer the case in the M-open world [5] where they match\nonly approximately [21], e.g. in the presence of outliers. GBI [6, 21] generalizes standard Bayesian\nupdating based on the KLD to a family of divergences. In particular, it uses the relationship between\nlosses (cid:96) and divergences D to deduce for D a corresponding loss (cid:96)D. It can then be shown that for\nmodel m, the posterior update optimal for D yields the distribution\n\nm(\u03b8m|y(t\u2212rt):t) \u221d \u03c0m(\u03b8) exp\n\u03c0D\n\n(cid:96)D(\u03b8m|yi)\n\ni=t\u2212rt\n\n.\n\n(3)\n\n(cid:111)\n\n(cid:110)\u2212(cid:80)t\n\n(cid:90)\n\n1 + \u03b2p\n\nY\n\nFor parameter inference with the KLD and \u03b2-D, these losses are the log score and the Tsallis score:\n(4)\n\n(cid:96)KLD(\u03b8m|yt) = \u2212 log (fm(yt|\u03b8m)\n(cid:96)\u03b2(\u03b8m|yt) = \u2212\n\nfm(yt|\u03b8m)\u03b2p \u2212 1\n\nfm(z|\u03b8m)1+\u03b2p dz\n\n.\n\n(5)\n\n(cid:19)\n\n(cid:18) 1\n\n\u03b2p\n\nEq. (5) shows why the \u03b2-D excels at robust inference: Similar to tempering, (cid:96)\u03b2 exponentially\ndownweights the density, attaching less in\ufb02uence to observations in the tails of the model. This\nphenomenon is depicted with in\ufb02uence functions I(yt) in Figure 1 A. I(yt) is a divergence between\nthe posterior with and without an observation yt [28].\nGBI with the \u03b2-D yields robust inference without the need to specify a heavy-tailed or otherwise\nrobusti\ufb01ed model. Hence, one estimates the same model parameters as in standard Bayesian inference\nwhile down-weighting the in\ufb02uence of observations that are overly inconsistent with the model.\nAccordingly, GBI provides robust inference for a much wider class of models and situations than the\nones illustrated here. Though other divergences such as \u03b1-Divergences [e.g. 19] also accommodate\nrobust inference, we restrict ourselves to the \u03b2-D. We do this because unlike other divergences, it\ndoes not require estimation of the DGM\u2019s density. Density estimation increases estimation error,\nis computationally cumbersome and works poorly for small run-lengths (i.e. sample sizes). Note\nthat versions of GBI have been proposed before [14, 32, 38, 10], but have framed the procedure as\nalternative to Variational Bayes instead.\nApart from the computational gains of Section 3.1, we tackle robust inference via the \u03b2-D rather\nthan via Student\u2019s t errors for three reasons: Firstly, robust run-length posteriors need robustness\nin ratios rather than tails (see Section 2.3 and the simulation results for Student\u2019s t errors in the\nAppendix). Secondly, Student\u2019s t errors model outliers as part of the DGM, which compromises\n\n3\n\n\fFigure 2: A: Lower bound on the odds of Thm. 1 for priors used for Figure 1 B and h(r) = 1/100.\nB: \u02c6k for different choices of \u03b2p and output (input) dimensions d (2d) in an autoregressive BLR\nthe inference target: Consider a BLR with error et = \u03b5t + wt\u03bdt, where wt \u223c Ber(p) for p = 0.01,\n\u03b5t \u223c N (0, \u03c32) with outliers \u03bdt \u223c t1(0, \u03b3). Appropriate choices of \u03b2p give most in\ufb02uence to\nthe (1 \u2212 p) \u00b7 100% = 99% of typical observations one can explain well with the BLR model. In\ncontrast, modeling et as Student\u2019s t under the KLD lets \u03bdt dominate parameter inference and lets\n1% of observations in\ufb02ate the predictive variance substantially. Thirdly, using Student\u2019s t errors is a\ntechnique only applicable to symmetric, continuous models. In contrast, GBI with the \u03b2-D is valid\nfor any setting, e.g. for asymmetric errors as well as point and count processes.\n\n2.2 Robust BOCPD\n\nThe literature on robust on-line CP detection so far is sparse and covers limited settings without\nBayesian uncertainty quanti\ufb01cation [e.g. 36, 7, 12]. For example, the method in Fearnhead and\nRigaill [12] only produces point estimates and is limited to \ufb01tting a piecewise constant function to\nunivariate data. In contrast, BOCPD can be applied to multivariate data and a set of models M while\nquantifying uncertainty about these models, their parameters and potential CPS, but is not robust.\nNoting that for standard BOCPD the posterior expectation is given by\n\nE(cid:0)yt|y1:(t\u22121), rt\u22121, mt\u22121\n\n(cid:1) p(rt\u22121, mt\u22121|y1:(t\u22121)),\n\nE(cid:0)yt|y1:(t\u22121)\n\n(6)\n\n(cid:1) =\n\n(cid:88)\n\nrt,mt\n\nm (\u03b8m|y1:t) for \u03b2 = (\u03b2rlm, \u03b2p) > 01.\n\nthe key observation is that prediction is driven by two probability distributions: The run-length and\nmodel posterior p(rt, mt|y1:t) and parameter posterior distributions \u03c0m(\u03b8m|y1:t). Thus, we make\nBOCPD robust by using \u03b2-D posteriors p\u03b2rlm(rt, mt|y1:t), \u03c0\u03b2p\n\u03b2rlm prevents abrupt changes in p\u03b2rlm(rt, mt|y1:t) caused by a small number of observations, see\nsection 2.3. This form of robustness is easy to implement and retains the closed forms of BOCPD:\nIn Eqs. (2a) and (2b), one simply replaces fmt(yt|y0) and fmt(yt|Ft\u22121) by their \u03b2-D-counterparts\nexp{(cid:96)\u03b2rlm(\u03b8mt|yt)}, where\n(cid:96)\u03b2rlm(\u03b8mt|yt) = \u2212\n\n(7)\nWhile the posterior p\u03b2rlm(rt, mt|y1:t) is only available up to a constant, it is discrete and thus easy\nm (\u03b8|y1:t) by preventing it\nto normalize. Complementing this, \u03b2p regulates the robustness of \u03c0\u03b2p\nm (\u03b8|y1:t) using\nfrom being dominated by tail events. Section 3.1 overcomes the intractability of \u03c0\u03b2p\nStructural Variational Inference (SVI) that recovers the approximated distribution exactly as \u03b2p \u2192 0.\n\nfm(z|Ft\u22121)1+\u03b2rlmdz\n\nfm(yt|Ft\u22121)\u03b2rlm \u2212\n\n(cid:18) 1\n\n1 + \u03b2rlm\n\n(cid:19)\n\n(cid:90)\n\n\u03b2rlm\n\nY\n\n1\n\n.\n\n2.3 Quantifying robustness\n\nThe algorithm of Fearnhead and Rigaill [12] is robust because hyperparameters enforce that a single\noutlier is insuf\ufb01cient for declaring a CP. Analogously, we investigate conditions under which a single\n(outlying) observation yt+1 is able to force a CP. An intuitive way of achieving this is by studying\nthe odds of rt+1 \u2208 {0, r + 1} conditional on rt = r:\np(rt+1 = r + 1|y1:t+1, rt = r, mt)\np(rt+1 = 0|y1:t+1, rt = r, mt)\n1In fact, \u03b2p= \u03b2m\n\n((((((((\np(y1:t, rt = r, mt) \u00b7 (1 \u2212 H(rt+1, rt))f D\n\n((((((((\np(y1:t, rt = r, mt) \u00b7 H(rt+1, rt)f D\n\np , i.e. the robustness is model-speci\ufb01c, but this is suppressed for readability\n\n(yt+1|Ft)\n\n(yt+1|y0)\n\n. (8)\n\n=\n\nmt\n\nmt\n\n4\n\n0.000.250.500.751.00|V|min0.00.51.01.52.02.5oddsA0.00.10.20.30.40.5p0.00.20.40.60.81.0kBd=5d=10d=15d=25\f(yt+1|y0) > f D\n\nmt\n\nmt\n\n(yt+1|Ft) = exp(cid:8)\u2212(cid:96)\u03b2rlm(\u03b8m|yt)(cid:9) as in Eq. (7). Tak-\n\n(yt+1|Ft) = fmt(yt+1|Ft) and f \u03b2rlm\n\n(8), if yt+1 is an outlier with low density under f D\nmt\n\nmt denotes the negative exponential of the score under divergence D.\n\n(yt+1|Ft) under a Student\u2019s t error model than under a normal error model, f KLD\n\nHere, f D\nIn particular,\nf KLD\nmt\n(yt+1|Ft), the\ning a closer look at Eq.\nodds will move in favor of a CP provided that the prior is suf\ufb01ciently uninformative to make\n(yt+1|Ft). In fact, even very small differences have a substantial impact on\nf D\nmt\nthe odds. This is why using the Student\u2019s t error for the BLR model with standard Bayes will not\nprovide robust run-length posteriors: While an outlying observation yt+1 will have greater density\n(yt+1|y0) (the\nf KLD\nmt\ndensity under the prior) will also be larger under the Student\u2019s t error model. As a result, changing\nthe tails of the model only has a very limited effect on the ratio in Eq. (8). In fact, the perhaps\nunintuitive consequence is that Student\u2019s t error models will yield CP inference that very closely\nresembles that of the corresponding normal model. A range of numerical examples in the Appendix\nillustrate this surprising fact. In contrast, CP inference robusti\ufb01ed via the \u03b2-D does not suffer from\nthis phenomenon. In fact, Theorem 1 provides very mild conditions for the \u03b2-D robusti\ufb01ed BLR\nmodel ensuring that the odds never favor a CP after any single outlying observation yt+1.\nTheorem 1. If mt in Eq. (8) is the Bayesian Linear Regression (BLR) model with \u00b5 \u2208 Rp and priors\na0, b0, \u00b50, \u03a30; and if the posterior predictive\u2019s variance determinant is larger than |V |min > 0, then\none can choose any (\u03b2rlm, H(rt, rt+1)) \u2208 S (p, \u03b2rlm, a0, b0, \u00b50, \u03a30,|V |min) to guarantee that\n\nmt\n\n(1 \u2212 H(rt+1, rt))f \u03b2rlm\n\n(yt+1|Ft)\n\nmt\n\n\u2265 1,\n\nH(rt+1, rt)f \u03b2rlm\n\nmt (yt+1|y0)\n\n(9)\nwhere the set S (p, \u03b2rlm, a0, b0, \u00b50, \u03a30,|V |min) is de\ufb01ned by an inequality given in the Appendix.\nThm. 1 says that one can bound the odds for a CP independently of yt+1. The requirement for a\nlower bound |V |min results from the integral term in Eq. (5), which dominates \u03b2-D-inference if\n|V | is extremely small. In practice, this is not restrictive: E.g. for p = 5, h(r) = 1\n\u03bb, a0 = 3, b0 =\n5, \u03a30 = diag(100, 5) used in Fig. 1 B, Thm. 1 holds for (\u03b2rlm, \u03bb) = (0.15, 100) used for inference if\n|V |min \u2265 8.12 \u00d7 10\u22126. Fig. 2 A plots the lower bound (see Appendix) as function of |V |min.\n\nFigure 3: Exemplary contour plots of bivariate marginals for the approximation(cid:98)\u03c0\u03b2p\n\n(dashed) and the target \u03c0\u03b2p\nMonte Carlo samples for the \u03b2-D posterior of BLR with d = 1, two regressors and \u03b2p = 0.25.\n\nm (\u03b8m) of Eq. (11)\nm (\u03b8m|y(t\u2212rt):t) (solid) estimated and smoothed from 95, 000 Hamiltonian\n\n3 On-line General Bayesian Inference (GBI)\n\n3.1 Structural Variational Approximations for Conjugate Exponential Families\n\nWhile there has been a recent surge in theoretical work on GBI [6, 15, 21, 14], applications have\nbeen sparse, in large part due to intractability. While sampling methods have been used successfully\nfor GBI [21, 15], it is not easy to scale these for the robust BOCPD setting. Thus, most work on\nBOCPD has focused on conjugate distributions [1, 43, 11] and approximations [44, 34]. We extend\nthe latter branch of research by deploying Structural Variational Inference (SVI). Unlike mean-\ufb01eld\napproximations, this preserves parameter dependence in the posterior, see Figure 3. While it is\nin principle possible to solve the inference task by sampling, this is computationally burdensome\nand makes the algorithm on-line in name only: Any sampling approach needs to (I) sample from\nm (\u03b8m|yt\u2212rt:t) in Eq. (3), (II) numerically integrate to obtain fm(yt|y1:(t\u22121), rt) and lastly (III)\n\u03c0\u03b2p\n\n5\n\nm0m1-0.10.00.10.20.3-0.9-0.8-0.7-0.6-0.5s2m00.70.80.91.01.11.21.3-0.2-0.10.00.10.20.30.4s2m10.70.80.91.01.11.21.3-1.2-1.0-0.8-0.6-0.4\fsample and numerically integrate the integral in Eq. (7) which no longer has a closed form. Moreover,\nthis has to be performed for each (rt, m) at times t = 1, 2, . . . . On top of this increased computational\ncost, it creates three sources of approximation error propagated forward through time via Eqs. (2a)\nm is available in closed form and as \u03b2-D \u2192 KLD as \u03b2 \u2192 0 [4], there is an\nand (2b). Since \u03c0KLD\nespecially compelling way of doing SVI for conjugate models using the \u03b2-D based on the fact that\n\n(10)\n\n(11)\n\nis exact as \u03b2 \u2192 0. Thus we approximate the \u03b2-D posterior for model m and run-length rt as\n\nm (\u03b8m|y(t\u2212rt):t) \u2248 \u03c0KLD\n\u03c0\u03b2p\n\nm (\u03b8m|y(t\u2212rt):t)\n\n(cid:110)\n\n(cid:16)\n\nKL\n\n\u03c0KLD\nm (\u03b8m)\n\n(cid:13)(cid:13)(cid:13)\u03c0\u03b2p\n\n(cid:17)(cid:111)\nm (\u03b8m|y(t\u2212rt):t)\n\n.\n\n(cid:98)\u03c0\u03b2p\n\nm (\u03b8m) = argmin\nm (\u03b8m)\n\nWhile this ensures that the densities(cid:98)\u03c0\u03b2p\n\n\u03c0KLD\n\nm and \u03c0KLD\n\nis analytically available iff the following three quantities have closed form:\n\nm belong to the same family, the variational parameters\ncan be very different from those implied by the KLD-posterior. This approximation mitigates multiple\nm (\u03b8m|y1:t) into the conjugate closed\nissues that would arise with sampling approaches: By forcing \u03c0\u03b2p\nform, steps (II) and (III) are solved analytically. Thus, inference is orders of magnitude faster, while\nthe resulting approximation error remains negligible (see Figs 2B, 3).\nMoreover, for many models, the Evidence Lower Bound (ELBO) associated with the optimization\nin Eq. (11) is available in closed form. As a result, off-the-shelf optimizers are suf\ufb01cient and no\nblack-box or sampling-based techniques are required to ef\ufb01ciently tackle the problem. Theorem 2\nprovides the conditions for a conjugate exponential family to admit such a closed form ELBO. The\nproof alongside the derivation of the ELBO for BLR can be found in the Appendix\nTheorem 2. The ELBO objective corresponding to the \u03b2-D posterior approximation in Eq. (11)\n\nof an exponential family likelihood model fm(y; \u03b8m) = exp(cid:0)\u03b7(\u03b8m)T T (y)(cid:1) g(\u03b7(\u03b8m))A(x) with\nconjugate prior \u03c00(\u03b8m|\u03bd0,X0) = g(\u03b7(\u03b8m))\u03bd0 exp(cid:0)\u03bd0\u03b7(\u03b8m)TX0\n(cid:1) h(X0, \u03bd0) and variational posterior\nm (\u03b8m|\u03bdm,Xm) = g(\u03b7(\u03b8m))\u03bdm exp(cid:0)\u03bdm\u03b7(\u03b8m)TXm\n(cid:1) h(Xm, \u03bdm) within the same conjugate family\n(cid:98)\u03c0\u03b2p\n(cid:18) (1 + \u03b2p)T (z) + \u03bdmXm\n(cid:20)\n(cid:19)(cid:21)\u22121\nE(cid:98)\u03c0\n[\u03b7(\u03b8m)] , E(cid:98)\u03c0\nThe conditions of Theorem 2 are met by many exponential models, e.g. the Normal-Inverse-Gamma,\nthe quality of(cid:98)\u03c0\u03b2p following Yao et al. [48], who estimate a difference \u02c6k between \u03c0\u03b2p\nthe Exponential-Gamma, and the Gamma-Gamma. For a simulated autoregressive BLR, we assess\nm relative\nm and drives the CP detection. Yao et al. [48] rate(cid:98)\u03c0\u03b2p\nto a posterior expectation. We use this on the posterior predictive, which is an expectation relative to\n\u03c0\u03b2p\nm if \u02c6k < 0.5. Figs 3 and 2 B\nshow that our approximation lies well below this threshold for choices of \u03b2p decreasing reasonably\nfast with the dimension. Note that these are exactly the values of \u03b2p one will want to select for\ninference: As d increases, the magnitude of fmt(yt|Ft\u22121) decreases rapidly. Hence, \u03b2p needs to\ndecrease as d increases to prevent the \u03b2-D inference from being dominated by the integral in Eq. (5)\nand disregarding yt [21]. This is also re\ufb02ected in our experiments in section 5, for which we initialize\n\u03b2p = 0.05 and \u03b2p = 0.005 for d = 1 and d = 29, respectively. However, as Figs. 3 and 2 B illustrate,\nthe approximation is still excellent for values of \u03b2p that are much larger than that.\n\nm and(cid:98)\u03c0\u03b2p\n\nm as close to \u03c0\u03b2p\n\n[log g(\u03b7(\u03b8m))] ,\n\n, 1 + \u03b2 + \u03bdm\n\nA(z)1+\u03b2p\n\nh\n\n1 + \u03b2p + \u03bdm\n\n(cid:90)\n\n\u03b2p\nm\n\n\u03b2p\nm\n\ndz.\n\n3.2 Stochastic Variance Reduced Gradient (SVRG) for BOCPD\n\nWhile highest predictive accuracy within BOCPD is achieved using full optimization of the variational\nparameters at each of T time periods, this has space and time complexity of O(T ) and O(T 2). In\ncomparison, Stochastic Gradient Descent (SGD) has space and time complexity of O(1) and O(T ),\nbut yields a loss in accuracy, substantially so for small run-lengths. In the BOCPD setting, there is\nan obvious trade-off between accuracy and scalability: Since the posterior predictive distributions\nfmt(yt|y1:(t\u22121), rt) for all run-lengths rt drive CP detection, SGD estimates are insuf\ufb01ciently accurate\nfor small run-lengths rt. On the other hand, once rt is suf\ufb01ciently large, the variational parameter\nestimates only need minor adjustments and computing an optimum is costly.\nRecently, a new generation of algorithms interpolating SGD and global optimization have addressed\nthis trade-off. They achieve substantially better convergence rates by anchoring the stochastic gradient\nto a point near an optimum [22, 9, 35, 18, 29]. We propose a memory-ef\ufb01cient two-stage variation of\n\n6\n\n\fStochastic Variance Reduced Gradient (SVRG) inference for BOCPD\n\nInput at time 0: Window & batch sizes W , B\u2217, b\u2217; frequency m, prior \u03b80, #steps K, step size \u03b7\nfor next observation yt at time t do\n\ns.t. W > B\u2217 > b\u2217; and \u223c denotes sampling without replacement\n\nfor retained run-lengths r \u2208 R(t) do\n\nif \u03c4r = 0 then\nif r < W then\n\u03b8r \u2190 \u03b8\u2217\nr \u2190 FullOpt (ELBO(yt\u2212r:t)); \u03c4r \u2190 m\nelse if r \u2265 W then\nr \u2190 \u03b8r; \u03c4r \u2190 Geom (B\u2217/(B\u2217 + b\u2217))\n(cid:80)\n\u03b8\u2217\nB \u2190 min(B\u2217, r)\nr \u2190 1\ni\u2208I \u2207ELBO(\u03b8\u2217\nganchor\nb \u2190 min(b\u2217, r) and(cid:101)I \u223c Unif{0, . . . , min(r, W )} and |(cid:101)I| = b\n(cid:80)\n\u03b8r \u2190 \u03b8r + \u03b7 \u00b7(cid:0)gnew\nr \u2190 1\ngold\n\n(cid:80)\n(cid:1); \u03c4r \u2190 \u03c4r \u2212 1\n\nfor j = 1, 2, . . . , K do\n\nr \u2190 1\n\nB\n\nr , yt\u2212i), gnew\nr + ganchor\nr \u2190 r + 1 for all r \u2208 R(t); R(t) \u2190 R(t) \u222a {0}\n\ni\u2208(cid:101)I \u2207ELBO(\u03b8\u2217\nr \u2212 gold\n\nr\n\nb\n\nr , yt\u2212i), where I \u223c Unif{0, . . . , min(r, W )}, |I| = B\n\ni\u2208(cid:101)I \u2207ELBO(\u03b8r, yt\u2212i)\n\nb\n\nthese methods tailored to BOCPD. First, the variational parameters are moved close to their global\noptimum using a variant of [22, 35]. Unlike standard versions, we anchor the gradient estimates to\na (local) optimum by calling a convex optimizer FullOpt every m steps for the \ufb01rst W iterations.\nWhile our implementation uses Python scipy\u2019s L-BFSG-B optimization routine, any convex optimizer\ncould be used for this step. Compared to standard SGD or SVRG, full optimization substantially\ndecreases variance and increases accuracy for small rt. Second, once rt > W we do not perform\nfull optimization anymore. Instead, we anchor optimization to the current value as in standard SVRG,\nby updating the anchor at stochastic time intervals determined by a geometric random variable with\nsuccess probability B\u2217/(B\u2217 + b\u2217). Whether the anchor is based on global optimization or not, the\nnext step consists in sampling B = min(rt, B\u2217) observations without replacement from a window\nwith the min(rt, W ) most recent observations to initiate the SVRG procedure. Following this, for the\nnext K observations, we incrementally re\ufb01ne the estimates while keeping their variance low using a\nstochastic-batch variant of [29, 30] by sampling a batch of size b = min(rt, b\u2217) without replacement\nfrom the min(rt, W ) most recent observations. The resulting on-line inference has constant space\nand linear time complexity like SGD, but produces good estimates for small rt and converges faster\n[22, 29, 30]. We provide a detailed complexity analysis of the procedure in the Appendix, where we\nalso demonstrate numerically that it is orders of magnitude faster than MCMC-based inference.\n\n4 Choice of \u03b2\n\nInitializing \u03b2p: The \u03b2-D has been used in a variety of settings [15, 4, 14, 49], but there is no\nprincipled framework for selecting \u03b2. We remedy this by minimizing the expected predictive loss\nwith respect to \u03b2 on-line. As the losses need not be convex in \u03b2p, initial values can matter for\nthe optimization. A priori, we pick \u03b2p maximizing the \u03b2-D in\ufb02uence for a given Mahalanobis\nDistance (MD) x\u2217 under \u03c0(\u03b8m). As Figure 1 A shows, \u03b2p > 0 induces a point of maximum in\ufb02uence\nMD(\u03b2p, \u03c0m(\u03b8m)): Points further in the tails are treated as outliers, while points closer to the mode\n(cid:99)MD(\u03b2p, \u03c0m(\u03b8m)) = argmaxx\u2208R+\nreceive similar in\ufb02uence as under the KLD. A Monte Carlo estimate of MD(\u03b2p, \u03c0m(\u03b8m)) is found via\nproblem: For x\u2217, we seek \u03b2p such that (cid:99)MD(\u03b2p, \u03c0m(\u03b8m)) = x\u2217. (The Appendix contains a pictorial\n\u02c6I(\u03b2p, \u03c0m(\u03b8m))(x) [28]. We initialize \u03b2p by solving the inverse\nillustration of this procedure.) The k-th standard deviation under the prior is a good choice of x\u2217\nfor low dimensions [see also 12], but not appropriate as delimiter for high density regions even in\nmoderate dimensions d. Thus, we propose x\u2217 =\nunder normality, MD \u2192 \u221a\nd for larger values of d, inspired by the fact that\n(cid:99)MD(\u03b2p, \u03c0m(\u03b8m)) with respect to \u03b2p. As \u03b2rlm does not affect \u03c0\u03b2p\nd as d \u2192 \u221e [17]. One then \ufb01nds \u03b2p by approximating the gradient of\nm , its initialization matters less and\ngenerally, initializing \u03b2rlm \u2208 [0, 1] produces reasonable results.\n\n\u221a\n\n7\n\n\fOptimizing \u03b2 on-line: For \u03b2 = (\u03b2rlm, \u03b2p) and prediction (cid:98)yt(\u03b2) of yt obtained as posterior ex-\npectation via Eq. (6), de\ufb01ne \u03b5t(\u03b2) = yt \u2212 (cid:98)yt(\u03b2). For predictive loss L : R \u2192 R+, we target\nto \ufb01nd the partial derivatives of \u2207\u03b2L (\u03b5t(\u03b2)). Noting that \u2207\u03b2L (\u03b5t(\u03b2))) = L(cid:48) (\u03b5t(\u03b2))) \u00b7 \u2207\u03b2 (cid:98)yt(\u03b2),\n\u03b2\u2217 = argmin\u03b2 {E (L(\u03b5t(\u03b2)))}. Replacing expected by empirical loss and deploying SGD, we seek\nthe issue reduces to \ufb01nding the partial derivatives \u2207\u03b2rlm(cid:98)yt(\u03b2) and \u2207\u03b2p(cid:98)yt(\u03b2). Remarkably, \u2207\u03b2rlm(cid:98)yt(\u03b2)\nis provided in the Appendix. The gradient \u2207\u03b2p(cid:98)yt(\u03b2) on the other hand is not available analytically\n\ncan be updated sequentially and ef\ufb01ciently by differentiating the recursion in Eq. (2b). The derivation\n\nand thus is approximated numerically. Now, \u03b2 can be updated on-line via\n\n(12)\n\n(cid:20)\u2207\u03b2rlm,tL(cid:0)\u03b5t(\u03b21:(t\u22121))(cid:1)\n\u2207\u03b2p,tL(cid:0)\u03b5t(\u03b21:(t\u22121))(cid:1))\n\n(cid:21)\n\n\u03b2t = \u03b2t\u22121 \u2212 \u03b7 \u00b7\n\nIn spirit, this procedure resembles existing approaches for model hyperparameter optimization [8].\nFor robustness, L should be chosen appropriately. In our experiments L is a bounded absolute loss.\n\n5 Results\n\nNext, we illustrate the most important improvements this paper makes to BOCPD. First, we show\nhow robust BOCPD deals with outliers on the well-log data set. Further, we show that standard\nBOCPD breaks down in the M-open world whilst \u03b2-D yields useful inference by analyzing noisy\nmeasurements of Nitrogen Oxide (NOX) levels in London. In both experiments, we use the methods\nin section 4, on-line hyperparameter optimization [8] and pruning for p(rt, mt|y1:t) [1]. Detailed\ninformation is provided in the Appendix. Software and simulation code is available as part of a\nreproducibility award at https://github.com/alan-turing-institute/rbocpdms/.\n\n5.1 Well-log\n\nThe well-log data set was \ufb01rst studied in Ruanaidh et al. [39] and has become a benchmark data\nset for univariate CP detection. However, except in Fearnhead and Rigaill [12] its outliers have\nbeen removed before CP detection algorithms are run [e.g. 1, 31, 40]. With M containing one BLR\nmodel of form yt = \u00b5 + \u03b5t, Figure 4 shows that robust BOCPD deals with outliers on-line. The\nmaximum of the run-length distribution for standard BOCPD is zero 145 times, so declaring CPS\nbased on the run-length distribution\u2019s maximum [see e.g. 41] yields a False Discovery Rate (FDR)\n> 90%. This problem persists even with non-parametric, Gaussian Process, models [p. 186, 45].\nEven using Maximum A Posteriori (MAP) segmentation [11], standard BOCPD mislabels 8 outliers\nas CPS, making for a FDR > 40%. In contrast, the segmentation of the \u03b2-D version does not mislabel\nany outliers. Morevoer and in accordance with Thm. 1, its run-length distribution\u2019s maximum never\ndrops to zero in response to outliers. Further, a natural byproduct of the robust segmentation is a\nreduction in squared (absolute) prediction error by 10% (6%) compared to the standard version. The\n\nFigure 4: Maximum A Posteriori (MAP) segmentation and run-length distributions of the well-log\ndata. Robust segmentation depicted using solid lines, CPS additionally declared under standard\nBOCPD with dashed lines. The corresponding run-length distributions for robust (middle) and\nstandard (bottom) BOCPD are shown in grayscale. The most likely run-lengths are dashed.\n\n8\n\n75000100000125000Response0100005001000150020002500300035004000Time01000run length1011710102108710721057104210271012\frobust version has more computational overhead than standard BOCPD, but still needs less than 0.5\nseconds per observation using a 3.1 GHZ Intel i7 and 16GB RAM.\nNot only does robust BOCPD\u2019s segmentation in Figure 4 match that in Fearnhead and Rigaill [12],\nbut it also offers three additional on-line outputs: Firstly, it produces probabilistic (rather than point)\nforecasts and parameter inference. Secondly, it self-regulates its robustness via \u03b2. Thirdly, it can\ncompare multiple models and produce model posteriors (see section 5.2). Further, unlike Fearnhead\nand Rigaill [12], it is not restricted to \ufb01tting univariate data with piecewise constant functions.\n\n5.2 Air Pollution\n\nThe example in Fig. 1 B gives an illustration of the importance of robustness in medium-dimensional\n(BOCPD) problems: It suf\ufb01ces for a single dimension of the problem to be misspeci\ufb01ed or outlier-\nprone for inference to fail. Moreover, the presence of misspeci\ufb01cation or outliers in this plot can\nhardly be spotted \u2013 and this effect will worsen with increasing dimensionality. To illustrate this point\non a multivariate real world data set, we also analyze Nitrogen Oxide (NOX) levels across 29 stations\nin London using spatially structured Bayesian Vector Autoregressions [see 25]. Previous robust\non-line methods [e.g. 36, 7, 12] cannot be applied to this problem because they assume univariate\ndata or do not allow for dependent observations. As Figure 5 shows, robust BOCPD \ufb01nds one CP\ncorresponding to the introduction of the congestion charge, while standard BOCPD produces an FDR\n>90%. Both methods \ufb01nd a change in dynamics (i.e. models) after the congestion charge introduction,\nbut variance in the model posterior is substantially lower for the robust algorithm. Further, it increases\nthe average one-step-ahead predictive likelihood by 10% compared to standard BOCPD.\n\n6 Conclusion\n\nThis paper has presented the very \ufb01rst robust Bayesian on-line changepoint (CP) detection algorithm\nand the \ufb01rst ever scalable General Bayesian Inference (GBI) method. While CP detection is a\nparticularly salient example of unaddressed heterogeneity and outliers leading to poor inference, the\ncapabilities of GBI and the Structural Variational approximations presented extend far beyond this\nsetting. With an ever increasing interest in the \ufb01eld of machine learning to ef\ufb01ciently and reliably\nquantify uncertainty, robust probabilistic inference will only become more relevant. In this paper,\nwe give a particularly striking demonstration of the inferential power that can be unlocked through\ndivergence-based General Bayesian inference.\n\nFigure 5: On-line model posteriors for three different VAR models (solid, dashed, dotted) and run-\nlength distributions in grayscale with most likely run-lengths dashed for standard (top two panels) and\nrobust (bottom two panels) BOCPD. Also marked are the congestion charge introduction, 17/02/2003\n(solid vertical line) and the MAP segmentations (crosses)\n\n9\n\n0.00.51.0P(m|y)050run length0.00.51.0P(m|y)2002-092002-102002-112002-122003-012003-022003-032003-042003-052003-062003-072003-08Time0100200run length103121030910306\fAcknowledgements\n\nWe would like to cordially thank both Jim Smith and Chris Holmes for fruitful discussions and help\nwith some of the theoretical results. JK and JJ are funded by EPSRC grant EP/L016710/1 as part of the\nOxford-Warwick Statistics Programme (OXWASP). TD is funded by the Lloyds Register Foundation\nprogramme on Data Centric Engineering through the London Air Quality project. This work was\nsupported by The Alan Turing Institute for Data Science and AI under EPSRC grant EP/N510129/1.\nIn collaboration with the Greater London Authority.\n\nReferences\n[1] Ryan Prescott Adams and David JC MacKay. Bayesian online changepoint detection. arXiv\n\npreprint arXiv:0710.3742, 2007.\n\n[2] Mauricio Alvarez, Jan R Peters, Neil D Lawrence, and Bernhard Sch\u00f6lkopf. Switched latent\nIn Advances in neural information processing\n\nforce models for movement segmentation.\nsystems, pages 55\u201363, 2010.\n\n[3] Daniel Barry and John A Hartigan. A Bayesian analysis for change point problems. Journal of\n\nthe American Statistical Association, 88(421):309\u2013319, 1993.\n\n[4] Ayanendranath Basu, Ian R Harris, Nils L Hjort, and MC Jones. Robust and ef\ufb01cient estimation\n\nby minimising a density power divergence. Biometrika, 85(3):549\u2013559, 1998.\n\n[5] Jos\u00e9 M Bernardo and Adrian FM Smith. Bayesian theory, 2001.\n\n[6] Pier Giovanni Bissiri, Chris C Holmes, and Stephen G Walker. A general framework for\nupdating belief distributions. Journal of the Royal Statistical Society: Series B (Statistical\nMethodology), 78(5):1103\u20131130, 2016.\n\n[7] Yang Cao and Yao Xie. Robust sequential change-point detection by convex optimization. In\nInformation Theory (ISIT), 2017 IEEE International Symposium on, pages 1287\u20131291. IEEE,\n2017.\n\n[8] Fran\u00e7ois Caron, Arnaud Doucet, and Raphael Gottardo. On-line changepoint detection and\nparameter estimation with application to genomic data. Statistics and Computing, 22(2):\n579\u2013595, 2012.\n\n[9] Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. Saga: A fast incremental gradient\nmethod with support for non-strongly convex composite objectives. In Advances in neural\ninformation processing systems, pages 1646\u20131654, 2014.\n\n[10] Adji Bousso Dieng, Dustin Tran, Rajesh Ranganath, John Paisley, and David Blei. Variational\ninference via \u03c7 upper bound minimization. In Advances in Neural Information Processing\nSystems, pages 2729\u20132738, 2017.\n\n[11] Paul Fearnhead and Zhen Liu. On-line inference for multiple changepoint problems. Journal of\n\nthe Royal Statistical Society: Series B (Statistical Methodology), 69(4):589\u2013605, 2007.\n\n[12] Paul Fearnhead and Guillem Rigaill. Changepoint detection in the presence of outliers. Journal\n\nof the American Statistical Association, (just-accepted), 2017.\n\n[13] Emily Fox and David B Dunson. Multiresolution Gaussian processes. In Advances in Neural\n\nInformation Processing Systems, pages 737\u2013745, 2012.\n\n[14] Futoshi Futami, Issei Sato, and Masashi Sugiyama. Variational inference based on robust\n\ndivergences. In Arti\ufb01cial Intelligence and Statistics, 2018.\n\n[15] Abhik Ghosh and Ayanendranath Basu. Robust Bayes estimation using the density power\n\ndivergence. Annals of the Institute of Statistical Mathematics, 68(2):413\u2013437, 2016.\n\n[16] Marco Grzegorczyk and Dirk Husmeier. Non-stationary continuous dynamic Bayesian networks.\n\nIn Advances in Neural Information Processing Systems, pages 682\u2013690, 2009.\n\n10\n\n\f[17] Peter Hall, JS Marron, and Amnon Neeman. Geometric representation of high dimension, low\nsample size data. Journal of the Royal Statistical Society: Series B (Statistical Methodology),\n67(3):427\u2013444, 2005.\n\n[18] Reza Harikandeh, Mohamed Osama Ahmed, Alim Virani, Mark Schmidt, Jakub Kone\u02c7cn`y, and\nScott Sallinen. Stopwasting my gradients: Practical svrg. In Advances in Neural Information\nProcessing Systems, pages 2251\u20132259, 2015.\n\n[19] Jos\u00e9 Miguel Hern\u00e1ndez-Lobato, Yingzhen Li, Mark Rowland, Daniel Hern\u00e1ndez-Lobato,\nThang D Bui, and Richard E Turner. Black-box \u03b1-divergence minimization. In Proceedings of\nthe 33rd International Conference on International Conference on Machine Learning-Volume\n48, pages 1511\u20131520. JMLR. org, 2016.\n\n[20] He Huang and Martin Paulus. Learning under uncertainty: a comparison between rw and\nBayesian approach. In Advances in Neural Information Processing Systems, pages 2730\u20132738,\n2016.\n\n[21] Jack Jewson, Jim Smith, and Chris Holmes. Principles of bayesian inference using general\n\ndivergence criteria. Entropy, 20(6):442, 2018.\n\n[22] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance\n\nreduction. In Advances in neural information processing systems, pages 315\u2013323, 2013.\n\n[23] Azadeh Khaleghi and Daniil Ryabko. Locating changes in highly dependent data with unknown\nIn Advances in Neural Information Processing Systems, pages\n\nnumber of change points.\n3086\u20133094, 2012.\n\n[24] Rebecca Killick, Idris A Eckley, Kevin Ewans, and Philip Jonathan. Detection of changes in\nvariance of oceanographic time-series using changepoint analysis. Ocean Engineering, 37(13):\n1120\u20131126, 2010.\n\n[25] Jeremias Knoblauch and Theodoros Damoulas. Spatio-temporal Bayesian on-line changepoint\ndetection with model selection. In Proceedings of the 27th International Conference on Machine\nLearning (ICML), 2018.\n\n[26] George Konidaris, Scott Kuindersma, Roderic Grupen, and Andrew G Barto. Constructing skill\ntrees for reinforcement learning agents from demonstration trajectories. In Advances in neural\ninformation processing systems, pages 1162\u20131170, 2010.\n\n[27] Erich Kummerfeld and David Danks. Tracking time-varying graphical structure. In Advances\n\nin neural information processing systems, pages 1205\u20131213, 2013.\n\n[28] Sebastian Kurtek and Karthik Bharath. Bayesian sensitivity analysis with the Fisher\u2013Rao metric.\n\nBiometrika, 102(3):601\u2013616, 2015.\n\n[29] Lihua Lei and Michael Jordan. Less than a single pass: Stochastically controlled stochastic\n\ngradient. In Arti\ufb01cial Intelligence and Statistics, pages 148\u2013156, 2017.\n\n[30] Lihua Lei, Cheng Ju, Jianbo Chen, and Michael I Jordan. Non-convex \ufb01nite-sum optimization\nvia scsg methods. In Advances in Neural Information Processing Systems, pages 2345\u20132355,\n2017.\n\n[31] C\u00e9line Levy-leduc and Za\u00efd Harchaoui. Catching change-points with lasso. In Advances in\n\nNeural Information Processing Systems, pages 617\u2013624, 2008.\n\n[32] Yingzhen Li and Richard E Turner. R\u00e9nyi divergence variational inference. In Advances in\n\nNeural Information Processing Systems, pages 1073\u20131081, 2016.\n\n[33] Kevin Lin, James L Sharpnack, Alessandro Rinaldo, and Ryan J Tibshirani. A sharp error\nIn\n\nanalysis for the fused Lasso, with application to approximate changepoint screening.\nAdvances in Neural Information Processing Systems, pages 6887\u20136896, 2017.\n\n[34] Scott Niekum, Sarah Osentoski, Christopher G Atkeson, and Andrew G Barto. CHAMP:\nChangepoint detection using approximate model parameters. Technical report, (No. CMU-RI-\nTR-14-10) Carnegie-Mellon University Pittsburgh PA Robotics Institute, 2014.\n\n11\n\n\f[35] Atsushi Nitanda. Stochastic proximal gradient descent with acceleration techniques. In Advances\n\nin Neural Information Processing Systems, pages 1574\u20131582, 2014.\n\n[36] Moshe Pollak. A robust changepoint detection method. Sequential Analysis, 29(2):146\u2013161,\n\n2010.\n\n[37] Aleksey S Polunchenko, Alexander G Tartakovsky, and Nitis Mukhopadhyay. Nearly optimal\nchange-point detection with an application to cybersecurity. Sequential Analysis, 31(3):409\u2013435,\n2012.\n\n[38] Rajesh Ranganath, Dustin Tran, Jaan Altosaar, and David Blei. Operator variational inference.\n\nIn Advances in Neural Information Processing Systems, pages 496\u2013504, 2016.\n\n[39] \u00d3 Ruanaidh, JK Joseph, and William J Fitzgerald. Numerical Bayesian methods applied to\n\nsignal processing. 1996.\n\n[40] Eric Ruggieri and Marcus Antonellis. An exact approach to Bayesian sequential change point\n\ndetection. Computational Statistics & Data Analysis, 97:71\u201386, 2016.\n\n[41] Yunus Saat\u00e7i, Ryan D Turner, and Carl E Rasmussen. Gaussian process change point models.\nIn Proceedings of the 27th International Conference on Machine Learning (ICML-10), pages\n927\u2013934, 2010.\n\n[42] Florian Stimberg, Manfred Opper, Guido Sanguinetti, and Andreas Ruttor.\n\nInference in\ncontinuous-time change-point models. In Advances in Neural Information Processing Systems,\npages 2717\u20132725, 2011.\n\n[43] Ryan Turner, Yunus Saatci, and Carl Edward Rasmussen. Adaptive sequential Bayesian change\n\npoint detection. In Temporal Segmentation Workshop at NIPS, 2009.\n\n[44] Ryan D Turner, Steven Bottone, and Clay J Stanek. Online variational approximations to\nnon-exponential family change point models: with application to radar tracking. In Advances in\nNeural Information Processing Systems, pages 306\u2013314, 2013.\n\n[45] Ryan Darby Turner. Gaussian processes for state space models and change point detection.\n\nPhD thesis, University of Cambridge, 2012.\n\n[46] Robert C Wilson, Matthew R Nassar, and Joshua I Gold. Bayesian online learning of the hazard\n\nrate in change-point problems. Neural computation, 22(9):2452\u20132476, 2010.\n\n[47] Xiang Xuan and Kevin Murphy. Modeling changing dependency structure in multivariate\ntime series. In Proceedings of the 24th international conference on Machine learning, pages\n1055\u20131062. ACM, 2007.\n\n[48] Yuling Yao, Aki Vehtari, Daniel Simpson, and Andrew Gelman. Yes, but did it work?: Evaluat-\n\ning variational inference. arXiv preprint arXiv:1802.02538, 2018.\n\n[49] Kenan Y Y\u0131lmaz, Ali T Cemgil, and Umut Simsekli. Generalised coupled tensor factorisation.\n\nIn Advances in neural information processing systems, pages 2151\u20132159, 2011.\n\n[50] XianXing Zhang, Lawrence Carin, and David B Dunson. Hierarchical topic modeling for\nanalysis of time-evolving personal choices. In Advances in Neural Information Processing\nSystems, pages 1395\u20131403, 2011.\n\n12\n\n\f", "award": [], "sourceid": 68, "authors": [{"given_name": "Jeremias", "family_name": "Knoblauch", "institution": "Warwick University"}, {"given_name": "Jack", "family_name": "Jewson", "institution": "University of Warwick"}, {"given_name": "Theodoros", "family_name": "Damoulas", "institution": "University of Warwick        The Alan Turing Institute"}]}