{"title": "The Forget-me-not Process", "book": "Advances in Neural Information Processing Systems", "page_first": 3702, "page_last": 3710, "abstract": "We introduce the Forget-me-not Process, an efficient, non-parametric meta-algorithm for online probabilistic sequence prediction for piecewise stationary, repeating sources. Our method works by taking a Bayesian approach to partition a stream of data into postulated task-specific segments, while simultaneously building a model for each task. We provide regret guarantees with respect to piecewise stationary data sources under the logarithmic loss, and validate the method empirically across a range of sequence prediction and task identification problems.", "full_text": "The Forget-me-not Process\n\nKieran Milan\u2020, Joel Veness\u2020, James Kirkpatrick, Demis Hassabis\n\nGoogle DeepMind\n\n{kmilan,aixi,kirkpatrick,demishassabis}@google.com\n\nAnna Koop, Michael Bowling\n\nUniversity of Alberta\n\n{anna,bowling}@cs.ualberta.ca\n\nAbstract\n\nWe introduce the Forget-me-not Process, an ef\ufb01cient, non-parametric meta-\nalgorithm for online probabilistic sequence prediction for piecewise stationary,\nrepeating sources. Our method works by taking a Bayesian approach to partition-\ning a stream of data into postulated task-speci\ufb01c segments, while simultaneously\nbuilding a model for each task. We provide regret guarantees with respect to piece-\nwise stationary data sources under the logarithmic loss, and validate the method\nempirically across a range of sequence prediction and task identi\ufb01cation problems.\n\n1\n\nIntroduction\n\nModeling non-stationary temporal data sources is a fundamental problem in signal processing,\nstatistical data compression, quantitative \ufb01nance and model-based reinforcement learning. One\nwidely-adopted and successful approach has been to design meta-algorithms that automatically\ngeneralize existing stationary learning algorithms to various non-stationary settings. In this paper\nwe introduce the Forget-me-not Process, a probabilistic meta-algorithm that provides the ability to\nmodel the class of memory bounded, piecewise-repeating sources given an arbitrary, probabilistic\nmemory bounded stationary model.\nThe most well studied class of probabilistic meta-algorithms are those for piecewise stationary\nsources, which model data sequences with abruptly changing statistics. Almost all meta-algorithms for\nabruptly changing sources work by performing Bayesian model averaging over a class of hypothesized\ntemporal partitions. To the best of our knowledge, the earliest demonstration of this fundamental\ntechnique was [21], for the purpose of data compression; closely related techniques have gained\npopularity within the machine learning community for change point detection [1] and have been\nproposed by neuroscientists as a mechanism by which humans deal with open-ended environments\ncomposed of multiple distinct tasks [4\u20136]. One of the reasons for the popularity of this approach is\nthat the temporal structure can be exploited to make exact Bayesian inference tractable via dynamic\nprogramming; in particular inference over all possible temporal partitions of n data points results in\nan algorithm of O(n2) time complexity and O(n) space complexity [21, 1]. Many variants have been\nproposed in the literature [20, 11, 10, 17], which trade off predictive accuracy for improved time and\nspace complexity; in particular the Partition Tree Weighting meta-algorithm [17] has O(n log n) time\nand O(log n) space complexity, and has been shown empirically to exhibit superior performance\nversus other low-complexity alternatives on piecewise stationary sources.\nA key limitation of these aforementioned techniques is that they can perform poorly when there\nexist multiple segments of data that are similarly distributed. For example, consider data generated\naccording to the schedule depicted in Figure 1. For all these methods, once a change-point occurs, the\nbase (stationary) model is invoked from scratch, even if the task repeats, which is clearly undesirable\n\n\u2020 indicates joint \ufb01rst authorship.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fk\ns\na\nT\n\n3\n2\n1\n\n1\n\n20\n\n40\n\n60\n\n80\nTime\n\n100\n\n120\n\n140\n\n160\n\nFigure 1: An example task segmentation.\n\nin many situations of interest. Our main contribution in this paper is to introduce the Forget-me-not\nProcess, which has the ability to avoid having to relearn repeated tasks, while still maintaining\nessentially the same theoretical performance guarantees as Partition Tree Weighting on piecewise\nstationary sources.\n\n2 Preliminaries\n\nWe now introduce some notation and necessary background material.\n\nX \u2217 = {\u0001} \u222a(cid:83)\u221e\ntions \u03c1n : X n \u2192 [0, 1], for all n \u2208 N, satisfying the constraint that \u03c1n(x1:n) =(cid:80)\nprovided \u03c1(x 0, with the familiar chain rule \u03c1(xi:j | x* n, we de\ufb01ne x1:m := x1:n and xm:n := \u0001. The concatenation of two strings\ns, r \u2208 X \u2217 is denoted by sr. Unless otherwise speci\ufb01ed, base 2 is assumed for all logarithms.\nA sequential probabilistic data generating source \u03c1 is de\ufb01ned by a sequence of probability mass func-\ny\u2208X \u03c1n+1(x1:ny)\nfor all x1:n \u2208 X n, with base case \u03c10(\u0001) = 1. From here onwards, whenever the meaning is clear\nfrom the argument to \u03c1, the subscripts on \u03c1 will be dropped. Under this de\ufb01nition, the conditional\nprobability of a symbol xn given previous data x 0 for each \u03c1 \u2208 M such that(cid:80)\nis de\ufb01ned in terms of its marginal by \u03be(x1:n) :=(cid:80)\n\nBayesian Sequence Prediction. A fundamental technique for constructing algorithms that work\nwell under the logarithmic loss is Bayesian model averaging. We now provide a short overview\nsuf\ufb01cient for the purposes of this paper; for more detail, we recommend the work of [12] and [14].\nGiven a non-empty discrete set of probabilistic data generating sources M := {\u03c11, \u03c12, . . .} and a\nprior weight w\u03c1\n0 = 1, the Bayesian mixture predictor\n0 \u03c1(x1:n). The predictive probability is\nthus given by the ratio of the marginals \u03be(xn | x 1.\n\n(6)\n\n\u03c1(cid:48)\u2208Mt\\{\u03c1}\n\n(cid:88)\n(cid:12)(cid:12)(cid:12)(cid:12) \u03c1\u2217 = argmax\n2\u2212\u0393d(P) (cid:89)\n\n\u03c1\u2208Ma\n\n(a,b)\u2208Pn\n\n(cid:88)\n\nP\u2208Cd\n\n(5)\n\n(7)\n\nFinally, substituting Equation 5 in for the base model of PTW yields our Forget Me Not process\n\nFMNd(x1:n) :=\n\n\u03bda(xa:b).\n\nAlgorithm. Algorithm 1 describes how to compute the marginal probability FMNd(x1:n). The rj\nvariables store the segment start times for the unclosed segments at depth j; the bj variables implement\na dynamic programming caching mechanism to speed up the PTW computation as explained in Section\n3.3 of [17]; the wj variables hold intermediate results needed to apply Lemma 1. The Most Signi\ufb01cant\nChanged Bit routine MSCBd(t), invoked at line 4, is used to determine the range of segments ending\nat the current time t, and is de\ufb01ned for t > 1 as the number of bits to the left of the most signi\ufb01cant\nlocation at which the d-bit binary representations of t\u2212 1 and t\u2212 2 differ, with MSCBd(1) := 0 for all\nd \u2208 N. For example, in Figure 3, at t = 5, before processing x5, we need to deal with the segments\n\n5\n\n\fAlgorithm 1 FORGET-ME-NOT - FMNd(x1:n)\nRequire: A depth parameter d \u2208 N, and a base probabilistic model \u03c1\nRequire: A data sequence x1:n \u2208 X n satisfying n \u2264 2d\n1: bj \u2190 1, wj \u2190 1, rj \u2190 1, for 0 \u2264 j \u2264 d\n2: M \u2190 {\u03c1}\n3: for t = 1 to n do\n4:\n5:\n\ni \u2190 MSCBd(t)\nbi \u2190 wi+1\nfor j = i + 1 to d do\n\nM \u2190 UPDATEMODELPOOL(\u03bdrj , xrj :t\u22121)\nwj \u2190 1, bj \u2190 1, rj \u2190 t\n\nend for\nwd \u2190 \u03bdrd (xrd:t)\nfor i = d \u2212 1 to 0 do\n\nwi \u2190 1\n\n2 \u03bdri(xri:t) + 1\n\n2 wi+1bi\n\n6:\n7:\n8:\n9:\n\n10:\n11:\n12:\n13:\n14: end for\n15: return w0\n\nend for\n\n(1, 4), (3, 4), (4, 4) \ufb01nishing. The method UPDATEMODELPOOL applies Equation 6 to remember\nthe best performing model in the mixture \u03bdrj on the completed segment (rj, t \u2212 1). Lines 11 to 13\ninvoke Lemma 1 from bottom-up, to compute the desired marginal probability FMNd(x1:n) = w0.\n(Space and Time Overhead) Under the assumption that each base model conditional probability can\nbe obtained in O(1) time, the time complexity to process a sequence of length n is O(nk log n),\nwhere k is an upper bound on |M|. The n log n factor is due to the number of iterations in the inner\nloops on Lines 6 to 9 and Lines 11 to 13 being upper bounded by d + 1. The k factor is due to the\ncost of maintaining the vt terms for the segments which have not yet closed. An upper bound on k\ncan be obtained from inspection of Figure 3, where if we set n = 2d, we have that the number of\ni=0 2i = 2d+1 \u2212 1 = 2n + 1 = O(n); thus the time complexity is\n\ncompleted segments is given by(cid:80)d\n\nO(n2 log n). The space overhead is O(k log n), due to the O(log n) instances of Equation 5.\n(Complexity Reducing Operations) For many applications of interest, a running time of O(n2 log n)\nis unacceptable. A workaround is to \ufb01x k in advance and use a model replacement strategy that\nenforces |M| \u2264 k via a modi\ufb01ed UPDATEMODELPOOL routine; this reduces the time complexity to\nO(nk log n). We found the following heuristic scheme to be effective in practice: when a segment\n(a, b) closes, the best performing model \u03c1\u2217 \u2208 Ma for this segment is identi\ufb01ed. Now, 1) letting y\u03c1\u2217\ndenote a uniform sub-sample of the data used to train \u03c1\u2217, if log \u03c1\u2217[xa:b](y\u03c1\u2217) \u2212 log \u03c1\u2217(y\u03c1\u2217) > \u03b1\nthen replace \u03c1\u2217 with \u03c1\u2217[xa:b] in M; else 2) if a uniform Bayes mixture \u03be over M assigns suf\ufb01ciently\nhigher probability to a uniform sub-sample s of xa:b than \u03c1\u2217 does, that is log \u03be(s) \u2212 log \u03c1\u2217(s) > \u03b2,\nthen leave M unchanged; else 3) add \u03c1\u2217[xa:b] to M; if |M| > k, remove the oldest model in M.\nThis requires choosing hyperparameters \u03b1, \u03b2 \u2208 R and appropriate constant sub-sample sizes. Step\n1 avoids adding multiple models for the same task; Step 2 avoids adding a redundant model to the\nmodel pool. Note that the per model and per segment sub-samples can be ef\ufb01ciently maintained\nonline using reservoir sampling [19]. As a further complexity reducing operation, one can skip calls\nto UPDATEMODELPOOL unless (b \u2212 a + 1) \u2265 2c for some c < d.\n(Strongly Online Prediction) A strongly online FMN process, where one does not need to \ufb01x a d in\ni=1 FMN(cid:100)log i(cid:101)(xi | x*