{"title": "The Forget-me-not Process", "book": "Advances in Neural Information Processing Systems", "page_first": 3702, "page_last": 3710, "abstract": "We introduce the Forget-me-not Process, an efficient, non-parametric meta-algorithm for online probabilistic sequence prediction for piecewise stationary, repeating sources. Our method works by taking a Bayesian approach to partition a stream of data into postulated task-specific segments, while simultaneously building a model for each task. We provide regret guarantees with respect to piecewise stationary data sources under the logarithmic loss, and validate the method empirically across a range of sequence prediction and task identification problems.", "full_text": "The Forget-me-not Process\n\nKieran Milan\u2020, Joel Veness\u2020, James Kirkpatrick, Demis Hassabis\n\nGoogle DeepMind\n\n{kmilan,aixi,kirkpatrick,demishassabis}@google.com\n\nAnna Koop, Michael Bowling\n\nUniversity of Alberta\n\n{anna,bowling}@cs.ualberta.ca\n\nAbstract\n\nWe introduce the Forget-me-not Process, an ef\ufb01cient, non-parametric meta-\nalgorithm for online probabilistic sequence prediction for piecewise stationary,\nrepeating sources. Our method works by taking a Bayesian approach to partition-\ning a stream of data into postulated task-speci\ufb01c segments, while simultaneously\nbuilding a model for each task. We provide regret guarantees with respect to piece-\nwise stationary data sources under the logarithmic loss, and validate the method\nempirically across a range of sequence prediction and task identi\ufb01cation problems.\n\n1\n\nIntroduction\n\nModeling non-stationary temporal data sources is a fundamental problem in signal processing,\nstatistical data compression, quantitative \ufb01nance and model-based reinforcement learning. One\nwidely-adopted and successful approach has been to design meta-algorithms that automatically\ngeneralize existing stationary learning algorithms to various non-stationary settings. In this paper\nwe introduce the Forget-me-not Process, a probabilistic meta-algorithm that provides the ability to\nmodel the class of memory bounded, piecewise-repeating sources given an arbitrary, probabilistic\nmemory bounded stationary model.\nThe most well studied class of probabilistic meta-algorithms are those for piecewise stationary\nsources, which model data sequences with abruptly changing statistics. Almost all meta-algorithms for\nabruptly changing sources work by performing Bayesian model averaging over a class of hypothesized\ntemporal partitions. To the best of our knowledge, the earliest demonstration of this fundamental\ntechnique was [21], for the purpose of data compression; closely related techniques have gained\npopularity within the machine learning community for change point detection [1] and have been\nproposed by neuroscientists as a mechanism by which humans deal with open-ended environments\ncomposed of multiple distinct tasks [4\u20136]. One of the reasons for the popularity of this approach is\nthat the temporal structure can be exploited to make exact Bayesian inference tractable via dynamic\nprogramming; in particular inference over all possible temporal partitions of n data points results in\nan algorithm of O(n2) time complexity and O(n) space complexity [21, 1]. Many variants have been\nproposed in the literature [20, 11, 10, 17], which trade off predictive accuracy for improved time and\nspace complexity; in particular the Partition Tree Weighting meta-algorithm [17] has O(n log n) time\nand O(log n) space complexity, and has been shown empirically to exhibit superior performance\nversus other low-complexity alternatives on piecewise stationary sources.\nA key limitation of these aforementioned techniques is that they can perform poorly when there\nexist multiple segments of data that are similarly distributed. For example, consider data generated\naccording to the schedule depicted in Figure 1. For all these methods, once a change-point occurs, the\nbase (stationary) model is invoked from scratch, even if the task repeats, which is clearly undesirable\n\n\u2020 indicates joint \ufb01rst authorship.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fk\ns\na\nT\n\n3\n2\n1\n\n1\n\n20\n\n40\n\n60\n\n80\nTime\n\n100\n\n120\n\n140\n\n160\n\nFigure 1: An example task segmentation.\n\nin many situations of interest. Our main contribution in this paper is to introduce the Forget-me-not\nProcess, which has the ability to avoid having to relearn repeated tasks, while still maintaining\nessentially the same theoretical performance guarantees as Partition Tree Weighting on piecewise\nstationary sources.\n\n2 Preliminaries\n\nWe now introduce some notation and necessary background material.\n\nX \u2217 = {\u0001} \u222a(cid:83)\u221e\ntions \u03c1n : X n \u2192 [0, 1], for all n \u2208 N, satisfying the constraint that \u03c1n(x1:n) =(cid:80)\nprovided \u03c1(x<n) > 0, with the familiar chain rule \u03c1(xi:j | x<i) =(cid:81)j\n\nSequential Probabilistic Data Generators. We begin with some terminology for sequential, prob-\nabilistic data generating sources. An alphabet is a \ufb01nite non-empty set of symbols, which we\nwill denote by X . A string x1x2 . . . xn \u2208 X n of length n is denoted by x1:n. The pre\ufb01x x1:j of\nx1:n, where j \u2264 n, is denoted by x\u2264j or x<j+1. The empty string is denoted by \u0001 and we de\ufb01ne\ni=1 X i. Our notation also generalizes to out of bounds indices; that is, given a string\nx1:n and an integer m > n, we de\ufb01ne x1:m := x1:n and xm:n := \u0001. The concatenation of two strings\ns, r \u2208 X \u2217 is denoted by sr. Unless otherwise speci\ufb01ed, base 2 is assumed for all logarithms.\nA sequential probabilistic data generating source \u03c1 is de\ufb01ned by a sequence of probability mass func-\ny\u2208X \u03c1n+1(x1:ny)\nfor all x1:n \u2208 X n, with base case \u03c10(\u0001) = 1. From here onwards, whenever the meaning is clear\nfrom the argument to \u03c1, the subscripts on \u03c1 will be dropped. Under this de\ufb01nition, the conditional\nprobability of a symbol xn given previous data x<n is de\ufb01ned as \u03c1(xn | x<n) := \u03c1(x1:n)/\u03c1(x<n)\nk=i \u03c1(xk | x<k) applying as\nusual. Notice too that a new sequential probabilistic data generating source \u03bd can be obtained from\nan existing source \u03c1 by conditioning on a \ufb01xed sequence of input data. More explicitly, given a string\ns \u2208 X \u2217, one can de\ufb01ne \u03bd(x1:n) := \u03c1(x1:n | s) for all n; we will use the notation \u03c1[s] to compactly\ndenote such a derived probabilistic data generating source.\n\nTemporal Partitions, Piecewise Sources and Piecewise-repeating sources. We now introduce\nsome notation to formally describe temporal partitions and piecewise sources. A segment is a tuple\n(a, b) \u2208 N \u00d7 N with a \u2264 b. A segment (a, b) is said to overlap with another segment (c, d) if there\nexists an i \u2208 N such that a \u2264 i \u2264 b and c \u2264 i \u2264 d. A temporal partition P of a set of time\nindices S = {1, 2, . . . n}, for some n \u2208 N, is a set of non-overlapping segments such that for all\nx \u2208 S, there exists a segment (a, b) \u2208 P such that a \u2264 x \u2264 b. We also use the overloaded notation\nP(a, b) := {(c, d) \u2208 P : a \u2264 c \u2264 d \u2264 b} to denote the set of segments falling inclusively within\nthe range (a, b). Finally, Tn will be used to denote the set of all possible temporal partitions of\n{1, 2, . . . , n}.\nWe can now de\ufb01ne a piecewise data generating source \u00b5hP in terms of a partition P =\n{(a1, b1), (a2, b2), . . .} and a set of probabilistic data generating sources {\u00b51, \u00b52, . . .}, such that for\nall n \u2208 N, for all x1:n \u2208 X n,\n\n\u00b5hP (x1:n) :=\n\n\u00b5h(a)(xa:b),\n\nwhere Pn := {(a, b) \u2208 P : a \u2264 n} and h : N \u2192 N is a task assignment function that maps\nsegment beginnings to task identi\ufb01ers.\nA piecewise repeating data generating source is a special case of a piecewise data generating source\nthat satis\ufb01es the additional constraint that \u2203a, c \u2208 {x : (x, y) \u2208 P} such that a (cid:54)= c and h(a) = h(c).\n\n(cid:89)\n\n(a,b)\u2208Pn\n\n2\n\n\f(cid:80)\n(cid:80)\n\n\u03c1\u2208M\n\n\u03c1\u2208M\n\n(cid:88)\n\n\u03c1\u2208M\n\n(cid:88)\n\n\u03c1\u2208M\n\nIn terms of modeling a piecewise repeating source, there are three key unknowns: the partition\nwhich de\ufb01nes the location of the change points, the task assignment function, and the model for each\nindividual task.\n\n0 > 0 for each \u03c1 \u2208 M such that(cid:80)\nis de\ufb01ned in terms of its marginal by \u03be(x1:n) :=(cid:80)\n\nBayesian Sequence Prediction. A fundamental technique for constructing algorithms that work\nwell under the logarithmic loss is Bayesian model averaging. We now provide a short overview\nsuf\ufb01cient for the purposes of this paper; for more detail, we recommend the work of [12] and [14].\nGiven a non-empty discrete set of probabilistic data generating sources M := {\u03c11, \u03c12, . . .} and a\nprior weight w\u03c1\n0 = 1, the Bayesian mixture predictor\n0 \u03c1(x1:n). The predictive probability is\nthus given by the ratio of the marginals \u03be(xn | x<n) = \u03be(x1:n) / \u03be(x<n). The predictive probability\ncan also be expressed in terms of a convex combination of conditional model predictions, with each\nmodel weighted by its posterior probability. More explicitly,\n\n\u03c1\u2208M w\u03c1\n\u03c1\u2208M w\u03c1\n\n\u03be(xn | x<n) =\n\nw\u03c1\n\n0 \u03c1(x1:n)\n\nw\u03c1\n\n0 \u03c1(x<n)\n\n=\n\nn\u22121 \u03c1(xn | x<n), where w\u03c1\nw\u03c1\n\nn\u22121 :=\n\n(cid:80)\n\nw\u03c1\n\n\u03bd\u2208M\n\n0 \u03c1(x<n)\nw\u03bd\n\n0 \u03bd(x<n)\n\n.\n\nA fundamental property of Bayesian mixtures is that if there exists a model \u03c1\u2217 \u2208 M that predicts\nwell, then \u03be will predict well since the cumulative loss satis\ufb01es\n\n\u2212 log \u03be(x1:n) = \u2212 log\n\n0 \u03c1(x1:n) \u2264 \u2212 log w\u03c1\u2217\nw\u03c1\n\n0 \u2212 log \u03c1\u2217(x1:n).\n\n(1)\n\nEquation 1 implies that a constant regret is suffered when using \u03be in place of the best (in hindsight)\nmodel within M.\n\n3 The Forget-me-not Process\n\nWe now introduce the Forget-me-not Process (FMN), a meta-algorithm designed to better model\npiecewise-repeating data generating sources. As FMN is a meta-algorithm, it takes as input a base\nmodel, which we will hereby denote as \u03bd. At a high level, the main idea is to extend the Partition\nTree Weighting [17] algorithm to incorporate a memory of previous model states, which is used\nto improve performance on repeated tasks. More concretely, our construction involves de\ufb01ning a\ntwo-level hierarchical process, with each level performing exact Bayesian model averaging. The \ufb01rst\nlevel will perform model averaging over a set of postulated segmentations of time, using the Partition\nTree Weighting technique. The second level will perform model averaging over a growing set of\nstored base model states. We describe each level in turn before describing how to combine these\nideas into the Forget-me-not Process.\n\nAveraging over Temporal Segmentations. We now de\ufb01ne the class of binary temporal partitions,\nwhich will correspond to the set of temporal partitions we perform model averaging over in the \ufb01rst\nlevel of our hierarchical model. Although more restrictive than the class of all possible temporal\npartitions, binary temporal partitions possess important computational advantages.\nDe\ufb01nition 1. Given a depth parameter d \u2208 N and a time t \u2208 N, the set Cd(t) of all binary temporal\npartitions from t is recursively de\ufb01ned by\n\nCd(t) :=(cid:8){(t, t + 2d \u2212 1)}(cid:9) \u222a(cid:8)S1 \u222a S2 : S1 \u2208 Cd\u22121 (t) ,S2 \u2208 Cd\u22121\n\n(cid:0)t + 2d\u22121(cid:1)(cid:9) ,\n\nwith C0(t) :=(cid:8){(t, t)}(cid:9). We also de\ufb01ne Cd := Cd(1).\n\nEach binary temporal partition can be naturally mapped onto a tree structure known as a partition tree;\nfor example, Figure 2 shows the collection of partition trees represented by C2; the leaves of each\ntree correspond to the segments within each particular partition. There are two important properties\nof binary temporal partition trees. The \ufb01rst is that there always exists a partition P(cid:48) \u2208 Cd which is\nclose to any temporal partition P, in the sense that P(cid:48) always starts a new segment whenever P does,\nand |P(cid:48)| \u2264 |P|((cid:100)log n(cid:101) + 1) [17, Lemma 2]. The second is that exact Bayesian model averaging can\nbe performed ef\ufb01ciently with an appropriate choice of prior. This is somewhat surprising, since the\n\n3\n\n\fFigure 2: The set C2 represented as a collection of temporal partition trees.\n\nnumber of binary temporal partitions |Cd| grows double exponentially in d. The trick is to de\ufb01ne,\ngiven a data sequence x1:n, the Bayesian mixture\n\nPTWd(x1:n) :=\n\n\u03c1(xa:b),\n\n(2)\n\n(cid:88)\n\n2\u2212\u0393d(P) (cid:89)\n\nP\u2208Cd\n\n(a,b)\u2208P\n\nas one can show(cid:80)P\u2208Cd\n\nwhere \u0393d(P) gives the number of nodes in the partition tree associated with P that have a depth less\nthan d and \u03c1 denotes the base model to the PTW process. This prior weighting is identical to how\nthe Context Tree Weighting method [22] weighs over tree structures, and is an application of the\ngeneral technique used by the class of Tree Experts described in Section 5.3 of [3]. It is a valid prior,\n2\u2212\u0393d(P) = 1 for all d \u2208 N. A direct computation of Equation 2 is clearly\nintractable, but we can make use of the tree structured prior to recursively decompose Equation 2\nusing the following lemma.\nLemma 1 (Veness et al. [17]). For any depth d \u2208 N, for all x1:n \u2208 X n satisfying n \u2264 2d,\n\nPTWd(x1:n) = 1\n\n2 \u03c1(x1:n) + 1\n\n2 PTWd\u22121 (x1:k) PTWd\u22121 (xk+1:n) ,\n\nwhere k = 2d\u22121.\n\nAveraging over Previous Model States given a Known Temporal Partition. Given a data se-\nquence x1:n \u2208 X n, a base model \u03c1 and a temporal partition P := {(a1, b1), . . . , (am, bm)} satisfying\nP \u2208 Tn, consider a sequential probabilistic model de\ufb01ned by\n\n\uf8eb\uf8ed (cid:88)\n\n\u03c1\u2208Mi\n\n|P|(cid:89)\n\ni=1\n\n\uf8f6\uf8f8 ,\n\n\u03c0P (x1:n) :=\n\n1|Mi| \u03c1(xai:bi)\n\nwhere M1 := {\u03c1} and Mi := Mi\u22121 \u222a {\u03c1 [xai:bi]}\u03c1\u2208Mi\u22121\nHere, whenever the ith segment of data is seen, each model in Mi is given the option of either\nignoring or adapting to this segment\u2019s data, which implies |Mi| = 2 |Mi\u22121|. Using an argument\nsimilar to Equation 1, and letting xh(t)\n<t denote the subsequence of x<t generated by \u00b5h(t), we can see\nthat the cumulative loss when the data is generated by a piecewise-repeating source \u00b5hP is bounded by\n\nfor 1 < i \u2264 |P|.\n\n|P|(cid:89)\n\ni=1\n\n\uf8eb\uf8ed (cid:88)\n(cid:16)\n\n\u03c1\u2208Mi\n\n|P|(cid:89)\n\ni=1\n\n\uf8f6\uf8f8 = \u2212 log\n\n|P|(cid:89)\n\ni=1\n\n|P|2 \u2212 |P|\n\n2\n\n=\n\n\u2212 log\n\n\uf8eb\uf8ed (cid:88)\n(cid:16)\n|P|(cid:89)\n\n\u03c1\u2208Mi\n\n\u03c1\n\ni=1\n\n1|Mi| \u03c1(xai:bi )\n\n(cid:17)\n\n\uf8f6\uf8f8\n\n(cid:17)\n\n2\u2212i+1 \u03c1(xai:bi)\n\nxai:bi | xh(ai)\n\n<ai\n\n.\n\n(3)\n\n\u2264 \u2212 log\n\n2\u2212i+1 \u03c1\n\nxai:bi | xh(ai)\n\n<ai\n\n\u2212 log \u03c0P (x1:n) = \u2212 log\n\nRoughly speaking, this bound implies that \u03c0P (x1:n) will perform almost as well as if we knew\nh(\u00b7) in advance, provided the number of segments grows o(\nn). The two main drawbacks with\nthis approach are that: a) computing \u03c0P (x1:n) takes time exponential in |P|; and b) a regret of\n(|P|2 \u2212 |P|)/2 seems overly large in cases where the source isn\u2019t repeating. These problems can be\nrecti\ufb01ed with the following modi\ufb01ed process,\n\n\u221a\n\n\uf8f6\uf8f8\n\n(cid:88)\n(cid:12)(cid:12)(cid:12) \u03c1\u2217 = argmax\u03c1\u2208Mi\u22121 {\u03c1 (xai:bi)}(cid:111)\n\n|Mi|\u22121 \u03c1(cid:48)(xai:bi)\n\n1\n\n.\n\n\u03c1(cid:48)\u2208Mi\\{\u03c1}\n\n\u03c1\u2217[xai:bi]\n\n(4)\n\n1\n2\n\n4\n\n|P|(cid:89)\n\n\uf8eb\uf8ed 1\n\n\u03bdP (x1:n) :=\n\nwhere now M1 := {\u03c1} and Mi := Mi\u22121 \u222a (cid:110)\n\n\u03c1(xai:bi) +\n\ni=1\n\n2\n\n\u2022(1,4)(1,2)\u2022(3,4)\u2022\u2022(2, 2)(3, 4)(1,2)\u2022(3,3)(4,4)(1,1)\u2022\u2022(2,2)(3,3)\u2022(4,4)(1,1)\f0\n1\n2\n3\n\nh\nt\np\ne\nD\n\n1\n\n2\n\n3\n\n4\n\n5\n\n6\n\n7\n\n8\n\nTime\n\nFigure 3: A graphical depiction of the Forget Me Not process (d = 3) after processing 7 symbols.\n\nWith this modi\ufb01ed de\ufb01nition of Mi, where the argmax implements a greedy approximation (ties are\nbroken arbitrarily), |Mi| now grows linearly with the number of segments, and thus the overall time\nto compute \u03bdP (x1:n) is O(|P| n) assuming the base model runs in linear time. Although heuristic,\nthis approximation is justi\ufb01ed provided that \u03c1[\u0001] assigns the highest probability out of any model in\nMi whenever a task is seen for the \ufb01rst time, and that a model trained on k segments for a given task\nis always better than a model trained on less than k segments for the same task (or a model trained on\nany number of other tasks). Furthermore, using a similar dominance argument to Equations 1 and 3,\nthe cost of not knowing h(\u00b7) with respect to piecewise non-repeating sources is now |P| vs O(|P|2).\n\nAveraging over Binary Temporal Segmentations and Previous Model States. This section de-\nscribes how to hierarchically combine the PTW and \u03bdP models to give rise to the Forget Me Not\nprocess. Our goal will be to perform model averaging over both binary temporal segmentations and\nprevious model states. This can be achieved by instantiating the PTW meta-algorithm with a sequence\nof time dependent base models similar in spirit to \u03bdP.\nIntuitively, this requires modifying the de\ufb01nition of Mi so that the best performing model state, for\nany completed segment within the PTW process, is available for future predictions. For example,\nFigure 3 provides a graphical depiction of our desired FMN3 process after processing 7 symbols.\nThe dashed segments ending in un\ufb01lled circles describe the segments whose set of base models\nare contributing to the predictive distribution at time 8. The solid-line segments denote previously\ncompleted segments for which we want the best performing model state to be remembered and made\navailable to segments starting at later times. A solid circle indicates a time where a model is added to\nthe pool of available models; note that now multiple models can be added at any particular time.\nWe now formalize the above intuitions. Let Bt := {(a, b) \u2208 Cd : b = t} be the set of segments ending\nat time t \u2264 2d. Given an an arbitrary string s \u2208 X \u2217, our desired sequence of base models is given by\n\n1\n2\n\n\u03bdt(s) :=\n\n1\n2\nwith the model pool de\ufb01ned by M1 := {\u03c1} and\n\u03c1\u2217[sa:b]\n\nMt := Mt\u22121 \u222a (cid:91)\n\n(cid:26)\n\n\u03c1(s) +\n\n(a,b)\u2208Bt\u22121\n\n|Mt|\u22121 \u03c1(cid:48)(s),\n\n1\n\n(cid:27)\n\n{\u03c1 (sa:b)}\n\nfor t > 1.\n\n(6)\n\n\u03c1(cid:48)\u2208Mt\\{\u03c1}\n\n(cid:88)\n(cid:12)(cid:12)(cid:12)(cid:12) \u03c1\u2217 = argmax\n2\u2212\u0393d(P) (cid:89)\n\n\u03c1\u2208Ma\n\n(a,b)\u2208Pn\n\n(cid:88)\n\nP\u2208Cd\n\n(5)\n\n(7)\n\nFinally, substituting Equation 5 in for the base model of PTW yields our Forget Me Not process\n\nFMNd(x1:n) :=\n\n\u03bda(xa:b).\n\nAlgorithm. Algorithm 1 describes how to compute the marginal probability FMNd(x1:n). The rj\nvariables store the segment start times for the unclosed segments at depth j; the bj variables implement\na dynamic programming caching mechanism to speed up the PTW computation as explained in Section\n3.3 of [17]; the wj variables hold intermediate results needed to apply Lemma 1. The Most Signi\ufb01cant\nChanged Bit routine MSCBd(t), invoked at line 4, is used to determine the range of segments ending\nat the current time t, and is de\ufb01ned for t > 1 as the number of bits to the left of the most signi\ufb01cant\nlocation at which the d-bit binary representations of t\u2212 1 and t\u2212 2 differ, with MSCBd(1) := 0 for all\nd \u2208 N. For example, in Figure 3, at t = 5, before processing x5, we need to deal with the segments\n\n5\n\n\fAlgorithm 1 FORGET-ME-NOT - FMNd(x1:n)\nRequire: A depth parameter d \u2208 N, and a base probabilistic model \u03c1\nRequire: A data sequence x1:n \u2208 X n satisfying n \u2264 2d\n1: bj \u2190 1, wj \u2190 1, rj \u2190 1, for 0 \u2264 j \u2264 d\n2: M \u2190 {\u03c1}\n3: for t = 1 to n do\n4:\n5:\n\ni \u2190 MSCBd(t)\nbi \u2190 wi+1\nfor j = i + 1 to d do\n\nM \u2190 UPDATEMODELPOOL(\u03bdrj , xrj :t\u22121)\nwj \u2190 1, bj \u2190 1, rj \u2190 t\n\nend for\nwd \u2190 \u03bdrd (xrd:t)\nfor i = d \u2212 1 to 0 do\n\nwi \u2190 1\n\n2 \u03bdri(xri:t) + 1\n\n2 wi+1bi\n\n6:\n7:\n8:\n9:\n\n10:\n11:\n12:\n13:\n14: end for\n15: return w0\n\nend for\n\n(1, 4), (3, 4), (4, 4) \ufb01nishing. The method UPDATEMODELPOOL applies Equation 6 to remember\nthe best performing model in the mixture \u03bdrj on the completed segment (rj, t \u2212 1). Lines 11 to 13\ninvoke Lemma 1 from bottom-up, to compute the desired marginal probability FMNd(x1:n) = w0.\n(Space and Time Overhead) Under the assumption that each base model conditional probability can\nbe obtained in O(1) time, the time complexity to process a sequence of length n is O(nk log n),\nwhere k is an upper bound on |M|. The n log n factor is due to the number of iterations in the inner\nloops on Lines 6 to 9 and Lines 11 to 13 being upper bounded by d + 1. The k factor is due to the\ncost of maintaining the vt terms for the segments which have not yet closed. An upper bound on k\ncan be obtained from inspection of Figure 3, where if we set n = 2d, we have that the number of\ni=0 2i = 2d+1 \u2212 1 = 2n + 1 = O(n); thus the time complexity is\n\ncompleted segments is given by(cid:80)d\n\nO(n2 log n). The space overhead is O(k log n), due to the O(log n) instances of Equation 5.\n(Complexity Reducing Operations) For many applications of interest, a running time of O(n2 log n)\nis unacceptable. A workaround is to \ufb01x k in advance and use a model replacement strategy that\nenforces |M| \u2264 k via a modi\ufb01ed UPDATEMODELPOOL routine; this reduces the time complexity to\nO(nk log n). We found the following heuristic scheme to be effective in practice: when a segment\n(a, b) closes, the best performing model \u03c1\u2217 \u2208 Ma for this segment is identi\ufb01ed. Now, 1) letting y\u03c1\u2217\ndenote a uniform sub-sample of the data used to train \u03c1\u2217, if log \u03c1\u2217[xa:b](y\u03c1\u2217) \u2212 log \u03c1\u2217(y\u03c1\u2217) > \u03b1\nthen replace \u03c1\u2217 with \u03c1\u2217[xa:b] in M; else 2) if a uniform Bayes mixture \u03be over M assigns suf\ufb01ciently\nhigher probability to a uniform sub-sample s of xa:b than \u03c1\u2217 does, that is log \u03be(s) \u2212 log \u03c1\u2217(s) > \u03b2,\nthen leave M unchanged; else 3) add \u03c1\u2217[xa:b] to M; if |M| > k, remove the oldest model in M.\nThis requires choosing hyperparameters \u03b1, \u03b2 \u2208 R and appropriate constant sub-sample sizes. Step\n1 avoids adding multiple models for the same task; Step 2 avoids adding a redundant model to the\nmodel pool. Note that the per model and per segment sub-samples can be ef\ufb01ciently maintained\nonline using reservoir sampling [19]. As a further complexity reducing operation, one can skip calls\nto UPDATEMODELPOOL unless (b \u2212 a + 1) \u2265 2c for some c < d.\n(Strongly Online Prediction) A strongly online FMN process, where one does not need to \ufb01x a d in\ni=1 FMN(cid:100)log i(cid:101)(xi | x<i), and\nef\ufb01ciently computed in the same manner as for PTW, with a similar loss bound \u2212 log FMN(x1:n) \u2264\n\u2212 log FMNd(x1:n) + (cid:100)log n(cid:101)(log 3 \u2212 1) following trivially from Theorem 2 in [17].\n\nadvance such that n \u2264 2d, can be obtained by de\ufb01ning FMN(x1:n) :=(cid:81)n\n\nTheoretical properties. We now show that the Forget Me Not process is competitive with any\npiecewise stationary source, provided the base model enjoys suf\ufb01ciently strong regret guarantees on\n\n6\n\n\fnon-piecewise sources. Note that provided c = 0, Proposition 1 also holds when the complexity\nreducing operations are used. While the following regret bound is of the same asymptotic order as\nPTW for piecewise stationary sources, note that it is no tighter for sources that repeat; we will later\nexplore the advantage of the FMN process on repeating sources experimentally.\nProposition 1. For all n \u2208 N, using FMN with d = (cid:100)log n(cid:101) and a base model \u03c1 whose redundancy\nis upper bounded by a non-negative, monotonically non-decreasing, concave function g : N \u2192 R\nwith g(0) = 0 on some class G of bounded memory data generating sources, the regret\n\n(cid:18)(cid:24)\n\n(cid:25)(cid:19)\n\n\u2264 2|Pn| ((cid:100)log n(cid:101) + 1) +|Pn| g\n\nn\n\n|Pn|((cid:100)log n(cid:101) + 1)\n\n((cid:100)log n(cid:101) + 1) +|Pn|,\n\n(cid:18) \u00b5hP (x1:n)\n\n(cid:19)\n\nFMNd(x1:n)\n\nlog\n\nwhere \u00b5 is a piecewise stationary data generating source, and the data in each of the stationary\nregions P \u2208 Tn is distributed according to some source in G.\nProof. First observe that for all x1:n \u2208 X n we can lower bound the probability\n\n2\u2212\u0393d(P) (cid:89)\n\n(cid:88)\n= 2\u2212|Pn| (cid:88)\n\nP\u2208Cd\n\n(a,b)\u2208Pn\n\n2\u2212\u0393d(P) (cid:89)\n\nP\u2208Cd\n\n(a,b)\u2208Pn\n\n\u03bda(xa:b) \u2265 (cid:88)\n\n2\u2212\u0393d(P) (cid:89)\n\nP\u2208Cd\n\n1\n2 \u03c1(xa:b)\n(a,b)\u2208Pn\n\n\u03c1(xa:b) = 2\u2212|Pn| PTWd(x1:n).\n\nFMNd(x1:n) =\n\nHence we have that \u2212 log FMNd(x1:n) \u2264 |P| \u2212 log PTWd(x1:n). The proof is completed by using\nTheorem 1 from [17] to upper bound \u2212 log PTWd(x1:n).\n\n4 Experimental Results\n\nWe now report some experimental results with the FMN algorithm across three test domains. The \ufb01rst\ntwo domains, The Mysterious Bag of Coins and A Fistful of Digits, are repeating sequence prediction\ntasks. The \ufb01nal domain, Continual Atari 2600 Task Identi\ufb01cation, is a video stream of game-play\nfrom a collection of Atari games provided by the ALE [2] framework; here we qualitatively assess the\ncapabilities of the FMN process to provide meaningful task labels online from high dimensional input.\n\nDomain Description.\n(Mysterious Bag of Coins) Our \ufb01rst domain is a sequence prediction game\ninvolving a predictor, an opponent and a bag of m biased coins. Flipping the ith coin involves\nsampling a value from a parametrized Bernoulli distribution B(\u03b8i), with \u03b8i \u2208 [0, 1] for 1 \u2264 i \u2264 m.\nThe predictor knows neither how many coins are in the bag, nor the value of the \u03b8i parameters. The\ndata is generated by having the opponent \ufb02ip a single coin (the choice of which is hidden from the\npredictor) drawn uniformly from the bag for X \u223c G(0.005) \ufb02ips, and repeating, where G(\u03b8) denotes\nthe geometric distribution with success probability \u03b8. At each time step t, the predictor outputs a\ndistribution \u03c1t : {0, 1} \u2192 [0, 1], and suffers an instantaneous loss of (cid:96)t(xt) := \u2212 log \u03c1t(xt). Here\nwe test whether the FMN process can robustly identify change points, and exploit the knowledge that\nsome segments of data appear to be similarly distributed.\n(A Fistful of Digits) The second test domain uses a similar setup to The Mysterious Bag of Coins,\nexcept that now each observation is a 28x28 binary image taken from the MNIST [15] data set.\nWe partitioned the MNIST data into m = 10 classes, one for each distinct digit, which we used\nto derive ten digit-speci\ufb01c empirical distributions. After picking a digit class, a random number\nY = 200 + X \u223c G(0.01) of examples are sampled (with replacement) from the associated empirical\ndistribution, before repeating the digit selection and generation process. Similar to before, the\npredictor is required to output a distribution \u03c1t : {0, 1}28\u00d728 \u2192 [0, 1] over the possible outcomes,\nsuffering an instantaneous loss of (cid:96)t(xt) := \u2212 log \u03c1t(xt) at each time step.\n(Continual Atari 2600 Task Identi\ufb01cation) Our third domain consists of a sequence of sampled Atari\n2600 frames. Each frame has been downsampled to a 28 \u00d7 28 resolution and a 3 bit color space for\nreasons of computational ef\ufb01ciency. The sequence of frames is generated by \ufb01rst picking a game\nuniformly at random from a set of 45 Atari games (for which a game-speci\ufb01c DQN [16] policy is\navailable), and then generating a random number Y = 200 + X of frames, where X \u223c G(0.005).\nEach action is chosen by the relevant game speci\ufb01c DQN controller, which uses an epsilon-greedy\npolicy. Once Y frames have been generated, the process is then repeated.\n\n7\n\n\fAlgorithm Average Cumulative Regret\n\nKT\n\n783.86 \u00b1 7.79\n157.19 \u00b1 0.77\n148.43 \u00b1 0.75\n147.75 \u00b1 0.74\n\nPTW + KT\nFMN + KT\nFMN\u2217 + KT\nFigure 4: (Left) Results on the Mysterious Bag of Coins; (Right) Results on a Fistful of Digits.\n\nPTW + MADE\nFMN + MADE\n\nOracle\n\nAlgorithm\n\nMADE\n\nAverage Per Digit Loss\n\n94.08 \u00b1 0.05\n94.08 \u00b1 0.05\n86.12 \u00b1 0.28\n82.81 \u00b1 0.06\n\nResults. We now describe our experimental setup and results. The following base models were\nchosen for each test domain: for the Mysterious Bag of Coins (MBOC), we used the KT-estimator\n[13], a beta-binomial model; for A Fistful of Digits (FOD), we used MADE [9], a recently introduced,\ngeneral purpose neural density estimator, with 500 hidden units, trained online using ADAGRAD [8]\nwith a learning rate of 0.1; MADE was also the base model for the Continual Atari task, but here a\nsmaller network consisting of 50 neurons was used for reasons of computational ef\ufb01ciency.\n(Sequence Prediction) For each domain, we compared the performance of the base model, the base\nmodel combined with PTW and the base model combined with the FMN process. We also report\nthe performance relative to a domain speci\ufb01c oracle: for the MBOC domain, the oracle is the true\ndata generating source, which has the (unfair) advantage of knowing the location of all potential\nchange-points and task-speci\ufb01c data generating distributions; for the FOD domain, we trained a\nclass conditional MADE model for each digit of\ufb02ine, and applied the relevant task-speci\ufb01c model to\neach segment. Regret is reported for MBOC since we know the true data generating source, whereas\nloss is reported for FOD. All results are reported in nats. The sequence length and number of\nrepeated runs for MBOC and FOD was 5k/10k and 221/64 respectively. For the MBOC experiment\nwe set m = 7 and generated each \u03b8i uniformly at random. Our sequence prediction results for each\ndomain are summarized in Figure 4, with 95% con\ufb01dence intervals provided. Here FMN\u2217 denotes the\nForget-me-not algorithm without the complexity reducing techniques previously described (these\nresults are only feasible to produce on MBOC). For the FMN results, the MBOC hyper-parameters\nare k = 15, \u03b1 = 0, \u03b2 = 0, c = 4 and sub-sample sizes of 100; the FOD hyper-parameters are\nk = 30, \u03b1 = 0.2, \u03b2 = 0.06, c = 4 with sub-sample sizes of 10. Here we see a clear advantage to\nusing the FMN process compared with PTW, and that no signi\ufb01cant performance is lost by using the\nlow complexity version of the algorithm.\nDigging a bit deeper, it is interesting to note the inability of PTW to improve upon the performance of\nthe base model on FOD. This is in contrast to the FMN process, whose ability to remember previous\nmodel states allows it to, over time, develop specialized models across digit speci\ufb01c data from\nmultiple segments, even in the case where the base model can be relatively slow to adapt online.\nThe reverse effect occurs in MBOC, where both FMN and PTW provide a large improvement over the\nperformance of the base model. The advantage of being able to remember is much smaller here due\nto the speed at which the KT base model can learn, although not insigni\ufb01cant. It is also worth noting\nthat a performance improvement is obtained even though each individual observation is by itself not\ninformative; the FMN process is exploiting the statistical similarity of the outcomes across time.\n(Online Task Identi\ufb01cation) A video demonstrating real-time segmentation of Atari frames can be\nfound at: http://tinyurl.com/FMNVideo. Here we see that the (low complexity) FMN\nquickly learns 45 game speci\ufb01c models, and performs an excellent job of routing experience to\nthe appropriate model. These results provide evidence that this technique can scale to long, high\ndimensional input sequences using state of the art density models.\n\n5 Conclusion\n\nWe introduced the Forget-me-not Process, an ef\ufb01cient, non-parametric meta-algorithm for online\nprobabilistic sequence prediction and task-segmentation for piecewise stationary, repeating sources.\nWe provided regret guarantees with respect to piecewise stationary data sources under the logarithmic\nloss, and validated the method empirically across a range of sequence prediction and task identi\ufb01cation\nproblems. For future work, it would be interesting to see whether a single Multiple Model-based\nReinforcement Learning [7] agent could be constructed using the Forget-me-not process for task\nidenti\ufb01cation. Alternatively, the FMN process could be used to augment the conditional state density\nmodels used for value estimation in [18]. Such systems would have the potential to be able to learn to\nsimultaneously play many different Atari games from a single stream of experience, as opposed to\nprevious efforts [16, 18] where game speci\ufb01c controllers were learnt independently.\n\n8\n\n\fReferences\n[1] Ryan Prescott Adams and David J.C. MacKay. Bayesian Online Changepoint Detection.\n\nhttp://arxiv.org/abs/0710.3742, 2007.\n\nIn arXiv,\n\n[2] M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling. The arcade learning environment: An evaluation\n\nplatform for general agents. Journal of Arti\ufb01cial Intelligence Research, 47:253\u2013279, 06 2013.\n\n[3] Nicolo Cesa-Bianchi and Gabor Lugosi. Prediction, Learning, and Games. Cambridge University Press,\n\nNew York, NY, USA, 2006. ISBN 0521841089.\n\n[4] Anne Collins and Etienne Koechlin. Reasoning, learning, and creativity: Frontal lobe function and human\n\ndecision-making. PLoS Biol, 10(3):1\u201316, 03 2012.\n\n[5] Anne G.E. Collins and Michael J. Frank. Cognitive Control over Learning: Creating, Clustering and\n\nGeneralizing Task-Set Structure. Psychological review, 120.1:190\u2013229, 2013.\n\n[6] Ma\u00ebl Donoso, Anne G. E. Collins, and Etienne Koechlin. Foundations of human reasoning in the prefrontal\n\ncortex. Science, 344(6191):1481\u20131486, 2014. doi: 10.1126/science.1252254.\n\n[7] Kenji Doya and Kazuyuki Samejima. Multiple model-based reinforcement learning. Neural Computation,\n\n14:1347\u20131369, 2002.\n\n[8] John Duchi, Elad Hazan, and Yoram Singer. Adaptive Subgradient Methods for Online Learning and\n\nStochastic Optimization. Journal of Machine Learning Research (JMLR), 12:2121\u20132159, 07 2011.\n\n[9] Mathieu Germain, Karol Gregor, Iain Murray, and Hugo Larochelle. MADE: masked autoencoder for\ndistribution estimation. In Proceedings of the 32nd International Conference on Machine Learning, JMLR\nW&CP, volume 37, pages 881\u2013889, 2015.\n\n[10] A. Gy\u00f6rgy, T. Linder, and G. Lugosi. Ef\ufb01cient tracking of large classes of experts. IEEE Transactions on\n\nInformation Theory, 58(11):6709\u20136725, 2011.\n\n[11] E. Hazan and C. Seshadhri. Ef\ufb01cient learning algorithms for changing environments. In Proceedings of\n\nthe 26th Annual International Conference on Machine Learning, pages 393\u2013400. ACM, 2009.\n\n[12] Marcus Hutter. On universal prediction and Bayesian con\ufb01rmation. Theoretical Computer Science, 384(1):\n\n33\u201348, 2007.\n\n[13] R. Krichevsky and V. Tro\ufb01mov. The performance of universal encoding. Information Theory, IEEE\n\nTransactions on, 27(2):199\u2013207, 1981.\n\n[14] Tor Lattimore, Marcus Hutter, and Peter Sunehag. Concentration and con\ufb01dence for discrete bayesian\nsequence predictors.\nIn Sanjay Jain, R\u00e9mi Munos, Frank Stephan, and Thomas Zeugmann, editors,\nProceedings of the 24th International Conference on Algorithmic Learning Theory, pages 324\u2013338.\nSpringer, 2013.\n\n[15] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition.\n\nProceedings of the IEEE, 86(11):2278\u20132324, Nov 1998.\n\n[16] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare,\nAlex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie,\nAmir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis\nHassabis. Human-level control through deep reinforcement learning. Nature, 518, 2015.\n\n[17] J. Veness, M. White, M. Bowling, and A. Gyorgy. Partition tree weighting.\n\nConference (DCC), pages 321\u2013330, March 2013.\n\nIn Data Compression\n\n[18] Joel Veness, Marc G. Bellemare, Marcus Hutter, Alvin Chua, and Guillaume Desjardins. Compress and\ncontrol. In Proceedings of the Twenty-Ninth AAAI Conference on Arti\ufb01cial Intelligence, January 25-30,\n2015, Austin, Texas, USA., pages 3016\u20133023, 2015.\n\n[19] Jeffrey S. Vitter. Random sampling with a reservoir. ACM Trans. Math. Softw., 11(1):37\u201357, March 1985.\n\nISSN 0098-3500. doi: 10.1145/3147.3165.\n\n[20] F. Willems and M. Krom. Live-and-die coding for binary piecewise i.i.d. sources. In Information Theory.\n\n1997. Proceedings., 1997 IEEE International Symposium on, page 68, jun-4 jul 1997.\n\n[21] Frans M. J. Willems. Coding for a binary independent piecewise-identically-distributed source. IEEE\n\nTransactions on Information Theory, 42:2210\u20132217, 1996.\n\n[22] Frans M.J. Willems, Yuri M. Shtarkov, and Tjalling J. Tjalkens. The Context Tree Weighting Method:\n\nBasic Properties. IEEE Transactions on Information Theory, 41:653\u2013664, 1995.\n\n9\n\n\f", "award": [], "sourceid": 1836, "authors": [{"given_name": "Kieran", "family_name": "Milan", "institution": "Google DeepMind"}, {"given_name": "Joel", "family_name": "Veness", "institution": "DeepMind"}, {"given_name": "James", "family_name": "Kirkpatrick", "institution": "Google DeepMind"}, {"given_name": "Michael", "family_name": "Bowling", "institution": "University of Alberta"}, {"given_name": "Anna", "family_name": "Koop", "institution": "University of Alberta"}, {"given_name": "Demis", "family_name": "Hassabis", "institution": "DeepMind"}]}