{"title": "Minimax Time Series Prediction", "book": "Advances in Neural Information Processing Systems", "page_first": 2557, "page_last": 2565, "abstract": "We consider an adversarial formulation of the problem ofpredicting a time series with square loss. The aim is to predictan arbitrary sequence of vectors almost as well as the bestsmooth comparator sequence in retrospect. Our approach allowsnatural measures of smoothness such as the squared norm ofincrements. More generally, we consider a linear time seriesmodel and penalize the comparator sequence through the energy ofthe implied driving noise terms. We derive the minimax strategyfor all problems of this type and show that it can be implementedefficiently. The optimal predictions are linear in the previousobservations. We obtain an explicit expression for the regret interms of the parameters defining the problem. For typical,simple definitions of smoothness, the computation of the optimalpredictions involves only sparse matrices. In the case ofnorm-constrained data, where the smoothness is defined in termsof the squared norm of the comparator's increments, we show thatthe regret grows as $T/\\sqrt{\\lambda_T}$, where $T$ is the lengthof the game and $\\lambda_T$ is an increasing limit on comparatorsmoothness.", "full_text": "Minimax Time Series Prediction\n\nWouter M. Koolen\n\nCentrum Wiskunde & Informatica\n\nwmkoolen@cwi.nl\n\nAlan Malek\nUC Berkeley\n\nmalek@berkeley.edu\n\nPeter L. Bartlett\n\nUC Berkeley & QUT\n\nbartlett@cs.berkeley.edu\n\nYasin Abbasi-Yadkori\n\nQueensland University of Technology\n\nyasin.abbasiyadkori@qut.edu.au\n\nAbstract\n\nWe consider an adversarial formulation of the problem of predicting a time series\nwith square loss. The aim is to predict an arbitrary sequence of vectors almost\nas well as the best smooth comparator sequence in retrospect. Our approach al-\nlows natural measures of smoothness such as the squared norm of increments.\nMore generally, we consider a linear time series model and penalize the compara-\ntor sequence through the energy of the implied driving noise terms. We derive\nthe minimax strategy for all problems of this type and show that it can be imple-\nmented ef\ufb01ciently. The optimal predictions are linear in the previous observations.\nWe obtain an explicit expression for the regret in terms of the parameters de\ufb01ning\nthe problem. For typical, simple de\ufb01nitions of smoothness, the computation of the\noptimal predictions involves only sparse matrices. In the case of norm-constrained\ndata, where the smoothness is de\ufb01ned in terms of the squared norm of the com-\nparator\u2019s increments, we show that the regret grows as T /\n\u03bbT , where T is the\nlength of the game and \u03bbT is an increasing limit on comparator smoothness.\n\n\u221a\n\n1\n\nIntroduction\n\nIn time series prediction, tracking, and \ufb01ltering problems, a learner sees a stream of (possibly noisy,\nvector-valued) data and needs to predict the future path. One may think of robot poses, meteo-\nrological measurements, stock prices, etc. Popular stochastic models for such tasks include the\nauto-regressive moving average (ARMA) model in time series analysis, Brownian motion models in\n\ufb01nance, and state space models in signal processing.\nIn this paper, we study the time series prediction problem in the regret framework; instead of mak-\ning assumptions on the data generating process, we ask: can we predict the data sequence online\nalmost as well as the best of\ufb02ine prediction method in some comparison class (in this case, of\ufb02ine\nmeans that the comparator only needs to model the data sequence after seeing all of it)? Our main\ncontribution is computing the exact minimax strategy for a range of time series prediction problems.\nAs a concrete motivating example, let us pose the simplest nontrivial such minimax problem\n\nmin\na1\n\nmax\nx1\u2208B\n\n\u00b7\u00b7\u00b7 min\n\naT\n\nmax\nxT \u2208B\n\n(cid:107)at \u2212 xt(cid:107)2\n\n\u2212 min\n\n\u02c6a1,...,\u02c6aT\n\n(cid:123)(cid:122)\n\n(cid:125)\n\n(cid:107)\u02c6at \u2212 xt(cid:107)2\n\n+ \u03bbT\n\n(cid:125)\n\n(cid:124)\n\n(cid:123)(cid:122)\n\nLoss of Learner\n\nLoss of Comparator\n\n(1)\nThis notion of regret is standard in online learning, going back at least to [1] in 2001, which views it\nas the natural generalization of L2 regularization to deal with non-stationarity comparators. We offer\ntwo motivations for this regularization. First, one can interpret the complexity term as the magnitude\n\n(cid:41)\n\n.\n\nT +1(cid:88)\n(cid:107)\u02c6at \u2212 \u02c6at\u22121(cid:107)2\n(cid:123)(cid:122)\n(cid:125)\n\nt=1\n\nComparator Complexity\n\nT(cid:88)\n(cid:124)\n\nt=1\n\n(cid:40) T(cid:88)\n(cid:124)\n\nt=1\n\n1\n\n\fof the noise required to generate the comparator using a multivariate Gaussian random walk, and,\ngeneralizing slightly, as the energy of the innovations required to model the comparator using a\nsingle, \ufb01xed linear time series model (e.g. speci\ufb01c ARMA coef\ufb01cients). Second, we can view the\ncomparator term in Equation (1) as akin to the Lagrangian of a constrained optimization problem.\nRather than competing with the comparator sequence \u02c6a1, . . . , \u02c6aT that minimizes the cumulative loss\nsubject to a hard constraint on the complexity term, the learner must compete with the comparator\nsequence that best trades off the cumulative loss and the smoothness. The Lagrange multiplier, \u03bbT ,\ncontrols the trade-off. Notice that it is natural to allow \u03bbT to grow with T , since that penalizes the\ncomparator\u2019s change per round more than the loss per round.\nFor the particular problem (1) we obtain an ef\ufb01cient algorithm using amortized O(d) time per round,\nwhere d is the dimension of the data; there is no nasty dependence on T as often happens with min-\nimax algorithms. Our general minimax analysis extends to more advanced complexity terms. For\nexample, we may regularize instead by higher-order smoothness (magnitude of increments of incre-\nments, etc.), or more generally, we may consider a \ufb01xed linear process and regularize the comparator\nby the energy of its implied driving noise terms (innovations). We also deal with arbitrary sequences\nof rank-one quadratic constraints on the data.\nWe show that the minimax algorithm is of a familiar nature; it is a linear \ufb01lter, with a twist. Its\ncoef\ufb01cients are not time-invariant but instead arise from the intricate interplay between the regular-\nization and the range of the data, combined with shrinkage. Fortunately, they may be computed in\na pre-processing step by a simple recurrence. An unexpected detail of the analysis is the follow-\ning. As we will show, the regret objective in (1) is a convex quadratic function of all data, and the\nsub-problem objectives that arise from the backward induction steps in the minimax analysis remain\nquadratic functions of the past. However, they may be either concave or convex. Changing direction\nof curvature is typically a source of technical dif\ufb01culty: the minimax solution is different in either\ncase. Quite remarkably, we show that one can determine a priori which rounds are convex and which\nare concave and apply the appropriate solution method in each.\nWe also consider what happens when the assumptions we need to make for the minimax analysis to\ngo through are violated. We will show that the obtained minimax algorithm is in fact highly robust.\nSimply applying it unlicensed anyway results in adaptive regret bounds that scale naturally with the\nrealized data magnitude (or, more generally, its energy).\n\n1.1 Related Work\n\nThere is a rich history of tracking problems in the expert setting. In this setting, the learner has some\n\ufb01nite number of actions to play and must select a distribution over actions to play each round in\nsuch a way as to guarantee that the loss is almost as small as the best single action in hindsight. The\nproblem of tracking the best expert forces the learner to compare with sequences of experts (usually\nwith some \ufb01xed number of switches). The \ufb01xed-share algorithm [2] was an early solution, but there\nhas been more recent work [3, 4, 5, 6]. Tracking experts has been applied to other areas; see e.g. [7]\nfor an application to sequential allocation. An extension to linear combinations of experts where the\nexpert class is penalized by the p-norm of the sequence was considered in [1].\nMinimax algorithms for squared Euclidean loss have been studied in several contexts such as Gaus-\nsian density estimation [8] and linear regression [9]. In [10], the authors showed that the minimax\nalgorithm for quadratic loss is Follow the Leader (i.e. predicting the previous data mean) when the\nplayer is constrained to play in a ball around the previous data mean. Additionally, Moroshko and\nKrammer [11, 12] propose a weak notion of non-stationarity that allows them to apply the last-step\nminimax approach to a regression-like framework.\n(cid:80)\nThe tracking problem in the regret setting has been considered previously, e.g. [1], where the authors\nstudied the best linear predictor with a comparison class of all sequences with bounded smoothness\nt(cid:107)at \u2212 at\u22121(cid:107)2 and proposed a general method for converting regret bounds in the static setting\nto ones in the shifting setting (where the best expert is allowed to change).\n\nOutline We start by presenting the formal setup in Section 2 and derive the optimal of\ufb02ine predic-\ntions. In Section 3 we zoom in to single-shot quadratic games, and solve these both in the convex\nand concave case. With this in hand, we derive the minimax solution to the time series prediction\nproblem by backward induction in Section 4. In Section 5 we focus on the motivating problem\n\n2\n\n\f(1) for which we give a faster implementation and tightly sandwich the minimax regret. Section 6\nconcludes with discussion, conjectures and open problems.\n\n2 Protocol and Of\ufb02ine Problem\n\nThe game protocol\nis described in Figure 1 and is the usual online prediction game with\nsquared Euclidean loss. The goal of the learner is to incur small regret, that is, to predict\nthe data almost as well as the best complexity-penalized sequence \u02c6a1 \u00b7\u00b7\u00b7 \u02c6aT chosen in hind-\nsight. Our motivating problem (1) gauged complexity by the sum of squared norms of the\nincrements,\nthus encouraging smoothness. Here we generalize to complexity terms de\ufb01ned\n(cid:124)\ns \u02c6at.\ns,t Ks,t \u02c6a\nWe recover the smoothness penalty of (1) by taking K to be the T \u00d7 T tridiagonal matrix\n\nby a complexity matrix K (cid:23) 0, and charge the comparator \u02c6a1 \u00b7\u00b7\u00b7 \u02c6aT by (cid:80)\n\uf8f6\uf8f7\uf8f7\uf8f7\uf8f7\uf8f8 ,\n\n\u2022 Learner predicts at \u2208 Rd\n\u2022 Environment reveals xt \u2208 Rd\n\u2022 Learner suffers loss (cid:107)at \u2212 xt(cid:107)2.\n\n2 \u22121\n...\n\u22121\n\nFor t = 1, 2, . . . , T :\n\n2 \u22121\n\u22121\n\n\uf8eb\uf8ec\uf8ec\uf8ec\uf8ec\uf8ed\n\n(2)\n\n2 \u22121\n\u22121\n2\n\nFigure 1: Protocol\n\nbut we may also regularize by e.g. the sum of\nsquared norms (K = I), the sum of norms of higher\norder increments, or more generally, we may consider a \ufb01xed linear process and take K1/2 to be the\nmatrix that recovers the driving noise terms from the signal, and then our penalty is exactly the en-\nergy of the implied noise for that linear process. We now turn to computing the identity and quality\nof the best competitor sequence in hindsight.\nTheorem 1. For any complexity matrix K (cid:23) 0, regularization scalar \u03bbT \u2265 0, and d \u00d7 T data\nmatrix XT = [x1 \u00b7\u00b7\u00b7 xT ] the problem\n\nL\u2217 := min\n\u02c6a1,...,\u02c6aT\n\n(cid:107)\u02c6at \u2212 xt(cid:107)2 + \u03bbT\n\n(cid:124)\nKs,t \u02c6a\ns \u02c6at\n\nhas linear minimizer and quadratic value given by\n\n[\u02c6a1 \u00b7\u00b7\u00b7 \u02c6aT ] = XT (I + \u03bbT K)\u22121\n\nand\n\ns,t\n\nL\u2217 = tr(cid:0)XT (I \u2212 (I + \u03bbT K)\u22121)X\n\n(cid:1) .\n\n(cid:124)\nT\n\n(cid:88)\n\nT(cid:88)\n\nt=1\n\n(cid:16)\n\nProof. Writing \u02c6A = [\u02c6a1 \u00b7\u00b7\u00b7 \u02c6aT ] we can compactly express the of\ufb02ine problem as\n\nL\u2217 = min\n\u02c6A\n\ntr\n\n(cid:124)\n( \u02c6A \u2212 XT )\n\n( \u02c6A \u2212 XT ) + \u03bbT K \u02c6A\n\n(cid:124) \u02c6A\n\n.\n\nThe \u02c6A derivative of the objective is 2( \u02c6A \u2212 XT ) + 2\u03bbT \u02c6AK.\nSetting this to zero yields\nthe minimizer \u02c6A = XT (I + \u03bbT K)\u22121. Back-substitution and simpli\ufb01cation result in value\n\ntr(cid:0)XT (I \u2212 (I + \u03bbT K)\u22121)X\n\n(cid:1).\n\n(cid:124)\nT\n\n(cid:17)\n\nNote that for the choice of K in (2) computing the optimal \u02c6A can be performed in O(dT ) time by\nsolving the linear system A(I + \u03bbT KT ) = XT directly. This system decomposes into d (one per\ndimension) independent tridiagonal systems, each in T (one per time step) variables, which can each\nbe solved in linear time using Gaussian elimination.\nThis theorem shows that the objective of our minimax problem is a quadratic function of the data.\nIn order to solve a T round minimax problem with quadratic regret objective, we \ufb01rst solve simple\nsingle round quadratic games.\n\n3 Minimax Single-shot Squared Loss Games\n\nOne crucial tool in the minimax analysis of our tracking problem will be solving particular single-\nshot min-max games. In such games, the player and adversary play prediction a and data x resulting\nin payoff given by the following square loss plus a quadratic in x:\n\nV (a, x) := (cid:107)a \u2212 x(cid:107)2 + (\u03b1 \u2212 1)(cid:107)x(cid:107)2 + 2b\n\n(cid:124)\n\nx.\n\n(3)\n\n3\n\n\fThe quadratic and linear terms in x have coef\ufb01cients \u03b1 \u2208 R and b \u2208 Rd. Note\nthat V (a, x) is convex in a and either convex or concave in x as decided by\nthe sign of \u03b1. The following result, proved in Appendix B.1 and illustrated for\n(cid:107)b(cid:107) = 1 by the \ufb01gure to the right, gives the minimax analysis for both cases.\nTheorem 2. Let V (a, x) be as in (3). If (cid:107)b(cid:107) \u2264 1, then the minimax problem\n\n\uf8f1\uf8f2\uf8f3 (cid:107)b(cid:107)2\n\nV \u2217 := min\na\u2208Rd\nif \u03b1 \u2264 0,\n1 \u2212 \u03b1\n(cid:107)b(cid:107)2 + \u03b1 if \u03b1 \u2265 0,\n\nmax\n\nx\u2208Rd:(cid:107)x(cid:107)\u22641\n\nV (a, x)\n\n\uf8f1\uf8f2\uf8f3 b\n\n1 \u2212 \u03b1\nb\n\nhas value V \u2217 =\n\nand minimizer a =\n\nif \u03b1 \u2264 0,\nif \u03b1 \u2265 0.\n\n(4)\n\nWe also want to look at the performance of this strategy when we do not impose the norm bound\n(cid:107)x(cid:107) \u2264 1 nor make the assumption (cid:107)b(cid:107) \u2264 1. By evaluating (3) we obtain an adaptive expression\nthat scales with the actual norm (cid:107)x(cid:107)2 of the data.\nTheorem 3. Let a be the strategy from (4). Then, for any data x \u2208 Rd and any b \u2208 Rd,\nif \u03b1 \u2264 0, and\nif \u03b1 \u2265 0.\n\nV (a, x) =\nV (a, x) = (cid:107)b(cid:107)2 + \u03b1(cid:107)x(cid:107)2\n\n(cid:13)(cid:13)(cid:13)(cid:13)2 \u2264 (cid:107)b(cid:107)2\n\n(cid:13)(cid:13)(cid:13)(cid:13) b\n\n(cid:107)b(cid:107)2\n1 \u2212 \u03b1\n\n1 \u2212 \u03b1\n\n1 \u2212 \u03b1\n\n\u2212 x\n\n+ \u03b1\n\nThese two theorems point out that the strategy in (4) is amazingly versatile. The former theorem\nestablishes minimax optimality under data constraint (cid:107)x(cid:107) \u2264 1 assuming that (cid:107)b(cid:107) \u2264 1. Yet the latter\ntheorem tells us that, even without constraints and assumptions, this strategy is still an extremely\nuseful heuristic. For its actual regret is bounded by the minimax regret we would have incurred if\nwe would have known the scale of the data (cid:107)x(cid:107) (and (cid:107)b(cid:107)) in advance. The norm bound we imposed\nin the derivation induces the complexity measure for the data to which the strategy adapts. This\nrobustness property will extend to the minimax strategy for time series prediction.\nFinally, it remains to note that we present the theorems in the canonical case. Problems with a\nconstraint of the form (cid:107)x \u2212 c(cid:107) \u2264 \u03b2 may be canonized by re-parameterizing by x(cid:48) = x\u2212c\nand\na(cid:48) = a\u2212c\n\u03b2\nCorollary 4. Fix \u03b2 \u2265 0 and c \u2208 Rd. Let V \u2217(\u03b1, b) denote the minimax value from (4) with\nparameters \u03b1, b. If (cid:107)(\u03b1 \u2212 1)c + b(cid:107) \u2264 \u03b2 then\n\nand scaling the objective by \u03b2\u22122. We \ufb01nd\n\n\u03b2\n\n(cid:18)\n\n(cid:19)\n\n(\u03b1 \u2212 1)c + b\n\n\u03b2\n\n(cid:124)\n\nc + (\u03b1 \u2212 1)(cid:107)c(cid:107)2.\n\n+ 2b\n\nmin\n\na\n\nmax\n\nx:(cid:107)x\u2212c(cid:107)\u2264\u03b2\n\nV (a, x) = \u03b22V \u2217\n\n\u03b1,\n\nWith this machinery in place, we continue the minimax analysis of time series prediction problems.\n\n4 Minimax Time Series Prediction\n\nIn this section, we give the minimax solution to the online prediction problem. Recall that the\nevaluation criterion, the regret, is de\ufb01ned by\n\nR :=\n\n(cid:107)at \u2212 xt(cid:107)2 \u2212 min\n\n\u02c6a1,...,\u02c6aT\n\n(cid:107)\u02c6at \u2212 xt(cid:107)2 + \u03bbT tr\n\nK \u02c6A\n\n(cid:124) \u02c6A\n\n(5)\n\nwhere K (cid:23) 0 is a \ufb01xed T \u00d7 T matrix measuring the complexity of the comparator sequence. Since\nall the derivations ahead will be for a \ufb01xed T , we drop the T subscript on the \u03bb. We study the\nminimax problem\n\n(6)\nunder the constraint on the data that (cid:107)Xtvt(cid:107) \u2264 1 in each round t for some \ufb01xed sequence v1, . . . vT\nsuch that vt \u2208 Rt. This constraint generalizes the norm bound constraint from the motivating\nproblem (1), which is recovered by taking vt = et. This natural generalization allows us to also\nconsider bounded norms of increments, bounded higher order discrete derivative norms etc.\n\nR\u2217 := min\n\n\u00b7\u00b7\u00b7 min\n\nmax\nxT\n\nmax\nx1\n\nR\n\naT\n\na1\n\nT(cid:88)\n\nt=1\n\n(cid:16)\n\n(cid:17)\n\nT(cid:88)\n\nt=1\n\n4\n\n0123456-4-2024\u03b1V\u2217\fxt:(cid:107)Xtvt(cid:107)\u22641\n\nWe compute the minimax regret and get an expression for the minimax algorithm. We show that,\nat any point in the game, the value is a quadratic function of the past samples and the minimax\nalgorithm is linear: it always predicts with a weighted sum of all past samples.\nMost intriguingly, the value function can either be a convex or concave quadratic in the last data\npoint, depending on the regularization. We saw in the previous section that these two cases require\na different minimax solution. It is therefore an extremely fortunate fact that the particular case we\n\ufb01nd ourselves in at each round is not a function of the past data, but just a property of the problem\nparameters K and vt.\nWe are going to solve the sequential minimax problem (6) one round at a time. To do so, it is\nconvenient to de\ufb01ne the value-to-go of the game from any state Xt = [x1 \u00b7\u00b7\u00b7 xt] recursively by\n(cid:107)at \u2212 xt(cid:107)2 + V (Xt).\n\nV (XT ) := \u2212 L\u2217\n\nmax\n\nand\n\nV (Xt\u22121) := min\n\nat\n\nWe are interested in the minimax algorithm and minimax regret R\u2217 = V (X0). We will show that\nthe minimax value and strategy are a quadratic and linear function of the observations. To express the\nvalue and strategy and state the necessary condition on the problem, we will need a series of scalars\ndt and matrices Rt \u2208 Rt\u00d7t for t = 1, . . . , T , which, as we will explain below, arises naturally from\nthe minimax analysis. The matrices, which depend on the regularization parameter \u03bb, comparator\ncomplexity matrix K and data constraints vt, are de\ufb01ned recursively back-to-front. The base case\nis RT := (I + \u03bbT K)\u22121. Using the convenient abbreviations vt = wt\nwe then recursively de\ufb01ne Rt\u22121 and set dt by\nRt\u22121 := At + (bt \u2212 ctut) (bt \u2212 ctut)\nRt\u22121 := At +\n\n(cid:124) \u2212 ctutu\n\n(cid:18)At\n\n(cid:18)ut\n\nif ct \u2265 0,\n\nif ct \u2264 0.\n\nand Rt =\n\ndt := 0\n\nct\nw2\nt\n\n(cid:19)\n\n(cid:19)\n\ndt :=\n\n(7b)\n\n(cid:124)\nbtb\nt\n1 \u2212 ct\n\n,\n\n(cid:124)\nt\n\nb\n\n(7a)\n\nbt\nct\n\n(cid:124)\nt ,\n\n1\n\n(cid:107)Xtvt(cid:107) \u2264 1 for all rounds t \u2264 T also satis\ufb01es(cid:13)(cid:13)Xt\u22121\n\nUsing this recursion for dt and Rt, we can perform the exact minimax analysis under a certain\ncondition on the interplay between the data constraint and the regularization. We then show below\nthat the obtained algorithm has a condition-free data-dependent regret bound.\nTheorem 5. Assume that K and vt are such that any data sequence XT satisfying the constraint\nt \u2264 T . Then the minimax value of and strategy for problem (6) are given by\nV (Xt) = tr (Xt (Rt \u2212 I) X\n\n(cid:1)(cid:13)(cid:13) \u2264 1/wt for all rounds\n(cid:40) bt\n\n(cid:0)(ct \u2212 1)ut \u2212 bt\n\nT(cid:88)\n\nand\n\nds\n\n(cid:124)\nt ) +\n\nIn particular, this shows that the minimax regret (6) is given by R\u2217 =(cid:80)T\n\ns=t+1\n\nat = Xt\u22121\n\nif ct \u2264 0,\nif ct \u2265 0,\n\n1\u2212ct\nbt \u2212 ctut\n\nt=1 dt.\n\nProof. By induction. The base case V (XT ) is Theorem 1. For any t < T we apply the de\ufb01nition\nof V (Xt\u22121) and the induction hypothesis to get\n\nT(cid:88)\n\nV (Xt\u22121) = min\n\nat\n\nmax\n\nxt:(cid:107)Xtvt(cid:107)\u22641\n\n(cid:107)at \u2212 xt(cid:107)2 + tr (Xt(Rt \u2212 I)X\n\n(cid:124)\nt ) +\n\nds\n\nT(cid:88)\n\ns=t+1\n\ns=t+1\n\ndt + C\n\n= tr(Xt\u22121(At \u2212 I)X\n\n(cid:124)\nt\u22121) +\n\nwhere we abbreviated\n\nC := min\nat\n\nmax\n\nxt:(cid:107)Xtvt(cid:107)\u22641\n\n(cid:124)\n(cid:124)\n(cid:107)at \u2212 xt(cid:107)2 + (ct \u2212 1)x\nt xt + 2x\nt Xt\u22121bt.\n\nWithout loss of generality, assume wt > 0. Now, as (cid:107)Xtvt(cid:107) \u2264 1 iff (cid:107)Xt\u22121ut + xt(cid:107) \u2264 1/wt,\napplication of Corollary 4 with \u03b1 = ct, b = Xt\u22121bt, \u03b2 = 1/wt and c = \u2212Xt\u22121ut followed by\nTheorem 2 results in optimal strategy\n\n(cid:40) Xt\u22121bt\n1\u2212ct\n\u2212ctXt\u22121ut + Xt\u22121bt\n\nat =\n\nif ct \u2264 0,\nif ct \u2265 0.\n\n5\n\n\fand value\nC = (ct\u22121)(cid:107)Xt\u22121ut(cid:107)2\u22122b\nExpanding all squares and rearranging (cycling under the trace) completes the proof.\n\n(cid:124)\nt\u22121Xt\u22121ut+\n\n/(1 \u2212 ct)\n+ ct/w2\nt\n\n(cid:0)(ct \u2212 1)ut \u2212 bt\n(cid:0)(ct \u2212 1)ut \u2212 bt\n\n(cid:40)(cid:13)(cid:13)Xt\u22121\n(cid:13)(cid:13)Xt\u22121\n\n(cid:1)(cid:13)(cid:13)2\n(cid:1)(cid:13)(cid:13)2\n\n(cid:124)\nt X\n\nif ct \u2264 0,\nif ct \u2265 0,\n\nOn the one hand, from a technical perspective the condition of Theorem 5 is rather natural.\nIt\nguarantees that the prediction of the algorithm will fall within the constraint imposed on the data.\n(If it would not, we could bene\ufb01t by clipping the prediction. This would be guaranteed to reduce the\nloss, and it would wreck the backwards induction.) Similar clipping conditions arise in the minimax\nanalyses for linear regression [9] and square loss prediction with Mahalanobis losses [13].\nIn practice we typically do not have a hard bound on the data. Sill, by running the above minimax\nalgorithm obtained for data complexity bounds (cid:107)Xtvt(cid:107) \u2264 1, we get an adaptive regret bound that\nscales with the actual data complexity (cid:107)Xtvt(cid:107)2, as can be derived by replacing the application of\nTheorem 2 in the proof of Theorem 5 by an invocation of Theorem 3.\nTheorem 6. Let K (cid:23) 0 and vt be arbitrary. The minimax algorithm obtained in Theorem 5 keeps\n\nthe regret (5) bounded by R \u2264(cid:80)T\n\nt=1 dt(cid:107)Xtvt(cid:107)2 for any data sequence XT .\n\n4.1 Computation, sparsity\n\nIn the important special case (typical application) where the regularization K and data\nconstraint vt are encoding some order of smoothness, we \ufb01nd that K is banded diagonal\nand vt only has a few tail non-zero entries. It hence is the case that R\u22121\nT\u22121 = I + \u03bbK\nis sparse. We now argue that the recursive updates (7) preserve sparsity of the inverse R\u22121\nAppendix C we derive an update for R\u22121\nto tabulate R\u22121\nTheorem 7. Say the vt are V -sparse (all but their tail V entries are zero). And say that K is\nD-banded (all but the the main and D \u2212 1 adjacent diagonals to either side are zero). Then each\nis the sum of the D-banded matrix I + \u03bbK1:t,1:t and a (D + V \u2212 2)-blocked matrix (i.e. all\nR\u22121\nbut the lower-right block of size D + V \u2212 2 is zero).\n\n. In\n. For computation it hence makes sense\n\ndirectly. We now argue (proof in Appendix B.2) that all R\u22121\n\nt\u22121 in terms of R\u22121\n\nare sparse.\n\nt\n\nt\n\nt\n\nt\n\nt\n\nSo what does this sparsity argument buy us? We only need to maintain the original D-banded matrix\nK and the (D + V \u2212 2)2 entries of the block perturbation. These entries can be updated backwards\nfrom t = T, . . . , 1 in O((D + V \u2212 2)3) time per round using block matrix inverses. This means that\nthe run-time of the entire pre-processing step is linear in T . For updates and prediction we need ct\nand bt, which we can compute using Gaussian elimination from R\u22121\nin O(t(D + V )) time. In the\nnext section we will see a special case in which we can update and predict in constant time.\n\nt\n\n5 Norm-bounded Data with Increment Squared Regularization\n\nWe return to our motivating problem (1) with complexity matrix K = KT given by (2) and norm\nconstrained data, i.e. vt = et. We show that the Rt matrices are very simple:\ntheir inverse is\nI + \u03bbKt with its lower-right entry perturbed. Using this, we show that the prediction is a linear\ncombination of the past observations with weights decaying exponentially backward in time. We\nderive a constant-time update equation for the minimax prediction and tightly sandwich the regret.\nHere, we will calculate a few quantities that will be useful throughout this section. The inverse\n(I + \u03bbKT )\u22121 can be computed in closed form as a direct application of the results in [14]:\nLemma 8. Recall that sinh(x) = ex\u2212e\u2212x\n\n. For any \u03bb \u2265 0:\n\n2\n\nand cosh(x) = ex+e\u2212x\n\ncosh(cid:0)(T + 1 \u2212 |i \u2212 j|)\u03bd(cid:1) \u2212 cosh(cid:0)(T + 1 \u2212 i \u2212 j)\u03bd(cid:1)\n\n2\u03bb sinh(\u03bd) sinh(cid:0)(T + 1)\u03bd(cid:1)\n\n2\n\n,\n\n(I + \u03bbKT )\u22121\n\nwhere \u03bd = cosh\u22121(cid:0)1 + 1\n\n(cid:1).\n\ni,j =\n\n2\u03bb\n\n6\n\n\fWe need some control on this inverse. We will use the abbreviations\n\nzt := (I + \u03bbKt)\u22121et,\nht := e\n\n(cid:124)\nt (I + \u03bbKt)\u22121et = e\n\n(cid:124)\nt zt, and\n\nh :=\n\n\u221a\n2\n1 + 2\u03bb +\n\n.\n\n1 + 4\u03bb\n\n(8)\n(9)\n\n(10)\n\nWe now show that these quantities are easily computable (see Appendix B for proofs).\nLemma 9. Let \u03bd be as in Lemma 8. Then, we can write\n1 \u2212 (\u03bbh)2t\n1 \u2212 (\u03bbh)2t+2 h,\n\nht =\n\nand limt\u2192\u221e ht = h from below, exponentially fast.\nA direct application of block matrix inversion (Lemma 12) results in\nLemma 10. We have\n\nht =\n\n1\n\n1 + 2\u03bb \u2212 \u03bb2ht\u22121\n\nand\n\nzt = ht\n\n(cid:19)\n\n(cid:18)\u03bbzt\u22121\n\n1\n\n.\n\nIntriguingly, following the optimal algorithm for all T rounds can be done in O(T d) computation\nand O(d) memory. These resource requirements are surprising as playing weighted averages typi-\ncally requires O(T 2d). We found that the weighted averages are similar between rounds and can be\nupdated cheaply.\nWe are now ready to state the main result of this section, proved in Appendix B.3.\nTheorem 11. Let zt and ht be as in (8) and Kt as in (2). For the minimax problem (1) we have\n\n(cid:124)\nt = I + \u03bbKt + \u03b3tete\nt\n\nR\u22121\n\nand the minimax prediction in round t is given by\n\nwhere \u03b3t = 1\nct\n\n\u2212 1\n\nht\n\nat = \u03bbctXt\u22121zt\u22121\n\nand ct satisfy the recurrence cT = hT and ct\u22121 = ht\u22121 + \u03bb2h2\n\nt\u22121ct (1 + ct).\n\n5.1\n\nImplementation\n\nTheorem 11 states that the minimax prediction is at = \u03bbctXt\u22121zt\u22121. Using Lemma 10, we can\nderive an incremental update for at by de\ufb01ning a1 = 0 and\n\n(cid:18)\u03bbzt\u22121\n\n(cid:19)\n\n1\n\nat+1 = \u03bbct+1Xtzt = \u03bbct+1[Xt\u22121 xt]ht\n\n= \u03bbct+1ht (Xt\u22121\u03bbzt\u22121 + xt)\n\n(cid:19)\n\n(cid:18) at\n\nct\n\n= \u03bbct+1ht\n\n+ xt\n\n.\n\nThis means we can predict in constant time O(d) per round.\n\n5.2 Lower Bound\n\nBy Theorem 5, using that wt = 1 so that dt = ct, the minimax regret equals(cid:80)T\nt=1 ct. For conve-\nnience, we de\ufb01ne rt := 1 \u2212 (\u03bbT h)2t (and rT +1 = 1) so that ht = hrt/rt+1. We can obtain a lower\nbound on ct from the expression given in Theorem 11 by ignoring the (positive) c2\nt term to obtain:\nct\u22121 \u2265 ht\u22121 + \u03bb2\nt\u22121ct. By unpacking this lower bound recursively, we arrive at\n\nT h2\n\nT(cid:88)\n\nk=t\n\nct \u2265 h\n\n(\u03bbT h)2(k\u2212t)\n\nr2\nt\n\nrkrk+1\n\n.\n\n7\n\n\fT(cid:88)\n\nct \u2265 h\n\nT(cid:88)\n\nT(cid:88)\n\nSince r2\n\nt /(riri+1) is a decreasing function in i for every t, we have\n\nwhich leads to\n\n\u2265 h\n\nr2\nt\n\nrt+1\n\nriri+1\n\n\u2265 rt\n\n(cid:90) T\n\n(cid:90) T\u22121\n\u03bbT ) and h = \u2126(1/\u03bbT ), we have that(cid:80)T\n\n(\u03bbT h)2(k\u2212t) rt\nrt+1\n\ndkdt = \u2126\n\nt+1\n\n0\n\n(cid:18)\n\n\u2212\n\nt=1\n\nk=t\n\nt=1\nwhere we have exploited the fact that the integrand is monotonic and concave in k and monotonic\nand convex in t to lower bound the sums with an integral. See Claim 14 in the appendix for more\ndetails. Since \u2212 log(\u03bbT h) = O(1/\n),\nmatching the upper bound below.\n\nt=1 ct = \u2126( T\u221a\u03bbT\n\n\u221a\n\n(cid:19)\n\nhT\n\n2 log(\u03bbT h)\n\n(\u03bbT h)2(k\u2212t) rt\nrt+1\n\n5.3 Upper Bound\nAs h \u2265 ht, the alternative recursion c(cid:48)T +1 = 0 and c(cid:48)t\u22121 = h + \u03bb2h2c(cid:48)t(1 + c(cid:48)t) satis\ufb01es c(cid:48)t \u2265 ct.\nA simple induction 1 shows that c(cid:48)t is increasing with decreasing t, and it must hence have a limit.\nThis limit is a \ufb01xed-point of c (cid:55)\u2192 h + \u03bb2h2c(1 + c). This results in a quadratic equation, which has\ntwo solutions. Our starting point c(cid:48)T +1 = 0 lies below the half-way point 1\u2212\u03bb2h2\n2\u03bb2h2 > 0, so the sought\nlimit is the smaller solution:\n\nc =\n\n\u2212\u03bb2h2 + 1 \u2212(cid:112)(\u03bb2h2 \u2212 1)2 \u2212 4\u03bb2h3\n4\u03bb + 1 + 7(cid:1) + 3\n\n2\u03bb2h2\n\u221a\n\n(cid:113)\n\n\u221a\n\n2\n\nThis is monotonic in h. Plugging in the de\ufb01nition of h, we \ufb01nd\n\n\u221a\n\n4\u03bb + 1(2\u03bb + 1) + 4\u03bb + 1 \u2212 \u221a\n\n.\n\n4\u03bb + 1 + 4(cid:1) +\n\n\u221a\n\nc =\nSeries expansion around \u03bb \u2192 \u221e results in c \u2264 (1 + \u03bb)\u22121/2. So all in all, the bound is\n\n4\u03bb2\n\n4\u03bb + 1 + 1\n\n.\n\n2\u03bb(cid:0)\u03bb(cid:0)2\n(cid:18)\n\n(cid:19)\n\n,\n\nR\u2217 = O\n\nT\u221a\n1 + \u03bbT\n\nwhere we have written the explicit T dependence of \u03bb. As discussed in the introduction, allowing\n\u03bbT to grow with T is natural and necessary for sub-linear regret. If \u03bbT were constant, the regret term\nand complexity term would grow with T at the same rate, effectively forcing the learner to compete\nwith sequences that could track the xt sequence arbitrarily well.\n\n6 Discussion\n\nWe looked at obtaining the minimax solution to simple tracking/\ufb01ltering/time series prediction prob-\nlems with square loss, square norm regularization and square norm data constraints. We obtained a\ncomputational method to get the minimax result. Surprisingly, the problem turns out to be a mixture\nof per-step quadratic minimax problems that can be either concave or convex. These two problems\nhave different solutions. Since the type of problem that is faced in each round is not a function\nof the past data, but only of the regularization, the coef\ufb01cients of the value-to-go function can still\nbe computed recursively. However, extending the analysis beyond quadratic loss and constraints is\ndif\ufb01cult; the self-dual property of the 2-norm is central to the calculations.\nSeveral open problems arise. The stability of the coef\ufb01cient recursion is so far elusive. For the case\nof norm bounded data, we found that the ct are positive and essentially constant. However, for higher\norder smoothness constraints on the data (norm bounded increments, increments of increments,\n. . . ) the situation is more intricate. We \ufb01nd negative ct and oscillating ct, both diminishing and\nincreasing. Understanding the behavior of the minimax regret and algorithm as a function of the\nregularization K (so that we can tune \u03bb appropriately) is an intriguing and elusive open problem.\n\nAcknowledgments\n\nWe gratefully acknowledge the support of the NSF through grant CCF-1115788, and of the Aus-\ntralian Research Council through an Australian Laureate Fellowship (FL110100281) and through\nthe ARC Centre of Excellence for Mathematical and Statistical Frontiers. Thanks also to the Si-\nmons Institute for the Theory of Computing Spring 2015 Information Theory Program.\n\n1For the base case, cT +1 = 0 \u2264 cT = h. Then c(cid:48)\n\nt\u22121 = h+\u03bb2h2c(cid:48)\n\nt(1+c(cid:48)\n\nt) \u2265 h+\u03bb2h2c(cid:48)\n\nt+1(1+c(cid:48)\n\nt+1) = c(cid:48)\nt.\n\n8\n\n\fReferences\n[1] Mark Herbster and Manfred K Warmuth. Tracking the best linear predictor. The Journal of\n\nMachine Learning Research, 1:281\u2013309, 2001.\n\n[2] Mark Herbster and Manfred K. Warmuth. Tracking the best expert. Machine Learning,\n\n32:151\u2013178, 1998.\n\n[3] Claire Monteleoni. Online learning of non-stationary sequences. Master\u2019s thesis, MIT, May\n\n2003. Arti\ufb01cial Intelligence Report 2003-11.\n\n[4] Kamalika Chaudhuri, Yoav Freund, and Daniel Hsu. An online learning-based framework for\ntracking. In Proceedings of the Twenty-Sixth Conference on Uncertainty in Arti\ufb01cial Intelli-\ngence (UAI), pages 101\u2013108, 2010.\n\n[5] Olivier Bousquet and Manfred K Warmuth. Tracking a small set of experts by mixing past\n\nposteriors. The Journal of Machine Learning Research, 3:363\u2013396, 2003.\n\n[6] Nicol`o Cesa-bianchi, Pierre Gaillard, Gabor Lugosi, and Gilles Stoltz. Mirror Descent meets\nFixed Share (and feels no regret). In F. Pereira, C.J.C. Burges, L. Bottou, and K.Q. Weinberger,\neditors, Advances in Neural Information Processing Systems 25, pages 980\u2013988. Curran As-\nsociates, Inc., 2012.\n\n[7] Avrim Blum and Carl Burch. On-line learning and the metrical task system problem. Machine\n\nLearning, 39(1):35\u201358, 2000.\n\n[8] Eiji Takimoto and Manfred K. Warmuth. The minimax strategy for Gaussian density estima-\n\ntion. In 13th COLT, pages 100\u2013106, 2000.\n\n[9] Peter L. Bartlett, Wouter M. Koolen, Alan Malek, Manfred K. Warmuth, and Eiji Takimoto.\nMinimax \ufb01xed-design linear regression. In P. Gr\u00a8unwald, E. Hazan, and S. Kale, editors, Pro-\nceedings of The 28th Annual Conference on Learning Theory (COLT), pages 226\u2013239, 2015.\n[10] Jacob Abernethy, Peter L. Bartlett, Alexander Rakhlin, and Ambuj Tewari. Optimal strate-\ngies and minimax lower bounds for online convex games. In Proceedings of the 21st Annual\nConference on Learning Theory (COLT 2008), pages 415\u2013423, December 2008.\n\n[11] Edward Moroshko and Koby Crammer. Weighted last-step min-max algorithm with improved\nsub-logarithmic regret.\nIn N. H. Bshouty, G. Stoltz, N. Vayatis, and T. Zeugmann, editors,\nAlgorithmic Learning Theory - 23rd International Conference, ALT 2012, Lyon, France, Oc-\ntober 29-31, 2012. Proceedings, volume 7568 of Lecture Notes in Computer Science, pages\n245\u2013259. Springer, 2012.\n\n[12] Edward Moroshko and Koby Crammer. A last-step regression algorithm for non-stationary\nonline learning. In Proceedings of the Sixteenth International Conference on Arti\ufb01cial Intelli-\ngence and Statistics, AISTATS 2013, Scottsdale, AZ, USA, April 29 - May 1, 2013, volume 31\nof JMLR Proceedings, pages 451\u2013462. JMLR.org, 2013.\n\n[13] Wouter M. Koolen, Alan Malek, and Peter L. Bartlett. Ef\ufb01cient minimax strategies for square\nloss games. In Z. Ghahramani, M. Welling, C. Cortes, N.D. Lawrence, and K.Q. Weinberger,\neditors, Advances in Neural Information Processing Systems (NIPS) 27, pages 3230\u20133238,\nDecember 2014.\n\n[14] G. Y. Hu and Robert F. O\u2019Connell. Analytical inversion of symmetric tridiagonal matrices.\n\nJournal of Physics A: Mathematical and General, 29(7):1511, 1996.\n\n9\n\n\f", "award": [], "sourceid": 1507, "authors": [{"given_name": "Wouter", "family_name": "Koolen", "institution": "Queensland University of Technology"}, {"given_name": "Alan", "family_name": "Malek", "institution": "UC Berkeley"}, {"given_name": "Peter", "family_name": "Bartlett", "institution": "UC Berkeley"}, {"given_name": "Yasin", "family_name": "Abbasi Yadkori", "institution": "Queensland University of Technology"}]}