{"title": "Online Learning for Adversaries with Memory: Price of Past Mistakes", "book": "Advances in Neural Information Processing Systems", "page_first": 784, "page_last": 792, "abstract": "The framework of online learning with memory naturally captures learning problems with temporal effects, and was previously studied for the experts setting. In this work we extend the notion of learning with memory to the general Online Convex Optimization (OCO) framework, and present two algorithms that attain low regret. The first algorithm applies to Lipschitz continuous loss functions, obtaining optimal regret bounds for both convex and strongly convex losses. The second algorithm attains the optimal regret bounds and applies more broadly to convex losses without requiring Lipschitz continuity, yet is more complicated to implement. We complement the theoretic results with two applications: statistical arbitrage in finance, and multi-step ahead prediction in statistics.", "full_text": "Online Learning for Adversaries with Memory:\n\nPrice of Past Mistakes\n\nOren Anava\nTechnion\n\nHaifa, Israel\n\noanava@tx.technion.ac.il\n\nElad Hazan\n\nPrinceton University\n\nNew York, USA\n\nehazan@cs.princeton.edu\n\nShie Mannor\n\nTechnion\n\nHaifa, Israel\n\nshie@ee.technion.ac.il\n\nAbstract\n\nThe framework of online learning with memory naturally captures learning prob-\nlems with temporal effects, and was previously studied for the experts setting. In\nthis work we extend the notion of learning with memory to the general Online\nConvex Optimization (OCO) framework, and present two algorithms that attain\nlow regret. The \ufb01rst algorithm applies to Lipschitz continuous loss functions, ob-\ntaining optimal regret bounds for both convex and strongly convex losses. The\nsecond algorithm attains the optimal regret bounds and applies more broadly to\nconvex losses without requiring Lipschitz continuity, yet is more complicated to\nimplement. We complement the theoretical results with two applications: statisti-\ncal arbitrage in \ufb01nance, and multi-step ahead prediction in statistics.\n\n1\n\nIntroduction\n\nOnline learning is a well-established learning paradigm which has both theoretical and practical\nappeals. The goal in this paradigm is to make a sequential decision, where at each trial the cost\nassociated with previous prediction tasks is given. In recent years, online learning has been widely\napplied to several research \ufb01elds including game theory, information theory, and optimization. We\nrefer the reader to [1, 2, 3] for more comprehensive survey.\nOne of the most well-studied frameworks of online learning is Online Convex Optimization (OCO).\nIn this framework, an online player iteratively chooses a decision in a convex set, then a convex loss\nfunction is revealed, and the player suffers loss that is the convex function applied to the decision\nshe chose. It is usually assumed that the loss functions are chosen arbitrarily, possibly by an all-\npowerful adversary. The performance of the online player is measured using the regret criterion,\nwhich compares the accumulated loss of the player with the accumulated loss of the best \ufb01xed\ndecision in hindsight.\nThe above notion of regret captures only memoryless adversaries who determine the loss based on\nthe player\u2019s current decision, and fails to cope with bounded-memory adversaries who determine\nthe loss based on the player\u2019s current and previous decisions. However, in many scenarios such\nas coding, compression, portfolio selection and more, the adversary is not completely memoryless\nand the previous decisions of the player affect her current loss. We are particularly concerned with\nscenarios in which the memory is relatively short-term and simple, in contrast to state-action models\nfor which reinforcement learning models are more suitable [4].\n\n1\n\n\fAn important aspect of our work is that the memory is not used to relax the adaptiveness of the ad-\nversary (cf. [5, 6]), but rather to model the feedback received by the player. In particular, throughout\nthis work we assume that the adversary is oblivious, that is, must determine the whole set of loss\nfunctions in advance. In addition, we assume a counterfactual feedback model: the player is aware\nof the loss she would suffer had she played any sequence of m decisions in the previous m rounds.\nThis model is quite common in the online learning literature; see for instance [7, 8].\nOur goal in this work is to extend the notion of learning with memory to one of the most general\nonline learning frameworks - the OCO. To this end, we adapt the policy regret1 criterion of [5], and\npropose two different approaches for the extended framework, both attain the optimal bounds with\nrespect to this criterion.\n\n1.1 Summary of Results\n\nWe present and analyze two algorithms for the framework of OCO with memory, both attain policy\nregret bounds that are optimal in the number of rounds. Our \ufb01rst algorithm utilizes the Lipschitz\nproperty of the loss functions, and \u2014 to the best of our knowledge \u2014 is the \ufb01rst algorithm for this\nframework that is not based on any blocking technique (this technique is detailed in the related work\nsection below). This algorithm attains O(T 1/2)-policy regret for general convex loss functions and\nO(log T )-policy regret for strongly convex losses.\nFor the case of convex and non-Lipschitz loss functions, our second algorithm attains the nearly op-\ntimal \u02dcO(T 1/2)-policy regret2; its downside is that it is randomized and more dif\ufb01cult to implement.\nA novel result that follows immediately from our analysis is that our second algorithm attains an\nexpected \u02dcO(T 1/2)-regret, along with \u02dcO(T 1/2) decision switches in the standard OCO framework.\nSimilar result currently exists only for the special case of the experts problem [9]. We note that\nthe two algorithms we present are related in spirit (both designed to cope with bounded-memory\nadversaries), but differ in the techniques and analysis.\n\nFramework\n\nExperts\n\nwith Memory\n\nOCO with memory\n\n(convex losses)\n\nOCO with Memory\n\n(strongly convex losses)\n\nPrevious bound Our \ufb01rst approach Our second approach\n\nO(T 1/2)\nO(T 2/3)\n\u02dcO(T 1/3)\n\nNot applicable\n\nO(T 1/2)\nO(log T )\n\n\u02dcO(T 1/2)\n\u02dcO(T 1/2)\n\u02dcO(T 1/2)\n\nTable 1: State-of-the-art upper-bounds on the policy regret as a function of T (number of rounds)\nfor the framework of OCO with memory. The best known bounds are due to the works of [9], [8],\nand [5], which are detailed in the related work section below.\n\n1.2 Related Work\n\nThe framework of OCO with memory was initially considered in [7] as an extension to the experts\nframework of [10]. Merhav et al. offered a blocking technique that guarantees a policy regret\nbound of O(T 2/3) against bounded-memory adversaries. Roughly speaking, the proposed technique\ndivides the T rounds into T 2/3 equal-sized blocks, while employing a constant decision throughout\neach of these blocks. The small number of decision switches enables the learning in the extended\nframework, yet the constant block size results in a suboptimal policy regret bound.\nLater, [8] showed that a policy regret bound of O(T 1/2) can be achieved by simply adapting the\nShrinking Dartboard (SD) algorithm of [9] to the framework considered in [7]. In short, the SD\nalgorithm is aimed at ensuring an expected O(T 1/2) decision switches in addition to O(T 1/2)-\nregret. These two properties together enable the learning in the considered framework, and the\nrandomized block size yields an optimal policy regret bound. Note that in both [7] and [8], the\n\n1The policy regret compares the performance of the online player with the best \ufb01xed sequence of actions in\nhindsight, and thus captures the notion of adversaries with memory. A formal de\ufb01nition appears in Section 2.\n2The notation \u02dcO(\u00b7) is a variant of the O(\u00b7) notation that ignores logarithmic factors.\n\n2\n\n\fpresented techniques are applicable only to the variant of the experts framework to adversaries with\nmemory, and not to the general OCO framework.\nThe framework of online learning against adversaries with memory was studied also in the setting\nof the adversarial multi-armed bandit problem. In this context, [5] showed how to convert an online\nlearning algorithm with regret guarantee of O(T q) into an online learning algorithm that attains\nO(T 1/(2\u2212q))-policy regret, also using a blocking technique. This approach is in fact a generalization\nof [7] to the bandit setting, yet the ideas presented are somewhat simpler. Despite the original\npresentation of [5] in the bandit setting, their ideas can be easily generalized to the framework of\nOCO with memory, yielding a policy regret bound of O(T 2/3) for convex losses and \u02dcO(T 1/3)-policy\nregret for strongly convex losses.\nAn important concept that is captured by the framework of OCO with memory is switching costs,\nwhich can be seen as a special case where the memory is of length 1. This special case was studied in\nthe works of [11], who studied the relationship between second order regret bounds and switching\ncosts; and [12], who proved that the blocking algorithm of [5] is optimal for the setting of the\nadversarial multi-armed bandit with switching costs.\n\n2 Preliminaries and Model\nWe continue to formally de\ufb01ne the notation for both the standard OCO framework and the frame-\nwork of OCO with memory. For sake of readability, we shall use the notation gt for memoryless\nloss functions (that correspond to memoryless adversaries), and ft for loss functions with memory\n(that correspond to bounded-memory adversaries).\n\n2.1 The Standard OCO Framework\nIn the standard OCO framework, an online player iteratively chooses a decision xt \u2208 K, and suffers\nloss that is equal to gt(xt). The decision set K is assumed to be a bounded convex subset of Rn,\nand the loss functions {gt}T\nt=1 are assumed to be convex functions from K to [0, 1]. In addition,\nthe set {gt}T\nt=1 is assumed to be chosen in advance, possibly by an all-powerful adversary that has\nfull knowledge of our learning algorithm (see [1], for instance). The performance of the player is\nmeasured using the regret criterion, de\ufb01ned as follows:\ngt(xt) \u2212 min\nx\u2208K\n\nT(cid:88)\n\nT(cid:88)\n\ngt(x),\n\nRT =\n\nt=1\n\nt=1\n\nwhere T is a prede\ufb01ned integer denoting the total number of rounds played. The goal in this frame-\nwork is to design ef\ufb01cient algorithms, whose regret grows sublinearly in T , corresponding to an\naverage per-round regret going to zero as T increases.\n\n2.2 The Framework of OCO with Memory\n\nIn this work we consider the framework of OCO with memory, detailed as follows: at each round\nt, the online player chooses a decision xt \u2208 K \u2282 Rn. Then, a loss function ft : Km+1 \u2192 R is\nrevealed, and the player suffers loss of ft(xt\u2212m, . . . , xt). For simplicity, we assume that 0 \u2208 K,\nand that ft(x0, . . . , xm) \u2208 [0, 1] for any x0, . . . , xm \u2208 K. Notice that the loss at round t depends\non the previous m decisions of the player, as well as on his current one. We assume that after ft is\nrevealed, the player is aware of the loss she would suffer had she played any sequence of decisions\nxt\u2212m, . . . , xt (this corresponds to the counterfactual feedback model mentioned earlier).\nOur goal in this framework is to minimize the policy regret, as de\ufb01ned in [5]3:\n\nRT,m =\n\nft(xt\u2212m, . . . , xt) \u2212 min\nx\u2208K\nt=m\nWe de\ufb01ne the notion of convexity for the loss functions {ft}T\nt=1 as follows: we say that ft is a con-\nvex loss function with memory if \u02dcft(x) = ft(x, . . . , x) is convex in x. From now on, we assume that\n3The rounds in which t < m are ignored since we assume that the loss per round is bounded by a constant;\n\nft(x, . . . , x).\n\nt=m\n\nthis adds at most a constant to the \ufb01nal regret bound.\n\nT(cid:88)\n\nT(cid:88)\n\n3\n\n\fAlgorithm 1\n1: Input: learning rate \u03b7 > 0, \u03c3-strongly convex and smooth regularization function R(x).\n2: Choose x0, . . . , xm \u2208 K arbitrarily.\n3: for t = m to T do\n4:\n5:\n6: end for\n\nPlay xt and suffer loss ft(xt\u2212m, . . . , xt).\nSet xt+1 = arg minx\u2208K\n\n(cid:110)\n\u03b7 \u00b7(cid:80)t\n\n\u02dcf\u03c4 (x) + R(x)\n\n(cid:111)\n\n\u03c4 =m\n\n{ft}T\n\n\ufb01cient algorithms are considered; otherwise, the optimization problem minx\u2208K(cid:80)T\n\nt=1 are convex loss functions with memory. This assumption is necessary in some cases, if ef-\nt=m ft(x, . . . , x)\n\nmight not be solvable ef\ufb01ciently.\n\n3 Policy Regret for Lipschitz Continuous Loss Functions\nIn this section we assume that the loss functions {ft}T\nconstant L, that is\n\nt=1 are Lipschitz continuous for some Lipschitz\n\n|ft(x0, . . . , xm) \u2212 ft(y0, . . . , ym)| \u2264 L \u00b7 (cid:107)(x0, . . . , xm) \u2212 (y0, . . . , ym)(cid:107),\n\nand adapt the well-known Regularized Follow The Leader (RFTL) algorithm to cope with bounded-\nmemory adversaries. In the above and throughout the paper, we use (cid:107) \u00b7 (cid:107) to denote the (cid:96)2-norm.\nDue to space constraints we present here only the algorithm and the main theorem, and defer the\ncomplete analysis to the supplementary material.\nIntuitively, Algorithm 1 relies on the fact that the corresponding functions { \u02dcft}T\nt=1 are memoryless\nand convex. Thus, standard regret minimization techniques are applicable, yielding a regret bound\nof O(T 1/2) for { \u02dcft}T\nt=1. This however, is not the policy regret bound we are interested in, but is in\nfact quite close if we use the Lipschitz property of {ft}T\nt=1 and set the learning rate properly. The\nalgorithm requires the following standard de\ufb01nitions of R and \u03bb (see supplementary material for\nmore comprehensive background and exact norm de\ufb01nitions):\n\ny\n\nsup\n\n\u03bb =\n\nt\u2208{1,...,T},x,y\u2208K\n\n(1)\nAdditionally, we denote by \u03c3 the strong convexity4 parameter of the regularization function R(x).\nFor Algorithm 1 we can prove the following:\nt=1 be Lipschitz continuous loss functions with memory (from Km+1 to\nTheorem 3.1. Let {ft}T\n[0, 1]), and let R and \u03bb be as de\ufb01ned in Equation (1). Then, Algorithm 1 generates an online\nsequence {xt}T\n\nt=1, for which the following holds:\n\nand R = sup\nx,y\u2208K\n\n{R(x) \u2212 R(y)} .\n\n(cid:26)(cid:16)(cid:107)\u2207 \u02dcft(x)(cid:107)\u2217\n\n(cid:17)2(cid:27)\n\nRT,m =\n\nft(xt\u2212m, . . . , xt) \u2212 min\nx\u2208K\n\nft(x, . . . , x) \u2264 2T \u03bb\u03b7(m + 1)3/2 +\n\nR\n\u03b7\n\n.\n\nSetting \u03b7 = R1/2(T L)\u22121/2(m+1)\u22123/4(\u03bb/\u03c3)\u22121/4 yields RT,m \u2264 3(T RL)1/2(m+1)3/4(\u03bb/\u03c3)1/4.\nThe following is an immediate corollary of Theorem 3.1 to H-strongly convex losses:\nCorollary 3.2. Let {ft}T\nt=1 be Lipschitz continuous and H-strongly convex loss functions with mem-\nory (from Km+1 to [0, 1]), and denote G = supt,x\u2208K (cid:107)\u2207 \u02dcft(x)(cid:107). Then, Algorithm 1 generates an\nonline sequence {xt}T\n\nt=1, for which the following holds:\n\nT(cid:88)\n\nt=m\n\nT(cid:88)\n\nt=m\n\nT(cid:88)\n\nT(cid:88)\nRT,m \u2264 2(m + 1)3/2G2\nHt yields RT,m \u2264 2(m+1)3/2G2\n\nt=m\n\nH\n\n(cid:18) 1\n\n(cid:19)\n\n.\n\n\u03b7t +\n\n(cid:107)xt \u2212 x\u2217(cid:107)2\n\n\u2212 H\n\n\u2212 1\n\u03b7t\n\n\u03b7t+1\n\nt=m\n\n(1 + log(T )).\n\nSetting \u03b7t = 1\n\nThe proof simply requires plugging time-dependent learning rate in the proof of Theorem 3.1, and\nis thus omitted here.\n\n4f (x) is \u03c3-strongly convex if \u22072f (x) (cid:23) \u03c3In\u00d7n for all x \u2208 K. We say that ft : Km+1 \u2192 R is \u03c3-strongly\n\nconvex loss function with memory if \u02dcft(x) = ft(x, . . . , x) is \u03c3-strongly convex in x.\n\n4\n\n\fAlgorithm 2\n1: Input: learning parameter \u03b7 > 0.\n2: Initialize w1(x) = 1 for all x \u2208 K, and choose x1 \u2208 K arbitrarily.\n3: for t = 1 to T do\n4:\n5:\n6:\n\nDe\ufb01ne weights wt+1(x) = e\u2212\u03b1(cid:80)t\nOtherwise, sample xt+1 from the density function pt+1(x) = wt+1(x) \u00b7(cid:0)(cid:82)\n\n\u03c4 =1 \u02c6g\u03c4 (x), where \u03b1 = \u03b7\nSet xt+1 = xt with probability wt+1(xt)\nwt(xt) .\n\nPlay xt and suffer loss gt(xt).\n\n4G2 and \u02c6gt(x) = gt(x) + \u03b7\n\n2(cid:107)x(cid:107)2.\n\nK wt+1(x)dx(cid:1)\u22121.\n\n7:\n8: end for\n\n4 Policy Regret with Low Switches\nIn this section we present a different approach to the framework of OCO with memory \u2014 low\nswitches. This approach was considered before in [8], who adapted the Shrinking Dartboard (SD)\nalgorithm of [9] to cope with limited-delay coding. However, the authors in [9, 8] consider only the\nexperts setting, in which the decision set is the simplex and the loss functions are linear. Here we\nadapt this approach to general decision sets and generally convex loss functions, and obtain optimal\npolicy regret against bounded-memory adversaries.\nDue to space constraints, we present here only the algorithm and main theorem. The complete\nanalysis appears in the supplementary material.\nIntuitively, Algorithm 2 de\ufb01nes a probability distribution over K at each round t. By sampling from\nthis probability distribution one can generate an online sequence that has an expected low regret\nguarantee. This however is not suf\ufb01cient in order to cope with bounded-memory adversaries, and\nthus an additional element of choosing xt+1 = xt with high probability is necessary (line 6). Our\nanalysis shows that if this probability is equal to wt+1(xt)\nthe regret guarantee remains, and we get\nwt(xt)\nan additional low switches guarantee.\n\nFor Algorithm 2 we can prove the following:\nTheorem 4.1. Let {gt}T\nand G = supx,t (cid:107)\u2207gt(x)(cid:107), and de\ufb01ne \u02c6gt(x) = gt(x) + \u03b7\nAlgorithm 2 generates an online sequence {xt}T\n\nt=1 be convex functions from K to [0, 1], such that D = supx,y\u2208K (cid:107)x \u2212 y(cid:107)\n. Then,\n\n(cid:113) 1+log(T +1)\n\nD\n\nT\n\nE [RT ] = O(cid:0)(cid:112)T log(T )(cid:1)\n\n2(cid:107)x(cid:107)2 for \u03b7 = 2G\n\nt=1, for which it holds that\n\nand E [S] = O(cid:0)(cid:112)T log(T )(cid:1),\n\nwhere S is the number of decision switches in the sequence {xt}T\nThe exact bounds for E [RT ] and E [S] are given in the supplementary material. Notice that Algo-\nrithm 2 applies to memoryless loss functions, yet its low switches guarantee implies learning against\nbounded-memory adversaries as stated and proven in Lemma D.5 (see supplementary material).\n\nt=1.\n\n5 Application to Statistical Arbitrage\nOur \ufb01rst application is motivated by \ufb01nancial models that are aimed at creating statistical arbitrage\nopportunities. In the literature, \u201cstatistical arbitrage\u201d refers to statistical mispricing of one or more\nassets based on their expected value. One of the most common trading strategies, known as \u201cpairs\ntrading\u201d, seeks to create a mean reverting portfolio using two assets with same sectorial belonging\n(typically using both long and short sales). Then, by buying this portfolio below its mean and selling\nit above, one can have an expected positive pro\ufb01t with low risk.\nHere we extend the traditional pairs trading strategy, and present an approach that aims at construct-\ning a mean reverting portfolio from an arbitrary (yet known in advance) number of assets. Roughly\nspeaking, our goal is to synthetically create a mean reverting portfolio by maintaining weights upon\nn different assets. The main problem arises in this context is how do we quantify the amount of mean\nreversion of a given portfolio? Indeed, mean reversion is somewhat an ill-de\ufb01ned concept, and thus\n\n5\n\n\fm\u22121(cid:88)\n\nm\u22121(cid:88)\n\ni=0\n\nj=0\n\n(cid:32)m\u22121(cid:88)\n\ni=0\n\ndifferent proxies are usually de\ufb01ned to capture its notion. We refer the reader to [13, 14, 15], in\nwhich few of these proxies (such as predictability and zero-crossing) are presented.\nIn this work, we consider a proxy that is aimed at preserving the mean price of the constructed\nportfolio (over the last m trading periods) close to zero, while maximizing its variance. We note that\ndue to the very nature of the problem (weights of one trading period affect future performance), the\nmemory comes unavoidably into the picture.\nWe proceed to formally de\ufb01ne the new mean reversion proxy and the use of our new algorithm in\nthis model. Thus, denote by yt \u2208 Rn the prices of n assets at time t, and by xt \u2208 Rn a distribution\nof weights over these assets. Since short selling is allowed, the norm of xt can sum up to an arbitrary\nnumber, determined by the loan \ufb02exibility. Without loss of generality we assume that (cid:107)xt(cid:107)2 = 1,\nwhich is also assumed in the works of [14, 15]. Note that since xt determines the proportion of\nwealth to be invested in each asset and not the actual wealth it self, any other constant would work\nas well. Consequently, de\ufb01ne:\n\nx(cid:62)\nt\u2212iyt\u2212i\n\nft(xt\u2212m, . . . , xt) =\n\nt\u2212iyt\u2212i\nfor some \u03bb > 0. Notice that minimizing ft iteratively yields a process {x(cid:62)\nt=1 such that its\nmean is close to zero (due to the expression on the left), and its variance is maximized (due to the\nexpression on the right). We use the regret criterion to measure our performance against the best\ndistribution of weights in hindsight, and wish to generate a series of weights {xt}T\nt=1 such that the\nregret is sublinear. Thus, de\ufb01ne the memoryless loss function \u02dcft(x) = ft(x, . . . , x) and denote\n\ni=0\n\ni=0\n\n(2)\n\n(cid:32) m(cid:88)\n\n(cid:33)2\n\n\u2212 \u03bb \u00b7 m(cid:88)\n\n(cid:0)x(cid:62)\n\n(cid:1)2\nt yt}T\n\n,\n\n(cid:33)\n\nAt =\n\nyt\u2212iy(cid:62)\nt\u2212j\n\nand Bt = \u03bb \u00b7\n\nyt\u2212iy(cid:62)\nt\u2212i\n\n.\n\ni=1\n\nt=m\n\nht(X) = X \u25e6 At \u2212 X \u25e6 Bt,\n\nNotice we can write \u02dcft(x) = x(cid:62)Atx \u2212 x(cid:62)Btx. Since \u02dcft is not convex in general, our techniques\nare not straightforwardly applicable here. However, the hidden convexity of the problem allows us\nto bypass this issue by a simple and tight Positive Semi-De\ufb01nite (PSD) relaxation. De\ufb01ne\n\n(cid:80)n\n(3)\nj=1 X(i, j) \u00b7 A(i, j).\nt=m ht(X) is a PSD relaxation to the minimization\n\nwhere X is a PSD matrix with T r(X) = 1, and X \u25e6 A is de\ufb01ned as(cid:80)n\nNow, notice that the problem of minimizing(cid:80)T\nproblem(cid:80)T\nT(cid:88)\nht(X) \u2264 T(cid:88)\nT(cid:88)\nwhere x\u2217 = arg minx\u2208K(cid:80)T\nusing an eigenvector decomposition as follows: represent X = (cid:80)n\na unit vector and \u03bbi are non-negative coef\ufb01cients such that (cid:80)n\nthe eigenvector x = vi with probability \u03bbi, we get that E(cid:2) \u02dcft(x)(cid:3) = ht(X). Technically, this\n\n\u02dcft(x). Also, we can recover a vector x from the PSD matrix X\ni , where each vi is\ni=1 \u03bbi = 1. Then, by sampling\n\n\u02dcft(x), and for the optimal solution it holds that:\nht(x\u2217x\u2217(cid:62)) =\n\ni=1 \u03bbiviv(cid:62)\n\n\u02dcft(x\u2217).\n\ndecomposition is possible due to the fact that X is a PSD matrix with T r(X) = 1. Notice that ht\nis linear in X, and thus we can apply regret minimization techniques on the loss functions {ht}T\nt=1.\nThis procedure is formally given in Algorithm 3. For this algorithm we can prove the following:\nCorollary 5.1. Let {ft}T\nt=1 be the corresponding mem-\noryless functions, as de\ufb01ned in Equation (3). Then, applying Algorithm 2 to the loss functions\n{ht}T\n\nt=1 yields an online sequence {Xt}T\n\nmin\nX\n\nt=m\n\nt=m\n\nt=m\n\nt=m\n\nt=1, for which the following holds:\n\nt=1 be as de\ufb01ned in Equation (2), and {ht}T\nT(cid:88)\nT(cid:88)\n\nht(X) = O(cid:0)(cid:112)T log(T )(cid:1).\nT(cid:88)\n\nE [ht(Xt)] \u2212 min\nX(cid:23)0\nTr(X)=1\n\nT(cid:88)\n\nE [ft(xt\u2212m, . . . , xt)] \u2212 min(cid:107)x(cid:107)=1\n\nt=1\n\nt=1\n\nt=m\n\nt=m\n\nft(x, . . . , x) = O(cid:0)(cid:112)T log(T )(cid:1).\n\nSampling xt \u223c Xt using the eigenvector decomposition described above yields:\n\nE [RT,m] =\n\n6\n\n\fAlgorithm 3 Online Statistical Arbitrage (OSA)\n1: Input: Learning rate \u03b7, memory parameter m, regularizer \u03bb.\n2: Initialize X1 = 1\n3: for t = 1 to T do\n4:\n5:\n6:\n7: end for\n\nRandomize xt \u223c Xt using the eigenvector decomposition.\nObserve ft and de\ufb01ne ht as in equation (3).\nApply Algorithm 2 to ht(Xt) to get Xt+1.\n\nn In\u00d7n.\n\nRemark: We assume here that the prices of the n assets at round t are bounded for all t by a constant\nwhich is independent of T .\nThe main novelty of our approach to the task of constructing mean reverting portfolios is the ability\nto maintain the weight distributions online. This is in contrast to the traditional of\ufb02ine approaches\nthat require a training period (to learn a weight distribution), and a trading period (to apply a corre-\nsponding trading strategy).\n\n6 Application to Multi-Step Ahead Prediction\n\nOur second application is motivated by statistical models for time series prediction, and in particular\nby statistical models for multi-step ahead AR prediction. Thus, let {Xt}T\nt=1 be a time series (that is,\na series of signal observations). The traditional AR (short for autoregressive) model, parameterized\nby lag p and coef\ufb01cient vector \u03b1 \u2208 Rp, assumes that each observation complies with\n\n(cid:33)2\uf8fc\uf8fd\uf8fe .\n\n\u03b1kX\u03c4\u2212k\n\n\u03b1\n\n\uf8f1\uf8f2\uf8f3t\u22121(cid:88)\n(cid:32)\nX\u03c4 \u2212 p(cid:88)\nt (\u03b1LS) = (cid:80)p\np(cid:88)\n\ni=1 \u03b1LS\n\n\u03c4 =1\n\nk=1\n\n\u03b1LS\nk Xt\u2212k+1.\n\nThen, \u03b1LS is used to generate a prediction for Xt: \u02dcX AR\nused as a proxy for it in order to predict the value of Xt+1:\n\ni Xt\u2212i, which is in turn\n\n\u02dcX AR\n\nt+1(\u03b1LS) = \u03b1LS\n1\n\n\u02dcX AR\n\nt (\u03b1LS) +\n\n(4)\n\nk=2\n\nThe values of Xt+2, . . . , Xt+m are predicted in the same recursive manner. The most obvious\ndrawback of this approach is that not much can be said on the quality of this predictor even if the\nAR model is well-speci\ufb01ed, let alone if it is not (see [18] for further discussion on this issue).\nIn light of this, the motivation to formulate the problem of multi-step ahead prediction in the online\nsetting is quite clear: attaining regret in this setting would imply that our algorithm\u2019s performance\n\n7\n\nXt =\n\n\u03b1kXt\u2212k + \u0001t,\n\nwhere {\u0001t}t\u2208Z is white noise. In words, the model assumes that Xt is a noisy linear combination of\nthe previous p observations. Sometimes, an additional additive term \u03b10 is included to indicate drift,\nbut we ignore this for simplicity.\nThe online setting for time series prediction is well-established by now, and appears in the works\nof [16, 17]. Here, we adapt this setting to the task of multi-step ahead AR prediction as follows: at\nround t, the online player has to predict Xt+m, while at her disposal are all the previous observations\nX1, . . . , Xt\u22121 (the parameter m determines the number of steps ahead). Then, Xt is revealed and\nshe suffers loss of ft(Xt, \u02dcXt), where \u02dcXt denotes her prediction for Xt. For simplicity, we consider\nthe squared loss to be our error measure, that is, ft(Xt, \u02dcXt) = (Xt \u2212 \u02dcXt)2.\nIn the statistical literature, a common approach to the problem of multi-step ahead prediction is\nto consider 1-step ahead recursive AR predictors [18, 19]: essentially, this approach makes use of\nstandard methods (e.g., maximum likelihood or least squares estimation) to extract the 1-step ahead\nestimator. For instance, a least squares estimator for \u03b1 at round t would be:\n\n(cid:40)t\u22121(cid:88)\n\n(cid:16)\n\n\u03c4 =1\n\n\u03b1LS = arg min\n\u03b1\n\nX\u03c4 \u2212 \u02dcX AR\n\n\u03c4 (\u03b1)\n\n= arg min\n\np(cid:88)\n\nk=1\n\n(cid:17)2(cid:41)\n\n\fAlgorithm 4 Adaptation of Algorithm 1 to Multi-Step Ahead Prediction\n1: Input: learning rate \u03b7, regularization function R(x), signal {Xt}T\nt=1.\nk Xt\u2212m\u2212k and suffer loss(cid:0)Xt \u2212 \u02dcX IP\n2: Choose w0, . . . , wm \u2208 KIP arbitrarily.\n(cid:110)\n\u03b7(cid:80)t\n3: for t = m to T do\nk=1 wt\u2212m\n4:\n5:\n6: end for\n\nPredict \u02dcX IP\nSet wt+1 = arg minw\u2208KIP\n\nt (wt\u2212m) =(cid:80)p\n\n(cid:0)X\u03c4 \u2212 \u02dcX IP\n\n\u03c4 (w)(cid:1)2\n\n+ (cid:107)w(cid:107)2\n\n(cid:111)\n\n\u03c4 =m\n\n2\n\nt (wt\u2212m)(cid:1)2.\n\nis comparable with the best 1-step ahead recursive AR predictor in hindsight (even if the latter is\nmisspeci\ufb01ed). Thus, our goal is to minimize the following regret term:\n\nT(cid:88)\n\nt=1\n\n(cid:0)Xt \u2212 \u02dcXt\n\n(cid:1)2 \u2212 min\n\n\u03b1\u2208K\n\nT(cid:88)\n\nt=1\n\n(cid:0)Xt \u2212 \u02dcX AR\n\nt (\u03b1)(cid:1)2\n\n,\n\nRT =\n\nt+m(w) = (cid:80)p\n\nwhere K denotes the set of all 1-step ahead recursive AR predictors, against which we want to\ncompete. Note that since the feedback is delayed (the AR coef\ufb01cients chosen at round t\u2212m are used\nto generate the prediction at round t), the memory comes unavoidably into the picture. Nevertheless,\nhere also both of our techniques are not straightforwardly applicable due the non-convex structure\nof the problem: each prediction \u02dcX AR\nt (\u03b1) contains products of \u03b1 coef\ufb01cients that cause the losses to\nbe non-convex in \u03b1.\nTo circumvent this issue, we use non-proper learning techniques, and let our predictions to be of\nk=1 wkXt\u2212k for a properly chosen set KNP \u2282 Rp of the w coef\ufb01cients.\nthe form \u02dcX NP\nBasically, the idea is to show that (a) attaining regret bound with respect to the best predictor in the\nnew family can be done using the techniques we present in this work; and (b) the best predictor in\nthe new family is better than the best 1-step ahead recursive AR predictor. This would imply a regret\nbound with respect to best 1-step ahead recursive AR predictor in hindsight. Our formal result is\ngiven in the following corollary:\nCorollary 6.1. Let D = supw1,w2\u2208KIP (cid:107)w1 \u2212 w2(cid:107)2 and G = supw,t (cid:107)\u2207ft(Xt, \u02dcXt(w))(cid:107)2. Then,\nAlgorithm 4 generates an online sequence {wt}T\n\nt=1, for which it holds that\n\nT(cid:88)\n\nt=1\n\n(cid:0)Xt \u2212 \u02dcX IP\n\nt (wt\u2212m)(cid:1)2 \u2212 min\n\n\u03b1\u2208K\n\nT(cid:88)\n\nt=1\n\n(cid:0)Xt \u2212 \u02dcX AR\n\nt (\u03b1)(cid:1)2 \u2264 3GD\n\n\u221a\n\nT m.\n\nRemark: The tighter bound in m (m1/2 instead of m3/4) follows directly by modifying the proof\nof theorem 3.1 to this setting (ft is affected only by wt\u2212m and not by wt\u2212m, . . . , wt).\nIn the above, the values of D and G are determined by the choice of the set K. For instance, if we\nwant to compete against the best \u03b1 \u2208 K = [\u22121, 1]p we need to use the restriction wk \u2264 2m for\nall k. In this case, D \u2248 2m and G \u2248 1. If we consider K to be the set of all \u03b1 \u2208 Rp such that\n\u221a\n\u03b1k \u2264 (1/\nThe main novelty of our approach to the task of multi-step ahead prediction is the elimination of\ngenerative assumptions on the data, that is, we allow the time series to be arbitrarily generated. Such\nassumptions are common in the statistical literature, and needed in general to extract ML estimators.\n\n2)k, we get that D \u2248 \u221a\n\nm and G \u2248 1.\n\n7 Discussion and Conclusion\n\nIn this work we extended the notion of online learning with memory to capture the general OCO\nframework, and proposed two algorithms with tight regret guarantees. We then applied our algo-\nrithms to two extensively studied problems: construction of mean reverting portfolios, and multistep\nahead prediction. It remains for future work to further investigate the performance of our algorithms\nin these problems and other problems in which the memory naturally arises.\n\nAcknowledgments\n\nThis work has been supported by the European Community\u2019s Seventh Framework Programme\n(FP7/2007-2013) under grant agreement 306638 (SUPREL).\n\n8\n\n\fReferences\n[1] N. Cesa-Bianchi and G. Lugosi. Prediction, learning, and games. Cambridge University Press, 2006.\n[2] Elad Hazan. The convex optimization approach to regret minimization. Optimization for machine learn-\n\ning, page 287, 2011.\n\n[3] Shai Shalev-Shwartz. Online learning and online convex optimization. Foundations and Trends in Ma-\n\nchine Learning, 4(2):107\u2013194, 2012.\n\n[4] M.L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley Series\n\nin Probability and Statistics. Wiley, 1994.\n\n[5] Raman Arora, Ofer Dekel, and Ambuj Tewari. Online bandit learning against an adaptive adversary: from\n\nregret to policy regret. 2012.\n\n[6] Nicolo Cesa-Bianchi, Ofer Dekel, and Ohad Shamir. Online learning with switching costs and other\n\nadaptive adversaries. In Advances in Neural Information Processing Systems, pages 1160\u20131168, 2013.\n\n[7] Neri Merhav, Erik Ordentlich, Gadiel Seroussi, and Marcelo J. Weinberger. On sequential strategies for\n\nloss functions with memory. IEEE Transactions on Information Theory, 48(7):1947\u20131958, 2002.\n\n[8] Andr\u00b4as Gy\u00a8orgy and Gergely Neu. Near-optimal rates for limited-delay universal lossy source coding. In\n\nISIT, pages 2218\u20132222, 2011.\n\n[9] Sascha Geulen, Berthold V\u00a8ocking, and Melanie Winkler. Regret minimization for online buffering prob-\n\nlems using the weighted majority algorithm. In COLT, pages 132\u2013143, 2010.\n\n[10] Nick Littlestone and Manfred K. Warmuth. The weighted majority algorithm. In FOCS, pages 256\u2013261,\n\n1989.\n\n[11] Eyal Gofer. Higher-order regret bounds with switching costs. In Proceedings of The 27th Conference on\n\nLearning Theory, pages 210\u2013243, 2014.\n\n[12] Ofer Dekel, Jian Ding, Tomer Koren, and Yuval Peres. Bandits with switching costs: T 2/3 regret. In\nProceedings of the 46th Annual ACM Symposium on Theory of Computing, pages 459\u2013467. ACM, 2014.\n[13] Anatoly B. Schmidt. Financial Markets and Trading: An Introduction to Market Microstructure and\n\nTrading Strategies (Wiley Finance). Wiley, 1 edition, August 2011.\n\n[14] Alexandre D\u2019Aspremont. Identifying small mean-reverting portfolios. Quant. Finance, 11(3):351\u2013364,\n\n2011.\n\n[15] Marco Cuturi and Alexandre D\u2019aspremont. Mean reversion with a variance threshold. 28(3):271\u2013279,\n\nMay 2013.\n\n[16] Oren Anava, Elad Hazan, Shie Mannor, and Ohad Shamir. Online learning for time series prediction.\n\narXiv preprint arXiv:1302.6927, 2013.\n\n[17] Oren Anava, Elad Hazan, and Assaf Zeevi. Online time series prediction with missing data. In ICML,\n\n2015.\n\n[18] Michael P Clements and David F Hendry. Multi-step estimation for forecasting. Oxford Bulletin of\n\nEconomics and Statistics, 58(4):657\u2013684, 1996.\n\n[19] Massimiliano Marcellino, James H Stock, and Mark W Watson. A comparison of direct and iterated\nmultistep ar methods for forecasting macroeconomic time series. Journal of Econometrics, 135(1):499\u2013\n526, 2006.\n\n[20] G.S. Maddala and I.M. Kim. Unit Roots, Cointegration, and Structural Change. Themes in Modern\n\nEconometrics. Cambridge University Press, 1998.\n\n[21] Soren Johansen. Estimation and Hypothesis Testing of Cointegration Vectors in Gaussian Vector Autore-\n\ngressive Models. Econometrica, 59(6):1551\u201380, November 1991.\n\n[22] Jakub W Jurek and Halla Yang. Dynamic portfolio selection in arbitrage. In EFA 2006 Meetings Paper,\n\n2007.\n\n[23] Elad Hazan, Amit Agarwal, and Satyen Kale. Logarithmic regret algorithms for online convex optimiza-\n\ntion. Machine Learning, 69(2-3):169\u2013192, 2007.\n\n[24] L\u00b4aszl\u00b4o Lov\u00b4asz and Santosh Vempala. Logconcave functions: Geometry and ef\ufb01cient sampling algorithms.\n\nIn FOCS, pages 640\u2013649. IEEE Computer Society, 2003.\n\n[25] Hariharan Narayanan and Alexander Rakhlin. Random walk approach to regret minimization. In John D.\nLafferty, Christopher K. I. Williams, John Shawe-Taylor, Richard S. Zemel, and Aron Culotta, editors,\nNIPS, pages 1777\u20131785. Curran Associates, Inc., 2010.\n\n9\n\n\f", "award": [], "sourceid": 517, "authors": [{"given_name": "Oren", "family_name": "Anava", "institution": "Technion"}, {"given_name": "Elad", "family_name": "Hazan", "institution": "Princeton University"}, {"given_name": "Shie", "family_name": "Mannor", "institution": "Technion"}]}