{"title": "Prediction strategies without loss", "book": "Advances in Neural Information Processing Systems", "page_first": 828, "page_last": 836, "abstract": "Consider a sequence of bits where we are trying to predict the next bit from the previous bits. Assume we are allowed to say `predict 0' or `predict 1', and our payoff is $+1$ if the prediction is correct and $-1$ otherwise. We will say that at each point in time the loss of an algorithm is the number of wrong predictions minus the number of right predictions so far. In this paper we are interested in algorithms that have essentially zero (expected) loss over any string at any point in time and yet have small regret with respect to always predicting $0$ or always predicting $1$. For a sequence of length $T$ our algorithm has regret $14\\epsilon T $ and loss $2\\sqrt{T}e^{-\\epsilon^2 T} $ in expectation for all strings. We show that the tradeoff between loss and regret is optimal up to constant factors. Our techniques extend to the general setting of $N$ experts, where the related problem of trading off regret to the best expert for regret to the 'special' expert has been studied by Even-Dar et al. (COLT'07). We obtain essentially zero loss with respect to the special expert and optimal loss/regret tradeoff, improving upon the results of Even-Dar et al (COLT'07) and settling the main question left open in their paper. The strong loss bounds of the algorithm have some surprising consequences. First, we obtain a parameter free algorithm for the experts problem that has optimal regret bounds with respect to $k$-shifting optima, i.e. bounds with respect to the optimum that is allowed to change arms multiple times. Moreover, for {\\em any window of size $n$} the regret of our algorithm to any expert never exceeds $O(\\sqrt{n(\\log N+\\log T)})$, where $N$ is the number of experts and $T$ is the time horizon, while maintaining the essentially zero loss property.", "full_text": "Prediction strategies without loss\n\nMichael Kapralov\nStanford University\n\nStanford, CA\n\nkapralov@stanford.edu\n\nRina Panigrahy\n\nMicrosoft Research Silicon Valley\n\nMountain View, CA\n\nrina@microsoft.com\n\nAbstract\n\nConsider a sequence of bits where we are trying to predict the next bit from the\nprevious bits. Assume we are allowed to say \u2018predict 0\u2019 or \u2018predict 1\u2019, and our\npayoff is +1 if the prediction is correct and \u22121 otherwise. We will say that at\neach point in time the loss of an algorithm is the number of wrong predictions\nminus the number of right predictions so far. In this paper we are interested in\nalgorithms that have essentially zero (expected) loss over any string at any point\nin time and yet have small regret with respect to always predicting 0 or always\n\u221a\npredicting 1. For a sequence of length T our algorithm has regret 14\u0001T and loss\nT e\u2212\u00012T in expectation for all strings. We show that the tradeoff between loss\n2\nand regret is optimal up to constant factors.\nOur techniques extend to the general setting of N experts, where the related prob-\nlem of trading off regret to the best expert for regret to the \u2019special\u2019 expert has\nbeen studied by Even-Dar et al. (COLT\u201907). We obtain essentially zero loss with\nrespect to the special expert and optimal loss/regret tradeoff, improving upon the\nresults of Even-Dar et al and settling the main question left open in their paper.\nThe strong loss bounds of the algorithm have some surprising consequences.\nFirst, we obtain a parameter free algorithm for the experts problem that has op-\ntimal regret bounds with respect to k-shifting optima, i.e. bounds with respect\nto the optimum that is allowed to change arms multiple times. Moreover, for\nany window of size n the regret of our algorithm to any expert never exceeds\n\nO((cid:112)n(log N + log T )), where N is the number of experts and T is the time\n\nhorizon, while maintaining the essentially zero loss property.\n\n1\n\nIntroduction\n\nConsider a gambler who is trying to predict the next bit in a sequence of bits. One could think of\nthe bits as indications of whether a stock price goes up or down on a given day, where we assume\nthat the stock always goes up or down by 1 (this is, of course, a very simpli\ufb01ed model of the stock\nmarket). If the gambler predicts 1 (i.e. that the stock will go up), she buys one stock to sell it the next\nday, and short sells one stock if her prediction is 0. We will also allow the gambler to bet fractionally\nby letting him specify a con\ufb01dence c where 0 \u2264 c \u2264 1 in his prediction. If the prediction is right\nthe gambler gets a payoff of c otherwise \u2212c. While the gambler is tempted to make predictions with\nthe prospect of making money, there is also the risk of ending up with a loss. Is there a way to never\nend up with a loss? Clearly there is the strategy of never predicting (by setting con\ufb01dence 0) all\nthe time that never has a loss but also never has a positive payoff. However, if the sequence is very\nimbalanced and has many more 0\u2019s than 1\u2019s then this never predict strategy has a high regret with\nrespect to the strategy that predicts the majority bit. Thus, one is interested in a strategy that has a\nsmall regret with respect to predicting the majority bit and incurs no loss at the same time.\n\u221a\nOur main result is that while one cannot always avoid a loss and still have a small regret, this is\npossible if we allow for an exponentially small loss. More precisely, we show that for any \u0001 > 1/\nT\n\n1\n\n\fT , where T is\n\n\u221a\nthere exists an algorithm that achieves regret at most 14\u0001T and loss at most 2e\u2212\u00012T\nthe time horizon. Thus, the loss is exponentially small in the length of the sequence.\nThe bit prediction problem can be cast as the experts problem with two experts: S+, that always\npredicts 1 and S\u2212 that always predicts 0. This problem has been studied extensively, and very\n\u221a\nef\ufb01cient algorithms are known. The weighted majority algorithm of [12] is known to give optimal\nregret guarantees. However, it can be seen that weighted majority may result in a loss of \u2126(\nT ).\nThe best known result on bounding loss is the work of Even-Dar et al. [7] on the problem of trading\n\u221a\noff regret to the best expert for regret to the average expert, which is equivalent to our problem.\n\u221a\nStated as a result on bounding loss, they were able to obtain a constant loss and regret O(\nT log T ).\nTheir work left the question open as to whether it is possible to even get a regret of O(\nT log T )\nand constant loss. In this paper we give an optimal regret/loss tradeoff, in particular showing that\nthis regret can be achieved even with subconstant loss.\nOur results extend to the general setting of prediction with expert advice when there are multiple\nexperts.\nIn this problem the decision maker iteratively chooses among N available alternatives\nwithout knowledge of their payoffs, and gets payoff based on the chosen alternative. The payoffs\nof all alternatives are revealed after the decision is made. This process is repeated over T rounds,\nand the goal of the decision maker is to maximize her cumulative payoff over all time steps t =\n1, . . . , T . This problem and its variations has been studied extensively, and ef\ufb01cient algorithms\nhave been obtained (e.g. [5, 12, 6, 2, 1]). The most widely used measure of performance of an\nonline decision making algorithm is regret, which is de\ufb01ned as the difference between the payoff\nof the best \ufb01xed alternative and the payoff of the algorithm. The well-known weighted majority\nalgorithm of [12] obtains regret O(\nT log N ) even when no assumptions are made on the process\ngenerating the payoff. Regret to the best \ufb01xed alternative in hindsight is a very natural notion when\n\u221a\nthe payoffs are sampled from an unknown distribution, and in fact such scenarios show that the\nbound of O(\n\u221a\nEven-Dar et al. [7] gave an algorithm that has constant regret to any \ufb01xed distribution on the experts\nT log N (log T + log log N )) with respect to all other experts1. We ob-\nat the expense of regret O(\n\ntain an optimal tradeoff between the two, getting an algorithm with regret O((cid:112)T (log N + log T ))\n\nT log N ) on regret achieved by the weighted majority algorithm is optimal.\n\n\u221a\n\n(cid:17)\n\n(cid:16) x\n\nto the best and O((N T )\u2212\u2126(1)) to the average as a special case. We also note, similarly to [7] that\nour regret/loss tradeoff cannot be obtained by using standard regret minimization algorithms with\na prior that is concentrated on the \u2019special\u2019 expert, since the prior would have to put a signi\ufb01cant\nweight on the \u2019special\u2019 expert, resulting in \u2126(T ) regret to the best expert.\nThe extension to the case of N experts uses the idea of improving one expert\u2019s predictions by that of\nanother. The strong loss bounds of our algorithm allow us to achieve lossless boosting, i.e. we use\navailable expert to continuously improve upon the performance of the base expert whenever possible\nwhile essentially never hurting its performance. When comparing two experts, we track the differ-\nence in the payoffs discounted geometrically over time and apply a transform g(x) on this difference\nto obtain a weighting that is applied to give a linear combination of the two experts with a higher\nweight being applied on the expert with a higher discounted payoff. The shape of g(x) is given\nex2/(16T ), capped at \u00b11. The weighted majority algorithm on the other hand uses\nby erf\na transform with the shape of the tanh( x\u221a\n) function and ignores geometric discounting (see Fig-\nure 1).\nAn important property of our algorithm is that it does not need a high imbalance between the number\nof ones and the number of zeros in the whole sequence to have a gain: it is suf\ufb01cient for the imbal-\nance to be large enough in at least one contiguous time window2, the size of which is a parameter\nof the algorithm. This property allows us to easily obtain optimal adaptive regret bounds, i.e. we\n\nshow that the payoff of our algorithm in any geometric window of size n is at most O((cid:112)n log(N T ))\nT log N (log T +log log N )) regret to the best and D-Prod, yielding O((cid:112)T / log N log T ) regret\n\n1In fact, [7] provide several algorithms, of which the most relevant for comparison are Phased Agression,\n\nyielding O(\nto the best. For the bit prediction problem one would set N = 2 and use the uniform distribution over the\n\u2018predict 0\u2019 and \u2018predict 1\u2019 strategy as the special distribution. Our algorithm improves on both of them, yielding\nan optimal tradeoff.\n\n\u221a\nT\n\n4\n\n2More precisely, we use an in\ufb01nite window with geometrically decreasing weighting, so that most of the\n\nweight is contained in the window of size O(n), where n is a parameter of the algorithm.\n\n\u221a\n\nT\n\n2\n\n\fworse than the payoff of the strategy that is best in that window (see Theorem 11). In the full version\nof the paper ([11]) we also obtain bounds against the class of strategies that are allowed to change\nexperts multiple times while maintaining the essentially zero loss property. We note that even though\nsimilar bounds (without the essentially zero loss property) have been obtained before ([3, 9, 14] and,\nmore recently, [10]), our approach is very different and arguably simpler.\nIn the full version of the paper, we also show how our algorithm yields regret bounds that depend on\nthe lp norm of the costs, regret bounds dependent on Kolmogorov complexity as well as applications\nof our framework to multi-armed bandits with partial information and online convex optimization.\n\n1.1 Related work\n\nThe question of what can be achieved if one would like to have a signi\ufb01cantly better guarantee\nwith respect to a \ufb01xed arm or a distribution on arms was asked before in [7] as we discussed in\nthe introduction. Tradeoffs between regret and loss were also examined in [13], where the author\nstudied the set of values of a, b for which an algorithm can have payoff aOP T + b log N, where\nOP T is the payoff of the best arm and a, b are constants. The problem of bit prediction was also\nconsidered in [8], where several loss functions are considered. None of them, however, corresponds\nto our setting, making the results incomparable.\nIn recent work on the NormalHedge algorithm[4] the authors use a potential function which is very\nsimilar to our function g(x) (see (2) below), getting strong regret guarantees to the \u0001-quantile of best\nexperts. However, the use of the function g(x) seems to be quite different from ours, as is the focus\nof the paper [4].\n\n1.2 Preliminaries\n\nWe start by de\ufb01ning the bit prediction problem formally. Let bt, t = 1, . . . , T be an adversarial\nIt will be convenient to adopt the convention that bt \u2208 {\u22121, +1} instead of\nsequence of bits.\nbt \u2208 {0, 1} since it simpli\ufb01es the formula for the payoff. In fact, in what follows we will only\nassume that \u22121 \u2264 bt \u2264 1, allowing bt to be real numbers. At each time step t = 1, . . . , T the\nalgorithm is required to output a con\ufb01dence level ft \u2208 [\u22121, 1], and then the value of bt is revealed\nto it. The payoff of the algorithm by time t(cid:48) is\n\nt(cid:48)(cid:88)\n\nt=1\n\n(1)\nFor example, if bt \u2208 {\u22121, +1}, then this setup is analogous to a prediction process in which a player\nobserves a sequence of bits and at each point in time predicts that the value of the next bit will be\nsign(ft) with con\ufb01dence |ft|. Predicting ft \u2261 0 amounts to not playing the game, and incurs no\nloss, while not bringing any pro\ufb01t. We de\ufb01ne the loss of the algorithm on a string b as\n\nAt(cid:48) =\n\nftbt.\n\nloss = min{\u2212At, 0},\n\ni.e. the absolute value of the smallest negative payoff over all time steps.\nIt is easy to see that any algorithm that has a positive expected payoff on some sequence necessar-\nily loses on another sequence. Thus, we are concerned with \ufb01nding a prediction strategy that has\nexponentially small loss bounds but also has low regret against a number of given prediction strate-\ngies. In the simplest setting we would like to design an algorithm that has low regret against two\nbasic strategies: S+, which always predicts +1 and S\u2212, which always predicts \u22121. Note that the\nmaximum of the payoffs of S+ and S\u2212 is always equal to\nstrategy, which predicts with con\ufb01dence 0, by S0. In what follows we will use the notation AT for\nthe cumulative payoff of the algorithm by time T as de\ufb01ned above.\nAs we will show in section 3, our techniques extend easily to give an algorithm that has low regret\nwith respect to the best of any N bit prediction strategies and exponentially small loss. Our tech-\nniques work for the general experts problem, where loss corresponds to regret with respect to the\n\u2019special\u2019 expert S0, and hence we give the proof in this setting. This provides the connection to the\nwork of [7].\nIn section 2 we give an algorithm for the case of two prediction strategies S+ and S\u2212, and in section 3\nwe extend it to the general experts problem, additionally giving the claimed adaptive regret bounds.\n\n(cid:12)(cid:12)(cid:12). We denote the base random\n\n(cid:12)(cid:12)(cid:12)(cid:80)T\n\nt=1 bt\n\n3\n\n\f2 Main algorithm\n\nThe main result of this section is\n\nTheorem 1 For any \u0001 \u2265(cid:113) 1\n\nAT \u2265 max\n\n\uf8f1\uf8f2\uf8f3\n\nT there exists an algorithm A for which\n\u221a\n\n\uf8fc\uf8fd\uf8fe \u2212 2\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) \u2212 14\u0001T, 0\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T(cid:88)\nT , we get a regret bound of(cid:112)T log(1/Z).\n\nT e\u2212\u00012T ,\n\nbj\n\nj=1\n\n\u221a\n\ni.e. the algorithm has at most 14\u0001T regret against S+ and S\u2212 as well as a exponentially small loss.\nBy setting \u0001 so that the loss bound is 2Z\n\nthe algorithm maintains a discounted deviation xt = (cid:80)t\u22121\n\nWe note that the algorithm is a strict generalization of weighted majority, which can be seen by\nletting Z = \u0398(1) (this property will also hold for the generalization to N experts in section 3).\nOur algorithm will have the following form. For a chosen discount factor \u03c1 = 1 \u2212 1/n, 0 \u2264 \u03c1 \u2264 1\nj=1 \u03c1t\u22121\u2212jbj at each time t = 1, . . . , T .\nThe value of the prediction at time t is then given by g(xt) for a function g(\u00b7) to be de\ufb01ned (note\nthat xt depends only on bt(cid:48) for t(cid:48) < t, so this is an online algorithm). The function g as well as the\ndiscount factor \u03c1 depend on the desired bound on expected loss and regret against S+ and S\u2212. In\nparticular, we will set \u03c1 = 1\u2212 1\nT for our main result on regret/loss tradeoff, and will use the freedom\nto choose different values of \u03c1 to obtain adaptive regret guarantees in section 3. The algorithm is\ngiven by\n\nAlgorithm 1: Bounded loss prediction\n1: x1 \u2190 0\n2: for t = 1 to T do\n3:\n4:\n5: end for\n\nPredict sign(g(xt)) with con\ufb01dence |g(xt)|.\nSet xt+1 \u2190 \u03c1xt + bt.\n\nWe start with an informal sketch of the proof, which will be made precise in Lemma 2 and Lemma 3\nbelow. The proof is based on a potential function argument.\nIn particular, we will choose the\ncon\ufb01dence function g(x) so that\n\n\u03a6t =\n\ng(s)ds.\n\n(cid:90) xt\n\n0\n\n(cid:90) x\n\nis a potential function, which serves as a repository for guarding our loss (we will chose g(x) to be\nan odd function, and hence will always have \u03a6t \u2265 0). In particular, we will choose g(x) so that the\nchange of \u03a6t lower bounds the payoff of the algorithm. If we let \u03a6t = G(xt) (assuming for sake of\nclarity that xt > 0), where\n\nG(x) =\n\ng(s)ds,\n\nwe have\n\u03a6t+1 \u2212 \u03a6t = G(xt+1) \u2212 G(xt) \u2248 G(cid:48)(x)\u2206x + G(cid:48)(cid:48)(x)\u2206x2/2 \u2248 g(x) [(\u03c1 \u2212 1)x + bt] + g(cid:48)(x)/2.\n\n0\n\nSince the payoff of the algorithm at time step t is g(xt)bt, we have\n\n\u2206\u03a6t \u2212 g(xt)bt = \u2212g(xt)(1 \u2212 \u03c1)xt + g(cid:48)(xt)/2,\n\nso the condition becomes\n\n\u2212g(xt)(1 \u2212 \u03c1)xt + g(cid:48)(xt)/2 \u2264 Z,\n\nwhere Z is the desired bound on per step loss of the algorithm. Solving this equation yields a\nfunction of the form\n\ng(x) = (2Z\n\nex2/(16T ),\n\n(cid:19)\n\n(cid:18) x\n\n\u221a\n4\n\nT\n\n\u221a\n\nT ) \u00b7 erf\n\n4\n\n\f(cid:82) x\n0 e\u2212s2\n\n\u03c0\n\nwhere erf(x) = 2\u221a\n\nds is the error function (see Figure 1 for the shape of g(x)).\n\nWe now make this proof sketch precise. For t = 1, . . . , T de\ufb01ne \u03a6t =(cid:82) xt\n\n0 g(x)dx. The function\ng(x) will be chosen to be a continuous odd function that is equal to 1 for x > U and to \u22121 when\nx < \u2212U, for some 0 < U < T . Thus, we will have that |xt| \u2212 U \u2264 \u03a6t \u2264 |xt|. Intuitively, \u03a6t\ncaptures the imbalance between the number of \u22121\u2019s and +1\u2019s in the sequence up to time t.\nWe will use the following parameters. We always have \u03c1 = 1 \u2212 1/n for some n > 1 and use the\nnotation \u00af\u03c1 = 1 \u2212 \u03c1. We will later choose n = T to prove Theorem 1, but we will use different value\nof n for the adaptive regret guarantees in section 3.\nWe now prove that if the function g(x) approximately satis\ufb01es a certain differential equation, then\n\u03a6t de\ufb01ned as above is a potential function. The statement of Lemma 2 involves a function h(x) that\nwill be chosen as a step function that is 1 when x \u2208 [\u2212U, U ] and 0 otherwise.\nLemma 2 Suppose that the function g(x) used in Algorithm 1 satis\ufb01es\n\n(\u00af\u03c1x + 1)2 \u00b7\n\n1\n2\n\nmax\n\ns\u2208[\u03c1x\u22121,\u03c1x+1]\n\n|g(cid:48)(s)| \u2264 \u00af\u03c1xg(x)h(x) + Z(cid:48)\n\nfor a function h(x), 0 \u2264 h(x) \u2264 1,\u2200x, for some Z(cid:48) > 0. Then the payoff of the algorithm is at least\n\nT(cid:88)\n\nt=1\n\n\u00af\u03c1xtg(xt)(1 \u2212 h(x)) + \u03a6T +1 \u2212 Z(cid:48)T\n\nas long as |bt| \u2264 1 for all t.\nProof: We will show that at each t\n\ni.e.\n\nT(cid:88)\n\nt=1\n\nT(cid:88)\n\nt=1\n\n\u03a6t+1 \u2212 \u03a6t \u2264 btg(xt) + Z(cid:48) \u2212 \u00af\u03c1xtg(xt)(1 \u2212 h(xt)),\n\nbtg(xt) \u2265 \u2212Z(cid:48)T +\n\n\u00af\u03c1xtg(xt)(1 \u2212 h(xt)) + \u03a6T +1 \u2212 \u03a61,\n\nthus implying the claim of the lemma since \u03a61 = 0.\nWe consider the case xt > 0. The case xt < 0 is analogous. In the following derivation we will\nwrite [A, B] to denote [min{A, B}, max{A, B}].\n0 \u2264 bt \u2264 1: We have xt+1 = \u03c1xt + bt = xt \u2212 \u00af\u03c1xt + bt, and the expected payoff of the algorithm\n\nis g(xt)bt. Then\n\n\u03a6t+1 \u2212 \u03a6t =\n\n(cid:90) xt\u2212 \u00af\u03c1xt+bt\n(cid:20)\n\nxt\n\ng(s)ds\n\n\u2264 g(xt)(bt \u2212 \u00af\u03c1xt) +\n\n(\u00af\u03c1xt + bt)2 \u00b7\n\n1\n2\n\nmax\n\ns\u2208[xt,xt\u2212 \u00af\u03c1xt+bt]\n\n|g(cid:48)(s)|\n\n\u2212g(xt)\u00af\u03c1xt +\n\n\u2264 g(xt)bt +\n(\u00af\u03c1xt + 1)2 \u00b7\n\u2264 g(xt)bt + (\u22121 + h(xt))\u00af\u03c1xtg(xt) + Z(cid:48).\n\n1\n2\n\nmax\n\ns\u2208[xt,xt\u2212 \u00af\u03c1xt+bt]\n\n|g(cid:48)(s)|\n\n(cid:21)\n\n\u22121 \u2264 bt \u2264 0: This case is analogous.\nWe now de\ufb01ne g(x) to satisfy the requirement of Lemma 2. For any Z, L > 0 and let\n\n(cid:18)|x|\n\n(cid:19)\n\n(cid:27)\n\nOne can show that one has g(x) = 1 for |x| \u2265 U for some U \u2264 7L(cid:112)log(1/Z). A plot of the\n\ng(x) = sign(xt) \u00b7 min\n\nZ \u00b7 erf\n\nx2\n16L2 , 1\n\n(2)\n\n4L\n\ne\n\n.\n\nfunction g(x) is given in Figure 1.\nWe choose\n\nh(x) =\n\n|x| < U\no.w.\n\n.\n\n(3)\n\n(cid:26)\n\n(cid:26) 1,\n\n0\n\nThe following lemma shows that the function g(x) satis\ufb01es the properties stated in Lemma 2:\n\n5\n\n\ftanh(cid:0) x\n\nU\n\n(cid:1)\n\n+1\n\n\u2212U\n\ng(x)\n\n\u22121\n\n+U\n\nx\n\nFigure 1: The shape of the con\ufb01dence function g(x) (solid line) and the tanh(x) function used by\nweighted majority (dotted line).\n\nLemma 3 Let L > 0 be such that \u00af\u03c1 = 1/n \u2265 1/L2. Then for n \u2265 80 log(1/Z) the function g(x)\nde\ufb01ned by (2) satis\ufb01es\n\n(\u00af\u03c1x + 1)2 \u00b7\n\n1\n2\n\nmax\n\n|g(cid:48)(s)| \u2264 \u00af\u03c1xg(x)h(x)/2 + 2\u00af\u03c1LZ,\n\ns\u2208[\u03c1x\u22121,\u03c1x+1]\nwhere h(x) is the step function de\ufb01ned above.\nProof: The intuition behind the Lemma is very simple. Note that s \u2208 [\u03c1x \u2212 1, \u03c1x + 1] is not much\n\u03c0L Z. Since \u00af\u03c1 \u2265 1/L2,\nfurther than 1 away from x, so g(cid:48)(s) is very close to g(cid:48)(x) = ( x\nwe have g(cid:48)(x) \u2264 \u00af\u03c1xg(x)/2 + 1\u221a\nWe can now lower bound the payoff of Algorithm 1. We will use the notation\n\n\u03c0 \u00af\u03c1LZ. We defer the details of the proof to the full version.\n\n2L2 )g(x) + 1\u221a\n\n(cid:26)\n\n|x|+\n\n\u0001 =\n\n0,\n\n|x| \u2212 \u0001,\n\n|x| < \u0001\n|x| > \u0001\n\nTheorem 4 Let n be the window size parameter of Algorithm 1. Then one has\n\n\u00af\u03c1|xt|+\n\nU + |xT +1|+\n\n\u221a\nU \u2212 2ZT /\n\nn.\n\nAT \u2265 T(cid:88)\n\nt=1\n\nT(cid:88)\n\nt=1\n\nProof: By Lemma 3 we have that the function g(x) satis\ufb01es the conditions of Lemma 2, and so\nfrom the bounds stated in Lemma 2 the payoff of the algorithm is at least\n\n\u00af\u03c1|xt|+\n\n\u221a\nU + \u03a6T +1 \u2212 2ZT /\n\nn.\n\nTheorem 5\n\nAT \u2265 max\n\nU , which gives the\n\nBy de\ufb01nition of \u03a6t, since |g(x)| = 1 for |x| \u2265 U, one has \u03a6T +1 \u2265 |xT +1|+\ndesired statement.\nNow, setting n = T , we obtain\n\n\uf8f1\uf8f2\uf8f3\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T(cid:88)\n\nj=1\n\nbj\n\n\u221a\n\n\uf8fc\uf8fd\uf8fe \u2212 2Z\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) \u2212 14(cid:112)T log(1/Z), 0\nT(cid:88)\n\nbt(1 \u2212 \u03c1T\u2212t) +\n\nT .\n\nProof: In light of Theorem 4 it remains to bound(cid:80)T\nT(cid:88)\nT\u22121(cid:88)\nThus, since U \u2264 2(cid:112)T log(1/Z), and we chose \u03c1 = 1 \u2212 1/n = 1 \u2212 1/T , we get the result by\n\nt=1 \u00af\u03c1xt + xT +1. We have\n\n\u03c1t\u2212jbj + xT +1 =\n\n\u03c1T\u2212tbt =\n\nxt + xT +1 = \u00af\u03c1\n\nT\u22121(cid:88)\n\nT(cid:88)\n\nt(cid:88)\n\n(4)\n\nt=1\n\nj=1\n\nbt.\n\nt=1\n\nt=1\n\nt=1\n\nt=1\n\n\u00af\u03c1\n\ncombining Theorem 4 and equation (4).\nProof of Theorem 1: Follows by setting log(1/Z) = \u00012T .\nOur loss/regret tradeoff is optimal up to constant factors (proof deferred to the full version):\n\n6\n\n\fTheorem 6 Any algorithm that has regret O((cid:112)T log(1/Z)) incurs loss \u2126(Z\n\nsequence of bits bt, t = 1, . . . , T .\n\n\u221a\n\nT ) on at least one\n\nNote that if Z = o(1/T ), then the payoff of the algorithm is positive whenever the absolute value of\nthe deviation xt is larger than, say 8\n\nn log T in at least one window of size n.\n\n\u221a\n\n3 Combining strategies (lossless boosting)\n\nIn the previous section we derived an algorithm for the bit prediction problem with low regret to\nthe S+ and S\u2212 strategies and exponentially small loss. We now show how our techniques yield an\nalgorithm that has low regret to the best of N bit prediction strategies S1, . . . , SN and exponentially\nsmall loss. However, since the proof works for the general experts problem, where loss corresponds\nto regret to a \u2019special\u2019 expert S0, we state it in the general experts setting. In what follows we will\nrefer to regret to S0 as loss. We will also prove optimal bounds on regret that hold in every window\nof length n at the end of the section. We start by proving\n\nstrategy S0 \ufb01xed a priori. These bounds are optimal up to constant factors.\n\nTheorem 7 For any Z < 1/e there exists an algorithm for combining N strategies that has regret\nT ) with respect to any\n\nO((cid:112)T log(N/Z)) against the best of N strategies and loss at most O(ZN\nwjt on the set of experts j = 1, . . . , N such that wjt depends only on bt(cid:48), t(cid:48) < t and(cid:80)N\n\nWe \ufb01rst \ufb01x notation. A prediction strategy S given a bit string bt, produces a sequence of weights\nj=1 wjt =\n1, wjt \u2265 0 for all t. Thus, using strategy S amounts to using expert j with probability wj,t at time\nt, for all t = 1, . . . , T . For two strategies S1, S2 we write \u03b1tS1 + (1 \u2212 \u03b1t)S2 to denote the strategy\nwhose weights are a convex combination of weights of S1 and S2 given by coef\ufb01cients \u03b1t \u2208 [0, 1].\nFor a strategy S we denote its payoff at time t by st.\nWe start with the case of two strategies S1, S2. Our algorithm will consider S1 as the base strategy\n(corresponding to the null strategy S0 in the previous section) and will use S2 to improve on S1\nwhenever possible, without introducing signi\ufb01cant loss over S1 in the process. We de\ufb01ne\n\n\u221a\n\n(cid:26) g( 1\n\n\u00afg(x) =\n\n2 x), x > 0\no.w,\n0\n\nAlgorithm 1 on this sequence. In particular, we set xt = (cid:80)t\u22121\n\ni.e. we are using a one-sided version of g(x). It is easy to see that \u00afg(x) satis\ufb01es the conditions of\nLemma 2 with h(x) as de\ufb01ned in (3). The intuition behind the algorithm is that since the difference\nin payoff obtained by using S2 instead of S1 is given by (s2,t \u2212 s1,t), it is suf\ufb01cient to emulate\nj=1 \u03c1t\u22121\u2212j(s2,j \u2212 s1,j) and predict\n\u00afg(xt) (note that since |s1,t \u2212 s2,t| \u2264 2, we need to use g( 1\n2 x) in the de\ufb01nition of \u00afg to scale the\npayoffs). Predicting 0 corresponds to using S1, predicting 1 corresponds to using S2 and fractional\nvalues correspond to a convex combination of S1 and S2.\nFormally, the algorithm COMBINE(S1, S2, \u03c1) takes the following form:\n\nAlgorithm 2: COMBINE(S1, S2, \u03c1)\n1: Input: strategies S1, S2\n2: Output: strategy S\u2217\n3: x1 \u2190 0\n4: for t = 1 to T do\nSet S\u2217\n5:\nSet xt+1 \u2190 \u03c1xt + (s2,t \u2212 s1,t).\n6:\n7: end for\n8: return S\u2217\nNote that COMBINE(S1, S2, \u03c1) is an online algorithm, since S\u2217\n\nt \u2190 S1,t(1 \u2212 \u00afg(xt)) + S2,t\u00afg(xt).\n\nt only depends on s1,t(cid:48), s2,t(cid:48), t(cid:48) < t.\n\n7\n\n\fLemma 8 There exists an algorithm that given two strategies S1 and S2 gets payoff at least\n\nT(cid:88)\n\n(cid:40) T(cid:88)\n\ns1,t + max\n\n(s2,t \u2212 s1,t) \u2212 O\n\n(cid:16)(cid:112)T log(1/Z)\n(cid:17)\n\n, 0\n\n(cid:41)\n\n\u2212 O(Z\n\n\u221a\n\nT ).\n\nt=1\n\nt=1\n\nPROOF OUTLINE: Use Algorithm 2 with \u03c1 = 1 \u2212 1/T . This amounts to applying Algorithm 1 to\nthe sequence (s2,t \u2212 s1,t), so the guarantees follow by Theorem 4.\nWe emphasize the property that Algorithm 2 combines two strategies S1 and S2, improving on the\nperformance of S1 using S2 whenever possible, essentially without introducing any loss with respect\nto S1. Thus, this amounts to lossless boosting of one strategy\u2019s performance using another. Thus,\nwe have\nProof of Theorem 7: Use Algorithm 2 repeatedly to combine N strategies S1, . . . , SN by initializ-\ning S0 \u2190 S0 and setting Sj \u2190 COMBINE(Sj\u22121, Sj, 1\u2212 1/T ), j = 1, . . . , N, where S0 is the null\nstrategy. The regret and loss guarantees follow by Lemma 8.\n\nCorollary 9 Setting Z = (N T )\u22121\u2212\u03b3 for \u03b3 > 0, we get regret O((cid:112)\u03b3T (log N + log T )) to the\n\nbest of N strategies and loss at most O((N T )\u2212\u03b3) wrt strategy S0 \ufb01xed a priori. These bounds are\noptimal and improve on the work on [7].\nSo far we have used \u03c1 = 1 \u2212 1/T for all results. One can obtain optimal adaptive guaran-\ntees by performing boosting over a range of decay parameters \u03c1.\nIn particular, choose \u03c1j =\n1 \u2212 nj, where nj, j = 1, . . . , W are powers of two between 80 log(N T ) and T . Then let\nAlgorithm 3: Boosting over different time scales\n1: S0,W \u2190 S0\n2: for j = W downto 1 do\n3:\n4:\n5:\n6:\n7: end for\n8: S\u2217 \u2190 S0,0\n9: return S\u2217\n\nSk,j \u2190 COMBINE(Sk\u22121,j, Sk, 1 \u2212 1/nj)\n\nend for\nS0,j\u22121 \u2190 SN,j\n\nfor k = 1 to N do\n\nWe note that it is important that the outer loop in Algorithm 3 goes from large windows down to\nsmall windows. Finally, we show another adaptive regret property of Algorithm 3. First, for a\nsequence wt, t = 1, . . . , T of real numbers and for a parameter \u03c1 = 1 \u2212 1/n \u2208 (0, 1) de\ufb01ne\n\nt(cid:88)\n\nj=1\n\n\u02dcw\u03c1\n\nt =\n\n\u03c1t\u2212jwj.\n\n\u03c1 \u2264 c(cid:112)n log(1/Z)\n\nWe will need the following de\ufb01nition:\n\nDe\ufb01nition 10 A sequence wj is Z-uniform at scale \u03c1 = 1 \u2212 1/n if one has(cid:102)wt\n\nfor all 1 \u2264 t \u2264 T , for some constant c > 0.\nNote that if the input sequence is iid Ber(\u00b11, 1/2), then it is Z-uniform at any scale with probability\nat least 1 \u2212 Z for any Z > 0. We now prove that the difference between the payoff of our algorithm\nand the payoff of any expert is Z-uniform, i.e. does not exceed the standard deviation of a uniformly\nrandom variable in any suf\ufb01ciently large window, when the loss is bounded by Z. More precisely,\nTheorem 11 The sequences sj,t \u2212 s\u2217\nt are Z-uniform for any 1 \u2264 j \u2264 N at any scale \u03c1 \u2265 1 \u2212\n1/(80 log(1/Z)) when Z = o((N T )\u22122). Moreover, the loss of the algorithm with respect to the\nbase strategy is at most 2ZN\n\n\u221a\n\nT .\n\nThe proof of Theorem 11 is given in the full version.\n\n8\n\n\fReferences\n[1] J.-Y. Audibert and S. Bubeck. Minimax policies for adversarial and stochastic bandits. COLT,\n\n2009.\n\n[2] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. Schapire. The nonstochastic multi-armed bandit\n\nproblem. SIAM J. Comput., 32:48\u201377, 2002.\n\n[3] A. Blum and Y. Mansour. From external to internal regret. Journal of Machine Learning\n\nResearch, pages 1307\u20131324, 2007.\n\n[4] K. Chaudhuri, Y. Freund, and D. Hsu. A parameter free hedging algorithm. NIPS, 2009.\n[5] T. Cover. Behaviour of sequential predictors of binary sequences. Transactions of the Fourth\nPrague Conference on Information Theory, Statistical Decision Functions, Random Processes,\n1965.\n\n[6] T. Cover. Universal portfolios. Mathematical Finance, 1991.\n[7] E. Even-Dar, M. Kearns, Y. Mansour, and J. Wortman. Regret to the best vs. regret to the\n\naverage. Machine Learning, 72:21\u201337, 2008.\n\n[8] Y. Freund. Predicting a binary sequence almost as well as the optimal biased coin. COLT,\n\n1996.\n\n[9] Y. Freund, R. E. Schapire, Y. Singer, and M. K. Warmuth. Using and combining predictors\n\nthat specialize. STOC, pages 334\u2013343, 1997.\n\n[10] E. Hazan and C. Seshadhri. Ef\ufb01cient learning algorithms for changing environments (full ver-\nsion available at http://eccc.hpi-web.de/eccc-reports/2007/tr07-088/index.html). ICML, pages\n393\u2013400, 2009.\n\n[11] M. Kapralov and R. Panigrahy. Prediction without loss in multi-armed bandit problems.\n\nhttp://arxiv.org/abs/1008.3672, 2010.\n\n[12] N. Littlestone and M.K. Warmuth. The weighted majority algorithm. FOCS, 1989.\n[13] V. Vovk. A game of prediction with expert advice. Journal of Computer and System Sciences,\n\n1998.\n\n[14] V. Vovk. Derandomizing stochastic prediction strategies. Machine Learning, pages 247\u2013282,\n\n1999.\n\n9\n\n\f", "award": [], "sourceid": 556, "authors": [{"given_name": "Michael", "family_name": "Kapralov", "institution": null}, {"given_name": "Rina", "family_name": "Panigrahy", "institution": null}]}