{"title": "Near Minimax Optimal Players for the Finite-Time 3-Expert Prediction Problem", "book": "Advances in Neural Information Processing Systems", "page_first": 3033, "page_last": 3042, "abstract": "We study minimax strategies for the online prediction problem with expert advice. It has been conjectured that a simple adversary strategy, called COMB, is near optimal in this game for any number of experts. Our results and new insights make progress in this direction by showing that, up to a small additive term, COMB is minimax optimal in the finite-time three expert problem. In addition, we provide for this setting a new near minimax optimal COMB-based learner. Prior to this work, in this problem, learners obtaining the optimal multiplicative constant in their regret rate were known only when $K=2$ or $K\\rightarrow\\infty$. We characterize, when $K=3$, the regret of the game scaling as $\\sqrt{8/(9\\pi)T}\\pm \\log(T)^2$ which gives for the first time the optimal constant in the leading ($\\sqrt{T}$) term of the regret.", "full_text": "Near Minimax Optimal Players for the Finite-Time\n\n3-Expert Prediction Problem\n\nYasin Abbasi-Yadkori\n\nAdobe Research\n\nPeter L. Bartlett\n\nUC Berkeley\n\nVictor Gabillon\n\nQueensland University of Technology\n\nAbstract\n\nWe study minimax strategies for the online prediction problem with expert advice.\nIt has been conjectured that a simple adversary strategy, called COMB, is near\noptimal in this game for any number of experts. Our results and new insights make\nprogress in this direction by showing that, up to a small additive term, COMB is\nminimax optimal in the \ufb01nite-time three expert problem. In addition, we provide\nfor this setting a new near minimax optimal COMB-based learner. Prior to this\nwork, in this problem, learners obtaining the optimal multiplicative constant in\ntheir regret rate were known only when K = 2 or K \u2192 \u221e. We characterize, when\nK = 3, the regret of the game scaling as\ufffd8/(9\u03c0)T \u00b1 log(T )2 which gives for\nthe \ufb01rst time the optimal constant in the leading (\u221aT ) term of the regret.\n\n1\n\nIntroduction\n\nThis paper studies the online prediction problem with expert advice. This is a fundamental problem\nof machine learning that has been studied for decades, going back at least to the work of Hannan [12]\n(see [4] for a survey). As it studies prediction under adversarial data the designed algorithms are\nknown to be robust and are commonly used as building blocks of more complicated machine learning\nalgorithms with numerous applications. Thus, elucidating the yet unknown optimal strategies has the\npotential to signi\ufb01cantly improve the performance of these higher level algorithms, in addition to\nproviding insight into a classic prediction problem. The problem is a repeated two-player zero-sum\ngame between an adversary and a learner. At each of the T rounds, the adversary decides the\nquality/gain of K experts\u2019 advice, while simultaneously the learner decides to follow the advice of\none of the experts. The objective of the adversary is to maximize the regret of the learner, de\ufb01ned as\nthe difference between the total gain of the learner and the total gain of the best \ufb01xed expert.\nOpen Problems and our Main Results. Previously this game has been solved asymptotically as\nboth T and K tend to \u221e: asymptotically the upper bound on the performance of the state-of-the-\nart Multiplicative Weights Algorithm (MWA) for the learner matches the optimal multiplicative\nconstant of the asymptotic minimax optimal regret rate\ufffd(T /2) log K [3]. However, for \ufb01nite K,\nal. [10] proved a matching lower bound\ufffd(T /2) log K on the regret of the classic version of MWA,\nCover [5] proved that the value of the game is of order of\ufffdT /(2\u03c0) when K = 2, meaning that the\n\nregret of a MWA learner is 47% larger that the optimal learner in this case. Therefore the question of\noptimality remains open for non-asymptotic K which are the typical cases in applications.\nIn studying a related setting with K = 3, where T is sampled from a geometric distribution with\nparameter \u03b4, Gravin et al. [9] conjectured that, for any K, a simple adversary strategy, called\nthe COMB adversary, is asymptotically optimal (T \u2192 \u221e, or when \u03b4 \u2192 0), and also excessively\ncompetitive for \ufb01nite-time \ufb01xed T . The COMB strategy sorts the experts based on their cumulative\ngains and, with probability one half, assigns gain one to each expert in an odd position and gain zero\n\nthis asymptotic quantity actually overestimates the \ufb01nite-time value of the game. Moreover, Gravin et\n\nadditionally showing that the optimal learner does not belong an extended MWA family. Already,\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fto each expert in an even position. With probability one half, the zeros and ones are swapped. The\nsimplicity and elegance of this strategy, combined with its almost optimal performance makes it very\nappealing and calls for a more extensive study of its properties.\nOur results and new insights make progress in this direction by showing that, for any \ufb01xed T and up to\nsmall additive terms, COMB is minimax optimal in the \ufb01nite-time three expert problem. Additionally\nand with similar guarantees, we provide for this setting a new near minimax optimal COMB-based\nlearner. For K = 3, the regret of a MWA learner is 39% larger than our new optimal learner1. In\nthis paper we also characterize, when K = 3, the regret of the game as\ufffd8/(9\u03c0)T \u00b1 log(T )2 which\ngives for the \ufb01rst time the optimal constant in the leading (\u221aT ) term of the regret. Note that the\nstate-of-the-art non-asymptotic lower bound in [15] on the value of this problem is non informative\nas the lower bound for the case of K = 3 is a negative quantity.\nRelated Works and Challenges. For the case of K = 3, Gravin et al. [9] proved the exact minimax\noptimality of a COMB-related adversary in the geometrical setting, i.e. where T is not \ufb01xed in advance\nbut rather sampled from a geometric distribution with parameter \u03b4. However the connection between\nthe geometrical setting and the original \ufb01nite-time setting is not well understood, even asymptotically\n(possibly due to the large variance of geometric distributions with small \u03b4). Addressing this issue, in\nSection 7 of [8], Gravin et al. formulate the \u201cFinite vs Geometric Regret\u201d conjecture which states\nthat the value of the game in the geometrical setting, V\u03b1, and the value of the game in the \ufb01nite-time\nsetting, VT , verify VT = 2\u221a\u03c0 V\u03b1=1/T . We resolve here the conjecture for K = 3.\nAnalyzing the \ufb01nite-time expert problem raises new challenges compared to the geometric setting. In\nthe geometric setting, at any time (round) t of the game, the expected number of remaining rounds\nbefore the end of the game is constant (does not depend on the current time t). This simpli\ufb01es the\nproblem to the point that, when K = 3, there exists an exactly minimax optimal adversary that\nignores the time t and the parameter \u03b4. As noted in [9], and noticeable from solving exactly small\ninstances of the game with a computer, in the \ufb01nite-time case, the exact optimal adversary seems to\ndepend in a complex manner on time and state. It is therefore natural to compromise for a simpler\nadversary that is optimal up to a small additive error term. Actually, based on the observation of the\nrestricted computer-based solutions, the additive error term of COMB seems to vanish with larger T .\nTightly controlling the errors made by COMB is a new challenge with respect to [9], where the\nsolution to the optimality equations led directly to the exact optimal adversary. The existence of such\nequations in the geometric setting crucially relies on the fact that the value-to-go of a given policy in\na given state does not depend on the current time t (because geometric distributions are memoryless).\nTo control the errors in the \ufb01nite-time setting, our new approach solves the game by backward\ninduction showing the approximate greediness of COMB with respect to itself (read Section 2.1 for\nan overview of our new proof techniques and their organization). We use a novel exchangeability\nproperty, new connections to random walks and a close relation that we develop between COMB and\na TWIN-COMB strategy. Additional connections with new related optimal strategies and random\nwalks are used to compute the value of the game (Theorem 2). We discuss in Section 6 how our new\ntechniques have more potential to extend to an arbitrary number of arms, than those of [9].\nAdditionally, we show how the approximate greediness of COMB with respect to itself is key to\nproving that a learner based directly on the COMB adversary is itself quasi-minimax-optimal. This is\nthe \ufb01rst work to extend to the approximate case, approaches used to designed exactly optimal players\nin related works. In [2] a probability matching learner is proven optimal under the assumption that the\nadversary is limited to a \ufb01xed cumulative loss for the best expert. In [14] and [1], the optimal learner\nrelies on estimating the value-to-go of the game through rollouts of the optimal adversary\u2019s plays.\nThe results in these papers were limited to games where the optimal adversary was only playing\ncanonical unit vector while our result holds for general gain vectors. Note also that a probability\nmatching learner is optimal in [9].\nNotation: Let [a : b] = {a, a + 1, . . . , b} with a, b \u2208 N, a \u2264 b, and [a] = [1 : a]. For a vector\nw \u2208 Rn, n \u2208 N, \ufffdw\ufffd\u221e\n= maxk\u2208[n]|wk|. A vector indexed by both a time t and a speci\ufb01c\nelement index k is wt,k. An undiscounted Markov Decision Process (MDP) [13, 16] M is a 4-tuple\n\ufffdS,A, r, p\ufffd. S is the state space, A is the set of actions, r : S \u00d7 A \u2192 R is the reward function, and\nthe transition model p(\u00b7|s, a) gives the probability distribution over the next state when action a is\ntaken in state s. A state is denoted by s or st if it is taken at time t. An action is denoted by a or at.\n\n1[19] also provides an upper-bound that is suboptimal when K = 3 even after optimization of its parameters.\n\n2\n\n\f2 The Game\n\nWe consider a game, composed of T rounds, between two players, called a learner and an adversary.\nAt each time/round t the learner chooses an index It \u2208 [K] from a distribution pt on the K arms.\nSimultaneously, the adversary assigns a binary gain to each of the arms/experts, possibly at random\nfrom a distribution \u02d9At, and we denote the vector of these gains by gt \u2208 {0, 1}K. The adversary and\nthe learner then observe It and gt. For simplicity we use the notation g[t] = (gs)s=1,...,t. The value\nof one realization of such a game is the cumulative regret de\ufb01ned as\n\nRT =\ufffd\ufffd\ufffd\ufffd\ufffd\n\nT\ufffdt=1\n\ngt\ufffd\ufffd\ufffd\ufffd\ufffd\u221e\n\n\u2212\n\nT\ufffdt=1\n\ngt,It .\n\nA state s \u2208 S = (N \u222a {0})K is a K-dimensional vector such that the k-th element is the cumulative\nsum of gains dealt by the adversary on arm k before the current time t. Here the state does not include\nt but is typically denoted for a speci\ufb01c time t as st and computed as st =\ufffdt\u22121\nt\ufffd=1 gt\ufffd. This de\ufb01nition\nis motivated by the fact that there exist minimax strategies for both players that rely solely on the\nstate and time information as opposed to the complete history of plays, g[t] \u222a I[t]. In state s, the set\nof leading experts, i.e., those with maximum cumulative gain, is X(s) = {k \u2208 [K] : sk = \ufffds\ufffd\u221e}.\nWe use \u03c0 to denote the (possibly non-stationary) strategy/policy used by the adversary, i.e., for any\ninput state s and time t it outputs the gain distribution \u03c0(s, t) played by the adversary at time t in\nstate s. Similarly we use \u00afp to denote the strategy of the learner. As the state depends only on the\nadversary plays, we can sample a state s at time t from \u03c0.\nGiven an adversary \u03c0 and a learner \u00afp,\nis\n\u00afp,\u03c0 = Eg[T ]\u223c\u03c0,I[T ]\u223c \u00afp [RT ] . The learner tries to minimize the expected regret while the adversary\nV T\ntries to maximize it. The value of the game is the minimax value VT de\ufb01ned by\n\nthe expected regret of\n\nthe game, V T\n\n\u00afp,\u03c0,\n\nVT = min\n\n\u00afp\n\nmax\n\n\u03c0\n\nV T\n\u00afp,\u03c0 = max\n\n\u03c0\n\nmin\n\n\u00afp\n\nV T\n\u00afp,\u03c0.\n\nIn this work, we are interested in the search for optimal minimax strategies, which are adversary\nstrategies \u03c0\u2217 such that VT = min \u00afp V T\n\n\u00afp,\u03c0\u2217 and learner strategies \u00afp\u2217, such that VT = max\u03c0 V T\n\n\u00afp\u2217,\u03c0.\n\n2.1 Summary of our Approach to Obtain the Near Greediness of COMB\n\nMost of our material is new. First, Section 3 recalls that Gravin et al. [9] have shown that the search\nfor the optimal adversary \u03c0\u2217 can be restricted to the \ufb01nite family of balanced strategies (de\ufb01ned in\nthe next section). When K = 3, the action space of a balanced adversary is limited to seven stochastic\nactions (gain distributions), denoted by \u02d9B3 = { \u02d9W, \u02d9C, \u02d9V, \u02d91, \u02d92,{},{123}} (see Section 5.1 for their\ndescription). The COMB adversary repeats the gain distribution \u02d9C at each time and in any state.\nIn Section 4 we provide an explicit formulation of the problem as \ufb01nding \u03c0\u2217 inside an MDP with\na speci\ufb01c reward function. Interestingly, we observe that another adversary, which we call TWIN-\nCOMB and denote by \u03c0W, which repeats the distribution \u02d9W, has the same value as \u03c0C (Section 5.1).\nTo control the errors made by COMB, the proof uses a novel and intriguing exchangeability property\n(Section 5.2). This exchangeability property holds thanks to the surprising role played by the TWIN-\nCOMB strategy. For any distributions \u02d9A \u2208 \u02d9B3 there exists a distribution \u02d9D, mixture of \u02d9C and \u02d9W,\nsuch that for almost all states, playing \u02d9A and then \u02d9D is the same as playing \u02d9W and then \u02d9A in terms of\nthe expected reward and the probabilities over the next states after these two steps. Using Bellman\noperators, this can be concisely written as: for any (value) function f : S \u2212\u2192 R, in (almost) any\nstate s, we have that [T \u02d9A[T \u02d9Df ]](s) = [T \u02d9W[T \u02d9Af ]](s). We solve the MDP with a backward induction\nin time from t = T . We show that playing \u02d9C at time t is almost greedy with respect to playing \u03c0C in\nlater rounds t\ufffd > t. The greedy error is de\ufb01ned as the difference of expected reward between always\nplaying \u03c0C and playing the best (greedy) \ufb01rst action before playing COMB. Bounding how these\nerrors accumulate through the rounds relates the value of COMB to the value of \u03c0\u2217 (Lemma 16).\nTo illustrate the main ideas, let us \ufb01rst make two simplifying (but unrealistic) assumptions at time t:\nCOMB has been proven greedy w.r.t. itself in rounds t\ufffd > t and the exchangeability holds in all states.\nThen we would argue at time t that by the exchangeability property, instead of optimizing the greedy\n\n3\n\n\f\u02d9A \u02d9C . . . \u02d9C, we can study the optimizer of max \u02d9A\u2208 \u02d9B3\n\n\u02d9W \u02d9A \u02d9C . . . \u02d9C. Then\naction w.r.t. COMB as max \u02d9A\u2208 \u02d9B3\nwe use the induction property to conclude that \u02d9C is the solution of the previous optimization problem.\nUnfortunately, the exchangeability property does not hold in one speci\ufb01c state denoted by s\u03b1. What\nsaves us though is that we can directly compute the error of greedi\ufb01cation of any gain distribution\nwith respect to COMB in s\u03b1 and show that it diminishes exponentially fast as T \u2212 t, the number of\nrounds remaining, increases (Lemma 7). This helps us to control how the errors accumulate during\nthe induction. From one given state st \ufffd= s\u03b1 at time t, \ufb01rst, we use the exchangeability property once\nwhen trying to assess the \u2018quality\u2019 of an action \u02d9A as a greedy action w.r.t. COMB. This leads us to\nconsider the quality of playing \u02d9A in possibly several new states {st+1} at time t + 1 reached following\nTWIN-COMB in s. We use our exchangeability property repeatedly, starting from the state st until a\nsubsequent state reaches s\u03b1, say at time t\u03b1, where we can substitute the exponentially decreasing\ngreedy error computed at this time t\u03b1 in s\u03b1. Here the subsequent states are the states reached after\nhaving played TWIN-COMB repetitively starting from the state st. If s\u03b1 is never reached we use\nthe fact that COMB is an optimal action everywhere else in the last round. The problem is then to\ndetermine at which time t\u03b1, starting from any state at time t and following a TWIN-COMB strategy,\nwe hit s\u03b1 for the \ufb01rst time. This is translated into a classical gambler\u2019s ruin problem, which concerns\nthe hitting times of a simple random walk (Section 5.3). Similarly the value of the game is computed\nusing the study of the expected number of equalizations of a simple random walk (Theorem 5.1).\n\n3 Solving for the Adversary Directly\n\nIn this section, we recall the results from [9] that, for arbitrary K, permit us to directly search for the\nminimax optimal adversary in the restricted set of balanced adversaries while ignoring the learner.\nDe\ufb01nition 1. A gain distribution \u02d9A is balanced if there exists a constant c \u02d9A, the mean gain of \u02d9A, such\nthat \u2200k \u2208 [K], c \u02d9A = Eg| \u02d9A [gk]. A balanced adversary uses exclusively balanced gain distributions.\nLemma 1 (Claim 5 in [9]). There exists a minimax optimal balanced adversary.\n\nUse B to denote the set of all balanced strategies and \u02d9B to denote the set of all balanced gain\ndistributions. Interestingly, as demonstrated in [9], a balanced adversary \u03c0 in\ufb02icts the same regret\non every learner: If \u03c0 \u2208 B, then \u2203V \u03c0\nT . (See Lemma 10) Therefore, given an\nadversary strategy \u03c0, we can de\ufb01ne the value-to-go V \u03c0\nt0 (s) associated with \u03c0 from time t0 in state s,\n\n\u00afp,\u03c0 = V \u03c0\n\nV \u03c0\nt0 (s) = E\n\nsT +1 \ufffdsT +1\ufffd\u221e \u2212\n\nst+1 \u223c P (.|st, \u03c0(st, t), st0 = s).\n\nT \u2208 R : \u2200 \u00afp, V T\nst\ufffdc\u03c0(st,t)\ufffd ,\n\nT\ufffdt=t0\n\nE\n\nAnother reduction comes from the fact that the set of balanced gain distributions can be seen as a\nconvex combination of a \ufb01nite set of balanced distributions [9, Claim 2 and 3]. We call this limited\nset the atomic gain distributions. Therefore the search for \u03c0\u2217 can be limited to this set. The set of\nconvex combinations of the m distributions \u02d9A1, . . . \u02d9Am is denoted by \u0394( \u02d9A1, . . . \u02d9Am).\n\n4 Reformulation as a Markovian Decision Problem\n\nIn this section we formulate, for arbitrary K, the maximization problem over balanced adversaries\nas an undiscounted MDP problem \ufffdS,A, r, p\ufffd. The state space S was de\ufb01ned in Section 2 and the\naction space is the set of atomic balanced distributions as discussed in Section 3. The transition\nmodel is de\ufb01ned by p(.|s, \u02d9D), which is a probability distribution over states given the current state\ns and a balanced distribution over gains \u02d9D. In this model, the transition dynamics are deterministic\nand entirely controlled by the adversary\u2019s action choices. However, the adversary is forced to choose\nstochastic actions (balanced gain distributions). The maximization problem can therefore also be\nthought of as designing a balanced random walk on states so as to maximize a sum of rewards (that\nare yet to be de\ufb01ned). First, we de\ufb01ne P \u02d9A the transition probability operator with respect to a gain\ndistribution \u02d9A. Given function f : S \u2212\u2192 R, P \u02d9A returns\n\n[P \u02d9Af ](s) = E[f (s\ufffd)|s\ufffd \u223c p(.|s, \u02d9A)] = E\ng\u223cs, \u02d9A\n\n[f (s + g)].\n\ng is sampled in s according to \u02d9A. Given \u02d9A in s, the per-step regret is denoted by r \u02d9A(s) and de\ufb01ned as\n\nr \u02d9A(s) = E\n\ns\ufffd|s, \u02d9A \ufffds\ufffd\ufffd\u221e \u2212 \ufffds\ufffd\u221e \u2212 c \u02d9A.\n\n4\n\n\ft=t0\n\nt0 (s) = \ufffdT\n\nGiven an adversary strategy \u03c0, starting in s at time t0,\nE\ufffdr\u03c0(\u00b7,t)(st) | st+1 \u223c p(.|st, \u03c0(st, t), st0 = s)\ufffd. The action-value function of \u03c0\n\u00afV \u03c0\nat (s, \u02d9D) and t is the expected sum of rewards received by starting from s, taking action \u02d9D, and then\nt (st, \u02d9D) = E [\ufffdT\nfollowing \u03c0: \u00afQ\u03c0\nt\ufffd=t r \u02d9At (st) | \u02d9A0 = \u02d9D, st+1 \u223c p(\u00b7|st, \u02d9At), \u02d9At+1 = \u03c0(st+1, t + 1)].\nThe Bellman operator of \u02d9A, T \u02d9A, is [T \u02d9Af ](s) = r \u02d9A(s) + [P \u02d9Af ](s). with [T\u03c0(s,t)\nt (s).\nThis per-step regret, r \u02d9A(s), depends on s and \u02d9A and not on the time step t. Removing the time\nfrom the picture permits a simpli\ufb01ed view of the problem that leads to a natural formulation of the\nexchangeability property that is independent of the time t. Crucially, this decomposition of the regret\ninto per-step regrets is such that maximizing \u00afV \u03c0\nt0 (s) over adversaries \u03c0 is equivalent, for all time t0\nt0 (s) (Lemma 2).\nand s, to maximizing over adversaries the original value of the game, the regret V \u03c0\nt0 (s) = \u00afV \u03c0\nLemma 2. For any adversary strategy \u03c0 and any state s and time t0, V \u03c0\n.\nt0 (s) + \ufffds\ufffd\u221e\n\nthe cumulative per-step regret is\n\n\u00afV \u03c0\nt+1](s) = \u00afV \u03c0\n\nThe proof of Lemma 2 is in Section 8. In the following, our focus will be on maximizing \u00afV \u03c0\nt (s) in\nany state s. We now show some basic properties of the per-step regret that holds for an arbitrary\nnumber of experts K and discuss their implications. The proofs are in Section 9.\nLemma 3. Let \u02d9A \u2208 \u02d9B, for all s, t , we have 0 \u2264 r \u02d9A(s) \u2264 1. Furthermore if |X(s)|= 1, r \u02d9A(s) = 0.\nLemma 3 shows that a state s in which the reward is not zero contains at least two equal leading\nexperts, |X(s)|> 1. Therefore the goal of maximizing the reward can be rephrased into \ufb01nding a\npolicy that visits the states with |X(s)|> 1 as often as possible, while still taking into account that the\nper-step reward increases with |X(s)|. The set of states with |X(s)|> 1 is called the \u2018reward wall\u2019.\nLemma 4. In any state s, with |X(s)|= 2, for any balanced gain distribution \u02d9D such that with\nprobability one exactly one of the leading expert receives a gain of 1, r \u02d9D(s) = max \u02d9A\u2208 \u02d9B r \u02d9A(s).\n5 The Case of K = 3\n\n5.1 Notations in the 3-Experts Case, the COMB and the TWIN-COMB Adversaries\n\nFirst we de\ufb01ne the state space in the 3-expert case. The experts are sorted with respect to their\ncumulative gains and are named in decreasing order, the leading expert, the middle expert and the\nlagging expert. As mentioned in [9], in our search for the minimax optimal adversary, it is suf\ufb01cient\nfor any K to describe our state only using dij that denote the difference between the cumulative gains\nof consecutive sorted experts i and j = i + 1. Here, i denotes the expert with ith largest cumulative\ngains, and hence dij \u2265 0 for all i < j. Therefore one notation for a state, that will be used throughout\nthis section, is s = (x, y) = (d12, d23). We distinguish four types of states C1, C2, C3, C4 as\ndetailed below in Figure 1. In the same \ufb01gure, in the center, the states are represented on a 2d-grid.\nC4 contains only the state denoted s\u03b1 = (0, 0).\n\ns \u2208 C1, d12 > 0, d23 > 0\ns \u2208 C2, d12 = 0, d23 > 0\ns \u2208 C3, d12 > 0, d23 = 0\ns \u2208 C4, d12 = 0, d23 = 0\n\nAtomic \u02d9A\n{1}{23}\n{2}{13}\n{3}{12}\n{1}{2}{3}\n{12}{13}{23}\n\nSymbol\n\n\u02d9W\n\u02d9C\n\u02d9V\n\u02d91\n\u02d92\n\nc \u02d9A\n1/2\n1/2\n1/2\n1/3\n2/3\n\nFigure 1: 4 types of states (left), their location on the 2d grid of states (center) and 5 atomic \u02d9A (right)\n\nConcerning the action space, the gain distributions use brackets. The group of arms in the same bracket\nreceive gains together and each group receive gains with equal probability. For instance, {1}{2}{3}\nexclusively deals a gain to expert 1 (leading expert) with probability 1/3, expert 2 (middle expert)\nwith probability 1/3, and expert 3 (lagging expert) with probability 1/3, whereas {1}{23} means\ndealing a gain to expert 1 alone with probability 1/2 and experts 2 and 3 together with probability 1/2.\nAs discussed in Section 3, we are searching for a \u03c0\u2217 using mixtures of atomic balanced distributions.\nWhen K = 3 there are seven atomic distributions, denoted by \u02d9B3 = { \u02d9V, \u02d91, \u02d92, \u02d9C, \u02d9W,{},{123}}\nand described in Figure 1 (right). Moreover, in Figure 2, we report in detail\u2014in a table (left) and\n\n5\n\nd12d234333211121112111Reward Wall\fs\n\nC1\nC2\nC3\nC4\n\nr \u02d9C(s)\n\n0\n1/2\n0\n1/2\n\nDistribution of next\nstate s\ufffd \u223c p(\u00b7|s, \u02d9C)\nwith s = (x, y)\nP (s\ufffd = (x\u22121, y+1)) = P (s\ufffd = (x+1, y\u22121)) = .5\nP (s\ufffd = (x + 1, y)) = P (s\ufffd = (x + 1, y \u2212 1)) = .5\nP (s\ufffd = (x, y + 1)) = P (s\ufffd = (x \u2212 1, y + 1)) = .5\nP (s\ufffd = (x, y + 1)) = P (s\ufffd = (x + 1, y)) = .5\n\nFigure 2: The per-step regret and transition probabilities of the gain distribution \u02d9C\n\nan illustration (right) on the 2-D state grid\u2014the properties of the COMB gain distribution \u02d9C. The\nremaining atomic distributions are similarly reported in the appendix in Figures 5 to 8.\nIn the case of three experts, the COMB distribution is simply playing {2}{13} in any state. We use\n\u02d9W to denote the strategy that plays {1}{23} in any state and refer to it as the TWIN-COMB strategy.\nThe COMB and TWIN-COMB strategies (as opposed to the distributions) repeat their respective gain\ndistributions in any state and any time. They are respectively denoted \u03c0C, \u03c0W. The Lemma 5 shows\nthat the COMB strategy \u03c0C, the TWIN-COMB strategy \u03c0W and therefore any mixture of both, have\nthe same expected cumulative per-step regret. The proof is reported to Section 11.\nLemma 5. For all states s at time t, we have \u00afV \u03c0C\n\n(s) = \u00afV \u03c0W\n\nt\n\n(s).\n\nt\n\n5.2 The Exchangeability Property\nLemma 6. Let \u02d9A \u2208 \u02d9B3, there exists \u02d9D \u2208 \u0394( \u02d9C, \u02d9W) such that for any s \ufffd= s\u03b1, and for any f : S \u2212\u2192 R,\n\n[T \u02d9A[T \u02d9Df ]](s) = [T \u02d9W[T \u02d9Af ]](s).\n\nProof. If \u02d9A = \u02d9W, \u02d9A = {} or \u02d9A = {123}, use \u02d9D = \u02d9W. If \u02d9A = \u02d9C, use Lemma 11 and 12.\nCase 1. \u02d9A = \u02d9V: \u02d9V is equal to \u02d9C in C3 \u222a C4 and if s\ufffd \u223c p(.|s, \u02d9W) with s \u2208 C3 then s\ufffd \u2208 C3 \u222a C4.\nSo when s \u2208 C3 we reuse the case \u02d9A = \u02d9C above. When s \u2208 C1 \u222a C2, we consider two cases.\nCase 1.1. s \ufffd= (0, 1): We choose \u02d9D = \u02d9W which is {1}{23}. If s\ufffd \u223c p(.|s, \u02d9V) with s \u2208 C2 then\ns\ufffd \u2208 C2. Similarly, if s\ufffd \u223c p(.|s, \u02d9V) with s \u2208 C1 then s\ufffd \u2208 C1 \u222a C3. Moreover \u02d9D modi\ufb01es\nsimilarly the coordinates (d12, d23) of s \u2208 C1 and s \u2208 C3. Therefore the effect in terms of transition\nprobability and reward of \u02d9D is the same whether it is done before or after the actions chosen by \u02d9V. If\ns\ufffd \u223c p(.|s, \u02d9D) with s \u2208 C1 \u222a C2 then s\ufffd \u2208 C1 \u222a C2. Moreover \u02d9V modi\ufb01es similarly the coordinates\n(d12, d23) of s \u2208 C1 and s \u2208 C2. Therefore the effect in terms of the transition probability of \u02d9V is\nthe same whether it is done before or after the action \u02d9D. In terms of reward, notice that in the states\ns \u2208 C1 \u222a C2, \u02d9V has 0 per-step regret and using \u02d9V does not make s\ufffd leave or enter the reward wall.\nCase 1.2 st = (0, 1): We can chose \u02d9D = \u02d9W. One can check from the tables in Figures 7 and 8 that\nexchangebility holds. Additionally we provide an illustration of the exchangeability equality in the\n2d-grid in Figure 1. The starting state s = (0, 1), is graphically represented by . We show on the\ngrid the effect of the gain distribution \u02d9V (in dashed red) followed (left picture) or preceded (right\npicture) by the gain distribution \u02d9D (in plain blue). The illustration shows that \u02d9V\u00b7 \u02d9D and \u02d9D\u00b7 \u02d9V lead to\nthe same \ufb01nal states (\n) with equal probabilities. The rewards are displayed on top of the pictures.\nTheir color corresponds to the actions, the probabilities are in italic, and the rewards are in roman.\n\nCase 2 & 3. \u02d9A = \u02d91 & \u02d9A = \u02d92: The proof is similar and is reported in Section 12 of the appendix.\n\n6\n\n4.52.5.53.5.5.51.5.5001/21/2d12d23\f5.3 Approximate Greediness of COMB, Minimax Players and Regret\n\nThe greedy error of the gain distribution \u02d9D in state s at time t is\n\u00afQ\u03c0C\nt (s, \u02d9A) \u2212 \u00afQ\u03c0C\n\n\ufffd \u02d9D\ns,t = max\n\u02d9A\u2208 \u02d9B3\n\nt (s, \u02d9D).\n\nt = maxs\u2208S \ufffd \u02d9D\n\n6\ufffd 1\n2\ufffdT\u2212t.\ns\u03b1,t \u2264 1\n\nLet \ufffd \u02d9D\ns,t denote the maximum greedy error of the gain distribution \u02d9D at time t. The\nCOMB greedy error in s\u03b1 is controlled by the following lemma proved in Section 13.1. Missing\nproofs from this section are in the appendix in Section 13.2.\nLemma 7. For any t \u2208 [T ] and gain distribution \u02d9D \u2208 { \u02d9W, \u02d9C, \u02d9V, \u02d91}, \ufffd \u02d9D\nThe following proposition shows how we can index the states\nin the 2d-grid as a one dimensional line over which the TWIN-\nCOMB strategy behaves very similarly to a simple random walk.\nFigure 3 (top) illustrates this random walk on the 2d-grid and\nthe indexing scheme (the yellow stickers).\nProposition 1. Index a state s = (x, y) by is = x + 2y ir-\nrespective of the time. Then for any state s \ufffd= s\u03b1, and s\ufffd \u223c\np(\u00b7|s, \u02d9W) we have that P (is\ufffd = is\u22121) = P (is\ufffd = is +1) = 1\n2 .\nConsider a random walk that starts from state s0 = s and is gen-\nerated by the TWIN-COMB strategy, st+1 \u223c p(.|st, \u02d9W). De\ufb01ne\nthe random variable T\u03b1,s = min{t \u2208 N\u222a{0} : st = s\u03b1}. This\nrandom variable is the number of steps of the random walk be-\nfore hitting s\u03b1 for the \ufb01rst time. Then, let P\u03b1(s, t) be the proba-\nbility that s\u03b1 is reached after t steps: P\u03b1(s, t) = P (T\u03b1,s = t).\nLemma 8 controls the COMB greedy error in st in relation to\nP\u03b1(s, t). Lemma 9 derives a state-independent upper-bound for P\u03b1(s, t).\nLemma 8. For any time t \u2208 [T ] and state s,\nT\ufffdt\ufffd=t\n\nFigure 3: Numbering TWIN-COMB\n(top) & \u03c0G random walks (bottom)\n\n6\ufffd 1\n2\ufffdT\u2212t\ufffd\n\nP\u03b1(s, t\ufffd \u2212 t)\n\n\ufffd \u02d9C\ns,t \u2264\n\n1\n\n.\n\nProof. If s = s\u03b1, this is a direct application of Lemma 7 as P\u03b1(s\u03b1, t\ufffd) = 0 for t\ufffd > 0.\nWhen s \ufffd= s\u03b1, the following proof is by induction.\nInitialization: Let t = T . At the last round only the last per-step regret matters (for all states s,\n\u00afQ\u03c0C\nt (s, \u02d9D) = r \u02d9D(s)). As s \ufffd= s\u03b1, s is such that |X(s)|\u2264 2 then r \u02d9D(s) = max \u02d9A\u2208 \u02d9B r \u02d9A(s) because of\nLemma 4 and Lemma 3. Therefore the statement holds.\nInduction: Let t < T . We assume the statement is true at time t + 1. We distinguish two cases.\nFor all gain distributions \u02d9D \u2208 \u02d9B3,\n\u00afV \u03c0C\nt+2]](s)\n\n\u00afV \u03c0C\nt+2]](s) = [T \u02d9W\n\n(b)\n= [T \u02d9W[T \u02d9D\n\nt+1(., \u02d9D)](s)\n\n(a)\n= [T \u02d9D[T \u02d9E\n\nt (s, \u02d9D)\n\n\u00afQ\u03c0C\n\n\u00afQ\u03c0C\n\n(c)\n\n\u2265 [T \u02d9W max\n\u02d9A\u2208 \u02d9B3\n\n(d)\n\n\u2265 max\n\u02d9A\u2208 \u02d9B3\n\n[T \u02d9W\n\n(b)\n= max\n\u02d9A\u2208 \u02d9B3\n\n(e)\n= max\n\u02d9A\u2208 \u02d9B3\n\n\u00afQ\u03c0C\nt (s, \u02d9A) \u2212\n\n\u00afQ\u03c0C\nt (s, \u02d9A) \u2212\n\n1\n\nT\ufffdt1=t+1\n6\ufffd 1\n2\ufffdT\u2212t1\nT\ufffdt1=t+1\n6\ufffd 1\n2\ufffdT\u2212t1\nT\ufffdt1=t+1\n6\ufffd 1\n2\ufffdT\u2212t1\nT\ufffdt1=t\n\n1\n\n1\n\nP\u03b1(s, t1 \u2212 t)\n\n7\n\n\u00afQ\u03c0C\nt+1(., \u02d9A)](s) \u2212\n\n[P \u02d9WP\u03b1(., t1 \u2212 t \u2212 1)\n\n1\n\n6\ufffd 1\n2\ufffdT\u2212t1\n\n](s)\n\n\u00afQ\u03c0C\nt+1(., \u02d9A)](s) \u2212\n\n[P \u02d9WP\u03b1(., t1 \u2212 t \u2212 1)](s)\n\n[P \u02d9WP\u03b1(., t1 \u2212 t \u2212 1)](s)\n\nd12d2343333211112111121111.5.5.5.5.5.5.5.5 12343456256784789106d12d2343333.5.5112340\fwhere in (a) \u02d9E is any distribution in \u0394( \u02d9C, \u02d9W) and this step holds because of Lemma 5, (b) holds\nbecause of the exchangeability property of Lemma 6, (c) is true by induction and monotonicity\nof Bellman operator, in (d) the max operators change from being speci\ufb01c to any next state s\ufffd at\ntime t + 1 to being just one max operator that has to choose a single optimal gain distribution in\nstate s at time t, (e) holds by de\ufb01nition as for any t2, (here the last equality holds because s \ufffd= s\u03b1)\n[P \u02d9WP\u03b1(., t2)](s) = Es\ufffd\u223cp(.|s, \u02d9W)[P\u03b1(s\ufffd, t2)] = Es\ufffd\u223cp(.|s, \u02d9W)[P (T\u03b1,s\ufffd = t2)] = P\u03b1(s, t2 + 1).\nLemma 9. For t > 0 and any s,\n\nP\u03b1(s, t) \u2264\n\n2\n\nt\ufffd 2\n\n\u03c0\n\n.\n\nProof. Using the connection between the TWIN-COMB strategy and a simple random walk in\nProposition 1, a formula can be found for P\u03b1(s, t) from the classical \u201cGambler\u2019s ruin\u201d problem,\nwhere one wants to know the probability that the gambler reaches ruin (here state s\u03b1) at any time\nt given an initial capital in dollars (here is as de\ufb01ned in Proposition 1). The gambler has an equal\nprobability to win or lose one dollar at each round and has no upper bound on his capital during the\ngame. Using [7] (Chapter XIV, Equation 4.14) or [18] we have P\u03b1(s, t) = is\nbinomial coef\ufb01cient is 0 if t and is are not of the same parity. The technical Lemma 14 completes the\nproof.\n\nt\ufffd t\n2 \ufffd2\u2212t, where the\n\nt+is\n\nWe now state our main result, connecting the value of the COMB adversary to the value of the game.\nTheorem 1. Let K = 3, the regret of COMB strategies against any learner \u00afp, min \u00afp V T\n\u00afp,\u03c0C, satis\ufb01es\n\nmin\n\n\u00afp\n\n\u00afp,\u03c0C \u2265 VT \u2212 12 log2 (T + 1) .\nV T\n\nWe also characterize the minimax regret of the game.\nTheorem 2. Let K = 3, for even T , we have that\n\n\ufffd\ufffd\ufffd\ufffdVT \u2212\ufffd T + 2\n\nT /2 + 1\ufffd T /2 + 1\n\n3 \u2217 2T \ufffd\ufffd\ufffd\ufffd \u2264 12 log2(T + 1),\n\nwith\ufffd T + 2\n\nT /2 + 1\ufffd T /2 + 1\n\n3 \u2217 2T \u223c\ufffd 8T\n\n9\u03c0\n\n.\n\nIn Figure 4 we introduce a COMB-based learner that is denoted by \u00afpC. Here a state is represented\nby a vector of 3 integers. The three arms/experts are ordered as (1) (2) (3), breaking ties arbitrarily.\nWe connect the value of the COMB-based learner to the\nvalue of the game.\nTheorem 3. Let K = 3, the regret of COMB-based\nlearner against any adversary \u03c0, max\u03c0 V T\n\npt,(1)(s) = V \u03c0C\n(s)\npt,(2)(s) = V \u03c0C\n(s)\npt,(3)(s) = 1 \u2212 pt,(1)(s) \u2212 pt,(2)(s)\n\nt+1(s+e(1))\u2212V \u03c0C\nt+1(s+e(2))\u2212V \u03c0C\n\n\u00afpC,\u03c0, satis\ufb01es\n\nt\n\nt\n\nmax\n\n\u03c0\n\n\u00afpC,\u03c0 \u2264 VT + 36 log2 (T + 1) .\nV T\n\nSimilarly to [2] and [14], this strategy can be ef\ufb01ciently\ncomputed using rollouts/simulations from the COMB adversary in order to estimate the value V \u03c0C\nof \u03c0C in s at time t.\n\nt\n\n(s)\n\nFigure 4: A COMB learner, \u00afpC\n\n6 Discussion and Future Work\n\nThe main objective is to generalize our new proof techniques to higher dimensions. In our case, the\nMDP formulation and all the results in Section 4 already holds for general K. Interestingly, Lemma 3\nand 4 show that the COMB distribution is the balanced distribution with highest per-step regret in all\nthe states s such that |X(s)|\u2264 2, for arbitrary K. Then assuming an ideal exchangeability property\nthat gives max \u02d9A\u2208 \u02d9B \u02d9A \u02d9C . . . \u02d9C = max \u02d9A\u2208 \u02d9B \u02d9C \u02d9C . . . \u02d9C \u02d9A, a distribution would be greedy w.r.t the COMB\nstrategy at an early round of the game if it maximizes the per-step regret at the last round of the\ngame. The COMB policy speci\ufb01cally tends to visit almost exclusively states |X(s)|\u2264 2, states where\nCOMB itself is the maximizer of the per-step regret (Lemma 3). This would give that COMB is greedy\nw.r.t. itself and therefore optimal. To obtain this result for larger K, we will need to extend the\nexchangeability property to higher K and therefore understand how the COMB and TWIN-COMB\nfamilies extend to higher dimensions. One could also borrow ideas from the link with pde approaches\nmade in [6].\n\n8\n\n\fAcknowledgements\nWe gratefully acknowledge the support of the NSF through grant IIS-1619362 and of the Australian\nResearch Council through an Australian Laureate Fellowship (FL110100281) and through the Aus-\ntralian Research Council Centre of Excellence for Mathematical and Statistical Frontiers (ACEMS).\nWe would like to thank Nate Eldredge for pointing us to the results in [18] and Wouter Koolen for\npointing us at [19]!\n\nReferences\n[1] Jacob Abernethy and Manfred K. Warmuth. Repeated games against budgeted adversaries. In\n\nAdvances in Neural Information Processing Systems (NIPS), pages 1\u20139, 2010.\n\n[2] Jacob Abernethy, Manfred K. Warmuth, and Joel Yellin. Optimal strategies from random walks.\n\nIn 21st Annual Conference on Learning Theory (COLT), pages 437\u2013446, 2008.\n\n[3] Nicol\u00f2 Cesa-Bianchi, Yoav Freund, David Haussler, David P. Helmbold, Robert E. Schapire, and\nManfred K. Warmuth. How to use expert advice. Journal of the ACM (JACM), 44(3):427\u2013485,\n1997.\n\n[4] Nicol\u00f2 Cesa-Bianchi and G\u00e1bor Lugosi. Prediction, learning, and games. Cambridge university\n\npress, 2006.\n\n[5] Thomas M. Cover. Behavior of sequential predictors of binary sequences. In 4th Prague\nConference on Information Theory, Statistical Decision Functions, Random Processes, pages\n263\u2013272, 1965.\n\n[6] Nadeja Drenska. A pde approach to mixed strategies prediction with expert advice.\n\nhttp://www.gtcenter.org/Downloads/Conf/Drenska2708.pdf. (Extended abstract).\n\n[7] Willliam Feller. An Introduction to Probability Theory and its Applications, volume 2. John\n\nWiley & Sons, 2008.\n\n[8] Nick Gravin, Yuval Peres, and Balasubramanian Sivan. Towards optimal algorithms for predic-\n\ntion with expert advice. In arXiv preprint arXiv:1603.04981, 2014.\n\n[9] Nick Gravin, Yuval Peres, and Balasubramanian Sivan. Towards optimal algorithms for predic-\ntion with expert advice. In Proceedings of the Twenty-Seventh Annual ACM-SIAM Symposium\non Discrete Algorithms (SODA), pages 528\u2013547, 2016.\n\n[10] Nick Gravin, Yuval Peres, and Balasubramanian Sivan. Tight Lower Bounds for Multiplicative\nWeights Algorithmic Families. In 44th International Colloquium on Automata, Languages, and\nProgramming (ICALP), volume 80, pages 48:1\u201348:14, 2017.\n\n[11] Charles Miller Grinstead and James Laurie Snell.\n\nMathematical Soc., 2012.\n\nIntroduction to probability. American\n\n[12] James Hannan. Approximation to bayes risk in repeated play. Contributions to the Theory of\n\nGames, 3:97\u2013139, 1957.\n\n[13] Ronald A. Howard. Dynamic Programming and Markov Processes. The MIT Press, Cambridge,\n\nMA, 1960.\n\n[14] Haipeng Luo and Robert E. Schapire. Towards minimax online learning with unknown time\nhorizon. In Proceedings of The 31st International Conference on Machine Learning (ICML),\npages 226\u2013234, 2014.\n\n[15] Francesco Orabona and D\u00e1vid P\u00e1l. Optimal non-asymptotic lower bound on the minimax regret\n\nof learning with expert advice. arXiv preprint arXiv:1511.02176, 2015.\n\n[16] Martin L. Puterman. Markov Decision Processes. Wiley, New York, 1994.\n\n[17] Pantelimon Stanica. Good lower and upper bounds on binomial coef\ufb01cients. Journal of\n\nInequalities in Pure and Applied Mathematics, 2(3):30, 2001.\n\n9\n\n\f[18] Remco van der Hofstad and Michael Keane. An elementary proof of the hitting time theorem.\n\nThe American Mathematical Monthly, 115(8):753\u2013756, 2008.\n\n[19] Vladimir Vovk. A game of prediction with expert advice. Journal of Computer and System\n\nSciences (JCSS), 56(2):153\u2013173, 1998.\n\n10\n\n\f", "award": [], "sourceid": 1730, "authors": [{"given_name": "Yasin", "family_name": "Abbasi Yadkori", "institution": "Adobe Research"}, {"given_name": "Peter", "family_name": "Bartlett", "institution": "UC Berkeley"}, {"given_name": "Victor", "family_name": "Gabillon", "institution": "QUT - ACEMS"}]}