{"title": "Online Optimization in X-Armed Bandits", "book": "Advances in Neural Information Processing Systems", "page_first": 201, "page_last": 208, "abstract": "We consider a generalization of stochastic bandit problems where the set of arms, X, is allowed to be a generic topological space. We constraint the mean-payoff function with a dissimilarity function over X in a way that is more general than Lipschitz. We construct an arm selection policy whose regret improves upon previous result for a large class of problems. In particular, our results imply that if X is the unit hypercube in a Euclidean space and the mean-payoff function has a finite number of global maxima around which the behavior of the function is locally H\u00f6lder with a known exponent, then the expected regret is bounded up to a logarithmic factor by $n$, i.e., the rate of the growth of the regret is independent of the dimension of the space. Moreover, we prove the minimax optimality of our algorithm for the class of mean-payoff functions we consider.", "full_text": "Online Optimization in X -Armed Bandits\n\nS\u00b4ebastien Bubeck\n\nINRIA Lille, SequeL project, France\n\nR\u00b4emi Munos\n\nINRIA Lille, SequeL project, France\n\nsebastien.bubeck@inria.fr\n\nremi.munos@inria.fr\n\nGilles Stoltz\n\nEcole Normale Sup\u00b4erieure and HEC Paris\n\ngilles.stoltz@ens.fr\n\nCsaba Szepesv\u00b4ari\n\nDepartment of Computing Science, University of Alberta\n\nszepesva@cs.ualberta.ca \u2217\n\nAbstract\n\nWe consider a generalization of stochastic bandit problems where the set of arms, X , is\nallowed to be a generic topological space. We constraint the mean-payoff function with a\ndissimilarity function over X in a way that is more general than Lipschitz. We construct\nan arm selection policy whose regret improves upon previous result for a large class of\nproblems. In particular, our results imply that if X is the unit hypercube in a Euclidean\nspace and the mean-payoff function has a \ufb01nite number of global maxima around which\n\u221a\nthe behavior of the function is locally H\u00a8older with a known exponent, then the expected\nregret is bounded up to a logarithmic factor by\nn, i.e., the rate of the growth of the regret\nis independent of the dimension of the space. Moreover, we prove the minimax optimality\nof our algorithm for the class of mean-payoff functions we consider.\n\n1 Introduction and motivation\nBandit problems arise in many settings, including clinical trials, scheduling, on-line parameter tuning of\nalgorithms or optimization of controllers based on simulations. In the classical bandit problem there are a\n\ufb01nite number of arms that the decision maker can select at discrete time steps. Selecting an arm results in\na random reward, whose distribution is determined by the identity of the arm selected. The distributions\nassociated with the arms are unknown to the decision maker whose goal is to maximize the expected sum of\nthe rewards received.\nIn many practical situations the arms belong to a large set. This set could be continuous [1; 6; 3; 2; 7],\nhybrid-continuous, or it could be the space of in\ufb01nite sequences over a \ufb01nite alphabet [4]. In this paper we\nconsider stochastic bandit problems where the set of arms, X , is allowed to be an arbitrary topological space.\nWe assume that the decision maker knows a dissimilarity function de\ufb01ned over this space that constraints\nthe shape of the mean-payoff function. In particular, the dissimilarity function is assumed to put a lower\nbound on the mean-payoff function from below at each maxima. We also assume that the decision maker is\nable to cover the space of arms in a recursive manner, successively re\ufb01ning the regions in the covering such\nthat the diameters of these sets shrink at a known geometric rate when measured with the dissimilarity.\n\n\u2217Csaba Szepesv\u00b4ari is on leave from MTA SZTAKI. He also greatly acknowledges the support received from the\n\nAlberta Ingenuity Fund, iCore and NSERC.\n\n1\n\n\fthe whole space. This allows us to obtain a regret which scales as (cid:101)O(\n\nOur work generalizes and improves previous works on continuum-armed bandit problems: Kleinberg [6]\nand Auer et al. [2] focussed on one-dimensional problems. Recently, Kleinberg et al. [7] considered generic\nmetric spaces assuming that the mean-payoff function is Lipschitz with respect to the (known) metric of the\nspace. They proposed an interesting algorithm that achieves essentially the best possible regret in a minimax\nsense with respect to these environments.\nThe goal of this paper is to further these works in a number of ways: (i) we allow the set of arms to be\na generic topological space; (ii) we propose a practical algorithm motivated by the recent very successful\ntree-based optimization algorithms [8; 5; 4] and show that the algorithm is (iii) able to exploit higher order\nsmoothness. In particular, as we shall argue in Section 7, (i) improves upon the results of Auer et al. [2],\nwhile (i), (ii) and (iii) improve upon the work of Kleinberg et al. [7]. Compared to Kleinberg et al. [7], our\nwork represents an improvement in the fact that just like Auer et al. [2] we make use of the local properties\nof the mean-payoff function around the maxima only, and not a global property, such as Lipschitzness in\n\u221a\nn) 1 when e.g. the space is the unit\nhypercube and the mean-payoff function is locally H\u00a8older with known exponent in the neighborhood of any\nmaxima (which are in \ufb01nite number) and bounded away from the maxima outside of these neighborhoods.\nThus, we get the desirable property that the rate of growth of the regret is independent of the dimensionality\nof the input space. We also prove a minimax lower bound that matches our upper bound up to logarithmic\nfactors, showing that the performance of our algorithm is essentially unimprovable in a minimax sense.\nBesides these theoretical advances the algorithm is anytime and easy to implement. Since it is based on\nideas that have proved to be ef\ufb01cient, we expect it to perform well in practice and to make a signi\ufb01cant\nimpact on how on-line global optimization is performed.\n2 Problem setup, notation\nWe consider a topological space X , whose elements will be referred to as arms. A decision maker \u201cpulls\u201d\nthe arms in X one at a time at discrete time steps. Each pull results in a reward that depends on the arm\nchosen and which the decision maker learns of. The goal of the decision maker is to choose the arms so\nas to maximize the sum of the rewards that he receives. In this paper we are concerned with stochastic\nenvironments. Such an environment M associates to each arm x \u2208 X a distribution Mx on the real line.\nThe support of these distributions is assumed to be uniformly bounded with a known bound. For the sake\nof simplicity, we assume this bound is 1. We denote by f(x) the expectation of Mx, which is assumed\nto be measurable (all measurability concepts are with respect to the Borel-algebra over X ). The function\nf : X \u2192 R thus de\ufb01ned is called the mean-payoff function. When in round n the decision maker pulls arm\nXn \u2208 X , he receives a reward Yn drawn from MXn, independently of the past arm choices and rewards.\nA pulling strategy of a decision maker is determined by a sequence \u03d5 = (\u03d5n)n\u22651 of measurable mappings,\n\nwhere each \u03d5n maps the history space Hn =(cid:0)X \u00d7 [0, 1](cid:1)n\u22121 to the space of probability measures over X .\n\nBy convention, \u03d51 does not take any argument. A strategy is deterministic if for every n the range of \u03d5n\ncontains only Dirac distributions.\nAccording to the process that was already informally described, a pulling strategy \u03d5 and an environment M\njointly determine a random process (X1, Y1, X2, Y2, . . .) in the following way: In round one, the decision\nmaker draws an arm X1 at random from \u03d51 and gets a payoff Y1 drawn from MX1. In round n \u2265 2, \ufb01rst,\nXn is drawn at random according to \u03d5n(X1, Y1, . . . , Xn\u22121, Yn\u22121), but otherwise independently of the past.\nThen the decision maker gets a rewards Yn drawn from MXn, independently of all other random variables\nin the past given Xn.\nenvironment M is (cid:98)Rn = n f\u2217 \u2212(cid:80)n\nLet f\u2217 = supx\u2208X f(x) be the maximal expected payoff. The cumulative regret of a pulling strategy in\nt=1 f(Xt).\n1We write un = (cid:101)O(vu) when un = O(vn) up to a logarithmic factor.\n\nt=1 Yt, and the cumulative pseudo-regret is Rn = n f\u2217 \u2212(cid:80)n\n\n2\n\n\fIn the sequel, we restrict our attention to the expected regret E [Rn], which in fact equals E[(cid:98)Rn], as can be\n\nseen by the application of the tower rule.\n3 The Hierarchical Optimistic Optimization (HOO) strategy\n3.1 Trees of coverings\n\nWe \ufb01rst introduce the notion of a tree of coverings. Our algorithm will require such a tree as an input.\nDe\ufb01nition 1 (Tree of coverings). A tree of coverings is a family of measurable subsets (Ph,i)1\u2264i\u22642h, h\u22650 of\nX such that for all \ufb01xed integer h \u2265 0, the covering \u222a1\u2264i\u22642hPh,i = X holds. Moreover, the elements of the\ncovering are obtained recursively: each subset Ph,i is covered by the two subsets Ph+1,2i\u22121 and Ph+1,2i.\nA tree of coverings can be represented, as the name suggests, by a binary tree T . The whole domain\nX = P0,1 corresponds to the root of the tree and Ph,i corresponds to the i\u2013th node of depth h, which will\nbe referred to as node (h, i) in the sequel. The fact that each Ph,i is covered by the two subsets Ph+1,2i\u22121\nand Ph+1,2i corresponds to the childhood relationship in the tree. Although the de\ufb01nition allows the child-\nregions of a node to cover a larger part of the space, typically the size of the regions shrinks as depth h\nincreases (cf. Assumption 1).\nRemark 1. Our algorithm will instantiate the nodes of the tree on an \u201das needed\u201d basis, one by one. In\nfact, at any round n it will only need n nodes connected to the root.\n\n3.2 Statement of the HOO strategy\nThe algorithm picks at each round a node in the in\ufb01nite tree T as follows. In the \ufb01rst round, it chooses the\nroot node (0, 1). Now, consider round n + 1 with n \u2265 1. Let us denote by Tn the set of nodes that have\nbeen picked in previous rounds and by Sn the nodes which are not in Tn but whose parent is. The algorithm\npicks at round n + 1 a node (Hn+1, In+1) \u2208 Sn according to the deterministic rule that will be described\nbelow. After selecting the node, the algorithm further chooses an arm Xn+1 \u2208 PHn+1,In+1. This selection\ncan be stochastic or deterministic. We do not put any further restriction on it. The algorithm then gets a\nreward Yn+1 as described above and the procedure goes on: (Hn+1, In+1) is added to Tn to form Tn+1 and\nthe children of (Hn+1, In+1) are added to Sn to give rise to Sn+1. Let us now turn to how (Hn+1, In+1) is\nselected.\nAlong with the nodes the algorithm stores what we call B\u2013values. The node (Hn+1, In+1) \u2208 Sn to expand\nat round n + 1 is picked by following a path from the root to a node in Sn, where at each node along the\npath the child with the larger B\u2013value is selected (ties are broken arbitrarily). In order to de\ufb01ne a node\u2019s\nB\u2013value, we need a few quantities. Let C(h, i) be the set that collects (h, i) and its descendants. We let\n\nn(cid:88)\n\nt=1\n\nNh,i(n) =\n\nI{(Ht,It)\u2208C(h,i)}\n\nbe the number of times the node (h, i) was visited. A given node (h, i) is always picked at most once, but\nsince its descendants may be picked afterwards, subsequent paths in the tree can go through it. Consequently,\n\n1 \u2264 Nh,i(n) \u2264 n for all nodes (h, i) \u2208 Tn. Let(cid:98)\u00b5h,i(n) be the empirical average of the rewards received for\n\nthe time-points when the path followed by the algorithm went through (h, i):\nYt I{(Ht,It)\u2208C(h,i)}.\n\n1\n\nThe corresponding upper con\ufb01dence bound is by de\ufb01nition\n\nNh,i(n)\n\n(cid:98)\u00b5h,i(n) =\nUh,i(n) =(cid:98)\u00b5h,i(n) +\n\nn(cid:88)\n(cid:115)\n\nt=1\n\n2 ln n\nNh,i(n)\n\n+ \u03bd1\u03c1h,\n\n3\n\n\fwhere 0 < \u03c1 < 1 and \u03bd1 > 0 are parameters of the algorithm (to be chosen later by the decision maker, see\nAssumption 1). For nodes not in Tn, by convention, Uh,i(n) = +\u221e. Now, for a node (h, i) in Sn, we de\ufb01ne\nits B\u2013value to be Bh,i(n) = +\u221e. The B\u2013values for nodes in Tn are given by\n\nUh,i(n), max(cid:8)Bh+1,2i\u22121(n), Bh+1,2i(n)(cid:9)(cid:111)\n(cid:110)\n\nBh,i(n) = min\n\n.\n\nNote that the algorithm is deterministic (apart, maybe, from the arbitrary random choice of Xt in PHt,It).\nIts total space requirement is linear in n while total running time at round n is at most quadratic in n, though\nwe conjecture that it is O(n log n) on average.\n\n4 Assumptions made on the model and statement of the main result\nWe suppose that X is equipped with a dissimilarity (cid:96), that is a non-negative mapping (cid:96) : X 2 \u2192 R\nsatisfying (cid:96)(x, x) = 0. The diameter (with respect to (cid:96)) of a subset A of X is given by diam A =\nsupx,y\u2208A (cid:96)(x, y). Given the dissimilarity (cid:96), the \u201copen\u201d ball with radius \u03b5 > 0 and center c \u2208 X is\nB(c, \u03b5) = { x \u2208 X : (cid:96)(c, x) < \u03b5} (we do not require the topology induced by (cid:96) to be related to the topol-\nogy of X .) In what follows when we refer to an (open) ball, we refer to the ball de\ufb01ned with respect to (cid:96).\nThe dissimilarity will be used to capture the smoothness of the mean-payoff function. The decision maker\nchooses (cid:96) and the tree of coverings. The following assumption relates this choice to the parameters \u03c1 and \u03bd1\nof the algorithm:\nAssumption 1. There exist \u03c1 < 1 and \u03bd1, \u03bd2 > 0 such that for all integers h \u2265 0 and all i = 1, . . . , 2h, the\ndiameter of Ph,i is bounded by \u03bd1\u03c1h, and Ph,i contains an open ball P(cid:48)\nh,i of radius \u03bd2\u03c1h. For a given h, the\nh,i are disjoint for 1 \u2264 i \u2264 2h.\nP(cid:48)\nRemark 2. A typical choice for the coverings in a cubic domain is to let the domains be hyper-rectangles.\nThey can be obtained, e.g., in a dyadic manner, by splitting at each step hyper-rectangles in the middle along\n\u221a\ntheir longest side, in an axis parallel manner; if all sides are equal, we split them along the \ufb01rst axis. In\nthis example, if X = [0, 1]D and (cid:96)(x, y) = (cid:107)x \u2212 y(cid:107)\u03b1 then we can take \u03c1 = 2\u2212\u03b1/D, \u03bd1 = (\nD/2)\u03b1 and\n\u03bd2 = 1/8\u03b1.\n\nThe next assumption concerns the environment.\nDe\ufb01nition 2. We say that f is weakly Lipschitz with respect to (cid:96) if for all x, y \u2208 X ,\n\nf\u2217 \u2212 f(y) \u2264 f\u2217 \u2212 f(x) + max(cid:8)f\u2217 \u2212 f(x), (cid:96)(x, y)(cid:9) .\n\n(1)\nNote that weak Lipschitzness is satis\ufb01ed whenever f is 1\u2013Lipschitz, i.e., for all x, y \u2208 X , one has |f(x) \u2212\nf(y)| \u2264 (cid:96)(x, y). On the other hand, weak Lipschitzness implies local (one-sided) 1\u2013Lipschitzness at any\nIndeed, at an optimal arm x\u2217 (i.e., such that f(x\u2217) = f\u2217), (1) rewrites to f(x\u2217) \u2212 f(y) \u2264\nmaxima.\n(cid:96)(x\u2217, y). However, weak Lipschitzness does not constraint the growth of the loss in the vicinity of other\npoints. Further, weak Lipschitzness, unlike Lipschitzness, does not constraint the local decrease of the loss\nat any point. Thus, weak-Lipschitzness is a property that lies somewhere between a growth condition on\nthe loss around optimal arms and (one-sided) Lipschitzness. Note that since weak Lipschitzness is de\ufb01ned\nwith respect to a dissimilarity, it can actually capture higher-order smoothness at the optima. For example,\nf(x) = 1\u2212 x2 is weak Lipschitz with the dissimilarity (cid:96)(x, y) = c(x\u2212 y)2 for some appropriate constant c.\nAssumption 2. The mean-payoff function f is weakly Lipschitz.\nLet f\u2217\nh,i be the suboptimality of node (h, i). We say that\na node (h, i) is optimal (respectively, suboptimal) if \u2206h,i = 0 (respectively, \u2206h,i > 0). Let X\u03b5\ndef=\n{ x \u2208 X : f(x) \u2265 f\u2217 \u2212 \u03b5} be the set of \u03b5-optimal arms. The following result follows from the de\ufb01nitions;\na proof can be found in the appendix.\n\nf(x) and \u2206h,i = f\u2217 \u2212 f\u2217\n\nh,i = supx\u2208Ph,i\n\n4\n\n\fLemma 1. Let Assumption 1 and 2 hold. If the suboptimality \u2206h,i of a region is bounded by c\u03bd1\u03c1h for some\nc > 0, then all arms in Ph,i are max{2c, c + 1}\u03bd1\u03c1h-optimal.\nThe last assumption is closely related to Assumption 2 of Auer et al. [2], who observed that the regret of\na continuum-armed bandit algorithm should depend on how fast the volume of the sets of \u03b5-optimal arms\nshrinks as \u03b5 \u2192 0. Here, we capture this by de\ufb01ning a new notion, the near-optimality dimension of the\nmean-payoff function. The connection between these concepts, as well as the zooming dimension de\ufb01ned\nby Kleinberg et al. [7] will be further discussed in Section 7.\nDe\ufb01ne the packing number P(X , (cid:96), \u03b5) to be the size of the largest packing of X with disjoint open balls of ra-\ndius \u03b5 with respect to the dissimilarity (cid:96).2 We now de\ufb01ne the near-optimality dimension, which characterizes\nthe size of the sets X\u03b5 in terms of \u03b5, and then state our main result.\nDe\ufb01nition 3. For c > 0 and \u03b50 > 0, the (c, \u03b50)\u2013near-optimality dimension of f with respect to (cid:96) equals\n\nd \u2208 [0, +\u221e) : \u2203 C s.t. \u2200\u03b5 \u2264 \u03b50, P(cid:0)Xc\u03b5, (cid:96), \u03b5(cid:1)\u2264 C \u03b5\u2212d(cid:111)\n(cid:110)\n\n(2)\n\ninf\n\n(with the usual convention that inf \u2205 = +\u221e).\nTheorem 1 (Main result). Let Assumptions 1 and 2 hold and assume that the (4\u03bd1/\u03bd2, \u03bd2)\u2013near-optimality\ndimension of the considered environment is d < +\u221e. Then, for any d(cid:48) > d there exists a constant C(d(cid:48))\nsuch that for all n \u2265 1,\n\nERn \u2264 C(d(cid:48)) n(d(cid:48)+1)/(d(cid:48)+2)(cid:0)ln n(cid:1)1/(d(cid:48)+2)\n\n.\n\nFurther, if the near-optimality dimension is achieved, i.e., the in\ufb01mum is achieved in (2), then the result holds\nalso for d(cid:48) = d.\nRemark 3. We can relax the weak-Lipschitz property by requiring it to hold only locally around the maxima.\nIn fact, at the price of increased constants, the result continues to hold if there exists \u03b5 > 0 such that (1)\nholds for any x, y \u2208 X\u03b5. To show this we only need to carefully adapt the steps of the proof below. We omit\nthe details from this extended abstract.\n\n5 Analysis of the regret and proof of the main result\n\nWe \ufb01rst state three lemmas, whose proofs can be found in the appendix. The proofs of Lemmas 3 and 4 rely\non concentration-of-measure techniques, while that of Lemma 2 follows from a simple case study. Let us\n\ufb01x some path (0, 1), (1, i\u2217\nh), . . . , of optimal nodes, starting from the root.\nLemma 2. Let (h, i) be a suboptimal node. Let k be the largest depth such that (k, i\u2217\nthe root to (h, i). Then we have\n\nk) is on the path from\n\n1), . . . , (h, i\u2217\n\nfor all optimal nodes and for all integers n \u2265\n\nn(cid:88)\n\nP(cid:110)\nNh,i(t) > u and (cid:2)Uh,i(t) > f\u2217 or Us,i\u2217\nE(cid:2)Nh,i(n)(cid:3) \u2264 u+\n1, P(cid:8)Uh,i(n) \u2264 f\u2217(cid:9) \u2264 n\u22123.\nf\u2217 and Nh,i(t) > u(cid:9) \u2264 t n\u22124.\n\nLemma 3. Let Assumptions 1 and 2 hold. Then,\n\nt=u+1\n\ns\n\nLemma 4. Let Assumptions 1 and 2 hold. Then, for all integers t \u2264 n, for all suboptimal nodes (h, i)\nsuch that \u2206h,i > \u03bd1\u03c1h, and for all integers u \u2265 1 such that u \u2265\n\n(\u2206h,i\u2212\u03bd1\u03c1h)2 , one has P(cid:8)Uh,i(t) >\n\n8 ln n\n\n\u2264 f\u2217 for some s \u2208 {k+1, . . . , t\u22121}(cid:3)(cid:111)\n\n.\n\n2Note that sometimes packing numbers are de\ufb01ned as the largest packing with disjoint open balls of radius \u03b5/2, or,\n\n\u03b5-nets.\n\n5\n\n\fTaking u as the integer part of (8 ln n)/(\u2206h,i \u2212 \u03bd1\u03c1h)2, and combining the results of Lemma 2, 3, and 4\nwith a union bound leads to the following key result.\nLemma 5. Under Assumptions 1 and 2, for all suboptimal nodes (h, i) such that \u2206h,i > \u03bd1\u03c1h, we have, for\nall n \u2265 1,\n\nE[Nh,i(n)] \u2264\n\n8 ln n\n\n(\u2206h,i \u2212 \u03bd1\u03c1h)2 +\n\n2\nn\n\n.\n\nWe are now ready to prove Theorem 1.\nProof. For the sake of simplicity we assume that the in\ufb01mum in the de\ufb01nition of near-optimality is achieved.\nTo obtain the result in the general case one only needs to replace d below by d(cid:48) > d in the proof below.\nFirst step. For all h = 1, 2, . . ., denote by Ih the nodes at depth h that are 2\u03bd1\u03c1h\u2013optimal, i.e., the nodes\nh,i \u2265 f\u2217 \u2212 2\u03bd1\u03c1h. Then, I is the union of these sets of nodes. Further, let J be the set of\n(h, i) such that f\u2217\nnodes that are not in I but whose parent is in I. We then denote by Jh the nodes in J that are located at\ndepth h in the tree. Lemma 4 bounds the expected number of times each node (h, i) \u2208 Jh is visited. Since\n\u2206h,i > 2\u03bd1\u03c1h, we get\n\nE(cid:2)Nh,i(n)(cid:3) \u2264 8 ln n\n\n1 \u03c12h +\n\u03bd2\n\n2\nn\n\n.\n\nSecond step. We bound here the cardinality |Ih|, h > 0. If (h, i) \u2208 Ih then since \u2206h,i \u2264 2\u03bd1\u03c1h, by\nLemma 1 Ph,i \u2282 X4\u03bd1\u03c1h. Since by Assumption 1, the sets (Ph,i), for (h, i) \u2208 Ih, contain disjoint balls of\nradius \u03bd2\u03c1h, we have that\n\n|Ih| \u2264 P(cid:0)\u222a(h,i)\u2208IhPh,i, (cid:96), \u03bd2\u03c1h(cid:1) \u2264 P(cid:0)X(4\u03bd1/\u03bd2) \u03bd2\u03c1h , (cid:96), \u03bd2\u03c1h(cid:1) \u2264 C(cid:0)\u03bd2\u03c1h(cid:1)\u2212d\n\n,\n\nwhere we used the assumption that d is the (4\u03bd1/\u03bd2, \u03bd2)\u2013near-optimality dimension of f (and C is the\nconstant introduced in the de\ufb01nition of the near-optimality dimension).\nThird step. Choose \u03b7 > 0 and let H be the smallest integer such that \u03c1H \u2264 \u03b7. We partition the in\ufb01nite\ntree T into three sets of nodes, T = T1 \u222a T2 \u222a T3. The set T1 contains nodes of IH and their descendants,\nT2 = \u222a0\u2264h<HIh, and T3 contains the nodes \u222a1\u2264h\u2264HJh and their descendants. (Note that T1 and T3 are\npotentially in\ufb01nite, while T2 is \ufb01nite.)\nWe denote by (Ht, It) the node that was chosen by the forecaster at round t to pick Xt. From the de\ufb01nition\nof the forecaster, no two such random variables are equal, since each node is picked at most once. We\ndecompose the regret according to the element Tj where the chosen nodes (Ht, It) belong to:\n\nE(cid:2)Rn\n\n(cid:3) = E\n\n(cid:34) n(cid:88)\n\n(f\u2217 \u2212 f(Xt))\n\n(cid:35)\n\n= E(cid:2)Rn,1\n\n(cid:3) + E(cid:2)Rn,2\n\n(cid:3) + E(cid:2)Rn,3\n\n(cid:3),\n\nt=1\n\nwhere for all i = 1, 2, 3,\n\nRn,i =\n\nn(cid:88)\n(f\u2217 \u2212 f(Xt))I{(Ht,It)\u2208Ti} .\n\nt=1\n\nThe contribution from T1 is easy to bound. By de\ufb01nition any node in IH is 2\u03bd1\u03c1H-optimal. Hence, by\nLemma 1 the corresponding domain is included in X4\u03bd1\u03c1H . The domains of these nodes\u2019 descendants are of\ncourse still included in X4\u03bd1\u03c1H . Therefore, E[Rn,1] \u2264 4n\u03bd1\u03c1H.\nFor h \u2265 1, consider a node (h, i) \u2208 T2. It belongs to Ih and is therefore 2\u03bd1\u03c1h\u2013optimal. By Lemma 1, the\ncorresponding domain is included in X4\u03bd1\u03c1h. By the result of the second step and using that each node is\nplayed at most once, one gets\n\nE(cid:2)Rn,2\n\n(cid:3) \u2264 H\u22121(cid:88)\n\nH\u22121(cid:88)\n\n4\u03bd1\u03c1h |Ih| \u2264 4\u03bd1C \u03bd\u2212d\n\n2\n\n\u03c1h(1\u2212d) .\n\nh=0\n\nh=0\n\n6\n\n\fWe \ufb01nish with the contribution from T3. We \ufb01rst remark that since the parent of any element (h, i) \u2208 Jh\nis in Ih\u22121, by Lemma 1 again, we have that Ph,i \u2282 X4\u03bd1\u03c1h\u22121. To each node (Ht, It) played in T3, we\nt) of some Jh on the path from the root to (Ht, It). When (Ht, It) is played,\nassociate the element (H(cid:48)\n. Decomposing Rn,3 according to the elements of \u222a1\u2264h\u2264HJh, we\nthe chosen arm Xt belongs also to PH(cid:48)\nt,I(cid:48)\nthen bound the regret from T3 as\n\nt, I(cid:48)\n\nt\n\n4\u03bd1\u03c1h\u22121 |Jh|\n\nE(cid:2)Nh,i(n)(cid:3) \u2264 H(cid:88)\nH(cid:88)\n\nh=1\n\n(cid:16)\n\nh=1\n\n(cid:17)\n\n= O\n\nn\u03c1H + \u03c1\n\n\u2212H(1+d) ln n\n\n= O\n\nh=1\n\ni : (h,i)\u2208Jh\n\nE(cid:2)Rn,3\n\n(cid:3) \u2264 H(cid:88)\n\n4\u03bd1\u03c1h\u22121 (cid:88)\nE(cid:2)Rn,3\n(cid:3) \u2264 8\u03bd1C \u03bd\u2212d\nH\u22121(cid:88)\n(cid:3) \u2264 4n\u03bd1\u03c1H + 4\u03bd1C \u03bd\nH(cid:88)\n\n\u2212h(1+d)\n\n(cid:33)\n\n(cid:32)\n\n\u2212d\n2\n\nh=0\n\n= O\n\nn\u03c1H + (ln n)\n\n\u03c1\n\nwhere we used the result of the \ufb01rst step. Now, it follows from that fact that the parent of Jh is in Ih\u22121 that\n|Jh| \u2264 2|Ih\u22121|. Substituting this and the bound on |Ih\u22121|, we get\n\u03c1h(1\u2212d)+d\u22121\nH(cid:88)\n\nFourth step. Putting things together, we have proved\n\n(cid:18) 8 ln n\n1 \u03c12h +\n\u03bd2\n\n(cid:19)\n\nh=1\n\n2\n\n\u03c1h(1\u2212d) + 8\u03bd1C \u03bd\n\n\u2212d\n2\n\n\u03c1h(1\u2212d)+d\u22121\n\nE(cid:2)Rn\n\n2\nn\n\n(cid:19)\n\n(cid:18) 8 ln n\n1 \u03c12h +\n\u03bd2\n(cid:19)\n(cid:18) 8 ln n\n1 \u03c12h +\nn(d+1)/(d+2) (ln n)1/(d+2)(cid:17)\n(cid:16)\n\u03bd2\n\n2\nn\n\n2\nn\n\n.\n\nby using \ufb01rst that \u03c1 < 1 and then, by optimizing over \u03c1H (the worst value being \u03c1H \u223c ( n\n\nln n)\u22121/(d+2)).\n\nh=1\n\n6 Minimax optimality\nThe packing dimension of a set X is the smallest d such that there exists a constant k such that for all\ndimension of d whenever (cid:96) is a norm. If X has a packing dimension of d, then all environments have a\nnear-optimality dimension less than d. The proof of the main theorem indicates that the constant C(d) only\ndepends on d, k (of the de\ufb01nition of packing dimension), \u03bd1, \u03bd2, and \u03c1, but not on the environment as long as\n\n\u03b5 > 0, P(cid:0)X , (cid:96), \u03b5(cid:1)\u2264 k \u03b5\u2212d. For instance, compact subsets of Rd (with non-empty interior) have a packing\nit is weakly Lipschitz. Hence, we can extract from it a distribution-free bound of the form (cid:101)O(n(d+1)/(d+2)).\n\nIn fact, this bound can be shown to be optimal as is illustrated by the theorem below, whose assumptions\nare satis\ufb01ed by, e.g., compact subsets of Rd and if (cid:96) is some norm of Rd. The proof can be found in the\nappendix.\nTheorem 2. If X is such that there exists c > 0 with P(X , (cid:96), \u03b5) \u2265 c \u03b5\u2212d \u2265 2 for all \u03b5 \u2264 1/4 then for all\nn \u2265 4d\u22121 c/ ln(4/3), all strategies \u03d5 are bound to suffer a regret of at least\n\n(cid:18)1\n\n(cid:114) c\n\n(cid:19)2/(d+2)\n\n4\n\n4 ln(4/3)\n\nsup E Rn(\u03d5) \u2265 1\n4\n\nn(d+1)/(d+2),\n\nwhere the supremum is taken over all environments with weakly Lipschitz payoff functions.\n\n\u221a\n\nof (cid:101)O(\n\n7 Discussion\nSeveral works [1; 6; 3; 2; 7] have considered continuum-armed bandits in Euclidean or metric spaces and\nprovided upper- and lower-bounds on the regret for given classes of environments. Cope [3] derived a regret\nn) for compact and convex subset of Rd and a mean-payoff function with unique minima and second\norder smoothness. Kleinberg [6] considered mean-payoff functions f on the real line that are H\u00a8older with\ndegree 0 < \u03b1 \u2264 1. The derived regret is \u0398(n(\u03b1+1)/(\u03b1+2)). Auer et al. [2] extended the analysis to classes of\nfunctions with only a local H\u00a8older assumption around maximum (with possibly higher smoothness degree\n\u03b1 \u2208 [0,\u221e)), and derived the regret \u0398(n\n1+\u03b1\u2212\u03b1\u03b2\n1+2\u03b1\u2212\u03b1\u03b2 ), where \u03b2 is such that the Lebesgue measure of \u03b5-optimal\n\n7\n\n\fstates is O(\u03b5\u03b2). Another setting is that of [7] who considered a metric space (X , (cid:96)) and assumed that f\n\nis Lipschitz w.r.t. (cid:96). The obtained regret is (cid:101)O(n(d+1)/(d+2)) where d is the zooming dimension (de\ufb01ned\n\nsimilarly to our near-optimality dimension, but using covering numbers instead of packing numbers and the\nsets X\u03b5 \\ X\u03b5/2). When (X , (cid:96)) is a metric space covering and packing numbers are equivalent and we may\nprove that the zooming dimension and near-optimality dimensions are equal.\nOur main contribution compared to [7] is that our weak-Lipschitz assumption, which is substantially weaker\nthan the global Lipschitz assumption assumed in [7], enables our algorithm to work better in some common\nsituations, such as when the mean-payoff function assumes a local smoothness whose order is larger than\none. In order to relate all these results, let us consider a speci\ufb01c example: Let X = [0, 1]D and assume that\nthe mean-reward function f is locally equivalent to a H\u00a8older function with degree \u03b1 \u2208 [0,\u221e) around any\nmaxima x\u2217 of f (the number of maxima is assumed to be \ufb01nite):\n\nf(x\u2217) \u2212 f(x) = \u0398(||x \u2212 x\u2217||\u03b1) as x \u2192 x\u2217.\n\n(3)\nThis means that \u2203c1, c2, \u03b50 > 0, \u2200x, s.t. ||x \u2212 x\u2217|| \u2264 \u03b50, c1||x \u2212 x\u2217||\u03b1 \u2264 f(x\u2217) \u2212 f(x) \u2264 c2||x \u2212 x\u2217||\u03b1.\n\u221a\n\u221a\nUnder this assumption, the result of Auer et al. [2] shows that for D = 1, the regret is \u0398(\nn) (since here\n\u03b2 = 1/\u03b1). Our result allows us to extend the\nn regret rate to any dimension D. Indeed, if we choose our\ndissimilarity measure to be (cid:96)\u03b1(x, y) def= ||x \u2212 y||\u03b1, we may prove that f satis\ufb01es a locally weak-Lipschitz\n\u221a\nn),\n\ncondition (as de\ufb01ned in Remark 3) and that the near-optimality dimension is 0. Thus our regret is (cid:101)O(\n\ni.e., the rate is independent of the dimension D.\nIn comparison, since Kleinberg et al. [7] have to satisfy a global Lipschitz assumption, they can not use (cid:96)\u03b1\nwhen \u03b1 > 1. Indeed a function globally Lipschitz with respect to (cid:96)\u03b1 is essentially constant. Moreover (cid:96)\u03b1\ndoes not de\ufb01ne a metric for \u03b1 > 1. If one resort to the Euclidean metric to ful\ufb01ll their requirement that f\nbe Lipschitz w.r.t. the metric then the zooming dimension becomes D(\u03b1 \u2212 1)/\u03b1, while the regret becomes\n\u221a\nn) and in fact becomes close to the slow\n\n(cid:101)O(n(D(\u03b1\u22121)+\u03b1)/(D(\u03b1\u22121)+2\u03b1)), which is strictly worse than (cid:101)O(\nrate (cid:101)O(n(D+1)/(D+2)) when \u03b1 is larger. Nevertheless, in the case of \u03b1 \u2264 1 they get the same regret rate.\nsuffers a regret of order (cid:101)O(\n\nIn contrast, our result shows that under very weak constraints on the mean-payoff function and if the local\nbehavior of the function around its maximum (or \ufb01nite number of maxima) is known then global optimization\nn), independent of the space dimension. As an interesting sidenote let us also\nremark that our results allow different smoothness orders along different dimensions, i.e., heterogenous\nsmoothness spaces.\nReferences\n[1] R. Agrawal. The continuum-armed bandit problem. SIAM J. Control and Optimization, 33:1926\u20131951, 1995.\n[2] P. Auer, R. Ortner, and Cs. Szepesv\u00b4ari. Improved rates for the stochastic continuum-armed bandit problem. 20th\n\n\u221a\n\nConference on Learning Theory, pages 454\u2013468, 2007.\n\n[3] E. Cope. Regret and convergence bounds for immediate-reward reinforcement learning with continuous action\n\nspaces. Preprint, 2004.\n\n[4] P.-A. Coquelin and R. Munos. Bandit algorithms for tree search. In Proceedings of 23rd Conference on Uncertainty\n\nin Arti\ufb01cial Intelligence, 2007.\n\n[5] S. Gelly, Y. Wang, R. Munos, and O. Teytaud. Modi\ufb01cation of UCT with patterns in Monte-Carlo go. Technical\n\nReport RR-6062, INRIA, 2006.\n\n[6] R. Kleinberg. Nearly tight bounds for the continuum-armed bandit problem. In 18th Advances in Neural Information\n\nProcessing Systems, 2004.\n\n[7] R. Kleinberg, A. Slivkins, and E. Upfal. Multi-armed bandits in metric spaces. In Proceedings of the 40th ACM\n\nSymposium on Theory of Computing, 2008.\n\n[8] L. Kocsis and Cs. Szepesv\u00b4ari. Bandit based Monte-Carlo planning. In Proceedings of the 15th European Conference\n\non Machine Learning, pages 282\u2013293, 2006.\n\n8\n\n\f", "award": [], "sourceid": 553, "authors": [{"given_name": "S\u00e9bastien", "family_name": "Bubeck", "institution": null}, {"given_name": "Gilles", "family_name": "Stoltz", "institution": null}, {"given_name": "Csaba", "family_name": "Szepesv\u00e1ri", "institution": null}, {"given_name": "R\u00e9mi", "family_name": "Munos", "institution": null}]}