{"title": "Fast Rates for Bandit Optimization with Upper-Confidence Frank-Wolfe", "book": "Advances in Neural Information Processing Systems", "page_first": 2225, "page_last": 2234, "abstract": "We consider the problem of bandit optimization, inspired by stochastic optimization and online learning problems with bandit feedback. In this problem, the objective is to minimize a global loss function of all the actions, not necessarily a cumulative loss. This framework allows us to study a very general class of problems, with applications in statistics, machine learning, and other fields. To solve this problem, we analyze the Upper-Confidence Frank-Wolfe algorithm, inspired by techniques for bandits and convex optimization. We give theoretical guarantees for the performance of this algorithm over various classes of functions, and discuss the optimality of these results.", "full_text": "Fast Rates for Bandit Optimization with\n\nUpper-Con\ufb01dence Frank-Wolfe\n\nQuentin Berthet \u2217\n\nUniversity of Cambridge\n\nq.berthet@statslab.cam.ac.uk\n\nVianney Perchet \u2020\n\nENS Paris-Saclay & Criteo Research, Paris\n\nvianney.perchet@normalesup.org\n\nAbstract\n\nWe consider the problem of bandit optimization, inspired by stochastic optimiza-\ntion and online learning problems with bandit feedback. In this problem, the ob-\njective is to minimize a global loss function of all the actions, not necessarily a\ncumulative loss. This framework allows us to study a very general class of prob-\nlems, with applications in statistics, machine learning, and other \ufb01elds. To solve\nthis problem, we analyze the Upper-Con\ufb01dence Frank-Wolfe algorithm, inspired\nby techniques for bandits and convex optimization. We give theoretical guaran-\ntees for the performance of this algorithm over various classes of functions, and\ndiscuss the optimality of these results.\n\nIntroduction\nIn online optimization problems, a decision maker choses at each round t \u2265 1 an action \u03c0t from\nsome given action space, observes some information through a feedback mechanism in order to\nminimize a loss, function of the set of actions {\u03c01, . . . , \u03c0T}. Traditionally, this objective is computed\n\nas a cumulative loss of the form(cid:80)\n\nt (cid:96)t(\u03c0t) [20, 34], or as a function thereof [2, 3, 16, 32].\n\nExamples include classical multi-armed bandit problems where the action space is \ufb01nite with K\nelements, in stochastic or adversarial settings [9]. In these problems, the loss at round t can be\nwritten as (cid:96)t(e\u03c0t) for a linear form (cid:96)t on RK, and basis vectors ei. More generally, this includes\nalso bandit problems over a convex body C, where the action at each round consists in picking xt \u2208 C\nand where the loss (cid:96)t(xt) is for some convex function (cid:96)t [see, e.g. 9, 12, 19, 10].\nIn this work, we consider the online learning problem of bandit optimization. Similarly to other\nproblems of this type, a decision maker chooses at each round an action \u03c0t from a set of size K, and\nobserves information about an unknown convex loss function L. The difference is that the objective\n\n(cid:1), not a cumulative one. At each round, choosing\n\nis to minimize a global convex loss L(cid:0) 1\n(cid:80)T\n(cid:80)T\n\nthe i-th action increases the information about the local dependency of L on its i-th coef\ufb01cient. This\nproblem can be contrasted with the objective of minimizing the average pseudo-regret in a stochastic\nbandit problem, i.e. of minimizing 1\nt=1 L(e\u03c0t) with observation (cid:96)t(e\u03c0t), a noisy estimate of\nT\nL(e\u03c0t). At the intersection of these frameworks, when L is a linear form, is the stochastic multi-\narmed bandit problem. Our problem is also related to maximization of known convex objectives\n[2, 3]. We compare our framework to these settings in Section 1.4.\nBandit optimization shares some similarities with stochastic optimization problems, where the ob-\njective is to minimize f (xT ) for an unknown function f, while choosing at each round a variable xt\n\u2217Supported by an Isaac Newton Trust Early Career Support Scheme and by The Alan Turing Institute under\n\u2020Supported by the ANR (grant ANR- 13-JS01-0004-01), and the FMJH Program Gaspard Monge in Opti-\n\nthe EPSRC grant EP/N510129/1.\n\nT\n\nt=1 e\u03c0t\n\nmization and operations research (supported in part by EDF) and from the Labex LMH.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\f(cid:80)t\n\nand observing some noisy information about the function f. Our problem can be seen as a stochastic\noptimization problem over the simplex, with the caveat that the list of actions \u03c01, . . . , \u03c0T determines\nthe variable, as xt = 1\ns=1 e\u03c0s, as well as the manner in which additional information about the\nt\nfunction can be gathered. This setting allows us to study a more general class of problems than\nmulti-armed bandits, and to cover examples where there is not one optimal action, but rather an op-\ntimal global strategy, that is an optimal mix of actions. We describe several natural problems from\nmachine learning, statistics, or economics that are cases of bandit optimization.\nThis problem draws inspiration from the world of multi-armed bandit problems and that of stochas-\ntic convex optimization, and our solution to it does as well. We analyze the Upper-Con\ufb01dence\nFrank-Wolfe algorithm, a modi\ufb01cation of the Frank-Wolfe algorithm [17] and of the UCB algorithm\nfor bandits [5]. The link with Frank-Wolfe is related to the choice of one action, and encourages\nexploitation, while the link with UCB encourages to chose rarely picked actions in order to increase\nknowledge about the function, encouraging exploration. This algorithm can be used for all convex\nfunctions L, and performs in a near-optimal manner over various classes of functions. Indeed, if\n\u221a\nit has been already proved that it achieves slow rates of convergence in some cases, i.e., the error\ndecreases as 1/\nThese fast rates are surprising, as they sometimes even hold for non-strongly convex functions, and\nin many problems with bandit feedback they cannot be reached [23, 35]. As shown in our lower\nbounds, the main complexity of this problem is statistical and comes from the limited information\navailable about the unknown function L. Usual results in optimization with a known function are\nnot necessarily relevant to our problem. As an example, while linear rates in e\u2212cT are possible\nin deterministic settings with variants in the Frank-Wolfe algorithm, we are limited to fast rates in\n1/T under similar assumptions. Interestingly, while linear functions are one of the settings in which\nthe deterministic Frank-Wolfe algorithm is the most ef\ufb01cient, it is among the most complicated for\nbandit optimization, and only slow rates are possible (see theorems 2 and 6).\nOur work is organized in the following manner: we describe in Section 1 the problem of bandit\noptimization. The main algorithm is introduced in Section 2, and its performance in various settings\nis studied in Section 3, 4, and 5. All proofs of the main results are in the supplementary material.\nNotations: For any positive integer n, denote by [n] the set {1, . . . , n} and, for any positive integer\n\ni\u2208[K]pi = 1(cid:9) the unit simplex of RK. Finally, ei stands\n\nK, by \u2206K :=(cid:8)p \u2208 RK : pi \u2265 0 and (cid:80)\n\nT , we are able to exhibit fast rates decreasing in 1/T , up to logarithmic terms.\n\nfor the i-th vector of the canonical basis of RK. Notice that \u2206K is their convex hull.\n\n1 Bandit Optimization\n\nWe describe the bandit optimization problem, generalizing multi-armed bandits. This stochastic\noptimization problem is doubly related to bandits: The decision variable cannot be chosen freely but\nis tied to the past actions, and information about the function is obtained via a bandit feedback.\n\na cumulative loss(cid:80)\n\n1.1 Problem description\nA each time step t \u2265 1, a decision maker chooses an action \u03c0t \u2208 [K] from K different actions with\nthe objective of minimizing an unknown convex loss function L : \u2206K \u2192 R. Unlike in traditional\nonline learning problems, we do not assume that the overall objective of the agent is to minimize\nt L(e\u03c0t) but rather to minimize the global loss L(pT ), where pt \u2208 \u2206K is the\npt =(cid:0)T1(t)/t, . . . , TK(t)/t(cid:1) with Ti(t) =(cid:80)t\n(cid:80)t\n\nvector of proportions of each action (also called occupation measure), i.e.,\n\ni=1 e\u03c0s. As usual in stochastic optimization, the performance of a policy is\n\ns=11{\u03c0s = i} .\n\nAlternatively, pt = 1\nt\nevaluated by controlling the difference\n\nr(T ) := E[L(pT )] \u2212 min\np\u2208\u2206K\n\nL(p) .\n\nThe information available to the policy is a feedback of bandit type: given the choice \u03c0t = i, it is\nan estimate \u02c6gt of \u2207L(pt). Its precision, with respect to each coef\ufb01cient i \u2208 [K], is speci\ufb01ed by a\ndeviation function \u03b1t,i, meaning that for all \u03b4 \u2208 (0, 1), it holds with probability 1 \u2212 \u03b4 that\n\n|\u02c6gt,i \u2212 \u2207iL(pt)| \u2264 \u03b1t,i(Ti(t), \u03b4) .\n\n2\n\n\fAt each round, it is possible to improve the precision for one of the coef\ufb01cients of the gradient but\npossibly at a cost of increasing the global loss. The most typical case, described in the following\n\nsection, is of \u03b1t,i(Ti, \u03b4) = (cid:112)2 log(t/\u03b4)/Ti, when the information consists of observations from\n\ndifferent distributions. In general, this type of feedback mechanism is indicative of a bandit feedback\n(and not of a full information setting), as motivated by the following parametric setting.\n\n1.2 Bandit feedback and parametric setting\n\nOne of the motivations is the minimization of a loss function L belonging to a known class\n{L(\u00b5,\u00b7), \u00b5 \u2208 RK} with an unknown parameter \u00b5. Choosing the i-th action provides information\nabout \u00b5i, through an observation of some auxiliary distribution \u03bdi.\nAs an example, the classical stochastic multi-armed bandit problem [9] falls within our framework.\nDenoting by \u00b5i the expected loss of arm i \u2208 [K], the average pseudo-regret \u00afR can be expressed as\n\nt(cid:88)\n\nK(cid:88)\n\n\u00afR(t) =\n\n1\nt\n\n\u00b5\u03c0s \u2212 \u00b5(cid:63) =\n\n\u00b5i\n\nTi(t)\n\nt\n\n\u2212 \u00b5(cid:63) = p(cid:62)\n\nt \u00b5 \u2212 p(cid:63)(cid:62)\u00b5, with p(cid:63) = ei(cid:63) ,\n\ns=1\n\ni=1\n\nHence the choice of L(\u00b5, p) = \u00b5(cid:62)p corresponds the problem of multi-armed bandits. Since\n\u2207L(\u00b5, p) = \u00b5, the feedback mechanism for \u02c6gt is induced by having a sample Xt from \u03bd\u03c0t at\ntime step t, taking \u02c6gt,i = \u00afXt,i, the empirical mean of the Ti(t) observations \u03bdi. In this case, if \u03bdi is\n\nsub-Gaussian with parameter 1, we have \u03b1t,i(Ti, \u03b4) = 2(cid:112)2 log(t/\u03b4)/Ti.\n\nMore generally, for any parametric model, we can consider the following observation setting: For\nall i \u2208 [K], let \u03bdi be a sub-Gaussian distribution with mean \u00b5i and tail parameter \u03c32. At time t,\nfor an action \u03c0t \u2208 [K], we observe a realization from \u03bd\u03c0t. We estimate \u00b5i by the empirical mean\n\u02c6\u00b5t,i of the Ti(t) draws from \u03bdi, and \u02c6gt = \u2207pL(\u02c6\u00b5t, pt) as an estimate of the gradient of L = L(\u00b5,\u00b7)\nat pt. The following bound on \u03b1i under smoothness conditions on the parametric model is a direct\napplication of Hoeffding\u2019s inequality.\nProposition 1. Let L = L(\u00b5,\u00b7) for some \u00b5 \u2208 RK being \u00b5-gradient-Lipschitz, i.e., such that\n\n(cid:12)(cid:12)(cid:12)(cid:0)\u2207pL(\u00b5, p)(cid:1)\n\ni \u2212(cid:0)\u2207pL(\u00b5(cid:48), p)(cid:1)\n\n(cid:12)(cid:12)(cid:12) \u2264 |\u00b5i \u2212 \u00b5(cid:48)\n\nUnder the sub-Gaussian observation setting above, \u02c6gt = \u2207pL(\u02c6\u00b5t, pt) is a valid gradient feedback\n\nwith deviation bounds \u03b1t,i(Ti, \u03b4) =(cid:112)2\u03c32 log(t/\u03b4)/Ti.\n\ni\n\ni| , \u2200p \u2208 \u2206([K]).\n\nThis Lipschitz condition on the parameter \u00b5 gives a motivation for our gradient bandit feedback.\n\n1.3 Examples\n\nStochastic multi-armed bandit:\nAs noted above, the stochastic multi-armed bandit problem is a special case of our setting for a loss\nL(p) = \u00b5(cid:62)p, and the bandit feedback allows to construct a proxy for the gradient \u02c6gt with deviations\n\u221a\nTi. The UCB algorithm used to solve this problem inspires our algorithm that\n\u03b1i decaying in 1/\ngeneralizes to any loss function L, as discussed in Section 2.\n\nOnline experimental design: In the context of statistical estimation with heterogenous data sources\n[8], consider the problem of allocating samples in order to minimize the variance of the \ufb01nal esti-\nmate. At time t, it is possible to sample from one of K distributions N (\u03b8i, \u03c32\ni ) for i \u2208 [K], the\nobjective being to minimize the average variance of the simple unbiased estimator\n\n2] =(cid:80)\n\nE[(cid:107)\u02c6\u03b8 \u2212 \u03b8(cid:107)2\n\ni\u2208[K]\u03c32\n\ni /Ti\n\nequivalent to L(p) =(cid:80)\n\ni\u2208[K]\u03c32\n\ni /pi .\n\nFor unknown \u03c3i, this problem falls within our framework and the gradient with coordinates \u2212\u03c32\ni /p2\ncan be estimated by using the Ti draws from N (\u03b8i, \u03c32\ni\ni . This function is only de\ufb01ned\non the interior of the simplex and is unbounded, matters that we discuss further in Section 4.3. Other\nobjective functions than the expected (cid:96)2 norm of the error can be used, as in [11], who consider the\n(cid:96)\u221e norm of the actual estimated deviations, not its expectation.\n\ni ) to construct \u02c6\u03c32\n\n3\n\n\fUtility maximization: A classical model to describe the utility of an agent purchasing xi units of\nK different goods is the Cobb-Douglas utility (see e.g. [27]) de\ufb01ned for parameters \u03b2i \u2208 (0, 1) by\n\nU (x1, . . . , xK) =(cid:81)\n\ni\u2208[K]x\u03b2i\n\ni\n\n.\n\nequivalent to minimizing in pi (the proportion of good i in the basket) L(p) = \u2212(cid:80)\n\nMaximizing this utility for unknown \u03b2i under a budget constraint - where each price is assumed\nto be 1 for ease of notations - by buying one unit of one of K goods at each round, is therefore\ni\u2208[K]\u03b2i log(pi).\nOther examples: More generally, the notion of bandit optimization can be applied to any situation\nwhere one optimizes a strategy through actions that are taken sequentially, with information gained\nat each round, and where the objective depends only on the proportions of actions. Other examples\ninclude a problem inspired by online Markovitz portfolio optimization, where the goal is to minimize\nL(p) = p(cid:62)\u03a3p \u2212 \u03bb\u00b5(cid:62)p, with a known covariance matrix \u03a3 and unknown returns \u00b5, or several\ni\u2208[K]fi(\u00b5i)pi when observations\n\ngeneralizations of bandit problems such as minimizing L(p) =(cid:80)\n\nare drawn from a distribution with mean \u00b5i, for known fi.\n\n1.4 Comparison with other problems\n\n(cid:80)\n\n(cid:80)\n\nT\n\nt e\u03c0t).\n\nt (cid:96)t(xt), we focus on L( 1\n\nAs mentioned in the introduction, the problem of bandit optimization is different from online learn-\ning problems related to regret minimization [21, 1, 10], even in a stochastic setting. While the usual\nobjective is to minimize a cumulative regret related to 1\nT\nProblems related to online optimization of global costs or objectives have been studied in similar\nT V ) where V is a K \u00d7 d\nsettings [2, 3, 16, 32]. They are equivalent to minimizing a loss L(p(cid:62)\nunknown matrix and L(\u00b7) : Rd \u2192 R is known. The feedback at stage t is a noisy evaluations of\nV\u03c0t. In the stochastic case [2, 3], this is close to our setting - even though none of them subsumes\ndirectly the other one. Only slow rates of convergence of order 1/\nT are derived for the variant\nof Frank-Wolfe, while we aim at fast rates, which are optimal. In contrast, in the adversarial case\n[16, 32], there are instances of the problem where the average regret cannot decrease to zero [26].\nUsing the Frank-Wolfe algorithm in a stochastic optimization problem has also already been consid-\nered, particularly in [25], where the estimates of the gradients are increasingly precise in t, indepen-\ndently of the actions of the decision maker. This setting, where the action at each round is to pick\nxt in the domain in order to minimize f (xT ) is therefore closer to classical stochastic optimization\nthan online learning problems related to bandits [9, 19, 10].\n\n\u221a\n\n2 Upper-Con\ufb01dence Frank-Wolfe algorithm\n\n\u221a\nWith linear functions, as in multi-armed bandits, an estimate of the gradient can be established by\nusing the past observations, as well as con\ufb01dence intervals on each coef\ufb01cient in 1/\nTi. The UCB\nalgorithm instructs to pick the action with the smallest lower con\ufb01dence estimate \u00b5\nfor the loss.\nThis is equivalent to making a step of size 1/(t + 1) in the direction of the corner of the simplex e\nthat minimizes e(cid:62)\u00b5\n. Following this intuition, we introduce the UCB Frank-Wolfe algorithm that\nuses a proxy of the gradient, penalized by the size of con\ufb01dence intervals.\n\nt,i\n\nt\n\nAlgorithm 0: UCB Frank-Wolfe algorithm\nInput: K, p0 = 1[K]/K, sequence (\u03b4t)t\u22650;\nfor t \u2265 0 do\nObserve \u02c6gt, noisy estimate of \u2207L(pt);\nfor i \u2208 [K] do\n\n\u02c6Ut,i = \u02c6gti \u2212 \u03b1t,i(Ti(t), \u03b4t)\n\nend\nSelect \u03c0t+1 \u2208 argmini\u2208[K]\nUpdate pt+1 = pt + 1\n\n\u02c6Ut,i;\n\nt+1 (e\u03c0t+1 \u2212 pt)\n\nend\n\n4\n\n\fNotice that for any algorithm, the selection of an action \u03c0t+1 \u2208 [K] at time step t + 1 updates the\nvariable p with respect to the following dynamics\n\npt+1 =\n\n1 \u2212 1\nt + 1\n\npt +\n\n1\n\nt + 1\n\ne\u03c0t+1 = pt +\n\n1\n\nt + 1\n\n(e\u03c0t+1 \u2212 pt) .\n\n(1)\n\n(cid:16)\n\n(cid:17)\n\nThis is implied by the mechanism of the problem, and is not dependent on the choice of an algorithm.\nIf the choice of e\u03c0t+1 is e(cid:63)t+1, the minimizer of s(cid:62)\u2207L(pt) over all s \u2208 \u2206K, this would precisely\nbe the Frank-Wolfe algorithm with step size 1/(t + 1). Inspired by this similarity, our selection rule\nis driven by the same principle, using a proxy \u02c6Ut for \u2207L(pt) based on the information up to time t.\nOur selection rule is therefore driven by two principles, borrowing from tools in convex optimization\n(the Frank-Wolfe algorithm) and classical bandit problems (Upper-con\ufb01dence bounds).\nThe choice of action \u03c0t+1 is equivalent to taking e\u03c0t+1 \u2208 argmins\u2208\u2206K s(cid:62) \u02c6Ut. The computational\ncost of this procedure is very light, and apart from gradient computations, it is linear in K at each\niteration, leading to a global cost of order KT .\n\n3 Slow rates\n\n\u221a\nIn this section we show that when \u03b1i is of order 1/\n\nSection 1.2, our algorithm has an approximation error of order(cid:112)log(T )/T over the very general\n\nTi, as motivated by the parametric model of\n\nclass of smooth convex functions. We refer to this as the slow rate. Our analysis is based on the\nclassical study of the Frank-Wolfe algorithm [see, e.g. 22, and references therein]. We consider the\ncase of C-smooth convex functions on the unit simplex, for which we recall the de\ufb01nition.\nDe\ufb01nition (Smooth functions). For a set D \u2282 Rn, a function f : D \u2192 R is said to be a C-smooth\nfunction if it is differentiable and if its gradient is C-Lipshitz continuous, i.e. the following holds\n\n(cid:107)\u2207f (x) \u2212 \u2207f (y)(cid:107)2 \u2264 C(cid:107)x \u2212 y(cid:107)2 , \u2200x, y \u2208 D .\n\nWe denote by FC,K the set of C-smooth convex functions. They attain their minimum at a point\np(cid:63) \u2208 \u2206K and their Hessian is uniformly bounded, \u22072L(p) (cid:22) CIK, if they are twice differentiable.\nWe establish in this general setting a slow rate when \u03b1i decreases like 1/\nTheorem 2 (Slow rate). Let L be a C-smooth convex function over the unit simplex \u2206K. For\nany T \u2265 1, after T steps of the UCB Frank-Wolfe algorithm with a bandit feedback such that\n\n\u03b1t,i(Ti, \u03b4) = 2(cid:112)log(t/\u03b4)/Ti and the choice \u03b4t = 1/t2, it holds that\n(cid:16) \u03c02\n\n(cid:17) 2(cid:107)\u2207L(cid:107)\u221e + (cid:107)L(cid:107)\u221e\n\nE(cid:2)L(pT )(cid:3) \u2212 L(p(cid:63)) \u2264 4\n\n3K log(T )\n\nC log(eT )\n\n(cid:114)\n\n+ K\n\n\u221a\n\nTi.\n\n+\n\n.\n\nT\n\nT\n\n+\n\n6\n\nT\n\n\u221a\n\nK is expected, and optimal, in bandit setting.\n\nThe proof draws inspiration from the analysis of the Frank-Wolfe algorithm with stepsize of 1/(t+1)\nand of the UCB algorithm. Notice that our algorithm is adaptive to the gradient Lipschitz constant\nC, and that the leading term of the error does not depend on it. We also emphasize the fact that the\ndependency in\nFor linear mappings L(p) = p(cid:62)\u00b5, our analysis is equivalent to studying the UCB algorithm in\n\nmulti-armed bandits. The slow rate in Theorem 2 corresponds to a regret of order(cid:112)KT log(T ), the\ndistribution-independent (or worst case) performance of UCB. The extra dependency in(cid:112)log(T )\ncould be reduced to(cid:112)log(K) or even optimally to 1 by using con\ufb01dence intervals more carefully\norder(cid:112)K/T for smooth convex functions. We discuss lower bounds further in Section 5.\nFor the sake of clarity, we state all our results when \u03b1t,i(Ti, \u03b4) = 2(cid:112)log(t/\u03b4)/Ti, but our techniques\nhandle more general deviations as \u03b1t,i(Ti, \u03b4) =(cid:0)\u03b8 log(t/\u03b4)/Ti\n(cid:1)\u03b2 where \u03b8 \u2208 R and \u03b2 > 0 are some\n\ntailored, for instance by replacing the log(t) term appearing in the de\ufb01nition of the estimated gradi-\nents by log(T /Ti(t)) or log(T /KTi(t)) if the horizon T is known in advance as in the algorithms\nMOSS or ETC (see [4, 29, 30]), but at the cost of a more involved analysis.\nThus, multi-armed bandits provide a lower bound for the approximation error E[L(pT )] \u2212 L(p(cid:63)) of\n\nknown parameters. More general results can be found in the supplementary material.\n\n5\n\n\f4 Fast rates\n\nIn this section, we describe situations where the approximation error rate can be improved to a fast\nrate of order log(T )/T , when we consider various classes of functions, with additional assumptions.\n\n4.1 Stochastic multi-armed bandits and functions minimized on vertices\n\nA very natural and well-known - yet illustrative - example of such a restricted class of functions is\nsimply the case of classical bandits where \u2206(i) := \u00b5i \u2212 \u00b5(cid:63) is bounded away from 0 for i (cid:54)= (cid:63). Our\nanalysis of the algorithm can be adapted to this special case with the following result.\nProposition 3. Let L be the linear function p (cid:55)\u2192 p(cid:62)\u00b5. After T steps of the UCB Frank-Wolfe\n\nalgorithm with a bandit feedback such that \u03b1t,i(Ti, \u03b4) = 2(cid:112)log(t/\u03b4)/Ti, the choices of \u03b4t = 1/t2\n\nhold the following\n\nE[L(pT )] \u2212 L(p(cid:63)) \u2264 48 log(T )\n\nT\n\n(cid:88)\n\n1\n\n\u2206(i)\n\ni(cid:54)=(cid:63)\n\n(cid:16) \u03c02\n\n3\n\n+ 3\n\n+ K\n\n(cid:17)\u221a\n\nK(cid:107)\u00b5(cid:107)\u221e\nT\n\n.\n\nThe constants of this proposition are sub-optimal (for instance the 48 can be reduced up to 2 us-\ning more careful but involved analysis). It is provided here to show that this classical bound on\nthe pseudo-regret in stochastic multi-armed bandits [see e.g. 9, and references therein] can be re-\ncovered with Frank-Wolfe type of techniques illustrating further the links between bandit problems\nand convex optimization [20, 34]. This result can actually be generalized to any convex functions\nwhich is minimized on a vertex of the simplex with a gradient whose component-wise differences\nare bounded away from 0.\nProposition 4. Let L be a convex mapping that attains its minimum on \u2206K at a vertex p(cid:63) = ei(cid:63)\nand such that \u2206(i)(L) := \u2207iL(p(cid:63)) \u2212 \u2207i(cid:63) L(p(cid:63)) > 0 for all i (cid:54)= i(cid:63). Then, after T steps of the UCB\n\nFrank-Wolfe algorithm with a bandit feedback such that \u03b1t,i(Ti, \u03b4) = 2(cid:112)log(t/\u03b4)/Ti, the choices\n(cid:17)\n\nof \u03b4t = 1/t2 hold the following\nE[L(pT )]\u2212L(p(cid:63)) \u2264 \u03c1(L)\n\n2(cid:107)\u2207L(cid:107)\u221e + (cid:107)L(cid:107)\u221e\n\n(cid:88)\n\nC log(eT )\n\n+K)\n\n+(\n\n+\n\n1\n\n,\n\n\u03c02\n6\n\nT\n\n\u2206(i)(L)\n\ni(cid:54)=(cid:63)\n\nT\n\n(cid:16) 48 log(T )\n(cid:17)\n\nT\n\n(cid:16)\n\n\u2206min(L)\n\n1 + CK\n\nand \u2206min(L) = mini(cid:54)=i(cid:63) \u2206(i)(L).\n\nwhere \u03c1(L) =\nThe KKT conditions imply that \u2206(i)(L) \u2265 0 whenever p(cid:63) is in a corner, but the strict inequality is\nnot always guaranteed. In particular, this result may not hold if p(cid:63) is the global minimum of L over\nRK. This type of condition has also been linked with rates of convergence in stochastic optimization\nproblems [15].\nThe extra multiplicative factor \u03c1(L) can be large, but it would be of the order of 1 + o(1) using vari-\nants of our algorithms with results that holds only with great probability (typically with con\ufb01dence\n\nbounds of the form 2(cid:112)log(1/\u03b4)/Ti).\n\n4.2 Strongly convex functions\n\nAnother classical assumption in convex optimization is strong convexity, as recalled below. We\ndenote by S\u00b5,K the set of \u00b5-strongly convex functions of \u2206K. This assumption usually improves\nthe rates in errors of approximation in many settings, even in stochastic optimization or some settings\nof online learning [see, e.g. 31, 14, 33, 6, 18, 19, 7]. Interestingly enough though, strong convexity\n\u221a\ncannot be leveraged to improve rates of convergence in online convex optimization [35, 23], where\nthe 1/\nT rate of convergence cannot be improved. Moreover, leveraging strong convexity usually\nimplies to adapt step size of gradient descents or with linear search and/or away steps for classical\nFrank-Wolfe methods. Those techniques cannot be adapted to our setting where step sizes are \ufb01xed.\n. For a set D \u2282 Rn, a function f : D \u2192 R is said to be a\nDe\ufb01nition (Strongly convex functions).\n\u00b5-strongly convex if for all x, y \u2208 D, we have\n\nf (x) \u2265 f (y) + \u2207f (x)(cid:62)(x \u2212 y) +\n\n(cid:107)x \u2212 y(cid:107)2\n2 .\n\n\u00b5\n2\n\n6\n\n\fWe already covered the case where the convex functions are minimized outside the simplex. We will\nnow assume that the minimum lies in its relative interior.\nTheorem 5. Let L : \u2206K \u2192 R be a C-smooth, \u00b5-strongly convex function such that its minimum\np(cid:63) satis\ufb01es dist(p(cid:63), \u2202\u2206K) \u2265 \u03b7, for some \u03b7 \u2208 (0, 1/K]. After T steps of the UCB Frank-Wolfe\n\nalgorithm with a bandit feedback such that \u03b1t,i(Ti, \u03b4) = 2(cid:112)log(t/\u03b4)/Ti, it holds that, with the\n\nchoice of \u03b4t = 1/t2,\n\nE[L(pT )] \u2212 L(p(cid:63)) \u2264 c1\n\nlog2(T )\n\nT\n\n+ c2\n\nlog(T )\n\nT\n\n+ c3\n\n1\nT\n\n,\n\nfor constants c1 = 96K\n\n\u00b5\u03b72 , c2 = 24\n\n\u00b5\u03b73 + C and c3 = 24( 20\n\n\u00b5\u03b72 )2K + \u00b5\u03b72\n\n2 + C.\n\nThe proof is based on an improvement in the analysis of the UCB Frank-Wolfe algorithm, based on\na better control on the duality gap, possible in the strongly convex case. It is a consequence of an\ninequality due to Lacoste-Julien and Jaggi [24, Lemma 2]. In order to get the result, we adapt these\nideas to a case of unknown gradient, with bandit feedback. We note that this approach is similar to\nthe one in [25] that focuses on stochastic optimization problems, as discussed in Section 1.4.\nOur framework is more complicated in some aspects than typical settings in stochastic optimization,\nwhere strong assumptions can usually be made over the noisy gradient feedback. These include\nstochastic gradients that are independent unbiased estimates of the true gradient, or with error terms\nthat are decreasing in t. Here, such properties do not hold: as an example, in a parametric setting,\ninformation is only obtained about one of the coef\ufb01cients, and there are strong dependencies between\nsuccessive gradients feedbacks. Dealing with these aspects, as well as the fact that our gradient proxy\nis penalized by the size of the con\ufb01dence intervals, are some of the main challenges of the proof.\n\n4.3 Interior-smooth functions\n\nE[(cid:107)\u02c6\u03b8 \u2212 \u03b8(cid:107)2\n\n2] =(cid:80)\n\nMany interesting examples of bandit optimization are not exactly covered by the case of functions\nthat are C-smooth on the whole unit simplex. In particular, for several applications, the function\ndiverges at its boundary, as in the examples of Cobb-Douglas utility maximization and variance\nminimization from Section 1.3. Recall the the loss was de\ufb01ned by\n\n(cid:80)\ni \u2265 pi := \u03c3i/(cid:80)\nmax((cid:80)\nlower/upper bounds are known beforehand. This leads to a Lipchitz constant C = (9(cid:80)\n\nThe gradient Lipschitz constant is in\ufb01nite but if we knew for instance that \u03c3i \u2208 [\u03c3i , \u03c3i], we could\nj \u03c3j. We would\nsafely sample \ufb01rst each arm i a linear number of time because p(cid:63)\nhave (pt)i \u2265 pi at all stages and our analysis holds with the constant C = 2\u03c32\nj \u03c3j)3/\u03c33\nmin .\nEven without knowledge on \u03c32\ni , it is possible to quickly have rough estimates, as illustrated by\nLemma 2 in the appendix. Only a logarithmic number of sample of each action are needed. Once\nthey are gathered, one can keep sampling each arm a linear number of times, as suggested when the\nj \u03c3j)3/\u03c3min,\n\nwhich is, up to to a multiplicative factor, the gradient Lipschitz constant at the minimum.\n\ni\u2208[K]\n\n= 1\n\nT L(p) = 1\n\nT\n\ni\u2208[K]\n\n\u03c32\ni\nTi\n\n\u03c32\ni\npi\n\n.\n\n5 Lower bounds\n\nThe results shown in Sections 3 and 4 exhibit different theoretical guarantees for our algorithm\ndepending on the class of function considered. We discuss here the optimality of these results.\n\n5.1 Slow rate lower bound\n\nIn Theorem 2, we show a slow rate of order(cid:112)K log(T )/T for the error approximation of our algo-\n\nrithm over the class of C-smooth convex functions of RK. Up to the logarithmic term, this result is\noptimal: no algorithm based on the same feedback can signi\ufb01cantly improve the rate of approxima-\ntion. This is a consequence of the following theorem, a direct corollary of a result by [5].\n\n7\n\n\fTheorem 6. For any algorithm based on a bandit feedback such that \u03b1t,i(Ti, \u03b4) =(cid:112)2 log(t/\u03b4)/Ti\n\nand that outputs \u02c6pT , we have over the class of linear forms LK that for some constant c > 0\n\ninf\n\u02c6pT\n\nsup\nL\u2208LK\n\nE[L(\u02c6pT )] \u2212 L(p(cid:63))\n\n(cid:111) \u2265 c(cid:112)K/T .\n\nThis result is established over the class of linear functions over the simplex (for which C = 0), when\nthe feedback consists of a draw from a distribution with mean \u00b5i. As mentioned in Section 3, the\nextra logarithmic term in our upper bound comes from our algorithm, which has the same behavior as\nUCB. Nevertheless, as mentioned before, modifying our algorithm to recover the behavior of MOSS\n[4], or even ETC, [see e.g. 29, 30], would improve the upper bound and remove the logarithmic term.\n\n5.2 Fast rate lower bound\n\nWe have shown that in the case of strongly convex smooth functions, there is an approximation error\nupper bound of order (K/\u03b74) log(T )/T for the performance of our algorithm, where \u03b7 < 1/K. We\nprovide a lower bound over this class of functions in the following theorem.\n\nTheorem 7. For any algorithm with a bandit feedback such that \u03b1t,i(Ti, \u03b4) =(cid:112)2 log(t/\u03b4)/Ti and\n\noutput \u02c6pT , we have over the class S1,K of 1-strongly convex functions that for some constant c > 0\n\ninf\n\u02c6p\n\nsup\nL\u2208S1,K\n\nE[L(\u02c6pT )] \u2212 L(p(cid:63))\n\n(cid:111) \u2265 c K 2/T .\n\n(cid:110)\n\n(cid:110)\n\n(cid:110)\n\n(cid:110)\n\n2 when observing a\nThe proof relies on the complexity of minimizing quadratic functions 1\ndraw from distribution with mean \u03b8i. Our upper bound is in the best case of order K 5 log(T )/T , as\n\u03b7 \u2264 1/K. Understanding more precisely the optimal rate is an interesting venue for future research.\n\n2(cid:107)p \u2212 \u03b8(cid:107)2\n\n5.3 Mixed feedbacks lower bound\n\nIn our analysis of this problem, we have only considered settings where the feedback upon choosing\naction i gives information about the i-th coef\ufb01cient of the gradient. The two following cases show\nthat even in simple settings, our upper bounds will not hold if the relationship between action and\nfeedback is different, when the feedback corresponds to another coef\ufb01cient.\nProposition 8. For L in the class of 1-strongly convex functions on \u22063, we have in the case of a\nmixed bandit feedback that\n\ninf\n\u02c6p\n\nsup\nL\u2208S1,3\n\nE[L(\u02c6pT )] \u2212 L(p(cid:63))\n\n(cid:111) \u2265 c/T 2/3 .\n\nFor strongly convex functions, even with K = 3, there are therefore pathological mixed feedback\nsettings where the error is at least of order 1/T 2/3 instead of 1/T . The case of smooth convex\n\u221a\nfunctions is covered by the existing lower bounds for the problem of partial monitoring [13], and\ngives a lower bound of order 1/T 1/3 instead of 1/\nProposition 9. For L in the class of linear forms F3 on \u22063, with a mixed bandit feedback we have\n\nT .\n\n(cid:111) \u2265 c/T 1/3 .\n\nE[L(\u02c6pT )] \u2212 L\u03b8(p(cid:63))\n\ninf\n\u02c6p\n\nsup\nL\u2208F3\n\n6 Discussion\n\nWe study the online minimization of stochastic global loss with a bandit feedback. This is naturally\nmotivated by many applications with a parametric setting, and tradeoffs between exploration and\nexploitation. The UCB Frank-Wolfe algorithm performs optimally in a generic setting.\nThe fast rates of convergence obtained for some clases of functions are a signi\ufb01cant improvement\nover the slow rates that hold for smooth convex functions. In bandit-type problems similar to our\nproblem, it is not always possible to leverage additional assumptions such as strong convexity: It has\nbeen proved impossible in the closely related setting of online convex optimization [23, 35]. When it\n\n8\n\n\fis possible, step sizes must usually depend on the strong convexity parameter, as in gradient descent\n[28]. This is not the case here, where the step size is \ufb01xed by the mechanics of the problem. We\nhave also shown that fast rates are possible without requiring strong convexity, with a gap condition\non the gradient at an extreme point, more commonly associated with bandit problems.\nWe mention that several extensions of our models, motivated by heterogenous estimations, are quite\ninteresting but out of scope. For instance, assume an experimentalist can chose one of K known\ncovariates Xi in order to estimate an unknown \u03b2 \u2208 RK, and observes yt = X(cid:62)\n(\u03b2 + \u03bet), where\n\u03bet \u223c N (0, \u03a3). Variants of that problem with covariates or contexts [29] can also be considered.\ni (.) are regular functions of covariates \u03c9 \u2208 Rd. The objective\nAssume for instance that \u00b5i(.) and \u03c32\nis to estimate all the functions \u00b5i(.).\n\n\u03c0t\n\nReferences\n[1] A. Agarwal, D. P. Foster, D. Hsu, S. Kakade, and A. Rakhlin. Stochastic convex optimiza-\nIn Proceedings of the 24th International Conference on Neural\n\ntion with bandit feedback.\nInformation Processing Systems, 2011.\n\n[2] S. Agrawal and N. Devanur. Bandits with concave rewards and convex knapsacks. In Pro-\nceedings of the Fifteenth ACM Conference on Economics and Computation, EC \u201914, pages\n989\u20131006, New York, NY, USA, 2014. ACM.\n\n[3] S. Agrawal, N. Devanur, and L. Li. An ef\ufb01cient algorithm for contextual bandits with knap-\nsacks, and an extension to concave objectives. Proceedings of the Annual Conference on\nLearning Theory (COLT), 2016.\n\n[4] J.-Y. Audibert and S. Bubeck. Minimax policies for adversarial and stochastic bandits. Pro-\n\nceedings of the Annual Conference on Learning Theory (COLT), 2009.\n\n[5] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. Schapire. The non-stochastic multi-armed bandit\n\nproblem. SIAM Journal on Computing, 32(1):48\u201377, 2002.\n\n[6] F. Bach and E. Moulines. Non-strongly-convex smooth stochastic approximation with conver-\n\ngence rate o(1/n). In Adv. NIPS, 2013.\n\n[7] F. Bach and V. Perchet. Highly-smooth zero-th order online optimization. COLT 2016, 2016.\n\n[8] Q. Berthet and V. Chandrasekaran. Resource allocation for statistical estimation. Proceedings\n\nof the IEEE, 104(1):115\u2013125, 2016.\n\n[9] S. Bubeck and N. Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed\n\nbandit problems. Machine Learning, 5(1):1\u2013122, 2012.\n\n[10] S. Bubeck, R. Eldan, and Y. T. Lee. Kernel-based methods for bandit convex optimization.\n\nCoRR, abs/1607.03084, 2016.\n\n[11] A. Carpentier, A. Lazaric, M. Ghavamzadeh, R. Munos, and A. Antos. Upper-con\ufb01dence-\n\nbound algorithms for active learning in multi-armed bandits. Preprint, 2015.\n\n[12] N. Cesa-Bianchi and G. Lugosi. Prediction, Learning, and Games. Cambridge University\n\nPress, 2006.\n\n[13] N. Cesa-Bianchi, G. Lugosi, and G. Stoltz. Regret minimization under partial monitoring.\n\nMath. Oper. Res., 31(3):562\u2013580, August 2006.\n\n[14] J. Dippon. Accelerated randomized stochastic optimization. Ann. Statist., 31(4):1260\u20131281,\n\n08 2003.\n\n[15] J. Duchi and F. Ruan. Local asymptotics for some stochastic optimization problems: Optimal-\n\nity, constraint identi\ufb01cation, and dual averaging. Arxiv Preprint, 2016.\n\n[16] E. Even-Dar, R. Kleinberg, S. Mannor, and Y. Mansour. Online learning for global cost func-\n\ntions. In Proceedings of COLT, 2009.\n\n9\n\n\f[17] M Frank and P. Wolfe. An algorithm for quadratic programming. Naval Res. Logis. Quart.,\n\n3:95\u2013110, 1956.\n\n[18] E. Hazan, T. Koren, and K. Levy. Logistic regression: Tight bounds for stochastic and online\n\noptimization. In Proc. Conference On Learning Theory (COLT), 2014.\n\n[19] E. Hazan and K. Levy. Bandit convex optimization: Towards tight bounds. In Adv. NIPS, 2014.\n\n[20] E. Hazan. The convex optimization approach to regret minimization. Optimization for machine\n\nlearning, pages 287\u2013303, 2012.\n\n[21] E. Hazan, A. Agarwal, and S. Kale. Logarithmic regret algorithms for online convex optimiza-\n\ntion. Mach. Learn., 69(2-3):169\u2013192, 2007.\n\n[22] M. Jaggi. Sparse Convex Optimization Methods for Machine Learning. PhD thesis, ETH\n\nZurich, 2011.\n\n[23] K. Jamieson, R. Nowak, and B. Recht. Query complexity of derivative-free optimization.\n\nAdvances in Neural Information Processing Systems, 2012.\n\n[24] S. Lacoste-Julien and M. Jaggi. An af\ufb01ne invariant linear convergence analysis for frank-wolfe\n\nalgorithms. NIPS 2013, 2013.\n\n[25] J. Lafond, H.-T. Wai, and E. Moulines. On the online frank-wolfe algorithms for convex and\n\nnon-convex optimizations. Arxiv Preprint, 2015.\n\n[26] S. Mannor, V. Perchet, and G. Stoltz. Approachability in unknown games: Online learning\n\nmeets multi-objective optimization. In Proceedings of COLT, 2014.\n\n[27] A. Mas-Colell, M.D. Whinston, and J. Green. Microeconomic theory. Oxford University press,\n\nNew York, 1995.\n\n[28] Y. Nesterov. Introductory Lectures on Convex Optimization. Springer, 2003.\n\n[29] V. Perchet and P. Rigollet. The multi-armed bandit problem with covariates. Ann. Statist..,\n\n41:693\u2013721, 2013.\n\n[30] V. Perchet, P. Rigollet, S. Chassang, and E. Snowberg. Batched bandit problems. Ann. Statist.,\n\n44(2):660\u2013681, 04 2016.\n\n[31] B. T. Polyak and A. B. Tsybakov. Optimal order of accuracy of search algorithms in stochastic\n\noptimization. Problemy Peredachi Informatsii, 26(2):45\u201353, 1990.\n\n[32] A. Rakhlin, K. Sridharan, and A. Tewari. Online learning: Beyond regret. In Sham M. Kakade\nand Ulrike von Luxburg, editors, Proceedings of the 24th Annual Conference on Learning\nTheory, volume 19 of Proceedings of Machine Learning Research, pages 559\u2013594, Budapest,\nHungary, 09\u201311 Jun 2011. PMLR.\n\n[33] A. Saha and A. Tewari. Improved regret guarantees for online smooth convex optimization\nwith bandit feedback. In Proc. International Conference on Arti\ufb01cial Intelligence and Statistics\n(AISTATS), 2011.\n\n[34] S. Shalev-Shwartz. Online learning and online convex optimization. Foundations and Trends\n\nin Machine Learning, 4(2):107\u2013194, 2011.\n\n[35] O. Shamir. On the complexity of bandit and derivative-free stochastic convex optimization. In\n\nProc. Conference on Learning Theory, 2013.\n\n10\n\n\f", "award": [], "sourceid": 1322, "authors": [{"given_name": "Quentin", "family_name": "Berthet", "institution": "University of Cambridge"}, {"given_name": "Vianney", "family_name": "Perchet", "institution": "ENS Paris-Saclay & Criteo Research"}]}