{"title": "Optimistic Bandit Convex Optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 2297, "page_last": 2305, "abstract": "We introduce the general and powerful scheme of predicting information re-use in optimization algorithms. This allows us to devise a computationally efficient algorithm for bandit convex optimization with new state-of-the-art guarantees for both Lipschitz loss functions and loss functions with Lipschitz gradients. This is the first algorithm admitting both a polynomial time complexity and a regret that is polynomial in the dimension of the action space that improves upon the original regret bound for Lipschitz loss functions, achieving a regret of $\\widetilde O(T^{11/16}d^{3/8})$. Our algorithm further improves upon the best existing polynomial-in-dimension bound (both computationally and in terms of regret) for loss functions with Lipschitz gradients, achieving a regret of $\\widetilde O(T^{8/13} d^{5/3})$.", "full_text": "Optimistic Bandit Convex Optimization\n\nMehryar Mohri\n\nCourant Institute and Google\n\n251 Mercer Street\n\nNew York, NY 10012\nmohri@cims.nyu.edu\n\nScott Yang\n\nCourant Institute\n251 Mercer Street\n\nNew York, NY 10012\nyangs@cims.nyu.edu\n\nAbstract\n\nWe introduce the general and powerful scheme of predicting information re-use\nin optimization algorithms. This allows us to devise a computationally ef\ufb01cient\nalgorithm for bandit convex optimization with new state-of-the-art guarantees for\nboth Lipschitz loss functions and loss functions with Lipschitz gradients. This is\nthe \ufb01rst algorithm admitting both a polynomial time complexity and a regret that is\npolynomial in the dimension of the action space that improves upon the original\n\nalgorithm further improves upon the best existing polynomial-in-dimension bound\n(both computationally and in terms of regret) for loss functions with Lipschitz\n\nregret bound for Lipschitz loss functions, achieving a regret ofeOT 11/16d3/8. Our\ngradients, achieving a regret ofeOT 8/13d5/3.\n\nIntroduction\n\n1\n\nBandit convex optimization (BCO) is a key framework for modeling learning problems with sequential\ndata under partial feedback. In the BCO scenario, at each round, the learner selects a point (or action)\nin a bounded convex set and observes the value at that point of a convex loss function determined by\nan adversary. The feedback received is limited to that information: no gradient or any other higher\norder information about the function is provided to the learner. The learner\u2019s objective is to minimize\nhis regret, that is the difference between his cumulative loss over a \ufb01nite number of rounds and that\nof the loss of the best \ufb01xed action in hindsight.\nThe limited feedback makes the BCO setup relevant to a number of applications, including online\nadvertising. On the other hand, it also makes the problem notoriously dif\ufb01cult and requires the learner\nto \ufb01nd a careful trade-off between exploration and exploitation. While it has been the subject of\nextensive study in recent years, the fundamental BCO problem remains one of the most challenging\nscenarios in machine learning where several questions concerning optimality guarantees remain open.\n\n2004]), both of which are still the best known results given by explicit algorithms. Agarwal et al.\n\nLipschitz and strongly convex, which is also still state-of-the-art. For functions that are Lipschitz\nand also admit Lipschitz gradients, Saha and Tewari [2011] designed an algorithm with a regret of\n\nThe original work of Flaxman et al. [2005] showed that a regret ofeO(T 5/6) is achievable for bounded\nloss functions and ofeO(T 3/4) for Lipschitz loss functions (the latter bound is also given in [Kleinberg,\n[2010] introduced an algorithm that maintains a regret ofeO(T 2/3) for loss functions that are both\neO(T 2/3) regret, a result that was recently improved toeO(T 5/8) by Dekel et al. [2015].\nwith a regret bound ofeOT 11/16 for Lipschitz loss functions. Similarly, our algorithm also achieves\n\nHere, we further improve upon these bounds both in the Lipschitz and Lipschitz gradient settings. By\nincorporating the novel and powerful idea of predicting information re-use, we introduce an algorithm\n\nthe best regret guarantee among computationally tractable algorithms for loss functions with Lipschitz\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\flower bound \u2326(T 1/2) given by Dani et al.. Thus, the dependency of our bounds with respect to T is\nnot optimal. Furthermore, two recent unpublished manuscripts, [Hazan and Li, 2016] and [Bubeck\n\nspace.\nWe note that the recent remarkable work by [Bubeck et al., 2015, Bubeck and Eldan, 2015] has\n\ngradients:eOT 8/13. Both bounds admit a relatively mild dependency on the dimension of the action\nproven the existence of algorithms that can attain a regret ofeO(T 1/2), which matches the known\net al., 2016], present algorithms achieving regreteO(T 1/2). These results, once veri\ufb01ed, would be\n\nground-breaking contributions to the literature. However, unlike our algorithms, the regret bound\nfor both of these algorithms admits a large dependency on the dimension d of the action space:\nexponential for [Hazan and Li, 2016], dO(9.5) for [Bubeck et al., 2016]. One hope is that the novel\nideas introduced by Hazan and Li [2016] (the application of the ellipsoid method with a restart button\nand lower convex envelopes) or those by Bubeck et al. [2016] (which also make use of the restart\nidea but introduces a very original kernel method) could be combined with those presented in this\npaper to derive algorithms with the most favorable guarantees with respect to both T and d.\nWe begin by formally introducing our notation and setup. We then highlight some of the essential\nideas in previous work before introducing our new key insight. Next, we give a detailed description\nof our algorithm for which we prove theoretical guarantees in several settings.\n\n2 Preliminaries\n\n2.1 BCO scenario\nThe scenario of bandit convex optimization, which dates back to [Flaxman et al., 2005], is a sequential\nprediction problem on a convex compact domain K \u21e2 Rd. At each round t 2 [1, T ], the learner\nselects a (possibly) randomized action xt 2 K and incurs the loss ft(xt) based on a convex function\nft : K ! R chosen by the adversary. We assume that the adversary is oblivious, so that the loss\nfunctions are independent of the player\u2019s actions. The objective of the learner is to minimize his\nregret with respect to the optimal static action in hindsight, that is, if we denote by A the learner\u2019s\nrandomized algorithm, the following quantity:\n\nWe will denote by D the diameter of the action space K in the Euclidean norm: D = supx,y2K kx \nyk2. Throughout this paper, we will often use different induced norms. We will denote by k\u00b7k A\nthe norm induced by a symmetric positive de\ufb01nite (SPD) matrix A 0, de\ufb01ned for all x 2 Rd by\nkxkA = px>Ax. Moreover, we will denote by k\u00b7k A,\u21e4 its dual norm, given by k\u00b7k A1. To simplify\nthe notation, we will write k\u00b7k x instead of k\u00b7k r2R(x), when the convex and twice differentiable\nfunction R : int(K) ! R is clear from the context. Here, int(K) is the set interior of K.\nWe will consider different levels of regularity for the functions ft selected by the adversary. We will\nalways assume that they are uniformly bounded by some constant C > 0, that is |ft(x)|\uf8ff C for all\nt 2 [1, T ] and x 2 K, and, by shifting the loss functions upwards by at most C, we will also assume,\nwithout loss of generality, that they are non-negative: ft 0, for all t 2 [1, T ]. Moreover, we will\nalways assume that ft is Lipschitz on K (henceforth denoted C0,1(K)):\n\n8t 2 [1, T ], 8x, y 2 K,\n\n|ft(x) ft(y)|\uf8ff Lkx yk2.\n\nIn some instances, we will further assume that the functions admit H-Lipschitz gradients on the\ninterior of the domain (henceforth denoted C1,1(int(K))):\n\n9H > 0 : 8t 2 [1, T ], 8x, y 2 int(K),\n\nkrft(x) rft(y)k2 \uf8ff Hkx yk2.\n\nSince ft is convex, it admits a subgradient at any point in K. We denote by gt one element of\nthe subgradient at the point xt 2 K selected by the learner at round t. When the losses are C1,1,\nthe only element of the subgradient is the gradient, and gt = rft(xt). We will use the shorthand\nv1:t =Pt\ns=1 vs to denote the sum of t vectors v1, . . . , vt. In particular, g1:t will denote the sum of\nthe subgradients gs for s 2 [1, t].\nLastly, we will denote by B1(0) =x 2 Rd : kxk2 \uf8ff 1 \u21e2 Rd the d-dimensional Euclidean ball of\n\nradius one and by @B1(0) the unit sphere.\n\n2\n\nRegT (A) = E\" TXt=1\n\nft(xt)# min\n\nx2K\n\nTXt=1\n\nft(x).\n\n(1)\n\n\f2.2 Follow-the-regularized-leader template\nA standard algorithm in online learning, both for the bandit and full-information setting is the\nfollow-the-regularized-leader (FTRL) algorithm. At each round, the algorithm selects the action that\nminimizes the cumulative linearized loss augmented with a regularization term R : K ! R. Thus,\nthe action xt+1 is de\ufb01ned as follows:\n\nxt+1 = argmin\n\n\u2318g>1:tx + R(x),\n\nx2K\n\nwhere \u2318> 0 is a learning rate that determines the tradeoff between greedy optimization and\nregularization.\nIf we had access to the subgradients at each round, then, FTRL with R(x) = kxk2\n2 and \u2318 = 1pT\nwould yield a regret of O(pdT ), which is known to be optimal. But, since we only have access to the\nloss function values ft(xt) and since the loss functions change at each round, a more re\ufb01ned strategy\nis needed.\n\nE\n\n\n\nu\u21e0U (@B1(0))\uf8ff d\n\n2.2.1 One-point gradient estimates and surrogate losses\nOne key insight into the bandit convex optimization problem, due to Flaxman et al. [2005], is that the\nsubgradient of a smoothed version of the loss function can be estimated by sampling and rescaling\naround the point the algorithm originally intended to play.\nLemma 1 ([Flaxman et al., 2005, Saha and Tewari, 2011]). Let f : K ! R be an arbitrary function\n(not necessarily differentiable) and let U (@B1(0)) denote the uniform distribution over the unit\nsphere. Then, for any > 0 and any SPD matrix A 0, the function bf de\ufb01ned for all x 2 K\nby bf (x) = Eu\u21e0U (@B1(0))[f (x + Au)] is differentiable over int(K) and, for any x 2 int(K),\nbg = d\n\n f (x + Au)A1u is an unbiased estimate of rbf (x):\n\nThe result shows that if at each round t we sample ut \u21e0 U (@B1(0)), de\ufb01ne an SPD matrix At and\nt ut is an\n\nf (x + Au)A1u = rbf (x).\nplay the point yt = xt + Atu (assuming that yt 2 K), thenbgt = d\nunbiased estimate of the gradient of bf at the point xt originally intended: E[bgt] = rbf (xt). Thus, we\ncan use FTRL with these smoothed gradient estimates: xt+1 = argminx2K \u2318bg>1:tx + R(x), at the\ncost of the approximation error from ft to bft. Furthermore, the norm of these estimate gradients can\nLemma 2. Let > 0, ut 2 @B1(0) and At 0, then the norm ofbgt = d\nbe bounded as follows: kbgtk2\nProof. Since ft is bounded by C, we can write kbgtk2\nThis gives us a bound on the Lipschitz constant of bft in terms of d, , and C.\n\n2.2.2 Self-concordant barrier as regularization\nWhen sampling to derive a gradient estimate, we need to ensure that the point sampled lies within the\nfeasible set K. A second key idea in the BCO problem, due to Abernethy et al. [2008], is to design\nellipsoids that are always contained in the feasible sets. This is done by using tools from the theory\nof interior-point methods in convex optimization.\nDe\ufb01nition 1 (De\ufb01nition 2.3.1 [Nesterov and Nemirovskii, 1994]). Let K \u21e2 Rd be closed convex, and\nlet \u232b 0. A C3 function R : int(K) ! R is a \u232b-self-concordant barrier for K if for any sequence\n(zs)1s=1 with zs ! @K, we have R(zs) ! 1, and if for all x 2 int(K), and y 2 Rd, the following\ninequalities hold:\n\n f (xt + Atut)A1\n\nt ut can\n\n f (xt + Atut)A1\n\n2 C2utA1\n\nt A2\n\nt A1\n\nt ut \uf8ff d2\n\n2 C2.\n\nt \uf8ff d2\n\nA2\n\nt \uf8ff d2\n\n2 C2.\n\nA2\n\nbe bounded.\n\n|r3R(x)[y, y, y]|\uf8ff 2kyk3\nx,\n\n|rR(x)>y|\uf8ff \u232b1/2kykx.\n\n3\n\n\fSince self-concordant barriers are preserved under translation, we will always assume for convenience\nthat minx2K R(x) = 0.\nNesterov and Nemirovskii [1994] show that any d-dimensional closed convex set admits an O(d)-\nself-concordant barrier. This allows us to always choose a self-concordant barrier as regularization.\nWe will use several other key properties of self-concordant barriers in this work, all of which are\nstated precisely in Appendix 7.1.\n\n3 Previous work\n\nSpeci\ufb01cally, denote by \u00afgt = 1\n\nThe original paper by Flaxman et al. [2005] sampled indiscriminately around spheres and projected\n\nMore recently, Dekel et al. [2015] showed that by averaging the smoothed gradient estimates\n\nThe follow-up work of Saha and Tewari [2011] showed that for C1,1 loss functions, one can run FTRL\nwith a self-concordant barrier as regularization and sample around the Dikin ellipsoid to attain an\n\nback onto the feasible set at each round. This yielded a regret ofeOT 3/4 for C0,1 loss functions.\nimproved regret bound ofeOT 2/3.\nand still using the self-concordant barrier as regularization, one can achieve a regret ofeOT 5/8.\nk+1Pk\ni=0bgti the average of the past k + 1 incurred gradients, where\nbgti = 0 for t i \uf8ff 0. Then we can play FTRL on these averaged smoothed gradient estimates:\n\nxt+1 = argmin2K \u2318\u00afg>t x + R(x), to attain the better guarantee.\nAbernethy and Rakhlin [2009] derive a generic estimate for FTRL algorithms with self-concordant\nbarriers as regularization:\nLemma 3 ([Abernethy and Rakhlin, 2009]-Theorem 2.2-2.3). Let K be a closed convex set in\nRd and let R be a \u232b-self-concordant barrier for K. Let {gt}T\nt=1 \u21e2 Rd and \u2318> 0 be such that\n\u2318kgtkxt,\u21e4 \uf8ff 1/4 for all t 2 [1, T ]. Then, the FTRL update xt+1 = argminx2K g>1:tx + R(x) admits\nthe following guarantees:\n\nkxt xt+1kxt \uf8ff 2\u2318kgtkxt,\u21e4,\n\n8x 2 K,\n\nTXt=1\n\ng>t (xt x) \uf8ff 2\u2318\n\nkgtk2\n\nxt,\u21e4 +\n\n1\n\u2318\n\nR(x).\n\nTXt=1\n\nBy Lemma 2, if we use FTRL with smoothed gradients, then the upper bound in this lemma can be\nfurther bounded by\n\nxt,\u21e4 +\n\n1\n\u2318\n\nR(x) \uf8ff 2\u2318T\n\nC2d2\n2 +\n\n1\n\u2318\n\nR(x).\n\n2\u2318\n\nTXt=1\n\nkbgtk2\n\nFurthermore, the regret is then bounded by the sum of this upper bound and the cost of approximating\n\nsmoothed gradients instead, then the upper bound in this lemma can be bounded as\n\nft with bft. On the other hand, Dekel et al. [2015] showed that if we used FTRL with averaged\n\n2\u2318\n\nTXt=1\n\nk\u00afgtk2\n\nxt,\u21e4 +\n\n1\n\u2318\n\nR(x) \uf8ff 2\u2318T\u2713 32C2d2\n\n2(k + 1)\n\n+ 2D2L2\u25c6 +\n\n1\n\u2318\n\nR(x).\n\nThe extra factor (k + 1) in the denominator, at the cost of now approximating ft with \u00afft, is what\ncontributes to their improved regret result.\nIn general, \ufb01nding surrogate losses that can both be approximated accurately and admit only a mild\nvariance is a delicate task, and it is not clear how the constructions presented above can be improved.\n\n4 Algorithm\n\n4.1 Predicting the predictable\nRather than designing a newer and better surrogate loss, our strategy will be to exploit the structure of\nthe current state-of-the-art method. Speci\ufb01cally, we draw upon the technique of predictable sequences\nfrom [Rakhlin and Sridharan, 2013]. The idea here is to allow the learner to preemptively \u201cguess\u201d the\n\n4\n\n\fxt+1 = argmin\n\nof the time t + 1 gradient gt+1 based on information up to time t, then the learner should play:\n\ngradient at the next step and optimize for this in the FTRL update. Speci\ufb01cally, ifegt+1 is an estimate\n\n(g1:t +egt+1)>x + R(x).\nt=1 \u21e2 Rd and \u2318> 0 such that \u2318kgt egtkxt,\u21e4 \uf8ff 1/4\n8t 2 [1, T ]. Then the FTRL update xt+1 = argminx2K(g1:t +egt+1)>x + R(x) admits the following\n\nThis optimistic FTRL algorithm admits the following guarantee:\nLemma 4 (Lemma 1 [Rakhlin and Sridharan, 2013]). Let K be a closed convex set in Rd, and let R\nbe a \u232b-self-concordant barrier for K. Let {gt}T\nguarantee:\n\nx2K\n\n8x 2 K,\n\ng>t (xt x) \uf8ff 2\u2318\n\nTXt=1\n\nTXt=1\n\nkgt egtk2\n\nxt,\u21e4 +\n\n1\n\u2318\n\nR(x).\n\n1\n\n1\n\n1\n\nIn general, it is not clear what would be a good prediction candidate. Indeed, this is why Rakhlin\nand Sridharan [2013] called this algorithm an \u201coptimistic\u201d FTRL. However, notice that if we elect\nto play the averaged smoothed losses as in [Dekel et al., 2015], then the update at each time is\n\u00afgt = 1\nincludes the smoothed gradients from time t + 1 down to time t (k 1). The key insight here is\nthat at time t, all but the (t + 1)-th gradient are known!\nThis means that if we predict\n\ni=0bgt+1i, which\n\nk+1Pk\n\nk+1Pk\n\ni=0bgti. This implies that the time t + 1 gradient is \u00afgt+1 = 1\nkXi=1bgt+1i,\nk + 1bgt.\n\nkXi=0bgt+1i \nkXi=0bgti \n\negt+1 =\ngt egt =\n\nkXi=1bgti =\n\nk + 1bgt+1 =\n\nk + 1\n\nthen the \ufb01rst term in the bound of Lemma 4 will be in terms of\n\nk + 1\n\n1\n\n1\n\n(k+1)2 by using this optimistic prediction.\n\nIn other words, all but the time t smoothed gradient will cancel out. Essentially, we are predicting\nthe predictable portion of the averaged gradient and guaranteeing that the optimism will pay off.\nMoreover, where we gained a factor of\nk+1 in the averaged loss case, we should expect to gain a\nfactor of\nNote that this technique of optimistically predicting the variance reduction is widely applicable. As\nalluded to with the reference to [Schmidt et al., 2013], many variance reduction-type techniques,\nparticularly in stochastic optimization, use historical information in their estimates (e.g. SVRG\n[Johnson and Zhang, 2013], SAGA [Defazio et al., 2014]). In these cases, it is possible to \u201cpredict\u201d\nthe information re-use and improve the convergence rates of each algorithm.\n\n1\n\nk + 1\n\n1\n\nk + 1\n\n1\n\n4.2 Description and pseudocode\nHere, we give a detailed description of our algorithm, OPTIMISTICBCO. At each round t, the\nalgorithm uses a sample ut from the uniform distribution over the unit sphere to de\ufb01ne an unbiased\n\nestimate of the gradient of bft, a smoothed version of the loss function ft, as described in Section 2.2.1:\n ft(yt)(r2R(xt))1/2ut. Next, the trailing average of these unbiased estimates over a \ufb01xed\nbgt d\nwindow of length k + 1 is computed: \u00afgt = 1\ni=0bgti. The remaining steps executed at each\nround coincide with the Follow-the-Regularized-Leader update with a self-concordant barrier used\nas a regularizer, augmented with an optimistic prediction of the next round\u2019s trailing average. As\ndescribed in Section 4.1, all but one of the terms in the trailing average are known and we predict\ntheir occurence:\n\nk+1Pk\n\negt+1 =\n\n1\n\nk + 1\n\nkXi=1bgt+1i,\n\nxt+1 = argmin\n\nx2K\n\n\u2318 (\u00afg1:t +egt+1)> x + R(x).\n\nNote that Theorem 3 implies that the actual point we play, yt, is always a feasible point in K. Figure 1\npresents the pseudocode of the algorithm.\n\n5\n\n\fOPTIMISTICBCO(R,,\u2318, k, x 1)\n\n1\n2\n3\n4\n5\n6\n7\n8\n9\n10\n\nfor t 1 to T do\n\nut SAMPLE(U (@B1(0)))\nyt xt + (r2R(xt)) 1\n2 ut\nPLAY(yt)\nft(yt) RECEIVELOSS(yt)\n ft(yt)(r2R(xt)) 1\nbgt d\n2 ut\nk+1Pk\n\u00afgt 1\ni=0bgti\nk+1Pk\negt+1 1\ni=1bgt+1i\nxt+1 argminx2K \u2318 (\u00afg1:t +egt+1)>x + R(x)\nreturnPT\n\nt=1 ft(yt)\n\nFigure 1: Pseudocode of OPTIMISTICBCO, with R : int(K) ! R, 2 (0, 1], \u2318> 0, k 2 Z, and\nx1 2 K.\n5 Regret guarantees\n\nIn this section, we state our main results, which are regret guarantees for OPTIMISTICBCO in the\nC0,1 and C1,1 cases. We also highlight the analysis and proofs for each regime.\n\n5.1 Main results\nThe following is our main result for the C0,1 case.\nTheorem 1 (C0,1 Regret). Let K \u21e2 Rd be a convex set with diameter D and (ft)T\nt=1 a sequence of\nloss functions with each ft : K ! R+ C-bounded and L-Lipschitz. Let R be a \u232b-self-concordant\nbarrier for K. Then, for \u2318k \uf8ff \n\n12Cd, the regret of OPTIMISTICBCO can be bounded as follows:\n\nRegT (OPTIMISTICBCO) \uf8ff \u270fLT + LDT +\n\nCk\n2\n\n+\n\n2Cd2\u2318T\n2(k + 1)2 +\n\n1\n\u2318\n\nlog(1/\u270f)\n\n+ LT 2\u2318D\"p3L1/2 + p2DLk +\n\np48dpk\n\n\n\n# .\n\nIn particular, for \u2318 = T 11/16d3/8, = T 5/16d3/8, k = T 1/8d1/4, the following guarantee holds\nfor the regret of the algorithm:\n\nRegT (OPTIMISTICBCO) =eO\u21e3T 8/13d5/3\u2318 .\n\n6\n\nRegT (OPTIMISTICBCO) =eO\u21e3T 11/16d3/8\u2318 .\n\nThe above result is the \ufb01rst improvement on the regret of Lipschitz losses in terms of T since the\noriginal algorithm of Flaxman et al. [2005] that is realizable from a concrete algorithm as well as\npolynomial in both dimension and time (both computationally and in terms of regret).\nTheorem 2 (C1,1 Bound). Let K \u21e2 Rd be a convex set with diameter D and (ft)T\nt=1 a sequence of\nloss functions with each ft : K ! R+ C-bounded, L-Lipschitz and H-smooth. Let R be a \u232b-self-\nconcordant barrier for K. Then, for \u2318k \uf8ff \n12d, the regret of OPTIMISTICBCO can be bounded as\nfollows:\nRegT (OPTIMISTICBCO) \uf8ff \u270fLT + H 2D2T\n+ (T L + DHT )2\u2318kD\"p3L1/2\n\n2(k + 1)2 .\nIn particular, for \u2318 = T 8/13d5/6, = T 5/26d1/3, k = T 1/13d5/3, the following guarantee holds\nfor the regret of the algorithm:\n\npk # +\np48d\n\n+ p2DL +\n\nlog(1/\u270f) + Ck + \u2318\n\nd2T\n\n1\n\u2318\n\nk\n\n\fThis result is currently the best polynomial-in-time regret bound that is also polynomial in the\ndimension of the action space (both computationally and in terms of regret). It improves upon the\nwork of Saha and Tewari [2011] and Dekel et al. [2015].\nWe now explain the analysis of both results, starting with Theorem 1 for C0,1 losses.\n\n5.2 C0,1 analysis\nOur analysis proceeds in two steps. We \ufb01rst modularize the cost of approximating the original losses\nft(yt) incurred with the averaged smoothed losses that we treat as surrogate losses. Then we show\nthat the algorithm minimizes the regret against the surrogate losses effectively. The proofs of all\nlemmas in this section are presented in Appendix 7.2.\nLemma 5 (C0,1 Structural bound on true losses in terms of smoothed losses). Let (ft)T\nt=1 be a\nsequence of loss functions, and assume that ft : K ! R+ is C-bounded and L-Lipschitz, where\nK \u21e2 Rd. Denote\nbft(x) =\n\nfor arbitrary At, , and ut.\n2\nargminy2K,dist(y,@K)>\u270f ky x\u21e4k. Assume that we play yt at every round. Then the following\nstructural estimate holds:\n\nt=1 ft(x), and let x\u21e4\u270f\n\nyt = xt + Atut\n\n[ft(x + Atu)],\n\nbgt =\n\nu\u21e0U (@B1(0))\n\nft(yt)A1\n\nt ut,\n\nd\n\n\nE\n\nLet x\u21e4 = argminx2KPT\nTXt=1\nft(yt) ft(x\u21e4)] \uf8ff \u270fLT + 2LDT +\n\nRegT (A) = E[\n\nTXt=1\n\nE[bft(xt) bft(x\u21e4\u270f )].\n\nE\n\nt ut,\n\nft(yt)A1\n\nu\u21e0U (@B1(0))\n\n[ft(x + Atu)],\n\nThus, at the price of \u270fLT + 2LDT , it suf\ufb01ces to look at the performance of the averaged losses for\nthe algorithm. Notice that the only assumptions we have made so far are that we play points sampled\non an ellipsoid around the desired point scaled by and that the loss functions are Lipschitz.\nLemma 6 (C0,1 Structural bound on smoothed losses in terms of averaged losses). Let (ft)T\nt=1 be\na sequence of loss functions, and assume that ft : K ! R+ is C-bounded and L-Lipschitz, where\nK \u21e2 Rd. Denote\nbft(x) =\n\nbgt =\nLet x\u21e4 = argminx2KPT\nfor arbitrary At, , and ut.\nargminy2K,dist(y,@K)>\u270f ky x\u21e4k. Furthermore, denote\nkXi=0bgti.\nkXi=0 bfti(x),\nTXt=1\n\nAssume that we play yt at every round. Then we have the structural estimate:\n\nE\uf8ffbft(xt) bft(x\u21e4\u270f ) \uf8ff\n\nE\u21e5\u00afg>t (xt x\u21e4\u270f )\u21e4 .\n\nWhile we use averaged smoothed losses as in [Dekel et al., 2015], the analysis in this lemma is\nactually somewhat different. Because Dekel et al. [2015] always assume that the loss functions are in\nC1,1, they elect to use the following decomposition:\n\nE[kxti xtk2] +\n\nt=1 ft(x), and let x\u21e4\u270f\n\nyt = xt + Atut\n\nt2[1,T ],i2[0,k^t]\n\nTXt=1\n\n\u00afft(x) =\n\n1\n\nk + 1\n\n1\n\nk + 1\n\nCk\n2\n\n+ LT\n\nsup\n\nd\n\n\n\u00afgt =\n\n2\n\nbft(xt) bft(x\u21e4\u270f ) = bft(xt) \u00afft(xt) + \u00afft(xt) \u00afft(x\u21e4\u270f ) + \u00afft(x\u21e4\u270f ) bft(x\u21e4\u270f ).\nk+1Pk\n\nThis is because they can relate r \u00afft(x) = 1\nusing the fact that the gradients are Lipschitz. Since the gradients of C0,1 functions are not Lipschitz,\nwe cannot use the same analysis. Instead, we use the decomposition\n\ni=0 rbfti(x\u270f) to \u00afgt = 1\n\ni=0 rbfti(xti)\n\nk+1Pk\n\nbft(xt) bft(x\u21e4\u270f ) = bft(xt) bfti(xti) + bfti(xti) \u00afft(x\u21e4\u270f ) + \u00afft(x\u21e4\u270f ) bft(x\u21e4\u270f ).\n\nThe next lemma af\ufb01rms that we do indeed get the improved\npredictable component of the average gradient.\n\n(k+1)2 factor from predicting the\n\n1\n\n7\n\n\fLemma 7 (C0,1 Algorithmic bound on the averaged losses). Let (ft)T\nt=1 be a sequence of loss\nfunctions, and assume that ft : K ! R+ is C-bounded and L-Lipschitz, where K \u21e2 Rd. Let\nx\u21e4 = argminx2KPT\nt=1 ft(x), and let x\u21e4\u270f 2 argminy2K,dist(y,@K)>\u270f ky x\u21e4k. Assume that we play\naccording to the algorithm with \u2318k \uf8ff \nTXt=1\n\n12Cd. Then we maintain the following guarantee:\n\nE\u21e5\u00afg>t (xt x\u21e4\u270f )\u21e4 \uf8ff\n\n2Cd2\u2318T\n2(k + 1)2 +\n\nR(x\u21e4\u270f ).\n\n1\n\u2318\n\nSo far, we have demonstrated a bound on the regret of the form:\n\nCk\n2\n\n2Cd2\u2318T\n2(k + 1)2 +\n\n1\n\u2318\n\nsup\n\n+ LT\n\nt2[T ],i2[k^t]\n\nE[kxti xtk2] +\n\nRegT (A) \uf8ff \u270fLT + 2LDT +\nR(x\u270f).\nThus, it remains to \ufb01nd a tight bound on supt2[1,T ],i2[0,k^t] E[kxti xtk2], which measures the\nstability of the actions across the history that we average over. This result is similar to that of Dekel\net al. [2015], except that we additionally need to account for the optimistic gradient prediction used.\nLemma 8 (C0,1 Algorithmic bound on the stability of actions). Let (ft)T\nt=1 be a sequence of loss\nfunctions, and assume that ft : K ! R+ is C-bounded and L-Lipschitz, where K \u21e2 Rd. Assume\nthat we play according to the algorithm with \u2318k \uf8ff \nE[kxti xtk2] \uf8ff 2\u2318kD p3L1/2\n\n12Cd. Then the following estimate holds:\n\npk ! .\np48Cd\n\n+ p2DL +\n\nk\n\n\n\n+\n\n1\n\u2318\n\n.\n\nCk\n2\n\np48Cdpk\n\nProof. [of Theorem 1] Putting all the pieces together from Lemmas 5, 6, 7, 8, shows that\n\n2Cd2\u2318T\n2(k + 1)2 +\n\nR(x\u270f)+LT 2\u2318D\uf8ffp3L1/2+p2DLk+\nRegT (A)\uf8ff\u270fLT +LDT +\nSince x\u270f is at least \u270f away from the boundary, it follows from [Abernethy and Rakhlin, 2009] that\nR(x\u270f) \uf8ff \u232b log(1/\u270f). Plugging in the stated quantities for \u2318, k, and yields the result.\n5.3 C1,1 analysis\nThe analysis of the C1,1 regret bound is similar to the C0,1 case. The only difference is that we leverage\nthe higher regularity of the losses to provide a more re\ufb01ned estimate on the cost of approximating ft\nwith \u00afft. Apart from that, we will reuse the bounds derived in Lemmas 6, 7, and 8. The proof of the\nfollowing lemma, along with that of Theorem 2, is provided in Appendix 7.3.\nLemma 9 (C1,1 Structural bound on true losses in terms of smoothed losses). Let (ft)T\nt=1 be a\nsequence of loss functions, and assume that ft : K ! R+ is C-bounded, L-Lipschitz, and H-smooth,\nwhere K \u21e2 Rd. Denote\nE\n\nyt = xt + Atut\n\n[ft(x + Atu)],\n\nft(yt)A1\n\nt ut,\n\nfor arbitrary At, , and ut.\n2\nargminy2K,dist(y,@K)>\u270f ky x\u21e4k. Assume that we play yt at every round. Then the following\nstructural estimate holds:\n\nt=1 ft(x), and let x\u21e4\u270f\n\nE[bft(xt) bft(x\u21e4\u270f )].\n\nbft(x) =\n\nu\u21e0U (@B1(0))\n\nd\n\n\nbgt =\n\nLet x\u21e4 = argminx2KPT\nTXt=1\nft(yt) ft(x\u21e4)] \uf8ff \u270fLT + 2H 2D2T +\n\nRegT (A) = E[\n\nTXt=1\n\n6 Conclusion\n\nWe designed a computationally ef\ufb01cient algorithm for bandit convex optimization admitting state-\nof-the-art guarantees for C0,1 and C1,1 loss functions. This was achieved using the general and\npowerful technique of predicting predictable information re-use. The ideas we describe here are\ndirectly applicable to other areas of optimization, in particular stochastic optimization.\n\nAcknowledgements\nThis work was partly funded by NSF CCF-1535987 and IIS-1618662 and NSF GRFP DGE-1342536.\n\n8\n\n\fReferences\nJ. Abernethy and A. Rakhlin. Beating the adaptive bandit with high probability. In COLT, 2009.\nJ. Abernethy, E. Hazan, and A. Rakhlin. Competing in the dark: An ef\ufb01cient algorithm for bandit\n\nlinear optimization. In COLT, pages 263\u2013274, 2008.\n\nA. Agarwal, O. Dekel, and L. Xiao. Optimal algorithms for online convex optimization with\n\nmulti-point bandit feedback. In COLT, pages 28\u201340, 2010.\n\nS. Bubeck and R. Eldan. Multi-scale exploration of convex functions and bandit convex optimization.\n\nCoRR, abs/1507.06580, 2015.\nS. Bubeck, O. Dekel, T. Koren, and Y. Peres. Bandit convex optimization: sqrt {T} regret in one\n\ndimension. CoRR, abs/1502.06398, 2015.\n\nS. Bubeck, R. Eldan, and Y. T. Lee. Kernel-based methods for bandit convex optimization. CoRR,\n\nabs/1607.03084, 2016.\n\nV. Dani, T. P. Hayes, and S. M. Kakade. Stochastic linear optimization under bandit feedback.\nA. Defazio, F. Bach, and S. Lacoste-Julien. Saga: A fast incremental gradient method with support\n\nfor non-strongly convex composite objectives. In NIPS, pages 1646\u20131654, 2014.\n\nO. Dekel, R. Eldan, and T. Koren. Bandit smooth convex optimization: Improving the bias-variance\n\ntradeoff. In NIPS, pages 2908\u20132916, 2015.\n\nA. D. Flaxman, A. T. Kalai, and H. B. McMahan. Online convex optimization in the bandit setting:\n\nGradient descent without a gradient. In SODA, pages 385\u2013394, 2005.\n\nE. Hazan and Y. Li. An optimal algorithm for bandit convex optimization. CoRR, abs/1603.04350,\n\n2016.\n\nR. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance reduction.\n\nIn NIPS, pages 315\u2013323, 2013.\n\nR. D. Kleinberg. Nearly tight bounds for the continuum-armed bandit problem. In Advances in\n\nNeural Information Processing Systems, pages 697\u2013704, 2004.\n\nY. Nesterov. Introductory Lectures on Convex Optimization: A Basic Course. Springer, New York,\n\nNY, USA, 2004.\n\nY. Nesterov and A. Nemirovskii. Interior-point Polynomial Algorithms in Convex Programming.\nStudies in Applied Mathematics. Society for Industrial and Applied Mathematics, 1994. ISBN\n9781611970791.\n\nA. Rakhlin and K. Sridharan. Online learning with predictable sequences. In COLT, pages 993\u20131019,\n\n2013.\n\nA. Saha and A. Tewari. Improved regret guarantees for online smooth convex optimization with\n\nbandit feedback. In AISTATS, pages 636\u2013642, 2011.\n\nM. W. Schmidt, N. L. Roux, and F. R. Bach. Minimizing \ufb01nite sums with the stochastic average\n\ngradient. CoRR, abs/1309.2388, 2013.\n\n9\n\n\f", "award": [], "sourceid": 1191, "authors": [{"given_name": "Scott", "family_name": "Yang", "institution": "New York University"}, {"given_name": "Mehryar", "family_name": "Mohri", "institution": "Courant Institute, NYU & Google"}]}