{"title": "Bandit Smooth Convex Optimization: Improving the Bias-Variance Tradeoff", "book": "Advances in Neural Information Processing Systems", "page_first": 2926, "page_last": 2934, "abstract": "Bandit convex optimization is one of the fundamental problems in the field of online learning. The best algorithm for the general bandit convex optimization problem guarantees a regret of $\\widetilde{O}(T^{5/6})$, while the best known lower bound is $\\Omega(T^{1/2})$. Many attemptshave been made to bridge the huge gap between these bounds. A particularly interesting special case of this problem assumes that the loss functions are smooth. In this case, the best known algorithm guarantees a regret of $\\widetilde{O}(T^{2/3})$. We present an efficient algorithm for the banditsmooth convex optimization problem that guarantees a regret of $\\widetilde{O}(T^{5/8})$. Our result rules out an $\\Omega(T^{2/3})$ lower bound and takes a significant step towards the resolution of this open problem.", "full_text": "Bandit Smooth Convex Optimization:\nImproving the Bias-Variance Tradeoff\n\nOfer Dekel\n\nMicrosoft Research\n\nRedmond, WA\n\noferd@microsoft.com\n\nRonen Eldan\n\nWeizmann Institute\n\nRehovot, Israel\n\nroneneldan@gmail.com\n\nTomer Koren\n\nTechnion\nHaifa, Israel\n\ntomerk@technion.ac.il\n\nAbstract\n\nBandit convex optimization is one of the fundamental problems in the \ufb01eld of\nonline learning. The best algorithm for the general bandit convex optimiza-\n\nis \u2326(T 1/2). Many attempts have been made to bridge the huge gap between these\nbounds. A particularly interesting special case of this problem assumes that the\nloss functions are smooth. In this case, the best known algorithm guarantees a re-\n\ntion problem guarantees a regret of eO(T 5/6), while the best known lower bound\ngret of eO(T 2/3). We present an ef\ufb01cient algorithm for the bandit smooth convex\noptimization problem that guarantees a regret of eO(T 5/8). Our result rules out\n\nan \u2326(T 2/3) lower bound and takes a signi\ufb01cant step towards the resolution of this\nopen problem.\n\n1\n\nIntroduction\n\nBandit convex optimization [11, 5] is the following online learning problem. First, an adversary\nprivately chooses a sequence of bounded and convex loss functions f1, . . . , fT de\ufb01ned over a con-\nvex domain K in d-dimensional Euclidean space. Then, a randomized decision maker iteratively\nchooses a sequence of points x1, . . . , xT , where each xt 2 K. On iteration t, after choosing the\npoint xt, the decision maker incurs a loss of ft(xt) and receives bandit feedback: he observes the\nvalue of his loss but he does not receive any other information about the function ft. The decision\nmaker uses the feedback to make better choices on subsequent rounds. His goal is to minimize re-\ngret, which is the difference between his loss and the loss incurred by the best \ufb01xed point in K. If\nthe regret grows sublinearly with T , it indicates that the decision maker\u2019s performance improves as\nthe length of the sequence increases, and therefore we say that he is learning.\nFinding an optimal algorithm for bandit convex optimization is an elusive open problem. The \ufb01rst\nalgorithm for this problem was presented in Flaxman et al. [11] and guarantees a regret of R(T ) =\neO(T 5/6) for any sequence of loss functions (here and throughout, the asymptotic eO notation hides\na polynomial dependence on the dimension d as well as logarithmic factors). Despite the ongoing\neffort to improve on this rate, it remains the state of the art. On the other hand, Dani et al. [9]\nproves that for any algorithm there exists a worst-case sequence of loss functions for which R(T ) =\n\u2326(T 1/2), and the gap between the upper and lower bounds is huge.\nWhile no progress has been made on the general form of the problem, some progress has been made\nin interesting special cases. Speci\ufb01cally, if the bounded convex loss functions are also assumed to be\n\ntions are smooth (namely, their gradients are Lipschitz), Saha and Tewari [15] present an algorithm\n\nLipschitz, Flaxman et al. [11] improves their regret guarantee to R(T ) = eO(T 3/4). If the loss func-\nwith a guaranteed regret of eO(T 2/3). Similarly, if the loss functions are bounded, Lipschitz, and\nstrongly convex, the guaranteed regret is eO(T 2/3) [3]. If even stronger assumptions are made, an\noptimal regret rate ofe\u21e5(T 1/2) can be guaranteed; namely, when the loss functions are both smooth\n\n1\n\n\fand strongly-convex [12], when they are Lipschitz and linear [2], and when Lipschitz loss functions\nare not generated adversarially but drawn i.i.d. from a \ufb01xed and unknown distribution [4].\nRecently, Bubeck et al. [8] made progress that did not rely on additional assumptions, such as\nLipschitz, smoothness, or strong convexity, but instead considered the general problem in the one-\ndimensional case. That result proves that there exists an algorithm with optimal e\u21e5(T 1/2) regret\nfor arbitrary univariate convex functions ft : [0, 1] 7! [0, 1]. Subsequently, and after the current\npaper was written, Bubeck and Eldan [7] generalized this result to bandit convex optimization in\ngeneral Euclidean spaces (albeit requiring a Lipschitz assumption). However, the proofs in both\npapers are non-constructive and do not give any hint on how to construct a concrete algorithm, nor\nany indication that an ef\ufb01cient algorithm exists.\nThe current state of the bandit convex optimization problem has given rise to two competing con-\njectures. Some believe that there exists an ef\ufb01cient algorithm that matches the current lower bound.\nMeanwhile, others are trying to prove larger lower bounds, in the spirit of [10], even under the as-\nsumption that the loss functions are smooth; if the \u2326(T 1/2) lower bound is loose, a natural guess\n\nany sequence of bounded, convex, smooth loss functions. Compare this result to the previous state-\n\nof the true regret rate would be e\u21e5(T 2/3).1 In this paper, we take an important step towards the\nresolution of this problem by presenting an algorithm that guarantees a regret of e\u21e5(T 5/8) against\nof-the-art result of e\u21e5(T 2/3) (noting that 2/3 = 0.666... and 5/8 = 0.625). This result rules out\n\nthe possibility of proving a lower bound of \u2326(T 2/3) with smooth functions. While there remains\na sizable gap with the T 1/2 lower bound, our result brings us closer to \ufb01nding the elusive optimal\nalgorithm for bandit convex optimization, at least in the case of smooth functions.\nOur algorithm is a variation on the algorithms presented in [11, 1, 15], with one new idea. These\nalgorithms all follow the same template: on each round, the algorithm computes an estimate of\nrft(xt), the gradient of the current loss function at the current point, by applying a random pertur-\nbation to xt. The sequence of gradient estimates is then plugged into a \ufb01rst-order online optimization\ntechnique. The technical challenge in the analysis of these algorithms is to bound the bias and the\nvariance of these gradient estimates. Our idea is take a window of consecutive gradient estimates\nand average them, producing a new gradient estimate with lower variance and higher bias. Overall,\nthe new bias-variance tradeoff works in our favor and allows us to improve the regret upper-bound.\nAveraging uncorrelated random vectors to reduce variance is a well-known technique, but applying\nit in the context of bandit convex optimization algorithm is easier said than done and requires us to\novercome a number of technical dif\ufb01culties. For example, the gradient estimates in our window are\ntaken at different points, which introduces a new type of bias. Another example is the dif\ufb01culty that\narrises when the sequence xs, . . . , xt travels adjacent to the boundary of the convex set K (imagine\ntransitioning from one face of a hypercube to another); the random perturbation applied to xs and\nxt could be supported on orthogonal directions, yet we average the resulting gradient estimates\nand expect to get a meaningful low-variance gradient estimate. While the basic idea is simple, our\nnon-trivial technical analysis is not, and may be of independent interest.\n\n2 Preliminaries\n\nWe begin by de\ufb01ning smooth bandit convex optimization more formally, and recalling several basic\nresults from previous work on the problem (Flaxman et al. [11], Abernethy et al. [2], Saha and Tewari\n[15]) that we use in our analysis. We also review the necessary background on self-concordant\nbarrier functions.\n\n2.1 Smooth Bandit Convex Optimization\n\nIn the bandit convex optimization problem, an adversary \ufb01rst chooses a sequence of convex functions\nf1, . . . , fT : K 7! [0, 1], where K is a closed and convex domain in Rd. Then, on each round\nt = 1, . . . , T , a randomized decision maker has to choose a point xt 2 K, and after committing\nto his decision he incurs a loss of ft(xt), and observes this loss as feedback. The decision maker\u2019s\nexpected loss (where expectation is taken with respect to his random choices) is E[PT\nt=1ft(xt)] and\n1In fact, we are aware of at least two separate research groups that invested time trying to prove such\n\nan \u2326(T 2/3) lower bound.\n\n2\n\n\fhis regret is\n\nR(T ) = E\" TXt=1\n\nft(xt)# min\n\nx2K\n\nTXt=1\n\nft(x) .\n\nThroughout, we use the notation Et[\u00b7] to indicate expectations conditioned on all randomness up to\nand including round t 1.\nWe make the following assumptions. First, we assume that each of the functions f1, . . . , fT is L-\nLipschitz with respect to the Euclidean norm k\u00b7k2, namely that |ft(x) ft(y)|\uf8ff Lkx yk2 for all\nx, y 2 K. We further assume that ft is H-smooth with respect to k\u00b7k 2, which is to say that\nIn particular, this implies that ft is continuously differentiable over K. Finally, we assume that the\nEuclidean diameter of the decision domain K is bounded by D > 0.\n\nkrft(x) rft(y)k2 \uf8ff H kx yk2 .\n\n8 x, y 2 K ,\n\n2.2 First Order Algorithms with Estimated Gradients\n\nThe online convex optimization problem becomes much easier in the full information setting, where\nthe decision maker\u2019s feedback includes the vector gt = rft(xt), the gradient (or subgradient) of\nft at the point xt. In this setting, the decision maker can use a \ufb01rst-order online algorithm, such as\nthe projected online gradient descent algorithm [17] or dual averaging [13] (sometimes known as\nfollow the regularized leader [16]), and guarantee a regret of O(T 1/2). The dual averaging approach\nsets xt to be the solution to the following optimization problem,\n\nxt = arg min\n\nx2K (x \u00b7\n\nt1Xs=1\n\n\u21b5s,tgs + R(x)) ,\n\n(1)\n\nwhere R is a suitably chosen regularizer, and for all t = 1, . . . , T and s = 1, . . . , t we de\ufb01ne a\nnon-negative weight \u21b5s,t. Typically, all of the weights (\u21b5s,t) are set to a constant value \u2318, called the\nlearning rate parameter.\nHowever, since we are not in the full information setting and the decision maker does not observe gt,\nthe algorithms mentioned above cannot be used directly. The key observation of Flaxman et al. [11],\nwhich is later reused in all of the follow-up work, is that gt can be estimated by randomly perturbing\nthe point xt. Speci\ufb01cally, on round t, the algorithm chooses the point\n\nyt = xt + Atut ,\n\n(2)\ninstead of the original point xt, where > 0 is a parameter that controls the magnitude of the\nperturbation, At is a positive de\ufb01nite d \u21e5 d matrix, and ut is drawn from the uniform distribution\non the unit sphere. In Flaxman et al. [11], At is simply set to the identity matrix whereas in Saha\nand Tewari [15], At is more carefully tailored to the point xt (see details below). In any case, care\nshould be taken to ensure that the perturbed point yt remains in the convex set K. The observed\nvalue ft(yt) is then used to compute the gradient estimate\n\n\u02c6gt =\n\nd\n\n\nt ut ,\n\nft(yt)A1\n\n\u02c6f t(x) = Et[ft(x + Atv)] ,\n\n(3)\nand this estimate is fed to the \ufb01rst-order optimization algorithm. While \u02c6gt is not an unbiased es-\ntimator of rft(xt), it is an unbiased estimator for the gradient of a different function, \u02c6f t, de\ufb01ned\nby\n(4)\nwhere v 2 Rd is uniformly drawn from the unit ball. The function \u02c6f t(x) is a smoothed version of\nft, which plays a key role in our analysis and in many of the previous results on this topic. The main\nproperty of \u02c6f t is summarized in the following lemma.\nLemma 1 (Flaxman et al. [11], Saha and Tewari [15, Lemma 5]). For any differentiable function\nf : Rd 7! R, positive de\ufb01nite matrix A, x 2 Rd, and 2 (0, 1], de\ufb01ne \u02c6g = (d/)f (x+Au)\u00b7A1u,\nwhere u is uniform on the unit sphere. Also, let \u02c6f (x) = E[f (x + Av)] where v is uniform on the\nunit ball. Then E[\u02c6g] = r\u02c6f (x).\nThe difference between rft(xt) and r\u02c6f t(xt) is the bias of the gradient estimator \u02c6gt. The analysis\nin Flaxman et al. [11], Abernethy et al. [2], Saha and Tewari [15] focuses on bounding the bias and\nthe variance of \u02c6gt and their effect on the \ufb01rst-order optimization algorithm.\n\n3\n\n\f2.3 Self-Concordant Barriers\n\nFollowing [2, 1, 15], our algorithm and analysis rely on the properties of self-concordant barrier\nfunctions. Intuitively, a barrier is a function de\ufb01ned on the interior of the convex body K, which is\nrather \ufb02at in most of the interior of K and explodes to 1 as we approach its boundary. Addition-\nally, a self-concordant barrier has some technical properties that are useful in our setting. Before\ngiving the formal de\ufb01nition of a self-concordant barrier, we de\ufb01ne the local norm de\ufb01ned by a\nself-concordant barrier.\nDe\ufb01nition 2 (Local Norm Induced by a Self-Concordant Barrier [14]). Let R : int(K) 7! R be a\nself-concordant barrier. The local norm induced by R at the point x 2 int(K) is denoted by kzkx\nand de\ufb01ned as kzkx =pzTr2R(x)z. Its dual norm is kzkx,\u21e4 =pzT(r2R(x))1z.\nIn words, the local norm at x is the Mahalanobis norm de\ufb01ned by the Hessian of R at the point x,\nnamely, r2R(x). We now give a formal de\ufb01nition of a self-concordant barrier.\nDe\ufb01nition 3 (Self-Concordant Barrier [14]). Let K \u2713 Rd be a convex body. A function R :\nint(K) 7! R is a #-self-concordant barrier for K if (i) R is three times continuously differentiable,\n(ii) R(x) ! 1 as x ! @K , and (iii) for all x 2 int(K) and y 2 Rd, R satis\ufb01es\np#kykx .\n\n|r3R(x)[y, y, y]|\uf8ff 2kyk3\n\nand |rR(x) \u00b7 y|\uf8ff\n\nx\n\nThis de\ufb01nition is given for completeness, and is not directly used in our analysis. Instead, we rely\non some useful properties of self-concordant barriers. First and foremost, there exists a O(d)-self-\nconcordant barrier for any convex body [14, 6]. Ef\ufb01ciently-computable self-concordant barriers\nare only known for speci\ufb01c classes of convex bodies, such as polytopes, yet we make the standard\nassumption that we have an ef\ufb01ciently computable #-self-concordant barrier for the set K.\nAnother key feature of a self-concordant barrier is the set of Dikin ellipsoids that it de\ufb01nes. The\nDikin ellipsoid at x 2 int(K) is simply the unit ball with respect to the local norm at x. A key\nfeature of the Dikin ellipsoid is that it is entirely contained in the convex body K, for any x [see 14,\nTheorem 2.1.1]. Another technical property of a self-concordant barriers is that its Hessian changes\nslowly with respect to its local norm.\nTheorem 4 (Nesterov and Nemirovskii [14, Theorem 2.1.1]). Let K be a convex body with self-\nconcordant barrier R. For any x 2 int(K) and z 2 Rd such that kzkx < 1, it holds that\n\n(1 kzkx)2 r2R(x) r2R(x + z) (1 kzkx)2 r2R(x) .\n\nWhile the self-concordant barrier explodes to in\ufb01nity at the boundary of K, it is quite \ufb02at at points\nthat are far from the boundary. To make this statement formal, we de\ufb01ne an operation that mul-\ntiplicatively shrinks the set K toward the minimizer of R (called the analytic center of K). Let\ny = arg min R(x) and assume without loss of generality that R(y) = 0. For any \u270f 2 (0, 1) let Ky,\u270f\ndenote the set {y + (1 \u270f)(x y) : x 2 K}. The next theorem states that the barrier is \ufb02at in Ky,\u270f\nand explodes to 1 in the thin shell between Ky,\u270f and K.\nTheorem 5 (Nesterov and Nemirovskii [14, Propositions 2.3.2-3]). Let K be a convex body with\n#-self-concordant barrier R, let y = arg min R(x), and assume that R(y) = 0. For any \u270f 2 (0, 1],\nit holds that\n\n8 x 2 Ky,\u270f\n\nR(x) \uf8ff # log\n\n1\n\u270f\n\n.\n\nOur assumptions on the loss functions, as the Lipschitz assumption or the smoothness assumption,\nare stated in terms of the standard Euclidean norm (which we denote by k\u00b7k 2). Therefore, we will\nneed to relate the Euclidean norm to the local norms de\ufb01ned by the self-concordant barrier. This is\naccomplished by the following lemma (whose proof appears in the supplementary material).\nLemma 6. Let K be a convex body with self-concordant barrier R and let D be the (Euclidean)\ndiameter of K. For any x 2 K, it holds that D1kzkx,\u21e4 \uf8ff kzk2 \uf8ff Dkzkx for all z 2 Rd.\n2.4 Self-Concordant Barrier as a Regularizer\n\nLooking back at the dual averaging strategy de\ufb01ned in Eq. (1), we can now \ufb01ll in some of the details\nthat were left unspeci\ufb01ed: [1, 15] set the regularization R in Eq. (1) to be a #-self-concordant barrier\nfor the set K. We use the following useful lemma from Abernethy and Rakhlin [1] in our analysis.\n\n4\n\n\fAlgorithm 1: Bandit Smooth Convex Optimization\nParameters: perturbation parameter 2 (0, 1], dual averaging weights (\u21b5s,t), self-concordant\nbarrier R : int(K) 7! R\nInitialize: y1 2 K arbitrarily\nfor t = 1, . . . , T\n\ns=1 \u21b5s,t\u02c6gs + R(x)}\n\ns=1 \u2318gs + R(x)}. Then,\n\nAt (r2R(xt))1/2\ndraw ut uniformly from the unit sphere\nyt xt + Atut\nchoose yt, receive feedback ft(yt)\n\u02c6gt (d/)ft(yt) \u00b7 A1\nt ut\nxt+1 arg minx2K{x \u00b7Pt\nLemma 7 (Abernethy and Rakhlin [1]). Let K be a convex body with #-self-concordant barrier\nR, let g1, . . . , gT be vectors in Rd, and let \u2318> 0 be such that \u2318kgtkxt,\u21e4 \uf8ff 1\n4 for all t. De\ufb01ne\nxt = arg minx2K{x \u00b7Pt1\n(i) for all t it holds that kxt xt+1kxt \uf8ff 2\u2318kgtkxt,\u21e4;\n(ii) for any x? 2 K it holds thatPT\nAlgorithms for bandit convex optimization that use a self-concordant regularizer also use the same\nself-concordant barrier to obtain gradient estimates. Namely, these algorithms perturb the dual av-\neraging solution xt as in Eq. (3), with the perturbation matrix At set to (r2R(xt))1/2, the root of\nthe inverse Hessian of R at the point xt. In other words, the distribution of yt is supported on the\nDikin ellipsoid centered at xt, scaled by . Since 2 (0, 1], this form of perturbation guarantees\nthat yt 2 K. Moreover, if yt is generated in this way and used to construct the gradient estimator\n\u02c6gt, then the local norm of \u02c6gt is bounded, as speci\ufb01ed in the following lemma.\nLemma 8 (Saha and Tewari [15, Lemma 5]). Let K \u2713 Rd be a convex body with self-concordant\nbarrier R. For any differentiable function f : K 7! [0, 1], 2 (0, 1], and x 2 int(K), de\ufb01ne\n\u02c6g = (d/)f (y) \u00b7 A1u, where A = (r2R(x))1/2, y = x + Au, and u is drawn uniformly from\nthe unit sphere. Then k\u02c6gkx \uf8ff d/.\n3 Main Result\n\n\u2318 R(x?) + 2\u2318PT\n\nt=1 gt \u00b7 (xt x?) \uf8ff 1\n\nt=1 kgtk2\n\nxt,\u21e4\n\n.\n\nOur algorithm for the bandit smooth convex optimization problem is a variant of the algorithm in\nSaha and Tewari [15], and appears in Algorithm 1. Following Abernethy and Rakhlin [1], Saha and\nTewari [15], we use a self-concordant function as the dual averaging regularizer and we use its Dikin\nellipsoids to perturb the points xt. The difference between our algorithm and previous ones is the\nintroduction of dual averaging weights (\u21b5s,t), for t = 1, . . . , T and s = 1, . . . , t, which allow us to\nvary the weight of each gradient in the dual averaging objective function.\nIn addition to the parameters , \u2318, and \u270f, we introduce a new buffering parameter k, which takes\nnon-negative integer values. We set the dual averaging weights in Algorithm 1 to be\n\n\u21b5s,t = (\u2318\n\nts+1\nk+1 \u2318\n\nif s \uf8ff t k\nif s > t k ,\n\n(5)\n\nwhere \u2318> 0 is a global learning rate parameter. This choice of (\u21b5s,t) effectively decreases the\nin\ufb02uence of the feedback received on the most recent k rounds. If k = 0, all of the (\u21b5s,t) become\nequal to \u2318 and Algorithm 1 reduces to the algorithm in Saha and Tewari [15]. The surprising result\nis that there exists a different setting of k > 0 that gives a better regret bound.\nWe introduce a slight abuse of notation, which helps us simplify the presentation of our regret bound.\nWe will eventually achieve the desired regret bound by setting the parameters \u2318, , and k to be some\nfunctions of T . Therefore, from now on, we treat the notation \u2318, , and k as an abbreviation for the\nfunctional forms \u2318(T ), (T ), and k(T ) respectively. The bene\ufb01t is that we can now use asymptotic\nnotation (e.g., O(\u2318k)) to sweep meaningless low-order terms under the rug.\n\n5\n\n\fWe prove the following regret bound for this algorithm.\nTheorem 9. Let f1, . . . , fT be a sequence of loss functions where each ft : K 7! [0, 1] is dif-\nferentiable, convex, H-smooth and L-Lipschitz, and where K \u2713 Rd is a convex body of diameter\nD > 0 with #-self-concordant barrier R. For any , \u2318 2 (0, 1] and k 2{ 0, 1, . . . , T} assume that\nAlgorithm 1 is run with these parameters and with the weights de\ufb01ned in Eq. (5) (using k and \u2318) to\ngenerate the sequences x1, . . . , xT and y1, . . . , yT . If 12k\u2318d \uf8ff and for any \u270f 2 (0, 1) it holds that\n+ T\u2318k\u25c6 .\n\nR(T ) \uf8ff HD22T +\nthat R(T ) = O(pd T 5/8 log T ).\nNote that if we set k = 0 in our theorem, we recover the eO(T 2/3) bound in Saha and Tewari [15] up\n\nSpeci\ufb01cally, if we set = d1/4T 3/16, \u2318 = d1/2T 5/8, k = d1/2T 1/8, and \u270f = T 100, we get\n\nto a small numerical constant (namely, the dependence on L, H, D, #, d, and T is the same).\n\n12(HD2 + DL)d\u2318pkT\n\n+ O\u2713 T\u270f\n\n64d2\u2318T\n2(k + 1)\n\n# log 1\n\u270f\n\n+\n\n+\n\n\u2318\n\n\n\n\n\n4 Analysis\n\nb\n\nc\n\na\n\n|\n\n. (6)\n\nregret as\n\n+ E\" TXt=1\n|\n\nft(yt) \u02c6f t(xt)#\n}\n{z\n\n\u02c6f t(x?) ft(x?)#\n}\n{z\n\n\u02c6f t(xt) \u02c6f t(x?)#\n}\n{z\n\nUsing the notation x? = arg minx2KPT\nt=1 ft(x), the decision maker\u2019s regret becomes R(T ) =\nE\u21e5PT\nt=1 ft(yt) ft(x?)\u21e4. Following Flaxman et al. [11], Saha and Tewari [15], we rewrite the\n+ E\" TXt=1\nR(T ) = E\" TXt=1\n|\n\nThis decomposition essentially adds a layer of hallucination to the analysis: we pretend that the\nloss functions are \u02c6f 1, . . . , \u02c6f T instead of f1, . . . , fT and we also pretend that we chose the points\nx1, . . . , xT rather than y1, . . . , yT . We then analyze the regret in this pretend world (this regret is\nthe expression in Eq. (6b)). Finally, we tie our analysis back to the real world by bounding the\ndifference between that which we analyzed and the regret of the actual problem (this difference is\nthe sum of Eq. (6a) and Eq. (6c)). The advantage of our pretend world over the real world is that we\nhave unbiased gradient estimates \u02c6g1, . . . , \u02c6gT that can plug into the dual averaging algorithm.\nThe algorithm in Saha and Tewari [15] sets all of the dual averaging weights (\u21b5s,t) equal to the\nconstant learning rate \u2318> 0. It decomposes the regret as in Eq. (6) and their main technical result is\nthe following bound for the individual terms:\nTheorem 10 (Saha and Tewari [15]). Let f1, . . . , fT be a sequence of loss functions where each\nft : K 7! [0, 1] is differentiable, convex, H-smooth and L-Lipschitz, and where K \u2713 Rd is a convex\nbody of diameter D > 0 and #-self-concordant barrier R. Assume that Algorithm 1 is run with\nperturbation parameter 2 (0, 1] and generates the sequences x1, . . . , xT and y1, . . . , yT . Then\nfor any \u270f 2 (0, 1) it holds that (6a) + (6c) \uf8ff (HD22 + \u270fL)T . If, additionally, the dual averaging\nweights (\u21b5s,t) are all set to the constant learning rate \u2318 then (6b) \uf8ff # log(1/\u270f)\u23181 + d22\u2318T .\nThe analysis in Saha and Tewari [15] goes on to obtain a regret bound of eO(T 2/3) by choosing\n\noptimal values for the parameters \u2318, and \u270f and plugging those values into Theorem 10. Our\nanalysis uses the \ufb01rst part of Theorem 10 to bound (6a) + (6c) and shows that our careful choice of\nthe dual averaging weights (\u21b5s,t) results in the following improved bound on (6b).\nWe begin our analysis by de\ufb01ning a moving average of the functions \u02c6f 1, . . . , \u02c6f T , as follows:\n\n8 t = 1, . . . , T ,\n\n\u00aff t(x) =\n\n1\n\nk + 1\n\n\u02c6f ti(x) ,\n\n(7)\n\nwhere, for soundness, we let \u00aff s \u2318 0 for s \uf8ff 0. Also, de\ufb01ne a moving average of gradient estimates:\n\nkXi=0\nkXi=0\n\n8 t = 1, . . . , T ,\n\n\u00afgt =\n\n1\n\nk + 1\n\n6\n\n\u02c6gti ,\n\n\fagain, with \u02c6gs = 0 for s \uf8ff 0. In Section 4 below, we show how each \u00afgt can be used as a biased\nestimate of r\u00aff t(xt). Also note that the choice of the dual averaging weights (\u21b5s,t) in Eq. (5) is such\nthatPt\ns=1 \u00afgs for all t. Therefore, the last step in Algorithm 1 basically performs\ndual averaging with the gradient estimates \u00afg1, . . . , \u00afgT uniformly weighted by \u2318.\nWe use the functions \u00aff t to rewrite Eq. (6b) as\n\n+ E\" TXt=1\n|\n\n\u00aff t(xt) \u00aff t(x?)#\n}\n{z\n\nb\n\n+ E\" TXt=1\n|\n\n\u00aff t(x?) \u02c6f t(x?)#\n}\n{z\n\nc\n\n.\n\n(8)\n\ns=1 \u21b5s,t\u02c6gs = \u2318Pt\nE\" TXt=1\n|\n\n\u02c6f t(xt) \u00aff t(xt)#\n}\n{z\n\na\n\nThis decomposition essentially adds yet another layer of hallucination to the analysis: we pretend\nthat the loss functions are \u00aff 1, . . . , \u00aff T instead of \u02c6f 1, . . . , \u02c6f T (which are themselves pretend loss\nfunctions, as described above). Eq. (8b) is the regret in our new pretend scenario, while Eq. (8a) +\nEq. (8c) is the difference between this regret and the regret in Eq. (6b).\nThe following lemma bounds each of the terms in Eq. (8) separately, and summarizes the main\ntechnical contribution of our paper.\nLemma 11. Under the conditions of Theorem 9, for any \u270f 2 (0, 1) it holds that (8c) \uf8ff 0 ,\n+ T\u2318k\u25c6 .\n\n+ O(1 + \u2318)k ,\n\n+ O\u2713 T\u270f\n\n(8a) \uf8ff\n(8b) \uf8ff\n\n12HD2d\u2318pkT\n\n12DLd\u2318pkT\n\n64d2\u2318T\n2(k + 1)\n\n\n# log 1\n\u270f\n\nand\n\n+\n\n+\n\n\u2318\n\n\n\n\n\n4.1 Proof Sketch of Lemma 11\n\nAs mentioned above, the basic intuition of our technique is quite simple: average the gradients to\ndecrease their variance. Yet, applying this idea in the analysis is tricky. We begin by describing the\nmain source of dif\ufb01culty in proving Lemma 11.\nRecall that our strategy is to pretend that the loss functions are \u00aff 1, . . . , \u00aff T and to use the random\nvector \u00afgt as a biased estimator of r\u00aff t(xt). Naturally, one of our goals is to show that this bias\nis small. Recall that each \u02c6gs is an unbiased estimator of r\u02c6f s(xs) (conditioned on the history up\nto round t). Speci\ufb01cally, note that each vector in the sequence \u02c6gtk, . . . , \u02c6gt is a gradient estimate\nat a different point. Yet, we average these vectors and claim that they accurately estimate r\u00aff t at\nthe current point, xt. Luckily, \u02c6f t is H-smooth, so r\u02c6f ti(xti) should not be much different than\nr\u02c6f ti(xt), provided that we show that xti and xt are close to each other in Euclidean distance.\nTo show that xti and xt are close, we exploit the stability of the dual averaging algorithm. Particu-\nlarly, the \ufb01rst claim in Lemma 7 states that kxs xs+1kxs is controlled by k\u00afgskxs,\u21e4 for all s, so now\nwe need to show that k\u00afgskxs,\u21e4 is small. However, \u00afgt is the average of k + 1 gradient estimates taken\nat different points; each \u02c6gti is designed to have a small norm with respect to its own local norm\nk\u00b7k xti,\u21e4; for all we know, it may be very large with respect to the current local norm k\u00b7k xt,\u21e4. So\nnow we need to show that the local norms at xti and xt are similar. We could prove this if we knew\nthat xti and xt are close to each other\u2014which is exactly what we set out to prove in the beginning.\nThis chicken-and-egg situation complicates our analysis considerably.\nAnother non-trivial component of our proof is the variance reduction analysis. The motivation\nto average \u02c6gtk, . . . , \u02c6gt is to generate new gradient estimates with a smaller variance. While the\nrandom vectors \u02c6g1, . . . , \u02c6gT are not independent, we show that their randomness is uncorrelated.\nTherefore, the variance of \u00afgt is k + 1 times smaller than the variance of each \u02c6gt. However, to make\nthis argument formal, we again require the local norms at xti and xt to be similar.\nTo make things more complicated, there is the recurring need to move back and forth between local\nnorms and the Euclidean norm, since the latter is used in the de\ufb01nition of Lipschitz and smoothness.\nAll of this has to do with bounding Eq. (8b), the regret with respect to the pretend loss functions\n\u00aff 1, . . . , \u00aff T - an additional bias term appears in the analysis of Eq. (8a).\nWe conclude the paper by stating our main lemmas and sketching the proof Lemma 11. The full\ntechnical proofs are all deferred to the supplementary material and replaced with some high level\ncommentary.\n\n7\n\n\fTo break the chicken-and-egg situation described above, we begin with a crude bound on k\u00afgtkxt,\u21e4,\nwhich does not bene\ufb01t at all from the averaging operation. We simultaneously prove that the local\nnorms at xti and xt are similar.\nLemma 12. If the parameters k, \u2318, and are chosen such that 12k\u2318d \uf8ff then for all t,\n\n(i) k\u00afgtkxt,\u21e4 \uf8ff 2d/;\n(ii) for any 0 \uf8ff i \uf8ff k such that t i 1 it holds that 1\nLemma 12 itself has a chicken-and-egg aspect, which we resolve using an inductive proof technique.\nArmed with the knowledge that the local norms at xti and xt are similar, we go on to prove the\nmore re\ufb01ned bound on Et\u21e5k\u00afgtk2\nLemma 13. If the parameters k, \u2318, and are chosen such that 12k\u2318d \uf8ff then\n\n2kzkxti,\u21e4 \uf8ff kzkxt,\u21e4 \uf8ff 2kzkxti,\u21e4.\n\nxt,\u21e4\u21e4, which does bene\ufb01t from averaging.\nEt\u21e5k\u00afgtk2\n\nxt,\u21e4\u21e4 \uf8ff 2D2L2 +\n\n2(k + 1)\n\n32d2\n\n.\n\nThe proof constructs a martingale difference sequence and uses the fact that its increments are un-\ncorrelated. Compare the above to Lemma 8, which proves that k\u02c6gtik2\nxti,\u21e4 \uf8ff d2/2 and note the\nextra k + 1 in our denominator\u2014all of our hard work was aimed at getting this factor.\nNext, we set out to bound the expected Euclidean distance between xti and xt. This bound is later\nneeded to exploit the L-Lipschitz and H-smooth assumptions. The crude bound on k\u00afgskxs,\u21e4 from\nLemma 12 is enough to satisfy the conditions of Lemma 7, which then tells us that E[kxsxs+1kxs]\nis controlled by E[k\u00afgskxs,\u21e4]. The latter enjoys the improved bound due to Lemma 13. Integrating\nthe resulting bound over time, we obtain the following lemma.\nLemma 14. If the parameters k, \u2318, and are chosen such that 12k\u2318d \uf8ff then for all t and any\n0 \uf8ff i \uf8ff k such that t i 1, we have\n\nE[kxti xtk2] \uf8ff\n\n12Dd\u2318pk\n\n\n\n+ O(\u2318k) .\n\nNotice that xti and xt may be k rounds apart, but the bound scales only with pk. Again, this is\nthe work of the averaging technique.\nFinally, we have all the tools in place to prove our main result, Lemma 11.\nProof sketch. The \ufb01rst term, Eq. (8a), is bounded by rewriting \u00aff t(xt) = 1\n\u02c6f ti(xt) and\nthen proving that \u02c6f ti(xt) is not very far from \u02c6f t(xt). This follows from the fact that \u02c6f t is L-\nLipschitz and from Lemma 14. To bound the second term, Eq. (8b), we use the convexity of each \u00aff t\nto write\n\ni=0\n\nk+1Pk\nr\u00aff t(xt) \u00b7 (xt x?)# .\n\nusing the fact that \u02c6f t is H-smooth and again using Lemma 14. Then, we upper bound the above\n\u21e4\nusing Lemma 7, Theorem 5, and Lemma 13.\n\nAcknowledgments\nWe thank Jian Ding for several critical contributions during the early stages of this research. Parts\nof this work were done while the second and third authors were at Microsoft Research, the support\nof which is gratefully acknowledged.\n\n8\n\nWe relate the right-hand side above to\n\nE\" TXt=1\n\n\u00aff t(xt) \u00aff t(x?)# \uf8ff E\" TXt=1\n\u00afgt \u00b7 (xt x?)# ,\n\nE\" TXt=1\n\n\fReferences\n[1] J. Abernethy and A. Rakhlin. Beating the adaptive bandit with high probability. In Information\n\nTheory and Applications Workshop, 2009, pages 280\u2013289. IEEE, 2009.\n\n[2] J. Abernethy, E. Hazan, and A. Rakhlin. Competing in the dark: An ef\ufb01cient algorithm for\nbandit linear optimization. In Proceedings of the 21st Annual Conference on Learning Theory\n(COLT), 2008.\n\n[3] A. Agarwal, O. Dekel, and L. Xiao. Optimal algorithms for online convex optimization with\nmulti-point bandit feedback. In Proceedings of the 23rd Annual Conference on Learning The-\nory (COLT), 2010.\n\n[4] A. Agarwal, D. P. Foster, D. Hsu, S. M. Kakade, and A. Rakhlin. Stochastic convex optimiza-\nIn Advances in Neural Information Processing Systems (NIPS),\n\ntion with bandit feedback.\n2011.\n\n[5] S. Bubeck and N. Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed\n\nbandit problems. Foundations and Trends in Machine Learning, 5(1):1\u2013122, 2012.\n\n[6] S. Bubeck and R. Eldan. The entropic barrier: a simple and optimal universal self-concordant\n\nbarrier. arXiv preprint arXiv:1412.1587, 2015.\n\nmization. arXiv preprint arXiv:1507.06580, 2015.\n\n[7] S. Bubeck and R. Eldan. Multi-scale exploration of convex functions and bandit convex opti-\n[8] S. Bubeck, O. Dekel, T. Koren, and Y. Peres. Bandit convex optimization: pT regret in one\nIn In Proceedings of the 28st Annual Conference on Learning Theory (COLT),\n\ndimension.\n2015.\n\n[9] V. Dani, T. Hayes, and S. M. Kakade. The price of bandit information for online optimization.\n\nIn Advances in Neural Information Processing Systems (NIPS), 2008.\n\n[10] O. Dekel, J. Ding, T. Koren, and Y. Peres. Bandits with switching costs: T 2/3 regret.\n\nProceedings of the 46th Annual Symposium on the Theory of Computing, 2014.\n\nIn\n\n[11] A. D. Flaxman, A. Kalai, and H. B. McMahan. Online convex optimization in the bandit\nsetting: gradient descent without a gradient.\nIn Proceedings of the sixteenth annual ACM-\nSIAM symposium on Discrete algorithms, pages 385\u2013394. Society for Industrial and Applied\nMathematics, 2005.\n\n[12] E. Hazan and K. Levy. Bandit convex optimization: Towards tight bounds. In Advances in\n\nNeural Information Processing Systems (NIPS), 2014.\n\n[13] Y. Nesterov. Primal-dual subgradient methods for convex problems. Mathematical program-\n\nming, 120(1):221\u2013259, 2009.\n\n[14] Y. Nesterov and A. Nemirovskii. Interior-point polynomial algorithms in convex programming,\n\nvolume 13. SIAM, 1994.\n\n[15] A. Saha and A. Tewari. Improved regret guarantees for online smooth convex optimization with\nbandit feedback. In International Conference on Arti\ufb01cial Intelligence and Statistics (AISTAT),\npages 636\u2013642, 2011.\n\n[16] S. Shalev-Shwartz. Online learning and online convex optimization. Foundations and Trends\n\nin Machine Learning, 4(2):107\u2013194, 2011.\n\n[17] M. Zinkevich. Online convex programming and generalized in\ufb01nitesimal gradient ascent. In\nProceedings of the 20th International Conference on Machine Learning (ICML\u201903), pages\n928\u2013936, 2003.\n\n9\n\n\f", "award": [], "sourceid": 1664, "authors": [{"given_name": "Ofer", "family_name": "Dekel", "institution": "Microsoft Research"}, {"given_name": "Ronen", "family_name": "Eldan", "institution": null}, {"given_name": "Tomer", "family_name": "Koren", "institution": "Technion"}]}