{"title": "The Price of Bandit Information for Online Optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 345, "page_last": 352, "abstract": null, "full_text": "The Price of Bandit Information\n\nfor Online Optimization\n\nVarsha Dani\n\nvarsha@cs.uchicago.edu\n\nDepartment of Computer Science\n\nUniversity of Chicago\n\nChicago, IL 60637\n\nThomas P. Hayes\n\nToyota Technological Institute\n\nChicago, IL 60637\n\nhayest@tti-c.org\n\nSham M. Kakade\n\nToyota Technological Institute\n\nChicago, IL 60637\nsham@tti-c.org\n\nAbstract\n\nIn the online linear optimization problem, a learner must choose, in each round,\na decision from a set D \u2282 Rn in order to minimize an (unknown and chang-\ning) linear cost function. We present sharp rates of convergence (with respect to\nadditive regret) for both the full information setting (where the cost function is\nrevealed at the end of each round) and the bandit setting (where only the scalar\ncost incurred is revealed). In particular, this paper is concerned with the price\nof bandit information, by which we mean the ratio of the best achievable regret\n\u221a\nin the bandit setting to that in the full-information setting. For the full informa-\ntion case, the upper bound on the regret is O\u2217(\nnT ), where n is the ambient\n\u221a\ndimension and T is the time horizon. For the bandit case, we present an algorithm\nwhich achieves O\u2217(n3/2\nT ) regret \u2014 all previous (nontrivial) bounds here were\nO(poly(n)T 2/3) or worse. It is striking that the convergence rate for the bandit\nsetting is only a factor of n worse than in the full information case \u2014 in stark\n\u221a\ncontrast to the K-arm bandit setting, where the gap in the dependence on K is\nT log K). We also present lower bounds showing that\nexponential (\nthis gap is at least\nn, which we conjecture to be the correct order. The bandit\nalgorithm we present can be implemented ef\ufb01ciently in special cases of particular\ninterest, such as path planning and Markov Decision Problems.\n\n\u221a\n\u221a\nT K vs.\n\n1 Introduction\n\nIn the online linear optimization problem (as in Kalai and Vempala [2005]), at each timestep the\nlearner chooses a decision xt from a decision space D \u2282 Rn and incurs a cost Lt \u00b7 xt, where the loss\nvector Lt is in Rn. This paper considers the case where the sequence of loss vectors L1, . . . , LT is\narbitrary \u2014 that is, no statistical assumptions are made about the data generation process. The goal\nof the learner is to minimize her regret, the difference between the incurred loss on the sequence\nand the loss of the best single decision in hindsight. After playing xt at time t, the two most natural\nsources of feedback that the learner receives are either complete information of the loss vector Lt\n(referred to as the full information case) or only the scalar feedback of the incurred loss Lt \u00b7 xt\n(referred to as the partial feedback or \u201cbandit\u201d case).\nThe online linear optimization problem has been receiving increasing attention as a paradigm for\nstructured decision making in dynamic environments, with potential applications to network routing,\n\n1\n\n\fK-Arm\n\nFull\n\n\u221a\n\u221a\n\nLower Bound\nUpper Bound\nEf\ufb01cient Algo\n\nT ln K\nT ln K\nN/A\n\nPartial\n\u221a\n\u221a\nT K\nT K\nN/A\n\nFull\n\u221a\n\u221a\nnT\nnT\n\nSometimes\n\nExpectation High Probability\n\nLinear Optimization\nPartial\n\nI.I.D.\n\u221a\n\u221a\nn\nT\nn\nT\nYes\n\n\u221a\n\u221a\nT\nn\nn3/2\nT\nSometimes\n\n\u221a\n\u221a\nT\nn\nn3/2\nT\n?\n\nTable 1: Summary of Regret Bounds: Only the leading dependency in terms of n and T are shown (so some\nlog factors are dropped). The results in bold are provided in this paper. The results for the K-arm case are\nfrom Freund and Schapire [1997], Auer et al. [1998]. The i.i.d. column is the stochastic setting (where the\nloss vectors are drawn from some \ufb01xed underlying distribution) and the result are from Dani et al. [2008]. The\nexpectation column refers to the expected regret for an arbitrary sequence of loss vectors (considered in this\npaper). The high probability column follows from a forthcoming paper Bartlett et al. [2007]; these results also\nhold in the adaptive adversary setting, where the loss vectors could change in response to the learner\u2019s previous\ndecisions. The Ef\ufb01cient Algo row refers to whether or not there is an ef\ufb01cient implementation \u2014 \u201cyes\u201d means\nthere is a polytime algorithm (for the stated upper bound) which only uses access to a certain optimization\noracle (as in Kalai and Vempala [2005]) and \u201csometimes\u201d means only in special cases (such as Path Planning)\ncan the algorithm be implemented ef\ufb01ciently. See text for further details.\n\npath planning, job scheduling, etc. This paper focuses on the fundamental regrets achievable for the\nonline linear optimization problem in both the full and partial information feedback settings, as\nfunctions of both the dimensionality n and the time horizon T . In particular, this paper is concerned\nwith what might be termed the price of bandit information \u2014 how much worse the regret is in the\npartial information case as compared to the full information case.\nIn the K-arm case (where D is the set of K choices), much work has gone into obtaining sharp regret\nbounds. These results are summarized in the left two columns in Table 1. For the full information\ncase, the exponential weights algorithm, Hedge, of Freund and Schapire [1997] provides the regret\nlisted. For the partial information case, there is a long history of sharp regret bounds in various\nsettings (particularly in statistical settings where i.i.d assumptions are made), dating back to Robbins\n[1952]. In the (non-statistical) adversarial case, the algorithm of Auer et al. [1998] provides the\nregret listed in Table 1 for the partial information setting. This case has a convergence rate that is\nexponentially worse than the full information case (as a function of K).\nThere are a number of issues that we must address in obtaining sharp convergence for the online\nlinear optimization problem. The \ufb01rst issue to address is in understanding what are the natural\nquantities to state upper and lower bounds in terms of. It is natural to consider the case where the\nloss is uniformly bounded (say in [0, 1]). Clearly, the dimensionality n and the time horizon T\nare fundamental quantities. For the full information case, all previous bounds (see, e.g., Kalai and\nVempala [2005]) also have dependencies on the diameter of the decision and cost spaces. It turns\nout that these are extraneous quantities \u2014 with the bounded loss assumption, one need not explicitly\nconsider diameters of the decision and cost spaces. Hence, even in the full information case, to\nobtain a sharp upper bound we need a new argument to get an upper bound that is stated only in\nterms of n and T (and we do this via a relatively straightforward appeal to Hedge).\n\u221a\nThe second (and more technically demanding) issue is to obtain a sharp bound for the partial infor-\nmation case. Here, for the K-arm bandit case, the regret is O\u2217(\nKT ). Trivially, we can appeal\nof D. However, the regret could have a very poor n dependence, as |D| could be exponential in n\n(or worse). In contrast, note that in the full information case, we could appeal to the K-arm case\n\nto this result in the linear optimization case to obtain a(cid:112)|D|T regret by setting K to be the size\nto obtain O((cid:112)T log |D|) regret, which in many cases is acceptable (such as when D is exponential\n\nin n). The primary motivation for different algorithms in the full information case (e.g. Kalai and\nVempala [2005]) was for computational reasons. In contrast, in the partial information case, we\n\u221a\nseek a new algorithm in order to just obtain a sharper convergence rate (of course, we are still also\ninterested in ef\ufb01cient implementations). The goal here is provide a regret that is O\u2217(poly(n)\nT ).\nIn fact, the partial information case (for linear optimization) has been receiving increasing inter-\nest in the literature [Awerbuch and Kleinberg, 2004, McMahan and Blum, 2004, Dani and Hayes,\n2006]. Here, all regrets provided are O(poly(n)T 2/3) or worse. We should note that some of the\nresults here [Awerbuch and Kleinberg, 2004, Dani and Hayes, 2006] are stated in terms of only n\n\n2\n\n\f\u221a\nT ).\n\n\u221a\nand T (without referring to the diameters of various spaces). There is only one (non-trivial) special\ncase [Gyorgy et al., 2007] in the literature where an O\u2217(poly(n)\nT ) regret has been established,\nand this case assumes signi\ufb01cantly more feedback than in the partial information case \u2014 their re-\nsult is for Path Planning (where D is the set of paths on a graph and n is the number of edges)\nand the feedback model assumes that learner receives the weight along each edge that is traversed\n\u221a\n(signi\ufb01cantly more information than the just the scalar loss). The current paper provides the \ufb01rst\nO\u2217(poly(n)\nT ) regret for the general online linear optimization problem with scalar feedback \u2014\nin particular, our algorithm has an expected regret that is O\u2217(n3/2\nThe \ufb01nal issue to address here is lower bounds, which are not extant in the literature. This paper\nprovides lower bounds for both the full and partial information case. We believe these lower bounds\nare tight, up to log factors.\nWe have attempted to summarize the extant results in the literature (along with the results in this\npaper) in Table 1. We believe that we have a near complete picture of the achievable rates. One\nstriking result is that the price of bandit information is relatively small \u2014 the upper bound is only\n\u221a\na factor of n worse than in the full information case. In fact, the lower bounds suggest the partial\nfeedback case is only worse by a factor of\nn. Contrast this to the K arm case, where the full\ninformation case does exponentially better as a function of K.\n\u221a\nAs we believe that the lower bounds are sharp, we conjecture that the price of bandit information\nis only\nn. Part of our reasoning is due to our previous result [Dani et al., 2008] in the i.i.d. case\n\u221a\n(where the linear loss functions are sampled from a \ufb01xed, time invariant distribution) \u2014 there, we\nprovided an upper bound on the regret of only O\u2217(n\nT ). That bound was achieved by a determin-\nistic algorithm which was a generalization of the celebrated algorithm of Lai and Robbins. [1985]\nfor the K-arm case (in the i.i.d. setting).\nFinally, we should note that this paper primarily focuses on the achievable regrets, not on ef\ufb01cient\nimplementations. In much of the previous work in the literature (for both the full and partial infor-\nmation case), the algorithms can be implemented ef\ufb01ciently provided access to a certain optimiza-\ntion oracle. We are not certain whether our algorithms can be implemented ef\ufb01ciently, in general,\nwith only this oracle access. However, as our algorithms use the Hedge algorithm of Freund and\nSchapire [1997], for certain important applications, ef\ufb01cient implementations do exist, based on dy-\nnamic programming. Examples include problems such as Path Planning (for instance, in routing\nnetwork traf\ufb01c), and also Markov Decision Problems, one of the fundamental models for long-term\nplanning in AI. This idea has been developed by Takimoto and Warmuth [2003] and also applied\nby Gyorgy et al. [2007] (mentioned earlier) for Path Planning \u2014 the extension to Markov Decision\nProblems is relatively straightforward (based on dynamic programming).\nThe paper is organized as follows. In Section 2, we give a formal description of the problem. Then\nin Section 3 we present upper bounds for both the full information and bandit settings. Finally, in\nSection 4 we present lower bounds for both settings. All results in this paper are summarized in\nTable 1 (along with previous work).\n\n2 Preliminaries\nLet D \u2282 Rn denote the decision space. The learner plays the following T -round game against an\noblivious adversary. First, the adversary chooses a sequence L1, . . . , LT of loss vectors in Rn. We\nassume that the loss vectors are admissible, meaning they satisfy the boundedness property that for\n\u2020\neach t and for all x \u2208 D, 0 \u2264 Lt\u00b7x = L\nt x \u2264 1. On each round t, the learner must choose a decision\n\u2020\nt xt. Throughout the paper we represent x \u2208 D and Lt\nxt in D, which results in a loss of (cid:96)t = L\nas column vectors and use v\u2020 to denote the transpose of a column vector v. In the full information\ncase, Lt is revealed to the learner after time t. In the partial information case, only the incurred loss\n(cid:96)t (and not the vector Lt) is revealed.\n\nIf x1, . . . , xT are the decisions the learner makes in the game, then the total loss is(cid:80)T\n\n\u2020\nt xt. The\nt=1 L\n\ncumulative regret is de\ufb01ned by\n\n(cid:33)\n\n\u2020\nL\nt x\n\n(cid:32) T(cid:88)\n\nt=1\n\nT(cid:88)\n\nt=1\n\nR =\n\n\u2020\nt xt \u2212 min\nL\nx\u2208D\n\n3\n\n\fIn other words, the learner\u2019s loss is compared to the loss of the best single decision in hindsight.\nThe goal of the learner is to make a sequence of decisions that guarantees low regret. For the partial\ninformation case, our upper bounds on the regret are only statements that hold in expectation (with\nrespect to the learner\u2019s randomness). The lower bounds provided hold with high probability.\nThis paper also assumes the learner has access to a barycentric spanner (as de\ufb01ned by Awerbuch and\nKleinberg [2004]) of the decision region \u2014 such a spanner is useful for exploration. This is a subset\nof n linearly independent vectors of the decision space, such that every vector in the decision space\ncan be expressed as a linear combination of elements of the spanner with coef\ufb01cients in [\u22121, 1].\nAwerbuch and Kleinberg [2004] showed that any full rank compact set in Rn has a barycentric\nspanner. Furthermore, an almost barycentric spanner (where the coef\ufb01cients are in [\u22122, 2]) can be\nfound ef\ufb01ciently (with certain oracle access). In view of these remarks, we assume without loss of\ngenerality, that D contains the standard basis vectors (cid:126)e1 . . . (cid:126)en and that D \u2282 [\u22121, 1]n. We refer to\nthe set {(cid:126)e1 . . . (cid:126)en} as the spanner. Note that with this assumption, (cid:107)x(cid:107)2 \u2264 \u221a\n\nn for all x \u2208 D.\n\n3 Upper Bounds\n\nThe decision set D may be potentially large or even uncountably in\ufb01nite. However, for the pur-\nposes of designing algorithms with sharp regret bounds, the following lemma shows that we need\nonly concern ourselves with \ufb01nite decision sets \u2014 the lemma shows that any decision set may be\napproximated to suf\ufb01ciently high accuracy by a suitably small set (which is a 1/\nLemma 3.1. Let D \u2282 [\u22121, 1]n be an arbitrary decision set. Then there is a set \u02dcD \u2282 D of size\n\nat most (4nT )n/2 such that for every sequence of admissible loss vectors, the optimal loss for (cid:101)D is\n\n\u221a\nT -net for D).\n\nwithin an additive\n\n\u221a\nnT of the optimal loss for D.\n\nProof sketch. For each x \u2208 D suppose we truncate each coordinate of x to only the \ufb01rst 1\n2 log(nT )\nbits. Now from all x \u2208 D which result in the same truncated representation, we select a single\n\nrepresentative to be included in (cid:101)D This results in a set (cid:101)D of size at most (4nT )n/2 which is a\nT from its nearest neighbor in (cid:101)D.\nsee that the optimal loss for (cid:101)D is within an additive\n\n\u221a\n\u221a\nn, summing over the T rounds of the game, we\nnT of the optimal loss for D.\n\n\u221a\nT -net for D. That is, every x \u2208 D is at distance at most 1/\n\n1/\nSince an admissible loss vector has norm at most\n\n\u221a\n\nFor implementation purposes, it may be impractical to store the decision set (or the covering net of\nthe decision set) explicitly as a list of points. However, our algorithms only require the ability to\nsample from a speci\ufb01c distribution over the decision set. Furthermore, in many cases of interest the\nfull decision set is \ufb01nite and exponential in n, so we can directly work with D (rather than a cover\nof D). As discussed in the Introduction, in many important cases of interest this can actually be\naccomplished using time and space which are only logarithmic in |D| \u2014 this is due to that Hedge\ncan be implemented ef\ufb01ciently for these special cases.\n\n3.1 With Full Information\n\nregret at most of O((cid:112)T log |D|). Since we may modify D so that log |D| is O(n log n log T ),\n\nIn the full information setting, the algorithm Hedge of Freund and Schapire [1997] guarantees a\n\n\u221a\nthis gives us regret O\u2217(\nnT ). Note that we are only concerned with the regret here. Hedge may\nin general be quite inef\ufb01cient to implement. However, in many special cases of interest, ef\ufb01cient\nimplementations are in fact possible, as discussed in the Introduction.\nWe also note that under the relatively minor assumption of the existence of an oracle for of\ufb02ine\n\u221a\noptimization, the algorithm of Kalai and Vempala [2005] is an ef\ufb01cient algorithm for this setting.\nHowever, it appears that that their regret is O(n\nnT ) \u2014 their regret bounds are\n\u221a\nstated in terms of diameters of the decision and cost spaces, but we can bound these in terms of n,\nwhich leads to the O(n\n\n\u221a\nT ) rather than O(\n\nT ) regret for their algorithm.\n\n4\n\n\f3.2 With Bandit Information\n\nWe now present the Geometric Hedge algorithm (shown in Algorithm 3.1) that achieves low ex-\npected regret for the setting where only the observed loss, (cid:96)t = Lt \u00b7 xt, is received as feedback. This\nalgorithm is motivated by the algorithms in Auer et al. [1998] (designed for the K-arm case), which\nuse Hedge (with estimated losses) along with a \u03b3 probability of exploration.\n\nAlgorithm GEOMETRICHEDGE(D, \u03b3, \u03b7)\n\u2200x \u2208 D, p1(x) \u2190 1|D|\n\u2200x \u2208 D,(cid:98)pt(x) = (1 \u2212 \u03b3)pt(x) + \u03b3\nfor t \u2190 1 to T\nSample xt according to distribution(cid:98)pt\n\u2020\nIncur and observe loss (cid:96)t := L\n(cid:98)Lt := (cid:96)tC\u22121\nCt := Ebpt [xx\u2020]\nt xt\n\u2200x \u2208 D, pt+1(x) \u221d pt(x)e\u2212\u03b7bL\n\nt xt\n\n\u2020\nt x\n\nn 1{x \u2208 spanner}\n\nnon-singular. The following lemma shows why this estimator is sensible.\n\nIn the Geometric Hedge algorithm, there is a \u03b3 probability of exploring with the spanner on each\nround (motivated by Awerbuch and Kleinberg [2004]). The estimated losses we feed into Hedge\n\nare determined by the estimator (cid:98)Lt of Lt. Note that the algorithm is well de\ufb01ned as Ct is always\nLemma 3.2. On each round t,(cid:98)Lt is an unbiased estimator for the true loss vector Lt.\nProof. (cid:98)Lt = (cid:96)tC\u22121\n\u2020\nt xt = (Lt \u00b7 xt)C\u22121\nE [(cid:98)Lt] = E [C\u22121\nt Lt). Therefore\nt xt(x\n\u2020\n\u2020\nt]Lt = C\u22121\nt Lt)] = C\u22121\nwhere all the expectations are over the random choice of xt drawn from(cid:98)pt.\nE [xtx\nt xt(x\n\nt xt = C\u22121\n\nt CtLt = Lt\n\nt\n\nIn the K-arm case, where n = K and D = {(cid:126)e1, . . . , (cid:126)eK}, Algorithm 3.1 specializes to the Exp3\nalgorithm of Auer et al. [1998].\nNote that if |D| is exponential in the dimension n then in general, maintaining and sampling from\n\nthe distributions pt and (cid:98)pt is very expensive in terms of running time. However in many special\n\ncases of interest, this can actually be implemented ef\ufb01ciently.\nWe now state the main technical result of the paper.\nTheorem 3.3. Let \u03b3 = n3/2\u221a\nadmissible loss vectors, let R denote the regret of Algorithm 3.1 on this sequence. Then\n\nand \u03b7 = 1\u221a\nnT\n\nT\n\nin Algorithm 3.1. For any sequence L1, . . . , LT of\n\u221a\nnT + 2n3/2\n\n\u221a\nT\n\nER \u2264 ln|D|\n\n\u221a\n\u221a\nAs before, since we may replace D with a set of size O((nT )n/2) for an additional regret of only\nT ). Moreover, if |D| \u2264 cn for some constant c, as is the case for the\nnT , the regret is O\u2217(n3/2\nonline shortest path problem, then ER = O(n3/2\n\n\u221a\nT ).\n\n3.3 Analysis of Algorithm 3.1\n\nIn this section, we prove Theorem 3.3. We start by providing the following bound on the sizes of\nthe estimated loss vectors used by Algorithm 3.1.\n\nLemma 3.4. For each x \u2208 D and 1 \u2264 t \u2264 T , the estimated loss vector(cid:98)Lt satis\ufb01es\n\n|(cid:98)Lt \u00b7 x| \u2264 n2\n\n\u03b3\n\n5\n\n\fProof. First, let us examine Ct. Let \u03bb1, . . . , \u03bbn be the eigenvalues of Ct, and v1, . . . , vn be the\n\ncorresponding (orthonormal) eigenvectors. Since Ct := Ebpt [xx\u2020] and \u03bbi = v\n\n\u2020\ni Ctvi, we have\n\n[xx\u2020]vi = (cid:88)\n(cid:98)pt(x)(x \u00b7 vi)2\n(cid:98)pt(x)(x \u00b7 vi)2 \u2265 n(cid:88)\n\nx\u2208D\n\nn of C\u22121\n\nt\n\n\u03b3 .\nare each at most n\n\nj=1\n\n(1)\n\n(cid:107)vi(cid:107)2 = \u03b3\nn\n\n\u03b3\nn\n\n((cid:126)ej \u00b7 vi)2 = \u03b3\nn\n\nand so\n\n\u03bbi = (cid:88)\n\nx\u2208D\n\nIt follows that the eigenvalues \u03bb\u22121\nHence, for each x\n\n\u2020\n\u03bbi = v\ni\n\nEbpt\n(cid:98)pt(x)(x \u00b7 vi)2 \u2265 (cid:88)\n|(cid:98)Lt \u00b7 x| = |(cid:96)tC\u22121\n\nx\u2208spanner\n1 , . . . \u03bb\u22121\n\nwhere we have used the upper bound on the eigenvalues and the upper bound of\n\nt xt \u00b7 x| \u2264 n\n\u03b3\n\n|(cid:96)t|(cid:107)xt(cid:107)2(cid:107)x(cid:107)2 \u2264 n2\n\u03b3\n\n\u221a\nn for x \u2208 D.\n\nThe following proposition is Theorem 3.1 in Auer et al. [1998], restated in our notation (for losses\ninstead of gains). We state it here without proof. Denote \u03a6M (\u03b7) := eM \u03b7\u22121\u2212M \u03b7\nProposition 3.5. (from Auer et al. [1998])For every x\u2217 \u2208 D, the sequence of estimated loss vectors\n\nM 2\n\n.\n\nT(cid:88)\n\n(cid:98)L1, . . . ,(cid:98)LT and the probability distributions p1, . . . pT satisfy\n(cid:98)Lt \u00b7 x\u2217 +\nwhere M = n2/\u03b3 is an upper bound on |(cid:98)L \u00b7 x|.\n\npt(x)(cid:98)Lt \u00b7 x \u2264 T(cid:88)\n\n(cid:88)\n\nln|D|\n\nx\u2208D\n\n+\n\nt=1\n\nt=1\n\n\u03b7\n\n\u03b7\n\n\u03a6M (\u03b7)\n\nT(cid:88)\n\n(cid:88)\n\nt=1\n\nx\u2208D\n\npt(x)((cid:98)Lt \u00b7 x)2\n\nBefore we are ready to complete the proof, two technical lemmas are useful.\nLemma 3.6. For each x \u2208 D and 1 \u2264 t \u2264 T ,\n\n(cid:16)\nxt\u223cbpt\nE\nProof. Using that E(cid:16)\n= x\u2020 E(cid:16)(cid:98)Lt(cid:98)L\n(cid:17)\nx\u2020 E(cid:16)(cid:98)Lt(cid:98)L\nt C\u22121\n(cid:96)2\nLemma 3.7. For each 1 \u2264 t \u2264 T , (cid:88)\n\n((cid:98)Lt \u00b7 x)2(cid:17)\nx = x\u2020 E(cid:16)\n\n((cid:98)Lt \u00b7 x)2(cid:17) \u2264 x\u2020C\u22121\n(cid:17)\n(cid:17)\n\u2020\nt\n\u2020\nt C\u22121\n(cid:98)pt(x)x\u2020C\u22121\n\nx, we have\nx \u2264 x\u2020C\u22121\n\nt x = n\n\nt xtx\n\nt x\n\n\u2020\nt\n\nt\n\nt\n\nx\u2208D\n\nE(cid:16)\n\n\u2020\nt\n\nxtx\n\n(cid:17)\n\nC\u22121\nt x = x\u2020C\u22121\nt x\n\nProof. The singular value decomposition of C\u22121\nis V BV \u2020 where B is diagonal (with the inverse\neigenvalues as the diagonal entries) and V is orthogonal (with the columns being the eigenvectors).\nThis implies that x\u2020C\u22121\n\nt\n\n(x \u00b7 vi)2. Using Equation 1, it follows that\n\u03bb\u22121\n\n(cid:98)pt(x)(x \u00b7 vi)2 =\n\n(x \u00b7 vi)2 =\n\n\u03bb\u22121\n\ni\n\ni\n\n(cid:88)\n\nn(cid:88)\n\ni=1\n\ni=1\n\nx\u2208D\n\nn(cid:88)\n\ni=1\n\n1 = n\n\nt x =(cid:80)\nt x = (cid:88)\nn(cid:88)\ni \u03bb\u22121\n(cid:98)pt(x)\n\ni\n\nx\u2208D\n\n(cid:88)\n\nx\u2208D\n\n(cid:98)pt(x)x\u2020C\u22121\n\nWe are now ready to complete the proof of Theorem 3.3.\nProof. We now have, for any x\u2217 \u2208 D,\n\n(cid:98)pt(x)(cid:98)Lt \u00b7 x =\n\nT(cid:88)\n\n(cid:88)\n\nt,x\n\n1{\u2203j : x = (cid:126)ej}(cid:17)(cid:98)Lt \u00b7 x\n\n(cid:88)\n\n(cid:16)\n(cid:32) T(cid:88)\n(cid:98)Lt \u00b7 x\u2217 +\n\nt=1\n\nt=1\n\nx\u2208D\n\u2264 (1 \u2212 \u03b3)\n\n\u2264 T(cid:88)\n\nt=1\n\n(1 \u2212 \u03b3)pt(x) + \u03b3\nn\n\n(cid:98)Lt \u00b7 x\u2217 +\n\nln|D|\n\n\u03b7\n\nln|D|\n\n\u03b7\n\n+\n\n\u03a6M (\u03b7)\n\n\u03b7\n\n6\n\n\u03a6M (\u03b7)\n\n(cid:88)\n\n(cid:33)\npt(x)((cid:98)Lt \u00b7 x)2\nn(cid:88)\nT(cid:88)\n\n\u03b7\n\nt,x\n\n(cid:98)pt(x)((cid:98)Lt \u00b7 x)2 +\n\n+\n\nT(cid:88)\nn(cid:88)\n(cid:98)Lt \u00b7 (cid:126)ej\n\nj=1\n\nt=1\n\n\u03b3\nn\n\nt=1\n\nj=1\n\n+\n\n(cid:88)\n\nt,x\n\n(cid:98)Lt \u00b7 (cid:126)ej\n\n\u03b3\nn\n\n\f(cid:98)pt(x)(cid:98)Lt \u00b7 x\n\nE\n\nt,x\n\nwhere the last step uses (1\u2212 \u03b3)pt(x) \u2264(cid:98)pt(x). Taking expectations and using the unbiased property,\n(cid:34)(cid:88)\n\nn(cid:88)\n\nln|D|\n\n\u03a6M (\u03b7)\n\n(cid:35)\n\n(cid:34)(cid:88)\n(cid:34)(cid:88)\n\nt,x\n\nt,x\n\n(cid:35)\nT(cid:88)\n(cid:35)\n((cid:98)Lt \u00b7 x)2\n\n+\n\nt=1\n\n(cid:98)pt(x)((cid:98)Lt \u00b7 x)2\n(cid:98)pt(x) E\nxt\u223cbpt\n\nE\n\n\u03b7\n\n\u03a6M (\u03b7)\n\n\u03b7\n\nE\n\nLt \u00b7 (cid:126)ej\n\n\u03b3\nn\n\nj=1\n\n+ \u03b3T\n\nt=1\n\n=\n\nT(cid:88)\n\u2264 T(cid:88)\n\u2264 T(cid:88)\n\nt=1\n\nt=1\n\nLt \u00b7 x\u2217 +\n\nLt \u00b7 x\u2217 +\n\nLt \u00b7 x\u2217 +\n\n+\n\n+\n\n+\n\n\u03b7\n\nln|D|\n\n\u03b7\n\nln|D|\n\n\u03b7\n\n\u03a6M (\u03b7)\n\n\u03b7\n\nnT + \u03b3T\n\nwhere we have used Lemmas 3.6 and 3.7 in the last step.\nSetting \u03b3 = n3/2\u221a\n\nand \u03b7 = 1\u221a\n\ngives M \u03b7 = n2\u03b7/\u03b3 \u2264 1 , which implies that\n\nT\n\nnT\n\nwhere the inequality comes from that for \u03b1 \u2264 1, e\u03b1 \u2264 1 + \u03b1 + \u03b12. With the above, we have\n\nM 2\n\n\u03a6M (\u03b7) = eM \u03b7 \u2212 1 \u2212 M \u03b7\n(cid:35)\n\u2264 T(cid:88)\n(cid:98)pt(x)(cid:98)Lt \u00b7 x\n(cid:34)(cid:88)\n(cid:98)pt(x)E(cid:16)(cid:98)Lt | Ht\n\n(cid:17) \u00b7 x\n\n(cid:35)\n\nt=1\n\nt,x\n\nE\n\n(cid:34)(cid:88)\n(cid:35)\n\nt,x\n\n= E\n\nThe proof is completed by noting that\n\n(cid:34)(cid:88)\n\nt,x\n\nE\n\n(cid:98)pt(x)(cid:98)Lt \u00b7 x\n\nis the expected total loss of the algorithm.\n\n4 Lower Bounds\n\n\u2264 M 2\u03b72\n\nM 2 = \u03b72\n\n\u221a\nLt \u00b7 x\u2217 + ln|D|\nnT + 2n3/2\n\n\u221a\nT\n\n(cid:34)(cid:88)\n\nt,x\n\n= E\n\n(cid:98)pt(x)Lt \u00b7 x\n\n(cid:35)\n\n= E\n\n(cid:32)(cid:88)\n\nt\n\n(cid:33)\n\nLt \u00b7 xt\n\n4.1 With Full Information\n\n\u221a\nWe now present a family of distributions which establishes an \u2126(\nnT ) lower bound for i.i.d. loss\nvectors in the full information setting. In the remainder of the paper, we assume for convenience\nthat the incurred losses are in the interval [\u22121, 1] rather than [0, 1]. (This changes the bounds by at\nmost a factor of 2.)\nExample 4.1. For a given S \u2286 {1, . . . , n} and 0 < \u03b5 < 1, we de\ufb01ne a random loss vector L as\nfollows. Choose i \u2208 {1, . . . , n} uniformly at random. Let \u03c3 \u2208 \u00b11 be 1 with probability (1 + \u03b5)/2\nand \u22121 otherwise. Set\n\n(cid:26)\u03c3(cid:126)ei\n\n\u2212\u03c3(cid:126)ei\n\nL =\n\nif i \u2208 S\nif i /\u2208 S\n\nLet DS,\u03b5 denote the distribution of L.\nTheorem 4.2. Suppose the decision set D is the unit hypercube {\u22121, 1}n. For any full-information\nlinear optimization algorithm A, and for any positive integer T , there exists S \u2286 {1, . . . , n} such\nthat for loss vectors L1, . . . , LT sampled i.i.d. according to D\n, the expected regret is\n\u2126(\n\n\u221a\nnT ).\n\n\u221a\n\nn/T\n\nS,\n\nProof sketch. Clearly, for each S and \u03b5, the optimal decision vector for loss vectors sampled i.i.d.\naccording to DS,\u03b5 is the vector (x1, . . . , xn) where xi = \u22121 if i \u2208 S and 1 otherwise.\nSuppose S is chosen uniformly at random. In this case, it is clear that the optimal algorithm chooses\ndecision (x1, . . . , xn) where for each i, the sign of xi is the same as the minority of past occurrences\nof loss vectors \u00b1ei (in case of a tie, the value of xi doesn\u2019t matter).\nNote that at every time step when the empirical minority incorrectly predicts the bias for coordi-\nnate i, the optimal algorithm incurs expected regret \u2126(\u03b5/n). By a standard application of Stirling\u2019s\n\n7\n\n\festimates, one can show that until coordinate i has been chosen \u2126(1/\u03b52) times, the probability\nthat the empirical majority disagrees with the long-run average is \u2126(1). In expectation, this re-\nquires \u2126(n/\u03b52) time steps. Summing over the n arms, the overall expected regret is thus at least\n\n\u2126(n(\u03b5/n) min{T, n/\u03b52} = \u2126(min{\u03b5T, n/\u03b5}). Setting \u03b5 =(cid:112)n/T yields the desired bound.\n\n4.2 With Bandit Information\nNext we prove that the same decision set {0, 1}n and family of distributions DS,\u03b5 can be used to\n\u221a\nestablish an \u2126(n\nT ) lower bound in the bandit setting.\nTheorem 4.3. Suppose the decision set D is the unit hypercube {0, 1}n. For any bandit linear\noptimization algorithm A, and for any positive integer T , there exists S \u2286 {1, . . . , n} such that for\nloss functions L1, . . . , LT sampled i.i.d. according to DS,n/\n\n\u221a\nT , the expected regret is \u2126(n\n\n\u221a\nT ).\n\nProof sketch. Again, for each S and \u03b5, the optimal decision vector for loss vectors sampled i.i.d.\naccording to DS,\u03b5 is just the indicator vector for the set S.\nSuppose S is chosen uniformly at random. Unlike the proof of Theorem 4.2, we do not attempt to\ncharacterize the optimal algorithm for this setting.\nNote that, for every 1 \u2264 i \u2264 n, every time step when the algorithm incorrectly sets xi (cid:54)= 1{i \u2208 S},\ncontributes \u2126(\u03b5/n) to the expected regret. Let us \ufb01x i \u2208 {1, . . . , n} and prove a lower bound on\nits expected contribution to the total regret. To simplify matters, let us consider the best algorithm\nconditioned on the value of S\\{i}. It is not hard to see that the problem of guessing the membership\nof i in S based on t past measurements can be recast as a problem of deciding between two possible\nmeans which differ by \u03b5/n, given a sequence of t i.i.d. Bernoulli random variables with one of the\nunknown mean, where each of the means is a priori equally likely. But for this problem, the error\nprobability is \u2126(1) unless t = \u2126((n/\u03b5)2). Thus we have shown that the expected contribution of\ncoordinate i to the total regret is \u2126(min{T, (n/\u03b5)2}\u03b5/n). Summing over the n arms gives an overall\n\u221a\nexpected regret of \u2126(min{\u03b5T, n2/\u03b5}. Setting \u03b5 = n/\nT completes the proof.\n\nReferences\nP. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. Gambling in a rigged casino: the adversarial multi-\narmed bandit problem. In Proceedings of the 36th Annual Symposium on Foundations of Computer Science\n(1995). IEEE Computer Society Press, Los Alamitos, CA, extended version, 24pp., dated June 8, 1998.\nAvailable from R. Schapire\u2019s website.\n\nB. Awerbuch and R. Kleinberg. Adaptive routing with end-to-end feedback: Distributed learning and geometric\n\napproaches. In Proceedings of the 36th ACM Symposium on Theory of Computing (STOC), 2004.\n\nP. Bartlett, V. Dani, T. P. Hayes, S. M. Kakade, A. Rakhlin, and A. Tewari. High probability regret bounds for\n\nonline optimization (working title). Manuscript, 2007.\n\nV. Dani, T. P. Hayes, and S. M. Kakade. Stochastic linear optimization under bandit feedback. In submission,\n\n2008.\n\nVarsha Dani and Thomas P. Hayes. Robbing the bandit: Less regret in online geometric optimization against\nan adaptive adversary. In Proceedings of the 17th ACM-SIAM Symposium on Discrete Algorithms (SODA),\n2006.\n\nY. Freund and R. Schapire. A decision-theoretic generalization of on-line learning and an application to boost-\n\ning. Journal of Computer and System Sciences, 55(1):119\u2013139, 1997.\n\nA. Gyorgy, T. Linder, G. Lugosi, and G. Ottucsak. The on-line shortest path problem under partial monitoring.\n\nJournal of Machine Learning Research, 8:2369\u20132403, 2007.\n\nAdam Kalai and Santosh Vempala. Ef\ufb01cient algorithms for online decision problems. J. Comput. Syst. Sci., 71\n\n(3):291\u2013307, 2005. ISSN 0022-0000. doi: http://dx.doi.org/10.1016/j.jcss.2004.10.016.\n\nT. L. Lai and H. Robbins. Asymptotically ef\ufb01cient adaptive allocation rules. Advances in Applied Mathematics,\n\n6:4\u201325, 1985.\n\nH.B. McMahan and A. Blum. Online geometric optimization in the bandit setting against an adaptive adversary.\n\nIn Proceedings of the 17th Annual Conference on Learning Theory (COLT), 2004.\n\nH. Robbins. Some aspects of the sequential design of experiments. Bulletin of the American Mathematical\n\nSociety, 55:527\u2013535, 1952.\n\nEiji Takimoto and Manfred K. Warmuth. Path kernels and multiplicative updates. J. Mach. Learn. Res., 4:\n\n773\u2013818, 2003. ISSN 1533-7928.\n\n8\n\n\f", "award": [], "sourceid": 758, "authors": [{"given_name": "Varsha", "family_name": "Dani", "institution": null}, {"given_name": "Sham", "family_name": "Kakade", "institution": null}, {"given_name": "Thomas", "family_name": "Hayes", "institution": null}]}