{"title": "Contextual bandits with surrogate losses: Margin bounds and efficient algorithms", "book": "Advances in Neural Information Processing Systems", "page_first": 2621, "page_last": 2632, "abstract": "We use surrogate losses to obtain several new regret bounds and new algorithms for contextual bandit learning. Using the ramp loss, we derive a new margin-based regret bound in terms of standard sequential complexity measures of a benchmark class of real-valued regression functions. Using the hinge loss, we derive an efficient algorithm with a $\\sqrt{dT}$-type mistake bound against benchmark policies induced by $d$-dimensional regressors. Under realizability assumptions, our results also yield classical regret bounds.", "full_text": "Contextual bandits with surrogate losses:\nMargin bounds and ef\ufb01cient algorithms\n\nDylan J. Foster\nCornell University\n\ndjfoster@cs.cornell.edu\n\nAkshay Krishnamurthy\nMicrosoft Research, NYC\nakshay@cs.umass.edu\n\nAbstract\n\nWe use surrogate losses to obtain several new regret bounds and new algorithms for\ncontextual bandit learning. Using the ramp loss, we derive new margin-based regret\nbounds in terms of standard sequential complexity measures of a benchmark class\nof real-valued regression functions. Using the hinge loss, we derive an ef\ufb01cient\n\nalgorithm with a\u221adT -type mistake bound against benchmark policies induced by\n\nd-dimensional regressors. Under realizability assumptions, our results also yield\nclassical regret bounds.\n\n1\n\nIntroduction\n\nWe study sequential prediction problems with partial feedback, mathematically modeled as contextual\nbandits [29]. In this formalism, a learner repeatedly (a) observes a context, (b) selects an action, and\n(c) receives a loss for the chosen action. The objective is to learn a policy for selecting actions with\nlow loss, formally measured via regret with respect to a class of benchmark policies. Contextual\nbandit algorithms have been successfully deployed in online recommendation systems [5], mobile\nhealth platforms [48], and elsewhere.\nIn this paper, we use surrogate loss functions to derive new margin-based algorithms and regret bounds\nfor contextual bandits. Surrogate loss functions are ubiquitous in supervised learning (cf. [49, 8, 43]).\nComputationally, they are used to replace NP-hard optimization problems with tractable ones, e.g.,\nthe hinge loss makes binary classi\ufb01cation amenable to convex programming techniques. Statistically,\nthey also enable sharper generalization analysis for models including boosting, SVMs, and neural\nnetworks [43, 6], by replacing dependence on dimension in VC-type bounds with distribution-\ndependent quantities. For example, to agnostically learn d-dimensional halfspaces the optimal\n\n \u22c5\uffff1\uffffn for the -margin loss\n\nrates for excess risk are\uffffd\uffffn for the 0\uffff1 loss benchmark and 1\n\nbenchmark [27], meaning the margin bound removes explicit dependence on dimension. Curiously,\nsurrogate losses have seen limited use in partial information settings (some exceptions are discussed\nbelow). This paper demonstrates that these desirable computational and statistical properties indeed\nextend to contextual bandits.\nIn the \ufb01rst part of the paper we focus on statistical issues, namely whether any algorithm can achieve a\ngeneralization of the classical margin bound from statistical learning [10] in the adversarial contextual\nbandit setting. Our aim here is to introduce a theory of learnability for contextual bandits, in analogy\nwith statistical and online learning, and our results provide an information-theoretic benchmark\nfor future algorithm designers. We consider benchmark policies induced by a class of real-valued\nregression functions and obtain a regret bound in terms of the class\u2019 sequential metric entropy, a\n\nregret is achievable for Lipschitz contextual bandits in d-dimensional metric spaces, improving on a\n\nstandard complexity measure in online learning [42]. As a consequence, we show that \u02dcO(T d\nd+1)\nrecent result of Cesa-Bianchi et al. [14], and that an \u02dcO(T 2\uffff3) mistake bound is achievable for bandit\n\nmulticlass prediction in smooth Banach spaces, extending Kakade et al. [26].\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fTechnically, these results build on the non-constructive minimax analysis of Rakhlin et al. [42], which,\nfor the online adversarial setting, prescribes a recipe for characterizing statistical behavior of arbitrary\nclasses, and thus provides a counterpart to empirical risk minimization in statistical learning. Indeed,\nfor full-information problems, this approach yields regret bounds in terms of sequential analogues\nof standard complexity measures including Rademacher complexity and metric entropy. However,\nsince we work in the contextual bandit setting, we must extend these arguments to incorporate partial\ninformation. To do so, we leverage the adaptive minimax framework of Foster et al. [20] along with a\ncareful \u201cadaptive\" chaining argument.\nIn the second part of the paper, we focus on computational issues and derive two new algorithms using\nthe hinge loss as a convex surrogate. The \ufb01rst algorithm, HINGE-LMC, provably runs in polynomial\n\ntime and achieves a\u221adT -mistake bound against d-dimensional benchmark regressors with convexity\nproperties. HINGE-LMC is the \ufb01rst ef\ufb01cient algorithm with\u221adT -mistake bound for bandit multiclass\n\nprediction using a surrogate loss without curvature, and so it provides a new resolution to the open\nproblem of Abernethy and Rakhlin [2]. This algorithm is based on the exponential weights update,\nalong with Langevin Monte Carlo for ef\ufb01cient sampling and a careful action selection scheme. The\nsecond algorithm is much simpler: in the stochastic setting, Follow-The-Leader with appropriate\nsmoothing matches our information-theoretic results for suf\ufb01ciently large classes.\n\n1.1 Preliminaries\n\nIn the adversarial\n\nRegret(T, \u21e7)\uffff T\ufffft=1\n\nE[`t(at)]\u2212 inf\n\u21e1\u2208\u21e7\n\nT\ufffft=1\n\nE[`t(\u21e1(xt))].\n\nminimize the cumulative loss over the T rounds, and, in particular, we would like to design learning\n\nOur algorithms use importance weighting to form unbiased loss estimates. If at round t, the algorithm\n\nnatural generalization of the standard formulation for binary classi\ufb01cation [8] and appears in Pires\n\nLetX denote a context space andA = {1, . . . , K} a discrete action space.\ncontextual bandits problem, for each of T rounds, an adversary chooses a pair(xt,` t) where xt\u2208X\nis the context and `t\u2208[0, 1]K \uffffL is a loss vector. The learner observes the context xt, chooses\nan action at, and incurs loss `t(at)\u2208[0, 1], which is also observed. The goal of the learner is to\nalgorithms that achieve low regret against a class \u21e7\u2282(X\u2192A) of benchmark policies:\nIn this paper, we always identify \u21e7 with a class of vector-valued regression functionsF\u2282(X\u2192 RK=0),\nwhere RK=0\uffff{s\u2208 RK \u2236\u2211a sa= 0}. We use the notation f(x)\u2208 RK to denote the vector-valued\noutput and f(x)a to denote the ath component. Note that we are assuming\u2211a f(x)a= 0, which is a\net al. [34]. De\ufb01ne B\uffff supf\u2208F supx\u2208X\ufffff(x)\uffff\u221e to be the maximum value predicted by any regressor.\nchooses action at by sampling from a distribution pt\u2208 (A), the loss estimate is de\ufb01ned as \u02c6`t(a)\uffff\n`t(at)1{at= a}\uffffpt(a). Given pt, we also de\ufb01ne a smoothed distribution as p\u00b5\nt \uffff(1\u2212 K\u00b5)pt+ \u00b5 for\nsome parameter \u00b5\u2208[0, 1\uffffK].\nare de\ufb01ned as (s)\uffff min(max(1+ s\uffff, 0), 1) and (s)\uffff max(1+ s\uffff, 0) respectively, for\n> 0. For s\u2208 RK, (s) and (s) are de\ufb01ned coordinate-wise. We start with a simple lemma,\nLemma 1 (Surrogate Loss Translation). For s \u2208 RK=0, de\ufb01ne \u21e1ramp(s),\u21e1 hinge(s) \u2208 (A) by\n\u21e1ramp(s)a\u221d (sa) and \u21e1hinge(s)a\u221d (sa). For any vector `\u2208 RK+ , we have\n\uffff\u21e1ramp(s),`\uffff\u2264\uffff`, (s)\uffff\u2264 \uffffa\u2208A\n\uffff\u21e1hinge(s),`\uffff\u2264 K\u22121\uffff`, (s)\uffff.\nt=1\u2211a\u2208A `t(a)1{f(xt)a\u2265\u2212},\nloss\u201d here because these quantities upper bound the cost-sensitive loss: `(argmaxa sa)\u2264\uffff`, (s)\uffff\u2264\n\uffff`, (s)\uffff.1 In the sequel, \u21e1ramp and \u21e1hinge are used by our algorithms, but do not de\ufb01ne the benchmark\nfunction \u2713(s)a\u2236= max{1+(sa\u2212 maxa\u2032 sa\u2032)\uffff, 0}, which also satis\ufb01es `(argmaxa sa)\u2264\uffff`, \u2713(s)\uffff. This\n\nBased on this lemma, it will be convenient to de\ufb01ne L\nwhich is the margin-based cumulative loss for the regressor f. L\nT should be seen as a cost-sensitive\nmulticlass analogue of the classical margin loss in statistical learning [10]. We use the term \u201csurrogate\n\nWe introduce two surrogate loss functions, the ramp loss and the hinge loss, whose scalar versions\n\n1On a related note, the information-theoretic results we present are also compatible with the surrogate\n\nleads to a perhaps more standard notion of multiclass margin bound but does not lead to ef\ufb01cient algorithms.\n\n`(a)1{sa\u2265\u2212},\n\nand\n\nT(f)\uffff\u2211T\n\ndemonstrating how , act as surrogates for cost-sensitive multiclass losses.\n\npolicy class, since we compare directly to L\n\nT or the surrogate loss.\n\n2\n\n\fRelated work. Contextual bandit learning has been the subject of intense investigation over the past\ndecade. The most natural categorization of these works is between parametric, realizability-based,\nand agnostic approaches. Parametric methods (e.g., [1, 16]) assume a (generalized) linear relationship\nbetween the losses and the contexts/actions. Realizability-based methods generalize parametric ones\nby assuming the losses are predictable by some abstract regression class [3, 21]. Agnostic approaches\n(e.g., [7, 29, 4, 38, 46, 47]) avoid realizability assumptions and instead compete with VC-type policy\nclasses for statistical tractability. Our work contributes to all of these directions, as our margin bounds\napply to the agnostic adversarial setting and yield true regret bounds under realizability assumptions.\nA special case of contextual bandits is bandit multiclass prediction, where the loss vector is zero\nfor one action and one for all others [26]. Several recent papers obtain surrogate regret bounds\n\nfunctions [26, 24, 9, 22]. Our work contributes to this line in two ways: our bounds and algorithms\nextend beyond linear/parametric classes, and we consider the more general contextual bandit setting.\nOur information-theoretic results on achievability are similar in spirit those of Daniely and Halbertal\n[18], who derive tight generic bounds for bandit multiclass prediction in terms of the Littlestone\n\nand ef\ufb01cient algorithms for this setting when the benchmark regressor classF consists of linear\ndimension. This result is incomparable to our own: their bounds are on the 0\uffff1 loss regret directly\n\nrather than surrogate regret, but the Littlestone dimension is not a tight complexity measure for\nreal-valued function classes in agnostic settings, which is our focus.\nAt a technical level, our work builds on several recent results. To derive achievable regret bounds,\nwe use the adaptive minimax framework of Foster et al. [20], along with a new adaptive chaining\nargument to control the supremum of a martingale process [42]. Our HINGE-LMC algorithm is based\non log-concave sampling [12], and it uses randomized smoothing [19] and the geometric resampling\ntrick of Neu and Bart\u00f3k [33]. We also use several ideas from classi\ufb01cation calibration [49, 8], and, in\nparticular, the surrogate hinge loss we work with is studied by Pires et al. [34].\n\n2 Achievable regret bounds\n\nThis section provides generic surrogate regret bounds for contextual bandits in terms of the sequential\n\nmetric entropy [41] of the regressor classF. Notably, our general techniques apply when the ramp\n\nloss is used as a surrogate, and so, via Lemma 1, they yield the main result of the section\u2014-a\nmargin-based regret guarantee\u2014as a special case.\nTo motivate our approach, consider a well-known reduction from bandits to full information online\nlearning: If a full information algorithm achieves a regret bound in terms of the so-called local\n\nyields an expected regret bound for the bandit setting. For example, when \u21e7 is \ufb01nite, EXP4 [7] uses\nHEDGE [23] as the full information algorithm, and obtains a deterministic regret bound of\n\nt\uffff, then running the full information algorithm on importance-weighted losses \u02c6`t(a)\n\nnorms\u2211t\uffffpt,` 2\n\nRegret(T, \u21e7)\u2264 \u2318\n\n2\n\nE\u21e1\u223cpt\uffff\u21e1(xt), \u02c6`t\uffff2+ log(\uffff\u21e7\uffff)\n\n\u2318\n\nT\ufffft=1\n\n,\n\n(1)\n\nTo use this reduction beyond the \ufb01nite class case and with surrogate losses we face two challenges:\n\ndistribution) for round t. Evaluating conditional expectations and optimizing \u2318 yields a regret bound\n\nwhere \u2318> 0 is the learning rate and pt is the distribution over policies in \u21e7 (inducing an action\nofO(\uffffKT log(\uffff\u21e7\uffff)), which is optimal for contextual bandits with a \ufb01nite policy class.\n1. In\ufb01nite classes. The natural approach of using a pointwise (or sup-norm) cover for F is\ncovering number forF, which is the correct generalization of the empirical covering number in\n2. Variance control. With surrogate losses, controlling the variance/local norm term E\u21e1\uffff\u21e1(xt), \u02c6`t\uffff2\n\ninsuf\ufb01cient\u2014not only because there are classes that have in\ufb01nite pointwise covers yet are online-\nlearnable, but also because it yields sub-optimal rates even when a \ufb01nite pointwise cover is\navailable. Instead, we establish existence of a full-information algorithm for large nonparametric\nclasses that has 1) strong adaptivity to loss scaling as in (1) and 2) regret scaling with the sequential\n\nin the reduction from bandit to full information is more challenging, since the surrogate loss of a\npolicy depends on the scale of the underlying regressor, not just the action it selects. To address\nthis, we develop a new sampling scheme tailored to scale-sensitive losses.\n\nstatistical learning to the adversarial online setting. This is achieved via non-constructive methods.\n\n3\n\n\fT\ufffft=1\n\nT\ufffft=1\uffffg(xt),` t\uffff.\n\nthe smallest set V of RK-valued trees for which\n\nEst\u223cpt\uffffst,` t\uffff\u2212 inf\ng\u2208G\n\nFull-information regret bound. We consider the following full information protocol, which in the\n\nAs our complexity measure, we use a multi-output generalization of sequential covering numbers\n\nsequel will be instantiated via reduction from contextual bandits. Let the context spaceX andA be\n\ufb01xed as in Subsection 1.1, and consider a function classG\u2282(X\u2192S), whereS\u2286 RK+ . The reader\nmay think ofG as representing \u25cbF or \u25cbF, i.e. the surrogate loss composed with the regressor\nclass, so thatS (which is not necessarily convex) represents the image of the surrogate loss overF.\nThe online learning protocol is: For time t= 1, . . . , T , (1) the learner observes xt and chooses a\ndistribution pt\u2208 (S), (2) the adversary picks a loss vector `t\u2208L\u2282 RK+ , (3) the learner samples\noutcome st\u223c pt and experiences loss\uffffst,` t\uffff. Regret against the benchmark classG is given by\nintroduced by Rakhlin et al. [41]. De\ufb01ne aZ-valued tree z to be a sequence of mappings zt \u2236\n{\u00b11}t\u22121\u2192Z. The tree z is a complete rooted binary tree with nodes labeled by elements ofZ,\nwhere for any \u201cpath\u201d \u270f\u2208{\u00b11}T , zt(\u270f)\uffff zt(\u270f1\u2236t\u22121) is the value of the node at level t on the path \u270f.\nDe\ufb01nition 1. For a function classG\u2282(X \u2192 RK) andX -valued tree x of length T , the L\u221e\uffff`\u221e\nsequential covering number2 forG on x at scale \", denoted byN\u221e,\u221e(\",G, x), is the cardinality of\n\u2200g\u2208G\u2200\u270f\u2208{\u00b11}T\u2203v\u2208 V s.t. max\nDe\ufb01neN\u221e,\u221e(\",G, T)\uffff supx\u2236length(x)=TN\u221e,\u221e(\",G, x).\nWe refer to logN\u221e,\u221e as the sequential metric entropy. Note that in the binary case, for learning unit\n`2 norm linear functions in d dimensions, the pointwise metric entropy is O(d log(1\uffff\")), whereas the\nsequential metric entropy is O(d log(1\uffff\")\u2227 \"\u22122 log(d)), leading to improved rates in high dimension.\nTheorem 2. Assume3 sup`\u2208L\uffff`\uffff1\u2264 R and sups\u2208S\uffffs\uffff\u221e\u2264 B. Fix any constants \u2318\u2208(0, 1], > 0,\nand > \u21b5> 0. Then there exists an algorithm with the following deterministic regret guarantee:\nlogN\u221e,\u221e(\uffff2,G, T)+ 3e2\u21b5\nT\ufffft=1\uffff`t\uffff1\nEst\u223cpt\uffffs, `t\uffff\u2212 inf\nEst\u223cpt\uffffst,` t\uffff2+ 4RB\nT\ufffft=1\ng\u2208G\n\u21b5 \ufffflogN\u221e,\u221e(\",G, T)d\".\n\uffff\uffff \nT\ufffft=1\uffff`t\uffff2\n1+ R\nObserve that the bound involves the variance/local norms Est\u223cpt\uffffst,` t\uffff2, and has a very mild explicit\n\ndependence on the loss range R; this can be veri\ufb01ed by optimizing over \u2318 and . This adaptivity to the\nloss range is crucial for our bandit reduction. Further observe that the bound contains a Dudley-type\nentropy integral, which is essential for obtaining sharp rates for complex nonparametric classes.\nBandit reduction and variance control. To lift Theorem 2 to contextual bandits we use the\n\nt\u2208[T]\uffffg(xt(\u270f))\u2212 vt(\u270f)\uffff\u221e\u2264 \".\n\nWith this de\ufb01nition, we can now state our main theorem for full information.\n\nT\ufffft=1\uffffg(xt),` t\uffff\u2264 2\u2318\n\nThe following lemma shows that this strategy leads to suf\ufb01ciently small variance in the loss estimates.\nThe de\ufb01nition of the action distribution P \u00b5\n\nfollowing reduction: First, initialize the full information algorithm from Theorem 2 withG= \u25cbF.\nst(a)\nFor each round t, receive xt, and de\ufb01ne Pt(a)\uffff Est\u223cpt\n\u2211a\u2032\u2208[K] st(a\u2032) where pt is the full information\nalgorithm\u2019s distribution. Then sample at\u223c P \u00b5\nt , observe `t(at), and pass the importance-weighted\nloss \u02c6`t(a) back to the algorithm. For the hinge loss we use the same strategy, but withG= \u25cbF.\nt(a) in terms of the real-valued predictions is crucial here.\nLemma 3. De\ufb01ne a \ufb01ltrationJt= ((x1,` 1, a1), . . . ,(xt\u22121,` t\u22121, at\u22121), xt,` t). Then for any \u00b5\u2208\n[0, 1\uffffK] the importance weighting strategy above guarantees\n\uffff1+ B\n\uffff2\n2Sequential coverings for Lp\uffff`q can be de\ufb01ned similarly, but do not appear in the present paper.\nweighted losses, and it enables us to cover the output space in `\u221e norm.\n\nt\uffffEst\u223cpt\uffffst, \u02c6`t\uffff2\uffffJt\uffff\u2264\uffff\uffff\uffff\uffff\uffff\uffff\uffff\uffff\uffff\uffff\uffff\n\n3Measuring loss in `1 may seem restrictive, but it is natural when working with the 1-sparse importance-\n\nforS\u2282 (A).\nforS= \u25cbF.\nforS= \u25cbF.\n\nEat\u223cP \u00b5\n\nK,\nK2,\n\nK2,\n\nRB\n\nT\ufffft=1\n+ 24e\uffff \n\n4R\n\n(2)\n\n\u2318\n\n4\n\n\fFigure 1: Regret bound exponent as a function of\n(sequential) metric entropy. The cross marks the\n\npoint p= 2 where the exponent from Theorem 4\nchanges growth rate. \u201cFull information\u201d refers to\np ) for the same setting\nthe optimal rate of T\nunder full information feedback [41]. \u201cSquare loss\u201d\np+1\np+2 for Lipschitz\nrefers to the optimal rate of T\ncontextual bandits over metric spaces of dimension\np, which have sequential metric entropy \"\u2212p, under\n\nsquare loss realizability [44].\n\n2\u2228( p\u22121\n\n1\n\nTheorem 2 and Lemma 3 together imply our central theorem: a chaining-based margin bound for\ncontextual bandits, generalizing classical results in statistical learning (cf. [10]).\n\np+1\n\n5\n\n`t(at)\uffff\u2264 inf\n\nparametric case.\n\n(3)\n\n(4)\n\nin terms of the growth rate for the sequential metric entropy.\n\n`t(at)\uffff\u2264 inf\n\nf\u2208F E[L\n+ 8\n\n\u00b5\n\nstrategy with expected regret against the -margin benchmark bounded as\n\nthere exists a contextual bandit strategy with the following regret guarantee:\n\nWe derive an analogous bound for the hinge loss in Appendix C. The hinge loss bound differs only\nthrough stronger dependence on scale parameters.\n\nTheorem 4 (Contextual bandit margin bound). For any \ufb01xed constants > \u21b5 > 0, smoothing\nparameter \u00b5\u2208(0, 1) and margin loss parameter > 0 there exists an adversarial contextual bandit\nT(f)]+ 4\uffff2K 2T logN\u221e,\u221e(\uffff2,F, T)+ \u00b5KT\nE\uffff T\ufffft=1\n\uffff\uffff3e2\u21b5KT+ 24e\uffff KT\n\u21b5 \ufffflogN\u221e,\u221e(\",F, T)d\"\uffff\uffff.\nlogN\u221e,\u221e(\uffff2,F, T)+ 1\n\u00b5 \uffff \nBefore showing the implications of Theorem 4 for speci\ufb01c classesF we state a coarse upper bound\nProposition 5. Suppose thatF has sequential metric entropy growth logN\u221e,\u221e(\",F, T)\u221d \"\u2212p for\nsome p> 0 (nonparametric case), or that logN\u221e,\u221e(\",F, T)\u221d d log(1\uffff\") (parametric case). Then\nE\uffff T\ufffft=1\nProposition 5 recovers the parametric rate of\u221adT seen with e.g., LINUCB [16] but is most interesting\np\u2208(0, 2] and the \u201chigh complexity\u201d regime of p\u2265 2. This is visualized in Figure 1.\n\nO(K\uffffdT log(KT\uffff)),\np+4),\n\u02dcO((KT) p+2\np+4 \u2212 2p\n\u02dcO((KT) p\np+1),\np+1 \u2212 p\n\nT(f)\uffff+\uffff\uffff\uffff\uffff\uffff\uffff\uffff\uffff\uffff\uffff\uffff\nf\u2208F E\uffffL\n\nRemark 1. Under i.i.d. losses and hinge/ramp loss realizability, the standard tools of classi\ufb01cation\ncalibration [8] can be used to deduce a proper policy regret bound from (3). However, these\nrealizability assumptions are somewhat non-standard, and moreover if one imposes the stronger\nassumption of a hard margin it is possible to derive improved rates [18]. See also Appendix B.\nRemark 2. Classical margin bounds typically hold for all values of simultaneously, but Theorem 4\nrequires that is chosen in advance. Learning the best value of online appears challenging.\nRates for speci\ufb01c classes. We now instantiate our results for concrete classes of interest.\n\nExample 1 (Finite classes). In the \ufb01nite class case there is an algorithm with O\uffffK\uffffT log\uffffF\uffff\uffff\nmargin regret. When \u21e7\u2282(X\u2192A) is a \ufb01nite policy class, directly reducing to Theorem 2 yields the\noptimal O\uffff\uffffKT log\uffff\u21e7\uffff\uffff policy regret, hinting at the optimality of our approach.\nExample 2 (Lipschitz CB). The class of Lipschitz functions over[0, 1]p admits a sequential cover\nwith metric entropy \u02dcO(\"\u2212p), so Proposition 5 implies an \u02dcO(T\np+1) regret bound. Since our proof\np+2\np+4\u2228 p\ngoes through Lemma 1, it also yields a policy regret bound against the \u21e1ramp(\u22c5) class. Therefore, this\nresult is directly comparable to the \u02dcO(T\np+2) bound of Cesa-Bianchi et al. [14], applied to the \u21e1ramp\n\npolicy class. Our bound achieves a smaller exponent for all values of p (see Figure 1).\n\nnonparametric w/ p\u2264 2.\nnonparametric w/ p\u2265 2.\n\nfor complex classes. The rate exhibits a phase change between the \u201cmoderate complexity\u201d regime of\n\n\fR(G, T)\uffff sup\n\nx\n\nE\u270f sup\n\ng\u2208G\n\nT\ufffft=1\n\n\u270ftg(xt(\u270f)).\n\nsimilar discussion applies to Theorem 4.4 of Lykouris et al. [31].\n\nbound is fairly user-friendly, it yields worse rates than Proposition 5 when translated to sequential\n\nmargin regret in full information for binary classi\ufb01cation. For partial information, BISTRO [38]\n\nThus, for margin-based contextual bandits, full information learnability is equivalent to bandit\n\nantee of BANDITRON [26] from Euclidean geometry to arbitrary uniformly convex Banach spaces,\nessentially the largest linear class for which online learning is possible [45]. The result also general-\nizes BANDITRON from multiclass to general contextual bandits and strengthens it from hinge loss to\n\ncomplexity, which is de\ufb01ned for any scalar-valued function classG\u2286(X\u2192 R) as:\nExample 3. LetF\uffffa\uffff{x\uffff f(x)a\uffff f\u2208F} be the scalar restriction ofF to output coordinate a\nand suppose that maxa\u2208[K] R(F\uffffa, T)\u2265 1 and B\u2264 1.4 Then there exists an adversarial contextual\nbandit algorithm with margin regret bound \u02dcO\uffffmaxa K(R(F\uffffa, T)\uffff)2\uffff3T 1\uffff3\uffff.\nlearnability. Since the optimal regret in full information is \u2326(maxa R(F\uffffa, T)), it further shows\nthat the price of bandit information is at most \u02dcO\uffffmaxa K(T\uffffR(F\uffffa, T))1\uffff3\uffff. Note that while this\nmetric entropy, except when p= 2 [40]. For comparison, Rakhlin et al. [41] obtain \u02dcO(R(F, T)\uffff)\nhas an O(\uffffKT R(\u21e7, T)) policy regret bound, which involves the policy complexity and a worse\nT dependence than our bound, but our bound (in terms ofF) applies only to the margin regret. A\nInstantiating Example 3 with linear classes generalizes the O(T 2\uffff3) dimension-independent guar-\nramp loss. Note that many subsequent works [2, 9, 22] obtain dimension-dependent O(\u221adT) bounds\nindependent O(T 2\uffff3)-type rates, which are more appropriate for high-dimensional settings.\nExample 4. LetX be the unit ball in a Banach space(B,\uffff\u22c5\uffff), and letF be induced by stacking K\u22121\nlinear predictors5 each in the unit ball of the dual space(B\uffff,\uffff\u22c5\uffff\uffff). Suppose that\uffff\u22c5\uffff has martingale\ntype 2 [35], which means there exists \u2236 B\u2192 R such that 1\n2\uffffx\uffff2\u2264 (x) and is -smooth w.r.t.\n\uffff\u22c5\uffff.6 Then there exists a contextual bandit strategy with margin regret O(K(T\uffff)2\uffff3).\nBeyond linear classes, we also obtain \u02dcO(K(T\uffff)2\uffff3) margin regret when eachFa is a class of neural\nnetworks with weights in each layer bounded in the(1,\u221e) group norm, or when eachFa is a class of\nAs our last example, we consider `p spaces for p< 2. These spaces fail to satisfy martingale type\ndependence, and so have sequential metric entropy of order \"\u2212 p\nadmit a pointwise cover with metric entropy O(d log(1\uffff\")), leading to the following dichotomy.\nExample 5. Consider the setting of Example 4, with(B,\uffff\u22c5\uffff)=(Rd,\uffff\u22c5\uffffp) for p\u2264 2. Then there\n2p\u22121\u2227 K\uffffdT log(KT\uffff)).\nexists a contextual bandit strategy with margin regret \u02dcO(K(T\uffff) p\n\n2 in a dimension-independent fashion, but they do satisfy martingale type p without dimension\np\u22121 [39]. Moreover, in Rd the `p spaces\n\nbounded depth decision trees on \ufb01nitely many decision functions. These results follow by appealing\nto the existing sequential Rademacher complexity bounds derived in Rakhlin et al. [41].\n\nfor bandit multiclass prediction, as we will in the next section, but, none have explored dimension-\n\nLearnability in full information online learning is known to be characterized entirely by the sequential\nRademacher complexity of the hypothesis class [41], and tight bounds on this quantity are known for\nstandard classes including linear predictors, decision trees, and neural networks. The next example, a\ncorollary of Theorem 4, bounds contextual bandit margin regret in terms of sequential Rademacher\n\n3 Ef\ufb01cient Algorithms\n\nWe derive two new algorithms for contextual bandits using the hinge loss . The \ufb01rst algorithm,\nHINGE-LMC, focuses on the parametric setting; it is based on a continuous version of exponential\n\n4This restriction serves only to simplify calculations and can be relaxed.\n\n6Norms that satisfy this property with dimension-independent or logarithmic constants include `p for all\n\n5Only K\u2212 1 predictors are needed due to the sum-to-zero constraint of RK=0.\np\u2265 2, Schatten Sp norms for p\u2265 2 (including the spectral norm), and(2, p) group norms for p\u2265 2 [27, 28].\n\n6\n\n\fSet p\u00b5\n\nAlgorithm 1 HINGE-LMC\nInput: Class \u21e5, learning rate \u2318, margin parameter .\n\n// Geometric resampling.\n\nDe\ufb01ne w0(\u2713)\uffff 1 for all \u2713\u2208 \u21e5.\nfor t= 1, . . . , T do\nReceive xt, set \u2713t\u2190 LMC(\u2318wt\u22121).\nSet pt(\u22c5; \u2713t)\u221d (f(xt; \u2713t)).\nt(\u22c5; \u2713t)\uffff(1\u2212 K\u00b5)pt+ \u00b5.\nPlay at\u223c p\u00b5\nt(\u22c5; \u2713t), observe `t(at).\nfor m= 1, . . . , M do\n\u02dc\u2713t\u2190 LMC(\u2318wt\u22121).\nSample \u02dcat\u223c p\u00b5\nt(\u22c5; \u02dc\u2713t), if \u02dcat= at, break.\nSet mt= m, and \u02dc`t(a)\uffff `t(at)\u22c5 mt1{at= a}.\nUpdate wt(\u2713)\u2190 wt\u22121(\u2713)+\uffff\u02dc`t, (f(xt; \u2713))\uffff.\n\nend for\n\nend for\n\nAlgorithm 2 Langevin Monte Carlo (LMC)\n\nDraw z1, . . . , zm\n\n// Parameter choices are in Appendix D.\n\nInput: F(\u22c5), parameters m, u, , N, \u21b5.\nSet \u02dc\u27130\u2190 0\u2208 Rd\nfor k= 1, . . . , N do\niid\u223c N(0, u2Id). De\ufb01ne\n\u02dcFk(\u2713)\uffff 1\ni=1 F(\u2713+ zi)+ \n2\uffff\u2713\uffff2\nm\u2211m\nDraw \u21e0k\u223cN(0, Id) and update\n2\u2207 \u02dcFk(\u02dc\u2713k\u22121)+\u221a\u21b5\u21e0k\uffff .\n\u02dc\u2713k\u2190P\u21e5\uffff\u02dc\u2713k\u22121\u2212 \u21b5\n\n2\n\nend for\nReturn \u02dc\u2713N.\n\nweights using a log-concave sampler. The second, SMOOTHFTL, is simply Follow-The-Leader with\nuniform smoothing. SMOOTHFTL applies to the stochastic setting with classes that have \u201chigh\ncomplexity\u201d in the sense of Proposition 5.\n\n3.1 Hinge-LMC\n\ncontains the centered Euclidean ball of radius 1 and is contained within a Euclidean ball of radius R.\n\nFor this section, we identifyF with a compact convex set \u21e5\u2282 Rd, using the notation f(x; \u2713)\u2208 RK=0\nto describe the parametrized function. We assume that (f(x; \u2713)a) is convex in \u2713 for each(x, a)\npair, supx,\u2713\ufffff(x; \u2713)\uffff\u221e\u2264 B, f(x;\u22c5)a is L-Lipschitz in \u2713 with respect to the `2 norm, and that \u21e5\nThese assumptions are satis\ufb01ed whenF is a linear class, under appropriate boundedness conditions.\n\nThe pseudocode for HINGE-LMC is displayed in Algorithm 1, and all parameters settings are given\nin Appendix D. The algorithm is a continuous variant of exponential weights [7], where at round t,\nwe de\ufb01ne the exponential weights distribution via its density (w.r.t. the Lebesgue measure over \u21e5):\n\nPt(\u2713)\u221d exp(\u2212\u2318wt\u22121(\u2713)),\n\nwt\u22121(\u2713)\uffff t\u22121\uffffs=1\uffff\u02dc`s, (f(xs; \u2713))\uffff,\n\nwhere \u2318 is a learning rate and \u02dc`s is a loss vector estimate. At a high level, at each iteration the\n\nalgorithm samples \u2713t\u223c Pt, then samples the action at from the induced policy distribution pt(\u22c5; \u2713)=\n\u21e1hinge(f(xt; \u2713t)), appropriately smoothed. The algorithm plays at and constructs a loss estimate\n\u02dc`t\uffff mt\u22c5 `t(a)1{a= at}, where mt is an approximate importance weight computed by repeatedly\na tractable log-concave sampling problem, by using the induced policy distribution \u21e1hinge(\u22c5), we are\n\nsampling from Pt. This vector \u02dc`t is passed to exponential weights to de\ufb01ne the distribution at the next\nround. To sample from Pt we use Projected Langevin Monte Carlo (LMC), displayed in Algorithm 2.\nThe algorithm has many important subtleties. Apart from passing to the hinge surrogate loss to obtain\n\nalso able to control the local norm term in the exponential weights regret bound.7 Then, the analysis\nfor Projected LMC [12] requires a smooth potential function, which we obtain by convolving with the\ngaussian density, also known as randomized smoothing [19]. We also use `2 regularization for strong\nconvexity and to overcome sampling errors introduced by randomized smoothing. Finally, we use the\ngeometric resampling technique [33] to approximate the importance weight by repeated sampling.\nHere, we state the main guarantee and its consequences. A more complete theorem statement, with\nexact parameter speci\ufb01cations and the precise running time is provided in Appendix D as Theorem 18.\n\n7This seems specialized to surrogates that can be expressed as an inner product between the loss vector and (a\ntransformation of) the prediction, so it does not apply to standard loss functions in bandit multiclass prediction.\n\n7\n\n\fE\n\n1\nK\n\nE\n\nTheorem 6 (Informal). Under the assumptions of Subsection 3.1, HINGE-LMC with appropriate\n\nparameter settings runs in time poly(T, d, B, K, 1\n\n , R, L) and guarantees\nT\ufffft=1\uffff`t, (f(xt; \u2713))\uffff+ \u02dcO\uffff B\n\nT\ufffft=1\n\n`t(at)\u2264 inf\n\u2713\u2208\u21e5\n\nSince bandit multiclass prediction is a special case of contextual bandits, Theorem 6 immediately\n\nCorollary 7 (Bandit multiclass). In the bandit multiclass setting, Algorithm 1 enjoys a mistake bound\n\n\u221adT\uffff .\nimplies a\u221adT -mistake bound for this setting. See Appendix B for more discussion.\nof \u02dcO((B\uffff)\u221adT) against the cost-sensitive -hinge loss and runs in polynomial time.\nsimplicity in de\ufb01ning the condition, assume that for every(x, `) pair, ` is a random variable with\nCorollary 8 (Realizable bound). In addition to the conditions above, assume that there exists \u2713\uffff\u2208 \u21e5\nsuch that for every(x, `) pair and for all a\u2208A, we have f(x; \u2713\uffff)a\uffff K1{\u00af`(a)\u2264 mina\u2032 \u00af`(a\u2032)}\u2212 .\n\nAdditionally, under a realizability condition for the hinge loss, we obtain a standard regret bound. For\n\nconditional mean \u00af` (chosen by the adversary) and \u00af` has a unique action with minimal loss.\n\nThen HINGE-LMC runs in polynomial time and guarantees\n\nT\ufffft=1\n\nE\u00af`t(at)\u2264 T\ufffft=1\n\nE min\na\n\n\u00af`(a)+ \u02dcO\uffff B\n\n\u221adT\uffff.\n\nA few comments are in order:\n\n1. The use of LMC for sampling is not strictly necessary. Other log-concave samplers do exist for\nnon-smooth potentials [30], which will remove the parameters m, u, , signi\ufb01cantly simplify the\nalgorithm, and even lead to a better run-time guarantee using current theory. However, we prefer\nto use LMC due to its success in Bayesian inference and deep learning, and its connections to\nincremental optimization methods. Note that more recent results in slightly different settings [36,\n17, 15] suggest that it may be possible to substantially improve upon the LMC analysis that we\nuse and even extend it to non-convex settings. We are hopeful that the LMC approach will lead to\na practically useful contextual bandit algorithm and plan to explore this direction further.\n\nour loss is slightly different from the multiclass hinge loss used by Kakade et al. [26] in their\n\n2. Corollary 7 provides a new solution to the open problem of Abernethy and Rakhlin [2]. In\n\nfact, it is the \ufb01rst ef\ufb01cient\u221adT -type regret bound against a hinge loss benchmark, although\nT 2\uffff3-regret BANDITRON algorithm (which motivated the open problem). All prior\u221adT -regret\n\nalgorithms [24, 9, 22] use losses with curvature such as the multiclass logistic loss or the squared\nhinge loss. See Appendix B for a comparison between cost-sensitive and multiclass hinge losses.\n3. In Corollary 8, regret is measured relative to the policy that chooses the best action (in expectation)\non every round. As in prior results [1, 3], this is possible because the realizability condition ensures\n\nhence the dependence on K is implicit and in fact slightly worse than the optimal rate [16].\n\n4. For Corollary 8, the best points of comparisons are methods based on square-loss realizability [3,\n21], although our condition is different. Compared with LINUCB and variants [16, 1] specialized\n\nguarantees for linear classes.8 Compared with Foster et al. [21], which is the only other ef\ufb01cient\napproach at a comparable level of generality, our assumptions on the regressor class are stronger,\nbut we obtain better guarantees, in particular removing distribution-dependent parameters.\n\nthat this policy is in our class. Note that here, a requirement for realizability is that B\u2265 K , and\nto `2\uffff`2 geometry, our assumptions are somewhat weaker but these methods have slightly better\nTo summarize, HINGE-LMC is the \ufb01rst ef\ufb01cient\u221adT -regret algorithm for bandit multiclass predic-\n\u221adT policy regret under hinge-based realizability. Finally, while we lose the theoretical guarantees,\n8In the abstract linear setting we takeF to be the set of linear functions in the ball for some norm\uffff\u22c5\uffff and\ncontexts to be bounded in the dual norm\uffff\u22c5\uffff\uffff. The runtime of HINGE-LMC will degrade (polynomially) with\nthe ratio\uffff\u2713\uffff\uffff\uffff\u2713\uffff2, but the regret bound is the same for any such norm pair.\n\ntion using the hinge loss. It also represents a new approach to adversarial contextual bandits, yielding\n\nthe algorithm easily extends to non-convex classes, which we expect to be practically effective.\n\n8\n\n\fT\ufffft=1\n\nT\nK\n\n3.2 SMOOTHFTL\nA drawback of HINGE-LMC is that it only applies in the parametric regime. We now introduce\nan ef\ufb01cient (in terms of queries to a hinge loss minimization oracle) algorithm with a regret bound\n\n\u02c6fm\u22121\uffff argmin\nf\u2208F\n\nnm\u22121\uffff\u2327=nm\u22121\uffff\u02c6`\u2327 , (f(x\u2327))\uffff.\n\nThe algorithm we analyze is simply Follow-The-Leader with uniform smoothing and epoching, which\n\nsimilar to Theorem 4, but in the stochastic setting, where{(xt,` t)}T\nt=1 are drawn i.i.d. from some\njoint distributionD overX\u00d7 RK+ . Here we return to the abstract setting with regression classF, and\nfor simplicity, we assume B= 1.\nwe refer to as SMOOTHFTL. We use an epoch schedule where the mth epoch lasts for nm\uffff 2m rounds\n(starting with m= 0). At the beginning of the mth epoch, we compute the empirical importance\nweighted hinge-loss minimizer \u02c6fm\u22121 using only the data from the previous epoch. That is, we set\nThen, for each round t in the mth epoch, we sample at from pt\uffff(1\u2212 K\u00b5)\u21e1hinge( \u02c6fm\u22121(xt))+ \u00b5.\nThe parameter \u00b5\u2208(0, 1\uffffK] controls the smoothing. At time t= 1 we simply take p1 to be uniform.\nTheorem 9. Suppose thatF satis\ufb01es logN\u221e,\u221e(\",F, T) \u221d \"\u2212p for some p > 2. Then in the\nstochastic setting, with \u00b5= K\u22121T \u22121\np+1 , SMOOTHFTL enjoys the following expected regret guarantee9\nE`t(at)\u2264 inf\nf\u2208F\nThis provides an algorithmic counterpart to Proposition 5 in the p\u2265 2 regime. The algorithm is quite\nthe regime p\u2208(0, 2) as an open problem.\n\nsimilar to EPOCH-GREEDY [29], and the main contribution here is to provide a careful analysis for\nlarge function classes. We leave obtaining an oracle-ef\ufb01cient algorithm that matches Proposition 5 in\n\nA similar bound can be obtained for the ramp loss by simply replacing the hinge loss ERM. We\nanalyze the hinge loss version because standard (e.g.\nlinear) classes admit ef\ufb01cient hinge loss\nminimization oracles. Interestingly, the bound in Theorem 9 actually improves on Proposition 5, in\nthat it is independent of K. This is due to the scaling of the hinge loss in Lemma 1.\nIn Appendix F, we extend the analysis to the stochastic Lipschitz contextual bandit setting. Here,\n\nE\uffff`, (f(x))\uffff+ \u02dcO\uffff(T\uffff) p\np+1\uffff.\n\np\n\np\n\nthat SMOOTHFTL achieves T\nThis improves on the T\nlower bound is T\n\ninstead of measuring regret against the benchmark \u25cbF we compare to the class of all 1-Lipschitz\nfunctions fromX to (A), whereX is a metric space of bounded covering dimension. We show\n\np+1 regret with a p-dimensional context space and \ufb01nite action space.\np+1\np+2 bound of Cesa-Bianchi et al. [14], as in Example 2, yet the best available\n[25]. Closing this gap remains an intriguing open problem.\n\np\u22121\n4 Discussion\nThis paper initiates a study of the utility of surrogate losses in contextual bandit learning. We\nobtain new margin-based regret bounds in terms of sequential complexity notions on the benchmark\nclass, improving on the best known rates for Lipschitz contextual bandits and providing dimension-\nindependent bounds for linear classes. On the algorithmic side, we provide the \ufb01rst solution to\nthe open problem of Abernethy and Rakhlin [2] with a non-curved loss and we also show that\nFollow-the-Leader with uniform smoothing performs well in nonparametric settings.\nYet, several open problems remain. First, our bounds in Section 2 are likely suboptimal in the\ndependence on K, and improving this is a natural direction. Other questions involve deriving stronger\nlower bounds (e.g., for the non-parametric setting) and adapting to the margin parameter. We also hope\nto experiment with HINGE-LMC, and develop a better understanding of computational-statistical\ntradeoffs with surrogate losses. We look forward to studying these questions in future work.\nAcknowledgements. We thank Haipeng Luo, Karthik Sridharan, Chen-Yu Wei, and Chicheng\nZhang for several helpful discussions. D.F. acknowledges the support of the NDSEG PhD fellowship\nand Facebook PhD fellowship.\n\n9This result is stated in terms of the sequential coverN\u221e,\u221e to avoid additional de\ufb01nitions, but can easily be\n\nimproved to depend on the classical (worst-case) covering number seen in statistical learning.\n\n9\n\n\fReferences\n[1] Yasin Abbasi-Yadkori, D\u00e1vid P\u00e1l, and Csaba Szepesv\u00e1ri.\n\nImproved algorithms for linear\n\nstochastic bandits. In Advances in Neural Information Processing Systems, 2011.\n\n[2] Jacob D. Abernethy and Alexander Rakhlin. An ef\ufb01cient bandit algorithm for O(\u221aT)-regret in\n\nonline multiclass prediction? In Conference on Learning Theory, 2009.\n\n[3] Alekh Agarwal, Miroslav Dud\u00edk, Satyen Kale, John Langford, and Robert E. Schapire. Contex-\n\ntual bandit learning with predictable rewards. In Arti\ufb01cial Intelligence and Statistics, 2012.\n\n[4] Alekh Agarwal, Daniel J. Hsu, Satyen Kale, John Langford, Lihong Li, and Robert E. Schapire.\nTaming the monster: A fast and simple algorithm for contextual bandits. In International\nConference on Machine Learning, 2014.\n\n[5] Alekh Agarwal, Sarah Bird, Markus Cozowicz, Luong Hoang, John Langford, Stephen Lee,\nJiaji Li, Dan Melamed, Gal Oshri, Oswaldo Ribas, Siddhartha Sen, and Aleksandrs Slivkins.\nMaking contextual decisions with low technical debt. arXiv:1606.03966, 2016.\n\n[6] Martin Anthony and Peter L. Bartlett. Neural network learning: Theoretical foundations.\n\nCambridge University Press, 2009.\n\n[7] Peter Auer, Nicol\u00f2 Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. The nonstochastic\n\nmultiarmed bandit problem. SIAM Journal on Computing, 2002.\n\n[8] Peter L. Bartlett, Michael I. Jordan, and Jon D. McAuliffe. Convexity, classi\ufb01cation, and risk\n\nbounds. Journal of the American Statistical Association, 2006.\n\n[9] Alina Beygelzimer, Francesco Orabona, and Chicheng Zhang. Ef\ufb01cient online bandit multiclass\n\nlearning with \u02dcO(\u221aT) regret. In International Conference on Machine Learning, 2017.\n\n[10] St\u00e9phane Boucheron, Olivier Bousquet, and G\u00e1bor Lugosi. Theory of classi\ufb01cation: A survey\n\nof some recent advances. ESAIM: Probability and Statistics, 2005.\n\n[11] St\u00e9phane Boucheron, G\u00e1bor Lugosi, and Pascal Massart. Concentration inequalities: A\n\nnonasymptotic theory of independence. Oxford University Press, 2013.\n\n[12] S\u00e9bastien Bubeck, Ronen Eldan, and Joseph Lehec. Sampling from a log-concave distribution\n\nwith projected langevin monte carlo. Discrete and Computational Geometry, 2018.\n\n[13] Nicol\u00f2 Cesa-Bianchi and Gabor Lugosi. Prediction, Learning, and Games. Cambridge\n\nUniversity Press, 2006.\n\n[14] Nicol\u00f2 Cesa-Bianchi, Pierre Gaillard, Claudio Gentile, and S\u00e9bastien Gerchinovitz. Algorithmic\nchaining and the role of partial feedback in online nonparametric learning. In Conference on\nLearning Theory, 2017.\n\n[15] Xiang Cheng, Niladri S. Chatterji, Yasin Abbasi-Yadkori, Peter L. Bartlett, and Michael I. Jordan.\nSharp convergence rates for langevin dynamics in the nonconvex setting. arXiv:1805.01648,\n2018.\n\n[16] Wei Chu, Lihong Li, Lev Reyzin, and Robert E. Schapire. Contextual bandits with linear payoff\n\nfunctions. In International Conference on Arti\ufb01cial Intelligence and Statistics, 2011.\n\n[17] Arnak S. Dalalyan and Avetik G. Karagulyan. User-friendly guarantees for the langevin monte\n\ncarlo with inaccurate gradient. arXiv:1710.00095, 2017.\n\n[18] Amit Daniely and Tom Halbertal. The price of bandit information in multiclass online classi\ufb01-\n\ncation. In Conference on Learning Theory, 2013.\n\n[19] John C. Duchi, Peter L. Bartlett, and Martin J. Wainwright. Randomized smoothing for\n\nstochastic optimization. SIAM Journal on Optimization, 2012.\n\n[20] Dylan J. Foster, Alexander Rakhlin, and Karthik Sridharan. Adaptive online learning. In\n\nAdvances in Neural Information Processing Systems, 2015.\n\n10\n\n\f[21] Dylan J. Foster, Alekh Agarwal, Miroslav Dud\u00edk, Haipeng Luo, and Robert E. Schapire.\nPractical contextual bandits with regression oracles. International Conference on Machine\nLearning, 2018.\n\n[22] Dylan J. Foster, Satyen Kale, Haipeng Luo, Mehryar Mohri, and Karthik Sridharan. Logistic\n\nregression: The importance of being improper. Conference on Learning Theory, 2018.\n\n[23] Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learning\n\nand an application to boosting. Journal of Computer and System Sciences, 1997.\n\n[24] Elad Hazan and Satyen Kale. Newtron: an ef\ufb01cient bandit algorithm for online multiclass\n\nprediction. In Advances in Neural Information Processing Systems, 2011.\n\n[25] Elad Hazan and Nimrod Megiddo. Online learning with prior knowledge. In Conference on\n\nLearning Theory, 2007.\n\n[26] Sham M. Kakade, Shai Shalev-Shwartz, and Ambuj Tewari. Ef\ufb01cient bandit algorithms for\n\nonline multiclass prediction. In International Conference on Machine learning, 2008.\n\n[27] Sham M. Kakade, Karthik Sridharan, and Ambuj Tewari. On the complexity of linear prediction:\nRisk bounds, margin bounds, and regularization. In Advances in Neural Information Processing\nSystems, 2009.\n\n[28] Sham M. Kakade, Shai Shalev-Shwartz, and Ambuj Tewari. Regularization techniques for\n\nlearning with matrices. Journal of Machine Learning Research, 2012.\n\n[29] John Langford and Tong Zhang. The epoch-greedy algorithm for multi-armed bandits with side\n\ninformation. In Advances in Neural Information Processing Systems, 2008.\n\n[30] L\u00e1szl\u00f3 Lov\u00e1sz and Santosh Vempala. The geometry of logconcave functions and sampling\n\nalgorithms. Random Structures & Algorithms, 2007.\n\n[31] Thodoris Lykouris, Karthik Sridharan, and \u00c9va Tardos. Small-loss bounds for online learning\n\nwith partial information. Conference on Learning Theory, 2018.\n\n[32] Hariharan Narayanan and Alexander Rakhlin. Ef\ufb01cient sampling from time-varying log-concave\n\ndistributions. Journal of Machine Learning Research, 2017.\n\n[33] Gergely Neu and G\u00e1bor Bart\u00f3k. An ef\ufb01cient algorithm for learning with semi-bandit feedback.\n\nIn International Conference on Algorithmic Learning Theory, 2013.\n\n[34] Bernardo \u00c1vila Pires, Csaba Szepesvari, and Mohammad Ghavamzadeh. Cost-sensitive multi-\n\nclass classi\ufb01cation risk bounds. In International Conference on Machine Learning, 2013.\n\n[35] Gilles Pisier. Martingales with values in uniformly convex spaces. Israel Journal of Mathematics,\n\n1975.\n\n[36] Maxim Raginsky, Alexander Rakhlin, and Matus Telgarsky. Non-convex learning via stochastic\ngradient langevin dynamics: a nonasymptotic analysis. In Conference on Learning Theory,\n2017.\n\n[37] Alexander Rakhlin and Karthik Sridharan. Online nonparametric regression with general loss\n\nfunctions. arxiv:1501.06598, 2015.\n\n[38] Alexander Rakhlin and Karthik Sridharan. BISTRO: An ef\ufb01cient relaxation-based method for\n\ncontextual bandits. In International Conference on Machine Learning, 2016.\n\n[39] Alexander Rakhlin and Karthik Sridharan. On equivalence of martingale tail bounds and\n\ndeterministic regret inequalities. Conference on Learning Theory, 2017.\n\n[40] Alexander Rakhlin, Karthik Sridharan, and Ambuj Tewari. Online learning: Random averages,\ncombinatorial parameters, and learnability. Advances in Neural Information Processing Systems,\n2010.\n\n11\n\n\f[41] Alexander Rakhlin, Karthik Sridharan, and Ambuj Tewari. Online learning via sequential\n\ncomplexities. Journal of Machine Learning Research, 2015.\n\n[42] Alexander Rakhlin, Karthik Sridharan, and Ambuj Tewari. Sequential complexities and uniform\n\nmartingale laws of large numbers. Probability Theory and Related Fields, 2015.\n\n[43] Robert E. Schapire and Yoav Freund. Boosting: Foundations and algorithms. MIT press, 2012.\n[44] Aleksandrs Slivkins. Contextual bandits with similarity information. In Conference on Learning\n\nTheory, 2011.\n\n[45] Nathan Srebro, Karthik Sridharan, and Ambuj Tewari. On the universality of online mirror\n\ndescent. In Advances in Neural Information Processing Systems, 2011.\n\n[46] Vasilis Syrgkanis, Akshay Krishnamurthy, and Robert E. Schapire. Ef\ufb01cient algorithms for\n\nadversarial contextual learning. In International Conference on Machine Learning, 2016.\n\n[47] Vasilis Syrgkanis, Haipeng Luo, Akshay Krishnamurthy, and Robert E Schapire. Improved\nregret bounds for oracle-based adversarial contextual bandits. In Advances in Neural Information\nProcessing Systems, 2016.\n\n[48] Ambuj Tewari and Susan A. Murphy. From ads to interventions: Contextual bandits in mobile\n\nhealth. In Mobile Health, 2017.\n\n[49] Tong Zhang. Statistical analysis of some multi-category large margin classi\ufb01cation methods.\n\nJournal of Machine Learning Research, 2004.\n\n12\n\n\f", "award": [], "sourceid": 1329, "authors": [{"given_name": "Dylan", "family_name": "Foster", "institution": "Cornell University"}, {"given_name": "Akshay", "family_name": "Krishnamurthy", "institution": "Microsoft"}]}