{"title": "Adaptive Online Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 3375, "page_last": 3383, "abstract": "We propose a general framework for studying adaptive regret bounds in the online learning setting, subsuming model selection and data-dependent bounds. Given a data- or model-dependent bound we ask, \u201cDoes there exist some algorithm achieving this bound?\u201d We show that modifications to recently introduced sequential complexity measures can be used to answer this question by providing sufficient conditions under which adaptive rates can be achieved. In particular each adaptive rate induces a set of so-called offset complexity measures, and obtaining small upper bounds on these quantities is sufficient to demonstrate achievability. A cornerstone of our analysis technique is the use of one-sided tail inequalities to bound suprema of offset random processes.Our framework recovers and improves a wide variety of adaptive bounds including quantile bounds, second order data-dependent bounds, and small loss bounds. In addition we derive a new type of adaptive bound for online linear optimization based on the spectral norm, as well as a new online PAC-Bayes theorem.", "full_text": "Adaptive Online Learning\n\nDylan J. Foster\u2217\n\nCornell University\n\nAlexander Rakhlin \u2020\n\nUniversity of Pennsylvania\n\nKarthik Sridharan\u2217\n\nCornell University\n\nAbstract\n\nWe propose a general framework for studying adaptive regret bounds in the online\nlearning setting, subsuming model selection and data-dependent bounds. Given a\ndata- or model-dependent bound we ask, \u201cDoes there exist some algorithm achiev-\ning this bound?\u201d We show that modi\ufb01cations to recently introduced sequential\ncomplexity measures can be used to answer this question by providing suf\ufb01cient\nconditions under which adaptive rates can be achieved. In particular each adaptive\nrate induces a set of so-called offset complexity measures, and obtaining small\nupper bounds on these quantities is suf\ufb01cient to demonstrate achievability. A\ncornerstone of our analysis technique is the use of one-sided tail inequalities to\nbound suprema of offset random processes.\nOur framework recovers and improves a wide variety of adaptive bounds including\nquantile bounds, second order data-dependent bounds, and small loss bounds. In\naddition we derive a new type of adaptive bound for online linear optimization\nbased on the spectral norm, as well as a new online PAC-Bayes theorem.\n\n1\n\nIntroduction\n\nSome of the recent progress on the theoretical foundations of online learning has been motivated by\nthe parallel developments in the realm of statistical learning. In particular, this motivation has led to\nmartingale extensions of empirical process theory, which were shown to be the \u201cright\u201d notions for\nonline learnability. Two topics, however, have remained elusive thus far: obtaining data-dependent\nbounds and establishing model selection (or, oracle-type) inequalities for online learning problems.\nIn this paper we develop new techniques for addressing both these questions.\nOracle inequalities and model selection have been topics of intense research in statistics in the last\n\ntwo decades [1, 2, 3]. Given a sequence of modelsM1,M2, . . . whose union isM, one aims to\nderive a procedure that selects, given an i.i.d. sample of size n, an estimator \u02c6f from a modelM \u02c6m\n\nthat trades off bias and variance. Roughly speaking the desired oracle bound takes the form\n\nwhere penn(m) is a penalty for the model m. Such oracle inequalities are attractive because they\ncan be shown to hold even if the overall modelM is too large. A central idea in the proofs of such\nstatements (and an idea that will appear throughout the present paper) is that penn(m) should be\n\n\u201cslightly larger\u201d than the \ufb02uctuations of the empirical process for the model m. It is therefore not\nsurprising that concentration inequalities\u2014and particularly Talagrand\u2019s celebrated inequality for the\nsupremum of the empirical process\u2014have played an important role in attaining oracle bounds. In\norder to select a good model in a data-driven manner, one establishes non-asymptotic data-dependent\nbounds on the \ufb02uctuations of an empirical process indexed by elements in each model [4].\n\u2217Deptartment of Computer Science\n\u2020Deptartment of Statistics\n\nerr( \u02c6f)\u2264 inf\n\nm\uffff inf\nf\u2208Mm\n\nerr(f)+ penn(m)\uffff ,\n\n1\n\n\fLifting the ideas of oracle inequalities and data-dependent bounds from statistical to online learning\nis not an obvious task. For one, there is no concentration inequality available, even for the simple\ncase of sequential Rademacher complexity. (For the reader already familiar with this complexity:\na change of the value of one Rademacher variable results in a change of the remaining path, and\nhence an attempt to use a version of a bounded difference inequality grossly fails). Luckily, as we\nshow in this paper, the concentration machinery is not needed and one only requires a one-sided tail\ninequality. This realization is motivated by the recent work of [5, 6, 7]. At a high level, our approach\nwill be to develop one-sided inequalities for the suprema of certain offset processes [7], where the\noffset is chosen to be \u201cslightly larger\u201d than the complexity of the corresponding model. We then show\nthat these offset processes determine which data-dependent adaptive rates are achievable for online\nlearning problems, drawing strong connections to the ideas of statistical learning described earlier.\n\n1.1 Framework\n\nn\ufffft=1\n\nE\uffff n\ufffft=1\n\nrates for a variety of loss functions [8, 7].\n\nrandomized algorithm for selecting \u02c6yt such that\n\n`(\u02c6yt, yt)\u2212 inf\nf\u2208F\nn\ufffft=1\n`(f(xt), yt)\uffff\u2264Bn \u2200x1\u2236n, y1\u2236n,\n`(\u02c6yt, yt)\u2212 inf\nf\u2208F\n\nLetX be the set of observations,D the space of decisions, andY the set of outcomes. Let (S)\ndenote the set of distributions on a set S. Let `\u2236D\u00d7Y\u2192 R be a loss function. The online learning\nframework is de\ufb01ned by the following process: For t= 1, . . . , n, Nature provides input instance\nxt\u2208X ; Learner selects prediction distribution qt\u2208 (D); Nature provides label yt\u2208Y, while the\nlearner draws prediction \u02c6yt\u223c qt and suffers loss `(\u02c6yt, yt).\nTwo important settings are supervised learning (Y \u2286 R,D \u2286 R) and online linear optimization\n(X ={0} is a singleton set,Y andD are balls in dual Banach spaces and `(\u02c6y, y)=\uffff\u02c6y, y\uffff). For a\nclassF\u2286DX , we de\ufb01ne the learner\u2019s cumulative regret toF as\nn\ufffft=1\n`(f(xt), yt).\nA uniform regret boundBn is achievable if there exists a randomized algorithm selecting \u02c6yt such that\nwhere a1\u2236n stands for{a1, . . . , an}. Achievable ratesBn depend on complexity of the function class\nF. For example, sequential Rademacher complexity ofF is one of the tightest achievable uniform\nAn adaptive regret bound has the formBn(f ; x1\u2236n, y1\u2236n) and is said to be achievable if there exists a\nWe distinguish three types of adaptive bounds, according to whetherBn(f ; x1\u2236n, y1\u2236n) depends only\non f, only on(x1\u2236n, y1\u2236n), or on both quantities. WheneverBn depends on f, an adaptive regret can\ninequality (2) may be proved for certain functionsBn(f ; x1\u2236n, y1\u2236n) even if a uniform bound (1)\ncannot hold for any nontrivialBn.\nThe case whenBn(f ; x1\u2236n, y1\u2236n)=Bn(x1\u2236n, y1\u2236n) does not depend on f has received most of the\nA bound of typeBn(f) was studied in [14], which presented an algorithm that competes with all\nwith an unbounded set and provide oracle inequalities with an appropriately chosen functionBn(f).\nFinally, the third category of adaptive bounds are those that depend on both the hypothesis f\u2208F\n\nattention in the literature. The focus is on bounds that can be tighter for \u201cnice sequences,\u201d yet maintain\nnear-optimal worst-case guarantees. An incomplete list of prior work includes [9, 10, 11, 12], couched\nin the setting of online linear/convex optimization, and [13] in the experts setting.\n\nbe viewed as an oracle inequality which penalizes each f according to a measure of its complexity\n(e.g. the complexity of the smallest model to which it belongs). As in statistical learning, an oracle\n\nE\uffff n\ufffft=1\n\n`(\u02c6yt, yt)\u2212 n\ufffft=1\n\n`(f(xt), yt)\uffff\u2264Bn(f ; x1\u2236n, y1\u2236n) \u2200x1\u2236n, y1\u2236n,\u2200f\u2208F.\n\nexperts simultaneously, but with varied regret with respect to each of them depending on the quantile\nof the expert. Another bound of this type was given by [15], who consider online linear optimization\n\n(1)\n\n(2)\n\n1.2 Related Work\n\nand the data. The bounds that depend on the loss of the best function (so-called \u201csmall-loss\u201d bounds,\n\n2\n\n\f[16, Sec. 2.4], [17, 13]) fall in this category trivially, since one may overbound the loss of the best\nfunction by the performance of f. We draw attention to the recent result of [18] who show an adaptive\nbound in terms of both the loss of comparator and the KL divergence between the comparator and\nsome pre-\ufb01xed prior distribution over experts. An MDL-style bound in terms of the variance of the\nloss of the comparator (under the distribution induced by the algorithm) was recently given in [19].\nOur study was also partly inspired by Cover [20] who characterized necessary and suf\ufb01cient conditions\nfor achievable bounds in prediction of binary sequences. The methods in [20], however, rely on the\nstructure of the binary prediction problem and do not readily generalize to other settings.\nThe framework we propose recovers the vast majority of known adaptive rates in literature, including\nvariance bounds, quantile bounds, localization-based bounds, and fast rates for small losses. It should\nbe noted that while existing literature on adaptive online learning has focused on simple hypothesis\nclasses such as \ufb01nite experts and \ufb01nite-dimensional p-norm balls, our results extend to general\nhypothesis classes, including large nonparametric ones discussed in [7].\n\n2 Adaptive Rates and Achievability: General Setup\n\ninf\n\nsup\n\nE\n\nt=1\uffff n\ufffft=1\n\nThe \ufb01rst step in building a general theory for adaptive online learning is to identify what adaptive\n\nadaptive rate can be formalized by the following minimax quantity.\n\nis said to be achievable if there exists an online learning algorithm such that, (2) holds.\n\nwhen the optimal learning algorithm that minimizes this difference is used against Nature trying to\nmaximize it. Directly from this de\ufb01nition,\n\nAn adaptive rateBn is achievable if and only ifAn(F,Bn)\u2264 0.\n\nregret bounds are possible to achieve. Recall that an adaptive regret bound ofBn\u2236F\u00d7X n\u00d7Y n\u2192 R\nIn the rest of this work, we use the notation\uffff. . .\uffffn\nt=1 to denote the interleaved application of the\noperators inside the brackets, repeated over t= 1, . . . , n rounds (see [21]). Achievability of an\nDe\ufb01nition 1. Given an adaptive rateBn we de\ufb01ne the offset minimax value:\n\u02c6yt\u223cqt\uffffn\n`(\u02c6yt, yt)\u2212 inf\n`(f(xt), yt)+Bn(f ; x1\u2236n, y1\u2236n)\uffff\uffff.\nAn(F,Bn)\uffff\uffff sup\nf\u2208F\uffff n\ufffft=1\nyt\u2208Y\nxt\u2208X\nqt\u2208(D)\nt=1 `(f(xt), yt)+Bn(f ; x1\u2236n, y1\u2236n)} behaves\nt=1 `(\u02c6yt, yt)\u2212 inf f\u2208F{\u2211n\nAn(F,Bn) quanti\ufb01es how\u2211n\nIfBn is a uniform rate, i.e.,Bn(f ; x1\u2236n, y1\u2236n)=Bn, achievability reduces to the minimax analysis\nexplored in [8]. The uniform rateBn is achievable if and only ifBn\u2265Vn(F), whereVn(F) is the\nWe now focus on understanding the minimax valueAn(F,Bn) for general adaptive rates. We\nt=1 of functions zt\u2236{\u00b11}t\u22121\u2192Z.\nGiven a setZ, aZ-valued tree z of depth n is a sequence(zt)n\nOne may view z as a complete binary tree decorated by elements ofZ. Let \u270f=(\u270ft)n\nt=1 be a sequence\nof independent Rademacher random variables. Then(zt(\u270f)) may be viewed as a predictable process\nwith respect to the \ufb01ltrationS t= (\u270f1, . . . ,\u270f t). For a tree z, the sequential Rademacher complexity\nof a function classG\u2286 RZ on z is de\ufb01ned as\nz Rn(G, z) .\n\u270ftg(zt(\u270f))\nLemma 1. For any lower semi-continuous loss `, and any adaptive rateBn that only depends on\noutcomes (i.e.Bn(f ; x1\u2236n, y1\u2236n)=Bn(y1\u2236n)), we have that\n\n\ufb01rst show that the minimax value is bounded by an offset version of the sequential Rademacher\ncomplexity studied in [8]. The symmetrization Lemma 1 below provides us with the \ufb01rst step towards\na probabilistic analysis of achievable rates. Before stating the lemma, we need to de\ufb01ne the notion of\na tree and the notion of sequential Rademacher complexity.\n\nRn(G, z)\uffff E\u270f sup\ng\u2208G\n\nn\ufffft=1\n\nand Rn(G)\uffff sup\n\nminimax value of the online learning game.\n\nAn\u2264 sup\n\nx,y\n\nE\u270f\uffffsup\nf\u2208F\uffff2\n\nn\ufffft=1\n\n\u270ft`(f(xt(\u270f)), yt(\u270f))\uffff\u2212Bn(y1\u2236n(\u270f))\uffff .\n\n(3)\n\n3\n\n\fx,y,y\u2032\n\nAn\u2264 sup\n\nFurther, for any general adaptive rateBn,\n\u270ft`(f(xt(\u270f)), yt(\u270f))\u2212Bn(f ; x1\u2236n(\u270f), y\u20322\u2236n+1(\u270f))\uffff\uffff .\nFinally, if one considers the supervised learning problem whereF\u2236X\u2192 R,Y\u2282 R and `\u2236 R\u00d7R\u2192 R\nis a loss that is convex and L-Lipschitz in its \ufb01rst argument, then for any adaptive rateBn,\nn\ufffft=1\n\nn\ufffft=1\nE\u270f\uffffsup\nf\u2208F\uffff2\nE\u270f\uffffsup\nAn\u2264 sup\nf\u2208F\uffff2L\n\n\u270ftf(xt(\u270f))\u2212Bn(f ; x1\u2236n(\u270f), y1\u2236n(\u270f))\uffff\uffff .\n\nThe above lemma tells us that to check whether an adaptive rate is achievable, it is suf\ufb01cient to check\nthat the corresponding adaptive sequential complexity measures are non-positive. We remark that if\nthe above complexities are bounded by some positive quantity of a smaller order, one can form a new\n\n(5)\n\n(4)\n\nx,y\n\nachievable rateB\u2032n by adding the positive quantity toBn.\n\n3 Probabilistic Tools\n\n(6)\n\nE sup\n\nit holds that\n\nAs mentioned in the introduction, our technique rests on certain one-sided probabilistic inequalities.\nWe now state the \ufb01rst building block: a rather straightforward maximal inequality.\n\n\u2713i= max\uffff i\n\nProposition 2. Let I={1, . . . , N}, N \u2264\u221e, be a set of indices and let(Xi)i\u2208I be a sequence of\nrandom variables satisfying the following tail condition: for any \u2327> 0,\nfor some positive sequence(Bi), nonnegative sequence(i) and nonnegative sequence(si) of\nnumbers, and for constants C1, C2\u2265 0. Then for any \u00af\u2264 1, \u00afs\u2265 s1, and\ni\u2208I {Xi\u2212 Bi\u2713i}\u2264 3C1\u00af+ 2C2(\u00afs)\u22121.\n\nP(Xi\u2212 Bi> \u2327)\u2264 C1 exp\uffff\u2212\u2327 2\uffff(22\nBi\uffff2 log(i\uffff\u00af)+ 4 log(i),(Bisi)\u22121 log\uffffi2(\u00afs\uffffsi)\uffff\uffff+ 1,\n\ni)\uffff+ C2 exp(\u2212\u2327s i)\n\nWe remark that Bi need not be the expected value of Xi, as we are not interested in two-sided\ndeviations around the mean.\nOne of the approaches to obtaining oracle-type inequalities is to split a large class into smaller\nones according to a \u201ccomplexity radius\u201d and control a certain stochastic process separately on each\nsubset (also known as the peeling technique). In the applications below, Xi will often stand for the\n(random) supremum of this process on subset i, and Bi will be an upper bound on its typical size.\nGiven deviation bounds for Xi above Bi, the dilated size Bi\u2713i then allows one to pass to maximal\ninequalities (7) and thus verify achievability in Lemma 1. The same strategy works for obtaining\ndata-dependent bounds, where we \ufb01rst prove tail bounds for the given size of the data-dependent\nquantity, then appeal to (7).\nA simple yet powerful example for the control of the supremum of a stochastic process is an inequality\ndue to Pinelis [22] for the norm (which is a supremum over the dual ball) of a martingale in a 2-smooth\nBanach space. Here we state a version of this result that can be found in [23, Appendix A].\n\nLemma 3. LetZ be a unit ball in a separable(2, D)-smooth Banach spaceH. For anyZ-valued\ntree z, and any n> \u2327\uffff4D2\n\n(7)\n\nP\uffff\uffff n\ufffft=1\n\n\u270ftzt(\u270f)\uffff\u2265 \u2327\uffff\u2264 2 exp\uffff\u2212 \u2327 2\n8D2n\uffff\n\nWhen the class of functions is not linear, we may no longer appeal to the above lemma. Instead, we\nmake use of a result from [24] that extends Lemma 3 at a price of a poly-logarithmic factor. Before\nstating this lemma, we brie\ufb02y de\ufb01ne the relevant complexity measures (see [24] for more details).\n\nFirst, a set V of R-valued trees is called an \u21b5-cover ofG\u2286 RZ on z with respect to `p if\nn\ufffft=1(g(zt(\u270f))\u2212 vt(\u270f))p\u2264 n\u21b5p.\n\n\u2200g\u2208G,\u2200\u270f\u2208{\u00b11}n,\u2203v\u2208 V s.t.\n\n4\n\n\f\u2200g\u2208G,\u2200\u270f\u2208{\u00b11},\u2203v\u2208 V s.t.\n\nThe size of the smallest \u21b5-cover is denoted byNp(G,\u21b5, z), andNp(G,\u21b5, n)\uffff supzNp(G,\u21b5, z).\nThe set V is an \u21b5-cover ofG on z with respect to `\u221e if\n\uffffg(zt(\u270f))\u2212 vt(\u270f)\uffff\u2264 \u21b5 \u2200t\u2208[n].\nWe letN\u221e(G,\u21b5, z) be the smallest such cover and setN\u221e(G,\u21b5, n)= supzN\u221e(G,\u21b5, z).\nLemma 4 ([24]). LetG\u2286[\u22121, 1]Z. SupposeRn(G)\uffffn\u2192 0 with n\u2192\u221e and that the following\nmild assumptions hold:Rn(G)\u2265 1\uffffn,N\u221e(G, 2\u22121, n)\u2265 4, and there exists a constant such that\n\u2265\u2211\u221ej=1N\u221e(G, 2\u2212j, n)\u22121. Then for any \u2713>\uffff12\uffffn, for anyZ-valued tree z of depth n,\n\u21b5 \ufffflogN\u221e(G,, n)d\uffff\uffff\u2264 2e\u2212 n\u27132\n\n\u270ftg(zt(\u270f))\uffff> 8\uffff1+ \u2713\uffff8n log3(en2)\uffff\u22c5Rn(G)\uffff\ng\u2208G\uffff n\ufffft=1\nP\uffffsup\ng\u2208G\uffff n\ufffft=1\n\u270ftg(zt(\u270f))\uffff> n inf\n\u2264 P\uffffsup\n\n\u21b5>0\uffff4\u21b5+ 6\u2713\uffff 1\n\nThe above lemma yields a one-sided control on the size of the supremum of the sequential Rademacher\nprocess, as required for our oracle-type inequalities.\nNext, we turn our attention to an offset Rademacher process, where the supremum is taken over a\ncollection of negative-mean random variables. The behavior of this offset process was shown to\ngovern the optimal rates of convergence for online nonparametric regression [7]. Such a one-sided\ncontrol of the supremum will be necessary for some of the data-dependent upper bounds we develop.\n\nLemma 5. Let z be aZ-valued tree of depth n, and letG\u2286 RZ. For any \u2265 1\uffffn and \u21b5> 0,\n1\uffffn\uffffn logN2(G,, z)d\u2212 1> \u2327\uffff\nn\ufffft=1\uffff\u270ftg(zt(\u270f))\u2212 2\u21b5g2(zt(\u270f))\uffff\u2212 logN2(G,, z)\nP\uffffsup\ng\u2208G\n\u2264 exp\uffff\u2212 \u2327 2\n22\uffff+ exp\uffff\u2212 \u21b5\u2327\n2 \uffff ,\nwhere \u2265\u2211log2(2n)\nN2(G, 2\u2212j, z)\u22122 and = 12\u222b \n\n\u2212 12\u221a2\uffff \nn\uffffn logN2(G,, z)d.\n\nWe observe that the probability of deviation has both subgaussian and subexponential components.\nUsing the above result and Proposition 2 leads to useful bounds on the quantities in Lemma 1 for\nspeci\ufb01c types of adaptive rates. Given a tree z, we obtain a bound on the expected size of the\nsequential Rademacher process when we subtract off the data-dependent `2-norm of the function on\nthe tree z, adjusted by logarithmic terms.\n\nj=1\n\n4 .\n\n\u21b5\n\n1\n\nE sup\n\ng\u2208G,\uffff\uffff\uffff\uffff\uffff\uffff\uffff\nn\ufffft=1\n\nCorollary 6. Suppose G \u2286 [\u22121, 1]Z, and let z be any Z-valued tree of depth n. Assume\nlogN2(G,, n)\u2264 \u2212p for some p< 2. Then\n1\uffffn\uffffn logN2(G,, z)d\uffff\uffff\uffff\uffff\uffff\uffff\uffff\u2264 7+ 2 log n .\nThe next corollary yields slightly faster rates than Corollary 6 when\uffffG\uffff<\u221e.\nCorollary 7. SupposeG\u2286[\u22121, 1]Z with\uffffG\uffff= N, and let z be anyZ-valued tree of depth n. Then\ng2(z(\u270f))+ e\uffff\uffff\uffff\uffff\uffff\uffff\uffff\uffff\u2264 1.\nn\ufffft=1\n\n\u270ftg(zt(\u270f))\u2212 4\uffff\uffff\uffff\uffff2(log n) logN2(G,\uffff2, z)\uffff n\ufffft=1\n\u221224\u221a2 log n\uffff \ng2(z(\u270f))+ e\uffff\uffff\uffff\uffff\uffff32\ufffflog N\n\n\u270ftg(zt(\u270f))\u2212 2 log\ufffflog N\n\ng2(zt(\u270f))+ 1\uffff\n\ng\u2208G\uffff\uffff\uffff\uffff\uffff\uffff\uffff\nn\ufffft=1\n\nn\ufffft=1\n\nE sup\n\n4 Achievable Bounds\n\nIn this section we use Lemma 1 along with the probabilistic tools from the previous section to obtain\nan array of achievable adaptive bounds for various online learning problems. We subdivide the section\ninto one subsection for each category of adaptive bound described in Section 1.1.\n\n5\n\n\f4.1 Adapting to Data\n\nfollowing adaptive rate is achievable:\n\nof the developed tools on the following example.\nExample 4.1 (Online Linear Optimization in Rd). Consider the problem of online linear opti-\n\nHere we consider adaptive rates of the formBn(x1\u2236n, y1\u2236n), uniform over f\u2208F. We show the power\nmization whereF={f\u2208 Rd\u2236\ufffff\uffff2\u2264 1},Y={y\u2236\uffffy\uffff2\u2264 1},X ={0}, and `(\u02c6y, y)=\uffff\u02c6y, y\uffff. The\nwhere\uffff\u22c5\uffff is the spectral norm. Let us deduce this result from Corollary 6. First, observe that\nt=1yty\ufffft\uffff1\uffff2\uffff= sup\n\uffff\uffff\u2211n\nt=1`2(f, yt).\nThe linear function classF can be covered point-wise at any scale with(3\uffff)d balls and thus\nN(`\u25cbF, 1\uffff(2n), z)\u2264(6n)d for anyY-valued tree z. We apply Corollary 6 with = 1\uffffn and the\n\nt=1yty\ufffft\uffff1\uffff2\uffff+ 16\u221ad log(n),\nf\u2208F\uffff\u2211n\nf\u2236\ufffff\uffff2\u22641\ufffff\uffff\u2211n\nt=1yty\ufffft f= sup\nf\uffff= sup\n\nBn(y1\u2236n)= 16\u221ad log(n)\uffff\uffff\u2211n\nt=1yty\ufffft\uffff1\uffff2\nf\u2236\ufffff\uffff2\u22641\uffff\uffff\u2211n\n\nintegral term in the corollary vanishes, yielding the claimed statement.\n\n4.2 Model Adaptation\n\nfor absolute constants K1, K2, and de\ufb01ned in Lemma 4.\n\nRn(F(1))\n\nIn this subsection we focus on achievable rates for oracle inequalities and model selection, but\n\nlearning problem with 1-Lipschitz loss `, the following rate is achievable:\n\nwe knew the optimal radius, at the price of a logarithmic factor. This is the price of adaptation.\n\nsequential Rademacher complexity for online learning problems for commonly encountered losses,\n\nwithout dependence on data. The form of the rate is thereforeBn(f). Assume we have a class\nF =\uffffR\u22651F(R), with the property thatF(R)\u2286F(R\u2032) for any R \u2264 R\u2032. If we are told by an\noracle that regret will be measured with respect to those hypotheses f \u2208F with R(f)\uffff inf{R\u2236\nf\u2208F(R)}\u2264 R\u2217, then using the minimax algorithm one can guarantee a regret bound of at most\nthe sequential Rademacher complexityRn(F(R\u2217)). On the other hand, given the optimality of the\nwe can argue that for any f\u2208F chosen in hindsight, one cannot expect a regret better than order\nRn(F(R(f))). In this section we show that simultaneously for all f\u2208F, one can attain an adaptive\nupper bound of O\uffffRn(F(R(f)))\ufffflog(Rn(F(R(f)))) log3\uffff2 n\uffff. That is, we may predict as if\nCorollary 8. For any class of predictorsF withF(1) non-empty, if one considers the supervised\nBn(f)= log3\uffff2 n\uffff\uffff\uffffK1Rn(F(2R(f)))\uffff\uffff\uffff1+\uffff\uffff\uffff\ufffflog\uffff log(2R(f))\u22c5Rn(F(2R(f)))\n\uffff\uffff\uffff\uffff+ K2Rn(F(1))\uffff\uffff\uffff ,\nIn fact, this statement is true more generally withF(2R(f)) replaced by `\u25cbF(2R(f)). It is\nthis value. Second, an experts bound yields only a slower\u221an rate.\n\ntempting to attempt to prove the above statement with the exponential weights algorithm running as\nan aggregation procedure over the solutions for each R. In general, this approach will fail for two\nreasons. First, if function values grow with R, the exponential weights bound will scale linearly with\n\nAs a special case of the above lemma, we obtain an online PAC-Bayesian theorem. We postpone\nthis example to the next sub-section where we get a data-dependent version of this result. We now\nprovide a bound for online linear optimization in 2-smooth Banach spaces that automatically adapts\nto the norm of the comparator. To prove it, we use the concentration bound from [22] (Lemma 3)\nwithin the proof of the above corollary to remove the extra logarithmic factors.\n\nExample 4.2 (Unconstrained Linear Optimization). Consider linear optimization withY being\nthe unit ball of some re\ufb02exive Banach space with norm\uffff\u22c5\uffff\u2217. LetF=D be the dual space and the\nloss `(\u02c6y, y)=\uffff\u02c6y, y\uffff (where we are using\uffff\u22c5,\u22c5\uffff to represent the linear functional in the \ufb01rst argument\nto the second argument). De\ufb01neF(R)={f\uffff\ufffff\uffff\u2264 R} where\uffff\u22c5\uffff is the norm dual to\uffff\u22c5\uffff\u2217. If the unit\nball ofY is(2, D)-smooth, then the following rate is achievable for all f with\ufffff\uffff\u2265 1:\nB(f)= D\u221an\uffff8\ufffff\uffff\uffff1+\ufffflog(2\ufffff\uffff)+ log log(2\ufffff\uffff)\uffff+ 12\uffff.\n\nFor the case of a Hilbert space, the above bound was achieved by [15].\n\n6\n\n\f4.3 Adapting to Data and Model Simultaneously\n\nBn(f ; x1\u2236n)= inf\n\n+K2 log n\uffff \n\nWe now study achievable bounds that perform online model selection in a data-adaptive way. Of\nspeci\ufb01c interest is our online optimistic PAC-Bayesian bound. This bound should be compared\nto [18, 19], with the reader noting that it is independent of the number of experts, is algorithm-\nindependent, and depends quadratically on the expected loss of the expert we compare against.\nExample 4.3 (Generalized Predictable Sequences (Supervised Learning)). Consider an online\nthat the learner can compute at round t based on information provided so far, including xt (One can\nthink of the predictable sequence Mt as a prior guess for the hypothesis we would compare with in\nhindsight). Then the following adaptive rate is achievable:\n\nsupervised learning problem with a convex 1-Lipschitz loss. Let(Mt)t\u22651 be any predictable sequence\n\nBn(f)= \u02dcO\uffff\uffff\uffff\u2211n\n\nhand, as p gets closer to 2 (i.e. more complex function classes), we do not adapt and get a uniform\n\nof Eq. (5) in Lemma 1, followed by Corollary 6 (one can include any predictable sequence in\n\n \uffff\uffff\uffff\uffff\uffff\uffff\uffffK1\uffff\uffff\uffff\ufffflog n\u22c5 logN2(F,\uffff2, n)\u22c5\uffff n\ufffft=1(f(xt)\u2212 Mt)2+ 1\uffff\n1\uffffn\uffffn logN2(F,, n)d+ 2 log n+7\uffff\uffff\uffff\uffff\uffff\uffff\uffff,\nfor constants K1= 4\u221a2, K2= 24\u221a2 from Corollary 6. The achievability is a direct consequence\nthe Rademacher average part because\u2211t Mt\u270ft is zero mean). Particularly, if we assume that the\nsequential covering of classF grows as logN2(F,\u270f, n)\u2264 \u270f\u2212p for some p< 2, we get that\n2\uffff\u221an\uffffp\uffff2\uffff .\nt=1(f(xt)\u2212 Mt)2+ 1\uffff1\u2212 p\nt=1(f(xt)\u2212 Mt)2+ 1. On the other\nAs p gets closer to 0, we get full adaptivity and replace n by\u2211n\nbound in terms of n. For p\u2208(0, 2), we attain a natural interpolation.\nsupervised learning problem with a convex 1-Lipschitz loss and let\uffffF\uffff= N. Let f\uffff\u2208F be a \ufb01xed\nn\ufffft=1(f(xt)\u2212 f\uffff(xt))2+ e\uffff\uffff\uffff\uffff\uffff32\ufffflog N\nn\ufffft=1(f(xt)\u2212 f\uffff(xt))2+ e\uffff+ 2.\nBn(f, x1\u2236n)= 4 log\ufffflog N\nIn particular, against f\uffff we haveBn(f\uffff, x1\u2236n)= O(1), and against an arbitrary expert we have\nBn(f, x1\u2236n)= O\uffff\u221an log N(log(n\u22c5 log N))\uffff. This bound follows from Eq. (5) in Lemma 1 followed\nby Corollary 7. This extends the study of [25] to supervised learning and general class of expertsF.\nloss for each expert on any round is non-negative and bounded by 1. The function classF is the set\nof all distributions over these experts, andX={0}. This setting can be formulated as online linear\noptimization where the loss of mixture f over experts, given instance y, is\ufffff, y\uffff, the expected loss\n\nExample 4.4 (Regret to Fixed Vs Regret to Best (Supervised Learning)). Consider an online\nexpert chosen in advance. The following bound is achievable:\n\nExample 4.5 (Optimistic PAC-Bayes). Assume that we have a countable set of experts and that the\n\nunder the mixture. The following adaptive bound is achievable:\n\nThis adaptive bound is an online PAC-Bayesian bound. The rate adapts not only to the KL di-\n\nEi\u223cf\uffffei, yt\uffff2+ 50(KL(f\uffff\u21e1)+ log(n))+ 10.\n\nBn(f ; y1\u2236n)=\uffff\uffff\uffff\uffff50(KL(f\uffff\u21e1)+ log(n)) n\ufffft=1\nvergence of f with \ufb01xed prior \u21e1 but also replaces n with\u2211n\nt=1 Ei\u223cf\uffffei, yt\uffff2\u2264\u2211n\n\u2211n\n\nt=1 Ei\u223cf\uffffei, yt\uffff2. Note that we have\nt=1\ufffff, yt\uffff, yielding the small-loss type bound described earlier. This is an\n\nimprovement over the bound in [18] in that the bound is independent of number of experts, and so\nholds even for countably in\ufb01nite sets of experts. The KL term in our bound may be compared to the\nMDL-style term in the bound of [19]. If we have a large (but \ufb01nite) number of experts and take \u21e1 to\nbe uniform, the above bound provides an improvement over both [14]1 and [18].\nEvaluating the above bound with a distribution f that places all its weight on any one expert appears\nto address the open question posed by [13] of obtaining algorithm-independent oracle-type variance\nbounds for experts. The proof of achievability of the above rate is shown in the appendix because it\nrequires a slight variation on the symmetrization lemma speci\ufb01c to the problem.\n\n1See [18] for a comparison of KL-based bounds and quantile bounds.\n\n7\n\n\f5 Relaxations for Adaptive Learning\n\nTo design algorithms for achievable rates, we extend the framework of online relaxations from [26].\n\nA relaxation Reln\u2236\uffffn\n\nand the recursive condition,\n\nt=0X t\u00d7Y t\u2192 R that satis\ufb01es the initial condition,\nReln(x1\u2236n, y1\u2236n)\u2265\u2212 inf\nf\u2208F\uffff n\ufffft=1\nReln(x1\u2236t\u22121, y1\u2236t\u22121)\u2265 sup\nxt\u2208X\nqt\u2208(D)\n\n`(f(xt), yt)+Bn(f ; x1\u2236n, y1\u2236n)\uffff,\nE\u02c6y\u223cqt[`(\u02c6yt, yt)+ Reln(x1\u2236t, y1\u2236t)],\nyt\u2208Y\n\nsup\n\ninf\n\n(8)\n\n(9)\n\n\n\nn\ufffft=1\n\n`(\u02c6yt, yt)\u2212 inf\nf\u2208F\uffff n\ufffft=1\n\n`(f(xt), yt)+Bn(f ; x1\u2236n, y1\u2236n)\uffff\u2264 Reln(\u22c5) \u2200x1\u2236n, y1\u2236n.\n\nyield admissible relaxations, but solving these relaxations may not be computationally tractable.\nExample 5.1 (Online PAC-Bayes). Consider the experts setting in Example 4.5 with:\n\nis said to be admissible for the adaptive rate Bn. The relaxation\u2019s corresponding strategy is\n\u02c6qt= arg minqt\u2208(D) supyt\u2208Y E\u02c6y\u223cqt[`(\u02c6yt, yt)+ Reln(x1\u2236t, y1\u2236t)], which enjoys the adaptive bound\nIt follows immediately that the strategy achieves the rateBn(f ; x1\u2236n, y1\u2236n)+ Reln(\u22c5). Our goal is\nthen to \ufb01nd relaxations for which the strategy is computationally tractable and Reln(\u22c5)\u2264 0 or at least\nhas smaller order thanBn. Similar to [26], conditional versions of the offset minimax valuesAn\nBn(f)= 3\uffff2n max{KL(f\uffff \u21e1), 1}+ 4\u221an.\nt(y) denote the exponential weights distribution with learning rate\uffffR\uffffn:\nLet Ri= 2i\u22121 and let qR\nqR(y1\u2236t)k\u221d \u21e1k exp\uffff\u2212\uffffR\uffffn(\u2211t\ns=1 yt)k\uffff. The following is an admissible relaxation achievingBn:\nexp\uffff\u2212\uffff t\uffffs=1\uffffqRi(y1\u2236s\u22121), ys\uffff+\u221anRi\uffff\uffff\uffff+ 2(n\u2212 t)\uffff.\nReln(y1\u2236t)= inf\n>0\uffff 1\nlog\uffff\uffffi\ns=1\uffffqRi(y1\u2236s\u22121), ys\uffff\u2212\u221anRi\uffff\uffff. We predict by\nLet q\ufffft be a distribution with(q\ufffft)i\u221d exp\uffff\u2212 1\u221an\uffff\u2211t\u22121\ndrawing i according to q\ufffft , then drawing an expert according to qRi(y1\u2236t\u22121).\nn is available for eachF(R), can one obtain an adaptive\nmodel selection algorithm for all ofF?\u201d. To this end for supervised learning problem with convex\nLipschitz loss we delineate a meta approach which utilizes existing relaxations for eachF(R).\nt(y1, . . . , yt\u22121) be the randomized strategy corresponding to RelR\nobserving outcomes y1, . . . , yt\u22121, and let \u2713\u2236 R\u2192 R be nonnegative. The following relaxation is\nn(\u22c5)):\nn(\u22c5)\u2713(RelR\nadmissible for the rateBn(R)= RelR\nAdan(x1\u2236t, y1\u2236t)=\nR\u22651\uffffRelR\nn(\u22c5)\u2713(RelR\ns (y1\u2236t,y\u2032t+1\u2236s\u22121(\u270f))`(\u02c6ys, ys(\u270f))\uffff.\nn(x1\u2236t, y1\u2236t)\u2212 RelR\nn(\u22c5))+ 2\nE\u270ft+1\u2236n sup\nx,y,y\u2032\nPlaying according to the strategy for Adan will guarantee a regret bound ofBn(R)+ Adan(\u22c5), and\nAdan(\u22c5) can be bounded using Proposition 2 when the form of \u2713 is as in that proposition.\nstrategy is optimal. More generally, when the adaptive rateBn depends on data, it is not possible to\n\nWhile in general the problem of obtaining an ef\ufb01cient adaptive relaxation might be hard, one can ask\nthe question, \u201cIf and ef\ufb01cient relaxation RelR\n\nWe remark that the above strategy is not necessarily obtained by running a high-level experts algorithm\nover the discretized values of R. It is an interesting question to determine the cases when such a\n\nobtain the rates we show non-constructively in this paper using the exponential weights algorithm\nwith meta-experts as the required weighting over experts would be data dependent (and hence is not a\nprior over experts). Further, the bounds from exponential-weights-type algorithms are akin to having\nsub-exponential tails in Proposition 2, but for many problems we may have sub-gaussian tails.\nObtaining computationally ef\ufb01cient methods from the proposed framework is an interesting research\ndirection. Proposition 2 provides a useful non-constructive tool to establish achievable adaptive\nbounds, and a natural question to ask is if one can obtain a constructive counterpart for the proposition.\n\nn\uffffs=t+1\n\n\u270fsE\u02c6ys\u223cqR\n\nLemma 9. Let qR\n\nn , obtained after\n\nsup\n\n8\n\n\fReferences\n[1] Lucien Birg\u00b4e, Pascal Massart, et al. Minimum contrast estimators on sieves: exponential bounds and rates\n\nof convergence. Bernoulli, 4(3):329\u2013375, 1998.\n\n[2] G\u00b4abor Lugosi and Andrew B Nobel. Adaptive model selection using empirical complexities. Annals of\n\nStatistics, pages 1830\u20131864, 1999.\n\n[3] Peter L. Bartlett, St\u00b4ephane Boucheron, and G\u00b4abor Lugosi. Model selection and error estimation. Machine\n\nLearning, 48(1-3):85\u2013113, 2002.\n\n[4] Pascal Massart. Concentration inequalities and model selection, volume 10. Springer, 2007.\n[5] Shahar Mendelson. Learning without Concentration. In Conference on Learning Theory, 2014.\n[6] Tengyuan Liang, Alexander Rakhlin, and Karthik Sridharan. Learning with square loss: Localization\nthrough offset rademacher complexity. Proceedings of The 28th Conference on Learning Theory, 2015.\n[7] Alexander Rakhlin and Karthik Sridharan. Online nonparametric regression. Proceedings of The 27th\n\nConference on Learning Theory, 2014.\n\n[8] Alexander Rakhlin, Karthik Sridharan, and Ambuj Tewari. Online learning: Random averages, combinato-\n\nrial parameters, and learnability. In Advances in Neural Information Processing Systems 23. 2010.\n\n[9] Elad Hazan and Satyen Kale. Extracting certainty from uncertainty: Regret bounded by variation in costs.\n\nMachine learning, 80(2):165\u2013188, 2010.\n\n[10] Chao-Kai Chiang, Tianbao Yang, Chia-Jung Lee, Mehrdad Mahdavi, Chi-Jen Lu, Rong Jin, and Shenghuo\n\nZhu. Online optimization with gradual variations. In COLT, 2012.\n\n[11] Alexander Rakhlin and Karthik Sridharan. Online learning with predictable sequences. In Proceedings of\n\nthe 26th Annual Conference on Learning Theory (COLT), 2013.\n\n[12] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and\n\nstochastic optimization. The Journal of Machine Learning Research, 12:2121\u20132159, 2011.\n\n[13] Nicolo Cesa-Bianchi, Yishay Mansour, and Gilles Stoltz. Improved second-order bounds for prediction\n\nwith expert advice. Machine Learning, 66(2-3):321\u2013352, 2007.\n\n[14] Kamalika Chaudhuri, Yoav Freund, and Daniel J Hsu. A parameter-free hedging algorithm. In Advances\n\nin neural information processing systems, pages 297\u2013305, 2009.\n\n[15] H. Brendan McMahan and Francesco Orabona. Unconstrained online linear learning in hilbert spaces:\nMinimax algorithms and normal approximations. Proceedings of The 27th Conference on Learning Theory,\n2014.\n\n[16] Nicolo Cesa-Bianchi and G\u00b4abor Lugosi. Prediction, Learning, and Games. Cambridge University Press,\n\n2006.\n\n[17] Nathan Srebro, Karthik Sridharan, and Ambuj Tewari. Smoothness, low noise and fast rates. In Advances\n\nin neural information processing systems, pages 2199\u20132207, 2010.\n\n[18] Haipeng Luo and Robert E. Schapire. Achieving all with no parameters: Adaptive normalhedge. CoRR,\n\nabs/1502.05934, 2015.\n\n[19] Wouter M. Koolen and Tim van Erven. Second-order quantile methods for experts and combinatorial\ngames. In Proceedings of the 28th Annual Conference on Learning Theory (COLT), pages 1155\u20131175,\n2015.\n\n[20] Thomas M. Cover. Behavior of sequential predictors of binary sequences.\n\nIn in Trans. 4th Prague\nConference on Information Theory, Statistical Decision Functions, Random Processes, pages 263\u2013272.\nPublishing House of the Czechoslovak Academy of Sciences, 1967.\n\n[21] Alexander Rakhlin and Karthik Sridharan. Statistical learning theory and sequential prediction, 2012.\n\nAvailable at http://stat.wharton.upenn.edu/\u02dcrakhlin/book_draft.pdf.\n\n[22] Iosif Pinelis. Optimum bounds for the distributions of martingales in banach spaces. The Annals of\n\nProbability, 22(4):1679\u20131706, 10 1994.\n\n[23] Alexander Rakhlin, Karthik Sridharan, and Ambuj Tewari. Online learning: Beyond regret. arXiv preprint\n\narXiv:1011.3168, 2010.\n\n[24] Alexander Rakhlin, Karthik Sridharan, and Ambuj Tewari. Sequential complexities and uniform martingale\n\nlaws of large numbers. Probability Theory and Related Fields, 2014.\n\n[25] Eyal Even-Dar, Michael Kearns, Yishay Mansour, and Jennifer Wortman. Regret to the best vs. regret to\n\nthe average. Machine Learning, 72(1-2):21\u201337, 2008.\n\n[26] Alexander Rakhlin, Ohad Shamir, and Karthik Sridharan. Relax and randomize: From value to algorithms.\n\nAdvances in Neural Information Processing Systems 25, pages 2150\u20132158, 2012.\n\n9\n\n\f", "award": [], "sourceid": 1866, "authors": [{"given_name": "Dylan", "family_name": "Foster", "institution": "Cornell University"}, {"given_name": "Alexander", "family_name": "Rakhlin", "institution": "UPenn"}, {"given_name": "Karthik", "family_name": "Sridharan", "institution": "Cornell"}]}