{"title": "Parameter-Free Online Learning via Model Selection", "book": "Advances in Neural Information Processing Systems", "page_first": 6020, "page_last": 6030, "abstract": "We introduce an efficient algorithmic framework for model selection in online learning, also known as parameter-free online learning. Departing from previous work, which has focused on highly structured function classes such as nested balls in Hilbert space, we propose a generic meta-algorithm framework that achieves online model selection oracle inequalities under minimal structural assumptions. We give the first computationally efficient parameter-free algorithms that work in arbitrary Banach spaces under mild smoothness assumptions; previous results applied only to Hilbert spaces. We further derive new oracle inequalities for matrix classes, non-nested convex sets, and $\\mathbb{R}^{d}$ with generic regularizers. Finally, we generalize these results by providing oracle inequalities for arbitrary non-linear classes in the online supervised learning model. These results are all derived through a unified meta-algorithm scheme using a novel \"multi-scale\" algorithm for prediction with expert advice based on random playout, which may be of independent interest.", "full_text": "Parameter-Free Online Learning via Model Selection\n\nDylan J. Foster\nCornell University\n\nSatyen Kale\n\nGoogle Research\n\nMehryar Mohri\n\nNYU and Google Research\n\nKarthik Sridharan\nCornell University\n\nAbstract\n\nWe introduce an ef\ufb01cient algorithmic framework for model selection in online\nlearning, also known as parameter-free online learning. Departing from previous\nwork, which has focused on highly structured function classes such as nested balls\nin Hilbert space, we propose a generic meta-algorithm framework that achieves\nonline model selection oracle inequalities under minimal structural assumptions.\nWe give the \ufb01rst computationally ef\ufb01cient parameter-free algorithms that work\nin arbitrary Banach spaces under mild smoothness assumptions; previous results\napplied only to Hilbert spaces. We further derive new oracle inequalities for matrix\nclasses, non-nested convex sets, and Rd with generic regularizers. Finally, we\ngeneralize these results by providing oracle inequalities for arbitrary non-linear\nclasses in the online supervised learning model. These results are all derived\nthrough a uni\ufb01ed meta-algorithm scheme using a novel \u201cmulti-scale\u201d algorithm\nfor prediction with expert advice based on random playout, which may be of\nindependent interest.\n\n1\n\nIntroduction\n\nstatistical learning setting, this can be analyzed in terms of the estimation and approximation errors.\n\nA key problem in the design of learning algorithms is the choice of the hypothesis setF. This is\nknown as the model selection problem. The choice ofF is driven by inherent trade-offs. In the\nA richer or more complexF helps better approximate the Bayes predictor (smaller approximation\n\nerror). On the other hand, a hypothesis set that is too complex may have too large a VC-dimension or\nhave unfavorable Rademacher complexity, thereby resulting in looser guarantees on the difference\nbetween the loss of a hypothesis and that of the best-in class (large estimation error).\nIn the batch setting, this problem has been extensively studied with the main ideas originating in\nthe seminal work of [40] and [39] and the principle of Structural Risk Minimization (SRM). This is\n\ntypically formulated as follows: let(Fi)i\u2208N be an in\ufb01nite sequence of hypothesis sets (or models);\nthe problem consists of using the training sample to select a hypothesis setFi with a favorable\nestimation-approximation trade-off and choosing the best hypothesis f inFi.\ntechniques such as SRM or similar penalty-based model selection methods return a hypothesis f\u2217\nobtained had an oracle informed us of the index i\u2217 of the best-in-class classi\ufb01er\u2019s hypothesis set\n\nIf we had access to a hypothetical oracle informing us of the best choice of i for a given instance, the\nproblem would reduce to the standard one of learning with a \ufb01xed hypothesis set. Remarkably though,\n\nthat enjoys \ufb01nite-sample learning guarantees that are almost as favorable as those that would be\n\n[39; 13; 36; 21; 4; 24]. Such guarantees are sometimes referred to as oracle inequalities. They can\nbe derived even for data-dependent penalties [21; 4; 3].\nSuch results naturally raise the following questions in the online setting: can we develop an analogous\ntheory of model selection in online learning? Can we design online algorithms for model selection\nwith solutions bene\ufb01ting from strong guarantees, analogous to the batch ones? Unlike the statistical\nsetting, in online learning one cannot split samples to \ufb01rst learn the optimal predictor within each\nsubclass and then later learn the optimal subclass choice.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fA series of recent works on online learning provide some positive results along that direction. On\nthe algorithmic side, [25; 27; 30; 31] present solutions that ef\ufb01ciently achieve model selection oracle\n\ninequalities for the important special case whereF1,F2, . . . is a sequence of nested balls in a Hilbert\n\nspace. On the theoretical side, a different line of work focusing on general hypothesis classes [14]\nuses martingale-based sequential complexity measures to show that, information-theoretically, one\ncan obtain oracle inequalities in the online setting at a level of generality comparable to that of the\nbatch statistical learning. However, this last result is not algorithmic.\nThe \ufb01rst approach that a familiar reader might think of for tackling the online model selection problem\n\nover these algorithms using the multiplicative weights algorithm for prediction with expert advice.\nThis would work if all the losses or \u201cexperts\u201d considered were uniformly bounded by a reasonably\nsmall quantity. However, in many reasonable problems \u2014 particularly those arising in the context of\n\nis to run for each i an online learning algorithm that minimizes regret againstFi, and then aggregate\nonline convex optimization \u2014 the losses of predictors or experts for eachFi may grow with i. Using\nsimple aggregation would scale our regret with the magnitude of the largestFi and not the i\u2217 we\n\nwant to compare against. This is the main technical challenge faced in this context, and one that we\nfully address in this paper.\nOur results are based on a novel multi-scale algorithm for prediction with expert advice. This\nalgorithm works in a situation where the different experts\u2019 losses lie in different ranges, and guarantees\nthat the regret to each individual expert is adapted to the range of its losses. The algorithm can\nalso take advantage of a given prior over the experts re\ufb02ecting their importance. This general,\nabstract setting of prediction with expert advice yields online model selection algorithms for a host of\napplications detailed below in a straightforward manner.\nFirst, we give ef\ufb01cient algorithms for model selection for nested linear classes that provide oracle\ninequalities in terms of the norm of the benchmark to which the algorithm\u2019s performance is compared.\nOur algorithm works for any norm, which considerably generalizes previous work [25; 27; 30; 31]\nand gives the \ufb01rst polynomial time online model selection for a number of online linear optimization\nsettings. This includes online oracle inequalities for high-dimensional learning tasks such as online\nPCA and online matrix prediction. We then generalize these results even further by providing oracle\ninequalities for arbitrary non-linear classes in the online supervised learning model. This yields\nalgorithms for applications such as online penalized risk minimization and multiple kernel learning.\n\n1.1 Preliminaries\n\nNotation. For a given norm\uffff\u22c5\uffff, let\uffff\u22c5\uffff\uffff denote the dual norm. Likewise, for any function F , F\uffff\nwill denote its Fenchel conjugate. For a Banach space(B,\uffff\u22c5\uffff), the dual is(B\uffff,\uffff\u22c5\uffff\uffff). We use x1\u2236n\nas shorthand for a sequence of vectors(x1, . . . , xn). For such sequences, we will use xt[i] to denote\nthe tth vector\u2019s ith coordinate. We let ei denote the ith standard basis vector. \uffff\u22c5\uffffp denotes the `p\nnorm,\uffff\u22c5\uffff denotes the spectral norm, and\uffff\u22c5\uffff\u2303 denotes the trace norm. For any p\u2208[1,\u221e], let p\u2032 be\n\nsuch that 1\n\np+ 1\np\u2032 = 1.\n\nt=1 ft(wt)\u2212\u2211n\n\nt=1 ft(w).\n\nSetup and goals. We work in two closely related settings: online convex optimization (Protocol 1)\nand online supervised learning (Protocol 2). In online convex optimization, the learner selects\n\ndecisions from a convex subsetW of some Banach space B. Regret to a comparator w\u2208W in this\nsetting is de\ufb01ned as\u2211n\nSupposeW can be decomposed into setsW1,W2, . . .. For a \ufb01xed setWk, the optimal regret, if one\ntailors the algorithm to compete withWk, is typically characterized by some measure of intrinsic\n[33]), denoted Compn(Wk). We would like to develop algorithms that predict a sequence(wt)\n\ncomplexity of the class (such as Littlestone\u2019s dimension [5] and sequential Rademacher complexity\n\nft(w)\u2264 Compn(Wk)+ Penn(k) \u2200k.\n\nn\ufffft=1\nft(wt)\u2212 min\nw\u2208Wk\nquence (wt) matches that of a comparator\nmink{minw\u2208Wk\u2211n\nt=1 ft(w)+ Compn(Wk)}, up to a penalty Penn(k) whose scale ideally\nmatches that of Compn(Wk). We shall see shortly that ensuring that the scale of Penn(k) does\n\nthe performance of the se-\nthat minimizes the bias-variance tradeoff\n\nThis equation is called an oracle inequality and states that\n\nn\ufffft=1\n\nsuch that\n\n(1)\n\n2\n\n\fProtocol 1 Online Convex Optimization\n\nfor t= 1, . . . , n do\nLearner selects strategy qt\u2208 (W) for convex decision setW.\nNature selects convex loss ft\u2236W\u2192 R.\nLearner draws wt\u223c qt and incurs loss ft(wt).\n\nend for\n\nindeed match is the core technical challenge in developing online oracle inequalities for commonly\nused classes.\n\nIn the supervised learning setting we measure regret against a benchmark classF =\uffff\u221ek=1Fk of\nfunctions f\u2236X\u2192 R, whereX is some abstract context space, also called feature space. In this case,\n\nthe desired oracle inequality has the form:\n\n`(f(xt), yt)\u2264 Compn(Fk)+ Penn(k) \u2200k.\n\n(2)\n\nn\ufffft=1\n\nProtocol 2 Online Supervised Learning\n\n`(\u02c6yt, yt)\u2212 inf\nf\u2208Fk\n\nn\ufffft=1\nfor t= 1, . . . , n do\nNature provides xt\u2208X .\nLearner selects randomized strategy qt\u2208 (R).\nNature provides outcome yt\u2208Y.\nLearner draws \u02c6yt\u223c qt and incurs loss `(\u02c6yt, yt).\n\nend for\n\n2 Online Model Selection\n\n2.1 The need for multi-scale aggregation\n\nLet us brie\ufb02y motivate the main technical challenge overcome by the model selection approach we\nconsider. The most widely studied oracle inequality in online learning has the following form\n\nn\ufffft=1\n\nft(wt)\u2212 n\ufffft=1\n\nft(w)\u2264 O\uffff(\uffffw\uffff2+ 1)\uffffn\u22c5 log((\uffffw\uffff2+ 1)n)\uffff \u2200w\u2208 Rd.\n\n(3)\n\nIn light of (1), a model selection approach to obtaining this inequality would be to split the set\n\nW= Rd into `2 norm balls of doubling radius, i.e.Wk=\uffffw\uffff\uffffw\uffff2\u2264 2k\uffff. A standard fact [15] is\nthat such a set has Compn(Wk)= 2k\u221an if one optimizes over it using Mirror Descent, and so\nobtaining the oracle inequality (1) is suf\ufb01cient to recover (3), so long as Penn(k) is not too large\nrelative to Compn(Wk).\n\nOnline model selection is fundamentally a problem of prediction with expert advice [8], where the\nexperts correspond to the different model classes one is choosing from. Our basic meta-algorithm,\nMULTISCALEFTPL (Algorithm 3), operates in the following setup. The algorithm has access to a\n\ufb01nite number, N, of experts. In each round, the algorithm is required to choose one of the N experts.\nThen the losses of all experts are revealed, and the algorithm incurs the loss of the chosen expert.\nThe twist from the standard setup is that the losses of all the experts are not uniformly bounded in\n\npredictions with norm as large as 2k. Therefore, here, we assume that expert i incurs losses in the\n\nthe same range. Indeed, for the setup described for the oracle inequality (3), classWk will produce\nrange[\u2212ci, ci], for some known parameter ci\u2265 0. The goal is to design an online learning algorithm\nthe term Penn(k) will dominate. This new type of scale-sensitive regret bound, achieved by our\n\nwhose regret to expert i scales with ci, rather than maxi ci, which is what previous algorithms for\nlearning from expert advice (such as the standard multiplicative weights strategy or AdaHedge [12])\nwould achieve. Indeed, any regret bound scaling in maxi ci will be far too large to achieve (3), as\n\nalgorithm MULTISCALEFTPL, is stated below.\n\n3\n\n\fAlgorithm 3\n\nprocedure MULTISCALEFTPL(c, \u21e1)\n\nCompute distribution\n\n\u25b7 Scale vector c with ci\u2265 1, prior distribution \u21e1.\nfor time t= 1, . . . , n: do\nDraw sign vectors t+1, . . . , n\u2208{\u00b11}N each uniformly at random.\ns[i]ci\u2212 B(i)\uffff\uffff\uffff\uffff\uffff\uffff,\ni\u2208[N]\uffff\u2212 t\uffffs=1\uffffei, gs\uffff+ 4\nn\uffffs=t+1\npt(t+1\u2236n)= arg min\np\u2208N\nwhere B(i)= 5ci\uffffn log\uffff4c2\nPlay it\u223c pt.\n\ngt\u2236\uffffgt[i]\uffff\u2264ci\uffff\uffff\uffff\uffff\uffff\uffffp, gt\uffff+ sup\n\ni n\uffff\u21e1i\uffff.\n\nObserve loss vector gt.\n\nend for\n\nsup\n\nend procedure\n\nTheorem 1. Suppose the loss sequence(gt)t\u2264n satis\ufb01es\uffffgt[i]\uffff\u2264 ci for a sequence(ci)i\u2208[N] with\neach ci\u2265 1. Let \u21e1\u2208 N be a given prior distribution on the experts. Then, playing the strategy\n(pt)t\u2264n given by Algorithm 3, MULTISCALEFTPL yields the following regret bound:1\nE\uffff n\ufffft=1\uffffeit, gt\uffff\u2212 n\ufffft=1\uffffei, gt\uffff\uffff\u2264 O\uffffci\uffffn log(nci\uffff\u21e1i)\uffff \u2200i\u2208[N].\n\nThe proof of the theorem is deferred to Appendix A in the supplementary material due to space\nconstraints. Brie\ufb02y, the proof follows the technique of adaptive relaxations from [14]. It relies on\n\nshowing that the following function of the \ufb01rst t loss vectors g1\u2236t is an admissible relaxation (see [14]\n\nfor de\ufb01nitions):\n\n(4)\n\nRel(g1\u2236t)\uffff\n\nE\n\nt+1,...,n\u2208{\u00b11}N\n\nsup\n\ni \uffff\u2212 t\uffffs=1\uffffei, gs\uffff+ 4\n\nn\uffffs=t+1\n\ns[i]ci\u2212 B(i)\uffff.\n\nThis implies that if we play the strategy(pt)t\u2264n given by Algorithm 3, the regret to the ith expert is\nbounded by B(i)+ Rel(\u22c5), where Rel(\u22c5) indicates the Rel function applied to an empty sequence\nof loss vectors. As a \ufb01nal step, we bound Rel(\u22c5) as O(1) using a probabilistic maximal inequality\n\n(Lemma 2 in the supplementary material), yielding the bound (4). Compared to related FTPL\nalgorithms [34], the analysis is surprisingly delicate, as additive ci factors can spoil the desired regret\nbound (4) if the cis differ by orders of magnitude.\nThe min-max optimization problem in MULTISCALEFTPL can be solved in polynomial-time using\nlinear programming \u2014 see Appendix A.1 in the supplementary material for a full discussion.\nIn related work, [7] simultaneously developed a multi-scale experts algorithm which could also be\nused in our framework. Their regret bound has sub-optimal dependence on the prior distribution over\nexperts, but their algorithm is more ef\ufb01cient and is able to obtain multiplicative regret guarantees.\n\n2.2 Online convex optimization\nOne can readily apply MULTISCALEFTPL for online optimization problems whenever it is possible\nto bound the losses of the different experts a-priori. One such application is to online convex\noptimization, where each \u201cexpert\u201d is a a particular OCO algorithm, and for which such a bound can\nbe obtained via appropriate bounds on the relevant norms of the parameter vectors and the gradients\nof the loss functions. We detail this application \u2014 which yields algorithms for parameter-free\nonline learning and more \u2014 below. All of the algorithms in this section are derived using a uni\ufb01ed\nmeta-algorithm strategy MULTISCALEOCO.\n\n1This\n\nregret bound holds under expectation over\n\nthe player\u2019s\nis selected before the randomized strategy pt\n\nIn fact, a slightly stronger version of\n\nt=1\uffffei, gt\uffff+ O\uffffci\uffffn log(nci\uffff\u21e1i)\uffff\uffff\uffff \u2264 0. A similar strengthening applies\n\nrandomization.\nis as-\nis revealed, but may adapt\nthis bound holds, namely\n\nIt\n\nsumed that each gt\nto the distribution over pt.\n\nE\uffff\u2211n\nt=1\uffffeit , gt\uffff\u2212 mini\u2208[N]\uffff\u2211n\n\nto all subsequent bounds.\n\n4\n\n\ft. Then, a\n\neach sub-algorithm ALGi produces a prediction wi\nR containing 0. Our meta-algorithm is then required to choose one of the predictions wi\n\nsuffers the loss of the chosen prediction. We make the following assumption on the sub-algorithms:\nAssumption 1. The sub-algorithms satisfy the following conditions:\n\nThe setup is as follows. We have access to N sub-algorithms, denoted ALGi for i\u2208[N]. In round t,\nt\u2208Wi, whereWi is a set in a vector space V over\nloss function ft\u2236 V \u2192 R is revealed, whereupon ALGi incurs loss ft(wi\nt), and the meta-algorithm\n\u2022 For each i\u2208[N], there is an associated norm\uffff\u22c5\uffff(i) such that supw\u2208Wi\uffffw\uffff(i)\u2264 Ri.\n\u2022 For each i\u2208[N], the sequence of functions ft are Li-Lipschitz onWi with respect to\uffff\u22c5\uffff(i).\nt)t\u2264n enjoy a regret bound\u2211n\n\u2022 For each sub-algorithm ALGi, the iterates(wi\nt=1 ft(wi\nt)\u2212\nt=1 ft(w)\u2264 Regn(i), where Regn(i) may be data- or algorithm-dependent.\ninf w\u2208Wi\u2211n\n\u25b7 Collection of sub-algorithms, prior \u21e1.\nprocedure MULTISCALEOCO({ALGi, Ri, Li}i\u2208[N], \u21e1)\nc\u2190(Ri\u22c5 Li)i\u2208[N]\n\u25b7 Sub-algorithm scale parameters.\nfor t= 1, . . . , n do\nt\u2190 ALGi( \u02dcf1, . . . , \u02dcft\u22121) for each i\u2208A.\nit\u2190 MULTISCALEFTPL[c, \u21e1](g1, . . . , gt\u22121).\nPlay wt= wit\nObserve loss function ft and let \u02dcft(w)= ft(w)\u2212 ft(0).\ngt\u2190\uffff \u02dcft(wi\n\nt)\uffffi\u2208[N].\n\nAlgorithm 4\n\nend for\n\nt .\n\nwi\n\nend procedure\n\nLipschitzness of the functions ft, as speci\ufb01ed in Assumption 1. This assumption implies that for\n\nAlgorithm 4. The following theorem provides a bound on the regret of MULTISCALEOCO; a direct\nconsequence of Theorem 1.\n\nIn most applications,Wi will be a convex set and ft a convex function; this convexity is not necessary\nto prove a regret bound for the meta-algorithm. We simply need boundedness of the setWi and\nany i, we have\uffffft(w)\u2212 ft(0)\uffff\u2264 RiLi for any w\u2208Wi. Thus, we can design a meta-algorithm for\nthis setup by using MULTISCALEFTPL with ci= RiLi, which is precisely what is described in\nTheorem 2. Without loss of generality, assume that RiLi\u2265 12. Suppose that the inputs to Algorithm 4\nsatisfy Assumption 1. Then the iterates(wt)t\u2264n returned by Algorithm 4 follow the regret bound\nft(w)\uffff\u2264 E[Regn(i)]+ O\uffffRiLi\uffffn log(RiLin\uffff\u21e1i)\uffff \u2200i\u2208[N].\nE\uffff n\ufffft=1\nof sub-algorithms(ALGi)i\u2208[N], the regret against any sub-algorithm i will only depend on that\n\nTheorem 2 shows that if we use Algorithm 4 to aggregate the iterates produced by a collection\n\nalgorithm\u2019s scale, not the regret of the worst sub-algorithm.\n\nft(wt)\u2212 inf\nw\u2208Wi\n\nn\ufffft=1\n\nApplication 1: Parameter-free online learning in uniformly convex Banach spaces. As the\n\ufb01rst application of our framework, we give a generalization of the parameter-free online learning\nbounds found in [25; 27; 30; 31; 10] from Hilbert spaces to arbitrary uniformly convex Banach\n\nwith respect to itself [32]. Our algorithm obtains a generalization of the oracle inequality (3) for\n\nof online convex optimization \u2014 and aggregating their iterates using MULTISCALEOCO. This\nstrategy is thus ef\ufb01cient whenever Mirror Descent can be implemented ef\ufb01ciently. The collection of\nsub-algorithms used by MULTISCALEOCO, which was alluded to at the beginning of this section\n\nspaces. Recall that a Banach space(B,\uffff\u22c5\uffff) is(2,)-uniformly convex if 1\n2\uffff\u22c5\uffff2 is -strongly convex\nany uniformly convex(B,\uffff\u22c5\uffff) by running multiple instances of Mirror Descent \u2014 the workhorse\nis as follows: For each 1 \u2264 i \u2264 N \u2236= n+ 1, set Ri = ei\u22121, Li = L,Wi = {w\u2208 B\uffff\uffffw\uffff\u2264 Ri},\nL\uffff \n\u2318i= Ri\nn, and ALGi= MIRRORDESCENT(\u2318i,Wi,\uffff\u22c5\uffff2). Finally, set \u21e1= Uniform([n+ 1]).\n\nMirror Descent is reviewed in detail in Appendix A.2 in the supplementary material, but the only fea-\nture of its performance of importance to our analysis is that, when con\ufb01gured as described above, the\n2For notational convenience all Lipschitz bounds are assumed to be at least 1 without loss of generality for\n\n(5)\n\nthe remainder of the paper.\n\n5\n\n\fE\uffff n\ufffft=1\n\nt=1 ft(wi\n\nft(wt)\u2212 n\ufffft=1\n\nt)t\u2264n produced by ALGi speci\ufb01ed above will satisfy\u2211n\n\nsimple fact, combined with the regret bound for MULTISCALEOCO and a few technical details in\nAppendix A.2, we can deduce the following parameter-free learning oracle inequality:\n\nt=1 ft(w)\u2264\nt)\u2212 inf w\u2208Wi\u2211n\niterates(wi\nO(RiL\u221an) on any sequence of losses that are L-Lipschitz with respect to\uffff\u22c5\uffff\uffff. Using just this\nTheorem 3 (Oracle inequality for uniformly convex Banach spaces). The iterates(wt)t\u2264n produced\nby MULTISCALEOCO on any L-Lipschitz (w.r.t.\uffff\u22c5\uffff\uffff) sequence of losses(ft)t\u2264n satisfy\nft(w)\uffff\u2264 O\uffffL\u22c5(\uffffw\uffff+ 1)\uffffn\u22c5 log((\uffffw\uffff+ 1)Ln)\uffff\uffff \u2200w\u2208 B.\nNote that the above oracle inequality applies for any uniformly convex norm\uffff\u22c5\uffff. Previous results\nonly obtain bounds of this form ef\ufb01ciently when\uffff\u22c5\uffff is a Hilbert space norm. As is standard for such\noracle inequality results, the bound is weaker than the optimal bound if\uffffw\uffff were selected in advance,\nbut only by a mild\ufffflog((\uffffw\uffff+ 1)Ln) factor.\nProposition 1. The algorithm can be implemented in time O(TMD\u22c5 poly(n)) per iteration, where\nIn the example above, the(2,)-uniform convexity condition was mainly chosen for familiarity.\nstrongly convex regularizerR de\ufb01ned over the spaceW. Such a bound would have the form\nO\uffffL\u22c5\uffffn(R(w)+ 1)\u22c5 log((R(w)+ 1)n)\uffff for typical choices ofR.\nThis example captures well-known quantile bounds [22] when one takesR to be the KL-divergence\nandW to be the simplex, or, in the matrix case, takesR to be the quantum relative entropy andW to\n\nThe result can easily be generalized to related notions such as q-uniform convexity (see [37]).\nMore generally, the approach can be used to derive oracle inequalities with respect to general\n\nTMD is the time complexity of a single Mirror Descent update.\n\nbe the set of density matrices, as in [18].\n\n(6)\n\nft(wt)\u2212 n\ufffft=1\n\nApplication 2: Oracle inequality for many `p norms.\nIt is instructive to think of MULTISCALE-\nOCO as executing a (scale-sensitive) online analogue of the structural risk minimization principle.\nWe simply specify a set of subclasses and a prior \u21e1 specifying the importance of each subclass, and\nwe are guaranteed that the algorithm\u2019s performance matches that of each sub-class, plus a penalty\ndepending on the prior weight placed on that subclass. The advantage of this approach is that the\nnested structure used in the Theorem 3 is completely inessential. This leads to the exciting prospect\nof developing parameter-free algorithms over new and exotic set systems. One such example is\ngiven now: The MULTISCALEOCO framework allows us to obtain an oracle inequality with re-\nspect to many `p norms in Rd simultaneously. To the best of our knowledge all previous works on\nparameter-free online learning have only provided oracle inequalities for a single norm.\n\nThe con\ufb01guration in the above theorem is described in full in Appendix A.2 in the supplementary\n\nTheorem 4. Fix > 0. Suppose that the loss functions(ft)t\u2264n are Lp-Lipschitz w.r.t.\uffff\u22c5\uffffp\u2032 for each\np\u2208[1+ , 2]. Then there is a computationally ef\ufb01cient algorithm that guarantees regret\nft(w)\uffff\u2264 O\uffff(\uffffw\uffffp+ 1)Lp\uffffn log((\uffffw\uffffp+ 1)Lp log(d)n)\uffff(p\u2212 1)\uffff\nE\uffff n\ufffft=1\nfor all w\u2208 Rd and all p\u2208[1+ , 2].\nmaterial. This strategy can be trivially extended to handle p in the range(2,\u221e). The inequality holds\nfor p\u2265 1+ rather than for p\u2265 1 because the `1 norm is not uniformly convex, but this is easily\nrecti\ufb01ed by changing the regularizer at p= 1; we omit this for simplicity of presentation.\noptimization over Rd\u00d7d by replacing the `p norm with the Schatten Sp norm. The Schatten Sp norm\nhas strong convexity parameter on the order of p\u2212 1 (which matches the `p norm up to absolute\ntime TMD. Likewise, the approach applies to(p, q)-group norms as used in multi-task learning [20].\nfrom a classWk=\uffffW\u2208 Rd\u00d7d\uffff W\uffff 0,\uffffW\uffff\u2264 1,\uffffW, I\uffff= k\uffff. For a \ufb01xed value of k, such a class is\n\nWe emphasize that the choice of `p norms for the result above was somewhat arbitrary \u2014 any\n\ufb01nite collection of norms will also work. For example, the strategy can also be applied to matrix\n\nApplication 3: Adapting to rank for online PCA For the online PCA task, the learner predicts\n\nconstants [2]) so the only change to practical change to the setup in Theorem 4 will be the running\n\n(7)\n\n6\n\n\fa convex relaxation of the set of all rank k projection matrices. After producing a prediction Wt, we ex-\n\nperience af\ufb01ne loss functions ft(Wt)=\uffffI\u2212 Wt, Yt\uffff, where Yt\u2208Y\u2236=\uffffY \u2208 Rd\u00d7d\uffff Y \uffff 0,\uffffY\uffff\u2264 1\uffff.\n\nWe leverage an analysis of online PCA due to [29] together with MULTISCALEOCO to derive an\nalgorithm that competes with many values of the rank simultaneously. This gives the following result:\nTheorem 5. There is an ef\ufb01cient algorithm for Online PCA with regret bound\n\nE\uffff\uffff\uffff\uffff\uffff\uffff\uffff\nn\ufffft=1\uffffI\u2212 Wt, Yt\uffff\u2212 min\nrank(W)=k\n\nW projection\n\nn\ufffft=1\uffffI\u2212 W, Yt\uffff\uffff\uffff\uffff\uffff\uffff\uffff\uffff\u2264 \u0303O\uffffk\u221an\uffff \u2200k\u2208[d\uffff2].\n\nFor a \ufb01xed value of k, the above bound is already optimal up to log factors, but it holds for all k\nsimultaneously.\n\nApplication 4: Adapting to norm for Matrix Multiplicative Weights\n\na \ufb01xed value of r, the well-known MATRIX MULTIPLICATIVE WEIGHTS strategy has regret against\n\nMULTIPLICATIVE WEIGHTS setting [1] we consider hypothesis classes of the form Wr =\n\uffffW\u2208 Rd\u00d7d\uffff W\uffff 0,\uffffW\uffff\u2303\u2264 r\uffff. Losses are given by ft(W) = \uffffW, Yt\uffff, where\uffffYt\uffff \u2264 1. For\nWr bounded by O(r\u221an log d). Using this strategy for \ufb01xed r as a sub-algorithm for MULTISCALE-\n\nOCO, we achieve the following oracle inequality ef\ufb01ciently:\nTheorem 6. There is an ef\ufb01cient matrix prediction strategy with regret bound\n\nIn the MATRIX\n\nE\uffff n\ufffft=1\uffffWt, Yt\uffff\u2212 n\ufffft=1\uffffW, Yt\uffff\uffff\u2264(\uffffW\uffff\u2303+ 1)\uffffn log d log((\uffffW\uffff\u2303+ 1)n)) \u2200W\uffff 0.\n\nexperts with MULTISCALEFTPL because, in general, the worst case w for achieving (6) can have\nnorm as large as en. If one has an a priori bound \u2014 say B \u2014 on the range at which each ft attains its\n\nA remark on ef\ufb01ciency All of our algorithms that provide bounds of the form (6) instantiate O(n)\nminimum, then the number of experts be reduced to O(log(B)).\n\n(8)\n\n2.3 Supervised learning\n\nWe now consider the online supervised learning setting (Protocol 2), with the goal being to compete\n\na key feature of the meta-algorithm approach we have adopted: We can ef\ufb01ciently obtain online\noracle inequalities for arbitrary nonlinear function classes \u2014 so long as we have an ef\ufb01cient\n\nwith a sequence of hypothesis classes(Fk)k\u2208[N] simultaneously. Working in this setting makes clear\nalgorithm for eachFk.\nWe obtain a supervised learning meta-algorithm by simply feeding the observed losses `(\u22c5, yt)\n\n(which may even be non-convex) to the meta-algorithm MULTISCALEFTPL in the same fashion\nas MULTISCALEOCO. The resulting strategy, which is described in detail in Appendix A.3 for\ncompleteness, is called MULTISCALELEARNING. We make the following assumptions analogous\nto Assumption 1, which lead to the performance guarantee for MULTISCALELEARNING given in\nTheorem 7 below.\nAssumption 2. The sub-algorithms used by MULTISCALELEARNING satisfy the following condi-\ntions:\n\n\u2022 For each i\u2208[N], the iterates(\u02c6yi\n\u2022 For each i\u2208[N], the function `(\u22c5, yt) is Li-Lipschitz on[\u2212Ri, Ri].\n\u2022 For each sub-algorithm ALGi, the iterates(\u02c6yi\ninf f\u2208Fi\u2211n\n\nt)t\u2264n produced by sub-algorithm ALGi satisfy\uffff\u02c6yi\nt\uffff\u2264 Ri.\nt, yt)\u2212\nt=1 `(\u02c6yi\nt)t\u2264n enjoy a regret bound\u2211n\nt=1 `(f(xt), yt) \u2264 Regn(i), where Regn(i) may be data- or algorithm-\nTheorem 7. Suppose that the inputs to Algorithm 5 satisfy Assumption 2. Then the iterates(\u02c6yt)t\u2264n\n`(f(xt), yt)\uffff\u2264 E[Regn(i)]+ O\uffffRiLi\uffffn log(RiLin\uffff\u21e1i)\uffff \u2200i\u2208[N].\nE\uffff n\ufffft=1\n\nproduced by the algorithm enjoy the regret bound\n\n`(\u02c6yi\nt, yt)\u2212 inf\nf\u2208Fi\n\ndependent.\n\nn\ufffft=1\n\n(9)\n\n7\n\n\fApplication: Multiple kernel learning\n\nn\ufffft=1\n\n`(\u02c6yt, yt)\u2212 n\ufffft=1\n\nt=1 `(\u02c6yk\n\nOnline penalized risk minimization In the statistical learning setting, oracle inequalities for\n\nRademacher complexity. For the online learning setting, it is well-known that sequential Rademacher\n\nconstant. (an algorithm with this regret is always guaranteed to exist, but may not be\nef\ufb01cient).\n\nTheorem 8 (Online penalized risk minimization). Under Assumption 3 there is an ef\ufb01cient (in N)\nalgorithm that achieves the following regret bound for any L-Lipschitz loss:\n\nthat it generically characterizes the minimax optimal regret for Lipschitz losses. We will obtain an\noracle inequality in terms of this parameter.\n\narbitrary sequences of hypothesis classesF1, . . . ,FN are readily available. Such inequalities are\ntypically stated in terms of complexity parameters for the classes(Fk) such as VC dimension or\ncomplexity Radn(F) provides a sequential counterpart to these complexity measures [33], meaning\nAssumption 3. The sequence of hypothesis classesF1, . . . ,FN are such that\nt , yt)\u2212\nt=1 `(f(xt), yt)\u2264 C\u22c5 L\u22c5 Radn(Fk) for any L-Lipschitz loss, where C is some\n\n1. There is an ef\ufb01cient algorithm ALGk producing iterates(\u02c6yk\nt)t\u2264n satisfying\u2211n\ninf f\u2208Fk\u2211n\n2. EachFk has output range[\u2212Rk, Rk], where Rk\u2265 1 without loss of generality.\n3. Radn(Fk)= \u2326(Rk\u221an) \u2014 this is obtained by most non-trivial classes.\n`(\u02c6yt, yt)\u2212 inf\nE\uffff n\ufffft=1\nf\u2208Fk\nN) algorithms ifF1,F2, . . . are nested.\nTheorem 9. LetH1, . . . ,HN be reproducing kernel Hilbert spaces for which eachHk has a kernel\nK such that supx\u2208X\uffffK(x, x)\u2264 Bk. Then there is an ef\ufb01cient learning algorithm that guarantees\n`(f(xt), yt)\uffff\u2264 O\uffffLBk(\ufffff\uffffHk+ 1)\ufffflog(LBkkn(\ufffff\uffffHk+ 1))\uffff \u2200k,\u2200f\u2208Hk\nE\uffff n\ufffft=1\nfor any L-Lipschitz loss, whenever an ef\ufb01cient algorithm is available for the norm ball in eachHk.\n\n`(f(xt), yt)\uffff\u2264 O\uffffL\u22c5 Radn(Fk)\u22c5\ufffflog(L\u22c5 Radn(Fk)\u22c5 k)\uffff \u2200k\u2208[N]. (10)\n\nAs in the previous section, one can derive tighter regret bounds and more ef\ufb01cient (e.g. sublinear in\n\n3 Discussion and Further Directions\n\nRelated work There are two directions in parameter-free online learning that have been explored\nextensively. The \ufb01rst considers bounds of the form (3); namely, the Hilbert space version of the\nmore general setting explored in Section 2.2. Beginning with [26], which obtained a slightly looser\n\nrate than (3), research has focused on obtaining tighter dependence on\uffffw\uffff2 and log(n) in this type\n\nof bound [25; 27; 30; 31]; all of these algorithms run in linear time per update step. Recent work\n[10; 11] has extended these results to the case where the Lipschitz constant is not known in advance.\nThese works give lower bounds for general norms, but only give ef\ufb01cient algorithms for Hilbert\nspaces. Extending Algorithm 4 to reach the Pareto frontier of regret in the unknown Lipschitz setting\nas described in [11] may be an interesting direction for future research.\nThe second direction concerns so-called \u201cquantile bounds\u201d [9; 22; 23; 31] for experts setting, where\n\nthe learner\u2019s decision setW is the simplex d and losses are bounded in `\u221e. The multi-scale\n\nmachinery developed in the present work is not needed to obtain bounds for this setting because\nthe losses are uniformly bounded across all model classes. Indeed, [14] recovered a basic form of\nquantile bound using the vanilla multiplicative weights strategy as a meta-algorithm. It is not known\nwhether the more sophisticated data-dependent quantile bounds given in [22; 23] can be recovered in\nthe same fashion.\n\nLosses with curvature. The O(\u221an)-type regret bounds provided by Algorithm 3 are appropriate\nwhen the sub-algorithms themselves incur O(\u221an) regret bounds. However, assuming certain\nregret bounds (O(log n) or even O(1)). These are also referred to as \u201cfast rates\u201d in online learning.\n\ncurvature properties (such as strong convexity, exp-concavity, stochastic mixability, etc. [16; 38]) of\nthe loss functions it is possible to construct sub-algorithms that admit signi\ufb01cantly more favorable\n\nA natural direction for further study is to design a meta-algorithm that admits logarithmic or constant\n\n8\n\n\fregret to each sub-algorithm, assuming that the loss functions of interest satisfy similar curvature\nproperties, with the regret to each individual sub-algorithm adapted to the curvature parameters\nfor that sub-algorithm. Perhaps surprisingly, for the special case of the logistic loss, improper\nprediction and aggregation strategies similar to those proposed in this paper offer a way to circumvent\nknown proper learning lower bounds [17]. This approach will be explored in detail in a forthcoming\ncompanion paper.\n\nmay be unavoidable through our approach, since we do not make use of the relationship between\nsub-algorithms beyond using the nested class structure. Whether the runtime of MULTISCALEFTPL\n\nComputational ef\ufb01ciency. We suspect that a running-time of O(n) to obtain inequalities like (6)\ncan be brought down to match O(n) is an open question. This boils down to whether or not the\n\nmin-max optimization problem in the algorithm description can simultaneously be solved in 1) Linear\ntime in the number of experts 2) strongly polynomial time in the scales ci.\n\nAcknowledgements\n\nWe thank Francesco Orabona and D\u00b4avid P\u00b4al for inspiring initial discussions. Part of this work was\ndone while DF was an intern at Google Research and while DF and KS were visiting the Simons\nInstitute for the Theory of Computing. DF is supported by the NDSEG fellowship.\n\nReferences\n[1] Sanjeev Arora, Elad Hazan, and Satyen Kale. The multiplicative weights update method: a\n\nmeta-algorithm and applications. Theory of Computing, 8(1):121\u2013164, 2012.\n\n[2] Keith Ball, Eric A Carlen, and Elliott H Lieb. Sharp uniform convexity and smoothness\n\ninequalities for trace norms. Inventiones mathematicae, 115(1):463\u2013482, 1994.\n\n[3] Peter L. Bartlett and Shahar Mendelson. Rademacher and Gaussian complexities: risk bounds\nand structural results. Journal of Machine Learning Research, 3:463\u2013482, 2003. ISSN 1532-\n4435.\n\n[4] Peter L. Bartlett, St\u00b4ephane Boucheron, and G\u00b4abor Lugosi. Model selection and error estimation.\n\nMachine Learning, 48(1-3):85\u2013113, 2002.\n\n[5] Shai Ben-David, David Pal, and Shai Shalev-Shwartz. Agnostic online learning. In Proceedings\n\nof the 22th Annual Conference on Learning Theory, 2009.\n\n[6] St\u00b4ephane Boucheron, G\u00b4abor Lugosi, and Pascal Massart. Concentration inequalities: A\n\nnonasymptotic theory of independence. Oxford university press, 2013.\n\n[7] Sebastien Bubeck, Nikhil Devanur, Zhiyi Huang, and Rad Niazadeh. Online auctions and multi-\nscale online learning. Accepted to The 18th ACM conference on Economics and Computation\n(EC 17), 2017.\n\n[8] Nicolo Cesa-Bianchi and Gabor Lugosi. Prediction, Learning, and Games. Cambridge\n\nUniversity Press, 2006.\n\n[9] Kamalika Chaudhuri, Yoav Freund, and Daniel J Hsu. A parameter-free hedging algorithm. In\n\nAdvances in neural information processing systems, pages 297\u2013305, 2009.\n\n[10] Ashok Cutkosky and Kwabena A Boahen. Online convex optimization with unconstrained\ndomains and losses. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett,\neditors, Advances in Neural Information Processing Systems 29, pages 748\u2013756. 2016.\n\n[11] Ashok Cutkosky and Kwabena A. Boahen. Online learning without prior information. The 30th\n\nAnnual Conference on Learning Theory, 2017.\n\n[12] Steven De Rooij, Tim Van Erven, Peter D Gr\u00a8unwald, and Wouter M Koolen. Follow the leader\nif you can, hedge if you must. Journal of Machine Learning Research, 15(1):1281\u20131316, 2014.\n\n9\n\n\f[13] Luc Devroye, L\u00b4azl\u00b4o Gy\u00a8or\ufb01, and G\u00b4abor Lugosi. A Probabilistic Theory of Pattern Recognition.\n\nSpringer, 1996.\n\n[14] Dylan J Foster, Alexander Rakhlin, and Karthik Sridharan. Adaptive online learning.\n\nAdvances in Neural Information Processing Systems, pages 3375\u20133383, 2015.\n\nIn\n\n[15] Elad Hazan. Introduction to online convex optimization. Foundations and Trends\u00ae in Opti-\n\nmization, 2(3-4):157\u2013325, 2016.\n\n[16] Elad Hazan, Amit Agarwal, and Satyen Kale. Logarithmic regret algorithms for online convex\n\noptimization. Machine Learning, 69(2-3):169\u2013192, 2007.\n\n[17] Elad Hazan, Tomer Koren, and K\ufb01r Y Levy. Logistic regression: Tight bounds for stochastic\nand online optimization. In Proceedings of The 27th Conference on Learning Theory, pages\n197\u2013209, 2014.\n\n[18] Elad Hazan, Satyen Kale, and Shai Shalev-Shwartz. Near-optimal algorithms for online matrix\n\nprediction. SIAM J. Comput., 46(2):744\u2013773, 2017. doi: 10.1137/120895731.\n\n[19] Sham M. Kakade, Karthik Sridharan, and Ambuj Tewari. On the complexity of linear prediction:\nRisk bounds, margin bounds, and regularization. In Advances in Neural Information Processing\nSystems 21, pages 793\u2013800. MIT Press, 2009.\n\n[20] Sham M Kakade, Shai Shalev-Shwartz, and Ambuj Tewari. Regularization techniques for\n\nlearning with matrices. Journal of Machine Learning Research, 13(Jun):1865\u20131890, 2012.\n\n[21] Vladimir Koltchinskii. Rademacher penalties and structural risk minimization. IEEE Trans.\n\nInformation Theory, 47(5):1902\u20131914, 2001.\n\n[22] Wouter M Koolen and Tim Van Erven. Second-order quantile methods for experts and combina-\ntorial games. In Proceedings of The 28th Conference on Learning Theory, pages 1155\u20131175,\n2015.\n\n[23] Haipeng Luo and Robert E Schapire. Achieving all with no parameters: Adanormalhedge. In\n\nConference on Learning Theory, pages 1286\u20131304, 2015.\n\n[24] Pascal Massart. Concentration inequalities and model selection. Lecture Notes in Mathematics,\n\n1896, 2007.\n\n[25] Brendan McMahan and Jacob Abernethy. Minimax optimal algorithms for unconstrained linear\noptimization. In Advances in Neural Information Processing Systems, pages 2724\u20132732, 2013.\n[26] Brendan Mcmahan and Matthew Streeter. No-regret algorithms for unconstrained online convex\noptimization. In Advances in neural information processing systems, pages 2402\u20132410, 2012.\n[27] H. Brendan McMahan and Francesco Orabona. Unconstrained online linear learning in hilbert\nspaces: Minimax algorithms and normal approximations. In Proceedings of The 27th Conference\non Learning Theory, pages 1020\u20131039, 2014.\n\n[28] Arkadi Nemirovski. Prox-method with rate of convergence O(1/t) for variational inequali-\nties with Lipschitz continuous monotone operators and smooth convex-concave saddle point\nproblems. SIAM Journal on Optimization, 15(1):229\u2013251, 2004.\n\n[29] Jiazhong Nie, Wojciech Kot\u0142owski, and Manfred K Warmuth. Online pca with optimal regrets.\nIn International Conference on Algorithmic Learning Theory, pages 98\u2013112. Springer, 2013.\n[30] Francesco Orabona. Simultaneous model selection and optimization through parameter-free\nstochastic learning. In Advances in Neural Information Processing Systems, pages 1116\u20131124,\n2014.\n\n[31] Francesco Orabona and D\u00b4avid P\u00b4al. From coin betting to parameter-free online learning. arXiv\n\npreprint arXiv:1602.04128, 2016.\n\n[32] Gilles Pisier. Martingales in banach spaces (in connection with type and cotype). course ihp,\n\nfeb. 2\u20138, 2011. 2011.\n\n10\n\n\f[33] Alexander Rakhlin, Karthik Sridharan, and Ambuj Tewari. Online learning: Random averages,\ncombinatorial parameters, and learnability. Advances in Neural Information Processing Systems\n23, pages 1984\u20131992, 2010.\n\n[34] Alexander. Rakhlin, Ohad Shamir, and Karthik Sridharan. Relax and randomize: From value to\nalgorithms. In Advances in Neural Information Processing Systems 25, pages 2150\u20132158, 2012.\n[35] James Renegar. A polynomial-time algorithm, based on newton\u2019s method, for linear program-\n\nming. Mathematical Programming, 40(1):59\u201393, 1988.\n\n[36] John Shawe-Taylor, Peter L Bartlett, Robert C Williamson, and Martin Anthony. Structural risk\nminimization over data-dependent hierarchies. IEEE transactions on Information Theory, 44\n(5):1926\u20131940, 1998.\n\n[37] Nati Srebro, Karthik Sridharan, and Ambuj Tewari. On the universality of online mirror descent.\n\nIn Advances in neural information processing systems, pages 2645\u20132653, 2011.\n\n[38] Tim van Erven, Peter D. Gr\u00a8unwald, Nishant A. Mehta, Mark D. Reid, and Robert C. Williamson.\nFast rates in statistical and online learning. Journal of Machine Learning Research, 16:1793\u2013\n1861, 2015.\n\n[39] Vladimir Vapnik. Estimation of dependences based on empirical data, volume 40. Springer-\n\nVerlag New York, 1982.\n\n[40] Vladimir Vapnik and Alexey Chervonenkis. On the uniform convergence of relative frequencies\nof events to their probabilities. Theory of Probability and its Applications, 16(2):264\u2013280, 1971.\n\n11\n\n\f", "award": [], "sourceid": 3075, "authors": [{"given_name": "Dylan", "family_name": "Foster", "institution": "Cornell University"}, {"given_name": "Satyen", "family_name": "Kale", "institution": "Google"}, {"given_name": "Mehryar", "family_name": "Mohri", "institution": "Courant Institute and Google"}, {"given_name": "Karthik", "family_name": "Sridharan", "institution": "Cornell University"}]}