{"title": "Progressive mixture rules are deviation suboptimal", "book": "Advances in Neural Information Processing Systems", "page_first": 41, "page_last": 48, "abstract": "We consider the learning task consisting in predicting as well as the best function in a finite reference set G up to the smallest possible additive term. If R(g) denotes the generalization error of a prediction function g, under reasonable assumptions on the loss function (typically satisfied by the least square loss when the output is bounded), it is known that the progressive mixture rule g_n satisfies E R(g_n) < min_{g in G} R(g) + Cst (log|G|)/n where n denotes the size of the training set, E denotes the expectation wrt the training set distribution. This work shows that, surprisingly, for appropriate reference sets G, the deviation convergence rate of the progressive mixture rule is only no better than Cst / sqrt{n}, and not the expected Cst / n. It also provides an algorithm which does not suffer from this drawback.", "full_text": "Progressive mixture rules are deviation suboptimal\n\nJean-Yves Audibert\n\nWillow Project - Certis Lab\nParisTech, Ecole des Ponts\n\n77455 Marne-la-Vall\u00b4ee, France\n\naudibert@certis.enpc.fr\n\nAbstract\n\nWe consider the learning task consisting in predicting as well as the best function\nin a \ufb01nite reference set G up to the smallest possible additive term. If R(g) denotes\nthe generalization error of a prediction function g, under reasonable assumptions\non the loss function (typically satis\ufb01ed by the least square loss when the output is\nbounded), it is known that the progressive mixture rule \u02c6g satis\ufb01es\n\nER(\u02c6g) \u2264 ming\u2208G R(g) + Cst log |G|n ,\n\n(1)\nwhere n denotes the size of the training set, and E denotes the expectation w.r.t.\nthe training set distribution.This work shows that, surprisingly, for appropriate\nreference sets G, the deviation convergence rate of the progressive mixture rule is\nno better than Cst /\u221an: it fails to achieve the expected Cst /n. We also provide\nan algorithm which does not suffer from this drawback, and which is optimal in\nboth deviation and expectation convergence rates.\n\n1 Introduction\n\nWhy are we concerned by deviations? The ef\ufb01ciency of an algorithm can be summarized by its\nexpected risk, but this does not precise the \ufb02uctuations of its risk. In several application \ufb01elds of\nlearning algorithms, these \ufb02uctuations play a key role: in \ufb01nance for instance, the bigger the losses\ncan be, the more money the bank needs to freeze in order to alleviate these possible losses. In this\ncase, a \u201cgood\u201d algorithm is an algorithm having not only low expected risk but also small deviations.\n\nWhy are we interested in the learning task of doing as well as the best prediction function of a given\n\ufb01nite set? First, one way of doing model selection among a \ufb01nite family of submodels is to cut the\ntraining set into two parts, use the \ufb01rst part to learn the best prediction function of each submodel\nand use the second part to learn a prediction function which performs as well as the best of the\nprediction functions learned on the \ufb01rst part of the training set. This scheme is very powerful since\nit leads to theoretical results, which, in most situations, would be very hard to prove without it. Our\nwork here is related to the second step of this scheme.\n\nSecondly, assume we want to predict the value of a continuous variable, and that we have many\ncandidates for explaining it. An input point can then be seen as the vector containing the prediction\nof each candidate. The problem is what to do when the dimensionality d of the input data (equiva-\nlently the number of prediction functions) is much higher than the number of training points n. In\nthis setting, one cannot use linear regression and its variants in order to predict as well as the best\ncandidate up to a small additive term. Besides, (penalized) empirical risk minimization is doomed\nto be suboptimal (see the second part of Theorem 2 and also [1]).\n\nAs far as the expected risk is concerned, the only known correct way of predicting as well as the\nbest prediction function is to use the progressive mixture rule or its variants. These algorithms are\nintroduced in Section 2 and their main good property is given in Theorem 1. In this work we prove\nthat they do not work well as far as risk deviations are concerned (see the second part of Theorem\n\n1\n\n\f3). We also provide a new algorithm for this \u2019predict as well as the best\u2019 problem (see the end of\nSection 4).\n\n2 The progressive mixture rule and its variants\n\nWe assume that we observe n pairs of input-output denoted Z1 = (X1, Y1), . . . , Zn = (Xn, Yn)\nand that each pair has been independently drawn from the same unknown distribution denoted P .\nThe input and output spaces are denoted respectively X and Y, so that P is a probability distribution\non the product space Z , X \u00d7 Y. The quality of a (prediction) function g : X \u2192 Y is measured by\nthe risk (or generalization error):\n\nR(g) = E(X,Y )\u223cP `[Y, g(X)],\n\nwhere `[Y, g(X)] denotes the loss (possibly in\ufb01nite) incurred by predicting g(X) when the true\noutput is Y . We work under the following assumptions for the data space and the loss function\n` : Y \u00d7 Y \u2192 R \u222a {+\u221e}.\nMain assumptions. The input space is assumed to be in\ufb01nite: |X| = +\u221e. The output space is\na non-trivial (i.e. in\ufb01nite) interval of R symmetrical w.r.t. some a \u2208 R: for any y \u2208 Y, we have\n2a \u2212 y \u2208 Y. The loss function is\n\n\u2022 uniformly exp-concave: there exists \u03bb > 0 such that for any y \u2208 Y, the set (cid:8)y0 \u2208 R :\n7\u2192 e\u2212\u03bb`(y,y0) is\n\n`(y, y0) < +\u221e(cid:9) is an interval containing a on which the function y0\nconcave.\n\n\u2022 symmetrical: for any y1, y2 \u2208 Y, `(y1, y2) = `(2a \u2212 y1, 2a \u2212 y2),\n\u2022 admissible: for any y, y0 \u2208 Y\u2229]a; +\u221e[, `(y, 2a \u2212 y0) > `(y, y0),\n\u2022 well behaved at center: for any y \u2208 Y\u2229]a; +\u221e[, the function `y : y0 7\u2192 `(y, y0) is twice\ncontinuously differentiable on a neighborhood of a and `0y(a) < 0.\n\nThese assumptions imply that\n\nfor some \u03b6 > 0.\n\n\u2022 Y has necessarily one of the following form: ] \u2212 \u221e; +\u221e[, [a \u2212 \u03b6; a + \u03b6] or ]a \u2212 \u03b6; a + \u03b6[\n7\u2192 `(y, y0) is\n\u2022 for any y \u2208 Y, from the exp-concavity assumption, the function `y : y0\nconvex on the interval on which it is \ufb01nite1. As a consequence, the risk R is also a convex\nfunction (on the convex set of prediction functions for which it is \ufb01nite).\n\nThe assumptions were motivated by the fact that they are satis\ufb01ed in the following settings:\n\nwe have a = (ymin + ymax)/2 and may take \u03bb = 1/[2(ymax \u2212 ymin)2].\n\n\u2022 least square loss with bounded outputs: Y = [ymin; ymax] and `(y1, y2) = (y1\u2212y2)2. Then\n\u2022 entropy loss: Y = [0; 1] and `(y1, y2) = y1 log(cid:0) y1\n1\u2212y2(cid:1). Note that\n\u2022 exponential (or AdaBoost) loss: Y = [\u2212ymax; ymax] and `(y1, y2) = e\u2212y1y2. Then we\n\u2022 logit loss: Y = [\u2212ymax; ymax] and `(y1, y2) = log(1 + e\u2212y1y2). Then we have a = 0 and\n\n`(0, 1) = `(1, 0) = +\u221e. Then we have a = 1/2 and may take \u03bb = 1.\nhave a = 0 and may take \u03bb = e\u2212y2\n\ny2(cid:1) + (1 \u2212 y1) log (cid:0) 1\u2212y1\n\nmax.\n\nmay take \u03bb = e\u2212y2\n\nmax.\n\nProgressive indirect mixture rule. Let G be a \ufb01nite reference set of prediction functions. Under the\nprevious assumptions, the only known algorithms satisfying (1) are the progressive indirect mixture\nrules de\ufb01ned below.\nFor any i \u2208 {0, . . . , n}, the cumulative loss suffered by the prediction function g on the \ufb01rst i pairs\nof input-output is\n\n1Indeed,\n\nif \u03be denotes the function e\u2212\u03bb`y , from Jensen\u2019s inequality, for any probability distribution,\n\n\u03a3i(g) , Pi\n\nj=1 `[Yj, g(Xj)],\n\nE`y(Y ) = E(cid:0) \u2212 1\n\n\u03bb log \u03be(Y )(cid:1) \u2265 \u2212 1\n\n\u03bb log E\u03be(Y ) \u2265 \u2212 1\n\n\u03bb log \u03be(EY ) = `y(EY ).\n\n2\n\n\fwhere by convention we take \u03a30 \u2261 0. Let \u03c0 denote the uniform distribution on G. We de\ufb01ne the\nprobability distribution \u02c6\u03c0i on G as\n\n\u02c6\u03c0i \u221d e\u2212\u03bb\u03a3i \u00b7 \u03c0\ne\u2212\u03bb\u03a3i(g0)). This distribution concentrates\nequivalently for any g \u2208 G, \u02c6\u03c0i(g) = e\u2212\u03bb\u03a3i(g)/(Pg0\u2208G\non functions having low cumulative loss up to time i. For any i \u2208 {0, . . . , n}, let \u02c6hi be a prediction\nfunction such that\n\nThe progressive indirect mixture rule produces the prediction function\n\n\u2200 (x, y) \u2208 Z\n\n`[y, \u02c6hi(x)] \u2264 \u2212 1\n\n\u03bb log Eg\u223c\u02c6\u03c0i e\u2212\u03bb`[y,g(x)].\n\n(2)\n\n\u02c6gpim = 1\n\nn+1 Pn\n\ni=0\n\n\u02c6hi.\n\nFrom the uniform exp-concavity assumption and Jensen\u2019s inequality, \u02c6hi does exist since one may\ntake \u02c6hi = Eg\u223c\u02c6\u03c0i g. This particular choice leads to the progressive mixture rule, for which the\npredicted output for any x \u2208 X is\n\n\u02c6gpm(x) = Pg\u2208G (cid:16) 1\n\nn+1 Pn\n\ni=0\n\ne\u2212\u03bb\u03a3i (g)\n\nP g0 \u2208G e\u2212\u03bb\u03a3i (g0)(cid:17) g(x).\n\nConsequently, any result that holds for any progressive indirect mixture rule in particular holds for\nthe progressive mixture rule.\n\nThe idea of a progressive mean of estimators has been introduced by Barron ([2]) in the context\nof density estimation with Kullback-Leibler loss. The form \u02c6gpm is due to Catoni ([3]). It was also\nindependently proposed in [4]. The study of this procedure was made in density estimation and least\nsquare regression in [5, 6, 7, 8]. Results for general losses can be found in [9, 10]. Finally, the\nprogressive indirect mixture rule is inspired by the work of Vovk, Haussler, Kivinen and Warmuth\n[11, 12, 13] on sequential prediction and was studied in the \u201cbatch\u201d setting in [10]. Finally, in the\nupper bounds we state, e.g. Inequality (1), one should notice that there is no constant larger than 1\nin front of ming\u2208G R(g), as opposed to some existing upper bounds (e.g. [14]). This work really\nstudies the behaviour of the excess risk, that is the random variable R(\u02c6g) \u2212 ming\u2208G R(g).\nThe largest integer smaller or equal to the logarithm in base 2 of x is denoted by blog2 xc .\n3 Expectation convergence rate\n\nThe following theorem, whose proof is omitted, shows that the expectation convergence rate of any\nprogressive indirect mixture rule is (i) at least (log |G|)/n and (ii) cannot be uniformly improved,\neven when we consider only probability distributions on Z for which the output has almost surely\ntwo symmetrical values (e.g. {-1;+1} classication with exponential or logit losses).\nTheorem 1 Any progressive indirect mixture rule satis\ufb01es\n\nER(\u02c6gpim) \u2264 min\ng\u2208G\n\nR(g) + log |G|\n\u03bb(n+1) .\n\nLet y1 \u2208 Y \u2212{a} and d be a positive integer. There exists a set G of d prediction functions such that:\nfor any learning algorithm, there exists a probability distribution generating the data for which\n\n\u2022 the output marginal is supported by 2a \u2212 y1 and y1: P (Y \u2208 {2a \u2212 y1; y1}) = 1,\n\u2022 ER(\u02c6g) \u2265 min\ng\u2208G\n\nR(g) + e\u22121\u03ba(cid:0)1 \u2227 blog2 |G|c\n\n[`(y1, a) \u2212 `(y1, y)] > 0.\n\n(cid:1), with \u03ba , sup\ny\u2208Y\n\nn+1\n\nThe second part of Theorem 1 has the same (log |G|)/n rate as the lower bounds obtained in sequen-\ntial prediction ([12]). From the link between sequential predictions and our \u201cbatch\u201d setting with i.i.d.\ndata (see e.g. [10, Lemma 3]), upper bounds for sequential prediction lead to upper bounds for i.i.d.\ndata, and lower bounds for i.i.d. data leads to lower bounds for sequential prediction. The converse\nof this last assertion is not true, so that the second part of Theorem 1 is not a consequence of the\nlower bounds of [12].\n\n3\n\n\fThe following theorem, whose proof is also omitted, shows that for appropriate set G: (i) the em-\npirical risk minimizer has a p(log |G|)/n expectation convergence rate, and (ii) any empirical risk\nminimizer and any of its penalized variants are really poor algorithms in our learning task since their\nexpectation convergence rate cannot be faster than p(log |G|)/n (see [5, p.14] and [1] for results of\nthe same spirit). This last point explains the interest we have in progressive mixture rules.\n\nTheorem 2 If B , supy,y0,y00\u2208Y [`(y, y0) \u2212 `(y, y00)] < +\u221e, then any empirical risk minimizer,\nwhich produces a prediction function \u02c6germ in argming\u2208G \u03a3n, satis\ufb01es:\nR(g) + Bq 2 log |G|n\n.\n\nER(\u02c6germ) \u2264 min\ng\u2208G\n\nLet y1, \u02dcy1 \u2208 Y\u2229]a; +\u221e[ and d be a positive integer. There exists a set G of d prediction functions\nsuch that: for any learning algorithm producing a prediction function in G (e.g. \u02c6germ) there exists a\nprobability distribution generating the data for which\n\n\u2022 the output marginal is supported by 2a \u2212 y1 and y1: P (Y \u2208 {2a \u2212 y1; y1}) = 1,\n\u2022 ER(\u02c6g) \u2265 min\ng\u2208G\n\n\u2227 2(cid:17), with \u03b4 , `(y1, 2a \u2212 \u02dcy1) \u2212 `(y1, \u02dcy1) > 0.\n\n8(cid:16)q blog2 |G|c\n\nR(g) + \u03b4\n\nn\n\nThe lower bound of Theorem 2 also says that one should not use cross-validation. This holds for the\nloss functions considered in this work, and not for, e.g., the classi\ufb01cation loss: `(y, y0) = 1y6=y0.\n\n4 Deviation convergence rate\n\nThe following theorem shows that the deviation convergence rate of any progressive indirect mix-\nture rule is (i) at least 1/\u221an and (ii) cannot be uniformly improved, even when we consider only\nprobability distributions on Z for which the output has almost surely two symmetrical values (e.g.\n{-1;+1} classication with exponential or logit losses).\nTheorem 3 If B , supy,y0,y00\u2208Y [`(y, y0) \u2212 `(y, y00)] < +\u221e, then any progressive indirect mixture\nrule satis\ufb01es: for any \u0001 > 0, with probability at least 1 \u2212 \u0001 w.r.t. the training set distribution, we\nhave\n\nR(\u02c6gpim) \u2264 min\ng\u2208G\n\nR(g) + Bq 2 log(2\u0001\u22121)\n\nn+1 + log |G|\n\n\u03bb(n+1)\n\nLet y1 and \u02dcy1 in Y\u2229]a; +\u221e[ such that `y1 is twice continuously differentiable on [a; \u02dcy1] and\n`0y1 ( \u02dcy1) \u2264 0 and `00y1( \u02dcy1) > 0. Consider the prediction functions g1 \u2261 \u02dcy1 and g2 \u2261 2a \u2212 \u02dcy1.\nFor any training set size n large enough, there exist \u0001 > 0 and a distribution generating the data\nsuch that\n\n\u2022 the output marginal is supported by y1 and 2a \u2212 y1\n\u2022 with probability larger than \u0001, we have\nR(\u02c6gpim) \u2212 min\n\ng\u2208{g1,g2}\n\nR(g) \u2265 cq log(e\u0001\u22121)\n\nn\n\nwhere c is a positive constant depending only on the loss function, the symmetry parameter\na and the output values y1 and \u02dcy1.\n\nProof 1 See Section 5.\n\nThis result is quite surprising since it gives an example of an algorithm which is optimal in terms of\nexpectation convergence rate and for which the deviation convergence rate is (signi\ufb01cantly) worse\nthan the expectation convergence rate.\n\nIn fact, despite their popularity based on their unique expectation convergence rate, the progressive\nmixture rules are not good algorithms since a long argument essentially based on convexity shows\nthat the following algorithm has both expectation and deviation convergence rate of order 1/n. Let\n\n4\n\n\f\u02c6germ be the minimizer of the empirical risk among functions in G. Let \u02dcg be the minimizer of the\nempirical risk in the star \u02c6G = \u222ag\u2208G [g; \u02c6germ]. The algorithm producing \u02dcg satis\ufb01es for some C > 0,\nfor any \u0001 > 0, with probability at least 1 \u2212 \u0001 w.r.t. the training set distribution, we have\n\nR(\u02dcg) \u2264 min\ng\u2208G\n\nR(g) + C log(\u0001\u22121|G|)\n\n.\n\nn\n\nThis algorithm has also the bene\ufb01t of being parameter-free. On the contrary, in practice, one will\nhave recourse to cross-validation to tune the parameter \u03bb of the progressive mixture rule.\nTo summarize, to predict as well as the best prediction function in a given set G, one should not\nrestrain the algorithm to produce its prediction function among the set G. The progressive mix-\nture rules satisfy this principle since they produce a prediction function in the convex hull of G.\nThis allows to achieve (log |G|)/n convergence rates in expectation. The proof of the lower bound\nof Theorem 3 shows that the progressive mixtures over\ufb01t the data: the deviations of their excess\nrisk are not PAC bounded by C log(\u0001\u22121|G|)/n while an appropriate algorithm producing prediction\nfunctions on the edges of the convex hull achieves the log(\u0001\u22121|G|)/n deviation convergence rate.\nFuture work might look at whether one can transpose this algorithm to the sequential prediction\nsetting, in which, up to now, the algorithms to predict as well as the best expert were dominated by\nalgorithms producing a mixture expert inside the convex hull of the set of experts.\n\n5 Proof of Theorem 3\n\n5.1 Proof of the upper bound\n\nLet Zn+1 = (Xn+1, Yn+1) be an input-output pair independent from the training set Z1, . . . , Zn\nand with the same distribution P . From the convexity of y0 7\u2192 `(y, y0), we have\n\n(3)\nNow from [15, Theorem 1] (see also [16, Proposition 1]), for any \u0001 > 0, with probability at least\n1 \u2212 \u0001, we have\n\nR(\u02c6gpim) \u2264 1\n\nn+1 Pn\n\ni=0 R(\u02c6hi).\n\n1\n\nn+1 Pn\n\ni=0 R(\u02c6hi) \u2264 1\n\nn+1 Pn\n\ni=0 `(cid:0)Yi+1, \u02c6h(Xi+1)(cid:1) + Bq log(\u0001\u22121)\n\n2(n+1)\n\nUsing [12, Theorem 3.8] and the exp-concavity assumption, we have\n\nLet \u02dcg \u2208 argmin\n\ni=0 `(cid:0)Yi+1, \u02c6h(Xi+1)(cid:1) \u2264 min\n\nPn\nG R. By Hoeffding\u2019s inequality, with probability at least 1 \u2212 \u0001, we have\n\ni=0 `(cid:0)Yi+1, g(Xi+1)(cid:1) + log |G|\u03bb\n\ng\u2208G Pn\n\n1\n\nn+1 Pn\n\ni=0 `(cid:0)Yi+1, \u02dcg(Xi+1)(cid:1) \u2264 R(\u02dcg) + Bq log(\u0001\u22121)\n\n2(n+1)\n\nMerging (3), (4), (5) and (6), with probability at least 1 \u2212 2\u0001, we get\ni=0 `(cid:0)Yi+1, \u02dcg(Xi+1)(cid:1) + log |G|\n\nR(\u02c6gpim) \u2264\n\nn+1 Pn\n\n1\n\n\u03bb(n+1) + Bq log(\u0001\u22121)\n\n2(n+1)\n\n\u2264 R(\u02dcg) + Bq 2 log(\u0001\u22121)\n\nn+1 + log |G|\n\u03bb(n+1) .\n\n(4)\n\n(5)\n\n(6)\n\n5.2 Sketch of the proof of the lower bound\n\nWe cannot use standard tools like Assouad\u2019s argument (see e.g. [17, Theorem 14.6]) because if it\nwere possible, it would mean that the lower bound would hold for any algorithm and in particular\nfor \u02dcg, and this is false. To prove that any progressive indirect mixture rule have no fast exponential\ndeviation inequalities, we will show that on some event with not too small probability, for most of\nthe i in {0, . . . , n}, \u03c0\u2212\u03bb\u03a3i concentrates on the wrong function.\nThe proof is organized as follows. First we de\ufb01ne the probability distribution for which we will\nprove that the progressive indirect mixture rules cannot have fast deviation convergence rates. Then\nwe de\ufb01ne the event on which the progressive indirect mixture rules do not perform well. We lower\nbound the probability of this excursion event. Finally we conclude by lower bounding R(\u02c6gpim) on\nthe excursion event.\n\nBefore starting the proof, note that from the \u201cwell behaved at center\u201d and exp-concavity assump-\ntions, for any y \u2208 Y\u2229]a; +\u221e[, on a neighborhood of a, we have: `00y \u2265 \u03bb(`0y)2 and since `0y(a) < 0,\ny1 and \u02dcy1 exist. Due to limited space, some technical computations have been removed.\n\n5\n\n\f5.2.1 Probability distribution generating the data and \ufb01rst consequences.\nLet \u03b3 \u2208]0; 1] be a parameter to be tuned later. We consider a distribution generating the data such\nthat the output distribution satis\ufb01es for any x \u2208 X\n\nP (Y = y1|X = x) = (1 + \u03b3)/2 = 1 \u2212 P (Y = y2|X = x),\n\nwhere y2 = 2a\u2212 y1. Let \u02dcy2 = 2a\u2212 \u02dcy1. From the symmetry and admissibility assumptions, we have\n`(y2, \u02dcy2) = `(y1, \u02dcy1) < `(y1, \u02dcy2) = `(y2, \u02dcy1). Introduce\n\nWe have\n\n\u03b4 , `(y1, \u02dcy2) \u2212 `(y1, \u02dcy1) > 0.\n\n(7)\n\nR(g2) \u2212 R(g1) = 1+\u03b3\n\n2 [`(y1, \u02dcy2) \u2212 `(y1, \u02dcy1)] + 1\u2212\u03b3\n\n(8)\nTherefore g1 is the best prediction function in {g1, g2} for the distribution we have chosen. Introduce\nWj , 1Yj =y1 \u2212 1Yj =y2 and Si , Pi\n\u03a3i(g2) \u2212 \u03a3i(g1) = Pi\n\nj=1 Wj. For any i \u2208 {1, . . . , n}, we have\nj=1[`(Yj, \u02dcy2) \u2212 `(Yj, \u02dcy1)] = Pi\nThe weight given by the Gibbs distribution \u03c0\u2212\u03bb\u03a3i to the function g1 is\n\n2 [`(y2, \u02dcy2) \u2212 `(y2, \u02dcy1)] = \u03b3\u03b4.\n\nj=1 Wj\u03b4 = \u03b4 Si\n\n\u03c0\u2212\u03bb\u03a3i(g1) =\n\ne\u2212\u03bb\u03a3i(g1)\n\ne\u2212\u03bb\u03a3i (g1)+e\u2212\u03bb\u03a3i(g2) =\n\n1+e\u03bb[\u03a3i (g1)\u2212\u03a3i (g2)] =\n\n1\n\n1\n\n1+e\u2212\u03bb\u03b4Si .\n\n(9)\n\n5.2.2 An excursion event on which the progressive indirect mixture rules will not perform\n\nwell.\n\nEquality (9) leads us to consider the event:\n\nE\u03c4 = (cid:8)\u2200i \u2208 {\u03c4, . . . , n}, Si \u2264 \u2212\u03c4(cid:9),\n\nwith \u03c4 the smallest integer larger than (log n)/(\u03bb\u03b4) such that n \u2212 \u03c4 is even (for convenience). We\nhave\n(10)\n\nlog n\n\n\u03bb\u03b4 \u2264 \u03c4 \u2264 log n\n\n\u03bb\u03b4 + 2.\n\nThe event E\u03c4 can be seen as an excursion event of the random walk de\ufb01ned through the random\nvariables Wj = 1Yj =y1 \u2212 1Yj =y2, j \u2208 {1, . . . , n}, which are equal to +1 with probability (1 + \u03b3)/2\nand \u22121 with probability (1 \u2212 \u03b3)/2.\nFrom (9), on the event E\u03c4 , for any i \u2208 {\u03c4, . . . , n}, we have\n\u03c0\u2212\u03bb\u03a3i(g1) \u2264 1\nn+1 .\n\n(11)\nThis means that \u03c0\u2212\u03bb\u03a3i concentrates on the wrong function, i.e. the function g2 having larger risk\n(see (8)).\n\n5.2.3 Lower bound of the probability of the excursion event.\n\nThis requires to look at the probability that a slightly shifted random walk in the integer space has a\nvery long excursion above a certain threshold. To lower bound this probability, we will \ufb01rst look at\nthe non-shifted random walk. Then we will see that for small enough shift parameter, probabilities\nof shifted random walk events are close to the ones associated to the non-shifted random walk.\nLet N be a positive integer. Let \u03c31, . . . , \u03c3N be N independent Rademacher variables: P(\u03c3i =\n+1) = P(\u03c3i = \u22121) = 1/2. Let si , Pi\nj=1 \u03c3i be the sum of the \ufb01rst i Rademacher variables. We\nstart with the following lemma for sums of Rademacher variables (proof omitted).\n\nLemma 1 Let m and t be positive integers. We have\n\nP(cid:0) max\n1\u2264k\u2264N\n\nsk \u2265 t; sN 6= t;(cid:12)(cid:12)sN \u2212 t(cid:12)(cid:12) \u2264 m(cid:1) = 2P(cid:0)t < sN \u2264 t + m(cid:1)\n\n(12)\n\nLet \u03c301, . . . , \u03c30N be N independent shifted Rademacher variables to the extent that P(\u03c30i = +1) =\n(1 + \u03b3)/2 = 1 \u2212 P(\u03c30i = \u22121). These random variables satisfy the following key lemma (proof\nomitted)\n\n6\n\n\finteger, we have\n\nLemma 2 For any set A \u2282 (cid:8)(\u00011, . . . , \u0001N ) \u2208 {\u22121, 1}n : (cid:12)(cid:12)PN\n(cid:0)1 \u2212 \u03b32(cid:1)N/2\n\nP(cid:8)(\u03c301, . . . , \u03c30N ) \u2208 A(cid:9) \u2265 (cid:16) 1\u2212\u03b3\n\n1+\u03b3(cid:17)M/2\n\ni=1 \u0001i(cid:12)(cid:12) \u2264 M(cid:9) where M is a positive\nP(cid:8)(\u03c31, . . . , \u03c3N ) \u2208 A(cid:9)\n\n(13)\n\nWe may now lower bound the probability of the excursion event E\u03c4 . Let M be an integer larger than\n\u03c4 . We still use Wj , 1Yj =y1 \u2212 1Yj =y2 for j \u2208 {1, . . . , n}. By using Lemma 2 with N = n \u2212 2\u03c4 ,\nwe obtain\n\nP(E\u03c4 ) \u2265 P(cid:0)W1 = \u22121, . . . , W2\u03c4 = \u22121; \u2200 2\u03c4 < i \u2264 n, Pi\n\nj=2\u03c4 +1 Wj \u2264 \u03c4(cid:1)\n\n= (cid:0) 1\u2212\u03b3\n\u2265 (cid:0) 1\u2212\u03b3\n\n2 (cid:1)2\u03c4\n2 (cid:1)2\u03c4(cid:0) 1\u2212\u03b3\n\nP(cid:0)\u2200 i \u2208 {1, . . . , N} Pi\n1+\u03b3(cid:1)M/2(cid:0)1 \u2212 \u03b32(cid:1)\n\nN\n\nj=1 \u03c30j \u2264 \u03c4(cid:1)\n\n2 P(cid:0)|sN| \u2264 M ;\u2200 i \u2208 {1, . . . , N}\n\nsi \u2264 \u03c4(cid:1)\n\nBy using Lemma 1, since \u03c4 \u2264 M, the r.h.s. probability can be lower bounded, and after some\ncomputations, we obtain\n\nP(E\u03c4 ) \u2265 \u03c4(cid:0) 1\u2212\u03b3\n\n(14)\nwhere we recall that \u03c4 have the order of log n, N = n \u2212 2\u03c4 has the order of n and that \u03b3 > 0 and\nM \u2265 \u03c4 have to be appropriately chosen.\nTo control the probabilities of the r.h.s., we use Stirling\u2019s formula\n\n2 [P(sN = \u03c4 ) \u2212 P(sN = M )]\n\n1+\u03b3(cid:1)M/2(cid:0)1 \u2212 \u03b32(cid:1)\n\n2 (cid:1)2\u03c4(cid:0) 1\u2212\u03b3\n\nN\n\nnne\u2212n\u221a2\u03c0n e1/(12n+1) < n! < nne\u2212n\u221a2\u03c0n e1/(12n),\n\n(15)\n\n(16)\n\n(17)\n\nand get for any s \u2208 [0; N ] such that N \u2212 s even,\n\u03c0N (cid:16)1 \u2212 s2\n\nP(sN = s) \u2265 q 2\n\nN 2(cid:17)\u2212 N\n\n2 (cid:16) 1\u2212 s\n\nN\n1+ s\n\nN (cid:17)\n\ns\n2\n\ne\u2212 1\n\n6(N +s) \u2212 1\n\n6(N\u2212s)\n\nand similarly\n\nP(sN = s) \u2264 q 2\n\n\u03c0N (cid:16)1 \u2212 s2\n\nN 2(cid:17)\u2212 N\n\n2 (cid:16) 1\u2212 s\n\nN\n1+ s\n\nN (cid:17)\n\ns\n2\n\n1\n\n12N +1 .\n\ne\n\nThese computations and (14) leads us to take M as the smallest integer larger than \u221an such that\nn \u2212 M is even. Indeed, from (10), (16) and (17), we obtain limn\u2192+\u221e \u221an[P(sN = \u03c4 ) \u2212 P(sN =\nM )] = c, where c = p2/\u03c0(cid:0)1 \u2212 e\u22121/2(cid:1) > 0. Therefore for n large enough we have\n\n(18)\nThe last two terms of the r.h.s. of (18) leads us to take \u03b3 of order 1/\u221an up to possibly a logarithmic\nterm. We obtain the following lower bound on the excursion probability\n\nP(E\u03c4 ) \u2265 c\u03c4\n\n1+\u03b3(cid:1)M/2(cid:0)1 \u2212 \u03b32(cid:1)\n\n2 (cid:1)2\u03c4(cid:0) 1\u2212\u03b3\n\n2\u221an(cid:0) 1\u2212\u03b3\n\nN\n2\n\nLemma 3 If \u03b3 = pC0(log n)/n with C0 a positive constant, then for any large enough n,\n\nP(E\u03c4 ) \u2265 1\nnC0 .\n\n5.2.4 Behavior of the progressive indirect mixture rule on the excursion event.\nFrom now on, we work on the event E\u03c4 . We have \u02c6gpim = (Pn\n\u02c6hi)/(n + 1). We still use \u03b4 ,\n`(y1, \u02dcy2)\u2212`(y1, \u02dcy1) = `(y2, \u02dcy1)\u2212`(y2, \u02dcy2). On the event E\u03c4 , for any x \u2208 X and any i \u2208 {\u03c4, . . . , n},\nby de\ufb01nition of \u02c6hi, we have\n\ni=0\n\n`[y2, \u02c6hi(x)] \u2212 `(y2, \u02dcy2) \u2264 \u2212 1\n= \u2212 1\n\u2264 \u2212 1\n\ne\u2212\u03bb{`[y2,g(x)]\u2212`(y2, \u02dcy2)}\n\u03bb log Eg\u223c\u03c0\u2212\u03bb\u03a3i\n\u03bb log (cid:8)e\u2212\u03bb\u03b4 + (1 \u2212 e\u2212\u03bb\u03b4)\u03c0\u2212\u03bb\u03a3i(g2)(cid:9)\n\u03bb log (cid:8)1 \u2212 (1 \u2212 e\u2212\u03bb\u03b4) 1\n\nn+1(cid:9)\n\nIn particular, for any n large enough, we have `[y2, \u02c6hi(x)] \u2212 `(y2, \u02dcy2) \u2264 Cn\u22121, with C > 0\nindependent from \u03b3. From the convexity of the function y 7\u2192 `(y2, y) and by Jensen\u2019s inequality,\nwe obtain\n\n`[y2, \u02c6gpim(x)] \u2212 `(y2, \u02dcy2) \u2264 1\n\nn+1 Pn\n\ni=0 `[y2, \u02c6hi(x)] \u2212 `(y2, \u02dcy2) \u2264 \u03c4 \u03b4\n\nn+1 + Cn\u22121 < C1\n\nlog n\n\nn\n\n7\n\n\ffor some constant C1 > 0 independent from \u03b3. Let us now prove that for n large enough, we have\n\n\u02dcy2 \u2264 \u02c6gpim(x) \u2264 \u02dcy2 + Cq log n\n\nn \u2264 \u02dcy1,\n\n(19)\n\nwith C > 0 independent from \u03b3.\nFrom (19), we obtain\nR(\u02c6gpim) \u2212 R(g1) = 1+\u03b3\n= 1+\u03b3\n= 1+\u03b3\n\u2265 \u03b3\u03b4 \u2212 (\u02c6gpim \u2212 \u02dcy2)|`0y1 ( \u02dcy2)|\n\u2265 \u03b3\u03b4 \u2212 C2q log n\nn ,\n\n2 (cid:2)`(y1, \u02c6gpim) \u2212 `(y1, \u02dcy1)(cid:3) + 1\u2212\u03b3\n2 (cid:2)`y1(\u02c6gpim) \u2212 `y1 ( \u02dcy1)(cid:3) + 1\u2212\u03b3\n2 (cid:2)\u03b4 + `y1 (\u02c6gpim) \u2212 `y1( \u02dcy2)(cid:3) + 1\u2212\u03b3\n\n2 (cid:2)`(y2, \u02c6gpim) \u2212 `(y2, \u02dcy1)(cid:3)\n2 (cid:2)`y1(2a \u2212 \u02c6gpim) \u2212 `y1( \u02dcy2)(cid:3)\n2 (cid:2) \u2212 \u03b4 + `y1 (2a \u2212 \u02c6gpim) \u2212 `y1 ( \u02dcy1)(cid:3)\n\n(20)\nwith C2 independent from \u03b3. We may take \u03b3 = 2C2\n\u03b4 p(log n)/n and obtain: for n large enough,\non the event E\u03c4 , we have R(\u02c6gpim) \u2212 R(g1) \u2265 Cplog n/n. From Lemma 3, this inequality holds\nwith probability at least 1/nC4 for some C4 > 0. To conclude, for any n large enough, there exists\n\u0001 > 0 s.t. with probability at least \u0001, R(\u02c6gpim) \u2212 R(g1) \u2265 cq log(e\u0001\u22121)\n. where c is a positive constant\ndepending only on the loss function, the symmetry parameter a and the output values y1 and \u02dcy1.\n\nn\n\nReferences\n\n[1] G. Lecu\u00b4e. Suboptimality of penalized empirical risk minimization in classi\ufb01cation. In Proceedings of the\n\n20th annual conference on Computational Learning Theory, 2007.\n\n[2] A. Barron. Are bayes rules consistent in information? In T.M. Cover and B. Gopinath, editors, Open\n\nProblems in Communication and Computation, pages 85\u201391. Springer, 1987.\n\n[3] O. Catoni. A mixture approach to universal model selection. preprint LMENS 97-30, Available from\n\nhttp://www.dma.ens.fr/edition/preprints/Index.97.html, 1997.\n\n[4] A. Barron and Y. Yang. Information-theoretic determination of minimax rates of convergence. Ann. Stat.,\n\n27(5):1564\u20131599, 1999.\n\n[5] O. Catoni. Universal aggregation rules with exact bias bound. Preprint n.510, http://www.proba.\n\njussieu.fr/mathdoc/preprints/index.html\\#1999, 1999.\n\n[6] G. Blanchard. The progressive mixture estimator for regression trees. Ann. Inst. Henri Poincar\u00b4e, Probab.\n\nStat., 35(6):793\u2013820, 1999.\n\n[7] Y. Yang. Combining different procedures for adaptive regression. Journal of multivariate analysis,\n\n74:135\u2013161, 2000.\n\n[8] F. Bunea and A. Nobel. Sequential procedures for aggregating arbitrary estimators of a conditional mean,\n\n2005. Technical report.\n\n[9] A. Juditsky, P. Rigollet, and A.B. Tsybakov. Learning by mirror averaging. Preprint n.1034, Laboratoire\n\nde Probabilit\u00b4es et Mod`eles Al\u00b4eatoires, Universit\u00b4es Paris 6 and Paris 7, 2005.\n\n[10] J.-Y. Audibert. A randomized online learning algorithm for better variance control. In Proceedings of the\n\n19th annual conference on Computational Learning Theory, pages 392\u2013407, 2006.\n\n[11] V.G. Vovk. Aggregating strategies. In Proceedings of the 3rd annual workshop on Computational Learn-\n\ning Theory, pages 371\u2013386, 1990.\n\n[12] D. Haussler, J. Kivinen, and M. K. Warmuth. Sequential prediction of individual sequences under general\n\nloss functions. IEEE Trans. on Information Theory, 44(5):1906\u20131925, 1998.\n\n[13] V.G. Vovk. A game of prediction with expert advice. Journal of Computer and System Sciences, pages\n\n153\u2013173, 1998.\n\n[14] M. Wegkamp. Model selection in nonparametric regression. Ann. Stat., 31(1):252\u2013273, 2003.\n[15] T. Zhang. Data dependent concentration bounds for sequential prediction algorithms. In Proceedings of\n\nthe 18th annual conference on Computational Learning Theory, pages 173\u2013187, 2005.\n\n[16] N. Cesa-Bianchi, A. Conconi, and C. Gentile. On the generalization ability of on-line learning algorithms.\n\nIEEE Transactions on Information Theory, 50(9):2050\u20132057, 2004.\n\n[17] L. Devroye, L. Gy\u00a8or\ufb01, and G. Lugosi. A Probabilistic Theory of Pattern Recognition. Springer-Verlag,\n\n1996.\n\n8\n\n\f", "award": [], "sourceid": 472, "authors": [{"given_name": "Jean-yves", "family_name": "Audibert", "institution": null}]}