{"title": "Mixability in Statistical Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 1691, "page_last": 1699, "abstract": "Statistical learning and sequential prediction are two different but related formalisms to study the quality of predictions. Mapping out their relations and transferring ideas is an active area of investigation. We provide another piece of the puzzle by showing that an important concept in sequential prediction, the mixability of a loss, has a natural counterpart in the statistical setting, which we call stochastic mixability. Just as ordinary mixability characterizes fast rates for the worst-case regret in sequential prediction, stochastic mixability characterizes fast rates in statistical learning. We show that, in the special case of log-loss, stochastic mixability reduces to a well-known (but usually unnamed) martingale condition, which is used in existing convergence theorems for minimum description length and Bayesian inference. In the case of 0/1-loss, it reduces to the margin condition of Mammen and Tsybakov, and in the case that the model under consideration contains all possible predictors, it is equivalent to ordinary mixability.", "full_text": "Mixability in Statistical Learning\n\nTim van Erven\n\nUniversit\u00b4e Paris-Sud, France\ntim@timvanerven.nl\n\nMark D. Reid\n\nANU and NICTA, Australia\nMark.Reid@anu.edu.au\n\nPeter D. Gr\u00a8unwald\n\nCWI and Leiden University, the Netherlands\n\npdg@cwi.nl\n\nRobert C. Williamson\n\nANU and NICTA, Australia\n\nBob.Williamson@anu.edu.au\n\nAbstract\n\nStatistical learning and sequential prediction are two different but related for-\nmalisms to study the quality of predictions. Mapping out their relations and trans-\nferring ideas is an active area of investigation. We provide another piece of the\npuzzle by showing that an important concept in sequential prediction, the mixa-\nbility of a loss, has a natural counterpart in the statistical setting, which we call\nstochastic mixability. Just as ordinary mixability characterizes fast rates for the\nworst-case regret in sequential prediction, stochastic mixability characterizes fast\nrates in statistical learning. We show that, in the special case of log-loss, stochastic\nmixability reduces to a well-known (but usually unnamed) martingale condition,\nwhich is used in existing convergence theorems for minimum description length\nand Bayesian inference. In the case of 0/1-loss, it reduces to the margin condition\nof Mammen and Tsybakov, and in the case that the model under consideration\ncontains all possible predictors, it is equivalent to ordinary mixability.\n\n1\n\nIntroduction\n\nlearning (also called batch learning)\n\nIn statistical\n[1] one obtains a random sample\n(X1, Y1), . . . , (Xn, Yn) of independent pairs of observations, which are all distributed according\nto the same distribution P \u21e4. The goal is to select a function \u02c6f that maps X to a prediction \u02c6f (X) of\nY for a new pair (X, Y ) from the same P \u21e4. The quality of \u02c6f is measured by its excess risk, which\nis the expectation of its loss `(Y, \u02c6f (X)) minus the expected loss of the best prediction function f\u21e4\nin a given class of functions F. Analysis in this setting usually involves giving guarantees about the\nperformance of \u02c6f in the worst case over the choice of the distribution of the data.\nIn contrast, the setting of sequential prediction (also called online learning) [2] makes no proba-\nbilistic assumptions about the source of the data. Instead, pairs of observations (xt, yt) are assumed\nto become available one at a time, in rounds t = 1, . . . , n, and the goal is to select a function \u02c6ft\njust before round t, which maps xt to a prediction of yt. The quality of predictions \u02c6f1, . . . , \u02c6fn is\nevaluated by their regret, which is the sum of their losses `(y1, \u02c6f1(x1)), . . . ,` (yn, \u02c6fn(xn)) on the\nactual observations minus the total loss of the best \ufb01xed prediction function f\u21e4 in a class of functions\nF. In sequential prediction the usual analysis involves giving guarantees about the performance of\n\u02c6f1, . . . , \u02c6fn in the worst case over all possible realisations of the data. When stating rates of conver-\ngence, we will divide the worst-case regret by n, which makes the rates comparable to rates in the\nstatistical learning setting.\nMapping out the relations between statistical learning and sequential prediction is an active area of\ninvestigation, and several connections are known. For example, using any of a variety of online-\n\n1\n\n\fto-batch conversion techniques [3], any sequential predictions \u02c6f1, . . . , \u02c6fn may be converted into a\nsingle statistical prediction \u02c6f and the statistical performance of \u02c6f is bounded by the sequential pre-\ndiction performance of \u02c6f1, . . . , \u02c6fn. Moreover, a deep understanding of the relation between worst-\ncase rates in both settings is provided by Abernethy, Agarwal, Bartlett and Rakhlin [4]. Amongst\nothers, their results imply that for many loss functions the worst-case rate in sequential prediction\nexceeds the worst-case rate in statistical learning.\n\nFast Rates\nIn sequential prediction with a \ufb01nite class F, it is known that the worst-case regret can\nbe bounded by a constant if and only if the loss ` has the property of being mixable [5, 6] (subject\nto mild regularity conditions on the loss). Dividing by n, this corresponds to O(1/n) rates, which is\nfast compared to the usual O(1/pn) rates.\nIn statistical learning, there are two kinds of conditions that are associated with fast rates. First,\nfor 0/1-loss, fast rates (faster than O(1/pn)) are associated with Mammen and Tsybakov\u2019s margin\ncondition [7, 8], which depends on a parameter \uf8ff.\nIn the nicest case, \uf8ff = 1 and then O(1/n)\nrates are possible. Second, for log(arithmic) loss there is a single supermartingale condition that\nis essential to obtain fast rates in all convergence proofs of two-part minimum description length\n(MDL) estimators, and in many convergence proofs of Bayesian estimators. This condition, used\nby e.g. [9, 10, 11, 12, 13, 14], sometimes remains implicit (see Example 1 below) and usually goes\nunnamed. A special case has been called the \u2018supermartingale property\u2019 by Chernov, Kalnishkan,\nZhdanov and Vovk [15]. Audibert [16] also introduced a closely related condition, which does seem\nsubtly different however.\n\nOur Contribution We de\ufb01ne the notion of stochastic mixability of a loss `, set of predictors F,\nand distribution P \u21e4, which we argue to be the natural analogue of mixability for the statistical setting\non two grounds: \ufb01rst, we show that it is closely related to both the supermartingale condition and the\nmargin condition, the two properties that are known to be related to fast rates; second, we show that\nit shares various essential properties with ordinary mixability and in speci\ufb01c cases is even equivalent\nto ordinary mixability.\nTo support the \ufb01rst part of our argument, we show the following: (a) for bounded losses (includ-\ning 0/1-loss), stochastic mixability is equivalent to the best case (\uf8ff = 1) of a generalization of\nthe margin condition; other values of \uf8ff may be interpreted in terms of a slightly relaxed version of\nstochastic mixability; (b) for log-loss, stochastic mixability reduces to the supermartingale condi-\ntion; (c) in general, stochastic mixability allows uniform O(log |Fn|/n)-statistical learning rates to\nbe achieved, where |Fn| is the size of a sub-model Fn \u21e2 F considered at sample size n. Finally, (d)\nif stochastic mixability does not hold, then in general O(log |Fn|/n)-statistical learning rates cannot\nbe achieved, at least not for 0/1-loss or for log-loss.\nTo support the second part of our argument, we show: (e) if the set F is \u2018full\u2019, i.e. it contains all\nprediction functions for the given loss, then stochastic mixability turns out to be formally equivalent\nto ordinary mixability (if F is not full, then either condition may hold without the other). We choose\nto call our property stochastic mixability rather than, say, \u2018generalized margin condition for \uf8ff = 1\u2019\nor \u2018generalized supermartingale condition\u2019, because (f) we also show that the general condition can\nbe formulated in an alternative way (Theorem 2) that directly indicates a strong relation to ordinary\nmixability, and (g) just like ordinary mixability, it can be interpreted as the requirement that a set of\nso-called pseudo-likelihoods is (effectively) convex.\nWe note that special cases of results (a)\u2013(e) already follow from existing work of many other authors;\nwe provide a detailed comparison in Section 7. Our contributions are to generalize these results, and\nto relate them to each other, to the notion of mixability from sequential prediction, and to the inter-\npretation in terms of convexity of a set of pseudo-likelihoods. This leads to our central conclusion:\nthe concept of stochastic mixability is closely related to mixability and plays a fundamental role in\nachieving fast rates in the statistical learning setting.\n\nOutline\nIn \u00a72 we de\ufb01ne both ordinary mixability and stochastic mixability. We show that two of\nthe standard ways to express mixability have natural analogues that express stochastic mixability\n(leading to (f)). In example 1 we specialize the de\ufb01nition to log-loss and explain its importance in\nthe literature on MDL and Bayesian inference, leading to (b). A third interpretation of mixability\nand standard mixability in terms of sets (g) is described in \u00a73. The equivalence between mixability\n\n2\n\n\fand stochastic mixability if F is full is presented in \u00a74 where we also show that the equivalence need\nnot hold if F is not full (e). In \u00a75, we turn our attention to a version of the margin condition that\ndoes not assume that F contains the Bayes optimal predictor and we show that (a slightly relaxed\nversion of) stochastic mixability is equivalent to the margin condition, taking care of (a). We show\n(\u00a76) that if stochastic mixability holds, O(log |Fn|/n)-rates can always be achieved (c), and that in\nsome cases in which it does not hold, O(log |Fn|/n)-rates cannot be achieved (d). Finally (\u00a77) we\nconnect our results to previous work in the literature. Proofs omitted from the main body of the\npaper are in the supplementary material.\n\n2 Mixability and Stochastic Mixability\n\nWe now introduce the notions of mixability and stochastic mixability, showing two equivalent for-\nmulations of the latter.\n\n2.1 Mixability\nA loss function ` : Y\u21e5A ! [0,1] is a nonnegative function that measures the quality of a prediction\na 2 A when the true outcome is y 2 Y by `(y, a). We will assume that all spaces come equipped\nwith appropriate -algebras, so we may de\ufb01ne distributions on them, and that the loss function ` is\nmeasurable.\nDe\ufb01nition 1 (Mixability). For \u2318> 0, a loss ` is called \u2318-mixable if for any distribution \u21e1 on A there\nexists a single prediction a\u21e1 such that\n\n(1)\n\n(2)\n\n`(y, a\u21e1) \uf8ff \n\n1\n\u2318\n\nlnZ e\u2318`(y,a) \u21e1(da)\n\nfor all y.\n\nIt is called mixable if there exists an \u2318> 0 such that it is \u2318-mixable.\n\nLet A be a random variable with distribution \u21e1. Then (1) may be rewritten as\n\nE\u21e1\uf8ff e\u2318`(y,A)\n\ne\u2318`(y,a\u21e1) \uf8ff 1\n\nfor all y.\n\n2.2 Stochastic Mixability\nLet F be a set of predictors f : X ! A, which are measurable functions that map any input x 2 X\nto a prediction f (x). For example, if A = Y = {0, 1} and the loss is the 0/1-loss, `0/1(y, a) =\n1{y 6= a}, then the predictors are classi\ufb01ers. Let P \u21e4 be the distribution of a pair of random variables\n(X, Y ) with values in X \u21e5 Y. Most expectations in the paper are with respect to P \u21e4. Whenever this\nis not the case we will add a subscript to the expectation operator, as in (2).\nDe\ufb01nition 2 (Stochastic Mixability). For any \u2318 0, we say that (`, F, P \u21e4) is \u2318-stochastically\nmixable if there exists an f\u21e4 2 F such that\nE\uf8ff e\u2318`(Y,f (X))\ne\u2318`(Y,f\u21e4(X)) \uf8ff 1\n\nfor all f 2 F.\n\nWe call (`, F, P \u21e4) stochastically mixable if there exists an \u2318> 0 such that it is \u2318-stochastically\nmixable.\n\n(3)\n\nBy Jensen\u2019s inequality, we see that (3) implies 1 Eh e\u2318`(Y,f (X))\n\ne\u2318`(Y,f\u21e4(X))i eE[\u2318(`(Y,f\u21e4(X))`(Y,f (X)))],\n\nso that\n\nE[`(Y, f\u21e4(X))] \uf8ff E[`(Y, f (X)))]\n\nfor all f 2 F,\n\nand hence the de\ufb01nition of stochastic mixability presumes that f\u21e4 minimizes E[`(Y, f (X))] over all\nf 2 F. We will assume throughout the paper that such an f\u21e4 exists, and that E[`(Y, f\u21e4(X))] < 1.\nThe larger \u2318, the stronger the requirement of \u2318-stochastic mixability:\nProposition 1. Any triple (`, F, P \u21e4) is 0-stochastically mixable. And if 0 <<\u2318 , then \u2318-stochastic\nmixability implies -stochastic mixability.\n\n3\n\n\fExample 1 (Log-loss). Let F be a set of conditional probability densities and let `log be log-loss,\ni.e. A is the set of densities on Y, f (x)(y) is written, as usual, as f (y | x), and `log(y, f (x)) :=\n ln f (y | x). For log-loss, statistical learning becomes equivalent to conditional density estimation\nwith random design (see, e.g., [14]). Equation 3 now becomes equivalent to\n\nA\u2318(f\u21e4kf ) := E\u2713 f (Y | X)\nf\u21e4(Y | X)\u25c6\u2318\n\n\uf8ff 1.\n\n(4)\n\nA\u2318 has been called the generalized Hellinger af\ufb01nity [12] in the literature. If the model is correct,\ni.e. it contains the true conditional density p\u21e4(y | x), then, because the log-loss is a proper loss [17]\nwe must have f\u21e4 = p\u21e4 and then, for \u2318 = 1, trivially A\u2318(fkf\u21e4) = 1 for all f 2 F. Thus if the model\nF is correct, then the log-loss is \u2318-stochastically mixable for \u2318 = 1. In that case, for \u2318 = 1/2, A\u2318\nturns into the standard de\ufb01nition of Hellinger af\ufb01nity [10].\nEquation 4 \u2014 which just expresses 1-stochastic mixability for log-loss \u2014 is used in all previous\nconvergence theorems for 2-part MDL density estimation [10, 12, 11, 18], and, more implicitly, in\nvarious convergence theorems for Bayesian procedures, including the pioneering paper by Doob\n[9]. All these results assume that the model F is correct, but, if one studies the proofs, one \ufb01nds that\nthe assumption is only needed to establish that (4) holds for \u2318 = 1. For example, as \ufb01rst noted by\n[12], if F is a convex set of densities, then (4) also holds for \u2318 = 1, even if the model is incorrect,\nand, indeed, two-part MDL converges at fast rates in such cases (see [14] for a precise de\ufb01nition of\nwhat this means, as well as more general treatment of (4)). Kleijn and Van der Vaart [13], in their\nextensive analysis of Bayesian nonparametric inference if the model is wrong, also use the fact that\n(4) holds with \u2318 = 1 for convex models to show that fast posterior concentration rates hold for such\nmodels even if they do not contain the true p\u21e4.\n\nThe de\ufb01nition of stochastic mixability looks similar to (2), but whereas \u21e1 is a distribution on pre-\ndictions, P \u21e4 is a distribution on outcomes (X, Y ). Thus at \ufb01rst sight the resemblance appears to be\nonly super\ufb01cial. It is therefore quite surprising that stochastic mixability can also be expressed in a\nway that looks like (1), which provides a \ufb01rst hint that the relation goes deeper.\nTheorem 2. Let \u2318> 0. Then (`, F, P \u21e4) is \u2318-stochastically mixable if and only if for any distribution\n\u21e1 on F there exists a single predictor f\u21e4 2 F such that\n\nE\u21e5`(Y, f\u21e4(X))\u21e4 \uf8ff E\uf8ff\n\n1\n\u2318\n\nlnZ e\u2318`(Y,f (X)) \u21e1(df ) .\n\n(5)\n\nNotice that, without\nE[`(Y, f (X))]. Then f\u21e4 does not depend on \u21e1.\n\nloss of generality, we can always choose f\u21e4 to be the minimizer of\n\n3 The Convexity Interpretation\n\nThere is a third way to express mixability, as the convexity of a set of so-called pseudo-likelihoods.\nWe will now show that stochastic mixability can also be interpreted as convexity of the correspond-\ning set in the statistical learning setting.\nFollowing Chernov et al. [15], we \ufb01rst note that the essential feature of a loss ` with corresponding\nset of predictions A is the set of achievable losses they induce:\n\nL = {l : Y ! [0,1] |9 a 2 A : l(y) = `(y, a) for all y 2 Y}.\n\nIf we would reparametrize the loss by a different set of predictions A0, while keeping L the same,\nthen essentially nothing would change. For example, for 0/1-loss standard ways to parametrize\npredictions are by A = {0, 1}, by A = {1, +1} or by A = R with the interpretation that predicting\na 0 maps to the prediction 1 and a < 0 maps to the prediction 0. Of course these are all equivalent,\nbecause L is the same.\nIt will be convenient to consider the set of functions that lie above the achievable losses in L:\n\nS = S` = {l : Y ! [0,1] |9 l0 2 L : l(y) l0(y) for all y 2 Y},\n\nChernov et al. call this the super prediction set. It plays a role similar to the role of the epigraph\nof a function in convex analysis. Let \u2318> 0. Then with each element l 2 S in the super prediction\n\n4\n\n\fP \u21e4\n\ncoPF(\u2318)\n\nf\u21e4\n\nPF(\u2318)\n\nP \u21e4\n\nf\u21e4\n\nPF(\u2318)\n\ncoPF(\u2318)\n\nNot stochastically mixable\n\nStochastically mixable\n\nFigure 1: The relation between convexity and stochastic mixability for log-loss, \u2318 = 1 and X = {x}\na singleton, in which case P \u21e4 and the elements of PF(\u2318) can all be interpreted as distributions on Y.\n\nset, we associate a pseudo-likelihood p(y) = e\u2318l(y). Note that 0 \uf8ff p(y) \uf8ff 1, but it is generally\nnot the case thatR p(y) \u00b5(dy) = 1 for some reference measure \u00b5 on Y, so p(y) is not normalized.\nLet e\u2318S = {e\u2318l | l 2 S} denote the set of all such pseudo-likelihoods. By multiplying (1) by \u2318\nand exponentiating, it can be shown that \u2318-mixability is exactly equivalent to the requirement that\ne\u2318S is convex [2, 15]. And like for the \ufb01rst two expressions of mixability, there is an analogous\nconvexity interpretation for stochastic mixability.\nIn order to de\ufb01ne pseudo-likelihoods in the statistical setting, we need to take into account that the\npredictions f (X) of the predictors in F are not deterministic, but depend on X. Hence we de\ufb01ne\nconditional pseudo-likelihoods p(Y |X) = e\u2318`(Y,f (X)). (See also Example 1.) There is no need to\nintroduce a conditional analogue of the super prediction set. Instead, let PF(\u2318) = {e\u2318`(Y,f (X)) |\nf 2 F} denote the set of all conditional pseudo-likelihoods. For 2 [0, 1], a convex combination\nof any two p0, p1 2 PF(\u2318) can be de\ufb01ned as p(Y |X) = (1 )p0(Y |X) + p1(Y |X). And\nconsequently, we may speak of the convex hull co PF(\u2318) = {p | p0, p1 2 PF(\u2318), 2 [0, 1]} of\nPF(\u2318).\nCorollary 3. Let \u2318> 0. Then \u2318-stochastic mixability of (`, F, P \u21e4) is equivalent to the requirement\nthat\n\nmin\n\np2PF(\u2318)\n\nE\u21e51\n\u2318 ln p(Y |X)\u21e4 = min\n\np2co PF(\u2318)\n\nE\u21e51\n\u2318 ln p(Y |X)\u21e4.\n\nProof. This follows directly from Theorem 2 after rewriting it in terms of conditional pseudo-\nlikelihoods.\n\n(6)\n\nNotice that the left-hand side of (6) equals E[`(Y, f\u21e4(X))], which does not depend on \u2318.\nEquation 6 expresses that the convex hull operator has no effect, which means that PF(\u2318) looks\nconvex from the perspective of P \u21e4. See Figure 1 for an illustration for log-loss. Thus we obtain\nan interpretation of \u2318-stochastic mixability as effective convexity of the set of pseudo-likelihoods\nPF(\u2318) with respect to P \u21e4.\nFigure 1 suggests that f\u21e4 should be unique if the loss is stochastically mixable, which is almost\nright. It is in fact the loss `(Y, f\u21e4(X)) of f\u21e4 that is unique (almost surely):\nCorollary 4.\nIf (`, F, P \u21e4) is stochastically mixable and there exist f\u21e4, g\u21e4 2 F such that\nE[`(Y, f\u21e4(X))] = E[`(Y, g\u21e4(X))] = minf2F E[`(Y, f (X))], then `(Y, f\u21e4(X)) = `(Y, g\u21e4(X))\nalmost surely.\n\nProof. Let \u21e1(f\u21e4) = \u21e1(g\u21e4) = 1/2. Then, by Theorem 2 and (strict) convexity of ln,\n\nmin\nf2F\n\nE[`(Y, f (X))] \uf8ff E\uf8ff\n\uf8ff E\uf8ff 1\n\n2\n\n1\n\u2318\n\nln 1\n\n2\n\n`(Y, f\u21e4(X)) +\n\ne\u2318`(Y,f\u21e4(X)) +\n\n1\n2\n\ne\u2318`(Y,g\u21e4(X))\n`(Y, g\u21e4(X)) = min\n\nf2F\n\n1\n2\n\nE[`(Y, f (X))].\n\n5\n\n\fHence both inequalities must hold with equality. For the second inequality this is only the case if\n`(Y, f\u21e4(X)) = `(Y, g\u21e4(X)) almost surely, which was to be shown.\n\n4 When Mixability and Stochastic Mixability Are the Same\n\nHaving observed that mixability and stochastic mixability of a loss share several common features,\nwe now show that in speci\ufb01c cases the two concepts even coincide. More speci\ufb01cally, Theorem 5\nbelow shows that a loss ` (meeting two requirements) is \u2318-mixable if and only if it is \u2318-stochastically\nmixable relative to Ffull, the set of all functions from X to A, and all distributions P \u21e4. To avoid\nmeasurability issues, we will assume that X is countable throughout this section.\nThe two conditions we assume of ` are both related to its set of pseudo-likelihoods := e\u2318S,\nwhich was de\ufb01ned in Section 3. The \ufb01rst condition is that is closed. When Y is in\ufb01nite, we mean\nclosed relative to the topology for the supremum norm kpk1 = supy2Y |p(y)|. The second, more\ntechnical condition is that is pre-supportable. That is, for every pseudo-likelihood p 2 , its\npre-image s 2 S (de\ufb01ned for each y 2 Y by s(y) := 1\n\u2318 ln p(y)) is supportable. Here, a point s 2 S\nis supportable if it is optimal for some distribution P \u21e4Y over Y \u2013 that is, if there exists a distribution\nP \u21e4Y over Y such that EP \u21e4Y [s(Y )] \uf8ff EP \u21e4Y [t(Y )] for all t 2 S. This is the case, for example, for all\nproper losses [17].\nWe say (`, F) is \u2318-stochastically mixable if (`, F, P \u21e4) is \u2318-stochastically mixable for all distributions\nP \u21e4 on X \u21e5 Y.\nTheorem 5. Suppose X is countable. Let \u2318> 0 and suppose ` is a loss such that its pseudo-\nlikelihood set e\u2318S is closed and pre-supportable. Then (`, Ffull) is \u2318-stochastically mixable if and\nonly if ` is \u2318-mixable.\n\nThis result generalizes Theorem 9 and Lemma 11 by Chernov et al. [15] from \ufb01nite Y to arbitrary\ncontinuous Y, which they raised as an open question.\nIn their setting, there are no explanatory\nvariables x, which may be emulated in our framework by letting X contain only a single element.\nTheir conditions also imply (by their Lemma 10) that the loss ` is proper, which implies that e\u2318S is\nclosed and pre-supportable. We note that for proper losses \u2318-mixability is especially well understood\n[19].\nThe proof of Theorem 5 is broken into two lemmas (the proofs of which are in the supplemen-\ntary material). The \ufb01rst establishes conditions for when mixability implies stochastic mixability,\nborrowing from a similar result for log-loss by Li [12].\nLemma 6. Let \u2318> 0. Suppose the Bayes optimal predictor f\u21e4B(x) 2 arg mina2A E[`(Y, a)|X = x]\nis in the model: f\u21e4B = f\u21e4 2 F. If ` is \u2318-mixable, then (`, F, P \u21e4) is \u2318-stochastically mixable.\nThe second lemma shows that stochastic mixability implies mixability.\nLemma 7. Suppose the conditions of Theorem 5 are satis\ufb01ed. If (`, Ffull) is \u2318-stochastically mixable,\nthen it is \u2318-mixable.\n\nThe above two lemmata are suf\ufb01cient to prove the equivalence of stochastic and ordinary mixability.\n\nProof of Theorem 5. In order to show that \u2318-mixability of ` implies \u2318-stochastic mixability of\n(`, Ffull) we note that the Bayes-optimal predictor f\u21e4B for any ` and P \u21e4 must be in Ffull and so\nLemma 6 implies (`, Ffull, P \u21e4) is \u2318-stochastically mixable for any distribution P \u21e4. Conversely,\nthat \u2318-stochastic mixability of (`, Ffull) implies the \u2318-mixability of ` follows immediately from\nLemma 7.\nExample 2 (if F is not full). In this case, we can have either stochastic mixability without ordinary\nmixability or the converse. Consider a loss function ` that is not mixable in the ordinary sense,\ne.g. ` = `0/1, the 0/1-loss [6], and a set F consisting of just a single predictor. Then clearly ` is\nstochastically mixable relative to F. This is, of course, a trivial case. We do not know whether we\ncan have stochastic mixability without ordinary mixability in nontrivial cases, and plan to investigate\nthis for future work. For the converse, we know that it does hold in nontrivial cases: consider\nthe log-loss `log which is 1-mixable in the standard sense (Example 1). Let Y = {0, 1} and let\nthe model F be a set of conditional probability mass functions {f\u2713 | \u2713 2 \u21e5} where \u21e5 is the\n\n6\n\n\fset of all classi\ufb01ers, i.e. all functions X ! Y, and f\u2713(y | x) := e`0/1(y,\u2713(x))/(1 + e1) where\n`0/1(y, \u02c6y) = 1{y 6= \u02c6y} is the 0/1-loss. Then log-loss becomes an af\ufb01ne function of 0/1-loss: for\neach \u2713 2 \u21e5, `log(Y, f\u2713(X)) = `0/1(Y, \u2713(X)) + C with C = ln(1 + e1) [14]. Because 0/1-loss is\nnot standard mixable, by Theorem 5, 0/1-loss is not stochastically mixable relative to \u21e5. But then\nwe must also have that log-loss is not stochastically mixable relative to F.\n\n5 Stochastic Mixability and the Margin Condition\n\nThe excess risk of any f compared to f\u21e4 is the mean of the excess loss `(Y, f (X)) `(Y, f\u21e4(X)):\n\nWe also de\ufb01ne the expected square of the excess loss, which is closely related to its variance:\n\nd(f, f\u21e4) = E\u21e5`(Y, f (X)) `(Y, f\u21e4(X))\u21e4.\nV (f, f\u21e4) = E\u21e3`(Y, f (X)) `(Y, f\u21e4(X))\u23182\n\n.\n\nNote that, for 0/1-loss, V (f, f\u21e4) = P \u21e4(f (X) 6= f\u21e4(X)) is the probability that f and f\u21e4 disagree.\nThe margin condition, introduced by Mammen and Tsybakov [7, 8] for 0/1-loss, is satis\ufb01ed with\nconstants \uf8ff 1 and c0 > 0 if\n\nfor all f 2 F.\n\nc0V (f, f\u21e4)\uf8ff \uf8ff d(f, f\u21e4)\n\n(7)\nUnlike Mammen and Tsybakov, we do not assume that F necessarily contains the Bayes predictor,\nwhich minimizes the risk over all possible predictors. The same generalization has been used in the\ncontext of model selection by Arlot and Bartlett [20].\nRemark 1. In some practical cases, the margin condition only holds for a subset of the model such\nthat V (f, f\u21e4) \uf8ff \u270f0 for some \u270f0 > 0 [8]. In such cases, the discussion below applies to the same\nsubset.\nStochastic mixability, as we have de\ufb01ned it, is directly related to the margin condition for the case\n\uf8ff = 1. In order to relate it to other values of \uf8ff, we need a little more \ufb02exibility: for given \u270f 0 and\n(`, F, P \u21e4), we de\ufb01ne\n(8)\nwhich excludes a band of predictors that approximate the best predictor in the model to within excess\nrisk \u270f.\nTheorem 8. Suppose a loss ` takes values in [0, V ] for 0 < V < 1. Fix a model F and distribution\nP \u21e4. Then the margin condition (7) is satis\ufb01ed if and only if there exists a constant C > 0 such that,\nfor all \u270f> 0, (`, F\u270f, P \u21e4) is \u2318-stochastically mixable for \u2318 = C\u270f(\uf8ff1)/\uf8ff. In particular, if the margin\ncondition is satis\ufb01ed with constants \uf8ff and c0, we can take C = min V 2c1/\uf8ff\n\nThis theorem gives a new interpretation of the margin condition as the rate at which \u2318 has to go\nto 0 when the model F is approximated by \u2318-stochastically mixable models F\u270f. By the following\ncorollary, proved in the additional material, stochastic mixability of the whole model F is equivalent\nto the best case of the margin condition.\nCorollary 9. Suppose ` takes values in [0, V ] for 0 < V < 1. Then (`, F, P \u21e4) is stochastically\nmixable if and only if there exists a constant c0 > 0 such that the margin condition (7) is satis\ufb01ed\nwith \uf8ff = 1.\n\nF\u270f = {f\u21e4}[{ f 2 F | d(f, f\u21e4) \u270f},\n\n1\n\nV (\uf8ff1)/\uf8ff .\n\n0\n\neV V 1 ,\n\n6 Connection to Uniform O(log |Fn|/n) Rates\nLet ` be a bounded loss function. Assume that, at sample size n, an estimator \u02c6f (statistical learning\nalgorithm) is used based on a \ufb01nite model Fn, where we allow the size |Fn| to grow with n. Let,\nfor all n, Pn be any set of distributions on X \u21e5 Y such that for all P \u21e4 2 Pn, the generalized\nmargin condition (7) holds for \uf8ff = 1 and uniform constant c0 not depending on n, with model\nFn. In the case of 0/1-loss, the results of e.g. Tsybakov [8] suggest that there exist estimators\n\n7\n\n\f\u02c6fn : (X \u21e5 Y)n ! Fn that achieve a convergence rate of O(log |Fn|/n), uniformly for all P \u21e4 2 P;\nthat is,\n(9)\n\nEP \u21e4[d( \u02c6fn, f\u21e4)] = O(log |Fn|/n).\n\nsup\nP \u21e42Pn\n\nThis can indeed be proven, for general loss functions, using Theorem 4.2. of Zhang [21] and with \u02c6fn\nset to Zhang\u2019s information-risk-minimization estimator (to see this, at sample size n apply Zhang\u2019s\nresult with \u21b5 set to 0 and a prior \u21e1 that is uniform on Fn, so that log \u21e1(f ) = log |Fn| for any\nf 2 Fn). By Theorem 8, this means that, for any bounded loss function `, if, for some \u2318> 0, all n,\nwe have that (`, Fn, P \u21e4) is \u2318-stochastically mixable for all P \u21e4 2 Pn, then Zhang\u2019s estimator satis\ufb01es\n(9). Hence, for bounded loss functions, stochastic mixability implies a uniform O(log |Fn|/n) rate.\nA connection between stochastic mixability and fast rates is also made by Gr\u00a8unwald [14], who\nintroduces some slack in the de\ufb01nition (allowing the number 1 in (3) to be slightly larger) and uses\nthe convexity interpretation from Section 3 to empirically determine the largest possible value for\n\u2318. His Theorem 2, applied with a slack set to 0, implies an in-probability version of Zhang\u2019s result\nabove.\nExample 3. We just explained that, if ` is stochastically mixable relative to Fn, then uniform\nO(log |Fn|/n) rates can be achieved. We now illustrate that if this is not the case, then, at least\nif ` is 0/1-loss or if ` is log-loss, uniform O(log |Fn|/n) rates cannot be achieved in general. To see\nthis, let \u21e5n be a \ufb01nite set of classi\ufb01ers \u2713 : X ! Y, Y = {0, 1} and let ` be 0/1-loss. Let for each\nn, \u02c6fn : (X \u21e5 Y)n ! Fn be some arbitrary estimator. It is known from e.g. the work of Vapnik [22]\nthat for every sequence of estimators \u02c6f1, \u02c6f2, . . ., there exist a sequence \u21e51, \u21e52, . . ., all \ufb01nite, and a\nsequence P \u21e41 , P \u21e42 , . . . such that\n\nEP \u21e4n [d( \u02c6fn, f\u21e4)]\nlog |\u21e5n|/n ! 1.\n\nClearly then, by Zhang\u2019s result above, there cannot be an \u2318 such that for all n, (`, \u21e5n, P \u21e4n) is \u2318-\nstochastically mixable. This establishes that if stochastic mixability does not hold, then uniform rates\nof O(log |Fn|/n) are not achievable in general for 0/1-loss. By the construction of Example 2, we\ncan modify \u21e5n into a set of corresponding log-loss predictors Fn so that the log-loss `log becomes an\naf\ufb01ne function of the 0/1-loss; thus, there still is no \u2318 such that for all n, (`log, Fn, P \u21e4n) is \u2318-mixable,\nand the sequence of estimators still cannot achieve uniform a O(log |Fn|/n) rate with log-loss either.\n7 Discussion \u2014 Related Work\n\nLet us now return to the summary of our contributions which we provided as items (a)\u2014(g) in \u00a71.\nWe note that slight variations of our formula (3) for stochastic mixability already appear in [14]\n(but there no connections to ordinary mixability are made) and [15] (but there it is just a tool for\nthe worst-case sequential setting, and no connections to fast rates in statistical learning are made).\nEquation 3 looks completely different from the margin condition, yet results connecting the two,\nsomewhat similar to (a), albeit very implicitly, already appear in [23] and [24]. Also, the paper by\nGr\u00a8unwald [14] contains a connection between the margin condition somewhat similar to Theorem 8,\nbut involving a signi\ufb01cantly weaker version of stochastic mixability in which the inequality (3) only\nholds with some slack. Result (b) is trivial given De\ufb01nition 2; (c) is a consequence of Theorem 4.2.\nof [21] when combined with (a) (see Section 6). Result (d) (Theorem 5) is a signi\ufb01cant extension of\na similar result by Chernov et al. [15]. Yet, our proof techniques and interpretation are completely\ndifferent from those in [15]. Result (e), Example 3, is a direct consequence of (a). Result (f)\n(Theorem 2) is completely new, but the proof is partly based on ideas which already appear in [12]\nin a log-loss/MDL context, and (g) is a consequence of (f). Finally, Corollary 3 can be seen as\nanalogous to the results of Lee et al. [25], who showed the role of convexity of F for fast rates in the\nregression setting with squared loss.\n\nAcknowledgments\nThis work was supported by the ARC and by NICTA, funded by the Australian Government. It\nwas also supported in part by the IST Programme of the European Community, under the PASCAL\nNetwork of Excellence, IST-2002-506778, and by NWO Rubicon grant 680-50-1112.\n\n8\n\n\fReferences\n[1] O. Bousquet, S. Boucheron, and G. Lugosi.\n\nIn\nO. Bousquet, U. von Luxburg, and G. R\u00a8atsch, editors, Advanced Lectures on Machine Learn-\ning, volume 3176 of Lecture Notes in Computer Science, pages 169\u2013207. Springer Berlin /\nHeidelberg, 2004.\n\nIntroduction to statistical learning theory.\n\n[2] N. Cesa-Bianchi and G. Lugosi. Prediction, learning, and games. Cambridge University Press,\n\n2006.\n\n[3] O. Dekel and Y. Singer. Data-driven online to batch conversions. In Y. Weiss, B. Sch\u00a8olkopf,\nand J. Platt, editors, Advances in Neural Information Processing Systems 18 (NIPS), pages\n267\u2013274, Cambridge, MA, 2006. MIT Press.\n\n[4] J. Abernethy, A. Agarwal, P. L. Bartlett, and A. Rakhlin. A stochastic view of optimal regret\nthrough minimax duality. In Proceedings of the 22nd Conference on Learning Theory (COLT),\n2009.\n\n[5] Y. Kalnishkan and M. V. Vyugin. The weak aggregating algorithm and weak mixability. Jour-\n\nnal of Computer and System Sciences, 74:1228\u20131244, 2008.\n\n[6] V. Vovk. A game of prediction with expert advice. In Proceedings of the 8th Conference on\n\nLearning Theory (COLT), pages 51\u201360. ACM, 1995.\n\n[7] E. Mammen and A. B. Tsybakov. Smooth discrimination analysis. The Annals of Statistics,\n\n[8] A. B. Tsybakov. Optimal aggregation of classi\ufb01ers in statistical learning. The Annals of Statis-\n\n27(6):1808\u20131829, 1999.\n\ntics, 32(1):135\u2013166, 2004.\n\n[9] J. L. Doob. Application of the theory of martingales.\n\nIn Le Calcul de Probabilit\u00b4es et ses\nApplications. Colloques Internationaux du Centre National de la Recherche Scienti\ufb01que, pages\n23\u201327, Paris, 1949.\n\n[10] A. Barron and T. Cover. Minimum complexity density estimation.\n\nIEEE Transactions on\n\nInformation Theory, 37(4):1034\u20131054, 1991.\n\n[11] T. Zhang. From \u270f-entropy to KL entropy: analysis of minimum information complexity density\n\nestimation. Annals of Statistics, 34(5):2180\u20132210, 2006.\n\n[12] J. Li. Estimation of Mixture Models. PhD thesis, Yale University, 1999.\n[13] B. Kleijn and A. van der Vaart. Misspeci\ufb01cation in in\ufb01nite-dimensional Bayesian statistics.\n\nAnnals of Statistics, 34(2), 2006.\n\n[14] P. Gr\u00a8unwald. Safe learning: bridging the gap between Bayes, MDL and statistical learning\ntheory via empirical convexity. In Proceedings of the 24th Conference on Learning Theory\n(COLT), 2011.\n\n[15] A. Chernov, Y. Kalnishkan, F. Zhdanov, and V. Vovk. Supermartingales in prediction with\n\nexpert advice. Theoretical Computer Science, 411:2647\u20132669, 2010.\n\n[16] J.-Y. Audibert. Fast learning rates in statistical inference through aggregation. Annals of\n\nStatistics, 37(4):1591\u20131646, 2009.\n\n[17] E. Vernet, R. C. Williamson, and M. D. Reid. Composite multiclass losses. In Advances in\n\nNeural Information Processing Systems 24 (NIPS), 2011.\n\n[18] P. Gr\u00a8unwald. The Minimum Description Length Principle. MIT Press, Cambridge, MA, 2007.\n[19] T. van Erven, M. Reid, and R. Williamson. Mixability is Bayes risk curvature relative to log\n\nloss. In Proceedings of the 24th Conference on Learning Theory (COLT), 2011.\n\n[20] S. Arlot and P. L. Bartlett. Margin-adaptive model selection in statistical learning. Bernoulli,\n\n17(2):687\u2013713, 2011.\n\n[21] T. Zhang.\n\nInformation theoretical upper and lower bounds for statistical estimation.\n\nIEEE\n\nTransactions on Information Theory, 52(4):1307\u20131321, 2006.\n\n[22] V. Vapnik. Statistical Learning Theory. Wiley, New York, 1998.\n[23] J.-Y. Audibert. PAC-Bayesian statistical learning theory. PhD thesis, Universit\u00b4e Paris VI,\n\n[24] O. Catoni. PAC-Bayesian Supervised Classi\ufb01cation. Lecture Notes-Monograph Series. IMS,\n\n2004.\n\n2007.\n\n[25] W. Lee, P. Bartlett, and R. Williamson. The importance of convexity in learning with squared\nloss. IEEE Transactions on Information Theory, 44(5):1974\u20131980, 1998. Correction, Volume\n54(9), 4395 (2008).\n\n[26] A. N. Shiryaev. Probability. Springer-Verlag, 1996.\n[27] J.-Y. Audibert. A better variance control for PAC-Bayesian classi\ufb01cation. Preprint 905, Labo-\n\nratoire de Probabilit\u00b4es et Mod`eles Al\u00b4eatoires, Universit\u00b4es Paris 6 and Paris 7, 2004.\n\n9\n\n\f", "award": [], "sourceid": 811, "authors": [{"given_name": "Tim", "family_name": "Erven", "institution": null}, {"given_name": "Peter", "family_name": "Gr\u00fcnwald", "institution": null}, {"given_name": "Mark", "family_name": "Reid", "institution": null}, {"given_name": "Robert", "family_name": "Williamson", "institution": null}]}