{"title": "Fast learning rates with heavy-tailed losses", "book": "Advances in Neural Information Processing Systems", "page_first": 505, "page_last": 513, "abstract": "We study fast learning rates when the losses are not necessarily bounded and may have a distribution with heavy tails. To enable such analyses, we introduce two new conditions: (i) the envelope function $\\sup_{f \\in \\mathcal{F}}|\\ell \\circ f|$, where $\\ell$ is the loss function and $\\mathcal{F}$ is the hypothesis class, exists and is $L^r$-integrable, and (ii) $\\ell$ satisfies the multi-scale Bernstein's condition on $\\mathcal{F}$. Under these assumptions, we prove that learning rate faster than $O(n^{-1/2})$ can be obtained and, depending on $r$ and the multi-scale Bernstein's powers, can be arbitrarily close to $O(n^{-1})$. We then verify these assumptions and derive fast learning rates for the problem of vector quantization by $k$-means clustering with heavy-tailed distributions. The analyses enable us to obtain novel learning rates that extend and complement existing results in the literature from both theoretical and practical viewpoints.", "full_text": "Fast learning rates with heavy-tailed losses\n\nVu Dinh1 Lam Si Tung Ho2 Duy Nguyen3 Binh T. Nguyen4\n\n1Program in Computational Biology, Fred Hutchinson Cancer Research Center\n\n2Department of Biostatistics, University of California, Los Angeles\n\n3Department of Statistics, University of Wisconsin-Madison\n\n4Department of Computer Science, University of Science, Vietnam\n\nAbstract\n\nWe study fast learning rates when the losses are not necessarily bounded and may\nhave a distribution with heavy tails. To enable such analyses, we introduce two\nnew conditions: (i) the envelope function supf\u2208F |(cid:96) \u25e6 f|, where (cid:96) is the loss func-\ntion and F is the hypothesis class, exists and is Lr-integrable, and (ii) (cid:96) satis\ufb01es\nthe multi-scale Bernstein\u2019s condition on F. Under these assumptions, we prove\nthat learning rate faster than O(n\u22121/2) can be obtained and, depending on r and\nthe multi-scale Bernstein\u2019s powers, can be arbitrarily close to O(n\u22121). We then\nverify these assumptions and derive fast learning rates for the problem of vector\nquantization by k-means clustering with heavy-tailed distributions. The analy-\nses enable us to obtain novel learning rates that extend and complement existing\nresults in the literature from both theoretical and practical viewpoints.\n\n1\n\nIntroduction\n\nThe rate with which a learning algorithm converges as more data comes in play a central role in\nmachine learning. Recent progress has re\ufb01ned our theoretical understanding about setting under\nwhich fast learning rates are possible, leading to the development of robust algorithms that can\nautomatically adapt to data with hidden structures and achieve faster rates whenever possible. The\nliterature, however, has mainly focused on bounded losses and little has been known about rates of\nlearning in the unbounded cases, especially in cases when the distribution of the loss has heavy tails\n[van Erven et al., 2015].\nMost of previous work about learning rate for unbounded losses are done in the context of density\nestimation [van Erven et al., 2015, Zhang, 2006a,b], of which the proofs of fast rates implicitly\nemploy the central condition [Gr\u00a8unwald, 2012] and cannot be extended to address losses with poly-\nnomial tails [van Erven et al., 2015]. Efforts to resolve this issue include Brownlees et al. [2015],\nwhich proposes using some robust mean estimators to replace empirical means, and Cortes et al.\n[2013], which derives relative deviation and generalization bounds for unbounded losses with the\nassumption that Lr-diameter of the hypothesis class is bounded. However, results about fast learning\nrates were not obtained in both approaches. Fast learning rates are derived in Lecu\u00b4e and Mendel-\nson [2013] for sub-Gaussian losses and in Lecu\u00b4e and Mendelson [2012] for hypothesis classes that\nhave sub-exponential envelope functions. To the best of our knowledge, no previous work about fast\nlearning rates for heavy-tailed losses has been done in the literature.\nThe goal of this research is to study fast learning rates for the empirical risk minimizer when the\nlosses are not necessarily bounded and may have a distribution with heavy tails. We recall that\nheavy-tailed distributions are probability distributions whose tails are not exponentially bounded:\nthat is, they have heavier tails than the exponential distribution. To enable the analyses of fast\nrates with heavy-tailed losses, two new assumptions are introduced. First, we assume the existence\nand the Lr-integrability of the envelope function F = supf\u2208F |f| of the hypothesis class F for\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fsome value of r \u2265 2, which enables us to use the results of Lederer and van de Geer [2014] on\nconcentration inequalities for suprema of empirical unbounded processes. Second, we assume that\nthe loss function satis\ufb01es the multi-scale Bernstein\u2019s condition, a generalization of the standard\nBernstein\u2019s condition for unbounded losses, which enables derivation of fast learning rates.\nBuilding upon this framework, we prove that if the loss has \ufb01nite moments up to order r large\nenough and if the hypothesis class satis\ufb01es the regularity conditions described above, then learning\nrate faster than O(n\u22121/2) can be obtained. Moreover, depending on r and the multi-scale Bernstein\u2019s\npowers, the learning rate can be arbitrarily close to the optimal rate O(n\u22121). We then verify these\nassumptions and derive fast learning rates for the k-mean clustering algorithm and prove that if the\ndistribution of observations has \ufb01nite moments up to order r and satis\ufb01es the Pollard\u2019s regularity\nconditions, then fast learning rate can be derived. The result can be viewed as an extension of\nthe result from Antos et al. [2005] and Levrard [2013] to cases when the source distribution has\nunbounded support, and produces a more favorable convergence rate than that of Telgarsky and\nDasgupta [2013] under similar settings.\n\n2 Mathematical framework\nLet the hypothesis class F be a class of functions de\ufb01ned on some measurable space X with values\nin R. Let Z = (X, Y ) be a random variable taking values in Z = X \u00d7Y with probability distribution\nP where Y \u2282 R. The loss (cid:96) : Z \u00d7F \u2192 R+ is a non-negative function. For a hypothesis f \u2208 F and\nn iid samples {Z1, Z2, . . . , Zn} of Z, we de\ufb01ne\n\nP (cid:96)(f ) = EZ\u223cP [(cid:96)(Z, f )]\n\nand\n\nPn(cid:96)(f ) =\n\n1\nn\n\n(cid:96)(Zi, f ).\n\nn(cid:88)\n\ni=1\n\nFor unsupervised learning frameworks, there is no output (Y = \u2205) and the loss has the form (cid:96)(X, f )\ndepending on applications. Nevertheless, P (cid:96)(f ) and Pn(cid:96)(f ) can be de\ufb01ned in a similar manner. We\nwill abuse the notation to denote the losses (cid:96)(Z, f ) by (cid:96)(f ). We also denote the optimal hypothesis\nf\u2217 be any function for which P (cid:96)(f\u2217) = inf f\u2208F P (cid:96)(f ) := P \u2217 and consider the empirical risk\nminimizer (ERM) estimator \u02c6fn = arg minf\u2208F Pn(cid:96)(f ).\nWe recall that heavy-tailed distributions are probability distributions whose tails are not exponen-\ntially bounded. Rigorously, the distribution of a random variable V is said to have a heavy right\ntail if limv\u2192\u221e e\u03bbvP[V > v] = \u221e for all \u03bb > 0 and the de\ufb01nition is similar for heavy left tail. A\nlearning problem is said to be with heavy-tailed loss if the distribution of (cid:96)(f ) has heavy tails from\nsome or all hypotheses f \u2208 F.\nFor a pseudo-metric space (G, d) and \u0001 > 0, we denote by N (\u0001, G, d) the covering number of (G, d);\nthat is, N (\u0001, G, d) is the minimal number of balls of radius \u0001 needed to cover G. The universal metric\nentropy of G is de\ufb01ned by H(\u0001, G) = supQ log N (\u0001, G, L2(Q)), where the supremum is taken over\nthe set of all probability measures Q concentrated on some \ufb01nite subset of G. For convenience, we\nde\ufb01ne G = (cid:96) \u25e6 F the class of all functions g such that g = (cid:96)(f ) for some f \u2208 F and denote by F\u0001 a\n\ufb01nite subset of F such that G is contained in the union of balls of radius \u0001 with centers in G\u0001 = (cid:96)\u25e6F\u0001.\nWe refer to F\u0001 and G\u0001 as an \u0001-net of F and G, respectively.\nTo enable the analyses of fast rates for learning problems with heavy-tailed losses, throughout the\npaper, we impose the following regularity conditions on F and (cid:96).\nAssumption 2.1 (Multi-scale Bernstein\u2019s condition). De\ufb01ne F\u2217 = arg minF P (cid:96)(f ). There exist a\n\ufb01nite partition of F = \u222ai\u2208IFi, positive constants B = {Bi}i\u2208I, constants \u03b3 = {\u03b3i}i\u2208I in (0, 1],\nand f\u2217 = {f\u2217\ni )])\u03b3i for all i \u2208 I and\nf \u2208 Fi.\nAssumption 2.2 (Entropy bounds). The hypothesis class F is separable and there exist C \u2265 1,\nK \u2265 1 such that \u2200\u0001 \u2208 (0, K], the L2(P )-covering numbers and the universal metric entropies of G\nare bounded as log N (\u0001,G, L2(P )) \u2264 C log(K/\u0001) and H(\u0001,G) \u2264 C log(K/\u0001).\nAssumption 2.3 (Integrability of the envelope function). There exists W > 0, r \u2265 C + 1 such that\n\ni }i\u2208I \u2282 F\u2217 such that E[((cid:96)(f ) \u2212 (cid:96)(f\u2217\n\ni ))2] \u2264 Bi (E[(cid:96)(f ) \u2212 (cid:96)(f\u2217\n\n(cid:0)E supg\u2208G |g|r(cid:1)1/r \u2264 W .\n\nThe multi-scale Bernstein\u2019s condition is more general than the Bernstein\u2019s condition. This entails\nthat the multi-scale Bernstein\u2019s condition holds whenever the Bernstein\u2019s condition does, thus al-\n\n2\n\n\flows us to consider a larger class of problems. In other words, our results are also valid with the\nBernstein\u2019s condition. The multi-scale Bernstein\u2019s condition is more proper to study unbounded\nlosses since it is able to separately consider the behaviors of the risk function on microscopic and\nmacroscopic scales, for which the distinction can only be observed in an unbounded setting.\nWe also recall that if G has \ufb01nite VC-dimension, then Assumption 2.2 is satis\ufb01ed [Boucheron et al.,\n2013, Bousquet et al., 2004]. Both Bernstein\u2019s condition and the assumption of separable parametric\nhypothesis class are standard assumptions frequently used to obtain faster learning rates in agnostic\nsettings. A review about the Bernstein\u2019s condition and its applications is Mendelson [2008], while\nfast learning rates for bounded losses on hypothesis classes satisfying Assumptions 2.2 were pre-\nviously studied in Mehta and Williamson [2014] under the stochastic mixability condition. Fast\nlearning rate for hypothesis classes with envelope functions were studied in Lecu\u00b4e and Mendelson\n[2012], but under a much stronger assumption that the envelope function is sub-exponential.\nUnder these assumptions, we illustrate that fast rates for heavy-tailed losses can be obtained.\nThroughout the analyses, two recurrent analytical techniques are worth mentioning. The \ufb01rst comes\nfrom the simple observation that in the standard derivation of fast learning rates for bounded losses,\nthe boundedness assumption is used in multiple places only to provide reverse-Holder-type inequali-\nties, where the L2-norm are upper bounded by the L1-norm. This use of the boundedness assumption\ncan be simply relieved by the assumption that the Lr-norm of the loss is bounded, which implies\n\n(cid:107)u(cid:107)L2 \u2264 (cid:107)u(cid:107)(r\u22122)/(2r\u22122)\n\nL1\n\n(cid:107)u(cid:107)r/(2r\u22122)\n\nLr\n\n.\n\nThe second technique relies on the following results of Lederer and van de Geer [2014] on concen-\ntration inequalities for suprema of empirical unbounded processes.\nLemma 2.1. If {Vk : k \u2208 K} is a countable family of non-negative functions such that\n\n|Vk|r \u2264 M r\n\nE sup\nk\u2208K\n\nthen for all \u03b6, x > 0, we have\n\n\u03c32 = sup\nk\u2208K\n\nEV 2\n\nk\n\nand\n\nP[V \u2265 (1 + \u03b6)EV + x] \u2264 min\n1\u2264l\u2264r\n\n(1/x)l\n\nV := sup\nk\u2208K\n\nPnVk,\n\n(cid:20)(cid:16)\n\n(cid:17)l(cid:21)\n64/\u03b6 + \u03b6 + 7) (l/n)1\u2212l/r M + 4\u03c3(cid:112)l/n\n\n.\n\nAn important notice from this result is that the failure probability is a polynomial in the deviation\nx. As we will see later, for a given level of con\ufb01dence \u03b4, this makes the constant in the convergence\nrate a polynomial function of (1/\u03b4) instead of log(1/\u03b4) as in sub-exponential cases. Thus, more\ncareful examinations of the order of the failure probability are required for the derivation of any\ngeneralization bound with heavy-tailed losses.\n\n3 Fast learning rates with heavy-tailed losses\n\nThe derivation of fast learning rate with heavy tailed losses proceeds as follows. First, we will use\nthe assumption of integrable envelope function to prove a localization-based result that allows us to\nreduce the analyses from the separable parametric classes F to its \ufb01nite \u0001-net F\u0001. The multi-scale\nBernstein\u2019s condition is then employed to derive a fast-rate inequality that helps distinguish the\noptimal hypothesis from alternative hypotheses in F\u0001. The two results are then combined to obtain\nfast learning rates.\n\n3.1 Preliminaries\nThroughout this section, let G\u0001 be an \u0001-net for G in the L2(P )-norm, with \u0001 = n\u2212\u03b2 for some\n1 \u2265 \u03b2 > 0. Denote by \u03c0 : G \u2192 G\u0001 an L2(P )-metric projection from G to G\u0001. For any g0 \u2208 G\u0001, we\ndenote K(g0) = {|g0 \u2212 g| : g \u2208 \u03c0\u22121(g0)}. We have\n\n(i) the constant zero function is an element of K(g0),\n(ii) E[supu\u2208K(g0) |u|r] \u2264 (2W )r; and supu\u2208K(g0) (cid:107)u(cid:107)L2(P ) \u2264 \u0001,\n(iii) N (t,K(g0), L2(P )) \u2264 (K/t)C for all t > 0.\n\n3\n\n\f(cid:17) r\u22122\n\nED(KZ) \u2264(cid:16)\n\nGiven a sample Z = (Z1, . . . , Zn), we denote by KZ the projection of K(g0) onto the sample Z\nand by D(KZ) half of the radius of (KZ,(cid:107) \u00b7 (cid:107)2), that is D(KZ) = supu,v\u2208KZ (cid:107)u \u2212 v(cid:107)/4. We have\nthe following preliminary lemma, for which the proofs are provided in the Appendix.\n\n2\u221a\nn\n\n\u0001 + E supu\u2208K(g0) (Pn \u2212 P )u\n\nLemma 3.1.\nLemma 3.2. Given 0 < \u03bd < 1, there exist constant C1, C2 > 0 depending only on \u03bd such that for\nall x > 0, if x \u2264 ax\u03bd + b then x \u2264 C1a1/(1\u2212\u03bd) + C2b.\nLemma 3.3. De\ufb01ne\n\nA(l, r, \u03b2,C, \u03b1) = max(cid:8)l2/r \u2212 (1 \u2212 \u03b2)l + \u03b2C, [\u03b2 (1 \u2212 \u03b1/2) \u2212 1/2] l + \u03b2C(cid:9) .\n\n2(r\u22121) (2W )\n\n2(r\u22121) .\n\nr\n\nAssuming that r \u2265 4C and \u03b1 \u2264 1, if we choose l = r (1 \u2212 \u03b2) /2 and\n\n0 < \u03b2 < (1 \u2212 2(cid:112)C/r)/(2 \u2212 \u03b1),\n\nthen 1 \u2264 l \u2264 r and A(l, r, \u03b2,C, \u03b1) < 0. This also holds if \u03b1 \u2265 1 and 0 < \u03b2 < 1 \u2212 2(cid:112)C/r.\n\n(3.1)\n\n(3.2)\n\n3.2 Local analysis of the empirical loss\nThe preliminary lemmas enable us to locally bound E supu\u2208K(g0) (Pn \u2212 P )u as follows:\nLemma 3.4. If \u03b2 < (r \u2212 1)/r, there exists c1 > 0 such that E supu\u2208K(g0) (Pn \u2212 P )u \u2264 c1n\u2212\u03b2 for\nall n.\n\nProof. Without loss of generality, we assume that K(g0) is countable. The arguments to extend\nthe bound from countable classes to separable classes are standard (see, for example, Lemma 12\nof Mehta and Williamson [2014]). Denote \u00afZ = supu\u2208K(g0) (Pn \u2212 P )u and let \u0001 = 1/n\u03b2, R =\n(R1, R2, . . . , Rn) be iid Rademacher random variables, using standard results about symmetrization\nand chaining of Rademacher process (see, for example, Corollary 13.2 in Boucheron et al. [2013]),\nwe have\n\nn(cid:88)\n\n\uf8eb\uf8edER sup\n(cid:112)log N (t,KX ,(cid:107) \u00b7 (cid:107)2)dt \u2264 24E\n\nu\u2208K(g0)\n\nj=1\n\nRju(Xj)\n\n\uf8f6\uf8f8\n(cid:90) D(KX )\u2228\u0001\n\nnE sup\nu\u2208K(g0)\n\n\u2264 24E\n\n(Pn \u2212 P )g \u2264 2E\n\n(cid:90) D(KX )\u2228\u0001\n\n0\n\n0\n\nwhere ER denotes the expected value with respect to the random variables R1, R2, . . . , Rn. By\nAssumption 2.2, we deduce that\n\n\u221a\n\nn,K(g0)(cid:1)dt,\n\n(cid:113)H(cid:0)t/\nC0 = O((cid:112)log n).\n2(r\u22121) /2 = O((cid:112)log n/\n\nr\n\n\u221a\n\nn),\n\nnE \u00afZ \u2264 C0(K, n, \u03c3,C)(\u0001 + ED(KX ))\n\nwhere\n\nIf we de\ufb01ne\n\nx = \u0001 + E \u00afZ, b = C0\u0001/n = O((cid:112)log n/n\u03b2+1), a = C0n\u22121/2(2W )\n\nthen by Lemma 3.1, we have x \u2264 ax(r\u22122)/(2r\u22122) + b + \u0001. Using lemma 3.2, we have\n\nx \u2264 C1a2(r\u22121)/r + C2(b + \u0001) \u2264 C3n\u2212\u03b2,\n\nwhich completes the proof.\n\nLemma 3.5. Assuming that r \u2265 4C, if \u03b2 < 1 \u2212 2(cid:112)C/r, there exist c1, c2 > 0 such that for all n\n\nPnu \u2264(cid:16)\n\n9c1 + (c2/\u03b4)1/[r(1\u2212\u03b2)](cid:17)\n\nn\u2212\u03b2\n\n\u2200g0 \u2208 G\u0001\n\nand \u03b4 > 0\n\nsup\n\nu\u2208K(g0)\n\nwith probability at least 1 \u2212 \u03b4.\n\n4\n\n\fApplying Lemma 2.1 for \u03b6 = 8 and x = y/n\u03b2 for \u00afZ, using the facts that\n\nZ = sup\n\nu\u2208K(g0)\n\nPnu \u2264 \u00afZ + sup\nu\u2208K(g0)\n\n(cid:112)E[u(X)]2 \u2264 \u0001 = 1/n\u03b2,\n(cid:20)(cid:16)\n\ny\u2212l\n\n1\u2264l\u2264r\n\n\u03c3 = sup\nu\u2208Kg0\n\nP(cid:2) \u00afZ \u2265 9E \u00afZ + y/n\u03b2(cid:3) \u2264 min\n\nwe have\n\nP u \u2264 \u00afZ + sup\nu\u2208K(g0)\n\n(cid:107)u(cid:107)L2(P ) = \u00afZ + \u0001.\n\n|u|r] \u2264 (2W )r,\n\nand\n\nE[ sup\nu\u2208Kg0\n\n(cid:17)l(cid:21)\n46 (l/n)1\u2212l/r n\u03b2W + 4(cid:112)l/n\n\n:= \u03c6(y, n).\n\nProof. Denote Z = supu\u2208K(g0) Pnu and \u00afZ = supu\u2208K(g0) (Pn \u2212 P )u. We have\n\nTo provide a union bound for all g0 \u2208 G\u0001, we want the total failure probability \u03c6(y, n)(n\u03b2K)C \u2264 \u03b4.\nThis failure probability, as a function of n, is of order A(l, r, \u03b2,C, \u03b1) (as de\ufb01ne in Lemma 3.3) with\nsuch that \u03c6(y, n)(n\u03b2K)C \u2264 c2/(nc3yl) \u2264 c2/yr(1\u2212\u03b2)/2. The proof is completed by choosing\n\n\u03b1 = 2 . By choosing l = r(1 \u2212 \u03b2)/2 and \u03b2 < 1 \u2212 2(cid:112)C/r, we deduce that there exist c2, c3 > 0\ny = (c2/\u03b4)2/[r(1\u2212\u03b2)] and using the fact that E \u00afZ \u2264 c1/n\u03b2 (note that 1 \u2212 2(cid:112)C/r \u2264 (r \u2212 1)/r\nin the L2(P )-norm, with \u0001 = n\u2212\u03b2 where \u03b2 < 1 \u2212 2(cid:112)C/r. Then there exist c1, c2 > 0 such that for\n\nA direct consequence of this Lemma is the following localization-based result.\nTheorem 3.1 (Local analysis). Under Assumptions 2.1, 2.2 and 2.3, let G\u0001 be a minimal \u0001-net for G\n\nand we can apply Lemma 3.4 to get the bound).\n\nPng \u2265 Pn(\u03c0(g)) \u2212(cid:16)\n\n9c1 + (c2/\u03b4)2/[r(1\u2212\u03b2)](cid:17)\n\nn\u2212\u03b2\n\n\u2200g \u2208 G\n\nall \u03b4 > 0,\n\nwith probability at least 1 \u2212 \u03b4.\n\n3.3 Fast learning rates with heavy-tailed losses\n\nTheorem 3.2. Given a0, \u03b4 > 0. Under the multi-scale (B, \u03b3, I)-Bernstein\u2019s condition and the\nassumption that r \u2265 4C, consider\n\n0 < \u03b2 < (1 \u2212 2(cid:112)C/r)/(2 \u2212 \u03b3i)\n\n\u2200i \u2208 I.\n\nThen there exist Na0,\u03b4,r,B,\u03b3 > 0 such that \u2200f \u2208 F\u0001 and n \u2265 Na0,\u03b4,r,B,\u03b3, we have\n\nP (cid:96)(f ) \u2212 P \u2217 \u2265 a0/n\u03b2\nwith probability at least 1 \u2212 \u03b4.\nProof. De\ufb01ne a = [P (cid:96)(f ) \u2212 P \u2217]n\u03b2. Assuming that f \u2208 Fi, applying Lemma 2.1 for \u03b6 = 1/2 and\nx = a/4n\u03b2 for a single hypothesis f, we have\n\n\u2203f\u2217 \u2208 F\u2217 : Pn(cid:96)(f ) \u2212 Pn(cid:96)(f\u2217) \u2265 a0/(4n\u03b2)\n\nimplies\n\n(3.3)\n\nwhere\n\nP [Pn(cid:96)(f ) \u2212 Pn(cid:96)(f\u2217\n\n(4/a)l(cid:16)\n\ni ) \u2264 (P (cid:96)(f ) \u2212 P (cid:96)(f\u2217\n\n(cid:17)l\n50n\u03b2 (l/n)1\u2212l/r W + 4n\u03b2Bia\u03b3i/2/n\u03b2\u03b3i/2(cid:112)l/n\n\ni ))/4] \u2264 h(a, n)\n\nh(a, n, i) = min\n1\u2264l\u2264r\n\nusing the fact that \u03c32 = E[(cid:96)(f )\u2212(cid:96)(f\u2217\n\u03b3i \u2264 1, h(a, n, i) is a non-increasing function in a. Thus,\nP [Pn(cid:96)(f ) \u2212 Pn(cid:96)(f\u2217\n\ni )]2 \u2264 Bi [E((cid:96)(f ) \u2212 (cid:96)(f\u2217\ni ) \u2264 (P (cid:96)(f ) \u2212 P (cid:96)(f\u2217\n\ni ))]\u03b3i = Bia\u03b3i/n\u03b2\u03b3i if f \u2208 Fi. Since\ni ))/4] \u2264 h(a0, n, i).\n\nTo provide a union bound for all f \u2208 F\u0001 such that P (cid:96)(f ) \u2212 P (cid:96)(f\u2217\ni ) \u2265 a0/n\u03b2, we want the total\nfailure probability to be small. This is guaranteed if h(a0, n, i)(n\u03b2K)C \u2264 \u03b4. This failure probability,\nas a function of n, is of order A(l, r, \u03b2,C, \u03b3i) as de\ufb01ned in equation (3.1). By choosing r, l as in\nLemma 3.3 and \u03b2 as in equation (3.3), we have 1 \u2264 l \u2264 r and A(l, r, \u03b2,C, \u03b3i) < 0 for all i. Thus,\nthere exists c4, c5, c6 > 0 such that\n\nh(a0, n, i)(n\u03b2K)C \u2264 c6a\n\n\u2212c5(1\u2212\u03b3i/2)\n0\n\nHence, when n \u2265 Na,\u03b4,r,B,\u03b3 =\nmin{\u03b3}1{a0<1}, we have: \u2200f \u2208 F\u0001, P (cid:96)(f ) \u2212 P \u2217 \u2265 a0/n\u03b2\nPn(cid:96)(f\u2217) \u2265 a0/(4n\u03b2) with probability at least 1 \u2212 \u03b4.\n\n\u2212c5(1\u2212\u02dc\u03b3/2)\nc6\u03b4a\n0\n\n(cid:16)\n\nn\u2212c4\n\n(cid:17)1/c4 where \u02dc\u03b3 = max{\u03b3}1{a0\u22651} +\n\n\u2200n, i.\n\nimplies\n\n\u2203f\u2217 \u2208 F\u2217, Pn(cid:96)(f ) \u2212\n\n5\n\n\fTheorem 3.3. Under Assumptions 2.1, 2.2 and 2.3, consider \u03b2 as in equation (3.3) and c1, c2 as in\nprevious theorems. For all \u03b4 > 0, there exists N\u03b4,r,B,\u03b3 such that if n \u2265 N\u03b4,r,B,\u03b3, then\n\n(cid:16)\n\n36c1 + 1 + 4 (2c2/\u03b4)2/[r(1\u2212\u03b2)](cid:17)\n\nn\u2212\u03b2\n\nP (cid:96)( \u02c6fz) \u2264 P (cid:96)(f\u2217) +\n\nwith probability at least 1 \u2212 \u03b4.\nProof of Theorem 3.3. Let F\u0001 by an \u0001-net of F with \u0001 = 1/n\u03b2 such that f\u2217 \u2208 F\u0001. We denote the\nprojection of \u02c6fz to F\u0001 by f1 = \u03c0( \u02c6fz). For a given \u03b4 > 0, de\ufb01ne\n\n(cid:110)\u2203f \u2208 F : Pnf \u2264 Pn(\u03c0(f )) \u2212(cid:16)\n\n9c1 + (c3/\u03b4)2/[r(1\u2212\u03b2)](cid:17)\n\nA2 =(cid:8)\u2203f \u2208 F\u0001 : Pn(cid:96)(\u03c0(f )) \u2212 Pn(cid:96)(f\u2217) \u2264 a0/(4n\u03b2) and P (cid:96)(\u03c0(f )) \u2212 P (cid:96)(f\u2217) \u2265 a0/n\u03b2(cid:9) ,\n\nA1 =\n\n,\n\nn\u2212\u03b2(cid:111)\n\nwhere c1, c2 is de\ufb01ned as in previous theorem, a0/4 = 9c1 + (c3/\u03b4)2/[r(1\u2212\u03b2)] and n \u2265 Na0,\u03b4,r,\u03b3.\nWe deduce that A1 and A2 happen with probability at most \u03b4. On the other hand, under the event\nthat A1 and A2 do not happen, we have\n\nPn(cid:96)(f1) \u2264 Pn(cid:96)( \u02c6fz) +\n\nn\u2212\u03b2 \u2264 Pn(cid:96)(f\u2217) + a0/(4n\u03b2).\n\n(cid:16)\n\n9c1 + (c3/\u03b4)2/[r(1\u2212\u03b2)](cid:17)\n\nBy de\ufb01nition of F\u0001, we have P (cid:96)( \u02c6fz) \u2264 P (cid:96)(f1) + \u0001 \u2264 P (cid:96)(f\u2217) + (a0 + 1)/n\u03b2.\n\n3.4 Verifying the multi-scale Bernstein\u2019s condition\n\nIn practice, the most dif\ufb01cult condition to verify for fast learning rates is the multi-scale Bernstein\u2019s\ncondition. We derive in this section some approaches to verify the condition. We \ufb01rst extend the re-\nsult of Mendelson [2008] to prove that the (standard) Bernstein\u2019s condition is automatically satis\ufb01ed\nfor functions that are relatively far way from f\u2217 under the integrability condition of the envelope\nfunction (proof in the Appendix). We recall that R(f ) = E(cid:96)(f ) is referred to as the risk function.\nLemma 3.6. Under Assumption 2.3, we de\ufb01ne M = W r/(r\u22122) and \u03b3 = (r \u2212 2)/(r \u2212 1). Then, if\n\u03b1 > M and R(f ) \u2265 \u03b1/(\u03b1 \u2212 M )R(f\u2217), then E((cid:96)(f ) \u2212 (cid:96)(f\u2217))2 \u2264 2\u03b1\u03b3E((cid:96)(f ) \u2212 (cid:96)(f\u2217))\u03b3.\nThis allows us to derive the following result, for which the proof is provided in the Appendix.\nLemma 3.7. If F is a subset of a vector space with metric d and the risk function R(f ) = E(cid:96)(f )\nhas a unique minimizer on F at f\u2217 in the interior of F and\n\n(i) There exists L > 0 such that E((cid:96)(f ) \u2212 (cid:96)(g))2 \u2264 Ld(f, g)2 for all f, g \u2208 F.\n(ii) There exists m \u2265 2, c > 0 and a neighborhood U around f\u2217 such that\n\nR(f ) \u2212 R(f\u2217) \u2265 cd(f, f\u2217)m for all f \u2208 U.\n\nThen the multi-scale Bernstein\u2019s condition holds for \u03b3 = ((r \u2212 2)/(r \u2212 1), 2/m).\nCorollary 3.1. Suppose that (F, d) is a pseudo-metric space, (cid:96) satis\ufb01es condition (i) in Lemma 3.7\nand the risk function is strongly convex with respect to d, then the Bernstein\u2019s condition holds with\n\u03b3 = 1.\nRemark 3.1. If the risk function is analytic at f\u2217, then condition (ii) in Lemma 3.7 holds. Similarly,\nif the risk function is continuously differentiable up to order 2 and the Hessian of R(f ) is positive\nde\ufb01nite at f\u2217, then condition (ii) is valid with m = 2.\nCorollary 3.2. If the risk function R(f ) = E(cid:96)(f ) has a \ufb01nite number of global minimizers\nf1, f2, . . . , fk, (cid:96) satis\ufb01es condition (i) in Lemma 3.7 and there exists mi \u2265 2, ci > 0 and neighbor-\nhoods Ui around fi such that R(f ) \u2212 R(fi) \u2265 cid(f, fi)mi for all f \u2208 Ui, i = 1, . . . , k, then the\nmulti-scale Bernstein\u2019s condition holds for \u03b3 = ((r \u2212 2)/(r \u2212 1), 2/m1, . . . , 2/mk).\n\n3.5 Comparison to related work\n\nTheorem 3.3 dictates that under our settings, the problem of learning with heavy-tailed losses can\nobtain convergence rates up to order\n\nO(cid:16)\n\n\u221a\nn\u2212(1\u22122\n\nC/r)/(2\u2212min{\u03b3})(cid:17)\n\n(3.4)\n\n6\n\n\fwhere \u03b3 is the multi-scale Bernstein\u2019s order and r is the degree of integrability of the loss. We\nrecall that convergence rate of O(n\u22121/(2\u2212\u03b3)) is obtained in Mehta and Williamson [2014] under the\nsame setting but for bounded losses. (The analysis there was done under the \u03b3-weakly stochastic\nmixability condition, which is equivalent with the standard \u03b3-Bernstein\u2019s condition for bounded\nlosses [van Erven et al., 2015]). We note that if the loss is bounded, r = \u221e and (3.4) reduces to the\nconvergence rate obtained in Mehta and Williamson [2014].\nFast learning rates for unbounded loses are previously derived in Lecu\u00b4e and Mendelson [2013]\nfor sub-Gaussian losses and in Lecu\u00b4e and Mendelson [2012] for hypothesis classes that have sub-\nexponential envelope functions. In Lecu\u00b4e and Mendelson [2013], the Bernstein\u2019s condition is not\ndirectly imposed, but is replaced by condition (ii) of Lemma 3.7 with m = 2 on the whole hypothesis\nclass, while the assumption of sub-Gaussian hypothesis class validates condition (i). This implies\nthe standard Bernstein\u2019s condition with \u03b3 = 1 and makes the convergence rate O(n\u22121) consistent\nwith our result (note that for sub-Gaussian losses, r can be chosen arbitrary large). The analysis of\nLecu\u00b4e and Mendelson [2012] concerns about non-exact oracle inequalities (rather than the sharp\noracle inequalities we investigate in this paper) and can not be directly compared with our results.\n\n4 Application: k-means clustering with heavy-tailed source distributions\nk-means clustering is a method of vector quantization aiming to partition n observations into k \u2265 2\nclusters in which each observation belongs to the cluster with the nearest mean. Formally, let X be\na random vector taking values in Rd with distribution P . Given a codebook (set of k cluster centers)\nC = {yi} \u2208 (Rd)k, the distortion (loss) on an instant x is de\ufb01ned as (cid:96)(C, x) = minyi\u2208C (cid:107)x \u2212\nyi(cid:107)2 and k-means clustering method aims at \ufb01nding a minimizer C\u2217 of R((cid:96)(C)) = P (cid:96)(C) via\nminimizing the empirical distortion Pn(cid:96)(C).\nThe rate of convergence of k-means clustering has drawn considerable attention in the statistics and\nmachine learning literatures [Pollard, 1982, Bartlett et al., 1998, Linder et al., 1994, Ben-David,\n2007]. Fast learning rates for k-means clustering (O(1/n)) have also been derived by Antos et al.\n[2005] in the case when the source distribution is supported on a \ufb01nite set of points, and by Lev-\nrard [2013] under the assumptions that the source distribution has bounded support and satis\ufb01es the\nso-called Pollard\u2019s regularity condition, which dictates that P has a continuous density with respect\nto the Lebesgue measure and the Hessian matrix of the mapping C \u2192 R(C) is positive de\ufb01nite\nat C\u2217. Little is known about the \ufb01nite-sample performance of empirically designed quantizers un-\nder possibly heavy-tailed distributions. In Telgarsky and Dasgupta [2013], a convergence rate of\nO(n\u22121/2+2/r) are derived, where r is the number of moments of X that are assumed to be \ufb01nite.\nBrownlees et al. [2015] uses some robust mean estimators to replace empirical means and derives a\nconvergence rate of O(n\u22121/2) assuming only that the variance of X is \ufb01nite.\nThe results from previous sections enable us to prove that with proper setting, the convergence\nrate of k-means clustering for heavy-tailed source distributions can be arbitrarily close to O(1/n).\nFollowing the framework of Brownlees et al. [2015], we consider\n\nG = {(cid:96)(C, x) = min\nyi\u2208C\n\n(cid:107)x \u2212 yi(cid:107)2, C \u2208 F = (\u2212\u03c1, \u03c1)d\u00d7k}\n\nfor some \u03c1 > 0 with the regular Euclidean metric. We let C\u2217, \u02c6Cn be de\ufb01ned as in the previous\nsections.\nTheorem 4.1. If X has \ufb01nite moments up to order r \u2265 4k(d + 1), P has a continuous density with\nrespect to the Lebesgue measure, the risk function has a \ufb01nite number of global minimizers and the\nHessian matrix of C \u2192 R(C) is positive de\ufb01nite at the every optimal C\u2217 in the interior of F, then\nfor all \u03b2 that satis\ufb01es\n\nthere exists c1, c2 > 0 such that for all \u03b4 > 0, with probability at least 1 \u2212 \u03b4, we have\n\nr \u2212 1\nr\n\n0 < \u03b2 <\n\nR( \u02c6Cn) \u2212 R(C\u2217) \u2264(cid:16)\n\n(1 \u2212 2(cid:112)k(d + 1)/r),\nc1 + 4 (c2/\u03b4)2/r(cid:17)\n\nn\u2212\u03b2\n\nMoreover, when r \u2192 \u221e, \u03b2 can be chosen arbitrarily close to 1.\n\n7\n\n\fProof. We have\n\n(cid:18)\n\nE sup\nC\u2208F\n\n(cid:96)(C, X)r\n\n(cid:19)1/r \u2264\n\n(cid:18) 1\n\n2r\n\n(cid:19)1/r \u2264\n\n(cid:18) 1\n\n2\n\nE(cid:107)X(cid:107)2r +\n\n1\n2\n\n\u03c12r\n\n(cid:19)1/r \u2264 W < \u221e,\n\nE[(cid:107)X(cid:107)2 + \u03c12]r\n\nwhile standard results about VC-dimension of k-means clustering hypothesis class guarantees that\nC \u2264 k(d + 1) [Linder et al., 1994]. On the other hand, we can verify that\n\nE[(cid:96)(C, X) \u2212 (cid:96)(C(cid:48), X)]2 \u2264 L\u03c1(cid:107)C \u2212 C(cid:48)(cid:107)2\n2,\n\nwhich validates condition (i) in Lemma 3.7. The fact that the Hessian matrix of C \u2192 R(C) is\npositive de\ufb01nite at C\u2217 prompts R( \u02c6Cn)\u2212R(C\u2217) \u2265 c(cid:107) \u02c6Cn\u2212C\u2217(cid:107)2 for some c > 0 in a neighborhood U\naround any optimal codebook C\u2217. Thus, Lemma 3.6 con\ufb01rms the multi-scale Bernstein\u2019s condition\nwith \u03b3 = ((r \u2212 2)/(r \u2212 1), 1, . . . , 1). The inequality is then obtained from Theorem 3.3.\n\n5 Discussion and future work\n\nWe have shown that fast learning rates for heavy-tailed losses can be obtained for hypothesis classes\nwith an integrable envelope when the loss satis\ufb01es the multi-scale Bernstein\u2019s condition. We then\nverify those conditions and obtain new convergence rates for k-means clustering with heavy-tailed\nlosses. The analyses extend and complement existing results in the literature from both theoretical\nand practical points of view. We also introduce a new fast-rate assumption, the multi-scale Bern-\nstein\u2019s condition, and provide a clear path to verify the assumption in practice. We believe that the\nmulti-scale Bernstein\u2019s condition is the proper assumption to study fast rates for unbounded losses,\nfor its ability to separate the behaviors of the risk function on microscopic and macroscopic scales,\nfor which the distinction can only be observed in an unbounded setting.\nThere are several avenues for improvement. First, we would like to consider hypothesis class with\npolynomial entropy bounds. Similarly, the condition of independent and identically distributed ob-\nservations can be replaced with mixing properties [Steinwart and Christmann, 2009, Hang and Stein-\nwart, 2014, Dinh et al., 2015]. While the condition of integrable envelope is an improvement from\nthe condition of sub-exponential envelope previously investigated in the literature, it would be in-\nteresting to see if the rates retain under weaker conditions, for example, the assumption that the\nLr-diameter of the hypothesis class is bounded [Cortes et al., 2013]. Finally, the recent work of\nBrownlees et al. [2015], Hsu and Sabato [2016] about robust estimators as alternatives of ERM to\nstudy heavy-tailed losses has yielded more favorable learning rates under weaker conditions, and we\nwould like to extend the result in this paper to study such estimators.\n\nAcknowledgement\n\nVu Dinh was supported by DMS-1223057 and CISE-1564137 from the National Science Foundation\nand U54GM111274 from the National Institutes of Health. Lam Si Tung Ho was supported by NSF\ngrant IIS 1251151.\n\n8\n\n\fReferences\nAndr\u00b4as Antos, L\u00b4aszl\u00b4o Gy\u00a8or\ufb01, and Andr\u00b4as Gy\u00a8orgy. Individual convergence rates in empirical vector quantizer\n\ndesign. IEEE Transactions on Information Theory, 51(11):4013\u20134022, 2005.\n\nPeter L Bartlett, Tam\u00b4as Linder, and G\u00b4abor Lugosi. The minimax distortion redundancy in empirical quantizer\n\ndesign. IEEE Transactions on Information Theory, 44(5):1802\u20131813, 1998.\n\nShai Ben-David. A framework for statistical clustering with constant time approximation algorithms for k-\n\nmedian and k-means clustering. Machine Learning, 66(2):243\u2013257, 2007.\n\nSt\u00b4ephane Boucheron, G\u00b4abor Lugosi, and Pascal Massart. Concentration inequalities: A nonasymptotic theory\n\nof independence. OUP Oxford, 2013.\n\nOlivier Bousquet, St\u00b4ephane Boucheron, and G\u00b4abor Lugosi.\n\nAdvanced lectures on machine learning, pages 169\u2013207. Springer, 2004.\n\nIntroduction to statistical learning theory.\n\nIn\n\nChristian Brownlees, Emilien Joly, and G\u00b4abor Lugosi. Empirical risk minimization for heavy-tailed losses. The\n\nAnnals of Statistics, 43(6):2507\u20132536, 2015.\n\nCorinna Cortes, Spencer Greenberg, and Mehryar Mohri. Relative deviation learning bounds and generalization\n\nwith unbounded loss functions. arXiv:1310.5796, 2013.\n\nVu Dinh, Lam Si Tung Ho, Nguyen Viet Cuong, Duy Nguyen, and Binh T Nguyen. Learning from non-iid\ndata: Fast rates for the one-vs-all multiclass plug-in classi\ufb01ers. In Theory and Applications of Models of\nComputation, pages 375\u2013387. Springer, 2015.\n\nPeter Gr\u00a8unwald. The safe Bayesian: learning the learning rate via the mixability gap. In Proceedings of the\n\n23rd international conference on Algorithmic Learning Theory, pages 169\u2013183. Springer-Verlag, 2012.\n\nHanyuan Hang and Ingo Steinwart. Fast learning from \u03b1-mixing observations. Journal of Multivariate Analysis,\n\n127:184\u2013199, 2014.\n\nDaniel Hsu and Sivan Sabato. Loss minimization and parameter estimation with heavy tails. Journal of Machine\n\nLearning Research, 17(18):1\u201340, 2016.\n\nGuillaume Lecu\u00b4e and Shahar Mendelson. General nonexact oracle inequalities for classes with a sub-\n\nexponential envelope. The Annals of Statistics, 40(2):832\u2013860, 2012.\n\nGuillaume Lecu\u00b4e and Shahar Mendelson. Learning sub-Gaussian classes: Upper and minimax bounds.\n\narXiv:1305.4825, 2013.\n\nJohannes Lederer and Sara van de Geer. New concentration inequalities for suprema of empirical processes.\n\nBernoulli, 20(4):2020\u20132038, 2014.\n\nCl\u00b4ement Levrard. Fast rates for empirical vector quantization. Electronic Journal of Statistics, 7:1716\u20131746,\n\n2013.\n\nTam\u00b4as Linder, G\u00b4abor Lugosi, and Kenneth Zeger. Rates of convergence in the source coding theorem, in\nempirical quantizer design, and in universal lossy source coding. IEEE Transactions on Information Theory,\n40(6):1728\u20131740, 1994.\n\nNishant A Mehta and Robert C Williamson. From stochastic mixability to fast rates. In Advances in Neural\n\nInformation Processing Systems, pages 1197\u20131205, 2014.\n\nShahar Mendelson. Obtaining fast error rates in nonconvex situations. Journal of Complexity, 24(3):380\u2013397,\n\n2008.\n\nDavid Pollard. A central limit theorem for k-means clustering. The Annals of Probability, pages 919\u2013926,\n\n1982.\n\nIngo Steinwart and Andreas Christmann. Fast learning from non-iid observations.\n\nInformation Processing Systems, pages 1768\u20131776, 2009.\n\nIn Advances in Neural\n\nMatus J Telgarsky and Sanjoy Dasgupta. Moment-based uniform deviation bounds for k-means and friends. In\n\nAdvances in Neural Information Processing Systems, pages 2940\u20132948, 2013.\n\nTim van Erven, Peter D Gr\u00a8unwald, Nishant A Mehta, Mark D Reid, and Robert C Williamson. Fast rates in\n\nstatistical and online learning. Journal of Machine Learning Research, 16:1793\u20131861, 2015.\n\nTong Zhang. From \u0001-entropy to KL-entropy: Analysis of minimum information complexity density estimation.\n\nThe Annals of Statistics, 34(5):2180\u20132210, 2006a.\n\nTong Zhang. Information-theoretic upper and lower bounds for statistical estimation. IEEE Transactions on\n\nInformation Theory, 52(4):1307\u20131321, 2006b.\n\n9\n\n\f", "award": [], "sourceid": 275, "authors": [{"given_name": "Vu", "family_name": "Dinh", "institution": "Fred Hutchinson Cancer Center"}, {"given_name": "Lam", "family_name": "Ho", "institution": "UCLA"}, {"given_name": "Binh", "family_name": "Nguyen", "institution": "University of Science"}, {"given_name": "Duy", "family_name": "Nguyen", "institution": "University of Wisconsin-Madison"}]}