{"title": "Tight Sample Complexity of Large-Margin Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 2038, "page_last": 2046, "abstract": "We obtain a tight distribution-specific characterization of the sample complexity of large-margin classification with L2 regularization: We introduce the gamma-adapted-dimension, which is a simple function of the spectrum of a distribution's covariance matrix, and show distribution-specific upper and lower bounds on the sample complexity, both governed by the gamma-adapted-dimension of the source distribution. We conclude that this new quantity tightly characterizes the true sample complexity of large-margin classification. The bounds hold for a rich family of sub-Gaussian distributions.", "full_text": "Tight Sample Complexity of Large-Margin Learning\n\n1 School of Computer Science & Engineering, The Hebrew University, Jerusalem 91904, Israel\n\nSivan Sabato1 Nathan Srebro2 Naftali Tishby1\n\n2 Toyota Technological Institute at Chicago, Chicago, IL 60637, USA\n{sivan sabato,tishby}@cs.huji.ac.il, nati@ttic.edu\n\nAbstract\n\nWe obtain a tight distribution-speci\ufb01c characterization of the sample complex-\nity of large-margin classi\ufb01cation with L2 regularization: We introduce the\n\u03b3-adapted-dimension, which is a simple function of the spectrum of a distribu-\ntion\u2019s covariance matrix, and show distribution-speci\ufb01c upper and lower bounds\non the sample complexity, both governed by the \u03b3-adapted-dimension of the\nsource distribution. We conclude that this new quantity tightly characterizes the\ntrue sample complexity of large-margin classi\ufb01cation. The bounds hold for a rich\nfamily of sub-Gaussian distributions.\n\n1\n\nIntroduction\n\nIn this paper we tackle the problem of obtaining a tight characterization of the sample complexity\nwhich a particular learning rule requires, in order to learn a particular source distribution. Specif-\nically, we obtain a tight characterization of the sample complexity required for large (Euclidean)\nmargin learning to obtain low error for a distribution D(X, Y ), for X \u2208 Rd, Y \u2208 {\u00b11}.\nMost learning theory work focuses on upper-bounding the sample complexity. That is, on pro-\nviding a bound m(D, \u01eb) and proving that when using some speci\ufb01c learning rule, if the sample\nsize is at least m(D, \u01eb), an excess error of at most \u01eb (in expectation or with high probability) can\nbe ensured. For instance, for large-margin classi\ufb01cation we know that if PD[kXk \u2264 B] = 1,\nthen m(D, \u01eb) can be set to O(B2/(\u03b32\u01eb2)) to get true error of no more than \u2113\u2217\n\u03b3 + \u01eb, where\n\u2113\u2217\n\u03b3 = minkwk\u22641 PD(Y hw, Xi \u2264 \u03b3) is the optimal margin error at margin \u03b3.\nSuch upper bounds can be useful for understanding positive aspects of a learning rule. But it is\ndif\ufb01cult to understand de\ufb01ciencies of a learning rule, or to compare between different rules, based\non upper bounds alone. After all, it is possible, and often the case, that the true sample complexity,\ni.e. the actual number of samples required to get low error, is much lower than the bound.\n\nOf course, some sample complexity upper bounds are known to be \u201ctight\u201d or to have an almost-\nmatching lower bound. This usually means that the bound is tight as a worst-case upper bound for\na speci\ufb01c class of distributions (e.g. all those with PD[kXk \u2264 B] = 1). That is, there exists some\nsource distribution for which the bound is tight. In other words, the bound concerns some quantity\nof the distribution (e.g. the radius of the support), and is the lowest possible bound in terms of this\nquantity. But this is not to say that for any speci\ufb01c distribution this quantity tightly characterizes the\nsample complexity. For instance, we know that the sample complexity can be much smaller than the\n\nradius of the support of X, if the average normpE[kXk2] is small. However, E[kXk2] is also not\n\na precise characterization of the sample complexity, for instance in low dimensions.\n\nThe goal of this paper is to identify a simple quantity determined by the distribution that does\nprecisely characterize the sample complexity. That is, such that the actual sample complexity for the\nlearning rule on this speci\ufb01c distribution is governed, up to polylogarithmic factors, by this quantity.\n\n1\n\n\fIn particular, we present the \u03b3-adapted-dimension k\u03b3(D). This measure re\ufb01nes both the dimension\nand the average norm of X, and it can be easily calculated from the covariance matrix of X. We show\nthat for a rich family of \u201clight tailed\u201d distributions (speci\ufb01cally, sub-Gaussian distributions with\nindependent uncorrelated directions \u2013 see Section 2), the number of samples required for learning\nby minimizing the \u03b3-margin-violations is both lower-bounded and upper-bounded by \u02dc\u0398(k\u03b3). More\nprecisely, we show that the sample complexity m(\u01eb, \u03b3, D) required for achieving excess error of no\nmore than \u01eb can be bounded from above and from below by:\n\u2126(k\u03b3(D)) \u2264 m(\u01eb, \u03b3, D) \u2264 \u02dcO(\n\nk\u03b3(D)\n\n).\n\n\u01eb2\n\nAs can be seen in this bound, we are not concerned about tightly characterizing the dependence of\nthe sample complexity on the desired error [as done e.g. in 1], nor with obtaining tight bounds for\nvery small error levels. In fact, our results can be interpreted as studying the sample complexity\nneeded to obtain error well below random, but bounded away from zero. This is in contrast to\nclassical statistics asymptotic that are also typically tight, but are valid only for very small \u01eb. As was\nrecently shown by Liang and Srebro [2], the quantities on which the sample complexity depends on\nfor very small \u01eb (in the classical statistics asymptotic regime) can be very different from those for\nmoderate error rates, which are more relevant for machine learning.\n\nOur tight characterization, and in particular the distribution-speci\ufb01c lower bound on the sample\ncomplexity that we establish, can be used to compare large-margin (L2 regularized) learning to other\nlearning rules. In Section 7 we provide two such examples: we use our lower bound to rigorously\nestablish a sample complexity gap between L1 and L2 regularization previously studied in [3], and to\nshow a large gap between discriminative and generative learning on a Gaussian-mixture distribution.\nIn this paper we focus only on large L2 margin classi\ufb01cation. But in order to obtain the distribution-\nspeci\ufb01c lower bound, we develop novel tools that we believe can be useful for obtaining lower\nbounds also for other learning rules.\n\nRelated work\n\nMost work on \u201csample complexity lower bounds\u201d is directed at proving that under some set of\nassumptions, there exists a source distribution for which one needs at least a certain number of\nexamples to learn with required error and con\ufb01dence [4, 5, 6]. This type of a lower bound does\nnot, however, indicate much on the sample complexity of other distributions under the same set of\nassumptions.\n\nAs for distribution-speci\ufb01c lower bounds, the classical analysis of Vapnik [7, Theorem 16.6] pro-\nvides not only suf\ufb01cient but also necessary conditions for the learnability of a hypothesis class with\nrespect to a speci\ufb01c distribution. The essential condition is that the \u01eb-entropy of the hypothesis\nclass with respect to the distribution be sub-linear in the limit of an in\ufb01nite sample size. In some\nsense, this criterion can be seen as providing a \u201clower bound\u201d on learnability for a speci\ufb01c distribu-\ntion. However, we are interested in \ufb01nite-sample convergence rates, and would like those to depend\non simple properties of the distribution. The asymptotic arguments involved in Vapnik\u2019s general\nlearnability claim do not lend themselves easily to such analysis.\n\nBenedek and Itai [8] show that if the distribution is known to the learner, a speci\ufb01c hypothesis\nclass is learnable if and only if there is a \ufb01nite \u01eb-cover of this hypothesis class with respect to the\ndistribution. Ben-David et al. [9] consider a similar setting, and prove sample complexity lower\nbounds for learning with any data distribution, for some binary hypothesis classes on the real line.\nIn both of these works, the lower bounds hold for any algorithm, but only for a worst-case target\nhypothesis. Vayatis and Azencott [10] provide distribution-speci\ufb01c sample complexity upper bounds\nfor hypothesis classes with a limited VC-dimension, as a function of how balanced the hypotheses\nare with respect to the considered distributions. These bounds are not tight for all distributions, thus\nthis work also does not provide true distribution-speci\ufb01c sample complexity.\n\n2 Problem setting and de\ufb01nitions\n\nLet D be a distribution over Rd \u00d7 {\u00b11}. DX will denote the restriction of D to Rd. We are\ninterested in linear separators, parametrized by unit-norm vectors in Bd\n, {w \u2208 Rd | kwk2 \u2264 1}.\n1\n\n2\n\n\f1\n\n1\n\n\u03b3(D) , minw\u2208Bd\n\n\u2113\u03b3(w, D). For a sample S = {(xi, yi)}m\n\nFor a predictor w denote its misclassi\ufb01cation error with respect to distribution D by \u2113(w, D) ,\nP(X,Y )\u223cD[Y hw, Xi \u2264 0]. For \u03b3 > 0, denote the \u03b3-margin loss of w with respect to D by\n\u2113\u03b3(w, D) , P(X,Y )\u223cD[Y hw, Xi \u2264 \u03b3]. The minimal margin loss with respect to D is denoted\nby \u2113\u2217\ni=1 such that (xi, yi) \u2208 Rd \u00d7 {\u00b11},\nthe margin loss with respect to S is denoted by \u02c6\u2113\u03b3(w, S) , 1\nm|{i | yihxi, wi \u2264 \u03b3}| and the misclas-\nsi\ufb01cation error is \u02c6\u2113(w, S) , 1\nm|{i | yihxi, wi \u2264 0}|. In this paper we are concerned with learning by\nminimizing the margin loss. It will be convenient for us to discuss transductive learning algorithms.\nSince many predictors minimize the margin loss, we de\ufb01ne:\nDe\ufb01nition 2.1. A margin-error minimization algorithm A is an algorithm whose input is a\ni=1 and an unlabeled test sample \u02dcSX = {\u02dcxi}m\nmargin \u03b3, a training sample S = {(xi, yi)}m\ni=1,\n\u02c6\u2113\u03b3(w, S). We denote the output of the algorithm by\nwhich outputs a predictor \u02dcw \u2208 argminw\u2208Bd\n\u02dcw = A\u03b3(S, \u02dcSX ).\nWe will be concerned with the expected test loss of the algorithm given a random training sample and\nS, \u02dcS\u223cDm[\u02c6\u2113(A(S, \u02dcSX ), \u02dcS)], where\na random test sample, each of size m, and de\ufb01ne \u2113m(A\u03b3, D) , E\nS, \u02dcS \u223c Dm independently. For \u03b3 > 0, \u01eb \u2208 [0, 1], and a distribution D, we denote the distribution-\nspeci\ufb01c sample complexity by m(\u01eb, \u03b3, D): this is the minimal sample size such that for any margin-\nerror minimization algorithm A, and for any m \u2265 m(\u01eb, \u03b3, D), \u2113m(A\u03b3, D) \u2212 \u2113\u2217\nSub-Gaussian distributions\n\n\u03b3(D) \u2264 \u01eb.\n\nWe will characterize the distribution-speci\ufb01c sample complexity in terms of the covariance of X \u223c\nDX. But in order to do so, we must assume that X is not too heavy-tailed. Otherwise, X can\nhave even in\ufb01nite covariance but still be learnable, for instance if it has a tiny probability of having\nan exponentially large norm. We will thus restrict ourselves to sub-Gaussian distributions. This\nensures light tails in all directions, while allowing a suf\ufb01ciently rich family of distributions, as we\npresently see. We also require a more restrictive condition \u2013 namely that DX can be rotated to a\nproduct distribution over the axes of Rd. A distribution can always be rotated so that its coordinates\nare uncorrelated. Here we further require that they are independent, as of course holds for any\nmultivariate Gaussian distribution.\nDe\ufb01nition 2.2 (See e.g. [11, 12]). A random variable X is sub-Gaussian with moment B (or\nB-sub-Gaussian) for B \u2265 0 if\nWe further say that X is sub-Gaussian with relative moment \u03c1 = B/pE[X 2].\n\n\u2200t \u2208 R, E[exp(tX)] \u2264 exp(B2t2/2).\n\nThe sub-Gaussian family is quite extensive: For instance, any bounded, Gaussian, or Gaussian-\nmixture random variable with mean zero is included in this family.\nDe\ufb01nition 2.3. A distribution DX over X \u2208 Rd is independently sub-Gaussian with relative\nmoment \u03c1 if there exists some orthonormal basis a1, . . . , ad \u2208 Rd, such that hX, aii are independent\nsub-Gaussian random variables, each with a relative moment \u03c1.\nWe will focus on the family Dsg\n\u03c1 of all independently \u03c1-sub-Gaussian distributions in arbitrary di-\nmension, for a small \ufb01xed constant \u03c1. For instance, the family Dsg\n3/2 includes all Gaussian distribu-\ntions, all distributions which are uniform over a (hyper)box, and all multi-Bernoulli distributions,\nin addition to other less structured distributions. Our upper bounds and lower bounds will be tight\nup to quantities which depend on \u03c1, which we will regard as a constant, but the tightness will not\ndepend on the dimensionality of the space or the variance of the distribution.\n\n(1)\n\n3 The \u03b3-adapted-dimension\n\nAs mentioned in the introduction, the sample complexity of margin-error minimization can be upper-\nbounded in terms of the average norm E[kXk2] by m(\u01eb, \u03b3, D) \u2264 O(E[kXk2]/(\u03b32\u01eb2)) [13]. Alter-\nnatively, we can rely only on the dimensionality and conclude m(\u01eb, \u03b3, D) \u2264 \u02dcO(d/\u01eb2) [7]. Thus,\n\n3\n\n\falthough both of these bounds are tight in the worst-case sense, i.e. they are the best bounds that\nrely only on the norm or only on the dimensionality respectively, neither is tight in a distribution-\nspeci\ufb01c sense: If the average norm is unbounded while the dimensionality is small, an arbitrarily\nlarge gap is created between the true m(\u01eb, \u03b3, D) and the average-norm upper bound. The converse\nhappens if the dimensionality is arbitrarily high while the average-norm is bounded.\n\nSeeking a distribution-speci\ufb01c tight analysis, one simple option to try to tighten these bounds is to\nconsider their minimum, min(d, E[kXk2]/\u03b32)/\u01eb2, which, trivially, is also an upper bound on the\nsample complexity. However, this simple combination is also not tight: Consider a distribution in\nwhich there are a few directions with very high variance, but the combined variance in all other\ndirections is small. We will show that in such situations the sample complexity is characterized not\nby the minimum of dimension and norm, but by the sum of the number of high-variance dimensions\nand the average norm in the other directions. This behavior is captured by the \u03b3-adapted-dimension:\nDe\ufb01nition 3.1. Let b > 0 and k a positive integer.\n\n(a). A subset X \u2286 Rd is (b, k)-limited if there exists a sub-space V \u2286 Rd of dimension d \u2212 k\n\nsuch that X \u2286 {x \u2208 Rd | kx\u2032Pk2 \u2264 b}, where P is an orthogonal projection onto V .\n\nsion d \u2212 k such that EX\u223cDX [kX \u2032Pk2] \u2264 b, with P an orthogonal projection onto V .\n\n(b). A distribution DX over Rd is (b, k)-limited if there exists a sub-space V \u2286 Rd of dimen-\nDe\ufb01nition 3.2. The \u03b3-adapted-dimension of a distribution or a set, denoted by k\u03b3, is the minimum\nk such that the distribution or set is (\u03b32k, k) limited.\nIt is easy to see that k\u03b3(DX ) is upper-bounded by min(d, E[kXk2]/\u03b32). Moreover, it can be much\nsmaller. For example, for X \u2208 R1001 with independent coordinates such that the variance of the\n\ufb01rst coordinate is 1000, but the variance in each remaining coordinate is 0.001 we have k1 = 1 but\nd = E[kXk2] = 1001. More generally, if \u03bb1 \u2265 \u03bb2 \u2265 \u00b7\u00b7\u00b7 \u03bbd are the eigenvalues of the covariance\nmatrix of X, then k\u03b3 = min{k | Pd\ni=k+1 \u03bbi \u2264 \u03b32k}. A quantity similar to k\u03b3 was studied\npreviously in [14]. k\u03b3 is different in nature from some other quantities used for providing sample\ncomplexity bounds in terms of eigenvalues, as in [15], since it is de\ufb01ned based on the eigenvalues\nof the distribution and not of the sample. In Section 6 we will see that these can be quite different.\n\nIn order to relate our upper and lower bounds, it will be useful to relate the \u03b3-adapted-dimension for\ndifferent margins. The relationship is established in the following Lemma , proved in the appendix:\nLemma 3.3. For 0 < \u03b1 < 1, \u03b3 > 0 and a distribution DX, k\u03b3(DX ) \u2264 k\u03b1\u03b3(DX ) \u2264 2k\u03b3 (DX )\nWe proceed to provide a sample complexity upper bound based on the \u03b3-adapted-dimension.\n\n\u03b12 + 1.\n\n4 A sample complexity upper bound using \u03b3-adapted-dimension\n\nIn order to establish an upper bound on the sample complexity, we will bound the fat-shattering\ndimension of the linear functions over a set in terms of the \u03b3-adapted-dimension of the set. Recall\nthat the fat-shattering dimension is a classic quantity for proving sample complexity upper bounds:\nDe\ufb01nition 4.1. Let F be a set of functions f : X \u2192 R, and let \u03b3 > 0. The set {x1, . . . , xm} \u2286 X is\n\u03b3-shattered by F if there exist r1, . . . , rm \u2208 R such that for all y \u2208 {\u00b11}m there is an f \u2208 F such\nthat \u2200i \u2208 [m], yi(f (xi) \u2212 ri) \u2265 \u03b3. The \u03b3-fat-shattering dimension of F is the size of the largest\nset in X that is \u03b3-shattered by F.\nThe sample complexity of \u03b3-loss minimization is bounded by \u02dcO(d\u03b3/8/\u01eb2) were d\u03b3/8 is the \u03b3/8-\nfat-shattering dimension of the function class [16, Theorem 13.4]. Let W(X ) be the class of linear\nfunctions restricted to the domain X . For any set we show:\nTheorem 4.2. If a set X is (B2, k)-limited, then the \u03b3-fat-shattering dimension of W(X ) is at most\n2 (B2/\u03b32 + k + 1). Consequently, it is also at most 3k\u03b3(X ) + 1.\nProof. Let X be a m \u00d7 d matrix whose rows are a set of m points in Rd which is \u03b3-shattered.\nFor any \u01eb > 0 we can augment X with an additional column to form the matrix \u02dcX of dimensions\n1+\u01eb such that eXwy = y (the details\nm\u00d7 (d + 1), such that for all y \u2208 {\u2212\u03b3, +\u03b3}m, there is a wy \u2208 Bd+1\n\n3\n\n4\n\n\fcan be found in the appendix). Since X is (B2, k)-limited, there is an orthogonal projection matrix\n\u02dcP of size (d + 1) \u00d7 (d + 1) such that \u2200i \u2208 [m],k \u02dcX \u2032\niPk2 \u2264 B2 where \u02dcXi is the vector in row i of\n\u02dcX. Let \u02dcV be the sub-space of dimension d \u2212 k spanned by the columns of \u02dcP . To bound the size of\nthe shattered set, we show that the projected rows of \u02dcX on V are \u2018shattered\u2019 using projected labels.\nWe then proceed similarly to the proof of the norm-only fat-shattering bound [17].\nWe have \u02dcX = \u02dcX \u02dcP + \u02dcX(I \u2212 \u02dcP ). In addition, \u02dcXwy = y. Thus y \u2212 \u02dcX \u02dcP wy = \u02dcX(I \u2212 \u02dcP )wy.\nI \u2212 \u02dcP is a projection onto a k + 1-dimensional space, thus the rank of \u02dcX(I \u2212 \u02dcP ) is at most k + 1.\nLet T be an m \u00d7 m orthogonal projection matrix onto the subspace orthogonal to the columns\nof \u02dcX(I \u2212 \u02dcP ). This sub-space is of dimension at most l = m \u2212 (k + 1), thus trace(T ) = l.\nT (y \u2212 \u02dcX \u02dcP wy) = T \u02dcX(I \u2212 \u02dcP )wy = 0(d+1)\u00d71. Thus T y = T \u02dcX \u02dcP wy for every y \u2208 {\u2212\u03b3, +\u03b3}m.\nDenote row i of T by ti and row i of T \u02dcX \u02dcP by zi. We have \u2200i \u2264 m, hzi, w1\nPj\u2264m ti[j]y[j]. Therefore hPi ziy[i], w1\nyi = Pi\u2264mPj\u2264(l+k) ti[j]y[i]y[j]. Since kw1\n\u2200x \u2208 Rd+1, (1 + \u01eb)kxk \u2265 kxkkw1\nPi\u2264mPj\u2264m ti[j]y[i]y[j]. Taking the expectation of y chosen uniformly at random, we have\nziy[i]k] \u2265Xi,j\nE[ti[j]y[i]y[j]] = \u03b32Xi\nE[kPi ziy[i]k2] =Pl\ni=1 kzik2 = trace( \u02dcP \u2032 \u02dcX \u2032T 2 \u02dcX \u02dcP ) \u2264 trace( \u02dcP \u2032 \u02dcX \u2032 \u02dcX \u02dcP ) \u2264 B2m.\n\u03b3 2 m. Since this holds for any\n\u03b3 2 (k + 1) \u2264\n\nIn addition, 1\n\u03b3 2\nFrom the inequality E[X 2] \u2264 E[X]2, it follows that l2 \u2264 (1 + \u01eb)2 B2\n\u01eb > 0, we can set \u01eb = 0 and solve for m. Thus m \u2264 (k + 1) + B2\n(k + 1) + B2\n\nyi = tiy =\nyk \u2264 1 + \u01eb,\nyi. Thus \u2200y \u2208 {\u2212\u03b3, +\u03b3}m, (1 + \u01eb)kPi ziy[i]k \u2265\n\n(1 + \u01eb)E[kXi\n\n2\u03b3 2 +q B4\n\nti[i] = \u03b32trace(T ) = \u03b32l.\n\nyk \u2265 hx, w1\n\n4\u03b3 4 + B2\n\n\u03b3 2 (k + 1) \u2264 3\n\n2 ( B2\n\n\u03b3 2 + k + 1).\n\n\u03b3 2 +q B2\n\nCorollary 4.3. Let D be a distribution over X \u00d7 {\u00b11}, X \u2286 Rd. Then\n\nm(\u01eb, \u03b3, D) \u2264 eO(cid:18) k\u03b3/8(X )\n\n\u01eb2\n\n(cid:19) .\n\nThe corollary above holds only for distributions with bounded support. However, since sub-Gaussian\nvariables have an exponentially decaying tail, we can use this corollary to provide a bound for\nindependently sub-Gaussian distributions as well (see appendix for proof):\nTheorem 4.4 (Upper Bound for Distributions in Dsg\nthat DX \u2208 Dsg\n\u03c1 ,\n\n\u03c1 ). For any distribution D over Rd \u00d7{\u00b11} such\n\u03c12k\u03b3(DX )\n\nm(\u01eb, \u03b3, D) = \u02dcO(\n\n).\n\n\u01eb2\n\nThis new upper bound is tighter than norm-only and dimension-only upper bounds. But does the\n\u03b3-adapted-dimension characterize the true sample complexity of the distribution, or is it just another\nupper bound? To answer this question, we need to be able to derive sample complexity lower bounds\nas well. We consider this problem in following section.\n\n5 Sample complexity lower bounds using Gram-matrix eigenvalues\n\nWe wish to \ufb01nd a distribution-speci\ufb01c lower bound that depends on the \u03b3-adapted-dimension, and\nmatches our upper bound as closely as possible. To do that, we will link the ability to learn with\na margin, with properties of the data distribution. The ability to learn is closely related to the\nprobability of a sample to be shattered, as evident from Vapnik\u2019s formulations of learnability as a\nfunction of the \u01eb-entropy. In the preceding section we used the fact that non-shattering (as captured\nby the fat-shattering dimension) implies learnability. For the lower bound we use the converse fact,\npresented below in Theorem 5.1: If a sample can be fat-shattered with a reasonably high probability,\nthen learning is impossible. We then relate the fat-shattering of a sample to the minimal eigenvalue\nof its Gram matrix. This allows us to present a lower-bound on the sample complexity using a lower\nbound on the smallest eigenvalue of the Gram-matrix of a sample drawn from the data distribution.\nWe use the term \u2018\u03b3-shattered at the origin\u2019 to indicate that a set is \u03b3-shattered by setting the bias\nr \u2208 Rm (see Def. 4.1) to the zero vector.\n\n5\n\n\fTheorem 5.1. Let D be a distribution over Rd \u00d7 {\u00b11}. If the probability of a sample of size m\ndrawn from Dm\nX to be \u03b3-shattered at the origin is at least \u03b7, then there is a margin-error minimization\nalgorithm A, such that \u2113m/2(A\u03b3, D) \u2265 \u03b7/2.\nProof. For a given distribution D, let A be an algorithm which, for every two input samples S and\n[\u02c6\u2113\u03b3(w, \u02dcS)].\n\u02c6\u2113\u03b3(w, S) that maximizes E \u02dcSY \u2208Dm\n\u02dcSX, labels \u02dcSX using the separator w \u2208 argminw\u2208Bd\nFor every x \u2208 Rd there is a label y \u2208 {\u00b11} such that P(X,Y )\u223cD[Y 6= y | X = x] \u2265 1\n2 . If the set of\nexamples in SX and \u02dcSX together is \u03b3-shattered at the origin, then A chooses a separator with zero\nmargin loss on S, but loss of at least 1\n\n2 on \u02dcS. Therefore \u2113m/2(A\u03b3, D) \u2265 \u03b7/2.\n\n1\n\nY\n\nThe notion of shattering involves checking the existence of a unit-norm separator w for each label-\nvector y \u2208 {\u00b11}m. In general, there is no closed form for the minimum-norm separator. However,\nthe following Theorem provides an equivalent and simple characterization for fat-shattering:\nTheorem 5.2. Let S = (X1, . . . , Xm) be a sample in Rd, denote X the m\u00d7d matrix whose rows are\ny\u2032(XX \u2032)\u22121y \u2264 1.\nthe elements of S. Then S is 1-shattered iff X is invertible and \u2200y \u2208 {\u00b11}m,\nThe proof of this theorem is in the appendix. The main issue in the proof is showing that if a set is\nshattered, it is also shattered with exact margins, since the set of exact margins {\u00b11}m lies in the\nconvex hull of any set of non-exact margins that correspond to all the possible labelings. We can now\nuse the minimum eigenvalue of the Gram matrix to obtain a suf\ufb01cient condition for fat-shattering,\nafter which we present the theorem linking eigenvalues and learnability. For a matrix X, \u03bbn(X)\ndenotes the n\u2019th largest eigenvalue of X.\nLemma 5.3. Let S = (X1, . . . , Xm) be a sample in Rd, with X as above. If \u03bbm(XX \u2032) \u2265 m then\nS is 1-shattered at the origin.\n\nProof. If \u03bbm(XX \u2032) \u2265 m then XX \u2032 is invertible and \u03bb1((XX \u2032)\u22121) \u2264 1/m. For any y \u2208 {\u00b11}m\nwe have kyk = \u221am and y\u2032(XX \u2032)\u22121y \u2264 kyk2\u03bb1((XX \u2032)\u22121) \u2264 m(1/m) = 1. By Theorem 5.2 the\nsample is 1-shattered at the origin.\nTheorem 5.4. Let D be a distribution over Rd\u00d7{\u00b11}, S be an i.i.d. sample of size m drawn from D,\nand denote XS the m \u00d7 d matrix whose rows are the points from S. If P[\u03bbm(XSX \u2032\nS) \u2265 m\u03b32] \u2265 \u03b7,\nthen there exists a margin-error minimization algorithm A such that \u2113m/2(A\u03b3, D) \u2265 \u03b7/2.\nTheorem 5.4 follows by scaling XS by \u03b3, applying Lemma 5.3 to establish \u03b3-fat shattering with\nprobability at least \u03b7, then applying Theorem 5.1. Lemma 5.3 generalizes the requirement for linear\nindependence when shattering using hyperplanes with no margin (i.e. no regularization). For unreg-\nularized (homogeneous) linear separation, a sample is shattered iff it is linearly independent, i.e. if\n\u03bbm > 0. Requiring \u03bbm > m\u03b32 is enough for \u03b3-fat-shattering. Theorem 5.4 then generalizes the\nsimple observation, that if samples of size m are linearly independent with high probability, there\nis no hope of generalizing from m/2 points to the other m/2 using unregularized linear predictors.\nTheorem 5.4 can thus be used to derive a distribution-speci\ufb01c lower bound. De\ufb01ne:\n\nm\u03b3(D) , 1\n2\n\nmin(cid:26)m(cid:12)(cid:12)(cid:12)(cid:12) PS\u223cDm[\u03bbm(XSX \u2032\n\nS) \u2265 m\u03b32] <\n\n1\n\n2(cid:27)\n\nThen for any \u01eb < 1/4\u2212 \u2113\u2217\n\u03b3(D), we can conclude that m(\u01eb, \u03b3, D) \u2265 m\u03b3(D), that is, we cannot learn\nwithin reasonable error with less than m\u03b3 examples. Recall that our upper-bound on the sample\ncomplexity from Section 4 was \u02dcO(k\u03b3). The remaining question is whether we can relate m\u03b3 and\nk\u03b3, to establish that the our lower bound and upper bound tightly specify the sample complexity.\n\n6 A lower bound for independently sub-Gaussian distributions\n\nAs discussed in the previous section, to obtain sample complexity lower bound we require a bound\non the value of the smallest eigenvalue of a random Gram-matrix. The distribution of this eigenvalue\nhas been investigated under various assumptions. The cleanest results are in the case where m, d \u2192\n\u221e and m\n\nd \u2192 \u03b2 < 1, and the coordinates of each example are identically distributed:\n\n6\n\n\fTheorem 6.1 (Theorem 5.11 in [18]). Let Xi be a series of mi \u00d7 di matrices whose entries are i.i.d.\nrandom variables with mean zero, variance \u03c32 and \ufb01nite fourth moments. If limi\u2192\u221e\n= \u03b2 < 1,\nthen limi\u2192\u221e \u03bbm( 1\n\nmi\ndi\n\nd XiX \u2032\n\ni) = \u03c32(1 \u2212 \u221a\u03b2)2.\n\nThis asymptotic limit can be used to calculate m\u03b3 and thus provide a lower bound on the sample\ncomplexity: Let the coordinates of X \u2208 Rd be i.i.d. with variance \u03c32 and consider a sample of size\nm. If d, m are large enough, we have by Theorem 6.1:\n\n\u03bbm(XX \u2032) \u2248 d\u03c32(1 \u2212pm/d)2 = \u03c32(\u221ad \u2212 \u221am)2\n\nSolving \u03c32(\u221ad \u2212p2m\u03b3)2 = 2m\u03b3\u03b32 we get m\u03b3 \u2248 1\n2 d/(1 + \u03b3/\u03c3)2. We can also calculate the \u03b3-\nadapted-dimension for this distribution to get k\u03b3 \u2248 d/(1 + \u03b32/\u03c32), and conclude that 1\n4 k\u03b3 \u2264 m\u03b3 \u2264\n1\n2 k\u03b3. In this case, then, we are indeed able to relate the sample complexity lower bound with k\u03b3, the\nsame quantity that controls our upper bound. This conclusion is easy to derive from known results,\nhowever it holds only asymptotically, and only for a highly limited set of distributions. Moreover,\nsince Theorem 6.1 holds asymptotically for each distribution separately, we cannot deduce from it\nany \ufb01nite-sample lower bounds for families of distributions.\n\nFor our analysis we require \ufb01nite-sample bounds for the smallest eigenvalue of a random Gram-\nmatrix. Rudelson and Vershynin [19, 20] provide such \ufb01nite-sample lower bounds for distributions\nwith identically distributed sub-Gaussian coordinates. In the following Theorem we generalize re-\nsults of Rudelson and Vershynin to encompass also non-identically distributed coordinates. The\nproof of Theorem 6.2 can be found in the appendix. Based on this theorem we conclude with Theo-\nrem 6.3, stated below, which constitutes our \ufb01nal sample complexity lower bound.\nTheorem 6.2. Let B > 0. There is a constant \u03b2 > 0 which depends only on B, such that for any\n\u03b4 \u2208 (0, 1) there exists a number L0, such that for any independently sub-Gaussian distribution with\ncovariance matrix \u03a3 \u2264 I and trace(\u03a3) \u2265 L0, if each of its independent sub-Gaussian coordinates\nhas moment B, then for any m \u2264 \u03b2 \u00b7 trace(\u03a3)\nP[\u03bbm(XmX \u2032\n\nm) \u2265 m] \u2265 1 \u2212 \u03b4,\n\nWhere Xm is an m \u00d7 d matrix whose rows are independent draws from DX.\nTheorem 6.3 (Lower bound for distributions in Dsg\nand an integer L0 such that for any D such that DX \u2208 Dsg\n\u03b3 > 0 and any \u01eb < 1\nm(\u01eb, \u03b3, D) \u2265 \u03b2k\u03b3(DX ).\n\n4 \u2212 \u2113\u2217\n\n\u03b3(D),\n\n\u03c1 ). For any \u03c1 > 0, there are a constant \u03b2 > 0\n\u03c1 and k\u03b3(DX ) > L0, for any margin\n\nProof. The covariance matrix of DX is clearly diagonal. We assume w.l.o.g.\nthat \u03a3 =\ndiag(\u03bb1, . . . , \u03bbd) where \u03bb1 \u2265 . . . \u2265 \u03bbd > 0. Let S be an i.i.d. sample of size m drawn from\nD. Let X be the m \u00d7 d matrix whose rows are the unlabeled examples from S. Let \u03b4 be \ufb01xed, and\nset \u03b2 and L0 as de\ufb01ned in Theorem 6.2 for \u03b4. Assume m \u2264 \u03b2(k\u03b3 \u2212 1).\nWe would like to use Theorem 6.2 to bound the smallest eigenvalue of XX \u2032 with high probability,\nso that we can then apply Theorem 5.4 to get the desired lower bound. However, Theorem 6.2\nholds only if all the coordinate variances are bounded by 1, and it requires that the moment, and not\nthe relative moment, be bounded. Thus we divide the problem to two cases, based on the value of\n\u03bbk\u03b3 +1, and apply Theorem 6.2 separately to each case.\nCase I: Assume \u03bbk\u03b3 +1 \u2265 \u03b32. Then \u2200i \u2208 [k\u03b3], \u03bbi \u2265 \u03b32. Let \u03a31 = diag(1/\u03bb1, . . . , 1/\u03bbk\u03b3 , 0, . . . , 0).\nThe random matrix X\u221a\u03a31 is drawn from an independently sub-Gaussian distribution, such that\neach of its coordinates has sub-Gaussian moment \u03c1 and covariance matrix \u03a3 \u00b7 \u03a31 \u2264 Id. In addition,\ntrace(\u03a3 \u00b7 \u03a31) = k\u03b3 \u2265 L0. Therefore Theorem 6.2 holds for X\u221a\u03a31, and P[\u03bbm(X\u03a31X \u2032) \u2265 m] \u2265\n1 \u2212 \u03b4. Clearly, for any X, \u03bbm( 1\nCase II: Assume \u03bbk\u03b3 +1 < \u03b32. Then \u03bbi < \u03b32 for all i \u2208 {k\u03b3 + 1, . . . , d}. Let \u03a32 =\ndiag(0, . . . , 0, 1/\u03b32, . . . , 1/\u03b32), with k\u03b3 zeros on the diagonal. Then the random matrix X\u221a\u03a32\nis drawn from an independently sub-Gaussian distribution with covariance matrix \u03a3\u00b7 \u03a32 \u2264 Id, such\nthat all its coordinates have sub-Gaussian moment \u03c1. In addition, from the properties of k\u03b3 (see\ndiscussion in Section 2), trace(\u03a3 \u00b7 \u03a32) = 1\ni=k\u03b3 +1 \u03bbi \u2265 k\u03b3 \u2212 1 \u2265 L0 \u2212 1. Thus Theorem 6.2\nholds for X\u221a\u03a32, and so P[\u03bbm( 1\n\n\u03b3 2 XX \u2032) \u2265 \u03bbm(X\u03a31X \u2032). Thus P[\u03bbm( 1\n\n\u03b3 2 XX \u2032) \u2265 m] \u2265 1 \u2212 \u03b4.\n\n\u03b3 2 XX \u2032) \u2265 m] \u2265 P[\u03bbm(X\u03a32X \u2032) \u2265 m] \u2265 1 \u2212 \u03b4.\n\n\u03b3 2 Pd\n\n7\n\n\fIn both cases P[\u03bbm( 1\nan algorithm A such that for any m \u2264 \u03b2(k\u03b3 \u2212 1) \u2212 1, \u2113m(A\u03b3, D) \u2265 1\n\u01eb < 1\n\n\u03b3 2 XX \u2032) \u2265 m] \u2265 1 \u2212 \u03b4 for any m \u2264 \u03b2(k\u03b3 \u2212 1). By Theorem 5.4, there exists\n2 \u2212 \u03b4/2. Therefore, for any\n\n\u03b3(D), we have m(\u01eb, \u03b3, D) \u2265 \u03b2(k\u03b3 \u2212 1). We get the theorem by setting \u03b4 = 1\n4 .\n\n2 \u2212 \u03b4/2 \u2212 \u2113\u2217\n\n7 Summary and consequences\n\nTheorem 4.4 and Theorem 6.3 provide an upper bound and a lower bound for the sample complexity\nof any distribution D whose data distribution is in Dsg\n\u03c1 for some \ufb01xed \u03c1 > 0. We can thus draw the\nfollowing bound, which holds for any \u03b3 > 0 and \u01eb \u2208 (0, 1\n\n4 \u2212 \u2113\u2217\n\n\u03b3(D)):\n\n\u2126(k\u03b3(DX )) \u2264 m(\u01eb, \u03b3, D) \u2264 \u02dcO(\n\nk\u03b3(DX )\n\n\u01eb2\n\n).\n\n(2)\n\nIn both sides of the bound, the hidden constants depend only on the constant \u03c1. This result shows\nthat the true sample complexity of learning each of these distributions is characterized by the \u03b3-\nadapted-dimension. An interesting conclusion can be drawn as to the in\ufb02uence of the conditional\ndistribution of labels DY |X: Since Eq. (2) holds for any DY |X, the effect of the direction of the best\nseparator on the sample complexity is bounded, even for highly non-spherical distributions. We can\nuse Eq. (2) to easily characterize the sample complexity behavior for interesting distributions, and\nto compare L2 margin minimization to learning methods.\nGaps between L1 and L2 regularization in the presence of irrelevant features. Ng [3] considers\nlearning a single relevant feature in the presence of many irrelevant features, and compares using\nL1 regularization and L2 regularization. When kXk\u221e \u2264 1, upper bounds on learning with L1\nregularization guarantee a sample complexity of O(log(d)) for an L1-based learning rule [21]. In\norder to compare this with the sample complexity of L2 regularized learning and establish a gap,\none must use a lower bound on the L2 sample complexity. The argument provided by Ng actually\nassumes scale-invariance of the learning rule, and is therefore valid only for unregularized linear\nlearning. However, using our results we can easily establish a lower bound of \u2126(d) for many speci\ufb01c\ndistributions with kXk\u221e \u2264 1 and Y = X[1] \u2208 {\u00b11}. For instance, when each coordinate is an\nindependent Bernoulli variable, the distribution is sub-Gaussian with \u03c1 = 1, and k1 = \u2308d/2\u2309.\nGaps between generative and discriminative learning for a Gaussian mixture. Consider two\nclasses, each drawn from a unit-variance spherical Gaussian in a high dimension Rd and with a\nlarge distance 2v >> 1 between the class means, such that d >> v4. Then PD[X|Y = y] =\nN (yv \u00b7 e1, Id), where e1 is a unit vector in Rd. For any v and d, we have DX \u2208 Dsg\n1 . For large\nvalues of v, we have extremely low margin error at \u03b3 = v/2, and so we can hope to learn the\nclasses by looking for a large-margin separator. Indeed, we can calculate k\u03b3 = \u2308d/(1 + v2\n4 )\u2309, and\nconclude that the sample complexity required is \u02dc\u0398(d/v2). Now consider a generative approach:\n\ufb01tting a spherical Gaussian model for each class. This amounts to estimating each class center as\nthe empirical average of the points in the class, and classifying based on the nearest estimated class\ncenter. It is possible to show that for any constant \u01eb > 0, and for large enough v and d, O(d/v4)\nsamples are enough in order to ensure an error of \u01eb. This establishes a rather large gap of \u2126(v2)\nbetween the sample complexity of the discriminative approach and that of the generative one.\n\nTo summarize, we have shown that the true sample complexity of large-margin learning of a rich\nfamily of speci\ufb01c distributions is characterized by the \u03b3-adapted-dimension. This result allows true\ncomparison between this learning algorithm and other algorithms, and has various applications, such\nas semi-supervised learning and feature construction. The challenge of characterizing true sample\ncomplexity extends to any distribution and any learning algorithm. We believe that obtaining an-\nswers to these questions is of great importance, both to learning theory and to learning applications.\n\nAcknowledgments\n\nThe authors thank Boaz Nadler for many insightful discussions, and Karthik Sridharan for pointing\nout [14] to us. Sivan Sabato is supported by the Adams Fellowship Program of the Israel Academy\nof Sciences and Humanities. This work was supported by the NATO SfP grant 982480.\n\n8\n\n\fReferences\n[1] I. Steinwart and C. Scovel. Fast rates for support vector machines using Gaussian kernels. Annals of\n\nStatistics, 35(2):575\u2013607, 2007.\n\n[2] P. Liang and N. Srebro. On the interaction between norm and dimensionality: Multiple regimes in learn-\n\ning. In ICML, 2010.\n\n[3] A.Y. Ng. Feature selection, l1 vs. l2 regularization, and rotational invariance. In ICML, 2004.\n[4] A. Antos and G. Lugosi. Strong minimax lower bounds for learning. Mach. Learn., 30(1):31\u201356, 1998.\n[5] A. Ehrenfeucht, D. Haussler, M. Kearns, and L. Valiant. A general lower bound on the number of ex-\namples needed for learning. In Proceedings of the First Anuual Workshop on Computational Learning\nTheory, pages 139\u2013154, August 1988.\n\n[6] C. Gentile and D.P. Helmbold. Improved lower bounds for learning from noisy examples: an information-\n\ntheoretic approach. In COLT, pages 104\u2013115, 1998.\n\n[7] V.N. Vapnik. The Nature of Statistical Learning Theory. Springer, 1995.\n[8] Gyora M. Benedek and Alon Itai. Learnability with respect to \ufb01xed distributions. Theoretical Computer\n\nScience, 86(2):377\u2013389, September 1991.\n\n[9] S. Ben-David, T. Lu, and D. P\u00b4al. Does unlabeled data provably help? In Proceedings of the Twenty-First\n\nAnnual Conference on Computational Learning Theory, pages 33\u201344, 2008.\n\n[10] N. Vayatis and R. Azencott. Distribution-dependent vapnik-chervonenkis bounds.\n\npages 230\u2013240, London, UK, 1999. Springer-Verlag.\n\nIn EuroCOLT \u201999,\n\n[11] D.J.H. Garling. Inequalities: A Journey into Linear Analysis. Cambrige University Press, 2007.\n[12] V.V. Buldygin and Yu. V. Kozachenko. Metric Characterization of Random Variables and Random Pro-\n\ncesses. American Mathematical Society, 1998.\n\n[13] P. L. Bartlett and S. Mendelson. Rademacher and Gaussian complexities: Risk bounds and structural\n\nresults. In COLT 2001, volume 2111, pages 224\u2013240. Springer, Berlin, 2001.\n\n[14] O. Bousquet. Concentration Inequalities and Empirical Processes Theory Applied to the Analysis of\n\nLearning Algorithms. PhD thesis, Ecole Polytechnique, 2002.\n\n[15] B. Sch\u00a8olkopf, J. Shawe-Taylor, A. J. Smola, and R.C. Williamson. Generalization bounds via eigenvalues\n\nof the gram matrix. Technical Report NC2-TR-1999-035, NeuroCOLT2, 1999.\n\n[16] M. Anthony and P. L. Bartlett. Neural Network Learning: Theoretical Foundations. Cambridge University\n\nPress, 1999.\n\n[17] N. Christianini and J. Shawe-Taylor. An Introduction to Support Vector Machines. Cambridge University\n\nPress, 2000.\n\n[18] Z. Bai and J.W. Silverstein. Spectral Analysis of Large Dimensional Random Matrices. Springer, second\n\nedition edition, 2010.\n\n[19] M. Rudelson and R. Vershynin. The smallest singular value of a random rectangular matrix. Communi-\n\ncations on Pure and Applied Mathematics, 62:1707\u20131739, 2009.\n\n[20] M. Rudelson and R. Vershynin. The littlewoodofford problem and invertibility of random matrices. Ad-\n\nvances in Mathematics, 218(2):600\u2013633, 2008.\n\n[21] T. Zhang. Covering number bounds of certain regularized linear function classes. Journal of Machine\n\nLearning Research, 2:527\u2013550, 2002.\n\n[22] G. Bennett, V. Goodman, and C. M. Newman. Norms of random matrices. Paci\ufb01c J. Math., 59(2):359\u2013\n\n365, 1975.\n\n[23] F.L. Nazarov and A. Podkorytov. Ball, haagerup, and distribution functions. Operator Theory: Advances\n\nand Applications, 113 (Complex analysis, operators, and related topics):247\u2013267, 2000.\n\n[24] R.E.A.C. Paley and A. Zygmund. A note on analytic functions in the unit circle. Proceedings of the\n\nCambridge Philosophical Society, 28:266272, 1932.\n\n9\n\n\f", "award": [], "sourceid": 231, "authors": [{"given_name": "Sivan", "family_name": "Sabato", "institution": null}, {"given_name": "Nathan", "family_name": "Srebro", "institution": null}, {"given_name": "Naftali", "family_name": "Tishby", "institution": null}]}