{"title": "Dimensionality Dependent PAC-Bayes Margin Bound", "book": "Advances in Neural Information Processing Systems", "page_first": 1034, "page_last": 1042, "abstract": "Margin is one of the most important concepts in machine learning. Previous margin bounds, both for SVM and for boosting, are dimensionality independent. A major advantage of this dimensionality independency is that it can explain the excellent performance of SVM whose feature spaces are often of high or infinite dimension. In this paper we address the problem whether such dimensionality independency is intrinsic for the margin bounds. We prove a dimensionality dependent PAC-Bayes margin bound. The bound is monotone increasing with respect to the dimension when keeping all other factors fixed. We show that our bound is strictly sharper than a previously well-known PAC-Bayes margin bound if the feature space is of finite dimension; and the two bounds tend to be equivalent as the dimension goes to infinity. In addition, we show that the VC bound for linear classifiers can be recovered from our bound under mild conditions. We conduct extensive experiments on benchmark datasets and find that the new bound is useful for model selection and is significantly sharper than the dimensionality independent PAC-Bayes margin bound as well as the VC bound for linear classifiers.", "full_text": "Dimensionality Dependent PAC-Bayes Margin Bound\n\nChi Jin\n\nSchool of Physics\nPeking University\n\nLiwei Wang\n\nSchool of EECS\nPeking University\n\nKey Laboratory of Machine Perception, MOE\n\nKey Laboratory of Machine Perception, MOE\n\nchijin06@gmail.com\n\nwanglw@cis.pku.edu.cn\n\nAbstract\n\nMargin is one of the most important concepts in machine learning. Previous mar-\ngin bounds, both for SVM and for boosting, are dimensionality independent. A\nmajor advantage of this dimensionality independency is that it can explain the ex-\ncellent performance of SVM whose feature spaces are often of high or in\ufb01nite\ndimension. In this paper we address the problem whether such dimensionality in-\ndependency is intrinsic for the margin bounds. We prove a dimensionality depen-\ndent PAC-Bayes margin bound. The bound is monotone increasing with respect\nto the dimension when keeping all other factors \ufb01xed. We show that our bound\nis strictly sharper than a previously well-known PAC-Bayes margin bound if the\nfeature space is of \ufb01nite dimension; and the two bounds tend to be equivalent as\nthe dimension goes to in\ufb01nity. In addition, we show that the VC bound for linear\nclassi\ufb01ers can be recovered from our bound under mild conditions. We conduct\nextensive experiments on benchmark datasets and \ufb01nd that the new bound is use-\nful for model selection and is usually signi\ufb01cantly sharper than the dimensionality\nindependent PAC-Bayes margin bound as well as the VC bound for linear classi-\n\ufb01ers.\n\n1 Introduction\n\nLinear classi\ufb01ers, including SVM and boosting, play an important role in machine learning. A cen-\ntral concept in the generalization analysis of linear classi\ufb01ers is margin. There have been extensive\nworks on bounding the generalization errors of SVM and boosting in terms of margins (with various\nde\ufb01nitions such l2, l1, soft, hard, average, minimum, etc.)\nIn 1970\u2019s Vapnik pointed out that large margin can imply good generalization. Using the fat-\nshattering dimension, Shawe-Taylor et al. [1] proved a margin bound for linear classi\ufb01ers. This\nbound was improved and simpli\ufb01ed in a series of works [2, 3, 4, 5] mainly based on the PAC-Bayes\ntheory [6] which was developed originally for stochastic classi\ufb01ers. (See Section 2 for a brief review\nof the PAC-Bayes theory and the PAC-Bayes margin bounds.) All these bounds state that if a linear\nclassi\ufb01er in the feature space induces large margins for most of the training examples, then it has a\nsmall generalization error bound independent of the dimensionality of the feature space.\nThe (l1) margin has also been extensively studied for boosting to explain its generalization ability.\nSchapire et al. [7] proved a margin bound for the generalization error of voting classi\ufb01ers. The bound\nis independent of the number of base classi\ufb01ers combined in the voting classi\ufb01er1. This margin\nbound was greatly improved in [8, 9] using (local) Rademacher complexities. There also exist\nimproved margin bounds for boosting from the viewpoint of PAC-Bayes theory [10], the diversity\nof base classi\ufb01ers [11], and different de\ufb01nition of margins [12, 13].\n\n1The bound depends on the VC dimension of the base hypothesis class. Nevertheless, given the VC dimen-\nsion of the base hypothesis space, the bound does not depend on the number of the base classi\ufb01ers, which can\nbe seen as the dimension of the feature space.\n\n1\n\n\fThe aforementioned margin bounds are all dimensionality independent. That is, the bounds are\nsolely characterized by the margins on the training data and do not depend on the dimension of\nfeature space. A major advantage of such dimensionality independent margin bounds is that they\ncan explain the generalization ability of SVM and boosting whose feature spaces have high or in\ufb01nite\ndimension, in which case the standard VC bound becomes trivial.\nAlthough very successful in bounding the generalization error, a natural question is whether this\ndimensionality independency is intrinsic for margin bounds. In this paper we explore this problem.\nBuilding upon the PAC-Bayes theory, we prove a dimensionality dependent margin bound. This\nbound is monotone increasing with respect to the dimension when keeping all other factors \ufb01xed.\nComparing with the PAC-Bayes margin bound of Langford [4], the new bound is strictly sharper\nwhen the feature space is of \ufb01nite dimension; and the two bounds tend to be equal as the dimension\ngoes to in\ufb01nity.\nWe conduct extensive experiments on benchmark datasets. The experimental results show that the\nnew bound is signi\ufb01cantly sharper than the dimensionality independent PAC-Bayes margin bound\nas well as the VC bound for linear classi\ufb01ers on relatively large datasets. The bound is also found\nuseful for model selection.\nThe rest of this paper is organized as follows. Section 2 contains a brief review of the PAC-Bayes\ntheory and the dimensionality independent PAC-Bayes margin bound.\nIn Section 3 we give the\ndimensionality dependent PAC-Bayes margin bound and further improvements. We provide the\nexperimental results in Section 4, and conclude in Section 5. Due to the space limit, all the proofs\nare given in the supplementary material.\n\n2 Background\nLet X be the instance space or generally the feature space. In this paper we always assume X = Rd.\nWe consider binary classi\ufb01cation problems and let Y = {\u22121; 1}. Examples are drawn independently\naccording to an underlying distribution D over X \u00d7 Y. Let PD(A(x; y)) denote the probability of\nevent A when an example (x; y) is chosen according to D. Let S denote a training set of n i.i.d.\nexamples. We denote by PS (A(x; y)) the probability of event A when an example (x; y) is chosen\nat random from S. Similarly we denote by ED and ES the corresponding expectations. If c is\na classi\ufb01er, then we denote by erD(c) = PD(y \u0338= c(x)) the generalization error of c, and let\nerS (c) = PS (y \u0338= c(x)) be the empirical error.\nAn important type of classi\ufb01ers studied in this paper is stochastic classi\ufb01ers. Let C be a set of\nclassi\ufb01ers, and let Q be a probability distribution of classi\ufb01ers on C. A stochastic classi\ufb01er de\ufb01ned\nby Q randomly selects c \u2208 C according to Q. When clear from the context, we often denote by\nerD(Q) and erS(Q) the generalization and empirical error of the stochastic classi\ufb01er Q respectively.\nThat is,\n\nerD(Q) = Ec\u223cQ[erD(c)];\n\nerS (Q) = Ec\u223cQ[erS (c)]\n\nA probability distribution Q of classi\ufb01ers also de\ufb01nes a deterministic classi\ufb01er\u2014the voting classi\ufb01er,\nwhich we denote by vQ. For x \u2208 X\n\nvQ(x) = sgn[Ec\u223cQc(x)]:\n\nIn this paper we always consider homogeneous linear classi\ufb01ers2, or stochastic classi\ufb01ers whose\ndistribution is over homogeneous linear classi\ufb01ers. Let X = Rd. For any w \u2208 Rd, the linear\nclassi\ufb01er cw is de\ufb01ned as cw(\u00b7) = sgn[< w;\u00b7 >]. When we consider a probability distribution over\nall homogeneous linear classi\ufb01ers cw in Rd, we can equivalently consider a distribution of w \u2208 Rd.\nThe work in this paper is based on the PAC-Bayes theory. PAC-Bayes theory is a beautiful gener-\nalization of the classical PAC theory to the setting of Bayes learning. It gives generalization error\nbounds for stochastic classi\ufb01ers. The PAC-Bayes theorem was \ufb01rst proposed by McAllester [6].\nThe following elegant version is due to Langford [4].\n\n2This does not sacri\ufb01ce any generality since linear classi\ufb01ers can be easily transformed to homogeneous\n\nlinear classi\ufb01ers by adding a new dimension.\n\n2\n\n\fTheorem 2.1. Let P , Q denote probability distributions of classi\ufb01ers. For any P and any (cid:14) \u2208 (0; 1),\nwith probability 1 \u2212 (cid:14) over the random draw of n training examples\n\nkl (erS(Q) || erD(Q)) \u2264 KL(Q||P ) + ln n+1\n\n(cid:14)\n\n(1)\nholds simultaneously for all distributions Q. Here KL(Q||P ) is the Kullback-Leibler divergence of\ndistributions Q and P ; kl(a||b) for a; b \u2208 [0; 1] is the Bernoulli KL divergence de\ufb01ned as kl(a||b) =\na log a\n\nb + (1 \u2212 a) log 1\u2212a\n1\u2212b .\n\nn\n\nThe above PAC-Bayes theorem states that if a stochastic classi\ufb01er, whose distribution Q is close (in\nthe sense of KL divergence) to the \ufb01xed prior P , has a small training error, then its generalization\nerror is small.\nPAC-Bayes theory has been improved and generalized in a series of works [5, 14]. For important\nrecent results please referred to [14]. [15] generalizes the KL divergence in the PAC-Bayes theorem\nto arbitrary convex functions. [15, 16, 17, 18, 19] utilize improved PAC-Bayes bounds to develop\nlearning algorithms and perform model selections.\nVery interestingly, it is shown in [2] that one can derive a margin bound for linear classi\ufb01ers (in-\ncluding SVM) from the PAC-Bayes theorem quite easily. It is much simpler and slightly tighter than\nprevious margin bounds for SVM [1, 20]. The following simpli\ufb01ed and re\ufb01ned version can be found\nin [4].\nTheorem 2.2 ([4]). Let X = Rd. Let Q((cid:22); ^w) ((cid:22) > 0, ^w \u2208 Rd, \u2225^w\u2225 = 1) denote the distribution of\nhomogeneous linear classi\ufb01ers cw, where w \u223c N ((cid:22)^w; I). For any (cid:14) \u2208 (0; 1), with probability 1 \u2212 (cid:14)\nover the random draw of n training examples\n\nkl (erS (Q((cid:22); ^w)) || erD(Q((cid:22); ^w))) \u2264 (cid:22)2\n\n(2)\nholds simultaneously for all (cid:22) > 0 and all ^w \u2208 Rd with \u2225^w\u2225 = 1. In addition, the empirical error\nof the stochastic classi\ufb01er can be written as\n\nn\n\n2 + ln n+1\n\n(cid:14)\n\nerS(Q((cid:22); ^w)) = ES (cid:8)((cid:22)(cid:13)(^w; x; y));\n\n(3)\n\n\u222b \u221e\n\nt\n\ne\n\n(4)\n\n\u2212(cid:28) 2=2d(cid:28)\n\nwhere (cid:13)(^w; x; y) = y <^w;x>\u2225x\u2225\n\nis the margin of (x; y) with respect to the unit vector ^w; and\n(cid:8)(t) = 1 \u2212 (cid:8)(t) =\n\n1\u221a\n2(cid:25)\nis the probability of the upper tail of Gaussian distribution.\nAccording to Theorem 2.2, if there is a linear classi\ufb01er ^w \u2208 Rd inducing large margins for most\ntraining examples, i.e., (cid:13)( ^w; x; y) is large for most (x; y) , then choosing a relatively small (cid:22) would\nyield a small erS (Q((cid:22); ^w)) and in turn a small upper bound for the generalization error of the\nstochastic classi\ufb01er Q((cid:22); ^w). Note that this bound does not depend on the dimensionality d. In fact\nalmost all previously known margin bounds are dimensionality independent3.\nPAC-Bayes theory only provides bounds for stochastic classi\ufb01ers. In practice however, users often\nprefer deterministic classi\ufb01ers. There is a close relation between the error of a stochastic classi\ufb01er\nde\ufb01ned by distribution Q and the error of the deterministic voting classi\ufb01er vQ. The following\nsimple result is well-known.\nProposition 2.3. Let vQ be the voting classi\ufb01er de\ufb01ned by distribution Q. That is, vQ(\u00b7) =\nsgn[Ec\u223cQc(\u00b7)]. Then for any Q\n(5)\n\nerD(vQ) \u2264 2 erD(Q):\n\nCombining Theorem 2.2 and Proposition 2.3, one can upper bound the generalization error of the\nvoting classi\ufb01er vQ associated with Q((cid:22); ^w) given in Theorem 2.2. In fact, it is easy to see that\nvQ = c^w, the voting classi\ufb01er is exactly the linear classi\ufb01er ^w. Thus\n\nerD(c^w) \u2264 2erD(Q((cid:22); ^w)):\n\n(6)\n\n3There exist dimensionality dependent margin bounds [21]. However these bounds grow unboundedly as\n\nthe dimensionality tends to in\ufb01nity.\n\n3\n\n\fFrom Theorem 2.2, Proposition 2.3 and (6), we have that with probability 1\u2212(cid:14) the following margin\nbound holds for all classi\ufb01ers c^w with ^w \u2208 Rd, \u2225 ^w\u2225 = 1 and all (cid:22) > 0:\n2 + ln n+1\n\n(\nerS(Q((cid:22); ^w)) || erD(c^w)\n\n\u2264 (cid:22)2\n\n)\n\n(7)\n\nkl\n\n:\n\n(cid:14)\n\n2\n\nn\n\n(\n\nOne disadvantage of the bounds in (5), (6) and (7) is that they involve a multiplicative factor of 2.\nIn general, the factor 2 cannot be improved. However for linear classi\ufb01ers with large margins there\ncan exist tighter bounds. The following is a slightly re\ufb01ned version of the bounds given in [2, 3].\nProposition 2.4 ([2, 3]). Let Q((cid:22); ^w) and vQ = c^w be de\ufb01ned as above. Let erD;(cid:18)(Q((cid:22); ^w)) =\nEw\u223cN ((cid:22)^w;I)PD\nbe the error of the stochastic classi\ufb01er with margin (cid:18). Then for all\n(cid:18) \u2265 0\nerD(c^w) \u2264 erD;(cid:18)(Q((cid:22); ^w)) + (cid:8)((cid:18)):\n\ny \u2225x\u2225 \u2264 (cid:18)\n\n)\n\n(8)\n\nThe bound states that if the stochastic classi\ufb01er induces small errors with large margin (cid:18), then the\nlinear (voting) classi\ufb01er has only a slightly larger generalization error than the stochastic classi\ufb01er.\nHowever sometimes (8) can be larger than (5). The two bounds have a different regime in which\nthey dominate [2]. It is also worth pointing out that the margin y \u2225x\u2225\nconsidered in Proposition\n2.4 is unnormalized with respect to w. See Section 3 for more discussions.\nTo apply Proposition 2.4, one needs to further bound erD;(cid:18)(Q((cid:22); ^w)) by its empirical version\n= ES (cid:8)((cid:22)y <^w;x>\u2225x\u2225 \u2212 (cid:18)). With slight modi\ufb01-\nerS;(cid:18)(Q((cid:22); ^w)) := Ew\u223cN ((cid:22)^w;I)PS\ncations of Theorem 2.2, one can show that for any (cid:18) \u2265 0 with probability 1\u2212 (cid:14) the following bound\nis valid for all (cid:22) and ^w uniformly:\n\ny \u2225x\u2225 \u2264 (cid:18)\n\n(\n\n)\n\nkl (erS;(cid:18)(Q((cid:22); ^w)) || erD;(cid:18)(Q((cid:22); ^w))) \u2264 (cid:22)2\n\n2 + ln n+1\n\n(cid:14)\n\nn\n\n:\n\n(9)\n\nThe following Proposition combines the above results.\nProposition 2.5. For any (cid:18) \u2265 0 and any (cid:14) > 0 with probability 1 \u2212 (cid:14) the following bound is valid\nfor all (cid:22) and ^w uniformly:\n\n(\nerS;(cid:18)(Q((cid:22); ^w)) || erD(c^w) \u2212 (cid:8)((cid:18)))\n\n) \u2264 (cid:22)2\n\nkl\n\n2 + ln n+1\n\n(cid:14)\n\nn\n\n:\n\n(10)\n\nNote that this last bound is not uniform for (cid:18), see also [3].\nImproving the multiplicative factor was also studied in [22, 17], in which the variance of the stochas-\ntic classi\ufb01er is also bounded by PAC-Bayes theorem, and Chebyshev inequality can be used.\n\n3 Theoretical Results\n\nIn this section we give the theoretical results. The main result of this paper is Theorem 3.1, which\nprovides a dimensionality dependent PAC-Bayes margin bound.\nTheorem 3.1. Let Q((cid:22); ^w) ((cid:22) > 0, ^w \u2208 Rd, \u2225^w\u2225 = 1) denote the distribution of linear classi\ufb01ers\ncw(\u00b7) = sgn[< w;\u00b7 >], where w \u223c N ((cid:22)^w; I). For any (cid:14) \u2208 (0; 1), with probability 1 \u2212 (cid:14) over the\nrandom draw of n training examples\n\n2 ln(1 + (cid:22)2\nd ) + ln n+1\nn\n\nkl (erS(Q((cid:22); ^w)) || erD(Q((cid:22); ^w))) \u2264 d\n\n(11)\nholds simultaneously for all (cid:22) > 0 and all ^w \u2208 Rd with \u2225^w\u2225 = 1. Here erS(Q((cid:22); ^w)) =\nES (cid:8)((cid:22)(cid:13)(^w; x; y)) and (cid:13)(^w; x; y) = y <^w;x>\u2225x\u2225 are the same as in Theorem 2.2.\nComparing Theorem 3.1 with Theorem 2.2, it is easy to see the following Proposition holds.\nProposition 3.2. The bound (11) is sharper than (2) for any d < \u221e, and the two bounds tend to be\nequivalent as d \u2192 \u221e.\n\n(cid:14)\n\n4\n\n\fTheorem 3.1 is the \ufb01rst dimensionality dependent margin bound that remains nontrivial in in\ufb01nite\ndimension.\nTheorem 3.1 and Theorem 2.2 are uniform bounds for (cid:22). Thus one can choose appropriate (cid:22) to op-\ntimize each bound respectively. Note that erS (Q((cid:22); ^w)) in the LHS of the two bounds is monotone\ndecreasing with respect to (cid:22). Comparing to Theorem 2.2, Theorem 3.1 has the advantage that its\nRHS scales only in O(ln (cid:22)) rather than O((cid:22)2), and therefore allows choosing a very large (cid:22).\nAs described in (7) in Section 2, we can also obtain a margin bound for the deterministic linear\nclassi\ufb01er c^w by combining (11) with erD(c^w) \u2264 2 erD(Q((cid:22); ^w)).\nIn addition, note that the VC dimension of homogeneous linear classi\ufb01ers in Rd is d. From Theorem\n))\n3.1 we can almost recover the VC bound [23]\n\n\u221a\n\n(\n\n(\n\nerD(c) \u2264 erS (c) +\n\nd\n\n1 + ln\n\n+ ln 4\n(cid:14)\n\n2n\nd\nn\n\nfor homogenous linear classi\ufb01ers in Rd under mild conditions. Formally we have the following\nCorollary.\nCorollary 3.3. Theorem 3.1 implies the following result. Suppose n > 5. For any (cid:14) > 2e\nwith probability 1 \u2212 (cid:14) over the random draw of n training examples\n\n\u2212 1\n8 ,\n\n\u2212 d\n\n8 n\n\n(12)\n\n(13)\n\n(14)\n\n\u221a\n\n2n\nd\n\n1 +\n\nd ln\n\n(\n\n(\n))\n(cid:12)(cid:12)(cid:12)(cid:12) \u2264 (ln n)1=2d3=2\n\nn\n\n4n2\n\n)\n\nerD(cw) \u2264 erS (cw) +\n\n((cid:12)(cid:12)(cid:12)(cid:12)y\n\nPD\n\n< w; x >\n\u2225w\u2225\u2225x\u2225\n\n\u221a\n\n+ 1\n\n2 ln 2(n+1)\n\n(cid:14)\n\n+\n\nd + ln n\n\nn\n\n\u221a\n\n\u2264 1\n4\n\nd + ln n\n\nn\n\n:\n\nholds simultaneously for all homogeneous linear classi\ufb01ers cw with w \u2208 Rd satisfying\n\nCondition (14) is easy to satisfy if d \u226a n.\nIn a sense, the dimensionality dependent margin bound in Theorem 3.1 uni\ufb01es the dimensionality\nindependent margin bound and the VC bound for linear classi\ufb01ers.\nAlthough it is not easy to theoretically quantify how much sharper (11) is than (2) and the VC bound\n(12) (because the \ufb01rst two bounds hold uniformly for all (cid:22)), in Section 4 we will demonstrate by\nexperiments that the new bound is usually signi\ufb01cantly better than (2) and (12) on relatively large\ndatasets.\n\n3.1\n\nImproving the Multiplicative Factor\n\nAs we mentioned in Section 2, Proposition 2.3 involves a multiplicative factor of 2 when bounding\nthe error of the deterministic voting classi\ufb01er by the error of the stochastic classi\ufb01er. Note that in\ngeneral erD(c^w) \u2264 2erD(Q((cid:22); ^w)) cannot be improved (consider the case that with probability one\nthe data has zero margin with respect to ^w). Here we study how to improve it for large margin\nclassi\ufb01ers.\nRecall that Proposition 2.4 gives erD(c^w) \u2264 erD;(cid:18)(Q((cid:22); ^w)) + (cid:8)((cid:18)), which bounds the gener-\nalization error of the linear classi\ufb01er in terms of the error of the stochastic classi\ufb01er with mar-\ngin (cid:18) \u2265 0. As pointed out in [2], this bound is not always better than Proposition 2.3 (i.e.,\nerD(c^w) \u2264 2erD(Q((cid:22); ^w))). The two bounds each has a different dominant regime. Our \ufb01rst result\nin this subsection is the following simple improvement over both Proposition 2.3 and Proposition\n2.4.\nProposition 3.4. Using the notions in Proposition 2.4, we have that for all (cid:18) \u2265 0,\n\nerD(c^w) \u2264 1\n\n(cid:8)((cid:18))\n\nerD;(cid:18)(Q((cid:22); ^w));\n\n(15)\n\nwhere (cid:8)((cid:18)) is de\ufb01ned in Theorem 2.2.\n\n5\n\n\fIt is easy to see that Proposition 2.3 is a special case of Proposition 3.4: just let (cid:18) = 0 in (15) we\nrecover (6). Thus Proposition 3.4 is always sharper than Proposition 2.3. It is also easy to show that\n(15) is sharper than (8) in Proposition 2.4 whenever the bounds are nontrivial. Formally we have the\nfollowing proposition.\nProposition 3.5. Suppose the RHS of (8) or the RHS of (15) is smaller than 1, i.e., at least one of\nthe two bounds is nontrivial. Then (15) is sharper than (8).\n\nAs mentioned in Section 2, the margins discussed so far in this subsection are unnormalized with\nrespect to w \u2208 Rd. That is, we consider y \u2225x\u2225 . In the following we will focus on normalized\nmargins y \n\u2225w\u2225\u2225x\u2225. It will soon be clear that this brings additional bene\ufb01ts when combining with the\ndimensionality dependent margin bound.\n\u2225w\u2225\u2225x\u2225 \u2264 (cid:18)) be the true error of the stochastic classi\ufb01er\nLet erND;(cid:18)(Q((cid:22); ^w)) = Ew\u223cN ((cid:22)^w;I)PD(y \nQ((cid:22); ^w) with normalized margin (cid:18) \u2208 [\u22121; 1]. Also let erNS;(cid:18)(Q((cid:22); ^w)) be its empirical version. We\nhave the following lemma.\nLemma 3.6. For any (cid:22) > 0, any ^w \u2208 Rd with \u2225^w\u2225 = 1 and any (cid:18) \u2265 0,\n\nerD(c^w) \u2264 erND;(cid:18) (Q((cid:22); ^w))\n\n(cid:8)((cid:22)(cid:18))\n\n:\n\n(16)\n\n(\n\n) \u2264 d\n\nIf erND;(cid:18)(Q) is only slightly larger than erD(Q) for a not-too-small (cid:18) > 0, then erND;(cid:18)(Q)\ncan be\nmuch smaller than 2erD(Q) even with a not too large (cid:22). Also note that setting (cid:18) = 0 in (16), we\ncan recover (6).\nThe true margin error erND;(cid:18)(Q) can be bounded by its empirical version similar to Theorem 3.1: For\nany (cid:18) \u2265 0 and any (cid:14) > 0, with probability 1 \u2212 (cid:14)\n\n(cid:8)((cid:22)(cid:18))\n\nerNS;(cid:18)(Q((cid:22); ^w))||erND;(cid:18)(Q((cid:22); ^w))\n\nkl\n\n2 ln(1 + (cid:22)2\nd ) + ln n+1\nn\n\n(cid:14)\n\n(17)\n\n(\n\nholds simultaneously for all (cid:22) > 0 and ^w \u2208 Rd with \u2225 ^w\u2225 = 1.\nCombining the previous two results we have a dimensionality dependent margin bound for the linear\nclassi\ufb01er c^w.\nProposition 3.7. Let Q((cid:22); ^w) de\ufb01ned as before. For any (cid:18) \u2265 0 and any (cid:14) > 0, with probability\n1 \u2212 (cid:14) over the random draw of n training examples\nerNS;(cid:18)(Q((cid:22); ^w))||erD(c^w)(cid:8)((cid:22)(cid:18))\n\n) \u2264 d\n\n2 ln(1 + (cid:22)2\nd ) + ln n+1\nn\n\n(18)\n\nholds simultaneously for all (cid:22) > 0 and ^w \u2208 Rd with \u2225^w\u2225 = 1.\n)\nTo see how Proposition 3.7 improves the multiplicative factor, let\u2019s take a closer look at the bound\n\u2225w\u2225\u2225x\u2225 \u2264 (cid:18)) tends to the\n(18). Observe that as (cid:22) getting large, erNS;(cid:18)(Q((cid:22); ^w)) = Ew\u223cN ((cid:22)^w;I)PD(y \n(recall that \u2225 ^w\u2225=1).\nempirical error of the linear classi\ufb01er ^w with margin (cid:18), i.e., PS\nAlso if (cid:22)(cid:18) > 3, (cid:8)((cid:22)(cid:18)) \u2248 1. Taking into the consideration that the RHS of (18) scales only in\nO(ln (cid:22)), we can choose a relatively large (cid:22) and (18) gives a dimensionality dependent margin bound\nwhose multiplicative factor can be very close to 1.\n\ny <^w;x>\u2225x\u2225 \u2264 (cid:18)\n\n(\n\nkl\n\n(cid:14)\n\n4 Experiments\n\nIn this section we conduct a series of experiments on benchmark datasets. The goal is to see to\nwhat extent the Dimensionality Dependent margin bound (will be referred to as DD-margin bound)\nis sharper than the Dimensionality Independent margin bound (will be referred to as DI-margin\nbound) as well as the VC bound. More importantly, we want to see from the experiments how\nuseful the DD-margin bound is for model selection.\n\n6\n\n\fDataset\n\n# Examples\n\n# Features\n\nDataset\n\n# Examples\n\n# Features\n\nTable 1: Description of dataset\n\nImage\nMagic04\nOptdigits\nPendigits\n\nBreastCancer\n\nPima\n\n2310\n19020\n5620\n10992\n683\n768\n\n20\n10\n64\n16\n9\n8\n\nLetter\n\nMushroom\nPageBlock\nWaveform\n\nGlass\nwdbc\n\n20000\n8124\n5473\n3304\n214\n569\n\n16\n22\n10\n21\n9\n30\n\nWe use 12 datasets all from the UCI repository [24]. A description of the datasets is given in Table\n1. For each dataset, we use 5-fold cross validation and average the results over 10 runs (for a total\n50 runs). If the dataset is a multiclass problem, we group the data into two classes since we study\nbinary classi\ufb01cation problems. In the data preprocessing stage each feature is normalized to [0; 1].\nTo compare the bounds and to do model selection, we use SVM with polynomial kernels K(x; x\u2032\n) =\n(a < x; x\u2032\n> +b)t and let t varies4. For each t, we train a classi\ufb01er by libsvm [25]. We plot the\nvalues of the three bounds\u2014the DD-margin bound, the DI-margin bound, the VC bound (12) as\nwell as the test and training error (see Figure 1 - Figure 12). For the two margin bounds, since they\nhold uniformly for (cid:22) > 0, we select the optimal (cid:22) to make the bounds as small as possible. For\nsimplicity, we combine Proposition 2.3 with Theorem 3.1 and Theorem 2.2 respectively to obtain\nthe \ufb01nal bound for the generalization error of the deterministic linear classi\ufb01ers. In each \ufb01gure, the\nhorizonal axis represents the degree t of the polynomial kernel. All bounds in the \ufb01gures (including\ntraining and test error) are for deterministic (voting) classi\ufb01er.\nTo analyze the experimental results, we group the 12 results into two categories as follows.\n\n1. Figure 1 - Figure 8. This category consists of eight datasets, and each of them contains\nat least 2000 examples (relatively large datasets). On all these datasets, the DD-margin\nbounds are signi\ufb01cantly sharper than the DI-margin bounds as well as the VC bounds. More\nimportantly, the DD-margin bounds work well for model selection. We can use this bound\nto choose the degree of the polynomial kernel. On all the datasets except \u201cImage\u201d, the curve\nof the DD-margin bound is highly correlated with the curve of the test error: When the test\nerror decreases (or increases), the DD-margin bound also decreases (or increases); And as\nthe test error remains unchanged as the degree t grows, the DD-margin bound selects the\nmodel with the lowest complexity.\n\n2. Figure 9 - Figure 12. This category consists of four small datasets, each contains less than\n1000 examples. On these small datasets, the VC bounds often become trivial (larger than\n1). The DD-margin bounds are still always, but less signi\ufb01cantly, sharper than the DI-\nmargin bounds. However, on these small datasets, it is dif\ufb01cult to tell if the bounds select\ngood models.\n\nIn sum, the experimental results demonstrate that the DD-margin bound is usually signi\ufb01cantly\nsharper than the DI-margin bound as well as the VC bound if the dataset is relatively large. Also the\nDD-margin bound is useful for model selection. However, for small datasets, all three bounds seem\nnot useful for practical purpose.\n\n5 Conclusion\n\nIn this paper we study the problem whether dimensionality independency is intrinsic for margin\nbounds. We prove a dimensionality dependent PAC-Bayes margin bound. This bound is sharper\nthan a previously well-known dimensionality independent margin bound when the feature space is of\n\ufb01nite dimension; and they tend to be equivalent as the dimensionality grows to in\ufb01nity. Experimental\nresults demonstrate that for relatively large datasets the new bound is often useful for model selection\nand signi\ufb01cantly sharper than previous margin bound as well as the VC bound.\n\n4For simplicity we \ufb01x a and b as constants in all the experiments.\n\n7\n\n\fFigure 1: Image\n\nFigure 2: Letter\n\nFigure 3: Magic04\n\nFigure 4: Mushroom\n\nFigure 5: Optdigits\n\nFigure 6: PageBlocks\n\nFigure 7: Pendigits\n\nFigure 8: Waveform\n\nFigure 9: BreastCancer\n\nFigure 10: Glass\n\nFigure 11: Pima\n\nFigure 12: wdbc\n\nOur work is based on the PAC-Bayes theory. One limitation is that it involves a multiplicative factor\nof 2 when transforming stochastic classi\ufb01ers to deterministic classi\ufb01ers. Although we provide two\nimproved bounds (Proposition 3.4, 3.7) over previous results (Proposition 2.3, 2.4), the multiplica-\ntive factor is still strictly larger than 1. A future work is to study whether there exist dimensionality\ndependent margin bounds (not necessarily PAC-Bayes) without this multiplicative factor.\n\nAcknowledgments\n\nThis work was supported by NSFC(61222307, 61075003) and a grant from Microsoft Research\nAsia. We also thank Chicheng Zhang for very helpful discussions.\n\n8\n\n02468101200.20.40.60.811.21.41.6terror DD\u2212marginDI\u2212marginVCtrain errortest error02468101200.20.40.60.811.21.41.6terror DD\u2212marginDI\u2212marginVCtrain errortest error02468101200.20.40.60.811.21.41.6terror DD\u2212marginDI\u2212marginVCtrain errortest error02468101200.20.40.60.811.21.41.6terror DD\u2212marginDI\u2212marginVCtrain errortest error02468101200.20.40.60.811.21.41.6terror DD\u2212marginDI\u2212marginVCtrain errortest error02468101200.20.40.60.811.21.41.6terror DD\u2212marginDI\u2212marginVCtrain errortest error02468101200.20.40.60.811.21.41.6terror DD\u2212marginDI\u2212marginVCtrain errortest error02468101200.20.40.60.811.21.41.6terror DD\u2212marginDI\u2212marginVCtrain errortest error02468101200.20.40.60.811.21.41.6terror DD\u2212marginDI\u2212marginVCtrain errortest error02468101200.20.40.60.811.21.41.6terror DD\u2212marginDI\u2212marginVCtrain errortest error02468101200.20.40.60.811.21.41.6terror DD\u2212marginDI\u2212marginVCtrain errortest error02468101200.20.40.60.811.21.41.6terror DD\u2212marginDI\u2212marginVCtrain errortest error\fReferences\n\n[1] John Shawe-Taylor, Peter L. Bartlett, Robert C. Williamson, and Martin Anthony. Structural risk mini-\nmization over data-dependent hierarchies. IEEE Transactions on Information Theory, 44(5):1926\u20131940,\n1998.\n\n[2] John Langford and John Shawe-Taylor. PAC-Bayes & Margins.\n\nProcessing Systems, pages 423\u2013430, 2002.\n\nIn Advances in Neural Information\n\n[3] David A. McAllester. Simpli\ufb01ed PAC-Bayesian margin bounds. Learning Theory and Kernel Machines,\n\n2777:203\u2013215, 2003.\n\n[4] John Langford. Tutorial on practical prediction theory for classi\ufb01cation. Journal of Machine Learning\n\nResearch, 6:273\u2013306, 2005.\n\n[5] Matthias Seeger. PAC-Bayesian generalization error bounds for Gaussian process classi\ufb01cation. Journal\n\nof Machine Learning Research, 3:233\u2013269, 2002.\n\n[6] David A. McAllester. Some PAC-Bayesian theorems. Machine Learning, 37(3):355\u2013363, 1999.\n[7] Robert E. Schapire, Yoav Freund, Peter Barlett, and Wee Sun Lee. Boosting the margin: A new explana-\n\ntion for the effectiveness of voting methods. Annals of Statistics, 26(5):1651\u20131686, 1998.\n\n[8] Vladimir Koltchinskii and Dmitry Panchenko. Empirical margin distributions and bounding the general-\n\nization error of combined classi\ufb01ers. Annals of Statistics, 30:1\u201350, 2002.\n\n[9] Vladimir Koltchinskii and Dmitry Panchenko. Complexities of convex combinations and bounding the\n\ngeneralization error in classi\ufb01cation. Annals of Statistics, 33:1455\u20131496, 2005.\n\n[10] John Langford, Matthias Seeger, and Nimrod Megiddo. An improved predictive accuracy bound for\n\naveraging classi\ufb01ers. In International Conference on Machine Learning, pages 290\u2013297, 2001.\n\n[11] Sanjoy Dasgupta and Philip M. Long. Boosting with diverse base classi\ufb01ers. In Annual Conference on\n\nLearning Theory, pages 273\u2013287, 2003.\n\n[12] Leo Breiman. Prediction games and arcing algorithms. Neural Computation, 11:1493\u20131518, 1999.\n[13] Liwei Wang, Masashi Sugiyama, Zhaoxiang Jing, Cheng Yang, Zhi-Hua Zhou, and Jufu Feng. A re\ufb01ned\nmargin analysis for boosting algorithms via equilibrium margin. Journal of Machine Learning Research,\n12:1835\u20131863, 2011.\n\n[14] Olivier Catoni. PAC-Bayesian supervised classi\ufb01cation: The thermodynamics of statistical learning. IMS\n\nLecture Notes\u2013Monograph Series, 56, 2007.\n\n[15] Pascal Germain, Alexandre Lacasse, Franc\u00b8ois Laviolette, and Mario Marchand. PAC-Bayesian learning\n\nof linear classi\ufb01ers. In International Conference on Machine Learning, page 45, 2009.\n\n[16] Pascal Germain, Alexandre Lacasse, Franc\u00b8ois Laviolette, Mario Marchand, and Sara Shanian. From\nPAC-Bayes bounds to KL regularization. In Advances in Neural Information Processing Systems, pages\n603\u2013610, 2009.\n\n[17] Jean-Francis Roy, Franc\u00b8ois Laviolette, and Mario Marchand. From PAC-Bayes bounds to quadratic pro-\n\ngrams for majority votes. In International Conference on Machine Learning, pages 649\u2013656, 2011.\n\n[18] Amiran Ambroladze, Emilio Parrado-Hern\u00b4andez, and John Shawe-Taylor. Tighter pac-bayes bounds. In\n\nAdvances in Neural Information Processing Systems, pages 9\u201316, 2006.\n\n[19] John Shawe-Taylor, Emilio Parrado-Hern\u00b4andez, and Amiran Ambroladze. Data dependent priors in PAC-\n\nBayes bounds. In International Conference on Computational Statistics, pages 231\u2013240, 2010.\n\n[20] Peter L. Bartlett. The sample complexity of pattern classi\ufb01cation with neural networks: the size of the\nIEEE Transactions on Information Theory,\n\nweights is more important than the size of the network.\n44(2):525\u2013536, 1998.\n\n[21] Ralf Herbrich and Thore Graepel. A PAC-Bayesian margin bound for linear classi\ufb01ers. IEEE Transactions\n\non Information Theory, 48(12):3140\u20133150, 2002.\n\n[22] Alexandre Lacasse, Franc\u00b8ois Laviolette, Mario Marchand, Pascal Germain, and Nicolas Usunier. PAC-\nBayes bounds for the risk of the majority vote and the variance of the gibbs classi\ufb01er. In Advances in\nNeural Information Processing Systems, pages 769\u2013776, 2006.\n\n[23] Vladimir N. Vapnik. Statistical Learning Theory. Wiley-Interscience, 1998.\n[24] Andrew Frank and Arthur Asuncion. UCI machine learning repository, 2010.\n[25] Chih-Chung Chang and Chih-Jen Lin. LIBSVM: A library for support vector machines. ACM Transac-\n\ntions on Intelligent Systems and Technology, 2:27:1\u201327:27, 2011.\n\n9\n\n\f", "award": [], "sourceid": 497, "authors": [{"given_name": "Chi", "family_name": "Jin", "institution": null}, {"given_name": "Liwei", "family_name": "Wang", "institution": null}]}