{"title": "Risk Bounds for Randomized Sample Compressed Classifiers", "book": "Advances in Neural Information Processing Systems", "page_first": 1449, "page_last": 1456, "abstract": "We derive risk bounds for the randomized classifiers in Sample Compressions settings where the classifier-specification utilizes two sources of information viz. the compression set and the message string. By extending the recently proposed Occam\u00e2\u0080\u0099s Hammer principle to the data-dependent settings, we derive point-wise versions of the bounds on the stochastic sample compressed classifiers and also recover the corresponding classical PAC-Bayes bound. We further show how these compare favorably to the existing results.", "full_text": "Risk Bounds for Randomized Sample Compressed\n\nClassi\ufb01ers\n\nMohak Shah\n\nCentre for Intelligent Machines\n\nMcGill University\n\nMontreal, QC, Canada, H3A 2A7\n\nmohak@cim.mcgill.ca\n\nAbstract\n\nWe derive risk bounds for the randomized classi\ufb01ers in Sample Compression set-\nting where the classi\ufb01er-speci\ufb01cation utilizes two sources of information viz. the\ncompression set and the message string. By extending the recently proposed Oc-\ncam\u2019s Hammer principle to the data-dependent settings, we derive point-wise ver-\nsions of the bounds on the stochastic sample compressed classi\ufb01ers and also re-\ncover the corresponding classical PAC-Bayes bound. We further show how these\ncompare favorably to the existing results.\n\n1 Introduction\nThe Sample compression framework [Littlestone and Warmuth, 1986, Floyd and Warmuth, 1995]\nhas resulted in an important class of learning algorithms known as sample compression algorithms.\nThese algorithms have been shown to be competitive with the state-of-the-art algorithms such as\nthe SVM in practice [Marchand and Shawe-Taylor, 2002, Laviolette et al., 2005]. Moreover, the\napproach has also resulted in practical realizable bounds and has shown signi\ufb01cant promise in using\nthese bounds in model selection.\n\nOn another learning theoretic front, the PAC-Bayes approach [McAllester, 1999] has shown that\nstochastic classi\ufb01er selection can prove to be more powerful than outputing a deterministic classi\ufb01er.\nWith regard to the sample compression settings, this was further con\ufb01rmed in the case of sample\ncompressed Gibbs classi\ufb01er by Laviolette and Marchand [2007]. However, the speci\ufb01c classi\ufb01er\noutput by the algorithm (according to a selected posterior) is generally of immediate interest since\nthis is the classi\ufb01er whose future performance is of relevance in practice. Diluting such guarantees\nin terms of the expectancy of the risk over the posterior over the classi\ufb01er space, although gives\ntighter risk bounds, result in averaged statements over the expected true error.\n\nA signi\ufb01cant result in obtaining such guarantees for the speci\ufb01c randomized classi\ufb01er has appeared\nin the form of Occam\u2019s Hammer [Blanchard and Fleuret, 2007]. It deals with bounding the perfor-\nmance of algorithms that result in a set output when given training data. With respect to classi\ufb01ers,\nthis results in a bound on the true risk of the randomized classi\ufb01er output by the algorithm in accor-\ndance with a learned posterior over the classi\ufb01er space from training data. Blanchard and Fleuret\n[2007] also present a PAC-Bayes bound for the data-independent settings (when the classi\ufb01er space\nis de\ufb01ned independently of the training data).\n\nMotivated by this result, we derive risk bounds for the randomized sample compressed classi\ufb01ers.\nNote that the classi\ufb01er space in the case of sample compression settings, unlike other settings, is\ndata-dependent in the sense that it is de\ufb01ned upon the speci\ufb01cation of training data.1 The rest of\n\n1Note that the classi\ufb01er space depends on the amount of the training data as we see further and not on\nthe training data themselves. Hence, a data-independent prior over the classi\ufb01er space can still be obtained in\nthis setting, e.g., in the PAC-Bayes case, owing to the independence of the classi\ufb01er space de\ufb01nition from the\ncontent of the training data.\n\n\fthe paper is organized as follows: Section 2 provides a background on the sample compressed\nclassi\ufb01ers and establishes the context; Section 3 then states the Occam\u2019s Hammer for the data-\nindependent settings. We then derive bounds for the randomized sample compressed classi\ufb01er in\nSection 4 followed by showing how we can recover bounds for the sample compressed Gibbs case\n(classical PAC-Bayes for sample compressed classi\ufb01ers) in Section 5. We conclude in Section 6.\n\n2 Sample Compressed (SC) Classi\ufb01ers\nWe consider binary classi\ufb01cation problems where the input space X consists of an arbitrary subset\ndef= (x, y) is an input-output pair where\nof Rn and the output space Y = {\u22121, +1}. An example z\nx \u2208 X and y \u2208 Y. Sample Compression learning algorithms are characterized as follows:\nGiven a training set S = {z1, . . . , zm} of m examples, the classi\ufb01er A(S) returned by algorithm\nA is described entirely by two complementary sources of information: a subset zi of S, called the\ncompression set, and a message string \u03c3 which represents the additional information needed to\nobtain a classi\ufb01er from the compression set zi. Given a training set S, the compression set zi is\ndef= (i1, i2, . . . , i|i|) with ij \u2208 {1, . . . , m} \u2200j and i1 < i2 < . . . <\nde\ufb01ned by a vector i of indices i\ni|i| and where |i| denotes the number of indices present in i. Hence, zi denotes the ith example of S\nwhereas zi denotes the subset of examples of S that are pointed to by the vector of indices i de\ufb01ned\nabove. We will use i to denote the set of indices not present in i. Hence, we have S = zi \u222a zi for\nany vector i \u2208 I where I denotes the set of the 2m possible realizations of i.\nFinally, a learning algorithm is a sample compression learning algorithm (that is identi\ufb01ed solely\nby a compression set zi and a message string \u03c3) iff there exists a Reconstruction Function R :\n(X \u00d7 Y)|i| \u00d7 K \u2212\u2192 H, associated with A. Here, H is the (data-dependent) classi\ufb01er space and\nK \u2282 I \u00d7 M s.t. M = \u222ai\u2208IM(i). That is, R outputs a classi\ufb01er R(\u03c3, zi) when given an arbitrary\ncompression set zi \u2286 S and message string \u03c3 chosen from the set M(zi) of all distinct messages\nthat can be supplied to R with the compression set zi.\nWe seek a tight risk bound for arbitrary reconstruction functions that holds uniformly for all com-\npression sets and message strings. For this, we adopt the PAC setting where each example z is drawn\naccording to a \ufb01xed, but unknown, probability distribution D on X \u00d7 Y. The true risk R(f ) of any\nclassi\ufb01er f is de\ufb01ned as the probability that it misclassi\ufb01es an example drawn according to D:\n\nR(f ) def= Pr(x,y)\u223cD (f (x) 6= y) = E(x,y)\u223cDI(f (x) 6= y)\n\nwhere I(a) = 1 if predicate a is true and 0 otherwise. Given a training set S = {z1, . . . , zm} of m\nexamples, the empirical risk RS(f ) on S, of any classi\ufb01er f , is de\ufb01ned according to:\n\nRS(f ) def=\n\n1\nm\n\nm\n\nXi=1\n\nI(f (xi) 6= yi) def= E(x,y)\u223cSI(f (x) 6= y)\n\nLet Zm denote the collection of m random variables whose instantiation gives a training sample\nS = zm = {z1, . . . , zm}. To obtain the tightest possible risk bound, we will fully exploit the\nfact that the distribution of classi\ufb01cation errors is a binomial. We now discuss the generic Occam\u2019s\nHammer principle (w.r.t. the classi\ufb01cation scenario) and then go on to show how it can be applied\nto the sample compression setting.\n\n3 Occam\u2019s Hammer for data independent setting\nIn this section, we brie\ufb02y detail the Occam\u2019s hammer [Blanchard and Fleuret, 2007] for data-\nindependent setting. For the sake of simplicity, we retain the key notations of Blanchard and Fleuret\n[2007]. Occam\u2019s hammer work by bounding the probability of bad event de\ufb01ned as follows. For\nevery classi\ufb01er h \u2208 H, and a con\ufb01dence parameter \u03b4 \u2208 [0, 1], the bad event B(h, \u03b4) is de\ufb01ned as\nthe region where the desired property on the classi\ufb01er h does not hold, with probability \u03b4. That is,\nPrS\u223cDm [S \u2208 B(h, \u03b4)] \u2264 \u03b4. Further, it assumes that this region is nondecreasing in \u03b4. Intuitively,\nthis means that with decreasing \u03b4 the bound on the true error of the classi\ufb01er h becomes tighter.\n\nWith the above assumption satis\ufb01ed, let, P be a non-negative reference measure on the classi\ufb01er\nspace H known as the volumic measure. Let \u03a0 be a probability distribution on H absolutely contin-\nuous w.r.t. P such that \u03c0 = d\u03a0\ndP . Let \u0393 be a probability distribution on (0, +\u221e) (the inverse density\nprior). Then Occam\u2019s Hammer [Blanchard and Fleuret, 2007] states that:\n\n\fTheorem 1 [Blanchard and Fleuret, 2007] Given the above assumption and P, \u03a0, \u0393 de\ufb01ned as\nabove, de\ufb01ne the level function\n\n\u2206(h, u) = min(\u03b4\u03c0(h)\u03b2(u), 1).\n\nwhere \u03b2(x) = R x\n\n0 ud\u0393(u) for x \u2208 (0, +\u221e). Then for any algorithm S 7\u2192 \u03b8S returning a probability\ndensity \u03b8S over H with respect to P, and such that (S, h) 7\u2192 \u03b8S(h) is jointly measurable in its two\nvariables, it holds that\n\nwhere Q is the distribution on H such that dQ\n\ndP = \u03b8S.\n\nPr\n\nS\u223cDm,h\u223cQ(cid:2)S \u2208 B(h, \u2206(h, \u03b8S(h)\u22121))(cid:3) \u2264 \u03b4,\n\nNote above that Q is the (data-dependent) posterior distribution on H after observing the data sample\nS while P is the data-independent prior on H. The subscript S in \u03b8S denotes this. Moreover, the\ndistribution \u03a0 on the space of classi\ufb01ers may or may not be data-dependent. As we will see later, in\nthe case of sample compression learning settings we will consider priors over the space of classi\ufb01ers\nwithout reference to the data (such as PAC-Bayes case). To this end, we can either opt for a prior \u03a0\nindependent of the data or make it the same as the volume measure P which establishes a distribution\non the classi\ufb01er space without reference to the data.\n\n4 Bounds for Randomized SC Classi\ufb01ers\nWe work in the sample compression settings and as mentioned before, each classi\ufb01er in this setting\nis denoted in terms of a compression set and a message string. A reconstruction function then\nuses these two information sources to reconstruct the classi\ufb01er. This essentially means that we deal\nwith a data-dependent hypothesis space. This is in contrast with other notions of hypothesis class\ncomplexity measures such as VC dimension. The hypothesis space is de\ufb01ned, in our case, based on\nthe size of data sample (and not the actual contents of the sample). Hence, we consider the priors\nbuilt on the size of the possible compression sets and associated message strings. More precisely, we\nconsider prior distribution P with probability density P (zi, \u03c3) to be facotorizable in its compression\nset dependent component and message string component (conditioned on a given compression set)\nsuch that:\n\nP (zi, \u03c3) = PI(i)PM(zi)(\u03c3)\n\np(|i|) such that Pm\n\n(1)\nwith PI (i) = 1\nd=0 p(d) = 1. The above choice of the form for PI (i) is\n(m\n|i|)\nappropriate since we do not have any a priori information to distinguish one compression set from\nother. However, as we will see later, we should choose p(d) such that we give more weight to smaller\ncompression sets.\nLet PK be the set of all distributions P on K satisfying above equation. Then, we are interested\nin algorithms that output a posterior Q \u2208 PK over the space of classi\ufb01ers with probability den-\nsity Q(zi, \u03c3) factorizable as QI(i)QM(zi)(\u03c3). A sample compressed classi\ufb01er is then de\ufb01ned by\nchoosing a classi\ufb01er (zi, \u03c3) according to the posterior Q(zi, \u03c3). This is basically the Gibbs classi\ufb01er\nde\ufb01ned in the PAC-Bayes settings where the idea is to bound the true risk of this Gibbs classi\ufb01er\nde\ufb01ned as R(GQ) = E(zi,\u03c3)\u223cQR((zi, \u03c3)). On the other hand, we are interested in bounding the true\nrisk of the speci\ufb01c classi\ufb01er (zi, \u03c3) output according to Q. As shown in [Laviolette and Marchand,\n2007], a rescaled posterior Q of the following form can provide tighter guarantees while maintaining\nthe Occam\u2019s principle of parsimony.\n\nDe\ufb01nition 2 Given a distribution Q \u2208 PK, we denote by Q the distribution:\n\nQ(zi, \u03c3) def=\n\nQ(zi, \u03c3)\n|i|E(zi,\u03c3)\u223cQ\n\n=\n\n1\n|i|\n\nQI(i)QM(zi)(\u03c3)\n|i|E(zi,\u03c3)\u223cQ\n\n1\n|i|\n\n= QI(i)QM(zi)(\u03c3)\n\n\u2200(zi, \u03c3) \u2208 K\n\nHence, note that the posterior is effectively rescaled for the compression set part. Hence, any\nclassi\ufb01er (zi, \u03c3) \u223c Q = i \u223c QI, \u03c3 \u223c QM(zi). Further, if we denote by dQ the expected\nvalue of the compression set size over the choice of parameters according to the scaled posterior,\ndQ\n\ndef= Ei\u223cQI ,\u03c3\u223cQM(z\n\n|i|, then,\n\ni )\n\nE(zi,\u03c3)\u223cQ\n\n1\n|i|\n\n=\n\n1\n\nEi\u223cQI ,\u03c3\u223cQM(z\n\ni )\n\n|i|\n\n=\n\n1\n\nm \u2212 dQ\n\n\fNow, we proceed to derive the bounds for the randomized sample compressed classi\ufb01ers starting\nwith a PAC-Bayes bound.\n\n4.1 A PAC-Bayes Bound for randomized SC classi\ufb01er\nWe exploit the fact that the distribution of the errors is binomial and de\ufb01ne the following error\nquantities (for a given i, and hence zi over z|i|):\n\nDe\ufb01nition 3 Let S \u2208 Dm with D a distribution on X \u00d7 Y, and (zi, \u03c3) \u2208 K. We denote by\nBinS(i, \u03c3), the probability that the classi\ufb01er R(zi, \u03c3) of (true) risk R(zbi, \u03c3) makes |i|Rzi (zi, \u03c3) or\nfewer errors on z\u2032\ni\n\n\u223c D|i|. That is,\n\nBinS(i, \u03c3) =\n\n|i|Rz\n\ni\n\n(zi,\u03c3)\n\nX\u03bb=0\n\n(cid:18)|i|\n\u03bb(cid:19)(R(\u03c3, zi))\u03bb(1 \u2212 R(\u03c3, zi))|i|\u2212\u03bb\n\nand by BS(i, \u03c3), the probability that this classi\ufb01er makes exactly |i|Rzi (zi, \u03c3) errors on z\u2032\nThat is,\n\ni\n\n\u223c D|i|.\n\nBS(i, \u03c3) = (cid:18)\n\n|i|\n\n|i|Rzi (zi, \u03c3)(cid:19)(R(zi, \u03c3))|i|Rz\n\n(zi,\u03c3)(1 \u2212 R(zi, \u03c3))|i|\u2212|i|Rz\n\ni\n\n(zi,\u03c3)\n\ni\n\nNow, approximating the binomial by relative entropy Chernoff bound [Langford, 2005], we have,\nfor a classi\ufb01er f :\n\nmRS(f )\n\nXj=0\n\n(cid:18)m\nj (cid:19)(R(f ))j (1 \u2212 R(f ))m\u2212j \u2264 exp(\u2212m \u00b7 kl(RS(f )kR(f )))\n\nfor all RS(f ) \u2264 R(f ).\n\nAs also shown in [Laviolette and Marchand, 2007], since (cid:0)m\n\nkl(1 \u2212 RS(f )k1 \u2212 R(f )), the above inequality holds true for each factor inside the sum on the\nleft hand side. Consequently, in the case of sample compressed classi\ufb01er, \u2200(zi, \u03c3) \u2208 K and \u2200S \u2208\n(X \u00d7 Y)m:\n\nm\u2212j(cid:1) and kl(RS(f )kR(f )) =\n\nj(cid:1) = (cid:0) m\n\nBounding this by \u03b4 yields:\n\nBS(i, \u03c3) \u2264 exp(cid:2)\u2212|i| \u00b7 kl(Rzi (\u03c3, zi)kR(\u03c3, zi))(cid:3)\n\nPrS\u223cDm(cid:18)kl(Rzi (\u03c3, zi)kR(\u03c3, zi)) \u2264\n\nln 1\n\u03b4\n\n|i| (cid:19) \u2265 1 \u2212 \u03b4\n\n(2)\n\n(3)\n\nNow, consider the quantity in the probability in Equation 3 as the bad event over classi\ufb01ers de\ufb01ned\nby a compression set i and an associated message string \u03c3. Let \u03c8zm (i, \u03c3) be the posterior probability\ndensity of the rescaled data-dependent posterior distribution Q over the classi\ufb01er space with respect\nto the volume measure P. We can now replace \u03b4 for this bad event by the delta of the Occam\u2019s\nhammer de\ufb01ned as:\n\nln(min(\u03b4\u03c0(hS)\u03b2(\u03c8zm (i, \u03c3)\u22121), 1)\u22121) = ln+(cid:18) 1\n= ln+(cid:18) 1\n\u2264 ln+(cid:18) 1\n\u2264 ln(cid:18) 1\n\n\u03b4\u00b7\u03c0(h)\n\n\u03b4\u00b7\u03c0(h)\n\n\u00b7\n\n\u03b4\u00b7\u03c0(h)\n\n\u03b4\u00b7\u03c0(h)\n\n1\n\nk , 1)(cid:19)\n\nk+1\n\n\u00b7 max((k + 1)\u03c8zm(i, \u03c3)\n\nmin((k + 1)\u22121\u03c8zm (i, \u03c3)\u2212 k+1\nk , 1)(cid:19)\nk , 1)(cid:19)\nk (cid:19)\n\n\u00b7 (k + 1)(cid:19) + ln+(cid:18)\u03c8zm (i, \u03c3)\n\n\u00b7 (k + 1) max(\u03c8zm (i, \u03c3)\n\nk+1\n\nk+1\n\nwhere ln+ denotes max(0, ln), the positive part of the logarithm.\n\n\fHowever, note that we are interested in data-independent priors over the space of classi\ufb01ers2, and\nhence, we consider our prior \u03a0 to be the same as the volume measure P over the classi\ufb01er space\nyielding \u03c0 as unity. That is, our prior gives a distribution over the classi\ufb01er space without any\nregard to the data. Substituting for \u03c8zm (i, \u03c3) (the fraction of respective densities; Radon-Nikodym\nderivative)3, we obtain the following result:\n\nTheorem 4 For any reconstruction function R : Dm \u00d7 K \u2212\u2192 H and for any prior distribution\nP over compression set and message strings, the sample compression algorithms A(S) returns a\nposterior distribution Q, then, for \u03b4 \u2208 (0, 1] and k > 0, we have:\n\nPr\n\nS\u223cDm,i\u223cQI ,\u03c3\u223cQM(z\n\ni )(cid:20)kl(Rzi (zi, \u03c3)kR(zi, \u03c3))\nm \u2212 |i|(cid:20) ln(cid:0)\n(cid:1) + (1 +\n\nk + 1\n\n1\n\n\u03b4\n\n\u2264\n\n1\nk\n\n) ln+(cid:18) Q(zi, \u03c3)\n\nP (zi, \u03c3)(cid:19)(cid:21)(cid:21) \u2265 1 \u2212 \u03b4\n\nwhere Rzi (zi, \u03c3) is the empirical risk of the classi\ufb01er reconstructed from (zi, \u03c3) on the training\nexamples not in the compression set and R(zi, \u03c3) is the corresponding true risk.\n\n1\n\nm\u2212dQ\n\nfactor in the bound instead of\n\nNote that we do not encounter the\nm\u2212|i| unlike the bound\nof Laviolette and Marchand [2007]. This is because the PAC-Bayes bound of Laviolette and Marc-\nhand [2007] computes the expectancy over the kl-divergence of the empirical and true risk of the\nclassi\ufb01ers chosen according to Q. This, as a result of rescaling of Q in preference of smaller com-\npression sets, is re\ufb02ected in the bound. On the other hand, the bound of Theorem 4 is a point-wise\nversion bounding the true error of the speci\ufb01c classi\ufb01er chosen according to Q and hence concerns\nthe speci\ufb01c compression set utilized by this classi\ufb01er.\n\n1\n\n4.2 A Binomial Tail Inversion Bound for randomized SC classi\ufb01er\nA tighter condition can be imposed on the true risk of the classi\ufb01er by considering the binomial tail\n\ninversion over the distribution of errors. The binomial tail inversion Bin(cid:0) k\n\nlargest risk value that a classi\ufb01er can have while still having a probability of at least \u03b4 of observing\nat most k errors out of m examples:\n\nm , \u03b4(cid:1) is de\ufb01ned as the\n\nwhere\n\nBin(cid:18) k\n\nm\n\n, \u03b4(cid:19) def= sup(cid:26)r : Bin(cid:18) k\n\nm\n\n, r(cid:19) \u2265 \u03b4(cid:27)\n\nBin(cid:18) k\n\nm\n\n, r(cid:19) def=\n\nk\n\nXj=0\n\n(cid:18)m\nj (cid:19)rj (1 \u2212 r)m\u2212j\n\nFrom this de\ufb01nition, it follows that Bin (RS(f ), \u03b4) is the smallest upper bound, which holds with\nprobability at least 1 \u2212 \u03b4, on the true risk of any classi\ufb01er f with an observed empirical risk RS(f )\non a test set of m examples (test set bound):\n\nPZm(cid:26)R(f ) \u2264 Bin(cid:16)RZm(f ), \u03b4(cid:17)(cid:27) \u2265 1 \u2212 \u03b4 \u2200f\n\n(4)\n\nThis bound can be converted to a training set bound in a standard manner by considering a measure\nover the classi\ufb01er space (see for instance [Langford, 2005, Theorem 4.1]). Moreover, in the sample\ncompression case, we are interested in the empirical risk of the classi\ufb01er on the examples not in the\ncompression set (consistent compression set assumption). Now, let \u03b4r be a \u03b4-weighed measure on\nthe classi\ufb01er space, i.e., i and \u03c3. Then, for the compression sets and associated message strings,\n\n2Hence, the missing S in the subscript of \u03c0(h) in the r.h.s. above.\n3Alternatively, let P (zi, \u03c3) and Q(zi, \u03c3) denote the probability densities of the prior distribution P and\nrescaled posterior distributions Q over classi\ufb01ers such that dQ = Q(zi, \u03c3)d\u00b5 and dP = P (zi, \u03c3)d\u00b5 w.r.t.\nsome measure \u00b5. This too yields dQ\nP (zi,\u03c3) . Note that the \ufb01nal expression is independent of the underlying\nmeasure \u00b5.\n\ndP = Q(zi,\u03c3)\n\n\fconsider the following bad event with empirical risk of the classi\ufb01er measured as BinS((zi, \u03c3)) for\ni \u223c QI, \u03c3 \u223c QM(zi):\n\nNow, we replace \u03b4r with the level function of Occam\u2019s hammer (with the same assumption of \u03a0 =\nP, \u03c0 = 1):\n\nB(h, \u03b4) = (cid:8)R(zi, \u03c3) > Bin(Rzi (zi, \u03c3), \u03b4r)(cid:9)\n\nmin(\u03b4\u03c0(hS)\u03b2(\u03c8zm (i, \u03c3)\u22121), 1) \u2264 \u03b4 \u00b7 min((k + 1)\u22121\u03c8zm (i, \u03c3)\u2212 k+1\n\nk , 1)\n\n\u2264 \u03b4 \u00b7\n\n\u2264 \u03b4\n\n1\n\nmax((k + 1)\u03c8zm(i, \u03c3)\n\nk+1\n\nk , 1)\n\n1\n\n(k + 1) max(\u03c8zm (i, \u03c3)\n\nk+1\n\nk , 1)\n\n\u2264\n\n\u03b4\n\n(k + 1)\u03c8zm (i, \u03c3)\n\nk+1\n\nk\n\nHence, we have proved the following:\n\nTheorem 5 For any reconstruction function R : Dm \u00d7 K \u2212\u2192 H and for any prior distribution P\nover the compression set and message strings, the sample compression algorithms A(S) returns a\nposterior distribution Q, then, for \u03b4 \u2208 (0, 1] and k > 0, we have:\n\nPr\n\nS\u223cDm,i\u223cQI ,\u03c3\u223cQM(z\n\ni )(cid:20)R(zi, \u03c3) \u2264 Bin(cid:18)Rzi (zi, \u03c3),\n\n\u03b4\n\n(k + 1)(cid:0) Q(zi,\u03c3)\nP (zi,\u03c3)(cid:1)\n\nk (cid:19)(cid:21) \u2265 1 \u2212 \u03b4\n\nk+1\n\nWe can obtain a looser bound by approximating the binomial tail inversion bound using [Laviolette\net al., 2005, Lemma 1]:\n\nCorollary 6 Given all our previous de\ufb01nitions, the following holds with probability 1 \u2212 \u03b4 over the\njoint draw of S \u223c Dm and i \u223c QI, \u03c3 \u223c QM(zi):\nR(zi, \u03c3) \u2264 1 \u2212 exp(cid:18)\n\nm \u2212 |i| \u2212 |i|Rzi (zi, \u03c3)(cid:20) ln(cid:18) m \u2212 |i|\n\n\u22121\n\n\u03b4 (cid:19)\n|i|Rzi (zi, \u03c3)(cid:19) + ln(cid:18) k + 1\n) ln(cid:18) Q(zi, \u03c3)\n\n+ (1 +\n\nP (zi, \u03c3)(cid:19)(cid:21)(cid:19)\n\n1\nk\n\n5 Recovering the PAC-Bayes bound for SC Gibbs Classi\ufb01er\nLet us now see how a bound can be obtained for the Gibbs setting. We follow the general line of\nargument of Blanchard and Fleuret [2007] to recover the PAC-Bayes bound for the Sample Com-\npressed Gibbs classi\ufb01er. However, note that we do this for the data-dependent setting here and also\nutilize the rescaled posterior over the space of sample compressed classi\ufb01ers.\n\nThe PAC-Bayes bound of Theorem 4 basically states that\n\nES\u223cDm[\n\nPr\n\ni\u223cQI ,\u03c3\u223cQM(z\n\ni )\n\n[kl(Rzi (zi, \u03c3)kR(zi, \u03c3)) > \u03d5(\u03b4)]] \u2264 \u03b4\n\nwhere\n\nConsequently,\n\n\u03d5(\u03b4) =\n\n1\n\nm \u2212 |i|(cid:20) ln(cid:0)\n\nk + 1\n\n\u03b4\n\n(cid:1) + (1 +\n\n1\nk\n\n) ln+(cid:18) Q(zi, \u03c3)\n\nP (zi, \u03c3)(cid:19)(cid:21)\n\nES\u223cDm[\n\nPr\n\ni\u223cQI ,\u03c3\u223cQM(z\n\ni )\n\n[kl(Rzi (zi, \u03c3)kR(zi, \u03c3)) > \u03d5(\u03b4\u03b3)]] \u2264 \u03b4\u03b3\n\nNow, bounding the argument of expectancy above using the Markov inequality, we get:\n\nS\u223cDm(cid:20)\n\nPr\n\nPr\n\ni\u223cQI ,\u03c3\u223cQM(z\n\ni )\n\n[kl(Rzi (zi, \u03c3)kR(zi, \u03c3)) > \u03d5(\u03b4\u03b3)] > \u03b3(cid:21) \u2264 \u03b4\n\n\fNow, discretizing the argument over (\u03b4i, \u03b3i) = (\u03b42\u2212i, 2\u2212i), we obtain\n\nS\u223cDm(cid:20)\n\nPr\n\nPr\n\ni\u223cQI ,\u03c3\u223cQM(z\n\ni )\n\n[kl(Rzi (zi, \u03c3)kR(zi, \u03c3)) > \u03d5(\u03b4i\u03b3i)] > \u03b3i(cid:21) \u2264 \u03b4i\n\nTaking the union bound over \u03b4i, i \u2265 1 now yields:\n\nS\u223cDm(cid:20)\n\nPr\n\nPr\n\ni\u223cQI ,\u03c3\u223cQM(z\n\ni )\n\n[kl(Rzi (zi, \u03c3)kR(zi, \u03c3)) > \u03d5(\u03b42\u22122i] \u2264 2\u2212i(cid:21) > 1 \u2212 \u03b4\n\n\u2200i \u2265 0\n\nNow, let us consider the argument of the above statement for a \ufb01xed sample S. Then, for all i \u2265 0,\nthe following holds with probability 1 \u2212 \u03b4:\n\nPr\n\ni\u223cQI ,\u03c3\u223cQM(z\n\ni )(cid:20)kl(Rzi (zi, \u03c3)kR(zi, \u03c3)) >\n\n1\n\nm \u2212 |i|(cid:20) ln(cid:0)\n\nk + 1\n\n\u03b4\n\n+ (1 +\n\n(cid:1) + 2i ln 2\n) ln+(cid:18) Q(zi, \u03c3)\n\n1\nk\n\nP (zi, \u03c3)(cid:19)(cid:21)(cid:21) \u2264 2\u2212i\n\nand hence:\n\nwhere:\n\nPr\n\ni\u223cQI ,\u03c3\u223cQM(z\n\ni )(cid:20)\u03a6S(zi, \u03c3) > 2i ln 2(cid:21) \u2264 2\u2212i\n\n\u03a6S(zi, \u03c3) = (m \u2212 |i|)kl(Rzi (zi, \u03c3)kR(zi, \u03c3)) \u2212 ln(cid:0)\nWe wish to bound, for the Gibbs classi\ufb01er, Ei\u223cQI ,\u03c3\u223cQM(z\n\ni )\n\n(cid:1) \u2212 (1 +\n\n\u03b4\n\u03a6S(zi, \u03c3):\n\nk + 1\n\n1\nk\n\n) ln+(cid:18) Q(zi, \u03c3)\nP (zi, \u03c3)(cid:19)\n\nEi\u223cQI ,\u03c3\u223cQM(z\n\ni )\n\n[\u03a6S(zi, \u03c3)] \u2264 Z2i ln 2>0\n\u2264 2 ln 2Xi\u22650\n\nPr\n\n[\u03a6S(zi, \u03c3) \u2265 2i ln 2]d(2i ln 2)\n\ni\u223cQI ,\u03c3\u223cQM(z\n\ni )\n\nPri\u223cQI ,\u03c3\u223cQM(z\n\ni )\n\n[\u03a6S(zi, \u03c3) \u2265 2i ln 2] \u2264 3\n\n(5)\n\nNow, we have:\n\nLemma 7 [Laviolette and Marchand, 2007] For any f : K \u2212\u2192 R+, and for any Q, Q\u2032 \u2208 PK\nrelated by\n\nQ\u2032(zi, \u03c3)f (zi, \u03c3) =\n\n1\nE(zi,\u03c3)\u223cQ\n\nwe have:\n\nE(zi,\u03c3)\u223cQ\u2032(cid:18)f (zi, \u03c3)kl(Rzi (zi, \u03c3)kR(zi, \u03c3))(cid:19) \u2265\n\nQ(zi, \u03c3),\n\n1\n\nf (zi,\u03c3)\n\n1\n\nE(zi,\u03c3)\u223cQ(cid:0)\n\n1\n\nf (zi,\u03c3)(cid:1)\n\nkl(RS(GQ)kR(GQ))\n\nwhere RS(GQ) and R(GQ) denote the empirical and true risk of the Gibbs classi\ufb01er with posterior\nQ respectively.\n\nHence, with Q\u2032 = Q and f (zi, \u03c3) = |i|, Lemma 7 yields:\n\nE(zi,\u03c3)\u223cQ(|i|kl(Rzi (zi, \u03c3)kR(zi, \u03c3))) \u2265\n\n1\n1\n\nm\u2212dQ\n\nkl(RS(GQ)kR(GQ))\n\n(6)\n\nFurther,\n\nEi\u223cQI ,\u03c3\u223cQM(z\n\ni )(cid:20) ln+\n\nQ(zi, \u03c3)\n\nPI (i)PM(zi)(\u03c3)(cid:19)(cid:21)\n\nQ(zi, \u03c3)\n\nP (zi, \u03c3)(cid:21) = Ei\u223cQI ,\u03c3\u223cQM(z\n= E(zi,\u03c3)\u223cP(cid:20)(cid:18)\n\u2264 E(zi,\u03c3)\u223cP(cid:20)(cid:18)\n\ni )(cid:20) ln+(cid:18)\nPI(i)PM(zi )(\u03c3)(cid:19) \u00b7 ln+(cid:18)\nPI(i)PM(zi )(\u03c3)(cid:19) \u00b7 ln(cid:18)\n\nQ(zi, \u03c3)\n\nQ(zi, \u03c3)\n\nQ(zi, \u03c3)\n\nPI (i)PM(zi)(\u03c3)(cid:19)(cid:21)\nPI (i)PM(zi)(\u03c3)(cid:19)(cid:21)\n\nQ(zi, \u03c3)\n\n\u2212 max\n0\u2264x<1\n\nx ln x\n\n\u2264 KL(QkP ) + 0.5\n\n(7)\n\n\fEquations 6 and 7 along with Equation 5 and substituting k = m \u2212 1 yields the \ufb01nal result:\n\nTheorem 8 For any reconstruction functionR : Dm \u00d7 K \u2212\u2192 H and for any prior distribution P\nover compression set and message strings, for \u03b4 \u2208 (0, 1], we have:\n\nS\u223cDm(cid:18)\u2200Q \u2208 PK : kl(RS(GQ)kR(GQ))\n\nPr\n\n1\n\n\u2264\n\n1\n\nm \u2212 dQ(cid:20)(cid:0)1 +\n\nm \u2212 1(cid:1)KL(QkP ) +\n\n\u03b4 (cid:1) + 3.5(cid:21)(cid:19) \u2265 1 \u2212 \u03b4\nTheorem 8 recovers almost exactly the PAC-Bayes bound for the Sample Compressed Classi\ufb01ers\nof Laviolette and Marchand [2007]. The key differences are an additional\n(m\u2212dQ)(m\u22121) weighted\n) and the additional trailing terms bounded by\nKL-divergence term, ln( m\n. Note that the bound of Theorem 8 is derived in a relatively more straightforward manner\n\n\u03b4 ) instead of the ln( m+1\n\n+ ln(cid:0)\n\n1\n\nm\n\n2(m \u2212 1)\n\n1\n\n4\n\nm\u2212dQ\nwith the Occam\u2019s Hammer criterion.\n\n\u03b4\n\n6 Conclusion\nIt has been shown that stochastic classi\ufb01er selection is preferable to deterministic selection by the\nPAC-Bayes principle resulting in tighter risk bounds over averaged risk of classi\ufb01ers according to\nthe learned posterior. Further, this observation resulted in tight bounds in the case of stochastic\nsample compressed classi\ufb01ers [Laviolette and Marchand, 2007] also showing that sparsity consid-\nerations are of importance even in this scenario via. the rescaled posterior. However, of immediate\nrelevance are the guarantees of the speci\ufb01c classi\ufb01er output by such algorithms according to the\nlearned posterior and hence a point-wise version of this bound is indeed needed. We have derived\nbounds for such randomized sample compressed classi\ufb01ers by adapting Occam\u2019s Hammer principle\nto the data-dependent sample compression settings. This has resulted in bounds on the speci\ufb01c clas-\nsi\ufb01er output by a sample compression learning algorithm according to the learned data-dependent\nposterior and is more relevant in practice. Further, we also showed how classical PAC-Bayes bound\nfor the sample compressed Gibbs classi\ufb01er can be recovered in a more direct manner and show that\nthis compares favorably to the existing result of Laviolette and Marchand [2007].\n\nAcknowledgments\n\nThe author would like to thank John Langford for interesting discussions.\n\nReferences\nGilles Blanchard and Franc\u00b8ois Fleuret. Occam\u2019s hammer. In Proceedings of the 20th Annual Con-\nference on Learning Theory (COLT-2007), volume 4539 of Lecture Notes on Computer Science,\npages 112\u2013126, 2007.\n\nSally Floyd and Manfred Warmuth. Sample compression, learnability, and the Vapnik-Chervonenkis\n\ndimension. Machine Learning, 21(3):269\u2013304, 1995.\n\nJohn Langford. Tutorial on practical prediction theory for classi\ufb01cation. Journal of Machine Learn-\n\ning Research, 3:273\u2013306, 2005.\n\nFranc\u00b8ois Laviolette and Mario Marchand. PAC-Bayes risk bounds for stochastic averages and major-\nity votes of sample-compressed classi\ufb01ers. Journal of Machine Learning Research, 8:1461\u20131487,\n2007.\n\nFrancois Laviolette, Mario Marchand, and Mohak Shah. Margin-sparsity trade-off for the set cov-\nering machine. In Proceedings of the 16th European Conference on Machine Learning, ECML\n2005, volume 3720 of Lecture Notes in Arti\ufb01cial Intelligence, pages 206\u2013217. Springer, 2005.\n\nN. Littlestone and M. Warmuth. Relating data compression and learnability. Technical report,\n\nUniversity of California Santa Cruz, Santa Cruz, CA, 1986.\n\nMario Marchand and John Shawe-Taylor. The Set Covering Machine. Journal of Machine Learning\n\nReasearch, 3:723\u2013746, 2002.\n\nDavid McAllester. Some PAC-Bayesian theorems. Machine Learning, 37:355\u2013363, 1999.\n\n\f", "award": [], "sourceid": 777, "authors": [{"given_name": "Mohak", "family_name": "Shah", "institution": null}]}