{"title": "Supervised learning through the lens of compression", "book": "Advances in Neural Information Processing Systems", "page_first": 2784, "page_last": 2792, "abstract": "This work continues the study of the relationship between sample compression schemes and statistical learning, which has been mostly investigated within the framework of binary classification. We first extend the investigation to multiclass categorization: we prove that in this case learnability is equivalent to compression of logarithmic sample size and that the uniform convergence property implies compression of constant size. We use the compressibility-learnability equivalence to show that (i) for multiclass categorization, PAC and agnostic PAC learnability are equivalent, and (ii) to derive a compactness theorem for learnability. We then consider supervised learning under general loss functions: we show that in this case, in order to maintain the compressibility-learnability equivalence, it is necessary to consider an approximate variant of compression. We use it to show that PAC and agnostic PAC are not equivalent, even when the loss function has only three values.", "full_text": "On statistical learning via the lens of compression\n\nO\ufb01r David\n\nDepartment of Mathematics\n\nTechnion - Israel Institute of Technology\n\nofirdav@tx.technion.ac.il\n\nShay Moran\n\nDepartment of Computer Science\n\nTechnion - Israel Institute of Technology\n\nshaymrn@cs.technion.ac.il\n\nAmir Yehudayoff\n\nDepartment of Mathematics\n\nTechnion - Israel Institute of Technology\n\namir.yehudayoff@gmail.com\n\nAbstract\n\nThis work continues the study of the relationship between sample compression\nschemes and statistical learning, which has been mostly investigated within the\nframework of binary classi\ufb01cation. The central theme of this work is establishing\nequivalences between learnability and compressibility, and utilizing these equiv-\nalences in the study of statistical learning theory. We begin with the setting of\nmulticlass categorization (zero/one loss). We prove that in this case learnability\nis equivalent to compression of logarithmic sample size, and that uniform conver-\ngence implies compression of constant size. We then consider Vapnik\u2019s general\nlearning setting: we show that in order to extend the compressibility-learnability\nequivalence to this case, it is necessary to consider an approximate variant of com-\npression. Finally, we provide some applications of the compressibility-learnability\nequivalences.\n\n1\n\nIntroduction\n\nThis work studies statistical learning theory using the point of view of compression. The main theme\nin this work is establishing equivalences between learnability and compressibility, and making an\neffective use of these equivalences to study statistical learning theory.\nIn a nutshell, the usefulness of these equivalences stems from that compressibility is a combinatorial\nnotion, while learnability is a statistical notion. These equivalences, therefore, translate statistical\nstatements to combinatorial ones and vice versa. This translation helps to reveal properties that are\notherwise dif\ufb01cult to \ufb01nd, and highlights useful guidelines for designing learning algorithms.\nWe \ufb01rst consider the setting of multiclass categorization, which is used to model supervised learning\nproblems using the zero/one loss function, and then move to Vapnik\u2019s general learning setting [23],\nwhich models many supervised and unsupervised learning problems.\n\nZero/one loss function (Section 3) This is the setting in which sample compression schemes were\nde\ufb01ned by Littlestone and Warmuth [16], as an abstraction of a common property of many learning\nalgorithms. For more background on sample compression schemes, see e.g. [16, 8, 9, 22].\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fWe use an agnostic version of sample compression schemes, and show that learnability is equivalent\nto some sort of compression. More formally, that any learning algorithm can be transformed to a\ncompression algorithm, compressing a sample of size m to a sub-sample of size roughly log(m), and\nthat such a compression algorithm implies learning. This statement is based on arguments that appear\nin [16, 10, 11]. We conclude this part by describing some applications:\n(i) Equivalence between PAC and agnostic PAC learning from a statistical perspective (i.e. in terms of\nsample complexity). For binary-labelled classes, this equivalence follows from basic arguments in\nVapnik-Chervonenkis (VC) theory, but these arguments do not seem to extend when the number of\nlabels is large.\n(ii) A dichotomy for sample compression - if a non-trivial compression exists (e.g. compressing\na sample of size m to a sub-sample of size m0.99), then a compression to logarithmic size exists\n(i.e. to a sub-sample of size roughly log m). This dichotomy is analogous to the known dichotomy\nconcerning the growth function of binary-labelled classes: the growth function is either polynomial\n(when the VC dimension is \ufb01nite), or exponential (when the VC dimension is in\ufb01nite).\n(iii) Compression to constant size versus uniform convergence - every class with the uniform conver-\ngence property has a compression of constant size. The proof has two parts. The \ufb01rst part, which is\nbased on arguments from [18], shows that \ufb01nite graph dimension (a generalization of VC dimension\nfor multiclass categorization [19]) implies compression of constant size. The second part, which uses\nideas from [1, 24, 7], shows that the uniform convergence rate is captured by the graph dimension.\nIn this part we improve upon the previously known bounds.\n(iv) Compactness for learning - if \ufb01nite sub-classes of a given class are learnable, then the class is\nlearnable as well. Again, for binary-labelled classes, such compactness easily follows from known\nproperties of VC dimension. For general multi-labeled classes we derive this statement using a\ncorresponding compactness property for sample compression schemes, based on the work by [2].\n\nGeneral learning setting (Section 4). We continue with investigating general loss functions. This\npart begins with a simple example in the context of linear regression, showing that for general loss\nfunctions, learning is not equivalent to compression. We then consider an approximate variant of\ncompression schemes, which was used by [13, 12] in the context of classi\ufb01cation, and observe that\nlearnability is equivalent to possessing an approximate compression scheme, whose size is roughly\nthe statistical sample complexity. This is in contrast to (standard) sample compression schemes, for\nwhich the existence of such an equivalence (under the zero/one loss) is a long standing open problem,\neven in the case of binary classi\ufb01cation [25]. We conclude the paper by showing that - unlike for\nzero/one loss functions - for general loss functions, PAC learnability and agnostic PAC learnability\nare not equivalent. In fact, this is derived for a loss function that takes just three values. The proof of\nthis non-equivalence uses Ramsey theory for hypergraphs. The combinatorial nature of compression\nschemes allows to clearly identify the place where Ramsey theory is helpful. More generally, the\nstudy of statistical learning theory via the lens of compression may shed light on additional useful\nconnections with different \ufb01elds of mathematics.\nWe begin our investigation by breaking the de\ufb01nition of sample compression schemes into two parts.\nThe \ufb01rst part (which may seem useless at \ufb01rst sight) is about selection schemes. These are learning\nalgorithms whose output hypothesis depends on a selected small sub-sample of the input sample. The\nsecond part of the de\ufb01nition is the sample-consistency guarantee; so, sample compression schemes\nare selection schemes whose output hypothesis is consistent with the input sample. We then show\nthat selection schemes of small size do not over\ufb01t in that their empirical risk is close to their true\nrisk. Roughly speaking, this shows that for selection schemes there are no surprises: \u201cwhat you see is\nwhat you get\u201d.\n\n2 Preliminaries\n\nThe de\ufb01nitions we use are based on the textbook [22].\n\nLearnability and uniform convergence\nA learning problem is speci\ufb01ed by a set H of hypotheses, a domain Z of examples, and a loss function\n(cid:96) : H \u00d7 Z \u2192 R+. To ease the presentation, we shall only discuss loss functions that are bounded\n\n2\n\n\ffrom above by 1, although the results presented here can be extended to more general loss functions.\nA sample S is a \ufb01nite sequence S = (z1, . . . , zm) \u2208 Z m. A learning algorithm is a mapping that\ngets as an input a sample and outputs an hypothesis h.\nIn the context of supervised learning, hypotheses are functions from a domain X to a label set\nY, and the examples domain is the cartesian product Z := X \u00d7 Y.\nIn this context, the loss\n(cid:96)(h, (x, y)) depends only on h(x) and y, and therefore in this case we it is modelled as a function\n(cid:96) : Y \u00d7 Y \u2192 R+.\nGiven a distribution D on Z, the risk of an hypothesis h : X \u2192 Y is its expected loss: LD(h) =\nEz\u223cD [(cid:96)(h, z)] . Given a sample S = (z1, . . . , zm), the empirical risk of an hypothesis h is LS(h) =\n\n(cid:80)m\n\n1\nm\n\ni=1 (cid:96)(h, z).\n\nAn hypothesis class H is a set of hypotheses. A distribution D is realizable by H if there exists h \u2208 H\nsuch that LD(h) = 0. A sample S is realizable by H if there exists h \u2208 H such that LS(h) = 0.\nA hypothesis class H has the uniform convergence property1 if there exists a rate function d :\n(0, 1)2 \u2192 N such that for every \u0001, \u03b4 > 0 and distribution D over Z, if S is a sample of m \u2265 d(\u0001, \u03b4)\ni.i.d. pairs generated by D, then with probability at least 1\u2212\u03b4 we have: \u2200h \u2208 H |LD(h)\u2212LS(h)| \u2264 \u0001.\nThe class H is agnostic PAC learnable if there exists a learner A and a rate function d : (0, 1)2 \u2192 N\nsuch that for every \u0001, \u03b4 > 0 and distribution D over Z, if S is a sample of m \u2265 d(\u0001, \u03b4) i.i.d. pairs\ngenerated by D, then with probability at least 1 \u2212 \u03b4 we have LD(A(S)) \u2264 inf h\u2208H LD(h) + \u0001. The\nclass H is PAC learnable if this condition holds for every realizable distribution D. The parameter \u0001\nis referred to as the error parameter and \u03b4 as the con\ufb01dence parameter.\nNote that the uniform convergence property implies agnostic PAC learnability with the same rate\nvia any learning algorithm which outputs h \u2208 H that minimizes the empirical risk, and that agnostic\nPAC learnability implies PAC learnability with the same rate.\n\nSelection and compression schemes\n\nThe variants of sample compression schemes that are discussed in this paper, are based on the\nfollowing object, which we term selection scheme. We stress here that unlike sample compression\nschemes, selection schemes are not associated with any hypothesis class.\nA selection scheme is a pair (\u03ba, \u03c1) of maps for which the following holds:\n\n\u2022 \u03ba is called the selection map. It gets as an input a sample S and outputs a pair (S(cid:48), b) where\nS(cid:48) is a sub-sample2 of S and b is a \ufb01nite binary string, which we think of as side information.\n\u2022 \u03c1 is called the reconstruction map. It gets as an input a pair (S(cid:48), b) of the same type as the\n\noutput of \u03ba and outputs an hypothesis h.\n\nThe size of (\u03ba, \u03c1) on a given input sample S is de\ufb01ned to be |S(cid:48)| + |b| where \u03ba(S) = (S(cid:48), b). For an\ninput size m, we denote by k(m) the maximum size of the selection scheme on all inputs S of size at\nmost m. The function k(m) is called the size of the selection scheme. If k(m) is uniformly bounded\nby a constant, which does not depend on m, then we say that the selection scheme has a constant\nsize; otherwise, we say that it has a variable size.\nThe de\ufb01nition of selection schemes is very similar to that of sample compression schemes. The\ndifference is that sample compression schemes are de\ufb01ned relative to a \ufb01xed hypothesis class with\nrespect to which they are required to have \u201ccorrect\u201d reconstructions whereas selection schemes do not\nprovide any correctness guarantee. The distinction between the \u2018selection\u2019 part and the \u2018correctness\u2019\npart is helpful for our presentation, and also provides some more insight into these notions.\nA selection scheme (\u03ba, \u03c1) is a sample compression scheme for H if for every sample S that is\nrealizable by H, LS (\u03c1 (\u03ba (S))) = 0. A selection scheme (\u03ba, \u03c1) is an agnostic sample compression\nscheme for H if for every sample S, LS (\u03c1 (\u03ba (S))) \u2264 inf h\u2208H LS(h).\nIn the following sections, we will see different manifestations of the statement \u201ccompression \u21d2\nlearning\u201d. An essential part of these statements boils down to a basic property of selection schemes,\n\ncontext.\n\n1We omit the dependence on the loss function (cid:96) from this and similar de\ufb01nitions, since (cid:96) is clear from the\n2That is, if S = (z1, . . . , zm) then S(cid:48) is of the form (zi1 , . . . , zi(cid:96) ) for 1 \u2264 i1 < . . . < i(cid:96) \u2264 m.\n\n3\n\n\fthat as long as k(m) is suf\ufb01ciently smaller than m, a selection scheme based learner does not over\ufb01t\nits training data (the proof appears in the full version of this paper).\nTheorem 2.1 ([22, Theorem 30.2]). Let (\u03ba, \u03c1) be a selection scheme of size k = k(m), and let\nA(S) = \u03c1 (\u03ba (S)). Then, for every distribution D on Z, integer m such that k \u2264 m/2, and \u03b4 > 0,\nwe have\n\n(cid:104)|LD (A (S)) \u2212 LS (A (S))| \u2265(cid:112)\u0001 \u00b7 LS (A (S)) + \u0001\n\n(cid:105) \u2264 \u03b4,\n\nPr\n\nS\u223cDm\n\nwhere \u0001 = 50 k log(m/k)+log(1/\u03b4)\n\nm\n\n.\n\n3 Zero/one loss functions\n\nIn this section we consider the zero/one loss function, which models categorization problems. We\nstudy the relationships between uniform convergence, learnability, and sample compression schemes\nunder this loss. Subsection 3.1 establishes equivalence between learnability and compressibility of\na sublinear size. In Subsection 3.2 we use this equivalence to study the relationships between the\nproperties of uniform convergence, PAC, and agnostic learnability. In Subsection 3.2.1 we show that\nagnostic learnability is equivalent to PAC learnability, In Subsection 3.2.2 we observe a dichotomy\nconcerning the size of sample compression schemes, and use it to establish a compactness property\nof learnability. Finally, in Subsection 3.2.3 we study an extension of the Littlestone-Floyd-Warmuth\nconjecture concerning an equivalence between learnability and sample compression schemes of \ufb01xed\nsize.\n\n3.1 Learning is equivalent to sublinear compressing\nThe following theorem shows that if H has a sample compression scheme of size k = o(m), then it\nis learnable. Its proof appears in the full version of this paper.\nTheorem 3.1 (Compressing implies learning [16]). Let (\u03ba, \u03c1) be a selection scheme of size k, let H\nbe a hypothesis class, and let D be a distribution on Z.\n\n1. If (\u03ba, \u03c1) is a sample compression scheme for H, and m is such that k(m) \u2264 m/2, then\n\n(cid:18)\n\nPr\n\nS\u223cDm\n\nLD (\u03c1 (\u03ba (S))) > 50\n\nk log m\n\nk + k + log 1\n\n\u03b4\n\nm\n\n(cid:19)\n\n< \u03b4.\n\n2. If (\u03ba, \u03c1) is an agnostic sample compression scheme for H, and m is such that k(m) \u2264 m/2,\n\nthen\n\n\uf8eb\uf8edLD (\u03c1 (\u03ba (S))) > inf\n\nPr\n\nS\u223cDm\n\nh\u2208H LD(h) + 100\n\nk log m\n\nk + k + log 1\n\n\u03b4\n\nm\n\n(cid:115)\n\n\uf8f6\uf8f8 < \u03b4.\n\nThe following theorem shows that learning implies compression. We present its proof in the full\nversion of this paper.\nTheorem 3.2 (Learning implies compressing). Let H be an hypothesis class.\n\n1. If H is agnostic PAC learnable with learning rate d(\u0001, \u03b4), then it is PAC learnable with the\n\nsame learning rate.\n\n2. If H is PAC learnable with learning rate d(\u0001, \u03b4), then it has a sample compression scheme\nof size k(m) = O(d0 log(m) log log(m) + d0 log(m) log(d0)), where d0 = d(1/3, 1/3).\n3. If H has a sample compression scheme of size k(m), then it has an agnostic sample\n\ncompression scheme of the same size.\n\nRemark. The third part in Theorem 3.2 does not hold when the loss function is general. In Section 4\nwe show that even if the loss function takes three possible values, then there are instances where a\nclass has a sample compression scheme but not an agnostic sample compression scheme.\n\n4\n\n\f3.2 Applications\n\n3.2.1 Agnostic and PAC learnability are equivalent\nTheorems 3.1 and 3.2 imply that if H is PAC learnable, then it is agnostic PAC learnable. Indeed, a\nsummary of the implications between learnability and compression given by Theorems 3.1 and 3.2\ngives:\n\nO (d0 \u00b7 log (m) log (d0 \u00b7 log (m))) where d0 = d(1/3, 1/3).\n\n\u2022 An agnostic learner with rate d (\u0001, \u03b4) implies a PAC learner with rate d (\u0001, \u03b4).\n\u2022 A PAC learner with rate d (\u0001, \u03b4) implies a sample compression scheme of size k (m) =\n\u2022 A sample compression scheme of size k (m) implies an agnostic sample compression\n\u2022 An agnostic sample compression scheme of size k (m) implies an agnostic learner with\n\nscheme of size k (m).\n\n(cid:113) k(d) log d\n\nd\n\nerror \u0001 (d, \u03b4) = 100\n\nk(d) +k(d)+log 1\n\n\u03b4\n\n.\n\nThus, for multiclass categorization problems, agnostic learnability and PAC learnability are equivalent.\nWhen the size of the label set Y is O(1), this equivalence follows from previous works that studied\nextensions of the VC dimension to multiclass categorization problems [24, 3, 19, 1]. These works\nshow that PAC learnability and agnostic PAC learnability are equivalent to the uniform convergence\nproperty, and therefore any ERM algorithm learns the class. Recently, [7] separated PAC learnability\nand uniform convergence for large label sets by exhibiting PAC learnable hypothesis classes that do\nnot satisfy the uniform convergence property. In contrast, this shows that the equivalence between\nPAC and agnostic learnability remains valid even when Y is large.\n\n3.2.2 A dichotomy and compactness\nLet H be an hypothesis class. Assume that H has a sample compression scheme of size, say, m/500\nfor some large m. Therefore, by Theorem 3.1, H is weakly PAC learnable with con\ufb01dence 2/3, error\n1/3, and O(1) examples. Now, Theorem 3.2 implies that H has a sample compression scheme of size\nk(m) \u2264 O(log(m) log log(m)). In other words, the following dichotomy holds: every hypothesis\nclass H either has a sample compression scheme of size k(m) = O(log(m) log log(m)), or any\nsample compression scheme for it has size \u2126(m).\nThis dichotomy implies the following compactness property for learnability under the zero/one loss.\nTheorem 3.3. Let d \u2208 N, and let H be an hypothesis class such that each \ufb01nite subclass of H\nis learnable with error 1/3, con\ufb01dence 2/3 and d examples. Then H is learnable with error 1/3,\ncon\ufb01dence 2/3 and O(d log2(d) log log(d)) examples.\nWhen Y = {0, 1}, the theorem follows by the observing that if every subclass of H has VC\ndimension at most d, then the VC dimension of H is at most d. We are not aware of a similar\nargument that applies for a general label set. A related challenge, which was posed by [6], is to \ufb01nd a\n\u201ccombinatorial\u201d parameter, which captures multiclass learnability like the VC dimension captures it\nin the binary-labeled case.\nA proof of Theorem 3.3 appears in the full version of this paper. It uses an analogous3 compactness\nproperty for sample compression schemes proven by [2].\n\n3.2.3 Uniform convergence versus compression to constant size\n\nSince the introduction of sample compression schemes by [16], they were mostly studied in the\ncontext of binary-labeled hypothesis classes (the case Y = {0, 1}). In this context, a signi\ufb01cant\nnumber of works were dedicated to studying the relationship between VC dimension and the minimal\nsize of a compression scheme (e.g. [8, 14, 9, 2, 15, 4, 21, 20, 17]). Recently, [18] proved that any class\nof VC dimension d has a compression scheme of size exponential in the VC dimension. Establishing\nwhether a compression scheme of size linear (or even polynomial) in the VC dimension remains\nopen [9, 25].\n\n3Ben-David and Litman proved a compactness result for sample compression schemes when Y = {0, 1},\n\nbut their argument generalizes for a general Y.\n\n5\n\n\fThis question has a natural extension to multiclass categorization: Does every hypothesis class H\nhave a sample compression scheme of size O(d), where d = dP AC(1/3, 1/3) is the minimal sample\ncomplexity of a weak learner for H? In fact, in the case of multiclass categorization it is open whether\nthere is a sample compression scheme of size depending only on d.\nWe show here that the arguments from [18] generalize to uniform convergence.\nTheorem 3.4. Let H be an hypothesis class with uniform convergence rate dU C(\u0001, \u03b4). Then H has a\nsample compression scheme of size exp(d), where d = dU C(1/3, 1/3).\n\nThe proof of this theorem uses the notion of the graph dimension, which was de\ufb01ned by [19].\nTheorem 3.4 is proved using the following two ingredients. First, the construction in [18] yields a\nsample compression scheme of size exp(dimG(H)). Second, the graph dimension determines the\nuniform convergence rate, similarly to that the VC dimension does it in the binary-labeled case.\nTheorem 3.5. Let H be an hypothesis class, let d = dimG(H), and let dU C(\u0001, \u03b4) denote the uniform\nconvergence rate of H. Then, there exist constants C1, C2 such that\n\nC1 \u00b7 d + log(1/\u03b4) \u2212 C1\n\n\u00012\n\n\u2264 dU C(\u0001, \u03b4) \u2264 C2 \u00b7 d log(1/\u0001) + log(1/\u03b4)\n\n\u00012\n\n.\n\nParts of this result are well-known and appear in the literature: The upper bound follows from\nTheorem 5 of [7], and the core idea of the argument dates back to the articles of [1] and of [24]. A\nlower bound with a worse dependence on \u0001 follows from Theorem 9 of [7]. A proof of Theorem 3.5\nappears in the full version of this paper.\n\n4 General loss functions\n\nWe have seen that in the case of the zero/one loss function, an existence of a sublinear sample\ncompression scheme is equivalent to learnability. It is natural to ask whether this phenomenon\nextends to other loss functions. The direction \u201ccompression =\u21d2 learning\u201d remains valid for general\nloss functions. In contrast, as will be discussed in this section, the other direction fails for general\nloss functions.\nHowever, a natural adaptation of sample compression schemes, which we term approximate sample\ncompression schemes, allows the extension of the equivalence to arbitrary loss functions. Approximate\ncompression schemes were previously studied in the context of classi\ufb01cation (e.g. [13, 12]). In\nSubsection 4.1 we argue that in general sample compression schemes are not equivalent to learnability;\nspeci\ufb01cally, there is no agnostic sample compression scheme for linear regression. In Subsection 4.2\nwe de\ufb01ne approximate sample compression schemes and establish their equivalence with learnability.\nFinally, in Subsection 4.3 we use this equivalence to demonstrate classes that are PAC learnable but\nnot agnostic PAC learnable. This manifests a difference with the zero/one loss under which agnostic\nand PAC learning are equivalent (see 3.2.1). It is worth noting that the loss function we use to break\nthe equivalence takes only three values (compared to the two values of the zero/one loss function).\n\n4.1 No agnostic compression for linear regression\n\nWe next show that in the setup of linear regression, which is known to be agnostic PAC learnable,\nthere is no agnostic sample compression scheme. For convenience, we shall restrict the discussion\nto zero-dimensional linear regression. In this setup4, the sample consists of m examples S =\n(cid:80)\n(z1, z2, . . . , zm) \u2208 [0, 1]m, and the loss function is de\ufb01ned by (cid:96)(h, z) = (h \u2212 z)2. The goal is to\n\ufb01nd h \u2208 R which minimizes LS(h). The empirical risk minimizer (ERM) is exactly the average\ni zi, and for every h (cid:54)= h\u2217 we have LS(h) > LS(h\u2217). Thus, an agnostic sample\nh\u2217 = 1\ncompression scheme in this setup should compress S to a subsequence and a binary string of side\ninformation, from which the average of S can be reconstructed. We prove that there is no such\ncompression.\nTheorem 4.1. There is no agnostic sample compression scheme for zero-dimensional linear regres-\nsion with size k(m) \u2264 m/2.\n\nm\n\n4One may think of X as a singleton.\n\n6\n\n\fThe proof appears in the full version of this paper. The idea is to restrict our attention to sets \u2126 \u2286 [0, 1]\nfor which every subset of \u2126 has a distinct average. It follows that any sample compression scheme\nfor samples from \u2126 must perform a compression that is information theoretically impossible.\n\n4.2 Approximate sample compression schemes\n\nThe previous example suggests the question of whether one can generalize the de\ufb01nition of com-\npression to \ufb01t problems where the loss function is not zero/one. Taking cues from PAC and agnostic\nPAC learning, we consider the following de\ufb01nition. We say that the selection scheme (\u03ba, \u03c1) is an\n\u0001-approximate sample compression scheme for H if for every sample S that is realizable by H,\nLS (\u03c1 (\u03ba (S))) \u2264 \u0001. It is called an \u0001-approximate agnostic sample compression scheme for H if for\nevery sample S, LS (\u03c1 (\u03ba (S))) \u2264 inf h\u2208H LS(h) + \u0001.\nLet us start by revisiting the case of zero-dimensional linear regression. Even though it does not have\nan agnostic compression scheme of sublinear size, it does have an \u0001-approximate agnostic sample\ncompression scheme of size k = O(log(1/\u0001)/\u0001) which we now describe.\n\nGiven a sample S = (z1, . . . , zm) \u2208 [0, 1], the average h\u2217 =(cid:80)m\n(cid:32) m(cid:88)\n\ni=1 zi/m is the ERM of S. Let\n\nm(cid:88)\n\n(cid:33)2\n\nL\u2217 = L(h\u2217) =\n\ni /m \u2212\nz2\n\nzi/m\n\n.\n\ni=1\n\ni=1\n\nj=1 zij /(cid:96)\n\n(cid:16)(cid:80)(cid:96)\n(cid:17) \u2264 L\u2217 + \u0001. It turns out that picking S(cid:48) at random suf\ufb01ces. Let Z1, . . . , Z(cid:96) be\naverage. Thus, E[H] = h\u2217 and E(cid:2)LS(H)(cid:3) = L\u2217 + Var[H] \u2264 L\u2217 + \u0001. In particular, this means that\n\nIt is enough to show that there exists a sub-sample S(cid:48) = (zi1, . . . , zi(cid:96)) of size (cid:96) = (cid:100)1/\u0001(cid:101) such that\nLS\nindependent random variables that are uniformly distributed over S and let H = 1\ni=1 Zi be their\n(cid:96)\nthere exists some sub-sample of size (cid:96) whose average has loss at most L\u2217 + \u0001. Encoding such a\nsub-sample requires O(log(1/\u0001)/\u0001) additional bits of side information.\nWe now establish the equivalence between approximate compression and learning (the proof is similar\nto the proof of Theorem 3.1).\nTheorem 4.2 (Approximate compressing implies learning). Let (\u03ba, \u03c1) be a selection scheme of size\nk, let H be an hypothesis class, and let D be a distribution on Z.\n\n(cid:80)(cid:96)\n\n1. If (\u03ba, \u03c1) is an \u0001-approximate sample compression scheme for H, and m is such that k(m) \u2264\n\nm/2, then\n\n\uf8eb\uf8edLD (\u03c1 (\u03ba (S))) > \u0001 + 100\n\n(cid:115)\n\nPr\n\nS\u223cDm\n\n\uf8f6\uf8f8 < \u03b4.\n\nk log m\n\nk + log 1\nm\n\n\u03b4\n\n2. If (\u03ba, \u03c1) is an \u0001-approximate agnostic sample compression scheme for H, and m is such that\n\nk(m) \u2264 m/2, then\n\n\uf8eb\uf8edLD (\u03c1 (\u03ba (S))) > inf\n\nPr\n\nS\u223cDm\n\n(cid:115)\n\n\uf8f6\uf8f8 < \u03b4.\n\nh\u2208H LD(h) + \u0001 + 100\n\nk log m\n\nk + log 1\nm\n\n\u03b4\n\nThe following Theorem shows that every learnable class has an approximate sample compression\nscheme. The proof of this theorem is straightforward - in contrast with the proof of the analog\nstatement in the case of zero/one loss functions and compression schemes without error.\nTheorem 4.3 (Learning implies approximate compressing). Let H be an hypothesis class.\n\n1. If H is PAC learnable with rate d(\u0001, \u03b4), then it has an \u0001-approximate sample compression\n\nscheme of size k \u2264 O(d log(d)) with d = min\u03b4<1 d(\u0001, \u03b4).\n\n2. If H is agnostic PAC learnable with rate d(\u0001, \u03b4), then it has an \u0001-approximate agnostic\n\nsample compression scheme of size k \u2264 O(d log(d)) with d = min\u03b4<1 d(\u0001, \u03b4).\n\nThe proof appears in the full version of this paper.\n\n7\n\n\f4.3 A separation between PAC and agnostic learnability\n\nHere we establish a separation between PAC and agnostic PAC learning under loss functions which\ntake more than two values:\nTheorem 4.4. There exist hypothesis classes H \u2286 YX and loss function l : Y \u00d7 Y \u2192 {0, 1\n2 , 1} such\nthat H is PAC learnable and not agnostic PAC learnable.\nThe main challenge in proving this theorem is showing that H is not agnostic PAC learnable. We\ndo this by showing that H does not have an approximate sample compression scheme. The crux of\nthe argument is an application of Ramsey theory; the combinatorial nature of compression allows to\nidentify the place where Ramsey theory is helpful. The proof appears in the full version of this paper.\n\n5 Discussion and further research\n\nThe compressibility-learnability equivalence is a fundamental link in statistical learning theory. From\na theoretical perspective this link can serve as a guideline for proving both negative/impossibility\nresults, and positive/possibility results.\nFrom the perspective of positive results, just recently, [5] relied on this paper in showing that\nevery learnable problem is learnable with robust generalization guarantees. Another important\nexample appears in the work of boosting weak learners [11] (see Chapter 4.2). These works follow a\nsimilar approach, that may be useful in other scenarios: (i) transform the given learner to a sample\ncompression scheme, and (ii) utilize properties of compression schemes to derive the desired result.\nThe same approach is also used in this paper in Section 3.2.1, where it is shown that PAC learning\nimplies agnostic PAC learning under 0/1 loss; we \ufb01rst transform the PAC learner to a realizable\ncompression scheme, and then use the realizable compression scheme to get an agnostic compression\nscheme that is also an agnostic learner. We note that we are not aware of a proof that directly\ntransforms the PAC learner to an agnostic learner without using compression.\nFrom the perspective of impossibility/hardness results, this link implies that to show that a problem is\nnot learnable, it suf\ufb01ces to show that it is not compressible. In Section 4.3, we follow this approach\nwhen showing that PAC and agnostic PAC learnability are not equivalent for general loss functions.\nThis link may also have a practical impact, since it offers a thumb rule for algorithm designers; if a\nproblem is learnable then it can be learned by a compression algorithm, whose design boils down to\nan intuitive principle \u201c\ufb01nd a small insightful subset of the input data.\u201d For example, in geometrical\nproblems, this insightful subset often appears on the boundary of the data points (see e.g. [12]).\n\nReferences\n[1] S. Ben-David, N. Cesa-Bianchi, D. Haussler, and P. M. Long. Characterizations of learnability for classes\n\nof {0,...,n}-valued functions. J. Comput. Syst. Sci., 50(1):74\u201386, 1995. 2, 5, 6\n\n[2] Shai Ben-David and Ami Litman. Combinatorial Variability of Vapnik-Chervonenkis Classes with\n\nApplications to Sample Compression Schemes. Discrete Applied Mathematics, 86(1):3\u201325, 1998. 2, 5\n\n[3] Anselm Blumer, Andrzej Ehrenfeucht, David Haussler, and Manfred K. Warmuth. Learnability and the\n\nVapnik-Chervonenkis dimension. J. Assoc. Comput. Mach., 36(4):929\u2013965, 1989. 5\n\n[4] A. Chernikov and P. Simon. Externally de\ufb01nable sets and dependent pairs. Israel Journal of Mathematics,\n\n194(1):409\u2013425, 2013. 5\n\n[5] Rachel Cummings, Katrina Ligett, Kobbi Nissim, Aaron Roth, and Zhiwei Steven Wu. Adaptive learning\nwith robust generalization guarantees. In Proceedings of the 29th Conference on Learning Theory, COLT\n2016, New York, USA, June 23-26, 2016, pages 772\u2013814, 2016. 8\n\n[6] A. Daniely and S. Shalev-Shwartz. Optimal learners for multiclass problems. In COLT, volume 35, pages\n\n287\u2013316, 2014. 5\n\n[7] Amit Daniely, Sivan Sabato, Shai Ben-David, and Shai Shalev-Shwartz. Multiclass learnability and the\n\nERM principle. Journal of Machine Learning Research, 16:2377\u20132404, 2015. 2, 5, 6\n\n[8] S. Floyd. Space-Bounded Learning and the Vapnik-Chervonenkis Dimension. In COLT, pages 349\u2013364,\n\n1989. 1, 5\n\n8\n\n\f[9] Sally Floyd and Manfred K. Warmuth. Sample Compression, Learnability, and the Vapnik-Chervonenkis\n\nDimension. Machine Learning, 21(3):269\u2013304, 1995. 1, 5\n\n[10] Yoav Freund. Boosting a weak learning algorithm by majority. Inf. Comput., 121(2):256\u2013285, 1995. 2\n\n[11] Yoav Freund and Robert E. Schapire. Boosting: Foundations and Algorithms. Adaptive computation and\n\nmachine learning. MIT Press, 2012. 2, 8\n\n[12] Lee-Ad Gottlieb, Aryeh Kontorovich, and Pinhas Nisnevitch. Nearly optimal classi\ufb01cation for semimetrics.\n\nCoRR, abs/1502.06208, 2015. 2, 6, 8\n\n[13] Thore Graepel, Ralf Herbrich, and John Shawe-Taylor. PAC-Bayesian Compression Bounds on the\nPrediction Error of Learning Algorithms for Classi\ufb01cation. Machine Learning, 59(1-2):55\u201376, 2005. 2, 6\n\n[14] D. P. Helmbold, R. H. Sloan, and M. K. Warmuth. Learning integer lattices. SIAM J. Comput., 21(2):240\u2013\n\n266, 1992. 5\n\n[15] Dima Kuzmin and Manfred K. Warmuth. Unlabeled compression schemes for maximum classes. Journal\n\nof Machine Learning Research, 8:2047\u20132081, 2007. 5\n\n[16] Nick Littlestone and Manfred Warmuth. Relating data compression and learnability. Unpublished, 1986.\n\n1, 2, 4, 5\n\n[17] Roi Livni and Pierre Simon. Honest compressions and their application to compression schemes. In COLT,\n\npages 77\u201392, 2013. 5\n\n[18] Shay Moran and Amir Yehudayoff. Sample compression schemes for VC classes. J. ACM, 63(3):21:1\u2013\n\n21:10, June 2016. 2, 5, 6\n\n[19] B. K. Natarajan. On learning sets and functions. Machine Learning, 4:67\u201397, 1989. 2, 5, 6\n\n[20] B. I. P. Rubinstein and J. H. Rubinstein. A geometric approach to sample compression. Journal of Machine\n\nLearning Research, 13:1221\u20131261, 2012. 5\n\n[21] Benjamin I. P. Rubinstein, Peter L. Bartlett, and J. H. Rubinstein. Shifting: One-inclusion mistake bounds\n\nand sample compression. J. Comput. Syst. Sci., 75(1):37\u201359, 2009. 5\n\n[22] Shai Shalev-Shwartz and Shai Ben-David. Understanding Machine Learning: From Theory to Algorithms.\n\nCambridge University Press, New York, NY, USA, 2014. 1, 2, 4\n\n[23] Vladimir Vapnik. Statistical learning theory. Wiley, 1998. 1\n\n[24] V.N. Vapnik and A.Ya. Chervonenkis. On the uniform convergence of relative frequencies of events to\n\ntheir probabilities. Theory Probab. Appl., 16:264\u2013280, 1971. 2, 5, 6\n\n[25] Manfred K. Warmuth. Compressing to VC dimension many points. In COLT/Kernel, pages 743\u2013744, 2003.\n\n2, 5\n\n9\n\n\f", "award": [], "sourceid": 1413, "authors": [{"given_name": "Ofir", "family_name": "David", "institution": "Technion - Israel institute of technology"}, {"given_name": "Shay", "family_name": "Moran", "institution": "Technion - Israel institue of Technology"}, {"given_name": "Amir", "family_name": "Yehudayoff", "institution": "Technion - Israel institue of Technology"}]}