{"title": "Chaining Mutual Information and Tightening Generalization Bounds", "book": "Advances in Neural Information Processing Systems", "page_first": 7234, "page_last": 7243, "abstract": "Bounding the generalization error of learning algorithms has a long history, which yet falls short in explaining various generalization successes including those of deep learning. Two important difficulties are (i) exploiting the dependencies between the hypotheses, (ii) exploiting the dependence between the algorithm\u2019s input and output. Progress on the first point was made with the chaining method, originating from the work of Kolmogorov, and used in the VC-dimension bound. More recently, progress on the second point was made with the mutual information method by Russo and Zou \u201915. Yet, these two methods are currently disjoint. In this paper, we introduce a technique to combine chaining and mutual information methods, to obtain a generalization bound that is both algorithm-dependent and that exploits the dependencies between the hypotheses. We provide an example in which our bound significantly outperforms both the chaining and the mutual information bounds. As a corollary, we tighten Dudley\u2019s inequality when the learning algorithm chooses its output from a small subset of hypotheses with high probability.", "full_text": "Chaining Mutual Information and Tightening\n\nGeneralization Bounds\n\nAmir R. Asadi1\u2217 Emmanuel Abbe1,2\n\nSergio Verd\u00fa\n\n1Princeton University\n\n2EPFL\n\nAbstract\n\nBounding the generalization error of learning algorithms has a long history, which\nyet falls short in explaining various generalization successes including those of deep\nlearning. Two important dif\ufb01culties are (i) exploiting the dependencies between the\nhypotheses, (ii) exploiting the dependence between the algorithm\u2019s input and output.\nProgress on the \ufb01rst point was made with the chaining method, originating from\nthe work of Kolmogorov, and used in the VC-dimension bound. More recently,\nprogress on the second point was made with the mutual information method by\nRusso and Zou \u201915. Yet, these two methods are currently disjoint. In this paper, we\nintroduce a technique to combine the chaining and mutual information methods, to\nobtain a generalization bound that is both algorithm-dependent and that exploits the\ndependencies between the hypotheses. We provide an example in which our bound\nsigni\ufb01cantly outperforms both the chaining and the mutual information bounds. As\na corollary, we tighten Dudley\u2019s inequality when the learning algorithm chooses its\noutput from a small subset of hypotheses with high probability.\n\n1\n\nIntroduction\n\n1.1 Motivation\n\nUnderstanding the generalization phenomenon in machine learning has been a central question for\nmany years and revived in recent years with the success and mystery of deep learning: why do neural\nnets generalize well, although they operate in a classically overparametrized setting? In particular,\nclassical generalization bounds do not explain this phenomenon (see e.g. [1], [2]). Even simpler\ninstances of successful machine learning problems and algorithms are not explained satisfactorily\nwith current generalization bounds, e.g. [2]. This paper aims at deriving tighter generalization bounds\nfor learning algorithms by combining ideas from information theory and from high dimensional\nprobability.\nGeneralization bounds have evolved throughout the years, starting from the basic union bound over\nthe hypothesis set, the re\ufb01ned union bound, Rademacher complexity, chaining and VC-dimension\n[3], [4]; and algorithm-dependent bounds such as PAC-Bayesian bounds [5], uniform stability [6],\ncompression bounds [7], and recently, the mutual information bound [8].\nWe highlight two pitfalls among the key limitations of current bounds:\n\nA. Ignoring the dependencies between the hypotheses. Consider the following example (which\nwe refer to as Example I): an algorithm observes G2 = (G1, G2), where G1 and G2 are two\nindependent standard normal random variables; the hypothesis set H = {ht : t \u2208 T} consists of\nfunctions ht(G2) (cid:44) (cid:104)t, G2(cid:105), where T (cid:44) {t \u2208 R2 : (cid:107)t(cid:107)2 = 1}. Suppose the algorithm is designed\n\n\u2217Corresponding author: aasadi@princeton.edu\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\frandom variables, the expected bias of the algorithm is E(cid:2)maxt\u2208T ht(G2)(cid:3). Moreover, since H\n\nto choose the hypothesis which achieves maxt\u2208T ht(G2). Since ht(G2), t \u2208 T are all zero mean\n\nconsists of an uncountable number of hypotheses, the union bound (or equivalently the maximal\ninequality) over the hypothesis set gives a vacuous bound. However, the fact is that we are not dealing\nwith in\ufb01nite number of independent random variables: the random variables ht(G2) and hs(G2) are\nactually quite dependent on each other when t and s are close.\nTo exploit the dependencies, the powerful technique of chaining has been developed in high dimen-\nsional probability in order to obtain uniform bounds on random processes, and has proven successful\nin a variety of problems including statistical learning. More speci\ufb01cally, chaining is the method for\nproving the tightest generalization bound using VC-dimension [9], [10]. Originating from the work of\nKolmogorov in 1934 (see [9, p. 149]) and later developed by Dudley, Fernique, Talagrand and many\nothers [11], the basic idea of chaining is to \ufb01rst describe the dependencies between the hypotheses by\na metric d on the set T , then to discretize T and to approximate the maximal value (maxt\u2208T ht(G2))\nby approximating the maxima over successively re\ufb01ned \ufb01nite discretizations, using union bounds\nat each step, and by introducing the notion of \u0001-nets and covering numbers [12]. For instance, with\n\nthis method, one can prove the \ufb01nite upper bound E(cid:2)maxt\u2208T ht(G2)(cid:3) \u2264 19.0353. Even for many\n\nexamples of \ufb01nite hypothesis sets, chaining is known to give far tighter bounds than the union bound\n[9]. Next we state a fundamental result which is based on the chaining method. For a metric space\n(T, d), let N (T, d, \u0001) denote the covering number of (T, d) at scale \u0001. For the de\ufb01nitions of \u0001-net and\ncovering number, see De\ufb01nition 8 in Section C of the supplementary material, and for the de\ufb01nition\nof seperable subgaussian processes see De\ufb01nitions 1 and 2.\nTheorem 1 (Dudley). [13]. Assume that {Xt}t\u2208T is a separable subgaussian process on the bounded\nmetric space (T, d). Then\n\nlog N (T, d, 2\u2212k).\n\n(1)\n\n(cid:20)\n\nE\n\n(cid:21)\n\n\u2264 6\n\nXt\n\nsup\nt\u2208T\n\n2\u2212k(cid:113)\n\n(cid:88)\n\nk\u2208Z\n\nNote that PAC-Bayesian bounds, compression bounds and bounds based on uniform stability also\ndo not exploit the dependencies between the hypotheses as they are not based on any metric on the\nhypothesis set.\n\nB. Ignoring the dependence between the algorithm input (data) and output. Generalization\nbounds based on Rademacher complexity2and VC-dimension only depend on the hypothesis set and\nnot on the algorithm, effectively rendering them too pessimistic for practical algorithms. Recent\nexperimental \ufb01ndings in [1] have shown that in the over-parameterized regime of deep neural nets,\nsuch complexity measures give vacuous bounds for the generalization error. A possible explanation\nfor that vacuousness is as follows: if H = {ht : t \u2208 T} denotes the hypothesis set and for every\nt \u2208 T , Xt denotes the generalization error of hypothesis ht and W denotes the index of the chosen\nhypothesis by the algorithm, then to upper bound the expected generalization error E[XW ], one uses\n\n(cid:20)\n\n(cid:21)\n\nE[XW ] \u2264 E\n\nXt\n\n,\n\nsup\nt\u2208T\n\nand aims at upper bounding E [supt\u2208T Xt] with these bounds, hence giving a uniform bound over the\ngeneralization errors of the entire hypothesis set. However, all we need to control is the generalization\nerror of the speci\ufb01c hypothesis W selected by the algorithm. That expected generalization error of\nW can be much smaller than the right side of (2) (see also [14]). In other words, such bounds are not\ntaking into account the input-output relation of the algorithm, and uniform bounding seems to be\ntoo stringent for these applications. Consider the following example (which we refer to as Example\nII): let X1, X2, ..., Xn be standard normal random variables and assume that the algorithm output is\nindex W . Therefore the expected bias of the algorithm is E[XW ] and the goal is to upper bound it.\nBy the maximal inequality (or equivalently the union bound), we have\n\n(cid:20)\n\nE\n\nsup\n1\u2264i\u2264n\n\nXi\n\n(cid:21)\n\n\u2264(cid:112)2 log n,\n\n2Here we are referring to the Rademacher average of the entire hypothesis set. There exist other notions of\nRademacher averages which are used in algorithm-dependent bounds, such as in local Rademacher complexities\n[15].\n\n2\n\n(2)\n\n(3)\n\n\f|E[XW ]| \u2264(cid:112)2\u03c32I(W ;{Xt}t\u2208T ).\nE[XW ] \u2264(cid:112)2I(W ; X1, ..., Xn).\n\n(4)\n\n(6)\n\n(7)\n\nwhere (3) is asymptotically tight if Xi, i = 1, 2, ..., n are independent (see [12, Chapter 2]). But\nwhat if the algorithm is always more likely to choose W among a small subset of {1, 2, ..., n}? Then\nE[XW ] could be much smaller than the right side of (3), as the chances of having an outlier value\nis smaller. Or, if the choice of W is not dependent on the data, then E[XW ] = E[E[XW|W ]] = 0.\nInterestingly, to explain this phenomenon and to obtain tighter upper bounds on E[XW ] an important\ninformation theoretic measure appears: the mutual information. This was originally proposed in the\nkey paper of Russo and Zou [8] and then generalized in [16], [17], and in [18] for in\ufb01nite number of\nhypotheses:\nTheorem 2. [8][18] Let {Xt}t\u2208T be a random process and T an arbitrary set. Assume that Xt is\n\u03c32-subgaussian and E[Xt] = 0 for every t \u2208 T , and let W be a random variable taking values on T .\nThen\n\nIn Example II, instead of using (2) and (3), one can have the tighter upper bound\n\n(5)\nFor example, if the algorithm chooses W among {1, 2, ...,(cid:100)log n(cid:101)} with probability 1 \u2212 o(1), then\n(5) implies\n\nE[XW ] \u2264(cid:112)2 ((1 \u2212 o(1)) log(log n) + o(1) log(n \u2212 log n) + 1 (cid:28)(cid:112)2 log n.\n\nHowever, this method does not give a \ufb01nite bound for Example I, since\n\nI(cid:0)argmaxt\u2208T ht(G2);{ht(G2)}t\u2208T\n\n(cid:1) = \u221e.\n\nSimilarly, as discussed in [19], the mutual information bound for perturbed SGD or any iterative\nalgorithm which adds degenerate noise in each iteration blows up, and information-theoretic strategies\nfor analyzing generalization error of such algorithms have not been reported.\n\n1.2 This paper\n\nBy combining the ideas of the chaining method and the mutual information method, in this paper\nwe obtain a chained mutual information bound on the expected generalization error which takes\ninto account the dependencies between the hypotheses as well as the dependence between output\nand input of the algorithm. When applied to the two aforementioned simple examples (Examples\nI and II), our bound yields the better bound between the classical chaining and classical mutual\ninformation bounds. More importantly, we provide examples for which our bound outperforms\nboth of the previous bounds signi\ufb01cantly: in Example 1 we provide a family of cases where the\nchaining method gives a relatively large constant, the mutual information bound blows up, but our\nbound tends towards zero. We also discuss how our new bound gives a possible direction to explain\nthe phenomenon described in [19] (see Remark 3), and to exploit regularization properties of some\nalgorithms (see Section 4).\n\n1.3 Further related literature\n\nIn [20], the mutual information between the input and the output of binary classi\ufb01cation learning\nalgorithms is used to obtain high probability generalization bounds.\nPAC-Bayesian bounds are another type of algorithm-dependent bounds which are concerned with\n\ufb01nding high probability generalization bounds for randomized classi\ufb01ers [5]. These bounds de\ufb01ne a\nhierarchy over the hypothesis set by using a prior distribution on that set [4]. As discussed in [20],\nthere is a connection and similarity between PAC-Bayesian bounds and the mutual information bound,\nboth using the variational representation of relative entropy in their proofs. In [21] and [22], the\nauthors combine the ideas of PAC-Bayesian bounds with generic chaining and create high probability\nbounds for randomized classi\ufb01ers. Their use of an auxiliary sample set and the notion of average\ndistance between partitions makes their bounds conceptually different from our work. However, their\nbounds have the advantage to exploit the variance of the hypotheses and to give high probability\nresults.\nIn the probability theory literature, Fernique [23] gives upper and lower bounds on the expected bias\nof an algorithm (or a selection rule) which chooses its output from a Gaussian process, by using a\n\n3\n\n\fchaining argument while taking into account the marginal distribution of the algorithm output. We\nfurther utilize the dependence between the algorithm input and output and the stochasticity of the\nalgorithm, and we give results for more general processes. However, we only obtain upper bounds in\nthis paper.\n\n1.4 Notation\nIn the framework of supervised statistical learning, X is the instances domain, Y is the labels domain\nand Z = X \u00d7 Y denotes the examples domain. Furthermore, H = {hw : w \u2208 W} is the hypothesis\nset where the hypotheses are indexed by an index set W, and there is a nonnegative loss function\n(cid:96) : H \u00d7 Z \u2192 R+. A learning algorithm receives the training set S = (Z1, Z2, ..., Zn) of n examples\nwith i.i.d. random elements drawn from Z with distribution \u00b5. Then it picks an element hW \u2208 H as\nthe output hypothesis according to a random transformation PW|S (thus, we are allowing randomized\nalgorithms). For any w \u2208 W, let\n\n(8)\ndenote the statistical (or population) risk of hypothesis hw. For a given training set S, the empirical\nrisk of hypothesis hw is de\ufb01ned as\n\nL\u00b5(w) (cid:44) E[(cid:96)(hw, Z)], Z \u223c \u00b5\n\nn(cid:88)\n\ni=1\n\nLS(w) (cid:44) 1\nn\n\n(cid:96)(hw, Zi),\n\n(9)\n\nand the generalization error of hypothesis hw (dependent on the training set) is de\ufb01ned as\n\ngen(w) (cid:44) L\u00b5(w) \u2212 LS(w).\n\n(10)\nAveraging with respect to the joint distribution PS,W = \u00b5\u2297nPW|S, we denote the expected general-\nization error and the expected absolute value of generalization error by\ngen(\u00b5, PW|S) (cid:44) E[L\u00b5(W ) \u2212 LS(W )],\ngen+(\u00b5, PW|S) (cid:44) E[|L\u00b5(W ) \u2212 LS(W )|],\n\n(11)\n\n(12)\n\nand\n\nrespectively. Our purpose is to \ufb01nd upper bounds on gen(\u00b5, PW|S) and gen+(\u00b5, PW|S).\nLet XN (cid:44) {Xi : i \u2208 N} denote a random process indexed by the elements of the set N . Let 0 denote\nthe identically zero function. In this paper, all logarithms are in natural base and all information\ntheoretic measures are in nats. H(X) denotes the Shannon entropy of a discrete random variable X,\nand h(Y ) denotes the differential entropy of an absolutely continuous random variable Y .\n\n2 Main results\nAssume that {Xt}t\u2208T is a random process with index set T . In the chaining method, we impose a\nmetric d on T which describes the dependencies between the random variables. The widely used\nsubgaussian processes capture this notion and they arise in many applications:\nDe\ufb01nition 1 (Subgaussian process). The random process {Xt}t\u2208T on the metric space (T, d) is\ncalled subgaussian if E[Xt] = 0 for all t \u2208 T and\n2 \u03bb2d2(t,s)\n\n(13)\nFor example, based on the Azuma\u2013Hoeffding inequality, {gen(w)}w\u2208W is a subgaussian process\nwith the metric\n\ne\u03bb(Xt\u2212Xs)(cid:105) \u2264 e\n\nt, s \u2208 T, \u03bb \u2265 0.\n\nE(cid:104)\n\nfor all\n\n1\n\nd(gen(w), gen(v)) (cid:44) (cid:107)(cid:96)(hw,\u00b7) \u2212 (cid:96)(hv,\u00b7)(cid:107)\u221e\n\n\u221a\n\nn\n\n,\n\n(14)\n\nregardless of the choice of distribution \u00b5 on Z.\nThe following is a technical assumption which holds in almost all cases of interest:\nDe\ufb01nition 2 (Separable process). The random process {Xt}t\u2208T is called separable if there is a\ncountable set T0 \u2286 T such that Xt \u2208 lim s\u2192t\nxs means\ns\u2208T0\nthat there is a sequence (sn) in T0 such that sn \u2192 t and xsn \u2192 x.\n\nXs for all t \u2208 T a.s., where x \u2208 lim s\u2192t\ns\u2208T0\n\n4\n\n\fFor example, if t \u2192 Xt is continuous a.s., then Xt is a separable process [9].\nOur main results rely on the notion of increasing sequence of \u0001-partitions of the metric space (T, d):\nDe\ufb01nition 3 (Increasing sequence of \u0001-partitions). We call a partition P = {A1, A2, ..., Am} of the\nset T an \u0001-partition of the metric space (T, d) if for all i = 1, 2, ..., m, Ai can be contained within a\nball of radius \u0001. A sequence of partitions {Pk}\u221e\nk=m of a set T is called an increasing sequence if for\nall k \u2265 m and each A \u2208 Pk+1, there exists B \u2208 Pk such that A \u2286 B. For any such sequence and\nany t \u2208 T , let [t]k denote the unique set A \u2208 Pk such that t \u2208 A.\nAssume now that (T, d) is a bounded metric space, and let k1(T ) be an integer such that\n2\u2212(k1(T )\u22121) \u2265 diam(T ). We have the following upper bounds on gen(\u00b5, PW|S) and gen+(\u00b5, PW|S)\nbased on the mutual information between the training set S and the discretized output of the learning\nalgorithm, where each of these mutual information terms is multiplied by an exponentially decreasing\nweight 2\u2212k, in which the exponent measures how \ufb01nely the output W of the learning algorithm is\ndiscretized:\nTheorem 3. Assume that {gen(w)}w\u2208W is a separable subgaussian process on the bounded metric\nspace (W, d). Let {Pk}\u221e\nk=k1(W) be an increasing sequence of partitions of W, where for each\nk \u2265 k1(W), Pk is a 2\u2212k-partition of (W, d).\n(a)\n\n\u221a\ngen(\u00b5, PW|S) \u2264 3\n\n2\n\n\u221e(cid:88)\n\nk=k1(W)\n\n2\u2212k(cid:112)I([W ]k; S),\n2\u2212k(cid:112)I([W ]k; S) + log 2.\n\n(b) If 0 \u2208 {(cid:96)(hw,\u00b7) : w \u2208 W}, then\n\n\u221a\ngen+(\u00b5, PW|S) \u2264 3\n\n2\n\n\u221e(cid:88)\n\nk=k1(W)\n\n(15)\n\n(16)\n\n(17)\n\n(18)\n\nRemark 1. Based on the general de\ufb01nition of mutual information with partitions ([24, p. 252]), we\nhave I(W ; S) = supk I([W ]k; S) therefore I([W ]k; S) \u2192 I(W ; S) as k \u2192 \u221e.\nTheorem 3 is stated in the context of statistical learning. The more general counterpart in the context\nof random processes is:\nTheorem 4. Assume that {Xt}t\u2208T is a separable subgaussian process on the bounded metric space\n(T, d). Let {Pk}\u221e\nk=k1(T ) be an increasing sequence of partitions of T , where for each k \u2265 k1(T ),\nPk is a 2\u2212k-partition of (T, d).\n(a)\n\n\u221a\nE[XW ] \u2264 3\n\n2\n\n\u221e(cid:88)\n\n(b) For any arbitrary t0 \u2208 T ,\n\nE[|XW \u2212 Xt0|] \u2264 3\n\nk=k1(T )\n\n\u221e(cid:88)\n\n\u221a\n\n2\n\nk=k1(T )\n\n2\u2212k(cid:112)I([W ]k; XT ).\n2\u2212k(cid:112)I([W ]k; XT ) + log 2.\n\nNote that in Theorem 4 if we let T (cid:44) W and Xw (cid:44) gen(w) for all w \u2208 W, then for each k \u2265 k1(T ),\ndue to the Markov chain\n\nXT = {gen(w)}w\u2208W \u2194 S \u2194 W \u2194 [W ]k,\n\n(19)\nand the data processing inequality, we have I([W ]k; XT ) \u2264 I([W ]k; S). Therefore Theorem\n3 follows from Theorem 4. The proof of Theorem 4 and the etymology of \u201cchaining mutual\ninformation\" is given in Section 3.\n\n5\n\n\fRemark 2. For random processes other than subgaussian processes, where the tail of increments\nare controlled by a function \u03c8, similar results can be derived from Theorem 12 in Section D of the\nsupplementary material.\n\nBoth Theorem 3 and Theorem 4 capture the dependencies between the hypotheses by utilizing\na metric d, and they are algorithm-dependent as the mutual information between the algorithm\u2019s\ndiscretized output and its input appears in their bounds. Now, to demonstrate the power of Theorem 4\nand to compare it with the existing results in the literature, consider the following example:\nExample 1. Let T be an arbitrary subset of Rn, and Gn (cid:44) (G1, ..., Gn) \u223c N (0, In) be a standard\nnormal random vector in Rn. The canonical Gaussian process is de\ufb01ned as {Xt}t\u2208T , where\n\nXt (cid:44) (cid:104)t, Gn(cid:105) for all t \u2208 T.\n\n(20)\nNote that {Xt}t\u2208T is a subgaussian process on the metric space (T, d), where d is the Euclidean\ndistance.\nConsider a canonical Gaussian process where n = 2 and T (cid:44) {t \u2208 R2 : (cid:107)t(cid:107)2 = 1}. The process\n{Xt}t\u2208T can be reparameterized according to the phase of each point t \u2208 T : the random variable\nXt can also be denoted as X\u03c6, where \u03c6 \u2208 [0, 2\u03c0) is the phase of t. In other words, \u03c6 is the unique\nnumber in [0, 2\u03c0) such that t = (sin \u03c6, cos \u03c6). Henceforth, we will assume the indices are in the\nphase form.\nLet the relation between the input XT of an algorithm and its output W be as\n\nargmax\u03c6\u2208[0,2\u03c0)X\u03c6\n\n(21)\nwhere the noise Z is independent from XT , and has an atom with probability mass \u0001 on 0, and 1 \u2212 \u0001\nprobability is uniformly distributed on (\u2212\u03c0, \u03c0). Note that since Z has a singular (degenerate) part,\nh(Z) = \u2212\u221e.\nDue to symmetry, W has uniform distribution over [0, 2\u03c0). But we have\n\nW (cid:44)(cid:16)\n\n(cid:17) \u2295 Z (mod 2\u03c0),\n\nargmax\u03c6\u2208[0,2\u03c0)X\u03c6 \u2295 Z\n\n(cid:17)\n\n(cid:12)(cid:12)(cid:12)XT\n\nI(W ; XT ) = h(W ) \u2212 h(W|XT )\n\n(cid:16)\n\n= log 2\u03c0 \u2212 h\n= log 2\u03c0 \u2212 h(Z|XT )\n= log 2\u03c0 \u2212 h(Z)\n= \u221e.\n\n(22)\n\n(23)\n\n(24)\n(25)\n(26)\n\nHence the upper bound on E[XW ] due to the mutual information method (Theorem 2) blows up:\n\n(27)\nNote that 2\u2212(\u22122) \u2265 diam(T ) = 2. Therefore let k1(T ) \u2190 \u22121 and for all integers k \u2265 \u22121, de\ufb01ne\n\nE[XW ] \u2264(cid:112)2I(W ; XT ) = \u221e.\n(cid:20) 2\u03c0\n2k+2 , 2 \u00d7 2\u03c0\n\n(cid:19)\n\n2k+2\n\n, ...,\n\n(cid:19)\n\n,\n\n(cid:20)(cid:0)2k+2 \u2212 1(cid:1) 2\u03c0\n\n(cid:26)(cid:20)\n\nPk (cid:44)\n\n0,\n\n2\u03c0\n2k+2\n\n2k+2 , 2\u03c0\n\n.\n\n(28)\n\n(cid:19)(cid:27)\n\nk=\u22121 is an increasing sequence of partitions of T . Furthermore, for each k \u2265 \u22121,\n2k+2 < 21\u2212k. Thus each Pk is a 2\u2212k-partition of\n\nIt is clear that {Pk}\u221e\nthe length of the arc of each set in Pk is \u03b4k (cid:44) 2\u03c0\n(T, d) and |Pk| = 2k+2 (see Figure 1).\nNow by using the classical chaining method (Theorem 1) to upper bound E[XW ] by upper bounding\nE[sup\u03c6\u2208[0,2\u03c0) X\u03c6] and ignoring the algorithm, we get\n\n(cid:34)\n\n(cid:35)\n2\u2212k(cid:112)\n\nX\u03c6\n\nE[XW ] \u2264 E\n\u221a\n\u2264 3\n\n2\n\n\u03c6\u2208[0,2\u03c0)\n\nsup\n\n\u221e(cid:88)\n\nk=\u22121\n= 19.0352...3\n\n6\n\nlog 2k+2\n\n(29)\n\n(30)\n\n(31)\n\n\fFigure 1: Depiction of T,P\u22121,P0 and P1 in the R2 plane. (The three partitions are magni\ufb01ed for\nclarity.)\n\nOn the other hand, for every k \u2265 \u22121 we have\n\nI([W ]k; XT ) = H([W ]k) \u2212 H ([W ]k|XT )\n\n= log 2k+2 \u2212 H\n= log 2k+2 \u2212 H\n\nargmax\u03c6\u2208[0,2\u03c0)X\u03c6\n1 \u2212 \u0001\n2k+2 ,\n\n1 \u2212 \u0001\n2k+2 , . . . ,\n\n\u0001 +\n\n1 \u2212 \u0001\n2k+2\n\n(cid:17) \u2295 Z\n\n(cid:17)\n\n(cid:12)(cid:12)(cid:12)XT\n(cid:105)\n(cid:19)\n\nk\n\n.\n\n(cid:16)(cid:104)(cid:16)\n(cid:18)\n\nTherefore, based on the chained mutual information method (Theorem 4), we have\n\n\u221a\nE[XW ] \u2264 3\n\n2\n\n\u221a\n= 3\n\n2\n\n2\u2212k(cid:112)I([W ]k; XT )\n(cid:115)\n\n2\u2212k\n\nlog 2k+2 \u2212 H\n\n\u221e(cid:88)\n\u221e(cid:88)\n\nk=\u22121\n\nk=\u22121\n\n(cid:18)\n\n\u0001 +\n\n(cid:19)\n\n1 \u2212 \u0001\n2k+2 ,\n\n1 \u2212 \u0001\n2k+2 , . . . ,\n\n1 \u2212 \u0001\n2k+2\n\n(32)\n\n(33)\n\n(34)\n\n(35)\n\n(36)\n\nNumerical values of the right side of (36) for different values of \u0001 are given in Table 1 (CMI bound).\nNote that indeed I([W ]k; XT ) \u2192 I(W ; XT ) = \u221e as k \u2192 \u221e. However, the slow rate of that\nconvergence and the existence of the 2\u2212k term makes the sum not only \ufb01nite, but very small. In fact,\nas \u0001 \u2192 0, the right side of (36) tends to 0 as well.\n\nIt is interesting to notice that for this toy example, the exact values of E(cid:104)\ncan be computed. As sup\u03c6\u2208[0,2\u03c0) X\u03c6 has a Rayleigh distribution, we have E(cid:104)\n(cid:112) \u03c0\nout, and we have E[XW ] = \u0001(cid:112) \u03c0\n\nand E[XW ]\n=\n2 = 1.253... . Since the noise Z is independent from XT , the effect of its continuous part cancels\n\nsup\u03c6\u2208[0,2\u03c0) X\u03c6\n\nsup\u03c6\u2208[0,2\u03c0) X\u03c6\n\n(cid:105)\n\n(cid:105)\n\n2 . See Table 1.\n\n3The exact value of the bound in Theorem 1 is slightly smaller, since with our partitions we are using a rough\napproximate for the covering numbers. For example, at scale 2\u2212(\u22121), the covering number is 1, while we have\nused partition P\u22121 with |P\u22121| = 2 sets.\n\n7\n\n\f2(cid:112)I(W ; XT )\n\n\u0001\n\nChaining bound\n\nCMI bound\n\nE[XW ]\n\nTable 1: E[XW ] and its upper bounds\n\n1\n20\n\n\u221e\n\n1\n30\n\n\u221e\n\n1\n40\n\n\u221e\n\n1\n50\n\n\u221e\n\n1\n\n100\n\n\u221e\n\n1\n\n200\n\n\u221e\n\n1\n\n400\n\n\u221e\n\n19.0352\n1.1013\n0.0626\n\n19.0352\n0.7507\n0.0417\n\n19.0352\n0.5709\n0.0313\n\n19.0352\n0.4612\n0.0250\n\n19.0352\n0.2364\n0.0125\n\n19.0352\n0.1204\n0.0062\n\n19.0352\n0.0610\n0.0031\n\nRemark 3. Notice that in Example 1 there exists an independent additive noise term Z which has a\ndegenerate part, causing the mutual information bound to blow up. Similarly, as discussed in [19],\nthe mutual information bound for perturbed SGD or any iterative algorithm which adds degenerate\nnoise in each iteration blows up. Example 1 illustrates that combining the mutual information method\nwith the chaining method as in our bound could give tight generalization bounds for such algorithms\nas well.\nRemark 4. It is clear that having degenerate noise is not necessary to observe that the chained\nmutual information bound is tighter than the mutual information bound; this is just an extreme case\nfor which the mutual information bound blows up. For instance, in Example 1, one can replace Z\nwith a sequence of continuous random variables which converge to Z in distribution.\n\n3 Proof outline\n\nHere we provide an outline of the proof of Theorem 4. As noted in Section 2, Theorem 3 follows\nfrom Theorem 4.\nFor an arbitrary k \u2265 k1(T ), consider Pk = {A1, A2, ..., Am}. Since Pk is a 2\u2212k-partition of\n(T, d), by de\ufb01nition there exists a set (or a multiset) Nk (cid:44) {a1, a2, ..., am} \u2286 T and a mapping\n\u03c0Nk : T \u2192 Nk such that \u03c0Nk (t) = ai if t \u2208 Ai, and further d (t, \u03c0Nk (t)) \u2264 2\u2212k, for all\ni = 1, 2, ..., m. Therefore Nk is a 2\u2212k-net and \u03c0Nk is its associated mapping. It is also clear that for\nan arbitrary t0 \u2208 T , Nk0\n(cid:44) {t0} is a 2\u2212(k1(T )\u22121)-net. Note that for any integer n \u2265 k1(T ) we can\nwrite\n\nSince by the de\ufb01nition of subgaussian processes the process is centered, we have E[Xt0 ] = 0. Thus\n\nn(cid:88)\n\n(cid:16)\n\nk=k1(T )\n\nXW = Xt0 +\n\nX\u03c0Nk (W ) \u2212 X\u03c0Nk\u22121 (W )\n\nE[XW ] \u2212 E(cid:2)XW \u2212 X\u03c0Nn (W )\n\n(cid:3) =\n\nn(cid:88)\n\nE(cid:104)\n\nk=k1(T )\n\n(cid:17)\n\n(cid:1) .\n+(cid:0)XW \u2212 X\u03c0Nn (W )\n(cid:105)\n\nX\u03c0Nk (W ) \u2212 X\u03c0Nk\u22121 (W )\n\n.\n\n(37)\n\n(38)\n\nFor every k \u2265 k1(T ), {X\u03c0Nk (t) \u2212 X\u03c0Nk\u22121 (t)}t\u2208T is a subgaussian process with at most |Nk||Nk\u22121|\ndistinct terms, hence a \ufb01nite process. Based on the triangle inequality,\n\nd(cid:0)\u03c0Nk (t), \u03c0Nk\u22121 (t)(cid:1) \u2264 d (t, \u03c0Nk (t)) + d(cid:0)t, \u03c0Nk\u22121(t)(cid:1)\n\n(cid:110)\nX\u03c0Nk (t) \u2212 X\u03c0Nk\u22121 (t)\n\nNote that knowing the value of (cid:0)\u03c0Nk (W ), \u03c0Nk\u22121(W )(cid:1) is enough to determine which one\n(cid:0)\u03c0Nk (W ), \u03c0Nk\u22121(W )(cid:1) is playing the role of the random index, and since X\u03c0Nk (t) \u2212 X\u03c0Nk\u22121 (t)\nis d2(cid:0)\u03c0Nk (t), \u03c0Nk\u22121(t)(cid:1)-subgaussian, based on Theorem 2, an application of data processing in-\nn(cid:88)\n\nis chosen according to W . Therefore\n\nequality and by summation, we obtain\n\nof the random variables\n\n\u2264 3 \u00d7 2\u2212k.\n\nn(cid:88)\n\n2 \u00d7 2\u2212k(cid:113)\n\n\u221a\n3\n\nI(\u03c0Nk (W ), \u03c0Nk\u22121(W ); XT ). (40)\n\nX\u03c0Nk (W ) \u2212 X\u03c0Nk\u22121 (W )\n\n(cid:111)\n\nt\u2208T\n\n(cid:105) \u2264\n\nE(cid:104)\n\n(39)\n\nk=k1(T )\n\nk=k1(T )\n\n8\n\n\fNotice the chain of mutual information terms in the right side of (40). Since {Pk}\u221e\nk=k1(T ) is an\nincreasing sequence of partitions, for any t \u2208 T , knowing Nk(t) will uniquely determine Nk\u22121(t).\nTherefore\n\nI(cid:0)\u03c0Nk (W ), \u03c0Nk\u22121(W ); XT\n\n(cid:1) = I (\u03c0Nk (W ); XT )\n\n= I ([W ]k; XT ) .\n\n(41)\n(42)\n\nThe rest of the proof follows from the de\ufb01nition of separable processes (De\ufb01nition 2). For more\ndetails, see proof of Theorem 11 in Section D of the supplementary material.\n\n4 Additional result: small subset property\n\nWe adjusted the conservative chaining method in random processes theory to learning problems by\ntaking into account information about the algorithm, with the chained mutual information method. In\nthis section, we state a result in which such information could make the bounds much tighter.\nIt is known that for linear models, the stochastic gradient descent (SGD) algorithm always converges\nto a solution with small norm [1]. Inspired by this observation, we tighten Dudley\u2019s inequality\n(Theorem 1), given the following regularization property: the output W of an algorithm, with high\nprobability, chooses a hypothesis from a subset of the hypothesis set with small covering numbers:\nTheorem 5 (Small subset property). Assume that {Xt}t\u2208T is a separable subgaussian process on\nthe bounded metric space (T, d). Let {T1, T2} be a partition of T and assume that W is a random\nvariable taking values on T with P[W \u2208 T1] = \u03b1. Then we have\n\n\u221e(cid:88)\n\n2\u2212k(cid:113)\n\nE[XW ] \u22646\n\n\u03b1 log N (T1, d, 2\u2212k) + (1 \u2212 \u03b1) log N (T2, d, 2\u2212k) + H(\u03b1).\n\n(43)\n\nk=k1(T )\n\nProof of Theorem 5 appears in Section D of the supplementary material. Note that the right side of\n(43) becomes much smaller than Dudley\u2019s bound when \u03b1 is close to 1 and the covering numbers of\nT1 (the small subset) are much smaller than the covering numbers of T2.\nRemark 5. One can upper bound the right side of (43) by replacing N (T2, d, 2\u2212k) with N (T, d, 2\u2212k).\nThis is particularly useful when bounding the latter is easier than the former.\n\n5 Conclusion\n\nWe combined ideas from information theory and from high dimensional probability to obtain a\ngeneralization bound that takes into account both the dependencies between the hypotheses and the\ndependence between the input and the output of a learning algorithm. We showed on an example\nthat our chained mutual information bound signi\ufb01cantly outperforms previous bounds and gets close\nto the true generalization error. Under a natural regularization property of the learning algorithm,\nwe provided a corollary of our bound which tightens Dudley\u2019s inequality; i.e. when the learning\nalgorithm chooses its output from a small subset of hypotheses with high probability.\n\n6 Acknowledgments\n\nWe gratefully acknowledge discussions with Ramon van Handel on the topic of chaining. This work\nwas partly supported by the NSF CAREER Award CCF-1552131.\n\nReferences\n[1] C. Zhang, S. Bengio, M. Hardt, B. Recht and O. Vinyals. Understanding deep learning requires\nrethinking generalization. In International Conference on Learning Representations (ICLR),\nApr. 2017.\n\n[2] M. Belkin, S. Ma, and S. Mandal. To understand deep learning we need to understand kernel\n\nlearning. arXiv preprint arXiv:1802.01396, 2018.\n\n9\n\n\f[3] O. Bousquet, S. Boucheron, and G. Lugosi. Introduction to statistical learning theory. In\n\nAdvanced Lectures on Machine Learning, pages 169\u2013207. Springer, 2004.\n\n[4] S. Shalev-Shwartz and S. Ben-David. Understanding Machine Learning: From Theory to\n\nAlgorithms. Cambridge University Press, 2014.\n\n[5] D. A. McAllester. Some PAC-Bayesian theorems. Machine Learning, 37(3):355\u2013363, 1999.\n\n[6] O. Bousquet and A. Elisseeff. Stability and generalization. Journal of Machine Learning\n\nResearch, 2(Mar):499\u2013526, 2002.\n\n[7] N. Littlestone and M. Warmuth. Relating data compression and learnability. Technical report,\n\nUniversity of California, Santa Cruz, 1986.\n\n[8] D. Russo and J. Zou. How much does your data exploration over\ufb01t? controlling bias via\n\ninformation usage. arXiv preprint arXiv:1511.05219, 2015.\n\n[9] R. van Handel. Probability in high dimension.\n\n[Online]. Available: https: // www.\n\nprinceton. edu/ ~rvan/ APC550. pdf , Dec. 21 2016.\n\n[10] R. Vershynin. High-Dimensional Probability: An Introduction with Applications in Data\nScience. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University\nPress, 2018.\n\n[11] M. Talagrand. Upper and Lower Bounds for Stochastic Processes: Modern Methods and\n\nClassical Problems, volume 60. Springer Science & Business Media, 2014.\n\n[12] S. Boucheron, G. Lugosi, and P. Massart. Concentration Inequalities: A Nonasymptotic Theory\n\nof Independence. Oxford University Press, 2013.\n\n[13] R. M. Dudley. The sizes of compact subsets of Hilbert space and continuity of Gaussian\n\nprocesses. Journal of Functional Analysis, 1(3):290\u2013330, 1967.\n\n[14] K. Kawaguchi, L. P. Kaelbling and Y. Bengio. Generalization in deep learning. arXiv preprint\n\narXiv:1710.05468, 2017.\n\n[15] P. L. Bartlett, O. Bousquet, and S. Mendelson. Local Rademacher complexities. The Annals of\n\nStatistics, 33(4):1497\u20131537, 2005.\n\n[16] J. Jiao, Y. Han and T. Weissman. Dependence measures bounding the exploration bias for\ngeneral measurements. In Proc. of IEEE Symposium on Information Theory (ISIT), pages\n1475\u20131479, Aachen, Germany, June 2017.\n\n[17] J. Jiao, Y. Han and T. Weissman. Generalizations of maximal inequalities to arbitrary selection\n\nrules. arXiv preprint arXiv:1708.09041, 2017.\n\n[18] A. Xu and M. Raginsky. Information-theoretic analysis of generalization capability of learning\nalgorithms. In Advances in Neural Information Processing Systems (NIPS), pages 2524\u20132533,\nDec. 2017.\n\n[19] A. Pensia, V. Jog and P. Loh. Generalization error bounds for noisy, iterative algorithms. arXiv\n\npreprint arXiv:1801.04295, 12 Jan 2018.\n\n[20] R. Bassily, S. Moran, I. Nachum, J. Shafer and A. Yehudayoff. Learners that leak little\n\ninformation. arXiv preprint arXiv:1710.05233, 2017.\n\n[21] J. Audibert and O. Bousquet. PAC-Bayesian generic chaining. In Advances in Neural Informa-\n\ntion Processing Systems (NIPS), pages 1125\u20131132, 2004.\n\n[22] J. Audibert and O. Bousquet. Combining PAC-Bayesian and generic chaining bounds. Journal\n\nof Machine Learning Research, 8(Apr):863\u2013889, 2007.\n\n[23] X. Fernique. Evaluations de processus Gaussiens composes. In Probability in Banach Spaces,\n\npages 67\u201383. Springer, 1976.\n\n[24] T. M. Cover and J. A. Thomas. Elements of Information Theory. John Wiley & Sons, 2012.\n\n10\n\n\f", "award": [], "sourceid": 3600, "authors": [{"given_name": "Amir", "family_name": "Asadi", "institution": "Princeton University"}, {"given_name": "Emmanuel", "family_name": "Abbe", "institution": "Princeton University"}, {"given_name": "Sergio", "family_name": "Verdu", "institution": "Princeton University"}]}