{"title": "Information-theoretic analysis of generalization capability of learning algorithms", "book": "Advances in Neural Information Processing Systems", "page_first": 2524, "page_last": 2533, "abstract": "We derive upper bounds on the generalization error of a learning algorithm in terms of the mutual information between its input and output. The bounds provide an information-theoretic understanding of generalization in learning problems, and give theoretical guidelines for striking the right balance between data fit and generalization by controlling the input-output mutual information.  We propose a number of methods for this purpose, among which are algorithms that regularize the ERM algorithm with relative entropy or with random noise. Our work extends and leads to nontrivial improvements on the recent results of Russo and Zou.", "full_text": "Information-theoretic analysis of generalization\n\ncapability of learning algorithms\n\nAolin Xu\n\nMaxim Raginsky\n\n{aolinxu2,maxim}@illinois.edu \u21e4\n\nAbstract\n\nWe derive upper bounds on the generalization error of a learning algorithm in\nterms of the mutual information between its input and output. The bounds provide\nan information-theoretic understanding of generalization in learning problems,\nand give theoretical guidelines for striking the right balance between data \ufb01t and\ngeneralization by controlling the input-output mutual information. We propose a\nnumber of methods for this purpose, among which are algorithms that regularize\nthe ERM algorithm with relative entropy or with random noise. Our work extends\nand leads to nontrivial improvements on the recent results of Russo and Zou.\n\n1\n\nIntroduction\n\nA learning algorithm can be viewed as a randomized mapping, or a channel in the information-\ntheoretic language, which takes a training dataset as input and generates a hypothesis as output.\nThe generalization error is the difference between the population risk of the output hypothesis and\nits empirical risk on the training data. It measures how much the learned hypothesis suffers from\nover\ufb01tting. The traditional way of analyzing the generalization error relies either on certain complexity\nmeasures of the hypothesis space, e.g. the VC dimension and the Rademacher complexity [1], or\non certain properties of the learning algorithm, e.g., uniform stability [2]. Recently, motivated\nby improving the accuracy of adaptive data analysis, Russo and Zou [3] showed that the mutual\ninformation between the collection of empirical risks of the available hypotheses and the \ufb01nal output\nof the algorithm can be used effectively to analyze and control the bias in data analysis, which is\nequivalent to the generalization error in learning problems. Compared to the methods of analysis\nbased on differential privacy, e.g., by Dwork et al. [4, 5] and Bassily et al. [6], the method proposed\nin [3] is simpler and can handle unbounded loss functions; moreover, it provides elegant information-\ntheoretic insights into improving the generalization capability of learning algorithms. In a similar\ninformation-theoretic spirit, Alabdulmohsin [7, 8] proposed to bound the generalization error in\nlearning problems using the total-variation information between a random instance in the dataset and\nthe output hypothesis, but the analysis apply only to bounded loss functions.\nIn this paper, we follow the information-theoretic framework proposed by Russo and Zou [3] to\nderive upper bounds on the generalization error of learning algorithms. We extend the results in [3]\nto the situation where the hypothesis space is uncountably in\ufb01nite, and provide improved upper\nbounds on the expected absolute generalization error. We also obtain concentration inequalities for\nthe generalization error, which were not given in [3]. While the main quantity examined in [3] is the\nmutual information between the collection of empirical risks of the hypotheses and the output of the\nalgorithm, we mainly focus on relating the generalization error to the mutual information between the\ninput dataset and the output of the algorithm, which formalizes the intuition that the less information\n\u21e4Department of Electrical and Computer Engineering and Coordinated Science Laboratory, University of\nIllinois, Urbana, IL 61801, USA. This work was supported in part by the NSF CAREER award CCF-1254041\nand in part by the Center for Science of Information (CSoI), an NSF Science and Technology Center, under\ngrant agreement CCF-0939370.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fa learning algorithm can extract from the input dataset, the less it will over\ufb01t. This viewpoint\nprovides theoretical guidelines for striking the right balance between data \ufb01t and generalization by\ncontrolling the algorithm\u2019s input-output mutual information. For example, we show that regularizing\nthe empirical risk minimization (ERM) algorithm with the input-output mutual information leads to\nthe well-known Gibbs algorithm. As another example, regularizing the ERM algorithm with random\nnoise can also control the input-output mutual information. For both the Gibbs algorithm and the\nnoisy ERM algorithm, we also discuss how to calibrate the regularization in order to incorporate\nany prior knowledge of the population risks of the hypotheses into algorithm design. Additionally,\nwe discuss adaptive composition of learning algorithms, and show that the generalization capability\nof the overall algorithm can be analyzed by examining the input-output mutual information of the\nconstituent algorithms.\nAnother advantage of relating the generalization error to the input-output mutual information is that\nthe latter quantity depends on all ingredients of the learning problem, including the distribution of\nthe dataset, the hypothesis space, the learning algorithm itself, and potentially the loss function, in\ncontrast to the VC dimension or the uniform stability, which only depend on the hypothesis space or\non the learning algorithm. As the generalization error can strongly depend on the input dataset [9],\nthe input-output mutual information can be more tightly coupled to the generalization error than the\ntraditional generalization-guaranteeing quantities of interest. We hope that our work can provide\nsome information-theoretic understanding of generalization in modern learning problems, which may\nnot be suf\ufb01ciently addressed by the traditional analysis tools [9].\nFor the rest of this section, we de\ufb01ne the quantities that will be used in the paper. In the standard\nframework of statistical learning theory [10], there is an instance space Z, a hypothesis space W,\nand a nonnegative loss function ` : W \u21e5 Z ! R+. A learning algorithm characterized by a Markov\nkernel PW|S takes as input a dataset of size n, i.e., an n-tuple\n\nS = (Z1, . . . , Zn)\n\n(1)\n\nof i.i.d. random elements of Z with some unknown distribution \u00b5, and picks a random element W of\nW as the output hypothesis according to PW|S. The population risk of a hypothesis w 2 W on \u00b5 is\n(2)\n\n`(w, z)\u00b5(dz).\n\nL\u00b5(w) , E[`(w, Z)] =ZZ\n\nThe goal of learning is to ensure that the population risk of the output hypothesis W is small, either in\nexpectation or with high probability, under any data generating distribution \u00b5. The excess risk of W\nis the difference L\u00b5(W )  inf w2W L\u00b5(w), and its expected value is denoted as Rexcess(\u00b5, PW|S).\nSince \u00b5 is unknown, the learning algorithm cannot directly compute L\u00b5(w) for any w 2 W, but can\ninstead compute the empirical risk of w on the dataset S as a proxy, de\ufb01ned as\n\nnXi=1\nFor a learning algorithm characterized by PW|S, the generalization error on \u00b5 is the difference\nL\u00b5(W )  LS(W ), and its expected value is denoted as\n\nLS(w) , 1\nn\n\n`(w, Zi).\n\n(3)\n\n(4)\nwhere the expectation is taken with respect to the joint distribution PS,W = \u00b5\u2326n \u2326 PW|S. The\nexpected population risk can then be decomposed as\n\ngen(\u00b5, PW|S) , E[L\u00b5(W )  LS(W )],\n\nE[L\u00b5(W )] = E[LS(W )] + gen(\u00b5, PW|S),\n\n(5)\n\nwhere the \ufb01rst term re\ufb02ects how well the output hypothesis \ufb01ts the dataset, while the second term\nre\ufb02ects how well the output hypothesis generalizes. To minimize E[L\u00b5(W )] we need both terms in\n(5) to be small. However, it is generally impossible to minimize the two terms simultaneously, and\nany learning algorithm faces a trade-off between the empirical risk and the generalization error. In\nwhat follows, we will show how the generalization error can be related to the mutual information\nbetween the input and output of the learning algorithm, and how we can use these relationships to\nguide the algorithm design to reduce the population risk by balancing \ufb01tting and generalization.\n\n2\n\n\f2 Algorithmic stability in input-output mutual information\n\nAs discussed above, having a small generalization error is crucial for a learning algorithm to produce\nan output hypothesis with a small population risk. It turns out that the generalization error of a learning\nalgorithm can be determined by its stability properties. Traditionally, a learning algorithm is said to be\nstable if a small change of the input to the algorithm does not change the output of the algorithm much.\nExamples include uniform stability de\ufb01ned by Bousquet and Elisseeff [2] and on-average stability\nde\ufb01ned by Shalev-Shwartz et al. [11]. In recent years, information-theoretic stability notions, such as\nthose measured by differential privacy [5], KL divergence [6, 12], total-variation information [7], and\nerasure mutual information [13], have been proposed. All existing notions of stability show that the\ngeneralization capability of a learning algorithm hinges on how sensitive the output of the algorithm is\nto local modi\ufb01cations of the input dataset. It implies that the less dependent the output hypothesis W\nis on the input dataset S, the better the learning algorithm generalizes. From an information-theoretic\npoint of view, the dependence between S and W can be naturally measured by the mutual information\nbetween them, which prompts the following information-theoretic de\ufb01nition of stability. We say that\na learning algorithm is (\", \u00b5)-stable in input-output mutual information if, under the data-generating\ndistribution \u00b5,\n\nFurther, we say that a learning algorithm is \"-stable in input-output mutual information if\n\nI(S; W ) \uf8ff \".\n\nsup\n\u00b5\n\nI(S; W ) \uf8ff \".\n\n(6)\n\n(7)\n\nAccording to the de\ufb01nitions in (6) and (7), the less information the output of a learning algorithm can\nprovide about its input dataset, the more stable it is. Interestingly, if we view the learning algorithm\nPW|S as a channel from Zn to W, the quantity sup\u00b5 I(S; W ) can be viewed as the information\ncapacity of the channel, under the constraint that the input distribution is of a product form. The\nde\ufb01nition in (7) means that a learning algorithm is more stable if its information capacity is smaller.\nThe advantage of the weaker de\ufb01nition in (6) is that I(S; W ) depends on both the algorithm and the\ndistribution of the dataset. Therefore, it can be more tightly coupled with the generalization error,\nwhich itself depends on the dataset. We mainly focus on studying the consequence of this notion of\n(\", \u00b5)-stability in input-output mutual information for the rest of this paper.\n\n3 Upper-bounding generalization error via I(S; W )\n\nIn this section, we derive various generalization guarantees for learning algorithms that are stable in\ninput-output mutual information.\n\n3.1 A decoupling estimate\nWe start with a digression from the statistical learning problem to a more general problem, which may\nbe of independent interest. Consider a pair of random variables X and Y with joint distribution PX,Y .\nLet \u00afX be an independent copy of X, and \u00afY an independent copy of Y , such that P \u00afX, \u00afY = PX \u2326 PY .\nFor an arbitrary real-valued function f : X \u21e5 Y ! R, we have the following upper bound on the\nabsolute difference between E[f (X, Y )] and E[f ( \u00afX, \u00afY )].\nLemma 1 (proved in Appendix A). If f ( \u00afX, \u00afY ) is -subgaussian under P \u00afX, \u00afY = PX \u2326 PY\n2 , then\n\n3.2 Upper bound on expected generalization error\nUpper-bounding the generalization error of a learning algorithm PW|S can be cast as a special case of\nthe preceding problem, by setting X = S, Y = W , and f (s, w) = 1\ni=1 `(w, zi). For an arbitrary\nw 2 W, the empirical risk can be expressed as LS(w) = f (S, w) and the population risk can be\nexpressed as L\u00b5(w) = E[f (S, w)]. Moreover, the expected generalization error can be written as\n(9)\n\nE[f (X, Y )]  E[f ( \u00afX, \u00afY )] \uf8ffp22I(X; Y ).\nnPn\ngen(\u00b5, PW|S) = E[f ( \u00afS, \u00afW )]  E[f (S, W )],\n\n(8)\n\n2Recall that a random variable U is -subgaussian if log E[e(UEU )] \uf8ff 22/2 for all  2 R.\n\n3\n\n\fwhere the joint distribution of S and W is PS,W = \u00b5\u2326n \u2326 PW|S. If `(w, Z) is -subgaussian for all\nw 2 W, then f (S, w) is /pn-subgaussian due to the i.i.d. assumption on Zi\u2019s, hence f ( \u00afS, \u00afW ) is\n/pn-subgaussian. This, together with Lemma 1, leads to the following theorem.\nTheorem 1. Suppose `(w, Z) is -subgaussian under \u00b5 for all w 2 W, then\n\nI(S; W ).\n\n(10)\n\ngen(\u00b5, PW|S) \uf8ffr 22\n\nn\n\nTheorem 1 suggests that, by controlling the mutual information between the input and the output\nof a learning algorithm, we can control its generalization error. The theorem allows us to consider\nunbounded loss functions as long as the subgaussian condition is satis\ufb01ed. For a bounded loss\nfunction `(\u00b7,\u00b7) 2 [a, b], `(w, Z) is guaranteed to be (b  a)/2-subgaussian for all \u00b5 and all w 2 W.\nRusso and Zou [3] considered the same problem setup with the restriction that the hypothesis space\nW is \ufb01nite, and showed that |gen(\u00b5, PW|S)| can be upper-bounded in terms of I(\u21e4W(S); W ), where\n(11)\nis the collection of empirical risks of the hypotheses in W. Using Lemma 1 by setting X =\u21e4 W(S),\nY = W , and f (\u21e4W(s), w) = Ls(w), we immediately recover the result by Russo and Zou even\nwhen W is uncountably in\ufb01nite:\nTheorem 2 (Russo and Zou [3]). Suppose `(w, Z) is -subgaussian under \u00b5 for all w 2 W, then\n(12)\n\n\u21e4W(S) ,LS(w)w2W\n\nI(\u21e4W(S); W ).\n\ngen(\u00b5, PW|S) \uf8ffr 22\n\nn\n\nIt should be noted that Theorem 1 can be obtained as a consequence of Theorem 2 because\n\nI(\u21e4W(S); W ) \uf8ff I(S; W ),\n\n(13)\nwhich is due to the Markov chain \u21e4W(S)  S  W , as for each w 2 W, LS(w) is a function of S.\nHowever, if the output W depends on S only through the empirical risks \u21e4W(S), in other words,\nwhen the Markov chain S  \u21e4W(S)  W holds, then Theorem 1 and Theorem 2 are equivalent. The\nadvantage of Theorem 1 is that I(S; W ) can be much easier to evaluate than I(\u21e4W(S); W ), and can\nprovide better insights to guide the algorithm design. We will elaborate on this when we discuss the\nGibbs algorithm and the adaptive composition of learning algorithms.\nTheorem 1 and Theorem 2 only provide upper bounds on the expected generalization error. We are\noften interested in analyzing the absolute generalization error |L\u00b5(W )  LS(W )|, e.g., its expected\nvalue or the probability for it to be small. We need to develop stronger tools to tackle these problems,\nwhich is the subject of the next two subsections.\n\n3.3 A concentration inequality for |L\u00b5(W )  LS(W )|\nFor any \ufb01xed w 2 W, if `(w, Z) is -subgaussian, the Chernoff-Hoeffding bound gives P[|L\u00b5(w) \nLS(w)| >\u21b5 ] \uf8ff 2e\u21b52n/22. It implies that, if S and W are independent, then a sample size of\n\nsuf\ufb01ces to guarantee\n\nn =\n\n22\n\u21b52 log\n\n2\n\n\nP[|L\u00b5(W )  LS(W )| >\u21b5 ] \uf8ff .\n\n(15)\nThe following results show that, when W is dependent on S, as long as I(S; W ) is suf\ufb01ciently small,\na sample complexity polynomial in 1/\u21b5 and logarithmic in 1/ still suf\ufb01ces to guarantee (15), where\nthe probability now is taken with respect to the joint distribution PS,W = \u00b5\u2326n \u2326 PW|S.\nTheorem 3 (proved in Appendix B). Suppose `(w, Z) is -subgaussian under \u00b5 for all w 2 W. If\na learning algorithm satis\ufb01es I(\u21e4W(S); W ) \uf8ff \", then for any \u21b5> 0 and 0 < \uf8ff 1, (15) can be\nguaranteed by a sample complexity of\n\nn =\n\n+ log\n\n82\n\n\u21b52 \u2713 \"\n\n\n\n4\n\n2\n\n\u25c6 .\n\n(14)\n\n(16)\n\n\fIn view of (13), any learning algorithm that is (\", \u00b5)-stable in input-output mutual information\nsatis\ufb01es the condition I(\u21e4W(S); W ) \uf8ff \". The proof of Theorem 3 is based on Lemma 1 and an\nadaptation of the \u201cmonitor technique\u201d proposed by Bassily et al. [6]. While the high-probability\nbounds of [4\u20136] based on differential privacy are for bounded loss functions and for functions with\nbounded differences, the result in Theorem 3 only requires `(w, Z) to be subgaussian. We have the\nfollowing corollary of Theorem 3.\nCorollary 1. Under the conditions in Theorem 3, if for some function g(n)  1, \" \uf8ff (g(n) \n1) log 2\n\n , then a sample complexity that satis\ufb01es n/g(n)  82\n\n guarantees (15).\n\n\u21b52 log 2\n\nFor example, taking g(n) = 2, Corollary 1 implies that if \" \uf8ff  log(2/), then (15) can be\nguaranteed by a sample complexity of n = (162/\u21b52) log(2/), which is on the same order of\nthe sample complexity when S and W are independent as in (14). As another example, taking\ng(n) = pn, Corollary 1 implies that if \" \uf8ff (pn  1) log(2/), then a sample complexity of\nn = (644/\u21b54) (log(2/))2 guarantees (15).\n\n3.4 Upper bound on E|L\u00b5(W )  LS(W )|\nA byproduct of the proof of Theorem 3 (setting m = 1 in the proof) is an upper bound on the expected\nabsolute generalization error.\nTheorem 4. Suppose `(w, Z) is -subgaussian under \u00b5 for all w 2 W. If a learning algorithm\nsatis\ufb01es that I(\u21e4W(S); W ) \uf8ff \", then\n\ngen(\u00b5, PW|S) \uf8ffr 22H(W )\n\nn\n\n.\n\n(18)\n\nFor the ERM algorithm, the upper bounds for the expected generalization error also hold for the\nexpected excess risk, since the empirical risk of the ERM algorithm satis\ufb01es\n\nE[LS(WERM)] = Eh inf\n\nw2W\n\nLS(w)i \uf8ff inf\n\nw2W\n\nE[LS(w)] = inf\nw2W\n\nL\u00b5(w).\n\n(19)\n\nFor an uncountable hypothesis space, we can always convert it to a \ufb01nite one by quantizing the output\nhypothesis. For example, if W \u21e2 Rm, we can de\ufb01ne the covering number N (r, W) as the cardinality\nof the smallest set W0 \u21e2 Rm such that for all w 2 W there is w0 2 W0 with kw  w0k \uf8ff r, and we\ncan use W0 as the codebook for quantization. The \ufb01nal output hypothesis W 0 will be an element of\n\n5\n\nEL\u00b5(W )  LS(W ) \uf8ffr 22\n\nn\n\n(\" + log 2).\n\n(17)\n\nTheorem 4 together with Markov\u2019s inequality implies that (15) can be guaranteed by n = 22\n\nThis result improves [3, Prop. 3.2], which states that ELS(W )  L\u00b5(W ) \uf8ff /pn + 36p22\"/n.\n\u21b522\" +\nlog 2, but it has a worse dependence on  as compared to the sample complexity given by Theorem 3.\n\n4 Learning algorithms with input-output mutual information stability\n\nIn this section, we discuss several learning problems and algorithms from the viewpoint of input-\noutput mutual information stability. We \ufb01rst consider two cases where the input-output mutual\ninformation can be upper-bounded via the properties of the hypothesis space. Then we propose\ntwo learning algorithms with controlled input-output mutual information by regularizing the ERM\nalgorithm. We also discuss other methods to induce input-output mutual information stability, and\nthe stability of learning algorithms obtained from adaptive composition of constituent algorithms.\n\n4.1 Countable hypothesis space\n\nWhen the hypothesis space is countable, the input-output mutual information can be directly upper-\nbounded by H(W ), the entropy of W . If |W| = k, we have H(W ) \uf8ff log k. From Theorem 1, if\n`(w, Z) is -subgaussian for all w 2 W, then for any learning algorithm PW|S with countable W,\n\n\fW0. If W lies in a d-dimensional subspace of Rm and maxw2W kwk = B, then setting r = 1/pn,\nwe have N (r, W) \uf8ff (2Bpdn)d, and under the subgaussian condition of `,\nlog2Bpdn.\n\ngen(\u00b5, PW 0|S) \uf8ffr 22d\n\n4.2 Binary Classi\ufb01cation\nFor the problem of binary classi\ufb01cation, Z = X \u21e5 Y, Y = {0, 1}, W is a collection of classi\ufb01ers\nw : X ! Y, which could be uncountably in\ufb01nite, and `(w, z) = 1{w(x) 6= y}. Using Theorem 1,\nwe can perform a simple analysis of the following two-stage algorithm [14, 15] that can achieve the\nsame performance as ERM. Given the dataset S, split it into S1 and S2 with lengths n1 and n2. First,\npick a subset of hypotheses W1 \u21e2 W based on S1 such that (w(X1), . . . , w(Xn1)) for w 2 W1 are\nall distinct and {(w(X1), . . . , w(Xn1)), w 2 W1} = {(w(X1), . . . , w(Xn1)), w 2 W}. In other\nwords, W1 forms an empirical cover of W with respect to S1. Then pick a hypothesis from W1 with\nthe minimal empirical risk on S2, i.e.,\n\n(20)\n\nn\n\nW = arg min\nw2W1\n\nLS2(w).\n\n(21)\n\nDenoting the nth shatter coef\ufb01cient and the VC dimension of W by Sn and V , we can upper-bound\nthe expected generalization error of W with respect to S2 as\n\nE[L\u00b5(W )]  E[LS2(W )] = E\u21e5E[L\u00b5(W )  LS2(W )|S1]\u21e4 \uf8ffs V log(n1 + 1)\n\n2n2\n\nwhere we have used the fact that I(S2; W|S1 = s1) \uf8ff H(W|S1 = s1) \uf8ff log Sn1 \uf8ff V log(n1 + 1),\nby Sauer\u2019s Lemma, and Theorem 1. It can also be shown that [14, 15]\n\n,\n\n(22)\n\nwhere the second expectation is taken with respect to W1 which depends on S1, and c is a constant.\nCombining (22) and (23) and setting n1 = n2 = n/2, we have for some constant c,\n\nE[LS2(W )] \uf8ff Eh inf\n\nw2W1\n\nL\u00b5(w) + cr V\n\nn1\n\nw2W\n\nL\u00b5(w)i \uf8ff inf\nL\u00b5(w) + cr V log n\n\nn\n\n,\n\n(23)\n\nP \u21e4W|S=s(dw) =\n\neLs(w)Q(dw)\nEQ[eLs(W )]\n\nfor each s 2 Zn.\n\n6\n\nE[L\u00b5(W )] \uf8ff inf\nw2W\n\n.\n\n(24)\n\nFrom an information-theoretic point of view, the above two-stage algorithm effectively controls the\nconditional mutual information I(S2; W|S1) by extracting an empirical cover of W using S1, while\nmaintaining a small empirical risk using S2.\n\n4.3 Gibbs algorithm\nAs Theorem 1 shows that the generalization error can be upper-bounded in terms of I(S; W ), it is\nnatural to consider an algorithm that minimizes the empirical risk regularized by I(S; W ):\n\nP ?\nW|S = arg inf\n\nPW|S \u2713E[LS(W )] +\n\n1\n\n\nI(S; W )\u25c6 ,\n\nwhere > 0 is a parameter that balances \ufb01tting and generalization. To deal with the issue that \u00b5\nis unknown to the learning algorithm, we can relax the above optimization problem by replacing\nI(S; W ) with an upper bound D(PW|SkQ|PS) = I(S; W ) + D(PWkQ), where Q is an arbitrary\ndistribution on W and D(PW|SkQ|PS) =RZn D(PW|S=skQ)\u00b5\u2326n(ds), so that the solution of the\nrelaxed optimization problem does not depend on \u00b5. It turns out that the well-known Gibbs algorithm\nsolves the relaxed optimization problem.\nTheorem 5 (proved in Appendix C). The solution to the optimization problem\n\nP \u21e4W|S = arg inf\nis the Gibbs algorithm, which satis\ufb01es\n\nPW|S \u2713E[LS(W )] +\n\n1\n\n\nD(PW|SkQ|PS)\u25c6\n\n(25)\n\n(26)\n\n(27)\n\n\fWe would not have been able to arrive at the Gibbs algorithm had we used I(\u21e4W(S); W )\nas the regularization term instead of I(S; W ) in (25), even if we upper-bound I(\u21e4W(S)) by\nD(PW|\u21e4W(S)kQ|P\u21e4W(S)). Using the fact that the Gibbs algorithm is (2/n,0)-differentially pri-\nvate when ` 2 [0, 1] [16] and the group property of differential privacy [17], we can upper-bound the\ninput-output mutual information of the Gibbs algorithm as I(S; W ) \uf8ff 2. Then from Theorem 1,\nbound on the expected generalization error for the Gibbs algorithm is obtained in [13], which states\nthat if ` 2 [0, 1],\n\nwe know that for ` 2 [0, 1],gen(\u00b5, P \u21e4W|S) \uf8ffp/n. Using Hoeffding\u2019s lemma, a tighter upper\n\n.\n\n(28)\n\ngen(\u00b5, P \u21e4W|S) \uf8ff\n\n\n2n\n\nWith the guarantee on the generalization error, we can analyze the population risk of the Gibbs\nalgorithm. We \ufb01rst present a result for countable hypothesis spaces.\nCorollary 2 (proved in Appendix D). Suppose W is countable. Let W denote the output of the\nGibbs algorithm applied on dataset S, and let wo denote the hypothesis that achieves the minimum\npopulation risk among W. For ` 2 [0, 1], the population risk of W satis\ufb01es\n\n.\n2n\n\nL\u00b5(w) +\n\n(29)\n\nE[L\u00b5(W )] \uf8ff inf\nw2W\n\nQ(wo)\n\n1\n\n\nlog\n\n1\n\n+\n\nThe distribution Q in the Gibbs algorithm can be used to express our preference, or our prior\nknowledge of the population risks, of the hypotheses in W, in a way that a higher probability under\nQ is assigned to a hypothesis that we prefer. For example, we can order the hypotheses according to\nour prior knowledge of their population risks, and set Q(wi) = 6/\u21e12i2 for the ith hypothesis in the\norder, then, setting  = pn, (29) becomes\n\nE[L\u00b5(W )] \uf8ff inf\nw2W\n\nL\u00b5(w) +\n\n2 log io + 1\n\npn\n\n,\n\n(30)\n\nwhere io is the index of wo. It means that a better prior knowledge on the population risks leads to a\nsmaller sample complexity to achieve a certain expected excess risk. As another example, if |W| = k\nand we have no preference on any hypothesis, then taking Q as the uniform distribution on W and\nsetting  = 2pn log k, (29) becomes E[L\u00b5(W )] \uf8ff inf w2W L\u00b5(w) +p(1/n)log k.\n\nFor uncountable hypothesis spaces, we can do a similar analysis for the population risk under a\nLipschitz assumption on the loss function.\nCorollary 3 (proved in Appendix E). Suppose W = Rd. Let wo be the hypothesis that achieves the\nminimum population risk among W. Suppose ` 2 [0, 1] and `(\u00b7, z) is \u21e2-Lipschitz for all z 2 Z. Let\nW denote the output of the Gibbs algorithm applied on dataset S. The population risk of W satis\ufb01es\n\nE[L\u00b5(W )] \uf8ff inf\nw2W\n\nL\u00b5(w) +\n\n\n2n\n\n+ inf\n\na>0\u2713a\u21e2pd +\n\n1\n\n\nDN (wo, a2Id)kQ\u25c6 .\n\nAgain, we can use the distribution Q to express our preference of the hypotheses in W. For example,\nwe can choose Q = N (wQ, b2Id) with b = n1/4d1/4\u21e21/2 and choose  = n3/4d1/4\u21e21/2. Then,\nsetting a = b in (31), we have\n\nE[L\u00b5(W )] \uf8ff inf\nw2W\n\nL\u00b5(w) +\n\nd1/4\u21e21/2\n\n2n1/4\n\nkwQ  wok2 + 3 .\n\nThis result essentially has no restriction on W, which could be unbounded, and only requires the\nLipschitz condition on `(\u00b7, z), which could be non-convex. The sample complexity decreases with a\nbetter prior knowledge of the optimal hypothesis.\n\n(31)\n\n(32)\n\n4.4 Noisy empirical risk minimization\nAnother algorithm with controlled input-output mutual information is the noisy empirical risk\nminimization algorithm, where independent noise Nw, w 2 W, is added to the empirical risk of each\nhypothesis, and the algorithm outputs a hypothesis that minimizes the noisy empirical risks:\n(33)\n\nW = arg min\n\nw2W LS(w) + Nw.\n\n7\n\n\fSimilar to the Gibbs algorithm, we can express our preference of the hypotheses by controlling the\namount of noise added to each hypothesis, such that our preferred hypotheses will be more likely\nto be selected when they have similar empirical risks as other hypotheses. The following result\nformalizes this idea.\nCorollary 4 (proved in Appendix F). Suppose W is countable and is indexed such that a hypothesis\nwith a lower index is preferred over one with a higher index. Also suppose ` 2 [0, 1]. For the noisy\nERM algorithm in (33), choosing Ni to be an exponential random variable with mean bi, we have\n\nE[L\u00b5(W )] \uf8ff min\n\ni\n\nL\u00b5(wi) + bio +vuut 1\n\n2n\n\nL\u00b5(wi)\n\nbi  1Xi=1\n\n1Xi=1\n\n1\n\nbi!1\n\n,\n\n(34)\n\n(35)\n\nwhere io = arg mini L\u00b5(wi). In particular, choosing bi = i1.1/n1/3, we have\n\nE[L\u00b5(W )] \uf8ff min\n\ni\n\nL\u00b5(wi) +\n\ni1.1\no + 3\nn1/3\n\n.\n\nWithout adding noise, the ERM algorithm applied to the above case when |W| = k can achieve\nE[L\u00b5(WERM)] \uf8ff mini2[k] L\u00b5(wi) +p(1/2n)log k. Compared with (35), we see that performing\nnoisy ERM may be bene\ufb01cial when we have high-quality prior knowledge of wo and when k is large.\n\n4.5 Other methods to induce input-output mutual information stability\n\nIn addition to the Gibbs algorithm and the noisy ERM algorithm, many other methods may be\nused to control the input-output mutual information of the learning algorithm. One method is to\npreprocess the dataset S to obtain \u02dcS, and then run a learning algorithm on \u02dcS. The preprocessing\ncan be adding noise to the data or erasing some of the instances in the dataset, etc. In any case, we\n\nmethod is the postprocessing of the output of a learning algorithm. For example, the weights \u02dcW\ngenerated by a neural network training algorithm can be quantized or perturbed by noise. This\n\nhave the Markov chain S  \u02dcS  W, which implies I(S; W ) \uf8ff minI(S; \u02dcS), I( \u02dcS; W ) . Another\ngives rise to the Markov chain S  \u02dcW  W, which implies I(S; W ) \uf8ff minI( \u02dcW ; W ), I(S; \u02dcW ) .\n\nMoreover, strong data processing inequalities [18] may be used to sharpen these upper bounds\non I(S; W ). Preprocessing of the dataset and postprocessing of the output hypothesis are among\nnumerous regularization methods used in the \ufb01eld of deep learning [19, Ch. 7.5]. Other regularization\nmethods may also be interpreted as ways to induce the input-output mutual information stability of a\nlearning algorithm, and this would be an interesting direction of future research.\n\n4.6 Adaptive composition of learning algorithms\n\nBeyond analyzing the generalization error of individual learning algorithms, examining the input-\noutput mutual information is also useful for analyzing the generalization capability of complex\nlearning algorithms obtained by adaptively composing simple constituent algorithms. Under a k-fold\nadaptive composition, the dataset S is shared by k learning algorithms that are sequentially executed.\nFor j = 1, . . . , k, the output Wj of the jth algorithm may be drawn from a different hypothesis\nspace Wj based on S and the outputs W j1 of the previously executed algorithms, according to\nPWj|S,W j1. An example with k = 2 is model selection followed by a learning algorithm using the\nsame dataset. Various boosting techniques in machine learning can also be viewed as instances of\nadaptive composition. From the data processing inequality and the chain rule of mutual information,\n\nI(S; Wk) \uf8ff I(S; W k) =\n\nkXj=1\n\nI(S; Wj|W j1).\n\n(36)\n\nIf the Markov chain S  \u21e4Wj (S)  Wj holds conditional on W j1 for j = 1, . . . , k, then the\nupper bound in (36) can be sharpened toPk\nj=1 I(\u21e4Wj (S); Wj|W j1). We can thus control the\ngeneralization error of the \ufb01nal output by controlling the conditional mutual information at each step of\nthe composition. This also gives us a way to analyze the generalization error of the composed learning\nalgorithm using the knowledge of local generalization guarantees of the constituent algorithms.\n\n8\n\n\fAcknowledgement\n\nWe would like to thank Vitaly Feldman and Vivek Bagaria for pointing out errors in the earlier version\nof this paper. We also would like to thank Peng Guan for helpful discussions.\n\nReferences\n[1] S. Boucheron, O. Bousquet, and G. Lugosi, \u201cTheory of classi\ufb01cation: a survey of some recent\n\nadvances,\u201d ESAIM: Probability and Statistics, vol. 9, pp. 323\u2013375, 2005.\n\n[2] O. Bousquet and A. Elisseeff, \u201cStability and generalization,\u201d J. Machine Learning Res., vol. 2,\n\npp. 499\u2013526, 2002.\n\n[3] D. Russo and J. Zou, \u201cHow much does your data exploration over\ufb01t? Controlling bias via\ninformation usage,\u201d arXiv preprint, 2016. [Online]. Available: https://arxiv.org/abs/1511.05219\n[4] C. Dwork, V. Feldman, M. Hardt, T. Pitassi, O. Reingold, and A. Roth, \u201cPreserving statistical\nvalidity in adaptive data analysis,\u201d in Proc. of 47th ACM Symposium on Theory of Computing\n(STOC), 2015.\n\n[5] \u2014\u2014, \u201cGeneralization in adaptive data analysis and holdout reuse,\u201d in 28th Annual Conference\n\non Neural Information Processing Systems (NIPS), 2015.\n\n[6] R. Bassily, K. Nissim, A. Smith, T. Steinke, U. Stemmer, and J. Ullman, \u201cAlgorithmic stability\nfor adaptive data analysis,\u201d in Proceedings of The 48th Annual ACM Symposium on Theory of\nComputing (STOC), 2016.\n\n[7] I. Alabdulmohsin, \u201cAlgorithmic stability and uniform generalization,\u201d in 28th Annual Confer-\n\nence on Neural Information Processing Systems (NIPS), 2015.\n\n[8] \u2014\u2014, \u201cAn information-theoretic route from generalization in expectation to generalization in\nprobability,\u201d in 20th International Conference on Arti\ufb01cial Intelligence and Statistics (AISTATS),\n2017.\n\n[9] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals, \u201cUnderstanding deep learning requires\nrethinking generalization,\u201d in International Conference on Learning Representations (ICLR),\n2017.\n\n[10] S. Shalev-Shwartz and S. Ben-David, Understanding Machine Learning: From Theory to\n\nAlgorithms. Cambridge University Press, 2014.\n\n[11] S. Shalev-Shwartz, O. Shamir, N. Srebro, and K. Sridharan, \u201cLearnability, stability and uniform\n\nconvergence,\u201d J. Mach. Learn. Res., vol. 11, pp. 2635\u20132670, 2010.\n\n[12] Y.-X. Wang, J. Lei, and S. E. Fienberg, \u201cOn-average kl-privacy and its equivalence to gener-\nalization for max-entropy mechanisms,\u201d in Proceedings of the International Conference on\nPrivacy in Statistical Databases, 2016.\n\n[13] M. Raginsky, A. Rakhlin, M. Tsao, Y. Wu, and A. Xu, \u201cInformation-theoretic analysis of stability\nand bias of learning algorithms,\u201d in Proceedings of IEEE Information Theory Workshop, 2016.\n[14] K. L. Buescher and P. R. Kumar, \u201cLearning by canonical smooth estimation. I. Simultaneous\nestimation,\u201d IEEE Transactions on Automatic Control, vol. 41, no. 4, pp. 545\u2013556, Apr 1996.\n[15] L. Devroye, L. Gy\u00f6r\ufb01, and G. Lugosi, A Probabilistic Theory of Pattern Recognition. Springer,\n\n1996.\n\n[16] F. McSherry and K. Talwar, \u201cMechanism design via differential privacy,\u201d in Proceedings of 48th\n\nAnnual IEEE Symposium on Foundations of Computer Science (FOCS), 2007.\n\n[17] C. Dwork and A. Roth, \u201cThe algorithmic foundations of differential privacy,\u201d Foundations and\n\nTrends in Theoretical Computer Science, vol. 9, no. 3-4, 2014.\n\n[18] M. Raginsky, \u201cStrong data processing inequalities and -Sobolev inequalities for discrete\n\nchannels,\u201d IEEE Trans. Inform. Theory, vol. 62, no. 6, pp. 3355\u20133389, 2016.\n\n[19] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016.\n[20] S. Boucheron, G. Lugosi, and P. Massart, Concentration Inequalities: A Nonasymptotic Theory\n\nof Independence. Oxford Univ. Press, 2013.\n\n9\n\n\f[21] T. Zhang, \u201cInformation-theoretic upper and lower bounds for statistical estimation,\u201d IEEE Trans.\n\nInform. Theory, vol. 52, no. 4, pp. 1307 \u2013 1321, 2006.\n\n[22] Y. Polyanskiy and Y. Wu, \u201cLecture Notes on Information Theory,\u201d Lecture Notes for ECE563\n(UIUC) and 6.441 (MIT), 2012-2016. [Online]. Available: http://people.lids.mit.edu/yp/\nhomepage/data/itlectures_v4.pdf\n\n[23] S. Verd\u00fa, \u201cThe exponential distribution in information theory,\u201d Problems of Information Trans-\n\nmission, vol. 32, no. 1, pp. 86\u201395, 1996.\n\n10\n\n\f", "award": [], "sourceid": 1462, "authors": [{"given_name": "Aolin", "family_name": "Xu", "institution": "University of Illinois at Urbana-Champaign"}, {"given_name": "Maxim", "family_name": "Raginsky", "institution": "University of Illinois at Urbana-Champaign"}]}