{"title": "Generalization Bounds for Neural Networks via Approximate Description Length", "book": "Advances in Neural Information Processing Systems", "page_first": 13008, "page_last": 13016, "abstract": "We investigate the sample complexity of networks with bounds on the magnitude of its weights. \nIn particular, we consider the class\n\\[\n\\cn = \\left\\{W_t\\circ\\rho\\circ W_{t-1}\\circ\\rho\\ldots\\circ \\rho\\circ W_{1} : W_1,\\ldots,W_{t-1}\\in M_{d\\times d}, W_t\\in M_{1,d}  \\right\\}\n\\]\nwhere the spectral norm of each $W_i$ is bounded by $O(1)$, the Frobenius norm is bounded by $R$, and $\\rho$ is the sigmoid function $\\frac{e^x}{1 + e^x}$ or the smoothened ReLU function $ \\ln\\left(1 + e^x\\right)$.\nWe show that for any depth $t$, if the inputs are in $[-1,1]^d$, the sample complexity of $\\cn$ is $\\tilde O\\left(\\frac{dR^2}{\\epsilon^2}\\right)$. This bound is optimal up to log-factors, and substantially improves over the previous state of the art of $\\tilde O\\left(\\frac{d^2R^2}{\\epsilon^2}\\right)$, that was established in a recent line of work.\n\nWe furthermore show that this bound remains valid if instead of considering the magnitude of the $W_i$'s, we consider the magnitude of $W_i - W_i^0$, where $W_i^0$ are some reference matrices, with spectral norm of $O(1)$. By taking the $W_i^0$ to be the matrices in the onset of the training process, we get sample complexity bounds that are sub-linear in the number of parameters, in many {\\em typical} regimes of parameters.  \n\nTo establish our results we develop a new technique to analyze the sample complexity of families $\\ch$ of predictors. \nWe start by defining a new notion of a randomized approximate description of functions $f:\\cx\\to\\reals^d$. We then show that if there is a way to approximately describe functions in a class $\\ch$ using $d$ bits, then $\\frac{d}{\\epsilon^2}$ examples suffices to guarantee uniform convergence. Namely, that the empirical loss of all the functions in the class is $\\epsilon$-close to the true loss. Finally, we develop a set of tools for calculating the approximate description length of classes of functions that can be presented as a composition of linear function classes and non-linear functions.", "full_text": "Generalization Bounds for Neural Networks via\n\nApproximate Description Length\n\nAmit Daniely\n\nHebrew University and Google Research Tel-Aviv\n\namit.daniely@mail.huji.ac.il\n\nElad Granot\n\nHebrew University\n\nelad.granot@mail.huji.ac.il\n\nAbstract\n\n(cid:17)\n\n(cid:16) dR2\n\n(cid:17)\n\n\u00012\n\n(cid:16) d2R2\n\nWe investigate the sample complexity of networks with bounds on the magnitude\nof its weights. In particular, we consider the class\nN = {Wt \u25e6 \u03c1 \u25e6 Wt\u22121 \u25e6 \u03c1 . . . \u25e6 \u03c1 \u25e6 W1 : W1, . . . , Wt\u22121 \u2208 Md\u00d7d, Wt \u2208 M1,d}\nwhere the spectral norm of each Wi is bounded by O(1), the Frobenius norm\nis bounded by R, and \u03c1 is the sigmoid function ex\n1+ex or the smoothened ReLU\nfunction ln (1 + ex). We show that for any depth t, if the inputs are in [\u22121, 1]d,\nthe sample complexity of N is \u02dcO\n. This bound is optimal up to log-factors,\nand substantially improves over the previous state of the art of \u02dcO\nestablished in a recent line of work [9, 4, 7, 5, 2, 8].\nWe furthermore show that this bound remains valid if instead of considering the\nmagnitude of the Wi\u2019s, we consider the magnitude of Wi \u2212 W 0\ni are\nsome reference matrices, with spectral norm of O(1). By taking the W 0\ni to be the\nmatrices at the onset of the training process, we get sample complexity bounds that\nare sub-linear in the number of parameters, in many typical regimes of parameters.\nTo establish our results we develop a new technique to analyze the sample complex-\nity of families H of predictors. We start by de\ufb01ning a new notion of a randomized\napproximate description of functions f : X \u2192 Rd. We then show that if there is a\nway to approximately describe functions in a class H using d bits, then d\n\u00012 examples\nsuf\ufb01ces to guarantee uniform convergence. Namely, that the empirical loss of all\nthe functions in the class is \u0001-close to the true loss. Finally, we develop a set of\ntools for calculating the approximate description length of classes of functions\nthat can be presented as a composition of linear function classes and non-linear\nfunctions.\n\ni , where W 0\n\n, that was\n\n\u00012\n\n1\n\nIntroduction\n\nWe analyze the sample complexity of networks with bounds on the magnitude of their weights. Let\nus consider a prototypical case, where the input space is X = [\u22121, 1]d, the output space is R, the\nnumber of layers is t, all hidden layers has d neurons, and the activation function is \u03c1 : R \u2192 R. The\nclass of functions computed by such an architecture is\n\nN = {Wt \u25e6 \u03c1 \u25e6 Wt\u22121 \u25e6 \u03c1 . . . \u25e6 \u03c1 \u25e6 W1 : W1, . . . , Wt\u22121 \u2208 Md\u00d7d, Wt \u2208 M1,d}\n\nAs the class N is de\ufb01ned by (t \u2212 1)d2 + d = O(d2) parameters, classical results (e.g. [1]) tell\nus that order of d2 examples are suf\ufb01cient and necessary in order to learn a function from N (in a\nstandard worst case analysis). However, modern networks often succeed to learn with substantially\nless examples. One way to provide alternative results, and a potential explanation to the phenomena,\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fis to take into account the magnitude of the weights. This approach was a success story in the days\nof SVM [3] and Boosting [10], provided a nice explanation to generalization with sub-linear (in the\nnumber of parameters) number of examples, and was even the deriving force behind algorithmic\nprogress. It seems just natural to adopt this approach in the context of modern networks. For instance,\nit is natural to consider the class\n\nNR = {Wt \u25e6 \u03c1 \u25e6 Wt\u22121 \u25e6 \u03c1 . . . \u25e6 \u03c1 \u25e6 W1 : \u2200i,(cid:107)Wi(cid:107)F \u2264 R,(cid:107)Wi(cid:107) \u2264 O(1)}\n\nwhere (cid:107)W(cid:107) = max(cid:107)x(cid:107)=1 (cid:107)W x(cid:107) is the spectral norm and (cid:107)W(cid:107)F =\nij is the Frobenius\nnorm. This class has been analyzed in several recent works [9, 4, 7, 5, 2, 8]. Best known results\nshow a sample complexity of \u02dcO\n(for the sake of simplicity, in the introduction, we ignore the\ndependence on the depth in the big-O notation). In this paper we prove, for various activations, a\nstronger bound of \u02dcO\n\n, which is optimal, up to log factors, for constant depth networks.\n\n(cid:16) d2R2\n\n(cid:16) dR2\n\ni,j=1 W 2\n\n(cid:17)\n\n(cid:17)\n\n\u00012\n\n\u00012\n\n(cid:113)(cid:80)d\n\n\u00012\n\n\u00012\n\n\u221a\n\n(cid:17)\n\n(cid:17)\n\n(cid:16) d2\n\n(cid:16) dR2\n\nbound, we get a sample complexity of \u02dcO\n\nd. The Frobenius norm of such a matrix is of order\nd. Going back to our \u02dcO\n\nHow good is this bound? Does it \ufb01nally provide sub-linear bound in typical regimes of the parameters?\nTo answer this question, we need to ask how large R is. While this question of course don\u2019t have a\nde\ufb01nite answer, empirical studies (e.g. [12]) show that it is usually the case that the norm (spectral,\nFrobenius, and others) of the weight matrices is at the same order of magnitude as the norm of the\nmatrix in the onset of the training process. In most standard training methods, the initial matrices\nare random matrices with independent (or almost independent) entries, with mean zero and variance\nd. Hence, the magnitude of R is of\nof order 1\n\u221a\norder\n, which is\nunfortunately still linear in the number of parameters.\nSince our bound is almost optimal, we can ask whether this is the end of the story? Should we\nabandon the aforementioned approach to network sample complexity? A more re\ufb01ned examination of\nthe training process suggests another hope for this approach. Indeed, the training process doesn\u2019t start\nfrom the zero matrix, but rather form a random initialization matrix. Thus, it stands to reason that\ninstead of considering the magnitude of the weight matrices Wi, we should consider the magnitude\nof Wi \u2212 W 0\nis the initial weight matrix. Indeed, empirical studies [6] show that the\nFrobenius norm of Wi \u2212 W 0\nis often order of magnitude smaller than the Frobenius norm of Wi.\nFollowing this perspective, it is natural to consider the class\nNR(W 0\nFor some \ufb01xed matrices, W 0\nballs around the initial W 0\nat hand. In other words, we can expect that the sample complexity of NR(W 0\napproximately the sample complexity of NR. Namely, we expect a sample complexity of \u02dcO\nSuch a bound would \ufb01nally be sub-linear, as in practice, it is often the case that R2 (cid:28) d.\nThis approach was pioneered by [4] who considered the class\n\nt of spectral norm1 O(1). It is natural to expect that considering\n1 , . . . , W 0\ni \u2019s instead of zero, shouldn\u2019t change the sample complexity of the class\nt ) should be\n.\n\nt ) =(cid:8)Wt \u25e6 \u03c1 \u25e6 Wt\u22121 \u25e6 \u03c1 . . . \u25e6 \u03c1 \u25e6 W1 : (cid:107)Wi \u2212 W 0\n\ni (cid:107) \u2264 O(1),(cid:107)Wi \u2212 W 0\n\ni , where W 0\ni\n\n1 , . . . , W 0\n\n1 , . . . , W 0\n\ni\n\n1 , . . . , W 0\n\nN 2,1\nR (W 0\n\nwhere (cid:107)W(cid:107)2,1 = (cid:80)d\n(cid:16) dR2\n\nt ) =(cid:8)Wt \u25e6 \u03c1 . . . \u25e6 \u03c1 \u25e6 W1 : (cid:107)Wi \u2212 W 0\n(cid:113)(cid:80)d\n. Since, (cid:107)W(cid:107)2,1 \u2264 \u221a\n(cid:16) dR2\n(cid:17)\n\n\u02dcO\nNR(W 0\ncomplexity bound of \u02dcO\n\non NR(W 0\n\n1 , . . . , W 0\n\n1 , . . . , W 0\n\n(cid:17)\n\nt ).\n\n\u00012\n\n\u00012\n\ni=1\n\nij. For this class they proved a sample complexity bound of\non\nt ), which is still not sublinear2. In this paper we \ufb01nally prove a sub-linear sample\n\nj=1 W 2\nd(cid:107)W(cid:107)F , this implies a sample complexity bound of \u02dcO\n\n\u00012\n\ni (cid:107)F \u2264 R(cid:9)\n(cid:17)\n(cid:16) dR2\ni (cid:107)2,1 \u2264 R(cid:9)\n(cid:17)\n(cid:16) d2R2\n\n\u00012\n\ni (cid:107) \u2264 O(1),(cid:107)Wi \u2212 W 0\n\nTo prove our results, we develop a new technique for bounding the sample complexity of function\nclasses. Roughly speaking, we de\ufb01ne a notion of approximate description of a function, and count\n\n1The bound of O(1) on the spectral norm of the W 0\n\nneural networks \u2013 the spectral norm of W 0\nshow that the spectral norm of Wi \u2212 W 0\n(cid:107)W(cid:107)F = \u0398(1) (namely, each entry has variance 1\n\n\u221a\n2We note that (cid:107)W(cid:107)2,1 = \u0398(\nd) even if W is a random matrix with variance that is calibrated so that\n\ni is usually very small.\n\ni is again motivated by the practice of\ni , with standard initializations, is O(1), and empirical studies [6, 12]\n\ni \u2019s and Wi \u2212 W 0\n\nd2 ).\n\n2\n\n\fhow many bits are required in order to give an approximate description for the functions in the class\nunder study. We then show that this number, called the approximate description length (ADL), gives\nan upper bound on the sample complexity. The advantage of our method over existing techniques is\nthat it behaves nicely with compositions. That is, once we know the approximate description length\nof a class H of functions from X to Rd, we can also bound the ADL of \u03c1 \u25e6 H, as well as L \u25e6 H,\nwhere L is a class of linear functions. This allows us to utilize the compositional structure of neural\nnetworks.\n\n1, . . . , xk\n\nd, . . . , xk\n\n1), . . . , med(x1\n\na\u2208A f (a). We denote Bd\n\nd)(cid:1). We use log to denote\n\nRd we denote med(x1, . . . , xk) =(cid:0)med(x1\nF =(cid:80)\n\n2 Preliminaries\nNotation We denote by med(x1, . . . , xk) the median of x1, . . . , xk \u2208 R. For vectors x1, . . . , xk \u2208\n(cid:80)\nlog2, and ln to denote loge An expression of the form f (n) (cid:46) g(n) means that there is a universal\nconstant c > 0 for which f (n) \u2264 cg(n). For a \ufb01nite set A and f : A \u2192 R we let Ex\u2208A f =\nEx\u2208A f (a) = 1|A|\n1. Likewise,\nwe denote Sd\u22121 = {x \u2208 Rd : (cid:107)x(cid:107) = 1} We denote the Frobenius norm of a matrix W by\n(cid:107)W(cid:107)2\nij, while the spectral norm is denoted by (cid:107)W(cid:107) = max(cid:107)x(cid:107)=1 (cid:107)W x(cid:107). For a pair of\nvectors x, y \u2208 Rd we denote by xy \u2208 Rd their point-wise product xy = (x1y1, . . . , xdyd)\nUniform Convergence and Covering Numbers Fix an instance space X , a label space Y and a\nloss (cid:96) : Rd \u00d7 Y \u2192 [0,\u221e). We say that (cid:96) is Lipschitz / Bounded / etc. if for any y \u2208 Y, (cid:96)(\u00b7, y)\nis. Fix a class H from X to Rd. For a distribution D and a sample S \u2208 (X \u00d7 Y)m we de\ufb01ne the\nrepresentativeness of S as\nrepD(S,H) = sup\nh\u2208H\n\nM = {x \u2208 Rd : (cid:107)x(cid:107) \u2264 M} and Bd = Bd\n\n(cid:96)D(h)\u2212 (cid:96)S(h) for (cid:96)D(h) = E\n\n(x,y)\u223cD (cid:96)(h(x), y) and (cid:96)S(h) =\n\nm(cid:88)\n\n(cid:96)(h(xi), yi)\n\nij W 2\n\n1\nm\n\ni=1\n\nWe note that if repD(S,H) \u2264 \u0001 then any algorithm that is guaranteed to return a function \u02c6h \u2208 H\nwill enjoy a generalization bound (cid:96)D(h) \u2264 (cid:96)S(h) + \u0001. In particular, the ERM algorithm will return a\nfunction whose loss is optimal, up to an additive factor of \u0001. We will focus on bounds on repD(S,H)\nwhen S \u223c Dm. To this end, we will rely on the connection between representativeness and the\ncovering numbers of H.\nDe\ufb01nition 2.1. Fix a class H of functions from X to Rd, an integer m, \u0001 > 0 and 1 \u2264 p \u2264 \u221e. We\nde\ufb01ne Np(H, m, \u0001) as the minimal integer for which the following holds. For every A \u2282 X of size\n\n(cid:12)(cid:12)(cid:12) \u02dcH(cid:12)(cid:12)(cid:12) \u2264 Np(H, m, \u0001) and for any h \u2208 H there is \u02dch \u2208 \u02dcH with\n\nsuch that\n\n\u2264 m there exists \u02dcH \u2282(cid:0)Rd(cid:1)X\n(cid:13)(cid:13)(cid:13)h(x) \u2212 \u02dch(x)\n(cid:13)(cid:13)(cid:13)p\n(cid:16)Ex\u2208A\n(cid:17) 1\n\n\u221e\n\np \u2264 \u0001. For p = 2, we denote N (H, m, \u0001) = N2(H, m, \u0001)\n\nWe conclude with a lemma, which will be useful in this paper. The proof can be found in the\nsupplementary material.\nLemma 2.2. Let (cid:96) : Rd \u00d7 Y \u2192 R be L-Lipschitz w.r.t. (cid:107) \u00b7 (cid:107)\u221e and B-bounded. Assume that for any\n\u221a\n0 < \u0001 \u2264 1, log (N (H, m, \u0001)) \u2264 n\n\u00012 . Then ES\u223cDm repD(S,H) (cid:46) (L+B)\n\u221a\nm log(m). Furthermore,\nwith probability at least 1 \u2212 \u03b4, repD(S,H) (cid:46) (L+B)\n\u221a\nm log(m) + B\n\n(cid:113) 2 ln(2/\u03b4)\n\n\u221a\n\nm\n\nn\n\nn\n\nA Basic Inequality\nLemma 2.3. Let X1, . . . , Xn be independent r.v. with that that are \u03c3-estimators to \u00b5. Then\n\nPr (|med(X1, . . . , Xn) \u2212 \u00b5| > k\u03c3) <(cid:0) 2\n\n(cid:1)n\n\nk\n\n3 Simpli\ufb01ed Approximate Description Length\n\nTo give a soft introduction to our techniques, we \ufb01rst consider a simpli\ufb01ed version of it. We next\nde\ufb01ne the approximate description length of a class H of functions from X to Rd, which quanti\ufb01es\nthe number of bits it takes to approximately describe a function from H. We will use the following\nnotion of approximation\n\n3\n\n\fDe\ufb01nition 3.1. A random vector X \u2208 Rd is a \u03c3-estimator to x \u2208 Rd if\n\nE X = x and \u2200u \u2208 Sd\u22121, VAR((cid:104)u, X(cid:105)) = E(cid:104)u, X \u2212 x(cid:105)2 \u2264 \u03c32\n\nA random function \u02c6f : X \u2192 Rd is a \u03c3-estimator to f : X \u2192 Rd if for any x \u2208 X , \u02c6f (x) is a\n\u03c3-estimator to f (x).\nA (\u03c3, n)-compressor C for a class H takes as input a function h \u2208 H, and outputs a (random) function\nCh such that (i) Ch is a \u03c3-estimator of h and (ii) it takes n bits to describe Ch. Formally,\nDe\ufb01nition 3.2. A (\u03c3, n)-compressor for H is a triplet (C, \u2126, \u00b5) where \u00b5 is a probability measure on\n\n\u2126, and C is a function C : \u2126 \u00d7 H \u2192(cid:0)Rd(cid:1)X\n\nsuch that\n\n1. For any h \u2208 H and x \u2208 X , (C\u03c9h)(x), \u03c9 \u223c \u00b5 is a \u03c3-estimator of h(x).\n\n2. There are functions E : \u2126 \u00d7 H \u2192 {\u00b11}n and D : {\u00b11}n \u2192(cid:0)Rd(cid:1)X\n\nfor which C = D \u25e6 E\nDe\ufb01nition 3.3. We say that a class H of functions from X to Rd has approximate description length\nn if there exists a (1, n)-compressor for H\n(cid:80)k\nIt is not hard to see that if (C, \u2126, \u00b5) is a (\u03c3, n)-compressor for H, then\ni=1(C\u03c9ih)(x)\n\n(cid:17)\n\n(cid:16) \u03c3\u221a\nany 1 \u2265 \u0001 > 0 there exists an(cid:0)\u0001, n(cid:100)\u0001\u22122(cid:101)(cid:1)-compressor for H.\n\n(C\u03c91,...,\u03c9k h)(x) :=\n\nis a\n\n, kn\n\nk\n\nk\n\n-compressor for H. Hence, if the approximate description length of H is n, then for\n\nlog (N (H, m, \u0001)) \u2264 n(cid:6)\u0001\u22122(cid:7). Hence, if (cid:96) : Rd \u00d7 Y \u2192 R is L-Lipschitz and B-bounded, then for any\n\nWe next connect the approximate description length, to covering numbers and representativeness. We\nseparate it into two lemmas, one for d = 1 and one for general d, as for d = 1 we can prove a slightly\nstronger bound.\nLemma 3.4. Fix a class H of functions from X to R with approximate description length n. Then,\n\u221a\ndistribution D on X \u00d7 Y, ES\u223cDm repD(S,H) (cid:46) (L+B)\n\u221a\nm log(m). Furthermore, with probability\n\u221a\nat least 1 \u2212 \u03b4, repD(S,H) (cid:46) (L+B)\n\u221a\nLemma 3.5. Fix a class H of functions from X to Rd with approximate description length n. Then,\n\nlog (N\u221e(H, m, \u0001)) \u2264 log (N (H, m, \u0001)) \u2264 n(cid:6)16\u0001\u22122(cid:7)(cid:100)log(dm)(cid:101)\n\n(cid:113) 2 ln(2/\u03b4)\n\nm log(m) + B\n\nHence, if (cid:96) : Rd \u00d7 Y \u2192 R is L-Lipschitz w.r.t. (cid:107) \u00b7 (cid:107)\u221e and B-bounded, then for any distribution D\non X \u00d7 Y, ES\u223cDm repD(S,H) (cid:46) (L+B)\nlog(m). Furthermore, with probability at least\n1 \u2212 \u03b4, repD(S,H) (cid:46) (L+B)\n\n(cid:113) 2 ln(2/\u03b4)\n\n\u221a\n\u221a\nn log(dm)\nm\n\n\u221a\n\u221a\nn log(dm)\nm\n\nlog(m) + B\n\nm\n\nm\n\nn\n\nn\n\n3.1 Linear Functions\n\nTheorem 3.6. Let class Ld1,d2,M = (cid:8)x \u2208 Bd1 (cid:55)\u2192 W x : W is d2 \u00d7 d1 matrix with (cid:107)W(cid:107)F \u2264 M(cid:9)\n\nWe next bound the approximate description length of linear functions with bounded Frobenius norm.\n\nhas approximate description length\n\n(cid:25)\n\n(cid:24) 1\n\n4\n\nn \u2264\n\n+ 2M 2\n\n2(cid:100)log (2d1d2(M + 1))(cid:101)\n\nHence, if (cid:96) : Rd2 \u00d7 Y \u2192 R is L-Lipschitz w.r.t. (cid:107) \u00b7 (cid:107)\u221e and B-bounded, then for any distribution D\non X \u00d7 Y\n\nrepD(S,Ld1,d2,M ) (cid:46) (L + B)(cid:112)M 2 log(d1d2M ) log(d2m)\n\nlog(m)\n\n\u221a\n\nE\n\nS\u223cDm\n\nFurthermore, with probability at least 1 \u2212 \u03b4,\n\nrepD(S,Ld1,d2,M ) (cid:46) (L + B)(cid:112)M 2 log(d1d2M ) log(d2m)\n\n\u221a\n\nm\n\n(cid:114)\n\nlog(m) + B\n\n2 ln (2/\u03b4)\n\nm\n\nm\n\n4\n\n\fWe remark that the above bounds on the representativeness coincides with standard bounds ([11] for\ninstance), up to log factors. The advantage of these bound is that they remain valid for any output\ndimension d2.\nIn order to prove theorem 3.6 we will use a randomized sketch of a matrix.\nDe\ufb01nition 3.7. Let w \u2208 Rd be a vector. A random sketch of w is a random vector \u02c6w that is samples\nas follows. Choose i w.p. pi = w2\nlet b = 1 and otherwise b = 0.\nFinally, let \u02c6w =\nei. A random k-sketch of w is an average of k-independent random\nsketches of w. A random sketch and a random k-sketch of a matrix is de\ufb01ned similarly, with the\nstandard matrix basis instead of the standard vector basis.\n\n\u2212(cid:106) wi\n\n2d . Then, w.p. wi\npi\n\n(cid:16)(cid:106) wi\n\n2(cid:107)w(cid:107)2 + 1\n\n(cid:17)\n\n(cid:107)\n\n(cid:107)\n\n+ b\n\npi\n\npi\n\ni\n\n(cid:113) 1\n4 + 2(cid:107)w(cid:107)2-estimator of w.\n\n4 + 2(cid:107)w(cid:107)2\n\nsample a k-sketch \u02c6W of W for k =(cid:6) 1\n\nThe following useful lemma shows that an sketch w is a\nLemma 3.8. Let \u02c6w be a random sketch of w \u2208 Rd. Then, (1) E \u02c6w = w and (2) for any u \u2208 Sd\u22121,\nE ((cid:104)u, \u02c6w(cid:105) \u2212 (cid:104)u, w(cid:105))2 \u2264 E(cid:104)u, \u02c6w(cid:105)2 \u2264 1\nProof. (of theorem 3.6) We construct a compressor for Ld1,d2,M as follows. Given W , we will\nthat that W (cid:55)\u2192 \u02c6W is a (1, 2k (cid:100)log(2d1d2(M + 1))(cid:101))-compressor for Ld1,d2,M . Indeed, to specify a\nsketch of W we need (cid:100)log(d1d2)(cid:101) bits to describe the chosen index, as well as log (2d1d2M + 2)\nbits to describe the value in that index. Hence, 2k (cid:100)log(2d1d2(M + 1))(cid:101) bits suf\ufb01ces to specify a\nk-sketch. It remains to show that for x \u2208 Bd1, \u02c6W x is a 1-estimator of W x. Indeed, by lemma 3.8,\nE \u02c6W = W and therefore E \u02c6W x = W x. Likewise, for u \u2208 Sd2\u22121. We have\n\n4 + 2M 2(cid:7), and will return the function x (cid:55)\u2192 \u02c6W x. We claim\n\n(cid:69) \u2212 (cid:104)u, W x(cid:105)(cid:17)2\n\n= E(cid:16)(cid:68) \u02c6W , xuT(cid:69) \u2212(cid:10)W, xuT(cid:11)(cid:17)2 \u2264 1\n\n4 + 2M 2\n\n\u2264 1\n\nk\n\nE(cid:16)(cid:68)\n\nu, \u02c6W x\n\nn=1 |an| = 1. For any W \u2208 Md,d we de\ufb01ne hW (x) = 1\u221a\n\n(cid:9) In order to build compressors for classes of networks, we will\n\nbe Bd. We \ufb01x an activation function \u03c1 : R \u2192 R that is assumed to be a polynomial \u03c1(x) =(cid:80)k\nwith(cid:80)n\nlet H = (cid:8)hW : \u2200i, (cid:107)wi(cid:107) \u2264 1\n\n3.2 Simpli\ufb01ed Depth 2 Networks\nTo demonstrate our techniques, we consider the following class of functions. We let the domain X to\n(cid:80)d\ni=0 aixi\ni=1 \u03c1((cid:104)wi, x(cid:105)) Finally, we\nutilize to compositional structure of the classes. Speci\ufb01cally, we have that H = \u039b \u25e6 \u03c1 \u25e6 F where\nF = {x (cid:55)\u2192 W x : W is d \u00d7 d matrix with (cid:107)wi(cid:107) \u2264 1 for all i} and \u039b(x) = 1\u221a\nAs F is a subset of Ld,d,\n\u221a\nd, we know that there exists a (1, O (d log(d)))-compressor for it. We will\nuse this compressor to build a compressor to \u03c1 \u25e6 F, and then to \u039b \u25e6 \u03c1 \u25e6 F. We will start with the\nlatter, linear case, which is simpler\nLemma 3.9. Let X be a \u03c3-estimator to x \u2208 Rd1. Let A \u2208 Md2,d1 be a matrix of spectral norm\n\u2264 r. Then, AX is a (r\u03c3)-estimator to Ax. In particular, if C is a (1, n)-compressor to a class H of\nfunctions from X to Rd. Then\n\n(cid:80)d\n\ni=1 xi.\n\nd\n\nd\n\n2\n\nC(cid:48)\n\u03c9(\u039b \u25e6 h) = \u039b \u25e6 C\u03c9h\n\nis a (1, n)-compressor to \u039b \u25e6 H\nWe next consider the composition of F with the non-linear \u03c1. As opposed to composition with a linear\nfunction, we cannot just generate a compression version using F\u2019s compressor and then compose\nwith \u03c1. Indeed, if X is a \u03c3-estimator to x, it is not true in general that \u03c1(X) is an estimator of \u03c1(x).\nFor instance, consider the case that \u03c1(x) = x2, and X = (X1, . . . , Xd) is a vector of independent\nstandard Gaussians. X is a 1-estimator of 0 \u2208 Rd. On the other hand, \u03c1(X) = (X 2\nn) is not\nan estimator of 0 = \u03c1(0). We will therefore take a different approach. Given f \u2208 F, we will sample\n(cid:81)i\ni=1 from F\u2019s compressor, and de\ufb01ne the compressed version of\nk independent estimators {C\u03c9if}k\n\u03c3 \u25e6 h as C(cid:48)\nj=0 C\u03c9if. This construction is analyzed in the following lemma\n\nf =(cid:80)d\n\n1 , . . . , X 2\n\ni=0 ai\n\n\u03c91,...,\u03c9k\n\n5\n\n\fLemma 3.10. If C is a(cid:0) 1\n\n2 , n(cid:1)-compressor of a class H of functions from X to(cid:2)\u2212 1\n\na (1, n)-compressor of \u03c1 \u25e6 H\nCombining theorem 3.6 and lemmas 3.9, 3.10 we have:\nTheorem 3.11. H has approximation length (cid:46) d log(d). Hence, if (cid:96) : R \u00d7 Y \u2192 R is L-Lipschitz\nand B-bounded, then for any distribution D on X \u00d7 Y\n\n2\n\n(cid:3)d. Then C(cid:48) is\n\n2 , 1\n\nrepD(S,H) (cid:46) (L + B)(cid:112)d log(d)\n\n\u221a\n\nm\n\nE\n\nS\u223cDm\n\nFurthermore, with probability at least 1 \u2212 \u03b4,\n\nrepD(S,H) (cid:46) (L + B)(cid:112)d log(d)\n\n\u221a\n\nm\n\nlog(m) + B\n\nlog(m)\n\n(cid:114)\n\n2 ln (2/\u03b4)\n\nm\n\nLemma 3.10 is implied by the following useful lemma:\nLemma 3.12.\n\n1. If X is a \u03c3-estimator of x then aX is a (|a|\u03c3)-estimator of aX\n\n2. Suppose that for n = 1, 2, 3, . . . Xn is a \u03c3n-estimator of xn \u2208 Rd. Assume furthermore\nn=1 Xn is a\n\nthat(cid:80)\u221e\nn=1 xn and(cid:80)\u221e\n\u03c3(cid:48)-estimator of(cid:81)k\n\nn=1 \u03c3n converge to x \u2208 Rd and \u03c3 \u2208 [0,\u221e). Then,(cid:80)\u221e\ni=1 are independent \u03c3i-estimators of xi \u2208 Rd. Then(cid:81)k\ni=1 xi for \u03c3(cid:48)2 =(cid:81)k\n\n3. Suppose that {Xi}k\n\n(cid:17) \u2212(cid:81)k\n\ni + (cid:107)xi(cid:107)2\u221e\n\u03c32\n\ni=1 (cid:107)xi(cid:107)2\u221e\n\n\u03c3-estimator of x\n\ni=1 Xi is a\n\n(cid:16)\n\ni=1\n\nWe note that the bounds in the above lemma are all tight.\n\n4 Approximation Description Length\n\nIn this section we re\ufb01ne the de\ufb01nition of approximate description length that were given in section 3.\nWe start with the encoding of the compressed version of the functions. Instead of standard strings,\nwe will use what we call bracketed string. The reason for that often, in order to create a compressed\nversion of a function, we concatenate compressed versions of other functions. This results with\nstrings with a nested structure. For instance, consider the case that a function h is encoded by the\nconcatenation of h1 and h2. Furthermore, assume that h1 is encoded by the string 01, while h2 is\nencoded by the concatenation of h3, h4 and h5 that are in turn encoded by the strings 101, 0101 and\n1110. The encoding of h will then be [[01][[101][0101][1110]]]. We note that in section 3 we could\navoid this issue since the length of the strings and the recursive structure were \ufb01xed, and did not\ndepend on the function we try to compress. Formally, we de\ufb01ne\nDe\ufb01nition 4.1. A bracketed string is a rooted tree S, such that (i) the children of each edge are\nordered, (ii) there are no nodes with a singe child, and (iii) the leaves are labeled by {0, 1}. The\nlength, len(S) of S is the number of its leaves.\n\nLet S be a bracketed string. There is a linear order on its leaves that is de\ufb01ned as follows. Fix a pair\nof leaves, v1 and v2, and let u be their LCA. Let u1 (resp. u2) be the child of u that lie on the path to\nv1 (resp. v2). We de\ufb01ne v1 < v2 if u1 < u2 and v1 > v2 otherwise (note that necessarily u1 (cid:54)= u2).\nLet v1, . . . , vn be the leaves of T , ordered according to the above order, and let b1, . . . , bn be the\ncorresponding bits. The string associated with T is s = b1 . . . bn. We denote by Sn the collection of\nbracketed strings of length \u2264 n, and by S = \u222a\u221e\nThe following lemma shows that in log-scale, the number of bracketed strings of length \u2264 n differ\nfrom standard strings of length \u2264 n by only a constant factor\nLemma 4.2. |Sn| \u2264 32n\nWe next revisit the de\ufb01nition of a compressor for a class H. The de\ufb01nition of compressor will now\nhave a third parameter, ns, in addition to \u03c3 and n. We will make three changes in the de\ufb01nition.\nThe \ufb01rst, which is only for the sake of convenience, is that we will use bracketed strings rather than\nstandard strings. The second change, is that the length of the encoding string will be bounded only\n\nn=1Sn the collection of all bracketed strings.\n\n6\n\n\fD : T ns \u00d7 T \u2192(cid:0)Rd(cid:1)X\n\nand E(\u03c9, h) encode a \u03c3-estimator. Namely, there is a function D : Sns \u00d7 S \u2192(cid:0)Rd(cid:1)X\n\nin expectation. The \ufb01nal change is that the compressor can now output a seed. That is, given a\nfunction h \u2208 H that we want to compress, the compressor can generate both a non-random seed\nEs(h) \u2208 Sns and a random encoding E(\u03c9, h) \u2208 S with E\u03c9\u223c\u00b5 len(E(\u03c9, h)) \u2264 n. Together, Es(h)\nsuch that\nD(Es(h), E(\u03c9, h)), \u03c9 \u223c \u00b5 is a \u03c3-estimator of h. The advantage of using seeds is that it will\nallow us to generate many independent estimators, at a lower cost. In the case that n (cid:28) ns, the\ncost of generating k independent estimators of h \u2208 H is ns + kn bits (in expectation) instead of\nk(ns + n) bits. Indeed, we can encode k estimators by a single seed Es(h) and k independent\n\u201cregular\" encodings E(\u03c9k, h), . . . , E(\u03c9k, h). The formal de\ufb01nition is given next.\nDe\ufb01nition 4.3. A (\u03c3, ns, n)-compressor for H is a 5-tuple C = (Es, E, D, \u2126, \u00b5) where \u00b5 is a\nprobability measure on \u2126, and Es, E, D are functions Es : H \u2192 T ns, E : \u2126 \u00d7 H \u2192 T , and\nsuch that for any h \u2208 H and x \u2208 X (1) D(Es(h), E(\u03c9, h)), \u03c9 \u223c \u00b5 is a\n\u03c3-estimator of h and (2) E\u03c9\u223c\u00b5 len(E(\u03c9, h)) \u2264 n\nWe \ufb01nally revisit the de\ufb01nition of approximate description length. We will add an additional\nparameter, to accommodate the use of seeds. Likewise, the approximate description length will\nnow be a function of m \u2013 we will say that H has approximate description length (ns(m), n(m)) if\nthere is a (1, ns(m), n(m))-compressor for the restriction of H to any set A \u2282 X of size at most m.\nFormally:\nDe\ufb01nition 4.4. We say that a class H of functions from X to Rd has approximate description length\n(ns(m), n(m)) if for any set A \u2282 X of size \u2264 m there exists a (1, ns(m), n(m))-compressor for\nH|A\nIt is not hard to see that if H has approximate description length (ns(m), n(m)), then for any\n\n1 \u2265 \u0001 > 0 and a set A \u2282 X of size \u2264 m, there exists an(cid:0)\u0001, ns(m), n(m)(cid:100)\u0001\u22122(cid:101)(cid:1)-compressor for H|A.\n\nWe next connect the approximate description length, to covering numbers and representativeness.\nThe proofs are similar the the proofs of lemmas 3.4 and 3.5.\nLemma 4.5. Fix a class H of functions from X to R with approximate description length\n(ns(m), n(m)). Then, log (N (H, m, \u0001)) (cid:46) ns(m) + n(m)\n\u00012 Hence, if (cid:96) : Rd \u00d7 Y \u2192 R is L-Lipschitz\nand B-bounded, then for any distribution D on X \u00d7 Y\n\u221a\n\nrepD(S,H) (cid:46) (L + B)(cid:112)ns(m) + n(m)\n\nlog(m)\n\nE\n\nS\u223cDm\n\nFurthermore, with probability at least 1 \u2212 \u03b4,\n\u221a\n\nrepD(S,H) (cid:46) (L + B)(cid:112)ns(m) + n(m)\n\nm\n\n(cid:114)\n\nlog(m) + B\n\n2 ln (2/\u03b4)\n\nm\n\nLemma 4.6. Fix a class H of functions from X to Rd with approximate description length\n(ns(m), n(m)). Then, log (N (H, m, \u0001)) \u2264 log (N\u221e(H, m, \u0001)) (cid:46) ns(m) + n(m) log(dm)\n. Hence, if\n(cid:96) : Rd \u00d7 Y \u2192 R is L-Lipschitz w.r.t. (cid:107) \u00b7 (cid:107)\u221e and B-bounded, then for any distribution D on X \u00d7 Y\n\n\u00012\n\nrepD(S,H) (cid:46) (L + B)(cid:112)ns(m) + n(m) log(dm)\n\n\u221a\n\nE\n\nS\u223cDm\n\nFurthermore, with probability at least 1 \u2212 \u03b4,\n\u221a\n\nrepD(S,H) (cid:46) (L + B)(cid:112)ns(m) + n(m) log(dm)\n\nm\n\nlog(m)\n\n(cid:114)\n\nlog(m) + B\n\n2 ln (2/\u03b4)\n\nm\n\nm\n\nm\n\ns(m), n1(m)) and (n2\n\nWe next analyze the behavior of the approximate description length under various operations\nLemma 4.7. Let H1,H2 be classes of functions from X to Rd with approximate description length\ns(m), n2(m)). Then H1 + H2 has approximate description length of\nof (n1\n(n1\ns(m) + n2\nLemma 4.8. Let H be a class of functions from X to Rd with approximate description length\nof (ns(m), n(m)). Let A be d2 \u00d7 d1 matrix. Then A \u25e6 H1 has approximate description length\n\ns(m), 2n1(m) + 2n2(m))\n\n(cid:0)ns(m),(cid:6)(cid:107)A(cid:107)2(cid:7) n(m)(cid:1)\n\n7\n\n\fFigure 1: The functions ln (1 + ex) and ex\n1+ex\n\nDe\ufb01nition 4.9. Denote by Ld1,d2,r,R the class of all d2 \u00d7 d1 matrices of spectral norm at most r and\nFrobenius norm at most R.\nLemma 4.10. Let H be a class of functions from X to Rd1 with approximate description length\n(ns(m), n(m)). Assume furthermore that for any x \u2208 X and h \u2208 H we have that (cid:107)h(x)(cid:107) \u2264 B. Then,\nLd1,d2,r,R \u25e6 H has approximate description length\n\n(cid:0)ns(m), n(m)O(r2 + 1) + O(cid:0)(d1 + B2)(R2 + 1) log(Rd1d2 + 1)(cid:1)(cid:1)\n\nDe\ufb01nition 4.11. A function f : R \u2192 R is B-strongly-bounded if for all n \u2265 1, (cid:107)f (n)(cid:107)\u221e \u2264 n!Bn.\nLikewise, f is strongly-bounded if it is B-strongly-bounded for some B\n\nWe note that\nLemma 4.12. If f is B-strongly-bounded then f is analytic and its Taylor coef\ufb01cients around any\npoint are bounded by Bn\n\nThe following lemma gives an example to a strongly bounded sigmoid function, as well as a strongly\nbounded smoothened version of the ReLU (see \ufb01gure 1).\nLemma 4.13. The functions ln (1 + ex) and ex\nLemma 4.14. Let H be a class of functions from X to Rd with approximate description length of\n(ns(m), n(m)). Let \u03c1 : R \u2192 R be B-strongly-bounded. Then, \u03c1 \u25e6 H has approximate description\nlength of\n\n(cid:0)ns(m) + O(cid:0)n(m)B2 log(md)(cid:1) , O(cid:0)n(m)B2 log(d)(cid:1)(cid:1)\n\n1+ex are strongly-bounded\n\n5 Sample Complexity of Neural Networks\nFix the instance space X to be the ball of radius\nstrongly-bounded activation \u03c1. Fix matrices W 0\nclass of depth-t networks\nNr,R(W 0\nWe note that\n\nt ) =(cid:8)Wt \u25e6 \u03c1 \u25e6 Wt\u22121 \u25e6 \u03c1 . . . \u25e6 \u03c1 \u25e6 W1 : (cid:107)Wi \u2212 W 0\n\n1 , . . . , W 0\n\n\u221a\n\nd in Rd (in particular [\u22121, 1]d \u2282 X ) and a B-\ni \u2208 Mdi,di\u22121 , i = 1, . . . , t. Consider the following\n\ni (cid:107) \u2264 r,(cid:107)Wi \u2212 W 0\n\ni (cid:107)F \u2264 R(cid:9)\n\n1 , . . . , W 0\n\nt ) = Nr,R(W 0\n\nNr,R(W 0\nThe following lemma analyzes the cost, in terms of approximate description length, when moving\nfrom a class H to Nr,R(W 0) \u25e6 H.\nLemma 5.1. Let H be a class of functions from X to Rd1 with approximate description length\n(ns(m), n(m)) and (cid:107)h(x)(cid:107) \u2264 M for any x \u2208 X and h \u2208 H. Fix W 0 \u2208 Md2,d1. Then, Nr,R(W 0\nt ) \u25e6\nH has approximate description length of\n\nt ) \u25e6 . . . \u25e6 Nr,R(W 0\n1 )\n\nn(cid:48)(m) = n(m)O(r2 + (cid:107)W 0(cid:107)2 + 1) + O(cid:0)(d1 + M 2)(R2 + 1) log(Rd1d2 + 1)(cid:1)\n\n(cid:0)ns(m) + n(cid:48)(m)B2 log(md2), n(cid:48)(m)B2 log(d2)(cid:1)\n(cid:17)\n(cid:16)\u221a\n\nThe lemma is follows by combining lemmas 4.7, 4.8, 4.10 and 4.14. We note that in the case that\n\u221a\nd1, d2 \u2264 d, M = O(\n) and R \u2265 1 we get that\nNr,R(W 0) \u25e6 H has approximate description length of\n\n(cid:0)ns(m) + O (n(m) log(md)) , O (n(m) log(d)) + O(cid:0)d1R2 log2(d)(cid:1)(cid:1)\n\nd1), B, r,(cid:107)W 0(cid:107) = O(1) (and hence R = O\n\nfor\n\nd\n\nBy induction, the approximate description length of Nr,R(W 0\n\n(cid:16)\n\ndR2O (log(d))t log(md), dR2O (log(d))t+1(cid:17)\n\n1 , . . . , W 0\n\nt ) is\n\n8\n\n\fReferences\n[1] Martin Anthony and Peter Bartlet. Neural Network Learning: Theoretical Foundations. Cam-\n\nbridge University Press, 1999.\n\n[2] Sanjeev Arora, Rong Ge, Behnam Neyshabur, and Yi Zhang. Stronger generalization bounds\n\nfor deep nets via a compression approach. In ICML, 2018.\n\n[3] P. L. Bartlett and S. Mendelson. Rademacher and Gaussian complexities: Risk bounds and\n\nstructural results. Journal of Machine Learning Research, 3:463\u2013482, 2002.\n\n[4] Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. Spectrally-normalized margin bounds\nfor neural networks. In Advances in Neural Information Processing Systems, pages 6240\u20136249,\n2017.\n\n[5] Noah Golowich, Alexander Rakhlin, and Ohad Shamir. Size-independent sample complexity of\n\nneural networks. In COLT, 2018.\n\n[6] Vaishnavh Nagarajan and J Zico Kolter. Generalization in deep networks: The role of distance\n\nfrom initialization. arXiv preprint arXiv:1901.01672, 2019.\n\n[7] Behnam Neyshabur, Srinadh Bhojanapalli, and Nathan Srebro. A pac-bayesian approach to\n\nspectrally-normalized margin bounds for neural networks. In ICLR, 2018.\n\n[8] Behnam Neyshabur, Zhiyuan Li, Srinadh Bhojanapalli, Yann LeCun, and Nathan Srebro. The\n\nrole of over-parametrization in generalization of neural networks. In ICLR, 2019.\n\n[9] Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. Norm-based capacity control in neural\n\nnetworks. In Conference on Learning Theory, pages 1376\u20131401, 2015.\n\n[10] R.E. Schapire, Y. Freund, P. Bartlett, and W.S. Lee. Boosting the margin: A new explanation\nfor the effectiveness of voting methods. In Machine Learning: Proceedings of the Fourteenth\nInternational Conference, pages 322\u2013330, 1997. To appear, The Annals of Statistics.\n\n[11] Shai Shalev-Shwartz and Shai Ben-David. Understanding machine learning: From theory to\n\nalgorithms. Cambridge university press, 2014.\n\n[12] Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. On the importance of\ninitialization and momentum in deep learning. In International conference on machine learning,\npages 1139\u20131147, 2013.\n\n9\n\n\f", "award": [], "sourceid": 7117, "authors": [{"given_name": "Amit", "family_name": "Daniely", "institution": "Hebrew University and Google Research"}, {"given_name": "Elad", "family_name": "Granot", "institution": "Hebrew University"}]}