{"title": "Random deep neural networks are biased towards simple functions", "book": "Advances in Neural Information Processing Systems", "page_first": 1964, "page_last": 1976, "abstract": "We prove that the binary classifiers of bit strings generated by random wide deep neural networks with ReLU activation function are biased towards simple functions. The simplicity is captured by the following two properties. For any given input bit string, the average Hamming distance of the closest input bit string with a different classification is at least sqrt(n / (2\u03c0 log n)), where n is the length of the string. Moreover, if the bits of the initial string are flipped randomly, the average number of flips required to change the classification grows linearly with n. These results are confirmed by numerical experiments on deep neural networks with two hidden layers, and settle the conjecture stating that random deep neural networks are biased towards simple functions. This conjecture was proposed and numerically explored in [Valle P\u00e9rez et al., ICLR 2019] to explain the unreasonably good generalization properties of deep learning algorithms. The probability distribution of the functions generated by random deep neural networks is a good choice for the prior probability distribution in the PAC-Bayesian generalization bounds. Our results constitute a fundamental step forward in the characterization of this distribution, therefore contributing to the understanding of the generalization properties of deep learning algorithms.", "full_text": "Random deep neural networks are biased towards\n\nsimple functions\n\nBobak T. Kiani\nMechE & RLE\n\nMIT\n\nSeth Lloyd\n\nMechE, Physics & RLE\n\nMIT\n\nGiacomo De Palma\n\nMechE & RLE\n\nMIT\n\nCambridge MA 02139, USA\n\nCambridge MA 02139, USA\n\nCambridge MA 02139, USA\n\ngdepalma@mit.edu\n\nbkiani@mit.edu\n\nslloyd@mit.edu\n\nAbstract\n\nWe prove that the binary classi\ufb01ers of bit strings generated by random wide deep\nneural networks with ReLU activation function are biased towards simple functions.\nThe simplicity is captured by the following two properties. For any given input\nbit string, the average Hamming distance of the closest input bit string with a\n\ndifferent classi\ufb01cation is at least(cid:112)n/(2\u03c0 ln n), where n is the length of the string.\n\nMoreover, if the bits of the initial string are \ufb02ipped randomly, the average number\nof \ufb02ips required to change the classi\ufb01cation grows linearly with n. These results\nare con\ufb01rmed by numerical experiments on deep neural networks with two hidden\nlayers, and settle the conjecture stating that random deep neural networks are biased\ntowards simple functions. This conjecture was proposed and numerically explored\nin [Valle P\u00e9rez et al., ICLR 2019] to explain the unreasonably good generalization\nproperties of deep learning algorithms. The probability distribution of the functions\ngenerated by random deep neural networks is a good choice for the prior probability\ndistribution in the PAC-Bayesian generalization bounds. Our results constitute\na fundamental step forward in the characterization of this distribution, therefore\ncontributing to the understanding of the generalization properties of deep learning\nalgorithms.\n\n1\n\nIntroduction\n\nThe \ufb01eld of deep learning provides a broad family of algorithms to \ufb01t an unknown target function via\na deep neural network and is having an enormous success in the \ufb01elds of computer vision, machine\nlearning and arti\ufb01cial intelligence [1\u20135]. The input of a deep learning algorithm is a training set,\nwhich is a set of inputs of the target function together with the corresponding outputs. The goal of the\nlearning algorithm is to determine the parameters of the deep neural network that best reproduces the\ntraining set.\nDeep learning algorithms generalize well when trained on real-world data [6]: the deep neural\nnetworks that they generate usually reproduce the target function even for inputs that are not part of\nthe training set and do not suffer from over-\ufb01tting even if the number of parameters of the network is\nlarger than the number of elements of the training set [7\u201310]. A thorough theoretical understanding\nof this unreasonable effectiveness is still lacking. The bounds to the generalization error of learning\nalgorithms are proven in the probably approximately correct (PAC) learning framework [11]. Most of\nthese bounds depend on complexity measures such as the Vapnik-Chervonenkis dimension [12,13] or\nthe Rademacher complexity [14, 15] which are based on the worst-case analysis and are not suf\ufb01cient\nto explain the observed effectiveness since they become void when the number of parameters is\nlarger than the number of training samples [10, 16\u201321]. A complementary approach is provided\nby the PAC-Bayesian generalization bounds [19, 22\u201325], which apply to nondeterministic learning\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\falgorithms. These bounds depend on the Kullback-Leibler divergence [26] between the probability\ndistribution of the function generated by the learning algorithm given the training set and an arbitrary\nprior probability distribution that is not allowed to depend on the training set: the smaller the\ndivergence, the better the generalization properties of the algorithm. Making the right choice for the\nprior distribution is fundamental to obtain a nontrivial generalization bound.\nA good choice for the prior distribution is the probability distribution of the functions generated\nby deep neural networks with randomly initialized weights [27]. Understanding this distribution is\ntherefore necessary to understand the generalization properties of deep learning algorithms. PAC-\nBayesian generalization bounds with this prior distribution led to the proposal that the unreasonable\neffectiveness of deep learning algorithms arises from the fact that the functions generated by a random\ndeep neural network are biased towards simple functions [27\u201329]. Since real-world functions are\nusually simple [30, 31], among all the functions that are compatible with a training set made of\nreal-world data, the simple ones are more likely to be close to the target function. The conjectured\nbias towards simple functions has been numerically explored in [27], which considered binary\nclassi\ufb01cations of bit strings and showed that binary classi\ufb01ers with a small Lempel-Ziv complexity\n[32] are more likely to be generated by a random deep neural network than binary classi\ufb01ers with a\nlarge Lempel-Ziv complexity. However, a rigorous proof of this bias is still lacking.\n\n1.1 Our contribution\n\nWe prove that random deep neural networks are biased towards simple functions, in the sense that\na typical function generated is insensitive to large changes in the input. We consider random deep\nneural networks with Recti\ufb01ed Linear Unit (ReLU) activation function and weights and biases drawn\nfrom independent Gaussian probability distributions, and we employ such networks to implement\nbinary classi\ufb01ers of bit strings. Our main results are the following:\n\n\u2022 We prove that for n (cid:29) 1, where n is the length of the string, for any given input bit string\nthe average Hamming distance of the closest bit string with a different classi\ufb01cation is at\n\nleast(cid:112)n/(2\u03c0 ln n) (Theorem 1), where the Hamming distance between two bit strings is\n\nthe number of different bits.\n\n\u2022 We prove that, if the bits of the initial string are randomly \ufb02ipped, the average number of\nbit \ufb02ips required to change the classi\ufb01cation grows linearly with n (Theorem 2). From\na heuristic argument, we \ufb01nd that the average required number of bit \ufb02ips is at least n/4\n(subsection 3.3), and simulations on deep neural networks with two hidden layers indicate a\nscaling of approximately n/3.\n\nBy contrast, for a random binary classi\ufb01er drawn from the uniform distribution over all the possible\nbinary classi\ufb01ers of strings of n (cid:29) 1 bits, the average Hamming distance of the closest bit string\nwith a different classi\ufb01cation is one, and the average number of random bit \ufb02ips required to change\nthe classi\ufb01cation is two. Therefore, our result identi\ufb01es a fundamental qualitative difference between\na typical binary classi\ufb01er generated by a random deep neural network and a uniformly random binary\nclassi\ufb01er.\nThe result proves that the binary classi\ufb01ers generated by random deep neural networks are simple\nand identi\ufb01es the classi\ufb01ers that are likely to be generated as the ones with the property that a large\nnumber of bits need to be \ufb02ipped in order to change the classi\ufb01cation. While all the classi\ufb01ers with\nthis property have a low Kolmogorov complexity1, the converse is not true. For example, the parity\nfunction has a small Kolmogorov complexity, but it is suf\ufb01cient to \ufb02ip just one bit of the input to\nchange the classi\ufb01cation, hence our result implies that it occurs with a probability exponentially small\nin n. Similarly, our results explain why [27] found that the look-up tables for the functions generated\nby random deep networks are typically highly compressible using the LZW algorithm [35], which\nidenti\ufb01es statistical regularities, but not all functions with highly compressible look-up tables are\nlikely to be generated.\nThe proofs of Theorems 1 and 2 are based on the approximation of random deep neural networks as\nGaussian processes, which becomes exact in the limit of in\ufb01nite width [36\u201347]. The crucial property\nof random deep neural networks captured by this approximation is that the outputs generated by\n\n1The Kolmogorov complexity of a function is the length of the shortest program that implements the function\n\non a Turing machine [26, 33, 34].\n\n2\n\n\finputs whose Hamming distance grows sub-linearly with n become perfectly correlated in the limit\nn \u2192 \u221e. These strong correlations are the reason why a large number of input bits need to be \ufb02ipped\nin order to change the classi\ufb01cation. The proof of Theorem 2 also exploits the theory of stochastic\nprocesses, and in particular the Kolmogorov continuity theorem [48]. We stress that for activation\nfunctions other than the ReLU, the scaling with n of both the Hamming distance of the closest bit\nstring with a different classi\ufb01cation and the number of random bit \ufb02ips necessary to change the\nclassi\ufb01cation remain the same. However, the prefactor can change and can be exponentially small in\nthe number of hidden layers.\nWe validate all the theoretical results with numerical experiments on deep neural networks with ReLU\n\nactivation function and two hidden layers. The experiments con\ufb01rm the scalings \u0398((cid:112)n/ ln n) and\n\n\u221a\n\u0398(n) for the Hamming distance of the closest string with a different classi\ufb01cation and for the average\nrandom \ufb02ips required to change the classi\ufb01cation, respectively. The theoretical pre-factor 1/\n2\u03c0\nfor the closest string with a different classi\ufb01cation is con\ufb01rmed within an extremely small error of\n1.5%. The heuristic argument that pre-factor for the random \ufb02ips is greater than 1/4 is con\ufb01rmed\nby numerics which indicate that the pre-factor is approximately 0.33. Moreover, we explore the\nHamming distance to the closest bit string with a different classi\ufb01cation on deep neural networks\ntrained on the MNIST database [49] of hand-written digits. The experiments show that the scaling\n\n\u0398((cid:112)n/ ln n) survives after the training of the network and that the distance of a training or test\n\npicture from the closest classi\ufb01cation boundary is strongly correlated with its classi\ufb01cation accuracy,\ni.e., the correctly classi\ufb01ed pictures are further from the boundary than the incorrectly classi\ufb01ed ones.\n\n1.2 Further related works\n\nThe properties of deep neural networks with randomly initialized weights have been the subject of\nintensive studies [38\u201342, 50\u201352]. The relation between generalization and simplicity for Boolean\nfunction was explored in [53], where the authors provide numerical evidence that the generalization\nerror is correlated with a complexity measure that they de\ufb01ne. Ref. [10] explores the generalization\nproperties of deep neural networks trained on partially random data, and \ufb01nds that the generalization\nerror correlates with the amount of randomness in the data. Based on this result, Ref. [28,54] proposed\nthat the stochastic gradient descent employed to train the network is more likely to \ufb01nd the simpler\nfunctions that match the training set rather than the more complex ones. However, further studies [29]\nsuggested that stochastic gradient descent is not suf\ufb01cient to justify the observed generalization. The\nidea of a bias towards simple patterns has been applied to learning theory through the concepts of\nminimum description length [55], Blumer algorithms [56, 57] and universal induction [34]. Ref. [58]\nproved that the generalization error grows with the Kolmogorov complexity of the target function if\nthe learning algorithm returns the function that has the lowest Kolmogorov complexity among all\nthe functions compatible with the training set. The relation between generalization and complexity\nhas been further investigated in [30, 59]. The complexity of the functions generated by a deep neural\nnetworks has also been studied from the perspective of the number of linear regions [60\u201362] and of\nthe curvature of the classi\ufb01cation boundaries [41]. We note that the results proved here \u2014 viz., that\nthe functions generated by random deep networks are insensitive to large changes in their inputs \u2014\nimplies that such functions should be simple with respect to all the measures of complexity above, but\nthe converse is not true: not all simple functions are likely to be generated by random deep networks.\n\n2 Setup and Gaussian process approximation\n\nWe consider a feed-forward deep neural network with L hidden layers, activation function \u03c4, input\nin Rn and output in R. The most common choice for \u03c4 is the ReLU activation function \u03c4 (x) =\nmax(0, x). We stress that Theorems 1 and 2 do not rely on this assumption and hold for any activation\nfunction. For any x \u2208 Rn and l = 2, . . . , L + 1, the network is recursively de\ufb01ned by\n+ b(l) ,\n\n(1)\nwhere \u03c6(l)(x), b(l) \u2208 Rnl, W (l) is an nl \u00d7 nl\u22121 real matrix, n0 = n and nL+1 = 1. We put for\nsimplicity \u03c6 = \u03c6(L+1), and we de\ufb01ne \u03c8(x) = sign (\u03c6(x)) for any x \u2208 Rn. The function \u03c8 is a\nbinary classi\ufb01er on the set of the strings of n bits identi\ufb01ed with the set {\u22121, 1}n \u2282 Rn, where\nthe classi\ufb01cation of the string x \u2208 {\u22121, 1}n is \u03c8(x) \u2208 {\u22121, 1}. We choose this representation of\nthe bit strings since any x \u2208 {\u22121, 1}n has (cid:107)x(cid:107)2 = n, and the covariance of the Gaussian process\n\n\u03c6(1)(x) = W (1)x + b(1) ,\n\n(cid:16)\n\n\u03c6(l)(x) = W (l) \u03c4\n\n\u03c6(l\u22121)(x)\n\n(cid:17)\n\n3\n\n\fw/nl\u22121 and \u03c32\n\napproximating the deep neural network has a signi\ufb01cantly simpler expression if all the inputs have the\nsame norm. Moreover, having the inputs lying on a sphere is a common assumption in the machine\nlearning literature [63].\nWe draw each entry of each W (l) and of each b(l) from independent Gaussian distributions with zero\nmean and variances \u03c32\nb , respectively. We employ the Gaussian process approximation\nof [41, 42], which consists in assuming that for any l and any x, y \u2208 Rn, the joint probability\ndistribution of \u03c6(l)(x) and \u03c6(l)(y) is Gaussian, and \u03c6(l)\nj (y) for any\ni (cid:54)= j. This approximation is exact for l = 1 and holds for any l in the limit n1, . . . , nL \u2192 \u221e\n[39]. Indeed, \u03c6(l)\n, which has a Gaussian distribution, with the nl\u22121 terms\nj=1 which are iid from the inductive hypothesis. Therefore if nl\u22121 (cid:29) 1, from\n{W (l)\nthe central limit theorem \u03c6(l)\ni (x) has a Gaussian distribution. We notice that for \ufb01nite width, the\noutputs of the intermediate layers have a sub-Weibull distribution [64]. Our experiments in section 4\nshow agreement with the Gaussian approximation starting from n (cid:38) 100.\nIn the Gaussian process approximation, for any x, y with (cid:107)x(cid:107)2 = (cid:107)y(cid:107)2 = n, the joint probability\ndistribution of \u03c6(x) and \u03c6(y) is Gaussian with zero mean and covariance that depends on x, y and n\nonly through x \u00b7 y/n:\n\ni (x) is the sum of b(l)\n(x))}nl\u22121\n\ni (x) is independent from \u03c6(l)\n\nij \u03c4 (\u03c6(l\u22121)\n\nj\n\ni\n\n(cid:107)x(cid:107)2 = (cid:107)y(cid:107)2 = n .\n\nE (\u03c6(x)) = 0 ,\n\nn\n\nn\n\n(2)\nAnalogously, \u03c6(x) is a Gaussian process with zero average and covariance given by the kernel\n\n(cid:1). Here Q > 0 is a suitable constant and F : [\u22121, 1] \u2192 R is a suitable function\n\nK(x, y) = Q F(cid:0) x\u00b7y\n\nthat depend on \u03c4, L, \u03c3w and \u03c3b, but not on n, x nor y. We have introduced the constant Q because it\nwill be useful to have F satisfy F (1) = 1. We provide the expression of Q and F in terms of \u03c4, L,\n\u03c3w and \u03c3b in the supplementary material, where we also prove that for the ReLU activation function\nt \u2264 F (t) \u2264 1.\nThe correlations between outputs of the network generated by close inputs are captured by the\nbehavior of F (t) for t \u2192 1. If F (t) stays close to 1 as t departs from 1, then the outputs generated by\nclose inputs are almost perfectly correlated and have the same classi\ufb01cation with probability close to\none. On the contrary, if F (t) drops quickly, the correlations decay and there is a nonzero probability\nthat close inputs have different classi\ufb01cations. In the supplementary material we prove that for the\nReLU activation function we have 0 < F (cid:48)(1) \u2264 1 and for t \u2192 1,\n\nE (\u03c6(x) \u03c6(y)) = Q F(cid:0) x\u00b7y\n\n(cid:1) ,\n\n(1 \u2212 t)\n\n3\n2\n\n,\n\n(3)\n\n(cid:16)\n\n(cid:17)\n\nF (t) = 1 \u2212 F (cid:48)(1) (1 \u2212 t) + O\n\nimplying strong short-distance correlations.\n\n3 Theoretical results\n\n3.1 Closest bit string with a different classi\ufb01cation\nOur \ufb01rst main result is the following Theorem 1, which states that for n (cid:29) 1, for any given input\nbit string of a random deep neural network as in section 2 the average Hamming distance of the\n\nclosest input bit string with a different classi\ufb01cation is(cid:112)n/(2\u03c0F (cid:48)(1) ln n). The proof is in the\nthe output of a random deep neural network as in section 2. Let a > 0 and let hn = (cid:98)a(cid:112)n/ ln n(cid:99),\n\nsupplementary material.\nTheorem 1 (closest string with a different classi\ufb01cation). For any n \u2208 N, let \u03c6 : {\u22121, 1}n \u2192 R be\nwhere (cid:98)t(cid:99) denotes the integer part of t \u2265 0. Let us \ufb01x x \u2208 {\u22121, 1}n and z > 0, and let Nn(a, z) be\nthe average number of input bit strings y \u2208 {\u22121, 1}n with Hamming distance hn from x and with a\ndifferent classi\ufb01cation from x, conditioned on \u03c6(x) =\n\n(4)\nHere h(x, y) is the Hamming distance between x and y and we recall that Q = E(\u03c6(x)2). Then, for\nn \u2192 \u221e\n\n#{y \u2208 {\u22121, 1}n : h(x, y) = hn , \u03c6(y) < 0}(cid:12)(cid:12)(cid:12)\u03c6(x) =(cid:112)Q z\n(cid:17)(cid:33)\n\nNn(a, z) = E(cid:16)\n\n(cid:32)\n\n(cid:17)\n\n(cid:16)\n\nQ z:\n\n\u221a\n\n\u221a\n\n.\n\nz2\n\n1 \u2212\n\n4F (cid:48)(1)a2 +\n\nln ln n\na2\nln n\n\n+ O\n\n4\u221a\n\n1\n\nn ln n\n\n.\n\n(5)\n\nln Nn(a, z) =\n\nn ln n\n\na\n2\n\n4\n\n\fIn particular,\n\nn\u2192\u221e Nn(a, z) = 0 for a <\nlim\n\n2(cid:112)F (cid:48)(1)\n\nz\n\n,\n\nn\u2192\u221e Nn(a, z) = \u221e for a \u2265\n\nlim\n\n2(cid:112)F (cid:48)(1)\n\nz\n\n. (6)\n\nTheorem 1 tells us that, if n (cid:29) 1, for any input bit string x \u2208 {\u22121, 1}n, with very high probability\nall the input bit strings y \u2208 {\u22121, 1}n with Hamming distance from x lower than\n\n|\u03c6(x)|\n\n2(cid:112)Q F (cid:48)(1)\n\n(cid:114) n\n\nln n\n\nh\u2217\nn(x) =\n\n(7)\n\n(8)\n\nhave the same classi\ufb01cation as x, i.e., \u03c6(y) has the same sign as \u03c6(x). Moreover, the number of input\nbit strings y with Hamming distance from x higher than h\u2217\nn(x) and with a different classi\ufb01cation than x\nis exponentially large in n. Therefore, with very high probability the Hamming distance from x of the\nclosest bit string with a different classi\ufb01cation is approximately h\u2217\nthe average Hamming distance of the closest string with a different classi\ufb01cation is\n\nn(x). Since E(|\u03c6(x)|) =(cid:112)2Q/\u03c0,\n(cid:114) n\n\n(cid:114)\n\nE (h\u2217\n\nn(x)) =\n\nn\n\n2\u03c0F (cid:48)(1) ln n\n\n\u2265\n\n,\n\n2\u03c0 ln n\n\nwhere the last inequality holds for the ReLU activation function and follows since in this case\nF (cid:48)(1) \u2264 1.\nRemark 1. While Theorem 1 holds for any activation function, the property F (cid:48)(1) \u2264 1 may not hold\nfor activation functions different from the ReLU. For example, in the case of tanh there are values\nof \u03c3w and \u03c3b such that F (cid:48)(1) grows exponentially with L [41]. In this case, the Hamming distance\n\nof the closest string with a different classi\ufb01cation still scales as(cid:112)n/ ln n, but the prefactor can be\nF (cid:48)(1) may become comparable with(cid:112)n/ ln n and signi\ufb01cantly affect the Hamming distance.\n\nexponentially small in L. Therefore with the tanh activation function, for \ufb01nite values of L and n,\n\n3.2 Random bit \ufb02ips\n\nLet us now consider the problem of the average number of bits that are needed to \ufb02ip in order to\nchange the classi\ufb01cation of a given bit string. We consider a random sequence of input bit strings\n{x(0), . . . , x(n)} \u2282 {\u22121, 1}n, where at the i-th step x(i) is generated \ufb02ipping a random bit of x(i\u22121)\nthat has not been already \ufb02ipped in the previous steps. Any sequence as above is geodesic, i.e.,\nh(x(i), x(j)) = |i \u2212 j| for any i, j = 0, . . . , n. The following Theorem 2 states that the average\nHamming distance from x(0) of the closest string of the sequence with a different classi\ufb01cation is\nproportional to n. The proof is in the supplementary material.\nTheorem 2 (random bit \ufb02ips). For any n \u2208 N, let \u03c6 : {\u22121, 1}n \u2192 R be the output of a random deep\nneural network as in section 2, and let {x(0), . . . , x(n)} \u2282 {\u22121, 1}n be a geodesic sequence of bit\nstrings. Let hn be the expected value of the minimum number of steps required to reach a bit string\nwith a different classi\ufb01cation from x(0):\n\n1 \u2264 i \u2264 n : \u03c6(x(0))\u03c6(x(i)) < 0\n\nmin\n\nmin\n\n(9)\nThen, there exists a constant t0 > 0 which depends only on F such that hn \u2265 n t0 for any n \u2208 N.\nRemark 2. Since the entry of the kernel (2) associated to two inputs lying on the sphere is a function\nof their squared Euclidean distance, which coincides with the Hamming distance in the case of bit\nstrings, Theorems 1 and 2 may be generalized to continuous inputs on the sphere by replacing the\nHamming distance with the squared Euclidean distance.\n\n, n\n\n.\n\nhn = E(cid:16)\n\n(cid:110)\n\n(cid:110)\n\n(cid:111)\n\n(cid:111)(cid:17)\n\n3.3 Heuristic argument\n\n\u221a\nFor a better understanding of Theorems 1 and 2, we provide a simple heuristic argument to their\nvalidity. The crucial observation is that, if one bit of the input is \ufb02ipped, the change in \u03c6 is \u0398(1/\nn).\nIndeed, let x, y \u2208 {\u22121, 1}n with h(x, y) = 1. From (2), \u03c6(y) \u2212 \u03c6(x) is a Gaussian random variable\nwith zero average and variance\n\nE(cid:0)(\u03c6(y) \u2212 \u03c6(x))2(cid:1) = 2Q(cid:0)1 \u2212 F(cid:0)1 \u2212 2\n\n(cid:1)(cid:1) (cid:39) 4QF (cid:48)(1)/n .\n\n(10)\n\nn\n\n5\n\n\f(a)\n\n(b)\n\n\u221a\n\nFigure 1: (a) Average Hamming distance to the nearest differently classi\ufb01ed input string versus the\nnumber of input neurons for the neural network. The Hamming distance to the nearest differently\n\nclassi\ufb01ed string scales as(cid:112)n/(2\u03c0 ln n). with respect to the number of input neurons. Left: the\n\nresults of the simulations clearly show the importance of the ln n term in the scaling. Right: the\nempirically calculated value 0.405 for the pre-factor a is close to the theoretically predicted value\nof 1/\n2\u03c0. Each data point is the average of 1000 different calculations of the Hamming distance\nfor randomly sampled bit strings. Each calculation was performed on a randomly generated neural\nnetwork. Further technical details for the design of the neural networks are given in subsection 4.4.\n(b) The linear relationship between |\u03c6(x)| and h\u2217\nn(x) is consistent across neural networks of different\nsizes. To calculate the average distance at values of |\u03c6(x)| within an interval, data was averaged\nacross equally spaced bins of 0.25 for values of |\u03c6(x)|. Averages for each bin are plotted at the\nmidpoint of the bin. Points are only shown if there are at least 10 samples within the bin.\n\nFor any i, at the i-th step of the sequence of bit strings of subsection 3.2, \u03c6 changes by the Gaussian\nrandom variable \u03c6(x(i)) \u2212 \u03c6(x(i+1)), which from (10) has zero mean and variance 4Q F (cid:48)(1)/n.\nAssuming that the changes are independent, after h steps \u03c6 changes by a Gaussian random variable\nwith zero mean and variance 4h Q F (cid:48)(1)/n. Recalling that E(\u03c6(x(0))2) = Q and that F (cid:48)(1) \u2264 1 for\nthe ReLU activation function, approximately h \u2248 n/(4F (cid:48)(1)) \u2265 n/4 steps are needed in order to\n\ufb02ip the sign of \u03c6 and hence the classi\ufb01cation.\nLet us now consider the problem of the closest bit string with a different classi\ufb01cation from a given\nbit string x. For any bit string y at Hamming distance one from x, \u03c6(y) \u2212 \u03c6(x) is a Gaussian\nrandom variable with zero mean and variance 4Q F (cid:48)(1)/n. We assume that these random variables\nare independent, and recall that the minimum among n iid normal Gaussian random variables\nscales as\n2 ln n [65]. There are n bit strings y at Hamming distance one from x, therefore\n\nthe minimum over these y of \u03c6(y) \u2212 \u03c6(x) is approximately \u2212(cid:112)8Q F (cid:48)(1) ln n/n. This is the\nthe maximum amount by which we can decrease \u03c6 \ufb02ipping h bits is h(cid:112)8Q F (cid:48)(1) ln n/n. Since\nh \u2248 (cid:112)n/(8F (cid:48)(1) ln n) \u2265 (cid:112)n/(8 ln n), where the last inequality holds for the ReLU activation\n\nE(\u03c6(x(0))2) = Q, the minimum number of bit \ufb02ips required to \ufb02ip the sign of \u03c6 is approximately\n8 (cid:39) 0.354 obtained with the heuristic proof above is very close to the\n\nmaximum amount by which we can decrease \u03c6 \ufb02ipping one bit of the input. Iterating the procedure,\n\n\u221a\n\n\u221a\n\u221a\nfunction. The pre-factor 1/\nexact pre-factor 1/\n\n2\u03c0 (cid:39) 0.399 obtained with the formal proof in (8).\n\n4 Experiments\n\n4.1 Closest bit string with a different classi\ufb01cation\n\nTo con\ufb01rm experimentally the \ufb01ndings of Theorem 1, Hamming distances to the closest bit string with\na different classi\ufb01cation were calculated for randomly generated neural networks with parameters\nsampled from normal distributions (see subsection 4.4). This distance was calculated using a greedy\nsearch algorithm (Figure 1a). In this algorithm, the search for a differently classi\ufb01ed bit string\nprogressed in steps, where in each step, the most signi\ufb01cant bit was \ufb02ipped. This bit corresponded\nto the one that produced the largest change towards zero in the value of the output neuron when\n\n6\n\n\u000f\u0012\r\r\u0012\n\n\u0001\u0012\r\r\u000e\n\n\u001f:2-07\u000341\u0003\u000435:9\u0003-\u000498\r\u0011\u0001\u000e\u000f\u000e\u0013\u0018;07,\u00040\u0003\u0002,22\u00043\u0004\u0003/\u000489,3.0\u000394\u000330,7089/\u00041107039\u0004\u0005\u0003.\u0004,88\u00041\u00040/\u0003-\u00049\u0003897\u00043\u0004\u001b,9,\u00031\u00049\u000394\u0001a\u221an/ln(n)\u0001\u0003a=\r\u000b\u0011\r\u0012a\u221an\u0001\u0003a=\r\u000b\u000e\u0010\u0001\r\u000f\u0012\r\r\u0012\n\n\u0001\u0012\r\r\u000e\n\n\u001f:2-07\u000341\u0003\u000435:9\u0003-\u000498a\u221an/ln(n)\u0001\u0003a=\r\u000b\u0011\r\u0012a\u221an/ln(n)\u0001\u0003a=1/\u221a2\u03c00.00.51.01.52.02.53.03.5Absolute value of output neuron |(cid:8240)(x)|010203040Average Hamming distance todifferently classified bit stringNumber ofinput bits100007000400020001000\f\ufb02ipped. To ensure that this algorithm accurately calculated Hamming distances, we compared the\nresults of the greedy search algorithm to those from an exact search which exhaustively searched all\nbit strings at speci\ufb01ed Hamming distances for smaller networks where this exact search method was\ncomputationally feasible. Comparisons between the two algorithms in Table 1 of the supplementary\nmaterial show that outcomes from the greedy search algorithm were consistent with those from the\n\nexact search algorithm. The results from the greedy search method con\ufb01rm the(cid:112)n/ ln n scaling\n\n\u221a\nof the average Hamming distance starting from n (cid:38) 100. The value of the pre-factor 1/\n2\u03c0 is also\ncon\ufb01rmed with the high precision of 1.5%. Figure 1b empirically validates the linear relationship\nbetween the value of the output neuron |\u03c6(x)| and the Hamming distance to bit strings with different\nclassi\ufb01cation h\u2217\nn(x) expressed by (7). This linear relationship was consistent with all neural networks\nempirically tested in our analysis. Intuitively, |\u03c6(x)| is an indication of the con\ufb01dence in classi\ufb01cation.\nThe linear relationship shown here implies that as the value of |\u03c6(x)| grows, the con\ufb01dence of the\nclassi\ufb01cation of an input strengthens, increasing the distance from that input to boundaries of different\nclassi\ufb01cations.\n\n4.2 Random bit \ufb02ips\n\nFigure 2 con\ufb01rms the \ufb01ndings of Theorem 2, namely that the expected number of random bit \ufb02ips\nrequired to reach a bit string with a different classi\ufb01cation scales linearly with the number of input\nneurons. The pre-factor found by simulation is 0.33, slightly above the lower bound of 0.25 estimated\nfrom the heuristic argument. Our results show that, though the Hamming distance to the nearest\n\nclassi\ufb01cation boundary scales on average at a rate of(cid:112)n/ ln n, the distance to a random boundary\n\nscales linearly and more rapidly.\n\nFigure 2: The average number of random bit \ufb02ips\nrequired to reach a bit string with different classi-\n\ufb01cation scales linearly with the number of input\nneurons. Each point is averaged across a sample\nof 1000 neural networks, where the Hamming dis-\ntances to differently classi\ufb01ed bit strings for each\nnetwork are tested at a single random input bit\nstring.\n\n4.3 Analysis of MNIST data\n\nOur theoretical results hold for random, untrained deep neural networks. It is an interesting question\nwhether trained deep neural networks exhibit similar properties for the Hamming distances to\nclassi\ufb01cation boundaries. Clearly some trained networks will not: a network that has been trained to\nreturn as output the \ufb01nal bit of the input string has Hamming distance one to the nearest classi\ufb01cation\nboundary. For networks that are trained to classify noisy data, however, we expect the trained networks\nto exhibit relatively large Hamming distances to the nearest classi\ufb01cation boundary. Moreover, if\na \u2018typical\u2019 network can perform the noisy classi\ufb01cation task, then we expect training to guide the\nweights to a nearby typical network that does the job, for the simple reason that networks that exhibit\n\n\u0398((cid:112)n/ ln n) distance to the nearest boundary and an average distance of \u0398(n) to a boundary under\n\nrandom bit \ufb02ips have much higher prior probabilities than atypical networks.\nTo determine if our results hold for models trained on real-world data, we trained 2-layer fully-\nconnected neural networks to categorize whether hand-drawn digits taken from the MNIST database\n[66] are even or odd. Images of hand drawn digits were converted from their 2-dimensional format\n(28 by 28 pixels) into a 1-dimensional vector of 784 binary inputs. The starting 8 bit pixel values were\nconverted to binary format by determining whether the pixel value was above or below a threshold of\n25. Networks were trained to determine whether the hand-drawn digit was odd or even. All networks\n\n7\n\n\u0012\n\n\u000e\n\n\u000e\u0012\n\n\u000f\n\n\u001f:2-07\u000341\u0003\u000435:9\u0003-\u000498\r\u000f\n\n\u0011\n\n\u0013\n\n\u0001\n\n\u0018;07,\u00040\u0003\u0002,22\u00043\u0004\u0003/\u000489,3.0\u000394/\u00041107039\u0004\u0005\u0003.\u0004,88\u00041\u00040/\u0003-\u00049\u0003897\u00043\u0004\u0019089\u00031\u00049\u000394\u0003a*n\u0001\u0003,\u0014\r\u000b\u0010\u0010\u0011\f(a)\n\n(b)\n\nFigure 3: (a) Average Hamming distance to the nearest differently classi\ufb01ed input bit string for\nMNIST trained models calculated using the greedy search method. The average distance calculated\nfor random bits is close to the expected value of approximately 4.33. Further technical details for the\ndesign of the neural networks are given in subsection 4.4.\n(b) The linear relationship between |\u03c6(x)| and h\u2217\nn(x) is consistent for networks trained on MNIST\ndata. To calculate the average distance at values of |\u03c6(x)| within an interval, data was averaged across\nequally spaced bins of 2.5 for values of |\u03c6(x)|. Averages for each bin are plotted at the midpoint of\nthe bin. Points are only shown if there are at least 25 samples within the bin.\n\nfollowed the design described in subsection 4.4. 400 Networks were trained for 20 epochs using the\nAdam optimizer [67]; average test set accuracy of 98.8% was achieved.\nFor these trained networks, Hamming distances to the nearest bit string with a different classi\ufb01cation\nwere calculated using the greedy search method outlined in subsection 4.1. These Hamming distances\nwere evaluated for three types of bit strings: bit strings taken from the training set, bit strings taken\nfrom the test set, and randomly sampled bit strings where each bit has equal probability of 0 and\n1. For the randomly sampled bit strings, the average minimum Hamming distance to a differently\n\nclassi\ufb01ed bit string is very close to the expected theoretical value of(cid:112)n/(2\u03c0 ln n) (Figure 3a). By\n\ncontrast, for bit strings taken from the test and training set, the minimum Hamming distances to a\nclassi\ufb01cation boundary were on average much higher than that for random bits, as should be expected:\ntraining increases the distance from the data points to the boundary of their respective classi\ufb01cation\nregions and makes the network more robust to errors when classifying real-world data compared with\nclassifying random bit strings.\nFurthermore, even for trained networks, a linear relationship is still observed between the absolute\nvalue of the output neuron (prior to normalization by a sigmoid activation) and the average Hamming\ndistance to the nearest differently classi\ufb01ed bit string (Figure 3b). Here, the slope of the linear\nrelationship is larger for test and training set data, consistent with the expectation that training should\nextend the Hamming distance to classi\ufb01cation boundaries for patterns of data found in the training\nset.\nFinally, we have explored the correlation between the distance of a training or test picture from the\nclosest classi\ufb01cation boundary with its classi\ufb01cation accuracy. Figure 4 shows that the incorrectly\nclassi\ufb01ed pictures tend to be signi\ufb01cantly closer to the classi\ufb01cation boundary than the correctly\nclassi\ufb01ed ones: the average distances are 1.42 and 10.61, respectively, for the training set, and\n2.30 and 10.47, respectively, for the test set. Therefore, our results show that the distance to the\nclosest classi\ufb01cation boundary is empirically correlated with the classi\ufb01cation accuracy and with the\ngeneralization properties of the deep neural network.\n\n4.4 Experimental apparatus and structure of neural networks\n\nWeights for all neural networks are initialized according to a normal distribution with zero mean\nand variance equal to 2/nin, where nin is the number of input units in the weight tensor. No bias\nterm is included in the neural networks. All networks consist of two fully connected hidden layers,\n\n8\n\nTrain SetTest SetRandom Bits0246810Average Hamming distance to nearestdifferently classified bit stringExpected Avg. Distance: 4.33Train SetTest SetRandom Bits050|\u03a6(x)|050|\u03a6(x)|050|\u03a6(x)|05101520Average Hamming distance to nearestdifferently classified bit string050|\u03a6(x)|slope = 0.15slope = 0.20slope = 0.20\fFigure 4: Histogram counting instances\nof correctly and incorrectly classi\ufb01ed\nMNIST pictures shows that trained neu-\nral networks are far more likely to mis-\nclassify points closer to a classi\ufb01cation\nboundary for both the training and test\nsets. Results are aggregated across 20\ndifferent trained neural networks trained\nto classify whether digits are even or odd.\nNetworks are trained for 10 epochs using\nthe Adam optimizer.\n\neach with n neurons (equal to number of input neurons) and activation function set to the commonly\nused Recti\ufb01ed Linear Unit (ReLU). All networks contain a single output neuron with no activation\nfunction. In the notation of section 2, this choice corresponds to \u03c32\nb = 0, n0 = n1 = n2 = n\nand n3 = 1, and implies F (cid:48)(1) = 1. Simulations were run using the python package Keras with a\nbackend of TensorFlow [68].\n\nw = 2, \u03c32\n\n5 Conclusions\nWe have proved that the binary classi\ufb01ers of strings of n (cid:29) 1 bits generated by wide random deep\nneural networks with ReLU activation function are simple. The simplicity is captured by the following\ntwo properties. First, for any given input bit string the average Hamming distance of the closest input\n\nbit string with a different classi\ufb01cation is at least(cid:112)n/(2\u03c0 ln n). Second, if the bits of the original\n\nstring are randomly \ufb02ipped, the average number of bit \ufb02ips needed to change the classi\ufb01cation is\nat least n/4. For activation functions other than the ReLU both scalings remain the same, but the\nprefactor can change and can be exponentially small in the number of hidden layers.\nThe striking consequence of our result is that the binary classi\ufb01ers of strings of n (cid:29) 1 bits generated\nby a random deep neural network lie with very high probability in a subset which is an exponentially\nsmall fraction of all the possible binary classi\ufb01ers. Indeed, for a uniformly random binary classi\ufb01er,\nthe average Hamming distance of the closest input bit string with a different classi\ufb01cation is one, and\nthe average number of bit \ufb02ips required to change the classi\ufb01cation is two. Our result constitutes\na fundamental step forward in the characterization of the probability distribution of the functions\ngenerated by random deep neural networks, which is employed as prior distribution in the PAC-\nBayesian generalization bounds. Therefore, our result can contribute to the understanding of the\ngeneralization properties of deep learning algorithms.\nOur analysis of the MNIST data suggests that, for certain types of problems, the property that many\nbits need to be \ufb02ipped in order to change the classi\ufb01cation survives after training the network. Both\nour theoretical results and our experiments are completely consistent to the empirical \ufb01ndings in\nthe context of adversarial perturbations [69\u201374], where the existence of inputs that are close to a\ncorrectly classi\ufb01ed input but have the wrong classi\ufb01cation is explored. As expected, our results show\nthat as the size of the input grows, the average number of bits needed to be \ufb02ipped to change the\nclassi\ufb01cation increases in absolute terms but decreases as a percentage of the total number of bits.\nAn extension of our theoretical results to trained deep neural networks would provide a fundamental\nrobustness result of deep neural networks with respect to adversarial perturbations, and will be the\nsubject of future work.\nMoreover, our experiments on MNIST show that the distance of a picture to the closest classi\ufb01cation\nboundary is correlated with its classi\ufb01cation accuracy and thus with the generalization properties of\ndeep neural networks, and con\ufb01rm that exploring the properties of this distance is a promising route\ntowards proving the unreasonably good generalization properties of deep neural networks.\nFinally, the simplicity bias proven in this paper might shed new light on the unexpected empirical\nproperty of deep learning algorithms that the optimization over the network parameters does not\nsuffer from bad local minima, despite the huge number of parameters and the non-convexity of the\nfunction to be optimized [75\u201379].\n\n9\n\nTest SetTraining Set05101520253035Distance05101520253035Distance0%20%40%60%% of Total0%20%40%60%% of Totalavgavgavgavgcorrectly classifiedincorrectly classified\fAcknowledgements\n\nGdP thanks the Research Laboratory of Electronics of the Massachusetts Institute of Technology for\nthe kind hospitality in Cambridge, and Dario Trevisan for useful discussions.\nGdP acknowledges \ufb01nancial support from the European Research Council (ERC Grant Agreements\nNos. 337603 and 321029), the Danish Council for Independent Research (Sapere Aude), VILLUM\nFONDEN via the QMATH Centre of Excellence (Grant No. 10059) and AFOSR and ARO under\nthe Blue Sky program. SL and BTK were supported by IARPA, NSF, BMW under the MIT Energy\nInitiative, and ARO under the Blue Sky program.\n\nReferences\n[1] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G\nBellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al.\nHuman-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.\n\n[2] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436,\n\n2015.\n\n[3] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with\ndeep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.\n[4] J\u00fcrgen Schmidhuber. Deep learning in neural networks: An overview. Neural networks,\n\n61:85\u2013117, 2015.\n\n[5] I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. Adaptive computation and machine\n\nlearning. MIT Press, 2016.\n\n[6] Moritz Hardt, Benjamin Recht, and Yoram Singer. Train faster, generalize better: Stability of\n\nstochastic gradient descent. arXiv preprint arXiv:1509.01240, 2015.\n\n[7] Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. In search of the real inductive bias:\nOn the role of implicit regularization in deep learning. arXiv preprint arXiv:1412.6614, 2014.\n[8] Alfredo Canziani, Adam Paszke, and Eugenio Culurciello. An analysis of deep neural network\n\nmodels for practical applications. arXiv preprint arXiv:1605.07678, 2016.\n\n[9] Roman Novak, Yasaman Bahri, Daniel A Abola\ufb01a, Jeffrey Pennington, and Jascha Sohl-\nDickstein. Sensitivity and generalization in neural networks: an empirical study. arXiv preprint\narXiv:1802.08760, 2018.\n\n[10] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding\n\ndeep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016.\n\n[11] V.N. Vapnik. The Nature of Statistical Learning Theory. Springer New York, 2013.\n[12] Eric B Baum and David Haussler. What size net gives valid generalization? In Advances in\n\nneural information processing systems, pages 81\u201390, 1989.\n\n[13] Peter L Bartlett, Nick Harvey, Chris Liaw, and Abbas Mehrabian. Nearly-tight vc-dimension and\npseudodimension bounds for piecewise linear neural networks. arXiv preprint arXiv:1703.02930,\n2017.\n\n[14] Shizhao Sun, Wei Chen, Liwei Wang, Xiaoguang Liu, and Tie-Yan Liu. On the depth of deep\n\nneural networks: A theoretical view. In AAAI, pages 2066\u20132072, 2016.\n\n[15] Behnam Neyshabur, Zhiyuan Li, Srinadh Bhojanapalli, Yann LeCun, and Nathan Srebro.\nTowards understanding the role of over-parametrization in generalization of neural networks.\narXiv preprint arXiv:1805.12076, 2018.\n\n[16] Kenji Kawaguchi, Leslie Pack Kaelbling, and Yoshua Bengio. Generalization in deep learning.\n\narXiv preprint arXiv:1710.05468, 2017.\n\n[17] Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nati Srebro. Exploring\ngeneralization in deep learning. In Advances in Neural Information Processing Systems, pages\n5947\u20135956, 2017.\n\n[18] Gintare Karolina Dziugaite and Daniel M Roy. Computing nonvacuous generalization bounds\nfor deep (stochastic) neural networks with many more parameters than training data. arXiv\npreprint arXiv:1703.11008, 2017.\n\n10\n\n\f[19] Gintare Karolina Dziugaite and Daniel M Roy. Data-dependent pac-bayes priors via differential\n\nprivacy. arXiv preprint arXiv:1802.09583, 2018.\n\n[20] Sanjeev Arora, Rong Ge, Behnam Neyshabur, and Yi Zhang. Stronger generalization bounds\n\nfor deep nets via a compression approach. arXiv preprint arXiv:1802.05296, 2018.\n\n[21] Ari S Morcos, David GT Barrett, Neil C Rabinowitz, and Matthew Botvinick. On the importance\n\nof single directions for generalization. arXiv preprint arXiv:1803.06959, 2018.\n\n[22] David A McAllester. Some pac-bayesian theorems. Machine Learning, 37(3):355\u2013363, 1999.\n[23] O. Catoni. Pac-Bayesian Supervised Classi\ufb01cation: The Thermodynamics of Statistical Learn-\ning. Institute of Mathematical Statistics lecture notes-monograph series. Institute of Mathemati-\ncal Statistics, 2007.\n\n[24] Guy Lever, Fran\u00e7ois Laviolette, and John Shawe-Taylor. Tighter pac-bayes bounds through\n\ndistribution-dependent priors. Theoretical Computer Science, 473:4\u201328, 2013.\n\n[25] Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nathan Srebro. A pac-\nbayesian approach to spectrally-normalized margin bounds for neural networks. arXiv preprint\narXiv:1707.09564, 2017.\n\n[26] Thomas M. Cover and Joy A. Thomas. Elements of Information Theory. A Wiley-Interscience\n\npublication. Wiley, 2012.\n\n[27] Guillermo Valle-Perez, Chico Q. Camargo, and Ard A. Louis. Deep learning generalizes because\nthe parameter-function map is biased towards simple functions. In International Conference on\nLearning Representations, 2019.\n\n[28] Devansh Arpit, Stanis\u0142aw Jastrzebski, Nicolas Ballas, David Krueger, Emmanuel Bengio,\nMaxinder S Kanwal, Tegan Maharaj, Asja Fischer, Aaron Courville, Yoshua Bengio, et al.\nA closer look at memorization in deep networks. In International Conference on Machine\nLearning, pages 233\u2013242, 2017.\n\n[29] Lei Wu, Zhanxing Zhu, and E Weinan. Towards understanding generalization of deep learning:\n\nPerspective of loss landscapes. arXiv preprint arXiv:1706.10239, 2017.\n\n[30] J\u00fcrgen Schmidhuber. Discovering neural nets with low kolmogorov complexity and high\n\ngeneralization capability. Neural Networks, 10(5):857\u2013873, 1997.\n\n[31] Henry W Lin, Max Tegmark, and David Rolnick. Why does deep and cheap learning work so\n\nwell? Journal of Statistical Physics, 168(6):1223\u20131247, 2017.\n\n[32] Abraham Lempel and Jacob Ziv. On the complexity of \ufb01nite sequences. IEEE Transactions on\n\nInformation Theory, 22(1):75\u201381, 1976.\n\n[33] Andrei N Kolmogorov. On tables of random numbers. Theoretical Computer Science,\n\n207(2):387\u2013395, 1998.\n\n[34] M. Li and P. Vitanyi. An Introduction to Kolmogorov Complexity and Its Applications. Mono-\n\ngraphs in Computer Science. Springer New York, 2013.\n\n[35] J. Ziv and A. Lempel. A universal algorithm for sequential data compression. IEEE Transactions\n\non Information Theory, 23(3):337\u2013343, 1977.\n\n[36] Radford M Neal. Priors for in\ufb01nite networks. In Bayesian Learning for Neural Networks, pages\n\n29\u201353. Springer, 1996.\n\n[37] Christopher KI Williams. Computing with in\ufb01nite networks. In Advances in neural information\n\nprocessing systems, pages 295\u2013301, 1997.\n\n[38] Jaehoon Lee, Yasaman Bahri, Roman Novak, Samuel S Schoenholz, Jeffrey Pennington,\nand Jascha Sohl-Dickstein. Deep neural networks as gaussian processes. arXiv preprint\narXiv:1711.00165, 2017.\n\n[39] Alexander G de G Matthews, Mark Rowland, Jiri Hron, Richard E Turner, and Zoubin\nGhahramani. Gaussian process behaviour in wide deep neural networks. arXiv preprint\narXiv:1804.11271, 2018.\n\n[40] Adri\u00e0 Garriga-Alonso, Laurence Aitchison, and Carl Edward Rasmussen. Deep convolutional\n\nnetworks as shallow gaussian processes. arXiv preprint arXiv:1808.05587, 2018.\n\n11\n\n\f[41] Ben Poole, Subhaneil Lahiri, Maithra Raghu, Jascha Sohl-Dickstein, and Surya Ganguli.\nExponential expressivity in deep neural networks through transient chaos. In Advances in neural\ninformation processing systems, pages 3360\u20133368, 2016.\n\n[42] Samuel S Schoenholz, Justin Gilmer, Surya Ganguli, and Jascha Sohl-Dickstein. Deep informa-\n\ntion propagation. arXiv preprint arXiv:1611.01232, 2016.\n\n[43] Lechao Xiao, Yasaman Bahri, Jascha Sohl-Dickstein, Samuel S Schoenholz, and Jeffrey Pen-\nnington. Dynamical isometry and a mean \ufb01eld theory of cnns: How to train 10,000-layer vanilla\nconvolutional neural networks. arXiv preprint arXiv:1806.05393, 2018.\n\n[44] Arthur Jacot, Franck Gabriel, and Cl\u00e9ment Hongler. Neural tangent kernel: Convergence and\ngeneralization in neural networks. In Advances in neural information processing systems, pages\n8571\u20138580, 2018.\n\n[45] Roman Novak, Lechao Xiao, Yasaman Bahri, Jaehoon Lee, Greg Yang, Daniel A. Abola\ufb01a,\nJeffrey Pennington, and Jascha Sohl-dickstein. Bayesian deep convolutional networks with many\nchannels are gaussian processes. In International Conference on Learning Representations,\n2019.\n\n[46] Greg Yang. Scaling limits of wide neural networks with weight sharing: Gaussian pro-\ncess behavior, gradient independence, and neural tangent kernel derivation. arXiv preprint\narXiv:1902.04760, 2019.\n\n[47] Jaehoon Lee, Lechao Xiao, Samuel S Schoenholz, Yasaman Bahri, Jascha Sohl-Dickstein, and\nJeffrey Pennington. Wide neural networks of any depth evolve as linear models under gradient\ndescent. arXiv preprint arXiv:1902.06720, 2019.\n\n[48] Daniel W. Stroock and S. R. Srinivasa Varadhan. Multidimensional Diffusion Processes. Classics\n\nin Mathematics. Springer Berlin Heidelberg, 2007.\n\n[49] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document\n\nrecognition. Proceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[50] Maithra Raghu, Ben Poole, Jon Kleinberg, Surya Ganguli, and Jascha Sohl-Dickstein. On the\n\nexpressive power of deep neural networks. arXiv preprint arXiv:1606.05336, 2016.\n\n[51] Raja Giryes, Guillermo Sapiro, and Alexander M Bronstein. Deep neural networks with\nrandom gaussian weights: A universal classi\ufb01cation strategy? IEEE Trans. Signal Processing,\n64(13):3444\u20133457, 2016.\n\n[52] Jeffrey Pennington, Samuel Schoenholz, and Surya Ganguli. The emergence of spectral\nIn International Conference on Arti\ufb01cial Intelligence and\n\nuniversality in deep networks.\nStatistics, pages 1924\u20131932, 2018.\n\n[53] Leonardo Franco. Generalization ability of boolean functions implemented in feedforward\n\nneural networks. Neurocomputing, 70(1-3):351\u2013361, 2006.\n\n[54] Daniel Soudry, Elad Hoffer, and Nathan Srebro. The implicit bias of gradient descent on\n\nseparable data. arXiv preprint arXiv:1710.10345, 2017.\n\n[55] Jorma Rissanen. Modeling by shortest data description. Automatica, 14(5):465\u2013471, 1978.\n[56] Anselm Blumer, Andrzej Ehrenfeucht, David Haussler, and Manfred K Warmuth. Occam\u2019s\n\nrazor. Information processing letters, 24(6):377\u2013380, 1987.\n\n[57] David H Wolpert. The relationship between pac, the statistical physics framework, the bayesian\nframework, and the vc framework. In The mathematics of generalization, pages 117\u2013214. CRC\nPress, 2018.\n\n[58] Tor Lattimore and Marcus Hutter. No free lunch versus occam\u2019s razor in supervised learning.\nIn Algorithmic Probability and Friends. Bayesian Prediction and Arti\ufb01cial Intelligence, pages\n223\u2013235. Springer, 2013.\n\n[59] Kamaludin Dingle, Chico Q Camargo, and Ard A Louis. Input\u2013output maps are strongly biased\n\ntowards simple outputs. Nature communications, 9(1):761, 2018.\n\n[60] Razvan Pascanu, Guido Montufar, and Yoshua Bengio. On the number of response regions of\ndeep feed forward networks with piece-wise linear activations. arXiv preprint arXiv:1312.6098,\n2013.\n\n12\n\n\f[61] Guido F Montufar, Razvan Pascanu, Kyunghyun Cho, and Yoshua Bengio. On the number of\nlinear regions of deep neural networks. In Advances in neural information processing systems,\npages 2924\u20132932, 2014.\n\n[62] Peter Hinz and Sara van de Geer. A framework for the construction of upper bounds on\nthe number of af\ufb01ne linear regions of relu feed-forward neural networks. arXiv preprint\narXiv:1806.01918, 2018.\n\n[63] Amit Daniely, Roy Frostig, and Yoram Singer. Toward deeper understanding of neural networks:\nThe power of initialization and a dual view on expressivity. In Advances In Neural Information\nProcessing Systems, pages 2253\u20132261, 2016.\n\n[64] Mariia Vladimirova, Jakob Verbeek, Pablo Mesejo, and Julyan Arbel. Understanding priors in\nbayesian neural networks at the unit level. In International Conference on Machine Learning,\npages 6458\u20136467, 2019.\n\n[65] Anton Bovier. Extreme values of random processes. Lecture Notes Technische Universit\u00e4t\n\nBerlin, 2005.\n\n[66] Yann LeCun, Corinna Cortes, and Christopher J.C. Burges. The mnist database of handwritten\n\ndigits, http://yann.lecun.com/exdb/mnist/, 1998.\n\n[67] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization.\n\narXiv:1412.6980 [cs], 2014. arXiv: 1412.6980.\n\n[68] Mart\u00edn Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro,\nGreg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow,\nAndrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser,\nManjunath Kudlur, Josh Levenberg, Dandelion Man\u00e9, Rajat Monga, Sherry Moore, Derek\nMurray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal\nTalwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Vi\u00e9gas, Oriol Vinyals, Pete\nWarden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Large-\nscale machine learning on heterogeneous systems, 2015. Software available from tensor\ufb02ow.org.\n[69] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfel-\nlow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199,\n2013.\n\n[70] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversar-\n\nial examples. arXiv preprint arXiv:1412.6572, 2014.\n\n[71] Jonathan Peck, Joris Roels, Bart Goossens, and Yvan Saeys. Lower bounds on the robustness\nto adversarial perturbations. In Advances in Neural Information Processing Systems, pages\n804\u2013813, 2017.\n\n[72] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu.\nTowards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083,\n2017.\n\n[73] Dimitris Tsipras, Shibani Santurkar, Logan Engstrom, Alexander Turner, and Aleksander Madry.\n\nRobustness may be at odds with accuracy. stat, 1050:11, 2018.\n\n[74] Preetum Nakkiran. Adversarial robustness may be at odds with simplicity. arXiv preprint\n\narXiv:1901.00532, 2019.\n\n[75] Anna Choromanska, Mikael Henaff, Michael Mathieu, G\u00e9rard Ben Arous, and Yann LeCun.\nThe loss surfaces of multilayer networks. In Arti\ufb01cial Intelligence and Statistics, pages 192\u2013204,\n2015.\n\n[76] Anna Choromanska, Yann LeCun, and G\u00e9rard Ben Arous. Open problem: The landscape of the\nloss surfaces of multilayer networks. In Conference on Learning Theory, pages 1756\u20131760,\n2015.\n\n[77] Kenji Kawaguchi. Deep learning without poor local minima. In Advances in Neural Information\n\nProcessing Systems, pages 586\u2013594, 2016.\n\n[78] Marco Baity-Jesi, Levent Sagun, Mario Geiger, Stefano Spigler, G Ben Arous, Chiara Cam-\nmarota, Yann LeCun, Matthieu Wyart, and Giulio Biroli. Comparing dynamics: Deep neural\nnetworks versus glassy systems. arXiv preprint arXiv:1803.06969, 2018.\n\n[79] Dhagash Mehta, Xiaojun Zhao, Edgar A Bernal, and David J Wales. Loss surface of xor\n\narti\ufb01cial neural networks. Physical Review E, 97(5):052307, 2018.\n\n13\n\n\f", "award": [], "sourceid": 1145, "authors": [{"given_name": "Giacomo", "family_name": "De Palma", "institution": "MIT"}, {"given_name": "Bobak", "family_name": "Kiani", "institution": "Massachusetts Institute of Technology"}, {"given_name": "Seth", "family_name": "Lloyd", "institution": "MIT"}]}