{"title": "Toward Deeper Understanding of Neural Networks: The Power of Initialization and a Dual View on Expressivity", "book": "Advances in Neural Information Processing Systems", "page_first": 2253, "page_last": 2261, "abstract": "We develop a general duality between neural networks and compositional kernel Hilbert spaces. We introduce the notion of a computation skeleton, an acyclic graph that succinctly describes both a family of neural networks and a kernel space. Random neural networks are generated from a skeleton through node replication followed by sampling from a normal distribution to assign weights. The kernel space consists of functions that arise by compositions, averaging, and non-linear transformations governed by the skeleton's graph topology and activation functions. We prove that random networks induce representations which approximate the kernel space. In particular, it follows that random weight initialization often yields a favorable starting point for optimization despite the worst-case intractability of training neural networks.", "full_text": "Toward Deeper Understanding of Neural Networks: The Power\n\nof Initialization and a Dual View on Expressivity\n\nAmit Daniely\nGoogle Brain\n\nRoy Frostig\u2217\nGoogle Brain\n\nYoram Singer\nGoogle Brain\n\nAbstract\n\nWe develop a general duality between neural networks and compositional kernel\nHilbert spaces. We introduce the notion of a computation skeleton, an acyclic\ngraph that succinctly describes both a family of neural networks and a kernel space.\nRandom neural networks are generated from a skeleton through node replication\nfollowed by sampling from a normal distribution to assign weights. The kernel\nspace consists of functions that arise by compositions, averaging, and non-linear\ntransformations governed by the skeleton\u2019s graph topology and activation functions.\nWe prove that random networks induce representations which approximate the\nkernel space. In particular, it follows that random weight initialization often yields\na favorable starting point for optimization despite the worst-case intractability of\ntraining neural networks.\n\n1\n\nIntroduction\n\nNeural network (NN) learning has underpinned state of the art empirical results in numerous applied\nmachine learning tasks, see for instance [25, 26]. Nonetheless, theoretical analyses of neural network\nlearning are still lacking in several regards. Notably, it remains unclear why training algorithms \ufb01nd\ngood weights and how learning is impacted by network architecture and its activation functions.\nThis work analyzes the representation power of neural networks within the vicinity of random\ninitialization. We show that for regimes of practical interest, randomly initialized neural networks\nwell-approximate a rich family of hypotheses. Thus, despite worst-case intractability of training\nneural networks, commonly used initialization procedures constitute a favorable starting point for\ntraining.\nConcretely, we de\ufb01ne a computation skeleton that is a succinct description of feed-forward networks.\nA skeleton induces a family of network architectures as well as an hypothesis class H of functions\nobtained by non-linear compositions mandated by the skeleton\u2019s structure. We then analyze the set of\nfunctions that can be expressed by varying the weights of the last layer, a simple region of the training\ndomain over which the objective is convex. We show that with high probability over the choice of\ninitial network weights, any function in H can be approximated by selecting the \ufb01nal layer\u2019s weights.\nBefore delving into technical detail, we position our results in the context of previous research.\n\nCurrent theoretical understanding of NN learning. Standard results from complexity theory [22]\nimply that all ef\ufb01ciently computable functions can be expressed by a network of moderate size.\nBarron\u2019s theorem [7] states that even two-layer networks can express a very rich set of functions. The\ngeneralization ability of algorithms for training neural networks is also fairly well studied. Indeed,\nboth classical [3, 9, 10] and more recent [18, 33] results from statistical learning theory show that, as\nthe number of examples grows in comparison to the size of the network, the empirical risk approaches\nthe population risk. In contrast, it remains puzzling why and when ef\ufb01cient algorithms, such as\nstochastic gradient methods, yield solutions that perform well. While learning algorithms succeed in\n\n\u2217Most of this work performed while the author was at Stanford University.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fpractice, theoretical analyses are overly pessimistic. For example, hardness results suggest that, in\nthe worst case, even very simple 2-layer networks are intractable to learn. Concretely, it is hard to\nconstruct a hypothesis which predicts marginally better than random [15, 23, 24].\nIn the meantime, recent empirical successes of neural networks prompted a surge of theoretical\nresults on NN learning. For instance, we refer the reader to [1, 4, 12, 14, 16, 28, 32, 38, 42] and the\nreferences therein.\n\nCompositional kernels and connections to networks. The idea of composing kernels has repeat-\nedly appeared in the machine learning literature. See for instance the early work by Grauman and\nDarrell [17], Sch\u00f6lkopf et al. [41]. Inspired by deep networks\u2019 success, researchers considered deep\ncomposition of kernels [11, 13, 29]. For fully connected two-layer networks, the correspondence\nbetween kernels and neural networks with random weights has been examined in [31, 36, 37, 45].\nNotably, Rahimi and Recht [37] proved a formal connection, in a similar sense to ours, for the\nRBF kernel. Their work was extended to include polynomial kernels [21, 35] as well as other\nkernels [5, 6]. Several authors have further explored ways to extend this line of research to deeper,\neither fully-connected networks [13] or convolutional networks [2, 20, 29].\n\nThis work establishes a common foundation for the above research and expands the ideas therein. We\nextend the scope from fully-connected and convolutional networks to a broad family of architectures.\nIn addition, we prove approximation guarantees between a network and its corresponding kernel in\nour general setting. We thus generalize previous analyses which are only applicable to fully connected\ntwo-layer networks.\n\n2 Setting\n\nNotation. We denote vectors by bold-face letters (e.g. x), and matrices by upper case Greek letters\n(e.g. \u03a3). The 2-norm of x \u2208 Rd is denoted by \ufffdx\ufffd. For functions \u03c3 : R \u2192 R we let\n\n\ufffd\u03c3\ufffd :=\ufffdEX\u223cN (0,1) \u03c32(X) =\ufffd 1\u221a2\u03c0\ufffd \u221e\n\n\u2212\u221e\n\n\u03c32(x)e\u2212 x2\n\n2 dx .\n\nLet G = (V, E) be a directed acyclic graph. The set of neighbors incoming to a vertex v is denoted\n\nin(v) := {u \u2208 V | uv \u2208 E} .\n\nThe d \u2212 1 dimensional sphere is denoted Sd\u22121 = {x \u2208 Rd | \ufffdx\ufffd = 1}. We provide a brief overview\nof reproducing kernel Hilbert spaces in the sequel and merely introduce notation here. In a Hilbert\nspace H, we use a slightly non-standard notation HB for the ball of radius B, {x \u2208 H | \ufffdx\ufffdH \u2264 B}.\nWe use [x]+ to denote max(x, 0) and 1[b] to denote the indicator function of a binary variable b.\n\nInput space. Throughout the paper we assume that each example is a sequence of n elements,\neach of which is represented as a unit vector. Namely, we \ufb01x n and take the input space to be\n\nX = Xn,d =\ufffdSd\u22121\ufffdn. Each input example is denoted,\n\nx = (x1, . . . , xn), where xi \u2208 Sd\u22121 .\nWe refer to each vector xi as the input\u2019s ith coordinate, and use xi\nj to denote it jth scalar entry.\nThough this notation is slightly non-standard, it uni\ufb01es input types seen in various domains. For\nexample, binary features can be encoded by taking d = 1, in which case X = {\u00b11}n. Meanwhile,\nimages and audio signals are often represented as bounded and continuous numerical values; we can\nassume in full generality that these values lie in [\u22121, 1]. To match the setup above, we embed [\u22121, 1]\ninto the circle S1, e.g. through the map\n\n(1)\n\nx \ufffd\u2192\ufffdsin\ufffd \u03c0x\n\n2 \ufffd , cos\ufffd \u03c0x\n\n2 \ufffd\ufffd .\n\nWhen each coordinate is categorical, taking one of d values, one can represent the category j \u2208 [d]\nby the unit vector ej \u2208 Sd\u22121. When d is very large or the basic units exhibit some structure\u2014such as\nwhen the input is a sequence of words\u2014a more concise encoding may be useful, e.g. using unit vectors\nin a low dimension space Sd\ufffd where d\ufffd \ufffd d (see for instance Levy and Goldberg [27], Mikolov et al.\n[30]).\n\n2\n\n\fSupervised learning. The goal in supervised learning is to devise a mapping from the input space\nX to an output space Y based on a sample S = {(x1, y1), . . . , (xm, ym)}, where (xi, yi) \u2208 X \u00d7 Y,\ndrawn i.i.d. from a distribution D over X \u00d7 Y. A supervised learning problem is further speci\ufb01ed\nby an output length k and a loss function \ufffd : Rk \u00d7 Y \u2192 [0,\u221e), and the goal is to \ufb01nd a predictor\nh : X \u2192 Rk whose loss,\n\nis small. The empirical loss\n\nLD(h) := E\n\n(x,y)\u223cD\n\n\ufffd(h(x), y)\n\nLS(h) :=\n\n1\nm\n\nm\ufffdi=1\n\n\ufffd(h(xi), yi)\n\nis commonly used as a proxy for the loss LD. Regression problems correspond to Y = R and, for\ninstance, the squared loss \ufffd(\u02c6y, y) = (\u02c6y \u2212 y)2. Binary classi\ufb01cation is captured by Y = {\u00b11} and,\nsay, the zero-one loss \ufffd(\u02c6y, y) = 1[\u02c6yy \u2264 0] or the hinge loss \ufffd(\u02c6y, y) = [1 \u2212 \u02c6yy]+, with standard\nextensions to the multiclass case. A loss \ufffd is L-Lipschitz if |\ufffd(y1, y) \u2212 \ufffd(y2, y)| \u2264 L|y1 \u2212 y2| for all\ny1, y2 \u2208 Rk, y \u2208 Y, and it is convex if \ufffd(\u00b7, y) is convex for every y \u2208 Y.\nNeural network learning. We de\ufb01ne a neural network N to be directed acyclic graph (DAG)\nwhose nodes are denoted V (N ) and edges E(N ). Each of its internal units, i.e. nodes with both\nincoming and outgoing edges, is associated with an activation function \u03c3v : R \u2192 R. In this paper\u2019s\ncontext, an activation can be any function that is square integrable with respect to the Gaussian\nmeasure on R. We say that \u03c3 is normalized if \ufffd\u03c3\ufffd = 1. The set of nodes having only incoming\nedges are called the output nodes. To match the setup of a supervised learning problem, a network N\nhas nd input nodes and k output nodes, denoted o1, . . . , ok. A network N together with a weight\nvector w = {wuv | uv \u2208 E} de\ufb01nes a predictor hN ,w : X \u2192 Rk whose prediction is given by\n\u201cpropagating\u201d x forward through the network. Formally, we de\ufb01ne hv,w(\u00b7) to be the output of the\nsubgraph of the node v as follows: for an input node v, hv,w is the identity function, and for all other\nnodes, we de\ufb01ne hv,w recursively as\n\nhv,w(x) = \u03c3v\ufffd\ufffdu\u2208in(v) wuv hu,w(x)\ufffd .\n\nFinally, we let hN ,w(x) = (ho1,w(x), . . . , hok,w(x)). We also refer to internal nodes as hidden\nunits. The output layer of N is the sub-network consisting of all output neurons of N along with\ntheir incoming edges. The representation induced by a network N is the network rep(N ) obtained\nfrom N by removing the output layer. The representation function induced by the weights w is\nRN ,w := hrep(N ),w. Given a sample S, a learning algorithm searches for weights w having small\nempirical loss LS(w) = 1\ni=1 \ufffd(hN ,w(xi), yi). A popular approach is to randomly initialize the\nweights and then use a variant of the stochastic gradient method to improve these weights in the\ndirection of lower empirical loss.\n\nm\ufffdm\n\nKernel learning. A function \u03ba : X \u00d7 X \u2192 R is a reproducing kernel, or simply a kernel, if for\nevery x1, . . . , xr \u2208 X , the r \u00d7 r matrix \u0393i,j = {\u03ba(xi, xj)} is positive semi-de\ufb01nite. Each kernel\ninduces a Hilbert space H\u03ba of functions from X to R with a corresponding norm \ufffd\u00b7\ufffdH\u03ba. A kernel and\nits corresponding space are normalized if \u2200x \u2208 X , \u03ba(x, x) = 1. Given a convex loss function \ufffd, a\nsample S, and a kernel \u03ba, a kernel learning algorithm \ufb01nds a function f = (f1, . . . , fk) \u2208 Hk\n\u03ba whose\nm\ufffdi \ufffd(f (xi), yi), is minimal among all functions with\ufffdi \ufffdfi\ufffd2\nempirical loss, LS(f ) = 1\n\u03ba \u2264 R2\nfor some R > 0. Alternatively, kernel algorithms minimize the regularized loss,\nLR\nS (f ) =\n\n\ufffd(f (xi), yi) +\n\n\ufffdfi\ufffd2\n\u03ba ,\n\n1\nR2\n\n1\nm\n\nm\ufffdi=1\n\nk\ufffdi=1\n\na convex objective that often can be ef\ufb01ciently minimized.\n\n3 Computation skeletons\n\nIn this section we de\ufb01ne a simple structure that we term a computation skeleton. The purpose of a\ncomputational skeleton is to compactly describe feed-forward computation from an input to an output.\nA single skeleton encompasses a family of neural networks that share the same skeletal structure.\nLikewise, it de\ufb01nes a corresponding kernel space.\n\n3\n\n\fS1\n\nS2\n\nS3\nFigure 1: Examples of computation skeletons.\n\nS4\n\nDe\ufb01nition. A computation skeleton S is a DAG whose non-input nodes are labeled by activations.\nThough the formal de\ufb01nition of neural networks and skeletons appear identical, we make a conceptual\ndistinction between them as their role in our analysis is rather different. Accompanied by a set of\nweights, a neural network describes a concrete function, whereas the skeleton stands for a topology\ncommon to several networks as well as for a kernel. To further underscore the differences we note\nthat skeletons are naturally more compact than networks. In particular, all examples of skeletons in\nthis paper are irreducible, meaning that for each two nodes v, u \u2208 V (S), in(v) \ufffd= in(u). We further\nrestrict our attention to skeletons with a single output node, showing later that single-output skeletons\ncan capture supervised problems with outputs in Rk. We denote by |S| the number of non-input\nnodes of S.\nFigure 1 shows four example skeletons, omitting the designation of the activation functions. The\nskeleton S1 is rather basic as it aggregates all the inputs in a single step. Such topology can be\nuseful in the absence of any prior knowledge of how the output label may be computed from an input\nexample, and it is commonly used in natural language processing where the input is represented as a\nbag-of-words [19]. The only structure in S1 is a single fully connected layer:\nTerminology (Fully connected layer of a skeleton). An induced subgraph of a skeleton with r + 1\nnodes, u1, . . . , ur, v, is called a fully connected layer if its edges are u1v, . . . , urv.\nThe skeleton S2 is slightly more involved: it \ufb01rst processes consecutive (overlapping) parts of the\ninput, and the next layer aggregates the partial results. Altogether, it corresponds to networks with a\nsingle one-dimensional convolutional layer, followed by a fully connected layer. The two-dimensional\n(and deeper) counterparts of such skeletons correspond to networks that are common in visual object\nrecognition.\nTerminology (Convolution layer of a skeleton). Let s, w, q be positive integers and denote n =\ns(q\u2212 1) + w. A subgraph of a skeleton is a one dimensional convolution layer of width w and stride s\nif it has n + q nodes, u1, . . . , un, v1, . . . , vq, and qw edges, us(i\u22121)+j vi, for 1 \u2264 i \u2264 q, 1 \u2264 j \u2264 w.\nThe skeleton S3 is a somewhat more sophisticated version of S2: the local computations are \ufb01rst\naggregated, then reconsidered with the aggregate, and \ufb01nally aggregated again. The last skeleton,\nS4, corresponds to the networks that arise in learning sequence-to-sequence mappings as used in\ntranslation, speech recognition, and OCR tasks (see for example Sutskever et al. [44]).\n\n3.1 From computation skeletons to neural networks\nThe following de\ufb01nition shows how a skeleton, accompanied with a replication parameter r \u2265 1 and\na number of output nodes k, induces a neural network architecture. Recall that inputs are ordered sets\nof vectors in Sd\u22121.\n\n4\n\n\fS\n\nN (S, 5)\n\nFigure 2: A 5-fold realizations of the computation skeleton S with d = 1.\n\nDe\ufb01nition (Realization of a skeleton). Let S be a computation skeleton and consider input coordi-\nnates in Sd\u22121 as in (1). For r, k \u2265 1 we de\ufb01ne the following neural network N = N (S, r, k). For\neach input node in S, N has d corresponding input neurons. For each internal node v \u2208 S labeled\nby an activation \u03c3, N has r neurons v1, . . . , vr, each with an activation \u03c3. In addition, N has k\noutput neurons o1, . . . , ok with the identity activation \u03c3(x) = x. There is an edge viuj \u2208 E(N )\nwhenever uv \u2208 E(S). For every output node v in S, each neuron vj is connected to all output\nneurons o1, . . . , ok. We term N the (r, k)-fold realization of S. We also de\ufb01ne the r-fold realization\nof S as2 N (S, r) = rep (N (S, r, 1)).\nNote that the notion of the replication parameter r corresponds, in the terminology of convolutional\nnetworks, to the number of channels taken in a convolutional layer and to the number of hidden units\ntaken in a fully-connected layer.\nFigure 2 illustrates a 5-realization of a skeleton with coordinate dimension d = 1. The realization is a\nnetwork with a single (one dimensional) convolutional layer having 5 channels, stride of 1, and width\nof 2, followed by two fully-connected layers. The global replication parameter r in a realization\nis used for brevity; it is straightforward to extend results when the different nodes in S are each\nreplicated to a different extent.\nWe next de\ufb01ne a scheme for random initialization of the weights of a neural network, that is similar\nto what is often done in practice. We employ the de\ufb01nition throughout the paper whenever we refer\nto random weights.\nDe\ufb01nition (Random weights). A random initialization of a neural network N is a multivariate\nGaussian w = (wuv)uv\u2208E(N ) such that each weight wuv is sampled independently from a normal\ndistribution with mean 0 and variance 1/\ufffd\ufffd\u03c3u\ufffd2 |in(v)|\ufffd.\n\nArchitectures such as convolutional nets have weights that are shared across different edges. Again, it\nis straightforward to extend our results to these cases and for simplicity we assume no weight sharing.\n\n3.2 From computation skeletons to reproducing kernels\nIn addition to networks\u2019 architectures, a computation skeleton S also de\ufb01nes a normalized kernel\n\u03baS : X \u00d7 X \u2192 [\u22121, 1] and a corresponding norm \ufffd \u00b7 \ufffdS on functions f : X \u2192 R. This norm has\nthe property that \ufffdf\ufffdS is small if and only if f can be obtained by certain simple compositions of\nfunctions according to the structure of S. To de\ufb01ne the kernel, we introduce a dual activation and\ndual kernel. For \u03c1 \u2208 [\u22121, 1], we denote by N\u03c1 the multivariate Gaussian distribution on R2 with\nmean 0 and covariance matrix\ufffd 1 \u03c1\n\u03c1 1\ufffd.\nDe\ufb01nition (Dual activation and kernel). The dual activation of an activation \u03c3 is the function\n\u02c6\u03c3 : [\u22121, 1] \u2192 R de\ufb01ned as\n\n\u02c6\u03c3(\u03c1) =\n\n\u03c3(X)\u03c3(Y ) .\n\nThe dual kernel w.r.t. to a Hilbert space H is the kernel \u03ba\u03c3 : H1 \u00d7 H1 \u2192 R de\ufb01ned as\n\n2Note that for every k, rep (N (S, r, 1)) = rep (N (S, r, k)).\n\n\u03ba\u03c3(x, y) = \u02c6\u03c3(\ufffdx, y\ufffdH) .\n\nE\n\n(X,Y )\u223cN\u03c1\n\n5\n\n\fActivation\nIdentity\n2nd Hermite\n\nReLU\nStep\nExponential\n\nx\nx2\u22121\u221a2\n\u221a2 [x]+\n\u221a2 1[x \u2265 0]\n\nex\u22122\n\nDual Activation\n\u03c1\n\u03c12\n\u03c0 + \u03c1\n2 + \u03c1\ne + \u03c1\n\n2 + \u03c12\n\u03c0 + \u03c13\ne + \u03c12\n\n1\n\n1\n\n1\n\n\u221a1\u2212\u03c12+(\u03c0\u2212cos\u22121(\u03c1))\u03c1\n\n2\u03c0 + \u03c14\n6\u03c0 + 3\u03c15\n2e + \u03c13\n\n24\u03c0 + . . . =\n40\u03c0 + . . . = \u03c0\u2212cos\u22121(\u03c1)\n6e + . . . = e\u03c1\u22121\n\n\u03c0\n\n\u03c0\n\nKernel\nlinear\npoly\n\narccos1\narccos0\nRBF\n\nRef\n\n[13]\n[13]\n[29]\n\nTable 1: Activation functions and their duals.\n\nWe show in the supplementary material that \u03ba\u03c3 is indeed a kernel for every activation \u03c3 that adheres\nwith the square-integrability requirement. In fact, any continuous \u00b5 : [\u22121, 1] \u2192 R, such that\n(x, y) \ufffd\u2192 \u00b5(\ufffdx, y\ufffdH) is a kernel for all H, is the dual of some activation. Note that \u03ba\u03c3 is normalized\niff \u03c3 is normalized. We show in the supplementary material that dual activations are closely related\nto Hermite polynomial expansions, and that these can be used to calculate the duals of activation\nfunctions analytically. Table 1 lists a few examples of normalized activations and their corresponding\ndual (corresponding derivations are in supplementary material). The following de\ufb01nition gives the\nkernel corresponding to a skeleton having normalized activations.3\nDe\ufb01nition (Compositional kernels). Let S be a computation skeleton with normalized activations\nand (single) output node o. For every node v, inductively de\ufb01ne a kernel \u03bav : X \u00d7 X \u2192 R as follows.\nFor an input node v corresponding to the ith coordinate, de\ufb01ne \u03bav(x, y) = \ufffdxi, yi\ufffd. For a non-input\nnode v, de\ufb01ne\n\n\u03bav(x, y) = \u02c6\u03c3v\ufffd\ufffdu\u2208in(v) \u03bau(x, y)\n\n|in(v)|\n\n\ufffd .\n\nThe \ufb01nal kernel \u03baS is \u03bao, the kernel associated with the output node o. The resulting Hilbert space\nand norm are denoted HS and \ufffd \u00b7 \ufffdS respectively, and Hv and \ufffd \u00b7 \ufffdv denote the space and norm\nwhen formed at node v.\nAs we show later, \u03baS is indeed a (normalized) kernel for every skeleton S. To understand the\nkernel in the context of learning, we need to examine which functions can be expressed as moderate\nnorm functions in HS. As we show in the supplementary material, these are the functions obtained\nby certain simple compositions according to the feed-forward structure of S. For intuition, we\ncontrast two examples of two commonly used skeletons. For simplicity, we restrict to the case\nX = Xn,1 = {\u00b11}n, and omit the details of derivations.\nExample 1 (Fully connected skeletons). Let S be a skeleton consisting of l fully connected layers,\nwhere the i\u2019th layer is associated with the activation \u03c3i. We have \u03baS (x, x\ufffd) = \u02c6\u03c3l \u25e6 . . . \u25e6 \u02c6\u03c31\ufffd\ufffdx,y\ufffdn \ufffd.\nFor such kernels, any moderate norm function in H is well approximated by a low degree polynomial.\nFor example, if \ufffdf\ufffdS \u2264 n, then there is a second degree polynomial p such that \ufffdf \u2212 p\ufffd2 \u2264 O\ufffd 1\u221an\ufffd.\nWe next argue that convolutional skeletons de\ufb01ne kernel spaces that are quite different from kernels\nspaces de\ufb01ned by fully connected skeletons. Concretely, suppose f : X \u2192 R is of the form\nf =\ufffdm\ni=1 fi where each fi depends only on q adjacent coordinates. We call f a q-local function. In\nExample 1 we stated that for fully-connected skeletons, any function of in the induced space of norm\nless then n is well approximated by second degree polynomials. In contrast, the following example\nunderscores that for convolutional skeletons, we can \ufb01nd functions that are more complex, provided\nthat they are local.\nExample 2 (Convolutional skeletons). Let S be a skeleton consisting of a convolutional layer of\nstride 1 and width q, followed by a single fully connected layer. (The skeleton S2 from Figure 1 is a\nconcrete example with q = 2 and n = 4.) To simplify the argument, we assume that all activations\nare \u03c3(x) = ex and q is a constant. For any q-local function f : X \u2192 R we have\n\n\ufffdf\ufffdS \u2264 C \u00b7 \u221an \u00b7 \ufffdf\ufffd2 .\n\n3For a skeleton S with unnormalized activations, the corresponding kernel is the kernel of the skeleton S\ufffd\n\nobtained by normalizing the activations of S.\n\n6\n\n\fHere, C > 0 is a constant depending only on q. Hence, for example, any average of functions\nfrom X to [\u22121, 1], each of which depends on q adjacent coordinates, is in HS and has norm of\nmerely O (\u221an).\n\n4 Main results\n\nWe review our main results. Proofs can be found in the supplementary material. Let us \ufb01x a\ncompositional kernel S. There are a few upshots to underscore upfront. First, our analysis implies\nthat a representation generated by a random initialization of N = N (S, r, k) approximates the kernel\n\u03baS. The sense in which the result holds is twofold. First, with the proper rescaling we show that\n\ufffdRN ,w(x),RN ,w(x\ufffd)\ufffd \u2248 \u03baS (x, x\ufffd). Then, we also show that the functions obtained by composing\nbounded linear functions with RN ,w are approximately the bounded-norm functions in HS. In other\nwords, the functions expressed by N under varying the weights of the \ufb01nal layer are approximately\nbounded-norm functions in HS. For simplicity, we restrict the analysis to the case k = 1. We also\ncon\ufb01ne the analysis to either bounded activations, with bounded \ufb01rst and second derivatives, or the\nReLU activation. Extending the results to a broader family of activations is left for future work.\nThrough this and remaining sections we use \ufffd to hide universal constants.\nDe\ufb01nition. An activation \u03c3 : R \u2192 R is C-bounded if it is twice continuously differentiable and\n\ufffd\u03c3\ufffd\u221e,\ufffd\u03c3\ufffd\ufffd\u221e,\ufffd\u03c3\ufffd\ufffd\ufffd\u221e \u2264 \ufffd\u03c3\ufffdC.\nNote that many activations are C-bounded for some constant C > 0. In particular, most of the popular\nsigmoid-like functions such as 1/(1 + e\u2212x), erf(x), x/\u221a1 + x2, tanh(x), and tan\u22121(x) satisfy the\nboundedness requirements. We next introduce terminology that parallels the representation layer\nof N with a kernel space. Concretely, let N be a network whose representation part has q output\nneurons. Given weights w, the normalized representation \u03a8w is obtained from the representation\nRN ,w by dividing each output neuron v by \ufffd\u03c3v\ufffd\u221aq. The empirical kernel corresponding to w is\nde\ufb01ned as \u03baw(x, x\ufffd) = \ufffd\u03a8w(x), \u03a8w(x\ufffd)\ufffd. We also de\ufb01ne the empirical kernel space corresponding\nto w as Hw = H\u03baw. Concretely,\n\nHw = {hv(x) = \ufffdv, \u03a8w(x)\ufffd | v \u2208 Rq} ,\n\nand the norm of Hw is de\ufb01ned as \ufffdh\ufffdw = inf{\ufffdv\ufffd | h = hv}. Our \ufb01rst result shows that the\nempirical kernel approximates the kernel kS.\nTheorem 3. Let S be a skeleton with C-bounded activations. Let w be a random initialization of\nN = N (S, r) with\n\n(4C 4)depth(S)+1 log (8|S|/\u03b4)\n\n.\n\n\ufffd2\n\nr \u2265\n\nThen, for all x, x\ufffd \u2208 X , with probability of at least 1 \u2212 \u03b4,\n\n|kw(x, x\ufffd) \u2212 kS (x, x\ufffd)| \u2264 \ufffd .\n\nWe note that if we \ufb01x the activation and assume that the depth of S is logarithmic, then the required\nbound on r is polynomial. For the ReLU activation we get a stronger bound with only quadratic\ndependence on the depth. However, it requires that \ufffd \u2264 1/depth(S).\nTheorem 4. Let S be a skeleton with ReLU activations. Let w be a random initialization of N (S, r)\nwith\n\nThen, for all x, x\ufffd \u2208 X and \ufffd \ufffd 1/depth(S), with probability of at least 1 \u2212 \u03b4,\n\nr \ufffd depth2(S) log (|S|/\u03b4)\n\n.\n\n\ufffd2\n\n|\u03baw(x, x\ufffd) \u2212 \u03baS (x, x\ufffd)| \u2264 \ufffd .\n\nFor the remaining theorems, we \ufb01x a L-Lipschitz loss \ufffd : R \u00d7 Y \u2192 [0,\u221e). For a distribution D on\nX \u00d7 Y we denote by \ufffdD\ufffd0 the cardinality of the support of the distribution. We note that log (\ufffdD\ufffd0)\nis bounded by, for instance, the number of bits used to represent an element in X \u00d7 Y. We use the\nfollowing notion of approximation.\nDe\ufb01nition. Let D be a distribution on X \u00d7Y. A space H1 \u2282 RX \ufffd-approximates the space H2 \u2282 RX\nw.r.t. D if for every h2 \u2208 H2 there is h1 \u2208 H1 such that LD(h1) \u2264 LD(h2) + \ufffd.\n\n7\n\n\fTheorem 5. Let S be a skeleton with C-bounded activations and let R > 0. Let w be a random\ninitialization of N (S, r) with\n\n\ufffd\u03b4 \ufffd\nL4 R4 (4C 4)depth(S)+1 log\ufffd LRC|S|\n\n\ufffd4\n\n.\n\nr \ufffd\n\n\ufffd-approximates HR\nS\n\nThen, with probability of at least 1 \u2212 \u03b4 over the choices of w we have that, for any data distribution,\n\u221a2R\nH\nw\nTheorem 6. Let S be a skeleton with ReLU activations, \ufffd \ufffd 1/depth(C), and R > 0. Let w be a\nrandom initialization of N (S, r) with\n\nand H\n\n\u221a2R\nS\n\n\ufffd-approximates HR\nw.\nL4 R4 depth2(S) log\ufffd\ufffdD\ufffd0|S|\n\ufffd-approximates HR\nw.\n\n\ufffd4\n\n\u03b4\n\n\ufffd\n\n.\n\nr \ufffd\n\n\ufffd-approximates HR\nS\n\nThen, with probability of at least 1 \u2212 \u03b4 over the choices of w we have that, for any data distribution,\n\u221a2R\nH\nw\nAs in Theorems 3 and 4, for a \ufb01xed C-bounded activation and logarithmically deep S, the required\nbounds on r are polynomial. Analogously, for the ReLU activation the bound is polynomial even\nwithout restricting the depth. However, the polynomial growth in Theorems 5 and 6 is rather large.\nImproving the bounds, or proving their optimality, is left to future work.\n\nand H\n\n\u221a2R\nS\n\nAcknowledgments\n\nWe would like to thank Percy Liang and Ben Recht for fruitful discussions, comments, and sugges-\ntions.\n\nReferences\n[1] A. Andoni, R. Panigrahy, G. Valiant, and L. Zhang. Learning polynomials with neural networks. In\n\nProceedings of the 31st International Conference on Machine Learning, pages 1908\u20131916, 2014.\n\n[2] F. Anselmi, L. Rosasco, C. Tan, and T. Poggio. Deep convolutional networks are hierarchical kernel\n\nmachines. arXiv:1508.01084, 2015.\n\n[3] M. Anthony and P. Bartlet. Neural Network Learning: Theoretical Foundations. Cambridge University\n\nPress, 1999.\n\n[4] S. Arora, A. Bhaskara, R. Ge, and T. Ma. Provable bounds for learning some deep representations. In\n\nProceedings of The 31st International Conference on Machine Learning, pages 584\u2013592, 2014.\n\n[5] F. Bach. Breaking the curse of dimensionality with convex neural networks. arXiv:1412.8690, 2014.\n[6] F. Bach. On the equivalence between kernel quadrature rules and random feature expansions. 2015.\n[7] A.R. Barron. Universal approximation bounds for superposition of a sigmoidal function. IEEE Transactions\n\non Information Theory, 39(3):930\u2013945, 1993.\n\n[8] P. L. Bartlett and S. Mendelson. Rademacher and Gaussian complexities: Risk bounds and structural\n\nresults. Journal of Machine Learning Research, 3:463\u2013482, 2002.\n\n[9] P.L. Bartlett. The sample complexity of pattern classi\ufb01cation with neural networks: the size of the weights\nis more important than the size of the network. IEEE Transactions on Information Theory, 44(2):525\u2013536,\nMarch 1998.\n\n[10] E.B. Baum and D. Haussler. What size net gives valid generalization? Neural Computation, 1(1):151\u2013160,\n\n1989.\n\n[11] L. Bo, K. Lai, X. Ren, and D. Fox. Object recognition with hierarchical kernel descriptors. In Computer\n\nVision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 1729\u20131736. IEEE, 2011.\n\n[12] J. Bruna and S. Mallat. Invariant scattering convolution networks. IEEE Transactions on Pattern Analysis\n\nand Machine Intelligence, 35(8):1872\u20131886, 2013.\n\n[13] Y. Cho and L.K. Saul. Kernel methods for deep learning. In Advances in neural information processing\n\nsystems, pages 342\u2013350, 2009.\n\n[14] A. Choromanska, M. Henaff, M. Mathieu, G. Ben Arous, and Y. LeCun. The loss surfaces of multilayer\n\nnetworks. In AISTATS, pages 192\u2013204, 2015.\n\n[15] A. Daniely and S. Shalev-Shwartz. Complexity theoretic limitations on learning DNFs. In COLT, 2016.\n[16] R. Giryes, G. Sapiro, and A.M. Bronstein. Deep neural networks with random gaussian weights: A\n\nuniversal classi\ufb01cation strategy? arXiv preprint arXiv:1504.08291, 2015.\n\n8\n\n\f[17] K. Grauman and T. Darrell. The pyramid match kernel: Discriminative classi\ufb01cation with sets of image\nfeatures. In Tenth IEEE International Conference on Computer Vision, volume 2, pages 1458\u20131465, 2005.\n[18] M. Hardt, B. Recht, and Y. Singer. Train faster, generalize better: Stability of stochastic gradient descent.\n\narXiv:1509.01240, 2015.\n\n[19] Z.S. Harris. Distributional structure. Word, 1954.\n[20] T. Hazan and T. Jaakkola.\n\narXiv:1508.05133, 2015.\n\nSteps toward deep kernel methods from in\ufb01nite neural networks.\n\n[21] P. Kar and H. Karnick. Random feature maps for dot product kernels. arXiv:1201.6530, 2012.\n[22] R.M. Karp and R.J. Lipton. Some connections between nonuniform and uniform complexity classes. In\nProceedings of the twelfth annual ACM symposium on Theory of computing, pages 302\u2013309. ACM, 1980.\n[23] M. Kearns and L.G. Valiant. Cryptographic limitations on learning Boolean formulae and \ufb01nite automata.\n\nIn STOC, pages 433\u2013444, May 1989.\n\n[24] A.R. Klivans and A.A. Sherstov. Cryptographic hardness for learning intersections of halfspaces. In FOCS,\n\n2006.\n\n[25] A. Krizhevsky, I. Sutskever, and G.E. Hinton. Imagenet classi\ufb01cation with deep convolutional neural\n\nnetworks. In Advances in neural information processing systems, pages 1097\u20131105, 2012.\n\n[26] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 521(7553):436\u2013444, 2015.\n[27] O. Levy and Y. Goldberg. Neural word embedding as implicit matrix factorization. In Advances in Neural\n\nInformation Processing Systems, pages 2177\u20132185, 2014.\n\n[28] R. Livni, S. Shalev-Shwartz, and O. Shamir. On the computational ef\ufb01ciency of training neural networks.\n\nIn Advances in Neural Information Processing Systems, pages 855\u2013863, 2014.\n\n[29] J. Mairal, P. Koniusz, Z. Harchaoui, and Cordelia Schmid. Convolutional kernel networks. In Advances in\n\nNeural Information Processing Systems, pages 2627\u20132635, 2014.\n\n[30] T. Mikolov, I. Sutskever, K. Chen, G.S. Corrado, and J. Dean. Distributed representations of words and\n\nphrases and their compositionality. In NIPS, pages 3111\u20133119, 2013.\n\n[31] R.M. Neal. Bayesian learning for neural networks, volume 118. Springer Science & Business Media,\n\n2012.\n\n[32] B. Neyshabur, R. R Salakhutdinov, and N. Srebro. Path-SGD: Path-normalized optimization in deep neural\n\nnetworks. In Advances in Neural Information Processing Systems, pages 2413\u20132421, 2015.\n\n[33] B. Neyshabur, N. Srebro, and R. Tomioka. Norm-based capacity control in neural networks. In COLT,\n\n2015.\n\n[34] R. O\u2019Donnell. Analysis of boolean functions. Cambridge University Press, 2014.\n[35] J. Pennington, F. Yu, and S. Kumar. Spherical random features for polynomial kernels. In Advances in\n\nNeural Information Processing Systems, pages 1837\u20131845, 2015.\n\n[36] A. Rahimi and B. Recht. Random features for large-scale kernel machines. In NIPS, pages 1177\u20131184,\n\n2007.\n\n[37] A. Rahimi and B. Recht. Weighted sums of random kitchen sinks: Replacing minimization with random-\n\nization in learning. In Advances in neural information processing systems, pages 1313\u20131320, 2009.\n\n[38] I. Safran and O. Shamir. On the quality of the initial basin in overspeci\ufb01ed neural networks.\n\narxiv:1511.04210, 2015.\n\n[39] S. Saitoh. Theory of reproducing kernels and its applications. Longman Scienti\ufb01c & Technical England,\n\n1988.\n\n[40] I.J. Schoenberg et al. Positive de\ufb01nite functions on spheres. Duke Mathematical Journal, 9(1):96\u2013108,\n\n1942.\n\n[41] B. Sch\u00f6lkopf, P. Simard, A. Smola, and V. Vapnik. Prior knowledge in support vector kernels. In Advances\n\nin Neural Information Processing Systems 10, pages 640\u2013646. MIT Press, 1998.\n\n[42] H. Sedghi and A. Anandkumar. Provable methods for training neural networks with sparse connectivity.\n\narXiv:1412.2693, 2014.\n\n[43] S. Shalev-Shwartz and S. Ben-David. Understanding Machine Learning: From Theory to Algorithms.\n\nCambridge University Press, 2014.\n\n[44] I. Sutskever, O. Vinyals, and Q.V. Le. Sequence to sequence learning with neural networks. In Advances\n\nin neural information processing systems, pages 3104\u20133112, 2014.\n\n[45] C.K.I. Williams. Computation with in\ufb01nite neural networks. pages 295\u2013301, 1997.\n\n9\n\n\f", "award": [], "sourceid": 1162, "authors": [{"given_name": "Amit", "family_name": "Daniely", "institution": "Google Brain"}, {"given_name": "Roy", "family_name": "Frostig", "institution": "Stanford University"}, {"given_name": "Yoram", "family_name": "Singer", "institution": "Google"}]}