{"title": "On Exact Computation with an Infinitely Wide Neural Net", "book": "Advances in Neural Information Processing Systems", "page_first": 8141, "page_last": 8150, "abstract": "How well does a classic deep net architecture like AlexNet or VGG19 classify on a standard dataset such as CIFAR-10 when its \u201cwidth\u201d\u2014 namely, number of channels in convolutional layers, and number of nodes in fully-connected internal layers \u2014 is allowed to increase to infinity? Such questions have come to the forefront in the quest to theoretically understand deep learning and its mysteries about optimization and generalization. They also connect deep learning to notions such as Gaussian processes and kernels. A recent paper [Jacot et al., 2018] introduced the Neural Tangent Kernel (NTK) which captures the behavior of fully-connected deep nets in the infinite width limit trained by gradient descent; this object was implicit in some other recent papers. An attraction of such ideas is that a pure kernel-based method is used to capture the power of a fully-trained deep net of infinite width.\n\nThe current paper gives the first efficient exact algorithm for computing the extension of NTK to convolutional neural nets, which we call Convolutional NTK (CNTK), as well as an efficient GPU implementation of this algorithm. This results in a significant new benchmark for performance of a pure kernel-based method on CIFAR-10, being 10% higher than the methods reported in [Novak et al., 2019], and only 6% lower than the performance of the corresponding finite deep net architecture (once batch normalization etc. are turned off). Theoretically, we also give the first non-asymptotic proof showing that a fully-trained sufficiently wide net is indeed equivalent to the kernel regression predictor using NTK.", "full_text": "On Exact Computation with an In\ufb01nitely Wide\n\nNeural Net\u21e4\n\nSanjeev Arora\u2020\n\nSimon S. Du\u2021\n\nWei Hu\u00a7\n\nZhiyuan Li\u00b6\n\nRuslan Salakhutdinovk\n\nRuosong Wang\u21e4\u21e4\n\nAbstract\n\nHow well does a classic deep net architecture like AlexNet or VGG19 classify on a\nstandard dataset such as CIFAR-10 when its \u201cwidth\u201d\u2014 namely, number of channels\nin convolutional layers, and number of nodes in fully-connected internal layers \u2014\nis allowed to increase to in\ufb01nity? Such questions have come to the forefront in the\nquest to theoretically understand deep learning and its mysteries about optimization\nand generalization. They also connect deep learning to notions such as Gaussian\nprocesses and kernels. A recent paper [Jacot et al., 2018] introduced the Neural\nTangent Kernel (NTK) which captures the behavior of fully-connected deep nets in\nthe in\ufb01nite width limit trained by gradient descent; this object was implicit in some\nother recent papers. An attraction of such ideas is that a pure kernel-based method\nis used to capture the power of a fully-trained deep net of in\ufb01nite width.\nThe current paper gives the \ufb01rst ef\ufb01cient exact algorithm for computing the ex-\ntension of NTK to convolutional neural nets, which we call Convolutional NTK\n(CNTK), as well as an ef\ufb01cient GPU implementation of this algorithm. This results\nin a signi\ufb01cant new benchmark for performance of a pure kernel-based method on\nCIFAR-10, being 10% higher than the methods reported in [Novak et al., 2019],\nand only 6% lower than the performance of the corresponding \ufb01nite deep net\narchitecture (once batch normalization etc. are turned off). Theoretically, we also\ngive the \ufb01rst non-asymptotic proof showing that a fully-trained suf\ufb01ciently wide\nnet is indeed equivalent to the kernel regression predictor using NTK.\n\n1\n\nIntroduction\n\nHow well does a classic deep net architecture like AlexNet or VGG19 perform on a standard dataset\nsuch as CIFAR-10 when its \u201cwidth\u201d\u2014 namely, number of channels in convolutional layers, and\nnumber of nodes in fully-connected internal layers \u2014 is allowed to increase to in\ufb01nity? Questions\nabout these \u201cin\ufb01nite limits\u201d of deep nets have naturally emerged in the ongoing effort to understand\nthe power of deep learning. In mathematics it is often easier to study objects in the in\ufb01nite limit. Fur-\nthermore, the in\ufb01nite limit could conceivably make sense in deep learning, since over-parametrization\nseems to help optimization a lot and doesn\u2019t hurt generalization much [Zhang et al., 2017]: deep\nneural nets with millions of parameters work well even for datasets with 50k training examples. So\nwhy not imagine nets whose width goes to in\ufb01nity?\n\n\u21e4The latest full version of this paper can be found at https://arxiv.org/abs/1904.11955.\n\u2020Princeton University and Institute for Advanced Study. Email: arora@cs.princeton.edu\n\u2021Institute for Advanced Study. Email: ssdu@ias.edu\n\u00a7Princeton University. Email: huwei@cs.princeton.edu\n\u00b6Princeton University. Email: zhiyuanli@cs.princeton.edu\nkCarnegie Mellon University. Email:rsalakhu@cs.cmu.edu\n\u21e4\u21e4Carnegie Mellon University. Email: ruosongw@andrew.cmu.edu\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fAllowing width to go to in\ufb01nity also connects deep learning in an interesting way with other areas of\nmachine learning. A single hidden-layer neural network with i.i.d. random parameters, in the limit\nof in\ufb01nite width, is a function drawn from a Gaussian process (GP) [Neal, 1996]. This model as\nwell as analogous ones with multiple layers [Lee et al., 2018, Matthews et al., 2018], convolutional\n\ufb01lters [Novak et al., 2019, Garriga-Alonso et al., 2019] and other architectures [Yang, 2019] make up\nthe GP view of deep learning. These correspond to in\ufb01nitely wide deep nets whose all parameters are\nchosen randomly (with careful scaling), and only the top (classi\ufb01cation) layer is optimized.\nFrom now on we will use weakly-trained nets to refer to nets whose layers receive random initial-\nization and only the top layer is trained by gradient descent. We use fully-trained to refer to nets\nwhose all parameters are trained by gradient descent. It has long been known that weakly-trained\nconvolutional nets have reasonable performance on MNIST and CIFAR-10. Weakly-trained nets that\nare fully-connected instead of convolutional, can also be thought of as \u201cmulti-layer random kitchen\nsinks,\u201d which also have a long history.\nWeakly-trained nets \u2014 whether of \ufb01nite or in\ufb01nite width \u2014 also de\ufb01ne interesting kernels. Speci\ufb01-\ncally, if f (\u2713, x) 2 R denotes the output of the network on input x where \u2713 denotes the parameters in\nthe network, and W is an initialization distribution over \u2713 (usually Gaussian), then training just the\ntop layer with an `2 loss is equivalent to kernel regression for the following kernel:\n(1)\n\nker (x, x0) = E\n\u2713\u21e0W\n\n[f (\u2713, x) \u00b7 f (\u2713, x0)],\n\nwhere x, x0 are two inputs. This kernel method makes sense when the width goes to in\ufb01nity.\nThe objects of interest in this paper are not weakly-trained nets, but fully-trained nets. In the \ufb01nite\ncase, analysis of optimization and generalization of fully-trained nets is of course an open problem.\nOne may also ask:\n\nCan we understand the power of fully-trained nets whose width goes to in\ufb01nity?\n\nA priori this question doesn\u2019t seem any easier than the \ufb01nite case, and empirical evaluation seems\ncomputationally infeasible due to the in\ufb01nite limit. They also do not correspond to a kernel method\nin any obvious way.\nRecent papers suggest that neural nets whose width greatly exceeds the number of training data points\ncan rapidly reduce training error to 0 via gradient descent, and under some conditions, the trained\nnet also exhibits good generalization [Du et al., 2019, 2018b, Li and Liang, 2018, Allen-Zhu et al.,\n2018a,b, Zou et al., 2018, Arora et al., 2019, Cao and Gu, 2019]. Extra-wideness plays a crucial\nrole in the proof: it is shown that as width increases, training causes increasingly smaller changes\n(in a proportionate sense) in the parameters. This raises the possibility that as one increases the\nwidth to in\ufb01nity, a certain limiting behavior can emerge even in the fully-trained net. A recent paper\nby Jacot et al. [2018] isolated a notion implicit in the above papers, which they called the Neural\nTangent Kernel (NTK). They suggested \u2014 via a proof that is slightly heuristic \u2014 that this \ufb01xed kernel\ncharacterizes the behavior of fully-connected in\ufb01nite width neural networks whose layers have been\ntrained by gradient descent. The NTK is different from the Gaussian process kernels discussed earlier,\nand is de\ufb01ned using the gradient of the output of the randomly initialized net with respect to its\nparameters, i.e.,\n\nker (x, x0) = E\n\n(2)\n\n\u2713\u21e0W\u2327 @f (\u2713, x)\n\n@\u2713\n\n,\n\n@f (\u2713, x0)\n\n@\u2713\n\n .\n\n@\u2713\n\nHere, the gradient @f (\u2713,x)\nappears from considering gradient descent, as will be explained in Section 3.\nOne may also generalize the NTK to convolutional neural nets, and we call the corresponding kernel\nConvolutional Neural Tangent Kernel (CNTK).\nThough NTK and CNTK are de\ufb01ned by an in\ufb01nite limit, a recent paper [Lee et al., 2019] attempted\nto understand their properties via a \ufb01nite approximation of the in\ufb01nite limit kernel by Monte Carlo\nmethods. However, as will be shown in Section B, using random features generated from practically\nsized nets can degrade the performance a lot. It was still open what is the full power of exact CNTK\non modern datasets. This is a challenging question especially for CNTK with pooling operations,\nsince when convolution with pooling is involved, it was believed that exact computation of kernels\n(for either convolutional Gaussian process kernel or CNTK) is infeasible for large datasets like\nCIFAR-10 [Novak et al., 2019].\n\n2\n\n\fOur contributions. We give an exact and ef\ufb01cient dynamic programming algorithm to compute\nCNTKs for ReLU activation (namely, to compute ker (x, x0) given x and x0). Using this algorithm\n\u2014 as well as implementation tricks for GPUs \u2014 we can settle the question of the performance of\nfully-trained in\ufb01nitely wide nets with a variety of architectures. For instance, we \ufb01nd that their\nperformance on CIFAR-10 is within 5% of the performance of the same architectures in the \ufb01nite case\n(note that the proper comparison in the \ufb01nite case involves turning off batch norm, data augmentation,\netc., in the optimization). In particular, the CNTK corresponding to a 11-layer convolutional net\nwith global average pooling achieves 77% classi\ufb01cation accuracy. This is 10% higher than the best\nreported performance of a Gaussian process with \ufb01xed kernel on CIFAR-10 [Novak et al., 2019].8\nFurthermore, we give a more rigorous, non-asymptotic proof that the NTK captures the behavior of a\nfully-trained wide neural net under weaker condition than previous proofs. We also experimentally\nshow that the random feature methods for approximating CNTK in earlier work do not compute good\napproximations, which is clear from their much worse performance on CIFAR.\n\n1.1 Notation\n\nWe use bold-faced letters for vectors, matrices and tensors. For a vector a, let [a]i be its i-th entry; for\na matrix A, let [A]i,j be its (i, j)-th entry; for a 4th-order tensor T , let [A]ij,i0j0 be its (i, j, i0, j0)-th\nentry. Let I be the identity matrix, and [n] = {1, 2, . . . , n}. Let ei be an indicator vector with i-th\nentry being 1 and other entries being 0, and let 1 denote the all-one vector. We use to denote the\nentry-wise product and \u2326 to denote the tensor product. We use h\u00b7,\u00b7i to denote the standard inner\nproduct. We use diag(\u00b7) to transform a vector to a diagonal matrix. We use (\u00b7) to denote the\nactivation function, such as the recti\ufb01ed linear unit (ReLU) function: (z) = max{z, 0}, and \u02d9 (\u00b7)\nto denote the derivative of (\u00b7). Denote by N (\u00b5, \u2303) the Gaussian distribution with mean \u00b5 and\ncovariance \u2303.\n\n2 Related Work\n\nFrom a Gaussian process (GP) viewpoint, the correspondence between in\ufb01nite neural networks and\nkernel machines was \ufb01rst noted by Neal [1996]. Follow-up work extended this correspondence\nto more general shallow neural networks [Williams, 1997, Roux and Bengio, 2007, Hazan and\nJaakkola, 2015]. More recently, this was extended to deep and convolutional neural networks [Lee\net al., 2018, Matthews et al., 2018, Novak et al., 2019, Garriga-Alonso et al., 2019] and a variety of\nother architectures [Yang, 2019]. However, these kernels, as we discussed in Section 1, represent\nweakly-trained nets, instead of fully-trained nets.\nBeyond GPs, the connection between neural networks and kernels is also studied in the compositional\nkernel literature. Cho and Saul [2009] derived a closed-form kernel formula for recti\ufb01ed polynomial\nactivations, which include ReLU as a special case. Daniely et al. [2016] proposed a general framework\nto transform a neural network to a compositional kernel and later Daniely [2017] showed for\nsuf\ufb01ciently wide neural networks, stochastic gradient descent can learn functions that lie in the\ncorresponding reproducing kernel Hilbert space. However, the kernels studied in these works still\ncorrespond to weakly-trained neural networks.\nThis paper is inspired by a line of recent work on over-parameterized neural networks [Du et al., 2019,\n2018b, Du and Hu, 2019, Li and Liang, 2018, Allen-Zhu et al., 2018b,a, Zou et al., 2018, Cao and Gu,\n2019]. These papers established that for (convolutional) neural networks with large but \ufb01nite width,\n(stochastic) gradient descent can achieve zero training error. A key component in these papers is\nshowing that the weight matrix at each layer is close to its initialization. This observation implies that\nthe kernel de\ufb01ned in Equation (2) is still close to its initialization. Arora et al. [2019] explicitly used\nthis observation to derive generalization bounds for two-layer over-parameterized neural networks.\nChizat and Bach [2018] argued that these results in the kernel regime may be too simple to be able to\nexplain the success of deep learning, while on the other hand, out results show that CNTK is at least\nable to perform well on tasks like CIFAR-10 classi\ufb01cation. Also see the survey Fan et al. [2019] for\nrecent advance in deep learning theory.\n\n8We only consider \ufb01xed kernels de\ufb01ned without using the training data. We do not compare to methods that\ntune the kernels using training data [Van der Wilk et al., 2017] or use a neural network to extract features and\nthen applying a kernel method on top of them [Mairal et al., 2014].\n\n3\n\n\fJacot et al. [2018] derived the exact same kernel from kernel gradient descent. They showed that\nif the number of neurons per layer goes to in\ufb01nity in a sequential order, then the kernel remains\nunchanged for a \ufb01nite training time. They termed the derived kernel Neural Tangent Kernel (NTK).\nWe follow the same naming convention and name its convolutional extension Convolutional Neural\nTangent Kernel (CNTK). Later, Yang [2019] derived a formula of CNTK as well as a mechanistic\nway to derive NTK for different architectures. Comparing with [Yang, 2019], our CNTK formula has\na more explicit convolutional structure and results in an ef\ufb01cient GPU-friendly computation method.\nRecently, Lee et al. [2019] tried to empirically verify the theory in [Jacot et al., 2018] by studying the\nlinearization of neural nets. They observed that in the \ufb01rst few iterations, the linearization is close\nto the actual neural net. However, as will be shown in Section B, such linearization can decrease\nthe classi\ufb01cation accuracy by 5% even on a \u201cCIFAR-2\" (airplane V.S. car) dataset. Therefore, exact\nkernel evaluation is important to study the power of NTK and CNTK.\n\n3 Neural Tangent Kernel\n\nIn this section we describe fully-connected deep neural net architecture and its in\ufb01nite width limit,\nand how training it with respect to the `2 loss gives rise to a kernel regression problem involving\nthe neural tangent kernel (NTK). We denote by f (\u2713, x) 2 R the output of a neural network where\n\u2713 2 RN is all the parameters in the network and x 2 Rd is the input.9 Given a training dataset\n{(xi, yi)}n\ni=1 \u21e2 Rd \u21e5 R, consider training the neural network by minimizing the squared loss over\n2Pn\ni=1 (f (\u2713, xi) yi)2 . The proof of the following lemma uses simple\ntraining data: `(\u2713) = 1\ndifferentiation and appears in Section C.\nLemma 3.1. Consider minimizing the squared loss `(\u2713) by gradient descent with in\ufb01nitesimally\nsmall learning rate: d\u2713(t)\ndt = r`(\u2713(t)). Let u(t) = (f (\u2713(t), xi))i2[n] 2 Rn be the network\noutputs on all xi\u2019s at time t, and y = (yi)i2[n] be the desired outputs. Then u(t) follows the\nfollowing evolution, where H(t) is an n \u21e5 n positive semide\ufb01nite matrix whose (i, j)-th entry is\nE:\nD @f (\u2713(t),xi)\n\n, @f (\u2713(t),xj )\n\ndu(t)\n\n@\u2713\n\n@\u2713\n\n(3)\n\n= H(t) \u00b7 (u(t) y).\n\ndt\n\nThe statement of Lemma 3.1 involves a matrix H(t). Below we de\ufb01ne a deep net architecture whose\nwidth is allowed to go to in\ufb01nity, while \ufb01xing the training data as above. In the limit, it can be\nshown that the matrix H(t) remains constant during training i.e., equal to H(0). Moreover, under a\nrandom initialization of parameters, the random matrix H(0) converges in probability to a certain\ndeterministic kernel matrix H\u21e4 as the width goes to in\ufb01nity, which is the Neural Tangent Kernel\nker(\u00b7,\u00b7) (Equation (2)) evaluated on the training data. If H(t) = H\u21e4 for all t, then Equation (3)\nbecomes\n(4)\n\ndu(t)\n\n= H\u21e4 \u00b7 (u(t) y).\n\ndt\n\nNote that the above dynamics is identical to the dynamics of kernel regression under gradient \ufb02ow,\nfor which at time t ! 1 the \ufb01nal prediction function is (assuming u(0) = 0)\nf\u21e4(x) = (ker(x, x1), . . . , ker(x, xn)) \u00b7 (H\u21e4)1y.\n\n(5)\nIn Theorem 3.2, we rigorously prove that a fully-trained suf\ufb01ciently wide ReLU neural network is\nequivalent to the kernel regression predictor (5) on any given data point.\n\nFully-connected deep neural net and its in\ufb01nite width limit. Now we de\ufb01ne a fully-connected\nneural net formally. Let x 2 Rd be the input, and denote g(0)(x) = x and d0 = d for notational\nconvenience. We de\ufb01ne an L-hidden-layer fully-connected neural network recursively:\n\nf (h)(x) = W (h)g(h1)(x) 2 Rdh,\n\ng(h)(x) =r c\n\ndh\n\n\u21e3f (h)(x)\u2318 2 Rdh,\n\nh = 1, 2, . . . , L,\n(6)\n\n9For simplicity, we only consider a single output here. The generalization to multiple outputs is straightfor-\n\nward.\n\n4\n\n\f\u2303(0)(x, x0) = x>x0,\n\n\u21e4(h)(x, x0) =\u2713 \u2303(h1)(x, x)\u2303 (h1)(x, x0)\n\n\u2303(h1)(x0, x)\u2303 (h1)(x0, x0)\u25c6 2 R2\u21e52,\n\n\u2303(h)(x, x0) = c\n\nE\n\n[ (u) (v)] .\n\n(u,v)\u21e0N(0,\u21e4(h))\n\nTo give the formula of NTK, we also need to de\ufb01ne a derivative covariance:\n\n\u02d9\u2303(h)(x, x0) = c\n\nE\n\n(u,v)\u21e0N(0,\u21e4(h))\n\n[ \u02d9(u) \u02d9(v)] .\n\nThe \ufb01nal NTK expression for the fully-connected neural network is\n\n\u21e5(L)(x, x0) =\n\nL+1Xh=1 \u2303(h1)(x, x0) \u00b7\n\nL+1Yh0=h\n\n\u02d9\u2303(h0)(x, x0)! ,\n\n(7)\n\n(8)\n\n(9)\n\nwhere W (h) 2 Rdh\u21e5dh1 is the weight matrix in the h-th layer (h 2 [L]), : R ! R is a coordinate-\nwise activation function, and c =\u21e3Ez\u21e0N (0,1)h (z)2i\u23181\n. The last layer of the neural network\n\nis\n\nf (\u2713, x) = f (L+1)(x) = W (L+1) \u00b7 g(L)(x)\n\u2713W (L) \u00b7r c\n\n= W (L+1) \u00b7r c\n\ndL\n\ndL1\n\n\u2713W (L1) \u00b7\u00b7\u00b7r c\n\nd1\n\n\u21e3W (1)x\u2318\u25c6\u25c6 ,\n\nwhere W (L+1) 2 R1\u21e5dL is the weights in the \ufb01nal layer, and \u2713 =W (1), . . . , W (L+1) represents\nall the parameters in the network.\nWe initialize all the weights to be i.i.d. N (0, 1) random variables, and consider the limit of large\nhidden widths: d1, d2, . . . , dL ! 1. The scaling factorpc/dh in Equation (6) ensures that the\nnorm of g(h)(x) for each h 2 [L] is approximately preserved at initialization (see [Du et al., 2018b]).\nIn particular, for ReLU activation, we have Ehg(h)(x)2i = kxk2 (8h 2 [L]).\n\nRecall from [Lee et al., 2018] that in the in\ufb01nite width limit, the pre-activations f (h)(x) at every\nhidden layer h 2 [L] has all its coordinates tending to i.i.d. centered Gaussian processes of covariance\n\u2303(h1) : Rd \u21e5 Rd ! R de\ufb01ned recursively as: for h 2 [L],\n\nwhere we let \u02d9\u2303(L+1)(x, x0) = 1 for convenience. We refer readers to Section D for the derivation of\nthis formula. Rigorously, for ReLU activation, we have the following theorem that gives a concrete\nbound on the hidden widths that is suf\ufb01cient for convergence to the NTK at initialization:\nTheorem 3.1 (Convergence to the NTK at initializatoin). Fix \u270f> 0 and 2 (0, 1). Suppose\n (z) = max(0, z) and minh2[L] dh \u2326( L14\n\u270f4 log(L/)). Then for any inputs x, x0 2 Rd0 such that\nkxk \uf8ff 1,kx0k \uf8ff 1, with probability at least 1 we have:\n\n\u2327 @f (\u2713, x)\n\n@\u2713\n\n,\n\n@f (\u2713, x0)\n\n@\u2713\n\n \u21e5(L)(x, x0) \uf8ff \u270f.\n\nThe proof of Theorem 3.1 is given in Section E. Theorem 3.1 improves upon previous results [Jacot\net al., 2018, Yang, 2019] that also established similar convergence in the following sense:\n1. Previous results are asymptotic, i.e., they require the widths to go to in\ufb01nity, while Theorem 3.1\n\ngives a non-asymptotic bound on the required layer widths.\n\n2. Jacot et al. [2018] required sequential limit, i.e., d1, . . . , dL go to in\ufb01nity one by one, and Yang\n[2019] let d1, . . . , dL go to in\ufb01nity at the same rate. On the other hand, Theorem 3.1 only requires\nminh2[L] dh to be suf\ufb01ciently large, which is the weakest notion of limit.\n\nEquivalence between wide neural net and kernel regression with NTK. Built on Theorem 3.1,\nwe can further incorporate the training process and show the equivalence between a fully-trained\nsuf\ufb01ciently wide neural net and the kernel regression solution using the NTK, as described in\nLemma 3.1 and the discussion after it.\n\n5\n\n\fRecall that the training data are {(xi, yi)}n\ni=1 \u21e2 Rd \u21e5 R, and H\u21e4 2 Rn\u21e5n is the NTK evaluated\non these training data, i.e., [H\u21e4]i,j =\u21e5 (L)(xi, xj). Denote 0 = min (H\u21e4). For a testing point\nxte 2 Rd, we let kerntk(xte, X) 2 Rn be the kernel evaluated between the testing point and n\ntraining points, i.e., [kerntk(xte, X)]i =\u21e5 (L)(xte, xi). The prediction of kernel regression using\nNTK on this testing point is fntk (xte) = (kerntk (xte, X))> (H\u21e4)1 y.\nSince the above solution corresponds to the linear dynamics in Equation (4) with zero initialization, in\norder to establish equivalence between neural network and kernel regression, we would like the initial\noutput of the neural network to be small. Therefore, we apply a small multiplier \uf8ff> 0, and let the \ufb01nal\noutput of the neural network be fnn(\u2713, x) = \uf8fff (\u2713, x) . We let fnn(xte) = limt!1 fnn(\u2713(t), xte)\nbe the prediction of the neural network at the end of training.\nThe following theorem establishes the equivalence between the fully-trained wide neural network\nfnn and the kernel regression predictor fntk using the NTK.\nTheorem 3.2 (Equivalence between trained net and kernel regression). Suppose (z) =\nmax(0, z), 1/\uf8ff = poly(1/\u270f, log(n/)) and d1 = d2 = \u00b7\u00b7\u00b7 = dL = m with m \npoly(1/\uf8ff, L, 1/0, n, log(1/)). Then for any xte 2 Rd with kxtek = 1, with probability at\nleast 1 over the random initialization, we have\n\n|fnn(xte) fntk(xte)|\uf8ff \u270f.\n\nThe proof of Theorem 3.2 is given in Section F. We remark that one can generalize our proof to more\nadvanced architectures, such as convolutinal neural network, ResNet, etc.\nTheorem 3.2 is, to our knowledge, the \ufb01rst result that rigorously shows the equivalence between a\nfully-trained neural net and a deterministic kernel predictor. Compared with similar results by [Jacot\net al., 2018, Lee et al., 2019], our bound is non-asymptotic whereas theirs are asymptotic. Compared\nwith [Arora et al., 2019, Allen-Zhu et al., 2018b,a, Du et al., 2019, 2018b, Li and Liang, 2018, Zou\net al., 2018], our theorem is a more precise characterization of the learned neural network. That is, the\nprediction is essentially a kernel predictor. Therefore, to study the properties of over-parameterized\nnets, such as their generalization power, it is suf\ufb01cient to study the corresponding NTK.\nWhile this theorem only gives guarantee for a single point, using a union bound, we can show that\nthis guarantee holds for (exponentially many) \ufb01nite testing points. Combing this with the standard\nanalysis of hold-out validation set, we can conclude that a fully-trained wide neural net enjoys the\nsame generalization ability as its corresponding NTK.\n\n4 Convolutional Neural Tangent Kernel\n\nIn this section we study convolutional neural nets (CNNs) and their corresponding CNTKs. We study\ntwo architectures, vanilla CNN and CNN with global average pooling (GAP). In this section we\nde\ufb01ne vanilla CNN and present its corresponding CNTK formula. The derivation of this formula is\ndeferred to Section G. We present the de\ufb01nition of CNN with GAP and its CNTK in Section H.\nTo formally de\ufb01ne CNNs, we \ufb01rst introduce some notation. We let P be the width and Q be the\nheight of the image. We use q 2 Z+ to denote the \ufb01lter size. In practice, q = 1, 3, or 5. We use\nstandard zero padding and set stride size to be 1 to make sure the input of each layer has the same size.\nFor a convolutional \ufb01lter w 2 Rq\u21e5q and an image x 2 RP\u21e5Q, the convolution operator is de\ufb01ned as\n\n[w \u21e4 x]ij =\n\n[w]a+ q+1\n\n2 ,b+ q+1\n\n2\n\n[x]a+i,b+j for i 2 [P ], j 2 [Q].\n\n(10)\n\nq1\n\n2Xa= q1\n\n2\n\nq1\n\n2Xb= q1\n\n2\n\n2 :i+ q1\n\nEquation (10) shows that patch [w \u21e4 x]ij depends on [x]i q1\n2 :j+ q1\nformula also relies on this dependency. For (i, j, i0, j0) 2 [P ] \u21e5 [Q] \u21e5 [P ] \u21e5 [Q], de\ufb01ne\nDij,i0j0\n={(i + a, j + b, i0 + a0, j0 + b0) 2 [P ] \u21e5 [Q] \u21e5 [P ] \u21e5 [Q] | (q 1)/2 \uf8ff a, b, a0, b0 \uf8ff (q 1)/2} .\nLastly, for a tensor T 2 RP\u21e5Q\u21e5P\u21e5Q, we denote by [T ]Dij,i0j0 2 Rq\u21e5q\u21e5q\u21e5q a sub-tensor and we let\ntr (T ) =Pi,j Tij,ij.\n\n. Our CNTK\n\n2 ,j q1\n\n2\n\n6\n\n\fA vanilla CNN consisting of L convolution layers and one fully-connected layer is formally de\ufb01ned\nas follows:\n\u2022 Let x(0) = x 2 RP\u21e5Q\u21e5C(0) be the input image where C(0) is the number of channels.\n\u2022 For h = 1, . . . , L, = 1, . . . , C(h), the intermediate outputs are de\ufb01ned as\n\u21e3 \u02dcx(h)\n()\u2318 ,\n(\u21b5)E where W (L+1)\n\n\u2022 The \ufb01nal output is de\ufb01ned as f (\u2713, x) =PC(L)\n\n() =r\n\u21b5=1DW (L+1)\n\n(\u21b5),() 2 Rq\u21e5q is a \ufb01lter with standard Gaussian initialization.\n\nweight matrix with standard Gaussian initialization.\n\n(\u21b5),() \u21e4 x(h1)\n\n2 RP\u21e5Q is a\n\nC(h1)X\u21b5=1\n\nwhere each W (h)\n\nC(h) \u21e5 q \u21e5 q\n\nFor this architecture, using the same reasoning as in Section D, we obtain the following convolutional\nneural tangent kernel formula. The details are provided in Section G.\n\n\u02dcx(h)\n() =\n\n, x(h)\n\nW (h)\n\n, x(L)\n\nc\n\n(\u21b5)\n\n(\u21b5)\n\n(\u21b5)\n\nCNTK formula. We let x, x0 be two input images.\n\u2022 For \u21b5 = 1, . . . , C(0), (i, j, i0, j0) 2 [P ] \u21e5 [Q] \u21e5 [P ] \u21e5 [Q], de\ufb01ne\n\n(11)\n\n(12)\n\n\u2022 For h 2 [L],\n\n=\n\n\u21e4(h)\n\nK(0)\n\n\u2013 De\ufb01ne K(h)(x, x0),\n\n(\u21b5) (x, x0) = x(\u21b5) \u2326 x0(\u21b5) and h\u2303(0)(x, x0)iij,i0j0\n\n\u2013 For (i, j, i0, j0) 2 [P ] \u21e5 [Q] \u21e5 [P ] \u21e5 [Q], de\ufb01ne\nij,i0j0(x, x0) = \u21e5\u2303(h1)(x, x)\u21e4ij,ij\n\u21e5\u2303(h1) (x0, x)\u21e4i0j0,ij\n\n(\u21b5)(x, x0)iDij,i0j0\u25c6 .\ntr\u2713hK(0)\nC(0)X\u21b5=1\n\u21e5\u2303(h1) (x0, x0)\u21e4i0j0,i0j0! 2 R2\u21e52.\n\u21e5\u2303(h1)(x, x0)\u21e4ij,i0j0\nij,i0j0 (x,x0)\u2318 [ (u) (v)] ,\nij,i0j0 (x,x0)\u2318 [ \u02d9 (u) \u02d9 (v)] .\n\u2013 De\ufb01ne \u2303(h)(x, x0) 2 RP\u21e5Q\u21e5P\u21e5Q: for (i, j, i0, j0) 2 [P ] \u21e5 [Q] \u21e5 [P ] \u21e5 [Q],\n\nhK(h)(x, x0)iij,i0j0\nh \u02d9K(h)(x, x0)iij,i0j0\n\n\u02d9K(h)(x, x0) 2 RP\u21e5Q\u21e5P\u21e5Q: for (i, j, i0, j0) 2 [P ] \u21e5 [Q] \u21e5 [P ] \u21e5 [Q],\n\nE\n(u,v)\u21e0N\u21e30,\u21e4(h)\nE\n(u,v)\u21e0N\u21e30,\u21e4(h)\n=tr\u2713hK(h)(x, x0)iDij,i0j0\u25c6 .\n\nh\u2303(h)(x, x0)iij,i0j0\n\nc\nq2 \u00b7\nc\nq2 \u00b7\n\n=\n\n=\n\nNote that \u2303(x, x0) and \u02d9\u2303(x, x0) share similar structures as their NTK counterparts in Equations (7)\nand (8). The only difference is that we have one more step, taking the trace over patches. This step\nrepresents the convolution operation in the corresponding CNN. Next, we can use a recursion to\ncompute the CNTK:\n1. First, we de\ufb01ne \u21e5(0)(x, x0) = \u2303(0)(x, x0).\n2. For h = 1, . . . , L 1 and (i, j, i0, j0) 2 [P ] \u21e5 [Q] \u21e5 [P ] \u21e5 [Q], we de\ufb01ne\n\nh\u21e5(h)(x, x0)iij,i0j0\n\n= tr\u2713h \u02d9K(h)(x, x0) \u21e5(h1)(x, x0) + K(h)(x, x0)iDij,i0j0\u25c6 .\n\n3. For h = L , we de\ufb01ne \u21e5(L)(x, x0) = \u02d9K(L)(x, x0) \u21e5(L1)(x, x0) + K(L)(x, x0).\n4. The \ufb01nal CNTK value is de\ufb01ned as tr\u21e5(L)(x, x0) .\nIn Section H we give the CNTK formula for CNNs with GAP, which is similar to vanilla CNNs. To\ncompute the CNTK matrix corresponding to a CNN with GAP that has L convolution layers and one\nfully-connected layer on n samples, the time complexity is O(n2P 2Q2L). Previous work assumed\nthat directly computing convolutional kernel (with pooling) exactly is computationally infeasible,\nand thus resorted to approximations like Monte Carlo sampling [Novak et al., 2019]. We are able to\nscale the exact CNTK computation to the full CIFAR-10 dataset and 20-layer CNN with GAP. We\npresent our ef\ufb01cient computation approach in Section I.\n\n7\n\n\fDepth CNN-V CNTK-V CNTK-V-2K CNN-GAP CNTK-GAP CNTK-GAP-2K\n\n3\n4\n6\n11\n21\n\n59.97% 64.47%\n60.20% 65.52%\n64.11% 66.03%\n69.48% 65.90%\n75.57% 64.09%\n\n40.94%\n42.54%\n43.43%\n43.42%\n42.53%\n\n63.81%\n80.93%\n83.75%\n82.92%\n83.30%\n\n70.47%\n75.93%\n76.73%\n77.43%\n77.08%\n\n49.71%\n51.06%\n51.73%\n51.92%\n52.22%\n\nTable 1: Classi\ufb01cation accuracies of CNNs and CNTKs on the CIFAR-10 dataset. CNN-V represents\nvanilla CNN and CNTK-V represents the kernel corresponding to CNN-V. CNN-GAP represents\nCNN with GAP and CNTK-GAP represents the kernel correspondong to CNN-GAP. CNTK-V-2K\nand CNTK-GAP-2K represent training CNTKs with only 2,000 training data.\n\n5 Experiments\n\nWe evaluate the performances of CNNs and their corresponding CNTKs on the CIFAR-10 dataset.\nThe implementation details are in Section A. We also compare the performances between CNTKs and\ntheir corresponding random features. Due to space limit, we defer these results on random features to\nSection B.\n\nResults. We test two types of architectures, vanilla CNN and CNN with global average pooling\n(GAP), as described in Sections 4 and H. We also test CNTKs with only 2,000 training data to\nsee whether their performances are consistent with CNTKs and CNNs using the full training set.\nThe results are summarized in Table 1. Notice that in Table 1, depth is the total number of layers\n(including both convolution layers and fully-connected layers).\nSeveral comments are in sequel. First, CNTKs are very powerful kernels. The best kernel, 11-layer\nCNTK with GAP, achieves 77.43% classi\ufb01cation accuracy on CIFAR-10. This results in a signi\ufb01cant\nnew benchmark for performance of a pure kernel-based method on CIFAR-10, being 10% higher\nthan methods reported in [Novak et al., 2019].\nSecond, we \ufb01nd that for both CNN and CNTK, depth can affect the classi\ufb01cation accuracy. This\nobservation demonstrates that depth not only matters in deep neural networks but can also affect the\nperformance of CNTKs.\nThird, the global average pooling operation can signi\ufb01cantly increase the classi\ufb01cation accuracy by\n8% - 10% for both CNN and CNTK. Based on this \ufb01nding, we expect that many techniques that\nimprove the performance of neural networks are in some sense universal, i.e., these techniques can\nalso bene\ufb01t kernel methods.\nFourth, we \ufb01nd that there is still a 5% - 6% performance gap between CNTKs and CNNs. Since\nCNTKs exactly correspond to in\ufb01nitely wide CNNs, this performance gap implies that \ufb01nite width\nhas its bene\ufb01ts. Therefore, it is likely that recent theoretical work on over-parameterization that\noperates in the NTK regime cannot fully explain the success of neural networks yet, and we believe it\nis an interesting open problem to characterize this gap.\n\nPotential application in neural architecture search. Finally, we \ufb01nd that performances of CNTK-\nV-2Ks and CNTK-GAP-2Ks are highly correlated to their CNN-V, CNTK-V, CNN-GAP and CNTK-\nGAP counterparts. Again we see CNTK-GAP-2Ks outperform CNTK-V-2Ks by a large margin\n(about 8% - 9%). One potential application of this observation is to guide neural architecture search.\nWe can compute the kernel on a small training data, test it on a validation set, and choose neural\nnetwork architectures based on the performance of this small kernel on the validation set. We leave\nlarge scale experiments of this idea for future work.\n\n6 Conclusion\n\nBy giving the \ufb01rst practical algorithm for computing CNTKs exactly, this paper allows investigation\nof the behavior of in\ufb01nitely wide (hence in\ufb01nitely over-parametrized) deep nets, which turns out to\n\n8\n\n\fnot be much worse than that of their \ufb01nite counterparts. We also give a fully rigorous proof that a\nsuf\ufb01ciently wide net is approximately equivalent to the kernel regression predictor, thus yielding a\npowerful new off-the-shelf kernel. We leave it as an open problem to understand the behavior of\nin\ufb01nitely wide nets with features such as Batch Normalization or Residual Layers. Of course, one\ncan also hope that the analysis of in\ufb01nite nets provides rigorous insight into \ufb01nite ones.\n\nAcknowledgments\n\nWe thank Jason D. Lee, Haochuan Li and Xiyu Zhai for useful discussions. S. Arora, W. Hu and\nZ. Li are supported by NSF, ONR, Simons Foundation, Schmidt Foundation, Mozilla Research,\nAmazon Research, DARPA and SRC. R. Salakhutdinov and R. Wang are supported in part by NSF\nIIS-1763562, Of\ufb01ce of Naval Research grant N000141812861, and Nvidia NVAIL award. We thank\nAmazon Web Services for providing compute time for the experiments in this paper.\n\nReferences\nZeyuan Allen-Zhu, Yuanzhi Li, and Yingyu Liang. Learning and generalization in overparameterized\n\nneural networks, going beyond two layers. arXiv preprint arXiv:1811.04918, 2018a.\n\nZeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via over-\n\nparameterization. arXiv preprint arXiv:1811.03962, 2018b.\n\nSanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, and Ruosong Wang. Fine-grained analysis of\noptimization and generalization for overparameterized two-layer neural networks. arXiv preprint\narXiv:1901.08584, 2019.\n\nSt\u00e9phane Boucheron, G\u00e1bor Lugosi, and Pascal Massart. Concentration inequalities: A nonasymptotic\n\ntheory of independence. Oxford university press, 2013.\n\nYuan Cao and Quanquan Gu. A generalization theory of gradient descent for learning over-\n\nparameterized deep relu networks. arXiv preprint arXiv:1902.01384, 2019.\n\nLenaic Chizat and Francis Bach. A note on lazy training in supervised differentiable programming.\n\narXiv preprint arXiv:1812.07956, 2018.\n\nYoungmin Cho and Lawrence K Saul. Kernel methods for deep learning. In Advances in neural\n\ninformation processing systems, pages 342\u2013350, 2009.\n\nAmit Daniely. SGD learns the conjugate kernel class of the network.\n\nInformation Processing Systems, pages 2422\u20132430, 2017.\n\nIn Advances in Neural\n\nAmit Daniely, Roy Frostig, and Yoram Singer. Toward deeper understanding of neural networks:\nThe power of initialization and a dual view on expressivity. In Advances In Neural Information\nProcessing Systems, pages 2253\u20132261, 2016.\n\nSimon S Du and Wei Hu. Width provably matters in optimization for deep linear neural networks.\n\narXiv preprint arXiv:1901.08572, 2019.\n\nSimon S Du, Wei Hu, and Jason D Lee. Algorithmic regularization in learning deep homogeneous\nmodels: Layers are automatically balanced. In Advances in Neural Information Processing Systems\n31, pages 382\u2013393. 2018a.\n\nSimon S Du, Jason D Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. Gradient descent \ufb01nds global\n\nminima of deep neural networks. arXiv preprint arXiv:1811.03804, 2018b.\n\nSimon S. Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably optimizes\nover-parameterized neural networks. In International Conference on Learning Representations,\n2019.\n\nJianqing Fan, Cong Ma, and Yiqiao Zhong. A selective overview of deep learning. arXiv preprint\n\narXiv:1904.05526, 2019.\n\n9\n\n\fAdri\u00e0 Garriga-Alonso, Carl Edward Rasmussen, and Laurence Aitchison. Deep convolutional\nnetworks as shallow gaussian processes. In International Conference on Learning Representations,\n2019. URL https://openreview.net/forum?id=Bklfsi0cKm.\n\nTamir Hazan and Tommi Jaakkola. Steps toward deep kernel methods from in\ufb01nite neural networks.\n\narXiv preprint arXiv:1508.05133, 2015.\n\nArthur Jacot, Franck Gabriel, and Cl\u00e9ment Hongler. Neural tangent kernel: Convergence and\n\ngeneralization in neural networks. arXiv preprint arXiv:1806.07572, 2018.\n\nJaehoon Lee, Jascha Sohl-dickstein, Jeffrey Pennington, Roman Novak, Sam Schoenholz, and\nYasaman Bahri. Deep neural networks as gaussian processes. In International Conference on\nLearning Representations, 2018. URL https://openreview.net/forum?id=B1EA-M-0Z.\n\nJaehoon Lee, Lechao Xiao, Samuel S Schoenholz, Yasaman Bahri, Jascha Sohl-Dickstein, and Jeffrey\nPennington. Wide neural networks of any depth evolve as linear models under gradient descent.\narXiv preprint arXiv:1902.06720, 2019.\n\nYuanzhi Li and Yingyu Liang. Learning overparameterized neural networks via stochastic gradient\n\ndescent on structured data. arXiv preprint arXiv:1808.01204, 2018.\n\nJulien Mairal, Piotr Koniusz, Zaid Harchaoui, and Cordelia Schmid. Convolutional kernel networks.\n\nIn Advances in neural information processing systems, pages 2627\u20132635, 2014.\n\nAlexander G de G Matthews, Mark Rowland, Jiri Hron, Richard E Turner, and Zoubin Ghahramani.\nGaussian process behaviour in wide deep neural networks. arXiv preprint arXiv:1804.11271, 2018.\nRadford M Neal. Priors for in\ufb01nite networks. In Bayesian Learning for Neural Networks, pages\n\n29\u201353. Springer, 1996.\n\nRoman Novak, Lechao Xiao, Yasaman Bahri, Jaehoon Lee, Greg Yang, Daniel A. Abola\ufb01a, Jeffrey\nPennington, and Jascha Sohl-dickstein. Bayesian deep convolutional networks with many channels\nare gaussian processes. In International Conference on Learning Representations, 2019. URL\nhttps://openreview.net/forum?id=B1g30j0qF7.\n\nNicolas Le Roux and Yoshua Bengio. Continuous neural networks. In Proceedings of the Eleventh\nInternational Conference on Arti\ufb01cial Intelligence and Statistics, volume 2 of Proceedings of\nMachine Learning Research, pages 404\u2013411, San Juan, Puerto Rico, 2007.\n\nMark Van der Wilk, Carl Edward Rasmussen, and James Hensman. Convolutional gaussian processes.\n\nIn Advances in Neural Information Processing Systems, pages 2849\u20132858, 2017.\n\nChristopher KI Williams. Computing with in\ufb01nite networks. In Advances in neural information\n\nprocessing systems, pages 295\u2013301, 1997.\n\nGreg Yang. Scaling limits of wide neural networks with weight sharing: Gaussian process behavior,\ngradient independence, and neural tangent kernel derivation. arXiv preprint arXiv:1902.04760,\n2019.\n\nChiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding\ndeep learning requires rethinking generalization. In Proceedings of the International Conference\non Learning Representations (ICLR), 2017, 2017.\n\nDifan Zou, Yuan Cao, Dongruo Zhou, and Quanquan Gu. Stochastic gradient descent optimizes\n\nover-parameterized deep ReLU networks. arXiv preprint arXiv:1811.08888, 2018.\n\n10\n\n\f", "award": [], "sourceid": 4444, "authors": [{"given_name": "Sanjeev", "family_name": "Arora", "institution": "Princeton University"}, {"given_name": "Simon", "family_name": "Du", "institution": "Institute for Advanced Study"}, {"given_name": "Wei", "family_name": "Hu", "institution": "Princeton University"}, {"given_name": "Zhiyuan", "family_name": "Li", "institution": "Princeton University"}, {"given_name": "Russ", "family_name": "Salakhutdinov", "institution": "Carnegie Mellon University"}, {"given_name": "Ruosong", "family_name": "Wang", "institution": "Carnegie Mellon University"}]}