{"title": "MetaInit: Initializing learning by learning to initialize", "book": "Advances in Neural Information Processing Systems", "page_first": 12645, "page_last": 12657, "abstract": "Deep learning models frequently trade handcrafted features for deep features learned with much less human intervention using gradient descent. While this paradigm has been enormously successful, deep networks are often difficult to train and performance can depend crucially on the initial choice of parameters. In this work, we introduce an algorithm called MetaInit as a step towards automating the search for good initializations using meta-learning. Our approach is based on a hypothesis that good initializations make gradient descent easier by starting in regions that look locally linear with minimal second order effects. We formalize this notion via a quantity that we call the gradient quotient, which can be computed with any architecture or dataset. MetaInit minimizes this quantity efficiently by using gradient descent to tune the norms of the initial weight matrices. We conduct experiments on plain and residual networks and show that the algorithm can automatically recover from a class of bad initializations. MetaInit allows us to train networks and achieve performance competitive with the state-of-the-art without batch normalization or residual connections. In particular, we find that this approach outperforms normalization for networks without skip connections on CIFAR-10 and can scale to Resnet-50 models on Imagenet.", "full_text": "MetaInit: Initializing learning by learning to initialize\n\nYann N. Dauphin\n\nGoogle AI\n\nynd@google.com\n\nSamuel S. Schoenholz\n\nGoogle AI\n\nschsam@google.com\n\nAbstract\n\nDeep learning models frequently trade handcrafted features for deep features\nlearned with much less human intervention using gradient descent. While this\nparadigm has been enormously successful, deep networks are often dif\ufb01cult to\ntrain and performance can depend crucially on the initial choice of parameters. In\nthis work, we introduce an algorithm called MetaInit as a step towards automating\nthe search for good initializations using meta-learning. Our approach is based on\na hypothesis that good initializations make gradient descent easier by starting in\nregions that look locally linear with minimal second order effects. We formalize\nthis notion via a quantity that we call the gradient quotient, which can be computed\nwith any architecture or dataset. MetaInit minimizes this quantity ef\ufb01ciently\nby using gradient descent to tune the norms of the initial weight matrices. We\nconduct experiments on plain and residual networks and show that the algorithm\ncan automatically recover from a class of bad initializations. MetaInit allows us\nto train networks and achieve performance competitive with the state-of-the-art\nwithout batch normalization or residual connections. In particular, we \ufb01nd that\nthis approach outperforms normalization for networks without skip connections on\nCIFAR-10 and can scale to Resnet-50 models on Imagenet.\n\n1\n\nIntroduction\n\nDeep learning has led to signi\ufb01cant advances across a wide range of domains including transla-\ntion [55], computer vision [24], and medicine [2]. This progress has frequently come alongside\narchitectural innovations such as convolutions [33], skip-connections [26, 22] and normalization meth-\nods [27, 4]. These components allow for the replacement of shallow models with hand-engineered\nfeatures by deeper, larger, and more expressive neural networks that learn to extract salient features\nfrom raw data [43, 8]. While building structure into neural networks has led to state-of-the-art results\nacross a myriad of tasks, there are signi\ufb01cant hindrances to this approach. Indeed, these larger and\nmore complicated models are often challenging to train and there are few guiding principles that can\nbe used to consistently train novel architectures. As such, neural network training frequently involves\nlarge, mostly brute force, hyperparameter searches that are a signi\ufb01cant computational burden and\nobfuscate scienti\ufb01c approaches to deep learning. Indeed, it is often unclear whether architectural\nadditions - such as batch normalization or skip connections - are responsible for improved network\nperformance or whether they simply ameliorate training.\nThere are many ways in which training a neural network can fail. Gradients can vanish or explode\nwhich makes the network either insensitive or overly sensitive to updates during stochastic gradient\ndescent [25]. Even if the gradients are well-behaved at initialization, curvature can cause gradients to\nbecome poorly conditioned after some time which can derail training. This has led researchers to\ntry to consider natural gradient [3] or conjugate gradient [38] techniques. While some methods like\nKFAC [37] are tractable, second order methods have found limited success due to the implementation\nchallenges and computational overhead. However, quasi-second order techniques such as Adam [30]\nhave become ubiquitous.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fThe choice of initial parameters, \u03b80, is intimately related to the initial conditioning of the gradients and\ntherefore plays a crucial role in the success or failure of neural network training [32]. Consequently,\nthere is a long line of research studying initialization schemes for neural networks including early\nseminal work by Glorot et al. [20] showing that the norms of the weights and biases in a fully-\nconnected network controls whether gradients explode or vanish on average. Subsequent work\nby Saxe et al. [51] showed that the gradient \ufb02uctuations could additionally be controlled in deep\nlinear networks. More recent contributions have included initialization schemes for fully-connected\nnetworks [52, 46], residual networks [23, 58, 63], convolutional networks [57], recurrent networks\nwith gating [9, 19], and batch normalization [59]. While this research has found success, the analysis\nis often sophisticated, requiring signi\ufb01cant expertise, and depends crucially on the architecture and\nchoice of activation functions. An automated approach to initialization would reduce the amount of\nexpertise necessary to train deep networks successfully and would be applicable to novel architectures.\nHowever, no objective has been identi\ufb01ed that works for a broader range of architectures. For example,\northogonal initialization schemes identi\ufb01ed in [51] fail in combination with ReLU nonlinearities [46]\nand LSUV [41] is not compatible with pre-activation residual networks [63].\nIn this work, we propose a strategy to automatically identify good initial parameters of machine\nlearning models. To do this we \ufb01rst propose a quantity, called the gradient quotient, that measures\nthe change in the gradient of a function after a single step of gradient descent. We argue that low\ngradient quotient correlates with a number of recently identi\ufb01ed predictors of trainability including\nthe conditioning of the hessian [18], the Fisher Information [3], and of the neural tangent kernel [28,\n17, 34]. We then introduce the MetaInit (Meta Initialize) algorithm that minimizes the gradient\nquotient using gradieng descent to tune the norms of the initial weight matrices. We show that for two\nkey architecture families (vanilla CNNs and Resnets), MetaInit can automatically correct several bad\ninitializations. Moreover, we show that by initializing using MetaInit we can initialize deep networks\nthat reach state-of-the-art results without normalization layers (e.g. batch normalization) and near\nstate-of-the-art without residual connections. Finally, we show that MetaInit is ef\ufb01cient enough that\nit can be applied to large-scale benchmarks such as Imagenet [12].\n\n2 MetaInit: Initializing by searching for less curvy starting regions\n\nIn this section, we propose an algorithm called MetaInit that adjusts the norms of the parameters at\ninitialization so they are favorable to learning. To do so we must \ufb01rst identify a principle for good\ninitialization that can be formalized into an objective function. This objective function should have\nother crucial properties such as being ef\ufb01cient to compute and easily amenable to minimization by\ngradient descent. These requirements rule out well-known quantities such as the condition number of\nthe Hessian and led to the development of a novel criterion.\n\nFigure 1: Illustration of the gradient quotient for different initial trajectories.\n\nRecall that gradient descent is a \ufb01rst order algorithm that does not take the curvature of the function\ninto account at each step. As discussed above, a longstanding goal in deep learning is to develop\ntractable optimization frameworks to try to take into account second-order information. Absent such\nmethods, we hypothesize that a favorable inductive bias for initialization is to start learning in a\nregion where the gradient is less affected by curvature. In this region, the magnitude and direction of\nthe gradient should not change too abruptly due to second order effects. This hypothesis is motivated\nby Pennington et al. [46] that observed better gradient conditioning and trainability as networks\nbecome more linear, Balduzzi et al. [5] that proposed a successful \u201clooks-linear\u201d initialization for\nrecti\ufb01ed linear layers, and Philipp et al. [47] who showed correlation between generalization and\n\u201cnonlinearity\u201d.\n\n2\n\ng(\u03b8)=[1, 1]g(\u03b8\t-\tg(\u03b8))=[1, 1]GQ \u2248 0.00g(\u03b8)=[1, 1]g(\u03b8\t-\tg(\u03b8))=[2, 1]GQ \u2248 0.49g(\u03b8)=[0.5, 0.5]g(\u03b8\t-\tg(\u03b8))=[2, 1]GQ \u2248 1.99g(\u03b8)=[1, 1]g(\u03b8\t-\tg(\u03b8))=[1, -1]GQ \u2248 1\fAccordingly, consider parameters \u03b8 \u2208 RN for a network, along with a loss function (cid:96)(x; \u03b8). We\ncan compute the average loss over a batch of examples, L(\u03b8) = Ex[(cid:96)(x; \u03b8)], along with the gradient\ng(\u03b8) = \u2207L(\u03b8) and Hessian H(\u03b8) = \u22072L(\u03b8). We would like to construct a quantity that measures\nthe effect of curvature near \u03b8 without the considerable expense of computing the full Hessian. To that\nend, we introduce the gradient quotient,\n\n(cid:13)(cid:13)(cid:13)(cid:13) g(\u03b8) \u2212 H(\u03b8)g(\u03b8)\n\ng(\u03b8) + \u0001\n\n1\nN\n\n(cid:13)(cid:13)(cid:13)(cid:13)1\n\n\u2212 1\n\n\u2248 1\nN\n\n(cid:13)(cid:13)(cid:13)(cid:13) g(\u03b8 \u2212 g(\u03b8))\n\ng(\u03b8) + \u0001\n\n(cid:13)(cid:13)(cid:13)(cid:13)1\n\n\u2212 1\n\nGQ(L, \u03b8) =\n\n(1)\nwhere \u0001 = \u00010(2g(\u03b8)\u22650 \u2212 1) computes a damping factor with the right sign for each element, \u00010 is a\nsmall constant and (cid:107) \u00b7 (cid:107)1 is the L1 vector norm. As its name suggests, the gradient quotient is the\nrelative per-parameter change in the gradient after a single step of gradient descent. We \ufb01nd that the\nstep-size has virtually no effect on the gradient quotient aside from a trivial scaling factor and so\nwe set it to 1 without a loss of generality. Parameters that cause the gradient to change explosively\nhave a large gradient quotient, while parameters that cause vanishing gradients do not minimize this\ncriterion since g(\u03b8) = g(\u03b8 \u2212 g(\u03b8)) = 0 =\u21d2 GQ(L, \u03b8) = 1 if \u0001 > 0. By contrast, it is clear that the\noptimal gradient quotient of 0 is approached when L(\u03b8) is nearly a linear function so that H(\u03b8) \u2248 0.\nRelationship to favorable learning dynamics The gradient quotient is intimately related to several\nquantities that have recently been shown to correlate with learning. Letting \u03bbi be the eigenvalues of\ni civi for some choice of ci.\n\nH(\u03b8) along with associated eigenvectors, vi, it follows that g(\u03b8) =(cid:80)\n\nFurthermore, neglecting \u0001 the objective simpli\ufb01es to\n\n(cid:13)(cid:13)(cid:13)(cid:13) H(\u03b8)g(\u03b8)\n\ng(\u03b8)\n\n(cid:13)(cid:13)(cid:13)(cid:13)1\n\n=\n\n1\nN\n\n(cid:80)\n(cid:80)\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\n(cid:88)\n\nj\n\ni \u03bbici(eT\ni ci(eT\n\nj vi)\nj vi)\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) .\n\nGQ(L, \u03b8) =\n\n1\nN\n\n(2)\n\nwhere the ei are standard basis vectors. This reveals the gradient quotient is intimately related to the\nspectrum of the Hessian. Moreover, the gradient quotient can be minimized by either:\n\n1. Improving the conditioning of the Hessian by concentrating its eigenvalues, \u03bbi, near 0.\n2. Encouraging the gradient to point in the \ufb02at directions of H, that is to say ci should be large\n\nwhen \u03bbi is close to zero.\n\nThere is signi\ufb01cant evidence that improving conditioning in the above sense can lead to large\nimprovements in learning. Notice that the Hessian, the Fisher Information, and the Neural Tangent\nKernel all share approximately the same nonzero eigenvalues. From the perspective of the Hessian,\nthe relationship between conditioning and \ufb01rst-order optimization is a classic topic of study in\noptimization theory. Most recently, it has been observed [51, 46, 18] that Hessian conditioning is\nintimately related to favorable learning dynamics in deep networks. In addition, it has been shown\nthat a signature of failure in deep learning models is when the gradient concentrates on the large\neigenvalues of the Hessian [21, 18]. This is precisely what condition 2 avoids. Likewise, experiments\non natural gradient methods [3, 37] have shown that by taking into account the conditioning of the\nFisher, one can signi\ufb01cantly improve training dynamics. Finally, the neural tangent kernel has been\nshown to determine early learning dynamics [17, 34] and its conditioning is a strong predictor of\ntrainability. In the appendix we present numerical experiments showing that this qualitative picture\naccurately describes improvements to a WideResnet during optimization of the gradient quotient.\nEf\ufb01ciency Computing the gradient quotient is on the same order of complexity as computing the\ngradient.\nIn addition to using the gradient quotient to measure the quality of an initial choice of parameters, it\nwill be used as a meta-objective to learn a good initialization from a poor one as follows\n\nMetaInit(L, \u03b8) = arg min\n\nGQ(L, \u03b8).\n\n\u03b8\n\n(3)\n\nRobustness as search objective Using gradient descent on a meta-learning objective to recover\nfrom bad initialization typically would make things more dif\ufb01cult [36]. If the gradient vanishes for\ngradient descent on L, then it likely vanishes for a meta-learning objective that involves multiple\nsteps on L like MAML [15]. The gradient quotient avoids this problem because it is sensitive to\nparameters even in the presence of gradient vanishing by depending on the per-parameter values of\nthe gradient explicitly. As such there is a \u201cshort path\u201d between each parameter and the objective\n\n3\n\n\funder backpropagation. In other words, pre-training using the gradient quotient as an objective makes\ntraining more robust.\nTask agnosticity We solve the over-\ufb01tting problems associated with meta-learning by following\nprevious initialization approaches [52, 57] in using input data drawn completely randomly, such\nas x \u223c N (0, 1), during meta-initialization. We \ufb01nd the gradient quotient is still informative with\nrandom data and as a result the meta-initialization is largely task independent. This is not possible\nfor other meta-learning methods which typically require large amounts of true data to compute the\nend-to-end training objective. It may be surprising that the gradient quotient is a good indicator even\nwith random data, but this is consistent with previous initialization work that used random data in\ntheir analysis [20, 51] or found that good-initialization was relatively dataset agnostic [52]. We dub\nthe use of such objectives as task agnostic meta-learning (TAML). We note that the task-agnostic\nnature of MetaInit implies that once a model has been initialized properly it can be used for a number\nof tasks.\n\nimport torch\n\ndef gradient_quotient(loss, params, eps=1e-5):\n\ngrad = torch.autograd.grad(loss,\n\nparams, retain_graph=True, create_graph=True)\n\nprod = torch.autograd.grad(sum([(g**2).sum() / 2 for g in grad]),\n\nparams, retain_graph=True, create_graph=True)\n\nout = sum([((g - p) / (g + eps * (2*(g >= 0).float() - 1).detach())\n\n- 1).abs().sum() for g, p in zip(grad, prod)])\n\nreturn out / sum([p.data.nelement() for p in params])\n\ndef metainit(model, criterion, x_size, y_size, lr=0.1,\n\nmomentum=0.9, steps=500, eps=1e-5):\n\nmodel.eval()\nparams = [p for p in model.parameters()\n\nif p.requires_grad and len(p.size()) >= 2]\n\nmemory = [0] * len(params)\nfor i in range(steps):\n\ninput = torch.Tensor(*x_size).normal_(0, 1).cuda()\ntarget = torch.randint(0, y_size, (x_size[0],)).cuda()\nloss = criterion(model(input), target)\ngq = gradient_quotient(loss, list(model.parameters()), eps)\n\ngrad = torch.autograd.grad(gq, params)\nfor j, (p, g_all) in enumerate(zip(params, grad)):\n\nnorm = p.data.norm().item()\ng = torch.sign((p.data * g_all).sum() / norm)\nmemory[j] = momentum * memory[j] - lr * g.item()\nnew_norm = norm + memory[j]\np.data.mul_(new_norm / norm)\n\nprint(\"%d/GQ = %.2f\" % (i, gq.item()))\n\nFigure 2: Basic Pytorch code for the MetaInit algorithm.\n\n3\n\nImplementation\n\nThe proposed meta-algorithm minimizes Equation 1 using gradient descent. This requires computing\nthe gradient of an expression that involves gradients and a hessian-gradient product. However,\ngradients of the GQ can easily be obtained automatically using a framework that supports higher\norder automatic differentiation such as PyTorch [45], TensorFlow [1], or JAX [16]. Automatic\ndifferentiation can compute the Hessian vector product without explicitly computing the Hessian\nby using the identity \u22072(cid:96)v = \u2207(\u2207(cid:96) \u00b7 v). The gradient_quotient function in Algorithm 2\nprovides the PyTorch code to compute Equation 1.\n\n4\n\n\fLike previous initialization methods [20], we \ufb01nd experimentally that it suf\ufb01ces to tune only the scale\nof the initial weight matrices when a reasonable random weight distribution is used - such as Gaussian\nor Orthogonal matrices. We can obtain the gradient with respect to the norm of a parameter w using\nthe identity w(cid:107)w(cid:107) \u00b7 \u2207w(cid:96) as in [50]. The biases are initialized to zero and are not tuned by MetaInit.\nAs discussed above, the gradient quotient objective is designed to ameliorate issues with vanishing\nand exploding gradients. Nonetheless, for the most pathological initialization schemes, more help is\nneeded. To that end we optimize Equation 1 using signSGD [7], which performs gradient descent\nwith the sign of the gradient elements. We \ufb01nd that using only the sign of the gradient prevents a\nsingle large gradient from derailing optimization and also guarantees that vanishing gradients still\nresult in non-negligible steps. The metainit function in Algorithm 2 provides the PyTorch code to\nperform Equation 3.\nWe \ufb01nd that successfully running the meta-algorithm for a Resnet-50 model on Imagenet takes 11\nminutes on 8 Nvidia V100 GPUs. This represents a little less than 1% of the training time for that\nmodel using our training setup. Though we expect that recovering from initializations worse than\nthose we consider in Section 4 could require more meta-training time.\n\n4 Experiments\n\nIn this section, we examine the behavior of MetaInit algorithm across different architectures. We\nconsider plain CNNs, WideResnets [60], and Resnet-50 [24]. Here plain networks refer to networks\nwithout skip connections that are otherwise the same as WideResnet. In order to isolate the effect\nof initialization, unless explicitly noted, we consider networks without normalization (e.g. batch\nnormalization) - which has been shown to make networks less sensitive to the initial parameters. To\nremedy the fact that, without normalization, layers have sightly fewer parameters, we introduce a\nscalar multiplier initialized at 1 every two layers as in [63]. The networks without normalization do\nnot have biases in the convolutional layers. Unless otherwise noted, we use Algorithm 2 with the\ndefault hyper-parameters.\n\n4.1 Minimizing the gradient quotient corrects bad initial parameter norms\n\nIn this section, we evaluate the ability of metainit to correct bad initializations. For each architecture,\nwe evaluate two bad initializations: one where the magnitude of the initial parameters are too\nsmall and one where they are too big. We then tune the norms of the initial parameters with the\nmeta-algorithm. We perform experiments with 28-layer deep linear networks so as to remove the\nconfounding factor of the activation function. The loss surface is still non-linear due to the cross-\nentropy loss. We use the default meta-hyper-parameters except for the number of steps, which is set\nto 1000, and the momentum, which is set to 0.5. As discussed above, we use randomly generated data\ncomposed of 128 \u00d7 3 \u00d7 32 \u00d7 32 input matrices and 10-dimensional multinomial targets. We evaluate\nthe method by comparing the norms of the weights at initialization and after meta-optimization with\na reference initialization that is known to perform well for that architecture.\nWe plot the norm of the weight matrices before-and-after MetaInit as a function of layer for each\ninitialization protocol outlined above in Figure 3. In these experiments Gaussian(0, \u03c32) refers to\nsampling the weight matrices from a Gaussian with \ufb01xed standard deviation \u03c3, Fixup (Nonzero)\nrefers to a Fixup initialization [63] where none of the parameters are initialized to zero. Gaussian(0,\n\u03c32) is a bad initialization that has nonetheless been used in in\ufb02uential papers like [31]. We see that\nMetaInit adjusts the norms of the initial parameters close to a known, good, initialization for both the\nplain and residual architectures considered.\nThis is surprising because MetaInit does not speci\ufb01cally try to replicate any particular known\ninitialization and simply ensures that we start in a region with small curvature parallel to the gradient.\nThough automatic initialization learns to match known initializations for certain models, we observe\nthat it tends to differ when non-linear activation functions are used. This is expected since the aim\nis for the method to \ufb01nd new initializations when existing approaches aren\u2019t appropriate. For these\ntypes of network, we will evaluate the method through training in the next section.\n\n5\n\n\f(a) Plain with small initialization\n\n(b) Plain with large initialization\n\n(c) Resnet with small initialization\n\n(d) Resnet with large initialization\n\nFigure 3: Norm of the weight matrices of a 28 layer linear network for a bad initialization (red),\nMetaInit applied to correct the bad initialization (purple) and a reference good initialization (blue).\nNote that the norm increases with the index because the number of channels in the weights increases.\nThese results report the average and standard error over 10 trials for each random initialization. We\nobserve that MetaInit learns norms close to known good initializations even when starting from a bad\ninitialization.\n\nGradient Quotient Test Error (%)\n\nModel\n\nPlain 28-10\n\nWideResnet 202-4 Batchnorm\n\nWideResnet 28-10 Batchnorm\n\nMethod\nBatchnorm\nLSUV\nDeltaOrthogonal (Untuned)\nDeltaOrthogonal \u2192 MetaInit\n\nLSUV\nDeltaOrthogonal (Untuned)\nDeltaOrthogonal \u2192 MetaInit\n\nLSUV\nDeltaOrthogonal (Untuned)\nDeltaOrthogonal \u2192 MetaInit\n\n-\n-\n\n1.00\n0.54\n\n-\n-\n\n-\n-\n\n2.72\n0.53\n\n0.87\n0.45\n\n6.0\n3.7\n90.0\n3.7\n3.4\n6.9\n6.7\n3.8\n2.8\n4.8\n3.2\n2.9\n\nTable 1: Test accuracy on CIFAR-10 with the different methods. The gradient quotient reported here\nis computed before training. Using MetaInit to improve the initialization allows training networks\nthat are competitive to networks with Batchnorm.\n\n4.2\n\nImproving bad initialization with MetaInit helps training\n\nIn this section, we evaluate networks trained from our meta-initialization on challenging benchmark\nproblems. Many works in deep learning, such as [24, 10, 49, 53], have tended to treat initializations\nlike Kaiming [23] and Orthogonal [51] as standards that can be used with little tuning to the\narchitecture. To illustrate the need to tune the initialization based on architecture, we will compare\nwith an untuned DeltaOrthogonal initialization [57], which is a state-of-the-art extension of the\nOrthogonal initialization [51] to convolutional networks. Since we hope to show that MetaInit can\n\n6\n\n0510152025Layer Index012345678Weight NormGaussian(0, 0.01)XavierGaussian(0, 0.01) \u2192 MetaInit0510152025Layer Index2.55.07.510.012.515.017.520.0Weight NormGaussian(0, 0.1)XavierGaussian(0, 0.1) \u2192 MetaInit0510152025Layer Index01234567Weight NormGaussian(0, 0.001)Fixup (Nonzero)Gaussian(0, 0.001) \u2192 MetaInit0510152025Layer Index012345678Weight NormXavierFixup (Nonzero)Xavier \u2192 MetaInit\fautomatically \ufb01nd good initializations, we do not multiply the initial parameter values by a scaling\nfactor that is derived using expert knowledge based on the architecture.\nIt should be noted that the tuning done by MetaInit could also be derived manually but this would\nrequire careful expert work. These experiments demonstrate that automating this process as described\nhere can be effective and comparatively easier.\nCIFAR We use the \u03b2t-Swish(x) = \u03c3(\u03b2tx)x activation function [49], with \u03b20 initialized to 0 to\nmake the activation linear at initialization [5]. We use Mixup [62] with \u03b1 = 1 to regularize all\nmodels, combined with Dropout with rate 0.2 for residual networks. For plain networks without\nnormalization, we use gradient norm clipping with the maximum norm set to 1 [11]. We use a cosine\nlearning rate schedule [35] with a single cycle and follow the setup described in that paper. We\nchose this learning rate schedule because it reliably produces state-of-the-art results, while removing\nhyper-parameters compared to the stepwise schedule. All methods use an initial learning rate of\n0.1, except LSUV which required lower learning rates of 0.01 and 0.001 for WideResnet 28-10 and\nWideResnet 202-4 respectively. LSUV also uses DeltaOrthogonal initialization in convolutional\nlayers for fairness since it is an improvement over Orthogonal initialization. The batch size used for\nthe meta-algorithm is 32. The number of meta-algorithm steps for WideResnet 202-4 was reduced to\n200 for this speci\ufb01c model. Apart from this, we use the default meta-hyper-parameters.\nTable 1 shows the results of training with the various methods on CIFAR-10. We observe that without\ntuning DeltaOrthogonal initialization does not generalize well to different architectures. The key\nissue in the plain architecture case is that the effect of the activation function was not taken into\naccount. In our training setup, the \u03b20-Swish = \u03c3(0)x = 0.5x results in a multiplicative scaling factor\nof 1/2 at every layer due to the initialization of \u03b20 at 0. This results in vanishing gradient in the plain\narchitecture case. Surprisingly, while this gain is bad for plain networks it helps training for residual\nnetworks. As explained by [58, 63], downscaling is necessary for residual networks to prevent\nexplosion. However, typically the scaling factor should be inversely proportional to the depth of the\nnetwork. By contrast, the naive initialization here uses a constant gain factor. Accordingly, we observe\nthat DeltaOrthogonal with this setup does not work well when we increase the number of layers\nin the network - with the accuracy decreasing by 3%. The failure of the untuned DeltaOrthogonal\ninitialization in this setup demonstrates that the initialization must change with the architecture of the\nnetwork. By contrast, using MetaInit we are able to recover strong results independent of architecture.\nOur results also show that adapting the DeltaOrthogonal initialization using MetaInit leads to ac-\ncuracies that exceed or are competitive to applying Batchnorm to the architectures considered here.\nThe gap between the meta-algorithm and Batchnorm for plain networks further corroborate the\ntheoretical results of [59], which showed that Batchnorm is not well-suited for networks without\nresidual connections. This suggests that in general Batchnorm should not be relied upon as much as\nit has to correct mistakes in initialization. As a case in point, our results demonstrate that plain deep\nnetworks can be much more competitive with Resnets than is commonly assumed when a proper\ninitialization and training setup is used. By comparison, the network with Batchnorm reaches an\naccuracy that is at least 2% lower for this setup. Aside from proper initialization, the key components\nto achieving this result for plain networks are the use of the \u03b2-Swish activation and clipping. As a\nreference, when we combine BatchNorm and MetaInit for the WideResnet 28-10, we obtain the same\nperformance as BatchNorm by itself (2.8%). This is not unexpected since BatchNorm makes the\nnetwork more robust to initialization.\nLSUV [41] is a data-dependent initialization method that tries to mimic BatchNorm by normalizing\nthe layers at initialization. Our results show that this approach improves results for plain networks,\nbut LSUV actually make results worse than the naive initialization for WideResnet. This failure is\nconsistent with the fact that LSUV cannot scale the residual layers by depth, which was shown to\nbe crucial for stability as in [58, 63]. As a result, LSUV requires using lower learning rates than the\nother methods discussed here to prevent divergence.\nImagenet We use the Resnet-50 architecture with scalar bias before and after each convolution\nfollowing [63]. In order to showcase the importance of adapting the initialisation to the architecture\nwe will consider two activation functions - Swish = \u03c3(x)x and ReLU. We use the same training\nsetup and hyper-parameters as [63] - except for the initialization which is set to DeltaOrthogonal. The\napplication of MetaInit was much less straightforward than in the previous case due to the complexity\nof the model considered. In order to obtain a good estimate of the gradient quotient, we had to\nuse a batch size of 4096 examples. This required using smaller random inputs, of size 32 \u00d7 32,\n\n7\n\n\fActivation BatchNorm* GroupNorm*\n\nFixup* DeltaOrthogonal MetaInit\n\n(Untuned)\n\n24.3\n99.9\n\n24.6\n24.0\n\nReLU\nSwish\n\n23.3\n\n-\n\n23.9\n\n-\n\n24.0\n\n-\n\nTable 2: Top-1 Test error on Imagenet ILSVRC2012 for a Resnet-50 model with different methods.\n* The columns for BatchNorm, GroupNorm and Fixup are reference results taken directly from\n[63]. They are not initialized in the same way and so are not as directly comparable. MetaInit\nproduces more consistent results than untuned DeltaOrthogonal initialization as we vary the activation\nfunctions.\n\ncompared with the size of Imagenet images while meta-initializing. Furthermore, it was necessary to\nuse cross-validation to select the momentum parameter for the meta-algorithm between the values of\n0.5 and 0.9.\nTable 2 shows that using MetaInit leads to more consistent results as we change the architecture\ncompared to untuned DeltaOrthogonal initialization. In this case, the change in architecture is driven\nby the choice of activation function. As \ufb01rst noted by [23], ReLU activation layers downscale the\nstandard deviation of pre-activations by 1/\u221a\n2; by contrast the Swish activation leads to a reduction\nin the standard deviation of about 1/2 at each layer. Coincidentally, the downscaling provided by\nthe ReLU works well for this speci\ufb01c architecture, while that one implied by the Swish is too large\nand prevents learning. However, with MetaInit, we observe training in both cases. Moreover, our\nresults are competitive with BatchNorm and GroupNorm. We believe that the gap in performance in\nBatchNorm is mainly due to the regularization properties of BatchNorm. We view it is a success of\nour method that we are able to disentangle trainability and generalization and quantify the regularizing\nrole of BatchNorm. Our results are also competitive to the Fixup [63] initialization method, which\nwas developed for residual networks with positive homogeneous activations - like ReLU units but\nunlike Swish.\n\n4.3 Ablation\n\nFigure 4: Meta-Learning curves for MetaInit\nusing L2 norm in Equation 1 instead L1. The\nresults reported are averaged over 10 trials.\nUsing the L1 norm leads to faster minimiza-\ntion.\n\nFigure 5: Learning curve of WideResnet 16-4\non CIFAR-10 comparing SGD with MetaInit\ninitialization to SGD and signSGD with un-\ntuned initialization during supervised training.\nMetaInit leads to faster convergence.\n\nWe evaluate the importance of using the L1 norm in Equation 1 during the meta-optimization phase.\nWe will consider a linear plain network with 28 layers and width 1 with a bad initialization sampled\nfrom Gaussian(0, 0.1). The random inputs have size 128 \u00d7 3 \u00d7 32 \u00d7 32 with 10 classes and the\nmomentum hyper-parameter set to 0.5. Figure 4 demonstrates the importance of the L1.\nNext we evaluate how the proposed meta-initialization compares to using signSGD directly during\nregular training to mitigate bad initializations. Figure 5 shows results on CIFAR-10 training for 5\nsupervised epochs with cosine learning rate. SGD and MetaInit were both trained using a learning\nrate of 0.1 while the learning rate for signSGD had to be reduced to 0.001 to avoid divergence. Unlike\nsignSGD, MetaInit discovers a good initialization without using any supervised data. signSGD\nmust recover from the bad initialization using supervised updates, which could and does in\ufb02uence\nconvergence in this case.\n\n8\n\n0255075100125150175200Meta Steps10110410710101013101610191022Meta LossMetaInit with L2 instead of L1MetaInit12345Iterations2030405060708090Test Error (\\%)SGD (Untuned Init)signSGD (Untuned Init)SGD (MetaInit)\f5 Limitations\n\nThe algorithm proposed in this paper has limitations, which we hope will be addressed in future work:\n\n1. In some cases, applying MetaInit successfully requires tuning the meta-hyper-parameters.\nHowever, the number of hyper-parameters added is small compared to directly cross-\nvalidating over different initial norms for each parameter.\n\n2. The proposed algorithm uses gradient descent to help gradient descent. However, the\ngradient descent process on the meta objective can and does fail for very large scale models\nand input dimensions. Getting good estimates of the gradient and Hessian-vector product\ncan require very large batch sizes for big models.\n\n3. Improving initialization does not necessarily address all numerical instability problems. In\n\nparticular, training with lower precision without normalization is particularly unstable.\n\n4. We learn the norms of the parameters, not the full parameters. This mirrors many in\ufb02uential\ninitializations, but it is possible to imagine that certain architectures might require adapting\nall initial parameters. Thus, for the moment, selecting the initial distribution of parameters\nstill requires expert intervention.\n\n6 Additional Related Work\n\nThere are a number of research directions that are related to MetaInit aside from the work on\noptimizers, normalization methods, and initialization schemes already discussed above. Maclaurin\net al. [36] performed gradient descent on hyper-parameters by propagating gradients through the\nentire training procedure. While this approach can be used to tune the parameter norms, it has a\nsigni\ufb01cant computational cost, can lead to over\ufb01tting, and gradients taken through optimization can\nbe very poorly conditioned [40]. MAML [15] is a related meta-algorithm that searches for weight\ninitializations that are bene\ufb01cial to few-shot learning. While it also produces a weight initialization,\nit is not mutually exclusive with the proposed approach due to their different focus. For example,\nMetaInit could be used as the initialization for MAML.\nAs discussed brie\ufb02y above there has been a long line of signi\ufb01cant work to improve optimizers to\nbe robust to poor initialization. Several examples of this include momentum [48, 42], RMSProp [],\nADAM and ADAMAX [30], ADADELTA [61], or NADAM [13]. Optimizers that exploit curvature\ninformation such as [14] have been proposed but they can negatively affect generalization [54].\nFurthermore, recovering from a bad initialization using regular supervised training steps could still\nnegatively affect the generalization of the model. A form of meta-learning can also be used to improve\noptimization such as [6], which tunes the learning rate using hypergradient descent. More recently,\na natural extension of this work has focused on learning the structure of the optimizer itself using\nmeta-learning [39]. Finally, a series of papers has used population based training [29] to identify\ntraining schedules. These powerful approaches still require a reasonable starting point for learning\nand, as with MAML, could be paired well with MetaInit.\nAs described earlier, a very successful approach to improving training robustness is adding normal-\nization. The most successful of these approaches is arguably BatchNorm [27]. However, as \ufb01rst\nexplained in [59] and supported in Section 4.2, BatchNorm does not apply well to all architectures.\nOther normalization methods such as LayerNorm [4] or GroupNorm [56] likewise appear to amelio-\nrate training in some circumstances but have a deleterious effect in others cases. Finally, techniques\nlike gradient clipping [44] are extremely useful, but require a reasonable starting point for learning.\nAs discussed above, we observe that gradient clipping works well in combination with MetaInit.\n\n7 Conclusion\n\nWe have proposed a novel method to automatically tune the initial parameter norms under the\nhypothesis that good initializations reduce second order effects. Our results demonstrate that this\napproach is useful in practice and can automatically recover from a class of bad initializations accross\nseveral architectures.\n\n9\n\n\fAcknowledgements\n\nWe would like to thank David Grangier, Ben Poole, and Jascha Sohl-Dickstein for help discussions.\n\nReferences\n[1] Mart\u00edn Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin,\nSanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensor\ufb02ow: A system for large-scale machine\nlearning. In 12th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 16),\npages 265\u2013283, 2016.\n\n[2] Michael David Abr\u00e0moff, Yiyue Lou, Ali Erginay, Warren Clarida, Ryan Amelon, James C Folk, and\nMeindert Niemeijer. Improved automated detection of diabetic retinopathy on a publicly available dataset\nthrough integration of deep learning. Investigative ophthalmology & visual science, 57(13):5200\u20135206,\n2016.\n\n[3] Shun-Ichi Amari. Natural gradient works ef\ufb01ciently in learning. Neural computation, 10(2):251\u2013276,\n\n1998.\n\n[4] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint\n\narXiv:1607.06450, 2016.\n\n[5] David Balduzzi, Marcus Frean, Lennox Leary, JP Lewis, Kurt Wan-Duo Ma, and Brian McWilliams. The\nshattered gradients problem: If resnets are the answer, then what is the question? In Proceedings of the\n34th International Conference on Machine Learning-Volume 70, pages 342\u2013350. JMLR. org, 2017.\n\n[6] Atilim Gunes Baydin, Robert Cornish, David Martinez Rubio, Mark Schmidt, and Frank Wood. Online\n\nlearning rate adaptation with hypergradient descent. arXiv preprint arXiv:1703.04782, 2017.\n\n[7] Jeremy Bernstein, Yu-Xiang Wang, Kamyar Azizzadenesheli, and Anima Anandkumar. signsgd: Com-\n\npressed optimisation for non-convex problems. arXiv preprint arXiv:1802.04434, 2018.\n\n[8] Shan Carter, Zan Armstrong, Ludwig Schubert, Ian Johnson, and Chris Olah. Activation atlas. Distill,\n\n2019. https://distill.pub/2019/activation-atlas.\n\n[9] Minmin Chen, Jeffrey Pennington, and Samuel S Schoenholz. Dynamical isometry and a mean \ufb01eld theory\nof rnns: Gating enables signal propagation in recurrent neural networks. arXiv preprint arXiv:1806.05394,\n2018.\n\n[10] Djork-Arn\u00e9 Clevert, Thomas Unterthiner, and Sepp Hochreiter. Fast and accurate deep network learning\n\nby exponential linear units (elus). arXiv preprint arXiv:1511.07289, 2015.\n\n[11] Yann N Dauphin, Angela Fan, Michael Auli, and David Grangier. Language modeling with gated\nconvolutional networks. In Proceedings of the 34th International Conference on Machine Learning-Volume\n70, pages 933\u2013941. JMLR. org, 2017.\n\n[12] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical\nimage database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248\u2013255.\nIeee, 2009.\n\n[13] Timothy Dozat. Incorporating nesterov momentum into adam.(2016). Dostupn\u00e9 z: http://cs229. stanford.\n\nedu/proj2015/054_report. pdf, 2016.\n\n[14] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and\n\nstochastic optimization. Journal of Machine Learning Research, 12(Jul):2121\u20132159, 2011.\n\n[15] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep\nnetworks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages\n1126\u20131135. JMLR. org, 2017.\n\n[16] Roy Frostig, Peter Hawkins, Matthew Johnson, Chris Leary, and Dougal Maclaurin. JAX: Autograd and\n\nXLA. www.github.com/google/jax, 2018.\n\n[17] Mario Geiger, Arthur Jacot, Stefano Spigler, Franck Gabriel, Levent Sagun, St\u00e9phane d\u2019Ascoli, Giulio\nBiroli, Cl\u00e9ment Hongler, and Matthieu Wyart. Scaling description of generalization with number of\nparameters in deep learning. arXiv preprint arXiv:1901.01608, 2019.\n\n10\n\n\f[18] Behrooz Ghorbani, Shankar Krishnan, and Ying Xiao. An investigation into neural net optimization via\n\nhessian eigenvalue density. arXiv preprint arXiv:1901.10159, 2019.\n\n[19] Dar Gilboa, Bo Chang, Minmin Chen, Greg Yang, Samuel S Schoenholz, Ed H Chi, and Jeffrey Pennington.\nDynamical isometry and a mean \ufb01eld theory of lstms and grus. arXiv preprint arXiv:1901.08987, 2019.\n\n[20] Xavier Glorot and Yoshua Bengio. Understanding the dif\ufb01culty of training deep feedforward neural\nnetworks. In Proceedings of the thirteenth international conference on arti\ufb01cial intelligence and statistics,\npages 249\u2013256, 2010.\n\n[21] Guy Gur-Ari, Daniel A Roberts, and Ethan Dyer. Gradient descent happens in a tiny subspace. arXiv\n\npreprint arXiv:1812.04754, 2018.\n\n[22] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.\n\nCoRR, abs/1512.03385, 2015.\n\n[23] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into recti\ufb01ers: Surpassing\nhuman-level performance on imagenet classi\ufb01cation. In Proceedings of the IEEE international conference\non computer vision, pages 1026\u20131034, 2015.\n\n[24] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks.\n\nIn European conference on computer vision, pages 630\u2013645. Springer, 2016.\n\n[25] Sepp Hochreiter, Yoshua Bengio, Paolo Frasconi, J\u00fcrgen Schmidhuber, et al. Gradient \ufb02ow in recurrent\n\nnets: the dif\ufb01culty of learning long-term dependencies.\n\n[26] Sepp Hochreiter and J\u00fcrgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735\u20131780,\n\n1997.\n\n[27] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing\n\ninternal covariate shift. arXiv preprint arXiv:1502.03167, 2015.\n\n[28] Arthur Jacot, Franck Gabriel, and Cl\u00e9ment Hongler. Neural tangent kernel: Convergence and generalization\n\nin neural networks. In Advances in neural information processing systems, pages 8571\u20138580, 2018.\n\n[29] Max Jaderberg, Valentin Dalibard, Simon Osindero, Wojciech M. Czarnecki, Jeff Donahue, Ali Razavi,\nOriol Vinyals, Tim Green, Iain Dunning, Karen Simonyan, Chrisantha Fernando, and Koray Kavukcuoglu.\nPopulation based training of neural networks. CoRR, abs/1711.09846, 2017.\n\n[30] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[31] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classi\ufb01cation with deep convolutional\n\nneural networks. In Advances in neural information processing systems, pages 1097\u20131105, 2012.\n\n[32] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436, 2015.\n\n[33] Yann LeCun, L\u00e9on Bottou, Yoshua Bengio, Patrick Haffner, et al. Gradient-based learning applied to\n\ndocument recognition. Proceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[34] Jaehoon Lee, Lechao Xiao, Samuel S Schoenholz, Yasaman Bahri, Jascha Sohl-Dickstein, and Jeffrey\nPennington. Wide neural networks of any depth evolve as linear models under gradient descent. arXiv\npreprint arXiv:1902.06720, 2019.\n\n[35] Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint\n\narXiv:1608.03983, 2016.\n\n[36] Dougal Maclaurin, David Duvenaud, and Ryan Adams. Gradient-based hyperparameter optimization\nthrough reversible learning. In International Conference on Machine Learning, pages 2113\u20132122, 2015.\n\n[37] James Martens and Roger B. Grosse. Optimizing neural networks with kronecker-factored approximate\n\ncurvature. CoRR, abs/1503.05671, 2015.\n\n[38] James Martens and Ilya Sutskever. Training deep and recurrent networks with hessian-free optimization.\n\nIn Neural networks: Tricks of the trade, pages 479\u2013535. Springer, 2012.\n\n[39] Luke Metz, Niru Maheswaranathan, Brian Cheung, and Jascha Sohl-dickstein. Meta-learning update rules\n\nfor unsupervised representation learning. 2019.\n\n11\n\n\f[40] Luke Metz, Niru Maheswaranathan, Jeremy Nixon, Daniel Freeman, and Jascha Sohl-Dickstein. Under-\nstanding and correcting pathologies in the training of learned optimizers. In Kamalika Chaudhuri and\nRuslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning,\nvolume 97 of Proceedings of Machine Learning Research, pages 4556\u20134565, Long Beach, California,\nUSA, 09\u201315 Jun 2019. PMLR.\n\n[41] Dmytro Mishkin and Jiri Matas. All you need is a good init. arXiv preprint arXiv:1511.06422, 2015.\n\n[42] Yurii Nesterov. A method for unconstrained convex minimization problem with the rate of convergence o\n\n(1/k\u02c6 2). In Doklady AN USSR, volume 269, pages 543\u2013547, 1983.\n\n[43] Chris Olah, Alexander Mordvintsev, and Ludwig Schubert. Feature visualization. Distill, 2017.\n\nhttps://distill.pub/2017/feature-visualization.\n\n[44] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the dif\ufb01culty of training recurrent neural\n\nnetworks, 2012.\n\n[45] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming\n\nLin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017.\n\n[46] Jeffrey Pennington, Samuel Schoenholz, and Surya Ganguli. Resurrecting the sigmoid in deep learning\nthrough dynamical isometry: theory and practice. In Advances in neural information processing systems,\npages 4785\u20134795, 2017.\n\n[47] George Philipp and Jaime G Carbonell. The nonlinearity coef\ufb01cient-predicting generalization in deep\n\nneural networks. arXiv preprint arXiv:1806.00179, 2018.\n\n[48] Ning Qian. On the momentum term in gradient descent learning algorithms. Neural networks, 12(1):145\u2013\n\n151, 1999.\n\n[49] Prajit Ramachandran, Barret Zoph, and Quoc V Le. Searching for activation functions. arXiv preprint\n\narXiv:1710.05941, 2017.\n\n[50] Tim Salimans and Durk P Kingma. Weight normalization: A simple reparameterization to accelerate\ntraining of deep neural networks. In Advances in Neural Information Processing Systems, pages 901\u2013909,\n2016.\n\n[51] Andrew M Saxe, James L McClelland, and Surya Ganguli. Exact solutions to the nonlinear dynamics of\n\nlearning in deep linear neural networks. arXiv preprint arXiv:1312.6120, 2013.\n\n[52] Samuel S Schoenholz, Justin Gilmer, Surya Ganguli, and Jascha Sohl-Dickstein. Deep information\n\npropagation. arXiv preprint arXiv:1611.01232, 2016.\n\n[53] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, \u0141ukasz\nKaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing\nsystems, pages 5998\u20136008, 2017.\n\n[54] Ashia C Wilson, Rebecca Roelofs, Mitchell Stern, Nati Srebro, and Benjamin Recht. The marginal value\nof adaptive gradient methods in machine learning. In Advances in Neural Information Processing Systems,\npages 4148\u20134158, 2017.\n\n[55] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim\nKrikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google\u2019s neural machine translation system: Bridging\nthe gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016.\n\n[56] Yuxin Wu and Kaiming He. Group normalization. In Proceedings of the European Conference on Computer\n\nVision (ECCV), pages 3\u201319, 2018.\n\n[57] Lechao Xiao, Yasaman Bahri, Jascha Sohl-Dickstein, Samuel S Schoenholz, and Jeffrey Pennington.\nDynamical isometry and a mean \ufb01eld theory of cnns: How to train 10,000-layer vanilla convolutional\nneural networks. arXiv preprint arXiv:1806.05393, 2018.\n\n[58] Ge Yang and Samuel Schoenholz. Mean \ufb01eld residual networks: On the edge of chaos. In Advances in\n\nneural information processing systems, pages 7103\u20137114, 2017.\n\n[59] Greg Yang, Jeffrey Pennington, Vinay Rao, Jascha Sohl-Dickstein, and Samuel S Schoenholz. A mean\n\n\ufb01eld theory of batch normalization. arXiv preprint arXiv:1902.08129, 2019.\n\n[60] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146,\n\n2016.\n\n12\n\n\f[61] Matthew D. Zeiler. Adadelta: An adaptive learning rate method, 2012.\n\n[62] Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk\n\nminimization. arXiv preprint arXiv:1710.09412, 2017.\n\n[63] Hongyi Zhang, Yann N Dauphin, and Tengyu Ma. Fixup initialization: Residual learning without\n\nnormalization. arXiv preprint arXiv:1901.09321, 2019.\n\n13\n\n\f", "award": [], "sourceid": 6877, "authors": [{"given_name": "Yann", "family_name": "Dauphin", "institution": "Google AI"}, {"given_name": "Samuel", "family_name": "Schoenholz", "institution": "Google Brain"}]}