{"title": "Which Algorithmic Choices Matter at Which Batch Sizes? Insights From a Noisy Quadratic Model", "book": "Advances in Neural Information Processing Systems", "page_first": 8196, "page_last": 8207, "abstract": "Increasing the batch size is a popular way to speed up neural network training, but beyond some critical batch size, larger batch sizes yield diminishing returns. In this work, we study how the critical batch size changes based on properties of the optimization algorithm, including acceleration and preconditioning, through two different lenses: large scale experiments and analysis using a simple noisy quadratic model (NQM). We experimentally demonstrate that optimization algorithms that employ preconditioning, specifically Adam and K-FAC, result in much larger critical batch sizes than stochastic gradient descent with momentum. We also demonstrate that the NQM captures many of the essential features of real neural network training, despite being drastically simpler to work with. The NQM predicts our results with preconditioned optimizers, previous results with accelerated gradient descent, and other results around optimal learning rates and large batch training, making it a useful tool to generate testable predictions about neural network optimization.\nWe demonstrate empirically that the simple noisy quadratic model (NQM) displays many similarities to neural networks in terms of large-batch training. We prove analytical convergence results for the NQM model that predict such behavior and hence provide possible explanations and a better understanding for many large-batch training phenomena.", "full_text": "Which Algorithmic Choices Matter at Which Batch\n\nSizes? Insights From a Noisy Quadratic Model\n\nGuodong Zhang1,2,3\u21e4, Lala Li3, Zachary Nado3, James Martens4,\n\nSushant Sachdeva1, George E. Dahl3, Christopher J. Shallue3, Roger Grosse1,2\n1University of Toronto, 2Vector Institute, 3Google Research, Brain Team, 4DeepMind\n\nAbstract\n\nIncreasing the batch size is a popular way to speed up neural network training,\nbut beyond some critical batch size, larger batch sizes yield diminishing returns.\nIn this work, we study how the critical batch size changes based on properties of\nthe optimization algorithm, including acceleration, preconditioning and averaging,\nthrough two different lenses: large scale experiments, and analysis of a simple\nnoisy quadratic model (NQM). We experimentally demonstrate that optimization\nalgorithms that employ preconditioning, speci\ufb01cally Adam and K-FAC, result in\nmuch larger critical batch sizes than stochastic gradient descent with momentum.\nWe also demonstrate that the NQM captures many of the essential features of\nreal neural network training, despite being drastically simpler to work with. The\nNQM predicts our results with preconditioned optimizers and exponential moving\naverage, previous results with accelerated gradient descent, and other results around\noptimal learning rates and large batch training, making it a useful tool to generate\ntestable predictions about neural network optimization.\n\n1\n\nIntroduction\n\nIncreasing the batch size is one of the most appealing ways to accelerate neural network training\non data parallel hardware. Larger batch sizes yield better gradient estimates and, up to a point,\nreduce the number of steps required for training, which reduces the training time. The importance of\nunderstanding the bene\ufb01ts of modern parallel hardware has motivated a lot of recent work on training\nneural networks with larger batch sizes [Goyal et al., 2017, Osawa et al., 2018, McCandlish et al.,\n2018, Shallue et al., 2018]. To date, the most comprehensive empirical study of the effects of batch\nsize on neural network training is Shallue et al. [2018], who con\ufb01rmed that increasing the batch size\ninitially achieves perfect scaling (i.e. doubling the batch size halves the number of steps needed) up\nto a problem-dependent critical batch size, beyond which it yields diminishing returns [Balles et al.,\n2017, Goyal et al., 2017, Jastrz\u02dbebski et al., 2018, McCandlish et al., 2018]. Shallue et al. [2018] also\nprovided experimental evidence that the critical batch size depends on the optimization algorithm,\nthe network architecture, and the data set. However, their experiments only covered plain SGD,\nSGD with (heavy-ball) momentum, and SGD with Nesterov momentum, leaving open the enticing\npossibility that other optimizers might extend perfect scaling to even larger batch sizes.\nEmpirical scaling curves like those in Shallue et al. [2018] are essential for understanding the effects\nof batch size, but generating such curves, even for a single optimizer on a single task, can be very\nexpensive. On the other hand, existing theoretical analyses that attempt to analytically derive critical\nbatch sizes (e.g. Ma et al. [2018], Yin et al. [2018], Jain et al. [2018]) do not answer our questions\nabout which optimizers scale the best with batch size. They tend to make strong assumptions, produce\nparameter-dependent results that are dif\ufb01cult to apply, or are restricted to plain SGD. It would be\n\n\u21e4Work done as part of the Google Student Researcher Program. Email: gdzhang@cs.toronto.edu\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fideal to \ufb01nd a middle ground between a purely empirical investigation and theoretical analysis by\nbuilding a model of neural network optimization problems that captures the essential behavior we\nsee in real neural networks, while still being easy to understand. Additionally, we need to study\noptimizers beyond momentum SGD since they might provide us an approach to exploit speedups\nfrom the very largest batch sizes. In this work, we make the following contributions:\n\n1. We show that a simple noisy quadratic model (NQM) is remarkably consistent with the batch\nsize effects observed in real neural networks, while allowing us to run experiments in seconds,\nmaking it a great tool to generate testable predictions about neural network optimization.\n\n2. We show that the NQM successfully predicts that momentum should speed up training relative\n\nto plain SGD at larger batch sizes, but have no bene\ufb01t at small batch sizes.\n\n3. Through large scale experiments with Adam [Kingma and Ba, 2014] and K-FAC [Martens and\nGrosse, 2015], we con\ufb01rm that, as predicted by the NQM, preconditioning extends perfect batch\nsize scaling to larger batch sizes than are possible with momentum SGD alone. Furthermore,\nunlike momentum, preconditioning can help at small batch sizes as well.\n\n4. Lastly, we show that, as predicted by the NQM, exponential moving averages reduce the number\nof steps required for a speci\ufb01c batch size and can achieve the same acceleration with smaller\nbatch sizes, thereby saving computation.\n\n2 Related Work\n\nIn a classic paper, Bottou and Bousquet [2008] studied the asymptotics of stochastic optimization\nalgorithms and found SGD to be competitive with fancier approaches. They showed that stochastic\noptimization involves fundamentally different tradeoffs from full-batch optimization. More recently,\nseveral studies have investigated the relationship between batch size and training time for neural\nnetworks. Chen et al. [2018] studied the effect of network width on the critical batch size, and showed\nexperimentally that it depends on both the data set and network architecture. Golmant et al. [2018]\nstudied how various heuristics for adjusting the learning rate as a function of batch size affect the\nrelationship between batch size and training time. Shallue et al. [2018] conducted a comprehensive\nempirical study on the relationship between batch size and training time with different neural network\narchitectures and data sets using plain SGD, heavy-ball momentum, and Nesterov momentum. Finally,\nMcCandlish et al. [2018] used the average gradient noise over training to predict the critical batch\nsize. All of these studies described a basic relationship between batch size and training steps to\na \ufb01xed error goal, which is comprised of three regions: perfect scaling initially, then diminishing\nreturns, and \ufb01nally no bene\ufb01t for all batch sizes greater than the critical batch size.\nOther studies have attempted to characterize the critical batch size analytically in stochastic optimiza-\ntion. Under varying assumptions, Ma et al. [2018], Yin et al. [2018], Jain et al. [2018] all derived\nanalytical notions of critical batch size, but to our knowledge, all for SGD.\nAdditionally, previous studies have shown that SGD and momentum SGD are equivalent for small\nlearning rates (after appropriate rescaling), both for the continuous limit [Leen and Orr, 1994] and\ndiscrete settings Yuan et al. [2016]. However, they do not explain why momentum SGD (including\nheavy-ball and Nesterov momentum) sometimes outperforms plain SGD in mini-batch training (as\nobserved by Kidambi et al. [2018] and Shallue et al. [2018]). Concurrently, Smith et al. [2019]\nshowed that momentum outperforms plain SGD at large batch sizes.\nFinally, there are a few works studying average of the iterates, rather than working with the last iterate.\nThis is a classical idea in optimization, where it is known to provide improved convergence [Polyak\nand Juditsky, 1992, Bach and Moulines, 2013, Dieuleveut and Bach, 2016]. However, most of\nthem focused on tail averaging, which you have to decide ahead of time the iteration to start\naccumulating the running averaging. More commonly (especially in deep learning), exponential\nmoving average [Martens, 2014] is preferred for its simplicity and ability to handle non-convex\nlandscape. However, no analysis was done especially when mini-batch is used.\n\n3 Analysis of the Noisy Quadratic Model (NQM)\n\nIn this section, we work with a noisy quadratic model (NQM), a stochastic optimization problem\nwhose dynamics can be simulated analytically, in order to reason about various phenomena en-\n\n2\n\n\fcountered in training neural networks. In this highly simpli\ufb01ed model, we \ufb01rst assume the loss\nfunction being optimized is a convex quadratic, with noisy observations of the gradient. For analytic\ntractability, we further assume the noise covariance is codiagonalizable with the Hessian. Because\nwe are not interested in modeling over\ufb01tting effects, we focus on the online training setting, where\nthe observations are drawn i.i.d. in every training iteration. Under these assumptions, we derive an\nanalytic expression for the risk after any number of steps of SGD with a \ufb01xed step size, as well as a\ndynamic programming method to compute the risk following a given step size schedule.\nConvex quadratics may appear an odd model for a complicated nonconvex optimization landscape.\nHowever, one obtains a convex quadratic objective by linearizing the network\u2019s function around a\ngiven weight vector and taking the second-order Taylor approximation to the loss function (assuming\nit is smooth and convex). Indeed, recent theoretical works [Jacot et al., 2018, Du et al., 2019, Zhang\net al., 2019a] show that for wide enough networks, the weights stay close enough to the initialization\nfor the linearized approximation to remain accurate. Empirically, linearized approximations closely\nmatch a variety of training phenomena for large but realistic networks [Lee et al., 2019].\n\n3.1 Problem Setup\nWe now introduce the noisy quadratic model [Schaul et al.,\n2013, Martens, 2014, Wu et al., 2018], where the true function\nbeing optimized is a convex quadratic. Because we analyze\nrotation-invariant and translation-invariant optimizers such as\nSGD and heavy-ball momentum, we assume without loss of\ngenerality that the quadratic form is diagonal, and that the\noptimum is at the origin. Hence, our exact cost function decom-\nposes as a sum of scalar quadratic functions for each coordinate:\n\nL(\u2713) =\n\n1\n2\n\n\u2713>H\u2713 =\n\n1\n2\n\nhi\u27132\n\ni ,\n\ndXi=1\n\ndXi=1\n\n`(\u2713i).\n\n(1)\n\nFigure 1: Cartoon of the evolution of\nrisk for different coordinates with and\nwithout learning rate decay.\n\nWithout loss of generality, we assume h1 h2 ... hd. We consider a single gradient query\nto have the form g(\u2713) = H\u2713 + \u270f where E[\u270f] = 0 and Cov(\u270f) = C. To reduce the variance\nof gradient estimation, we can average over multiple independent queries, which corresponds to\n\"mini-batch training\" in neural network optimization. We denote the averaged gradient as gB(\u2713) and\nthe covariance Cov(gB(\u2713)) = C/B, where B is the number of queries (mini-batch size).\nFor analytical tractability, we make the nontrivial assumption that H and C are codiagonalizable.\n(Since H is diagonal, this implies that C = diag(c1, . . . , cd).) See Section 3.5 for justi\ufb01cation of this\nassumption. Under gradient descent with \ufb01xed step size \u21b5, each dimension evolves independently as\n(2)\nwhere \u21b5 is the learning rate and \u270fi is zero-mean unit variance iid noise. By treating \u2713i as a random\nvariable, we immediately obtain the dynamics of its mean and variance.\n\n\u2713i(t + 1) = (1 \u21b5hi)\u2713i(t) + \u21b5pci/B\u270fi,\n\nE [\u2713i(t + 1)] = (1 \u21b5hi)E [\u2713i(t)] , V [\u2713i(t + 1)] = (1 \u21b5hi)2V [\u2713i(t)] +\n\nBased on eqn. (3), the expected risk after t steps in a given dimension i is\n\n\u21b52ci\nB\n\n.\n\n(3)\n\n(4)\n\nE [`(\u2713i(t))] = (1 \u21b5hi)2t\n}\n\nconvergence rate\n\n{z\n\n|\n\nE [`(\u2713i(0))] +1 (1 \u21b5hi)2t\n\n\u21b5ci\n\n{z\n\n,\n\n}\n\n2B(2 \u21b5hi)\n\nsteady state risk\n\n|\n\nwhere we have assumed that \u21b5hi \uf8ff 2. (Note that this can be seen as a special case of the convergence\nresult derived for convex quadratics in Martens [2014].)\nRemarkably, each dimension converges exponentially to a steady state risk. Unfortunately, there is\na trade-off in the sense that higher learning rates (up to 1/hi) give faster convergence to the steady\nstate risk, but also produce higher values of the steady-state risk. The steady state risk also decreases\nproportionally to increases in batch size; this is important to note because in the following subsections,\nwe will show that traditional acceleration techniques (e.g., momentum and preconditioning) help\nimprove the convergence rate at the expense of increasing the steady state risk. Therefore, the NQM\nimplies that momentum and preconditioning would bene\ufb01t more from large-batch training compared\nto plain SGD, as shown in later sections.\n\n3\n\n0501001502002503003504006teSs0.00.20.40.60.81.05iskhigh curvature (0.5)lRwer curvature (0.01)decay at steS 100decay at steS 100\f3.2 Momentum Accelerates Training at Large Batch Sizes\nApplied to the same noisy quadratic model as before, the update equations for momentum SGD are:\n\nmi(t + 1) = mi(t) + hi\u2713i(t) +pci/B\u270fi,\n\n\u2713i(t + 1) = \u2713i(t) \u21b5mi(t + 1).\n\n2\n\n) (rt\n\n1 rt\n2))\n\n1 rt+1\n\nWe show in the following theorem (see Appendix C for proof) that momentum SGD performs\nsimilarly to plain SGD in the regime of small batch sizes but helps in the large-batch regime, which\ncan be viewed as a near-deterministic optimization problem.\n\ntime t associated with that dimension satis\ufb01es the upper bound\n\nTheorem 1. Given a dimension index i, and 0 \uf8ff < 1 with 6= (1 p\u21b5hi)2, the expected risk at\nE [`(\u2713i(t))] \uf8ff\u2713 (rt+1\nwhere r1 and r2 (with r1 r2) are the two roots of the quadratic equation x2(1\u21b5hi+)x+ = 0.\nAs with plain SGD (c.f. eqn. (4)), the loss associated with each dimension can be expressed as the\nsum of two terms, where the \ufb01rst one decays exponentially and corresponds to the behavior of the\ndeterministic version of the algorithm, and the second remains constant.\nFollowing the existing treatment of the deterministic version of the algorithm [Chiang, 1974, Qian,\n1999, Yang et al., 2018, Goh, 2017], we divide our analysis two cases: overdamping and underdamp-\n\ning. In the case of overdamping, where < (1 p\u21b5hi)2, both roots r1 and r2 are real and therefore\nthe convergence rate is determined by the larger one (i.e. r1), which has the value\n\n2B(2 + 2 \u21b5hi)(1 )\n\nE [`(\u2713i(0))] +\n\nr1 r2\n\n(1 + )\u21b5ci\n\n\u25c62\n\n(5)\n\n,\n\n(6)\n\n(7)\n\nr1 =\n\n1 \u21b5hi + +p(1 )2 2(1 + )\u21b5hi + \u21b52h2\n\n2\n\ni\n\nWith a \ufb01xed learning rate, the steady state risk will be constant, and the best achievable expected risk\nwill be lower bounded by it. Thus, to achieve a certain target loss we must either drive the learning\nrate down, or the batch size up. Assuming a small batch size and a low target risk, we are forced to\npick a small learning rate, in which case one can show2 that r1 \u21e1 1 \u21b5h/1. In Figure 2 we plot the\nconvergence rate as a function of , and we indeed observe that the convergence rate closely matches\n1 \u21b5h/1, assuming a relative small learning rate. We further note that the convergence rate and\nsteady state risk of eqn. (6) are the same as the ones in plain SGD (eqn. (4)), except that they use an\n\"effective learning rate\" of \u21b5/1. To help validate these predictions, in Appendix E.3 we provide a\ncomparison of momentum SGD with plain SGD using the effective learning rate.\nIn the case of underdamping where > (1 p\u21b5hi)2,\nboth r1 and r2 will be complex and have norm p. We\nnote that the optimal should be equal to or smaller than\n(1 p\u21b5hd)2, since otherwise all dimensions are under-\n\ndamped, and we can easily improve the convergence rate\nand steady state risk by reducing .\nNext we observe that the convergence of the total loss\nwill eventually be dominated by the slowest converging\ndimension (which corresponds to the smallest curvature\nhd), and this will be in the overdamping regime as argued\nabove. By our analysis of the overdamping case, we can\nachieve the same convergence rate for this dimension by\nsimply replacing the learning rate \u21b5 in the bound for plain\nSGD (eqn. (4)) with the effective learning rate \u21b5/1.\nSo while momentum gives no long-term training acceleration for very low \ufb01xed learning rates (which\nwe are forced to use when the batch size is small), we note that it can help in large-batch training.\nWith > 0, the steady state risk roughly ampli\ufb01es by a factor of 1/1, and we note that steady state\nrisk also decreases proportionally to increases in batch size. Therefore, we expect momentum SGD\nto exhibit perfect scaling up to larger batch sizes than plain SGD.\n\nFigure 2: Convergence rate and steady state\nrisk (SSK) as a function of momentum for\na single dimension with \u21b5h = 0.0005 and\nbatch size B = 1.\n\n2To see this, note that the term in the square root of eqn. (7) for r1 can be written as (1 (1+)\u21b5hi/1)2 +\n\ni ) term and simplifying gives the claimed expression for r1.\n\nO(\u21b52h2\n\ni ). Dropping the O(\u21b52h2\n\n4\n\n10-410-310-210-11\u2212|r1|unGerGDmSLngRverGDmSLng1\u2212(1\u2212\u03b1h1\u2212\u03b2)RStLmDl 1\u2212\u03b210-310-210-11001\u2212\u03b210-410-310-210-1100101steDGy stDte rLsk (SSK)SSK Rf SGD wLtK 0RmentumSSK Rf SGD usLng effectLve LRRStLmDl 1\u2212\u03b2\f3.3 Preconditioning Further Extends Perfect Scaling to Larger Batch Sizes\nMany optimizers, such as Adam and K-FAC, can be viewed as preconditioned gradient descent\nmethods. In each update, the gradient is rescaled by a PSD matrix P1, called the preconditioner.\n(8)\nIn lieu of trying to construct noisy quadratic analogues of particular optimizers, we analyze precondi-\ntioners of the form P = Hp with 0 \uf8ff p \uf8ff 1. Note that P remains \ufb01xed throughout training since the\nHessian H is constant in the NQM. We can recover standard SGD by setting p = 0.\nConveniently, for our NQM, the dynamics of preconditioned SGD are equivalent to the SGD dynamics\nin an NQM with Hessian \u02dcH = P1/2HP1/2 and gradient covariance \u02dcC = P1/2CP1/2.\nHence, the dynamics can be simulated using eqn. (4), exactly like the non-preconditioned case. We\nimmediately obtain the following bound on the risk:\n\n\u2713(t + 1) = \u2713(t) \u21b5P1 [H\u2713 + \u270f] .\n\nE [L(\u2713(t))] \uf8ff\n\n(1 \u21b5h(1p)\n\ni\n\n)2tE [`(\u2713i(0))] +\n\ndXi=1\n\ndXi=1\n\n\u21b5cihp\n\ni\n\n2B(2 \u21b5h1p\n\ni\n\n.\n\n)\n\n(9)\n\nTo qualitatively understand the effect of preconditioning, \ufb01rst consider the \ufb01rst term in eqn. (8).\nThe convergence of this term resembles that of gradient descent on a deterministic quadratic, which\n(with optimal \u21b5 \u21e1 2/\u02dch1) converges exponentially at a rate of approximately 2/\u02dc\uf8ff, where \u02dc\uf8ff = \u02dch1/\u02dchd\nis the condition number of the transformed problem. Since \u02dc\uf8ff = \uf8ff1p, this implies a factor of \uf8ffp\nimprovement in the rate of convergence. Hence, for near-deterministic objectives where the \ufb01rst term\ndominates, values of p closer to 1 correspond to better preconditioners, and result in much faster\nconvergence. Unfortunately, there is no free lunch, as larger values of p will also increase the second\nterm (steady state risk). Assuming an ill-conditioned loss surface (\uf8ff 1), the steady state risk of\neach dimension becomes\n\n1\n2B\n\ni\n\n\u21b5cihp\n2 \u21b5h(1p)\n\ni\n\nci\n\n2Bh1\n\n\u21e1\n\n(hi/h1)p\n\n1 (hi/h1)(1p) ,\n\n(10)\n\nwhich is a monotonically increasing function with respect to p. Even without this ampli\ufb01cation effect,\nthe steady state risk will eventually become the limiting factor in the minimization of the expected\nrisk. One way to reduce the steady state risk, apart from using Polyak averaging [Polyak and Juditsky,\n1992] or decreasing the learning rate (which will harm the rate of convergence), is to increase the\nbatch size. This suggests that the bene\ufb01ts of using stronger preconditioners will be more clearly\nobserved for larger batch sizes, which is an an effect that we empirically demonstrate in later sections.\n\n3.4 Exponential Moving Average Reduces Steady State Risk\nFollowing the same procedure as previous two sections, we analyze exponential moving averages\n(EMA) on our NQM. The update rule of EMA can be written as\n\u2713(t + 1) = \u2713(t) \u21b5 [H\u2713 + \u270f] ,\n\u02dc\u2713(t + 1) = \u02dc\u2713(t) + (1 )\u2713(t + 1).\n\nThe averaged iterate \u02dc\u2713 is used at test time. The computational overhead is minimal (storing an\nadditional copy of the parameters, plus some cheap arithmetic operations). We now show that EMA\noutperforms plain SGD by reducing the steady state risk term.\nTheorem 2. Given a dimension index i, and 0 \uf8ff < 1, the expected risk at time t associated with\nthat dimension satis\ufb01es the upper bound\n\n(11)\n\nEh`(\u02dc\u2713i(t))i \uf8ff\u2713 (rt+1\n\n\u21b5ci\n\n+\n\n1 rt+1\n\n2\n\n) (1 \u21b5hi)(rt\n\n1 rt\n2))\n\nr1 r2\n\n2B(2 \u21b5hi)\n\n(1 )(1 + (1 \u21b5hi))\n(1 + )(1 (1 \u21b5hi))\n\n,\n\n\u25c62\n\nE [`(\u2713i(0))]\n\n(12)\n\nwhere r1 = 1 \u21b5hi and r2 = .\nBy properly choosing an averaging coef\ufb01cient < 1 \u21b5hd such that r1 > r2, one can show that\nEMA reduces the steady state risk without sacri\ufb01cing the convergence rate. To see this, we note\nthat the red part of eqn. (12) is strictly less than 1 given the fact 1 \u21b5hi < 1 while the other part is\nexactly the same as the steady state risk of plain SGD.\n\n5\n\n\f(b) Fixed LR vs. Schedules\n\n(a) Momentum and Preconditioning\nFigure 3: (a) Effects of momentum and preconditioning. Steps required to reach target loss as a function of\nbatch size under different preconditioning power. Solid lines are momentum SGD while dashed lines are plain\nSGD. The black dashed line is the information theoretic lower bound. (b) Effect of learning rate decay. The\nsolid lines use the optimized piecewise constant scheme, which are shown in (c) for power 0. The dashed curves\nin (b) are plain SGD for comparison. We observe that learning rate schedules close most of the gap between the\n\ufb01xed learning rate performance and the information theoretic lower bound.\n\n(c) Optimized LR Schedules\n\n3.5 Choice of H and C\n\nWe\u2019ve found that the qualitative behavior of optimizers in our NQM depends on the choices of H\nand C. Therefore, we choose matrices motivated by theoretical and empirical considerations about\ni=1 for some integer d, giving\nneural net training. First, we set the diagonal entries of H to be { 1\na condition number of d. This closely matches the estimated eigenspectrum of the Hessian of a\nconvolutional network (see Figure 9 and Appendix E.4), and is also consistent with recent work\n\ufb01nding heavy tailed eigenspectra of neural network Hessians [Ubaru et al., 2017, Ghorbani et al.,\n2019]. We choose d = 104, which approximately matches the condition number of the K-FAC\nHessian approximation for ResNet8. (Qualitative behaviors were consistent for a wide range of d.)\nWe also set C = H (a nontrivial assumption). This was motivated by theoretical arguments that,\nunder the assumption that the implicit conditional distribution over the network\u2019s output is close to\nthe conditional distribution of targets from the training distribution, the Hessian closely matches the\ngradient covariance in neural network training [Martens, 2014]. Empirically, this relationship appears\nto hold tightly for a convolutional network and moderately well for a transformer (see Appendix E.2).\n\ni}d\n\n3.6\n\nInformation Theoretic Lower Bound\n\nSince our NQM assumes the in\ufb01nite data (online optimization) setting, it\u2019s instructive to compare\nthe performance of optimizers against an information theoretic lower bound. Speci\ufb01cally, under the\nassumption that H = C, the NQM is equivalent to maximum likelihood estimation of the mean\nvector for a multivariate Gaussian distribution with covariance H1. Hence, the risk obtained by any\noptimizer can be bounded below by the risk of the maximum likelihood estimator for the Gaussian,\nwhich is d/2N, where d is the dimension and N is the total number of training examples visited. We\nindicate this bound with a dashed black line in our plots.\n\n3.7 Noisy Quadratic Experiments\n\nIn this section, we simulate noisy quadratic optimization using the closed-form dynamics. Our aim is\nto formulate hypotheses for how different optimizers would behave for neural network optimization.\nOur main metric is the number of steps required to achieve a target risk. For ef\ufb01ciency, rather than\nexplicitly representing all the eigenvalues of H, we quantize them into 100 bins and count the number\nof eigenvalues in each bin. Unless otherwise speci\ufb01ed, we initialize \u2713 as N (0, I) and use a target\nrisk of 0.01. (The results don\u2019t seem to be sensitive to either the initial variance or the target risk;\nsome results with varying target risk thresholds are shown in Appendix E.5).\n\n3.7.1 Effect of Momentum, Preconditioning and Exponential Moving Average\nWe \ufb01rst experiment with momentum and varying preconditioner powers on our NQM. We treat both\nthe (\ufb01xed) learning rate \u21b5 and momentum decay parameter as hyperparameters, which we tune\nusing a \ufb01ne-grained grid search.\nConsistent with the empirical results of Shallue et al. [2018], each optimizer shows two distinct\nregimes: a small-batch (stochastic) regime with perfect linear scaling, and a large-batch (deterministic)\n\n6\n\n22242628210212214216218220Batch size2-220222426282102122142162182206teSs to thresholdSow 0Sow 0.25Sow 0.5Sow 0.75lower bound22242628210212214216218220Batch size2-220222426282102122142162182206teSs to thresholdSow 0Sow 0.25Sow 0.5Sow 0.75lower bound010203040503Leces2-92-82-72-62-52-42-32-22-12021LearnLng 5ateB6 16B6 32B6 64B6 128B6 256B6 512B6 1024B6 2048B6 4096\fRemarks\nSame as Shallue et al. [2018] except\nwithout dropout regularization.\n\nLR\n\nConstant\n\nSame as Shallue et al. [2018].\nGhost batch norm is used.\nGhost batch norm is used.\nShallow model in Shallue et al. [2018]\n\nConstant\nLinear Decay\nLinear Decay\nConstant\n\nData Set\nMNIST\nFMNIST\n\nSize\n55,000\n55,000\n\nCIFAR10\n\n45,000\n\nModel\n\nSimple CNN\n\nResNet8 without BN\nResNet32 with BN\nVGG11 with BN\n\nLM1B\nTable 1: Data sets and models used in our experiments. See Appendix F.2 for full details.\n\n\u21e030M Two-layer Transformer\n\nregime insensitive to batch size. We call the phase transition between these regimes the critical batch\nsize. Consistent with the analysis of Section 3.2 and the observations of Smith et al. [2018], Shallue\net al. [2018], Kidambi et al. [2018], the performance of momentum-based optimizers matches that of\nthe plain SGD methods in the small-batch regime, but momentum increases the critical batch size\nand gives substantial speedups in the large batch regime. Preconditioning also increases the critical\nbatch size and gives substantial speedups in the large batch regime, but interestingly, also improves\nperformance by a small constant factor even for very small batches. Combining momentum with\npreconditioning extends both of these trends.\nWe next experiment with EMA and varying preconditioning pow-\ners on our NQM. Following the same procedure as before, we\ntune both learning rate \u21b5 and averaging coef\ufb01cient using grid\nsearch. As expected, EMA reduces the number of steps required\nespecially for plain SGD with preconditioning power 0. Another\ninteresting observation is that EMA becomes redundant in the\nlarge batch (near-deterministic) regime since the main effect of\nEMA is reducing the steady-state risk, which can also be done by\nincreasing the batch size. This implies that EMA would reduce\nthe critical batch size and therefore achieve the same amount of\nacceleration with less computation.\n\nFigure 4: Effects of exponential\nmoving average (EMA). Solid lines\nare SGD with EMA while dashed\nlines are plain SGD.\n\n3.7.2 Optimal Learning Rate and Decay Scheme\n\nIn the NQM, we can calculate the optimal constant learning rate given a speci\ufb01c batch size. Figure 14\nshows the optimal learning rate as a function of batch size for a target risk of 0.01. Notably, the\noptimal learning rate of plain (preconditioned) SGD (Figure 14a) scales linearly with batch size\nbefore it hits the critical batch size, matching the scheme used in Goyal et al. [2017]. The linear\nscaling also holds for the effective learning rate of momentum SGD. In the small batch regime, the\noptimal effective learning rate for momentum SGD matches the optimal plain SGD learning rate,\nsuggesting that the momentum and learning rate are interchangeable in the small batch regime.\nWhile a \ufb01xed learning rate often works well for simple problems, good performance on the ImageNet\nbenchmark [Russakovsky et al., 2015] requires a carefully tuned schedule. Here we explicitly\noptimize a piecewise constant learning rate schedule for SGD (with 50 pieces), in terms of the number\nof steps to reach the loss threshold.3 In Figure 3b, we show that optimized learning rate schedules\nhelp signi\ufb01cantly in the small batch regime, consistent with the analysis in Wu et al. [2018]. We\nobserve the same linear scaling as with \ufb01xed-learning-rate SGD, but with a better constant factor.\nIn fact, optimized schedules nearly achieve the information theoretic optimum. However, learning\nrate schedules do not improve at all over \ufb01xed learning rates in the large batch regime. Figure 3c\nshows optimized schedules for different batch sizes; interestingly, they maintain a large learning\nrate throughout training followed by a roughly exponential decay, consistent with commonly used\nneural network training schedules. Additionally, even though the different batch sizes start with the\nsame learning rate, their \ufb01nal learning rates at the end of training scale linearly with batch size (see\nFigure 15 in Appendix E.7).\n\n3For a given schedule and number of time steps, we obtain the exact risk using dynamic programming with\neqn. (3). For stability, the learning rates are constrained to be at most 2/h1. For a \ufb01xed number of time steps, we\nminimize this risk using BFGS. We determine the optimal number of time steps using binary search.\n\n7\n\n22242628210212214216218220Batch size2-220222426282102122142162182206teSs to thresholdSow 0Sow 0.25Sow 0.5Sow 0.75lower bound\f(a) Simple CNN on MNIST\n\n(b) Simple CNN on Fashion MNIST\n\n(c) ResNet8 on CIFAR10\n\n(d) VGG11 on CIFAR10\n\n(e) ResNet32 on CIFAR10\n\n(f) Transformer on LM1B\n\nFigure 5: Empirical relationship between batch size and steps to result. Key observations: 1) momentum\nSGD has no bene\ufb01t over plain SGD at small batch sizes, but extends the perfect scaling to larger batch sizes; 2)\npreconditioning also extends perfect scaling to larger batch sizes, i.e. K-FAC > Adam > momentum SGD. This\nis most noticeable in the Transformer model; 3) preconditioning (particularly K-FAC) reduces the number of\nsteps needed to reach the target even for small batch sizes. All of these agree with the predictions by NQM.\n4 Neural Network Experiments\n\nWe investigated whether the predictions made by the NQM hold in practice by running experiments\nwith \ufb01ve neural network architectures across three image classi\ufb01cation tasks and one language\nmodeling task (see Table 1). For each model and task, we compared a range of optimizers: SGD,\nmomentum SGD, Adam (with and without momentum), and K-FAC (with and without momentum).\nFor K-FAC, preconditioning is applied before momentum. See Appendix F for more details.\nThe primary quantity we measured is the number of steps required to reach a target accuracy (for\nimage classi\ufb01cation tasks) or cross entropy (for language modeling). Unless otherwise speci\ufb01ed, we\nmeasured steps to target on the validation set. We chose the target metric values based on an initial\nset of experiments with practical computational budgets. For each model, task, optimizer, and batch\nsize, we independently tuned the learning rate \u21b5, the parameters governing the learning rate schedule\n(where applicable), and optimizer-speci\ufb01c metaparameters (see Appendix F.4). We manually chose\nthe search spaces based on our initial experiments, and we veri\ufb01ed after each experiment that the\noptimal metaparameter values were far from the search space boundaries. We used quasi-random\nsearch [Bousquet et al., 2017] to tune the metaparameters with \ufb01xed budgets of non-divergent4 trials\n(100 for Simple CNN, ResNet8, and Transformer, and 200 for ResNet32 and VGG11). We chose the\ntrial that reached the target metric value using the fewest number of steps.\n\n4.1 Critical Batch Size Depends on the Optimizer\nFigure 5 shows the relationship between batch size and steps to target for each model, task, and\noptimizer. In each case, as the batch size grows, there is an initial period of perfect scaling where\ndoubling the batch size halves the steps to target, but once the batch size exceeds a problem-dependent\ncritical batch size, there are rapidly diminishing returns, matching the results of [Goyal et al., 2017,\nMcCandlish et al., 2018, Shallue et al., 2018]. K-FAC has the largest critical batch size in all\ncases, highlighting the usefulness of preconditioning. Momentum SGD extends perfect scaling to\nlarger batch sizes than plain SGD, but for batch sizes smaller than the plain SGD critical batch size,\nmomentum SGD requires as many steps as plain SGD to reach the target. This is consistent with\nboth the empirical results of Shallue et al. [2018] and our NQM simulations. By contrast, Adam and\nK-FAC can reduce the number of steps needed to reach the target compared to plain SGD even for the\nsmallest batch sizes, although neither optimizer does so in all cases. Finally, we see some evidence\nthat the bene\ufb01t of momentum diminishes with preconditioning (Figures 5a and 5b), as predicted by\nour NQM simulations, although we do not see this in all cases (e.g. Figure 5c and 5f).\n\n4We discarded trials with a divergent training loss, which occurred when the learning rate was too high.\n\n8\n\n242526272829210211212213Batch size2425262728292102112122132146teSs to target7arget Accuracy: 0.992242526272829210211212213Batch size25262728292102112122132146teSs to target7arget Accuracy: 0.920242526272829210211212213Batch size28292102112122132142152162176teSs to target7arget Accuracy: 0.800sgdheavy balladam w/o momentumadamkfac w/o momentumkfac2526272829210211212Batch size2102112122132142152162172186teSs to target7arget Accuracy: 0.91026272829210211212213Batch size2102112122132142152162176teSs to target7arget Accuracy: 0.9302628210212214216218Batch 6ize292112132152172192216teSs to target7arget cross entroSy: 3.90\f4.2 Exponential Moving Average Improves Convergence with Minimal Computation Cost\nTo verify the predictions of NQM on\nexponential moving average (EMA),\nwe conducted some experiments on\ncomparing EMA with plain SGD. We\nfollow the same protocol of Figure 5\nand report the results in Figure 6. As\nexpected, the results on real neural net-\nworks closely match our predictions\nbased on NQM analysis. In particular,\nSGD with EMA appears to reach the\nsame target with fewer steps than plain SGD at small batch sizes, though the bene\ufb01t of EMA dimin-\nishes with large batch sizes. Besides, we note that EMA leads to smaller critical batch sizes and\nachieves the same acceleration with less computation.\n\nFigure 6: Steps to training accuracy versus batch size. Left:\nResNet8 on CIFAR10; Right: Simple CNN on MNIST.\n\n4.3 Optimal Learning Rate\nThe NQM predicts that the optimal\nconstant learning rate for plain SGD\n(or effective learning rate for momen-\ntum SGD) scales linearly with batch\nsize initially, and then levels off after\na certain batch size. Figure 7 shows\nthe empirical optimal (effective) learn-\nFigure 7: Optimal learning rates for plain SGD and momentum\ning rate as a function of batch size for\nSGD. Left: Simple CNN on MNIST; Right: ResNet8 on CIFAR10\nsimple CNN on MNIST and ResNet8\non CIFAR10. For small batch sizes, the optimal learning rate of plain SGD appears to match the\noptimal effective learning rate of momentum SGD. However, after a certain batch size, the optimal\nlearning rate for plain SGD saturates while the optimal effective learning rate of momentum SGD\nkeeps increasing. Interestingly, plain SGD and momentum SGD appear to deviate at the same batch\nsize in the optimal effective learning rate and steps to target plots (Figures 5 and 7).\n\n4.4 Steps to Target on the Training Set\nFigure 8 shows the empirical rela-\ntionship between batch size and steps\nto target, measured on the training\nset, for ResNet8 and ResNet32 on CI-\nFAR10. For ResNet8, the curves are\nalmost identical to those using vali-\ndation accuracy (Figure 5c), but for\nResNet32, the gaps between different\noptimizers become much smaller than\nin Figure 5e and the effects of momen-\ntum and preconditioning appear to become less signi\ufb01cant. Nevertheless, the qualitative differences\nbetween optimizers are consistent with the validation set measurements.\n\nFigure 8: Steps to training accuracy versus batch size on CIFAR10.\nLeft: ResNet8; Right: ResNet32.\n\n5 Conclusion\n\nIn this work, we analyzed the interactions between the batch size and the optimization algorithm\nfrom two perspectives: experiments with real neural networks, and a noisy quadratic model with\nparameters chosen based on empirical observations about neural networks. Despite its simplicity, the\nnoisy quadratic model agrees remarkably well with a variety of neural network training phenomena,\nincluding learning rate scaling, critical batch sizes, and the effects of momentum, preconditioning\nand averaging. More importantly, the noisy quadratic model allows us to run experiments in seconds,\nwhile it can take weeks, or even months, to conduct careful large-scale experiments with real neural\nnetworks. Therefore, the noisy quadratic model is a convenient and powerful way to quickly formulate\ntestable predictions about neural network optimization.\n\n9\n\n242526272829210211212213Batch size2102112122132142152162176teSs to target7arget Accuracy: 0.800242526272829210211Batch size2728292102112126teSs to target7arget Accuracy: 0.990sgdsgd with ema242526272829210211212213Batch sLze2-42-32-22-1202122232425262ptLmaO (effectLve) L5242526272829210211212213Batch sLze2-62-52-42-32-22-120212223242ptLmaO (effectLve) L5sgdheavy baOO242526272829210211212213Batch size28292102112122132142152162176teSs to target7arget Accuracy: 0.830sgdheavy balladam w/o momentumadamkfac w/o momentumkfac26272829210211212213Batch size292102112122132142152166teSs to target7arget Accuracy: 0.990\fAcknowledgements\n\nRG acknowledges support from the CIFAR Canadian AI Chairs program and the Ontario MRIS Early\nResearcher Award.\n\nReferences\nJimmy Ba, Roger Grosse, and James Martens. Distributed second-order optimization using Kronecker-\n\nfactored approximations. In International Conference on Learning Representations, 2017.\n\nFrancis Bach and Eric Moulines. Non-strongly-convex smooth stochastic approximation with\nconvergence rate o (1/n). In Advances in neural information processing systems, pages 773\u2013781,\n2013.\n\nJuhan Bae, Guodong Zhang, and Roger Grosse. Eigenvalue corrected noisy natural gradient. In\nWorkshop of Bayesian Deep Learning, Advances in neural information processing systems, 2018.\n\nLukas Balles, Javier Romero, and Philipp Hennig. Coupling adaptive batch sizes with learning rates.\n\nIn Conference on Uncertainty in Arti\ufb01cial Intelligence (UAI) 2017. AUAI Press, 2017.\n\nL\u00e9on Bottou and Olivier Bousquet. The tradeoffs of large scale learning. In Advances in neural\n\ninformation processing systems, pages 161\u2013168, 2008.\n\nOlivier Bousquet, Sylvain Gelly, Karol Kurach, Olivier Teytaud, and Damien Vincent. Critical\n\nhyper-parameters: No random, no cry. arXiv preprint arXiv:1706.03200, 2017.\n\nLingjiao Chen, Hongyi Wang, Jinman Zhao, Dimitris Papailiopoulos, and Paraschos Koutris. The\nIn Advances in Neural\n\neffect of network width on the performance of large-batch training.\nInformation Processing Systems, pages 9302\u20139309, 2018.\n\nA.C. Chiang. Fundamental Methods of Mathematical Economics. International student edition.\n\nMcGraw-Hill, 1974. ISBN 9780070107809.\n\nAymeric Dieuleveut and Francis Bach. Nonparametric stochastic approximation with large step-sizes.\n\nTHE ANNALS, 44(4):1363\u20131399, 2016.\n\nSimon S. Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably optimizes\nover-parameterized neural networks. In International Conference on Learning Representations,\n2019. URL https://openreview.net/forum?id=S1eK3i09YQ.\n\nThomas George, C\u00e9sar Laurent, Xavier Bouthillier, Nicolas Ballas, and Pascal Vincent. Fast\napproximate natural gradient descent in a Kronecker-factored eigenbasis. In Advances in Neural\nInformation Processing Systems, pages 9550\u20139560, 2018.\n\nBehrooz Ghorbani, Shankar Krishnan, and Ying Xiao. An investigation into neural net optimization\nvia hessian eigenvalue density. In Proceedings of the 36th International Conference on Machine\nLearning, pages 2232\u20132241, 2019.\n\nGabriel Goh. Why momentum really works. Distill, 2(4):e6, 2017.\n\nNoah Golmant, Nikita Vemuri, Zhewei Yao, Vladimir Feinberg, Amir Gholami, Kai Rothauge,\nMichael W Mahoney, and Joseph Gonzalez. On the computational inef\ufb01ciency of large batch sizes\nfor stochastic gradient descent. arXiv preprint arXiv:1811.12941, 2018.\n\nI. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press, 2016. http://www.\n\ndeeplearningbook.org.\n\nPriya Goyal, Piotr Doll\u00e1r, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola,\nAndrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch SGD: Training\nImagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.\n\nRoger Grosse and James Martens. A kronecker-factored approximate \ufb01sher matrix for convolution\n\nlayers. In International Conference on Machine Learning, pages 573\u2013582, 2016.\n\n10\n\n\fKaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\nrecognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,\npages 770\u2013778, 2016.\n\nElad Hoffer, Itay Hubara, and Daniel Soudry. Train longer, generalize better: Closing the generaliza-\ntion gap in large batch training of neural networks. In Advances in Neural Information Processing\nSystems, pages 1731\u20131741, 2017.\n\nSergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training\nby reducing internal covariate shift. In International Conference on Machine Learning, pages\n448\u2013456, 2015.\n\nArthur Jacot, Franck Gabriel, and Cl\u00e9ment Hongler. Neural tangent kernel: Convergence and\ngeneralization in neural networks. In Advances in neural information processing systems, pages\n8571\u20138580, 2018.\n\nPrateek Jain, Sham M Kakade, Rahul Kidambi, Praneeth Netrapalli, and Aaron Sidford. Parallelizing\nstochastic gradient descent for least squares regression: mini-batching, averaging, and model\nmisspeci\ufb01cation. Journal of Machine Learning Research, 18(223):1\u201342, 2018.\n\nStanis\u0142aw Jastrz\u02dbebski, Zachary Kenton, Devansh Arpit, Nicolas Ballas, Asja Fischer, Yoshua Bengio,\nand Amos Storkey. Three factors in\ufb02uencing minima in SGD. In International Conference on\nArti\ufb01cial Neural Networks, 2018.\n\nRahul Kidambi, Praneeth Netrapalli, Prateek Jain, and Sham Kakade. On the insuf\ufb01ciency of existing\nmomentum schemes for stochastic optimization. In 2018 Information Theory and Applications\nWorkshop (ITA), pages 1\u20139. IEEE, 2018.\n\nDiederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International\n\nConference on Learning Representations, 2014.\n\nJaehoon Lee, Lechao Xiao, Samuel S Schoenholz, Yasaman Bahri, Jascha Sohl-Dickstein, and Jeffrey\nPennington. Wide neural networks of any depth evolve as linear models under gradient descent.\narXiv preprint arXiv:1902.06720, 2019.\n\nTodd K. Leen and Genevieve B. Orr. Optimal stochastic search and adaptive momentum.\n\nIn\nJ. D. Cowan, G. Tesauro, and J. Alspector, editors, Advances in Neural Information Processing\nSystems 6, pages 477\u2013484. Morgan-Kaufmann, 1994. URL http://papers.nips.cc/paper/\n772-optimal-stochastic-search-and-adaptive-momentum.pdf.\n\nSiyuan Ma, Raef Bassily, and Mikhail Belkin. The power of interpolation: Understanding the\neffectiveness of SGD in modern over-parametrized learning. In International Conference on\nMachine Learning, pages 3331\u20133340, 2018.\n\nJames Martens. New insights and perspectives on the natural gradient method. arXiv preprint\n\narXiv:1412.1193, 2014.\n\nJames Martens and Roger Grosse. Optimizing neural networks with Kronecker-factored approximate\n\ncurvature. In International Conference on Machine Learning, pages 2408\u20132417, 2015.\n\nSam McCandlish, Jared Kaplan, Dario Amodei, and OpenAI Dota Team. An empirical model of\n\nlarge-batch training. arXiv preprint arXiv:1812.06162, 2018.\n\nKazuki Osawa, Yohei Tsuji, Yuichiro Ueno, Akira Naruse, Rio Yokota, and Satoshi Matsuoka.\nSecond-order optimization method for large mini-batch: Training resnet-50 on imagenet in 35\nepochs. arXiv preprint arXiv:1811.12019, 2018.\n\nBoris T Polyak and Anatoli B Juditsky. Acceleration of stochastic approximation by averaging. SIAM\n\nJournal on Control and Optimization, 30(4):838\u2013855, 1992.\n\nNing Qian. On the momentum term in gradient descent learning algorithms. Neural networks, 12(1):\n\n145\u2013151, 1999.\n\n11\n\n\fOlga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang,\nAndrej Karpathy, Aditya Khosla, Michael Bernstein, et al. ImageNet large scale visual recognition\nchallenge. International Journal of Computer Vision, 115(3):211\u2013252, 2015.\n\nLevent Sagun, Leon Bottou, and Yann LeCun. Eigenvalues of the hessian in deep learning: Singularity\n\nand beyond. arXiv preprint arXiv:1611.07476, 2016.\n\nTom Schaul, Sixin Zhang, and Yann LeCun. No more pesky learning rates.\n\nConference on Machine Learning, pages 343\u2013351, 2013.\n\nIn International\n\nNicol N Schraudolph. Fast curvature matrix-vector products for second-order gradient descent.\n\nNeural computation, 14(7):1723\u20131738, 2002.\n\nChristopher J Shallue, Jaehoon Lee, Joe Antognini, Jascha Sohl-Dickstein, Roy Frostig, and George E\nDahl. Measuring the effects of data parallelism on neural network training. arXiv preprint\narXiv:1811.03600, 2018.\n\nKaren Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image\n\nrecognition. In International Conference on Learning Representations, 2015.\n\nSamuel L. Smith, Pieter-Jan Kindermans, and Quoc V. Le. Don\u2019t decay the learning rate, increase\nthe batch size. In International Conference on Learning Representations, 2018. URL https:\n//openreview.net/forum?id=B1Yy1BxCZ.\n\nSamuel L Smith, Erich Elsen, and Soham De. Momentum enables large batch training. In Theoretical\n\nPhysics for Deep Learning Workshop, International Conference on Machine Learning., 2019.\n\nChristian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the\ninception architecture for computer vision. In Proceedings of the IEEE Conference on Computer\nVision and Pattern Recognition, pages 2818\u20132826, 2016.\n\nShashanka Ubaru, Jie Chen, and Yousef Saad. Fast estimation of tr(f(a)) via stochastic lanczos\n\nquadrature. SIAM Journal on Matrix Analysis and Applications, 38(4):1075\u20131099, 2017.\n\nTwan van Laarhoven. L2 regularization versus batch and weight normalization. arXiv preprint\n\narXiv:1706.05350, 2017.\n\nAshish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, \u0141ukasz\nIn Advances in neural information\n\nKaiser, and Illia Polosukhin. Attention is all you need.\nprocessing systems, pages 5998\u20136008, 2017.\n\nChaoqi Wang, Roger Grosse, Sanja Fidler, and Guodong Zhang. Eigendamage: Structured pruning\nin the Kronecker-factored eigenbasis. In Proceedings of the 36th International Conference on\nMachine Learning, pages 6566\u20136575, 2019.\n\nYuhuai Wu, Mengye Ren, Renjie Liao, and Roger Grosse. Understanding short-horizon bias in\nstochastic meta-optimization. In International Conference on Learning Representations, 2018.\nURL https://openreview.net/forum?id=H1MczcgR-.\n\nLin Yang, Raman Arora, Tuo Zhao, et al. The physical systems behind optimization algorithms. In\n\nAdvances in Neural Information Processing Systems, pages 4372\u20134381, 2018.\n\nDong Yin, Ashwin Pananjady, Max Lam, Dimitris Papailiopoulos, Kannan Ramchandran, and Peter\nBartlett. Gradient diversity: a key ingredient for scalable distributed learning. In International\nConference on Arti\ufb01cial Intelligence and Statistics, pages 1998\u20132007, 2018.\n\nKun Yuan, Bicheng Ying, and Ali H. Sayed. On the in\ufb02uence of momentum acceleration on\nonline learning. Journal of Machine Learning Research, 17(192):1\u201366, 2016. URL http:\n//jmlr.org/papers/v17/16-157.html.\n\nGuodong Zhang, James Martens, and Roger Grosse. Fast convergence of natural gradient descent for\n\noverparameterized neural networks. arXiv preprint arXiv:1905.10961, 2019a.\n\nGuodong Zhang, Chaoqi Wang, Bowen Xu, and Roger Grosse. Three mechanisms of weight\ndecay regularization. In International Conference on Learning Representations, 2019b. URL\nhttps://openreview.net/forum?id=B1lz-3Rct7.\n\n12\n\n\f", "award": [], "sourceid": 4461, "authors": [{"given_name": "Guodong", "family_name": "Zhang", "institution": "University of Toronto"}, {"given_name": "Lala", "family_name": "Li", "institution": "Google"}, {"given_name": "Zachary", "family_name": "Nado", "institution": "Google Inc."}, {"given_name": "James", "family_name": "Martens", "institution": "DeepMind"}, {"given_name": "Sushant", "family_name": "Sachdeva", "institution": "University of Toronto"}, {"given_name": "George", "family_name": "Dahl", "institution": "Google Brain"}, {"given_name": "Chris", "family_name": "Shallue", "institution": "Google Brain"}, {"given_name": "Roger", "family_name": "Grosse", "institution": "University of Toronto"}]}