{"title": "L4: Practical loss-based stepsize adaptation for deep learning", "book": "Advances in Neural Information Processing Systems", "page_first": 6433, "page_last": 6443, "abstract": "We propose a stepsize adaptation scheme for stochastic gradient descent.\nIt operates directly with the loss function and rescales the gradient in order to make fixed predicted progress on the loss.\nWe demonstrate its capabilities by conclusively improving the performance of Adam and Momentum optimizers.\nThe enhanced optimizers with default hyperparameters\n consistently outperform their constant stepsize counterparts, even the best ones,\n without a measurable increase in computational cost.\nThe performance is validated on multiple architectures including dense nets, CNNs, ResNets, and the recurrent Differential Neural Computer on classical datasets MNIST, fashion MNIST, CIFAR10 and others.", "full_text": "L4: Practical loss-based stepsize adaptation for deep\n\nlearning\n\nMichal Rol\u00ednek and Georg Martius\n\nMax-Planck-Institute for Intelligent Systems T\u00fcbingen, Germany\n\nmichal.rolinek@tuebingen.mpg.de and georg.martius@tuebingen.mpg.de\n\nAbstract\n\nWe propose a stepsize adaptation scheme for stochastic gradient descent. It oper-\nates directly with the loss function and rescales the gradient in order to make \ufb01xed\npredicted progress on the loss. We demonstrate its capabilities by conclusively\nimproving the performance of Adam and Momentum optimizers. The enhanced\noptimizers with default hyperparameters consistently outperform their constant\nstepsize counterparts, even the best ones, without a measurable increase in com-\nputational cost. The performance is validated on multiple architectures including\ndense nets, CNNs, ResNets, and the recurrent Differential Neural Computer on\nclassical datasets MNIST, fashion MNIST, CIFAR10 and others.\n\n1\n\nIntroduction\n\nStochastic gradient methods are the driving force behind the recent boom of deep learning. As a\nresult, the demand for practical ef\ufb01ciency as well as for theoretical understanding has never been\nstronger. Naturally, this has inspired a lot of research and has given rise to new and currently very\npopular optimization methods such as Adam [9], AdaGrad [5], or RMSProp [22], which serve as\ncompetitive alternatives to classical stochastic gradient descent (SGD).\nHowever, the current situation still causes huge overhead in implementations. In order to extract\nthe best performance, one is expected to choose the right optimizer, \ufb01nely tune its hyperparameters\n(sometimes multiple), often also to handcraft a speci\ufb01c stepsize adaptation scheme, and \ufb01nally\ncombine this with a suitable regularization strategy. All of this, mostly based on intuition and\nexperience.\nIf we put aside the regularization aspects, the holy grail for resolving the optimization issues would\nbe a widely applicable automatic stepsize adaptation for stochastic gradients. This idea has been\n\ufb02oating in the community for years and different strategies were proposed. One line of work casts the\nlearning rate as another parameter one can train with a gradient descent (see [2], also for a survey).\nAnother approach is to make use of (an approximation of) second order information (see [3] and [19]\nas examples). Also, an interesting Bayesian approach for probabilistic line search has been proposed\nin [13]. Finally, another related research branch is based on the \u201cLearning to learn\u201d paradigm [1]\n(possibly using reinforcement learning such as in [12]).\nAlthough some of the mentioned papers claim to \u201ceffectively remove the need for learning rate tuning\u201d,\nthis has not been observed in practice. Whether this is due to conservativism on the implementor\u2019s\nside or due to lack of solid experimental evidence, we leave aside. In any case, we also take the\nchallenge.\nOur strategy is performance oriented. Admittedly, this also means, that while our stepsize adaptation\nscheme makes sense intuitively (and is related to sound methods), we do not provide or claim any\ntheoretical guarantees. Instead, we focus on strong reproducible performance against optimized\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fbaselines across multiple different architectures, on a minimum need for tuning, and on releasing a\nprototype implementation that is easy to use in practice.\nOur adaptation method is called Linearized Loss-based optimaL Learning-rate (L4) and it has two\nmain features. First, it operates directly with the (currently observed) value of the loss. This eventually\nallows for almost independent stepsize computation of consecutive updates and consequently enables\nvery rapid learning rate changes. Second, we separate the two roles a gradient vector typically has. It\nprovides both a local linear approximation as well as an actual vector of the update step. We allow\nusing a different gradient method for each of the two tasks.\nThe scheme itself is a meta-algorithm and can be combined with any stochastic gradient method. We\nreport our results for the L4 adaptation of Adam and Momentum SGD.\n\n2 Method\n\n2.1 Motivation\n\nThe stochasticity poses a severe challenge for stepsize adaptation methods. Any changes in the\nlearning rate based on one or a few noisy loss estimates are likely to be inaccurate. In a setting, where\nany overestimation of the learning rate can be very punishing, this leaves little maneuvering space.\nThe approach we take is different. We do not maintain any running value of the stepsize. Instead,\nat every iteration, we compute it anew with the intention to make maximum possible progress on\nthe (linearized) loss. This is inspired by the classical iterative Newton\u2019s method for \ufb01nding roots of\none-dimensional functions. At every step, this method computes the linearization of the function at\nthe current point and proceeds to the root of this linearization. We use analogous updates to locate\nthe root (minimum) of the loss function.\nThe idea of using linear approximation for line search is, of course, not novel, as witnessed for\nexample by the Armijo-Wolfe line search [15]. Also, and more notably, our motivation is identical to\nthe one of the Polyak\u2019s update rule [16], where the loss-linearization (Eq. 2) is already proposed in a\ndeterministic setting as well as the idea of approximating the minimum loss.\nTherefore, our scheme should be thought of as an adaptation of these classical methods for the\npractical needs of deep learning. Also, the ideological proximity to provably correct methods is\nreassuring.\n\n2.2 Algorithm\n\nIn the following section, we describe how the stepsize is chosen for a gradient update proposed by an\nunderlying optimizer (e. g. SGD, Adam, momentum SGD). We begin with a simpli\ufb01ed core version.\nLet L(\u03b8) be the loss function (on current batch) depending on the parameters \u03b8 and let v be the update\nstep provided by some standard optimizer, e. g. in case of SGD this would be \u2207\u03b8L. Throughout the\npaper, the loss L will be considered to be non-negative.\nFor now, let us assume the minimum attainable loss is Lmin (see Suppl. Section 2.4 for details). We\nconsider the stepsize \u03b7 needed to reach Lmin (under idealized assumptions) by satisfying\n\nL(\u03b8 \u2212 \u03b7v) != Lmin .\n\nWe linearize L (around \u03b8) and then, after denoting g = \u2207\u03b8L, we solve for \u03b7:\nL(\u03b8) \u2212 Lmin\n\n\u03b7 =\n\ng(cid:62)v\n\nL(\u03b8) \u2212 \u03b7g(cid:62)v != Lmin\n\n=\u21d2\n\n(1)\n\n(2)\n\n.\n\nFirst of all, note the clear separation between g, the estimator of the gradient of L and v, the proposed\nupdate step. Moreover, it is easily seen that the \ufb01nal update step \u03b7v is independent of the magnitude\nof v. In other words, the adaptation method only takes into account the \u201cdirection\u201d of the proposed\nupdate. This decomposition into the gradient estimate and the update direction is the core principle\nbehind the method and is also vital for its performance.\nThe update rule is illustrated in Fig. 1 for a quadratic (or other convex) loss. Here, we see (deceptively)\nthat the proposed stepsize is, in fact, still conservative. However, in the multidimensional case, the\n\n2\n\n\fFigure 1: Illustration of stepsize cal-\nculation for one parameter. Given a\nminimum loss, the stepsize is such\nthat the linearized loss would be min-\nimal after one step.\nIn practice a\nfraction of that stepsize is used, see\nSec. 2.4.\n\nFigure 2: Training performance on badly conditioned re-\ngression task with \u03ba(A) = 1010. The mean (in log-space)\ntraining loss over 5 restarts is shown. The areas between\nminimal and maximal loss (after log-space smoothing) are\nshaded. For all algorithms the best stepsize was selected\n(4 \u00b7 10\u22125 and 10\u22123 for SGD and Adam respectively, and\n\u03b1 = 0.25 for both L4 optimizers), except for the default\nsetting without the \u201c*\u201d. Note the logarithmic scale of the\nloss.\n\nminimum will not necessarily lie on the line given by the gradient. That is why in real-world problems,\nthis stepsize is far too aggressive and prone to divergence. In addition there are the following reasons\nto be more conservative: the problems in deep learning are (often strongly) non-convex, and actually\nminimizing the currently seen batch loss is very likely to not generalize to the whole dataset.\nFor these reasons, we introduce a hyperparameter \u03b1 which captures the \ufb01xed fraction of the stepsize\n(2) we take at each step. Then the update rule becomes:\n\n\u2206\u03b8 = \u2212\u03b7v = \u2212\u03b1\n\nL(\u03b8) \u2212 Lmin\n\ng(cid:62)v\n\nv ,\n\n(3)\n\nEven though a few more hyperparameters will appear later as stability measures and regularizers,\n\u03b1 is the main hyperparameter to consider. We observed in experiments that the relevant range is\nconsistently \u03b1 \u2208 (0.10, 0.30). In comparison, for SGD the range of stable learning rates varies over\nmultiple orders of magnitude. We chose the slightly conservative value \u03b1 = 0.15 as a default setting.\nWe report its performance on all the tasks in Section 3.\n\n2.3\n\nInvariance to af\ufb01ne transforms of the loss\n\nHere, we offer a partial explanation why the value of \u03b1 stays in the same small relevant range even\nfor very different problems and datasets. Interestingly, the new update equation (3) is invariant to\naf\ufb01ne loss transformations of the type:\n\n(4)\nwith a, b > 0. Let us brie\ufb02y verify this. The gradient of L(cid:48) will be g(cid:48) = ag and we will assume\nthat the underlying optimizer will offer the same update direction v in both cases (we have already\nestablished that its magnitude does not matter). Then we can simply write\n\n= aL + b\n\nL(cid:48)\n\nL(cid:48)(\u03b8) \u2212 Lmin(cid:48)\n\ng(cid:48)(cid:62)v(cid:48)\n\nv(cid:48)\n\n\u2212\u03b1\n\naL(\u03b8) + b \u2212 aLmin \u2212 b\n\nag(cid:62)v\n\n= \u2212\u03b1\n\nv = \u2212\u03b1\n\nL(\u03b8) \u2212 Lmin\n\ng(cid:62)v\n\nv\n\nand we see that the updates are the same in both cases. On top of being a good sanity check for any\nloss-based method, we additionally believe that it simpli\ufb01es problem-to-problem adaptation (also in\nterms of hyperparameters).\nIt should be noted though that we lose this precise invariance once we introduce some heuristical and\nregularization steps in the following paragraphs.\n\n3\n\nmin.LossParameterLoss120100002000030000Iteration10\u2212810\u2212510\u22122101104TraininglossAdam*L4AdamL4Adam*L4MomL4Mom*SGD*\f2.4 Stability measures and heuristics\nLmin adaptation: We still owe an explanation of how Lmin is maintained during training. We base\nits value on the minimal loss seen so far. Naturally, some mini-batches will have a lower loss and\nwill be used as a reference for the others. By itself, this comes with some disadvantages. In case of\nsmall variance across batches, this Lmin estimate would be very pessimistic. Also, the \u201cnew best\u201d\nmini-batches would have zero stepsize.\nTherefore, we introduce a factor \u03b3 which captures the fraction of the lowest seen loss that is still\nbelieved to be achievable. Similarly, to correct for possibly strong effects of a few outlier batches, we\nlet Lmin slowly increase with a timescale \u03c4. This reactiveness of Lmin slightly shifts its interpretation\nfrom \u201cglobally minimum loss\u201d to \u201cminimum currently achievable loss\u201d. This re\ufb02ects on the fact that\nin practical settings, it is unrealistic to aim for the global minimum in each update. All in all, when a\nnew value L of the loss comes, we set\n\nthen we use \u03b3Lmin for the gradient update and apply the \u201cforgetting\u201d\n\nLmin \u2190 min(Lmin, L),\n\nLmin \u2190 (1 + 1/\u03c4)Lmin.\n\n(5)\nThe value of Lmin gets initialized by a \ufb01xed fraction of the \ufb01rst seen loss L, that is Lmin \u2190 \u03b30L.\nWe set \u03b3 = 0.9, \u03c4 = 1000, and \u03b30 = 0.75 as default settings and we use these values in all our\nexperiments. Even though, we can not exclude that tuning these values could lead to enhanced\nperformance, we have not observed such effects and we do not feel the necessity to modify these\nvalues.\n\nNumerical stability: Another unresolved issue is the division by an inner product in Eq. (3). Our\nsolution to potential numerical instabilities are two-fold. First, we require compatibility of g and v\nin the sense that the angle between the vectors does not exceed 90\u25e6. In other words, we insist on\ng(cid:62)v \u2265 0. For L4Adam and L4Mom this is the case, see Section 2.5, Eq. (7). Second, we add a tiny \u03b5\nas a regularizer to the denominator. The \ufb01nal form of update rule then is:\n\nwith the default value \u03b5 = 10\u221212.\n\n\u2206\u03b8 = \u2212\u03b1\n\nL(\u03b8) \u2212 \u03b3Lmin\n\ng(cid:62)v + \u0001\n\nv ,\n\n(6)\n\n2.5 Putting it together: L4Mom and L4Adam\n\nThe algorithm is called Linearized Loss-based optimaL Learning-rate (L4) and it works on top of\nprovided gradient estimator (producing g) and an update direction algorithm (producing v), see\nAlgorithm 1 in the Supplementary for the pseudocode. For compactness of presentation, we introduce\na notation for exponential moving averages as (cid:104)\u00b7(cid:105)\u03c4 with timescale \u03c4 using bias correction just as\nin [9] (see Algorithm 2 in the Supplementary).\nIn this paper, we introduce two variants of L4 leading to two optimizers: (1) with momentum gradient\ndescent, denoted by L4Mom, and (2) with Adam [9], denoted by L4Adam. We choose the update\ndirections for L4Mom and L4Adam, respectively as\n\nv = VM om(L, \u03b8) = (cid:104)\u2207\u03b8L(\u03b8)(cid:105)\u03c4m\n\nv = VAdam(L, \u03b8) = (cid:104)\u2207\u03b8L(\u03b8)(cid:105)\u03c4m\n(cid:104)|\u2207\u03b8|2L(\u03b8)(cid:105)\u03c4s\n\n,\n\n(7)\n\n(cid:112)\n\nwith \u03c4m = 10 and \u03c4s = 1000 being the timescales for momentum and (in case of L4Adam) second\nmoment averaging. In both cases, the choice of g = G(L, \u03b8) = VM om(L, \u03b8) ensures g(cid:62)v \u2265 0, as\nmentioned in Section 2.4. Additional reasoning is that the averaged local gradient is in practice often\na more accurate estimator of the gradient on the global loss.\n\n3 Results\n\nWe evaluate the proposed method on \ufb01ve different setups, spanning over different architectures,\ndatasets, and loss functions. We compare to the de facto standard methods: stochastic gradient\ndescent (SGD), momentum SGD (Mom), and Adam [9].\n\n4\n\n\fFor each of the methods, the performance is evaluated for the best setting of the stepsize/learning rate\nparameter (found via a \ufb01ne grid search with multiple restarts). All other parameters are as follows: for\nmomentum SGD we used a timescale of 10 steps (\u03b2 = 0.9); for Adam: \u03b21 = 0.9, \u03b22 = 0.999, and\n\u03b5 = 10\u22124. The (non-default) value of \u03b5 was selected in accordance with TensorFlow documentation\nto decrease the instability of Adam.\nIn all experiments, the performance of the standard methods heavily depends on the stepsize parameter.\nHowever, in case of the proposed method, the default setting showed remarkable consistency. Across\nthe experiments, it outperforms even the best constant learning rates for the respective gradient-\nbased update rules. In addition, the performance of these default settings is also comparable with\nhandcrafted optimization policies on more complicated architectures. We consider this to be the main\nstrength of the L4 method.\nWe present results for L4Mom and L4Adam, see Tab. 3 for an overview. In all experiments we\nstrictly followed the out-of-the-box policy. We simply cloned an of\ufb01cial repository, changed the\noptimizer, and left everything else intact. Also, throughout the experiments we have observed neither\nany runtime increase nor additional memory requirements arising from the adaptation.\nAs a general nomenclature, a method is marked with a \u2217 if optimized stepsize was used. Otherwise\n(in case of L4 optimizers), the default settings are in place. The Fashion MNIST [24] experiment can\nbe found in the supplementary material as well as additional experiments with varying batch sizes as\nhinted in Tab. 3 by values in brackets.\n\nRunning time: Neither of the L4 optimizers slows down network training in practical settings. By\ninspection of Equations (6) and (7), we can see that the only additional computation (compared to\nAdam or momentum SGD) is calculating the inner product g(cid:62)v. This introduces two additional\noperations per weight (multiplication and addition). In any realistic scenario, these have negligible\nruntimes when compared to matrix multiplications (convolutions), which are required both in forward\nand backward pass.\n\n3.1 Badly conditioned regression\n\nThe \ufb01rst task we investigate is a linear regression with badly conditioned input/output relationship.\nIt has recently been brought into the spotlight by Ali Rahimi in his NIPS 2017 talk, see [18], as\nan example of a problem \u201cresistant\u201d to standard stochastic gradient optimization methods. For our\nexperiments, we used the corresponding code by Ben Recht [17].\nThe network has two weight matrices W1, W2 and the loss function is given by\ny = Ax\n\n(8)\nwhere A is a badly conditioned matrix, i. e. \u03ba(A) = \u03c3max/\u03c3min (cid:29) 1, with \u03c3max and \u03c3min are the\nlargest and the smallest singular values of A, respectively. Note that this is in disguise a (realizable)\nmatrix factorization problem: L = (cid:107)W1W2 \u2212 A(cid:107)2\nF . Also, it is not a stochastic optimization problem\nbut a deterministic one.\nFigure 2 shows the results for x \u2208 R6, W1 \u2208 R10\u00d76, W2 \u2208 R6\u00d76, y \u2208 R10 (the default con\ufb01guration\nof [17]) and condition number \u03ba(A) = 1010. The statistics is given for 5 independent runs (with\nrandomly generated matrices A) and a \ufb01xed dataset of 1000 samples. We can con\ufb01rm that standard\n\nL(W1, W2) = Ex\u223cN (0,I)(cid:107)W1W2x \u2212 y(cid:107)2\n\ns.t.\n\nTable 1: Overview of experiments. The experiments span over classical datasets, traditional and\nmodern architectures, as well as different batch sizes. The tested learning rates are denoted by \u03b1 and\nmarked with \u2217 if chosen optimally via grid search. The optimal learning rates for the baselines vary\nwhile L4 optimizers can keep a \ufb01xed setting and still outperforming in terms of training and test loss.\n\nArch\n\nDataset\nSynthetic\nMNIST\nResNet\nCIFAR-10\nDNC\nRecurrent\nFashion MNIST ConvNet\n\n2-Layer MLP\n3-Layer MLP\n\nBatch size\n-\n64 [8,16,32]\n128\n16 [8, 32, 64]\n100\n\n5\n\nAdam \u03b1L4Adam\n\n\u03b1L4M om\n\n0.15 [0.25] 0.15 [0.25]\n0.15 [0.25] 0.15 [0.25]\n\n\u03b1\u2217\nM om/SGD \u03b1\u2217\n0.0005\n0.05\n0.004\n1.2\n0.01\n\n0.001\n0.001\n0.0002 0.15\n0.15\n0.01\n0.0003 0.15\n\n0.15\n0.15\n0.15\n\n\f(a) learning curve for 2-hidden layer NN on MNIST\n\n(b) effective learning rate (MNIST)\n\nFigure 3: Training progress of multilayer neural networks on MNIST, see Section 3.2 for details.\n(a) Average (in log-space) training loss with respect to \ufb01ve restarts with a shaded area between\nminimum and maximum loss (after log-space smoothing). (b) Effective learning rates \u03b7 for a single\nrun. The bold curves are averages taken in log-space.\n\noptimizers indeed have great dif\ufb01culty reaching convergence. Only a \ufb01ne grid search discovered\nsettings behaving reasonably well (divergence or too early plateaus are very common). The proposed\nstepsize adaptation method apparently overcomes this issue (see Fig. 2).\n\n3.2 MNIST digit recognition\n\nThe second task is a classical multilayer neural network trained for digit recognition using the\nMNIST [11] dataset. We use the standard architecture with two layers containing 300 and 100 hidden\nunits and ReLu activations functions followed by a logistic regression output layer for the 10 digit\nclasses. Batch size in use is 64.\nFigure 3 shows the learning curves and the effective learning rates. The effective learning rate is given\nby \u03b7 in (3). Note how after 22 epochs the effective learning of L4Adam becomes very small and\nactually becomes 0 around 30 epochs. This is simply because by then the loss is 0 (within machine\nprecision) on every batch and thus \u03b7 = 0; a global optimum was found. The very high learning rates\nthat precede can be attributed to a \u201cplateau\u201d character of the obtained minimum. The gradients are\nso small in magnitude that very high stepsize is necessary to make any progress. This is, perhaps,\nunexpected since in optimization theory convergence is typically linked to decrease in the learning\nrate, rather than increase.\nGenerally, we see that the effective learning rate shows highly nontrivial behavior. We can observe\nsharp increases as well as sharp decreases. Also, even in short time period it fully spans 2 or more\norders of magnitude as highlighted by the shaded area, see Fig. 3(b). None of this causes instabilities\nin the training itself.\nEven though the ability to generalize and compatibility with various regularization methods are not\nour main focus in this work, we still report in Tab. 2 the development of test accuracy during the\ntraining. We see that the test performance of all optimizers is comparable. This does not come as a\nsurprise as the used architecture has no regularization. Also, it can be seen that the L4 optimizers\nreach near-\ufb01nal accuracies faster, already after around 10 epochs.\n\nComparison to other work: A list of papers reporting improved performance over SGD on\nMNIST is long (examples include [19, 13, 1, 2, 14]). Unfortunately, there are no widely recognized\nbenchmarks to use for comparison. There is a lot of variety in choosing the baseline optimizer (often\nonly the default setting for SGD) and in the number of training steps reported (often fewer than one\nepoch). In this situation, it is dif\ufb01cult to make any substantiated claims. However, to our knowledge,\nprevious work does not achieve such rapid convergence as can be seen in Fig. 3.\n\n6\n\n0102030Epoch10\u22121010\u2212810\u2212610\u2212410\u22122100TraininglossAdam*L4AdamL4Adam*L4MomL4Mom*mSGD*0102030Epoch10\u2212810\u2212610\u2212410\u22122100102104EffectivelearningrateL4Adam*L4Mom*Adam*mSGD*\fFigure 4: Effective learning rates \u03b7 for CIFAR10. The adaptive stepsize of L4Mom matches roughly\nthe hand-coded decay schedule (grey line) until 150 epochs. Both use the same gradient type.\n\n3.3 ResNets for CIFAR-10 image classi\ufb01cation\n\nIn the next two tasks, we target \ufb01nely tuned publicly available implementations of well-known\narchitectures and compare their performance to our default setting. We begin with the deep residual\nnetwork architecture for CIFAR-10 [10] taken from the of\ufb01cial TensorFlow repository [21]. Deep\nresidual networks [8], or ResNets for short, provided the breakthrough idea of identity mappings in\norder to enable training of very deep convolutional neural networks. The provided architecture has\n32 layers and uses batch normalization for batches of size 128. The loss is given by cross-entropy\nwith L2 regularization.\nThe deployed optimization policy is momentum gradient with manually crafted piece-wise constant\nstepsize adaptation. We simply replace it with default settings of L4Mom and L4Adam.\nThe \ufb01rst surprise comes when we look at Fig. 4, which compares the effective learning rates. Clearly,\nthe adaptive learning rates are much more conservative in behavior compared to MNIST, possibly\nsignaling for different nature of the datasets. Also the L4Mom learning rate approximately matches\nthe manually designed schedule (also for momentum gradient) during the decisive \ufb01rst 150 epochs.\nComparing performance against optimized constant learning rates is favorable for L4 optimizers both\nin terms of loss and test accuracy (see Fig. 5). Note also that the two L4 optimizers perform almost\nindistinguishably. However, competing with the default policy has another surprising outcome. While\nthe default policy is inferior in loss minimization (more strongly at the beginning than at the end), in\nterms of test accuracy it eventually dominates. By careful inspection of Fig. 5, we see the decisive\ngain happens right after the \ufb01rst drop in the hardcoded learning rate. This, in itself, is very intriguing\nsince both default and L4Mom use the same type of gradients of similar magnitudes. Also, it explains\nthe original authors\u2019 choice of a piece-wise constant learning rate schedule.\nTo our knowledge, there is no satisfying answer to why piece-wise constant learning rates lead to\ngood generalization. Yet, practitioners use them frequently, perhaps precisely for this reason.\n\nTable 2: Test accuracy after a certain number of epochs of (unregularized) MNIST training. The\nresults are reported over 5 restarts.\n\n1 epoch\n10 epochs\n30 epochs\n\nAdam\n\n95.7 \u00b1 0.3\n97.9 \u00b1 0.3\n98.0 \u00b1 0.4\n\nmSGD\n96.4 \u00b1 0.5\n98.0 \u00b1 0.1\n98.5 \u00b1 0.1\n\nTest accuracy in %\n\nL4Adam\n95.9 \u00b1 0.7\n98.4 \u00b1 0.1\n98.4 \u00b1 0.1\n\nL4Adam*\n96.8 \u00b1 0.2\n98.3 \u00b1 0.0\n98.3 \u00b1 0.1\n\nL4Mom\n95.4 \u00b1 0.5\n98.3 \u00b1 0.1\n98.4 \u00b1 0.1\n\nL4Mom*\n96.3 \u00b1 0.4\n98.3 \u00b1 0.1\n98.4 \u00b1 0.1\n\n7\n\n0100200300400Epoch10\u2212510\u2212410\u2212310\u2212210\u22121100EffectivelearningrateL4AdamManualL4MommSGD*Adam*\f(a) training loss\n\n(b) test accuracy\n\nFigure 5: Training and test performance on ResNet architecture for CIFAR-10. Mean loss and\naccuracy are shown with respect to three restarts. The default settings of the L4 optimizers perform\nbetter in loss minimization, however, become inferior in test accuracy after the \ufb01rst drop in learning\nrate of the baseline\u2019s learning rate schedule (see also Fig. 4). For Adam and mSGD, the best\nperforming constant stepsizes 2 \u00b7 10\u22124 and 0.004 were evaulated.\n\n3.4 Differential neural computer\n\nAs the last task, we chose a somewhat exotic one; a recurrent architecture of Google Deepmind\u2019s\nDifferential Neural Computer (DNC) [7]. Again, we compare with the performance from the of\ufb01cial\nrepository [4]. The DNC is a culmination of a line of work developing LSTM-like architectures with\na differentiable memory management, e. g. [6, 20], and is in itself very complex. The targeted tasks\nhave typically very structured \ufb02avor (e. g. shortest path, question answering).\nThe task implemented in [4] is to learn a REPEAT-COPY algorithm. In a nutshell, the input speci\ufb01es\na sequence of bits an and a number of repeats k while the expected output is a sequence bn consisting\nof k repeats of an. The loss function is a negative log-probability of outputting the correct sequence.\nSince, the ground truth is a known algorithm, the training data can be generated on the \ufb02y, and there\nis no separate test regime. This time, the optimizer in place is RMSProp [22] with gradient clipping.\nWe found out, however, that the constant learning rate 10\u22123 provided in [4] can be further tuned and\nwe compare our results against the improved value 0.005. We also used the best performing constant\nlearning rates 0.01 for Adam and 1.2 for momentum SGD (both with the suggested gradient clipping)\nas baselines. The L4 optimizers did not use gradient clipping.\nAgain, we can see in Fig. 6 that L4Adam and L4Mom performed almost the same on average, even\nthough L4Mom was more prone to instabilities as can be seen from the volume of the orange-shaded\nregions. More importantly, they both performed better or on par with the optimized baselines.\nWe end this experimental section with a short discussion of Fig. 6(b), since it illustrates multiple\nfeatures of the adaptation all at once. In this \ufb01gure, we compare the effective learning rates of L4 and\nplain Adam. We immediately notice the dramatic evolution of the L4 learning rate, jumping across\nmultiple orders of magnitude, until \ufb01nally settling around 103. This behavior, however, results in a\nmuch more stable optimization process (see again Fig. 6), unlike in the case of plain Adam optimizer\n(note the volume of the green-shaded regions).\nThe intuitive explanation is two-fold. For one, the high gradients only need a small learning rate to\nmake the expected progress. This lowers the danger of divergence and, in this sense, it plays the role\nof gradient clipping. And second, plateau regions with small gradients will force very high learning\nrates in order to leave them. This bene\ufb01cial rapid adaptation is due to almost independent stepsize\ncomputation for every batch. Only Lmin and possibly (depending on the underlying gradient methods)\nsome gradient history is reused. This is a fundamental difference to methods that at each step make a\nsmall update to the previous learning rate. This is in agreement with [23], where the phenomenon\nwas discussed in more depth.\n\n8\n\n0100200300400Epoch0.10.20.51.0LossL4AdamManualL4MommSGD*Adam*0100200300400Epoch0.60.70.80.91.0AccuracyL4AdamManualL4MommSGD*Adam*\f(a) training loss (DNC)\n\n(b)effective learning rate (DNC)\n\nFigure 6: Training progress of the DNC. (a) Training loss (equals test loss) on the Differential Neural\nComputer architecture. See Fig. 2 for details. The L4 optimizers use default settings, whereas\nRMSProp and Adam use best performing learning rates 0.005 and 0.01, respectively. We see high\nstochasticity in training, particularly with Adam. Both L4 optimizers match or beat RMSProp in\nperformance. (b) Effective learning rate \u03b7 of L4Adam and plain Adam. The L4Adam displays a huge\nvariance in the selected stepsize. This however has a stabilizing effect on the training progress.\n\n4 Discussion\n\nWe propose a stepsize adaptation scheme L4 compatible with currently most prominent gradient\nmethods. Two arising optimizers were tested on a multitude of datasets, spanning across different\nbatch sizes, loss functions and network structures. The results validate the stepsize adaptation in\nitself, as the adaptive optimizers consistently outperform their non-adaptive counterparts, even when\nthe adaptive optimizers use the default setting and the non-adaptive ones were \ufb01nely tuned. This\ndefault setting also performs well when compared to hand-tuned optimization policies from of\ufb01cial\nrepositories of modern high-performing architectures. Although we cannot give guarantees, this is a\npromising step towards practical \u201cno-tuning-necessary\u201d stochastic optimization.\nThe core design feature, ability to change stepsize dramatically from batch to batch, while occasionally\nreaching extremely high stepsizes, was also validated. This idea does not seem widespread in the\ncommunity and we would like to inspire further work.\nThe ability of the proposed method to actually drive loss to convergence creates an opportunity\nto better evaluate regularization strategies and develop new ones. This can potentially convert the\nsuperiority in training to enhanced test performance as discussed in the Fashion MNIST experiments.\nFinally, Ali Rahimi and Benjamin Recht suggested in their NIPS 2017 talk (and the corresponding\nblog post) [17, 18] that the failure to drive loss to zero within machine precision might be an actual\nbottleneck of deep learning (using exactly the ill-conditioned regression task). We show on this\nexample and on MNIST that our method can break this \u201coptimization \ufb02oor\u201d.\n\n5 Acknowledgement\n\nWe would like to thank Alex Kolesnikov, Friedrich Solowjow, and Anna Levina for helping to\nimprove the manuscript.\n\nReferences\n\n[1] Marcin Andrychowicz, Misha Denil, Sergio G\u00f3mez, Matthew W Hoffman, David Pfau, Tom\nSchaul, and Nando de Freitas. Learning to learn by gradient descent by gradient descent. In\nAdvances in Neural Information Processing Systems 29, pages 3981\u20133989. Curran Associates,\nInc., 2016.\n\n9\n\n0200040006000800010000Iteration10\u2212910\u2212710\u2212510\u2212310\u22121101LossL4AdamRMSProp*SGD*L4MomAdam*0200040006000800010000Iteration10\u2212710\u2212510\u2212310\u22121101103105EffectivelearningrateL4AdamAdam*\f[2] Atilim Gunes Baydin, Robert Cornish, David Martinez Rubio, Mark Schmidt, and Frank D.\nWood. Online learning rate adaptation with hypergradient descent. CoRR, abs/1703.04782,\n2017.\n\n[3] R. H. Byrd, S. L. Hansen, Jorge Nocedal, and Y. Singer. A stochastic quasi-newton method for\n\nlarge-scale optimization. SIAM Journal on Optimization, 26(2):1008\u20131031, 1 2016.\n\n[4] Google Deepmind. Of\ufb01cial implementation of the differential neural computer, 2017. https:\n\n//github.com/deepmind/dnc Commit a4debae.\n\n[5] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning\n\nand stochastic optimization. J. Mach. Learn. Res., 12:2121\u20132159, July 2011.\n\n[6] Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines, 2014.\n\n[7] Alex Graves, Greg Wayne, Malcolm Reynolds, Tim Harley, Ivo Danihelka, Agnieszka Grabska-\nBarwi\u00b4nska, Sergio G\u00f3mez Colmenarejo, Edward Grefenstette, Tiago Ramalho, John Agapiou,\nAdri\u00e0 Puigdom\u00e8nech Badia, Karl Moritz Hermann, Yori Zwols, Georg Ostrovski, Adam Cain,\nHelen King, Christopher Summer\ufb01eld, Phil Blunsom, Koray Kavukcuoglu, and Demis Has-\nsabis. Hybrid computing using a neural network with dynamic external memory. Nature,\n538(7626):471\u2013476, October 2016.\n\n[8] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\n\nrecognition. In CVPR, pages 770\u2013778. IEEE Computer Society, 2016.\n\n[9] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization.\n\nProceedings of ICLR, 2015. arXiv preprint https://arxiv.org/abs/1412.6980.\n\nIn in\n\n[10] Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. CIFAR-10 (Canadian Institute for Advanced\n\nResearch), 2009.\n\n[11] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document\n\nrecognition. Proceedings of the IEEE, 86(11):2278\u20132324, November 1998.\n\n[12] Ke Li and Jitendra Malik. Learning to optimize. CoRR, abs/1606.01885, 2016.\n\n[13] M. Mahsereci and P. Hennig. Probabilistic line searches for stochastic optimization. In Advances\nin Neural Information Processing Systems 28, pages 181\u2013189. Curran Associates, Inc., 2015.\n\n[14] Franziska Meier, Daniel Kappler, and Stefan Schaal. Online learning of a memory for learning\n\nrates. arXiv preprint https://arxiv.org/abs/1709.06709, 2017.\n\n[15] J. Nocedal and S. J. Wright. Numerical Optimization. Springer, New York, 2nd edition, 2006.\n\n[16] B. T. Polyak. Introduction to optimization. Translations Series in Mathematics and Engineering,\n\nOptimization Software, 1987.\n\n[17] Benjamin Recht. Gradient descent doesn\u2019t \ufb01nd a local minimum, 2017. https://github.\n\ncom/benjamin-recht/shallow-linear-net Commit d192d96.\n\n[18] Benjamin Recht and Ali Rahimi. Re\ufb02ections on random kitchen sinks, 2017. http://www.\n\nargmin.net/2017/12/05/kitchen-sinks, Dec. 5 2017.\n\n[19] Tom Schaul, Sixin Zhang, and Yann LeCun. No more pesky learning rates. In Sanjoy Dasgupta\nand David McAllester, editors, Proceedings of the 30th International Conference on Machine\nLearning, volume 28/3 of Proceedings of Machine Learning Research, pages 343\u2013351, Atlanta,\nGeorgia, USA, 17\u201319 Jun 2013. PMLR.\n\n[20] Sainbayar Sukhbaatar, Jason Weston, Rob Fergus, et al. End-to-end memory networks. In\nAdvances in neural information processing systems, pages 2440\u20132448. Curran Associates, Inc.,\n2015.\n\n[21] TensorFlow GitHub Repository. Tensor\ufb02ow implementation of ResNets, 2016. Commit\n\n1f34fcaf.\n\n10\n\n\f[22] T. Tieleman and G. Hinton. Lecture 6.5\u2014RmsProp: Divide the gradient by a running average\n\nof its recent magnitude. COURSERA: Neural Networks for Machine Learning, 2012.\n\n[23] Yuhuai Wu, Mengye Ren, Renjie Liao, and Roger Grosse. Understanding short-horizon bias in\nstochastic meta-optimization. In International Conference on Learning Representations, 2018.\n\n[24] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for\n\nbenchmarking machine learning algorithms, 2017.\n\n11\n\n\f", "award": [], "sourceid": 3164, "authors": [{"given_name": "Michal", "family_name": "Rolinek", "institution": "Max Planck Institute for Intelligent Systems"}, {"given_name": "Georg", "family_name": "Martius", "institution": "MPI for Intelligent Systems"}]}