{"title": "The Marginal Value of Adaptive Gradient Methods in Machine Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 4148, "page_last": 4158, "abstract": "Adaptive optimization methods, which perform local optimization with a metric constructed from the history of iterates, are becoming increasingly popular for training deep neural networks.  Examples include AdaGrad, RMSProp, and Adam. We show that for simple overparameterized problems, adaptive methods often find drastically different solutions than gradient descent (GD) or stochastic gradient descent (SGD).  We construct an illustrative binary classification problem where the data is linearly separable, GD and SGD achieve zero test error, and AdaGrad, Adam, and RMSProp attain test errors arbitrarily close to half.  We additionally study the empirical generalization capability of adaptive methods on several state-of-the-art deep learning models. We observe that the solutions found by adaptive methods generalize worse (often significantly worse) than SGD, even when these solutions have better training performance. These results suggest that practitioners should reconsider the use of adaptive methods to train neural networks.", "full_text": "The Marginal Value of Adaptive Gradient Methods\n\nin Machine Learning\n\nAshia C. Wilson], Rebecca Roelofs], Mitchell Stern], Nathan Srebro\u2020, and Benjamin Recht]\n{ashia,roelofs,mitchell}@berkeley.edu, nati@ttic.edu, brecht@berkeley.edu\n\n]University of California, Berkeley\n\n\u2020Toyota Technological Institute at Chicago\n\nAbstract\n\nAdaptive optimization methods, which perform local optimization with a metric\nconstructed from the history of iterates, are becoming increasingly popular for\ntraining deep neural networks. Examples include AdaGrad, RMSProp, and Adam.\nWe show that for simple overparameterized problems, adaptive methods often \ufb01nd\ndrastically different solutions than gradient descent (GD) or stochastic gradient\ndescent (SGD). We construct an illustrative binary classi\ufb01cation problem where\nthe data is linearly separable, GD and SGD achieve zero test error, and AdaGrad,\nAdam, and RMSProp attain test errors arbitrarily close to half. We additionally\nstudy the empirical generalization capability of adaptive methods on several state-\nof-the-art deep learning models. We observe that the solutions found by adaptive\nmethods generalize worse (often signi\ufb01cantly worse) than SGD, even when these\nsolutions have better training performance. These results suggest that practitioners\nshould reconsider the use of adaptive methods to train neural networks.\n\n1\n\nIntroduction\n\nAn increasing share of deep learning researchers are training their models with adaptive gradient\nmethods [3, 12] due to their rapid training time [6]. Adam [8] in particular has become the default\nalgorithm used across many deep learning frameworks. However, the generalization and out-of-\nsample behavior of such adaptive gradient methods remains poorly understood. Given that many\npasses over the data are needed to minimize the training objective, typical regret guarantees do not\nnecessarily ensure that the found solutions will generalize [17].\nNotably, when the number of parameters exceeds the number of data points, it is possible that the\nchoice of algorithm can dramatically in\ufb02uence which model is learned [15]. Given two different\nminimizers of some optimization problem, what can we say about their relative ability to generalize?\nIn this paper, we show that adaptive and non-adaptive optimization methods indeed \ufb01nd very different\nsolutions with very different generalization properties. We provide a simple generative model for\nbinary classi\ufb01cation where the population is linearly separable (i.e., there exists a solution with large\nmargin), but AdaGrad [3], RMSProp [21], and Adam converge to a solution that incorrectly classi\ufb01es\nnew data with probability arbitrarily close to half. On this same example, SGD \ufb01nds a solution with\nzero error on new data. Our construction suggests that adaptive methods tend to give undue in\ufb02uence\nto spurious features that have no effect on out-of-sample generalization.\nWe additionally present numerical experiments demonstrating that adaptive methods generalize worse\nthan their non-adaptive counterparts. Our experiments reveal three primary \ufb01ndings. First, with\nthe same amount of hyperparameter tuning, SGD and SGD with momentum outperform adaptive\nmethods on the development/test set across all evaluated models and tasks. This is true even when\nthe adaptive methods achieve the same training loss or lower than non-adaptive methods. Second,\nadaptive methods often display faster initial progress on the training set, but their performance quickly\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fplateaus on the development/test set. Third, the same amount of tuning was required for all methods,\nincluding adaptive methods. This challenges the conventional wisdom that adaptive methods require\nless tuning. Moreover, as a useful guide to future practice, we propose a simple scheme for tuning\nlearning rates and decays that performs well on all deep learning tasks we studied.\n\n2 Background\n\nThe canonical optimization algorithms used to minimize risk are either stochastic gradient methods\nor stochastic momentum methods. Stochastic gradient methods can generally be written\n\nwk+1 = wk  \u21b5k \u02dcrf (wk),\n\n(2.1)\nwhere \u02dcrf (wk) := rf (wk; xik ) is the gradient of some loss function f computed on a batch of data\nxik.\nStochastic momentum methods are a second family of techniques that have been used to accelerate\ntraining. These methods can generally be written as\n\nwk+1 = wk  \u21b5k \u02dcrf (wk + k(wk  wk1)) + k(wk  wk1).\n\n(2.2)\nThe sequence of iterates (2.2) includes Polyak\u2019s heavy-ball method (HB) with k = 0, and Nesterov\u2019s\nAccelerated Gradient method (NAG) [19] with k = k.\nNotable exceptions to the general formulations (2.1) and (2.2) are adaptive gradient and adaptive\nmomentum methods, which choose a local distance measure constructed using the entire sequence of\niterates (w1,\u00b7\u00b7\u00b7 , wk). These methods (including AdaGrad [3], RMSProp [21], and Adam [8]) can\ngenerally be written as\n\nwk+1 = wk  \u21b5kH1\n\nk\n\n(2.3)\nwhere Hk := H(w1,\u00b7\u00b7\u00b7 , wk) is a positive de\ufb01nite matrix. Though not necessary, the matrix Hk is\nusually de\ufb01ned as\n\n\u02dcrf (wk + k(wk  wk1)) + kH1\n\nk Hk1(wk  wk1),\n\nHk = diag0@( kXi=1\n\n\u2318igi  gi)1/21A ,\n\n(2.4)\n\nwhere \u201c\u201d denotes the entry-wise or Hadamard product, gk = \u02dcrf (wk + k(wk  wk1)), and \u2318k is\nsome set of coef\ufb01cients speci\ufb01ed for each algorithm. That is, Hk is a diagonal matrix whose entries\nare the square roots of a linear combination of squares of past gradient components. We will use the\nfact that Hk are de\ufb01ned in this fashion in the sequel. For the speci\ufb01c settings of the parameters for\nmany of the algorithms used in deep learning, see Table 1. Adaptive methods attempt to adjust an\nalgorithm to the geometry of the data. In contrast, stochastic gradient descent and related variants use\nthe `2 geometry inherent to the parameter space, and are equivalent to setting Hk = I in the adaptive\nmethods.\n\nSGD HB NAG AdaGrad\n\nRMSProp\n\nAdam\n\nI Gk1 + Dk 2Gk1 + (1  2)Dk\n\u21b5\n\nGk\n\u21b5k\n\nk\n\n\nI\n\u21b5\n0\n0\n\nI\n\u21b5\n\n\n0\n\n\n\n\n\u21b5\n0\n0\n\n\u21b5\n0\n0\n\n2\n1k\n\n2\n\nGk1 + (12)\n1k\n\n2\n\nDk\n\n\u21b5 11\n1k\n1(1k1\n1k\n\n1\n\n1\n\n1\n\n)\n\n0\n\nTable 1: Parameter settings of algorithms used in deep learning. Here, Dk = diag(gk  gk) and\nGk := Hk  Hk. We omit the additional \u270f added to the adaptive methods, which is only needed to ensure\nnon-singularity of the matrices Hk.\n\nIn this context, generalization refers to the performance of a solution w on a broader population.\nPerformance is often de\ufb01ned in terms of a different loss function than the function f used in training.\nFor example, in classi\ufb01cation tasks, we typically de\ufb01ne generalization in terms of classi\ufb01cation error\nrather than cross-entropy.\n\n2\n\n\f2.1 Related Work\n\nUnderstanding how optimization relates to generalization is a very active area of current machine\nlearning research. Most of the seminal work in this area has focused on understanding how early\nstopping can act as implicit regularization [22]. In a similar vein, Ma and Belkin [10] have shown\nthat gradient methods may not be able to \ufb01nd complex solutions at all in any reasonable amount of\ntime. Hardt et al. [17] show that SGD is uniformly stable, and therefore solutions with low training\nerror found quickly will generalize well. Similarly, using a stability argument, Raginsky et al. [16]\nhave shown that Langevin dynamics can \ufb01nd solutions than generalize better than ordinary SGD\nin non-convex settings. Neyshabur, Srebro, and Tomioka [15] discuss how algorithmic choices can\nact as implicit regularizer. In a similar vein, Neyshabur, Salakhutdinov, and Srebro [14] show that a\ndifferent algorithm, one which performs descent using a metric that is invariant to re-scaling of the\nparameters, can lead to solutions which sometimes generalize better than SGD. Our work supports\nthe work of [14] by drawing connections between the metric used to perform local optimization and\nthe ability of the training algorithm to \ufb01nd solutions that generalize. However, we focus primarily on\nthe different generalization properties of adaptive and non-adaptive methods.\nA similar line of inquiry has been pursued by Keskar et al. [7]. Hochreiter and Schmidhuber [4]\nshowed that \u201csharp\u201d minimizers generalize poorly, whereas \u201c\ufb02at\u201d minimizers generalize well. Keskar\net al. empirically show that Adam converges to sharper minimizers when the batch size is increased.\nHowever, they observe that even with small batches, Adam does not \ufb01nd solutions whose performance\nmatches state-of-the-art. In the current work, we aim to show that the choice of Adam as an optimizer\nitself strongly in\ufb02uences the set of minimizers that any batch size will ever see, and help explain why\nthey were unable to \ufb01nd solutions that generalized particularly well.\n\n3 The potential perils of adaptivity\n\nThe goal of this section is to illustrate the following observation: when a problem has multiple global\nminima, different algorithms can \ufb01nd entirely different solutions when initialized from the same point.\nIn addition, we construct an example where adaptive gradient methods \ufb01nd a solution which has\nworse out-of-sample error than SGD.\nTo simplify the presentation, let us restrict our attention to the binary least-squares classi\ufb01cation\nproblem, where we can easily compute closed the closed form solution found by different methods.\nIn least-squares classi\ufb01cation, we aim to solve\n\nminimizew RS[w] := 1\n\n2kXw  yk2\n2.\n\n(3.1)\n\nHere X is an n \u21e5 d matrix of features and y is an n-dimensional vector of labels in {1, 1}. We\naim to \ufb01nd the best linear classi\ufb01er w. Note that when d > n, if there is a minimizer with loss 0\nthen there is an in\ufb01nite number of global minimizers. The question remains: what solution does an\nalgorithm \ufb01nd and how well does it perform on unseen data?\n\n3.1 Non-adaptive methods\n\nMost common non-adaptive methods will \ufb01nd the same solution for the least squares objective (3.1).\nAny gradient or stochastic gradient of RS must lie in the span of the rows of X. Therefore, any\nmethod that is initialized in the row span of X (say, for instance at w = 0) and uses only linear\ncombinations of gradients, stochastic gradients, and previous iterates must also lie in the row span\nof X. The unique solution that lies in the row span of X also happens to be the solution with\nminimum Euclidean norm. We thus denote wSGD = X T (XX T )1y. Almost all non-adaptive\nmethods like SGD, SGD with momentum, mini-batch SGD, gradient descent, Nesterov\u2019s method,\nand the conjugate gradient method will converge to this minimum norm solution. The minimum norm\nsolutions have the largest margin out of all solutions of the equation Xw = y. Maximizing margin\nhas a long and fruitful history in machine learning, and thus it is a pleasant surprise that gradient\ndescent naturally \ufb01nds a max-margin solution.\n\n3\n\n\f3.2 Adaptive methods\nNext, we consider adaptive methods where Hk is diagonal. While it is dif\ufb01cult to derive the general\nform of the solution, we can analyze special cases. Indeed, we can construct a variety of instances\nwhere adaptive methods converge to solutions with low `1 norm rather than low `2 norm.\nFor a vector x 2 Rq, let sign(x) denote the function that maps each component of x to its sign.\nLemma 3.1 Suppose there exists a scalar c such that X sign(X T y) = cy. Then, when initialized at\nw0 = 0, AdaGrad, Adam, and RMSProp all converge to the unique solution w / sign(X T y).\nIn other words, whenever there exists a solution of Xw = y that is proportional to sign(X T y), this\nis precisely the solution to which all of the adaptive gradient methods converge.\nProof We prove this lemma by showing that the entire trajectory of the algorithm consists of iterates\nwhose components have constant magnitude. In particular, we will show that\n\nwk = k sign(X T y) ,\n\nfor some scalar k. The initial point w0 = 0 satis\ufb01es the assertion with 0 = 0.\nNow, assume the assertion holds for all k \uf8ff t. Observe that\n\nrRS(wk + k(wk  wk1)) = X T (X(wk + k(wk  wk1))  y)\n\n= X T(k + k(k  k1))X sign(X T y)  y \n= {(k + k(k  k1))c  1} X T y\n= \u00b5kX T y,\n\nwhere the last equation de\ufb01nes \u00b5k. Hence, letting gk = rRS(wk + k(wk  wk1)), we also have\n\nHk = diag0@( kXs=1\n\n\u2318s gs  gs)1/21A = diag0@( kXs=1\n\ns)1/2\n\n\u2318s\u00b52\n\n|X T y|1A = \u232bk diag|X T y| ,\n\nwhere |u| denotes the component-wise absolute value of a vector and the last equation de\ufb01nes \u232bk.\nIn sum,\n\nwk+1 = wk  \u21b5kH1\n\u21b5k\u00b5k\n\n=\u21e2k \n\nk rf (wk + k(wk  wk1)) + tH1\n(k  k1) sign(X T y),\n\nk\u232bk1\n\n\u232bk\n\n+\n\n\u232bk\n\nk Hk1(wk  wk1)\n\nproving the claim.1\n\nThis solution is far simpler than the one obtained by gradient methods, and it would be surprising if\nsuch a simple solution would perform particularly well. We now turn to showing that such solutions\ncan indeed generalize arbitrarily poorly.\n\n3.3 Adaptivity can over\ufb01t\nLemma 3.1 allows us to construct a particularly pernicious generative model where AdaGrad fails\nto \ufb01nd a solution that generalizes. This example uses in\ufb01nite dimensions to simplify bookkeeping,\nbut one could take the dimensionality to be 6n. Note that in deep learning, we often have a number\nof parameters equal to 25n or more [20], so this is not a particularly high dimensional example by\ncontemporary standards. For i = 1, . . . , n, sample the label yi to be 1 with probability p and 1 with\nprobability 1  p for some p > 1/2. Let xi be an in\ufb01nite dimensional vector with entries\n\nyi\n1\n1\n0\n\nj = 1\nj = 2, 3\nj = 4 + 5(i  1), . . . , 4 + 5(i  1) + 2(1  yi)\notherwise\n\n.\n\nxij =8>><>>:\n\n1In the event that X T y has a component equal to 0, we de\ufb01ne 0/0 = 0 so that the update is well-de\ufb01ned.\n\n4\n\n\fIn other words, the \ufb01rst feature of xi is the class label. The next 2 features are always equal to 1.\nAfter this, there is a set of features unique to xi that are equal to 1. If the class label is 1, then there\nis 1 such unique feature. If the class label is 1, then there are 5 such features. Note that the only\ndiscriminative feature useful for classifying data outside the training set is the \ufb01rst one! Indeed,\none can perform perfect classi\ufb01cation using only the \ufb01rst feature. The other features are all useless.\nFeatures 2 and 3 are constant, and each of the remaining features only appear for one example in the\ndata set. However, as we will see, algorithms without such a priori knowledge may not be able to\nlearn these distinctions.\nTake n samples and consider the AdaGrad solution for minimizing 1\n\n2||Xw  y||2. First we show that\ni=1 yi and assume for the sake of simplicity that b > 0.\nThis will happen with arbitrarily high probability for large enough n. De\ufb01ne u = X T y and observe\nthat\n\nthe conditions of Lemma 3.1 hold. Let b =Pn\n\nn\nb\nyj\n0\n\nj = 1\nj = 2, 3\nif j > 3 and x\notherwise\n\n5 c,j = 1\n\nb j+1\n\nand\n\nuj =8>>><>>>:\n\nsign(uj) =8>>><>>>:\n\n1\n1\nyj\n0\n\nj = 1\nj = 2, 3\nif j > 3 and x\notherwise\n\n5 c,j = 1\n\nb j+1\n\nThus we have hsign(u), xii = yi + 2 + yi(3  2yi) = 4yi as desired. Hence, the AdaGrad solution\nwada / sign(u). In particular, wada has all of its components equal to \u00b1\u2327 for some positive constant\n\u2327. Now since wada has the same sign pattern as u, the \ufb01rst three components of wada are equal to\neach other. But for a new data point, xtest, the only features that are nonzero in both xtest and wada\nare the \ufb01rst three. In particular, we have\n\nhwada, xtesti = \u2327 (y(test) + 2) > 0 .\n\nTherefore, the AdaGrad solution will label all unseen data as a positive example!\nNow, we turn to the minimum 2-norm solution. Let P and N denote the set of positive and negative\nexamples respectively. Let n+ = |P| and n = |N|. Assuming \u21b5i = \u21b5+ when yi = 1 and \u21b5i = \u21b5\nwhen yi = 1, we have that the minimum norm solution will have the form wSGD = X T \u21b5 =\nPi2P \u21b5+xi +Pj2N \u21b5xj. These scalars can be found by solving XX T \u21b5 = y. In closed form we\nhave\n\n\u21b5+ =\n\n4n + 3\n\n9n+ + 3n + 8n+n + 3\n\nand\n\n\u21b5 =\n\n4n+ + 1\n\n9n+ + 3n + 8n+n + 3\n\n.\n\n(3.2)\n\nThe algebra required to compute these coef\ufb01cients can be found in the Appendix. For a new data\npoint, xtest, again the only features that are nonzero in both xtest and wSGD are the \ufb01rst three. Thus\nwe have\n\nhwSGD, xtesti = ytest(n+\u21b5+  n\u21b5) + 2(n+\u21b5+ + n\u21b5) .\nUsing (3.2), we see that whenever n+ > n/3, the SGD solution makes no errors.\nA formal construction of this example using a data-generating distribution can be found in Appendix C.\nThough this generative model was chosen to illustrate extreme behavior, it shares salient features\nwith many common machine learning instances. There are a few frequent features, where some\npredictor based on them is a good predictor, though these might not be easy to identify from \ufb01rst\ninspection. Additionally, there are many other features which are sparse. On \ufb01nite training data\nit looks like such features are good for prediction, since each such feature is discriminatory for a\nparticular training example, but this is over-\ufb01tting and an artifact of having fewer training examples\nthan features. Moreover, we will see shortly that adaptive methods typically generalize worse than\ntheir non-adaptive counterparts on real datasets.\n\n4 Deep Learning Experiments\n\nHaving established that adaptive and non-adaptive methods can \ufb01nd different solutions in the convex\nsetting, we now turn to an empirical study of deep neural networks to see whether we observe a\nsimilar discrepancy in generalization. We compare two non-adaptive methods \u2013 SGD and the heavy\nball method (HB) \u2013 to three popular adaptive methods \u2013 AdaGrad, RMSProp and Adam. We study\nperformance on four deep learning problems: (C1) the CIFAR-10 image classi\ufb01cation task, (L1)\n\n5\n\n\fName\nC1\nL1\nL2\nL3\n\nNetwork type\n\nDeep Convolutional\n\nDataset\nArchitecture\nCIFAR-10\ncifar.torch\ntorch-rnn War & Peace\n2-Layer LSTM + Feedforward span-parser Penn Treebank\n\n2-Layer LSTM\n\nFramework\n\nTorch\nTorch\nDyNet\n\n3-Layer LSTM\n\nemnlp2016\n\nPenn Treebank Tensor\ufb02ow\n\nTable 2: Summaries of the models we use for our experiments.2\n\ncharacter-level language modeling on the novel War and Peace, and (L2) discriminative parsing\nand (L3) generative parsing on Penn Treebank. In the interest of reproducibility, we use a network\narchitecture for each problem that is either easily found online (C1, L1, L2, and L3) or produces\nstate-of-the-art results (L2 and L3). Table 2 summarizes the setup for each application. We take care\nto make minimal changes to the architectures and their data pre-processing pipelines in order to best\nisolate the effect of each optimization algorithm.\nWe conduct each experiment 5 times from randomly initialized starting points, using the initialization\nscheme speci\ufb01ed in each code repository. We allocate a pre-speci\ufb01ed budget on the number of epochs\nused for training each model. When a development set was available, we chose the settings that\nachieved the best peak performance on the development set by the end of the \ufb01xed epoch budget.\nCIFAR-10 did not have an explicit development set, so we chose the settings that achieved the lowest\ntraining loss at the end of the \ufb01xed epoch budget.\nOur experiments show the following primary \ufb01ndings: (i) Adaptive methods \ufb01nd solutions that gener-\nalize worse than those found by non-adaptive methods. (ii) Even when the adaptive methods achieve\nthe same training loss or lower than non-adaptive methods, the development or test performance\nis worse. (iii) Adaptive methods often display faster initial progress on the training set, but their\nperformance quickly plateaus on the development set. (iv) Though conventional wisdom suggests\nthat Adam does not require tuning, we \ufb01nd that tuning the initial learning rate and decay scheme for\nAdam yields signi\ufb01cant improvements over its default settings in all cases.\n\n4.1 Hyperparameter Tuning\n\nOptimization hyperparameters have a large in\ufb02uence on the quality of solutions found by optimization\nalgorithms for deep neural networks. The algorithms under consideration have many hyperparameters:\nthe initial step size \u21b50, the step decay scheme, the momentum value 0, the momentum schedule\nk, the smoothing term \u270f, the initialization scheme for the gradient accumulator, and the parameter\ncontrolling how to combine gradient outer products, to name a few. A grid search on a large space\nof hyperparameters is infeasible even with substantial industrial resources, and we found that the\nparameters that impacted performance the most were the initial step size and the step decay scheme.\nWe left the remaining parameters with their default settings. We describe the differences between the\ndefault settings of Torch, DyNet, and Tensor\ufb02ow in Appendix B for completeness.\nTo tune the step sizes, we evaluated a logarithmically-spaced grid of \ufb01ve step sizes. If the best\nperformance was ever at one of the extremes of the grid, we would try new grid points so that the\nbest performance was contained in the middle of the parameters. For example, if we initially tried\nstep sizes 2, 1, 0.5, 0.25, and 0.125 and found that 2 was the best performing, we would have tried\nthe step size 4 to see if performance was improved. If performance improved, we would have tried 8\nand so on. We list the initial step sizes we tried in Appendix D.\nFor step size decay, we explored two separate schemes, a development-based decay scheme (dev-\ndecay) and a \ufb01xed frequency decay scheme (\ufb01xed-decay). For dev-decay, we keep track of the best\nvalidation performance so far, and at each epoch decay the learning rate by a constant factor  if the\nmodel does not attain a new best value. For \ufb01xed-decay, we decay the learning rate by a constant\nfactor  every k epochs. We recommend the dev-decay scheme when a development set is available;\n\n2Architectures can be found at\n\nhttps://github.\ncom/szagoruyko/cifar.torch; (2) torch-rnn: https://github.com/jcjohnson/torch-rnn; (3)\nspan-parser: https://github.com/jhcross/span-parser; (4) emnlp2016: https://github.com/\ncdg720/emnlp2016.\n\nthe following links:\n\n(1) cifar.torch:\n\n6\n\n\f(a) CIFAR-10 (Train)\n\n(b) CIFAR-10 (Test)\n\nFigure 1: Training (left) and top-1 test error (right) on CIFAR-10. The annotations indicate where the\nbest performance is attained for each method. The shading represents \u00b1 one standard deviation computed\nacross \ufb01ve runs from random initial starting points. In all cases, adaptive methods are performing worse on\nboth train and test than non-adaptive methods.\n\nnot only does it have fewer hyperparameters than the \ufb01xed frequency scheme, but our experiments\nalso show that it produces results comparable to, or better than, the \ufb01xed-decay scheme.\n\n4.2 Convolutional Neural Network\n\nWe used the VGG+BN+Dropout network for CIFAR-10 from the Torch blog [23], which in prior\nwork achieves a baseline test error of 7.55%. Figure 1 shows the learning curve for each algorithm\non both the training and test dataset.\nWe observe that the solutions found by SGD and HB do indeed generalize better than those found\nby adaptive methods. The best overall test error found by a non-adaptive algorithm, SGD, was\n7.65 \u00b1 0.14%, whereas the best adaptive method, RMSProp, achieved a test error of 9.60 \u00b1 0.19%.\nEarly on in training, the adaptive methods appear to be performing better than the non-adaptive\nmethods, but starting at epoch 50, even though the training error of the adaptive methods is still lower,\nSGD and HB begin to outperform adaptive methods on the test error. By epoch 100, the performance\nof SGD and HB surpass all adaptive methods on both train and test. Among all adaptive methods,\nAdaGrad\u2019s rate of improvement \ufb02atlines the earliest. We also found that by increasing the step size,\nwe could drive the performance of the adaptive methods down in the \ufb01rst 50 or so epochs, but the\naggressive step size made the \ufb02atlining behavior worse, and no step decay scheme could \ufb01x the\nbehavior.\n\n4.3 Character-Level Language Modeling\n\nUsing the torch-rnn library, we train a character-level language model on the text of the novel War\nand Peace, running for a \ufb01xed budget of 200 epochs. Our results are shown in Figures 2(a) and 2(b).\nUnder the \ufb01xed-decay scheme, the best con\ufb01guration for all algorithms except AdaGrad was to decay\nrelatively late with regards to the total number of epochs, either 60 or 80% through the total number\nof epochs and by a large amount, dividing the step size by 10. The dev-decay scheme paralleled\n(within the same standard deviation) the results of the exhaustive search over the decay frequency\nand amount; we report the curves from the \ufb01xed policy.\nOverall, SGD achieved the lowest test loss at 1.212 \u00b1 0.001. AdaGrad has fast initial progress, but\n\ufb02atlines. The adaptive methods appear more sensitive to the initialization scheme than non-adaptive\nmethods, displaying a higher variance on both train and test. Surprisingly, RMSProp closely trails\nSGD on test loss, con\ufb01rming that it is not impossible for adaptive methods to \ufb01nd solutions that\ngeneralize well. We note that there are step con\ufb01gurations for RMSProp that drive the training loss\n\n7\n\n\fbelow that of SGD, but these con\ufb01gurations cause erratic behavior on test, driving the test error of\nRMSProp above Adam.\n\n4.4 Constituency Parsing\n\nA constituency parser is used to predict the hierarchical structure of a sentence, breaking it down into\nnested clause-level, phrase-level, and word-level units. We carry out experiments using two state-\nof-the-art parsers: the stand-alone discriminative parser of Cross and Huang [2], and the generative\nreranking parser of Choe and Charniak [1]. In both cases, we use the dev-decay scheme with  = 0.9\nfor learning rate decay.\n\nDiscriminative Model. Cross and Huang [2] develop a transition-based framework that reduces\nconstituency parsing to a sequence prediction problem, giving a one-to-one correspondence between\nparse trees and sequences of structural and labeling actions. Using their code with the default settings,\nwe trained for 50 epochs on the Penn Treebank [11], comparing labeled F1 scores on the training and\ndevelopment data over time. RMSProp was not implemented in the used version of DyNet, and we\nomit it from our experiments. Results are shown in Figures 2(c) and 2(d).\nWe \ufb01nd that SGD obtained the best overall performance on the development set, followed closely\nby HB and Adam, with AdaGrad trailing far behind. The default con\ufb01guration of Adam without\nlearning rate decay actually achieved the best overall training performance by the end of the run, but\nwas notably worse than tuned Adam on the development set.\nInterestingly, Adam achieved its best development F1 of 91.11 quite early, after just 6 epochs,\nwhereas SGD took 18 epochs to reach this value and didn\u2019t reach its best F1 of 91.24 until epoch 31.\nOn the other hand, Adam continued to improve on the training set well after its best development\nperformance was obtained, while the peaks for SGD were more closely aligned.\n\nGenerative Model. Choe and Charniak [1] show that constituency parsing can be cast as a language\nmodeling problem, with trees being represented by their depth-\ufb01rst traversals. This formulation\nrequires a separate base system to produce candidate parse trees, which are then rescored by the\ngenerative model. Using an adapted version of their code base,3 we retrained their model for 100\nepochs on the Penn Treebank. However, to reduce computational costs, we made two minor changes:\n(a) we used a smaller LSTM hidden dimension of 500 instead of 1500, \ufb01nding that performance\ndecreased only slightly; and (b) we accordingly lowered the dropout ratio from 0.7 to 0.5. Since they\ndemonstrated a high correlation between perplexity (the exponential of the average loss) and labeled\nF1 on the development set, we explored the relation between training and development perplexity to\navoid any con\ufb02ation with the performance of a base parser.\nOur results are shown in Figures 2(e) and 2(f). On development set performance, SGD and HB\nobtained the best perplexities, with SGD slightly ahead. Despite having one of the best performance\ncurves on the training dataset, Adam achieves the worst development perplexities.\n\n5 Conclusion\n\nDespite the fact that our experimental evidence demonstrates that adaptive methods are not advan-\ntageous for machine learning, the Adam algorithm remains incredibly popular. We are not sure\nexactly as to why, but hope that our step-size tuning suggestions make it easier for practitioners to use\nstandard stochastic gradient methods in their research. In our conversations with other researchers,\nwe have surmised that adaptive gradient methods are particularly popular for training GANs [18, 5]\nand Q-learning with function approximation [13, 9]. Both of these applications stand out because\nthey are not solving optimization problems. It is possible that the dynamics of Adam are accidentally\nwell matched to these sorts of optimization-free iterative search procedures. It is also possible that\ncarefully tuned stochastic gradient methods may work as well or better in both of these applications.\n\n3While the code of Choe and Charniak treats the entire corpus as a single long example, relying on the\nnetwork to reset itself upon encountering an end-of-sentence token, we use the more conventional approach of\nresetting the network for each example. This reduces training ef\ufb01ciency slightly when batches contain examples\nof different lengths, but removes a potential confounding factor from our experiments.\n\n8\n\n\fIt is an exciting direction of future work to determine which of these possibilities is true and to\nunderstand better as to why.\n\nAcknowledgements\n\nThe authors would like to thank Pieter Abbeel, Moritz Hardt, Tomer Koren, Sergey Levine, Henry\nMilner, Yoram Singer, and Shivaram Venkataraman for many helpful comments and suggestions.\nRR is generously supported by DOE award AC02-05CH11231. MS and AW are supported by\nNSF Graduate Research Fellowships. NS is partially supported by NSF-IIS-13-02662 and NSF-IIS-\n15-46500, an Inter ICRI-RI award and a Google Faculty Award. BR is generously supported by\nNSF award CCF-1359814, ONR awards N00014-14-1-0024 and N00014-17-1-2191, the DARPA\nFundamental Limits of Learning (Fun LoL) Program, a Sloan Research Fellowship, and a Google\nFaculty Award.\n\n(a) War and Peace (Training Set)\n\n(b) War and Peace (Test Set)\n\n(c) Discriminative Parsing (Training Set)\n\n(d) Discriminative Parsing (Development Set)\n\n(e) Generative Parsing (Training Set)\n\n(f) Generative Parsing (Development Set)\n\nFigure 2: Performance curves on the training data (left) and the development/test data (right) for three\nexperiments on natural language tasks. The annotations indicate where the best performance is attained for\neach method. The shading represents one standard deviation computed across \ufb01ve runs from random initial\nstarting points.\n\n9\n\n\fReferences\n[1] Do Kook Choe and Eugene Charniak. Parsing as language modeling. In Jian Su, Xavier\nCarreras, and Kevin Duh, editors, Proceedings of the 2016 Conference on Empirical Methods in\nNatural Language Processing, EMNLP 2016, Austin, Texas, USA, November 1-4, 2016, pages\n2331\u20132336. The Association for Computational Linguistics, 2016.\n\n[2] James Cross and Liang Huang. Span-based constituency parsing with a structure-label system\nand provably optimal dynamic oracles. In Jian Su, Xavier Carreras, and Kevin Duh, editors,\nProceedings of the 2016 Conference on Empirical Methods in Natural Language Processing,\nAustin, Texas, pages 1\u201311. The Association for Computational Linguistics, 2016.\n\n[3] John C. Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online\nlearning and stochastic optimization. Journal of Machine Learning Research, 12:2121\u20132159,\n2011.\n\n[4] Sepp Hochreiter and J\u00fcrgen Schmidhuber. Flat minima. Neural Computation, 9(1):1\u201342, 1997.\n\n[5] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with\n\nconditional adversarial networks. arXiv:1611.07004, 2016.\n\n[6] Andrej Karparthy. A peek at trends in machine learning. https://medium.com/@karpathy/\n\na-peek-at-trends-in-machine-learning-ab8a1085a106. Accessed: 2017-05-17.\n\n[7] Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping\nTak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima.\nIn The International Conference on Learning Representations (ICLR), 2017.\n\n[8] D.P. Kingma and J. Ba. Adam: A method for stochastic optimization. The International\n\nConference on Learning Representations (ICLR), 2015.\n\n[9] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa,\nDavid Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. In\nInternational Conference on Learning Representations (ICLR), 2016.\n\n[10] Siyuan Ma and Mikhail Belkin. Diving into the shallows: a computational perspective on\n\nlarge-scale shallow learning. arXiv:1703.10622, 2017.\n\n[11] Mitchell P. Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. Building a large annotated\ncorpus of english: The penn treebank. COMPUTATIONAL LINGUISTICS, 19(2):313\u2013330,\n1993.\n\n[12] H. Brendan McMahan and Matthew Streeter. Adaptive bound optimization for online convex\noptimization. In Proceedings of the 23rd Annual Conference on Learning Theory (COLT), 2010.\n\n[13] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lilli-\ncrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep\nreinforcement learning. In International Conference on Machine Learning (ICML), 2016.\n\n[14] Behnam Neyshabur, Ruslan Salakhutdinov, and Nathan Srebro. Path-SGD: Path-normalized\noptimization in deep neural networks. In Neural Information Processing Systems (NIPS), 2015.\n\n[15] Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. In search of the real inductive bias:\nOn the role of implicit regularization in deep learning. In International Conference on Learning\nRepresentations (ICLR), 2015.\n\n[16] Maxim Raginsky, Alexander Rakhlin, and Matus Telgarsky. Non-convex learning via stochastic\n\ngradient Langevin dynamics: a nonasymptotic analysis. arXiv:1702.03849, 2017.\n\n[17] Benjamin Recht, Moritz Hardt, and Yoram Singer. Train faster, generalize better: Stability\nof stochastic gradient descent. In Proceedings of the International Conference on Machine\nLearning (ICML), 2016.\n\n10\n\n\f[18] Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak\nLee. Generative adversarial text to image synthesis. In Proceedings of The International\nConference on Machine Learning (ICML), 2016.\n\n[19] Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. On the importance of\ninitialization and momentum in deep learning. In Proceedings of the International Conference\non Machine Learning (ICML), 2013.\n\n[20] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Re-\nthinking the inception architecture for computer vision. In Proceedings of the IEEE Conference\non Computer Vision and Pattern Recognition (CVPR), 2016.\n\n[21] T. Tieleman and G. Hinton. Lecture 6.5\u2014RmsProp: Divide the gradient by a running average\n\nof its recent magnitude. COURSERA: Neural Networks for Machine Learning, 2012.\n\n[22] Yuan Yao, Lorenzo Rosasco, and Andrea Caponnetto. On early stopping in gradient descent\n\nlearning. Constructive Approximation, 26(2):289\u2013315, 2007.\n\n[23] Sergey Zagoruyko. Torch blog. http://torch.ch/blog/2015/07/30/cifar.html, 2015.\n\n11\n\n\f", "award": [], "sourceid": 2186, "authors": [{"given_name": "Ashia", "family_name": "Wilson", "institution": "UC Berkeley"}, {"given_name": "Rebecca", "family_name": "Roelofs", "institution": "UC Berkeley"}, {"given_name": "Mitchell", "family_name": "Stern", "institution": "UC Berkeley"}, {"given_name": "Nati", "family_name": "Srebro", "institution": "TTI-Chicago"}, {"given_name": "Benjamin", "family_name": "Recht", "institution": "UC Berkeley"}]}