{"title": "Equilibrated adaptive learning rates for non-convex optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 1504, "page_last": 1512, "abstract": "Parameter-specific adaptive learning rate methods are computationally efficient ways to reduce the ill-conditioning problems encountered when training large deep networks. Following recent work that strongly suggests that most of thecritical points encountered when training such networks are saddle points, we find how considering the presence of negative eigenvalues of the Hessian could help us design better suited adaptive learning rate schemes. We show that the popular Jacobi preconditioner has undesirable behavior in the presence of both positive and negative curvature, and present theoretical and empirical evidence that the so-called equilibration preconditioner is comparatively better suited to non-convex problems. We introduce a novel adaptive learning rate scheme, called ESGD, based on the equilibration preconditioner. Our experiments demonstrate that both schemes yield very similar step directions but that ESGD sometimes surpasses RMSProp in terms of convergence speed, always clearly improving over plain stochastic gradient descent.", "full_text": "Equilibrated adaptive learning rates for non-convex\n\noptimization\n\nYann N. Dauphin1\nUniversit\u00b4e de Montr\u00b4eal\n\ndauphiya@iro.umontreal.ca\n\nHarm de Vries1\n\nUniversit\u00b4e de Montr\u00b4eal\n\ndevries@iro.umontreal.ca\n\nYoshua Bengio\n\nUniversit\u00b4e de Montr\u00b4eal\n\nyoshua.bengio@umontreal.ca\n\nAbstract\n\nParameter-speci\ufb01c adaptive learning rate methods are computationally ef\ufb01cient\nways to reduce the ill-conditioning problems encountered when training large\ndeep networks. Following recent work that strongly suggests that most of the\ncritical points encountered when training such networks are saddle points, we \ufb01nd\nhow considering the presence of negative eigenvalues of the Hessian could help\nus design better suited adaptive learning rate schemes. We show that the popular\nJacobi preconditioner has undesirable behavior in the presence of both positive\nand negative curvature, and present theoretical and empirical evidence that the so-\ncalled equilibration preconditioner is comparatively better suited to non-convex\nproblems. We introduce a novel adaptive learning rate scheme, called ESGD,\nbased on the equilibration preconditioner. Our experiments show that ESGD per-\nforms as well or better than RMSProp in terms of convergence speed, always\nclearly improving over plain stochastic gradient descent.\n\n1\n\nIntroduction\n\nOne of the challenging aspects of deep learning is the optimization of the training criterion over mil-\nlions of parameters: the dif\ufb01culty comes from both the size of these neural networks and because the\ntraining objective is non-convex in the parameters. Stochastic gradient descent (SGD) has remained\nthe method of choice for most practitioners of neural networks since the 80\u2019s, in spite of a rich lit-\nerature in numerical optimization. Although it is well-known that \ufb01rst-order methods considerably\nslow down when the objective function is ill-conditioned, it remains unclear how to best exploit\nsecond-order structure when training deep networks. Because of the large number of parameters,\nstoring the full Hessian or even a low-rank approximation is not practical, making parameter speci\ufb01c\nlearning rates, i.e diagonal preconditioners, one of the viable alternatives. One of the open questions\nis how to set the learning rate for SGD adaptively, both over time and for different parameters, and\nseveral methods have been proposed (see e.g. Schaul et al. (2013) and references therein).\nOn the other hand, recent work (Dauphin et al., 2014; Choromanska et al., 2014) has brought theo-\nretical and empirical evidence suggesting that local minima are with high probability not the main\nobstacle to optimizing large and deep neural networks, contrary to what was previously believed:\ninstead, saddle points are the most prevalent critical points on the optimization path (except when\nwe approach the value of the global minimum). These saddle points can considerably slow down\ntraining, mostly because the objective function tends to be ill-conditioned in the neighborhood of\n\n1Denotes \ufb01rst authors\n\n1\n\n\f(a) Original\n\n(b) Preconditioned\n\nFigure 1: Contour lines of a saddle point (black point) problem for (a) original function and (b) trans-\nformed function (by equilibration preconditioner). Gradient descent slowly escapes the saddle point\nin (a) because it oscillates along the high positive curvature direction. For the better conditioned\nfunction (b) these oscillations are reduced, and gradient descent makes faster progress.\n\nthese saddle points. This raises the question: can we take advantage of the saddle structure to design\ngood and computationally ef\ufb01cient preconditioners?\nIn this paper, we bring these threads together. We \ufb01rst study diagonal preconditioners for saddle\npoint problems, and \ufb01nd that the popular Jacobi preconditioner has unsuitable behavior in the pres-\nence of both positive and negative curvature. Instead, we propose to use the so-called equilibration\npreconditioner and provide new theoretical justi\ufb01cations for its use in Section 4. We provide speci\ufb01c\narguments why equilibration is better suited to non-convex optimization problems than the Jacobi\npreconditioner and empirically demonstrate this for small neural networks in Section 5. Using this\nnew insight, we propose a new adaptive learning rate schedule for SGD, called ESGD, that is based\non the equilibration preconditioner. In Section 7 we evaluate the proposed method on two deep au-\ntoencoder benchmarks. The results, presented in Section 8, con\ufb01rm that ESGD performs as well or\nbetter than RMSProp. In addition, we empirically \ufb01nd that the update direction of RMSProp is very\nsimilar to equilibrated update directions, which might explain its success in training deep neural\nnetworks.\n\n2 Preconditioning\n\nIt is well-known that gradient descent makes slow progress when the curvature of the loss function\nis very different in separate directions. The negative gradient will be mostly pointing in directions of\nhigh curvature, and a small enough learning rate have to be chosen in order to avoid divergence in the\nlargest positive curvature direction. As a consequence, the gradient step makes very little progress in\nsmall curvature directions, leading to the slow convergence often observed with \ufb01rst-order methods.\nPreconditioning can be thought of as a geometric solution to the problem of pathological curvature.\nIt aims to locally transform the optimization landscape so that its curvature is equal in all directions.\nThis is illustrated in Figure 1 for a two-dimensional saddle point problem using the equilibration\npreconditioner (Section 4). Gradient descent method slowly escapes the saddle point due to the\ntypical oscillations along the high positive curvature direction. By transforming the function to be\nmore equally curved, it is possible for gradient descent to move much faster.\nMore formally, we are interested in minimizing a function f with parameters \u03b8 \u2208 RN. We introduce\npreconditioning by a linear change of variables \u02c6\u03b8 = D 1\n2 . We use\nthis change of variables to de\ufb01ne a new function \u02c6f, parameterized by \u02c6\u03b8, that is equivalent to the\noriginal function f:\n\n2 \u03b8 with a non-singular matrix D 1\n\n\u02c6f(\u02c6\u03b8) = f(D\u2212 1\n\n2 \u02c6\u03b8) = f(\u03b8)\n\nThe gradient and the Hessian of this new function \u02c6f are (by the chain rule):\n\n\u2207 \u02c6f(\u02c6\u03b8) = D\u2212 1\n\u22072 \u02c6f(\u02c6\u03b8) = D\u2212 1\n\n2\u2207f(\u03b8)\n2(cid:62)HD\u2212 1\n\n2 with H = \u22072f(\u03b8)\n\n2\n\n(1)\n\n(2)\n(3)\n\n\fA gradient descent iteration \u02c6\u03b8t = \u02c6\u03b8t\u22121 \u2212 \u03b7\u2207 \u02c6f(\u02c6\u03b8) for the transformed function corresponds to\n\n\u03b8t = \u03b8t\u22121 \u2212 \u03b7D\u22121\u2207f(\u03b8)\n\n(4)\nfor the original parameter \u03b8. In other words, by left-multiplying the original gradient with a positive\nde\ufb01nite matrix D\u22121, we effectively apply gradient descent to the problem after a change of variables\n\u02c6\u03b8 = D 1\n2 , and\nwe aim to seek a preconditioning matrix D such that the new Hessian has equal curvature in all\ndirections. One way to assess the success of D in doing so is to compute the relative difference\nbetween the biggest and smallest curvature direction, which is measured by the condition number of\nthe Hessian:\n\n2 \u03b8. The curvature of this transformed function is given by the Hessian D\u2212 1\n\n2(cid:62)HD\u2212 1\n\n\u03ba(H) = \u03c3max(H)\n\u03c3min(H)\n\n(5)\n\nwhere \u03c3max(H), \u03c3min(H) denote respectively the biggest and smallest singular values of H (which\nare the absolute value of the eigenvalues). It is important to stress that the condition number is\nde\ufb01ned for both de\ufb01nite and inde\ufb01nite matrices.\nThe famous Newton step corresponds to a change of variables D 1\n2 which makes the new\nHessian perfectly conditioned. However, a change of variables only exists2 when the Hessian H is\npositive semi-de\ufb01nite. This is a problem for non-convex loss surfaces where the Hessian might be\ninde\ufb01nite. In fact, recent studies (Dauphin et al., 2014; Choromanska et al., 2014) has shown that\nsaddle points are dominating the optimization landscape of deep neural networks, implying that the\nHessian is most likely inde\ufb01nite. In such a setting, H\u22121 not a valid preconditioner and applying\nNewton\u2019s step without modi\ufb01cation would make you move towards the saddle point. Nevertheless,\nit is important to realize that the concept of preconditioning extends to non-convex problems, and\nreducing ill-conditioning around saddle point will often speed up gradient descent.\nAt this point, it is natural to ask whether there exists a valid preconditioning matrix that always\nperfectly conditions the new Hessian? The answer is yes, and the corresponding preconditioning\nmatrix is the inverse of the absolute Hessian\n\n2 = H 1\n\n|\u03bbj|qjq(cid:62)\nj ,\n\n(6)\n\n|H| =(cid:88)\n\nj\n\nwhich is obtained by an eigendecomposition of H and taking the absolute values of the eigenvalues.\nSee Proposition 1 in Appendix A for a proof that |H|\u22121 is the only (up to a scalar3) symmetric\npositive de\ufb01nite preconditioning matrix that perfectly reduces the condition number.\nPractically, there are several computational drawbacks for using |H|\u22121 as a preconditioner. Neural\nnetworks typically have millions of parameters, rendering it infeasible to store the Hessian (O(N 2)),\nperform an eigendecomposition (O(N 3)) and invert the matrix (O(N 3)). Except for the eigende-\ncomposition, other full rank preconditioners are facing the same computational issues. We therefore\nlook for more computationally affordable preconditioners while maintaining its ef\ufb01ciency in reduc-\ning the condition number of inde\ufb01nite matrices. In this paper, we focus on diagonal preconditioners\nwhich can be stored, inverted and multiplied by a vector in linear time. When diagonal precondi-\ntioners are applied in an online optimization setting (i.e. in conjunction with SGD), they are often\nreferred to as adaptive learning rates in the neural network literature.\n\n3 Related work\n\nThe Jacobi preconditioner is one of the most well-known preconditioners. It is given by the diagonal\nof the Hessian DJ = |diag(H)| where | \u00b7 | is element-wise absolute value. LeCun et al. (1998)\nproposes an ef\ufb01cient approximation of the Jacobi preconditioner using the Gauss-Newton matrix.\nThe Gauss-Newton has been shown to approximate the Hessian under certain conditions (Pascanu\n& Bengio, 2014). The merit of this approach is that it is ef\ufb01cient but it is not clear what is lost\nby the Gauss-Newton approximation. What\u2019s more the Jacobi preconditioner has not be found to\nbe competitive for inde\ufb01nite matrices (Bradley & Murray, 2011). This will be further explored for\nneural networks in Section 5.\n\n2A real square root H\n3can be incorporated into the learning rate\n\n1\n2 only exists when H is positive semi-de\ufb01nite.\n\n3\n\n\flearning rate. This gives us the diagonal preconditioning matrix DA = ((cid:80)\n\nA recent revival of interest in adaptive learning rates has been started by AdaGrad (Duchi et al.,\n2011). Adagrad collects information from the gradients across several parameter updates to tune the\n(t))\u22121/2 which relies\non the sum of gradients \u2207f(t) at each timestep t. Duchi et al. (2011) relies strongly on convexity to\njustify this method. This makes the application to neural networks dif\ufb01cult from a theoretical per-\nspective. RMSProp (Tieleman & Hinton, 2012) and AdaDelta (Zeiler, 2012) were follow-up meth-\nods introduced to be practical adaptive learning methods to train large neural networks. Although\nRMSProp has been shown to work very well (Schaul et al., 2013), there is not much understanding\nfor its success in practice. Preconditioning might be a good framework to get a better understanding\nof such adaptive learning rate methods.\n\nt \u2207f 2\n\n4 Equilibration\n\nii = (cid:107)Hi,\u00b7(cid:107)2,\nDE\n\nEquilibration is a preconditioning technique developed in the numerical mathematics commu-\nnity (Sluis, 1969). When solving a linear system Ax = b with Gaussian Elimination, signi\ufb01cant\nround-off errors can be introduced when small numbers are added to big numbers (Datta, 2010).\nTo circumvent this issue, it is advised to properly scale the rows of the matrix before starting the\nelimination process. This step is often referred to as row equilibration, which formally scales the\nrows of A to unit magnitude in some p-norm. Throughout the following we consider 2-norm. Row\nequilibration is equivalent to multiplying A from the left by the matrix D\u22121\n. Instead of\nsolving the original system, we now solve the equivalent left preconditioned system \u02c6Ax = \u02c6b with\n\u02c6A = D\u22121A and \u02c6b = D\u22121\ni b.\nIn this paper, we apply the equilibration preconditioner in the context of large scale non-convex\noptimization. However, it is not straightforward how to apply the preconditioner. By choosing the\npreconditioning matrix\n\nii = 1(cid:107)Ai,\u00b7(cid:107) 2\n\n2 HD\u2212 1\n\n2(cid:62)H(DE)\u2212 1\n\n2(cid:62)H(DE)\u2212 1\n\n(7)\nthe Hessian of the transformed function (DE)\u2212 1\n2 (see Section 2) does not have equi-\nlibrated rows. Nevertheless, its spectrum (i.e. eigenvalues) is equal to the spectrum of the row\nequilibrated Hessian (DE)\u22121H and column equilibrated Hessian H(DE)\u22121. Consequently, if row\nequilibration succesfully reduces the condition number, then the condition number of the trans-\nformed Hessian (DE)\u2212 1\n2 will be reduced by the same amount. The proof is given by\nProposition 2.\nFrom the above observation, it seems more natural to seek for a diagonal preconditioning matrix\nD such that D\u2212 1\n2 is row and column equilibrated. In Bradley & Murray (2011) an iterative\nstochastic procedure is proposed for \ufb01nding such matrix. However, we did not \ufb01nd it to work very\nwell in an online optimization setting, and therefore stick to the original equilibration matrix DE.\nAlthough the original motivation for row equilibration is to prevent round-off errors, our interest is\nin how well it is able to reduce the condition number. Intuitively, ill-conditioning can be a result of\nmatrix elements that are of completely different order. Scaling the rows to have equal norm could\ntherefore signi\ufb01cantly reduce the condition number. Although we are not aware of any proofs that\nrow equilibration improves the condition number, there are theoretical results that motivates its use.\n\u221a\nIn Sluis (1969) it is shown that the condition number of a row equilibrated matrix is at most a factor\nN worse than the diagonal preconditioning matrix that optimally reduces the condition number.\nNote that the bound grows sublinear in the dimension of the matrix, and can be quite loose for the\nextremely large matrices we consider. In this paper, we provide an alternative justi\ufb01cation using the\nfollowing upper bound on the condition number from Guggenheimer et al. (1995):\n\n(cid:19)N\n\n(cid:18)(cid:107)H(cid:107)F\u221a\n\nN\n\n\u03ba(H) <\n\n2\n\n|det H|\n\n(8)\n\nThe proof in Guggenheimer et al. (1995) provides useful insight when we expect a tight upper bound\nto be tight: if all singular values, except for the smallest, are roughly equal.\nWe prove by Proposition 4 that row equilibration improves this upper bound by a factor\ndet(DE)\n. It is easy see that the bound is more reduced when the norms of the rows\n\n(cid:16)(cid:107)H(cid:107)F\u221a\n\n(cid:17)N\n\nN\n\n4\n\n\f(a) convex\n\n(b) non-convex\n\nFigure 2: Histogram of the condition number reduction (lower is better) for random Hessians in a\n(a) convex and b) non-convex setting. Equilibration clearly outperforms the other methods in the\nnon-convex case.\n\nare more varied. Note that the proof can be easily extended to column equilibration, and row and\ncolumn equilibration. In contrast, we can not prove that the Jacobi preconditioner improves the\nupper bound, which provides another justi\ufb01cation for using the equilibration preconditioner.\nA deterministic implementation to calculate the 2-norm of all matrix rows needs to access all matrix\nelements. This is prohibitive for very large Hessian\u2019s that can not even be stored. We therefore resort\nto a matrix-free estimator of the equilibration matrix that only uses matrix vector multiplications of\nthe form (Hv)2 where the square is element-wise and vi \u223c N (0, 1)4. As shown by Bradley &\nMurray (2011), this estimator is unbiased, i.e.\n\n(cid:107)Hi,\u00b7(cid:107)2 = E[(Hv)2].\n\n(9)\n\nSince multiplying the Hessian by a vector can be ef\ufb01ciently done without ever computing the Hes-\nsian, this method can be ef\ufb01ciently used in the context of neural networks using the R-operator\nSchraudolph (2002). The R-operator computation only uses gradient-like computations and costs\nabout the same as two backpropagations.\n\n5 Equilibrated learning rates are well suited to non-convex problems\n\nii = (cid:107)Hi,\u00b7(cid:107)2 =(cid:112)diag(H2)i\n\nDE\n\nIn this section, we demonstrate that equilibrated learning rates are well suited to non-convex opti-\nmization, particularly compared to the Jacobi preconditioner. First, the diagonal equilibration matrix\ncan be seen as an approximation to diagonal of the absolute Hessian. Reformulating the equilibration\nmatrix as\n\n(10)\nreveals an interesting connection. Changing the order of the square root and diagonal would give us\nthe diagonal of |H|. In other words, the equilibration preconditioner can be thought of as the Jacobi\npreconditioner of the absolute Hessian.\nRecall that the inverse of the absolute Hessian |H|\u22121 is the only symmetric positive de\ufb01nite ma-\ntrix that reduces the condition number to 1 (the proof of which can be be found in Proposition 1\nin the Appendix). It can be considered as the gold standard, if we do not take computational costs\ninto account. For inde\ufb01nite matrices, the diagonal of the Hessian H and the diagonal of the abso-\nlute Hessian |H| will be very different, and therefore the behavior of the Jacobi and equilibration\npreconditioner will also be very different.\nIn fact, we argue that the Jacobi preconditioner can cause divergence because it underestimates\ncurvature. We can measure the amount of curvature in a given direction with the Raleigh quotient\n\nR(H, v) =\n\nvT Hv\nvT v .\n\n(11)\n\n4Any random variable vi with zero mean and unit variance can be used.\n\n5\n\n\fD \u2190 0\nfor i = 1 \u2192 K do\nv \u223c N (0, 1)\nD \u2190 D + (Hv)2\n\u2207f (\u03b8)\u221a\n\u03b8 \u2190 \u03b8 \u2212 \u0001\n\nD/i+\u03bb\n\nend for\n\nAlgorithm 1 Equilibrated Gradient Descent\nRequire: Function f(\u03b8) to minimize, learning rate \u0001 and damping factor \u03bb\n\nj \u03bbjq2\n\nj \u03bbjv(cid:62)qjq(cid:62)\nj,i|\u22121. An element DJ\n\ntient can be decomposed into R(H, v) = (cid:80)N\nii = |R(H, I\u00b7,i)|\u22121 = |(cid:80)N\n\nThis quotient is large when there is a lot of curvature in the direction v. The Raleigh quo-\nj v where \u03bbj and qj are the eigenval-\nues and eigenvectors of H. It is easy to show that each element of the Jacobi matrix is given by\nDJ\nii is the inverse of the sum of the eigen-\nvalues \u03bbj. Negative eigenvalues will reduce the total sum and make the step much larger than it\nshould. Speci\ufb01cally, imagine a diagonal element where there are large positive and negative curva-\nture eigendirections. The contributions of these directions will cancel each other and a large step\nwill be taken in that direction. However, the function will probably also change fast in that direction\n(because of the high curvature), and the step is too large for the local quadratic approximation we\nhave considered.\nEquilibration methods never diverge this way because they will not underestimate curvature. In\nequilibration, the curvature information is given by the Raleigh quotient of the squared Hessian\nj,i)\u22121/2. Note that all the elements are positive and so will\nDE\nnot cancel. Jensen\u2019s inequality then gives us an upper bound\n\nii = (R(H2, I\u00b7,i))\u22121/2 = ((cid:80)\n\nj \u03bb2\n\njq2\n\nii \u2264 |H|\u22121\nDE\nii .\n\n(12)\n\nwhich ensures that equilibrated adaptive learning rate will in fact be more conservative than the\nJacobi preconditioner of the absolute Hessian (see Proposition 2 for proof).\nThis strengthens the links between equilibration and the absolute Hessian and may explain why\nequilibration has been found to work well for inde\ufb01nite matrices Bradley & Murray (2011). We have\nveri\ufb01ed this claim experimentally for random neural networks. The neural networks have 1 hidden\nlayer of a 100 sigmoid units with zero mean unit-variance Gaussian distributed inputs, weights and\nbiases. The output layer is a softmax with the target generated randomly. We also give results for\nsimilarly sampled logistic regressions. We compare reductions of the condition number between\nthe methods. Figure 2 gives the histograms of the condition number reductions. We obtained these\ngraphs by sampling a hundred networks and computing the ratio of the condition number before and\nafter preconditioning. On the left we have the convex case, and on the right the non-convex case. We\nclearly observe that the Jacobi and equilibration method are closely matched for the convex case.\nHowever, in the non-convex case equilibration signi\ufb01cantly outperforms the other methods. Note\nthat the poor performance of the Gauss-Newton diagonal only means that its success in optimization\nis not due to preconditioning. As we will see in Section 8 these results extend to practical high-\ndimensional problems.\n\n6\n\nImplementation\n\nThis method will estimate the same curvature information(cid:112)diag(H2) with the unbiased estimator\n\nWe propose to build a scalable algorithm for preconditioning neural networks using equilibration.\n\ndescribed in Equation 9.\nIt is prohibitive to compute the full expectation at each learning step.\nInstead we will simply update our running average at each learning step much like RMSProp. The\npseudo-code is given in Algorithm 1. The additional costs are one product with the Hessian, which is\nroughly the cost of two additional gradient calculations, and the sampling a random Gaussian vector.\nIn practice we greatly amortize the cost by only performing the update every 20 iterations. This\nbrings the cost of equilibration very close to that of regular SGD. The only added hyper-parameter\nis the damping \u03bb. We \ufb01nd that a good setting for that hyper-parameter is \u03bb = 10\u22124 and it is robust\nover the tasks we considered.\n\n6\n\n\f(a) MNIST\n\n(b) CURVES\n\nFigure 3: Learning curves for deep auto-encoders on a) MNIST and b) CURVES comparing the\ndifferent preconditioned SGD methods.\n\nIn the interest of comparison, we will evaluate SGD preconditioned with the Jacobi preconditioner.\nThis will allow us to verify the claims that the equilibration preconditioner is better suited for non-\nconvex problems. Bekas et al. (2007) show that the diagonal of a matrix can be recovered by the\nexpression\n\ndiag(H) = E[v (cid:12) Hv]\n\nthis estimator for an element i is(cid:80)\n\n(13)\nwhere v are random vectors with entries \u00b11 and (cid:12) is the element-wise product. We use this esti-\nmator to precondition SGD in the same fashion as that described in Algorithm 1. The variance of\nii, while the method in Martens et al. (2012) has H 2\nii.\nTherefore, the optimal method depends on the situation. The computational complexity is the same\nas ESGD.\n\nji \u2212 H 2\n\nj H 2\n\n7 Experimental setup\n\nWe consider the challenging optimization benchmark of training very deep neural networks. Follow-\ning Martens (2010); Sutskever et al. (2013); Vinyals & Povey (2011), we train deep auto-encoders\nwhich have to reconstruct their input under the constraint that one layer is very low-dimensional.\nThe networks have up to 11 layers of sigmoidal hidden units and have on the order of a million\nparameters. We use the standard network architectures described in Martens (2010) for the MNIST\nand CURVES dataset. Both of these datasets have 784 input dimensions and 60,000 and 20,000\nexamples respectively.\nWe tune the hyper-parameters of the optimization methods with random search. We have sampled\nthe learning rate from a logarithmic scale between [0.1, 0.01] for stochastic gradient descent (SGD)\nand equilibrated SGD (ESGD). The learning rate for RMSProp and the Jacobi preconditioner are\nsampled from [0.001, 0.0001]. The damping factor \u03bb used before dividing the gradient is taken\nfrom either {10\u22124, 10\u22125, 10\u22126} while the exponential decay rate of RMSProp is taken from either\n{0.9, 0.95}. The networks are initialized using the sparse initialization described in Martens (2010).\nThe minibatch size for all methods in 200. We do not make use of momentum in these experiments\nin order to evaluate the strength of each preconditioning method on its own. Similarly we do not\nuse any regularization because we are only concerned with optimization performance. For these\nreasons, we report training error in our graphs. The networks and algorithms were implemented\nusing Theano Bastien et al. (2012), simplifying the use of the R-operator in Jacobi and equilibrated\nSGD. All experiments were run on GPU\u2019s.\n\n8 Results\n8.1 Comparison of preconditioned SGD methods\n\nWe compare the different adaptive learning rates for training deep auto-encoders in Figure 3. We\ndon\u2019t use momentum to better isolate the performance of each method. We believe this is important\nbecause RMSProp has been found not to mix well with momentum (Tieleman & Hinton, 2012).\nThus the results presented are not state-of-the-art, but they do reach state of the art when momentum\nis used.\n\n7\n\n\f(a) MNIST\n\n(b) CURVES\n\nFigure 4: Cosine distance between the diagonals estimated by each method during the training of\na deep auto-encoder trained on a) MNIST and b) CURVES. We can see that RMSProp estimates a\nquantity close to the equilibration matrix.\n\nOur results on MNIST show that the proposed ESGD method signi\ufb01cantly outperforms both RM-\nSProp and Jacobi SGD. The difference in performance becomes especially notable after 250 epochs.\nSutskever et al. (2013) reported a performance of 2.1 of training MSE for SGD without momentum\nand we can see all adaptive learning rates improve on this result, with equilibration reaching 0.86.\nWe observe a convergence speed that is approximately three times faster then our baseline SGD.\nESGD also performs best for CURVES, although the difference with RMSProp and Jacobi SGD is\nnot as signi\ufb01cant as for MNIST. We show in the next section that the smaller gap in performance is\ndue to the different preconditioners behaving the same way on this dataset.\n\n8.2 Measuring the similarity of the methods\n\nDE = (cid:112)diag(H2) and Jacobi matrix DJ = (cid:112)diag(H)2 using 100 samples of the unbiased esti-\n\nWe train deep autoencoders with RMSProp and measure every 10 epochs the equilibration matrix\n\nmators described in Equations 9, respectively. We then measure the pairwise differences between\nthese quantities in terms of the cosine distance cosine(u, v) = 1\u2212 u\u00b7v\n(cid:107)u(cid:107)(cid:107)v(cid:107), which measures the angle\nbetween two vectors and ignores their norms.\nFigure 4 shows the resulting cosine distances over training on MNIST and CURVES. For the latter\ndataset we observe that RMSProp remains remarkably close (around 0.05) to equilibration, while it\nis signi\ufb01cantly different from Jacobi (in the order of 0.2). The same order of difference is observed\nwhen we compare equilibration and Jacobi, con\ufb01rming the observations of Section 5 that both quan-\ntities are rather different in practice. For the MNIST dataset we see that RMSProp fairly well esti-\n\nmates(cid:112)diag(H)2 in the beginning of training, but then quickly diverges. After 1000 epochs this\n\ndifference has exceeded the difference between Jacobi and equilibration, and RMSProp no longer\nmatches equilibration. Interestingly, at the same time that RMSProp starts diverging, we observe in\nFigure 3 that also the performance of the optimizer drops in comparison to ESGD. This may suggests\nthat the success of RMSProp as a optimizer is tied to its similarity to the equilibration matrix.\n9 Conclusion\n\nWe have studied diagonal preconditioners for saddle point problems i.e. inde\ufb01nite matrices. We have\nshown by theoretical and empirical arguments that the equilibration preconditioner is comparatively\nbetter suited to this kind of problems than the Jacobi preconditioner. Using this insight, we have pro-\nposed a novel adaptive learning rate schedule for non-convex optimization problems, called ESGD,\nwhich empirically outperformed RMSProp on two competitive deep autoencoder benchmark. In-\nterestingly, we have found that the update direction of RMSProp was in practice very similar to\nthe equilibrated update direction, which might provide more insight into why RMSProp has been\nso successfull in training deep neural networks. More research is required to con\ufb01rm these results.\nHowever, we hope that our \ufb01ndings will contribute to a better understanding of SGD\u2019s adaptive\nlearning rate schedule for large scale, non-convex optimization problems.\n\n8\n\n\fReferences\nBastien, Fr\u00b4ed\u00b4eric, Lamblin, Pascal, Pascanu, Razvan, Bergstra, James, Goodfellow, Ian J., Bergeron,\nArnaud, Bouchard, Nicolas, and Bengio, Yoshua. Theano: new features and speed improvements.\nDeep Learning and Unsupervised Feature Learning NIPS 2012 Workshop, 2012.\n\nBekas, Costas, Kokiopoulou, Effrosyni, and Saad, Yousef. An estimator for the diagonal of a matrix.\n\nApplied numerical mathematics, 57(11):1214\u20131229, 2007.\n\nBradley, Andrew M and Murray, Walter. Matrix-free approximate equilibration. arXiv preprint\n\narXiv:1110.2805, 2011.\n\nChoromanska, Anna, Henaff, Mikael, Mathieu, Michael, Arous, Grard Ben, and LeCun, Yann. The\n\nloss surface of multilayer networks, 2014.\n\nDatta, Biswa Nath. Numerical Linear Algebra and Applications, Second Edition. SIAM, 2nd edition,\n\n2010. ISBN 0898716853, 9780898716856.\n\nDauphin, Yann, Pascanu, Razvan, Gulcehre, Caglar, Cho, Kyunghyun, Ganguli, Surya, and Bengio,\nIdentifying and attacking the saddle point problem in high-dimensional non-convex\n\nYoshua.\noptimization. In NIPS\u20192014, 2014.\n\nDuchi, John, Hazan, Elad, and Singer, Yoram. Adaptive subgradient methods for online learning\n\nand stochastic optimization. Journal of Machine Learning Research, 2011.\n\nGuggenheimer, Heinrich W., Edelman, Alan S., and Johnson, Charles R. A simple estimate of the\ncondition number of a linear system. The College Mathematics Journal, 26(1):pp. 2\u20135, 1995.\nISSN 07468342. URL http://www.jstor.org/stable/2687283.\n\nLeCun, Yann, Bottou, L\u00b4eon, Orr, Genevieve B., and M\u00a8uller, Klaus-Robert. Ef\ufb01cient backprop. In\nNeural Networks, Tricks of the Trade, Lecture Notes in Computer Science LNCS 1524. Springer\nVerlag, 1998.\n\nMartens, J. Deep learning via Hessian-free optimization. In ICML\u20192010, pp. 735\u2013742, 2010.\nMartens, James, Sutskever, Ilya, and Swersky, Kevin. Estimating the hessian by back-propagating\n\ncurvature. arXiv preprint arXiv:1206.6464, 2012.\n\nPascanu, Razvan and Bengio, Yoshua. Revisiting natural gradient for deep networks. In Interna-\n\ntional Conference on Learning Representations 2014(Conference Track), April 2014.\n\nSchaul, Tom, Antonoglou, Ioannis, and Silver, David. Unit tests for stochastic optimization. arXiv\n\npreprint arXiv:1312.6055, 2013.\n\nSchraudolph, Nicol N. Fast curvature matrix-vector products for second-order gradient descent.\n\nNeural Computation, 14(7):1723\u20131738, 2002.\n\nSluis, AVD. Condition numbers and equilibration of matrices. Numerische Mathematik, 14(1):\n\n14\u201323, 1969.\n\nSutskever, Ilya, Martens, James, Dahl, George, and Hinton, Geoffrey. On the importance of initial-\n\nization and momentum in deep learning. In ICML, 2013.\n\nTieleman, Tijmen and Hinton, Geoffrey. Lecture 6.5-rmsprop: Divide the gradient by a running\naverage of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 4, 2012.\nVinyals, Oriol and Povey, Daniel. Krylov subspace descent for deep learning. arXiv preprint\n\narXiv:1111.4259, 2011.\n\nZeiler, Matthew D. ADADELTA: an adaptive learning rate method. Technical report, arXiv\n\n1212.5701, 2012. URL http://arxiv.org/abs/1212.5701.\n\n9\n\n\f", "award": [], "sourceid": 921, "authors": [{"given_name": "Yann", "family_name": "Dauphin", "institution": "Facebook AI Research"}, {"given_name": "Harm", "family_name": "de Vries", "institution": null}, {"given_name": "Yoshua", "family_name": "Bengio", "institution": "U. Montreal"}]}