{"title": "Adaptive Methods for Nonconvex Optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 9793, "page_last": 9803, "abstract": "Adaptive gradient methods that rely on scaling gradients down by the square root of exponential moving averages of past squared gradients, such RMSProp, Adam, Adadelta have found wide application in optimizing the nonconvex problems that arise in deep learning. However, it has been recently demonstrated that such methods can fail to converge even in simple convex optimization settings. In this work, we provide a new analysis of such methods applied to nonconvex stochastic optimization problems, characterizing the effect of increasing minibatch size. Our analysis shows that under this scenario such methods do converge to stationarity up to the statistical limit of variance in the stochastic gradients (scaled by a constant factor). In particular, our result implies that increasing minibatch sizes enables convergence,  thus providing a way to circumvent the non-convergence issues. Furthermore, we provide a new adaptive optimization algorithm, Yogi, which controls the increase in effective learning rate,  leading to even better performance with similar theoretical guarantees on convergence. Extensive experiments show that Yogi with very little hyperparameter tuning outperforms methods such as Adam in several challenging machine learning tasks.", "full_text": "Adaptive Methods for Nonconvex Optimization\n\nManzil Zaheer \u2217\nGoogle Research\n\nmanzilzaheer@google.com\n\nSashank J. Reddi \u2217\nGoogle Research\n\nsashank@google.com\n\nDevendra Sachan\n\nCarnegie Mellon University\ndsachan@andrew.cmu.edu\n\nSatyen Kale\n\nGoogle Research\n\nsatyenkale@google.com\n\nSanjiv Kumar\nGoogle Research\n\nsanjivk@google.com\n\nAbstract\n\nAdaptive gradient methods that rely on scaling gradients down by the square root of\nexponential moving averages of past squared gradients, such RMSPROP, ADAM,\nADADELTA have found wide application in optimizing the nonconvex problems\nthat arise in deep learning. However, it has been recently demonstrated that such\nmethods can fail to converge even in simple convex optimization settings. In this\nwork, we provide a new analysis of such methods applied to nonconvex stochastic\noptimization problems, characterizing the effect of increasing minibatch size. Our\nanalysis shows that under this scenario such methods do converge to stationarity up\nto the statistical limit of variance in the stochastic gradients (scaled by a constant\nfactor). In particular, our result implies that increasing minibatch sizes enables\nconvergence, thus providing a way to circumvent the nonconvergence issues. Fur-\nthermore, we provide a new adaptive optimization algorithm, YOGI, which controls\nthe increase in effective learning rate, leading to even better performance with\nsimilar theoretical guarantees on convergence. Extensive experiments show that\nYOGI with very little hyperparameter tuning outperforms methods such as ADAM\nin several challenging machine learning tasks.\n\n1\n\nIntroduction\n\nWe study nonconvex stochastic optimization problems of the form\n\nf (x) := Es\u223cP[(cid:96)(x, s)],\n\nmin\nx\u2208Rd\n\n(1)\nwhere (cid:96) is a smooth (possibly nonconvex) function and P is a probability distribution on the domain\nS \u2282 Rk. Optimization problems of this form arise naturally in machine learning where x are model\nparameters, (cid:96) is the loss function and P is an unknown data distribution. Stochastic gradient descent\n(SGD) is the dominant method for solving such optimization problems, especially in nonconvex\nsettings. SGD [29] iteratively updates the parameters of the model by moving them in the direction of\nthe negative gradient computed on a minibatch scaled by step length, typically referred to as learning\nrate. One has to decay this learning rate as the algorithm proceeds in order to control the variance in\nthe stochastic gradients computed over a minibatch and thereby, ensure convergence. Hand tuning\nthe learning rate decay in SGD is often painstakingly hard. To tackle this issue, several methods that\nautomatically decay the learning rate have been proposed. The \ufb01rst prominent algorithms in this\nline of research is ADAGRAD [7, 22], which uses a per-dimension learning rate based on squared\npast gradients. ADAGRAD achieved signi\ufb01cant performance gains in comparison to SGD when the\ngradients are sparse.\nAlthough ADAGRAD has been demonstrated to work well in sparse settings, it has been observed that\nits performance, unfortunately, degrades in dense and nonconvex settings. This degraded performance\n\n\u2217Equal Contribution\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fis often attributed to the rapid decay in the learning rate when gradients are dense, which is often\nthe case in many machine learning applications. Several methods have been proposed in the deep\nlearning literature to alleviate this issue. One such popular approach is to use gradients scaled down\nby square roots of exponential moving averages of squared past gradients instead of cumulative sum\nof squared gradients in ADAGRAD. The basic intuition behind these approaches is to adaptively tune\nthe learning rate based on only the recent gradients; thereby, limiting the reliance of the update on\nonly the past few gradients. RMSPROP, ADAM, ADADELTA are just few of many methods based on\nthis update mechanism [34, 16, 40].\nExponential moving average (EMA) based adaptive methods are very popular in the deep learning\ncommunity. These methods have been successfully employed in plethora of applications. ADAM\nand RMSPROP, in particular, have been instrumental in achieving state-of-the-art results in many\napplications. At the same time, there have also been concerns about their convergence and generaliza-\ntion properties, indicating that despite their widespread use, our understanding of these algorithms\nis still very limited. Recently, [25] showed that EMA based adaptive methods may not converge to\nthe optimal solution even in simple convex settings when a constant minibatch size is used. Their\nanalysis relies on the fact that the effective learning rate (i.e. the learning rate parameter divided\nby square root of an exponential moving average of squared past gradients) of EMA methods can\npotentially increase over time in a fairly quick manner, and for convergence it is important to have\nthe learning rate decrease over iterations, or at least have controlled increase. This issue persists even\nif the learning rate parameter is decreased over iterations.\nDespite the problem of non-convergence demonstrated by [25], their work does not rule out con-\nvergence in case the minibatch size is increased with time, thereby decreasing the variance of\nthe stochastic gradients. Increasing minibatch size has been shown to help convergence in a few\noptimization algorithms that are not based on EMA methods [10, 3].\n\nContributions.\n\nIn the light of this background, we state the main contributions of our paper.\n\n\u2022 We develop convergence analysis for ADAM under certain useful parameter settings, showing\nconvergence to a stationary point up to the statistical limit of variance in the stochastic gradients\n(scaled by a constant factor) even for nonconvex problems. Our analysis implies that increasing\nbatch size will lead to convergence, as increasing batch size decreases variance linearly. Our\nwork thus provides a principled means to circumvent the non-convergence results of [25].\n\n\u2022 Inspired by our analysis of ADAM and the intuition that controlled increase of effective learning\nrate is essential for good convergence, we also propose a new algorithm (YOGI) for achieving\nadaptivity in SGD. Similar to the results in ADAM, we show convergence results with increasing\nminibatch size. Our analysis also highlights the interplay between level of \u201cadaptivity\u201d and\nconvergence of the algorithm.\n\n\u2022 We provide extensive empirical experiments for YOGI and show that it performs better than\nADAM in many state-of-the-art machine learning models. We also demonstrate that YOGI\nachieves similar, or better, results to best performance reported on these models without much\nhyperparameter tuning.\n\nRelated Work. The literature in stochastic optimization is vast; so we summarize a few very closely\nrelated works. SGD and its accelerated variants for smooth nonconvex problems are analyzed in\n[8]. Stochastic methods have also been employed in nonconvex \ufb01nite-sum problems where stronger\nresults can be shown [41, 24, 26, 27]. However, none of these methods are adaptive and can exhibit\npoor performance in ill-conditioned problems. All the aforementioned works show convergence\nto a stationary point. Recently, several \ufb01rst-order and second-order methods have been proposed\nthat are guaranteed to converge to local minima under certain conditions [1, 4, 14, 28]. However,\nthese methods are computationally expensive or exhibit slow convergence in practice, making them\nunsuitable for large-scale settings. Adaptive methods, that include ADAGRAD, RMSPROP, ADAM,\nADADELTA have been mostly studied in the convex setting but their analysis in the nonconvex setting\nis largely missing [7, 22, 40, 34, 16]. Normalized variants of SGD have been studied recently in\nnonconvex settings [10, 3].\nNotation For any vectors a, b \u2208 Rd, we use\na for element-wise square root, a2 for element-wise\nsquare, a/b to denote element-wise division. For any vector \u03b8i \u2208 Rd, either \u03b8i,j or [\u03b8i]j are used to\ndenote its jth coordinate where j \u2208 [d].\n\n\u221a\n\n2\n\n\f2 Preliminaries\n\nWe now formally state the de\ufb01nitions and assumptions used in this paper. We assume function (cid:96) is\nL-smooth, i.e., there exists a constant L such that\n\n(cid:107)\u2207(cid:96)(x, s) \u2212 \u2207(cid:96)(y, s)(cid:107) \u2264 L(cid:107)x \u2212 y(cid:107),\n\n\u2200 x, y \u2208 Rd and s \u2208 S.\n\nFurthermore, also assume that the function (cid:96) has bounded gradient i.e., (cid:107)\u2207[(cid:96)(x, s)]i(cid:107) \u2264 G for all\nx \u2208 Rd, s \u2208 S and i \u2208 [d]. Note that these assumptions trivially imply that expected loss f de\ufb01ned in\n(1) is L-smooth, i.e., (cid:107)\u2207f (x)\u2212\u2207f (y)(cid:107) \u2264 L(cid:107)x\u2212 y(cid:107) for all x, y \u2208 Rd. We also assume the following\nbound on the variance in stochastic gradients: E(cid:107)\u2207(cid:96)(x, s) \u2212 \u2207f (x)(cid:107)2 \u2264 \u03c32 for all x \u2208 Rd. Such\nassumptions are typical in the analysis of stochastic \ufb01rst-order methods (cf. [8, 9]).\nWe analyze convergence rates of some popular adaptive methods for the above classes of functions.\nFollowing several previous works on nonconvex optimization [23, 8], we use (cid:107)\u2207f (x)(cid:107)2 \u2264 \u03b4\nto measure the\u201cstationarity\u201d of the iterate x; we refer to such a solution as \u03b4-accurate solution2.\nIn contrast, algorithms in the convex setting are typically analyzed with the suboptimality gap,\nf (x) \u2212 f (x\u2217), where x\u2217 is an optimal point, as the convergence criterion. However, it is not possible\nto provide meaningful guarantees for such criteria for general nonconvex problems due to the hardness\nof the problem. We also note that adaptive methods have historically been studied in online convex\noptimization framework where the notion of regret is used as a measure of convergence. This naturally\ngives convergence rates for stochastic convex setting too. In this work, we focus on the stochastic\nnonconvex optimization setting since that is precisely the right model for risk minimization in machine\nlearning problems.\nTo simplify the exposition of results in the paper, we de\ufb01ne the following measure of ef\ufb01ciency for a\nstochastic optimization algorithm:\nDe\ufb01nition 1. Stochastic \ufb01rst-order (SFO) complexity of an algorithm is de\ufb01ned as the number of\ngradients evaluations of the function (cid:96) with respect to its \ufb01rst argument made by the algorithm.\n\nSince our paper only deals with \ufb01rst order methods, the ef\ufb01ciency of the algorithms can be measured\nin terms of SFO complexity to achieve a \u03b4-accurate solution. Throughout this paper, we hide the\ndependence of SFO complexity on L, G, (cid:107)x0 \u2212 x\u2217(cid:107)2 and f (x0) \u2212 f (x\u2217) for a clean comparison.\nStochastic gradient descent (SGD) is one of the simplest algorithms for solving (1). The update at the\ntth iteration of SGD is of the following form:\n\n(SGD)\nwhere gt = \u2207(cid:96)(xt, st) and st is a random sample drawn from the distribution P. When the learning\nrate is decayed as \u03b7t = 1/\nCorollary 1. The SFO complexity of SGD to obtain a \u03b4-accurate solution is O(1/\u03b42).\n\nt, one can obtain the following well-known result [8]:\n\n\u221a\n\nxt+1 = xt \u2212 \u03b7tgt,\n\n\u221a\nIn practice, it is often tedious to tune the learning rate of SGD because rapid decay in learning rate\nlike \u03b7t = 1/\nt typically hurts the empirical performance in nonconvex settings. In the next section,\nwe investigate adaptive methods which partially circumvent this issue.\n\n3 Algorithms\n\nIn this section, we discuss adaptive methods and analyze their convergence behavior in the nonconvex\nsetting. In particular, we focus on two algorithms: ADAM (3.1) and the proposed method, YOGI (3.2).\n\n3.1 Adam\n\nADAM is an adaptive method based on EMA, which is popular among the deep learning community\n[16]. EMA based adaptive methods were initially inspired from ADAGRAD and were proposed to\naddress the problem of rapid decay of learning rate in ADAGRAD. These methods scale down the\ngradient by the square roots of EMA of past squared gradients.\nThe pseudocode for ADAM is provided in Algorithm 1. The terms mt and vt in Algorithm 1 are\nEMA of the gradients and squared gradients respectively. Note that here, for the sake of clarity, the\n\n2Here we use \u03b4 instead of standard \u0001 in optimization and machine learning literature since \u0001 symbol is\n\nreserved for description of some popular adaptive methods like ADAM.\n\n3\n\n\fAlgorithm 1 ADAM\n\nAlgorithm 2 YOGI\n\nt=1, decay pa-\n\nt=1, parameters\n\nInput: x1 \u2208 Rd, learning rate {\u03b7t}T\nrameters 0 \u2264 \u03b21, \u03b22 \u2264 1, \u0001 > 0\nSet m0 = 0, v0 = 0\nfor t = 1 to T do\n\nDraw a sample st from P.\nCompute gt = \u2207(cid:96)(xt, st).\nmt = \u03b21mt\u22121 + (1 \u2212 \u03b21)gt\nvt = vt\u22121 \u2212 (1 \u2212 \u03b22)(vt\u22121 \u2212 g2\nt )\nxt+1 = xt \u2212 \u03b7tmt/(\n\n\u221a\nvt + \u0001)\n\nend for\n\nInput: x1 \u2208 Rd, learning rate {\u03b7t}T\n0 < \u03b21, \u03b22 < 1, \u0001 > 0\nSet m0 = 0, v0 = 0\nfor t = 1 to T do\n\nDraw a sample st from P.\nCompute gt = \u2207(cid:96)(xt, st).\nmt = \u03b21mt\u22121 + (1 \u2212 \u03b21)gt\nvt = vt\u22121 \u2212 (1 \u2212 \u03b22)sign(vt\u22121 \u2212 g2\nxt+1 = xt \u2212 \u03b7tmt/(\n\n\u221a\nvt + \u0001)\n\nt )g2\nt\n\nend for\n\ndebiasing step used in the original paper by [16] is removed but our results also apply to the debiased\nversion. A value of \u03b21 = 0.9, \u03b22 = 0.999 and \u0001 = 10\u22128 is typically recommended in practice [16].\nThe \u0001 parameter, which was initially designed to avoid precision issues in practical implementations,\nis often overlooked. However, it has been observed that very small \u0001 in some applications has also\nresulted in performance issues, indicating that it has a role to play in convergence of the algorithm.\nIntuitively \u0001 captures the amount of \u201cadaptivity\u201d in ADAM: larger values of \u0001 imply weaker adaptivity\nsince \u0001 dominates vt in this case.\nVery recently, [25] has shown non-convergence of ADAM in simple online convex settings, assuming\nconstant minibatch sizes. These results naturally apply to the nonconvex setting too. It is, however,\ninteresting to consider the case of ADAM in nonconvex setting with increasing batch sizes.\nTo this end, we prove the following convergence result for nonconvex setting. In this paper, for the\nsake of simplicity, we analyze the case where \u03b21 = 0, which is typically referred to as RMSPROP.\nHowever, our analysis should extend to the general case as well.\nTheorem 1. Let \u03b7t = \u03b7 for all t \u2208 [T ]. Furthermore, assume that \u0001, \u03b22 and \u03b7 are chosen such that\nthe following conditions satis\ufb01ed: \u03b7 \u2264 \u0001\n16G2 . Then for xt generated using ADAM\n(Algorithm 1), we have the following bound\nE(cid:107)\u2207f (xa)(cid:107)2 \u2264 O\n\n(cid:18) f (x1) \u2212 f (x\u2217)\n\n2L and 1 \u2212 \u03b22 \u2264 \u00012\n\n(cid:19)\n\n+ \u03c32\n\n,\n\nwhere x\u2217 is an optimal solution to the problem in (1) and xa is an iterate uniformly randomly chosen\nfrom {x1,\u00b7\u00b7\u00b7 , xT}.\nAll the proofs are relegated to the Appendix due to spaces constraints. The above result shows\nthat ADAM achieves convergence to stationarity within the constant factor of O(\u03c32) for constant\nlearning rate \u03b7, which is similar to the result for SGD with constant learning rate [8]. An immediate\nconsequence of this result is that increasing minibatch size can improve convergence. Speci\ufb01cally,\nthe above result assumes a minibatch size of 1. Suppose instead we use a minibatch size of b, and in\neach iteration of ADAM we average b stochastic gradients computed at the b samples in the minibatch.\nSince the samples in the minibatch are independent, the variance of the averaged stochastic gradient\nis at most \u03c32\nb , a factor b lower than a single stochastic gradient. Plugging this variance bound into\nthe bound of Theorem 1, we conclude that increasing the minibatch size decreases the limiting\nexpected stationarity by a factor of b. Speci\ufb01cally, we have the following result which is an immediate\nconsequence of Theorem 1 with \ufb01xed batch size b and constant learning rate.\nCorollary 2. For xt generated using ADAM with constant \u03b7 (and parameters from Theorem 1), we\nhave\n\n\u03b7T\n\nE[(cid:107)\u2207f (xa)(cid:107)2] \u2264 O\n\n(cid:18) 1\n\nT\n\n(cid:19)\n\n1\nb\n\n+\n\n,\n\nwhere xa is an iterate uniformly randomly chosen from {x1,\u00b7\u00b7\u00b7 , xT}.\nThe above results shows that ADAM obtains a point that has bounded stationarity in expectation i.e.,\nE[(cid:107)\u2207f (xa)(cid:107)2] \u2264 O(1/b) as T \u2192 \u221e. Note that this does not necessarily imply that the xa is close\nto a stationary point but a small bound is typically suf\ufb01cient for many machine learning applications.\nTo ensure good SFO complexity, we need b = \u0398(T ), which yields the following important corollary.\nCorollary 3. ADAM with b = \u0398(T ) and constant \u03b7 (and parameters from Theorem 1), we obtain\nE[(cid:107)\u2207f (xa)(cid:107)2] \u2264 O(1/T ) and the SFO complexity for achieving a \u03b4-accurate solution is O(1/\u03b42).\n\n4\n\n\fThe result simply follows by using batch size b = \u0398(T ) and constant \u03b7 in Theorem 1. Note that this\nresult can be achieved using a constant learning rate and \u03b22. We defer further discussion of these\nbounds to the end of the section.\n\n3.2 Yogi\n\nThe key element behind ADAM is to use an adaptive gradient while ensuring that the learning rate\ndoes not decay quickly. To achieve this, ADAM uses an EMA which is, by nature, multiplicative. This\nleads to a situation where the past gradients are forgotten in a fairly fast manner. As pointed out in [25],\nthis can especially be problematic in sparse settings where gradients are rarely nonzero. An alternate\napproach to attain the same goal as ADAM is through additive updates. To this end, we propose\na simple additive adaptive method, YOGI, for optimizing the stochastic nonconvex optimization\nproblem of our interest. (Name derived from the Sanskrit word yuj meaning to add.)\nAlgorithm 2 provides the pseudocode for YOGI. Note that the update looks very similar to ADAGRAD\nexcept for the use of sign(vt\u22121 \u2212 g2\nt ) in YOGI. Similar to ADAM, \u0001 controls the amount of adaptivity\nin the method. The difference with ADAM is in the update of vt. To gain more intuition for YOGI, let\nus compare its update rule with that of ADAM. The quantity vt\u2212vt\u22121 is \u2212(1\u2212\u03b22) sign (vt\u22121\u2212g2\nt )g2\nin YOGI as opposed to \u2212(1 \u2212 \u03b22)(vt\u22121 \u2212 g2\nt\nt ) in ADAM. An important property of YOGI, which is\ncommon with ADAM, is that the difference of vt and vt\u22121 depends only on vt\u22121 and g2\nt . However,\nt as opposed to dependence\nunlike ADAM, the magnitude of this difference in YOGI only depends on g2\non both vt\u22121 and g2\nt in ADAM. Note that when vt\u22121 is much larger than g2\nt , ADAM and YOGI increase\nthe effective learning rate. However, in this case it can be seen that ADAM can rapidly increase the\neffective learning rate while YOGI does it in a controlled fashion. As we shall see shortly, we often\nobserved improved empirical performance by adopting such a controlled increase in effective learning\nrate. Even in cases where rapid change in learning rate is desired, one can use YOGI with a smaller\nvalue of \u03b22 to mirror that behavior. Also, note that YOGI has same O(d) computational and memory\nrequirements as ADAM, and is hence, ef\ufb01cient to implement.\nSimilar to ADAM, we provide the following convergence result for YOGI in the nonconvex setting.\nTheorem 2. Let \u03b7t = \u03b7 for all t \u2208 [T ]. Furthermore, assume that \u0001, \u03b22 and \u03b7 are chosen such that\n\u221a\nthe following conditions satis\ufb01ed: 1 \u2212 \u03b22 \u2264 \u00012\n\u03b22\n2L . Then for xt generated using YOGI\n(Algorithm 2), we have the following bound\nE(cid:107)\u2207f (xa)(cid:107)2 \u2264 O\n\n(cid:18) f (x1) \u2212 f (x\u2217)\n\n16G2 and \u03b7 \u2264 \u0001\n\n+ \u03c32\n\n,\n\n(cid:19)\n\n\u03b7T\n\nwhere x\u2217 is an optimal solution to the problem in (1) and xa is an iterate uniformly randomly chosen\nfrom {x1,\u00b7\u00b7\u00b7 , xT}.\nThe convergence result is very similar to the result in Theorem 1. As before, the following results on\nbounded gradient norm with increasing batch size can be obtained as a simple corollary of Theorem 2.\nCorollary 4. For xt generated using YOGI with constant \u03b7 (and parameters from Theorem 2), we\nhave\n\nwhere xa is an iterate uniformly randomly chosen from {x1,\u00b7\u00b7\u00b7 , xT}.\nCorollary 5. YOGI with b = \u0398(T ) and constant \u03b7 (and parameters from Theorem 2) has SFO\ncomplexity is O(1/\u03b42) for achieving a \u03b4-accurate solution.\n\nDiscussion about Theoretical Results. We would like to emphasize that the SFO complexity\nobtained here for ADAM or YOGI with large batch size is similar to that of SGD (see Corollary 1).\nWhile we stated our theoretical results with batch size b = \u0398(T ) for the sake of simplicity, similar\nresults can be obtained for increasing minibatches bt = \u0398(t). In practice, we believe a much weaker\nincrease in batch size is suf\ufb01cient. In fact, when the variance is not large, our analysis shows that\na reasonably large batch size can work well. Before we proceed further, note that these are upper\nbounds and may not be completely re\ufb02ective of the performance in practice. It is, however, instructive\nto note the relationship between different quantities of these algorithms in our results. In particular,\nthe amount of adaptivity that can be tolerated depends on the parameter \u03b22. This convergence analysis\n\n5\n\nE[(cid:107)\u2207f (xa)(cid:107)2] \u2264 O\n\n(cid:18) 1\n\nT\n\n(cid:19)\n\n+\n\n1\nb\n\n\fFigure 1: Effect of adaptivity level \u0001 on training MNIST deep autoencoder. We observe aggressive\nadaptivity leads to poor performance while a more controlled variation like in YOGI leads to a much\nbetter \ufb01nal performance.\n\nG is large when compared to 1\u2212 \u03b22 i.e., the adaptivity level is moderate3. Recall that \u0001\nis useful when \u0001\nhere is only a parameter of the algorithm and is not associated with accuracy of the solution. Typically,\nit is often desirable to have small \u0001 in adaptive methods; however, as we shall see in the experiment\nsection, limiting the adaptivity level to a certain extent almost always improves the performance (e.g.\nsee Table 1 and 2, and Figure 1). For this reason, we also set the adaptivity level to a moderate value\nof \u0001 = 10\u22123 for YOGI across all our experiments.\n\n4 Experiments\n\nBased on the insights gained from our theoretical analysis, we now present empirical results showcas-\ning three aspects of our framework: (i) the value gained by controlled variation in learning rate using\nYOGI, (ii) fast optimization, and (iii) wide applicability on several large-scale problems ranging from\nnatural language processing to computer vision. We run our experiments on a commodity machine\nwith Intel R(cid:13) Xeon R(cid:13) CPU E5-2630 v4 CPU, 256GB RAM, and 8 Nvidia R(cid:13) Titan Xp GPU.\nExperimental Setup. We compare YOGI with highly tuned ADAM and the reported state-of-\nthe-art results for the setup. Typically, for obtaining the state-of-the-art results extensive hyper-\nparameter tuning and carefully designed learning rate schedules are required (see e.g. Transformers\n[35][Equation 8]). However, for YOGI, we restrict to tuning just two scalar hyperparameters \u2014\nthe learning rate and \u03b22. In all our experiments, YOGI and ADAM are always initialized from the\nsame point. Initialization of mt and vt are also important for YOGI and ADAM. These are often\ninitialized with 0 in conjunction with debiasing strategies [16]. Even with additional debiasing, such\nan initialization can result in very large learning rates at start of the training procedure; thereby,\nleading to convergence and performance issues. Thus, for YOGI, we propose to initialize the vt\nbased on gradient square evaluated at the initial point averaged over a (reasonably large) mini-batch.\nDecreasing learning rate is typically necessary for superior performance. To this end, we chose the\nsimple learning rate schedule of reducing the learning rate by a constant factor when performance\nmetric plateaus on the validation/test set (commonly known as ReduceLRonPlateau). Inspired from\nour theoretical analysis, we set a moderate value of \u0001 = 10\u22123 in YOGI for all the experiments in\norder to control the adaptivity. We will see that under such simple setup, YOGI still achieves similar\nor better performance compared to highly task-speci\ufb01c tuned setups. Due to space constraints, a few\nexperiments are relegated to the Appendix (Section D).\n\nDeep AutoEncoder. For our \ufb01rst task, we train two deep autoencoder models from [12] called\n\u201cCURVES\u201d and \u201cMNIST\u201d, typically used as standard benchmarks for neural network optimization (see\ne.g. [21, 32, 36, 20]). The \u201cMNIST\u201d autoencoder consists of an encoder with layers of size (28\u00d728)-\n1000-500-250-30 and a symmetric decoder, totaling in 2.8M parameters. The thirty units in the code\nlayer were linear and all the other units were logistic. The data set contains images of handwritten\ndigits 0\u20139. The pixel intensities were normalized to lie between 0 and 1. The \u201cCURVES\u201d autoencoder\nconsists of an encoder with layers of size (28\u00d728)-400-200-100- 50-25-6 and a symmetric decoder\ntotaling in 0.85M parameters. The six units in the code layer were linear and all the other units were\nlogistic. In \u201cMNIST\u201d autoencoder, we perform signi\ufb01cantly better than all prior results including\nADAM with specially tuned learning rate. We also observed similar gains in \u201cCURVES\u201d autoencoder\nwith a smaller \u03b22.\n\n3Note that here, we have assumed same bound |[\u2207(cid:96)(x, s)]i| \u2264 G across all coordinates i \u2208 [d] for simplicity,\n\nbut our analysis can easily incorporate non-uniform bounds on gradients across coordinates.\n\n6\n\n100101102103Iterations100101Train LossAdam 1e-3Adam 1e-8Yogi100101102103Iterations100101Test LossAdam 1e-3Adam 1e-8Yogi\fTable 1: Train and test loss comparison for Deep AutoEncoders. Standard errors with 2\u03c3 are shown\nover 6 runs are shown. All our experiments were run for 5000 epochs utilizing the ReduceLRonPlateau\nschedule with patience of 20 epochs and decay factor of 0.5 with a batch size of 128.\n\nMethod\n\nLR\n\n\u03b21\n\n\u03b22\n\nPT + NCG [20]\nRAND+HF [20]\nPT + HF [20]\nKSD [36]\nHF [36]\nADAM (Default)\nADAM (Modi\ufb01ed)\nYOGI (Ours)\nYOGI (Ours)\n\n-\n-\n-\n-\n-\n\n-\n-\n-\n-\n-\n\n-\n-\n-\n-\n-\n\n10\u22123\n10\u22123\n10\u22122\n10\u22122\n\n0.9\n0.9\n0.9\n0.9\n\n0.999\n0.999\n0.9\n0.999\n\n10\u22128\n10\u22123\n10\u22123\n10\u22123\n\n\u0001\n\n-\n-\n-\n-\n-\n\nMNIST\n\nTrain Loss\n2.31\n1.64\n1.63\n1.8\n1.7\n1.85 \u00b1 0.19\n0.91 \u00b1 0.04\n0.78 \u00b1 0.02\n0.88 \u00b1 0.02\n\nTest Loss\n2.72\n2.78\n2.46\n2.5\n2.7\n4.36 \u00b1 0.33\n1.88 \u00b1 0.07\n1.70 \u00b1 0.03\n1.36 \u00b1 0.01\n\nTable 2: Test BLEU score comparison for Base Transformer model [35]. The experiment of En-Vi\nwas run for 30 epochs using a batch size of 3K words/batch for source and target sentences, while\nEn-De was run for 20 epochs using a batch size of 32K words/batch for source and target sentences.\nWe utilized the ReduceLRonPlateau schedule with a patience of 5 epochs and a decay factor of 0.7.\nThe results reported here for our experiments are without checkpoint averaging; Adam+Noam: refers\nto the special custom learning rate schedule used in [35]; Avg: refers to checkpoint averaging used\nfor the reporting of En-De BLEU scores.\n\nMethod\n\nLR\n\n\u03b21\n\n\u03b22\n\nAdam+Noam+Avg [35]\nAdam+Noam (tensor2tensor) [30]\nSGD+Mom [6]\nADAM (Default)\nADAM (Modi\ufb01ed)\nADAM+Noam\nYOGI (Ours)\n\n-\n-\n-\n\n10\u22124\n10\u22124\n\u2212\n10\u22123\n\n-\n-\n-\n\n0.9\n0.9\n0.9\n0.9\n\n\u0001\n\n-\n-\n-\n\n-\n-\n-\n\n0.999\n0.999\n0.997\n0.99\n\n10\u22128\n10\u22123\n10\u22129\n10\u22123\n\nBLEU\nEn-Vi En-De\n27.3\n\n-\n28.1\n28.9\n27.92 \u00b1 0.22\n28.28 \u00b1 0.29\n28.84 \u00b1 0.24\n29.27 \u00b1 0.07\n\n-\n-\n-\n-\n27.2\n\nNeural Machine Translation. As a large-scale experiment, we use the Transformer (TF) model\n[35] for machine translation, which has recently gained a lot of attention. In TF, both encoder\nand decoder consists of only self-attention and feed-forward layers. TF networks are known to be\nnotoriously hard to optimize. The original paper proposed a method based on linearly increasing the\nlearning rate for a speci\ufb01ed number of optimization steps followed by inverse square root decay. In\nour experiments, we use the same 6-layer 8-head TF network described in the original paper: The\nposition-wise feed-forward networks have 512 dimensional input/output with a 2048 dimensional\nhidden layer and ReLU activation. Layer normalization [2] is applied at the input and residual\nconnections are added at the output of each sublayer. Word embeddings between encoder and decoder\nare shared and the softmax layers are tied. We perform experiments on the IWSLT\u201915 En-Vi [18] and\nWMT\u201914 En-De datasets with the standard train, validation and test splits. These datasets consist of\n133K and 4.5M sentences respectively. In both the experiments, we encode the sentences using 32K\nmerge operations using byte-pair encoding [31]. Due to very large-scale nature of the En-De dataset,\nwe only report the performance of YOGI on it. As seen in Table 2, with very little parameter tuning,\nYOGI is able to obtain much better BLEU score over previously reported results on En-Vi dataset\nand is directly competitive on En-De, without using any ensembling techniques such as checkpoint\naveraging.\nResNets and DenseNets. For our next experiment, we use YOGI to train ResNets [11] and\nDenseNets [13], which are very popular architectures, producing state-of-the-art results across\nmany computer vision tasks. Training these networks typically requires careful selection of learning\nrates. It is widely believed that adaptive methods yield inferior performance for these type of networks\n\n7\n\n\fTable 3: Test accuracy for ResNets on CIFAR-10. Standard errors with 2\u03c3 over 6 runs are shown. All\nour experiments were run for 500 epochs utilizing the ReduceLRonPlateau schedule with patience of\n20 epochs and decay factor of 0.5 with a batch size of 128. Also we report numbers from original\npaper for reference, which employs a highly tuned learning rate schedule.\n\nMethod\n\nSGD+Mom [11]\nADAM (Default)\nADAM (Default)\nADAM (Modi\ufb01ed)\nADAM (Modi\ufb01ed)\nYOGI (Ours)\n\nLR\n\n-\n\n10\u22123\n10\u22122\n10\u22123\n10\u22122\n10\u22122\n\n\u03b21\n\n-\n\n0.9\n0.9\n0.9\n0.9\n0.9\n\n\u03b22\n\n-\n\n0.999\n0.999\n0.999\n0.999\n0.999\n\n\u0001\n\n-\n\n10\u22128\n10\u22128\n10\u22123\n10\u22123\n10\u22123\n\nTest Accuracy (%)\n\nResNet20\n91.25\n90.37 \u00b1 0.24\n89.11 \u00b1 0.22\n89.99 \u00b1 0.30\n92.56 \u00b1 0.14\n92.62 \u00b1 0.17\n\nResNet50\n93.03\n92.59 \u00b1 0.23\n88.82 \u00b1 0.33\n91.74 \u00b1 0.33\n93.42 \u00b1 0.16\n93.90 \u00b1 0.21\n\nTable 4: Test accuracy for DenseNet on CIFAR10. Standard errors with 2\u03c3 are shown over 6 runs are\nshown. All our experiments were run for 300 epochs utilizing the ReduceLRonPlateau schedule with\npatience of 20 epochs and decay factor of 0.5 with a batch size of 64. Also we report numbers from\noriginal paper for reference, which employs a highly tuned learning rate schedule.\n\nMethod\nSGD+Mom [13]\nADAM (Default)\nADAM (Modi\ufb01ed)\nYOGI (Ours)\n\nLR\n-\n\n10\u22123\n10\u22123\n10\u22122\n\n\u03b21\n-\n\n0.9\n0.9\n0.9\n\n\u03b22\n-\n\n0.999\n0.999\n0.999\n\n\u0001\n-\n\n10\u22128\n10\u22123\n10\u22123\n\nTest Accuracy (%)\n\n94.76\n92.53 \u00b1 0.20\n93.35 \u00b1 0.21\n94.38 \u00b1 0.26\n\n[15]. We attempt to tackle this challenging problem on the CIFAR-10 dataset. For ResNet experiment,\nwe select a small ResNet network with 20 layers and medium-sized ResNet network with 50 layers\n(same as those used in original ResNet paper [11] for CIFAR-10). For DenseNet experiment, we used\na DenseNet with 40 layers and growth rate k = 12 without bottleneck, channel reduction, or dropout.\nWe adopt a standard data augmentation scheme (mirroring/shifting) that is widely used. As seen from\nTable 3 and 4, without any tuning, our default parameter setting achieves state-of-the-art results for\nthese networks.\n\nDeepSets. We also evaluate our approach on the task of classi\ufb01cation of point clouds. We use\nDeepSets [39] to classify point-cloud representation on a benchmark ModelNet40 dataset [38]. We\nuse the same network described in the DeepSets paper: The network consists of 3 permutation-\nequivariant layers with 256 channels followed by maxpooling over the set structure. The resulting\nvector representation of the set is then fed to a fully connected layer with 256 units followed by\na 40-way softmax unit. We use tanh activation at all layers and apply dropout on the layers after\nset-max-pooling (two dropout operations) with 50% dropout rate. Consistent with our previous\nexperiments, we observe that YOGI outperforms ADAM and obtains better results than those reported\nin the original paper (Table 6).\nNamed Entity Recognition (NER). Finally, we test YOGI for sequence labeling task in NLP\ninvolving recurrent neural networks. We use the popular CNN-LSTM-CRF model [5], [19] for NER\ntask. In this, multiple CNN \ufb01lters of different widths are used to encode morphological and lexical\ninformation observed in characters. A word-level bidirectional LSTM layer models the long-range\ndependence structure in texts. A linear-chain CRF layer models the sequence-level data likelihood\nwhile inference is performed using Viterbi Algorithm. In our experiments, we use the BC5CDR\nbiomedical data [17] 4. The CNN-LSTM model comprises of 1400 CNN \ufb01lters of widths [1-7], 300\ndimensional pre-trained word embeddings, a single layer 256 dimensional bidirectional LSTM, and\ndropout probability of 0.5 applied to word embeddings and LSTM output layer. We use exact match\nto evaluate the F1 score of our approach. The results are presented in Table 7 (Appendix). YOGI,\nagain, performs better than ADAM and achieves better F1 score than the one previously reported.\n\n4http://www.biocreative.org/tasks/biocreative-v/track-3-cdr/\n\n8\n\n\fFigure 2: Comparison of highly tuned RMSProp optimizer with YOGI for Inception-Resnet-v2 on\nImagenet. First plot shows the mini-batch estimate of the loss during training, while the remaining\ntwo plots show top-1 and top-5 error rates on the held-out Imagenet validation set. We observe that\nYOGI default (learning rate of 10\u22122, \u03b21 = 0.9, \u03b22 = 0.999, \u0001 = 10\u22123) not only achieves better\ngeneralization but also reaches similar accuracy level as the tuned optimizer more than 2x faster.\n\n5 Discussion\nWe discuss our observations about empirical results and suggest some practical guidelines for using\nYOGI. As we have shown, YOGI tends to perform superior to ADAM and highly hand-tuned SGD\nacross various architectures and datasets. It is also interesting to note that typically YOGI has a smaller\nvariance across different runs when compared to ADAM. Furthermore, very little hyperparameter\ntuning was used for YOGI to remain faithful to our goal. Inspired from our theoretical analysis, we\n\ufb01xed \u0001 = 10\u22123 and \u03b21 = 0.9 in all experiments. We also observed that initial learning rate \u03b7 in the\nrange [10\u22122, 10\u22123] and \u03b22 = {0.9, 0.99, 0.999} appears to work well in most settings. As a general\nremark, the learning rate for YOGI seems to be around 5 \u2212 10 times higher than that of ADAM. For\ndecreasing the learning rate, we used the standard heuristic of ReduceLRonPlateau 5 (see Section 4\nfor more details). In general, the patience of ReduceLRonPlateau scheduler will depend on data size,\nbut in most experiments we saw a patience of 5,000-10,000 gradient steps works well with a decay\nfactor of 0.5.\nFinally, we would like to emphasize that other popular learning rate decay schedules different from\nReduceLRonPlatueau can also be used for YOGI. Also, it is easy to deploy YOGI in existing pipelines\nto provide good performance with minimal tuning. To illustrate these points, we conducted an\nexperiment on Inception-Resnet-v2 [33], a high performing model on ImageNet 2012 classi\ufb01cation\nchallenge involving 1.28M images and 1,000 classes. In order to achieve best results, [33] employed\na heavily tuned RMSPROP optimizer with learning rate of 0.045, momentum decay of 0.9, \u03b2 = 0.9,\nand \u0001 = 1. The learning rate of RMSPROP is decayed every 3 epochs by a factor of 0.946. With\nthis setting of RMSPROP and batch size of 320, we obtained a top-1 and top-5 error of 21.63% and\n5.79% respectively, under single crop evaluation on the Imagenet validation set comprising of 50K\nimages (slightly lower than the published results). However, using YOGI with essentially the default\nparameters (i.e. learning rate of 10\u22122, \u03b21 = 0.9, \u03b22 = 0.999, \u0001 = 10\u22123) and an alternate decay\nschedule of reducing learning rate by 0.7 every 3 epochs achieves better performance (top-1 and\ntop-5 error of 21.14% and 5.46% respectively). Furthermore, it reaches similar accuracy level as the\ntuned optimizer more than 2x faster, as shown in Figure 2, demonstrating the ef\ufb01cacy of YOGI.\n\nReferences\n[1] Naman Agarwal, Zeyuan Allen Zhu, Brian Bullins, Elad Hazan, and Tengyu Ma. Finding\napproximate local minima for nonconvex optimization in linear time. CoRR, abs/1611.01146,\n2016.\n\n[2] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. stat, 1050:21,\n\n[3] Jeremy Bernstein, Yu-Xiang Wang, Kamyar Azizzadenesheli, and Anima Anandkumar.\n\nsignSGD: Compressed optimisation for non-convex problems. 2018. arXiv:1802.04434.\n\n[4] Yair Carmon, John C. Duchi, Oliver Hinder, and Aaron Sidford. Accelerated methods for\n\n2016.\n\nnon-convex optimization. CoRR, abs/1611.00756, 2016.\n\n[5] Jason PC Chiu and Eric Nichols. Named entity recognition with bidirectional lstm-cnns. arXiv\n\npreprint arXiv:1511.08308, 2015.\n\n5https://keras.io/callbacks\n6The code for Inception-Resnet-v2 is available at https://github.com/tensorflow/models/blob/\n\nmaster/research/slim/train_image_classifier.py.\n\n9\n\n100200300400500Iterations2223Mini-Batch LossRMSProp TunedYogi Default100200300400500Iterations2-22-120Top-1 Error21.32%@ 20021.68%@ 400> 2xRMSProp TunedYogi Default100200300400500Iterations2-42-32-22-120Top-5 Error5.51%@ 2005.89%@ 400> 2xRMSProp TunedYogi Default\f[6] Kevin Clark. Semi-supervised learning for nlp. http: // web. stanford. edu/ class/\n\ncs224n/ lectures/ lecture17. pdf , 2018.\n\n[7] John C. Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online\nlearning and stochastic optimization. Journal of Machine Learning Research, 12:2121\u20132159,\n2011.\n\n[8] Saeed Ghadimi and Guanghui Lan. Stochastic \ufb01rst- and zeroth-order methods for nonconvex\n\nstochastic programming. SIAM Journal on Optimization, 23(4):2341\u20132368, 2013.\n\n[9] Saeed Ghadimi, Guanghui Lan, and Hongchao Zhang. Mini-batch stochastic approximation\nmethods for nonconvex stochastic composite optimization. Mathematical Programming, 155(1-\n2):267\u2013305, 2014.\n\n[10] Elad Hazan, K\ufb01r Levy, and Shai Shalev-Shwartz. Beyond convexity: Stochastic quasi-convex\noptimization. In Advances in Neural Information Processing Systems, pages 1585\u20131593, 2015.\n[11] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\nrecognition. In Proceedings of the IEEE conference on computer vision and pattern recognition,\npages 770\u2013778, 2016.\n\n[12] Geoffrey E Hinton and Ruslan R Salakhutdinov. Reducing the dimensionality of data with\n\nneural networks. science, 313(5786):504\u2013507, 2006.\n\n[13] Gao Huang, Zhuang Liu, Kilian Q Weinberger, and Laurens van der Maaten. Densely connected\nconvolutional networks. In Proceedings of the IEEE conference on computer vision and pattern\nrecognition, volume 1, page 3, 2017.\n\n[14] Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M. Kakade, and Michael I. Jordan. How to escape\n\nsaddle points ef\ufb01ciently. CoRR, abs/1703.00887, 2017.\n\n[15] Nitish Shirish Keskar and Richard Socher. Improving generalization performance by switching\n\nfrom adam to sgd. arXiv preprint arXiv:1712.07628, 2017.\n\n[16] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR,\n\nabs/1412.6980, 2014.\n\n[17] Xingguo Li, Tuo Zhao, Raman Arora, Han Liu, and Jarvis Haupt. Stochastic variance reduced\n\noptimization for nonconvex sparse learning. In ICML, 2016. arXiv:1605.02711.\n\n[18] Minh-Thang Luong and Christopher D. Manning. Stanford neural machine translation systems\nfor spoken language domain. In International Workshop on Spoken Language Translation, Da\nNang, Vietnam, 2015.\n\n[19] Xuezhe Ma and Eduard Hovy. End-to-end sequence labeling via bi-directional lstm-cnns-crf.\n\narXiv preprint arXiv:1603.01354, 2016.\n\n[20] James Martens. Deep learning via hessian-free optimization.\n\nIn Proceedings of the 27th\n\nInternational Conference on Machine Learning (ICML-10), pages 735\u2013742, 2010.\n\n[21] James Martens and Roger Grosse. Optimizing neural networks with kronecker-factored ap-\nproximate curvature. In International Conference on Machine Learning, pages 2408\u20132417,\n2015.\n\n[22] H. Brendan McMahan and Matthew J. Streeter. Adaptive bound optimization for online convex\nIn Proceedings of the 23rd Annual Conference On Learning Theory, pages\n\n[23] Yurii Nesterov. Introductory Lectures On Convex Optimization: A Basic Course. Springer,\n\noptimization.\n244\u2013256, 2010.\n\n2003.\n\n[24] Sashank J. Reddi, Ahmed Hefny, Suvrit Sra, Barnab\u00e1s P\u00f3czos, and Alexander J. Smola. Stochas-\ntic variance reduction for nonconvex optimization. In Proceedings of the 33nd International\nConference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016,\npages 314\u2013323, 2016.\n\n[25] Sashank J. Reddi, Satyen Kale, and Sanjiv Kumar. On the Convergence of Adam & Beyond. In\n\nProceedings of the 6th International Conference on Learning Representations., 2018.\n\n[26] Sashank J. Reddi, Suvrit Sra, Barnab\u00e1s P\u00f3czos, and Alexander J. Smola. Fast incremental\n\nmethod for nonconvex optimization. CoRR, abs/1603.06159, 2016.\n\n[27] Sashank J. Reddi, Suvrit Sra, Barnab\u00e1s P\u00f3czos, and Alexander J. Smola. Fast stochastic methods\n\nfor nonsmooth nonconvex optimization. CoRR, abs/1605.06900, 2016.\n\n[28] Sashank J. Reddi, Manzil Zaheer, Suvrit Sra, Barnab\u00e1s P\u00f3czos, Francis Bach, Ruslan Salakhut-\ndinov, and Alexander J. Smola. A generic approach for escaping saddle points. In International\nConference on Arti\ufb01cial Intelligence and Statistics, AISTATS 2018, 9-11 April 2018, Playa\nBlanca, Lanzarote, Canary Islands, Spain, pages 1233\u20131242, 2018.\n\n[29] H. Robbins and S. Monro. A stochastic approximation method. Annals of Mathematical\n\nStatistics, 22:400\u2013407, 1951.\n\n[30] Stefan Schweter.\n\nNeural machine translation system for english to vietnamese.\n\nhttps://github.com/stefan-it/nmt-en-vi, 2018.\n\n10\n\n\f[31] Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words\n\nwith subword units. arXiv preprint arXiv:1508.07909, 2015.\n\n[32] Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. On the importance of\ninitialization and momentum in deep learning. In International conference on machine learning,\npages 1139\u20131147, 2013.\n\n[33] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi. Inception-v4,\ninception-resnet and the impact of residual connections on learning. In AAAI, volume 4, page 12,\n2017.\n\n[34] T. Tieleman and G. Hinton. RmsProp: Divide the gradient by a running average of its recent\n\nmagnitude. COURSERA: Neural Networks for Machine Learning, 2012.\n\n[35] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,\n\u0141ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Informa-\ntion Processing Systems, pages 6000\u20136010, 2017.\n\n[36] Oriol Vinyals and Daniel Povey. Krylov subspace descent for deep learning. In AISTATS, pages\n\n1261\u20131268, 2012.\n\n[37] Xuan Wang, Yu Zhang, Xiang Ren, Yuhao Zhang, Marinka Zitnik, Jingbo Shang, Curtis P.\nLanglotz, and Jiawei Han. Cross-type biomedical named entity recognition with deep multi-task\nlearning. CoRR, abs/1801.09851, 2018.\n\n[38] Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Linguang Zhang, Xiaoou Tang, and\nJianxiong Xiao. 3d shapenets: A deep representation for volumetric shapes. In Proceedings of\nthe IEEE conference on computer vision and pattern recognition, pages 1912\u20131920, 2015.\n\n[39] Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Ruslan R Salakhutdinov,\nand Alexander J Smola. Deep sets. In Advances in Neural Information Processing Systems,\npages 3394\u20133404, 2017.\n\n[40] Matthew D. Zeiler. ADADELTA: An Adaptive Learning Rate Method. CoRR, abs/1212.5701,\n\n[41] Zeyuan Allen Zhu and Elad Hazan. Variance reduction for faster non-convex optimization.\n\n2012.\n\nCoRR, abs/1603.05643, 2016.\n\n11\n\n\f", "award": [], "sourceid": 6407, "authors": [{"given_name": "Manzil", "family_name": "Zaheer", "institution": "Google"}, {"given_name": "Sashank", "family_name": "Reddi", "institution": "Google"}, {"given_name": "Devendra", "family_name": "Sachan", "institution": "Carnegie Mellon University"}, {"given_name": "Satyen", "family_name": "Kale", "institution": "Google"}, {"given_name": "Sanjiv", "family_name": "Kumar", "institution": "Google Research"}]}