{"title": "Training Deep Networks without Learning Rates Through Coin Betting", "book": "Advances in Neural Information Processing Systems", "page_first": 2160, "page_last": 2170, "abstract": "Deep learning methods achieve state-of-the-art performance in many application scenarios. Yet, these methods require a significant amount of hyperparameters tuning in order to achieve the best results. In particular, tuning the learning rates in the stochastic optimization process is still one of the main bottlenecks. In this paper, we propose a new stochastic gradient descent procedure for deep networks that does not require any learning rate setting. Contrary to previous methods, we do not adapt the learning rates nor we make use of the assumed curvature of the objective function. Instead, we reduce the optimization process to a game of betting on a coin and propose a learning rate free optimal algorithm for this scenario. Theoretical convergence is proven for convex and quasi-convex functions and empirical evidence shows the advantage of our algorithm over popular stochastic gradient algorithms.", "full_text": "Training Deep Networks without Learning Rates\n\nThrough Coin Betting\n\nFrancesco Orabona\u2217\n\nTatiana Tommasi\u2217\n\nDepartment of Computer Science\n\nDepartment of Computer, Control, and\n\nStony Brook University\n\nStony Brook, NY\n\nfrancesco@orabona.com\n\nManagement Engineering\n\nSapienza, Rome University, Italy\ntommasi@dis.uniroma1.it\n\nAbstract\n\nDeep learning methods achieve state-of-the-art performance in many application\nscenarios. Yet, these methods require a signi\ufb01cant amount of hyperparameters\ntuning in order to achieve the best results. In particular, tuning the learning rates\nin the stochastic optimization process is still one of the main bottlenecks. In this\npaper, we propose a new stochastic gradient descent procedure for deep networks\nthat does not require any learning rate setting. Contrary to previous methods, we\ndo not adapt the learning rates nor we make use of the assumed curvature of the\nobjective function. Instead, we reduce the optimization process to a game of betting\non a coin and propose a learning-rate-free optimal algorithm for this scenario.\nTheoretical convergence is proven for convex and quasi-convex functions and\nempirical evidence shows the advantage of our algorithm over popular stochastic\ngradient algorithms.\n\n1\n\nIntroduction\n\nIn the last years deep learning has demonstrated a great success in a large number of \ufb01elds and has\nattracted the attention of various research communities with the consequent development of multiple\ncoding frameworks (e.g., Caffe [Jia et al., 2014], TensorFlow [Abadi et al., 2015]), the diffusion of\nblogs, online tutorials, books, and dedicated courses. Besides reaching out scientists with different\nbackgrounds, the need of all these supportive tools originates also from the nature of deep learning: it\nis a methodology that involves many structural details as well as several hyperparameters whose\nimportance has been growing with the recent trend of designing deeper and multi-branches networks.\nSome of the hyperparameters de\ufb01ne the model itself (e.g., number of hidden layers, regularization\ncoef\ufb01cients, kernel size for convolutional layers), while others are related to the model training\nprocedure. In both cases, hyperparameter tuning is a critical step to realize deep learning full potential\nand most of the knowledge in this area comes from living practice, years of experimentation, and, to\nsome extent, mathematical justi\ufb01cation [Bengio, 2012].\nWith respect to the optimization process, stochastic gradient descent (SGD) has proved itself to be a\nkey component of the deep learning success, but its effectiveness strictly depends on the choice of\nthe initial learning rate and learning rate schedule. This has primed a line of research on algorithms\nto reduce the hyperparameter dependence in SGD\u2014see Section 2 for an overview on the related\nliterature. However, all previous algorithms resort on adapting the learning rates, rather than removing\nthem, or rely on assumptions on the shape of the objective function.\nIn this paper we aim at removing at least one of the hyperparameter of deep learning models. We\nleverage over recent advancements in the stochastic optimization literature to design a backprop-\n\n\u2217The authors contributed equally.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fagation procedure that does not have a learning rate at all, yet it is as simple as the vanilla SGD.\nSpeci\ufb01cally, we reduce the SGD problem to the game of betting on a coin (Section 4). In Section 5,\nwe present a novel strategy to bet on a coin that extends previous ones in a data-dependent way,\nproving optimal convergence rate in the convex and quasi-convex setting (de\ufb01ned in Section 3).\nFurthermore, we propose a variant of our algorithm for deep networks (Section 6). Finally, we show\nhow our algorithm outperforms popular optimization methods in the deep learning literature on a\nvariety of architectures and benchmarks (Section 7).\n\n2 Related Work\n\nStochastic gradient descent offers several challenges in terms of convergence speed. Hence, the topic\nof learning rate setting has been largely investigated.\nSome of the existing solutions are based on the use of carefully tuned momentum terms [LeCun et al.,\n1998b, Sutskever et al., 2013, Kingma and Ba, 2015]. It has been demonstrated that these terms can\nspeed-up the convergence for convex smooth functions [Nesterov, 1983]. Other strategies propose\nscale-invariant learning rate updates to deal with gradients whose magnitude changes in each layer of\nthe network [Duchi et al., 2011, Tieleman and Hinton, 2012, Zeiler, 2012, Kingma and Ba, 2015].\nIndeed, scale-invariance is a well-known important feature that has also received attention outside of\nthe deep learning community [Ross et al., 2013, Orabona et al., 2015, Orabona and Pal, 2015]. Yet,\nboth these approaches do not avoid the use of a learning rate.\nA large family of algorithms exploit a second order approximation of the cost function to better capture\nits local geometry and avoid the manual choice of a learning rate. The step size is automatically\nadapted to the cost function with larger/shorter steps in case of shallow/steep curvature. Quasi-\nNewton methods [Wright and Nocedal, 1999] as well as the natural gradient method [Amari, 1998]\nbelong to this family. Although effective in general, they have a spatial and computational complexity\nthat is square in the number of parameters with respect to the \ufb01rst order methods, which makes the\napplication of these approaches unfeasible in modern deep learning architectures. Hence, typically\nthe required matrices are approximated with diagonal ones [LeCun et al., 1998b, Schaul et al., 2013].\nNevertheless, even assuming the use of the full information, it is currently unclear if the objective\nfunctions in deep learning have enough curvature to guarantee any gain.\nThere exists a line of work on unconstrained stochastic gradient descent without learning\nrates [Streeter and McMahan, 2012, Orabona, 2013, McMahan and Orabona, 2014, Orabona, 2014,\nCutkosky and Boahen, 2016, 2017]. The latest advancement in this direction is the strategy of reduc-\ning stochastic subgradient descent to coin-betting, proposed by Orabona and Pal [2016]. However,\ntheir proposed betting strategy is worst-case with respect to the gradients received and cannot take\nadvantage, for example, of sparse gradients.\n\n3 De\ufb01nitions\n\n(cid:62)\n\nx \u2212 f (x).\n\nWe now introduce the basic notions of convex analysis that are used in the paper\u2014see, e.g., Bauschke\nand Combettes [2011]. We denote by (cid:107)\u00b7(cid:107)1 the 1-norm in Rd. Let f : Rd \u2192 R \u222a {\u00b1\u221e}, the Fenchel\nconjugate of f is f\u2217 : Rd \u2192 R \u222a {\u00b1\u221e} with f\u2217(\u03b8) = supx\u2208Rd \u03b8\nA vector x is a subgradient of a convex function f at v if f (v) \u2212 f (u) \u2264 (v \u2212 u)(cid:62)x for any u in\nthe domain of f. The differential set of f at v, denoted by \u2202f (v), is the set of all the subgradients of\nf at v. If f is also differentiable at v, then \u2202f (v) contains a single vector, denoted by \u2207f (v), which\nis the gradient of f at v.\nWe go beyond convexity using the de\ufb01nition of weak quasi-convexity in Hardt et al. [2016]. This\nde\ufb01nition is relevant for us because Hardt et al. [2016] proved that \u03c4-weakly-quasi-convex objective\nfunctions arise in the training of linear recurrent networks. A function f : Rd \u2192 R is \u03c4-weakly-quasi-\nconvex over a domain B \u2286 Rd with respect to the global minimum v\u2217 if there is a positive constant\n\u03c4 > 0 such that for all v \u2208 B, f (v) \u2212 f (v\u2217) \u2264 \u03c4 (v \u2212 v\u2217)(cid:62)\u2207f (v). From the de\ufb01nition, it follows\nthat differentiable convex function are also 1-weakly-quasi-convex.\n\nBetting on a coin. We will reduce the stochastic subgradient descent procedure to betting on a\nnumber of coins. Hence, here we introduce the betting scenario and its notation. We consider a\n\n2\n\n\ft(cid:88)\n\ngambler making repeated bets on the outcomes of adversarial coin \ufb02ips. The gambler starts with\ninitial money \u0001 > 0. In each round t, he bets on the outcome of a coin \ufb02ip gt \u2208 {\u22121, 1}, where +1\ndenotes heads and \u22121 denotes tails. We do not make any assumption on how gt is generated.\nThe gambler can bet any amount on either heads or tails. However, he is not allowed to borrow any\nadditional money. If he loses, he loses the betted amount; if he wins, he gets the betted amount back\nand, in addition to that, he gets the same amount as a reward. We encode the gambler\u2019s bet in round t\nby a single number wt. The sign of wt encodes whether he is betting on heads or tails. The absolute\nvalue encodes the betted amount. We de\ufb01ne Wealtht as the gambler\u2019s wealth at the end of round t\nand Rewardt as the gambler\u2019s net reward (the difference of wealth and the initial money), that is\n\nWealtht = \u0001 +\n\nwigi\n\nand\n\nRewardt = Wealtht \u2212 \u0001 =\n\nwigi .\n\n(1)\n\nIn the following, we will also refer to a bet with \u03b2t, where \u03b2t is such that\n\ni=1\n\ni=1\n\n(2)\nThe absolute value of \u03b2t is the fraction of the current wealth to bet and its sign encodes whether\nhe is betting on heads or tails. The constraint that the gambler cannot borrow money implies that\n\u03b2t \u2208 [\u22121, 1]. We also slighlty generalize the problem by allowing the outcome of the coin \ufb02ip gt to\nbe any real number in [\u22121, 1], that is a continuous coin; wealth and reward in (1) remain the same.\n\nwt = \u03b2t Wealtht\u22121 .\n\n4 Subgradient Descent through Coin Betting\n\nt(cid:88)\n\nIn this section, following Orabona and Pal [2016], we brie\ufb02y explain how to reduce subgradient\ndescent to the gambling scenario of betting on a coin.\nConsider as an example the function F (x) := |x \u2212 10| and the optimization problem minx F (x).\nThis function does not have any curvature, in fact it is not even differentiable, thus no second order\noptimization algorithm could reliably be used on it. We set the outcome of the coin \ufb02ip gt to be\nequal to the negative subgradient of F in wt, that is gt \u2208 \u2202[\u2212F (wt)], where we remind that wt is the\namount of money we bet. Given our choice of F (x), its negative subgradients are in {\u22121, 1}. In the\n\ufb01rst iteration we do not bet, hence w1 = 0 and our initial money is $1. Let\u2019s also assume that there\nexists a function H(\u00b7) such that our betting strategy will guarantee that the wealth after T rounds will\n\nbe at least H((cid:80)T\n\nt=1 gt) for any arbitrary sequence g1,\u00b7\u00b7\u00b7 , gT .\n\nt=1 wt, converges to the solution of our optimization\n\nWe claim that the average of the bets, 1\nT\nproblem and the rate depends on how good our betting strategy is. Let\u2019s see how.\nDenoting by x\u2217 the minimizer of F (x), we have that the following holds\ngtx\u2217 \u2212 1\nT\n\nF (wt) \u2212 F (x\u2217) \u2264 1\nT\n\n\u2212 F (x\u2217) \u2264 1\nT\n\nT(cid:88)\n\nT(cid:88)\n\nT(cid:88)\n\n(cid:32)\n\n(cid:33)\n\n1\nT\n\nwt\n\nF\n\nt=1\n\ngtwt\n\nt=1\n\n(cid:80)T\n\ngtx\u2217 \u2212 H\n\n\u2264 1\n\nT + 1\n\nT max\n\nv\n\nvx\u2217 \u2212 H(v)\n\nT(cid:88)\n(cid:32) T(cid:88)\n(cid:33)(cid:33)\n\nt=1\n\ngt\n\n(cid:32) T(cid:88)\n\nt=1\n\n\u2264 1\n\nT + 1\n\nT\n\n= H\u2217(x\u2217)+1\n\nT\n\nt=1\n\n,\n\nt=1\n\nwhere in the \ufb01rst inequality we used Jensen\u2019s inequality, in the second the de\ufb01nition of subgradients,\nin the third our assumption on H, and in the last equality the de\ufb01nition of Fenchel conjugate of H.\nIn words, we used a gambling algorithm to \ufb01nd the minimizer of a non-smooth objective function\nby accessing its subgradients. All we need is a good gambling strategy. Note that this is just a\nvery simple one-dimensional example, but the outlined approach works in any dimension and for\nany convex objective function, even if we just have access to stochastic subgradients [Orabona and\nPal, 2016]. In particular, if the gradients are bounded in a range, the same reduction works using a\ncontinuous coin.\n\nOrabona and Pal [2016] showed that the simple betting strategy of \u03b2t =\ngives optimal growth\nrate of the wealth and optimal worst-case convergence rates. However, it is not data-dependent so it\ndoes not adapt to the sparsity of the gradients. In the next section, we will show an actual betting\nstrategy that guarantees optimal convergence rate and adaptivity to the gradients.\n\nt\n\n(cid:80)t\u22121\n\ni=1 gi\n\n3\n\n\fF (function to minimize)\n\nGet a (negative) stochastic subgradient gt such that E[gt] \u2208 \u2202[\u2212F (wt)]\nfor i = 1, 2, . . . , d do\n\nAlgorithm 1 COntinuous COin Betting - COCOB\n1: Input: Li > 0, i = 1,\u00b7\u00b7\u00b7 , d; w1 \u2208 Rd (initial parameters); T (maximum number of iterations);\n2: Initialize: G0,i \u2190 Li, Reward0,i \u2190 0, \u03b80,i \u2190 0, i = 1,\u00b7\u00b7\u00b7 , d\n3: for t = 1, 2, . . . , T do\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12: end for\n13: Return \u00afwT = 1\nT\n\nUpdate the sum of the absolute values of the subgradients: Gt,i \u2190 Gt\u22121,i + |gt,i|\nUpdate the reward: Rewardt,i \u2190 Rewardt\u22121,i +(wt,i \u2212 w1,i)gt,i\nUpdate the sum of the gradients: \u03b8t,i \u2190 \u03b8t\u22121,i + gt,i\nCalculate the fraction to bet: \u03b2t,i = 1\nLi\nCalculate the parameters: wt+1,i \u2190 w1,i + \u03b2t,i (Li + Rewardt,i)\n\nt=1 wt or wI where I is chosen uniformly between 1 and T\n\n(cid:17) \u2212 1\n(cid:17)\n\n(cid:16) 2\u03b8t,i\n\n, where \u03c3(x) =\n\n(cid:80)T\n\n1+exp(\u2212x)\n\nend for\n\n(cid:16)\n\nGt,i+Li\n\n2\u03c3\n\n1\n\n5 The COCOB Algorithm\n\n1\n\n(cid:16)\n\n1+exp(\u2212x) (lines 9 and 10). Intuitively, if\n\nWe now introduce our novel algorithm for stochastic subgradient descent, COntinuous COin Betting\n(COCOB), summarized in Algorithm 1. COCOB generalizes the reasoning outlined in the previous\nsection to the optimization of a function F : Rd \u2192 R with bounded subgradients, reducing the\noptimization to betting on d coins.\nSimilarly to the construction in the previous section, the outcomes of the coins are linked to the\nstochastic gradients. In particular, each gt,i \u2208 [\u2212Li, Li] for i = 1,\u00b7\u00b7\u00b7 , d is equal to the coordinate\ni of the negative stochastic gradient gt of F in wt. With the notation of the algorithm, COCOB is\nbased on the strategy to bet a signed fraction of the current wealth equal to 1\n,\nLi\nwhere \u03c3(x) =\nis big in absolute value, it means\nthat we received a sequence of equal outcomes, i.e., gradients, hence we should increase our bets, i.e.,\nthe absolute value of wt,i. Note that this strategy assures that |wt,igt,i| < Wealtht\u22121,i, so the wealth\nof the gambler is always positive. Also, it is easy to verify that the algorithm is scale-free because\nmultiplying all the subgradients and Li by any positive constant it would result in the same sequence\nof iterates wt,i.\nNote that the update in line 10 is carefully de\ufb01ned: The algorithm does not use the previous wt,i in\nthe update. Indeed, this algorithm belongs to the family of the Dual Averaging algorithms, where the\niterate is a function of the average of the past gradients [Nesterov, 2009].\nDenoting by w\u2217 a minimizer of F , COCOB satis\ufb01es the following convergence guarantee.\nTheorem 1. Let F : Rd \u2192 R be a \u03c4-weakly-quasi-convex function and assume that gt satisfy\n|gt,i| \u2264 Li. Then, running COCOB for T iterations guarantees, with the notation in Algorithm 1,\n\n(cid:16) 2\u03b8t,i\n\n(cid:17) \u2212 1\n\n(cid:17)\n\nGt,i+Li\n\nGt,i+Li\n\n2\u03c3\n\n\u03b8t,i\n\n(cid:33)(cid:35)\n\n,\n\nE[F (wI )] \u2212 F (w\u2217) \u2264 d(cid:88)\n\nLi+|w\u2217\n\ni \u2212w1,i|\n\nLi(GT ,i+Li) ln\n\n(GT ,i+Li)2(w\u2217\n\ni \u2212w1,i)2\n\nL2\ni\n\n1+\n\n(cid:118)(cid:117)(cid:117)(cid:116)E\n(cid:34)\n\n(cid:32)\n\n\u03c4 T\n\ni=1\n\nwhere the expectation is with respect to the noise in the subgradients and the choice of I. Moreover,\nif F is convex, the same guarantee with \u03c4 = 1 also holds for wT .\n\n(cid:16)\n\nThe proof, in the Appendix, shows through induction that betting a fraction of money equal to\n\u03b2t,i in line 9 on the outcomes gi,t, with an initial money of Li, guarantees that the wealth after T\nrounds is at least Li exp\n. Then, as sketched in Section 4, it is enough to\ncalculate the Fenchel conjugate of the wealth and use the standard construction for the per-coordinate\nupdates [Streeter and McMahan, 2010]. We note in passing that the proof technique is also novel\nbecause the one introduced in Orabona and Pal [2016] does not allow data-dependent bounds.\n\n2Li(GT ,i+Li) \u2212 1\n\n2 ln GT ,i\n\n\u03b82\nT ,i\n\nLi\n\n(cid:17)\n\n4\n\n\fy\n\ny\n\nx\n\nx\n\nFigure 1: Behaviour of COCOB (left) and gradient descent with various learning rates and same\nnumber of steps (center) in minimizing the function y = |x \u2212 10|. (right) The effective learning rates\nof COCOB. Figures best viewed in colors.\n\nWhen |gt,i| = 1, we have \u03b2t,i \u2248 (cid:80)t\u22121\n\ni=1 gi\n\nt\n\nT\n\nT\n\nT\n\n\u03b7i\n\n(cid:107)w\u2217(cid:107)1\u221a\n\n(cid:80)d\ni=1( (w\u2217)2\n\nthat recovers the betting strategy in Orabona and Pal [2016].\nIn other words, we substitute the time variable with the data-dependent quantity Gt,i. In fact, our\nbound depends on the terms GT,i while the similar one in Orabona and Pal [2016] simply depends\non LiT . Hence, as in AdaGrad [Duchi et al., 2011], COCOB\u2019s bound is tighter because it takes\nadvantage of sparse gradients.\n(cid:107)w\u2217(cid:107)1\u221a\nCOCOB converges at a rate of \u02dcO(\n) without any learning rate to tune. This has to be compared\nto the bound of AdaGrad that is2 O( 1\u221a\n+ \u03b7i)), where \u03b7i are the initial learning rates\nfor each coordinate. Usually all the \u03b7i are set to the same value, but from the bound we see that\nthe optimal setting would require a different value for each of them. This effectively means that the\noptimal \u03b7i for AdaGrad are problem-dependent and typically unknown. Using the optimal \u03b7i would\ngive us a convergence rate of O(\n), that is exactly equal to our bound up to polylogarithmic\nterms. Indeed, the logarithmic term in the square root of our bound is the price to pay to be adaptive\nto any w\u2217 and not tuning hyperparameters. This logarithmic term is unavoidable for any algorithm\nthat wants to be adaptive to w\u2217, hence our bound is optimal [Streeter and McMahan, 2012, Orabona,\n2013].\nTo gain a better understanding on the differences between COCOB and other subgradient descent\nalgorithms, it is helpful to compare their behaviour on the simple one-dimensional function F (x) =\n|x \u2212 10| already used in Section 4. In Figure 1 (left), COCOB starts from 0 and over time it increases\nin an exponential way the iterate wt, until it meets a gradient of opposing sign. From the gambling\nperspective this is obvious: The wealth will increase exponentially because there is a sequence of\nidentical outcomes, that in turn gives an increasing wealth and a sequence of increasing bets.\nOn the other hand, in Figure 1 (center), gradient descent shows a different behaviour depending on\nits learning rate. If the learning rate is constant and too small (black line) it will take a huge number\nof steps to reach the vicinity of the minimum. If the learning rate is constant and too large (red line),\nit will keep oscillating around the minimum, unless some form of averaging is used [Zhang, 2004]. If\nthe learning rate decreases as \u03b7\u221a\n, as in AdaGrad [Duchi et al., 2011], it will slow down over time,\nbut depending of the choice of the initial learning rate \u03b7 it might take an arbitrary large number of\nsteps to reach the minimum.\nAlso, notice that in this case the time to reach the vicinity of the minimum for gradient descent is\nnot in\ufb02uenced in any way by momentum terms or learning rates that adapt to the norm of the past\ngradients, because the gradients are all the same. Same holds for second order methods: The function\nin \ufb01gure lacks of any curvature, so these methods could not be used. Even approaches based on the\nreduction of the variance in the gradients, e.g. [Johnson and Zhang, 2013], do not give any advantage\nhere because the subgradients are deterministic.\n\nt\n\nFigure 1 (right) shows the \u201ceffective learning\u201d rate of COCOB that is \u02dc\u03b7t := wt\ni . This is\nthe learning rate we should use in AdaGrad to obtain the same behaviour of COCOB. We see a very\n\ni=1 g2\n\n2The AdaGrad variant used in deep learning does not have a convergence guarantee, because no projections\nare used. Hence, we report the oracle bound in the case that projections are used inside the hypercube with\ndimensions |w\u2217\ni |.\n\n5\n\n(cid:113)(cid:80)t\n\n050100150200Iterations0123456Effective Learning RateEffective Learning Rate of COCOB\fiterations); F (function to minimize)\n\nGet a (negative) stochastic subgradient gt such that E[gt] \u2208 \u2202[\u2212F (wt)]\nfor each i-th parameter in the network do\n\nAlgorithm 2 COCOB-Backprop\n1: Input: \u03b1 > 0 (default value = 100); w1 \u2208 Rd (initial parameters); T (maximum number of\n2: Initialize: L0,i \u2190 0, G0,i \u2190 0, Reward0,i \u2190 0, \u03b80,i \u2190 0, i = 1,\u00b7\u00b7\u00b7 , number of parameters\n3: for t = 1, 2, . . . , T do\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12: end for\n13: Return wT\n\nUpdate the maximum observed scale: Lt,i \u2190 max(Lt\u22121,i,|gt,i|)\nUpdate the sum of the absolute values of the subgradients: Gt,i \u2190 Gt\u22121,i + |gt,i|\nUpdate the reward: Rewardt,i \u2190 max(Rewardt\u22121,i +(wt,i \u2212 w1,i)gt,i, 0)\nUpdate the sum of the gradients: \u03b8t,i \u2190 \u03b8t\u22121,i + gt,i\nCalculate the parameters: wt,i \u2190 w1,i +\n\nend for\n\nLt,i max(Gt,i+Lt,i,\u03b1Lt,i) (Lt,i + Rewardt,i)\n\n\u03b8t,i\n\ninteresting effect: The learning rate is not constant nor is monotonically increasing or decreasing.\nRather, it is big when we are far from the optimum and small when close to it. However, we would\nlike to stress that this behaviour has not been coded into the algorithm, rather it is a side-effect of\nhaving the optimal convergence rate.\nWe will show in Section 7 that this theoretical gain is con\ufb01rmed in the empirical results.\n\n6 Backprop and Coin Betting\n\nThe algorithm described in the previous section is guaranteed to converge at the optimal convergence\nrate for non-smooth functions and does not require a learning rate. However, it still needs to know\nthe maximum range of the gradients on each coordinate. Note that for the effect of the vanishing\ngradients, each layer will have a different range of the gradients [Hochreiter, 1991]. Also, the weights\nof the network can grow over time, increasing the value of the gradients too. Hence, it would be\nimpossible to know the range of each gradient beforehand and use any strategy based on betting.\nBy following the previous literature, e.g. [Kingma and Ba, 2015], we propose a variant of COCOB\nbetter suited to optimizing deep networks. We name it COCOB-Backprop and its pseudocode is in\nAlgorithm 2. Although this version lacks the backing of a theoretical guarantee, it is still effective in\npractice as we will show experimentally in Section 7.\nThere are few differences between COCOB and COCOB-Backprop. First, we want to be adaptive\nto the maximum component-wise range of the gradients. Hence, in line 6 we constantly update the\nvalues Lt,i for each variable. Next, since Li,t\u22121 is not assured anymore to be an upper bound on\ngt,i, we do not have any guarantee that the wealth Rewardt,i is non-negative. Thus, we enforce the\npositivity of the reward in line 8 of Algorithm 2.\nWe also modify the fraction to bet in line 10 by removing the sigmoidal function because 2\u03c3(2x)\u22121 \u2248\nx for x \u2208 [\u22121, 1]. This choice simpli\ufb01es the code and always improves the results in our experiments.\nMoreover, we change the denominator of the fraction to bet such that it is at least \u03b1Lt,i. This has\nthe effect of restricting the value of the parameters in the \ufb01rst iterations of the algorithm. To better\nunderstand this change, consider that, for example, in AdaGrad and Adam with learning rate \u03b7 the\n\ufb01rst update is w2,i = w1,i \u2212 \u03b7SGN(g1,i). Hence, \u03b7 should have a value smaller than w1,i in order\nto not \u201cforget\u201d the initial point too fast. In fact, the initialization is critical to obtain good results\nand moving too far away from it destroys the generalization ability of deep networks. Here, the \ufb01rst\nupdate becomes w2,i = w1,i \u2212 1\nFinally, as in previous algorithms, we do not return the average or a random iterate, but just the last\none (line 13 in Algorithm 2).\n\n\u03b1 should also be small compared to w1,i.\n\n\u03b1 SGN(g1,i), so 1\n\n6\n\n\fFigure 2: Training cost (cross-entropy) (left) and testing error rate (0/1 loss) (right) vs. the number\nepochs with two different architectures on MNIST, as indicated in the \ufb01gure titles. The y-axis is\nlogarithmic in the left plots. Figures best viewed in colors.\n7 Empirical Results and Future Work\n\nWe run experiments on various datasets and architectures, comparing COCOB with some popular\nstochastic gradient learning algorithms: AdaGrad [Duchi et al., 2011], RMSProp [Tieleman and\nHinton, 2012], Adadelta [Zeiler, 2012], and Adam [Kingma and Ba, 2015]. For all the algorithms,\nbut COCOB, we select their learning rate as the one that gives the best training cost a posteriori using\na very \ufb01ne grid of values3. We implemented4 COCOB (following Algorithm 2) in Tensor\ufb02ow [Abadi\net al., 2015] and we used the implementations of the other algorithms provided by this deep learning\nframework. The best value of the learning rate for each algorithm and experiment is reported in the\nlegend.\nWe report both the training cost and the test error, but, as in previous work, e.g., [Kingma and Ba,\n2015], we focus our empirical evaluation on the former. Indeed, given a large enough neural network\nit is always possible to over\ufb01t the training set, obtaining a very low performance on the test set.\nHence, test errors do not only depends on the optimization algorithm.\n\nDigits Recognition. As a \ufb01rst test, we tackle handwritten digits recognition using the MNIST\ndataset [LeCun et al., 1998a]. It contains 28 \u00d7 28 grayscale images with 60k training data, and\n10k test samples. We consider two different architectures, a fully connected 2-layers network and a\nConvolutional Neural Network (CNN). In both cases we study different optimizers on the standard\ncross-entropy objective function to classify 10 digits. For the \ufb01rst network we reproduce the structure\ndescribed in the multi-layer experiment of [Kingma and Ba, 2015]: it has two fully connected hidden\nlayers with 1000 hidden units each and ReLU activations, with mini-batch size of 100. The weights\nare initialized with a centered truncated normal distribution and standard deviation 0.1, the same\nsmall value 0.1 is also used as initialization for the bias. The CNN architecture follows the Tensor\ufb02ow\ntutorial 5: two alternating stages of 5 \u00d7 5 convolutional \ufb01lters and 2 \u00d7 2 max pooling are followed\nby a fully connected layer of 1024 recti\ufb01ed linear units (ReLU). To reduce over\ufb01tting, 50% dropout\nnoise is used during training.\n\n3[0.00001, 0.000025, 0.00005, 0.000075, 0.0001, 0.00025, 0.0005, 0.00075, 0.001, 0.0025, 0.005, 0.0075,\n\n0.01, 0.02, 0.05, 0.075, 0.1]\n\n4https://github.com/bremen79/cocob\n5https://www.tensorflow.org/get_started/mnist/pros\n\n7\n\n\fFigure 3: Training cost (cross-entropy) (left) and testing error rate (0/1 loss) (right) vs. the number\nepochs on CIFAR-10. The y-axis is logarithmic in the left plots. Figures best viewed in colors.\n\nFigure 4: Training cost (left) and test cost (right) measured as average per-word perplexity vs. the\nnumber epochs on PTB word-level language modeling task. Figures best viewed in colors.\n\nTraining cost and test error rate as functions of the number of training epochs are reported in Figure 2.\nWith both architectures, the training cost of COCOB decreases at the same rate of the best tuned\ncompetitor algorithms. The training performance of COCOB is also re\ufb02ected in its associated test\nerror which appears better or on par with the other algorithms.\nObject Classi\ufb01cation. We use the popular CIFAR-10 dataset [Krizhevsky, 2009] to classify 32\u00d732\nRGB images across 10 object categories. The dataset has 60k images in total, split into a training/test\nset of 50k/10k samples. For this task we used the network de\ufb01ned in the Tensor\ufb02ow CNN tutorial6.\nIt starts with two convolutional layers with 64 kernels of dimension 5 \u00d7 5 \u00d7 3, each followed by a\n3 \u00d7 3 \u00d7 3 max pooling with stride of 2 and by local response normalization as in Krizhevsky et al.\n[2012]. Two more fully connected layers respectively of 384 and 192 recti\ufb01ed linear units complete\nthe architecture that ends with a standard softmax cross-entropy classi\ufb01er. We use a batch size of\n128 and the input images are simply pre-processed by whitening. Differently from the Tensor\ufb02ow\ntutorial, we do not apply image random distortion for data augmentation.\nThe obtained results are shown in Figure 3. Here, with respect to the training cost, our learning-\nrate-free COCOB performs on par with the best competitors. For all the algorithms, there is a good\ncorrelation between the test performance and the training cost. COCOB and its best competitor\nAdaDelta show similar classi\ufb01cation results that differ on average \u223c 0.008 in error rate.\n\nWord-level Prediction with RNN. Here we train a Recurrent Neural Network (RNN) on a lan-\nguage modeling task. Speci\ufb01cally, we conduct word-level prediction experiments on the Penn Tree\nBank (PTB) dataset [Marcus et al., 1993] using the 929k training words and its 73k validation words.\nWe adopted the medium LSTM [Hochreiter and Schmidhuber, 1997] network architecture described\nin Zaremba et al. [2014]: it has 2 layers with 650 units per layer and parameters initialized uniformly\nin [\u22120.05, 0.05], a dropout of 50% is applied on the non-recurrent connections, and the norm of the\ngradients (normalized by mini-batch size = 20) is clipped at 5.\n\n6https://www.tensorflow.org/tutorials/deep_cnn\n\n8\n\n010203040Epochs0100200300400500600PerplexityWord Prediction on PTB - Training CostAdaGrad 0.25RMSprop 0.001Adadelta 2.5Adam 0.00075COCOB010203040Epochs50100150200250300350400PerplexityWord Prediction on PTB - Test CostAdaGrad 0.25RMSprop 0.001Adadelta 2.5Adam 0.00075COCOB\fWe show the obtained results in terms of average per-word perplexity in Figure 4. In this task COCOB\nperforms as well as Adagrad and Adam with respect to the training cost and much better than the other\nalgorithms. In terms of test performance, COCOB, Adam, and AdaGrad all show an over\ufb01t behaviour\nindicated by the perplexity which slowly grows after having reached its minimum. Adagrad is the\nleast affected by this issue and presents the best results, followed by COCOB which outperforms all\nthe other methods. We stress again that the test performance does not depend only on the optimization\nalgorithm used in training and that early stopping may mitigate the over\ufb01tting effect.\n\nSummary of the Empirical Evaluation and Future Work. Overall, COCOB has a training\nperformance that is on-par or better than state-of-the-art algorithms with perfectly tuned learning\nrates. The test error appears to depends on other factors too, with equal training errors corresponding\nto different test errors.\nWe would also like to stress that in these experiments, contrary to some of the previous reported\nempirical results on similar datasets and networks, the difference between the competitor algorithms\nis minimal or not existent when they are tuned on a very \ufb01ne grid of learning rate values. Indeed, the\nvery similar performance of these methods seems to indicate that all the algorithms are inherently\ndoing the same thing, despite their different internal structures and motivations. Future more detailed\nempirical results will focus on unveiling what is the common structure of these algorithms that give\nrise to this behavior.\nIn the future, we also plan to extend the theory of COCOB beyond \u03c4-weakly-quasi-convex functions,\ncharacterizing the non-convexity present in deep networks. Also, it would be interesting to evaluate a\npossible integration of the betting framework with second-order methods.\n\nAcknowledgments\n\nThe authors thank the Stony Brook Research Computing and Cyberinfrastructure, and the Institute\nfor Advanced Computational Science at Stony Brook University for access to the high-performance\nSeaWulf computing system, which was made possible by a $1.4M National Science Foundation grant\n(#1531492). The authors also thank Akshay Verma for the help with the TensorFlow implementation\nand Matej Kristan for reporting a bug in the pseudocode in the previous version of the paper. T.T. was\nsupported by the ERC grant 637076 - RoboExNovo. F.O. is partly supported by a Google Research\nAward.\n\nReferences\nM. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean,\nM. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz,\nL. Kaiser, M. Kudlur, J. Levenberg, D. Man\u00e9, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster,\nJ. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Vi\u00e9gas,\nO. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng. TensorFlow: Large-scale\nmachine learning on heterogeneous systems, 2015. URL http://tensorflow.org/. Software\navailable from tensor\ufb02ow.org.\n\nS.-I. Amari. Natural gradient works ef\ufb01ciently in learning. Neural computation, 10(2):251\u2013276,\n\n1998.\n\nH. H. Bauschke and P. L. Combettes. Convex Analysis and Monotone Operator Theory in Hilbert\n\nSpaces. Springer, 2011.\n\nY. Bengio. Practical recommendations for gradient-based training of deep architectures. In G. Mon-\ntavon, G. B. Orr, and K.-R. M\u00fcller, editors, Neural Networks: Tricks of the Trade: Second Edition,\npages 437\u2013478. Springer, Berlin, Heidelberg, 2012.\n\nA. Cutkosky and K. Boahen. Online learning without prior information. In Conference on Learning\n\nTheory (COLT), pages 643\u2013677, 2017.\n\nA. Cutkosky and K. A. Boahen. Online convex optimization with unconstrained domains and losses.\n\nIn Advances in Neural Information Processing Systems (NIPS), pages 748\u2013756, 2016.\n\n9\n\n\fJ. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic\n\noptimization. Journal of Machine Learning Research, 12(Jul):2121\u20132159, 2011.\n\nM. Hardt, T. Ma, and B. Recht. Gradient descent learns linear dynamical systems. arXiv preprint\n\narXiv:1609.05191, 2016.\n\nS. Hochreiter. Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis, Institut f\u00fcr\n\nInformatik, Lehrstuhl Prof. Brauer, Technische Universit\u00e4t M\u00fcnchen, 1991.\n\nS. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735\u20131780,\n\n1997.\n\nY. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell.\nCaffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093,\n2014.\n\nR. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance reduction.\n\nIn Advances in Neural Information Processing Systems (NIPS), pages 315\u2013323, 2013.\n\nD. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In International Conference\n\non Learning Representations (ICLR), 2015.\n\nA. Krizhevsky. Learning multiple layers of features from tiny images. Master\u2019s thesis, Department of\n\nComputer Science, University of Toronto, 2009.\n\nA. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classi\ufb01cation with deep convolutional neural\nnetworks. In Advances in Neural Information Processing Systems (NIPS), pages 1097\u20131105, 2012.\n\nY. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document\nrecognition. Proceedings of the IEEE, 86(11):2278\u20132324, 1998a. URL http://yann.lecun.\ncom/exdb/mnist/.\n\nY. A. LeCun, L. Bottou, G. B. Orr, and K.-R. M\u00fcller. Ef\ufb01cient backprop. In Neural networks: Tricks\n\nof the trade, pages 9\u201348. Springer, 1998b.\n\nM. P. Marcus, M. A. Marcinkiewicz, and B. Santorini. Building a large annotated corpus of english:\n\nThe penn treebank. Computational linguistics, 19(2):313\u2013330, 1993.\n\nH. B. McMahan and F. Orabona. Unconstrained online linear learning in Hilbert spaces: Minimax\nalgorithms and normal approximations. In Conference on Learning Theory (COLT), pages 1020\u2013\n1039, 2014.\n\nY. Nesterov. A method for unconstrained convex minimization problem with the rate of convergence\n\nO(1/k2). Soviet Mathematics Doklady, 27(2):372\u2013376, 1983.\n\nY. Nesterov. Primal-dual subgradient methods for convex problems. Mathematical programming,\n\n120(1):221\u2013259, 2009.\n\nF. Orabona. Dimension-free exponentiated gradient. In Advances in Neural Information Processing\n\nSystems (NIPS), pages 1806\u20131814, 2013.\n\nF. Orabona. Simultaneous model selection and optimization through parameter-free stochastic\nlearning. In Advances in Neural Information Processing Systems (NIPS), pages 1116\u20131124, 2014.\n\nF. Orabona and D. Pal. Scale-free algorithms for online linear optimization.\n\nConference on Algorithmic Learning Theory (ALT), pages 287\u2013301. Springer, 2015.\n\nIn International\n\nF. Orabona and D. Pal. Coin betting and parameter-free online learning. In Advances in Neural\n\nInformation Processing Systems (NIPS), pages 577\u2013585. 2016.\n\nF. Orabona, K. Crammer, and N. Cesa-Bianchi. A generalized online mirror descent with applications\n\nto classi\ufb01cation and regression. Machine Learning, 99(3):411\u2013435, 2015.\n\nS. Ross, P. Mineiro, and J. Langford. Normalized online learning. In Proc. of the Twenty-Ninth\n\nConference on Uncertainty in Arti\ufb01cial Intelligence (UAI), 2013.\n\n10\n\n\fT. Schaul, S. Zhang, and Y. LeCun. No more pesky learning rates. In International conference on\n\nMachine Learning (ICML), pages 343\u2013351, 2013.\n\nM. Streeter and H. B. McMahan. Less regret via online conditioning. arXiv preprint arXiv:1002.4862,\n\n2010.\n\nM. Streeter and H. B. McMahan. No-regret algorithms for unconstrained online convex optimization.\n\nIn Advances in Neural Information Processing Systems (NIPS), pages 2402\u20132410, 2012.\n\nI. Sutskever, J. Martens, G. Dahl, and G. Hinton. On the importance of initialization and momentum\nin deep learning. In International conference on Machine Learning (ICML), pages 1139\u20131147,\n2013.\n\nT. Tieleman and G. Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running average of its\n\nrecent magnitude. COURSERA: Neural Networks for Machine Learning, 2012.\n\nS. Wright and J. Nocedal. Numerical optimization. Springer, 1999.\n\nW. Zaremba, I. Sutskever, and O. Vinyals. Recurrent neural network regularization. arXiv preprint\n\narXiv:1409.2329, 2014.\n\nM. D. Zeiler. ADADELTA: an adaptive learning rate method. arXiv preprint arXiv:1212.5701, 2012.\n\nT. Zhang. Solving large scale linear prediction problems using stochastic gradient descent algorithms.\n\nIn International Conference on Machine Learning (ICML), pages 919\u2013926, 2004.\n\n11\n\n\f", "award": [], "sourceid": 1293, "authors": [{"given_name": "Francesco", "family_name": "Orabona", "institution": "Stony Brook University"}, {"given_name": "Tatiana", "family_name": "Tommasi", "institution": "University of Rome La Sapienza"}]}