{"title": "Online to Offline Conversions, Universality and Adaptive Minibatch Sizes", "book": "Advances in Neural Information Processing Systems", "page_first": 1613, "page_last": 1622, "abstract": "We present an approach towards convex optimization that relies on a novel scheme which converts adaptive online algorithms into offline methods. In the offline optimization setting, our derived methods are shown to obtain favourable adaptive guarantees which depend on the harmonic sum of the queried gradients. We further show that our methods implicitly adapt to the objective's structure: in the smooth case fast convergence rates are ensured without any prior knowledge of the smoothness parameter, while still maintaining guarantees in the non-smooth setting. Our approach has a natural extension to the stochastic setting, resulting in a lazy version of SGD (stochastic GD), where minibathces are chosen adaptively depending on the magnitude of the gradients. Thus providing a principled approach towards choosing minibatch sizes.", "full_text": "Online to Of\ufb02ine Conversions, Universality and\n\nAdaptive Minibatch Sizes\n\nK\ufb01r Y. Levy\n\nDepartment of Computer Science, ETH Z\u00fcrich.\n\nyehuda.levy@inf.ethz.ch\n\nAbstract\n\nWe present an approach towards convex optimization that relies on a novel scheme\nwhich converts adaptive online algorithms into of\ufb02ine methods. In the of\ufb02ine\noptimization setting, our derived methods are shown to obtain favourable adaptive\nguarantees which depend on the harmonic sum of the queried gradients. We\nfurther show that our methods implicitly adapt to the objective\u2019s structure: in the\nsmooth case fast convergence rates are ensured without any prior knowledge of\nthe smoothness parameter, while still maintaining guarantees in the non-smooth\nsetting. Our approach has a natural extension to the stochastic setting, resulting in\na lazy version of SGD (stochastic GD), where minibathces are chosen adaptively\ndepending on the magnitude of the gradients. Thus providing a principled approach\ntowards choosing minibatch sizes.\n\n1\n\nIntroduction\n\nOver the past years data adaptiveness has proven to be crucial to the success of learning algorithms.\nThe objective function underlying \u201cbig data\" applications often demonstrates intricate structure:\nthe scale and smoothness are often unknown and may change substantially in between different\nregions/directions, [1]. Learning methods that acclimatize to these changes may exhibit superior\nperformance compared to non adaptive procedures.\nState-of-the-art \ufb01rst order methods like AdaGrad, [1], and Adam, [2], adapt the learning rate on the\n\ufb02y according to the feedback (i.e. gradients) received during the optimization process. AdaGrad and\nAdam are guaranteed to work well in the online convex optimization setting, where loss functions\nmay be chosen adversarially and change between rounds. Nevertheless, this setting is harder than the\nstochastic/of\ufb02ine settings, which may better depict practical applications. Interestingly, even in the\nof\ufb02ine convex optimization setting it could be shown that in several scenarios very simple schemes\nmay substantially outperform the output of AdaGrad/Adam. An example of such a simple scheme is\nchoosing the point with the smallest gradient norm among all rounds. In the \ufb01rst part of this work we\naddress this issue and design adaptive methods for the of\ufb02ine convex optimization setting. At heart of\nour derivations is a novel scheme which converts adaptive online algorithms into of\ufb02ine methods with\nfavourable guarantees1. Our shceme is inspired by standard online to batch conversions, [3].\nA seemingly different issue is choosing the minibatch size, b, in the stochastic setting. Stochastic\noptimization algorithms that can access a noisy gradient oracle may choose to invoke the oracle\nb times in every query point, subsequently employing an averaged gradient estimate. Theory for\nstochastic convex optimization suggests to use a minibatch of b = 1, and predicts a degradation of pb\nfactor upon using larger minibatch sizes2. Nevertheless in practice larger minibatch sizes are usually\nfound to be effective. In the second part of this work we design stochastic optimization methods in\n1For concreteness we concentrate in this work on converting AdaGrad, [1]. Note that our conversion scheme\n2A degradation by a pb factor in the general case and by a b factor in the strongly-convex case.\n\napplies more widely to other adaptive online methods.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fwhich minibatch sizes are chosen adaptively without any theoretical degradation. These are natural\nextensions of the of\ufb02ine methods presented in the \ufb01rst part.\nOur contributions:\nOf\ufb02ine setting: We present two (families of) algorithms AdaNGD (Alg. 2) and SC-AdaNGD\n(Alg. 3) for the convex/strongly-convex settings which achieve favourable adaptive guarantees\n(Thms. 2.1, 2.2, 3.1, 3.2 ). The latter theorems also establish their universality, i.e., their ability to\nimplicitly take advantage of the objective\u2019s smoothness and attain rates as fast as GD would have\nachieved if the smoothness parameter was known. In contrast to other universal approaches such as\nline-search-GD, [4], and universal gradient [5], we do so without any line search procedure.\nConcretely, without the knowledge of the smoothness parameter our algorithm ensures an O(1/pT )\nrate in general convex case and an O(1/T ) rate if the objective is also smooth (Thms. 2.1, 2.2). In\nthe strongly-convex case our algorithm ensures an O(1/T ) rate in general and an O(exp(T )) rate\nif the objective is also smooth (Thm. 3.2 ), where is the condition number.\nStochastic setting: We present Lazy-SGD (Algorithm 4) which is an extension of our of\ufb02ine\nalgorithms. Lazy-SGD employs larger minibatch sizes in points with smaller gradients, which\nselectively reduces the variance in the \u201cmore important\" query points. Lazy-SGD guarantees are\ncomparable with SGD in the convex/strongly-convex settings (Thms. 4.2, 4.3).\nOn the technical side, our online to of\ufb02ine conversion schemes employ three simultaneous mech-\nanisms: an adaptive online algorithm used in conjunction with gradient normalization and with\na respective importance weighting. To the best of our knowledge the combination of the above\ntechniques is novel, and we believe it might also \ufb01nd use in other scenarios.\nThis paper is organized as follows.\nIn Sections 2,3, we present our methods for the of\ufb02ine\nconvex/strongly-convex settings. Section 4 describes our methods for the stochastic setting, and\nSection 5 concludes. Extensions and a preliminary experimental study appear in the Appendix.\n\n1.1 Related Work\nThe authors of [1] simultaneously to [6], were the \ufb01rst to suggest AdaGrad\u2014an adaptive gradient\nbased method, and prove its ef\ufb01ciency in tackling online convex problems. AdaGrad was subsequently\nadjusted to the deep-learning setting to yield the RMSprop, [7], and Adadelta, [8], heuristics. Adam,\n[2], is a popular adaptive algorithm which is often the method of choice in deep-learning applications.\nIt combines ideas from AdaGrad together with momentum machinery, [9].\nAn optimization procedure is called universal if it implicitly adapts to the objective\u2019s smoothness. In\n[5], universal gradient methods are devised for the general convex setting. Concretely, without the\nknowledge of the smoothness parameter, these methods attain the standard O(1/T ), an accelerated\nO(1/T 2) rates for smooth objectives, and an O(1/pT ) rate in the non-smooth case. The core\ntechnique in this work is a line search procedure which estimates the smoothness parameter in\nevery iteration. For strongly-convex and smooth objectives, line search techniques, [4], ensure\nlinear convergence rate, without the knowledge of the smoothness parameter. However, line search\nis not \u201cfully universal\", in the sense that it holds no guarantees in the non-smooth case. For the\nlatter setting we present a method which is \u201cfully universal\" (Thm. 3.2), nevertheless it requires the\nstrong-convexity parameter.\nThe usefulness of employing normalized gradients was demonstrated in several non-convex scenarios.\nIn the context of quasi-convex optimization, [10], and [11], established convergence guarantees for\nthe of\ufb02ine/stochastic settings. More recently, it was shown in [12], that normalized gradient descent\nis more appropriate than GD for saddle-evasion scenarios.\nIn the context of stochastic optimization, the effect of minibatch size was extensively investigated\nthroughout the past years, [13, 14, 15, 16, 17, 18]. Yet, all of these studies: (i) assume a smooth\nexpected loss, (ii) discuss \ufb01xed minibatch sizes. Conversely, our work discusses adaptive minibatch\nsizes, and applies to both smooth/non-smooth expected losses.\n\n1.2 Preliminaries\nNotation: k\u00b7k denotes the `2 norm, G denotes a bound on the norm of the objective\u2019s gradients, and\n[T ] := {1, . . . , T}. For a set K2 Rd its diameter is de\ufb01ned as D = supx,y2K kx yk. Next we\n\n2\n\n\fAlgorithm 1 Adaptive Gradient Descent (AdaGrad)\n\nInput: #Iterations T , x1 2 Rd, set K\nSet: Q0 = 0\nfor t = 1 . . . T do\n\nCalculate: gt = rft(xt)\nUpdate:\nSet: \u2318t = D/p2Qt\nUpdate: xt+1 =\u21e7 K (xt \u2318tgt)\n\nend for\n\nQt = Qt1 + kgtk2\n\nde\ufb01ne H-strongly-convex/-smooth functions,\n\nf (y) f (x) + rf (x)>(y x) +\nf (y) \uf8ff f (x) + rf (x)>(y x) +\n\nH\n2 kx yk2;\n\n2kx yk2;\n\n8x, y 2K (H-strong-convexity)\n8x, y 2K (-smoothness)\n\n1.2.1 AdaGrad\n\nThe methods presented in this paper lean on AdaGrad (Alg. 1), an online optimization method which\nemploys an adaptive learning rate. The following theorem states AdaGrad\u2019s guarantees, [1],\nTheorem 1.1. Let K be a convex set with diameter D. Let {ft}T\nconvex loss functions. Then Algorithm 1 guarantees the following regret;\nTXt=1\n\nt=1 be an arbitrary sequence of\n\nft(xt) min\nx2K\n\nft(x) \uf8ff vuut2D2\n\nTXt=1\n\nTXt=1\n\nkgtk2 .\n\n2 Adaptive Normalized Gradient Descent (AdaNGD)\n\nIn this section we discuss the convex optimization setting and introduce our AdaNGDk algorithm,\nwhich depends on a parameter k 2 R. We \ufb01rst derive a general convergence rate which holds for a\ngeneral k. Subsequently, we elaborate on the k = 1, 2 cases which exhibit universality as well as\nadaptive guarantees that may be substantially better compared to standard methods.\nOur method AdaNGDk is depicted in Alg. 2. This algorithm can be thought of as an online to of\ufb02ine\nconversion scheme which utilizes AdaGrad (Alg. 1) as a black box and eventually outputs a weighted\nsum of the online queries. Indeed, for a \ufb01xed k 2 R, it is not hard to notice that AdaNGDk is\nequivalent to invoking AdaGrad with the following loss sequence { \u02dcft(x) := g>t x/kgtkk}T\nt=1. And\neventually weighting each query point inversely proportional to the k\u2019th power norm of its gradient.\nThe reason behind this scheme is that in of\ufb02ine optimization it makes sense to dramatically reduce\nthe learning rate upon uncountering a point with a very small gradient. For k 1, this is achieved by\ninvoking AdaGrad with gradients normalized by their k\u2019th power norm. Since we discuss constrained\noptimization, we use the projection operator de\ufb01ned as, \u21e7K(y) := minx2K kx yk . The next lemma\nstates the guarantee of AdaNGD for a general k:\nLemma 2.1. Let k 2 R, K be a convex set with diameter D, and f be a convex function; Also let \u00afxT\nbe the output of AdaNGDk (Algorithm 2), then the following holds:\n\nf (\u00afxT ) min\nx2K\n\nf (x) \uf8ff q2D2PT\nPT\n\nt=1 1/kgtk2(k1)\nt=1 1/kgtkk\n\nProof sketch. Notice that the AdaNGDk algorithm is equivalent to applying AdaGrad to the following\nloss sequence: { \u02dcft(x) := g>t x/kgtkk}T\nt=1. Thus, applying Theorem 1.1, and using the de\ufb01nition of\n\u00afxT together with Jensen\u2019s inequality the lemma follows.\n\n3\n\n\fAlgorithm 2 Adaptive Normalized Gradient Descent (AdaNGDk)\n\nInput: #Iterations T , x1 2 Rd, set K , parameter k\nSet: Q0 = 0\nfor t = 1 . . . T 1 do\n\nCalculate: gt = rf (xt), \u02c6gt = gt/kgtkk\nUpdate:\nSet \u2318t = D/p2Qt\nUpdate: xt+1 =\u21e7 K (xt \u2318t\u02c6gt)\nReturn: \u00afxT =PT\n\n1/kgtkk\n\u2327 =1 1/kg\u2327kk xt\nPT\n\nend for\n\nt=1\n\nQt = Qt1 + 1/kgtk2(k1)\n\nFor k = 0, Algorithm 2 becomes AdaGrad (Alg. 1). Next we focus on the cases where k = 1, 2,\nshowing improved adaptive rates and universality compared to GD/AdaGrad. These improved rates\nare attained thanks to the adaptivity of the learning rate: when query points with small gradients are\nencountered, AdaNGDk (with k 1) reduces the learning rate, thus focusing on the region around\nthese points. The hindsight weighting further emphasizes points with smaller gradients.\n\n2.1 AdaNGD1\nHere we show that AdaNGD1 enjoys a rate of O(1/pT ) in the non-smooth convex setting, and a\nfast rate of O(1/T ) in the smooth setting. We emphasize that the same algorithm enjoys these rates\nsimultaneously, without any prior knowledge of the smoothness or of the gradient norms.\nFrom Algorithm 2 it can be noted that for k = 1 the learning rate becomes independent of the\ngradients, i.e. \u2318t = D/p2t, the update is made according to the direction of the gradients, and the\nweighting is inversely proportional to the norm of the gradients. The following Theorem establishes\nthe guarantees of AdaNGD1,\nTheorem 2.1. Let k = 1, K be a convex set with diameter D, and f be a convex function; Also let\n\u00afxT be the outputs of AdaNGD1 (Alg. 2), then the following holds:\n\nProof sketch. The data dependent bound is a direct corollary of Lemma 2.1. The general case bound\n\nt=1 1/kgtk \u2326(T 3/2), which concludes the proof.\n\nholds by using kgtk \uf8ff G. The bound for the smooth case is proven by showingPT\nThis translates to a lower boundPT\nThe data dependent bound in Theorem 2.1 may be substantially better compared to the bound of\nthe GD/AdaGrad. As an example, assume that half of the gradients encountered during the run\nof the algorithm are of O(1) norms, and the other gradient norms decay proportionally to O(1/t).\nIn this case the guarantee of GD/AdaGrad is O(1/pT ), whereas AdaNGD1 guarantees a bound\nthat behaves like O(1/T 3/2). Note that the above example presumes that all algorithms encounter\nthe same gradient magnitudes, which might be untrue. Nevertheless in the smooth case AdaNGD1\nprovably bene\ufb01ts due to its adaptivity.\n\nt=1 kgtk \uf8ff O(pT ).\n\n2.2 AdaNGD2\nHere we show that AdaNGD2 enjoys comparable guarantees to AdaNGD1 in the general/smooth\ncase. Similarly to AdaNGD1 the same algorithm enjoys these rates simultaneously, without any\nprior knowledge of the smoothness or of the gradient norms. The following Theorem establishes the\nguarantees of AdaNGD2,\n\n4\n\nMoreover, if f is also -smooth and the global minimum x\u21e4 = arg minx2Rn f (x) belongs to K, then:\n\nf (\u00afxT ) min\nx2K\n\nf (x) \uf8ff\n\nf (\u00afxT ) min\nx2K\n\nf (x) \uf8ff\n\np2D2T\nt=1 1/kgtk \uf8ff\nPT\nDpT\nt=1 1/kgtk \uf8ff\nPT\n\np2GD\npT\n\n.\n\n4D 2\n\n.\n\nT\n\n\fAlgorithm 3 Strongly-Convex AdaNGD (SC-AdaNGDk)\n\nInput: #Iterations T , x1 2 Rd, set K, strong-convexity H, parameter k\nSet: Q0 = 0\nfor t = 1 . . . T 1 do\n\nCalculate: gt = rf (xt), \u02c6gt = gt/kgtkk\nUpdate:\n\nQt = Qt1 + 1/kgtkk\n\nend for\n\nSet \u2318t = 1/HQt\nUpdate: xt+1 =\u21e7 K (xt \u2318t\u02c6gt)\nReturn: \u00afxT =PT\nTheorem 2.2. Let k = 2, K be a convex set with diameter D, and f be a convex function; Also let\n\u00afxT be the outputs of AdaNGD2 (Alg. 2), then the following holds:\n\n1/kgtkk\n\u2327 =1 1/kg\u2327kk xt\nPT\n\nt=1\n\nMoreover, if f is also -smooth and the global minimum x\u21e4 = arg minx2Rn f (x) belongs to K, then:\n\nf (\u00afxT ) min\nx2K\n\nf (x) \uf8ff\n\nf (\u00afxT ) min\nx2K\n\nf (x) \uf8ff\n\np2GD\npT\n\n4D 2\n\nT\n\n.\n\n.\n\np2D2\nqPT\nt=1 1/kgtk2 \uf8ff\np2D2\nqPT\nt=1 1/kgtk2 \uf8ff\nT PT\n\n1\nt=1 1/at\n\n1\n\nIt is interesting to note that AdaNGD2 will have always performed better than AdaGrad, had both\nalgorithms encountered the same gradient norms. This is due to the well known inequality between\nt=1 \u21e2 R+ , which directly\narithmetic and harmonic means, [19], 1\n\nt=1 at \n\n8{at}T\n\n,\n\nimplies,\n\n1\n\nt=1 1/kgtk2 \uf8ff 1\n\npPT\n\nT PT\n\nt=1 kgtk2 .\n\nTqPT\n\n3 Adaptive NGD for Strongly Convex Functions\n\nHere we discuss the of\ufb02ine optimization setting of strongly convex objectives. We introduce our\nSC-AdaNGDk algorithm, and present convergence rates for general k 2 R. Subsequently, we\nelaborate on the k = 1, 2 cases which exhibit universality as well as adaptive guarantees that may be\nsubstantially better compared to standard methods.\nOur SC-AdaNGDk algorithm is depicted in Algorithm 3. Similarly to its non strongly-convex\ncounterpart, SC-AdaNGDk can be thought of as an online to of\ufb02ine conversion scheme which utilizes\nan online algorithm which we denote SC-AdaGrad (we elaborate on the latter in the appendix). The\nnext Lemma states its guarantees,\nLemma 3.1. Let k 2 R, and K be a convex set. Let f be an H-strongly-convex function; Also let \u00afxT\nbe the outputs of SC-AdaNGDk (Alg. 3), then the following holds:\n\nf (\u00afxT ) min\nx2K\n\nf (x) \uf8ff\n\n1\nt=1 kgtkk\n\nTXt=1\n\n2HPT\n\nkgtk2(k1)\nPt\n\u2327 =1 kg\u2327kk\n\n.\n\nProof sketch. In the appendix we present and analyze SC-AdaGrad. This is an online \ufb01rst order algo-\n\u2327 =1 H\u2327 ,\nwhere H\u2327 is the strong-convexity parameter of the loss function at time \u2327. Then we show that\nSC-AdaNGDk is equivalent to applying SC-AdaGrad to the following loss sequence:\n\nrithm for strongly-convex functions in which the learning rate decays according to \u2318t = 1/Pt\n\n\u21e2 \u02dcft(x) =\n\n1\n\nkgtkk g>t x +\n\nH\n\n2kgtkk kx xtk2T\n\nt=1\n\n.\n\nThe lemma follows by combining the regret bound of SC-AdaGrad together with the de\ufb01nition of \u00afxT\nand with Jensen\u2019s inequality.\n\n5\n\n\fFor k = 0, SC-AdaNGD becomes the standard GD algorithm which uses learning rate of \u2318t = 1/Ht.\nNext we focus on the cases where k = 1, 2.\n\n3.1 SC-AdaNGD1\n\nHere we show that SC-AdaNGD1 enjoys a rate of \u02dcO(1/T ) for strongly-convex objectives, and a\nfaster rate of \u02dcO(1/T 2) assuming that the objective is also smooth. We emphasize that the same\nalgorithm enjoys these rates simultaneously, without any prior knowledge of the smoothness or of the\ngradient norms. The following theorem establishes the guarantees of SC-AdaNGD1,\nTheorem 3.1. Let k = 1, and K be a convex set. Let f be a G-Lipschitz and H-strongly-convex\nfunction; Also let \u00afxT be the outputs of SC-AdaNGD1 (Alg. 3), then the following holds:\n\nf (\u00afxT ) min\nx2K\n\nf (x) \uf8ff\n\nG\u21e31 + log\u21e3PT\n2HPT\n\nt=1\n\nt=1\n\n1\nkgtk\n\nG\n\nkgtk\u2318\u2318\n\nG2(1 + log T )\n\n2HT\n\n.\n\n\uf8ff\n\nMoreover, if f is also -smooth and the global minimum x\u21e4 = arg minx2Rn f (x) belongs to K, then,\n\nf (\u00afxT ) min\nx2K\n\nf (x) \uf8ff\n\n(/H)G2 (1 + log T )2\n\nHT 2\n\n.\n\n3.2 SC-AdaNGD2\n\nHere we show that SC-AdaNGD2 enjoys the standard \u02dcO(1/T ) rate for strongly-convex objectives,\nand a linear rate assuming that the objective is also smooth. We emphasize that the same algorithm\nenjoys these rates simultaneously, without any prior knowledge of the smoothness or of the gradient\nnorms. In the case where k = 2 the guarantee of SC-AdaNGD is as follows,\nTheorem 3.2. Let k = 2, K be a convex set, and f be a G-Lipschitz and H-strongly-convex function;\nAlso let \u00afxT be the outputs of SC-AdaNGD2 (Alg. 3), then the following holds:\n\nf (\u00afxT ) min\nx2K\n\nf (x) \uf8ff\n\n1 + log(G2PT\n2HPT\n\nt=1 kgtk2)\n\nt=1 kgtk2\n\nG2(1 + log T )\n\n2HT\n\n.\n\n\uf8ff\n\nMoreover, if f is also -smooth and the global minimum x\u21e4 = arg minx2Rn f (x) belongs to K, then,\n\nf (\u00afxT ) min\nx2K\n\nf (x) \uf8ff\n\n3G2\n2H\n\ne H\n\n T\u27131 +\n\nH\n\n\nT\u25c6 .\n\nIntuition: For strongly-convex objectives the appropriate GD algorithm utilizes two very extreme\nlearning rates of \u2318t / 1/t vs. \u2318t = 1/ for the general/smooth settings respectively. A possible\nexplanation to the universality of SCAdaNGD2 is that it implicitly interpolate between these rates.\nkgtk2\nIndeed the update rule of our algorithm can be written as follows, xt+1 = xt 1\n\u2327 =1 kg\u2327k2 gt .\nPt\nThus, ignoring the hindsight weighting, SCAdaNGD2 is equivalent to GD with an adaptive learning\nrate \u02dc\u2318t := kgtk2/HPt\n\u2327 =1 kg\u2327k2. Now, when all gradient norms are of the same magnitude, then\n\u02dc\u2318t / 1/t, which boils down to the standard GD for strongly-convex objectives. Conversely, assume\nthat the gradients are exponentially decaying, i.e., that kgtk / qt for some q < 1. In this case \u02dc\u2318t is\napproximately constant. We believe that the latter applies for strongly-convex & smooth case.\n\nH\n\n4 Adaptive NGD for Stochastic Optimization\n\nHere we show that using data-dependent minibatch sizes, we can adapt our (SC-)AdaNGD2 algo-\nrithms (Algs. 2, 3 with k = 2) to the stochastic setting, and achieve the well know convergence rates\nfor the convex/strongly-convex settings. Next we introduce the stochastic optimization setting, and\nthen we present and discuss our Lazy SGD algorithm.\nSetup: We consider the problem of minimizing a convex/strongly-convex function f : K 7! R,\nwhere K2 Rd is a convex set. We assume that optimization lasts for T rounds; on each round\n\n6\n\n\fAlgorithm 4 Lazy Stochastic Gradient Descent (LazySGD)\n\nInput: #Oracle Queries T , x1 2 Rd, set K, \u23180, p\nSet: t = 0, s = 0\nwhile t \uf8ff T do\n\nAlgorithm 5 Adaptive Estimate (AE)\n\nni\n\ni=1\n\ni=1 ni = T )\n\nend while\n\nT xi . (Note thatPs\n\nUpdate: s = s + 1\nSet G = GradOracle(xs), i.e., G generates i.i.d. noisy samples of rf (xs)\nGet: (\u02dcgs, ns) = AE(G, T t) % Adaptive Minibatch\nUpdate: t = t + ns\nCalculate: \u02c6gs = ns\u02dcgs\nSet: \u2318s = \u23180/tp\nUpdate: xs+1 =\u21e7 K (xs \u2318s\u02c6gs)\nReturn: \u00afxT =Ps\nInput: random vectors generator G, sample budget Tmax, sample factor m0\nSet: i = 0, N = 0, \u02dcg0 = 0\nwhile N < Tmax do\nTake \u2327i = min{2i, Tmax N} samples from G\nSet N N + \u2327i\nUpdate: \u02dcgN Average of N samples received so far from G\nIf k\u02dcgNk > 3m0/pN then return (\u02dcgN , N )\nUpdate i i + 1\nend while\nReturn: (\u02dcgN , N )\n\nt = 1, . . . , T , we may query a point xt 2K , and receive a feedback. After the last round, we choose\n\u00afxT 2K , and our performance measure is the expected excess loss, de\ufb01ned as,\n\nE[f (\u00afxT )] min\nx2K\n\nf (x) .\n\nHere we assume that our feedback is a \ufb01rst order noisy oracle G : K 7! Rd such that upon\nquerying G with a point xt 2K , we receive a bounded and unbiased gradient estimate, G(xt),\nsuch E[G(xt)|xt] = rf (xt); kG(xt)k \uf8ff G. We also assume that the that the internal coin tosses\n(randomizations) of the oracle are independent. It is well known that variants of Stochastic Gradient\nDescent (SGD) are ensured to output an estimate \u00afxT such that the excess loss is bounded by\nO(1/pT )/O(1/T ) for the setups of convex/strongly-convex stochastic optimization, [20], [21].\nNotation: In this section we make a clear distinction between the number of queries to the gradient\noracle, denoted henceforth by T ; and between the number of iterations in the algorithm, denoted\nhenceforth by S. We care about the dependence of the excess loss in T .\n\n4.1 Lazy Stochastic Gradient Descent\n\nData Dependent Minibatch sizes: The Lazy SGD (Alg. 4) algorithm that we present in this section,\nuses a minibatch size that changes in between query points. Given a query point xs, Lazy SGD\ninvokes the noisy gradient oracle \u02dcO(1/kgsk2) times, where gs := rf (xs) 3. Thus, in contrast to\nSGD which utilizes a \ufb01xed number of oracle calls per query point, our algorithm tends to stall in\npoints with smaller gradients, hence the name Lazy SGD.\nHere we give some intuition regarding our adaptive minibatch size rule: Consider the stochastic\noptimization setting. However, imagine that instead of the noisy gradient oracle G, we may access an\nimproved (imaginary) oracle which provides us with unbiased estimates, \u02dcg(x), that are accurate up\nto some multiplicative factor, e.g., E[\u02dcg(x)|x] = rf (x), and 1\n2krf (x)k \uf8ff k\u02dcg(x)k \uf8ff 2krf (x)k .\nThen intuitively we could have used these estimates instead of the exact normalized gradients inside\nour (SC-)AdaNGD2 algorithms (Algs. 2, 3 with k = 2), and still get similar (in expectation) data\n\n3Note that the gradient norm, kgsk, is unknown to the algorithm. Nevertheless it is estimated on the \ufb02y.\n\n7\n\n\fdependent bounds. Quite nicely, we may use our original noisy oracle G to generate estimates\nfrom this imaginary oracle. This can be done by invoking G for \u02dcO(1/kgsk2) times at each query\npoint. Using this minibatch rule, the total number of calls to G (along all iterations) is equal to\nT =PS\ns=1 1/kgsk2. Plugging this into the data dependent bounds of (SC-)AdaNGD2 (Thms. 2.2,\n3.2), we get the well known \u02dcO(1/pT )/ \u02dcO(1/T ) rates for the stochastic convex settings.\nThe imaginary oracle: The construction of the imaginary oracle from the original oracle appears in\nAlgorithm 5 (AE procedure) . It receives as an input, G, a generator of independent random vectors\nwith an (unknown) expected value g 2 Rd. The algorithm outputs two variables: N which is an\nestimate of 1/kgk2, and \u02dcgN an average of N random vectors from G. Thus, it is natural to think of\nN \u02dcgN as an estimate for g/kgk2. Moreover, it can be shown that E[N (\u02dcgN g)] = 0. Thus in a sense\nwe receive an unbiased estimate. The guarantees of Algorithm 5 appear below,\nLemma 4.1 (Informal). Let Tmax 1, 2 (0, 1). Suppose an oracle G : K 7! Rd that generates\nG-bounded i.i.d. random vectors with an (unknown) expected value g 2 Rd. Then w.p. 1 ,\ninvoking AE (Algorithm 5), with m0 =\u21e5( G log(1/)), it is ensured that:\n\nN =\u21e5(min {m0/kgk2, Tmax}), and E[N (\u02dcgN g)] = 0 .\n\nLazy SGD: Now, plugging the output of the AE algorithm into our of\ufb02ine algorithms (SC-)AdaNGD2,\nwe get their stochastic variants which appears in Algorithm 4 (Lazy SGD). This algorithm is equivalent\nto the of\ufb02ine version of (SC-)AdaNGD2, with the difference that we use ns instead of 1/krf (xs)k2\nand ns\u02dcgs instead of rf (xs)/krf (xs)k2.\nLet T be a bound on the total number of queries to the the \ufb01rst order oracle G, and be the con\ufb01dence\nparameter used to set m0 in the AE procedure. Next we present the guarantees of LazySGD,\nLemma 4.2. Let = O(T 3/2); let K be a convex set with diameter D, and f be a convex function;\nand assume kG(x)k \uf8ff G w.p.1. Then using LazySGD with \u23180 = D/p2G, p = 1/2, ensures:\n\nE[f (\u00afxT )] min\nx2K\n\nf (x) \uf8ff O\u2713 GD log(T )\npT\n\n\u25c6 .\n\nLemma 4.3. Let = O(T 2), let K be a convex set, and f be an H-strongly-convex convex function;\nand assume kG(x)k \uf8ff G w.p.1. Then using LazySGD with \u23180 = 1/H, p = 1, ensures:\n\nE[f (\u00afxT )] min\nx2K\n\nf (x) \uf8ff O\u2713 G2 log2(T )\n\nHT\n\n\u25c6 .\n\nNote that LazySGD uses minibatch sizes that are adapted to the magnitude of the gradients, and still\nmaintains the optimal O(1/pT )/O(1/T ) rates. In contrast using a \ufb01xed minibatch size b for SGD\nmight degrade the convergence rates, yielding O(pb/pT )/O(b/T ) guarantees. This property of\nLazySGD may be bene\ufb01cial when considering distributed computations (see [13]).\n\n5 Discussion\n\nWe have presented a new approach based on a conversion scheme, which exhibits universality and\nnew adaptive bounds in the of\ufb02ine convex optimization setting, and provides a principled approach\ntowards minibatch size selection in the stochastic setting. Among the many questions that remain\nopen is whether we can devise \u201caccelerated\" universal methods. Furthermore, our universality results\nonly apply when the global minimum is inside the constraints. Thus, it is natural to seek for methods\nthat ensure universality when this assumption is violated. Moreover, our algorithms depend on\na parameter k 2 R, but only the cases where k 2{ 0, 1, 2} are well understood. Investigating a\nwider spectrum of k values is intriguing. Lastly, it is interesting to modify and test our methods in\nnon-convex scenarios, especially in the context of deep-learning applications.\n\nAcknowledgments\nI would like to thank Elad Hazan and Shai Shalev-Shwartz for fruitful discussions during the early\nstages of this work.\nThis work was supported by the ETH Z\u00fcrich Postdoctoral Fellowship and Marie Curie Actions for\nPeople COFUND program.\n\n8\n\n\fReferences\n[1] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning\nand stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121\u20132159, 2011.\n\n[2] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[3] Nicolo Cesa-Bianchi, Alex Conconi, and Claudio Gentile. On the generalization ability of\non-line learning algorithms. IEEE Transactions on Information Theory, 50(9):2050\u20132057, 2004.\n\n[4] Stephen Wright and Jorge Nocedal. Numerical optimization. Springer Science, 35:67\u201368, 1999.\n\n[5] Yu Nesterov. Universal gradient methods for convex optimization problems. Mathematical\n\nProgramming, 152(1-2):381\u2013404, 2015.\n\n[6] H Brendan McMahan and Matthew Streeter. Adaptive bound optimization for online convex\n\noptimization. COLT 2010, page 244, 2010.\n\n[7] Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running\naverage of its recent magnitude. COURSERA: Neural networks for machine learning, 4(2),\n2012.\n\n[8] Matthew D Zeiler. Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701,\n\n2012.\n\n[9] Yurii Nesterov. A method for unconstrained convex minimization problem with the rate of\n\nconvergence o (1/k2). In Doklady an SSSR, volume 269, pages 543\u2013547, 1983.\n\n[10] Yu E Nesterov. Minimization methods for nonsmooth convex and quasiconvex functions.\n\nMatekon, 29:519\u2013531, 1984.\n\n[11] Elad Hazan, K\ufb01r Levy, and Shai Shalev-Shwartz. Beyond convexity: Stochastic quasi-convex\noptimization. In Advances in Neural Information Processing Systems, pages 1594\u20131602, 2015.\n\n[12] K\ufb01r Y Levy. The power of normalization: Faster evasion of saddle points. arXiv preprint\n\narXiv:1611.04831, 2016.\n\n[13] Ofer Dekel, Ran Gilad-Bachrach, Ohad Shamir, and Lin Xiao. Optimal distributed online\nprediction using mini-batches. Journal of Machine Learning Research, 13(Jan):165\u2013202, 2012.\n\n[14] Andrew Cotter, Ohad Shamir, Nati Srebro, and Karthik Sridharan. Better mini-batch algorithms\nvia accelerated gradient methods. In Advances in neural information processing systems, pages\n1647\u20131655, 2011.\n\n[15] Shai Shalev-Shwartz and Tong Zhang. Accelerated mini-batch stochastic dual coordinate ascent.\n\nIn Advances in Neural Information Processing Systems, pages 378\u2013385, 2013.\n\n[16] Mu Li, Tong Zhang, Yuqiang Chen, and Alexander J Smola. Ef\ufb01cient mini-batch training for\nstochastic optimization. In Proceedings of the 20th ACM SIGKDD international conference on\nKnowledge discovery and data mining, pages 661\u2013670. ACM, 2014.\n\n[17] Martin Tak\u00e1\u02c7c, Peter Richt\u00e1rik, and Nathan Srebro. Distributed mini-batch sdca. arXiv preprint\n\narXiv:1507.08322, 2015.\n\n[18] Prateek Jain, Sham M Kakade, Rahul Kidambi, Praneeth Netrapalli, and Aaron Sidford. Par-\nallelizing stochastic approximation through mini-batching and tail-averaging. arXiv preprint\narXiv:1610.03774, 2016.\n\n[19] Peter S Bullen, Dragoslav S Mitrinovic, and M Vasic. Means and their Inequalities, volume 31.\n\nSpringer Science & Business Media, 2013.\n\n[20] Arkadii Nemirovskii, David Borisovich Yudin, and ER Dawson. Problem complexity and\n\nmethod ef\ufb01ciency in optimization. 1983.\n\n9\n\n\f[21] Elad Hazan, Amit Agarwal, and Satyen Kale. Logarithmic regret algorithms for online convex\n\noptimization. Machine Learning, 69(2-3):169\u2013192, 2007.\n\n[22] Hongzhou Lin, Julien Mairal, and Zaid Harchaoui. A universal catalyst for \ufb01rst-order optimiza-\n\ntion. In Advances in Neural Information Processing Systems, pages 3384\u20133392, 2015.\n\n[23] Elad Hazan and Tomer Koren. Linear regression with limited observation. In Proceedings of\n\nthe 29th International Conference on Machine Learning (ICML-12), pages 807\u2013814, 2012.\n\n[24] Kenneth L Clarkson, Elad Hazan, and David P Woodruff. Sublinear optimization for machine\n\nlearning. Journal of the ACM (JACM), 59(5):23, 2012.\n\n[25] Sham Kakade. Lecture notes in multivariate analysis, dimensionality reduction, and spectral\nmethods. http://stat.wharton.upenn.edu/~skakade/courses/stat991_\nmult/lectures/MatrixConcen.pdf, April 2010.\n\n[26] Anatoli B Juditsky and Arkadi S Nemirovski. Large deviations of vector-valued martingales in\n\n2-smooth normed spaces. arXiv preprint arXiv:0809.0813, 2008.\n\n[27] David Asher Levin, Yuval Peres, and Elizabeth Lee Wilmer. Markov chains and mixing times.\n\nAmerican Mathematical Soc., 2009.\n\n10\n\n\f", "award": [], "sourceid": 1023, "authors": [{"given_name": "Kfir", "family_name": "Levy", "institution": "ETH"}]}