{"title": "New Insight into Hybrid Stochastic Gradient Descent: Beyond With-Replacement Sampling and Convexity", "book": "Advances in Neural Information Processing Systems", "page_first": 1234, "page_last": 1243, "abstract": "As an incremental-gradient algorithm, the hybrid stochastic gradient descent (HSGD) enjoys merits of both stochastic and full gradient methods for finite-sum minimization problem. However, the existing rate-of-convergence analysis for HSGD is made under with-replacement sampling (WRS) and is restricted to convex problems. It is not clear whether HSGD still carries these advantages under the common practice of without-replacement sampling (WoRS) for non-convex problems. In this paper, we affirmatively answer this open question by showing that under WoRS and for both convex and non-convex problems, it is still possible for HSGD (with constant step-size) to match full gradient descent in rate of convergence, while maintaining comparable sample-size-independent incremental first-order oracle complexity to stochastic gradient descent. For a special class of finite-sum problems with linear prediction models, our convergence results can be further improved in some cases. Extensive numerical results confirm our theoretical affirmation and demonstrate the favorable efficiency of WoRS-based HSGD.", "full_text": "New Insight into Hybrid Stochastic Gradient Descent:\nBeyond With-Replacement Sampling and Convexity\n\nPan Zhou\u2217\n\nXiao-Tong Yuan\u2020\n\nJiashi Feng\u2217\n\n\u2217 Learning & Vision Lab, National University of Singapore, Singapore\n\n\u2020 B-DAT Lab, Nanjing University of Information Science & Technology, Nanjing, China\n\npzhou@u.nus.edu\n\nxtyuan@nuist.edu.cn\n\nelefjia@nus.edu.sg\n\nAbstract\n\nAs an incremental-gradient algorithm, the hybrid stochastic gradient descent (HS-\nGD) enjoys merits of both stochastic and full gradient methods for \ufb01nite-sum\nproblem optimization. However, the existing rate-of-convergence analysis for\nHSGD is made under with-replacement sampling (WRS) and is restricted to con-\nvex problems. It is not clear whether HSGD still carries these advantages under\nthe common practice of without-replacement sampling (WoRS) for non-convex\nproblems. In this paper, we af\ufb01rmatively answer this open question by showing\nthat under WoRS and for both convex and non-convex problems, it is still possi-\nble for HSGD (with constant step-size) to match full gradient descent in rate of\nconvergence, while maintaining comparable sample-size-independent incremental\n\ufb01rst-order oracle complexity to stochastic gradient descent. For a special class of\n\ufb01nite-sum problems with linear prediction models, our convergence results can be\nfurther improved in some cases. Extensive numerical results con\ufb01rm our theoretical\naf\ufb01rmation and demonstrate the favorable ef\ufb01ciency of WoRS-based HSGD.\n\nIntroduction\n\n1\nWe consider the following \ufb01nite-sum minimization problem:\n\n(cid:88)n\n\n1\nn\n\ni=1\n\nfi(x),\n\nmin\nx\u2208X f (x) :=\n\n(1)\nwhere each individual fi(x) is (cid:96)-smooth and the feasible set X \u2286 Rd is convex. In the \ufb01eld of\nmachine learning, formulation (1) encapsulates a large body of optimization problems including\nleast square regression, logistic regression and deep neural networks training, to name a few. Such a\nproblem can be solved by various algorithms, e.g. full gradient descent (FGD) [1], stochastic GD\n(SGD) [2], hybrid SGD [3], SDCA [4] and SVRG [5].\nIn this paper, we are particularly interested in Hybrid SGD (HSGD) [3, 6, 7] which is an inexact\ngradient method that iteratively samples an evolving mini-batch of the terms in (1) for gradient\nestimation. The iteration of HSGD is given by\n\nxk+1 = \u03a6X(cid:0)xk \u2212 \u03b7kgk(cid:1) , with gk =\n\n(cid:88)\n\nwhere \u03a6X (\u00b7) denotes the Euclidean projection onto X , \u03b7k is the learning rate, and Sk denotes the set\nof the sk selected samples at the k-th iteration. In early iterations, HSGD selects a few samples to\ncompute the full gradient approximately; and along with more iterations, sk is increased gradually,\nleading to more accurate full gradient estimation. Such a mechanism allows HSGD to simultaneously\nenjoy the merits of both SGD and FGD, i.e. rapid initial process of SGD and constant learning rate\n\u03b7k without sacri\ufb01cing the convergence rate of FGD [6].\nMotivation. Though HSGD has been shown, both in theory and practice, to bridge smoothly the\ngap between full and stochastic gradient descent methods, its rate-of-convergence analysis remains\nrestrictive in several aspects.\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\u2207fik (xk),\n\nik\u2208Sk\n\n1\nsk\n\n\f(a)\n\n(b)\n\nFigure 1: Comparison of WoRS-based HSGD. (a)\nWoRS vs. WRS in HSGD: optimizing a softmax\nregression model with a single full pass over the\ndata letter. (b) Comparison among randomized\nalgorithms for optimizing a feedforward neural\nnetwork with 50 full passes over the data sen-\nsorless. HSGD-exp and HSGD-lin respective-\nly denote WoRS based HSGD with exponentially\nand linearly increasing mini-batch sizes (ref. Sec-\ntion 3.2 and 3.4). See more results in supplement.\n\nFirst, the convergence behavior of HSGD un-\nder without-replacement sampling (WoRS) is\nnot clear.\nIn the existing analysis [6], the s-\ntochastic gradient is assumed to be computed\nunder with-replacement sampling (WRS). But\nfor stochastic optimization, it is a more com-\nmon practice to use WoRS, i.e., to pass the loss\nfunctions fi(x) sequentially, after random shuf-\n\ufb02ing, without revisiting any of them [8, 9]. This\nmakes signi\ufb01cant discrepancy between the the-\noretical guarantee and practical implementation.\nAs shown in Figure 1 (a), WoRS tends to provide\nbetter performance than WRS in actual imple-\nmentation.\nSecond, the convergence behavior of HSGD for\nnon-convex problems is not clear. Prior con-\nvergence guarantees on HSGD are limited to\nconvex problems. Bertsekas [3] established lin-\near convergence of HSGD for least square problems. Friedlander et al. [6] proved that HSGD\nconverges linearly for strongly convex problems with exponentially increasing sk, and sub-linearly\nfor arbitrary convex problems with polynomially increasing sk. Unfortunately, non-convex conver-\ngence guarantee on HSGD is still absent, though highly desirable in machine learning applications\nand extensively studied in other stochastic algorithms, e.g. SVRG [10, 11]. In Figure 1 (b), HSGD has\nsharper convergence behavior than several state-of-the-art SGD methods in training neural networks.\nThird, the Incremental First-order Oracle (IFO) complexity (i.e. stochastic gradient computation; see\nDe\ufb01nition 2) of HSGD is largely left unknown. Although Friedlander et al. [6] showed that HSGD\nmaintains steady convergence rates of FGD, its IFO complexity is not explicitly analyzed, making it\nless clear where HSGD should be positioned w.r.t. existing stochastic gradient approaches in overall\ncomputational complexity.\nSummary of contributions. In this work, we address the aforementioned three limitations in the\nexisting analysis of HSGD. We analyze the rate-of-convergence of HSGD under WoRS in a wide\nproblem spectrum including strongly convex, non-strongly convex and non-convex problems. Table 1\nsummarizes our main results on IFO complexity of HSGD (WoRS) and compares them against\nstate-of-the-art WoRS-oriented results for (stochastic) gradient methods. These results are divided\ninto two groups: for general problems and for a special class of problems with linear prediction loss\nfi(x) = h(a(cid:62)\ni x). As shown in the bottom row of Table 1, we contribute several new theoretical\ninsights into HSGD, which are elaborated in the following paragraphs.\nThe bounds highlighted in green: For both general and certain specially structured strongly convex\nproblems, HSGD is n\u00d7 faster than FGD. Compared to the results for SAGA and AVRG [12], the\nIFO complexity of HSGD is not relying on the sample size n but dependent on 1/\u0001. This suggests\nthat HSGD will converge faster when n dominates 1/\u0001. Finally, compared to the results for SGD in\nlinear prediction problems [13], ours has removed the dependency on the logarithm term log (\u03ba/\u0001).\nThe bounds highlighted in red: To our best knowledge, for the \ufb01rst time these new results establish\nguarantees on WoRS-based stochastic approaches for non-strongly convex and non-convex problems.\nThe bounds highlighted in blue: If the loss function h(a(cid:62)\ni x) in the linearly structured problem is\ni x (but f (x) may still be non-strongly convex), HSGD has O (1/\u0001)\nstrongly convex in terms of a(cid:62)\nIFO complexity. The least square regression and logistic regression (with a bounded feasible set)\nmodels have such a linear prediction structure.\nThe bounds highlighted in brown: When the specially structured problem is non-strongly convex,\nHSGD converges to the minimum of problem (1), while SGD can only be shown to converge to a\nsub-optimum up to some statistical error (see footnote 2 below Table 1).\nRelated work. Understanding randomized algorithms under WoRS and random reshuf\ufb02ing is gaining\nconsiderable attention in recent years. By focusing on least squares problems, Recht et al. [14] utilized\narithmetic-mean inequality on matrices to show that for randomized algorithms, WoRS is always\nfaster than WRS if the data are randomly generated from a certain distribution. For more general\n\n2\n\n0200040006000800010000\u221210\u22128\u22126\u22124\u2212202Sample NumberObjective Distance log(f \u2212 f*) HSGD\u2212exp (WRS)HSGD\u2212exp (WoRS)050100150200246810121416IFO/nObjective value f(x) SGDSVRGSAGAAVRGSCGCHSGD\u2212expHSGD\u2212lin\fTable 1: Comparison of IFO complexity for randomized algorithms under WoRS. \u03ba = (cid:96)/\u03c1 denotes\nthe condition number of (cid:96)-smooth and \u03c1-strong convex cases for problem (1). Best viewed in color.\nSpecially Structured Problem with fi(x) = h(a(cid:62)\n\nGeneral Problem\n\ni x)\n\nStro. conv.\n\nNon-Stro. conv. Non-conv.\n\nf (\u00b7) is\nstro. conv.\n\nf (\u00b7) is\n\nnon-stro. conv.\n\n2\u2264 \u0001 for stro. conv., E[f (xa)\u2212f (x\u2217)]\u2264 \u0001 for non-stro. conv., E(cid:107)\u2207f (xa)(cid:107)2\n\n2\u2264 \u0001 for non-conv.\n\nh(\u00b7) is\nstro. conv.\n\n\u2014\n\u2014\n\u2014\n\nO(cid:0) 1\nO(cid:0) 1\n\n\u2014\n\n\u0001\n\n(cid:1)\n(cid:1)\n\n(cid:1)(cid:1)\n(cid:1)(cid:1)\n\n\u0001\n\nMetric: E(cid:107)xa\u2212x\u2217(cid:107)2\nFGD [9]\n\n(cid:1)\nO(cid:0) n\u03ba2\nSAGA [12] O(cid:0)n\u03ba2log(cid:0) 1\nAVRG [12] O(cid:0)n\u03ba2 log(cid:0) 1\nO(cid:0) \u03ba2\n(cid:1)\nO(cid:0) \u03ba\n(cid:1)\n\nHSGD\n\n\u2014\n\n\u0001\n\n\u0001\n\n\u0001\n\n\u2014\n\u2014\n\u2014\n\nO(cid:0) 1\nO(cid:0) 1\n\n\u00013\n\n(cid:1)1\n(cid:1)1\n\n\u0001\n\n\u2014\n\n(cid:1)\nO(cid:0) n\u03ba2\n\u2014 O(cid:0)n\u03ba2 log(cid:0) 1\n(cid:1)(cid:1)\n\u2014 O(cid:0)n\u03ba2 log(cid:0) 1\n(cid:1)(cid:1)\nO(cid:0) \u03ba2\nO(cid:0) 1\n(cid:1)\n(cid:1)\n\u0001 log(cid:0) \u03ba\n(cid:1)(cid:1)\nO(cid:0) \u03ba\nO(cid:0) \u03ba\nO(cid:0) 1\n(cid:1)\n(cid:1)\n\n\u2014\n\n\u00012\n\n\u0001\n\n\u0001\n\n\u0001\n\n\u0001\n\n\u2014\n\u2014\n\u2014\n\nO(cid:0) 1\nO(cid:0) 1\nO(cid:0) 1\n\n\u00012\n\n\u00013\n\n(cid:1)\n(cid:1)2\n(cid:1)\n\n\u2014\n\nMetric: E[f (xa)\u2212f (x\u2217)]\u2264 \u0001 for both stro. and non-stro. conv., E(cid:107)\u2207f (xa)(cid:107)2\nSGD [13]\nHSGD\n1 Our IFO complexity for arbitrary convex cases appears higher than the non-convex ones, as we use sub-\n2 \u2264 \u0001 for non-convex cases.\noptimality metric E [f (xa)\u2212f (x\u2217)] \u2264 \u0001 for convex cases while E(cid:107)\u2207f (xa)(cid:107)2\n\u221a\n\u221a\n2 Corollary 1 in [13] provides E [f (xa)\u2212f (x\u2217)] \u2264 RT /k+2(12+\n\u221a\nn where D denotes the diameter\n2D)/\nof the domain X , k is the iteration number and RT \u223c O(D(cid:96)/\nk) is the regret bound of SGD for (1). The\nterm 2(12 +\n\n\u221a\nn is a statistical error which is an artifact from the regret analysis approach.\n\n2\u2264 \u0001 for non-conv.\n\n2D)/\n\n\u221a\n\n\u00013\n\n\u00012\n\n\u00013\n\n\u0001\n\n\u0001\n\n\u0001\n\nsmooth and strongly convex problems, G\u00fcrb\u00fczbalaban et al. [9] proved that gradient descent based\non random reshuf\ufb02ing enjoys O(1/k2) rate of convergence after k epoches, as opposed to O(1/k)\nunder WRS. But this analysis does not explicitly explain why WoRS works well after a few (or\neven just one) passes over the data. To answer such a central question, by leveraging regret analysis,\nShamir et al. [13] proved that for a special class of loss functions fi(x) = h(a(cid:62)\ni x), SGD and SVRG\nusing WoRS can achieve competitive IFO complexity to their WRS counterparts. More recently,\nYing et al. [12] proved that for strongly convex problems, both SAGA [15] and their proposed AVRG\nalgorithm achieve linear convergence rate with WoRS. Recently, Zhou et al. [7] applied the HSGD\nalgorithm for solving sparsity or rank-constrained problems and proved its linear convergence rate\nunder the restricted strong convex and smooth conditions. Our work differs from these prior works:\n1) For the \ufb01rst time, we provide WoRS based theoretical analysis for HSGD. 2) Our analysis covers\nnon-strongly convex and non-convex cases which are not covered by the current WoRS analysis of\nstochastic gradient methods.\n\n2 Preliminaries\n\n2(cid:107)x1 \u2212 x2(cid:107)2\n\nWe \ufb01rst introduce the concepts of strong convexity and Lipschtiz smoothness which are commonly\nused in analyzing stochastic gradient methods [4, 5, 16, 17, 18, 19].\nDe\ufb01nition 1 (Strong convexity and Lipschitz smoothness). We say a function g(x) is \u03c1-strongly-\nconvex if there exists a positive constant \u03c1 such that \u2200x1, x2 \u2208 X , g(x1)\u2265 g(x2) + (cid:104)\u2207g(x2), x1 \u2212\nx2(cid:105) + \u03c1\n2. Moreover, we say g(x) is (cid:96)-smooth if there exists a positive constant (cid:96) such that\n(cid:107)\u2207g(x1) \u2212 \u2207g(x2)(cid:107)2 \u2264 (cid:96)(cid:107)x1 \u2212 x2(cid:107)2.\nIn all our analysis, we will impose the basic Assumption 1 to bound stochastic gradient variance.\nAssumption 1 (Bounded gradient). For each loss fi(x), the distance between its gradient \u2207fi(x)\nand the full gradient \u2207f (x) is upper bounded as maxi (cid:107)\u2207fi(x)\u2212\u2207f (x)(cid:107)2\u2264 G.\nIf fi(x) is (cid:96)-smooth and the domain of interest X is bounded, then the bounded gradient assumption\ncan be naturally implied. We explicitly write out this assumption for the sake of notation simplicity.\nFollowing [5, 20, 21], we also employ the incremental \ufb01rst order oracle (IFO) complexity as the\ncomputational complexity metric for solving the \ufb01nite-sum minimization problem (1).\nDe\ufb01nition 2. An IFO takes an index i \u2208 [n] and a point x \u2208 X , and returns the pair (fi(x),\u2207fi(x)).\nThe IFO complexity can more accurately re\ufb02ect the overall computational performance of a \ufb01rst-order\nalgorithm, as objective value and gradient evaluation usually dominate the per-iteration complexity.\n\n3\n\n\fInput: Initial point x0, sample index set S ={1,\u00b7\u00b7\u00b7, n}, learning rate {\u03b7k}, mini-batch size {sk}.\nfor k = 0 to T \u2212 1 do\n\nAlgorithm 1 Hybrid SGD under WoRS\n\nSelect sk samples Sk by WoRS from S \u2212(cid:83)k\u22121\n(cid:80)\nUpdate xk+1 = \u03a6X(cid:0)xk \u2212 \u03b7kgk(cid:1).\n\nCompute the gradient gk = 1\nsk\n\ni=0 Si.\n\u2207fik (xk).\n\nik\u2208Sk\n\nend for\nOutput: xa sampled uniformly from {xk}T\u22121\nor {xk}T\u22121\n\nk=(cid:98)0.5T(cid:99) for non-strongly/non-convex problems.\n\nk=0 for strong convex and linearly structured problems\n\n3 General Analysis for HSGD under WoRS\n\nThe WoRS-based HSGD algorithm is outlined in Algorithm 1. Here we systematically analyze\nits convergence performance for strongly/non-strongly convex and non-convex problems. Similar\nto [13], we focus our analysis on the scenario where a single pass (or less) over data is of interest,\nwhich occurs, e.g. in streaming data analysis. According to our empirical study (see, e.g., Figure 3),\nrunning Algorithm 1 for a single pass over data can provide satisfactory accuracy in many cases.\n\n3.1 A key lemma\n\nIt is well understood that unbiased gradient estimation with gradually vanishing variance is important\nfor accelerating randomized algorithms [5, 15]. This is because the increasingly more accurate\nestimate of full gradient allows the algorithm to move ahead with more aggressive step-size to\ndecrease the objective value. However, for WoRS implementation, the mini-batch terms selected at\neach iteration are no longer statistically independent, leading to biased gradient estimate gk, i.e.\n\nE[gk] = E(cid:104) 1\n\n(cid:88)\n\nsk\n\nik\u2208Sk\n\n\u2207fik (xk)\n\n(cid:105) (cid:54)= \u2207f (xk).\n\nSuch a biased estimate gk brings a challenge to bounding its variance E(cid:107)gk\u2212\u2207f (xk)(cid:107)2\ntechniques such as Bernstein inequality [22] and those existing bounds on E(cid:107)gk \u2212 \u2207f (xk)(cid:107)2\nWRS [23]. To tackle this challenge, we introduce the following sequence of random variables:\n\n2 with common\n2 under\n\nzk = \u00af\u00b5k \u2212 \u2207f (xk)\n\nk = n \u2212(cid:80)k\u22121\ni=0 si in which S =\nwhere \u00af\u00b5k := 1\ns(cid:48)\n{1, 2,\u00b7\u00b7\u00b7 , n} denotes the index set of all samples. We can prove zk\u2019s form a martingale, i.e.\n(cid:105)\nE[zk | zk\u22121, . . . , z0] = zk\u22121. Moreover, we can show that its squared Euclidian norm is bounded by\n\nand z0 = 0,\ni=0 Si and s(cid:48)\n(cid:104)\n\nk = S \u2212(cid:83)k\u22121\n(cid:3) \u2264 4G2\n\n(cid:80)\nE(cid:2)(cid:107)zk(cid:107)2\n\n\u2207fik (xk), S(cid:48)\n\n2 | zk\u22121, . . . , z0\n\nik\u2208S(cid:48)\n\n1 \u2212 (n \u2212 bk)2 \u2212 bk\nn(n \u2212 bk)\n\nn \u2212 bk\n\n,\n\nk\n\nk\n\ni=0 si. Similarly, we de\ufb01ne a sequence of \u00afzi for the process of without-replacement\n\nwhere bk =(cid:80)k\u22121\nsampling a subset (cid:98)Si of size(cid:98)si from S(cid:48)\n(cid:88)\nik\u2208(cid:99)Si\n\n\u00afzi =\n\n1(cid:98)si\nE(cid:2)(cid:107) \u00afzi(cid:107)2\n\n2 | \u00afzi\u22121, . . . , \u00afz0\n\nAlso, we can prove that \u00afzi is a martingale with bounded norm:\n\nk of size s(cid:48)\nk:\n\u2207fik (xk) \u2212 \u00af\u00b5k\n\nand\n\n\u00afz0 = 0.\n\nBased on the above arguments, we formulate the k-th WoRS as a stochastic process consisting of\ni=0 Si after k \u2212 1\ntwo phases. In the \ufb01rst phase, we are given s(cid:48)\ntimes of WoRS over all the data. The sampling result is recorded by zk. Then, in the second phase,\nwe sample sk data from the remaining s(cid:48)\nk in a without-replacement fashion,\n\nk samples indexed by S(cid:48)\n\nk samples indexed by S(cid:48)\n\n(cid:3) \u2264 4G2(cid:98)si\n\n(cid:104)\n1 \u2212 (cid:98)si \u2212 1\n\ns(cid:48)\n\nk\n\n.\n\n(cid:105)\nk = S \u2212(cid:83)k\u22121\n\n4\n\n\fwhich corresponds to \u00afzi. Based on such a WoRS process, we have\n\nE[(cid:107)gk \u2212 \u2207f (xk)(cid:107)2\n\n2]\u22642E[(cid:107) \u00af\u00b5k \u2212 \u2207f (xk)(cid:107)2\n(cid:20)\n(cid:20)\n2 | \u00afzsk\u22121, . . . , \u00afz0; zk\u22121, . . . , z0]\n2 | zk\u22121, . . . , z0] + 2E[(cid:107) \u00afzsk(cid:107)2\n1 \u2212 (n \u2212 bk)2\u2212 bk\n1 \u2212 sk \u2212 1\n8G2\nn \u2212 bk\nn(n \u2212 bk)\nsk\n\n2+(cid:107)gk \u2212 \u00af\u00b5k(cid:107)2\n2]\n\n(cid:21)\n\n(cid:21)\n\n+\n\n.\n\n= 2E[(cid:107)zk(cid:107)2\n\u2264 8G2\nn \u2212 bk\n\nThe above claim leads to the following Lemma 1 which is key to our WoRS-based convergence\nanalysis in the sections to follow. Notice, the above results on gradient variance are some intermediate\nresults for proving Lemma 1, whose full proofs are deferred to Appendix A.\n\nLemma 1. The gradient gk estimated by WoRS in Algorithm 1 satis\ufb01es E(cid:2)(cid:107)gk\u2212\u2207f (xk)(cid:107)2\n\n(cid:3) \u2264 24G2\n\n.\n\n2\n\nsk\n\nFrom Lemma 1, we \ufb01nd that the gradient variance E[(cid:107)gk\u2212\u2207f (xk)(cid:107)2\n2] in Algorithm 1 is controlled\nby 1/sk. Accordingly, the estimated gradient becomes increasingly more accurate and stable. This\nmeans that by gradually increasing the mini-batch size, HSGD under WoRS can reduce variance,\nsimilar to SVRG and SAGA, but without requiring to integrate historical gradients or full gradient of\nthe snapshot point into current gradient estimate. In the following sections, we will extensively use\nLemma 1 to analyze HSGD under WoRS.\nBy applying Bernstein inequality, Friedlander et al. [6] showed that E[(cid:107)gk\u2212\u2207f (xk)(cid:107)2\nif the sk samples selected at iteration k are different, but are sampled from the entire data set.\nIn contrast, our considered WoRS strategy assumes the sk different samples are drawn from the\nremaining set S \u2212 \u222ak\u22121\ni=0 Si, and thus needs to take into account the statistical dependence among\niterations to bound the stochastic gradient variance.\n\n2] = O(cid:0) n\u2212sk\n\n(cid:1)\n\nnsk\n\n3.2 Strongly convex functions\n\nWe analyze the convergence behavior of both the computed solution x and the objective f (x) under\nthe strongly convex setting. Our convergence result on the computed solution is stated in Theorem 1.\nTheorem 1. Suppose f (x) is \u03c1-strongly-convex and each fi(x) is (cid:96)-smooth. With learning rate \u03b7k =\n\u03c1\n(cid:96)2 and mini-batch size sk = \u03c4\n\n(cid:107)x0\u2212x\u2217(cid:107)2 max(cid:0) 324\n\n(cid:1), we have\n\n18(cid:96)2 and \u03c4 \u2265\n\n\u03c12 , 432\n(cid:96)2\n\nG2\n\n\u03b6k where \u03b6 = 1 \u2212 \u03c1\nE(cid:107)xa \u2212 x\u2217(cid:107)2\n\n2 \u2264(cid:0)1 \u2212 \u03c12\n\n(cid:1)T(cid:107)x0 \u2212 x\u2217(cid:107)2\n\n2,\n\n18(cid:96)2\n\nwhere xa is the output solution of Algorithm 1 and T is the number of iterations.\n\n1\n\n1\n\nA proof of this result is given in Appendix B.1. From Theorem 1, if mini-batch size is increased\nat an exponential rate\n18(cid:96)2 , then the objective in HSGD converges linearly at the\n\nrate of O(cid:0)(1 \u2212 \u03b3)k(cid:1) for strongly convex problems. This implies that HSGD enjoys the merits of\n\n1\u2212\u03b3 with \u03b3 = \u03c1\n\nboth SGD and FGD. Speci\ufb01cally, similar to SGD, the per-iteration computation of HSGD is cheap\nas it is free of computing the full gradient \u2207f (x). Meanwhile, it uses a constant learning rate and\nenjoys the steady convergence rate of FGD. As the condition number \u03ba = (cid:96)/\u03c1 is usually large in\nrealistic problems, the exponential rate\n1\u2212\u03b3 is actually only slightly above one. This means even\na moderate-scale dataset allows plenty of HSGD iterations in one epoch to decrease the objective\nvalue suf\ufb01ciently, as illustrated in Figure 2 and 3. Friedlander et al. [6] proved that HSGD has linear\nconvergence rate under WRS. Theorem 1 generalizes the result to WoRS. Then we can derive the\nIFO complexity of HSGD for strongly-convex problems in the following corollary, for which proof is\ngiven in Appendix B.2.\n2 \u2264 \u0001, the IFO\nCorollary 1. Suppose the assumptions in Theorem 1 hold. To achieve E(cid:107)xa \u2212 x\u2217(cid:107)2\n\u03c1 denotes the condition number of the objective f (x).\nFrom Corollary 1, the IFO complexity of HSGD for strongly convex problems is at the order of\n\u0001 , HSGD can be superior to\n\ncomplexity of HSGD is O(cid:0) \u03ba2G2\nO(cid:0) \u03ba2\n(cid:1), which is not relying on the sample size n. So when n dominates 1\nat each iteration and adopting a diminishing learning rate \u03b7k = O(cid:0) 1\n\nthe algorithms with complexity linearly relying on n, such as SVRG and SAGA.\nG\u00fcrb\u00fczbalaban et al. [9] showed that by processing each individual fi(x) with random shuf\ufb02ing\n2 , 1), the IFO\n\n(cid:1) where \u03ba = (cid:96)\n\n(cid:1) with \u03b2 \u2208 ( 1\n\n\u0001\n\n\u0001\n\nk\u03b2\n\n5\n\n\f\u0001\n\n\u0001\n\n\u0001\n\nn).\n\n(cid:1) for achieving E(cid:107)xa \u2212 x\u2217(cid:107)2\n\n2 \u2264 \u0001. So HSGD is n times faster than\nFGD. This is because at each iteration, unlike FGD requiring to access all data, HSGD only samples\na mini-batch for gradient estimation without sacri\ufb01cing convergence rate. Ying et al. [12] proved that\n\ncomplexity of FGD is O(cid:0)\u03ba2 n\n(cid:1)(cid:1).\nunder WoRS, both SAGA and AVRG converge linearly and have IFO complexity of O(cid:0)n\u03ba2 log(cid:0) 1\n\u0001 log(cid:0) \u03ba\n(cid:1)(cid:1) by measuring the objective (see Section 4). Here we can also establish the shaper\nO(cid:0) \u03ba\n\n\u221a\nHence, HSGD will outperform SAGA and AVRG if n dominates 1\nthe data scale is huge while the desired accuracy \u0001 is moderately small (e.g. 1/\nShamir [13] proved that for linearly structured problems, SGD under WoRS has IFO complexity\n\n\u0001 , which is usually the case when\n\nconvergence behavior of the objective value. The result is presented in Theorem 2 with proof provided\nin Appendix B.3.\nTheorem 2. Assume f (x) is \u03c1-strongly-convex and each fi(x) is (cid:96)-smooth. Let learning rate \u03b7k = 1\n(cid:96)\nand mini-batch size sk = \u03c4\n\u03c1[f (x0)\u2212f (x\u2217)] . Then the output xa of Algorithm 1\nsatis\ufb01es\n\nE [f (xa) \u2212 f (x\u2217)] \u2264(cid:0)1 \u2212 \u03c1\nMoreover, to achieve E[f (xa) \u2212 f (x\u2217)] \u2264 \u0001, the IFO complexity of HSGD is O(cid:0) \u03baG2\nly mini-batch size. But it has lower complexity O(cid:0) \u03ba\nwhich is in contrast to the complexity O(cid:0) \u03ba2\n\n\u03c1 .\nTheorem 2 shows that HSGD also enjoys linear convergence rate on the objective by using exponential-\n2 \u2264 \u0001. This is because\n(cid:96) , while the analysis on the solution\n\n(cid:1)T\n(cid:1), where \u03ba = (cid:96)\n(cid:1) under the measurement E[f (xa)\u2212 f (x\u2217)] \u2264 \u0001\n\n(cid:1) for achieving E(cid:107)xa \u2212 x\u2217(cid:107)2\n\nthe objective analysis allows to use more aggressive step-size 1\nrequires smaller learning rate \u03c1\n\n(cid:96)2 . In this way, HSGD with larger step-size converges faster.\n\n(f (x0) \u2212 f (x\u2217)).\n\n\u03b6k with \u03b6 = 1\u2212 \u03c1\n\n2(cid:96) and \u03c4 \u2265\n\n6G2\n\n2(cid:96)\n\n\u0001\n\n\u0001\n\n\u0001\n\n3.3 Non-strongly convex functions\n\nWe proceed to analyze the convergence performance of HSGD for non-strongly convex problems. Our\nresult for this case is summarized in Theorem 3. To our best knowledge, this is the \ufb01rst convergence\nguarantee of WoRS-based methods for non-strongly convex problems.\nTheorem 3. Suppose f (x) is convex and each fi(x) is (cid:96)-smooth. Assume that (cid:107)x1\u2212x2(cid:107)2\u2264 D holds\nfor \u2200x1, x2 \u2208X . Then with the learning rate \u03b7k = 1\n2(cid:96) and mini-batch size sk = (k + 1)2, we have\n\nE[f (xa) \u2212 f (x\u2217)] \u2264 4(cid:96)D2 + 24GD\n\nT\n\n+\n\n48G2\n(cid:96)T 2 ,\n\nwhere xa denotes the output solution of Algorithm 1 and T is the number of iterations.\n\nA proof of this result is given in Appendix B.4. Theorem 3 shows that if one expands the mini-batch\nwas established for WoRS-based SGD in a special class of convex problems with fi(x) = hi((cid:104)ai, x(cid:105)).\nA detailed comparison between their result and ours for such a structured formulation will be discussed\nk=0 (cid:107)gk\u2212\u2207f (xk)(cid:107)2 < +\u221e, Friedlander et al. [6] showed\n\nsize at O(cid:0)k2(cid:1), then the convergence rate of HSGD under WoRS is O(cid:0) 1\nin Section 4. Under the assumption(cid:80)+\u221e\nthat WRS-based HSGD outputs f (xa) \u2212 f (x\u2217) = O(cid:0) 1\nif HSGD selects at least O(cid:0)k2(cid:1) samples at the k-th iteration due to E[(cid:107)gk\u2212\u2207f (xk)(cid:107)2\nIFO complexity of HSGD is O(cid:0) (6GD+(cid:96)D2)3\n\n(cid:1). In [13], a sub-linear rate\n(cid:1). However, such an assumption holds only\n2] = O(cid:0) n\u2212sk\n(cid:1).\n\nIn this way, their result under WRS is of the same order as ours under WoRS. The following corollary\ngives the corresponding IFO complexity. A proof of this result is given in Appendix B.5.\nCorollary 2. Suppose the assumptions in Theorem 3 hold. To achieve E[f (xa) \u2212 f (x\u2217)] \u2264 \u0001, the\n\n(cid:1).\n\nnsk\n\nT\n\nT\n\n\u00013\n\n3.4 Non-convex functions\n\nNow we analyze HSGD for non-convex problems, which to our knowledge has not yet been addressed\nelsewhere in literature. The result is stated in Theorem 4 with proof provided in Appendix B.6.\nTheorem 4. Suppose each fi(x) is (cid:96)-smooth and for \u2200x1, x2 \u2208X , (cid:107)x1 \u2212 x2(cid:107)2 \u2264 D. With learning\n2(cid:96) and mini-batch size sk = k + 1, the output xa of Algorithm 1 with T iterations satis\ufb01es\nrate \u03b7k = 1\n\nE(cid:2)(cid:107)\u2207f (xa)(cid:107)2\n\n(cid:3) \u2264 4(cid:96)2D2 + 35G2\n\n.\n\n2\n\nT\n\n6\n\n\flinearly expanding the mini-batch size at each iteration. Here we follow the convention in [10, 11, 23]\nto adopt the value (cid:107)\u2207f (xa)(cid:107)2\n2 as a measurement of quality for approximate stationary solutions.\nThen we drive the IFO complexity of HSGD in the following corollary with proof in Appendix B.7.\n\nTheorem 4 guarantees that for non-convex problems, HSGD exhibits O(cid:0) 1\nCorollary 3. Suppose the assumptions in Theorem 4 hold. To achieve E(cid:2)(cid:107)\u2207f (xa)(cid:107)2\ncomplexity of the HSGD in Algorithm 1 is O(cid:0) (4(cid:96)2D2+35G2)2\narbitrary convex problems and E(cid:2)(cid:107)\u2207f (xa)(cid:107)2\n\n(cid:3) \u2264 \u0001 for non-convex problems.\n\nThe IFO complexity for non-convex problems looks lower than that for non-strongly convex ones\nin Corollary 2. This is because we use E [f (xa \u2212 f (x\u2217)] \u2264 \u0001 as sub-optimality measurement for\n\n(cid:1) rate of convergence by\n(cid:3) \u2264 \u0001, the IFO\n\n(cid:1).\n\n\u00012\n\nT\n\n2\n\n2\n\n4 Analysis for Linearly Structured Problems\n\nWe further consider a special case of problem (1) where each fi(x) has a linear prediction structure:\n\n\u0001\n\ni=1\n\n1\nn\n\nfi(x), where fi(x) = h((cid:104)ai, x(cid:105)).\n\nf (x) :=\n\n2 (bi \u2212 a(cid:62)\n\n(2)\nHere ai denotes the i-th sample vector and h(\u00b7) denotes a convex loss function. Such a formulation\ncovers several common problems in machine learning, such as fi(x) = 1\ni x)2 for least\nsquare regression and fi(x) = log(1 + exp(\u2212bia(cid:62)\ni x)) for logistic regression, where bi is the real or\nbinary target output. Such a special problem setting has been considered in [13] for analyzing SGD\nunder WoRS. To make a comprehensive comparison, we specify our strongly convex analysis to (2),\nand improve our non-strongly-convex results when the surrogate loss h(\u00b7) is strongly convex.\nStrongly convex case. In this case, according to Theorem 2, HSGD converges linearly and its IFO\n\ncomplexity is O(cid:0) \u03ba\n(cid:1). By comparison, SGD under WoRS in [13] converges at O(cid:0) \u03ba\nIFO complexity O(cid:0) \u03ba\n\u0001 log(cid:0) \u03ba\ncomplexity O(cid:0)(cid:0)n + \u03ba log(cid:0) 1\n\n(cid:1)(cid:1) and has\n(cid:1)(cid:1), slightly higher than ours due to the presence of the factor log(cid:0) \u03ba\n(cid:1).\n(cid:1)(cid:1) in ridge regression with the measurement E[f (x)\u2212 f (x\u2217)].\n(cid:1)(cid:1) log(cid:0) 1\n\nMoreover, it is allowed in HSGD to use constant step-size which is required to be shrinking in [13].\nOn this special problem, other results on general strongly convex problems can also be applied.\nAs discussed in Section 3.2, HSGD is n times faster than FGD [9], and is superior to SAGA [12]\nand AVRG [12] when n dominates 1\n\u0001 . Shamir [13] showed that SVRG [5] under WoRS has IFO\n\nT log(cid:0) 1\n\nComparatively, such an IFO complexity is still higher than HSGD when sample size n is large and\nthe desired accuracy is moderately small.\nNon-strongly convex case with strongly-convex h(\u00b7). When the loss f (x) in (2) is non-strongly\nconvex but the surrogate loss h(\u00b7) is strongly convex, we show an improved convergence rate in\nTheorem 5 than that in Theorem 3 for general cases. See proof of Theorem 5 in Appendix C.1.\ni x) is (cid:96)-smooth and h(\u00b7) is \u03b1-strongly convex. Let \u03c3(A) denote\nTheorem 5. Suppose fi(x) = h(a(cid:62)\nthe smallest non-zero singular value of the matrix A = [a(cid:62)\nn ] and \u00b5 = \u03b1\u03c3(A). If the\n\u03b6k with \u03c4 \u2265\n(cid:1)T\n2(cid:96) , we have\nlearning rate \u03b7k = 1\n\n1 ; a(cid:62)\n\u00b5[f (x0)\u2212f (x\u2217)] and \u03b6 = 1 \u2212 \u00b5\n(f (x0) \u2212 f (x\u2217)),\n\nE [f (xa) \u2212 f (x\u2217)] \u2264(cid:0)1 \u2212 \u00b5\n\n(cid:96) and mini-batch size sk = \u03c4\n\n2 ; . . . , a(cid:62)\n\n24G2\n\nT\n\n\u0001\n\n\u0001\n\n\u0001\n\n\u0001\n\n(cid:88)n\n\n2(cid:96)\n\nwhere xa denotes the output solution of Algorithm 1 and T is the number of iterations.\nTheorem 5 shows if the function h(a(cid:62)\ni x, by\nexponentially sampling the data at each iteration, HSGD converges linearly even though f (x) might\nbe non-strongly convex. Based on Theorem 5, we further derive the IFO complexity of Algorithm 1\nfor such a special problem, as summarized in Corollary 4 with proof in Appendix C.2.\nCorollary 4. Suppose the assumptions in Theorem 5 hold. To achieve E[f (xa) \u2212 f (x\u2217)] \u2264 \u0001 for the\n\nspecial problem, the IFO complexity of the proposed algorithm is O(cid:0) (cid:96)G2\n\ni x) is strongly convex in terms of the linear prediction a(cid:62)\n\n(cid:1).\n\n\u00b52\u0001\n\n\u221a\nIt is interesting to compare Theorem 5 and Corollary 4 with those existing ones for SGD. Particularly,\nit was shown by Shamir [13] that E [f (xa) \u2212 f (x\u2217)] \u2264 RT /T + 2(12 +\n\u221a\nn for SGD,\n2D)/\nwhere RT is the regret bound of SGD on problem (2), at the order of O(D(cid:96)\nT ). This gives a\n\n\u221a\n\n7\n\n\f\u221a\n\n\u221a\n\nT ) and IFO complexity of O(1/\u00012). However, there exists an accuracy\nconvergence rate of O(1/\nbarrier O (1/\nn which is the artifact brought by\nanalyzing the regret. In sharp contrast, our result in Theorem 5 guarantees that HSGD converges to\nthe global optimum of problem (2). More importantly, provided that h(\u00b7) is strongly convex, HSGD\n\nn) due to the statistical error term 2(12 +\n\nhas superior IFO complexity of O(cid:0) 1\n\n(cid:1) to the SGD complexity O(cid:0) 1\n\n(cid:1) given in [13].\n\n\u221a\n2D)/\n\n\u221a\n\n\u0001\n\n\u00012\n\n5 Experiments\n\nWe compare HSGD with several state-of-the-art algorithms, including SGD [2], SVRG [5],\nSAGA [15], AVRG [12] and SCGC [23], under WoRS for all. We consider two set-\ns of learning tasks.\n(cid:96)2-regularized logistic re-\ngression minx\n, where bi is the target output of ai.\nminx\nThe other one is a non-convex problem of training multi-layer neural networks. We run simulations\non 10 datasets (see Appendix D). Hyper-parameters of all the algorithms are tuned to best.\n\n(cid:2)log(1 + exp(\u2212bia(cid:62)\ni x)) + \u03bb\n(cid:80)k\nexp(a(cid:62)\ni xj )\nl=1exp(a(cid:62)\n\n(cid:3) and k-classes softmax regression\n(cid:105)\n\n(cid:80)n\n(cid:104) \u03bb\n2k(cid:107)xj(cid:107)2\n\nThe \ufb01rst contains two convex problems:\n\n2\u22121{bi = j} log\n\n(cid:80)n\n\n(cid:80)k\n\n2(cid:107)x(cid:107)2\n\ni xl)\n\nj=1\n\ni=1\n\n1\nn\n\n1\nn\n\ni=1\n\n2\n\n(a) Logistic regression. From left to right: ijcnn, A09, w08 and rcv11.\n\n(b) Softmax regression. From left to right: protein, satimage, sensorless and mnist.\n\nFigure 2: Single-epoch processing: comparison of randomized algorithms for a single pass over data.\n\n5.1 Convex problems\n\nAs the \ufb01rst set of problems are strongly convex, we follow Theorem 2 to exponentially expand the\nmini-batch size sk in HSGD with \u03c4 = 1. We run FGD until the gradient (cid:107)\u2207f (x)(cid:107)2 \u2264 10\u221210. Then\nuse the output as the optimal value f\u2217 for sub-optimality estimation in Figure 1 (a), 2 and 3.\nSingle-epoch processing in well-conditioned problems. We \ufb01rst consider the case where the\noptimization problem is well-conditioned with strong regularization, such that good results can be\nobtained after only one epoch of data pass. Single-epoch learning is common in online learning. For\ntwo problems, we respectively set their regularization parameters to \u03bb = 0.01 and \u03bb = 0.1.\nFigure 2 summarizes the numerical results. On the simulated well-conditioned tasks most algorithms\nachieve high accuracy after one epoch, while HSGD (WoRS) converges much faster. This con\ufb01rms\n\nCorollary 1 that HSGD is cheaper in IFO complexity (O(cid:0) \u03ba2\nalgorithms (O(cid:0)n\u03ba2 log(cid:0) 1\n\n(cid:1)) than other considered variance-reduced\n(cid:1)(cid:1)) when the desired accuracy is moderately low and data size is large.\n\n\u0001\n\n\u0001\n\nMulti-epoch processing in ill-conditioned problems. To solve more challenging problems, a\nmethod usually needs multiple cycles of data processing to reach high accuracy solution. Thus we\ndevelop a practical implementation of HSGD for multiple epochs processing. After visiting all data\nin one full pass, it continues to increase the mini-batch size, allowing possible with-replacement\nsampling, until sk > n. After that, HSGD degenerates to standard FGD. But this does not happen in\n\n8\n\n010000200003000040000\u221212\u221210\u22128\u22126\u22124\u221220Sample NumberObjective Distance log(f \u2212 f*) SGDSVRGSAGAAVRGSCGCHSGD050001000015000200002500030000\u221212\u221210\u22128\u22126\u22124\u2212202Sample NumberObjective Distance log(f \u2212 f*) SGDSVRGSAGAAVRGSCGCHSGD010000200003000040000\u221212\u221210\u22128\u22126\u22124\u2212202Sample NumberObjective Distance log(f \u2212 f*) SGDSVRGSAGAAVRGSCGCHSGD05000100001500020000\u221210\u2212505Sample NumberObjective Distance log(f \u2212 f*) SGDSVRGSAGAAVRGSCGCHSGD0500010000\u22128\u22126\u22124\u2212202Sample NumberObjective Distance log(f \u2212 f*) SGDSVRGSAGAAVRGSCGCHSGD01000200030004000\u22128\u22126\u22124\u221220Sample NumberObjective Distance log(f \u2212 f*) SGDSVRGSAGAAVRGSCGCHSGD0500100015002000\u22126\u22125\u22124\u22123\u22122\u2212101Sample NumberObjective Distance log(f \u2212 f*) SGDSVRGSAGAAVRGSCGCHSGD0200004000060000\u22128\u22126\u22124\u22122024Sample NumberObjective Distance log(f \u2212 f*) SGDSVRGSAGAAVRGSCGCHSGD\four testing cases, since we set the exponential rate suf\ufb01ciently small. To generate more challenging\noptimization tasks, we reset the regularization strength parameter in softmax regression as \u03bb = 0.001.\nFigure 3 shows that HSGD under WoRS outperforms all compared algorithms. These observations\nalign well with those in Figure 2, implying HSGD has sharper convergence behavior when the sample\nsize n is large and the desired accuracy is moderate. The convergence curves of HSGD also con\ufb01rm\nthe effectiveness of our practical implementation in continuously decreasing the objective value.\n\nFigure 3: Multi-epoch processing: comparison of randomized algorithms for multiple passes over\ndata (Softmax regression. From left to right: protein, satimage, sensorless and letter).\n\n5.2 Non-convex problems\n\nHere we evaluate HSGD for optimizing a three-layer feedforward neural network with a logistic loss\non ijcnn1 and covtype and softmax loss on sensorless (see Figure 1 (b)). For both cases we set\n\u03bb = 0.01. The network has an architecture of d \u2212 30 \u2212 c, where d and c respectively denote the\ninput and output dimension and 30 is the neuron number in the hidden layer. We test two versions of\nHSGD, namely HSGD-lin and HSGD-exp, respectively with linearly and exponentially increasing\nmini-batch size from s0 = 1. We use the same initialization for all algorithms.\nFrom Figure 4, HSGD-exp exhibits similar convergence behavior as above: it decreases the loss very\nquickly. Comparatively, HSGD-lin outputs more accurate solutions with linearly increasing batch\nsize, which is consistent with Theorem 4. We note HSGD-lin behaves differently in Figure 4 (a) and\n(b). In Figure 4 (a), it converges relatively slowly at the beginning, while in Figure 4 (b) much faster,\nof which we attribute the reason to the different characteristics of data.\n\nFigure 4: Non-convex results: comparison of randomized algorithms on forward neural networks.\n\n(a) ijcnn1\n\n(b) covtype\n\n6 Conclusion\n\nWe analyzed the rate-of-convergence of HSGD under WoRS for strongly/arbitrarily convex and\nnon-convex problems. We proved that under WoRS, HSGD with constant step-size can match FG\ndescent in convergence rate, while maintaining comparable sample-size-independent IFO complexity\nto SGD. Compared to the variance-reduced SGD methods such as SVRG and SAGA, HSGD tends to\ngain better ef\ufb01ciency and scalability in the setting where the sample size is large while the required\noptimization accuracy is moderately small. Numerical results con\ufb01rmed our theoretical results.\n\nAcknowledgements\n\nJiashi Feng was partially supported by NUS startup R-263-000-C08-133, MOE Tier-I R-263-000-\nC21-112, NUS IDS R-263-000-C67-646, ECRA R-263-000-C87-133 and MOE Tier-II R-263-000-\nD17-112. Xiao-Tong Yuan was supported in part by Natural Science Foundation of China (NSFC)\nunder Grant 61522308 and Grant 61876090, and in part by Tencent AI Lab Rhino-Bird Joint Research\nProgram No.JR201801.\n\n9\n\n0481216\u22128\u22126\u22124\u221220IFO/nObjective Distance log(f \u2212 f*) SGDSVRGSAGAAVRGSCGCHSGD0481216\u22128\u22126\u22124\u221220IFO/nObjective Distance log(f \u2212 f*) SGDSVRGSAGAAVRGSCGCHSGD0481216\u22128\u22126\u22124\u221220IFO/nObjective Distance log(f \u2212 f*) SGDSVRGSAGAAVRGSCGCHSGD0481216\u22128\u22126\u22124\u2212202IFO/nObjective Distance log(f \u2212 f*) SGDSVRGSAGAAVRGSCGCHSGD0481216\u22125\u22124\u22123\u22122\u221210IFO/nObjective Distance log(f \u2212 f*) SGDSVRGSAGAAVRGSCGCHSGD0481216\u22127\u22126\u22125\u22124\u22123\u22122\u2212101IFO/nObjective Distance log(f \u2212 f*) SGDSVRGSAGAAVRGSCGCHSGD0102030405000.511.522.533.54IFO/nObjective value f(x) SGDSVRGSAGAAVRGSCGCHSGD\u2212expHSGD\u2212lin01020304050\u22126\u22124\u221220246IFO/nlog(k\u2207f(x)k22) SGDSVRGSAGAAVRGSCGCHSGD\u2212expHSGD\u2212lin01020304050051015IFO/nObjective value f(x) SGDSVRGSAGAAVRGSCGCHSGD\u2212expHSGD\u2212lin01020304050\u22124\u22123\u22122\u221210IFO/nlog(k\u2207f(x)k22) SGDSVRGSAGAAVRGSCGCHSGD\u2212expHSGD\u2212lin\fReferences\n[1] M. A. Cauchy. M\u00e9thode g\u00e9n\u00e9rale pour la r\u00e9solution des syst\u00e8mes d\u2019\u00e9quations simultan\u00e9es. Comptesrendus\n\ndes s\u00e9ances de l\u2019Acad\u00e9mie des sciences de Paris, 25:536\u2013538, 1847. 1\n\n[2] H. Robbins and S. Monro. A stochastic approximation method. The Annals of Mathematical Statistics,\n\n22(3):400\u2013407, 1951. 1, 8\n\n[3] D. P. Bertsekas. A new class of incremental gradient methods for least squares problems. SIAM Journal on\n\nOptimization, 7(4):913\u2013926, 1997. 1, 2\n\n[4] S. Shalev-Shwartz and T. Zhang. Stochastic dual coordinate ascent methods for regularized loss minimiza-\n\ntion. J. of Machine Learning Research, 14(Feb):567\u2013599, 2013. 1, 3\n\n[5] R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In\n\nProc. Conf. Neutral Information Processing Systems, pages 315\u2013323, 2013. 1, 3, 4, 7, 8\n\n[6] M. P. Friedlander and M. Schmidt. Hybrid deterministic-stochastic methods for data \ufb01tting. SIAM Journal\n\non Scienti\ufb01c Computing, 34(3):A1380\u2013A1405, 2012. 1, 2, 5, 6\n\n[7] P. Zhou, X. Yuan, and J. Feng. Ef\ufb01cient stochastic gradient hard thresholding. In Proc. Conf. Neutral\n\nInformation Processing Systems, 2018. 1, 3\n\n[8] L. Bottou. Curiously fast convergence of some stochastic gradient descent algorithms. In Proc. Symposium\n\non Learning and Data Science, Paris, 2009. 2\n\n[9] M. G\u00fcrb\u00fczbalaban, A. Ozdaglar, and P. Parrilo. Why random reshuf\ufb02ing beats stochastic gradient descent.\n\narXiv preprint arXiv:1510.08560, 2015. 2, 3, 5, 7\n\n[10] S. Reddi, A. Hefny, S. Sra, B. Poczos, and A. Smola. Stochastic variance reduction for nonconvex\n\noptimization. In Proc. Int\u2019l Conf. Machine Learning, pages 314\u2013323, 2016. 2, 7\n\n[11] Z. Allen-Zhu and E. Hazan. Variance reduction for faster non-convex optimization. In Proc. Int\u2019l Conf.\n\nMachine Learning, pages 699\u2013707, 2016. 2, 7\n\n[12] B. Ying, K. Yuan, and A. H. Sayed. Convergence of variance-reduced stochastic learning under random\n\nreshuf\ufb02ing. arXiv preprint arXiv:1708.01383, 2017. 2, 3, 6, 7, 8\n\n[13] O. Shamir. Without-replacement sampling for stochastic gradient methods.\n\nInformation Processing Systems, pages 46\u201354, 2016. 2, 3, 4, 6, 7, 8\n\nIn Proc. Conf. Neutral\n\n[14] B. Recht and C. R\u00e9. Toward a noncommutative arithmetic-geometric mean inequality: conjectures,\n\ncase-studies, and consequences. In Conf. on Learning Theory, pages 1\u201311, 2012. 2\n\n[15] A. Defazio, F. Bach, and S. Lacoste-Julien. SAGA: A fast incremental gradient method with support for\nnon-strongly convex composite objectives. In Proc. Conf. Neutral Information Processing Systems, pages\n1646\u20131654, 2014. 3, 4, 8\n\n[16] N. L. Roux, M. Schmidt, and F. R. Bach. A stochastic gradient method with an exponential convergence\nrate for \ufb01nite training sets. In Proc. Conf. Neutral Information Processing Systems, pages 2663\u20132671,\n2012. 3\n\n[17] A. Defazio, J. Domke, and T. S. Caetano. Finito: A faster, permutable incremental gradient method for big\n\ndata problems. In Proc. Int\u2019l Conf. Machine Learning, pages 1125\u20131133, 2014. 3\n\n[18] H. Lin, J. Mairal, and Z. Harchaoui. A universal catalyst for \ufb01rst-order optimization. In Proc. Conf. Neutral\n\nInformation Processing Systems, pages 3384\u20133392, 2015. 3\n\n[19] Z. Allen-Zhu. Katyusha: The \ufb01rst direct acceleration of stochastic gradient methods. In ACM SIGACT\n\nSymposium on Theory of Computing, pages 1200\u20131205, 2017. 3\n\n[20] Y. Zhang and L. Xiao. Stochastic primal-dual coordinate method for regularized empirical risk minimization.\n\nIn Proc. Int\u2019l Conf. Machine Learning, pages 353\u2013361, 2015. 3\n\n[21] Q. Lin, Z. Lu, and L. Xiao. An accelerated proximal coordinate gradient method. In Proc. Conf. Neutral\n\nInformation Processing Systems, pages 3059\u20133067, 2014. 3\n\n[22] J. M. Kohler and A. Lucchi. Sub-sampled cubic regularization for non-convex optimization. In Proc. Int\u2019l\n\nConf. Machine Learning, 2017. 4\n\n[23] L. Lei and M. Jordan. Less than a single pass: Stochastically controlled stochastic gradient. In Arti\ufb01cial\n\nIntelligence and Statistics, pages 148\u2013156, 2017. 4, 7, 8\n\n10\n\n\f", "award": [], "sourceid": 641, "authors": [{"given_name": "Pan", "family_name": "Zhou", "institution": "National University of Singapore"}, {"given_name": "Xiaotong", "family_name": "Yuan", "institution": "Nanjing University of Information Science and Technology"}, {"given_name": "Jiashi", "family_name": "Feng", "institution": "National University of Singapore"}]}