{"title": "A Simple Proximal Stochastic Gradient Method for Nonsmooth Nonconvex Optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 5564, "page_last": 5574, "abstract": "We analyze stochastic gradient algorithms for optimizing nonconvex, nonsmooth finite-sum problems. In particular, the objective function is given by the summation of a differentiable (possibly nonconvex) component, together with a possibly non-differentiable but convex component.\nWe propose a proximal stochastic gradient algorithm based on variance reduction, called ProxSVRG+.\nOur main contribution lies in the analysis of ProxSVRG+.\nIt recovers several existing convergence results and improves/generalizes them (in terms of the number of stochastic gradient oracle calls and proximal oracle calls).\nIn particular, ProxSVRG+ generalizes the best results given by the SCSG algorithm, recently proposed by [Lei et al., NIPS'17] for the smooth nonconvex case.\nProxSVRG+ is also more straightforward than SCSG and yields simpler analysis.\nMoreover, ProxSVRG+ outperforms the deterministic proximal gradient descent (ProxGD) for a wide range of minibatch sizes, which partially solves an open problem proposed in [Reddi et al., NIPS'16].\nAlso, ProxSVRG+ uses much less proximal oracle calls than ProxSVRG [Reddi et al., NIPS'16].\nMoreover, for nonconvex functions satisfied Polyak-\\L{}ojasiewicz condition, we prove that ProxSVRG+ achieves a global linear convergence rate without restart unlike ProxSVRG.\nThus, it can \\emph{automatically} switch to the faster linear convergence in some regions as long as the objective function satisfies the PL condition locally in these regions.\nFinally, we conduct several experiments and the experimental results are consistent with the theoretical results.", "full_text": "A Simple Proximal Stochastic Gradient Method for\n\nNonsmooth Nonconvex Optimization\n\nZhize Li\n\nIIIS, Tsinghua University\n\nJian Li\n\nIIIS, Tsinghua University\n\nzz-li14@mails.tsinghua.edu.cn\n\nlijian83@mail.tsinghua.edu.cn\n\nAbstract\n\nWe analyze stochastic gradient algorithms for optimizing nonconvex, nonsmooth\n\ufb01nite-sum problems. In particular, the objective function is given by the summation\nof a differentiable (possibly nonconvex) component, together with a possibly non-\ndifferentiable but convex component. We propose a proximal stochastic gradient\nalgorithm based on variance reduction, called ProxSVRG+. Our main contribution\nlies in the analysis of ProxSVRG+. It recovers several existing convergence results\nand improves/generalizes them (in terms of the number of stochastic gradient\noracle calls and proximal oracle calls). In particular, ProxSVRG+ generalizes\nthe best results given by the SCSG algorithm, recently proposed by [Lei et al.,\n2017] for the smooth nonconvex case. ProxSVRG+ is also more straightforward\nthan SCSG and yields simpler analysis. Moreover, ProxSVRG+ outperforms the\ndeterministic proximal gradient descent (ProxGD) for a wide range of minibatch\nsizes, which partially solves an open problem proposed in [Reddi et al., 2016].\nAlso, ProxSVRG+ uses much less proximal oracle calls than ProxSVRG [Reddi\net al., 2016]. Moreover, for nonconvex functions satis\ufb01ed Polyak-\u0141ojasiewicz\ncondition, we prove that ProxSVRG+ achieves a global linear convergence rate\nwithout restart unlike ProxSVRG. Thus, it can automatically switch to the faster\nlinear convergence in some regions as long as the objective function satis\ufb01es the\nPL condition locally in these regions. Finally, we conduct several experiments and\nthe experimental results are consistent with the theoretical results.\n\n1\n\nIntroduction\n\n(cid:80)n\n\nIn this paper, we consider nonsmooth nonconvex \ufb01nite-sum optimization problems of the form\n\nmin\n\nx\n\n\u03a6(x) := f (x) + h(x),\n\n(1)\n\nwhere f (x) := 1\ni=1 fi(x) and each fi(x) is possibly nonconvex with a Lipschitz continuous\ngradient, while h(x) is nonsmooth but convex (e.g., l1 norm (cid:107)x(cid:107)1 or indicator function IC(x) for\nn\nsome convex set C). We assume that the proximal operator of h(x) can be computed ef\ufb01ciently.\nThis above optimization problem is fundamental to many machine learning problems, ranging from\nconvex optimization such as Lasso, SVM to highly nonconvex problem such as optimizing deep\nneural networks. There has been extensive research when f (x) is convex (see e.g., [25, 7, 15, 1]).\nIn particular, if fis are strongly-convex, Xiao and Zhang [25] proposed the Prox-SVRG algorithm,\nwhich achieves a linear convergence rate, based on the well-known variance reduction technique\nSVRG developed in [12]. In recent years, due to the increasing popularity of deep learning, the\nnonconvex case has attracted signi\ufb01cant attention. See e.g., [9, 3, 23, 17] for results on the smooth\nnonconvex case (i.e., h(x) \u2261 0). Very recently, Zhou et al. [27] proposed an algorithm with stochastic\n\u0001 ) [3].\n\ngradient complexity (cid:101)O( 1\n\nFor the more general nonsmooth nonconvex case, the research is still somewhat limited.\n\n\u0001 ), improving the previous results O( 1\n\n\u00015/3 ) [17] and O( n2/3\n\n\u00013/2 \u2227 n1/2\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fRecently, for the nonsmooth nonconvex case, Reddi et al. [24] provided two algorithms called\nProxSVRG and ProxSAGA, which are based on the well-known variance reduction techniques SVRG\nand SAGA [12, 7]. Also, we would like to mention that Aravkin and Davis [5] considered the case\nwhen h can be nonconvex in a more general context of robust optimization. Before that, Ghadimi\net al. [10] analyzed the deterministic proximal gradient method (i.e., computing the full-gradient in\nevery iteration) for nonconvex nonsmooth problems. Here we denote it as ProxGD. Ghadimi et al.\n[10] also considered the stochastic case (here we denote it as ProxSGD). However, ProxSGD requires\nthe batch sizes being a large number (i.e., \u2126(1/\u0001)) or increasing with the iteration number t. Note\nthat ProxSGD may reduce to deterministic ProxGD after some iterations due to the increasing batch\nsizes. Note that from the perspectives of both computational ef\ufb01ciency and statistical generalization,\nalways computing full-gradient (GD or ProxGD) may not be desirable for large-scale machine\nlearning problems. A reasonable minibatch size is also desirable in practice, since the computation\nof minibatch stochastic gradients can be implemented in parallel. In fact, practitioners typically use\nmoderate minibatch sizes, often ranging from something like 16 or 32 to a few hundreds (sometimes\nto a few thousands, see e.g., [11]).1 Hence, it is important to study the convergence in moderate and\nconstant minibatch size regime.\nReddi et al. [24] provided the \ufb01rst non-asymptotic convergence rates for ProxSVRG with minibatch\nsize at most O(n2/3), for the nonsmooth nonconvex problems. However, their convergence bounds\n(using constant or moderate size minibatches) are worse than the deterministic ProxGD in terms of the\nnumber of proximal oracle calls. Note that their algorithms (i.e., ProxSVRG/SAGA) outperform the\nProxGD only if they use quite large minibatch size b = O(n2/3). Note that in a typical application,\nthe number of training data is n = 106 \u223c 109, and n2/3 = 104 \u223c 106. Hence, O(n2/3) is a quite\nlarge minibatch size. Finally, they presented an important open problem of developing stochastic\nmethods with provably better performance than ProxGD with constant minibatch size.\nOur Contribution: In this paper, we propose a very straightforward algorithm called ProxSVRG+\nto solve the nonsmooth nonconvex problem (1). Our main technical contribution lies in the new\nconvergence analysis of ProxSVRG+, which has notable difference from that of ProxSVRG [24].\nWe list our results in Table 1\u20133, and Figure 1\u20132. Our convergence results are stated in terms of the\nnumber of stochastic \ufb01rst-order oracle (SFO) calls and proximal oracle (PO) calls (see De\ufb01nition 2).\nWe would like to highlight the following results yielded by our new analysis:\n1) ProxSVRG+ is\n\nb\u0001n) times faster than ProxGD in terms of #SFO when b \u2264 n2/3 (resp.\nb \u2264 1/\u00012/3), and n/b times faster than ProxGD when b > n2/3 (resp. b > 1/\u00012/3). Note that #PO\n= O(1/\u0001) for both ProxSVRG+ and ProxGD. Obviously, for any super constant b, ProxSVRG+\nis strictly better than ProxGD. Hence, we partially answer the open question (i.e. developing\nstochastic methods with provably better performance than ProxGD with constant minibatch size b)\nproposed in [24]. Also, ProxSVRG+ matches the best result achieved by ProxSVRG at b = n2/3,\nand ProxSVRG+ is strictly better for smaller b (using less PO calls). See Figure 1 for an overview.\n2) Assuming that the variance of the stochastic gradient is bounded, i.e. online/stochastic setting,\nProvSVRG+ generalizes the best result achieved by SCSG, recently proposed by Lei et al. [17]\nfor the smooth nonconvex case, i.e., h(x) \u2261 0 in form (1) (see Table 1, the 5th row). ProxSVRG+\nis more straightforward than SCSG and yields simpler proof. Our results also match the results of\nNatasha1.5 proposed by Allen-Zhu [2] very recently, in terms of #SFO, if there is no additional\nassumption (see Footnote 2 for details). In terms of #PO, our algorithm outperforms Natasha1.5.\nWe also note that SCSG [17] and ProxSVRG [24] achieved their best convergence results with\nb = 1 and b = n2/3 respectively, while ProxSVRG+ achieves the best result with b = 1/\u00012/3 (see\nFigure 1), which is a moderate minibatch size (which is not too small for parallelism and not too\nlarge for better generalization). In our experiments, the best b for ProxSVRG and ProxSVRG+ in\nthe MNIST experiments is 4096 and 256, respectively (see the second row of Figure 4).\n\n\u221a\n\n\u221a\n\nb (resp.\n\n3) For the nonconvex functions satisfying Polyak-\u0141ojasiewicz condition [22], we prove that Prox-\nSVRG+ achieves a global linear convergence rate without restart, while Reddi et al. [24] used\nPL-SVRG to restart ProxSVRG O(log(1/\u0001)) times to obtain the linear convergence rate. More-\nover, ProxSVRG+ also improves ProxGD and ProxSVRG/SAGA, and generalizes the results of\nSCSG in this case (see Table 3). Also see the remarks after Theorem 2 for more details.\n1In fact, some studies argued that smaller minibatch sizes in SGD are very useful for generalization (e.g.,\n[14]). Although generalization is not the focus of the present paper, it provides further motivation for studying\nthe moderate minibatch size regime.\n\n2\n\n\fAlgorithms\nProxGD [10]\n(full gradient)\nProxSGD [10]\n\nProxSVRG/SAGA [24]\n\nSCSG [17]\n\n(smooth nonconvex,\ni.e., h(x) \u2261 0 in (1))\n\nNatasha1.5 [2]\n\nProxSVRG+\n(this paper)\n\nO\n\nO\n\nb\n\n\u221a\n\u0001\n\nO(b/\u0001)\n\nO(cid:0) n\n+ n(cid:1)\n(cid:1)2/3(cid:17)\n(cid:16) b1/3\n(cid:0)n \u2227 1\nO(cid:0) n\n(cid:1)\n(cid:16)(cid:0)n \u2227 1\n\nO(1/\u00015/3) 2\n+ b\n\u0001\n\n(cid:1) 1\n\n\u221a\n\n\u221a\n\n\u0001\n\nb\n\n\u0001\n\n\u0001\n\n+ b\n\u0001\n\n\u0001\n\n\u0001\n\nb\n\n(cid:17)\n\nO(1/\u0001)\n\nO(cid:0) n\n\n\u0001b3/2\n\n(cid:1)\n\n\u2013\n\n\u03c3 = O(1),\nb \u2265 1/\u0001\nb \u2264 n2/3\n\nNA\n\n\u03c3 = O(1)\n\nO(1/\u00015/3)\n\n\u03c3 = O(1)\n\nO(1/\u0001)\n\n\u2013\n\nO(1/\u0001)\n\n\u03c3 = O(1)\n\nTable 1: Comparison of the SFO and PO complexity\n\nStochastic \ufb01rst-order\n\noracle (SFO)\n\nProximal oracle Additional\ncondition\n\n(PO)\n\nO(n/\u0001)\n\nO(1/\u0001)\n\nThe \u2227 denotes the minimum and b denotes the minibatch size. The de\ufb01nitions of SFO\nand PO are given in De\ufb01nition 2, \u03c3 (in the last column) is de\ufb01ned in Assumption 1.\n\nTable 2: Some recommended minibatch sizes b\n\nAlgorithm Minibatch\n\nProxSVRG+\n\nb = 1\n\nb = 1\n\u00012/3\n\nb = n2/3\n\nb = n\n\nSFO\n\nO(n/\u0001)\nO(1/\u00012)\n\u00012/3 + 1\n\nO(cid:0) n\nO(cid:0) 1\nO(cid:0) n2/3\n\n(cid:1)\n(cid:1)\n\n\u00015/3\n\n\u0001\n\n\u00015/3\n\nO(n/\u0001)\n\nPO\n\nO(1/\u0001)\nO(1/\u0001)\n\n(cid:1) O(1/\u0001)\n\nO(1/\u0001)\n\nO(1/\u0001)\n\nO(1/\u0001)\n\nCond.\n\n\u2013\n\n\u03c3 = O(1)\n\n\u2013\n\n\u03c3 = O(1),\nn > 1/\u0001\n\n\u2013\n\u2013\n\nNotes\n\nSame as ProxGD\nSame as ProxSGD\nBetter than ProxGD,\n\ndoes not need \u03c3 = O(1)\nBetter than ProxGD and\n\nProxSVRG/SAGA,\n\nsame as SCSG (in SFO)\n\nSame as\n\nProxSVRG/SAGA\nSame as ProxGD\n\nFigure 2: PO complexity in terms of minibatch b\n\nFigure 1: SFO complexity in terms of minibatch b 3\n\nis O( 1\n\n2Natasha 1.5 used an additional parameter, called strongly nonconvex parameter(cid:101)L ((cid:101)L \u2264 L) and #SFO in [2]\n\u00015/3 ). If(cid:101)L is much smaller than L, the bound is better. Without any additional assumption, the\n\u00013/2 + (cid:101)L1/3\ndefault value of(cid:101)L is L. The result listed in the table is the(cid:101)L = L case. Besides, one can verify that #PO of\n\nNatasha1.5 is the same as its #SFO.\n3Note that the curve of ProxSGD overlaps with ProxSVRG+ for b \u2265 1/\u0001, and the curve of ProxSVRG/SAGA\noverlaps with ProxSVRG+ for b \u2264 n2/3 in Figure 1. We did not plot Natasha 1.5 since it did not consider the\nminibatch case, i.e., b \u2261 1 in Natasha 1.5.\n\n3\n\nbSFO1\u00015/3n\u0001n2/3nProxGD1ProxSVRG/SAGA1/\u00011/\u00012/3ProxSVRG+1\u00012n2/3\u0001SCSG(\u03c3=O(1))(\u03c3=O(1))(b\u2264n2/3)ProxSVRG+ProxSGD(b\u22651/\u0001,\u03c3=O(1))bPO1\u0001n\u0001n2/3nProxSVRG+ProxSGD(b\u22651/\u0001)ProxGD1ProxSVRG/SAGA(b\u2264n2/3)\f2 Preliminaries\nWe assume that fi(x) in (1) has an L-Lipschitz continuous gradient for all i \u2208 [n], i.e., there is a\nconstant L such that\n(2)\nwhere (cid:107) \u00b7 (cid:107) denotes the Eculidean norm (cid:107) \u00b7 (cid:107)2. Note that fi(x) does not need to be convex. We also\nassume that the nonsmooth convex function h(x) in (1) is well structured, i.e., the following proximal\noperator on h can be computed ef\ufb01ciently:\n\n(cid:107)\u2207fi(x) \u2212 \u2207fi(y)(cid:107) \u2264 L(cid:107)x \u2212 y(cid:107),\n\n(cid:16)\n\n(cid:107)y \u2212 x(cid:107)2(cid:17)\n\n.\n\n1\n2\u03b7\n\nh(y) +\n\nprox\u03b7h(x) := arg min\ny\u2208Rd\n\n(3)\nFor convex problems, one typically uses the optimality gap \u03a6(x)\u2212\u03a6(x\u2217) as the convergence criterion\n(see e.g., [21]). But for general nonconvex problems, one typically uses the gradient norm as the\nconvergence criterion. E.g., for smooth nonconvex problems (i.e., h(x) \u2261 0), Ghadimi and Lan [9],\nReddi et al. [23] and Lei et al. [17] used (cid:107)\u2207\u03a6(x)(cid:107)2 (i.e., (cid:107)\u2207f (x)(cid:107)2) to measure the convergence\nresults. In order to analyze the convergence results for nonsmooth nonconvex problems, we need to\nde\ufb01ne the gradient mapping as follows (as in [10, 24]):\nx \u2212 prox\u03b7h\n(cid:16)\nWe often use an equivalent but useful form of prox\u03b7h\n\n(cid:0)x \u2212 \u03b7\u2207f (x)(cid:1)(cid:17)\n(cid:0)x \u2212 \u03b7\u2207f (x)(cid:1) as follows:\n\n(cid:107)y \u2212 x(cid:107)2 + (cid:104)\u2207f (x), y(cid:105)(cid:17)\n\n(cid:0)x \u2212 \u03b7\u2207f (x)(cid:1) = arg min\n\nG\u03b7(x) :=\n\n(cid:16)\n\n1\n\u03b7\n\n(4)\n\n(5)\n\n.\n\n.\n\nprox\u03b7h\n\nh(y) +\n\ny\u2208Rd\n\n1\n2\u03b7\n\n\u03b7\u2207f (x)(cid:1), then G\u03b7(x) := 1\n\nNote that if h(x) is a constant function (in particular, zero), this gradient mapping reduces to the\nordinary gradient: G\u03b7(x) = \u2207\u03a6(x) = \u2207f (x). In this paper, we use the gradient mapping G\u03b7(x) as\nthe convergence criterion (same as [10, 24]).\nDe\ufb01nition 1 \u02c6x is called an \u0001-accurate solution for problem (1) if E[(cid:107)G\u03b7(\u02c6x)(cid:107)2] \u2264 \u0001, where \u02c6x denotes\nthe point returned by a stochastic algorithm.\nNote that the metric G\u03b7(x) has already normalized the step-size \u03b7, i.e., it is independent of different\nalgorithms. Also it is indeed a convergence metric for \u03a6(x) = f (x) + h(x). Let x+ := prox\u03b7h\n\u03b7(cid:107)x \u2212 x+(cid:107) = (cid:107)\u2207f (x) + \u2202h(x+)(cid:107) \u2264 \u0001, then\n(cid:107)\u2202\u03a6(x+)(cid:107) = (cid:107)\u2207f (x+) + \u2202h(x+)(cid:107) \u2264 L(cid:107)x \u2212 x+(cid:107) + (cid:107)\u2207f (x) + \u2202h(x+)(cid:107) \u2264 L\u03b7\u0001 + \u0001 = O(\u0001).\nThus the next iteration point x+ is an \u0001-approximate stationary solution for the objection function\n\u03a6(x) = f (x) + h(x).\nTo measure the ef\ufb01ciency of a stochastic algorithm, we use the following oracle complexity.\nDe\ufb01nition 2 (1) Stochastic \ufb01rst-order oracle (SFO): given a point x, SFO outputs a stochastic\n\n(cid:0)x \u2212 x+(cid:1). If (cid:107)G\u03b7(x)(cid:107) = 1\n\n(cid:0)x \u2212\n\n\u03b7\n\ngradient \u2207fi(x) such that Ei\u223c[n][\u2207fi(x)] = \u2207f (x).\n\n(2) Proximal oracle (PO): given a point x, PO outputs the result of the proximal projection prox\u03b7h(x)\n\n(see (3)).\n\nSometimes, the following assumption on the variance of the stochastic gradients is needed (see the\nlast column \u201cadditional condition\" in Table 1). Such an assumption is necessary if one wants the\nconvergence result to be independent of n. People also denote this case as the online/stochastic\nsetting, in which the full gradient is not available (see e.g., [2, 16]).\nAssumption 1 For \u2200x, E[(cid:107)\u2207fi(x) \u2212 \u2207f (x)(cid:107)2] \u2264 \u03c32, where \u03c3 > 0 is a constant and \u2207fi(x) is a\nstochastic gradient.\n\n3 Nonconvex ProxSVRG+ Algorithm\n\nIn this section, we propose a proximal stochastic gradient algorithm called ProxSVRG+, which is\nvery straightforward (similar to nonconvex ProxSVRG [24] and convex Prox-SVRG [25]). The\ndetails are described in Algorithm 1. We call B the batch size and b the minibatch size.\n\n4\n\n\fAlgorithm 1 Nonconvex ProxSVRG+\nInput: initial point x0, batch size B, minibatch size b, epoch length m, step size \u03b7\n\n1: (cid:101)x0 = x0\n\n0 =(cid:101)xs\u22121\n(cid:80)\n\nj\u2208IB\n\n(cid:80)\n\nxs\ngs = 1\nB\nfor t = 1, 2, . . . , m do\n\n\u2207fj((cid:101)xs\u22121) 4\n(cid:0)\u2207fi(xs\nt\u22121 \u2212 \u03b7vs\n\n2: for s = 1, 2, . . . , S do\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10: end for\nOutput: \u02c6x chosen uniformly from {xs\n\nt\u22121 = 1\nvs\ni\u2208Ib\nb\nt = prox\u03b7h(xs\nxs\n\n(cid:101)xs = xs\n\nend for\nm\n\nt\u22121) \u2212 \u2207fi((cid:101)xs\u22121)(cid:1) + gs\n\nt\u22121) (call PO once)\n\nt\u22121}t\u2208[m],s\u2208[S]\n\nCompared with Prox-SVRG, ProxSVRG [24] analyzed the nonconvex functions while Prox-SVRG\n[25] only analyzed the convex functions. The major difference of our ProxSVRG+ is that we avoid\nthe computation of the full gradient at the beginning of each epoch, i.e., B may not equal to n (see\nLine 4 of Algorithm 1) while ProxSVRG and Prox-SVRG used B = n. Note that even if we choose\nB = n, our analysis is more stronger than ProxSVRG [24]. Also, our ProxSVRG+ shows that\nthe \u201cstochastically controlled\u201d trick of SCSG [17] (i.e., the length of each epoch is a geometrically\ndistributed random variable) is not really necessary for achieving the desired bound.5 As a result,\nour straightforward ProxSVRG+ generalizes the result of SCSG to the more general nonsmooth\nnonconvex case and yields simpler analysis.\n\n4 Convergence Results\n\nNow, we present the main theorem for our ProxSVRG+ which corresponds to the last two rows in\nTable 1 and give some remarks.\n\nTheorem 1 Let step size \u03b7 = 1\nb, where b denotes the minibatch size. Then \u02c6x returned\nby Algorithm 1 is an \u0001-accurate solution for problem (1) (i.e., E[(cid:107)G\u03b7(\u02c6x)(cid:107)2] \u2264 \u0001). We distinguish the\nfollowing two cases:\n\n6L and m =\n\n\u221a\n\n1) We let batch size B = n. The number of SFO calls is at most\n\n2) Under Assumption 1, we let batch size B = min{6\u03c32/\u0001, n}. The number of SFO calls is at most\n\n36L(cid:0)\u03a6(x0) \u2212 \u03a6(x\u2217)(cid:1)(cid:16) B\n36L(cid:0)\u03a6(x0) \u2212 \u03a6(x\u2217)(cid:1)(cid:16) B\n\n\u221a\n\n\u221a\n\n+\n\n\u0001\n\n\u0001\n\nb\n\n(cid:17)\n\nb\n\u0001\n\n+\n\nb\n\n(cid:17)\n\nb\n\u0001\n\n= O\n\n(cid:0)\u03a6(x0) \u2212 \u03a6(x\u2217)(cid:1) = O\n\n36L\n\n\u0001\n\n(cid:17)\n\n.\n\nb\n\u0001\n\n+\n\nb\n\n(cid:1) 1\n\n\u221a\n\n\u0001\n\nb\n\u0001\n\n+\n\nb\n\n(cid:17)\n\n,\n\n= O\n\n\u221a\n\n\u0001\n\n(cid:16) n\n(cid:16)(cid:0)n \u2227 1\n(cid:19)\n(cid:18) 1\n\n\u0001\n\n.\n\n\u0001\n\nwhere \u2227 denotes the minimum.\n\nIn both cases, the number of PO calls equals to the total number of iterations T , which is at most\n\nRemark: The proof for Theorem 1 is notably different from that of ProxSVRG [24]. Reddi et al. [24]\nused a Lyapunov function Rs+1\n\n) + ct(cid:107)xs+1\n\nt = \u03a6(xs+1\n\nt\n\nt \u2212(cid:101)xS(cid:107)2 and showed that Rs decreases by\n(cid:80)n\nj=1 \u2207fj((cid:101)xs\u22121) = \u2207f ((cid:101)xs\u22121))\n\n4If B = n, ProxSVRG+ is almost the same as ProxSVRG (i.e., gs = 1\nn\n\nexcept some detailed parameter settings (e.g., step-size, epoch length).\n\n5A similar observation was also made in Natasha1.5 [2]. However, Natasha1.5 divides each epoch into\nmultiple sub-epochs and randomly chooses the iteration point at the end of each sub-epoch. In our ProxSVRG+,\nthe length of an epoch is deterministic and it directly uses the last iteration point at the end of each epoch.\n\n5\n\n\fthe accumulated gradient mapping(cid:80)m\n\u03a6(xs) decreases by(cid:80)m\n\nt=1 (cid:107)G\u03b7(xs\n\nt=1 (cid:107)G\u03b7(xs\n\nt )(cid:107)2 in epoch s. In our proof, we directly show that\nt )(cid:107)2 using a different analysis. This is made possible by tightening\nthe inequalities using Young\u2019s inequality and Lemma 2 (which gives the relation between the variance\nof stochastic gradient estimator and the inner product of the gradient difference and point difference).\nAlso, our convergence result holds for any minibatch size b \u2208 [1, n] unlike ProxSVRG b \u2264 n2/3 (see\nFig. 1). Moreover, ProxSVRG+ uses much less proximal oracle calls than ProxSVRG (see Fig. 2).\nFor the online/stochastic Case 2), we avoid the computation of the full gradient at the beginning of\neach epoch, i.e., B (cid:54)= n. Then, we use the similar idea in SCSG [17] to bound the variance term,\nbut we do not need the \u201cstochastically controlled\u201d trick of SCSG (as we discussed in Section 3) to\nachieve the desired convergence bound which yields a much simpler analysis for our ProxSVRG+.\nWe defer the proof of Theorem 1 to Appendix A.1. Also, similar convergence results for other choices\nof epoch length m (cid:54)=\n\nb are provided in Appendix A.2.\n\n\u221a\n\n5 Convergence Under PL Condition\nIn this section, we provide the global linear convergence rate for nonconvex functions under the\nPolyak-\u0141ojasiewicz (PL) condition [22]. The original form of PL condition is\n\u2203\u00b5 > 0, such that (cid:107)\u2207f (x)(cid:107)2 \u2265 2\u00b5(f (x) \u2212 f\u2217), \u2200x,\n\n(6)\nwhere f\u2217 denotes the (global) optimal function value. It is worth noting that f satis\ufb01es PL condition\nwhen f is \u00b5-strongly convex. Moreover, Karimi et al. [13] showed that PL condition is weaker than\nmany conditions (e.g., strong convexity (SC), restricted strong convexity (RSC) and weak strong\nconvexity (WSC) [20]). Also, if f is convex, PL condition is equivalent to the error bounds (EB) and\nquadratic growth (QG) condition [19, 4]. Note that PL condition implies that every stationary point\nis a global minimum, but it does not imply there is a unique minimum unlike the strongly convex\ncondition.\nDue to the nonsmooth term h(x) in problem (1), we use the gradient mapping (see (4)) to de\ufb01ne a\nmore general form of PL condition as follows:\n\n\u2203\u00b5 > 0, such that (cid:107)G\u03b7(x)(cid:107)2 \u2265 2\u00b5(\u03a6(x) \u2212 \u03a6\u2217), \u2200x.\n\n(7)\nRecall that if h(x) is a constant function, the gradient mapping reduces to G\u03b7(x) = \u2207\u03a6(x) = \u2207f (x).\nOur PL condition is different from the one used in [13, 24]. See the Remark (3) after Theorem 2.\nFurther Motivation: In many cases, although the loss function is generally nonconvex, the local\nregion near a local minimum may satisfy the PL condition. In fact, there are some recent studies\nshowing the strong convexity in the neighborhood of the ground truth solution in some simple neural\nnetworks [26, 8]. Such results provide further motivation for studying the PL condition. Moreover,\nwe argue that our ProxSVRG+ is particularly desirable in this case since it \ufb01rst converges sublinearly\nO(1/\u0001) (according to Theorem 1) then automatically converges linearly O(log 1/\u0001) (according to\nTheorem 2) in the regions as long as the loss function satis\ufb01es the PL condition locally in these\nregions. We list the convergence results in Table 3 (also see the remarks after Theorem 2).\n\nTable 3: Under the PL condition with parameter \u00b5\n\nAlgorithms\nProxGD [13]\n(full gradient)\n\nProxSVRG/SAGA\n\n[24]\n\n(smooth nonconvex,\n\nSCSG [17]\ni.e., h(x) \u2261 0)\n\n(cid:16) b\n\n1\n3\n\u00b5\n\nO\n\nProxSVRG+\n(this paper)\n\nStochastic \ufb01rst-order\n\noracle (SFO)\n\u00b5 log 1\nO( n\n\u0001 )\n\n\u0001\n\n\u0001 + n log 1\n\n(cid:1)\n\u0001 +(cid:0)n \u2227 1\n(cid:1)\n\n\u0001 + b\n\n\u00b5 log 1\n\n\u0001\n\n\u00b5\u0001\n\nlog 1\n\nb\n\n\u221a\n\u00b5\n\nO(cid:0) n\n(cid:1) 2\n(cid:0)n \u2227 1\nO(cid:0) n\n(cid:16)(cid:0)n \u2227 1\n(cid:1) 1\n\n\u221a\n\u00b5\n\n\u00b5\u0001\n\nb\n\n\u221a\n\u00b5\n\n3 log 1\n\nlog 1\n\n(cid:17)\n\n(cid:1) log 1\n(cid:17)\n\n\u0001\n\nProximal oracle\n\n(PO)\n\u00b5 log 1\n\u0001 )\n\nO( 1\n\nO(cid:0) n\n\n\u00b5b3/2 log 1\n\n\u0001\n\n(cid:1)\n\nAddi.\n\ncondition\n\n\u2013\n\nb \u2264 n2/3\n\nNA\n\n\u03c3 = O(1)\n\nO( 1\n\n\u00b5 log 1\n\u0001 )\n\n\u2013\n\n\u03c3 = O(1)\nThe notation \u2227 denotes the minimum. Similar to Table 2, ProxSVRG+ is better than ProxGD and\n\nO\n\n\u0001 + b\n\n\u00b5 log 1\n\nlog 1\n\n\u00b5 log 1\n\u0001 )\n\nO( 1\n\n\u00b5\u0001\n\nb\n\n\u0001\n\nProxSVRG/SAGA, and generalizes the SCSG by choosing different minibatch size b.\n\n6\n\n\fSimilar to Theorem 1, we provide the convergence result of ProxSVRG+ (Algorithm 1) under PL-\ncondition in the following Theorem 2. Note that under PL condition (i.e. (7) holds), ProxSVRG+ can\n\nto [24], we assume the condition number L/\u00b5 >\nn for simplicity. Otherwise, one can choose\ndifferent step size \u03b7 which is similar to the case where we deal with other choices of epoch length m\n(see Appendix A.2).\n\ndirectly use the \ufb01nal iteration(cid:101)xS as the output point instead of the randomly chosen one \u02c6x. Similar\n6L and b denote the minibatch size. Then the \ufb01nal iteration point(cid:101)xS\nin Algorithm 1 satis\ufb01es E[\u03a6((cid:101)xS) \u2212 \u03a6\u2217] \u2264 \u0001 under PL condition. We distinguish the following two\n\nTheorem 2 Let step size \u03b7 = 1\n\n\u221a\n\ncases:\n\n1) We let batch size B = n. The number of SFO calls is bounded by\n\nb\n\u00b5\n2) Under Assumption 1, we let batch size B = min{ 6\u03c32\n\n\u221a\n\u00b5\n\nlog\n\n1\n\u0001\n\n+\n\nO\n\nb\n\n\u00b5\u0001 , n}. The number of SFO calls is bounded\n\n(cid:16) n\n\n(cid:16)(cid:0)n \u2227 1\n\n\u00b5\u0001\n\n(cid:17)\n\n.\n\nlog\n\n1\n\u0001\n\n(cid:17)\n\n,\n\n1\n\u0001\n\n+\n\nb\n\u00b5\n\nlog\n\n1\n\u0001\n\nlog\n\nb\n\n(cid:1) 1\n\n\u221a\n\u00b5\n\n(cid:16) 1\n\n\u00b5\n\n(cid:17)\n\n.\n\nlog\n\n1\n\u0001\n\nby\n\nO\nwhere \u2227 denotes the minimum.\n\nO\n\nRemark:\n\nIn both cases, the number of PO calls equals to the total number of iterations T which is bounded by\n\n\u00b5 log 1\n\n(1) We show that ProxSVRG+ directly obtains a global linear convergence rate without restart by a\nnontrivial proof. Note that Reddi et al. [24] used PL-SVRG/SAGA to restart ProxSVRG/SAGA\nO(log(1/\u0001)) times to obtain the linear convergence rate under PL condition.\nMoreover, similar to Table 2, if we choose b = 1 or n for ProxSVRG+, then its convergence result\n\u0001 ), which is the same as ProxGD [13]. If we choose b = n2/3 for ProxSVRG+, then\nis O( n\nthe convergence result is O( n2/3\n\u0001 ), the same as the best result achieved by ProxSVRG/SAGA\n[24]. If we choose b = 1/(\u00b5\u0001)2/3 (assuming 1/(\u00b5\u0001) < n) for ProxSVRG+, then its convergence\nresult is O(\n\u0001 ) which generalizes the best result of SCSG [17] to the more general\nnonsmooth nonconvex case and is better than ProxGD and ProxSVRG/SAGA. Also note that our\nProxSVRG+ uses much less proximal oracle calls than ProxSVRG/SAGA if b < n2/3.\n\n\u00b55/3\u00012/3 log 1\n\n\u00b5 log 1\n\n1\n\n(2) Another bene\ufb01t of ProxSVRG+ is that it can automatically switch to the faster linear convergence\nrate in some regions as long as the loss function satis\ufb01es the PL condition locally in these regions.\nThis is impossible for ProxSVRG [24] since it needs to be restarted many times.\n\n(3) We want to point out that [13, 24] used the following form of PL condition:\n\u2203\u00b5 > 0, such that Dh(x, \u03b1) \u2265 2\u00b5(\u03a6(x) \u2212 \u03a6\u2217), \u2200x,\n\n(cid:8)(cid:104)\u2207f (x), y\u2212 x(cid:105) + \u03b1\n\n2 (cid:107)y\u2212 x(cid:107)2 + h(y)\u2212 h(x)(cid:9). Our PL condition\n\n(8)\n\nwhere Dh(x, \u03b1) := \u22122\u03b1 miny\nis arguably more natural. In fact, one can show that if \u03b1 = 1/\u03b7, our new PL condition (7) implies\n(8). For a direct comparison with prior results, we also provide the proof of the same result of\nTheorem 2 using the previous PL condition (8) in the appendix.\n\nThe proofs of Theorem 2 under PL form (7) and (8) are provided in Appendix B.1 and B.2, respectively.\nRecently, Csiba and Richt\u00e1rik [6] proposed a novel weakly PL condition. The (strongly) PL condition\n(7) or (8) serves as a generalization of strong convexity as we discussed in the beginning of this\nsection. One can achieve linear convergence under (7) or (8). However, the weakly PL condition [6]\nmay be considered as a generalization of (weak) convexity. Although one only achieves the sublinear\nconvergence under this condition, it is still interesting to \ufb01gure out similar (sublinear) convergence\n(for ProxSVRG+, ProxSVRG, etc.) under their weakly PL condition.\n\n7\n\n\f6 Experiments\n\nIn this section, we present the experimental results. We compare the nonconvex ProxSVRG+ with\nnonconvex ProxGD, ProxSGD [10], ProxSVRG [24]. We conduct the experiments using the non-\nnegative principal component analysis (NN-PCA) problem (same as [24]). In general, NN-PCA is\nNP-hard. Speci\ufb01cally, the optimization problem for a given set of samples (i.e., {zi}n\n\ni=1) is:\n\n(cid:17)\nNote that (9) can be written in the form (1), where f (x) =(cid:80)n\n\nxT(cid:16) n(cid:88)\n\n\u2212 1\n2\n\nmin\n\n(cid:107)x(cid:107)\u22641,x\u22650\n\nzizT\ni\n\ni=1\n\nx.\n\ni=1 fi(x) =(cid:80)n\n\ni=1 \u2212 1\n\n(9)\n\n2 (xT zi)2 and\nh(x) = IC(x) where set C = {x \u2208 Rd|(cid:107)x(cid:107) \u2264 1, x \u2265 0}. We conduct the experiment on the\nstandard MNIST and \u2018a9a\u2019 datasets. 6 The experimental results on both datasets (corresponding to\nthe \ufb01rst row and second row in Figure 3\u20135) are almost the same.\nThe samples from each dataset are normalized, i.e., (cid:107)zi(cid:107) = 1 for all i \u2208 n. The parameters of\nthe algorithms are chosen as follows: L can be precomputed from the data samples {zi}n\ni=1 in the\nsame way as in [18]. The step sizes \u03b7 for different algorithms are set to be the ones used in their\nconvergence results: For ProxGD, it is \u03b7 = 1/L (see Corollary 1 in [10]); for ProxSGD, \u03b7 = 1/(2L)\n(see Corollary 3 in [10]); for ProxSVRG, \u03b7 = b3/2/(3Ln) (see Theorem 6 in [24]). The step size\nfor our ProxSVRG+ is 1/(6L) (see our Theorem 1). We did not further tune the step sizes. The\nbatch size B (in Line 4 of Algorithm 1) is equal to n/5 (i.e., 20% data samples). We also considered\nB = n/10, the performance among these algorithms are similar to the case B = n/5. In practice,\none can tune the step size \u03b7 and parameter B.\nRegarding the comparison among these algorithms, we use the number of SFO calls (see De\ufb01nition\n2) to evaluate them since the number of PO calls of them are the same except ProxSVRG (which is\nalready clearly demonstrated by Figure 2). Note that we amortize the batch size (n or B in Line 4 of\nAlgorithm 1) into the inner loops, so that the curves in the \ufb01gures are smoother, i.e., the number of\nSFO calls for each iteration (inner loop) of ProxSVRG and ProxSVRG+ is counted as b + n/m and\nb + B/m, respectively. Note that ProxGD uses n SFO calls (full gradient) in each iteration.\n\nFigure 3: Comparison among algorithms with different minibatch size b\n\nIn Figure 3, we compare the performance of these four algorithms as we vary the minibatch size b.\nIn particular, the \ufb01rst column (b = 4) shows that ProxSVRG+ and ProxSVRG perform similar to\nProxSGD and ProxGD respectively, which is quite consistent with the theoretical results (Figure\n1). Then, ProxSVRG+ and ProxSVRG both get better as b increases. Note that our ProxSVRG+\nperforms better than ProxGD, ProxSGD and ProxSVRG.\n\n6The datasets can be downloaded from https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/\n\ndatasets/\n\n8\n\n0123456#SFO/n70006000500040003000200010000Function valuea9a (b=4)ProxGDProxSGDProxSVRGProxSVRG+ 0123456#SFO/n70006000500040003000200010000Function valuea9a (b=64)ProxGDProxSGDProxSVRGProxSVRG+ 0123456#SFO/n70006000500040003000200010000Function valuea9a (b=256)ProxGDProxSGDProxSVRGProxSVRG+ 0123456#SFO/n12000100008000600040002000Function valueMNIST (b=4)ProxGDProxSGDProxSVRGProxSVRG+ 0123456#SFO/n12000100008000600040002000Function valueMNIST (b=64)ProxGDProxSGDProxSVRGProxSVRG+ 0123456#SFO/n12000100008000600040002000Function valueMNIST (b=256)ProxGDProxSGDProxSVRGProxSVRG+ \fFigure 4: ProxSVRG+ and ProxSVRG under different b\n\nFigure 5: Under the best b\n\nFigure 4 demonstrates that our ProxSVRG+ prefers smaller minibatch sizes than ProxSVRG (see\nthe curves with dots). Then, in Figure 5, we compare the algorithms with their corresponding best\nminibatch size b.\nIn conclusion, the experimental results are quite consistent with the theoretical results, i.e., different\nalgorithms favor different minibatch sizes (see Figure 1). Concretely, our ProxSVRG+ achieves its\nbest performance with a moderate minibatch size b = 256 unlike ProxSVRG with b = 2048/4096.\nBesides, choosing b = 64 is already good enough for ProxSVRG+ by comparing the second column\nand last column of Figure 3, however ProxSVRG is only as good as ProxSGD with such a minibatch\nsize. Moreover, ProxSVRG+ uses much less proximal oracle calls than ProxSVRG if b < n2/3 (see\nFigure 2). Note that small minibatch size also usually provides better generalization in practice. Thus,\nwe argue that our ProxSVRG+ might be more attractive in certain applications due to its moderate\nminibatch size.\n\n7 Conclusion\n\n\u221a\n\n\u221a\n\nIn this paper, we propose a simple proximal stochastic method called ProxSVRG+ for nonsmooth\nnonconvex optimization. We prove that ProxSVRG+ improves/generalizes several well-known\nconvergence results (e.g., ProxGD, ProxSGD, ProxSVRG/SAGA and SCSG) by choosing proper\nminibatch sizes. In particular, ProxSVRG+ is\nb\u0001n if n > 1/\u0001) times faster than ProxGD,\nwhich partially answers the open problem (i.e., developing stochastic methods with provably better\nperformance than ProxGD with constant minibatch size b) proposed in [24]. Also, ProxSVRG+\ngeneralizes the results of SCSG [17] to this nonsmooth nonconvex case, and it is more straightforward\nthan SCSG and yields simpler analysis. Moreover, for nonconvex functions satisfying Polyak-\n\u0141ojasiewicz condition, we prove that ProxSVRG+ achieves the global linear convergence rate without\nrestart. As a result, ProxSVRG+ can automatically switch to the faster linear convergence rate (i.e.,\nO(log 1/\u0001)) from sublinear convergence rate (i.e., O(1/\u0001)) in some regions (e.g., the neighborhood of\na local minimum) as long as the objective function satis\ufb01es the PL condition locally in these regions.\nThis is impossible for ProxSVRG [24] since it needs to be restarted O(log 1/\u0001) times.\n\nb (or\n\nAcknowledgments\n\nThis research is supported in part by the National Basic Research Program of China Grant\n2015CB358700, the National Natural Science Foundation of China Grant 61772297, 61632016,\n61761146003, and a grant from Microsoft Research Asia. The authors would like to thank Rong Ge\n(Duke), Xiangliang Zhang (KAUST) and the anonymous reviewers for their useful suggestions.\n\n9\n\n0.00.51.01.52.02.53.0#SFO/n7000600050004000300020001000Function valuea9a (ProxSVRG+)b=1b=16b=64b=256b=512b=1024b=2048b=4096b=8192b=163840.00.51.01.52.02.53.0#SFO/n7000600050004000300020001000Function valuea9a (ProxSVRG)b=1b=16b=64b=256b=512b=1024b=2048b=4096b=8192b=163840123456#SFO/n70006000500040003000200010000Function valuea9aProxGDProxSGDProxSVRG (b=2048)ProxSVRG+ (b=256)0.00.51.01.52.02.53.0#SFO/n12000100008000600040002000Function valueMNIST (ProxSVRG+)b=1b=16b=64b=256b=512b=1024b=2048b=4096b=8192b=163840.00.51.01.52.02.53.0#SFO/n12000100008000600040002000Function valueMNIST (ProxSVRG)b=1b=16b=64b=256b=512b=1024b=2048b=4096b=8192b=163840123456#SFO/n12000100008000600040002000Function valueMNISTProxGDProxSGDProxSVRG (b=4096)ProxSVRG+ (b=256)\fReferences\n[1] Zeyuan Allen-Zhu. Katyusha: the \ufb01rst direct acceleration of stochastic gradient methods. In\nProceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pages\n1200\u20131205. ACM, 2017.\n\n[2] Zeyuan Allen-Zhu. Natasha 2: Faster non-convex optimization than sgd. arXiv preprint\n\narXiv:1708.08694, 2017.\n\n[3] Zeyuan Allen-Zhu and Elad Hazan. Variance reduction for faster non-convex optimization. In\n\nInternational Conference on Machine Learning, pages 699\u2013707, 2016.\n\n[4] Mihai Anitescu. Degenerate nonlinear programming with a quadratic growth condition. SIAM\n\nJournal on Optimization, 10(4):1116\u20131135, 2000.\n\n[5] Aleksandr Aravkin and Damek Davis. A smart stochastic algorithm for nonconvex optimization\n\nwith applications to robust machine learning. arXiv preprint arXiv:1610.01101, 2016.\n\n[6] Dominik Csiba and Peter Richt\u00e1rik. Global convergence of arbitrary-block gradient methods\n\nfor generalized polyak-\u0142ojasiewicz functions. arXiv preprint arXiv:1709.03014, 2017.\n\n[7] Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. Saga: A fast incremental gradient\nmethod with support for non-strongly convex composite objectives. In Advances in Neural\nInformation Processing Systems, pages 1646\u20131654, 2014.\n\n[8] Haoyu Fu, Yuejie Chi, and Yingbin Liang. Local geometry of one-hidden-layer neural networks\n\nfor logistic regression. arXiv preprint arXiv:1802.06463, 2018.\n\n[9] Saeed Ghadimi and Guanghui Lan. Stochastic \ufb01rst-and zeroth-order methods for nonconvex\n\nstochastic programming. SIAM Journal on Optimization, 23(4):2341\u20132368, 2013.\n\n[10] Saeed Ghadimi, Guanghui Lan, and Hongchao Zhang. Mini-batch stochastic approximation\nmethods for nonconvex stochastic composite optimization. Mathematical Programming, 155(1-\n2):267\u2013305, 2016.\n\n[11] Priya Goyal, Piotr Doll\u00e1r, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola,\nAndrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch sgd: training\nimagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.\n\n[12] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance\n\nreduction. In Advances in neural information processing systems, pages 315\u2013323, 2013.\n\n[13] Hamed Karimi, Julie Nutini, and Mark Schmidt. Linear convergence of gradient and proximal-\ngradient methods under the polyak-\u0142ojasiewicz condition. In Joint European Conference on\nMachine Learning and Knowledge Discovery in Databases, pages 795\u2013811. Springer, 2016.\n\n[14] Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping\nTak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima.\narXiv preprint arXiv:1609.04836, 2016.\n\n[15] Guanghui Lan and Yi Zhou. An optimal randomized incremental gradient method. arXiv\n\npreprint arXiv:1507.02000, 2015.\n\n[16] Guanghui Lan and Yi Zhou. Random gradient extrapolation for distributed and stochastic\n\noptimization. SIAM Journal on Optimization, 28(4):2753\u20132782, 2018.\n\n[17] Lihua Lei, Cheng Ju, Jianbo Chen, and Michael I Jordan. Non-convex \ufb01nite-sum optimization\nvia scsg methods. In Advances in Neural Information Processing Systems, pages 2345\u20132355,\n2017.\n\n[18] Qunwei Li, Yi Zhou, Yingbin Liang, and Pramod K Varshney. Convergence analysis of proximal\ngradient with momentum for nonconvex optimization. In International Conference on Machine\nLearning, pages 2111\u20132119, 2017.\n\n10\n\n\f[19] Zhi-Quan Luo and Paul Tseng. Error bounds and convergence analysis of feasible descent\n\nmethods: a general approach. Annals of Operations Research, 46(1):157\u2013178, 1993.\n\n[20] Ion Necoara, Yurii Nesterov, and Francois Glineur. Linear convergence of \ufb01rst order methods\n\nfor non-strongly convex optimization. arXiv preprint arXiv:1504.06298, 2015.\n\n[21] Yurii Nesterov. Introductory Lectures on Convex Optimization: A Basic Course. Kluwer, 2004.\n\n[22] Boris Teodorovich Polyak. Gradient methods for minimizing functionals. Zhurnal Vychisli-\n\ntel\u2019noi Matematiki i Matematicheskoi Fiziki, 3(4):643\u2013653, 1963.\n\n[23] Sashank J Reddi, Ahmed Hefny, Suvrit Sra, Barnab\u00e1s P\u00f3czos, and Alex Smola. Stochastic\nvariance reduction for nonconvex optimization. In International conference on machine learning,\npages 314\u2013323, 2016.\n\n[24] Sashank J Reddi, Suvrit Sra, Barnab\u00e1s P\u00f3czos, and Alexander J Smola. Proximal stochastic\nmethods for nonsmooth nonconvex \ufb01nite-sum optimization. In Advances in Neural Information\nProcessing Systems, pages 1145\u20131153, 2016.\n\n[25] Lin Xiao and Tong Zhang. A proximal stochastic gradient method with progressive variance\n\nreduction. SIAM Journal on Optimization, 24(4):2057\u20132075, 2014.\n\n[26] Kai Zhong, Zhao Song, Prateek Jain, Peter L Bartlett, and Inderjit S Dhillon. Recovery\n\nguarantees for one-hidden-layer neural networks. arXiv preprint arXiv:1706.03175, 2017.\n\n[27] Dongruo Zhou, Pan Xu, and Quanquan Gu. Stochastic nested variance reduction for nonconvex\n\noptimization. arXiv preprint arXiv:1806.07811, 2018.\n\n11\n\n\f", "award": [], "sourceid": 2665, "authors": [{"given_name": "Zhize", "family_name": "Li", "institution": "Tsinghua University"}, {"given_name": "Jian", "family_name": "Li", "institution": "Tsinghua University"}]}