{"title": "Non-asymptotic Analysis of Stochastic Methods for Non-Smooth Non-Convex Regularized Problems", "book": "Advances in Neural Information Processing Systems", "page_first": 2630, "page_last": 2640, "abstract": "Stochastic Proximal Gradient (SPG) methods have been widely used for solving optimization problems with a simple (possibly non-smooth) regularizer in machine learning and statistics. However, to the best of our knowledge no non-asymptotic convergence analysis of SPG exists for non-convex optimization with a non-smooth and non-convex regularizer. All existing non-asymptotic analysis of SPG for solving non-smooth non-convex problems require the non-smooth regularizer to be a convex function, and hence are not applicable to a non-smooth non-convex regularized problem. This work initiates the analysis to bridge this gap and opens the door to non-asymptotic convergence analysis of non-smooth non-convex regularized problems. We analyze several variants of mini-batch SPG methods for minimizing a non-convex objective that consists of a smooth non-convex loss and a non-smooth non-convex regularizer. Our contributions are two-fold: (i) we show that they enjoy the same complexities as their counterparts for solving convex regularized non-convex problems in terms of finding an approximate stationary point; (ii) we develop more practical variants using dynamic mini-batch size instead of a fixed mini-batch size without requiring the target accuracy level of solution. The significance of our results is that they improve upon the-state-of-art results for solving non-smooth non-convex regularized problems. We also empirically demonstrate the effectiveness of the considered SPG methods in comparison with other peer stochastic methods.", "full_text": "Non-asymptotic Analysis of Stochastic Methods for\nNon-Smooth Non-Convex Regularized Problems\n\nYi Xu1, Rong Jin2, Tianbao Yang1\n\n1. Department of Computer Science, The University of Iowa, Iowa City, IA 52246, USA\n\n2. Machine Intelligence Technology, Alibaba Group, Bellevue, WA 98004, USA\n\n{yi-xu, tianbao-yang}@uiowa.edu, jinrong.jr@alibaba-inc.com\n\nAbstract\n\nStochastic Proximal Gradient (SPG) methods have been widely used for solv-\ning optimization problems with a simple (possibly non-smooth) regularizer in\nmachine learning and statistics. However, to the best of our knowledge no non-\nasymptotic convergence analysis of SPG exists for non-convex optimization with a\nnon-smooth and non-convex regularizer. All existing non-asymptotic analysis\nof SPG for solving non-smooth non-convex problems require the non-smooth\nregularizer to be a convex function, and hence are not applicable to a non-smooth\nnon-convex regularized problem. This work initiates the analysis to bridge this\ngap and opens the door to non-asymptotic convergence analysis of non-smooth\nnon-convex regularized problems. We analyze several variants of mini-batch\nSPG methods for minimizing a non-convex objective that consists of a smooth\nnon-convex loss and a non-smooth non-convex regularizer. Our contributions are\ntwo-fold: (i) we show that they enjoy the same complexities as their counterparts\nfor solving convex regularized non-convex problems in terms of \ufb01nding an ap-\nproximate stationary point; (ii) we develop more practical variants using dynamic\nmini-batch size instead of a \ufb01xed mini-batch size without requiring the target\naccuracy level of solution. The signi\ufb01cance of our results is that they improve upon\nthe-state-of-art results for solving non-smooth non-convex regularized problems.\nWe also empirically demonstrate the effectiveness of the considered SPG methods\nin comparison with other peer stochastic methods.\n\n1\n\nIntroduction\n\nIn this work, we consider the following stochastic non-smooth non-convex optimization problem:\n\nF (x) := E\u03be[f (x; \u03be)]\n\n+r(x),\n\n(1)\n\nmin\nx\u2208Rd\n\nwhere \u03be is a random variable, f (x) is a smooth non-convex function, and r(x) : Rd \u2192 R is a proper\nnon-smooth non-convex lower-semicontinuous function. A special case of problem (1) in machine\nlearning is of the following \ufb01nite-sum form:\n\n(cid:124)\n\n(cid:123)(cid:122)\n\nf (x)\n\n(cid:125)\n\nmin\nx\u2208Rd\n\nF (x) :=\n\n1\nn\n\nn(cid:88)\n\ni=1\n\nfi(x) + r(x),\n\n(2)\n\nwhere n is the number of data samples. In the sequel, we refer to the problem (1) with a \ufb01nite-sum\nstructure as in the \ufb01nite-sum setting and otherwise as in the online setting [29, 43]. The family of\noptimization problems with a non-convex smooth loss and a non-convex non-smooth regularizer\nis important and broad in machine learning and statistics. Examples of smooth non-convex losses\ninclude non-linear square loss for classi\ufb01cation [20], truncated square loss for regression [44], and\ncross-entropy loss for learning a neural network with a smooth activation function. Examples of\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fTable 1: Summary of Complexities for \ufb01nding an \u0001-stationary point of (1). LC denotes Lipchitz\ncontinuous function; FV means \ufb01nite-valued over Rd; PM denotes the proximal mapping exists and\n\ncan be obtained ef\ufb01ciently. (cid:101)O(\u00b7) suppresses a logarithmic factor in terms of \u0001\u22121.\n\ncomplexity\n\nAlgorithm\nMBSGA [29], SSDC-SPG [43] O(\u0001\u22125)\nO(\u0001\u22126)\nSSDC-SPG [43]\nO(\u0001\u22124)\nMB-SPG (this work)\nO(\u0001\u22123)\nSPGR (this work)\nO(n2/3\u0001\u22123)\n\nProblem\nOnline\nOnline\nOnline\nOnline\nFinite-sum VRSGA [29]\nFinite-sum SSDC-SVRG [43]\nFinite-sum SSDC-SVRG [43]\nFinite-sum SPGR (this work)\n\n(cid:101)O(n\u0001\u22123)\n(cid:101)O(n\u0001\u22124)\n\nO(n1/2\u0001\u22122 + n)\n\nr(x)\nPM, LC\nPM, FV\nPM\nPM\nPM, LC\nPM, LC\nPM, FV\nPM\n\nnon-smooth non-convex regualerizers include (cid:96)p (0 \u2264 p < 1) norm, smoothly clipped absolute\ndeviation (SCAD) [17], log-sum penalty (LSP) [9], minimax concave penalty (MCP) [48], and an\nindicator function of a non-convex constraint as well (e.g., (cid:107)x(cid:107)0 \u2264 k).\nAlthough non-convex minimization with a non-smooth convex regularizer has been extensively\nstudied in both online setting [19, 14, 41, 35] and \ufb01nite-sum setting [16, 38, 1, 34, 26, 12, 41, 35],\nstochastic optimization for the considered problem with a non-smooth non-convex regularizer is\nstill under-explored. The presence of non-smooth non-convex functions r makes the analysis more\nchallenging, which renders previous analysis that hinges on the convexity of r not applicable. A\nspecial case of non-convex r that can be written as a DC (Difference of Convex) function, i.e.,\nr(x) = r1(x) \u2212 r2(x) with r1 and r2 being convex, has been recently tackled by several studies\nwith stochastic algorithms [43, 33, 40]. In this paper, we focus on \ufb01rst-order stochastic algorithms\nfor solving the problem (1) with a general non-smooth non-convex regularizer and study their\nnon-asymptotic convergence rates.\nAlthough there are plenty of studies devoted to non-smooth non-convex regularized problems [3, 5,\n49, 23, 25, 47, 6, 2, 46, 27], they are restricted to deterministic algorithms and asymptotic or local\nconvergence analysis. There are few studies concerned with the non-asymptotic convergence analysis\nof stochastic algorithms for the problem (1). To the best of our knowledge, [43] is the \ufb01rst work that\npresents stochastic algorithms with non-asymptotic convergence results for \ufb01nding an approximate\ncritical point of a non-convex problem with a non-convex non-smooth regularizer. Indeed, they\nconsidered a more general problem in which f is a DC function and assumed that the second\ncomponent of the DC decomposition of f has a H\u00f6lder-continuous gradient. Their convergence\nresults are the state-of-the-art for stochastic optimization of the problem (1) in the online setting.\nLater, [29] presented two algorithms, namely mini-batch stochastic gradient algorithm (MBSGA) and\nvariance reduced stochastic gradient algorithm (VRSGA), for solving (1) and (2) with an improved\ncomplexity for the \ufb01nite-sum setting. To tackle the non-smooth non-convex regularizer, both of these\nworks use a Moreau envelope of r to approximate r, which inevitably introduces approximation error\nand hence worsen the convergence rates.\nA simple idea for tackling a non-smooth regularizer is to use proximal gradient methods, which\nhas been studied extensively in the literature for a convex regularizer [19, 14, 16, 38, 1, 34, 26,\n12, 41, 35]. A natural question is whether stochastic proximal gradient (SPG) methods still enjoy\nsimilar convergence guarantee for solving a non-smooth non-convex regularized problem as their\ncounterparts for convex regularized non-convex minimization problems. In this paper, we provide an\naf\ufb01rmative answer to this question. Our contributions are summarized below:\n\n\u2022 We establish the \ufb01rst convergence rate of standard mini-batch SPG (MB-SPG) for solving (1) in\nterms of \ufb01nding an approximate stationary point, which is the same as its counterpart for solving a\nnon-convex minimization problem with a convex regularizer [19].\n\n\u2022 Furthermore, we analyze improved variants of mini-batch SPG that use a recursive stochastic\ngradient estimator (SARAH [32, 31] or SPIDER [18, 41]) referred to as SPGR, and achieve the\nnew state of the art convergence results for both online setting and the \ufb01nite-sum setting.\n\n2\n\n\f\u2022 Moreover, we propose more practical variants of MB-SPG and SPGR by using dynamic mini-\nbatch size instead of a \ufb01xed mini-batch size to remove the requirement on the target accuracy level\nof solution for running the algorithms.\n\nThe complexity results of our algorithms and other works for \ufb01nding an \u0001-stationary solution of the\nconsidered problem are summarized in Table 1. It is notable that the complexity result of SPGR\nfor the \ufb01nite-sum setting is optimal matching an existing lower bound [18]. Before ending this\nsection, it is worth mentioning that the differences between this work and [15] that provides the \ufb01rst\nconvergence analysis of SPG to critical points of a non-smooth non-convex minimization problem: (i)\ntheir convergence analysis is asymptotic and hence provides no convergence rate; (ii) their analysis\napplies to non-smooth f but requires stronger assumptions on r (e.g., local Lipchitz continuity) that\nprecludes (cid:96)0 norm regularizer or an indicator function of a non-convex constraint; (ii) their analyzed\nSPG imposes no requirement on the mini-batch size.\n\n(cid:110)\n\n(cid:80)\n\n2 Preliminaries\nIn this section, we present some preliminaries and notations. Let (cid:107)x(cid:107) denote the Euclidean\nnorm of a vector x \u2208 Rd. Denote by S = {\u03be1, . . . , \u03bem} a set of random variables, let\n|S| be the number of elements in set S and fS (x) = 1|S|\n\u03bei\u2208S f (x; \u03bei). We denote by\ndist(x,S) the distance between the vector x and a set S. Denote by \u02c6\u2202h(x) the Fr\u00e9chet sub-\ngradient and \u2202h(x) the limiting subgradient of a non-convex function h(x) : Rd \u2192 R, i.e.,\nh\u2212\u2192 \u00afx, vk \u2208\n\u02c6\u2202h(\u00afx) =\n\u02c6\u2202h(xk), vk \u2192 v}, where x h\u2212\u2192 \u00afx means x \u2192 \u00afx and h(x) \u2192 h(\u00afx).\nWe aim to \ufb01nd an \u0001-stationary point of problem (1), i.e., to \ufb01nd a solution x such that dist(0, \u02c6\u2202F (x)) \u2264\n\u0001. Since f is differentiable, then we have \u02c6\u2202F (x) = \u02c6\u2202(f + r)(x) = \u2207f (x) + \u02c6\u2202r(x) (see Exercise\n8.8, [39]). Thus, it is equivalent to \ufb01nd a solution x satisfying\n\nv \u2208 Rd : limx\u2192\u00afx inf h(x)\u2212h(\u00afx)\u2212v(cid:62)(x\u2212\u00afx)\n\n, \u2202h(\u00afx) = {v \u2208 Rd : \u2203xk\n\n(cid:107)x\u2212\u00afx(cid:107)\n\n\u2265 0\n\n(cid:111)\n\ndist(0,\u2207f (x) + \u02c6\u2202r(x)) \u2264 \u0001.\n\n(3)\nFor problem (1), we make the following basic assumptions, which are standard in the literature on\nstochastic gradient methods for non-convex optimization [19, 29].\nAssumption 1. Assume the following conditions hold:\n(i) E\u03be[\u2207f (x; \u03be)] = \u2207f (x), and there exists a constant \u03c3 > 0, such that E\u03be[(cid:107)\u2207f (x; \u03be) \u2212\n(ii) Given an initial point x0, there exists \u2206 < \u221e such that F (x0) \u2212 F (x\u2217) \u2264 \u2206, where x\u2217 denotes\n\n\u2207f (x)(cid:107)2] \u2264 \u03c32.\n\n(iii) f (x) is smooth with a L-Lipchitz continuous gradient, i.e., it is differentiable and there exists a\n\nthe global minimum of (1).\nconstant L > 0 such that (cid:107)\u2207f (x) \u2212 \u2207f (y)(cid:107) \u2264 L(cid:107)x \u2212 y(cid:107),\u2200x, y.\n\nIn addition, we assume r(x) is simple enough such that its proximal mapping exists and can be\nobtained ef\ufb01ciently:\n\n(cid:27)\n\nprox\u03b7r[x] = arg min\ny\u2208Rd\n\n(cid:107)y \u2212 x(cid:107)2 + r(y)\n\n.\n\nThis assumption is standard to proximal algorithms for non-convex functions [3, 8, 24]. The notation\narg min denotes the set of minimizers. The closed form of proximal mapping for non-convex regular-\nizers include hard thresholding for (cid:96)0 regularizer [3], and (cid:96)p thresholding for (cid:96)1/2 regularizer [45]\nand (cid:96)2/3 regularizer [10].\nAn immediate dif\ufb01culty in solving problem (1) is the presence of non-smoothness non-convexity in\nthe regularizer r(x). To deal with this issue, [43, 29] use the the Moreau envelope of r to approximate\nr, which is de\ufb01ned as r\u00b5(x) = miny\u2208Rd\n, where \u00b5 > 0 is an approximation\nparameter. It is easy to see that the Moreau envelope of r(x) is a DC function:\n\n(cid:110) 1\n(cid:111)\n2\u00b5(cid:107)y \u2212 x(cid:107)2 + r(y)\n(cid:124)\n(cid:123)(cid:122)\n\ny(cid:62)x \u2212 1\n2\u00b5\n\n1\n\u00b5\n\nR\u00b5(x)\n\n(cid:107)y(cid:107)2 \u2212 r(y)\n,\n\n(cid:125)\n\nr\u00b5(x) =\n\n1\n2\u00b5\n\n(cid:107)x(cid:107)2 \u2212 max\ny\u2208Rd\n\n(cid:26) 1\n\n2\u03b7\n\n3\n\n\fwhere R\u00b5(x) is convex since it is the max of convex functions in terms of x [7]. Instead of solving\nthe problem (1) directly, their idea is to solve the following approximated problem:\n\nmin\nx\u2208Rd\n\nF\u00b5(x) := f (x) +\n\n(cid:107)x(cid:107)2 \u2212 R\u00b5(x).\n\n1\n2\u00b5\n\n(4)\n\nHowever, this is a bad idea because it introduces the approximation error on one hand and slows\ndown the convergence on the other hand. For example, [29] considers algorithms that update the\nsolution based on a smooth function that is constructed by linearizing the term R\u00b5(x). As a result,\nthe smoothness constant of the resulting function is proportional to 1/\u00b5. In order to maintain a small\napproximation error, \u00b5 has to be a small value which ampli\ufb01es the smoothness constant dramatically.\nIn this paper, we consider a direct approach that updates the solution simply by a stochastic proximal\ngradient update, i.e., xt+1 \u2208 prox\u03b7r[xt \u2212 \u03b7gt], where gt is a stochastic gradient of \u2207f (xt) with\nwell-controlled variance, and \u03b7 is a step size.\n\n2.1 Warm-up: Proximal Gradient Descent Method\nAs a warm-up, we \ufb01rst present the analysis of the deterministic proximal gradient descent (PGD)\nmethod (also known as forward-backward splitting, FBS), which updates the solutions for t =\n0, . . . , T \u2212 1 iteratively given an initial solution x0:\n\nxt+1 \u2208prox\u03b7r[xt \u2212 \u03b7\u2207f (xt)] = arg min\nx\u2208Rd\n\nr(x) + (cid:104)\u2207f (xt), x \u2212 xt(cid:105) +\n\n(cid:107)x \u2212 xt(cid:107)2\n\n1\n2\u03b7\n\n,\n\n(5)\n\n(cid:26)\n\n(cid:27)\n\nwhere \u03b7 is a step size. To our knowledge, non-asymptotic analysis of PGD for non-convex r(x) is not\navailable, though asymptotic analysis of PGD was provided in [3]. We summarize the non-asymptotic\nconvergence result of PGD in the following theorem, and provide a proof sketch to highlight the key\nsteps. The detailed proofs are provided in the supplement.\nTheorem 1. Suppose Assumption 1 (ii) and (iii) hold, run (5) with \u03b7 = c\nT = 4(\u03b72L2+1)\nE[dist(0, \u02c6\u2202F (xR))] \u2264 \u0001.\nRemark: It is notable that this complexity result is optimal according to [11] for smooth non-convex\noptimization, which is the same as that for solving problem (1) when r(x) is convex [30].\n\nL (0 < c < 1) and\n\u03b7(1\u2212\u03b7L)\u00012 \u2206 = O(1/\u00012) iterations, with R being uniformly sampled from {1, . . . , T} we have\n\nProof Sketch. For the update (5), we can only leverage its optimality condition (e.g., by Exercise 8.8\nand Theorem 10.1 of [39]):\n\n(xt+1 \u2212 xt) \u2208 \u02c6\u2202r(xt+1),\n\n\u2212 \u2207f (xt) \u2212 1\n\u03b7\nr(xt+1) + (cid:104)\u2207f (xt), xt+1 \u2212 xt(cid:105) +\n1\n2\u03b7\n\u03b7 (xt+1 \u2212 xt) \u2208 \u02c6\u2202F (xt+1). Combining the\nwhere the \ufb01rst implies that \u2207f (xt+1) \u2212 \u2207f (xt) \u2212 1\nsecond inequality with the smoothness of f (x), i.e., f (xt+1) \u2264 f (xt) + (cid:104)\u2207f (xt), xt+1 \u2212 xt(cid:105) +\n2 (cid:107)xt+1 \u2212 xt(cid:107)2, we get 1\n2 (1/\u03b7 \u2212 L)(cid:107)xt+1 \u2212 xt(cid:107)2 \u2264 F (xt) \u2212 F (xt+1). By telescoping the above\ninequality and connecting \u02c6\u2202F (xt+1) with (cid:107)xt+1 \u2212 xt(cid:107) we can \ufb01nish the proof.\n\n(cid:107)xt+1 \u2212 xt(cid:107)2 \u2264 r(xt),\n\nL\n\n3 Mini-Batch Stochastic Proximal Gradient Methods\n\nIn this and next section, we analyze mini-batch stochastic proximal gradient methods that use a\nstochastic gradient gt for updating the solution. The key idea of the two methods is to control the\nvariance of the stochastic gradient properly.\nWe present the detailed updates of the \ufb01rst algorithm (named MB-SPG) in Algorithm 1, which is to\nupdate the solution based on a mini-batched stochastic gradient of f (x) at the t-th iteration and the\nproximal mapping of r(x). We \ufb01rst present a general convergence result of Algorithm 1.\nTheorem 2. Suppose Assumption 1 holds, run Algorithm 1 with \u03b7 = c\nxR of Algorithm 1 satis\ufb01es E[dist(0, \u02c6\u2202F (xR))2] \u2264 c1\nc1 = 2c(1\u22122c)+2\nc(1\u22122c)\n\n(cid:80)T\u22121\nt=0 E[(cid:107)gt \u2212 \u2207f (xt)(cid:107)2] + c2\u2206\n\n2 ), then the output\n\u03b7T , where\n\n1\u22122c are two positive constants.\n\nand c2 = 6\u22124c\n\nL (0 < c < 1\n\nT\n\n4\n\n\fAlgorithm 1 Mini-Batch Stochastic Proximal Gradient: MB-SPG\n1: Initialize: x0 \u2208 Rd, \u03b7 = c\n2: for t = 0, 1, . . . , T \u2212 1 do\nDraw samples St = {\u03bei, . . . , \u03bemt}, let gt = 1\n3:\nxt+1 \u2208 prox\u03b7r[xt \u2212 \u03b7gt]\n4:\n5: end for\n6: Output: xR, where R is uniformly sampled from {1, . . . , T}.\n\nL with 0 < c < 1\n2.\n\n(cid:80)mt\nit=1 \u2207f (xt; \u03beit)\n\nmt\n\nL with 0 < c < 1\n3.\n\nAlgorithm 2 Stochastic Proximal Gradient using SPIDER/SARAH: SPGR\n1: Initialize: x0 \u2208 Rd, \u03b7 = c\n2: for t = 0, 1, . . . , T \u2212 1 do\nif mod(t, q) == 0 then\n3:\n4:\n5:\n6:\n7:\n8:\n9: end for\n10: Output: xR, where R is uniformly sampled from {1, . . . , T}.\n\nDraw samples S1, let gt = \u2207fS1(xt) // For \ufb01nite-sum setting, |S1| = n\nDraw samples S2, let gt = \u2207fS2(xt) \u2212 \u2207fS2(xt\u22121) + gt\u22121\n\nend if\nxt+1 \u2208 prox\u03b7r[xt \u2212 \u03b7gt]\n\nelse\n\nL (0 < c < 1\n\nNext, we present two corollaries by using a \ufb01xed mini-batch size and increasing mini-batch sizes.\nCorollary 3 (Fixed mini-batch size). Suppose Assumption 1 holds, run MB-SPG (Algorithm 1)\nwith \u03b7 = c\n2 ), T = 2c2\u2206/(\u03b7\u00012) and a \ufb01xed mini-batch size mt = 2c1\u03c32/\u00012 for\nt = 0, . . . , T \u2212 1, then the output xR of Algorithm 1 satis\ufb01es E[dist(0, \u02c6\u2202F (xR))2] \u2264 \u00012, where c1, c2\nare two positive constants as in Theorem 2.\nCorollary 4 (Increasing mini-batch sizes). Suppose Assumption 1 holds, run MB-SPG (Algorithm 1)\n2 ) and a sequence of mini-batch sizes mt = b(t + 1) for t = 0, . . . , T \u2212 1,\nwith \u03b7 = c\nwhere b > 0 is a constant, then the output xR of Algorithm 1 satis\ufb01es E[dist(0, \u02c6\u2202F (xR))2] \u2264\n\u03b7T , where c1, c2 are constants as in Theorem 2. In particular in order to have\n\nE[dist(0, \u02c6\u2202F (xR))] \u2264 \u0001, it suf\ufb01ces to set T = (cid:101)O(1/\u00012). The total complexity is (cid:101)O(1/\u00014).\n\nL (0 < c < 1\n\n+ c2\u2206\n\nc1\u03c32(log(T )+1)\n\nbT\n\nRemark: Although using increasing mini-batch sizes has an additional logarithmic factor in the\ncomplexity than that using a \ufb01xed mini-batch size, it would be more practical and user-friendly\nbecause it does not require knowing the target accuracy \u0001 to run the algorithm .\n4 Stochastic Proximal Gradient Methods with Recursive Stochastic\n\nGradient Estimator\n\nIn this section, we levearage the novel recursive stochastic gradient estimator (SARAH/SPIDER) for\nachieving a better complexity. We present the detailed updates of the proposed algorithm referred\nto as SPGR in Algorithm 2, where the stochastic gradient estimate gt is periodically updated by\nadding current stochastic gradient \u2207fS2(xt) and subtracting the past stochastic gradient \u2207fS2 (xt\u22121)\nfrom gt\u22121. To our knowledge, this framework was \ufb01rstly introduced in SARAH [32, 31] for\nsolving convex/nonconvex smooth \ufb01nite-sum problems with r(x) = 0. Another algorithm so-called\nSPIDER with same recursive framework was proposed in [18] for solving non-convex smooth\nproblems with r(x) = 0 both in \ufb01nite-sum and online settings. One difference is that SPIDER\nuses normalized gradient update with step size \u03b7 = O(\u0001/L). Recently, [41] and [35] respectively\nextended SPIDER and SARAH to their proximal versions for solving non-convex smooth problems\nwith convex non-smooth regularizer r(x). By contrast, we consider more challenging problems in\nthis paper, i.e., non-convex non-smooth regularized non-convex minimization problems. In order to\nuse the SARAH/SPIDER technique to construct a variance-reduced stochastic gradient of f, we need\nadditional assumption, which is also used in previous studies [31, 18, 41, 35].\nAssumption 2. Assume that every random function f (x; \u03be) is smooth with a L-Lipchitz continuous\ngradient, i.e., it is differentiable and there exists a constant L > 0 such that (cid:107)\u2207f (x; \u03be)\u2212\u2207f (y; \u03be)(cid:107) \u2264\nL(cid:107)x \u2212 y(cid:107),\u2200x, y.\n\n5\n\n\f6, b \u2265 1\n\nL with 0 < c < 1\n\nDraw samples S1,s, let gt = \u2207fS1,s (xt)\nxt+1 \u2208 prox\u03b7r[xt \u2212 \u03b7gt], t = t + 1\nfor q = 1, . . . , bs do\n\nAlgorithm 3 SPGR with Increasing Mini-Batch sizes: SPGR-imb\n1: Initialize: x0 \u2208 Rd, \u03b7 = c\n2: Set: t = 0, x\u22121 = x0\n3: for s = 1, . . . , S do\n4:\n5:\n6:\n7:\n8:\n9:\n10: end for\n11: Output: xR, where R is uniformly sampled from {1, . . . , T}.\n\nDraw samples S2,s, let gt = \u2207fS2,s (xt) \u2212 \u2207fS2,s (xt\u22121) + gt\u22121\nxt+1 \u2208 prox\u03b7r[xt \u2212 \u03b7gt], t = t + 1\n\nend for\n\n(cid:5) |S1,s| = b2s2\n\n(cid:5) |S2,s| = bs\n\n1\n\n2\u03b7\n\n\u03b7\u03b8T\n\nare two positive constants.\n\n\u03b7 and \u03b8 = 1\u22123\u03b7L\n\nFirst, we present a general non-asymptotic convergence result of SPGR, which is summarized below.\nTheorem 5. Suppose Assumptions 1 and 2 hold, run Algorithm 2 with \u03b7 = c\nL (0 < c < 1\n3 ) and\nq = |S2|, then the output xR of Algorithm 2 satis\ufb01es E[dist(0, \u02c6\u2202F (xR))2] \u2264 2\u03b8\u2206+\u03b3\u03b7\u2206\n\u03b7\u03b8T + (\u03b3+4\u03b8L)\u03c32\n2\u03b8L|S1|\nfor online setting and E[dist(0, \u02c6\u2202F (xR))2] \u2264 2\u03b8\u2206+\u03b3\u03b7\u2206\nfor \ufb01nite-sum setting, where \u03b3 = 4L2 +\n\u03b72 + 2L\nAlthough the SARAH/SPIDER update used in Algorithm 2 is similar to that used in [41, 35] for\nhandling convex regularizers, our analysis has some key differences from that in [41, 35]. In particular,\nthe analysis in [41, 35] heavily relies on the convexity of the regularizer. In addition, they proved the\nconvergence of the proximal gradient de\ufb01ned as G\u03b7(x) = 1\n\u03b7 (x \u2212 prox\u03b7r(x \u2212 \u03b7\u2207f (x))), while we\ndirectly prove the convergence of the subgradient \u02c6\u2202F (x). The convergence of the proximal gradient\nonly implies a weak convergence of subgradient (i.e., a solution x which satis\ufb01es (cid:107)G\u03b7(x)(cid:107) \u2264 \u0001\nindicates that it is close to a solution x+ = prox\u03b7r(x \u2212 \u03b7\u2207f (x)) such that (cid:107) \u02c6\u2202F (x+)(cid:107) \u2264 O(\u0001) when\n\u03b7 = \u0398(1/L)). The following corollary summarizes results in the two settings.\nCorollary 6. Under the same conditions and notations as in Theorem 5, in order to have\nE[dist(0, \u02c6\u2202F (xR))] \u2264 \u0001 we can set:\n\n\u2022 (Online setting) q = |S2| = (cid:112)|S1|, |S1| = (\u03b3+4\u03b8L)\u03c32\n\n, and T = 2(2\u03b8+\u03b3\u03b7)\u2206\n\n\u03b7\u03b8\u00012\n\n, giving a total\n\n\u03b8L\u00012\n\ncomplexity of O(\u0001\u22123).\n\n\u2022 (Finite-sum setting) q = |S2| =\n\n\u221a\n\nn, |S1| = n, and T = (2\u03b8+\u03b3\u03b7)\u2206\n\n\u03b7\u03b8\u00012\n\n, leading to a total complexity\n\n\u221a\n\nof O(\n\nn\u0001\u22122 + n).\n\nRemark: It is notable that the above complexity result is near-optimal according to [18, 50] for the\n\ufb01nite-sum setting. For same special cases of r(x), similar complexities have been established when\nr(x) = 0 [18, 51] or when r(x) is convex [41, 35].\n\n4.1 SPGR with Increasing Mini-Batch Sizes\nOne limitation of SPGR for the online setting is that it requires knowing the target accuracy level\n\u0001 in order to set q and the sizes of S1 and S2, which makes it not practical. An user will need to\nworry about what is the right value of \u0001 for running the algorithm, as a small \u0001 may waste at lot of\ncomputations and a relatively large \u0001 may not lead to an accurate solution. To address this issue, we\npropose a practical variant of SPGR, namely SPGR-imb, which uses increasing mini-batch sizes. The\ndetailed updates are presented in Algorithm 3. The key idea is that we divide the whole progress into\nS stages, and for each stage s \u2208 [S], the mini-batch sizes |S1| and |S2| are set to be proportional s2\nand s, respectively. The insight of this design is similar to Algorithm 1 with increasing mini-batch\nsizes, i.e., at earlier stages when the solution is far from a stationary solution we can tolerate a large\nvariance in the stochastic gradient estimator and hence allow for a smaller mini-batch size. We\nsummarize the non-asymptotic convergence result of SPGR-imb in the following theorem.\nTheorem 7. Suppose Assumptions 1 and 2 hold, run Algorithm 3 with \u03b7 = c\n3 ) and S\nL (0 < c < 1\nsatisfying bS(S + 1)/2 = T , then the output xR of Algorithm 3 satis\ufb01es E[dist(0, \u02c6\u2202F (xR))2] \u2264\nfor\nare two positive constants. In\n\nfor online setting and E[dist(0, \u02c6\u2202F (xR))2] \u2264 (2\u03b8+\u03b3\u03b7)\u2206\n\n\ufb01nite-sum setting, where \u03b3 = 4L2 + 1\n\n+ (4\u03b8L+\u03b3)\u03c32(log(2T /b)+2)\n\n\u03b7 and \u03b8 = 1\u22123\u03b7L\n\n\u03b72 + 2L\n\n(2\u03b8+\u03b3\u03b7)\u2206\n\n4b\u03b8LT\n\n\u03b8\u03b7T\n\n\u03b8\u03b7T\n\n2\u03b7\n\n6\n\n\fparticular in order to have E[dist(0, \u02c6\u2202F (xR))] \u2264 \u0001, it suf\ufb01ces to set T = (cid:101)O(1/\u00012). The total\ncomplexity is (cid:101)O(1/\u00013).\n\nRemark: Compared to the result in Corollary 6, the complexity result of Theorem 7 is only worse\nby a logarithmic factor.\n\n5 Experiments\n\n\u221a\n\n(cid:80)n\ni=1(bi\u2212\u03c3(x(cid:62)ai))2+r(x) with a sigmod function \u03c3(s) = 1\n\nRegularized loss minimization. First, we compare MB-SPG, SPGR with MBSGA, VRSGA, SSDC-\n(cid:80)n\nSPG and SSDC-SVRG for solving the regularized non-linear least square (NLLS) classi\ufb01cation\nproblems 1\n1+e\u2212s for classi\ufb01cation, and\nn\ni=1 \u03b1 log(1+(yi\u2212w(cid:62)xi)2/\u03b1)+r(x)\nthe regularized truncated least square (TLS) loss function 1\n2n\nfor regression [44]. Two data sets (covtype and a9a) are used for classi\ufb01cation, and two data sets\nE2006 and triazines are used for regression. These data sets are downloaded from the libsvm website.\nWe use three different non-smooth non-convex regularizers, i.e., (cid:96)0 regularizer r(x) = \u03bb(cid:107)x(cid:107)0, (cid:96)0.5\nregularizer r(x) = \u03bb(cid:107)x(cid:107)0.5, and indicator function of (cid:96)0 constraint I{(cid:107)x(cid:107)0\u2264\u03ba}(x). The truncation\n10n following [44]. The value of regularization parameter \u03bb is \ufb01xed as 10\u22124\nvalue \u03b1 is set to\nand the value of \u03ba is \ufb01xed as 0.2d where d is the dimension of data. For all algorithms, we use the\ntheoretical values of the parameters for the sake of fairness in comparison. All algorithms start with\nthe same initial solution with all zero entries. We implement the increasing mini-batch versions of\nMB-SPG and SPGR (online setting) with b = 1. The unknown parameter \u03c3 in MBSGA is estimated\nfollowing [29]. The objective value (in log scale) versus the number of gradient computations for\ndifferent tasks are plotted in Figure 1. The solid lines correspond to algorithms running in the online\nsetting and the dashed lines correspond to algorithms running in the \ufb01nite-sum setting. By comparing\nalgorithms running in the online setting including MB-SPG, SPGR, MBSGA and SSDC-SPG, we can\nsee that the proposed algorithms (MB-SPG and SPGR) are faster across different tasks. In addition,\nSPGR is faster than MB-SPG. These results are consistent with our theory. By comparing algorithms\nrunning in the \ufb01nite-sum setting including VRSGA, SSDC-SVRG and SPGR, we can see that the\nproposed SPGR is much faster, which also corroborates our theory.\nLearning with Quantization. Second, we consider the problem of learning a quantized model\nwhere the model parameter is represented by a small number of bits (e.g., 2 bits that can encode 1\nor \u22121). It has received tremendous attention in deep learning for model compression [21, 42, 36].\nAn idea to formulate the problem is to consider a constrained optimization problem: minx\u2208\u2126 f (x)\nwhere \u2126 denotes a discrete set including the values that can be represented by a small number of bits.\nHowever, \ufb01nding a stationary point for this problem is meaningless. This is because that for a discrete\nset \u2126, the subgradient of its indicator function I\u2126(x) is the whole space [13, 22]. Hence, we have\n0 \u2208 \u02c6\u2202(f (x) + I\u2126(x)) for any x \u2208 \u2126. To avoid this issue, we consider a different formulation by using\n2(cid:107)x \u2212 P\u2126(x)(cid:107)2, where P\u2126(x) is a projection onto\na penalization of the constraint: minx\u2208Rd f (x) + \u03bb\nthe set \u2126 and \u03bb > 0 is a penalization parameter. This penalization-based approach is one standard\nway to handle complicated constraints [4, 28]. It is notable that in general the penalization term is\na non-smooth non-convex function of x for a non-convex set \u2126, though its local smoothness has\nbeen proved under some regularity condition of \u2126 [37]. The proximal mapping of the penalization\nterm has a closed-form solution as long as P\u2126(x) can be easily computed [24], which corresponds to\nquantization for our considered problem.\nIn the experiment, we use the NLLS loss similar to regularized loss minimization for learning a\nquantized non-linear model, and focus on comparison of algorithms running in the online setting\nincluding MBSGA, SSDC-SPG, MB-SPG and SPGR. We also implement a popular heuristic SGD\napproach in deep learning for learning a quantized model [36], which updates the solution simply by\nxt+1 = xt \u2212 \u03b7t\u2207f (\u02c6xt; \u03bet) where \u02c6xt = P\u2126(xt) is the quantized model. We conduct the experiments\non four data sets mnist, news20, rcv1, w8a, where the last three data sets are downloaded from the\nlibsvm website. We compare the testing accuracy of learned quantized model versus the number of\niterations, and the results are plotted in Figure 2, where q denotes the number of bits for quantization.\nWe \ufb01x \u03bb = 1, and decrease the step size by half every 100 iterations for heuristic SGD, MBSGA and\nMB-SPG. This is helpful for generalization purpose. We can see that the proposed SPGR algorithm\nhas better testing accuracy in most cases, and the proposed MB-SPG has comparable performance if\nnot better results than other baselines.\n\n7\n\n\fFigure 1: Comparisons of different algorithms for regularized loss minimization.\n\nFigure 2: Comparisons of different algorithms for learning with quantization.\n\n6 Conclusions\nIn this paper, we have presented the \ufb01rst non-asymptotic convergence analysis of stochastic proximal\ngradient methods for solving a non-convex optimization problem with a smooth loss function and\na non-smooth non-convex regularizer. The proposed algorithms enjoy improved complexities than\nthe state-of-the-art results for the same problems, and also match the existing complexity results for\nsolving non-convex minimization problems with a smooth loss and a non-smooth convex regularizer.\nAcknowledgements\nThe authors thank the anonymous reviewers for their helpful comments. Y. Xu and T. Yang are\npartially supported by National Science Foundation (IIS-1545995).\n\n8\n\n# gradient#10602468log10(objective)-0.35-0.3-0.25-0.2-0.15-0.1-0.0500.050.1NLLS+`0,covtypeMBSGAVRSGASSDC-SPGSSDC-SVRGMB-SPGSPGR(online)SPGR(-nite-sum)# gradient#1050246810log10(objective)00.050.10.150.20.25NLLS+`0,a9aMBSGAVRSGASSDC-SPGSSDC-SVRGMB-SPGSPGR(online)SPGR(-nite-sum)# gradient#1040246810log10(objective)-1.2-1-0.8-0.6-0.4-0.200.20.40.60.8TLS+`0,E2006MBSGAVRSGASSDC-SPGSSDC-SVRGMB-SPGSPGR(online)SPGR(-nite-sum)# gradient#1050246810log10(objective)-2-1.8-1.6-1.4-1.2-1-0.8-0.6TLS+`0,triazinesMBSGAVRSGASSDC-SPGSSDC-SVRGMB-SPGSPGR(online)SPGR(-nite-sum)# gradient#10602468log10(objective)-0.3-0.25-0.2-0.15-0.1-0.0500.050.1NLLS+`0:5,covtypeMBSGAVRSGASSDC-SPGSSDC-SVRGMB-SPGSPGR(online)SPGR(-nite-sum)# gradient#1050246810log10(objective)00.050.10.150.20.25NLLS+`0:5,a9aMBSGAVRSGASSDC-SPGSSDC-SVRGMB-SPGSPGR(online)SPGR(-nite-sum)# gradient#1040246810log10(objective)-1.2-1-0.8-0.6-0.4-0.200.20.40.60.8TLS+`0:5,E2006MBSGAVRSGASSDC-SPGSSDC-SVRGMB-SPGSPGR(online)SPGR(-nite-sum)# gradient#1050246810log10(objective)-2-1.8-1.6-1.4-1.2-1-0.8-0.6TLS+`0:5,triazinesMBSGAVRSGASSDC-SPGSSDC-SVRGMB-SPGSPGR(online)SPGR(-nite-sum)# gradient#10602468log10(objective)-0.35-0.3-0.25-0.2-0.15-0.1-0.0500.050.1NLLS+`0constraint,covtypeMBSGAVRSGASSDC-SPGSSDC-SVRGMB-SPGSPGR(online)SPGR(-nite-sum)# gradient#1050246810log10(objective)00.050.10.150.20.25NLLS+`0constraint,a9aMBSGAVRSGASSDC-SPGSSDC-SVRGMB-SPGSPGR(online)SPGR(-nite-sum)# gradient#1040246810log10(objective)-1.4-1.2-1-0.8-0.6-0.4-0.200.2TLS+`0constraint,E2006MBSGAVRSGASSDC-SPGSSDC-SVRGMB-SPGSPGR(online)SPGR(-nite-sum)# gradient#1050246810log10(objective)-2.2-2-1.8-1.6-1.4-1.2-1-0.8-0.6TLS+`0constraint,triazinesMBSGAVRSGASSDC-SPGSSDC-SVRGMB-SPGSPGR(online)SPGR(-nite-sum)# gradient#1040246810testing accuracy0.40.450.50.550.60.650.70.750.80.850.9quantization(q=2),mnistSGD-heuristicMBSGASSDC-SPGMB-SPGSPGR# gradient#1040246810testing accuracy0.20.250.30.350.40.450.50.550.60.65quantization(q=2),news20SGD-heuristicMBSGASSDC-SPGMB-SPGSPGR# gradient#1040246810testing accuracy0.9550.960.9650.970.9750.98quantization(q=2),w8aSGD-heuristicMBSGASSDC-SPGMB-SPGSPGR# gradient#1040246810testing accuracy0.650.70.750.80.850.90.951quantization(q=2),rcv1.binarySGD-heuristicMBSGASSDC-SPGMB-SPGSPGR# gradient#1040246810testing accuracy0.40.450.50.550.60.650.70.750.80.850.9quantization(q=4),mnistSGD-heuristicMBSGASSDC-SPGMB-SPGSPGR# gradient#1040246810testing accuracy0.20.250.30.350.40.450.50.550.60.65quantization(q=4),news20SGD-heuristicMBSGASSDC-SPGMB-SPGSPGR# gradient#1040246810testing accuracy0.9550.960.9650.970.9750.98quantization(q=4),w8aSGD-heuristicMBSGASSDC-SPGMB-SPGSPGR# gradient#1040246810testing accuracy0.650.70.750.80.850.90.951quantization(q=4),rcv1.binarySGD-heuristicMBSGASSDC-SPGMB-SPGSPGR# gradient#1040246810testing accuracy0.40.450.50.550.60.650.70.750.80.850.9quantization(q=8),mnistSGD-heuristicMBSGASSDC-SPGMB-SPGSPGR# gradient#1040246810testing accuracy0.20.250.30.350.40.450.50.550.60.65quantization(q=8),news20SGD-heuristicMBSGASSDC-SPGMB-SPGSPGR# gradient#1040246810testing accuracy0.9550.960.9650.970.9750.98quantization(q=8),w8aSGD-heuristicMBSGASSDC-SPGMB-SPGSPGR# gradient#1040246810testing accuracy0.650.70.750.80.850.90.951quantization(q=8),rcv1.binarySGD-heuristicMBSGASSDC-SPGMB-SPGSPGR\fReferences\n[1] Z. Allen-Zhu. Natasha: Faster non-convex stochastic optimization via strongly non-convex\n\nparameter. In International Conference on Machine Learning, pages 89\u201397, 2017.\n\n[2] N. T. An and N. M. Nam. Convergence analysis of a proximal point algorithm for minimizing\n\ndifferences of functions. Optimization, 66(1):129\u2013147, 2017.\n\n[3] H. Attouch, J. Bolte, and B. F. Svaiter. Convergence of descent methods for semi-algebraic and\ntame problems: proximal algorithms, forward\u2013backward splitting, and regularized gauss\u2013seidel\nmethods. Mathematical Programming, 137(1):91\u2013129, Feb 2013.\n\n[4] D. P. Bertsekas. Constrained optimization and Lagrange multiplier methods. Academic press,\n\n2014.\n\n[5] J. Bolte, S. Sabach, and M. Teboulle. Proximal alternating linearized minimization for noncon-\n\nvex and nonsmooth problems. Mathematical Programming, 146(1-2):459\u2013494, Aug. 2014.\n\n[6] R. I. Bot, E. R. Csetnek, and S. C. L\u00e1szl\u00f3. An inertial forward\u2013backward algorithm for the\nminimization of the sum of two nonconvex functions. EURO Journal on Computational\nOptimization, 4(1):3\u201325, Feb 2016.\n\n[7] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.\n\n[8] K. Bredies, D. A. Lorenz, and S. Reiterer. Minimization of non-smooth, non-convex functionals\nby iterative thresholding. Journal of Optimization Theory and Applications, 165(1):78\u2013112,\n2015.\n\n[9] E. J. Cand\u00e8s, M. B. Wakin, and S. P. Boyd. Enhancing sparsity by reweighted l1 minimization.\n\nJournal of Fourier Analysis and Applications, 14(5):877\u2013905, Dec 2008.\n\n[10] W. Cao, J. Sun, and Z. Xu. Fast image deconvolution using closed-form thresholding formulas of\nlq (q = 1/2, 2/3) regularization. Journal of Visual Communication and Image Representation,\n24(1):31\u201341, 2013.\n\n[11] Y. Carmon, J. C. Duchi, O. Hinder, and A. Sidford. Lower bounds for \ufb01nding stationary points\n\ni. arXiv preprint arXiv:abs/1710.11606, 2017.\n\n[12] Z. Chen and T. Yang. A variance reduction method for non-convex optimization with improved\n\nconvergence under large condition number. arXiv preprint arXiv:1809.06754, 2018.\n\n[13] F. H. Clarke. Optimization and nonsmooth analysis, volume 5. SIAM, 1990.\n\n[14] D. Davis and D. Drusvyatskiy. Stochastic model-based minimization of weakly convex functions.\n\nSIAM Journal on Optimization, 29(1):207\u2013239, 2019.\n\n[15] D. Davis, D. Drusvyatskiy, S. Kakade, and J. D. Lee. Stochastic subgradient method converges\n\non tame functions. Foundations of Computational Mathematics, pages 1\u201336, 2018.\n\n[16] A. Defazio, F. R. Bach, and S. Lacoste-Julien. SAGA: A fast incremental gradient method\nwith support for non-strongly convex composite objectives. In Advances in Neural Information\nProcessing Systems, pages 1646\u20131654, 2014.\n\n[17] J. Fan and R. Li. Variable selection via nonconcave penalized likelihood and its oracle properties.\n\nJournal of the American Statistical Association, 96(456):1348\u20131360, 2001.\n\n[18] C. Fang, C. J. Li, Z. Lin, and T. Zhang. SPIDER: Near-optimal non-convex optimization via\nstochastic path-integrated differential estimator. In Advances in Neural Information Processing\nSystems, pages 687\u2013697, 2018.\n\n[19] S. Ghadimi, G. Lan, and H. Zhang. Mini-batch stochastic approximation methods for nonconvex\n\nstochastic composite optimization. Mathematical Programming, 155(1-2):267\u2013305, 2016.\n\n[20] I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press.\n\n9\n\n\f[21] S. Han, H. Mao, and W. J. Dally. Deep compression: Compressing deep neural networks with\n\npruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.\n\n[22] A. Y. Kruger. On fr\u00e9chet subdifferentials. Journal of Mathematical Sciences, 116(3):3325\u20133358,\n\n2003.\n\n[23] G. Li and T. K. Pong. Global convergence of splitting methods for nonconvex composite\n\noptimization. SIAM Journal on Optimization, 25(4):2434\u20132460, 2015.\n\n[24] G. Li and T. K. Pong. Douglas-rachford splitting for nonconvex optimization with application\n\nto nonconvex feasibility problems. Mathematical Programming, 159(1-2):371\u2013401, 2016.\n\n[25] H. Li and Z. Lin. Accelerated proximal gradient methods for nonconvex programming. In\nAdvances in Neural Information Processing Systems, pages 379\u2013387, Cambridge, MA, USA,\n2015. MIT Press.\n\n[26] Z. Li and J. Li. A simple proximal stochastic gradient method for nonsmooth nonconvex\noptimization. In Advances in Neural Information Processing Systems, pages 5569\u20135579, 2018.\n\n[27] T. Liu, T. K. Pong, and A. Takeda. A successive difference-of-convex approximation method\nfor a class of nonconvex nonsmooth optimization problems. Mathematical Programming, pages\n1\u201329, 2017.\n\n[28] D. G. Luenberger and Y. Ye. Linear and Nonlinear Programming, volume 228. Springer, 2015.\n\n[29] M. R. Metel and A. Takeda. Stochastic gradient methods for non-smooth non-convex regularized\n\noptimization. arXiv preprint arXiv:1901.08369, 2019.\n\n[30] Y. Nesterov. Gradient methods for minimizing composite functions. Mathematical Program-\n\nming, 140(1):125\u2013161, 2013.\n\n[31] L. M. Nguyen, J. Liu, K. Scheinberg, and M. Tak\u00e1c. Stochastic recursive gradient algorithm for\n\nnonconvex optimization. arXiv preprint arXiv:1705.07261, 2017.\n\n[32] L. M. Nguyen, J. Liu, K. Scheinberg, and M. Tak\u00e1c. SARAH: A novel method for machine\nlearning problems using stochastic recursive gradient. In International Conference on Machine\nLearning, pages 2613\u20132621, 2017.\n\n[33] A. Nitanda and T. Suzuki. Stochastic Difference of Convex Algorithm and its Application to\nTraining Deep Boltzmann Machines. In International Conference on Arti\ufb01cial Intelligence and\nStatistics, pages 470\u2013478, 2017.\n\n[34] C. Paquette, H. Lin, D. Drusvyatskiy, J. Mairal, and Z. Harchaoui. Catalyst for gradient-based\nnonconvex optimization. In International Conference on Arti\ufb01cial Intelligence and Statistics,\npages 1\u201310, 2018.\n\n[35] N. H. Pham, L. M. Nguyen, D. T. Phan, and Q. Tran-Dinh. ProxSARAH: An ef\ufb01cient\nalgorithmic framework for stochastic composite nonconvex optimization. arXiv preprint\narXiv:1902.05679, 2019.\n\n[36] A. Polino, R. Pascanu, and D. Alistarh. Model compression via distillation and quantization. In\n\nInternational Conference on Learning Representations, 2018.\n\n[37] R. Poliquin, R. R. T., and T. L. Local differentiability of distance functions. Transactions of the\n\nAmerican Mathematical Society, 352:5231\u20135249, 01 2000.\n\n[38] S. J. Reddi, S. Sra, B. P\u00f3czos, and A. J. Smola. Proximal stochastic methods for nonsmooth\nnonconvex \ufb01nite-sum optimization. In Advances in Neural Information Processing Systems,\npages 1145\u20131153, 2016.\n\n[39] R. Rockafellar and R. J.-B. Wets. Variational Analysis. Springer Verlag, Heidelberg, Berlin,\n\nNew York, 1998.\n\n10\n\n\f[40] H. A. L. Thi, H. M. Le, D. N. Phan, and B. Tran. Stochastic DCA for the large-sum of non-\nconvex functions problem and its application to group variable selection in classi\ufb01cation. In\nInternational Conference on Machine Learning, pages 3394\u20133403, 2017.\n\n[41] Z. Wang, K. Ji, Y. Zhou, Y. Liang, and V. Tarokh. SpiderBoost: A class of faster variance-\n\nreduced algorithms for nonconvex optimization. arXiv preprint arXiv:1810.10690, 2018.\n\n[42] J. Wu, C. Leng, Y. Wang, Q. Hu, and J. Cheng. Quantized convolutional neural networks\nfor mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern\nRecognition, pages 4820\u20134828, 2016.\n\n[43] Y. Xu, Q. Qi, Q. Lin, R. Jin, and T. Yang. Stochastic optimization for dc functions\nand non-smooth non-convex regularizers with non-asymptotic convergence. arXiv preprint\narXiv:1811.11829, 2018.\n\n[44] Y. Xu, S. Zhu, S. Yang, C. Zhang, R. Jin, and T. Yang. Learning with non-convex truncated\n\nlosses by SGD. arXiv preprint arXiv:1805.07880, 2018.\n\n[45] Z. Xu, X. Chang, F. Xu, and H. Zhang. l1/2 regularization: A thresholding representation theory\nand a fast solver. IEEE Transactions on neural networks and learning systems, 23(7):1013\u20131027,\n2012.\n\n[46] L. Yang. Proximal gradient method with extrapolation and line search for a class of nonconvex\n\nand nonsmooth problems. arXiv preprint arXiv:1711.06831, 2018.\n\n[47] Y. Yu, X. Zheng, M. Marchetti-Bowick, and E. P. Xing. Minimizing nonconvex non-separable\nfunctions. In International Conference on Arti\ufb01cial Intelligence and Statistics, pages 1107\u20131115,\n2015.\n\n[48] C.-H. Zhang. Nearly unbiased variable selection under minimax concave penalty. The Annals\n\nof Statistics, 38:894 \u2013 942, 2010.\n\n[49] W. Zhong and J. T. Kwok. Gradient descent with proximal average for nonconvex and composite\n\nregularization. In AAAI Conference on Arti\ufb01cial Intelligence, pages 2206\u20132212, 2014.\n\n[50] D. Zhou and Q. Gu. Lower bounds for smooth nonconvex \ufb01nite-sum optimization. arXiv\n\npreprint arXiv:1901.11224, 2019.\n\n[51] D. Zhou, P. Xu, and Q. Gu. Stochastic nested variance reduced gradient descent for nonconvex\noptimization. In Advances in Neural Information Processing Systems, pages 3925\u20133936, 2018.\n\n11\n\n\f", "award": [], "sourceid": 1516, "authors": [{"given_name": "Yi", "family_name": "Xu", "institution": "University of Iowa"}, {"given_name": "Rong", "family_name": "Jin", "institution": "Alibaba"}, {"given_name": "Tianbao", "family_name": "Yang", "institution": "The University of Iowa"}]}