{"title": "Escaping Saddle Points in Constrained Optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 3629, "page_last": 3639, "abstract": "In this paper, we study the problem of escaping from saddle points in smooth\nnonconvex optimization problems subject to a convex set $\\mathcal{C}$. We propose a generic framework that yields convergence to a second-order stationary point of the problem, if the convex set $\\mathcal{C}$ is simple for a quadratic objective function. Specifically, our results hold if one can find a $\\rho$-approximate solution of a quadratic program subject to $\\mathcal{C}$ in polynomial time, where $\\rho<1$ is a positive constant that depends on the structure of the set $\\mathcal{C}$. Under this condition, we show that the sequence of iterates generated by the proposed framework reaches an $(\\epsilon,\\gamma)$-second order stationary point (SOSP) in at most $\\mathcal{O}(\\max\\{\\epsilon^{-2},\\rho^{-3}\\gamma^{-3}\\})$ iterations.  We further characterize the overall complexity of reaching an SOSP when the convex set $\\mathcal{C}$ can be written as a set of quadratic constraints and the objective function Hessian\nhas a specific structure over the convex $\\mathcal{C}$. Finally, we extend our results to the stochastic setting and characterize the number of stochastic gradient and Hessian evaluations to reach an $(\\epsilon,\\gamma)$-SOSP.", "full_text": "Escaping Saddle Points in Constrained Optimization\n\nAryan Mokhtari\n\nMIT\n\nAsuman Ozdaglar\n\nMIT\n\nCambridge, MA 02139\n\nCambridge, MA 02139\n\naryanm@mit.edu\n\nasuman@mit.edu\n\nAbstract\n\nAli Jadbabaie\n\nMIT\n\nCambridge, MA 02139\njadbabai@mit.edu\n\nIn this paper, we study the problem of escaping from saddle points in smooth\nnonconvex optimization problems subject to a convex set C. We propose a generic\nframework that yields convergence to a second-order stationary point of the prob-\nlem, if the convex set C is simple for a quadratic objective function. Speci\ufb01cally,\nour results hold if one can \ufb01nd a \u21e2-approximate solution of a quadratic program\nsubject to C in polynomial time, where \u21e2< 1 is a positive constant that depends\non the structure of the set C. Under this condition, we show that the sequence\nof iterates generated by the proposed framework reaches an (\u270f, )-second order\nstationary point (SOSP) in at most O(max{\u270f2,\u21e2 33}) iterations. We further\ncharacterize the overall complexity of reaching an SOSP when the convex set C\ncan be written as a set of quadratic constraints and the objective function Hessian\nhas a speci\ufb01c structure over the convex set C. Finally, we extend our results to the\nstochastic setting and characterize the number of stochastic gradient and Hessian\nevaluations to reach an (\u270f, )-SOSP.\n\n1\n\nIntroduction\n\nThere has been a recent revival of interest in non-convex optimization, due to obvious applications\nin machine learning. While the modern history of the subject goes back six or seven decades, the\nrecent attention to the topic stems from new applications as well as availability of modern analytical\nand computational tools, providing a new perspective on classical problems. Following this trend, in\nthis paper we focus on the problem of minimizing a smooth nonconvex function over a convex set as\nfollows:\n\nminimize f (x),\n\nsubject to x 2C ,\n\n(1)\nwhere x 2 Rd is the decision variable, C\u21e2 Rd is a closed convex set, and f : Rd ! R is a twice\ncontinuously differentiable function over C. It is well known that \ufb01nding a global minimum of\nProblem (1) is hard. Equally well-known is the fact that for certain nonconvex problems, all local\nminimizers are global. These include, for example, matrix completion [24], phase retrieval [42], and\ndictionary learning [43]. For such problems, \ufb01nding a global minimum of Problem (1) reduces to the\nproblem of \ufb01nding one of its local minima.\nGiven the well-known hardness results in \ufb01nding stationary points, recent focus has shifted in\ncharacterizing approximate stationary points. When the objective function f is convex, \ufb01nding an\n\u270f-\ufb01rst-order stationary point is often suf\ufb01cient since it leads to \ufb01nding an approximate local (and\nhence global) minimum. However, in the nonconvex setting, even when the problem is unconstrained,\ni.e., C = Rd, convergence to a \ufb01rst-order stationary point (FOSP) is not enough as the critical point\nto which convergence is established might be a saddle point. It is therefore natural to look at higher\norder derivatives and search for a second-order stationary points. Indeed, under the assumption that\nall the saddle points are strict (formally de\ufb01ned later), in both unconstrained and constrained settings,\nconvergence to a second-order stationary point (SOSP) implies convergence to a local minimum.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00b4eal, Canada.\n\n\fWhile convergence to an SOSP has been thoroughly investigated in the recent literature for the\nunconstrained setting, the overall complexity for the constrained setting has not been studied yet.\nContributions. Our main contribution is to propose a generic framework which generates a sequence\nof iterates converging to an approximate second-order stationary point for the constrained nonconvex\nproblem in (1), when the convex set C has a speci\ufb01c structure that allows for approximate minimization\nof a quadratic loss over the feasible set. The proposed framework consists of two main stages: First, it\nutilizes \ufb01rst-order information to reach a \ufb01rst-order stationary point; next, it incorporates second-order\ninformation to escape from a stationary point if it is a local maximizer or a strict saddle point. We show\nthat the proposed approach leads to an (\u270f, )-second-order stationary point (SOSP) for Problem (1)\n(check De\ufb01nition 1). The proposed approach utilizes advances in constant-factor optimization of\nnonconvex quadratic programs [46, 22, 44] that \ufb01nd a \u21e2-approximate solution over C in polynomial\ntime, where \u21e2 is a positive constant smaller than 1 that depends on the structure of C. When such\napproximate solution exists, the sequence of iterates generated by the proposed framework reaches\nan (\u270f, )-SOSP of Problem (1) in at most O(max{\u270f2,\u21e2 33}) iterations.\nWe show that quadratic constraints satisfy the required condition for the convex set C if the objective\nfunction Hessian r2f has a speci\ufb01c structure over the convex set C (formally described later). For this\ncase, we show that it is possible to achieve an (\u270f, )-SOSP after at most O(max{\u2327 \u270f2, d3m73})\narithmetic operations, where d is the dimension of the problem and \u2327 is the number of required\narithmetic operations to solve a linear program over C or to project a point onto C. We further extend\nour results to the stochastic setting and show that we can reach an (\u270f, )-SOSP after computing at\nmost O(max{\u270f4,\u270f 2\u21e244,\u21e2 77}) stochastic gradients and O(max{\u270f2\u21e233,\u21e2 55})\nstochastic Hessians.\n\n1.1 Related work\n\nUnconstrained case. The rich literature on nonconvex optimization provides a plethora of algorithms\nfor reaching stationary points of a smooth unconstrained minimization problem. Convergence to\n\ufb01rst-order stationary points (FOSP) has been widely studied for both deterministic [35, 1, 7\u201310] and\nstochastic settings [39, 38, 3, 32]. Stronger results which indicate convergence to an SOSP are also\nestablished. Numerical optimization methods such as trust-region methods [13, 19, 33] and cubic\nregularization algortihms [36, 11, 12] can reach an approximate second-order stationary point in a\n\ufb01nite number of iterations; however, typically the computational complexity of each iteration could\nbe relatively large due to the cost of solving trust-region or regularized cubic subproblems. Recently,\na new line of research has emerged that focuses on the overall computational cost to achieve an\nSOSP. These results build on the idea of escaping from strict saddle points with perturbing the iterates\nby injecting a properly chosen noise [23, 29, 30], or by updating the iterates using the eigenvector\ncorresponding to the smallest eigenvalue of the Hessian [7, 2, 45, 41, 1, 40, 37].\nConstrained case. Asymptotic convergence to \ufb01rst-order and second-order stationary points for the\nconstrained optimization problem in (1) has been studied in the numerical optimization community\n[6, 18, 21, 20]. Recently, \ufb01nite-time analysis for convergence to an FOSP of the generic smooth\nconstrained problem in (1) has received a lot of attention. In particular, [31] shows that the sequence\nof iterates generated by the update of Frank-Wolfe converges to an \u270f-FOSP after O(\u270f2) iterations.\nThe authors of [26] consider norm of gradient mapping as a measure of non-stationarity and show that\nthe projected gradient method has the same complexity of O(\u270f2). Similar result for the accelerated\nprojected gradient method is also shown [25]. Adaptive cubic regularization methods in [14\u201316]\nimprove these results using second-order information and obtain an \u270f-FOSP of Problem (1) after at\nmost O(\u270f3/2) iterations. Finite time analysis for convergence to an SOSP has also been studied\nfor linear constraints. In particular, [5] studies convergence to an SOSP of (1) when the set C is a\nlinear constraint of the form x  0 and propose a trust region interior point method that obtains\nan (\u270f,p\u270f)-SOSP in O(\u270f3/2) iterations. The work in [27] extends their results to the case that the\nobjective function is potentially not differentiable or not twice differentiable on the boundary of the\nfeasible region. The authors in [17] focus on the general convex constraint case and introduce a trust\nregion algorithm that requires O(\u270f3) iterations to obtain an SOSP; however, each iteration of their\nproposed method requires access to the exact solution of a nonconvex quadratic program (\ufb01nding\nits global minimum) which, in general, could be computationally prohibitive. To the best of our\nknowledge, our paper provides the \ufb01rst \ufb01nite-time overall computational complexity analysis for\nreaching an SOSP of Problem (1).\n\n2\n\n\f2 Preliminaries and De\ufb01nitions\n\nIn the case of unconstrained minimization of the objective function f, the \ufb01rst-order and second-order\nnecessary conditions for a point x\u21e4 to be a local minimum of that are de\ufb01ned as rf (x\u21e4) = 0d\nand r2f (x\u21e4) \u232b 0d\u21e5d, respectively. If a point satis\ufb01es these conditions it is called a second-order\nstationary point (SOSP). If the second condition becomes strict, i.e., r2f (x)  0, then we recover\nthe suf\ufb01cient conditions for a local minimum. However, to derive \ufb01nite time convergence bounds\nfor achieving an SOSP, these conditions should be relaxed. In other words, the goal should be to\n\ufb01nd an approximate SOSP where the approximation error can be arbitrarily small. For the case of\nunconstrained minimization, a point x\u21e4 is called an (\u270f, )-second-order stationary point if it satis\ufb01es\nkrf (x\u21e4)k \uf8ff \u270f and r2f (x\u21e4) \u232b Id, where \u270f and  are arbitrary positive constants. To study the\nconstrained setting, we \ufb01rst state the necessary conditions for a local minimum of Problem (1).\nProposition 1 ([4]). If x\u21e4 2C is a local minimum of the function f over the convex set C, then\n\nrf (x\u21e4)>(x  x\u21e4)  0,\n(x  x\u21e4)>r2f (x\u21e4)(x  x\u21e4)  0,\n\nfor all x 2C ,\n\nfor all x 2C s. t. rf (x\u21e4)>(x  x\u21e4) = 0.\n\n(2)\n(3)\n\n(4)\n(5)\n\n(6)\n\nThe conditions in (2) and (3) are the \ufb01rst-order and second-order necessary optimality conditions,\nrespectively. By making the inequality in (3) strict, i.e., (x x\u21e4)>r2f (x\u21e4)(x x\u21e4) > 0, we recover\nthe suf\ufb01cient conditions for a local minimum when C is a polyhedral [4]. Further, if the inequality\nin (3) is replaced by (x  x\u21e4)>r2f (x\u21e4)(x  x\u21e4)  kx  x\u21e4k2 for some > 0, we obtain the\nsuf\ufb01cient conditions for a local minimum of Problem (1) for any convex constraint C; see [4]. If a\npoint x\u21e4 satis\ufb01es the conditions in (2) and (3) it is an SOSP of Problem (1).\nAs in the unconstrained setting, the \ufb01rst-order and second-order optimality conditions may not be\nsatis\ufb01ed in \ufb01nite number of iterations, and we focus on \ufb01nding an approximate SOSP.\nDe\ufb01nition 1. Recall the twice continuously differentiable function f : Rd ! R and the convex\nclosed set C\u21e2 Rd introduced in Problem (1). We call x\u21e4 2C an (\u270f, )-second order stationary point\nof Problem (1) if the following conditions are satis\ufb01ed.\n\nrf (x\u21e4)>(x  x\u21e4)  \u270f,\n(x  x\u21e4)>r2f (x\u21e4)(x  x\u21e4)  ,\n\nfor all x 2C ,\n\nfor all x 2C s. t. rf (x\u21e4)>(x  x\u21e4) = 0.\n\nIf a point only satis\ufb01es the \ufb01rst condition, we call it an \u270f-\ufb01rst order stationary point.\n\nWe further formally de\ufb01ne strict saddle points for the constrained optimization problem in (1).\nDe\ufb01nition 2. A point x\u21e4 2C is a -strict saddle point of Problem (1) if (i) for all x 2C the condition\nrf (x\u21e4)>(x  x\u21e4)  0 holds, and (ii) there exists a point y such that\n\n(y  x\u21e4)>r2f (x\u21e4)(y  x\u21e4) < ,\n\ny 2C and rf (x\u21e4)>(y  x\u21e4) = 0.\n\nAccording to De\ufb01nitions 1 and 2 if all saddle points are -strict and  \uf8ff , any (\u270f, )-SOSP of\nProblem (1) is an approximate local minimum.\nWe emphasize that in this paper we do not assume that all saddles are strict to prove convergence to an\nSOSP. We formally de\ufb01ned strict saddles just to clarify that if all the saddles are strict then convergence\nto an approximate SOSP is equivalent to convergence to an approximation local minimum.\nOur goal throughout the rest of the paper is to design an algorithm which \ufb01nds an (\u270f, )-SOSP of\nProblem (1). To do so, we \ufb01rst assume the following conditions are satis\ufb01ed.\nAssumption 1. The gradients rf are L-Lipschitz continuous over the set C, i.e., for any x, \u02dcx 2C ,\n(7)\nAssumption 2. The Hessians r2f are M-Lipschitz continuous over the set C, i.e., for any x, \u02dcx 2C\n(8)\nAssumption 3. The diameter of the compact convex set C is upper bounded by a constant D, i.e.,\n(9)\n\nkr2f (x)  r2f (\u02dcx)k \uf8ff Mkx  \u02dcxk.\n\nkrf (x)  rf (\u02dcx)k \uf8ff Lkx  \u02dcxk.\n\nmax\n\nx,\u02dcx2C{kx  \u02c6xk} \uf8ff D.\n\n3\n\n\f3 Main Result\n\nIn this section, we introduce a generic framework to reach an (\u270f, )-SOSP of the non-convex function\nf over the convex set C, when C has a speci\ufb01c structure as we describe below. In particular, we focus\non the case when we can solve a quadratic program (QP) of the form\n(10)\nup to a constant factor \u21e2 \uf8ff 1 in a \ufb01nite number of arithmetic operations. Here, A 2 Rd is a symmetric\nmatrix, b 2 Rd is a vector, and c 2 R is a scalar. To clarify the notion of solving a problem up to a\nconstant factor \u21e2, consider x\u21e4 as a global minimizer of (10). Then, we say Problem (10) is solved up\nto a constant factor \u21e2 2 (0, 1] if we have found a feasible solution \u02dcx 2C such that\n\nminimize x>Ax + b>x + c\n\nsubject to x 2C ,\n\nx\u21e4>Ax\u21e4 + b>x\u21e4 + c \uf8ff \u02dcx>A\u02dcx + b>\u02dcx + c \uf8ff \u21e2(x\u21e4>Ax\u21e4 + b>x\u21e4 + c).\n\n(11)\nNote that here w.l.o.g. we have assumed that the optimal objective function value x\u21e4>Ax\u21e4+b>x\u21e4+c\nis non-positive. Larger constant \u21e2 implies that the approximate solution is more accurate. If \u02dcx satis\ufb01es\nthe condition in (11), we call it a \u21e2-approximate solution of Problem (10). Indeed, if \u21e2 = 1 then \u02dcx is\na global minimizer of Problem (10).\nIn Algorithm 1, we introduce a generic framework that achieves an (\u270f, )-SOSP of Problem (1) whose\nrunning time is polynomial in \u270f1, 1, \u21e21 and d, when we can \ufb01nd a \u21e2-approximate solution of a\nquadratic problem of the form (10) in a time that is polynomial in d. The proposed scheme consists\nof two major stages. In the \ufb01rst phase, as mentioned in Steps 2-4, we use a \ufb01rst-order update, i.e.,\na gradient-based update, to \ufb01nd an \u270f-FOSP, i.e., we update the decision variable x according to a\n\ufb01rst-order update until we reach a point xt that satis\ufb01es the condition\nfor all x 2C .\n\n(12)\nIn Section 4, we study in detail projected gradient descent and conditional gradient algorithms for the\n\ufb01rst order phase of the proposed framework. Interestingly, both of these algorithms require at most\nO(\u270f2) iterations to reach an \u270f-\ufb01rst order stationary point.\nThe second stage of the proposed scheme uses second-order information of the objective function f\nto escape from the stationary point if it is a local maximum or a strict saddle point. To be more\nprecise, if we assume that xt is a feasible point satisfying the condition (12), we then aim to \ufb01nd a\ndescent direction by solving the following quadratic program\n\nrf (xt)>(x  xt)  \u270f,\n\nq(u) := (u  xt)>r2f (xt)(u  xt)\n\nminimize\nsubject to u 2C , rf (xt)>(u  xt) = 0,\n\n(13)\nup to a constant factor \u21e2 where \u21e2 2 (0, 1]. To be more speci\ufb01c, if we de\ufb01ne q(u\u21e4) as the optimal\nobjective function value of the program in (13), we focus on the cases that we can obtain a feasible\npoint ut which is a \u21e2-approximate solution of Problem (13), i.e., ut 2C satis\ufb01es the constraints\nin (13) and\n(14)\nThe problem formulation in (13) can be transformed into the quadratic program in (10); see Section 5\nfor more details. Note that the constant \u21e2 is independent of \u270f, , and d and only depends on the\nstructure of the convex set C. For instance, if C is de\ufb01ned in terms of m quadratic constraints one can\n\ufb01nd a \u21e2 = m2 approximate solution of (13) after at most \u02dcO(md3) arithmetic operations (Section 5).\nAfter computing a feasible point ut satisfying the condition in (14), we check the quadratic objective\nfunction value at the point ut, and if the inequality q(ut) < \u21e2 holds, we follow the update\n\nq(u\u21e4) \uf8ff q(ut) \uf8ff \u21e2q (u\u21e4).\n\nxt+1 = (1  )xt + ut,\n\n(15)\nwhere  is a positive stepsize. Otherwise, we stop the process and return xt as an (\u270f, )-second\norder stationary point of Problem (1). To check this claim, note that Algorithm 1 stops if we reach a\npoint xt that satis\ufb01es the \ufb01rst-order stationary condition rf (xt)>(x  xt)  \u270f, and the objective\nfunction value for the \u21e2-approximate solution of the quadratic subproblem is larger than \u21e2, i.e.,\nq(ut)  \u21e2. The second condition alongside with the fact that q(ut) satis\ufb01es (14) implies that\nq(u\u21e4)  . Therefore, for any x 2C and rf (xt)>(x  xt) = 0, it holds that\n\n(x  xt)>r2f (xt)(x  xt)  .\n\n(16)\n\n4\n\n\fCompute xt+1 using \ufb01rst-order information (Frank-Wolfe or projected gradient descent)\n\nif xt is not an \u270f-\ufb01rst order stationary point then\n\nAlgorithm 1 Generic framework for escaping saddles in constrained optimization\nRequire: Stepsize > 0. Initialize x0 2C\n1: for t = 1, 2, . . . do\n2:\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10:\nend if\n11:\n12: end for\n\nCompute the updated variable xt+1 = (1  )xt + ut;\nReturn xt and stop.\n\nFind ut: a \u21e2-approximate solution of (13)\nif q(ut) < \u21e2 then\nelse\n\nelse\n\nend if\n\nThese two observations show that the outcome of the proposed framework in Algorithm 1 is an\n(\u270f, )-SOSP of Problem (1). Now it remains to characterize the number of iterations that Algorithm 1\nneeds to perform before reaching an (\u270f, )-SOSP which we formally state in the following theorem.\nTheorem 1. Consider the optimization problem de\ufb01ned in (1). Suppose that the conditions in\nAssumptions 1-3 are satis\ufb01ed.\nIf in the \ufb01rst-order stage, i.e., Steps 2-4, we use the update of\nFrank-Wolfe or projected gradient descent, the generic framework proposed in Algorithm 1 \ufb01nds an\n(\u270f, )-second-order stationary point of Problem (1) after at most O(max{\u270f2,\u21e2 33}) iterations.\nThe result in Theorem 1 shows that if the convex constraint C is such that one can solve the quadratic\nsubproblem in (13) \u21e2-approximately, then the proposed generic framework \ufb01nds an (\u270f, )-SOSP\npoint of Problem (1) after at most O(\u270f2) \ufb01rst-order and O(\u21e233) second-order updates.\nTo prove the claim in Theorem 1, we \ufb01rst review \ufb01rst-order conditional gradient and projected gradient\nalgorithms and show that if the current iterate is not a \ufb01rst-order stationary point, by following either\nof these updates the objective function value decreases by a constant of O(\u270f2) (Section 4). We then\nfocus on the second stage of Algorithm 1 which corresponds to the case that the current iterate is\nan \u270f-FOSP and we need to solve the quadratic program in (13) approximately (Section 5). In this\ncase, we show that if the iterate is not an (\u270f, )-SOSP, by following the update in (15) the objective\nfunction value decreases at least by a constant of O(\u21e233). Finally, by combining these two results it\ncan be shown that Algorithm 1 \ufb01nds an (\u270f, )-SOSP after at most O(max{\u270f2,\u21e2 33}) iterations.\n4 First-Order Step: Convergence to a First-Order Stationary Point\n\nIn this section, we study two different \ufb01rst-order methods for the \ufb01rst stage of Algorithm 1. The result\nin this section can also be independently used for convergence to an FOSP of Problem (1) satisfying\n(17)\nwhere \u270f> 0 is a positive constant. Although for Algorithm 1 we assume that C has a speci\ufb01c\nstructure as mentioned in (10), the results in this section hold for any closed and compact convex\nset C. To keep our result as general as possible, in this section, we study both conditional gradient and\nprojected-based methods when they are used in the \ufb01rst-stage of the proposed generic framework.\n\nrf (x\u21e4)>(x  x\u21e4)  \u270f,\n\nfor all x 2C ,\n\n4.1 Conditional gradient update\nThe conditional gradient (Frank-Wolfe) update has two steps. We \ufb01rst solve the linear program\n\nThen, we compute the updated variable xt+1 according to the update\n\nvt = argmax\n\nv2C {rf (xt)>v}.\n\n(18)\n\n(19)\nwhere \u2318 is a stepsize. In the following proposition, we show that if the current iterate is not an \u270f-\ufb01rst\norder stationary point, then by updating the variable according to (18)-(19) the objective function\nvalue decreases. The proof of the following proposition is adopted from [31].\n\nxt+1 = (1  \u2318)xt + \u2318vt,\n\n5\n\n\fProposition 2. Consider the optimization problem in (1). Suppose Assumptions 1 and 3 hold. Set\nthe stepsize in (19) to \u2318 = \u270f/D2L. Then, if the iterate xt at step t is not an \u270f-\ufb01rst order stationary\npoint, the objective function value at the updated variable xt+1 satis\ufb01es the inequality\n\nf (xt+1) \uf8ff f (xt) \n\n\u270f2\n\n2D2L\n\n.\n\n(20)\n\nThe result in Proposition 2 shows that by following the update of the conditional gradient method the\nobjective function value decreases by O(\u270f2), if an \u270f-FOSP is not achieved.\nRemark 1. In step 3 of Algorithm 1 we \ufb01rst check if xt is an \u270f-FOSP. This can be done by evaluating\n\nmin\n\nx2C{rf (xt)>(x  xt)} = max\n\n(21)\nand comparing the optimal value with \u270f. Note that the linear program in (21) is the same as the one\nin (18). Therefore, by checking the \ufb01rst-order optimality condition of xt, the variable vt is already\ncomputed, and we need to solve only one linear program per iteration.\n\nx2C {rf (xt)>x} + rf (xt)>xt\n\n4.2 Projected gradient update\nThe projected gradient descent (PGD) update consists of two steps: (i) descending through the\ngradient direction and (ii) projecting the updated variable onto the convex constraint set. These two\nsteps can be combined together and the update can be explicitly written as\n\nxt+1 = \u21e1C{xt  \u2318rf (xt)},\n\n(22)\nwhere \u21e1C(.) is the Euclidean projection onto the convex set C and \u2318 is a positive stepsize. In the\nfollowing proposition, we \ufb01rst show that by following the update of PGD the objective function value\ndecreases by a constant until we reach an \u270f- FOSP. Further, we show that the number of required\niterations for PGD to reach an \u270f-FOSP is of O(\u270f2).\nProposition 3. Consider Problem (1). Suppose Assumptions 1 and 3 are satis\ufb01ed. Further, assume\nthat the gradients rf (x) are uniformly bounded by K for all x 2C . If the stepsize of the projected\ngradient descent method de\ufb01ned in (22) is set to \u2318 = 1/L the objective function value decreases by\n\nf (xt+1) \uf8ff f (xt) \n\n\u270f2L\n\n2(K + LD)2 ,\n\n(23)\n\nMoreover, iterates reach a \ufb01rst-order stationary point satisfying (17) after at most O(\u270f2) iterations.\nProposition 3 shows that by following the update of PGD the function value decreases by O(\u270f2)\nuntil we reach an \u270f-FOSP. It further shows PGD obtains an \u270f-FOSP satisfying (17) after at most\nO(\u270f2) iterations. To the best of our knowledge, this result is also novel, since the only convergence\nguarantee for PGD in [26] is in terms of number of iterations to reach a point with a gradient mapping\nnorm less than \u270f, while our result characterizes number of iterations to satisfy (17).\nRemark 2. To use the PGD update in the \ufb01rst stage of Algorithm 1 one needs to de\ufb01ne a\ncriteria to check if xt is an \u270f-FOSP or not. However, in PGD we do not solve the linear\nprogram minx2C{rf (xt)>(x  xt)}. This issue can be resolved by checking the condition\nkxt  xt+1k \uf8ff \u270f/(K + LD) which is a suf\ufb01cient condition for the condition in (17). In other\nwords, if this condition holds we stop and xt is an \u270f-FOSP; otherwise, the result in (23) holds and\nthe function value decreases. For more details please check the proof of Proposition 3.\n\n5 Second-Order Step: Escape from Saddle Points\n\nIn this section, we study the second stage of the framework in Algorithm 1 which corresponds to\nthe case that the current iterate is an \u270f-FOSP. Note that when we reach a critical point the goal is to\n\ufb01nd a feasible point u 2C in the tangent space rf (xt)>(u  xt) = 0 that makes the inner product\n(u  xt)>r2f (xt)(u  xt) smaller than . To achieve this goal we need to check the minimum\nvalue of this inner product over the constraints, i.e., we need to solve the quadratic program in (13)\nup to a constant factor \u21e2 2 (0, 1]. In the following proposition, we show that the updated variable\naccording to (15) decreases the objective function value if the condition q(ut) < \u21e2 holds.\n\n6\n\n\fProposition 4. Consider the quadratic program in (13). Let ut be a \u21e2-approximate solution for\nquadratic subproblem in (13). Suppose that Assumptions 2 and 3 hold. Further, set the stepsize\n = \u21e2/M D3. If the quadratic objective function value q evaluated at ut satis\ufb01es the condition\nq(ut) < \u21e2, then the updated variable according to (15) satis\ufb01es the inequality\n\nf (xt+1) \uf8ff f (xt) \n\n\u21e233\n3M 2D6 .\n\n(24)\n\n(25)\n\nThe only unanswered question is how to solve the quadratic subproblem in (13) up to a constant\nfactor \u21e2 2 (0, 1]. For general C, the quadratic subproblem could be NP-hard [34]; however, for some\nspecial choices of the convex constraint C, this quadratic program (QP) can be solved either exactly or\napproximately up to a constant factor. In the following section, we focus on the quadratic constraint\ncase, but indeed there are other classes of constraints that satisfy our required condition.\n\n5.1 Quadratic constraints case\nIn this section, we focus on the case where the constraint set C is de\ufb01ned as the intersection of m\nellipsoids centered at the origin.1 In particular, assume that the set C is given by\n\nC := {x 2 Rd | x>Qix \uf8ff 1,\n\nfor all i = 1, . . . , m},\n\n+. Under this assumption, the QP in (13) can be written as\n\nwhere Qi 2 Sd\n\n(u  xt)>r2f (xt)(u  xt)\n\nu\n\nmin\ns.t. u>Qiu \uf8ff 1,\n\n(26)\nNote that the equality constraint rf (xt)>(u  xt) = 0 does not change the hardness of the problem\nand can be easily eliminated. To do so, \ufb01rst de\ufb01ne a new optimization variable z := u  xt to obtain\n\nfor i = 1, . . . , m and rf (xt)>(u  xt) = 0.\n\nmin\n\nz\ns.t.\n\nz>r2f (xt)z\n(z + xt)>Qi(z + xt) \uf8ff 1,\n\nfor i = 1, . . . , m and rf (xt)>z = 0,\n\n(27)\nThen, \ufb01nd a basis for the tangent space rf (xt)>z = 0. Indeed, using the Gramm-Schmidt procedure,\nwe can \ufb01nd an orthonormal basis for the space Rd of the form {v1, . . . , vd1, rf (xt)\nkrf (xt)k} at the\ncomplexity of O(d3). If we de\ufb01ne A = [v1; . . . ; vd1] 2 Rd\u21e5d1 as the concatenation of the\nvectors {v1, . . . , vd1}, then any vector z satisfying rf (xt)>z = 0 can be written as z = Ay\nwhere y 2 Rd1. Hence, (27) is equivalent to\n\ny>A>r2f (xt)Ay\n\n(28)\nThis procedure reduces the dimension of the problem from d to d  1. It is not hard to check that the\ncenter of ellipsoids in (28) is A>xt. By a simple change of variable A\u02c6y := Ay + xt we obtain\n\n(Ay + xt)>Qi(Ay + xt) \uf8ff 1,\n\nfor i = 1, . . . , m.\n\nmin\n\nmin\n\nz\ns.t.\n\n\u02c6y>A>r2f (xt)A\u02c6y  2x>t r2f (xt)A\u02c6y + x>t r2f (xt)xt\n\u02c6y>A>QiA\u02c6y \uf8ff 1,\n\n(29)\nDe\ufb01ne the matrices \u02dcQi := A>QiA and Bt := A>r2f (xt)A, the vector st = 2x>t r2f (xt)A,\nand the scalar ct := x>t r2f (xt)xt. Using these de\ufb01nitions the problem reduces to\n\nfor i = 1, . . . , m.\n\ns.t.\n\nz\n\nmin\n\nz\n\nq(\u02c6y) := \u02c6y>Bt \u02c6y + s>t \u02c6y + ct\n\ns.t.\n\n\u02c6y> \u02dcQi \u02c6y \uf8ff 1,\n\nfor i = 1, . . . , m.\n\n(30)\n\nNote that the matrices \u02dcQi 2 Sd\n+ are positive semide\ufb01nite, while the matrix Bt 2 Sd might be\ninde\ufb01nite. Indeed, the optimal objective function value of the program in (30) is equal to the optimal\nobjective function value of (26). Further, note that if we \ufb01nd a \u21e2-approximate solution \u02c6y\u21e4 for (30),\nwe can recover a \u21e2-approximate solution u\u21e4 for (26) using the transformation u\u21e4 = A\u02c6y\u21e4.\n\n1To simplify the constant factor approximation \u21e2 we assume ellipsoids are centered at the origin. If we drop\nthis assumption then \u21e2 will depend on the maximum distance between the origin and the boundary of each of the\nellipsoids, e.g., see equation (6) in [44].\n\n7\n\n\fThe program in (30) is a speci\ufb01c Quadratic Constraint Quadratic Program (QCQP), where all the\nconstraints are centered at 0. For the speci\ufb01c case of m = 1, the duality gap of this problem is\nzero and simply by transferring the problem to the dual domain one can solve Problem (30) exactly.\nIn the following proposition, we focus on the general case of m  1 and explain how to \ufb01nd a\n\u21e2-approximate solution for (30).\nProposition 5. Consider Problem (30) and de\ufb01ne qmin as the minimum objective value of the\nproblem. Based on the result in [22], there exists a polynomial time method that obtains a point \u02c6y\u21e4\n\nq(\u02c6y\u21e4) \uf8ff\n\n1  \u21e3\n\nm2(1 + \u21e3)2 qmin +\u27131 \n\n1  \u21e3\n\nm2(1 + \u21e3)2\u25c6 x>t r2f (xt)xt\n\n(31)\n\nafter at most O(d3(m log(1/) + log(1/\u21e3) + log d)) arithmetic operations, where  is the ratio of the\nradius of the largest inscribing sphere over that of the smallest circumscribing sphere of the feasible\nset. Further, based on [44], using a SDP relaxation of (30) one can \ufb01nd a point \u02c6y\u21e4 such that\n\nq(\u02c6y\u21e4) \uf8ff\n\n1\nm\n\nqmin +\u27131 \n\n1\n\nm\u25c6 x>t r2f (xt)xt.\n\n(32)\n\nProof. If we de\ufb01ne the function \u02dcq as \u02dcq(x) := q(x)  ct, using the approaches in [22] and [44], we\ncan \ufb01nd a \u21e2 approximate solution for min\u02c6y \u02dcq(\u02c6y) subject to \u02c6y> \u02dcQi \u02c6y \uf8ff 1 for i = 1, . . . , m. In other\nwords, we can \ufb01nd a point \u02c6y\u21e4 such that \u02dcq(\u02c6y\u21e4) \uf8ff \u21e2 \u02dcqmin where 0 <\u21e2< 1 and \u02dcqmin is the minimum\nobjective function value of \u02dcq over the constraint set which satis\ufb01es \u02dcqmin = qmin  ct. Replacing\n\u02dcq(\u02c6y\u21e4) and \u02dcqmin by their de\ufb01nitions and regrouping the terms imply that \u02c6y\u21e4 satis\ufb01es the condition\nm2(1+\u21e3)2 (which is the constant factor approximation\nq(\u02c6y\u21e4) \uf8ff \u21e2qmin + (1  \u21e2)ct. Replacing \u21e2 by\nshown in [22]) leads to the claim in (31), and substituting \u21e2 by 1/m (which is the approximation\nbound in [44]) implies the result in (32).\n\n1\u21e3\n\nThe result in Proposition 5 indicates that if x>t r2f (xt)xt is non-positive, then one can \ufb01nd a \u21e2-\napproximate solution for Problem (30) and consequently Problem (26). This condition is satis\ufb01ed if\nwe assume that maxx2C x>r2f (x)x \uf8ff 0. For instance, for a concave minimization problem over\nthe convex set C this condition is satis\ufb01ed. In fact, it can be shown that our analysis still stands even if\nmaxx2C x>r2f (x)x is at most O(). Note that this condition is signi\ufb01cantly weaker than requiring\nthe function to be concave when restricted to the feasible set. The condition essentially implies that\nthe quadratic term in the Taylor expansion of the function evaluated at the origin should be negative\n(or not too positive).\nCorollary 1. Consider a convex set C which is de\ufb01ned as the intersection of m  1 ellipsoids\ncentered at the origin. Further, assume that the objective function Hessian r2f satis\ufb01es the condition\nmaxx2C x>r2f (x)x \uf8ff 0. Then, for \u21e2 = 1/m and \u21e2 = 1/m2, it is possible to \ufb01nd a \u21e2-approximate\nsolution of Problem (13) in time polynomial in m and d.\n\nBy using the approach in [22], we can solve the QCQP in (29) with the approximation factor\n\u21e2 \u21e1 1/m2 for m  1 at the overall complexity of \u02dcO(md3) when the constraint C is de\ufb01ned as\nm convex quadratic constraints. As the total number of calls to the second-order stage is at most\nO(\u21e233) = O(m63), we obtain that the total number of arithmetic operations for the second-\norder stage is at most \u02dcO(m7d33). The constant factor can be improved to 1/m if we solve the\nSDP relaxation problem suggested in [44].\n\n6 Stochastic Extension\n\nIn this section, we focus on stochastic constrained minimization problems. Consider the optimization\nproblem in (1) when the objective function f is de\ufb01ned as an expectation of a set of stochastic\nfunctions F : Rd \u21e5 Rr ! R with inputs x 2 Rd and \u21e5 2 Rr, where \u21e5 is a random variable with\nprobability distribution P. To be more precise, we consider the optimization problem\n\n(33)\nOur goal is to \ufb01nd a point which satis\ufb01es the necessary optimality conditions with high probability.\n\nminimize f (x) := E [F (x, \u21e5)] ,\n\nsubject to x 2C .\n\n8\n\n\fAlgorithm 2\nRequire: Stepsize t > 0. Initialize x0 2C\n1: for t = 1, 2, . . . do\nCompute vt = argmaxv2C{d>t v}\n2:\nif d>t (vt  xt) < \u270f/2 then\n3:\nCompute xt+1 = (1  \u2318)xt + \u2318vt\n4:\nelse\n5:\nFind ut: a \u21e2-approximate solution of\n6:\nif q(ut) < \u21e2/2 then\nelse\n\n(u  xt)>Ht(u  xt)\n\n7:\n8:\n9:\n10:\n11:\nend if\n12:\n13: end for\n\nend if\n\nminu\nCompute the updated variable xt+1 = (1  )xt + ut;\nReturn xt and stop.\n\ns.t. u 2C , d>t (u  xt) \uf8ff r.\n\ni=1 rF (xt, \u2713i) and matrix Ht = (1/bH)PbH\n\nConsider the vector dt = (1/bg)Pbg\ni=1 r2F (xt, \u2713i) as\nstochastic approximations of the gradient rf (xt) and Hessian r2f (xt), respectively. Here bg and\nbH are the gradient and Hessian batch sizes, respectively, and the vectors \u2713i are the realizations of\nthe random variable \u21e5. In Algorithm 2, we present the stochastic variant of our proposed scheme for\n\ufb01nding an (\u270f, )-SOSP of Problem (33). Algorithm 2 differs from Algorithm 1 in using the stochastic\ngradients dt and Hessians Ht in lieu of the exact gradients rf (xt) and r2f (xt) Hessians. The\nsecond major difference is the inequality constraint in step 6. Here instead of using the constraint\nd>t (u  xt) = 0 we need to use d>t (u  xt) \uf8ff r, where r > 0 is a properly chosen constant.\nThis modi\ufb01cation is needed to ensure that if a point satis\ufb01es this constraint with high probability\nit also satis\ufb01es the constraint rf (xt)>(u  xt) = 0. This modi\ufb01cation implies that we need to\nhandle a linear inequality constraint instead of the linear equality constraint, which is computationally\nmanageable for some constraints including the case that C is a single ball constraint [28]. To prove\nour main result we assume that the following conditions also hold.\nAssumption 4. The variance of the stochastic gradients and Hessians are uniformly bounded by\nconstants \u232b2 and \u21e02, respectively, i.e., for any x 2C and \u2713 we can write\n\nE\u21e5krF (x, \u2713)  rf (x)k2\u21e4 \uf8ff \u232b2,\n\nE\u21e5kr2F (x, \u2713)  r2f (x)k2\u21e4 \uf8ff \u21e02.\n\n(34)\n\nThe required conditions in Assumption 4 ensure that the variances of stochastic gradients and Hessians\nare uniformly bounded above, which are customary in stochastic optimizaiton.\nIn the following theorem, we characterize the iteration complexity of Algorithm 2 to reach an\n(\u270f, )-SOSP of Problem (33) with high probability.\nTheorem 2. Consider the optimization problem in (33). Suppose the conditions in Assumptions 1-4\nare satis\ufb01ed. If the batch sizes are bg = O(max{\u21e244,\u270f 2}) and bH = O(\u21e222) and we set\nthe parameter r = O(\u21e222), then the outcome of the proposed framework outlined in Algorithm 2\nis an (\u270f, )-second-order stationary point of Problem (33) with high probability. Further, the total\nnumber of iterations to reach such point is at most O(max{\u270f2,\u21e2 33}) with high probability.\nThe result in Theorem 2 indicates that the total number of iterations to reach an (\u270f, )-SOSP is at\nmost O(max{\u270f2,\u21e2 33}). As each iteration at most requires O(max{\u21e244,\u270f 2}) stochastic\ngradient and O(\u21e222) stochastic Hessian evaluations, the total number of stochastic gradient\nand Hessian computations to reach an (\u270f, )-SOSP is of O(max{\u270f2\u21e244,\u270f 4,\u21e2 77}) and\nO(max{\u270f2\u21e233,\u21e2 55}), respectively.\n\nAcknowledgment\nThis work was supported by DARPA Lagrange and ONR BRC Program. The authors would like to\nthank Yue Sun for pointing out a missing condition in the \ufb01rst draft of the paper.\n\n9\n\n\fReferences\n[1] N. Agarwal, Z. Allen Zhu, B. Bullins, E. Hazan, and T. Ma. Finding approximate local minima faster than\n\ngradient descent. In STOC, pages 1195\u20131199, 2017.\n\n[2] Z. Allen-Zhu. Natasha 2: Faster non-convex optimization than SGD. CoRR, abs/1708.08694, 2017.\n\n[3] Z. Allen Zhu and E. Hazan. Variance reduction for faster non-convex optimization. In ICML, pages\n\n699\u2013707, 2016.\n\n[4] D. P. Bertsekas. Nonlinear programming. Athena scienti\ufb01c Belmont, 1999.\n\n[5] W. Bian, X. Chen, and Y. Ye. Complexity analysis of interior point algorithms for non-lipschitz and\n\nnonconvex minimization. Math. Program., 149(1-2):301\u2013327, 2015.\n\n[6] J. V. Burke, J. J. Mor\u00b4e, and G. Toraldo. Convergence properties of trust region methods for linear and\n\nconvex constraints. Math. Program., 47(1-3):305\u2013336, 1990.\n\n[7] Y. Carmon, J. C. Duchi, O. Hinder, and A. Sidford. Accelerated methods for non-convex optimization.\n\nCoRR, abs/1611.00756, 2016.\n\n[8] Y. Carmon, J. C. Duchi, O. Hinder, and A. Sidford. \u201dconvex until proven guilty\u201d: Dimension-free\n\nacceleration of gradient descent on non-convex functions. In ICML, pages 654\u2013663, 2017.\n\n[9] Y. Carmon, J. C. Duchi, O. Hinder, and A. Sidford. Lower bounds for \ufb01nding stationary points i. arXiv\n\npreprint arXiv:1710.11606, 2017.\n\n[10] Y. Carmon, J. C. Duchi, O. Hinder, and A. Sidford. Lower bounds for \ufb01nding stationary points ii: First-order\n\nmethods. arXiv preprint arXiv:1711.00841, 2017.\n\n[11] C. Cartis, N. Gould, and P. Toint. Adaptive cubic regularisation methods for unconstrained optimization.\n\npart I: motivation, convergence and numerical results. Math. Program., 127(2):245\u2013295, 2011.\n\n[12] C. Cartis, N. Gould, and P. Toint. Adaptive cubic regularisation methods for unconstrained optimization.\npart II: worst-case function- and derivative-evaluation complexity. Math. Program., 130(2):295\u2013319, 2011.\n\n[13] C. Cartis, N. Gould, and P. Toint. Complexity bounds for second-order optimality in unconstrained\n\noptimization. J. Complexity, 28(1):93\u2013108, 2012.\n\n[14] C. Cartis, N. Gould, and P. Toint. An adaptive cubic regularization algorithm for nonconvex optimization\nwith convex constraints and its function-evaluation complexity. IMA Journal of Numerical Analysis, 32(4):\n1662\u20131695, 2012.\n\n[15] C. Cartis, N. Gould, and P. Toint. On the evaluation complexity of cubic regularization methods for\npotentially rank-de\ufb01cient nonlinear least-squares problems and its relevance to constrained nonlinear\noptimization. SIAM J. Opt., 23(3):1553\u20131574, 2013.\n\n[16] C. Cartis, N. Gould, and P. Toint. On the evaluation complexity of constrained nonlinear least-squares\nand general constrained nonlinear optimization using second-order methods. SIAM Journal on Numerical\nAnalysis, 53(2):836\u2013851, 2015.\n\n[17] C. Cartis, N. Gould, and P. Toint. Second-order optimality and beyond: Characterization and evaluation\ncomplexity in convexly constrained nonlinear optimization. Foundations of Computational Mathematics,\npages 1\u201335, 2017.\n\n[18] A. R. Conn, N. Gould, A. Sartenaer, and P. Toint. Global convergence of a class of trust region algorithms\nfor optimization using inexact projections on convex constraints. SIAM J. on Opt., 3(1):164\u2013221, 1993.\n\n[19] F. E. Curtis, D. P. Robinson, and M. Samadi. A trust region algorithm with a worst-case iteration complexity\nof \\mathcal {O}(\\epsilon\u02c6{-3/2}) for nonconvex optimization. Math. Program., 162(1-2):1\u201332, 2017.\n[20] G. Di Pillo, S. Lucidi, and L. Palagi. Convergence to second-order stationary points of a primal-dual\nalgorithm model for nonlinear programming. Mathematics of Operations Research, 30(4):897\u2013915, 2005.\n\n[21] F. Facchinei and S. Lucidi. Convergence to second order stationary points in inequality constrained\n\noptimization. Mathematics of Operations Research, 23(3):746\u2013766, 1998.\n\n[22] M. Fu, Z.-Q. Luo, and Y. Ye. Approximation algorithms for quadratic programming. Journal of combina-\n\ntorial optimization, 2(1):29\u201350, 1998.\n\n10\n\n\f[23] R. Ge, F. Huang, C. Jin, and Y. Yuan. Escaping from saddle points - online stochastic gradient for tensor\n\ndecomposition. In COLT, pages 797\u2013842, 2015.\n\n[24] R. Ge, J. Lee, and T. Ma. Matrix completion has no spurious local minimum. In NIPS, pages 2973\u20132981,\n\n2016.\n\n[25] S. Ghadimi and G. Lan. Accelerated gradient methods for nonconvex nonlinear and stochastic programming.\n\nMath. Program., 156(1-2):59\u201399, 2016.\n\n[26] S. Ghadimi, G. Lan, and H. Zhang. Mini-batch stochastic approximation methods for nonconvex stochastic\n\ncomposite optimization. Math. Program., 155(1-2):267\u2013305, 2016.\n\n[27] G. Haeser, H. Liu, and Y. Ye. Optimality condition and complexity analysis for linearly-constrained\n\noptimization without differentiability on the boundary. Math. Program., pages 1\u201337, 2017.\n\n[28] V. Jeyakumar and G. Li. Trust-region problems with linear inequality constraints: exact SDP relaxation,\n\nglobal optimality and robust optimization. Mathematical Programming, 147(1-2):171\u2013206, 2014.\n\n[29] C. Jin, R. Ge, P. Netrapalli, S. M. Kakade, and M. I. Jordan. How to escape saddle points ef\ufb01ciently. In\n\nICML, pages 1724\u20131732, 2017.\n\n[30] C. Jin, P. Netrapalli, and M. I. Jordan. Accelerated gradient descent escapes saddle points faster than\n\ngradient descent. CoRR, abs/1711.10456, 2017.\n\n[31] S. Lacoste-Julien. Convergence rate of Frank-Wolfe for non-convex objectives.\n\narXiv:1607.00345, 2016.\n\narXiv preprint\n\n[32] L. Lei, C. Ju, J. Chen, and M. I. Jordan. Non-convex \ufb01nite-sum optimization via SCSG methods. In\n\nAdvances in Neural Information Processing Systems 30, pages 2345\u20132355, 2017.\n\n[33] J. M. Mart\u00b4\u0131nez and M. Raydan. Cubic-regularization counterpart of a variable-norm trust-region method\n\nfor unconstrained minimization. J. Global Optimization, 68(2):367\u2013385, 2017.\n\n[34] K. G. Murty and S. N. Kabadi. Some np-complete problems in quadratic and nonlinear programming.\n\nMath. Program., 39(2):117\u2013129, 1987.\n\n[35] Y. Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science &\n\nBusiness Media, 2013.\n\n[36] Y. Nesterov and B. T. Polyak. Cubic regularization of newton method and its global performance. Math.\n\nProgram., 108(1):177\u2013205, 2006.\n\n[37] S. Paternain, A. Mokhtari, and A. Ribeiro. A second order method for nonconvex optimization. arXiv\n\npreprint arXiv:1707.08028, 2017.\n\n[38] S. J. Reddi, A. Hefny, S. Sra, B. P\u00b4oczos, and A. J. Smola. Stochastic variance reduction for nonconvex\n\noptimization. In ICML, pages 314\u2013323, 2016.\n\n[39] S. J. Reddi, S. Sra, B. P\u00b4oczos, and A. J. Smola. Fast incremental method for smooth nonconvex optimization.\n\nIn IEEE Conference on Decision and Control, CDC, pages 1971\u20131977, 2016.\n\n[40] S. J. Reddi, M. Zaheer, S. Sra, B. P\u00b4oczos, F. Bach, R. Salakhutdinov, and A. J. Smola. A generic approach\n\nfor escaping saddle points. In AISTATS, pages 1233\u20131242, 2018.\n\n[41] C. W. Royer and S. J. Wright. Complexity analysis of second-order line-search algorithms for smooth\n\nnonconvex optimization. arXiv preprint arXiv:1706.03131, 2017.\n\n[42] J. Sun, Q. Qu, and J. Wright. A geometric analysis of phase retrieval. In IEEE International Symposium\n\non Information Theory, ISIT 2016, pages 2379\u20132383, 2016.\n\n[43] J. Sun, Q. Qu, and J. Wright. Complete dictionary recovery over the sphere I: overview and the geometric\n\npicture. IEEE Trans. Information Theory, 63(2):853\u2013884, 2017.\n\n[44] P. Tseng. Further results on approximating nonconvex quadratic optimization by semide\ufb01nite programming\n\nrelaxation. SIAM Journal on Optimization, 14(1):268\u2013283, 2003.\n\n[45] Y. Xu and T. Yang. First-order stochastic algorithms for escaping from saddle points in almost linear time.\n\narXiv preprint arXiv:1711.01944, 2017.\n\n[46] Y. Ye. On af\ufb01ne scaling algorithms for nonconvex quadratic programming. Math. Program., 56(1-3):\n\n285\u2013300, 1992.\n\n11\n\n\f", "award": [], "sourceid": 1830, "authors": [{"given_name": "Aryan", "family_name": "Mokhtari", "institution": "MIT"}, {"given_name": "Asuman", "family_name": "Ozdaglar", "institution": "Massachusetts Institute of Technology"}, {"given_name": "Ali", "family_name": "Jadbabaie", "institution": "MIT"}]}