{"title": "Smoothed analysis of the low-rank approach for smooth semidefinite programs", "book": "Advances in Neural Information Processing Systems", "page_first": 2281, "page_last": 2290, "abstract": "We consider semidefinite programs (SDPs) of size $n$ with equality constraints. In order to overcome scalability issues, Burer and Monteiro proposed a factorized approach based on optimizing over a matrix $Y$ of size $n\\times k$ such that $X=YY^*$ is the SDP variable. The advantages of such formulation are twofold: the dimension of the optimization variable is reduced, and positive semidefiniteness is naturally enforced. However, optimization in $Y$ is non-convex. In prior work, it has been shown that, when the constraints on the factorized variable regularly define a smooth manifold, provided $k$ is large enough, for almost all cost matrices, all second-order stationary points (SOSPs) are optimal. Importantly, in practice, one can only compute points which approximately satisfy necessary optimality conditions, leading to the question: are such points also approximately optimal? To this end, and under similar assumptions, we use smoothed analysis to show that approximate SOSPs for a randomly perturbed objective function are approximate global optima, with $k$ scaling like the square root of the number of constraints (up to log factors). We particularize our results to an SDP relaxation of phase retrieval.", "full_text": "Smoothed analysis of the low-rank approach\n\nfor smooth semide\ufb01nite programs\n\nThomas Pumir\u2217\nORFE Department\nPrinceton University\n\nSamy Jelassi\u2217\n\nORFE Department\nPrinceton University\n\ntpumir@princeton.edu\n\nsjelassi@princeton.edu\n\nNicolas Boumal\n\nDepartment of Mathematics\n\nPrinceton University\n\nnboumal@math.princeton.edu\n\nAbstract\n\nWe consider semide\ufb01nite programs (SDPs) of size n with equality constraints. In\norder to overcome scalability issues, Burer and Monteiro proposed a factorized\napproach based on optimizing over a matrix Y of size n\u00d7 k such that X = Y Y \u2217 is\nthe SDP variable. The advantages of such formulation are twofold: the dimension\nof the optimization variable is reduced, and positive semide\ufb01niteness is naturally\nenforced. However, optimization in Y is non-convex. In prior work, it has been\nshown that, when the constraints on the factorized variable regularly de\ufb01ne a\nsmooth manifold, provided k is large enough, for almost all cost matrices, all\nsecond-order stationary points (SOSPs) are optimal.\nImportantly, in practice,\none can only compute points which approximately satisfy necessary optimality\nconditions, leading to the question: are such points also approximately optimal?\nTo answer it, under similar assumptions, we use smoothed analysis to show that\napproximate SOSPs for a randomly perturbed objective function are approximate\nglobal optima, with k scaling like the square root of the number of constraints (up\nto log factors). Moreover, we bound the optimality gap at the approximate solution\nof the perturbed problem with respect to the original problem. We particularize our\nresults to an SDP relaxation of phase retrieval.\n\n(cid:104)C, X(cid:105)\n\nmin\nX\u2208Sn\u00d7n\nsubject to A(X) = b,\n\nX (cid:23) 0,\n\n(SDP)\n\nIntroduction\n\n1\nWe consider semide\ufb01nite programs (SDP) over K = R or C of the form:\n\nwith (cid:104)A, B(cid:105) = Re[Tr(A\u2217B)] the Frobenius inner product (A\u2217 is the conjugate-transpose of A), Sn\u00d7n\nthe set of self-adjoint matrices of size n (real symmetric for R, or Hermitian for C), C \u2208 Sn\u00d7n\nthe cost matrix, and A : Sn\u00d7n \u2192 Rm a linear operator capturing m equality constraints with right\nhand side b \u2208 Rm: for each i, A(X)i = (cid:104)Ai, X(cid:105) = bi for given matrices A1, . . . , Am \u2208 Sn\u00d7n. The\noptimization variable X is positive semide\ufb01nite. We let C be the feasible set of (SDP):\n\nC =(cid:8)X \u2208 Sn\u00d7n : A(X) = b and X (cid:23) 0(cid:9) .\n\n(1)\n\n\u2217Equal contribution\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fLarge-scale SDPs have been proposed for machine learning applications including matrix comple-\ntion [Cand\u00e8s and Recht, 2009], community detection [Abb\u00e9, 2018] and kernel learning [Lanckriet\net al., 2004] for K = R, and in angular synchronization [Singer, 2011] and phase retrieval [Wald-\nspurger et al., 2015] for K = C. Unfortunately, traditional methods to solve (SDP) do not scale (due\nto memory and computational requirements), hence the need for alternatives.\nIn order to address such scalability issues, Burer and Monteiro [2003, 2005] restrict the search to\nthe set of matrices of rank at most k by factorizing X as X = Y Y \u2217, with Y \u2208 Kn\u00d7k. It has been\nshown that if the search space C (1) is compact, then (SDP) admits a global optimum of rank at most\nr, where dim Sr\u00d7r \u2264 m [Barvinok, 1995, Pataki, 1998], with dim Sr\u00d7r = r(r+1)\nfor K = R and\ndim Sr\u00d7r = r2 for K = C. In other words, restricting C to the space of matrices with rank at most k\nwith dim Sk\u00d7k > m does not change the optimal value. This factorization leads to a quadratically\nconstrained quadratic program:\n\n2\n\n(cid:104)C, Y Y \u2217(cid:105)\n\nmin\nY \u2208Kn\u00d7k\nsubject to A(Y Y \u2217) = b.\nAlthough (P) is in general non-convex because its feasible set\n\nM = Mk =(cid:8)Y \u2208 Kn\u00d7k : A(Y Y \u2217) = b(cid:9)\n\n(P)\n\n(2)\n\nis non-convex, considering (P) instead of the original SDP presents signi\ufb01cant advantages: the\nnumber of variables is reduced from O(n2) to O(nk), and the positive semide\ufb01niteness of the matrix\nis naturally enforced. Solving (P) using local optimization methods is known as the Burer\u2013Monteiro\nmethod and yields good results in practice: Kulis et al. [2007] underlined the practical success of\nsuch low-rank approaches in particular for maximum variance unfolding and for k-means clustering\n(see also [Carson et al., 2017]). Their approach is signi\ufb01cantly faster and more scalable. However,\nthe non-convexity of (P) means further analysis is needed to determine whether it can be solved to\nglobal optimality reliably.\nFor K = R, in the case where M is a compact, smooth manifold (see Assumption 1 below for a\nprecise condition), it has been shown recently that, up to a zero-measure set of cost matrices, second-\norder stationary points (SOSPs) of (P) are globally optimal provided dim Sk\u00d7k > m [Boumal et al.,\n2016, 2018b]. Algorithms such as the Riemannian trust-regions method (RTR) converge globally to\nSOSPs, but unfortunately they can only guarantee approximate satisfaction of second-order optimality\nconditions in a \ufb01nite number of iterations [Boumal et al., 2018a].\nThe aforementioned papers close with a question, crucial in practice: when is it the case that\napproximate SOSPs, which we now call ASOSPs, are approximately optimal? Building on recent\nproof techniques by Bhojanapalli et al. [2018], we provide some answers here.\n\nContributions\n\nThis paper formulates approximate global optimality conditions holding for (P) and, consequently,\nfor (SDP). Our results rely on the following core assumption as set in [Boumal et al., 2016].\nAssumption 1 (Smooth manifold). For all values of k up to n such that Mk is non-empty, the\nconstraints on (P) de\ufb01ned by A1, . . . , Am \u2208 Sn\u00d7n and b \u2208 Rm satisfy at least one of the following:\n\n1. {A1Y, . . . , AmY } are linearly independent in Kn\u00d7k for all Y \u2208 Mk; or\n2. {A1Y, . . . , AmY } span a subspace of constant dimension in Kn\u00d7k for all Y in an open\n\nneighborhood of Mk in Kn\u00d7k.\n\nIn [Boumal et al., 2018b], it is shown that (a) if the assumption above is veri\ufb01ed for k = n, then it\nautomatically holds for all values of k \u2264 n such that Mk is non-empty; and (b) for those values of k,\nthe dimension of the subspace spanned by {A1Y, . . . , AmY } is independent of k: we call it m(cid:48).\nWhen Assumption 1 holds, we refer to problems of the form (SDP) as smooth SDPs because M is\nthen a smooth manifold. Examples of smooth SDPs for K = R are given in [Boumal et al., 2018b].\nFor K = C, we detail an example in Section 4. Our main theorem is a smooth analysis result (cf.\nTheorem 3.1 for a more formal statement). An ASOSP is an approximate SOSP (a precise de\ufb01nition\nfollows.)\n\n2\n\n\fTheorem 1.1 (Informal). Let Assumption 1 hold and assume C is compact. Randomly perturb the cost\nm), any ASOSP Y \u2208 Kn\u00d7k for (P) is an approximate\nmatrix C. With high probability, if k = \u02dc\u2126(\nglobal optimum, and X = Y Y \u2217 is an approximate global optimum for (SDP) (with the perturbed C.)\n\n\u221a\n\nThe high probability proviso is with respect to the perturbation only: if the perturbation is \u201cgood\u201d,\nthen all ASOSPs are as described in the statement. If C is compact, then so is M and known\nalgorithms for optimization on manifolds produce an ASOSP in \ufb01nite time (with explicit bounds).\nTheorem 1.1 ensures that, for k large enough and for any cost matrix C, with high probability upon a\nrandom perturbation of C, such algorithms produce an approximate global optimum of (P).\nTheorem 1.1 is a corollary of two intermediate arguments, developed in Lemmas 3.1 and 3.2:\n\n1. Probabilistic argument (Lemma 3.1): By perturbing the cost matrix in the objective function\nof (P) with a Gaussian Wigner matrix, with high probability, any approximate \ufb01rst-order\nstationary point Y of the perturbed problem (P) is almost column-rank de\ufb01cient.\n\n2. Deterministic argument (Lemma 3.2): If an approximate second-order stationary point Y\nfor (P) is also almost column-rank de\ufb01cient, then it is an approximate global optimum and\nX = Y Y \u2217 is an approximate global optimum for (SDP).\n\nThe \ufb01rst argument is motivated by smoothed analysis [Spielman and Teng, 2004] and draws heavily\non a recent paper by Bhojanapalli et al. [2018]. The latter work introduces smoothed analysis to\nanalyze the performance of the Burer\u2013Monteiro factorization, but it analyzes a quadratically penalized\nversion of the SDP: its solutions do not satisfy constraints exactly. This affords more generality,\nbut, for the special class of smooth SDPs, the present work has the advantage of analyzing an exact\nformulation. The second argument is a smoothed extension of well-known on-off results [Burer and\nMonteiro, 2003, 2005, Journee et al., 2010]. Implications of this theorem for a particular SDP are\nderived in Section 4, with applications to phase retrieval and angular synchronization.\nThus, for smooth SDPs, our results improve upon [Bhojanapalli et al., 2018] in that we address\nexact-feasibility formulations of the SDP. Our results also improve upon [Boumal et al., 2016] by\nproviding approximate optimality results for approximate second-order points with relaxation rank\nk scaling only as \u02dc\u2126(\nm), whereas the latter reference establishes such results only for k = n + 1.\nFinally, we aim for more generality by covering both real and complex SDPs, and we illustrate the\nrelevance of complex SDPs in Section 4.\n\n\u221a\n\nRelated work\n\nA number of recent works focus on large-scale SDP solvers. Among the direct approaches (which\nproceed in the convex domain directly), Hazan [2008] introduced a Frank\u2013Wolfe type method for a\nrestricted class of SDPs. Here, the key is that each iteration increases the rank of the solution only\nby one, so that if only a few iterations are required to reach satisfactory accuracy, then only low\ndimensional objects need to be manipulated. This line of work was later improved by Laue [2012],\nGarber [2016] and Garber and Hazan [2016] through hybrid methods. Still, if high accuracy solutions\nare desired, a large number of iterations will be required, eventually leading to large-rank iterates.\nIn order to overcome such issue, Yurtsever et al. [2017] recently proposed to combine conditional\ngradient and sketching techniques in order to maintain a low rank representation of the iterates.\nAmong the low-rank approaches, our work is closest to (and indeed largely builds upon) recent results\nof Bhojanapalli et al. [2018]. For the real case, they consider a penalized version of problem (SDP)\n(which we here refer to as (P-SDP)) and its related penalized Burer\u2013Monteiro formulation, here called\n(P-P). With high probability upon random perturbation of the cost matrix, they show approximate\nglobal optimality of ASOSPs for (P-P), assuming k grows with\nm and either the SDP is compact\nor its cost matrix is positive de\ufb01nite. Given that there is a zero-measure set of SDPs where SOSPs\nmay be suboptimal, there can be a small-measure set of SDPs where ASOSPs are not approximately\noptimal [Bhojanapalli et al., 2018]. In this context, the authors resort to smoothed analysis, in the\nsame way that we do here. One drawback of that work is that the \ufb01nal result does not hold for the\noriginal SDP, but for a non-equivalent penalized version of it. This is one of the points we improve\nhere, by focusing on smooth SDPs as de\ufb01ned in [Boumal et al., 2016].\n\n\u221a\n\n3\n\n\fThe associated Frobenius norm is de\ufb01ned as (cid:107)A(cid:107) =(cid:112)(cid:104)A, A(cid:105). For a linear map f between matrix\n\nNotation\nWe use K to refer to R or C when results hold for both \ufb01elds. For matrices A, B of same size, we use\nthe inner product (cid:104)A, B(cid:105) = Re[Tr(A\u2217B)], which reduces to (cid:104)A, B(cid:105) = Tr(AT B) in the real case.\n(cid:107)f (A)(cid:107)\nspaces, this yields a subordinate operator norm as (cid:107)f(cid:107)op = supA(cid:54)=0\n(cid:107)A(cid:107) . The set of self-adjoint\nmatrices of size n over K, Sn\u00d7n, is the set of symmetric matrices for K = R or the set of Hermitian\nmatrices for K = C. We also write Hn\u00d7n to denote Sn\u00d7n for K = C. A self-adjoint matrix X is\npositive semide\ufb01nite (X (cid:23) 0) if and only if u\u2217Xu \u2265 0 for all u \u2208 Kn. Furthermore, I is the identity\noperator and In is the identity matrix of size n. The integer m(cid:48) is de\ufb01ned after Assumption 1.\n\n2 Geometric framework and near-optimality conditions\n\nIn this section, we present properties of the smooth geometry of (P) and approximate global optimality\nconditions for this problem. In covering these preliminaries, we largely parallel developments\nin [Boumal et al., 2016]. As argued in that reference, Assumption 1 implies that the search space\nM of (P) is a submanifold in Kn\u00d7k of codimension m(cid:48). We can associate tangent spaces to a\nsubmanifold. Intuitively, the tangent space TY M to the submanifold M at a point Y \u2208 M is a\nsubspace that best approximates M around Y , when the subspace origin is translated to Y . It is\nobtained by linearizing the equality constraints.\nLemma 2.1 (Boumal et al. [2018b, Lemma 2.1]). Under Assumption 1, the tangent space at Y to\nM (2), denoted by TY M, is:\nTY M =\n\n(cid:111)\n(cid:110) \u02d9Y \u2208 Kn\u00d7k : A( \u02d9Y Y \u2217 + Y \u02d9Y \u2217) = 0\n(cid:110) \u02d9Y \u2208 Kn\u00d7k : (cid:104)AiY, \u02d9Y (cid:105) = 0 for i = 1, . . . , m\n\n(3)\nBy equipping each tangent space with a restriction of the inner product (cid:104)\u00b7,\u00b7(cid:105), we turn M into a\nRiemannian submanifold of Kn\u00d7k. We also introduce the orthogonal projector ProjY : Kn\u00d7k \u2192\nTY M which, given a matrix Z \u2208 Kn\u00d7k, projects it to the tangent space TY M:\n\n=\n\n(cid:111)\n\n.\n\nProjY Z := argmin\n\u02d9Y \u2208TY M\n\n(cid:107) \u02d9Y \u2212 Z(cid:107).\n\n(4)\n\nThis projector will be useful to phrase optimality conditions. It is characterized as follows.\nLemma 2.2 (Boumal et al. [2018b, Lemma 2.2]). Under Assumption 1, the orthogonal projector\nadmits the closed form\n\nProjY Z = Z \u2212 A\u2217(cid:0)G\u2020A(ZY \u2217)(cid:1) Y,\n\nwhere A\u2217 : Rm \u2192 Sn\u00d7n is the adjoint of A, G is a Gram matrix de\ufb01ned by Gij = (cid:104)AiY, AjY (cid:105) (it is\na function of Y ), and G\u2020 denotes the Moore\u2013Penrose pseudo-inverse of G (differentiable in Y ).\n(See a proof in Appendix A.) To properly state the approximate \ufb01rst- and second-order necessary\noptimality conditions for (P), we further need the notions of Riemannian gradient and Riemannian\nHessian on the manifold M. We recall that (P) minimizes the function g, de\ufb01ned by\n\ng(Y ) = (cid:104)CY, Y (cid:105) ,\n\n(5)\non the manifold M. The Riemannian gradient of g at Y , grad g(Y ), is the unique tangent vector at\nY such that, for all tangent \u02d9Y , (cid:104)grad g(Y ), \u02d9Y (cid:105) = (cid:104)\u2207g(Y ), \u02d9Y (cid:105), with \u2207g(Y ) = 2CY the Euclidean\n(classical) gradient of g evaluated at Y . Intuitively, grad g(Y ) is the tangent vector at Y that points\nin the steepest ascent direction for g as seen from the manifold\u2019s perspective. A classical result states\nthat, for Riemannian submanifolds, the Riemannian gradient is given by the projection of the classical\ngradient to the tangent space [Absil et al., 2008, eq. (3.37)]:\n\ngrad g(Y ) = ProjY (\u2207g(Y )) = 2(cid:0)C \u2212 A\u2217(cid:0)G\u2020A(CY Y \u2217)(cid:1)(cid:1) Y.\n\n(6)\nThis leads us to de\ufb01ne the matrix S \u2208 Sn\u00d7n which plays a key role to guarantee approximate global\noptimality for problem (P), as discussed in Section 3:\n\n\u00b5iAi,\n\n(7)\n\nS = S(Y ) = C \u2212 A\u2217(\u00b5) = C \u2212 m(cid:88)\n\ni=1\n\n4\n\n\fwhere \u00b5 = \u00b5(Y ) = G\u2020A(CY Y \u2217). We can write the Riemannian gradient of g evaluated at Y as\n\ngrad g(Y ) = 2SY.\n\n(8)\n\nThe Riemannian gradient enables us to de\ufb01ne an approximate \ufb01rst-order necessary optimality\ncondition below. To de\ufb01ne the approximate second-order necessary optimality condition, we need\nto introduce the notion of Riemannian Hessian. The Riemannian Hessian of g at Y is a self-adjoint\noperator on the tangent space at Y obtained as the projection of the derivative of the Riemannian\ngradient vector \ufb01eld [Absil et al., 2008, eq. (5.15)]. Boumal et al. [2018b] give a closed form\nexpression for the Riemannian Hessian of g at Y :\n\nHess g(Y )[ \u02d9Y ] = 2 \u00b7 ProjY (S \u02d9Y ).\n\n(9)\n\nWe can now formally de\ufb01ne the approximate necessary optimality conditions for problem (P).\nDe\ufb01nition 2.1 (\u03b5g-FOSP). Y \u2208 M is an \u03b5g\u2013\ufb01rst-order stationary point for (P) if the norm of the\nRiemannian gradient of g at Y almost vanishes, speci\ufb01cally,\n\n(cid:107)grad g(Y )(cid:107) = (cid:107)2SY (cid:107) \u2264 \u03b5g,\n\nwhere S is de\ufb01ned as in equation (7).\nDe\ufb01nition 2.2 ((\u03b5g, \u03b5H )-SOSP). Y \u2208 M is an (\u03b5g, \u03b5H)\u2013second-order stationary point for (P) if\nit is an \u03b5g\u2013\ufb01rst-order stationary point and the Riemannian Hessian of g at Y is almost positive\nsemide\ufb01nite, speci\ufb01cally,\n\n(cid:68) \u02d9Y , Hess g(Y )[ \u02d9Y ]\n\n(cid:69)\n\n\u2200 \u02d9Y \u2208 TY M,\n\n1\n2\n\n= (cid:104) \u02d9Y , S \u02d9Y (cid:105) \u2265 \u2212\u03b5H(cid:107) \u02d9Y (cid:107)2.\n\nFrom these de\ufb01nitions, it is clear that S encapsulates the approximate optimality conditions of\nproblem (P).\n\n3 Approximate second-order points and smoothed analysis\n\nWe state our main results formally in this section. As announced, following [Bhojanapalli et al., 2018],\nwe resort to smoothed analysis [Spielman and Teng, 2004]. To this end, we consider perturbations\nof the cost matrix C of (SDP) by a Gaussian Wigner matrix. Intuitively, smoothed analysis tells us\nhow large the variance of the perturbation should be in order to obtain a new SDP which, with high\nprobability, is suf\ufb01ciently distant from any pathological case. We start by formally de\ufb01ning the notion\nof Gaussian Wigner matrix, following [Ben Arous and Guionnet, 2010].\nDe\ufb01nition 3.1 (Gaussian Wigner matrix). The random matrix W = W \u2217 in Sn\u00d7n is a Gaussian\nWigner matrix with variance \u03c32\nW if its entries on and above the diagonal are independent, zero-mean\nGaussian variables (real or complex depending on context) with variance \u03c32\nBesides Assumption 1, another important assumption for our results is that the search space C (1)\nof (SDP) is compact. In that scenario, there exists a \ufb01nite constant R such that\n\nW .\n\n(10)\nThus, for all Y \u2208 M, (cid:107)Y (cid:107)2 = Tr(Y Y \u2217) \u2264 R. Another consequence of compactness of C is that the\noperator A\u2217 \u25e6 G\u2020 \u25e6 A is uniformly bounded, that is, there exists a \ufb01nite constant K such that\n\n\u2200X \u2208 C, Tr(X) \u2264 R.\n\n\u2200Y \u2208 M,\n\n(cid:107)A\u2217 \u25e6 G\u2020 \u25e6 A(cid:107)op \u2264 K,\n\n(11)\nwhere G\u2020 is a continuous function of Y as in Lemma 2.2. We give explicit expressions for the\nconstants R and K for the case of phase retrieval in Section 4.\nWe now state the main theorem, whose proof is in Appendix E.\nTheorem 3.1. Let Assumption 1 hold for (SDP) with cost matrix C \u2208 Sn\u00d7n and m constraints.\nAssume C (1) is compact, and let R and K be as in (10) and (11). Let W be a Gaussian Wigner\nmatrix with variance \u03c32\n\nW and let \u03b4 \u2208 (0, 1) be any tolerance. De\ufb01ne \u03ba as:\n\u221a\n\n\u03ba = \u03ba(R, K, C, n, \u03c3W ) = RK(cid:0)(cid:107)C(cid:107)op + 3\u03c3W\n\nn(cid:1) .\n\n(12)\n\n5\n\n\fThere exists a universal constant c0 such that, if the rank k for the low-rank problem (P) satis\ufb01es\n\n(cid:34)\n\nlog(n) +(cid:112)log(1/\u03b4) +\n\n(cid:115)\n\nk \u2265 3\n\n(cid:18)\n\n(cid:19)(cid:35)\n\n\u221a\n6\u03ba\n\nc0n\n\n\u03c3W\n\nm \u00b7 log\n\n1 +\n\n,\n\n(13)\n\nthen, with probability at least 1 \u2212 \u03b4 \u2212 e\u2212 n\nof (P) with perturbed cost matrix C + W has bounded optimality gap:\n\n2 on the random matrix W , any (\u03b5g, \u03b5H )-SOSP Y \u2208 Kn\u00d7k\n\n0 \u2264 g(Y ) \u2212 f (cid:63) \u2264 (\u03b5H + \u03b52\n\ng\u03b7)R +\n\n\u03b5g\n2\n\n\u221a\n\nR,\n\n(14)\n\n(15)\n\nwith g the cost function of (P), f (cid:63) the optimal value of (SDP) (both perturbed), and\n\n\u03b7 = \u03b7(R, K, C, n, m, \u03c3W ) =\n\nc0nK(2 + KR)2 ((cid:107)C(cid:107)op + 3\u03c3W\n\n9m\u03c32\n\nW log\n\n\u221a\n\n(cid:17)\n\nn))\n\n.\n\n(cid:16)\n\n\u221a\n\n\u03c3W\n\nc0n\n\n1 + 6\u03ba\n\u221a\n\n\u221a\n\nk scales with(cid:112)log(1/\u03c3W ), so that a smaller \u03c3W may require k to be a larger multiple of\n\nThis result indicates that, as long as the rank k is on the order of\nm (up to logarithmic factors), the\noptimality gap in the perturbed problem is small if a suf\ufb01ciently good approximate second-order\npoint is computed. Since (SDP) may admit a unique solution of rank as large as \u0398(\nm) (see for\nexample [Laurent and Poljak, 1996, Thm. 3.1(ii)] for the Max-Cut SDP), we conclude that the scaling\nof k with respect to m in Theorem 3.1 is essentially optimal.\nThere is an incentive to pick \u03c3W small, since the optimality gap is phrased in terms of the perturbed\nproblem. As expected though, taking \u03c3W small comes at a price. Speci\ufb01cally, the required rank\nm.\nFurthermore, the optimality gap is bounded in terms of \u03b7 with a dependence in \u03b52\nW ; this may force\nus to compute more accurate approximate second-order points (smaller \u03b5g) for a similar guarantee\nwhen \u03c3W is smaller: see also Corollary 3.1 below.\nAs announced, the theorem rests on two arguments which we now present\u2014a probabilistic one, and a\ndeterministic one:\n\ng/\u03c32\n\n\u221a\n\n1. Probabilistic argument: In the smoothed analysis framework, we show, for k large enough,\nthat \u03b5g-FOSPs of (P) have their smallest singular value near zero, with high probability upon\nperturbation of C. This implies that such points are almost column-rank de\ufb01cient.\n\n2. Deterministic argument: If Y is an (\u03b5g, \u03b5H )-SOSP of (P) and it is almost column-rank\nde\ufb01cient, then the matrix S(Y ) de\ufb01ned in equation (7) is almost positive semide\ufb01nite. From\nthere, we can derive a bound on the optimality gap.\n\nFormal statements for both follow, building on the notation in Theorem 3.1. Proofs are in Appen-\ndices C and D, with supporting lemmas in Appendix B.\nLemma 3.1. Let Assumption 1 hold for (SDP). Assume C (1) is compact. Let W be a Gaussian\nW and let \u03b4 \u2208 (0, 1) be any tolerance. There exists a universal\nWigner matrix with variance \u03c32\nconstant c0 such that, if the rank k for the low-rank problem (P) is lower bounded as in (13), then,\nwith probability at least 1 \u2212 \u03b4 \u2212 e\u2212 n\n\n2 on the random matrix W , we have\n\n(cid:107)W(cid:107)op \u2264 3\u03c3W\n\n\u221a\n\nn,\n\nand furthermore: any \u03b5g-FOSP Y \u2208 Kn\u00d7k of (P) with perturbed cost matrix C + W satis\ufb01es\n\n\u221a\n\n\u03c3k(Y ) \u2264 \u03b5g\n\u03c3W\n\nc0n\nk\n\n,\n\nwhere \u03c3k(Y ) is the kth singular value of the matrix Y .\nLemma 3.2. Let Assumption 1 hold for (SDP) with cost matrix C. Assume C is compact. Let\nY \u2208 Kn\u00d7k be an (\u03b5g, \u03b5H )-SOSP of (P) (for any k). Then, the smallest eigenvalue of S = S(Y ) (7)\nis bounded below as\n\n\u03bbmin(S) \u2265 \u2212\u03b5H \u2212 \u03b6(cid:107)C(cid:107)op \u00b7 \u03c32\n\nk(Y ),\n\nwhere \u03b6 = K(2 + KR)2 with R, K as in (10) and (11), and \u03c3k(Y ) is the kth singular value of Y (it\nis zero if k > n). This holds deterministically for any cost matrix C.\n\n6\n\n\fCombining the two above lemmas, the key step in the proof of Theorem 3.1 is to deduce a bound on\nthe optimality gap from a bound on the smallest eigenvalue of S: see Appendix E.\nWe have shown in Theorem 3.1 that a perturbed version of (P) can be approximately solved to global\noptimality, with high probability on the perturbation. In the corollary below, we further bound the\noptimality gap at the approximate solution of the perturbed problem with respect to the original,\nunperturbed problem. The proof is in Appendix F.\nCorollary 3.1. Assume C is compact and let R be as de\ufb01ned in (10). Let X \u2208 C be an approximate\nsolution for (SDP) with perturbed cost matrix C + W , so that the optimality gap in the perturbed\nproblem is bounded by \u03b5f . Let f (cid:63) denote the optimal value of the unperturbed problem (SDP), with\ncost matrix C. Then, the optimality gap for X with respect to the unperturbed problem is bounded\nas:\n\n0 \u2264 (cid:104)C, X(cid:105) \u2212 f (cid:63) \u2264 \u03b5f + 2(cid:107)W(cid:107)opR.\n\nUnder the conditions of Theorem 3.1, with the prescribed probability, \u03b5f and (cid:107)W(cid:107)op can be bounded\nso that for an (\u03b5g, \u03b5H )-SOSP Y of the perturbed problem (P) we have:\n\n0 \u2264 (cid:104)CY, Y (cid:105) \u2212 f (cid:63) \u2264 (\u03b5H + \u03b52\n\ng\u03b7)R +\n\n\u03b5g\n2\n\nR + 6\u03c3W\n\nnR,\n\n\u221a\n\n\u221a\n\nwhere \u03b7 is as de\ufb01ned in (15) and \u03c32\n\nW is the variance of the Wigner perturbation W .\n\n4 Applications\n\nThe approximate global optimality results established in the previous section can be applied to deduce\nguarantees on the quality of ASOSPs of the low-rank factorization for a number of SDPs that appear\nin machine learning problems. Of particular interest, we focus on the phase retrieval problem. This\nproblem consists in retrieving a signal z \u2208 Cd from n amplitude measurements b = |Az| \u2208 Rn\n+ (the\nabsolute value of vector Az is taken entry-wise). If we can recover the complex phases of Az, then z\ncan be estimated through linear least-squares. Following this approach, Waldspurger et al. [2015]\nargue that this task can be modeled as the following non-convex problem:\n\nu\u2217Cu\n\nmin\nu\u2208Cn\nsubject to |ui| = 1, for i = 1, . . . , n,\n\n(PR)\n\nwhere C = diag(b)(I \u2212 AA\u2020)diag(b) and diag : Rn \u2192 Hn\u00d7n maps a vector to the corresponding\ndiagonal matrix. The classical relaxation is to rewrite the above in terms of X = uu\u2217 (lifting) without\nenforcing rank(X) = 1, leading to a complex SDP which Waldspurger et al. [2015] call PhaseCut:\n\nmin\n\nX\u2208Hn\u00d7n\nsubject to diag(X) = 1,\n\n(cid:104)C, X(cid:105)\n\nX (cid:23) 0.\n\n(PhaseCut)\n\nThe same SDP relaxation also applies to a problem called angular synchronization [Singer, 2011].\nThe Burer\u2013Monteiro factorization of (PhaseCut) is an optimization problem over a matrix Y \u2208 Cn\u00d7k\nas follows:\n\n(cid:104)CY, Y (cid:105)\n\nmin\nY \u2208Cn\u00d7k\nsubject to diag(Y Y \u2217) = 1.\n\n(PhaseCut-BM)\n\nFor a feasible Y , each row has unit norm: the search space is a Cartesian product of spheres in Ck,\nwhich is a smooth manifold. We can check that Assumption 1 holds for all k \u2265 1. Furthermore, the\nfeasible space of the SDP is compact. Therefore, Theorem 3.1 applies.\nIn this setting, Tr(X) = n for all feasible X, and (cid:107)A\u2217 \u25e6 G\u2020 \u25e6 A(cid:107)op = 1 for all feasible Y (because\nG = G(Y ) = Im for all feasible Y \u2014see Lemma 2.2\u2014and A\u2217 \u25e6 A is an orthogonal projector from\nHermitian matrices to diagonal matrices). For this reason, the constants de\ufb01ned in (10) and (11) can\nbe set to R = n and K = 1.\nAs a comparison, Mei et al. [2017] also provide an optimality gap for ASOSPs of (PhaseCut) without\nperturbation. Their result is more general in the sense that it holds for all possible values of k.\n\n7\n\n\fFigure 1: Computation time of the dedicated interior-point method (IPM) and of the Burer\u2013Monteiro\napproach (BM) to solve (PhaseCut). For increasing values of n (horizontal axis), we display the\ncomputation time averaged over four independent realizations of the problem (vertical axis). The\nsmallest and largest observed computation times are represented with dashed lines. At n = 3000,\nBM is about 40 times faster than IPM. For the largest value of n, IPM runs out of memory.\n\nHowever, when k is large, it does not accurately capture the fact that SOSPs are optimal, thus\nincurring a larger bound on the optimality gap of ASOSPs. In contrast, our bounds do show that for k\nlarge enough, as \u03b5g, \u03b5H go to zero, the optimality gap goes to zero, with the trade-off that they do so\nfor a perturbed problem (though see Corollary 3.1), with high probability.\n\nNumerical Experiments\n\nWe present the empirical performance of the low-rank approach in the case of (PhaseCut). We\ncompare it with a dedicated interior-point method (IPM) implemented by Helmberg et al. [1996]\nfor real SDPs and adapted to phase retrieval as done by Waldspurger et al. [2015]. This adaptation\ninvolves splitting the real and the imaginary parts of the variables in (PhaseCut) and forming an\nequivalent real SDP with double the dimension. The Burer\u2013Monteiro approach (BM) is implemented\nin complex form directly using Manopt, a toolbox for optimization on manifolds [Boumal et al.,\n2014]. In particular, a Riemannian Trust-Region method (RTR) is used [Absil et al., 2007]. Theory\nsupports that these methods can return an ASOSP in a \ufb01nite number of iterations [Boumal et al.,\n2018a]. We stress that the SDP is not perturbed in these experiments: the role of the perturbation\nin the analysis is to understand why the low-rank approach is so successful in practice despite the\nexistence of pathological cases. In practice, we do not expect to encounter pathological cases.\nOur numerical experiment setup is as follows. We seek to recover a signal of dimension d, z \u2208 Cd,\nfrom n measurements encoded in the vector b \u2208 Rn\n+ such that b = |Az| + \u0001, where A \u2208 Cn\u00d7d is\nthe sensing matrix and \u0001 \u223c N (0, Id) is standard Gaussian noise. For the numerical experiments,\nwe generate the vectors z as complex random vectors with i.i.d. standard Gaussian entries, and we\nrandomly generate the complex sensing matrices A also with i.i.d. standard Gaussian entries. We do\nso for values of d ranging from 10 to 1000, and always for n = 10d (that is, there are 10 magnitude\nmeasurements per unknown complex coef\ufb01cient, which is an oversampling factor of 5.) Lastly, we\ngenerate the measurement vectors b as described above and we cap its values from below at 0.01 in\norder to avoid small (or even negative) magnitude measurements.\nFor n up to 3000, both methods solve the same problem, and indeed produce the same answer up to\nsmall discrepancies. The BM approach is more accurate, at least in satisfying the constraints, and, for\nn = 3000, it is also about 40 times faster than IPM. BM is run with k =\nn (rounded up), which\nis expected to be generically suf\ufb01cient to include the global optimum of the SDP (as con\ufb01rmed in\npractice). For larger values of n, the IPM ran into memory issues and we had to abort the process.\n\n\u221a\n\n8\n\n\f5 Conclusion\n\nWe considered the low-rank (or Burer\u2013Monteiro) approach to solve equality-constrained SDPs. Our\nkey assumptions are that (a) the search space of the SDP is compact, and (b) the search space of its\nlow-rank version is smooth (the actual condition is slightly stronger). Under these assumptions, we\nproved using smoothed analysis that, provided k = \u02dc\u2126(\nm) where m is the number of constraints, if\nthe cost matrix is perturbed randomly, with high probability, approximate second-order stationary\npoints of the perturbed low-rank problem map to approximately optimal solutions of the perturbed\nSDP. We also related optimality gaps in the perturbed SDP to optimality gaps in the original SDP.\nFinally, we applied this result to an SDP relaxation of phase retrieval (also applicable to angular\nsynchronization).\n\n\u221a\n\nAcknowledgments\n\nNB is partially supported by NSF award DMS-1719558.\n\nReferences\nE. Abb\u00e9. Community Detection and Stochastic Block Models, volume 14. Now Publishers inc., 2018.\n\ndoi:10.1561/0100000067.\n\nP.-A. Absil, C. Baker, and K. Gallivan. Trust-region methods on Riemannian manifolds. Foundations\n\nof Computational Mathematics, 7(3):303\u2013330, 2007. doi:10.1007/s10208-005-0179-9.\n\nP.-A. Absil, R. Mahony, and R. Sepulchre. Optimization algorithms on matrix manifolds. Princeton\n\nUniversity Press, 2008.\n\nA. Bandeira, N. Boumal, and A. Singer. Tightness of the maximum likelihood semide\ufb01nite\nrelaxation for angular synchronization. Mathematical Programming, 163(1):145\u2013167, 2017.\ndoi:10.1007/s10107-016-1059-6.\n\nA. Barvinok. Problems of distance geometry and convex properties of quadratic maps. Discrete &\n\nComputational Geometry, 13(1):189\u2013202, 1995. doi:10.1007/BF02574037.\n\nG. Ben Arous and A. Guionnet. Wigner matrices. In Handbook on Random Matrices, pages 433\u2013451.\n\nOxford University Press, 2010.\n\nR. Bhatia. Positive de\ufb01nite matrices. Princeton University Press, 2007.\n\nS. Bhojanapalli, N. Boumal, P. Jain, and P. Netrapalli. Smoothed analysis for low-rank solutions to\nsemide\ufb01nite programs in quadratic penalty form. In S. Bubeck, V. Perchet, and P. Rigollet, editors,\nProceedings of the 31st Conference On Learning Theory, volume 75 of Proceedings of Machine\nLearning Research, pages 3243\u20133270. PMLR, 06\u201309 Jul 2018. URL http://proceedings.mlr.\npress/v75/bhojanapalli18a.html.\n\nN. Boumal, B. Mishra, P.-A. Absil, and R. Sepulchre. Manopt, a Matlab toolbox for optimization\non manifolds. Journal of Machine Learning Research, 15:1455\u20131459, 2014. URL https:\n//www.manopt.org.\n\nN. Boumal, V. Voroninski, and A. Bandeira. The non-convex Burer-Monteiro approach works on\nsmooth semide\ufb01nite programs. In Advances in Neural Information Processing Systems, pages\n2757\u20132765, 2016.\n\nN. Boumal, P.-A. Absil, and C. Cartis. Global rates of convergence for nonconvex optimization on\n\nmanifolds. IMA Journal of Numerical Analysis, 2018a. doi:10.1093/imanum/drx080.\n\nN. Boumal, V. Voroninski, and A. Bandeira. Deterministic guarantees for Burer\u2013Monteiro factor-\nizations of smooth semide\ufb01nite programs. To appear in Communications on Pure and Applied\nMathematics, 2018b.\n\nS. Burer and R. D. Monteiro. A nonlinear programming algorithm for solving semide\ufb01nite programs\n\nvia low-rank factorization. Mathematical Programming, 95(2):329\u2013357, 2003.\n\n9\n\n\fS. Burer and R. D. Monteiro. Local minima and convergence in low-rank semide\ufb01nite programming.\n\nMathematical Programming, 103(3):427\u2013444, 2005.\n\nE. J. Cand\u00e8s and B. Recht. Exact matrix completion via convex optimization. Foundations of\n\nComputational mathematics, 9(6):717, 2009.\n\nT. Carson, D. G. Mixon, and S. Villar. Manifold optimization for k-means clustering. In 2017\nInternational Conference on Sampling Theory and Applications (SampTA), pages 73\u201377, July\n2017. doi:10.1109/SAMPTA.2017.8024388.\n\nD. Garber. Faster projection-free convex optimization over the spectrahedron. Advances in Neural\n\nInformation Processing Systems, pages 874\u2013882, 2016.\n\nD. Garber and E. Hazan. Sublinear time algorithms for approximate semide\ufb01nite programming.\n\nMathematical Programming: Series A and B archive, 158:329\u2013361, 2016.\n\nG. H. Golub and V. Pereyra. The differentiation of pseudo-inverses and nonlinear least squares\nproblems whose variables separate. SIAM Journal on Numerical Analysis, 10(2):413\u2013432, 1973.\ndoi:10.1137/0710036.\n\nE. Hazan. Sparse approximate solutions to semide\ufb01nite programs. LATIN 2008: Theoretical\n\nInformatics, pages 306\u2013316, 2008.\n\nC. Helmberg, F. Rendl, R. J. Vanderbei, and H. Wolkowicz. An interior-point method for semide\ufb01nite\n\nprogramming. SIAM Journal on Optimization, 6(2):342\u2013361, 1996.\n\nM. Journee, F. Bach, P.-A. Absil, and R. Sepulchre. Low-rank optimization on the cone of positive\n\nsemide\ufb01nite matrices. SIAM Journal on Optimization, 20(5):2327\u20132351, 2010.\n\nB. Kulis, A. C. Surendran, and J. C. Platt. Fast low-rank semide\ufb01nite programming for embedding\n\nand clustering. In Arti\ufb01cial Intelligence and Statistics, pages 235\u2013242, 2007.\n\nG. R. Lanckriet, N. Cristianini, P. Bartlett, L. E. Ghaoui, and M. I. Jordan. Learning the kernel matrix\n\nwith semide\ufb01nite programming. Journal of Machine learning research, 5(Jan):27\u201372, 2004.\n\nS. Laue. A hybrid algorithm for convex semide\ufb01nite optimization. ICML, pages 177\u2013184, 2012.\nM. Laurent and S. Poljak. On the facial structure of the set of correlation matrices. SIAM Journal on\n\nMatrix Analysis and Applications, 17(3):530\u2013547, 1996. doi:10.1137/0617031.\n\nS. Mei, T. Misiakiewicz, A. Montanari, and R. I. Oliveira. Solving sdps for synchronization and\nmaxcut problems via the grothendieck inequality. In Proceedings of the 30th Conference on\nLearning Theory, COLT 2017, Amsterdam, The Netherlands, 7-10 July 2017, pages 1476\u20131515,\n2017. URL http://proceedings.mlr.press/v65/mei17a.html.\n\nH. H. Nguyen. Random matrices: Overcrowding estimates for the spectrum. Journal of Functional\n\nAnalysis, 275(8):2197\u20132224, 2018. doi:10.1016/j.jfa.2018.06.010.\n\nG. Pataki. On the rank of extreme matrices in semide\ufb01nite programs and the multiplic-\nity of optimal eigenvalues. Mathematics of operations research, 23(2):339\u2013358, 1998.\ndoi:10.1287/moor.23.2.339.\n\nP. Rigollet and J.-C. H\u00fctter. High dimensional statistics, 2017. URL http://www-math.mit.edu/\n\n~rigollet/PDFs/RigNotes17.pdf.\n\nA. Singer. Angular synchronization by eigenvectors and semide\ufb01nite programming. Applied and\n\ncomputational harmonic analysis, 30(1):20\u201336, 2011.\n\nD. Spielman and S.-H. Teng.\n\nrithm usually takes polynomial time.\ndoi:10.1145/990308.990310.\n\nSmoothed analysis of algorithms: Why the simplex algo-\nJournal of the ACM, 51(3):385\u2013463, May 2004.\n\nI. Waldspurger, A. d\u2019Aspremont, and S. Mallat. Phase recovery, maxcut and complex semide\ufb01nite\n\nprogramming. Mathematical Programming, 149(1-2):47\u201381, 2015.\n\nA. Yurtsever, M. Udell, J. A. Tropp, and V. Cevher. Sketchy decisions: Convex low-rank matrix\noptimization with optimal storage. Proceedings of the 20th International Conference on Arti\ufb01cial\nIntelligence and Statistics, 54:1188\u20131196, 2017.\n\n10\n\n\f", "award": [], "sourceid": 1147, "authors": [{"given_name": "Thomas", "family_name": "Pumir", "institution": "Princeton University"}, {"given_name": "Samy", "family_name": "Jelassi", "institution": "Princeton University"}, {"given_name": "Nicolas", "family_name": "Boumal", "institution": "Princeton"}]}