{"title": "A Nonconvex Optimization Framework for Low Rank Matrix Estimation", "book": "Advances in Neural Information Processing Systems", "page_first": 559, "page_last": 567, "abstract": "We study the estimation of low rank matrices via nonconvex optimization. Compared with convex relaxation, nonconvex optimization exhibits superior empirical performance for large scale instances of low rank matrix estimation. However, the understanding of its theoretical guarantees are limited. In this paper, we define the notion of projected oracle divergence based on which we establish sufficient conditions for the success of nonconvex optimization. We illustrate the consequences of this general framework for matrix sensing and completion. In particular, we prove that a broad class of nonconvex optimization algorithms, including alternating minimization and gradient-type methods, geometrically converge to the global optimum and exactly recover the true low rank matrices under standard conditions.", "full_text": "A Nonconvex Optimization Framework for Low Rank\n\nMatrix Estimation\u21e4\n\nTuo Zhao\n\nJohns Hopkins University\n\nZhaoran Wang\n\nHan Liu\n\nPrinceton University\n\nAbstract\n\nWe study the estimation of low rank matrices via nonconvex optimization. Com-\npared with convex relaxation, nonconvex optimization exhibits superior empirical\nperformance for large scale instances of low rank matrix estimation. However, the\nunderstanding of its theoretical guarantees are limited. In this paper, we de\ufb01ne the\nnotion of projected oracle divergence based on which we establish suf\ufb01cient condi-\ntions for the success of nonconvex optimization. We illustrate the consequences\nof this general framework for matrix sensing. In particular, we prove that a broad\nclass of nonconvex optimization algorithms, including alternating minimization\nand gradient-type methods, geometrically converge to the global optimum and\nexactly recover the true low rank matrices under standard conditions.\n\nIntroduction\n\n1\nLet M\u21e4 2 Rm\u21e5n be a rank k matrix with k much smaller than m and n. Our goal is to estimate\nM\u21e4 based on partial observations of its entires. For example, matrix sensing is based on linear\nmeasurements hAi, M\u21e4i, where i 2{ 1, . . . , d} with d much smaller than mn and Ai is the sensing\nmatrix. In the past decade, signi\ufb01cant progress has been established on the recovery of low rank matrix\n[4, 5, 23, 18, 15, 16, 12, 22, 7, 25, 19, 6, 14, 11, 13, 8, 9, 10, 27]. Among all these existing works, most\nare based upon convex relaxation with nuclear norm constraint or regularization. Nevertheless, solving\nthese convex optimization problems can be computationally prohibitive in high dimensional regimes\nwith large m and n [27]. A computationally more ef\ufb01cient alternative is nonconvex optimization. In\nparticular, we reparameterize the m \u21e5 n matrix variable M in the optimization problem as U V >\nwith U 2 Rm\u21e5k and V 2 Rn\u21e5k, and optimize over U and V . Such a reparametrization automatically\nenforces the low rank structure and leads to low computational cost per iteration. Due to this reason,\nthe nonconvex approach is widely used in large scale applications such as recommendation systems\n[17].\nDespite the superior empirical performance of the nonconvex approach, the understanding of its\ntheoretical guarantees is relatively limited in comparison with the convex relaxation approach. Only\nuntil recently has there been progress on coordinate descent-type nonconvex optimization methods,\nwhich is known as alternating minimization [14, 8, 9, 10]. They show that, provided a desired\ninitialization, the alternating minimization algorithm converges at a geometric rate to U\u21e4 2 Rm\u21e5k\nand V \u21e4 2 Rn\u21e5k, which satisfy M = U\u21e4V \u21e4>. Meanwhile, [15, 16] establish the convergence of\ngradient-type methods, and [27] further establish the convergence of a broad class of nonconvex\nalgorithms including both gradient-type and coordinate descent-type methods. However, [15, 16, 27]\nonly establish the asymptotic convergence for an in\ufb01nite number of iterations, rather than the explicit\nrate of convergence. Besides these works, [18, 12, 13] consider projected gradient-type methods,\nwhich optimize over the matrix variable M 2 Rm\u21e5n rather than U 2 Rm\u21e5k and V 2 Rn\u21e5k. These\nmethods involve calculating the top k singular vectors of an m \u21e5 n matrix at each iteration. For\n\u21e4Research supported by NSF IIS1116730, NSF IIS1332109, NSF IIS1408910, NSF IIS1546482-BIGDATA,\nNSF DMS1454377-CAREER, NIH R01GM083084, NIH R01HG06841, NIH R01MH102339, and FDA\nHHSF223201000072C.\n\n1\n\n\fk much smaller than m and n, they incur much higher computational cost per iteration than the\naforementioned methods that optimize over U and V . All these works, except [27], focus on speci\ufb01c\nalgorithms, while [27] do not establish the explicit optimization rate of convergence.\nIn this paper, we propose a general framework that uni\ufb01es a broad class of nonconvex algorithms\nfor low rank matrix estimation. At the core of this framework is a quantity named projected oracle\ndivergence, which sharply captures the evolution of generic optimization algorithms in the presence\nof nonconvexity. Based on the projected oracle divergence, we establish suf\ufb01ciently conditions under\nwhich the iteration sequences geometrically converge to the global optima. For matrix sensing, a direct\nconsequence of this general framework is that, a broad family of nonconvex algorithms, including\ngradient descent, coordinate gradient descent and coordinate descent, converge at a geometric rate\nto the true low rank matrices U\u21e4 and V \u21e4. In particular, our general framework covers alternating\nminimization as a special case and recovers the results of [14, 8, 9, 10] under standard conditions.\nMeanwhile, our framework covers gradient-type methods, which are also widely used in practice\n[28, 24]. To the best of our knowledge, our framework is the \ufb01rst one that establishes exact recovery\nguarantees and geometric rates of convergence for a broad family of nonconvex matrix sensing\nalgorithms.\nTo achieve maximum generality, our uni\ufb01ed analytic framework signi\ufb01cantly differs from previous\nworks. In detail, [14, 8, 9, 10] view alternating minimization as a perturbed version of the power\nmethod. However, their point of view relies on the closed form solution of each iteration of alternating\nminimization, which makes it hard to generalize to other algorithms, e.g., gradient-type methods.\nMeanwhile, [27] take a geometric point of view. In detail, they show that the global optimum of the\noptimization problem is the unique stationary point within its neighborhood and thus a broad class of\nalgorithms succeed. However, such geometric analysis of the objective function does not characterize\nthe convergence rate of speci\ufb01c algorithms towards the stationary point. Unlike existing analytic\nframeworks, we analyze nonconvex optimization algorithms as perturbed versions of their convex\ncounterparts. For example, under our framework we view alternating minimization as a perturbed\nversion of coordinate descent on convex objective functions. We use the key quantity, projected oracle\ndivergence, to characterize such a perturbation effect, which results from the local nonconvexity\nat intermediate solutions. This framework allows us to establish explicit rate of convergence in an\nanalogous way as existing convex optimization analysis.\nNotation: For a vector v = (v1, . . . , vd)T 2 Rd, let the vector `q norm be kvkq\nj . For a\nmatrix A 2 Rm\u21e5n, we use A\u21e4j = (A1j, ..., Amj)> to denote the j-th column of A, and Ai\u21e4 =\n(Ai1, ..., Ain)> to denote the i-th row of A. Let max(A) and min(A) be the largest and smallest\nnonzero singular values of A. We de\ufb01ne the following matrix norms: kAk2\n2, kAk2 =\nmax(A). Moreover, we de\ufb01ne kAk\u21e4 to be the sum of all singular values of A. Given another matrix\nB 2 Rm\u21e5n, we de\ufb01ne the inner product as hA, Bi = Pi,j AijBij. We de\ufb01ne ei as an indicator\nvector, where the i-th entry is one, and all other entries are zero. For a bivariate function f (u, v), we\nde\ufb01ne ruf (u, v) to be the gradient with respect to u. Moreover, we use the common notations of\n\u2326(\u00b7), O(\u00b7), and o(\u00b7) to characterize the asymptotics of two real sequences.\n\nq = Pj vq\nF =Pj kA\u21e4jk2\n\n2 Problem Formulation and Algorithms\n\nLet M\u21e4 2 Rm\u21e5n be the unknown low rank matrix of interest. We have d sensing matrices Ai 2\nRm\u21e5n with i 2{ 1, . . . , d}. Our goal is to estimate M\u21e4 based on bi = hAi, M\u21e4i in the high\ndimensional regime with d much smaller than mn. Under such a regime, a common assumption\nis rank(M\u21e4) = k \u2327 min{d, m, n}. Existing approaches generally recover M\u21e4 by solving the\nfollowing convex optimization problem\n(2.1)\n\nmin\n\nsubject to b = A(M ),\n\nM2Rm\u21e5n kMk\u21e4\nA(M ) = [hA1, Mi, ...,hAi, Mi]> 2 Rd.\n\nwhere b = [b1, ..., bd]> 2 Rd, and A(M ) : Rm\u21e5n ! Rd is an operator de\ufb01ned as\n(2.2)\nExisting convex optimization algorithms for solving (2.1) are computationally inef\ufb01cient, in the sense\nthat they incur high per-iteration computational cost, and only attain sublinear rates of convergence to\nthe global optimum [14]. Instead, in large scale settings we usually consider the following nonconvex\n\n2\n\n\foptimization problem\n\nmin\n\nU2Rm\u21e5k,V 2Rn\u21e5k F(U, V ). where F(U, V ) =\n\n1\n2kb A (U V >)k2\n2.\n\n(2.3)\n\ni=1, {Ai}d\n\ni=1\n\nThe reparametrization of M = U V >, though making the optimization problem in (2.3) nonconvex,\nsigni\ufb01cantly improves the computational ef\ufb01ciency. Existing literature [17, 28, 21, 24] has established\nconvincing empirical evidence that (2.3) can be effectively solved by a board variety of gradient-based\nnonconvex optimization algorithms, including gradient descent, alternating exact minimization (i.e.,\nalternating least squares or coordinate descent), as well as alternating gradient descent (i.e., coordinate\ngradient descent), which are shown in Algorithm 1.\nIt is worth noting the QR decomposition and rank k singular value decomposition in Algorithm\n1 can be accomplished ef\ufb01ciently. In particular, the QR decomposition can be accomplished in\nO(k2 max{m, n}) operations, while the rank k singular value decomposition can be accomplished\nin O(kmn) operations. In fact, the QR decomposition is not necessary for particular update schemes,\ne.g., [14] prove that the alternating exact minimization update schemes with or without the QR\ndecomposition are equivalent.\nAlgorithm 1 A family of nonconvex optimization algorithms for matrix sensing. Here (U , D, V ) \nKSVD(M ) is the rank k singular value decomposition of M. Here D is a diagonal matrix containing\nthe top k singular values of M in decreasing order, and U and V contain the corresponding top k left\nand right singular vectors of M. Here (V , RV ) QR(V ) is the QR decomposition, where V is the\ncorresponding orthonormal matrix and RV is the corresponding upper triangular matrix.\nInput: {bi}d\nParameter: Step size \u2318, Total number of iterations T\n(U (0), D(0), V (0)) KSVD(Pd\nFor: t = 0, ...., T  1\nAlternating Exact Minimization : V (t+0.5) argminV F(U (t), V )\n(V (t+1), R(t+0.5)\n) QR(V (t+0.5))\nAlternating Gradient Descent : V (t+0.5) V (t)  \u2318rV F(U (t), V (t))\n) QR(V (t+0.5)), U (t) U (t)R(t+0.5)>\n(V (t+1), R(t+0.5)\nGradient Descent : V (t+0.5) V (t)  \u2318rV F(U (t), V (t))\n) QR(V (t+0.5)), U (t+1) U (t)R(t+0.5)>\n(V (t+1), R(t+0.5)\nAlternating Exact Minimization : U (t+0.5) argminU F(U, V (t+1))\n(U (t+1), R(t+0.5)\nAlternating Gradient Descent : U (t+0.5) U (t)  \u2318rUF(U (t), V (t+1))\n(U (t+1), R(t+0.5)\nGradient Descent : U (t+0.5) U (t)  \u2318rUF(U (t), V (t))\n(U (t+1), R(t+0.5)\n\n) QR(U (t+0.5))\n) QR(U (t+0.5)), V (t+1) V t+1R(t+0.5)>\n) QR(U (t+0.5)), V (t+1) V tR(t+0.5)>\n\ni=1 biAi), V (0) V (0)D(0), U (0) U (0)D(0)\n\nEnd for\nOutput: M (T ) U (T0.5)V (T )> (for gradient descent we use U (T )V (T )>)\n3 Theoretical Analysis\nWe analyze the convergence properties of the general family of nonconvex optimization algorithms\nillustrated in \u00a72. Before we present the main results, we \ufb01rst introduce a uni\ufb01ed analytic framework\nbased on a key quantity named projected oracle divergence. Such a uni\ufb01ed framework equips our\ntheory with the maximum generality. Without loss of generality, we assume m \uf8ff n throughout the\nrest of this paper.\n\n9>>>>>>>>>=>>>>>>>>>;\n9>>>>>>>>>=>>>>>>>>>;\n\nUpdating U\n\nUpdating V\n\nV\n\nV\n\nV\n\nU\n\nU\n\nU\n\nU\n\nV\n\nV\n\nU\n\n3.1 Projected Oracle Divergence\nWe \ufb01rst provide an intuitive explanation for the success of nonconvex optimization algorithms, which\nforms the basis of our later proof for the main results. Recall that (2.3) is a special instance of the\nfollowing optimization problem,\n\n(3.1)\nA key observation is that, given \ufb01xed U, f (U,\u00b7) is strongly convex and smooth in V under suitable\nconditions, and the same also holds for U given \ufb01xed V correspondingly. For the convenience of\n\nU2Rm\u21e5k,V 2Rn\u21e5k\n\nf (U, V ).\n\nmin\n\n3\n\n\fdiscussion, we summarize this observation in the following technical condition, which will be later\nveri\ufb01ed for matrix sensing under suitable conditions.\n\nCondition 3.1 (Strong Biconvexity and Bismoothness). There exist universal constants \u00b5+ > 0 and\n\u00b5 > 0 such that\n\u00b5\n2 kU0  Uk2\n\u00b5\n2 kV 0  V k2\n\nF \uf8ff f (U0, V )  f (U, V )  hU0  U,rU f (U, V )i \uf8ff\nF \uf8ff f (U, V 0)  f (U, V )  hV 0  V,rV f (U, V )i \uf8ff\n\n\u00b5+\n2 kU0  Uk2\n\u00b5+\n2 kV 0  V k2\n\nF for all U, U0,\n\nF for all V, V 0.\n\nFor the simplicity of discussion, for now we assume U\u21e4 and V \u21e4 are the unique global minimizers to\nthe generic optimization problem in (3.1). Assuming U\u21e4 is given, we can obtain V \u21e4 by\n\nCondition 3.1 implies the objective function in (3.2) is strongly convex and smooth. Hence, we can\nchoose any gradient-based algorithm to obtain V \u21e4. For example, we can directly solve for V \u21e4 in\n\nV \u21e4 = argmin\nV 2Rn\u21e5k\n\nf (U\u21e4, V ).\n\n(3.2)\n\n(3.3)\n\nor iteratively solve for V \u21e4 using gradient descent, i.e.,\n\nrV f (U\u21e4, V ) = 0,\n\n(3.4)\nwhere \u2318 is the step size. For the simplicity of discussion, we put aside the renormalization issue for\nnow. In the example of gradient descent, by invoking classical convex optimization results [20], it is\neasy to prove that\n\nV (t) = V (t1)  \u2318rV f (U\u21e4, V (t1)),\n\nkV (t)  V \u21e4kF \uf8ff \uf8ffkV (t1)  V \u21e4kF for all t = 0, 1, 2, . . . ,\n\nwhere \uf8ff 2 (0, 1) is a contraction coef\ufb01cient, which depends on \u00b5+ and \u00b5 in Condition 3.1.\nHowever, the \ufb01rst-order oracle rV f (U\u21e4, V (t1)) is not accessible in practice, since we do not know\nU\u21e4. Instead, we only have access to rV f (U, V (t1)), where U is arbitrary. To characterize the\ndivergence between the ideal \ufb01rst-order oracle rV f (U\u21e4, V (t1)) and the accessible \ufb01rst-order oracle\nrV f (U, V (t1)), we de\ufb01ne a key quantity named projected oracle divergence, which takes the form\nD(V, V 0, U ) =\u2326rV f (U\u21e4, V 0)  rV f (U, V 0), V  V \u21e4/(kV  V \u21e4kF)\u21b5,\n(3.5)\nwhere V 0 is the point for evaluating the gradient. In the above example, it holds for V 0 = V (t1).\nLater we will illustrate that, the projection of the difference of \ufb01rst-order oracles onto a speci\ufb01c one\ndimensional space, i.e., the direction of V  V \u21e4, is critical to our analysis. In the above example of\ngradient descent, we will prove later that for V (t) = V (t1)  \u2318rV f (U, V (t1)), we have\n(3.6)\nIn other words, the projection of the divergence of \ufb01rst-order oracles onto the direction of V (t)  V \u21e4\ncaptures the perturbation effect of employing the accessible \ufb01rst-order oracle rV f (U, V (t1)) instead\nof the ideal rV f (U\u21e4, V (t1)). For V (t+1) = argminV f (U, V ), we will prove that\n(3.7)\nAccording to the update schemes shown in Algorithm 1, for alternating exact minimization, we set\nU = U (t) in (3.7), while for gradient descent or alternating gradient descent, we set U = U (t1) or\nU = U (t) in (3.6) respectively. Correspondingly, similar results hold for kU (t)  U\u21e4kF.\nTo establish the geometric rate of convergence towards the global minima U\u21e4 and V \u21e4, it remains to\nestablish upper bounds for the projected oracle divergence. In the example of gradient decent we will\nprove that for some \u21b5 2 (0, 1  \uf8ff),\n\nkV (t)  V \u21e4kF \uf8ff \uf8ffkV (t1)  V \u21e4kF + 2/\u00b5+ \u00b7 D(V (t), V (t1), U ).\n\nkV (t)  V \u21e4kF \uf8ff 1/\u00b5 \u00b7 D(V (t), V (t), U ).\n\nwhich together with (3.6) (where we take U = U (t1)) implies\n\n2/\u00b5+ \u00b7 D(V (t), V (t1), U (t1)) \uf8ff \u21b5kU (t1)  U\u21e4kF,\n\nkV (t)  V \u21e4kF \uf8ff \uf8ffkV (t1)  V \u21e4kF + \u21b5kU (t1)  U\u21e4kF.\n\nCorrespondingly, similar results hold for kU (t)  U\u21e4kF, i.e.,\nCombining (3.8) and (3.9) we then establish the contraction\n\nkU (t)  U\u21e4kF \uf8ff \uf8ffkU (t1)  U\u21e4kF + \u21b5kV (t1)  V \u21e4kF.\n\n(3.8)\n\n(3.9)\n\nmax{kV (t)  V \u21e4kF,kU (t)  U\u21e4kF}\uf8ff (\u21b5 + \uf8ff) \u00b7 max{kV (t1)  V \u21e4kF,kU (t1)  U\u21e4kF},\n\n4\n\n\fwhich further implies the geometric convergence, since \u21b5 2 (0, 1 \uf8ff). Respectively, we can establish\nsimilar results for alternating exact minimization and alternating gradient descent. Based upon such a\nuni\ufb01ed analytic framework, we now simultaneously establish the main results.\nRemark 3.2. Our proposed projected oracle divergence is inspired by previous work [3, 2, 1],\nwhich analyzes the Wirtinger Flow algorithm for phase retrieval, the expectation maximization (EM)\nAlgorithm for latent variable models, and the gradient descent algorithm for sparse coding. Though\ntheir analysis exploits similar nonconvex structures, they work on completely different problems, and\nthe delivered technical results are also fundamentally different.\n\n3.2 Matrix Sensing\nBefore we present our main results, we \ufb01rst introduce an assumption known as the restricted isometry\nproperty (RIP). Recall that k is the rank of the target low rank matrix M\u21e4.\nAssumption 3.3. The linear operator A(\u00b7) : Rm\u21e5n ! Rd de\ufb01ned in (2.2) satis\ufb01es 2k-RIP with\nparameter 2k 2 (0, 1), i.e., for all  2 Rm\u21e5n such that rank() \uf8ff 2k, it holds that\n\n(1  2k)kk2\n\nF \uf8ff kA()k2\n\n2 \uf8ff (1 + 2k)kk2\nF.\n\n2k kn log n).\n\nSeveral random matrix ensembles satisfy k-RIP for a suf\ufb01ciently large d with high probability. For\nexample, suppose that each entry of Ai is independently drawn from a sub-Gaussian distribution,\nA(\u00b7) satis\ufb01es 2k-RIP with parameter 2k with high probability for d =\u2326( 2\nThe following theorem establishes the geometric rate of convergence of the nonconvex optimization\nalgorithms summarized in Algorithm 1.\nTheorem 3.4. Assume there exists a suf\ufb01ciently small constant C1 such that A(\u00b7) satis\ufb01es 2k-RIP\nwith 2k \uf8ff C1/k, and the largest and smallest nonzero singular values of M\u21e4 are constants, which do\nnot scale with (d, m, n, k). For any pre-speci\ufb01ed precision \u270f, there exist an \u2318 and universal constants\nC2 and C3 such that for all T  C2 log(C3/\u270f), we have kM (T )  M\u21e4kF \uf8ff \u270f.\nThe proof of Theorems 3.4 is provided in Appendices 4.1, A.1, and A.2. Theorem 3.4 implies that all\nthree nonconvex optimization algorithms geometrically converge to the global optimum. Moreover,\nassuming that each entry of Ai is independently drawn from a sub-Gaussian distribution with mean\nzero and variance proxy one, our result further suggests, to achieve exact low rank matrix recovery,\nour algorithm requires the number of measurements d to satisfy\n\nd =\u2326( k3n log n),\n\n(3.10)\nsince we assume that 2k \uf8ff C1/k. This sample complexity result matches the state-of-the-art result\nfor nonconvex optimization methods, which is established by [14]. In comparison with their result,\nwhich only covers the alternating exact minimization algorithm, our results holds for a broader variety\nof nonconvex optimization algorithms.\nNote that the sample complexity in (3.10) depends on a polynomial of max(M\u21e4)/min(M\u21e4), which\nis treated as a constant in our paper. If we allow max(M\u21e4)/min(M\u21e4) to increase with the dimension,\nwe can plug the nonconvex optimization algorithms into the multi-stage framework proposed by [14].\nFollowing similar lines to the proof of Theorem 3.4, we can derive a new sample complexity, which\nis independent of max(M\u21e4)/min(M\u21e4). See more details in [14].\n\n4 Proof of Main Results\nDue to space limitation, we only sketch the proof of Theorem 3.4 for alternating exact minimization.\nThe proof of Theorem 3.4 for alternating gradient descent and gradient descent, and related lemmas\nare provided in the appendix. For notational simplicity, let 1 = max(M\u21e4) and k = min(M\u21e4).\nBefore we proceed with the main proof, we \ufb01rst introduce the following lemma, which veri\ufb01es\nCondition 3.1.\nLemma 4.1. Suppose that A(\u00b7) satis\ufb01es 2k-RIP with parameter 2k. Given an arbitrary orthonormal\nmatrix U 2 Rm\u21e5k, for any V, V 0 2 Rn\u21e5k, we have\n1 + 2k\n\nkV 0  V k2\n\nF F (U , V 0) F (U , V )  hrV F(U , V ), V 0  V i \n\nkV 0  V k2\nF.\n\n2\n\n1  2k\n\n2\n\nThe proof of Lemma 4.1 is provided in Appendix B.1. Lemma 4.1 implies that F(U ,\u00b7) is strongly\nconvex and smooth in V given a \ufb01xed orthonormal matrix U, as speci\ufb01ed in Condition 3.1. Equipped\nwith Lemma 4.1, we now lay out the proof for each update scheme in Algorithm 1.\n\n5\n\n\fV \u21e4(t)>, where U\u21e4(t) is orthonormal. We choose the projected oracle divergence as\nV (t+0.5)V \u21e4(t)\n\n4.1 Proof of Theorem 3.4 (Alternating Exact Minimization)\nProof. Throughout the proof of alternating exact minimization, we de\ufb01ne a constant \u21e0 2 (1,1)\nfor notational simplicity. We assume that at the t-th iteration, there exists a matrix factorization of\nM\u21e4 = U\u21e4(t)\nkV (t+0.5)V \u21e4(t)kF.\nD(V (t+0.5), V (t+0.5), U\nRemark 4.2. Note that the matrix factorization is not necessarily unique. Because given a factoriza-\ntion of M\u21e4 = U V >, we can always obtain a new factorization of M\u21e4 = eUeV >, where eU = U O and\neV = V O for an arbitrary unitary matrix O 2 Rk\u21e5k. However, this is not a issue to our convergence\n\nanalysis. As will be shown later, we can prove that there always exists a factorization of M\u21e4 satisfying\nthe desired computational properties for each iteration (See Lemma 4.5, Corollaries 4.7 and 4.8).\n\n)=\u2327rV F(U\u21e4(t)\n\n, V (t+0.5))rV F(U\n\n, V (t+0.5)),\n\n(t)\n\n(t)\n\n2k \uf8ff\n\nand kU\n) \uf8ff\n\nThe following lemma establishes an upper bound for the projected oracle divergence.\nLemma 4.3. Suppose that 2k and U\np2(1  2k)2k\n\n(t) satisfy\n\n.\n\n(t)  U\u21e4(t)kF \uf8ff\n\n(1  2k)k\n4\u21e0(1 + 2k)1\n\n(t)\n\n2\u21e0\n\nkU\n\n(1  2k)k\n\n(t)  U\u21e4(t)kF.\n\n4\u21e0k(1 + 2k)1\nThen we have D(V (t+0.5), V (t+0.5), U\nThe proof of Lemma 4.3 is provided in Appendix B.2. Lemma 4.3 shows that the projected oracle di-\n(t).The following lemma quanti\ufb01es\nvergence for updating V diminishes with the estimation error of U\nthe progress of an exact minimization step using the projected oracle divergence.\nLemma 4.4. We have kV (t+0.5)  V \u21e4(t)kF \uf8ff 1/(1  2k) \u00b7 D(V (t+0.5), V (t+0.5), U\nThe proof of Lemma 4.4 is provided in Appendix B.3. Lemma 4.4 illustrates that the estimation error\nof V (t+0.5) diminishes with the projected oracle divergence. The following lemma characterizes the\neffect of the renormalization step using QR decomposition.\nLemma 4.5. Suppose that V (t+0.5) satis\ufb01es\n\n).\n\n(t)\n\n(4.1)\n\n(0).\n\nkV (t+0.5)  V \u21e4(t)kF \uf8ff k/4.\n(t+1)  V \u21e4(t+1)kF \uf8ff 2/k \u00b7k V (t+0.5)  V \u21e4(t)kF.\n\n(4.2)\nThen there exists a factorization of M\u21e4 = U\u21e4(t+1)V \u21e4(t+1) such that V \u21e4(t+0.5) 2 Rn\u21e5k is an\northonormal matrix, and satis\ufb01es kV\nThe proof of Lemma 4.5 is provided in Appendix B.4. The next lemma quanti\ufb01es the accuracy of the\ninitialization U\nLemma 4.6. Suppose that 2k satis\ufb01es\n2k \uf8ff\n\n(1  2k)24\n(4.3)\nV \u21e4(0)> such that U\u21e4(0) 2 Rm\u21e5k is an orthonormal\n.\n\nThen there exists a factorization of M\u21e4 = U\u21e4(0)\n(0)  U\u21e4kF \uf8ff (12k)k\nmatrix, and satis\ufb01es kU\nThe proof of Lemma 4.6 is provided in Appendix B.5. Lemma 4.6 implies that the initial solution\nU\n\n(0) attains a suf\ufb01ciently small estimation error.\n\n192\u21e02k(1 + 2k)24\n1\n\n4\u21e0(1+2k)1\n\nk\n\n.\n\nCombining the above Lemmas, we obtain the next corollary for a complete iteration of updating V .\nCorollary 4.7. Suppose that 2k and U\n(1  2k)24\n\n(t) satisfy\n\nk\n\nand kU\n\n(t)  U\u21e4(t)kF \uf8ff\n\n(1  2k)k\n4\u21e0(1 + 2k)1\n\n192\u21e02k(1 + 2k)24\n1\n\n2k \uf8ff\n\n.\n\n(4.4)\n(t+1)  V \u21e4(t+1)kF \uf8ff\n\n1\n\nWe then have kV\n\u21e0kU\n\n(t+1)  V \u21e4(t+1)kF \uf8ff (12k)k\n(t)  U\u21e4(t)kF and kV (t+0.5)  V \u21e4(t)kF \uf8ff k\n6\n\n4\u21e0(1+2k)1\n\n. Moreover, we also have kV\n2\u21e0 kU\n\n(t)  U\u21e4(t)kF.\n\n\fk\n\n2k \uf8ff\n\n(1  2k)24\n\n192\u21e02k(1 + 2k)24\n1\n\nThe proof of Corollary 4.7 is provided in Appendix B.6. Since the alternating exact minimization\nalgorithm updates U and V in a symmetric manner, we can establish similar results for a complete\niteration of updating U in the next corollary.\nCorollary 4.8. Suppose that 2k and V\n\n(t+1)  V \u21e4(t+1)kF \uf8ff\nV \u21e4(t+1)> such U\u21e4(t+1) is an orthonormal matrix,\n(t+1)  U\u21e4(t+1)kF \uf8ff\n. Moreover, we also have kU\n\n(t+1) satisfy\nand\nkV\nThen there exists a factorization of M\u21e4 = U\u21e4(t+1)\n(t+1)  U\u21e4(t+1)kF \uf8ff (12k)k\nand satis\ufb01es kU\n(t+1)  V \u21e4(t+1)kF.\n\u21e0kV\nThe proof of Corollary 4.8 directly follows Appendix B.6, and is therefore omitted.\nWe then proceed with the proof of Theorem 3.4 for alternating exact minimization. Lemma 4.6\n(0). Then Corollary 4.7 ensures that (4.5) of Corollary\nensures that (4.4) of Corollary 4.7 holds for U\n(1). By induction, Corollaries 4.7 and 4.8 can be applied recursively for all T iterations.\n4.8 holds for V\nThus we obtain\nkV\n\n(t+1)  V \u21e4(t+1)kF and kU (t+0.5)  U\u21e4(t+1)kF \uf8ff k\n\n(T )  V \u21e4(T )kF \uf8ff\n\n(1  2k)k\n4\u21e0(1 + 2k)1\n\n2\u21e0 kV\n\n4\u21e0(1+2k)1\n\n(4.5)\n\n1\n\n.\n\n(T1)  V \u21e4(T1)kF\n(1  2k)k\n\n4\u21e02T (1 + 2k)1\n\n,\n\n(4.6)\n\nwhere the last inequality comes from Lemma 4.6. Therefore, for a pre-speci\ufb01ed accuracy \u270f, we need\n\n1\n\n1\n\u21e0kU\n\n\uf8ff\u00b7\u00b7\u00b7\uf8ff\n\n(T1)  U\u21e4(T1)kF \uf8ff\n\u21e02T1kU\n\n1\n\u21e02kV\n(0)  U\u21e4(0)kF \uf8ff\n2\u270f(1+2k)1\u2318 log1 \u21e0m iterations such that\n\nkV\n\n(T )  V \u21e4(T )kF \uf8ff\n\n(1  2k)k\n\n4\u21e02T (1 + 2k)1 \uf8ff\n\n\u270f\n2\n\n.\n\nat most T =l1/2 \u00b7 log\u21e3 (12k)k\n\nMoreover, Corollary 4.8 implies\n\n,\n\n(4.7)\n\n(4.8)\n\nwhere the last inequality comes from (4.6). Therefore, we need at most\n\n(1  2k)2\n\nk\n\n8\u21e02T +1(1 + 2k)1\n\nkU (T0.5)  U\u21e4(T )kF \uf8ff\n\nk\n2\u21e0 kV\n\n(T )  V \u21e4(T )kF \uf8ff\n4\u21e0\u270f(1 + 2k)\u25c6 log1 \u21e0\u21e1\n\nk\n\nT =\u21e01/2 \u00b7 log\u2713 (1  2k)2\nkU (T0.5)  U\u21e4kF \uf8ff\n\n(1  2k)2\n\n8\u21e02T +1(1 + 2k)1 \uf8ff\n\nk\n\n\u270f\n21\n\n.\n\niterations such that\n\nThen combining (4.7) and (4.8), we obtain\n\nkM (T )  M\u21e4k = kU (T0.5)V\n\n(T )>  U\u21e4(T )V \u21e4(T )>kF\n\nwhere the last inequality is from kV\n1 (since U\u21e4(T )V \u21e4(T )> = M\u21e4 and V \u21e4(T ) is orthonormal). Thus we complete the proof.\n\n(T )k2 = 1 (since V\n\n\uf8ff kV\n\n(T )k2kU (T0.5)  U\u21e4(T )kF + kU\u21e4(T )k2kV\n\n(T )  V \u21e4(T )kF \uf8ff \u270f,\n\n(4.9)\n(T ) is orthonormal) and kU\u21e4k2 = kM\u21e4k2 =\n\n5 Extension to Matrix Completion\nUnder the same setting as matrix sensing, we observe a subset of the entries of M\u21e4, namely, W\u2713\n{1, . . . , m}\u21e5{ 1, . . . , n}. We assume that W is drawn uniformly at random, i.e., M\u21e4i,j is observed\nindependently with probability \u00af\u21e2 2 (0, 1]. To exactly recover M\u21e4, a common assumption is the\nincoherence of M\u21e4, which will be speci\ufb01ed later. A popular approach for recovering M\u21e4 is to solve\nthe following convex optimization problem\n\n(5.1)\nwhere PW (M ) : Rm\u21e5n ! Rm\u21e5n is an operator de\ufb01ned as [PW (M )]ij = Mij if (i, j) 2W , and\n0 otherwise. Similar to matrix sensing, existing algorithms for solving (5.1) are computationally\n\nsubject to PW (M\u21e4) = PW (M ),\n\nM2Rm\u21e5n kMk\u21e4\n\nmin\n\n7\n\n\finef\ufb01cient. Hence, in practice we usually consider the following nonconvex optimization problem\n\nmin\n\nU2Rm\u21e5k,V 2Rn\u21e5k FW (U, V ), where FW (U, V ) = 1/2 \u00b7 kPW (M\u21e4) P W (U V >)k2\nF.\n\nSimilar to matrix sensing, (5.2) can also be ef\ufb01ciently solved by gradient-based algorithms. Due to\nspace limitation, we present these matrix completion algorithms in Algorithm 2 of Appendix D. For\nthe convenience of later convergence analysis, we partition the observation set W into 2T + 1 subsets\nW0,...,W2T using Algorithm 4 in Appendix D. However, in practice we do not need the partition\nscheme, i.e., we simply set W0 = \u00b7\u00b7\u00b7 = W2T = W.\nBefore we present the main results, we introduce an assumption known as the incoherence property.\nAssumption 5.1. The target rank k matrix M\u21e4 is incoherent with parameter \u00b5, i.e., given the rank k\nsingular value decomposition of M\u21e4 = U\u21e4\u2303\u21e4V \u21e4>, we have\n\n(5.2)\n\nmax\n\ni kU\u21e4\n\ni\u21e4k2 \uf8ff \u00b5pk/m and max\nj kV \u21e4\n\nj\u21e4k2 \uf8ff \u00b5pk/n.\n\nThe incoherence assumption guarantees that M\u21e4 is far from a sparse matrix, which makes it feasible\nto complete M\u21e4 when its entries are missing uniformly at random. The following theorem establishes\nthe iteration complexity and the estimation error under the Frobenius norm.\nTheorem 5.2. Suppose that there exists a universal constant C4 such that \u00af\u21e2 satis\ufb01es\n\n\u00af\u21e2  C4\u00b52k3 log n log(1/\u270f)/m,\n\n(5.3)\nwhere \u270f is the pre-speci\ufb01ed precision. Then there exist an \u2318 and universal constants C5 and C6 such\nthat for any T  C5 log(C6/\u270f), we have kM (T )  MkF \uf8ff \u270f with high probability.\nDue to space limit, we defer the proof of Theorem 5.2 to the longer version of this paper. Theorem\n5.2 implies that all three nonconvex optimization algorithms converge to the global optimum at a\ngeometric rate. Furthermore, our results indicate that the completion of the true low rank matrix M\u21e4\nup to \u270f-accuracy requires the entry observation probability \u00af\u21e2 to satisfy\n\n(5.4)\nThis result matches the result established by [8], which is the state-of-the-art result for alternating\nminimization. Moreover, our analysis covers three nonconvex optimization algorithms.\n\n\u00af\u21e2 =\u2326( \u00b52k3 log n log(1/\u270f)/m).\n\n6 Experiments\nWe present numerical experiments for matrix sensing to support our theoretical analysis. We choose\nm = 30, n = 40, and k = 5, and vary d from 300 to 900. Each entry of Ai\u2019s are independent sampled\n\nfrom N (0, 1). We then generate M = U V >, where eU 2 Rm\u21e5k and eV 2 Rn\u21e5k are two matrices\nwith all their entries independently sampled from N (0, 1/k). We then generate d measurements by\nbi = hAi, Mi for i = 1, ..., d. Figure 1 illustrates the empirical performance of the alternating exact\nminimization and alternating gradient descent algorithms for a single realization. The step size for the\nalternating gradient descent algorithm is determined by the backtracking line search procedure. We\nsee that both algorithms attain linear rate of convergence for d = 600 and d = 900. Both algorithms\nfail for d = 300, because d = 300 is below the minimum requirement of sample complexity for the\nexact matrix recovery.\n\nr\no\nr\nr\nE\n \nn\no\ni\nt\na\nm\ni\nt\ns\nE\n\n100\n\n10-5\n\n0\n\nd = 300\nd = 600\nd = 900\n10\n30\nNumber of Iterations\n\n20\n\n40\n\nr\no\nr\nr\nE\n \nn\no\ni\nt\na\nm\ni\nt\ns\nE\n\n100\n\n10-5\n\n0\n\nd = 300\nd = 600\nd = 900\n10\n30\nNumber of Iterations\n\n20\n\n40\n\n(a) Alternating Exact Minimization\n\n(b) Alternating Gradient Descent\n\nFigure 1: Two illustrative examples for matrix sensing. The vertical axis corresponds to estimation\nerror kM (t)  MkF. The horizontal axis corresponds to numbers of iterations. Both the alternating\nexact minimization and alternating gradient descent algorithms attain linear rate of convergence for\nd = 600 and d = 900. But both algorithms fail for d = 300, because d = 300 is below the minimum\nrequirement of sample complexity for the exact matrix recovery.\n\n8\n\n\fReferences\n[1] Sanjeev Arora, Rong Ge, Tengyu Ma, and Ankur Moitra. Simple, ef\ufb01cient, and neural algorithms for sparse\n\ncoding. arXiv preprint arXiv:1503.00778, 2015.\n\n[2] Sivaraman Balakrishnan, Martin J Wainwright, and Bin Yu. Statistical guarantees for the EM algorithm:\n\nFrom population to sample-based analysis. arXiv preprint arXiv:1408.2156, 2014.\n\n[3] Emmanuel J Cand`es, Xiaodong Li, and Mahdi Soltanolkotabi. Phase retrieval via wirtinger \ufb02ow: Theory\n\nand algorithms. IEEE Transactions on Information Theory, 61(4):1985\u20132007, 2015.\n\n[4] Emmanuel J Cand`es and Benjamin Recht. Exact matrix completion via convex optimization. Foundations\n\nof Computational Mathematics, 9(6):717\u2013772, 2009.\n\n[5] Emmanuel J Cand`es and Terence Tao. The power of convex relaxation: Near-optimal matrix completion.\n\nIEEE Transactions on Information Theory, 56(5):2053\u20132080, 2010.\n\n[6] Yudong Chen. Incoherence-optimal matrix completion. arXiv preprint arXiv:1310.0154, 2013.\n[7] David Gross. Recovering low-rank matrices from few coef\ufb01cients in any basis. IEEE Transactions on\n\nInformation Theory, 57(3):1548\u20131566, 2011.\n\n[8] Moritz Hardt. Understanding alternating minimization for matrix completion. In Symposium on Founda-\n\ntions of Computer Science, pages 651\u2013660, 2014.\n\n[9] Moritz Hardt, Raghu Meka, Prasad Raghavendra, and Benjamin Weitz. Computational limits for matrix\n\ncompletion. arXiv preprint arXiv:1402.2331, 2014.\n\n[10] Moritz Hardt and Mary Wootters. Fast matrix completion without the condition number. arXiv preprint\n\narXiv:1407.4070, 2014.\n\n[11] Trevor Hastie, Rahul Mazumder, Jason Lee, and Reza Zadeh. Matrix completion and low-rank SVD via\n\nfast alternating least squares. arXiv preprint arXiv:1410.2596, 2014.\n\n[12] Prateek Jain, Raghu Meka, and Inderjit S Dhillon. Guaranteed rank minimization via singular value\n\nprojection. In Advances in Neural Information Processing Systems, pages 937\u2013945, 2010.\n\n[13] Prateek Jain and Praneeth Netrapalli. Fast exact matrix completion with \ufb01nite samples. arXiv preprint\n\narXiv:1411.1087, 2014.\n\n[14] Prateek Jain, Praneeth Netrapalli, and Sujay Sanghavi. Low-rank matrix completion using alternating\n\nminimization. In Symposium on Theory of Computing, pages 665\u2013674, 2013.\n\n[15] Raghunandan H Keshavan, Andrea Montanari, and Sewoong Oh. Matrix completion from a few entries.\n\nIEEE Transactions on Information Theory, 56(6):2980\u20132998, 2010.\n\n[16] Raghunandan H Keshavan, Andrea Montanari, and Sewoong Oh. Matrix completion from noisy entries.\n\nJournal of Machine Learning Research, 11:2057\u20132078, 2010.\n\n[17] Yehuda Koren. The Bellkor solution to the Net\ufb02ix grand prize. Net\ufb02ix Prize Documentation, 81, 2009.\n[18] Kiryung Lee and Yoram Bresler. Admira: Atomic decomposition for minimum rank approximation. IEEE\n\nTransactions on Information Theory, 56(9):4402\u20134416, 2010.\n\n[19] Sahand Negahban and Martin J Wainwright. Estimation of (near) low-rank matrices with noise and\n\nhigh-dimensional scaling. The Annals of Statistics, 39(2):1069\u20131097, 2011.\n\n[20] Yurii Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer, 2004.\n[21] Arkadiusz Paterek. Improving regularized singular value decomposition for collaborative \ufb01ltering. In\n\nProceedings of KDD Cup and workshop, volume 2007, pages 5\u20138, 2007.\n\n[22] Benjamin Recht. A simpler approach to matrix completion. Journal of Machine Learning Research,\n\n12:3413\u20133430, 2011.\n\n[23] Benjamin Recht, Maryam Fazel, and Pablo A Parrilo. Guaranteed minimum-rank solutions of linear matrix\n\nequations via nuclear norm minimization. SIAM Review, 52(3):471\u2013501, 2010.\n\n[24] Benjamin Recht and Christopher R\u00b4e. Parallel stochastic gradient algorithms for large-scale matrix comple-\n\ntion. Mathematical Programming Computation, 5(2):201\u2013226, 2013.\n\n[25] Angelika Rohde and Alexandre B Tsybakov. Estimation of high-dimensional low-rank matrices. The\n\nAnnals of Statistics, 39(2):887\u2013930, 2011.\n\n[26] Gilbert W Stewart, Ji-guang Sun, and Harcourt B Jovanovich. Matrix perturbation theory, volume 175.\n\nAcademic press New York, 1990.\n\n[27] Ruoyu Sun and Zhi-Quan Luo. Guaranteed matrix completion via non-convex factorization. arXiv preprint\n\narXiv:1411.8003, 2014.\n\n[28] G\u00b4abor Tak\u00b4acs, Istv\u00b4an Pil\u00b4aszy, Botty\u00b4an N\u00b4emeth, and Domonkos Tikk. Major components of the gravity\n\nrecommendation system. ACM SIGKDD Explorations Newsletter, 9(2):80\u201383, 2007.\n\n9\n\n\f", "award": [], "sourceid": 386, "authors": [{"given_name": "Tuo", "family_name": "Zhao", "institution": null}, {"given_name": "Zhaoran", "family_name": "Wang", "institution": "Princeton University"}, {"given_name": "Han", "family_name": "Liu", "institution": "Princeton University"}]}