{"title": "A Non-convex One-Pass Framework for Generalized Factorization Machine and Rank-One Matrix Sensing", "book": "Advances in Neural Information Processing Systems", "page_first": 1633, "page_last": 1641, "abstract": "We develop an efficient alternating framework for learning a generalized version of Factorization Machine (gFM) on steaming data with provable guarantees. When the instances are sampled from $d$ dimensional random Gaussian vectors and the target second order coefficient matrix in gFM is of rank $k$, our algorithm converges linearly, achieves $O(\\epsilon)$ recovery error after retrieving $O(k^{3}d\\log(1/\\epsilon))$ training instances, consumes $O(kd)$ memory in one-pass of dataset and only requires matrix-vector product operations in each iteration. The key ingredient of our framework is a construction of an estimation sequence endowed with a so-called Conditionally Independent RIP condition (CI-RIP). As special cases of gFM, our framework can be applied to symmetric or asymmetric rank-one matrix sensing problems, such as inductive matrix completion and phase retrieval.", "full_text": "A Non-convex One-Pass Framework for Generalized\nFactorization Machine and Rank-One Matrix Sensing\n\nMing Lin\n\nUniversity of Michigan\nlinmin@umich.edu\n\nAbstract\n\nJieping Ye\n\nUniversity of Michigan\n\njpye@umich.edu\n\nWe develop an ef\ufb01cient alternating framework for learning a generalized version of\nFactorization Machine (gFM) on steaming data with provable guarantees. When\nthe instances are sampled from d dimensional random Gaussian vectors and the\ntarget second order coef\ufb01cient matrix in gFM is of rank k, our algorithm converges\nlinearly, achieves O(\u0001) recovery error after retrieving O(k3d log(1/\u0001)) training\ninstances, consumes O(kd) memory in one-pass of dataset and only requires matrix-\nvector product operations in each iteration. The key ingredient of our framework is\na construction of an estimation sequence endowed with a so-called Conditionally\nIndependent RIP condition (CI-RIP). As special cases of gFM, our framework can\nbe applied to symmetric or asymmetric rank-one matrix sensing problems, such as\ninductive matrix completion and phase retrieval.\n\n1\n\nIntroduction\n\nLinear models are one of the foundations of modern machine learning due to their strong learning\nguarantees and ef\ufb01cient solvers [Koltchinskii, 2011]. Conventionally linear models only consider the\n\ufb01rst order information of the input feature which limits their capacity in non-linear problems. Among\nvarious efforts extending linear models to the non-linear domain, the Factorization Machine [Rendle,\n2010] (FM) captures the second order information by modeling the pairwise feature interaction in\nregression under low-rank constraints. FMs have been found successful in many applications, such as\nrecommendation systems [Rendle et al., 2011] and text retrieval [Hong et al., 2013]. In this paper, we\nconsider a generalized version of FM called gFM which removes several redundant constraints in\nthe original FM such as positive semi-de\ufb01nite and zero-diagonal, leading to a more general model\nwithout sacri\ufb01cing its learning ability. From theoretical side, the gFM includes rank-one matrix\nsensing [Zhong et al., 2015, Chen et al., 2015, Cai and Zhang, 2015, Kueng et al., 2014] as a special\ncase, where the latter one has been studied widely in context such as inductive matrix completion\n[Jain and Dhillon, 2013] and phase retrieval [Candes et al., 2011].\nDespite of the popularity of FMs in industry, there is rare theoretical study of learning guarantees for\nFMs. One of the main challenges in developing a provable FM algorithm is to handle its symmetric\nrank-one matrix sensing operator. For conventional matrix sensing problems where the matrix sensing\noperator is RIP, there are several alternating methods with provable guarantees [Hardt, 2013, Jain\net al., 2013, Hardt and Wootters, 2014, Zhao et al., 2015a,b]. However, for a symmetric rank-one\nmatrix sensing operator, the RIP condition doesn\u2019t hold trivially which turns out to be the main\ndif\ufb01culty in designing ef\ufb01cient provable FM solvers.\nIn rank-one matrix sensing, when the sensing operator is asymmetric, the problem is also known as\ninductive matrix completion which can be solved via alternating minimization with a global linear\nconvergence rate [Jain and Dhillon, 2013, Zhong et al., 2015]. For symmetric rank-one matrix sensing\noperators, we are not aware of any ef\ufb01cient solver by the time of writing this paper. In a special case\nwhen the target matrix is of rank one, the problem is called \u201cphase retrieval\u201d whose convex solver\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fis \ufb01rst proposed by Candes et al. [2011] then alternating methods are provided in [Lee et al., 2013,\nNetrapalli et al., 2013]. While the target matrix is of rank k > 1 , only convex methods minimizing\nthe trace norm have been proposed recently, which are computationally expensive [Kueng et al., 2014,\nCai and Zhang, 2015, Chen et al., 2015, Davenport and Romberg, 2016].\nDespite of the above fundamental challenges, extending rank-one matrix sensing algorithm to gFM\nitself is dif\ufb01cult. Please refer to Section 2.1 for an in-depth discussion. The main dif\ufb01culty is due to\nthe \ufb01rst order term in the gFM formulation, which cannot be trivially converted to a standard matrix\nsensing problem.\nIn this paper, we develop a uni\ufb01ed theoretical framework and an ef\ufb01cient solver for generalized\nFactorization Machine and its special cases such as rank-one matrix sensing, either symmetric or\nasymmetric. The key ingredient is to show that the sensing operator in gFM satis\ufb01es a so-called\nConditionally Independent RIP condition (CI-RIP, see De\ufb01nition 2) . Then we can construct an\nestimation sequence via noisy power iteration [Hardt and Price, 2013]. Unlike previous approaches,\nour method does not require alternating minimization or choosing the step-size as in alternating\ngradient descent. The proposed method works on steaming data, converges linearly and has O(kd)\nspace complexity for a d-dimension rank-k gFM model. The solver achieves O(\u0001) recovery error\nafter retrieving O(k3d log(1/\u0001)) training instances.\nThe remainder of this paper is organized as following. In Section 2, we introduce necessary notation\nand background of gFM. Subsection 2.1 investigates several fundamental challenges in depth. Section\n3 presents our algorithm, called One-Pass gFM, followed by its theoretical guarantees. Our analysis\nframework is presented in Section 4. Section 5 concludes this paper.\n\n2 Generalized Factorization Machine (gFM)\n\nIn this section, we \ufb01rst introduce necessary notation and background of FM and its generalized\nversion gFM. Then in Subsection 2.1, we reveal the connection between gFM and rank-one matrix\nsensing followed by several fundamental challenges encountered when applying frameworks of\nrank-one matrix sensing to gFM.\nThe FM predicts the labels of instances by not only their features but also high order interactions\nbetween features. In the following, we focus on the second order FM due to its popularity. Suppose\nwe are given N training instances xi \u2208 Rd independently and identically (I.I.D.) sampled from\nthe standard Gaussian distribution and so are their associated labels yi \u2208 R. Denote the feature\nmatrix X = [x1, x2,\u00b7\u00b7\u00b7 , xn] \u2208 Rd\u00d7n and the label vector y = [y1, y2,\u00b7\u00b7\u00b7 , yn](cid:62) \u2208 Rn . In second\norder FM, yi is assumed to be generated from a target vector w\u2217 \u2208 Rd and a target rank k matrix\nM\u2217 \u2208 Rd\u00d7d satisfying\n\nyi =xi\n\n(cid:62)w\u2217 + xi\n\n(cid:62)M\u2217xi + \u03bei\n\n(1)\n\nwhere \u03bei is a random subgaussian noise with proxy variance \u03be2 .\nto write Eq.\n[(cid:104)A1, M(cid:105) ,(cid:104)A2, M(cid:105) ,\u00b7\u00b7\u00b7 ,(cid:104)An, M(cid:105)](cid:62) where Ai = xixi\n\nIt is often more convenient\n(1) in matrix form. Denote the linear operator A : Rd\u00d7d \u2192 Rn as A(M ) (cid:44)\n\n(cid:62) . Then Eq. (1) has a compact form:\n\ny = X(cid:62)w\u2217+A(M\u2217) + \u03be .\n\n(cid:62)M xi where the left/right design vectors (xi and xi\n\n(2)\nThe FM model given by Eq. (2) consists of two components: the \ufb01rst order component X(cid:62)w\u2217 and\nthe second order component A(M\u2217). The component A(M\u2217) is a symmetric rank-one Gaussian\nmeasurement since Ai(M ) = xi\n(cid:62)) are identical.\nThe original FM requires that M\u2217 should be positive semi-de\ufb01nite and the diagonal elements of M\u2217\nshould be zero. However our analysis shows that both constraints are redundant for learning Eq. 2.\nTherefore in this paper we consider a generalized version of FM which we call gFM where M\u2217 is\nonly required to be symmetric and low rank. To make the recovery of M\u2217 well de\ufb01ned, it is necessary\nto assume M\u2217 to be symmetric. Indeed for any asymmetric matrix M\u2217, there is always a symmetric\nmatrix M\u2217\nsym) thus the symmetric constraint does\nnot affect the model. Another standard assumption in rank-one matrix sensing is that the rank of M\u2217\nshould be no more than k for k (cid:28) d. When w\u2217 = 0, gFM is equal to the symmetric rank-one matrix\nsensing problem. Recent researches have proposed several convex programming methods based on\nthe trace norm minimization to recover M\u2217 with a sampling complexity on order of O(k3d) [Candes\n\nsym = (M\u2217 + M\u2217(cid:62))/2 such that A(M\u2217) = A(M\u2217\n\n2\n\n\fet al., 2011, Cai and Zhang, 2015, Kueng et al., 2014, Chen et al., 2015, Zhong et al., 2015]. Some\nauthors also call gFM as second order polynomial network [Blondel et al., 2016].\nWhen d is much larger than k, the convex programming on the trace norm or nuclear norm of M\u2217\nbecomes dif\ufb01cult since M\u2217 can be a d \u00d7 d dense matrix. Although modern convex solvers can scale\nto large d with reasonable computational cost, a more popular strategy to ef\ufb01ciently estimate w\u2217\nand M\u2217 is to decompose M\u2217 as U V (cid:62) for some U, V \u2208 Rd\u00d7k, then alternatively update w, U, V to\nminimize the empirical loss function\n\nmin\nw,U,V\n\n1\n2N\n\n(cid:107)y \u2212 X(cid:62)w \u2212 A(U V (cid:62))(cid:107)2\n2 .\n\n(3)\n\nThe loss function in Eq. (3) is non-convex. It is even unclear whether an estimator of the optimal\nsolution {w\u2217, M\u2217} of Eq. (3) with a polynomial time complexity exists or not.\nIn our analysis, we denote M + O(\u0001) as a matrix M plus a perturbation matrix whose spectral\nnorm is bounded by \u0001. We use (cid:107) \u00b7 (cid:107)2 , (cid:107) \u00b7 (cid:107)F , (cid:107) \u00b7 (cid:107)\u2217 to denote the matrix spectral norm, Frobenius\nnorm and nuclear norm respectively. To abbreviate the high probability bound, we denote C =\npolylog(d, n, T, 1/\u03b7) to be a constant polynomial logarithmic in {d, n, T, 1/\u03b7}. The eigenvalue\ndecomposition of M\u2217 is M\u2217 = U\u2217\u039b\u2217U\u2217(cid:62) where U\u2217 \u2208 Rd\u00d7k is the top-k eigenvectors of M\u2217\nk) are the corresponding eigenvalues sorted by |\u03bbi| \u2265 |\u03bbi+1|. Let\nand \u039b\u2217 = diag(\u03bb\u2217\ni | denote the singular value of M\u2217 and \u03c3i{M} be the i-th largest singular value of M. U\u2217\ni = |\u03bb\u2217\n\u03c3\u2217\n\u22a5\ndenotes an matrix whose columns are the orthogonal basis of the complementary subspace of U\u2217.\n\n2,\u00b7\u00b7\u00b7 , \u03bb\u2217\n\n1, \u03bb\u2217\n\ni\n\n(M ) = ui\n\ngFM and Rank-One Matrix Sensing\n\n2.1\nWhen w\u2217 = 0 in Eq. (1), the gFM becomes the symmetric rank-one matrix sensing problem.\nWhile the recovery ability of rank-one matrix sensing is somehow provable recently despite of the\ncomputational issue, it is not the case for gFM. It is therefore important to discuss the differences\nbetween gFM and rank-one matrix sensing to give us a better understanding of the fundamental\nbarriers in developing provable gFM algorithm.\nIn the rank-one matrix sensing problem, a relaxed setting is to assume that the sensing operator is\nasymmetric, which is de\ufb01ned by Aasy\n(cid:62)M vi where ui and vi are independent random\nvectors. Under this setting, the recovery ability of alternating methods is provable [Jain and Dhillon,\n2013]. However, existing analyses cannot be generalized to their symmetric counterpart, since ui\nand vi are not allowed to be dependent in these frameworks. For example, the sensing operator\nAasy(\u00b7) is unbiased ( EAasy(\u00b7) = 0) but the symmetric sensing operator is clearly not [Cai and\nZhang, 2015]. Therefore, the asymmetric setting oversimpli\ufb01es the problem and loses important\nstructure information which is critical to gFM.\nAs for the symmetric rank-one matrix sensing operator, the state-of-the-art estimator is based on the\ntrace norm convex optimization [Tropp, 2014, Chen et al., 2015, Cai and Zhang, 2015], which is\ncomputationally expensive. When w\u2217 (cid:54)= 0, the gFM has an extra perturbation term X(cid:62)w\u2217 . This\n\ufb01rst order perturbation term turns out to be a fundamental challenge in theoretical analysis. One might\nattempt to merge w\u2217 into M\u2217 in order to convert gFM as a rank (k + 1) matrix sensing problem. For\nexample, one may extend the feature \u02c6xi (cid:44) [xi, 1](cid:62) and the matrix \u02c6M\u2217 = [M\u2217; w\u2217(cid:62)] \u2208 R(d+1)\u00d7d.\nHowever, after this simple extension, the sensing operator becomes \u02c6A(M\u2217) = \u02c6xi\n(cid:62) \u02c6M\u2217xi. It is no\nlonger symmetric. The left/right design vector is neither independent nor identical. Especially, not\nall dimensions of \u02c6xi are random variables. According to the above discussion, the conditions to\nguarantee the success of rank-one matrix sensing do not hold after feature extension and all the\nmentioned analyses cannot be directly applied.\n\n3 One-Pass gFM\n\nIn this section, we present the proposed algorithm, called One-Pass gFM followed by its theoretical\nguarantees. We will focus on the intuition of our algorithm. A rigorous theoretical analysis is\npresented in the next section.\nThe One-Pass gFM is a mini-batch algorithm. In each mini-batch, it processes n training instances\nand then alternatively updates parameters. The iteration will continue until T mini-batch updates.\n\n3\n\n\fAlgorithm 1 One-Pass gFM\nRequire: The mini-batch size n, number of total mini-batch update T , training instances X =\n\n[x1, x2,\u00b7\u00b7\u00b7 xnT}, y = [y1, y2,\u00b7\u00b7\u00b7 , ynT ](cid:62), desired rank k \u2265 1.\n\nh(t)\n2\n\n(cid:44) 1\n\nn 1(cid:62)(y \u2212 A(M (t)) \u2212 X (t)(cid:62)w(t)) , h(t)\n\nEnsure: w(T ), U (T ), V (T ).\n1: De\ufb01ne M (t) (cid:44) (U (t)V (t)(cid:62) + V (t)U (t)(cid:62))/2 , H (t)\n(cid:44) 1\n1 \u2212 1\n2: Initialize: w(0) = 0, V (0) = 0. U (0) = SVD(H (0)\n3: for t = 1, 2,\u00b7\u00b7\u00b7 , T do\n4:\n\nvectors.\n\n3\n\n1\n\n(cid:44) 1\n\n2nA(cid:48)(y \u2212 A(M (t)) \u2212 X (t)(cid:62)w(t)) ,\n\nn X (t)(y \u2212 A(M (t)) \u2212 X (t)(cid:62)w(t)) .\n\n2 h(0)\n\n2 I, k), that is, the top-k left singular\n\nRetrieve n training instances X (t) = [x(t\u22121)n+1,\u00b7\u00b7\u00b7 , x(t\u22121)n+n] . De\ufb01ne A(M ) (cid:44)\n(cid:62)M X (t)\n[X (t)\ni=1.\n]n\n\u02c6U (t) = (H (t\u22121)\n\u2212 1\nOrthogonalize \u02c6U (t) via QR decomposition: U (t) = QR\n\nI + M (t\u22121)(cid:62))U (t\u22121) .\n\n(cid:16) \u02c6U (t)(cid:17)\n\n2 h(t\u22121)\n\n.\n\n1\n\ni\n\ni\n\n2\n\n5:\n6:\n7: w(t) = h(t\u22121)\n+ w(t\u22121) .\nV (t) = (H (t\u22121)\n2 h(t\u22121)\n\u2212 1\n8:\n9: end for\n10: Output: w(T ), U (T ), V (T ) .\n\n2\n\n1\n\n3\n\nI + M (t\u22121))U (t)\n\n2nA(cid:48)(y) \u2212 1\n\nSince gFM deals with a non-convex learning problem, the conventional gradient descent framework\nhardly works to show the global convergence. Instead, our method is based on a construction\nIntuitively, when w\u2217 = 0, we will show in the next section that\nof an estimation sequence.\nnA(cid:48)A(M ) \u2248 2M + tr(M )I and tr(M ) \u2248 1\nn 1(cid:62)A(M ). Since y \u2248 A(M\u2217), we can estimate\n1\nn 1(cid:62)yI. But this simple construction cannot generate a convergent estimation\nM\u2217 via 1\nsequence since the perturbation terms in the above approximate equalities cannot be reduced along\niterations. To overcome this problem, we replace A(M\u2217) with A(M\u2217 \u2212 M (t)) in our construction.\nThen the perturbation terms will be on order of O((cid:107)M\u2217 \u2212 M (t)(cid:107)2). When w\u2217 (cid:54)= 0, we can apply a\nsimilar trick to construct its estimation sequence via the second and the third order moments of X.\nAlgorithm 1 gives a step-by-step description of our algorithm1.\nIn Algorithm 1, we only need to store w(t) \u2208 Rd, U (t), V (t) \u2208 Rd\u00d7k. Therefore the space com-\nplexity is O(d + kd). The auxiliary variables M (t), H (t)\ncan be implicitly presented\nby w(t), U (t), V (t). In each mini-batch updating, we only need matrix-vector product operations\nwhich can be ef\ufb01ciently implemented on many computation architectures. We use truncated SVD\nto initialize gFM, a standard initialization step in matrix sensing. We do not require this step to\nbe computed exactly but up to an accuracy of O(\u03b4) where \u03b4 is the RIP constant. The QR step on\nline 6 requires O(k2d) operations. Compared with SVD which requires O(kd2) operations, the QR\nstep is much more ef\ufb01cient when d (cid:29) k. Algorithm 1 retrieves instances streamingly, a favorable\nbehavior on systems with high speed cache. Finally, we export w(T ), U (T ), V (T ) as our estimation\nof w\u2217 \u2248 w(T ) and M\u2217 \u2248 U (T )V (T )(cid:62).\nOur main theoretical result is presented in the following theorem, which gives the convergence rate\nof recovery and sampling complexity of gFM when M\u2217 is low rank and the noise \u03be = 0.\nTheorem 1. Suppose xi\u2019s are independently sampled from the standard Gaussian distribution. M\u2217\nis a rank k matrix. The noise \u03be = 0. Then with a probability at least 1 \u2212 \u03b7, there exists a constant C\nand a constant \u03b4 < 1 such that\n\n2 , h(t)\n\n1 , h(t)\n\n3\n\nprovided n \u2265 C(4\n\n\u221a\n\n(cid:107)w\u2217 \u2212 w(t)(cid:107)2 + (cid:107)M\u2217 \u2212 M (t)(cid:107)2 \u2264\u03b4t((cid:107)w\u2217(cid:107)2 + (cid:107)M\u2217(cid:107)2)\n1/\u03c3\u2217\n5\u03c3\u2217\n\nk + 3)2k3d/\u03b42, \u03b4 \u2264\n\nk+3)\u03c3\u2217\n\u221a\n\n\u221a\n5\u03c3\u2217\n(4\n1 +3\u03c3\u2217\n5\u03c3\u2217\n\n1 /\u03c3\u2217\nk+4\n\n5(cid:107)w\u2217(cid:107)2\n\n\u221a\n4\n\n.\n\nk\n\n2\n\nTheorem 1 shows that {w(t), M (t)} will converge to {w\u2217, M\u2217} linearly. The convergence rate is\nn). A small \u03b4 will result in a fast convergence rate\ncontrolled by \u03b4, whose value is on order of O(1/\n\n\u221a\n\n1Implementation is available from https://minglin-home.github.io/\n\n4\n\n\fbut a large sampling complexity. To reduce the sampling complexity, a large \u03b4 is preferred. The largest\nallowed \u03b4 is bounded by O(1/((cid:107)M\u2217(cid:107)2 + (cid:107)w\u2217(cid:107)2)). The sampling complexity is O((\u03c3\u2217\nk)2k3d).\nIf M\u2217 is not well conditioned, it is possible to remove (\u03c3\u2217\nk)2 in the sampling complexity by a\nprocedure called \u201csoft-de\ufb02ation\u201d [Jain et al., 2013, Hardt and Wootters, 2014]. By theorem 1, gFM\nachieves \u0001 recovery error after retrieving nT = O(k3d log (((cid:107)w\u2217(cid:107)2 + (cid:107)M\u2217(cid:107)2)/\u0001)) instances.\nThe noisy case where M\u2217 is not exactly low rank and \u03be > 0 is more intricate therefore we postpone\nit to Subsection 4.1. The main conclusion is similar to the noise-free case Theorem 1 under a small\nnoise assumption.\n\n1/\u03c3\u2217\n\n1/\u03c3\u2217\n\n4 Theoretical Analysis\n\nIn this section, we give the sketch of our proof of Theorem 1. Omitted details are postponed to\nappendix.\n\nFrom high level, our proof constructs an estimation sequence {(cid:101)w(t),(cid:102)M (t), \u0001t} such that \u0001t \u2192 0 and\n(cid:107)w\u2217 \u2212 (cid:101)w(t)(cid:107)2 + (cid:107)M\u2217 \u2212(cid:102)M (t)(cid:107)2 \u2264 \u0001t . In conventional matrix sensing, this construction is possible\n\nwhen the sensing matrix satis\ufb01es the Restricted Isometric Property (RIP) [Cand\u00e8s and Recht, 2009]:\nDe\ufb01nition 2 ((cid:96)2-norm RIP). A sensing operator A is (cid:96)2-norm \u03b4k-RIP if for any rank k matrix M,\n\n(1 \u2212 \u03b4k)(cid:107)M(cid:107)2\n\nF \u2264 1\nn\n\n(cid:107)A(M )(cid:107)2\n\n2 \u2264 (1 + \u03b4k)(cid:107)M(cid:107)2\nF .\n\nWhen A is (cid:96)2-norm \u03b4k-RIP for any rank k matrix M, A(cid:48)A is nearly isometric [Jain et al., 2012], which\nimplies (cid:107)M \u2212 A(cid:48)A(M )/n(cid:107)2 \u2264 \u03b4. Then we can construct our estimation sequence as following:\n\n(cid:102)M (t) =\n\n1\nn\n\nA(cid:48)A(M\u2217 \u2212(cid:102)M (t\u22121)) +(cid:102)M (t\u22121) , (cid:101)w(t) = (I \u2212 1\n\nXX(cid:62))(w\u2217 \u2212 (cid:101)w(t\u22121)) + (cid:101)w(t\u22121) .\n\nn\n\nHowever, in gFM and symmetric rank-one matrix sensing, the (cid:96)2-norm RIP condition cannot be\nsatis\ufb01ed with high probability [Cai and Zhang, 2015]. To establish an RIP-like condition for rank-one\nmatrix sensing, several variants have been proposed, such as the (cid:96)2/(cid:96)1-RIP condition [Cai and Zhang,\n2015, Chen et al., 2015]. The essential idea of these variants is to replace the (cid:96)2-norm (cid:107)A(M )(cid:107)2 with\n(cid:96)1-norm (cid:107)A(M )(cid:107)1 then a similar norm inequality can be established for all low rank matrix again.\nHowever, even using these (cid:96)1-norm RIP variants, we are still unable to design an ef\ufb01cient alternating\nalgorithm. All these (cid:96)1-norm RIP variants have to deal with trace norm programming problems. In\nfact, it is impossible to construct an estimation sequence based on (cid:96)1-norm RIP because we require\n(cid:96)2-norm bound on A(cid:48)A during the construction.\nA key ingredient of our framework is to propose a novel (cid:96)2-norm RIP condition to overcome the\nabove dif\ufb01culty. The main technique reason for the failure of conventional (cid:96)2-norm RIP is that it\ntries to bound A(cid:48)A(M ) over all rank k matrices. This is too aggressive to be successful in rank-one\nmatrix sensing. Regarding to our estimation sequence, what we really need is to make the RIP hold\nfor current low rank matrix M (t). Once we update our estimation M (t+1), we can regenerate a new\nsensing operator independent of M (t) to avoid bounding A(cid:48)A over all rank k matrices. To this end,\nwe propose the Conditionally Independent RIP (CI-RIP) condition.\nDe\ufb01nition 3 (CI-RIP). A matrix sensing operator A is Conditionally Independent RIP with constant\n\u03b4k, if for a \ufb01xed rank k matrix M, A is sampled independently regarding to M and satis\ufb01es\n\n(cid:107)(I \u2212 1\nn\n\nA(cid:48)A)M(cid:107)2\n\n2 \u2264 \u03b4k .\n\n(4)\n\nAn (cid:96)2-norm or (cid:96)1-norm RIP sensing operator is naturally CI-RIP but the reverse is not true. In CI-RIP,\nA is no longer a \ufb01xed but random sensing operator independent of M. In one-pass algorithm, this is\nachievable if we always retrieve new instances to construct A in one mini-batch updating. Usually\nEq. (4) doesn\u2019t hold in a batch method since M (t+1) depends on A(M (t)).\nAn asymmetric rank-one matrix sensing operator is clearly CI-RIP due to the independency between\nleft/right design vectors. But a symmetric rank-one matrix sensing operator is not CI-RIP. In fact it is\na biased estimator since E(x(cid:62)M x) = tr(M ) . To this end, we propose a shifted version of CI-RIP\nfor symmetric rank-one matrix sensing operator in the following theorem. This theorem is the key\ntool in our analysis.\n\n5\n\n\fTheorem 4 (Shifted CI-RIP). Suppose xi are independent standard random Gaussian vectors, M is\na \ufb01xed symmetric rank k matrix independent of xi and w is a \ufb01xed vector. Then with a probability at\nleast 1 \u2212 \u03b7, provided n \u2265 Ck3d/\u03b42 ,\n\n(cid:107) 1\n2n\n\nA(cid:48)A(M ) \u2212 1\n2\n\ntr(M )I \u2212 M(cid:107)2 \u2264 \u03b4(cid:107)M(cid:107)2 .\n\nRIP constant \u03b4 = O((cid:112)k3d/n) . In gFM, we choose M = M\u2217 \u2212 M (t) therefore M is of rank 3k .\n\n2nA(cid:48)A(M ) is nearly isometric after shifting by its expectation 1\n\nTheorem 4 shows that 1\n\n2 tr(M )I. The\n\nn 1(cid:62)A(M )) \u2212 tr(M )| \u2264 \u03b4(cid:107)M(cid:107)2 provided n \u2265 Ck/\u03b42 .\nn 1(cid:62)X(cid:62)w| \u2264 (cid:107)w(cid:107)2\u03b4 provided n \u2265 C/\u03b42 .\nnA(cid:48)(X(cid:62)w)(cid:107)2 \u2264 (cid:107)w(cid:107)2\u03b4 provided n \u2265 Cd/\u03b42 .\nn X(cid:62)A(M )(cid:107)2 \u2264 (cid:107)M(cid:107)2\u03b4 provided n \u2265 Ck2d/\u03b42 .\n\nUnder the same settings of Theorem 4, suppose that d \u2265 C then the following lemmas hold true with\na probability at least 1 \u2212 \u03b7 for \ufb01xed w and M .\nLemma 5. | 1\nLemma 6. | 1\nLemma 7. (cid:107) 1\nLemma 8. (cid:107) 1\nLemma 9. (cid:107)I \u2212 1\nEquipping with the above lemmas, we construct our estimation sequence as following.\nLemma 10. Let M (t), H (t)\n(cid:107)M\u2217 \u2212 M (t)(cid:107)2 . Then with a probability at least 1 \u2212 \u03b7, provided n \u2265 Ck3d/\u03b42 ,\n\n3 be de\ufb01ned as in Algorithm 1. De\ufb01ne \u0001t = (cid:107)w\u2217 \u2212 w(t)(cid:107)2 +\n\nn XX(cid:62)(cid:107)2 \u2264 \u03b4 provided n \u2265 Cd/\u03b42 .\n\n2 , h(t)\n\n1 , h(t)\n\n1 =M\u2217 \u2212 M (t) + tr(M\u2217 \u2212 M (t))I + O(\u03b4\u0001t) , h(t)\nH (t)\n3 =w\u2217 \u2212 w(t) + O(\u03b4\u0001t) .\nh(t)\n\n2 = tr(M\u2217 \u2212 M (t)) + O(\u03b4\u0001t)\n\n1 \u2212h(t)\n\n2 I +M (t) \u2192 M\u2217 and h(t)\n\nSuppose by construction, \u0001t \u2192 0 when t \u2192 \u221e. Then H (t)\n3 +w(t) \u2192\nw\u2217 and then the proof of Theorem 1 is completed. In the following we only need to show that Lemma\n10 constructs an estimation sequence with \u0001t = O(\u03b4t) \u2192 0. To this end, we need a few things from\nmatrix perturbation theory.\nBy Theorem 1, U (t) will converge to U\u2217 up to column order perturbation. We use the largest\ncanonical angle to measure the subspace distance spanned by U (t) and U\u2217, which is denoted as\n\u03b8t = \u03b8(U (t), U\u2217). For any matrix U, it is well known [Zhu and Knyazev, 2013] that\nsin \u03b8(U, U\u2217) = (cid:107)U\u2217\n(cid:62)U (U\u2217(cid:62)U )\u22121(cid:107)2 .\n\u22a5\nThe last tangent equality allows us to bound the canonical angle after QR decomposition. Suppose\nU (t)R = \u02c6U (t) in the QR step of Algorithm 1, we have\n\n(cid:62)U(cid:107)2, cos \u03b8(U, U\u2217) = \u03c3k{U\u2217(cid:62)U}, tan \u03b8(U, U\u2217) = (cid:107)U\u2217\n\u22a5\n\ntan \u03b8( \u02c6U (t), U\u2217) = (cid:107)U\u2217\n\u22a5\n= (cid:107)U\u2217\n\u22a5\n\n(cid:62) \u02c6U (t)(U\u2217(cid:62) \u02c6U (t))\u22121(cid:107)2 = (cid:107)U\u2217\n\u22a5\n(cid:62)U (t)(U\u2217(cid:62)U (t))\u22121(cid:107)2 = tan \u03b8(U (t), U\u2217) .\n\n(cid:62)U (t)R(U\u2217(cid:62)U (t)R)\u22121(cid:107)2\n\nTherefore, it is more convenient to measure the subspace distance by tangent function.\nTo show \u0001t \u2192 0, we recursively de\ufb01ne the following variables:\n\n\u03b1t (cid:44) tan \u03b8t, \u03b2t (cid:44) (cid:107)w\u2217 \u2212 w(t)(cid:107)2, \u03b3t (cid:44) (cid:107)M\u2217 \u2212 M (t)(cid:107)2, \u0001t (cid:44) \u03b2t + \u03b3t .\n\nThe following lemma derives the recursive inequalities regarding to {\u03b1t, \u03b2t, \u03b3t} .\n\u221a\nLemma 11. Under the same settings of Theorem 1, suppose \u03b1t \u2264 2, \u03b4\u0001t \u2264 4\n5\u03c3\u2217\n\nk, then\n\n\u221a\n\n6\n\n\u221a\n\n\u03b1t+1 \u2264 4\n\n5\u03b4\u03c3\u2217\u22121\n\nk\n\n(\u03b2t + \u03b3t), \u03b2t+1 \u2264 \u03b4(\u03b2t + \u03b3t), \u03b3t+1 \u2264 \u03b1t+1(cid:107)M\u2217(cid:107)2 + 2\u03b4(\u03b2t + \u03b3t) .\n\nn) is small enough, {\u03b1t, \u03b2t, \u03b3t} will converge\nIn Lemma 11, when we choose n such that \u03b4 = O(1/\nto zero. The only question is the initial value {\u03b10, \u03b20, \u03b30}. According to the initialization step of\ngFM, \u03b20 \u2264 (cid:107)w\u2217(cid:107)2 and \u03b30 \u2264 (cid:107)M\u2217(cid:107)2 . To bound \u03b10 , we need the following lemma which directly\nfollows Wely\u2019s and Wedin\u2019s theorems [Stewart and Sun, 1990].\n\n\f\u221a\n4\n\n5(\u03b20 + \u03b30)\u03b4/\u03c3\u2217\n\nk \u2264 2 \u21d4 \u03b4 \u2264\n\n\u221a\n2\n\n\u03c3\u2217\nk\n5(\u03c3\u2217\n1 + \u03b20)\n\n.\n\n\u221a\nk/\u00010 = 4\n\n5\u03c3\u2217\n\nk/(\u03c3\u2217\n\n\u221a\n1 + \u03b20) \u21d2 \u03b4 \u2264 4\n\n5\u03c3\u2217\n\nk/\u0001t .\n\n\u221a\n\n,\n\n4\n\n\u03c3\u2217\n5\u03c3\u2217\n1 + 3\u03c3\u2217\n\nk\n\nk\n\n\u221a\n\n\u03c3\u2217\nk\n5(\u03c3\u2217\n1 + \u03b20)\n\n,\n\n,\n\n2\n\n\u03c3\u2217\n1 + \u03b20)\n\nk\n\n8(\u03c3\u2217\n\n(cid:41)\n\n\u221a\nTo ensure the condition \u03b4\u0001t \u2264 4\n5\u03c3\u2217\n\n\u03b4 \u2264 4\n\n\u221a\n\n5\u03c3\u2217\nk,\n\nIn summary, when\n\n\u03b4 \u2264 min\n\n(cid:40)\n\n\u21d0\u03b4 \u2264\n\n\u221a\n4\n\n5\u03c3\u2217\n\n4\n\n\u221a\n\n\u03c3\u2217\nk\n5(\u03c3\u2217\n1 + \u03b20)\n\u03c3\u2217\n\u221a\n1 + 3\u03c3\u2217\n\nk\nk + 4\n\n.\n\n5\u03b20\n\u221a\n\u0001t = [(4\n\nThe i-th singular value of M is \u03c3i. Suppose that \u0001 \u2264 \u03c3k\u2212\u03c3k+1\n\nLemma 12. Denote U and(cid:101)U as the top-k left singular vectors of M and(cid:102)M = M +O(\u0001) respectively.\nbetween U and (cid:101)U, denoted as \u03b8(U,(cid:101)U ), is bounded by sin \u03b8(U,(cid:101)U ) \u2264 2\u0001/(\u03c3k \u2212 \u03c3k+1) .\nAccording to Lemma 12, when 2\u03b4((cid:107)w\u2217(cid:107)2 + (cid:107)M\u2217(cid:107)2) \u2264 \u03c3\u2217\nk/4, we have sin \u03b80 \u2264 4\u03b4((cid:107)w\u2217(cid:107)2 +\n(cid:107)M\u2217(cid:107)2)/\u03c3\u2217\n\u221a\nProof of Theorem 1. Suppose that at step t, \u03b1t \u2264 2, \u03b4\u0001t \u2264 4\n\nk. Therefore, \u03b10 \u2264 2 provided \u03b4 \u2264 \u03c3\u2217\n\nk/[8((cid:107)w\u2217(cid:107)2 + (cid:107)M\u2217(cid:107)2 )] .\n\n. Then the largest canonical angle\n\n5\u03c3\u2217\n\u03b2t+1 + \u03b3t+1 \u2264\u03b2t+1 + \u03b1t+1(cid:107)M\u2217(cid:107)2 + 2\u03b4(\u03b2t + \u03b3t) \u2264 \u03b4\u0001t + 4\n\n\u0001t(cid:107)M\u2217(cid:107)2 + 2\u03b4\u0001t\n\nk, from Lemma 11,\n\n\u221a\n\n4\n\n5\u03b4\u03c3\u2217\u22121\n\nk\n\n\u221a\n=(4\n\n5\u03c3\u2217\n\n1/\u03c3\u2217\n\nk + 3)\u03b4\u0001t .\n\nTherefore,\n\n\u221a\n\u03b1t+1 \u2264 4\n\u221a\nClearly we need (4\n\u221a\n4\nguaranteed by\n\n\u03c3\u2217\n1 +3\u03c3\u2217\n5\u03c3\u2217\n\nk\n\nk\n\n\u221a\n\u0001t = \u03b2t + \u03b3t \u2264 [(4\n\n5\u03c3\u2217\n\n1/\u03c3\u2217\n(\u03b2t + \u03b3t) \u2264 4\nk + 3)\u03b4 < 1 to ensure convergence, which is guaranteed by \u03b4 <\n. To ensure the recursive inequality holds for any t, we require \u03b1t+1 \u2264 2, which is\n\nk + 3)\u03b4]t(\u03b20 + \u03b30)\n\u221a\n5\u03c3\u2217\n[(4\n\n5\u03b4\u03c3\u2217\u22121\n5\u03c3\u2217\n1/\u03c3\u2217\n\nk + 3)\u03b4]t(\u03b20 + \u03b30) .\n\n5\u03b4\u03c3\u2217\u22121\n\n1/\u03c3\u2217\n\n\u221a\n\nk\n\nk\n\nwe have\n\n5\u03c3\u2217\n1/\u03c3\u2217\nk + 3)\u03b4]t(\u03c3\u2217\n\u221a\n1/\u03c3\u2217\n5\u03c3\u2217\nTo simplify the result, replace \u03b4 with \u03b41 = (4\n\n1 + \u03b30) .\n\nk + 3)\u03b4. The proof is completed.\n\n\u22a5\u039b\u2217\n\n\u22a5U\u2217\n\u22a5\n\n(cid:62) where \u039b\u2217\n\nk = U\u2217\u039b\u2217U\u2217(cid:62) to be the best rank k approximation of M\u2217 and M\u2217\n\n4.1 Noisy Case\nIn this subsection, we analyze the performance of gFM under noisy setting. Suppose that M\u2217 is no\n\u22a5 = diag(\u03bbk+1,\u00b7\u00b7\u00b7 , \u03bbd) is the residual\nlonger low rank, M\u2217 = U\u2217\u039b\u2217U\u2217(cid:62) + U\u2217\n\u22a5 = M\u2217\u2212M\u2217\nspectrum. Denote M\u2217\nk .\nThe additive noise \u03bei\u2019s are independently sampled from subgaussian with proxy variance \u03be.\nFirst we generalize the above theorems and lemmas to noisy case.\nLemma 13. Suppose that in Eq. (1) xi\u2019s are independent standard random Gaussian vectors. M is\n\u22a5 (cid:54)= 0 and \u03be > 0. Then provided n \u2265 Ck3d/\u03b42, with a probability at least\na \ufb01xed rank k matrix. M\u2217\n1 \u2212 \u03b7,\n(cid:107) 1\nA(cid:48)A(M\u2217 \u2212 M ) \u2212 1\n2n\n2\n1(cid:62)A(M\u2217 \u2212 M ) \u2212 tr(M\u2217\n| 1\nn\nX(cid:62)A(M\u2217 \u2212 M )(cid:107)2 \u2264 \u03b4(cid:107)M\u2217\n(cid:107) 1\nn\nA(cid:48)(X(cid:62)w)(cid:107)2 \u2264 \u03b4(cid:107)w(cid:107)2, (cid:107) 1\n(cid:107) 1\nn\nn\n\nk \u2212 M )I \u2212 (M\u2217\ntr(M\u2217\nk \u2212 M )| \u2264 \u03b4(cid:107)M\u2217\n\u221a\nk \u2212 M(cid:107)2 + C\u03c3\u2217\nk+1d2/\n1(cid:62)X(cid:62)w(cid:107)2 \u2264 \u03b4(cid:107)w(cid:107)2 .\n\nk \u2212 M )(cid:107)2 \u2264 \u03b4(cid:107)M\u2217\nk \u2212 M(cid:107)2 + C\u03c3\u2217\n\nk \u2212 M(cid:107)2 + C\u03c3\u2217\n\n\u221a\nk+1d2/\n\n\u221a\nk+1d2/\n\nn (5)\n\n(6)\n\n(8)\n\n(7)\n\nn\n\nn\n\n7\n\n\fDe\ufb01ne \u03b3t = (cid:107)M\u2217\nn \u2265 Ck3d/\u03b42,\n\nk \u2212 M (t)(cid:107)2 similar to the noise-free case. According to Lemma 13, when \u03be = 0, for\n\n\u221a\nk+1d2/\n\nn)\n\nn)\n\n1\n2\n\ntr(M\u2217\n\nk \u2212 M (t) +\n\n\u221a\nk+1d2/\nn) .\n\nk \u2212 M (t))I + O(\u03b4\u0001t + C\u03c3\u2217\n\n1 =M\u2217\nH (t)\n2 =tr(M\u2217 \u2212 M (t)) + O(\u03b4\u0001t + C\u03c3\u2217\nh(t)\n\u221a\n3 =w\u2217 \u2212 w(t) + O(\u03b4\u0001t + C\u03c3\u2217\nh(t)\nk+1d2/\n\u221a\nDe\ufb01ne r = C\u03c3\u2217\n\u221a\nk+1d2/\nr + O(\u03be/\ninequalities regarding to the recovery error is constructed in Lemma 14.\nLemma 14. Under the same settings of Lemma 13, de\ufb01ne \u03c1 (cid:44) 2\u03c3\u2217\nk+1/(\u03c3\u2217\n\u221a\nany step i, 0 \u2264 i \u2264 t , \u03b1i \u2264 2 . When provided 4\n5(\u03b4\u0001t + r) \u2264 \u03c3\u2217\nk \u2212 \u03c3\u2217\n\u03b1t+1 \u2264\u03c1\u03b1t +\n\n\u03b4\u0001t +\n\n\u221a\n4\n5\n\u03c3\u2217\nk + \u03c3\u2217\n\nk+1\n\n\u221a\n4\n5\n\u03c3\u2217\nk + \u03c3\u2217\n\nk+1\n\nk+1). Suppose that at\n\nk + \u03c3\u2217\nk+1,\n\nr , \u03b2t+1 \u2264 \u03b4\u0001t + r , \u03b3t+1 \u2264 \u03b1t+1(cid:107)M\u2217(cid:107)2 + 2\u03b4\u0001t + 2r .\n\nIf \u03be > 0, it is easy to check that the perturbation becomes \u02c6r =\nn) . Therefore we uniformly use r to present the perturbation term. The recursive\n\nn.\n\nThe solution to the recursive inequalities in Lemma 14 is non-trivial. Comparing to the inequalities\nin Lemma 11, \u03b1t+1 is bounded by \u03b1t in noisy case. Therefore, if we simply follow Lemma 11 to\nconstruct recursive inequality about \u0001t , we will quickly be overloaded by recursive expansion terms.\nThe key construction of our solution is to bound the term \u03b1t + 8\nk+1)\u03b4\u0001t . The solution\nis given in the following theorem.\nTheorem 15. De\ufb01ne constants\nk + \u03c3\u2217\n\nk+1) , q = (1 + \u03c1)/2 .\n\nk+1) , b = 3 + 4\n\nk + \u03c3\u2217\n\n5/(\u03c3\u2217\n\n5/(\u03c3\u2217\n\n5\u03c3\u2217\n\n\u221a\n\n\u221a\n\n\u221a\nc =4\nThen for any t \u2265 0,\n\n(cid:18)\n\n1/(\u03c3\u2217\nk + \u03c3\u2217\n(cid:19)\n\n\u03b1t + 2c\u03b4\u0001t \u2264qt\n\n2 \u2212 (1 + \u03c1)cr\n\n1 \u2212 q\n\n+\n\n(1 + \u03c1)cr\n1 \u2212 q\n\n.\n\nprovided\n\n\u221a\n\n4\n\n\u03b4 \u2264 min{ 1 \u2212 \u03c1\n5(cid:0)4 + 2c(\u03c3\u2217\n,\n4\u03c1\u03c3\u2217\n1c\nk \u2212 \u03c3\u2217\n\n} , (2 + c(\u03c3\u2217\n\n\u03c1\n2b\n\nk+1)(cid:1) \u03b4\u00010 + 4\n\nk \u2212 \u03c3\u2217\n\n5(cid:0)4 + (\u03c3\u2217\n\n\u221a\n\nk+1))\u03b4\u00010 + r \u2264 (\u03c3\u2217\n\nk+1)(cid:1) r \u2264 (\u03c3\u2217\n\nk \u2212 \u03c3\u2217\n\nk \u2212 \u03c3\u2217\n\nk+1)\nk \u2212 \u03c3\u2217\n\nk+1)2 .\n\n(9)\n\n(10)\n\nTheorem 15 gives the convergence rate of gFM under noisy settings. We bound \u03b1t + 2c\u03b4\u0001t as the\nindex of recovery error, whose convergence rate is linear. The convergence rate is controlled by q, a\nk . The \ufb01nal recovery error is bounded by O(r/(1 \u2212 q)) .\nconstant depends on the eigen gap \u03c3\u2217\nEq. (10) is the small noise condition to ensure the noisy recovery is possible. Generally speaking,\nlearning a d\u00d7 d matrix with O(d) samples is an ill-conditioned problem when the target matrix is full\nrank. The small noise condition given by Eq. (10) essentially says that M\u2217 can be slightly deviated\nfrom low rank manifold and the noise shouldn\u2019t be too large to blur the spectrum of M\u2217. When the\nnoise is large, Eq. (10) will be satis\ufb01ed with n = O(d2) which is the information-theoretical lower\nbound for recovering a full rank matrix.\n\nk+1/\u03c3\u2217\n\n5 Conclusion\n\nIn this paper, we propose a provable ef\ufb01cient algorithm to solve generalized Factorization Machine\n(gFM) and rank-one matrix sensing. Our method is based on an one-pass alternating updating\nframework. The proposed algorithm is able to learn gFM within O(kd) memory on steaming data,\nhas linear convergence rate and only requires matrix-vector product implementation. The algorithm\ntakes no more than O(k3d log (1/\u0001)) instances to achieve O(\u0001) recovery error.\n\nAcknowledgments\n\nThis work was supported in part by research grants from NIH (RF1AG051710) and NSF (III-1421057\nand III-1421100).\n\n8\n\n\fReferences\nMathieu Blondel, Masakazu Ishihata, Akinori Fujino, and Naonori Ueda. Polynomial Networks and Factorization\n\nMachines: New Insights and Ef\ufb01cient Training Algorithms. pages 850\u2013858, 2016.\n\nT. Tony Cai and Anru Zhang. ROP: Matrix recovery via rank-one projections. The Annals of Statistics, 43(1):\n\n102\u2013138, 2015.\n\nEmmanuel J Cand\u00e8s and Benjamin Recht. Exact matrix completion via convex optimization. Foundations of\n\nComputational mathematics, 9(6):717\u2013772, 2009.\n\nEmmanuel J. Candes, Yonina Eldar, Thomas Strohmer, and Vlad Voroninski. Phase Retrieval via Matrix\n\nCompletion. arXiv:1109.0573, 2011.\n\nYuxin Chen, Yuejie Chi, and Andrea J. Goldsmith. Exact and stable covariance estimation from quadratic\n\nsampling via convex programming. Information Theory, IEEE Transactions on, 61(7):4034\u20134059, 2015.\n\nMark A. Davenport and Justin Romberg. An overview of low-rank matrix recovery from incomplete observations.\n\narXiv:1601.06422, 2016.\n\nMoritz Hardt. Understanding Alternating Minimization for Matrix Completion. arXiv:1312.0925, 2013.\n\nMoritz Hardt and Eric Price. The Noisy Power Method: A Meta Algorithm with Applications. arXiv:1311.2495,\n\n2013.\n\nMoritz Hardt and Mary Wootters. Fast matrix completion without the condition number. arXiv:1407.4070, 2014.\n\nLiangjie Hong, Aziz S. Doumith, and Brian D. Davison. Co-factorization Machines: Modeling User Interests\n\nand Predicting Individual Decisions in Twitter. In WSDM, pages 557\u2013566, 2013.\n\nPrateek Jain and Inderjit S. Dhillon. Provable inductive matrix completion. arXiv:1306.0626, 2013.\n\nPrateek Jain, Praneeth Netrapalli, and Sujay Sanghavi. Low-rank Matrix Completion using Alternating Mini-\n\nmization. arXiv:1212.0467, 2012.\n\nPrateek Jain, Praneeth Netrapalli, and Sujay Sanghavi. Low-rank Matrix Completion Using Alternating\n\nMinimization. In STOC, pages 665\u2013674, 2013.\n\nV. Koltchinskii. Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery Problems, volume\n\n2033. Springer, 2011.\n\nRichard Kueng, Holger Rauhut, and Ulrich Terstiege. Low rank matrix recovery from rank one measurements.\n\narXiv:1410.6913, 2014.\n\nKiryung Lee, Yihong Wu, and Yoram Bresler. Near Optimal Compressed Sensing of Sparse Rank-One Matrices\n\nvia Sparse Power Factorization. arXiv:1312.0525, 2013.\n\nPraneeth Netrapalli, Prateek Jain, and Sujay Sanghavi. Phase Retrieval using Alternating Minimization.\n\narXiv:1306.0160, 2013.\n\nSteffen Rendle. Factorization machines. In ICDM, pages 995\u20131000, 2010.\n\nSteffen Rendle, Zeno Gantner, Christoph Freudenthaler, and Lars Schmidt-Thieme. Fast Context-aware\n\nRecommendations with Factorization Machines. In SIGIR, pages 635\u2013644, 2011.\n\nG. W. Stewart and Ji-guang Sun. Matrix Perturbation Theory. Academic Press, 1990.\n\nJoel A. Tropp. Convex recovery of a structured signal from independent random linear measurements.\n\narXiv:1405.1102, 2014.\n\nJoel A. Tropp. An Introduction to Matrix Concentration Inequalities. arXiv:1501.01571, 2015.\n\nTuo Zhao, Zhaoran Wang, and Han Liu. Nonconvex Low Rank Matrix Factorization via Inexact First Order\n\nOracle. 2015a.\n\nTuo Zhao, Zhaoran Wang, and Han Liu. A Nonconvex Optimization Framework for Low Rank Matrix Estimation.\n\nIn NIPS, pages 559\u2013567, 2015b.\n\nKai Zhong, Prateek Jain, and Inderjit S. Dhillon. Ef\ufb01cient matrix sensing using rank-1 gaussian measurements.\n\nIn Algorithmic Learning Theory, pages 3\u201318, 2015.\n\nPeizhen Zhu and Andrew V. Knyazev. Angles between subspaces and their tangents. Journal of Numerical\n\nMathematics, 21(4), 2013.\n\n9\n\n\f", "award": [], "sourceid": 888, "authors": [{"given_name": "Ming", "family_name": "Lin", "institution": "University of Michigan"}, {"given_name": "Jieping", "family_name": "Ye", "institution": "University of Michigan"}]}