{"title": "Calibrated Elastic Regularization in Matrix Completion", "book": "Advances in Neural Information Processing Systems", "page_first": 863, "page_last": 871, "abstract": "This paper concerns the problem of matrix completion, which is to estimate a matrix from observations in a small subset of indices. We propose a calibrated spectrum elastic net method with a sum of the nuclear and Frobenius penalties and develop an iterative algorithm to solve the convex minimization problem. The iterative algorithm alternates between imputing the missing entries in the incomplete matrix by the current guess and estimating the matrix by a scaled soft-thresholding singular value decomposition of the imputed matrix until the resulting matrix converges. A calibration step follows to correct the bias caused by the Frobenius penalty. Under proper coherence conditions and for suitable penalties levels, we prove that the proposed estimator achieves an error bound of nearly optimal order and in proportion to the noise level. This provides a unified analysis of the noisy and noiseless matrix completion problems. Simulation results are presented to compare our proposal with previous ones.", "full_text": "Calibrated Elastic Regularization in Matrix\n\nCompletion\n\nTingni Sun\n\nCun-Hui Zhang\n\nStatistics Department, The Wharton School\n\nDepartment of Statistics and Biostatistics\n\nUniversity of Pennsylvania\n\nPhiladelphia, Pennsylvania 19104\ntingni@wharton.upenn.edu\n\nRutgers University\n\nPiscataway, New Jersey 08854\n\nczhang@stat.rutgers.edu\n\nAbstract\n\nThis paper concerns the problem of matrix completion, which is to estimate a\nmatrix from observations in a small subset of indices. We propose a calibrated\nspectrum elastic net method with a sum of the nuclear and Frobenius penalties and\ndevelop an iterative algorithm to solve the convex minimization problem. The iter-\native algorithm alternates between imputing the missing entries in the incomplete\nmatrix by the current guess and estimating the matrix by a scaled soft-thresholding\nsingular value decomposition of the imputed matrix until the resulting matrix con-\nverges. A calibration step follows to correct the bias caused by the Frobenius\npenalty. Under proper coherence conditions and for suitable penalties levels, we\nprove that the proposed estimator achieves an error bound of nearly optimal order\nand in proportion to the noise level. This provides a uni\ufb01ed analysis of the noisy\nand noiseless matrix completion problems. Simulation results are presented to\ncompare our proposal with previous ones.\n\nIntroduction\n\n1\nLet \u0398 \u2208 IRd1\u00d7d2 be a matrix of interest and \u2126\u2217 = {1, . . . , d1} \u00d7 {1, . . . , d2}. Suppose we observe\nvectors (\u03c9i, yi),\n\ni = 1, . . . , n,\n\nyi = \u0398\u03c9i + \u03b5i,\n\n(1)\nwhere \u03c9i \u2208 \u2126\u2217 and \u03b5i are random errors. We are interested in estimating \u0398 when n is a small\nfraction of d1d2. A well-known application of matrix completion is the Net\ufb02ix problem where yi\nis the rating of movie bj by user ai for \u03c9 = (ai, bj) \u2208 \u2126\u2217 [1]. In such applications, the proportion\nof the observed entries is typically very small, so that the estimation or recovery of \u0398 is impossible\nwithout a structure assumption on \u0398. In this paper, we assume that \u0398 is of low rank.\nA focus of recent studies of matrix completion has been on a simpler formulation, also known\nas exact recovery, where the observations are assumed to be uncorrupted, i.e. \u03b5i = 0. A direct\napproach is to minimize rank(M ) subject to M\u03c9i = yi. An iterative algorithm was proposed in [5]\nto project a trimmed SVD of the incomplete data matrix to the space of matrices of a \ufb01xed rank\nr. The nuclear norm was proposed as a surrogate for the rank, leading to the following convex\nminimization problem in a linear space [2]:\n\n(cid:98)\u0398(CR) = arg min\n\nM\n\n(cid:110)(cid:107)M(cid:107)(N ) : M\u03c9i = yi \u2200 i \u2264 n\n(cid:111)\n\n.\n\nWe denote the nuclear norm by (cid:107) \u00b7 (cid:107)(N ) here and throughout this paper. This procedure, analyzed\nin [2, 3, 4, 11] among others, is parallel to the replacement of the (cid:96)0 penalty by the (cid:96)1 penalty in\nsolving the sparse recovery problem in a linear space.\n\n1\n\n\fM 2\n\u03c9i\n\nof noise, penalized squared error(cid:80)n\nIn this paper, we focus on the problem of matrix completion with noisy observations (1) and take the\nexact recovery as a special case. Since the exact constraint is no longer appropriate in the presence\ni=1(M\u03c9i \u2212 yi)2 is considered. By reformulating the problem in\n(cid:110) n(cid:88)\n(cid:98)\u0398(MHT) = arg min\n\nLagrange form, [8] proposed the spectrum Lasso\n\n/2 \u2212 n(cid:88)\n\nyiM\u03c9i + \u03bb(cid:107)M(cid:107)(N )\n\n(cid:111)\n\n(2)\n\nM\n\ni=1\n\ni=1\n\nsample fraction \u03c00 = n/(d1d2) is small, due to the ill-posedness of the quadratic term(cid:80)n\n(cid:96)\u221e constraint on M, [7] modi\ufb01ed (2) by replacing the quadratic term(cid:80)n\n\nalong with an iterative convex minimization algorithm. However, (2) is dif\ufb01cult to analyze when the\n\u03c9i.\ni=1 M 2\nThis has led to two alternatives in [7] and [9]. While [9] proposed to minimize (2) under an additional\n\u03c9i with \u03c00(cid:107)M(cid:107)2\n(F ).\nBoth [7, 9] provided nearly optimal error bounds when the noise level is of no smaller order than\nthe (cid:96)\u221e norm of the target matrix \u0398, but not of smaller order, especially not for exact recovery. In\na different approach, [6] proposed a non-convex recursive algorithm and provided error bounds in\nproportion to the noise level. However, the procedure requires the knowledge of the rank r of the\nunknown \u0398 and the error bound is optimal only when d1 and d2 are of the same order.\nOur goal is to develop an algorithm for matrix completion that can be as easily computed as the\nspectrum Lasso (2) and enjoys a nearly optimal error bound proportional to the noise level to con-\ntinuously cover both the noisy and noiseless cases. We propose to use an elastic penalty, a linear\ncombination of the nuclear and Frobenius norms, which leads to the estimator\n\ni=1 M 2\n\n,\n\n(cid:101)\u0398 = arg min\n\nM\n\n(cid:110) n(cid:88)\n\ni=1\n\n/2 \u2212 n(cid:88)\n\ni=1\n\nM 2\n\u03c9i\n\nyiM\u03c9i + \u03bb1(cid:107)M(cid:107)(N ) + (\u03bb2/2)(cid:107)M(cid:107)2\n\n(F )\n\n(cid:111)\n\n,\n\n(3)\n\n(4)\n\nwhere (cid:107) \u00b7 (cid:107)(N ) and (cid:107) \u00b7 (cid:107)(F ) are the nuclear and Frobenius norms, respectively. We call (3) spectrum\nelastic net (E-net) since it is parallel to the E-net in linear regression, the least squares estimator\nwith a sum of the (cid:96)1 and (cid:96)2 penalties, introduced in [15]. Here the nuclear penalty provides the\nsparsity in the spectrum, while the Frobenius penalty regularizes the inversion of the quadratic term.\nMeanwhile, since the Frobenius penalty roughly shrinks the estimator by a factor \u03c00/(\u03c00 + \u03bb2), we\n\ncorrect this bias by a calibration step,(cid:98)\u0398 = (1 + \u03bb2/\u03c00)(cid:101)\u0398.\n\nWe call this estimator calibrated spectrum E-net.\nMotivated by [8], we develop an EM algorithm to solve (3) for matrix completion. The algorithm\niteratively replaces the missing entries with those obtained from a scaled soft-thresholding singular\nvalue decomposition (SVD) until the resulting matrix converges. This EM algorithm is guaranteed\nto converge to the solution of (3).\nUnder proper coherence conditions, we prove that for suitable penalty levels \u03bb1 and \u03bb2, the cali-\nbrated spectrum E-net (4) achieves a desired error bound in the Frobenius norm. Our error bound\nis of nearly optimal order and in proportion to the noise level. This provides a sharper result than\nthose of [7, 9] when the noise level is of smaller order than the (cid:96)\u221e norm of \u0398, and than that of [6]\nwhen d2/d1 is large. Our simulation results support the use of the calibrated spectrum E-net. They\nillustrate that (4) performs comparably to (2) and outperforms the modi\ufb01ed method of [7].\nOur analysis of the calibrated spectrum E-net uses an inequality similar to a duel certi\ufb01cate bound\nin [3]. The bound in [3] requires sample size n (cid:16) min{(r log d)2, r(log d)6}d log d, where d =\nd1 + d2. We use the method of moments to remove a log d factor in the \ufb01rst component of their\nsample size requirement. This leads to a sample size requirement of n (cid:16) r2d log d, with an extra r\nin comparison to the ideal n (cid:16) rd log d. Since the extra r does not appear in our error bound, its\nappearance in the sample size requirement seems to be a technicality.\nThe rest of the paper is organized as follows. In Section 2, we describe an iterative algorithm for the\ncomputation of the spectrum E-net and study its convergence. In Section 3, we derive error bounds\nfor the calibrated spectrum E-net. Some simulation results are presented in Section 4. Section 5\nprovides the proof of our main result.\nWe use the following notation throughout this paper. For matrices M \u2208 Rd1\u00d7d2, (cid:107)M(cid:107)(N ) is the\nnuclear norm (the sum of all singular values of M), (cid:107)M(cid:107)(S) is the spectrum norm (the largest\n\n2\n\n\fsingular value), (cid:107)M(cid:107)(F ) is the Frobenius norm (the (cid:96)2 norm of vectorized M), and (cid:107)M(cid:107)\u221e =\nmaxjk |Mjk|. Linear mappings from Rd1\u00d7d2 to Rd1\u00d7d2 are denoted by the calligraphic letters. For\na linear mapping Q, the operator norm is (cid:107)Q(cid:107)(op) = sup(cid:107)M(cid:107)(F )=1 (cid:107)QM(cid:107)(F ). We equip Rd1\u00d7d2\nwith the inner product (cid:104)M1, M2(cid:105) = trace(M(cid:62)\n(F ). For projections\nP, P\u22a5 = I \u2212 P with I being the identity. We denote by E\u03c9 the unit matrix with 1 at \u03c9 \u2208\n{1, . . . , d1} \u00d7 {1, . . . , d2}, and by P\u03c9 the projection to E\u03c9: M \u2192 M\u03c9E\u03c9 = (cid:104)E\u03c9, M(cid:105)E\u03c9.\n\n1 M2) so that (cid:104)M, M(cid:105) = (cid:107)M(cid:107)2\n\n2 An algorithm for spectrum elastic regularization\n\nWe \ufb01rst present a lemma for the M-step of our iterative algorithm.\n\nLemma 1 Suppose the matrix Z has rank r. The solution to the optimization problem\n\n(cid:110)(cid:107)Z \u2212 W(cid:107)2\n\narg min\n\nZ\n\n(F )/2 + \u03bb1(cid:107)Z(cid:107)(N ) + \u03bb2(cid:107)Z(cid:107)2\n\n(F )/2\n\n(cid:111)\n\nis given by S(W ; \u03bb1, \u03bb2) = U D\u03bb1,\u03bb2V (cid:48) with D\u03bb1,\u03bb2 = diag{(d1\u2212\u03bb1)+, . . . , (dr\u2212\u03bb1)+}/(1+\u03bb2),\nwhere U DV (cid:48) is the SVD of W , D = diag{d1, . . . , dr} and t+ = max(t, 0).\nThe minimization problem in Lemma 1 is solved by a scaled soft-thresholding SVD. This is parallel\nto Lemma 1 in [8] and justi\ufb01ed by Remark 1 there. We use Lemma 1 to solve the M-step of the EM\nalgorithm for the spectrum E-net (3).\nWe still need an E-step to impute a complete matrix given the observed data {yi, \u03c9i : i = 1, . . . , n}.\nSince \u03c9i are allowed to have ties, we need the following notation. Let m\u03c9 = #{i : \u03c9i = \u03c9, i \u2264 n}\nbe the multiplicity of observations at \u03c9 \u2208 \u2126\u2217 and m\u2217 = max\u03c9 m\u03c9 be the maximum multiplicity.\nSuppose that the complete data is composed of m\u2217 observations at each \u03c9 for a certain integer m\u2217.\n(com) be the matrix with components\n(com)\nLet Y\nbe the sample mean of the complete data at \u03c9 and Y\n\u03c9\n(com)\n. If the complete data are available, (3) is equivalent to\nY\n\u03c9\n\n(m\u2217/2)(cid:107)Y\n\n(com) \u2212 M(cid:107)2\n\n(F ) + \u03bb1(cid:107)M(cid:107)(N ) + (\u03bb2/2)(cid:107)M(cid:107)2\n\n(F )\n\n(cid:110)\n\narg min\n\nM\n\n= m\u22121\n\n(cid:80)\n\n(cid:111)\n\n.\n\n(obs)\nLet Y\n\u03c9\n(obs)\n)d1\u00d7d2.\n\u03c9\n\n(Y\n(m\u03c9/m\u2217)Y\n\n\u03c9\n\n\u03c9i=\u03c9 yi be the sample mean of the observations at \u03c9 and Y\ngiven Y\n\nIn the white noise model, the conditional expectation of Y\n\n=\n(obs) is\n\n(com)\n\u03c9\n\nY\n\n(obs)\n\n(imp)\n\n= (Y\n\n(imp)\n\u03c9\n\n\u03c9 + (1 \u2212 m\u03c9/m\u2217)\u0398\u03c9 for m\u03c9 \u2264 m\u2217. This leads to a generalized E-step:\n\u03c9 + (1 \u2212 m\u03c9/m\u2217)+Z (old)\nWe now present the EM-algorithm for the computation of the spectrum E-net(cid:101)\u0398 in (3).\n\n(5)\nwhere Z (old) is the estimation of \u0398 in the previous iteration. This is a genuine E-step when m\u2217 = m\u2217\nbut also allows a smaller m\u2217 to reduce the proportion of missing data.\n\n\u03c9 = min{1, (m\u03c9/m\u2217)}Y\n\n)d1\u00d7d2, Y\n\n(imp)\n\n(obs)\n\n,\n\n\u03c9\n\n(obs)\n\nAlgorithm 1 Initialize with Z (0) and k = 0. Repeat the following steps:\n\n(imp) in (5) with Z (old) = Z (k) and assign k \u2190 k + 1,\n\n\u2022 E-step: Compute Y\n\u2022 M-step: Compute Z (k) = S(Y\n\nuntil (cid:107)Z (k) \u2212 Z (k\u22121)(cid:107)2\n\n(F )/(cid:107)Z (k)(cid:107)2\n\n(imp)\n\n; \u03bb1/m\u2217, \u03bb2/m\u2217),\n(F ) \u2264 \u0001. Then, return Z (k).\n\nThe following theorem states the convergence of Algorithm 1.\nTheorem 1 As k \u2192 \u221e, Z (k) converges to a limit Z (\u221e) as a function of the data and (\u03bb1, \u03bb2, m\u2217),\n\nand Z (\u221e) = (cid:101)\u0398 for m\u2217 \u2265 m\u2217.\n\n3\n\n\fTheorem 1 is a variation of a parallel result in [8] and follows from the same proof there. As [8]\npointed out, a main advantage of Algorithm 1 is the speed of each iteration. When the maximum\nmultiplicity m\u2217 is small, we simply use Z (0) = Y\n(obs) and m\u2217 = m\u2217; Otherwise, we may \ufb01rst run\nthe EM-algorithm for an m\u2217 < m\u2217 and use the output as the initialization Z (0) for a second run of\nthe EM-algorithm with m\u2217 = m\u2217.\n\n3 Analysis of estimation accuracy\n\nIn this section, we derive error bounds for the calibrated spectrum E-net. We need the following\nnotation. Let r = rank(\u0398), U DV (cid:62) be the SVD of \u0398, and s1 \u2265 . . . \u2265 sr be the nonzero singular\nvalues of \u0398. Let T be the tangent space with respect to U V (cid:62), the space of all matrices of the form\nU U(cid:62)M1 + M2V V (cid:62). The orthogonal projection to T is given by\n\nTheorem 2 Let \u03be = 1 + \u03bb2/\u03c00 and H =(cid:80)n\n\nPT M = U U(cid:62)M + M V V (cid:62) \u2212 U U(cid:62)M V V (cid:62).\n\ni=1 P\u03c9i. De\ufb01ne\nR = (H \u2212 \u03c00)PT /(\u03c00 + \u03bb2),\n\u2206 = R(\u03bb2\u0398 + \u03bb1U V (cid:62)),\nQ = I \u2212 H(PTHPT + \u03bb2PT )\u22121PT .\n\nLet \u03b5 =(cid:80)n\n\ni=1 \u03b5iE\u03c9i. Suppose\n(cid:107)PTR(cid:107)(op) \u2264 1/2,\n(cid:107)PT \u2206(cid:107)(F ) \u2264 \u221a\n(cid:107)PT \u03b5(cid:107)(F ) \u2264 \u221a\n\nr\u03bb1/8,\n\nr\u03bb1/8,\n\nsr \u2265 5\u03bb1/\u03bb2,\n\n(cid:107)\u2206 \u2212 R(PTR + PT )\u22121PT \u2206(cid:13)(cid:13)(S) \u2264 \u03bb1/4,\n\n(cid:107)Q\u03b5(cid:107)(S) \u2264 3\u03bb1/4,\n\n(cid:107)P\u22a5\n\nT \u03b5(cid:107)(S) \u2264 \u03bb1.\n\n(6)\n\n(7)\n(8)\n\n(9)\n\nThen the calibrate spectrum E-net (4) satis\ufb01es\n\n(cid:107)(cid:98)\u0398 \u2212 \u0398(cid:107)(F ) \u2264 2\n\n\u221a\n\nr\u03bb1/\u03c00.\n\n(10)\nThe proof of Theorem 2 is provided in Section 5. When \u03c9i are random entries in \u2126\u2217, EH = \u03c00I,\nso that (8) and the \ufb01rst inequality of (7) are expected to hold under proper conditions. Since the\nrank of PT \u03b5 is no greater than 2r, (9) essentially requires (cid:107)\u03b5(cid:107)(S) (cid:16) \u03bb1. Our analysis allows \u03bb2 to\nlie in a certain range [\u03bb\u2217, \u03bb\u2217], and \u03bb\u2217/\u03bb\u2217 is large under proper conditions. Still, the choice of \u03bb2 is\nconstrained by (7) and (8) since \u2206 is linear in \u03bb2. When \u03bb2/\u03c00 diverges to in\ufb01nity, the calibrated\nspectrum E-net (4) becomes the modi\ufb01ed spectrum Lasso of [7].\nTheorem 2 provides suf\ufb01cient conditions on the target matrix and the noise for achieving a cer-\ntain level of estimation error.\nIntuitively, these conditions on the target matrix \u0398 must imply a\ncertain level of coherence (or \ufb02atness) of the unknown matrix since it is impossible to distinguish\nthe unknown from zero when the observations are completely outside its support. In [2, 3, 4, 11],\ncoherence conditions are imposed on\n\n\u00b50 = max{(d1/r)(cid:107)U U(cid:62)(cid:107)\u221e, (d2/r)(cid:107)V V (cid:62)(cid:107)\u221e}, \u00b51 =(cid:112)d1d2/r(cid:107)U V (cid:62)(cid:107)\u221e,\n\n(11)\nwhere U and V are matrices of singular vectors of \u0398. [9] considered a more general notation of\nspikiness of a matrix M, de\ufb01ned as the ratio between the (cid:96)\u221e and dimension-normalized (cid:96)2 norms,\n\n\u03b1sp(M ) = (cid:107)M(cid:107)\u221e\n\nd1d2/(cid:107)M(cid:107)(F ).\n\n(12)\nSuppose in the rest of the section that \u03c9i are iid points uniformly distributed in \u2126\u2217 and \u03b5i are iid\nN (0, \u03c32) variables independent of {\u03c9i}. The following theorem asserts that under certain coherence\nconditions on the matrices \u0398, U U(cid:62), V V (cid:62) and U V (cid:62), all conditions of Theorem 2 hold with large\nprobability when the sample size n is of the order r2d log d.\n\n(cid:112)\n\nTheorem 3 Let d = d1 + d2. Consider \u03bb1 and \u03bb2 satisfying\n\n\u03bb1 = \u03c3(cid:112)8\u03c00d log d,\n\n1 \u2264\n\n\u03bb2(cid:107)\u0398(cid:107)(F )\n\n\u03bb1{n/(d log d)}1/4\n\n\u2264 2.\n\n(13)\n\n4\n\n\fThen, there exists a constant C such that\n\n(cid:110)\n\nn \u2265 C max\n\nimplies\n\n\u00b52\n0r2d log d, (\u00b51 + r)\u00b51rd log d, (\u03b14/3\n\n(cid:107)(cid:98)\u0398 \u2212 \u0398(cid:107)2\n\n(F )/(d1d2) \u2264 32(\u03c32rd log d)/n\n\n(cid:111)\n\n(14)\n\nsp \u2228 \u03ba4\u2217)r2d log d\n\nwith probability at least 1 \u2212 1/d2, where \u00b50 and \u00b51 are the coherence constants in (11), \u03b1sp =\n\u03b1sp(\u0398) is the spikiness of \u0398 and \u03ba\u2217 = (cid:107)\u0398(cid:107)(F )/(r1/2sr).\nWe require the knowledge of noise level \u03c3 to determine the penalty level that is usually con-\nsidered as tuning parameter in practice. The Frobenius norm (cid:107)\u0398(cid:107)(F ) in (13) can be replaced\nIn our simulation experiment, we use\nby an estimate of the same magnitude in Theorem 3.\ni /\u03c00)1/2. The Chebyshev inequality provides\n\n\u03bb2 = \u03bb1{n/(d log d)}1/4/(cid:98)F with (cid:98)F = ((cid:80)n\n(cid:98)F /(cid:107)\u0398(cid:107)(F ) \u2192 1 when \u03b1sp = O(1) and \u03c32 (cid:28) (cid:107)\u0398(cid:107)2\u221e.\n\ni=1 y2\n\nA key element in our analysis is to \ufb01nd a probabilistic bound for the second inequality of (8), or\nequivalently an upper bound for\n\nP(cid:8)(cid:107)R(PTR + PT )\u22121(\u03bb2\u0398 + \u03bb1U V (cid:62))(cid:107)(S) > \u03bb1/4(cid:9).\n\n(15)\nThis guarantees the existence of a primal dual certi\ufb01cate for the spectrum E-net penalty [14].\nFor \u03bb2 = 0, a similar inequality was proved in [3], where the sample size requirement is\nn \u2265 C0 min{\u00b52r2(log d)2d, \u00b52r(log d)6d} for a certain coherence factor \u00b5. We remove a log\nfactor in the \ufb01rst bound, resulting in the sample size requirement in (14), which is optimal when\nr = O(1). For exact recovery in the noiseless case, the sample size n (cid:16) rd(log d)2 is suf\ufb01cient if\na gol\ufb01ng scheme is used to construct an approximate dual certi\ufb01cate [4, 11]. We use the following\nlemma to bound (15).\n\nLemma 2 Let H = (cid:80)n\n\n(cid:112)\n\nC\u00b52\n\n0r2dkm/n\n\n(cid:17)2m\n\n.\n\n(cid:111)km(cid:16)\n\nd1d2/r)(cid:107)M(cid:107)\u221e\n\n\u03be2kmE(cid:107)RkM(cid:107)2m\n\n(S) \u2264(cid:110)\n\ni=1 P\u03c9i where \u03c9i are iid points uniformly distributed in \u2126\u2217. Let R =\n(H \u2212 \u03c00)PT /(\u03c00 + \u03bb2) and \u03be = 1 + \u03bb2/\u03c00. Let M be a deterministic matrix. Then, there exists a\nnumerical constant C such that, for all k \u2265 1 and m \u2265 1,\n\u00b5\u22122\n0 (\n\n(16)\nWe use a different graphical approach than those in [3] to bound E trace({(RkM )(cid:62)(RkM )}m) in\nthe proof of Lemma 2. The rest of the proof of Theorem 3 can be outlined as follows. Assume\nthat all coherence factors are O(1). Let M = \u03bb2\u0398 + \u03bb1U V (cid:62) and write R(PTR + PT )\u22121M =\nM +Rem. By (16) with km (cid:16) log d for k \u2265 2 and an even simpler\nRM\u2212R2M +\u00b7\u00b7\u00b7+(\u22121)k\u2217\u22121Rk\u2217\n\u221a\nd1d2/r)(cid:107)M(cid:107)\u221e (cid:16) \u03bb1\u03b7, where \u03b7 (cid:16) r2d(log d)/n.\nbound for k = 1 and Rem, (15) holds when (\nSince \u03b1sp + \u00b51 + (cid:107)\u0398(cid:107)2\nr) = O(1), this is equivalent to \u03b7(sr\u03bb2/\u03bb1 + 1) (cid:46) 1. Finally, we use\nmatrix exponential inequalities [10, 12] to verify other conditions of Theorem 2. We omit technical\ndetails of the proof of Lemma 2 and Theorem 3. We would like to point out that if the r2 in (16) can\nbe replaced by r(log d)\u03b3, e.g. \u03b3 = 5 in view of [3], the rest of the proof of Theorem 3 is intact with\n\u03b7 (cid:16) rd(log d)1+\u03b3/n and a proper adjustment of \u03bb2 in (13).\nCompared with [7] and [9], the main advantage of Theorem 3 is the proportionality of its error\n\u03c9i in (2) is replaced by its expectation\n\u03c00(cid:107)M(cid:107)2\n\nbound to the noise level. In [7], the quadratic term(cid:80)n\n\n(F ) and the resulting minimizer is proved to satisfy\n\n(F )/(rs2\n\ni=1 M 2\n\n(cid:107)(cid:98)\u0398(KLT) \u2212 \u0398(cid:107)2\n\n(F )/(d1d2) \u2264 C max(\u03c32,(cid:107)\u0398(cid:107)2\u221e)rd(log d)/n\n\nwith large probability, where C is a numerical constant. This error bound achieves the squared error\nrate \u03c32rd(log d)/n as in Theorem 3 when the noise level \u03c3 is of no smaller order than (cid:107)\u0398(cid:107)\u221e, but\nnot of smaller order. In particular, (17) does not imply exact recovery when \u03c3 = 0. In Theorem 3,\nthe error bound converges to zero as the noise level diminishes, implying exact recovery in the\n\u221a\nnoiseless case. In [9], a constrained spectrum Lasso was proposed that minimizes (2) subject to\n(cid:107)M(cid:107)\u221e \u2264 \u03b1\u2217/\n\nd1d2. For (cid:107)\u0398(cid:107)(F ) \u2264 1 and \u03b1sp(\u0398) \u2264 \u03b1\u2217, [9] proved\n\n(cid:107)(cid:98)\u0398(NW) \u2212 \u0398(cid:107)2\n\n(F ) \u2264 C max(d1d2\u03c32, 1)(\u03b1\u2217)2rd(log d)/n\n\n(17)\n\n(18)\n\n5\n\n\f(cid:107)(cid:98)\u0398(NW) \u2212 \u0398(cid:107)2\n\nwith large probability. Scale change from the above error bound yields\n\n(F )/(d1d2) \u2264 C max{\u03c32,(cid:107)\u0398(cid:107)2\n\n(F )/(d1d2)}(\u03b1\u2217)2rd(log d)/n.\n\n\u221a\nSince \u03b1\u2217 \u2265 1 and \u03b1\u2217(cid:107)\u0398(cid:107)(F )/\nd1d2 \u2265 (cid:107)\u0398(cid:107)\u221e, the right-hand side of (18) is of no smaller order\nthan that of (17). We shall point out that (17) and (18) only require sample size n (cid:16) rd log d. In\naddition, [9] allows more practical weighted sampling models.\nCompared with [6], the main advantage of Theorem 3 is the independence of its sample size require-\nment on the aspect ratio d2/d1, where d2 \u2265 d1 is assumed without loss of generality by symmetry.\nThe error bound in [6] implies\n\n(19)\nfor sample size n \u2265 C\u2217\n2} are constants depending on the\nsame set of coherence factors as in (14) and s1 > \u00b7\u00b7\u00b7 > sr are the singular values of \u0398. Therefore,\n\nTheorem 3 effectively replaces the root aspect ratio(cid:112)d2/d1 in the sample size requirement of (19)\n\n1 rd log d + C\u2217\n\n1 , C\u2217\n\n(F )/(d1d2) \u2264 C0(s1/sr)4\u03c32rd(log d)/n\n\n2 r2d(cid:112)d2/d1, where {C\u2217\n\n(cid:107)(cid:98)\u0398(KMO) \u2212 \u0398(cid:107)2\n\nwith a log factor, and removes the coherence factor (s1/sr)4 on the right-hand side of (19). We\nnote that s1/sr is a larger coherence factor than (cid:107)\u0398(cid:107)(F )/(r1/2sr) in the sample size requirement in\nTheorem 3. The root aspect ratio can be removed from the sample size requirement for (19) if \u0398\ncan be divided into square blocks uniformly satisfying the coherence conditions.\n\n4 Simulation study\n\n\u221a\n\nr/\u03c3.\n\nde\ufb01ned as SNR =\nWe compare the calibrated spectrum E-net (4) with the spectrum Lasso (2) and its modi\ufb01cation\n\nThis experiment has the same setting as in Section 9 of [8]. We provide the description of the\nsimulation settings in our notation as follows: The target matrix is \u0398 = U V (cid:62), where Ud1\u00d7r and\nVd2\u00d7r are random matrices with independent standard normal entries. The sampling points \u03c9i have\nY = P\u2126(\u0398 + \u03b5) with P\u2126 = H =(cid:80)n\nno tie and \u2126 = {\u03c9i : i = 1, . . . , n} is a uniformly distributed random subset of {1, . . . , d1} \u00d7\n{1, . . . , d2}, where n is \ufb01xed. The errors \u03b5 are iid N (0, \u03c32) variables. Thus, the observed matrix is\ni=1 P\u03c9i being a projection. The signal to noise ratio (SNR) is\n(cid:98)\u0398(KLT) of [7]. For all methods, we compute a series of estimators with 100 different penalty lev-\n\u03bb1{n/(d log d)}1/4/(cid:98)F , where (cid:98)F = ((cid:80)n\nels, where the smallest penalty level corresponds to a full-rank solution and the largest penalty\nlevel corresponds to a zero solution. For the calibrated spectrum E-net, we always use \u03bb2 =\ni /\u03c00)1/2 is an estimator for (cid:107)\u0398(cid:107)(F ). We plot the\ntraining errors and test errors as functions of estimated ranks, where the training and test errors are\n(cid:107)P\u2126((cid:98)\u0398 \u2212 Y )(cid:107)2\nde\ufb01ned as\n\ni=1 y2\n\n(cid:107)P\u22a5\n\n\u2126 ((cid:98)\u0398 \u2212 \u0398)(cid:107)2\n\u2126 \u0398(cid:107)2\n\n(cid:107)P\u22a5\n\n(F )\n\n(F )\n\n.\n\nTraining error =\n\n(cid:107)P\u2126Y (cid:107)2\n\n(F )\n\n(F )\n\n, Test error =\n\nIn Figure 1, we report the estimation performance of three methods. The rank of \u0398 is 10 but\n{\u0398, \u2126, \u03b5} are regenerated in each replication. Different noise levels and proportions of the ob-\nserved entries are considered. All the results are averaged over 50 replications. In this experiment,\nthe calibrated spectrum E-net and the spectrum Lasso estimator have very close testing and training\nerrors, and both of them signi\ufb01cantly outperform the modi\ufb01ed Lasso. Figure 1 also illustrates that\nin most cases, the calibrated spectrum E-net and spectrum Lasso achieve the optimal test error when\nthe estimated rank is around the true rank.\n\nWe note that the constrained spectrum Lasso estimator(cid:98)\u0398(NW) would have the same performance as\nthe spectrum Lasso when the constraint \u03b1sp((cid:98)\u0398) \u2264 \u03b1\u2217 is set with a suf\ufb01ciently high \u03b1\u2217. However,\n\nanalytic properties of the spectrum Lasso is unclear without constraint or modi\ufb01cation.\n\n5 Proof of Theorem 2\n\nThe proof of Theorem 2 requires the following proposition that controls the approximation error of\nthe Taylor expansion of the nuclear norm with subdifferentiation. The result, closely related to those\n\n6\n\n\fFigure 1: Plots of training and testing errors against the estimated rank: testing error with solid lines;\ntraining error with dashed lines; spectrum Lasso in blue, calibrated spectrum E-net in red; modi\ufb01ed\nspectrum Lasso in black; d1 = d2 = 100, rank(\u0398) = 10.\n\nin [13], is used to control the variation of the tangent space of the spectrum E-net estimator. We omit\nits proof.\nProposition 1 Let \u0398 = U DV (cid:62) be the SVD and M be another matrix. Then,\nT M(cid:107)(N ) \u2212 (cid:104)U V (cid:62), M \u2212 \u0398(cid:105)\n(F ) + (cid:107)D\u22121/2U(cid:62)(PT M \u2212 \u0398)(cid:107)2\n\n0 \u2264 (cid:107)M(cid:107)(N ) \u2212 (cid:107)\u0398(cid:107)(N ) \u2212 (cid:107)P\u22a5\n\u2264 (cid:107)(PT M \u2212 \u0398)V D\u22121/2(cid:107)2\n\n(F ).\n\nProof of Theorem 2. De\ufb01ne\n\nSince(cid:98)\u0398 = \u03be(cid:101)\u0398 and \u03be\u0398 \u2212 \u0398 = \u2212(\u03bb1/\u03c00)U V (cid:62),\n\n\u0398\u2217 = (PTHPT + \u03bb2PT )\u22121(PT \u03b5 + PTH\u0398 \u2212 \u03bb1U V (cid:62)),\n\u0398 = (\u03c00 + \u03bb2)\u22121(\u03c00\u0398 \u2212 \u03bb1U V (cid:62)),\n\n\u2206 = (cid:101)\u0398 \u2212 \u0398\u2217, \u2206\u2217 = \u0398\u2217 \u2212 \u0398, \u2206\u2217 = (cid:101)\u0398 \u2212 \u0398.\n(cid:107)(cid:98)\u0398 \u2212 \u0398(cid:107)(F ) \u2264 \u03be(cid:107)\u2206\u2217(cid:107)(F ) + (cid:107)\u03be\u0398 \u2212 \u0398(cid:107)(F )\n\n\u221a\n\n= \u03be(cid:107)\u2206\u2217(cid:107)(F ) +\nr\u03bb1/\u03c00\n\u2264 \u03be(cid:107)\u2206(cid:107)(F ) + \u03be(cid:107)\u2206\u2217(cid:107)(F ) +\n\n\u221a\n\nr\u03bb1/\u03c00.\n\n(20)\n(21)\n\n(22)\n\nWe consider two cases by comparing \u03bb2 and \u03c00.\nCase 1: \u03bb2 \u2264 \u03c00. By algebra \u03be\u2206\u2217 = \u03c0\u22121\n\n\u03be(cid:107)\u2206\u2217(cid:107)(F ) \u2264 \u03c0\u22121\n\n(cid:107)\u2206(cid:107)(F ). Let Y =(cid:80)n\n(cid:101)\u0398 = arg min\n\n0 (PTR + PT )\u22121PT (\u03b5 + \u2206), so that\n\n0 (cid:107)(PTR + PT )\u22121(cid:107)(op)(cid:107)PT \u2206 + PT \u03b5(cid:107)(F ) \u2264 \u221a\n(cid:110)(cid:104)HM, M(cid:105)/2 \u2212 (cid:104)Y, M(cid:105) + \u03bb1(cid:107)M(cid:107)(N ) + (\u03bb2/2)(cid:107)M(cid:107)2\n\ni=1 yiE\u03c9i. We write the spectrum E-net estimator (3) as\n\n(cid:111)\n\n.\n\n(F )\n\nr\u03bb1/(2\u03c00).\n\nThe last inequality above follows from the \ufb01rst inequalities in (7), (8) and (9). It remains to bound\n\nM\n\n7\n\n01020304000.51\u03c00=0.2, SNR=1RankError010203000.51\u03c00=0.2, SNR=10RankError020406000.51\u03c00=0.5, SNR=1RankError01020304000.51\u03c00=0.5, SNR=10RankError051015202500.51\u03c00=0.8, SNR=10RankError02040608000.51\u03c00=0.8, SNR=1RankError\fIt follows that for a certain member (cid:98)G in the sub-differential of (cid:107)M(cid:107)(N ) at M = (cid:101)\u0398,\n0 = \u2202L\u03bb1,\u03bb2((cid:101)\u0398) = H(cid:101)\u0398 \u2212 Y + \u03bb2(cid:101)\u0398 + \u03bb1(cid:98)G = (H + \u03bb2)\u2206 + (H + \u03bb2)\u0398\u2217 \u2212 Y + \u03bb1(cid:98)G.\nLet Rem1 = (cid:107)\u0398\u2217(cid:107)(N ) \u2212 (cid:104)U V (cid:62), \u0398\u2217(cid:105). Since (cid:107)\u0398\u2217(cid:107)(N ) \u2212 (cid:107)(cid:101)\u0398(cid:107)(N ) \u2265 \u2212(cid:104)\u2206,(cid:98)G(cid:105), we have\n(cid:104)(H + \u03bb2)\u2206, \u2206(cid:105) \u2264 (cid:104)H\u0398 + \u03b5 \u2212 (H + \u03bb2)\u0398\u2217, \u2206(cid:105) + \u03bb1(cid:107)\u0398\u2217(cid:107)(N ) \u2212 \u03bb1(cid:107)(cid:101)\u0398(cid:107)(N )\n\n= (cid:104)H(\u0398 \u2212 \u0398\u2217) + \u03b5 \u2212 \u03bb2\u0398\u2217, \u2206(cid:105) + \u03bb1Rem1 + \u03bb1(cid:104)U V (cid:62), \u0398\u2217(cid:105) \u2212 \u03bb1(cid:107)(cid:101)\u0398(cid:107)(N )\nT \u2206(cid:107)(N )\nT (cid:101)\u0398 = P\u22a5\n\n\u2264 \u03bb1Rem1 + (cid:104)\u03b5 + H(\u0398 \u2212 \u0398\u2217) \u2212 \u03bb2\u0398\u2217 \u2212 \u03bb1U V (cid:62), \u2206(cid:105) \u2212 \u03bb1(cid:107)P\u22a5\n= \u03bb1Rem1 + (cid:104)\u03b5 + H(\u0398 \u2212 \u0398\u2217),P\u22a5\n\nT \u2206(cid:105) \u2212 \u03bb1(cid:107)P\u22a5\nT \u2206(cid:107)(N ).\nT (cid:101)\u0398(cid:107)(N ) +(cid:104)U V (cid:62),(cid:101)\u0398(cid:105) and P\u22a5\n\nThe second inequality in (23) is due to (cid:107)(cid:101)\u0398(cid:107)(N ) \u2265 (cid:107)P\u22a5\n\nT \u2206. The\nlast equality in (23) follows from the de\ufb01nition of \u0398\u2217 \u2208 T , since it gives PT \u03b5 + PTH(\u0398 \u2212 \u0398\u2217) \u2212\n\u03bb2\u0398\u2217 \u2212 \u03bb1U V (cid:62) = \u2212(PTHPT + \u03bb2PT )\u0398\u2217 + PT \u03b5 + PTH\u0398 \u2212 \u03bb1U V (cid:62) = 0. By the de\ufb01nitions\nof Q, \u0398\u2217 and \u2206, \u03b5 + H(\u0398 \u2212 \u0398\u2217) = Q\u03b5 + H(\u0398 \u2212 \u0398) \u2212 H(PTHPT + \u03bb2PT )\u22121PT \u2206. Since\nP\u22a5\nT HPT = P\u22a5\n\nT R(\u03c00 + \u03bb2) and (H \u2212 \u03c00)(\u0398 \u2212 \u0398) = \u2206, we \ufb01nd\n\nT (H \u2212 \u03c00)PT = P\u22a5\n\n(23)\n\n(cid:104)\u03b5 + H(\u0398 \u2212 \u0398\u2217),P\u22a5\n\nT \u2206(cid:105)\n\n= (cid:104)Q\u03b5 + (H \u2212 \u03c00){\u0398 \u2212 \u0398 \u2212 (PTHPT + \u03bb2PT )\u22121PT \u2206},P\u22a5\n= (cid:104)Q\u03b5 + \u2206 \u2212 R(PTR + PT )\u22121PT \u2206,P\u22a5\n\nT \u2206(cid:105).\n\nT \u2206(cid:105)\n\nThus, by the second inequalities of (8) and (9),\n(cid:104)\u03b5 + H(\u0398 \u2212 \u0398\u2217),P\u22a5\n\nT \u2206(cid:105) \u2264 \u03bb1(cid:107)P\u22a5\n\nT \u2206(cid:107)(N ).\n\n(24)\nSince \u0398\u2217 = \u2206\u2217 \u2212 \u0398 \u2208 T and the singular values of \u0398 is no smaller than (\u03c00sr \u2212 \u03bb1)/(\u03c00 + \u03bb2) \u2265\n(sr \u2212 \u03bb1/\u03bb2)/\u03be \u2265 4\u03bb1/(\u03bb2\u03be) by the second inequality in (7), Proposition 1 and (22) imply\n\nRem1 \u2264 2(cid:107)\u0398\u2217 \u2212 \u0398(cid:107)2\n\n(F )/{(\u03c00sr \u2212 \u03bb1)/(\u03c00 + \u03bb2)} \u2264 r(\u03bb1/\u03c00)2/(8\u03be\u03bb1/\u03bb2).\n\nIt follows from (23), (24) and (25) that\n\n\u03be2(cid:107)\u2206(cid:107)2\n\n(F ) \u2264 \u03be2(cid:104)(H + \u03bb2)\u2206, \u2206(cid:105)/\u03bb2 \u2264 \u03be2(\u03bb1/\u03bb2)Rem1 \u2264 r\u03bb2\n\n1/(4\u03c02\n\n0).\n\n(25)\n\n(26)\n\nTherefore, the error bound (10) follows from (21), (22) and (26).\nCase 2: \u03bb2 \u2265 \u03c00. By applying the derivation of (23) to \u0398 instead of \u0398\u2217, we \ufb01nd\n\n(cid:104)(H + \u03bb2)\u2206\u2217, \u2206\u2217(cid:105) + \u03bb1(cid:107)P\u22a5\n\n(cid:0)(cid:107)\u0398(cid:107)(N ) \u2212 (cid:104)U V (cid:62), \u0398(cid:105)(cid:1) + (cid:104)\u03b5 + H(\u0398 \u2212 \u0398) \u2212 \u03bb2\u0398 \u2212 \u03bb1U V (cid:62), \u2206\u2217(cid:105).\n\nT \u2206\u2217(cid:107)(N )\n\n\u2264 \u03bb1\n\nBy the de\ufb01nitions of \u2206, R, and \u0398, \u2206 = (H \u2212 \u03c00)(\u0398 \u2212 \u0398) = H(\u0398 \u2212 \u0398) \u2212 \u03bb2\u0398 \u2212 \u03bb1U V (cid:62). This\nand (cid:107)\u0398(cid:107)(N ) = (cid:104)U V (cid:62), \u0398(cid:105) gives\n\n(cid:104)(H + \u03bb2)\u2206\u2217, \u2206\u2217(cid:105) + \u03bb1(cid:107)P\u22a5\n\nT \u2206\u2217(cid:107)(N ) \u2264 (cid:104)\u03b5 + \u2206, \u2206\u2217(cid:105).\n\nSince (cid:107)P\u22a5\n\nT \u03b5(cid:107)(S) \u2264 \u03bb1 by the third inequality in (9), we have\nT (\u03b5 + \u2206)(cid:107)(S) = (cid:107)P\u22a5\n(cid:104)P\u22a5\nT (\u03b5 + \u2206), \u2206\u2217(cid:105) \u2264 \u03bb1(cid:107)P\u22a5\n\nT \u2206\u2217(cid:107)(N ).\nIt follows from (27), (28) and the \ufb01rst inequalities of (8) and (9) that\n\u03bb2(cid:107)\u2206\u2217(cid:107)2\nThus, due to \u03bb2 \u2265 \u03c00,\n\n(F ) \u2264 (cid:104)PT (\u03b5 + \u2206), \u2206\u2217(cid:105) \u2264(cid:110)(cid:107)PT \u03b5(cid:107)(F ) + (cid:107)PT \u2206(cid:107)(F )\nr\u03bb1/2 \u2264 \u221a\n\n\u03be(cid:107)\u2206\u2217(cid:107)(F ) \u2264 (\u03be/\u03bb2)\n\nr\u03bb1/\u03c00.\n\n\u221a\n\n(cid:111)(cid:107)\u2206\u2217(cid:107)(F ) \u2264 \u221a\n\nr\u03bb1(cid:107)\u2206\u2217(cid:107)(F )/2.\n\nTherefore, the error bound (10) follows from (20) and (29).\n\nAcknowledgments\n\n(27)\n\n(28)\n\n(29)\n(cid:3)\n\nThis research is partially supported by the NSF Grants DMS 0906420, DMS-11-06753 and DMS-\n12-09014, and NSA Grant H98230-11-1-0205.\n\n8\n\n\fReferences\n\n[1] ACM SIGKDD and Net\ufb02ix. Proceedings of KDD Cup and workshop. 2007.\n[2] E. Candes and B. Recht. Exact matrix completion via convex optimization. Found. Comput.\n\nMath., 9:717\u2013772, 2009.\n\n[3] E. J. Cand`es and T. Tao. The power of convex relaxation: Near-optimal matrix completion.\n\nIEEE Trans. Inform. Theory, 56(5):2053\u20132080, 2009.\n\n[4] D. Gross. Recovering low-rank matrices from few coef\ufb01cients in any basis. CoRR,\n\nabs/0910.1879, 2009.\n\n[5] R. H. Keshavan, A. Montanari, and S. Oh. Matrix completion from a few entries.\n\nTransactions on Information Theory, 56(6):2980\u20132998, 2010.\n\nIEEE\n\n[6] R. H. Keshavan, A. Montanari, and S. Oh. Matrix completion from noisy entries. Journal of\n\nMachine Learning Research, 11:2057\u20132078, 2010.\n\n[7] V. Koltchinskii, K. Lounici, and A. B. Tsybakov. Nuclear-norm penalization and optimal rates\n\nfor noisy low-rank matrix completion. The Annals of Statistics, 39:2302\u20132329, 2011.\n\n[8] R. Mazumder, T. Hastie, and R. Tibshirani. Spectral regularization algorithms for learning\n\nlarge incomplete matrices. Journal of Machine Learning Research, 11:2287\u20132322, 2010.\n\n[9] S. Negahban and M. J. Wainwright. Restricted strong convexity and weighted matrix comple-\n\ntion: Optimal bounds with noise. 2010.\n\n[10] R. I. Oliveira. Concentration of the adjacency matrix and of the laplacian in random graphs\n\nwith independent edges. Technical Report arXiv:0911.0600, arXiv, 2010.\n\n[11] B. Recht. A simpler approach to matrix completion. Journal of Machine Learning Research,\n\n12:3413\u20133430, 2011.\n\n[12] J. A. Tropp. User-friendly tail bounds for sums of random matrices. Found. Comput. Math.\n\ndoi:10.1007/s10208-011-9099-z., 2011.\n\n[13] P.-A. Wedin. Perturbation bounds in connection with singular value decomposition. BIT,\n\n12:99\u2013111, 1972.\n\n[14] C.-H. Zhang and T. Zhang. A general framework of dual certi\ufb01cate analysis for structured\n\nsparse recovery problems. Technical report, arXiv: 1201.3302v1, 2012.\n\n[15] H. Zou and T. Hastie. Regularization and variable selection via the elastic net. J. R. Statist.\n\nSoc. B, 67:301\u2013320, 2005.\n\n9\n\n\f", "award": [], "sourceid": 397, "authors": [{"given_name": "Tingni", "family_name": "Sun", "institution": null}, {"given_name": "Cun-hui", "family_name": "Zhang", "institution": null}]}