{"title": "Regularization-Free Estimation in Trace Regression with Symmetric Positive Semidefinite Matrices", "book": "Advances in Neural Information Processing Systems", "page_first": 2782, "page_last": 2790, "abstract": "Trace regression models have received considerable attention in the context of matrix completion, quantum state tomography, and compressed sensing. Estimation of the underlying matrix from regularization-based approaches promoting low-rankedness, notably nuclear norm regularization, have enjoyed great popularity. In this paper, we argue that such regularization may no longer be necessary if the underlying matrix is symmetric positive semidefinite (spd) and the design satisfies certain conditions. In this situation, simple least squares estimation subject to an spd constraint may perform as well as regularization-based approaches with a proper choice of regularization parameter, which entails knowledge of the noise level and/or tuning. By contrast, constrained least squaresestimation comes without any tuning parameter and may hence be preferred due to its simplicity.", "full_text": "Regularization-Free Estimation in Trace Regression with\n\nSymmetric Positive Semide\ufb01nite Matrices\n\nMartin Slawski\nPing Li\nDepartment of Statistics & Biostatistics\nDepartment of Computer Science\nRutgers University\nPiscataway, NJ 08854, USA\n{martin.slawski@rutgers.edu,\npingli@stat.rutgers.edu}\n\nMatthias Hein\nDepartment of Computer Science\nDepartment of Mathematics\nSaarland University\nSaarbr\u00a8ucken, Germany\nhein@cs.uni-saarland.de\n\nAbstract\n\nTrace regression models have received considerable attention in the context of\nmatrix completion, quantum state tomography, and compressed sensing. Esti-\nmation of the underlying matrix from regularization-based approaches promoting\nlow-rankedness, notably nuclear norm regularization, have enjoyed great popular-\nity. In this paper, we argue that such regularization may no longer be necessary\nif the underlying matrix is symmetric positive semide\ufb01nite (spd) and the design\nsatis\ufb01es certain conditions. In this situation, simple least squares estimation sub-\nject to an spd constraint may perform as well as regularization-based approaches\nwith a proper choice of regularization parameter, which entails knowledge of the\nnoise level and/or tuning. By contrast, constrained least squares estimation comes\nwithout any tuning parameter and may hence be preferred due to its simplicity.\n\n1 Introduction\n\nTrace regression models of the form\n\nyi = tr(X \u22a4\n\ni \u03a3\u2217) + \u03b5i,\n\ni = 1, . . . , n,\n\n(1)\n\nwhere \u03a3\u2217 \u2208 Rm1\u00d7m2 is the parameter of interest to be estimated given measurement matrices Xi \u2208\nRm1\u00d7m2 and observations yi contaminated by errors \u03b5i, i = 1, . . . , n, have attracted considerable\ninterest in high-dimensional statistical inference, machine learning and signal processing over the\npast few years. Research in these areas has focused on a setting with few measurements n \u226a m1\u00b7m2\nand \u03a3\u2217 being (approximately) of low rank r \u226a min{m1, m2}. Such setting is relevant to problems\nsuch as matrix completion [6, 23], compressed sensing [5, 17], quantum state tomography [11] and\nphase retrieval [7]. A common thread in these works is the use of the nuclear norm of a matrix\nas a convex surrogate for its rank [18] in regularized estimation amenable to modern optimization\ntechniques. This approach can be seen as natural generalization of \u21131-norm (aka lasso) regularization\nfor the linear regression model [24] that arises as a special case of model (1) in which both \u03a3\u2217 and\n{Xi}n\nThe situation is less clear if \u03a3\u2217 is known to satisfy additional constraints that can be incorporated in\nestimation. Speci\ufb01cally, in the present paper we consider the case in which m1 = m2 = m and \u03a3\u2217\nis known to be symmetric positive semide\ufb01nite (spd), i.e. \u03a3\u2217 \u2208 Sm\n+ denoting the positive\nsemide\ufb01nite cone in the space of symmetric real m \u00d7 m matrices Sm. The set Sm\n+ deserves speci\ufb01c\ninterest as it includes covariance matrices and Gram matrices in kernel-based learning [20]. It is\nrather common for these matrices to be of low rank (at least approximately), given the widespread\nuse of principal components analysis and low-rank kernel approximations [28]. In the present paper,\nwe focus on the usefulness of the spd constraint for estimation. We argue that if \u03a3\u2217 is spd and the\nmeasurement matrices {Xi}n\n\ni=1 are diagonal. It is inarguable that in general regularization is essential if n < m1 \u00b7 m2.\n\ni=1 obey certain conditions, constrained least squares estimation\n\n+ with Sm\n\nmin\n\u03a3\u2208Sm\n+\n\n1\n2n\n\nnXi=1\n\n1\n\nmay perform similarly well in prediction and parameter estimation as approaches employing nuclear\nnorm regularization with proper choice of the regularization parameter, including the interesting\n\n(yi \u2212 tr(X \u22a4\n\ni \u03a3))2\n\n(2)\n\n\fregime n < \u03b4m, where \u03b4m = dim(Sm) = m(m + 1)/2. Note that the objective in (2) only consists\nof a data \ufb01tting term and is hence convenient to work with in practice since there is no free parameter.\nOur \ufb01ndings can be seen as a non-commutative extension of recent results on non-negative least\nsquares estimation for linear regression [16, 21].\n\nRelated work. Model (1) with \u03a3\u2217 \u2208 Sm\n+ has been studied in several recent papers. A good deal\nof these papers consider the setup of compressed sensing in which the {Xi}n\ni=1 can be chosen by\nthe user, with the goal to minimize the number of observations required to (approximately) recover\n\u03a3\u2217. For example, in [27], recovery of \u03a3\u2217 being low-rank from noiseless observations (\u03b5i = 0,\ni = 1, . . . , n) by solving a feasibility problem over Sm\n+ is considered, which is equivalent to the\nconstrained least squares problem (1) in a noiseless setting.\n\nIn [3, 8], recovery from rank-one measurements is considered, i.e., for {xi}n\n\nyi = x\u22a4\n\ni \u03a3\u2217xi + \u03b5i = tr(X \u22a4\n\ni \u03a3\u2217) + \u03b5i, with Xi = xix\u22a4\n\ni=1 \u2282 Rm\ni , i = 1, . . . , n.\n\n(3)\n\nAs opposed to [3, 8], where estimation based on nuclear norm regularization is proposed, the present\nwork is devoted to regularization-free estimation. While rank-one measurements as in (3) are also in\nthe center of interest herein, our framework is not limited to this case. In [3] an application of (3) to\ni=1 of the data points\ni=1 are i.i.d. from a distribution with zero mean and covariance matrix\n\ncovariance matrix estimation given only one-dimensional projections {x\u22a4\nis discussed, where the {zi}n\n\u03a3\u2217. In fact, this \ufb01ts the model under study with observations\n\u03b5i = x\u22a4\n\nyi = (x\u22a4\n\ni zi}n\n\ni \u03a3\u2217xi + \u03b5i,\n\ni zi)2 = x\u22a4\n\ni xi = x\u22a4\n\ni ziz\u22a4\n\ni {ziz\u22a4\n\ni \u2212 \u03a3\u2217}xi, i = 1, . . . , n.\n\n(4)\n\nSpecializing (3) to the case in which \u03a3\u2217 = \u03c3\u2217(\u03c3\u2217)\u22a4, one obtains the quadratic model\n\nyi = |x\u22a4\n\ni \u03c3\u2217|2 + \u03b5i\n\n(5)\nwhich (with complex-valued \u03c3\u2217) is relevant to the problem of phase retrieval [14]. The approach\nof [7] treats (5) as an instance of (1) and uses nuclear norm regularization to enforce rank-one\nsolutions. In follow-up work [4], the authors show a re\ufb01ned recovery result stating that imposing an\nspd constraint \u2212 without regularization \u2212 suf\ufb01ces. A similar result has been proven independently\nby [10]. However, the results in [4] and [10] only concern model (5). After posting an extended\nversion [22] of the present paper, a generalization of [4, 10] to general low-rank spd matrices has\nbeen achieved in [13]. Since [4, 10, 13] consider bounded noise, whereas the analysis herein assumes\nGaussian noise, our results are not direclty comparable to those in [4, 10, 13].\n\nNotation.\n\nj=1 \u03bbj(M )uju\u22a4\n\nMd denotes the space of real d \u00d7 d matrices with inner product hM, M \u2032i :=\ntr(M \u22a4M \u2032). The subspace of symmetric matrices Sd has dimension \u03b4d := d(d + 1)/2. M \u2208 Sd\nhas an eigen-decomposition M = U \u039bU \u22a4 = Pd\nj , where \u03bb1(M ) = \u03bbmax(M ) \u2265\n\u03bb2(M ) \u2265 . . . \u2265 \u03bbd(M ) = \u03bbmin(M ), \u039b = diag(\u03bb1(M ), . . . , \u03bbd(M )), and U = [u1 . . . ud]. For\nq \u2208 [1,\u221e) and M \u2208 Sd, kMkq := (Pd\nj=1 |\u03bbj(M )|q)1/q denotes the Schatten-q-norm (q = 1: nu-\nclear norm; q = 2 Frobenius norm kMkF , q = \u221e: spectral norm kMk\u221e := max1\u2264j\u2264d |\u03bbj(M )|).\nLet S1(d) = {M \u2208 Sd : kMk1 = 1} and S+\n+. The symbols (cid:23),(cid:22),\u227b,\u227a refer to\nthe semide\ufb01nite ordering. For a set A and \u03b1 \u2208 R, \u03b1A := {\u03b1a, a \u2208 A}.\nIt is convenient to re-write model (1) as y = X (\u03a3\u2217) + \u03b5, where y = (yi)n\nX : Mm \u2192 Rn is a linear map de\ufb01ned by (X (M ))i = tr(X \u22a4\nsampling operator. Its adjoint X \u2217 : Rn \u2192 Mm is given by the map v 7\u2192Pn\n\ni=1 and\ni M ), i = 1, . . . , n, referred to as\n\nSupplement. The appendix contains all proofs, additional experiments and \ufb01gures.\n\n1 (d) = S1(d) \u2229 Sd\n\ni=1, \u03b5 = (\u03b5i)n\n\ni=1 viXi.\n\n2 Analysis\n\nPreliminaries. Throughout this section, we consider a special instance of model (1) in which\n\nyi = tr(Xi\u03a3\u2217) + \u03b5i, where \u03a3\u2217 \u2208 Sm\n\n+ , Xi \u2208 Sm, and \u03b5i\n\ni.i.d.\u223c N (0, \u03c32), i = 1, . . . , n.\n\n(6)\n\nThe assumption that the errors {\u03b5i}n\n\nstochastic part of our analysis, which could be extended to sub-Gaussian errors.\n\ni=1 are Gaussian is made for convenience as it simpli\ufb01es the\n\nNote that w.l.o.g., we may assume that {Xi}n\nwe have that tr(M \u03a3\u2217) = tr(M sym\u03a3\u2217), where M sym = (M + M \u22a4)/2.\n\ni=1 \u2282 Sm. In fact, since \u03a3\u2217 \u2208 Sm, for any M \u2208 Mm\n\n2\n\n\fIn the sequel, we study the statistical performance of the constrained least squares estimator\n\n1\n\n\u03a3\u2208Sm\n+\n\nb\u03a3 \u2208 argmin\nnkX (\u03a3\u2217) \u2212 X (b\u03a3)k2\nb\u03a3ols \u2208 argmin\n\n\u03a3\u2208Sm\n\n1\n2nky \u2212 X (\u03a3)k2\n\n2\n\n(b) kb\u03a3 \u2212 \u03a3\u2217k1,\n\n1\n2nky \u2212 X (\u03a3)k2\n2,\n\nunder model (6). More speci\ufb01cally, under certain conditions on X , we shall derive bounds on\n\n(a)\n\n2,\n\nand\n\nwhere (a) will be referred to as \u201cprediction error\u201d below. The most basic method for estimating \u03a3\u2217\nis ordinary least squares (ols) estimation\n\nwhich is computationally simpler than (7). While (7) requires convex programming, (9) boils down\nto solving a linear system of equations in \u03b4m = m(m + 1)/2 variables. On the other hand, the\nprediction error of ols scales as OP(dim(range(X ))/n), where dim(range(X )) can be as large as\nmin{n, \u03b4m}, in which case the prediction error vanishes only if \u03b4m/n \u2192 0 as n \u2192 \u221e. Moreover,\nthe estimation error kb\u03a3ols \u2212 \u03a3\u2217k1 is unbounded unless n \u2265 \u03b4m. Research conducted over the past\nfew years has thus focused on methods dealing successfully with the case n < \u03b4m as long as the\ntarget \u03a3\u2217 has additional structure, notably low-rankedness. Indeed, if \u03a3\u2217 has rank r \u226a m, the\nintrinsic dimension of the problem becomes (roughly) mr \u226a \u03b4m. In a large body of work, nuclear\nnorm regularization, which serves as a convex surrogate of rank regularization, is considered as a\ncomputationally convenient alternative for which a series of adaptivity properties to underlying low-\nrankedness has been established, e.g. [5, 15, 17, 18, 19]. Complementing (9) with nuclear norm\nregularization yields the estimator\n\n(7)\n\n(8)\n\n(9)\n\n(10)\n\n(11)\n\nwhere \u03bb > 0 is a regularization parameter. In case an spd constraint is imposed (10) becomes\n\n1\n2nky \u2212 X (\u03a3)k2\n\n2 + \u03bbk\u03a3k1,\n\n1\n2nky \u2212 X (\u03a3)k2\n\n2 + \u03bb tr(\u03a3).\n\n\u03a3\u2208Sm\n\nb\u03a31 \u2208 argmin\nb\u03a31+ \u2208 argmin\n\n\u03a3\u2208Sm\n+\n\nOur analysis aims at elucidating potential advantages of the spd constraint in the constrained least\nsquares problem (7) from a statistical point of view. It turns out that depending on properties of\n\ndata \ufb01tting problem without explicit regularization.\n\nX , the behaviour ofb\u03a3 can range from a performance similar to the least squares estimatorb\u03a3ols on\nthe one hand to a performance similar to the nuclear norm regularized estimatorb\u03a31+ with properly\nchosen/tuned \u03bb on the other hand. The latter case appears to be remarkable: b\u03a3 may enjoy similar\nadaptivity properties as nuclear norm regularized estimators even thoughb\u03a3 is obtained from a pure\nWe \ufb01rst discuss a negative example of X for which the spd-constrained estimator b\u03a3 does not im-\nprove (substantially) over the unconstrained estimatorb\u03a3ols. At the same time, this example provides\n\nclues on conditions to be imposed on X to achieve substantially better performance.\nRandom Gaussian design. Consider the Gaussian orthogonal ensemble (GOE)\n\n2.1 Negative result\n\nGOE(m) = {X = (xjk), {xjj}m\n\nj=1\n\ni.i.d.\u223c N (0, 1), {xjk = xkj}1\u2264j 0, if n \u2264 (1 \u2212 \u03b5)\u03b4m/2, with\n\nProposition 1. Consider Xi\nprobability at least 1 \u2212 32 exp(\u2212\u03b52\u03b4m), there exists \u2206 \u2208 Sm\nProposition 1 implies that if the number of measurements drops below 1/2 of the ambient dimension\n\n+ , \u2206 6= 0 such that X (\u2206) = 0.\n\n\u03b4m, estimating \u03a3\u2217 based on (7) becomes ill-posed; the estimation error kb\u03a3 \u2212 \u03a3\u2217k1 is unbounded,\n\nirrespective of the rank of \u03a3\u2217. Geometrically, the consequence of Proposition 1 is that the convex\ncone CX = {z \u2208 Rn : z = X (\u2206), \u2206 \u2208 Sm\n+} contains 0. Unless 0 is contained in the boundary\nof CX (we conjecture that this event has measure zero), this means that CX = Rn, i.e. the spd\n\nconstraint becomes vacuous.\n\n3\n\n\f2.2 Slow Rate Bound on the Prediction Error\n\n1\n\n0), where \u03bb0 =\n\n1\nnkX \u2217(\u03b5)k\u221e,\n\n2 = O(\u03bb0k\u03a3\u2217k1 + \u03bb2\n\nnkX (\u03a3\u2217) \u2212 X (b\u03a3)k2\n\ncondition on the sampling operator X . Speci\ufb01cally, the prediction error will be bounded as\n\nWe present a positive result on the spd-constrained least squares estimator b\u03a3 under an additional\nwith \u03bb0 typically being of the order O(pm/n) (up to log factors). The rate in (12) can be a sig-\nni\ufb01cant improvement of what is achieved by b\u03a3ols if k\u03a3\u2217k1 = tr(\u03a3\u2217) is small. If \u03bb0 = o(k\u03a3\u2217k1)\n\nthat rate coincides with those of the nuclear norm regularized estimators (10), (11) with regulariza-\ntion parameter \u03bb \u2265 \u03bb0, cf. Theorem 1 in [19]. For nuclear norm regularized estimators, the rate\nO(\u03bb0k\u03a3\u2217k1) is achieved for any choice of X and is slow in the sense that the squared prediction\nerror only decays at the rate n\u22121/2 instead of n\u22121.\nCondition on X .\nIn order to arrive at a suitable condition to be imposed on X so that (12) can\nbe achieved, it makes sense to re-consider the negative example of Proposition 1, which states that\nas long as n is bounded away from \u03b4m/2 from above, there is a non-trivial \u2206 \u2208 Sm\n+ such that\nX (\u2206) = 0. Equivalently, dist(PX , 0) = min\u2206\u2208S+\n1 (m)},\n\nPX := {z \u2208 Rn : z = X (\u2206), \u2206 \u2208 S+\n\n1 (m)kX (\u2206)k2 = 0, where\n\n1 (m) := {\u2206 \u2208 Sm\n\n+ : tr(\u2206) = 1}.\n\nand S+\n\n(12)\n\n0 may imply CX = Rn so that kX (\u03a3\u2217) \u2212 X (b\u03a3)k2\n\nIn this situation, it is impossible to derive a non-trivial bound on the prediction error as dist(PX , 0) =\n2. To rule this out, the condition\ndist(PX , 0) > 0 is natural. More strongly, one may ask for the following:\n1\nnkX (\u2206)k2\n\nThere exists a constant \u03c4 > 0 such that \u03c4 2\n\n0 (X ) := min\n\n2 = k\u03b5k2\n\n2 \u2265 \u03c4 2.\n\n(13)\n\n\u2206\u2208S+\n\n1 (m)\n\nAn analogous condition is suf\ufb01cient for a slow rate bound in the vector case, cf. [21]. However, the\ncondition for the slow rate bound in Theorem 1 below is somewhat stronger than (13).\n\nCondition 1. There exist constants R\u2217 > 1, \u03c4\u2217 > 0 s.t. \u03c4 2(X , R\u2217) \u2265 \u03c4 2\n\n\u2217 , where for R \u2208 R\n\n\u03c4 2(X , R) = dist2(RPX ,PX )/n = min\n\nA\u2208RS+\nB\u2208S+\n\n1 (m)\n1 (m)\n\n1\nnkX (A) \u2212 X (B)k2\n2.\n\nThe following condition is suf\ufb01cient for Condition 1 and in some cases much easier to check.\n\nProposition 2. Suppose there exists a \u2208 Rn, kak2 \u2264 1, and constants 0 < \u03c6min \u2264 \u03c6max s.t.\n\n\u03bbmin(n\u22121/2X \u2217(a)) \u2265 \u03c6min,\n\nand \u03bbmax(n\u22121/2X \u2217(a)) \u2264 \u03c6max.\n\n\u2217 = (\u03b6 \u2212 1)2\u03c62\n\nThen for any \u03b6 > 1, X satis\ufb01es Condition 1 with R\u2217 = \u03b6(\u03c6max/\u03c6min) and \u03c4 2\nThe condition of Proposition 2 can be phrased as having a positive de\ufb01nite matrix in the image of\nthe unit ball under X \u2217, which, after scaling by 1/\u221an, has its smallest eigenvalue bounded away\nfrom zero and a bounded condition number. As a simple example, suppose that X1 = \u221anI. In-\nvoking Proposition 2 with a = (1, 0, . . . , 0)\u22a4 and \u03b6 = 2, we \ufb01nd that Condition 1 is satis\ufb01ed with\nR\u2217 = 2 and \u03c4 2\ni=1 are (sam-\nple) covariance matrices, where the underlying random vectors satisfy appropriate tail or moment\nconditions.\nCorollary 1. Let \u03c0m be a probability distribution on Rm with second moment matrix \u0393 :=\nEz\u223c\u03c0m[zz\u22a4] satisfying \u03bbmin(\u0393) > 0. Consider the random matrix ensemble\n\n\u2217 = 1. A more interesting example is random design where the {Xi}n\n\nmax.\n\n(14)\n\ni=1 Xi and 0 < \u01ebn < \u03bbmin(\u0393). Under the\n\nM(\u03c0m, q) =n 1\nqPq\nk , {zk}q\nnPn\ni.i.d.\u223c M(\u03c0m, q) and letb\u0393n := 1\nevent {k\u0393 \u2212b\u0393nk\u221e \u2264 \u01ebn}, X satis\ufb01es Condition 1 with\n\nSuppose that {Xi}n\n\nk=1 zkz\u22a4\n\nand \u03c4 2\n\nR\u2217 =\n\ni=1\n\n2(\u03bbmax(\u0393) + \u01ebn)\n\u03bbmin(\u0393) \u2212 \u01ebn\n\nk=1\n\ni.i.d.\u223c \u03c0mo .\n\n\u2217 = (\u03bbmax(\u0393) + \u01ebn)2.\n\n4\n\n\fIt is instructive to spell out Corollary 1 with \u03c0m as the standard Gaussian distribution on Rm. The\n\nmatrixb\u0393n equals the sample covariance matrix computed from N = n\u00b7 q samples. It is well-known\n[9] that for m, N large, \u03bbmax(b\u0393n) and \u03bbmin(b\u0393n) concentrate sharply around (1+\u03b7n)2 and (1\u2212\u03b7n)2,\nrespectively, where \u03b7n =pm/N . Hence, for any \u03b3 > 0, there exists C\u03b3 > 1 so that if N \u2265 C\u03b3m,\nit holds that R\u2217 \u2264 2 + \u03b3. Similar though weaker concentration results for k\u0393 \u2212b\u0393nk\u221e exist for the\nbroader class of distributions \u03c0m with \ufb01nite fourth moments [26]. Specialized to q = 1, Corollary 1\nyields a statement about X made up from random rank-one measurements Xi = ziz\u22a4\ni , i = 1, . . . , n,\n\ncf. (3). The preceding discussion indicates that Condition 1 tends to be satis\ufb01ed in this case.\n\nTheorem 1. Suppose that model (6) holds with X satisfying Condition 1 with constants R\u2217 and \u03c4 2\n\u2217 .\n\nWe then have\n\n1\n\nnkX (\u03a3\u2217) \u2212 X (b\u03a3)k2\n\n2 \u2264 max(2(1 + R\u2217)\u03bb0k\u03a3\u2217k1, 2\u03bb0k\u03a3\u2217k1 + 8(cid:18)\u03bb0\n\nR\u2217\n\n\u03c4\u2217(cid:19)2)\n\nwhere, for any \u00b5 \u2265 0, with probability at least 1 \u2212 (2m)\u2212\u00b5\n\n\u03bb0 \u2264 \u03c3q(1 + \u00b5)2 log(2m) V 2\n\nn\n\nn , where V 2\n\ni=1 X 2\n\nn =(cid:13)(cid:13) 1\nnPn\n\ni(cid:13)(cid:13)\u221e .\n\nRemark: Under the scalings R\u2217 = O(1) and \u03c4 2\n\u2217 = \u2126(1), the bound of Theorem 1 is of the\norder O(\u03bb0k\u03a3\u2217k1 + \u03bb2\n0) as announced at the beginning of this section. For given X , the quantity\n\u03c4 2(X , R) can be evaluated by solving a least squares problem with spd constraints. Hence it is\nfeasible to check in practice whether Condition 1 holds. For later reference, we evaluate the term\nV 2\nn for M(\u03c0m, q) with \u03c0m as standard Gaussian distribution. As shown in the supplement, with\nhigh probability, V 2\n\nn = O(m log n) holds as long as m = O(nq).\n\n2.3 Bound on the Estimation Error\n\nIn the previous subsection, we did not make any assumptions about \u03a3\u2217 apart from \u03a3\u2217 \u2208 Sm\n+ . Hence-\nforth, we suppose that \u03a3\u2217 is of low rank 1 \u2264 r \u226a m and study the performance of the constrained\n\nleast squares estimator (7) for prediction and estimation in such setting.\n\nPreliminaries. Let \u03a3\u2217 = U \u039bU \u22a4 be the eigenvalue decomposition of \u03a3\u2217, where\n\nU =(cid:20) Uk\n\nm \u00d7 r m \u00d7 (m \u2212 r) (cid:21)(cid:20)\n\nU\u22a5\n\n\u039br\n\n0(m\u2212r)\u00d7r\n\n0r\u00d7(m\u2212r)\n\n0(m\u2212r)\u00d7(m\u2212r)(cid:21)\n\nwhere \u039br is diagonal with positive diagonal entries. Consider the linear subspace\n\nT\u22a5 = {M \u2208 Sm : M = U\u22a5AU \u22a4\n\n\u22a5 , A \u2208 Sm\u2212r}.\n\nFrom U \u22a4\n\n\u22a5 \u03a3\u2217U\u22a5 = 0, it follows that \u03a3\u2217 is contained in the orthogonal complement\n\nT = {M \u2208 Sm : M = UkB + B\u22a4U \u22a4\n\nk , B \u2208 Rr\u00d7m},\n\nof dimension mr \u2212 r(r \u2212 1)/2 \u226a \u03b4m if r \u226a m. The image of T under X is denoted by T .\nConditions on X . We introduce the key quantities the bound in this subsection depends on.\nSeparability constant.\n\n\u03c4 2(T) =\n\n1\nn\n\ndist2 (T ,PX ) ,\n\n=\n\nmin\n\u0398\u2208T, \u039b\u2208S+\n\n1 (m)\u2229T\u22a5\n\nRestricted eigenvalue.\n\nPX := {z \u2208 Rn : z = X (\u2206), \u2206 \u2208 T\u22a5 \u2229 S+\n1\nnkX (\u0398) \u2212 X (\u039b)k2\n\n2\n\n1 (m)}\n\n\u03c62(T) = min\n06=\u2206\u2208T\n\n2/n\n\n.\n\nkX (\u2206)k2\nk\u2206k2\n\n1\n\nAs indicated by the following statement concerning the noiseless case, for bounding kb\u03a3 \u2212 \u03a3\u2217k, it is\n\ninevitable to have lower bounds on the above two quantities.\n\n5\n\n\fProposition 3. Consider the trace regression model (1) with \u03b5i = 0, i = 1, . . . , n. Then\n\nargmin\n\u03a3\u2208Sm\n+\n\n1\n2nkX (\u03a3\u2217) \u2212 X (\u03a3)k2\n\n2 = {\u03a3\u2217} for all \u03a3\u2217 \u2208 T \u2229 Sm\n\n+\n\nif and only if it holds that \u03c4 2(T) > 0 and \u03c62(T) > 0.\n\nCorrelation constant. Moreover, we use of the following the quantity. It is not clear to us if it is\nintrinsically required, or if its appearance in our bound is for merely technical reasons.\n\n\u00b5(T) = max(cid:8) 1\n\nn hX (\u2206),X (\u2206\u2032)i : k\u2206k1 \u2264 1, \u2206 \u2208 T, \u2206\u2032 \u2208 S+\n\n1 (m) \u2229 T\u22a5(cid:9) .\n\nWe are now in position to provide a bound on kb\u03a3 \u2212 \u03a3\u2217k1.\n\nTheorem 2. Suppose that model (6) holds with \u03a3\u2217 as considered throughout this subsection and let\n\u03bb0 be de\ufb01ned as in Theorem 1. We then have\n\nkb\u03a3 \u2212 \u03a3\u2217k1 \u2264 max(8\u03bb0\n\n\u00b5(T)\n\n\u03c4 2(T)\u03c62(T)(cid:18) 3\n\u03c62(T)(cid:19) ,\n\n\u00b5(T)\n\n8\u03bb0\n\n\u03c62(T)(cid:18)1 +\n\n2\n\n+\n\n\u00b5(T)\n\n\u03c62(T)(cid:19) + 4\u03bb0(cid:18) 1\n\u03c4 2(T)).\n\n8\u03bb0\n\n\u03c62(T)\n\n+\n\n1\n\n\u03c4 2(T)(cid:19) ,\n\nGiven Theorem 2 an improved bound on the prediction error scaling with \u03bb2\n\nRemark.\nof \u03bb0 can be derived, cf. (26) in Appendix D.\nThe quality of the bound of Theorem 2 depends on how the quantities \u03c4 2(T), \u03c62(T) and \u00b5(T) scale\nwith n, m and r, which is design-dependent. Accordingly, the estimation error in nuclear norm\ncan be non-\ufb01nite in the worst case and O(\u03bb0r) in the best case, which matches existing bounds for\nnuclear norm regularization (cf. Theorem 2 in [19]).\n\n0 in place\n\ni with zi\n\n\u2022 The quantity \u03c4 2(T) is speci\ufb01c to the geometry of the constrained least squares problem\n(7) and hence of critical importance. For instance, it follows from Proposition 1 that for\nstandard Gaussian measurements, \u03c4 2(T) = 0 with high probability once n < \u03b4m/2. The\nsituation can be much better for random spd measurements (14) as exempli\ufb01ed for mea-\nsurements Xi = ziz\u22a4\n\u03c4 2(T) = \u2126(1/r) as long as n = \u2126(m \u00b7 r).\n\ni.i.d.\u223c N (0, I) in Appendix H. Speci\ufb01cally, it turns out that\n\u2022 It is not restrictive to assume \u03c62(T) is positive. Indeed, without that assumption, even an\noracle estimator based on knowledge of the subspace T would fail. Reasonable sampling\noperators X have rank min{n, \u03b4m} so that the nullspace of X only has a trivial intersection\nwith the subspace T as long as n \u2265 dim(T) = mr \u2212 r(r \u2212 1)/2.\n\u2022 For \ufb01xed T, computing \u00b5(T) entails solving a biconvex (albeit non-convex) optimization\nproblem in \u2206 \u2208 T and \u2206\u2032 \u2208 S+\n1 (m)\u2229T\u22a5. Block coordinate descent is a practical approach\nto such optimization problems for which a globally optimal solution is out of reach. In\nthis manner we explore the scaling of \u00b5(T) numerically as done for \u03c4 2(T). We \ufb01nd that\n\u00b5(T) = O(\u03b4m/n) so that \u00b5(T) = O(1) apart from the regime n/\u03b4m \u2192 0, without ruling\nout the possibility of undersampling, i.e. n < \u03b4m.\n\n3 Numerical Results\n\nrelative to regularization-based methods is explored. We also present an application to spiked co-\nvariance estimation for the CBCL face image data set and stock prices from NASDAQ.\n\nIn this section, we empirically study properties of the estimator b\u03a3. In particular, its performance\nComparison with regularization-based approaches. We here empirically evaluate kb\u03a3 \u2212 \u03a3\u2217k1\nrelative to well-known regularization-based methods.\ni.i.d.\u223c N (0, I), i =\nSetup. We consider rank-one Wishart measurement matrices Xi = ziz\u22a4\n1, . . . , n, \ufb01x m = 50 and let n \u2208 {0.24, 0.26, . . . , 0.36, 0.4, . . . , 0.56} \u00b7 m2 and r \u2208 {1, 2, . . . , 10}\nvary. Each con\ufb01guration of (n, r) is run with 50 replications. In each of these, we generate data\n\ni , zi\n\ni = 1, . . . , n,\nwhere \u03a3\u2217 is generated randomly as rank r Wishart matrices and the {\u03b5i}n\n\nyi = tr(Xi\u03a3\u2217) + \u03c3\u03b5i, \u03c3 = 0.1,\n\ni=1 are i.i.d. N (0, 1).\n\n(15)\n\n6\n\n\f r: 1 \n\n \n\nconstrained LS\nregularized LS #\nregularized LS\nChen et al. #\nChen et al.\noracle\n\n0.09\n\n0.08\n\n0.07\n\n0.06\n\n0.05\n\n0.04\n\n1\n\ni\n\n \n\n|\n*\na\nm\ng\nS\n\u2212\na\nm\ng\nS\n\n \n\ni\n\n|\n\n \n\n \n\n r: 2 \n\nconstrained LS\nregularized LS #\nregularized LS\nChen et al. #\nChen et al.\noracle\n\n0.16\n\n0.14\n\n0.12\n\n0.1\n\n0.08\n\n r: 4 \n\nconstrained LS\nregularized LS #\nregularized LS\nChen et al. #\nChen et al.\noracle\n\n0.35\n\n0.3\n\n0.25\n\n0.2\n\n0.15\n\n1\n\ni\n\n \n\n|\n*\na\nm\ng\nS\n\u2212\na\nm\ng\nS\n\n \n\ni\n\n|\n\n1\n\ni\n\n \n\n|\n*\na\nm\ng\nS\n\u2212\na\nm\ng\nS\n\n \n\ni\n\n|\n\n0.03\n\n \n\n600\n\n700\n\n800\n\n900 1000 1100 1200 1300 1400\n\nn\n\n0.06\n\n \n\n600\n\n700\n\n800\n\n900 1000 1100 1200 1300 1400\n\nn\n\n \n\n600\n\n700\n\n800\n\n900 1000 1100 1200 1300 1400\n\nn\n\n r: 6 \n\nconstrained LS\nregularized LS #\nregularized LS\nChen et al. #\nChen et al.\noracle\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n \n\n \n\n1\n\ni\n\n \n\n|\n*\na\nm\ng\nS\n\u2212\na\nm\ng\nS\n\n \n\ni\n\n|\n\n r: 8 \n\nconstrained LS\nregularized LS #\nregularized LS\nChen et al. #\nChen et al.\noracle\n\n1.2\n\n1.1\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n r: 10 \n\n \n\nconstrained LS\nregularized LS #\nregularized LS\nChen et al. #\nChen et al.\noracle\n\n2\n\n1.8\n\n1.6\n\n1.4\n\n1.2\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n1\n\ni\n\n \n\n|\n*\na\nm\ng\nS\n\u2212\na\nm\ng\nS\n\n \n\ni\n\n|\n\n1\n\ni\n\n \n\n|\n*\na\nm\ng\nS\n\u2212\na\nm\ng\nS\n\n \n\ni\n\n|\n\n \n\n600\n\n700\n\n800\n\n900 1000 1100 1200 1300 1400\n\nn\n\n \n\n700\n\n800\n\n900\n\n1000 1100 1200 1300 1400\n\nn\n\n \n\n800\n\n900\n\n1000\n\n1100\n\nn\n\n1200\n\n1300\n\n1400\n\nFigure 1: Average estimation error (over 50 replications) in nuclear norm for \ufb01xed m = 50 and\ncertain choices of n and r. In the legend, \u201cLS\u201d is used as a shortcut for \u201cleast squares\u201d. Chen et\nal. refers to (16). \u201c#\u201dindicates an oracular choice of the tuning parameter. \u201coracle\u201d refers to the ideal\n\nestimator in (11). Regarding the choice of the regularization parameter \u03bb, we consider the grid\n\nerror \u03c3rpm/n. Best seen in color.\nRegularization-based approaches. We compareb\u03a3 to the corresponding nuclear norm regularized\n\u03bb\u2217 \u00b7 {0.01, 0.05, 0.1, 0.3, 0.5, 1, 2, 4, 8, 16}, where \u03bb\u2217 = \u03c3pm/n as recommended in [17] and pick\n\n\u03bb so that the prediction error on a separate validation data set of size n generated from (15) is\nminimized. Note that in general, neither \u03c3 is known nor an extra validation data set is available. Our\ngoal here is to ensure that the regularization parameter is properly tuned. In addition, we consider\nan oracular choice of \u03bb where \u03bb is picked from the above grid such that the performance measure\nof interest (the distance to the target in the nuclear norm) is minimized. We also compare to the\nconstrained nuclear norm minimization approach of [8]:\n\nmin\n\ntr(\u03a3)\n\n\u03a3\n\nsubject to \u03a3 (cid:23) 0, and ky \u2212 X (\u03a3)k1 \u2264 \u03bb.\n\n(16)\n\nFor \u03bb, we consider the grid n\u03c3p2/\u03c0 \u00b7 {0.2, 0.3, . . . , 1, 1.25}. This speci\ufb01c choice is motivated by\nthe fact that E[ky \u2212 X (\u03a3\u2217)k1] = E[k\u03b5k1] = n\u03c3p2/\u03c0. Apart from that, tuning of \u03bb is performed\n\nas for the nuclear norm regularized estimator. In addition, we have assessed the performance of the\napproach in [3], which does not impose an spd constraint but adds another constraint to (16). That\nadditional constraint signi\ufb01cantly complicates optimization and yields a second tuning parameter.\nThus, instead of doing a 2D-grid search, we use \ufb01xed values given in [3] for known \u03c3. The results\nare similar or worse than those of (16) (note in particular that positive semide\ufb01niteness is not taken\nadvantage of in [3]) and are hence not reported.\n\nDiscussion of the results. We conclude from Figure 1 that in most cases, the performance of\nthe constrained least squares estimator does not differ much from that of the regularization-based\nmethods with careful tuning. For larger values of r, the constrained least squares estimator seems to\nrequire slightly more measurements to achieve competitive performance.\n\nReal Data Examples. We now present an application to recovery of spiked covariance matrices\n\nThis model appears frequently in connection with principal components analysis (PCA).\n\nj=1 \u03bbj uju\u22a4\n\nj + \u03c32I, where r \u226a m and \u03bbj \u226b \u03c32 > 0, j = 1, . . . , r.\n\nwhich are of the form \u03a3\u2217 =Pr\n\nSo far, we have assumed that \u03a3\u2217 is of low rank, but it is straight-\nExtension to the spiked case.\nforward to extend the proposed approach to the case in which \u03a3\u2217 is spiked as long as \u03c32 is known or\n\nan estimate is available. A constrained least squares estimator of \u03a3\u2217 takes the formb\u03a3 + \u03c32I, where\n\n1\n2nky \u2212 X (\u03a3 + \u03c32I)k2\n2.\n\n(17)\n\nThe case of \u03c32 unknown or general (unknown) diagonal perturbation is left for future research.\n\nb\u03a3 \u2208 argmin\n\n\u03a3\u2208Sm\n+\n\n7\n\n\f)\n\nF\n\ni\n\n \n\n|\n*\na\nm\ng\nS\n\u2212\na\nm\ng\nS\n\n \n\ni\n\n|\n(\n0\n1\ng\no\n\nl\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n\u22120.2\n\n\u22120.4\n\n\u22120.6\n\n\u22120.8\n\n\u22121\n\n\u22121.2\n\n\u22121.4\n\n\u03b2 = 1/N (1 sample)\n\n\u03b2 = 0.008\n\n\u03b2 = 0.08\n\n\u03b2 = 0.4\n\nCBCL\n\n\u03b2 = 1 (all samples)\n\noracle\n\nF\n\n)\n\ni\n\n \n\n|\n*\na\nm\ng\nS\n\u2212\n \na\nm\ng\nS\n\ni\n\n|\n(\n0\n1\ng\no\n\nl\n\n2\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u03b2 = 1/N (1 sample)\n\n\u03b2 = 0.008\n\n\u03b2 = 0.08\n\n\u03b2 = 0.4\n\nNASDAQ\n\noracle\n\n\u03b2 = 1 (all samples)\n\n0\n\n2\n\n4\n\n6\n\nn / (m * r)\n\n8\n\n10\n\n12\n\n0\n\n1\n\n2\n\n3\n\nn / (m * r)\n\n4\n\n5\n\n6\n\nFigure 2: Average reconstruction errors log10kb\u03a3 \u2212 \u03a3\u2217kF in dependence of n/(mr) and the param-\n\neter \u03b2. \u201coracle\u201d refers to the best rank r-approximation \u03a3r.\n\n(i) The CBCL facial image data set [1] consist of N = 2429 images of 19 \u00d7 19 pixels\nData sets.\n(i.e., m = 361). We take \u03a3\u2217 as the sample covariance matrix of this data set. It turns out that\n\u03a3\u2217 can be well approximated by \u03a3r, r = 50, where \u03a3r is the best rank r approximation to \u03a3\u2217\nobtained from computing its eigendecomposition and setting to zero all but the top r eigenvalues.\n(ii) We construct a second data set from the daily end prices of m = 252 stocks from the technology\nsector in NASDAQ, starting from the beginning of the year 2000 to the end of the year 2014 (in\ntotal N = 3773 days, retrieved from finance.yahoo.com). We take \u03a3\u2217 as the resulting sample\ncorrelation matrix and choose r = 100.\n\nAs in preceding measurements, we consider n random Wishart mea-\nExperimental setup.\nsurements for the operator X , where n = C(mr), where C ranges from 0.25 to 12. Since\nk\u03a3r \u2212 \u03a3\u2217kF /k\u03a3\u2217kF \u2248 10\u22123 for both data sets, we work with \u03c32 = 0 in (17) for simplicity.\nTo make recovery of \u03a3\u2217 more dif\ufb01cult, we make the problem noisy by using observations\n\nyi = tr(XiSi),\n\ni = 1, . . . , n,\n\n(18)\n\nwhere Si is an approximation to \u03a3\u2217 obtained from the sample covariance respectively sample cor-\nrelation matrix of \u03b2N data points randomly sampled with replacement from the entire data set,\ni = 1, . . . , n, where \u03b2 ranges from 0.4 to 1/N (Si is computed from a single data point). For each\nchoice of n and \u03b2, the reported results are averages over 20 replications.\n\nResults. For the CBCL data set, as shown in Figure 2, b\u03a3 accurately approximates \u03a3\u2217 once the\n\nnumber of measurements crosses 2mr. Performance degrades once additional noise is introduced to\nthe problem by using measurements (18). Even under signi\ufb01cant perturbations (\u03b2 = 0.08), reason-\nable reconstruction of \u03a3\u2217 remains possible, albeit the number of required measurements increases\naccordingly. In the extreme case \u03b2 = 1/N , the error is still decreasing with n, but millions of\nsamples seems to be required to achieve reasonable reconstruction error.\n\nThe general picture is similar for the NASDAQ data set, but the difference between using measure-\nments based on the full sample correlation matrix on the one hand and approximations based on\nrandom subsampling (18) on the other hand are more pronounced.\n\n4 Conclusion\n\nWe have investigated trace regression in the situation that the underlying matrix is symmetric posi-\ntive semide\ufb01nite. Under restrictions on the design, constrained least squares enjoys similar statistical\nproperties as methods employing nuclear norm regularization. This may come as a surprise, as reg-\nularization is widely regarded as necessary in small sample settings.\n\nAcknowledgments\n\nThe work of Martin Slawski and Ping Li is partially supported by NSF-DMS-1444124, NSF-III-\n1360971, ONR-N00014-13-1-0764, and AFOSR-FA9550-13-1-0137.\n\n8\n\n\fReferences\n\n[1] CBCL face dataset. http://cbcl.mit.edu/software-datasets/FaceData2.html.\n\n[2] D. Amelunxen, M. Lotz, M. McCoy, and J. Tropp. Living on the edge: phase transitions in convex\n\nprograms with random data. Information and Inference, 3:224\u2013294, 2014.\n\n[3] T. Cai and A. Zhang. ROP: Matrix recovery via rank-one projections. The Annals of Statistics, 43:102\u2013\n\n138, 2015.\n\n[4] E. Candes and X. Li. Solving quadratic equations via PhaseLift when there are about as many equations\n\nas unknowns. Foundation of Computational Mathematics, 14:1017\u20131026, 2014.\n\n[5] E. Candes and Y. Plan. Tight oracle bounds for low-rank matrix recovery from a minimal number of noisy\n\nmeasurements. IEEE Transactions on Information Theory, 57:2342\u20132359, 2011.\n\n[6] E. Candes and B. Recht. Exact matrix completion via convex optimization. Foundation of Computational\n\nMathematics, 9:2053\u20132080, 2009.\n\n[7] E. Candes, T. Strohmer, and V. Voroninski. PhaseLift: exact and stable signal recovery from magnitude\nmeasurements via convex programming. Communications on Pure and Applied Mathematics, 66:1241\u2013\n1274, 2012.\n\n[8] Y. Chen, Y. Chi, and A. Goldsmith. Exact and Stable Covariance Estimation from Quadratic Sampling\n\nvia Convex Programming. IEEE Transactions on Information Theory, 61:4034\u20134059, 2015.\n\n[9] K. Davidson and S. Szarek. Handbook of the Geometry of Banach Spaces, volume 1, chapter Local\n\noperator theory, random matrices and Banach spaces, pages 317\u2013366. 2001.\n\n[10] L. Demanet and P. Hand. Stable optimizationless recovery from phaseless measurements. Journal of\n\nFourier Analysis and its Applications, 20:199\u2013221, 2014.\n\n[11] D. Gross, Y.-K. Liu, S. Flammia, S. Becker, and J. Eisert. Quantum State Tomography via Compressed\n\nSensing. Physical Review Letters, 105:150401\u201315404, 2010.\n\n[12] R. Horn and C. Johnson. Matrix Analysis. Cambridge University Press, 1985.\n\n[13] M. Kabanva, R. Kueng, and H. Rauhut und U. Terstiege. Stable low rank matrix recovery via null space\n\nproperties. arXiv:1507.07184, 2015.\n\n[14] M. Klibanov, P. Sacks, and A. Tikhonarov. The phase retrieval problem.\n\nInverse Problems, 11:1\u201328,\n\n1995.\n\n[15] V. Koltchinskii, K. Lounici, and A. Tsybakov. Nuclear-norm penalization and optimal rates for noisy\n\nlow-rank matrix completion. The Annals of Statistics, 39:2302\u20132329, 2011.\n\n[16] N. Meinshausen. Sign-constrained least squares estimation for high-dimensional regression. The Elec-\n\ntronic Journal of Statistics, 7:1607\u20131631, 2013.\n\n[17] S. Negahban and M. Wainwright. Estimation of (near) low-rank matrices with noise and high-dimensional\n\nscaling. The Annals of Statistics, 39:1069\u20131097, 2011.\n\n[18] B. Recht, M. Fazel, and P. Parillo. Guaranteed minimum-rank solutions of linear matrix equations via\n\nnuclear norm minimization. SIAM Review, 52:471\u2013501, 2010.\n\n[19] A. Rohde and A. Tsybakov. Estimation of high-dimensional low-rank matrices. The Annals of Statistics,\n\n39:887\u2013930, 2011.\n\n[20] B. Sch\u00a8olkopf and A. Smola. Learning with kernels. MIT Press, Cambridge, Massachussets, 2002.\n\n[21] M. Slawski and M. Hein. Non-negative least squares for high-dimensional linear models: consistency\n\nand sparse recovery without regularization. The Electronic Journal of Statistics, 7:3004\u20133056, 2013.\n\n[22] M. Slawski, P. Li, and M. Hein. Regularization-free estimation in trace regression with positive semidef-\n\ninite matrices. arXiv:1504.06305, 2015.\n\n[23] N. Srebro, J. Rennie, and T. Jaakola. Maximum margin matrix factorization.\n\nIn Advances in Neural\n\nInformation Processing Systems 17, pages 1329\u20131336, 2005.\n\n[24] R. Tibshirani. Regression shrinkage and variable selection via the lasso. Journal of the Royal Statistical\n\nSociety Series B, 58:671\u2013686, 1996.\n\n[25] J. Tropp. User-friendly tools for random matrices: An introduction. 2014. http://users.cms.\n\ncaltech.edu/\u02dcjtropp/.\n\n[26] R. Vershynin. How close is the sample covariance matrix to the actual covariance matrix ? Journal of\n\nTheoretical Probability, 153:405\u2013419, 2012.\n\n[27] M. Wang, W. Xu, and A. Tang. A unique \u2019nonnegative\u2019 solution to an underdetermined system: from\n\nvectors to matrices. IEEE Transactions on Signal Processing, 59:1007\u20131016, 2011.\n\n[28] C. Williams and M. Seeger. Using the Nystr\u00a8om method to speed up kernel machines. In Advances in\n\nNeural Information Processing Systems 14, pages 682\u2013688, 2001.\n\n9\n\n\f", "award": [], "sourceid": 1586, "authors": [{"given_name": "Martin", "family_name": "Slawski", "institution": "Rutgers University"}, {"given_name": "Ping", "family_name": "Li", "institution": "Rugters University"}, {"given_name": "Matthias", "family_name": "Hein", "institution": "Saarland University"}]}