{"title": "Efficient Structured Matrix Rank Minimization", "book": "Advances in Neural Information Processing Systems", "page_first": 1350, "page_last": 1358, "abstract": "We study the problem of finding structured low-rank matrices using nuclear norm regularization where the structure is encoded by a linear map. In contrast to most known approaches for linearly structured rank minimization, we do not (a) use the full SVD; nor (b) resort to augmented Lagrangian techniques; nor (c) solve linear systems per iteration. Instead, we formulate the problem differently so that it is amenable to a generalized conditional gradient method, which results in a practical improvement with low per iteration computational cost. Numerical results show that our approach significantly outperforms state-of-the-art competitors in terms of running time, while effectively recovering low rank solutions in stochastic system realization and spectral compressed sensing problems.", "full_text": "Ef\ufb01cient Structured Matrix Rank Minimization\n\nAdams Wei Yu\u2020, Wanli Ma\u2020, Yaoliang Yu\u2020, Jaime G. Carbonell\u2020, Suvrit Sra\u2021\n\n{weiyu, mawanli, yaoliang, jgc}@cs.cmu.edu, suvrit@tuebingen.mpg.de\n\nSchool of Computer Science, Carnegie Mellon University\u2020\n\nMax Planck Institute for Intelligent Systems\u2021\n\nAbstract\n\nWe study the problem of \ufb01nding structured low-rank matrices using nuclear norm\nregularization where the structure is encoded by a linear map. In contrast to most\nknown approaches for linearly structured rank minimization, we do not (a) use the\nfull SVD; nor (b) resort to augmented Lagrangian techniques; nor (c) solve linear\nsystems per iteration. Instead, we formulate the problem differently so that it is\namenable to a generalized conditional gradient method, which results in a practical\nimprovement with low per iteration computational cost. Numerical results show\nthat our approach signi\ufb01cantly outperforms state-of-the-art competitors in terms of\nrunning time, while effectively recovering low rank solutions in stochastic system\nrealization and spectral compressed sensing problems.\n\nIntroduction\n\n1\nMany practical tasks involve \ufb01nding models that are both simple and capable of explaining noisy\nobservations. The model complexity is sometimes encoded by the rank of a parameter matrix,\nwhereas physical and system level constraints could be encoded by a speci\ufb01c matrix structure. Thus,\nrank minimization subject to structural constraints has become important to many applications in\nmachine learning, control theory, and signal processing [10, 22]. Applications include collaborative\n\ufb01ltering [23], system identi\ufb01cation and realization [19, 21], multi-task learning [28], among others.\nThe focus of this paper is on problems where in addition to being low-rank, the parameter matrix\nmust satisfy additional linear structure. Typically, this structure involves Hankel, Toeplitz, Sylvester,\nHessenberg or circulant matrices [4, 11, 19]. The linear structure describes interdependencies be-\ntween the entries of the estimated matrix and helps substantially reduce the degrees of freedom.\nAs a concrete example consider a linear time-invariant (LTI) system where we are estimating the\nparameters of an autoregressive moving-average (ARMA) model. The order of this LTI system,\ni.e., the dimension of the latent state space, is equal to the rank of a Hankel matrix constructed\nby the process covariance [20]. A system of lower order, which is easier to design and analyze,\nis usually more desirable. The problem of minimum order system approximation is essentially\na structured matrix rank minimization problem. There are several other applications where such\nlinear structure is of great importance\u2014see e.g., [11] and references therein. Furthermore, since\n(enhanced) structured matrix completion also falls into the category of rank minimization problems,\nthe results in our paper can as well be applied to speci\ufb01c problems in spectral compressed sensing\n[6], natural language processing [1], computer vision [8] and medical imaging [24].\nFormally, we study the following (block) structured rank minimization problem:\n\n1\n\n2kA(y) bk2\n\nminy\n\n(1)\nHere, y = (y1, ..., yj+k1) is an m \u21e5 n(j + k 1) matrix with yt 2 Rm\u21e5n for t = 1, ..., j + k 1,\nA : Rm\u21e5n(j+k1) ! Rp is a linear map, b 2 Rp, Qm,n,j,k(y) 2 Rmj\u21e5nk is a structured matrix\nwhose elements are linear functions of yt\u2019s, and \u00b5 > 0 controls the regularization. Throughout this\npaper, we will use M = mj and N = nk to denote the number of rows and columns of Qm,n,j,k(y).\n\nF + \u00b5 \u00b7 rank(Qm,n,j,k(y)).\n\n1\n\n\f1\n\n2kA(y) bk2\n\nminy\n\nF + \u00b5 \u00b7 kQm,n,j,k(y)k\u21e4.\n\nProblem (1) is in general NP-hard [21] due to the presence of the rank function. A popular approach\nto address this issue is to use the nuclear norm k\u00b7k \u21e4, i.e., the sum of singular values, as a convex\nsurrogate for matrix rank [22]. Doing so turns (1) into a convex optimization problem:\n(2)\nSuch a relaxation has been combined with various convex optimization procedures in previous work,\ne.g., interior-point approaches [17, 18] and \ufb01rst-order alternating direction method of multipliers\n(ADMM) approaches [11]. However, such algorithms are computationally expensive. The cost per\niteration of an interior-point method is no less than O(M 2N 2), and that of typical proximal and\nADMM style \ufb01rst-order methods in [11] is O(min(N 2M, N M 2)); this high cost arises from each\niteration requiring a full Singular Value Decomposition (SVD). The heavy computational cost of\nthese methods prevents them from scaling to large problems.\nContributions. In view of the ef\ufb01ciency and scalability limitations of current algorithms, the key\ncontributions of our paper are as follows.\n\nrank solutions consistent with the observations, but substantially more scalably.\n\n\u2022 We formulate the structured rank minimization problem differently, so that we still \ufb01nd low-\n\u2022 We customize the generalized conditional gradient (GCG) approach of Zhang et al. [27] to our\nnew formulation. Compared with previous \ufb01rst-order methods, the cost per iteration is O(M N )\n(linear in the data size), which is substantially lower than methods that require full SVDs.\n\n\u2022 Our approach maintains a convergence rate of O 1\nof O M N\n\n\u270f and thus achieves an overall complexity\n\u270f , which is by far the lowest in terms of the dependence of M or N for general struc-\n\ntured rank minimization problems. It also empirically proves to be a state-of-the-art method\nfor (but clearly not limited to) stochastic system realization and spectral compressed sensing.\n\nWe note that following a GCG scheme has another practical bene\ufb01t: the rank of the intermediate\nsolutions starts from a small value and then gradually increases, while the starting solutions obtained\nfrom existing \ufb01rst-order methods are always of high rank. Therefore, GCG is likely to \ufb01nd a low-\nrank solution faster, especially for large size problems.\nRelated work. Liu and Vandenberghe [17] adopt an interior-point method on a reformulation of\n(2), where the nuclear norm is represented via a semide\ufb01nite program. The cost of each iteration in\n[17] is no less than O(M 2N 2). Ishteva et al. [15] propose a local optimization method to solve the\nweighted structured rank minimization problem, which still has complexity as high as O(N 3M r2)\nper iteration, where r is the rank. This high computational cost prevents [17] and [15] from handling\nlarge-scale problems. In another recent work, Fazel et al. [11] propose a framework to solve (2).\nThey derive several primal and dual reformulations for the problem, and propose corresponding\n\ufb01rst-order methods such as ADMM, proximal-point, and accelerated projected gradient. However,\neach iteration of these algorithms involves a full SVD of complexity O(min(M 2N, N 2M )), making\nit hard to scale them to large problems. Signoretto et al. [25] reformulate the problem to avoid full\nSVDs by solving an equivalent nonconvex optimization problem via ADMM. However, their method\nrequires subroutines to solve linear equations per iteration, which can be time-consuming for large\nproblems. Besides, there is no guarantee that their method will converge to the global optimum.\nThe conditional gradient (CG) (a.k.a. Frank-Wolfe) method was proposed by Frank and Wolfe [12]\nto solve constrained problems. At each iteration, it \ufb01rst solves a subproblem that minimizes a lin-\nearized objective over a compact constraint set and then moves toward the minimizer of the cost\nfunction. CG is ef\ufb01cient as long as the linearized subproblem is easy to solve. Due to its simplicity\nand scalability, CG has recently witnessed a great surge of interest in the machine learning and opti-\nmization community [16]. In another recent strand of work, CG was extended to certain regularized\n(non-smooth) problems as well [3, 13, 27]. In the following, we will show how a generalized CG\nmethod can be adapted to solve the structured matrix rank minimization problem.\n\n2 Problem Formulation and Approach\nIn this section we reformulate the structured rank minimization problem in a way that enables us\nto apply the generalized conditional gradient method, which we subsequently show to be much\nmore ef\ufb01cient than existing approaches, both theoretically and experimentally. Our starting point\nis that in most applications, we are interested in \ufb01nding a \u201csimple\u201d model that is consistent with\n\n2\n\n\fX :=2664\n\nx11 x12\nx21 x22\n...\n...\nxj1 xj2\n\n\u00b7\u00b7\u00b7\n\u00b7\u00b7\u00b7\n\n\u00b7\u00b7\u00b7\n\nx1k\nx2k\n...\nxjk\n\n3775 with xil 2 Rm\u21e5n for i = 1, ..., j,\n\nl = 1, ..., k.\n\n(4)\n\nthe observations, but the problem formulation itself, such as (2), is only an intermediate means,\nhence it need not be \ufb01xed. In fact, when formulating our problem we can and we should take the\ncomputational concerns into account. We will demonstrate this point \ufb01rst.\n\n2.1 Problem Reformulation\nThe major computational dif\ufb01culty in problem (2) comes from the linear transformation Qm,n,j,k(\u00b7)\ninside the trace norm regularizer. To begin with, we introduce a new matrix variable X 2 Rmj\u21e5nk\nand remove the linear transformation by introducing the following linear constraint\n(3)\n\nQm,n,j,k(y) = X.\nFor later use, we partition the matrix X into the block form\n\nWe denote by x := vec(X) 2 Rmjk\u21e5n the vector obtained by stacking the columns of X blockwise,\nand by X := mat(x) 2 Rmj\u21e5nk the reverse operation. Since x and X are merely different re-\norderings of the same object, we will use them interchangeably to refer to the same object.\nWe observe that any linear (or slightly more generally, af\ufb01ne) structure encoded by the linear trans-\nformation Qm,n,j,k(\u00b7) translates to linear constraints on the elements of X (such as the sub-blocks\nin (4) satisfying say x12 = x21), which can be represented as linear equations Bx = 0, with an\nappropriate matrix B that encodes the structure of Q. Similarly, the linear constraint in (3) that\nrelates y and X, or equivalently x, can also be written as the linear constraint y = Cx for a suitable\nrecovery matrix C. Details on constructing matrix B and C can be found in the appendix. Thus,\nwe reformulate (2) into\n\nmin\n\nx2Rmjk\u21e5n\n\n1\n\n2kA(Cx) bk2\n\nF + \u00b5kXk\u21e4\n\n(5)\n\n(6)\nThe new formulation (5) is still computationally inconvenient due to the linear constraint (6). We\nresolve this dif\ufb01culty by applying the penalty method, i.e., by placing the linear constraint into the\nobjective function after composing with a penalty function such as the squared Frobenius norm:\n\ns.t. Bx = 0.\n\nmin\n\nx2Rmjk\u21e5n\n\n1\n\n2kA(Cx) bk2\n\nF + \n\n2kBxk2\n\nF + \u00b5kXk\u21e4.\n\n(7)\n\nHere > 0 is a penalty parameter that controls the inexactness of the linear constraint. In essence,\nwe turn (5) into an unconstrained problem by giving up on satisfying the linear constraint exactly.\nWe argue that this is a worthwhile trade-off for (i) By letting \" 1 and following a homotopy\nscheme the constraint can be satis\ufb01ed asymptotically; (ii) If exactness of the linear constraint is\ntruly desired, we could always post-process each iterate by projecting to the constraint manifold\nusing Cproj (see appendix); (iii) As we will show shortly, the potential computational gains can be\nsigni\ufb01cant, enabling us to solve problems at a scale which is not achievable previously. Therefore,\nin the sequel we will focus on solving (7). After getting a solution for x, we recover the original\nvariable y through the linear relation y = Cx. As shown in our empirical studies (see Section 3), the\nresulting solution Qm,n,j,k(y) indeed enjoys the desirable low-rank property even with a moderate\npenalty parameter . We next present an ef\ufb01cient algorithm for solving (7).\n\n2.2 The Generalized Conditional Gradient Algorithm\n\nObserving that the \ufb01rst two terms in (7) are both continuously differentiable, we absorb them into a\ncommon term f and rewrite (7) in the more familiar compact form:\n(X) := f (X) + \u00b5kXk\u21e4,\n\nmin\n\n(8)\n\nwhich readily \ufb01ts into the framework of the generalized conditional gradient (GCG) [3, 13, 27]. In\nshort, at each iteration GCG successively linearizes the smooth function f, \ufb01nds a descent direction\nby solving the (convex) subproblem\n\nX2Rmj\u21e5nk\n\nZk 2 arg min\n\nkZk\u21e4\uf8ff1hZ,rf (Xk1)i,\n\n(9)\n\n3\n\n\fAlgorithm 1 Generalized Conditional Gradient for Structured Matrix Rank Minimization\n1: Initialize U0, V0;\n2: for k = 1, 2, ... do\n3:\n4:\n5:\n6:\n7: end for\n\n(uk, vk) top singular vector pair of rf (Uk1Vk1);\nset \u2318k 2/(k + 1), and \u2713k by (13);\nUinit (p1 \u2318kUk1,p\u2713kuk); Vinit (p1 \u2318kVk1,p\u2713kvk);\n(Uk, Vk) arg min (U, V ) using initializer (Uinit, Vinit);\n\nand then takes the convex combination Xk = (1 \u2318k)Xk1 + \u2318k(\u21b5kZk) with a suitable step size \u2318k\nand scaling factor \u21b5k. Clearly, the ef\ufb01ciency of GCG heavily hinges on the ef\ufb01cacy of solving the\nsubproblem (9). In our case, the minimal objective is simply the matrix spectral norm of rf (Xk)\nand the minimizer can be chosen as the outer product of the top singular vector pair. Both can be\ncomputed essentially in linear time O(M N ) using the Lanczos algorithm [7].\nTo further accelerate the algorithm, we adopt the local search idea in [27], which is based on the\nvariational form of the trace norm [26]:\nkXk\u21e4 = 1\n\n(10)\nThe crucial observation is that (10) is separable and smooth in the factor matrices U and V , although\nnot jointly convex. We alternate between the GCG algorithm and the following nonconvex auxiliary\nproblem, trying to get the best of both ends:\n\nF : X = U V }.\n\n2 min{kUk2\n\nF + kV k2\n\nmin\nU,V\n\n (U, V ), where (U, V ) = f (U V ) + \u00b5\n\n2 (kUk2\n\nF + kV k2\nF).\n\n(11)\n\nSince our smooth function f is quadratic, it is easy to carry out a line search strategy for \ufb01nding an\nappropriate \u21b5k in the convex combination Xk+1 = (1 \u2318k)Xk + \u2318k(\u21b5kZk) =: (1 \u2318k)Xk + \u2713kZk,\nwhere\n(12)\n\nhk(\u2713)\n\n\u2713k = arg min\n\u27130\n\nis the minimizer of the function (on \u2713 0)\n(13)\nIn fact, hk(\u2713) upper bounds the objective function at (1 \u2318k)Xk + \u2713Zk. Indeed, using convexity,\n\nhk(\u2713) := f ((1 \u2318k)Xk + \u2713Zk) + \u00b5(1 \u2318k)kXkk\u21e4 + \u00b5\u2713.\n\n((1 \u2318k)Xk + \u2713Zk) = f ((1 \u2318k)Xk + \u2713Zk) + \u00b5k(1 \u2318k)Xk + \u2713Zkk\u21e4\n\n\uf8ff f ((1 \u2318k)Xk + \u2713Zk) + \u00b5(1 \u2318k)kXkk\u21e4 + \u00b5\u2713kZkk\u21e4\n\uf8ff f ((1 \u2318k)Xk + \u2713Zk) + \u00b5(1 \u2318k)kXkk\u21e4 + \u00b5\u2713 (as kZkk\u21e4 \uf8ff 1)\n= hk(\u2713).\n\nThe reason to use the upper bound hk(\u2713), instead of the true objective ((1 \u2318k)Xk + \u2713Zk), is to\navoid evaluating the trace norm, which can be quite expensive. More generally, if f is not quadratic,\nwe can use the quadratic upper bound suggested by the Taylor expansion. It is clear that \u2713k in (12)\ncan be computed in closed-form.\nWe summarize our procedure in Algorithm 1. Importantly, we note that the algorithm explicitly\nmaintains a low-rank factorization X = U V throughout the iteration. In fact, we never need the\nproduct X, which is a crucial step in reducing the memory footage for large applications. The\nmaintained low-rank factorization also allows us to more ef\ufb01ciently evaluate the gradient and its\nspectral norm, by carefully arranging the multiplication order. Finally, we remark that we need not\nwait until the auxiliary problem (11) is fully solved; we can abort this local procedure whenever\nthe gained improvement does not match the devoted computation. For the convergence guarantee\nwe establish in Theorem 1 below, only the descent property (UkVk) \uf8ff (Uk1Vk1) is needed.\nThis requirement can be easily achieved by evaluating , which, unlike the original objective , is\ncomputationally cheap.\n\n2.3 Convergence analysis\n\nHaving presented the generalized conditional gradient algorithm for our structured rank minimiza-\ntion problem, we now analyze its convergence property. We need the following standard assumption.\n\n4\n\n\fAssumption 1 There exists some norm k\u00b7k and some constant L > 0, such that for all A, B 2\nRN\u21e5M and \u2318 2 (0, 1), we have\n\nf ((1 \u2318)A + \u2318B) \uf8ff f (A) + \u2318hB A,rf (A)i + L\u23182\n\n2 kB Ak2.\n\nMost standard loss functions, such as the quadratic loss we use in this paper, satisfy Assumption 1.\nWe are ready to state the convergence property of Algorithm 1 in the following theorem. To make\nthe paper self-contained, we also reproduce the proof in the appendix.\n\nTheorem 1 Let Assumption 1 hold, X be arbitrary, and Xk be the k-th iterate of Algorithm 1\napplied on the problem (7), then we have\n\n(Xk) (X) \uf8ff\nwhere C is some problem dependent absolute constant.\n\n2C\nk + 1\n\n,\n\n(14)\n\nThus for any given accuracy \u270f> 0, Algorithm 1 will output an \u270f-approximate (in the sense of\nfunction value) solution in at most O(1/\u270f) steps.\n\n2.4 Comparison with existing approaches\n\nWe brie\ufb02y compare the ef\ufb01ciency of Algorithm 1 with the state-of-the-art approaches; more thor-\nough experimental comparisons will be conducted in Section 3 below. The per-step complexity of\nour algorithm is dominated by the subproblem (9) which requires only the leading singular vector\npair of the gradient. Using the Lanczos algorithm this costs O(M N ) arithmetic operations [16],\nwhich is signi\ufb01cantly cheaper than the O(min(M 2N, N 2M )) complexity of [11] (due to their need\nof full SVD). Other approaches such as [25] and [17] are even more costly.\n\n3 Experiments\n\nIn this section, we present empirical results using our algorithms. Without loss of generality, we fo-\ncus on two concrete structured rank minimization problems: (i) stochastic system realization (SSR);\nand (ii) 2-D spectral compressed sensing (SCS). Both problems involve minimizing the rank of\ntwo different structured matrices. For SSR, we compare different \ufb01rst-order methods to show the\nspeedups offered by our algorithm. In the SCS problem, we show that our formulation can be gen-\neralized to more complicated linear structures and effectively recover unobserved signals.\n\n3.1 Stochastic System Realization\n\nModel. The SSR problem aims to \ufb01nd a minimal order autoregressive moving-average (ARMA)\nmodel, given the observation of noisy system output [11]. As a discrete linear time-invariant (LTI)\nsystem, an AMRA process can be represented by the following state-space model\n\nt = 1, 2, ..., T,\n\nst+1 = Dst + Eut, zt = F st + ut,\n\n(15)\nwhere st 2 Rr is the hidden state variable, ut 2 Rn is driving white noise with covariance matrix\nG, and zt 2 Rn is the system output that is observable at time t. It has been shown in [20] that the\nsystem order r equals the rank of the block-Hankel matrix (see appendix for de\ufb01nition) constructed\nby the exact process covariance yi = E(ztzT\nt+i), provided that the number of blocks per column, j,\nis larger than the actual system order. Determining the rank r is the key to the whole problem, after\nwhich, the parameters D, E, F, G can be computed easily [17, 20]. Therefore, \ufb01nding a low order\nsystem is equivalent to minimizing the rank of the Hankel matrix above, while remaining consistent\nwith the observations.\nSetup. The meaning of the following parameters can be seen in the text after E.q. (1). We follow\nthe experimental setup of [11]. Here, m = n, p = n \u21e5 n(j + k 1), while v = (v1, v2, ..., vj+k1)\ndenotes the empirical process covariance calculated as vi = 1\nt , for 1 \uf8ff i \uf8ff k and\n0 otherwise. Let w = (w1, w2, ..., wj+k1) be the observation matrix, where the wi are all 1\u2019s for\n1 \uf8ff i \uf8ff k, indicating the whole block of vi is observed, and all 0\u2019s otherwise (for unobserved\n\nT PTi\n\nt=1 zt+izT\n\n5\n\n\fF (or 1\n\n|k+1k|\n\n| min(k+1,k)|\n\n2kA(y) bk2\n\nblocks). Finally, A(y) = vec(w y), b = vec(w v), Q(y) = Hn,n,j,k(y), where is the element-\nwise product and is Hn,n,j,k(\u00b7) the Hankel matrix (see Appendix for the corresponding B and C).\nData generation. Each entry of the matrices D 2 Rr\u21e5r, E 2 Rr\u21e5n, F 2 Rn\u21e5r is sampled from a\nGaussian distribution N (0, 1). Then they are normalized to have unit nuclear norm. The initial state\nvector s0 is drawn from N (0, Ir) and the input white noise ut from N (0, In). The measurement\nnoise is modeled by adding an \u21e0 term to the output zt, so the actual observation is zt = zt + \u21e0,\nwhere each entry of \u21e0 2 Rn is a standard Gaussian noise, and is the noise level. Throughout this\nexperiment, we set T = 1000, = 0.05, the maximum iteration limit as 100, and the stopping\ncriterion as kxk+1 xkkF < 103 or\n< 103. The initial iterate is a matrix of all\nones.\nAlgorithms. We compare our approach with the state-of-the-art competitors, i.e., the \ufb01rst-order\nmethods proposed in [11]. Other methods, such as those in [15, 17, 25] suffer heavier computation\ncost per iteration, and are thus omitted from comparison. Fazel et al. [11] aim to solve either the\nprimal or dual form of problem (2), using primal ADMM (PADMM), a variant of primal ADMM\n(PADMM2), a variant of dual ADMM (DADMM2), and a dual proximal point algorithm (DPPA). As\nfor solving (7), we implemented generalized conditional gradient (GCG) and its local search variant\n(GCGLS). We also implemented the accelerated projected gradient with singular value threshold-\ning (APG-SVT) to solve (8) by adopting the FISTA [2] scheme. To fairly compare both lines of\nmethods for different formulations, in each iteration we track their objective values, the squared loss\nF), and the rank of the Hankel matrix Hm,n,j,k(y). Since square\n1\n2kA(Cx) bk2\nloss measures how well the model \ufb01ts the observations, and the Hankel matrix rank approximates\nthe system order, comparison of these quantities obtained by different methods is meaningful.\nResult 1: Ef\ufb01ciency and Scalability. We compare the performance of different methods on two\nsizes of problems, and the result is shown in Figure 2. The most important observation is, our ap-\nproach GCGLS/GCG signi\ufb01cantly outperform the remaining competitors in term of running time. It\nis easy to see from Figure 2(a) and 2(b) that both the objective value and square loss by GCGLS/GCG\ndrop drastically within a few seconds and is at least one order of magnitude faster than the runner-up\ncompetitor (DPPA) to reach a stable stage. The rest of baseline methods cannot even approach the\nminimum values achieved by GCGLS/GCG within the iteration limit. Figure 2(d) and 2(e) show\nthat such advantage is ampli\ufb01ed as size increases, which is consistent with the theoretical \ufb01nding.\nThen, not surprisingly, we observe that the competitors become even slower if the problem size con-\ntinues growing. Hence, we only test the scalability of our approach on larger sized problems, with\nthe running time reported in Figure 1. We can see that the running time of GCGLS grows linearly\nw.r.t. the size M N, again consistent with previous analysis.\nResult 2: Rank of solution. We also report the rank of\nHn,n,j,k(y) versus the running time in Figure 2(c) and 2(f),\nwhere y = Cx if we solve (2) or y directly comes from the\nsolution of (7). The rank is computed as the number of sin-\ngular values larger than 103. For the GCGLS/GCG, the it-\nerate starts from a low rank estimation and then gradually ap-\nproaches the true one. However, for other competitors, the iter-\nate \ufb01rst jumps to a full rank matrix and the rank of later iterate\ndrops gradually. Given that the solution is intrinsically of low\nrank, GCGLS/GCG will probably \ufb01nd the desired one more ef-\n\ufb01ciently. In view of this, the working memory of GCGLS is\nusually much smaller than the competitors, as it uses two low\nrank matrices U, V to represent but never materialize the solu-\ntion until necessary.\n\nFigure 1: Scalability of GCGLS and\nGCG. The size (M, N ) is labeled out.\n\n1\nMatrix Size (MN)\n\n2\n\ni\n\ne\nm\nT\n \nn\nu\nR\n\n3000\n\n2000\n\n5000\n\n4000\n\n3\n8\nx 10\n\n1000\n\n(2050, 10000)\n\n(4100, 20000)\n\n(6150, 30000)\n\nGCGLS\nGCG\n\n \n\n(8200, 40000)\n\n \n\n0\n0\n\n3.2 Spectral Compressed Sensing\n\nIn this part we apply our formulation and algorithm to another application, spectral compressed\nsensing (SCS), a technique that has by now been widely used in digital signal processing applications\n[6, 9, 29]. We show in particular that our reformulation (7) can effectively and rapidly recover\npartially observed signals.\n\n6\n\n\f \n\n2\n10\n\n \n\n5\n10\n\n4\n10\n\n3\n10\n\n2\n10\n\n1\n10\n\n \n\n5\n10\n\n4\n10\n\n3\n10\n\n2\n10\n\nl\n\n \n\ne\nu\na\nV\ne\nv\ni\nt\nc\ne\nb\nO\n\nj\n\nl\n\n \n\ne\nu\na\nV\ne\nv\ni\nt\nc\ne\nb\nO\n\nj\n\n1\n10\n\n \n\n10\n\n\u22122\n\nGCGLS\nGCG\nPADMM\nPADMM2\nDPPA\nDADMM2\nAPG\u2212SVT\n\n\u22122\n\n10\n\n0\n10\n\nRun Time (seconds)\n(a) Obj v.s. Time\n\nGCGLS\nGCG\nPADMM\nPADMM2\nDPPA\nDADMM2\nAPG\u2212SVT\n\n0\n10\n\n2\n10\n\nRun Time (seconds)\n(d) Obj v.s. Time\n\n \n\ns\ns\no\nL\ne\nr\na\nu\nq\nS\n\n5\n10\n\n4\n10\n\n3\n10\n\n2\n10\n\n1\n10\n\n \n\nGCGLS\nGCG\nPADMM\nPADMM2\nDPPA\nDADMM2\nAPG\u2212SVT\n\n\u22122\n\n10\n\n0\n10\n\nRun Time (seconds)\n\n \n\n2\n10\n\n3\n10\n\n2\n10\n\n1\n10\n\n)\ny\n(\nl\ne\nk\nn\na\nH\n\n \nf\n\no\n\n \nk\nn\na\nR\n\n \n\nGCGLS\nGCG\nPADMM\nPADMM2\nDPPA\nDADMM2\nAPG\u2212SVT\n\n0\n10\n\n \n\n10\n\n\u22121\n\n0\n10\n\n1\n10\n\nRun Time (seconds)\n\n(b) Sqr loss v.s. Time\n\n(c) Rank(y) v.s. Time\n\n5\n10\n\n4\n10\n\n3\n10\n\n2\n10\n\ns\ns\no\nL\n\n \n\ne\nr\na\nu\nq\nS\n\nGCGLS\nGCG\nPADMM\nPADMM2\nDPPA\nDADMM2\nAPG\u2212SVT\n\n \n\n1\n10\n\n10\n\n\u22122\n\n0\n10\n\n2\n10\n\nRun Time (seconds)\n(e) Sqr loss v.s. Time\n\nGCGLS\nGCG\nPADMM\nPADMM2\nDPPA\nDADMM2\nAPG\u2212SVT\n\n \n\n3\n10\n\n)\ny\n(\nl\ne\nk\nn\na\nH\n\n \nf\n\no\n \nk\nn\na\nR\n\n2\n10\n\n1\n10\n\n0\n10\n\n \n\n10\n\n\u22122\n\n0\n10\n\n2\n10\n\nRun Time (seconds)\n(f) Rank(y) v.s. Time\n\n2\n10\n\n \n\nFigure 2: Stochastic System Realization problem with j = 21, k = 100, r = 10, \u00b5 = 1.5 for formulation (2)\nand \u00b5 = 0.1 for (7). The \ufb01rst row corresponds to the case M = 420, N = 2000, n = m = 20, . The second\nrow corresponds to the case M = 840, N = 4000, n = m = 40.\n\nModel. The problem of spectral compressed sensing aims to recover a frequency-sparse signal from\na small number of observations. The 2-D signal Y (k, l), 0 < k \uf8ff n1, 0 < l \uf8ff n2 is supposed to be\nthe superposition of r 2-D sinusoids of arbitrary frequencies, i.e. (in the DFT form)\n\nY (k, l) =\n\ndiej2\u21e1(kf1i+lf2i) =\n\ndi(ej2\u21e1f1i)k(ej2\u21e1f2i)l\n\n(16)\n\nrXi=1\n\nrXi=1\n\nwhere di is the amplitudes of the i-th sinusoid and (fxi, fyi) is its frequency.\nInspired by the conventional matrix pencil method [14] for estimating the frequencies of sinusoidal\nsignals or complex sinusoidal (damped) signals, the authors in [6] propose to arrange the observed\ndata into a 2-fold Hankel matrix whose rank is bounded above by r, and formulate the 2-D spectral\ncompressed sensing problem into a rank minimization problem with respect to the 2-fold Hankel\nstructure. This 2-fold structure is a also linear structure, as we explain in the appendix. Given limited\nobservations, this problem can be viewed as a matrix completion problem that recovers a low-rank\nmatrix from partially observed entries while preserving the pre-de\ufb01ned linear structure. The trace\nnorm heuristic for rank (\u00b7) is again used here, as it is proved by [5] to be an exact method for matrix\ncompletion provided that the number of observed entries satis\ufb01es the corresponding information\ntheoretic bound.\nSetup. Given a partial observed signal Y with \u2326 as the observation index set, we adopt the formu-\nlation (7) and thus aim to solve the following problem:\n\nmin\n\nX2RM\u21e5N\n\n1\n2kP\u2326(mat(Cx)) P\u2326(Y )k2\n\nF +\n\n\n2kBxk2\n\nF + \u00b5kXk\u21e4\n\n(17)\n\nwhere x = vec(X), mat(\u00b7) is the inverse of the vectorization operator on Y . In this context, as\nbefore, A = P\u2326, b = P\u2326(Y ), where P\u2326(Y ) only keeps the entries of Y in the index set \u2326 and\nvanishes the others, Q(Y ) = H (2)\n(Y ) is the two-fold Hankel matrix, and corresponding B and\nC can be found in the appendix to encode H (2)\n(Y ) = X . Further, the size of matrix here is\nM = k1k2, N = (n1 k1 + 1)(n2 k2 + 1).\nAlgorithm. We apply our generalized conditional gradient method with local search (GCGLS) to\nsolve the spectral compressed sensing problem, using the reformulation discussed above. Following\n\nk1,k2\n\nk1,k2\n\n7\n\n\f100\n\n90\n\n80\n\n70\n\n60\n\n50\n\n40\n\n30\n\n20\n\n10\n\n100\n\n90\n\n80\n\n70\n\n60\n\n50\n\n40\n\n30\n\n20\n\n10\n\n100\n\n90\n\n80\n\n70\n\n60\n\n50\n\n40\n\n30\n\n20\n\n10\n\n \n\n \n\n3\n\n2\n\n1\n\n0\n\n\u22121\n\n\u22122\n\n\u22123\n\n\u22124\n\n10 20 30 40 50 60 70 80 90 100\n\n10 20 30 40 50 60 70 80 90 100\n\n10 20 30 40 50 60 70 80 90 100\n\n(a) True 2-D Sinosuidal Signal\n\n(b) Observed Entries\n\n(c) Recovered Signal\n\n5\n4\n3\n2\n1\n0\n\u22121\n\u22122\n\u22123\n\u22124\n\n \n\n \n\nTrue Signal\nObservations\n\n10 20 30 40 50 60 70 80 90 100\n\n(d) Observed Signal on Column 1\n\n5\n4\n3\n2\n1\n0\n\u22121\n\u22122\n\u22123\n\u22124\n\n \n\n \n\nTrue Signal\nRecovered\n\n10 20 30 40 50 60 70 80 90 100\n\n(e) Recovered Signal on Column 1\n\nFigure 3: Spectral Compressed Sensing problem with parameters n1 = n2 = 101, r = 6, solved with our\nGCGLS algorithm using k1 = k2 = 8, \u00b5 = 0.1. The 2-D signals in the \ufb01rst row are colored by the jet\ncolormap. The second row shows the 1-D signal extracted from the \ufb01rst column of the data matrix.\n\nthe experiment setup in [6], we generate a ground truth data matrix Y 2 R101\u21e5101 through a super-\nposition of r = 6 2-D sinusoids, randomly reveal 20% of the entries, and add i.i.d Gaussian noise\nwith amplitude signal-to-noise ratio 10.\nResult. The results on the SCS problem are shown in Figure 3. The generated true 2-D signal Y is\nshown in Figure 3(a) using the jet colormap. The 20% observed entries of Y are shown in Figure\n3(b), where the white entries are unobserved. The signal recovered by our GCGLS algorithm is\nshown in Figure 3(c). Comparing with the true signal in Figure 3(a), we can see that the result of\nour CGCLS algorithm is pretty close to the truth. To demonstrate the result more clearly, we extract\na single column as a 1-D signals for further inspection. Figure 3(d) plots the original signal (blue\nline) as well as the observed ones (red dot), both from the \ufb01rst column of the 2-D signals. In 3(e),\nthe recovered signal is represented by the red dashed dashed curve. It matches the original signal\nwith signi\ufb01cantly large portion, showing the success of our method in recovering partially observed\n2-D signals from noise. Since the 2-fold structure used in this experiment is more complicated than\nthat in the previous SSR task, this experiment further validates our algorithm on more complicated\nproblems.\n\n4 Conclusion\n\nIn this paper, we address the structured matrix rank minimization problem. We \ufb01rst formulate the\nproblem differently, so that it is amenable to adapt the Generalized Conditional Gradient Method.\nBy doing so, we are able to achieve the complexity O(M N ) per iteration with a convergence rate\n\nO 1\n\u270f. Then the overall complexity is by far the lowest compared to state-of-the-art methods for the\n\nstructured matrix rank minimization problem. Our empirical studies on stochastic system realization\nand spectral compressed sensing further con\ufb01rm the ef\ufb01ciency of the algorithm and the effectiveness\nof our reformulation.\n\n8\n\n\fReferences\n[1] B. Balle and M. Mohri. Spectral learning of general weighted automata via constrained matrix completion.\n\nIn NIPS, pages 2168\u20132176, 2012.\n\n[2] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems.\n\nSIAM J. Imaging Sciences, 2(1):183\u2013202, 2009.\n\n[3] K. Bredies, D. A. Lorenz, and P. Maass. A generalized conditional gradient method and its connection to\n\nan iterative shrinkage method. Computational Optimization and Applications, 42(2):173\u2013193, 2009.\n\n[4] J. A. Cadzow. Signal enhancement: A composite property mapping algorithm. IEEE Transactions on\n\nAcoustics, Speech and Signal Processing, pages 39\u201362, 1988.\n\n[5] E. J. Cand`es and T. Tao. The power of convex relaxation: near-optimal matrix completion. IEEE Trans-\n\nactions on Information Theory, 56(5):2053\u20132080, 2010.\n\n[6] Y. Chen and Y. Chi. Spectral compressed sensing via structured matrix completion.\n\n414\u2013422, 2013.\n\nIn ICML, pages\n\n[7] J. K. Cullum and R. A. Willoughby. Lanczos Algorithms for Large Symmetric Eigenvalue Computations,\n\nVol. 1. Elsevier, 2002.\n\n[8] T. Ding, M. Sznaier, and O. I. Camps. A rank minimization approach to video inpainting. In ICCV, pages\n\n1\u20138, 2007.\n\n[9] M. F. Duarte and R. G. Baraniuk. Spectral compressive sensing. Applied and Computational Harmonic\n\nAnalysis, 35(1):111\u2013129, 2013.\n\n[10] M. Fazel. Matrix rank minimization with applications. PhD thesis, Stanford University, 2002.\n[11] M. Fazel, T. K. Pong, D. Sun, and P. Tseng. Hankel matrix rank minimization with applications to system\n\nidenti\ufb01cation and realization. SIAM J. Matrix Analysis Applications, 34(3):946\u2013977, 2013.\n\n[12] M. Frank and P. Wolfe. An algorithm for quadratic programming. Naval Research Logistics Quarterly,\n\n3:95\u2013110, 1956.\n\n[13] Z. Harchaoui, A. Juditsky, and A. Nemirovski. Conditional gradient algorithms for machine learning. In\n\nNIPS Workshop on Optimization for ML., 2012.\n\n[14] Y. Hua. Estimating two-dimensional frequencies by matrix enhancement and matrix pencil. IEEE Trans-\n\nactions on Signal Processing, 40(9):2267\u20132280, 1992.\n\n[15] M. Ishteva, K. Usevich, and I. Markovsky. Factorization approach to structured low-rank approximation\n\nwith applications. SIAM J. Matrix Analysis Applcations, 35(3):1180\u20131204, 2014.\n\n[16] M. Jaggi. Revisiting Frank-Wolfe: Projection-free sparse convex optimization. In ICML, pages 427\u2013435,\n\n2013.\n\n[17] Z. Liu and L. Vandenberghe. Semide\ufb01nite programming methods for system realization and identi\ufb01cation.\n\nIn CDC, pages 4676\u20134681, 2009.\n\n[18] Z. Liu and L. Vandenberghe. Interior-point method for nuclear norm approximation with application to\n\nsystem identi\ufb01cation. SIAM J. Matrix Analysis Applications, 31(3):1235\u20131256, 2009.\n\n[19] Z. Liu, A. Hansson, and L. Vandenberghe. Nuclear norm system identi\ufb01cation with missing inputs and\n\noutputs. Systems & Control Letters, 62(8):605\u2013612, 2013.\n\n[20] J. Mari, P. Stoica, and T. McKelvey. Vector ARMA estimation: a reliable subspace approach.\n\nTransactions on Signal Processing, 48(7):2092\u20132104, 2000.\n\nIEEE\n\n[21] I. Markovsky. Structured low-rank approximation and its applications. Automatica, 44(4):891\u2013909, 2008.\n[22] B. Recht, M. Fazel, and P. A. Parrilo. Guaranteed minimum-rank solutions of linear matrix equations via\n\nnuclear norm minimization. SIAM Review, 52(3):471\u2013501, 2010.\n\n[23] J. D. M. Rennie and N. Srebro. Fast maximum margin matrix factorization for collaborative prediction.\n\nIn ICML, pages 713\u2013719, 2005.\n\n[24] P. J. Shin, P. E. Larson, M. A. Ohliger, M. Elad, J. M. Pauly, D. B. Vigneron, and M. Lustig. Cali-\nbrationless parallel imaging reconstruction based on structured low-rank matrix completion. Magnetic\nResonance in Medicine, 2013.\n\n[25] M. Signoretto, V. Cevher, and J. A. Suykens. An SVD-free approach to a class of structured low rank\nmatrix optimization problems with application to system identi\ufb01cation. Technical report, K.U.Leuven,\n2013. 13-44, ESTA-SISTA.\n\n[26] N. Srebro, J. D. M. Rennie, and T. Jaakkola. Maximum-margin matrix factorization. In NIPS, 2004.\n[27] X. Zhang, Y. Yu, and D. Schuurmans. Accelerated training for matrix-norm regularization: A boosting\n\napproach. In NIPS, pages 2915\u20132923, 2012.\n\n[28] J. Zhou, J. Chen, and J. Ye. Multi-task learning: theory, algorithms, and applications. SIAM Data Mining\n\nTutorial, 2012.\n\n[29] X. Zhu and M. Rabbat. Graph spectral compressed sensing. Technical report, McGill University, Tech.\n\nRep, 2011.\n\n9\n\n\f", "award": [], "sourceid": 751, "authors": [{"given_name": "Adams Wei", "family_name": "Yu", "institution": "Carnegie Mellon University"}, {"given_name": "Wanli", "family_name": "Ma", "institution": "Carnegie Mellon University"}, {"given_name": "Yaoliang", "family_name": "Yu", "institution": "Carnegie Mellon University"}, {"given_name": "Jaime", "family_name": "Carbonell", "institution": "CMU"}, {"given_name": "Suvrit", "family_name": "Sra", "institution": "MPI for Intelligent Systems"}]}