{"title": "Orthogonal NMF through Subspace Exploration", "book": "Advances in Neural Information Processing Systems", "page_first": 343, "page_last": 351, "abstract": "Orthogonal Nonnegative Matrix Factorization {(ONMF)} aims to approximate a nonnegative matrix as the product of two $k$-dimensional nonnegative factors, one of which has orthonormal columns. It yields potentially useful data representations as superposition of disjoint parts, while it has been shown to work well for clustering tasks where traditional methods underperform. Existing algorithms rely mostly on heuristics, which despite their good empirical performance, lack provable performance guarantees.We present a new ONMF algorithm with provable approximation guarantees.For any constant dimension~$k$, we obtain an additive EPTAS without any assumptions on the input. Our algorithm relies on a novel approximation to the related Nonnegative Principal Component Analysis (NNPCA) problem; given an arbitrary data matrix, NNPCA seeks $k$ nonnegative components that jointly capture most of the variance. Our NNPCA algorithm is of independent interest and generalizes previous work that could only obtain guarantees for a single component. We evaluate our algorithms on several real and synthetic datasets and show that their performance matches or outperforms the state of the art.", "full_text": "Orthogonal NMF through Subspace Exploration\n\nMegasthenis Asteris\n\nThe University of Texas at Austin\n\nmegas@utexas.edu\n\nDimitris Papailiopoulos\n\nUniversity of California, Berkeley\ndimitrisp@berkeley.edu\n\nAlexandros G. Dimakis\n\nThe University of Texas at Austin\n\ndimakis@austin.utexas.edu\n\nAbstract\n\nOrthogonal Nonnegative Matrix Factorization (ONMF) aims to approximate a\nnonnegative matrix as the product of two k-dimensional nonnegative factors, one\nof which has orthonormal columns.\nIt yields potentially useful data represen-\ntations as superposition of disjoint parts, while it has been shown to work well\nfor clustering tasks where traditional methods underperform. Existing algorithms\nrely mostly on heuristics, which despite their good empirical performance, lack\nprovable performance guarantees.\nWe present a new ONMF algorithm with provable approximation guarantees. For\nany constant dimension k, we obtain an additive EPTAS without any assumptions\non the input. Our algorithm relies on a novel approximation to the related Non-\nnegative Principal Component Analysis (NNPCA) problem; given an arbitrary\ndata matrix, NNPCA seeks k nonnegative components that jointly capture most\nof the variance. Our NNPCA algorithm is of independent interest and generalizes\nprevious work that could only obtain guarantees for a single component.\nWe evaluate our algorithms on several real and synthetic datasets and show that\ntheir performance matches or outperforms the state of the art.\n\n1\n\nIntroduction\n\nOrthogonal NMF The success of Nonnegative Matrix Factorization (NMF) in a range of disci-\nplines spanning data mining, chemometrics, signal processing and more, has driven an extensive\npractical and theoretical study [1, 2, 3, 4, 5, 6, 7, 8].\nIts power lies in its potential to generate\nmeaningful decompositions of data into non-subtractive combinations of a few nonnegative parts.\n\nOrthogonal NMF (ONMF) [9] is a variant of NMF with an additional orthogonality constraint: given\na real nonnegative m \u00d7 n matrix M and a target dimension k, typically much smaller than m and n,\nwe seek to approximate M by the product of an m \u00d7 k nonnegative matrix W with orthogonal\n(w.l.o.g, orthonormal) columns, and an n \u00d7 k nonnegative matrix H. In the form of an optimization,\n\n(ONMF)\n\nE\u22c6 ,\n\nmin\n\nW\u22650, H\u22650\nW\u22a4W=Ik\n\nkM \u2212 WH\u22a4k2\nF .\n\n(1)\n\nSince W is nonnegative, its columns are orthogonal if and only if they have disjoint supports. In\nturn, each row of M is approximated by a scaled version of a single (transposed) column of H.\n\nDespite the admittedly limited representational power compared to NMF, ONMF yields sparser part-\nbased representations that are potentially easier to interpret, while it naturally lends itself to certain\napplications. In a clustering setting, for example, W serves as a cluster membership matrix and the\n\n1\n\n\fcolumns of H correspond to k cluster centroids [9, 10, 11]. Empirical evidence shows that ONMF\nperforms remarkably well in certain clustering tasks, such as document classi\ufb01cation [6, 11, 12, 13,\n14, 15]. In the analysis of textual data where M is a words by documents matrix, the orthogonal\ncolumns of W can be interpreted as topics de\ufb01ned by disjoint subsets of words. In the case of an\nimage dataset, with each column of M corresponding to an image evaluated on multiple pixels, each\nof the orthogonal base vectors highlights a disjoint segment of the image area.\n\nNonnegative PCA For any given factor W \u2265 0 with orthonormal columns, the second ONMF fac-\ntor H is readily determined: H = M\u22a4W \u2265 0. This follows from the fact that M is by assumption\nnonnegative. Based on the above, it can be shown that the ONMF problem (1) is equivalent to\n\n(NNPCA)\n\nwhere\n\nV\u22c6 , max\nW\u2208Wk\n\nkM\u22a4Wk2\nF ,\n\nWk , (cid:8)W \u2208 Rm\u00d7k : W \u2265 0, W\u22a4W = Ik(cid:9) .\n\n(2)\n\nFor arbitrary \u2014i.e., not necessarily nonnegative\u2014 matrices M, the non-convex maximization (2)\ncoincides with the Nonnegative Principal Component Analysis (NNPCA) problem [16]. Similarly\nto vanilla PCA, NNPCA seeks k orthogonal components that jointly capture most of the variance\nof the (centered) data in M. The nonzero entries of the extracted components, however, must be\npositive, which renders the problem NP-hard even in the case of a single component (k = 1) [17].\n\nOur Contributions We present a novel algorithm for NNPCA. Our algorithm approximates the\nsolution to (2) for any real input matrix and is accompanied with global approximation guarantees.\nUsing the above as a building block, we develop an algorithm to approximately solve the ONMF\nproblem (1) on any nonnegative matrix. Our algorithm outputs a solution that strictly satis\ufb01es both\nthe nonnegativity and the orthogonality constraints. Our main results are as follows:\nTheorem 1. (NNPCA) For any m \u00d7 n matrix M, desired number of components k, and accuracy\nparameter \u01eb \u2208 (0, 1), our NNPCA algorithm computes W \u2208 Wk such that\n\n(cid:13)(cid:13)M\u22a4W(cid:13)(cid:13)2\n\nF\n\n\u2265 (1 \u2212 \u01eb) \u00b7 V\u22c6 \u2212 k \u00b7 \u03c32\n\nr+1(M),\n\nwhere \u03c3r+1(M) is the (r + 1)th singular value of M, in time TSVD(r) + O(cid:0)(cid:0) 1\n\n\u01eb(cid:1)r\u00b7k\n\n\u00b7 k \u00b7 m(cid:1).\n\nHere, TSVD(r) denotes the time required to compute a rank-r approximation M of the input M using\nthe truncated singular value decomposition (SVD). Our NNPCA algorithm operates on the low-rank\nmatrix M. The parameter r controls a natural trade-off; higher values of r lead to tighter guarantees,\nbut impact the running time of our algorithm. Finally, note that despite the exponential dependence\nin r and k, the complexity scales polynomially in the ambient dimension of the input.\n\nIf the input matrix M is nonnegative, as in any instance of the ONMF problem, we can compute an\napproximate orthogonal nonnegative factorization in two steps: \ufb01rst obtain an orthogonal factor W\nby (approximately) solving the NNPCA problem on M, and subsequently set H = M\u22a4W.\nTheorem 2. (ONMF) For any m \u00d7 n nonnegative matrix M, target dimension k, and desired ac-\ncuracy \u01eb \u2208 (0, 1), our ONMF algorithm computes an ONMF pair W, H, such that\n\nin time TSVD( k\n\n\u01eb(cid:1)k\n\u01eb ) + O(cid:0)(cid:0) 1\n\n2/\u01eb \u00b7 k \u00b7 m(cid:1).\n\nkM \u2212 WH\u22a4k2\n\nF \u2264 E\u22c6 + \u01eb \u00b7 kMk2\nF ,\n\nFor any constant dimension k, Theorem 2 implies an additive EPTAS for the relative ONMF approx-\nimation error. This is, to the best our knowledge, the \ufb01rst general ONMF approximation guarantee\nsince we impose no assumptions on M beyond nonnegativity.\n\nWe evaluate our NNPCA and ONMF algorithms on synthetic and real datasets. As we discuss in\nSection 4, for several cases we show improvements compared to the previous state of the art.\n\nRelated Work ONMF as a variant of NMF \ufb01rst appeared implicitly in [18]. The formulation\nin (1) was introduced in [9]. Several algorithms in a subsequent line of work [12, 13, 19, 20, 21, 22]\napproximately solve variants of that optimization problem. Most rely on modifying approaches for\nNMF to accommodate the orthogonality constraint; either exploiting the additional structural prop-\nerties in the objective [13], introducing a penalization term [9], or updating the current estimate\n\n2\n\n\fin suitable directions [12], they typically reduce to a multiplicative update rule which attains or-\nthogonality only in a limit sense. In [11], the authors suggest two alternative approaches: an EM\nalgorithm motivated by connections to spherical k-means, and an augmented Lagrangian formula-\ntion that explicitly enforces orthogonality, but only achieves nonnegativity in the limit. Despite their\ngood performance in practice, existing methods only guarantee local convergence.\n\nF\nM\nN\nO\n\nW\u00b7H\u22a4\n\nA signi\ufb01cant body of work [23, 24, 25, 26] has fo-\ncused on Separable NMF, a variant of NMF par-\ntially related to ONMF. Sep. NMF seeks to de-\ncompose M into the product of two nonnegative\nmatrices W and H\u22a4 where W contains a permu-\ntation of the k \u00d7 k identity matrix. Intuitively, the\ngeometric picture of Sep. NMF should be quite\ndifferent from that of ONMF: in the former, the\nrows of H\u22a4 are the extreme rays of a convex cone\nenclosing all rows of M, while in the latter they\nshould be scattered in the interior of that cone so that each row of M has one representative in small\nangular distance. Algebraically, ONMF factors approximately satisfy the structural requirement of\nSep. NMF, but the converse is not true: a Sep. NMF solution is not a valid ONMF solution (Fig. 1).\n\nFigure 1: ONMF and Separable NMF, upon ap-\npropriate permutation of the rows of M. In the\n\ufb01rst case, each row of M is approximated by\na single row of H\u22a4, while in the second, by a\nnonnegative combination of all k rows of H\u22a4.\n\nW\u00b7H\u22a4\n\nF\nM\nN\n\n.\np\ne\nS\n\nIn the NNPCA front, nonnegativity as a constraint on PCA \ufb01rst appeared in [16], which proposed\na coordinate-descent scheme on a penalized version of (2) to compute a set of nonnegative com-\nponents. In [27], the authors developed a framework stemming from Expectation-Maximization\n(EM) on a generative model of PCA to compute a nonnegative (and optionally sparse) component.\nIn [17], the authors proposed an algorithm based on sampling points from a low-dimensional sub-\nspace of the data covariance and projecting them on the nonnegative orthant. [27] and [17] focus\non the single-component problem; multiple components can be computed sequentially employing\na heuristic de\ufb02ation step. Our main theoretical result is a generalization of the analysis of [17] for\nmultiple components. Finally, note that despite the connection between the two problems, existing\nalgorithms for ONMF are not suitable for NNPCA as they only operate on nonnegative matrices.\n\n2 Algorithms and Guarantees\n\n2.1 Overview\n\nWe \ufb01rst develop an algorithm to approximately solve the NNPCA problem (2) on any arbitrary \u2014\ni.e., not necessarily nonnegative\u2014 m \u00d7 n matrix M. The core idea is to solve the NNPCA problem\nnot directly on M, but a rank-r approximation M instead. Our main technical contribution is a\nprocedure that approximates the solution to the constrained maximization (2) on a rank-r matrix\nwithin a multiplicative factor arbitrarily close to 1, in time exponential in r, but polynomial in the\ndimensions of the input. Our Low Rank NNPCA algorithm relies on generating a large number of\ncandidate solutions, one of which provably achieves objective value close to optimal.\n\nThe k nonnegative components W \u2208 Wk returned by our Low Rank NNPCA algorithm on the\nsketch M are used as a surrogate for the desired components of the original input M. Intuitively, the\nperformance of the extracted nonnegative components depends on how well M is approximated by\nthe low rank sketch M; a higher rank approximation leads to better results. However, the complexity\nof our low rank solver depends exponentially in the rank of its input. A natural trade-off arises\nbetween the quality of the extracted components and the running time of our NNPCA algorithm.\n\nUsing our NNPCA algorithm as a building block, we propose a novel algorithm for the ONMF\nproblem (1). In an ONMF instance, we are given an m \u00d7 n nonnegative matrix M and a target di-\nmension k < m, n, and seek to approximate M with a product WH\u22a4 of two nonnegative matrices,\nwhere W additionally has orthonormal columns. Computing such a factorization is equivalent to\nsolving the NNPCA problem on the nonnegative matrix M. (See Appendix A.1 for a formal ar-\ngument.) Once a nonnegative orthogonal factor W is obtained, the second ONMF factor is readily\ndetermined: H = M\u22a4W minimizes the Frobenius approximation error in (1) for a given W. Under\nan appropriate con\ufb01guration of the accuracy parameters, for any nonnegative m \u00d7 n input M and\nconstant target dimension k, our algorithm yields an additive EPTAS for the relative approximation\nerror, without any additional assumptions on the input data.\n\n3\n\n\fAlgorithm 1 LowRankNNPCA\n\n2.2 Main Results\n\nLow Rank NNPCA We develop an algo-\nrithm to approximately solve the NNPCA\nproblem on an m \u00d7 n real rank-r matrix M:\n\nW\u22c6, arg max\nW\u2208Wk\n\n\u22a4\n\nkM\n\nWk.\n\n(3)\n\ninput real m \u00d7 n rank-r matrix M, k, \u01eb \u2208 (0, 1)\noutput W \u2208 Wk \u2282 Rm\u00d7k\n1: C \u2190 {}\n2: (cid:2)U, \u03a3, V(cid:3) \u2190 SVD(M, r)\n\u01eb/2 (cid:0)Sr\u22121\n3: for each C \u2208 N \u2297k\n4: A \u2190 U\u03a3C\n5: cW \u2190 LocalOptW(A)\n\n{See Lemma 1}\n{Candidate solutions}\n{Trunc. SVD}\n\n{A \u2208 Rm\u00d7k}\n{Alg. 3}\n\n(cid:1) do\n\n2\n\nThe procedure, which lies in the core of\nour subsequent developments, is encoded in\nAlg. 1. We describe it in detail in Section 3.\nThe key observation is that irrespectively of\nthe dimensions of the input, the maximiza-\ntion in (3) can be reduced to k \u00b7 r unknowns.\nThe algorithm generates a large number of\nk-tuples of r-dimensional points; the collec-\n(cid:1), the kth Cartesian power of an \u01eb/2-net of the r-dimensional\ntion of tuples is denoted by N \u2297k\nunit sphere. Using these points, we effectively sample the column-space of the input M. Each tuple\nyields a feasible solution W \u2208 Wk through a computationally ef\ufb01cient subroutine (Alg. 3). The best\namong those candidate solutions is provably close to the optimal W\u22c6 with respect to the objective\nin (2). The approximation guarantees are formally established in the following lemma.\n\n6:\n7: end for\n8: W \u2190 arg maxW\u2208C kM\n\nC \u2190 C \u222a (cid:8)cW(cid:9)\n\n\u01eb/2 (cid:0)Sr\u22121\n\nWk2\nF\n\n\u22a4\n\n2\n\nLemma 1. For any real m\u00d7n matrix M with rank r, desired number of components k, and accuracy\nparameter \u01eb \u2208 (0, 1), Algorithm 1 outputs W \u2208 Wk such that\n\n\u22a4\n\nkM\n\nWk2\n\nF \u2265 (1 \u2212 \u01eb) \u00b7 kM\n\n\u22a4\n\nW\u22c6k2\nF ,\n\nwhere W\u22c6 is the optimal solution de\ufb01ned in (3), in time TSVD(r) + O(cid:0)(cid:0) 2\n\n\u01eb(cid:1)r\u00b7k\n\n\u00b7 k \u00b7 m(cid:1).\n\nProof. (See Appendix A.2.)\n\nNonnegative PCA Given an arbitrary real m \u00d7 n matrix M, we can generate a rank-r sketch M\nand solve the low rank NNPCA problem on M using Algorithm 1. The output W \u2208 Wk of the\nlow rank problem can be used as a surrogate for the desired components of the original input M.\nFor simplicity, here we consider the case where M is the rank-r approximation of M obtained by\nthe truncated SVD. Intuitively, the performance of the extracted components on the original data\nmatrix M will depend on how well the latter is approximated by M, and in turn by the spectral\ndecay of the input data. For example, if M exhibits a sharp spectral decay, which is frequently the\ncase in real data, a moderate value of r suf\ufb01ces to obtain a good approximation. This leads to our\n\ufb01rst main theorem which formally establishes the guarantees of our NNPCA algorithm.\n\nTheorem 1. For any real m \u00d7 n matrix M, let M be its best rank-r approximation. Algorithm 1\nwith input M, and parameters k and \u01eb \u2208 (0, 1), outputs W \u2208 Wk such that\n\n(cid:13)(cid:13)M\u22a4W(cid:13)(cid:13)2\n\nF\n\n\u2265 (1 \u2212 \u01eb) \u00b7 (cid:13)(cid:13)M\u22a4W\u22c6(cid:13)(cid:13)2\n\nF\n\nwhere W\u22c6, arg maxW\u2208Wk(cid:13)(cid:13)M\u22a4W(cid:13)(cid:13)2\n\nF\n\n, in time TSVD(r) + O(cid:0)(cid:0) 1\n\n\u2212 k \u00b7 (cid:13)(cid:13)M \u2212 M(cid:13)(cid:13)2\n2,\n\u00b7 k \u00b7 m(cid:1).\n\n\u01eb(cid:1)r\u00b7k\n\nProof. The proof follows from Lemma 1. It is formally provided in Appendix A.3.\n\nTheorem 1 establishes a trade-off between the computational complexity of the proposed NNPCA\napproach and the tightness of the approximation guarantees; higher values of r imply smaller\nkM \u2212 Mk2\n2 and in turn a tighter bound (assuming that the singular values of M decay), but have\nan exponential impact on the running time. Despite the exponential dependence on r and k, our\napproach is polynomial in the dimensions of the input M, dominated by the truncated SVD.\n\nIn practice, Algorithm 1 can be terminated early returning the best computed result at the time\nof termination, sacri\ufb01cing the theoretical approximation guarantees. In Section 4 we empirically\nevaluate our algorithm on real datasets and demonstrate that even for small values of r, our NNPCA\nalgorithms signi\ufb01cantly outperforms existing approaches.\n\n4\n\n\fOrthogonal NMF The NNPCA algorithm straightforwardly yields an algorithm for the ONMF\nproblem (1). In an ONMF instance, the input matrix M is by assumption nonnegative. Given any\nm \u00d7 k orthogonal nonnegative factor W, the optimal choice for the second factor is H = M\u22a4W.\nHence, it suf\ufb01ces to determine W, which can be obtained by solving the NNPCA problem on M.\n\nThe proposed ONMF algorithm is outlined in\nAlg. 2. Given a nonnegative m \u00d7 n matrix M,\nwe \ufb01rst obtain a rank-r approximation M via\nthe truncated SVD, where r is an accuracy pa-\nrameter. Using Alg. 1 on M, we compute an\northogonal nonnegative factor W \u2208 Wk that\napproximately maximizes (3) within a desired\naccuracy. The second ONMF factor H is read-\nily determined as described earlier.\n\nAlgorithm 2 ONMFS\n\ninput : m \u00d7 n real M \u2265 0, r, k, \u01eb \u2208 (0, 1)\n1: M \u2190 SVD(M, r)\n2: W \u2190 LowRankNNPCA(cid:0)M, k, \u01eb(cid:1)\n3: H \u2190 M\u22a4W\noutput W, H\n\n{Alg. 1}\n\nThe accuracy parameter r once again controls a trade-off between the quality of the ONMF factors\nand the complexity of the algorithm. We note, however, that for any target dimension k and de-\nsired accuracy parameter \u01eb, setting r = \u2308k/\u01eb\u2309 suf\ufb01ces to achieve an additive \u01eb error on the relative\napproximation error of the ONMF problem. More formally,\n\nTheorem 2. For any m \u00d7 n real nonnegative matrix M, target dimension k, and desired accuracy\n\u01eb \u2208 (0, 1), Algorithm 2 with parameter r = \u2308k/\u01eb\u2309 outputs an ONMF pair W, H, such that\n\nin time TSVD( k\n\n\u01eb ) + O(cid:0)(cid:0) 1\n\n\u01eb(cid:1)k2/\u01eb\n\nF \u2264 E\u22c6 + \u03b5 \u00b7 kMk2\nF ,\n\nkM \u2212 WH\u22a4k2\n\u00b7 (k \u00b7 m)(cid:1).\n\nProof. (See Appendix A.4.)\n\nTheorem 2 implies an additive EPTAS1 for the relative approximation error in the ONMF problem\nfor any constant target dimension k; Algorithm 2 runs in time polynomial in the dimensions of the\ninput M. Finally, note that it did not require any assumption on M beyond nonnegativity.\n\n3 The Low Rank NNPCA Algorithm\n\nIn this section, we re-visit Alg. 1, which plays a central role in our developments, as it is the key\npiece of our NNPCA and in turn our ONMF algorithm. Alg. 1 approximately solves the NNPCA\nproblem (3) on a rank-r, m \u00d7 n matrix M. It operates by producing a large, but tractable number of\ncandidate solutions W \u2208 Wk, and returns the one that maximizes the objective value in (2). In the\nsequel, we provide a brief description of the ideas behind the algorithm.\n\nWe are interested in approximately solving the low rank NNPCA problem (3). Let M = U\u03a3V\ndenote the truncated SVD of M. For any W \u2208 Rm\u00d7k,\n\n\u22a4\n\n\u22a4\n\nkM\n\nWk2\n\nF = k\u03a3U\n\n\u22a4\n\nWk2\n\nF =\n\nkX\n\nj=1\n\n\u22a4\n\nk\u03a3U\n\nwjk2\n\n2 =\n\nkX\n\nj=1\n\nmax\n\ncj \u2208Sr\u22121\n\n2\n\n(cid:10)wj, U\u03a3cj(cid:11)2\n\n,\n\n(4)\n\n2\n\nwhere Sr\u22121\ndenotes the r-dimensional \u21132-unit sphere. Let C denote the r \u00d7 k variable formed by\nstacking the unit-norm vectors cj , j = 1, . . . , k. The key observation is that for a given C, we can\nef\ufb01ciently compute a W \u2208 Wk that maximizes the right-hand side of (4). The procedure for that\ntask is outlined in Alg. 3. Hence, the NNPCA problem (3) is reduced to determining the optimal\nvalue of the low-dimensional variable C. But, \ufb01rst let us we provide a brief description of Alg. 3.\n\n1 Additive EPTAS (Ef\ufb01cient Polynomial Time approximation Scheme [28, 29]) refers to an algorithm that\ncan approximate the solution of an optimization problem within an arbitrarily small additive error \u01eb and has\ncomplexity that scales polynomially in the input size n, but possibly exponentially in 1/\u01eb. EPTAS is more\nef\ufb01cient than a PTAS because it enforces a polynomial dependency on n for any \u01eb, i.e., a running time f (1/\u01eb) \u00b7\np(n), where p(n) is polynomial. For example, a running time of O(n1/\u01eb) is considered PTAS, but not EPTAS.\n\n5\n\n\fFor a \ufb01xed r \u00d7 k matrix C, Algorithm 3 computes\n\nAlgorithm 3 LocalOptW\n\ncW , arg max\n\nW\u2208Wk\n\n(cid:10)wj, aj(cid:11)2\n\n,\n\nkX\n\nj=1\n\n(5)\n\ninput : real m \u00d7 k matrix A\n\noutput cW = arg maxW\u2208Wk Pk\n\nj=1(cid:10)wj, aj(cid:11)2\n\nA2\n\nj = 1, . . . , k\n\nij\u22c6 \u2265 0 then\n\nIj\u22c6 \u2190 Ij\u22c6 \u222a {i}\n\nIj \u2190 {},\nfor i = 1 . . . , m do\nj\u22c6 \u2190 arg maxj A\u2032\nij\nif A\u2032\n\n1: CW \u2190 {}\n2: for each s \u2208 {\u00b11}k do\n3: A\u2032 \u2190 A \u00b7 diag(s)\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11: W \u2190 0m\u00d7k\n12:\n13:\n\nwhere A,U\u03a3C. The challenge is to determine\nthe support of the optimal solution cW; if an oracle\nrevealed the optimal supports Ij , j = 1, . . . , k of\nits columns, then the exact value of the nonzero en-\ntries would be determined by the Cauchy-Schwarz\ninequality, and the contribution of the jth summand\nin (5) would be equal to Pi\u2208Ij\nij. Due to the non-\nnegativity constrains in Wk, the optimal support Ij\nof the jth column must contain indices correspond-\ning to only nonnegative or nonpositive entries of\naj , but not a combination of both. Algorithm 3\nconsiders all 2k possible sign combinations for the\nsupport sets implicitly by solving (5) on all 2k ma-\ntrices A\u2032 = A \u00b7 diag(s), s \u2208 {\u00b11}k. Hence, we\nmay assume without loss of generality that all sup-\nport sets correspond to nonnegative entries of A.\nMoreover, if index i \u2208 [m] is assigned to Ij , then the contribution of the entire ith row of A to the\nobjective is equal to A2\nij . Based on the above, Algorithm 3 constructs the collection of the support\nsets by assigning index i to Ij if and only if Aij is nonnegative and the largest among the entries of\nthe ith row of A. The algorithm runs in time2 O(2k \u00b7 k \u00b7 m) and guarantees that the output is the\noptimal solution to (5). A more formal analysis of the Alg. 3 is provided in Section A.5.\n\n17: cW \u2190 arg maxW\u2208CW Pk\n\nfor j = 1, . . . , k do\nj]Ij\n\n[wj]Ij \u2190 [a\u2032\n\n14:\n15:\n16: end for\n\nend for\nCW \u2190 CW \u222a W\n\nj=1(cid:10)wj, aj(cid:11)2\n\nend if\nend for\n\n/k[a\u2032\n\nj]Ij\n\nk\n\nThus far, we have seen that any given value of C can be associated with a feasible solution W \u2208 Wk\nvia the maximization (5) and Alg. 3.\nIf we could ef\ufb01ciently consider all possible values in the\n(continuous) domain of C, we would be able to recover the pair that maximizes (4) and, in turn, the\noptimal solution of (3). However, that is not possible. Instead, we consider a \ufb01ne discretization of the\ndomain of C and settle for an approximate solution. In particular, let N\u01eb(Sr\u22121\n) denote a \ufb01nite \u01eb-net\nof the r-dimensional \u21132-unit sphere; for any point in Sr\u22121\n, the net contains a point within distance\n\u01eb from the former. (see Appendix C for the construction of such a net). Further, let [N\u01eb(Sr\u22121\n)]\u2297k\ndenote the kth Cartesian power of the previous net; the latter is a collection of r \u00d7 k matrices C.\nAlg. 1 operates on this collection: for each C it identi\ufb01es a candidate solution W \u2208 Wk via the\nmaximization (5) using Algorithm 3. By the properties of the \u01eb-nets, it can be shown that at least\none of the computed candidate solutions must attain an objective value close to the optimal of (3).\n\n2\n\n2\n\n2\n\nThe guarantees of Alg. 1 are formally established in Lemma 1. A detailed analysis of the algorithm\nis provided in the corresponding proof in Appendix A.2. This completes the description of our\nalgorithmic developments.\n\n4 Experimental Evaluation\n\nNNPCA We compare our NNPCA algorithm against three existing approaches: NSPCA [16],\nEM [27] and NNSPAN [17] on real datasets. NSPCA computes multiple nonnegative, but not nec-\nessarily orthogonal components; a parameter \u03b1 penalizes the overlap among their supports. We\nset a high penalty (\u03b1 = 1e10) to promote orthogonality. EM and NNSPAN compute only a single\nnonnegative component. Multiple components are computed consecutively, interleaving an appro-\npriate de\ufb02ation step. To ensure orthogonality, the de\ufb02ation step effectively zeroes out the variables\nused in previously extracted components. Finally, note that both the EM and NSPCA algorithms\nare randomly initialized. All depicted values are the best results over multiple random restarts. For\nour algorithm, we use a sketch of rank r = 4 of the (centered) input data. Further we apply an\nearly termination criterion; execution is terminated if no improvement is observed in a number of\nconsecutive iterations (samples). This can only hurt the performance of our algorithm.\n\n2 When used as a subroutine in Alg. 1, Alg. 3 can be simpli\ufb01ed into an O(k \u00b7 m) procedure (lines 4-14).\n\n6\n\n\fe\nc\nn\na\ni\nr\na\nV\n\n.\nl\n\np\nx\nE\n\ne\nv\ni\nt\na\nl\n\nu\nm\nu\nC\n\n7\n\n6\n\n5\n\n4\n\n3\n\n2\n\n1\n\n0\n\nNNSPCA\nEM\nNNSPAN\nONMFS\n\n+59:95%\n\n1\n\n2\n\n3\n\n4\n\n5\n\n6\n\n7\n\n8\n\nComponents\n\n(a)\n\ne\nc\nn\na\ni\nr\na\nV\n\n.\nl\n\np\nx\nE\n\ne\nv\ni\nt\na\nl\n\nu\nm\nu\nC\n\n7\n\n6\n\n5\n\n4\n\n3\n\n2\n\n1\n\n0\n\nONMFS\nNNSPAN\nEM\nNNSPCA\n\n+59.95%\n\n+52.15%\n\n+42.64%\n\n+32.96%\n\n+19.87%\n\n+3.76%\n\n+1.26%\n\n2\n\n3\n\n4\n\n5\n\n6\n\n7\n\n8\n\n# Target Components\n\n(b)\n\nFigure 2: Cumul. variance captured by k nonnegative components; CBCL dataset [30]. In Fig. 2(a),\nwe set k = 8 and plot the cumul. variance versus the number of components. EM and NNSPAN\nextract components greedily; \ufb01rst components achieve high value, but subsequent ones contribute\nless to the objective. Our algorithm jointly optimizes the k = 8 components, achieving a 59.95%\nimprovement over the second best method. Fig. 2(b) depicts the cumul. variance for various values\nof k. We note the percentage improvement of our algorithm over the second best method.\n\nCBCL Dataset. The CBCL dataset [30] contains 2429, 19 \u00d7 19 pixel, gray scale face images. It has\nbeen used in the evaluation of all three methods [16, 17, 27]. We extract k orthogonal nonnegative\ncomponents using all methods and compare the total explained variance, i.e., the objective in (2).\nWe note that input data has been centered and it is hence not nonnegative.\n\nFig. 2(a) depicts the cumulative explained variance versus the number of components for k = 8.\nEM and NNSPAN extract components greedily with a de\ufb02ation step; the \ufb01rst component achieves\nhigh value, but subsequent ones contribute less to the total variance. On the contrary, our algorithm\njointly optimizes the k = 8 components, achieving an approximately 60% increase in the total vari-\nance compared to the second best method. We repeat the experiment for k = 2, . . . , 8. Fig. 2(b)\ndepicts the total variance captured by each method for each value of k. Our algorithm signi\ufb01cantly\noutperforms the existing approaches.\n\nAdditional Datasets. We solve the NNPCA problem on various datasets obtained from [31]. We\narbitrarily set the target number of components to k = 5 and con\ufb01gure our algorithm to use a rank-4\nsketch of the input. Table 1 lists the total variance captured by the extracted components for each\nmethod. Our algorithm consistently outperforms the other approaches.\n\nONMF We compare our algorithm with several state-of-the-art ONMF algorithms i) the O-PNMF\nalgorithm of [13] (for 1000 iterations), and ii) the more recent ONP-MF iii) EM-ONMF algorithms\nof [11, 32] (for 1000 iterations). We also compare to clustering methods (namely, vanilla and spher-\nical k-means) since such algorithms also yield an approximate ONMF.\n\nAMZN COM. REV (1500\u00d710000)\n(100\u00d710000)\nARCENCE TRAIN\nISOLET-5\nLEUKEMIA\nMFEAT PIX\nLOW RES. SPEC.\nBOW:KOS\n\n(2000\u00d7240)\n(531\u00d7100)\n\n(3430\u00d76906)\n\n(1559\u00d7617)\n\n(72\u00d712582)\n\nNSPCA\n\nEM\n\nNNSPAN\n\nONMFS\n\n5.44e + 01\n4.96e + 04\n5.83e \u2212 01\n3.02e + 07\n2.00e + 01\n3.98e + 06\n4.96e \u2212 02\n\n7.32e + 03\n3.01e + 07\n3.54e + 01\n7.94e + 09\n3.20e + 02\n2.29e + 08\n2.96e + 01\n\n7.32e + 03\n3.00e + 07\n3.55e + 01\n8.02e + 09\n3.25e + 02\n2.29e + 08\n3.00e + 01\n\n7.86e + 03 (+7.37%)\n3.80e + 07 (+26.7%)\n4.55e + 01 (+28.03%)\n1.04e + 10 (+29.57%)\n5.24e + 02 (+61.17%)\n2.41e + 08 (+5.34%)\n4.59e + 01 (+52.95%)\n\nTable 1: Total variance captured by k = 5 nonnegative components on various datasets [31]. For\neach dataset, we list (#samples\u00d7#variables) and the variance captured by each method; higher values\nare better. Our algorithm (labeled ONMFS) operates on a rank-4 sketch in all cases, and consistently\nachieves the best results. We note the percentage improvement over the second best method.\n\n7\n\n\fSynthetic data. We generate a synthetic\ndataset as follows. We select \ufb01ve base vec-\ntors cj , j = 1, . . . , 5 randomly and indepen-\ndently from the unit hypercube in 100 dimen-\nsions. Then, we generate data points xi =\nai \u00b7 cj + p \u00b7 ni, for some j \u2208 {1, . . . , 5}, where\nai \u223c U ([0.1, 1]), ni \u223c N (0, I), and p is a\nparameter controlling the noise variance. Any\nnegative entries of xi are set to zero.\nWe vary p in [10\u22122, 1]. For each p value, we\ncompute an approximate ONMF on 10 ran-\ndomly generated datasets and measure the rel-\native Frobenius approximation error. For the\nmethods that involved random initialization,\nwe run 10 averaging iterations per Monte-\nCarlo trial. Our algorithm is con\ufb01gured to op-\nerate on a rank-5 sketch. Figure 3 depicts the\nrelative error achieved by each method (aver-\naged over the random trials) versus the noise\nvariance p. Our algorithm, labeled ONMFS\nachieves competitive or higher accuracy for\nmost values in the range of p.\n\n0.25\n\n0.2\n\n2F\nk\n\nK-means\nSp. K-means\nO-PNMF\nONP-MF\nEM-ONMF\nONMFS\n\nM\n\nk\n=\n2F\nk\n>\nH\nW\n!\nM\n\n0.15\n\n0.1\n\nk\n\n0.05\n\n0\n10-2\n\n10-1\n\np (Noise power)\n\n100\n\nFigure 3: Relative Frob. approximation error on\nsynthetic data. Data points (samples) are gen-\nerated by randomly scaling and adding noise to\none of \ufb01ve base points that have been randomly\nselected from the unit hypercube in 100 dimen-\nsions. We run ONMF methods with target dimen-\nsion k = 5. Our algorithm is labeled as ONMFS.\n\nReal Datasets. We apply the ONMF algorithms on various nonnegative datasets obtained from [31].\nWe arbitrarily set the target number of components to k = 6. Table 2 lists the relative Frobenius\napproximation error achieved by each algorithm. We note that on the text datasets (e.g., Bag of\nWords [31]) we run the algorithms on the uncentered word-by-document matrix. Our algorithm\nperforms competitively compared to other methods.\n\n5 Conclusions\n\nWe presented a novel algorithm for approximately solving the ONMF problem on a nonnegative\nmatrix. Our algorithm relied on a new method for solving the NNPCA problem. The latter jointly\noptimizes multiple orthogonal nonnegative components and provably achieves an objective value\nclose to optimal. Our ONMF algorithm is the \ufb01rst one to be equipped with theoretical approxi-\nmation guarantees; for a constant target dimension k, it yields an additive EPTAS for the relative\napproximation error. Empirical evaluation on synthetic and real datasets demonstrates that our algo-\nrithms outperform or match existing approaches in both problems.\n\nAcknowledgments DP is generously supported by NSF awards CCF-1217058 and CCF-1116404\nand MURI AFOSR grant 556016. This research has been supported by NSF Grants CCF 1344179,\n1344364, 1407278, 1422549 and ARO YIP W911NF-14-1-0258.\n\nK-MEANS\n\nO-PNMF\n\nONP-MF\n\nEM-ONMF\n\nONMFS\n\nAMZN COM. REV\nARCENCE TRAIN\nMFEAT PIX\nPEMS TRAIN\nBOW:KOS\nBOW:ENRON\nBOW:NIPS\nBOW:NYTIMES\n\n(10000\u00d71500)\n\n(100\u00d710000)\n\n(2000\u00d7240)\n\n(267\u00d7138672)\n\n(3430\u00d76906)\n\n(28102\u00d739861)\n(1500\u00d712419)\n(102660\u00d73 \u00b7 105)\n\n0.0547\n0.0837\n0.2489\n0.1441\n0.8193\n0.9946\n0.8137\n\n\u2212\n\n0.1153\n\n\u2212\n\n0.2974\n0.1439\n0.7692\n\n\u2212\n\n0.7277\n\n\u2212\n\n0.1153\n0.1250\n0.3074\n0.1380\n0.7671\n0.6728\n0.7277\n0.9199\n\n0.0467\n0.0856\n0.2447\n\n0.1278\n0.7671\n0.7148\n0.7375\n0.9238\n\n0.0462(5)\n0.0788(4)\n0.2615 (4)\n0.1283 (5)\n0.7609(4)\n0.6540(4)\n0.7252(5)\n0.9199(5)\n\nTable 2: ONMF approximation error on nonnegative datasets [31]. For each dataset, we list the size\n(#samples\u00d7#variables) and the relative Frobenius approximation error achieved by each method;\nlower values are better. We arbitrarily set the target dimension k = 6. Dashes (-) denote an invalid\nsolution/non-convergence. For our method, we note in parentheses the approximation rank r used.\n\n8\n\n\fReferences\n\n[1] Daniel D Lee and H Sebastian Seung. Algorithms for non-negative matrix factorization. In Advances in neural information processing\n\nsystems, pages 556\u2013562, 2000.\n\n[2] Gershon Buchsbaum and Orin Bloch. Color categories revealed by non-negative matrix factorization of munsell color spectra. Vision\n\nresearch, 42(5):559\u2013563, 2002.\n\n[3] Farial Shahnaz, Michael W Berry, V Paul Pauca, and Robert J Plemmons. Document clustering using nonnegative matrix factorization.\n\nInformation Processing & Management, 42(2):373\u2013386, 2006.\n\n[4] Chih-Jen Lin. Projected gradient methods for nonnegative matrix factorization. Neural computation, 19(10):2756\u20132779, 2007.\n\n[5] Andrzej Cichocki, Rafal Zdunek, Anh Huy Phan, and Shun-ichi Amari. Nonnegative matrix and tensor factorizations: applications to\n\nexploratory multi-way data analysis and blind source separation. John Wiley & Sons, 2009.\n\n[6] Victor Bittorf, Benjamin Recht, Christopher Re, and Joel A Tropp. Factoring nonnegative matrices with linear programs. Advances in\n\nNeural Information Processing Systems, 25:1223\u20131231, 2012.\n\n[7] Nicolas Gillis and Stephen A Vavasis. Fast and robust recursive algorithms for separable nonnegative matrix factorization. arXiv preprint\n\narXiv:1208.1237, 2012.\n\n[8] K Huang, ND Sidiropoulos, and A Swamiy. Nmf revisited: New uniqueness results and algorithms. In Acoustics, Speech and Signal\n\nProcessing (ICASSP), 2013 IEEE International Conference on, pages 4524\u20134528. IEEE, 2013.\n\n[9] Chris Ding, Tao Li, Wei Peng, and Haesun Park. Orthogonal nonnegative matrix t-factorizations for clustering. In Proceedings of the\n\n12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 126\u2013135. ACM, 2006.\n\n[10] Tao Li and Chris Ding. The relationships among various nonnegative matrix factorization methods for clustering. In Data Mining, 2006.\n\nICDM\u201906. Sixth International Conference on, pages 362\u2013371. IEEE, 2006.\n\n[11] Filippo Pompili, Nicolas Gillis, P-A Absil, and Franc\u00b8ois Glineur. Two algorithms for orthogonal nonnegative matrix factorization with\n\napplication to clustering. arXiv preprint arXiv:1201.0901, 2012.\n\n[12] Seungjin Choi. Algorithms for orthogonal nonnegative matrix factorization.\n\nIn Neural Networks, 2008. IJCNN 2008.(IEEE World\n\nCongress on Computational Intelligence). IEEE International Joint Conference on, pages 1828\u20131832. IEEE, 2008.\n\n[13] Zhirong Yang and Erkki Oja. Linear and nonlinear projective nonnegative matrix factorization. Neural Networks, IEEE Transactions on,\n\n21(5):734\u2013749, 2010.\n\n[14] Da Kuang, Haesun Park, and Chris HQ Ding. Symmetric nonnegative matrix factorization for graph clustering. In SDM, volume 12,\n\npages 106\u2013117. SIAM, 2012.\n\n[15] Zhirong Yang, Tele Hao, Onur Dikmen, Xi Chen, and Erkki Oja. Clustering by nonnegative matrix factorization using graph random\n\nwalk. In Advances in Neural Information Processing Systems, pages 1088\u20131096, 2012.\n\n[16] Ron Zass and Amnon Shashua. Nonnegative sparse pca. In Advances in Neural Information Processing Systems 19, pages 1561\u20131568,\n\nCambridge, MA, 2007. MIT Press.\n\n[17] Megasthenis Asteris, Dimitris Papailiopoulos, and Alexandros Dimakis. Nonnegative sparse pca with provable guarantees. In Proceed-\n\nings of the 31st International Conference on Machine Learning (ICML-14), pages 1728\u20131736, 2014.\n\n[18] Zhijian Yuan and Erkki Oja. Projective nonnegative matrix factorization for image compression and feature extraction. In Image Analysis,\n\npages 333\u2013342. Springer, 2005.\n\n[19] Hualiang Li, T\u00a8ulay Adal, Wei Wang, Darren Emge, and Andrzej Cichocki. Non-negative matrix factorization with orthogonality con-\nstraints and its application to raman spectroscopy. The Journal of VLSI Signal Processing Systems for Signal, Image, and Video Technol-\nogy, 48(1-2):83\u201397, 2007.\n\n[20] Bin Cao, Dou Shen, Jian-Tao Sun, Xuanhui Wang, Qiang Yang, and Zheng Chen. Detect and track latent factors with online nonnegative\n\nmatrix factorization. In IJCAI, volume 7, pages 2689\u20132694, 2007.\n\n[21] Xin Li, William KW Cheung, Jiming Liu, and Zhili Wu. A novel orthogonal nmf-based belief compression for pomdps. In Proceedings\n\nof the 24th international conference on Machine learning, pages 537\u2013544. ACM, 2007.\n\n[22] Gang Chen, Fei Wang, and Changshui Zhang. Collaborative \ufb01ltering using orthogonal nonnegative matrix tri-factorization. Information\n\nProcessing & Management, 45(3):368\u2013379, 2009.\n\n[23] NA Gillis and S Vavasis. Fast and robust recursive algorithms for separable nonnegative matrix factorization. IEEE transactions on\n\npattern analysis and machine intelligence, 2013.\n\n[24] Sanjeev Arora, Rong Ge, Yoni Halpern, David Mimno, Ankur Moitra, David Sontag, Yichen Wu, and Michael Zhu. A practical algorithm\n\nfor topic modeling with provable guarantees. arXiv preprint arXiv:1212.4777, 2012.\n\n[25] Sanjeev Arora, Rong Ge, Ravindran Kannan, and Ankur Moitra. Computing a nonnegative matrix factorization\u2013provably. In Proceedings\n\nof the 44th symposium on Theory of Computing, pages 145\u2013162, 2012.\n\n[26] Abhishek Kumar, Vikas Sindhwani, and Prabhanjan Kambadur. Fast conical hull algorithms for near-separable non-negative matrix\n\nfactorization. In Proceedings of the 30th International Conference on Machine Learning (ICML-13), pages 231\u2013239, 2013.\n\n[27] Christian D. Sigg and Joachim M. Buhmann. Expectation-maximization for sparse and non-negative pca. In Proceedings of the 25th\n\nInternational Conference on Machine Learning, ICML \u201908, pages 960\u2013967, New York, NY, USA, 2008. ACM.\n\n[28] Liming Cai, Michael Fellows, David Juedes, and Frances Rosamond. On ef\ufb01cient polynomial-time approximation schemes for problems\n\non planar structures. Journal of Computer and System Sciences, 2003.\n\n[29] Marco Cesati and Luca Trevisan. On the ef\ufb01ciency of polynomial time approximation schemes.\n\nInformation Processing Letters,\n\n64(4):165\u2013171, 1997.\n\n[30] Kah-Kay Sung. Learning and example selection for object and pattern recognition. PhD thesis, PhD thesis, MIT, Arti\ufb01cial Intelligence\n\nLaboratory and Center for Biological and Computational Learning, Cambridge, MA, 1996.\n\n[31] M. Lichman. UCI machine learning repository, 2013.\n\n[32] Filippo Pompili, Nicolas Gillis, Pierre-Antoine Absil, and Franc\u00b8ois Glineur. Onp-mf: An orthogonal nonnegative matrix factorization\n\nalgorithm with application to clustering. In ESANN 2013, 2013.\n\n[33] K. Bache and M. Lichman. UCI machine learning repository, 2013.\n\n9\n\n\f", "award": [], "sourceid": 205, "authors": [{"given_name": "Megasthenis", "family_name": "Asteris", "institution": "University of Texas at Austin"}, {"given_name": "Dimitris", "family_name": "Papailiopoulos", "institution": "UC Berkeley"}, {"given_name": "Alexandros", "family_name": "Dimakis", "institution": "Utaustin"}]}