{"title": "Factoring nonnegative matrices with linear programs", "book": "Advances in Neural Information Processing Systems", "page_first": 1214, "page_last": 1222, "abstract": "This paper describes a new approach for computing nonnegative matrix factorizations (NMFs) with linear programming. The key idea is a data-driven model for the factorization, in which the most salient features in the data are used to express the remaining features.  More precisely, given a data matrix X, the algorithm identifies a matrix C that satisfies X = CX and some linear constraints.  The matrix C selects features, which are then used to compute a low-rank NMF of X.  A theoretical analysis demonstrates that this approach has the same type of guarantees as the recent NMF algorithm of Arora et al.~(2012).  In contrast with this earlier work, the proposed method has (1) better noise tolerance, (2) extends to more general noise models, and (3) leads to efficient, scalable algorithms.  Experiments with synthetic and real datasets provide evidence that the new approach is also superior in practice.  An optimized C++ implementation of the new algorithm can factor a multi-Gigabyte matrix in a matter of minutes.", "full_text": "Factoring nonnegative matrices with linear programs\n\nVictor Bittorf\n\nbittorf@cs.wisc.edu\n\nBenjamin Recht\n\nbrecht@cs.wisc.edu\n\nComputer Sciences\n\nUniversity of Wisconsin\n\nChristopher R\u00b4e\n\nchrisre@cs.wisc.edu\n\nJoel A. Tropp\n\nComputing and Mathematical Sciences\n\nCalifornia Institute of Technology\ntropp@cms.caltech.edu\n\nAbstract\n\nThis paper describes a new approach, based on linear programming, for com-\nputing nonnegative matrix factorizations (NMFs). The key idea is a data-driven\nmodel for the factorization where the most salient features in the data are used to\nexpress the remaining features. More precisely, given a data matrix X, the algo-\nrithm identi\ufb01es a matrix C that satis\ufb01es X \u2248 CX and some linear constraints.\nThe constraints are chosen to ensure that the matrix C selects features; these fea-\ntures can then be used to \ufb01nd a low-rank NMF of X. A theoretical analysis\ndemonstrates that this approach has guarantees similar to those of the recent NMF\nalgorithm of Arora et al. (2012). In contrast with this earlier work, the proposed\nmethod extends to more general noise models and leads to ef\ufb01cient, scalable al-\ngorithms. Experiments with synthetic and real datasets provide evidence that the\nnew approach is also superior in practice. An optimized C++ implementation can\nfactor a multigigabyte matrix in a matter of minutes.\n\n1\n\nIntroduction\n\nNonnegative matrix factorization (NMF) is a popular approach for selecting features in data [16\u201318,\n23]. Many machine-learning and data-mining software packages (including Matlab [3], R [12], and\nOracle Data Mining [1]) now include heuristic computational methods for NMF. Nevertheless, we\nstill have limited theoretical understanding of when these heuristics are correct.\nThe dif\ufb01culty in developing rigorous methods for NMF stems from the fact that the problem is\ncomputationally challenging. Indeed, Vavasis has shown that NMF is NP-Hard [27]; see [4] for\nfurther worst-case hardness results. As a consequence, we must instate additional assumptions on\nthe data if we hope to compute nonnegative matrix factorizations in practice.\nIn this spirit, Arora, Ge, Kannan, and Moitra (AGKM) have exhibited a polynomial-time algorithm\nfor NMF that is provably correct\u2014provided that the data is drawn from an appropriate model, based\non ideas from [8]. The AGKM result describes one circumstance where we can be sure that NMF\nalgorithms are capable of producing meaningful answers. This work has the potential to make an\nimpact in machine learning because proper feature selection is an important preprocessing step for\nmany other techniques. Even so, the actual impact is damped by the fact that the AGKM algorithm\nis too computationally expensive for large-scale problems and is not tolerant to departures from the\nmodeling assumptions. Thus, for NMF, there remains a gap between the theoretical exercise and the\nactual practice of machine learning.\n\n1\n\n\fThe present work presents a scalable, robust algorithm that can successfully solve the NMF problem\nunder appropriate hypotheses. Our \ufb01rst contribution is a new formulation of the nonnegative feature\nselection problem that only requires the solution of a single linear program. Second, we provide\na theoretical analysis of this algorithm. This argument shows that our method succeeds under the\nsame modeling assumptions as the AGKM algorithm with an additional margin constraint that is\ncommon in machine learning. We prove that if there exists a unique, well-de\ufb01ned model, then we\ncan recover this model accurately; our error bound improves substantially on the error bound for\nthe AGKM algorithm in the high SNR regime. One may argue that NMF only \u201cmakes sense\u201d (i.e.,\nis well posed) when a unique solution exists, and so we believe our result has independent interest.\nFurthermore, our algorithm can be adapted for a wide class of noise models.\nIn addition to these theoretical contributions, our work also includes a major algorithmic and experi-\nmental component. Our formulation of NMF allows us to exploit methods from operations research\nand database systems to design solvers that scale to extremely large datasets. We develop an ef\ufb01cient\nstochastic gradient descent (SGD) algorithm that is (at least) two orders of magnitude faster than the\napproach of AGKM when both are implemented in Matlab. We describe a parallel implementation\nof our SGD algorithm that can robustly factor matrices with 105 features and 106 examples in a few\nminutes on a multicore workstation.\nOur formulation of NMF uses a data-driven modeling approach to simplify the factorization prob-\nlem. More precisely, we search for a small collection of rows from the data matrix that can be\nused to express the other rows. This type of approach appears in a number of other factorization\nproblems, including rank-revealing QR [15], interpolative decomposition [20], subspace cluster-\ning [10, 24], dictionary learning [11], and others. Our computational techniques can be adapted to\naddress large-scale instances of these problems as well.\n\n2 Separable Nonnegative Matrix Factorizations and Hott Topics\n\nNotation. For a matrix M and indices i and j, we write Mi\u00b7 for the ith row of M and M\u00b7j for the\njth column of M. We write Mij for the (i, j) entry.\nLet Y be a nonnegative f \u00d7 n data matrix with columns indexing examples and rows indexing\nfeatures. Exact NMF seeks a factorization Y = F W where the feature matrix F is f \u00d7 r, where\nthe weight matrix W is r \u00d7 n, and both factors are nonnegative. Typically, r (cid:28) min{f, n}.\nUnless stated otherwise, we assume that each row of the data matrix Y is normalized so it sums to\none. Under this hypothesis, we may also assume that each row of F and of W also sums to one [4].\nIt is notoriously dif\ufb01cult to solve the NMF problem. Vavasis showed that it is NP-complete to decide\nwhether a matrix admits a rank-r nonnegative factorization [27]. AGKM proved that an exact NMF\nalgorithm can be used to solve 3-SAT in subexponential time [4].\nThe literature contains some mathematical analysis of NMF that can be used to motivate algorithmic\ndevelopment. Thomas [25] developed a necessary and suf\ufb01cient condition for the existence of a\nrank-r NMF. More recently, Donoho and Stodden [8] obtained a related suf\ufb01cient condition for\nuniqueness. AGKM exhibited an algorithm that can produce a nonnegative matrix factorization\nunder a weaker suf\ufb01cient condition. To state their results, we need a de\ufb01nition.\nDe\ufb01nition 2.1 A set of vectors {v1, . . . , vr} \u2282 Rd is simplicial if no vector vi lies in the convex\nhull of {vj : j (cid:54)= i}. The set of vectors is \u03b1-robust simplicial if, for each i, the (cid:96)1 distance from vi\nto the convex hull of {vj : j (cid:54)= i} is at least \u03b1. Figure 1 illustrates these concepts.\n\nThese ideas support the uniqueness results of Donoho and Stodden and the AGKM algorithm. In-\ndeed, we can \ufb01nd an NMF of Y ef\ufb01ciently if Y contains a set of r rows that is simplicial and whose\nconvex hull contains the remaining rows.\n\nDe\ufb01nition 2.2 An NMF Y = F W is called separable if the rows of W are simplicial and there is\na permutation matrix \u03a0 such that\n\n\u03a0F =\n\n.\n\n(1)\n\n(cid:21)\n\n(cid:20) Ir\n\nM\n\n2\n\n\fAlgorithm 1: AGKM: Approximably Separable\nNonnegative Matrix Factorization [4]\n1: Initialize R = \u2205.\n2: Compute the f \u00d7 f matrix D with Dij =\n(cid:107)Xi\u00b7 \u2212 Xj\u00b7(cid:107)1.\n3: for k = 1, . . . f do\n4:\n\nFind the set Nk of rows that are at least\n5\u0001/\u03b1 + 2\u0001 away from Xk\u00b7.\nCompute the distance \u03b4k of Xk\u00b7 from\nconv({Xj\u00b7\n: j \u2208 Nk}).\nif \u03b4k > 2\u0001, add k to the set R.\n\n6:\n7: end for\n8: Cluster the rows in R as follows: j and k are\nin the same cluster if Djk \u2264 10\u0001/\u03b1 + 6\u0001.\n9: Choose one element from each cluster to\n10: F = arg minZ\u2208Rf\u00d7r (cid:107)X \u2212 ZW(cid:107)\u221e,1\n\nyield W .\n\n5:\n\nFigure 1: Numbered circles are hott topics. Their\nconvex hull (orange) contains the other topics (small\ncircles), so the data admits a separable NMF. The ar-\nrow d1 marks the (cid:96)1 distance from hott topic (1) to the\nconvex hull of the other two hott topics; de\ufb01nitions of\nd2 and d3 are similar. The hott topics are \u03b1-robustly\nsimplicial when each di \u2265 \u03b1.\n\nTo compute a separable factorization of Y , we must \ufb01rst identify a simplicial set of rows from Y .\nAfterward, we compute weights that express the remaining rows as convex combinations of this\ndistinguished set. We call the simplicial rows hott and the corresponding features hott topics.\nThis model allows us to express all the features for a particular instance if we know the values of\nthe instance at the simplicial rows. This assumption can be justi\ufb01ed in a variety of applications. For\nexample, in text, knowledge of a few keywords may be suf\ufb01cient to reconstruct counts of the other\nwords in a document. In vision, localized features can be used to predict gestures. In audio data, a\nfew bins of the spectrogram may allow us to reconstruct the remaining bins.\nWhile a nonnegative matrix one encounters in practice might not admit a separable factorization, it\nmay be well-approximated by a nonnnegative matrix with separable factorization. AGKM derived an\nalgorithm for nonnegative matrix factorization of a matrix that is well-approximated by a separable\nfactorization. To state their result, we introduce a norm on f \u00d7 n matrices:\n\n(cid:107)\u2206(cid:107)\u221e,1 := max\n1\u2264i\u2264f\n\n|\u2206ij| .\n\nn(cid:88)\n\nj=1\n\nTheorem 2.3 (AGKM [4]) Let \u0001 and \u03b1 be nonnegative constants satisfying \u0001 \u2264 \u03b12\n20+13\u03b1 . Let X be\na nonnegative data matrix. Assume X = Y + \u2206 where Y is a nonnegative matrix whose rows\nhave unit (cid:96)1 norm, where Y = F W is a rank-r separable factorization in which the rows of W\nare \u03b1-robust simplicial, and where (cid:107)\u2206(cid:107)\u221e,1 \u2264 \u0001. Then Algorithm 1 \ufb01nds a rank-r nonnegative\nfactorization \u02c6F \u02c6W that satis\ufb01es the error bound\n\n\u2264 10\u0001/\u03b1 + 7\u0001.\n\n(cid:13)(cid:13)(cid:13)X \u2212 \u02c6F \u02c6W\n\n(cid:13)(cid:13)(cid:13)\u221e,1\n\nIn particular, the AGKM algorithm computes the factorization exactly when \u0001 = 0. Although\nthis method is guaranteed to run in polynomial time, it has many undesirable features. First, the\nalgorithm requires a priori knowledge of the parameters \u03b1 and \u0001. It may be possible to calculate\n\u0001, but we can only estimate \u03b1 if we know which rows are hott. Second, the algorithm computes\nall (cid:96)1 distances between rows at a cost of O(f 2n). Third, for every row in the matrix, we must\ndetermine its distance to the convex hull of the rows that lie at a suf\ufb01cient distance; this step requires\nus to solve a linear program for each row of the matrix at a cost of \u2126(f n). Finally, this method is\nintimately linked to the choice of the error norm (cid:107)\u00b7(cid:107)\u221e,1. It is not obvious how to adapt the algorithm\nfor other noise models. We present a new approach, based on linear programming, that overcomes\nthese drawbacks.\n\n3 Main Theoretical Results: NMF by Linear Programming\n\nThis paper shows that we can factor an approximately separable nonnegative matrix by solving a\nlinear program. A major advantage of this formulation is that it scales to very large data sets.\n\n3\n\n213213d3d2d1d1\fAlgorithm 2 Separable Nonnegative Matrix Factorization by Linear Programming\nRequire: An f \u00d7 n nonnegative matrix Y with a rank-r separable NMF.\nEnsure: An f \u00d7 r matrix F and r \u00d7 n matrix W with F \u2265 0, W \u2265 0, and Y = F W .\n1: Find the unique C \u2208 \u03a6(Y ) to minimize pT diag(C) where p is any vector with distinct values.\n2: Let I = {i : Cii = 1} and set W = YI\u00b7 and F = C\u00b7I.\n\nHere is the key observation: Suppose that Y is any f \u00d7 n nonnegative matrix that admits a rank-r\nseparable factorization Y = F W . If we pad F with zeros to form an f \u00d7 f matrix, we have\n\n(cid:21)\n\n(cid:20) Ir\n\n0\nM 0\n\nY = \u03a0T\n\n\u03a0Y =: CY\n\nWe call the matrix C factorization localizing. Note that any factorization localizing matrix C is an\nelement of the polyhedral set\n\n\u03a6(Y ) := {C \u2265 0 : CY = Y , Tr(C) = r, Cjj \u2264 1 \u2200j, Cij \u2264 Cjj \u2200i, j}.\n\nThus, to \ufb01nd an exact NMF of Y , it suf\ufb01ces to \ufb01nd a feasible element of C \u2208 \u03a6(Y ) whose\ndiagonal is integral. This task can be accomplished by linear programming. Once we have such\na C, we construct W by extracting the rows of X that correspond to the indices i where Cii =\n1. We construct the feature matrix F by extracting the nonzero columns of C. This approach is\nsummarized in Algorithm 2. In turn, we can prove the following result.\n\nTheorem 3.1 Suppose Y is a nonnegative matrix with a rank-r separable factorization Y = F W .\nThen Algorithm 2 constructs a rank-r nonnegative matrix factorization of Y .\n\nAs the theorem suggests, we can isolate the rows of Y that yield a simplicial factorization by solving\na single linear program. The factor F can be found by extracting columns of C.\n\n3.1 Robustness to Noise\n\nSuppose we observe a nonnegative matrix X whose rows sum to one. Assume that X = Y + \u2206\nwhere Y is a nonnegative matrix whose rows sum to one, which has a rank-r separable factorization\nY = F W such that the rows of W are \u03b1-robust simplicial, and where (cid:107)\u2206(cid:107)\u221e,1 \u2264 \u0001. De\ufb01ne the\npolyhedral set\n\n(cid:110)\nC \u2265 0 : (cid:107)CX \u2212 X(cid:107)\u221e,1 \u2264 \u03c4, Tr(C) = r, Cjj \u2264 1 \u2200j, Cij \u2264 Cjj \u2200i, j\n\n\u03a6\u03c4 (X) :=\n\n(cid:111)\n\nThe set \u03a6(X) consists of matrices C that approximately locate a factorization of X. We can prove\nthe following result.\n\nTheorem 3.2 Suppose that X satis\ufb01es the assumptions stated in the previous paragraph. Further-\nmore, assume that for every row Yj,\u00b7 that is not hott, we have the margin constraint (cid:107)Yj,\u00b7\u2212Yi,\u00b7(cid:107) \u2265 d0\n\u2264 2\u0001\nfor all hott rows i. Then we can \ufb01nd a nonnegative factorization satisfying\nprovided that \u0001 < min{\u03b1d0,\u03b12}\n. Furthermore, this factorization correctly identi\ufb01es the hott topics\nappearing in the separable factorization of Y .\n\n9(r+1)\n\n(cid:13)(cid:13)(cid:13)X \u2212 \u02c6F \u02c6W\n\n(cid:13)(cid:13)(cid:13)\u221e,1\n\nAlgorithm 3 requires the solution of two linear programs. The \ufb01rst minimizes a cost vector over\n\u03a62\u0001(X). This lets us \ufb01nd \u02c6W . Afterward, the matrix \u02c6F can be found by setting\n\n(cid:13)(cid:13)(cid:13)X \u2212 Z \u02c6W\n\n(cid:13)(cid:13)(cid:13)\u221e,1\n\n\u02c6F = arg min\nZ\u22650\n\n.\n\n(2)\n\nOur robustness result requires a margin-type constraint assuming that the original con\ufb01guration\nconsists either of duplicate hott topics, or topics that are reasonably far away from the hott topics. On\nthe other hand, under such a margin constraint, we can construct a considerably better approximation\nthat guaranteed by the AGKM algorithm. Moreover, unlike AGKM, our algorithm does not need to\nknow the parameter \u03b1.\n\n4\n\n\fAlgorithm 3 Approximably Separable Nonnegative Matrix Factorization by Linear Programming\nRequire: An f \u00d7 n nonnegative matrix X that satis\ufb01es the hypotheses of Theorem 3.2.\nEnsure: An f \u00d7 r matrix F and r\u00d7 n matrix W with F \u2265 0, W \u2265 0, and (cid:107)X \u2212 F W(cid:107)\u221e,1 \u2264 2\u0001.\n1: Find C \u2208 \u03a62\u0001(X) that minimizes pT diag C where p is any vector with distinct values.\n2: Let I = {i : Cii = 1} and set W = XI\u00b7.\n3: Set F = arg minZ\u2208Rf\u00d7r (cid:107)X \u2212 ZW(cid:107)\u221e,1\n\nThe proofs of Theorems 3.1 and 3.2 can be found in the b version of this paper [6]. The main idea\nis to show that we can only represent a hott topic ef\ufb01ciently using the hott topic itself. Some earlier\nversions of this paper contained incomplete arguments, which we have remedied. For a signifcantly\nstronger robustness analysis of Algorithm 3, see the recent paper [13].\nHaving established these theoretical guarantees, it now remains to develop an algorithm to solve\nthe LP. Off-the-shelf LP solvers may suf\ufb01ce for moderate-size problems, but for large-scale matrix\nfactorization problems, their running time is prohibitive, as we show in Section 5. In Section 4, we\nturn to describe how to solve Algorithm 3 ef\ufb01ciently for large data sets.\n\n3.2 Related Work\n\nLocalizing factorizations via column or row subset selection is a popular alternative to direct fac-\ntorization methods such as the SVD. Interpolative decomposition such as Rank-Revealing QR [15]\nand CUR [20] have favorable ef\ufb01ciency properties as compared to factorizations (such as SVD) that\nare not based on exemplars. Factorization localization has been used in subspace clustering and has\nbeen shown to be robust to outliers [10, 24].\nIn recent work on dictionary learning, Esser et al. and Elhamifar et al. have proposed a factorization\nlocalization solution to nonnegative matrix factorization using group sparsity techniques [9, 11].\nEsser et al. prove asymptotic exact recovery in a restricted noise model, but this result requires\npreprocessing to remove duplicate or near-duplicate rows. Elhamifar shows exact representative\nrecovery in the noiseless setting assuming no hott topics are duplicated. Our work here improves\nupon this work in several aspects, enabling \ufb01nite sample error bounds, the elimination of any need\nto preprocess the data, and algorithmic implementations that scale to very large data sets.\n\n4\n\nIncremental Gradient Algorithms for NMF\n\nThe rudiments of our fast implementation rely on two standard optimization techniques: dual de-\ncomposition and incremental gradient descent. Both techniques are described in depth in Chapters\n3.4 and 7.8 of Bertsekas and Tstisklis [5].\nWe aim to minimize pT diag(C) subject to C \u2208 \u03a6\u03c4 (X). To proceed, form the Lagrangian\nwi ((cid:107)Xi\u00b7 \u2212 [CX]i\u00b7(cid:107)1 \u2212 \u03c4 )\n\nL(C, \u03b2, w) = pT diag(C) + \u03b2(Tr(C) \u2212 r) +\n\nf(cid:88)\n\nwith multipliers \u03b2 and w \u2265 0. Note that we do not dualize out all of the constraints. The remaining\nones appear in the constraint set \u03a60 = {C : C \u2265 0, diag(C) \u2264 1, and Cij \u2264 Cjj for all i, j}.\nDual subgradient ascent solves this problem by alternating between minimizing the Lagrangian over\nthe constraint set \u03a60, and then taking a subgradient step with respect to the dual variables\n\nwi \u2190 wi + s ((cid:107)Xi\u00b7 \u2212 [C(cid:63)X]i\u00b7(cid:107)1 \u2212 \u03c4 )\n\nand \u03b2 \u2190 \u03b2 + s(Tr(C(cid:63)) \u2212 r)\n\ni=1\n\nwhere C(cid:63) is the minimizer of the Lagrangian over \u03a60. The update of wi makes very little difference\nin the solution quality, so we typically only update \u03b2.\nWe minimize the Lagrangian using projected incremental gradient descent. Note that we can rewrite\nthe Lagrangian as\n\nL(C, \u03b2, w) = \u2212\u03c4 1T w \u2212 \u03b2r +\n\nwj(cid:107)Xjk \u2212 [CX]jk(cid:107)1 + \u00b5j(pj + \u03b2)Cjj\n\n\uf8eb\uf8ed (cid:88)\n\nn(cid:88)\n\nk=1\n\nj\u2208supp(X\u00b7k)\n\n5\n\n\uf8f6\uf8f8 .\n\n\fAlgorithm 4 HOTTOPIXX: Approximate Separable NMF by Incremental Gradient Descent\nRequire: An f \u00d7 n nonnegative matrix X. Primal and dual stepsizes sp and sd.\nEnsure: An f \u00d7 r matrix F and r\u00d7 n matrix W with F \u2265 0, W \u2265 0, and (cid:107)X \u2212 F W(cid:107)\u221e,1 \u2264 2\u0001.\n1: Pick a cost p with distinct entries.\n2: Initialize C = 0, \u03b2 = 0\n3: for t = 1, . . . , Nepochs do\n4:\n5:\n6:\n7:\n8:\n9:\n10: end for\n11: Let I = {i : Cii = 1} and set W = XI\u00b7.\n12: Set F = arg minZ\u2208Rf\u00d7r (cid:107)X \u2212 ZW(cid:107)\u221e,1\n\nChoose k uniformly at random from [n].\nC \u2190 C + sp \u00b7 sign(X\u00b7k \u2212 CX\u00b7k)X T\u00b7k \u2212 sp diag(\u00b5 \u25e6 (\u03b21 \u2212 p)).\n\nfor i = 1, . . . n do\n\nend for\nProject C onto \u03a60.\n\u03b2 \u2190 \u03b2 + sd(Tr(C) \u2212 r)\n\nHere, supp(x) is the set indexing the entries where x is nonzero, and \u00b5j is the number of nonzeros\nin row j divided by n. The incremental gradient method chooses one of the n summands at random\nand follows its subgradient. We then project the iterate onto the constraint set \u03a60. The projection\nonto \u03a60 can be performed in the time required to sort the individual columns of C plus a linear-time\noperation. The full procedure is described in the extended version of this paper [6]. In the case\nwhere we expect a unique solution, we can drop the constraint Cij \u2264 Cjj, resulting in a simple\nclipping procedure: set all negative items to zero and set any diagonal entry exceeding one to one.\nIn practice, we perform a tradeoff. Since the constraint Cij \u2264 Cjj is used solely for symmetry\nbreaking, we have found empirically that we only need to project onto \u03a60 every n iterations or so.\nThis incremental iteration is repeated n times in a phase called an epoch. After each epoch, we\nupdate the dual variables and quit after we believe we have identi\ufb01ed the large elements of the\ndiagonal of C. Just as before, once we have identi\ufb01ed the hott rows, we can form W by selecting\nthese rows of X. We can \ufb01nd F just as before, by solving (2). Note that this minimization can\nalso be computed by incremental subgradient descent. The full procedure, called HOTTOPIXX, is\ndescribed in Algorithm 4.\n\n4.1 Sparsity and Computational Enhancements for Large Scale.\n\nFor small-scale problems, HOTTOPIXX can be implemented in a few lines of Matlab code. But for\nthe very large data sets studied in Section 5, we take advantage of natural parallelism and a host\nof low-level optimizations that are also enabled by our formulation. As in any numerical program,\nmemory layout and cache behavior can be critical factors for performance. We use standard tech-\nniques: in-memory clustering to increase prefetching opportunities, padded data structures for better\ncache alignment, and compiler directives to allow the Intel compiler to apply vectorization.\nNote that the incremental gradient step (step 6 in Algorithm 4) only modi\ufb01es the entries of C where\nX\u00b7k is nonzero. Thus, we can parallelize the algorithm with respect to updating either the rows\nor the columns of C. We store X in large contiguous blocks of memory to encourage hardware\nprefetching. In contrast, we choose a dense representation of our localizing matrix C; this choice\ntrades space for runtime performance.\nEach worker thread is assigned a number of rows of C so that all rows \ufb01t in the shared L3 cache.\nThen, each worker thread repeatedly scans X while marking updates to multiple rows of C. We\nrepeat this process until all rows of C are scanned, similar to the classical block-nested loop join in\nrelational databases [22].\n\n5 Experiments\n\nExcept for the speedup curves, all of the experiments were run on an identical con\ufb01guration: a dual\nXeon X650 (6 cores each) machine with 128GB of RAM. The kernel is Linux 2.6.32-131.\n\n6\n\n\fFigure 2: Performance pro\ufb01les for synthetic data. (a) (\u221e, 1)-norm error for 40 \u00d7 400 sized instances and\n(b) all instances. (c) is the performance pro\ufb01le for running time on all instances. RMSE performance pro\ufb01les\nfor the (d) small scale and (e) medium scale experiments. (f) (\u221e, 1)-norm error for the \u03b7 \u2265 1. In the noisy\nexamples, even 4 epochs of HOTTOPIXX is suf\ufb01cient to obtain competitive reconstruction error.\n\nIn small-scale, synthetic experiments, we compared HOTTOPIXX to the AGKM algorithm and the\nlinear programming formulation of Algorithm 3 implemented in Matlab. Both AGKM and Algo-\nrithm 3 were run using CVX [14] coupled to the SDPT3 solver [26]. We ran HOTTOPIXX for 50\nepochs with primal stepsize 1e-1 and dual stepsize 1e-2. Once the hott topics were identi\ufb01ed, we \ufb01t\nF using two cleaning epochs of incremental gradient descent for all three algorithms.\nTo generate our instances, we sampled r hott topics uniformly from the unit simplex in Rn. These\ntopics were duplicated d times. We generated the remaining f \u2212 r(d + 1) rows to be random convex\ncombinations of the hott topics, with the combinations selected uniformly at random. We then\nadded noise with (\u221e, 1)-norm error bounded by \u03b7 \u00b7\n20+13\u03b1. Recall that AGKM algorithm is only\nguaranteed to work for \u03b7 < 1. We ran with f \u2208 {40, 80, 160}, n \u2208 {400, 800, 1600}, r \u2208 {3, 5, 10},\nd \u2208 {0, 1, 2}, and \u03b7 \u2208 {0.25, 0.95, 4, 10, 100}. Each experiment was repeated 5 times.\nBecause we ran over 2000 experiments with 405 different parameter settings, it is convenient to use\nthe performance pro\ufb01les to compare the performance of the different algorithms [7]. Let P be the\nset of experiments and A denote the set of different algorithms we are comparing. Let Qa(p) be\nthe value of some performance metric of the experiment p \u2208 P for algorithm a \u2208 A. Then the\nperformance pro\ufb01le at \u03c4 for a particular algorithm is the fraction of the experiments where the value\nof Qa(p) lies within a factor of \u03c4 of the minimal value of minb\u2208A Qb(p). That is,\n\n\u03b12\n\n#{p \u2208 P : Qa(p) \u2264 \u03c4 mina(cid:48)\u2208A Qa(cid:48)(p)}\n\n.\n\nPa(\u03c4 ) =\n\n#(P)\n\nIn a performance pro\ufb01le, the higher a curve corresponding to an algorithm, the more often it outper-\nforms the other algorithms. This gives a convenient way to contrast algorithms visually.\nOur performance pro\ufb01les are shown in Figure 2. The \ufb01rst two \ufb01gures correspond to experiments\nwith f = 40 and n = 400. The third \ufb01gure is for the synthetic experiments with all other values\nof f and n. In terms of (\u221e, 1)-norm error, the linear programming solver typically achieves the\nlowest error. However, using SDPT3, it is prohibitively slow to factor larger matrices. On the other\nhand, HOTTOPIXX achieves better noise performance than the AGKM algorithm in much less time.\nMoreover, the AGKM algorithm must be fed the values of \u0001 and \u03b1 in order to run. HOTTOPIXX does\nnot require this information and still achieves about the same error performance.\nWe also display a graph for running only four epochs (hott (fast)). This algorithm is by far the fastest\nalgorithm, but does not achieve as optimal a noise performance. For very high levels of noise,\nhowever, it achieves a lower reconstruction error than the AGKM algorithm, whose performance\n\n7\n\n020406000.20.40.60.81\u03c4Pr(error\u2264\u03c4 errormin)  (a)hotthott (fast)hott (lp)AGKM020406000.20.40.60.81\u03c4Pr(error\u2264\u03c4 errormin)  (b)hotthott (fast)AGKM010020030000.20.40.60.81\u03c4Pr(time\u2264\u03c4 timemin)  (c)hotthott (fast)AGKM020406000.20.40.60.81\u03c4Pr(RMSE\u2264\u03c4 RMSEmin)  (d)hotthott (fast)hott (lp)AGKM020406000.20.40.60.81\u03c4Pr(RMSE\u2264\u03c4 RMSEmin)  (e)hotthott (fast)AGKM020406000.20.40.60.81\u03c4Pr(error\u2264\u03c4 errormin)  (f)hotthott (fast)AGKM\fdata set\njumbo\nclueweb\nRCV1\n\nfeatures\n1600\n44739\n47153\n\ndocuments\n64000\n351849\n781265\n\nnonzeros\n1.02e8\n1.94e7\n5.92e7\n\nsize (GB)\n2.7\n0.27\n1.14\n\ntime (s)\n338\n478\n430\n\nTable 1: Description of the large data sets. Time is to \ufb01nd 100 hott topics on the 12 core machines.\n\nFigure 3: (left) The speedup over a serial implementation for HOTTOPIXX on the jumbo and clueweb data\nsets. Note the superlinear speedup for up to 20 threads. (middle) The RMSE for the clueweb data set. (right)\nThe test error on RCV1 CCAT class versus the number of hott topics. The horizontal line indicates the test\nerror achieved using all of the features.\n\ndegrades once \u03b7 approaches or exceeds 1 (Figure 2(f)). We also provide performance pro\ufb01les for\nthe root-mean-square error of the nonnegative matrix factorizations (Figure 2 (d) and (e)). The\nperformance is qualitatively similar to that for the (\u221e, 1)-norm.\nWe also coded HOTTOPIXX in C++, using the design principles described in Section 4.1, and ran on\nthree large data sets. We generated a large synthetic example (jumbo) as above with r = 100. We\ngenerated a co-occurrence matrix of people and places from the ClueWeb09 Dataset [2], normalized\nby TFIDF. We also used HOTTOPIXX to select features from the RCV1 data set to recognize the\nclass CCAT [19]. The statistics for these data sets can be found in Table 1.\nIn Figure 3 (left), we plot the speed-up over a serial implementation. In contrast to other parallel\nmethods that exhibit memory contention [21], we see superlinear speed-ups for up to 20 threads\ndue to hardware prefetching and cache effects. All three of our large data sets can be trained in\nminutes, showing that we can scale HOTTOPIXX on both synthetic and real data. Our algorithm is\nable to correctly identify the hott topics on the jumbo set. For clueweb, we plot the RMSE Figure 3\n(middle). This curve rolls off quickly for the \ufb01rst few hundred topics, demonstrating that our algo-\nrithm may be useful for dimensionality reduction in Natural Language Processing applications. For\nRCV1, we trained an SVM on the set of features extracted by HOTTOPIXX and plot the misclassi\ufb01-\ncation error versus the number of topics in Figure 3 (right). With 1500 hott topics, we achieve 7%\nmisclassi\ufb01cation error as compared to 5.5% with the entire set of features.\n\n6 Discussion\n\nThis paper provides an algorithmic and theoretical framework for analyzing and deploying any fac-\ntorization problem that can be posed as a linear (or convex) factorization localizing program. Future\nwork should investigate the applicability of HOTTOPIXX to other factorization localizing algorithms,\nsuch as subspace clustering, and should revisit earlier theoretical bounds on such prior art.\n\nAcknowledgments\n\nThe authors would like to thank Sanjeev Arora, Michael Ferris, Rong Ge, Nicolas Gillis, Ankur\nMoitra, and Stephen Wright for helpful suggestions. BR is generously supported by ONR award\nN00014-11-1-0723, NSF award CCF-1139953, and a Sloan Research Fellowship. CR is generously\nsupported by NSF CAREER award under IIS-1054009, ONR award N000141210041, and gifts or\nresearch awards from American Family Insurance, Google, Greenplum, and Oracle. JAT is gener-\nously supported by ONR award N00014-11-1002, AFOSR award FA9550-09-1-0643, and a Sloan\nResearch Fellowship.\n\n8\n\n010203040010203040threadsspeedup  jumboclueweb0500100015002000250046810121416number of topicsRMSE01000200030004000500051015202530number of topicsclass error\fReferences\n[1] docs.oracle.com/cd/B28359_01/datamine.111/b28129/algo_nmf.htm.\n[2] lemurproject.org/clueweb09/.\n[3] www.mathworks.com/help/toolbox/stats/nnmf.html.\n[4] S. Arora, R. Ge, R. Kannan, and A. Moitra. Computing a nonnegative matrix factorization \u2013 provably. To\n\nappear in STOC 2012. Preprint available at \\arxiv.org/abs/1111.0952, 2011.\n\n[5] D. P. Bertsekas and J. N. Tsitsiklis. Parallel and Distributed Computation: Numerical Methods. Athena\n\nScienti\ufb01c, Belmont, MA, 1997.\n\n[6] V. Bittorf, B. Recht, C. R\u00b4e, and J. A. Tropp. Factoring nonnegative matrices with linear programs. Tech-\n\nnical Report. Available at arxiv.org/1206.1270, 2012.\n\n[7] E. D. Dolan and J. J. Mor\u00b4e. Benchmarking optimization software with performance pro\ufb01les. Mathemati-\n\ncal Programming, Series A, 91:201\u2013213, 2002.\n\n[8] D. Donoho and V. Stodden. When does non-negative matrix factorization give a correct decomposition\n\ninto parts? In Advances in Neural Information Processing Systems, 2003.\n\n[9] E. Elhamifar, G. Sapiro, and R. Vidal. See all by looking at a few: Sparse modeling for \ufb01nding represen-\n\ntative objects. In Proceedings of CVPR, 2012.\n\n[10] E. Elhamifar and R. Vidal. Sparse subspace clustering.\n\nComputer Vision and Pattern Recognition, 2009.\n\nIn Proceedings of the IEEE Conference on\n\n[11] E. Esser, M. M\u00a8oller, S. Osher, G. Sapiro, and J. Xin. A convex model for non-negative matrix factorization\nIEEE Transactions on Image Processing, 2012. To\n\nand dimensionality reduction on physical space.\nappear. Preprint available at arxiv.org/abs/1102.0844.\n\n[12] R. Gaujoux and C. Seoighe. NMF: A \ufb02exible R package for nonnegative matrix factorization. BMC\n\nBioinformatics, 11:367, 2010. doi:10.1186/1471-2105-11-367.\n\n[13] N. Gillis. Robustness analysis of hotttopixx, a linear programming model for factoring nonnegative ma-\n\ntrices. arxiv.org/1211.6687, 2012.\n\n[14] M. Grant and S. Boyd. CVX: Matlab software for disciplined convex programming, version 1.21. http:\n\n//cvxr.com/cvx, May 2010.\n\n[15] M. Gu and S. C. Eisenstat. Ef\ufb01cient algorithms for computing a strong rank-revealing QR factorization.\n\nSIAM Journal on Scienti\ufb01c Computing, 17:848\u2013869, 1996.\n\n[16] T. Hofmann. Probabilistic latent semantic indexing. In Proceedings of the 22nd Annual International\n\nSIGIR Conference on Research and Development in Information Retrieval, 1999.\n\n[17] D. D. Lee and H. S. Seung. Learning the parts of objects by non-negative matrix factorization. Nature,\n\n401:788\u2013791, 1999.\n\n[18] D. D. Lee and H. S. Seung. Algorithms for non-negative matrix factorization. In Advances in Neural\n\nInformation Processing Systems, 2001.\n\n[19] D. Lewis, Y. Yang, T. Rose, and F. Li. RCV1: A new benchmark collection for text categorization\n\nresearch. Journal of Machine Learning Research, 5:361\u2013397, 2004.\n\n[20] M. W. Mahoney and P. Drineas. CUR matrix decompositions for improved data analysis. Proceedings of\n\nthe National Academy of Sciences, 106:697\u2013702, 2009.\n\n[21] F. Niu, B. Recht, C. R\u00b4e, and S. J. Wright. HOGWILD!: A lock-free approach to parallelizing stochastic\n\ngradient descent. In Advances in Neural Information Processing Systems, 2011.\n\n[22] L. D. Shapiro. Join processing in database systems with large main memories. ACM Transactions on\n\nDatabase Systems, 11(3):239\u2013264, 1986.\n\n[23] P. Smaragdis. Non-negative matrix factorization for polyphonic music transcription. In IEEE Workshop\n\non Applications of Signal Processing to Audio and Acoustics, pages 177\u2013180, 2003.\n\n[24] M. Soltanolkotabi and E. J. Cand`es. A geometric analysis of subspace clustering with outliers. Preprint\n\navailable at arxiv.org/abs/1112.4258, 2011.\n\n[25] L. B. Thomas. Problem 73-14, rank factorization of nonnegative matrices. SIAM Review, 16(3):393\u2013394,\n\n1974.\n\n[26] K. C.\n\nToh, M.\n\nTodd,\n\nand R. H.\n\nT\u00a8ut\u00a8unc\u00a8u.\n\nware\nhttp://www.math.nus.edu.sg/\u02dcmattohkc/sdpt3.html.\n\nsemide\ufb01nite-quadratic-linear\n\npackage\n\nprogramming.\n\nfor\n\nSDPT3:\n\nA MATLAB\nAvailable\n\nsoft-\nfrom\n\n[27] S. A. Vavasis. On the complexity of nonnegative matrix factorization. SIAM Joural on Optimization,\n\n20(3):1364\u20131377, 2009.\n\n9\n\n\f", "award": [], "sourceid": 597, "authors": [{"given_name": "Ben", "family_name": "Recht", "institution": null}, {"given_name": "Christopher", "family_name": "Re", "institution": null}, {"given_name": "Joel", "family_name": "Tropp", "institution": null}, {"given_name": "Victor", "family_name": "Bittorf", "institution": null}]}