{"title": "Robust Spectral Inference for Joint Stochastic Matrix Factorization", "book": "Advances in Neural Information Processing Systems", "page_first": 2710, "page_last": 2718, "abstract": "Spectral inference provides fast algorithms and provable optimality for latent topic analysis. But for real data these algorithms require additional ad-hoc heuristics, and even then often produce unusable results. We explain this poor performance by casting the problem of topic inference in the framework of Joint Stochastic Matrix Factorization (JSMF) and showing that previous methods violate the theoretical conditions necessary for a good solution to exist. We then propose a novel rectification method that learns high quality topics and their interactions even on small, noisy data. This method achieves results comparable to probabilistic techniques in several domains while maintaining scalability and provable optimality.", "full_text": "Robust Spectral Inference for Joint Stochastic Matrix\n\nFactorization\n\nMoontae Lee, David Bindel\nDept. of Computer Science\n\nCornell University\nIthaca, NY 14850\n\n{moontae,bindel}@cs.cornell.edu\n\nDavid Mimno\n\nDept. of Information Science\n\nCornell University\nIthaca, NY 14850\n\nmimno@cornell.edu\n\nAbstract\n\nSpectral inference provides fast algorithms and provable optimality for latent topic\nanalysis. But for real data these algorithms require additional ad-hoc heuristics,\nand even then often produce unusable results. We explain this poor performance\nby casting the problem of topic inference in the framework of Joint Stochastic\nMatrix Factorization (JSMF) and showing that previous methods violate the theo-\nretical conditions necessary for a good solution to exist. We then propose a novel\nrecti\ufb01cation method that learns high quality topics and their interactions even on\nsmall, noisy data. This method achieves results comparable to probabilistic tech-\nniques in several domains while maintaining scalability and provable optimality.\n\n1\n\nIntroduction\n\nSummarizing large data sets using pairwise co-occurrence frequencies is a powerful tool for data\nmining. Objects can often be better described by their relationships than their inherent char-\nacteristics. Communities can be discovered from friendships [1], song genres can be identi\ufb01ed\nfrom co-occurrence in playlists [2], and neural word embeddings are factorizations of pairwise co-\noccurrence information [3, 4]. Recent Anchor Word algorithms [5, 6] perform spectral inference on\nco-occurrence statistics for inferring topic models [7, 8]. Co-occurrence statistics can be calculated\nusing a single parallel pass through a training corpus. While these algorithms are fast, deterministic,\nand provably guaranteed, they are sensitive to observation noise and small samples, often producing\neffectively useless results on real documents that present no problems for probabilistic algorithms.\nWe cast this general problem\nof learning overlapping latent\nclusters as Joint-Stochastic Ma-\ntrix Factorization (JSMF), a\nsubset of non-negative matrix\nfactorization that contains topic\nmodeling as a special case.\nWe explore the conditions nec-\nessary for inference from co-\noccurrence statistics and show\nthat the Anchor Words algo-\nrithms necessarily violate such\nconditions. Then we propose a recti\ufb01ed algorithm that matches the performance of probabilistic\ninference\u2014even on small and noisy datasets\u2014without losing ef\ufb01ciency and provable guarantees.\nValidating on both real and synthetic data, we demonstrate that our recti\ufb01cation not only produces\nbetter clusters, but also, unlike previous work, learns meaningful cluster interactions.\n\nFigure 1: 2D visualizations show the low-quality convex hull\nfound by Anchor Words [6] (left) and a better convex hull (middle)\nfound by discovering anchor words on a recti\ufb01ed space (right).\n\n1\n\n-0.04-0.0200.020.040.060.08-0.04-0.03-0.02-0.0100.010.020.030.040.05Area = 0.000313-0.04-0.0200.020.040.060.08-0.04-0.03-0.02-0.0100.010.020.030.040.05Area = 0.002602-0.02-0.0100.010.020.03-0.015-0.01-0.00500.0050.010.0150.02Area = 0.000660\fLet the matrix C represent the co-occurrence of pairs drawn from N objects: Cij is the joint prob-\nability p(X1 = i, X2 = j) for a pair of objects i and j. Our goal is to discover K latent clus-\nters by approximately decomposing C \u2248 BABT . B is the object-cluster matrix, in which each\ncolumn corresponds to a cluster and Bik = p(X = i|Z = k) is the probability of drawing an\nobject i conditioned on the object belonging to the cluster k; and A is the cluster-cluster matrix,\nin which Akl = p(Z1 = k, Z2 = l) represents the joint probability of pairs of clusters. We\ncall the matrices C and A joint-stochastic (i.e., C \u2208 J S N , A \u2208 J S K) due to their correspon-\ndence to joint distributions; B is column-stochastic. Example applications are shown in Table 1.\n\nDomain\nDocument\n\nTable 1: JSMF applications, with anchor-word equivalents.\n\nAnchor Word algorithms [5,\n6] solve JSMF problems us-\ning a separability assumption:\neach topic contains at\nleast\none \u201canchor\u201d word that has\nnon-negligible probability ex-\nclusively in that topic. The al-\ngorithm uses the co-occurrence\npatterns of the anchor words as a summary basis for the co-occurrence patterns of all other words.\nThe initial algorithm [5] is theoretically sound but unable to produce column-stochastic word-topic\nmatrix B due to unstable matrix inversions. A subsequent algorithm [6] \ufb01xes negative entries in B,\nbut still produces large negative entries in the estimated topic-topic matrix A. As shown in Figure 3,\nthe proposed algorithm infers valid topic-topic interactions.\n\nImage\nNetwork\nLegislature Member\n\nObject\nWord\nPixel\nUser\n\nCluster\nTopic\n\nSegment\n\nCommunity\nParty/Group\n\nPlaylist\n\nAnchor Word\n\nPure Pixel\n\nRepresentative\n\nPartisan\n\nSignature Song\n\nSong\n\nGenre\n\nBasis\n\n2 Requirements for Factorization\n\nIn this section we review the probabilistic and statistical structures of JSMF and then de\ufb01ne geo-\nmetric structures of co-occurrence matrices required for successful factorization. C \u2208 RN\u00d7N is a\njoint-stochastic matrix constructed from M training examples, each of which contain some subset\nof N objects. We wish to \ufb01nd K (cid:28) N latent clusters by factorizing C into a column-stochastic\nmatrix B \u2208 RN\u00d7K and a joint-stochastic matrix A \u2208 RK\u00d7K, satisfying C \u2248 BABT .\nProbabilistic structure. Figure 2 shows the event\nspace of our model. The distribution A over pairs of clus-\nters is generated \ufb01rst from a stochastic process with a hy-\nperparameter \u03b1.\nIf the m-th training example contains\na total of nm objects, our model views the example as\nconsisting of all possible nm(nm \u2212 1) pairs of objects.1\nFor each of these pairs, cluster assignments are sampled\nfrom the selected distribution ((z1, z2) \u223c A). Then an\nactual object pair is drawn with respect to the correspond-\ning cluster assignments (x1 \u223c Bz1, x2 \u223c Bz2). Note that\nthis process does not explain how each training example\nis generated from a model, but shows how our model un-\nderstands the objects in the training examples.\nFollowing [5, 6], our model views B as a set of parameters rather than random variables.2 The\nprimary learning task is to estimate B; we then estimate A to recover the hyperparameter \u03b1. Due to\nthe conditional independence X1 \u22a5 X2 | (Z1 or Z2), the factorization C \u2248 BABT is equivalent to\n\nFigure 2: The JSMF event space differs\nfrom LDA\u2019s. JSMF deals only with pairwise\nco-occurrence events and does not generate\nobservations/documents.\n\n(cid:88)\n\n(cid:88)\n\nz1\n\nz2\n\np(X1, X2|A; B) =\n\np(X1|Z1; B)p(Z1, Z2|A)p(X2|Z2; B).\n\nUnder the separability assumption, each cluster k has a basis object sk such that p(X = sk|Z =\nk) > 0 and p(X = sk|Z (cid:54)= k) = 0. In matrix terms, we assume the submatrix of B comprised of\n1Due to the bag-of-words assumption, every object can pair with any other object in that example, except\nitself. One implication of our work is better understanding the self-co-occurrences, the diagonal entries in the\nco-occurrence matrix.\n\n2In LDA, each column of B is generated from a known distribution Bk \u223c Dir(\u03b2).\n\n2\n\nA\u03b1Z1Z2X1X2Bknm(nm\u22121)1\u2264m\u2264M1\u2264k\u2264K1\fthe rows with indices S = {s1, . . . , sK} is diagonal. As these rows form a non-negative basis for\nthe row space of B, the assumption implies rank+(B) = K = rank(B).3 Providing identi\ufb01ability\nto the factorization, this assumption becomes crucial for inference of both B and A. Note that JSMF\nfactorization is unique up to column permutation, meaning that no speci\ufb01c ordering exists among\nthe discovered clusters, equivalent to probabilistic topic models (see the Appendix).\n\nStatistical structure. Let f (\u03b1) be a (known) distribution of distributions from which a cluster\ndistribution is sampled for each training example. Saying Wm \u223c f (\u03b1), we have M i.i.d samples\n{W1, . . . , WM} which are not directly observable. De\ufb01ning the posterior cluster-cluster matrix\nA\u2217\nm], Lemma 2.2 in [5] showed that4\nM = 1\nM\n\nm=1 WmW T\n\n(cid:80)M\n\nDenote the posterior co-occurrence for the m-th training example by C\u2217\nThen C\u2217\n\nmBT , and C\u2217 = 1\n\nm = BWmW T\n\n\u2217\n\nA\n\n\u2217\nM \u2212\u2192 A\n\nm and the expectation A\u2217 = E[WmW T\nas M \u2212\u2192 \u221e.\n(cid:33)\n\n(cid:80)M\nm=1 C\u2217\nM(cid:88)\n\nm. Thus\n\n(cid:32)\n\nM\n\n\u2217\n\nC\n\n= B\n\n1\nM\n\nm=1\n\n(1)\nm and all examples by C\u2217.\n\nWmW T\nm\n\nBT = BA\n\n\u2217\nM BT .\n\n(2)\n\n(cid:88)\n\n\u2208 DNN K and C\u2217\nM(cid:88)\n\nyT A\n\n\u2217\nM y =\n\n1\nM\n\nm=1\n\n(cid:88)\n\n1\nM\n\nDenote the noisy observation for the m-th training example by Cm, and all examples by C. Let\nW = [W1|...|WM ] be a matrix of topics. We will construct Cm so that E[C|W ] is an unbiased\nestimator of C\u2217. Thus as M \u2192 \u221e\n(3)\n\n= BA\n\nBT .\n\n\u2217\n\n\u2217\n\nC \u2212\u2192 E[C] = C\n\n\u2217\nM BT \u2212\u2192 BA\n\nGeometric structure. Though the separability assumption allows us to identify B even from the\nnoisy observation C, we need to throughly investigate the structure of cluster interactions. This is\nbecause it will eventually be related to how much useful information the co-occurrence between\ncorresponding anchor bases contains, enabling us to best use our training data. Say DNN n is the\nset of n\u00d7 n doubly non-negative matrices: entrywise non-negative and positive semide\ufb01nite (PSD).\nClaim A\u2217\nProof Take any vector y \u2208 RK. As A\u2217\n\nM is de\ufb01ned as a sum of outer-products,\n\n\u2208 DNN N\n\nM , A\u2217\n\nyT WmW T\n\nmy =\n\n(W T\n\nmy)T (W T\n\nmy) =\n\n(non-negative) \u2265 0.\n\n(4)\n\nIn addition, (A\u2217\n\nM \u2208 PSDK.\n\nThus A\u2217\nM )kl = p(Z1 = k, Z2 = l) \u2265 0 for all k, l. Proving\nA\u2217\n\u2208 DNN K is analogous by the linearity of expectation. Relying on double non-negativity of\nM , Equation (3) implies not only the low-rank structure of C\u2217, but also double non-negativity of\nA\u2217\nC\u2217 by a similar proof (see the Appendix).\nThe Anchor Word algorithms in [5, 6] consider neither double non-negativity of cluster interactions\nnor its implication on co-occurrence statistics. Indeed, the empirical co-occurrence matrices col-\nlected from limited data are generally inde\ufb01nite and full-rank, whereas the posterior co-occurrences\nmust be positive semide\ufb01nite and low-rank. Our new approach will ef\ufb01ciently enforce double non-\nnegativity and low-rankness of the co-occurrence matrix C based on the geometric property of its\nposterior behavior. We will later clarify how this process substantially improves the quality of the\nclusters and their interactions by eliminating noises and restoring missing information.\n\n3 Recti\ufb01ed Anchor Words Algorithm\n\nIn this section, we describe how to estimate the co-occurrence matrix C from the training data, and\nhow to rectify C so that it is low-rank and doubly non-negative. We then decompose the recti\ufb01ed\nC(cid:48) in a way that preserves the doubly non-negative structure in the cluster interaction matrix.\n\n3rank+(B) means the non-negative rank of the matrix B, whereas rank(B) means the usual rank.\n4This convergence is not trivial while 1\nM\n\n(cid:80)M\nm=1 Wm \u2192 E[Wm] as M \u2192 \u221e by the Central Limit Theorem.\n\n3\n\n\fGenerating co-occurrence C. Let Hm be the vector of object counts for the m-th training exam-\nple, and let pm = BWm where Wm is the document\u2019s latent topic distribution. Then Hm is assumed\nm , and\n\nto be a sample from a multinomial distribution Hm \u223c Multi(nm, pm) where nm =(cid:80)N\n\n(cid:1). As in [6], we\n\nrecall E[Hm] = nmpm = nmBWm and Cov(Hm) = nm\ngenerate the co-occurrence for the m-th example by\n\n(cid:0)diag(pm) \u2212 pmpT\n\ni=1 H (i)\n\nm\n\ndm\n\ndm\n\nCm =\n\nHmH T\n\nm \u2212 diag(Hm)\nnm(nm \u2212 1)\n\n.\n\n(5)\n\n1\ndm\n\ndiag(E[Hm]) = 1\n\nThe diagonal penalty in Eq. 5 cancels out the diagonal matrix term in the variance-covariance matrix,\nmaking the estimator unbiased. Putting dm = nm(nm \u2212 1), that is E[Cm|Wm] = 1\nE[HmH T\nm] \u2212\nm)BT \u2261 C\u2217\n(E[Hm]E[Hm]T + Cov(Hm) \u2212 diag(E[Hm])) = B(WmW T\nm.\nThus E[C|W ] = C\u2217 by the linearity of expectation.\nRectifying co-occurrence C. While C is an unbiased estimator for C\u2217 in our model, in reality the\ntwo matrices often differ due to a mismatch between our model assumptions and the data5 or due\nto error in estimation from limited data. The computed C is generally full-rank with many negative\neigenvalues, causing a large approximation error. As the posterior co-occurrence C\u2217 must be low-\nrank, doubly non-negative, and joint-stochastic, we propose two recti\ufb01cation methods: Diagonal\nCompletion (DC) and Alternating Projection (AP). DC modi\ufb01es only diagonal entries so that C\nbecomes low-rank, non-negative, and joint-stochastic; while AP enforces modi\ufb01es every entry and\nenforces the same properties as well as positive semi-de\ufb01niteness. As our empirical results strongly\nfavor alternating projection, we defer the details of diagonal completion to the Appendix.\nBased on the desired property of the posterior co-occurrence C\u2217, we seek to project our estimator\nC onto the set of joint-stochastic, doubly non-negative, low rank matrices. Alternating projection\nmethods like Dykstra\u2019s algorithm [9] allow us to project onto an intersection of \ufb01nitely many convex\nsets using projections onto each individual set in turn. In our setting, we consider the intersection\nof three sets of symmetric N \u00d7 N matrices: the elementwise non-negative matrices NN N , the\nnormalized matrices NORN whose entry sum is equal to 1, and the positive semi-de\ufb01nite matrices\nwith rank K, PSDN K. We project onto these three sets as follows:\n\u03a0PSDNK (C) = U \u039b+\n\nKU T , \u03a0NORN (C) = C +\n\n11T , \u03a0NN N (C) = max{C, 0}.\n\n(cid:80)\n\ni,j Cij\n\n1 \u2212\n\nN 2\n\nwhere C = U \u039bU T is an eigendecomposition and \u039b+\nK is the matrix \u039b modi\ufb01ed so that all negative\neigenvalues and any but the K largest positive eigenvalues are set to zero. Truncated eigendecom-\npositions can be computed ef\ufb01ciently, and the other projections are likewise ef\ufb01cient. While NN N\nand NORN are convex, PSDN K is not. However, [10] show that alternating projection with a\nnon-convex set still works under certain conditions, guaranteeing a local convergence. Thus iterat-\ning three projections in turn until the convergence recti\ufb01es C to be in the desired space. We will\nshow how to satisfy such conditions and the convergence behavior in Section 5.\n\nSelecting basis S. The \ufb01rst step of the factorization is to select the subset S of objects that satisfy\nthe separability assumption. We want the K best rows of the row-normalized co-occurrence matrix\nC so that all other rows lie nearly in the convex hull of the selected rows.\n[6] use the Gram-\nSchmidt process to select anchors, which computes pivoted QR decomposition, but did not utilize the\nsparsity of C. To scale beyond small vocabularies, they use random projections that approximately\npreserve (cid:96)2 distances between rows of C. For all experiments we use a new pivoted QR algorithm\n(see the Appendix) that exploits sparsity instead of using random projections, and thus preserves\ndeterministic inference.6\n\nRecovering object-cluster B. After \ufb01nding the set of basis objects S, we can infer each entry of\nB by Bayes\u2019 rule as in [6]. Let {p(Z1 = k|X1 = i)}K\nk=1 be the coef\ufb01cients that reconstruct the\ni-th row of C in terms of the basis rows corresponding to S. Since Bik = p(X1 = i|Z1 = k),\n\n5There is no reason to expect real data to be generated from topics, much less exactly K latent topics.\n6To effectively use random projections, it is necessary to either \ufb01nd proper dimensions based on multiple\n\ntrials or perform low-dimensional random projection multiple times [25] and merge the resulting anchors.\n\n4\n\n\fwe can use the corpus frequencies p(X1 = i) = (cid:80)\n\nj Cij to estimate Bik \u221d p(Z1 = k|X1 =\ni)p(X1 = i). Thus the main task for this step is to solve simplex-constrained QPs to infer a\nset of such coef\ufb01cients for each object. We use an exponentiated gradient algorithm to solve the\nproblem similar to [6]. Note that this step can be ef\ufb01ciently done in parallel for each object.\n\nfailing to model\n\nRecovering cluster-cluster A.\n[6] recovered A by minimizing\n(cid:107)C \u2212 BABT(cid:107)F ; but the inferred\nA generally has many negative\nentries,\nthe\nprobabilistic interaction between\ntopics. While we can further\nproject A onto the joint-stochastic\nmatrices,\nthis produces a large\napproximation error.\nWe consider an alternate recovery\nmethod that again leverages the\nseparability assumption. Let CSS be the submatrix whose rows and columns correspond to the\nselected objects S, and let D be the diagonal submatrix BS\u2217 of rows of B corresponding to S. Then\n\nFigure 3: The algorithm of [6] (\ufb01rst panel) produces negative cluster\nco-occurrence probabilities. A probabilistic reconstruction alone (this\npaper & [5], second panel) removes negative entries but has no off-\ndiagonals and does not sum to one. Trying after recti\ufb01cation (this\npaper, third panel) produces a valid joint stochastic matrix.\n\nCSS = DADT = DAD =\u21d2 A = D\n\n(6)\nThis approach ef\ufb01ciently recovers a cluster-cluster matrix A mostly based on the co-occrrurence\ninformation between corresponding anchor basis, and produces no negative entries due to the sta-\nbility of diagonal matrix inversion. Note that the principle submatrices of a PSD matrix are also\nPSD; hence, if C \u2208 PSDN then CSS, A \u2208 PSDK. Thus, not only is the recovered A an unbiased\nestimator for A\u2217\nM \u2208 DNN K after the recti\ufb01cation.7\n\nM , but also it is now doubly non-negative as A\u2217\n\n\u22121CSSD\n\n\u22121.\n\n4 Experimental Results\n\nOur Recti\ufb01ed Anchor Words algorithm with alternating projection \ufb01xes many problems in the base-\nline Anchor Words algorithm [6] while matching the performance of Gibbs sampling [11] and main-\ntaining spectral inference\u2019s determinism and independence from corpus size. We evaluate direct\nmeasurement of matrix quality as well as indicators of topic utility. We use two text datasets:\nNIPS full papers and New York Times news articles.8 We eliminate a minimal list of 347 En-\nglish stop words and prune rare words based on tf-idf scores and remove documents with fewer\nthan \ufb01ve tokens after vocabulary curation. We also prepare two non-textual item-selection datasets:\nusers\u2019 movie reviews from the Movielens 10M Dataset,9 and music playlists from the complete\nYes.com dataset.10 We perform similar vocabulary curation and document tailoring, with the ex-\nception of frequent stop-object elimination. Playlists often contain the same songs multiple times,\nbut users are unlikely to review the same movies more than once, so we augment the movie dataset\nso that each review contains 2 \u00d7 (stars) number of movies based on the half-scaled rating in-\nformation that varies from 0.5 stars to 5 stars. Statistics of our datasets are shown in Table 2.\nWe run DC 30 times for each experiment, randomly\npermuting the order of objects and using the median\nresults to minimize the effect of different orderings.\nWe also run 150 iterations of AP alternating PSDN K,\nNORN , and NN N in turn. For probabilistic Gibbs\nsampling, we use the Mallet with the standard option\ndoing 1,000 iterations. All metrics are evaluated against\nthe original C, not against the recti\ufb01ed C(cid:48), whereas we use B and A inferred from the recti\ufb01ed C(cid:48).\n7We later realized that essentially same approach was previously tried in [5], but it was not able to generate\n\nTable 2: Statistics of four datasets.\nDataset\nNIPS\n\nM\n1,348\n269,325\n63,041\n14,653\n\nNYTimes\nMovies\nSongs\n\n380.5\n204.9\n142.8\n119.2\n\nN\n5k\n15k\n10k\n10k\n\nAvg. Len\n\na valid topic-topic matrix as shown in the middle panel of Figure 3.\n\n8https://archive.ics.uci.edu/ml/datasets/Bag+of+Words\n9http://grouplens.org/datasets/movielens\n10http://www.cs.cornell.edu/\u02dcshuochen/lme\n\n5\n\n22.842-7.6870.629-2.723-12.888-7.68743.605-4.986-7.788-22.9300.629-4.98612.782-5.269-2.998-2.723-7.788-5.26919.237-3.267-12.888-22.930-2.998-3.26742.36745.0210.0000.0000.0000.0000.00043.0860.0000.0000.0000.0000.00052.8280.0000.0000.0000.0000.00017.5270.0000.0000.0000.0000.00076.1530.1140.0000.0020.0240.0040.0000.1150.0100.0070.0170.0020.0100.1620.0160.0120.0240.0070.0160.0720.0140.0040.0170.0120.0140.328\u221222.93\u221211.230.000.170.340.500.670.841.0023.46\fQualitative results. Although [6] report comparable results to probabilistic algorithms for LDA,\nthe algorithm fails under many circumstances. The algorithm prefers rare and unusual anchor words\nthat form a poor basis, so topic clusters consist of the same high-frequency terms repeatedly, as\nshown in the upper third of Table 3.\nIn contrast, our algorithm with AP recti\ufb01cation success-\nfully learns themes similar to the probabilistic algorithm. One can also verify that cluster inter-\nactions given in the third panel of Figure 3 explain how the \ufb01ve topics correlate with each other.\n\nEach panel\n\nTable 3: Each line is a topic from NIPS (K = 5). Previous work\nsimply repeats the most frequent words in the corpus \ufb01ve times.\n\nSimilar to [12], we visualize the\n\ufb01ve anchor words\nin the co-\noccurrence space after 2D PCA\nof C.\nin Figure 1\nshows a 2D embedding of\nthe\nNIPS vocabulary as blue dots and\n\ufb01ve selected anchor words in red.\nThe \ufb01rst plot shows standard an-\nchor words and the original co-\noccurrence space. The second plot\nshows anchor words selected from\nthe recti\ufb01ed space overlaid on the\noriginal co-occurrence space. The\nthird plot shows the same anchor\nwords as the second plot overlaid\non the AP-recti\ufb01ed space. The rec-\nti\ufb01ed anchor words provide better\ncoverage on both spaces, explain-\ning why we are able to achieve rea-\nsonable topics even with K = 5.\nRecti\ufb01cation also produces better clusters in the non-textual movie dataset. Each cluster is notably\nmore genre-coherent and year-coherent than the clusters from the original algorithm. When K = 15,\nfor example, we verify a cluster of Walt Disney 2D Animations mostly from the 1990s and a cluster\nof Fantasy movies represented by Lord of the Rings \ufb01lms, similar to clusters found by probabilistic\nGibbs sampling. The Baseline algorithm [6] repeats Pulp Fiction and Silence of the Lambs 15 times.\n\nArora et al. 2013 (Baseline)\nneuron layer hidden recognition signal cell noise\nneuron layer hidden cell signal representation noise\nneuron layer cell hidden signal noise dynamic\nneuron layer cell hidden control signal noise\nneuron layer hidden cell signal recognition noise\nThis paper (AP)\nneuron circuit cell synaptic signal layer activity\ncontrol action dynamic optimal policy controller reinforcement\nrecognition layer hidden word speech image net\ncell \ufb01eld visual direction image motion object orientation\ngaussian noise hidden approximation matrix bound examples\nProbabilistic LDA (Gibbs)\nneuron cell visual signal response \ufb01eld activity\ncontrol action policy optimal reinforcement dynamic robot\nrecognition image object feature word speech features\nhidden net layer dynamic neuron recurrent noise\ngaussian approximation matrix bound component variables\n\n(cid:80)K\n\ni (cid:107)C i \u2212\n\nk p(Z1 = k|X1 = i)C Sk(cid:107)2\n\nN\nrecti\ufb01ed matrix. AP reduces error in almost all cases and is more effective than DC. Although\nwe expect error to decrease as we increase the number of clusters K, reducing recovery error for\na \ufb01xed K by choosing better anchors is extremely dif\ufb01cult: no other subset selection algorithm\n[13] decreased error by more than 0.001. A good matrix factorization should have small element-\n\nQuantitative results. We measure the intrinsic quality of inference and summarization with re-\nspect to the JSMF objectives as well as the extrinsic quality of resulting topics. Lines correspond to\nfour methods: \u25e6 Baseline for the algorithm in the previous work [6] without any recti\ufb01cation, (cid:52) DC\nfor Diagonal Completion, (cid:3) AP for Alternating Projection, and (cid:5) Gibbs for Gibbs sampling.\n(cid:0) 1\n(cid:1) with respect to the original C matrix, not the\n(cid:80)N\nAnchor objects should form a good basis for the remaining objects. We measure Recovery error\nwise Approximation error(cid:0)\n(cid:1). DC and AP preserve more of the information in\nk p(Z2 = k|Z1 = k)(cid:1) indicates lower correlation between clusters.12\ndiagonal Dominancy(cid:0) 1\n(cid:80)K\nSpeci\ufb01city(cid:0) 1\nk KL (p(X|Z = k)(cid:107)p(X))(cid:1) measures how much each cluster is distinct from\n(cid:80)K\n\nthe original matrix C than the Baseline method, especially when K is small.11 We expect non-\ntrivial interactions between clusters, even when we do not explicitly model them as in [14]. Greater\n\nAP and Gibbs results are similar. We do not report held-out probability because we \ufb01nd that relative\nresults are determined by user-de\ufb01ned smoothing parameters [12, 24].\n\nK\n\nthe corpus distribution. When anchors produce a poor basis, the conditional distribution of clus-\n11In the NYTimes corpus, 10\u22122 is a large error: each element is around 10\u22129 due to the number of normal-\n\n(cid:107)C \u2212 BABT(cid:107)F\n\nK\n\nized entries.\n\n12Dominancy in Songs corpus lacks any Baseline results at K > 10 because dominancy is unde\ufb01ned if an\nalgorithm picks a song that occurs at most once in each playlist as a basis object. In this case, the original\nconstruction of CSS, and hence of A, has a zero diagonal element, making dominancy NaN.\n\n6\n\n\fFigure 4: Experimental results on real dataset. The x-axis indicates logK where K varies by 5 up to 25 topics\nand by 25 up to 100 or 150 topics. Whereas the Baseline algorithm largely fails with small K and does not infer\nquality B and A even with large K, Alternating Projection (AP) not only \ufb01nds better basis vectors (Recovery),\nbut also shows stable and comparable behaviors to probabilistic inference (Gibbs) in every metric.\n\nInter-topic Dissimilarity\nters given objects becomes uniform, making p(X|Z) similar to p(X).\ncounts the average number of objects in each cluster that do not occur in any other cluster\u2019s top\n(cid:0) 1\n(cid:80)K\n20 objects. Our experiments validate that AP and Gibbs yield comparably speci\ufb01c and distinct\ntopics, while Baseline and DC simply repeat the corpus distribution as in Table 3. Coherence\n\n(cid:1) penalizes topics that assign high probability (rank > 20) to\n\n(cid:80)\u2208T opk\n\nx1(cid:54)=x2 log D2(x1,x2)+\u0001\n\nk\n\nD1(x2)\n\nK\nwords that do not occur together frequently. AP produces results close to Gibbs sampling, and\nfar from the Baseline and DC. While this metric correlates with human evaluation of clusters [15]\n\u201cworse\u201d coherence can actually be better because the metric does not penalize repetition [12].\nIn semi-synthetic experiments [6] AP matches Gibbs sampling and outperforms the Baseline, but\nthe discrepancies in topic quality metrics are smaller than in the real experiments (see Appendix).\nWe speculate that semi-synthetic data is more \u201cwell-behaved\u201d than real data, explaining why issues\nwere not recognized previously.\n\n5 Analysis of Algorithm\n\nWhy does AP work? Before recti\ufb01cation, diagonals of the empirical C matrix may be far from\ncorrect. Bursty objects yield diagonal entries that are too large; extremely rare objects that occur\nat most once per document yield zero diagonals. Rare objects are problematic in general: the cor-\nresponding rows in the C matrix are sparse and noisy, and these rows are likely to be selected by\nthe pivoted QR. Because rare objects are likely to be anchors, the matrix CSS is likely to be highly\ndiagonally dominant, and provides an uninformative picture of topic correlations. These problems\nare exacerbated when K is small relative to the effective rank of C, so that an early choice of a poor\nanchor precludes a better choice later on; and when the number of documents M is small, in which\ncase the empirical C is relatively sparse and is strongly affected by noise. To mitigate this issue,\n[24] run exhaustive grid search to \ufb01nd document frequency cutoffs to get informative anchors. As\n\n7\n\nllllllllllllllllllllllllllllllllllllllllllllllllRecoveryApproximationDominancySpecificityDissimilarityCoherence0.030.040.050.060.000.050.100.150.20.40.60.81.00123051015\u2212320\u2212280\u2212240\u2212200\u2212160510152550751005101525507510051015255075100510152550751005101525507510051015255075100CategorylAPBaselineDCGibbsNipsllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllRecoveryApproximationDominancySpecificityDissimilarityCoherence0.050.100.150.200.250.000.050.100.150.200.20.40.60.81.0012351015\u2212400\u2212350\u2212300510152550751001505101525507510015051015255075100150510152550751001505101525507510015051015255075100150CategorylAPBaselineDCGibbsNYTimesllllllllllllllllllllllllllllllllllllllllllllllllRecoveryApproximationDominancySpecificityDissimilarityCoherence0.040.060.080.1005100.250.500.751.0001234051015\u2212240\u2212210\u2212180\u2212150\u2212120510152550751005101525507510051015255075100510152550751005101525507510051015255075100CategorylAPBaselineDCGibbsMoviesllllllllllllllllllllllllllllllllllllllllllRecoveryApproximationDominancySpecificityDissimilarityCoherence0.0500.0750.1000.1250.1500.0000.0050.0100.0150.40.60.81.001234505101520\u2212700\u2212500\u2212300510152550751005101525507510051015255075100510152550751005101525507510051015255075100CategorylAPBaselineDCGibbsSongs\fmodel performance is inconsistent for different cutoffs and search requires cross-validation for each\ncase, it is nearly impossible to \ufb01nd good heuristics for each dataset and number of topics.\nFortunately, a low-rank PSD matrix cannot have too many diagonally-dominant rows, since this vi-\nolates the low rank property. Nor can it have diagonal entries that are small relative to off-diagonals,\nsince this violates positive semi-de\ufb01niteness. Because the anchor word assumption implies that\nnon-negative rank and ordinary rank are the same, the AP algorithm ideally does not remove the\ninformation we wish to learn; rather, 1) the low-rank projection in AP suppresses the in\ufb02uence of\nsmall numbers of noisy rows associated with rare words which may not be well correlated with the\nothers, and 2) the PSD projection in AP recovers missing information in diagonals. (As illustrated\nin the Dominancy panel of the Songs corpus in Figure 4, AP shows valid dominancies even after\nK > 10 in contrast to the Baseline algorithm.)\nWhy does AP converge? AP enjoys local linear convergence [10] if 1) the initial C is near the\nconvergence point C(cid:48), 2) PSDN K is super-regular at C(cid:48), and 3) strong regularity holds at C(cid:48). For\nthe \ufb01rst condition, recall that we recti\ufb01ed C(cid:48) by pushing C toward C\u2217, which is the ideal convergence\npoint inside the intersection. Since C \u2192 C\u2217 as shown in (5), C is close to C(cid:48) as desired.The prox-\nregular sets13 are subsets of super-regular sets, so prox-regularity of PSDN K at C(cid:48) is suf\ufb01cient for\nthe second condition. For permutation invariant M \u2282 RN , the spectral set of symmetric matrices\nis de\ufb01ned as \u03bb\u22121(M) = {X \u2208 SN : (\u03bb1(X), . . . , \u03bbN (X)) \u2208 M}, and \u03bb\u22121(M) is prox-regular\nif and only if M is prox-regular [16, Th. 2.4]. Let M be {x \u2208 R+\nn : |supp(x)| = K}. Since each\nelement in M has exactly K positive components and all others are zero, \u03bb\u22121(M) = PSDN K. By\nthe de\ufb01nition of M and K < N, PM is locally unique almost everywhere, satisfying the second\ncondition almost surely. (As the intersection of the convex set PSDN and the smooth manifold of\nrank K matrices, PSDN K is a smooth manifold almost everywhere.)\nChecking the third condition a priori is challenging, but we expect noise in the empirical C to\nprevent an irregular solution, following the argument of Numerical Example 9 in [10]. We expect\nAP to converge locally linearly and we can verify local convergence of AP in practice. Empirically,\nthe ratio of average distances between two iterations are always \u2264 0.9794 on the NYTimes dataset\n(see the Appendix), and other datasets were similar. Note again that our recti\ufb01ed C(cid:48) is a result of\npushing the empirical C toward the ideal C\u2217. Because approximation factors of [6] are all computed\nbased on how far C and its co-occurrence shape could be distant from C\u2217\u2019s, all provable guarantees\nof [6] hold better with our recti\ufb01ed C(cid:48).\n\n6 Related and Future Work\n\nJSMF is a speci\ufb01c structure-preserving Non-negative Matrix Factorization (NMF) performing spec-\ntral inference. [17, 18] exploit a similar separable structure for NMF problmes. To tackle hyperspec-\ntral unmixing problems, [19, 20] assume pure pixels, a separability-equivalent in computer vision.\nIn more general NMF without such structures, RESCAL [21] studies tensorial extension of similar\nfactorization and SymNMF [22] infers BBT rather than BABT . For topic modeling, [23] performs\nspectral inference on third moment tensor assuming topics are uncorrelated.\nAs the core of our algorithm is to rectify the input co-occurrence matrix, it can be combined with\nseveral recent developments.\n[24] proposes two regularization methods for recovering better B.\n[12] nonlinearly projects co-occurrence to low-dimensional space via t-SNE and achieves better\nanchors by \ufb01nding the exact anchors in that space. [25] performs multiple random projections to\nlow-dimensional spaces and recovers approximate anchors ef\ufb01ciently by divide-and-conquer strat-\negy. In addition, our work also opens several promising research directions. How exactly do anchors\nfound in the recti\ufb01ed C(cid:48) form better bases than ones found in the original space C? Since now the\ntopic-topic matrix A is again doubly non-negative and joint-stochastic, can we learn super-topics in\na multi-layered hierarchical model by recursively applying JSMF to topic-topic co-occurrence A?\n\nAcknowledgments\n\nThis research is supported by NSF grant HCC:Large-0910664. We thank Adrian Lewis for valuable\ndiscussions on AP convergence.\n\n13A set M is prox-regular if PM is locally unique.\n\n8\n\n\fReferences\n[1] Alan Mislove, Bimal Viswanath, Krishna P. Gummadi, and Peter Druschel. You are who you know: In-\nferring user pro\ufb01les in Online Social Networks. In Proceedings of the 3rd ACM International Conference\nof Web Search and Data Mining (WSDM\u201910), New York, NY, February 2010.\n\n[2] Shuo Chen, J. Moore, D. Turnbull, and T. Joachims. Playlist prediction via metric embedding. In ACM\n\nSIGKDD Conference on Knowledge Discovery and Data Mining (KDD), pages 714\u2013722, 2012.\n\n[3] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. GloVe: Global vectors for word repre-\n\nsentation. In EMNLP, 2014.\n\n[4] Omer Levy and Yoav Goldberg. Neural word embedding as implicit matrix factorization. In NIPS, 2014.\n[5] S. Arora, R. Ge, and A. Moitra. Learning topic models \u2013 going beyond SVD. In FOCS, 2012.\n[6] Sanjeev Arora, Rong Ge, Yonatan Halpern, David Mimno, Ankur Moitra, David Sontag, Yichen Wu, and\n\nMichael Zhu. A practical algorithm for topic modeling with provable guarantees. In ICML, 2013.\n\n[7] T. Hofmann. Probabilistic latent semantic analysis. In UAI, pages 289\u2013296, 1999.\n[8] D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, pages\n\n993\u20131022, 2003. Preliminary version in NIPS 2001.\n\n[9] JamesP. Boyle and RichardL. Dykstra. A method for \ufb01nding projections onto the intersection of convex\nsets in Hilbert spaces. In Advances in Order Restricted Statistical Inference, volume 37 of Lecture Notes\nin Statistics, pages 28\u201347. Springer New York, 1986.\n\n[10] Adrian S. Lewis, D. R. Luke, and Jrme Malick. Local linear convergence for alternating and averaged\n\nnonconvex projections. Foundations of Computational Mathematics, 9:485\u2013513, 2009.\n\n[11] T. L. Grif\ufb01ths and M. Steyvers. Finding scienti\ufb01c topics. Proceedings of the National Academy of\n\nSciences, 101:5228\u20135235, 2004.\n\n[12] Moontae Lee and David Mimno. Low-dimensional embeddings for interpretable anchor-based topic\ninference. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing\n(EMNLP), pages 1319\u20131328. Association for Computational Linguistics, 2014.\n\n[13] Mary E Broadbent, Martin Brown, Kevin Penner, I Ipsen, and R Rehman. Subset selection algorithms:\n\nRandomized vs. deterministic. SIAM Undergraduate Research Online, 3:50\u201371, 2010.\n\n[14] D. Blei and J. Lafferty. A correlated topic model of science. Annals of Applied Statistics, pages 17\u201335,\n\n2007.\n\n[15] David Mimno, Hanna Wallach, Edmund Talley, Miriam Leenders, and Andrew McCallum. Optimizing\n\nsemantic coherence in topic models. In EMNLP, 2011.\n\n[16] A. Daniilidis, A. S. Lewis, J. Malick, and H. Sendov. Prox-regularity of spectral functions and spectral\n\nsets. Journal of Convex Analysis, 15(3):547\u2013560, 2008.\n\n[17] Christian Thurau, Kristian Kersting, and Christian Bauckhage. Yes we can: simplex volume maximization\n\nfor descriptive web-scale matrix factorization. In CIKM\u201910, pages 1785\u20131788, 2010.\n\n[18] Abhishek Kumar, Vikas Sindhwani, and Prabhanjan Kambadur. Fast conical hull algorithms for near-\n\nseparable non-negative matrix factorization. CoRR, pages \u20131\u20131, 2012.\n\n[19] Jos M. P. Nascimento, Student Member, and Jos M. Bioucas Dias. Vertex component analysis: A fast\nalgorithm to unmix hyperspectral data. IEEE Transactions on Geoscience and Remote Sensing, pages\n898\u2013910, 2005.\n\n[20] C\u00b4ecile Gomez, H. Le Borgne, Pascal Allemand, Christophe Delacourt, and Patrick Ledru. N-FindR\nmethod versus independent component analysis for lithological identi\ufb01cation in hyperspectral imagery.\nInternational Journal of Remote Sensing, 28(23):5315\u20135338, 2007.\n\n[21] Maximilian Nickel, Volker Tresp, and Hans-Peter Kriegel. A three-way model for collective learning on\nmulti-relational data. In Proceedings of the 28th International Conference on Machine Learning (ICML-\n11), ICML, pages 809\u2013816. ACM, 2011.\n\n[22] Da Kuang, Haesun Park, and Chris H. Q. Ding. Symmetric nonnegative matrix factorization for graph\n\nclustering. In SDM. SIAM / Omnipress, 2012.\n\n[23] Anima Anandkumar, Dean P. Foster, Daniel Hsu, Sham Kakade, and Yi-Kai Liu. A spectral algorithm\nfor latent Dirichlet allocation. In Advances in Neural Information Processing Systems 25: 26th Annual\nConference on Neural Information Processing Systems 2012. Proceedings of a meeting held December\n3-6, 2012, Lake Tahoe, Nevada, United States., pages 926\u2013934, 2012.\n\n[24] Thang Nguyen, Yuening Hu, and Jordan Boyd-Graber. Anchors regularized: Adding robustness and\nextensibility to scalable topic-modeling algorithms. In Association for Computational Linguistics, 2014.\n[25] Tianyi Zhou, Jeff A Bilmes, and Carlos Guestrin. Divide-and-conquer learning by anchoring a conical\n\nhull. In Advances in Neural Information Processing Systems 27, pages 1242\u20131250. 2014.\n\n9\n\n\f", "award": [], "sourceid": 1563, "authors": [{"given_name": "Moontae", "family_name": "Lee", "institution": "Cornell University"}, {"given_name": "David", "family_name": "Bindel", "institution": "Cornell University"}, {"given_name": "David", "family_name": "Mimno", "institution": "Cornell University"}]}