{"title": "When are Overcomplete Topic Models Identifiable? Uniqueness of Tensor Tucker Decompositions with Structured Sparsity", "book": "Advances in Neural Information Processing Systems", "page_first": 1986, "page_last": 1994, "abstract": "Overcomplete latent representations have been very popular for unsupervised feature learning in recent years. In this paper, we specify which overcomplete models can be identified given observable moments of a certain order. We consider probabilistic admixture or topic models in the overcomplete regime, where the number of latent topics can greatly exceed the size of the observed word vocabulary. While general overcomplete topic models are not identifiable, we establish {\\em generic} identifiability under a constraint, referred to as {\\em topic persistence}. Our sufficient conditions for identifiability involve a novel set of higher order'' expansion conditions on the {\\em topic-word matrix} or the {\\em population structure} of the model. This set of higher-order expansion conditions allow for overcomplete models, and require the existence of a perfect matching from latent topics to higher order observed words. We establish that random structured topic models are identifiable w.h.p. in the overcomplete regime. Our identifiability results allow for general (non-degenerate) distributions for modeling the topic proportions, and thus, we can handle arbitrarily correlated topics in our framework. Our identifiability results imply uniqueness of a class of tensor decompositions with structured sparsity which is contained in the class of {\\em Tucker} decompositions, but is more general than the {\\em Candecomp/Parafac} (CP) decomposition.\"", "full_text": "When Are Overcomplete Topic Models Identi\ufb01able?\n\nUniqueness of Tensor Tucker Decompositions\n\nwith Structured Sparsity\n\nAnimashree Anandkumar\n\nUniversity of California\n\nIrvine, CA\n\nDaniel Hsu\n\nColumbia University\n\nNew York, NY\n\na.anandkumar@uci.edu\n\ndjhsu@cs.columbia.edu\n\nMajid Janzamin\n\nUniversity of California\n\nIrvine, CA\n\nSham Kakade\n\nMicrosoft Research\n\nCambridge, MA\n\nmjanzami@uci.edu\n\nskakade@microsoft.com\n\nAbstract\n\nOvercomplete latent representations have been very popular for unsupervised fea-\nture learning in recent years. In this paper, we specify which overcomplete mod-\nels can be identi\ufb01ed given observable moments of a certain order. We consider\nprobabilistic admixture or topic models in the overcomplete regime, where the\nnumber of latent topics can greatly exceed the size of the observed word vocabu-\nlary. While general overcomplete topic models are not identi\ufb01able, we establish\ngeneric identi\ufb01ability under a constraint, referred to as topic persistence. Our suf-\n\ufb01cient conditions for identi\ufb01ability involve a novel set of \u201chigher order\u201d expan-\nsion conditions on the topic-word matrix or the population structure of the model.\nThis set of higher-order expansion conditions allow for overcomplete models, and\nrequire the existence of a perfect matching from latent topics to higher order ob-\nserved words. We establish that random structured topic models are identi\ufb01able\nw.h.p. in the overcomplete regime. Our identi\ufb01ability results allow for general\n(non-degenerate) distributions for modeling the topic proportions, and thus, we\ncan handle arbitrarily correlated topics in our framework. Our identi\ufb01ability re-\nsults imply uniqueness of a class of tensor decompositions with structured sparsity\nwhich is contained in the class of Tucker decompositions, but is more general than\nthe Candecomp/Parafac (CP) decomposition.\n\nKeywords: Overcomplete representation, admixture models, generic identi\ufb01ability, tensor decom-\nposition.\n\n1 Introduction\n\nA probabilistic framework for incorporating features posits latent or hidden variables that can pro-\nvide a good explanation to the observed data. Overcomplete probabilistic models can incorporate a\nmuch larger number of latent variables compared to the observed dimensionality. In this paper, we\ncharacterize the conditions under which overcomplete latent variable models can be identi\ufb01ed from\ntheir observed moments.\n\nFor any parametric statistical model, identi\ufb01ability is a fundamental question of whether the model\nparameters can be uniquely recovered given the observed statistics. Identi\ufb01ability is crucial in a\nnumber of applications where the latent variables are the quantities of interest, e.g. inferring diseases\n\n1\n\n\f(latent variables) through symptoms (observations), inferring communities (latent variables) via the\ninteractions among the actors in a social network (observations), and so on. Moreover, identi\ufb01ability\ncan be relevant even in predictive settings, where feature learning is employed for some higher\nlevel task such as classi\ufb01cation. For instance, non-identi\ufb01ability can lead to the presence of non-\nisolated local optima for optimization-based learning methods, and this can affect their convergence\nproperties, e.g. see [1].\n\nIn this paper, we characterize identi\ufb01ability for a popular class of latent variable models, known\nas the admixture or topic models [2, 3]. These are hierarchical mixture models, which incorporate\nthe presence of multiple latent states (i.e.\ntopics) in documents consisting of a tuple of observed\nvariables (i.e. words). In this paper, we characterize conditions under which the topic models are\nidenti\ufb01ed through their observed moments in the overcomplete regime. To this end, we introduce\nan additional constraint on the model, referred to as topic persistence. Intuitively, this captures the\n\u201clocality\u201d effect among the observed words, and goes beyond the usual \u201cbag-of-words\u201d or exchange-\nable topic models. Such local dependencies among observations abound in applications such as text,\nimages and speech, and can lead to more faithful representation. In addition, we establish that the\npresence of topic persistence is central to obtaining model identi\ufb01ability in the overcomplete regime,\nand we provide an in-depth analysis of this phenomenon in this paper.\n\n1.1 Summary of Results\n\nIn this paper, we provide conditions for generic1 model identi\ufb01ability of overcomplete topic models\ngiven observable moments of a certain order (i.e., a certain number of words in each document). We\nintroduce a novel constraint, referred to as topic persistence, and analyze its effect on identi\ufb01ability.\nWe establish identi\ufb01ability in the presence of a novel combinatorial object, named as perfect n-gram\nmatching, in the bipartite graph from topics to words (observed variables). Finally, we prove that\nrandom models satisfy these criteria, and are thus identi\ufb01able in the overcomplete regime.\n\nPersistent Topic Model: We \ufb01rst introduce the n-persistent topic model, where the parameter n\ndetermines the so-called persistence level of a common topic in a sequence of n successive words, as\nseen in Figure 1. The n-persistent model reduces to the popular \u201cbag-of-words\u201d model, when n = 1,\nand to the single topic model (i.e. only one topic in each document) when n \u2192 \u221e. Intuitively, topic\npersistence aids identi\ufb01ability since we have multiple views of the common hidden topic generating\na sequence of successive words. We establish that the bag-of-words model (with n = 1) is too\nnon-informative about the topics to be identi\ufb01able in the overcomplete regime. On the other hand,\nn-persistent overcomplete topic models with n \u2265 2 are generically identi\ufb01able, and we provide a\nset of transparent conditions for identi\ufb01ability.\n\ny1\n\ny2\n\ny2r\n\nh\n\nfor\n\nDeterministic Conditions\nIdenti\ufb01ability:\nOur suf\ufb01cient conditions for identi\ufb01ability are in\nthe form of expansion conditions from the latent\ntopic space to the observed word space.\nIn the\novercomplete regime, there are more topics than\nwords, and thus it is impossible to have expansion\nfrom topics to words. Instead, we impose a novel\nexpansion constraint from topics to \u201chigher order\u201d\nwords, which allows us to handle overcomplete\nmodels. We establish that this condition translates\nto the presence of a novel combinatorial object,\nreferred to as perfect n-gram matching, on the\nbipartite graph from topics to words, which encodes\nthe sparsity pattern of\nthe topic-word matrix.\nIntuitively, this condition implies \u201cdiversity\u201d of the\nword support for different topics which leads to\nidenti\ufb01ability. In addition, we present tradeoffs between the topic and word space dimensionality,\ntopic persistence level, the order of the observed moments at hand, the maximum degree of any\n\nFigure 1: Hierarchical structure of\nthe n-\n2rn number of words\npersistent topic model.\n(views) are shown for some integer r \u2265 1. A sin-\ngle topic yj, j \u2208 [2r], is chosen for each n succes-\nsive views {x(j\u22121)n+1, . . . , x(j\u22121)n+n}.\n\nx(2r\u22121)n+1\n\nx2rn\n\nx1\n\nxn\n\nxn+1\n\nx2n\n\n1A model is generically identi\ufb01able, if all the parameters in the parameter space are identi\ufb01able, almost\n\nsurely. Refer to De\ufb01nition 1 for more discussion.\n\n2\n\n\ftopic in the bipartite graph, and the Kruskal rank [4] of the topic-word matrix, for identi\ufb01ability\nto hold. We also provide the discussion that how \u21131-based optimization program can recover the\nmodel under additional constraints.\n\nIdenti\ufb01ability of Random Structured Topic Models: We explicitly characterize the regime of\nidenti\ufb01ability for the random setting, where each topic i is randomly supported on a set of di words,\ni.e. the bipartite graph is a random graph. For this random model with q topics, p-dimensional word\nvocabulary, and topic persistence level n, when q = O(pn) and \u0398(log p) \u2264 di \u2264 \u0398(p1/n), for all\ntopics i, the topic-word matrix is identi\ufb01able from 2nth order observed moments with high probabil-\nity. Furthermore, we establish that the size condition q = O(pn) for identi\ufb01ability is tight.\n\nImplications on Uniqueness of Overcomplete Tucker and CP Tensor Decompositions: We\nestablish that identi\ufb01ability of an overcomplete topic model is equivalent to uniqueness of the ob-\nserved moment tensor (of a certain order) decomposition. Our identi\ufb01ability results for persistent\ntopic models imply uniqueness of a structured class of tensor decompositions, which is contained in\nthe class of Tucker decompositions, but is more general than the candecomp/parafac (CP) decom-\nposition [5]. This sub-class of Tucker decompositions involves structured sparsity and symmetry\nconstraints on the core tensor, and sparsity constraints on the inverse factors of the decomposi-\ntion.\n\nDetailed discussion on overview of techniques and related works are provided in the long ver-\nsion [12].\n\n2 Model\n\nNotations: The set {1, 2, . . . , n} is denoted by [n] := {1, 2, . . . , n}. The cardinality of set S is\ndenoted by |S|. For any vector u (or matrix U ), the support denoted by Supp(u) corresponds to the\nlocation of its non-zero entries. For a vector u \u2208 Rq, Diag(u) \u2208 Rq\u00d7q is the diagonal matrix with\nu on its main diagonal. The column space of a matrix A is denoted by Col(A). Operators \u201c\u2297\u201d and\n\u201c\u2299\u201d refer to Kronecker and Khatri-Rao products [6], respectively.\n\n2.1 Persistent topic model\n\nRq : ui \u2265 0,Pq\n\nAn admixture model speci\ufb01es a q-dimensional vector of topic proportions h \u2208 \u2206q\u22121 := {u \u2208\ni=1 ui = 1} which generates the observed variables xl \u2208 Rp through vectors\na1, . . . , aq \u2208 Rp. This collection of vectors ai, i \u2208 [q], is referred to as the population structure or\ntopic-word matrix [7]. For instance, ai represents the conditional distribution of words given topic\ni. The latent variable h is a q dimensional random vector h := [h1, . . . , hq]\u22a4 known as proportion\nvector. A prior distribution P (h) over the probability simplex \u2206q\u22121 characterizes the prior joint\ndistribution over the latent variables (topics) hi, i \u2208 [q].\n\nThe n-persistent topic model has a three-level multi-view hierarchy in Figure 1. In this model, a\ncommon hidden topic is persistent for a sequence of n words {x(j\u22121)n+1, . . . , x(j\u22121)n+n}, j \u2208 [2r].\nNote that the random observed variables (words) are exchangeable within groups of size n, where n\nis the persistence level, but are not globally exchangeable.\n\nWe now describe a linear representation for the n-persistent topic model, on lines of [9], but with\nextensions to incorporate persistence. Each random variable yj, j \u2208 [2r], is a discrete-valued q-\ndimensional random variable encoded by the basis vectors ei, i \u2208 [q], where ei is the i-th basis\nvector in Rq with the i-th entry equal to 1 and all the others equal to zero. When yj = ei \u2208 Rq,\nthen the topic of j-th group of words is i. Given proportion vector h \u2208 Rq, topics yj, j \u2208 [2r], are\n\nindependently drawn according to the conditional expectation E(cid:2)yj|h(cid:3) = h, j \u2208 [2r], or equivalently\nPr(cid:2)yj = ei|h(cid:3) = hi, j \u2208 [2r], i \u2208 [q].\n\nEach observed variable xl for l \u2208 [2rn], is a discrete-valued p-dimensional random variable (word)\nwhere p is the size of vocabulary. Again, we assume that variables xl, are encoded by the basis\nvectors ek, k \u2208 [p], such that xl = ek \u2208 Rp when the l-th word in the document is k. Given the\n\n3\n\n\fcorresponding topic yj, j \u2208 [2r], words xl, l \u2208 [2rn], are independently drawn according to the\nconditional expectation\n\nE(cid:2)x(j\u22121)n+k|yj = ei(cid:3) = ai, i \u2208 [q], j \u2208 [2r], k \u2208 [n],\n\nwhere vectors ai \u2208 Rp, i \u2208 [q], are the conditional probability distribution vectors. The matrix\nA = [a1|a2| \u00b7 \u00b7 \u00b7 |aq] \u2208 Rp\u00d7q collecting these vectors is called population structure or topic-word\nmatrix.\n\nThe (2rn)-th order moment of observed variables xl, l \u2208 [2rn], for some integer r \u2265 1, is de\ufb01ned\nas (in the matrix form)\n\nM2rn(x) := E(cid:2)(x1 \u2297 x2 \u2297 \u00b7 \u00b7 \u00b7 \u2297 xrn)(xrn+1 \u2297 xrn+2 \u2297 \u00b7 \u00b7 \u00b7 \u2297 x2rn)\u22a4(cid:3) \u2208 Rprn\u00d7prn\n\nFor the n-persistent topic model with 2rn number of observations (words) xl, l \u2208 [2rn], the corre-\nsponding moment is denoted by M (n)\nIn this paper, we consider the problem of identi\ufb01ability when exact moments are available.\nGiven M (n)\n2rn(x), what are the suf\ufb01cient conditions under which the population structure A =\n[a1|a2| \u00b7 \u00b7 \u00b7 |aq] \u2208 Rp\u00d7q is identi\ufb01able? This is answered in Section 3.\n\n2rn(x).\n\n.\n\n(1)\n\n3 Suf\ufb01cient Conditions for Generic Identi\ufb01ability\n\nIn this section, the identi\ufb01ability result for the n-persistent topic model with access to (2n)-th order\nobserved moment is provided. First, suf\ufb01cient deterministic conditions on the population structure\nA are provided for identi\ufb01ability in Theorem 1. Next, the deterministic analysis is specialized to a\nrandom structured model in Theorem 2.\n\nWe now make the notion of identi\ufb01ability precise. As de\ufb01ned in literature, (strict) identi\ufb01ability\nmeans that the population structure A can be uniquely recovered up to permutation and scaling for\nall A \u2208 Rp\u00d7q.\nInstead, we consider a more relaxed notion of identi\ufb01ability, known as generic\nidenti\ufb01ability.\nDe\ufb01nition 1 (Generic identi\ufb01ability). We refer to a matrix A \u2208 Rp\u00d7q as generic, with a \ufb01xed\nsparsity pattern when the nonzero entries of A are drawn from a distribution which is absolutely\ncontinuous with respect to Lebesgue measure 2. For a given sparsity pattern, the class of population\nstructure matrices is said to be generically identi\ufb01able [10], if all the non-identi\ufb01able matrices form\na set of Lebesgue measure zero.\n\nThe (2r)-th order moment of hidden variables h \u2208 Rq, denoted by M2r(h), is de\ufb01ned as\n\nM2r(h) := E(cid:20)(cid:16)\n\nr times\n\nh \u2297 \u00b7 \u00b7 \u00b7 \u2297 h(cid:17)(cid:16)\nz\n{\n\n}|\n\nr times\n\n{\n\nh \u2297 \u00b7 \u00b7 \u00b7 \u2297 h(cid:17)\u22a4(cid:21) \u2208 Rqr \u00d7qr\nz\n\n}|\n\n.\n\n(2)\n\nCondition 1 (Non-degeneracy). The (2r)-th order moment of hidden variables h \u2208 Rq, de\ufb01ned in\nequation (2), is full rank (non-degeneracy of hidden nodes).\n\nNote that there is no hope of distinguishing distinct hidden nodes without this non-degeneracy as-\nsumption. We do not impose any other assumption on hidden variables and can incorporate arbitrar-\nily correlated topics.\n\nFurthermore, we can only hope to identify the population structure A up to scaling and permutation.\nTherefore, we can identify A up to a canonical form de\ufb01ned as:\nDe\ufb01nition 2 (Canonical form). Population structure A is said to be in canonical form if all of its\ncolumns have unit norm.\n\n3.1 Deterministic Conditions for Generic Identi\ufb01ability\n\nBefore providing the main result, a generalized notion of (perfect) matching for bipartite graphs is\nde\ufb01ned. We subsequently impose these conditions on the bipartite graph from topics to words which\nencodes the sparsity pattern of population structure A.\n\n2As an equivalent de\ufb01nition, if the non-zero entries of an arbitrary sparse matrix are independently perturbed\n\nwith noise drawn from a continuous distribution to generate A, then A is called generic.\n\n4\n\n\fGeneralized matching for bipartite graphs: A bipartite graph with two disjoint vertex sets Y\nand X and an edge set E between them is denoted by G(Y, X; E). Given the bi-adjacency matrix\nA, the notation G(Y, X; A) is also used to denote a bipartite graph. Here, the rows and columns of\nmatrix A \u2208 R|X|\u00d7|Y | are respectively indexed by X and Y vertex sets. Furthermore, for any subset\nS \u2286 Y , the set of neighbors of vertices in S with respect to A is denoted by NA(S).\nDe\ufb01nition 3 ((Perfect) n-gram matching). A n-gram matching M for a bipartite graph G(Y, X; E)\nis a subset of edges M \u2286 E which satis\ufb01es the following conditions. First, for any j \u2208 Y , we\nhave |NM (j)| \u2264 n. Second, for any j1, j2 \u2208 Y, j1 6= j2, we have min{|NM (j1)|, |NM (j2)|} >\n|NM (j1) \u2229 NM (j2)|.\nA perfect n-gram matching or Y -saturating n-gram matching for the bipartite graph G(Y, X; E) is\na n-gram matching M in which each vertex in Y is the end-point of exactly n edges in M .\n\nIn words, in a n-gram matching M , each vertex j \u2208 Y is at most the end-point of n edges in M and\nfor any pair of vertices in Y (j1, j2 \u2208 Y, j1 6= j2), there exists at least one non-common neighbor in\nset X for each of them (j1 and j2).\n\nNote that 1-gram matching is the same as regular matching for bipartite graphs.\nRemark 1 (Necessary size bound). Consider a bipartite graph G(Y, X; E) with |Y | = q and\n\nand each combination can at most have one neighbor (a node in Y which is connected to all nodes\n\n|X| = p which has a perfect n-gram matching. Note that there are (cid:0)p\nin the combination) through the matching, and therefore we necessarily have q \u2264 (cid:0)p\nn(cid:1).\n\nn(cid:1) n-combinations on X side\n\nIdenti\ufb01ability conditions based on existence of perfect n-gram matching in topic-word graph:\nNow, we are ready to propose the identi\ufb01ability conditions and result.\nCondition 2 (Perfect n-gram matching on A). The bipartite graph G(Vh, Vo; A) between hidden\nand observed variables, has a perfect n-gram matching.\n\nThe above condition implies that the sparsity pattern of matrix A is appropriately scattered in the\nmapping from hidden to observed variables to be identi\ufb01able. Intuitively, it means that every hidden\nnode can be distinguished from another hidden node by its unique set of neighbors under the corre-\nsponding n-gram matching.\nFurthermore, condition 2 is the key to be able to propose identi\ufb01ability in the overcomplete regime.\nAs stated in the size bound in Remark 1, for n \u2265 2, the number of hidden variables can be more\nthan the number of observed variables and we can still have perfect n-gram matching.\nDe\ufb01nition 4 (Kruskal rank, [11]). The Kruskal rank or the krank of matrix A is de\ufb01ned as the\nmaximum number k such that every subset of k columns of A is linearly independent.\nCondition 3 (Krank condition on A). The Kruskal rank of matrix A satis\ufb01es the bound krank(A) \u2265\ndmax(A)n, where dmax(A) is the maximum node degree of any column of A.\n\nIn the overcomplete regime, it is not possible for A to be full column rank and krank(A) < |Vh| = q.\nHowever, note that a large enough krank ensures that appropriate sized subsets of columns of A are\nlinearly independent. For instance, when krank(A) > 1, any two columns cannot be collinear and\nthe above condition rules out the collinear case for identi\ufb01ability. In the above condition, we see\nthat a larger krank can incorporate denser connections between topics and words.\n\nThe main identi\ufb01ability result under a \ufb01xed graph structure is stated in the following theorem for\nn \u2265 2, where n is the topic persistence level.\nTheorem 1 (Generic identi\ufb01ability under deterministic topic-word graph structure). Let M (n)\n2rn(x)\nin equation (1) be the (2rn)-th order observed moment of the n-persistent topic model, for some\ninteger r \u2265 1. If the model satis\ufb01es conditions 1, 2 and 3, then, for any n \u2265 2, all the columns of\npopulation structure A are generically identi\ufb01able from M (n)\n2rn(x). Furthermore, the (2r)-th order\nmoment of the hidden variables, denoted by M2r(h), is also generically identi\ufb01able.\n\nThe theorem is proved in Appendix A of the long version [12]. It is seen that the population structure\nA is identi\ufb01able, given any observed moment of order at least 2n. Increasing the order of observed\nmoment results in identifying higher order moments of the hidden variables.\nThe above theorem does not cover the case of n = 1. This is the usual bag-of-words admixture\nmodel. Identi\ufb01ability of this model has been studied earlier [13], and we recall it below.\nRemark 2 (Bag-of-words admixture model, [13]). Given (2r)-th order observed moments with\nr \u2265 1, the structure of the popular bag-of-words admixture model and the (2r)-th order moment of\n\n5\n\n\fhidden variables are identi\ufb01able, when A is full column rank and the following expansion condition\nholds [13]\n\n|NA(S)| \u2265 |S| + dmax(A),\n\n\u2200S \u2286 Vh, |S| \u2265 2.\n\n(3)\n\nOur result for n \u2265 2 in Theorem 1, provides identi\ufb01ability in the overcomplete regime with weaker\nmatching condition 2 and krank condition 3. The matching condition 2 is weaker than the above\nexpansion condition which is based on the perfect matching and hence, does not allow overcomplete\nmodels. Furthermore, the above result for the bag-of-words admixture model requires full column\nrank of A which is more stringent than our krank condition 3.\nRemark 3 (Recovery using \u21131 optimization). It turns out that our conditions for identi\ufb01ability imply\n\nthat the columns of the n-gram matrix 3 A\u2299n, are the sparsest vectors in Col(cid:16)M (n)\n\ntensor rank of one. See Appendix A in the long version [12]. This implies recovery of the columns\nof A through exhaustive search, which is not ef\ufb01cient. Ef\ufb01cient \u21131-based recovery algorithms have\nbeen analyzed in [13, 14] for the undercomplete case (n = 1). They can be employed here for\nrecovery from higher order moments as well. Exploiting additional structure present in A\u2299n, for\nn > 1, such as rank-1 test devices proposed in [15] are interesting avenues for future investigation.\n\n2n (x)(cid:17), having a\n\n3.2 Analysis Under Random Topic-Word Graph Structures\n\nIn this section, we specialize the identi\ufb01ability result to the random case. This result is based on more\ntransparent conditions on the size and the degree of the random bipartite graph G(Vh, Vo; A). We\nconsider the random model where in the bipartite graph G(Vh, Vo; A), each node i \u2208 Vh is randomly\nconnected to di different nodes in set Vo. Note that this is a heterogeneous degree model.\nCondition 4 (Size condition). The random bipartite graph G(Vh, Vo; A) with |Vh| = q, |Vo| = p,\n\nand A \u2208 Rp\u00d7q, satis\ufb01es the size condition q \u2264 (cid:0)c p\nn(cid:1)n\n\nfor some constant 0 < c < 1.\n\nThis size condition is required to establish that the random bipartite graph has a perfect n-gram\nmatching (and hence satis\ufb01es deterministic condition 2). It is shown that the necessary size con-\nstraint q = O(pn) stated in Remark 1, is achieved in the random case. Thus, the above constraint\nallows for the overcomplete regime, where q \u226b p for n \u2265 2, and is tight.\nCondition 5 (Degree condition). In the random bipartite graph G(Vh, Vo; A) with |Vh| = q, |Vo| =\np, and A \u2208 Rp\u00d7q, the degree di of nodes i \u2208 Vh satis\ufb01es the lower and upper bounds dmin \u2265\nmax{1 + \u03b2 log p, \u03b1 log p} for some constants \u03b2 > n\u22121\ndmax \u2264 (cp)\n\nlog 1/c , \u03b1 > max(cid:8)2n2(cid:0)\u03b2 log 1\n\nc + 1(cid:1), 2\u03b2n(cid:9), and\n\n1\nn .\n\nIntuitively, the lower bound on the degree is required to show that the corresponding bipartite graph\nG(Vh, Vo; A) has suf\ufb01cient number of random edges to ensure that it has perfect n-gram match-\ning with high probability. The upper bound on the degree is mainly required to satisfy the krank\ncondition 3, where dmax(A)n \u2264 krank(A).\n\nIt is important to see that, for n \u2265 2, the above condition on degree covers a range of models from\nsparse to intermediate regimes and it is reasonable in a number of applications that each topic does\nnot generate a very large number of words.\n\nProbability rate constants: The probability rate of success in the following random identi\ufb01ability\nresult is speci\ufb01ed by constants \u03b2\u2032 > 0 and \u03b3 = \u03b31 + \u03b32 > 0 as\n\n\u03b2\u2032 = \u2212\u03b2 log c \u2212 n + 1,\n\n\u03b31 = en\u22121(cid:16) c\n\nnn\u22121 +\n\ne2\n\n1 \u2212 \u03b41\n\nn\u03b2\u2032+1(cid:17),\n\ncn\u22121e2\n\n\u03b32 =\n\nnn(1 \u2212 \u03b42)\n\n,\n\n(4)\n\n(5)\n\n(6)\n\nwhere \u03b41 and \u03b42 are some constants satisfying e2(cid:16) p\n\nn(cid:17)\u2212\u03b2 log 1/c\n\n1.\n\n< \u03b41 < 1 and cn\u22121e2\n\nnn p\u2212\u03b2\u2032\n\n< \u03b42 <\n\n3A\u2299n := A \u2299 \u00b7 \u00b7 \u00b7 \u2299 A\n\n|\n\n{z\n\n.\n}\n\nn times\n\n6\n\n\fh\n\ny\n\nx1\n\nxm xm+1\n\nx2m\n\nh\n\ny1\n\nx1\n\nym ym+1\n\ny2m\n\nxm xm+1\n\nx2m\n\n(a) Single topic model\n(in\ufb01nite-persistent topic model)\n\n(b) Bag-of-words admixture model\n(1-persistent topic model)\n\nFigure 2: Hierarchical structure of the single topic model and bag-of-words admixture model shown for 2m\nnumber of words (views).\n\nTheorem 2 (Random identi\ufb01ability). Let M (n)\n2rn(x) in equation (1) be the (2rn)-th order observed\nmoment of the n-persistent topic model for some integer r \u2265 1. If the model with random population\nstructure A satis\ufb01es conditions 1, 4 and 5, then whp (with probability at least 1\u2212\u03b3p\u2212\u03b2\u2032\nfor constants\n\u03b2\u2032 > 0 and \u03b3 > 0, speci\ufb01ed in (4)-(6)), for any n \u2265 2, all the columns of population structure A are\nidenti\ufb01able from M (n)\n2rn(x). Furthermore, the (2r)-th order moment of hidden variables, denoted by\nM2r(h), is also identi\ufb01able, whp.\n\nThe theorem is proved in Appendix B of the long version [12]. Similar to the deterministic analysis,\nit is seen that the population structure A is identi\ufb01able given any observed moment with order at\nleast 2n. Increasing the order of observed moment results in identifying higher order moments of\nthe hidden variables.\n\nThe above identi\ufb01ability theorem only covers for n \u2265 2 and the n = 1 case is addressed in the\nfollowing remark.\nRemark 4 (Bag-of-words admixture model). The identi\ufb01ability result for the random bag-of-words\nadmixture model is comparable to the result in [14], which considers exact recovery of sparsely-used\ndictionaries. They assume that Y = DX is given for some unknown arbitrary dictionary D \u2208 Rq\u00d7q\nand unknown random sparse coef\ufb01cient matrix X \u2208 Rq\u00d7p. They establish that if D \u2208 Rq\u00d7q is\nfull rank and the random sparse coef\ufb01cient matrix X \u2208 Rq\u00d7p follows the Bernoulli-subgaussian\nmodel with size constraint p > Cq log q and degree constraint O(log q) < E[d] < O(q log q), then\nthe model is identi\ufb01able, whp. Comparing the size and degree constraints, our identi\ufb01ability result\nfor n \u2265 2 requires more stringent upper bound on the degree (d = O(p1/n)), while more relaxed\ncondition on the size (q = O(pn)) which allows to identi\ufb01ability in the overcomplete regime.\nRemark 5 (The size condition is tight). The size bound q = O(pn) in the above theorem achieves\n\nthe necessary condition that q \u2264 (cid:0)p\n\nis argued in Theorem 3 of the long version [12], where we show that the matching condition 2 holds\nunder the above size and degree conditions 4 and 5.\n\nn(cid:1) = O(pn) (see Remark 1), and is therefore tight. The suf\ufb01ciency\n\n4 Why Persistence Helps in Identi\ufb01ability of Overcomplete Models?\n\nIn this section, we provide the moment characterization of the 2-persistent topic model. Then, we\nprovide a discussion and comparison on why persistence helps in providing identi\ufb01ability in the\novercomplete regime. The moment characterization and detailed tensor analysis is provided in the\nlong version [12].\n\nThe single topic model (n \u2192 \u221e) is shown in Figure 2a and the bag-of-words admixture model\n(n = 1) is shown in Figure 2b. In order to have a fair comparison among these different models, we\n\ufb01x the number of observed variables to 4 (case m = 2) and vary the persistence level. Consider three\ndifferent models: 2-persistent topic model, single topic model and bag-of-words admixture model\n(1-persistent topic model). From moment characterization results provided in the long version [12],\nwe have the following moment forms for each of these models.\n\nFor the 2-persistent topic model with 4 words (r = 1, n = 2), we have\n\nM (2)\n\n4 (x) = (A \u2299 A)E(cid:2)hh\u22a4](A \u2299 A)\u22a4.\n\n7\n\n(7)\n\n\fFor the single topic model with 4 words, we have\n\nAnd for the bag-of-words-admixture model with 4 words (r = 2, n = 1), we have\n\n4\n\nM (\u221e)\n\n(x) = (A \u2299 A) Diag(cid:0)E(cid:2)h](cid:1) (A \u2299 A)\u22a4,\n4 (x) = (A \u2297 A)E(cid:2)(h \u2297 h)(h \u2297 h)\u22a4(cid:3)(A \u2297 A)\u22a4.\n\nM (1)\n\n(8)\n\n(9)\n\nNote that for the single topic model in (8), the Khatri-Rao product matrix A \u2299 A \u2208 Rp2\u00d7q has\nthe same as the number of columns (i.e. the latent dimensionality) of the original matrix A, while\nthe number of rows (i.e. the observed dimensionality) is increased. Thus, the Khatri-Rao product\n\u201cexpands\u201d the effect of hidden variables to higher order observed variables, which is the key towards\nidentifying overcomplete models. In other words, the original overcomplete representation becomes\ndetermined due to the \u2018expansion effect\u2019 of the Khatri-Rao product structure of the higher order\nobserved moments.\n\nOn the other hand, in the bag-of-words admixture model in (9), this interesting \u2018expansion property\u2019\ndoes not occur, and we have the Kronecker product A \u2297 A \u2208 Rp2\u00d7q2\n, in place of the Khatri-Rao\nproducts. The Kronecker product operation increases both the number of the columns (i.e. latent\ndimensionality) and the number of rows (i.e. observed dimensionality), which implies that higher\norder moments do not help in identifying overcomplete models.\n\nFinally, Contrasting equation (7) with (8) and (9), we \ufb01nd that the 2-persistent model retains the\ndesirable property of possessing Khatri-Rao products, while being more general than the form for\nsingle topic model in (8). This key property enables us to establish identi\ufb01ability of topic models\nwith \ufb01nite persistence levels.\nRemark 6 (Relationship to tensor decompositions). In the long version of this work [12], we es-\ntablish that the tensor representation of our model is a special case of the Tucker representation,\nbut more general than the symmetric CP representation [6]. Therefore, our identi\ufb01ability results\nalso imply uniqueness of a class of tensor decompositions with structured sparsity which is con-\ntained in the class of Tucker decompositions, but is more general than the Candecomp/Parafac (CP)\ndecomposition.\n\n5 Proof sketch\n\nM (n)\n\nRest.\n\nThe moment of n-persistent\n\ntopic model with 2n words is derived as M (n)\n\n2n (x) =\n; see [12]. When hidden variables are non-degenerate and A\u2299n is full col-\n\n(A\u2299n) E(cid:2)hh\u22a4(cid:3) (A\u2299n)\u22a4\numn rank, we have Col(cid:0)M (n)\n2n (x) reduces to \ufb01nding A\u2299n in Col(cid:0)A\u2299n(cid:1). Then, under the expansion condition 4\n(cid:12)(cid:12)(cid:12)NA\u2299n\nwe establish that, matrix A is identi\ufb01able from Col(cid:0)A\u2299n(cid:1). This identi\ufb01ability claim is proved\nCol(cid:0)A\u2299n(cid:1) under the suf\ufb01cient expansion and genericity conditions.\n\n2n (x)(cid:1) = Col(cid:0)A\u2299n(cid:1). Therefore, the problem of recovering A from\n(S)(cid:12)(cid:12)(cid:12) \u2265 |S| + dmax(cid:16)A\u2299n(cid:17),\n\nby showing that the columns of A\u2299n are the sparsest and rank-1 (in the tensor form) vectors in\n\nThen, it is established that, suf\ufb01cient combinatorial conditions on matrix A (conditions 2 and 3)\nensure that the expansion and rank conditions on A\u2299n are satis\ufb01ed. This is shown by proving that\nthe existence of perfect n-gram matching on A results in the existence of perfect matching on A\u2299n.\nFor further discussion on proof techniques, see the long version [12].\n\n\u2200S \u2286 Vh, |S| > krank(A),\n\nAcknowledgments\n\nThe authors acknowledge useful discussions with Sina Jafarpour, Adel Javanmard, Alex Dimakis,\nMoses Charikar, Sanjeev Arora, Ankur Moitra and Kamalika Chaudhuri. A. Anandkumar is sup-\nported in part by Microsoft Faculty Fellowship, NSF Career award CCF-1254106, NSF Award\nCCF-1219234, ARO Award W911NF-12-1-0404, and ARO YIP Award W911NF-13-1-0084. M.\nJanzamin is supported by NSF Award CCF-1219234, ARO Award W911NF-12-1-0404 and ARO\nYIP Award W911NF-13-1-0084.\n\n4A\u2299n\n\nRest. is the restricted version of n-gram matrix A\u2299n, in which the redundant rows of A\u2299n are removed.\n\n8\n\n\fReferences\n\n[1] Andr\u00b4e Uschmajew. Local convergence of the alternating least squares algorithm for canonical\ntensor approximation. SIAM Journal on Matrix Analysis and Applications, 33(2):639\u2013652,\n2012.\n\n[2] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent Dirichlet Allocation. Journal of\n\nMachine Learning Research, 3:993\u20131022, 2003.\n\n[3] J. K. Pritchard, M. Stephens, and P. Donnelly. Inference of population structure using multilo-\n\ncus genotype data. Genetics, 155:945\u2013959, 2000.\n\n[4] J.B. Kruskal. More factors than subjects, tests and treatments: an indeterminacy theorem for\ncanonical decomposition and individual differences scaling. Psychometrika, 41(3):281\u2013293,\n1976.\n\n[5] Tamara Kolda and Brett Bader. Tensor decompositions and applications. SIREV, 51(3):455\u2013\n\n500, 2009.\n\n[6] G.H. Golub and C.F. Van Loan. Matrix Computations. The Johns Hopkins University Press,\n\nBaltimore, Maryland, 2012.\n\n[7] XuanLong Nguyen. Posterior contraction of the population polytope in \ufb01nite admixture mod-\n\nels. arXiv preprint arXiv:1206.0068, 2012.\n\n[8] T. Austin et al. On exchangeable random variables and the statistics of large graphs and hyper-\n\ngraphs. Probab. Surv, 5:80\u2013145, 2008.\n\n[9] A. Anandkumar, R. Ge, D. Hsu, S. M. Kakade, and M. Telgarsky. Tensor Methods for Learning\nLatent Variable Models. Under Review. J. of Machine Learning. Available at arXiv:1210.7559,\nOct. 2012.\n\n[10] Elizabeth S. Allman, John A. Rhodes, and Amelia Taylor. A semialgebraic description of the\n\ngeneral markov model on phylogenetic trees. Arxiv preprint arXiv:1212.1200, Dec. 2012.\n\n[11] J.B. Kruskal. Three-way arrays: Rank and uniqueness of trilinear decompositions, with appli-\ncation to arithmetic complexity and statistics. Linear algebra and its applications, 18(2):95\u2013\n138, 1977.\n\n[12] A. Anandkumar, D. Hsu, M. Janzamin, and S. Kakade. When are Overcomplete Topic Models\nIdenti\ufb01able? Uniqueness of Tensor Tucker Decompositions with Structured Sparsity. Preprint\navailable on arXiv:1308.2853, Aug. 2013.\n\n[13] A. Anandkumar, D. Hsu, A. Javanmard, and S. M. Kakade. Learning Linear Bayesian Net-\n\nworks with Latent Variables. ArXiv e-prints, September 2012.\n\n[14] Daniel A. Spielman, Huan Wang, and John Wright. Exact recovery of sparsely-used dictionar-\n\nies. ArxXiv preprint, abs/1206.5882, 2012.\n\n[15] L. De Lathauwer, J. Castaing, and J.-F Cardoso. Fourth-order cumulant-based blind identi\ufb01-\ncation of underdetermined mixtures. IEEE Tran. on Signal Processing, 55:2965\u20132973, June\n2007.\n\n9\n\n\f", "award": [], "sourceid": 1009, "authors": [{"given_name": "Anima", "family_name": "Anandkumar", "institution": "UC Irvine"}, {"given_name": "Daniel", "family_name": "Hsu", "institution": "Columbia University"}, {"given_name": "Majid", "family_name": "Janzamin", "institution": "UC Irvine"}, {"given_name": "Sham", "family_name": "Kakade", "institution": "Microsoft Research"}]}