{"title": "Rethinking LDA: Moment Matching for Discrete ICA", "book": "Advances in Neural Information Processing Systems", "page_first": 514, "page_last": 522, "abstract": "We consider moment matching techniques for estimation in Latent Dirichlet Allocation (LDA). By drawing explicit links between LDA and discrete versions of independent component analysis (ICA), we first derive a new set of cumulant-based tensors, with an improved sample complexity. Moreover, we reuse standard ICA techniques such as joint diagonalization of tensors to improve over existing methods based on the tensor power method. In an extensive set of experiments on both synthetic and real datasets, we show that our new combination of tensors and orthogonal joint diagonalization techniques outperforms existing moment matching methods.", "full_text": "Rethinking LDA: Moment Matching for Discrete ICA\n\nAnastasia Podosinnikova\n\nINRIA - \u00b4Ecole normale sup\u00b4erieure Paris\n\nFrancis Bach\n\nSimon Lacoste-Julien\n\nAbstract\n\nWe consider moment matching techniques for estimation in latent Dirichlet allo-\ncation (LDA). By drawing explicit links between LDA and discrete versions of\nindependent component analysis (ICA), we \ufb01rst derive a new set of cumulant-\nbased tensors, with an improved sample complexity. Moreover, we reuse standard\nICA techniques such as joint diagonalization of tensors to improve over existing\nmethods based on the tensor power method. In an extensive set of experiments on\nboth synthetic and real datasets, we show that our new combination of tensors and\northogonal joint diagonalization techniques outperforms existing moment match-\ning methods.\n\n1\n\nIntroduction\n\nTopic models have emerged as \ufb02exible and important tools for the modelisation of text corpora.\nWhile early work has focused on graphical-model approximate inference techniques such as varia-\ntional inference [1] or Gibbs sampling [2], tensor-based moment matching techniques have recently\nemerged as strong competitors due to their computational speed and theoretical guarantees [3, 4].\nIn this paper, we draw explicit links with the independent component analysis (ICA) literature\n(e.g., [5] and references therein) by showing a strong relationship between latent Dirichlet allocation\n(LDA) [1] and ICA [6, 7, 8]. We can then reuse standard ICA techniques and results, and derive new\ntensors with better sample complexity and new algorithms based on joint diagonalization.\n\n2\n\nIs LDA discrete PCA or discrete ICA?\n\nNotation. Following the text modeling terminology, we de\ufb01ne a corpus X = {x1, . . . , xN} as a\ncollection of N documents. Each document is a collection {wn1, . . . , wnLn} of Ln tokens. It is\nconvenient to represent the `-th token of the n-th document as a 1-of-M encoding with an indicator\nvector wn` 2 {0, 1}M with only one non-zero, where M is the vocabulary size, and each document\nas the count vector xn := P` wn` 2 RM.\nIn such representation, the length Ln of the n-th\ndocument is Ln = Pm xnm. We will always use index k 2 {1, . . . , K} to refer to topics, index\nn 2 {1, . . . , N} to refer to documents, index m 2 {1, . . . , M} to refer to words from the vocabulary,\nand index ` 2 {1, . . . , Ln} to refer to tokens of the n-th document. The plate diagrams of the models\nfrom this section are presented in Appendix A.\nLatent Dirichlet allocation [1] is a generative probabilistic model for discrete data such as text\ncorpora. In accordance to this model, the n-th document is modeled as an admixture over the vo-\ncabulary of M words with K latent topics. Speci\ufb01cally, the latent variable \u2713n, which is sampled\nfrom the Dirichlet distribution, represents the topic mixture proportion over K topics for the n-th\ndocument. Given \u2713n, the topic choice zn`|\u2713n for the `-th token is sampled from the multinomial dis-\ntribution with the probability vector \u2713n. The token wn`|zn`, \u2713n is then sampled from the multinomial\ndistribution with the probability vector dzn`, or dk if k is the index of the non-zero element in zn`.\nThis vector dk is the k-th topic, that is a vector of probabilities over the words from the vocabulary\nsubject to the simplex constraint, i.e., dk 2 M, where M := {d 2 RM : d \u232b 0, Pm dm = 1}.\n\nThis generative process of a document (the index n is omitted for simplicity) can be summarized as\n\n1\n\n\f\u2713 \u21e0 Dirichlet(c),\n\nz`|\u2713 \u21e0 Multinomial(1, \u2713),\nw`|z`, \u2713 \u21e0 Multinomial(1, dz`).\n\n(1)\n\nOne can think of the latent variables z` as auxiliary variables which were introduced for convenience\nof inference, but can in fact be marginalized out [9], which leads to the following model\n\n\u2713 \u21e0 Dirichlet(c),\nx|\u2713 \u21e0 Multinomial(L, D\u2713),\n\nLDA model (2)\n\nwhere D 2 RM\u21e5K is the topic matrix with the k-th column equal to the k-th topic dk, and c 2 RK\n++\nis the vector of parameters for the Dirichlet distribution. While a document is represented as a set\nof tokens w` in the formulation (1), the formulation (2) instead compactly represents a document as\nthe count vector x. Although the two representations are equivalent, we focus on the second one in\nthis paper and therefore refer to it as the LDA model.\nImportantly, the LDA model does not model the length of documents. Indeed, although the original\npaper [1] proposes to model the document length as L| \u21e0 Poisson(), this is never used in practice\nand, in particular, the parameter is not learned. Therefore, in the way that the LDA model is\ntypically used, it does not provide a complete generative process of a document as there is no rule to\nsample L|. In this paper, this fact is important, as we need to model the document length in order\nto make the link with discrete ICA.\nDiscrete PCA. The LDA model (2) can be seen as a discretization of principal component anal-\nysis (PCA) via replacement of the normal likelihood with the multinomial one and adjusting the\nprior [9] in the following probabilistic PCA model [10, 11]: \u2713 \u21e0 Normal(0, IK) and x|\u2713 \u21e0\nNormal(D\u2713, 2IM ), where D 2 RM\u21e5K is a transformation matrix and is a parameter.\nDiscrete ICA (DICA). Interestingly, a small extension of the LDA model allows its interpreta-\ntion as a discrete independent component analysis model. The extension naturally arises when the\ndocument length for the LDA model is modeled as a random variable from the gamma-Poisson\nmixture (which is equivalent to a negative binomial random variable), i.e., L| \u21e0 Poisson() and\n \u21e0 Gamma(c0, b), where c0 :=Pk ck is the shape parameter and b > 0 is the rate parameter. The\nLDA model (2) with such document length is equivalent (see Appendix B.1) to\n\n\u21b5k \u21e0 Gamma(ck, b),\nxm|\u21b5 \u21e0 Poisson([D\u21b5]m),\n\nGP model (3)\n\nwhere all \u21b51, \u21b52, . . . , \u21b5K are mutually independent, the parameters ck coincide with the ones of the\nLDA model in (2), and the free parameter b can be seen (see Appendix B.2) as a scaling parameter\nfor the document length when c0 is already prescribed.\nThis model was introduced by Canny [12] and later named as a discrete ICA model [13]. It is more\nnatural, however, to name model (3) as the gamma-Poisson (GP) model and the model\n\n\u21b51, . . . , \u21b5K \u21e0 mutually independent,\n\nxm|\u21b5 \u21e0 Poisson([D\u21b5]m)\n\nDICA model (4)\n\nas the discrete ICA (DICA) model. The only difference between (4) and the standard ICA model [6,\n7, 8] (without additive noise) is the presence of the Poisson noise which enforces discrete, instead of\ncontinuous, values of xm. Note also that (a) the discrete ICA model is a semi-parametric model that\ncan adapt to any distribution on the topic intensities \u21b5k and that (b) the GP model (3) is a particular\ncase of both the LDA model (2) and the DICA model (4).\nThanks to this close connection between LDA and ICA, we can reuse standard ICA techniques to\nderive new ef\ufb01cient algorithms for topic modeling.\n\n3 Moment matching for topic modeling\n\nThe method of moments estimates latent parameters of a probabilistic model by matching theoretical\nexpressions of its moments with their sample estimates. Recently [3, 4], the method of moments\nwas applied to different latent variable models including LDA, resulting in computationally fast\n\n2\n\n\flearning algorithms with theoretical guarantees. For LDA, they (a) construct LDA moments with a\nparticular diagonal structure and (b) develop algorithms for estimating the parameters of the model\nby exploiting this diagonal structure. In this paper, we introduce novel GP/DICA cumulants with\na similar to the LDA moments structure. This structure allows to reapply the algorithms of [3, 4]\nfor the estimation of the model parameters, with the same theoretical guarantees. We also consider\nanother algorithm applicable to both the LDA moments and the GP/DICA cumulants.\n\n3.1 Cumulants of the GP and DICA models\n\nIn this section, we derive and analyze the novel cumulants of the DICA model. As the GP model is\na particular case of the DICA model, all results of this section extend to the GP model.\nThe \ufb01rst three cumulant tensors for the random vector x can be de\ufb01ned as follows\n\ncum(x) := E(x),\n\ncum(x, x) := cov(x, x) = E\u21e5(x E(x))(x E(x))>\u21e4 ,\ncum(x, x, x) := E [(x E(x)) \u2326 (x E(x)) \u2326 (x E(x))] ,\n\n(5)\n(6)\n(7)\nwhere \u2326 denotes the tensor product (see some properties of cumulants in Appendix C.1). The\nessential property of the cumulants (which does not hold for moments) that we use in this paper is\nthat the cumulant tensor for a random vector with independent components is diagonal.\nLet y = D\u21b5; then for the Poisson random variable xm|ym \u21e0 Poisson(ym), the expectation is\nE(xm|ym) = ym. Hence, by the law of total expectation and the linearity of expectation, the\nexpectation in (5) has the following form\n(8)\nFurther, the variance of the Poisson random variable xm is var(xm|ym) = ym and, as x1,\nx2, . . . , xM are conditionally independent given y, then their covariance matrix is diagonal, i.e.,\ncov(x, x|y) = diag(y). Therefore, by the law of total covariance, the covariance in (6) has the form\n(9)\n\ncov(x, x) = E [cov(x, x|y)] + cov [E(x|y), E(x|y)]\n\nE(x) = E(E(x|y)) = E(y) = DE(\u21b5).\n\n= diag [E(y)] + cov(y, y) = diag [E(x)] + Dcov(\u21b5, \u21b5)D>,\n\nwhere the last equality follows by the multilinearity property of cumulants (see Appendix C.1).\nMoving the \ufb01rst term from the RHS of (9) to the LHS, we de\ufb01ne\nS := cov(x, x) diag [E(x)] .\n\nDICA S-cum. (10)\nFrom (9) and by the independence of \u21b51, . . . , \u21b5K (see Appendix C.3), S has the following diagonal\nstructure\n\nvar(\u21b5k)dkd>k = Ddiag [var(\u21b5)] D>.\n\n(11)\n\nBy analogy with the second order case, using the law of total cumulance, the multilinearity property\nof cumulants, and the independence of \u21b51, . . . , \u21b5K, we derive in Appendix C.2 expression (24),\nsimilar to (9), for the third cumulant (7). Moving the terms in this expression, we de\ufb01ne a tensor T\nwith the following element\n\n[T ]m1m2m3\n\n:= cum(xm1, xm2, xm3) + 2(m1, m2, m3)E(xm1)\n\nDICA T-cum. (12)\n\n (m2, m3)cov(xm1, xm2) (m1, m3)cov(xm1, xm2) (m1, m2)cov(xm1, xm3),\n\nwhere is the Kronecker delta. By analogy with (11) (Appendix C.3), the diagonal structure of\ntensor T :\n\nS =X k\n\nT =X k\n\ncum(\u21b5k, \u21b5k, \u21b5k)dk \u2326 dk \u2326 dk.\n\n(13)\n\nIn Appendix E.1, we recall (in our notation) the matrix S (39) and the tensor T (40) for the LDA\nmodel [3], which are analogues of the matrix S (10) and the tensor T (12) for the GP/DICA mod-\nels. Slightly abusing terminology, we refer to the matrix S (39) and the tensor T (40) as the LDA\nmoments and to the matrix S (10) and the tensor T (12) as the GP/DICA cumulants. The diagonal\nstructure (41) & (42) of the LDA moments is similar to the diagonal structure (11) & (13) of the\nGP/DICA cumulants, though arising through a slightly different argument, as discussed at the end of\n\n3\n\n\f(14)\n\nAppendix E.1. Importantly, due to this similarity, the algorithmic frameworks for both the GP/DICA\ncumulants and the LDA moments coincide.\nThe following sample complexity results apply to the sample estimates of the GP cumulants:1\n\nGP cumulant S (10) is:\n\nProposition 3.1. Under the GP model, the expected error for the sample estimator bS (29) for the\n\nEhkbS SkFi \uf8ffrEhkbS Sk2\n\nFi \uf8ff O\u2713 1\n\npN\n\nmax\u21e5 \u00afL2, \u00afc0 \u00afL\u21e4\u25c6 ,\n\nwhere := max k kdkk2\nA high probability bound could be derived using concentration inequalities for Poisson random\nvariables [14]; but the expectation already gives the right order of magnitude for the error (for\n\n2, \u00afc0 := min(1, c0) and \u00afL := E(L).\n\nexample via Markov\u2019s inequality). The expression (29) for an unbiased \ufb01nite sample estimatebS of S\nand the expression (30) for an unbiased \ufb01nite sample estimate bT of T are de\ufb01ned2 in Appendix C.4.\nA sketch of a proof for Proposition 3.1 can be found in Appendix D.\nBy following a similar analysis as in [15], we can rephrase the topic recovery error in term of the\nerror on the GP cumulant. Importantly, the whitening transformation (introduced in Section 4) redi-\nvides the error on S (14) by \u00afL2, which is the scale of S (see Appendix D.5 for details). This means\nthat the contribution from \u02c6S to the recovery error will scale as O(1/pN max{, \u00afc0/ \u00afL}), where\nboth and \u00afc0/ \u00afL are smaller than 1 and can be very small. We do not present the exact expression\nfor the expected squared error for the estimator of T , but due to a similar structure in the derivation,\nwe expect the analogous bound of E[kbT TkF ] \uf8ff 1/pN max{3/2 \u00afL3, \u00afc3/2\n\nCurrent sample complexity results of the LDA moments [3] can be summarized as O(1/pN ). How-\never, the proof (which can be found in the supplementary material [15]) analyzes only the case when\n\ufb01nite sample estimates of the LDA moments are constructed from one triple per document, i.e.,\nw1 \u2326 w2 \u2326 w3 only, and not from the U-statistics that average multiple (dependent) triples per\ndocument as in the practical expressions (43) and (44). Moreover, one has to be careful when com-\nparing upper bounds. Nevertheless, comparing the bound (14) with the current theoretical results\nfor the LDA moments, we see that the GP/DICA cumulants sample complexity contains the `2-\nnorm of the columns of the topic matrix D in the numerator, as opposed to the O(1) coef\ufb01cient\nfor the LDA moments. This norm can be signi\ufb01cantly smaller than 1 for vectors in the simplex\n(e.g., = O(1/kdkk0) for sparse topics). This suggests that the GP/DICA cumulants may have\nbetter \ufb01nite sample convergence properties than the LDA moments and our experimental results in\nSection 5.2 are indeed consistent with this statement.\nThe GP/DICA cumulants have a somewhat more intuitive derivation than the LDA moments as\nthey are expressed via the count vectors x (which are the suf\ufb01cient statistics for the model) and\nnot the tokens w`\u2019s. Note also that the construction of the LDA moments depend on the unknown\nparameter c0. Given that we are in an unsupervised setting and that moreover the evaluation of\nLDA is a dif\ufb01cult task [16], setting this parameter is non-trivial.\nIn Appendix G.4, we observe\nexperimentally that the LDA moments are somewhat sensitive to the choice of c0.\n\n\u00afL3/2}.\n\n0\n\n4 Diagonalization algorithms\n\nHow is the diagonal structure (11) of S and (13) of T going to be helpful for the estimation of the\nmodel parameters? This question has already been thoroughly investigated in the signal processing\n(see, e.g., [17, 18, 19, 20, 21, 5] and references therein) and machine learning (see [3, 4] and refer-\nences therein) literature. We review the approach in this section. Due to similar diagonal structure,\nthe algorithms of this section apply to both the LDA moments and the GP/DICA cumulants.\nFor simplicity, let us rewrite expressions (11) and (13) for S and T as follows\ntkdk \u2326 dk \u2326 dk,\n\nskdkd>k ,\n\n(15)\n\n1Note that the expected squared error for the DICA cumulants is similar, but the expressions are less compact\n\nT =X k\n\nS =X k\n\nand, in general, depend on the prior on \u21b5k.\n\nthe LDA moments (which are consistent with the ones suggested in [4]) in Appendix F.4.\n\n2For completeness, we also present the \ufb01nite sample estimates bS (43) and bT (44) of S (39) and T (40) for\n\n4\n\n\fa matrix such that W SW > = IK where IK is the K-by-K identity matrix (see Appendix F.1 for\n\nwhere sk := var(\u21b5k) and tk := cum(\u21b5k, \u21b5k, \u21b5k). Introducing the rescaled topics edk := pskdk,\nwe can also rewrite S = eDeD>. Following the same assumption from [3] that the topic vectors are\nlinearly independent (eD is full rank), we can compute a whitening matrix W 2 RK\u21e5M of S, i.e.,\nmore details). As a result, the vectors zk := Wedk form an orthonormal set of vectors.\nFurther, let us de\ufb01ne a projection T (v) 2 RK\u21e5K of a tensor T 2 RK\u21e5K\u21e5K onto a vector u 2 RK:\n(16)\n\nApplying the multilinear transformation (see, e.g., [4] for the de\ufb01nition) with W > to the tensor T\nfrom (15) and projecting the resulting tensor T := T (W >, W >, W >) onto some vector u 2 RK,\nwe obtain\n(17)\n\nT (u)k1k2 :=X k3 Tk1k2k3uk3.\nT (u) =X ketkhzk, uizkz>k ,\n\nk\n\nThis procedure was referred to as the spectral algorithm for LDA [3] and the fourth-order3 blind\nidenti\ufb01cation algorithm for ICA [17, 18]. Indeed, one can expect that the \ufb01nite sample estimates\n\nwhereetk := tk/s3/2\nis due to the rescaling of topics and h\u00b7,\u00b7i stands for the inner product. As the\nvectors zk are orthonormal, the pairs zk and k := etkhzk, ui are eigenpairs of the matrix T (u),\nwhich are uniquely de\ufb01ned if the eigenvalues k are all different. If they are unique, we can recover\nthe GP/DICA (as well as LDA) model parameters via edk = W \u2020zk andetk = k/hzk, ui.\nbS (29) and bT (30) possess approximately the diagonal structure (11) and (13) and, therefore, the rea-\n\nsoning from above can be applied, assuming that the effect of the sampling error is controlled.\nThis spectral algorithm, however, is known to be quite unstable in practice (see, e.g., [22]). To over-\ncome this problem, other algorithms were proposed. For ICA, the most notable ones are probably\nthe FastICA algorithm [20] and the JADE algorithm [21]. The FastICA algorithm, with appropriate\nchoice of a contrast function, estimates iteratively the topics, making use of the orthonormal struc-\nture (17), and performs the de\ufb02ation procedure at every step. The recently introduced tensor power\nmethod (TPM) for the LDA model [4] is close to the FastICA algorithm. Alternatively, the JADE al-\ngorithm modi\ufb01es the spectral algorithm by performing multiple projections for (17) and then jointly\ndiagonalizing the resulting matrices with an orthogonal matrix. The spectral algorithm is a special\ncase of this orthogonal joint diagonalization algorithm when only one projection is chosen. Impor-\ntantly, a fast implementation [23] of the orthogonal joint diagonalization algorithm from [24] was\nproposed, which is based on closed-form iterative Jacobi updates (see, e.g., [25] for the later).\nIn practice, the orthogonal joint diagonalization (JD) algorithm is more robust than FastICA (see,\ne.g., [26, p. 30]) or the spectral algorithm. Moreover, although the application of the JD algorithm\nfor the learning of topic models was mentioned in the literature [4, 27], it was never implemented in\npractice. In this paper, we apply the JD algorithm for the diagonalization of the GP/DICA cumulants\nas well as the LDA moments, which is described in Algorithm 1. Note that the choice of a projection\n\nvector vp 2 RM obtained as vp =cW >up for some vector up 2 RK is important and corresponds to\nthe multilinear transformation of bT withcW > along the third mode. Importantly, in Algorithm 1, the\njoint diagonalization routine is performed over (P + 1) matrices of size K\u21e5K, where the number of\ntopics K is usually not too big. This makes the algorithm computationally fast (see Appendix G.1).\nThe same is true for the spectral algorithm, but not for TPM.\nIn Section 5.1, we compare experimentally the performance of the spectral, JD, and TPM algorithms\nfor the estimation of the parameters of the GP/DICA as well as LDA models. We are not aware of\nany experimental comparison of these algorithms in the LDA context. While already working on\nthis manuscript, the JD algorithm was also independently analyzed by [27] in the context of tensor\nfactorization for general latent variable models. However, [27] focused mostly on the comparison\nof approaches for tensor factorization and their stability properties, with brief experiments using a\nlatent variable model related but not equivalent to LDA for community detection. In contrast, we\nprovide a detailed experimental comparison in the context of LDA in this paper, as well as propose\na novel cumulant-based estimator. Due to the space restriction the estimation of the topic matrix D\nand the (gamma/Dirichlet) parameter c are moved to Appendix F.6.\n\n3See Appendix C.5 for a discussion on the orders.\n\n5\n\n\fAlgorithm 1 Joint diagonalization (JD) algorithm for GP/DICA cumulants (or LDA moments)\n1: Input: X 2 RM\u21e5N, K, P (number of random projections); (and c0 for LDA moments)\n2: Compute sample estimate bS 2 RM\u21e5M ((29) for GP/DICA / (43) for LDA in Appendix F)\n3: Estimate whitening matrixcW 2 RK\u21e5M of bS (see Appendix F.1)\noption (a): Choose vectors {u1, u2, . . . , uP} \u2713 RK uniformly at random from the unit `2-\nsphere and set vp =cW >up 2 RM for all p = 1, . . . , P\n(P = 1 yields the spectral algorithm)\noption (b): Choose vectors {u1, u2, . . . , uP} \u2713 RK as the canonical basis e1, e2, . . . , eK of\nRK and set vp =cW >up 2 RM for all p = 1, . . . , K\n4: For 8p, compute Bp =cWbT (vp)cW > 2 RK\u21e5K ((52) for GP/DICA / (54) for LDA; Appendix F)\n5: Perform orthogonal joint diagonalization of matrices {cWbScW > = IK, Bp, p = 1, . . . , P}\n(see [24] and [23]) to \ufb01nd an orthogonal matrix V 2 RK\u21e5K and vectors {a1, a2, . . . , aP} \u21e2 RK\nsuch that\nVcWbScW >V > = IK, and V BpV > \u21e1 diag(ap), p = 1, . . . , P\n6: Estimate joint diagonalization matrix A = VcW and values ap, p = 1, . . . , P\n\n7: Output: Estimate of D and c as described in Appendix F.6\n\n5 Experiments\n\nIn this section, (a) we compare experimentally the GP/DICA cumulants with the LDA moments and\n(b) the spectral algorithm [3], the tensor power method [4] (TPM), the joint diagonalization (JD)\nalgorithm from Algorithm 1, and variational inference for LDA [1].\nReal data: the associated press (AP) dataset, from D. Blei\u2019s web page,4 with N = 2, 243 documents\n\nand M = 10, 473 vocabulary words and the average document length bL = 194; the NIPS papers\ndataset5 [28] of 2, 483 NIPS papers and 14, 036 words, andbL = 1, 321; the KOS dataset,6 from the\nUCI Repository, with 3, 430 documents and 6, 906 words, andbL = 136.\n\nSemi-synthetic data are constructed by analogy with [29]: (1) the LDA parameters D and c are\nlearned from the real datasets with variational inference and (2) toy data are sampled from a model\nof interest with the given parameters D and c. This provides the ground truth parameters D and c.\nFor each setting, data are sampled 5 times and the results are averaged. We plot error bars that are\nthe minimum and maximum values. For the AP data, K 2 {10, 50} topics are learned and, for\nthe NIPS data, K 2 {10, 90} topics are learned. For larger K, the obtained topic matrix is ill-\nconditioned, which violates the identi\ufb01ability condition for topic recovery using moment matching\ntechniques [3]. All the documents with less than 3 tokens are resampled.\nSampling techniques. All the sampling models have the parameter c which is set to c = c0\u00afc/k\u00afck1,\nwhere \u00afc is the learned c from the real dataset with variational LDA, and c0 is a parameter that we\ncan vary. The GP data are sampled from the gamma-Poisson model (3) with b = c0/bL so that\nthe expected document length isbL (see Appendix B.2). The LDA-\ufb01x(L) data are sampled from the\nLDA model (2) with the document length being \ufb01xed to a given L. The LDA-\ufb01x2(,L1,L2) data\nare sampled as follows: (1 )-portion of the documents are sampled from the LDA-\ufb01x(L1) model\nwith a given document length L1 and -portion of the documents are sampled from the LDA-\ufb01x(L2)\nmodel with a given document length L2.\nEvaluation. Evaluation of topic recovery for semi-synthetic data is performed with the `1-\n\nerror between the recovered bD and true D topic matrices with the best permutation of columns:\n2KP k kbd\u21e1k dkk1 2 [0, 1]. The minimization is over the possible\nerr`1(bD, D) := min\u21e12PERM\npermutations \u21e1 2 PERM of the columns of bD and can be ef\ufb01ciently obtained with the Hungarian\n\nalgorithm for bipartite matching. For the evaluation of topic recovery in the real data case, we use\nan approximation of the log-likelihood for held out documents as the metric [16]. See Appendix G.6\nfor more details.\n\n1\n\n4http://www.cs.columbia.edu/\u02dcblei/lda-c\n5http://ai.stanford.edu/\u02dcgal/data\n6https://archive.ics.uci.edu/ml/datasets/Bag+of+Words\n\n6\n\n\f1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\nr\no\nr\nr\ne\n-\n1\n\u2113\n\n0\n1 \n\nJD\nJD(k)\nJD(f)\nSpec\nTPM\n\n10\nNumber of docs in 1000s\n\n20\n\n30\n\n40\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\nr\no\nr\nr\ne\n-\n1\n\u2113\n\n0\n1 \n\n50\n\n10\nNumber of docs in 1000s\n\n40\n\n20\n\n30\n\n50\n\nFigure 1: Comparison of the diagonalization algorithms. The topic matrix D and Dirichlet parameter c are\nlearned for K = 50 from AP; c is scaled to sum up to 0.5 and b is set to \ufb01t the expected document length\n\n50, 000. Left: GP/DICA moments. Right: LDA moments. Note: a smaller value of the `1-error is better.\n\nbL = 200. The semi-synthetic dataset is sampled from GP; number of documents N varies from 1, 000 to\n\nWe use our Matlab implementation of the GP/DICA cumulants, the LDA moments, and the di-\nagonalization algorithms. The datasets and the code for reproducing our experiments are available\nonline.7 In Appendix G.1, we discuss implementation and complexity of the algorithms. We explain\nhow we initialize the parameter c0 for the LDA moments in Appendix G.3.\n\n5.1 Comparison of the diagonalization algorithms\n\nIn Figure 1, we compare the diagonalization algorithms on the semi-synthetic AP dataset for K = 50\nusing the GP sampling. We compare the tensor power method (TPM) [4], the spectral algorithm\n(Spec), the orthogonal joint diagonalization algorithm (JD) described in Algorithm 1 with different\noptions to choose the random projections: JD(k) takes P = K vectors up sampled uniformly from\nthe unit `2-sphere in RK and selects vp = W >up (option (a) in Algorithm 1); JD selects the full basis\ne1, . . . , eK in RK and sets vp = W >ep (as JADE [21]) (option (b) in Algorithm 1); JD(f ) chooses\nthe full canonical basis of RM as the projection vectors (computationally expensive).\nBoth the GP/DICA cumulants and LDA moments are well-speci\ufb01ed in this setup. However, the\nLDA moments have a slower \ufb01nite sample convergence and, hence, a larger estimation error for the\nsame value N. As expected, the spectral algorithm is always slightly inferior to the joint diagonal-\nization algorithms. With the GP/DICA cumulants, where the estimation error is low, all algorithms\ndemonstrate good performance, which also ful\ufb01lls our expectations. However, although TPM shows\nalmost perfect performance in the case of the GP/DICA cumulants (left), it signi\ufb01cantly deteriorates\nfor the LDA moments (right), which can be explained by the larger estimation error of the LDA\nmoments and lack of robustness of TPM. The running times are discussed in Appendix G.2. Over-\nall, the orthogonal joint diagonalization algorithm with initialization of random projections as W >\nmultiplied with the canonical basis in RK (JD) is both computationally ef\ufb01cient and fast.\n\n5.2 Comparison of the GP/DICA cumulants and the LDA moments\n\nIn Figure 2, when sampling from the GP model (top, left), both the GP/DICA cumulants and LDA\nmoments are well speci\ufb01ed, which implies that the approximation error (i.e., the error for the in-\n\ufb01nite number of documents) is low for both. The GP/DICA cumulants achieve low values of the\nestimation error already for N = 10, 000 documents independently of the number of topics, while\nthe convergence is slower for the LDA moments. When sampling from the LDA-\ufb01x(200) model (top,\nright), the GP/DICA cumulants are mis-speci\ufb01ed and their approximation error is high, although the\nestimation error is low due to the faster \ufb01nite sample convergence. One reason of poor performance\nof the GP/DICA cumulants, in this case, is the absence of variance in document length. Indeed, if\ndocuments with two different lengths are mixed by sampling from the LDA-\ufb01x2(0.5,20,200) model\n(bottom, left), the GP/DICA cumulants performance improves. Moreover, the experiment with a\nchanging fraction of documents (bottom, right) shows that a non-zero variance on the length im-\nproves the performance of the GP/DICA cumulants. As in practice real corpora usually have a\nnon-zero variance for the document length, this bad scenario for the GP/DICA cumulants is not\nlikely to happen.\n\n7 https://github.com/anastasia-podosinnikova/dica\n\n7\n\n\f1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\nr\no\nr\nr\ne\n-\n1\n\u2113\n\n0\n1 \n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\nr\no\nr\nr\ne\n-\n1\n\u2113\n\n0\n1 \n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\nr\no\nr\nr\ne\n-\n1\n\u2113\n\n0\n1 \n\n1\n\nr\no\nr\nr\ne\n-\n1\n\n\u2113\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n0\n\n50\n\n50\n\nJD-GP(10)\nJD-LDA(10)\nJD-GP(90)\nJD-LDA(90)\n\n10\nNumber of docs in 1000s\n\n40\n\n20\n\n30\n\n0.4\n\n0.2\n0.8\nFraction of doc lengths \u03b3\n\n0.6\n\n50\n\n1\n\n10\nNumber of docs in 1000s\n\n20\n\n30\n\n40\n\n10\nNumber of docs in 1000s\n\n20\n\n30\n\n40\n\nFigure 2: Comparison of the GP/DICA cumulants and LDA moments. Two topic matrices and parameters c1\nand c2 are learned from the NIPS dataset for K = 10 and 90; c1 and c2 are scaled to sum up to c0 = 1.\nFour corpora of different sizes N from 1, 000 to 50, 000: top, left: b is set to \ufb01t the expected document length\n\nbL = 1300; sampling from the GP model; top, right: sampling from the LDA-\ufb01x(200) model; bottom, left:\n\nsampling from the LDA-\ufb01x2(0.5,20,200) model. Bottom, right: the number of documents here is \ufb01xed to\nN = 20, 000; sampling from the LDA-\ufb01x2(,20,200) model varying the values of the fraction from 0 to 1\nwith the step 0.1. Note: a smaller value of the `1-error is better.\n\n)\ns\nt\ni\nb\n\nn\ni\n(\n\nd\no\no\nh\n\ni\nl\ne\nk\ni\nl\n-\ng\no\nL\n\n-11.5\n\n-12\n\n-12.5\n\n-13\n\n-13.5\n\nJD-GP\n\nJD-LDA\n\nSpec-GP\n\nSpec-LDA\n\nVI\n\nVI-JD\n\n)\ns\nt\ni\nb\n\nn\ni\n(\n\nd\no\no\nh\n\ni\nl\ne\nk\ni\nl\n-\ng\no\nL\n\n-10.5\n\n-11\n\n-11.5\n\n-12\n\n-12.5\n\n10 \n\n50 \n\n100\n\n150\n\n10 \n\n50 \n\n100\n\n150\n\nTopics K\n\nTopics K\n\nFigure 3: Experiments with real data. Left: the AP dataset. Right: the KOS dataset. Note: a higher value of\nthe log-likelihood is better.\n5.3 Real data experiments\n\nIn Figure 3, JD-GP, Spec-GP, JD-LDA, and Spec-LDA are compared with variational inference (VI)\nand with variational inference initialized with the output of JD-GP (VI-JD). We measure held out\nlog-likelihood per token (see Appendix G.7 for details on the experimental setup). The orthogo-\nnal joint diagonalization algorithm with the GP/DICA cumulants (JD-GP) demonstrates promising\nperformance. In particular, the GP/DICA cumulants signi\ufb01cantly outperform the LDA moments.\nMoreover, although variational inference performs better than the JD-GP algorithm, restarting varia-\ntional inference with the output of the JD-GP algorithm systematically leads to better results. Similar\nbehavior has already been observed (see, e.g., [30]).\n\n6 Conclusion\n\nIn this paper, we have proposed a new set of tensors for a discrete ICA model related to LDA, where\nword counts are directly modeled. These moments make fewer assumptions regarding distributions,\nand are theoretically and empirically more robust than previously proposed tensors for LDA, both\non synthetic and real data. Following the ICA literature, we showed that our joint diagonalization\nprocedure is also more robust. Once the topic matrix has been estimated in a semi-parametric way\nwhere topic intensities are left unspeci\ufb01ed, it would be interesting to learn the unknown distributions\nof the independent topic intensities.\n\nAcknowledgments. This work was partially supported by the MSR-Inria Joint Center. The authors\nwould like to thank Christophe Dupuy for helpful discussions.\n\n8\n\n\fReferences\n[1] D.M. Blei, A.Y. Ng, and M.I. Jordan. Latent Dirichlet allocation. J. Mach. Learn. Res., 3:903\u20131022,\n\n2003.\n\n[2] T. Grif\ufb01ths. Gibbs sampling in the generative model of latent Dirichlet allocation. Technical report,\n\nStanford University, 2002.\n\n[3] A. Anandkumar, D.P. Foster, D. Hsu, S.M. Kakade, and Y.-K. Liu. A spectral algorithm for latent Dirichlet\n\nallocation. In NIPS, 2012.\n\n[4] A. Anandkumar, R. Ge, D. Hsu, S. M. Kakade, and M. Telgarsky. Tensor decompositions for learning\n\nlatent variable models. J. Mach. Learn. Res., 15:2773\u20132832, 2014.\n\n[5] P. Comon and C. Jutten. Handbook of Blind Source Separation: Independent Component Analysis and\n\nApplications. Academic Press, 2010.\n\n[6] C. Jutten. Calcul neuromim\u00b4etique et traitement du signal: analyse en composantes ind\u00b4ependantes. PhD\n\nthesis, INP-USM Grenoble, 1987.\n\n[7] C. Jutten and J. H\u00b4erault. Blind separation of sources, part I: an adaptive algorithm based on neuromimetric\n\narchitecture. Signal Process., 24:1\u201310, 1991.\n\n[8] P. Comon. Independent component analysis, a new concept? Signal Process., 36:287\u2013314, 1994.\n[9] W.L. Buntine. Variational extensions to EM and multinomial PCA. In ECML, 2002.\n[10] M.E. Tipping and C.M. Bishop. Probabilistic principal component analysis. J. R. Stat. Soc., 61:611\u2013622,\n\n1999.\n\n[11] S. Roweis. EM algorithms for PCA and SPCA. In NIPS, 1998.\n[12] J. Canny. GaP: a factor model for discrete data. In SIGIR, 2004.\n[13] W.L. Buntine and A. Jakulin. Applying discrete PCA in data analysis. In UAI, 2004.\n[14] S. Boucheron, G. Lugosi, and P. Massart. Concentration Inequalities: A Nonasymptotic Theory of Inde-\n\npendence. Oxford University Press, 2013.\n\n[15] A. Anandkumar, D.P. Foster, D. Hsu, S.M. Kakade, and Y.-K. Liu. A spectral algorithm for latent Dirichlet\n\nallocation. CoRR, abs:1204.6703, 2013.\n\n[16] H.M. Wallach, I. Murray, R. Salakhutdinov, and D. Mimno. Evaluation methods for topic models. In\n\nICML, 2009.\n\n[17] J.-F. Cardoso. Source separation using higher order moments. In ICASSP, 1989.\n[18] J.-F. Cardoso. Eigen-structure of the fourth-order cumulant tensor with application to the blind source\n\nseparation problem. In ICASSP, 1990.\n\n[19] J.-F. Cardoso and P. Comon. Independent component analysis, a survey of some algebraic methods. In\n\nISCAS, 1996.\n\n[20] A. Hyv\u00a8arinen. Fast and robust \ufb01xed-point algorithms for independent component analysis. IEEE Trans.\n\nNeural Netw., 10(3):626\u2013634, 1999.\n\n[21] J.-F. Cardoso and A. Souloumiac. Blind beamforming for non Gaussian signals. In IEE Proceedings-F,\n\n1993.\n\n[22] J.-F. Cardoso. High-order contrasts for independent component analysis. Neural Comput., 11:157\u2013192,\n\n1999.\n\n[23] J.-F. Cardoso and A. Souloumiac. Jacobi angles for simultaneous diagonalization. SIAM J. Mat. Anal.\n\nAppl., 17(1):161\u2013164, 1996.\n\n[24] A. Bunse-Gerstner, R. Byers, and V. Mehrmann. Numerical methods for simultaneous diagonalization.\n\nSIAM J. Matrix Anal. Appl., 14(4):927\u2013949, 1993.\n\n[25] J. Nocedal and S.J. Wright. Numerical Optimization. Springer, 2nd edition, 2006.\n[26] F.R. Bach and M.I. Jordan. Kernel independent component analysis. J. Mach. Learn. Res., 3:1\u201348, 2002.\n[27] V. Kuleshov, A.T. Chaganty, and P. Liang. Tensor factorization via matrix factorization. In AISTATS,\n\n2015.\n\n[28] A. Globerson, G. Chechik, F. Pereira, and N. Tishby. Euclidean embedding of co-occurrence data. J.\n\nMach. Learn. Res., 8:2265\u20132295, 2007.\n\n[29] S. Arora, R. Ge, Y. Halpern, D. Mimno, A. Moitra, D. Sontag, Y. Wu, and M. Zhu. A practical algorithm\n\nfor topic modeling with provable guarantees. In ICML, 2013.\n\n[30] S. Cohen and M. Collins. A provably correct learning algorithm for latent-variable PCFGs. In ACL, 2014.\n\n9\n\n\f", "award": [], "sourceid": 365, "authors": [{"given_name": "Anastasia", "family_name": "Podosinnikova", "institution": "INRIA/ENS"}, {"given_name": "Francis", "family_name": "Bach", "institution": "INRIA - ENS"}, {"given_name": "Simon", "family_name": "Lacoste-Julien", "institution": "INRIA"}]}