{"title": "Complexity of Inference in Latent Dirichlet Allocation", "book": "Advances in Neural Information Processing Systems", "page_first": 1008, "page_last": 1016, "abstract": "We consider the computational complexity of probabilistic inference in Latent Dirichlet Allocation (LDA). First, we study the problem of finding the maximum a posteriori (MAP) assignment of topics to words, where the document's topic distribution is integrated out. We show that, when the effective number of topics per document is small, exact inference takes polynomial time. In contrast, we show that, when a document has a large number of topics, finding the MAP assignment of topics to words in LDA is NP-hard. Next, we consider the problem of finding the MAP topic distribution for a document, where the topic-word assignments are integrated out. We show that this problem is also NP-hard. Finally, we briefly discuss the problem of sampling from the posterior, showing that this is NP-hard in one restricted setting, but leaving open the general question.", "full_text": "Complexity of Inference in Latent Dirichlet\n\nAllocation\n\nDavid Sontag\n\nNew York University\u21e4\n\nDaniel M. Roy\n\nUniversity of Cambridge\n\nAbstract\n\nWe consider the computational complexity of probabilistic inference in La-\ntent Dirichlet Allocation (LDA). First, we study the problem of \ufb01nding\nthe maximum a posteriori (MAP) assignment of topics to words, where\nthe document\u2019s topic distribution is integrated out. We show that, when\nthe e\u21b5ective number of topics per document is small, exact inference takes\npolynomial time. In contrast, we show that, when a document has a large\nnumber of topics, \ufb01nding the MAP assignment of topics to words in LDA\nis NP-hard. Next, we consider the problem of \ufb01nding the MAP topic dis-\ntribution for a document, where the topic-word assignments are integrated\nout. We show that this problem is also NP-hard. Finally, we brie\ufb02y discuss\nthe problem of sampling from the posterior, showing that this is NP-hard\nin one restricted setting, but leaving open the general question.\n\n1\n\nIntroduction\n\nProbabilistic models of text and topics, known as topic models, are powerful tools for ex-\nploring large data sets and for making inferences about the content of documents. Topic\nmodels are frequently used for deriving low-dimensional representations of documents that\nare then used for information retrieval, document summarization, and classi\ufb01cation [Blei\nand McAuli\u21b5e, 2008; Lacoste-Julien et al., 2009]. In this paper, we consider the computa-\ntional complexity of inference in topic models, beginning with one of the simplest and most\npopular models, Latent Dirichlet Allocation (LDA) [Blei et al., 2003]. The LDA model is\narguably one of the most important probabilistic models in widespread use today.\n\nAlmost all uses of topic models require probabilistic inference. For example, unsupervised\nlearning of topic models using Expectation Maximization requires the repeated computation\nof marginal probabilities of what topics are present in the documents. For applications in\ninformation retrieval and classi\ufb01cation, each new document necessitates inference to deter-\nmine what topics are present.\n\nAlthough there is a wealth of literature on approximate inference algorithms for topic mod-\nels, such Gibbs sampling and variational inference [Blei et al., 2003; Griths and Steyvers,\n2004; Mukherjee and Blei, 2009; Porteous et al., 2008; Teh et al., 2007], little is known\nabout the computational complexity of exact inference. Furthermore, the existing inference\nalgorithms, although well-motivated, do not provide guarantees of optimality. We choose\nto study LDA because we believe that it captures the essence of what makes inference easy\nor hard in topic models. We believe that a careful analysis of the complexity of popular\nprobabilistic models like LDA will ultimately help us build a methodology for spanning the\ngap between theory and practice in probabilistic AI.\n\nOur hope is that our results will motivate discussion of the following questions, guiding\nresearch of both new topic models and the design of new approximate inference and learning\n\n\u21e4 This work was partially carried out while D.S. was at Microsoft Research New England.\n\n1\n\n\falgorithms. First, what is the structure of real-world LDA inference problems? Might there\nbe structure in \u201cnatural\u201d problem instances that makes them di\u21b5erent from hard instances\n(e.g., those used in our reductions)? Second, how strongly does the prior distribution bias\nthe results of inference? How do the hyperparameters a\u21b5ect the structure of the posterior\nand the hardness of inference?\n\nWe study the complexity of \ufb01nding assignments of topics to words with high posterior\nprobability and the complexity of summarizing the posterior distributions on topics in a\ndocument by either its expectation or points with high posterior density. In the former case,\nwe show that the number of topics in the maximum a posteriori assignment determines the\nhardness. In the latter case, we quantify the sense in which the Dirichlet prior can be seen\nto enforce sparsity and use this result to show hardness via a reduction from set cover.\n\n2 MAP inference of word assignments\n\nWe will consider the inference problem for a single document. The LDA model states that the\ndocument, represented as a collection of words w = (w1, w2, . . . , wN ), is generated as follows:\na distribution over the T topics is sampled from a Dirichlet distribution, \u2713 \u21e0 Dir(\u21b5); then,\nfor i 2 [N ] := {1, . . . , N}, we sample a topic zi \u21e0 Multinomial(\u2713) and word wi \u21e0 zi, where\n t, t 2 [T ] are distributions on a dictionary of words. Assume that the word distributions\n t are \ufb01xed (e.g., they have been previously estimated), and let lit = log Pr(wi|zi = t) be\nthe log probability of the ith word being generated from topic t. After integrating out the\ntopic distribution vector, the joint distribution of the topic assignments conditioned on the\nwords w is given by\n\nPr(z1, . . . , zN|w) /\n\nPr(wi|zi),\n\n(1)\n\nQt (\u21b5t)Qt (nt + \u21b5t)\n(Pt \u21b5t)\n(Pt \u21b5t + N )\n\nNYi=1\n\nwhere nt is the total number of words assigned to topic t.\nIn this section, we focus on the inference problem of \ufb01nding the most likely assignment of\ntopics to words, i.e. the maximum a posteriori (MAP) assignment. This has many possible\napplications. For example, it can be used to cluster the words of a document, or as part of\na larger system such as part-of-speech tagging [Li and McCallum, 2005]. More broadly, for\nmany classi\ufb01cation tasks involving topic models it may be useful to have word-level features\nfor whether a particular word was assigned to a given topic. From both an algorithm design\nand complexity analysis point of view, this MAP problem has the additional advantage of\ninvolving only discrete random variables.\n\nTaking the logarithm of Eq. 1 and ignoring constants, \ufb01nding the MAP assignment is seen\nto be equivalent to the following combinatorial optimization problem:\n\n = max\n\nxitlit\n\n(2)\n\nxit2{0,1},nt Xt\nsubject to Xt\n\nlog( nt + \u21b5t) +Xi,t\nxit = 1, Xi\n\nxit = nt,\n\nwhere the indicator variable xit = I[zi = t] denotes the assignment of word i to topic t.\n\n2.1 Exact maximization for small number of topics\n\nSuppose a document only uses \u2327 \u2327 T topics. That is, T could be large, but we are\nguaranteed that the MAP assignment for a document uses at most \u2327 di\u21b5erent topics. In this\nsection, we show how we can use this knowledge to eciently \ufb01nd a maximizing assignment\nof words to topics. It is important to note that we only restrict the maximum number of\ntopics per document, letting the Dirichlet prior and the likelihood guide the choice of the\nactual number of topics present.\n\nWe \ufb01rst observe that, if we knew the number of words assigned to each topic, \ufb01nding the\nMAP assignment is easy. For t 2 [T ], let n\u21e4t be the number of words assigned to topic t\n\n2\n\n\ft1\n\nt2\n\nt3\n\nt4\n\nt5\n\n1\u00ea4\n1\u00ea2\n\n1\n\n2\n\n3.0\n\n2.5\n\n2.0\n\n1.5\n\n1.0\n\n0.5\n\n15\n\n10\n\n5\n\nw1\n\nw2\n\nw3\n\nw4\n\nw5\n\nw6\n\n0.5\n\n1.0\n\n1.5\n\n2.0\n\n2.5\n\n3.0\n\n1\n\n2\n\n3\n\n4\n\n5\n\nFigure 1:\n(Left) A LDA instance derived from a k-set packing instance. (Center) Plot of\nF (nt) = log (nt + \u21b5) for various values of \u21b5. The x-axis varies nt, the number of words assigned\nto topic t, and they -axis shows F (nt). (Right) Behavior of log (nt + \u21b5) as \u21b5 ! 0. The function\nis stable everywhere but at zero, where the reward for sparsity increases without bound.\n\nin the MAP assignment. Then, the MAP assignment x is found by solving the following\noptimization problem:\n\nmax\n\nxit2{0,1} Xi,t\nsubject to Xt\n\nxitlit\n\nxit = 1, Xi\n\nxit = n\u21e4t ,\n\n(3)\n\nwhich is equivalent to weighted b-matching in a bipartite graph (the words are on one side,\nthe topics on the other) and can be optimally solved in time O(bm3), where b = maxt n\u21e4t =\nO(N ) and m = N + T [Schrijver, 2003].\n\nWe call (n1, . . . , nT ) a valid partition when ni 0 and Pt nt = N . Using weighted b-\nmatching, we can \ufb01nd a MAP assignment of words to topics by trying all T\n\u2327 =\u21e5( T \u2327 )\n\nchoices of \u2327 topics and all possible valid partitions with at most \u2327 non-zeros.\n\nfor all subsets A \u2713 [T ] such that |A| = \u2327 do\n\nfor all valid partitions n = (n1, n2, . . . , nT ) such that nt = 0 for t 62 A do\nend for\n\nA,n Weighted-B-Matching(A, n, l) +Pt log( nt + \u21b5t)\n\nend for\nreturn arg maxA,n A,n\n\nThere are at most N \u23271 valid partitions with \u2327 non-zero counts. For each of these, we solve\nthe b-matching problem to \ufb01nd the most likely assignment of words to topics that satis\ufb01es\nthe cardinality constraints. Thus, the total running time is O((N T )\u2327 (N + \u2327 )3). This is\npolynomial when the number of topics \u2327 appearing in a document is a constant.\n\n2.2\n\nInference is NP-hard for large numbers of topics\n\nIn this section, we show that probabilistic inference is NP-hard in the general setting where a\ndocument may have a large number of topics in its MAP assignment. Let WORD-LDA(\u21b5)\ndenote the decision problem of whether > V (see Eq. 2) for some V 2 R, where the\nhyperparameters \u21b5t = \u21b5 for all topics. We consider both \u21b5< 1 and \u21b5 1 because, as\nshown in Figure 1, the optimization problem is qualitatively di\u21b5erent in these two cases.\nTheorem 1. WORD-LDA(\u21b5) is NP-hard for all \u21b5> 0.\n\nProof. Our proof is a straightforward generalization of the approach used by Halperin and\nKarp [2005] to show that the minimum entropy set cover problem is hard to approximate.\nThe proof is done by reduction from k-set packing (k-SP), for k 3. In k-SP, we are given\na collection of k-element sets over some universe of elements \u2303 with |\u2303| = n. The goal\nis to \ufb01nd the largest collection of disjoint sets. There exists a constant c <1 such that\nit is NP-hard to decide whether a k-SP instance has (i) a solution with n/k disjoint sets\n\n3\n\n\fcovering all elements (called a perfect matching), or (ii) at most cn/k disjoint sets (called a\n(cn/k)-matching).\n\nWe now describe how to construct a LDA inference problem from a k-SP instance. This\nrequires specifying the words in the document, the number of topics, and the word log\nprobabilities lit. Let each element i 2 \u2303 correspond to a word wi, and let each set correspond\nto one topic. The document consists of all of the words (i.e., \u2303). We assign uniform\nprobability to the words in each topic, so that Pr(wi|zi = t) = 1\nk for i 2 t, and 0 otherwise.\nFigure 1 illustrates the resulting LDA model. The topics are on the top, and the words\nfrom the document are on the bottom. An edge is drawn between a topic (set) and a word\n(element) if the corresponding set contains that element.\n\nWhat remains is to show that we can solve some k-SP problem by using this reduction and\nsolving a WORD-LDA(\u21b5) problem. For technical reasons involving \u21b5> 1, we require that\nk is suciently large. We will use the following result (we omit the proof due to space\nlimitations).\nLemma 2. Let P be a k-SP instance for k > (1 + \u21b5)2, and let P 0 be the derived WORD-\nLDA(\u21b5) instance. There exist constants CU and CL < CU such that, if there is a perfect\nmatching in P , then CU . If, on the other hand, there is at most a (cn/k)-matching in\nP , then < CL.\nLet P be a k-SP instance for k > (3 + \u21b5)2, P 0 be the derived WORD-LDA(\u21b5) instance,\nand CU and CL < CU be as in Lemma 2. Then, by testing < CL and > CU we can\ndecide whether P is a perfect matching or at best a (cn/k)-matching. Hence k-SP reduces\nto WORD-LDA(\u21b5).\n\nThe bold lines in Figure 1 indicate the MAP assignment, which for this example corresponds\nto a perfect matching for the original k-set packing instance. More realistic documents would\nhave signi\ufb01cantly more words than topics used. Although this is not possible while keeping\nk = 3, since the MAP assignment always has \u2327 N/k, we can instead reduce from a k-set\npacking problem with k 3. Lemma 2 shows that this is hard as well.\n\n3 MAP inference of the topic distribution\n\nIn this section we consider the task of \ufb01nding the mode of Pr(\u2713|w). This MAP problem\ninvolves integrating out the topic assignments, zi, as opposed to the previously considered\nMAP problem of integrating out the topic distribution \u2713. We will see that the MAP topic\ndistribution is not always well-de\ufb01ned, which will lead us to de\ufb01ne and study alternative\nformulations. In particular, we give a precise characterization of the MAP problem as one\nof \ufb01nding sparse topic distributions, and use this fact to give hardness results for several\nsettings. We also show settings for which MAP inference is tractable.\n\nThere are many potential applications of MAP inference of the document\u2019s topic distribu-\ntion. For example, the distribution may be used for topic-based information retrieval or\nas the feature vector for classi\ufb01cation. As we will make clear later, this type of inference\nresults in sparse solutions. Thus, the MAP topic distribution provides a compact summary\nof the document that could be useful for document summarization.\n\nLet \u2713 = (\u27131, . . . ,\u2713 T ). A straightforward application of Bayes\u2019 rule allows us to write the\nposterior density of \u2713 given w as\n\nwhere it = Pr(wi|zi = t). Taking the logarithm of the posterior and ignoring constants,\nwe obtain\n\nPr(\u2713|w) / TYt=1\n\n\u2713\u21b5t1\nt\n\n! NYi=1\n\n(\u2713) =\n\nTXt=1\n\n(\u21b5t 1) log(\u2713t) +\n\nNXi=1\n\n4\n\n\u2713t it! ,\n\nTXt=1\nlog TXt=1\n\n\u2713t it!\n\n(4)\n\n(5)\n\n\ft=1 it\u2713t).\n\ni=1 log(PT\n\nWe will use the shorthand (\u2713) =P (\u2713) + L(\u2713), where P (\u2713) = PT\nL(\u2713) =PN\nTo \ufb01nd the MAP \u2713, we maximize (5) subject to the constraint thatPT\n\nt=1 \u2713t = 1 and \u2713t 0.\nIn particular, note that if\nUnfortunately, this maximization problem can be degenerate.\n\u2713t = 0 for \u21b5t < 1, then the corresponding term in P (\u2713) will take the value 1, overwhelming\nthe likelihood term. Thus, any feasible solution with the above property could be considered\n\u2018optimal\u2019.\n\nt=1(\u21b5t 1) log(\u2713t) and\n\nA similar problem arises during the maximum-likelihood estimation of a normal mixture\nmodel, where the likelihood diverges to in\ufb01nity as the variance of a mixture component\nwith a single data point approaches zero [Biernacki and Chr\u00b4etien, 2003; Kiefer and Wol-\nfowitz, 1956]. In practice, one can enforce a lower bound on the variance or penalize such\ncon\ufb01gurations. Here we consider a similar tactic.\n\nFor \u270f> 0, let TOPIC-LDA(\u270f) denote the optimization problem\n\nmax\n\n\u2713\n\n(\u2713)\n\nsubject to Xt\n\n\u2713t = 1,\u270f \uf8ff \u2713t \uf8ff 1.\n\n(6)\n\nFor \u270f = 0, we will denote the corresponding optimization problem by TOPIC-LDA. When\n\u21b5t = \u21b5, i.e.\nthe prior distribution on the topic distribution is a symmetric Dirichlet,\nwe write TOPIC-LDA(\u270f,\u21b5 ) for the corresponding optimization problem.\nIn the follow-\ning sections we will study the structure and hardness of TOPIC-LDA, TOPIC-LDA(\u270f) and\nTOPIC-LDA(\u270f,\u21b5 ).\n\n3.1 Polynomial-time inference for large hyperparameters (\u21b5t 1)\nWhen \u21b5t 1, Eq. 5 is a concave function of \u2713. As a result, we can eciently \ufb01nd \u2713\u21e4 using a\nnumber of techniques from convex optimization. Note that this is in contrast to the MAP\ninference problem discussed in Section 2, which we showed was hard for all choices of \u21b5.\n\nSince we are optimizing over the simplex (\u2713 must be non-negative and sum to 1), we can\napply the exponentiated gradient method [Kivinen and Warmuth, 1995]. Initializing \u27130 to\nbe the uniform vector, the update for time s is given by\n\u21b5t 1\n\u2713s\nt\n\n5s\nt =\nwhere \u2318 is the step size and 5s is the gradient.\nWhen \u21b5 = 1 the prior disappears altogether and this algorithm simply corresponds to\noptimizing the likelihood term. When \u21b5 1, the prior corresponds to a bias toward a\nparticular \u2713 topic distribution.\n\nt exp(\u23185s\n\u2713s\nt )\n\u02c6t exp(\u23185s\n\u02c6t )\n\nP\u02c6t \u2713s\n\n it\n\u02c6t=1 \u2713s\n\nPT\n\n,\n\n\u02c6t i\u02c6t\n\n+\n\nNXi=1\n\n\u2713s+1\nt =\n\n,\n\n(7)\n\n3.2 Small hyperparameters encourage sparsity (\u21b5< 1)\n\nOn the other hand, when \u21b5t < 1, the \ufb01rst term in Eq. 5 is convex whereas the second term is\nconcave. This setting, of \u21b5 much smaller than 1, occurs frequently in practice. For example,\nlearning a LDA model on a large corpus of NIPS abstracts with T = 200 topics, we \ufb01nd\nthat the hyperparameters found range from \u21b5t = 0.0009 to 0.135, with the median being\n0.01. Although in this setting it is dicult to \ufb01nd the global optimum (we will make this\nprecise in Theorem 6), one possibility for \ufb01nding a local maximum is the Concave-Convex\nProcedure [Yuille and Rangarajan, 2003].\n\nIn this section we prove structural results about the TOPIC-LDA(\u270f,\u21b5 ) solution space for\nwhen \u21b5< 1. These results illustrate that the Dirichlet prior encourages sparse MAP so-\nlutions: the topic distribution will be large on as few topics as necessary to explain every\nword of the document, and otherwise will be close to zero.\n\nThe following lemma shows that in any optimal solution to TOPIC-LDA(\u270f,\u21b5 ), for every\nword, there is at least one topic that both has large probability and gives non-trivial prob-\nability to this word. We use K(\u21b5, T, N ) = e3/\u21b5N1T 1/\u21b5 to refer to the lower bound on\nthe topic\u2019s probability.\n\n5\n\n\fLemma 3. Let \u21b5< 1. All optimal solutions \u2713\u21e4 to TOPIC-LDA(\u270f,\u21b5 ) have the following\nproperty: for every word i, \u2713\u21e4\u02c6t K(\u21b5, T, N ) where \u02c6t = arg maxt it\u2713\u21e4t .\nProof sketch. If \u270f K(\u21b5, T, N ) the claim trivially holds. Assume for the purpose of contra-\ndiction that there exists a word \u02c6i such that \u2713\u21e4\u02c6t < K(\u21b5, T, N ), where \u02c6t = arg maxt \u02c6it\u2713\u21e4t .\n\nLet Y denote the set of topics t 6= \u02c6t such that \u2713\u21e4t 2\u270f. Let 1 = Pt2Y \u2713\u21e4t and 2 =\nPt62Y,t6=\u02c6t \u2713\u21e4t . Note that 2 < 2T\u270f . Consider\nIt is easy to show that 8t, \u02c6\u2713t \u270f, and Pt\n\ncontradicting the optimality of \u2713\u21e4. The full proof is given in the supplementary material.\n\n\u02c6\u2713t =\u2713 1 2 1\n\n\u02c6\u2713t = 1. Finally, we show that (\u02c6\u2713) > (\u2713\u21e4),\n\n\u25c6 \u2713\u21e4t for t 2 Y,\n\n\u02c6\u2713t = \u2713\u21e4t for t 62 Y, t 6= \u02c6t.\n\n\u02c6\u2713\u02c6t =\n\n1\nn\n\n(8)\n\n1\n\nn\n\n,\n\n it\n it0\n\n1\n\n1\n\n\u02c6\u2713t it\n\n1\u21b5 +2\u270f.\n\nNXi=1\n\n1\u21b5 +2\u270f. Consider \u02c6\u2713 de\ufb01ned as\n\n1\u2713\u21e4\u02c6t\u2318 \u2713\u21e4t for t 6= \u02c6t. We have:\n\n\u270f \u25c6 + (T 1)(1 \u21b5) log\u2713 1 \u2713\u21e4\u02c6t\n1 \u270f \u25c6 +\n\nNext, we show that if a topic is not suciently \u201cused\u201d then it will be given a probability very\nclose to zero. By used, we mean that for at least one word, the topic is close in probability\nto that of the largest contributor to the likelihood of the word. To do this, we need to de\ufb01ne\nthe notion of the dynamic range of a word, given as \uf8ffi = maxt,t0: it>0, it0 >0\n. We let\nthe maximum dynamic range be \uf8ff = maxi \uf8ffi. Note that \uf8ff 1 and, for most applications,\nit is reasonable to expect \uf8ff to be small (e.g., less than 1000).\nLemma 4. Let \u21b5< 1, and let \u2713\u21e4 be any optimal solution to TOPIC-LDA(\u270f,\u21b5 ). Suppose\ntopic \u02c6t has \u2713\u21e4\u02c6t < (\uf8ffN )1K(\u21b5, T, N ). Then, \u2713\u21e4\u02c6t \uf8ff e\nProof. Suppose for the purpose of contradiction that \u2713\u21e4\u02c6t > e\nfollows: \u02c6\u2713\u02c6t = \u270f, and \u02c6\u2713t =\u21e3 1\u270f\n(\u02c6\u2713) (\u2713\u21e4) = (1 \u21b5) log\u2713 \u2713\u21e4\u02c6t\nUsing the fact that log(1 z) 2z for z 2 [0, 1\n(T 1)(1 \u21b5) log\u2713 1 \u2713\u21e4\u02c6t\nWe have \u02c6\u2713t \u2713\u21e4t for t 6= \u02c6t, and so\n= Pt6=\u02c6t\nPt6=\u02c6t \u2713\u21e4t it + \u2713\u21e4\u02c6t i\u02c6t \nRecall from Lemma 3 that, for each word i and \u02dct = arg maxt it\u2713\u21e4t , we have \u2713\u02dct > K(\u21b5, T, N ).\nNecessarily \u02dct 6= \u02c6t. Therefore, using the fact that log 1\n\u2713\u21e4\u02c6t i\u02c6tPt6=\u02c6t \u2713\u21e4t it \n\nlog Pt\nPt \u2713\u21e4t it! .\n1 \u270f \u25c6 (T 1)(1 \u21b5) log1 \u2713\u21e4\u02c6t 2(T 1)(\u21b5 1)\u2713\u21e4\u02c6t\n\nPt\nPt \u2713\u21e4t it\nPt6=\u02c6t \u2713\u21e4t it + \u2713\u21e4\u02c6t i\u02c6t! \nlog Pt6=\u02c6t \u2713\u21e4t it\n\nPt6=\u02c6t \u2713\u21e4t it\nPt6=\u02c6t \u2713\u21e4t it + \u2713\u21e4\u02c6t i\u02c6t\n\n 2(T 1)(\u21b5 1)(\uf8ffN )1K(\u21b5, T, N ) 2(\u21b5 1).\n\nThus,( \u02c6\u2713) (\u2713\u21e4) > (1 \u21b5) log e\nFinally, putting together what we showed in the previous two lemmas, we conclude that\nall optimal solutions to TOPIC-LDA(\u270f,\u21b5 ) either have \u2713t large or small, but not in between\n(that is, we have demonstrated a gap). We have the immediate corollary:\n\n1\u21b5 +2 + 2(\u21b5 1) 1 = 0, completing the proof.\n\n1+z z,\n(\uf8ffN )1K(\u21b5, T, N ) i\u02c6t\n\n2 ], it follows that\n\n\u02c6\u2713t it + \u270f i\u02c6t\n\nK(\u21b5, T, N ) i\u02dct\n\n \n\n(9)\n\n(10)\n\nTheorem 5. For \u21b5< 1, all optimal solutions to TOPIC-LDA(\u270f,\u21b5 ) have \u2713t \uf8ff\u21e3e\nor \u2713t \uf8ff1e3/\u21b5N2T 1/\u21b5.\n\n1\n\n1\u21b5 +2\u2318 \u270f\n\n.\n\n(11)\n\n1\nn\n\n.\n\n(12)\n\n\u02c6\u2713t it\n\n1\n\n6\n\n\f3.3\n\nInference is NP-hard for small hyperparameters (\u21b5< 1)\n\nThe previous results characterize optimal solutions to TOPIC-LDA(\u270f,\u21b5 ) and highlight the\nfact that optimal solutions are sparse. In this section we show that these same properties\ncan be the source of computational hardness during inference. In particular, it is possible to\nencode set cover instances as TOPIC-LDA(\u270f,\u21b5 ) instances, where the set cover corresponds\nto those topics assigned appreciable probability.\nTheorem 6. TOPIC-LDA(\u270f,\u21b5 ) is NP-hard for \u270f \uf8ff K(\u21b5, T, N )T /(1\u21b5)T N/(1\u21b5) and \u21b5< 1.\nProof. Consider a set cover instance consisting of a universe of elements and a family of\nsets, where we assume for convenience that the minimum cover is neither a singleton, all\nbut one of the family of sets, nor the entire family of sets, and that there are at least two\nelements in the universe. As with our previous reduction, we have one topic per set and\none word in the document for each element. We let Pr(wi|zi = t) = 0 when element wi is\nnot in set t, and a constant otherwise (we make every topic have the uniform distribution\nover the same number of words, some of which may be dummy words not appearing in the\ndocument). Let Si \u2713 [T ] denote the set of topics to which word i belongs. Then, up to\nadditive constants, we have P (\u2713) = (1 \u21b5)PT\n\u2713t).\nLet C\u2713\u21e4 \u2713 [T ] be those topics t 2 [T ] such that \u2713\u21e4t K(\u21b5, T, n), where \u2713\u21e4 is an optimal\nsolution to TOPIC-LDA(\u270f,\u21b5 ). It immediately follows from Lemma 3 that C\u2713\u21e4 is a cover.\nSuppose for the purpose of contradiction that C\u2713\u21e4 is not a minimal cover. Let \u02dcC be a\nminimal cover, and let \u02dc\u2713t = \u270f for t 62 \u02dcC and \u02dc\u2713t = 1\u270f(T| \u02dcC|)\nT otherwise. We will show\nthat (\u02dc\u2713) > (\u2713\u21e4), contradicting the optimality of \u2713\u21e4, and thus proving that C\u2713\u21e4 is in fact\nminimal. This suces to show that TOPIC-LDA(\u270f,\u21b5 ) is NP-hard in this regime.\n\nt=1 log(\u2713t) and L(\u2713) =PN\n\ni=1 log(Pt2Si\n\n> 1\n\n| \u02dcC|\n\n(13)\n\n(1 \u21b5)\n\n1\n\u270f T log K(\u21b5, T, N ),\n\nFor all \u2713 in the simplex, we have Pi log(maxt2Si \u2713t) \uf8ff L(\u2713) \uf8ff 0. Thus it follows that\nL(\u2713\u21e4) L(\u02dc\u2713) \uf8ff N log T . Likewise, using the assumption that T | \u02dcC| + 1, we have\nP (\u02dc\u2713) P (\u2713\u21e4)\n (T | \u02dcC|) log \u270f (| \u02dcC| + 1) log K(\u21b5, T, N ) + (T | \u02dcC|1) log \u270f\n log\n\n(14)\nwhere we have conservatively only included the terms t 62 \u02dcC for P (\u02dc\u2713) and taken \u2713\u21e4 2\n{\u270f, K(\u21b5, T, N )} with | \u02dcC| + 1 terms taking the latter value. It follows that\nP (\u02dc\u2713) +L (\u02dc\u2713) P (\u2713\u21e4) +L (\u2713\u21e4) > (1 \u21b5) log\nThis is greater than 0 precisely when (1 \u21b5) log 1\nNote that although \u270f is exponentially small in N and T , the size of its representation in\nbinary is polynomial in N and T , and thus polynomial in the size of the set cover instance.\nIt can be shown that as \u270f ! 0, the solutions to TOPIC-LDA(\u270f,\u21b5 ) become degenerate,\nconcentrating their support on the minimal set of topics C \u2713 [T ] such that 8i,9t 2 C s.t.\n it > 0. A generalization of this result holds for TOPIC-LDA(\u270f) and suggests that, while\nit may be possible to give a more sensible de\ufb01nition of TOPIC-LDA as the set of solutions\nfor TOPIC-LDA(\u270f) as \u270f ! 0, these solutions are unlikely to be of any practical use.\n4 Sampling from the posterior\n\n1\n\u270f (T log K(\u21b5, T, N ) +N log T ).\n\u270f > log T N K(\u21b5, T, N )T .\n\n(15)\n\nThe previous sections of the paper focused on MAP inference problems. In this section, we\nstudy the problem of marginal inference in LDA.\nTheorem 7. For \u21b5> 1, one can approximately sample from Pr(\u2713 | w) in polynomial time.\nProof sketch. The density given in Eq. 4 is log-concave when \u21b5 1. The algorithm given\nin Lovasz and Vempala [2006] can be used to approximately sample from the posterior.\n\n7\n\n\fAlthough polynomial, it is not clear whether the algorithm given in Lovasz and Vempala\n[2006], based on random walks, is of practical interest (e.g., the running time bound has a\nconstant of 1030). However, we believe our observation provides insight into the complexity\nof sampling when \u21b5 is not too small, and may be a starting point towards explaining the\nempirical success of using Markov chain Monte Carlo to do inference in LDA.\n\nNext, we show that when \u21b5 is extremely small, it is NP-hard to sample from the posterior.\nWe again reduce from set cover. The intuition behind the proof is that, when \u21b5 is small\nenough, an appreciable amount of the probability mass corresponds to the sparsest possible\n\u2713 vectors where the supported topics together cover all of the words. As a result, we could\ndirectly read o\u21b5 the minimal set cover from the posterior marginals E[\u2713t | w].\nTheorem 8. When \u21b5< (4N + 4)T N (N )1, it is NP-hard to approximately sample from\n\nPr(\u2713 | w), under randomized reductions.\nThe full proof can be found in the supplementary material. Note that it is likely that one\nwould need an extremely large and unusual corpus to learn an \u21b5 so small. Our results\nillustrate a large gap in our knowledge about the complexity of sampling as a function of \u21b5.\nWe feel that tightening this gap is a particularly exciting open problem.\n\n5 Discussion\n\nIn this paper, we have shown that the complexity of MAP inference in LDA strongly depends\non the e\u21b5ective number of topics per document. When a document is generated from a small\nnumber of topics (regardless of the number of topics in the model), WORD-LDA can be\nsolved in polynomial time. We believe this is representative of many real-world applications.\nOn the other hand, if a document can use an arbitrary number of topics, WORD-LDA is\nNP-hard. The choice of hyperparameters for the Dirichlet does not a\u21b5ect these results.\n\nWe have also studied the problem of computing MAP estimates and expectations of the\ntopic distribution. In the former case, the Dirichlet prior enforces sparsity in a sense that\nwe make precise. In the latter case, we show that extreme parameterizations can similarly\ncause the posterior to concentrate on sparse solutions. In both cases, this sparsity is shown\nto be a source of computational hardness.\n\nIn related work, Sepp\u00a8anen et al. [2003] suggest a heuristic for inference that is also applicable\nto LDA: if there exists a word that can only be generated with high probability from one of\nthe topics, then the corresponding topic must appear in the MAP assignment whenever that\nword appears in a document. Miettinen et al. [2008] give a hardness reduction and greedy\nalgorithm for learning topic models. Although the models they consider are very di\u21b5erent\nfrom LDA, some of the ideas may still be applicable. More broadly, it would be interesting\nto consider the complexity of learning the per-topic word distributions t.\nOur paper suggests a number of directions for future study. First, our exact algorithms\ncan be used to evaluate the accuracy of approximate inference algorithms, for example by\ncomparing to the MAP of the variational posterior. On the algorithmic side, it would be\ninteresting to improve the running time of the exact algorithm from Section 2.1. Also, note\nthat we did not give an analogous exact algorithm for the MAP topic distribution when\nthe posterior has support on only a small number of topics.\nIn this setting, it may be\npossible to \ufb01nd this set of topics by trying all S \u2713 [T ] of small cardinality and then doing\na (non-uniform) grid search over the topic distribution restricted to support S.\n\nFinally, our structural results on the sparsity induced by the Dirichlet prior draws connec-\ntions between inference in topic models and sparse signal recovery. We proved that the MAP\ntopic distribution has, for each topic t, either \u2713t \u21e1 \u270f or \u2713t bounded below by some value\n(much larger than \u270f). Because of this gap, we can approximately view the MAP problem\nas searching for a set corresponding to the support of \u2713. Our work motivates the study of\ngreedy algorithms for MAP inference in topic models, analogous to those used for set cover.\nOne could even consider learning algorithms that use this greedy algorithm within the inner\nloop [Krause and Cevher, 2010].\n\n8\n\n\fAcknowledgments D.M.R. is supported by a Newton International Fellowship. We\nthank Tommi Jaakkola and anonymous reviewers for helpful comments.\n\nReferences\nC. Biernacki and S. Chr\u00b4etien. Degeneracy in the maximum likelihood estimation of univariate\n\nGaussian mixtures with EM. Statist. Probab. Lett., 61(4):373\u2013382, 2003. ISSN 0167-7152.\n\nD. Blei and J. McAuli\u21b5e. Supervised topic models. In J. Platt, D. Koller, Y. Singer, and S. Roweis,\neditors, Adv. in Neural Inform. Processing Syst. 20, pages 121\u2013128. MIT Press, Cambridge, MA,\n2008.\n\nD. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation. J. Mach. Learn. Res., 3:\n\n993\u20131022, 2003. ISSN 1532-4435.\n\nT. L. Griths and M. Steyvers. Finding scienti\ufb01c topics. Proc. Natl. Acad. Sci. USA, 101(Suppl\n\n1):5228\u20135235, 2004. doi: 10.1073/pnas.0307752101.\n\nE. Halperin and R. M. Karp. The minimum-entropy set cover problem. Theor. Comput. Sci., 348\n\n(2):240\u2013250, 2005. ISSN 0304-3975. doi: http://dx.doi.org/10.1016/j.tcs.2005.09.015.\n\nJ. Kiefer and J. Wolfowitz. Consistency of the maximum likelihood estimator in the presence of\nin\ufb01nitely many incidental parameters. Ann. Math. Statist., 27:887\u2013906, 1956. ISSN 0003-4851.\nJ. Kivinen and M. K. Warmuth. Exponentiated gradient versus gradient descent for linear predic-\n\ntors. Inform. and Comput., 132, 1995.\n\nA. Krause and V. Cevher. Submodular dictionary selection for sparse representation. In Proc. Int.\n\nConf. on Machine Learning (ICML), 2010.\n\nS. Lacoste-Julien, F. Sha, and M. Jordan. DiscLDA: Discriminative learning for dimensionality\nreduction and classi\ufb01cation. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors,\nAdv. in Neural Inform. Processing Syst. 21, pages 897\u2013904. 2009.\n\nW. Li and A. McCallum. Semi-supervised sequence modeling with syntactic topic models. In Proc.\n\nof the 20th Nat. Conf. on Arti\ufb01cial Intelligence, volume 2, pages 813\u2013818. AAAI Press, 2005.\n\nL. Lovasz and S. Vempala. Fast algorithms for logconcave functions: Sampling, rounding, integra-\ntion and optimization. In Proc. of the 47th Ann. IEEE Symp. on Foundations of Comput. Sci.,\npages 57\u201368. IEEE Computer Society, 2006. ISBN 0-7695-2720-5.\n\nP. Miettinen, T. Mielik\u00a8ainen, A. Gionis, G. Das, and H. Mannila. The discrete basis problem. IEEE\n\nTrans. Knowl. Data Eng., 20(10):1348\u20131362, 2008.\n\nI. Mukherjee and D. M. Blei. Relative performance guarantees for approximate inference in latent\nDirichlet allocation. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, Adv. in\nNeural Inform. Processing Syst. 21, pages 1129\u20131136. 2009.\n\nI. Porteous, D. Newman, A. Ihler, A. Asuncion, P. Smyth, and M. Welling. Fast collapsed gibbs\nsampling for latent dirichlet allocation. In Proc. of the 14th ACM SIGKDD Int. Conf. on Knowl-\nedge Discovery and Data Mining, pages 569\u2013577, New York, NY, USA, 2008. ACM.\n\nA. Schrijver. Combinatorial optimization. Polyhedra and eciency. Vol. A, volume 24 of Algorithms\nand Combinatorics. Springer-Verlag, Berlin, 2003. ISBN 3-540-44389-4. Paths, \ufb02ows, matchings,\nChapters 1\u201338.\n\nJ. K. Sepp\u00a8anen, E. Bingham, and H. Mannila. A simple algorithm for topic identi\ufb01cation in 0-1\ndata. In Proc. of the 7th European Conf. on Principles and Practice of Knowledge Discovery in\nDatabases, pages 423\u2013434. Springer-Verlag, 2003.\n\nY. W. Teh, D. Newman, and M. Welling. A collapsed variational Bayesian inference algorithm for\n\nlatent Dirichlet allocation. In Adv. in Neural Inform. Processing Syst. 19, volume 19, 2007.\n\nA. L. Yuille and A. Rangarajan. The concave-convex procedure. Neural Comput., 15:915\u2013936, April\n\n2003. ISSN 0899-7667.\n\n9\n\n\f", "award": [], "sourceid": 622, "authors": [{"given_name": "David", "family_name": "Sontag", "institution": null}, {"given_name": "Dan", "family_name": "Roy", "institution": null}]}