{"title": "Fixed-Length Poisson MRF: Adding Dependencies to the Multinomial", "book": "Advances in Neural Information Processing Systems", "page_first": 3213, "page_last": 3221, "abstract": "We propose a novel distribution that generalizes the Multinomial distribution to enable dependencies between dimensions. Our novel distribution is based on the parametric form of the Poisson MRF model [Yang et al., 2012] but is fundamentally different because of the domain restriction to a fixed-length vector like in a Multinomial where the number of trials is fixed or known. Thus, we propose the Fixed-Length Poisson MRF (LPMRF) distribution. We develop methods to estimate the likelihood and log partition function (i.e. the log normalizing constant), which was not developed for the Poisson MRF model. In addition, we propose novel mixture and topic models that use LPMRF as a base distribution and discuss the similarities and differences with previous topic models such as the recently proposed Admixture of Poisson MRFs [Inouye et al., 2014]. We show the effectiveness of our LPMRF distribution over Multinomial models by evaluating the test set perplexity on a dataset of abstracts and Wikipedia. Qualitatively, we show that the positive dependencies discovered by LPMRF are interesting and intuitive. Finally, we show that our algorithms are fast and have good scaling (code available online).", "full_text": "Fixed-Length Poisson MRF:\n\nAdding Dependencies to the Multinomial\n\nDavid I. Inouye\n\nPradeep Ravikumar\n\nInderjit S. Dhillon\n\nDepartment of Computer Science\n\n{dinouye,pradeepr,inderjit}@cs.utexas.edu\n\nUniversity of Texas at Austin\n\nAbstract\n\nWe propose a novel distribution that generalizes the Multinomial distribution to\nenable dependencies between dimensions. Our novel distribution is based on the\nparametric form of the Poisson MRF model [1] but is fundamentally different be-\ncause of the domain restriction to a \ufb01xed-length vector like in a Multinomial where\nthe number of trials is \ufb01xed or known. Thus, we propose the Fixed-Length Pois-\nson MRF (LPMRF) distribution. We develop AIS sampling methods to estimate\nthe likelihood and log partition function (i.e. the log normalizing constant), which\nwas not developed for the Poisson MRF model. In addition, we propose novel\nmixture and topic models that use LPMRF as a base distribution and discuss the\nsimilarities and differences with previous topic models such as the recently pro-\nposed Admixture of Poisson MRFs [2]. We show the effectiveness of our LPMRF\ndistribution over Multinomial models by evaluating the test set perplexity on a\ndataset of abstracts and Wikipedia. Qualitatively, we show that the positive de-\npendencies discovered by LPMRF are interesting and intuitive. Finally, we show\nthat our algorithms are fast and have good scaling (code available online).\n\n1\n\nIntroduction & Related Work\n\nThe Multinomial distribution seems to be a natural distribution for modeling count-valued data\nsuch as text documents.\nIndeed, most topic models such as PLSA [3], LDA [4] and numerous\nextensions\u2014see [5] for a survey of probabilistic topic models\u2014use the Multinomial as the funda-\nmental base distribution while adding complexity using other latent variables. This is most likely\ndue to the extreme simplicity of Multinomial parameter estimation\u2014simple frequency counts\u2014that\nis usually smoothed by the simple Dirichlet conjugate prior. In addition, because the Multinomial\nrequires the length of a document to be \ufb01xed or pre-speci\ufb01ed, usually a Poisson distribution on doc-\nument length is assumed. This yields a Poisson-Multinomial distribution\u2014which by well-known\nresults is merely an independent Poisson model.1 However, the Multinomial assumes independence\nbetween the words because the Multinomial is merely the sum of independent categorical variables.\nThis restriction does not seem to \ufb01t with real-world text. For example, words like \u201cneural\u201d and\n\u201cnetwork\u201d will tend to co-occur quite frequently together in NIPS papers. Thus, we seek to relax\nthe word independence assumption of the Multinomial.\nThe Poisson MRF distribution (PMRF) [1] seems to be a potential replacement for the Poisson-\nMultinomial because it allows some dependencies between words. The Poisson MRF is developed\nby assuming that every conditional distribution is 1D Poisson. However, the original formulation\nin [1] only allowed for negative dependencies. Thus, several modi\ufb01cations were proposed in [6]\nto allow for positive dependencies. One proposal, the Truncated Poisson MRF (TPMRF), simply\ntruncated the PMRF by setting a max count for every word. While this formulation may provide\n\n1The assumption of Poisson document length is not important for most topic models [4].\n\n1\n\n\finteresting parameter estimates, a TPMRF with positive dependencies may be almost entirely con-\ncentrated at the corners of the joint distribution because of the quadratic term in the log probability\n(see the bottom left of Fig. 1). In addition, the log partition function of the TPMRF is intractable to\nestimate even for a small number of dimensions because the sum is over an exponential number of\nterms.\nThus, we seek a different distribution than a TPMRF that allows positive dependencies but is more\nappropriately normalized. We observe that the Multinomial is proportional to an independent Pois-\nson model with the domain restricted to a \ufb01xed length L. Thus, in a similar way, we propose a\nFixed-Length Poisson MRF (LPMRF) that is proportional to a PMRF but is restricted to a domain\nwith a \ufb01xed vector length\u2014i.e. where (cid:107)x(cid:107)1 = L. This distribution is quite different from previous\nPMRF variants because the normalization is very different as will be described in later sections. For\na motivating example, in Fig. 1, we show the marginal distributions of the empirical distribution\nand \ufb01tted models using only three words from the Classic3 dataset that contains documents regard-\ning library sciences and aerospace engineering (See Sec. 4). Clearly, real-world text has positive\ndependencies as evidenced by the empirical marginals of \u201cboundary\u201d and \u201clayer\u201d (i.e. referring\nto the boundary layer in \ufb02uid dynamics) and LPMRF does the best at \ufb01tting this empirical dis-\ntribution. In addition, the log partition function\u2014and hence the likelihood\u2014for LPMRF can be\napproximated using sampling as described in later sections. Under the PMRF or TPMRF models,\nboth the log partition function and likelihood were computationally intractable to compute exactly.2\nThus, approximating the log partition function of an LPMRF opens up the door for likelihood-based\nhyperparameter estimation and model evaluation that was not possible with PMRF.\n\n(a) Empirical Marginal Distributions\n\n(b) Multinomial \u00d7 Poisson (Ind. Poissons)\n\n(c) Truncated Poisson MRF\n\n(d) Fixed-Length PMRF \u00d7 Poisson\n\nFigure 1: Marginal Distributions from Classic3 Dataset (Top Left) Empirical Distribution, (Top\nRight) Estimated Multinomial \u00d7 Poisson joint distribution\u2014i.e.\nindependent Poissons, (Bottom\nLeft) Truncated Poisson MRF, (Bottom Right) Fixed-Length PMRF \u00d7 Poisson joint distribution.\nThe simple empirical distribution clearly shows a strong dependency between \u201cboundary\u201d and\n\u201clayer\u201d but strong negative dependency of \u201cboundary\u201d with \u201clibrary\u201d. Clearly, the word-independent\nMultinomial-Poisson distribution under\ufb01ts the data. While the Truncated PMRF can model depen-\ndencies, it obviously has normalization problems because the normalization is dominated by the\nedge case. The LPMRF-Poisson distribution much more appropriately \ufb01ts the empirical data.\n\nIn the topic modeling literature, many researchers have realized the issue with using the Multino-\nmial distribution as the base distribution. For example, the interpretability of a Multinomial can be\ndif\ufb01cult since it only gives an ordering of words. Thus, multiple metrics have been proposed to eval-\nuate topic models based on the perceived dependencies between words within a topic [7, 8, 9, 10].\nIn particular, [11] showed that the Multinomial assumption was often violated in real world data.\nIn another paper [12], the LDA topic assignments for each word are used to train a separate Ising\nmodel\u2014i.e. a Bernoulli MRF\u2014for each topic in a heuristic two-stage procedure. Instead of model-\ning dependencies a posteriori, we formulate a generalization of topic models that allows the LPMRF\ndistribution to directly replace the Multinomial. This allows us to compute a topic model and word\ndependencies jointly under a uni\ufb01ed model as opposed to the two-stage heuristic procedure in [12].\n\n2The example in Fig. 1 was computed by exhaustively computing the log partition function.\n\n2\n\nboundarylibrarylayerlibrarylayerboundaryEmpirical Distributionboundarylibrarylayerlibrarylayerboundarymodels.LPMRF(nnz=0,nnzwords=0)boundarylibrarylayerlibrarylayerboundarymodels.TPMRF(nnz=6,nnzwords=0)boundarylibrarylayerlibrarylayerboundarymodels.LPMRF(nnz=6,nnzwords=0)\fThis model has some connection to the Admixture of Poisson MRFs model (APM) [2], which was\nthe \ufb01rst topic model to consider word dependencies. However, the LPMRF topic model directly\nrelaxes the LDA word-independence assumption (i.e.\nthe independent case is the same as LDA)\nwhereas APM is only an indirect relaxation of LDA because APM mixes in the exponential family\ncanonical parameter space while LDA mixes in the standard Multinomial parameter space. Another\ndifference with APM is that our proposed LPMRF topic model can actually produce topic assign-\nments for each word similar to LDA with Gibbs sampling [13]. Finally, the LPMRF topic model\ndoes not fall into the same generalization of topic models as APM because the instance-speci\ufb01c\ndistribution is not an LPMRF\u2014as described more fully in later sections. The follow up APM paper\n[14] gives a fast algorithm for estimating the PMRF parameters. We use this algorithm as the basis\nfor estimating the topic LPMRF parameters. For estimating the topic vectors for each document, we\ngive a simple coordinate descent algorithm for estimation of the LPMRF topic model. This estima-\ntion of topic vectors can be seen as a direct relaxation of LDA and could even provide a different\nestimation algorithm for LDA.\n\n2 Fixed-Length Poisson MRF\nNotation Let p, n and k denote the number of words, documents and topics respectively. We will\ngenerally use uppercase letters for matrices (e.g. \u03a6, X), boldface lowercase letters or indices of\nmatrices for column vectors (i.e xi, \u03b8, \u03a6s) and lowercase letters for scalar values (i.e. xi, \u03b8s).\nPoisson MRF De\ufb01nition First, we will brie\ufb02y describe the Poisson MRF distribution and refer\nthe reader to [1, 6] for more details. A PMRF can be parameterized by a node vector \u03b8 and an\nedge matrix \u03a6 whose non-zeros encode the direct dependencies between words: PrPMRF(x| \u03b8, \u03a6) =\n\ns=1 log(xs!)\u2212A (\u03b8, \u03a6)(cid:1), where A (\u03b8, \u03a6) is the log partition function needed\n\nexp(cid:0)\u03b8T x+xT \u03a6x\u2212(cid:80)p\n\nfor normalization. Note that without loss of generality, we can assume \u03a6 is symmetric because it\nonly shows up in the symmetric quadratic term. The conditional distribution of one word given all\nthe others\u2014i.e. Pr(xs|x\u2212s)\u2014is a 1D Poisson distribution with natural parameter \u03b7s = \u03b8s + xT\u2212s\u03a6s\nby construction. One primary issue with the PMRF is that the log partition function (i.e. A (\u03b8, \u03a6))\nis a log-sum over all vectors in Zp\n+ and thus with even one positive dependency, the log partition\nfunction is in\ufb01nite because of the quadratic term in the formulation. Yang et al. [6] tried to address\nthis issue but, as illustrated in the introduction, their proposed modi\ufb01cations to the PMRF can yield\nunusual models for real-world data.\n\nFigure 2: LPMRF distribution for L = 10 (left) and L = 20 (right) with negative, zero and pos-\nitive dependencies. The distribution of LPMRF can be quite different than a Multinomial (zero\ndependency) and thus provides a much more \ufb02exible parametric distribution for count data.\n\nLPMRF De\ufb01nition The Fixed-Length Poisson MRF (LPMRF) distribution is a simple yet fun-\ndamentally different distribution than the PMRF. Letting L \u2261 (cid:107)x(cid:107)1 be the length of document, we\nde\ufb01ne the LPMRF distribution as follows:\n\n(x|\u03b8, \u03a6, L) = exp(\u03b8T x + xT \u03a6x \u2212(cid:80)\n\ns log(xs!) \u2212 AL(\u03b8, \u03a6))\n\nPr\n\nLPMRF\n\n(1)\n\n(2)\n\nAL(\u03b8, \u03a6) = log\n\ns log(xs!))\n\nexp(\u03b8T x + xT \u03a6x \u2212(cid:80)\n\n(cid:88)\n\nx\u2208XL\n\nXL = {x : x \u2208 Zp\n\n+,(cid:107)x(cid:107)1 = L}.\n\n(3)\nThe only difference from the PMRF parametric form is the log partition function AL(\u03b8, \u03a6) which is\nconditioned on the set XL (unlike the unbounded set for PMRF). This domain restriction is critical to\nformulating a tractable and reasonable distribution. Combined with a Poisson distribution on vector\nlength L = (cid:107)x(cid:107)1, the LPMRF distribution can be a much more suitable distribution for documents\nthan a Multinomial. The LPMRF distribution reduces to the standard Multinomial if there are no\n\n3\n\n01234567891000.10.20.30.4Domain of Binomial at L = 10LPMRF with Different Parameters L = 10Probability Negative DependencyIndependent (Binomial)Positive Dependency0246810121416182000.050.10.150.20.250.30.35Domain of Binomial at L = 20LPMRF with Different Parameters L = 20Probability Negative DependencyIndependent (Binomial)Positive Dependency\fdependencies. However, if there are dependencies, then the distribution can be quite different than\na Multinomial as illustrated in Fig. 2 for an LPMRF with p = 2 and L \ufb01xed at either 10 or 20\nwords. After the original submission, we realized that for p = 2 the LPMRF model is the same\nas the multiplicative binomial generalization in [15]. Thus, the LPMRF model can be seen as a\nmultinomial generalization (p \u2265 2) of the multiplicative binomial in [15].\n\nLPMRF Parameter Estimation Because the parametric form of the LPMRF model is the same\nas the form of the PMRF model and we primarily care about \ufb01nding the correct dependencies, we\ndecide to use the PMRF estimation algorithm described in [14] to estimate \u03b8 and \u03a6. The algorithm\nin [14] uses an approximation to the likelihood by using the pseudo-likelihood and performing (cid:96)1\nregularized nodewise Poisson regressions. The (cid:96)1 regularization is important both for the sparsity of\nthe dependencies and the computational ef\ufb01ciency of the algorithm. While the PMRF and LPMRF\nare different distributions, the pseudo-likelihood approximation for estimation provides good results\nas shown in the results section. We present timing results to show the scalability of this algorithm in\nSec. 5. Other parameter estimation methods would be an interesting area of future work.\n\n2.1 Likelihood and Log Partition Estimation\n\nUnlike previous work on the PMRF or TPMRF distributions, we develop a tractable approximation\nto the LPMRF log partition function (Eq. 2) so that we can compute approximate likelihood values.\nThe likelihood of a model can be fundamentally important for hyperparameter optimization and\nmodel evaluation.\n\nLPMRF Annealed Importance Sampling First, we develop an LPMRF Gibbs sampler by con-\nsidering the most common form of Multinomial sampling, namely by taking the sum of a sequence\nof L Categorical variables. From this intuition, we sample one word at a time while holding all other\nwords \ufb01xed. The probability of one word in the sequence w(cid:96) given all the other words is propor-\ntional to exp(\u03b8s + 2\u03a6sx\u2212(cid:96)) where x\u2212(cid:96) is the sum of all other words. See the Appendix for the\ndetails of Gibbs sampling. Then, we derive an annealed importance sampler [16] using the Gibbs\nsampling by scaling the \u03a6 matrix for each successive distribution by the linear sequence starting\nwith 0 and ending with 1 (i.e. \u03b3 = 0, . . . , 1). Thus, we start with a simple Multinomial sample\nfrom Pr(x| \u03b8, 0 \u00b7 \u03a6, L) = PrMult(x| \u03b8, L) and then Gibbs sample from each successive distribution\nPrLPMRF(x| \u03b8, \u03b3\u03a6, L) updating the sample weight as de\ufb01ned in [16] until we reach the \ufb01nal distri-\nbution when \u03b3 = 1. From these weighted samples, we can compute an estimate of the log partition\nfunction [16].\n\nLlog((cid:80)\nthe full derivation. We simplify this upper bound by subtracting log((cid:80)\n\nUpper Bound Using H\u00a8older\u2019s inequality, a simple convex relaxation and the partition function of a\nMultinomial, an upper bound for the log partition function can be computed: AL(\u03b8, \u03a6) \u2264 L2\u03bb\u03a6,1 +\ns exp \u03b8s) \u2212 log(L!), where \u03bb\u03a6,1 is the maximum eigenvalue of \u03a6. See the Appendix for\ns exp \u03b8s) from \u03b8 (which\ndoes not change the distribution) so that the second term becomes 0. Then, neglecting the constant\nterm \u2212log(L!) that does not interact with the parameters (\u03b8, \u03a6), the log partition function is upper\nbounded by a simple quadratic function w.r.t. L.\n\nWeighting \u03a6 for Different L For datasets in which L is observed for every sample but is not\nuniform\u2014such as document collections, the log partition function will grow quadratically in L if\nthere are any positive dependencies as suggested by the upper bound. This causes long documents\nto have extremely small likelihood. Thus, we must modify \u03a6 as L gets larger to conteract this\neffect. We propose a simple modi\ufb01cation that scales the \u03a6 for each L: \u02dc\u03a6L = \u03c9(L)\u03a6. In particular,\nwe propose to use the sigmoidal function using the Log Logistic cumulative distribution function\n(CDF): \u03c9(L) = 1 \u2212 LogLogisticCDF(L| \u03b1LL, \u03b2LL). We set the \u03b2LL parameter to 2 so that the tail is\nO(1/L2) which will eventually cause the upper bound to approach a constant. Letting \u00afL = 1\ni Li\nn\nbe the mean instance length, we choose \u03b1LL = c \u00afL for some small constant c. This choice of \u03b1LL\nhelps the weighting function to appropriately scale for corpuses of different average lengths.\n\n(cid:80)\n\nFinal Approximation Method for All L For our experiments, we approximate the log partition\nfunction value for all L in the range of the corpus. We use 100 AIS samples for 50 different test\nvalues of L linearly spaced between the 0.5 \u00afL and 3 \u00afL so that we cover both small and large values\n\n4\n\n\fupper bounds all 50 estimates: amax = maxL[\u03c9(L)L2]\u22121( \u02c6AL(\u03b8, \u03a6)\u2212Llog((cid:80)\n\nof L. This gives a total of 5,000 annealed importance samples. We use the quadratic form of the\nupper bound Ua(L) = \u03c9(L)L2a (ignoring constants with respect to \u03a6) and \ufb01nd a constant a that\ns exp \u03b8s)+log(L!)),\nwhere \u02c6AL is an AIS estimate of the log partition function for the 50 test values of L. This gives a\nsmooth approximation for all L that are greater than or equal to all individual estimates (\ufb01gure of\nexample approximation in Appendix).\n\nMixtures of LPMRF With an approximation to the likelihood, we can easily formulate an esti-\nmation algorithm for a mixture of LPMRFs using a simple alternating, EM-like procedure. First,\ngiven the cluster assignments, the LPMRF parameters can be estimated as explained above. Then,\nthe best cluster assignments can be computed by assigning each instance to the highest likelihood\ncluster. Extending the LPMRF to topic models requires more careful analysis as described next.\n\n3 Generalizing Topic Models using Fixed-Length Distributions\n\nIn standard topic models like LDA, the distribution contains a unique topic variable for every word in\nthe corpus. Essentially, this means that every word is actually drawn from a categorical distribution.\nHowever, this does not allow us to capture dependencies between words because there is only one\nword being drawn at a time. Therefore, we need to reformulate LDA in a way that the words from a\ntopic are sampled jointly from a Multinomial. From this reformulation, we can then simply replace\nthe Multinomial with an LPMRF to obtain a topic model with LPMRF as the base distribution. Our\nreformulation of LDA groups the topic indicator variables for each word into k vectors correspond-\ning to the k different topics. These k \u201ctopic indicator\u201d vectors zl are then assumed to be drawn\nfrom a Multinomial with \ufb01xed length L = (cid:107)zj(cid:107). This grouping of topic vectors yields an equivalent\ndistribution because the topic indicators are exchangeable and independent of one another given the\nobserved word and the document-topic distribution. This leads to the following generalization of\ntopic models in which an observation xi is the summation of k hidden variables zj\ni :\n\nGeneric Topic Model\nwi \u223c SimplexPrior(\u03b1)\nLi \u223c LengthDistribution( \u00afL)\nmi \u223c PartitionDistribution(wi, Li)\ni \u223c FixedLengthDist(\u03c6j ; (cid:107)zj\nzj\n\nxi =(cid:80)k\n\nj=1 zj\n\ni\n\ni (cid:107) = mj\ni )\n\nNovel LPMRF Topic Model\n\nwi \u223c Dirichlet(\u03b1)\nLi \u223c Poisson(\u03bb = \u00afL)\nmi \u223c Multinomial(p = wi; N = Li)\ni \u223c LPMRF(\u03b8j, \u03a6j; L = mj\nzj\ni )\n\nxi =(cid:80)k\n\nj=1 zj\ni .\n\nfrom a LPMRF( \u00af\u03b8i = (cid:80)\n\nNote that this generalization of topic models does not require the partition distribution and the \ufb01xed-\nlength distribution to be the same. In addition, other distributions could be substituted for the Dirich-\nlet prior distribution on document-topic distributions like the logistic normal prior. Finally, this gen-\neralization allows for real-valued topic models for other types of data although exploration of this is\noutside the scope of this paper.\nThis generalization is distinctive from the topic model generalization termed \u201cadmixtures\u201d in [2].\nAdmixtures assume that each observation is drawn from an instance-speci\ufb01c base distribution whose\nparameters are a convex combination of previous parameters. Thus an admixture of LPMRFs could\nbe formulated by assuming that each document, given the document-topic weights wi, is drawn\nj wij\u03b8j, \u00af\u03a6i = wij\u03a6j; L = (cid:107)xi(cid:107)1). Though this may be an interesting\nmodel in its own right and useful for further exploration in future work, this is not the same as\nthe above proposed model because the distribution of xi is not an LPMRF but rather a sum of\nindependent LPMRFs. One case\u2014possibly the only case\u2014where these two generalizations of topic\nmodels intersect is when the distribution is a Multinomial (i.e. a LPMRF with \u03a6 = 0). As another\ndistinction from APM, the LPMRF topic model directly generalizes LDA because the LPMRF in the\nabove model reduces to a Multinomial if \u03a6 = 0. Fully exploring the differences between this topic\nmodel generalization and the admixture generalization are quite interesting but outside the scope of\nthis paper.\nWith this formulation of LPMRF topic models, we can create a joint optimization problem to solve\nfor the topic matrix Zi = [z1\ni ] for each document and to solve for the shared LPMRF\n\ni , . . . , zk\n\ni , z2\n\n5\n\n\fparameters \u03b81...k, \u03a61...k. The optimization is based on minimizing the negative log posterior:\n\nn(cid:88)\n\nk(cid:88)\n\ni=1\n\nj=1\n\narg min\n\nZ1...n,\u03b81...k,\u03a61...k\n\n\u2212 1\nn\n\ni |\u03b8j, \u03a6j, mj\n(zj\n\nPr\n\nLPMRF\n\ns.t. Zie = xi, Zi \u2208 Zk\u00d7p\n+ ,\n\ni ) \u2212 n(cid:88)\n\ni=1\n\n)) \u2212 k(cid:88)\n\nj=1\n\nlog( Pr\nprior\n\n(m1...k\n\ni\n\nlog( Pr\nprior\n\n(\u03b8j, \u03a6j))\n\nwhere e is the all ones vector. Notice that the observations xi only show up in the constraints.\nThe prior distribution on m1...k\ncan be related to the Dirichlet distribution as in LDA by taking\ni|\u03b1). Also, notice that the documents are all independent if the\nPrprior(m1...k\nLPMRF parameters are known so this optimization can be trivially parallelized.\n\n) = PrDir(mj\n\n(cid:96) m(cid:96)\n\ni\n\ni /(cid:80)\n\ni\n\nConnection to Collapsed Gibbs Sampling This optimization is very similar to the collapsed\nGibbs sampling for LDA [13]. Essentially, the key part to estimating the topic models is estimating\nthe topic indicators for each word in the corpus. The model parameters can then be estimated directly\nfrom these topic indicators. In the case of LDA, the Multinomial parameters are trivial to estimate\nby merely keeping track of counts and thus the parameters can be updated in constant time for every\ntopic resampled. This also suggests that an interesting area of future work would be to understand the\nconnections between collapsed Gibbs sampling and this optimization problem. It may be possible\nto use this optimization problem to speed up Gibbs sampling convergence or provide a MAP phase\nafter Gibbs sampling to get non-random estimates.\n\nEstimating Topic Matrices Z1...n For LPMRF topic models, the estimation of the LPMRF pa-\nrameters given the topic assignments requires solving another complex optimization problem. Thus,\nwe pursue an alternating EM-like scheme as in LPMRF mixtures. First, we estimate LPMRF param-\neters with the PMRF algorithm from [14], and then we optimize the topic matrix Zi \u2208 Rp\u00d7k for each\ndocument. Because of the constraints on Zi, we pursue a simple dual coordinate descent procedure.\nWe select two coordinates in row r of Zi and determine if the optimization problem can be improved\nby moving a words from topic (cid:96) to topic q. Thus, we only need to solve a series of simple univariate\nproblems. Each univariate problem only has xis number of possible solutions and thus if the max\ncount of words in a document is bounded by a constant, the univariate subproblems can be solved\nq gives\na better optimization value than Zi. If we remove constant terms w.r.t. a, we arrive at the following\nunivariate optimization problem (suppressing dependence on i because each of the n subproblems\nare independent):\n\nef\ufb01ciently. More formally, we are seeking a step size a such that (cid:98)Zi = Zi + aereT\n\n(cid:96) \u2212 aereT\n\narg min\n\u2212z(cid:96)\nr\u2264a\u2264zq\nr\n\n\u2212a[\u03b8(cid:96)\n\nr + 2zT\n\n(cid:96) \u03a6(cid:96)\n\nr \u2212 2zT\n\nr \u2212 \u03b8q\n+ Am(cid:96)+a(\u03b8(cid:96), \u03a6(cid:96)) + Amq+a(\u03b8q, \u03a6q) \u2212 log( Pr\n\nr] + [log((z(cid:96)\n\nq \u03a6q\n\nr +a)!) + log((zq\n\nr\u2212a)!)]\n\n( \u02dcm1...k)),\n\nprior\n\nwhere \u02dcm is the new distribution of length based on the step size a. The \ufb01rst term is the linear and\nquadratic term from the suf\ufb01cient statistics. The second term is the change in base measure of a\nword is moved. The third term is the difference in log partition function if the length of the topic\nvectors changes. Note that the log partition function can be precomputed so it merely costs a table\nlookup. The prior also only requires a simple calculation to update. Thus the main computation\ncomes in the inner product zT\nr. However, this inner product can be maintained very ef\ufb01ciently\nand updated ef\ufb01ciently so that it does not signi\ufb01cantly affect the running time.\n\n(cid:96) \u03a6(cid:96)\n\n4 Perplexity Experiments\n\nWe evaluated our novel LPMRF model using perplexity on a held-out test set of documents from\na corpus composed of research paper abstracts3 denoted Classic3 and a collection of Wikipedia\ndocuments. The Classic3 dataset has three distinct topic areas: medical (Medline, 1033), library\ninformation sciences (CISI, 1460) and aerospace engineering (CRAN, 1400).\n\n3http://ir.dcs.gla.ac.uk/resources/test_collections/\n\n6\n\n\fExperimental Setup We train all the models using a 90% training split of the documents\nand compute the held-out perplexity on the remaining 10% where perplexity is equal\nto\nexp(\u2212L(Xtest|\u03b81...k, \u03a61...k)/Ntest), where L is the log likelihood and Ntest is the total number of\nwords in the test set. We evaluate single, mixture and topic models with both the Multinomial as\nthe base distribution and LPMRF as the base distribution at k = {1, 3, 10, 20}. The topic indicator\nmatrices Zi for the test set are estimated by \ufb01tting a MAP-based estimate while holding the topic\nparameters \u03b81...k, \u03a61...k \ufb01xed.4 For a single Multinomial or LPMRF, we set the smoothing param-\neter \u03b2 to 10\u22124. 5 We select the LPMRF models using all combinations of 20 log spaced \u03bb between\n1 and 10\u22123, and 5 linearly spaced weighting function constants c between 1 and 2 for the weighting\nfunction described in Sec. 2.1. In order to compare our algorithms with LDA, we also provide per-\nplexity results using an LDA Gibbs sampler [13] for MATLAB 6 to estimate the model parameters.\nFor LDA, we used 2000 iterations and optimized the hyperparameters \u03b1 and \u03b2 using the likelihood\nof a tuning set. We do not seek to compare with many topic models because many of them use the\nMultinomial as a base distribution which could be replaced by a LPMRF but rather we simply focus\non simple representative models.7\n\nResults The perplexity results for all models can be seen in Fig. 3. Clearly, a single LPMRF sig-\nni\ufb01cantly outperforms a single Multinomial on the test dataset both for the Classic3 and Wikipedia\ndatasets. The LPMRF model outperforms the simple Multinomial mixtures and topic models in all\ncases. This suggests that the LPMRF model could be an interesting replacement for the Multino-\nmial in more complex models. For a small number of topics, LPMRF topic models also outperforms\nGibbs sampling LDA but does not perform as well for larger number of topics. This is likely due to\nthe well-developed sampling methods for learning LDA. Exploring the possibility of incorporating\nsampling into the \ufb01tting of the LPMRF topic model is an excellent area of future work. We believe\nLPMRF shows signi\ufb01cant promise for replacing the Multinomial in various probabilistic models.\n\nFigure 3: (Left) The LPMRF models quite signi\ufb01cantly outperforms the Multinomial for both\ndatasets. (Right) The LPMRF model outperforms the simple Multinomial model in all cases. For a\nsmall number of topics, LPMRF topic models also outperforms Gibbs sampling LDA but does not\nperform as well for larger number of topics.\n\nQualitative Analysis of LPMRF Parameters\nIn addition to perplexity analysis, we present the\ntop words, top positive dependencies and the top negative dependencies for the LPMRF topic\nmodel in Table 1. Notice that in LDA, only the top words are available for analysis but an\nLPMRF topic model can produce intuitive dependencies. For example, the positive dependency\n\u201clanguage+natural\u201d is composed of two words that often co-occur in the library sciences but each\nword independently does not occur very often in comparison to \u201cinformation\u201d and \u201clibrary\u201d. The\npositive dependency \u201cstress+reaction\u201d suggests that some of the documents in the Medline dataset\nlikely refer inducing stress on a subject and measuring the reaction. Or in the aerospace topic, the\npositive dependency \u201cnon+linear\u201d suggests that non-linear equations are important in aerospace.\nNotice that these concepts could not be discovered with a standard Multinomial-based topic model.\n\n4For topic models, the likelihood computation is intractable if averaging over all possible Zi. Thus, we\nuse a MAP simpli\ufb01cation primarily for computational reasons to compare models without computationally\nexpensive likelihood estimation.\n\n5For the LPMRF, this merely means adding 10\u22124 to y-values of the nodewise Poisson regressions.\n6http://psiexp.ss.uci.edu/research/programs_data/toolbox.htm\n7We could not compare to APM [2, 14] because it is not computationally tractable to calculate the likelihood\n\nof a test instance in APM, and thus we cannot compute perplexity.\n\n7\n\n47 34 Mult LPMRF Perplexity Classic3 9.3 7.9 Mult LPMRF Perplexity Wikipedia 05101520253035k=3k=10k=20k=3k=10k=20MixtureTopic ModelTest Set Perplexity for Classic3 MultLPMRFGibbs LDA\fTable 1: Top Words and Dependencies for LPMRF Topic Model\n\n5 Timing and Scalability\nFinally, we explore the practical performance of our algorithms. In C++, we implemented the three\ncore algorithms: \ufb01tting p Poisson regressions, \ufb01tting the n topic matrices for each document, and\nsampling 5,000 AIS samples. The timing for each of these components respectively can be seen in\nFig. 4 for the Wikipedia dataset. We set \u03bb = 1 in the \ufb01rst two experiments which yields roughly\n20,000 non-zeros and varied \u03bb for the third experiment. Each of the components is trivially par-\nallelized using OpenMP (http://openmp.org/). All timing experiments were conducted on\nthe TACC Maverick system with Intel Xeon E5-2680 v2 Ivy Bridge CPUs (2.80 GHz), 20 CPUs\nper node, and 12.8 GB memory per CPU (https://www.tacc.utexas.edu/). The scaling\nis generally linear in the parameters except for \ufb01tting topic matrices which is O(k2). For the AIS\nsampling, the scaling is linear in the number of non-zeros in \u03a6 irrespective of p. Overall, we believe\nour implementations provide both good scaling and practical performance (code available online).\n\nFigure 4: (Left) The timing for \ufb01tting p Poisson regressions shows an empirical scaling of O(np).\n(Middle) The timing for \ufb01tting topic matrices empirically shows scaling that is O(npk2). (Right)\nThe timing for AIS sampling shows that the sampling is approximately linearly scaled with the\nnumber of non-zeros in \u03a6 irrespective of p.\n6 Conclusion\nWe motivated the need for a more \ufb02exible distribution than the Multinomial such as the Poisson\nMRF. However, the PMRF distribution has several complications due to its normalization that hin-\nder it from being a general-purpose model for count data. We overcome these dif\ufb01culties by re-\nstricting the domain to a \ufb01xed length as in a Multinomial while retaining the parametric form of\nthe Poisson MRF. By parameterizing by the length of the document, we can then ef\ufb01ciently com-\npute sampling-based estimates of the log partition function and hence the likelihood\u2014which were\nnot tractable to compute under the PMRF model. We extend the LPMRF distribution to both mix-\ntures and topic models by generalizing topic models using \ufb01xed-length distributions and develop\nparameter estimation methods using dual coordinate descent. We evaluate the perplexity of the pro-\nposed LPMRF models on datasets and show that they offer good performance when compared to\nMultinomial-based models. Finally, we show that our algorithms are fast and have good scaling.\nPotential new areas could be explored such as the relation between the topic matrix optimization\nmethod and Gibbs sampling. It may be possible to develop sampling-based methods for the LPMRF\ntopic model similar to Gibbs sampling for LDA. In general, we suggest that the LPMRF model could\nopen up new avenues of research where the Multinomial distribution is currently used.\n\nAcknowledgments\n\nThis work was supported by NSF (DGE-1110007, IIS-1149803, IIS-1447574, DMS-1264033, CCF-\n1320746) and ARO (W911NF-12-1-0390).\n\n8\n\nTop wordsTop Pos. EdgesTop Neg. EdgesTop wordsTop Pos. EdgesTop Neg. EdgesTop wordsTop Pos. EdgesTop Neg. Edgesinformationstates+unitedpaper-bookpatientsterm+longcells-patientflowsupported+simplyflow-shellslibrarypoint+viewlibraries-retrievalcasespositive+negativepatients-animalspressureaccount+takennumber-numbersresearchtest+testslibrary-chemicalnormalcooling+hypothermipatients-ratsboundaryagreement+goodflow-shellsystemprimary+secondarylibraries-languagecellssystem+centralhormone-proteinresultsmoment+pitchingwing-hypersoniclibrariesrecall+precisionsystem-publishedtreatmentatmosphere+heightgrowth-parathyroidtheorynon+linearsolutions-turbulentbookdissemination+sdiinformation-citationschildrenfunction+functionspatients-lensmethodlower+uppermach-reynoldssystemsdirect+accessinformation-citationfoundmethods+suitablepatients-micelayertunnel+windflow-stressesdatalanguage+naturalchemical-documentresultsstress+reactionpatients-dogsgiventime+dependenttheoretical-draguseyears+fivelibrary-scientistsbloodlow+rateshormone-tumornumberlevel+noisegeneral-bucklingscientificterm+longlibrary-scientificdiseasecase+reportpatients-childpresentedpurpose+notemade-conductedTopic 1Topic 2Topic 31 10 100 1000 10000 1000 10000 Time in Seconds (log scale) Number of Unique Words (p) (log scale) Timing for Poisson Regressions n=100k n=50k n=10k n=5k 0 200 400 600 800 1000 k=3 k=10 k=3 k=10 p=5000 p=10000 Time (s) Timing for Fitting Topic Matrices n=50000 n=100000 0 100 200 300 400 500 600 700 0 100000 200000 300000 400000 500000 Time (s) Number of Nonzeros in Edge Matrix Timing for AIS Sampling p=10000 p=5000 p=1000 \fReferences\n[1] E. Yang, P. Ravikumar, G. I. Allen, and Z. Liu, \u201cGraphical models via generalized linear models,\u201d in\n\nNIPS, pp. 1367\u20131375, 2012.\n\n[2] D. I. Inouye, P. Ravikumar, and I. S. Dhillon, \u201cAdmixture of Poisson MRFs: A Topic Model with Word\n\nDependencies,\u201d in International Conference on Machine Learning (ICML), pp. 683\u2013691, 2014.\n\n[3] T. Hofmann, \u201cProbabilistic latent semantic analysis,\u201d in Uncertainty in Arti\ufb01cial Intelligence (UAI),\n\npp. 289\u2013296, Morgan Kaufmann Publishers Inc., 1999.\n\n[4] D. Blei, A. Ng, and M. Jordan, \u201cLatent dirichlet allocation,\u201d JMLR, vol. 3, pp. 993\u20131022, 2003.\n[5] D. Blei, \u201cProbabilistic topic models,\u201d Communications of the ACM, vol. 55, pp. 77\u201384, Nov. 2012.\n[6] E. Yang, P. Ravikumar, G. Allen, and Z. Liu., \u201cOn poisson graphical models,\u201d in NIPS, pp. 1718\u20131726,\n\n2013.\n\n[7] J. Chang, J. Boyd-Graber, S. Gerrish, C. Wang, and D. Blei, \u201cReading tea leaves: How humans interpret\n\ntopic models,\u201d in NIPS, 2009.\n\n[8] D. Mimno, H. M. Wallach, E. Talley, M. Leenders, and A. McCallum, \u201cOptimizing semantic coherence\n\nin topic models,\u201d in EMNLP, pp. 262\u2013272, 2011.\n\n[9] D. Newman, Y. Noh, E. Talley, S. Karimi, and T. Baldwin, \u201cEvaluating topic models for digital libraries,\u201d\n\nin ACM/IEEE Joint Conference on Digital Libraries (JCDL), pp. 215\u2013224, 2010.\n\n[10] N. Aletras and R. Court, \u201cEvaluating Topic Coherence Using Distributional Semantics,\u201d in International\n\nConference on Computational Semantics (IWCS 2013) - Long Papers, pp. 13\u201322, 2013.\n\n[11] D. Mimno and D. Blei, \u201cBayesian Checking for Topic Models,\u201d in EMNLP, pp. 227\u2013237, 2011.\n[12] R. Nallapati, A. Ahmed, W. Cohen, and E. Xing, \u201cSparse word graphs: A scalable algorithm for capturing\n\nword correlations in topic models,\u201d in ICDM, pp. 343\u2013348, 2007.\n\n[13] M. Steyvers and T. Grif\ufb01ths, \u201cProbabilistic topic models,\u201d in Latent Semantic Analysis: A Road to Mean-\n\ning, pp. 424\u2013440, 2007.\n\n[14] D. I. Inouye, P. K. Ravikumar, and I. S. Dhillon, \u201cCapturing Semantically Meaningful Word Dependencies\n\nwith an Admixture of Poisson MRFs,\u201d in NIPS, pp. 3158\u20133166, 2014.\n\n[15] P. M. E. Altham, \u201cTwo Generalizations of the Binomial Distribution,\u201d Journal of the Royal Statistical\n\nSociety. Series C (Applied Statistics), vol. 27, no. 2, pp. 162\u2013167, 1978.\n\n[16] R. M. Neal, \u201cAnnealed importance sampling,\u201d Statistics and Computing, vol. 11, no. 2, pp. 125\u2013139,\n\n2001.\n\n9\n\n\f", "award": [], "sourceid": 1800, "authors": [{"given_name": "David", "family_name": "Inouye", "institution": "University of Texas at Austin"}, {"given_name": "Pradeep", "family_name": "Ravikumar", "institution": "University of Texas at Austin"}, {"given_name": "Inderjit", "family_name": "Dhillon", "institution": "University of Texas at Austin"}]}