{"title": "Precision-Recall Balanced Topic Modelling", "book": "Advances in Neural Information Processing Systems", "page_first": 6750, "page_last": 6759, "abstract": "Topic models are becoming increasingly relevant probabilistic models for dimensionality reduction of text data, inferring topics that capture meaningful themes of frequently co-occurring terms. We formulate topic modelling as an information retrieval task, where the goal is, based on the latent topic representation, to capture relevant term co-occurrence patterns. We evaluate performance for this task rigorously with regard to two types of errors, false negatives and positives, based on the well-known precision-recall trade-off and provide a statistical model that allows the user to balance between the contributions of the different error types. When the user focuses solely on the contribution of false negatives ignoring false positives altogether our proposed model reduces to a standard topic model. Extensive experiments demonstrate the proposed approach is effective and infers more coherent topics than existing related approaches.", "full_text": "Precision-Recall Balanced Topic Modelling\n\nSeppo Virtanen\n\nUniversity of Cambridge\n\nsjv35@cam.ac.uk\n\nUniversity of Cambridge and The Alan Turing Institute\n\nMark Girolami\n\nmag92@cam.ac.uk\n\nAbstract\n\nTopic models are becoming increasingly relevant probabilistic models for dimen-\nsionality reduction of text data, inferring topics that capture meaningful themes of\nfrequently co-occurring terms. We formulate topic modelling as an information\nretrieval task, where the goal is, based on the latent topic representation, to cap-\nture relevant term co-occurrence patterns. We evaluate performance for this task\nrigorously with regard to two types of errors, false negatives and positives, based\non the well-known precision-recall trade-off and provide a statistical model that\nallows the user to balance between the contributions of the different error types.\nWhen the user focuses solely on the contribution of false negatives ignoring false\npositives altogether our proposed model reduces to a standard topic model. Exten-\nsive experiments demonstrate the proposed approach is effective and infers more\ncoherent topics than existing related approaches.\n\n1\n\nIntroduction\n\nTopic models are ubiquitous probabilistic models for text data suitable for corpus exploration and\nsummarisation as well as for predictive tasks (Blei et al., 2003). The inferred topics are deemed to\nbe useful and meaningful for human interpretation. Accordingly, there is a strong need to develop\ninexpensive quantitative evaluation methods to assess the quality of the inferred topics ef\ufb01ciently\nand accurately, because human-based evaluations are slow and elaborate.\nMimno et al. (2011) present a useful data-based quantitative evaluation criterion for measuring qual-\nity of the topics. The measure relies on pair-wise word co-occurrence statistics computed ef\ufb01ciently\nover the corpus and agrees well with human-based topical quality evaluations. Wallach et al. (2009)\npresent evaluation methods based on predictive performance (held-out data likelihood). However,\nChang et al. (2009) demonstrate with large-scale human-based evaluations that predictive likelihood\nmay not be a useful criterion; models with better predictive ability may infer less semantically mean-\ningful topics. The \ufb01nding undermines the core modelling assumptions, complicating development\nof human-interpretable models. Even though many authors (Arora et al., 2012; AlSumait et al.,\n2009; Grif\ufb01ths et al., 2004; Minka and Lafferty, 2002; Teh and Jordan, 2010) have proposed partic-\nular topic model variants, based on different modelling assumptions, empirically reporting improved\ntopic coherences, all these assumptions may not be interpreted or justi\ufb01ed with a robust quantitative\nevaluation framework.\nIn this work, we formulate topic modelling as a novel information retrieval task, where the goal\nis to retrieve recurring word co-occurrence patterns based on the latent topic representation. We\nquantify the task performance regarding two types of errors, false negatives (referred to as, misses)\nand false positives, measured via concepts of recall and precision, respectively. We present a novel\ntopic model that allows the user to trade-off between contributions of the two error types ef\ufb01ciently,\nand show that taking precision also into account signi\ufb01cantly improves topic quality. We show\nthat standard topic models emphasise recall, penalising only misses, at the expense of discarding\nprecision altogether not taking into account false positives.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fThe paper is structured as follows. Section 2 provides suf\ufb01cient background on topic modelling\nand shows that standard topic models emphasise the contribution of misses. In Section 3, we \ufb01rst\nformulate topic modelling as an information retrieval task and present formulations of recall and\nprecision suitable for the task. Then we present a novel model, that is able to balance between recall\nand precision, accompanied with an ef\ufb01cient inference algorithm. Section 3.3 discusses relevant\nrelated work, Section 4 contains experiments and results, and Section 5 concludes the paper.\nNotation: Consider M documents ym, where m = 1, . . . , M, such that ym,d, where d = 1, . . . , D,\ndenotes a frequency of the dth term in the vocabulary for the mth document. We denote Nm in-\ndividual words for the mth document as wm,n \u2208 {1, D}, where n = 1, . . . , Nm, and note that\nI[wm,n = d], where I[\u00b7] denotes the indicator function taking value one if argument\n\nym,d =(cid:80)Nm\n\nis true and zero otherwise.\n\nn=1\n\n2 Topic models are recall-biased\n\nStandard topic models, prominently Latent Dirichlet Allocation (LDA; Blei et al., 2003), assume\nmultinomial likelihood\n\nLm =\n\nqym,d\nm,d ,\n\nisfying qm,d \u2265 0 and(cid:80)D\n\nwhere qm \u2208 \u2206D denotes an unknown expectation parameter of the multinomial distribution, sat-\nd=1 qm,d = 1. The goal of topic modelling is, based on the corpus, to\ninfer a set of K topics capturing a lower dimensional representation suitable for summarisation and\nprediction tasks. Topic models assume the expectations qm decompose as a linear decomposition\n\nd=1\n\n(cid:89)D\n\n(cid:88)\n\nqm =\n\n\u03b7k\u03b8m,k,\n\nk\n\nwhere \u03b7k \u2208 \u2206D, for k = 1, . . . , K, correspond to the topics and \u03b8m \u2208 \u2206K to topic proportions.\nWe de\ufb01ne an empirical word occurrence distribution over the vocabulary for the mth document\n\npm,d = ym,d/Nm,\n\nnoting that qm should be similar to pm, for m = 1, . . . , M. Because the decomposition is uniden-\nti\ufb01able, similarities need to be computed between pm and qm. Naturally, inferring qm that is closer\nto pm leads to more accurate topics.\nThe mean multinomial log likelihood,\n1/Nm log Lm = 1/Nm\n\nym,d log qm,d =\n\npm,d log qm,d,\n\n(cid:88)\n\n(cid:88)\n\nd\n\nd\n\nrelates to the KL-divergence between empirical and latent word distributions,\n\nd\n\nKL(pm, qm) =\n\n(pm,d log pm,d \u2212 pm,d log qm,d) = Hm \u2212 1/Nm log Lm,\n\n(1)\nwhere Hm is the negative entropy of pm. The asymmetric KL-divergence (1) provides a similarity\nmeasure between the empirical and latent distributions, that is sensitive to the contribution of misses,\ncorresponding to terms for which pm are large but the corresponding qm are small, and, thus, closely\nrelates to the concept of recall. The model may infer dense and spurious topics, because qm must be\nnon-zero for all pm > 0, proportionally to the actual counts. Even though, these topics emphasise\nrecall, they may have very low precision, containing intruder terms that capture false similarities.\n\n(cid:88)\n\n3\n\nInformation retrieval aspect\n\nWe formulate topic modelling as an information retrieval task: based on the retrieval model qm,\nthe goal is to retrieve co-occurring terms. Here, the pm represent relevances (that is, empirical\nco-occurrences) and qm should be similar to the pm, avoiding errors. We characterise two classes\nof errors, misses as well as false positives: terms for which pm are large but qm are small corre-\nspond to misses and terms for which qm are large but pm are small correspond to false positives.\nNaturally, concepts of recall and precision may be quanti\ufb01ed with the directed KL divergences, be-\ncause KL(pm, qm) emphasises misses and the reversed divergence KL(qm, pm) emphasises false\npositives. Both measures, or divergences in general, are positive and lower bounded by zero with\nequivalence if and only if the arguments equal. Mean divergences over documents may be used to\nassess the performance for the corpus.\n\n2\n\n\f3.1 Connections to precision and recall for binary relevances\n\nIt is useful to consider maximum entropy distributions for p and q to further illustrate connections\nbetween the KL-divergences and standard recall and precision that are suitable for binary relevances\n(a term is or is not relevant). The maximum entropy distributions, denoted as p\u2217 and q\u2217, take uniform\nvalues over the support of the distributions, denoted as P and Q, respectively, whereas the remaining\nvalues are (arbitrarily close to) zero. In the following, we denote these zero-probabilities with a very\nsmall positive number \u0001, 1 (cid:29) \u0001 \u2248 0, noting that log \u0001 (cid:28) 0.\ni = \u2212 log |P|, where | \u00b7 | computes\ni p\u2217\ni , which further decomposes into |P \u2229 Q| true\npositives with weight 1|P| log 1|Q| , |(P \u222a Q)c| true negatives1 with weight \u0001 log \u0001, |Q|\u2212|P \u2229 Q| false\npositives with weight \u0001 log 1|Q| and |P| \u2212 |P \u2229 Q| misses with weight 1|P| log \u0001. Because \u0001 \u2248 0, the\ndivergence is dominated by misses, terms that are in P (relevant) but not in Q (retrieved). Thus,\n\nKL(p\u2217, q\u2217) consists of the negative entropy of p\u2217,(cid:80)\nset cardinality, and the cross-divergence,(cid:80)\n\ni log p\u2217\n\ni log q\u2217\n\ni p\u2217\n\nKL(p\u2217, q\u2217) = C +\n\n|P \u2229 Q|\n\n|P|\n\nlog \u0001,\n\n(2)\n\ndivergence,(cid:80)\n\nwhere C contains the remaining expressions, is proportional to standard recall, proportion of relevant\nterms that are retrieved.\nOn the other hand, KL(q\u2217, p\u2217) consists of the negative entropy of q\u2217, \u2212 log |Q|, and the cross-\n1|Q| log 1|P| ,\n|(P \u222a Q)c| true negatives with weight \u0001 log \u0001, |Q|\u2212|P \u2229 Q| false positives with weight 1|Q| log \u0001 and\n|P| \u2212 |P \u2229 Q| misses with weight \u0001 log 1|P|. Following similar reasoning as above, the divergence is\ndominated by false positives, terms that are in Q (retrieved) but not in P (relevant). Thus,\n\ni , which decomposes into |P \u2229 Q| true positives with weight\n\ni log p\u2217\n\ni q\u2217\n\nKL(q\u2217, p\u2217) = C +\n\n|P \u2229 Q|\n\n|Q|\n\nlog \u0001,\n\n(3)\n\nis proportional to standard precision, proportion of retrieved terms that are relevant.\nBecause of the connections (2) and (3), we may interpret the directed divergences as generalisations\nof the concepts of recall and precision for continuously-valued grades of relevances.\n\n3.2 Precision-recall balanced topic model\n\nFollowing the well-known precision-recall trade-off, we present a new model that is able to com-\npromise between the contributions of misses and false positives, both capturing recurring word co-\noccurrence patterns and avoiding false similarities. We generalise over standard topic models that\nare only able to account for misses.\nOur model is based on the K-divergence (Lin, 1991),\n\nK(pm, qm) =\n\npm,d log\n\npm,d\n\n(1 \u2212 \u03bb)qm,d + \u03bbpm,d\n\n(4)\n\nd\n\n(cid:88)\n\nwhere 0 < \u03bb < 1 is a user-de\ufb01ned parameter. In the following, we show that \u03bb intuitively trade-offs\nthe balance between recall and precision; Section 4 further establishes experimental evidence sup-\nporting this property. We also show that for this divergence inference can be carried out ef\ufb01ciently.\nThe K-divergence equals zero if and only if pm = qm and is both lower as well as upper bounded,\n\n0 \u2264 K(pm, qm) \u2264 \u2212 log(\u03bb).\n\nThe K-divergence is always well-de\ufb01ned for all values for qm \u2208 \u2206; this is especially relevant at the\nboundaries of \u2206. Consequently, the K-divergence (4) is not as sensitive to misses as KL(pm, qm),\nwhich approaches in\ufb01nity close to the boundaries imposing in\ufb01nite penalty for misses, essentially,\nimposing a barrier function.\nWe note that for the maximum entropy distributions, as considered in Section 3.1, the K-divergence\nbecomes, for \u0001 \u2192 0,\n\n(cid:18)\n\n1 +(cid:98)\u03bb\n\n|P \u2229 Q|\n\n|Q|\n\n(cid:18)|P \u2229 Q|\n\n(cid:19)\u22121(cid:19)\n\n|P|\n\n\u2212 log(\u03bb),\n\nK(p\u2217, q\u2217) = \u2212|P \u2229 Q|\n|P|\n\nlog\n\n1Upper index (\u00b7)c stands for set complement.\n\n3\n\n\fwhere(cid:98)\u03bb = 1\u2212\u03bb\n(cid:1)\u22121\n\u03bb(cid:0)|P\u2229Q|\n\n|Q|\n\n\u03bb . Applying the logarithmic inequality, x\n\nx+1 < log(1 + x) < x; x > \u22121\u2227 x (cid:54)= 0, we\nfurther notice, that the \ufb01rst expression on the right hand side of the divergence is bounded between\nweighted harmonic mean of precision and recall and weighted precision,\n1 +(cid:98)\u03bb\n\n(cid:18)|P \u2229 Q|\n\n(cid:19)\u22121(cid:19)\n\n<(cid:98)\u03bb\n\n|P \u2229 Q|\n\n|P \u2229 Q|\n\n|P \u2229 Q|\n\n(cid:18)\n\nlog\n\n.\n\n1 \u2212 \u03bb\n\n+ (1 \u2212 \u03bb)(cid:0)|P\u2229Q|\n\n|P|\n\n(cid:1)\u22121 <\n\n|Q|\n\n|Q|\n\n|P|\n\n|P|\n\nFor \u03bb close to zero, the divergence emphasises recall, whereas for increasing \u03bb it takes precision\nalso into account.\nWe complement the topic model with a mixture of the qm and document-speci\ufb01c distribution bm \u2208\n\u2206D. For bm = pm, the corresponding likelihood for the ym is\n\n(cid:89)\n\nd\n\n(cid:0)(1 \u2212 \u03bb)qm,d + \u03bbbm,d\n\n(cid:1)ym,d\n\nLK\nm =\n\nand we note that the likelihood is connected to the K-divergence,\n\n1/Nm log LK\n\nm = Hm \u2212 K(pm, qm).\n\nIn order to retain the properties of the K-divergence suitable for the information retrieval setting\nconsidered, we assume bm = pm, estimating the bm based on the observed counts. We emphasise\nthat even though we may not generate data from the prior distribution we may use the predictive and\nposterior distributions as usual. There is little or no need in practice to generate data from the prior\ndistribution and all inferences condition on the observed data.\nTo carry out inference, we apply an MCMC framework in an empirical Bayesian setting, follow-\ning Casella (2001), employing the empirical distributions as well as introducing prior distributions\nfor the topics as well as topic proportions. We prefer using MCMC over approximate VB or EP\napproaches that fail to address the true posterior distribution.\nWe introduce i) word-speci\ufb01c binary assignment variables\n\nxm,n \u223c Bernoulli(\u03bb),\n\nfor m = 1, . . . , M and n = 1, . . . , Nm, to indicate whether wm,n is explained by the qm or pm and\nii) categorical topic assignment variables cm,n \u2208 {1, . . . , K} for words that are generated based on\nthe qm, respectively. When xm,n = 0 with probability 1 \u2212 \u03bb, the word is explained by the qm as in\nstandard topic models. Given the word assignment\n\ncm,n \u223c Categorical(\u03b8m)\n\nthe word is generated from the cm,nth topic,\n\nwm,n \u223c Categorical(\u03b7cm,n ).\n\nTo complete the model description, we assume\n\u03b7k \u223c Dirichlet(\u03b31),\n\n\u03b8m \u223c Dirichlet(\u03b1),\n\nwhere \u03b3 and \u03b1k, for k = 1, . . . , K, denote parameters of the Dirichlet distributions.\nWe present a collapsed Gibbs sampling algorithm building on Grif\ufb01ths and Steyvers (2004) to carry\nout posterior computations ef\ufb01ciently. We jointly sample the two types of assignment variables. The\nprobability that wm,n = d is assigned to the kth topic is\n\n(cid:80)\n\np(cm,n = k, xm,n = 0) \u221d\n\n\u2212(wm,n)\nk,d\n\u2212(wm,n)\nd(cid:48) G\nk,d(cid:48)\nand the probability that the term is explained by the empirical distribution is\n\n\u2212(wm,n)\nk,m\n\u2212(wm,n)\nk(cid:48),m\n\nk(cid:48) \u03b1k(cid:48)\n\n+ \u03b1k\n\nk(cid:48) N\n\n\u00d7\n\nN\n\nG\n\n+ \u03b3\n\n+ \u03b3D\n\n+(cid:80)\n\n(cid:80)\n\np(xm,n = 1) \u221d \u03bb\n1 \u2212 \u03bb\n\npm,d.\n\nHere the upper index (\u00b7)\u2212(wm,n) denotes discarding contribution of the current word from topic-\ndocument and topic-term count matrices denoted by Nk,m and Gk,d, respectively. Each sampling\nstep updates all the assignments. The algorithm has little additional computational load compared to\n\n4\n\n\fcollapsed Gibbs sampling for LDA, obtained when setting \u03bb = 0, because the empirical distributions\nmay be cached.\nSpatio-temporal extension: In addition to text data, we also demonstrate the model on crime data.\nHere, the terms correspond to crime occurrences within an area and documents collect occurrences in\nnon-overlapping time windows. Accordingly, to impose smoothness, we modify only the priors for\n\u03b8m and \u03b7k; the topics may be interpreted as crime maps, more mass is assigned to areas with higher\ncrime rates. We introduce \u03b2k \u223c Normal(0, Q\u22121), for k = 1, . . . , K, and use \u03b7k \u221d exp(\u03b2k)2. The\nelements of Q for off-diagonal elements take value \u2212\u03b4 for two neighbouring areas, otherwise zero,\nand the diagonal contains the total number of neighbours for each area plus an additive constant\n\u03ba > 0 multiplied by \u03b4. We use \u03b8m,k \u221d exp(\u03b1m,k), where \u03b1m,k \u223c Normal(\u03b1m\u22121,k, \u03c4\u22121), for\nm > 1, and \u03b11,k \u223c Normal(0, 10). We \ufb01x \u03ba to a small value (10\u22122) and infer \u03b1m,k and \u03b2k,d using\nslice sampling, and employ Gibbs for \u03b4 and MH for \u03c4 with Gamma(1, 10\u22123) priors.\n\n3.3 Related work\n\nChemudugunta et al. (2006) present a related topic model that is able to infer, in addition to topics\nthat are shared by all documents, document-speci\ufb01c distributions that explain document-speci\ufb01c\nwords. Following our model notation, the model introduces \u03bbm for each document and infers bm\nbased on the data, employing symmetric and weakly informative Beta and Dirichlet priors for the\n\u03bbm and bm, respectively. For this model, almost surely bm (cid:54)= pm, meaning that the model has no\nconnection to the K-divergence and, importantly, to the information retrieval setting and is unable\nto balance between precision and recall, as considered in this work. In other words, bm biases the\nlatent representation qm. Interestingly, we show that our model may be interpreted as a limiting\ncase when adopting strongly informative and asymmetric priors, as follows. Assume\n\n\u03bbm \u223c Beta ((1 \u2212 \u03bb) v, \u03bbv) and bm \u223c Dir(pmv + \u00011),\n\nwhere v denotes strength of the prior and \u0001 \u2248 0. When v \u2192 \u221e, the priors reduce to point distribu-\ntions and the model becomes equivalent to our model. Both computationally and conceptually, our\nmodel is simpler; in practice, tuning the prior strengths is not straightforward. This tuning can be\nexpensive and is further data-set dependent.\nStochastic Neighbour Embedding (SNE; Hinton and Roweis, 2002) is a statistical model for nor-\nmalised similarity data suitable for nonlinear dimensionality reduction. The model applies KL-\ndivergences between the observed similarities (distributions) and latent distributions as likelihoods.\nPeltonen and Kaski (2011) propose a variant of SNE, that applies K-divergence instead, although\nthe authors do not cite the original work of Lin (1991), and show that the model provides improved\nvisualisation performance compared to the original SNE.\n\n4 Results\n\nWe compare our model against LDA and, as discussed in Section 3.3, to the closely related model\nby Chemudugunta et al. (2006), referred to as SW model. For all the models, based on text data, we\nemploy collapsed Gibbs sampling for inference.\nWe quantitatively evaluate topic semantic coherences (Mimno et al., 2011) and entropies, directed\nKL-divergences, corresponding to concepts of precision and recall, standard recall and precision for\nbinarised relevances as well as (metric) variational ((cid:96)1) distances and adjusted rand index (ARI) for\ndocument clustering (when category information is available) for various data collections and for a\nwide range of different values for \u03bb.\nWe compute the divergences and distances for held-out (test) data not used for inferring the top-\n\nics. We sample 1/5 of the documents for each data collection to create a test set containing (cid:99)M\n\ndocuments. We estimate the latent test distribution as\n\n(cid:98)qm =\n\n1\nS\n\n(cid:88)\n\n(cid:88)\n\ns\n\nk\n\nk (cid:98)\u03b8(s)\n\n\u03b7(s)\n\nm,k,\n\n2For identi\ufb01ability, we \ufb01x \u03b2k,1 \u2248 0 by setting the corresponding variance to an arbitrarily small value.\n\n5\n\n\fm\n\n(cid:88)\n\naveraging over S posterior samples3. To ease presentation of results, we denote mean divergences\nfor the held-out data as\n\nKL((cid:98)pm,(cid:98)qm) and (mean) precision = \u2212 1(cid:99)M\n\nKL((cid:98)qm,(cid:98)pm),\n(mean) recall = \u2212 1(cid:99)M\nwhere (cid:98)pm denotes the test empirical distributions. We also evaluate (mean) standard recall and\ncontains the top-J retrieved terms based on(cid:98)qm, correspondingly. We note that for both measures\nhigher values indicate better performance. When computing the test divergence KL((cid:98)qm,(cid:98)pm), we\nsmooth the(cid:98)pm by adding a very small constant to the counts before normalisation in order to prevent\n\nprecision; here Pm contains all the terms that occur at least once for the mth test document and Qm\n\nnumerical problems; the cost of false positives should be large but \ufb01nite. We also compute (mean)\n(cid:96)1 distances as\n\n(cid:88)\n\nm\n\n(cid:88)\n\n(cid:88)\n\nm\n\nd\n\n1(cid:99)M\n\n|(cid:98)qm,d \u2212(cid:98)pm,d|.\n\nComputation of the topic coherences requires specifying a threshold for sorting T most probable\nterms for each topic in decreasing order. The measure penalises for intruder and random terms\ncorresponding to false similarities. We show results for T = {5, 10, 15, 20}. For the entropies,\nwe note that topics with low entropy focus the probability mass on few terms, indicating sparsity;\na highly desired property for improving interpretability. We average values for coherences and\nentropies over the topics. When category information is available, we cluster documents according\nto the most active topic for each document based on \u03b8(s)\nm and compute adjusted Rand index (ARI) to\nmeasure similarity between the inferred clusterings and available category information. We do not\nassume the number of clusters to be known; the number of potential clusters is constrained by the\nnumber of topics.\nWe show the model performance for three subsets of publicly available data collections, NYTIMES4,\nmovie reviews5 and 20newsgroup6, as well as for textual product descriptions combined with cate-\ngorical information that we employ for further evaluations. Category information is also available\nfor 20newsgroup. Table 1 shows relevant statistics for each collection.\n\nTable 1: Data statistics.\n\nData set\nNYTIMES\nPRODUCTS\n\nMOVIES\n\n20NEWSGROUP\n\nM\n6800\n7743\n4997\n18307\n\nD\n\n19908\n14237\n25884\n28794\n\n(cid:80)\n\nm Nm\n2.00 \u00d7 106\n1.29 \u00d7 106\n0.80 \u00d7 106\n2.03 \u00d7 106\n\nWe initialise the assignments randomly and set \u03b1k = 0.1 and \u03b3 = 0.01, corresponding to weakly\ninformative priors, and use 5 \u00d7 103 sampling steps as burnin. After the burnin we collect posterior\naverages for S = 200 samples. We \ufb01nd the number of steps for burnin suf\ufb01cient for convergence by\nmonitoring log likelihood. We infer the models for K = 200 topics and for 21 equi-spaced values\nbetween (0, 0.2) for \u03bb, noting that, \u03bb = 0, corresponds to the standard topic model (LDA).\nTable 2 collects results for \u03bb = 0.1 for our model, LDA and SW. Unsurprisingly, recall is always\nbest for LDA (\u03bb = 0). For our model recall decreases, naturally, because the model takes also\nprecision into account; precision is best for our model. The standard recall (R@J) and precision\n(P@J) measures for J = 10 (in percentages) show that standard precision is always best for our\nmodel and similarly for standard recall, except for the 20NG data set. The coherences (coh@T) are\nconsistently best for our model, except for the PROD data set for T = 20, showing that models\nthat focus solely on recall do not obtain high coherences. This observation is in agreement with\nChang et al. (2009), who \ufb01nd that models with better predictive performance (i.e., mean recall)\nmay infer less semantically meaningful topics. Our model also attains smaller (better) mean (cid:96)1\n\ndistances, which evaluate metric distance, between (cid:98)p and (cid:98)q. Further, the inferred topics of our\n\nmodel are more sparse, measured via topic entropies (ent). In addition, our model attains best ARI\n\nk , k = 1, . . . , K, we sample(cid:98)\u03b8\n\n3For each sample \u03b7(s)\n4https://archive.ics.uci.edu/ml/datasets/Bag+of+Words\n5http://www.cs.cornell.edu/people/pabo/movie-review-data/\n6http://qwone.com/~jason/20Newsgroups/\n\n(s)\nm .\n\n6\n\n\fTable 2: Quantitative results for our model for \u03bb = 0.1, LDA and SW for various data sets. Bolding\nindicates best results that are statistically signi\ufb01cant (p < 0.01). For the recall and precision mea-\nsures and distances we use the paired one-sided Wilcoxon test over the test documents, and for the\ncoherences, entropies and ARI the unpaired one-sided Wilcoxon test over the S samples.\n\nprecision R@J[%]\n\nP@J[%]\n\ncoh@15\n\ncoh@20\n\nprecision R@J[%]\n\nprecision R@J[%]\n\nprecision R@J[%]\n\n5.97\n5.22\n5.28\n\n8.92\n9.85\n9.76\n\n11.7\n9.44\n9.94\n\n8.53\n7.91\n8.17\n\nrecall\nNYT\n-3.61\nOur\n-2.97\nLDA\nSW -3.14\n20NG recall\n-5.54\nOur\n-4.22\nLDA\nSW -4.35\nPROD recall\n-3.62\nOur\n-2.81\nLDA\nSW -2.98\nMOV recall\n-4.41\nOur\n-3.61\nLDA\nSW -3.66\n\n-72.7\n-78.8\n-79.3\n\n-83.7\n-86.4\n-86.3\n\n-70.5\n-77.8\n-77.8\n\n-73.7\n-82.6\n-82.1\n\n63.6\n55.6\n56.6\n\nP@J[%]\n\n32.3\n30.2\n30.4\n\nP@J[%]\n\n69.1\n55.8\n58.7\n\nP@J[%]\n\n57.2\n51.5\n52.8\n\ncoh@5\n\n-16\n-23.1\n-16.4\ncoh@5\n-18.8\n-21\n-20.4\ncoh@5\n-16.9\n-23.8\n-17.9\ncoh@5\n-16.1\n-25.3\n-17.9\n\ncoh@10\n-77.6\n-114\n-80.8\n\n-190\n-275\n-200\n\n-358\n-511\n-378\n\ncoh@10\n-99.2\n-116\n-106\n\ncoh@10\n-87.1\n-121\n-90.2\n\ncoh@10\n-88.6\n-143\n-95.5\n\ncoh@15\n\n-257\n-306\n-277\n\ncoh@15\n\n-226\n-304\n-230\n\ncoh@15\n\n-237\n-390\n-255\n\ncoh@20\n\n-521\n-606\n-538\n\ncoh@20\n\n-462\n-584\n-447\n\ncoh@20\n\n-493\n-787\n-522\n\n(cid:96)1\n1.63\n1.74\n1.75\n(cid:96)1\n1.84\n1.9\n1.9\n(cid:96)1\n1.56\n1.7\n1.7\n(cid:96)1\n1.67\n1.82\n1.81\n\nent\n4.78\n5.85\n5.82\nent\n4.49\n5.81\n5.74\nent\n3.6\n5\n\n4.91\nent\n4.89\n5.87\n5.73\n\nARI\n\n-\n-\n-\n\nARI\n0.209\n0.15\n0.166\nARI\n0.15\n0.127\n0.133\nARI\n\n-\n-\n-\n\nvalues, showing that topics inferred by our model are in closer agreement with the external category\ninformation, providing further quantitative evidence of better performance for our model. We note\nthat the conclusions based on Table 2 are similar for \u03bb \u2208 (0.07, 0.11), showing that obtaining good\nresults is not sensitive for particular \u03bb. We also experimented with a variant of the SW model that\nadditionally includes a shared background distribution (referred to as, SWB). The results for SWB\nare marginally worse or similar to SW, suggesting that including a common background distribution\nis not effective for improving performance.\n\nFigure 1: Various performance measures for different values for \u03bb for the NYT data set. We note\nthat LDA corresponds to \u03bb = 0. The curves are similar for the other data sets.\n\nFigure 1 shows results for a wider range of \u03bb for the NYT data set. For the divergences and distances\nwe plot mean values versus \u03bb and for entropies and coherences we use boxplots, respectively. The\nperformance curves, as shown in Figure 1, are smooth for a wide range of values for \u03bb, demonstrating\nstable computations, and we see that recall is always best for LDA (\u03bb = 0), decreasing for increasing\nvalues for \u03bb, and precision increases for increasing values for \u03bb. Even small deviations from \u03bb = 0\nare suf\ufb01cient to shift the focus from recall to a compromise between recall and precision. For\nincreasing \u03bb the model also attains smaller (better) mean (cid:96)1 distances. The saturating distance curves\nalso show an effective range for \u03bb values; for \u03bb (cid:29) 0.2 (not shown), the computations eventually\nbecome more unstable, because more and more terms are assigned to the empirical distributions\nand the topics become too sparse complicating posterior inference. In particular, the entropies show\nhow the topics become (on average) more sparse for increasing \u03bb. The coherences are best for our\nmodel for intermediate values for \u03bb. The coherence curves follow similar trend for other values for\nthe threshold 2 \u2264 T \u2264 20. However, the measure is sensitive to topic sparsity; if the support of\nthe topic distribution is smaller than T the measure becomes more noisy and less meaningful. We\nveri\ufb01ed that the supports of the topics are larger than T = 20 for the data collections for \u03bb \u2264 0.2.\nTo summarise, we observe a general trend showing that recall is negatively correlated with the other\nperformance measures; as recall decreases, the other measures improve. On the other hand, we ob-\nserve that precision is positively correlated with the other performance measures, excluding recall.\nHigher precision implies that the model infers i) latent distributions that have smaller (cid:96)1 distance to\ntest empirical distributions, and ii) topics that are more sparse and more semantically meaningful.\n\n7\n\n0.000.100.20\u22123.8\u22123.4\u22123.0recalllambda0.000.100.20\u221278\u221276\u221274\u221272precisionlambda00.050.10.150.2\u221224\u221220\u221216coherence@5lambda0.000.100.201.621.661.701.74L1 distancelambda00.050.10.150.24.55.05.5topic entropylambda\fWe emphasise that increased sparsity alone does not indicate improved precision; sparse topic mod-\nels that aim to infer sparse topics focus solely on recall. For instance, Wang and Blei (2009) report\nimproved predictive performance (that is, recall) for sparse models.\nTuning of the LDA hyperparameters is ineffective to trade-off recall and precision.\nIntuitively,\nprior tuning is unable to overcome problems of the likelihood function; here the sensitivity of the\nKL(p, q) to misses. We \ufb01x this issue by modifying directly the likelihood function. We empirically\nvaried both \u03b1 and \u03b3 in the range {10\u22123, 10\u22122, 0.1, 1} for LDA for all data sets and found that\n\u03b3 \u2264 0.01 (inducing sparsity) is preferred for topics (larger values produced useless results despite\ndifferent \u03b1). Although, too small \u03b3 may increase computational complexity and the risk of getting\nstuck in a locally optimal mode. For a sparse topic prior, increasing \u03b1 (i) decreases topic entropies\n(inferring sparser topics), (ii) coherences improve marginally or remain the same and (iii) both recall\nand precision decrease. Recall and precision curves for different \u03b3 as a function of \u03b1 have similar\nform peaking at \u03b1 = {0.01, 0.1} and \u03b3 = {10\u22123, 10\u22122}, verifying that the adopted setting for LDA\nis competitive. Despite the tuning, the precisions and coherences are worse than for our model.\nWe also repeated the experiment for an asymmetric topic prior, that is proportional to overall term\noccurrences, matching the prior strength to equal the strength of the symmetric variant for varying\n\u03b3. The results for the asymmetric prior are very similar to the symmetric prior showing that the\nasymmetric prior is ineffective to boost precision.\n\nTable 3: Illustration of top topics with top words inferred based on the NYT data.\n\nOur model\nT1 point game team shot half minutes play lead season left rebound games guard coach *laker win quarter night played ball\n\nT2 game team playoff season titan games *n\ufb02 *jacksonville *miami dolphin play quarterback win *tennessee jaguar *super-bowl *dan-marino played yard won\n\nT3 team player game games season play coach played basketball sport fan win playing championship winning guy won record league football\n\nT4 *al-gore *bill-bradley *bradley campaign *iowa president democratic *new-hampshire health vice care voter debate caucuses support presidential candidates poll vote administration\n\nT5 tablespoon cup minutes add oil pepper large garlic medium serve onion sauce serving bowl fresh pound chopped taste butter chicken\nLDA\nT1 guy right look thought hard talk tell getting put feel bad remember told trying happen kind give real ago sure\n\nT2 asked question statement called told saying public interview conference meeting comment reporter added issue took decision member matter plan clear\n\nT3 win won record winning victory lost beat past early loss road \ufb01nished \ufb01nal season home losing start lead close need\n\nT4 need feel help problem kind \ufb01nd try getting job able success important step experience level look right start hard hope\n\nT5 company companies business industry customer market part high product technology executive \ufb01rm president competition executives line competitor big chief system\nSW\nT1 company companies business industry million \ufb01rm customer executive largest executives market billion part analyst chief businesses employees services sales president\n\nT2 team season playoff game *n\ufb02 games quarterback coach *super-bowl football player *jacksonville titan *miami dolphin *tennessee played play record *ram\n\nT3 win lead lost won \ufb01nal beat victory point season record loss early home put winning start gave right consecutive losing\n\nT4 guy real right big put look pretty talk course happen tell getting bad mean kid talking wrong hear question head\n\nT5 election presidential candidates campaign voter democratic candidate republican vote political *republican primary president race party democrat *party support poll win\n\nTable 4: Illustration of top terms explained by the empirical distributions (or document-speci\ufb01c\ndistributions) for the NYT data.\n\nOur model\nmillion percent home plan team right system company\nproblem part need game of\ufb01cial point early money\namerican president run play business public record talk\nhigh head set government told place night show big\ncountry season decision control deal half return found\nlook line left \ufb01nd help called family group newspaper\n\nSW model\n*mccain percent *governor-bush *john-mccain\n*bill-bradley *george-bush *bradley women *bush drug\n*clinton *al-gore *internet fund *bleated-nato abortion\nunion *ram *party test *black children card\n*harvard-pilgrim gun *steve-forbes bill *army *gore game\ncancer *cowboy *buc \ufb01rm companies *republican *russia\n\nTable 3 shows top-5 topics, based on one posterior sample, sorted according to decreasing topic size,\nanother useful measure for topic quality (Mimno et al., 2011), for the NYT data collection, for our\nmodel for \u03bb = 0.1, LDA and SW model. Named entities are referred to as using \u2217-symbol for the\nterms. The topics for our model are semantically meaningful, capturing certain intuitive themes, as\ndesired. On the other hand, the LDA topics capture frequently occurring words but the topics are\nnot as meaningful and do not correspond to any evident themes. Inspection of such poor quality\ntopics, thought to be the most representative, undermine users\u2019 con\ufb01dence in trusting the inferred\nmodel. The SW model falls between our model and LDA, retaining topics similar to LDA that are\nnot meaningful.\nTable 4 shows top words assigned to the empirical distributions (or document-speci\ufb01c distributions)\nbm for our and SW models. For our model these terms correspond to frequently occurring terms\n\n8\n\n\fover the collection that pollute the latent representation of LDA. Our model is able to explain these\nterms via the empirical distributions leading to more meaningful and more sparse topics, also in-\nferring more accurate latent representations, as veri\ufb01ed in Figure 1. The terms for the SW model\nare document-speci\ufb01c capturing names of persons or places; most of the terms correspond to the\nnamed entities. Thus the model still needs to explain the frequently occurring terms over the whole\ncollection, similarly to LDA, inferring poor quality topics that are dense as veri\ufb01ed by the large topic\nentropies. Also, the bias introduced by the bm leads to inaccurate latent representations as measured\nin terms of (cid:96)1 distances and the divergences.\n\nFigure 2: Various performance measures for different values for \u03bb for crime prediction.\n\nable), D = 3.06 \u00d7 104 and(cid:80)\n\nSpatio-temporal extension: We use publicly available crime data for London7 for crime prediction.\nWe discretise the data both in space and time, resulting in M = 71 months (\ufb01nest resolution avail-\nm Nm = 2.68 \u00d7 105. For the spatial discretisation and computing Q,\nwe use R-INLA8. Our task is to predict \u2019hot spots\u2019, a collection of mesh points where crime occurs.\nFollowing Flaxman et al. (2018), we use predictive ef\ufb01ciency and accuracy indices (PEI and PAI,\nrespectively) for evaluation (the higher, the better). PAI penalises the values by the predicted area\nsize, giving large values for crime hotspots using the smallest area. PEI computes a ratio between the\nnumber of crimes occurred in the predicted hotspots and the maximum number of crimes that could\nhave occurred in same area size. In general, PAI and PEI may be interpreted as generalisations of\nprecision and recall, correspondingly, for spatial crime hotspot prediction. We compute the number\n\nof hot spots by simulation based on(cid:98)qm, taking mean of non-zero areas (that is, support of the distri-\nbution, |(cid:98)Qm|) and set top-|(cid:98)Qm| areas as hot spots. Again, we remove 1/5 as test data and estimate\n(cid:98)qm by simulation from the posterior. For K = 5, Figure 2 shows: i) the recall-precision trade-off,\n\nii) better PEI and PAI for increasing values of \u03bb and iii) smaller (cid:96)1 distances for intermediate \u03bb. For\n0.3 \u2264 \u03bb \u2264 0.5, the results are statistically meaningful compared to the LDA variant (\u03bb = 0)9. The\nconclusions are similar for K = {4, 5, . . . , 10} and the performance does not improve signi\ufb01cantly\nfor K > 5.\n\n5 Discussion\n\nIn this work, we present new insights into topic modelling from an information retrieval perspective\nand propose a novel statistical topic model combined with an ef\ufb01cient inference algorithm that\nallows the user to balance between contributions of precision and recall, inferring more coherent\nand meaningful topics. Based on extensive experiments for various data collections and settings, the\nresults demonstrate the proposed approach is effective and useful.\n\nAcknowledgements\n\nThe authors were supported by the EPSRC grant EP/P020720/1, Inference COmputation and Nu-\nmerics for Insights into Cities (ICONIC), https://iconicmath.org/.\n\nReferences\nLoulwah AlSumait, Daniel Barbar\u00b4a, James Gentle, and Carlotta Domeniconi. Topic signi\ufb01cance\n\nranking of LDA generative models. In ECML, 2009.\n\n7https://data.police.uk/data/; we collect crimes in public order-category.\n8http://www.r-inla.org; we use 2D mesh function with 100m cut-off distance between any two mesh\n9Paired one-sided Wilcoxon; p < 5 \u00d7 10\u22124.\n\npoints. Areas are parameterised by the mesh points, following the idea of Voronoi tesselation.\n\n9\n\n0.00.20.4\u22125.0\u22124.0\u22123.0\u22122.0recalllambda0.00.20.4\u221220\u221216\u221212precisionlambda0.00.20.40.480.500.520.54PEIlambda0.00.20.45.56.57.58.5PAIlambda0.00.20.41.321.361.401.44L1 distancelambda\fSanjeev Arora, Rong Ge, and Ankur Moitra. Learning topic models\u2013going beyond SVD. In 2012\n\nIEEE 53rd Annual Symposium on Foundations of Computer Science, 2012.\n\nDavid M Blei, Andrew Y Ng, and Michael I Jordan. Latent Dirichlet allocation. Journal of Machine\n\nLearning Research, 3:993\u20131022, 2003.\n\nGeorge Casella. Empirical Bayes Gibbs sampling. Biostatistics, 2(4):485\u2013500, 2001.\n\nJonathan Chang, Sean Gerrish, Chong Wang, Jordan L Boyd-Graber, and David M Blei. Reading\n\ntea leaves: How humans interpret topic models. In NIPS, 2009.\n\nChaitanya Chemudugunta, Padhraic Smyth, and Mark Steyvers. Modeling general and speci\ufb01c\n\naspects of documents with a probabilistic topic model. In NIPS, 2006.\n\nSeth Flaxman, Michael Chirico, Pau Pereira, and Charles Loef\ufb02er. Scalable high-resolution fore-\ncasting of sparse spatiotemporal events with kernel methods: a winning solution to the NIJ \u201dReal-\nTime Crime Forecasting Challenge\u201d. arXiv preprint arXiv:1801.02858, 2018.\n\nThomas L Grif\ufb01ths and Mark Steyvers. Finding scienti\ufb01c topics. Proceedings of the National\n\nAcademy of Sciences, 101(suppl 1):5228\u20135235, 2004.\n\nThomas L Grif\ufb01ths, Michael I Jordan, Joshua B Tenenbaum, and David M Blei. Hierarchical topic\n\nmodels and the nested Chinese restaurant process. In NIPS, 2004.\n\nGeoffrey E Hinton and Sam T Roweis. Stochastic neighbor embedding. In NIPS, 2002.\n\nJianhua Lin. Divergence measures based on the Shannon entropy. IEEE Transactions on Information\n\ntheory, 37(1):145\u2013151, 1991.\n\nDavid Mimno, Hanna M Wallach, Edmund Talley, Miriam Leenders, and Andrew McCallum. Op-\n\ntimizing semantic coherence in topic models. In EMNLP, 2011.\n\nThomas Minka and John Lafferty. Expectation-propagation for the generative aspect model. In UAI,\n\n2002.\n\nJaakko Peltonen and Samuel Kaski. Generative modeling for maximizing precision and recall in\n\ninformation visualization. In AISTATS, 2011.\n\nYee Whye Teh and Michael I Jordan. Hierarchical Bayesian nonparametric models with applica-\n\ntions. Bayesian nonparametrics, 1:158\u2013207, 2010.\n\nHanna M Wallach, Iain Murray, Ruslan Salakhutdinov, and David Mimno. Evaluation methods for\n\ntopic models. In ICML, 2009.\n\nChong Wang and David M Blei. Decoupling sparsity and smoothness in the discrete hierarchical\n\nDirichlet process. In NIPS, 2009.\n\n10\n\n\f", "award": [], "sourceid": 3655, "authors": [{"given_name": "Seppo", "family_name": "Virtanen", "institution": "University of Cambridge"}, {"given_name": "Mark", "family_name": "Girolami", "institution": "Imperial College London"}]}