{"title": "Dirichlet belief networks for topic structure learning", "book": "Advances in Neural Information Processing Systems", "page_first": 7955, "page_last": 7966, "abstract": "Recently, considerable research effort has been devoted to developing deep architectures for topic models to learn topic structures. Although several deep models have been proposed to learn better topic proportions of documents, how to leverage the benefits of deep structures for learning word distributions of topics has not yet been rigorously studied. Here we propose a new multi-layer generative process on word distributions of topics, where each layer consists of a set of topics and each topic is drawn from a mixture of the topics of the layer above. As the topics in all layers can be directly interpreted by words, the proposed model is able to discover interpretable topic hierarchies. As a self-contained module, our model can be flexibly adapted to different kinds of topic models to improve their modelling accuracy and interpretability. Extensive experiments on text corpora demonstrate the advantages of the proposed model.", "full_text": "Dirichlet belief networks for topic structure learning\n\nHe Zhao1, Lan Du1\u2217, Wray Buntine1, and Mingyuan Zhou2\u2217\n1Faculty of Information Technology, Monash University, Australia\n\n2McCombs School of Business, The University of Texas at Austin, USA\n\nAbstract\n\nRecently, considerable research effort has been devoted to developing deep archi-\ntectures for topic models to learn topic structures. Although several deep models\nhave been proposed to learn better topic proportions of documents, how to leverage\nthe bene\ufb01ts of deep structures for learning word distributions of topics has not yet\nbeen rigorously studied. Here we propose a new multi-layer generative process\non word distributions of topics, where each layer consists of a set of topics and\neach topic is drawn from a mixture of the topics of the layer above. As the topics\nin all layers can be directly interpreted by words, the proposed model is able to\ndiscover interpretable topic hierarchies. As a self-contained module, our model can\nbe \ufb02exibly adapted to different kinds of topic models to improve their modelling\naccuracy and interpretability. Extensive experiments on text corpora demonstrate\nthe advantages of the proposed model.\n\n1\n\nIntroduction\n\nUnderstanding text has been an important task in machine learning, natural language processing, and\ndata mining. Text is discrete, unstructured, and often highly sparse. A popular way of analysing texts\nis to represent them as a set of latent factors via topic modelling or matrix factorisation. With great\nsuccess in modelling text, probabilistic topic models discover a set of latent topics from a collection\nof documents. Those topics, as latent factors, can be interpreted by distributions over words and used\nto derive low dimensional representations of the documents. Speci\ufb01cally, most existing topic models\nare built on top of the following generative process: Each topic is a distribution over the words (i.e.,\nword distribution, WD) in the vocabulary; each document is associated with a topic proportion (TP)\nvector; and a word in a document is generated by \ufb01rst drawing a topic according to the document\u2019s\nTP, then sampling the word according to the topic\u2019s WD.\nIn a Bayesian setting, TPs and WDs are both imposed on prior distributions. For example, one\ncommonly-used prior for TP and WD is a Dirichlet distribution, as in Latent Dirichlet Allocation\n(LDA) (Blei et al., 2003). Recently, deep hierarchical priors, especially imposed on TPs, have been\ndeveloped to generate hierarchical document representations as well as discover interpretable topic\nhierarchies. For example, there are hierarchical tree-structured constructions based on the Dirichlet\nProcess (DP) or Chinese Restaurant Process (CRP), such as the nested CRP (nCRP) (Blei et al.,\n2010) and the nested hierarchical DP (Paisley et al., 2015); deep constructions based on restricted\nBoltzmann machines and neural networks such as the Replicated Softmax Model (RSM) (Hinton and\nSalakhutdinov, 2009), the Neural Autoregressive Density Estimator (NADE) (Larochelle and Lauly,\n2012), and the Over-replicated Softmax Model (OSM) (Srivastava et al., 2013); models based on\nvariational autoencoders (VAE) including Srivastava and Sutton (2017); Miao et al. (2017); Zhang\net al. (2018). Recently, models that generalise the sigmoid belief network (Hinton et al., 2006) have\nbeen proposed, such as Deep Poisson Factor Analysis (DPFA) (Gan et al., 2015), Deep Exponential\nFamilies (DEF) (Ranganath et al., 2015), Deep Poisson Factor Modelling (DPFM) (Henao et al.,\n2015), and Gamma Belief Networks (GBNs) (Zhou et al., 2016).\n\n\u2217Corresponding authors\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fCompared with the considerable interest in deep models on TPs, to our knowledge, the counterparts\non WDs have not been fully investigated. In this paper, we propose a new multi-layer generative\nprocess on WDs, as a self-contained module and an alternative to the single-layer Dirichlet prior.\nIn the proposed model, WDs are the output units of the bottom layer in a DBN with hidden layers\nparameterised by Dirichlet-distributed hidden units and connected with gamma-distributed weights.\nSpeci\ufb01cally, each Dirichlet unit in a hidden layer is a probability distribution over the words in the\nvocabulary and can be view as a \u201chidden\u201d topic. In each layer, the Dirichlet prior of a topic is a\nmixture of the topics in the layer above. As the hidden units are drawn from Dirichlet, the proposed\nmodel is named the Dirichlet Belief Network, hereafter referred to as DirBN2.\nCompared with existing related deep models, DirBN has the following appealing properties: 1)\nInterpretability of hidden units: Every hidden unit in every layer of DirBN is a probability dis-\ntribution over the words, making them real topics that can be directly interpreted. 2) Discovering\ntopic hierarchies: The mixture structure of DirBN enables the model to enjoy a straightforward\nway of discovering semantic correlations of topics in two adjacent layers, which further form topic\nhierarchies with the multi-layer construction of the model. Due to the intrinsic abstraction effect\nof DBN, the topics in the higher layers are more abstract and can be treated as the generalisation\nof the ones in the lower layers. 3) Better modelling accuracy: It is known that TPs are local\nvariables (speci\ufb01c to individual document), while WDs are global variables over the target corpus.\nUnlike many other hierarchical parallels on TP, DirBN imposes a deep structure on WD, which\n\u201cabsorbs the information\u201d from the entire corpus. It makes DirBN be able to get better modelling\naccuracy especially in the case of sparse texts such as tweets and news abstracts, where the context\ninformation of an individual document is not enough to learn a good model using existing approaches.\n4) Adaptability: As many sophisticated models on TPs usually use a simple Dirichlet prior on WDs,\nincluding the well-known ones such as Supervised Topic Model (Mcauliffe and Blei, 2008) and\nAuthor Topic Model (Rosen-Zvi et al., 2004), our DirBN can be easily adapted to them to further\nimprove modelling accuracy and interpretability.\nIn conclusion, the contributions of this paper include: 1) We propose DirBN, a deep structure\nthat can be used as an advanced alternative to the Dirichlet prior on WDs with better modelling\nperformance and interpretability. 2) We demonstrate our model\u2019s adaptability by applying DirBN\nwith several well-developed models, including Poisson Factor Analysis (PFA) (Zhou et al., 2012),\nMetaLDA (Zhao et al., 2017a), and GBN (Zhou et al., 2016). 3) With proper data augmentation and\nmarginalisation techniques, DirBN enjoys full local conjugacy, which facilitates the derivation of a\nsimple and effective inference algorithm.\n\n2 The proposed DirBN\n\nIn this section, we introduce the details of the generative and inference processes of DirBN.\n\n2.1 Generative process\n\nWe \ufb01rst de\ufb01ne the essential notation and review the basic framework of topic modelling, followed by\nthe details of the proposed DirBN. Assume that the bag-of-words of document d in a corpus with\nN documents and V unique words in the vocabulary are stored in a count vector xd \u2208 NV\n0 . A topic\nmodel with K topics is composed of the TP vector \u03b8d \u2208 RK\n+ for each document d and the WD vector\n+ for each topic k (k \u2208 {1,\u00b7\u00b7\u00b7 , K}). To generate a word in document d, one can \ufb01rst sample\n\u03c6k \u2208 RV\na topic according to its TP, and then sample the word type according to the topic\u2019s WD. Given this\nframework, many prior constructions of TPs have been proposed, such as the Dirichlet distribution\nin LDA, logistic normal distributions for modelling topic correlations in Correlated Topic Model\n(CTM) (Lafferty and Blei, 2006), nonparametric priors like the Hierarchical Dirichlet Process (Teh\net al., 2012), and recently-proposed deep models like DPFA (Gan et al., 2015), DPFM (Henao et al.,\n2015), and GBN (Zhou et al., 2016). Unlike the extensive choices for constructing TP, the symmetric\nDirichlet distribution on WDs still dominates in many advanced topic models. Here DirBN is a new\nhierarchical approach of constructing WDs, detailed as follows.\n\n2Code available at https://github.com/ethanhezhao/DirBN\n\n2\n\n\fFigure 1: Demonstration of the generative process of DirBN with three layers.\n\nA DirBN with T layers leaves the TPs of the basic framework untouched and draws \u03c6k according to\nthe following generative process:\n\n\u03c6(T )\nkT\n\n\u03c6(t)\nkt\n\n\u03c6(1)\nk1\n\n\u223c DirV (\u03b7),\n\u00b7\u00b7\u00b7\n\u223c DirV (\u03c8(t)\n\u00b7\u00b7\u00b7\n\u223c DirV (\u03c8(1)\n\nkt\n\nk1\n\nKt+1(cid:88)\nK2(cid:88)\n\nkt+1\n\n), \u03c8(t)\nkt\n\n=\n\n), \u03c8(1)\nk1\n\n=\n\n\u03c6(t+1)\nkt+1\n\n\u03b2(t)\nkt+1kt\n\n, \u03b2(t)\n\nkt+1kt\n\n\u223c Ga(\u03b3(t)\n\nkt+1\n\n, 1/c(t)),\n\n\u03c6(2)\nk2\n\n\u03b2(1)\nk2k1\n\n, \u03b2(1)\nk2k1\n\n\u223c Ga(\u03b3(1)\n\nk2\n\n, 1/c(1)),\n\n(1)\n\nk2\n\nwhere 1) Ga(\u2212,\u2212) is the gamma distribution with shape and scale parameters and DirV (\u2212) is the\nDirichlet distribution3; 2) The superscript with a bracket over a variable indicates which layer it\nbelongs to and kt \u2208 {1,\u00b7\u00b7\u00b7 , Kt} is the topic index in the t-th layer; 3) The output of DirBN is \u03c6(1)\n,\nk1\nwhich corresponds to \u03c6k in the basic framework and hereafter, we use \u03c6(1)\ninstead; 4) We further\nk1\nimpose gamma priors on the following variables: \u03b7 \u223c Ga(a0, 1/b0), \u03b3(t)\n\u223c Ga(\u03b3(t)\n0 /Kt, 1/c(t)\n0 ),\n0 \u223c Ga(g0, 1/h0), and c(t) \u223c Ga(g0, 1/h0). The generative process of a topic\n0 \u223c Ga(e0, f0), c(t)\n\u03b3(t)\nmodel equipped with DirBN is demonstrated in Figure 1.\nThe idea of our DirBN can be summarised as follows:\n\nkt+1\n\n1. From a bottom-up view, DirBN is a multi-layer matrix factorisation, which factorises the\nmatrix of the WDs in the t-th layer as: \u03a6(t) \u223c Dir(\u03a6(t+1)B(t)). Here we de\ufb01ne \u03a6(t) \u2208\nRV \u00d7Kt\nis the kt-th column). From a\ntop-down view, the model can be considered as a stochastic feedforward network (Tang and\nSalakhutdinov, 2013), where the input matrix in \u03a6(T ), the output matrix is \u03a6(1), and the\nstochastic units are drawn from the Dirichlet distribution.\n\nis the kt-th column) and B(t) \u2208 RKt+1\u00d7Kt\n\n(\u03c6(t)\nkt\n\n(\u03b2(t)\nkt\n\n+\n\n+\n\n2. As DirBN is a Bayesian probabilistic model, consider a DirBN with only two layers as\nan example: each \ufb01rst-layer topic \u03c6(1)\nis drawn from a Dirichlet with the topic-speci\ufb01c\nk1\nasymmetric parameter \u03c8(1)\n, which is a mixture of the second-layer topics. So the statistical\nk1\nstrength is shared via the mixture, which plays an important role in handling sparse texts.\n3. In DirBN, not only in the bottom layer, but also in any other layer t, each hidden unit is\na distribution over the vocabulary and can be viewed as real topic directly interpreted by\nwords. Although the bottom layer serves as the actual WDs for generating the words, the\ntopics in the higher layers are involved with the belief prorogation in the network.\n\n4. The weight \u03b2(t)\n\nkt+1kt\n\nof the gamma prior on \u03b2(t)\n\nis drawn from a hierarchical gamma prior (i.e., the shape parameter\n\u03b3(t)\nis also drawn from a gamma). It allows topics in\nkt+1\nthe (t + 1)-th layer to contribute differently to those in the t-th layer. In addition, the\nhierarchical structure on \u03b2(t)\nis similar to the one in Zhou (2015), which provides an\n\nkt+1kt\n\nkt+1kt\n\n3\u2212 can be a vector as a set of asymmetric parameters or a scalar as a symmetric parameter of Dirichlet\n\n3\n\n!\"#(%)!\"'(()!\")(*)+,-,.\u2208{1,\u2026,4}\u2022Relational Topic Model\u2022Supervised Topic Model\u2022Author Topic Model\u2022\u20266(%)6(()6(*)DirBN with 3layers8\"'\")(*)8\"#\"'(()\fintrinsic shrinkage mechanism on \u03b2(t)\n. In other words, each kt is expected to be sparsely\nkt\nconnected by a subset of kt+1. We will demonstrate the shrinkage effect of DirBN in the\nexperiments.\n\n2.2 Inference process\n\nThe learning of DirBN can be done by the inference of its latent variables, i.e., \u03a6(t) and B(t) for\nall t. With several data augmentation techniques, we are able to derive a layer-wise Gibbs sampling\nalgorithm facilitated by local conjugacy. Given \u03b8 and \u03c6 (despite their constructions), a topic model\nusually samples the topic assignment of each word in the corpus. After that, each topic k1 is associated\nwith a vector of word counts, denoted as x(1)\n], which encodes the semantic\nk1\ninformation of topic k1 and is one of the input count vectors of DirBN in the inference process.\nGiven the input vectors, the inference of DirBN involves two key steps: 1) propagating the semantic\ninformation of the input vectors up to the top layer via latent counts; 2) updating \u03a6(t) and B(t) down\nto the bottom given the latent counts. Without loss of generality, we illustrate the inference details\nwith a two-layer DirBN as follows4:\n\n,\u00b7\u00b7\u00b7 , x(1)\n\n= [x(1)\n1k1\n\nV k1\n\nPropagating the latent counts from the bottom up By integrating \u03c6(1)\nk1\nlikelihood, we can get the likelihood of \u03c8(1)\nk1\n\nout from its multinomial\n\nwhere \u0393(\u2212) is the gamma function, \u03c8(1)\u00b7k1\nout and introducing two auxiliary variables q(1)\nk1\n2017a):\n\nL(cid:16)\n\n\u03c8(1)\nk1\n\n(cid:17) \u221d\n\nL(cid:16)\n\n\u0393(\u03c8(1)\u00b7k1\n\nV(cid:89)\n\nv\n\n)\nv \u03c8(1)\nvk1\nand y(1)\nvk1\n\nas:\n\u0393(\u03c8(1)\u00b7k1\n\n)\n\n+ x(1)\u00b7k1\n\n=(cid:80)V\n(cid:17) \u221d V(cid:89)\n(cid:16)\n(cid:16)\n\nv\n\n\u223c CRT\n\n\u0393(\u03c8(1)\nvk1\n\u0393(\u03c8(1)\nvk1\n\n)\n\n+ x(1)\nvk1\n)\n\n=(cid:80)V\n(cid:17)y(1)\n\nvk1(cid:16)\n(cid:17)\u03c8(1)\n(cid:17)\n\n,\n\n(2)\n\n, and x(1)\u00b7k1\n\n. By integrating \u03c6(1)\nk1\n, Eq. (2) can be augmented as (Zhao et al.,\n\nv x(1)\nvk1\n\n\u03c8(1)\nk1\n\n, q(1)\nk1\n\n, y(1)\nvk1\n\nq(1)\nk1\n\n\u03c8(1)\nvk1\n\nvk1 ,\n\n(3)\n\n(cid:33)\n\n(cid:32)\n\n(cid:17) \u223c Mult\n\n\u223c Beta(\u03c8(1)\u00b7k1\n=(cid:80)K2\n,\u00b7\u00b7\u00b7 , y(1)\nV k1\n\u03c6(2)\nlayer topic k2 by:(cid:16)\nvk2\n\nWith \u03c8(1)\nvk1\n\nk2\n\nwhere z(1)\n\nvk2k1\n\nwhere q(1)\n. Here CRT stands for the Chinese\nk1\nRestaurant Table distribution (Zhou and Carin, 2015; Zhao et al., 2017b). Now we can de\ufb01ne\ny(1)\nk1\n\n], the latent count vector derived from the input count vector x(1)\nk1\n\n) and y(1)\nvk1\n\n= [y(1)\n1k1\n\n, \u03c8(1)\nvk1\n\n, x(1)\u00b7k1\n\nx(1)\nvk1\n\n.\n\n\u03b2(1)\nk2k1\n\n, we can then distribute the latent count y(1)\nvk1\n\non \u03c8(1)\nvk1\n\nto each second\n\n,\u00b7\u00b7\u00b7 , z(1)\n\ny(1)\nvk1\n\nz(1)\nv1k1\n\nvK2k1\n\nis the latent count allocated to k2 and(cid:80)K2\n=(cid:80)K1\n\n,\u00b7\u00b7\u00b7 , x(2)\n\nk2\n\n,\n\nv1 \u03b2(1)\n\u03c6(2)\n1k1\n\u03c8(1)\nvk1\nz(1)\nvk2k1\nz(1)\nvk2k1\n\nk1\n\n= y(1)\nvk1\n. x(2)\nk2\n\n\u03c8(1)\nvk1\n\n.\n\n,\u00b7\u00b7\u00b7 ,\n\n\u03c6(2)\nvK2\n\n\u03b2(1)\nK2k1\n\n,\n\n(4)\n\nV k2\n\n= [x(2)\n1k2\n\n] where x(2)\nvk2\n\nWe now note x(2)\nk2\noutput count vectors of the \ufb01rst layer and also the input count vector of the second layer topic k2.\nIn conclusion, to propagate the semantic information from the \ufb01rst to the second layer, we \ufb01st derive\ny(1)\n), and \ufb01nally aggregate\nvk1\nz(1)\nvk2k1\n\n, then distribute y(1)\nvk1\n.\n\nto all the second layer topics (i.e., z(1)\n\nfrom x(1)\nvk1\ninto x(2)\nvk2\n\ncan be viewed as one of the\n\nvk2k1\n\nUpdating the latent variables from the top down After the latent counts are propagated, we start\nupdating the latent variables from the top layer (i.e. the second layer here). Given x(2)\nis easy\nk2\nto sample from its Dirichlet posterior. With z(1)\nfrom its\nvk2k1\ngamma posterior given the following likelihood:\n\u2212\u03b2(1)\nk2 k1\n\nk2,v = 1, we can sample \u03b2(1)\n\nand(cid:80)V\n\n(\u2212 log q(1)\nk1\n\nv \u03c6(2)\n\n, \u03c6(2)\nk2\n\n(cid:17) \u221d e\n\nL(cid:16)\n\n)z(1)\u00b7k2k1 ,\n\nk2k1\n\n(5)\n\n)(\u03b2(1)\nk2k1\n\n\u03b2(1)\nk2k1\n\n4 Omitted details of inference as well as the overall algorithm are given in the supplementary materials.\n\n4\n\n\f=(cid:80)V\n\nv z(1)\n\nvk2k1\n\n. Given the newly sampled \u03c6(2)\nk2\n\nand \u03b2(1)\nk2k1\n\n, we can recompute \u03c8(1)\nk1\n\nand\n\nfrom its Dirichlet posterior. Now the inference of a two-layer DirBN is done.\n\nwhere z(1)\u00b7k2k1\nsample \u03c6(1)\nk1\n\n3 Using DirBN in topic modelling\n\nDirBN is a self-contained module on \u03c6, leaving \u03b8 untouched. Therefore, it can be used as an\nalternative to the simple Dirichlet prior on \u03c6 in many existing models. The adaptability of DirBN\nenables us to easily apply it to advanced models so that those models can bene\ufb01t from the advantages\nof DirBN. To demonstrate this, we adapt the proposed DirBN structure to the following models:\n\nPFA+DirBN Poisson Factor Analysis (PFA) is a popular framework for topic analysis (DPFA (Gan\net al., 2015), DPFM (Henao et al., 2015), GBN (Zhou et al., 2016) can be viewed as a deep extension\nto PFA). Speci\ufb01cally, we use the Bayesian nonparametric version of PFA named BGGPFA (Zhou\net al., 2012), where \u03b8d is constructed from a negative binomial process and \u03c6k is drawn from a\nDirichlet distribution. Note that there are close relationships between PFA and LDA, and between\nBGGPFA and HDP (Teh et al., 2012), analysed in Zhou (2018). Here we replace the Dirichlet\nconstruction on \u03c6 with DirBN, yielding a model named PFA+DirBN.\n\nMetaLDA+DirBN MetaLDA (Zhao et al., 2017a, 2018a) is a supervised topic model that is able\nto incorporate document labels to inform the learning of \u03b8d. Keeping the structure on \u03b8 untouched,\nwe replace the MetaLDA\u2019s structure on \u03c6 with our DirBN to get a combined model that discovers\nthe topic hierarchies informed by the document labels. The proposed model is able to discover the\ncorrelations between labels and topic hierarchies.\n\nGBN+DirBN Recall that GBN (Zhou et al., 2015, 2016) imposes a hierarchical structure on \u03b8,\nwhich is able to learn multi-layer document representations and topic hierarchies. Here we combine\nDirBN and GBN together to yield a \u201cdual\u201d deep model, where the GBN part is on \u03b8 and the DirBN\npart is on \u03c6. Both parts discover topic hierarchies and the bottom-layer topics are shared by the two\nparts/hierarchies. It would be interesting to see how the two deep structures interact with each other.\n\n4 Related work\n\nAs the proposed model introduces a hierarchical architecture on WDs (i.e., \u03c6) in topic models, we \ufb01rst\nreview various priors on \u03c6, starting with the ones on sampling/optimising the Dirichlet parameters in\ntopic models. The Dirichlet parameters in topic models were studied comprehensively in Wallach\net al. (2009), which showed that Dirichlet with a symmetric parameter sampled from an uninformative\ngamma is the best choice. Actually, our DirBN can be reduced to this choice if T = 1 (i.e., DirBN-1,\nwith one layer only). However, unlike the sampling/optimising approaches used in Wallach et al.\n(2009), DirBN-1 uses a negative binomial augmentation shown in Eq. (3), which leads to a simpler\ninference scheme. Recently, models like Zhao et al. (2017a,c, 2018b) construct informative and\nasymmetric Dirichlet priors by taking into account some external knowledge like word embeddings.\nWhereas DirBN learns the asymmetric priors purely based on the context of the target corpus.\nInstead of Dirichlet, the Pitman-Yor process (PYP) has been used on WDs to model the power-law\ndistribution of words, as in Sato and Nakagawa (2010); Buntine and Mishra (2014). Chen et al. (2015)\nused a transformed PYP prior on \u03c6 to model multiple document collections. Lindsey et al. (2012)\nimposed a hierarchical PYP prior on \u03c6 to discover word phrases. Besides PYP, the Indian Buffet\nProcess (IBP) has been used as a prior on \u03c6 to introduce word focusing on topics, as in Archambeau\net al. (2015). In general, existing models use different priors on \u03c6 for modelling various linguistic\nphenomena, which have different purposes to DirBN. The deep structures induced by DirBN on WDs\nhave not yet been rigorously studied.\nTo our knowledge, most existing models explore the structure of topics by imposing a\ndeep/hierarchical prior on \u03b8. For example, hierarchical PYPs were used for domain adaptation\nin language models (Wood and Teh, 2009) and topic models (Du et al., 2012). nCRP (Blei et al.,\n2010) models topic hierarchies by introducing a tree-structured prior. Paisley et al. (2015); Kim\net al. (2012); Ahmed et al. (2013) extended nCRP by either softening its constraints or applying it to\ndifferent problems. Li and McCallum (2006) proposed the Pachinko Allocation model (PAM), which\n\n5\n\n\f(a)\n\n(b)\n\nFigure 2: (a): Histograms of the normalised (latent) words counts. (b): B(1).\n\ncaptures the topic correlations with a directed acyclic graph. Recently, several deep extensions of\nPFA on \u03b8 have been proposed, including DPFA (Gan et al., 2015), DPFM (Henao et al., 2015), and\nGBN (Zhou et al., 2016). DPFM and GBN are the most related models to ours, which are also able\nto discover topic hierarchies. In DPFM and GBN, the higher-layer topics are not distributions over\nwords but distributions over the topics in the layer below (they are called \u201cmeta-topics\u201d in DPFM).\nTo interpret those meta-topics, one needs to project them all the way down to the bottom-layer\ntopics with matrix multiplication. Whereas in our model, the topics on all the layers are directly\ninterpretable.\n\n5 Experiments\n\nThe experiments were conducted on three real-world datasets, detailed as follows: 1) Web Snippets\n(WS), containing 12,237 web search snippets labelled with 8 categories. The vocabulary contains\n10,052 word types. 2) Tag My News (TMN), consisting of 32,597 RSS news labelled with 7 categories.\nEach document contains a title and a description. There are 13,370 word types in the vocabulary. 3)\nTwitter, extracted in 2011 and 2012 microblog tracks at Text REtrieval Conference (TREC)5. It has\n11,109 tweets in total. The vocabulary size is 6,344.\nWith the framework of PFA, we compared three options of constructing \u03c6: (1) The default setting\nof PFA, where \u03c6 is drawn from a symmetric Dirichlet distribution with parameter 0.05, i.e., \u03c6k \u223c\nDirV (0.05); (2) PFA+Mallet, where \u03c6k \u223c DirV (\u03b10) and \u03b10 is sampled by Mallet 6; (3) PFA+DirBN,\nthe proposed model, where \u03c6k is drawn from an asymmetric Dirichlet distribution speci\ufb01c to k,\nthe parameter of which is constructed with the higher-layer topics. Note that Wallach et al. (2009)\ntested the option using speci\ufb01c asymmetric Dirichlet parameter, i.e., \u03c6k \u223c DirV ([\u03b11,\u00b7\u00b7\u00b7 , \u03b1V ]), but\nthe performance is not as good as the symmetric parameter (the second one above). In addition,\nfollowing a similar routine, we compared MetaLDA (Zhao et al., 2017a), and GBN (Zhou et al.,\n2016) with/without DirBN. Note that PFA is a widely used Bayesian topic model, MetaLDA is the\nstate-of-the-art topic model capable of handling sparse texts, and GBN is reported Cong et al. (2017)\nto outperform many other deep models including DPFA (Gan et al., 2015), DPFM (Henao et al.,\n2015), nHDP (Paisley et al., 2015), and RSM (Hinton and Salakhutdinov, 2009).\nFor all the models, we ran 3,000 MCMC iterations with 1,500 burnin. For DirBN, we set a0 =\nb0 = g0 = h0 = 1.0 and e0 = f0 = 0.01. For PFA, MetaLDA, and GBN, we used their original\nimplementations and settings, except that \u03c6 is drawn from DirBN in the combined models. For all\nthe models, the number of topics in each layer of DirBN was set to 100, i.e., KT = \u00b7\u00b7\u00b7 = K1 = 100.\nFor GBN and GBN+DirBN, we set the number of topics in each layer of GBN to 100 as well. Due to\nthe shrinkage mechanisms of PFA, GBN, and DirBN, the number of active topics will be adjusted\naccording to the data. In all the experiments, we varied the number of layers of DirBN T from 1 to 3.\nFor GBN+DirBN, the dual deep model, we \ufb01xed the number of layers of GBN as 3.\n\n5http://trec.nist.gov/data/microblog.html\n6http://mallet.cs.umass.edu\n\n6\n\n00.010.020.030.040.050.060.0701020304000.010.020.030.040.050.060.0701020304050012345\f(a)\n\n(d)\n\n(b)\n\n(e)\n\n(c)\n\n(f)\n\n(g)\n\n(h)\n\nFigure 3: Perplexity (the vertical axis) with varied proportion (the horizontal axis) of the words for\ntraining in the training documents. (a-c): Results of the models based on PFA on WS, TMN, Twitter.\n(d-f): Results of the models based on GBN on WS, TMN, Twitter. (g,h): Results of the models\nbased on MetaLDA on WS and TMN. The errorbars indicate the standard deviations of \ufb01ve runs.\nThe number of a model indicates the number of layers used in DirBN. The results of MetaLDA and\ndocument classi\ufb01cation on Twitter are not reported due to the unavailability of labels.\n\nx(t)\nvkt\n\nfor all kt where x(t)\nvkt\n\nvkt\n\nv x(t)\nvkt\n\n/(cid:80)\n\nwords counts(cid:80)\n\nDemonstration of DirBN\u2019s shrinkage effect As previously discussed, DirBN has an intrinsic\nshrinkage mechanism that is able to automatically learn the number of active topics in each layer (i.e.,\nthe network width). We empirically demonstrate the shrinkage effect in Figure 2, with the results\nof PFA+DirBN-3 on the TMN dataset. Figure 2a plots the histograms of the normalised (latent)\nis the word count for topic kt. The blue and\nred bars are for the \ufb01rst- (t = 1) and the second-layer (t = 2) topics, respectively. The histogram\nindicates the number of topics ( the vertical axis) that are with a speci\ufb01c word count (the horizontal\naxis). A topic with a larger word count is more important. The shrinkage effect is that large proportion\nof the topics are with very small word counts, indicating that the number of effective topics is less\nthan the truncation (i.e., Kt = 100). This is more obvious, in the second layer. Moreover, we display\nlog B(1) as an image in Figure 2b. The vertical and horizontal axes are for the second- and \ufb01rst-layer\ntopics, respectively. We ranked the \ufb01rst- and second-layer topics by their word counts. The sparsity of\nB(1) indicates that the \ufb01rst- and second-layer topics are sparsely connected. This also demonstrates\nthe shrinkage effect of the model.\n\nQuantitative results We report the per-heldout-word perplexity and topic coherence results. To\ncompute perplexity, we randomly selected 80% of the documents in each dataset to train the models\nand 20% for testing. For each testing document, we randomly used one half of its words to infer its\nTP, and the other half to calculate perplexity. Topic coherence measures the semantic coherence in\nthe most signi\ufb01cant words (top words) of a topic. Here we used the Normalized Pointwise Mutual\nInformation (NPMI) (Aletras and Stevenson, 2013; Lau et al., 2014) to calculate topic coherence\n\n7\n\n20%40%100%60080010001200140016001800PFAPFA+MalletPFA+DirBN-1PFA+DirBN-2PFA+DirBN-320%40%100%10001500200025003000PFAPFA+MalletPFA+DirBN-1PFA+DirBN-2PFA+DirBN-320%40%100%300400500600700PFAPFA+MalletPFA+DirBN-1PFA+DirBN-2PFA+DirBN-320%40%100%60080010001200140016001800GBNGBN+DirBN-1GBN+DirBN-320%40%100%10001500200025003000GBNGBN+DirBN-1GBN+DirBN-320%40%100%300400500600700800GBNGBN+DirBN-1GBN+DirBN-320%40%100%60080010001200140016001800MetaLDAMetaLDA+DirBN-1MetaLDA+DirBN-320%40%100%1500200025003000MetaLDAMetaLDA+DirBN-1MetaLDA+DirBN-3\fTable 1: Topic coherence with varied proportion of the words for training in the training documents.\n\u00b1 indicates the standard deviation of \ufb01ve runs. The best result in each column is in boldface.\n\nTraining words\n\nPFA\n\nPFA+Mallet\nPFA+DirBN-1\nPFA+DirBN-3\n\nGBN\n\nGBN+DirBN-1\nGBN+DirBN-3\n\n20%\n-0.070\u00b10.010\n0.008\u00b10.004\n0.013\u00b10.003\n0.021\u00b10.005\n-0.072\u00b10.013\n0.015\u00b10.005\n0.018\u00b10.006\n\nWS\n\n40%\n0.008\u00b10.002\n0.049\u00b10.005\n0.052\u00b10.004\n0.059\u00b10.002\n0.007\u00b10.005\n0.057\u00b10.002\n0.061\u00b10.004\n\n100%\n0.062\u00b10.011\n0.063\u00b10.003\n0.060\u00b10.006\n0.068\u00b10.004\n0.069\u00b10.009\n0.069\u00b10.005\n0.075\u00b10.002\n\n20%\n-0.059\u00b10.008\n0.035\u00b10.006\n0.031\u00b10.003\n0.046\u00b10.003\n-0.065\u00b10.008\n0.032\u00b10.002\n0.048\u00b10.003\n\nTMN\n\n40%\n0.064\u00b10.009\n0.083\u00b10.005\n0.080\u00b10.001\n0.090\u00b10.003\n0.063\u00b10.006\n0.086\u00b10.002\n0.094\u00b10.004\n\n100%\n0.103\u00b10.006\n0.108\u00b10.005\n0.108\u00b10.008\n0.111\u00b10.004\n0.106\u00b10.004\n0.112\u00b10.007\n0.113\u00b10.004\n\n20%\n-0.003\u00b10.003\n0.022\u00b10.003\n0.019\u00b10.004\n0.024\u00b10.001\n-0.005\u00b10.005\n0.021\u00b10.004\n0.025\u00b10.003\n\nTwitter\n\n40%\n0.031\u00b10.003\n0.037\u00b10.002\n0.037\u00b10.004\n0.038\u00b10.002\n0.032\u00b10.002\n0.040\u00b10.005\n0.040\u00b10.002\n\n100%\n0.046\u00b10.002\n0.045\u00b10.003\n0.049\u00b10.007\n0.049\u00b10.002\n0.047\u00b10.00\n0.050\u00b10.005\n0.051\u00b10.003\n\nTable 2: Topic hierarchy comparison in GBN+DirBN. Each row in boldface is the top 10 words in a\n\ufb01rst-layer topic. Each of these topics is associated with three most correlated topics in the second\nlayer of DirBN (left) and GBN (right), respectively. The number associated with a second-layer topic\nis its (normalised) link weight to the \ufb01rst-layer topic.\n\npolice arrested man charged woman authorities death year found accused\n\nheat miami james lebron game nba \ufb01nals celtics bulls wade\n\nfacebook google internet social twitter online web media site search\n\n0.38\n\n0.19\n\n0.15\n\n0.97\n\n0.00\n\n0.00\n\n0.22\n\n0.19\n\n0.18\n\n0.91\n\n0.04\n\n0.44\n\n0.01\n\n0.13\n\n0.13\n\n0.11\n\n0.43\n\n0.15\n\n0.10\n\n0.18\n\n0.14\n\n0.12\n\n0.12\n\n0.12\n\n0.10\n\n0.17\n\n0.14\n\n0.13\n\ncase charges accused trial courtattorney\ninvestigation judge allegations criminal\npolice of\ufb01cial killing attack deaddeath\n\narmy security man family\n\nwoman men drug suicide girl\nsexual death found human york\n\nseason team game play run\nnight star series fans career\nnba playoffs court brink seeds\n\ndefeated berth seed opponent semi\ufb01nals\n\nwin victory beat lead winning\ntop fourth loss straight beating\n\nphone plan video technology mobile\ndevices computer tech ceo content\ncompany million buy billion corp\n\nindustry sales companies consumers products\ngovernment report country nation pressure\n\nof\ufb01cial state move released public\n\npolice arrested man charged year\n\naccused found charges woman death\n\npolice prison man china years\n\narrested charges charged year chinese\n\nchina police chinese bomb \ufb01re\npeople blast city artist of\ufb01cials\n\nheat miami james game nba\n\n\ufb01nals lebron bulls mavericks dallas\ntrial rajaratnam insider trading fund\n\nhedge raj anthony galleon case\nmusic album lady gaga justin\n\nstar pop band rock tour\n\nfacebook social internet google online\n\ntwitter chief executive media web\ncourt lawsuit case facebook judge\nsocial federal internet google online\nfacebook social internet google online\n\ntwitter world web media site\n\nstudy cancer drug risk heart patients women researchers disease people\n\nrising percent high higher economic\nincrease low growth strong recovery\n\nreactions periods technique method declared\n\nstudy cancer drug risk researchers heart\n\npeople patients health women\n\nworld war years family oil\n\nimportant realized treatment peril scores\nstudy experience \ufb01nding recent security\nkids challenges millions report special\nnuclear japan plant power radiation crisis japanese fukushima crippled tokyo\ngovernment united states of\ufb01cials state\nreport country group of\ufb01cial agency\n\n0.18\n\n0.53\n\ntwitter world web media site\n\ndies year energy women american\n\nnuclear japan plant power radiation\n\nfacebook social internet google online\n\ncrisis japanese fukushima earthquake tokyo\n\nsafety water nearby land found\ncaused sea believed center parts\nwork plans part future system\n\nrules program bring offers decision\n\nnuclear japan plant power radiation\ncrisis japanese fukushima water tokyo\n\ntheater review broadway play york\n\nmusical stage life time love\n\nscore from the top 10 words of each topic and reported scores averaged over top 50 topics with\nhighest NPMI, where \u201crubbish\u201d topics are eliminated, following Yang et al. (2015)7. In the training\ndocuments, we further varied the proportion of the words used in training to mimic the case of sparse\ntexts. All the models ran \ufb01ve times with different random seeds and we reported the averaged value\nwith standard deviations.\nThe results of perplexity and topic coherence are shown in Figure 3 and Table 1, respectively. We\nhave the following remarks on the results: (1) In general, for the models with DirBN, the performance\nis signi\ufb01cantly improved compared with the counterparts without DirBN, especially in terms of\nperplexity and topic coherence and with low proportion of the training words. (2) In terms of all the\nmeasures, DirBN-2/3 always has better results than DirBN-1. Whereas if we compare GBN with PFA,\nits perplexity is worse than PFA\u2019s, especially for sparse texts. This demonstrates that hierarchical\nstructures on \u03b8 (i.e., GBN) may not perform as well as hierarchical structures on \u03c6 (i.e., DirBN) on\nsparse texts. (3) Although PFA+DirBN-1 and PFA+Mallet both impose a symmetric Dirichlet on \u03c6,\nthe former usually has better perplexity. (4) The dual deep model (GBN+DirBN-3) usually performs\nthe best on topic coherence, which demonstrates the bene\ufb01ts of the deep structures.\n\nQualitative analysis on topic hierarchies\n8 GBN+DirBN is a dual deep model that discovers two\nsets of hierarchies, one induced by GBN on \u03b8 and the other induced by DirBN on \u03c6. The topics in the\n\n7We used the Palmetto package (http://palmetto.aksw.org) with a large Wikipedia dump.\n8More visualisations of topic hierarchies are shown in the supplementary material.\n\n8\n\n\f(a)\n\n(b)\n\nFigure 4: Topic hierarchies discovered by MetaLDA+DirBN. The topics in the yellow and blue\nrectangles are the second and \ufb01rst layer topics in DirBN and the correlated labels to the \ufb01rst-layer\ntopics are shown at the bottom of each \ufb01gure. Thicker arrows indicate stronger correlations.\n\n\ufb01rst layer of DirBN connect the two sets of hierarchies. In Table 2, we show the \ufb01rst-layer topics and\nthe correlated second-layer topics in the two hierarchies. It is interesting to see that the second-layer\ntopics of DirBN are more abstract. For example, the second topic is about teams and player in NBA,\nwhile its correlated second-layer topics are more general words for sports. Moreover, DirBN is able\nto discover layer-wise semantically meaningful topic correlations with fewer overlapping top words.\nThis is because GBN combines the words in the \ufb01rst-layer topics to form the second-layer topics,\nwhereas DirBN decomposes the \ufb01rst-layer topics into the second-layer ones.\nIn MetaLDA+DirBN, the MetaLDA part is able to use document labels to construct TPs (Zhao et al.,\n2017a), by learning a correlation matrix between the labels and topics, while the DirBN part learns\nthe topic hierarchy. The \ufb01rst-layer topics of DirBN link the correlation matrix and the topic hierarchy\ntogether. Figure 4 shows the sample linkages between topic hierarchies and labels on TMN, where\nthe documents are labelled with 7 categories: 1 sport, 2 business, 3 us, 4 entertainment, 5 world, 6\nhealth, 7 sci-tech. One can observe that there is a well correspondence between the topic hierarchies\nand the labels.\n\n6 Conclusions\n\nWe have presented DirBN, a multi-layer process generating word distributions of topics. With\nreal topics in each layer, DirBN is able to discover interpretable topic hierarchies. As a \ufb02exible\nmodule, DirBN can be adapted to other advanced topic models and improve the performance and\ninterpretability, especially on sparse texts. We have demonstrated DirBN\u2019s advantages by equipping\nPFA, MetaLDA, and GBN, with DirBN. With the help of data augmentation, the inference of DirBN\ncan be done by a layer-wise Gibbs sampling, as a full conjugate model.\nFuture directions include deriving alternative inference algorithms, such as variational inference (Hoff-\nman et al., 2013), conditional density \ufb01ltering (Guhaniyogi et al., 2018), and stochastic gradient-based\napproaches (Chen et al., 2014; Ding et al., 2014; Welling and Teh, 2011).\n\n9\n\n49 health treat condition studieschicken idea scientists routineheart develop fare therapy54 scientists world species climatefarmers endangered study foundchange farm wild \ufb01sh61 study health people researchersmedical kids children diseasepatients brain doctors heart66 time good make backlife day years questionsnews bad job year81 cancer drug study riskdrugs heart patients womenhealth breast fda prostate3 us6 health7 sci_tech1 sport2 business4 entertainment12 economy economic crisis japanglobal rising surge analystsfocus markets fears recovery25 china united states worldeurope india countries globaltrade european group nations85 prices sales year marchgrowth april percent economyrate economic report home93 pro\ufb01t stocks quarter percentearnings year shares saleshigher wall street investors2 business5 world6 health7 sci_tech3 us\fAcknowledgments\n\nM. Zhou acknowledges the support of Award IIS-1812699 from the U.S. National Science Foundation.\n\nReferences\nD. M. Blei, A. Y. Ng, and M. I. Jordan, \u201cLatent Dirichlet allocation,\u201d JMLR, vol. 3, pp. 993\u20131022,\n\n2003.\n\nD. M. Blei, T. L. Grif\ufb01ths, and M. I. Jordan, \u201cThe nested Chinese restaurant process and Bayesian\n\nnonparametric inference of topic hierarchies,\u201d Journal of the ACM, vol. 57, no. 2, p. 7, 2010.\n\nJ. Paisley, C. Wang, D. M. Blei, and M. I. Jordan, \u201cNested hierarchical Dirichlet processes,\u201d TPAMI,\n\nvol. 37, no. 2, pp. 256\u2013270, 2015.\n\nG. E. Hinton and R. R. Salakhutdinov, \u201cReplicated softmax: An undirected topic model,\u201d in NIPS,\n\n2009, pp. 1607\u20131614.\n\nH. Larochelle and S. Lauly, \u201cA neural autoregressive topic model,\u201d in NIPS, 2012, pp. 2708\u20132716.\n\nN. Srivastava, R. Salakhutdinov, and G. Hinton, \u201cModeling documents with a deep Boltzmann\n\nmachine,\u201d in UAI, 2013, pp. 616\u2013624.\n\nA. Srivastava and C. Sutton, \u201cAutoencoding variational inference for topic models,\u201d 2017.\n\nY. Miao, E. Grefenstette, and P. Blunsom, \u201cDiscovering discrete latent topics with neural variational\n\ninference,\u201d in ICML, 2017, pp. 2410\u20132419.\n\nH. Zhang, B. Chen, D. Guo, and M. Zhou, \u201cWHAI: Weibull hybrid autoencoding inference for deep\n\ntopic modeling,\u201d 2018.\n\nG. E. Hinton, S. Osindero, and Y.-W. Teh, \u201cA fast learning algorithm for deep belief nets,\u201d Neural\n\ncomputation, vol. 18, no. 7, pp. 1527\u20131554, 2006.\n\nZ. Gan, C. Chen, R. Henao, D. Carlson, and L. Carin, \u201cScalable deep Poisson factor analysis for\n\ntopic modeling,\u201d in ICML, 2015, pp. 1823\u20131832.\n\nR. Ranganath, L. Tang, L. Charlin, and D. Blei, \u201cDeep exponential families,\u201d in AISTATS, 2015, pp.\n\n762\u2013771.\n\nR. Henao, Z. Gan, J. Lu, and L. Carin, \u201cDeep Poisson factor modeling,\u201d in NIPS, 2015, pp. 2800\u20132808.\n\nM. Zhou, Y. Cong, and B. Chen, \u201cAugmentable gamma belief networks,\u201d JMLR, vol. 17, no. 163, pp.\n\n1\u201344, 2016.\n\nJ. D. Mcauliffe and D. M. Blei, \u201cSupervised topic models,\u201d in NIPS, 2008, pp. 121\u2013128.\n\nM. Rosen-Zvi, T. Grif\ufb01ths, M. Steyvers, and P. Smyth, \u201cThe author-topic model for authors and\n\ndocuments,\u201d in UAI, 2004, pp. 487\u2013494.\n\nM. Zhou, L. Hannah, D. B. Dunson, and L. Carin, \u201cBeta-negative binomial process and Poisson\n\nfactor analysis,\u201d in AISTATS, 2012, pp. 1462\u20131471.\n\nH. Zhao, L. Du, W. Buntine, and G. Liu, \u201cMetalda: A topic model that ef\ufb01ciently incorporates meta\n\ninformation,\u201d in ICDM, 2017, pp. 635\u2013644.\n\nJ. D. Lafferty and D. M. Blei, \u201cCorrelated topic models,\u201d in NIPS, 2006, pp. 147\u2013154.\n\nY. Teh, M. Jordan, M. Beal, and D. Blei, \u201cHierarchical Dirichlet processes,\u201d Journal of the American\n\nStatistical Association, vol. 101, no. 476, pp. 1566\u20131581, 2012.\n\nY. Tang and R. R. Salakhutdinov, \u201cLearning stochastic feedforward neural networks,\u201d in NIPS, 2013,\n\npp. 530\u2013538.\n\nM. Zhou, \u201cIn\ufb01nite edge partition models for overlapping community detection and link prediction,\u201d\n\nin AISTATS, 2015, pp. 1135\u2014-1143.\n\n10\n\n\fM. Zhou and L. Carin, \u201cNegative binomial process count and mixture modeling,\u201d TPAMI, vol. 37,\n\nno. 2, pp. 307\u2013320, 2015.\n\nH. Zhao, L. Du, and W. Buntine, \u201cLeveraging node attributes for incomplete relational data,\u201d in\n\nICML, 2017, pp. 4072\u20134081.\n\nM. Zhou, \u201cNonparametric Bayesian negative binomial factor analysis,\u201d Bayesian Analysis, 2018.\n\nH. Zhao, L. Du, W. Buntine, and G. Liu, \u201cLeveraging external information in topic modelling,\u201d KAIS,\n\npp. 1\u201333, 2018.\n\nM. Zhou, Y. Cong, and B. Chen, \u201cThe Poisson gamma belief network,\u201d in NIPS, 2015, pp. 3043\u20133051.\n\nH. M. Wallach, D. M. Mimno, and A. McCallum, \u201cRethinking LDA: Why priors matter,\u201d in NIPS,\n\n2009, pp. 1973\u20131981.\n\nH. Zhao, L. Du, and W. Buntine, \u201cA word embeddings informed focused topic model,\u201d in ACML,\n\n2017, pp. 423\u2013438.\n\nH. Zhao, L. Du, W. Buntine, and M. Zhou, \u201cInter and intra topic structure learning with word\n\nembeddings,\u201d in ICML, 2018, pp. 5887\u20135896.\n\nI. Sato and H. Nakagawa, \u201cTopic models with power-law using Pitman-Yor process,\u201d in SIGKDD,\n\n2010, pp. 673\u2013682.\n\nW. L. Buntine and S. Mishra, \u201cExperiments with non-parametric topic models,\u201d in SIGKDD, 2014,\n\npp. 881\u2013890.\n\nC. Chen, W. Buntine, N. Ding, L. Xie, and L. Du, \u201cDifferential topic models,\u201d TPAMI, vol. 37, no. 2,\n\npp. 230\u2013242, 2015.\n\nR. V. Lindsey, W. P. Headden III, and M. J. Stipicevic, \u201cA phrase-discovering topic model using\n\nhierarchical Pitman-Yor processes,\u201d in EMNLP, 2012, pp. 214\u2013222.\n\nC. Archambeau, B. Lakshminarayanan, and G. Bouchard, \u201cLatent IBP compound Dirichlet allocation,\u201d\n\nTPAMI, vol. 37, no. 2, pp. 321\u2013333, 2015.\n\nF. Wood and Y. W. Teh, \u201cA hierarchical nonparametric Bayesian approach to statistical language\n\nmodel domain adaptation,\u201d in AISTATS, 2009, pp. 607\u2013614.\n\nL. Du, W. Buntine, and H. Jin, \u201cModelling sequential text with an adaptive topic model,\u201d in EMNLP,\n\n2012, pp. 535\u2013545.\n\nJ. H. Kim, D. Kim, S. Kim, and A. Oh, \u201cModeling topic hierarchies with the recursive chinese\n\nrestaurant process,\u201d in CIKM, 2012, pp. 783\u2013792.\n\nA. Ahmed, L. Hong, and A. Smola, \u201cNested Chinese restaurant franchise process: Applications to\n\nuser tracking and document modeling,\u201d in ICML, 2013, pp. 1426\u20131434.\n\nW. Li and A. McCallum, \u201cPachinko allocation: DAG-structured mixture models of topic correlations,\u201d\n\nin ICML, 2006, pp. 577\u2013584.\n\nY. Cong, B. Chen, H. Liu, and M. Zhou, \u201cDeep latent Dirichlet allocation with topic-layer-adaptive\n\nstochastic gradient Riemannian MCMC,\u201d in ICML, 2017, pp. 864\u2013873.\n\nN. Aletras and M. Stevenson, \u201cEvaluating topic coherence using distributional semantics,\u201d in Proc. of\n\nthe 10th Intnl. Conf. on Computational Semantics, 2013, pp. 13\u201322.\n\nJ. H. Lau, D. Newman, and T. Baldwin, \u201cMachine reading tea leaves: Automatically evaluating topic\n\ncoherence and topic model quality,\u201d in EACL, 2014, pp. 530\u2013539.\n\nY. Yang, D. Downey, and J. Boyd-Graber, \u201cEf\ufb01cient methods for incorporating knowledge into topic\n\nmodels,\u201d in EMNLP, 2015, pp. 308\u2013317.\n\nM. D. Hoffman, D. M. Blei, C. Wang, and J. Paisley, \u201cStochastic variational inference,\u201d JMLR,\n\nvol. 14, no. 1, pp. 1303\u20131347, 2013.\n\n11\n\n\fR. Guhaniyogi, S. Qamar, and D. B. Dunson, \u201cBayesian conditional density \ufb01ltering,\u201d Journal of\n\nComputational and Graphical Statistics, 2018.\n\nT. Chen, E. Fox, and C. Guestrin, \u201cStochastic gradient Hamiltonian Monte Carlo,\u201d in ICML, 2014,\n\npp. 1683\u20131691.\n\nN. Ding, Y. Fang, R. Babbush, C. Chen, R. D. Skeel, and H. Neven, \u201cBayesian sampling using\n\nstochastic gradient thermostats,\u201d in NIPS, 2014, pp. 3203\u20133211.\n\nM. Welling and Y. W. Teh, \u201cBayesian learning via stochastic gradient Langevin dynamics,\u201d in ICML,\n\n2011, pp. 681\u2013688.\n\n12\n\n\f", "award": [], "sourceid": 4925, "authors": [{"given_name": "He", "family_name": "Zhao", "institution": "Monash University, Australia"}, {"given_name": "Lan", "family_name": "Du", "institution": "Monash University"}, {"given_name": "Wray", "family_name": "Buntine", "institution": "Monash University"}, {"given_name": "Mingyuan", "family_name": "Zhou", "institution": "University of Texas at Austin"}]}