{"title": "Discriminative Topic Modeling with Logistic LDA", "book": "Advances in Neural Information Processing Systems", "page_first": 6770, "page_last": 6780, "abstract": "Despite many years of research into latent Dirichlet allocation (LDA), applying LDA to collections of non-categorical items is still challenging for practitioners. Yet many problems with much richer data share a similar structure and could benefit from the vast literature on LDA. We propose logistic LDA, a novel discriminative variant of latent Dirichlet allocation which is easy to apply to arbitrary inputs. In particular, our model can easily be applied to groups of images, arbitrary text embeddings, or integrate deep neural networks. Although it is a discriminative model, we show that logistic LDA can learn from unlabeled data in an unsupervised manner by exploiting the group structure present in the data. In contrast to other recent topic models designed to handle arbitrary inputs, our model does not sacrifice the interpretability and principled motivation of LDA.", "full_text": "Discriminative Topic Modeling with Logistic LDA\n\nIryna Korshunova\nGhent University\n\niryna.korshunova@ugent.be\n\nMateusz Fedoryszak\n\nTwitter\n\nmfedoryszak@twitter.com\n\nHanchen Xiong\n\nTwitter\n\nhxiong@twitter.com\n\nLucas Theis\n\nTwitter\n\nltheis@twitter.com\n\nAbstract\n\nDespite many years of research into latent Dirichlet allocation (LDA), applying\nLDA to collections of non-categorical items is still challenging. Yet many problems\nwith much richer data share a similar structure and could bene\ufb01t from the vast\nliterature on LDA. We propose logistic LDA, a novel discriminative variant of latent\nDirichlet allocation which is easy to apply to arbitrary inputs. In particular, our\nmodel can easily be applied to groups of images, arbitrary text embeddings, and\nintegrates well with deep neural networks. Although it is a discriminative model,\nwe show that logistic LDA can learn from unlabeled data in an unsupervised manner\nby exploiting the group structure present in the data. In contrast to other recent\ntopic models designed to handle arbitrary inputs, our model does not sacri\ufb01ce the\ninterpretability and principled motivation of LDA.\n\n1\n\nIntroduction\n\nProbabilistic topic models are powerful tools for discovering themes in large collections of items.\nTypically, these collections are assumed to be documents and the models assign topics to individual\nwords. However, a growing number of real-world problems require assignment of topics to much\nricher sets of items. For example, we may want to assign topics to the tweets of an author on Twitter\nwhich contain multiple sentences as well as images, or to the images and websites stored in a board\non Pinterest, or to the videos uploaded by a user on YouTube. These problems have in common that\ngrouped items are likely to be thematically similar. We would like to exploit this dependency instead\nof categorizing items based on their content alone. Topic models provide a natural way to achieve\nthis.\nThe most widely used topic model is latent Dirichlet allocation (LDA) [6]. With a few exceptions [7,\n31], LDA and its variants, including supervised models [5, 19], are generative. They generally assume\na multinomial distribution over words given topics, which limits their applicability to discrete tokens.\nWhile it is conceptually easy to extend LDA to continuous inputs [4], modeling the distribution of\ncomplex data such as images can be a dif\ufb01cult task on its own. Achieving low perplexity on images,\nfor example, would require us to model many dependencies between pixels which are of little use for\ntopic inference and would lead to inef\ufb01cient models. On the other hand, a lot of progress has been\nmade in accurately and ef\ufb01ciently assigning categories to images using discriminative models such\nas convolutional neural networks [18, 35].\nIn this work, our goal is to build a class of discriminative topic models capable of handling much\nricher items than words. At the same time, we would like to preserve LDA\u2019s extensibility and\ninterpretability. In particular, group-level topic distributions and items should be independent given\nthe item\u2019s topics, and topics and topic distributions should interact in an intuitive way. Our model\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fachieves these goals by discarding the generative part of LDA while maintaining the factorization\nof the conditional distribution over latent variables. By using neural networks to represent one of\nthe factors, the model can deal with arbitrary input types. We call this model logistic LDA as its\nconnection to LDA is analogous to the relationship between logistic regression and naive Bayes \u2013\ntextbook examples of discriminative and generative approaches [30].\nA desirable property of generative models is that they can be trained in an unsupervised manner.\nIn Section 6, we show that the grouping of items provides enough supervision to train logistic\nLDA in an otherwise unsupervised manner. We provide two approaches for training our model. In\nSection 5.1, we describe mean-\ufb01eld variational inference which can be used to train our model in an\nunsupervised, semi-supervised or supervised manner. In Section 5.2, we further describe an empirical\nrisk minimization approach which can be used to optimize an arbitrary loss when labels are available.\nWhen topic models are applied to documents, the topics associated with individual words are usually\nof little interest. In contrast, the topics of tweets on Twitter, pins on Pinterest, or videos on YouTube\nare of great interest. Therefore, we additionally introduce a new annotated dataset of tweets which\nallows us to explore model\u2019s ability to infer the topics of items. Our code and datasets are available at\ngithub.com/lucastheis/logistic_lda.\n\n2 Related work\n\n2.1 Latent Dirichlet allocation\nLDA [6] is a latent variable model which relates observed words xdn \u2208 {1, . . . , V } of document\nd to latent topics kdn \u2208 {1, . . . , K} and a distribution over topics \u03c0d. It speci\ufb01es the following\ngenerative process for a document:\n\n1. Draw topic proportions \u03c0d \u223c Dir(\u03b1)\n2. For each word xdn:\n\n(a) Draw a topic assignment kdn \u223c Cat(\u03c0d)\n(b) Draw a word xdn \u223c Cat(\u03b2(cid:62)kdn)\n\nHere, we assume that topics and words are represented as vectors using a one-hot encoding, and \u03b2 is\na K \u00d7 V matrix where each row corresponds to a topic which parametrizes a categorical distribution\nover words. The matrix \u03b2 is either considered a parameter of the model or, more commonly, a latent\nvariable with a Dirichlet prior over rows, i.e., \u03b2k \u223c Dir(\u03b7) [6]. A graphical model corresponding to\nLDA is provided in Figure 1a.\nBlei et al. [6] used mean-\ufb01eld variational inference to approximate the intractable posterior over latent\nvariables \u03c0d and kdn, resulting in closed-form coordinate ascent updates on a variational lower bound\nof the model\u2019s log-likelihood. Many other methods of inference have been explored, including Gibbs\nsampling [12], expectation propagation [28], and stochastic variants of variational inference [14].\nIt is worth noting that while LDA is most frequently used to model words, it can also be applied to\ncollections of other items. For example, images can be viewed as collections of image patches, and by\nassigning each image patch to a discrete code word one could directly apply the model above [15, 39].\nHowever, while for example clustering is a simple way to assign image patches to code words, it is\nunclear how to choose between different preprocessing approaches in a principled manner.\n\n2.2 A zoo of topic models\n\nMany topic models have built upon the ideas of LDA and extended it in various ways. One group of\nmethods modi\ufb01es LDA\u2019s assumptions regarding the forms of p(\u03c0d), p(kdn|\u03c0d) and p(xdn|kdn) such\nthat the model becomes more expressive. For example, Jo and Oh [16] modeled sentences instead of\nwords. Blei and Jordan [4] applied LDA to images by extracting low-level features such as color and\ntexture and using Gaussian distributions instead of categorical distributions. Other examples include\ncorrelated topic models [3], which replace the Dirichlet distribution over topics p(\u03c0d) with a logistic\nnormal distribution, and hierarchical Dirichlet processes [37], which enable an unbounded number of\ntopics. Srivastava and Sutton [36] pointed out that a major challenge with this approach is the need to\nrederive an inference algorithm with every change to the modeling assumptions. Thus, several papers\n\n2\n\n\f\u03b7\n\n\u03b2k\n\nK\n\n\u03b1\n\n\u03c0d\n\nkdn\n\nxdn\n\nD\n\nN\n\n\u03b7\n\n\u03b2k\n\nK\n\n\u03b1\n\n\u03c0d\n\nkdn\n\nxdn\n\nN\n\nD\n\ncd\n\n\u03b8\n\n(a)\n\n(b)\n\n\u03b1\n\n\u03c0d\n\nkdn\n\nxdn\n\n(c)\n\ncd\n\nN\n\nD\n\nFigure 1: Graphical models for (a) LDA [6], (b) supervised LDA [5], and (c) logistic LDA. Gray\ncircles indicate variables which are typically observed during training.\n\nproposed to use neural variational inference but only tested their approaches on simple categorical\nitems such as words [24, 25, 36].\nAnother direction of extending LDA is to incorporate extra information such as authorship [34],\ntime [40], annotations [4], class labels or other features [27]. In this work, we are mainly interested\nin the inclusion of class labels as it covers a wide range of practical applications. In our model, in\ncontrast to sLDA [5] and DiscLDA [19], labels interact with topic proportions instead of topics, and\nunlike in L-LDA [33], labels do not impose hard constraints on topic proportions.\nA related area of research models documents without an explicit representation of topics, instead\nusing more generic latent variables. These models are commonly implemented using neural networks\nand are sometimes referred to as document models to distinguish them from topic models which\nrepresent topics explicitly [24]. Examples of document models include Replicated Softmax [13],\nTopicRNN [10], NVDM [25], the Sigmoid Belief Document Model [29] and DocNADE [22].\nFinally, non-probabilistic approaches to topic modeling employ heuristically designed loss functions.\nFor example, Cao et al. [7] used a ranking loss to train an LDA inspired neural topic model.\n\n3 An alternative view of LDA\n\nIn this section, we provide an alternative derivation of LDA as a special case of a broader class of\nmodels. Our goal is to derive a class of models which makes it easy to handle a variety of data\nmodalities but which keeps the desirable inductive biases of LDA. In particular, topic distributions \u03c0d\nand items xdn should be independent given the items\u2019 topics kdn, and topics and topic distributions\nshould interact in an intuitive way.\nInstead of specifying a generative model as a directed network, we assume the factorization in\nFigure 1c and make the following three assumptions about the complete conditionals:\n\n(cid:32)\n(cid:32)\n\n(cid:33)\n\n(cid:88)\n(cid:88)\n\nn\n\np(\u03c0d | kd) = Dir\np(kdn | xdn, \u03c0d, \u03b8) = k(cid:62)\np(\u03b8 | x, k) \u221d exp\n\n\u03c0d; \u03b1 +\n\nkdn\n\n,\n\ndnsoftmax(g(xdn, \u03b8) + ln \u03c0d),\n\nr(\u03b8) +\n\nk(cid:62)\ndng(xdn, \u03b8)\n\n(cid:33)\n\n(1)\n\n(2)\n\n(3)\n\n.\n\ndn\n\nThe \ufb01rst condition requires that the topic distribution is conditionally Dirichlet distributed, as in LDA.\nThe second condition expresses how we would like to integrate information from \u03c0d and xdn to\ncalculate beliefs over kdn. The function g might be a neural network, in which case ln \u03c0d simply acts\nas an additional bias which is shared between grouped items. Finally, the third condition expresses\nwhat inference would look like if we knew the topics of all words. This inference step is akin to a\nclassi\ufb01cation problem with labels kdn, where exp r(\u03b8) acts as a prior and the remaining factors act\nas a likelihood.\nIn general, for an arbitrary set of conditional distributions, there is no guarantee that a corresponding\njoint distribution exists. The conditional distributions might be inconsistent, in which case no joint\n\n3\n\n\fdistribution can satisfy all of them [1]. However, whenever a positive joint distribution exists we can\nuse Brook\u2019s lemma [1] to \ufb01nd a form for the joint distribution, which in our case yields (Appendix B):\n\n(cid:32)\n(\u03b1 \u2212 1)(cid:62)(cid:88)\n\n(cid:88)\n\n(cid:88)\n\np(\u03c0, k, \u03b8 | x) \u221d exp\n\nln \u03c0d +\n\nk(cid:62)\ndn ln \u03c0d +\n\nk(cid:62)\ndng(xdn, \u03b8) + r(\u03b8)\n\n.\n\n(4)\n\nd\n\ndn\n\ndn\n\nIt is easy to verify that this distribution satis\ufb01es the constraints given by Equations 1 to 3. Furthermore,\none can show that the posterior induced by LDA is a special case of Eq. 4, where\n\n(cid:33)\n\nr(\u03b2) = (\u03b7 \u2212 1)(cid:62)(cid:88)\n\nk\n\nln \u03b2k,\n\n(5)\n\ng(xdn, \u03b2) = ln \u03b2 xdn,\n\nand \u03b2 is constrained such that(cid:80)\n\nj \u03b2kj = 1 for all k (Appendix A). However, Eq. 4 describes a larger\n\nclass of models which share a very similar form of the posterior distribution.\nThe risk of relaxing assumptions is that it may prevent us from generalizing from data. An interesting\nquestion is therefore whether there are other choices of g and r which lead to useful models. In\nparticular, does g have to be a normalized log-likelihood as in LDA or can we lift this constraint? In\nthe following, we answer this question positively.\n\n3.1 Supervised extension\n\nIn many practical settings we have access to labels associated with the documents. It is not dif\ufb01cult\nto extend the model given by Equations 1 to 3 to the supervised case. However, there are multiple\nways to do so. For instance, sLDA [5] assumes that a class variable cd arises from the empirical\nfrequencies of topic assignments kdn within a document, as in Figure 1b. An alternative would be\nto have the class labels in\ufb02uence the topic proportions \u03c0d instead. As a motivating example for the\nlatter, consider the case where authors of documents belong to certain communities, each with a\ntendency to talk about different topics. Thus, even before observing any words of a new document,\nknowing the community provides us with information about the topic distribution \u03c0d. In sLDA, on\nthe other hand, beliefs over topic distributions can only be in\ufb02uenced by labels once words xdn have\nbeen observed.\nOur proposed supervised extension therefore assumes Equations 2 and 3 together with the following\nconditionals:\n\np(cd | \u03c0d) = softmax(\u03bbc(cid:62)\n\nd ln \u03c0d),\n\n(6)\nAppendix B provides a derivation of the corresponding joint distribution, p(\u03c0, k, c, \u03b8 | x). Here, we\nassumed that the document label cd is a 1\u00d7 K one-hot vector and \u03bb is an extra scalar hyperparameter.\nFuture work may want to explore the case where the number of classes is different from K and \u03bb is\nreplaced by a learnable matrix of weights.\n\nn kdn + \u03bbcd) .\n\np(\u03c0d | kd, cd) = Dir (\u03c0d; \u03b1 +(cid:80)\n\n4 Logistic LDA\n\nalternative to LDA is to require(cid:80)\n\nLet us return to the question regarding the possible choices for g and r in Eq. 4. An interesting\nk \u03b2kj = 1. Instead of distributions over words, \u03b2 in this case\nencodes a distribution over topics for each word and Eq. 3 turns into the posterior of a discriminative\nclassi\ufb01er rather than the posterior associated with a generative model over words. More generally, we\ncan normalize g such that it corresponds to a discriminative log-likelihood,\n\ng(xdn, \u03b8) = ln softmaxf (xdn, \u03b8),\n\n(7)\nwhere f outputs, for example, the K-dimensional logits of a neural network with parameters \u03b8. Note\nthat the conditional distribution over topics in Eq. 2 remains unchanged by this normalization.\nSimilar to how logistic regression and naive Bayes both implement linear classi\ufb01ers but only naive\nBayes makes assumptions about the distribution of inputs [30], our revised model shares the same\nconditional distribution over topics as LDA, but no longer make assumptions about the distribution of\ninputs xdn. We therefore refer to LDA-type models whose g takes the form of Eq. 7 as logistic LDA.\nDiscriminative models typically require labels for training. But unlike other discriminative models,\nlogistic LDA already receives a weak form of supervision through the partitioning of the dataset,\n\n4\n\n\fwhich encourages grouped items to be mapped to the same topics. Unfortunately, the assumptions of\nlogistic LDA are still slightly too weak to produce useful beliefs. In particular, assigning all topics\nkdn to the same value has high probability (Eq. 4). However, we found the following regularizer to\nbe enough to encourage the use of all topics and to allow unsupervised training:\n\n(cid:88)\n\ndn\n\n(cid:35)\n\n(cid:17)\n\nr(\u03b8, x) = \u03b3 \u00b7 1(cid:62) ln\n\nexp g(xdn, \u03b8).\n\n(8)\n\nHere, we allow the regularizer to depend on the observed data, which otherwise does not affect\nthe math in Section 3, and \u03b3 controls the strength of the regularization. The regularizer effectively\ncomputes the average distribution of the item\u2019s topics as predicted by g across the whole dataset and\ncompares it to the uniform distribution. The proposed regularizer allows us to discover meaningful\ntopics with logistic LDA in an unsupervised manner, although the particular form of the regularizer\nmay not be crucial.\nTo make the regularizer more amenable to stochastic approximation, we lower-bound it as follows:\n\nr(\u03b8, x) \u2265 \u03b3\n\nrdnk ln\n\nexp gk(xdn, \u03b8)\n\nrdnk\n\nrdnk =\n\nexp gk(xdn, \u03b8)\ndn exp gk(xdn, \u03b8)\n\n(9)\n\n(cid:88)\n\n(cid:88)\n\nk\n\ndn\n\n(cid:80)\n\nFor \ufb01xed rdnk evaluated at \u03b8, the lower bound has the same gradient as r(\u03b8, x). In practice, we are\nfurther approximating the denominator of rdnk with a running average, yielding an estimate \u02c6rdnk\n(see Appendix D for details).\n\n5 Training and inference\n\n5.1 Mean-\ufb01eld variational inference\n\nWe approximate the intractable posterior (Eq. 4) with a factorial distribution via mean-\ufb01eld variational\ninference, that is, by minimizing the Kullback-Leibler (KL) divergence\n\n(cid:34)\n\n(cid:32)(cid:89)\n\n(cid:33)(cid:32)(cid:89)\n\n(cid:33)(cid:32)(cid:89)\n\nDKL\n\nq(\u03b8)\n\nq(cd)\n\nq(\u03c0d)\n\nq(kdn)\n\nd\n\nd\n\ndn\n\n(cid:33)\n\n|| p(\u03c0, k, c, \u03b8 | x)\n\n(10)\n\nwith respect to the distributions q. Assuming for now that the distribution over \u03b8 is concentrated on a\npoint estimate, i.e., q(\u03b8; \u02c6\u03b8) = \u03b4(\u03b8 \u2212 \u02c6\u03b8), we can derive the following coordinate descent updates for\nthe variational parameters (see Appendix C for more details):\n\nd \u02c6pd\n\nq(cd) = c(cid:62)\nq(\u03c0d) = Dir(\u03c0d; \u02c6\u03b1d)\nq(kdn) = k(cid:62)\n\ndn \u02c6pdn\n\n\u02c6pd = softmax (\u03bb\u03c8( \u02c6\u03b1d))\n\n\u02c6\u03b1d = \u03b1 +(cid:80)\n\nn \u02c6pdn + \u03bb \u02c6pd\n\n(cid:16)\n\n\u02c6pdn = softmax\n\nf (xdn, \u02c6\u03b8) + \u03c8( \u02c6\u03b1d)\n\n(11)\n(12)\n\n(13)\n\nHere, \u03c8 is the digamma function and f are the logits of a neural network with parameters \u02c6\u03b8. From\nEq. 13, we see that topic predictions for a word xdn are computed based on biased logits. The bias\n\u03c8( \u02c6\u03b1d) aggregates information across all items of a group (e.g., words of a document), thus providing\ncontext for individual predictions.\nIterating Equations 11 to 13 in arbitrary order implements a valid inference algorithm for any \ufb01xed \u02c6\u03b8.\nNote that inference does not depend on the regularizer. To optimize the neural network\u2019s weights \u02c6\u03b8,\nwe \ufb01x the values of the variational parameters \u02c6pdn and regularization terms \u02c6rdn. We then optimize the\nKL divergence in Eq. 10 with respect to \u02c6\u03b8, which amounts to minimizing the following cross-entropy\nloss:\n\n( \u02c6pdn + \u03b3 \u00b7 \u02c6rdn)(cid:62)g(xdn, \u02c6\u03b8).\n\n(14)\n\n(cid:96)( \u02c6\u03b8) \u2248 \u2212(cid:88)\n\nThis corresponds to a classi\ufb01cation problem with soft labels \u02c6pdn + \u03b3 \u00b7 \u02c6rdn. Intuitively, \u02c6pdn tries to\nalign predictions for grouped items, while \u02c6rdn tries to ensure that each topic is predicted at least\nsome of the time.\n\ndn\n\n5\n\n\fThus far, we presented a general way of training and inference in logistic LDA, where we assumed cd\nto be a latent variable. If class labels are observed for some or all of the documents, we can replace\n\u02c6pd with cd during training. This makes the method suitable for unsupervised, semi-supervised and\nsupervised learning. For supervised training with labels, we further developed the discriminative\ntraining procedure below.\n\n5.2 Discriminative training\n\nDecision tasks are associated with loss functions, e.g., in classi\ufb01cation we often care about accuracy.\nVariational inference, however, only approximates general properties of a posterior while ignoring the\ntask in which the approximation is going to be used, leading to suboptimal results [20]. When enough\nlabels are available and classi\ufb01cation is the goal, we therefore propose to directly optimize parameters\n\u02c6\u03b8 with respect to an empirical loss instead of the KL divergence above, e.g., a cross-entropy loss:\n\n(cid:96)( \u02c6\u03b8) = \u2212(cid:80)\n\nd c(cid:62)\n\nd ln \u02c6pd\n\n(15)\n\nTo see why this is possible, note that each update in Equations 11 to 13 leading to \u02c6pd is a differen-\ntiable operation. In effect, we are unrolling the mean-\ufb01eld updates and treat them like layers of a\nsophisticated neural network. This strategy has been succesfully used before, for example, to improve\nperformance of CRFs in semantic segmentation [41]. Unrolling mean-\ufb01eld updates leads to the\ntraining procedure given in Algorithm 1. The algorithm reveals that training and inference can be\nimplemented easily even when the derivations needed to arrive at this algorithm may have seemed\ncomplex.\nAlgorithm 1 requires processing of all words of a document in each iteration. In Appendix D, we\ndiscuss highly scalable implementations of variational training and inference which only require\nlooking at a single item at a time. This is useful in settings with many items or where items are more\ncomplex than words.\n\nAlgorithm 1 Single step of discriminative training for a collection {xdn}Nd\nRequire: {xdn}Nd\n\nn=1, cd\n\nn=1 with class label cd.\n\n\u02c6\u03b1d \u2190 \u03b1\n\u02c6pd \u2190 1/K % uniform initial beliefs over K classes\nfor i \u2190 1 to Niter do\n\n(cid:16)\n\n(cid:17)\n\nfor n \u2190 1 to Nd do\n\u02c6pdn \u2190 softmax\n\n\u02c6\u03b1d \u2190 \u03b1 +(cid:80)\n\nend for\n\u02c6pd \u2190 softmax (\u03bb\u03c8( \u02c6\u03b1d)) % Eq. 11\n\nn \u02c6pdn + \u03bb \u02c6pd % Eq. 12\n\nend for\n\u02c6\u03b8 \u2190 \u02c6\u03b8 \u2212 \u03b5\u2207\u03b8cross_entropy(cd, \u02c6pd)\n\nf (xdn, \u02c6\u03b8) + \u03c8( \u02c6\u03b1d)\n\n% Eq. 13; f outputs K logits of a neural net\n\n6 Experiments\n\nWhile a lot of research has been done on models related to LDA, benchmarks have almost exclusively\nfocused on either document classi\ufb01cation or on a generative model\u2019s perplexity. However, here we\nare not only interested in logistic LDA\u2019s ability to discover the topics of documents but also those\nof individual items, as well as its ability to handle arbitrary types of inputs. We therefore explore\ntwo new benchmarks. First, we are going to look at a model\u2019s ability to discover the topics of\ntweets. Second, we are going to evaluate a model\u2019s ability to predict the categories of boards on\nPinterest based on images. To connect with the literature on topic models and document classi\ufb01ers,\nwe are going to show that logistic LDA can also work well when applied to the task of document\nclassi\ufb01cation. Finally, we demonstrate that logistic LDA can recover meaningful topics from Pinterest\nand 20-Newsgroups in an unsupervised manner.\n\n6\n\n\f6.1 Topic classi\ufb01cation on Twitter\n\nWe collected two sets of tweets. The \ufb01rst set contains 1.45 million tweets of 66,455 authors. Authors\nwere clustered based on their follower graph, assigning each author to one or more communities.\nThe clusters were subsequently manually annotated based on the content of typical tweets in the\ncommunity. The community label thus provides us with an idea of the content an author is likely\nto produce. The second dataset contains 3.14 million tweets of 18,765 authors but no author labels.\nInstead, 18,864 tweets were manually annotated with one out of 300 topics from the same taxonomy\nused to annotate communities. We split the \ufb01rst dataset into training (70%), validation (10%), and\ntest (20%) sets such that tweets of each author were only contained in one of the sets. The second\ndataset was only used for evaluation. Due to the smaller size of the second dataset, we here used\n10-fold cross-validation to estimate the performance of all models.\nDuring training, we used community labels to in\ufb02uence the distribution over topic weights via cd.\nWhere authors belonged to multiple communities, a label was chosen at random (i.e., the label is\nnoisy). Labels were not used during inference but only during training. Tweets were embedded by\naveraging 300-dimensional skip-gram embeddings of words [26]. Logistic LDA applied a shallow\nMLP on top of these embeddings and was trained using a stochastic approximation to mean-\ufb01eld\nvariational inference (Section 5.1). As baselines, we tried LDA as well as training an MLP to predict\nthe community labels directly. To predict the community of authors with an MLP, we used majority\nvoting across the predictions for their tweets. The main difference between majority voting and\nlogistic LDA is that the latter is able to reevaluate predictions for tweets based on other tweets of\nan author. For LDA, we extended the open source implementation of Theis and Hoffman [38] to\ndepend on the label in the same manner as logistic LDA. That is, the label biases topic proportions\nas in Figure 1c. The words in all tweets of an author were combined to form a document, and the\n100,000 most frequent words of the corpus formed the vocabulary of LDA. To predict the topics of\nm q(kdm)/M.\nTable 1 shows that logistic LDA is able to improve the predictions of a purely discriminatively trained\nneural network for both author- and tweet-level categories. More principled inference allows it to\nimprove the accuracy of predictions of the communities of authors, while integrating information\nfrom other tweets allows it to improve the prediction of a tweet\u2019s topic. LDA performed worse on\nboth tasks. We note that labels of the dataset are noisy and dif\ufb01cult to predict even for humans, hence\nthe relatively low accuracy numbers.\n\ntweets, we averaged LDA\u2019s beliefs over the topics of words contained in the tweet,(cid:80)\n\nTable 1: Accuracy of prediction of annotations at the\nauthor and tweet level. Authors were annotated with\ncommunities, tweets with topics. LDA here refers to a\nsupervised generative model.\n\nModel\nMLP (individual)\nMLP (majority)\nLDA\nLogistic LDA\n\nAuthor Tweet\n26.6% 32.4%\n35.0%\n33.1% 25.4%\n38.7% 35.6%\n\nn/a\n\n6.2\n\nImage categorization on Pinterest\n\nTable 2: Accuracy on 20-Newsgroups.\n\nModel\n\nTest accuracy\n\nSVM [8]\nLSTM [9]\nSA-LSTM [9]\noh-2LSTMp [17]\nLogistic LDA\n\n82.9%\n82.0%\n84.4%\n86.5%\n84.4%\n\nTo illustrate how logistic LDA can be used with images, we apply it to Pinterest data of boards and\npins. In LDA\u2019s terms, every board would correspond to a document and every pin \u2013 an image pinned\nto a board \u2013 to a word or an item. For our purpose, we used a subset of the Pinterest dataset of Geng\net al. [11], which we describe in Appendix F. It should be noted, however, that the dataset contains\nonly board labels. Thus, without pin labels, we are not able to perform the same in-depth analysis as\nin the previous section.\nAs in case of Twitter data, we trained logistic LDA with our stochastic variational inference procedure.\nFor comparison, we trained an MLP to predict the labels of individual pins, where each pin was labeled\nwith the category of its board. For both models, we used image embeddings from a MobileNetV2 [35]\nas inputs, and tuned the hyperparameters on a validation set.\n\n7\n\n\fFigure 2: Top-3 images assigned to three different topics discovered by logistic LDA in an unsuper-\nvised manner (dogs, fashion, architecture).\n\nTest accuracy when predicting board labels for logistic LDA and MLP was 82.5% and 81.3%,\nrespectively. For MLP, this score was obtained using majority voting across pins to compute board\npredictions. We further trained logistic LDA in an unsupervised manner. Image embeddings are\nsubsequently mapped to topics using the trained neural network. We \ufb01nd that logistic LDA is able to\nlearn coherent topics in an unsupervised manner. Examples topics are visualized in Figure 2. Further\ndetails and more results are provided in Appendix G and H.\n\n6.3 Document classi\ufb01cation\n\nWe apply logistic LDA with discrimintive training (Section 5.2) to the standard benchmark problem\nof document classi\ufb01cation on the 20-Newsgroups dataset [21]. 20-Newsgroups comprises of around\n18,000 posts partitioned almost evenly among 20 topics. While various versions of this dataset exist,\nwe used the preprocessed version of Cardoso-Cachopo [8] so our results can be compared to the ones\nfrom LSTM-based classi\ufb01ers [9, 17]. More details on this dataset are given in Appendix F.\nWe trained logistic LDA with words represented as 300-dimensional GloVe embeddings [32]. The hy-\nperparameters were selected based on a 15% split from the training data and are listed in Appendix E.\nResults of these experiments are given in Table 2. As a baseline, we include an SVM model trained\non tf-idf document vectors [8]. We also compare logistic LDA to an LSTM model for document\nclassi\ufb01cation [9], which owes its poorer performance to instable training and dif\ufb01culties when dealing\nwith long texts. These issues can be overcome by starting from a pretrained model or by using more\nintricate architectures. SA-LSTM [9] adopts the former approach, while oh-2LSTMp [17] implements\nthe latter. To our knowledge, oh-2LSTMp holds the state-of-the-art results on 20-Newsgroups. While\nlogistic LDA does not surpass oh-2LSTMp on this task, its performance compares favourably to\nother more complex models. Remarkably, it achieves the same results as SA-LSTM \u2013 an LSTM\nclassi\ufb01er initialized with a pretrained sequence autoencoder [9]. It is worth noting that logistic LDA\nuses generic word embeddings, is a lightweight model which requires hours instead of days to train,\nand provides explicit representations of topics.\nIn this accuracy-driven benchmark, it is interesting to look at the performance of a supervised logistic\nLDA trained with the loss-insensitive objective for \u02c6\u03b8 as described in Section 5.1. Our best accuracy\nwith this method was 82.2% \u2013 a signi\ufb01cantly worse result compared to 84.4% achieved with logistic\nLDA that used a cross-entropy loss in Eq. 15 when optimizing for \u02c6\u03b8. This con\ufb01rms the usefulness of\noptimizing inference for the task at hand [20].\nThe bene\ufb01t of mean-\ufb01eld variational inference (Section 5.1) is that it allows to train logistic LDA in\nan unsupervised manner. In this case, we \ufb01nd that the model is able to discover topics such as the\nones given in Table 3. Qualitative comparisons with Srivastava and Sutton [36] together with NPMI\ntopic coherence scores [23] of multiple models can be found in Appendix I.\n\nTable 3: Examples of topics discovered by unsupervised logistic LDA represented by top-10 words\n\n1\n2\n3\n4\n5\n\nbmw, motor, car, honda, motorcycle, auto, mg, engine, ford, bike\nchristianity, prophet, atheist, religion, holy, scripture, biblical, catholic, homosexual, religious, atheist\nspacecraft, orbit, probe, ship, satellite, rocket, surface, shipping, moon, launch\nuser, computer, microsoft, monitor,programmer, electronic, processing, data, app, systems\ncongress, administration, economic, accord, trade, criminal, seriously, \ufb01ght, responsible, future\n\n8\n\n\f7 Discussion and conclusion\n\nWe presented logistic LDA, a neural topic model that preserves most of LDA\u2019s inductive biases while\ngiving up its generative component in favour of a discriminative approach, making it easier to apply\nto a wide range of data modalities.\nIn this paper we only scratched the surface of what may be possible with discriminative variants\nof LDA. Many inference techniques have been developed for LDA and could be applied to logistic\nLDA. For example, mean-\ufb01eld variational inference is known to be prone to local optima but trust-\nregion methods are able to get around them [38]. We only trained fairly simple neural networks on\nprecomputed embeddings. An interesting question will be whether much deeper neural networks can\nbe trained using only weak supervision in the form of grouped items.\nInterestingly, logistic LDA would not be considered a discriminative model if we follow the de\ufb01nition\nof Bishop and Lasserre [2]. According to this de\ufb01nition, a discriminative model\u2019s joint distribution\nover inputs x, labels c, and model parameters \u03b8 factorizes as p(c | x, \u03b8)p(\u03b8)p(x). Logistic LDA, on\nthe other hand, only admits the factorization p(c, \u03b8 | x)p(x). Both have in common that the choice\nof marginal p(x) has no in\ufb02uence on inference, unlike in a generative model.\n\nAcknowledgments\n\nThis work was done while Iryna Korshunova was an intern at Twitter, London. We thank Ferenc\nHusz\u00e1r, Jonas Degrave, Guy Hugot-Derville, Francisco Ruiz, and Dawen Liang for helpful discussions\nand feedback on the manuscript.\n\nReferences\n[1] Besag, J. (1974). Spatial interaction and the statistical analysis of lattice systems. Journal of the Royal\n\nStatistical Society. Series B (Methodological), 36(2):pp. 192\u2013236.\n\n[2] Bishop, C. and Lasserre, J. (2007). Generative or discriminative? Getting the best of both worlds. Bayesian\n\nStatistics, 8:3\u201323.\n\n[3] Blei, D. and Lafferty, J. (2007). A correlated topic model of science. Annals of Applied Statistics, 1:17\u201335.\n\n[4] Blei, D. M. and Jordan, M. I. (2003). Modeling annotated data.\n\nIn Proceedings of the 26th Annual\nInternational ACM SIGIR Conference on Research and Development in Informaion Retrieval, pages 127\u2013134.\n\n[5] Blei, D. M. and McAuliffe, J. D. (2008). Supervised topic models. In Advances in Neural Information\n\nProcessing Systems 20.\n\n[6] Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of Machine Learning\n\nResearch, 3:993\u20131022.\n\n[7] Cao, Z., Li, S., Liu, Y., Li, W., and Ji, H. (2015). A novel neural topic model and its supervised extension.\n\nIn AAAI.\n\n[8] Cardoso-Cachopo, A. (2007).\n\nImproving Methods for Single-label Text Categorization. PhD Thesis,\n\nInstituto Superior Tecnico, Universidade Tecnica de Lisboa.\n\n[9] Dai, A. M. and Le, Q. V. (2015). Semi-supervised sequence learning. In Advances in Neural Information\n\nProcessing Systems 28.\n\n[10] Dieng, A. B., Wang, C., Gao, J., and Paisley, J. W. (2017). TopicRNN: A recurrent neural network with\n\nlong-range semantic dependency. In International Conference on Learning Representations.\n\n[11] Geng, X., Zhang, H., Bian, J., and Chua, T. (2015). Learning image and user features for recommendation\n\nin social networks. In 2015 IEEE International Conference on Computer Vision, pages 4274\u20134282.\n\n[12] Grif\ufb01ths, T. L. and Steyvers, M. (2004). Finding scienti\ufb01c topics. Proceedings of the National Academy of\n\nSciences, 101(Suppl. 1):5228\u20135235.\n\n[13] Hinton, G. E. and Salakhutdinov, R. R. (2009). Replicated softmax: an undirected topic model. In\n\nAdvances in Neural Information Processing Systems 22.\n\n[14] Hoffman, M. D., Blei, D. M., and Bach, F. (2010). Online learning for latent Dirichlet allocation. In\n\nAdvances in Neural Information Processing Systems 23.\n\n9\n\n\f[15] Hu, D. J. (2009). Latent Dirichlet allocation for text, images, and music.\n\n[16] Jo, Y. and Oh, A. H. (2011). Aspect and sentiment uni\ufb01cation model for online review analysis. In\nProceedings of the Fourth ACM International Conference on Web Search and Data Mining, WSDM \u201911,\npages 815\u2013824.\n\n[17] Johnson, R. and Zhang, T. (2016). Supervised and semi-supervised text categorization using lstm for\nregion embeddings. In Proceedings of the 33rd International Conference on International Conference on\nMachine Learning, pages 526\u2013534.\n\n[18] Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2017). Imagenet classi\ufb01cation with deep convolutional\n\nneural networks. Commun. ACM, 60(6):84\u201390.\n\n[19] Lacoste-Julien, S., Sha, F., and Jordan, M. I. (2009). DiscLDA: Discriminative learning for dimensionality\n\nreduction and classi\ufb01cation. In Advances in Neural Information Processing Systems 21.\n\n[20] Lacoste\u2013Julien, S., Husz\u00e1r, F., and Ghahramani, Z. (2011). Approximate inference for the loss-calibrated\nbayesian. In Proceedings of the Fourteenth International Conference on Arti\ufb01cial Intelligence and Statistics,\npages 416\u2013424.\n\n[21] Lang, K. (1995). Newsweeder: Learning to \ufb01lter netnews. In Proceedings of the 12th International\n\nConference on Machine Learning.\n\n[22] Larochelle, H. and Lauly, S. (2012). A neural autoregressive topic model.\n\nInformation Processing Systems 25.\n\nIn Advances in Neural\n\n[23] Lau, J. H., Newman, D., and Baldwin, T. (2014). Machine reading tea leaves: Automatically evaluating\ntopic coherence and topic model quality. In Proceedings of the 14th Conference of the European Chapter of\nthe Association for Computational Linguistics.\n\n[24] Miao, Y., Grefenstette, E., and Blunsom, P. (2017). Discovering discrete latent topics with neural variational\n\ninference. In Proceedings of the 34th International Conference on Machine Learning, pages 2410\u20132419.\n\n[25] Miao, Y., Yu, L., and Blunsom, P. (2016). Neural variational inference for text processing. In Proceedings\n\nof the 33rd International Conference on Machine Learning, pages 1727\u20131736.\n\n[26] Mikolov, T., Chen, K., Corrado, G. S., and Dean, J. (2013). Ef\ufb01cient estimation of word representations in\n\nvector space. In International Conference on Learning Representations.\n\n[27] Mimno, D. M. and McCallum, A. (2008). Topic models conditioned on arbitrary features with Dirichlet-\n\nmultinomial regression. In Uncertainty in Arti\ufb01cial Intelligence.\n\n[28] Minka, T. and Lafferty, J. (2002). Expectation-propagation for the generative aspect model. In Proceedings\n\nof the Eighteenth Conference on Uncertainty in Arti\ufb01cial Intelligence, pages 352\u2013359.\n\n[29] Mnih, A. and Gregor, K. (2014). Neural variational inference and learning in belief networks.\n\nProceedings of the 31st International Conference on International Conference on Machine Learning.\n\nIn\n\n[30] Ng, A. Y. and Jordan, M. I. (2002). On discriminative vs. generative classi\ufb01ers: A comparison of logistic\n\nregression and naive bayes. In Advances in Neural Information Processing Systems 14. MIT Press.\n\n[31] Pandey, G. and Dukkipati, A. (2017). Discriminative neural topic models. arxiv preprint, abs/1701.06796.\n\n[32] Pennington, J., Socher, R., and Manning, C. D. (2014). GloVe: Global vectors for word representation. In\n\nEmpirical Methods in Natural Language Processing (EMNLP), pages 1532\u20131543.\n\n[33] Ramage, D., Hall, D., Nallapati, R., and Manning, C. D. (2009). Labeled LDA: A supervised topic model\nfor credit attribution in multi-labeled corpora. In Proceedings of the 2009 Conference on Empirical Methods\nin Natural Language Processing, pages 248\u2013256.\n\n[34] Rosen-Zvi, M., Grif\ufb01ths, T., Steyvers, M., and Smyth, P. (2004). The author-topic model for authors and\ndocuments. In Proceedings of the 20th Conference on Uncertainty in Arti\ufb01cial Intelligence, pages 487\u2013494.\n\n[35] Sandler, M. B., Howard, A. G., Zhu, M., Zhmoginov, A., and Chen, L.-C. (2018). Mobilenetv2: Inverted\nresiduals and linear bottlenecks. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition,\npages 4510\u20134520.\n\n[36] Srivastava, A. and Sutton, C. A. (2016). Neural variational inference for topic models. In Bayesian deep\n\nlearning workshop.\n\n10\n\n\f[37] Teh, Y. W., Jordan, M. I., Beal, M. J., and Blei, D. M. (2006). Hierarchical Dirichlet processes. Journal of\n\nthe American Statistical Association, 101(476):1566\u20131581.\n\n[38] Theis, L. and Hoffman, M. D. (2015). A trust-region method for stochastic variational inference with\n\napplications to streaming data. https://github.com/lucastheis/trlda.\n\n[39] Wang, C., Blei, D., and Fei-Fei, L. (2009). Simultaneous Image Classi\ufb01cation and Annotation. In IEEE\n\nConference on Computer Vision and Pattern Recognition.\n\n[40] Wang, X. and McCallum, A. (2006). Topics over time: A non-markov continuous-time model of topical\ntrends. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and\nData Mining, pages 424\u2013433.\n\n[41] Zheng, S., Jayasumana, S., Romera-Paredes, B., Vineet, V., Su, Z., Du, D., Huang, C., and Torr, P. H. S.\n(2015). Conditional Random Fields as Recurrent Neural Networks. In IEEE International Conference on\nComputer Vision (ICCV), pages 1529\u20131537.\n\n11\n\n\f", "award": [], "sourceid": 3671, "authors": [{"given_name": "Iryna", "family_name": "Korshunova", "institution": "Ghent University"}, {"given_name": "Hanchen", "family_name": "Xiong", "institution": "Twitter"}, {"given_name": "Mateusz", "family_name": "Fedoryszak", "institution": "Twitter"}, {"given_name": "Lucas", "family_name": "Theis", "institution": "Twitter"}]}