{"title": "End-to-end Learning of LDA by Mirror-Descent Back Propagation over a Deep Architecture", "book": "Advances in Neural Information Processing Systems", "page_first": 1765, "page_last": 1773, "abstract": "We develop a fully discriminative learning approach for supervised Latent Dirichlet Allocation (LDA) model using Back Propagation (i.e., BP-sLDA), which maximizes the posterior probability of the prediction variable given the input document. Different from traditional variational learning or Gibbs sampling approaches, the proposed learning method applies (i) the mirror descent algorithm for maximum a posterior inference and (ii) back propagation over a deep architecture together with stochastic gradient/mirror descent for model parameter estimation, leading to scalable and end-to-end discriminative learning of the model.  As a byproduct, we also apply this technique to develop a new learning method for the traditional unsupervised LDA model (i.e., BP-LDA). Experimental results on three real-world regression and classification tasks show that the proposed methods significantly outperform the previous supervised topic models, neural networks, and is on par with deep neural networks.", "full_text": "End-to-end Learning of LDA by Mirror-Descent Back\n\nPropagation over a Deep Architecture\n\nJianshu Chen\u21e4, Ji He\u2020, Yelong Shen\u21e4, Lin Xiao\u21e4, Xiaodong He\u21e4, Jianfeng Gao\u21e4,\n\nXinying Song\u21e4 and Li Deng\u21e4\n\n\u21e4Microsoft Research, Redmond, WA 98052, USA,\n\n{jianshuc,yeshen,lin.xiao,xiaohe,jfgao,xinson,deng}@microsoft.com\n\u2020Department of Electrical Engineering, University of Washington, Seattle, WA 98195, USA,\n\njvking@uw.edu\n\nAbstract\n\nWe develop a fully discriminative learning approach for supervised Latent Dirich-\nlet Allocation (LDA) model using Back Propagation (i.e., BP-sLDA), which max-\nimizes the posterior probability of the prediction variable given the input doc-\nument. Different from traditional variational learning or Gibbs sampling ap-\nproaches, the proposed learning method applies (i) the mirror descent algorithm\nfor maximum a posterior inference and (ii) back propagation over a deep architec-\nture together with stochastic gradient/mirror descent for model parameter estima-\ntion, leading to scalable and end-to-end discriminative learning of the model. As\na byproduct, we also apply this technique to develop a new learning method for\nthe traditional unsupervised LDA model (i.e., BP-LDA). Experimental results on\nthree real-world regression and classi\ufb01cation tasks show that the proposed meth-\nods signi\ufb01cantly outperform the previous supervised topic models, neural net-\nworks, and is on par with deep neural networks.\n\n1\n\nIntroduction\n\nLatent Dirichlet Allocation (LDA) [5], among various forms of topic models, is an important prob-\nabilistic generative model for analyzing large collections of text corpora. In LDA, each document is\nmodeled as a collection of words, where each word is assumed to be generated from a certain topic\ndrawn from a topic distribution. The topic distribution can be viewed as a latent representation of\nthe document, which can be used as a feature for prediction purpose (e.g., sentiment analysis). In\nparticular, the inferred topic distribution is fed into a separate classi\ufb01er or regression model (e.g.,\nlogistic regression or linear regression) to perform prediction. Such a separate learning structure\nusually signi\ufb01cantly restricts the performance of the algorithm. For this purpose, various supervised\ntopic models have been proposed to model the documents jointly with the label information. In\n[4], variational methods was applied to learn a supervised LDA (sLDA) model by maximizing the\nlower bound of the joint probability of the input data and the labels. The DiscLDA method devel-\noped in [15] learns the transformation matrix from the latent topic representation to the output in a\ndiscriminative manner, while learning the topic to word distribution in a generative manner similar\nto the standard LDA. In [26], max margin supervised topic models are developed for classi\ufb01ca-\ntion and regression, which are trained by optimizing the sum of the variational bound for the log\nmarginal likelihood and an additional term that characterizes the prediction margin. These methods\nsuccessfully incorporate the information from both the input data and the labels, and showed better\nperformance in prediction compared to the vanilla LDA model.\nOne challenge in LDA is that the exact inference is intractable, i.e., the posterior distribution of the\ntopics given the input document cannot be evaluated explicitly. For this reason, various approximate\n\n1\n\n\f\u21b5\n\n\u2713d\n\nzd,n wd,n N\n\nyd\n\nD\n\nk K \nU, \n\nFigure 1: Graphical representation of the supervised LDA model. Shaded nodes are observables.\n\ninference methods are proposed, such as variational learning [4, 5, 26] and Gibbs sampling [9, 27],\nfor computing the approximate posterior distribution of the topics. In this paper, we will show that,\nalthough the full posterior probability of the topic distribution is dif\ufb01cult, its maximum a posteriori\n(MAP) inference, as a simpli\ufb01ed problem, is a convex optimization problem when the Dirichlet pa-\nrameter satis\ufb01es certain conditions, which can be solved ef\ufb01ciently by the mirror descent algorithm\n(MDA) [2, 18, 21]. Indeed, Sontag and Roy [19] pointed out that the MAP inference problem of\nLDA in this situation is polynomial-time and can be solved by an exponentiated gradient method,\nwhich shares a same form as our mirror-descent algorithm with constant step-size. Nevertheless,\ndifferent from [19], which studied the inference problem alone, our focus in this paper is to in-\ntegrate back propagation with mirror-descent algorithm to perform fully discriminative training of\nsupervised topic models, as we proceed to explain below.\nAmong the aforementioned methods, one training objective of the supervised LDA model is to max-\nimize the joint likelihood of the input and the output variables [4]. Another variant is to maximize\nthe sum of the log likelihood (or its variable bound) and a prediction margin [26, 27]. Moreover,\nthe DiscLDA optimizes part of the model parameters by maximizing the marginal likelihood of the\ninput variables, and optimizes the other part of the model parameters by maximizing the condi-\ntional likelihood. For this reason, DiscLDA is not a fully discriminative training of all the model\nparameters. In this paper, we propose a fully discriminative training of all the model parameters by\nmaximizing the posterior probability of the output given the input document. We will show that the\ndiscriminative training can be performed in a principled manner by naturally integrating the back-\npropagation with the MDA-based exact MAP inference. To our best knowledge, this paper is the\n\ufb01rst work to perform a fully end-to-end discriminative training of supervised topic models. Dis-\ncriminative training of generative model is widely used and usually outperforms standard generative\ntraining in prediction tasks [3, 7, 12, 14, 25]. As pointed out in [3], discriminative training increases\nthe robustness against the mismatch between the generative model and the real data. Experimental\nresults on three real-world tasks also show the superior performance of discriminative training.\nIn addition to the aforementioned related studies on topic models [4, 15, 26, 27], there have been\nanother stream of work that applied empirical risk minimization to graphical models such as Markov\nRandom Field and nonnegative matrix factorization [10, 20]. Speci\ufb01cally, in [20], an approximate\ninference algorithm, belief propagation, is used to compute the belief of the output variables, which\nis further fed into a decoder to produce the prediction. The approximate inference and the decoder\nare treated as an entire black-box decision rule, which is tuned jointly via back propagation. Our\nwork is different from the above studies in that we use an MAP inference based on optimization\ntheory to motivate the discriminative training from a principled probabilistic framework.\n\n2 Smoothed Supervised LDA Model\n\nWe consider the smoothed supervised LDA model in Figure 1. Let K be the number of topics,\nN be the number of words in each document, V be the vocabulary size, and D be the number of\ndocuments in the corpus. The generative process of the model in Figure 1 can be described as:\n\n1. For each document d, choose the topic proportions according to a Dirichlet distribution:\n\u2713d \u21e0 p(\u2713d|\u21b5) = Dir(\u21b5), where \u21b5 is a K \u21e5 1 vector consisting of nonnegative components.\n2. Draw each column k of a V \u21e5 K matrix  independently from an exchangeable Dirichlet\ndistribution: k \u21e0 Dir() (i.e.,  \u21e0 p(|)), where > 0 is the smoothing parameter.\n3. To generate each word wd,n:\n\n2\n\n\f(a) Choose a topic zd,n \u21e0 p(zd,n|\u2713d) = Multinomial(\u2713d). 1\n(b) Choose a word wd,n \u21e0 p(wd,n|zd,n, ) = Multinomial(zd,n).\n\n4. Choose the C \u21e5 1 response vector: yd \u21e0 p(yd|\u2713, U,  ).\n(a) In regression, p(yd|\u2713d, U, ) = N (U\u2713 d, 1), where U is a C \u21e5 K matrix consisting\n(b) In multi-class classi\ufb01cation, p(yd|\u2713d, U, ) = MultinomialSoftmax(U\u2713 d), where\n\nthe softmax function is de\ufb01ned as Softmax(x)c =\n\nof regression coef\ufb01cients.\n\nexc\n\nc0=1 exc0 , c = 1, . . . , C.\nPC\n\nTherefore, the entire model can be described by the following joint probability\n\np(|)\n\nDYd=1h p(yd|\u2713d, U, ) \u00b7 p(\u2713d|\u21b5) \u00b7 p(wd,1:N|zd,1:N , ) \u00b7 p(zd,1:N|\u2713d)\n}\n\n,p(yd,\u2713d,wd,1:N ,zd,1:N|,U,\u21b5,)\n\n{z\n\n|\n\ni\n\n(1)\n\nwhere wd,1:N and zd,1:N denotes all the words and the associated topics, respectively, in the d-th\ndocument. Note that the model in Figure 1 is slightly different from the one proposed in [4], where\nthe response variable yd in Figure 1 is coupled with \u2713d instead of zd,1:N as in [4]. Blei and Mcauliffe\nalso pointed out this choice as an alternative in [4]. This modi\ufb01cation will lead to a differentiable\nend-to-end cost trainable by back propagation with superior prediction performance.\nTo develop a fully discriminative training method for the model parameters  and U, we follow the\nargument in [3], which states that the discriminative training is also equivalent to maximizing the\njoint likelihood of a new model family with an additional set of parameters:\n\narg max\n,U, \u02dc\n\np(|)p( \u02dc|)\n\np(yd|wd,1:N , , U,\u21b5, )\n\np(wd,1:N| \u02dc,\u21b5 )\n\n(2)\n\nDYd=1\n\nDYd=1\n\nwhere p(wd,1:N| \u02dc,\u21b5 ) is obtained by marginalizing p(yd,\u2713 d, wd,1:N , zd,1:N|, U,\u21b5, ) in (1) and\nreplace  with \u02dc. The above problem (2) decouples into\n\narg max\n\narg max\n\n,U h ln p(|) +\n\u02dc h ln p( \u02dc|) +\n\nDXd=1\nDXd=1\n\nln p(yd|wd,1:N , , U,\u21b5, )i\nln p(wd,1:N| \u02dc,\u21b5 )i\n\n(3)\n\n(4)\n\nwhich are the discriminative learning problem of supervised LDA (Eq. (3)), and the unsupervised\nlearning problem of LDA (Eq. (4)), respectively. We will show that both problems can be solved in\na uni\ufb01ed manner using a new MAP inference and back propagation.\n\n3 Maximum A Posterior (MAP) Inference\n\nWe \ufb01rst consider the inference problem in the smoothed LDA model. For the supervised case, the\nmain objective is to infer yd given the words wd,1:N in each document d, i.e., computing\n\np(yd|wd,1:N , , U,\u21b5, ) =Z\u2713d\n\np(yd|\u2713d, U, )p(\u2713d|wd,1:N , ,\u21b5 )d\u2713d\n\n(5)\n\nwhere the probability p(yd|\u2713d, U, ) is known (e.g., multinomial or Gaussian for classi\ufb01cation and\nregression problems \u2014 see Section 2). The main challenge is to evaluate p(\u2713d|wd,1:N , ,\u21b5 ), i.e.,\ninfer the topic proportion given each document, which is also the important inference problem in\nthe unsupervised LDA model. However, it is well known that the exact evaluation of the posterior\nprobability p(\u2713d|wd,1:N , ,\u21b5 ) is intractable [4, 5, 9, 15, 26, 27]. For this reason, various approx-\nimate inference methods, such as variational inference [4, 5, 15, 26] and Gibbs sampling [9, 27],\n\n1We will represent all the multinomial variables by a one-hot vector that has a single component equal to\n\none at the position determined by the multinomial variable and all other components being zero.\n\n3\n\n\fhave been proposed to compute the approximate posterior probability. In this paper, we take an\nalternative approach for inference; given each document d, we only seek a point (MAP) estimate\nof \u2713d, instead of its full (approximate) posterior probability. The major motivation is that, although\nthe full posterior probability of \u2713d is dif\ufb01cult, its MAP estimate, as a simpli\ufb01ed problem, is more\ntractable (and it is a convex problem under certain conditions). Furthermore, with the MAP estimate\nof \u2713d, we can infer the prediction variable yd according to the following approximation from (5):\n\np(yd|wd,1:N , , U,\u21b5, ) = E\u2713d|wd,1:N [p(yd|\u2713d, U, )] \u21e1 p(yd|\u02c6\u2713d|wd,1:N , U, )\n\n(6)\nwhere E\u2713d|wd,1:N denotes the conditional expectation with respect to \u2713d given wd,1:N, and the ex-\npectation is sampled by the MAP estimate, \u02c6\u2713d|wd,1:N , of \u2713d given wd,1:N, de\ufb01ned as\n\n\u02c6\u2713d|wd,1:N = arg max\n\n\u2713d\n\np(\u2713d|wd,1:N , ,\u21b5, )\n\n(7)\n\nThe approximation gets more precise when p(\u2713d|wd,1:N , ,\u21b5, ) becomes more concentrated\naround \u02c6\u2713d|wd,1;N . Experimental results on several real datasets (Section 5) show that the approx-\nimation (6) provides excellent prediction performance.\nUsing the Bayesian rule p(\u2713d|wd,1:N , ,\u21b5 ) = p(\u2713d|\u21b5)p(wd,1:N|\u2713d, )/p(wd,1:N|,\u21b5 ) and the fact\nthat p(wd,1:N|,\u21b5 ) is independent of \u2713d, we obtain the equivalent form of (7) as\n\u2713d2PK\u21e5 ln p(\u2713d|\u21b5) + ln p(wd,1:N|\u2713d, )\u21e4\n\u02c6\u2713d|wd,1:N = arg max\nwhere PK = {\u2713 2 RK : \u2713j  0,PK\nj=1 \u2713j = 1} denotes the (K  1)-dimensional probability\nsimplex, p(\u2713d|\u21b5) is the Dirichlet distribution, and p(wd,1:N|\u2713d, ) can be computed by integrating\np(wd,1:N , zd,1:N|\u2713d, ) =QN\nn=1 p(wd,n|zd,n, )p(zd,n|\u2713d) over zd,1:N, which leads to (derived in\nSection A of the supplementary material)\n\n(8)\n\np(wd,1:N|\u2713d, ) =\n\nVYv=1\u2713 KXj=1\n\n\u2713d,jvj\u25c6xd,v\n\n= p(xd|\u2713d, )\n\n(9)\n\nwhere xd,v denotes the term frequency of the v-th word (in vocabulary) inside the d-th document,\nand xd denotes the V -dimensional bag-of-words (BoW) vector of the d-th document. Note that\np(wd,1:N|\u2713d, ) depends on wd,1:N only via the BoW vector xd, which is the suf\ufb01cient statistics.\nTherefore, we use p(xd|\u2713d, ) and p(wd,1:N|\u2713d, ) interchangeably from now on. Substituting the\nexpression of Dirichlet distribution and (9) into (8), we get\n\n\u02c6\u2713d|wd,1:N = arg max\n= arg min\n\n\u2713d2PK\u21e5xT\n\u2713d2PK\u21e5  xT\n\nd ln(\u2713d) + (\u21b5  1)T ln \u2713d\u21e4\nd ln(\u2713d)  (\u21b5  1)T ln \u2713d\u21e4\n\n(10)\n\nwhere we dropped the terms independent of \u2713d, and 1 denotes an all-one vector. Note that when\n\u21b5  1 (\u21b5> 1), the optimization problem (10) is (strictly) convex and is non-convex otherwise.\n3.1 Mirror Descent Algorithm for MAP Inference\n\nAn ef\ufb01cient approach to solving the constrained optimization problem (10) is the mirror descent\nalgorithm (MDA) with Bregman divergence chosen to be generalized Kullback-Leibler divergence\n[2, 18, 21]. Speci\ufb01cally, let f (\u2713d) denote the cost function in (10), then the MDA updates the MAP\nestimate of \u2713d iteratively according to:\n\n\u2713d,` = arg min\n\n\u2713d2PK\uf8fff (\u2713d,`1) + [r\u2713df (\u2713d,`1)]T (\u2713d  \u2713d,`1) +\n\n1\nTd,`\n\n (\u2713d,\u2713 d,`1)\n\n(11)\n\n\u2713d,` denotes the estimate of \u2713d,` at the `-th iteration, Td,` denotes the step-size of MDA, and (x, y)\nis the Bregman divergence chosen to be (x, y) = xT ln(x/y)  1T x + 1T y. The argmin in (11)\ncan be solved in closed-form (see Section B of the supplementary material) as\n\n\u2713d,` =\n\n1\n\nC\u2713 \u00b7 \u2713d,`1  exp\u2713Td,`\uf8ffT\n\n\u2713d,`1\u25c6 ,` = 1, . . . , L, \u2713 d,0 =\n\u21b5  1\n\n1\n\n1\nK\n\n(12)\n\nxd\n\n\u2713d,`1\n\n+\n\n4\n\n\fFigure 2: Layered deep architecture for computing p(yd|wd,1:N , , U,\u21b5, ), where ()/() denotes\nelement-wise division,  denotes Hadamard product, and exp() denotes element-wise exponential.\nwhere C\u2713 is a normalization factor such that \u2713d,` adds up to one,  denotes Hadamard product, L is\nthe number of MDA iterations, and the divisions in (12) are element-wise operations. Note that the\nrecursion (12) naturally enforces each \u2713d,` to be on the probability simplex. The MDA step-size Td,`\ncan be either constant, i.e., Td,` = T , or adaptive over iterations and samples, determined by line\nsearch (see Section C of the supplementary material). The computation complexity in (12) is low\nsince most computations are sparse matrix operations. For example, although by itself \u2713d,`1 in\n(12) is a dense matrix multiplication, we only need to evaluate the elements of \u2713d,`1 at the posi-\ntions where the corresponding elements of xd are nonzero, because all other elements of xd/\u2713d,`1\nis known to be zero. Overall, the computation complexity in each iteration of (12) is O(nTok \u00b7 K),\nwhere nTok denotes the number of unique tokens in the document. In practice, we only use a small\nnumber of iterations, L, in (12) and use \u2713d,L to approximate \u02c6\u2713d|wd,1:N so that (6) becomes\n\np(yd|wd,1:N , , U,\u21b5, ) \u21e1 p(yd|\u2713d,L, U, )\n\n(13)\nIn summary, the inference of \u2713d and yd can be implemented by the layered architecture in Figure 2,\nwhere the top layer infers yd using (13) and the MDA layers infer \u2713d iteratively using (12). Figure 2\nalso implies that the the MDA layers act as a feature extractor by generating the MAP estimate \u2713d,L\nfor the output layer. Our end-to-end learning strategy developed in the next section jointly learns the\nmodel parameter U at the output layer and the model parameter  at the feature extractor layers to\nmaximize the posterior of the prediction variable given the input document.\n\n4 Learning by Mirror-Descent Back Propagation\n\nWe now consider the supervised learning problem (3) and the unsupervised learning problem (4),\nrespectively, using the developed MDA-based MAP inference. We \ufb01rst consider the supervised\nlearning problem. With (13), the discriminative learning problem (3) can be approximated by\n\nwhich can be solved by stochastic mirror descent (SMD). Note that the cost function in (14) depends\non U explicitly through p(yd|\u2713d,L, U, ), which can be computed directly from its de\ufb01nition in\nSection 2. On the other hand, the cost function in (14) depends on  implicitly through \u2713d,L. From\nFigure 2, we observe that \u2713d,L not only depends on  explicitly (as indicated in the MDA block on\nthe right-hand side of Figure 2) but also depends on  implicitly via \u2713d,L1, which in turn depends\non  both explicitly and implicitly (through \u2713d,L2) and so on. That is, the dependency of the\ncost function on  is in a layered manner. Therefore, we devise a back propagation procedure to\nef\ufb01ciently compute its gradient with respect to  according to the mirror-descent graph in Figure\n2, which back propagate the error signal through the MDA blocks at different layers. The gradient\nformula and the implementation details of the learning algorithm can be found in Sections C\u2013D in\nthe supplementary material.\nFor the unsupervised learning problem (4), the gradient of ln p( \u02dc|) with respect to \u02dc assumes the\nsame form as that of ln p(|). Moreover, it can be shown that the gradient of ln p(wd,1:N| \u02dc,\u21b5, )\n\n5\n\n,U\" ln p(|) \n\narg min\n\nln p(yd|\u2713d,L, U, )#\n\nDXd=1\n\n(14)\n\nMirror Descent CellMirror Descent Cell\u2026Normalization\f@ ln p(wd,1:N| \u02dc,\u21b5 )\n\n@ \u02dc\n\n= E\u2713d|xd\u21e2 @\n\n@ \u02dc\n\nln p(xd|\u2713d, \u02dc) (a)\n\n\u21e1\n\n@\n@ \u02dc\n\nln p(xd|\u2713d,L, \u02dc)\n\n(15)\n\nwith respect \u02dc can be expressed as (see Section E of the supplementary material):\n\nwhere p(xd|\u2713d, \u02dc) assumes the same form as (9) except  is replaced by \u02dc. The expectation is\nevaluated with respect to the posterior probability p(\u2713d|wd,1:N , \u02dc,\u21b5 ), and is sampled by the MAP\nestimate of \u2713d in step (a). \u2713d,L is an approximation of \u02c6\u2713d|wd,1:N computed via (12) and Figure 2.\n5 Experiments\n\n5.1 Description of Datasets and Baselines\n\nWe evaluated our proposed supervised learning (denoted as BP-sLDA) and unsupervised learning\n(denoted as BP-LDA) methods on three real-world datasets. The \ufb01rst dataset we use is a large-scale\ndataset built on Amazon movie reviews (AMR) [16]. The data set consists of 7.9 million movie\nreviews (1.48 billion words) from Amazon, written by 889,176 users, on a total of 253,059 movies.\nFor text preprocessing we removed punctuations and lowercasing capital letters. A vocabulary of\nsize 5,000 is built by selecting the most frequent words. (In another setup, we keep the full vocab-\nulary of 701K.) Same as [24], we shifted the review scores so that they have zero mean. The task\nis formulated as a regression problem, where we seek to predict the rating score using the text of\nthe review. Second, we consider a multi-domain sentiment (MultiSent) classi\ufb01cation task [6], which\ncontains a total 342,104 reviews on 25 types of products, such as apparel, electronics, kitchen and\nhousewares. The task is formulated as a binary classi\ufb01cation problem to predict the polarity (posi-\ntive or negative) of each review. Likewise, we preprocessed the text by removing punctuations and\nlowercasing capital letters, and built a vocabulary of size 1,000 from the most frequent words. In ad-\ndition, we also conducted a second binary text classi\ufb01cation experiment on a large-scale proprietary\ndataset for business-centric applications (1.2M documents and vocabulary size of 128K).\nThe baseline algorithms we considered include Gibbs sampling (Gibbs-LDA) [17], logistic/linear re-\ngression on bag-of-words, supervised-LDA (sLDA) [4], and MedLDA [26], which are implemented\neither in C++ or Java. And our proposed algorithms are implemented in C#.2 For BP-LDA and\nGibbs-LDA, we \ufb01rst train the models in an unsupervised manner, and then generate per-document\ntopic proportion \u2713d as their features in the inference steps, on top of which we train a linear (logistic)\nregression model on the regression (classi\ufb01cation) tasks.\n\n5.2 Prediction Performance\n\nd  \u00afyo\n\nd)2), where yo\n\n1  (Pd(yo\n\nd  yd)2)/(Pd(yo\n\nWe \ufb01rst evaluate the prediction performance of our models and compare them with the traditional\n(supervised) topic models. Since the training of the baseline topic models takes much longer time\nthan BP-sLDA and BP-LDA (see Figure 5), we compare their performance on two smaller datasets,\nnamely a subset (79K documents) of AMR (randomly sampled from the 7.9 million reviews) and the\nMultiSent dataset (342K documents), which are all evaluated with 5-fold cross validation. For AMR\nregression, we use the predictive R2 to measure the prediction performance, de\ufb01ned as: pR2 =\nd denotes the label of the d-th document in the\nheldout (out-of-fold) set during the 5-fold cross validation, \u00afyo\nd in the heldout\nset, and yd is the predicted value. The pR2 scores of different models with varying number of topics\nare shown in Figure 3(a). Note that the BP-sLDA model outperforms the other baselines with large\nmargin. Moreover, the unsupervised BP-LDA model outperforms the unsupervised LDA model\ntrained by Gibbs sampling (Gibbs-LDA). Second, on the MultiSent binary classi\ufb01cation task, we\nuse the area-under-the-curve (AUC) of the operating curve of probability of correct positive versus\nprobability of false positive as our performance metric, which are shown in Figure 3(b). It also shows\nthat BP-sLDA outperforms other methods and that BP-LDA outperforms the Gibbs-LDA model.\nNext, we compare our BP-sLDA model with other strong discriminative models (such as neural net-\nworks) by conducting two large-scale experiments: (i) regression task on AMR full dataset (7.9M\ndocuments) and (ii) binary classi\ufb01cation task on the proprietary business-centric dataset (1.2M doc-\numents). For the large-scale AMR regression, we can see that pR2 improves signi\ufb01cantly compared\n\nd is the mean of all yo\n\n2A third-party code is available online at https://github.com/jvking/bp-lda.\n\n6\n\n\f0.5\n\n0.4\n\n2\nR\np\n\n0.3\n\n0.2\n\n0.1\n\n0\n \n0\n\n \n\nBP\u2212sLDA\nLinear\nMedLDA\nsLDA\nBP\u2212LDA\nGibbs\u2212LDA\n\n95\n\n90\n\n85\n\n80\n\n75\n\n70\n\n65\n\n)\n\n%\n\n(\n \n\nC\nU\nA\n\nBP\u2212sLDA\nLogistic regression\nMedLDA\nsLDA\nBP\u2212LDA\nGibbs\u2212LDA\n\n \n\n93\n\n92\n\n91\n\n90\n\n89\n\n)\n\n%\n\n(\n \n\nC\nU\nA\n\n \n\nBP\u2212sLDA\nLogistic regression\n\n20\n\n40\n80\nNumber of topics\n\n60\n\n100\n\n120\n\n60\n\n \n0\n\n20\n\n40\n80\nNumber of topics\n\n60\n\n100\n\n120\n\n88\n\n \n0\n\n20\n\n40\n80\nNumber of topics\n\n60\n\n100\n\n120\n\n(a) AMR regression task (79K)\n\n(b) MultiSent classi\ufb01cation task\n\n(c) MultiSent task (zoom in)\n\nFigure 3: Prediction performance on AMR regression task (measured in pR2) and MultiSent classi-\n\ufb01cation task (measured in AUC). Higher score is better for both, with perfect value being one.\n\nTable 1: pR2 (in percentage) on full AMR data (7.9M documents). The standard deviations in the\nparentheses are obtained from 5-fold cross validation.\n\n5\n\n10\n\n20\n\nNumber of topics\nLinear Regression (voc5K)\nNeural Network (voc5K)\nBP-sLDA (\u21b5 = 1.001, voc5K)\nBP-sLDA (\u21b5 = 0.5, voc5K)\nBP-sLDA (\u21b5 = 0.1, voc5K)\nLinear Regression (voc701K)\nBP-sLDA (\u21b5=1.001,voc701K)\n\n50\n\n100\n\n200\n\n38.4 (0.1)\n\n59.0 (0.1) 61.0 (0.1) 62.3 (0.4) 63.5 (0.7) 63.1 (0.8) 63.5 (0.4)\n61.4 (0.1) 65.3 (0.3) 69.1 (0.2) 74.7 (0.3) 74.3 (2.4) 78.3 (1.1)\n54.7 (0.1) 54.5 (1.2) 57.0 (0.2) 61.3 (0.3) 67.1 (0.1) 74.5 (0.2)\n53.3 (2.8) 56.1 (0.1) 58.4 (0.1) 64.1 (0.1) 70.6 (0.3) 75.7 (0.2)\n\n69.8 (0.2) 74.3 (0.3) 78.5 (0.2) 83.6 (0.6) 80.1 (0.9) 84.7 (2.8)\n\n41.5 (0.2)\n\nto the best results on the 79K dataset shown in Figure 3(a), and also signi\ufb01cantly outperform the neu-\nral network models with same number of model parameters. Moreover, the best deep neural network\n(200\u21e5 200 in hidden layers) gives pR2 of 76.2%(\u00b10.6%), which is worse than 78.3% of BP-sLDA.\nIn addition, BP-sLDA also signi\ufb01cantly outperforms Gibbs-sLDA [27], Spectral-sLDA [24], and the\nHybrid method (Gibbs-sLDA initialized with Spectral-sLDA) [24], whose pR2 scores (reported in\n[24]) are between 10% and 20% for 5 \u21e0 10 topics (and deteriorate when further increasing the topic\nnumber). The results therein are obtained under same setting as this paper. To further demonstrate\nthe superior performance of BP-sLDA on the large vocabulary scenario, we trained BP-sLDA on\nfull vocabulary (701K) AMR and show the results in Table 1, which are even better than the 5K\nvocabulary case. Finally, for the binary text classi\ufb01cation task on the proprietary dataset, the AUCs\nare given in Table 2, where BP-sLDA (200 topics) achieves 31% and 18% relative improvements\nover logistic regression and neural network, respectively. Moreover, on this task, BP-sLDA is also\non par with the best DNN (a larger model consisting of 200\u21e5 200 hidden units with dropout), which\nachieves an AUC of 93.60.\n\n5.3 Analysis and Discussion\n\nWe now analyze the in\ufb02uence of different hyper parameters on the prediction performance. Note\nfrom Figure 3(a) that, when we increase the number of topics, the pR2 score of BP-sLDA \ufb01rst\nimproves and then slightly deteriorates after it goes beyond 20 topics. This is most likely to be\ncaused by over\ufb01tting on the small dataset (79K documents), because the BP-sLDA models trained\non the full 7.9M dataset produce much higher pR2 scores (Table 1) than that on the 79K dataset\nand keep improving as the model size (number of topics) increases. To understand the in\ufb02uence\nof the mirror descent steps on the prediction performance, we plot in Figure 4(a) the pR2 scores\nof BP-sLDA on the 7.9M AMR dataset for different values of mirror-descent steps L. When L\nincreases, for small models (K = 5 and K = 20), the pR2 score remains the same, and, for a larger\nmodel (K = 100), the pR2 score \ufb01rst improves and then remain the same. One explanation for\nthis phenomena is that larger K implies that the inference problem (10) becomes an optimization\nproblem of higher dimension, which requires more mirror descent iterations. Moreover, the mirror-\ndescent back propagation, as an end-to-end training of the prediction output, would compensate\nthe imperfection caused by the limited number of inference steps, which makes the performance\ninsensitive to L once it is large enough. In Figure 4(b), we plot the percentage of the dominant\n\n7\n\n\fTable 2: AUC (in percentage) on the business-centric proprietary data (1.2M documents, 128K vo-\ncabulary). The standard deviations in the parentheses are obtained from \ufb01ve random initializations.\n\nNumber of topics\nLogistic Regression\nNeural Network\nBP-sLDA\n\n5\n\n10\n\n20\n\n50\n90.56 (0.00)\n\n100\n\n200\n\n90.95 (0.07) 91.25 (0.05) 91.32 (0.23) 91.54 (0.11) 91.90 (0.05) 91.98 (0.05)\n92.02 (0.02) 92.21 (0.03) 92.35 (0.07) 92.58 (0.03) 92.82 (0.07) 93.50 (0.06)\n\n2\nR\np\n\n \n\n0.8\n0.75\n0.7\n0.65\n0.6\n0.55\n0.5\n0.45\n0.4\n \n0\n100\nNumber of mirror descent iterations (layers)\n(a) In\ufb02uence of MDA iterations L\n\n5 topics\n20 topics\n100 topics\n\n20\n\n40\n\n60\n\n80\n\n)\n\n%\n\n(\n \ns\nc\np\no\n\ni\n\nt\n \nt\n\ni\n\nn\na\nn\nm\no\nd\n\n \nf\n\no\n\n \n\ne\ng\na\n\nt\n\nn\ne\nc\nr\ne\nP\n\n50\n\n40\n\n30\n\n20\n\n10\n\n0\n \n0\n\n20\n\n \n\nBP\u2212sLDA (\u03b1=1.001)\nBP\u2212sLDA (\u03b1=0.5)\nBP\u2212sLDA (\u03b1=0.1)\nGibbs\u2212LDA (\u03b1=0.5)\nGibbs\u2212LDA (\u03b1=0.1)\nBP\u2212LDA (\u03b1=0.5)\nBP\u2212LDA (\u03b1=0.1)\n\n40\n\nNumber of topics\n\n60\n\n80\n\n100\n\nd\no\no\nh\n\ni\nl\n\ne\nk\n\ni\nl\n\n\u2212\ng\no\n\nl\n \n\nd\nr\no\nw\n\u2212\nr\ne\np\n\n \n\ne\nv\ni\nt\n\na\ng\ne\nN\n\n12\n\n10\n\n8\n\n6\n\n4\n\n2\n\n0\n\n \n\nBP\u2212LDA (\u03b1=1.001)\nBP\u2212LDA (\u03b1=0.5)\nBP\u2212LDA (\u03b1=0.1)\nGibbs\u2212LDA (\u03b1=0.5)\nGibbs\u2212LDA (\u03b1=0.1)\n\n \n\n5\n\n10\n\nNumber of topics\n\n20\n\n(b) Sparsity of the topic distribution\n\n(c) Per-word log-likelihoods\n\nFigure 4: Analysis of the behaviors of BP-sLDA and BP-LDA models.\n\ntopics (which add up to 90% probability) on AMR, which shows that BP-sLDA learns sparse topic\ndistribution even when \u21b5 = 1.001 and obtains sparser topic distribution with smaller \u21b5 (i.e., 0.5 and\n0.1). In Figure 4(c), we evaluate the per-word log-likelihoods of the unsupervised models on AMR\ndataset using the method in [23]. The per-word log-likelihood of BP-LDA with \u21b5 = 1.001 is worse\nthan the case of \u21b5 = 0.5 and \u21b5 = 0.1 for Gibbs-LDA, although its prediction performance is better.\nThis suggests the importance of the Dirichlet prior in text modeling [1, 22] and a potential tradeoff\nbetween the text modeling performance and the prediction performance.\n\n5.4 Ef\ufb01ciency in Computation Time\n\nTo compare the ef\ufb01ciency of the algorithms, we\nshow the training time of different models on the\nAMR dataset (79K and 7.9M) in Figure 5, which\nshows that our algorithm scales well with respect\nto increasing model size (number of topics) and in-\ncreasing number of data samples.\n\n6 Conclusion\n\nsLDA (79K)\nBP\u2212sLDA (79K)\nMedLDA (79K)\nBP\u2212sLDA (7.9M)\n\n \n\n103\n\n102\n\n101\n\n100\n\ns\nr\nu\no\nh\n \nn\ni\n \ne\nm\n\ni\n\ni\nt\n \ng\nn\nn\na\nr\nT\n\ni\n\n10\u22121\n\n60\n\n40\n\n20\n\n10\u22122\n \n0\n\nNumber of topics\n\nWe have developed novel learning approaches for\nsupervised LDA models, using MAP inference and\nmirror-descent back propagation, which leads to an\nend-to-end discriminative training. We evaluate the\nprediction performance of the model on three real-\nworld regression and classi\ufb01cation tasks. The re-\nsults show that the discriminative training signi\ufb01-\ncantly improves the performance of the supervised\nLDA model relative to previous learning methods.\nFuture works include (i) exploring faster algorithms for the MAP inference (e.g., accelerated mirror\ndescent), (ii) developing semi-supervised learning of LDA using the framework from [3], and (iii)\nlearning \u21b5 from data. Finally, also note that the layered architecture in Figure 2 could be viewed\nas a deep feedforward neural network [11] with structures designed from the topic model in Figure\n1. This opens up a new direction of combining the strength of both generative models and neu-\nral networks to develop new deep learning models that are scalable, interpretable and having high\nprediction performance for text understanding and information retrieval [13].\n\nFigure 5: Training time on the AMR dataset.\n(Tested on Intel Xeon E5-2680 2.80GHz.)\n\n100\n\n80\n\n8\n\n\fReferences\n[1] A. Asuncion, M. Welling, P. Smyth, and Y. W. Teh. On smoothing and inference for topic models. In\n\nProc. UAI, pages 27\u201334, 2009.\n\n[2] A. Beck and M. Teboulle. Mirror descent and nonlinear projected subgradient methods for convex opti-\n\nmization. Operations Research Letters, 31(3):167\u2013175, 2003.\n\n[3] C. M. Bishop and J. Lasserre. Generative or discriminative? getting the best of both worlds. Bayesian\n\nStatistics, 8:3\u201324, 2007.\n\n[4] D. M. Blei and J. D. Mcauliffe. Supervised topic models. In Proc. NIPS, pages 121\u2013128, 2007.\n[5] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. JMLR, 3:993\u20131022, 2003.\n[6] J. Blitzer, M. Dredze, and F. Pereira. Biographies, bollywood, boom-boxes and blenders: Domain adap-\n\ntation for sentiment classi\ufb01cation. In Proc. ACL, volume 7, pages 440\u2013447, 2007.\n\n[7] G. Bouchard and B. Triggs. The tradeoff between generative and discriminative classi\ufb01ers.\n\nCOMPSTAT, pages 721\u2013728, 2004.\n\nIn Proc.\n\n[8] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic\n\noptimization. Journal of Machine Learning Research, 12:2121\u20132159, Jul. 2011.\n\n[9] T. L. Grif\ufb01ths and M. Steyvers. Finding scienti\ufb01c topics. Proc. of the National Academy of Sciences,\n\npages 5228\u20135235, 2004.\n\n[10] J. R. Hershey, J. L. Roux, and F. Weninger. Deep unfolding: Model-based inspiration of novel deep\n\narchitectures. arXiv:1409.2574, 2014.\n\n[11] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen,\nT. N. Sainath, and B. Kingsbury. Deep neural networks for acoustic modeling in speech recognition: The\nshared views of four research groups. IEEE Signal Process. Mag., 29(6):82\u201397, 2012.\n\n[12] A. Holub and P. Perona. A discriminative framework for modelling object classes. In Proc. IEEE CVPR,\n\nvolume 1, pages 664\u2013671, 2005.\n\n[13] P.-S. Huang, X. He, J. Gao, L. Deng, A. Acero, and L. Heck. Learning deep structured semantic models\n\nfor web search using clickthrough data. In Proc. CIKM, pages 2333\u20132338, 2013.\n\n[14] S. Kapadia. Discriminative Training of Hidden Markov Models. PhD thesis, University of Cambridge,\n\n1998.\n\n[15] S. Lacoste-Julien, F. Sha, and M. I. Jordan. DiscLDA: Discriminative learning for dimensionality reduc-\n\ntion and classi\ufb01cation. In Proc. NIPS, pages 897\u2013904, 2008.\n\n[16] J. J. McAuley and J. Leskovec. From amateurs to connoisseurs: modeling the evolution of user expertise\n\nthrough online reviews. In Proc. WWW, pages 897\u2013908, 2013.\n\n[17] Andrew Kachites McCallum.\nhttp://mallet.cs.umass.edu, 2002.\n\nMALLET: A Machine Learning for Language Toolkit.\n\n[18] D. B. Nemirovsky. A. S., Yudin. Problem Complexity and Method Ef\ufb01ciency in Optimization. Wiley,\n\nNew York, 1983.\n\n[19] D. Sontag and D. Roy. Complexity of inference in latent dirichlet allocation.\n\n1008\u20131016, 2011.\n\nIn Proc. NIPS, pages\n\n[20] V. Stoyanov, A. Ropson, and J. Eisner. Empirical risk minimization of graphical model parameters given\n\napproximate inference, decoding, and model structure. In Proc. AISTATS, pages 725\u2013733, 2011.\n\n[21] P. Tseng. On accelerated proximal gradient methods for convex-concave optimization. SIAM Journal on\n\nOptimization, 2008.\n\n[22] H. M. Wallach, D. M. Mimno, and A. McCallum. Rethinking LDA: Why priors matter. In Proc. NIPS,\n\npages 1973\u20131981, 2009.\n\n[23] H. M. Wallach, I. Murray, R. Salakhutdinov, and D. Mimno. Evaluation methods for topic models. In\n\nProc. ICML, pages 1105\u20131112, 2009.\n\n[24] Y. Wang and J. Zhu. Spectral methods for supervised topic models. In Proc. NIPS, pages 1511\u20131519,\n\n2014.\n\n[25] Oksana Yakhnenko, Adrian Silvescu, and Vasant Honavar. Discriminatively trained Markov model for\n\nsequence classi\ufb01cation. In Proc. IEEE ICDM, 2005.\n\n[26] J. Zhu, A. Ahmed, and E. P. Xing. MedLDA: maximum margin supervised topic models. JMLR,\n\n13(1):2237\u20132278, 2012.\n\n[27] J. Zhu, N. Chen, H. Perkins, and B. Zhang. Gibbs max-margin topic models with data augmentation.\n\nJMLR, 15(1):1073\u20131110, 2014.\n\n9\n\n\f", "award": [], "sourceid": 1056, "authors": [{"given_name": "Jianshu", "family_name": "Chen", "institution": "Microsoft Research, Redmond, W"}, {"given_name": "Ji", "family_name": "He", "institution": "University Washington"}, {"given_name": "Yelong", "family_name": "Shen", "institution": "Microsoft Research, Redmond, WA"}, {"given_name": "Lin", "family_name": "Xiao", "institution": "Microsoft"}, {"given_name": "Xiaodong", "family_name": "He", "institution": "Microsoft Research, Redmond, WA"}, {"given_name": "Jianfeng", "family_name": "Gao", "institution": "Microsoft Research, Redmond, WA"}, {"given_name": "Xinying", "family_name": "Song", "institution": "Microsoft Research, Redmond, WA"}, {"given_name": "Li", "family_name": "Deng", "institution": "MSR"}]}