Export Reviews, Discussions, Author Feedback and Meta-Reviews

Paper ID:	1056
Title:	End-to-end Learning of LDA by Mirror-Descent Back Propagation over a Deep Architecture

Current Reviews

Submitted by Assigned_Reviewer_1

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)

=== Update after rebuttal ===

I thank the authors for a comprehensive rebuttal and extra experiments. It has addressed most of my concerns, and I have updated my score. The authors should make sure to properly tone down the claims about improved training for LDA (vs. sLDA), as well as mention the perplexity experiment. It seems to me that we do not really understand very well what is happening in these models at this stage; this perplexity experiment is just scratching the surface (and should be presented as such). I am also a bit puzzled by the use of alpha = 1.001 (vs. exactly 1 e.g.). As a prior, there is almost no difference between the two as a prior; but I can see that it changes a lot the MAP. In particular, using alpha = 1 (uniform distribution), the MAP is just maximum likelihood and could have zero entries; whereas any alpha > 1 creates a barrier in the objective function and prevents any theta_k to be too close to zero.

This being said, I think this paper makes interesting contributions, and so even though more careful experiments are needed to understand better what is happening, this can be left as future work and people can already build on the ideas introduced here for topic modeling.

== additional comments

[# of dominant topics experiment] The "# of dominant topics" experiment is interesting and should be included in the final version. As a note, a prior with alpha = 1.001 with K=100 gives on average about 60 dominant topics as a prior (defined as the number of topics to include to get 90% of the mass when sorted in decreasing order). This becomes 45 for alpha = 0.5 and 15 for alpha = 0.1 (which shows that these alpha are still too big to be used when K = 100). This experiment indicates that given the topic learned, the likelihood is still enforcing somewhat sparse topic proportion (I think that for an alpha so close to 1, the MAP should not be too different than the maximum likelihood solution). Though at K=100, one can already see a big difference in sparsity between BP-LDA alpha = 1.001 and Gibbs-LDA alpha=0.5.

[perplexity experiment] The authors should also report the perplexity for BP-LDA with alpha=0.5 and alpha=0.1 to control the difference in alpha. They should also try more topics. The reason that the perplexity is becoming worse with more topics is that alpha should change according to the number of topics. In my experience, properly estimating alpha yields a perplexity more robust to the number of topics (and can even get a better fit with more topics). To get some ball park, the alpha's learned by maximum likelihood using variational inference on standard text document datasets are much smaller than these values. For example, on the NIPS dataset from http://ai.stanford.edu/~gal/data.html, they are of the order of 1e-3 for a K=200 topic model. Finally, as I said in my previous review, it would interesting in future work to propagate the backprop to updating the alpha's as well...

=== end of update ===

This paper makes an interesting contribution to the supervised topic modeling literature. The idea of using backpropagation by unrolling a few steps of an iterative procedure (which computes a fixed point; or maximize some objective; or do some message passing) has been used before in several areas (as the cited [16]), but as far as I know, it has not been used for supervised topic modeling, so this is a fresh outlook, and I like it. It is also interesting that they are able to get significantly better results than linear regression (I hope this is ridge regression!), as usually supervised topic models struggle to improve over standard discriminative approaches. Their results in Figure 2b) where they basically do the same as logistic regression (within the variation) is more typical (though the other supervised topic model approaches do not do well on this one for some reasons).

* Missing experiment:

The reason that I am not putting a higher rating is because I think an important experiment is missing given that the authors claim that they their approximate MAP method to learn the parameters of the unsupervised LDA model is "outperforming previous learning methods". The classification / regression experiment is not meaningful for comparing unsupervised LDA learning techniques (it is great for sLDA, and the results are impressive; but unsupervised LDA is meant for an *unsupervised* evaluation!). The authors should thus also report the average test set per-word log-likelihood for LDA with the topic parameters learned by BP-LDA vs. Gibbs-LDA on their dataset. Note that they are definitively not allowed to do the PLSA-like technique of maximizing the posterior over theta on the test data to evaluate its likelihood (this is cheating!) -- they should just use the usual marginalization over theta for LDA. I suggest they use the code from http://homepages.inf.ed.ac.uk/imurray2/pub/09etm/ "Evaluation Methods for Topic Models", Wallach et al. ICML 2009, with some of their best methods. The authors should report these results in their rebuttal -- I am quite curious to see whether the superior prediction performance came at the cost of worse text modeling accuracy. If the authors can report on these results, I am willing to increase my rating (irrespective of whether their perplexity results are good or not).

* Novelty claim correction:

Using convex optimization to do MAP over theta for LDA is not new: this was already proposed in "Complexity of Inference in Latent Dirichlet Allocation", Sontag & Roy, NIPS 2011 (see Section 3.1). They also had proposed to use the exponentiated gradient algorithm to do the convex optimization, which is equivalent to mirror descent with KL divergence, i.e. yields exactly the same update as (12). The authors should properly correct this novelty claim and mention this prior work.

* Using alpha > 1 is not a good text model:

I also do not think that using alpha > 1 is a good idea for proper text modeling. It might be fine to get good classification / regression performance (it gives more dense features to the regressor); but if one would also evaluate the generative likelihood for the document, it won't be as good. All the work on topic modeling that I know always found that the individual hyperparameters of the Dirichlet prior over theta should be < 1 to give better perplexity (especially if a large number of topics is used). This was also mentioned in Section 3.2 of the [Sontag & Roy 2011] paper; see also Section 3 of

"On Smoothing and Inference for Topic Models", Asuncion et al. UAI 2009. I am willing to bet that if the authors do the perplexity experiment, they could get better results by using alpha < 1 (at least for the Gibbs sampler).

Also, note that it was also mentioned before that estimating the hyperparameters of the Dirichlet prior over theta (with asymmetric components) made a big difference for text modeling; see "Rethinking LDA: Why Priors Matter", Wallach et al., NIPS 2009. This was for unsupervised LDA. I am not sure whether it also could make a difference for sLDA -- but given that the authors have now an efficient backpropagation framework to learn all the parameters in a discriminative fashion; it might also be interesting to also backpropagate the gradient to estimate the alpha_k hyperparameters as well.

Using alpha > 1 means that a sampled theta is not sparse: this means that a specific document contains all topics with some probability, which certainly seems like a bad generative modeling assumption! I understand that the authors made this assumption because they wanted to claim that their MAP inference over theta was convex. On the other hand, given that they only do a small number L of iterations (and so they certainly do not optimize to convergence), and that the objective is not convex anyway in U and Phi, I do not think that having the inner MAP inference non-convex would be such a big problem. Using a finite number of iterations L is a way to lower bound the smallest value that theta could take in any case (as was suggested in Section 3.2 of [Sontag & Roy 2011]), and so the problem of negative infinities would operationally be avoided. They also seem to report better results for smaller alpha; so I would be curious to see Figure 4 to also report results for alpha < 1.

=== other comments ===

- Correction line 076: "this paper is the first work to perform a fully end-to-end discriminative training of LDA" -- it should be "of sLDA". They certainly do not do discriminative training of LDA: for LDA, they do an approximate MAP for Phi by removing the marginalization over theta that would normally be required by the model and replacing it with an approximate MAP over theta. This is still generative training, and so should not be called discriminative. In some sense, by doing MAP for theta, they are doing a kind of regularized PLSA model (probabilistic latent semantic analysis), which was the non-Bayesian precursor to LDA. In PLSA, there is a fixed theta per document, found by maximum likelihood. Here, they do regularized maximum likelihood for theta with the Dirichlet prior acting as a regularizer. It is presented as an approximation to LDA; but really, I would see it more as just regularized (potentially supervised) PLSA.

- Lines 120-122: the authors should clarify that Blei and McAuliffe had argued that using bar{z} instead of theta to tie the words to the variable to predict y was a better idea given their previous experience with two-signals modeling. I understand that the authors' framework could not handle bar{z} as the input variable for the distribution on y, as it is discrete, justifying their modification, but it is worthwhile to clarify this point. In their experiments, when they talk about "sLDA", do they mean the approach from [3] which used bar{z} as the input variable and also variational inference?

- Figure 3 a): are the error bars coming from different folds? What about the variation arising from just different random initialization for this non-convex problem? (Same thing for the sampling approach). Also, what about using a l2-regularizer on U to avoid overfitting for their method?

- Lines 396-399: do they use the same number of steps L when computing the test features; or they do full MAP for it?