NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Reviewer 1
ORIGINALITY The idea of having a discriminative version of LDA, analogous to logistic regression, is interesting. This idea is carried out quite well with the logistic LDA, its inference algorithm, and classification results using various datasets. QUALITY One concern I have is with comparisons with supervised LDA models, such as sLDA, discLDA, or LLDA. I realize these are mentioned in the beginning of the paper, and authors may have felt they are not as relevant, as they are not discriminative models, but I feel that readers would natural wonder about this, and authors should compare them, not necessarily empirically (thought that would be helpful). Another question I had was about topics being coherent. This paper (and supplementary PDF) shows example topics, both for images and text, but the more accepted evaluation is to actually compute topic coherence, which perhaps cannot be done for images but certainly for documents. The paper says "We find that logistic LDA is able to learn coherent topics in an unsupervised manner.", but I feel this is not supported with sufficient evidence. CLARITY This paper is very well written, and it is mostly very clear. However, I had trouble understanding the following few things: LDA discovers, for each document, a distribution over topics. Does logistic LDA also assign multiple topics (and a distribution over them) for each item? If not, I think this paper should make that clear and discuss this limitation. Perhaps related to this point, for evaluation of tweets, the papers says that when an author belongs to multiple communities, one is chosen at random. What would this mean in terms of the author classification results? Lastly, in the tweet classification, I did not fully understand what is meant by the sentence "For LDA, we extended the open source implementation of Theis and Hoffman [36] to depend on the label in the same manner as logistic LDA." I am pretty familiar with the topic modeling literature, and I think this would need more explanation. Miscellaneous question/comment -- Authors mention that one of their contributions is a new dataset of annotated tweets. As far as I know, Twitter does not allow distributing tweets that researchers collect. Please make sure and describe exactly how these data will be distributed. ** Post-author response ** I appreciate the authors responding to the questions and concerns in my review. I am happy with the response and raised my score accordingly.
Reviewer 2
I have mixed feelings about this paper. On the bright side, I like the idea of relaxing the (sometimes strict) assumptions underlying topic models such as LDA. The differenciable g functions act as a comparator between the topic distribution over words p(w|z) and the vector representation of w. It reminds me some recent works that combine topic models and word embedding (see [1]). It is an interesting way to embed specific preprocessing, such as convolutionnal layers for dealing with images. On the other side I'm not convinced that this model is just "another view" of the classic LDA. Let's take an example: the authors use a Dirichlet prior to "mimick" the behavior or LDA when generating \pi_d (what I usually call \theta_d), but it's not really motivated here. This prior is usually chosen for calculation purpose, using the conjugacy between distributions. Why following the same path here? Generally speaking, I face difficulties in fully understanding the analogy with LDA. It *looks* similar but I still think we loose the beauty and fundations of probabilistic models. The paper is probably too short with (too) many appendices, and it is hard to follow the multiple derivations (e.g., parameters inference). The authors chose to add an extension to let their model deal with observed classes. However there is a huge literature for integrating supervision to LDA. sLDA isn't the only model (see for instance labeled LDA [2]). Besides, the assumption is that there is a one-to-one relation between topics and classes. I'm not fully convinced by the experiments that it is a fruitful assumption, which is annoying with the title chosen by the authors. Therefore I suggest to remove this part of the contribution (and fin another title), or to submit to a journal to have room for giving a clear presentation of the work. Finally I see pros and cons for the experimental section. It's definitively a good idea to vary the kind of datasets, but it also gives arguments against here and there. For instance: - Several topic models have been proposed to deal with short texts, in particular posted on Twitter. See for instance the Biterm topic model [3]. - The authors use two different evaluation measures for classification (see Tables 1 and 2). Why? - I'm highly surprised that we can write that learning a model for 20NewsGroups (~20,000 instances) in 11 hours is fast! I'm highly confident that we can train classification models faster with competitive results. - I'm hardly convinced by the top words given in Table 3 (and in the appendices). Recent papers use topic coherence measures, such as the ones based on NPMI (see [4]). I spotted some typos, such as: - "Table 11" (line 239) - "embeddigns" (line 269) === UPDATE: I've carefully read the other reviews and authors' response. It wasn't enough to fully convince me on a couple of points (e.g., complexity in time, topic coherence that "will be included"). However I've changed my overall score from 6 (weak accept) to 7 (accept). === References [1] Das, R., Zaheer, M., & Dyer, C. (2015, July). Gaussian lda for topic models with word embeddings. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) (pp. 795-804). [2] Ramage, D., Hall, D., Nallapati, R., & Manning, C. D. (2009, August). Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1-Volume 1 (pp. 248-256). Association for Computational Linguistics. [3] Yan, X., Guo, J., Lan, Y., & Cheng, X. (2013, May). A biterm topic model for short texts. In Proceedings of the 22nd international conference on World Wide Web (pp. 1445-1456). ACM. [4] Röder, M., Both, A., & Hinneburg, A. (2015, February). Exploring the space of topic coherence measures. In Proceedings of the eighth ACM international conference on Web search and data mining (pp. 399-408). ACM.
Reviewer 3
The paper was expertly written which introduces an interesting discriminative variant of LDA. I think this work will be a nice addition to the crowded literature of topic modeling. Here are some of my additional thoughts: - One of the many advantages of LDA is that it is a building block for many topic modeling extensions, as described Section 2.2 “A zoo of topic models” in the paper. I wondering with this discriminative variant, how easy/difficult it is to modify Logistic LDA to achieve these extensions. - It also would be helpful if the paper discusses the scalability of the two algorithms proposed to train Logistic LDA