Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Originality * This paper's main contribution of recall-precision balanced topic model is quite original, as no other topic model (AFAIK) tries to balance recall and precision, even though those are widely used and sensible metrics. * The model itself, deriving from KL divergence between the predicted and empirical distributions, and its relationship with precision and recall is simple and elegant. * This paper made me think about sparse topic models, and I am glad this is mentioned in the paper. However, I don't think the authors do enough; just saying that the sparse topic models are evaluated only from the perspective of maximizing recall does not automatically mean that they would do poorly on the precision dimension. I would have liked to see an empirical comparison with a sparse topic model, especially given that there are more advanced sparse models, such as Zhang, et al WWW2013. Quality * The experiments are done well, comparing the three models using a variety of metrics including recall/precision (KL based and conventional), topic coherence, adjusted rand index on classification, and topic entropy. Some of the non-conventional metrics are explained well. * I do have one question about classification results on the datasets that have class labels. Why do you not report precision/recall/F1 scores for classification? Clarity * This paper is quite clear for the most part, though I do not fully understand the short section about the crime rate. Since this part is, application-wise, quite different from text modeling, more friendly description, rather than pointing to a reference paper, would be helpful (e.g., what PEI and PAI mean). Significance * I do have one concern about this paper. As much as topic modeling is an important subfield within ML, I keep wondering how significant this paper will be. Will it be cited and used for further research in latent semantics of text? Given many of the text modeling tasks are done with neural network based models (often in combination with latent variables), it would be helpful for the authors to explain and emphasize the significance of this research. ** Post-author response ** I appreciate the authors addressing the questions and concerns in my review.
The authors motivate the problem quite effectively and go into detail on the recall bias (that is, propensity of inference methods to penalize topic distributions significantly for missing document terms) of standard topic models. This is a useful observation that can motivate researchers to address this inherent bias. The authors propose a new topic model to balance precision and recall. This is done by introducing a Bernoulli distribution that controls word token generation through topics or through document specific word distribution. The authors note that this is similar to the SW model by Chemudugunta et al. The differences are that the SW model uses a weak symmetric prior for the Bernoulli distribution (\lambda) and uses a biased estimate for the document-specific word distribution. Experimental results measuring precision, recall, coherence, etc. demonstrate that the proposed model is significantly better on all metrics except recall (as one would expect). This is a significant result. I have some questions that I would like the authors to respond/address: In my opinion, the differences between the proposed model and the SW model are not significant. For example, it is straightforward to convert a weak symmetric prior to a strong asymmetric beta prior in the current setting. Maybe the novelty is the way the document-specific word distributions are generated and the theoretical connection of your proposed model to the K-divergence. (ii) Secondly, based on the reasoning in lines 268-273, it looks like one other disadvantage of the SW model is its inability to explain generic words in the corpus. However, the same paper also introduces SWB model to address this issue. It would be useful to compare your model to the SWB version. (iii) It would be useful to see if the results in Table 2 are sensitive to \lambda. My understanding is that all of them use \lambda = 0.1. The paper is quite well-written and the theoretical motivation for proposing the model is compelling.
Thanks to the authors for the helpful response and the additional experiments. Original Review: This paper makes an argument that standard LDA is biased in favor of recall, and it presents a method that can remove the bias. In experiments, the new method performs well on a precision metric and on topic coherence. This paper seems to be making an interesting insight. However, I had a hard time understanding the arguments, and I think the paper’s analysis and experiments do not sufficiently evaluate how much the recall focus of LDA depends on specific choices for hyperparameters or symmetric priors. The derivation around Equation 1 is true for any model that is trained to maximize likelihood, so when the paper declares there that Equation 1 is “sensitive to misses” it is hard to understand why. It is not until later that the paper aims to explain, but the explanation comes only in terms of a uniform model and the argument is made informally. It would help me if the paper stated its results more formally, as a conjecture or theorem with proof, so that I can better know what is being claimed, and to what extent the result relies on assumptions like uniformity. Also, the paper would be stronger if its mathematical illustrations used more realistic settings (e.g. Zipfian distributions for word frequency). The algorithm ends up being a fairly simple variant on LDA that mixes in a document-specific distribution (estimated from the ground truth document counts) along with the standard topic model. As the paper says, this means that the model is not a complete generative model of documents, because its likelihood requires knowing the counts of words in each document (b_m) in advance. In the experiments, the paper does not consider altering the LDA hyperparameters. This is a concern because the fact that precision and recall are monotonic in topic entropy (Fig 1) suggests that simply tuning the LDA hyperparameters toward lower-entropy topics might serve to boost precision at the cost of recall. In particular, allowing an asymmetric gamma that reflects corpus-wide word frequency would start to look somewhat similar to this paper’s model, if I am not mistaken, and I think the paper needs to experiment with that approach. Minor: The empirical results state that in their experiment “LDA topics capture frequently occurring words but the topics are not as meaningful and do not correspond to any evident themes”---we know that very often LDA does infer meaningful topics, so investigating and explaining why it failed here would be helpful.