Paper ID: 1009
Title: When Are Overcomplete Topic Models Identifiable? Uniqueness of Tensor Tucker Decompositions with Structured Sparsity
Reviews

Submitted by Assigned_Reviewer_4

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper gives a sufficient condition for a unique recovery of topic-word matrix of "n-persistent" overcomplete topic models by a tensor decomposition of moments. In the overcomplete regime, the number of topics are more than that of words. Thus, it is impossible to uniquely recover the relation between topics and words. However, this paper overcomes this difficulty utilizing the notion "perfect n-gram matching". Utilizing this notion, sufficient conditions for identifiability are given for a wide range of situations. The result shows that as n, the length of persistency of a topic, gets larger, it becomes easier to recover the topic-word matrix uniquely.

The theoretical analysis given in this paper is quite novel. That can be applied to a wide range of situations (from sparse to dense models), and the conditions are intuitively meaningful.
The discussions about why persistence helps for identifiability is instructive.

My concern is about its readability. The intuition of some theoretical materials is sometimes unclear for non-experts. Moreover, some mathematical definitions are missing.

For example, it is hard for non-experts to catch the theoretical essence about why even-th order (2rn-th) moments are required and we need to split the 2rn words into 2 groups of rn words to obtain a matrix form of the moments. Some intuitive explanations about this point would be helpful.

Except the readability, this is an interesting paper.

==After the author feed-back
I appreciate the feed-back. Such explanations as given in the rebuttal would improve the readability, and as a result, the significance of the results could be highlighted more effectively.


Minor comments:
Khatri-Rao product is defined only in the supplementary material. Khatri-Rao product itself depends on a partitioning pattern of matrices. The partitioning pattern used in the analysis should be defined in the main body.
Q2: Please summarize your review in 1-2 sentences
Unique recovery condition of the topic-word matrix of n-persistent topic models from moments is derived. More readable descriptions are welcome.

Submitted by Assigned_Reviewer_5

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
Identifiability of a model typically entails that the mapping from model parameters to the set of distributions is injective. In this paper they consider the mapping from model parameters to set of moments, and the class of topic (or probabilistic admixture models), and derive sufficient conditions for identifiability of such topic models. They do so by relating the identifiability of model parameters to the existence of a perfect "n-gram matching" in a bipartite graph. For the specific setting where each topic is randomly supported on a bounded subset of the set of words, they show that their identifiability condition is also necessary in part.

This is a well-written paper, with interesting results.

CONS:

The title of the paper suggests that they derive identifiability conditions for overcomplete latent variable models in general, but their technical development is very strongly tied to a topic model. Perhaps the title could be made more specific.
Q2: Please summarize your review in 1-2 sentences
Well-written paper that provides identifiability conditions for topic models.

Submitted by Assigned_Reviewer_6

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The paper is concerned with identifiability of latent variable topic models, characterizing when the latent structure is identifiable from observable moments in these models. The authors introduce a multi-view type of model called the n-persistent topic model where each set of n words (in sequence) share the same topic and study how n plays a role in identifiability.

The first theorem is a deterministic result providing sufficient conditions for identifying the population structure (or the topic distributions) from the (2rn)-th observable moment. The three conditions involve non-degeneracy of the prior distribution over topic probability, an non-overlapping condition on the sparsity pattern of the columns of the population structure matrix, and a condition relating the kruskal rank of the population matrix to the sparsity of each column.

The paper then analyzes population structures with random sparsity pattern and characterizes parameter settings (sparsity, degree of overcompleteness) under which these models are w.h.p. identifiable. Here, the authors show that if the number of topics is O(p^n) where p is the vocabulary size and n is the persistence parameter, and the support of each topic distribution is O(p^{1/n}), then these models are w.h.p. identifiable.

While the paper deals with matrix representations throughout, they comment that working with the tensor representations would yield uniqueness conditions for a range of tensor decompositions (depending on the persistence parameter). This is an interesting result which is presented as an afterthought in the paper but should be emphasized. Working with the tensor representations may also lend to readability.

The results of the paper are novel and interesting. The paper, however, suffers in terms of readability; some more intuition about the conditions and the main theorem would significantly improve the clarity of the paper. For example, condition 2 and 3 together imply that ideal matrices have high degree of sparsity while still containing a n-perfect matching but it's not obvious why these matrices are the ideal ones. Moreover, the awkward matrix notation makes the paper significantly harder to read.

Questions:
1. For condition 1 to be satisfied, it is essential that the topic proportions is a random quantity, but this is not always assumed in analysis for latent variable models (if h is fixed then M_{2r}(h) is always rank one). Why is this condition required? Previous analysis [9] seem to not require a prior distribution on the topic proportions while still guaranteeing identifiability.
Q2: Please summarize your review in 1-2 sentences
The paper studies a problem that has received a lot of attention in the past few years and makes a nice contribution with interesting results and proof techniques. Deeper intuitions about the results could significantly improve the readability of the paper.
Author Feedback

Q1:Author rebuttal: Please respond to any concerns raised in the reviews. There are no constraints on how you want to argue your case, except for the fact that your text should be limited to a maximum of 6000 characters. Note however that reviewers and area chairs are very busy and may not read long vague rebuttals. It is in your own interest to be concise and to the point.
Reviewer 4:

* Why split the 2rn words into 2 groups of rn words? Isn't it sufficient to consider just the rn words and their moments, E[x_1 \otimes x_2 ... \otimes x_{rn}], as in [9]?

Response: In our model, the decomposition of the tensor E[x_1 \otimes ...\otimes x_{2rn}] is overcomplete, so the techniques of [9] do not apply here. We use alternative techniques and for the purpose of analysis, we reshape the tensor E[x_1 \otimes ...\otimes x_{2rn}] into a matrix of rn by rn words, where the first group of words is {x_1,..,x_n} and the second group is {x_n+1,.. x_2n}. Note that each group is a sequence of n successive words and we assume that the topic persistence level is at least n. The order of the words is important when the persistence level n>1, since we assume that a common topic persists over a sequence of n words.

With regards to the question on why 2rn words are required and not just rn: Note that even for n=r=1, which is the case considered in [12], second order moment (2rn) are required and the first order moment (rn) is not sufficient. Similarly in [9], when n=r=1, we require third order moment and the first order moment is not sufficient. We establish that 2n-th moment is necessary for estimating overcomplete models when q=Theta(p^n) with persistence level n.

*Only have 2rn words and moments are estimated in a one-shot way. However, to estimate moments, need sufficient number of repetitions of words. Even if we have a sufficient number of words, there is still a problem about where we should split the words into groups of rn words. The problem setting (how to estimate the moments and construct samples) is seriously unclear.

Response: In a single document, 2rn is the minimum number of words required for estimating the moment. We estimate the moment tensor E[x_1 \otimes ...\otimes x_{2rn}] by considering the first 2rn words in each document and averaging over documents in the corpus to obtain a consistent estimate. Note that this is not a bag-of-words model (when persistence level>1), and the order of the words is important for estimating the moment. Once the tensor is computed, it can be appropriately reshaped into a matrix to obtain M_{2rn}(x).

**********Reviewer 5:

*title too broad

Response: Revised title is: When are Overcomplete Topic Models Identifiable? Uniqueness of Tensor Tucker Decompositions with Structured Sparsity

*****************Reviewer 6:

*Why not use tensor representation and not matrix form?

Response: We have provided the tensor form of the moment in the appendix, but since we felt that a part of NIPS audience may not be familiar with the tensor notation, we retained only matrix form in the main text. Our identifiability results imply uniqueness of a structured class of tensor Tucker decompositions and we will attempt to emphasize this in the main text in the revised version.

*More intuitions on degree and krank conditions 2 and 3

Response: The topic-word matrix must have sparsity ranging from sparse (log p) to intermediate regimes (p^{1/n}) for identifiability (in random setting). On one hand, too sparse models do not have enough edges in the bipartite graph from topics to words, and therefore, the different topics cannot be distinguished. On the other hand, if the bipartite graph is too dense, then there is not enough diversity in the word supports among different topics. Thus, we cannot have identifiability when the degrees are too small or too large, and we provide a range of degrees for identifiability Condition 2 ensures that this occurs through the requirement of a perfect n-gram matching. Regarding the relationship between krank and degrees in condition 3, intuitively, a larger krank leads to better distinguishability among the topics, and therefore, we can tolerate larger degrees when the krank is correspondingly larger. We will add this discussion to the revised version.

*For condition 1 to be satisfied, it is essential that the topic proportions is a random quantity. Previous analysis [9] seem to not require a prior distribution on the topic proportions while still guaranteeing identifiability.

Response: This is not true. [9] is restrictive and requires the topic proportions to be either drawn from Dirichlet distribution, or only have single topics in each document. Assuming these distributions implies that M_{2r}(h) is full rank. Our condition 1 is the natural non-degeneracy condition that each topic cannot be expressed as a combination of the other topics: if that were the case, we would not have identifiability. Thus condition 1 is necessary for identifiability. Condition 1 does not require a random model on h. All we require is that the topics be distinguishable from one another. Note that this condition is also required in [12], as well as in works of Arora et. al. for learning topic models via NMF techniques.