Review for NeurIPS paper: Consistent Estimation of Identifiable Nonparametric Mixture Models from Grouped Observations

NeurIPS 2020

Consistent Estimation of Identifiable Nonparametric Mixture Models from Grouped Observations

Review 1

Summary and Contributions: This paper studies the problem of estimating nonparamatric mixture models from grouped observations (i.e. pairs of data (x_1, x_2) where x_1 and x_2 are sampled from the same mixture component). The authors provided the first consistency result for this problem using a KDE-type estimator. ---------------------------- Post-rebuttal update: I would like to thank the authors for their response. After reading the rebuttal and discussing with other reviewers, I've decided to raise my score to 6. I agree with R3 that the problem considered in this paper is important, and would love to see more work in this area in the future. It would be great if the authors can include a paragraph about the main technical barriers to finite-sample rate (just like in the rebuttal, but preferably longer) in the camera ready version.

Strengths: This paper provided the first consistency result for this problem (estimating nonparamatric mixture models from grouped observations).

Weaknesses: The theoretical results are a little weaker than I would like to see, in particular: - Lacking finite-sample guarantees: the convergence of the estimator only holds in the infinity-sample limit (n -> \infty). The finite-sample convergence rate is not established. - Computationally inefficient estimator: the optimization problem (7) is non-convex, hence cannot be solved in poly-time.

Correctness: I have verified that the mathematical claims are correct. The methodology of the experiments look correct to me, but I haven't checked the details of implementation.

Clarity: The paper is mostly well-written. I have made a few minor suggestions for improvement, see detailed comments below.

Relation to Prior Work: The discussion on the prior work is mostly clear, but I hope this part can be expanded a little (see the detailed comments on presentation below).

Reproducibility: Yes

Additional Feedback: My evaluation to this paper is on the borderline - that being said, I am happy to change my score if the authors are able address the following questions: - What's the main obstacle to stronger theoretical guarantees? In particular, (1) Statistically: Is it possible to achieve a finite-sample convergence rate? If the answer is 'no', what makes it difficult? The classical works on KDE have showed that we can achieve a rate of convergence for density estimation under standard assumptions like the Holder smoothness of the density, can we get some results with a similar flavor here? (2) Computationally: Is it possible to achieve an efficient algorithm for this problem? If the answer is 'no', what makes it difficult? The less significant ones: - In the synthetic experiment, the results presented here showed that NDIGO (the algorithm introduced by this work) outperforms other methods on the training sample, but not for out-of-sample setting (where NPMIX performs the best). Can you comment on the possible reasons for this discrepancy? I would like to mention a few possible improvements to the paper below: Presentation: - I think this paper will benefit from a discussion section, especially on why this problem is important (the authors provided two examples, but I hope this part can be more detailed in the future version) and what are the main technical barriers (especially, comparing to non-grouped version of the problem, and density estimation). - In the current version, it was not immediately clear to me why (6) and (7) are equivalent, and why the first term is an empirical quantity. I was able to figure it out on my own, but I hope the authors can provide a brief comment on this (e.g. provide a pointer to supplementary S2.1). Experiments: It would be nice to see a little more experiments on synthetic data (to get better visualization of the clustering induced by the algorithm). For example, the ones used in [10]. However, I fully understand that the focus of this paper is mostly theoretical, so this is just a minor suggestion for the authors.

Review 2

Summary and Contributions: The paper presents a novel algorithm based on weighted kernel density estimators to estimates arbitrary identifiable mixture model from grouped observations. The experimental results show that the approach outperforms baseline methods, especially when mixture components overlap significantly.

Strengths: Authors propose a weighted kernel density estimator for learning mixture models with theoretical guarantees.

Weaknesses: N/A

Correctness: Claims in three theorems look correct without further checking.

Clarity: The paper is easy to follow.

Relation to Prior Work: Authors do not clearly discuss the differences between the proposed work and previous contributions.

Reproducibility: Yes

Additional Feedback:

Review 3

Summary and Contributions: An important limitation of semi-supervised learning is that unlabeled data is not that useful unless certain conditions are satisfied. A strong condition is identifiability of the mixture distribution according to which the unlabeled samples are distributed. If the mixture is identifiable, one could conceptually estimate the components and class probabilities using unlabeled data with an appropriate estimator and use labeled data to label the components of the mixture. Identifiability from iid samples is a strong requirement, traditionally limited to certain - but not all - families of parametric distributions. Recent results show that if one can obtain groups of samples from each of the underlying distribution, then the mixture becomes identifiable under broad conditions. This paper proposes consistent estimators of the mixture distribution using grouped observations, proves consistency results including results for convergence of estimators of the mixture component and corresponding mixing parameters, describes optimization procedures for the estimation, and shows results on a few datasets.

Strengths: The paper appears to be correct, although I did not attempt to derive the results from the hints in the main paper - I just read the proofs. The work appears to be theoretically sound. The experimental results - which actually relate to the optimization procedure rather than to the general asymptotic result 0 are promising and show that the method outperforms (unsurprisingly) clustering methods when identifying overlapping distributions and actually outperforms a close competitor that also attempts to identify mixture components using grouped data. Results on text date (Russian troll Twitter dataset) are promising although it is not obvious to this reviewer that the paired samples are indeed from the same distribution even if they come from the same tweet (people tend to be scattered on twitter). The optimization method also appears to scale well

Weaknesses: The main limitation of the work is that it applies to cases where grouped samples are available,which is not an incredibly common occurrence.

Correctness: The paper appears to be correct. I strongly believe that the main contribution of the paper is of theoretical nature and I am not that concerned with the experimental results. The methodology of the experiments appears to be correct

Clarity: The paper reads extremely well, notation is clear and well chosen, one could read the paper in a single pass.

Relation to Prior Work: yes, the main results are the estimator consistency theorems and the optimization procedure.

Reproducibility: Yes

Additional Feedback: I did not proofread the paper, but line 77 should read "but they do not" rather than "but they does not"

Review 4

Summary and Contributions: This paper proposes a kernel-based estimator of a nonparametric mixture model in which the user has access to paired observations of the true model. The authors prove that the proposed estimator is consistent, propose a practical algorithm for computing it, apply it to simulated data, and finally consider two interesting applications to nuclear source detection and Twitter data (NLP). #### POST-REBUTTAL I appreciate the authors' response to my questions and addressing many of my minor concerns. After discussion, I have decided to increase my score to a 7, since the authors have clarified some of the fundamental road blocks to some of the theoretical issues regarding time/sample complexity. I appreciate the authors' willingness to discuss applications more seriously. One minor point: In asking "which of the algorithms compared make use of the paired observations, and which do not", my point was that the paper only mentions NPMIX in this regard, but none of the other algorithms. It would be helpful for readers if this was clearly spelled out for every algorithm tested in one place.

Strengths: * The paper is well-written and the approach is quite natural * The consistency results are novel, and the theoretical results are presented coherently with the appropriate amount of detail * The experiments are reasonably complete * The biggest strength in my opinion are the two applications, which help to motivate the model, but I would like to see more detail on these applications (see below)

Weaknesses: * The identifiability of the model has been proved previously in [3], so the main contribution is an estimator * This estimator is based on nonconvex program, and it is not clear that this can be solved to optimality using the author's approach * The results are asymptotic, and no there is no discussion of finite sample complexity (esp. wrt to other methods in the literature) * The applications are a significant strong point, however, they are not given enough attention: More detail in describing how the applications maps to this model would be useful, along with more details in the experiments as well (there is ample room to include more text, e.g. by removing Section 3 and moving proof sketches to the supplement) Given the limitations of the theoretical results mentioned above, much of the merit of this paper rests on its practical significance. On my first reading, I did not pay enough attention to the applications since they have been glossed over by the authors. I was thus concerned about the motivation behind the model itself. But after thinking about this, I am convinced the applications are sufficient to motivate the model, however, I do feel much more attention needs to paid attention to this aspect. Furthermore, in addition to applications, I would appreciate more detail regarding computational aspects. How practical is this approach on large datasets? What guarantees can the authors give regarding the optimization scheme? What tradeoffs and limitations exist in practice, esp. compared to related algorithms? It would be interesting to illustrate (or at least describe) some examples from the proposed method fails, but existing algorithms succeed.

Correctness: I found no major issues with the claims in this paper, but there are some untrue statements that should be corrected (see Q8).

Clarity: Overall, the paper is well-written. I have made some minor suggestions in Q8 below regarding presentation.

Relation to Prior Work: The related work section looks reasonably complete and well-referenced.

Reproducibility: Yes

Additional Feedback: Major comments * L134: The assumption that M is known should be mentioned up front and discussed early on (somewhere in Section 1 as well as beginning of Section 4, right now it is hidden at the end of Section 4) * Limitations of this assumptions should also be discussed in detail * Can this be relaxed? * Section 7: The authors claim an approach for "solving (6)", which is a nonconvex problem, and I could not find a proof of the solvability of this program * What guarantees do the authors have? * Will this approach converge to a stationary point, or a global minimum? There is some discussion at L218-219, but more discussion is needed. * In the experiments, which of the algorithms compared make use of the paired observations, and which do not? It should be mentioned that not all algorithms are on equal footing in this regard. Minor comments * When a statistical model is identifiable, under very weak conditions there is guaranteed to be a consistent estimator (e.g. Le Cam and Schwartz, "A necessary and sufficient condition for the existence of consistent estimates", or the book by Ibragimov and Hasminskii) -- what this work proposes is a *practical* estimator, which is much more useful * L18-20: It would be helpful for the reader to point out some applications for this model here (e.g. authors can simply refer to the two good applications they consider later in the paper) * L63: Authors should mention the drawbacks of clustering using mixture models, e.g. identifiability, sample complexity, and the fact that MMs can lead to undesirable or misspecified clusterings * L77-78: The claim about the paper [10] seems incorrect; my understanding based on their Theorem 4.3 is that they do prove recovery of the underlying components * L138: What does it mean for p to be identifiable? Presumably the authors mean the underlying parameters are identifiable. This should be clarified (or appropriately defined somewhere) * L187: This sentence could be phrased better -- Theorems 1 and 2 imply consistency in estimating q, which is of course implies identifiability of q, but this does not necessarily imply identifiability of p: It would read better to clarify this distinction more carefully (e.g. it took me a second to parse what was meant here) * Obviously a matter of personal taste, but I found "NoMM" to be an odd choice of acronym (even confusing in some places); either NPMM or NMM seems more natural Typos * L132: "in the supplemental." * L233: "concision"