NeurIPS 2020

Learning Structured Distributions From Untrusted Batches: Faster and Simpler


Review 1

Summary and Contributions: We are given N batches of samples, of which epsilon fraction of are corrupted, and the remaining batches are from distributions that are omega close to a ground truth mu. The goal is to estimate mu to total variance. This paper considers the special case of structured distributions, and designs algorithms that run in polynomial time, and require a number of batches that is at most quadratic factor of the best known lower bound. This paper is a combination of two prior works. One [JO], which provides a simple polynomial time algorithm with near optimal sample complexity, and another [CLM] which requires superpolynomial number of batches but also consider more general classes of structured distributions. In this paper the authors study the problem of structured distribution estimation considered in CLM and apply the tools of JO to obtain the sample complexity bounds. — Post Author Response — I am happy with the author’s clear and honest response. I have revised my score.

Strengths: The paper proposes algorithms that are polynomial in running time and data requirement unlike the previous works for structured distribution estimation from untrusted batches. The paper requires technical insights to extend JO to the case of structured distributions and is definitely non-obvious.

Weaknesses: While it requires some technical insights, I miss if there is any additional technical insights that are needed beyond combining the methods in CLM and JO. In particular, perhaps the technical difficulties in directly combining the two methods should be clarified better.

Correctness: Yes.

Clarity: Yes.

Relation to Prior Work: Yes.

Reproducibility: Yes

Additional Feedback:


Review 2

Summary and Contributions: This paper considers the setting of learning discrete distributions from untrusted batches that was introduced in Qiao and Valiant recently. Here there are N batches of k i.i.d. samples each drawn from a distribution \mu over the discrete domain [n], and an \eps fraction of the batches can be corrupted arbitrarily. The goal is to learn the distribution \mu efficiently. This is related to the work on robust estimation, except that here, one needs to get closeness in L_1 norm (as opposed to L_2). Recently, Jain and Orlitsky'19 and Chen Li Moitra '19 designed efficient algorithms to solve this problem. The main setting considered in this paper is when the distribution \mu is structured i.e., it is close to a piecewise polynomial function (few pieces) of small degree each. The main contribution is a new polynomial time algorithm based on the adaptation of the filtering approach by [JO19], that also works in this structured setting. The sample complexity dependency on many of the parameters is tight. These bounds have a dependence mostly on the number of pieces and degree, and only a logarithmic dependence on dimension. The paper also demonstrates the effectiveness of their algorithm with simulated data, where an \eps fraction of the batches are drawn from a different distribution (which moves mass from the high-probability items to the low-probability items).

Strengths: The theoretical contribution is solid: it gives improved guarantees in the structured setting that was explored in Chen-Li-Moitra'19. This is an improvement from quasi-polynomial time dependence to polynomial time dependence in terms of the structure parameters (# pieces, degree), and in the sample complexity bounds. There is also some technical contribution by extending the filtering framework in the recent robust estimation literature to what they call soft filtering, which avoids the need for rounding the SDP in each step.

Weaknesses: I think the main drawback is that the paper is quite incremental in terms of the techniques. It combines the filtering approach in Jain and Orlitsky'19, and the ideas in CKM'19 of using the Haar wavelet basis (where the representation is sparse). Putting these together is still quite technical, but perhaps not surprising. The model of Qiao and Valiant is interesting. But the notion of "structured distribution" in terms of piecewise polynomial approximations considered here is not as convincing. This seems more of theoretical interest. In this context, it would have been more interesting to see experiments based on real data (where there is some potential structure), as opposed to simulated data from some fixed number of pieces. Finally the experiments are run on small values of n ~ 64 and n~128. The SDP-based algorithm doesn't seem practical (it takes several minutes for n~128). For such small n, it is not clear that the sample complexity savings in terms of n are significant. But this is less important in my opinion, since the main contribution is theoretical.

Correctness: The claims seem correct, as much as I could verify.

Clarity: The paper is dense, and is very hard to read, apart from the introduction. E.g., the algorithm (in particular, the SDP) is not fully specified in the main body, and there are incorrect references to equations etc. The paper as written seems to be aimed at the experts.

Relation to Prior Work: Yes

Reproducibility: Yes

Additional Feedback: Overall I am on the fence about this paper. ------ Post Rebuttal ----------- The rebuttal helped with some of the concerns that I had, particularly in the new ideas needed to extend the JO19 approach to this setting. I'm happy to see the paper accepted. Comments: 1. It may be good to see if there are more practical algorithms, by avoiding the use an SDP solver (or using algorithms based on multiplicative weights). I would recommend that the authors use a better SDP solver like MOSEK. It can often solve instances with #vars, constraints of the order of several hundred, or even thousands in the same time. 2. About the writing: the semidefinite program specified in the main body; the only reference is in Definition 2.3 where it is mentioned in passing. There are various references that appear before they are defined (e.g., constraints 3 and 5 in section 2). I assume this is referring to the definition in Section 2.1 of the Appendix? 3. In the soft filtering framework, does avoiding rounding the SDP in each step give significant running time savings as well, for the unstructured setting compared to [JO19](or is it primarily of technical interest)? If it does reduce the running time, it will be good to emphasize that.


Review 3

Summary and Contributions: The paper solves an important problem in machine learning consisting of determining a distribution on {1,...,n} given a number of minibatches coming from a close distribution and a number of corrupted minibatches. The main contribution of the paper is to provide an algorithm which scales sublinearly in n, while staying competitive with state of the art concerning the statistical accuracy.

Strengths: The paper answers a key question in the field concerning the algorithmic complexity of solving the problem of robust learning discrete distributions. This contribution is very relevant to the ML community and more specifically the

Weaknesses: Practical relevance beyond the theoretical achievements should perhaps be given some more consideration.

Correctness: The proofs I have reviewed are correct.

Clarity: In my opinion, the paper is very clearly written and a pleasure to read.

Relation to Prior Work: Relation to prior work is clearly stated and the contribution of the paper as compared to the current state of the art is illuminating.

Reproducibility: Yes

Additional Feedback: