NIPS 2017
Mon Dec 4th through Sat the 9th, 2017 at Long Beach Convention Center
Paper ID: 2950 Learning Populations of Parameters

### Reviewer 1

The paper considers a statistical framework where many observations are available but only a few concern each individual parameter of interests. Instead of estimating each parameter individually, following an idea of Stein, the idea is to estimate population parameters", that is to provide an estimate of the empirical distribution of these parameters. Theoretically, an estimation of the empirical distribution based on its first moments is proposed and a control of the risk of this estimator with respect to the Wasserstein distance is proved. Then, an algorithm for practical implementation of this estimator is proposed and illustrates the performances of the estimator on both synthetic and real data-sets. The interest of the statistical problem is shown on political tendency and sports performances at the end of the paper. I think that the statistical problem, the way to analyse it and the moment solution are interesting and I strongly recommand the paper for publication. Nevertheless, I think that many references and related works could be mentionned. For example, it seems that the estimator would provide an interesting relevant prior for an empirical Bayes procedure, the connection with latent variables in mixed effects models should be discussed, as well as the connection of this procedure with hidden" variables model like HMM. Finally, a moment procedure was proposed recently to estimate the unobserved random environment" of a Markov Chain. Comparisons with these related works could improved the paper.

### Reviewer 2

This paper establishes an interesting new problem setting, that of estimating the distribution of parameters for a group of binomial observations, gives an algorithm significantly better (in earth-mover distance) than the obvious one, and shows both a bound on its performance and an asymptotically-matching lower bounds. It also gives empirical examples on several datasets, showing the algorithm's practical advantages. I did not closely verify the proofs, but they seem reasonable. Presentation-wise, they are somewhat disorganized: it would be helpful to label the sections e.g. Proof of Theorem 3, and provide a formal statement of the proposition being proved in the noise-free case, etc. Unfortunately, the algorithm does involve solving a fairly large linear program. Explicitly characterizing the computational complexity and accuracy effects of m would be helpful. It would also be much nicer to not need a hard cutoff on the moment estimators s. Perhaps you could weight the loss function by the standard error of their estimators, or something similar? The estimators should be asymptotically normal, and you could e.g. maximize the likelihood under that distribution (though their asymptotic correlations might be difficult to characterize). Though it's an interesting problem setting, it also doesn't seem like one likely to take the world by storm; this paper is probably of primarily theoretical interest. Some interesting potential applications are suggested, however. A question about the proof: in lines 400-401, I think some more detail is needed in the proof of the variance bound. Choosing the t/k independent sets would give the stated bound, but of course there are \binom{t}{k} - t/k additional sets correlated to those t/k. It seems likely that averaging over these extra sets decreases the variance, but this is not immediate: for example, if all of the other sets were identical, the variance would be increased. I think some work is needed here to prove this, though maybe there's an obvious reason that I'm missing. Minor notes: Line 47: should be "Stein's phenomenon." Line 143, "there exists only one unbiased estimator": If you add N(0, 1) noise to any unbiased estimator, it's still unbiased. Or you could weight different samples differently, or any of various other schemes. Your proof in Appendix E (which should be referenced by name in the main text, to make it easier to find...) rather shows that there is a unique unbiased estimator which is a deterministic function of the total number of 1s seen. There is of course probably no reason to use any of those other estimators, and yours is the MVUE, but the statement should be corrected. Lines 320-322: (6) would be clearer if you established t to be scaled by c explicitly. y axis limits for CDF figures (2, 3, 4) should be clamped to [0, 1]. Probably the y lower bound in Figure 1a should be 0 as well.