Reviews: Learning Sparse Distributions using Iterative Hard Thresholding

Post-rebuttal: I have downgraded my overall score to 7. I am troubled by the lack of motivation (and that in the rebuttal, the authors defer more discussion of model compression to future work). Also, I'd have liked to see in the rebuttal more details about the "more comprehensive discussion" regarding alternate algorithms. ------------------ ORIGINALITY =========== Motivated by previous work on modeling priors for functional neuroimaging data as sparse distributions, this paper studies the problem of learning a sparse distribution that minimizes a convex loss function. Apparently, this is the first work that studies this problem for general loss functions. The goal of the work is to adapt the well-known IHT algorithm, a form of projected gradient descent, to this problem. Again, as far as I can see, this approach is original to this work. The mathematical techniques are based on standard approaches and are not very novel in my opinion. QUALITY & CLARITY ================= The paper is mostly a pleasure to read. It tackles head-on the stated goal of investigating the performance of IHT, identifies a computational barrier to effient implementation, and then gives general conditions (strong convexity, smoothness, lipschitzness) that enable a greedy heuristic to be correct. The proofs are solid but not very complicated. The experimental section is brief and a bit unsatisfactory, as the comparisons are against simple baselines. Why not try some others? * Analogously to basic thresholding, first solve the problem without the sparsity constraint, then take the heaviest k coordinates, and solve the problem restricted to distributions with support on just those coordinates. * Consider an alternate projection algorithm in IHT, where you take the heaviest k coordinates of q and find a distribution supported on those k coordinates that is closest to q. Also, it would be interesting to check whether in the experiments, the support set changes during the iterations. Why not try fixing the support set after the first (or few) iterations in IHT and doing the exact projection to that set thereafter? SIGNIFICANCE ============= To the extent that optimization over sparse distributions is an important problem, the contributions in this paper are very significant and relevant to the NeurIPS community. However, I feel the authors should do a better job with the motivation. Is it just [5] that shows the utility of modeling distributions as sparse? Why do they not discuss the model compression problem that's used for the experiments?

Reviewer 2

Paper ID:	3656
Title:	Learning Sparse Distributions using Iterative Hard Thresholding

The paper studies Iterative Hard Thresholding (IHT) in distribution space. IHT algorithms have been studied before. This work aims at kind of lifting the solutions provided by IHT to distribution space, i.e., to distributions with (usually many)

$hard' zeros on the discrete space they are defined on. The overall approach is defined and investigated for relatively general functionals F[.]. The definition of the general framework is an achievement by itself to my mind. I also like the authors showing what can be done and what can not be done in terms of complexity. Conditions for functionals are provided and convergence results are obtained. The proofs are partly long but necessary for this domain of IHT, I believe. Sometime (e.g. 211-214) restrictions are imposed that the authors say can be made more general to be practical. This puzzles the reader. Also some claims are not supported by the theoretical results. For instance (ll198-202), the authors say that only "extreme examples" are hard to solve. But we "extreme examples" is not really defined. If one is unlucky, many real-world examples may fall under the "extreme examples" label. If not the authors should explain. In general, however, the theory part is solid and interesting. The general research direction is also relevant, as distributions with 'hard zeros' are relevant. This is btw not only true if considering compressive sensing type applications. There has been interest, e.g., in distributions with hard zeros for spike-and-slab sparse coding (Goodfellow et al., TPAMI 2012; Sheikh et al., JMLR 2014) or even for neuroscience applications (Shelton et al., NIPS 2011; Shivkumar et al., NeurIPS 2018). In this respect it would be interesting to optimize a functional for the free energy / ELBO using IHT (has to take the entropy of q into account). On the downside, the paper shows a gap between theory and numerical evaluation. Of course, it is always difficult to relate general theoretical results to concrete experiments even if they are intended as a proof of concept. But for the reader it is particularly difficult to gauge the relevance of the theoretical results (e.g., convergence rates, properties of the functional etc) to the shown and to the potential applications of the approach. The experimental section does too little to link previous properties to what is shown in the experiments. There are also open questions: In lines 277 to 280 the smaller variance of IHT compared to the other algorithms is stated as an advantage. However, if one can efficiently compute or estimate the obtained objective, then one could pick the best, e.g. of a bunch of$ Greedy' runs. A large variance would then be an advantage. To turn the argument around: could one make the IHT more stochastic? And is it true that there is not variance in 20 runs obtained because the algorithm is deterministic? What about different starting conditions? After rebuttal: I do not have the feeling that all my questions were understood correctly but many were.

Reviewer 3

The main contribution is to provide with an algorithm to learn sparse distributions and would benefit by improving the way the methods are explained and the clarity is identified. In particular: - Some statements could be commented more generally, for instance " Interestingly, $Dk included in Dk' included in P$ in general." (l.79), in particular if they are novel. Explain what the "QM-AM inequality" is. - The resulting algorithm provides a further evidence that greedy algorithms are quite powerful at solving NP-hard problems, yet this statement is not completely falsifiable (are there other alternatives to using greedy methods?) and the novelty of this particular application to distributions should be justified. Indeed, it seems that most theoretical results come from extending this functional setting from vector sparsity (see supp l.408 for instance). As such, clarify the novelty of your results.

Reviewer 1

Reviewer 2

Reviewer 3