Processing math: 0%
NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:3656
Title:Learning Sparse Distributions using Iterative Hard Thresholding

Reviewer 1


		
Post-rebuttal: I have downgraded my overall score to 7. I am troubled by the lack of motivation (and that in the rebuttal, the authors defer more discussion of model compression to future work). Also, I'd have liked to see in the rebuttal more details about the "more comprehensive discussion" regarding alternate algorithms. ------------------ ORIGINALITY =========== Motivated by previous work on modeling priors for functional neuroimaging data as sparse distributions, this paper studies the problem of learning a sparse distribution that minimizes a convex loss function. Apparently, this is the first work that studies this problem for general loss functions. The goal of the work is to adapt the well-known IHT algorithm, a form of projected gradient descent, to this problem. Again, as far as I can see, this approach is original to this work. The mathematical techniques are based on standard approaches and are not very novel in my opinion. QUALITY & CLARITY ================= The paper is mostly a pleasure to read. It tackles head-on the stated goal of investigating the performance of IHT, identifies a computational barrier to effient implementation, and then gives general conditions (strong convexity, smoothness, lipschitzness) that enable a greedy heuristic to be correct. The proofs are solid but not very complicated. The experimental section is brief and a bit unsatisfactory, as the comparisons are against simple baselines. Why not try some others? * Analogously to basic thresholding, first solve the problem without the sparsity constraint, then take the heaviest k coordinates, and solve the problem restricted to distributions with support on just those coordinates. * Consider an alternate projection algorithm in IHT, where you take the heaviest k coordinates of q and find a distribution supported on those k coordinates that is closest to q. Also, it would be interesting to check whether in the experiments, the support set changes during the iterations. Why not try fixing the support set after the first (or few) iterations in IHT and doing the exact projection to that set thereafter? SIGNIFICANCE ============= To the extent that optimization over sparse distributions is an important problem, the contributions in this paper are very significant and relevant to the NeurIPS community. However, I feel the authors should do a better job with the motivation. Is it just [5] that shows the utility of modeling distributions as sparse? Why do they not discuss the model compression problem that's used for the experiments?

Reviewer 2


		
The paper studies Iterative Hard Thresholding (IHT) in distribution space. IHT algorithms have been studied before. This work aims at kind of lifting the solutions provided by IHT to distribution space, i.e., to distributions with (usually many) hardzerosonthediscretespacetheyaredefedon.TheoverallapproachisdefedandvestigatedforrelativelyralfunctionalsF[.].ThedefitionoftheralameworkisanaχevementbyitselfmyminGreedy' runs. A large variance would then be an advantage. To turn the argument around: could one make the IHT more stochastic? And is it true that there is not variance in 20 runs obtained because the algorithm is deterministic? What about different starting conditions? After rebuttal: I do not have the feeling that all my questions were understood correctly but many were.

Reviewer 3


		
The main contribution is to provide with an algorithm to learn sparse distributions and would benefit by improving the way the methods are explained and the clarity is identified. In particular: - Some statements could be commented more generally, for instance " Interestingly, $Dk included in Dk' included in P$ in general." (l.79), in particular if they are novel. Explain what the "QM-AM inequality" is. - The resulting algorithm provides a further evidence that greedy algorithms are quite powerful at solving NP-hard problems, yet this statement is not completely falsifiable (are there other alternatives to using greedy methods?) and the novelty of this particular application to distributions should be justified. Indeed, it seems that most theoretical results come from extending this functional setting from vector sparsity (see supp l.408 for instance). As such, clarify the novelty of your results.