NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:5000
Title:Algorithmic Analysis and Statistical Estimation of SLOPE via Approximate Message Passing

Reviewer 1

The main contribution of this work is to show that the approximate message passing (AMP) algorithm, which has originally been developed in the context of losses with separable penalties, can be adapted for a specific non-separable penalty. It moreover focus on proving that the resulting algorithm successfully converges to the correct solution. It is important to mention that the focus here is on providing an AMP-based solution to the problem, as similar results for this penalty had already been obtained using different techniques in [20]. The mathematical analysis performed seems sound, and while the framework used has already been introduced in [7], new techniques were employed in order to deal with the possibly non-strongly convex loss. While AMP algorithms have gained prominence in the last decade, there remains much work to be done before they become practical, mostly on providing convergence guarantees for arbitrarily distributed data. Adapting AMP to specific losses is important, but does not necessarily translate into a big impact for the machine learning community. As such, I believe this work belongs to more specialized venues rather than at NeurIPS. Minor points follow: - the notation $|b|_{(i)}$ is not clear; order statistics should be defined to the unfamiliar reader - typo on caption for Fig/Table 1: "hasi.i.d." - on page 2: emphasize experiments are on Gaussian i.i.d. data # Post-rebuttal addendum I believe the authors have put a great deal of effort on the rebuttal, answering most of the points raised in the reviews. Moreover, I am impressed they were able to come up with a SLOPE variant of VAMP in such a short notice. This new algorithm, together with remaining results presented, addresses my concerns regarding the usefulness of the proposed approach. I have thus decided to increase my score to 7, in spite of my belief that the paper is perhaps a bit too specific. I recommend the authors update the manuscript with the information contained in the rebuttal, in particular regarding the possibility of having alternative message-passing algorithms that are more stable in practice.

Reviewer 2

This is a solid paper in the line of work that provides sharp results for linear estimation problems with random iid matrices/designs. It extends the approximate message passing algorithm and the proof of its state evolution to the non-separable penalty called SLOPE. At the same time as the author note, AMP has already been extended to non-separable penalties, e.g. [7], and the performance of the SLOPE penalty was already analysed using other methods recently (ref. [20].). This together with the results on [12] that show limitations of SLOPE common to all convex penalties, limits somewhat the originality and significance of the work. Suggestions: ** It could be useful to the readers if the authors discuss in related work the main results of ref. [12] to highlight the gap that is obtained between optimal estimators and any convex penalty. ** The discussion of AMP applied to non-separable penalties could be extended, including for instance Manoel, Krzakala, Varoquaux, Thirion, Zdeborova, "Approximate message-passing for convex optimization with non-separable penalties" --------- Post-feedback: I have read the feedback and the other reviews. I maintain my score. The paper presents theoretical contributions of interest towards analysis of AMP with non-separable convex penalties. But I do not consider the theoretical contribution groundbreaking and I consider the algorithm limited in applicability because of the convexity of the penalty and the assumptions on the matrix.

Reviewer 3

The submission is original and very clearly written. The contribution is significant if one is a theoretician and okay with large iid Gaussian designs and iid ground truth "beta" with known distribution "B". Indeed, the work is mathematically deep compared to typical NeurIPS papers. The numerical simulations show that AMP is very fast relative to other typical algorithms, at least when the measurement matrix is sufficiently large and iid Gaussian, giving some practical motivation to the effort. But, from a more practical perspective, there are serious limitations. First, if the matrix "X" is non-iid or non-zero-mean, AMP may diverge, despite the fact that the optimization objective (1.2) is convex. Second, to use AMP, we must translate the lambda parameters in (1.2) to alpha parameters somehow, and the solution proposed in (2.9) requires that the user knows the distribution of the ground truth "B". Both are unlikely to occur in practice. Now, if we accept the large iid X assumption and the AMP framework, then there are various other approaches that one could consider, and it would be good to understand how these compare to SLOPE. For example, if one knows the distribution of the ground truth, then one could use the MMSE denoiser and presumably arrive at a much better solution than SLOPE (in the sense of MSE). How much better would be interesting to know. And if the distribution of the ground truth was unknown, one could still assume a parametric distribution and tune its parameters using EM- or SURE-based AMP strategies. These two would be valid competitors of SLOPE in large iid Gaussian regime. ====================== UPDATE AFTER REBUTTAL ====================== I thought that the authors did a good job preparing the rebuttal. As they say, the main motivation of SLOPE is to perform variable selection while controlling FDR; the goal is not to minimize MSE. But my main concern, which still stands, has to do with the possibility for AMP to diverge in practice, in the case that the matrix is not drawn iid sub-Gaussian. This certainly happens in medical imaging applications and it may very well happen in some applications of interest to the statistics community. Demonstrating that it works for one application (e.g., GWAS) does not guarantee that it will work for all. One small comment is that the two non-iid-Gaussian examples in the rebuttal seem to be in the class of iid-sub-Gaussian matrices (certainly the Bernoulli example is) where AMP is already proven to work. So, those examples are not very convincing regarding universality.