Review for NeurIPS paper: Spike and slab variational Bayes for high dimensional logistic regression

NeurIPS 2020

Spike and slab variational Bayes for high dimensional logistic regression

Review 1

Summary and Contributions: This paper established optimal bounds for VB in a high-dimensional sparse logistic regression model and proposed a VB algorithm that was empirically shown by the authors to be an appealing alternative to the existing procedures.

Strengths: Disclaimer first: Bayesian inference is not in my area, so my evaluation is based on an apparent ignorance of the large literature along this very interesting line of research. To me, the theoretical bounds established in this paper are meaningful and interesting, and if they are new, they are to me of added value. One potential concern is how much more work has to be done give what has been known regarding linear regression, about which I believe much has been understood. However, since I did not do much related to VB, I will leave this point to other referees as well as the AC to comment on. A new VB algorithm is also proposed and empirically shown to be performing better than the existing ones. I am not sure how much novelty there is though; the algorithm looks standard to me.

Weaknesses: It seems that the authors did not analyze the convergence of their algorithm. I would suggest the authors to comment on the theoretical validity of their algorithm.

Correctness: They read meaningful to me.

Clarity: It is well written to me.

Relation to Prior Work: I think so; but again, VB is not in my area.

Reproducibility: Yes

Additional Feedback: Restricted to the studied problem, I would love to see more comments on the advantage of VB over frequentist approaches using, say, penalized MLE. It is my understanding that the main advantage of VB is not on estimation/prediction but on inference (e.g., establishing confidence intervals)? If so, would establishing validity of the confidence interval derived by VB (i.e., Bernstein-von Mises type results) be more interesting? [Update after rebuttal] I really appreciate the authors' comments on my questions. They are exceedingly clear to me, and combined with the other referees' comments on novelty, made me to accordingly raise my score further. Speaking about Bernstein-von Mises type results, in case the authors missed it, V. Spokoiny had some very exciting progresses to extend them to high dimensions in a general M-estimation framework; cf. https://projecteuclid.org/euclid.ba/1422884986. Of course, I believe they are still millions miles away from being applicable to studying VB, but maybe useful in terms of strengthening the results of Wang and Blei (?). I sincerely hope that the authors continue their success in this rather exciting line of research!!!

Review 2

Summary and Contributions: The paper aims to provide statistical guarantees for variational Bayes (VB) method when a high-dimensional logistic regression model is under consideration. The authors show that under an appropriate prior (spike and slab with Laplace slabs), VB can achieve minimax rate under both \ell_2 and mean-squared prediction loss. A coordinate-ascent variational inference algorithm is introduced to compute the VB posterior. Several numerical results are presented to verify the theoretical results. In particular, both \ell_2 and mean-squared prediction loss are under control, and the VB posterior can also control FDR in terms of variable selection.

Strengths: The main contribution of the paper is to derive the minimax concentration rate for the VB Posterior when a high-dimensional logistic regression model is considered. Theoretical guarantees of VB have drawn a lot of attention in recent years. The paper relies on the recent breakthrough work on this topic, but also has its own contributions.

Weaknesses: I think several key reference papers are not cited in the paper. This includes the Skinny Gibbs (Narisetty et. al., 2019, JASA), which shows strong variable selection consistency under a spike and slab prior for high-dimensional logistic regression models and propose an efficient algorithm for sampling; and the paper by Ran Wei and Subhashis Ghosal, 2019, which obtains minimax posterior contraction rates for high-dimensional logistic regression models under a wide class of shrinkage priors. In simulation part, I think it would be more convincing if the authors can also include a comparison with the Skinny Gibbs.

Correctness: I did not check the details of the proof. It seems that the outline is correct.

Clarity: The paper is well written and easy to follow.

Relation to Prior Work: As mentioned before, I think several key reference papers are missing.

Reproducibility: Yes

Additional Feedback: -------------------------------------------------- -------after reading authors' feedback------- The reason why I say Wei and Ghosal's paper is missing is that, when the authors review the relevant work this paper is not mentioned. It is only mentioned to explain technical assumptions. I also hope the authors can add the comparison with the skinny Gibbs. My score remains the same as before.

Review 3

Summary and Contributions: The rebuttal addressed well my concern on the use of surrogate KL and to some extent the comment on the practical relevance. Hence, I have raised my score to 7. The paper establishes some theoretical guarantees for variable selection with a spike and slab prior and mean-field VB approximation. This is a highly relevant topic given the recent renewed interest in this area.

Strengths: The paper has a solid theoretical grounding. I only skimmed over the proofs, but they all sound correct and carefully written.

Weaknesses: The main limitation is its practical relevance. The CAVI algorithm for minimizing the original KL divergence is challenging to derive, an alternative is developed that minimizes a surrogate KL in Equation (10). Some explanation and motivation about using this surrogate should be provided; also, it'd be great if the authors can give some comments on efficiency of this surrogate target.

Correctness: Yes

Clarity: Yes

Relation to Prior Work: Yes

Reproducibility: Yes

Additional Feedback: line 25 "explainability and interpretability": what is the difference between these two terms?

Review 4

Summary and Contributions: The authors show some non-asymptotic theoretical guarantees for the Variational Bayes algorithm and illustrate the improvement in performance of the algorithm relative to sparse VB approaches. The results highlight that the variational approximation using Laplace slabs outperforms the VB method with Gaussian prior slab available in the literature.

Strengths: The strength of the article is the theoretical guarantees results obtained for optimal concentration of VB posterior. VB approaches are shown to be faster than other approaches though the gain in accuracy is not at the same scale.

Weaknesses: It is claimed that their approach can be used in high-dimensional models where other approaches based on the EM algorithm or MCMC are not computable. Though I agree with this, I think it's better to have some application based support for this by applying the methods on large models and do a model assessment. Also, a discussion on how sensitive the results in table 1 with respect to settings of hyper parameters will enhance the quality of the results.

Correctness: The results seem to be correct according to my knowledge.

Clarity: Yes, the paper is well written and some clarifications can be added on experimental settings.

Relation to Prior Work: The paper seems to be fairly acknowledge the prior work available in the literature.

Reproducibility: Yes

Additional Feedback: