NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID: 7343 Variational Bayes under Model Misspecification

### Reviewer 1

he paper characterizes the asymptotic behavior of the variational Bayes approximation under model misspecification. I found none of the results particularly surprising since they are intuitive and expected, especially based on the results of [17] & [29]. (The proof also seems to be a straightforward extension.) I also find that the authors make too big of a deal about the model misspecificaiton error asymptotically dominating the variational approximation error, making this old observation look like a new contribution. (As an approximation becomes more accurate asymptotically, obviously the model misspecificaiton is going to dominate.) In particular, I find the section 2.2 verbose and repetitious in explaining the obvious intuition. The authors should be more explicit about the limitation of the result discussed in Section 2.3. I believe the local asymptotic normality assumption (Assumption 5) often fails under the type of models considered in Section 2.3 (e.g. hyper-parameters of the Gaussian process regression does not concentrate under the in-fill aymptotics). All that said, the result is a welcome addition to the literature on the theoretical properties of the variational Bayes approximation under model. I also appreciate that the authors kept the intuitions very clear (without making things unnecessarily complicated) and the paper is generally very easy to read. Minor comments: - Assumption 4 & 5 in the supplement are not just analogous but are essentially identical to Assumption 2 & 3. In this case, why not make the assumption for Section 2.3 more clear in the main manuscript? - Line 127, tests $\phi_n$: a test is undefined. I suppose it is a compactly supported smooth function, but is certainly not in the standard vocabulary of stats/ML audience. - Line 166, limiting exact posterior: this terminology threw me off and was very confusing to me because limiting and exact are contradictory. I suggest to call it just limiting posterior. - Line 166, \theta vs \tilde{\theta}: does the parameter with tilde play a different role? If not, it is just confusing. - Line 172, \mathcal{Q}^d: is the same as \mathcal{Q}? I suppose it is meant to emphasize the dependency on $d$, but the dependency was always there and the sudden change of notation is just confusing. - Line 289, simulation corroborates... the limiting VB posterior coincide with the limiting exact posterior.: I don't think this claim is true. Just looking at RMSE does *not* establish that that the two distributions (VB and MCMC) are close. Response to author feedbacks: Straitening out the main contributions and clarifying the limitations will certainly make the paper more worthwhile to the readers. With a successful revision, the paper will deserve the score of 7 (though there is no 2nd round review unfortunately).

### Reviewer 2

The paper is clearly written and very well presented. Although inherently technical, the results are explained both precisely and in plain language, with proof sketches to convey intuitive understanding. This paper is a great model of clear communication of technical results. The results are novel to my knowledge and well situated in terms of previous literature. I found no obvious technical errors, although I wasn't able to closely check the proofs in the supplement. My impression is that the results themselves don't involve significant new technical ideas and are more or less straightforward extensions of previous theorems. Nonetheless, actually doing this technical work is a valuable contribution. My main concerns about significance, which largely apply to Bernstein-von Mises theorems more generally, is that by focusing on the asymptotic regime the work assumes away essentially all of the practically relevant structure in Bayesian inference problems. Behind all the technical machinery, the intuition behind these proofs (which, to its credit, the paper does a good job of conveying) is that for identifiable models in the iid asymptotic regime, the likelihood dominates the prior, and the posterior concentrates at a normal shrinking to a point mass, so we can ignore the prior and we can mostly ignore posterior uncertainty. But if you really believe you're in this regime, why not save yourself the trouble of VB and just fit an MLE? The argument that the MLE minimizes KL between the true data distribution and a misspecified model is so trivial that it's more of an observation (that the non-constant part of KL is just the expected model log likelihood) than an argument. This work dresses up that argument with substantially more mathematical machinery, but not (as far as I can tell) much more insight. It tells us that if you run VB in the setting where there is no uncertainty to quantify, it preserves the properties of a point estimate. This is well and good -- it's always possible that something could have gone wrong, and there's some pedantic value in checking that it doesn't -- but it's also kind of not the point of VB. Practical Bayesian inference involves quantifying uncertainty; without that, why are we here? We only get to do so much with our wild and precious lives, and it's not my place to question the authors' choices, but I can't help but view this as something of an example of math for math's sake with limited takeaways for the broader field. All that said, theoretical papers are in scope for NeurIPS, and this one is well done within (as far as I'm qualified to judge) the standards of the community.