NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:6102
Title:Sparse Variational Inference: Bayesian Coresets from Scratch

Reviewer 1

I thank the authors for the their clear response which solidified my view that this is a good contribution to NeurIPS In future work, I think it would be worth it to not only compare their method to normal coreset approaches but also to cheap-and-basic variational approximations like the Laplace approximation, if only to show me that I am wrong in suggesting that this could provide an alternative Best ----------------------------------------- The author(s) tackle the problem of Bayesian inference on large datasets. Coresets provide one approach to such solutions: simply sub-sample and reweight in a clever fashion the data so that inference on the sub-sampled data is almost the same as inference on the full dataset. They propose a variational inference approach which naturally builds a coreset-type approximation of the posterior. More precisely, they observe that the posterior is always in a (very high-dimensional) exponential family, with sufficient statistics the log-prior and the log-likelihoods of each datapoint. Thus, choosing an approximation which reweights each log-likelihood corresponds to seeking an approximation inside an exponential family, thus yielding computational tractability of the variational optimization of the KL divergence. In contrast, existing methods instead use Hilbert distances between the log-likelihoods. Their algorithm iteratively constructs a subset of the data by iterating the following two steps: 1. Select a new point to add to the subset. This is achieved by finding the minimizer among a random subset of the remaining datapoints of a stochastic approximation of the gradient of the KL divergence against the weights. 2. Recompute the weights of the log-likelihoods for all datapoints in the active subset. They propose a theoretical analysis of their algorithm and show that the exact version (no stochastic approximation of the gradient and no subsetting of the data in step 1 above) as a geometric approximation (Prop.1 and 2, line 200-208). They perform two tests of their method. The first is done on a Gaussian example, yielding closed-form updates. Their method is very good in this situation. The second is done on logistic and Poisson regression. This second comparison appears inconclusive to me and could be highly misleading due to the fact that it mixes together six different datasets. It is unclear to me whether the situation considered in this second experiment would be highly favorable to their method or whether it represents an honest test. I believe that this article is a strong accept but I am not completely certain in this opinion, due to my review time being limited. While the system flagged this submission as potentially similar to submission 1914, I am confident that there is zero overlap between the two submissions. Major remarks: - The algorithm has two stochastic steps: we’re sampling from the current approximation of the posterior, and we’re sub-sampling the potentials that we are willing to examine at each step. It would be useful to discuss the influence of the number of samples from the posterior-approximation and number of potentials on the precision of the algorithm. How should I choose these values in practice? What are their impact? Am I correct in stating that the guarantees of the algorithm only hold on the version of the algorithm which does not sub-sample the potentials and which examines all of them? - I have not understood where you discuss the impact of the speed parameter \gamma_t which appears in algorithm 1. Can you please discuss further how you choose its value and how its impact is felt? - Fig.2 does not feel very informative due to the fact that it mixes so many different sub-experiments. - This is more of a direction for future work, but one thing I’m curious about is how this method compares to Laplace approximations. In a large dataset, the Laplace approximation of the posterior distribution should be decent and should be easy to compute (or possibly at least easier to compute than the method you propose here). No minor remarks

Reviewer 2

# Strengths - Clear motivation and structure of the derivations. The paper and the introduced method are clearly motivated. The structure of the paper and the individual derivations is well structured and presented. - Some theoretic results from an information geometric viewpoint. The authors are able to show how their approach can be not only motivated by itself (Section 3), but also show how their approach can be nicels interpreted from an information gometry standpoint where they are able to show the nice result that the method optimizes a bound on the a symmetric version of the KL divergence. # Weaknesses - Seems more incremental than novel While the proposed model gives a new, clean approach on how to construct coresets, that is well motivated if we regard it in isolation, that benefit becomes a lot less clear if we take prior work on coresets into consideration. In that case it looks rather incremental and is lacking a strong motivation as to why it should be used compared to prior work in the area of coreset construction. - A weak experimental evaluation The method is only evaluated on a small set of datasets that are either synthetic or very low dimensional (D=10). Yet even there the authors can only demonstrate roughly similar performance to prior work, however at greatly increased cost. # Minor Typo In Equation (24), it should be \Sigma_w instead of \Sigma_p in the \mu_w computation unless I am mistaken. # Recommendation The work has potential, but there are still some weaknesses that should be addressed during the rebuttal, before I can recommend a clear accept POST-REBUTTAL: Having read the author response, I am convinced further about the significance of the contribution. Hence, I increase my score to 7.

Reviewer 3

Summary: -------- The paper proposes a new perspective on constructing Bayesian coresets. It is shown that the coresets are in fact in the exponential family. Based on that observation, the construction is formulated as an exponential family variational inference problem with a sparsity constraint. The optimization problem can be solved via greedy optimization and an interesting information-geometric view is provided. Review: ------- First of all, the paper is well written and the math seems to be sound. The derivations are easy to follow and authors do a good job communicating the main results. I find the exponential family view on the Bayesian coreset construction quite interesting. The greedy optimization method seems to be a "natural" way to solve the problem. As a second interesting contribution, I enjoyed the information geometric interpretation which gives a unifying view on coreset construction algorithms. I think that these theoretical insights are from interest to the community and could inspire new research in that field. However, the empirical evaluation seems to be quite limited. Based on the provided experiments it is hard to tell how significant this work is in practical terms. My concerns are: - The considered problems seem to be rather simple. They are more of a toy experiment nature (Gaussian toy data, logistic regression, Poisson regression). The significance of the experiments could be increased by considering more challenging (real-world) problems. - I have a couple of concerns about the plots in figure 3. - First of all, the plot (a) is hardly readable (too many lines in one plot). Also, there is no legend provided, which makes a direct comparison of the methods for one model impossible. - A more serious concern is that in plot (a), the GIGA optimization curves don't seem to be converged. Therefore, the claim that the proposed method achieves a better KL divergence using a smaller coreset size is not really justified. - In plot (c), why does the proposed method sometimes achieve very bad KL (points on the top)? - Additional to the KL plots, an experiment with a downstream task (i.e. MC sampling using the different coresets) would be interesting. Minor Comments: --------------- - Eq. 5: \pi_1 is used before it's actually defined - Is the fact that the KL divergence here is a Bregman div anywhere used? Conclusion: ----------- The proposed method is an interesting contribution and provides new insights on the construction of Bayesian coresets. However, in my opinion, the experiments don't show that the proposed method is superior in practice. First, the considered problems are quite simple. Second, the experiments seem to be immature and the significance is limited. I think the claims about the method are not justified by the experiments. By addressing the issues, the paper would have a much higher impact and would be suitable for acceptance at NeurIPS. I encourage the authors to improve the experiments and think the paper would then be a strong contribution. But in the current state, I tend to reject it.